Regression

Stat 135 : Regression
Shobhana Stoyanov
Department of Statistics
UC Berkeley
December 5, 2013
Shobhana Stoyanov
1 / 36
The Regression Line

Have a set of points to which we want to fit a straight line
y = 0 + 1x
n subjects indexed by i, where i = 1, . . . , n
Example: Can we use the height of the father to predict the height of
the son?
2 data variables x and y
xi is the height of the father in the family
yi is the height of the son in the family
Shobhana Stoyanov
2 / 36
Sir Francis Galton: 1822-1911
From galton.org : Victorian polymath: geographer, meterologist, tropical

explorer, founder of dierential psychology, inventor of fingerprint
identification, pioneer of statistical correlation and regression, convinced
hereditarian, eugenicist, proto-genicist, half-cousin of Charles Darwin...
Shobhana Stoyanov
3 / 36
Karl Pearson: 1857-1936
Figure: Galton, aged 87, with

Pearson
Shobhana Stoyanov
4 / 36
70
65
60
Son's height (inches)
75
Scatter plots for heights of 1078 fathers and sons

(Pearson data)
60
65
70
75
Father's height (inches)

Shobhana Stoyanov
5 / 36
70
65
60
75
Scatter plots for heights of 1078 fathers and sons

(Pearson data)
60
65
70
75

Shobhana Stoyanov
6 / 36
75
70
65
60
80
Scatter plot for heights of 1078 fathers and sons

(Pearson data)
60
65
70
75
80

Shobhana Stoyanov
7 / 36
60
65
son
70
75
Pearson Father-Son Heights
60
65
70
75
father
Shobhana Stoyanov
8 / 36
60
65
son
70
75
Pearson Father-Son Heights
60
65
70
75
father
Shobhana Stoyanov
9 / 36
The Regression Line

2 data variables x and y
x is called the predictor and y is called the response
In the Pearson study, the fathers height was the predictor and the
sons height was the response
The regression line approximates the average heights of the sons,
given the heights of their fathers.
The line goes through the centers of the vertical strips.
Shobhana Stoyanov
10 / 36
Simple Linear Regression
y=
y=
0e
1x
1x
2x
is a linear model
is a nonlinear model
What makes a good fit ?
Shobhana Stoyanov
11 / 36
Model and assumptions: Simple Linear Regression

Yi =
1 xi
+ ei , i = 1, . . . , n
Our assumptions
2
ei N(0,
xi not random, and known
0,
1,
), ei are iid
unknown, model parameters.
The true regression line is y =

Yi N(
1 xi ,
Y N(
,
1x
Shobhana Stoyanov
2 ),
1x
and the Yi are independent.
2 /n)
12 / 36
Estimating
Let S( 0 ,
and
1) =
n
X
(Yi
2
1 xi )
i=1
Dierentiating with respect to

0 = Y
and then solving for 0 :
1 x
Doing the same with

xi0 = xi x):
Pn
(xi
1 = Pi=1
n
i=1 (xi
Shobhana Stoyanov
1,
x)Yi
=
x)2
we get (using the horizontal shift

Pn
(x
x)(Yi
i=1
Pn i
x)2
i=1 (xi
Y )
(Why?)
13 / 36
Statistical properties of 0 and 1

Note that both 0 and 1 are linear functions of the Yi , and so
j N(E ( j ), Var ( j )), j = 0, 1
Theorem A, page 548:
Shobhana Stoyanov
E ( 0 ) =
0,
and E ( 1 ) =
14 / 36

Theorem B, page 548-549:
Var ( 0 ) =
Var ( 1 ) =
Cov ( 0 , 1 ) =
Shobhana Stoyanov
Pn
Pn
x
i=1
Pni
( i=1 xi )2
2
i=1 xi
Pn
2
i=1 xi
Pn
2
i=1 xi
P
( ni=1 xi )2
Pn
x
i=1
Pni
( i=1 xi )2
15 / 36

Proof:
Pn
(xi
1 = Pi=1
n
i=1 (xi
Let A =
n
X
x)2 , and ai = (xi
(xi
x)
x)Yi
x)2
(Note that
i=1
) 1 =
Pn
n
X
i=1
i=1 ai Yi
) Var ( 1 ) =
Shobhana Stoyanov
= Pn
i=1 (xi
x)2
16 / 36
ai2 = A.)

Use the variance of 1 to get the expressions for the variance of 0
and the covariance of 0 and 1 .
Note that Cov (Y , 1 ) = 0. We will use this fact later, so keep it in

mind.
Shobhana Stoyanov
17 / 36
In which we learn more about 0 and 1

0 and 1 are called linear unbiased estimators of
and
1.
The Gauss-Markov theorem gives us the major justification for using

the least squares estimator. (It states that among all linear unbiased
estimators, the least squares estimator has minimum variance.)
If we assume that the errors are normally distributed, then 0 and 1
are the MLEs of 0 and 1 .
We usually care more about the slope, rather than the intercept, of
the regression line, so lets focus on 1 .
Shobhana Stoyanov
18 / 36
Regression assumptions
From Introductory Statistics by Wonnacott & Wonnacott

Shobhana Stoyanov
19 / 36
More about 1
If Yi N(
0+
1 xi ,
), then 1 N( 1 ,
/A) where A =
n
X
(xi
i=1
From Introductory Statistics by Wonnacott & Wonnacott
Shobhana Stoyanov
20 / 36
x)2
Inference about
If we assume normality (or if n is large enough), let z =
need to estimate
need n
2) as an estimate for
n
X
(ei )2 where
i=1
1 xi is the estimated error
Use s 2 = RSS/(n
p 1
/ A
Define the Residual Sum of Squares by RSS =

ei = Yi
2 since using 2 estimates in computation of s.
Shobhana Stoyanov
21 / 36
Inference about
Then we can define s 2 , where

1
s 1 =
Then t =
s 1
RSS
1
s
p = pPn
n 2
A
i=1 (xi
has the tn
x)2
distribution.
We can build a (1
)100% confidence interval for the slope of the

true regression line, 1 : 1 t 2 ,n 2 s 1
Shobhana Stoyanov
22 / 36
Inference about
Or we could test (for instance) H0 :
t=
= 0 vs H1 :
6= 0 using
s 1
If we reject the null, then we can conclude that x has some predictive
value. That is, there exists a linear relationship between x and Y .
Shobhana Stoyanov
23 / 36
Confidence intervals and Prediction intervals

So far, we have focussed on the whole line (slope, intercept), but now
we will narrow our focus to 2 questions:
Say we are given x = x0 . What is E (Y0 ) = 0 =
find a confidence interval for 0 ?
1 x0 ?
Can we
Can we compute an interval to predict a single observed value of Y0 ?
Note that the first question is about a mean value, and the second
about single value.
Shobhana Stoyanov
24 / 36
Confidence intervals for means

Now let the point estimate of 0 be given by 0 = 0 + 1 x0
Then we have that 0 = Y + 1 (x0
) E (0 ) =
1 x0
= 0 ,
and Var (0 ) = Var (Y + 1 (x0
) Var (0 ) =
x)
x))
1
(x0 x)2
+ Pn
n
x)2
i=1 (xi
If x0 is far from x, then this variance is large, and the accuracy of the
estimate decreases.
Shobhana Stoyanov
25 / 36
Confidence intervals for means
A 100(1
)% confidence interval for 0 is given by:

r
1 (x0 x)2
+
0 z 2
n
A
or, if we have to estimate , by:
0 t 2 ,n
Shobhana Stoyanov
1 (x0 x)2
+
n
A
26 / 36
Prediction intervals
We now want to predict a single observed value Y0 , given a value x0 .
We know that the fitted value Y0 is given by 0 + 1 x0 = 0 .
The value of the response is the sum of the value of the true line and
the error.
Y0 =
N(0,
0+
2 ).
1 x0 +
error, where the error is assumed to be (say),
Since we are estimating an individual value, we need to take the error

into account, which means that we have two independent sources of
error, one from 0 , and the other from the error.
Shobhana Stoyanov
27 / 36
Prediction intervals
Var (Y0 ) = Var (0 ) +
) Var (Y0 ) =
A 100(1
1
(x0 x)2
+ Pn
+1
n
x)2
i=1 (xi
)% prediction interval is then given by:

r
1 (x0 x)2
0 z 2
+
+1
n
A
or, if we have to estimate , by:
0 t 2 ,n
Shobhana Stoyanov
2s
1 (x0 x)2
+
+1
n
A
28 / 36
Confidence intervals and Prediction intervals
60 65 70 75
Son's height
Here is a picture of the father-son height data (that we saw at the

beginning of this chapter), where we can see the prediction band
marked by the dashed outer green lines, and the confidence band,
marked by the solid inner blue lines. The estimated regression line is
in red.
60
65
70
75
Father's height
Shobhana Stoyanov
29 / 36
Correlation Analysis and the Regression eect

n
Define sxx
1X
=
(xi
n
x) and syy
i=1
1X
=
(Yi
n
Y )2
i=1
Further, define sxy =
1X
(xi
n
x)(Yi
Y )
i=1
The correlation coefficient between the predictor and response data is

n
sxy
1 X (xi x) (Yi Y )
r=p
=
p
p
sxx syy
n
sxx
syy
i=1
Note that 1 =
r
syy
=r
sxx
Pn
(x
x)(Yi
i=1
Pn i
x)2
i=1 (xi
Y )
p
r sxx syy
sxy
=
sxx
sxx
Regression to mediocrity
Shobhana Stoyanov
30 / 36
The regression line, revisited

Let us look at the fitted line again: Y = 0 + 1 x
where
0 = Y
and
1 = r
1 x
r
syy
sxx
We see that we can write the line as

r
syy
Y Y =r
(x
sxx
x)
(This is covered in 14.2.3.)

Shobhana Stoyanov
31 / 36
Notes and cautions

Since one of the assumptions (of simple linear regression that we
made) was that Y is normal for fixed x, we can then make statments
about the probabilities of Y , given x.
That is, we can estimate the probabilities P(Y y |x) using the
estimates of E (Y |x) and Var (Y ).
r 2 or R 2 is called the squared (multiple) correlation coefficient or the
coefficient of determination. It is the proportion of variability of the
dependent variable that is explained by the model.
Beware the regression fallacy - imputing important causes to the
regression eect. (eg. SI cover jinx, test-retest)
Dont extrapolate beyond the data.
We saw that outliers will aect the regression. Best to do the
regression with and without the outliers to see how it is aected.
Dont fit a straight line to a nonlinear relationship.
Shobhana Stoyanov
32 / 36
Notes and cautions

One of our big assumptions: data are homoscedastic, that is, the
variance of the errors is constant (not changing with x). If data are
heteroscedastic - that is, errors do not have constant variance, then
our estimated standard errors, and CIs will not be reliable.
Note that there are two regression lines. (Example to follow.)
Think about the assumptions of simple linear regression, and where
they were used.
Always look at the residual plots to make sure that your regression is
valid.
If the data have a nonlinear relationship, have outliers, or are
heteroscedastic, then regression is not an appropriate procedure.
These properties are more easily seen in a residual plot.
Shobhana Stoyanov
33 / 36
Regression Diagnostics
Please refer to Philip Starks online text: SticiGui if you would like
more details.
Shobhana Stoyanov
34 / 36
Regression in R: problem 14.36
From the summary, we can see that least squares estimates are about:
1 = 39.87 and 0 = 26068
SE ( 1 ) = 0.5042 and SE ( 0 ) = 16.17. What about estimating
The Residual standard error in the summary gives us s, where
RSS
s2 =
is our estimate of 2 . Our estimate of is 25.31.
n 2
Shobhana Stoyanov
35 / 36
2?
Regression in R
We now have the equation of the regression line:
26068(16.17)
39.87(0.5042) temperature
When we look at the residual plot, though, it appears to show

violation of homoscedisticity.
Next, look at the Normal q-q plot, to check the assumption of
normally distributed errors.
The next plot checks to see if there is a trend in the residuals, and
looks at standardized residuals.
The next plot checks the influential points and points with high
leverage. Cooks distance is a measure of the how influential an
observation is and how much is its leverage.
Shobhana Stoyanov
36 / 36

Regression

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Regression

Загружено:

Авторское право:

Доступные форматы

Stat 135 : Regression

Stat 135 : Regression

The Regression Line

Stat 135 : Regression

Sir Francis Galton: 1822-1911

From galton.org : Victorian polymath: geographer, meterologist, tropical

Stat 135 : Regression

Karl Pearson: 1857-1936

Figure: Galton, aged 87, with

Stat 135 : Regression

Son's height (inches)

Scatter plots for heights of 1078 fathers and sons

Father's height (inches)

Stat 135 : Regression

Son's height (inches)

Scatter plots for heights of 1078 fathers and sons

Father's height (inches)

Stat 135 : Regression

Son's height (inches)

Scatter plot for heights of 1078 fathers and sons

Father's height (inches)

Stat 135 : Regression

Pearson Father-Son Heights

Stat 135 : Regression

Pearson Father-Son Heights

Stat 135 : Regression

The Regression Line

Stat 135 : Regression

Simple Linear Regression

What makes a good fit ?

Stat 135 : Regression

Model and assumptions: Simple Linear Regression

xi not random, and known

unknown, model parameters.

The true regression line is y =

and the Yi are independent.

Stat 135 : Regression

Dierentiating with respect to

and then solving for 0 :

Doing the same with

we get (using the horizontal shift

Stat 135 : Regression

Statistical properties of 0 and 1

Stat 135 : Regression

Statistical properties of 0 and 1

Stat 135 : Regression

Statistical properties of 0 and 1

x)2 , and ai = (xi

Stat 135 : Regression

Statistical properties of 0 and 1

Note that Cov (Y , 1 ) = 0. We will use this fact later, so keep it in

Stat 135 : Regression

In which we learn more about 0 and 1

The Gauss-Markov theorem gives us the major justification for using

Stat 135 : Regression

From Introductory Statistics by Wonnacott & Wonnacott

Stat 135 : Regression

From Introductory Statistics by Wonnacott & Wonnacott

Stat 135 : Regression

If we assume normality (or if n is large enough), let z =

1 xi is the estimated error

Define the Residual Sum of Squares by RSS =

2 since using 2 estimates in computation of s.

Stat 135 : Regression