Вы находитесь на странице: 1из 36

Stat 135 : Regression

Shobhana Stoyanov
Department of Statistics
UC Berkeley

December 5, 2013

Shobhana Stoyanov

Stat 135 : Regression

1 / 36

The Regression Line


Have a set of points to which we want to fit a straight line
y = 0 + 1x
n subjects indexed by i, where i = 1, . . . , n
Example: Can we use the height of the father to predict the height of
the son?
2 data variables x and y
xi is the height of the father in the family
yi is the height of the son in the family

Shobhana Stoyanov

Stat 135 : Regression

2 / 36

Sir Francis Galton: 1822-1911

From galton.org : Victorian polymath: geographer, meterologist, tropical


explorer, founder of dierential psychology, inventor of fingerprint
identification, pioneer of statistical correlation and regression, convinced
hereditarian, eugenicist, proto-genicist, half-cousin of Charles Darwin...
Shobhana Stoyanov

Stat 135 : Regression

3 / 36

Karl Pearson: 1857-1936

Figure: Galton, aged 87, with


Pearson
Shobhana Stoyanov

Stat 135 : Regression

4 / 36

70
65
60

Son's height (inches)

75

Scatter plots for heights of 1078 fathers and sons


(Pearson data)

60

65

70

75

Father's height (inches)


Shobhana Stoyanov

Stat 135 : Regression

5 / 36

70
65
60

Son's height (inches)

75

Scatter plots for heights of 1078 fathers and sons


(Pearson data)

60

65

70

75

Father's height (inches)


Shobhana Stoyanov

Stat 135 : Regression

6 / 36

75
70
65
60

Son's height (inches)

80

Scatter plot for heights of 1078 fathers and sons


(Pearson data)

60

65

70

75

80

Father's height (inches)


Shobhana Stoyanov

Stat 135 : Regression

7 / 36

60

65

son

70

75

Pearson Father-Son Heights

60

65

70

75

father
Shobhana Stoyanov

Stat 135 : Regression

8 / 36

60

65

son

70

75

Pearson Father-Son Heights

60

65

70

75

father
Shobhana Stoyanov

Stat 135 : Regression

9 / 36

The Regression Line


2 data variables x and y
x is called the predictor and y is called the response
In the Pearson study, the fathers height was the predictor and the
sons height was the response
The regression line approximates the average heights of the sons,
given the heights of their fathers.
The line goes through the centers of the vertical strips.

Shobhana Stoyanov

Stat 135 : Regression

10 / 36

Simple Linear Regression

y=

y=

0e

1x

1x

2x

is a linear model

is a nonlinear model

What makes a good fit ?

Shobhana Stoyanov

Stat 135 : Regression

11 / 36

Model and assumptions: Simple Linear Regression


Yi =

1 xi

+ ei , i = 1, . . . , n

Our assumptions
2

ei N(0,

xi not random, and known

0,

1,

), ei are iid

unknown, model parameters.

The true regression line is y =


Yi N(

1 xi ,

Y N(

,
1x

Shobhana Stoyanov

2 ),

1x

and the Yi are independent.

2 /n)

Stat 135 : Regression

12 / 36

Estimating
Let S( 0 ,

and

1) =

n
X

(Yi

2
1 xi )

i=1

Dierentiating with respect to


0 = Y

and then solving for 0 :

1 x

Doing the same with


xi0 = xi x):
Pn
(xi
1 = Pi=1
n
i=1 (xi
Shobhana Stoyanov

1,

x)Yi
=
x)2

we get (using the horizontal shift


Pn

(x
x)(Yi
i=1
Pn i
x)2
i=1 (xi

Y )

Stat 135 : Regression

(Why?)

13 / 36

Statistical properties of 0 and 1


Note that both 0 and 1 are linear functions of the Yi , and so
j N(E ( j ), Var ( j )), j = 0, 1
Theorem A, page 548:

Shobhana Stoyanov

E ( 0 ) =

0,

and E ( 1 ) =

Stat 135 : Regression

14 / 36

Statistical properties of 0 and 1


Theorem B, page 548-549:
Var ( 0 ) =

Var ( 1 ) =

Cov ( 0 , 1 ) =

Shobhana Stoyanov

Pn

Pn

x
i=1
Pni
( i=1 xi )2

2
i=1 xi

Pn

2
i=1 xi

Pn

2
i=1 xi

P
( ni=1 xi )2

Pn

x
i=1
Pni
( i=1 xi )2

Stat 135 : Regression

15 / 36

Statistical properties of 0 and 1


Proof:
Pn
(xi
1 = Pi=1
n
i=1 (xi
Let A =

n
X

x)2 , and ai = (xi

(xi

x)

x)Yi
x)2

(Note that

i=1

) 1 =

Pn

n
X
i=1

i=1 ai Yi

) Var ( 1 ) =

Shobhana Stoyanov

= Pn

i=1 (xi

x)2

Stat 135 : Regression

16 / 36

ai2 = A.)

Statistical properties of 0 and 1


Use the variance of 1 to get the expressions for the variance of 0
and the covariance of 0 and 1 .

Note that Cov (Y , 1 ) = 0. We will use this fact later, so keep it in


mind.
Shobhana Stoyanov

Stat 135 : Regression

17 / 36

In which we learn more about 0 and 1


0 and 1 are called linear unbiased estimators of

and

1.

The Gauss-Markov theorem gives us the major justification for using


the least squares estimator. (It states that among all linear unbiased
estimators, the least squares estimator has minimum variance.)
If we assume that the errors are normally distributed, then 0 and 1
are the MLEs of 0 and 1 .
We usually care more about the slope, rather than the intercept, of
the regression line, so lets focus on 1 .

Shobhana Stoyanov

Stat 135 : Regression

18 / 36

Regression assumptions

From Introductory Statistics by Wonnacott & Wonnacott


Shobhana Stoyanov

Stat 135 : Regression

19 / 36

More about 1
If Yi N(

0+

1 xi ,

), then 1 N( 1 ,

/A) where A =

n
X

(xi

i=1

From Introductory Statistics by Wonnacott & Wonnacott

Shobhana Stoyanov

Stat 135 : Regression

20 / 36

x)2

Inference about

If we assume normality (or if n is large enough), let z =

need to estimate

need n

2) as an estimate for

n
X

(ei )2 where

i=1

1 xi is the estimated error

Use s 2 = RSS/(n

p 1
/ A

Define the Residual Sum of Squares by RSS =


ei = Yi

2 since using 2 estimates in computation of s.

Shobhana Stoyanov

Stat 135 : Regression

21 / 36

Inference about

Then we can define s 2 , where


1

s 1 =

Then t =

s 1

RSS
1
s
p = pPn
n 2
A
i=1 (xi

has the tn

x)2

distribution.

We can build a (1

)100% confidence interval for the slope of the


true regression line, 1 : 1 t 2 ,n 2 s 1

Shobhana Stoyanov

Stat 135 : Regression

22 / 36

Inference about

Or we could test (for instance) H0 :

t=

= 0 vs H1 :

6= 0 using

s 1

If we reject the null, then we can conclude that x has some predictive
value. That is, there exists a linear relationship between x and Y .

Shobhana Stoyanov

Stat 135 : Regression

23 / 36

Confidence intervals and Prediction intervals


So far, we have focussed on the whole line (slope, intercept), but now
we will narrow our focus to 2 questions:
Say we are given x = x0 . What is E (Y0 ) = 0 =
find a confidence interval for 0 ?

1 x0 ?

Can we

Can we compute an interval to predict a single observed value of Y0 ?

Note that the first question is about a mean value, and the second
about single value.

Shobhana Stoyanov

Stat 135 : Regression

24 / 36

Confidence intervals for means


Now let the point estimate of 0 be given by 0 = 0 + 1 x0
Then we have that 0 = Y + 1 (x0
) E (0 ) =

1 x0

= 0 ,

and Var (0 ) = Var (Y + 1 (x0

) Var (0 ) =

x)

x))

1
(x0 x)2
+ Pn
n
x)2
i=1 (xi

If x0 is far from x, then this variance is large, and the accuracy of the
estimate decreases.

Shobhana Stoyanov

Stat 135 : Regression

25 / 36

Confidence intervals for means

A 100(1

)% confidence interval for 0 is given by:


r
1 (x0 x)2
+
0 z 2
n
A

or, if we have to estimate , by:

0 t 2 ,n

Shobhana Stoyanov

1 (x0 x)2
+
n
A

Stat 135 : Regression

26 / 36

Prediction intervals
We now want to predict a single observed value Y0 , given a value x0 .
We know that the fitted value Y0 is given by 0 + 1 x0 = 0 .
The value of the response is the sum of the value of the true line and
the error.
Y0 =
N(0,

0+
2 ).

1 x0 +

error, where the error is assumed to be (say),

Since we are estimating an individual value, we need to take the error


into account, which means that we have two independent sources of
error, one from 0 , and the other from the error.

Shobhana Stoyanov

Stat 135 : Regression

27 / 36

Prediction intervals
Var (Y0 ) = Var (0 ) +
) Var (Y0 ) =
A 100(1

1
(x0 x)2
+ Pn
+1
n
x)2
i=1 (xi

)% prediction interval is then given by:


r
1 (x0 x)2
0 z 2
+
+1
n
A

or, if we have to estimate , by:

0 t 2 ,n

Shobhana Stoyanov

2s

1 (x0 x)2
+
+1
n
A

Stat 135 : Regression

28 / 36

Confidence intervals and Prediction intervals

60 65 70 75

Son's height

Here is a picture of the father-son height data (that we saw at the


beginning of this chapter), where we can see the prediction band
marked by the dashed outer green lines, and the confidence band,
marked by the solid inner blue lines. The estimated regression line is
in red.

60

65

70

75

Father's height
Shobhana Stoyanov

Stat 135 : Regression

29 / 36

Correlation Analysis and the Regression eect


n

Define sxx

1X
=
(xi
n

x) and syy

i=1

1X
=
(Yi
n

Y )2

i=1

Further, define sxy =

1X
(xi
n

x)(Yi

Y )

i=1

The correlation coefficient between the predictor and response data is


n
sxy
1 X (xi x) (Yi Y )
r=p
=
p
p
sxx syy
n
sxx
syy
i=1

Note that 1 =
r
syy
=r
sxx

Pn

(x
x)(Yi
i=1
Pn i
x)2
i=1 (xi

Y )

p
r sxx syy
sxy
=
sxx
sxx

Regression to mediocrity
Shobhana Stoyanov

Stat 135 : Regression

30 / 36

The regression line, revisited


Let us look at the fitted line again: Y = 0 + 1 x
where

0 = Y

and
1 = r

1 x
r

syy
sxx

We see that we can write the line as


r
syy

Y Y =r
(x
sxx

x)

(This is covered in 14.2.3.)


Shobhana Stoyanov

Stat 135 : Regression

31 / 36

Notes and cautions


Since one of the assumptions (of simple linear regression that we
made) was that Y is normal for fixed x, we can then make statments
about the probabilities of Y , given x.
That is, we can estimate the probabilities P(Y y |x) using the
estimates of E (Y |x) and Var (Y ).
r 2 or R 2 is called the squared (multiple) correlation coefficient or the
coefficient of determination. It is the proportion of variability of the
dependent variable that is explained by the model.
Beware the regression fallacy - imputing important causes to the
regression eect. (eg. SI cover jinx, test-retest)
Dont extrapolate beyond the data.
We saw that outliers will aect the regression. Best to do the
regression with and without the outliers to see how it is aected.
Dont fit a straight line to a nonlinear relationship.
Shobhana Stoyanov

Stat 135 : Regression

32 / 36

Notes and cautions


One of our big assumptions: data are homoscedastic, that is, the
variance of the errors is constant (not changing with x). If data are
heteroscedastic - that is, errors do not have constant variance, then
our estimated standard errors, and CIs will not be reliable.
Note that there are two regression lines. (Example to follow.)
Think about the assumptions of simple linear regression, and where
they were used.
Always look at the residual plots to make sure that your regression is
valid.
If the data have a nonlinear relationship, have outliers, or are
heteroscedastic, then regression is not an appropriate procedure.
These properties are more easily seen in a residual plot.
Shobhana Stoyanov

Stat 135 : Regression

33 / 36

Regression Diagnostics

Please refer to Philip Starks online text: SticiGui if you would like
more details.

Shobhana Stoyanov

Stat 135 : Regression

34 / 36

Regression in R: problem 14.36

From the summary, we can see that least squares estimates are about:
1 = 39.87 and 0 = 26068
SE ( 1 ) = 0.5042 and SE ( 0 ) = 16.17. What about estimating
The Residual standard error in the summary gives us s, where
RSS
s2 =
is our estimate of 2 . Our estimate of is 25.31.
n 2
Shobhana Stoyanov

Stat 135 : Regression

35 / 36

2?

Regression in R
We now have the equation of the regression line:
26068(16.17)

39.87(0.5042) temperature

When we look at the residual plot, though, it appears to show


violation of homoscedisticity.
Next, look at the Normal q-q plot, to check the assumption of
normally distributed errors.
The next plot checks to see if there is a trend in the residuals, and
looks at standardized residuals.
The next plot checks the influential points and points with high
leverage. Cooks distance is a measure of the how influential an
observation is and how much is its leverage.
Shobhana Stoyanov

Stat 135 : Regression

36 / 36

Вам также может понравиться