Вы находитесь на странице: 1из 30

ST 102: Elementary Statistical Theory Lecture 36 – Linear Regression: Prediction and Diagnostics

Piotr Fryzlewicz p.fryzlewicz@lse.ac.uk

Department of Statistics, LSE

Regression: Prediction and Diagnostics Piotr Fryzlewicz p.fryzlewicz@lse.ac.uk Department of Statistics, LSE 1 / 30

Objectives:

Confidence intervals for E (y )

Predictive intervals for y

Regression diagnostics: a summary

Based on the observations { (x i , y i ) i = 1, · · · , n }, we fit a regression model

= β 0 + β 1 x .

y

Goal. Predict (unobserved) y corresponding to (known) x .

Point prediction : y = β 0 + β 1 x .

Predict (unobserved) y corresponding to (known) x . Point prediction : y = β 0 +

For the analysis to be more informative, we would like to have some ‘error bars’ for our prediction. We introduce two methods:

Confidence interval for µ(x ) E (y ) = β 0 + β 1 x

Predictive interval for y

Remark. Confidence interval is an interval estimator for an unknown parameter (i.e. for a constant) while predictive interval is for a random variable. They are different and serve different purposes.

We assume the model is normal, i.e. ε = y β 0 β 1 x N (0, σ 2 ).

purposes. We assume the model is normal, i.e. ε = y − β 0 − β

Confidence interval for µ(x ) = Ey

Let µ (x ) = β 0 + β 1 x . Then µ (x ) is an unbiased estimator for µ(x ).

Theorem . µ (x ) is normally distributed with mean µ(x ) and variance

Var{µ (x )} = σ 2

n

n

i

=1 (x i x ) 2

n

=1 (x j x¯ ) 2 .

j

Proof. Note both β 0 and β 1 are linear estimators. Therefore µ (x ) may be written in the form µ (x ) = n =1 b i y i , where b 1 , ··· , b n are some constants. Hence µ (x ) is normally distributed. To determine its distribution entirely, we only need to find its mean and variance.

i

E {µ (x )} = E ( β 0 ) + E ( β 1 )x = β 0 + β 1 x = µ(x )

i E { µ ( x ) } = E ( β 0 ) + E

Var{ µ

(x )} =

E [{ ( β 0 β 0 ) + ( β 1 β 1 )x } 2 ]

= Var( β 0 ) + x 2 Var( β 1 ) + 2 x Cov( β 0 , β 1 ).

In Lecture 33 we derived

Var( β 0 ) = σ 2

n

n

i =1

2

i

x

n

j =1

(x j x¯ ) 2 ,

Var( β 1 ) = σ 2

n

j =1

(x j x¯ ) 2 .

In Workshop 18 we showed Cov( β 0 , β 1 ) = σ 2 x¯ =1 (x j x¯ ) 2 .

n

j

Hence

=

=1 (x j x¯ ) 2

n

j

σ

2

(x )} = 1

n

Var{ µ

n

i =1

1

n

n

i =1

2

i

x

+ nx 2 2 x

n

i =1

x i = 1

n

2

i

x

+ x 2 2x x¯

n

i =1

(x i x ) 2 .

i.e. Var{ µ

(x )} = (σ 2 / n ) n =1 (x i x ) 2

i

n (x j − x¯ ) 2 . j =1
n
(x j − x¯ ) 2 .
j =1

Now

{

(x ) µ(x )} σ 2

µ

n

i

=1 (x i x ) 2

=1 (x j x¯ ) 2 1 / 2 N (0, 1),

n

n

j

and n 2 σ 2 χ n 2 2 , where σ 2 =

σ

2

n 2 i n

1

=1 (y i β 0 β 1 x i ) 2 .

Furthermore, µ (x ) and σ 2 are independent. Hence

{µ (x ) µ(x )} σ

n

2

n

i

j

=1 (x i x ) 2

n

=1 (x j x¯ ) 2

1 / 2 t n 2 .

A (1 α ) confidence interval for µ(x ) is

µ (x ) ± t α/ 2 , n 2 σ

n =1 (x i x ) 2

i

n

n =1 (x j x¯ )

j

2 1 / 2

Recall: the above interval contains the true expectation Ey = µ(x ) with probability 1 α . It does not cover y with probability 1 α .

expectation Ey = µ ( x ) with probability 1 − α . It does not

Predictive interval – an interval contains y with probability 1 α

We may assume that y to be predicted is independent from y 1 , ··· , y n used in estimation.

Hence y µ (x ) is normal with mean 0 and variance

Var( y ) + Var{ µ (x )} = σ 2 + σ 2

n

n

i

=1 (x i x ) 2

n

=1 (x j x¯ ) 2

j

Therefore

{ y µ

(x )} σ

2 1 +

n =1 (x i x ) 2

n =1 (x j x¯ ) 2 1 / 2 t n 2 .

i

n

j

An interval covering y with probability 1 α is

µ (x ) ± t α/ 2 , n 2 σ 1 +

n =1 (x i − x ) 2 2 1 / 2 i n n
n =1 (x i − x ) 2
2 1 / 2
i
n
n =1 (x j − x¯ )
j

Remark. (i) It holds that

P y

µ (x ) ± t α/ 2 , n 2 σ 1 +

n =1 (x i x ) 2

i

n

n =1 (x j x¯ )

j

2 1 / 2 = 1 α.

(ii) The predictive interval for y is longer than the confidence

interval for E (y ). The former contains unobserved random variable

y

with probability 1 α , the latter contains unknown constant

E

(y ) with probability 1 α .

with probability 1 − α , the latter contains unknown constant E ( y ) with

Example . The data set ‘usedFord.mtw’ contains the prices ( y , in $1,000) of 100 three-year old Ford Tauruses together with their mileages ( x , in 1000 miles) when they were sold at auction. Based on those data, a car dealer needs to make two decisions:

1 to prepare cash for bidding on a three-year old Ford Taurus with the mileage of x = 40;

2 to prepare for buying several three-year old Ford Tauruses with the mileages close to x = 40 from a rental company.

For the first task, a predictive interval would be more appropriate. For the second task , he needs to know the average price and, therefore, a confidence interval.

This can be down easily using Minitab.

he needs to know the average price and, therefore, a confidence interval . This can be

MTB > regr c1 1 c2; SUBC> predict 40.

Price = 17.2 - 0.0669 Mileage

Predictor

Coef

SE Coef

T

P

Constant

17.2487

0.1821

94.73 0.000

Mileage

-0.066861 0.004975 -13.44 0.000

S = 0.326489

R-Sq = 64.8%

Analysis of Variance

R-Sq(adj) = 64.5%

Source

DF

SS

MS

F

P

Regression

1

19.256 19.256

180.64 0.000

Residual Error 98 10.446

Total

99 29.702

0.107

Predicted Values for New Observations

New

Obs

95% CI

1 14.5743 0.0382 (14.4985, 14.6501) (13.9220, 15.2266)

Fit

SE Fit

95% PI

New

Obs Mileage

40.0

1

We predict that a Ford Taurus will sell for between $13,922 and $15,227. The average selling price of several 3-year-old Ford Tauruses is es- timated to be be- tween $14,499 and $14,650 . Because predicting the sell- ing price for one car is more difficult, the corresponding inter- val is wider.

. Because predicting the sell- ing price for one car is more difficult, the corresponding inter-

To produce the plots with both confidence intervals for E (y ) and predictive intervals for y :

MTB > Fitline c1 c2; SUBC> Confidence 95; SUBC> Ci; SUBC> Pi.

) and predictive intervals for y : MTB > Fitline c1 c2; SUBC> Confidence 95; SUBC>

Regression Diagnostics

The usefulness of a fitted regression model rests on a basic assumption:

Ey = β 0 + β 1 x Furthermore the inference such as the tests, the confidence intervals and predictive intervals only make sense if ε 1 , ··· , ε n are (approximately) independent and normal with constant variance σ 2 .

Therefore it is important to check those conditions are met in practice — this task is called Regression Diagnostics .

Basic idea: looking into the residuals ε i or the normalized

residuals ε i / σ .

. Basic idea : looking into the residuals ε i or the normalized residuals ε i

What to look for?

Do the residuals manifest i.i.d. normal behaviour?

Is the scatter plot of ε i versus x i patternless?

Is the scatter plot of ε i versus y i patternless?

Is the scatter plot of ε i versus i patternless?

If you see trends, periodic patterns, increasing variation in any one of the above scatter plots, it is very likely that at least one assumption is not met.

variation in any one of the above scatter plots, it is very likely that at least

The various residual plots can be obtained in Minitab as follows (using the same example):

MTB > Fitline c1 c2; SUBC> gfourpack; SUBC> gvars c2.

in Minitab as follows (using the same example): MTB > Fitline c1 c2; SUBC> gfourpack; SUBC>
in Minitab as follows (using the same example): MTB > Fitline c1 c2; SUBC> gfourpack; SUBC>
15 / 30
15 / 30

Two other issues in regression diagnostics: outliers and influential observations.

Outlier : an unusually small or unusually large y i which lies outside of the majority of observations.

An outlier is often caused by an error in either sampling or recording data. If so, we should correct it before proceeding with the regression analysis.

If an observation which looks like an outlier indeed belongs to the sample and no errors in sampling or recording were discovered, we may use more complex model or distribution to accommodate this ‘outlier’. For example, stock returns often exhibit extreme values and they often cannot be modelled satisfactorily by a normal regression model.

Remark. Strictly speaking, outliers are defined with respect to the model:

y is very unlikely to be 2σ distance away from Ey = β 0 + β 1 x under the normal regression model. This is how Minitab identifies potential outliers.

= β 0 + β 1 x under the normal regression model. This is how Minitab

Influential observation : an x i which is far away from other x s.

Such an observation may have a large influence on the fitted regression line.

is far away from other x s. Such an observation may have a large influence on
is far away from other x s. Such an observation may have a large influence on

Remark. (i) Minitab output marks both outliers and influential observations.

MTB > regr c1 1 c2; SUBC> predict 40.

Price = 17.2 - 0.0669 Mileage

Unusual Observations

Obs Mileage

Price

Fit

SE Fit

Residual St Resid

 

8

19.1 15.7000 15.9717 0.0902

-0.2717

-0.87 X

14

34.5 15.6000 14.9420 0.0335

0.6580

2.03R

19

48.6 14.7000 13.9993 0.0706

0.7007

2.20R

63

21.2 15.4000 15.8313 0.0806

-0.4313

-1.36 X

74

21.0 16.4000 15.8446 0.0815

0.5554

1.76 X

78

44.3 13.6000 14.2868 0.0526

-0.6868

-2.13R

R

denotes an observation with a large standardized residual.

X

denotes an observation whose X value gives it large influence.

with a large standardized residual. X denotes an observation whose X value gives it large influence.

(ii) To mitigate the impact of both outliers and influential observations, we could use robust regression, i.e. estimate β 0 and β 1 by minimising the sum of absolute deviations:

SAD (β 0 , β 1 ) =

n

i =1

|y i β 0 β 1 x i |

However, note that since the function f (x ) = |x | is not differentiable at where it attains its minimum, we would not be able to find β 0 and β 1 by differentiating SAD (β 0 , β 1 ) w.r.t. to β 0 and β 1 and equating the partial derivatives to zero. More complex minimisation techniques would have to be used. This may be viewed as a drawback of this approach.

More complex minimisation techniques would have to be used. This may be viewed as a drawback

Workshop 19

In this workshop we apply the simple linear regression method to study the relationship between two financial returns series: a regression of Cisco Systems stock returns y on S&P500 Index returns x . This regression model is an example of the CAPM (Capital Asset Pricing Model).

Stock returns:

return = current price previous price previous price

log

current price previous price

when the difference between the two prices is small.

Dataset: “return4.mtw” (on moodle). Daily returns 3 January – 29 December 2000 (n = 252 observations). Dataset has 5 columns: c1 – date, c2 – 100 ×(S&P500 return), c3 – 100 ×(Cisco return) , and c4 and c5 are two other stock returns.

c2 – 100 × (S&P500 return), c3 – 100 × (Cisco return) , and c4 and

Workshop 19

Remark. Daily prices are definitely not independent. However daily returns may be seen as a sequence of uncorrelated random variables.

MTB > describe c2 c3.

Descriptive Statistics: S&P500, Cisco

Variable

N

N*

Mean SE Mean

StDev Minimum

Q1

Median

Q3

S&P500

252

0 -0.0424

0.0882 1.4002 -6.0045 -0.8543 -0.0379 0.8021

Cisco

252

0

-0.134

0.267

4.234 -13.439

-3.104

-0.115

2.724

Variable Maximum

 

S&P500

4.6546

Cisco

15.415

For S&P500, average daily return is -0.04%, the maximum daily return is 4.46%, the minimum daily return is -6.01%, and the standard deviation is 1.40.

For Cisco, average daily return is -0.13%, the maximum daily return is 15.42%, the minimum daily return is -13.44%, and the

standard deviation is 4.23.

maximum daily return is 15.42%, the minimum daily return is -13.44%, and the standard deviation is

Workshop 19

Remark. Cisco is much more volatile than S&P500.

MTB > tsplot c2 c3; SUBC> overlay.

Workshop 19 Remark . Cisco is much more volatile than S&P500. MTB > tsplot c2 c3;
Workshop 19 Remark . Cisco is much more volatile than S&P500. MTB > tsplot c2 c3;

Workshop 19

There is clear synchronisation between the movements of the two return series.

MTB > corr c2 c3 Pearson correlation of S&P500 and Cisco = 0.687 P-Value = 0.000

of the two return series. MTB > corr c2 c3 Pearson correlation of S&P500 and Cisco

Workshop 19

We fit a regression model: Cisco = β 0 + β 1 S&P500 + ε

Rationale : part of the fluctuation in Cisco returns was driven by the fluctuation of the S&P500 return.

MTB > regr c3 1 c2

The regression equation is Cisco = - 0.045 + 2.08 S&P500

Predictor

Coef

SE Coef

T

P

Constant

-0.0455

0.1943 -0.23 0.815

S&P500

2.0771

0.1390 14.94 0.000

S = 3.08344

R-Sq = 47.2%

R-Sq(adj) = 47.0%

Analysis of Variance

Source

DF

SS

MS

F

P

Regression

1

2123.1 2123.1 223.31 0.000

Residual Error 250 2376.9

Total

251 4500.0

9.5

MS F P Regression 1 2123.1 2123.1 223.31 0.000 Residual Error 250 2376.9 Total 251 4500.0

Workshop 19

Unusual Observations

Obs S&P500

Cisco

Fit

SE Fit

Residual St Resid

 

2

-3.91

-5.771

-8.167

0.572

2.396

0.79 X

27

-2.10

2.357

-4.415

0.346

6.772

2.21R

36

0.63

11.208

1.259

0.215

9.949

3.23R

51

2.40

-2.396

4.936

0.391

-7.332

-2.40R

52

4.65

2.321

9.623

0.681

-7.302

-2.43RX

210

1.37

-5.328

2.808

0.277

-8.135

-2.65R

211

2.17

11.431

4.470

0.364

6.961

2.27R

234

0.74

-5.706

1.487

0.222

-7.193

-2.34R

235

3.82

12.924

7.886

0.571

5.038

1.66 X

244

0.80 -11.493

1.624

0.227

-13.117

-4.27R

246

-3.18 -13.439

-6.650

0.477

-6.789

-2.23RX

R

denotes an observation with a large standardized residual.

X

denotes an observation whose X value gives it large influence.

with a large standardized residual. X denotes an observation whose X value gives it large influence.

Workshop 19

The estimated slope: β 1 = 2.077. The null hypothesis H 0 : β 1 = 0 is rejected with p -value 0.000: extremely significant.

Attempted interpretation: when the market index goes up by 1%, Cisco stock goes up by 2.077% on average. However the error term ε in the model is large with the estimated σ = 3.08%.

The p -value for testing H 0 : β 0 = 0 is 0.815, so we cannot reject the hypothesis β 0 = 0. Recall β 0 = y¯ β 1 x¯ and both y¯ and x¯ are very close to 0.

There are many standardised residual values 2 or ≤ −2, indicating non-normal error distribution.

R 2 = 47.2% of the variation of the Cisco stock may explained by the variation of the S&P500 index, or in other words the 47.2% of the risk in the Cisco stock is the market-related risk — see CAPM below.

or in other words the 47.2% of the risk in the Cisco stock is the market-related

Workshop 19

CAPM — a simple asset pricing model in finance:

y i = β 0 + β 1 x i + ε i

where y i is a stock return and x i is a market return at time i .

Total risk of the stock:

1

n

n

i =1

(y i y¯ ) 2 = 1

n

n

i =1

i y¯ ) 2 + 1

n

(y

n

i =1

(y i y i ) 2

Market-related (or systematic) risk :

1

n

n

i =1

i y¯ ) 2 = 1 β

(y

2

1

n

Firm-specific risk:

1 n i n

=1 (y i y i ) 2

n

i =1

(x i x¯ ) 2 .

Remark. (i) β 1 measures the market-related (or systematic) risk of the stock .

i − x ¯ ) 2 . Remark . (i) β 1 measures the market-related (or

Workshop 19

(ii) Market-related risk is unavoidable, while firm-specific risk may

be “diversified away” through hedging .

(iii) Variance is a simple and one of the most frequently used

measure for risk in finance.

To plot the data with fitted regression line together with confidence bounds for E (y ) and predictive bounds for y :

MTB > Fitline c3 c2; SUBC> gfourpack; SUBC> confidence 95; SUBC> ci; SUBC> pi.

for y : MTB > Fitline c3 c2; SUBC> gfourpack; SUBC> confidence 95; SUBC> ci; SUBC>

Workshop 19

Workshop 19 There are more than 5% data points lying outside the predictive bounds, as those

There are more than 5% data points lying outside the predictive bounds, as those bounds are derived under the assumption of ε i N (0, σ 2 ).

The bounds can be misleading in

are derived under the assumption of ε i ∼ N (0 , σ 2 ). The

practice!

are derived under the assumption of ε i ∼ N (0 , σ 2 ). The

Workshop 19

Workshop 19 Top-left panel: points below the line in the top-right corner, above the line in

Top-left panel: points below the line in the top-right corner, above the line in the bottom-left corner — the residual distribution has heavier tails than N (0, σ 2 ).

above the line in the bottom-left corner — the residual distribution has heavier tails than N