Вы находитесь на странице: 1из 30

ST 102: Elementary Statistical Theory Lecture 36 – Linear Regression: Prediction and Diagnostics

Piotr Fryzlewicz p.fryzlewicz@lse.ac.uk

Department of Statistics, LSE Objectives:

Conﬁdence intervals for E (y )

Predictive intervals for y

Regression diagnostics: a summary

Based on the observations { (x i , y i ) i = 1, · · · , n }, we ﬁt a regression model

 = β 0 + β 1 x . y

Goal. Predict (unobserved) y corresponding to (known) x .

Point prediction : y = β 0 + β 1 x . For the analysis to be more informative, we would like to have some ‘error bars’ for our prediction. We introduce two methods:

Conﬁdence interval for µ(x ) E (y ) = β 0 + β 1 x

Predictive interval for y

Remark. Conﬁdence interval is an interval estimator for an unknown parameter (i.e. for a constant) while predictive interval is for a random variable. They are diﬀerent and serve diﬀerent purposes.

We assume the model is normal, i.e. ε = y β 0 β 1 x N (0, σ 2 ). Conﬁdence interval for µ(x ) = Ey

Let µ (x ) = β 0 + β 1 x . Then µ (x ) is an unbiased estimator for µ(x ).

Theorem . µ (x ) is normally distributed with mean µ(x ) and variance

Var{µ (x )} = σ 2

n

n

i

=1 (x i x ) 2

n

=1 (x j x¯ ) 2 .

j

Proof. Note both β 0 and β 1 are linear estimators. Therefore µ (x ) may be written in the form µ (x ) = n =1 b i y i , where b 1 , ··· , b n are some constants. Hence µ (x ) is normally distributed. To determine its distribution entirely, we only need to ﬁnd its mean and variance.

i

E {µ (x )} = E ( β 0 ) + E ( β 1 )x = β 0 + β 1 x = µ(x ) Var{ µ

(x )} =

E [{ ( β 0 β 0 ) + ( β 1 β 1 )x } 2 ]

= Var( β 0 ) + x 2 Var( β 1 ) + 2 x Cov( β 0 , β 1 ).

In Lecture 33 we derived

Var( β 0 ) = σ 2

n

n

i =1

2

i

x

n

j =1

(x j x¯ ) 2 ,

Var( β 1 ) = σ 2

n

j =1

(x j x¯ ) 2 .

In Workshop 18 we showed Cov( β 0 , β 1 ) = σ 2 x¯ =1 (x j x¯ ) 2 .

n

j

Hence

=

=1 (x j x¯ ) 2

n

j

σ

2

(x )} = 1

n

Var{ µ

n

i =1

1

n

n

i =1

2

i

x

+ nx 2 2 x

n

i =1

x i = 1

n

2

i

x

+ x 2 2x x¯

n

i =1

(x i x ) 2 .

i.e. Var{ µ

(x )} = (σ 2 / n ) n =1 (x i x ) 2

i n
(x j − x¯ ) 2 .
j =1

Now

{

(x ) µ(x )} σ 2

µ

n

i

=1 (x i x ) 2

=1 (x j x¯ ) 2 1 / 2 N (0, 1),

n

n

j

and n 2 σ 2 χ n 2 2 , where σ 2 =

σ

2

n 2 i n

1

=1 (y i β 0 β 1 x i ) 2 .

Furthermore, µ (x ) and σ 2 are independent. Hence

{µ (x ) µ(x )} σ

n

2

n

i

j

=1 (x i x ) 2

n

=1 (x j x¯ ) 2

1 / 2 t n 2 .

A (1 α ) conﬁdence interval for µ(x ) is

µ (x ) ± t α/ 2 , n 2 σ

n =1 (x i x ) 2

i

n

n =1 (x j x¯ )

j

2 1 / 2

Recall: the above interval contains the true expectation Ey = µ(x ) with probability 1 α . It does not cover y with probability 1 α . Predictive interval – an interval contains y with probability 1 α

We may assume that y to be predicted is independent from y 1 , ··· , y n used in estimation.

Hence y µ (x ) is normal with mean 0 and variance

Var( y ) + Var{ µ (x )} = σ 2 + σ 2

n

n

i

=1 (x i x ) 2

n

=1 (x j x¯ ) 2

j

Therefore

{ y µ

(x )} σ

2 1 +

n =1 (x i x ) 2

n =1 (x j x¯ ) 2 1 / 2 t n 2 .

i

n

j

An interval covering y with probability 1 α is

µ (x ) ± t α/ 2 , n 2 σ 1 + n =1 (x i − x ) 2
2 1 / 2
i
n
n =1 (x j − x¯ )
j

Remark. (i) It holds that

P y

µ (x ) ± t α/ 2 , n 2 σ 1 +

n =1 (x i x ) 2

i

n

n =1 (x j x¯ )

j

2 1 / 2 = 1 α.

(ii) The predictive interval for y is longer than the conﬁdence

interval for E (y ). The former contains unobserved random variable

 y with probability 1 − α , the latter contains unknown constant E (y ) with probability 1 − α . Example . The data set ‘usedFord.mtw’ contains the prices ( y , in \$1,000) of 100 three-year old Ford Tauruses together with their mileages ( x , in 1000 miles) when they were sold at auction. Based on those data, a car dealer needs to make two decisions:

1 to prepare cash for bidding on a three-year old Ford Taurus with the mileage of x = 40;

2 to prepare for buying several three-year old Ford Tauruses with the mileages close to x = 40 from a rental company.

For the ﬁrst task, a predictive interval would be more appropriate. For the second task , he needs to know the average price and, therefore, a conﬁdence interval.

This can be down easily using Minitab. MTB > regr c1 1 c2; SUBC> predict 40.

Price = 17.2 - 0.0669 Mileage

 Predictor Coef SE Coef T P Constant 17.2487 0.1821 94.73 0.000 Mileage -0.066861 0.004975 -13.44 0.000

S = 0.326489

R-Sq = 64.8%

Analysis of Variance

 Source DF SS MS F P Regression 1 19.256 19.256 180.64 0.000

Residual Error 98 10.446

Total

99 29.702

0.107

Predicted Values for New Observations

New

Obs

95% CI

1 14.5743 0.0382 (14.4985, 14.6501) (13.9220, 15.2266)

Fit

SE Fit

95% PI

New

Obs Mileage

40.0

1

We predict that a Ford Taurus will sell for between \$13,922 and \$15,227. The average selling price of several 3-year-old Ford Tauruses is es- timated to be be- tween \$14,499 and \$14,650 . Because predicting the sell- ing price for one car is more diﬃcult, the corresponding inter- val is wider. To produce the plots with both conﬁdence intervals for E (y ) and predictive intervals for y :

MTB > Fitline c1 c2; SUBC> Confidence 95; SUBC> Ci; SUBC> Pi. Regression Diagnostics

The usefulness of a ﬁtted regression model rests on a basic assumption:

Ey = β 0 + β 1 x Furthermore the inference such as the tests, the conﬁdence intervals and predictive intervals only make sense if ε 1 , ··· , ε n are (approximately) independent and normal with constant variance σ 2 .

Therefore it is important to check those conditions are met in practice — this task is called Regression Diagnostics .

Basic idea: looking into the residuals ε i or the normalized

residuals ε i / σ . What to look for?

Do the residuals manifest i.i.d. normal behaviour?

Is the scatter plot of ε i versus x i patternless?

Is the scatter plot of ε i versus y i patternless?

Is the scatter plot of ε i versus i patternless?

If you see trends, periodic patterns, increasing variation in any one of the above scatter plots, it is very likely that at least one assumption is not met. The various residual plots can be obtained in Minitab as follows (using the same example):

MTB > Fitline c1 c2; SUBC> gfourpack; SUBC> gvars c2.    Two other issues in regression diagnostics: outliers and inﬂuential observations.

Outlier : an unusually small or unusually large y i which lies outside of the majority of observations.

An outlier is often caused by an error in either sampling or recording data. If so, we should correct it before proceeding with the regression analysis.

If an observation which looks like an outlier indeed belongs to the sample and no errors in sampling or recording were discovered, we may use more complex model or distribution to accommodate this ‘outlier’. For example, stock returns often exhibit extreme values and they often cannot be modelled satisfactorily by a normal regression model.

Remark. Strictly speaking, outliers are deﬁned with respect to the model:

y is very unlikely to be 2σ distance away from Ey = β 0 + β 1 x under the normal regression model. This is how Minitab identiﬁes potential outliers. Inﬂuential observation : an x i which is far away from other x s.

Such an observation may have a large inﬂuence on the ﬁtted regression line.  Remark. (i) Minitab output marks both outliers and inﬂuential observations.

MTB > regr c1 1 c2; SUBC> predict 40.

Price = 17.2 - 0.0669 Mileage

Unusual Observations

Obs Mileage

Price

Fit

SE Fit

Residual St Resid

 8 19.1 15.7000 15.9717 0.0902 -0.2717 -0.87 X 14 34.5 15.6000 14.9420 0.0335 0.6580 2.03R 19 48.6 14.7000 13.9993 0.0706 0.7007 2.20R 63 21.2 15.4000 15.8313 0.0806 -0.4313 -1.36 X 74 21.0 16.4000 15.8446 0.0815 0.5554 1.76 X 78 44.3 13.6000 14.2868 0.0526 -0.6868 -2.13R R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large influence. (ii) To mitigate the impact of both outliers and inﬂuential observations, we could use robust regression, i.e. estimate β 0 and β 1 by minimising the sum of absolute deviations:

SAD (β 0 , β 1 ) =

n

i =1

|y i β 0 β 1 x i |

However, note that since the function f (x ) = |x | is not diﬀerentiable at where it attains its minimum, we would not be able to ﬁnd β 0 and β 1 by diﬀerentiating SAD (β 0 , β 1 ) w.r.t. to β 0 and β 1 and equating the partial derivatives to zero. More complex minimisation techniques would have to be used. This may be viewed as a drawback of this approach. Workshop 19

In this workshop we apply the simple linear regression method to study the relationship between two ﬁnancial returns series: a regression of Cisco Systems stock returns y on S&P500 Index returns x . This regression model is an example of the CAPM (Capital Asset Pricing Model).

Stock returns:

return = current price previous price previous price

log

current price previous price

when the diﬀerence between the two prices is small.

Dataset: “return4.mtw” (on moodle). Daily returns 3 January – 29 December 2000 (n = 252 observations). Dataset has 5 columns: c1 – date, c2 – 100 ×(S&P500 return), c3 – 100 ×(Cisco return) , and c4 and c5 are two other stock returns. Workshop 19

Remark. Daily prices are deﬁnitely not independent. However daily returns may be seen as a sequence of uncorrelated random variables.

MTB > describe c2 c3.

Descriptive Statistics: S&P500, Cisco

 Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3 S&P500 252 0 -0.0424 0.0882 1.4002 -6.0045 -0.8543 -0.0379 0.8021 Cisco 252 0 -0.134 0.267 4.234 -13.439 -3.104 -0.115 2.724 Variable Maximum S&P500 4.6546 Cisco 15.415

For S&P500, average daily return is -0.04%, the maximum daily return is 4.46%, the minimum daily return is -6.01%, and the standard deviation is 1.40.

For Cisco, average daily return is -0.13%, the maximum daily return is 15.42%, the minimum daily return is -13.44%, and the

standard deviation is 4.23. Workshop 19

Remark. Cisco is much more volatile than S&P500.

MTB > tsplot c2 c3; SUBC> overlay.  Workshop 19

There is clear synchronisation between the movements of the two return series.

MTB > corr c2 c3 Pearson correlation of S&P500 and Cisco = 0.687 P-Value = 0.000 Workshop 19

We ﬁt a regression model: Cisco = β 0 + β 1 S&P500 + ε

Rationale : part of the ﬂuctuation in Cisco returns was driven by the ﬂuctuation of the S&P500 return.

MTB > regr c3 1 c2

The regression equation is Cisco = - 0.045 + 2.08 S&P500

 Predictor Coef SE Coef T P Constant -0.0455 0.1943 -0.23 0.815 S&P500 2.0771 0.1390 14.94 0.000 S = 3.08344 R-Sq = 47.2% R-Sq(adj) = 47.0%

Analysis of Variance

 Source DF SS MS F P Regression 1 2123.1 2123.1 223.31 0.000

Residual Error 250 2376.9

Total

251 4500.0

9.5 Workshop 19

Unusual Observations

 Obs S&P500 Cisco Fit SE Fit Residual St Resid 2 -3.91 -5.771 -8.167 0.572 2.396 0.79 X 27 -2.10 2.357 -4.415 0.346 6.772 2.21R 36 0.63 11.208 1.259 0.215 9.949 3.23R 51 2.40 -2.396 4.936 0.391 -7.332 -2.40R 52 4.65 2.321 9.623 0.681 -7.302 -2.43RX 210 1.37 -5.328 2.808 0.277 -8.135 -2.65R 211 2.17 11.431 4.470 0.364 6.961 2.27R 234 0.74 -5.706 1.487 0.222 -7.193 -2.34R 235 3.82 12.924 7.886 0.571 5.038 1.66 X 244 0.80 -11.493 1.624 0.227 -13.117 -4.27R 246 -3.18 -13.439 -6.650 0.477 -6.789 -2.23RX R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large influence. Workshop 19

The estimated slope: β 1 = 2.077. The null hypothesis H 0 : β 1 = 0 is rejected with p -value 0.000: extremely signiﬁcant.

Attempted interpretation: when the market index goes up by 1%, Cisco stock goes up by 2.077% on average. However the error term ε in the model is large with the estimated σ = 3.08%.

The p -value for testing H 0 : β 0 = 0 is 0.815, so we cannot reject the hypothesis β 0 = 0. Recall β 0 = y¯ β 1 x¯ and both y¯ and x¯ are very close to 0.

There are many standardised residual values 2 or ≤ −2, indicating non-normal error distribution.

R 2 = 47.2% of the variation of the Cisco stock may explained by the variation of the S&P500 index, or in other words the 47.2% of the risk in the Cisco stock is the market-related risk — see CAPM below. Workshop 19

CAPM — a simple asset pricing model in ﬁnance:

y i = β 0 + β 1 x i + ε i

where y i is a stock return and x i is a market return at time i .

Total risk of the stock:

1

n

n

i =1

(y i y¯ ) 2 = 1

n

n

i =1

i y¯ ) 2 + 1

n

(y

n

i =1

(y i y i ) 2

Market-related (or systematic) risk :

1

n

n

i =1

i y¯ ) 2 = 1 β

(y

2

1

n

Firm-speciﬁc risk:

1 n i n

=1 (y i y i ) 2

n

i =1

(x i x¯ ) 2 .

Remark. (i) β 1 measures the market-related (or systematic) risk of the stock . Workshop 19

(ii) Market-related risk is unavoidable, while ﬁrm-speciﬁc risk may

be “diversiﬁed away” through hedging .

(iii) Variance is a simple and one of the most frequently used

measure for risk in ﬁnance.

To plot the data with ﬁtted regression line together with conﬁdence bounds for E (y ) and predictive bounds for y :

MTB > Fitline c3 c2; SUBC> gfourpack; SUBC> confidence 95; SUBC> ci; SUBC> pi. Workshop 19 There are more than 5% data points lying outside the predictive bounds, as those bounds are derived under the assumption of ε i N (0, σ 2 ).

The bounds can be misleading in practice! Workshop 19 Top-left panel: points below the line in the top-right corner, above the line in the bottom-left corner — the residual distribution has heavier tails than N (0, σ 2 ). 