Вы находитесь на странице: 1из 30

ST 102: Elementary Statistical Theory

Lecture 36 Linear Regression: Prediction and


Diagnostics
Piotr Fryzlewicz
p.fryzlewicz@lse.ac.uk
Department of Statistics, LSE
1 / 30
Objectives:
Condence intervals for E(y)
Predictive intervals for y
Regression diagnostics: a summary
Based on the observations {(x
i
, y
i
) i = 1, , n}, we t a
regression model
y =

0
+

1
x.
Goal. Predict (unobserved) y corresponding to (known) x.
Point prediction: y =

0
+

1
x.
2 / 30
For the analysis to be more informative, we would like to have
some error bars for our prediction. We introduce two methods:
Condence interval for (x) E(y) =
0
+
1
x
Predictive interval for y
Remark. Condence interval is an interval estimator for an
unknown parameter (i.e. for a constant) while predictive interval is
for a random variable. They are dierent and serve dierent
purposes.
We assume the model is normal, i.e. = y
0

1
x N(0,
2
).
3 / 30
Condence interval for (x) = Ey
Let (x) =

0
+

1
x. Then (x) is an unbiased estimator for (x).
Theorem. (x) is normally distributed with mean (x) and
variance
Var{ (x)} =

2
n

n
i =1
(x
i
x)
2

n
j =1
(x
j
x)
2
.
Proof. Note both

0
and

1
are linear estimators. Therefore (x)
may be written in the form (x) =

n
i =1
b
i
y
i
, where b
1
, , b
n
are
some constants. Hence (x) is normally distributed. To determine
its distribution entirely, we only need to nd its mean and variance.
E{ (x)} = E(

0
) + E(

1
)x =
0
+
1
x = (x)
4 / 30
Var{ (x)} = E[{(

0
) + (

1
)x}
2
]
= Var(

0
) + x
2
Var(

1
) + 2xCov(

0
,

1
).
In Lecture 33 we derived
Var(

0
) =

2
n
n

i =1
x
2
i
_
n

j =1
(x
j
x)
2
, Var(

1
) =
2
_
n

j =1
(x
j
x)
2
.
In Workshop 18 we showed Cov(

0
,

1
) =
2
x
_

n
j =1
(x
j
x)
2
.
Hence

n
j =1
(x
j
x)
2

2
Var{ (x)} =
1
n
n

i =1
x
2
i
+ x
2
2x x
=
1
n
_
n

i =1
x
2
i
+ nx
2
2x
n

i =1
x
i
_
=
1
n
n

i =1
(x
i
x)
2
.
i.e. Var{ (x)} = (
2
/n)

n
i =1
(x
i
x)
2
_

n
j =1
(x
j
x)
2
.
5 / 30
Now
{ (x) (x)}
_
_

2
n

n
i =1
(x
i
x)
2

n
j =1
(x
j
x)
2
_
1/2
N(0, 1),
and
n2

2

2

2
n2
, where
2
=
1
n2

n
i =1
(y
i

1
x
i
)
2
.
Furthermore, (x) and
2
are independent. Hence
{ (x) (x)}
_
_

2
n

n
i =1
(x
i
x)
2

n
j =1
(x
j
x)
2
_
1/2
t
n2
.
A (1 ) condence interval for (x) is
(x) t
/2, n2

_

n
i =1
(x
i
x)
2
n

n
j =1
(x
j
x)
2
_
1/2
Recall: the above interval contains the true expectation Ey = (x)
with probability 1 . It does not cover y with probability 1 .
6 / 30
Predictive interval an interval contains y with
probability 1
We may assume that y to be predicted is independent from
y
1
, , y
n
used in estimation.
Hence y (x) is normal with mean 0 and variance
Var(y) + Var{ (x)} =
2
+

2
n

n
i =1
(x
i
x)
2

n
j =1
(x
j
x)
2
Therefore
{y (x)}
_
_

2
_
1 +

n
i =1
(x
i
x)
2
n

n
j =1
(x
j
x)
2
__
1/2
t
n2
.
An interval covering y with probability 1 is
(x) t
/2, n2

_
1 +

n
i =1
(x
i
x)
2
n

n
j =1
(x
j
x)
2
_
1/2
7 / 30
Remark. (i) It holds that
P
_
y (x) t
/2, n2

_
1 +

n
i =1
(x
i
x)
2
n

n
j =1
(x
j
x)
2
_
1/2
_
= 1 .
(ii) The predictive interval for y is longer than the condence
interval for E(y). The former contains unobserved random variable
y with probability 1 , the latter contains unknown constant
E(y) with probability 1 .
8 / 30
Example. The data set usedFord.mtw contains the prices (y, in
$1,000) of 100 three-year old Ford Tauruses together with their
mileages (x, in 1000 miles) when they were sold at auction. Based
on those data, a car dealer needs to make two decisions:
1
to prepare cash for bidding on a three-year old Ford Taurus
with the mileage of x = 40;
2
to prepare for buying several three-year old Ford Tauruses
with the mileages close to x = 40 from a rental company.
For the rst task, a predictive interval would be more appropriate.
For the second task, he needs to know the average price and,
therefore, a condence interval.
This can be down easily using Minitab.
9 / 30
MTB > regr c1 1 c2;
SUBC> predict 40.
Price = 17.2 - 0.0669 Mileage
Predictor Coef SE Coef T P
Constant 17.2487 0.1821 94.73 0.000
Mileage -0.066861 0.004975 -13.44 0.000
S = 0.326489 R-Sq = 64.8% R-Sq(adj) = 64.5%
Analysis of Variance
Source DF SS MS F P
Regression 1 19.256 19.256 180.64 0.000
Residual Error 98 10.446 0.107
Total 99 29.702
... ...
Predicted Values for New Observations
New
Obs Fit SE Fit 95% CI 95% PI
1 14.5743 0.0382 (14.4985, 14.6501) (13.9220, 15.2266)
New
Obs Mileage
1 40.0
We predict that a
Ford Taurus will sell
for between $13,922
and $15,227. The
average selling price
of several 3-year-old
Ford Tauruses is es-
timated to be be-
tween $14,499 and
$14,650. Because
predicting the sell-
ing price for one car
is more dicult, the
corresponding inter-
val is wider.
10 / 30
To produce the plots with both condence intervals for E(y) and
predictive intervals for y:
MTB > Fitline c1 c2;
SUBC> Confidence 95;
SUBC> Ci;
SUBC> Pi.

11 / 30
Regression Diagnostics
The usefulness of a tted regression model rests on a basic
assumption:
Ey =
0
+
1
x
Furthermore the inference such as the tests, the condence
intervals and predictive intervals only make sense if
1
, ,
n
are
(approximately) independent and normal with constant
variance
2
.
Therefore it is important to check those conditions are met in
practice this task is called Regression Diagnostics.
Basic idea: looking into the residuals
i
or the normalized
residuals
i
/ .
12 / 30
What to look for?
Do the residuals manifest i.i.d. normal behaviour?
Is the scatter plot of
i
versus x
i
patternless?
Is the scatter plot of
i
versus y
i
patternless?
Is the scatter plot of
i
versus i patternless?
If you see trends, periodic patterns, increasing variation in any one
of the above scatter plots, it is very likely that at least one
assumption is not met.
13 / 30
The various residual plots can be obtained in Minitab as follows
(using the same example):
MTB > Fitline c1 c2;
SUBC> gfourpack;
SUBC> gvars c2.

14 / 30

15 / 30
Two other issues in regression diagnostics: outliers and
inuential observations.
Outlier: an unusually small or unusually large y
i
which lies outside of the
majority of observations.
An outlier is often caused by an error in either sampling or recording data.
If so, we should correct it before proceeding with the regression analysis.
If an observation which looks like an outlier indeed belongs to the sample
and no errors in sampling or recording were discovered, we may use more
complex model or distribution to accommodate this outlier. For
example, stock returns often exhibit extreme values and they often
cannot be modelled satisfactorily by a normal regression model.
Remark. Strictly speaking, outliers are dened with respect to the model:
y is very unlikely to be 2 distance away from Ey =
0
+
1
x under the
normal regression model. This is how Minitab identies potential outliers.
16 / 30
Inuential observation: an x
i
which is far away from other x

s.
Such an observation may have a large inuence on the tted
regression line.

17 / 30
Remark. (i) Minitab output marks both outliers and inuential
observations.
MTB > regr c1 1 c2;
SUBC> predict 40.
Price = 17.2 - 0.0669 Mileage
... ...
Unusual Observations
Obs Mileage Price Fit SE Fit Residual St Resid
8 19.1 15.7000 15.9717 0.0902 -0.2717 -0.87 X
14 34.5 15.6000 14.9420 0.0335 0.6580 2.03R
19 48.6 14.7000 13.9993 0.0706 0.7007 2.20R
63 21.2 15.4000 15.8313 0.0806 -0.4313 -1.36 X
74 21.0 16.4000 15.8446 0.0815 0.5554 1.76 X
78 44.3 13.6000 14.2868 0.0526 -0.6868 -2.13R
R denotes an observation with a large standardized residual.
X denotes an observation whose X value gives it large influence.
... ...
18 / 30
(ii) To mitigate the impact of both outliers and inuential
observations, we could use robust regression, i.e. estimate
0
and

1
by minimising the sum of absolute deviations:
SAD(
0
,
1
) =
n

i =1
|y
i

0

1
x
i
|
However, note that since the function f (x) = |x| is not
dierentiable at where it attains its minimum, we would not be
able to nd
0
and
1
by dierentiating SAD(
0
,
1
) w.r.t. to
0
and
1
and equating the partial derivatives to zero. More complex
minimisation techniques would have to be used. This may be
viewed as a drawback of this approach.
19 / 30
Workshop 19
In this workshop we apply the simple linear regression method to
study the relationship between two nancial returns series: a
regression of Cisco Systems stock returns y on S&P500 Index
returns x. This regression model is an example of the CAPM
(Capital Asset Pricing Model).
Stock returns:
return =
current price previous price
previous price
log
_
current price
previous price
_
when the dierence between the two prices is small.
Dataset: return4.mtw (on moodle). Daily returns 3 January
29 December 2000 (n = 252 observations). Dataset has 5
columns: c1 date, c2 100(S&P500 return), c3 100(Cisco
return), and c4 and c5 are two other stock returns.
20 / 30
Workshop 19
Remark. Daily prices are denitely not independent. However
daily returns may be seen as a sequence of uncorrelated random
variables.
MTB > describe c2 c3.
Descriptive Statistics: S&P500, Cisco
Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3
S&P500 252 0 -0.0424 0.0882 1.4002 -6.0045 -0.8543 -0.0379 0.8021
Cisco 252 0 -0.134 0.267 4.234 -13.439 -3.104 -0.115 2.724
Variable Maximum
S&P500 4.6546
Cisco 15.415
For S&P500, average daily return is -0.04%, the maximum daily
return is 4.46%, the minimum daily return is -6.01%, and the
standard deviation is 1.40.
For Cisco, average daily return is -0.13%, the maximum daily
return is 15.42%, the minimum daily return is -13.44%, and the
standard deviation is 4.23.
21 / 30
Workshop 19
Remark. Cisco is much more volatile than S&P500.
MTB > tsplot c2 c3;
SUBC> overlay.

22 / 30
Workshop 19
There is clear synchronisation between the movements of the two
return series.
MTB > corr c2 c3
Pearson correlation of S&P500 and Cisco = 0.687
P-Value = 0.000
23 / 30
Workshop 19
We t a regression model: Cisco =
0
+
1
S&P500 +
Rationale: part of the uctuation in Cisco returns was driven by
the uctuation of the S&P500 return.
MTB > regr c3 1 c2
The regression equation is Cisco = - 0.045 + 2.08 S&P500
Predictor Coef SE Coef T P
Constant -0.0455 0.1943 -0.23 0.815
S&P500 2.0771 0.1390 14.94 0.000
S = 3.08344 R-Sq = 47.2% R-Sq(adj) = 47.0%
Analysis of Variance
Source DF SS MS F P
Regression 1 2123.1 2123.1 223.31 0.000
Residual Error 250 2376.9 9.5
Total 251 4500.0
24 / 30
Workshop 19
Unusual Observations
Obs S&P500 Cisco Fit SE Fit Residual St Resid
2 -3.91 -5.771 -8.167 0.572 2.396 0.79 X
27 -2.10 2.357 -4.415 0.346 6.772 2.21R
36 0.63 11.208 1.259 0.215 9.949 3.23R
51 2.40 -2.396 4.936 0.391 -7.332 -2.40R
52 4.65 2.321 9.623 0.681 -7.302 -2.43RX
... ...
210 1.37 -5.328 2.808 0.277 -8.135 -2.65R
211 2.17 11.431 4.470 0.364 6.961 2.27R
234 0.74 -5.706 1.487 0.222 -7.193 -2.34R
235 3.82 12.924 7.886 0.571 5.038 1.66 X
244 0.80 -11.493 1.624 0.227 -13.117 -4.27R
246 -3.18 -13.439 -6.650 0.477 -6.789 -2.23RX
R denotes an observation with a large standardized residual.
X denotes an observation whose X value gives it large influence.
25 / 30
Workshop 19
The estimated slope:

1
= 2.077. The null hypothesis H
0
:
1
= 0
is rejected with p-value 0.000: extremely signicant.
Attempted interpretation: when the market index goes up by 1%,
Cisco stock goes up by 2.077% on average. However the error
term in the model is large with the estimated = 3.08%.
The p-value for testing H
0
:
0
= 0 is 0.815, so we cannot reject
the hypothesis
0
= 0. Recall

0
= y

1
x and both y and x are
very close to 0.
There are many standardised residual values 2 or 2,
indicating non-normal error distribution.
R
2
= 47.2% of the variation of the Cisco stock may explained by
the variation of the S&P500 index, or in other words the 47.2% of
the risk in the Cisco stock is the market-related risk see
CAPM below.
26 / 30
Workshop 19
CAPM a simple asset pricing model in nance:
y
i
=
0
+
1
x
i
+
i
where y
i
is a stock return and x
i
is a market return at time i .
Total risk of the stock:
1
n
n

i =1
(y
i
y)
2
=
1
n
n

i =1
( y
i
y)
2
+
1
n
n

i =1
(y
i
y
i
)
2
Market-related (or systematic) risk:
1
n
n

i =1
( y
i
y)
2
=
1
n

2
1
n

i =1
(x
i
x)
2
.
Firm-specic risk:
1
n

n
i =1
(y
i
y
i
)
2
Remark. (i)
1
measures the market-related (or systematic) risk of
the stock.
27 / 30
Workshop 19
(ii) Market-related risk is unavoidable, while rm-specic risk may
be diversied away through hedging.
(iii) Variance is a simple and one of the most frequently used
measure for risk in nance.
To plot the data with tted regression line together with
condence bounds for E(y) and predictive bounds for y:
MTB > Fitline c3 c2;
SUBC> gfourpack;
SUBC> confidence 95;
SUBC> ci;
SUBC> pi.
28 / 30
Workshop 19

There are more than 5% data points lying outside the predictive
bounds, as those bounds are derived under the assumption of

i
N(0,
2
).
The bounds can be misleading in practice!
29 / 30
Workshop 19

Top-left panel: points below the line in the top-right corner, above
the line in the bottom-left corner the residual distribution has
heavier tails than N(0,
2
).
30 / 30

Вам также может понравиться