Вы находитесь на странице: 1из 42

Author(s): Kerby Shedden, Ph.D.

, 2010 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution Share Alike 3.0 License: http://creativecommons.org/licenses/by-sa/3.0/
We have reviewed this material in accordance with U.S. Copyright Law and have tried to maximize your ability to use, share, and adapt it. The citation key on the following slide provides information about how you may share and adapt this material. Copyright holders of content included in this material should contact open.michigan@umich.edu with any questions, corrections, or clarification regarding the use of content. For more information about how to cite these materials visit http://open.umich.edu/privacy-and-terms-use. Any medical information in this material is intended to inform and educate and is not a tool for self-diagnosis or a replacement for medical evaluation, advice, diagnosis or treatment by a healthcare professional. Please speak to your physician if you have questions about your medical condition. Viewer discretion is advised: Some medical content is graphic and may not be suitable for all viewers.

1 / 42

Regression diagnostics
Kerby Shedden
Department of Statistics, University of Michigan

May 9, 2012

2 / 42

Motivation
When working with a linear model with design matrix X , the conventional linear modeling assumptions can be expressed as E [Y |X ] col(X ) and var[Y |X ] = 2 I .

Least squares point estimates and inferences depend on these assumptions approximately holding. Inferences for small sample sizes may also depend on the distribution of Y E [Y |X ] being approximately multivariate Gaussian, but for moderate or large sample sizes this is not critical. Regression diagnostics are approaches for assessing how well the key linear modeling assumptions hold in a particular data set.

3 / 42

Residuals
Linear models can be expressed in two equivalent ways: Expression based on moments: E [Y |X ] col(X ) and var[Y |X ] = 2 I .

Expression based on an additive error model: Y = X + , where is random with E [ |X ] = 0, and cov[ |X ] I . Since the residuals can be viewed as predictions of the errors, it turns out that regression model diagnostics can often be developed using the residuals. Recall that the residuals can be expressed R (I P )Y where P is the projection onto col(X ).
4 / 42

Residuals
The residuals have two key mathematical properties regardless of the correctness of the model specication: The residuals sum to zero, since (I P )1 = 0 and hence 1 R = 1 (I P )Y = 0. The residuals and tted values are orthogonal (they have zero sample covariance):

) cov(R , Y = = =

) Y (R R RY Y (I P )PY 0.

These properties hold as long as an intercept is included in the model (so P 1 = 1, where 1 is a vector of 1s).
5 / 42

Residuals
If the basic linear model assumptions hold, these two properties have population counterparts: The expected value of each residual is zero:

ER

= =

(I P )EY 0.

The population covariance between any residual and any tted value is zero:

) cov(R , Y

= ER Y = = (I P )cov(Y )P 0 Rnn .
6 / 42

= 2 (I P )P

Residuals
If the model is correctly specied, there is a simple formula for the variances and covariances of the residuals: cov(R ) = = (I P ) (EYY ) (I P ) (I P ) X X + 2 I (I P )

= 2 (I P ). If the model is correctly specied, the standardized residuals i Yi Y and the Studentized residuals i Yi Y (1 Pii )1/2 approximately have mean zero and variance one.
7 / 42

External standardization of residuals


2 2 Let i be the estimate of obtained by tting a regression model omitting the i th case. It turns out that we can calculate this value without actually retting the model:

2 i =

(n p 1) 2 ri /(1 Pii ) np2

where ri is the residual for the model t to all data. The externally standardized residuals are i Yi Y , i The externally Studentized residuals are i Yi Y . i (1 Pii )1/2
8 / 42

Outliers and masking

In principle, residuals should be useful for identifying outliers. However, in a small data set, a large outlier will increase the value of , and hence may mask itself. Externally Studentized residuals solve the problem of a single large outlier masking itself. But masking may still occur if multiple large outliers are present.

9 / 42

Outliers and masking


If multiple large outliers may be present we may use alternate estimates of the scale parameter : Interquartile range (IQR): this is the dierence between the 75th percentile and the 25th percentile of the distribution or data. The IQR of the standard normal distribution is 1.35, so IQR/1.35 can be used to estimate . Median Absolute Deviation (MAD): this is the median value of the absolute deviations from the median of the distribution or data, i.e. median(|Z median(Z )|). The MAD of the standard normal distribution is 0.65, so MAD/0.65 can be used to estimate . These alternative estimates of can be used in place of the usual for standardizing or Studentizing residuals.

10 / 42

Leverage
Leverage is a measure of how strongly the data for case i determine the i . tted value Y = PY , and Since Y i = Y
j

Pij Yj ,

it is natural to dene the leverage for case i as Pii , where P is the projection matrix onto col(X ). This is related to the fact that the variance of the i th residual is 2 (1 Pii ). Since the residuals have mean zero, when Pii is close to 1, the residual will likely be close to zero. This means that tted line will usually pass close to (Xi , Yi ) if it is a high leverage point.

11 / 42

Leverage
What is a big leverage? The average leverage is trace(P )/n = (p + 1)/n. If the leverage for a particular case is two or more times greater than the average leverage, it may be considered to have high leverage. In simple linear regression, it is easy to show that

Xi ) = (n 1) 2 /n 2 (Xi X )2 / var(Yi
j

)2 . (Xj X

This implies that when p = 1, )2 / Pii = 1/n + (Xi X


j

)2 . (Xj X

12 / 42

Leverage
In general, Pii = Xi (X X )1 Xi = Xi (X X /n)1 Xi /n where Xi is the i th row of X (including the intercept). i be row i of X without the intercept, let be the sample mean of Let X i (scaled by the Xi , and let X be the sample covariance matrix of the X n rather than n 1). It is a fact that
1 i Xi (X X /n)1 Xi = (X ) ) + 1 X (Xi

and therefore
1 i Pii = (X X ) X ) + 1 /n . X (Xi

Note that this implies that Pii 1/n.


13 / 42

Leverage

The expression
1 i (X X ) X ) X (Xi

i and is the Mahalanobis distance between X X . Thus there is a direct relationship between the Mahalanobis distance of a point relative to the center of the covariate set, and its leverage.

14 / 42

Inuence
Inuence measures the degree to which deletion of a case changes the tted model. We will see that this is dierent from leverage a high leverage point has the potential to be inuential, but is not always inuential. The deleted slope for case i is the tted slope vector that obtained upon deleting case i . The following identity allows the deleted slopes to be calculated eciently (i ) = Ri (X X )1 Xi : , 1 Pii

where Ri is the i th residual, and Xi : is row i of the design matrix.

15 / 42

Inuence
(i ) are The deleted tted values Y (i ) = X (i ) = Y Y Ri X (X X )1 Xi : . 1 Pii

Inuence can be measured by Cooks distance:

Di

= =

1 Y (i ) ) (Y Y (i ) ) (Y (p + 1) 2 Ri2 Xi : (X X )1 Xi : (1 Pii )2 (p + 1) 2 Pii Ris 2 , (1 Pii )(p + 1)

where Ri is the residual and Ris is the studentized residual.


16 / 42

Inuence

Cooks distance approximately captures the average squared change in tted values due to deleting case i , in error variance units. Cooks distance is large only if both the leverage Pii is high, and the studentized residual for the i th case is large. As a general rule, Di values from 1/2 to 1 are high, and values greater than 1 are considered to be a possible problem.

17 / 42

PRESS residuals
If case i is deleted and a prediction of Yi is made from the remaining data, we can compare the observed and predicted values to get the prediction residual: (i )i . R(i ) Yi Y A simple formula for the prediction residual is given by

R( i )

(i ) = Yi Xi : Ri (X X )1 Xi /(1 Pii )) = Yi Xi : ( = Ri /(1 Pii ).

The sum of squares of the prediction residuals is called PRESS (prediction error sum of squares). It is equivalent to using leave-one-out cross validation to estimate the generalization error rate.
18 / 42

Regression graphics

Quite a few graphical techniques have been proposed to aid in visualizing regression relatiobnships. We will discuss the following plots: 1. Scatterplots of Y against individual X variables. 2. Scatterplots of X variables against each other. 3. Residuals versus tted values plot. 4. Added variable plots. 5. Partial residual plots. 6. Residual quantile plots.

19 / 42

Scatterplots of Y against individual X variables


E [Y |X ] = X1 X2 + X3 , var[Y |X ] = 1, var(Xj ) = 1, cor(Xj , Xk ) = 0.3

4 2 0 2 44 4 2 0 2 44

X1

4 2 0 2 44 4 2 0 2 44

X2

X3

X1 X2 + X3

4
20 / 42

Scatterplots of X variables against each other


E [Y |X ] = X1 X2 + X3 , var[Y |X ] = 1, var(Xj ) = 1, cor(Xj , Xk ) = 0.3

4 2 0 2 44

X1

2 4 2 0 2 44

4 2 0 2 44

X2

X3

X1

X3

X2

4
21 / 42

Residuals against tted values plot


E [Y |X ] = X1 X2 + X3 , var[Y |X ] = 1, var(Xj ) = 1, cor(Xj , Xk ) = 0.3

4
Residuals

2 0 2 44
Fitted values

22 / 42

Residuals against tted values plots


Heteroscedastic errors: E [Y |X ] = X1 + X3 , var[Y |X ] = 4 + X1 + X3 , var(Xj ) = 1, cor(Xj , Xk ) = 0.3

20
Residuals

10 0 10 20 4
Fitted values

4
23 / 42

Residuals against tted values plots


Nonlinear mean structure:
2 , var[Y |X ] = 1, var(Xj ) = 1, cor(Xj , Xk ) = 0.3 E [Y |X ] = X1

4 2 0 2 44

Residuals

Fitted values

4
24 / 42

Added variable plots

Suppose Pj is the projection onto the span of all covariates except Xj , j = Pj Y , X = Pj Xj . The added variable plot is a and dene Y j scatterplot of Y hatYj against X Xj . The squared correlation coecient of the points in the added variable plot is the partial R 2 for variable j . Added variable plots are also called partial regression plots.

25 / 42

Added variable plots


E [Y |X ] = X1 X2 + X3 , var[Y |X ] = 1, var(Xj ) = 1, cor(Xj , Xk ) = 0.3

4 2 0 2 44

X1

2 4 2 0 2 44

4 2 0 2 44

1 Y

2 Y

X2

3 Y

X3

26 / 42

Partial residual plot

Suppose we t the model i = Xi = 0 + 1 Xi 1 + p Xip . Y j Xij + Ri against Xij , The partial residual plot for covariate j is a plot of where Ri is the residual. The partial residual plot attempts to show how covariate j is related to Y , if we control for the eects of all other covariates.

27 / 42

Partial residual plot


2 , var[Y |X ] = 1, var(Xj ) = 1, cor(Xj , Xk ) = 0.3 E [Y |X ] = X1

4 2 0 2 44

X1

2 4 2 0 2 44

4 2 0 2 44

1X1 + R

2X2 + R

X2

3X3 + R

X3

28 / 42

Residual quantile plots

2 , var[Y |X ] = 1, var(Xj ) = 1, cor(Xj , Xk ) = 0.3 E [Y |X ] = X1

t4 distributed errors

4
Residual quantiles (standardized)

2 0 2 44
Standard normal quantiles

29 / 42

Transformations

If the residual diagnostics suggest that the linear model assumptions do not hold, it may be possible to continuously transform either Y or X so that the linear model becomes more consistent with the data.

30 / 42

Variance stabilizing transformations


A common violation of the linear model assumptions is a mean/variance relationship, where EYi and var(Yi ) are related. Suppose that var Yi = g (EYi ) 2 ,

and let f () be a transform to be applied to the Yi . The goal is to nd a transform such that the variances of the transformed responses are constant. Using a Taylor expansion, f (Yi ) f (EYi ) + f (EYi )(Yi EYi ).

31 / 42

Variance stabilizing transformations


Therefore var f (Yi ) f (EYi )2 var(Yi ) = f (EYi )2 g (EYi ) 2 . The goal is to nd f such that f = 1/ g . Example: Suppose g (z ) = z . This includes the Poisson regression case = 1, where the variance is proportional to the mean, and the case = 2 where the standard deviation is proportional to the mean. When = 1, f solves f (z ) = 1/ z , so f is the square root function. When = 2, f solves f (z ) = 1/z , so f is the logarithm function.

32 / 42

Log/log regression
Suppose we t a simple linear regression of the form E (log(Y )| log(X )) = + log(X ).

Suppose the logarithms are base 10. Let Xz = X 10z . Under the model, E (log(Y )|Xz ) E (log(Y )|X ) = z Using the crude approximation log E (Y |X ) E (log(Y )|X ), we conclude E (Y |X ) is approximately scaled by a factor of 10 z when X is scaled by a factor of 10z . This holds for relatively small values of z where the crude approximation holds. Thus in a log/log model, we may say that a f % change in X is approximately associated with a f % change in the expected response.
33 / 42

Maximum likelihood estimation of a data transformation

The Box-Cox family of transforms is y 1 ,

which makes sense only when all Yi are positive. The Box-Cox family includes the identity ( = 1), all power transformations such as the square root ( = 1/2) and reciprocal ( = 1), and the logarithm in the limiting case 0.

34 / 42

Maximum likelihood estimation of a data transformation


Suppose we assume that for some value of , the transformed data follow a linear model with Gaussian errors. We can then set out to estimate . The joint log-likelihood of the transformed data is n 1 n log(2 ) log 2 2 2 2 2 (Yi
i ()

Xi )2 .

1 Next we transform this back to a likelihood in terms of Yi = g (Yi This joint log-likelihood is

()

).

n n 1 log(2 ) log 2 2 2 2 2 where the Jacobian is

(g (Yi ) Xi )2 +
i i

log Ji

log Ji = log g (Yi ) = ( 1) log Yi .


35 / 42

Maximum likelihood estimation of a data transformation

The joint log likelihood for the Yi is

n 1 n log(2 ) log 2 2 2 2 2

(g (Yi ) Xi )2 + ( 1)
i i

log Yi .

This likelihood is maximized with respect to , , and 2 to identify the MLE.

36 / 42

Maximum likelihood estimation of a data transformation

To do the maximization, let Y () g (Y ) denote the transformed () denote the tted values from regressing observed responses, and let Y () 2 Y on X . Since does not appear in the Jacobian,
2 () n1 Y () Y 2

will be the maximizing value of 2 . Therefore the MLE of and will maximize n 2 + ( 1) log 2 log Yi .
i

37 / 42

Collinearity Diagnostics
Collinearity inates the sampling variances of covariate eect estimates. j , reorder the columns and To understand the eect of collinearity on var partition the design matrix X as X = Xj X0 = Xj Xj + Xj X0

where X0 is the n p matrix consisting of all columns in X except Xj , and Xj is the projection of Xj onto col(X0 ) . Therefore HX X = Xj Xj (Xj Xj ) X0 X0 (Xj Xj ) X0 X0 .

j = 2 H 1 , so we want a simple expression for H 1 . var 11 11


38 / 42

Collinearity Diagnostics
A symmetric block matrix can be inverted using:

A B where

B C

S 1 C B S 1
1

S 1 BC 1 + C 1 B S 1 BC 1

S = A BC 1 B . Therefore
1 H1 ,1 =

Xj

1 , (Xj Xj ) P0 (Xj Xj )

where P0 = X0 (X0 X0 )1 X0 is the projection matrix onto col(X0 ).


39 / 42

Collinearity Diagnostics
Since Xj Xj col(X0 ), we can write
1 H1 ,1 =

Xj

1 Xj Xj

and since Xj (Xj Xj ) = 0, it follows that Xj so


1 H1 ,1 = 2

= Xj Xj + Xj

= Xj Xj

+ Xj 2 ,

1 Xj

.
2

This makes sense, since smaller values of Xj collinearity.

correspond to greater

40 / 42

Collinearity Diagnostics

2 be the coecient of determination (multiple R 2 ) for the Let Rjx regression of Xj on the other covariates.

2 Rjx =1

Xj (Xj Xj ) j 2 Xj X

=1

Xj

2 2

j Xj X

Combining the two equations yields


1 H11 =

1 j Xj X

1 2 . 1 Rjx

41 / 42

Collinearity Diagnostics
The two factors in the expression
1 H11 =

1 j Xj X

1 2 . 1 Rjx

j : reect two dierent sources of variance of j 1/ Xj X


2

= 1/ ((n 1)var(Xj )) reects the scaling of Xj

2 The variance ination factor (VIF) 1/(1 Rjx ) is scale-free. It is always greater than or equal to 1, and is equal to 1 only if Xj is orthogonal to the other covariates. Large values of the VIF indicate that parameter estimation is strongly aected by collinearity.

42 / 42