Stat 331 Course Notes

Stat 331: Applied Linear Models
Darrell Aucoin
Email: daucoin@uwaterloo.ca
Teacher: Dr. Leilei Zeng
Oce: M3-4223
Oce Hours: T & TH, 2:30-3:30PM
Email: lzeng@uwaterloo.ca
University of Waterloo
Under-Graduate Advisor: Diana Skrzydco
TAs: Saad Khan M3-3108 Space 1
Oce Hours: Wednesday 4:00-5:00
Friday 1:30-2:30 M3-3111
Yu Nakajima
Zi Tian
Jianfeng Zhang
October 8, 2013
Contents
1 Introduction 4
1.1 Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Review of Simple Linear Regression Model 6
2.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Assumptions about (Gauss-Markov Assumptions) . . . . . . . . . . . . . . . . . . . . 6
2.1.1.1 Assumption Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Regression Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 The Least-Squares Estimator (LSE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 The Properties of

0
and

1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Consequence of LS Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 The Estimation of
2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Condence Intervals and Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5.1 The t-test Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 Value Prediction for Future Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6.1 Some properties of y
p
: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.7 Mean Prediction for Future Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.8 Analysis of Variance (ANOVA) for Testing: H
0
:
1
= 0 . . . . . . . . . . . . . . . . . . . . . 16
2.8.1 F-Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.8.2 Terminologies of ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.8.3 Coecient of Determination R
2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 Review of Random Vectors and Matrix Algebra 20
3.1 Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Basic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Dierentiating Over Linear and Quadratic Forms . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Some Useful Results on a Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4 Multiple Linear Regression Model 24
4.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1.1 Assumptions of Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1.2 Regression Coecients
1
, . . . ,
p
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 LSE of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.1 Properties of LSE

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.2 Some Useful Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 Residuals Relationship with the Hat Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4 An Estimation of
2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.5 Sampling Distribution of

,
2
under Normality . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.6 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.7 ANOVA Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1
CONTENTS 2
5 Model and Model Assumptions 37
5.1 Model and Model Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.1.1 Basic Model Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Relationship Between Residuals and Random Errors . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.1 Statistical Properties of r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3 Residual Plot for Checking E[
i
] = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3.1 Residuals Versus x
j
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3.2 Partial Residuals Versus x
j
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3.2.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3.3 Added-Variable Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.4 Residual Plots for Checking Constant Variance V ar (
i
) =
2
. . . . . . . . . . . . . . . . . . 42
5.5 Residual Plots for Checking Normality of
i
s . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.5.1 Standardized Residual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.6 Residual Plots for Detecting Correlation in
i
s . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.6.1 Consequence of Correlation in
i
? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.6.2 The Durbin-Watson Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6 Model Evaluation: Data Transformation 46
6.1 Box-Cox Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.1.1 Remarks on Data Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.2 Logarithmic Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.2.1 Logarithmic Transformation of y Only . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.2.1.1 Interpretation of
j
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.2.2 Logarithmic Transformation of All Variables . . . . . . . . . . . . . . . . . . . . . . . 50
j
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.2.3 Logarithmic Transformation of y and Some x
i
s . . . . . . . . . . . . . . . . . . . . . . 50
6.2.4 95% CI for Transformed Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.2.4.1 There are Two ways to Get CI . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.3 Transformation for Stabilizing Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.4 Some Remedies for Non-Linearity -Polynomial Regression . . . . . . . . . . . . . . . . . . . . 51
7 Model Evaluation: Outliers and Inuential Case 53
7.1 Outlier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.1.1 How to Detect Outliers? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.2 Hat Matrix and Leverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.3 Cooks Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.3.1 Cooks D Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.4 Outliers and Inuential Cases: Remove or Keep? . . . . . . . . . . . . . . . . . . . . . . . . . 57
8 Model Building and Selection 58
8.1 More Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
8.1.1 Testing Some But Not All s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
8.1.1.1 Extra Sum of Squares Principle . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.1.1.2 Alternative Formulas for F
0
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.1.1.3 ANOVA (Version 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
8.1.1.4 ANOVA (Version 2) (not including
0
) . . . . . . . . . . . . . . . . . . . . . 60
8.1.2 The General Linear Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
8.1.2.1 The test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
8.2 Categorical Predictors and Interaction Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
8.2.1 Binary Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
8.2.2 Interaction Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8.2.3 Categorical Predictor with More Than 2 Levels . . . . . . . . . . . . . . . . . . . . . . 62
8.2.3.1 Dummy Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8.2.3.2 Testing Overall Eect of a Categorical Predictor . . . . . . . . . . . . . . . . 63
CONTENTS 3
8.3 Modeling Interactions With Categorical Predictors . . . . . . . . . . . . . . . . . . . . . . . . 63
8.4 The Principle of Marginality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
8.5 Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
8.5.1 Backward Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
8.5.2 Forward Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
8.5.3 Stepwise Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
8.5.4 All Subsets Regressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
8.5.4.1 R
2
Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
8.5.4.2 R
2
adj
Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
8.5.4.3 Mallows C
k
Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
8.5.4.4 AIC (Akaikes Information Criterion) . . . . . . . . . . . . . . . . . . . . . . 70
9 Multicollinearity in Regression Models 75
9.1 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
9.2 Consequence of Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
9.3 Detection of Multicollinearity Among x
1
, . . . , x
p
. . . . . . . . . . . . . . . . . . . . . . . . . . 76
9.3.1 Formal Check of Multicollinearity: Variance Ination Factors (VIF) . . . . . . . . . . 77
9.4 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
9.4.1 Minimize Subject to Constraints (Lagrange Multiplier Method) . . . . . . . . . . . . . 77
Get the book Linear Models with R
References: Oxford Dictionary of Statistics, Regression Modeling
Chapter 1
Introduction
1.1 Denitions
Denition 1.1. Response Variable (y): The dependent or outcome variable in a study. This is the
primary variable of interest in a study.
Example 1.1. Yield of a crop, performance of a stock, etc.
Denition 1.2. Explanatory Variable(s) (x
i
): Also called the independent, antecedent, background,
predictor, or controlled variable(s) that help predict the response variable.
Example 1.2. What type of fertilizer, temperature, average rain fall, quarterly returns, etc.
Denition 1.3. Regression: Regression deals with the functional relationship between a response (or
outcome) variable y and the one or more explanatory variables (or predictor variables) x
1
, x
2
, . . . , x
p
.
A general expression for a regression model is
y = f (x
1
, x
2
, . . . , x
p
) +
where
The function f (x
1
, x
2
, . . . , x
p
) represent the deterministic relationship between y and x
1
, x
2
, . . . , x
p
represents unexplained variation in y due to other factors
Remark 1.1. y, are considered the only variables in this model with
V ar (y) = V ar () =
2
The x
1
, x
2
, . . . , x
p
are considered to be deterministic upon y.
Example. Examples of applications:
Linking climate change to man made activities
y xs
Global Climate Surface temperature Green House Gasses
Finance
y xs
Finance Stock Price Index Unemployment rate, Money Supply, etc.
Economics
y xs
Economics Unemployment Rate Interest Rate
4
CHAPTER 1. INTRODUCTION 5
Regression Modeling can be used for
Identifying important factors (or explanatory variables)
Estimation
Prediction
In Stat 231, you saw only a simplest form of the regression model
y =
0
+
1
x +
where we have only one explanatory variable x, and the form of f (x) is assumed to be known as a linear
function.
Example 1.3. Linear function
y =
0
+
1
x +
2
x
2
Example 1.4. Non-linear function
y =
0
+
1
exp (
2
x)
If the derivative of any s still has that in it, then it is a non-linear function
Stat 331 extends discussion to p explanatory variables.
y =
0
+
1
x
1
+ +
p
x
p
+
Note. In this course we will use the model of
y =
0
+
1
x
1
+ +
p
x
p
+
Where
0
,
1
, . . . ,
p
are constants in the linear function, we normally call them regression parameters (or
coecients). Note that s are unknown and are estimated from the data.
Chapter 2
Review of Simple Linear Regression
Model
2.1 The Model
Let y be the response variable and x be the only explanatory variable. The simple linear regression model
is given by
y =
0
+
1
x
1
+
where
0
+
1
x
1
represents the systematic relationship, and is random error.
0
and
1
are unknown
regression parameters. y, are considered random variables, x
i
is considered (in this course) a non-random
variable.
Suppose we observe n pairs of values {(y
i
, x
i
) | i = 1, 2, . . . , n} on y and x from a random sample of
subjects. Then for the i
th
observation, we have
y
i
=
0
+
1
x
1,i
+
i
0

0

X Medical Treatment
Y

R
e
s
p
o
n
s
e
y =
0
+
1
x
(x
i
, y
i
)
i = 1, 2, 3, . . .
2.1.1 Assumptions about (Gauss-Markov Assumptions)
Formally, we make a number of assumptions about the
1
, . . . ,
n
. Gauss-Markou Assumption (conditional
on x
i
)
6
CHAPTER 2. REVIEW OF SIMPLE LINEAR REGRESSION MODEL 7
1. E[
i
] = 0
1
2.
1
, . . . ,
n
are statistically independent
2
3. V ar (
i
) =
2
,
3
4.
i
is normally distributed for i = 1, . . . n.
4
These four assumptions are often summarized as saying that
1
, . . . ,
n
are independent and identically
distributed (iid) N
_
0,
2
_
.
In particular, assumption 1. is needed to ensure that a linear relationship between y and x is approximate.
2.1.1.1 Assumption Implications
Assumption 1. implies that
E[y
i
] = E[y
i
| x
i
] =
0
+
1
x
i
Assumption 2 implies y
1
, . . . , y
n
are independent.
Assumption 3 implies V ar (y
i
) =
2
(constant over x
i
).
Assumption 4 implies that y
i
is normally distributed.
So, equivalently, we can summarize that y
1
, . . . , y
n
are independent and normally distributed such that
y
i
N
_
0
+
1
x
i
,
2
_
2.1.2 Regression Parameters
The two unknown regression parameters
0
and
1
:

0
is the intercept

1
is the slope and is of primary interest.
1
= E[y | x = a + 1] E[y | x = a]
If
1
= 0,
E[y | x] =
0
2.2 The Least-Squares Estimator (LSE)
Suppose we let

0
and

1
be the chosen estimators for
0
and
1
, respectively and the tted value for y
i
from the regression line is:
y =

0
+

1
x
1,i
Then the least squares criterion chooses

0
and

1
to make the residuals r
i
small.
r
i
= y
i
y
i
1
= E[y
i
] =
0
+
1
x
1
2
= y
1
, . . . , yn are independent
3
= V ar (y
i
) =
2
4
= y
i
N
i
,
2

Specically, LSE of
0
and
1
are chosen to minimize the sum of squared residuals:
min
0,
1
S (
0
,
1
) =
n
i=1
r
2
i
min
0,
1
S (
0
,
1
) =
n
i=1
(y
i
y
i
)
2
=
n
i=1
_
y
i
1
x
1,i
_
2
The LSE of
0
and
1
are
0
= y

1
x

1
=
n
i=1
(x
i
x) (y
i
y)
n
i=1
(x
i
x)
2
(2.2.1)
or
1
=
s
x,y
s
x,x
Note. In this course we occasionally use y
i
to denote the random variable from the i
th
subject of a sample
and sometimes for the value (number) actually observed.
Similarly,

0
,

1
will be used as estimators (random) and for particular estimates calculated from some
data.
5
2.3 The Properties of

0
and

1
We have have following properties of LSE
1. E
_
0
_
=
0
E
_
1
_
=
1
2. The Theoretical variance of

0
and

1
V ar
_
0
_
=
2
_
1
n
+
x
2
(x
i
x)
2
_
V ar
_
1
_
=
2
_
1
(x
i
x)
2
_
3.
Cov
_
0
,

1
_
=

2
x
(x
i
x)
2
5
Sept 18, 2012
Proof. To prove the results related to

1
we write,
1
=
(x
i
x) (y
i
y)
(x
i
x)
2
1
=
(x
i
x)
(x
i
x)
2
y
i
(x
i
x)
(x
i
x)
2
y
1
=
(x
i
x)
(x
i
x)
2
y
i
y
(x
i
x)
(x
i
x)
2
(x
i
x) = 0 =
1
=
(x
i
x)
(x
i
x)
2
y
i
0
1
=
c
i
y
i
where c
i
=
(x
i
x)
(x
i
x)
2
Hence, because y
1
, . . . , y
n
are independent variables, the expectation of

1
is:
E
_
1
_
=
c
i
E[y
i
]
E
_
1
_
=
c
i
(
0
+
1
x
i
)
E
_
1
_
=
0
c
i
+
1
c
i
x
i
(x
i
x) = 0 =
E
_
1
_
=
1
c
i
x
i
c
i
x
i
=
(x
i
x) x
i
(x
i
x)
2
c
i
x
i
=
(x
i
x) ((x
i
x) + x)
(x
i
x)
2
c
i
x
i
=
(x
i
x)
2
+ (x
i
x) x
(x
i
x)
2
c
i
x
i
= 1 +
(x
i
x) x
(x
i
x)
2
c
i
x
i
= 1 + x
(x
i
x)
(x
i
x)
2
(x
i
x) = 0 =
c
i
x
i
= 1 =
E
_
1
_
=
1
Similarly
V ar
_
1
_
= V ar
_
c
i
y
i
_
y
i
are independent =
V ar
_
1
_
=
c
2
i
V ar (y
i
)
V ar
_
1
_
=
(x
i
x)
2
(x
i
x)
4
2
V ar
_
1
_
=

2
(x
i
x)
2
Results:
1
N
_
1
,

2
(x
i
x)
2
_
2.3.1 Consequence of LS Fitting
1.

r
i
= 0
2.

r
i
x
i
= 0
3.

r
i
y
i
= 0
4. The point ( x, y) is always on the tted regression line
Proof.
r = y y
r = (I H) y
X
T
r = X
T
(I H) y
X
T
r = X
T
_
I X
_
X
T
X
_
1
X
T
_
y
X
T
r = X
T
y X
T
y
X
T
r = 0 =
r
i
= 0
r
i
x
i
= 0
y
T
r =
n
i=1
y
i
r
i
y
T
r =
n
i=1
_
0
+

1
x
1
+ +

p
x
p
_
r
i
y
T
r =
n
i=1
0
r
i
+

1
x
1
r
i
+ +

p
x
p
r
i
y
T
r =
n
i=1
0
r
i
+
n
i=1
1
x
1
r
i
+ +
n
i=1
p
x
p
r
i
y
T
r =

0
n
i=1
r
i
+

1
n
i=1
x
1
r
i
+ +

p
n
i=1
x
p
r
i
_
r
i
= 0
r
i
x
i
= 0
=
y
T
r = 0
y =

0
+

1
x
0
= y

1
x +

1
x
y = y
2.4 The Estimation of
2
Note 1. We can re-write the model
y
i
=
0
+
1
x
i
+
i
as
i
= y
i
1
x
i
to emphasize the analogy with the residuals
r
i
= y
i
1
x
i
We could say that r
i
(which can be calculated) estimate the unobservable
i
. The basic idea is then to
use sample variance of r
1
, . . . , r
n
to estimate the unknown V ar (
i
) =
2
. The sample variance of r
1
, . . . , r
n
1
n 1
n
i=1
(r
i
r)
2
this is actually not unbiased.
E
_
1
n 1
n
i=1
(r
i
r)
2
_
=
2
The unbiased estimator of
2
is dened as

2
= S
2
=
1
n 2
n
i=1
(r
i
r)
2
Proof. Look this up in the assignment solutions
r
i
= y
i
y
r =
1
n
n
i=1
r
i
r = 0
(r
i
r)
2
=
(y
i
y)
2
E
_
(r
i
r)
2
_
=
E
_
(y
i
y)
2
_
E
_
(r
i
r)
2
_
=
E
_
_
y
i
1
x
i
_
2
_
0
= y

1
x =
E
_
(r
i
r)
2
_
=
E
_
_
y
i
_
y

1
x
_
1
x
i
_
2
_
E
_
(r
i
r)
2
_
=
E
_
_
y
i
y +

1
x

1
x
i
_
2
_
E
_
(r
i
r)
2
_
=
E
_
_
(y
i
y)

1
(x
i
x)
_
2
_
1
=
S
x,y
S
2
x
=
E
_
(r
i
r)
2
_
=
E
_
_
(y
i
y)
S
x,y
S
2
x
(x
i
x)
_
2
_
E
_
(r
i
r)
2
_
=
E
_
(y
i
y)
2
2
S
x,y
S
2
x
(x
i
x) (y
i
y) +
_
S
x,y
S
2
x
_
2
(x
i
x)
2
_
E
_
(r
i
r)
2
_
= E
_
(y
i
y)
2
_
2E
_
S
x,y
S
2
x
(x
i
x) (y
i
y)
_
+E
_
_
S
x,y
S
2
x
_
2
(x
i
x)
2
_
E
_
(r
i
r)
2
_
= E
_
S
2
y
2E
_
S
x,y
S
2
x
S
x,y
_
+E
_
_
S
x,y
S
2
x
_
2
S
2
x
_
E
_
(r
i
r)
2
_
= E
_
S
2
y
2E
_
S
2
x,y
S
2
x
_
+E
_
S
2
x,y
S
2
x
_
E
_
(r
i
r)
2
_
= E[S
y,y
] E
_
S
2
x,y
S
2
x
_
E
_
(r
i
r)
2
_
= E[S
y,y
]
E
_
S
2
x,y
S
2
x
E[S
y,y
] = E
_
y
2
i
n y
2
_
E[S
y,y
] =
E
_
y
2
i
nE
_
y
2
E[S
y,y
] =
_
V ar (y
i
) +E
2
[y
i
]
_
n
_
V ar ( y) +E
2
[ y]
_
E[S
y,y
] =
2
+E
2
[y
i
]
_
n
_
2
n
+E
2
[ y]
_
E[S
y,y
] = n
2
+
_
E
2
[y
i
]
_
2
nE
2
[ y]
E[S
y,y
] = (n 1)
2
+
_
E
2
[y
i
]
_
nE
2
[ y]
E[S
y,y
] = (n 1)
2
+
(
0
+
1
x
i
)
2
n(
0
+
1
x)
2
E[S
y,y
] = (n 1)
2
+
2
0
+
0
1
x
i
+
2
1
x
2
i
_
n
_
2
0
+
0
1
x +
2
1
x
2
_
E[S
y,y
] = (n 1)
2
+n
2
0
n
2
0
+
0
x
i
n
0
1
x +
2
1
x
2
i
n
2
1
x
2
E[S
y,y
] = (n 1)
2
+
0
x
i
n
1
n
x
i
+
2
1
x
2
i
n
2
1
x
2
E[S
y,y
] = (n 1)
2
+
2
1
_
x
2
i
x
2
_
E[S
y,y
] = (n 1)
2
+
2
1
S
x,x
E
_
S
2
x,y
= V ar (S
x,y
) +E
2
[S
x,y
]
E
_
S
2
x,y
= V ar
_
(x
i
x) (y
i
y)
_
+E
2
_
(x
i
x) (y
i
y)
_
E
_
S
2
x,y
= V ar
_
(x
i
x) (y
i
y)
_
+E
2
_
(x
i
x) (y
i
y)
_
E
_
S
2
x,y
= V ar
_
(x
i
x) (y
i
y)
_
+
_
(x
i
x) E[y
i
y]
_
2
E
_
S
2
x,y
= V ar
_
(x
i
x) y
i
y
(x
i
x)
_
+
_
(x
i
x) (
0
+
1
x
i
1
x)
_
2
(x
i
x) = 0 =
E
_
S
2
x,y
= V ar
_
(x
i
x) y
i
_
+
_
(x
i
x) (
1
x
i
1
x)
_
2
E
_
S
2
x,y
(x
i
x)
2
V ar (y
i
) +
2
1
_
(x
i
x)
2
_
2
E
_
S
2
x,y
= S
x,x
_
2
+
1
S
x,x
_
E
_
(r
i
r)
2
_
= E[S
y,y
]
E
_
S
2
x,y
S
x,x
E
_
(r
i
r)
2
_
= (n 1)
2
+
2
1
S
x,x
S
x,x
_
2
+
1
S
x,x
_
S
x,x
E
_
(r
i
r)
2
_
= (n 1)
2
+
2
1
S
x,x
2
+
1
S
x,x
_
E
_
(r
i
r)
2
_
= (n 2)
2
2.5 Condence Intervals and Hypothesis Testing
Recall that

1
N
_
1
,

2
Sxx
_
, so
Sxx
N (0, 1)
By denition
P
_
1.96 <
Sxx
< 1.96
_
= 0.95
P
_
1
1.96

S
xx
<
1
<

1
+ 1.96

S
xx
_
= 0.95
95% CI for
1
is

1
1.96

Sxx
when
2
is known.
In most practices,
2
is unknown, when we replace
2
by S
2
, then the unknown standard deviation of

2
S
xx
is replaced by standard error
SE
_
1
_
=
S
2
S
xx
where S
2
=
1
n 2
r
2
i
The standardized

1
random variable becomes
1
SE
_
1
_ t
n2
which is no longer standard normal but has a t-distribution with n 2 degrees of freedom.
A 100 (1 ) % condence interval for
1
is
CI
100(1)%
_
1
_
=

1
t
n2,/2
SE
_
1
_
Hypothesis Tests are derived and computed in a similar way. To test
H
0
:
1
=
1
H
a
:
1
=
1
We use the t-statistic
t =
1
SE
_
1
_
which as a t
n2
distribution when H
0
is true.
2.5.1 The t-test Statistic
t =
1
SE
_
1
_ t
n2
If H
0
:
1
=
1
2
t
n2,/2
t
n2,/2
Formally, if
|t| =
1
SE
_
1
_
> t
n2,/2
There is evidence to reject H
0
:

1
=
1
at signicant level of . Otherwise, we cannot reject H
0
.
2.6 Value Prediction for Future Values
The tted value:
y
i
=

0
+

1
x
i
refers to an x which is part of the sample data.
Predict a single future value at a given x = x
p
The future value is given as
y
p
=
0
+
1
x
p
+
p
where
p
is the future error. Naturally, we replace
p
by its expectation and use
y
p
=

0
+

1
x
p
to predict y
p
.
2.6.1 Some properties of y
p
:
1. E[y
p
y
p
] = 0 (an unbiased prediction)
2. V ar (y
p
y
p
) =
_
1 +
1
n
+
(xp x)
2
Sxx
_
2
y
p
y
p
=
0
+
1
x
p
+
p
1
x
p
Note that
p
is independent of

0
and

1
since it is an future error that is unrelated to the data that
0
and

1
are calculated.
V ar (y
p
y
p
) = V ar (
p
) +V ar
_
0
+

1
x
p
_
3. It can be shown that
y
p
y
p
SE (y
p
y
p
)
t
n2
where
SE (y
p
y
p
) =
_
_
1 +
1
n
+
(x
p
x)
2
S
xx
_
s
2
where S
xx
=
(x
i
x)
2
2.7 Mean Prediction for Future Values
Predict the mean of future response values at a given x = x
p
. We will still use

p
=

0
+

1
x
p
as the predicted future mean
p
=
0
+
1
x
p
The variance of the prediction error V ar (
p

p
) is smaller than the variance of prediction error of y
p
.
SE (
p

p
) =
_
_
1
n
+
(x
p
x)
2
S
xx
_
S
2
Note. Notice that
V ar (
p

p
) < V ar (y
p
y
p
)
2.8 Analysis of Variance (ANOVA) for Testing: H
0
:
1
= 0
The total variation among the y
i
s is measured by
SST =
n
i=1
(y
i
y)
2
Pf there is no variation (all y
i
s are the same), the SST = 0. The bigger the SST, the more variation. If we re-
write SST as HbegineqnarraySST =
n
i=1
(y
i
y)
2
SST =
n
i=1
(y
i
y
i
+ y
i
y)
2
SST =
(y
i
y
i
)
2
. .
r
2
i
+
( y
i
y)
2
+ 2
(y
i
y
i
) (y
i
y)
. .
=0
SST = SSE +SSReqnarray
Where:
SSE refers to the sum of squares of residuals. It measures the variability of y
i
s that is unexplained by
the regression model.
SSR refers to the sum of squares of regression. It measures the variability of response is accounted for
by the regression model.
If H
0
:
1
= 0 is true, SSR should be relatively small compared to SSE. Our decision is to reject H
0
if the
ratio of SSR and SSE is large.
Some Distribution Results: (when H
0
is true)
SST
2

2
(n1)
To show this, recall that y
1
, . . . , y
n
are independent N
_
0
,
2
_
then
_
y
i
_
2

2
(n)
By re-arrangement of SST
SST =
n
i=1
(y
i
0
+
0
y)
2
SST =
n
i=1
(y
i
0
)
2
i=1
(
0
y)
2
SST =
n
i=1
(y
i
0
)
2
n( y
0
)
2
SST
2
. .
2
(n1)
=
n
i=1
(y
i
0
)
2
2
. .
2
(n)
n( y
0
)
2
2
. .
2
(1)
_
`
Theorem 2.1. Cochrams Theorem:

n
i=1
(y
i
0
)
2
2
=
SST
2
+
n( y
0
)
2
2
1. From Cockrams Theorem:
SST
2
is independent of
n( y0)
2
2
and
SST
2

2
(n1)
2.
SSR
2

2
(1)
SSR =
( y
i
y)
2
SSR =
0
+

1
x
i
y
_
2
0
= y

1
x =
SSR =
_
y

1
x +

1
x
i
y
_
2
SSR =
1
x +

1
x
i
_
2
SSR =

2
1
(x
i
x)
2
SSR =

2
1
S
xx
Recall
1
N
_
1
,

2
S
xx
_
_
Sxx
_
2

2
(1)
Under
H
0
:

2
1
2
/Sxx
=

2
1
Sxx
2

2
(1)
3.
SSE
2

2
(n2)
SST
2
. .
2
(n1)
=
SSE
2
+
SSR
2
. .
2
(1)
From Cockrams Theorem,
SSE
2

2
(n2)
6
6
Sept 25, 2012
2.8.1 F-Distribution
Based on these results, we derive F- Statistic
F
=
SSR
2 /1
SSE
2 /n2
F
(1,n2)
It can be used for testing H
0
:
1
= 0, we reject H
0
at -level if
F > F
(1,n2)
Recall
t =
1
SE
_
1
_
t =
1
_
s
2
Sxx
t
2
=
2
1
S
xx
s
and
F
2
1
S
xx
s
2
t
2
n2
= F
(1,n2)
only for 1
The t-test and F-test for H
0
:
1
= 0 are equivalent for SLR..
2.8.2 Terminologies of ANOVA
Sum of Squares Source of Variation Degrees of Freedom Mean Squares F p-value
SSR Regression 1 MSR =
SSR
1
F =
MSR
MSE
SSE Residual n 2 MSE =
SSE
n2
SST Total n 1
For p explanatory variables:
Source of Variation Sum of Squares Degrees of Freedom Mean Squares F p-value
Regression SSR = (y y)
T
(y y) (p + 1) 1 MSR =
SSR
p
F =
MSR
MSE
Residual n p 1 MSE =
SSE
np
Total n 1
2.8.3 Coecient of Determination R
2
R
2
=
SSR
SST
0 R
2
1
It is a measure of goodness-of-t of the regression model to the data. In the case of SLR
R
2
=
SSR
SST
R
2
=
2
1
S
xx
S
yy
1
=
S
xy
S
xx
=
R
2
=
S
2
xy
S
xx
S
yy
= r
2
where r
2
is the sample correlation coecient R
2
is applicable to multiple regression, but r
2
is not.
Chapter 3
Review of Random Vectors and Matrix
Algebra
3.1 Denitions
Denition 3.1. Vector of Variables:
Y = (y
1
, . . . , y
n
)
E[Y ] =
_
_
E[y
1
]
.
.
.
E[y
n
]
_
_
E[Y ] =
_
1
.
.
.
n
_
_
E[Y ] =
V ar (y) = =
_
_
V ar (y
1
) Cov (y
1
, y
2
) Cov (y
1
, y
n
)
Cov (y
1
, y
2
) V ar (y
2
) Cov (y
2
, y
n
)
.
.
.
.
.
.
.
.
.
.
.
.
Cov (y
n
, y
1
) Cov (y
n
, y
1
) V ar (y
n
)
_
_
nn
V ar (y) = [
i,j
]
nn
V ar (Y ) = E
_
(Y E[Y ]) (Y E[Y ])
T
_
V ar (Y ) = E
_
(y
i
i
)
2
_
nn
If y
1
, . . . , y
n
are independent and identically distributed
V ar (Y ) =
2
I
3.2 Basic Properties
A = (a
i,j
)
mn
b = (b
1
, b
2
, . . . , b
m
)
T
c = (c
1
, c
2
, . . . , c
n
)
T
20
CHAPTER 3. REVIEW OF RANDOM VECTORS AND MATRIX ALGEBRA 21
1. E[Ay +b] = AE[y] +b
2. V ar (y +c) = V ar (y)
3. V ar (Ay) = A V ar (y) A
T
4. V ar (Ay +b) = AV ar (y) A
T
3.3 Dierentiating Over Linear and Quadratic Forms
1. f (y) = f (y
1
, . . . , y
n
)
d
dy
f =
_
d
dy
1
f, . . . ,
d
dy
n
f
_
T
2. f = c
T
y =
n
i=1
c
i
y
i
d
dy
f = c
3. f = y
T
Ay where A is a symmetric matrix
f = y
T
Ay
d
dy
f = 2Ay
Example.
f = y
T
Ay
f =
j
a
i,j
y
i
y
j
f =
i
a
ii
y
2
i
+ 2
i<j
j
a
i,j
y
i
y
j
d
dy
1
f = 2a
1,1
y
1
+ 2
i<j
a
i,j
y
j
d
dy
1
f = 2
n
j=1
a
i,j
y
j
3.4 Some Useful Results on a Matrix
1. Trace,
tr (A
mm
) =
m
i=1
a
i,i
tr (B
mn
C
nm
) = tr (C
nm
B
mn
)
2. Rank of a Matrix
rank (A
mm
) = Number of linearly independent columns
3. Vectors (y
1
, . . . , y
m
) are linearly independent i
c
1
y
1
+ +c
m
y
m
= 0 = c
1
= = c
m
= 0
4. Orthogonal Vectors and Matrices
(a) Two vectors are orthogonal: y
T
x = 0
(b) A
mm
is orthogonal (ortonormal) if
A
T
A = AA
T
= I = A
T
= A
1
5. Eigenvalues and Eigenvectors.
A vector v
i
is called an eigenvector of A
mm
if
i
s.t.
Av
i
=
i
v
i
i = 1, 2, . . . , k
where
i
is the eigenvalue.
6. Decomposition of a symmetric matrix
A
T
= A
For a symmetric matrix A
mm
,
1
, . . . ,
m
are real and an orthogonal matrix P s.t.
A = PP
T
where
=
_
1
0
.
.
.
.
.
.
.
.
.
0
m
_
_
is diagonal matrix with eigenvalues on diagonal,
P =
_
_
| |
v
1
v
m
| |
_
_
is a matrix with eigenvectors on the columns.
7. Idempotent Matrix
A
mm
is idempotent if A
2
= A
Results: If A
mm
is idempotent, then all eigenvalues are either 0 or 1.
Proof.
Av
i
=
i
v
i
A
2
v
i
=
i
Av
i
i
v
i
=
2
i
v
i
=
i
{0, 1}
Results: If A
mm
is idempotent, an orthogonal matrix P s.t.
A = PP
T
where
=
_
_
1 0 0
0 0 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 1
_
_
Proof.
tr (A) = rank (A) = tr () = Number of 1 eiginvalues
i
{0, 1}
= =
_
_
1 0 0
0 0 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 1
_
_
3.5 Multivariate Normal Distribution
The random vector y = (y
1
, . . . , y
n
)
T
follow a multivariate normal distribution with a joint pdf
f (y) =
_
1
2
_n
2
||
1
2
exp
_
1
2
(y )
T
(y )
_
where
= E[y]
n1
= (E[y
1
] , E[y
2
] , . . . , E[y
n
])
T
and
= V ar (y)
nn
= (
i,j
)
nn
We also write y MV N (, )
1. Margin Normality
If y MV N (, ), then y
i
N (
i
,
i,i
). Where
i,i
is the (i, i)
th
element of
2. y
1
, . . . , y
n
are independent i is diagonal.
1
3. If y MV N (, ), then let Z = Ay
Z MV N
_
A, AA
T
_
4. If U MV N (, ), y
1
= AU, y
2
= BU then y
1
and y
2
are independent if
Cov (y
1
, y
2
) = 0
Cov (AU, BU) = 0
A V ar (U) B
T
= 0
AB
T
= 0
5. If y
1
, . . . , y
n
are iid N
_
,
2
_
y MV N
_
,
2
I
_
6. If y MV N
_
0,
2
I
_
, the general case
y
T
y
2

2
(n)
1
In general if y
i
, y
j
are independent = Cov (y
i
, y
j
) = 0 but if Cov (y
i
, y
j
) = 0 y
i
and y
j
are independent.
Chapter 4
Multiple Linear Regression Model
Suppose we are interested in the relationship between a type of air pollutant and lung function.
y : FEV1
x
1
: a type of air polutant
x
2
: age
x
3
: gender
4.1 The Model
The general model is in the form:
y
i
=
0
+
1
x
i,1
+
2
x
i,2
+ +
p
x
i,p
+
i
where x
i,1
, . . . , x
i,p
are p explanatory variables, and
1
, . . . ,
p
are regression coecients associated with
these explanatory variables respectively, i = 1, . . . , n
4.1.1 Assumptions of Model
1. E[
i
] = 0 = E[y
i
] =
0
+
1
x
i,1
+
2
x
i,2
+ +
p
x
i,p
2. V ar (
i
) =
2
= V ar (y
i
) =
2
3.
1
, . . . ,
n
are independent = y
1,
, . . . , y
n
are independent
4. A stronger assumption
i
N
_
0,
2
_
= y
i
N
_
0
+
1
x
i,1
+
2
x
i,2
+ +
p
x
i,p
,
2
_
4.1.2 Regression Coecients
1
, . . . ,
p
j
: The average amount of increase (or decrease) in response when the j
th
covariate x
j
increases (or
decreases) by 1 unit while holding all other covariates xed.
If we have 2 s then our solution is a 2D plane
H
0
:
j
= 0 = x
j
is not linearly related to y, given all the other explanatory variables in the model.
24
CHAPTER 4. MULTIPLE LINEAR REGRESSION MODEL 25
In matrix form:
_
_
y
1
y
2
.
.
.
y
n
_
_
=
0
_
_
1
1
.
.
.
1
_
_
+
1
_
_
x
1,1
x
2,1
.
.
.
x
n,1
_
_
+
2
_
_
x
1,2
x
2,2
.
.
.
x
n,2
_
_
+ +
p
_
_
x
1,p
x
2,p
.
.
.
x
n,p
_
_
+
_
2
.
.
.
n
_
_
y
n1
=
_
_
1 x
1,1
x
1,2
x
1,p
1 x
2,1
x
2,2
x
2,p
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1 x
n,1
x
n,2
x
n,p
_
_
n(p+1)
. .
X
_
1
.
.
.
[
_
_
(p+1)1
. .
+
n1
y = X +
where
MV N
_
0,
2
I
_
& y MV N
_
X,
2
I
_
4.2 LSE of
Least squares choose

to make the n 1 vector y = X
close to y (or make the residual vector r = y y

small)
Specically, we minimize
S () =
n
i=1
(y
i
1
x
i,1

p
x
i,p
)
2
S () = (y X)
T
(y X)
S () = y
T
y y
T
X
T
X
T
y +
T
X
T
X
T
X
T
y is a constant =
y
T
X =
T
X
T
y =
S () = y
T
y 2y
T
X
. .
A
+
T
X
T
X
. .
A
Set

S () = 0 to get
1
0 =

S ()
0 =

_
y
T
y 2y
T
X +
T
X
T
X
_
0 = 2X
T
y + 2X
T
X
0 = X
T
y +X
T
X
X
T
y = X
T
X
_
X
T
X
_
1
X
T
y =
=
_
X
T
X
_
1
X
T
y
=

=
_
X
T
X
_
1
X
T
y
(We require X
T
X to be full rank)
1
T
A = 2A,
A = A
T
,

A = 0
4.2.1 Properties of LSE

1.

is unbiased
E
_
_
= E
_
_
X
T
X
_
1
X
T
y
_
E
_
_
=
_
X
T
X
_
1
X
T
E[y]
E
_
_
=
_
X
T
X
_
1
X
T
X
E
_
_
=
2. V ar
_
_
=
2
_
X
T
X
_
1
V ar
_
_
= V ar
__
X
T
X
_
X
T
y
_
=
_
X
T
X
_
1
X
T
V ar (y) X
_
X
T
X
_
1
=
2
_
X
T
X
_
1
X
T
X
_
X
T
X
_
1
=
2
_
X
T
X
_
1
V ar
_
_
=
2
_
X
T
X
_
1
=
_
_
V ar
_
0
_
Cov
_
1
,

2
_
Cov
_
1
,

p
_
Cov
_
1
,

2
_
V ar
_
1
_
Cov
_
2
,

p
_
.
.
.
.
.
.
.
.
.
.
.
.
Cov
_
p
,

1
_
Cov
_
p
,

2
_
V ar
_
p
_
_
_
4.2.2 Some Useful Results
Fitted values
y = X
y = X
_
X
T
X
_
1
X
T
y
Let H = X
_
X
T
X
_
1
X
T
, a hat matrix
y = Hy
The matrix H is idempotent and symmetric = H is
a projection matrix which projects y into R(x) (a (p + 1)
dimensional subspace spanned by linear combination of p+1
columns of X)
y = proj
Col(H)
y
Col (X) = Col (H)
y
C
o
l
(
I
H
)
r = proj
Col(IH)
y
R(x)
a
A graphical representation of y projected
onto the column space of X with n = 3,
p = 1
a
Edit to t wording better
2
4.3 Residuals Relationship with the Hat Matrix
Theorem. r = (I H) y
Proof.
r = y y
r = y X
_
X
T
X
_
1
X
T
y
r =
_
I X
_
X
T
X
_
1
X
T
_
y
r = (I H) y
Theorem. (I H) is idempotent.
Proof.
(I H) = (I H)
(I H) (I H) = (I H) (I H)
(I H) (I H) = I 2H +H
2
(I H) (I H) = I 2H +H
(I H) (I H) = I H
Theorem.

r
i
= 0,
r
i
x
i,1
= 0, . . . ,
r
i
x
i,p
= 0
Proof.
X
T
r = X
T
_
I X
_
X
T
X
_
1
X
T
_
y
X
T
r = X
T
y X
T
y
X
T
r = 0
=
r
i
= 0,
r
i
x
i,1
= 0, . . . ,
r
i
x
i,p
= 0
Theorem.

r
i
y
i
= y
T
r = 0
Proof.
y
T
r =
n
i=1
y
i
r
i
y
T
r =
n
i=1
_
0
+

1
x
1
+ +

p
x
p
_
r
i
y
T
r =
n
i=1
0
r
i
+

1
x
1
r
i
+ +

p
x
p
r
i
y
T
r =
n
i=1
0
r
i
+
n
i=1
1
x
1
r
i
+ +
n
i=1
p
x
p
r
i
y
T
r =

0
n
i=1
r
i
+

1
n
i=1
x
1
r
i
+ +

p
n
i=1
x
p
r
i
2
Oct 2, 2012
r
i
= 0,
r
i
x
i,1
= 0, . . . ,
r
i
x
i,p
= 0 =
y
T
r = 0
Theorem. E[r] = 0
Proof.
E[r] = E
__
I X
_
X
T
X
_
1
X
T
_
y
_
E[r] = E
__
y X
_
X
T
X
_
1
X
T
y
__
E[r] = E[y] X
_
X
T
X
_
1
X
T
E[y]
E[r] = X X
_
X
T
X
_
1
X
T
X
E[r] = X X
E[r] = 0
Theorem. V ar (r) =
2
(I H)
Proof.
V ar (r) = V ar
_
_
(I H)
. .
A
y
_
_
V ar (r) = (I H)
2
I (I H)
T
V ar (r) =
2
(I H)
4.4 An Estimation of
2
Theorem. An unbiased estimator of
2
is

2
=
1
n (p + 1)
n
i=1
r
2
i
= MSE
Proof.
E
_
n
i=1
r
2
i
_
= E
_
r
T
r
E
_
n
i=1
r
2
i
_
= E
_
tr
_
r
T
r
_
trace of a number is the number itself
E
_
n
i=1
r
2
i
_
= E
_
tr
_
rr
T
_
E
_
n
i=1
r
2
i
_
= tr
_
E
_
rr
T
_
E
_
rr
T
= V ar (r) since r = 0 =
E
_
n
i=1
r
2
i
_
= tr (V ar (r))
E
_
n
i=1
r
2
i
_
= tr (V ar ((I H) y))
E
_
n
i=1
r
2
i
_
= tr
_
(I H) V ar (y) (I H)
T
_
E
_
n
i=1
r
2
i
_
= tr ((I H) V ar (y))
E
_
n
i=1
r
2
i
_
= tr
_
(I H)
2
_
E
_
n
i=1
r
2
i
_
= (n (p + 1))
2
4.5 Sampling Distribution of

,
2
under Normality
We assume
y MV N
_
X,
2
I
_
Theorem. Result 1:

MV N
_
,
2
_
X
T
X
_
1
_
MV N
_
,
2
_
X
T
X
_
1
_
_
1
.
.
.
p
_
_
(P+1)1
MV N
_
_
_
_
_
_
1
.
.
.
p
_
_
,
2
_
X
T
X
_
1
(p+1)(p+1)
_
_
_
_
_
i
N
_
i
,
2
v
i,i
_
Proof.
=
_
X
T
X
_
1
X
T
. .
A
y
=
_
X
T
X
_
1
X
T
y
V ar
_
_
= V ar
_
_
X
T
X
_
1
X
T
y
_
V ar
_
_
=
_
X
T
X
_
1
X
T
V ar (y)
_
_
X
T
X
_
1
X
T
_
T
V ar
_
_
=
_
X
T
X
_
1
X
T
2
I
_
_
X
T
X
_
1
X
T
_
T
V ar
_
_
=
2
_
X
T
X
_
1
X
T
X
_
X
T
X
_
1
V ar
_
_
=
2
_
X
T
X
_
1
E
_
_
=
y MV N
_
,
2
_
=

MV N
_
,
2
_
X
T
X
_
1
_
Theorem. Result 2:

and
2
are independent.
Proof. It is enough to Show r and

are independent by:

2
=
1
n (p + 1)
r
T
r
Cov
_
r,

_
= Cov
_
_
_
_
_
I X
_
X
T
X
_
1
X
T
_
. .
A
y,
_
X
T
X
_
1
X
T
. .
B
y
_
_
_
_
Cov
_
r,

_
=
_
I X
_
X
T
X
_
1
X
T
_
Cov (y, y)
_
_
X
T
X
_
1
X
T
_
T
Cov
_
r,

_
=
_
I X
_
X
T
X
_
1
X
T
_
2
I X
_
X
T
X
_
1
Cov
_
r,

_
=
2
_
X
_
X
T
X
_
1
X
_
X
T
X
_
1
X
T
X
_
X
T
X
_
1
_
Cov
_
r,

_
=
2
_
X
_
X
T
X
_
1
X
_
X
T
X
_
1
_
Cov
_
r,

_
= 0
Cov (Ay, By) = E
_
Ay
T
yB
T
E[Ay] E
_
y
T
B
T
Cov (Ay, By) = AE

_
y
T
y
B
T
AE[y] E
_
y
T
B
T
Theorem. Result 3: (n (p + 1))

2
2

2
n(p+1)
Proof. Note that we can re-write
(n (p + 1))

2
2
= (n (p + 1))
1
n(p+1)
r
2
i
2
(n (p + 1))

2
2
=
r
2
i
2
(n (p + 1))

2
2
=
r
T
r
2
(n (p + 1))

2
2
=
_
r
_
T
_
r
_
(n (p + 1))

2
2
= (r
)
T
(r
)
Recall y MV N
_
X,
2
I
_
, then
r
=
(I H) y
MV N (0, I H)
Since I H is idempotent, then there an orthogonal matrix P such that
I H = PP
T
where =
_
_
1 0 0
0 1 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0
_
_
The number of 1s = n (p + 1)
Now, if we dene a new random variable z
z = P
T
r
V ar (z) = P
T
V ar (r) P
V ar (z) = P
T
V ar ((I H) y) P
V ar (z) = P
T
(I H) V ar (y) (I H)
T
P
V ar (z) = P
T
(I H) V ar (y) P
V ar (z) = P
T
PP
T
2
P
V ar (z) =
2
3
then,
z MV N
_
0, P
T
(I H) P
_
I H = PP
T
=
z MV N
_
0, P
T
_
PP
T
_
P
_
z MV N (0, )
z =
_
_
_z
1
, z
2
, z
3
, . . . , z
n(p+1)
. .
, . . . ,
n(p+1)
, z
n
_
_
_
T
the rst n (p + 1) z
i
s has N (0, 1), the rest are 0s.
Therefore,
(n (p + 1))

2
2
= (r
)
T
(r
)
(n (p + 1))

2
2
= (Pz)
T
(Pz)
(n (p + 1))

2
2
= z
T
P
T
Pz
(n (p + 1))

2
2
= z
T
z
(n (p + 1))

2
2
=
n(p+1)
i=1
z
2
i

2
(n(p+1))
3
Redo: need to change a few things
Theorem. Result 4: We can use a t-distribution to test each .
i
_

2
v
i,i
t
np1
SE
_
i
_
=
_
v
i,i

2
V ar
_
_
=
2
_
X
T
X
_
1
=
2
_
_
v
0,0
v
0,1
v
0,p
v
1,0
v
1,1
v
1,p
.
.
.
.
.
.
.
.
.
.
.
.
v
p,0
v
p,1
v
p,p
_
_
Proof. From result (1)
MV N
_
,
2
_
X
T
X
_
1
_
i
N
_
i
,
2
v
i,i
_
4
when
2
is unknown, we use

2
=
1
n (p + 1)
r
2
i
to estimate
2
.
From (1) and (2), we know that

and
2
are independent, and
(n (p + 1))

2
2

2
(n(p+1))
_
`
Note. If x N (0, 1), y

2
()
then
x
_
y
/
t
then
ii
2
vi,i
_
(np1)

2
2
np1
t
np1
ii
vi,i

2
t
np1
i
_

2
v
i,i
t
np1
This also implies that we use standard error
SE
_
i
_
=
_
v
i,i

2
to estimate the standard deviation
_
v
i,i
2
. The quantity (*) can be used to construct CI (1 ) % and
test hypothesis H
0
:
i
=
i
.
V ar
_
_
=
2
_
X
T
X
_
1
=
_
_
V ar
_
0
_
Cov
_
1
,

2
_
Cov
_
1
,

p
_
Cov
_
1
,

2
_
V ar
_
1
_
Cov
_
2
,

p
_
.
.
.
.
.
.
.
.
.
.
.
.
Cov
_
p
,

1
_
Cov
_
p
,

2
_
V ar
_
p
_
_
_
4
v
i,i
is the (i, i)
th
element of

X
T
X
1
4.6 Prediction
5
Suppose we are interested in predicting y for given set of values of the explanatory variables x
1
, . . . , x
p
.
For example, our multiple regression model:
y =
0
+
1
x
1
+
2
x
2
+
3
x
3
+
y = FEV 1
x
1
= level of a certain air polutant
x
2
= age
x
3
= weight
We want to predict FEV 1 for a new case with an arbitrary vector of explanatory variable values a
p
(e.g.
a
p
= (1, 10, 52, 170)
T
Note. Be cautious when extrapolating outside the ranges of the explanatory variables in the tting data.
y
p
=
0
+
1
10 +
2
52 +
3
170 +
p
We can estimate y
p
by using

(LSE) to replace , and set
p
= 0
y
p
=

0
+

1
10 +

2
52 +

3
170
y
p
= a
T
p

To place a condence interval around the single predicted value y

p
, we need to know
V ar (y
p
y
p
) = V ar
_
a
T
p
+
p
a
T
p

_
V ar (y
p
y
p
) = V ar
_
p
a
T
p

_
V ar (y
p
y
p
) = V ar (
p
) +V ar
_
a
T
p

_
V ar (y
p
y
p
) =
2
+a
T
p
V ar
_
_
a
p
=
_
X
T
X
_
1
X
T
y =
V ar (y
p
y
p
) =
2
+a
T
p
V ar
_
_
X
T
X
_
1
X
T
y
_
a
p
V ar (y
p
y
p
) =
2
+a
T
p
_
X
T
X
_
1
2
a
p
V ar (y
p
y
p
) =
2
_
1 +a
T
p
_
X
T
X
_
1
a
p
_
As usual, we have to replace
2
by

2
=
1
n p 1
r
2
i
which leads to the result that
y
p
y
p
_

2
_
1 +a
T
p
(X
T
X)
1
a
p
_
t
np1
6
and thus
Theorem 4.1. The 100 (1 ) % CI for y
p
is
CI
100(1)%
( y
p
) = y
p
t
np1,/2
_

2
_
1 +a
T
p
(X
T
X)
1
a
p
_
5
Oct 4, 2012, check Handout might have to put it in
6
Homework: the proof for this
What if we want to predict the mean of the response at a given vector of values for explanatory variable,
a
T
p
?
p
= E[y
p
] = a
T
p
The estimate of the

p
,

p
= a
p
= y
p
However
[V ar (
p

p
) = V ar (
p
)] < [V ar (y
p
y
p
)]
Why is this?
Theorem 4.2. The 100 (1 ) % CI for
p
is
CI
100(1)%
(
p
) =
p
t
np1,/2
_

2
a
T
p
(X
T
X)
1
a
p
Proof.
V ar (
p

p
) = V ar
_
a
T
p
a
T
p

_
V ar (
p

p
) = V ar
_
a
T
p

_
V ar (
p

p
) = V ar
_
a
T
p

_
V ar (
p

p
) = a
T
p
V ar
_
_
a
p
V ar (
p

p
) = a
T
p
V ar
_
_
X
T
X
_
1
X
T
y
_
a
p
V ar (
p

p
) = a
T
p
_
X
T
X
_
1
2
a
p
V ar (
p

p
) =
2
a
T
p
_
X
T
X
_
1
a
p
[V ar (
p

p
) = V ar (
p
)] < [V ar (y
p
y
p
)]
4.7 ANOVA Table
Consider the general model:
y
i
=
0
+
1
x
i,1
+ +
p
x
i,p
+
i
()
is LSE
SSE =
n
i=1
r
2
i
SSE = r
T
r
SSE =
_
y X
_
T
_
y X
_
Now if we consider the hypothesis
H
0
:
1
=
2
= =
p
= 0
under H
0
, the general model () reduces to
y
i
=
0
+
i
Reduced model
the LSE of
0
is

0
= y
SSE
_
0
_
=
n
i=1
(y
i
y
i
)
2
SSE
_
0
_
=
n
i=1
_
y
i
0
_
2
SSE
_
0
_
=
n
i=1
(y
i
y)
2
SSE
_
0
_
= SST
The dierence between the full and reduced model:
SSE
_
0
_
SSE
_
_
= SST SSE
_
_
SSE
_
0
_
SSE
_
_
= SSR
is the additive sum of squares from the p-explanatory variables. SSR tells us how much variability in response
is explained by the full model over and above the simple mean model.
SSR =
(y
i
y)
2
_
y X
_
T
_
y X
_
SSR = y
T
y n y
2
[(I H) y]
T
[(I H) y]
SSR = y
T
y n y
2
y
T
(I H) y
SSR =

T
X
T
X
n y
2
(4.7.1)
Theorem 4.3. The F-test statistic for testing
H
0
:
1
=
2
= =
p
= 0
H
a
: At least one = 0
F
=
SSR
/p
SSE
/np1
F
=
Additional sum of squares
/p
Sum of squares from full model
/np1
is
used to test
H
0
:
1
=
2
= =
p
= 0
H
a
: At least one = 0
What does a statistically signicant F-ratio imply?
It indicates that there is strong evidence against the claim that none of the explanatory variables have
an inuence on response.
Source df Sum of Squares Mean Squares F
Regression p SSR =

T
X
T
X
n y
2
MSR =
SSR
p
F =
MSR
MSE
Residual n p 1 SSE =
_
y X
_
T
_
y X
_
MSR =
SSE
np1
Total n 1 SST = (y y)
T
(y y)
The R
2
is an overall measurement of the goodness of t of the model.
R
2
=
SSR
SST
Problem. Does a large R
2
always mean that a signicant relationship has been discovered?
R
2
usually will go up as we add more explanatory variables in the model (even if they are not relevant).
Suppose p + 1 = n, R
2
= 1.
Theorem 4.4. Adjusted R
2
: This is used to penalize for a large number of parameters.
R
2
Adj
= 1
n 1
n p 1
_
1 R
2
_
78
7
Oct 9, 2012
8
Look at Lecture Handout
Chapter 5
Model and Model Assumptions
Model Evaluation and Residual Analysis: Given a particular data set, a specic model (with a set of
assumptions)
Lease squares t
Construct hypothesis test and CI
Estimation and prediction
In practice, a more dicult task is to nd a reasonable model for a set of data. We will focus on techniques
based on analysis of residuals for model checking.
5.1 Model and Model Assumptions
Problem. What is a good model?
A good model is the one which is complex enough to provide a good t to the data and yet simple
enough to use (i.e. make prediction) well beyond the data.
5.1.1 Basic Model Assumptions
1. E[
i
] = 0
= E[y
i
] =
0
+
1
x
i,1
+ +
p
x
i,p
= Linearity
2. V ar (
i
) =
2
, constant variance
= homoscedasticity
3.
1
, . . . ,
n
are independent
4.
i
N
_
0,
2
_
Of course, we cannot observe or compute the errors in practice, so their properties can not be evaluated.
Rather, we look at the residuals r
1
, . . . , r
n
in the tted model.
r
i
= y
i
y
i
If the residuals estimate the errors well, any pattern found in the residuals suggested that a similar
relationship exists in the random error.
37
CHAPTER 5. MODEL AND MODEL ASSUMPTIONS 38
5.2 Relationship Between Residuals and Random Errors
We can write:
r = y X
r = y X
_
X
T
X
_
1
X
T
. .
H
y
r = (I H) y
r = (I H) (X +)
r = (I H) X + (I H)
r = 0 + (I H)
r = (I H)
Note:
(I H) X = (I H) X
(I H) X = (X HX)
(I H) X = (X X)
(I H) X = 0
The residuals will approximately equal the errors if H is small relative to I.
Since H is a projection matrix, and idempotenent (H = HH), and the i
th
diagonal element can be
written as
h
i,i
= (HH)
i,i
h
i,i
=
n
j=1
h
i,j
h
j,i
H is also symmetric, h
i,j
= h
j,i
h
i,i
= h
2
i,i
+
j=i
h
2
i,j
h
i,i
(1 h
i,i
) =
j=i
h
2
i,j
The right hand side is sum of squares, hence non-negative, and we see that
0 < h
i,i
< 1
If the diagonal elements h
i,i
are small, then the o diagonal elements are also small.
Note:
tr (H) =
h
i,i
tr (H) = p + 1
1
Therefore the average of diagonal elements is
p+1
n
. If we try to t nearly as many parameters as there are
observations, the h
i,i
s cannot all be small relative to 1, and the residuals are poor estimate of the errors.
Note.
H =
_
_
h
1,1
h
1,2
h
1,i
h
1,n
h
1,2
h
2,2
h
2,i
h
2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
h
i,1
h
i,2
h
i,i
h
i,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
h
i,1
h
i,2
h
2,i
h
n,n
_
_
V ar (r) = V ar (), but if H is small, they are close, if H is not small there could be substantial
correlations among the residuals and patterns will be apparent even if the error assumptions hold.
1
The trace of a matrix is equal to the sum of its eigenvalues
5.2.1 Statistical Properties of r
E[r] = 0 E[] = 0
V ar (r) = V ar ((I H) )
V ar (r) = (I H) V ar () (I H)
T
V ar (r) = (I H)
2
V ar (r) = V ar ()
Of course we cannot observe or compute the errors in practice, so their properties cannot be evaluated.
Rather, we look at the residuals (r
1
, . . . , r
n
) in the tted model.
r
i
= y
i
y
i
If the residuals estimate the errors well, any pattern found in the the residuals suggest that a similar
relationship exists in the random error.
Summary. If the assumptions about hold and H is small relative to I, then
r = (I H)
E[r] = 0 V ar (r) =
2
(I H)
2
I
r MV N
_
0,
2
I
_
the residual should look approximately like a sample from un-correlated, mean zero, constant variance normal
distribution.
5.3 Residual Plot for Checking E[
i
] = 0
2
Potentially the most important assumption for linear regression models is E[
i
] = 0. The likely causes for
violation of this assumption are:
1. Eect of explanatory variables on response variable is not in fact linear (e.g. tting a relationship
linearly in x when E[y] is in fact linear in x
2
)
2. Omission of some important explanatory variables
We shall consider three types of plot for checking this assumption:
1. Residuals versus x
j
, j = 1, . . . , p
2. Partial residuals versus x
j
, j = 1, . . . , p
3. Added-variable plots
2
Oct 16, 2012
5.3.1 Residuals Versus x
j
Suppose we t a multiple regression model and
r
i
= y
i
y
i
r
i
= y
i
0
+

1
x
1
+ +

p
x
p
_
the residuals have the linear eect of xs removed from y. If x
j
does have a linear eect on y (in other words
the model assumption E[
i
] = 0 is not violated), when we plot raw residuals r
1
, . . . , r
n
against the n values
x
1,j
, . . . , x
n,j
, we expect to see a random scatter for j = 1, . . . , p.
3
On the other hand, if we see any obvious non-random pattern, it suggests the non-linearity and and we
could adapt the way x
j
is modeled. For example:
4
may require higher order terms (e.g. x
2
k
, x
3
k
, . . . ) in the model = Polynomial Regression.
5.3.2 Partial Residuals Versus x
j
Plots of the raw residuals are sometimes dicult to interpret because we have to decide whether the scatter
looks random or not. For the next type of plot considered, based on partial residuals, we have to judge
whether the plot look linear- this if often easier.
For each x
j
, the partial residuals r
(j)
i
is dened as
r
(j)
i
= r
i
+

j
x
i,j
i = 1, . . . , n
The estimated linear eect of x
j
is added back into the residuals.
For each x
j
, when we plot (r
(j)
1
, . . . , r
(j)
n
) versus (x
1,j
, . . . , x
n,j
), we expect a linear trend if the model
with a linear term in x
j
is adequate. Hence the typical pattern of partial residual plots when the assumption
is not violated is:
5
for j = 1, . . . , p
The partial residuals for x
j
attempt to correct y for all other explanatory variables, os that the plot of
r
(j)
i
against x
i,j
(i = 1, . . . n) shows the marginal eect of x
j
. To see this, note that
r
(j)
i
= r
i
+

j
x
i,j
r
(j)
i
= y
i
0
+

1
x
i,1
+ +

p
x
i,p
_
+

j
x
i,j
r
(j)
i
= y
i
0
+

1
x
i,1
+ +

j1
x
i,j1
+

j+1
x
i,j+1
+ +

p
x
i,p
_
In simple linear regression, we can see the relationship between y and x simply by plotting the two variables.
In a multiple regression situation, plotting y versus each x
j
does not show the marginal eect of x
j
.
The y values are also aected by the remaining explanatory variables. In the partial residuals plot, we
remove the estiamted eect of remaining explanatory variables hence attempting to uncover the marginal
eect of x
j
on y. We can then judge whether this marginal eect is linear or not.
In R, a function crPlots() in the car package has been make available to you to produce partial residual
plots. See the example and attached R code.
5.3.2.1 Example
The value of a tree of a particular species is largely determined by the volume of timber in the trunk.
Foresters want to value stands of trees without having to cut them down to determine their volumes. Here,
3
Insert random scatter plot from handout
4
Insert non-random scatter plots from handout
5
Insert partial residuals scatter plot from handout
we explore the relationship between volume and two measures that can be made without felling a tree, the
girth (the tree diameter in inches at 4.5 feet above ground) and the height (feet). The data are collected on
31 felled blackcherry trees, and the three variables measured for each tree are: Volume, Girth, Height.
The initial examination of data by plotting volume against girth and volume against height (see Figure
5.2.1) show a close linear relationship between volume and girth and a less strong linear relationship between
volume and height. Volume seems to become more variable as height increases. However, remember that
these plots do not necessarily show the marginal eect of a particular explanatory variable (Girth/ or
Height) on the response variable Volume.
6
The plots of residual versus the explanatory variable Girth and residual versus Height respectively are
presented in Figure 5.2.2. There is obviously a non-random pattern in the plot of residual versus Girth
(quadratic trend?), which implies that the marginal eect of Girth on Volume may not be linear. Hence the
assumption that E[
i
] = 0 is violated. How about the plot of residual versus Height? Is the scatter looks
random or not?
7
The partial residual plots for Girth and Height in Figure 5.2.3, respectively, show nonlinear trends. They
are produced using the R function crPlots( ) from the car package. The dashed straight line is the tted
simple linear regression line by using partial residual as the response variable and Girth / or Height as the
explanatory variable.The smooth curve through the data points is constructed by using the non-parametric
smooth method called locally weighted scatterplot smoothing (LOWESS). LOWESS is dened by a complex
algorithm, where a local linear polynomial t is used at each data point depending on the points fall within a
specied neighborhood. So that it attempts to follow the points fairly smoothly, but not necessarily linearly.
This makes it easier to spot nonlinear behavior. Both plots indicate the deviation from the straight line,
hence the violation of the assumption that E[] = 0.
Note that LOWESS curve can be computed by using the R function lowess(x, y, f) where
x, y are vectors giving the coordinates of the points in the scatterplot
f is the smoother span. It give the proportion of point in the plot which inuence the smooth at each
value. Larger values give more smoothness.
lowess( ) returns a list containing components x and y which give the coordinates of the smooth. The
smooth can be added to a plot of the original points with the function lines( ). See the attached R code
for how to add lowess curve to the scatter plot in Figure 5.2.1 for example.
8
5.3.3 Added-Variable Plots
Sometimes, we might suspect that an important explanatory variable has not been included in the model.
Consider the plot of residuals versus a new explanatory variable x (that currently is not in the model),
9
it suggests that the addition of x may improve the model.
When deciding whether a new explanatory variable (that currently is not in the model) should be included,
an added variable plot turns out to be a more powerful graph. To produce the added-variable plot for a new
explanatory variable x, we
regress y on all the current explanatory variables x
1
, . . . , x
p
and denote the residual vector r.
regress x on all of the x
1
, . . . , x
p
and denote the residual vector t.
Plot residual vector r versus t. Systematic patterns in the plot indicate that the variable x should be
included.
6
Insert scatterplot from handout of trees
7
Insert residuals scatterplot from handout of trees
8
Insert partial residuals scatterplot from handout of trees
9
Insert Added-variable residuals scatterplot from handout
In R, the car package provides a function avPlots(model, variable, ...) to produce added-variable
plots, where
model is the model object produced by lm(). Or you can directly use
avPlots( fit<-lm(y x1 + x2 + , data), ... )
variable is the name of the new explanatory variable that is not included in the lm() tting. The
observations of this new variable shall be included in your data set though.
5.4 Residual Plots for Checking Constant Variance V ar (
i
) =
2
Once we are satised about the assumption that E[
i
] = 0, which is equivalently to saying that we are
modeling E(y) adequately as a linear function of the explanatory variables, we can move on to the assumptions
that V ar (
i
) =
2
for i = 1, . . . , n. This assumption states that the error variance, or equivalently V ar (y
i
)
is constant.
To detect non-constant variance (heteroscedasticity), a standard diagnostic is to plot the residuals against
the tted values: r
i
versus y
i
. We examine this plot to see if the residuals appear to be have constant
variability with respect to the tted values. A pattern as following
10
suggests that the constant variance assumption is violated. A random scatter, on the other hand, supports
the assumption.
Now back to tree data. Figure 5.3.1 shows rst the residual versus tted plot and then the absolute values
of the residuals versus tted plot. The latter one folds over the bottom half of the rst plot to increase the
resolution for detecting non-constant variance. A pattern is evident in the rst plot, the residuals appear to
have some quadratic trend with the tted value. This problem with the model goes back to the assumption
that E[
i
] = 0 for i = 1, . . . , n as mentioned in Section 5.2.3. The plot can not diagnose any problems with
the assumptions about the variance of y, because there are still problems with the assumptions relating to
modeling the mean of y.
11
5.5 Residual Plots for Checking Normality of
i
s
We probably do not want to worry about the normality assumption until the other, more serious assumptions
have been checked and xed.
If all assumptions are valid, including the normality assumptions, and we have sucient degrees of freedom
for the residuals, then the residuals should look approximately like a sample from a normal distribution.
Consider n = 5 residuals for simplicity, even though this is far too few for a useful plot. What would
we expect a sample of n = 5 independent standard normals to look like? It seems intuitively reasonable
that the third largest (i.e., the middle one) has expectation zero, the mean of the standard normal. That
is, the middle observation is expected to cut the normal distribution into two equal halves. Carrying this
idea further, we might expect the sample of ve normals, once ordered, to divide the normal distribution
into six approximately equal areas. Thus for n = 5, the choice for the area to the left of the i
th
ordered
observation are a
i
=
i
n+1
for i = 1, . . . , 5 (e.g. a
1
=
1
6
, a
2
=
2
6
, a
3
=
3
6
, a
4
=
4
6
, a
5
=
5
6
, a
6
=
6
6
), this gives the
equal areas we argued above. The values diving the standard normal distribution into these equal areas are
z
i
=
1
(a
i
) for i = 1, . . . , n, where is the cumulative distribution function for the standard normal. We
call z
i
expected standard normal order statistics because z
1
< z
2
< < z
n
.
This is the basis for a Q-Q (quantile-quantile) plot to check normality. The ordered residuals (ordered
r
i
s) are plotted against the expected standard normal order statistics (z
i
s). If the normality assumption is
correct, a Q-Q plot should show an approximately straight line.
The rst plot in Figure 5.4.1 shows the Q-Q plot of a random sample of size n = 100 from a normal
distribution, which shows an approximately straight line. The other three plots are based on random samples
10
Insert residuals vs y
i
scatterplot from handout
11
Insert residuals vs y
i
scatterplot from handout for trees
simulated from Lognormal distribution (as an example of skewed distribution), Cauchy distribution (as an
example of a heavy tailed distribution), Uniform distribution (as an example of light tailed distribution).
When non-normality is found, the resolution depends on the type of problem found.
For light tailed distribution, the consequences of non-normality are not serious and can be reasonably
ignored.
For skewed errors, a transformation of the response may solve the problem.
For long-tailed errors, we might just accept the non-normality.
r
1
, r
2
, r
3
, r
4
, r
5
Ordered Residuals r
(1)
< r
(2)
< r
(3)
< r
(4)
< r
(5)
12
(Z
i
) = Pr (Z < Z
i
) = a
i
Z
i
=
1
(a
i
)
13
In R, the function qqnorm( ) can be used to produce the Q-Q plot of residuals (see attached R code).
For example for tree data, you could use command
fit <- lm(Volume Girth + Height, data=tree)
qqnorm(residuals(fit))
14
5.5.1 Standardized Residual
d
i
=
r
i
_
2
i
(1 h
i,i
)
i = 1, 2, . . . , n
d
1
, d
2
, . . . , d
n
are approximately iid N (0, 1)
15
5.6 Residual Plots for Detecting Correlation in
i
s
16
None of the diagnostic plots discussed so far has questioned the assumption that the random errors are
uncorrelated. In general, checking this assumption is very dicult, if not impossible, by inspection of the
data. Scrutiny of the data collection method is often all that one can do. For example, if the tree data
include adjacent trees, one tree might shade its neighbor leading to correlation. Care in selecting trees that
are widely separated would make the assumption of uncorrelated errors more credible.
Only if there is some structure to the correlations that might exist, do we have some basis for checking.
For example temporally (or spatially) related data, it is often reasonable to suspect that observations close
together in time (or in space) are the most likely to be correlated (e.g. daily stock price data, temperature
data, etc). It is then wise to check the uncorrelated assumption.
If this assumption is violated, the rst-order property of the least square estimate

will not be aected
(e.g. E
_
_
= ), but the second-order, variance properties will be. As a matter of fact, a fairly small
12
Insert pictures from notes plus oct 18, 2012
13
Insert Q-Q plot from handout
14
Insert R code from handout oct16
15
Oct 18, 2012
16
insert Oct 18 handout
correlation between the errors may lead to the estimated variance of

being an order of magnitude wrong,
hence there is potential for standard errors to be very wrong with corresponding eects on condence inter-
vals, etc. The reason for this is that, although the correlation is small, there are many pairwise correlations
contributing to the true variance.
Graphical checks for correlation in
i
s include plots of the residual r against time and r
i
against r
i1
.
For example, consider the case that any two observations t time units apart have correlation
t
= corr (
i
,
it
) =
t
t 0
where now i indexes the time, and 1 < < 1. This is a so called autocorrelation structure in Time Series
Analysis (STAT443). The plots in Figure 5.5.1 are generated from normal random variables r
1
, . . . , r
n
with
an autocorrelation as above with = 0.9, 0, +0.9. These plots give you some idea of what we are looking
for. Positive correlation is probably the most harmful because the computed standard errors will likely to
be too small leading to condence intervals that do not capture the true value with the stated probability.
5.6.1 Consequence of Correlation in
i
?
E
_
_
=
V ar
_
_
=
_
_
X
T
X
_
1
X
T
y
_
V ar
_
_
=
_
X
T
X
_
1
X
T
V ar (y)
_
_
X
T
X
_
1
X
T
_
T
V ar
_
_
=
_
X
T
X
_
1
X
T
V ar () X
_
_
X
T
X
_
1
_
T
V ar
_
_
=
2
I
V ar ( ) is not diagonal anymore
V ar () =
_
2
1

2
1,2

2
1,3

2
1,n
2
2,1

2
2

2
2,3

2
2,n
2
3,1

2
3,2

2
3

2
3,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
n,1

2
n,2

2
n,3

2
n
_
_
V ar
_
j
_
may be under estimated for

j
SE(
j)
5.6.2 The Durbin-Watson Test
H
0
: = 0 Or sometimes H
a
: > 0
H
a
: = 0 H
a
: < 0
The Durbin-Watson test is a formal statistical test for the correlation structure mentioned above. It test
H
0
: = 0 versus H
a
: = 0. The Durbin-Watson test statistic is
d =
n
i=2
(r
i
r
i1
)
2
n
i=1
r
2
i
It follows a linear combination of
2
distributions if the null hypothesis is true (e.g. random errors are
not correlated). Due to the diculties with the distribution of d, using the tables is complicated. The test
can be implemented by using the dwtest( ) function in the lmtest package in R.
Example. For example, consider the low birth weight infant data, where we t a multiple linear regression
model by using head circumferences as the response variable, gestational age, birth weight and mothers
toxemia status as explanatory variables. Suppose that all the other important assumptions have been
satised, and we want to test whether an autocorrelation structure exists in the random errors. We conduct
D-W test
> library(lmtest)
> dwtest(headcircgestage + birthwt + toxemia, data=lowbwt)
Durbin-Watson test
data: headcirc gestage + birthwt + toxemia
DW = 1.9726, p-value = 0.4267
alternative hypothesis: true autocorrelation is greater than 0
where the p value indicates no evidence of correlation.
17
d =
(r
i
r
i1
)
2
r
2
i
Distribution
A high statistic usually means a low p-value
If the p-value< signicance level, then we reject H
0
. We conclude that there is strong evidence
that the random errors are correlated. (We can also specify if they are negatively or positive correlated).
R Code:
dwtest(fit, alternative =
_
_
"greater"
"two.sided"
"less"
)
#test is either two.sided, greater, or less
If p-value > , we can not reject H
0
. We can conclude that there is not enough evidence that there is
autocorrelation among random errors.
If autocorrelation is detected there are several remedies.
Add a missing explanatory variable.
For example, if we model beer sales on a daily basis and omit daily maximum temperature as an
explanatory variable, we may well see strings of positive residuals during spells of hot weather
and negative residuals during poor weather systems.
Dierencing. It is often the case that the dierences D
i
= y
i
y
i1
show less correlation.
For instance, modeling the dierences for a stock price amounts to build a model for how much
the price changes from one period to the next.
Take STAT443 (Forecasting / Time Series Modeling), or STAT936 (Longitudinal Data Analysis).
17
Insert Figure 5.5.1 Normal random variables with autocorrelation structure
Chapter 6
Model Evaluation: Data Transformation
1
We now know the techniques (based on analysis of the residuals) to diagnose the problems with the as-
sumptions on random errors. In this chapter, we will concentrate on techniques where the response variable
and/or the explanatory variables are transformed so that the usual assumptions might look more reasonable.
6.1 Box-Cox Transformation
6.1.1 Remarks on Data Transformation
Box and Cox have proposed a family of transformations that can be used with nonnegative responses y, and
suggested that transformation of y can have several advantages:
1. The model in the original x variables ts better (reducing the need for quadratic terms, etc)
2. The error variance is more constant.
3. The errors are more normal.
(Box and Cox, Journal of Royal Statistical Society B, 1964)
Suppose y
i
is always positive for i = 1, . . . , n, Box-Cox transformation suggest to transform y to y
. The
procedure for choosing is:
1
Oct 23, 2012 use handouts
46
CHAPTER 6. MODEL EVALUATION: DATA TRANSFORMATION 47
1. Choose several values for s, typically in the range [1, 1].
2. For each , transform y
i
to
Z
i
=
_
y
i
= 0
ln (y
i
) = 0
3. Fit the regression
Z
i
=
0
+
1
x
i,1
+ +
p
x
i,p
+
i
and calculate
MSE
adj
=
_
_
1
y
1
_
MSE = 0
y
2
MSE = 0
where y =
_
n
i=1
y
i
_1
n
where y = (
n
i=1
y
i
)
1
n
is called the geometric mean. The (unadjusted) mean square errors (MSE)
obtained from tting the model on the transformed scale (Z
i
) is adjusted so that regression with
dierent scales for the response variable can be compared. The adjustment is related to the Jacobian
arising in a density after a change of variable
4. Choose a that minimize MSE
adj
.
Note. The Box-Cox transformation leads to the following results depends on the values we choose for :
1. = 1 = no transformation
2. = 1 = the recprocal tranformation (
1
yi
)
3. = 0.5 = The squre root transformation (
y
i
)
4. = 0 = ln transformation with natural base e (ln (y
i
))
In practice, there will be a range of values that give reasonably small values of the scale-adjusted MSE
adj
.
From this range, we will want to choose a transformation that is convenient and provides a meaningful scale.
Scientists and engineers often work on logarithmic scales ( = 0), for example. In other applications,
reciprocals ( = 1) make sense.
However, notice
Once the transformation is selected, all subsequent estimation and tests are performed in terms of
transformed values
Transformation complicate the interpretation. Some transformations are easier to explain than others
in some context.
The graphic diagnostics do not provide a clear cut decision rule. A natural criteria for assessing the
necessity for transformation is whether important substantive results dier qualitatively before and
after.
In multiple regression, the best solution may require transforming xs
In this course, we focus on Box-Cox transformation of response variable. if ln transformation is
chosen, then we may consider same ln transformation of all explanatory variables (ln ln model)
if the improvement is substantial.
Box and Cox also showed how to generate a condence interval for and hence provide a range of reasonable
values, from which we may pick a convenient value. If the condence interval contains = 1, no transfor-
mation is usually required. Otherwise, a transformation convenient for the context is chosen from the values
in the condence interval (e.g. = 1, 0, 0.5.) The methods is based on the log-likelihood for the original
response values as a function of , and we seek large values of the log-likelihood.
In R, we can use the function boxcox(<object>) from the MASS package to do the Box-Cox transformation
analysis, where <object> is a model object created based on lm( ) t. We now illustrate how to identify
the appropriate transformation on the tree data to resolve the problem with the non-linearity. The following
R code
> l i br a r y (MASS)
> f i t 1 <lm( Volume Gi rth+Hei ght , data=t r e e )
> boxcox ( f i t 1 , lambda=seq ( 1, 1) )
will produce the Figure 6.1. The boxcox function computes the log-likelihood for a number of values of
and plots the curve in the gure. The values of above the horizontal dotted line comprise an approximate
95% condence interval. Here we see that values from about 0.1 to about 0.5 seems reasonable. For
convenience, we will pick the = 0 which implies a log-transformation for response variable, the Volume.
2
Now if we t the model using the transformed response y = ln (Volume), and check the plot of residuals
versus Girth (see Figure 6.3), we will see less quadratic pattern and more random scatter compared to Figure
5.2.2
3
For many applications, transformation of the explanatory variables is also useful, for example, transform
x
j
to x
j
. We consider to apply the same transformation (the ln transformation) to all the explanatory
variables. We call this the log-log model, and write
y =

0
+

1
x
1
+

2
x
2
where now
y = ln (Volume) x
1
= ln (Girth) x
2
= ln (Height)
The tted log log model is
> t r ee$y < l og ( tree$Vol ume )
> t r ee$x1 < l og ( t r ee$Gi r t h )
> t r ee$x2 < l og ( t r ee$Hei ght )
> f i t 2 <lm( y x1 + x2 , data = t r e e )
> summary( f i t 2 )
Cal l :
lm( f ormul a = y x1 + x2 , data = t r e e )
Co e f f i c i e nt s :
Esti mate Std . Error t val ue Pr( >| t | )
( I nt e r c e pt ) 6.63162 0. 79979 8.292 5. 06 e09
x1 1. 98265 0. 07501 26. 432 < 2e16
x2 1. 11712 0. 20444 5. 464 7. 81 e06
Si g ni f . codes : 0 0. 001 0. 01 0. 05 . 0. 1 1
Res i dual standard e r r or : 0. 08139 on 28 degr ees of f reedom
Mul t i pl e Rsquared : 0. 9777 , Adj usted Rsquared : 0. 9761
Fs t a t i s t i c : 613. 2 on 2 and 28 DF, pval ue : < 2. 2 e16
4
The Figure 6.4 shows the plots of residual against log-transformed predictors, they show random scatter
patterns. The partial residual plots are displayed in Figure 6.5, they are linear to a very good approximation.
The LOWESS curve attempts to wraps around the straight line fairly closely. In summary, then, all these
are consistent with the assumption that E[
i
] = 0 for i = 1, . . . , n.
2
Insert Figure 6.2: Box-Cox transformation for the tree data
3
Insert Figure 6.3 Tree Data with log transformed Volume- Residual versus Predictor Plots
4
Redo later
5
6
Since the rst-order assumption E[
i
] = 0 appears to be reasonable for the log log model tted to the
tree data. Thus it is appropriate to plot the residuals versus the tted values to check the constant-variance
assumption. Figure 6.6(b) indicates no problems with this assumption.
We also see that R
2
has increase slightly from the original models 0.948 to 0.978. With the same number of
tted explanatory variables. Furthermore, the t statistics for the slopes are now larger. The most compelling
reason for favoring the log log model, however, is that this model cannot predict negative volumes and
gives much more sensible predictions than the original model.
7
6.2 Logarithmic Transformation
6.2.1 Logarithmic Transformation of y Only
In general, suppose we t the model
ln (y) =
0
+
1
x
1
+ +
p
x
p
+
On the original scale, this model becomes
y = e
0+1x1++pxp+
y = e
0
e
1x1
e
pxp
e
where the explanatory variables have multipicative eects on response variable, and each appears as an
exponential relationship. The multiplicative error e
has a log-normal distribution.

j
Assume x
j
= a
E[y | x
j
= a] = e
0
e
1x1
e
ja
e
pxp
e
Now if x
j
= a + 1
E[y | x
j
= a] = e
0
e
1x1
e
j(a+1)
e
pxp
e
=
E[y | x
j
= a]
E[y | x
j
= a + 1]
= e
j
E[y | x
j
= a]
E[y | x
j
= a + 1]
1 = e
j
1
E[y | x
j
= a] E[y | x
j
= a + 1]
E[y | x
j
= a + 1]
= e
j
1
100%
_
e
j
1
_
is the interpreted as percentage of change in the average value of response variable per unit
increase in explanatory variable x
j
, while holding all the other explanatory variables xed.
100%
_
e
j
1
_
Average percentage change in y
5
Insert Figure 6.4: Tree Data with log transformed all variables- Residual versus Predictor Plots
6
Insert Figure 6.5: Tree Data with log transformed all variables - Partial Residual versus Predictor Plots
7
Insert Figure 6.6 Tree Data - Residual versus Fitted Value Plots (a) original data; (b) Log-transformed data
6.2.2 Logarithmic Transformation of All Variables
Suppose in general, we t model
ln (y) =
0
+
1
ln (x
1
) + +
p
ln (x
p
) +
On the original scale of y:
y = e
0
e
1 ln x1
e
p ln xp
e
y = e
0
x
1
1
x
p
p
e
Essentially, explanatory variables now have multiplicative eects rather than additive eects on y, and each
appears as a power relationship.
j
100%
_
e
j ln(1.01)
1
_
percentage change in average value of response variables per 1% change (increase) in
x
j
.
100%
_
e
j ln(1.01)
1
_
Average percentage change in y per 1% change in x
j
6.2.3 Logarithmic Transformation of y and Some x
i
s
Consider the model with two explanatory variables
ln (y) =
0
+
1
ln (x
1
) +
2
x
2
+
where x
1
is transformed, but x
2
is not.
On the original scale of y.
y = e
0
x
1
1
e
2x2
e
Thus, x
1
has a power relationship, while x
2
has an exponential eect. In general we can obtain a mixture of
power and exponential multipicative eects.
6.2.4 95% CI for Transformed Estimate
Consider log model, 95% CI for y
p
for a given vector of values a
p
for explanatory variables.
ln y
p
= a
T
p

y
p
= e
a
T
p
6.2.4.1 There are Two ways to Get CI

Method 1: Find 95% CI for a
T
p

= y
p
then [L, U] then 95% CI for y
p
= e
a
T
p
_
e
L
, e
U
Method 2: Find SE
_
e
a
T
p
_
based on the delta method, then 95% CI for y
p
= e
a
T
p
is
e
a
T
p
t
np1,/2
SE
_
e
a
T
p
_
Second method is more correct but the rst is easier
6.3 Transformation for Stabilizing Variance
Consider the general model
y
i
=
0
+
1
x
1
+ +
p
x
p
. .
i
+
y
i
=
i
+
where
i
is the mean of response.
Furthermore, suppose that y
i
has non-constant variance
V ar (y
i
) =
i

2
where
2
is a constant of proportionality between the variance of
y
i
and the mean of y
i
.
r
y
= Non-constant Variance
If > 0 , then variance increases with the mean.
If < 0, then variance decreases with the mean.
Now we want to nd a transformation, g (y
i
), of y
i
such that g (y
i
) has a constant variance. For this, we
approximate g (y
i
) by a rst-order Taylor series.
g (y
i
) g (
i
) + (y
i
i
) g
(
i
)
g (y
i
) g (
i
) + (y
i
i
)
_
d
dy
g (y
i
)
_
yi=i
Then
V ar (g (y
i
)) V ar ((y
i
i
) g
(
i
))
V ar (g (y
i
)) [g
(
i
)]
2
V ar ((y
i
i
))
V ar (g (y
i
)) [g
(
i
)]
2
i

2
To stabilize the variance, we may choose the transformation of g () such that
[g
(
i
)]
2
=
1
i
= g
(
i
) =
1
/2
i
Then choosing
g (y
i
) =
_
y
1/2
i
1/2
= 2
ln y
i
= 2
does the trick and lead to V ar (g (y
i
)) =
2
.
This analysis does not tell us which function g () to choose as we do not know and the true form of
V ar (y
i
). It does, however, explain why Box-Cox often choose transformation y
i
with < 0 or ln (y).
6.4 Some Remedies for Non-Linearity -Polynomial Regression
Fit: y =
0
+
1
x
1
+
Plot r vs. x = non-linearity
Include higher order terms:
y =
0
+
1
x +
2
x
2
+
y =
0
+
1
x +
2
x
2
+
3
x
3
+
.
.
.
Rule 1: If x
n
in the expression, then x
n1
should be in as well. In general, if a higher term is in, all lower
order terms should also be in.
Rule 2: We include a higher order term only if the new model is much better.
Chapter 7
Model Evaluation: Outliers and
Inuential Case
1
7.1 Outlier
Denition 7.1. Outlier: An outlier is a particular case with un-usual (extreme) value in y or/and xs.
Consider the following cases:

X Axis
Y

A
x
i
s
A
Case A is outlying in co-
variate x, but not in y
The response is right on
the model trajectory.

X Axis
Y

A
x
i
s
B
Case B is not un-usual
with regards to x, but it is
an outlier in y

X Axis
Y

A
x
i
s
C
Case C represents an out-
lier in the x as well as in
y.
7.1.1 How to Detect Outliers?
Simple diagnostic tool-graphs of studentized residuals.
d
i
=
r
i
_

2
(1 h
i,i
)
where h
i,i
is the (i, i) entry of the hat matrix H =
X
_
X
T
X
_
1
X
T
, and approximately
d
i
N (0, 1)
Large values of d
i
(e.g. |d
i
| > 2.5) = outlier in y.
10 0 1 2 3 4 5 6 7 8 9
3.5
-3.5
-2
-1
0
1
2
The real issue is not whether a case is an outlier or not: it is whether a case has a major inuence on
a given statistical procedure, in other words, keeping or removing the case will result in dramatically
1
Nov 1, 2012
53
CHAPTER 7. MODEL EVALUATION: OUTLIERS AND INFLUENTIAL CASE 54
dierent results of the regression model.
= on tted line y
= on estimate

Problem. How to Detect the Inuential Case?

7.2 Hat Matrix and Leverage
Recall
H = X
_
X
T
X
_
1
X
T
= (h
i,j
)
nn
y = Hy
y =
_
_
h
1,1
h
1,2
h
1,n
h
1,1
h
2,2
h
2,n
.
.
.
.
.
.
.
.
.
h
i,1
h
i,2
h
i,n
.
.
.
.
.
.
.
.
.
h
i,1
h
i,2
h
n,n
_
_
_
_
y
1
y
2
.
.
.
y
n
_
_
and the i
th
tted value y
i
y
i
=
n
i=1
h
i,j
y
i
y
i
= h
i,i
y
i
+
n
j=i
h
i,j
y
i
2
The weight of h
i,i
indicates inuence of y
i
to y
i
h
i,i
is large = h
i,i
y
i
dominates y
i
0 h
i,i
1, if h
i,i
= 1, then y
i
y
i
This implies that when h
i,i
is large, the tted line will be force to pass very close to the i
th
observation
(y
i,
x
i,1
, . . . x
i,p
). We say that the case i exerts high leverage on the tted line.
Denition 7.2. Leverage: h
i,i
is called the leverage value of case i.
large h
i,i
high leverage inuential on the tted line
The leverage h
i,i
is a function of xs but not y.
The leverage h
i,i
is small for cases with (x
i,1
, . . . , x
i,p
) near the centroid ( x
1
, . . . , x
p
) that is determined
by all cases. The leverage h
i,i
will be large if (x
i,1
, . . . , x
i,p
) is far away from the centroid.
(h
i,i
is used to assess whether a case is unusual with regards to its covariates - the x dimension)
2
Recall Section 5.2
h
i,i
(1 h
i,i
) =
j=i
h
2
i,j
0 h
i,i
1
Example 7.1. Simple Linear Regression
_
X
T
X
_
1
=
_
n n x
n x
x
2
i
_
_
X
T
X
_
1
=
1
nS
xx
_
x
2
i
n x
n x n
_
h
i,i
=
_
1 x
i
1
S
xx
_
1
n
x
2
i
x
x 1
_ _
1
x
i
_
h
i,i
=
1
S
xx
__
1
n
x
2
i
xx
i
_
(x
i
x)
_
1
x
i
_
h
i,i
=
1
S
xx
__
1
n
x
2
i
xx
i
_
+ (x
i
x) x
i
_
h
i,i
=
1
S
xx
_
1
n
x
2
i
xx
i
+x
2
i
xx
i
_
h
i,i
=
1
S
xx
_
1
n
S
x,x
+ (x
i
x)
2
_
h
i,i
=
1
n
+
(x
i
x)
2
S
x,x
The leverage is smallest when x
i
= x, and it is large if x
i
is far from x.
Rule: The average leverage in a model with (p + 1) regression parameters is
h =
p + 1
n
If a case for which
h
i,i
> 2
h =
2 (p + 1)
n
then it is considered a high-leverage case.
7.3 Cooks Distance
Denition 7.3. Cooks Distance: It is a measure of inuence on

.
Consider model
y = X +
and
=
_
X
T
X
_
1
X
T
y
Suppose delete the i
th
case and t model
y
(i)
= X
(i)
+
i
where
y =
_
_
y
1
.
.
.
y
i1
y
i+1
.
.
.
y
n
_
_
(n1)1
X
(i)
=
_
_
1 x
1,1
x
1,p
.
.
.
.
.
.
.
.
.
1 x
i1,1
x
i1,p
1 x
i+1,1
x
i+1,p
.
.
.
.
.
.
.
.
.
1 x
n,1
x
n,p
_
_
(n1)(p+1)
and
(1)
=
_
X
T
(i)
X
(i)
_
1
X
T
(i)
y
(i)
If the i
th
case is inuential, we expect a big change in the estimate of .
The change
(i)
is then a good measure of inuence of the i
th
case.
Note.

(i)
is a vector, any large values in any component implies that the i
th
case is inuential.
_
(i)
_
T
_
(i)
_
The magnitude of

(i)
should be adjusted by the variance of

V ar
_
_
=
2
_
X
T
X
_
1
7.3.1 Cooks D Statistic
D
i
=
_
(i)
_
T
_

2
_
X
T
X
_
1
_
1
_
(i)
_
(p + 1)
D
i
=
_
(i)
_
T _
X
T
X
_
_
(i)
_

2
(p + 1)
An identity
(i)
=
r
i
1 h
i,i
_
X
T
X
_
x
i
where x
i
= (1, x
i,1
, . . . , x
i,p
) is the i
th
row of X. Substituting this into the expression.
D
i
=
r
2
i
x
T
i
_
X
T
X
_
1
x
i
(1 h
i,i
)
2
(p + 1)
2
D
i
=
d
2
i
x
T
i
_
X
T
X
_
1
x
i
(1 h
i,i
) (p + 1)
D
i
=
d
2
i
h
i,i
(1 h
i,i
) (p + 1)
D
i
measures the inuence of the i
th
case on all tted values and on the estimated .
If h
i,i
is large and d
i
small = D
i
is small
If h
i,i
is small and d
i
large = D
i
is small
D
i
is an overall measure of inuence
How large is large enough?
The cut o
If D
i
> 1 (and sometimes D
i
> 0.5) we should be concerned
7.4 Outliers and Inuential Cases: Remove or Keep?
Correct for the Obvious error due to data processing
Could be a data entry problem
A careful decision on whether to keep or remove them (before/after analysis). The target population
may change due to in inclusion/exclusion of certain cases.
Most investigators would hesitate to report rejecting H
0
if the removal of a case results in the H
0
not
being rejected.
Robust Method: Weighted least squares
In R: Suppose we t a model
fit <- lm(y ~ x
1
+ x
2
+ + x
p
, data = <data frame>)
To get cooks distance D
cookD <- cook.distance(fit)
To get leverage h
i,i
fitinf <- influence(fit)
fitinf$hat
(contains a vector of diagonal of the hat matrix H)
To get studentized residual d
i
fitsummary <- summary(fit)
s <- fitsummary$sig
(s, ,
MSE)
studr <- residuals(fit)/(sqrt(1-fitinf$hat) * s)
Chapter 8
Model Building and Selection
8.1 More Hypothesis Testing
8.1.1 Testing Some But Not All s
Consider the general model
y =
0
+
1
x
1
+ +
p
x
p
+
Partition
X =
p
A
+1
..
_
_
1 x
1,1

1 x
2,1

.
.
.
.
.
.
1 x
n,1

p
B
..
x
1,p
x
2,p
.
.
.
x
n,p
_
_
=
_
X
A
X
B
=
_
1
.
.
.
_
_
(p
A
+ 1) 1
_
.
.
.
p
__
p
B
1
=
_
B
_
Example 8.1.
y =
0
+
1
x
1
+
2
x
2
+
3
x
3
+
4
x
4
+
with p + 1 = 5 parameters partitioned as
A
= (
0
,
1
,
2
)
T
A
= (
3
,
4
)
T
and
X
A
=
_
_
1 x
1,1
x
1,2
1 x
2,1
x
2,2
.
.
.
.
.
.
.
.
.
1 x
n,1
x
n,2
_
_
X
B
=
_
_
x
1,3
x
1,4
x
2,3
x
2,4
.
.
.
.
.
.
x
n,3
x
n,4
_
_
Suppose we want to test
H
0
:
B
= 0 or H
0
:
3
=
4
= 0
H
a
:
B
= 0 H
a
: At least one = 0
We are not restricted to the p
B
elements, these ideas apply to any p
B
elements.
58
CHAPTER 8. MODEL BUILDING AND SELECTION 59
8.1.1.1 Extra Sum of Squares Principle
A test follows from the change in the sum of squares of regression between tting model (1): just
A
(reduced
model) and model (2): both
A
and
B
.
ANOVA Table for Testing Some s
Source df SSR
Regression Fitting
A
(p
A
+ 1) 1 SSR
_
A
_
Regression Fitting
B
extra to
A
p
B
SSR
_
_
SSR
_
A
_
Residuals n p 1 SSE
_
_
= r
T
r
Total n SST = y
T
y n y
2
p + 1 = (p
A
+ 1) +p
B
The idea is then if H
0
:
B
= 0 is not true, the extra sum of squares of regression contributed by including
B
in the model should be large (relative to MSE)
Formally, if all model assumptions hold
F =
(SSR(
)SSR(

A))
/p
B
MSE
H0
F
(p
B
,np1)
If F > F
,(p
B
,np1)
then we reject H
0
with signicance level . Otherwise, H
0
is not rejected.
Note that
SST SSR
_
_
= SSE
SST SSR
_
A
_
= SSE
0
where SSE
0
is sum of squares of residuals leaving out X
B
(tting the model subject to H
0
)
Then the dierence
SSE
0
SSE = SSR
_
_
SSR
_
A
_
Thus, if the extra sum of squares of regression is small:
The two models have similar residual sum of squares
The two models t about the same
We choose the simpler model
= we do not reject H
0
Mathematically, we can rearrange the formulas for F
0
:
8.1.1.2 Alternative Formulas for F
0
F =
(SSE0SSE)
/p
B
MSE
F =
_
SSE
0
SSE
1
_
n p 1
p
B
where SSE
0
is the SSE of the reduced model.
1
Note that there are two versions of the ANOVA TABLE:
1
Nov 8, 2012
8.1.1.3 ANOVA (Version 1)
Source of Variation Degrees of Freedom Sum of Squares
Regression (p + 1) 1 SSR = ( y y)
T
( y y) = y
T
y n y
2
Residual n p 1 SSE = y
T
y y
T
y = r
T
r
Total n 1 SST = (y y)
T
(y y) = y
T
y n y
2
where
y = ( y, y, . . . , y)
T
SST =
(y
i
y)
2
= (y y)
T
(y y)
SSR =
( y
i
y)
2
= ( y y)
T
( y y)
SSE =
r
2
i
= r
T
r
8.1.1.4 ANOVA (Version 2) (not including
0
)
Source of Variation Degrees of Freedom Sum of Squares
Regression p + 1 SSR =
y
i
= y
T
y
Residual n p 1 SSE =
r
2
i
= r
T
r
Total n SST = y
T
y
8.1.2 The General Linear Hypothesis
To test the very general hypothesis concerning the regression coecients
H
0
: T
c(p+1)
= b
c1
Where T is a c (p + 1) matrix of constant and b is a c 1 vector of constants.
Example 8.2.
y =
0
+
1
x
1
+
2
x
2
+
3
x
3
+
The null hypothesis
H
0
:
0
= 0 &
1
=
2
_
1 0 0 0
0 1 1 0
_
. .
T
_
3
_
_
. .
=
_
0
0
_
..
b
Thus,
H
0
: T = b
Example 8.3. To test
H
0
:
2
=
3
= 0 = T =
_
0 0 1 0
0 0 0 1
_
8.1.2.1 The test
To test H
0
: T = b in general
1. Fit regression with no constraints
2. Compute SSE
3. Fit regression model subject to constraints
4. Compare the new SSE
0
5. Compute the F-ratio
F =
(SSE0SSE)
/c
SSE
/(np1)
where c = the number of rows in the T matrix
6. If F > F
,(c,np1)
then reject H
0
: T = b otherwise not reject H
0
.
Example. Consider H
0
:
2
=
3
= 0
_
0 0 1 0
0 0 0 1
_
_
3
_
_
=
_
0
0
_
8.2 Categorical Predictors and Interaction Terms
8.2.1 Binary Predictor
Recall Low birth weight infant example
y: Head circ
x
1
: Gestation age
x
2
: Toxemia, 1 = yes, 0 = no
Consider the model
y =
0
+
1
x
1
+
2
x
2
+
y = 1.496 + 0.874x
1
1.412x
2
testing
2
= 0
y
0
+
2
0
2
x
2
= 0
x
2
= 1
It is often not reasonable to assume the eect of other explanatory variables are the same across dierent
groups
8.2.2 Interaction Terms
y =
0
+
1
x
1
+
2
x
2
+
3
x
1
x
2
+
=
_
y =
0
+
1
x
1
+ x
2
= 0
y = (
0
+
2
) + (
1
+
3
) x
1
+ x
2
= 1
by adding interaction term, it allows x
1
to have a dierent eect on y depending on the value of x
2
.
Hypothesis testing of interaction terms
H
0
:
3
= 0
H
a
:
3
= 0
tells whether the eect is dierent or not between groups
8.2.3 Categorical Predictor with More Than 2 Levels
Example. y: prestige score of occupations
exp. var:
(x
1
) Education (in years)
(x
2
) Income
(x
3
) Type of occupation:
blue collar
white collar
professional
8.2.3.1 Dummy Variables
Basically binary indicators
D
1
=
_
1 Professional
0 Otherwise
D
2
=
_
1 White collar
0 Otherwise
Type of Occupation D
1
D
2
prof 1 0
w.c 0 1
b.c. 0 0
The categorical explanatory variable with k levels can be represented by k 1 dummy variables.
The regression model
y =
0
+
1
x
1
+
2
x
2
+
3
D
1
+
4
D
2
+
=
_
_
y = (
0
+
3
) +
1
x
1
+
2
x
2
+ Professional
y = (
0
+
4
) +
1
x
1
+
2
x
2
+ White collar
y =
0
+
1
x
1
+
2
x
2
+ Blue collar
where
3
represent the constant vertical distance between the parallel regression planes for professional
and blue collar occupations
4
represent the constant vertical distance between the parallel regression planes for white collar
and blue collar occupations
2
To make a vector into a vector of categorial indicators:
R Code:
<variable vector> = factor(<variable vector>)
Makes the variable into a factor, meaning that the linear regression code will create the necessary
dummy variables as needed.
Be careful with p values for dummy variables, the p value for s only considers it against the
baseline model.
Testing Individual Hypothesis (t-test)
H
0
:
3
= 0
H
a
:
3
= 0 or
H
0
:
4
= 0
H
a
:
4
= 0
Testing dierence between experiment (e.g. prof or wc) group and reference (bc) group
8.2.3.2 Testing Overall Eect of a Categorical Predictor
H
0
:
3
=
4
= 0
H
a
: At least one = 0
Model Terms df SSE
1 (F) x
1
, x
2
, D
1
, D
2
93 4681.28
2 (R) x
1
, x
2
95 5272.44
F
0
=
(SSE0SSE)
/2
SSE
/93
F (2, 93)
F
0
= 5.95
F
0
= 5.95 > F
0.05
(2, 93) = 3.07
Therefore, we reject the null hypothesis and conclude that occupational type is overall signicantly related
to prestige score.
To change what case to consider for the base case in R:
R Code:
contrasts(<factor vector>) <- contr.treatment(<# levels>, base=<level as base>)
8.3 Modeling Interactions With Categorical Predictors
y
i
=
0
+
1
x
i,1
+
2
x
i,2
+
3
D
i,1
+
4
D
i,2
. .
main eect
+
5
x
i,1
D
i,1
+
6
x
i,1
D
i,2
. .
edutype
+
7
x
i,2
D
i,1
+
8
x
i,2
D
i,2
. .
incometype
+
i
This model also can be written as
=
_
_
y = (
0
+
3
) + (
1
+
5
) x
1
+ (
2
+
7
) x
2
+ Professional
y = (
0
+
4
) + (
1
+
6
) x
1
+ (
2
+
8
) x
2
+ White collar
y =
0
+
1
x
1
+
2
x
2
+ Blue collar
2
Nov 13, 2012
Where
5
,
6
represent eect of interaction between edu. and occupation type;
7
and
8
represent eect
of interaction between income and occupation type.
To test the signicance of the interaction. eg.
H
0
:
7
=
8
= 0
H
0
: At least one = 0
Model Terms df SSE
1 (F) x
1
, x
2
, D
1
, D
2
,x
1
D
1
, x
1
D
2
,
x
2
D
1
, x
2
D
2
89 3552.624
2 (R) x
1
, x
2
, D
1
, D
2
,x
1
D
1
, x
1
D
2
91 4504.982
We use F-test
F =
(SSE0SSE)
/2
SSE
/89
F (2, 89)
F = 11.929
F
0.05
(2, 89) = 3.099
Hence we reject H
0
and conclude that there is signicant evidence that the relationship between income
and prestige score is dierent across dierent occupation types.
8.4 The Principle of Marginality
3
y
i
=
_
0
+
1
x
i,1
+
2
x
i,2
+
3
D
i,1
+
4
D
i,2
+
i
+
5
x
i,1
D
i,1
+
6
x
i,1
D
i,2
+
7
x
i,2
D
i,1
+
8
x
i,2
D
i,2
D
i,1
, D
i,2
represent the categorical variables and must be tested as a group (F-test)
If a model includes higher order term, then the lower order term should also be included.
examine higher order term (interactions) rst, then proceed to test, estimate and interpret main eects
8.5 Variable Selection
4
Often, many explanatory variables are available. Investigators may have little idea of the driving factors
and so will cast a wide net in data collection, hoping that analysis will identify the important variables.
There are several reasons why we would like to include only the important variables:
The model will become simpler and easier to understand (unimportant factor are eliminated).
Cost of prediction is reduced - fewer variables to measure.
Accuracy of predicting new ys may improve. In general, including unnecessary explanatory variables
inates the variances of predictions.
In this section, we look at some of the more popular algorithms for selecting explanatory variables:
forward selection
backward elimination
3
Nov 22, 2012
4
Nov 20, 2012, from handout
stepwise regression
criterion based all subsets regression
Note. Italicized methods can be automated.
We will use an example to illustrate how to implement these methods. We will also discuss under what
circumstance these methods will be appropriate to use.
Example 8.4. We illustrate the variable selection methods on some data on the 50 states in U.S.A. from
the 1970s. We will take the life expectancy as the response and the remaining variables as predictors:
State states name
Population population estimate of the state
Income per capital income
Illiteracy illiteracy percent of population
Life_Exp life expectancy in years
Murder murder and non-negligent manslaughter rate per 100,000 population
Hs_Grad percent hight-school graduates
Frost mean number of days with min temperature < 32 degrees in capital city
Area land area in square miles 1
8.5.1 Backward Elimination
1. Start with all p potential explanatory variables in the model
y =
0
+
1
x
1
+ +
p
x
p
+
2. For each explanatory variable x
j
, calculate the p-value (based on either t-test or F-test
5
) for testing
H
0
:
j
= 0 j = 1, . . . , p
3. If the largest p-value is greater than , then drop the predictor with the largest p-value. If the largest
p-value is smaller than , then you cannot simplify the model further and you stop the algorithm.
4. Repeat the procedure in step 1 and step 2 with the simplied model until all p-values for remaining
variables are less than the preset signicance level .
Note: the does not have to be 0.05. (a 0.05 to 0.2 cut-o may work best if prediction is the goal). One
refers to it as "alpha to drop".
Example. Life expectancy data
5
F-test should be used when the explanatory variable is categorical
R code:
> data(state)
> statedata<-data.frame(state.x77)
> g<-lm(Life.Exp~., data=statedata)
> summary(g)
Call:
lm(formula = Life.Exp ~ ., data = statedata)
Residuals:
Min 1Q Median 3Q Max
-1.48895 -0.51232 -0.02747 0.57002 1.49447
Coecients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.094e+01 1.748e+00 40.586 < 2e-16 ***
Population 5.180e-05 2.919e-05 1.775 0.0832 .
Income -2.180e-05 2.444e-04 -0.089 0.9293
Illiteracy 3.382e-02 3.663e-01 0.092 0.9269
Murder -3.011e-01 4.662e-02 -6.459 8.68e-08 ***
HS.Grad 4.893e-02 2.332e-02 2.098 0.0420 *
Frost -5.735e-03 3.143e-03 -1.825 0.0752 .
Area -7.383e-08 1.668e-06 -0.044 0.9649
Residual standard error: 0.7448 on 42 degrees of freedom
Multiple R-squared: 0.7362,Adjusted R-squared: 0.6922
F-statistic: 16.74 on 7 and 42 DF, p-value: 2.534e-10
We illustrate the backward method. At each stage we remove the predictor with the largest p-value over
0.05:
R Code:
> g<-update(g, .~.-Area)
> summary(g)
Coefficients:
(Intercept)
Population
Income
Illiteracy
Murder
HS.Grad
Frost
> g<-update(g, .~.-Illiteracy)
> summary(g)
Coefficients:
(Intercept)
Population
Income
Murder
HS.Grad
Frost
> g<-update(g, .~.-Income)
> summary(g)
Coefficients:
(Intercept)
Population
Murder
HS.Grad
Frost
> g<-update(g, .~.-Population)
> summary(g)
Coefficients:
(Intercept)
Murder
HS.Grad
Frost
Notice that the nal removal of Population is a close call. The R
2
= 0.736 for the full model, it is only
reduced slightly in the nal model (R
2
= 0.713). Thus the removal of four predictors cause only a minor
reduction in t.
Note. The nal model depends on the signicance level , the larger the is, the bigger the nal model
is.
Issue with backward elimination:
Once a predictor has been eliminated from the model, it will never have a chance to re-enter the model,
even if it becomes signicant after other predictors being dropped.
For example,
R Code:
> summary(lm(Life.Exp~Illiteracy+Murder+Frost, data=statedata))
Coefficients:
(Intercept)
Illiteracy
Murder
Frost
We see that Illiteracy does have some association with Life.Exp. It is true that replacing Illiteracy
with HS.Grad gives us a somewhat better tting model, but it would be insucient to conclude that
Illiteracy is not a variable of interest.
8.5.2 Forward Selection
1. Fit p simple linear models, each with only a single explanatory variable v
j
, j = 1, . . . , p. There are p
t-test statistics and p-values for testing H
0
:
j
= 0, j = 1, . . . , p. The most signicant predictor is the
one with the smallest p-value, denoted by v
k
. If the smallest p value > , the algorithm stops and
there is no need to include any more variables. Otherwise, set x
1
= v
k
and t the model
2. Start from model
y =
0
+
1
x
1
+ ()
Enter the remains p 1 predictors; one-at-a-time; to t p 1 models
y =
0
+
1
x
1
+
2
v
j
+ j = 1, . . . , p 1
and let p
k
denote the smallest p-value, v
k
denote the most signicant explanatory variable.
(a) If p
k
> : stop and model (*) is the nal model.
(b) If p
k
< : set x
2
= v
k
and enter the corresponding explanatory variable, denoted by x
2
, into the
model (*) to update it as
y =
0
+
1
x
1
+
2
x
2
+
3. Continue this algorithm until no new explanatory variables can be added.
The preset signicance level is called the "alpha to enter".
Example. Life expectancy data
The rst variable to enter
R Code:
code
Result form tting 7 simple linear models
y
j
=
0
+
1
x
i,j
+
i
The second variable to enter
R Code:
The third variable to enter
R Code:
The fourth variable to enter
R Code:
Can not add any more explanatory variables at preset signicance level = 0.05, stop.
Summary of forward selection steps:
Iteration Variable to Enter p value (F-test)
1 Murder 2.260 10
11
2 HS.Grad 0.009088
3 Frost 0.006988
The nal model selected for signicant level = 0.05 includes explanatory variables: Murder, HS.Grad
and Frost. The same nal model as from the backward elimination method.
Issue with forward selection:
Once a predictor entered the model, it remains in the model forever, even if it becomes non-signicant
after other predictors have been selected.
8.5.3 Stepwise Regression
It is a combination of backward and forward method. It addresses the situation where variables are added
or removed early in the process and we want to change our mind about them later. The procedure depends
on two alphas:
1
: Alpha to enter
2
: Alpha to drop
At each stage a variable may be added or removed and there are several variations on exactly how this is
done.
For example:
1. Start as in forward selection using signicance level
1
.
2. At each stage, once a predictor entered the model, check all other predictors previously in the model
for their signicance. Drop the least signicant predictor (the one with the largest p value) if its
p value is greater than the preset signicance level
2
.
3. Continue until no predictors can be added and no predictors can be removed.
Remark. With automatic methods (forward/backward/stepwise):
Because of the one-at-a-time nature of adding/removing variables, it is possible to miss the optimal
model
The procedures are not directly linked to nal objectives of prediction or explanation and so may not
really help solve the problem of interest. It is important to keep in mind that model selection cannot
be divorced from the underlying purpose of the investigation. Variable selection tends to amplify the
statistical signicance of the variables that stay in the model. Variables that are dropped can still be
correlated with the response. It would be wrong to say these variables are unrelated to the response,
its just that they provide no additional explanatory eect beyond those variables already included in
the model.
All "automatic" algorithms should be used with caution. When there is an appreciable degree of
multicollinearity among the explanatory variables (as in most observational studies), the three methods
may lead to quite dierent nal models.
Some practical advices on t-test and F-test in linear regression models:
To test hypotheses about a single coecient, use the t-test.
To test hypothesis about several coecients (e.g. testing the coecients of several dummy variables),
or more generally to compare nested models, use the F-test based on a comparison of SSEs (or
SSRs).
8.5.4 All Subsets Regressions
Suppose we start with a regression model with p explanatory variables,
y
i
=
0
+
1
x
i,1
+ +
p
x
i,p
+
i
where each x
j
may be included or left out. Thus there are 2
p
possible regressions (e.g. p = 10 gives
2
10
= 1024 regressions). In principle, we can t each regression and choose the "best" model based on some
"t" criterion.
Numerical criteria for model comparison:
8.5.4.1 R
2
Comparison
R-square (Multiple Correlation Coecient)
R
2
=
SSR
SST
It is always in favor of a large model.
8.5.4.2 R
2
adj
Comparison
Adjusted R-square
R
2
adj
= 1
_
n 1
n p 1
_
_
1 R
2
_
where p is the number of explanatory variables in the model. A large model may have a smaller R
2
adj
.
8.5.4.3 Mallows C
k
Comparison
Mallows C
k
Consider a smaller candidate model with k explanatory variables (k < p), and SSE
k
is the sum of
squares of errors from tting this model
C
k
=
SSE
k
MSE
full
(n 2 (k + 1))
The idea is to compare sum of squares of errors from the smaller candidate with one from the full
model
A candidate model is good if C
k
k + 1
Look for the simplest model (with smallest k) for which C
k
is close to k + 1
8.5.4.4 AIC (Akaikes Information Criterion)
Under linear regression model
y
i
=
0
+
1
x
i,1
+ +
p
x
i,p
+
i
then we know
y
i
N
_
0
+
1
x
i,1
+ +
p
x
i,p
,
2
_
and y
i
s are independent.
The likelihood function
L
_
,
2
_
=
n
i=1
f (y
i
)
. .
pdf of yi
= f (y
1
, . . . , y
n
)
L
_
,
2
_
=
n
i=1
1
2
2
exp
_
(y
i
0
+
1
x
i,1
+ +
p
x
i,p
)
2
2
2
_
l
_
,
2
_
= ln L
_
,
2
_
l
_
,
2
_
=
n
i=1
_
ln
_
2
2
_
1
2
(y
i
0
+
1
x
i,1
+ +
p
x
i,p
)
2
2
2
_
l
_
,
2
_
=
n
2
ln
_
2
2
_
1
2
2
n
i=1
(y
i
0
+
1
x
i,1
+ +
p
x
i,p
)
2
The LSE

are the same as MLE

l
_
,
2
_
=
n
2
ln 2
2
1
2
2
SSE
2
l
_
,
2
_
=
n
2
1
2
2
2 +
1
2 (
2
)
2
SSE
0 =
n
2
1
2
2
2 +
1
2 (
2
)
2
SSE

2
=
SSE
n
l
_
,
2
_
=
n
2
ln 2
n
2
ln
SSE
n

n
2
l
_
,
2
_
= constant
n
2
ln
SSE
n
AIC (Akaikes Information Criterion)
AIC = 2 (max log-likelihood (p + 1))
AIC = nln
_
SSE
n
_
+ 2 (p + 1)
For linear regression model, the maximum log-likelihood is
l
_
,
2
_
=
n
2
ln
_
SSE
n
_
+ constant
AIC is a penalized maximum log-likelihood
Small AIC means better model.
smaller AIC = large max log-likelihood = better model
Note that for a model of a given size (here size refers to the number of explanatory variables included in the
model), all the criterion above will select the model with the smallest sum of squares of residuals SSE.
Example. Life expectancy data, all subsets regression:
R Code:
> library(leaps)
> data(state)
> statedata<-data.frame(state.x77)
> tmp<-regsubsets(Life.Exp~., data=statedata)
> summary(tmp)
Subset selection object
Call: regsubsets.formula(Life.Exp ~ ., data = statedata)
7 Variables (and intercept)
Forced in Forced Out
Population FALSE FALSE
Income FALSE FALSE
Illiteracy FALSE FALSE
Murder FALSE FALSE
HS.Grad FALSE FALSE
Frost FALSE FALSE
Area FALSE FALSE
1 subsets of each size up to 7
Selection Algorithm: exhaustive
Population Income Illiteracy Murder HS.Grad Frost Area
1 ( 1 ) " " " " " " "*" " " " " " "
2 ( 1 ) " " " " " " "*" "*" " " " "
3 ( 1 ) " " " " " " "*" "*" "*" " "
4 ( 1 ) "*" " " " " "*" "*" "*" " "
5 ( 1 ) "*" "*" " " "*" "*" "*" " "
6 ( 1 ) "*" "*" "*" "*" "*" "*" " "
7 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
The * means that the variable is included for that model
R Code:
> summary(tmp)$cp
[1] 16.126760 9.669894 3.739878
a
2.019659
b
4.008737 6.001959
[7] 8.000000
> summary(tmp)$adjr2
[1] 0.6015893 0.6484991 0.6939230 0.7125690
c
0.7061129 0.6993268
[7] 0.6921823
> par(mfrow=c(1,2))
> plot(2:8, summary(tmp)$cp, xlab="No. of Parameters", ylab="Ck statistic")
> abline(0,1)
> plot(2:8, summary(tmp)$adjr2, xlab="No. of Parameters", ylab="Adjusted R-square")
d
Notice for it to be good we consider C
k
k + 1
C
3
> 3.739878, and C
4
> 2.019659
Also notice R
2
adj
= 0.7125690 is the largest R
2
adj
a
Smaller than 4
b
Smaller than 5, so choose between these two
c
Largest R
2
adj
d
Insert image from teachers notes
According to the C
k
criteria, the competition is between the three-predictor model (including Murder,
HS.Grad, Frost) and the four-predictor model also including Population. The choice is between the
smaller model and the larger model, which ts a little better.
If the subset model (or candidate model) is adequate, then we expect
E
_
SSE
k
n k 1
_

2
E[SSE
k
] (n k 1)
2
We also know that
E
_
SSE
n p 1
_
=
2
therefore,
E[C
k
] = E
_
SSE
k
MSE
(n 2 (k + 1))
_
E[C
k
] k + 1
According to adjusted R
2
criteria, the four-predictor model (Populations, Murder, HS.Grad, Frost)
has the largest R
2
adj
.
Problem. Is the four predictor model (population, frost, HS graduation and murder) the optimal model?
Model selection methods are sensitive to outliers/inuential points:
Based on diagnostic statistics from tting the full model, "Alaska" can be an inuential point
When "Alaska" is excluded from the analysis, Area now makes to the model based on R
2
adj
criteria.
R Code:
> tmp<-regsubsets(Life.Exp~., data=statedata[-2,])
> summary(tmp)
Selection Algorithm: exhaustive
Population Income Illiteracy Murder HS.Grad Frost Area
1 ( 1 ) " " " " " " "*" " " " " " "
2 ( 1 ) " " " " " " "*" "*" " " " "
3 ( 1 ) " " " " " " "*" "*" "*" " "
4 ( 1 ) "*" " " " " "*" "*" "*" " "
5 ( 1 ) "*" " " " " "*" "*" "*" "*"
6 ( 1 ) "*" "*" " " "*" "*" "*" "*"
7 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
> summary(tmp)$adjr2
[1] 0.5923260 0.6603281 0.6948855 0.7086703 0.7104405
a
0.7073027
[7] 0.7008899
The * means that the variable is included for that model
a
Without "Alaska", the 5 predictor model looks best
Remark. Some Final Remarks:
Automatic variable selections methods should be used with caution.
Criterion-based best subsets methods typically involve a wider search and compare models in a prefer-
able manner. We recommend this method in general.
There may be several suggested models which t equally well. If they lead to quite dierent conclusions,
then it is clear that the data cannot answer the question of interest un-ambiguously
Chapter 9
Multicollinearity in Regression Models
9.1 Multicollinearity
Example. Pizza sales data:
y: Sales ($1000s)
x
1
: Number of advertisements
x
2
: Cost of advertisements ($100s)
Suppose t a model:
y
i
=
0
+
1
x
i,1
+
2
x
i,2
+
i
and get the following results:
i
SE
_
_
t
0
p value
Intercept 24.82 5.46 4.39 0.0007
x
1
0.66 0.54 1.23 0.2404
x
2
1.23 0.70 1.77 0.1000
R
2
= 0.7789, F-Statistic: 22.899 on 2 and 13 df, p value = 0.0001
The t-test says that
1
,
2
are not signicant but the F-test says at least one is signicant.
What do we nd?
R
2
= 0.7789, x
1
and x
2
together explain a large part (78%) of the variability in sales
F-Statistic and p value indicate that one of them is important
We can not reject H
0
:
1
= 0 when x
2
is in the model. Similarly, we cannot reject H
0
:
2
= 0 when
x
1
is in the model.
In other words, if one of x
1
or x
2
is in the model, then the extra contribution of the other variable
toward the regression is not important. The individual t-test indicates that you do not need one
variable if you already included the other.
This is because variables x
1
and x
2
are highly correlated. The two variables appear to express the same
information. So no point to include both.
Denition. Collinearity: Linear relationship between two variables: x
i
and x
j
, i = j.
Denition. Multicollinearity: There is a linear relationship involving more than two x variables.
e.g.. x
1
x
2
+x
3
75
CHAPTER 9. MULTICOLLINEARITY IN REGRESSION MODELS 76
9.2 Consequence of Multicollinearity
To understand what happens if there is an exact linear dependence, consider the
X =
_
_
| | | |
1 x
1
x
k
x
p
| | | |
_
_
where x
k
= (x
1,k
, . . . , x
n,k
)
T
is the k + 1 column of X.
If one of x
k
is a linear combination of other x
k
, say
x
1
= c
1
1 +c
2
x
2
+ +c
p
x
p
then
rank (X) < p + 1
= rank
_
X
T
X
_
< p + 1
hence

X
T
X
= 0, and
_
X
T
X
_
1
does not exist, we are not able to solve
=
_
X
T
X
_
1
X
T
y
Under Multicollinearity:
X
T
X
0 (small)
It is computationally unstable for

=
_
X
T
X
_
1
X
T
y, sometimes resulting in
Insignicance of important predictors
Opposite sign of

from expected relationship
Large S.E. and wide C.I.
9.3 Detection of Multicollinearity Among x
1
, . . . , x
p
1
First look at Pairwise Sample Correlation
r
l,m
=
n
i1
(x
i,l
x
l
) (x
i,m
x
m
)
_
n
i=1
(x
i,l
x
l
)
2
n
i=1
(x
i,m
x
m
)
2
r
l,m
measure the linear association between any two x variables: x
l
and x
m
.
1 r
l,m
1 =
_
1, 1 Perfect linear relationship
0 Not linearlly related
The matrix
_
_
1 r
1,2
r
1,3
r
1,p
r
2,1
1 r
2,3
r
2,p
r
3,1
r
3,2
1 r
3,p
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
r
p,1
r
p,2
r
p,3
1
_
_
|r
l,m
| 1 = x
l,
, x
m
are strongly linearly related.
1
Nov 29, 2012
9.3.1 Formal Check of Multicollinearity: Variance Ination Factors (VIF)
x
k
is regressed (x
k
is used as a response) on the remaining p 1 xs
x
i,k
=
0
+
1
x
i,1
+ +
k1
x
i,k1
+
k+1
x
i,k+1
+ +
p
x
i,p
+
i
for k = 1, . . . , p
The resulting
R
2
k
=
SSR
k
SST
is a measure of how strongly x
k
is linearly related to the rest of xs
R
2
k
= 1 = Perfectly linearly
R
2
k
= 0 = Not linearly related
V IP
k
=
1
1 R
2
k
( 1)
where k = 1, . . . , p
The general consensus is that if:
V IF
k
> 10, strong evidence of multicollinearity
V IF
k
[5, 10], some evidence of multicollinearity
9.4 Ridge Regression
Ridge regression is used when the design matrix X is multi-collinear and the usual least squares estimate
of appears to be unstable.
LSE

minimize (y X)
T
(y X)
For

X
T
X
0, ridge regression makes the assumption that the regression coecients are not likely to
be very large. Suppose we place some upper bound on .
p
j=1
2
j
=
T
< c
9.4.1 Minimize Subject to Constraints (Lagrange Multiplier Method)
Minimizing
(y X)
T
(y X) +c
p
j=1
2
j
The 2
nd
term is penalty depends on

p
j=1
2
j
Note. c is just a constant, therefore we will just change it to
Ridge Regression: Minimize
(y X)
T
(y X) +
p
j=1
2
j
In statistics, this is called shrinkage: you are shrinking

p
j=1
2
j
towards 0.
is a shrinkage parameter that you have to choose
The ridge regression solution

_
(y X)
T
(y X) +
T
_
= 0
2X
T
X 2X
T
y + 2 = 0
X
T
X X
T
y + = 0
_
X
T
X +I
_
= X
T
y
k
=
_
X
T
X +I
_
1
X
T
y
Note. That
k
is biased for (LSE

is unbiased)
Choose such that
Bias is small
X
T
X +I
= 0
Variance is not large

Stat 331 Course Notes

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Stat 331 Course Notes

Загружено:

Авторское право:

Доступные форматы

Stat 331: Applied Linear Models

CHAPTER 2. REVIEW OF SIMPLE LINEAR REGRESSION MODEL 8

Theorem 2.1. Cochrams Theorem:

close to y (or make the residual vector r = y y

Cov (Ay, By) = AE

Note. If x N (0, 1), y

To place a condence interval around the single predicted value y

The estimate of the

has a log-normal distribution.

6.2.4.1 There are Two ways to Get CI

Problem. How to Detect the Inuential Case?

Вам также может понравиться