5 Inference FRM

=0
General linear hypothesis
Splitting
Corrected SS
Sequential testing
Linear Statistical Models: Inference for the full

rank model
Notes by Yao-ban Chan and Owen Jones
Linear Statistical Models: Inference for the full rank model
1/123
=0
Splitting
Corrected SS
Sequential testing
In this section, we develop various forms of hypothesis testing on

the full rank model. To recap, the full rank model is
y = X +
where X is n p, n p, r (X ) = p, and the errors have:
I
mean 0;
variance 2 I ;
(for some theorems) are normally distributed.
2/123
=0
Splitting
Corrected SS
Sequential testing
The first thing we want to test is for model relevance: does our
model contribute anything at all?
If none of the x variables have any relevance for predicting y, then
all the parameters will be 0. We test for this using the null
hypothesis
H0 : = 0.
Alternatively, if at least some of the x variables are relevant to
predicting y, then the corresponding parameters will be nonzero.
So our alternative hypothesis is
H1 : 6= 0.
To test these hypotheses, we assume throughout the section that
the errors are normally distributed.
3/123
=0
Splitting
Corrected SS
Sequential testing
ANOVA
The method used to test the hypotheses is ANOVA.
If = 0, then y = consists entirely of errors. In this case, yT y,
the sum of squares of the errors, measures the variability of the
errors.
However, if 6= 0, then y = X + . In this case, yT y is not
made up solely of the errors but also of the model predictions.
Some of yT y will come from the errors and some from the model
predictions.
By separating yT y into the two parts, measuring variation due to
the model and variation due to the errors, we can compare them to
see how well the model is doing.
4/123
=0
Splitting
Corrected SS
Sequential testing
More precisely, the sum of squares of the residuals is
SSRes
= (y X b)T (y X b)
= yT y 2yT H y + yT H 2 y
(y H y)T (y H y)
=
yT y yT H y
= yT y yT X (X T X )1 X T y
which means that
yT y = yT X (X T X )1 X T y + SSRes .
T y
the regression sum of squares
We call yT X (X T X )1 X T y = y
and denote it by SSReg . It reflects the variation in the response
variable that is accounted for by the model. If we call the total
variation in the response variable SSTotal = yT y, then we have
divided it into:
SSTotal = SSReg + SSRes .
5/123
=0
Splitting
Corrected SS
Sequential testing
Example. Suppose that there is no error, so that y = X . We

have
SSReg
= yT X (X T X )1 X T y
= T X T X (X T X )1 X T X
= T X T X
= yT y = SSTotal
and SSRes = 0.
6/123
=0
Splitting
Corrected SS
Sequential testing
On the other hand, suppose that there is no signal, so that = 0

and y = . If we put b = = 0 then
SSRes
= (y X b)T (y X b)
= yT y = SSTotal
and SSReg = 0.
These are the two extremes of the spectrum.
7/123
=0
Splitting
Corrected SS
Sequential testing
3.5
2.0
2.5
3.0
4.0
4.5
5.0
Example. Recall our previous paint cracking example, in which the

data had a strong linear relationship.
7
8/123
=0
Splitting
Corrected SS
Sequential testing
The data matrices are
y=
1.9
2.7
4.2
4.8
4.8
5.1
,X =
1
1
1
1
1
1
2
3
4
5
6
7
and the sample the variance is

s 2 = 0.27.
This means that
SSRes = (n p)s 2 = (6 2)s 2 = 1.1.
9/123
=0
Splitting
Corrected SS
Sequential testing
Since
SSTotal = yT y =

1.9 2.7 4.2 4.8 4.8 5.1
1.9
2.7
4.2
4.8
4.8
5.1
= 100.63,
we get
SSReg = SSTotal SSRes = 99.53.
Since 99.53 > 1.1, informally we would say that there is some
linear signal in the data.
10/123
=0
Splitting
Corrected SS
Sequential testing
To create a formal test of = 0, we compare SSReg against

SSRes . If SSReg is large compared to SSRes , then we have evidence
that 6= 0.
To know exactly how large, we must first derive the distributions of

SSReg and SSRes .
11/123
=0
Splitting
Corrected SS
Sequential testing
Theorem
In the full rank linear model, SSRes / 2 has a 2 distribution with
n p degrees of freedom, SSReg / 2 has a noncentral 2
distribution with p degrees of freedom and noncentrality parameter
=
1 T T
X X ,
2 2
and they are independent.
12/123
=0
Splitting
Corrected SS
Sequential testing
13/123
=0
Splitting
Corrected SS
Sequential testing
The test for = 0 comes about when we observe that if the null
hypothesis is true, the noncentrality parameter for SSReg / 2 must
be 0.
Thus, under H0 ,
SSReg /p 2
SSReg /p
MSReg
=
=
2
SSRes /(n p)
SSRes /(n p)
MSRes
has an F distribution with p and n p degrees of freedom.
14/123
=0
Splitting
Corrected SS
Sequential testing
What happens if H0 is not true? The expected value of MSReg is

SSReg
1
E
= 2 + T X T X .
p
p
(Recall SSReg = yT H y and E xT Ax = tr (AV ) + T A.)
The expected value of the denominator MSRes is

SSRes
E
= E [s 2 ] = 2 .
n p
15/123
=0
So if = 0, E [
SSReg
p ]
Splitting
Corrected SS
Sequential testing
= 2 and the statistic should be close to 1.
But if 6= 0, since X T X is positive definite, we get

SS
E [ pReg ] > 2 and the statistic should be bigger than one.
Therefore, we should use a one-tailed test and reject H0 if the

statistic is large.
16/123
=0
Splitting
Corrected SS
Sequential testing
To lay out the workings, we use a familiar ANOVA table.
Source of
variation
Regression
Residual
Total
Sum of
squares
yT X (X T X )1 X T y
T
y y yT X (X T X )1 X T y
yT y
degrees of
freedom
p
n p
n
Mean
square
SSReg
p
SSRes
np
F
ratio
MSReg
MSRes
17/123
=0
Splitting
Corrected SS
Sequential testing
Example: system cost

A data processing system uses three types of structural elements:
files, flows and processes. Files are permanent records, flows are
data interfaces, and processes are logical manipulations of the
data. The costs of developing software for the system are based on
the number of these three elements. A study is conducted with the
following results:
Cost (y)
22.6
15
78.1
28
80.5
24.5
20.5
147.6
4.2
48.2
20.5
Files (x1 )
4
2
20
6
6
3
4
16
4
6
5
Flows (x2 )
44
33
80
24
227
20
41
187
19
50
48
Processes (x3 )
18
15
80
21
50
18
13
137
15
21
17
18/123
=0
Splitting
Corrected SS
Sequential testing
The model we use is

yi = 0 + 1 xi1 + 2 xi2 + 3 xi3 + i .
We want to test the hypothesis
H0 : = 0 vs. H1 : 6= 0.
Simple matrix calculations give us
SSReg
= yT X (X T X )1 X T y = 38978
yT y = 39667
SSRes
= yT y SSReg = 689
MSReg
= SSReg /4 = 9745
MSRes
= SSRes /(11 4) = 98
F4,7 = MSReg /MSRes = 99
19/123
=0
Splitting
Corrected SS
Sequential testing
The F ratio is very large and we would expect H0 to be rejected

based on it. Indeed, the critical value for = 0.01 is 7.85, so we
can say that 6= 0 with 99% confidence.
Variation
Regression
Residual
Total
SS
38978
689
39667
d.f.
4
7
11
MS
9745
98
F
99
20/123
=0
Splitting
Corrected SS
Sequential testing
Example: clover area
Recall the clover example

>
>
>
>
>
>
>
>
clover <- read.csv("../data/clover.csv")

clover <- log(clover)
clover <- clover[-c(6,23,47,68,97,111,140),]
y <- clover$area
X <- cbind(1, clover$midrib, clover$estim)
b <- solve(t(X) %*% X, t(X) %*% y)
n <- length(y)
p <- dim(X)[2]
21/123
=0
Splitting
Corrected SS
Sequential testing
We test H0 : = 0.
> (SS <- sum(y^2))
[1] 381.1864
> (SSRes <- sum((y - X %*% b)^2))
[1] 4.498653
> (SSReg <- SS - SSRes)
[1] 376.6877
> (SSReg <- t(y) %*% X %*% solve(t(X) %*% X) %*% t(X) %*% y)
[,1]
[1,] 376.6877
> (Fstat <- as.vector((SSReg/p)/(SSRes/(n-p))))
[1] 3768.005
> pf(Fstat, p, n-p, lower.tail=FALSE)
[1] 6.656806e-130
22/123
=0
Splitting
Corrected SS
Sequential testing
> basemodel <- lm(area ~ 0, data=clover)

> model <- lm(area ~ midrib + estim, data=clover)
> anova(basemodel, model)
Analysis of Variance Table
Model 1: area ~
Model 2: area ~
Res.Df
RSS
1
138 381.19
2
135
4.50
--Signif. codes:
0
midrib + estim
Df Sum of Sq
3
Pr(>F)
376.69 3768 < 2.2e-16 ***
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
23/123
=0
Splitting
Corrected SS
Sequential testing
The general linear hypothesis
We can now progress to testing the general linear hypothesis,

which tests
H0 : C = vs. H1 : C 6=
where C is an r p matrix of rank r p and is an r 1
vector of constants.
This hypothesis makes it possible to test for relationships among

the parameters, as well as testing the individual parameters against
a constant.
24/123
=0
Splitting
Corrected SS
Sequential testing
Example. Consider the null hypothesis of model relevance,

H0 : = 0. We can express this in the form of the general linear
hypothesis with C = Ip (which has rank p) and = 0.
Example. Consider the regression model with 4 parameters (3
predictors)
yi = 0 + 1 xi1 + 2 xi2 + 3 xi3 + i .
Let

C =
0 1 1
0
0 0
1 1
, =
0
0

.
Suppose we want to test H0 : C = . Then what we are really

testing for is
1 2 = 0
2 3 = 0.
In other words, we are testing the hypothesis 1 = 2 = 3 .
25/123
=0
Splitting
Corrected SS
Sequential testing
Test statistic
To develop a test statistic, we start with C b , the least

squares estimator for C . Because it is a vector containing
linear combinations of variables which have a joint normal
distribution, it is a normal random vector, with mean and variance
E(C b ) = C ,
Var (C b ) = C (X T X )1 C T 2 .
26/123
=0
Splitting
Corrected SS
Sequential testing
Therefore, the quadratic form

(C b )T [C (X T X )1 C T ]1 (C b )
2
has a noncentral 2 distribution with r degrees of freedom and
noncentrality parameter
=
(C )T [C (X T X )1 C T ]1 (C )
.
2 2
(How do we know C (X T X )1 C T has an inverse?)
27/123
=0
Splitting
Corrected SS
Sequential testing
If the null hypothesis is true, then C = and the quadratic

form has a 2 distribution.
Since the numerator depends (stochastically) only on b, and

therefore is independent from s 2 , under the null hypothesis the
statistic
(C b )T [C (X T X )1 C T ]1 (C b )/r
SSRes /(n p)
has an F distribution with r and n p degrees of freedom. We
use this statistic to test the general linear hypothesis.
28/123
=0
Splitting
Corrected SS
Sequential testing
To justify a one-tailed test, note that the expected value of the

numerator is
(C b )T [C (X T X )1 C T ]1 (C b )
]
r
1
= 2 + (C )T [C (X T X )1 C T ]1 (C ).
r
E[
where C (X T X )1 C T is positive definite. (Why?)

If the null hypothesis is true, then the expectation is 2 . However,
if it is false, the expectation will be greater than 2 . Therefore we
reject H0 when the statistic is large.
29/123
=0
Splitting
Corrected SS
Sequential testing
Example: system cost

Consider the data processing system example we looked at earlier.
Suppose we want to test the hypothesis

T
H0 : = 2 0 0 1
. The least squares estimate is
1.96
0.12
b = (X T X )1 X T y =
0.18
0.8
so
0.04
0.12
b =
0.18 .
0.2
30/123
=0
Splitting
Corrected SS
Sequential testing
Our calculations proceed as follows (noting C = I ):

(b )T X T X (b ) = 1110.18
SSRes = yT [I X (X T X )1 X T ]y = 668.63
F4,7
p = 4
1110.18/4
=
= 2.8.
668.63/7
The critical value at = 0.05 is 4.12, so we cannot reject the null

hypothesis. This doesnt mean that it is true, just that it is close!
Exercise: show that
(b )T X T X (b ) = (yX )T (yX )(yX b)T (yX b).
That is, the SSRes for the model under H0 minus the SSRes for the
full model.
31/123
=0
Splitting
Corrected SS
Sequential testing
Suppose now we wish to test the hypothesis H0 : C = , where

0 1 1
0
0
C =
, =
.
0 0
1 1
0
Out least squares estimates are (still)
1.96
0.12
b = (X T X )1 X T y =
0.18 .
0.8
32/123
=0
Splitting
Corrected SS
Sequential testing
Therefore
Cb
T
C (X X )
0.06
0.62
0.013 0.0024
0.0024 0.00077
=
=

(C b )T [C (X T X )1 C T ]1 (C b ) = 1138.35
33/123
=0
Splitting
Corrected SS
Sequential testing
Since SSRes was calculated earlier to be 668.63, our F statistic

(with 2 and 7 degrees of freedom) is
1138.35/2
= 5.79.
668.63/7
The corresponding p-value is 0.9672 Thus we can reject the null
hypothesis at the 5% level, but not at the 1% level.
That is, there is evidence that the parameters 1 , 2 , and 3 are

not identical, but not strong evidence.
34/123
=0
Splitting
Corrected SS
Sequential testing
Example: clover area

For the clover data, consider the null hypothesis
H0 : (0 , 1 , 2 ) = (1, 0.5, 1).
> bst <- as.vector(c(-1, 0.5, 1))
> ( Fstat <- ((t(b-bst) %*% t(X) %*% X %*% (b-bst))/p)/
+
(SSRes/(n-p)) )
[,1]
[1,] 330.4352
> pf(Fstat, p, n-p, lower.tail=FALSE)
[,1]
[1,] 5.661703e-62
35/123
=0
>
>
>
>
Splitting
Corrected SS
Sequential testing
h0 <- X %*% bst

basemodel <- lm(area ~ 0, data=clover, offset=h0)
model <- lm(area ~ midrib + estim, data=clover)
anova(basemodel, model)

Model 1: area ~
Model 2: area ~
Res.Df
RSS
1
138 37.532
2
135 4.499
--Signif. codes:
0
midrib + estim
Df Sum of Sq
3
Pr(>F)
33.034 330.44 < 2.2e-16 ***
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '
36/123
=0
Splitting
Corrected SS
Sequential testing
H0 : 0 = 1, 1 = 2
> ( C <- matrix(c(1,0,0,1,0,-1),2,3) )
[1,]
[2,]
[,1] [,2] [,3]

1
0
0
0
1
-1
> r <- 2 # rank C

> dst <- c(-1,0)
> ( Fstat <- (t(C %*% b - dst) %*%
+
solve(C %*% solve(t(X) %*% X) %*% t(C)) %*%
+
(C %*% b - dst)/r)/(SSRes/(n-p)) )
[,1]
[1,] 18.87658
> pf(Fstat, 2, n-p, lower.tail=FALSE)
[,1]
[1,] 5.905736e-08
37/123
=0
Splitting
Corrected SS
Sequential testing
Testing if part of is 0
If we find that 6= 0, we cannot say which i are nonzero, only

that at least one is not.
If a particular i is zero, then it is best to remove it from the

model. Otherwise it will only serve to fit noise, and reduce the
ability of the model to predict.
Thus, we need to find a way of testing whether parts of the

parameter vector are 0 or not.
38/123
=0
Splitting
Corrected SS
Sequential testing
We split the parameter vector
0
..
.

r 1
= 1
r
2
..
.
k
and test the hypotheses

H0 : 1 = 0 vs. H1 : 1 6= 0.
By relabelling the indices, we can test the zero-ness of any subset
of the parameters.
39/123
=0
Splitting
Corrected SS
Sequential testing
The important thing to note is that we are testing 1 = 0 in the

presence of the other parameters, not by itself.
In other words, we are comparing two models: in H1 , the full model

y = X + ,
and in H0 , the reduced model
y = X2 2 + 2
where X2 contains the last p r columns of X = [X1 |X2 ].
40/123
=0
Splitting
Corrected SS
Sequential testing
Let C = [Ir |0] and = 0 then C = iff 1 = 0.

We define the regression sum of squares for 1 in the presence of
2 as
R( 1 | 2 ) = (C b )T (C (X T X )1 C T )1 (C b )
= 1 T A1
11 1
where 1 is the least squares estimator for 1 , and A11 is the
r r principal minor of (X T X )1 .
Our test statistic is
R( 1 | 2 )/r
.
SSRes /(n p)
Under the null hypothesis that 1 = 0 this has an Fr ,np

distribution, and we reject the null when it is too large.
41/123
=0
Splitting
Corrected SS
Sequential testing
Theorem
R( 1 | 2 ) = R() R( 2 )
where R() is the regression sum of squares for the full model

1
y = X + = [X1 |X2 ]
+
2
and R( 2 ) is the regression sum of squares for the reduced model
y = X2 2 + .
42/123
=0
Splitting
Corrected SS
Sequential testing
We will content ourselves with showing

E 1 T A1
11 1 = E(R() R( 2 ))
= EyT [X (X T X )1 X T X2 (X2T X2 )1 X2T ]y
Lemma
Suppose that

A=
A11 A12
A21 A22
,A

=B =
B11 B12
B21 B22

,
1
and B22
exists. Then
1
A1
11 = B11 B12 B22 B21 .
43/123
=0
Splitting
Corrected SS
Sequential testing
We have
T
(p (p r )) + [X
X)
T
X2 ]y
T
1 T
tr (X (X X )
X X2 (X2 X2 )
X2 )
T
1 T
T T
T
1 T
+ X [X (X X )
X X2 (X2 X2 )
X2 ]X
Ey [X (X
X2 (X2 X2 )
1
X X
r
h
+
T
1
2
T
2
T
1
""
T
2
X1T
X2T
""
#
#

X1T
T
1 T
X1
X2
X2 (X2 X2 )
X2
T
X2
#
"
##
X1T X2
X1T X2 (X2T X2 )1 X2T X1
X1T X2
X2T X2
X2T X1
X2T X2
X1
X2
X1T X1
X2T X1
r+
r + 1 [X1 X1 X1 X2 (X2 X2 )
X2 X ]
"
X2 (X2 X2 )
1
tr (A11 A11 )
T 1
E 1 A11 1 .
2
1
2
1
2
X2 X1 ] 1
T 1
1 A11 1
Here we have applied the lemma with

A = (X T X )1 ,
B = XTX =
X1T X1
X2T X1
X1T X2
X2T X2

.
44/123
=0
Splitting
Corrected SS
Sequential testing
We can again express the test calculations in an ANOVA table.
Source of
variation
Regression
Full model
Reduced model
1 in presence of 2
Residual
Total
Sum of
squares
degrees of
freedom
R()
R( 2 )
R( 1 | 2 )
yT y R()
yT y
p
pr
r
n p
n
Mean
square
R( 1 | 2 )
r
SSRes
np
F
ratio
R( 1 )| 2 )/r
MSRes
45/123
=0
Splitting
Corrected SS
Sequential testing
Example. Consider the data processing system example given

above. We rejected the null hypothesis = 0. But that is obvious
because the cost of a system is not going to have average 0.
The question we want to test is, does the cost depend on the files,
flows or processes? In other words, is one of 1 , 2 , or 3 nonzero?
To do this, we re-arrange the parameter vector as
1

2
1
=
=
.
3
2
0
46/123
=0
Splitting
Corrected SS
Sequential testing
To keep things in sync, we must rearrange the columns of X :
4 44 18 1
2 33 15 1
X = .
= X1 X2 .
.
.
.
..
.. ..
..
5 48 17 1
We want to test H0 : 1 = 0 (the intercept alone is adequate)

against H1 : 1 6= 0. The reduced model is
y = X2 0 + 2 .
47/123
=0
Splitting
Corrected SS
Sequential testing
The sum of squares of regression for the reduced model is

R( 2 ) = yT X2 (X2T X2 )1 X2T y
= (X2T y)T (n)1 X2T y
!2
n
1 X
=
yi
11
i=1
= 21800.
From before,
R() = SSReg = 38978, MSRes = 98,
so
R( 1 | 2 ) = R() R( 2 ) = 38978 21800 = 17178.
48/123
=0
Splitting
Corrected SS
Sequential testing
Our F statistic is now

R( 1 | 2 )/r
17178/3
=
= 58.2.
SSRes /(n p)
98
We check this against the F distribution with 3 and n p = 7

degrees of freedom. The critical point for = 0.01 is 8.45, so we
can again say that H0 can be rejected.
In other words, the intercept alone does not explain the variation in
the response variable adequately, and we are (reasonably) certain
that we need at least one of the terms in the model.
49/123
=0
Variation
Regression
Full
Reduced
1 in presence of 2
Residual
Total
Splitting
Corrected SS
SS
d.f.
38978
21800
17178
689
39667
4
1
3
7
11
Sequential testing
MS
5726
98
58.2
50/123
=0
Splitting
Corrected SS
Sequential testing
Corrected sum of squares

In general, we have the following ANOVA table for the test
H0 : 1 = = k = 0 versus the alternative that some i 6= 0,
i {1, . . . , k }.
Source of
variation
Regression
Full model
Reduced model
1 in presence of 2
Residual
Total
Sum of
squares
degrees of
freedom
R() = yT H y
2
Pn
n
i=1 yi
R( 1 | 2 )
yT y R()
yT y
k +1
1
k
n k 1
n
Mean
square
R( 1 | 2 )
k
SSRes
np
F
ratio
R( 1 )| 2 )/k
MSRes
51/123
=0
Splitting
Corrected SS
Sequential testing
This ANOVA table is sometimes presented differently. Observe that

n
X
i=1
(yi y ) =
n
X
i=1
yi2
P
( ni=1 yi )2
= yT y R( 2 ).
n
This is called the corrected sum of squares, and R( 2 ) the

correction factor.
52/123
=0
Splitting
Corrected SS
Sequential testing
We break down the corrected sum of squares into R( 1 | 2 ) and

SSRes , and test using an F statistic ratio. The end result is the
same as before, but the table looks slightly different.
Source of
variation
Regression
Residual
Total
Sum of
squares
2
Pn
SSReg
n
i=1 yi
SSRes
2
Pn
n
yT y
i=1 yi
degrees of
freedom
k
n k 1
n 1
Mean
square
R( 1 | 2 )
k
SSRes
nk 1
F
ratio
R( 1 | 2 )/k
MSRes
Some computer outputs will use a corrected sum of squares layout

instead of an uncorrected sum, so you should be familiar with both.
53/123
=0
Splitting
Corrected SS
Sequential testing
Example. In the data processing example, we rejected the

T

= 0. The ANOVA table for a
hypothesis that 1 2 3
corrected sum of squares test is
Variation
Regression
Residual
Total
SS
17178
689
17867
d.f.
3
7
10
MS
5726
98
F
58.2
The actual test does not change: the F statistic and degrees of
freedom are the same.
54/123
=0
Splitting
Corrected SS
Sequential testing
Clover example
Recall the clover example

>
>
>
>
>
>
>
>
clover <- read.csv("../data/clover.csv")

clover <- log(clover)
clover <- clover[-c(6,23,47,68,97,111,140),]
y <- clover$area
X <- cbind(1, clover$midrib, clover$estim)
b <- solve(t(X) %*% X, t(X) %*% y)
n <- length(y)
p <- dim(X)[2]
55/123
=0
Splitting
Corrected SS
Sequential testing
H0 : 0 = 0
> X2 <- X[,-1]
> b2 <- solve(t(X2) %*% X2, t(X2) %*% y)
> (SSRes2 <- sum((y - X2 %*% b2)^2))
[1] 6.296183
> (Rg2 <- SS - SSRes2)
[1] 374.8902
> (Rg2 <- t(y) %*% X2 %*% solve(t(X2) %*% X2) %*% t(X2) %*% y)
[,1]
[1,] 374.8902
> (Rg1g2 <- as.vector(SSReg - Rg2))
[1] 1.79753
56/123
=0
Splitting
Corrected SS
Sequential testing
H0 : 0 = 0
> r <- 1
> (Fstat <- (Rg1g2/r)/(SSRes/(n-p)))
[1] 53.94204
> pf(Fstat, r, n-p, lower.tail=FALSE)
[1] 1.761625e-11
57/123
=0
Splitting
Corrected SS
Sequential testing
H0 : 0 = 0
> basemodel <- lm(area ~ 0 + midrib + estim, data=clover)

Model 1: area ~
Model 2: area ~
Res.Df
RSS
1
136 6.2962
2
135 4.4987
--Signif. codes:
0 + midrib + estim
midrib + estim
Df Sum of Sq
F
1
Pr(>F)
1.7975 53.942 1.762e-11 ***
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
58/123
=0
Splitting
Corrected SS
Sequential testing
H0 : 1 = 0
> X2 <- X[,-2]

> Rg2 <- t(y) %*% X2 %*% solve(t(X2) %*% X2) %*% t(X2) %*% y
> Rg2
[,1]
[1,] 375.1498
> Rg1g2 <- SSReg - Rg2
> Rg1g2
[,1]
[1,] 1.53792
> r <- 1
59/123
=0
Splitting
Corrected SS
Sequential testing
H0 : 1 = 0
> Fstat <- (Rg1g2/r)/(SSRes/(n-p))

> Fstat
[,1]
[1,] 46.15142
[,1]
[1,] 3.189909e-10
60/123
=0
Splitting
Corrected SS
Sequential testing
H0 : 1 = 0
> basemodel <- lm(area ~ estim, data=clover)

Model 1: area ~
Model 2: area ~
Res.Df
RSS
1
136 6.0366
2
135 4.4987
--Signif. codes:
estim
midrib + estim
Df Sum of Sq
1
Pr(>F)
1.5379 46.151 3.19e-10 ***
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
61/123
=0
Splitting
Corrected SS
Sequential testing
H0 : 2 = 0
> X2 <- X[,-3]

> Rg2
[,1]
[1,] 373.5761
> Rg1g2
[,1]
[1,] 3.111602
> r <- 1
62/123
=0
Splitting
Corrected SS
Sequential testing
H0 : 2 = 0

> Fstat
[,1]
[1,] 93.37601
[,1]
[1,] 4.114603e-17
63/123
=0
Splitting
Corrected SS
Sequential testing
H0 : 2 = 0
> basemodel <- lm(area ~ midrib, data=clover)

Model 1: area ~
Model 2: area ~
Res.Df
RSS
1
136 7.6103
2
135 4.4987
--Signif. codes:
midrib
midrib + estim
Df Sum of Sq
1
Pr(>F)
3.1116 93.376 < 2.2e-16 ***
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
64/123
=0
Splitting
Corrected SS
Sequential testing
Corrected sum of squares, H0 : 1 = 2 = 0
> X2 <- X[,1]

> Rg2
[,1]
[1,] 311.9043
> Rg1g2
[,1]
[1,] 64.78343
> r <- 2
65/123
=0
Splitting
Corrected SS
Sequential testing
H0 : 1 = 2 = 0

> Fstat
[,1]
[1,] 972.0423
[,1]
[1,] 6.936784e-81
66/123
=0
Splitting
Corrected SS
Sequential testing
H0 : 1 = 2 = 0
> basemodel <- lm(area ~ 1, data=clover)

Model 1: area ~
Model 2: area ~
Res.Df
RSS
1
137 69.282
2
135 4.499
--Signif. codes:
1
midrib + estim
Df Sum of Sq
2
Pr(>F)
64.783 972.04 < 2.2e-16 ***
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
67/123
=0
Splitting
Corrected SS
Sequential testing
> summary(model)
Call:
lm(formula = area ~ midrib + estim, data = clover)
Residuals:
Min
1Q
-0.57603 -0.09824
Median
0.01173
3Q
0.11355
Max
0.51957
Coefficients:
Estimate Std. Error t value
(Intercept) -1.58458
0.21575 -7.345
midrib
0.76731
0.11295
6.793
estim
0.62183
0.06435
9.663
--Signif. codes: 0 '***' 0.001 '**' 0.01
Pr(>|t|)
1.76e-11 ***
3.19e-10 ***
< 2e-16 ***
'*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1825 on 135 degrees of freedom

Multiple R-squared: 0.9351, Adjusted R-squared: 0.9341
F-statistic:
972 on 2 and 135 DF, p-value: < 2.2e-16
68/123
=0
Splitting
Corrected SS
Sequential testing
Sequential testing
Suppose we have a number of explanatory variables, and we would
like a parsimonious model. That is, a model which explains the
variation in the response y using a minimal number of explanatory
variables. A parsimonious model is less likely to suffer from
overfitting.
For such a model, if we were to test if parameter i is 0, in the
presence of the other model parameters, we should always reject
the null.
How do we find such a minimal set of parameters?
69/123
=0
Splitting
Corrected SS
Sequential testing
Conceivably, with the help of a computer, we could test all the

possible parameter sets to find the largest 1 such that the
hypothesis 1 = 0 is not rejected.
The problem with this approach (apart from the time required) is
that it can give inconsistent results. For example we might reject
1 = 2 = 0 given 3 , but not reject 1 = 0 given 2 and 3 , and
also not reject 2 = 0 given 1 and 3 .
This can happen when x1 and x2 are very strongly correlated, so

that given one of them the other isnt needed, but you need to
have at least one of them.
70/123
=0
Splitting
Corrected SS
Sequential testing
If we have p = k + 1 parameters 0 , . . . , k we could consider p

tests of the form H0 : i = 0, given all the other parameters are in
the model. Such tests are sometimes called partial tests.
The discussion above suggests that this could lead us to remove

too many variables, because the partial tests are not independent.
In a partial test, acceptance or rejection of H0 does not mean that

the parameter is useful or useless in the best model, just useful or
useless in the full model.
71/123
=0
Splitting
Corrected SS
Sequential testing
To avoid the problem of dependence between partial tests we can

consider a nested sequence of models.
That is, we can start with a simple model and sequentially add
parameters until we reach a parsimonious model. That is, until
adding parameters does not significantly improve the fit.
Alternatively we can start with a full model and sequentially

remove parameters until we reach a parsimonious model. That is,
until removing parameters significantly worsens the fit.
72/123
=0
Splitting
Corrected SS
Sequential testing
Consider the series of models (subject to relabelling)
= 0 + (0)
= 0 + 1 x1 + (1)
..
.
= 0 + 1 x1 + . . . + k xk + (k ) .
We denote the corresponding X matrices by X (j ) , which are the

first j + 1 columns of X .
The regression sum of squares for each of these models is
calculated in the usual way:
R(0 , 1 , . . . , j ) = yT X (j ) ((X (j ) )T X (j ) )1 (X (j ) )T y.
73/123
=0
Splitting
Corrected SS
Sequential testing
Note that these are full regression sums of squares, i.e. we are
looking at the total variation explained by the model in the
presence of no other parameters.
Now by taking the difference between the sums of squares, we can
get the extra variation explained as we add variables to the model
one at a time:
R(1 |0 ) = R(0 , 1 ) R(0 )

R(2 |0 , 1 ) = R(0 , 1 , 2 ) R(0 , 1 )
..
.
R(k |0 , 1 , . . . , k 1 ) = R() R(0 , 1 , . . . , k 1 ).
74/123
=0
Splitting
Corrected SS
Sequential testing
Theorem
Suppose y = X + where X is full rank and N (0, 2 I ).
Let Xj be the first j + 1 columns of X (the first column is all
ones), and put
Hj
= Xj (XjT Xj )1 XjT
Rj
= yT Hj y = R(0 , . . . , j )
R(j |0 , . . . , j 1 ) = Rj Rj 1 = yT (Hj Hj 1 )y.

Then (supposing r (X ) = p = k + 1)
yT y = SSRes + R()
= SSres + R(0 ) + R(1 |0 ) + R(2 |0 , 1 ) +
+ R(k |0 , 1 , . . . , k 1 )
and the quadratic forms on the right are all independent
non-central 2 . SSRes has n p d.f. and the rest 1 d.f. each.
75/123
=0
Splitting
Corrected SS
Sequential testing
Lemma
Suppose X = [X1 |X2 ] is full rank, size n p for n p, then
X2 = X (X T X )1 X T X2
Lemma
For X as above, X1 size n r and X2 size n (p r ), we have
that
A2 := X (X T X )1 X T X2 (X2T X2 )1 X2T = H H2
is symmetric and idempotent, rank r .
76/123
=0
Splitting
Corrected SS
Sequential testing
Proof of lemmas
77/123
=0
Splitting
Corrected SS
Sequential testing
Proof of theorem
78/123
=0
Splitting
Corrected SS
Sequential testing
Each sequential regression sum of squares has 1 degree of freedom.

Therefore under the hypothesis j = 0, the test statistic
R(j |0 , 1 , . . . , j 1 )
SSRes /(n p)
has an F distribution with 1 and n p degrees of freedom.
Note that this is still not entirely satisfactory, because the result
will depend heavily on the order of the parameters considered.
Different orderings can result in different sets of parameters being
included in the final model.
79/123
=0
Splitting
Corrected SS
Sequential testing
Example. An experiment was conducted to study the size of squid.

The response is the weight of the squid, and the predictors are
I
x1 : Beak length
x2 : Wing length
x3 : Beak to notch length
x4 : Notch to wing length
x5 : Width
A total of 22 squid are sampled.
80/123
=0
Splitting
Corrected SS
Sequential testing
81/123
=0
Splitting
Corrected SS
Sequential testing
The first thing we test is whether = 0. The ANOVA table is

Variation
Regression
Residual
Total
SS
595.16
7.92
603.08
d.f.
6
16
22
MS
99.19
0.49
F
200.47
The null hypothesis = 0 is rejected strongly (at = 0.01 the

critical value is 4.20).
82/123
=0
Splitting
Corrected SS
Sequential testing
Next we test to see which parameters should be in the model. The

sequential sums of squares are:
R(0 ) = 387.16
R(1 |0 ) = 199.15
R(2 |0 , 1 ) = 0.127
R(3 |0 , 1 , 2 ) = 4.12
R(4 |0 , 1 , 2 , 3 ) = 0.263
R(5 |0 , 1 , 2 , 3 , 4 ) = 4.35
Note that these sum to the regression sum of squares for the full
model, 595.16.
83/123
=0
Splitting
Corrected SS
Sequential testing
Each of these sums of squares should be compared against the

critical F value with 1 and n p degrees of freedom, multiplied by
SSRes /(n p). With = 0.05, this is
7.92
4.49 = 2.22.
22 6
So starting with a model with no parameters, we should definitely
add 0 and then 1 , but not 2 .
The subsequent tests are harder to interpret. For example, if
0 , 1 , 2 , and 3 are in the model, we should not add 4 . But 2
is not in the model!
The tests for 3 , 4 and 5 need to be repeated, supposing only
that 0 and 1 are in the model.
84/123
=0
Splitting
Corrected SS
Sequential testing
Note that we use the SSRes (and residual degrees of freedom) of

the full model in the denominator of our F statistics.
This is because we cannot assume that variables that are not in the
model are irrelevant. This means that the SSRes of a reduced
model may be disproportionately large, and more importantly not
conform to our distributional assumptions.
The only way to be safe about this is to use the SSRes of the full
model, even if it means losing a few degrees of freedom to truly
irrelevant variables.
Note: R does not do this! To test for i in the presence of
0 , . . . , i1 it uses the residual sum of squares from the model
using 0 , . . . , i . This still gives a valid test, though in general not
as powerful as the test using SSRes from the full model.
85/123
=0
Splitting
Corrected SS
Sequential testing
Forward selection
Forward selection starts off with an empty model, and adds

variables which are deemed to be significant. Significance is
measured in relation to the current model, so all tests are
conducted in the presence of already included parameters, but not
the other parameters.
When no variables are significant enough to add, we stop and take

the current model as the final model.
86/123
=0
Splitting
Corrected SS
Sequential testing
1. Start off with an empty model.

2. Calculate the F -values for the tests H0 : i = 0, for all
parameters not in the model, in the presence of parameters
already in the model.
3. If none of the tests are significant (we do not reject any null
hypotheses), then stop.
4. Otherwise add the most significant parameter (i.e. parameter
with the largest F -value).
5. Return to step 2.
87/123
=0
Splitting
Corrected SS
Sequential testing
Backward elimination
A method which is conceptually very similar to forward selection is
backward elimination:
1. Start off with the full model.
2. Calculate the F -values for the tests H0 : i = 0, for all
parameters in the model, in the presence of the other
parameters in the model.
3. If all of the tests are significant (we reject all null hypotheses),
then stop.
4. Otherwise, remove the least significant parameter (i.e.
parameter with smallest F -value).
88/123
=0
Splitting
Corrected SS
Sequential testing
Backward elimination is complimentary to forward selection, i.e.

starts from the full model and removes the least important variable
until all variables are important.
Forward selection and backward elimination are easy to understand

and to apply, but do not always produce the optimal results. One
reason this is so is the inability to remove an already added
variable (or add an already removed variable). This inflexibilty is
often limiting.
89/123
=0
Splitting
Corrected SS
Sequential testing
Example: forward selection

We model the hardening of cement.
> heat <- read.csv("../data/heat.csv")
> str(heat)
'data.frame': 13 obs. of 5 variables:

$ x1: int 7 1 11 11 7 11 3 1 2 21 ...
$ x2: int 26 29 56 31 52 55 71 31 54 47 ...
$ x3: int 6 15 8 8 6 9 17 22 18 4 ...
$ x4: int 60 52 20 47 33 22 6 44 22 26 ...
$ y : num 78.5 74.3 104.3 87.6 95.9 ...
> basemodel <- lm(y ~ 1, data=heat)
90/123
=0
Splitting
Corrected SS
Sequential testing
> pairs(heat)
50
70
10
30
50
20
30
50
70
5 10
x1
20
30
x2
50
10
x3
80
100
10
30
x4
5 10
20
10
20
80
100
91/123
=0
Splitting
Corrected SS
Sequential testing
> add1(basemodel, scope= ~ . + x1 + x2 + x3 + x4, test="F")

Single term additions
Model:
y ~ 1
Df Sum of Sq
RSS
AIC
<none>
2715.76 71.444
x1
1
1450.08 1265.69 63.519
x2
1
1809.43 906.34 59.178
x3
1
776.36 1939.40 69.067
x4
1
1831.90 883.87 58.852
--Signif. codes: 0 '***' 0.001 '**'
F value
Pr(>F)
12.6025
21.9606
4.4034
22.7985
0.0045520
0.0006648
0.0597623
0.0005762
**
***
.
***
0.01 '*' 0.05 '.' 0.1 ' ' 1
> model2 <- lm(y ~ x4, data=heat)
92/123
=0
Splitting
Corrected SS
Sequential testing
> add1(model2, scope= ~ . + x1 + x2 + x3, test="F")

Model:
y ~ x4
Df Sum of Sq
RSS
AIC F value
Pr(>F)
<none>
883.87 58.852
x1
1
809.10 74.76 28.742 108.2239 1.105e-06 ***
x2
1
14.99 868.88 60.629
0.1725
0.6867
x3
1
708.13 175.74 39.853 40.2946 8.375e-05 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> model3 <- lm(y ~ x1 + x4, data=heat)
93/123
=0
Splitting
Corrected SS
Sequential testing
> add1(model3, scope= ~ . + x2 + x3, test="F")

Model:
y ~ x1 + x4
Df Sum of Sq
RSS
AIC F value Pr(>F)
<none>
74.762 28.742
x2
1
26.789 47.973 24.974 5.0259 0.05169 .
x3
1
23.926 50.836 25.728 4.2358 0.06969 .
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '
We use the model with variables x1 and x4 .
94/123
=0
Splitting
Corrected SS
Sequential testing
Backward elimination
> fullmodel <- lm(y ~ x1+x2+x3+x4,data=heat)
> drop1(fullmodel, scope= ~ ., test="F")
Single term deletions
Model:
y ~ x1 + x2 + x3 + x4
Df Sum of Sq
RSS
AIC F value Pr(>F)
<none>
47.864 26.944
x1
1
25.9509 73.815 30.576 4.3375 0.07082 .
x2
1
2.9725 50.836 25.728 0.4968 0.50090
x3
1
0.1091 47.973 24.974 0.0182 0.89592
x4
1
0.2470 48.111 25.011 0.0413 0.84407
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '
> model2 <- lm(y~x1+x2+x4,data=heat)
95/123
=0
Splitting
Corrected SS
Sequential testing
> drop1(model2, scope= ~., test="F")

Model:
y ~ x1 + x2 + x4
Df Sum of Sq
RSS
AIC F value
Pr(>F)
<none>
47.97 24.974
x1
1
820.91 868.88 60.629 154.0076 5.781e-07 ***
x2
1
26.79 74.76 28.742
5.0259
0.05169 .
x4
1
9.93 57.90 25.420
1.8633
0.20540
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '
> model3 <- lm(y ~ x1 + x2, data=heat)
96/123
=0
Splitting
Corrected SS
Sequential testing
> drop1(model3, scope = ~ ., test="F")

Model:
y ~ x1 + x2
Df Sum of Sq
RSS
AIC F value
Pr(>F)
<none>
57.90 25.420
x1
1
848.43 906.34 59.178 146.52 2.692e-07 ***
x2
1
1207.78 1265.69 63.519 208.58 5.029e-08 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '
We use the model with variables x1 and x2 .
97/123
=0
Splitting
Corrected SS
Sequential testing
Stepwise selection
Stepwise selection functions similarly to forward or backward
selection, but with the possibility of either adding or eliminating a
variable at each step. We give a procedure using a goodness of fit
measure called the AIC , though it is trivial to adjust for the usage
of any other goodness-of-fit statistic.
1. Start with any model.
2. Compute the AIC of all models which either have one extra
variable or one less variable than the current model.
3. If the AIC of all such models is more than the AIC of the
current model, stop.
4. Otherwise, change to the model with the lowest AIC .
98/123
=0
Splitting
Corrected SS
Sequential testing
Stepwise selection is generally better than forward or backward

selection, because it avoids the problem that an already added
variable can never be removed (or the opposite).
However the final model does often depend on the starting model,
so it does not necessarily find a global minimum for AIC (or
whichever goodness-of-fit criterion you use). Instead it finds a local
minimum.
It is possible for small numbers of variables to find a global
minimum through an exhaustive search of all possible
combinations. However, as the number of variables increases, this
can take too long.
99/123
=0
Splitting
Corrected SS
Sequential testing
Space of models (hypercube)
100/123
=0
Splitting
Corrected SS
Sequential testing
Goodness-of-fit measures
The F test is used to compare nested models, that is, it requires
the variable set of one model to be fully contained in the variable
set of the other model. Thus we cannot use an F test to compare
models which, for example, have replaced one variable with
another variable.
Also, use of the F test requires the somewhat arbitrary choice of a
significance level.
To overcome these problems many authors have proposed
goodness-of-fit measures, which try to give a measure of how good
a model is, independently of other models (though still dependent
on the data in question).
101/123
=0
Splitting
Corrected SS
Sequential testing
Residual sum of squares

The residual sum of squares, SSRes , is not a good goodness-of-fit
measure. Although it measures how well the model fits the
(training) data, it does not take into account model complexity,
and thus can not prevent overfitting.
One way to get around this is to simply use s 2 as a goodness-of-fit
statistic. Whevenever we add a variable to the model, SSRes always
decreases. However, the degrees of freedom n p also decreases,
so s 2 will decrease only if the variable is good (in a sense).
Unfortunately, in practice using s 2 for model fittng does not
discourage overfitting enough.
102/123
=0
Splitting
Corrected SS
Sequential testing
R2
A commonly reported goodness-of-fit statistic is the proportion of
(corrected) total sums of squares that is explained by the model:
R2 = 1
SSTotal
SSRes
.
P
( i yi )2 /n
R 2 lies between 0 and 1, and the larger it is, the more variation in
y is explained by the model. (We are assuming that 0 is always in
the model.)
However R 2 can never decrease when we add a variable to a
model, as even an irrelevant variable will explain a small extra
amount of variation. We would like to remove irrelevant variables,
so, like the SSRes , R 2 is not appropriate for model selection.
103/123
=0
Splitting
Corrected SS
Sequential testing
Adjusted R 2
The adjusted R 2 tries to account for model complexity, by
introducing a penalty based on the number of parameters in the
model.
n 1
(1 R 2 ).
n 1k
Here we are assuming that 0 is in the model, and k is the number
of other parameters in the model.
adj R 2 = 1
The adjusted R 2 is better for model selection than s 2 , but there

are other more sophisticated goodness-of-fit measures that we can
use, such as the AIC, BIC or Mallows Cp statistic.
104/123
=0
Splitting
Corrected SS
Sequential testing
AIC
A very popular goodness-of-fit statistic is Akaikes information
criterion, or AIC. This is based on the likelihood of the observed
values of the response.
AIC
= 2 ln(likelihood) + 2p

SSRes
= n ln
+ 2p + const.
n
(Here the likelihood is the maximised likelihood.) A smaller value

of AIC indicates a better model.
The form of the AIC can be justified using information theory.
105/123
=0
Splitting
Corrected SS
Sequential testing
Note that any goodness-of-fit statistic should only to be used to

compare various models for the same data. For none of them is
there an absolute measure of how good a model is.
106/123
=0
Splitting
Corrected SS
Sequential testing
Example: stepwise selection

Recall our cement hardening example
> heat <- read.csv("../data/heat.csv")
> str(heat)
'data.frame': 13 obs. of 5 variables:

$ x1: int 7 1 11 11 7 11 3 1 2 21 ...
$ x2: int 26 29 56 31 52 55 71 31 54 47 ...
$ x3: int 6 15 8 8 6 9 17 22 18 4 ...
$ x4: int 60 52 20 47 33 22 6 44 22 26 ...
$ y : num 78.5 74.3 104.3 87.6 95.9 ...
> basemodel <- lm(y ~ 1, data=heat)
107/123
=0
Splitting
Corrected SS
Sequential testing
> model2 <- step(basemodel, scope=~.+x1+x2+x3+x4,

+
steps=1)
Start:
y ~ 1
+ x4
+ x2
+ x1
+ x3
<none>
AIC=71.44
Df Sum of Sq
RSS
AIC
1
1831.90 883.87 58.852
1
1809.43 906.34 59.178
1
1450.08 1265.69 63.519
1
776.36 1939.40 69.067
2715.76 71.444
Step: AIC=58.85
y ~ x4
108/123
=0
Splitting
Corrected SS
Sequential testing
> model3 <- step(model2, scope=~.+x1+x2+x3,steps=1)

Start:
y ~ x4
+ x1
+ x3
<none>
+ x2
- x4
AIC=58.85
Df Sum of Sq
1
809.10
1
708.13
1
1
RSS
74.76
175.74
883.87
14.99 868.88
1831.90 2715.76
AIC
28.742
39.853
58.852
60.629
71.444
Step: AIC=28.74
y ~ x4 + x1
109/123
=0
Splitting
Corrected SS
Sequential testing
> model4 <- step(model3, scope=~.+x2+x3,steps=1)

Start: AIC=28.74
y ~ x4 + x1
+ x2
+ x3
<none>
- x1
- x4
Df Sum of Sq
1
26.79
1
23.93
1
1
RSS
47.97
50.84
74.76
809.10 883.87
1190.92 1265.69
AIC
24.974
25.728
28.742
58.852
63.519
Step: AIC=24.97
y ~ x4 + x1 + x2
110/123
=0
Splitting
Corrected SS
Sequential testing
> step(model4, scope=~.+x3)

Start: AIC=24.97
y ~ x4 + x1 + x2
Df Sum of Sq
<none>
- x4
+ x3
- x2
- x1
1
1
1
1
9.93
0.11
26.79
820.91
RSS
47.97
57.90
47.86
74.76
868.88
AIC
24.974
25.420
26.944
28.742
60.629
Call:
lm(formula = y ~ x4 + x1 + x2, data = heat)
Coefficients:
(Intercept)
71.6483
x4
-0.2365
x1
1.4519
x2
0.4161
111/123
=0
Splitting
Corrected SS
Sequential testing
> model2 <- step(fullmodel, scope=~., steps=1)

Start: AIC=26.94
y ~ x1 + x2 + x3 + x4
Df Sum of Sq
RSS
AIC
- x3
1
0.1091 47.973 24.974
- x4
1
0.2470 48.111 25.011
- x2
1
2.9725 50.836 25.728
<none>
47.864 26.944
- x1
1
25.9509 73.815 30.576
Step: AIC=24.97
y ~ x1 + x2 + x4
112/123
=0
Splitting
Corrected SS
Sequential testing
> step(model2, scope=~.+x3)

Start: AIC=24.97
y ~ x1 + x2 + x4
Df Sum of Sq
<none>
- x4
+ x3
- x2
- x1
1
1
1
1
9.93
0.11
26.79
820.91
RSS
47.97
57.90
47.86
74.76
868.88
AIC
24.974
25.420
26.944
28.742
60.629
Call:
lm(formula = y ~ x1 + x2 + x4, data = heat)
Coefficients:
(Intercept)
71.6483
x1
1.4519
x2
0.4161
x4
-0.2365
113/123
=0
Splitting
Corrected SS
Sequential testing
t tests
We can also use a t test for a partial test of one parameter. That
is, to test H0 : i = 0 against H1 : i 6= 0 in the presence of all the
other parameters. (A partial test.)
Recall our confidence interval for i
bi t/2 s cii
where cii is the (i , i )th entry of (X T X )1 , and we use a t
distribution with n p degrees of freedom. If this confidence
interval includes 0, we do not reject H0 ; otherwise, we can reject it.
114/123
=0
Splitting
Corrected SS
Sequential testing
In other words, we use the t statistic (with n p degrees of

freedom)
bi
.
s cii
Let us compare this with our existing partial F test. The statistic
we use for this is
R(i |0 , 1 , . . . , i1 , i+1 , . . . , k )
.
SSRes /(n p)
The denominator is of course s 2 .
115/123
=0
Splitting
Corrected SS
Sequential testing
We saw previously that the numerator is

R(i |0 , 1 , . . . , i1 , i+1 , . . . , k ) = 1 T A1
11 1
where 1 = bi , and A11 is the top left element of (X T X )1 after
the columns have been re-arranged so that the i th column comes
first. In other words, A11 = cii and
R(i |0 , 1 , . . . , i1 , i+1 , . . . , k ) = bi (cii )1 bi =
bi2
.
cii
So the statistic (using an F distribution with 1 and n p degrees

of freedom) is
bi2
.
cii s 2
This is exactly the square of the t statistic!
116/123
=0
Splitting
Corrected SS
Sequential testing
This is actually not too surprising. The t distribution can be

expressed as a normal variable divided by the square root of a 2
variable, and therefore when we square it we get the square of a
normal variable divided by a 2 variable. But the square of a
normal variable is a 2 variable with 1 d.f.
Therefore the square of a t variable with n d.f. is an F variable

with 1 and n d.f.
This means that the t test and the F test are (nearly) identical;
the t test is actually slightly more useful, because it also gives an
indication of the sign of the parameter.
117/123
=0
Splitting
Corrected SS
Sequential testing
Example. In the estimation section, we modelled the amount of a

chemical which dissolves in water, when held at a certain
temperature. We found that the 95% confidence interval for 1
was
0.31 2.78 0.86 0.00057 = [0.25, 0.36].

A t test would use the statistic
b1
0.31
=
= 14.89
s c11
0.86 0.00057
which, using a t distribution with n p = 6 2 = 4 degrees of
freedom, would reject the hypothesis 1 = 0 at the 0.05 level
(critical value 2.78). We can also say that 1 is almost certainly
positive.
118/123
=0
Splitting
Corrected SS
Sequential testing
On the other hand, if we use an F test then we find that

R() = yT X (X T X )1 X T y = 663.77
R(0 ) = yT X1 (X1T X1 )1 X1T y = 498.68
R(1 |0 ) = 663.77 498.68 = 165.09
and the F statistic is
R(1 |0 )
165.09
= 221.7 = 14.892 .
=
2
s
0.74
119/123
=0
Splitting
Corrected SS
Sequential testing
Based on the F distribution with 1 and 4 degrees of freedom, the

critical value is 7.71 = 2.772 . So we can again reject the null
hypothesis of 1 = 0.
Variation
Regression
Full
Reduced
1 in presence of 0
Residual
Total
SS
d.f.
663.77
498.68
165.09
2.98
666.75
2
1
1
4
6
MS
165.09
0.74
221.7
120/123
=0
Splitting
Corrected SS
Sequential testing
Shrinkage
Not all selection procedures employ sequential addition and/or

deletion of variables. Some go for a more holistic approach.
A common approach is to try and shrink all the fitted parameters

toward 0, so that the irrelevant variables have very little effect on
the model. Some of the fitted parameters might actually become
0, and the associated variables can then be removed.
121/123
=0
Splitting
Corrected SS
Sequential testing
For example, ridge regression uses a penalized least squares

approach. Instead of simply minimising the residual sum of
squares, we include a penalty term which is proportional to the
sum of squares of the parameters. We choose b to minimise
n
X
ei2
i=1
k
X
bj2 .
j =0
The term controls the amount of shrinkage of the parameters.

The greater it is, the more the parameters are shrunk. The
penalized least squares estimators can be calculated to be
b = (X T X + I )1 X T y.
122/123
=0
Splitting
Corrected SS
Sequential testing
Another approach is the LASSO, which minimises

n
X
ei2 +
i=1
k
X
|bj |.
j =0
The LASSO actually shrinks small parameters to 0, and can be

used for variable selection by removing those variables.
Choosing an appropriate shrinking parameter is quite involved. A

common method is cross-validation, which estimates the predictive
power of the model.
123/123

5 Inference FRM

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

5 Inference FRM

Загружено:

Авторское право:

Доступные форматы

=0

General linear hypothesis

Linear Statistical Models: Inference for the full

Linear Statistical Models: Inference for the full rank model

General linear hypothesis

In this section, we develop various forms of hypothesis testing on

(for some theorems) are normally distributed.

Linear Statistical Models: Inference for the full rank model

General linear hypothesis

General linear hypothesis

General linear hypothesis

More precisely, the sum of squares of the residuals is

General linear hypothesis

Example. Suppose that there is no error, so that y = X . We

Linear Statistical Models: Inference for the full rank model

General linear hypothesis

On the other hand, suppose that there is no signal, so that = 0

These are the two extremes of the spectrum.

Linear Statistical Models: Inference for the full rank model

General linear hypothesis

Example. Recall our previous paint cracking example, in which the

Linear Statistical Models: Inference for the full rank model

General linear hypothesis

The data matrices are

and the sample the variance is

Linear Statistical Models: Inference for the full rank model

General linear hypothesis

Linear Statistical Models: Inference for the full rank model

General linear hypothesis

To create a formal test of = 0, we compare SSReg against

To know exactly how large, we must first derive the distributions of

Linear Statistical Models: Inference for the full rank model

General linear hypothesis

and they are independent.

Linear Statistical Models: Inference for the full rank model

General linear hypothesis

Linear Statistical Models: Inference for the full rank model

General linear hypothesis

Linear Statistical Models: Inference for the full rank model

General linear hypothesis

What happens if H0 is not true? The expected value of MSReg is

The expected value of the denominator MSRes is

Linear Statistical Models: Inference for the full rank model

General linear hypothesis

= 2 and the statistic should be close to 1.

But if 6= 0, since X T X is positive definite, we get

Therefore, we should use a one-tailed test and reject H0 if the

Linear Statistical Models: Inference for the full rank model

General linear hypothesis

To lay out the workings, we use a familiar ANOVA table.

Linear Statistical Models: Inference for the full rank model

General linear hypothesis

Example: system cost

Linear Statistical Models: Inference for the full rank model

General linear hypothesis

The model we use is

F4,7 = MSReg /MSRes = 99

Linear Statistical Models: Inference for the full rank model

General linear hypothesis

The F ratio is very large and we would expect H0 to be rejected

Linear Statistical Models: Inference for the full rank model

General linear hypothesis

Example: clover area

0 '' 0.001 '' 0.01 '' 0.05 '.' 0.1 '