Classical Linear Regression Model

2 Classical Linear Regression Models
In this section we study the classical linear regression models.

2.1 Assumptions for the Ordinary Least Squares Regression
Assume that the population model is
1 = ,
1
+,
2
A
2
+,
3
A
3
+ +,
|
A
|
+- = ,
0
A +- (2.1)
where A = (1, A
2
, , A
|
)
0
, , = (,
1
, ,
2
, , ,
|
)
0
. A random sample of : observations is drawn from
the population:
1
I
= ,
1
+,
2
A
I2
+,
3
A
I3
+ +,
|
A
I|
+l
I
= ,
0
X
I
+-
I
for i = 1, 2, , : (2.2)
where X
I
= (1, A
I2
, , A
I|
)
0
is a / 1 vector. Frequently, we write the above model in matrix form
Y = X, + (2.3)
where
Y =
1
1
.
.
.
1
n
, =
-
1
.
.
.
-
n
, X =
X
0
1
.
.
.
X
0
n
1 A
12
A
1|
.
.
.
.
.
.
.
.
.
.
.
.
1 A
n2
A
n|
.
We make the following classical assumptions on the model.
Assumption A.1 (Linearity) {1
I
, X
I
}
n
I=1
satises the linear relationship:
1
I
= ,
0
X
I
+-
I
,
where , is a / 1 unknown parameter vector, X
I
is a / 1 vector of independent variables (regressors,
explanatory variables), -
I
is an unobservable disturbance/error term, and 1
I
is the dependent variable
(regressand).
3
The key notion of linearity is that the regression model is linear in , rather than in X
I
. If the above
LRM is correctly specied for 1(1
I
|X
I
) , then
, =
01(1
I
|X
I
)
0X
I
.
In matrix notation, we can write the above LRM as
Y = X, +
where = (-
1
, ..., -
n
)
0
.
Assumption A.2 (Strict exogeneity)
1(-
I
|X) = 1(-
I
|X
1
, , X
n
) = 0 for i = 1, 2, , :.
We make a few remarks on the above assumption.
3
We use o instead of o
0
to denote the true value because elements of o will be denoted as o
for ) = 1, 2, ..., I.
11
1. 1(-
I
) = 0 by the law of iterated expectations.
2. 1(-
I
X
) = 0 for all i, , = 1, , : by the law of iterated expectations.

3. In the case of time series data, Assumption A.2 requires that -
I
does not depend on the past,
current, or future values of the regressors. This rules out dynamic time series models. Example:
1
|
= ,
1
+,
2
1
|1
+-
|
, t = 1, , :,
where -
|
v 111(0, o
2
). Let X
|
= (1, 1
|1
)
0
. It is easy to verify that 1(X
|
-
|
) = 0 but 1(X
|+1
-
|
) 6=
0 because
1(X
|+1
-
|
) = 1
-
|
1
|
-
|
!
= 1
-
|
(,
1
+,
2
1
|1
+-
|
) -
|
!
=
1(-
|
)
,
1
1(-
|
) +,
2
1(1
|1
-
|
) +1
-
2
|
!
=
0
o
2
!
.
This implies that
1(-
|
|X) 6= 0.
4. Assumption A.2 says nothing about higher order conditional moments. It may allow for conditional
heteroskedasticity:
1(-
2
I
|X
I
) = o
2
(X
I
) which is not a constant.
5. If X
I
, i = 1, ..., :, are non-stochastic, then Assumption A.2 becomes 1(-
I
) = 0.
6. If {1
I
, X
I
}
n
I=1
is an independent sample, then 1(-
I
|X) = 1(-
I
|X
I
) = 0.
Assumption A.3 (Nonsingularity) The rank of X
0
X is / with probability 1.
Assumption A.3
(Nonsingularity) The minimum eigenvalue of X

0
X =
P
n
I=1
X
I
X
0
I
satises
`
min
(X
0
X) as : with probability 1.
Recall that the eigenvalues of a / / matrix A are dened as the solution to
|A`1
|
| = 0,
where 1
|
is a / / identity matrix. Let `
1
, , `
|
denote the / eigenvalues with possible multiplicity
(say, two eigenvalues can be equal). Assumption A.3 rules out perfect collinearity among regressors
in nite samples whereas Assumption A.3
rules asymptotic multicollinearity in large samples. Noting

that X
0
X is positive semi-denite (p.s.d.), Assumption A.3 also implies that `
min
(X
0
X) 0 in nite
samples.
Assumption A.4 (Spherical error variance)
1(-
2
I
|X) = o
2
0 for all i = 1, , : (conditional homoskedasticity)
1(-
I
-
|X) = 0 for all i 6= , (conditional spatial/serial uncorrelatedness)

We make a few remarks on Assumption A.4.
12
1. Together with Assumption A.2, A.4 implies that
Var (-
I
) = Var [1(-
I
|X)] +1[Var (-
I
|X)] = 0 +o
2
= o
2
and
Cov (-
I
, -
) = Cov [1(-
I
|X) , 1(-
|X)] +1[Cov (-
I
, -
|X)]
= 0 + 0 = 0.
The last result applies the covariance decomposition formula which can be proved in the exact
way as the variance decomposition formula is proved.
2. In matrix notation, we can express Assumptions A.2 and A.4 as follows:
1(|X) = 0, and 1(
0
|X) = o
2
1
n
where
1(|X) =
1(-
1
|X)
.
.
.
1(-
n
|X)
and 1(
0
|X) =
1(-
2
1
|X) 1(-
1
-
n
|X)
.
.
.
.
.
.
.
.
.
1(-
n
-
1
|X) 1(-
2
n
|X)
.
2.2 Ordinary Least Squares Estimation
2.2.1 Estimation of ,
Denition 2.1 (Ordinary least squares (OLS) estimator) Dene the residual sum of squares(RSS)
of the LRM 1
I
= ,
0
X
I
+-
I
as
1oo(,) = (YX,)
0
(YX,) =
n
X
I=1
(1
I
,
0
X
I
)
2
.
Then the ordinary least squares (OLS) estimator

, of , is given by
,

,
OJS
arg min
oR
1oo(,).
The following theorem gives a closed form solution to the above minimization problem.
Theorem 2.2 (OLS estimator) Under Assumptions A.1 and A.3 the OLS estimator

, of , exists
and is given by
, = (X
0
X)
1
X
0
Y =
n
X
I=1
X
I
X
0
I
!
1
n
X
I=1
X
I
1
I
!
.
Proof. Noting that 1oo(,) =
P
n
I=1
(1
I
,
0
X
I
)
2
, the FOC is given by
01oo(,)
0,
=
n
X
I=1
0
0,
(1
I
,
0
X
I
)
2
= 2
n
X
I=1
X
I
(1
I
,
0
X
I
)
= 2
n
X
I=1
X
I
1
I
+ 2
n
X
I=1
X
I
X
0
I
,
= 2X
0
Y + 2X
0
X, = 0 when , =

,. (2.4)
13
It follows that X
0
Y = X
0
X
, and

, = (X
0
X)
1
X
0
Y = (
P
n
I=1
X
I
X
0
I
)
1
(
P
n
I=1
X
I
1
I
) by Assumption
A.3.
We now check the SOC. Noting that
0
2
1oo(,)
0,0,
0
=
n
X
I=1
0[2X
I
Y
I
+ 2X
I
X
0
I
,)]
0,
0
=
n
X
I=1
2X
I
X
0
I
= 2X
0
X
is positive denite under Assumption A.3, so the SOC is satised and

, is the global minimizer.
We make a few remarks.
1.

1
I

,
0
X
I
is called the (in-sample) tted value or predicted value of 1
I
. -
I
1
I

1
I
is called the
(estimated) residual for 1
I
. Let

Y (
1
1
, ...,

1
n
)
0
= X
, and (-
1
, ..., -
n
)
0
. Then we can write
Y =

Y+ , which is an orthogonal decomposition of Y.
2. The FOC (2.4) implies that a very important equation, i.e., normal equation, holds:
X
0
YX
0
X
, = X
0
(YX
,) = 0, i.c., X
0
=
n
X
I=1
X
I
-
I
= 0.
The normal equation always hold no matter whether 1(-
I
|X) = 0 or not. If an intercept is included
in the LRM, say, A
I1
1 and X
I
= (1, A
I2
, ..., A
I|
)
0
, then the above equation implies that
n
X
I=1
-
I
= 0.
Exercise. Let

1 =
1
n
P

1
I
and

1 =
1
n
P
1
I
. Demonstrate that

1 =

1 if an intercept is included in
the LRM.
2.2.2 Estimation of o
2
Recall o
2
= 1(-
2
I
) under Assumption A.4. We can estimate it by method of moments (MOM):
o
2
=
1
:
n
X
I=1
-
2
I
=
1
:
0
.
In nite samples, the above estimator is biased for o
2
. An unbiased estimator is given by
:
2
=
1
: /
n
X
I=1
-
2
I
.
2.2.3 Alternative Interpretation of OLS Estimator
If we assume that -
I
|X
I
(0, o
2
) in the LRM: 1
I
= ,
0
X
I
+ -
I
and -
I
s are independent given X,
we can consider the maximum likelihood estimator of , and o
2
. Note that the likelihood of -
1
, , -
n
,
14
conditional on X, is given by
)(-
1
, , -
n
|X; ,, o
2
) =
n
Y
I=1
1
2o
2
exp
(1
I
,
0
X
I
)
2
2o
2
= (2o
2
)
2
n
Y
I=1
exp
(1
I
,
0
X
I
)
2
2o
2
= (2o
2
)
2
exp
1
2o
2
n
X
I=1
(1
I
,
0
X
I
)
2
!
.
The log-likelihood function is
1
n
(,, o
2
) log
)(-
1
, , -
n
|X; ,, o
2
)
=
:
2
log(2o
2
)
1
2o
2
n
X
I=1
(1
I
,
0
X
I
)
2
.
To maximize the above log-likelihood function, we can obtain the FOC
(
JJ
(o,c
2
)
Jo
= 0
JJ
(o,c
2
)
Jc
2
= 0

(

,
1J
= (X
0
X)
1
X
0
Y =

,
OJS
o
2
1J
=
1
n
0
= o
2
.
2.2.4 Projection Matrices
Recall
, = (X
0
X)
1
X
0
Y
Y = X
, = X(X
0
X)
1
X
0
Y 1Y
= Y

Y = Y1Y = (1
n
1)Y 'Y
where
1 = X(X
0
X)
1
X
0
, and ' = 1
n
X(X
0
X)
1
X
0
= 1
n
1.
Below we study some of the important properties of 1 and '.
1. 1 and ' are symmetric and idempotent so that they are projection matrices. The symmetry is
obvious. We now check the idempotence:
1
2
= X(X
0
X)
1
X
0
X(X
0
X)
1
X
0
= X(X
0
X)
1
X
0
= 1
'
2
= (1
n
1)(1
n
1) = 1
n
1 1 +1
2
= 1
n
1.
2. 1A = A, 'A = 0, and 1' = 0.
3.
tr(1) = tr
h
X(X
0
X)
1
X
0
i
= tr
h
(X
0
X)
1
X
0
X
i
= tr(1
|
) = /
tr(') = tr(1
n
1) = tr(1
n
) tr(1) = : /.
15
4.
= 'Y = '(X, + ) = '
Y = (1 +')Y = 1Y +'Y =

Y +, orthogonal decomposition
Note that

Y
0
= (1Y)
0
'Y = Y
0
1'Y = 0. Noting that

Y =1Y and 'X = 0, so 1 is known
as the hat matrix and ' is called an orthogonal projection matrix or an annihilator matrix.
5.
0
= (')
0
(') =
0
' = Y
0
'Y.
2.2.5 Goodness of Fit: 1
2
and

1
2
Dene
Too
n
X
I=1
(1
I

1 )
2
total sum of squares
1oo
n
X
I=1
(
1
I

1 )
2
explained sum of squares
1oo
n
X
I=1
(1
I

1
I
)
2
=
n
X
I=1
-
2
I
residual sum of squares
where

1 =
1
n
P
n
I=1
1
I
and

1 =
1
n
P
n
I=1

1
I
. We consider two measures of coecient of determination
(1
2
):
1
2
1
= 1
1oo
Too
and 1
2
2
=
1oo
Too
Theorem 2.3 If an intercept is included in the regression, then
(a) Too = 1oo +1oo
(/) 0 6 1
2
1
= 1
2
2
6 1
Proof. (a) When an intercept is included in the regression, we have
P
n
I=1
-
I
= 0 which implies that
1 =

1 . It follows that
Too =
n
X
I=1
(1
I

1 )
2
=
n
X
I=1
h
1
I

1
I
+

1
I

1
i
2
=
n
X
I=1
(1
I

1
I
)
2
+
n
X
I=1
(
1
I

1 )
2
+ 2
n
X
I=1
(1
I

1
I
)(
1
I

1 )
= 1oo +
n
X
I=1
(
1
I

1 )
2
+ 2
n
X
I=1
-
I
(
1
I

1 )
= 1oo +1oo
because the normal equation implies that
n
X
I=1
-
I
(
1
I

1 ) =
n
X
I=1
-
I
1
I
n
X
I=1
-
I
1 =
n
X
I=1
-
I
X
I
,

1
n
X
I=1
-
I
= 0 0 = 0.
(b) This follows from (a).
Remarks.
16
1. Without an intercept, 1
2
1
1 but can be negative, and 1
2
2
0 but can be greater than 1. With
an intercept, we can write
1
2
1
= 1
2
2
= 1
2
.
In this case,
1
2
=
1oo
Too
=
P
n
I=1
(
1
I

1 )
2
P
n
I=1
(1
I

1 )
2
=

Y
0
'
0
Y
Y
0
'
0
Y
where '
0
= 1
n
1
n
ii
0
= 1
n
i (i
0
i)
1
i
0
is the demeaned matrix and i is an : 1 vector of ones.
It is easy to verify that '
0
is a symmetric idempotent matrix and thus a projection matrix. In
addition,
Yi
1 =
1
1

1
.
.
.
1
n
=

Yi
i
0
Y
:
=
1
n
1
:
ii
0

Y = '
0
Y,
Y
0
'
0
Y =

'
0
0
'
0
Y =
n
X
I=1
(
1
I

1 )
2
.
Similarly
Y
0
'
0
Y = ('
0
Y)
0
('
0
Y) =
n
X
I=1
(1
I

1 )
2
X
0
'
0
Y = ('
0
X)
0
('
0
Y) =
n
X
I=1
(X
I

X)(1
I

1 )
where

X =
1
n
P
n
I=1
X
I
.
2. 1
2
never decreases when we include some additional regressors to the LRM. High 1
2
does not
necessarily imply that the model is good, and low 1
2
does not necessarily imply the model is bad.
In macroeconomics, 1
2
can be as high as 0.99 but in microeconomics or nance, 1
2
can be as low
as 0.1 or 0.2 and the model is still ne.
3. When an intercept is included, 1
2
indicates the sample correlation between 1
I
and

1
I
:
1
2
= [
[
Corr(1,

1 )]
2
=
h
P
n
I=1
(
1
I

1 )(1
I

1 )
i
2
P
n
I=1
(
1
I

1 )
2
P
n
I=1
(1
I

1 )
2
=
(
Y
0
'
0
Y)
2
(
Y
0
'
0
Y)(Y
0
'
0
Y)
Proof: Noting that when an intercept is included,
P
n
I=1
-
I
= 0 and

1 =

1 . It follows that
1oo =
n
X
I=1
(
1
I

1 )
2
=
n
X
I=1
(
1
I

1 )(
1
I

1 )
=
n
X
I=1
(
1
I

1 )(1
I
-
I

1 )
=
n
X
I=1
(
1
I

1 )(1
I

1 )
n
X
I=1
(
1
I

1 )-
I
=

Y
0
'
0
Y,
17
where the last line follows from the fact that
P
n
I=1
(
1
I
1 )-
I
=
P
n
I=1

1
I
-
I
1
P
n
I=1
-
I
= 00 = 0.
1
2
=
1oo
Too
=
1oo
2
1oo Too
=
(
Y
0
'
0
Y)
2
(
Y
0
'
0
Y)(Y
0
'
0
Y)
= [
[
Corr(1,

1 )]
2
.
4.

1
2
(1-bar-squared) A better measure of goodness-of-t is given by the adjusted coecient of
determination:
1
2
= 1
1oo, (: /)
Too, (: 1)
= 1
: 1
: /
1 1
2
.
Note that

1
2
is not a monotone function of /. It may rise or fall as one adds one additional
regressor to the regression model.
2.3 Finite Sample Properties of the OLS Estimators
Denition 2.4 An unbiased estimator

, is more ecient than another unbiased estimator

, if Var(
,)Var(
,)
is p.s.d.
Remarks.
1. An important implication of the above denition is: for any / 1 vector C s.t. C
0
C = 1, we have
C
0
[Var(
,)Var(
,)]C 0. For example, taking C = (1, 0, , 0)

0
yields Var(
,
1
)Var(
,
1
) 0.
2. Sucient condition. In the case where 1(
,|X) = 1(
,|X) = ,, it is sucient to have

Var(
,|X) Var(
,|X) is p.s.d. with probability 1.

To see this, noting that Var(1 ) = 1[Var(1 |X)]+Var[1(1 |X)], we have
Var(
,) = 1[Var(
,|X)] + Var[1(
,|X)] = 1[Var(
,|X)].
and similarly
Var(
,) = 1[Var(
,|X)].
Thus Var(
,)Var(
,) = 1[Var(
,|X)Var(
,|X)] which is p.s.d. provided Var(
,|X)Var(
,|X) is
p.s.d. with probability 1.
Theorem 2.5 Assume that the classical Assumptions A.1-A.4 hold. Then:
(a) (Unbiasedness) 1(
,|X) = , and 1(
,) = ,;
(b) (Variance-covariance matrix) \ ar(
,|X) = o
2
(X
0
X)
1
;
(c) (Gauss-Markov Theorem)

, is the best linear unbiased estimator (BLUE) of ,. That is, for any
unbiased estimator

, that is linear in Y, \ ar(
,|X) \ ar(
,|X);
(d) (Unbiased estimator of variance) 1(:
2
|X) = o
2
;
(e) (Orthogonality between

, and ) Cov(
,, |X) = 1[(
, ,)
0
|X] = 0.
Proof. (a) By Assumptions A.1 and A.3, we have
, = (X
0
X)
1
X
0
1 = (X
0
X)
1
X
0
(X
0
, + ) = , + (X
0
X)
1
X
0
.
18
By Assumption A.2
1(
,|X) = , +1[(X
0
X)
1
X
0
|X] = , + (X
0
X)
1
X
0
1[|X]
= , + 0 = ,.
Then 1(
,) = , by the law of iterated expectations.

(/) By Assumption A.4,
Var(
,|X) = Var[(X
0
X)
1
X
0
|X] = (X
0
X)
1
X
0
Var(|X)
h
(X
0
X)
1
X
0
i
0
= (X
0
X)
1
X
0
o
2
1
n
X(X
0
X)
1
= o
2
(X
0
X)
1
.
(c) Let

, = Y be a linear estimator of , where is a / : matrix. It is unbiased i
1(
,|X) = 1[(X
0
, + ) |X] = 1[(X
0
, +) |X]
= X, +1[|X] = X, + 0 = ,.
This follows i X = 1
|
. Then
Var(
,|X) = Var(X, +|X) = Var(|X) = Var(|X)

0
= o
2
0
.
and
Var(
,|X) Var(
,|X) = o
2
0
o
2
(X
0
X)
1
= o
2
h
0
(X
0
X)
1
i
= o
2
h
0
X(X
0
X)
1
X
0
0
i
= o
2
[1
n
X(X
0
X)
1
X
0
]
0
= o
2
'
0
= o
2
'(')
0
0.
Note that
'
0
= 0 '(')
0
= 0 ' = 0
(1
n
X(X
0
X)
1
X
0
) = 0
X(X
0
X)
1
X
0
= 0
= (X
0
X)
1
X
0
i.e.

, =

,. Thus suggests the uniqueness of the BLUE.
(d)
1
:
2
|X
=
1
: /
1(
0
|X) =
1
: /
1(
0
'|X) =
1
: /
1[tr(
0
')|X]
=
1
: /
tr [1('
0
|X)] =
1
: /
tr ['1(
0
|X)] =
1
: /
tr
'o
2
1
n
=
o
2
: /
tr (') = o
2
.
(c) Noting that

, = , + (X
0
X)
1
X and = Y

Y = (1
n
1)Y = 'Y = ', we have
1(|X) = '1(|X) = 0
19
and
Co(
,, |X) = 1
n
[
, 1(
,|X)] [1(|X)]
0
|X
o
= 1
n
(
, ,) ['0]
0
|X
o
= 0
= 1
h
(X
0
X)
1
X
0

0
'|X
i
= (X
0
X)
1
X
0
1(
0
|X) '
= (X
0
X)
1
X
0
o
2
1
n
' = o
2
(X
0
X)
1
X
0
' = 0.
Remarks.
1. (a) and (b) imply that the conditional MSE of

, given X is
'o1
,|X
, 1(
,|X)
i h
, 1(
,|X)
i
0
|X
= Var
,|X
+
h
Bias
,|X
i
2
= o
2
(X
0
X)
1
+ 0 = o
2
(X
0
X)
1
.
As we shall prove, o
2
(X
0
X)
1
0 as : under Assumption A.3
so that 'o1(
,|X) 0 as
: in this case, which implies the consistency of

, a large sample property of

,.
2. The Gauss-Markov theorem makes no assumption on the distribution of the error term except
that
1(-
I
|X) = 0 and Var (-
I
|X) = o
2
.
It says noting about nonlinear estimators and compares only linear unbiased estimators. We can
have biased or nonlinear estimators that have smaller MSE than the OLS estimators.
2.4 Sampling Distribution
To obtain the nite sample distribution of

,, we impose the following assumption.
Assumption A.5 (Normality) |X v
0, o
2
1
n
.
4
Remarks.
1. Because the conditional pdf of given X is
) (|X) =
2o
2
n/2
exp
2o
2
which has nothing to do with X, the above assumption implies that is independent of X : X.
2. Assumption A.5 implies A.2 and A.4.
The following lemma is very useful and will be used frequently.
4
If Z is . (0, 1) conditional A, then it is also .(0, 1) unconditionally as we can easily tell that ) (:|a) = ) (:) =
(2)
12
exp(
2
2
). Such an implication seems trivial but will be used repeatedly throughout the course.
20
Lemma 2.6 (i) If (0, ) where is nonsingular, then
0
1

2
(:) .
(ii) If
0, o
2
1
n
and is an : : projection matrix, then

0
,o
2

2
(rank ()) .
(iii) If
0, o
2
1
n
, is an : : projection matrix, and

0
1 = 0, then
0
and 1
0
are
independent.
(iv) If
0, o
2
1
n
, and 1 are both symmetric, then

0
and
0
1 are independent i
1 = 0.
In fact, if
0, o
2
1
n
and is symmetric, then

0
,o
2

2
(rank ()) i is idempotent.
The following theorem states the sampling distribution of

, and :
2
.
Theorem 2.7 Suppose that Assumptions A.1, A.3 and A.5 hold. Then
(a)

, , | X v
0, o
2
(X
0
X)
1
,
(b)
(n|)s
2
c
2
| X
2
(: /) ,
(c) Conditional on X, :
2

,.
Proof. (a)

, , = (X
0
X)
1
X
0
=C =
P
n
I=1
C
I
-
I
is a linear combination of -
I
s, where C
(X
0
X)
1
X
0
is a /: matrix and C
I
= (X
0
X)
1
X
I
. Conditional on X,

,, is also normally distributed
with mean
1
, ,|X
=
n
X
I=1
C
I
1(-
I
|X) = 0
and variance
Var
, ,|X
= Var
(X
0
X)
1
X
0
|X
= (X
0
X)
1
X
0
Var (|X)
(X
0
X)
1
X
0
0
= o
2
(X
0
X)
1
X
0
1
n
X(X
0
X)
1
= o
2
(X
0
X)
1
.
That is,

, , | X v
0, o
2
(X
0
X)
1
.
(b) By Lemma 2.6(ii)
(: /) :
2
o
2
|X =
0
'
o
2
|X
2
(rank (')) =
2
(tr (')) =
2
(: /) .
(c) Note that :
2
=
1
n|
0
' and

, , = (X
0
X)
1
X
0
. By Lemma 2.6(iii), :
2

, conditional on
X because
'
(X
0
X)
1
X
0
0
= 0.
2.5 Hypothesis Testing
Hypothesis testing is frequently needed when we conduct statistical inference in the regression frame-
work. It can be used to evaluate the validity of economic theory, to detect absence of structure, among
many other things.
Example 2.8 (Production Function) Given the Cobb-Douglas production function 1 = 1
o
2
1
o
3
,
we want to test
H
0
: ,
2
+,
3
= 1 (constant return to scale)
versus H
1
: ,
2
+,
3
< 1 (decreasing return to scale).
21
To conduct the test, we can consider the following log-linear model
ln(1 ) = ,
1
+,
2
ln(1) +,
3
ln(1) +-.
Example 2.9 (Structural Change) Let GDP
I
standards for the gross domestic product of China at
year i. We are interested in whether there is a structural change in GDP around the year 1979. For this
purpose, dene a dummy variable 1
I
= 1 (i 1979) , and consider the following regression model
ln(G11
I
) = (,
1
+,
3
1
I
) + (,
2
+,
4
1
I
) i +-
I
.
The null of interest is H
0
: ,
3
= ,
4
= 0 (no structural change) versus H
1
: ,
3
6= 0 or ,
4
6= 0 (having
structural change).
Even though both Examples 2.8 and 2.9 are about hypothesis testing in the regression framework,
they are dierent in that the null hypothesis in Example 2.8 has only restriction while that in Example 2.9
has two restrictions. We will discuss tests with a single linear restriction and multiple linear restrictions
separately.
2.5.1 Single Linear Restriction: t-test
For clarity, we assume Assumptions A.1-A.5 hold. Note that the normality assumption A.5 is crucial in
deriving the exact distribution of the t and 1 tests dened below. For asymptotic tests which we dont
discuss in this section, however, the normality assumption can be avoided.
We rst consider testing a single linear restriction
H
0
: c
0
, = r versus H
1
: c
0
, 6= r, (2.5)
where c is a / 1 vector and r is a scalar.
Since under Assumption A.5,

, |X v
,, o
2
(X
0
X)
1
, we have
c
0
,|X
c
0
,, o
2
c
0
(X
0
X)
1
c
.
Therefore under H
0
7
n

c
0
, r
q
o
2
c
0
(X
0
X)
1
c
(0, 1) conditional on X. (2.6)
Replacing o
2
by its OLS estimators :
2
, we get the test statistic
T
n

c
0
, r
q
:
2
c
0
(X
0
X)
1
c
. (2.7)
Denition 2.10 (Student t distribution) A random variable T follows the student t distribution
with degrees of freedom, written as T v t () , if T =
I
\/j
where l v (0, 1) , \ v
2
() , and
l \.
Theorem 2.11 Suppose Assumption A.1-A.5 hold. Then under H
0
, T
n
v t(: /).
22
Proof. Let o
n
(: /) :
2
,o
2
. Then under H
0
, c
0
, = r,
T
n
=
c
0
, r
q
:
2
c
0
(X
0
X)
1
c
=
c
0
(
, ,)
q
:
2
c
0
(X
0
X)
1
c
=
c
0
(
, ,),
q
o
2
c
0
(X
0
X)
1
c
q
(n|)s
2
/c
2
n|
=
7
n
p
o
n
,(: /)
.
Then the result follows from Theorem 2.7 and the denition of the t distribution.
Remark.
1. Suppose we are interested in testing H
0
: ,
= ,
0
versus H
1
: ,
6= ,
0
. In this case, c = c
, a
/ 1 vector with 1 in its ,th place and 0 elsewhere. The test statistic
T
n
=

,
,
0
q
:
2
[(X
0
X)
1
]
v t(: /) under H
0
,
where []
denote the ,-th diagonal element of the matrix . We reject H

0
if |T
n
| t
o/2
(: /) ,
where t
o/2
(: /) denotes the upper c,2-percentile of the t (: /) distribution. Let :c(
) =
q
:
2
[(X
0
X)
1
]
. A two-sided 1 c condence interval (CI) for ,
is given by
h
t
o/2
(: /) :c(
),

,
+t
o/2
(: /) :c(
)
i
C1 (c) .
By the duality between hypothesis tests and condence intervals, we also reject the null at the
signicance level c if ,
0
C1 (c) .
2. In modern econometrics, more attention has been given to the use of j-value which is the smallest
signicance level at which we can reject the null hypothesis. Note that the j-value for a one-sided
test is dierent from that for a two-sided test. For example, in the above two-sided test, if the t
statistic takes value t
n
(a xed number), then its j-value is dened by
j-value = 21 (t (: /) |t
n
|)
where t (: /) is the student t random variable with :/ degrees of freedom. We reject the null
if j-value < c, the prescribed level of signicance.
3. In testing H
0
: ,
= ,
0
versus H
1
: ,
,
0
, we can obtain the t statistic value t
n
as above, but
the j-value in this case is dened as
j-value = 1 (t (: /) t
n
) .
Again, we reject the null if j-value < c.
2.5.2 Multiple Linear Restrictions: 1-test
Now we consider testing the linear restrictions on , :
H
0
: 1, = r versus H
1
: 1, 6= r, (2.8)
where 1 is a known matrix of order / with < / and r is a known 1 vector. We assume
rank(1) = .
23
Example 2.12 (a) 1 = [1, 0, ..., 0], r = 0, = 1. This is equivalent to testing H
0
: ,
1
= 0.
(b) 1 =
0 1 0 ... 0
0 0 1 ... 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 ... 1
= [0 1
|1
], r =
0
.
.
.
0
, = / 1. This is equivalent to testing

H
0
: ,
2
= ... = ,
|
= 0, i.e., to testing the joint (overall) signicance of the regression model.
(c) 1 =
1 -1 0 ... 0
0 1 2 ... 0
!
, r =
0
1
!
, = 2. This is equivalent to testing H
0
: ,
1
= ,
2
and ,
2
+ 2,
3
= 1.
Denition 2.13 (1 distribution) A random variable 1 follows the 1 distribution with (j, ) degrees
of freedom, written as 1 v 1 (j, ) , if 1 =
I/
\/j
where l v
2
(j) , \ v
2
() , and l \.
Theorem 2.14 Suppose Assumptions A.1-A.5 hold. Then under H
0
,
1
n

1
, r
0
h
:
2
1(X
0
X)
1
1
0
i
1
, r
1(, : /) conditional on X.
Proof. Since

,|X v
,, o
2
(X
0
X)
1
, 1
, r|X v
0, o
2
1(X
0
X)
1
1
0
under H
0
. Therefore
by Lemma 2.6(i),
, r
0
h
o
2
1(X
0
X)
1
1
0
i
1
, r
v
2
() conditional on X
since 1(X
0
X)
1
1
0
is a p.d. matrix with rank . Also, we have seen
o
n

(: /) :
2
o
2
v
2
(: /) conditional on X.
By Theorem 2.7, o
n
and
n
are independent conditional on X because :
2

,|X. The result then
follows by writing
1
n
=

n
,
o
n
,(: /)
conditional on X.
Remarks.
1. The above theorem implies that 1
n
1(, : /) unconditionally.
2. Suppose we are still interested in testing H
0
: ,
= ,
0
. In this case, = 1, r = ,
0
, and
1 = c
0
, where c
is a / 1 vector with 1 in its ,th place and 0 elsewhere. 1
, r =

,
,
0
.
1(X
0
X)
1
1
0
=
(X
0
X)
1
. The test statistic

1
n
=
,
0
q
:
2
[(X
0
X)
1
]
2
1(1, : /) under H
0
,
Note that the expression inside the curly bracket is just the t-statistic. The result is not surprising
since t (: /)
2
= 1(1, : /).
24
3. The 1 test is usually used to test for multiple restrictions and one rejects the null only when
the 1 statistic takes suciently large value. We reject the null if 1
n
1
o
(, : /), the upper
c-percentile of the 1 (, : /) distribution. Alternatively, we reject the null at the prescribed c
level of signicance if
j-value = 1 (1 (, : /) )
n
) < c
where )
n
is the value of the 1
n
test statistic (a xed number).
2.6 Constrained Least Squares
Consider testing the linear restrictions
H
0
: 1, = r versus H
1
: 1, 6= r.
We use 1oo
u:
to denote the unrestricted sum of squared residuals and 1oo
:
to denote the restricted
sum of squared residuals under the restriction H
0
. The following theorem shows an alternative expression
for the 1 test statistic.
Theorem 2.15 Suppose Assumptions A.1-A.5 hold. Then
1
n
=
(1oo
:
1oo
u:
) ,
1oo
u:
, (: /)
1(, : /).
Proof. Consider the following minimization problem under the null restrictions:
min
o
(YX,)
0
(YX,) :.t. 1, = r.
Form the Lagrangian
L(,, `) = (YX,)
0
(YX,) +`
0
(1, r) ,
where ` denotes the Lagrangian multiplier. Let (
,,
`) denote the solution to the above problem. Then

it should satisfy the FOCs:
0L
,,
0,
= 2X
0
Y + 2X
0
X
, +1
0
` = 0 (2.9)
0L
,,
0`
= 1
, r = 0 (2.10)
From (2.9) , we have
, =

,
1
2
(X
0
X)
1
1
0
`. (2.11)
From (2.10) and (2.11) , we have r = 1
, = 1
,
1
2
h
1(X
0
X)
1
1
0
i
` and hence
5 6
` = 2
h
1(X
0
X)
1
1
0
i
1
, r
. (2.12)
5
A is an indicator of the departure of 1
o v from 0. If the null hypothesis is true, we expect 1
o v is close to 0.
Otherwise, it may deviate far apart from 0.
6
Alternative we can premultiply (2.9) by 1(X
0
X)
1
to obtain

A = 2[1(X
0
X)
1
1
0
]
1
(1
o v).
25
Equations (2.11) and (2.12) combine to give
, =

, (X
0
X)
1
1
0
h
1(X
0
X)
1
1
0
i
1
, r
. (2.13)
Let Y X
,. Then using equation (2.13) and the normal equation for the unrestricted regression
we have
1oo
:
=
0
=
h
+X
,
i
0
h
+X
,
i
=
0
+
0
X
0
X
= 1oo
u:
+
, r
0
h
1(X
0
X)
1
1
0
i
1
, r
. (2.14)
Therefore,
1
n
=
(1oo
:
1oo
u:
) ,
1oo
u:
, (: /)
=
1
, r
0
h
:
2
1(X
0
X)
1
1
0
i
1
, r
v 1(, : /),
as desired.
Example 2.16 Consider testing H
0
: ,
2
= ,
3
in the linear regression model
1
I
= ,
1
+,
2
A
I2
+,
3
A
I3
+-
I
for i = 1, ..., :. (2.15)
The restricted model is
1
I
= ,
1
+,
2
A
I
+-
I
, (2.16)
where A
I
= A
I2
+A
I3
. One can run the regressions (2.15) and (2.16) respectively, obtain the 1oo from
each model, and construct the 1-test.
Example 2.17 Consider testing
H
0
: ,
2
= ... = ,
|
= 0 against H
1
: ,
6= 0 for some 2 , /
in the linear regression model
1
I
= ,
1
+,
2
A
I2
+... +,
|
A
I|
+-
I
for i = 1, ..., :. (2.17)
The restricted model is
1
I
= ,
1
+-
I
. (2.18)
In this case, the restricted least squares is given by

,
1
=

1 and hence 1oo
:
=
P
n
I=1
1
I
2
= Too.
1oo
u:
can be obtained from (2.17) easily. As remarked earlier on, 1
2
(in the unrestricted model) is
closely related to a 1-test statistic to test for the above null. The 1-statistic in this case is
1
n
=
(1oo
:
1oo
u:
) ,(/ 1)
1oo
u:
, (: /)
=
(TSS1SS
)
TSS
,(/ 1)
1SS
TSS
, (: /)
=
1
2
,(/ 1)
(1 1
2
) , (: /)
.
Therefore, we conclude that the 1-statistic for testing the signicance of the regression is an increasing
function of 1
2
.
26
2.7 Generalized Least Squares
What may go wrong if Assumptions A.1-A.5 do not hold? Here we relax Assumption A.5 a little bit.
Assumption A.5
(Normality) |X v
0, o
2
\
, where \ = \ (X) is a known nite p.d.

matrix.
Note that the above assumption means that Var(|X) = o
2
\ (X) is known up to a nite constant
o
2
and it allows for conditional heteroskedasticity of known form. Written explicitly, this assumption
indicates that
1(-
I
|X) = 0,
1
-
2
I
|X
= o
2
\
II
(X) ,
1(-
I
-
|X) = o
2
\
I
(X) ,
where \
I
(X) denotes the (i, ,) element of \ (X) .
Theorem 2.18 Suppose Assumptions A.1, A.3 and A.5
hold. Then
(a) (Unbiasedness) 1
,|X
= ,;
(b) (Variance-covariance matrix) \ ar
,|X
= o
2
(X
0
X)
1
X
0
\ X(X
0
X)
1
;
(c) (Normality)

, ,|X v
0, o
2
(X
0
X)
1
X
0
\ X(X
0
X)
1
;
(d) (Orthogonality) Cov(
,, |X) = 0.
The proof of the above theorem is analogue to that of Theorem 2.7 and thus omitted. (a) implies
that the OLS estimator is still unbiased. But it is not BLUE any more unless \ is proportional to
an identity matrix. The classical t and 1 tests are not valid any more because they are based on
the incorrect variance-covariance estimator. To obtain valid t and 1 tests, one has to obtain estimate
of o
2
(X
0
X)
1
X
0
\ X(X
0
X)
1
instead, say, by o
2
(X
0
X)
1
X
0
\ X(X
0
X)
1
, where o
2
is a consistent
estimator of o
2
.
Recall that for any symmetric p.d. matrix \, we can write \
1
= C
0
C where C is a nonsingular
matrix. This fact helps us to nd the estimator which has the BLUE property. Pre-multiplying both
sides of the following equation by C
Y = X, + (2.19)
yields
CY = CX, +C or Y
= X
, +
(2.20)
where Y
= CY, X
= CX, and
= C. Noting that
|X v
0, o
2
1
n
by construction under
Assumption A.5
,
7
the OLS estimator

,
of , in (2.20) is BLUE. Note that
= (X
0
X
)
1
X
= (X
0
C
0
CX)
1
X
0
C
0
CY
=

X
0
\
1
X
1
X
0
\
1
Y
,
cJS
is called the generalized least squares (GLS) estimator of , in (2.19).
7
Var(
|X) =Var(C|X) = CVar(|X) C

0
= o
2
C\ C
0
= o
2
CC
1
(C
0
)
1
C
0
= o
2
1
.
27
Theorem 2.19 Suppose Assumptions A.1, A.3 and A.5
hold. Then
(a) (Unbiasedness) 1
,
cJS
|X
= ,;
(b) (Variance-covariance matrix) \ ar
,
cJS
|X
= o
2
X
0
\
1
X
1
;
(c) (Normality)

,
cJS
,|X v
0, o
2
X
0
\
1
X
;
(d) (Unbiasedness of :
2
) 1
:
2
|X
= o
2
,
(e) (Orthogonality) Cov(
,
cJS
,
|X) = 0,
where :
2
=
1
n|
and
= Y
,
cJS
.
The proof of the above theorem is straightforward. Note that

,
cJS
is the OLS estimator of , in the
transformed model (2.20) which satises Assumptions A.1, A.3, and A.5 with
|X v
0, o
2
1
n
, we
must have by Theorem 2.7
1
,
cJS
|X
= ,,
\ ar
,
cJS
|X
= o
2
(X
0
X
)
1
= o
2
X
0
\
1
X
1
,
,
cJS
is BLUE
1
:
2
|X
= o
2
,
Co(
,
cJS
,
|X) = 0.
Remarks.
1. Classical t and 1 tests are applicable for inference procedure based on

,
cJS
. But in practice, \ is
generally unknown so that

,
cJS
is usually infeasible. One has to estimate \ in order to obtain a
feasible GLS estimator of ,. If we can estimate \ consistently by

\ , then we can use the feasible
GLS (FGLS) estimate

,
JcJS

X
0

\
1
X
1
X
0

\
1
Y. Then one has to rely on the large sample
theory for justication.
2. Alternatively, we can continue to use

,
OJS
, but obtain the correct variance-covariance formula
\ ar(
,|X) = o
2
(X
0
X)
1
X
0
\ X(X
0
X)
1
as well as a consistent estimator for it. In this case, the classical t and 1 tests cannot be used
because they are based on an incorrect formula for \ ar(
,|X). Nevertheless, modied t and 1 tests

(or Wald tests, to be introduced in the next section) are valid by using the correct estimator of
\ ar(
,|X). Again, one has to rely on the large sample theory for justication.
28

Classical Linear Regression Model

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Classical Linear Regression Model

Загружено:

Авторское право:

Доступные форматы

2 Classical Linear Regression Models

In this section we study the classical linear regression models.

) = 0 for all i, , = 1, , : by the law of iterated expectations.

(Nonsingularity) The minimum eigenvalue of X

rules asymptotic multicollinearity in large samples. Noting

|X) = 0 for all i 6= , (conditional spatial/serial uncorrelatedness)

,)]C 0. For example, taking C = (1, 0, , 0)

,|X) = ,, it is sucient to have

,|X) is p.s.d. with probability 1.

,|X)] which is p.s.d. provided Var(

,) = , by the law of iterated expectations.

,|X) = Var(X, +|X) = Var(|X) = Var(|X)

and is an : : projection matrix, then

, is an : : projection matrix, and

, and 1 are both symmetric, then

and is symmetric, then

denote the ,-th diagonal element of the matrix . We reject H

. A two-sided 1 c condence interval (CI) for ,

, = / 1. This is equivalent to testing

is a / 1 vector with 1 in its ,th place and 0 elsewhere. 1

. The test statistic

`) denote the solution to the above problem. Then

o v from 0. If the null hypothesis is true, we expect 1

, where \ = \ (X) is a known nite p.d.

of , in (2.20) is BLUE. Note that

|X) =Var(C|X) = CVar(|X) C

,|X). Nevertheless, modied t and 1 tests

Вам также может понравиться