Martin Wittenberg
School of Economics and SALDRU
University of Cape Town
2011
ii
Contents
I
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
3
4
4
6
8
9
9
9
10
10
10
11
11
12
12
12
13
13
13
14
14
15
16
17
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19
19
19
22
23
23
23
24
25
iii
.
.
.
.
.
.
.
.
iv
CONTENTS
2.4
2.5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
26
26
27
27
28
29
29
29
31
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35
35
37
41
41
42
43
43
43
43
44
44
47
47
47
48
50
50
51
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
55
55
55
55
57
58
59
60
62
62
63
64
64
67
68
68
4 Asymptotic Theory
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Sequences, limits and convergence . . . . . . . . . . . . . . . .
4.2.1 The limit of a mathematical sequence . . . . . . . . . .
4.2.2 The probability limit of a sequence of random variables
4.2.3 Rules for probability limits . . . . . . . . . . . . . . . .
4.2.4 Convergence in distribution . . . . . . . . . . . . . . . .
4.2.5 Rates of convergence . . . . . . . . . . . . . . . . . . . .
4.3 Sampling, consistency and laws of large numbers . . . . . . . .
4.3.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.2 Consistency of the sample CDF . . . . . . . . . . . . . .
4.3.3 Consistency of method of moments estimation . . . . .
4.4 Asymptotic normality and central limit theorems . . . . . . . .
4.5 Properties of Maximum Likelihood Estimators . . . . . . . . .
4.6 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.1 Chebyshevs Inequality . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CONTENTS
5 Statistical Inference
5.1 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . .
5.1.1 Type I and Type II errors . . . . . . . . . . . . . . . .
5.1.2 Power of a test . . . . . . . . . . . . . . . . . . . . . .
5.2 Types of tests . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 The Wald Test . . . . . . . . . . . . . . . . . . . . . .
5.2.2 The likelihood ratio test . . . . . . . . . . . . . . . . .
5.2.3 The Lagrange Multiplier test . . . . . . . . . . . . . .
5.3 Worked example: The Pareto distribution . . . . . . . . . . .
5.3.1 Wald test . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.2 Likelihood ratio test . . . . . . . . . . . . . . . . . . .
5.3.3 Lagrange multiplier test . . . . . . . . . . . . . . . . .
5.4 Worked example: The bivariate normal . . . . . . . . . . . .
5.4.1 Wald Test of a single hypothesis . . . . . . . . . . . .
5.4.2 Wald Test of the joint hypothesis . . . . . . . . . . . .
5.4.3 Likelihood Ratio test . . . . . . . . . . . . . . . . . . .
5.5 Appendix: ML estimation of the bivariate normal distribution
5.5.1 Maximum likelihood estimators . . . . . . . . . . . . .
5.5.2 Information matrix . . . . . . . . . . . . . . . . . . . .
5.5.3 Asymptotic covariance matrix . . . . . . . . . . . . . .
5.5.4 Loglikelihood . . . . . . . . . . . . . . . . . . . . . . .
5.6 Restricted Maximum Likelihood estimation . . . . . . . . . .
5.6.1 Restricted Maximum Likelihood Estimators . . . . . .
5.6.2 Restricted loglikelihood . . . . . . . . . . . . . . . . .
II
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
69
69
69
69
72
72
73
73
74
75
75
76
76
77
78
78
80
81
83
84
85
86
86
88
89
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
91
91
92
93
94
95
95
96
96
96
98
99
100
101
102
104
vi
CONTENTS
7 Least Squares
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 The Least Squares criterion . . . . . . . . . . . . . . . . .
7.2.1 The solution to the OLS problem . . . . . . . . . .
7.3 The geometry of Least Squares . . . . . . . . . . . . . . .
7.3.1 Projection: the matrix and the matrix . . . .
7.3.2 Algebraic properties of the Least Squares Solution
7.4 Partitioned regression . . . . . . . . . . . . . . . . . . . .
7.4.1 The FrischWaughLovell Theorem . . . . . . . . .
7.4.2 Interpretation of the FWL theorem . . . . . . . . .
7.4.3 Alternative proof . . . . . . . . . . . . . . . . . . .
7.4.4 Applications of the FWL theorem: . . . . . . . . .
7.4.5 Omitted variable bias . . . . . . . . . . . . . . . .
7.5 Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . .
7.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.7 Appendix: A worked example . . . . . . . . . . . . . . . .
8 Properties of the OLS estimators in finite samples
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
8.2 Motivations for OLS . . . . . . . . . . . . . . . . . . . .
8.2.1 Method of moments . . . . . . . . . . . . . . . .
8.2.2 Minimum Variance Linear Unbiased Estimation .
8.2.3 Maximum likelihood estimation . . . . . . . . . .
8.3 The mean and covariance matrix of the OLS estimator .
8.3.1 Unbiased Estimation . . . . . . . . . . . . . . . .
b . . . . . . . . . . . .
8.3.2 The Covariance matrix of
2
8.3.3 Estimating . . . . . . . . . . . . . . . . . . . .
8.4 GaussMarkov Theorem . . . . . . . . . . . . . . . . . .
8.5 Stochastic, but exogenous regressors . . . . . . . . . . .
8.5.1 Lack of bias . . . . . . . . . . . . . . . . . . . . .
b . . . . . . . . . . . .
8.5.2 The covariance matrix of
8.5.3 The estimator of 2 . . . . . . . . . . . . . . . .
8.5.4 GaussMarkov Theorem . . . . . . . . . . . . . .
8.6 The normal linear regression model . . . . . . . . . . . .
8.6.1 Finite sample distribution of the OLS estimators
8.6.2 Maximum likelihood estimation . . . . . . . . . .
8.6.3 The information matrix . . . . . . . . . . . . . .
8.7 Data Issues . . . . . . . . . . . . . . . . . . . . . . . . .
8.7.1 Multicollinearity . . . . . . . . . . . . . . . . . .
8.7.2 Influential data points . . . . . . . . . . . . . . .
8.7.3 Missing information . . . . . . . . . . . . . . . .
8.8 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . .
8.8.1 The trace of a matrix . . . . . . . . . . . . . . .
8.8.2 Results on the multivariate normal distribution .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
107
107
107
108
110
112
114
116
116
117
117
119
121
121
123
124
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
129
129
130
130
130
130
131
131
131
131
132
133
133
134
134
134
134
135
136
137
137
138
139
141
142
142
143
CONTENTS
vii
9.4.4
9.5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
145
145
146
147
147
147
. . . . . . . . . . . . . . . . . . . 148
Consistency of e as an estimator of . . . . . . . . . . .
Consistency of
b2 as an estimator of 2 . . . . . . . . .
Asymptotic normality of
b2 . . . . . . . . . . . . . . ..
b
Consistency of
b2 (X0 X)1 as an estimator for var
. . . . . . . . . . 148
. . . . . . . . . . 149
. . . . . . . . . . 150
. . . . . . . . . . 150
b . . . . . . . . . . . . . . . . . . . 150
Appendix: Alternative proof of consistency of
covariance matrix
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
151
151
152
152
152
153
153
153
154
156
157
158
159
160
161
162
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
169
169
169
169
170
171
172
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
172
172
172
172
173
175
175
viii
CONTENTS
.
.
.
.
.
.
.
.
177
177
178
179
180
180
180
181
181
.
.
.
.
.
.
.
.
.
.
.
183
183
183
183
184
184
185
185
185
186
186
186
III
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
14 Instrumental Variables
14.1 Introduction . . . . . . . . . . . . . . . . . . . .
14.1.1 The model . . . . . . . . . . . . . . . .
14.1.2 Least squares bias and inconsistency . .
14.1.3 Examples . . . . . . . . . . . . . . . . .
14.1.4 The problem of nonexperimental data .
14.2 The instrumental variables solution . . . . . . .
14.2.1 Rationale . . . . . . . . . . . . . . . . .
14.2.2 Consistency . . . . . . . . . . . . . . . .
14.2.3 Asymptotic normality . . . . . . . . . .
14.3 The overidentified case . . . . . . . . . . . . . .
14.3.1 Two stage least squares . . . . . . . . .
14.3.2 Test of the overidentifying restrictions .
14.4 IV and Ordinary Least Squares . . . . . . . . .
14.4.1 OLS as a special case of IV estimation .
14.4.2 Hausman specification test . . . . . . .
14.4.3 Hausmans test by means of an artificial
14.5 Problems with IV estimation . . . . . . . . . .
14.5.1 Finite sample properties . . . . . . . . .
14.5.2 Weak instruments . . . . . . . . . . . .
14.6 Omitted variables . . . . . . . . . . . . . . . . .
14.7 Measurement error . . . . . . . . . . . . . . . .
14.7.1 Attenuation bias . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
regression
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
189
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
191
191
191
191
192
193
193
194
195
196
196
197
198
199
199
199
200
200
200
201
201
202
203
CONTENTS
ix
IV
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Systems of Equations
209
209
212
212
213
213
214
214
215
218
219
221
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
223
223
223
224
225
226
226
226
227
227
229
238
239
241
243
Solutions
Solutions to Chapter 14
245
CONTENTS
Part I
Chapter 1
Probability
The notion of probability is fundamental to everything that we will be doing in this course,
but a proper (axiomatic) treatment of it is beyond the scope of this course. At the core of the
theory of probability is the concept of the sample space , i.e. the set of all possible outcomes
of the random experiment. An event is said to occur if, and only if, the outcome of the
experiment is such that .
Example 1.1 Consider throwing a die. The sample space = {1 2 3 4 5 6} and a possible
event is throwing an even number, i.e. = {2 4 6}. Note that we have to be able to
define all possible outcomes of the experiment. We have excluded certain outcomes, e.g. the die
shattering or standing on edge or disappearing down a drain hole!
The fundamental defining axioms of probability theory are given by: (Mittelhammer, Judge
and Miller 2000, Appendix E1, pp.45)
For any event () 0
() = 1
Let {  } be a set of disjoint events contained
in
1.1.1
() = ( ) +
( ) = () + () ( )
3
(1.1)
(1.2)
(1.3)
(1.4)
(1.5)
1.1.2
Conditional probability
In certain situations we know that event has definitely occurred. In this case we want to
recalibrate the probabilities of other events occurring.
Definition 1.2 If () 6= 0, then the conditional probability of event given event is given
by () = ( ) ()
Note that using this definition it is trivial to show that () = 1. In other words we are
eectively redefining the sample space to include only events that belong in .
It follows from the definition that
Pr ( ) = Pr () Pr ()
By applying this rule repeatedly we can extend this to any countable number of events:
Pr (1 2 ) = Pr (1 2 ) Pr (2 3 ) Pr (1  ) Pr ( )
Theorem 1.3 Total probability
If the events are such that
S( ) is defined for all and the events are mutually disjoint,
i.e. = for 6= , and
= then
() =
Theorem 1.4 Bayess Rule
If () 0
() =
( ) ( )
() ()
()
Or more generally if the conditions enumerated in the previous theorem hold, then
( ) ( )
( ) = P
for all
( ) ( )
We say that two events are independent if knowledge that one event occurred does not
change the probability that we would assign to the other event occurring, i.e.
Pr () = Pr ()
It then follows immediately that Pr ( ) = Pr () Pr (). This is, in fact, how we will define
the statistical independence of events:
Definition 1.5 and are pairwise independent events if, and only if
Pr ( ) = Pr () Pr ()
1.2
A random variable is a mapping from the sample space to the real numbers. In other words
it is the outcome of a random experiment with real number values. We can define how probable
certain of these values are. A useful way of summarising this information is by means of the
concept of a probability distribution.
A random variable is discrete if the set of outcomes is either finite or countable. The random
variable is continuous if the set of outcomes is not countable.
In the case of a discrete random variable, we can enumerate the probabilities associated with
the outcomes. This gives the discrete probability distribution1 .
() = Pr ( = )
This will have the properties
0 () 1
X
() = 1
For the (absolutely) continuous case we can define a probability density function which
has the properties
() 0
Z
()
Pr ( ) =
Z
() = 1
R
Note that () = 0, so Pr( = ) = 0. Nevertheless there are situations where we
want to combine continuous and discrete distributions, i.e. there may be particular points in the
distribution (e.g. where = 0) where there is a spike in the distribution. Such a concentration
of probability at a single point is called a point mass. For such mixed distributions we need to
define separately the density function for continuous points and for
R the discrete points
P (for
more details see Mittelhammer et al. 2000, Chapter E1). In this case () + () =
1, where the sum is taken over all points at which the distribution has a point mass.
All types of distribution can be uniquely described by the cumulative distribution function (cdf ).
() = Pr ( )
This function must satisfy the following properties:
1. 0 () 1
2. If , then () ()
3. lim () = 1
4. lim () = 0
5. must be right continuous, i.e.
lim () = ()
1 In some statistical texts this is referred to as a a probability mass function to distinguish it from a probability
density function (pdf) which applies to continuous variables. We will refer to both of them as probability density
functions.
It is easy to see that the cdf of a discrete distribution must have jumps upwards at all the
values where it has a point mass. In fact we must have
() = () ()
where () = lim () is the left limit at .
In the case of continuous random variables (and mixed distributions at points where there is
no jump discontinuity) we will have
() =
1.2.1
()
Exercises
() =
1
2
1
4
1
4
if
if
if
= 2
=0
=1
elsewhere
(a) Is a valid pdf? If not, find an appropriate way to turn it into a proper pdf.
(b) What is Pr ( 1)?
() =
1
4
1
8
1
16
if
if
if
=
=
=
..
.
1 +1
2
if
=1
..
.
1
2
if
=1
..
.
..
.
1
2
3
4
7
8
1
2
(a) Is a valid pdf? If not, find an appropriate way to turn it into a proper pdf.
(b) What is Pr ( 1)?
(c) Sketch the cdf of the distribution.
(d) Verify that (1) = (1) (1).
3. Consider the function
() =
5 if
0
 01
elsewhere
(a) Is a valid pdf? If not, find an appropriate way to turn it into a proper pdf. Then
sketch the pdf.
2 Does
1  if
0
 1
elsewhere
(a) Is a valid pdf? If not, find an appropriate way to turn it into a proper pdf. Then
sketch the pdf.
(b) What is Pr 12 ?
(c) Sketch the cdf of the distribution.
1 2 if
0
 1
elsewhere
(b) What is Pr 12 ?
(c) Sketch the cdf of the distribution.
0 if
if
() =
1 if
0
01
1
(a) Is this a valid cdf? If not, find the most appropriate way to turn it into a valid cdf.
(b) Is it the cdf of a discrete, continuous or mixed distribution?
(c) Describe the pdf of the distribution
7. Consider the function
() =
if
if
if
1
2
0
01
1
(a) Is this a valid cdf? If not, find the most appropriate way to turn it into a valid cdf.
(b) Is it the cdf of a discrete, continuous or mixed distribution?
(c) Describe the pdf of the distribution.
8. Consider the function
() =
1
2
0
+
1
if
if
if
0
01
1
(a) Is this a valid cdf? If not, find the most appropriate way to turn it into a valid cdf.
(b) Is it the cdf of a discrete, continuous or mixed distribution?
(c) Describe the pdf of the distribution.
1.3
Usually it is denoted as .
Let () be a function of . We can define a new random variable = (), defined such
that = (). We can calculate the expected value of . It is given by
P
if is discrete
R () ()
[ ()] =
() () if is continuous
h
i
() = ( )2
P
( )2 ()
if is discrete
R
2
(
()
if
is continuous
It is usually denoted as 2 .
Note that according to this definition, = 1 and 2 = 2 . There is an interesting interrelationship between these type of moments. In the case of the variance it is easy to show
that
2 = 02 2
Some other useful statistics:
coecient of skewness =
3
3
22
coecient of kurtosis =
i
h
3
( )
3
i
h
4
( )
4
=
22
4
Note that these are standardised measures, because we have normalised them relative to the
variance of the distribution. This means that.they are independent of the particular units within
which the outcomes of the random variable are measured. A symmetric distribution will have a
skewness of zero. A distribution with a longer right tail (skewed to the right) will have a positive
skewness. The kurtosis measures how peaked the distribution is. The normal distribution has
a kurtosis of 3. Any distribution which has a kurtosis higher than 3 is said to be leptokurtic
(thin peaked) while if it has a kurtosis less than 3 it is said to be platykurtic (flat peaked).
It turns out that in certain cases it may be possible to calculate the moments in a dierent
way by means of the moment generating function. We define this in the appendix to this
chapter.
1.4
1.4.1
This is the simplest random variable. It can take on only two values, zero and one. The full
description of the pdf is:
(1) = Pr ( = 1) =
(0) = Pr ( = 0) = (1 )
where 0 1. This can be put more elegantly as follows
1
() = (1 )
{0 1} , [0 1]
Applying the definitions it is straightforward to show that the mean of this random variable is
and its variance is (1 ).
1.4.2
Binomial
The binomial distribution is used to model the number of successes in experiments, where
each trial is independent of the next and the outcome of each trial is a Bernoulli random variable
with the same parameter . The pdf of this random variable is given by
() =
(1 )
{0 1 } , [0 1]
The mean is and its variance is (1 ).
Exercises
1. Assuming that the probability of passing a certain econometrics test is 8 for every member
of the class. There are six people in the class. What is
(a) The probability that at least one person passes?
(b) The probability that three or more people pass?
10
1.5
1.5.1
The pdf of the normal distribution with mean and variance 2 is given by
 2 =
()2
1
22
2
( ) , 2 (0 )
(1.6)
1.5.2
Chisquared
(2)2 2
(1.7)
22
2
(0 ) , {1 2 3 }
The parameter is called the degrees ofR freedom of the distribution. The function () is
1.5.3
11
Students t
( +1
2
2 )
() =
1
+
( 2 )
{1 2 3 }
+1
2
().
(1.8)
, if 2
2
The distribution has a symmetrical shape similar in appearance to the normal distribution,
but it has thicker tails. Indeed we see from the formula above, that for suciently low degrees
of freedom, the variance does not even exist indicating that there is too much mass in the tails.
As increases, the distribution approaches the (0 1) distribution.
Exercises
1. Graph the following pdfs on the same set of axes: (1), (5), (25), (0 1).
2. Calculate the value of the cdf at = 3 in these cases.
1.5.4
1
1
2
2
2
(1 2 ) =
1+
() 2
( 21 )( 22 ) 2
2
(0 ) , 1 2 {1 2 3 }
The mean of an F distribution is
() =
2
provided 2 2
2 2
The variance is
222 (2 + 1 2)
2
1 (2 2) (2 4)
, provided 2 4
1
2
(1.9)
12
Exercises
1. Graph the following distributions: (5 10), (5 30), (5 100)
2. Calculate the value of the cdf at = 25 for these three distributions.
3. Let the random variable = 1 , where (1 2 ). The pdf of will be given by
2
)
1 ( 1 +
2
(1 2 ) =
1
2
1 ( 2 )( 2 )
1
2
1 1 2
1 + 2
2
2
2
1+
1
2
(0 ) , 1 2 {1 2 3 }
(1.10)
Graph the distributions of 5 (5 10), 5 (5 30), 5 (5 100) and 2 (5) on the same set of
axes.
1.5.5
The pdf of the 2 distribution given above, was for the central chisquare distribution. Most
hypothesis tests are based on this distribution. If we square a normal variable that has a nonzero mean, we get the noncentral chisquare. Correspondingly there are noncentral versions of
the and distributions. These become important particularly if we want to investigate the
distribution of teststatistics under the alternative hypothesis.
1.5.6
Gamma
1
()
(0 ) , 0, 0
1.5.7
Exponential
This is a particularly simple distribution which crops up frequently in applied work. Its pdf is
given by
() =
(0 ) , (0 )
13
Exercises
1. Graph the exponential distributions with = 02, = 1, and = 5.
2. Evaluate the cdfs at = 2.
1.5.8
Beta
Like the gamma distribution, this is a flexible distribution that can capture many particular
shapes. It is defined on a bounded interval only, so is particularly appropriate for contexts where
the range of the
random variable is bounded. Its pdf is:
1
( + ) 1 (1 )
if 0 1
( ) =
() ()
0
otherwise
It has mean
+
Exercises
1. Graph the following distributions: (2 4), (4 2),
1.5.9
Logistic
1
2 2
, (1 2).
The logistic distribution ( ) is some times used in applications where one requires thicker
tails than the normal distribution has. Otherwise it is a bellshaped, symmetric curve quite
similar to the normal. Its pdf is given by
( ) =
1 +
<, (0 )
In this case it is also possible to give a closed form for the cdf:
( ) =
Its mean is and its variance is
1
1 +
()2
3 .
Exercises
1. Find and such that ( ) has the same mean and variance as (0 1). Plot these
two distributions on top of each other. If you can, zoom into the tail, to verify that the
logistic distribution has fatter tails.
1.5.10
Cauchy
14
Exercises
1. Verify that equation 1.11 does define a legitimate pdf.
2. Verify that the Cauchy distribution does not have a mean.
3. Graph the Cauchy distribution
4. What, if any, is the connection between the Cauchy distribution and the distribution?
1.5.11
Uniform
To leave the simplest distribution to last: the uniform distribution ( ) states that every
outcome in the interval [ ] is equally probable. Its pdf is given by
1
[ ] ,
( ) =
+
2
()2
12 .
Exercises
1. Assume that (0 1). Calculate Pr (02 07)
1.6
It frequently happens that we want to define a new random variable as some function of an
existing random variable, e.g. = , more generally:
= ()
Provided that this is a onetoone transformation (i.e. is monotonically increasing or decreasing) so that there is an inverse transformation = 1 ( ) it is straightforward to calculate
probabilities in this new distribution.
Pr ( ) = Pr 1 () 1 ()
where we have assumed for the moment that is increasing. If we let the pdf of be and
the pdf of be , it follows that
Z
() =
1 ()
()
1 ()
If we know what is, we can simply use the fact that = 1 () and that = 10 () ,
and change the variable of integration on the right hand side i.e.
Z
() =
0
1 () 1 ()
0
Since this is true for any values of and it is easy to see that we must have = 1 () 1 ().
15
Pr ( ) = Pr 1 () 1 ()
() =
1 ()
()
1 ()
0
We can again rewrite () as 1 () 1 (). Note that this expression will now be
0
negative, since 1 () is negative. Changing variables on the right hand side again, we see that
Z
()
0
1 () 1 ()
0
1 () 1 ()
0
In this case we must have = 1 () 1 (). We can combine both cases by writing
Exercises
= 1 () 1 ()
1. Let have a distribution with pdf what is the pdf of the new random variable =
+ where and are both nonzero?
2. Let have exponential distribution with parameter . Find the distribution of = 2 .
1.6.1
Assume
that
= , where 2 , then we say that is lognormal, i.e.
2 . Applying the formula above, we see that the pdf of this variable is
 2 =
(ln )2 1
1
22
2
(ln )2
1
22
2
Z
(ln )2
1
22
=
2
16
= . Consequently
()2
1
22
2
Z
()2 2 2
1
2 2
2
2
Z
(2 ) 22 4
1
22
2
Z
(2 )2
2
1
22
+ 2
2
Z
(2 )2
2
1
+ 2
2 2
The term that is being integrated is just the pdf of a variable that is + 2 2 , so the
integral must evaluate to one. The fundamental result is that
() = +
2
2
(1.12)
We note that we cannot simply take the mean of ln and antilog it. We need to add in the
2
correction 2 . The reason for this correction is that the lognormal is no longer a symmetric
distribution. The right tail of the distribution is much longer than the left tail and this shifts
the mean up.
Exercises
1. Graph the following lognormal distributions: (0 1), (0 2) and (0 4) on the
same set of axes. What do you observe?
2. What are the means for these distributions? The medians? And the modes? Comment on
what you observe.
1.7
() =
17
Moment generating functions are extremely useful (when they exist) because they uniquely
identify a distribution. In a sense they act as fingerprints of that distribution (Mittelhammer
et al. 2000, Chapter E1, p.44). A useful feature in this regard is that if and are independent,
then the MGF of + is () (). If we can identify this MGF, we can deduce the
distribution of the random variable (see the exercises below).
Distribution MGF
Bernoulli
() = + (1 )
Binomial
() = (1 + )
1 2 2
2
() = exp + 2
2
()
() = (1 2) 2 , 12
Gamma( ) () = (1 ) , 1
( )
() = csc ()
( )
() = ()
Students t
MGF does not exist
F
MGF does not exist
1.7.1
Exercises
0
4. Using the formula for the MGF of a 2 random variable show that 01 = and
2 = 2 .
5. Using the MGF show that if 1 21 and 2 22 , with and independent of each other, then + 1 + 2 21 + 22 . Find a simple example where and are not independent of each other and + is not distributed as
1 + 2 21 + 22
6. Using the MGF show that if 2 (1 ) and 2 (2 ), with and independent,
then + 2 (1 + 2 )
7. Derive the MGF for the exponential distribution from first principles. Check using the fact
that the exponential distribution is (1 ) that your answer is correct.
0
18
Chapter 2
2.1
Joint distributions
In the previous chapter we considered only univariate distributions. In most situations of interest
to the economist, however, the outcome of the random experiment can most usefully be thought
of as a vector. For instance when we collect survey information we tend to collect information
from the same individual on more than one variable. If the outcome of the experiment can be
captured by the variables 1 2 then we can think of the outcome in terms of the
random vector X = (1 2 )
Many of the definitions can be extended very easily to this case.
Definition 2.1 Cumulative density function
The cdf (x) of the random vector X is defined as
(x) = Pr (X x)
Note that the vector inequality holds only if the inequality holds for every one of the components of the vector, so
Pr (X x) = Pr ((1 1 ) (2 2 ) ( ))
The properties of the cdf are as before. In particular the function must be nondecreasing
and must be continuous from the right (where this applies to each dimension). Furthermore
there will again be two types of probability distributions that can be defined in terms of the
cumulative density functions: discrete and continuous.
2.1.1
As in the univariate case, a joint distribution of discrete variables will show jumps in the cdf
at the points (which are now vectors) where there is positive probability. The size of the jump
19
20
will again be equal to the probability attached to that precise outcome. It turns out, however,
that there will now be many more points at which there are jumps, but where there is zero
probability (see Figure 2.1 and exercise 2 below). This means that it is more dicult to recover
the corresponding probability distribution function.
We can define the joint probability distribution as
(1 2 ) = Pr (1 = 1 and 2 = 2 and and = )
As before we must have
0 (x) 1
X
(x) = 1
x
Example 2.2 Assume that we have enumerated a population of ten individuals and have ascertained that the following combinations (1 2 ) of measurements are possible:
(1 1)
(0 2)
(2 3)
(1 3)
(0 1)
(2 1)
(3 4)
(2 4)
(2 3)
(3 3)
The outcome of a random draw from this population defines the random vector X = (1 2 )
with the following joint distribution:
Joint probabilities
1 = 0
1 = 1
1 = 2
1 = 3
Marginal 2
2 = 1
01
01
01
0
03
2 = 2
01
0
0
0
01
2 = 3
0
01
02
01
04
2 = 4
0
0
01
01
02
Marginal 1
02
02
04
02
Based on these outcomes we can graph the cumulative distribution function as in Figure 2.1.
Exercises
1. For the example given above find (1 2), (3 0),
2 5
2. How do you explain the jump in Figure 2.1 at the point (1 2) even though Pr (X = (1 2)) =
0?
3. Assume that you are given the following definition of a function:
0 if
( 1) or ( 2)
03 if (1 3) and (2 )
( ) =
07 if ( 3) and (2 5)
1 if
( 3) and ( 5)
Generate a contour plot (bands in which the probability is the same) for this function.
Is this a valid cdf? If yes, derive the corresponding joint distribution. If no, find some way
of turning it into a proper cdf and then provide the joint distribution.
21
p=1
p=0.8
p=0.4
p=0.7
p=0.2
4
3
2
y
2
1
p=0.3
0.5 p=0.1
1
x 2
1
2
Figure 2.1: The cdf of a joint distribution is a nondecreasing function with jumps at all points
where there is a point mass but some additional jumps as well.
22
2.1.2
In the case of continuous distributions the relationship to the joint density is given by
(1 2 ) =
(1 2 )
1 2
Z1 Z2
(1 2 ) 1 2
0 (x)
(x) x = 1
x
where it is understood that the integral is taken over the entire domain of the random vector.
Example 2.3 Assume that the function is defined as
2 if 0 and 0 1
( ) =
0
elsewhere
A threedimensional plot of this function looks as follows:
HL
2
1.5
f x,y 1
0.5
0
0.5
5
1.5
1
0.5y
0
0.5
x
0
1
1.5 0.5
[2]0
1
= 2 0
=1
2.2
23
Marginal distributions
Frequently we are interested in the behaviour of one of the components of the random vector
while ignoring the rest. We define the marginal pdf of the random variable
( P
( x )
if is discrete
R x
( ) =
( x ) x if is continuous
x
where the vector x is the vector of all the other random variables in the random vector X.
Example 2.4 The marginal distributions of the discrete distribution considered in Example 2.2
are given in the margins. They are:
02 if
1 = 0
02
if
1 =1
04 if
1 = 2
1 (1 ) =
02 if
1 = 3
0
elsewhere
03 if
2 = 1
01
if
2 =2
04 if
2 = 3
2 (2 ) =
02 if
2 = 4
0
elsewhere
It is easy to see that both of these are valid univariate discrete distributions.
Example 2.5 The marginal distributions of the continuous distribution considered in Example
2.3 can be worked out as follows:
Z
2
2 () =
0
= 2 where 0 1
Z 1
1 () =
2
= 2 2 where 0 1
Note that in the second case we rewrote the domain of the function as 0 1 and 1,
which is equivalent to the domain definition that we started out with. We needed to do this to
ensure that we had no variable left in the definition of the marginal distribution.
2.3
2.3.1
Conditional distributions
Discrete conditional distributions
Pr ( )
Pr ()
24
We can use this notion to define the conditional distribution of a random variable given that
one variable takes on a particular value:
(2 1 ) =
(1 2 )
1 (1 )
(2.1)
(1 1)
1 (1)
(1 2)
1 (1)
(1 3)
1 (1)
(1 4)
1 (1)
01
02
0
=
02
01
=
02
0
=
02
=
= 05
=0
= 05
=0
In short the conditional pdf is given by (1) = 05, (3) = 05 and (2 ) = 0 everywhere else.
This function meets all the conditions of a proper pdf.
Exercises
1. Find the conditional distribution (1 2 = 1), using the joint distribution in Example 2.2.
Verify that it is a proper distribution.
2. Find the conditional distribution ( = 2) using the joint distribution in Exercise 3.
2.3.2
In the case of continuous random variables the probability that a random variable takes on a
particular value is always zero. Nevertheless we can still define a conditional pdf in exactly the
same way as we have done for the discrete case, i.e
(2 1 ) =
(1 2 )
1 (1 )
(05 )
1 (05)
2 if 05 1
( = 05) =
0
elsewhere
Observe that ( = 05) = (05 1)
25
Exercises
1. Find the conditional distribution ( = 05) using the joint distribution given in Example
2.3.
2. Consider the following function:
8
( ) =
0
if 0 and 0 1
elsewhere
2.3.3
Observe that (as with conditional probability) we can rewrite the definition of a conditional
distribution (equation 2.1) as
(1 2 ) = (2 1 ) 1 (1 )
Note that we can iterate this definition in just the same way as we did in the case of probabilities, i.e.
(1 2 ) = ( 1 2 1 ) (2 1 ) 1 (1 )
Intuitively, the joint distribution of the random variables 2 is independent of 1 if
knowledge of 1 does not change our assessment of the probability of particular joint outcomes
of the variables 2 , i.e. if (2 1 ) = (2 ).
Definition 2.8 We will say that the variables 1 2 are statistically independent if
the joint distribution (1 2 ) can be written as the product of the marginal distributions,
i.e.
(1 2 ) = 1 (1 ) 2 (2 ) ( )
Note that this implies that the probability of the joint event 1 1 and 2 2 ... and
will be just the product of the probabilities that 1 1 and the probability that
2 2 ... and that . In the case of continuous variables we can show this as follows:
Z Z
Z
Pr (1 2 ) =
(1 2 ) 1 2
1 2
1 (1 ) 1
2 (2 ) 2
( )
= Pr (1 ) Pr (2 ) Pr ( )
Exercises
1. Are the variables 1 and 2 in Example 2.2 statistically independent? Explain.
2. Are the variables and in Example 2.3 statistically independent? Explain.
26
2.4
The definition of the expected value of a variable is a simple extension of the univariate case:
Z
(1 ) 1
( ) =
where the integral is taken over the entire domain of the joint distribution. In the expression
above we can integrate out all the variables except for so it is relatively easy to see that this
will be equivalent to evaluating the expectation on the marginal distribution, i.e.
Z
( ) =
( )
2.4.1
Conditional expectations
2.4.2
27
Covariance
One of the most commonly used expectation involving two variables is the covariance, defined as
( ) = [( ( )) ( ( ))]
If and are statistically independent, so that the joint distribution factors into the
product of the marginal distributions, it follows from the definition that the covariance and the
correlation coecient are of necessity zero. The converse result does not always hold. It does,
however, hold in the case of the multivariate normal distribution, as we will show below.
Exercises
1. Calculate the covariance and the correlation coecient for 1 and 2 in Example 2.2.
2. Calculate the covariance and the correlation coecient for and in Example 2.3.
2.4.3
11 12 1
21 22 2
1 2
(11 ) (12 ) (1 )
(21 ) (22 ) (2 )
..
(1 ) (2 ) ( )
1 Hint:
let =
( ) and = ( ).
28
Several useful properties of the expectations operator follow. Since the expectations operator
is linear in random variables , i.e.
( + ) = () + ( )
(where and are real constants) it immediately follows that this will be true for random
matrices X and Y too, i.e.
(X+Y) = (X) + (Y)
Furthermore it follows that if is a fixed matrix, i.e. a matrix of constants, then
(X) = (X)
2.4.4
Covariance matrix
We can generalise the notion of covariance for the case of a whole vector of random variables.
Definition 2.12 We define
matrix Var (z) of the (column) vector of
the variancecovariance
1
1
2
2
1 1
2 2
z [z] =
and
1 1
2 2
1 1 2 2
(z [z]) (z [z])0 =
2
(1 1 )
(1 1 ) (2 2 ) (1 1 ) ( )
(2 ) (1 )
(2 2 )2
(2 2 ) ( )
2
1
=
..
( ) (1 1 ) ( ) (2 2 )
( )2
(1 )
(1 2 )
(1 2 )
(2 )
0
(z [z]) (z [z]) =
..
(1 ) (2 )
(1 )
(2 )
( )
It is now obvious why this matrix should be called the variancecovariance matrix.
Note that a variancecovariance matrix by definition must be symmetric. We show below
that it must also be positive semidefinite. This simply means that if we take any nonzero
column vector of constants c and form the product c0 Vc where V is a covariance matrix then
c0 Vc 0. This is simply the extension of the condition that variances must be nonnegative to
the context of more than one variable.
29
Remark 2.14 We will often refer to it simply as the covariance matrix and write Var (z)
simply as z or as V (z).
Observe that if the random vector z has the covariance matrix Var (z) then the random
vector Az (where A is a matrix of constants) has covariance matrix Var (Az) = Az A0 . This
follows simply by expanding out the definitions:
(Az) = A (z)
Az (Az) = A (z [z])
= {A (z [z])} {A (z [z])}0
0
= A (z [z]) (z [z]) A0
= Az A0
In the case where A is the row vector
a0 , the random vector a0 z is just a scalar variable.
0
0
In this case a (z [z]) (z [z]) a is just the variance of this new scalar variable. Since
variances are always nonnegative it follows that a0 z a is nonnegative regardless of the choice
of a. This shows that z must be positive definite. Note that if z is positive semidefinite,
then Az A0 is also positive semidefinite.
Exercises
1. Find the covariance matrix of 1 and 2 in Example 2.2 and hence find the covariance
matrix of the new variables 1 = 1 + 2 and 2 = 1 2 .
2. Find the covariance matrix of 1 and 2 in Example 2.3 and hence find the covariance
matrix of the new variables 1 = + and 2 = .
2.5
2.5.1
1
, if (1 ) [1 1 ] [2 2 ] [ ]
=1
= 0 elsewhere
It is easy to see that this expression is just the product of separate uniform pdfs.
2.5.2
Bivariate normal
30
Marginal distributions
Despite the fact that the distribution looks rather complicated, it is fairly easy to obtain the
marginal and conditional distributions. We can rewrite the term in braces by completing the
square, i.e.
So consequently
2
2 + 2 2
2 ( )
=
2 (1 2 )
2
2 (1 2 )
!
2 Z
2
( )
1
1
q
exp
( ) = p
exp
2
2 (1 2 )
22
2 2 (1 2 )
( )2
2(12 ) ,
( )
=
2 (1 2 )
( )
2 2 (1 2 )
The expression that is being integrated is the pdf of a normal variable with mean + ( )
and variance 2 1 2 , so the area under the curve must be equal to one. Consequently
Z
( )
2
1
p
exp
2
2
2
!
( )2
1
p
exp
2 2
2 2
!
2 !
2
( )
1
1
q
exp
exp
= p
2 2
2 2
2 2
22
Consequently in this case a zero covariance or correlation implies that and are statistically
independent.
Conditional distributions
We have, in fact, already derived the conditional distributions. We showed above that
!
2
1
( )2
1
q
exp
exp
( ) = p
2
2 (1 2 )
22
2 2 (1 2 )
31
The first term is the marginal distribution of , i.e. it is 1 (). So by definition we must have
( )
() = q
exp
2 (1 2 )
2 2 (1 2 )
conditional mean of changes with . The slope of this relationship is given by which we
could also write as ()
() . Note that changes in will aect both the slope of this relationship
as well as the conditional variance. Observe that if = 1 then
variance of would
the conditional
be zero! This would be the case if the probability that = + is equal to one.
We say that in this case the random vector ( ) has a degenerate
distribution.
The entire
2.5.3
Multivariate normal
0
The random (column) vector x = 1 2
with mean and (nonsingular) covariance matrix is multivariate normal if its pdf is given by:
(x) = (2)
12

1
0
exp (x ) 1 (x )
2
We write this as x N ( ).
Special case: the bivariate normal
We can check that this definition gives the same formula for the bivariate case .
0
2
In this case we have = 2, =
and =
. It follows that  =
2
"
#
1
p
(1
2)
12
2 (12 )
2 2
2
1
2
= 1
. Furthermore =
1 . Consequently 
.
1
(1
2)
2 (12 )
32
0
2
4
4
0.2
0.1
0
4
2
0
2
4
2
4
0.15
0.1
0.05
0
4
2
0
2
4
2
4
0.15
0.1
0.05
0
4
2
0
2
4
Figure 2.2: Changes in the bivariate normal distribution with . In all cases = = 0,
= = 1. Top panel: = 08. Middle panel: = 05. Bottom panel: = 0.
and =
(x )
(x ) =
=
(x ) is
as before. Then
"
1
2 (12 )
(1
2)
"
33
(1
2)
1
2 (12 )
(1
2)
(1
2)
1
2 (12 )
1
2 (12 )
2 + 2 2
1 2
It is now easy to verify that the two expressions are mathematically identical.
Special case: diagonal covariance matrix
1
If h= 21 22 i 2 , then it is easy to see that  2 = 1 2 and 1 =
1
1
1
21 22 2 . In this case
1
(1 1 )
(2 2 )
( )
(x) =
exp
2
2
2
2
22
1 2 2
1
2
!
Y
1
( )2
exp
=
2 2
2
=1
=
( )
=1
0
Theorem 2.15 Let the vector x = 1 2
have multivariate normal distribution with mean and covariance matrix . Assume that
z = Ax
where A is a real matrix with rank . Then
z N A AA0
We can use this result to show that every one of the variables in x must itself be normally
distributed.
For instance,
34
Chapter 3
The basic approach that we will be using can be explained by means of the diagram given in
Figure 3.1. There are several components to this diagram:
1. We begin with the underlying social/economic processes which are happening in the real
world. We assume that these can be represented as random variables 1 . Implicitly we assume that the outcomes of these processes can be measured and well defined.
This may not be true of all processes.
2. The processes in the real world, together with the measurement process (e.g. a survey
questionnaire administered by a field team which is coded up in a back room) result in the
delivery of real data on our desk top. These data (even if it they are a macroeconomic
time series or a population census ) can be thought of as the outcome of a sampling
from that social reality. We will call this (after Mittelhammer et al. (2000)) the Data
Sampling Process. Many authors refer to it as the Data Generating Process. I prefer the
Mittelhammer et al. (2000) usage, because it emphasises the fact that data are intrinsically
incomplete. Crucially we will assume that the DSP can be fully characterised by some
joint probability distribution function over the outcomes y . In particular we will assume
that we can characterise the DSP as belonging to a given family of distributions although
we will not know the precise one. This means that we assume that the distribution of the
sample observations is given by the joint distribution function (y1 y ), where
is a parameter (or vector of parameters) which uniquely identifies the DSP. For instance,
y1 y might be multivariate normal, in which case is the vector of means and the
covariance matrix.
3. Once we have the data in front of us, we can manipulate them. In particular we can calculate
various statistics. These are simply functions of the observations y . Some examples:
The sample mean =
( )2
P
1
= 1
( ) ( )
1
1
36
Estimation
Represented by:
f(Y )
Y2
Y1
Yj
sa tio n
Glo b a li
In
m
co
BMI
Fina n
c ia
l p o lic
y1
y2
y
ns
a
uc
Ed
n
tio
Em o tio
"Real World"
yn
Sample
yi
y
s2
max
min
median
Statistic s
37
genetic component then two observations extracted from the same household will be more
alike than two observations extracted from the population at random.
Despite the fact that simple random sampling processes are the exception rather than the
rule, we will build up the theory on the basis of this assumption and then complicate it for other
sampling processes.
Note that just as the individual observations can be thought of as random variables, so
statistics are random variables. (Functions of random variables are themselves random variables.)
We can therefore talk about the distribution of a particular statistic. The distribution depends of
course on the DSP and on the sample size . This implies that in dierent samples an estimator
will lead to dierent estimates. Consequently we will be concerned with what sort of rules are
desirable or even optimal. Below we will consider two types of rules that have been frequently
used in practice:
estimation by maximum likelihood
estimation by method of moments
3.2
The principle of maximum likelihood is relatively easy to grasp in the context of discrete random variables. The idea is explained diagrammatically in Figure 3.2. In this diagram we are
considering an experiment in which a sample of ten observations is extracted from a Bernoulli
distribution with parameter . Assuming that we have simple random sampling, the joint pdf
of the sample will be given by (y) = (1 ) since each random variable has pdf
1
(1 ) , with {0 1}. This pdf tells us how probable dierent kinds of samples will
be. In the left panel of Figure 3.2 we have shown several possible outcomes and the associated
probabilities if = 06.
We assume that our actual sample is given by y = (0 1 1 0 1 1 1 0 1 1), i.e. there were
seven successes and three failures. The question that we now want to solve is what would be a
reasonable estimate for ? Given the outcome, we know that the joint density in this case will
be given by 7 (1 )3 . We can now consider how likely the actual sample would have been if
the DSP had taken on some particular value, say . For instance, if = 05 we could deduce
that the probability that we would have observed this particular sample was 057 053 = 00009
765 6. On the other hand, if was really 03, the probability that we would have observed this
sample would only have been 037 073 = 000007501 4.
The maximum likelihood criterion stipulates that we use that estimate of which maximises the probability that we would have observed this particular sample, i.e. in this case
b = arg max 7 (1 )3
In the right panel of Figure 3.2 we see that = 07 gives a higher probability than any of the
other values that we have displayed. Nonetheless we need to consider all possible values.
3
2
6
7
If we let = 7 (1 )3 , we find that
= 7 (1 ) 3 (1 ) . If we set this equal
2
38
Possib le
D.S.P.
Bernoulli
p= 0.5
L(p y)=
7
3
0.5 0.5 = 0.000977
Bernoulli
p= 0.3
10
True
D.S.P.
Bernoulli
p= 0.6
Possib le
D.S.P.
[1,1,1,1,1,1,1,1,0,1]
Bernoulli
p= 0.6
True DSP
[0,1,1,0,1,1,1,0,1,1]
7
Possible
D.S.P.
Ac tua l sa mp le
Bernoulli
p= 0.7
[0,0,0,0,0,0,0,0,0,0]
10
Possible
D.S.P.
Bernoulli
p= 1
7
L(p y )= 1 0 = 0
Figure 3.2: The joint density (y) represents how likely a particular sample is, given the DSP
(left panel). In maximum likelihood estimation we ask how likely the given sample would be if
the DSP had been represented by some value (right panel).
0.002
0.0015
0.001
0.0005
0.2
0.4
0.6
0.8
Figure 3.3: The probability of observing the given sample changes with . It reaches its maximum
at b = 07.
39
It is clear that this is an intuitively attractive way of estimating the population parameter in
the case of discrete distributions, where (y) is really a probability. In the case of continuous
distributions, the probability of observing any particular sample will always be zero, since the
probability of obtaining particular values is always zero. Nevertheless the value (y) still
captures how likely certain outcomes are relative to others. One might think that the probability
that y takes on a particular value is approximately (y) y, so that higher values of (y)
certainly represent more likely outcomes.
Note that there is no guarantee that the estimation procedure will give us the true value of
the parameter. What we can hope for, however, is that our estimates will be close to the truth,
in a sense which we will try to make more precise later.
Example 3.1 Estimating in a Bernoulli distribution
1
If (), then ( ) = (1 ) . The joint pdf is given by (y) =
(1 )
. Consequently (y) = (1 )
and
!
X
X
ln +
ln (1 )
(y) =
so
P
1
(3.1)
b =
( )2
1
We assume that 2 . Consequently  2 = 2
exp
. The
2
2 2
joint pdf is
P
!
2
2
2 2
( )
y = 2
exp
22
40
( )2
This becomes the likelihood function, i.e. 2 y = 22 2 exp 2 2
. Taking
logs we get
P
( )2
2
2
(3.2)
y = ln (2) ln
2
2
2 2
Dierentiating this with respect to and 2 we get:
P
( )
(3.3)
= 2
P
2
( )
=
+
(3.4)
2
2 2
24
Setting the derivative equal to zero, we get the likelihood equation:
P
b)
(
=0
2
b
P
b )2
(
+
=0
2b
2
2b
4
(3.5a)
(3.5b)
We have replaced the true parameters and 2 with their estimates in these equations, since
there is no guarantee that the gradient will be precisely zero at the true parameter value. Instead
these two equations implicitly define the maximum likelihood estimators. We can explicitly solve
out for them. From the first equation we find that
P
b= =
(3.6)
b =
(3.7)
In both these examples the likelihood function was well behaved and we could get the
optimum through dierentiating the function and setting the derivative equal to zero. This will
not always be the case (although it will be for most of the applications that we will consider).
One case where it does not hold is in estimating the parameters of a uniform distribution:
1
if , for all {1 2 }
(1 2  ) =
0
if or , for any {1 2 }
The likelihood function ( y) is therefore
(
1
if min {1 2 } and max {1 2 }
( y) =
0
if min {1 2 } or max {1 2 }
This function has discontinuities at min {1 2 } and max {1 2 } respectively, so
it cannot be dierentiated there. It is obvious, however, that ( y) can be maximised by
setting
b
= min {1 2 } and b = max {1 2 }
41
The minimum and maximum sample values are therefore the MLE estimators of the range of the
uniform distribution.
It is intuitively obvious that both of these must be biased estimators: we must have b
and b in every sample. It also seems obvious that the larger the sample the smaller this
bias is likely to be.
3.2.1
Exercises
1. Derive the ML estimator of from a sample of independent draws from the exponential
distribution.
2. Consider the pdf of the discrete random variable given by
() =
, {0 1 2 } , (0 )
!
Assume that you have an independent random sample of size from this distribution.
Show that the maximum likelihood estimator of is given by the sample mean.
3. Let 2 . Assume that you have independent draws from this distribution.
Estimate and 2 by means of maximum likelihood.
3.3
b
= 3.
+ = 2 and = 12. Consequently = + 3 and b
Example 3.5 Estimating the parameters of a gamma ( ) distribution
1
The pdf of the gamma distribution is given by ()
1 . Consequently if we have a
sample (1 2 ) then the likelihood function will be given by
1
1
( 1 ) =
(1 2 )
exp
()
It is dicult to maximise this expression with respect to and , because of the gamma function
in the denominator. Certainly there is no convenient analytical expression for the solution. If
1 My presentation here is not entirely rigorous  I mix up centred and uncentred moments. Provided the
corresponding sample moments are consistent estimators, the results will hold.
42
we use the method of moments instead, we know that = and 2 = 2 . Equating the sample
moments to the population moments we get the two equations
=
2 = 2
The estimators are
2
2
2
b =
b
=
be via the variable = ln , since 2 , so we know what the appropriate MLE would
be. If, for any reason, we cannot get access to the individual level data, but we have access to the
moments of the distribution of , we could still estimate the parameters and 2 by the method
of moments. We have
2
( ) = exp +
2
2
= exp 2 + 2 2
Equating these to the sample moments we get
2
= exp +
2
2
2
+ = exp 2 + 2 2
So
b2 = ln 2 + 2 ln 2
b = ln 2 ln 2 + 2
2
The basic idea should be very clear by now. One thing which may not be clear is what to
do if we have more than the required moments. For instance the normal distribution is fully
identified by just two moments. Empirically, however, there would be no problem in calculating
the third or even the fourth sample moments. At the population level this additional information
would be redundant  any two equations would give the same parameter values. In a random
sample this will, however, not be the case. Dierent subsets of the equations would give dierent
results. The simplest solution might be to simply throw away the extra information and just use
the first two moments to estimate the two parameters. On the other hand, this seems a waste
of good information. The question of how to deal with the extra information is tackled by the
Generalised Method of Moments (GMM).
3.3.1
Exercises
if

elswhere
43
3.4
Other rules
It is important to understand that ML estimation and MoM estimation are not the only approaches to estimation.
3.4.1
Rules of thumb
There are a number of approaches which are based on intuition or heuristic rules. When business
economists predict inflation or growth they frequently seem to do so on the basis of instinct.
When chartists extrapolate trends they do so by analogy with previous patterns in the data.
Even in more scientific parts of economics, analysts frequently have strong intuitions about
what sort of results one should expect. For instance when Card and Krueger suggested that
employment did not decline with increases in the minimum wage there were many economists
that simply did not believe their results.
3.4.2
Bayesian estimation
In Bayesian estimation these prior beliefs are explicitly modelled. The analyst specifies how
likely dierent parameters might be in the form of a prior distribution. This distribution is then
updated in the light of the empirical evidence (by Bayess law) to give the posterior distribution.
This does not yield a point estimate, but a range of estimates. This range, however, builds in other
information, unlike traditional confidence intervals. We will not discuss Bayesian approaches in
this course.
3.4.3
Pretest estimators
44
3.4.4
We will see below that in a number of contexts analysts make adjustments to a ML or MoM
estimator in order to remove a particular source of bias.
3.5
Sampling distribution
We have seen that there may be more than one way of estimating a set of parameters. This raises
the question as to how we might decide between dierent estimators. In order to assess this we
will generally be concerned in the first instance with the sampling distribution of the estimates,
i.e. how the estimator would behave if we had the luxury of repeating the experiment very many
times. We will also in due course consider the asymptotic properties of the estimator, i.e. how
the estimator would behave if we had the luxury of enlarging our sample indefinitely.
Example 3.7 Sampling distribution of b, the sample proportion from independent
draws from a Bernoulli() distribution
( ) =
(1 )
{0 1 } , [0 1]
We can use the change of variable technique to get the distribution of b. We know that b =
so = b
and the sampling distribution of this estimator (statistic) is given by:
(b
 ) =
(1 )
b
1 2
b 0 1 , [0 1]
1
,
We observe that b can only take on ( + 1) discrete values, so this is a discrete pdf. In Figure
3.4 we graph some examples of what the true sampling distribution would look like.
We observe that in each case the distribution of the estimator is centred on the true population
parameter. This need not be the case, in general.
Example
0
We can write = 1 0 y where 0 =
12 1 1 , y = 1 2 . Since each
is independently distributed as , their joint distribution is multivariate normal with
0
mean = and diagonal covariance matrix = 2 . By Theorem 2.15 of
Chapter 2 we have 1 0 1 0 2 1 , i.e.
1 2
Note that the variance of the sample mean is considerably smaller than the variance of the original
distribution.
45
probability
.1
.2
.3
probability
0 .05 .1 .15 .2 .25
Sampling distributions
.2
.4
.6
sample proportion
.8
.4
.6
sample proportion
.8
.4
.6
sample proportion
.8
p=0.8,n=10
probability
.02 .04 .06
probability
.02 .04 .06 .08 .1
.08
p=0.5,n=10
.2
.2
p=0.5,n=100
.4
.6
sample proportion
.8
.2
p=0.8,n=100
46
Example
b2 from independent draws from
b 2 ( 1)
2
The sampling distribution of
b2 can therefore be derived by change of variable techniques from
2
the distribution. In Figure 3.5 we graph some examples. Note that the mode of the sample
2
estimates
is below the true value. Since the mean of a ( 1) variable is 1 it follows that
2
2
b = 1
, so the mean of the estimator also undershoots the true parameter value.
where is the cdf of the population from which the sample is drawn, i.e. the cdf of the minimum
is
1 (1 ())
So the pdf of the minimum is
() = (1 ())1 ()
where is the pdf of the original population. Note that this holds true in general!
In the case of the uniform distribution ( ) we have
1
() = 1
, if
( )
, if
( )
Figure 3.6 gives some examples of this sampling distribution from a (0 1) distribution. We
observe that as increases, the distribution becomes increasingly concentrated around zero.
47
25
20
15
10
5
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Figure 3.6: Distribution of the sample minimum from a (0 1) distribution, with sample sizes
= 10, = 25 and = 100 respectively.
3.5.1
Exercises
1. Derive the sampling distribution of the sample maximum in independent draws from a
( ) distribution.
b
2. Derive the sampling distribution
P of the ML estimator of the exponential distribution. You
may want to first show that
( ) by using the MGF of the exponential
distribution. Then plot this distribution for = 1 = 10, = 25 and = 100.
3.6
3.6.1
b
=
1 2
,
48
( )
( ) +
=
( ) 1 +
+
=
1+
=+
1+
Consequently the bias is
1+ .
Exercise
1. Discuss the bias of the sample proportion b from independent draws from a Bernoulli
distribution.
3.6.2
Minimum variance
2
1 "
2 #!1
ln ()
ln ()
1
[ ()] =
=
2
The multivariate analogue of this is that the dierence between the covariance matrix of any
unbiased estimator and the inverse of the information matrix
2
1
ln ()
1
[I ()] =
0
1
ln ()
ln ()
=
0
will be a nonnegative definite matrix.(Greene 2003, p.88990)
49
This theorem is important because it states that there is a point beyond which unbiased
estimators cannot get more precise. There will always be some intrinsic sampling variance in
any such estimator. On the positive side, if we can establish that any estimator has the variance
described in this theorem, then we can be sure that it must be an ecient estimator.
Example 3.17 Obtaining the Information Matrix I () of the ML estimators of the parameters
of the 2
The information matrix is the Hessian matrix (matrix of second derivatives) of the loglikelihood with respect to the parameters. In Example 3.2 we derived the gradient of the likelihood
function (equations 3.3 and 3.4). Dierentiating these again we get:
2
2
2
2
( 2 )
Now I () =
2 ln ()
0
( )
4
P
2
( )
2 4
6
2
P
I () =
"
2
( )
4
. We get
( )
4
24 +
2
( )
6
P
2
2 , it is evident that ( ( )) = 0 and
= 2 . Consequently
( )
I () =
2 4
I ()
2 4
#
2
2
2
the variance of this random variable is 2 ( 1), so the variance of = 1
2 ( 1) =
2 4
1 .
It follows that 2 does not reach the CramrRao lower bound. Nevertheless it can be shown
that there is no unbiased estimator that has a lower variance than 2 .
Exercises
1. Find the information for the MLE b derived in Example 3.1. Hence discuss the eciency
of this ML estimator.
50
3.6.3
2
b
b
=
, if is a scalar
Exercises
2
= b
+ b
0
b +
b
b if is a vector
=
3.6.4
Invariance
Another highly desirable property of an estimator is that it yields the same estimate regardless
of how the problem is parameterised. For instance, in Example 3.2 we derived the ML estimators
on the assumptions that we were estimating and 2 . It would have been somewhat disturbing
if the estimator had changed if we had wanted to estimate and instead. One of the properties
of ML estimation is that it is invariant in this sort of way. If we initially set the problem up in
b then if we remap
terms of the parameter vector and get the maximum likelihood estimator ,
the problem in terms of the parameters
= c ()
where the function c () is a mapping from the old parameters to the new ones, then
b
b= c
i.e. applying the mapping to the ML estimators gives the ML estimators of the new parameters.
In short to obtain the ML estimator of we can just take the square root of the ML estimator
of 2 .
This result is of some practical importance, because there may be ways of parameterising a
problem so that it is easier to obtain estimates. Then one can apply the appropriate transformations to get estimates for the particular problem one started o with.
One setting in which regression packages (like Stata) do this routinely, is if the parameters
need to obey certain constraints. For instance a variance has to be positive. One could set
the maximisation problem up as a constrained maximisation problem with a nonnegativity
constraint. In practice it is frequently easier to reparameterise the problem. For instance, one
can set the parameter that has to be positive, say to be equal to a positive function of another
parameter, say exp () and then optimise with respect to . When the estimates have been
obtained one transforms them back into the form that one was interested in.
3.7
51
In many situations it is very dicult to derive the precise sampling distribution of a particular
estimator. One of the tools that has become available to the econometrician is the ability to
simulate the sampling distribution. In Monte Carlo studies one specifies the distribution and
then simulates the process of extracting samples and calculating statistics on them. With modern
computing power it is possible to do this thousands of time. The distributions obtained in this
way should approximate the real sampling distributions very closely. The theory underpinning
this statement will be explored in the next chapter.
There are several issues that have to be confronted in any Monte Carlo simulation:
How can one ensure that the results reported are reproducible, given that they are supposed
to be the outcome of a set of random experiments?
The fact of the matter is that the random numbers generated by the computer are, in fact,
not really random. They come out of a deterministic process, even though they behave
precisely like random numbers. Provided that one specifies where one starts the process
o (the random number see) and which package one is using, the results are completely
deterministic.
How many samples are sucient?
That depends a bit on the process being simulated, but around 10 000 replications should
generally be good enough to persuade most sceptics.
One hopes that the qualitative results are not too dependent on the precise sample size.
For many empirical problems it is sensible to pick sample sizes similar to those encountered
in actual research. As we will see in the next chapter large samples tend to be much
better behaved than small ones, so it is probably a good idea to pick intermediate sample
sizes.
Again one hopes that the results are not dependent on the precise parameters. That said,
it is probably sensible to pick several combinations of parameters to explore the sensitivity
of the results to these.
There are also practical issues. The most immediate one is how to simulate a draw from an
arbitrary distribution! Most random number generators spit out numbers between zero and one
that are uniformly distributed, i.e. any fraction (up to about eight decimal places) is equally
likely. The trick then is to convert these random numbers into a draw from the appropriate
distribution:
In the case of the Bernoulli distribution with parameter we set our random variable equal
to one whenever our random number is less than (or equal) to and we set it equal to zero
if it is greater than .
Other discrete distributions can be handled analogously.
In the case of absolutely continuous distributions we make use of the fact that the cumulative distribution function () is a monotonically increasing function that gives values
between zero and one. We can therefore use the inverse function 1 to convert random
numbers between zero and one into values of the random variable . It is useful to see
why this works:
52
Now take the inverse function 1 and assume that these points map to 1 2 100
100 ,
i.e. is the
The random number generator spits out numbers so that the probability of getting a
number between 01 and 07 is 60%. Correspondingly, the probability of drawing a value
of the random variable between 10 and 70 is also exactly 60%. The draws therefore
happen in each part of the distribution precisely according to the cumulative distribution
.
Example 3.19 The sampling distributions of dierent estimators of the parameter
from a ( ) distribution
In Example 3.10 we derived the theoretical distribution of the MLE of . It is much more
intractable to derive the distribution of the MoM estimator. As Example 3.14 showed, however,
we might be interested also in adjusting the MLE for bias. In Figure 3.7 we show the sampling
distributions for three estimators:
The MLE estimator: b
= min {1 }
,
1
3
biased. The theoretical bias is +1
which in this case is 51
= 005882 which accords well with
the bias derived from the simulations of 0.059143. We can estimate the Mean Square Error:
MSE of MLE: (0059143)2 + (0057856)2 = 0006 845 2
2
53
10
15
1.5
2.5
MoM
MLE
Adjusted MLE
MoME
54
Chapter 4
Asymptotic Theory
(Compare with Wooldridge (2002, Chapter 3, Sections 3.13.4).)
4.1
Introduction
The purpose of asymptotic theory is to investigate the properties of random variables as the
sample size tends to infinity. It turns out that there are a number of very powerful results which
describe these properties. They fall into broadly two classes: laws of large numbers and
central limit theorems.
4.2
4.2.1
The crucial concept that we will be concerned with is that of the limit of a sequence of numbers
or random variables. It describes what happens to that sequence as the number of terms in it get
large. Before we define the limit, however, we need to define the sequence itself. Fundamentally
a sequence is defined by a rule. Examples of sequences are:
{}=1
1
=1
n 1
o
1
=1
56
Example 4.2 The sequence 1 converges to zero. We can prove this quite easily. Assume that
0 is given. We now pick 1 . This will always be possible, because the natural
numbers
are not bounded. For all we have that 1 . Consequently 1 , i.e. 1 0 .
It turns out that limits have attractive properties.
n o
In the last case we also have to be careful that the sequence is defined, i.e. that is not
equal to zero.
In some cases these rules do not help us to evaluate the limit. A very useful result in such
cases is given by
Proposition 4.4 LHpitals rule
If the functions and are both dierentiable in an interval around , except possibly at ,
and () and () both tend to zero as tends to , then if 0 () 6= 0 for all in this interval
()
0 ()
= lim 0
=
()
()
lim
The same rule can be applied if () and () both tend to , i.e. the formula can be
1
Example 4.5 The sequence 1 converges to ln (). To show this, we note that
1
1
1
lim 1 = lim
1
0
0.
lim
1
1
1
ln 12
1
= lim
2
1
= lim ln
= ln
If we have a sequence of vectors we can extend the definition given in Definition 4.1:
Definition 4.6 The sequence {a } of real vectors has limit a if for any given positive number
it is possible to find an integer such that if , ka ak . We say that {a } converges
to a.
Note that the norm kk is just the normal definition of the length of a vector. It has the
property that kxk 0 for every nonzero vector x.
4.2.2
57
We want to investigate the behaviour of a sequence of random variables. The limit of such a
sequence has to be defined somewhat dierently, because the terms are no longer numbers,
but the outcomes of a random variable. There are dierent ways of defining convergence for
random variables. The simplest of these is convergence in probability.
Definition 4.7 The sequence {a } of real or vectorvalued random variables tends in probability
to the limiting random variable a if for all 0
lim Pr (ka ak ) = 0
We write this as
(4.1)
lim a = a or a a
Note that the limit in front of the probability (in equation 4.1) is the ordinary mathematical
limit. We could rewrite this condition equivalently to say that
lim a = a if 0, and 0, s.t. Pr (ka ak )
(4.2)
Note also that this definition makes sense if a is just a constant (which is a degenerate random
variable).
Intuitively the sequence of random variables a converges to a, if in a large enough sample
it is highly improbable to find a far away from a.
Later we will want to consider the probability limit of sequence {A } of random matrices.
We can adapt this definition, provided that we find some way of defining the norm of a matrix.
This is, in fact, possible1 . Note that convergence of the random vector a or the random matrix
A will always imply convergence of the elements of the vector or matrix.
Example 4.8 Consider the case of tossing a fair coin. Let be the (Bernoulli) random variable
equal to one if the outcome is heads and zero if it is tails. This means that { } is a sequence of
random variables. We can define a new sequence as follows:
=
1X
(4.3)
1
2
1
It is straightforward to show that ( ) = (1)
= 4
. This indicates that lim ( ) =
1
1
Pr
2
42
1 One
where (C) is the trace of the square matrix C. We will work more with the trace of a matrix in due course.
58
1
So the condition above will be satisfied provided that we pick large enough that 4
2 , i.e.
1
we pick larger than 42 .
For instance: if = 0001 and = 001, we could set at 25000001 and we would be guaranteed that
for every
sample
where
we
would
have
Pr 12 0001 001, i.e. the probability that the true proportion of heads deviates from
the true value of 0.5 by more than one in a thousand is less than 1%.
This example exemplifies a more general case. In fact any sequence { } which has a finite
mean and finite variance 2 where the variance tends to zero, will converge in probability to
:
Theorem 4.9 Convergence in mean square
If { } is a sequence of random variables, such that the ordinary limits of and 2 are
and 0 repectively, then converges in probability to , i.e.
lim =
It is clear that mean square convergence is stronger than converge in probability. This is an
example of a law of large numbers. In this case it is a weak law of large numbers. There
are strong laws based on a stricter form of convergence than convergence in probability. This
form of convergence is termed almost sure convergence:
Definition 4.10 The sequence {a } of real or vectorvalued random variables a is said to
converge almost surely (a.s.) to a limiting random variable a if
Pr lim a = a = 1
We write
a a
Note that the limit and the probability are interchanged from the previous definition. We
will not attempt to explain the subtleties of the dierence between the two forms of convergence,
except to note that almost sure convergence implies convergence in probability, but not the other
way round.
4.2.3
lim
= , if 6= 0
59
4.2.4
Convergence in distribution
for all real numbers or vectors b such that the limiting distribution function a (x) is continuous
in x at b. One writes
a a
An equivalent way of writing the condition is as
lim a (b) = a (b) at all points x where a (x) is continuous
2
Example 4.14 Consider the sequence of random variables { } where 0 . It is
intuitively obvious that as this variable collapses to zero. We can show that, indeed, it
converges in distribution. We have
Pr ( ) = Pr
= Pr
1
lim Pr
=
0 if 0
() =
1 if 0
So we note that lim () = () except at = 0, which is the point at which () is
discontinuous.
We observe that in this case a sequence of continuous distributions converges to a discrete
distribution, and in particular a degenerate distribution.
60
+ +
, if 6= 0
( ) ()
If has a limiting distribution and lim ( ) = 0, then has the same limiting
distribution as .
Theorem 4.16 Convergence in distribution via MGF convergence
(Mittelhammer et al. 2000, Appendix E1 p.65) Let the random variable in the sequence
{ } have MGF () and have MGF ().
If lim () = () then
4.2.5
Rates of convergence
(This discussion is based on Davidson and MacKinnon 1993, pp.108113) One very useful device
in assessing the asymptotic behaviour of a sequence of random variables is given by the O o
notation (bigO littleo). Here and stand for order so they are also referred to as order
symbols. When we say a quantity is () we mean roughly that it is of the same order as ,
while if we say it is (), we mean that it is of lower order.
Definition 4.17 If () and () are two realvalued functions of the positive integer variable
, then the notation
() = ( ())
means that
lim
()
=0
()
We might say that () is of smaller order than () as tends to infinity. Note that ()
does not itself need to have a limit  it is only the comparison which matters. Most often we
consider functions () that are powers of , e.g. 2 , 1 , or 0 . In the later case we would say
that () is (1), since 0 = 1. If a sequence is (1) we know that lim () = 0.
Definition 4.18 If () and () are two realvalued functions of the positive integer variable
, then the notation
() = ( ())
means that there exists a constant 0, independent of , and a positive integer such that
()
()
for all .
61
Normally this notation is used to express the sameorder relation, i.e. to tell us the greatest
rate at which () changes with . Note, however, that in terms of the definition the ratio could
be zero, so the expression of the same order can be misleading.
Definition 4.19 If () and () are two realvalued functions of the positive integer variable
, then they are asymptotically equal if
lim
()
=1
()
=0
()
Similarly, the notation = ( ()) means that there is a constant such that for all 0,
there is a positive integer such that
for all
Pr
()
=1
Since there should not be any confusion between the mathematical and the stochastic order
symbols, we will drop the p subscript.
Proposition 4.21 Rules for operations with order symbols
Rules for addition and subtraction:
( ) ( ) = max()
( ) ( ) = max()
( ) ( ) = ( ) if
( ) ( ) = ( ) if
( ) ( ) = +
( ) ( ) = +
( ) ( ) = +
In many cases we will be considering sums of terms. If these terms are all (1), then the
sum is () unless the terms all have zero meansanda central limit theorem can be applied
1
(see below). In that case the order of the sum is 2 .
62
Example 4.22 The variable defined in equation 4.3 is such that lim = 12 . Consequently
= (1). Consider now
1
=
2
We have lim = 0, hence = (1). If we define the new sequence
=
1
to a normal random variable with mean zero and variance 14 , i.e. where 0 14 .
4.3
We are interested in applying the concepts developed above to samples generated by some DSP.
One of the tricky points to consider in this context is how we understand the concept of enlarging the sample. In the case of simple random sampling from a given distribution this is very
straightforward we simply run the DSP on and on and on.
In practice there may be some tricky issues here. For instance if we are sampling from a
finite population (like people living in South Africa) then there comes a point where we cannot
enlarge the sample any more. In the case of crosscountry analyses these constraints bind much
earlier. Once you have every country in your data set, you are done! Every sample of size
(where is the total population size) will be identical.
In order to get around this, statisticians think of the finite population itself as a realisation
of a DSP which could theoretically have led to dierent people, GDP outcomes and so on. This
superpopulation approach allows to think about getting more draws from the social process
even if in practice we could never do so.
4.3.1
Consistency
b of
The first asymptotic property that we want to define is that of consistency. An estimator
a vector of parameters is said to be consistent if it converges to its true value as the sample
size tends to infinity. This statement is not all that precise. In particular we havent defined how
an estimator can be said to converge.
b be the estimator that results from a sample of size . Then we define the estimator
Let
b as the sequence
n o
b=
b
b can be
where we start the sequence at the minimum sample size at which the statistic
computed. In the case of the linear regression model with parameters, this would require
.
2 Here
1
4 2
63
2
.
Consequently .
Corollary 4.25 In random sampling for any function (), if [ ()] and [ ()] are finite
constants, then
1X
lim
( ) = [ ()]
=1
4.3.2
One of the implications of Theorem 4.24 is that the sample cumulative distribution function
from a random sample will be a consistent estimator of the cdf of the distribution.
We define the sample cdf b () as the proportion of the sample that is smaller than or equal
to , i.e.
Definition 4.28 Sample cumulative distribution function
The sample cumulative distribution function b () is defined as
1X
1 ( )
b () =
=1
where 1 () is the indicator function which takes on the value of 1 if the condition is true and
zero otherwise.
Now define the random variable as = 1 ( ). so is a Bernoulli random variable
with parameter = Pr ( ). It follows that () = Pr ( ) = () (by definition of the
cumulative distribution function). Now note that b () is just the sample mean of the sample
P
outcomes of the Bernoulli random variable , i.e. b () = 1 =1 . The variance of a Bernoulli
random variable is finite so by Theorem 4.24
lim b () = ()
Since is just an arbitrary point it is clear that the sample cumulative distribution function is a
consistent estimator of the population cumulative distribution function.
64
4.3.3
Another implication of Theorem 4.24 and its corollary is that in many situations method of
moments estimation will yield consistent estimators. We will not prove this in general, but
sketch out the intuition in the case of a one parameter distribution. Suppose that is a random
variable with pdf (). Suppose also that it has finite mean () and finite variance. Now
() will typically depend on the parameter . Let us assume that we can write
() = ()
where is some continuous monotonic function, so that it has a continuous inverse 1 . We can
then solve out for as
= 1 ( [])
Our method of moments estimator would be given by
b
= 1 ()
where we have used the sample mean in place of the population mean []. By Theorem 4.24
we know that lim = []. It now follows by Slutskys Theorem (Theorem 4.11) that
lim b
= 1 lim
4.4
If an estimator is consistent its distribution collapses to a spike as . This does not make
it suitable for statistical inference. Nevertheless we noted in Example 4.22 above, that
1 although
1
the random variable collapsed to a spike, the variable 2 did not. In fact 2 = 14 ,
so that the distribution of this variable is nondegenerate.
Theorem 4.29 Simple Central Limit Theorem (Lyapunov)
Let { } be a sequence of centered random variables with variances 2 such that 2 2 2
where the lower and upper bounds are finite positive constants, and absolute third moments 03
such that 03 3 for a finite constant 3 . Further let
20
exist. Then the sequence
lim
12
1X 2
=1
X
=1
tends in distribution to a limit characterised by the normal distribution with mean zero and
variance 20 . (Davidson and MacKinnon 1993, p.126)
65
We can apply this theorem directly to the case of the Bernoulli random variable considered
in Examples 4.8 and 4.22. Define the centered version of the random Bernoulli variable as
=
1
2
1 P
lim 1 =1 41 = 14 . By the central limit theorem the sequence 2 =1 will con
P
1 P
1
1
verge to a 0 14 random variable. Note that 2 =1 = 2 1 =1 = 2 which
confirms our previous observation that this variable had a nondegenerate and normal distribution.
The implication of the central limit theorem is quite astonishing  it does not matter what
the nature of the original distribution is, the outcome is always a normal distribution!
Example 4.30 The limiting distribution when is normal
Let (0 1). Then meets all the requirements of the theorem. Furthermore in this
P
1 P
case we know that (0 ), so it is easy to see that 2 =1 (0 1).
The remarkable thing about the central limit theorem is that when we sum up the individual
variables, the original properties of the distribution somehow get lost. Davidson and MacKinnon
(1993, pp.1267) sketch out why this should be the case. Observe that in the particular case
2
where each of the P
variables has identical distribution with mean zero and variance , the
12
2
variable =
=1 is such that ( ) = 0 and ( ) = . Now consider a higher
moment:
!4
X
4
12
=
=1
1 XXXX
( )
2 =1 =1 =1 =1
If any one of the indices are dierent from the other, say 6= , 6= , 6= , then ( ) =
( ) ( ) = 0, by the independence of the variables. The only nonzero expectations will
involve terms where = = = or where
the variables fall into pairs, e.g. = and = .
The former terms are of the type 4 , i.e. involve the fourth moments of the
variables .
There are, however, only of these and with the factor of 12 contribute to 4 only to order
2
type are of the sort 2 2 = 2 . There are 3 ( 1)
1 . The terms of thesecond
such pairs, which is 2 , so these terms contribute to order of unity. Thus to leading order,
the fourth moment of depends only on 2 , it does not depend on the fourth moment of the
random variables .
It is instructive to also consider an odd moment higher than two:
!3
X
3
12
=
=1
1 XXX
=
( )
3
2 =1 =1 =1
The only nonzero terms in this case are terms of the type 3 . There
are
only of these
and since we have assumed that the third moment is finite, we have 3 = 1 03 which is
66
1
2 and converges to zero. Similar arguments will show that the odd moments will all
vanish asymptotically, i.e. the limiting distribution is symmetric, while the higher order even
moments only depend on 2 , i.e. the higher order moments of the random variables do not
influence the asymptotic distribution.
Theorem 4.31 LindbergLevy Central Limit Theorem
If 1 are a random
P sample from a probability distribution with finite mean and finite
variance 2 and = 1 =1 , then
( ) 0 2
1X
1X 2
and 2 =
=1
=1
( ) 0 2
max( )
( ) (0 Q)
Theorem 4.34 Multivariate LindbergFeller Central Limit Theorem
Suppose that x1 x are a sample of random vectors such that (x ) = (x ) = Q
and all mixed third moments of the multivariate distribution are finite. Let
1X
=
1X
Q =
Q
We assume that
lim Q = Q
1
Q = 0
lim Q
In this case
Definition 4.35 If
( ) (0 Q)
b
V
4.5
67
It turns out that given suitable regularity conditions, maximum likelihood estimators have a lot
of very attractive asymptotic properties:
Theorem 4.36 Properties of a MLE estimator
Under regularity conditions, the maximum likelihood estimator (MLE) has the following asymptotic properties:
b = 0
1. Consistency: lim
h
i
1
b
2. Asymptotic normality:
0 [I (0 )]
where
I ( 0 ) =
2 ln ()
0 00
ln ()
ln ()
I (0 ) =
b
so that we just need to have an estimate of the gradient vector ln () evaluated at .
68
4.6
4.6.1
Appendix
Chebyshevs Inequality
Theorem 4.37 If random variable has zero mean and a finite variance , then
Pr (  )
2 =
2 +
2 +
Each of these terms is nonnegative. Considering the last two terms, we note that in both cases
2 2 over the entire domain of integration, so
Z
2 +
2 2 (Pr ( ))
Consequently
2 (Pr (  ))
which proves the result.
Alternative proof. (Greene 2003, p.898) The alternative proof proceeds first by proving
Markovs inequality, i.e.
( )
Pr ( )
if is a nonnegative random variable and is a positive constant. The proof follows from the
fact that
( ) = Pr ( ) (  ) + Pr ( ) (  )
The first term on the right hand side is nonnegative, and (  ) so
( ) Pr ( )
from which the inequality follows. Substituting in = 2 and = 2 we get Chebyshevs
inequality.
Corollary 4.38 If random variable has mean and a finite variance 2 , then
Pr
2
Chapter 5
Statistical Inference
In this chapter we consider the question how we can use the sample information to answer
questions about what the state of the world (as represented by the DSP) might actually be.
5.1
Hypothesis Testing
The basic mechanics of hypothesis testing should be familiar by now. The general principle is
that we formulate a null hypothesis 0 about the parameter vector as well as an alternative
hypothesis 1 . On the assumption that 0 is true we can derive the sampling distribution (or
asymptotic distribution) of a given estimator b
. Typically we will use this estimator as the basis
for constructing a test statistic . This test statistic is a scalar, so it lends itself to making
simple decisions of the type accept or reject. Under the assumption that 0 is true, these
test statistics will have their own sampling distribution or asymptotic distribution. In order to
adjudicate between 0 and 1 we form a decision rule. This will take the form of specifying
a rejection region . The complement of this will be the acceptance region, i.e. if our test
statistic falls into the acceptance region we accept 0 . If it falls into the rejection region, we
reject 0 in favour of 1 . In essence we calculate the probability of observing the test statistic (or
an outcome more extreme from the point of view of the comparison between 0 and 1 ), given
the hypothesis 0 . In other words, we assume that 0 is true, for the purposes of calculating
the distribution of our test statistics.
5.1.1
We can summarise the possible outcomes of the test in the form of the following table:
State of the world (DSP)
0 is true
1 is true
Test decision Accept 0 correct
Type II error
Reject 0 Type I error correct
5.1.2
Power of a test
70
If the power function is evaluated at a that is contained in 0 , then the power of the test
is equal to the probability of making a Type I error. If the power function is evaluated at a at
which 1 is true, then the power of the test is equal to 1 Pr (Type II error).
We will say that the test of the hypothesis 0 versus 1 is of size if
sup () =
0
We will say that the test is conducted at the significance level if sup0 () .
Example 5.1 Consider the case of sampling from a distribution that is known as ( 4), i.e.
we know the distribution is normal and it has a variance of 4. We want to set up the test
0 : = 0
1 : 6= 0
Assume initially that we have a sample of size 4 from this distribution. We know that
( 1). We will use the test statistic = 1 together with the rejection region =
{ 196} { 196} to implement the test. Note that under 0 the test statistic is distributed as (0 1), so the Pr (  = 0) = 005. We can graph the power function for this
case:
1
0.8
0.6
0.4
0.2
4
2
71
1
0.8
0.6
0.4
0.2
4
2
4
2
The three tests considered above obviously form part of a sequence of tests where the test
statistic () is given by
= p
4
and the rejection region is given by = { 196} { 196}. This sequence of tests
fixes the probability of making a type I error at 005, i.e they are all of the same size.
A consistent sequence of tests (of size ) is such that if 1 is true then Pr ( ) 1 as
. In other words in large samples the probability of making a type II error goes to zero.
72
5.2
Types of tests
In general we will consider tests of the form 0 : () = 0, where is some linear or nonlinear
function of . These functions can be regarded as restrictions imposed on the parameter space.
These tests can be constructed on the basis of three principles:
1. The Wald principle states that we should estimate the unrestrictedmodel
and obtain
b
b
our estimate accordingly. We should then investigate how close is to zero. If it
is, then we would accept 0 . Otherwise we would reject it. Typical examples of Waldlike
tests are ttests and Ftests run on unrestricted regressions.
2. The likelihood ratio principle states that we should compare the fit of the estimates b
5.2.1
hypothesis 0 : = c.
Consider now a linear function R where R is some matrix of constants. We can show that
b N R R R0
R
b c R R0 1 R
bc
= R
(5.1)
73
R R0 in this case will simply extract the th element on the diagonal, which is b
.
Consequently
b
b
b
b
=
b
(0 1) variable. This will be 2 (1). So our Wald test in this case is equivalent to doing a
normal test of the hypothesis 0 : = .
5.2.2
In our discussion of the principle of maximum likelihood, we argued that the likelihood (y)
represented how likely the given sample values y were, if the true parameter vector was .
Consider now the case where we restrict the parameter space that we can consider. Beforehand
we were free to consider any . Now we will do our optimisation over the restricted parameter
space . The likelihood value that we manage to obtain on the unrestricted parameter space
b . The likelihood value on the restricted parameter space, i.e.
is b
y . We denote this as
b y we denote as
b . The ratio of these is a measure of how reasonable the restriction is.
We have
b
1
0
b
If we get values close to one we would accept the validity of the restrictions, while values close
to zero should lead to rejection of the null hypothesis.
The actual LR test is based on the statistic
2 ln
h
i
b
b ln
b
= 2 ln
b
(5.2)
5.2.3
L = ln () (R c)
74
b
R
(5.3)
= c
If the constraints are not binding the Lagrange multipliers would be zero. Another way of
maximise the unrestricted loglikelihood. A test based on would therefore seem appropriate.
The LM test statistic is
b 0 RbI (0 )1 R0
b
which has 2 () degrees of freedom. In this form we would need to have estimated the Lagrange
multipliers. We can derive a more tractable version of the test, by using equation 5.3, i.e. the
test statistic becomes
ln b
ln
1
(5.4)
I (0 )
b
ln
should more correctly be written as
 , i.e. it is the gradient
Note that the expression ln
5.3
+1
(1 ) =
+1
=1
= (1 2 )(+1)
Consequently
(1 ) = ln + ln ( + 1)
We dierentiate this with respect to
ln
= + ln
=1
And our MLE is the value that sets this gradient equal to zero, i.e.
b
=
1
ln ln
ln
=1
(5.5)
75
2 = 2
0 : = 2
against
: 6= 2
5.3.1
Wald test
Since b
is a maximum likelihood estimator we know that b
. Given the null hypothesis
= 31518495,
= 2 we presume that b
2 4 . In the empirical work we get an estimate of b
with = 2582. Consequently our Wald statistic is
0
1
b
b
=
2
b
1
4
= (31518495 2)
(31518495 2)
2582
= 856 42
This is distributed as 2 (1). We can safely reject the null hypothesis.
5.3.2
We can use
P the empirical estimates to obtain the value of the unrestricted loglikelihood. We
note that =1 ln = 8834467 2582 = 22811. Consequently
X
b
= ln b
+ ln b
+1
ln
=1
ln
=1
= 2 b
b
= 2 (224300 (22661))
= 4620
76
5.3.3
Finally we calculate the score version of the LM test. We substitute the restricted value of into
the equation of the gradient (equation 5.5), to get
ln
+ ln
2
=1
2582
+ 2582 ln (5000) 22811
2
= 471 39
=
Consequently
= 471 39
= 344 24
4
2582
471 39
In this case the LM test is the most conservative, while the Wald test will be most likely to
reject. Of course asymptotically they are all equal, but this is not true in the finite samples to
which these tests are applied.
5.4
In the appendix to this chapter we show that the MLE of the parameters of the bivariate normal
are given by
P
b =
b =
P
2
(
b )
b2 =
2
P
b =
P
(
b )
b
b
=
b
b
which (except for the divisors) is what we might have expected. We also show in the appendix
0
that the asymptotic covariance matrix of
is given by
b
b
b2
b2 b
0
0
0
0
0
0
2
2
2 2 2
4
1
2
)
2
(
0
0
2 (12 )
2 4
22 2 2
0
0
2 (12 )
2 (12 )
(12 )2
0
0
77
We wish to test the hypothesis that the random variables and come from the same underlying
distribution, i.e. that = and 2 = 2 . We can formulate the null hypothesis as:
0 :
= 2
1 1 0 0 0
2 = 0
0 0 1 1 0
0
2
5.4.1
In order to work up to the joint test it is useful first to consider what the test of the single
hypothesis
0 : =
would look like. The Wald statistic is given by
0
bc
b c R R0 1 R
R
b
R
=
b
b
1 1 0 0 0
R R0 =
1 1 0 0 0
2 2 + 2
b2
b2
b
2 4
22 2 2
2 (12 )
22 2 2
2 4
2 (12 )
2 (12 )
2
2
(1 )
2
(12 )
0
1
1
0
0
0
b Since R
b is a scalar in this case, it reduces to
R R0 will give the covariance matrix of R.
b
the variance of R.
78
0
b
2 2 + 2
)1
b
b
b =
2 2 + 2
5.4.2
The test of the joint hypothesis is a little bit more complicated, but not much so. In this case
we have:
1 1 0 0 0
2
b =
b
R
0 0 1 1 0 2
Furthermore
b
b
b2
b2
1 1 0 0 0
R R0
=
0 0 1 1 0
"
0
#
2 4 +2 4 42 2 2
b
b
2 + 2 2
b2
b2
0
22 2 2
2 4
2 (12 )
#!1
2 4 +2 4 42 2 2
b
b
b2
b2
1
2 4 +2 4 42 2 2
2 (12 )
2 (12 )
2
(12 )
2 4
22 2 2
2 (12 )
2 2
1
+ 2
5.4.3
2 + 2 2
0 "
b
b
=
b2
b2
=
b
b
2
b
b2
We show in the appendix to this chapter that the unrestricted loglikelihood is given by
b
b2
b2 b
x y = ln (2) ln
b
b2 ln
b2 ln 1 b
2
2
2
2
1
0
1 0
0
1
0 1
0
0
b2
1
( + )
2
P
P
2
2
(
b ) + (
b )
2
P
(
b ) (
b )
2
b
b
b2 b
x y = ln (2) ln
b2 ln 1 b
2
2
= 2 b
b
2
2
2
ln (2) ln
ln
ln
1
2
2
2
2
= 2
2 ln (2) ln
b2 2 ln 1 b
2
2
= ln
b2 ln
b2 ln
b2 + ln 1 b
2 ln 1 b
2
79
80
5.5
In this appendix we will derive the maximum likelihood estimators of the bivariate normal
distribution. We will also derive the information matrix and hence the asymptotic covariance
matrix. To begin with we need to start with the joint density of one observation ( ), which
we can write as:
(
2
)
2
( )
1
( )
p
( ) =
exp
+
2 (1 2 ) 2 2 (1 2 ) 2
(1 2 )
2 1 2
This means that the joint density of the sample (1 1 ), (2 2 ), ..., ( ) will be
x y 2 2 =
2 (1 2 )
( P
2
)
P
P
( )
( )2
+
exp
2 (1 2 ) 2
2 (1 2 ) 2
(1 2 )
This of course gives the likelihood from which we can derive the log likelihood:
2 2 x y =
2 (1 2 )
( P
2
)
P
P
( )
( )2
+
exp
2 (1 2 ) 2
2 (1 2 ) 2
(1 2 )
P
( )2
2
2
2
2
2
x y = ln (2) ln ln ln 1
2
2
2
2 (1 2 ) 2
P
P
( )
+
(5.6)
2
2
2 (1 )
(1 2 )
Dierentiating the loglikelihood we get the gradient:
P
( )
(1 2 ) 2
(1 2 )
P
P
( )
(1 2 ) 2
(1 2 )
P
P
( )2 ( )
2 +
2
2 (1 2 ) 4
2 (1 2 ) 3
P
P
2
( )
2 +
2
2 (1 2 ) 4
2 (1 2 ) 3
P
P
2
( )
1 + 2
( )
(1 2 ) (1 2 )2
2
2
(1 2 )2
(1 2 )2
P
P
P
b
b
(
b )
b
1b
2
b2
1b
2
b
P
P
b
b
(
b )
2
2
2
b
1b
b
1b
b
P
P
2
b
(
b )
b
(
b )
2 +
2b
2 1b
2
2 1b
2
b
b4
b3
P
P
b
(
b
b )
b
2 +
2b
2 1b
2
2 1b
2
b3
b4
b
2
P
P
2
b
b
(
b )
b
+
2
2
b2
b2
1b
2
1b
2
1b
2
1+b
2 P (
b )
b
b
b
1b
2
5.5.1
81
five unknowns:
= 0
(5.7)
= 0
(5.8)
= 0
(5.9)
= 0
(5.10)
= 0
(5.11)
The solution to this will be the MLE. From equation 5.7 we get
X
b X
b ) = b
b
(
b X
(
b = b
b )
b
from which it follows that we must have
X
X
(
b ) = b
2
b )
(
X
1b
2
(
b ) = 0
P
(
b ) = 0, i.e.
We require 1 b
2 6= 0 (otherwise the likelihood is not defined), hence
P
b =
(5.12)
and hence
P
(5.13)
b =
P
P
2
b
(
b )
b
(
b )
+
=
(5.14)
2
2 1b
2
2 1b
2
b
b2
b
P
P
b
(
b
b )
b
+
=
2
2 1b
2
2 1b
2
b
b2
b
82
b )
(
=
b2
2
P
b
b2
(5.15)
P
1b
2
(
b )
b
(
b )2
=
+
b
b
b
b2
2b
2
2
1b
1b
2
Consequently
2
2
P
2
1
+
b
(
(
b )
b
)
+
=0
+
2
2
b
b2
b
1b
(
b )2
b
2 1 b
2 2b
2
b2
P (
b )2
2
2
2
= 0
+ 1+b
1+b
1b
b2
2
P (
b )
2
= 1b
2
1b
b
P
(
b )2
b2 =
(5.16)
b2
+
2
2 1b
2
2
P
b
=
1 b
2 + =
b
2 =
b
=
b )
b
(
b
2 1b
2
b
P
(
b )
b
b
b
b
P
(
b )
b
b
b
b
P
(
b )
b
b
b
b
(5.17)
(5.18)
5.5.2
83
Information matrix
To get the information matrix, we need to get the matrix of second derivatives.
2
2
2
2
2
2
( 2 )
2
2
2
(1 2 ) 2
(1 2 ) 2
P
P
( )2 3 ( )
+
2 4
(1 2 ) 6
4 (1 2 ) 5
P
P
3 ( )
+
2 4
(1 2 ) 6
4 (1 2 ) 5
P
P
P
( )
1 + 2
1 + 32
1 + 32
2 3 + 2
( )2
+
2
2
(1 2 )2
(1 2 )3
(1 2 )3
(1 2 )3
=
=
=
=
2
2
2
2
2
2
2
2
2
2 2
2
2
2
2
(1 2 )
P
P
( )
+
(1 2 ) 4
2 (1 2 ) 3
2 (1 2 ) 3
P
P
1 + 2
2 ( )
(1 2 )2 2
(1 2 )2
P
( )
2 (1 2 ) 3
P
P
( )
+
(1 2 ) 4
2 (1 2 ) 3
P
P
2
( )
1 + 2
2 2
2
2
2
(1 )
(1 )
P
( )
4 (1 2 ) 3 3
P
2
1 + 2
( )
( )
(1 2 )2 4
2 (1 2 )2 3
2
P
P
1 + 2
( )
(1 2 )2 4
2 (1 2 )2 3
P
When we take expectations of these terms we note that [ ( )] =
=
hP
i
i
hP
P
2
( ) = .
= 2 and
0,
( ) = 2 ,
84
where =
5.5.3
2
0
(12 ) 2
(1
2 )
(1
2 )
(2 )
4(12 ) 4
4(12 )2 2
2(1
2 ) 2
(12 ) 2
0
0
2
0
0
4(12 )2 2
(22 )
4(12 ) 4
0
0
2(1
2 ) 2
2(1
2 ) 2
2(1
2 ) 2
(1+2 )
(12 )2
b
Inverting this matrix we get the asymptotic covariance matrix of
()
2 4
22 2 2
2 (12 )
22 2 2
2 4
2 (12 )
2 (12 )
2 (12 )
2
(12 )
It is important to understand what this matrix is saying. For instance it shows that asymp2
totically (b
) = . This will, of course, hold also in a small sample. This quantity is the
variance of the sampling distribution of
b . It captures how variable the estimates would be if
we reran the DSP many times. In practice since we dont know 2 we will need to estimate it
from the data. Using either the MLE
b2 or the bias adjusted 2 we can use our data to give us
an estimate of the true (b
). To capture the fact that it is an estimate we will write it as
d
(b
).
1
One interesting fact about () is that it is blockdiagonal in nature. In particular we see
that the estimators of and are uncorrelated with the estimators of 2 , 2 and . Given the
fact that these estimators are multivariate normal (asymptotically) this shows that they are at
least asymptotically independent of each other. In fact it can be shown that they are independent
even in small samples. This is very convenient. It means that if we are testing hypotheses only
on the means, we need to consider only their covariance matrix i.e.
"
#
2
b = . This indicates that if the two variables
Looking at this, we note that
b
are positively correlated, then the sample estimates
b and
b will also be positively correlated.
This makes sense. If
b overshoots the true mean in a particular sample then the fact that
the values are positively correlated with the values we would expect
b to overshoot its mean
too.
85
We can show that the formula
b
b = holds in small samples too. We have
b
b = (b
b
)
P
Here we are making use of the fact that (b
) = 1
= and
b = . Now
X
X
1
1
(b
)
b
( )
=
=1
=1
X
X
1
=
( )
2
=1
=1
1
=
( )
2
=1 =1
=
1 XX
( )
2
=1 =1
(b
)
b =
5.5.4
Loglikelihood
P (
b )
2
2
2
2
2
b
b
b b
x y
= ln (2) ln
b ln
b ln 1 b
b
2
2
2
2 1b
2
b2
P
P
b
(
b
b )
b
+
2 1b
2
b
b2
1b
2
b
b
b
b2
b2 b
x y
= ln (2) ln
2
b2 ln
b2 ln 1 b
2
2
2
b
2
+
2 1b
2
2 1b
2
1b
2
= ln (2) ln
b2 ln
b2 ln 1 b
2 (5.19)
2
2
2
86
5.6
= =
= 2 = 2
2 4
( )2
( )2
( ) ( )
2
2 2
x y = 4 1
exp
+
2 (1 2 ) 2
2 (1 2 ) 2
(1 2 ) 2
P
P
P
( )2 + ( )2 2 ( ) ( )
2
2
2
x y = ln (2) ln ln 1
2
2 (1 2 ) 2
In order to maximise this we first get the derivatives of the loglikelihood function:
P
P
P
P
( ) + ( ) ( ) ( )
=
(1 2 ) 2
P
P
P
( )2 + ( )2 2 ( ) ( )
+
2
2
2 (1 2 ) 4
P
P
P
2
2
( ) ( )
1 + 2
( ) + ( )
+
=
2
2
2
2
(1 )
(1 )
(1 2 )2 2
5.6.1
2
2
1b
b
P
P
P
2
2
(
b) + (
b) 2b
(
b) (
b)
2+
b
2 1b
2
b4
2 P
P
P
2
2
(
b) (
b)
1
+
b
b) + (
b)
(
b
+
2
2
1b
2
b2
b2
1b
2
1b
2
= 0 (5.20)
= 0 (5.21)
= 0 (5.22)
To distinguish these from the unrestricted MLE, we should really subscript these with R, to
make it clear that the restricted estimates will, in general, be dierent. This clutters up the
notation, so we use the subscripts only when reporting the final results.
The first equation can be rewritten as
P
P
(1 b
) ( (
b) + (
b))
=0
2
2
1b
b
This will hold only if
(
b) +
(
b) = 0
87
i.e.
=
=
+
2
1
( + )
2
(5.23)
So the restricted estimate of the mean will be the average of the unrestricted estimates, which
is equivalent to calculating the mean over the pooled sample.
Equation 5.21 can be rewritten as
P
P
P
(
2b
(
b)2 + (
b)2
b) (
b)
=
+ 2
2
2
2
2
1b
b
1b
b
(5.24)
P
P
b) (
b)
(
1+b
2
(
b) (
b)
+ 2 +
2
1b
2
1b
2
1b
2
b2
1b
2
b2
P
2 (
b) (
b) 2b
1b
2
2
b) (
b)
b2 2b
b2 + 1 + b
(
b
1b
2
2
b2
1b
2
b
2b
= 0
= 0
Consequently
X
b
1b
2
(
2
b) (
b) 2b
1b
2
b2 2b
b2 +
X
1+b
2
b) (
b) = 0
(
1b
2
b) (
b) = b
1b
2
(
b2
P
(
b ) (
b )
b
=
(5.25)
2
b
1b
2
b2
X
X
(
b )2 +
(
b )2
P
b2
2b
b
+ 2
1b
2
= 2b
2 b
2
2 + 1 b
P
P
2
2
(
b ) + (
b )
=
2
=
(5.26)
These results are intuitively obvious: if and are drawn from the same distribution, then
it would be most ecient to estimate the mean and the variance by pooling the observations on
and . The correlation coecient then looks at the deviations from the pooled mean, normalised
against the pooled standard deviations.
88
5.6.2
Restricted loglikelihood
As before we can substitute the maximum likelihood estimates into the restricted loglikelihood
to evaluate what this maximum value actually is:
b
b2 b
x y
= ln (2) ln
b2 ln 1 b
2
2
P
P
P
(
b )2 + (
b )2 2b
(
b ) (
b )
2
2
2 1b
b
Substituting in equations 5.25 and 5.26 we get
2b
2b
2 2b
b2 b
x y
= ln (2) ln
b2 ln 1 b
b
2
2
2 1b
2
b2
= ln (2) ln
b2 ln 1 b
(5.27)
2
2
Part II
89
Chapter 6
Why study econometrics? Most of the time applied econometricians think it is quite obvious
what they do and how they should do it. Underpinning this is an implicit model of how the
world works. Some times it is quite useful to make this methodology explicit. The purpose of
this chapter is to provide you with some tools which may come in useful if you come up against
nonstandard problems: situations in which it may no longer be obvious what you should do or
how you should do it. In fact an understanding of the methodology is useful even as the broad
backdrop to the most wellknown of models, the classical linear regression model. We will spend
some time setting up this model against the backdrop of the broader framework within which it
fits.
Our point of departure is to marry a typology derived from Mittelhammer et al. (2000,
Chapter 1) and a framework given by Angrist and Pischke. The former suggest that the process
of econometric research may be crudely categorised into three parts:
1. A process of abstraction or modelbuilding. In this process we want to capture the essential
relationships and characteristics of the real world in simplified form. The resultant
mathematical/econometric model can be used to make deductions about both what we can
or should not be able to observe in the world. At its core an econometric model can be
thought of as depicting a Data Sampling Process (DSP) 1 . This characterises both what
we know about the world and how the information available to the analyst is ultimately
derived from it.
2. A process of information recovery which can take both the form of estimation and inference.
The purpose of this step is to use the available information to extract more information
about the DSP, i.e. the world.
3. A final step is to reflect on the meaning of this additional information. The fundamental
problem (too often forgotten) is that the process of estimation and inference is conditional
on the econometric model. It is therefore always advisable to reflect on how plausible the
model is, given the results obtained. This process of analysis is, of course, often the prelude
1 Other
91
92
Angrist and Pischke (2009, Chapter 1), by contrast, suggest that the key questions which can
be used to characterise most econometric research are:
1. What is the causal relationship of interest? This focus on causality is not selfevident, but
much of the time economists are interested in the determinants of social and economic
processes. If we understand what drives the observed outcomes, we will be more confident
about policy interventions that seek to modify these outcomes.
2. What ideal experiment could reveal the causal relationship? For the moment it is sucient
to note that thinking about the possibility of an experiment forces the analyst to be clear
about what could, in principle, be manipulated. Questions which could not in principle
ever be settled by an experiment are fundamentally unidentified questions (FUQs). If
you have one of these, you are FUQed, and you will need to change your research question.
3. In the absence of an ideal experiment, what identification strategy will reveal it? In order
to think about this we will need to know a lot more about how the process works in the
real world, i.e. outside experimental control.
4. What mode of statistical inference is appropriate?
6.2
93
come from a certain family of distributions (such as the multivariate normal), which can
be indexed by a set of parameters . If we knew , we could then completely characterise
the Data Sampling Process (DSP). The point of the probability model is that it specifies
how likely certain outcomes are, compared to others.
Once the DSP has been fully specified, it should, in principle be possible to simulate the
process of data generation. The analyst could play God and recreate many possible outcomes
of the underlying economic process. Such simulations may be useful in answering questions about
the characteristics of this process.
6.2.1
As noted above, the process of abstraction involves isolating the processes of interest from
the surrounding jumble of crosscutting events and processes. To make this more concrete,
consider the relationship between the log of wages received and taking an advanced econometrics
course. We might have considered many facts about individuals other than their educational
trajectories, for instance their astrological star signs, their pain thresholds or their bloodtype.
Which factors we choose to focus on will be guided by economic theory. We are interested mainly
in relationships that are not purely coincidental, but reflect stable underlying social processes.
Causal processes are the best examples of such stable relationships. Many economic models are
underpinned by causal stories. For instance human capital theory maintains that education
causes people to become more productive and hence earn higher wages. A human capital account
would therefore posit a link between taking econometrics courses and earning higher wages. It
would rule out links between astrological star signs and wages, except of a completely accidental
nature.
An alternative link is provided by signalling theory. This suggests that employers find it
dicult to measure ability of candidates accurately. Consequently high ability candidates need
to acquire a signal (such as an econometrics qualification) which low ability candidates find
it hard or impossible to do. Gaining such a qualification therefore causes a change in the
employers belief about the applicants ability and hence the wage which will be paid. In this case
the causal link is indirect and dependent on employers and candidates both understanding that
econometrics is dicult and hence a good signal of ability. There is nothing intrinsic about
econometrics which gives it that function. Studying Latin could do just as well. As such the link
between the econometrics course and higher wages is actually contingent on the prevailing norms
and beliefs. It could shift over time or could function dierently in dierent labour markets
(locations).
Even human capital theory would allow for changes in the relationship between wages and
taking econometrics courses. If there was a glut of econometrics graduates, this would depress
their earnings, even if the causal relationship between econometrics and higher productivity
remains unchanged. This points to an interesting relationship between the economic notion
of equilibrium and causality. Implicit in any equilibrium account are causal stories about the
impact of supply and demand: increases in supply cause drops in prices as long as demand
remains unchanged. The causal relationship between acquiring econometric skills and higher
wages is implicitly predicated on everything else staying the same.
Hence causes will invariably lead to particular eects ceteris paribus. The origin of this
notion of causality can be traced back at least to Hume, who defined causality in terms of the
constant conjunction of events. If we say "A causes B" we mean that whenever we observe A,
we also observe B. Of course this is not strictly speaking true: it is not the case that whenever
we switch on a light that we get illuminated: the light could have blown, there could be an
electricity outage or the circuit could have shorted. This means that the constant conjunction
94
has to be defined more carefully: in terms of equipment in working order, not subject to external
interruptions etc. Indeed controlling the interference of external forces is one of the key issues
for scientific laboratory experiments. Constant conjunctions occur only in closed systems,
i.e. systems isolated from their context; from the jumble of crosscutting events and processes!
This suggests that sensible abstractions are those which pinpoint relationships which might
be isolated under laboratory conditions. Of course we do not expect the mechanisms that we
manage to isolate experimentally to stop working the moment that we leave the laboratory.
Science enables us to make sense of everyday processes ranging from medicine to mechanical
engineering, precisely because the same causal mechanisms operate, even if they are sometimes
confounded by other processes.
6.2.2
Where laboratory experiments are designed to isolate causal mechanisms in the physical sciences,
it is much harder to achieve such experimental closure in the social sciences or indeed even in the
complex interactions inside biological systems. In these contexts we may not see the constant
conjunctions posited by the causal story. Instead we may need to look for statistical regularities
rather than deterministic patterns. An influential model for thinking about causality in these
contexts is provided by Rubin (Holland 1986). To make things more definite let us assume
that we are considering a treatment, such as administering a drug (nonrecreational) or an
econometrics course (definitely nonrecreational). The variable captures whether individual
receives the treatment ( = 1) or the control ( = 0). The outcome of interest (or response
variable) is which might be recovers from illness" ( = 1 or = 0) in the former example or
log of wages in the latter. The key idea in the Rubin framework is that for each individual we can
think of two possible outcomes: 1 and 0 , i.e. the outcome if individual is treated ( = 1)
or not ( = 0). Only one of these potential outcomes can ever be observed. Indeed Holland
(1986) makes the point that even if we could repeat an experiment on the same individual (e.g.
rerun a lab experiment) we can still not observe what would have happened at that time if we
had applied the cause dierently. The causal eect for individual of the treatment is defined
as
= 1 0
Because biological and social systems are such that we cannot control for all sources of heterogeneity (dierent individuals have slightly dierent genetics and so might respond to drugs
slightly dierently) we cannot assume in general that is a constant across all individuals. What
we might measure, though, is the average treatment eect ATE, defined as
= (1 0 )
(6.1)
Note that this is not the observed dierence in outcomes between those who are treated and
those who are not. This naive dierence (the prima facie causal eect in the terminology of
Holland (1986)) is
= (1  = 1) (0  = 0)
The two will be equal if (1 ) = (1  = 1) and (0 ) = (0  = 0). This means that
the potential outcomes are unrelated to the treatment status. In many cases this is likely to be
violated: people who are smart enough to take econometrics courses are likely to have earned
better than people who do not, even if they had not carried on with their education. We can
95
= (1  = 1) (0  = 1) + (0  = 1) (0  = 0)
= (1 0  = 1) + (0  = 1) (0  = 0)
{z
} 
{z
}

6.2.3
6.3
Experimentation
In practice, of course, the econometrician does not fully know the DSP. Depending on how much
information the analyst has up front, the econometric specification will be more or less complete.
A general specification can be written as follows (Mittelhammer et al. 2000, p.9):
Y = (X )
(6.2)
where
Y is the set of random variables characterising the outcomes on the dependent variables.
X is a set of random variables characterising the additional observable information.
is a set of unobserved random variables
is a vector of parameters characterising the joint distribution of the Y variables.
is the function that relates the dependent variable to the explanatory variables and the
unobservables.
Any specification will capture how much the analyst is willing to assume about the underlying
DSP. For instance it is possible to make stronger or weaker assumptions about and (the
functional form). In many of our applications we will subdivide into parameters which aect
the mean value of Y and parameters which aect its variance.
Once the analyst has indicated how much she is willing to assume, the problem becomes how
to retrieve information about the unobservables and from the observed data (y x). Note
that we have written this in lower case to indicate that these are outcomes and not the random
variables themselves!
In short, the problem is how to go from (y x) to ( ). Mittelhammer et al (2000, p.9) call
this the inverse problem of econometrics.
There are dierent ways in which we may go about trying to get information about ( ):
1. The simplest is point estimation. In this we try to get estimated vectors b
b
2. A little bit more complicated is interval estimation. Here we try to find ranges within
which the unobservables are likely to lie, with say a 95% confidence.
3. Related to interval estimation is the process of inference. Here we try to ask questions
along the lines of: Is it plausible that the DSP could be characterised by ? Typically we
will separate the possible DSPs into two groups (based on 0 and 1 ) and decide whether
the DSP we are considering belongs to group 1 or 2. For instance we may ask the question
whether a particular production function is constant returns to scale, or not.
96
6.3.1
Consistency: A consistent estimator is one that in large samples will give estimates close
to the true value, i.e. lim b
= .
Eciency: An ecient estimator b
within a particular class of estimators
have
will
a
smaller variance than any other estimator e
within that class, i.e. b
e
.
Asymptotic normality: Many of the estimators that we will consider will have the property
that in large samples, the distribution of the estimator tends towards the normal distribution. We will be concerned to characterise the mean and the covariance matrix of these
estimators.
Similarly the properties of the rules of inference will depend on the nature of the DSP. We
will have much less to say on this subject, although more advanced texts will consider properties
such as the power of dierent tests.
An important point to bear in mind, is that all our analyses will be done within the standard
framework, i.e. the properties should be interpreted either
in a repeated sampling sense, i.e. how the estimator would behave if we had the opportunity
to rerun the analysis very many times; or
in an asymptotic sense, i.e. how the estimator would behave if we had an infinite sized
sample.
In an empirical problem we, of course, will never have infinite sized samples. Furthermore we
will hardly ever have the luxury to repeat the experiment. The fact that the estimator performs
well on average, does not guarantee that we will get estimates that are at all close to the true
values in any particular analysis! Hence even after the analysis has been performed there is still
some judgement involved as to whether we choose to believe our estimates or not. Because of this
diculty, Bayesian analysts argue that our a priori judgements about the nature of the DSP
should be explicitly incorporated into the process of estimation and inference. In this course we
will not introduce Bayesian approaches.
6.4
6.4.1
One of the favourite examples of an econometric model discussed in textbooks (see for instance
Greene 2003, Gujarati 2003) is the Keynesian consumption function. Keynes describes this as
follows:
97
We will therefore define what we shall call the propensity to consume as the
functional relationship between , a given level of income in terms of wageunits,
and the expenditure on consumption out of that level of income, so that
= ( ) or = ( ) .
The amount that the community spends on consumption obviously depends (i) partly
on the amount of its income, (ii) partly on the other objective attendant circumstances, and (iii) partly on the subjective needs and the psychological propensities
and habits of the individuals composing it and the principles on which the income is
divided between them (which may suer modification as output is increased). ...
Granted, then, that the propensity to consume is a fairly stable function so that,
as a rule, the amount of aggregate consumption mainly depends on the amount of
aggregate income (both measured in terms of wage units), changes in the propensity
itself being treated as secondary influences, what is the normal shape of this function?
The fundamental psychological law, upon which we are entitled to depend with
great confidence both a priori from our knowledge of human nature and from the
detailed facts of experience, is that men are disposed, as a rule and on the average,
to increase their consumption as their income increases, but not by as much as the
increase in their income. That is to say, if is the amount of consumption and
is income (both measured in wageunits) has the same sign as but is
= + +
0 2
Comparing this to the general specification in equation 6.2 we see several things:
The dependent variable Y is a particular macroeconomic consumption series
The explanatory variable X is a particular macroeconomic GDP series
The functional form is linear
(6.3)
98
The vector = 2 .
Note that as it stands this model can not be literally true. If really is a normal variable, it
can theoretically assume arbitrarily large positive and negative values. This means that could
be negative. Admittedly the probability of this might be vanishingly small, but it is not zero. In
short, the real world DSP cannot be a member of the family of DSPs represented by equations
6.3.
One of the questions arising in econometrics is how robust processes of estimation and inference are to misspecifications of this sort. Unsurprisingly, it depends on the nature of the
misspecification. Many of the techniques we will discuss are reasonably robust to small departures from their underlying assumptions. We will also see, however, that in certain cases our
processes of inference can become very misleading.
6.4.2
A very dierent problem is provided by the problem of estimating the unemployment rate. This
seems a dierent problem because it looks like a pure measurement issue, i.e. we are not looking
at the relationship between two or more variables. Nevertheless all of the same issues crop up.
In order to even measure unemployment we need to have a clear concept of what it is. This
requires us to depart from some economic model of the process. This turns out to be more tricky
than it might seem at first. In terms of standard neoclassical labour economics one wants to
distinguish between voluntary unemployment (when the wage one can command is below ones
reservation wage) and involuntary unemployment. The economically interesting measurement
is that of involuntary unemployment. The simplest economic model of unemployment will therefore state that attached to each individual within the economy there is a vector ( )
which contains the information on that individuals attainable wage , reservation wage and
employment status which might be coded as follows:
1 if
(voluntarily unemployed)
The sampling model in this case will describe how the theoretical vector ( ) gets
converted into an actual observation on individual . In many surveys questions are not asked
about reservation wages. This information may therefore simply not be available. Even when
questions are asked, they may only be asked of people who are unemployed. More problematic
still, the analyst cannot observe for someone who is unemployed. It is not clear that asking the
unemployed to tell us how much they think they could command in the market place would give
us any approximation to either. Another problem is that we do not have direct observations
on . Instead we usually have a battery of questions along the lines of Did you do any work
during the past week?, How many hours did you spend on casual work last week? If you
were oered employment tomorrow, would you be willing to take it? and so on. It is not
clear whether the persons answering the questions understand these in the way that the analyst
intended. For instance a number of people who undertake casual work may not regard this
as proper work. Similarly the question about willingness to work does not specify under what
conditions. Dierent analysts will take responses to these questions and on the basis of these
code someone as being unemployed, employed or out of the labour force. The measured variable
99
In short, the typical data available to the analyst will be the vector ( ) if the person
is employed, and ( ) if the person is unemployed or not economically active. The dot here
indicates that the information is missing. From this the analyst might create the dummy variable
Implicit in this is a particular view of the probability model. The assumption is that observations
are essentially independent of each other. If this is not the case and frequently in crosssectional
surveys it is not then the appropriate method of estimation needs to take that into account
also (see Deaton 1997, Chapter 1). The implicit econometric model might be
=
()
where () is the Bernoulli distribution with parameter . It would be instructive to compare
this to the general specification given in equation 6.2.
In short even what looks like a simple measurement issue involves at least implicitly a view of
the underlying DSP. Analysts can disagree (sometimes vehemently) on the appropriate methods
of measuring such phenomena because they either have dierent views about the appropriate economic model that should inform the measurement, the data sampling process or the appropriate
probability model.
6.5
In the previous section we already saw two very dierent econometric problems. In the one
case the dependent variable was continuous, in the other it was discrete. In the first there were
covariates, in the second we did not use any. The distribution of the errors was assumed to
be normal in the one and Bernoulli in the other. Mittelhammer et al. (2000, p.27) provide a
typology of dierent types of probability models which is given in Table 6.1.
In this course we will not be examining all the possibilities contained in this table. Nevertheless it is important to know that econometric techniques exist for all sorts of conditions.
Identifying which circumstances apply to your problem and picking the right tools for the job,
is absolutely vital if you want to perform high quality research.
In applied work it is important to beware of two kinds of errors:
100
Dependent
variable
Y=
Y
RV Type:
discrete,
continuous,
mixed
Range:
unlimited,
limited
Dimension:
univariate,
multivariate
Function
(
(X )
Functional
Form:
in X:
linear, transformable to
linear, nonlinear
in :
linear, transformable to
linear, nonlinear
)
Specific Model Characteristics
X
RV Type:
RV Type:
RV Type:
indefixed,
ran iid,
fixed
pendent but
dom
random:
nonidentical,
independent
dependent
Dimension:
of
finite,
ununcorrelated
specified
with
Moments:
dependent
[X] = 0
(X) =
Genesis:
(X )
endogenous,
exogenous
X
Parameter
Space
x f (ex; ) ( )
f (ex; )
PDF Family:
normal, nonnormal, unspecified
Prior Info:
unconstrained,
equality
constrained,
inequality
constrained,
stochastic
prior info
in :
additive,
nonadditive
From Mittelhammer et al. (2000, p.27)
6.6
The baseline standard of econometric analysis is the classical linear regression model. At its
simplest, the model assumes that each observation in the sample is generated by a process that
can be represented as
= 1 1 + 2 2 + + +
(6.4)
where the x variables are assumed to be independent of the error terms, the s are fixed and
each is distributed independently and identically with a mean of 0 and variance 2 (i.e. there
is no heteroscedasticity and no autocorrelation). Note that in this form we have already made
very particular choices in each of the dimensions represented in Table 6.1. If in addition we
101
make the assumption that the errors are normally distributed, then the model is known as
the classical normal linear regression model. We can make all of this more precise.
6.6.1
Matrix representation
Since equation 6.4 is true of every observation , we can stack the observations and get the
equivalent expression
1
2
..
.
11
21
..
.
1
1 +
12
22
..
.
2
2 + +
1
2
..
.
1
2
..
.
1
2
..
.
11
21
..
.
12
22
..
.
..
.
1
2
..
.
1
2
..
.
1
2
..
.
or in short
y = X +
(6.5)
This is the fundamental equation of the linear regression model: y is the ( 1) column vector
of the observations on the dependent variable, X is an ( ) matrix in which each column
represents the observations on one of the explanatory variables, vector is a ( 1) vector of
parameters and is the ( 1) vector of stochastic error terms (or disturbance terms). Typically
the first column of X will be a column of 1s, so that 1 is the intercept in the model.
Using the notation introduced in Section 2.4.3, the assumptions about the mean of the error
term can be represented as follows:
[X] =
1 X
2 X
..
.
X
=0
102
The assumptions about the variance of the errors are captured in the following:
Var (X) = [0 X]
21 X
[2 1 X]
=
..
X
6.6.2
[1 2 X]
22 X
..
.
..
.
[ 1 X] [ 2 X]
2 0 0
0 2 0
..
..
..
..
.
.
.
.
0
0 2
[1 X]
[2 X]
..
.2
X
= 2 I
Assumptions
In summary, under the assumptions of the Classical Linear Regression Model the DSP can
be described as follows:
y = X +
(Assumption 1)
[X] = 0
(Assumption 2)
Var [X] = 2 I
(Assumption 3)
(Assumption 4a)
(Assumption 4b)
X N 0 2 I
(Assumption 5)
(6.6)
This is an important identification condition. It doesnt describe the DSP, but stipulates
under what conditions we can estimate the parameter vector . If the condition is violated, we
cannot solve the inverse problem.
We will briefly discuss these assumptions further.
Assumption 1: Linearity in X and and additivity in
The model assumes both linearity in X and linearity in . Linearity in X is not as restrictive as
it may look at first. Many nonlinear relationships in variables can be accommodated by this
model:
103
ln = x + +
also meets the assumptions of the classical linear regression model, when suitably interpreted. (Note x is the column vector corresponding to observation , i.e. it is row of the
matrix X.)
As we noted in relation to the loglinear model, nonlinearities in can also be accommodated,
if we can reparameterise the model appropriately. In that case we do so by setting ln = 1 .
Finally the multiplicative error in that specification becomes an additive error in the logarithmic
version.
Assumption 2: Regression
The assumption that the error terms have a conditional mean of zero implies that
[yX] = X
Any function of the form [yX] = (X) is called a regression function, i.e. a regression
function describes how the conditional mean of y (the dependent variable) changes with X (the
explanatory variables).
Assumption 3: Spherical disturbances
This assumption states that the error process operates essentially constantly from observation
to observation. Furthermore errors that happen on one observation have no influence on what
happens to the next observation. This independence property will frequently be violated in
practice. Particularly in macroeconomic data processes from one period spill over to the next.
Even in microeconomic interactions individuals can influence each other. For instance if people
imitate the behaviour of their neighbours, this will induce (for example) correlations in their
consumption patterns which may go beyond the observables (i.e X) that we can control for in
the standard regressions.
104
Assumption 4: Exogeneity of X
Version (a) of this assumption (fixed regressors) is unlikely to be ever met in economic research.
Indeed econometricians have thought long and hard about how to analyse data that are essentially
nonexperimental. We do not have the luxury of being able to predetermine the levels of our
explanatory variables and to measure the outcomes. In practice the crucial assumption will
therefore be version (b), i.e. that the regressors are generated independently of .
Assumption 5: Normality of
As noted above, this assumption is not a core assumption of the classical linear regression model.
It is, however, very useful for providing various optimality results and allowing us to deduce the
sampling properties of our estimators in small samples.
The identification condition: Full rank of X
It is important to understand what this condition says. Essentially it requires two things:
None of the columns of the X matrix should be able to be written as a linear combination
of the other columns. We can see what can go wrong if this condition is violated. If, for
instance, we had x4 = x3 + x2 and our DSP was given by
DSP1: y = x1 1 + x2 2 + x3 3 + x4 4 +
This DSP would generate precisely the same data as the dierent DSP
DSP2: y = x1 1 + x2 (2 2) + x3 (3 2) + x4 ( 4 + 2) +
In fact every possible data set generated by DSP1 would also be generated by DSP2. The
observed data could therefore never adjudicate which of these DSPs was really generating
the observations. In short the data could never identify the underlying process.
The number of observations must be at least as large as the number of variables . Again,
if we had too few observations, we could never hope to identify the unknowns.
6.7
Exercises
(  ) =
1
2
y = x +
if  
if  
( x) = 0 if 6=
where x takes on only positive values.
(a) What is (x)?
(b) What is x ?
(c) Does this DSP satisfy the assumptions of the classical linear regression model? Explain.
6.7. EXERCISES
105
( x) =
1   if   1
0
if   1
( x) = 0 if 6=
and x (5 )
(a) What is (x)?
(b) What is ()?
(c) What is 12 ?
(d) What is ?
(e) Does this DSP satisfy the assumptions of the classical linear regression model? Explain.
(f) (More dicult) What is the conditional distribution of given ?
106
Chapter 7
Least Squares
7.1
Introduction
7.2
The Ordinary Least Squares Criterion stipulates that we pick that estimate which minimises
the Residual Sum of Squares, i.e.
b = arg min
X
=1
107
( x b)
108
21
=1
= 21 + 22 + + 2
=
= e0 e
1
2
7.2.1
b
e = y X
b
y X
b 0 X0 y X
b
=
y0
e0 e =
b
y X
b
b X0 y+
b X0 X
b
= y0 y y0 X
We can simplify this if we note that each of these terms is a scalar (a 1 1 matrix). The
0
b =
b 0 X0 y. The
transpose of a scalar is of course just that number again. Now note that y0 X
two middle terms are therefore equal to each other and so
b 0 X0 y+
b 0 X0 X
b
= y0 y 2
(7.1)
= 0
= 0
b
= 0
b
109
We need to simultaneously solve these equations. We can write this system of equations in vector
form:
=0
b
where we let
be the vector
It helps to be able to do the dierentiation directly on the vector expression (equation 7.1).
A short diversion on matrix dierentiation
To dierentiate this equation we make use of the following rules:
1. If = , where is a constant, then
=0
b
i.e. vector dierentiation with respect to a constant gives the zero vector.
b 0 c, where c is a vector of constants, then
2. If =
=c
b
b A,
b where A is a symmetric matrix of constants, then
3. if =
b
= 2A
b
b
= 2X0 y+2X0 X
b
b
2X0 y+2X0 X
0 b
X X
= 0
= X0 y
(7.2)
These are the normal equations. Provided that X0 X has rank (which it will do, by the
b
identification condition that we imposed), we can solve out for
b = (X0 X)1 X0 y
(7.3)
2
= 2X0 X
b
b0
110
7.3
At this stage the obvious question seems to be, why should we want to square the residuals?
Why dont we minimise the sum of the absolute deviations instead? In fact, there is an estimator
(the Least Absolute Deviations or LAD estimator) that does precisely that.
In order to get some sense of the rationale for squaring, let us consider the simplest possible
regression problem, given by the model
y =x +
where there is precisely one explanatory variable. Let us consider the particularly simple case in
which we have only two observations, i.e. our model is
1
1
1
=
+
2
2
2
We can plot these points in the ordinary Cartesian plane where the axes correspond to the two
observations. Geometrically this is shown in Figure 7.1. We have plotted the points (1 2 ) and
(1 2 ). In this context it is useful to identify these vectors not only with a particular point in
the twodimension space, but the directed line segment from the origin to that point. These are
indicated in the figure by the darker arrows.
b we trace out the line through the point x. In the diagram
By choosing dierent values for ,
we have indicated two possible fitted values, i.e. x1 and x2 . The residual vectors e1 and
e2 corresponding to these fitted values are simply the vectors starting at the points x1 and x2
respectively and going to y. This has to be the case, since by definition
b+e
y = x
b
for any choice of .
Minimising the Residual Sum of Squares in this particular context means minimising 21 + 22 .
This is just the square of the length of the vector e. Minimising the RSS therefore amounts to
b represent that point
picking a residual vector e that is as short as possible! The fitted values y
on the line through x that is closest to y.
This insight generalises to the case where y and x are arbitrary
vectors in dimensional space.
p
The residual vector e = (1 2 ) has length kek = 21 + 22 + + 2 , so minimising the
RSS is equivalent to minimising kek2 . This, of course, is equivalent to simply minimising the
length of e.
Mathematically it is therefore obvious why one might want to minimise the residual sum of
squares. The reason why we square (and dont take absolute values) is that the usual distance
measures in dimensional space all involve squares, through Pythagorass theorem.
Going back to Figure 7.1 it is clear that the vector e which will minimise the length of e has
to be given by the perpendicular dropped from y onto the line that passes through the point x.
In other words, the vector e has to be at right angles to the line through x. This implies that
the inner product (dot product) of the vectors e and x has to be zero, i.e.
x0
e = 0
b
x y x
= 0
0
x0 y
b
= x0 x
We have therefore derived the normal equations through a geometric argument! Note that we
can rerun this derivation in reverse order to show that the OLS solution has to be such that the
residual vector is at right angles to the explanatory variable.
111
Observation 2
(y1,y2)
e1
(e 1,e 2)
xb 1
e
e2
(x1,x2)
^y= ^(y1^,y2)
xb 2
Observation 1
Figure 7.1: The OLS residual vector e is at right angles to the x vector and any vector lying on
the line through x.
112
It turns out that this argument generalises to the case where there is more than one explanatory variable (for an extended geometric treatment of Least Squares see Davidson and
MacKinnon 1993). Provided that the vectors of explanatory variables x1 , x2 , , x are all independent of each other (which by our identification assumption will be the case), we can trace out
a dimensional subspace of < by considering all possible linear combinations of these varib
b
b An arbitary
ables. Dierent fitted values will be given by dierent choices of
1
2
b + x2
b + + x
b
x1
1
2
b
X
b is the vector from this space to the point y. Again the problem
The residual vector e = y X
is to minimise the length of this vector, i.e. to find the point inside the space spanned by x1 , x2 ,
. . . , x that is as close as possible to y. Again the solution is to drop the perpendicular from y
to this space, i.e. to make e orthogonal to X, which in this case implies that X0 e = 0, i.e.
b
= 0
X0 y X
X0 y
7.3.1
b
= X0 X
The process of dropping the perpendicular from y into the space spanned by x1 , x2 , . . . , x is
an example of a mathematical operation called projection. It is a mapping from any arbitrary
b which reside within the dimensional subspace generated
vector y in < onto its fitted values y
b Substituting in from
b = X.
by the columns of X. The fitted values, of course, are given by y
equation 7.3 we get
1
b = X (X0 X) X0 y
y
This makes explicit how the fitted values are generated from the y vector.
The matrix X (X0 X)1 X0 is called a projection matrix since it accomplishes the projection
from y into the space spanned by x1 , x2 , . . . , x . It is suciently important that it is frequently
given its own name:
1
X = X (X0 X) X0
(7.4)
If there is no ambiguity about the regressors it is referred to simply as . This is sometimes also
called the hat matrix because it puts a hat on the original y values, i.e.
b = X y
y
(7.5)
Note that X X = X , i.e. the matrix is idempotent. This makes a lot of sense. If we start
with a point that has already been projected into the space and project it again, it will simply
stay where it is. This is obvious if we look at the example given in Figure 7.1. If we try to drop
b onto the line through x, the point will simply stay where
the perpendicular from the point y
it is. More generally if we regress our fitted values on the X matrix, we will just get the fitted
values back. Note also that the matrix is symmetric.
Having shown how to obtain the fitted values from y, it is equally possible to show how we
get the residuals. The OLS residuals are, of course given by
e = yb
y
= yX y
= (I X ) y
113
X = I X (X0 X)
X0
(7.6)
This matrix is called the residual maker by Greene (2003) since it shows how the residuals are
created from the original y vector:
e =X y
(7.7)
Again if there is no chance of confusion, we will drop the subscript and talk about the matrix.
X is also idempotent, i.e. X X = X and symmetric. In fact X is also a projection
matrix. In this case the projection is onto the space that is orthogonal (i.e. at right angles) to
the space spanned by x1 , x2 , . . . , x (for more information see Davidson and MacKinnon 1993).
In Figure 7.1 this space is again a line  from the origin through the point (1 2 ). In the more
general case, this space will have dimensions .
In short there are two operations associated with ordinary least squares:
b from y
The projection which creates fitted values y
The projection which creates residuals e from y
Between them these two completely (and uniquely) decompose the y vector as y =b
y + e.
This decomposition is such that these vectors will be at right angles to each other. Indeed there
are two key relationships between the and the matrix:
+ =
and annihilate each other, i.e. = 0 = .
These two relationships have an interesting interpretation. The first says that the residuals
b on X will be zero, i.e. the fitted
that one would get from regressing the fitted values y
values can be perfectly explained by the X variables. The second says that the fitted values
one would get by regressing the residuals on X will also be zero, i.e. the X variables cannot
explain anything additional about the residuals.
1
X X =
I X (X0 X) X0 X
= XX
= 0
We can understand what this means more intuitively if we write the matrix X in terms of its
column vectors as X = [x1 x2 x ]. The condition above simply means that X x = 0, for
every . This means that the residuals that we get from regressing x on the X variables is zero.
This is as it should be, since we can obviously retrieve any combination of the x variables from
those variables themselves!
One implication of this is that
e =
=
=
=
X y
X (X + )
X X+X
X
(7.8)
So there is an immediate connection between the true errors and the fitted values through the
X matrix.
114
7.3.2
We can summarise the numerical properties of the least squares estimator as follows:
1. The OLS estimator is a linear function of the dependent variable
b is a function of the sample values y. More particularly
From equation 7.3 it is clear that
1
0
b = Ay. But this means that
b is a linear
if we define the matrix A = (X X) X0 , then
function of the y vector.
2. The fitted values are a linear function of the dependent variable
This follows immediately from equation 7.5.
3. The residuals are uncorrelated with the explanatory variables
We have shown this in the context of our geometric interpretation of least squares.
4. The residuals are uncorrelated with the fitted values
This follows from the previous point, since the fitted values are just linear combinations of
the x variables. We can show it formally as follows
b
e0 y
= (X y)0 X y
= y0 X X y
= 0
e0 x2 x = e0 e0 x2 e0 x
The first entry of this 1 row vector must be zero, i.e.
e0 =0
This however implies that
the residuals is zero.
115
It turns out that this property generalises: if we transform the X matrix with any linear
transformation that can be undone, then the fitted values will not be aected. We can
show this formally as follows:
Let us assume that we transform the X matrix linearly to the matrix Z where
Z = X
y = Z1 +
= Z 1 +
= Z +
The parameter vector of the new model is therefore given by = 1 . We will show
that the OLS estimator of on the transformed data will satisfy
We have
b
b = 1
b = (Z0 Z)
= (0 X0 X)
1
= 1 (X0 X)
Z0 y
0 X0 y
01 0 X0 y
= 1 (X0 X)
X0 y
b
= 1
Furthermore consider the fate of the fitted values. Prior to the transformation they were
given by
1
b = X (X0 X) X0 y
y
After the transformation they are given by:
b = Z (Z0 Z)
y
= X (0 X0 X)
= X (X0 X)
Z0 y
0 X0 y
X0 y
By comparison, it is easy to show that a rescaling of the y vector will rescale the fitted
values. But again the rescaling will happen in such a way that the underlying interpretation
116
X0 y
X0 y
b
=
b = (X0 X)
= (X0 X)
7.4
b = y
b
Consequently y
Partitioned regression
In many cases we are interested in analysing the role of subsets of variables. In particular,
suppose that the regression involves two sets of variables X1 and X2 so that
y = X1 1 + X2 2 +
We will be interested in investigating the properties of the OLS estimates of 1 and 2 .
7.4.1
One extremely important result is contained in the FrischWaughLovell Theorem. An easy proof
is provided by the results on projections that we derived above (for the full details see Davidson
and MacKinnon 1993, p.19).
X2
b is such that
then the OLS estimate
2
1 0 0
b = X0 0 1 X2
X2 1 1 y
2
2 1
0
1 0
= X2 1 X2
X2 1 y
then by the argument above e will be orthogonal to both X1 and X2 . It follows that 1 e = e.
Multiplying through by 1 we get
b +e
1 y = 1 X2
2
117
0
b we get the result. This means that
b is the vector of
since X2 e = 0. Solving out for
2
2
coecients that succeeds in minimising the distance between the residual vector 1 y and the
space spanned by the columns of 1 X2 , i.e. it is the vector of OLS coecients in the regression
of 1 y on 1 X2 .
7.4.2
In essence the theorem says that we can think of a multiple regression coecient as giving us the
impact of a variable after we have fully taken the impacts of all the other variables into account.
More specifically, it states that if we have more than one explanatory variable, we can get the
multiple regression coecient on any variable (or group of variables) by the simple expedient of
regressing that variable(s) on all the other explanatory variables and obtaining the residuals e2 .
(This notation is a bit awkward, because e2 may be a matrix rather than a vector.) Similarly we
regress y on those other variables and obtain the residuals e1 . The coecient in the regression of
e1 on e2 will be numerically equal to the coecient in the multiple regression. Figure 7.2 gives
a pictorial representation.
The overall variation of y is represented by the circle (areas 1,2,4,5). After taking the impact
of x1 fully into consideration we are left with the residual variation of the shaded areas 1 and
2. This is the variation of the residuals e1 around their mean. After fully accounting for the
impact of x1 the variable x2 will only contribute the additional information about y given by
area 2. This is the part in the variation of the residuals e1 explained by the residuals e2  and
it coincides precisely with the impact of x2 on y in the multiple regression, with x2 included as
an additional variable.
In other words, if our regression is
y = 1 x1 + 2 x2 +
then 2 is the impact of x2 on y, once we have purged all of the eects that depend on x1 , i.e.
the direct eect of x1 on y and any indirect eects which may work through the impact that x1
has on x2 .
Of course the result cuts in the opposite direction too  the coecient on x1 in the multiple
regression is also the coecient in the relationship between the residuals of y, after accounting
for x2 and the residuals of x1 , after controlling for x2 .
This picture may also make it clear, why if x1 and x2 are highly correlated (so that the area
labelled 2 in figure 7.2 is very small), it will be extremely hard to estimate the dierential impact
of x2 on y with any degree of accuracy. There will be simply too little information on which to
base our estimates. This is referred to as the problem of collinearity which we will discuss later
in this course.
7.4.3
Alternative proof
We have given a proof which uses the properties of the projection matrices and . Greene
(2003, p.26) provides an alternative proof involving a consideration of the normal equations.
Using the partitioned form of the matrix X, these can be written as
#
0
"
b
X1 X1 X01 X2
X01 y
1
=
(7.9)
b
X02 y
X02 X1 X02 X2
118
variation
in y
1
4
variation in x1
5
6
2
3
variation in x2
Figure 7.2: The shaded areas labelled 1 and 2 represent the residual variation in y after x1
has been taken into account. It is a pictorial representation of e1 . The areas labelled 2 and
3 represent that portion of the variation in x2 which remains after controlling for x1 , which
represents e2 . The overlap area 2 is the variation in y explained by x2 holding x1 constant. The
b in the multiple regression of y on
FrischWaughLovell theorem says that the OLS coecient
2
x1 and x2 is identical to the coecient obtained in the regression of e1 on e2 .
119
b = (X0 X1 )1 X0 y X
b
1
2 2
1
1
(7.10)
Substituting this into equation 7.9 and then considering the second set of equations, we will get
X02 X1 (X01 X1 )
b + X0 X2
b = X0 y
X01 X2
2
2
2
2
b we get
Rearranging and solving for
2
h
i1 h
i
1
1
0
0
0
0
0
0
b
=
X
(X
X
)
X
(X
X
)
X
I
X
X
X
I
X
1
2
1
2
2
1 1
1
2
1 1
1 y
1
= [X02 1 X2 ]
[X02 1 y]
7.4.4
b = [e0 e2 ]1 [e0 e1 ]
2
2
2
1
y =
I (0 ) 0 y
1 0
= y (0 )
Now (0 ) = 1 and 0 y = . The second term on the right hand side is therefore just .
Consequently the vector y is just the vector of deviations of y values from their mean, i.e.
1
2
e1 =
..
It is clear that there is nothing special about y here. Any vector when premultiplied by will
have its mean removed. The matrix X2 will therefore contain in its columns the deviations
of each of the x variables from their mean. In short the slope coecients can all be estimated
from the deviations form of the regression model. The remaining coeent, i.e. the intercept, can
be retrieved from equation 7.10. Substituting the slope estimates into this equation we find that
b = 2
b
b
1
2
This provides an additional neat numerical result: if there is an intercept in the regression
model, then the fitted OLS regression line will go through the point of means (2 3 ).
120
Consider the eects of including the very particular dummy variable i1 = (1 0 0 0) , i.e. a
variable which is 1 in the first observation and otherwise 0. Our regression model is
y = i1 1 + X2 2 +
1 0
i1 y.
1 0
i1 y
1 y
0
2
3
..
.
0
y
0
0
22 23
1 X2 = 32 33
..
..
..
.
.
.
=
0
X2
= (1 0 0 0)0 .
0
2
3
..
.
X
X2 1 1 y
1
2
2
2 1
0
0
0
0
0
0 X2
0 X2
=
X2
y
= (X0
2 X2 )
X0
2 y
This means that the regression estimate of 2 is determined as though the first observation did
b through equation 7.10. In fact it is
not exist. Instead the first observation only determines
1
b
easy to see that in this case 1 will be set so that the first observation fits perfectly!
121
One way of thinking about this result is that by allowing the first observation to have its
own coecient, we are in eect allowing it to have an arbitrarily large residual. Note that
the argument is perfectly general  it applies to any dummy variable i which has value 1 for
observation and zeros otherwise.
This trick of eliminating an observation by including a specific dummy for that observation
is used some times in time series analyses, if it is thought that one observation is atypical (e.g.
if it was the year of some major upheaval).
7.4.5
Before leaving the topic of partitioned regression it is useful to note what happens to the OLS
estimates if we dont estimate the full model, but only estimate some of the coecients. In
particular, let us assume that the DSP is given by
y = X1 1 + X2 2 +
and we estimate instead
y = X1 1 + v
the OLS estimate of this misspecified regression will have the property that
b
b + A
b=
1
2
(7.11)
b are the OLS coecients that we would have obtained in the multiple regression
b and
where
1
2
and A is the matrix of coecients obtained by regressing each of the columns in X2 on X1 .
Proof.
b + X2
b +e
y = X1
1
2
hence
(X01 X1 )
b + (X0 X1 )1 X0 X2
b
X01 y =
1
1
1
2
b
b
b = + A
1
b = 0, or X2 is orthogonal to X1 the
Equation 7.11 highlights the simple fact that unless
2
OLS estimates of the coecients of X1 in the restricted regression will be dierent from those
in the multiple regression.
7.5
Goodness of Fit
We have noted above that the fitted values are orthogonal to the residuals. This allows us to
decompose the sum of the squares of the y values into two components: the Residual Sum of
Squares and the Regression Sum of Squares:
y0 y
y0 y
= (b
y + e) (b
y + e)
0
0
by
b+ee
= y
(7.12)
This particular decomposition is not used that often, because for many data series, the biggest
contribution to the sum of squares on the left hand side is the mean of the y values (think of
economic series like GDP!).
122
A better measure how much the explanatory variables have contributed to understanding the
behaviour of y is to exclude the intercept from consideration. So if our fitted model is
b + X2
b +e
y =
1
2
We have noted above (it is implied by section 7.3.2) that e = e, so we will write this as
y
b +e
= X2
2
b +e
= y
b will
where the superscript indicates that we have centered the variables. The fitted values y
still be orthogonal to the residual vector e (since they derive from the multiple regression of y
on X2 ). Consequently we can write the decomposition in the form
b 0 y
b + e0 e
y0 y = y
(7.13)
The left hand side is the sum of squares of the deviations of the y values from their mean. This
is some times referred to as the variation in y. The first term on the right hand side is the
explained sum of squares, some times also called the regression sum of squares or model
sum of squares. The final term is the residual sum of squares, some times also called the
error sum of squares.
Regrettably nomenclature in this area is not uniform. What makes this particularly unfortunate is that some times the abbreviations have diametrically opposite meanings, i.e. the ESS
and RSS could refer to error sum of squares and regression sum of squares or explained
sum of squares and residual sum of squares! If you need to use one of these terms it is always
advisable to specify first what you intend it to refer to.
The decomposition in equation 7.13 is the basis for defining the coecient of determination or 2 :
b
b 0 y
y
e0 e
2 = 0 = 1 0
(7.14)
y y
y y
This is some times also called the centered 2 . The uncentred version would be based on the
decomposition given in equation 7.12.
The 2 ranges from zero (when the model explains nothing about y) to one, when it fits
perfectly. Consequently the 2 is frequently used to assess how well a regression seems to fit.
There are several problems with this particular measure:
Firstly, it is always possible to improve the fit of the regression by including more variables. Indeed, it is always possible to get a perfectly fitting regression if one were to use
regressors!
In order to get around this problem several other measures have been suggested, such as
the adjusted 2 . Greene (2003) has a discussion of some of the options.
Secondly, the size of the 2 depends on the nature of the y variable. If the y variable
is transformed (e.g. by taking logarithms), the total variation in y changes and with it
the 2 . The 2 cannot really be used to compare models that have dierent dependent
variables.
7.6. EXERCISES
123
Thirdly, there are some domains in research in which it is almost impossible to reduce the
intrinsic noise that is coming from . A regression with an 2 of 05 does not necessarily
fit badly if it is estimating certain kinds of labour market outcomes. In fact, regressions
with too high an 2 often need to be treated with extreme caution.
7.6
Exercises
1. Consider the formulae given in the appendix, equations 7.15 and 7.16.
(a) Verify that these expressions do, indeed, represent the OLS estimators.
(b) Prove that these values uniquely minimise the sum of squares.
2. Consider the data given in the appendix, table 7.1. Rewrite the information on the explana1
tory variable(s) in standard matrix form as the X matrix. Calculate X0 X and (X0 X)
1
0
0
and (X X) X y. Verify that this provides the same set of estimates as supplied in the
appendix.
3. Regress the residuals obtained from this expression on X. Verify that the OLS coecients
are all zero.
4. (Greene 2003, p.39,Exercise2) Suppose that b is the least squares coecient vector in the
regression of y on X and that c is any other 1 vector. Prove that the dierence in the
two sums of squared residuals is
(y Xc)0 (y Xc) (y Xb)0 (y Xb) = (c b)0 X0 X (c b)
Prove that this dierence is positive.
124
7
8
19
12
16
2
16
5
9
3
5
1
3
6
13
124
132
202
14
192
7
20
84
136
68
118
56
74
94
174
7.7
The principle of ordinary least squares is fairly easy to explain. At one level one can view OLS
as simply a method for trying to fit a straight line to a set of points. To make these points more
concrete, consider the hypothetical data set contained in Table 7.1.
Figure 7.3 presents a scatterplot of the variable against the variable . As the plotted points
show, there seems to be a linear relationship between these variables. On the diagram we have
arbitrarily drawn a line through these points, with the equation of the line given by b = 8 + 06.
Given any such line, we can use it to predict the value of that we would expect, given
a particular value of . Such predictions are called fitted values and they are indicated by the
hat over the variable, i.e. b is the predicted value for corresponding to the equation given
above. In the case indicated on the diagram = 8 and so = 5 and = 84 According to the
equation we therefore have b8 = 11.
The dierence between the actual value and the fitted value is known as the residual and is
indicated in the diagram above by , i.e. = b .
In the specific case above, we have 8 = 84 11 = 26
Of course with a dierent line we would get very dierent fitted values and residuals. In
Figure 7.4 we have used the equation = 4 + and as is immediately obvious, both the fitted
value and the residual (or error) has changed. We now have b8 = 9 and 8 = 06. The absolute
value of the error in this case is much smaller. From this perspective we might be tempted to
conclude that the second line is better than the first one. Note, however, that for observation
= 11 we have = 5 and = 118. The errors for the two cases are therefore given by
11 = 114 11 = 04 for the first line and 11 = 114 9 = 24 for the second line.
In general we will not be concerned with trying to fit the line to a particular point on the
scatter diagram, but we would like to find that line that in some sense minimises the aggregate
error over all the observations.
The OLS criterion is a rule which specifies how we should measure the aggregate error. It
stipulates that the best fitting line is the one that minimises the Residual Sum of Squares or
125
line
20
15
^yi
10
yi
} ei
0
0
10
15
x
xi
Equation of line:
20
yi = 8 + 0.6 xi
Figure 7.3: Given a particular line we can use it to define a predicted (or fitted) value for . The
dierence between the actual and the fitted value is known as the residual.
line2
20
15
^yi10
ei
0
0
10
15
x
xi
Equation of line: yi = 4 + xi
20
Figure 7.4: With a dierent line, we get dierent fitted values and dierent residuals.
126
7
8
19
12
16
2
16
5
9
3
5
1
3
6
13
124
132
202
14
192
7
20
84
136
68
118
56
74
94
174
b
122
128
194
152
176
92
176
11
134
98
11
86
98
116
158
Line 1
2
02
004
04
016
08
064
12
144
16
256
22
484
24
576
26
676
02
004
3
9
08
064
3
9
24
576
22
484
16
256
RSS= 5404
b
11
12
23
16
20
6
20
9
13
7
9
5
7
10
17
Line 2
14
12
28
2
08
1
0
06
06
02
28
06
04
06
04
RSS=
2
196
144
784
4
064
1
0
036
036
004
784
036
016
036
016
2652
=1
Note that this criterion does not in itself tell us how to find that line. It only gives us the
criterion according to which we can decide which one of many dierent possible lines should
count as the best fitting one.
As an example of the application of the OLS criterion, let us investigate which of the two lines
that we considered above has the smallest RSS. In table 7.2 we have summarised the original
pairs of observations, together with the fitted values corresponding to the two lines, the residuals
and the square of the residuals. When the squares of the residuals are added up, we get RSS
values of 54.04 and 26.52 respectively. According to the OLS criterion, therefore, the second line
gives a better fit than the first line.
The problem with proceeding in this way is that there are infinitely many lines that we might
try to fit to the data. Fortunately for the linear case (and, indeed, for the polynomial case more
generally) it is possible to derive a formula for the equation of the line that is guaranteed to
produce a lower RSS than any other line.
The OLS formulae for the optimal values of and in the equation = + are given by
P
=1 ( ) ( )
b
=
(7.15)
P
2
=1 ( )
b = b
(7.16)
where is the sample mean of the values and is the sample mean of the values. Applied
to the data above, we can calculate the best fitting line according to the OLS formula as in
table 7.3 below. The middle panel provides the necessary calculations which lead to the OLS
estimates of b
= 0863 and b = 5231, as given in the bottommost panel. The rightmost panel
then calculates the fitted values and residuals based on the line = 5231 + 0863. Note that
7
8
19
12
16
2
16
5
9
3
5
1
3
6
13
= 8333
b
=
386267
447333
12.4
13.2
20.2
14
19.2
7
20
8.4
13.6
6.8
11.8
5.6
7.4
9.4
17.4
= 12427
= 0 863 489
1.333
0.333
10.667
3.667
7.667
6.333
7.667
3.333
0.667
5.333
3.333
7.333
5.333
2.333
4.667
0.027
0.773
7.773
1.573
6.773
5.427
7.573
4.027
1.173
5.627
0.627
6.827
5.027
3.027
4.973
127
2
( ) ( ) ( )
b
0.036
1.778
11.275
0.258
0.111
12.139
82.916
113.778
21.637
5.769
13.444
15.593
51.929
58.778
19.047
34.369
40.111
6.958
58.062
58.778
19.047
13.422
11.111
9.548
0.782
0.444
13.002
30.009
28.444
7.821
2.089
11.111
9.548
50.062
53.778
6.094
26.809
28.444
7.821
7.062
5.444
10.412
23.209
21.778
16.456
386267 447333
b = 12427 (0863) 8333 = 5 235 62
1.125
1.061
1.437
1.593
0.153
0.042
0.953
1.148
0.598
1.021
2.252
0.494
0.421
1.012
0.944
RSS=
2
1.265
1.126
2.066
2.537
0.023
0.002
0.909
1.319
0.357
1.043
5.070
0.244
0.178
1.024
0.891
18.053
the RSS associated with this line is, indeed, lower than the RSS of the other two lines considered
above.
The case above is the simplest one, where there is one independent variable and one
dependent variable and we fit a straight line. It is possible, however, to provide formulae for
the solutions to the least squares problem for whole classes of more complex functions. What
is required, however, is that the parameters of the function (in the case above and ) enter it
linearly. Examples of functions that are linear in the parameters are:
polynomials: = 0 + 1 + 2 2 + +
hyperplanes: = 1 + 2 2 + 3 3 + + (where 2 are dierent variables)
Even functions that on the surface appear to be nonlinear can often be written in a form
where they become linear in the parameters:
hyperbola: = 1 + 2 i.e. = 1 + 2 where = 1
exponential: = i.e. log = log + log . This is linear if we let = log , 1 = log
and = log
The equation of the hyperplane is the most general formulation of the linear model. It can,
for example, encompass the polynomial model provided that we let 2 = 3 = 2 = 1 .
128
Chapter 8
Introduction
In the last chapter we saw one motivation for OLS  the procedure amounts to minimising the
b as close to the actual values
length of the residual vector, i.e. it makes the fitted values y
y as possible. This is, however, a purely geometric consideration. In this chapter we will be
considering the statistical motivations behind OLS. In order to do this we need to specify the
type of DSPs that we assume have generated the data at our disposal. For the moment we will
make the following assumptions (compare with table 6.1):
1. Y: We assume that Y is a univariate variable with continuous, unlimited range.
2. : The function is linear in X and , additive in
3. X: The X variables are nonstochastic. In section 8.5 we allow the regressors to be stochastic
as long as they are exogenous.
4. : The parameters are fixed.
5. : The disturbances are independent and identically distributed, with (X) = 0,
(X) = 2 I
6. (eX ): The distribution of the error terms is left unspecified. In section 8.6 below we
will restrict the analysis to normally distributed errors.
7. : The parameter space is unrestricted.
The model can therefore be written as
Y = X +
(X) = 0, (X) = 2 I
Note that we will need to assume that the model is identified, i.e. X has full column rank.
129
(8.1)
130
8.2
In this section we will consider some of the statistical reasons that make OLS attractive. These
reasons fall into two types:
Reasons why minimising the residual sum of squares may seem like the logical thing to
do. Many of these turn out to be variants of letting the sample mimic the population
relationships.
Reasons why minimising the residual sum of squares is the optimal thing to do, particularly
when compared to other types of estimators. Prominent among these reasons is that OLS
will turn out to be ecient in certain categories of estimators. Establishing those optimality
properties will take up the bulk of the chapter.
8.2.1
Method of moments
The model specifies that the X variables must be uncorrelated with the vector of disturbances
. We can write the condition (X) = 0 equivalently as
(X0 ) = 0
Now we saw that one of the fundamental characteristics of OLS estimation is that
X0 e = 0
We can write this equivalently in the form
1 0
Xe=0
We can think of this as a sample analogue of the population moment equation. It says that
the average sample correlation must be zero. However, as we saw, the sample equation X0 e = 0
b As is shown in the
leads directly to the normal equations, once we substitute in e = y X.
section of the course dealing with GMM estimation, such equations generally lead to consistent
estimators.
8.2.2
We will show in section 8.4 that the least squares estimator is the minimum variance linear
unbiased estimator.
8.2.3
With the addition of the normality assumption, we will show in Section 8.6 that the least squares
estimator is also the maximum likelihood estimator. Furthermore it will be the minimum variance
unbiased estimator.
8.3
8.3.1
131
= (X0 X)
X0 y
= (X0 X)
X0 (X + )
= + (X0 X)
X0
(8.2)
Consequently
h
i
1
b
X
= + (X0 X) X0 X
=
And
8.3.2
h
i
b = X X
b
=
b
The Covariance matrix of
b = (X0 X)1 X0
b (conditional on X) is given by
It follows that the covariance matrix of
0
b
b
X
=
h
ih
i0
1
1
0
0
0
0
= (X X) X (X X) X X
n
o
1
1
= (X0 X) X0 0 X (X0 X) X
1
= (X0 X)
X0 (0 X) X (X0 X)
= (X0 X)
X0 2 I X (X0 X)
= 2 (X0 X)
Note that we have used the assumption of homoscedasticity and zero autocorrelation to derive
this result.
8.3.3
Estimating 2
For the purposes of estimating this, we require an estimator of 2 . Since 2 is the common
variance of , it would seem sensible to base the estimator on its sample analogue:
the variance
P
of the . We could simply take the sample variance of the 2 , i.e.
b2 = 1 2 but this would
give us a biased estimator, as we can show:
!
X
2
= (e0 e)
= (0 X )
132
where I have substituted in a result from the last chapter and made use of the fact that X is
symmetric and idempotent.
This does not seem to have got us much further. At this point we use a trick involving the
trace of a symmetric matrix (a discussion of the trace is in the appendix). In particular we use
the fact that the trace of a scalar is just that scalar, as well as the properties () = ()
(where and are both symmetric) and ( ()) = ( ()).
Applying these we have
(0 X ) =
=
=
=
( (0 X ))
( (0 X ))
( (0 X ))
2 I X
= 2 (X )
h
i
1
= 2 I X (X0 X) X0
n
o
1
= 2 (I ) X (X0 X) X0
n
o
1
= 2 X0 X (X0 X)
= 2 { (I )}
= 2 ( )
In short we have shown that
(e0 e) = ( ) 2
An unbiased estimator of 2 is therefore given by
8.4
b2 =
e0 e
(8.3)
GaussMarkov Theorem
= Cy (X0 X) X0 y
h
i
1
= C (X0 X) X0 y
= Dy
where D = C (X0 X)
b + Dy
=
0
= (X X)
(8.4)
0
X y + Dy
133
We assume that the true model is given by a special case of the DSP given in equation 8.1, with
= 0 and 2 = 20 . Substituting in y = X0 + we get
= 0 + DX0 + (X0 X)1 X0 + D
i
h
i
h
1
b (Dy)0
= (X0 X) X0 0 D0
0
= 20 (X0 X)
= 0
X0 D0
Note this result again uses the assumption of homoscedasticity and zero autocorrelation.
is the sum of the OLS
Equation 8.4 therefore says that the unbiased linear estimator
estimator plus a random component Dy with which it is uncorrelated (Davidson and MacKinnon
1993, p.159). It turns out that this is true more generally:
Asymptotically, an inecient estimator is always equal to an ecient estimator
plus an independent random noise term. (Davidson and MacKinnon 1993, p.159)
It follows that
i
0
0 0 = 20 (X0 X)1 + 20 DD0
b a0
for any linear combination a of the estimates.
of , then a0
8.5
In the discussion thus far we have made the assumption that the X matrix was nonstochastic,
i.e. we would be able to fix it in repeated sampling. In practice this is almost never the case. We
therefore want to consider how the results above change if we allow X to be nonstochastic, but
independent of . It turns out that most of the results go through, provided that we condition on
X first. In fact in many of the derivations given above we have already done this conditioning,
which would be unnecessary if X was nonstochastic.
8.5.1
Lack of bias
134
8.5.2
b
The covariance matrix of
= 2 (X0 X)
X
If X was nonstochastic, this would be the unconditional covariance matrix too. If X is stochastic
we can make use of the variance decomposition formula (Theorem 2.10):
h
i
h
i
b = X X
b
b
+ X X
b
The first term on the right hand side is zero, since X
= for all values of X. Consequently
8.5.3
h
i
1
= X 2 (X0 X)
h
i
1
= 2 X (X0 X)
The estimator of 2
Our previous derivation will go through, provided that we condition on X initially. Since
(e0 eX) = ( ) 2
It turns out that the unconditional expectation must be ( ) 2 also.
8.5.4
GaussMarkov Theorem
As Greene (2003) notes, the GaussMarkov theorem also goes through in this case. For any given
X we will have that the OLS estimator will be the BLUE of . It follows that the OLS estimator
must be more ecient than any other linear unbiased estimator for all X.
8.6
We will now explicitly consider the case where is distributed normally, i.e we now make the
assumption that
X N 0 2 I
(8.5)
It follows from this that
yX N X 2 I
(8.6)
Here we have used the fundamental result from section 8.8.2 that a linear function of a vector of
normal random variables is itself normal. The fact that the random variables are uncorrelated
with each other implies that they are statistically independent of each other and each is
distributed normally, i.e.
(8.7)
x x 2
where x is the th row of the matrix X, i.e. it is the row vector of explanatory variables
corresponding to observation .
8.6.1
135
b
The distribution of
b = (X0 X)
Since
b
distribution of
b N 2 (X0 X)1
X
The distribution of
b2
1
N (0 I )
b 2 ( )
2
b2 =
Since the variance of a 2 ( ) distribution is 2 ( ), it follows from this that
2 2
2 4
2
(
),
i.e.
b2 =
.
b and
The independence of
b2
= 2 (X0 X)
= 0
X0 I X
1 b
We will show that is is independent of 12 e0 e, from which the result will follows. We
X
X
1 b
(as above). The result of section 8.8.2 then guarantees that is independent of 12 e0 e.
1 One slight complication is that the distribution of e is singular multivariate normal, i.e. there are only
proper random variables in the e vector. The remaining are perfect linear combinations of the others.
136
8.6.2
With the addition of the normality assumption, we have fully identified a family of DSPs, and
consequently we can estimate the parameters by means of maximum likelihood. The pdf of the
random variable defined in equation 8.7 is given by
(
)
2
( x )
1
2
exp
x =
2 2
2 2
Since the are independent of each other, the joint pdf is2
( P
)
1
2
( x )
yX
exp
=
2 2
2 2
(y X)0 (y X)
= 2 2 2 exp
22
(y X)0 (y X)
2 y X = 2 2 2 exp
22
(y X) (y X)
2 y X = ln (2) ln 2
2
2
2 2
b
The maximum likelihood estimators
b2 are those values that maximise this log and
0
likelihood. You may recognise the term (y X) (y X). This is the residual sum of squares.
b
Since does not feature anywhere else in , picking
to maximise is equivalent to picking
0
b
to maximise (y X) (y X) or minimising the residual sum of squares! It turns out
b
b
that
must be equal to the least squares estimator ! We can show this more formally:
X0 y X0 X
2
2
(y X)0 (y X)
= 2+
2
2 4
(8.8)
(8.9)
+
2b
2
b
X0 X
X0 y
2
2
b
0
b
b
y X
y X
2b
4
= 0
= 0
2 We could have derived this more simply by using the fact that the joint pdf of the multivariate normal is
given by
1
(y )0 1 (y )
(2) 2  2 exp
2
where is the covariance matrix of y and  its determinant. Substituting in = 2 I gives the same result.
It is perhaps more instructive to derive the joint pdf from the individual pdfs, since this corresponds to how we
have derived ML estimators previously.
137
b , which confirms
From the first set of equations we retrieve the normal equations X0 y = X0 X
b
b
that the MLE is equal to the OLS estimator
1
0
b
X0 y
(8.10)
= (X X)
0
0
b
b
y X
Relabelling the term y X
b2 =
e0 e
(8.11)
8.6.3
2
( 2 )
=
=
X0 X
2
0
(y X) (y X)
2 4
6
X0 y + X0 X
2 4
P 2
0
2
0
Taking expectations of this, we note that (y X) (y X) =
= . [X y] =
0
0
[X (X + )] = X X. Consequently the information matrix will be given by
1 0
0
2X X
2
I =
0
2 4
2
2
(Note that one zero vector is a 1 row vector while the other is a 1 column vector.) The
CramrRao Lower Bound (CRLB) is therefore given by
"
#
1
0
2 (X0 X)
2 1
I
=
2 4
0
b
is the minimum variance unbiased
It follows that
(and hence the least squares estimator)
b = 2 (X0 X)1 . The unbiased estimator
estimator since we have already shown that
b2 has a variance which exceeds the CRLB. It turns out, however, that there is no unbiased
estimator which has a lower variance, i.e.
b2 is also the minimum variance unbiased estimator
(Mittelhammer et al. 2000, p.44). In short, with the assumption of normality the least squares
estimators are ecient.
8.7
Data Issues
The DSP describes how the data arrive on the analysts desktop. It is clear that the matrix of
explanatory variables X plays a substantial role in this process and is therefore an important
138
factor in how eectively we can solve the inverse problem. We have seen one example of this
already: The identification condition
(X) =
determines whether or not we are able to get unique estimates. There are two ways in which
this particular condition might fail:
The problem may be rooted in the DSP itself, e.g. it may be the case that whatever sample
we may get, it will always be the case that x2 = 2x1 . In this case the problem is intrinsic
to the model and we will never be able to estimate all the structural parameters of the
DSP.
The problem may be just a sample problem, i.e. we may just have the rotten luck that in
our particular sample we have x2 = 2x1 . Alternatively we may simple not have enough
observations to estimate the structural parameters. Theoretically there is no problem
with estimating the model, but practically there will be, since we are hardly ever in the
situation of having the luxury of regenerating the sample or extending the data run.
Since all our estimation procedures are only as good as the data on which they are based (i.e.
all our estimates are conditional on X), it is worthwhile to spend some time to look at a few
common data problems:
8.7.1
Multicollinearity
It is clear that if (X) we cannot estimate the model, i.e. (X0 X) does not have an inverse.
A more common problem, however, is that our explanatory variables are highly correlated, but
not perfectly so. In this case X0 X is almost singular. It is clear why this might cause a problem:
in a sense we are multiplying through by the inverse of something that is almost zero. This will
lead to very unstable and imprecise estimates. In fact we can show (Greene 2003, p.57) that
2
b =
(8.12)
P
2
2 )
(1
( )
2
is the 2 obtained in the regression of x on all other variables (including the constant).
where
b can be due to three sources:
This formula shows that a lack of precision in estimating
the intrinsic noise in the data sampling process. The more noise, i.e. the higher 2 , the
less precisely we will be able to estimate the parameter.
the variation in x . The more variation we have at our disposal, the more accurately we
are able to measure how changes in x aect y.
the correlation between x and the other explanatory variables. The higher the correlation,
the less accurately we are able to isolate the independent eect of x . Intuitively, the
regression estimates try to purge the eects of all the other variables first (this is the FWL
theorem). If x is highly correlated with the other variables, there is very little variation
left on which to assess the separate impact of x on y.
There are various diagnostics that are available for assessing whether multicollinearity is
likely to pose a problem. One of these is the
=
1
2
1
139
22433
120.053
.787
13478
Figure 8.1: The point at = 13478 has high leverage. It pulls the regression line towards
itself. The top line is the regression line with the observation included, while the lower one is
the regression line without .
This shows how much the estimated variance has been aected by the correlation between x
and the other variables.
Various fixes for multicollinearity have been suggested in the literature (for discussions see
Greene 2003, Gujarati 2003). At the end of the day many of these may create new problems
or try to force relationships on to the data that the data simply do not want to accept. As
Gujarati (2003, Chapter 10) points out, multicollinearity can perhaps most usefully be thought
of as being akin to micronumerosity, i.e. the problem of having too few observations. In a sense
one is asking questions of the data that the data are not equipped to answer.
8.7.2
Another problem is exemplified by Figure 8.1. In this particular example we have regressed the
exchange rate on a measure of purchasing power parity (the Big Mac Index). The data are for
1994. It is clear that the observation at = 13478 (Poland) has a disproportionate influence
on the regression line. The lower line in the diagram is the regression line that would have been
obtained if that observation had been deleted from the data set.
b
We can think about this somewhat more rigorously by noting that each element of the
vector is a linear combination of the elements of the vector y. An influential observation
b The fundamental
is such that has a disproportionate influence on one or more elements of .
result (proved in Davidson and MacKinnon 1993, pp.3239)is that
b
b () =
1
1
(X0 X) X0
1
(8.13)
140
b is the vector of parameter estimates without observation , X is the th row of the
where
data matrix and is the th residual when the model is run on all observations. A particularly
important quantity in this formula is . It is defined as
= X (X0 X)
X0
i.e. it is the th diagonal element of the projection matrix X . It is intuitively clear why the
b . The
projection (or hat) matrix X may be important, since it shows the impact of y on y
diagonal element measures the impact that has on its own fitted value b .
b will
Looking at equation 8.13, it is clear that the impact of observation on the estimates
be great if:
is large or
is large
Interestingly enough only depends on the X matrix, i.e. it is a feature of the structure of
the explanatory variables. We can show that
0 1
If is close to one, then the observation is said to have high leverage. Indeed any point which
has a value of greater than has potentially more influence than the others. In the example
given in Figure 8.1 the leverage associated with the point at is 967, i.e. it is very high. Points
with high leverage potentially have a great influence on the regression estimates. Whether this
potential is translated into actual exercise of influence depends on . If the value for Poland
had been = 13316, it would have been right on the lower line and so the regression line would
not have changed with the deletion of the observation. In short if = 0 in equation 8.13 the
point may have high leverage, but will not actually aect the estimates.
It is therefore desirable to investigate not only the leverage of the observations, but also the
associated residuals. We note that e =X , i.e.
(ee0 ) = 2 X
= 2 (I X )
The diagonal elements of this will give the respective variances of the associated residuals, i.e.
( ) = 2 (1 ). Note that although we have assumed that the error process is homoscedastic, this is not true of the residuals  the residuals associated with points of high leverage will be
smaller on average than the residuals elsewhere. We can standardise the residuals by dividing
through by an estimate of the standard error
b =
b 1
Standardised residuals that are large (say bigger than two) are an indication of points that will
have significant impacts on the regression estimates.
141
We can see the combined eect of and by observing that the predicted value for based
b () will be given by
on the estimates
()
b ()
= X
1
1
X (X0 X) X0
1
= b
1
b
= X
So the quantity 1
measures the impact on its own fitted value of the deletion of observation
.
The key question for empirical work is what to do about influential data points. There are
at least three ways to think about it:
It is possible that the influential point simply represents bad data  the recorded data
may be in error or we may have mixed in observations that really belong to a dierent
regime or DSP. In the example cited above, it is possible that the exchange rates of
countries undergoing drastic political change (such as Poland) may work to a logic that
is not described adequately by purchasing power parity. The appropriate response in this
case would be to delete the observation.
It is possible that the observation represents disproportionately informative data. What
gives the observation on Poland such a high leverage, is that it is far from the other values.
In short it gives us information about what happens outside the normal range of . It
therefore helps us to fix the regression line much more accurately than would otherwise be
the case. In this case the last thing that we should do is to delete the observation.
It is possible that the observation represents neither completely bad data nor hundred
percent good data. Instead we may have misspecified the model. In the PPP example,
it is plausible that the error process is heteroscedastic  at high levels of the exchange
rate perhaps becomes more volatile, i.e. PPP still has some influence, but other random
factors become more important. In this case the appropriate response would be to either
respecify the model, reweight the data or perhaps transform the data.
It is important to understand this last point properly: the influence of an observation is
always with reference to the particular model that we are trying to estimate.
8.7.3
Missing information
The problems of multicollinearity and influential data points are both a result of the fact that
we hardly ever have the luxury of controlling the variables in our studies, i.e. we hardly ever
have experimental data. We are therefore constrained by what the DSP happens to throw up for
us  and this may involve both highly correlated variables as well as too few observations with
high values. For instance, we generally have too few really rich individuals in our household
data sets. Correspondingly the few high income observations will have a high leverage on any
estimates where income features as an explanatory variable.
What can exacerbate this problem is nonresponse by sampled individuals. If the pattern of
nonresponse is random (e.g. if high income individuals have the same propensity to refuse as low
income individuals) this will not materially aect any estimates. If, however, it turns out that
there are systematic patterns then our analyses may be subject to sample selection bias. A
discussion of this problem is given in Wooldridge (2002, Chapter 17).
142
8.8
8.8.1
Appendix
The trace of a matrix
11 12
21 22
= .
..
..
.
1
..
.
1
2
..
.
then
() = 11 + 22 + +
Remark 8.3 It follows immediately from the definition that
( + ) = () + ()
whenever the matrix addition makes sense.
Remark 8.4 It also follows that
() = ()
where is a scalar.
Proposition 8.5 If the matrix is and the matrix is , so that both the matrix
product and the product are defined and both product matrices are symmetric (one is
and the other ), then
() = ()
Proof. We will show this first for the case where = 1 and is 1 , i.e.
1
2
= . and = 1 2
..
so that
1 1
2 1
..
.
1 2
2 2
..
.
..
.
1
2
..
.
() = 1 1 + 2 2 + +
But
= [1 1 + 2 2 + + ]
which is a 1 1 matrix and so trivially
() = 1 1 + 2 2 + +
8.8. APPENDIX
143
1
2
= . and = 1 2
..
1 1
2 1
..
.
1 2
2 2
..
.
..
.
1
2
..
.
8.8.2
If x N ( ) then Ax + b N A + b AA0
144
If x N ( ) then 2 (x ) N (0 I )
1
Note that 2 will exist, provided that is positive definite (i.e. provided that is of full
rank).
Distribution of an idempotent quadratic form in a standard normal vector
If x N (0 I ) and A is idempotent, then x0 Ax 2 (), where = (A)
Independence of idempotent quadratic forms
If x N (0 I ) and x0 Ax and x0 Bx are two idempotent quadratic forms in x, then x0 Ax and
x0 Bx are independent if AB = 0
Independence of a linear and a quadratic form
A linear function Lx and a symmetric idempotent quadratic form x0 Ax in a standard normal
vector are statistically independent if LA = 0.
Chapter 9
9.1
Introduction
In this chapter we will consider the asymptotic properties of the Least Squares estimator in the
context of the classical linear regression model. The reason for doing this include:
We can say something about the accuracy of our estimation procedures in large samples.
The appropriate concept is that of consistency.
We can say something about the distribution of our estimators in large samples. The
justification for using the normal distribution for inference will turn on this.
The approach that we develop in this chapter has more general application. We will make
use of similar forms of argument in contexts outside the linear regression model. Frequently
it will be impossible to derive finite sample properties of the estimators while the asymptotic
properties might be relatively straightforward to derive.
In order to contextualise the discussion, we need to remind ourselves of what we are assuming
about the DSP:
1. Y: We assume that Y is a univariate variable with continuous, unlimited range.
2. : The function is linear in X and , additive in
3. X: The X variables are exogenous.
4. : The parameters are fixed.
5. : The disturbances are independent and identically distributed, with (X) = 0,
(X) = 2 I
6. (eX ): The distribution of the error terms is left unspecified.
7. : The parameter space is unrestricted.
145
146
(9.1)
Note that we will need to assume that the model is identified, i.e. X has full column rank.
9.2
Note that when we are talking about the consistency of the OLS estimator we are really talking
about the behaviour of the series of estimators
n
o
1
(X0 X ) X0 y
=
where X and y are the data matrices and dependent variables from a sample of size . A key
question is how dierent (or otherwise) the additional rows of the data matrix are when compared
to the previous ones. The upwardly trending data characteristic of many macroeconomic time
series require some care in this regard, because it is clear that additional observations are generally
not from precisely the same distribution as earlier ones. Indeed in the previous chapter on
asymptotic theory we ruled out processes such as
=
If we are dealing with crosssectional data we can simply make the assumption that each
observation is a separate draw from the same underlying distribution. In this case, the only
assumption that we need to prove consistency is that
[ (x0 x)] =
(9.2)
where x is the row vector of explanatory variables, i.e. each row of the X matrix can be thought
of as a separate draw from the same multivariate distribution as x. We will make the weaker
assumption that
lim
1 0
X X = A, where A is a positive definite matrix
(9.3)
Note that Assumption 9.2 implies assumption 9.3, but not necessarily vice versa. It is possible
to prove consistency with yet weaker conditions on the data sampling process (Mittelhammer
et al. 2000, pp.44):
1
(X0 X)
must exist for all (so that the OLS estimator exists)
P P
(X0 X) = =1 =1 2 as and the ratio of the largest to the smallest
eigenvalues of X0 X is upper bounded.
The latter condition rules out that one (or more) of the eigenvalues of X0 X goes to zero.
It would do this if one of the columns in the X matrix became more and more like a linear
combination of some of the other columns or if one of the columns became more increasingly like
a column of zeros (e.g. if = 1 ). We give a proof of the consistency of OLS based on these
assumptions in the appendix. Note that these conditions are fairly unrestrictive.
b
9.3. ASYMPTOTIC PROPERTIES OF
9.3
9.3.1
147
b
Asymptotic properties of
b
Consistency of
Theorem 9.1 Assume that the DSP meets the conditions listed in Section 9.1 as well as the
b
assumption that 1 X0 XA where A is positive definite, then
By definition:
= (X0 X)
X0 y
1
= + (X0 X) X0
1
1 0
1 0
= +
XX
X
1 0
X
1X
=1
(9.4)
where is the th observation on variable x (i.e. the th column of X). Now ( ) = 0
2 P
(by our assumptions about and X) and ( ) = 2 2 (since all terms involving
have zero expectation), so if the 2 remain finite (if they dont the matrix 1 X0 X is unlikely to
P
b
Asymptotic normality of
9.3.2
Theorem 9.2 (Mittelhammer et al. 2000, p.96) Assume that the DSP meets the conditions
listed in Section 9.1 and assume further that 1 X0 X A where A is a hfinite positive
definite
i
2+
symmetric matrix, the elements in X are bounded in absolute value and  
for some
finite constants and , then
b
2
N 0 2 A1
In order to prove this we write
1
1
1
b
= 2 (X0 X) X0
2
1
1
1 0
=
2 X0
XX
Now as before
1
1 1
0
A .
X X
2
1
148
central
theorem
that
limit
A1 2 AA1 = 2 A1
We can rewrite the result of the theorem as
2 1
b
N A
The matrix A1 is somewhat awkward in here, but we can replace it with something more
1 1
of random variables A 2 X0 X 2 . Since A and 1 X0 X are symmetric positive definite, these
1
1
1
1
matrices
are
well defined. We have 1 X0 X 2 A 2 , so A 2 1 X0 X 2 I . Furthermore
b
2
0 2 A1 . It follows now that
12
A
and it follows that
12
1
1 0
b
2
N 0 2 A1
XX
1
1
1
1
N 2 (X0 X) 2 A 2 A1 A 2 (X0 X) 2
1
= N 2 (X0 X)
This holds by one of the properties of normal variables. If X N ( V), then BX + c N B + c BVB0 .
12
9.4
9.4.1
A 2 and c = .
b
Asymptotic properties of e,
b and d
2
Consistency of e as an estimator of
Theorem 9.3 (Mittelhammer et al. 2000, p.98) Under the conditions of the DSP set out in
1
Section 9.1 and on the additional assumption that X (X0 X) X0 0 as , it follows that
X0
b
9.4. ASYMPTOTIC PROPERTIES OF E,
b2 AND d
X (X0 X)
X0
x1
x2 0 1
=
(X X)
x
1
x1 (X0 X) x01
x2 (X0 X)1 x0
1
x01
x02
149
x0
x1 (X0 X) x02
1
x2 (X0 X) x02
..
.
1
x1 (X0 X) x0
1
x2 (X0 X) x0
1 0
0
x (X X) x
As we see that this matrix gets larger and (X0 X)1 0, but the elements of
each row x should remain bounded (and there will only be of these), so that the product term
x (X0 X)1 x0 will tend to zero.
The implication of this theorem is that the asymptotic distribution of the residuals will be
the same as the distribution of the stochastic errors, i.e. it will be normal only if the errors are
normally distributed.
9.4.2
Consistency of
b2 as an estimator of 2
We will write
b2
0 X (X0 X)1 X0
0 X (X0 X) X0
Now we use Markovs Inequality (see the Appendix to Chapter 4). This Inequality states
)
( )
that Pr ( ) (
. We will turn this around as Pr ( ) 1
. Consequently
0
1
X(X0 X) X0
0 X (X0 X)1 X0
Pr
1
1
X(X0 X) X0
But
. Taking limits
= 2
lim Pr
0 X (X0 X) X0
lim 2
0 X (X0 X)1 X0
=1
0 X(X0 X) X0
= 0.
But this proves that lim
0
0
0
2
The term = converges to , since is the mean of i.i.d. random variables
2
1 as .
having expected value = 2 and
150
9.4.3
Asymptotic normality of
b2
9.4.4
b
Consistency of
b2 (X0 X)1 as an estimator for var
It is consistent, since
b2 2 0, and
1
b var
b =
d
v
ar
b2 2 (X0 X)
This will converge to zero, provided that (X0 X)
1
is 2 .
b
Appendix: Alternative proof of consistency of
9.5
Theorem 9.4 (Mittelhammer et al. 2000, p.45) Assume that the DSP meets the conditions listed
P P
in Section 9.1 and assume further that (X0 X) = =1 =1 2 and that the ratio of the
b
largest to the smallest eigenvalues of X0 X is upper bounded, then
b vector go to zero. Furthermore we know that
b = . It follows that each
element of the
b
b
Chapter 10
Introduction
In this chapter we will consider in more details how to test assumptions about the DSP based
on the least squares estimates, i.e. we continue to make the following assumptions:
1. Y: We assume that Y is a univariate variable with continuous, unlimited range.
2. : The function is linear in X and , additive in
3. X: The X variables are exogenous.
4. : The parameters are fixed.
5. : The disturbances are independent and identically distributed, with (X) = 0,
(X) = 2 I
6. (eX ): The distribution of the error terms is left unspecified. We will also consider the
special case where we know that the error terms are normally distributed.
7. : The parameter space is unrestricted. Below we will consider the specific case where we
impose a set of linear restrictions on the parameter space.
As noted in Chapter 5, there are broadly three approaches to testing:
We can base the test on the unrestricted model and investigate how dierent the unrestricted estimates are to the values given by the null hypothesis
We can base the test on how much the fit of the regression changes from the unrestricted
to the restricted model
We can estimate the restricted model and investigate whether the restrictions appear to be
binding, i.e. whether we would get very dierent estimates if we relaxed the restrictions.
151
152
In all cases we need to make some distributional assumptions about the estimator. We
saw in the last chapter that under fairly broad conditions the OLS estimators will be normally
distributed, provided that the assumptions of the classical linear regression model hold. If we
assume that the error term is normally distributed, we can give precise results even in small
samples.
In all cases we will be concerned with testing a set of linear restrictions stated in the null
hypothesis
0 : R = c
against the alternative
1 : R 6= c
10.2
10.2.1
A Wald test
Under the stated assumptions we have noted that the Wald statistic
1
0
b c
b c
R R0
R
R
1
will be distributed as 2 (). We know that = 2 (X0 X) , so our test statistic becomes
0
1
1 b
1
b c
= 2 R
c
R (X0 X) R0
R
(10.1)
b2 =
The only unknown quantity in this expression is 2 . We know that
2
estimator of , so in large samples we could base our Wald statistic on
0
1
1
0
0
b c
b c
c = 1 R
X)
R
R
(X
R
b2
10.2.2
e0 e
is a consistent
(10.2)
F test
We can do better than this if we know that the errors are normally distributed. In this case we
know that
( ) 2
b 2 ( )
(10.3)
2
b and
Furthermore we showed that
b2 are statistically independent of each other. So the Wald
2
statistic given in equation 10.1 will be independent of 2 . So we can form an F statistic by
dividing each chisquare variable by its degrees of freedom, i.e.
()
2
(
2
1 b
1
b c
= 2 R
c
R (X0 X) R0
R
b
(10.4)
10.2.3
153
t tests
In the particular case where our test involves only one restriction, the test statistic can equivalently be formulated as a ttest. In these cases the R matrix is a row vector and the matrix
1
Rb
2 (X0 X) R0 is a 11 matrix, i.e. a scalar. In fact this scalar is just the variance of the linear
b We can therefore rewrite the 1 statistic given in formula 10.4 equivalently
combination R.
as
2
b
=
b
d
2
b
=
b
b
= 2
where
=
is the tstatistic associated with the test
0 : =
against
1 : 6=
Since the distribution of a variable with degrees of freedom is exactly equal to the distribution of the square root of an 1 variable, these two tests are statistically and numerically
equivalent.
10.3
10.3.1
Asymptotic LR test
In order to implement this, we will initially consider the case where we assume normality of
the errors. Assume also that we know 2 but need to estimate . Under the assumption of
b
normality, the maximum likelihood estimator will be (as before) the least squares estimator .
Furthermore the log likelihood in this case will be given by
0
b
b
y X
y X
b = ln (2) ln 2
2
2
2 2
will be
0
b
b
y
b
ln 2
= ln (2)
2
2
2 2
154
0
0
b
b
b
b
y X
y X
y X
y X
=
2
0
0
e e e e
=
2
(10.5)
This is asymptotically distributed as 2 (). In fact we can show that under the assumption
of normality, it will be precisely distributed as 2 (). We could operationalise this as a test
statistic by substituting in a consistent estimator of 2 .
10.3.2
We will generate the precise distribution of the LR statistic above, for the special case where the
restrictions are null restrictions, i.e. where of the parameters have been set equal to zero
(Davidson and MacKinnon 1993, pp.8287). In section 10.3.3 we show that this is not, in fact,
a restrictive assumption. In this special case our unrestricted model can be written as
y = X1 1 + X2 2 +
(10.6)
(10.7)
(10.8)
By the FWL Theorem, we know that the residuals that we get from estimating model 10.6 are
identical to the residuals that we would get if we first created the residuals e1 = 1 y and the
residuals e2 = 1 X2 and then regressed e1 on e2 . The latter regression can be written as
1 y =1 X2 2 + 1
(10.9)
X02 1
I 1 X2 (X02 1 X2 )
X02 1
e0 e = (1 y) 1 X2 (1 y)
= y0 1 y y0 1 X2 (X02 1 X2 )
X02 1 y
(10.10)
X02 1 y
X02 1
e0 e e0 e = y0 1 X2 (X02 1 X2 )
= 0 1 X2 (X02 1 X2 )
(10.11)
155
where the last step is valid provided that the null hypothesis is true.
This expression is valid whether or not normality holds. Under the assumption of normal
errors, the random vector
v = X02 1
is normally distributed with a mean of zero and covariance matrix
(vv0 ) = [X02 1 0 1 X2 ]
v = 2 X02 1 X2
The right hand side of equation 10.11 is therefore of the form
2 v0 1
v v
So
e0 e e0 e
= v0 1
v v
2
By one of the previous results we can conclude that
e0 e e0 e
2 ()
2
where is the number of elements in v.
We can turn this into an F statistic by using a consistent estimator of 2 . As above (equation
10.3) we will use the fact that
e0 e
1
= 2 0 X
2
is distributed as 2 ( ) so
e0 e e0 e
2
e0 e
2 ( )
has distribution provided that the chisquared variables in the numerator and denominator are independent of each other. By a result on quadratic forms (given in the appendix to
1
Chapter 8) they will be, provided that the product of 1 X2 (X02 1 X2 ) X02 1 and X is 0.
Now 1 X = X , since the X1 variables are just a subset of the X variables, but X02 X = 0
1
since X2 is also a subset of X. Consequently 1 X2 (X02 1 X2 ) X02 1 X = 0.
In short we find that
(e0 e e0 e)
(10.12)
e0 e ( )
This result depends on the normality of the error terms. Nevertheless from equation 10.11 it
looks as though the result should hold up asymptotically,
provided that
we can apply
a central
limit theorem to 1 X02 1 . Under the null hypothesis 1 X02 1 = 0 and 1 X02 1 =
2 0
X2 1 X2
Provided this is bounded above we can apply the LindbergFeller central limit theorem to establish asymptotic normality.
It is interesting to note that the term y0 1 X2 (X02 1 X2 )1 X02 1 y (see equation 10.11) can
2
b0 y
b  from the regression of e1
be written as k1 X2 yk , i.e. it is an explained sum of squares y
on e2 . It is the additional sum of squares that can be ascribed to X2 once all the eects of X1
have been stripped out. In short our F statistic can also be written as
2
k1 X2 yk
b
2
(10.13)
156
10.3.3
Above we made the claim that all linear restrictions could be subsumed by the case of zero
restrictions, provided that we reparameterise the model appropriately. The argument is very
simple (Davidson and MacKinnon 1993, pp.1619). The matrix R is and of rank . By a
suitable reordering of the variables we can ensure that R = [R1 R2 ] where R1 is a matrix
of full rank. Our null hypothesis
R = c
can therefore be reformulated as
1
= c
2
R1 1 + R2 2 = c
[R1 R2 ]
i.e.
1
1 = R1
1 c R1 R2 2
(10.14)
y X1 R1
X2 X1 R1
1 c =
1 R2 2 +
1
Let y = y X1 R1
1 c and Z2 = X2 X1 R1 R2 , then we could estimate the restricted model
as:
y = Z2 2 +
(10.15)
= X1 1 + Z2 2 +
= X1 1 + R1
1 c + Z2 2 +
(10.16)
The residuals from the last model will be equal to the residuals from the original model. Furthermore 2 = 2 and the parameter 1 will be equal to zero, if the null hypothesis is true.
Consequently the model given in equation 10.16 has also precisely the same residual sum of
squares and is in a form where the restriction can be tested by means of a zero restriction on 1 .
To show all this, in essence we just apply the fact that a linear transformation of the data
transforms the estimates appropriately (see Chapter 7). In this case the transformation matrix
is given by
I R1
1 R2
A=
0
I
i.e. Z = XA. The parameters of the transformed model are given by A1 . But
I R1
1 R2
A1 =
0
I
1
which gives 1 + R1
1 c = 1 + R1 R2 2 and 2 = 2 . Now R1 1 + c = R1 1 + R2 2 .
Under the null hypothesis this is c, which is possible only if R1 1 = 0, i.e. 1 = 0.
The fact that the residual sum of squares is identical, whether we base our estimates on
the original model (10.14) or the reparameterised one (10.16) proves that our discussion in the
157
previous section carries through. Even in cases where linear restrictions are not zero restrictions,
the formula given in equation 10.12 will still apply. Note this is not true of some other forms of
the F test, based on the 2 from the restricted and the unrestricted regressions. This is a good
reason for not using those forms of the test.
10.4
LM type tests
Lagrange multiplier (or score) tests are based on estimating the restricted model. In other words
the DSP is now characterised by a restriction on the parameter space given by R = c.
The problem therefore is how to minimise the residual sum of squares, subject to the restrictions R = c. We can set up the Lagrangian
1
(y X)0 (y X) + (R c)0
2
where we have multiplied by
are
1
2
in order to simplify the algebra later on. The first order conditions
b + R0
b = 0
X0 y X
b c = 0
R
(10.17)
(10.18)
b
b = X0 y X
R0
= X0 e
It is clear that if the restriction is valid, the term on the right hand side should asymptotically
converge to zero. It is also plausible that we should be able to apply some central limit theorem
to this vector to show that it is asymptotically normal with convariance matrix 2 X0 X. We
should therefore be able to base a test of the hypothesis that is zero on the statistic
=
where e2 =
e0 e
1 b0
1
b
R (X0 X) R0
e2
is the estimate of the common variance given the restricted model. Equiva
1
b X (X0 X)1 X0 y X
b
y X
(10.19)
2
e
This is the score form of the LM statistic. This is the more common form in which this test is
implemented, since we will generally not be estimating the restricted model by means of Lagrange
multipliers. Instead we will often estimate it by imposing the constraints in the way that we
sketched out in section 10.3.3. Note that the score form is equivalent to
0
1
b X y X
b
y X
2
e
1
b X y X
b is the
where X is the projection matrix X (X0 X) X0 , i.e. 12 y X
explained sum of squares from the artificial linear regression
1
b = Xb + u
(10.20)
y X
e
=
158
The residuals from the restricted regression are standardised (by dividing through by the estimated standard error of the restricted regression) and then regressed on the full set of explanatory variables1 . If the restriction is valid the explained sum of squares should be small. As
Davidson and MacKinnon (1993) note, LM tests can almost always be calculated by means of
artificial regressions.
The argument that we have produced here did not formally invoke the gradient of the loglikelihood, although that is how we defined LM tests in Chapter 5. In the case of the normal
linear regression model the two will coincide if we fix 2 initially. The gradient vector will then
just be given by 12 X0 e and the information matrix by 12 X0 X. The statistic will be identical.
Given normality the score form of the LM statistic will be distributed as 2 even in small samples.
10.5
In Chapter 5 we noted that the Wald, LR and LM tests are asymptotically equivalent. We
will show this for the current model in the context of the hypothesis 2 = 0 in the model 10.6
(Davidson and MacKinnon 1993, pp.934). We saw in section 10.3.2 equation 10.13 that the LR
F statistic is equivalent to
k1 X2 yk2
b
2
b c. For
If we test the hypothesis by means of Wald test, the test statistic will be based on R
the particular hypothesis that we are considering this turns out to be particularly simple, i.e.
b . By the FWL theorem we know that this is
2
b = (X0 1 X2 )1 X0 1 y
2
2
2
(10.21)
Furthermore the R matrix is equal to [0 I ]. Thus the term R (X0 X) R0 which goes into the
Wald statistic and gives the covariance matrix of the test statistic is just the lower right
1
block of the (X0 X) matrix. We can easily find this (e.g. directly by looking at formula 10.21).
1
It is given by (X02 1 X2 ) . The Wald statistic is
1
1 b0
1
0
0
b
=
X)
R
R
(X
2
2 2
0
1 0
1
1
0
0
0
0
=
X
)
X
y
(X
X
)
(X
X
)
X
y
(X
1
2
1
1
2
1
2
1
2
2
2
2
2
2
1 0
1
=
y 1 X2 (X02 1 X2 ) X02 1 y
2
k1 X2 yk2
=
2
So if we use the Wald form of the test (equation 10.2) then the only dierence between the LR
formulation and the Wald is that the former uses an F test whereas the latter does so by means
of a 2 test. In the F form of the Wald type test the two statistics are precisely equivalent.
1 Alternatively,
X y X
y X
1
0
y X
=
X y X
2
e0 e
= 2
where 2 is the usual 2 in the artificial regression of e on Xb (Wooldridge 2002, p.58) provided that the
restricted regression includes a constant. If it does not, then we would use the uncentred 2 .
Since 1 X1 = 0, the explained sum of squares from this regression must be precisely equal to
the explained sum of squares from the regression
1
1 y =1 X2 b2 + u
e
This is because the residuals are identical, i.e. the residual sum of squares is identical, and the
total sum of squares is identical. The explained sum of squares from the latter regressions is:
1
1 0
1
2
y 1 X2 (X02 1 X2 ) X02 1 y = 2 k1 X2 yk
e2
e
So the only dierence between the LM statistic and the other two statistics is that the former
uses e2 , i.e. the estimate of the variance is based on the restricted regression, whereas in both
the other two cases it is based on the unrestricted regression. Of course if the null hypothesis
is true then in large samples these should give very similar quantities.
10.6
Thus far we have looked at tests of linear restrictions on the data. We may, however, wish to
investigate also nonlinear functions of the estimators. This turns out to be fairly straightforward.
In this case we will write our set of hypotheses in the form
R () = c
where R is a set of functions. An example might be
"
#
12
1
=
1
2
+
2
b = R () + R ( )
b
R
0
(10.22)
is well 0
b R ().
behaved near , it is obvious that R
160
b and
we get
R () b
b
R
R () +
b
b +
(10.23)
0
b
b
0
0
=c
is given by the Wald statistic
0
1
b c
b
bc
R () = c
with the Wald statistic
"
#
0 R ()
R () 0 1
b c
b
b c
R
R
0
0
2 + 3
10.7
1
(
1
2
2 + 3 )
0
(
1
2
2 + 3 )
Nonlinear relationships
We have already seen that many nonlinear relationships can be turned into linear ones. The
CobbDouglas production function
= 2 3
becomes linear when we take logs
ln = 1 + 2 ln + 3 ln +
Note that in this transformation we have lost the original parameter . The transformed
model is linear in the parameters 1 , 2 , 3 where 1 = ln .
Definition 10.1 In the classical linear regression model, if the parameters 1 2 can
be written as onetoone, possibly nonlinear functions of a set of underlying parameters
1 2 , then the model is intrinsically linear in .
10.8. PREDICTION
161
The important point here is that the functions have to be onetoone. In this case we can
retrieve the original parameters 1 after estimating the regression coecients through the
appropriate inverse transformation. Since our parameter estimates b
1 b
will be (possibly)
b
b
nonlinear functions of our estimates 1 we may need to use the delta method to test
hypotheses on the original parameter values..
Other examples of intrinsically linear regressions are polynomial regression models and the
semilog model (such as the Mincerian wage regression above). Not all models of the form
= 1 () 1 + + () +
are intrinsically linear. The relationships
= + 1 + 2 + 3 +
is not intrinsically linear, since the relationship between the parameter vector ( )0 and the
regression coecients ( 1 2 3 4 )0 is not onetoone.
10.8
Prediction
There are two kinds of predictions that we might make: we might wish to estimate 0 x0 or
we might wish to predict the observation 0 itself. By the GaussMarkov theorem
b
b0 = x0
0
where
0 x0 is a row vector of observations, is the minimum variance linear unbiased estimator of
x . We have
b = x0
b0 = x0
b
b0 = x0 + x0
1
b0 = 2 x0 (X0 X) x00
= 0 b0
b
= x0 + 0 x0 x0
b
= 0 x0
1
0 = 2 + 2 x0 (X0 X) x00
162
10.9
Exercises
Source 
SS
df
MS
+Model 
3652.7455
4 913.186376
Residual  4260.70023 6067 .702274638
+Total  7913.44573 6071 1.30348307
Number of obs
F( 4, 6067)
Prob > F
Rsquared
Adj Rsquared
Root MSE
=
6072
= 1300.33
= 0.0000
= 0.4616
= 0.4612
= .83802
logpay 
Coef.
Std. Err.
t
P>t
[95% Conf. Interval]
+highed 
.1435481
.0030851
46.53
0.000
.1375003
.1495959
years 
.0545642
.0033909
16.09
0.000
.0479169
.0612115
years2  .0007096
.0000626
11.33
0.000
.0008324
.0005868
African  .8052136
.0244181
32.98
0.000
.8530819
.7573454
_cons 
5.504426
.0580638
94.80
0.000
5.3906
5.618252
(a) Interpret the coecients on and .
(b) You would like to predict the turning point of the relationship between experience
and (expected log) wages. Generate a consistent estimate of this turning point.
(c) Assume that the covariance matrix
is given by:
2 26 105
4 139 9 106
6
8 956 6 10
9 844 4 106
b ,
b ,
b ,
b ,
b )
of the estimators (ordered as
2
3
4
5
1
3 862 5 108
1 273 6 107
3 918 8 109
1 222 9 107
3 634 8 107
2 26 105
4 139 9 106
1 222 9 107
5 962 4 104
1 417 8 104
8 956 6 106
9 844 4 106
3 634 8 107
1 417 8 104
3 371 4 103
Now generate standard errors for the turning point by means of the delta method.
10.9. EXERCISES
163
1
2
The
The
The
The
natural
natural
natural
natural
log
log
log
log
of
of
of
of
You have run several regressions, together with some diagnostics. The (edited) Stata output
is as follows:
Regression A
. regress lncons lnm lp1 lp2
Source 
SS
df
MS
+Model  6.93935951
Residual 
+Total  7.02610983
34 .206650289
Number of obs
F( 3,
31)
Prob > F
Rsquared
Adj Rsquared
Root MSE
=
=
=
=
=
=
35
lncons 
Coef.
Std. Err.
t
P>t
[95% Conf. Interval]
+lnm 
1.273924
.0744083
lp1 
.6665821
.1901708
lp2  1.614593
.2645366
_cons 
.4053666
.6017865
. vif
Variable 
VIF
1/VIF
+lp1 
47.35
0.021118
164
Regression B
. regress lncons lnmp2 lnp1p2
Source 
SS
df
MS
+Model  6.90670345
2 3.45335172
Residual  .119406383
32 .003731449
+Total  7.02610983
34 .206650289
Number of obs
F( 2,
32)
Prob > F
Rsquared
Adj Rsquared
Root MSE
=
=
=
=
=
=
35
925.47
0.0000
0.9830
0.9819
.06109
lncons 
Coef.
Std. Err.
t
P>t
[95% Conf. Interval]
+lnmp2 
1.397402
.0751029
18.606
0.000
1.244422
1.550381
lnp1p2 
1.044132
.1787048
5.843
0.000
.6801216
1.408141
_cons 
2.030167
.4257199
4.769
0.000
1.163004
2.89733
. vif
Variable 
VIF
1/VIF
+lnmp2 
3.29
0.303518
lnp1p2 
3.29
0.303518
+Mean VIF 
3.29
(a) Interpret the coecients in Regression A.
(b) Which of the coecients in Regression A are statistically significant?
(c) Test the significance of the regression as a whole.
(d) Test the following hypotheses, using Regression A (at the 5% level)
i. chocolate is a luxury good
ii. sweets are a substitute good for chocolate
(e) Test the following hypothesis (in the main regression):
0 : 2 + 3 + 4 = 0
where model A is written as
ln = 1 + 2 ln + 3 ln 1 + 4 ln 2 +
10.9. EXERCISES
165
(f) Do you detect evidence of multicollinearity in these data? If yes, what might be the
cause and what corrective measures might you take?
(g) Comment on all the results. How might you improve this research?
3. You are given the model
= +
where
() =
1
2
if 
elsewhere
and where is some positive constant. You are also given the following matrices:
22191 00186
1
( 0 ) =
00186 00024
1864
0 =
19396
and told that = 15 and e0 e =18053
(a) Do the assumptions of the classical linear regression model hold in this case? Explain.
(b) What is (X)?
(c) What is X ?
(d) Estimate by OLS
(e) Estimate
(f) Test the joint hypothesis 1 = 0 and 2 = 0 by means of the appropriate test.
(b) What is (  )?
(c) What would the (X0 X) matrix look like in this instance?
166
5. You think that the process by which wages are set is given by
log = 0 + 1 + 2 + 3 2 +
where is the wage rate, is the highest level of education obtained and is
potential experience (in years).
You have the following Stata output:
. reg logw educ exper exper2
Source 
SS
df
MS
+Model  12296.5615
3 4098.85383
Residual  20936.2047 22482 .931242981
+Total  33232.7662 22485 1.47799716
Number of obs
F( 3, 22482)
Prob > F
Rsquared
Adj Rsquared
Root MSE
=
22486
= 4401.49
= 0.0000
= 0.3700
= 0.3699
= .96501
logw 
Coef.
Std. Err.
t
P>t
[95% Conf. Interval]
+educ 
.2002941
exper 
.0480962
exper2  .0004388
_cons 
4.622752
. matrix list e(V)
symmetric e(V)[4,4]
educ
exper
educ
3.309e06
exper
4.131e07
2.918e06
exper2
3.425e09 4.787e08
_cons
.0000414 .00003776
exper2
_cons
8.913e10
4.509e07
.00097137
10.9. EXERCISES
167
(f) Test the joint hypothesis that the coecients of experience and experience2 are both
zero, by means of a Wald Test.
(g) You know that your regression has an omitted variable. You expect ability to have
an independent eect on (log) wages. You suspect that the true coecient of ability
(as measured by IQ) in a multiple regression (controlling for education, experience,
experience2 and a constant) is 005. Additionally you assume that the true relationship
between ability and all the other variables is given by:
() = 90 + 2
(note: this is a statistical relationship, not a causal one.) Discuss how this omitted
variable aects the estimates obtained from the empirical regression.
(h) Assume that you have a proxy variable for ability in the form of an IQ test conducted
when the individual was still at school. Under which circumstances would this proxy
variable deal with the omitted variable bias?
168
Chapter 11
11.1
Thus far we have assumed that the error process was i.i.d. In this chapter we will relax this
assumption. We will first consider the case where we know the structure of the covariance
matrix of the error terms. Although this is hardly ever the case it will enable us to discuss the
implications for OLS estimation and establish a set of useful benchmarks. In the next chapter
we will turn to the case where the covariance matrix is unknown.
11.1.1
The model
11.2
What happens if we ignore the fact that the covariance matrix of the errors is no longer 2 I
0
(yX ) (yX )
b = (X0 X)1 X0 y and
and we continue to use the OLS estimators
b2 =
?
169
170
11.2.1
Point estimation of
b will continue to
We will show that under fairly general conditions the OLS slope estimators
be:
Unbiased:
= + (X0 X)
Consequently
X0
(11.1)
b =
(11.2)
0
b
b
b
var =
h
i
1
1
= (X0 X) X0 0 X (X0 X)
= 2 (X0 X)
X0 X (X0 X)
b is unbiased, consistency will follow if var
b
Since
0 as (since then
1
b
we can invoke mean square convergence). It turns out that var
(X0 X)
where is the largest eigenvalue of . So if remains finite (bounded) as the
OLS slope estimators will be consistent under precisely the same conditions that we used
to establish consistency when we had = I, i.e. if (X0 X)1 0.
In the case where we have only heteroscedasticity so that is diagonal, it is easy to see
that the condition that remains finite is the condition that none of the error variances
should become arbitrarily large as .
1
1
1
X0 2 2
1
W0
where = 2 is a vector of transformed error terms. This transformation will turn out
to be important later on so it is important to note that it will always exist. This follows
from the spectral decomposition theorem for symmetric matrices. We can always write
= TT0
(11.3)
171
1
1
= 2 2 2
= 2 I
2
2
It follows that 1 W0 = 0 and var 1 W0 = W0 W = X0 X. If 1 X0 X h
where h is a symmetric positive definite matrix, then we can use the same reasoning as in
b
2
0 2 A1 hA1
where we are assuming, as in Chapter 9, that 1 X0 X. We can use the same procedure
to get a tractable version of this, i.e.
1
1
b
2 (X0 X) X0 X (X0 X)
(11.4)
This will be the exact distribution if the errors are normally distributed.
11.2.2
Point estimation of 2
While the OLS slope estimates retain many of their desirable properties, the standard estimator
of 2 is biased and inconsistent. We have that
e0 e = 0 M
1
where M = I X (X0 X)
= M0
= M0
= 2 (M)
But in general (M) 6= ( ). So
b2 6= 2 . Furthermore there is no reason to believe
that this bias would disappear as .
1 This
172
11.2.3
b
Point estimation of var
11.2.4
Hypothesis testing
b
estimator of var will produce biased and inconsistent results. This includes the standard
tests and tests discussed in Chapter 10.
11.3
11.3.1
We observed above that it was possible to transform the error term to get which had a much
better behaved covariance matrix. We will use the same transformation to transform the data
into a form in which they obey the assumptions of the classical linear regression model. The
existing model is (in linear form)
y = X +
1
2 y
y
= 2 X + 2
= X +
(11.5)
11.3.2
= (X0 X ) X0 y
1 0 1
= X0 1 X
X y
(11.6)
We can now use the full armoury of results from the previous chapters applied to model 11.5 to
establish the unbiasedness, consistency and asymptotic normality of the GLS estimator. It is,
however, useful to derive some of them directly from equation 11.6. Note in particular that
So that
0 1 1 0 1
b
X
X
= + X
b
=
X0 X (X0 X)
173
0 1 1
2
b
var
X X
=
We therefore have
1
1
b var
b
var
= 2 (X0 X) X0 X (X0 X) 2 X0 1 X
= 2 C0 C
1 0 1
1
where C0 = (X0 X) X0 X0 1 X
X . To see this we just multiply the expression out:
h
1 0 1 i h
1 i
1
1
(X0 X) X0 X0 1 X
X
X (X0 X) 1 X X0 1 X
h
1 0 1 i h
1 i
1
1
= (X0 X) X0 X0 1 X
X X (X0 X) 1 X X0 1 X
1 0 1 1 0 1 1
1
1
X X
+ X X
= (X0 X) X0 X (X0 X) X0 1 X
1
1
1
= (X0 X) X0 X (X0 X) X0 1 X
The matrix C0 C is at least positive semidefinite since is positive definite. This follows
since if we take any vector x let z = Cx, then z0 z 0, i.e. x0 C0 Cx 0. It will be strictly
positive whenever z is not the zero vector. Note that if = I then C = 0, so we cannot
guarantee that if x is nonzero that z = Cx will be nonzero. Nevertheless we are assured that
C0 C is positive semidefinite, which is enough to prove that the GLS estimator is more ecient
than the OLS estimator, a claim that we made in section 11.2.1.
The proof that we have just given works directly o the two covariance matrices. We can, of
course, also appeal to the GaussMarkov theorem on the transformed model given in equation
b
11.5. We know that
will be more ecient than any other linear unbiased estimator of
b
in that model. We know that
is an unbiased estimator of , so we only need to show that
it is a linear estimator, i.e. it is linear in the dependent variable y . Now
b
X0 y
X0 2 2 y
X0 2 y
= (X0 X)
= (X0 X)
= (X0 X)
= Ay
b
b
where A is just a matrix of constants, i.e.
is linear in y , i.e. is more ecient than
the OLS estimator.
11.3.3
1
1
1
Let P = 2 T0 where T and are defined as in equation 11.3. Note that PP0 = 2 T0 TT0 T 2 =
I . Then it is easy to show that the transformed model
Py = PX + P
also obeys the assumptions of the classical linear regression model. Estimating this model by
OLS yields the same GLS estimator!
174
= x +
= 1 +
The covariance matrix (0 ) in this case can
2
1
1
1
3
..
..
..
..
.
.
.
.
2
3
1
2
2
(0 ) =
2
1 ..
.
1
2
..
.
1
1
..
.
..
.
1
2
3
..
.
..
.
2
P =
..
0
0
1
12
1
2
0
..
.
0
0
0 0
1 0
1
..
.. . .
.
.
.
0 0
0 0
1
2
2
+1
12
1
2
..
.
0
0
0
0
0
..
.
0
0
0
..
.
1
2
2
+1
12
..
.
0
0
..
.
2 +1
12
1
2
1
2
1
12
1 0
1
1 = (x x1 ) + 1
p
The first observation is transformed by multiplying through by 1 2 . This is known as the
PraisWinsten
transformation. This transformation ensures that the variance of the first error
term is 1 2 2 which is identical to the variance of the other transformed error terms. The
other observations are transformed by taking the generalised dierences. Note that in each case
1 = , so the error process has been transformed to be i.i.d.
Example 11.2 Estimation with heteroscedasticity
11.4. EXERCISES
175
The case of heteroscedasticity is fairly simple. The noise covariance matrix is diagonal, i.e.
2
1 0 0
0 22 0
(0 ) = .
.. . .
..
..
.
.
.
0
so writing
P =
1
1
0
..
.
0
0
1
2
..
.
0
..
.
0
0
..
.
1
=
x +
This model is also referred to as weighted least squares, because it amounts to a reweighting
of dierent observations.
11.3.4
Estimation of 2
If we estimate the transformed linear model (equation 11.5) by OLS, then the residual sum of
squares from that estimation can be used to estimate 2 . We have
b
b
y X
e0 e =
y X
1
1
12
12
b
b
=
2 y 2 X
y
b
b
=
y X
1 y X
b
b
1 y X
y X
b2 =
11.4
Exercises
176
022191 00186
00186 00024
1864
X0 y =
19396
008484 00084
1
(X0 WX)
=
00084 000123
4024
0
X Wy =
34762
975
95
0
1
XW X =
95 12415
1
(X0 X)
b
(e) Now find the true variancecovariance matrix of the OLS estimators .
(f) Test the nullhypothesis 2 = 1 on this assumption.
b
(h) Estimate the variancecovariance matrix of the GLS estimators
.
(i) Test the nullhypothesis 2 = 1 using the GLS estimators.
Chapter 12
12.1
In general we will not know 2 . One approach would be to estimate and then replace
b This estimated GLS estimator (EGLS) or
the unknown with a consistent estimate .
feasible GLS estimator (FGLS) is given by
1
0 b 1
b
b 1 y
X
X0
(12.1)
= X
Note that if we let = 2 then an equivalent expression is given by
1
0 b 1
b
b 1 y
X
X0
= X
In fact any scalar multiple of will also work.
177
(12.2)
12.1.1
The problem of estimating or is nontrivial once one realises that there are ( + 1) 2
distinct elements to be estimated many more than the observations available. In practice we
need to impose some restrictions on the shape of the covariance matrix. This is generally done
by imposing some parametric structure on the matrix, i.e. we assume that = () where
is a vector of parameters of dimension considerably smaller than .
Example 12.1 If we think there is autocorrelation
the form of an (1) process, i.e. that
2
2
2
=
..
..
.
.
1
1
..
.
..
.
1
2
3
..
.
Note that = (), i.e. we need to estimate only one parameter in order to fully characterise
the matrix.
Example 12.2 In many crosssectional data sets there is correlation within households, neighbourhoods, schools etc. One simple way of capturing this is with the hierarchical eects model
= x + +
In this model the subscript refers to the individual and the subscript to the household.
The random variable is assumed to be common within the household while the error term
is assumed to be uncorrelated between individuals. In short we would add the assumptions:
( ) = 0, ( ) = 0
2 = 2
2 = 2
( ) = 0 if 6=
( ) = 0 if 6=
The error term in the regression is = + , so the covariance matrix of the error terms
is block diagonal once we order the observations so that indviduals within the same group are
next to each other. A typical matrix might look like:
2
1 + 2
21
21
0
0
0
21
2 + 2
2
0
0
0
2
2
2
2
0
0
0
1
1
1
2
2
2
0
0
0
0
2
2
2
2
2 =
0
0
0
0
2
2
2
2
0
0
0
0
0
0
3
..
..
..
..
..
..
.
.
.
.
.
.
.
.
.
.
.
.
0
2 + 2
Provided that the number of groups is less than the number of observations , it should be
possible, in principle, to estimate the within group covariances 2 .
179
Example 12.3 Even if we assume that the matrix is diagonal there would still be separate
variances to be estimated. In this case we might suppose that the variances might be constant
within groups (e.g. households) so that the matrix might be:
2
1 0
0
0
0
0
0
0 21 0
0
0
0
0
2
0
0 1 0
0
0
0
0
0
0 22 0
0
0
2
= 0
0
0
0 22 0
0
0
0
0
0
0 23
0
..
..
..
..
..
.. . .
..
.
.
.
.
.
.
.
.
0
Alternatively we might suspect that the heteroscedasticity is driven by some explanatory variables,
so that
( x ) = x a
12.1.2
Approach to estimating
Since the true error terms are unobserved it is unclear how we might estimate their covariances
b
or correlations. Note, however, that the OLS estimators
remain consistent so that the OLS
residuals e also remain consistent estimates of the true error vector .
The general approach is therefore:
Estimate the original model using OLS and obtain the vector of OLS residuals e
Estimate the parameters using the residuals e as estimates of .
b matrix, letting
b = (b
Form the
)
= b
1 , if 1
x = x b
x1 , if 1
y = x +
by OLS
5. If desired, iterate  until convergence is achieved.
Remark 12.5 Please note, however, that the state of the art in time series econometrics has
progressed far beyond this particular technique. If you have serious autocorrelation problems it
should be seen as an indicator that you probably have nonstationary data and in that case dierent
approaches to estimation are probably called for.
12.1.3
The FLGS estimator will be consistent, asymptotically ecient and asymptotically normal provided that the matrix can, in fact be characterised as = ()
12.2
b
We observed in the last chapter that the OLS estimators
remain unbiased and consistent.
Their covariance matrix, however, is given by
2 (X0 X)
X0 X (X0 X)
An alternative approach would be to obtain consistent estimates of this covariance matrix and
use that for purposes of inference instead.
12.2.1
One particular
application of this
the context of heteroscedasticity. In this case the matrix
P is2 in
X0 2 X can be written as
x0 x where x is the ith row of the X matrix. It is possible
to show that
1 X 2 0
x x
is a consistent estimator of
1
lim X0 2 X
provided that the latter exists. In this case the OLS covariance matrix can be estimated as
X
1
1
(X0 X)
(12.3)
2 x0 x (X0 X)
2
. Note that if we were to estimate we would have used .
12.3. SUMMING UP
12.2.2
181
If there is autocorrelation in the data as well as heteroscedasticity we can again use the relationships of the OLS residuals in the sample to approximate the asymptotic correlations given
in 2 . One problem we immediately encounter, however, is that certain correlations (e.g. the
correlation between 1 and ) will only ever be estimated by one observation and so 1 cannot
possibly yield a consistent estimator of the population covariance (1 ). In order to get decent
asymptotic behaviour one therefore needs to impose a maximum lag length above which we
assume that there is no or negligible correlation. We could then estimate X0 X as
0 2 X =
X\
2 x0 x
=1 =+1
x0 x + x0 x
We can then allow to increase as increases (but at a slower rate!). It turns out that these
empirical estimates need not be positive definite, however. This is a major drawback. Newey
and West have shown that one can get a positive definite set of estimates if we downweight the
estimates coming from correlations over longer periods. The Newey West estimator is given
by
0 2 X =
X\
X 2 0
X
x x +
x0 x + x0 x
1
+1
=1
12.3
=+1
Summing up
Feasible Generalised Least Squares will be asymptotically ecient provided that we have parameterised the structure of the covariance matrix correctly. If there is doubt about this, the
FGLS estimates could be more or less ecient than OLS. In this context using OLS with robust
standard errors has much to recommend it.
Chapter 13
Heteroscedasticity and
Autocorrelation
13.1
Introduction
In this chapter we will introduce the issue how we might diagnose the presence of heteroscedasticity or autocorrelation. The fundamental approach in both cases involves analysis of the residuals
from an OLS regression. Indeed visual inspection of these residuals can frequently be very
instructive. The formal tests tend to be of the LM type in which the homoscedastic/zero autocorrelation model is the restricted model. The test statistics can generally be obtained by
an auxiliary regression of the sort discussed previously, i.e. we will regress the OLS residuals
on a broader set of explanatory variables and then use a Chisquare test with 2 as our test
statistic, where 2 is the 2 of the auxiliary regression.
13.2
13.2.1
BreuschPaganGodfrey test
Note that the original BPG test is more involved than the procedure outlined below (see, for
instance, Gujarati (2003, p.411)). This discussion follows Mittelhammer et al. (2000, pp.536
539).
The null and alternative hypotheses underlying the BPG test are given by
0 : 2 = 2 for all versus 1 : 2 = 0 + z
where z is a 1 , row vector of variables that are thought to explain the level of the variance
for observation (note that these may include a set of dummies) while is an 1 parameter
vector. Note that the hypothesis of homoscedasticity is now equivalent to the hypothesis
0 : = 0
The auxiliary regression in this case will be given by
2 = 0 + z +
and the 2 test statistic will be distributed asymptotically as 2 ().
183
(13.1)
184
2 = 0 + z + 2 2
2
= 0 + z +
(13.2)
We have
( ) = 0
2
= 0 + z
If we could observe the terms we could estimate this model by OLS and test the hypothesis
= 0 by any of the tests discussed in a previous chapter. In particular, we could test it by the
LM approach. In this case the restricted model would be given by the intercept only regression:
2 = 0
(13.3)
The residuals from this model would be regressed on the full model, which includes the intercept
and the additional explanatory variables. 2 from this regression will then be a valid test
of the restriction = 0. Since the residuals from the model 13.3 are identical to the 2 value
minus a constant, the 2 in the regression 13.2 will be identical to this statistic.
Of course the 2 terms are not observed. Under our assumptions, however, the OLS residual
13.2.2
White test
A similar logic underlies Whites general test for heteroscedasticity. In this case the hypotheses
are:
0 : 2 = 2 for all versus 1 : 2 = x Ax0
where the row vector x is the vector of explanatory variables used in the regression and A is
some symmetric matrix. The auxiliary regression in this case is
2 = x Ax0 +
If x does not include an intercept term, then a constant is added into this regression. Note that
the auxiliary regression involves all squares and cross product terms. Some of these variables
may need to be dropped (e.g. the square of a dummy variable is perfectly collinear with the
variable itself).
13.2.3
Other tests
There are a number of other tests available. For instance we could assume that 2 = 0 +1 ( ),
in which case our auxiliary regression would involve regressing the square of the residuals on
b . 2 from the auxiliary regression would in this case be
a constant and the fitted values y
2
distributed as (1)
Another test that is sometimes encountered is the GoldfeldQuandt test (discussed in
Gujarati (2003, p.408)). We could split the sample up into two groups and estimate the regression
separately on the subsamples, i.e. our model is
y1
y2
= X1 + 1
= X2 + 2
185
If the errors are normally distributed, then our OLS estimates of the error variances on each
subsample should be distributed as chisquare, with. 1
b21 2 (1 ) and 2
b22
2
2
1
2
2 (2 ) where the subscripts indicate which sample they are taken from. The ratio of these
two, divided by their respective degrees of freedom, should be distributed as an statistic, since
these 2 statistics will obviously be independent of each other, i.e.
1 2
b1 (1
21
2 2
b2 (2
2
2
21
22
)
)
1 2
we get that
b21
1 2
b22
13.3
13.3.1
BreuschGodfrey test
=1
() +
=1
if
otherwise
13.3.2
DurbinWatson d test
The most famous test for first order autocorrelation is the DurbinWatson d test. The test
statistic is given by
P
( 1 )2
2 + 2
= =2P
= 2 (1 ) P1
2
2
=1
=1
where is the sample autocorrelation coecient. The critical values for this test are somewhat
awkward, because they divide into regions in which the null hypothesis is rejected, where it is
accepted and a region where the test is inconclusive! The procedure is adequately described in
the undergraduate econometrics text books (. Gujarati 2003, p.467471)
Mittelhammer et al. (2000, pp.550) observe that an asymptotically equivalent test can be
derived through the inverse auxiliary regression model
(1) = +
186
and testing the null hypothesis that = 0 (this could be done by means of a ttest). They
suggest that unlike with the DW test, this test would be valid even if the error process is not
normal.
13.4
Pretest estimation
Some caution is appropriate if these diagnostic tests are used for model selection. In that case the
process of estimationtestingreestimation can be thought of as a dierent algorithm for arriving
at estimates. For instance our pretest estimator might look as follows:
(
b
=
b
otherwise
A typical example of this is where the analyst switches to using the CochraneOrcutt procedure
once the DW test has failed.
One should note that the properties of the pretest estimator cannot be determined from
the properties of the OLS and the FGLS estimators taken separately. When we analysed those
properties we assumed that the analyst estimated the model once and once only. If the analyst
engages in a serial specification search, the properties of the resulting estimator are likely to be
very dierent from the theoretical properties that we outlined. Indeed there are some Monte
Carlo results which suggest that the pretest estimator will have fare badly particularly in the
cases where its application is most likely to bind!
13.5
A warning note
Misspecification of the regression can frequently result in OLS residuals that look heteroscedastic
or autocorrelated. In particular omission of a relevant variable or the choice of an inappropriate
functional form can lead to such problems. Failure of a specification test may therefore be
grounds for rethinking your specification as much as for worrying about the error process.
13.6
Exercises
1. You are trying to estimate a PPP type of relationship on time series data. In particular
your theoretical model can be represented as
ln = 1 + 2 ln + 3 ln +
where is the exchange rate (in this case South African cents per dollar), is the
domestic price level (in this case given by the South African Producer Price Index) and
is the foreign price level (given by US producer prices). You have estimated this on
quarterly data from the first quarter of 1970 to the third quarter of 1997. Your empirical
results are given in the (slightly edited) Stata output given below. You may find it useful
to know that is a dummy variable equal to one for the period from June 1984 to April
1994, i.e. the period of peak political conflict in South Africa. The operator L. in front of
any variable is the lag operator, i.e. it refers to the previous period, e.g. L. would
be equivalent to 1 . The PraisWinsten regression is equivalent to a CochraneOrcutt regression with the PraisWinsten correction.
13.6. EXERCISES
187
lnusppi
_cons
.00774515
.02355327
.07323965
Number of obs
F( 3,
105)
Prob > F
Rsquared
Adj Rsquared
Root MSE
=
=
=
=
=
=
109
5.77
0.0011
0.1414
0.1169
.11144
error 
Coef.
Std. Err.
t
P>t
[95% Conf. Interval]
+lnsappi  .0387676
.0343342
1.13
0.261
.106846
.0293108
lnusppi 
.016065
.0820243
0.20
0.845
.1465741
.1787042
D 
.1131396
.0272041
4.16
0.000
.0591989
.1670802
_cons 
.0157327
.251981
0.06
0.950
.4838991
.5153645

188
Number of obs
F( 3,
105)
Prob > F
Rsquared
Adj Rsquared
Root MSE
=
=
=
=
=
=
109
267.37
0.0000
0.8842
0.8809
.05494
lnexrate 
Coef.
Std. Err.
t
P>t
[95% Conf. Interval]
+lnsappi 
1.000519
.1079209
9.27
0.000
.7865319
1.214506
lnusppi  1.005464
.2548018
3.95
0.000
1.510689
.5002393
D 
.003786
.0385567
0.10
0.922
.0726648
.0802368
_cons 
6.039999
.7748212
7.80
0.000
4.503672
7.576326
+rho 
.892923
. matrix list e(V)
symmetric e(V)[4,4]
lnsappi
lnsappi
.01164692
lnusppi .02526347
D .00022855
_cons
.06969469
lnusppi
_cons
.06492398
.00017145
.19399087
.00148662
.00099607
.60034786
(a) What might be your theoretical expectations about the coecients 1 , 2 and 3 ?
(b) Interpret the regression output of the first regression.
(c) Test the following hypotheses on the first regression by the appropriate F or ttest:
i. 0 : 2 = 3
ii. 0 : 2 = 3 = 0
(d) Test to see if there is firstorder autocorrelation.
(e) Use an LM type of test to see if the coecient on the dummy variable in the first
model is zero.
(f) Compare the PraisWinsten regression to the OLS regression.
Part III
189
Chapter 14
Instrumental Variables
Comment: include testing with IV (incl heteroscedasticity), natural experiments
(some more stu on weak instruments?)
14.1
Introduction
In this chapter we will be considering violations of the assumption that the X variables are
exogenous. In practice this is an enormously important topic and one that is still very much the
subject of ongoing theoretical work.
14.1.1
The model
In this case we are making the following assumptions about the DSP:
1. Y: We assume that Y is a univariate variable with continuous, unlimited range.
2. : The function is linear in X and , additive in
3. X: The X variables are endogenous/correlated with .
4. : The parameters are fixed.
5. : The disturbances are independent and identically distributed with (X) =
6. (eX ): The distribution of the error terms is left unspecified.
7. : The parameter space is unrestricted.
14.1.2
Consequently
= (X0 X)
X0 y
1
= + (X0 X)
X0
1
b
X
= + (X0 X) X0
191
192
h
i
b = X X
b
and hence
6= , except in very particular circumstances.
Furthermore if (x0 ) = (where x is a row of X) then we can apply a central limit
theorem to show that for wellbehaved cases
1 0
X=
lim
so that
b = + Q1
lim
14.1.3
Examples
Omitted variables
One of the cases that we have already looked at is omitted variable bias. In some cases there is a
straightforward solution  include the omitted variable! In many contexts, however, the variable
may not have been measured on the data set, or it may even be unmeasurable. One example
that has been extensively analysed in the labour economics literature is that of the relationship
between schooling and wages. Consider the DSP given by
ln = 2 + 3 +
where is schooling (in years) and is innate ability and everything is expressed in deviations
from the respective means1 We assume that ( ) = 0, i.e. someone of higher ability will
find it easier to attain more schooling, everything else held constant. Generally, however, it is
very dicult to measure innate ability, so we will estimate the model
ln = 2 +
where = 3 + . We note that 1 s0 = 3 . We have already seen that OLS will lead
to biased results in this case:
b
1 0
= (s0 s)
s ( 2 s + 3 a + u)
1 0
= 2 + 3
b + (s0 s)
b
= 2 + 3 (b
)
2
su
where
b is the coecient that we would have obtained in a regression of ability on schooling.
It is straightforward to see that the OLS estimator will be inconsistent, with
b = +
lim
2
2
3 2
where 2 = ( ) = lim 1 s0 s.
In this particular context we would expect 3 0 and 0, so the estimated returns to
schooling from the typical regression will be overestimated, since part of the measured eect
of schooling is due to the fact that schooling is correlated with ability, but we have not been able
to control for ability.
1 This
1.
is justified in terms of the FWL theorem. Note that the deviations model does not include the constant
193
Systems of equations
Another context in which the explanatory variables might become correlated with the error term
is if the relationship that we are trying to estimate is in fact part of a system of equations. One
of the simplest text book examples of this is given by the macroeconomic consumption function
= + +
(14.1)
(14.2)
1
+
+
+
1
1
1
1
From this last equation it is immediately obvious that ( ) 6= 0. Consequently (as before)
OLS estimates of equation 14.1 will produce biased and inconsistent coecients.
Measurement error
This is an interesting topic which we will explore in more detail below. For the moment let us
assume that the DSP is given by
= +
But we do not measure x accurately. Instead, we measure
= +
where is a random error, which we assume to be uncorrelated with . The model that we are
able to estimate is given by
= ( ) +
= +
14.1.4
The fundamental problem in all cases is that we generally do not have the luxury of controlling
the level of explanatory variables. Unlike the settings on a machine in a laboratory which can
be preset to specified levels, we cannot control the level of schooling that our research subjects
have; we cannot easily loosen or tighten their budget constraints or force them to reveal their
private information truthfully. As such the issue that we are addressing here goes to the heart
of the estimation problems facing applied economists.
14.2
In all these cases the theoretical solution is given by the instrumental variables estimator. The
simple IV estimator is based on the assumption that there is a matrix of instruments
194
W such that:
1 0
1
0
lim
W = lim
(W ) = 0
1 0
1
0
lim
W W = lim
(W W) = Q and
Q is positive definite
1 0
lim
W X = Q and has full rank
It is defined as
14.2.1
Rationale
b = (W0 X)1 W0 y
(IV Assumption 1)
(IV Assumption 2)
(IV Assumption 3)
(14.3)
We will show in a moment that this estimator is consistent. Before we do so, it is useful to
consider how we might arrive at even thinking about an estimator of this sort. The first point
to note is that the fundamental assumption is the first one given above, viz.
(W0 ) = 0
(14.4)
i.e. we need a set of variables that are uncorrelated with (orthogonal to) the error term in
the regression equation. If we can find such variables, they are instrumental in solving our
estimation problem. Equation 14.4 is a population moment condition, so we might think of
applying a method of moments logic to the estimation. In this case we would get
1 0
b
W y X
= 0
(14.5)
These could hold trivially if W0 y = W0 X = 0. Provided, however, that W0 X has full rank
b
(which we have assumed at least asymptotically) we can solve out for
which will give us
the equation of the instrumental variables estimator given above. In short we require the set of
instruments to be uncorrelated with the errors but suciently correlated with the explanatory
variables.
Intuitively we can think of the situation as sketched out in Figure 14.1.We have ( ) 6= 0
so that as changes, so does . This means that it is dicult for OLS to decompose the observed
changes into changes which occur due to changes in and changes in . The instrumental variable
in essence acts like a seismometer: it moves when moves, but it does not change when
changes. By observing we can decompose the changes in into real changes (independent
of ) and changes which are correlated with . It is intuitively obvious that we should be able
to retrieve an estimate of by observing the relationship between and and the induced
relationship between and .
If we assume that
y = x + , x = w + v
It follows that
y = w + (v + )
(14.6)
195
y
x
w
Figure 14.1: A schematic representation of the model y = x + , x = w + , (w ) = 0,
( x) 6= 0.
Let = . We can estimate the relationship between and through OLS and get an estimate
of . Similarly we can estimate the relationship between and and get an estimate of . It
is now easy to get an estimate of as
b
b=
b
It is obvious that this should give consistent estimates, since
b and
b give consistent estimates
of and respectively. This is eectively what the instrumental variables estimator does. In
this particular case
1
1
b = (w0 w) w0 y,
b = (w0 w) w0 x
so
b = (w0 x)1 w0 y
because the (w0 w) terms divide out. In order for this estimation strategy to work, we require
in particular the assumption that is uncorrelated with (because otherwise we cannot estimate
the relationship between and consistently). It is also evident that we potentially run into
trouble if our estimate of is very small. Unfortunately even if and are strongly correlated
within the population, it is always possible to get a pathological sample in which
b happens to be
too small. This means that instrumental variables estimation should be employed with suitable
caution! We will discuss this in more detail below.
14.2.2
Consistency
= (W0 X)
W0 y
= + (W0 X)
W0
196
so
b
lim
= + lim
= + Q1
0
=
1
1 0
1
lim W0
WX
Note that in general the IV estimator will not be unbiased, since the random variables
1
(W0 X) W0 and are not independent of each other, so even if we could
condition on iW,
h
1
0
0
0
we will not be able to break the expression (W X) W up, i.e. (W X)1 W0 W 6=
h
i
1
(W0 X) W0 W [W]
b
b
y
1 0
e e
=
=
1
1
1
X + X X (W0 X) W0
X + X X (W0 X) W0
1
1
1
X (W0 X) W0
X (W0 X) W0
1 0
2
1
1
1
1
0 X (W0 X) W0 + 0 W (X0 W) (X0 X) (W0 X) W0
so
1
lim e0 e
= lim
"
= lim
1 0
14.2.3
1 1
1 0
2 0
0
X 1 W0 X
W +
1
1 1
1 0
1 0
1 0
1
0
0
W X W
X X
W X
W
Asymptotic normality
We observe that
b =
1
1 0
1
W0
WX
So provided that we can apply a central limit theorem to 1 W0 (which we should be able to),
we will have
1
b
0 2 Q1
Q Q
i.e.
14.3
2
1
Q1
Q
Q
In the discussion thus far we have assumed that we have exactly as many instruments as we
have explanatory variables. If we have more instruments we potentially have dierent ways of
estimating our coecients. Which of these would be best?
197
One possibility is suggested by Figure 14.1. Another way of understanding what is happening
in that case is to write x as
x = (xw) + v
so that we are in eect breaking x up into two components: one that reflects the correlation with
w and one that is independent of it. The sample analogue of this is
b
x =b
x+v
b is the set of residuals from the regression of x on w. The fitted values and these residuals
where v
are guaranteed to be uncorrelated with each other. If we write our model now in the form
y
b) +
= (b
x+v
b + u
= x
(14.7)
14.3.1
This case suggests a simple way in which we can generalise our approach to the situation in which
we have more than one instrument for x. Suppose that we have two instruments w1 and w2 .
Obviously we could use either one to obtain consistent estimates, but this is obviously not the
best use of the information available. Indeed, if w1 is correlated with x and w2 is also correlated
with x and both are uncorrelated with , then any linear combination w1 1 + w2 2 will also be
correlated with x and uncorrelated with .
Our approach in this case is to regress x on the instruments w1 and w2 and then use the
b = w1
fitted values x
b 1 + w2
b 2 instead of x in the regression. Note that as in equation 14.7 the
b and and so the coecient can be consistently
fitted values will be uncorrelated with both v
estimated by OLS. This procedure is referred to as two stage least squares, because in the first
b and in the second stage we run the regression as y =b
stage we create the fitted values x
x + u.
Observe that the first stage creates the optimal linear combination w1
b 1 + w2
b 2 ; optimal in
the sense that the correlation between the linear combination and x will be maximised.
The estimate obtained by the 2SLS procedure is
b = W x, so
But x
1 0
b = (b
b) x
by
x0 x
b = (x0 W x)1 x0 W y
This result generalises if there is more than one endogenous variable. Indeed if the DSP is given
by
y = X +
198
we can think of the matrix of instruments W for X, where some of the columns of W may simply
be identical to some of the columns of X (if those particular variables are uncorrelated with .
The generalised IV estimator or two stage least squares estimator is defined as:
b
= (X0 W X) X0 W y
1
1
1
=
X0 W (W0 W) W0 X
X0 W (W0 W) W0 y
Note that this estimator is defined only if the W matrix has at least the same rank as the X
matrix. In the latter form it is clear that there is no need to do the estimation in two stages.
Indeed in general we would not want to do the estimation in two stages, since the estimate
of the
e0 e
b
error variance 2 will be incorrect: instead of being based on where e = y X
,
e0 e2
b
b
it would be based on 2
where
e
=
y
X
2
.
= (X0 W X)
X0 W y
= + (X0 W X)
X0 W
1
1 0
1
2
b
2
plim
X
i.e.
1
2
0
b
(X
X)
W
e0
We will estimate the covariance matrix as (X0 W X) . Note that in this case there is
no compelling case for dividing by , since it is not the case that (e0 e ) = ( ) 2 .
e0 e
Indeed there is no intrinsic reason why should underestimate 2 , since the instrumental
variables procedure is not based on minimising e0 e . Of course asymptotically it makes little
dierence whether one divides by or .
14.3.2
One of the key conditions in order for instrumental variables estimation to be valid is that
1 0
W =0
We used this population moment condition to derive the IV estimator in equation 14.5. Note,
however, that the sample moment conditions
1
1 0
b
W y = W0 X
have a unique solution only if there are exactly equations. If there are more columns in the W
matrix than in the X matrix we have more equations than unknowns. This is why we refer to
b that will
this situation as the overidentified situation. In general there will be no value of
199
solve out all of these equations. In essence we could take any equations and get a dierent
set of sample estimates. We would assume, however, that if the population condition is really
true that these estimates should all be approximately equal. We can test for the validity of these
overidentifying restrictions by means of a simple test (Davidson and MacKinnon 1993, pp.232237). The procedure is as follows: we regress the IV residuals e on the set of instruments
W. From this regression you calculate times the uncentered 2 . This will be distributed
approximately as 2 ( ) where is the rank of the W matrix and is the rank of the X
matrix. Note that we can obviously not use this test if = .
14.4
14.4.1
If we let W = X we see that OLS is just a special case of IV estimation with the X variables as
instruments for themselves!
There is an additional relationship between them. We note that the IV estimator is a linear
estimator which will be unbiased in the special case where X is independent of . So if the
assumptions of the GaussMarkov theorem hold, we come to the conclusion that the IV estimator
would be a linear unbiased estimator and hence by the GaussMarkov theorem less ecient than
the OLS estimator. In short, if X is uncorrelated with the errors we would prefer to run OLS
rather than IV estimation with some instruments W 6= X.
14.4.2
This suggests that it is in general an interesting question to see if X is uncorrelated with the
error vector . In particular we wish to test the hypothesis
0 : y = X + , 0 2 I , (X0 ) = 0
1 : y = X + , 0 2 I , (W0 ) = 0
b
Note that our test supposes that
is definitely consistent, but inecient under 0 . By conb
trast is ecient and consistent under 0 , but inconsistent under 1 . Tests of inecient but
consistent estimators against possibly ecient estimators can be carried out by means of a Hausman test (this discussion is based on Davidson and MacKinnon 2004, pp.3412). The intuition
for these tests follows from the fact that the inecient estimator can be written asymptotically
as the sum of the ecient estimator plus an independent noise variable, i.e.
1
b 0
b 0 +
2
= 2
Since the two terms on the right hand side are independent of each other, we have asymptotically
1
b 0
b 0 + var ()
var 2
= var 2
Furthermore
1
b
b
= 2
and hence has zero mean under 0 . A suitable test statistic is therefore given by
0
1
b
b
b
b V
b
b
200
b
b
b
b
b
b
b
b
tory variables that are instruments for themselves then the matrix V
may
have rank less than (in general it should be at least of the order 2 where 2 is the number of
endogenous variables). Furthermore there is no guarantee that in finite samples the matrix
b
b
b
b
V
V
is of full rank or that it is positive definite. As Davidson and MacKinnon (2004, pp.3412)
note, one can base a valid test on a subvector. In this context one might want to create the
test statistic from only the coecients on the endogenous elements of X. The 2 would have 2
degrees of freedom. Indeed this would be the preferable version of the test.
14.4.3
It turns out that one can implement a test of the hypothesis given above by means of an artificial
regression. The idea is straightforward. Assume that the model is given by
y = X1 1 + Z2 2 +
Here we assume that the 2 variables Z2 are endogenous and that we have a matrix W = X1 W2
of instruments, where the number of elements in W2 is 2 . We now run the auxiliary regression
y = X1 1 + Z2 2 + MW Z2 +
(14.8)
Note that MW Z2 is the set of residuals from the first stage regression of Z2 on all the instruments
W. A test of the hypothesis that = 0 amounts to a test of the hypothesis that the OLS and
IV coecients are equal. One useful feature of the auxiliary regression 14.8 is that the OLS
estimates of 1 and 2 are numerically identical to the IV estimates! This is fairly easy to show
(see Exercises).
14.5
14.5.1
The logic underpinning instrumental variable estimation is asymptotic. It turns out, however,
that the finite sample properties of IV are problematic. In general the sampling distributions
b
are very dicult to derive, but in some typical cases the sample distribution of
will have
moments, where is the number of instruments and is the number of regressors. In
b will have no mean! This means that the tails
particular if = the sample distribution of
of the distribution are very fat, i.e. extreme outcomes will occur fairly often. Even if we have an
extra instrument the distribution will have no variance, which again points to the possibility of
extreme outcomes.
It may seem strange that it is possible for the IV estimator to be asymptotically wellbehaved,
while so badly behaved in small samples. As Davidson and MacKinnon (2004, p.327) note, it
is not the case that if a sequence of random variables converge to a limiting distribution that
201
the sequence of moments will converge. In this case the limiting distribution has allnmoments,
o
b
whereas in the exactly identified case, none of the random variables in the sequence
=
has any moments at all!
Convergence, of course, implies that the CDFs converge to the CDF of the limiting distribution. To that extent the asymptotic distribution can yield valid values and confidence
intervals.
b has a mean, this mean is typically biased. Indeed this bias will tend to increase
Where
with the degree of overidentification. This arises from the fact that the better the first stage
regression fits, i.e. the closer the fitted values are to the endogenous variables themselves, the
closer the IV results will be to the OLS ones.
14.5.2
Weak instruments
Asymptotically it does not matter how weak the correlation between the instrument and the endogenous variable is: any correlation can be good enough to identify the structural relationship.
In practice, however, weak instruments can create many problems. In the first place with weak
instruments there can be susbstantial departures from the asymptotic distributions even with
hundreds of thousands of observations. This means that standard inference procedures can be
very unreliable.
Secondly weak instruments will lead to large standard errors, so that even correctly estimated coecients may turn out to be nonsignificant. It is quite easy to see this in relation to
the standard formula for the variance of an OLS estimate, given in equation 8.12, i.e.
b =
2
P
(1 2 ) ( )2
Consider the case where the variable is the only endogenous variable and where the instruments not included in the structural equation have weak explanatory power for (after
controlling for 1 1 ). We know that that the 2SLS estimates use
b but this will now be
highly correlated with the other explanatory variables, so 2 will be close to one. Perforce the
IV estimates will have higher standard errors
A key issue that arises in this context is how to detect weak instruments and what to do
about them. A basic precondition is that the instruments that are not in the main regression
should be jointly significant in the first stage regression, i.e. they should have explanatory power
in addition to variables that are included in the regression. As Stock, Wright and Yogo (2002,
p.522) point out, however, in many cases the statistic will have to be large, typically above
10, for inference to be reliable. It is these days regarded as unacceptable to publish IV and 2SLS
results without reporting these diagnostic statistics.
Nevertheless even where the instruments are weak, there are now techniques available for
providing more reliable inference. An accessible discussion is provided by Murray (2006).
14.6
Omitted variables
Above we showed that the omission of a relevant variable can lead to regressors that are correlated
with the error term. Let us write the model in the form
y = x1 1 + x + z +
(14.9)
202
where z is a variable that is not measured (or measurable). If z is correlated with the x variables,
then we have the standard case of omitted variable bias, with
b = +
lim
(14.10)
1
1
1
+
z1
1
2
2
2
2 1 +.
(14.11)
z2 = 1 + 2 z + 2
and 2 6= 0, ( 2 ) = 0, ( 2 ) = 0 and ( 1 2 ) = 0. With these assumptions
z2 is a valid instrument for z1 and the variables x are valid instruments for themselves.in
the modified regression 14.11. The parameters of that regression can therefore be estimated
consistently.
As Wooldridge (2002, pp.6367) discusses, under certain conditions the omitted variable
problem can also be addressed by means of proxy variables, using OLS. The key dierence
between a proxy variable and an indicator variable is that we assume that we can write
z =1 + z1 2 +
(compare to regression 14.10) where we now assume that (1 ) = 0 and ( ) = 0. With
these assumptions the main regression can be written in the form 14.11, but with u = +.
With the assumptions that we made this regression can be consistently estimated by OLS.
14.7
Measurement error
The indicator variable model given in equation 14.10 can be thought of as an error in variables
model if 1 = 0 and 2 = 1. If ( 1 ) = 0, then we have the case of classical measurement
error.
14.7.1
203
Attenuation bias
b2 = P 2
1
where both 1 and are written as deviations from their respective means. Consequently
P
1
1 (1 2 + 1 2 )
P 2
lim
b2 = lim
1
1
P
P
1
1 1
lim 1
2 lim 1
P 2
P 2
= 2 +
1
1
lim 1
lim 1
( 1 )
= 2 + 0 2
(1 )
(1 )
= 2 1
(1 )
()
(14.12)
= 2
() + ( 1 )
This formula suggests that in the case of classical measurement error the OLS coecient estimate
will be attenuated, i.e. biased towards zero.
We can invoke the FrischWaughLovell theorem to work out what would happen to the bias
if we add some correctly measured covariates, i.e. if our model is now given by equation 14.9. In
this case the OLS coecient
b2 is identical to the coecient in the regression
e = e2 2 + u
where e and e2 are the residuals obtained by regressing y and z1 respectively on the covariates.
If the covariates are uncorrelated with the measurement error, then the regression error u is
unaected. It is easy to show now that
( )
lim
b2 = 2
( ) + ( 1 )
where is the residual that we would obtain if we projected the correctly measured variable z on
the covariates. Since ( ) (), it is easy to see that the addition of correctly measured
covariates can increase the problem of attenuation. The more collinear the other explanatory
variables are with z, the worse the problem is likely to be.
204
14.7.2
The attenuation bias formula 14.12 can be used to correct the OLS estimates, provided that
we have a consistent estimator either of () or ( 1 ). This may be available from other
sources. For instance if we have access to administrative records, we may know precisely what
() is in the population, even though we do not have measured accurately in our sample.
Some times detailed validation studies on subsamples can provide estimates of the variance of the
measurement error, i.e. (1 ). In these circumstances the errors in variables estimator is
given by
(1 )
b =
b
()
14.7.3
If we have another indicator variable for z we can use the second indicator as an instrument for
z1 , provided that the error component in z2 is uncorrelated with the measurement error in z1 .
14.8
Exercises
b and
b in the artificial regression 14.8 are identical to
1. Show that the OLS coecients
1
2
the IV coecients
b = (X0 Pw X)1 X0 PW y
where X = X1 Z2 .
Hint: Show that the IV coecients are identical to the coecients in the OLS regression
y = X1 1 + PW Z2 2 +
b in the artificial regression
b and
Show that these in turn give identical coecients
1
2
y = X1 1 + PW Z2 2 + MW Z2 +
b
2. Show that the IV residuals e = y X
have a sample mean of zero, provided that the
intercept features in the list of instruments. Show that this implies that the usual (centred)
2 can be used in the overidentification test.
3. Acemoglu, Johnson and Robinson (2001) have suggested that malaria deaths in the seventeenand eighteenhundreds, i.e. at the beginning of the process of colonisation, might provide a
useful instrument for the quality of governance institutions in a crosssectional regression.
(a) Sketch out the argument for why malaria deaths may be a good instrument. (Read
the article!)
(b) What might be some of the problems with this instrument?
4. You are given the regression model
= 1 + 2 +
14.8. EXERCISES
205
where y is the vector of the log of wages and x is the vector of the (true) level of schooling. We assume that this model obeys the standard assumptions of the Classical Linear
Regression Model. Unfortunately schooling is measured badly in your data set. Indeed you
have reason to believe that measured schooling x is given by
x = x + u
where ( ) = 0 and ( ) = 0. On your data set you observe that (x) = 96.
You also have a study available which suggests that (u) = 15. On top of this you have
data available for a subset of your observations on the schooling of a sibling. This variable
z is also badly measured, i.e.
z = z + v
where (z v) = 0 and ( v) = 0.
(a) Derive an expression for the asymptotic value of the OLS estimator.
(b) What would be the appropriate estimator of 2 ?
(c) Under what circumstances could you use z as an instrument for x? Explain.
5. You are given the following model:
ln
= 1 + 2 + 3 + 1
= 1 + 2 + 2
Number of obs
F( 7,
808)
Prob > F
Rsquared
Adj Rsquared
Root MSE
=
=
=
=
=
=
816
59.25
0.0000
0.3392
0.3335
.76818
logpay 
Coef.
Std. Err.
t
P>t
[95% Conf. Interval]
+highed 
.1333759
.0091304
14.61
0.000
.1154538
.151298
exper 
.0312511
.0051414
6.08
0.000
.021159
.0413433
_Imetro_2 
.1612915
.0754299
2.14
0.033
.0132297
.3093532
_Imetro_3 
.4565405
.0724531
6.30
0.000
.314322
.598759
_Irace_2 
.0398263
.0883555
0.45
0.652
.1336071
.2132597
_Irace_3 
.2365166
.1069858
2.21
0.027
.0265137
.4465195
_Irace_4 
.477008
.1072434
4.45
0.000
.2664995
.6875166
_cons 
4.850398
.1244376
38.98
0.000
4.606139
5.094657

206
Number of obs
F( 7,
808)
Prob > F
Rsquared
Adj Rsquared
Root MSE
=
=
=
=
=
=
816
80.15
0.0000
0.4098
0.4047
2.8652
highed 
Coef.
Std. Err.
t
P>t
[95% Conf. Interval]
+parent_ed 
.2288393
.0310682
7.37
0.000
.1678555
.2898232
exper  .2888812
.0162631
17.76
0.000
.320804
.2569584
_Imetro_2 
.551927
.2826728
1.95
0.051
.0029326
1.106787
_Imetro_3 
.4677897
.2761861
1.69
0.091
.0743371
1.009917
_Irace_2  .6611566
.331181
2.00
0.046
1.311233
.0110801
_Irace_3 
.1372697
.4034849
0.34
0.734
.6547326
.9292719
_Irace_4  .1738147
.4272961
0.41
0.684
1.012556
.6649266
_cons 
10.73762
.2738222
39.21
0.000
10.20014
11.27511
. predict u_ed, res
(1507 missing values generated)
. reg logpay highed exper _I* u_ed
Source 
SS
df
MS
+Model  257.400261
8 32.1750326
Residual  464.168688
807 .575178052
+Total  721.568948
815 .885360673
Number of obs
F( 8,
807)
Prob > F
Rsquared
Adj Rsquared
Root MSE
=
=
=
=
=
=
816
55.94
0.0000
0.3567
0.3503
.7584
logpay 
Coef.
Std. Err.
t
P>t
[95% Conf. Interval]
+highed 
.2964423
.0359358
8.25
0.000
.2259036
.3669809
exper 
.0816785
.0118951
6.87
0.000
.0583296
.1050274
_Imetro_2 
.0236001
.0800533
0.29
0.768
.1335372
.1807374
_Imetro_3 
.3015268
.0788048
3.83
0.000
.1468403
.4562134
_Irace_2 
.1050021
.0883318
1.19
0.235
.068385
.2783893
_Irace_3 
.1383059
.1076816
1.28
0.199
.0730632
.3496751
_Irace_4 
.3206584
.1110075
2.89
0.004
.102761
.5385558
u_ed  .1740156
.0371227
4.69
0.000
.246884
.1011472
_cons 
2.923025
.4291273
6.81
0.000
2.080688
3.765362

14.8. EXERCISES
207
The regression is estimated over individuals where the parents education could be determined.
(a) Given the regression output, what would be the estimate of the returns to education
if you were to estimate the first equation by instrumental variables, using parents
education as an instrument for own education?
(b) Perform a Hausman test for the dierence between the OLS and IV estimates. How
might you explain the results?
(c) Do the results suggest that you might have the problem of weak instruments?
(d) Interpret both the OLS and the IV estimates of the first equation.
(e) Discuss the empirical results in relation to the following two possible reasons for the
use of instrumental variables:
omitted variable bias in the main regression
measurement error in the schooling variable .
(f) What assumptions would you need to make for the OLS estimates to be valid? And
what assumptions are required in order for the IV estimates to be valid? Do you think
that any of these assumptions hold in this case?
208
Chapter 15
Estimation by Generalised
Method of Moments (GMM)
This chapter introduces a powerful generalisation of the Methods of Moments considered earlier,
applicable if we have more equations than unknown parameters. The techniques introduced in
this chapter can be applied to individual equations as well as to systems of equations. In this
chapter we will outline the basis of the approach and show some applications. In the next chapter
we will apply it specifically to the analysis of systems of equations.
15.1
In earlier work we considered estimating the parameters of a Pareto distribution by the method
b
1X
=0
(15.1)
1
() =
We noted in passing that there are potentially other moment conditions that we might have used.
In this context, for example, we know also that
2
2 =
2
provided that 2. We could therefore potentially get a dierent MoM estimator defined by
!
b
1X
2 2
2
=0
b
2 2
209
(15.2)
b
1X
= 0
(15.3a)
1
!
b
1X
2
2
= 0
(15.3b)
b
2
there will, in general, be no solution for b
Instead of trying to do the impossible and solve both equations 15.3a and 15.3b we can think
of any candidate solution b
for these moment equations as defining an error vector:
!
b
1X
= 1
!
b
1X
2
2
= 2
b
b
We can now reframe
0 the problem as one of picking an estimate so as to minimise the error
vector g = 1 2 . One natural way of doing this is to minimise the quadratic loss
b
= g0 g
= 12 + 22
After some thought, it is clear that this is unlikely to be the best way of combining the information in the two conditions, since it assumes that the information in the two conditions is of a
comparable quality. Instead we should use a weighted quadratic loss function
( W) = g0 Wg
(15.5)
0
g
Wg = 0
(15.7)
0
g
Remark 15.1 The matrix
is the same matrix of derivatives encountered when we consid0
ered the delta method. We can check that the formula
gives the right results in the simple 2 2
0
11 12
case where g = 1 2
(where we have stipulated that 21 = 12 to
and W =
12 22
ensure symmetry). In this case
= 12 11 + 21 2 12 + 22 22
1
2
2
1
= 21
11 + 2
2 12 + 21
12 + 22
1 2 11 12
1
= 2
12 22
2
0
g
Wg
= 2
0
211
b
2
6850.222
11302.23
5283.836
5472.997
6202.225
8839.387
7536.145
5029.76
7809.99
5538.132
6486.269
5302.856
7408207
=
= 3 076 2
7408207 5000
22
2 62410417
=
= 3 336 5
62410417 25000000
2 2
=
=
impact.
1 0
With the weighting matrix W = I2 =
,
0 1
( W) = 12 + 22
1 X
=
2
2 X
2
1 X 2
2
+ 2
2
!
!
2
2 X 2
22
+ 2
1
2
( 1)2
( 2)2
b
1
2 +
b
1
b
2
2
b
2
2 = 0
b
2
This does not have a neat closed form solution, but a numerical solution for the data given above
is
b
(I2 ) = 3 336 2
15.2
More generally, suppose that there is a DSP identified by the dimensional parameter vector
and that there is an dimensional vectorvalued function h such that
0 [h (w 0 )] = 0
(15.8)
where 0 is the true value of the parameter vector, w contains all the data relevant to the
observation (e.g. a dependent variable y, explanatory variables x and instruments z), and .
Let us further assume that
h (w 1 ) = h (w 2 ) if, and only if, 1 = 2
Then the GMM estimator b
(W) is given by
(15.9)
b
(W) = arg min g0 Wg, where
g
1X
h (w )
g
0 
g

0
b =0
Wg
(15.10)
g
is a matrix with full rank, i.e. it has an inverse. In
Remark 15.3 If = , then
0 
15.2.1
Assumptions
Besides the assumptions given in equations 15.8 and 15.9 (an identification assumption), we
need to make a number of additional assumptions. Let us write the objective function for a
sample of size as
( W ) = g0 W g
(15.11)
1. We will assume that the matrix G0 exists, where
g

G0 = lim
0 0
and G0 is finite with rank
(15.12)
213
g (0 ) (0 S0 )
(15.13)
1X
h h0 0
(15.14)
where
S0 = lim
Here we have assumed independent draws between observations and h is abbreviated for
h (w ). Cameron and Trivedi (2005, p.174) provide the appropriate formula if observations are not independent. Note that for the case of independent observations drawn from
the same distribution this takes the easy to remember form
S0 = hh0
(15.15)
15.2.2
Consistency
We will sketch out why this estimator is likely to be consistent. (here we follow Cameron and
Trivedi (2005, pp.182183)). Note that
 =2
0 0
g

0 0
W g ( 0 )
Now g

G0 , W W0 and g (0 ) 0, hence
0
15.2.3
Asymptotic normality
b 0
b = g ( 0 ) + g 
g
b
where lies between
0 and . Substituting this into the first order conditions (equation 15.10)
and multiplying by we get
0
h
i
g
b
= 0

W
g
0
g
g
b

W
g ( 0 ) +

0
= 0
0
0
b
We can solve this for
0 , i.e.
b
0 =
"
g

0
g

0
#1 "
g

0
W g ( 0 )
b
By our assumptions g ( 0 ) (0 S0 ) and since
0 , we must have 0 . Conse1
quently the first square bracket has probability limit [G00 W0 G0 ] and
1
b 0
[G00 W0 G0 ] G00 W0 (0 S0 )
i.e.
1
1
b 0
0 [G00 W0 G0 ] G00 W0 S0 W0 G0 [G00 W0 G0 ]
(15.16)
Remark 15.4 Note that if = , so that G0 and W0 are both square this simplifies. In partic1
1
0 1
ular [G00 W0 G0 ] = G1
, so that
0 W0 (G0 )
0 1
b 0
0 G1
0 S0 (G0 )
15.2.4
We can do inference on these estimators, provided that we can obtain estimates of G0 , W0 and
S0 . For W0 we simply use the sample weighting matrix W . For G0 we use
g
b
G=

(15.17)
0
and for S0
Consequently
15.3
X
b= 1
h h0 
S
i1
h
i1
1h
b
b
b 0 W SW
b G
b G
b 0 W G
b
b
b 0 W G
G
=
V
G
(15.18)
(15.19)
We return to the point made earlier that the GMM estimator is a function of the weighting
matrix W. Considering the variance of the GMM estimator, we can show that the variance is
minimised if we pick
W S1
(15.20)
0
In that case the asymptotic distribution of this optimal GMM estimator (OGMM) simplifies to
b
(15.21)
0 0 G00 S1
0 G0
In practice knowledge of S0 will require knowledge of , which we are trying to estimate. Consequently this estimator is generally not feasible. Instead we can use a twostep procedure to
get estimates of S0 . In the first step use any GMM estimator (e.g. with W = I ). By the
b
argument above this should lead to a consistent set of parameter estimates
. Use these
b (by equation 15.18). Then set
estimates to estimate S
b 1
W = S
215
b
b 1 g
= arg min g0 S
This estimated optimal GMM estimator has the same asymptotic distribution given in equation 15.21. For purposes of inference we estimate the covariance matrix as
1
1
1
0b
b
b
b
b
b
GS G
V =
b
b and S
b are estimated at b
where G
.
Note that the relationship between EOGMM and OGMM is similar to that between GLS and
FGLS. As in the case of FGLS we use a consistent (but inecient) estimator to get estimates of
the covariance matrix and then use that estimated covariance matrix to approximate the ecient
estimator.
15.4
Let us return to the case of the Pareto distribution considered earlier and see these estimators
in action. Figures 15.1 and 15.2 show our results. The summary statistics for this Monte Carlo
simulation are given in the following table:
n=2000
n=100000
Mean
s.d.
Mean
s.d.
b
3.004076 .0756765 3.000169 .0107534
b
2
3.036968 .1456797 3.004737 .0519052
b
3.036968 .1456797 3.004737 .0519052
(I2 )
b
3.003228 .066415
3.000174 .0092302
replications 2000
1294
Several features deserve comment:
1. All the estimators look approximately unbiased to be more precise a 95% confidence
interval for will be given by the Monte Carlo sample average (e.g. 3.004076)
twice the
standard error. The standard error will be the standard deviation divided by
2.5
3
theta
MOM1
GMM
OGMM
MOM2
EOGMM
MLE
Figure 15.1: The performance of dierent GMM estimators with a small sample and problematic
distribution
3.5
217
10
20
30
40
2.4
2.6
2.8
theta
MOM1
GMM
OGMM
3
MOM2
EOGMM
MLE
3.2
6. In all cases maximum likelihood outperforms these GMM estimators. The optimality of
GMM therefore has to be understood within the context of the given moment conditions.
15.5
A GMM approach to the linear model would start with the underlying moment condition, which
in this case can be written
[x0 ( x )] = 0
where x is a row vector from the X matrix, the vector of unknown parameters and = x +
(by assumption). In this case therefore
h
g
= x0 ( x )
1X 0
x ( x )
=
=
The matrix
g
0
1 0
X (y X)
is given by 1 X0 X, so the first order conditions for the GMM estimator become
1 0
1 0
b
=0
XX W
X y X
As we noted above, in the case where we have equations in unknowns (as here), this simplifies
to the simple method of moments condition
1 0
b
= 0
X y X
1
b
= (X0 X) X0 y
0 1
b
0 G1
0
0 S0 (G0 )
i.e.
1 1
0 1
b
S
(G
)
G
0
0
0
1 0
In this case G0 is the probability
limit of X X which we would approximate with
P
Furthermore S0 = lim 1 h h0  which will be approximated by
i
0
1 Xh 0
1X
0
b
b
h h  =
x x
x x
X
b = 1
S
x0 2 x
1 0
X X.
1 X
1
1 0
1 0
1
2 0
x x
XX
XX
X
1
1
= (X0 X)
2 x0 x (X0 X)
b
b
V
=
This, of course, is identical to the formula for the robust covariance matrix of the OLS estimator
given in equation 12.3.
Note that the GLS estimator cannot be derived from the moment condition above. We need
a dierent set of moment conditions, those emanating from the tansformed model (i.e. equation
11.5). The moment condition is
h
i
0
x ( x ) = 0
1 0 1
X (y X) = 0
15.6
Consider now the case where we have an endogenous X matrix and assume that we have a
matrix of valid instruments Z. The moment conditions in this case can be written as
[z0 ( x )] = 0
where z is the th row of Z. In this case therefore
h
g
= z0 ( x )
1X 0
z ( x )
=
=
The matrix
g
0
is given by
1 0
Z X,
1 0
Z (y X)
1 0
1 0
b
=0
ZX W
Z y X
b
X0 ZWZ0 y = X0 ZWZ0 X
This has a solution provided that X0 Z has rank and W has rank . The solution is
0
1 0
0
b
X ZWZ0 y
(W) = X ZWZ X
We could set W = I and use this as the initial estimate for the EOGMM estimator. In this
case, however, we can do somewhat better.
Consider the case of independent draws from the same distribution. In this case S0 = hh0 .
We have
S0
since the errors will be homoscedastic in this case. The sample estimate of [z0 z] will be
So our OGMM estimator in this case will be given by setting
W
This means that
1 0
Z Z.
1
2 0
ZZ
h
i1
1 0
1 0
0
0
b
ZX
X0 Z (Z0 Z) Z y
= X Z (Z Z)
This, however, is the 2SLS estimator! This shows that the 2SLS estimator is equivalent to the
optimal GMM estimator provided that the assumption of homoscedasticity and zero autocorrelation holds. If the errors are independent but heteroscedastic, then we need to estimate S. In
this case
X
b = 1
S
h h0 
i
0
1 Xh 0
b
b z
=
z x
x
b Any GMM estimator would
To implement this we obviously need an initial GMM estimate of .
do, but it is customary to use the 2SLS estimator for this first stage. This means that
X
b= 1
S
2 z0 z
= X ZS
Part IV
Systems of Equations
221
Chapter 16
Introduction
In this chapter we will begin our analysis of situations in which we estimate multiple relationships
simultaneously. The typical model can be written in the form
1
2
= x1 1 + 1
= x2 2 + 2
= x +
We assume that there are equations and we have observations (subscripted ) for each.
The variables appearing on the right hand side can be the same, but need not be. In general
we assume that the row vector x has dimension . We typically will need to assume that
( ) 6= ( ) if 6= . Furthermore there may be crossequation correlation in the errors.
Example 16.1 A good example is if we estimate a system of demand equations. In this case
the demands are the dependent variables and the explanatory variables are own prices, cross
prices, income and a set of variables that are likely to shift tastes (e.g. age) or that might impact
on the eciency with which resources can be spent (e.g. household size). In this case it would
stand to reason that whatever was left out of the regressions (and thus goes into the error term)
would be correlated across equations.
16.2
Zellner had the fundamental insight that we could simply stack these equations to produce
one megaregression. There are two ways of stacking (by equation or by individual) and it is
worthwhile noting that dierent authors adopt dierent approaches. The standard approach is
to stack by equation. Wooldridge makes the cogent case that for asymptotic analysis it makes
more sense to stack it by individual. If we want to analyse what happens as , we just
223
224
x1 0
1
0
1
0 x2
2
0
2
, X = .
, = .
y =
..
..
..
..
..
.
.
.
1
0
0 x
, =
1
2
..
.
1
(16.1)
The subscript notation is supposed to remind us that we are stacking data on the same individuals, i.e. we are stacking on the first index. Stacking the data vectors and matrices we
get
y1
X1
1
y2
X2
2
y =
, X =
, =
(16.2)
y 1
X
1
P
where we have let = .
We can now rewrite the equations as one big equation (the dimension of each matrix is
given for clarity):
y1 = X 1 +1
(16.3)
In this form it looks just like a singleequation regression model. The key dierence is that it is
bound to be heteroscedastic and autocorrelated, for reasons that we alluded to earlier.
16.3
Assumptions
0 0
0 0
(0 X) = = . . .
. . ...
.. ..
0
The last assumption will automatically hold if observations are drawn independently of each
other from the same distribution. We will make the additional assumption
(X0 X ) = Q
exists and has rank . This assumption is also easy to justify if we are drawing X from the
same distribution for each observation .
16.4
225
Estimation by OLS
It should be easy to see that with the given assumptions we have a regression model that is
heteroscedastic and autocorrelated, unless = 2 . Nevertheless OLS estimation of this system
should provide consistent estimates.
We can develop a method of moments logic for OLS estimation of the system from the moment
condition
(X0 ) = 0
X (y X )
= 0
0
X y
= (X0 X )
We observe that the last equation identifies provided that (X0 X ) has an inverse, which it
does by the additional assumption that we made above.
Our sample analogue of the moment equations will be given by
1X 0
X y
=1
b
1X 0
b
X X
=1
!1
!
1X 0
1X 0
=
X X
X y
=1
=1
=
(16.4)
1
0
b
X0 y
= (X X)
Looking at equation 16.4 it is easy to see that this should lead to a consistent estimator of
. As ,
expect (by normal weak law of numbers arguments) that
1 P
P
0
1
0
0
=1 X y X y and
=1 X X (X X ).
b = + (X0 X)1 X0
!1
!
1X 0
1X 0
= +
X X
X
=1
=1
!1
!
1 X 0
1X 0
b
=
X X
X
=1
=1
b
226
16.5
Estimation by GLS
It should be evident that if we knew the structure of the matrix we would be able to estimate
the relationship more eciently. The logic (as in the case of single equations) is that we can
transform the data to make the model conform to the assumptions of the classical linear regression
model. If we have
y = X +
with var ( X ) = , then we can transform the data so that
1
2 y = 2 X + 2
1
where
!1
!
1 X 0 1
1 X 0 1
=
X X
X y
=1
=1
1 0 1
= X0 1 X
X y
1 =
16.5.1
1
0
..
.
0
1
..
.
..
.
0
0
..
.
Notation
It becomes quite tedious to write out these stacked matrices the long way. A convenient
notation for these cases is provided by the Kronecker product which is discussed further in
the appendix to this chapter. By definition the Kronecker product of two matrices A and
B is given by:
11 B 12 B 1 B
21 B 22 B 2 B
A B = .
..
..
..
..
.
.
.
1 B 2 B
1 = I 1
The mathematical properties of the Kronecker product allow one to derive many useful results
about these stacked matrices quickly and easily.
16.5.2
Some caution
In order for the GLS estimator to be consistent we need a stronger condition than (X0 ) = 0,
since we now need the transformed X variables to be orthogonal to the transformed errors. The
1
problem is that (in general) the transformed variables 2 X will be some linear combination
of the explanatory variables from dierent equations (for the same individual). It is therefore
now necessary that the explanatory variables x from any equation be uncorrelated with the
227
error term even when 6= . If the error terms are uncorrelated with each other (i.e. the
system is purely homoscedastic) then this doesnt apply. But to show consistency in general we
now need the condition
(X ) = 0
which is stronger.
16.6
Estimation by FGLS
(1 )
(1 2 )
(1 2 )
(2 )
=
..
..
.
.
(1 ) (2 )
(1 )
(2 )
..
..
.
.
( )
If it is reasonable to assume that the variance of the error term on the first equation, i.e. (1 )
is homoscedastic, so
(1 ) = 21 , for all
then we have equations (i.e. individuals) over which we can estimate this variance. Similarly
we will have equations over which to estimate any of the other variances and the covariances.
Consistent estimators of these are
1X 2
2
b1 =
=1 1
where 1 is the residual from the Systems OLS estimator. Note that we are exploiting the fact
here that Systems OLS is consistent, so the OLS residuals are consistent for the true errors.
Similarly
1X
b12 =
c (1 2 ) =
1 2
=1
We can summarise this as:
X
b= 1
e e0
=1
where e is the stacked vector of OLS residuals for the th individual.
Our Systems FGLS estimator is therefore:
b
16.7
Exercises
!1
!
1 X 0 b 1
1 X 0 b 1
=
X X
X y
=1
=1
1
b 1 X
b 1 y
X0
=
X0
= 1 + 2 1 + 1
= 3 2 + 2
228
1
21
1 2
1 2
=
=
2
1 2
22
You have the following empirical information on this model:
individual
1
2
3
1
1
1
2
2
2
3
5
1
0
1
2
2
2
4
4
(a) Rewrite the theoretical model in stacked matrix form, paying attention also to the
assumptions being made about the error term.
(b) Rewrite the empirical information in stacked matrix form.
(c) Let = (0 ), where is the error vector of the stacked model. Calculate 1 .
(d) Assume that you want to estimate this model by GLS. Write down the appropriate formula with the appropriate empirical information. You do not need to simplify/calculate the final solution.
16.8
229
= 1 + 2 1 + 1
= 3 + 4 2 + 2
1
1
1
2
2
2
5
7
1
0
3
4
2
2
4
4
11
12
21
22
31
32
1 11
0 0
1 21
0 0
1 31
0 0
0 0
1 12
0 0
1 22
0 0
1 32
+
3
11
12
21
22
31
32
Observe that the first column in the X matrices is for the intercept in equation 1, the
second column is 1 , the third column is the intercept in equation 2 and the fourth
column is 2 .
We should also specify the assumptions about the errors in matrix form. We have
(X0 ) = 0 and
11 12 11 21 11 22 11 31 11 32
211
12 11
212
12 21 12 22 12 31 12 32
221
21 22 21 31 21 32
21 11
21 12
0
( ) =
222
22 31 22 32
22 11 22 12 22 21
31 11 31 12 31 21 31 22
231
31 32
32 11 32 12 32 21 32 22 32 31
232
21
1 2
0
0
0
0
1 2
22
0
0
0
0
0
0
0
0
1 2
1
=
2
0
0
0
0
1 2
2
0
0
0
0
21
1 2
0
0
0
0
1 2
22
1 1 0 0 0 0
1 4 0 0 0 0
0 0 1 1 0 0
0 0 1 4 0 0
0 0 0 0 1 1
0 0 0 0 1 4
230
Answer:
y1
y2
y3
1
2
1
5
2
7
, X1 =
, X2 =
, X3 =
1 0 0 0
0 0 1 2
1 4 0 0
0 0 1 4
1 3 0 0
0 0 1 4
y =
Answer:
1
2
1
5
2
7
, X =
1
0
1
0
1
0
0
0
3
0
4
0
0
1
0
1
0
1
0
2
0
4
0
4
231
(a) The formula for the OLS estimator is the same as before:
b
= (X0 X)
=
0
X0 y
0
0
1
2
1
3
0
0
0
0
1
4
1
4
0
0
1
0
1
0
1
0
0
0
3
0
4
0
1
3 7 0 0
4
11
7 25 0 0
0 0 3 10 14
0 0 10 36
52
1
3 7
022
7 25
1
3 10
022
10 36
25
7
26
0
0
26
3
7
26
0
0
26
36
10
0
0
8
8
3
0
0
10
8
8
23
26
5
26
=
2
2
0 884 62
0 192 31
'
2
2
0
1
0
1
0
1
0
2
0
4
0
4
1 0
0 0
0 1
0 2
1
3
0
0
0
0
1
4
1
4
0
0
1
2
1
5
2
7
4
11
14
52
4
11
14
52
b2
23
5
+ 1
26 26
= 2 + 22
=
6. Impose the restriction 1 = 3 and 2 = 4 . Reestimate the model by OLS with these
restrictions.
Answer:
232
= 1 1 + 2 x1 + 1 2 + 2 x2 +
= 1 (1 + 2 ) + 2 (x1 + x2 ) +
Observe that 1 + 2 is now just a column of ones. x1 + x2 is a column with all the
x values, i.e. our transformed X matrix X is:
X =
1
1
1
1
1
1
0
2
3
4
4
4
1 1 1 1 1 1
=
0 2 3 4 4 4
=
=
=
'
6 17
17 61
61
77
17
77
27
77
72
77
17
77
6
77
0 350 65
0 935 06
18
63
18
63
1
1
1
1
1
1
0
2
3
4
4
4
1 1 1 1 1 1
0 2 3 4 4 4
1
2
1
5
2
7
233
25
7
26
0
0
26
3
7
0
0
26
26
b
=
36
10
0
0
8
8
3
0
0
10
8
8
1 1 0 0
1 0 1 0 1 0 1 4 0 0
0 0 3 0 4 0 0 0 1 1
0 1 0 1 0 1 0 0 1 4
0 2 0 4 0 4 0 0 0 0
0 0 0 0
25
7
26
0
0
26
3
7
0
0
26
26
36
10
0
0
8
8
3
0
0
10
8
8
25
0
0
3 7
26
26
3
7
7 25
0
0
26
26
=
36
0
3 7
0
10
8
8
3
10 28
0
0
10
8
8
25
7
49
99
26
104
26
52
35
3
21
7
26
26
52
104
=
35
99
18
5
52
52
49
3
21
104
5
104
2
0
0
0
0
1
1
0
0
0
0
1
4
1
0
1
0
1
0
0
0
3
0
4
0
0
1
0
1
0
1
25
3 10
26
7
7 28
26
12 40 0
40 144
0
0
2
0
4
0
4
7
26
0
0
0
0
36
8
10
8
3
26
0 : 1
2
= 3
= 4
0 :
1 0 1 0
0 1 0 1
1
2
= 0
3
0
4
0
b
b
RV R0
R
R
1 0 1 0
0 1 0 1
197
13
33
8
33
8
63
52
25
26
7
26
99
52
49
104
7
26
3
26
35
52
21
104
99
52
35
52
18
5
1
0
21
1
104 0
5 1 0
3
0 1
2
49
104
0
0
10
8
3
8
234
b
R
1 0 1 0
0 1 0 1
75
26
47
26
23
26
5
26
2
2
'
75
26
47
26
197
13
33
8
2 884 6 1 807 7
= 12 34
33
8
63
52
75
26
47
26
901 55 3 069 6
3 069 6 11 277
2 884 6
1 807 7
This is distributed as 2 (2). The critical value at the 5% level is 5.991 and at the 1%
level is 9.210. We reject the null hypothesis, i.e. the two sets of regression coecients are
dierent.
8. Reestimate the original model (the one without restriction) by GLS.
Answer
(a) In order to do this we need to invert = (0 ). Since the matrix is block diagonal
we only need to invert the twobytwo matrix = ( 0 ) where
=
1 1
1 4
1 1
1 4
i.e.
1
4
3
13
13
1
3
Consequently
4
3
13
0
=
0
0
0
13
0
0
0
0
0
0
4
3
13
1
3
0
0
0
0
13
0
0
0
0
0
0
4
3
13
1
3
0
0
0
0
13
1
3
235
0 1 1 0 1
X X
X y
1 0 1 0 1 0
0 0 3 0 4 0
=
0 1 0 1 0 1
0 2 0 4 0 4
=
1
0
0
0
0
0
1
2
4
1
3
0
0
0
0
1
4
28
3
100
3
73
28
3
28
3
=
1
10
3
949 69
264 15
=
1 874 2
462 26
0 936 9
0 169 85
=
2 158 4
2 047 7
4
3
13
0
0
4
3
13
0
0
0
0
0
0
4
3
13
1
3
0
0
0
0
0
0
4
3
13
1
3
2
3
1
3
10
3
38
3
0
0
13
0
0
0
0
0
0
4
3
13
1
3
0
0
13
0
0
1
0
4 0
0
1
1 10
3
73 28
3
10
1
3
10
12
3
1
4
0
0
13
0
0
0
0
13
0
0
0
0
0
0
4
3
13
1
3
0
0
0
0
13
1
3
0
0
0
0
13
1
3
666 67
264 15 1 874 2 462 26
333 33
113 21 660 38 198 11
1
2
1
5
2
7
1
0
1
0
1
0
0
0
3
0
4
0
0
1
0
1
0
1
9. Assume now that you dont know the exact distribution of the error terms. You do know,
however, that ( 0 ) = . What would be the most appropriate estimator in general?
Apply it in this instance.
Answer:
(a) The FGLS estimator would be better than OLS in general. Indeed the FGLS estimator
would be asymptotically as ecient as the GLS estimator. Unfortunately in this case
the sample size is tiny, so asymptotic arguments are dubious. Nevertheless we will go
through the FGLS routine to show how it would work.
The FGLS estimator begins with the OLS residuals. We have calculated the OLS
0
2
0
4
0
4
236
b
e = y X
1
1
2 0
1 1
=
5 0
2 1
0
7
3
26
6
13
=
1
9
26
1
0
0
3
0
4
0
0
1
0
1
0
1
0
2
0
4
0
4
23
26
5
26
2
b21
=
=
=
b12
=
=
=
b22
=
=
=
P
1
3
21
3
26
P
3
26
2 2 !
6
9
+
+
13
26
1 2
1
3
6
9
0+
(1) +
1
3
26
13
26
7
26
P 2
2
1 2
0 + (1)2 + 12
3
2
3
237
Consequently
b =
b 1
b 1
3
26
7
26
7
26
2
3
3
26
7
26
7
26
2
3
150 22 60 667
60 667
260
150 22 60 667
0
0
0
0
60 667
260
0
0
0
0
0
0
150 22 60 667
0
0
0
0
60 667
260
0
0
0
0
0
0
150 22 60 667
0
0
0
0
60 667
260
Consequently
b
1
b 1 X
b 1 y
X0
X0
1
0
0
0
150 22
60 667
0
0
0
0
0
0
1
2
1
3
0
0
0
0
1
4
1 0
3 0
0 1
0 4
60 667
0
0
260
0
0
0
150 22 60 667
0
60 667
260
0
0
0
0
0
0
150 22 60 667
60
667
260
1 0
0
0
4 0
0
0
0 1
0
0
0 4
0
0
7 580 6 102
1 599 7 102
307 53
6 532 3 102
939 89
168 57
2 343 9
2 104 8
1
0
0
0
1 599 7 102
6 856 103
9 331 9 102
2 799 6 102
0
0
1
2
0
0
1
4
0
0
0
0
0
0
0
0
150 22 60 667
60 667
260
1
4
0
0
1
0
1
0
1
0
0
0
3
0
4
0
0
0
0
0
0
0
150 22 60 667
0
60 667
260
0
0
0
150 22
0
0
60 667
307 53
6 532 3 102
1 652 8
429 13
429 13
128 74
0
1
0
1
0
1
0
2
0
4
0
4
0
0
0
0
60 667
260
248 46
956 26
121 33
502 66
1
2
1
5
2
7
238
16.9
Definition 16.2 For two matrices A and B the Kronecker product is defined as:
11 B 12 B 1 B
21 B 22 B 2 B
A B = .
..
..
..
..
.
.
.
1 B 2 B B
1
11 12 13
Example 16.3 Let A =
and B = 2 , then
21 22 23
3
11 B 12 B 13 B
AB =
21 B 22 B 23 B
11 1 12 1 13 1
11 2 12 2 13 2
11 3 12 3 13 3
=
21 1 22 1 23 1
21 2 22 2 23 2
21 3 22 3 33 3
Proposition 16.4
A B C = (A B) C = A (B C)
Proposition 16.5 If A and B are both and C and D are both matrices, then
(A + B) (C + D) = A C + A D + B C + B D
Proposition 16.6 If the products AC and BD are defined, then
(A B) (C D) = AC BD
Remark 16.7 It follows that if B is a column vector, then
(A B) C = AC B
since the product AC will then be defined and C = C I1 (the 1 1 Identity matrix). Similarly
if A is a column vector, then the product BC will be defined, so that
(A B) C = A BC
using the fact that C = I1 C.
Proposition 16.8 Assume that A and B are square nonsingular matrices, then
(A B)1 = A1 B1
Proposition 16.9
(A B) = A0 B0
Proposition 16.10 Assume that A and B are square matrices, then
A B = A B
Proposition 16.11
(A B) = (A) (B)
Chapter 17
System estimation by
Instrumental Variables and GMM
239
Chapter 18
241
242
Part V
Solutions
243
Solutions to Chapter 14
1. We know that the IV estimator can be written either as
b = (X0 Pw X)1 X0 PW y
b= X
b 0X
b X
b 0y
b are the fitted values from the first stage. These two are equivalent, since
where X
b = Pw X
X
and
b 0X
b X
b 0 y = (X0 Pw Pw X)1 X0 PW y
X
= (X0 Pw X)
X0 PW y
So we could estimate by using OLS and the fitted values, i.e. writing the model as
b +
y =X
But
b = Pw X
X
= Pw X1 Z2
= Pw X1 Pw Z2
= X1 Pw Z2
The last step follows since X1 is among the instruments, so the fitted values are equal to
the values themselves. Consequently the coecients in the OLS regression
y = X1 1 + PW Z2 2 +
(1)
246
SOLUTIONS TO CHAPTER 14
Furthermore
PW Z2 = Z2 MW Z2
2
Now observe that our structural model is
y = X1 1 + Z2 2 +
Writing
Z2 = PW Z2 + MW Z2
This model becomes
y = X1 1 + PW Z2 2 + MW Z2 2 +
If is uncorrelated with Z2 it will certainly be uncorrelated with PW Z2 and MW Z2 . Estimating this last equation by OLS will therefore give us unbiased and consistent coecients.
We would therefore expect the OLS coecient on the variables MW Z2 in regression 2 to
be 2 . This implies that
= 2
=0
b
2. Show that the IV residuals e = y X
have a sample mean of zero, provided that the
intercept features in the list of instruments. Show that this implies that the usual (centred)
2 can be used in the overidentification test.
This is, in fact, a dicult question! It is fairly easy to show that the IV residuals should
have a mean of zero asymptotically. This follows from the fact that
(W0 ) = 0
and we are therefore sure that
1
lim W0 e = 0
The first row of W is a row of ones, so the first element of the vector 1 W0 e is just
and it follows that
1
lim 0 e = 0
which just states that the sample mean of the residuals converges to zero.
1 0
e,
It is also easy to show that the sample mean of the IV residuals in the exactly identified
case will be zero. In that case we can derive the IV estimator from the sample moment
condition
b
W0 y X
=0
W0 e = 0
247
This will be a set of equations in unknowns and hence will have a unique solution.
Since the first row of W0 will be a row of ones, the first equation will be
0 e = 0
from which it follows that the IV residuals have a mean of zero.
In the general overidentified case the sample moment condition does not have a unique
solution. Indeed the equations in unknowns will give an inconsistent set of equations,
unless the row rank of the W0 matrix is and not . We cannot be sure therefore that
at the IV solution the first equation will be exactly satisfied.
In order to show this more generally we need to adopt a slightly dierent tack. We want
to show that
1 0
e=0
1 0
b
=0
y X
1
1 0
b
y = 0 X
We therefore have to show that the sample mean of the values is equal to the sample
mean of the fitted values. Assume that the X matrix is partitioned (as before) as
X = X1 Z2
W=
Let
X1
W2
b + Z2
b
b = X1
1
2
b
b +Z
b + Z2 Z
b 2
b 2
= X1
1
2
2
b +Z
b
b 2
b2 = X1
1
2
b and
b will be identical. Note that X1 and Z2 are row
although the coecient vectors
1
2
vectors, so this is a vector equation. Now
X
X
1X
1X
b
b +1
b +1
b 2
b 2
b =
X1
Z
Z2 Z
1
2
2
X
1X
1X
b +1
b
b 2
b =
X1
Z
1
2
248
SOLUTIONS TO CHAPTER 14
i.e.
b +Z
b
b 2
b = X1
1
2
= b2
So although it is not the case that b will be equal to b2 , their means will be equal.
Now if there is an intercept in the second stage regression1 , then the fitted values from that
regression will be equal to the mean of the values, i.e.
b2 =
= b
from which it will in turn follow that the IV residuals will have a sample mean of zero.
If the sample of the IV residuals is zero, then
e0 e = e0 e
where e is the vector of centred residuals (i.e. with their sample mean subtracted). The
0
e e
2 of the auxiliary regression is equal to e0 e where b
e is the set of fitted values from the
auxiliary regression. Since these will also have a mean of zero, they will also be equal to
the uncentred values. Consequently the uncentred and centred 2 will be equal.
3. Acemoglu et al. (2001) have suggested that malaria deaths in the seventeen and eighteenhundreds, i.e. at the beginning of the process of colonisation, might provide a useful
instrument for the quality of governance institutions in a crosssectional regression.
(a) Sketch out the argument for why malaria deaths may be a good instrument. (Read
the article!)
The key argument is summarised on the second page of the article (p.1370). It is
that settler mortality aected the extent of European settlement. This in turn differentiated colonies which became settler colonies from those that had largely an
extractive function. These early institutions shaped the evolution of the society
and the current institutions. Current institutions in turn aect current economic
performance.
(b) What might be some of the problems with this instrument?
The fact that malaria deaths occurred before the current institutions does not guarantee that malaria deaths may not be correlated with the error term in the regression.
Malaria (and yellow fever) deaths in an earlier century may be correlated with some
other feature of the country that might be persistent and aect growth. This is why
the authors spend so much eort at dealing with other potential channels through
which early deaths might be correlated with current growth. You may want to note
how many dierent channels they consider and the variety of evidence that they bring
to bear.
One potential channel that they do not consider is the development of trade and
endogenous industries. Foreign companies may be deterred from investing in local
capacity (other than extractive capacity) if doing so requires sending out skilled people
1 This
249
who might die in the process. Note that the malaria in 1994 variable does not
adequately control for this since it is again the mortality of the expatriates that
is at issue. Particularly if building up domestic industry occurs incrementally over a
long period of time the low presence of expatriates may have a very similar impact to
the one that Acemoglu et al focus on except through a dierent channel.
4. You are given the regression model
= 1 + 2 +
where y is the vector of the log of wages and x is the vector of the (true) level of schooling. We assume that this model obeys the standard assumptions of the Classical Linear
Regression Model. Unfortunately schooling is measured badly in your data set. Indeed you
have reason to believe that measured schooling x is given by
x = x + u
where ( ) = 0 and ( ) = 0. On your data set you observe that (x) = 96.
You also have a study available which suggests that (u) = 15. On top of this you have
data available for a subset of your observations on the schooling of a sibling. This variable
z is also badly measured, i.e.
z = z + v
where (z v) = 0 and ( v) = 0.
(a) Derive an expression for the asymptotic value of the OLS estimator.
P
( ) ( )
P
( )2
b = ( )
lim
2
( )
( 1 + 2 ( ) + )
=
( )
( )
= 2
( )
( )
= 2 1
( )
b =
( )
b
b
=
2
2
( ) ( )
96
b
=
2
96 15
96
b
= 2
81
This amounts to scaling up the OLS estimates by 18.52%
250
SOLUTIONS TO CHAPTER 14
(c) Under what circumstances could you use z as an instrument for x? Explain.
We require z to be correlated with x but not correlated with the error in the regression.
The regression model is
= 1 + 2 2 +
so the regression error consists both of the measurement error and the term . We
therefore require to be correlated with but uncorrelated with and . This
requires the true variable z to be uncorrelated with either of these error terms. Since
we hypothesised that x was uncorrelated with and u this is plausible. However we
also require the measurement error v to be uncorrelated with u. This is much less
plausible.
There is an additional, more subtle point. Since we only have education on a sibling for
a subset of our observations, we must be sure that there is no correlation between the
process of having a sibling (as measured in the data set) and the error term in the main
regression. If, for instance, people that have more siblings develop important skills that
lead to higher wages, then estimating the main regression only over individuals with
siblings will lead to biased coecients for reasons not at all related to the measurement
error.
Number of obs
F( 7,
808)
Prob > F
Rsquared
Adj Rsquared
Root MSE
=
=
=
=
=
=
816
59.25
0.0000
0.3392
0.3335
.76818
logpay 
Coef.
Std. Err.
t
P>t
[95% Conf. Interval]
+highed 
.1333759
.0091304
14.61
0.000
.1154538
.151298
exper 
.0312511
.0051414
6.08
0.000
.021159
.0413433
_Imetro_2 
.1612915
.0754299
2.14
0.033
.0132297
.3093532
_Imetro_3 
.4565405
.0724531
6.30
0.000
.314322
.598759
_Irace_2 
.0398263
.0883555
0.45
0.652
.1336071
.2132597
_Irace_3 
.2365166
.1069858
2.21
0.027
.0265137
.4465195
_Irace_4 
.477008
.1072434
4.45
0.000
.2664995
.6875166
_cons 
4.850398
.1244376
38.98
0.000
4.606139
5.094657
251
. reg highed parent_ed exper _I* if logpay~=.
Source 
SS
df
MS
+Model 
4605.8463
7 657.978043
Residual  6633.29953
808 8.20952912
+Total  11239.1458
815
13.790363
Number of obs
F( 7,
808)
Prob > F
Rsquared
Adj Rsquared
Root MSE
=
=
=
=
=
=
816
80.15
0.0000
0.4098
0.4047
2.8652
highed 
Coef.
Std. Err.
t
P>t
[95% Conf. Interval]
+parent_ed 
.2288393
.0310682
7.37
0.000
.1678555
.2898232
exper  .2888812
.0162631
17.76
0.000
.320804
.2569584
_Imetro_2 
.551927
.2826728
1.95
0.051
.0029326
1.106787
_Imetro_3 
.4677897
.2761861
1.69
0.091
.0743371
1.009917
_Irace_2  .6611566
.331181
2.00
0.046
1.311233
.0110801
_Irace_3 
.1372697
.4034849
0.34
0.734
.6547326
.9292719
_Irace_4  .1738147
.4272961
0.41
0.684
1.012556
.6649266
_cons 
10.73762
.2738222
39.21
0.000
10.20014
11.27511
. predict u_ed, res
(1507 missing values generated)
. reg logpay highed exper _I* u_ed
Source 
SS
df
MS
+Model  257.400261
8 32.1750326
Residual  464.168688
807 .575178052
+Total  721.568948
815 .885360673
Number of obs
F( 8,
807)
Prob > F
Rsquared
Adj Rsquared
Root MSE
=
=
=
=
=
=
816
55.94
0.0000
0.3567
0.3503
.7584
logpay 
Coef.
Std. Err.
t
P>t
[95% Conf. Interval]
+highed 
.2964423
.0359358
8.25
0.000
.2259036
.3669809
exper 
.0816785
.0118951
6.87
0.000
.0583296
.1050274
_Imetro_2 
.0236001
.0800533
0.29
0.768
.1335372
.1807374
_Imetro_3 
.3015268
.0788048
3.83
0.000
.1468403
.4562134
_Irace_2 
.1050021
.0883318
1.19
0.235
.068385
.2783893
_Irace_3 
.1383059
.1076816
1.28
0.199
.0730632
.3496751
_Irace_4 
.3206584
.1110075
2.89
0.004
.102761
.5385558
u_ed  .1740156
.0371227
4.69
0.000
.246884
.1011472
_cons 
2.923025
.4291273
6.81
0.000
2.080688
3.765362

252
SOLUTIONS TO CHAPTER 14
The regression is estimated over individuals where the parents education could be determined.
(a) Given the regression output, what would be the estimate of the returns to education
if you were to estimate the first equation by instrumental variables, using parents
education as an instrument for own education?
We can retrieve the IV coecients from the auxiliary regression used to perform the
Hausman test. All those coecients are identical to the IV coecients. In this case we
need to look at the coecient on the highed variable. It looks as though the returns
to education are 02964423
(b) Perform a Hausman test for the dierence between the OLS and IV estimates. How
might you explain the results?
We test the significance of the residuals term in the auxiliary regression. We see
(from the regression output) that the pvalue on u_ed is less than 0.001, i.e. we
reject the hypothesis that the OLS and the IV coecients are identical. We conclude
that the education variable and the errors in the wage regression must be correlated.
This could be due to measurement error or the omission of a common variable in the
equations determining how much schooling someone gets and how much they earn.
(c) Do the results suggest that you might have the problem of weak instruments?
We look at the first stage regression in which we regress the education variable on all
the instruments. We see that parents education is highly significant with a tstatistic
in excess of seven. This translates into an Fstatistic of over 49. Consequently we do
not have the problem of weak instruments.
(d) Interpret both the OLS and the IV estimates of the first equation.
The OLS regression suggests that the returns to education are 0.1333759, i.e. each
additional year of schooling raises earnings by about 13%. The experience variable
indicates that each additional year of experience raises earnings by about 3%. We note
that race and location dummies are significant and have the structure that we would
expect: metropolitan areas pay better than urban areas which in turn pay better than
the rural areas. Whites earn more than Indians who earn more than Coloureds who
earn more than Africans.
The IV regression suggests that the returns to education are 0.2964423, i.e. each
additional
year of schooling raises earnings by about 30% (to be precise, it raises it
253
measurement error in the schooling variable .
The omitted variable bias formula suggests that the biased OLS results would be the
sum of the true education coecient plus the coecient on the omitted variable
(say ) multiplied by the regression coecient (say ) of that variable on education,
i.e.
01333759 = 02964423 +
For this to make any sense we therefore require a variable that is either negatively
correlated with earnings or with education. We might have suspected the omission of
an ability variable: higher ability individuals are likely to earn more (at any level of
schooling), but they are also more likely to get additional schooling. It is quite clear,
however, that these coecients cannot arise from the omission of an ability variable.
In that case our OLS results would have overestimated the true returns to schooling.
Measurement error would, of course, lead to an underestimate (as shown by the relationship between the IV and the OLS coecients). The attenuation bias formula,
however, is
()
b
lim = 1
( ) + ()
To get attenuation in excess of 50% we would need to assume that the error process is
on a par with the true signal i.e. about half of the observed variation in education
levels is spurious. This just does not seem plausible.
In short neither of these two reasons coheres very well with the empirical results.
(f) What assumptions would you need to make for the OLS estimates to be valid? And
what assumptions are required in order for the IV estimates to be valid? Do you think
that any of these assumptions hold in this case?
The OLS estimates would be valid if the regressors are independent of the error term.
In particular we would need to assume that the process that determines education is
independent of the wage received, i.e. 1 and 2 are uncorrelated.
Instrumental variables estimation is valid only under the following conditions:
i. The instrument must be correlated with the endogenous variable
ii. It must not be correlated with the error term in the primary regression
It is clear from the first stage regression that the instruments are highly significant.
It is not clear, however, whether parents education is a valid instrument. One particularly troubling factor in this case (not made explicit in the question!) is that we
have data on parents education only for individuals who are still living with their
parents. These individuals, however, are more likely to be low earners. Sample selection therefore induces a relationship between parents education and the wage. This
will contaminate the IV results!
254
SOLUTIONS TO CHAPTER 14
Bibliography
Acemoglu, D., Johnson, S. and Robinson, J. A.: 2001, The colonial origins of comparative
development: an empirical investigation, American Economic Review 91(5), 13691401.
Angrist, J. D. and Pischke, J.S.: 2009, Mostly Harmless Econometrics: An Empiricists Companion, Princeton University Press, Princeton, NJ.
Cameron, A. C. and Trivedi, P. K.: 2005, Microeconometrics: Methods and Applications, Cambridge University Press, New York.
Davidson, R. and MacKinnon, J. G.: 1993, Estimation and Inference in Econometrics, Oxford
University Press, New York.
Davidson, R. and MacKinnon, J. G.: 2004, Econometric Theory and Methods, Oxford University
Press, New York.
Deaton, A.: 1997, The Analysis of Household Surveys: A Microeconometric Approach to Development Policy, Johns Hopkins University Press, Baltimore.
Greene, W. H.: 2003, Econometric Analysis, 5 edn, PrenticeHall.
Gujarati, D.: 2003, Basic Econometrics, 4 edn, McGrawHill, Boston.
Holland, P. W.: 1986, Statistics and causal inference, Journal of the American Statistical Association 81(396), 945960.
Keynes, J. M.: 1936, The General Theory of Employment Interest and Money, Macmillan, London.
Mittelhammer, R. C., Judge, G. G. and Miller, D. J.: 2000, Econometric Foundations, CUP,
Cambridge.
Murray, M. P.: 2006, Avoiding invalid instruments and coping with weak instruments, Journal
of Economic Perspectives 20(4), 111132.
Simon, C. P. and Blume, L.: 1994, Mathematics for Economists, Norton, New York.
Stock, J. H., Wright, J. H. and Yogo, M.: 2002, A survey of weak instruments and weak identification in generalized method of moments, Journal of Business and Economic Statistics
20(4), 518529.
Sydsaeter, K., Strom, A. and Berck, P.: 1999, Economists Mathematical Manual, 3 edn, Springer,
Berlin.
Wooldridge, J. M.: 2002, Econometric Analysis of Cross Section and Panel Data, MIT Press,
Cambridge, Mass.
255