You are on page 1of 13

UNIVERSITY OF CAMBRIDGE

FACULTY OF ECONOMICS
M.Phil. in Economics
M.Phil. in Economic Research
Subject M300 Econometric Methods
Exercise Sheet 1
1. Consider the following joint probability density
fX;Y (x; y) =

(x + xy + y) =4 if x 2 [0; 1] ; y 2 [0; 2]
0 otherwise

(a) Whats E (Y )? Whats E (Y jX)?

Solution: Marginal distribution of Y has density zero for y 2


= [0; 2] ; and
fY (y) =

x2
yx
x + xy + y
dx = (1 + y)
+
4
8
4

=
x=0

1 + 3y
8

for y 2 [0; 2] : Marginal distribution of X has density zero for x 2


=
[0; 1] ; and
fX (x) =

x + xy + y
y2
xy
dy = (x + 1)
+
4
8
4

=
y=0

1 + 2x
:
2

The conditional density of Y given X is


fY jX (yjx) =

x + xy + y
fX;Y (x; y)
=
:
fX (x)
2 + 4x

Marginal expectation of Y is
Z 2
y + 3y 2
4
8
5
E (Y ) =
dy =
+ = :
8
16 8
4
0
Marginal expectation of X is
Z 1
x + 2x2
1 2
7
E (X) =
dx = + =
:
2
4
6
12
0
Conditional expectation of Y is
Z 2
2X + 83 (X + 1)
X + Xy + y
7X + 4
E (Y jX) =
y
dy =
=
2
+
4X
2
+
4X
3 + 6X
0
(b) Whats the BLP of Y given X? Graph the results from (a) and (b).
1

Solution: Recall that the bestlinear predictor of Y given X equals + X;


where
= E (Y )
E (X) ;
and
=

Cov (Y; X)
:
V ar (X)

To compute Cov(Y; X) :
E (XY )

4X + 7X 2
=
= E [XE (Y jX)] = E
3 + 6X
2
7
=
+
:7222
6 18

4x + 7x2
dx
6

Therefore,
Cov (Y; X) = E (XY )

E (X) E (Y )

:7222

7 5
12 4

0:0070:

To compute Var(X) :
E X

x2 + 2x3
1 1
5
dx = + =
;
2
6 4
12

so that
2

V ar (X) = E X 2

E (X) =

72
122

5
12

0:0764:

Therefore,
0:0070
0:0764
and

0:0916;

5
+ 0:0916
4

7
12

1:3034:

(c) What would OLS of Y on a constant and X estimate? What would


OLS of Y on X (no constant) estimate?
Solution: OLS of Y on a constant and X estimates BLP:
BLP = 1:3034

0:0916 X:

Note that
OLS

^ OLS

(xi
P
^

x) (yi
(xi

OLS

y)
2

x)
x:

d (X; Y )
Cov
Vd
ar (X)

1.34

CEF and BLP of Y given X

1.32
1.3
1.28
1.26
1.24
1.22
1.2
0

0.2

0.4

0.6

0.8

These equations are the sample analogs of the equations dening the
coe cients of the BLP. OLS of Y on X (no constant) estimates the
BLP of Y when the intercept is restricted to be zero. Note that the
mean squared error of such a constrained linear predictor is
M SE = E (Y

bX) :

The rst order condition for the minimization is


E (Y

bX) X = 0:

Hence,
b=

E (XY )
E (X 2 )

0:7222
5=12

1:7333:

When no constant is included in the OLS regression of Y on X; we


have
P
^bOLS = Pxi yi ;
x2i

which is the sample analog of b:

2. Suppose that Y and X are n 1 vectors of data, and the following conditions hold: (1) Y = X + e; P
(2) rank (X) = 1; P
(3) E (ejX) = 0 and (4)
n
n
V ar (ejX) = 2 In : Let X = n1 i=1 Xi and Y = i=1 Yi :
(a) Consider the estimator ~ = Y =X: Show that ~ is linear and conditionally unbiased. Calculate its conditional variance and compare it
to the conditional variance of the OLS estimator.

Solution: ~ =

Y
X

10 Y
10 X :

Therefore ~ is linear in y

10 X
10 E (Y jX)
=
=
E ~ jX =
10 X
10 X
Hence, ~ is conditionally (and unconditionally) unbiased. For the
conditional variance, we have
2
n
10 V ar (Y jX) 1
=
V ar ~ jX =
2
0 X)2
0
(1
(1 X)

Let ^ OLS =

X0Y
X0X

be the OLS estimator. Then

2
X 0 V ar (Y jX) X
V ar ^ jX =
=
:
2
(X 0 X)
(X 0 X)

Since
X 0X =

n
X

x2i =

i=1

n
X

x)2 + nx2

(xi

nx2 =

i=1

we have

(X 0 X)

(10 X)2
n

(10 X)2

and therefore
V ar ^ jX

V ar ~ jX :

This is consistent with the Gauss-Markov theorem, which says that


OLS is BLUE (note that the conditions of the Gauss-Markov theorem
are satised in this example).
(b) Suppose that you decide to use the rst m (< n) observations and
do OLS. Show that this estimator ^ (m) is linear and conditionally
unbiased, but not minimum conditional variance.
X0 y
Solution: ^ = 0m m : therefore ^ is linear in y: Further
m

Xm Xm

X 0 E (Ym jX))
X 0 Xm
E ^ m jX = m 0
= m0
=
Xm Xm
Xm Xm
Hence, ^ m is conditionally unbiased.
2
X 0 V ar (Ym jX) Xm
V ar ^ m jX = m
= 0
2
0 X )
Xm Xm
(Xm
m

Since X 0 X =

n
X
i=1

x2i

m
X

0
x2i = Xm
Xm , we have

i=1

(X 0 X)
4

2
0 X
Xm
m

and therefore
V ar ^ jX

V ar ^ m jX

so that ^ m is not of minimum variance.


(c) Could you suggest a minimum conditional variance, linear estimator
(not necessarily unbiased)?
Solution: Any constant has zero variance and is a minimum variance
estimator. It is biased (and rather silly) though.
3. STATA le problemset1.dta contains data from 1990 cross-section of the
NLSY (National Longitudinal Survey of Youth). The le contains wage
(variable w0), education (variable ed0), and age (variable a0) variables for
392 individuals. Create variables lwage=log(w0), educ=ed0, and age=a0.
(a) Regress lwage on the dummies for all possible combinations of values
of educ and age using command xi: regress lwage i.educ*i.age (executing this command will automatically create the dummies). This is
an example of saturated regression. Why do you think STATA omits
many dummy variables from the regression?
Solution: See attached log le. STATA omits many dummy variables because there are no observations where these dummies equal 1. Hence,
the dummy variables are identically equal to zero, which creates perfect multicollinearity.
(b) Consider the hypothesis that the conditional expectation function
2
E (lwagejeduc,age) is linear in educ, age and (age) . That is, E (lwagejeduc,age) =
2
0 + 1 educ+ 2 age+ 3 (age) . Since the linearity imposes a constraint on the CEF, we can call the regression of lwage on constant,
2
wage, age and (age) "restricted" regression. What is the "unrestricted" regression? Explain.
Solution: The unrestricted version is the saturated regression. The CEF
of dependent variable given explanatory variables in the saturated
regression is always linear. In this regression, the coe cient on the
dummy variable that equals one i educ=v1 and age=v2 (where
v1 and v2 are any possible values of education and age) is simply
E (lwagejeduc=v1 ,age=v2 ). The relationship between E (lwagejeduc=v1 ,age=v2 )
and E (lwagejeduc=v3 ,age=v4 ) (where v3 and v4 are some other possible values of education and age) may be arbitrary. No constraint is
imposed.
(c) Using results from the "restricted" and "unrestricted" regressions,
compute the (homoskedasticity-only) F statistic to test the hypothesis from (b). Do you accept or reject the null?
Solution: Recall that
F =

(SSRr SSRu ) =q
;
SSRu = (n k)
5

where q is the number of constraints, n is the number of observations, and k is the number of variables in the unconstrained regression. Let us count the number of constriants. In our data, there are
14 dierent values of education (only 13 dummies included to avoid
multicollinearity). The constrained regression E (lwagejeduc,age) =
2
0 + 1 educ+ 2 age+ 3 (age) "sets" all coe cients on the 13 dummies equal to the same unspecied value, which introduces 12 constraints. Next, there are two dierent values of age (only one dummy
is included to avoid multicollinearity). In the constrained regression,
age is omitted to avoid multicollinearity with age2. All in all, there
is only one coe cient corresponding to the age dummy in the saturated regression, and there is only one coe cient corresponding to
age2 in the restricted regression. Hence, no additional restriction results. Finally, there are 11 included interaction dummies in the saturated regression. No interaction eects is specied in the constrained
model. Eectively, this imposes 11 constraints (the coe cients on
the interaction dummies are set to zero). Hence, the total number of
constraints is q =12+11=23. (We could have computed this directly
by subtracting "model degrees of freedom df" (given in the ANOVA
outcome of the regression command) of constrained regression from
that of the unconstrained one. Further, n = 392; and k = 26: Hence,
(221:16 210:37) =23
210:37= (392 26)

0:82:

The 95% critical value of F (23; 366) is 1.5588. We do not reject the
null.
4. Consider the regression model
yi = x0i + ei ; i = 1; :::; n;
where (1) (yi ; xi ) are i.i.d, (2) E (xi x0i ) is non-degenerate, (3) E (ei jxi ) = 0;
and (4) V ar (ei jxi ) = 2 : Assume that xi does not contain a constant term.
The corresponding uncentered coe cient of determination is dened by
Pn
y^2
R2 = Pni=1 i2
i=1 yi

where y^i = x0i ^ , (i = 1; :::; n), with ^ the OLS estimator of .


(a) Show that
Pn
0 Pn
i.
^i2 =n = ^ ( i=1 xi x0i =n) ^ .
i=1 y
Solution:
n
X
i=1

y^i2 =n =

n
X
i=1

(x0i b )2 =n =
6

n
X
i=1

0
0
( b xi )(x0i b )=n = b (

n
X
i=1

xi x0i =n) b :

Pn
p
ii.
^i2 =n ! 0 E[xi x0i ] .
i=1 y
Solution: By the Law of Large Numbers
n
X
i=1

xi x0i =n ! E[xx0 ]:

p
Since, OLS is consistent, b ! : By Continuous Mapping Theorem,
n
n
X
0 X
p
y^i2 =n = b (
xi x0i =n) b ! 0 E[xi x0i ] :
i=1

Pn

p
2
i=1 yi =n !

i=1

iii.
+ E[xi x0i ] .
Solution: By the Law of Large Numbers
n
X
i=1

yi2 =n ! E[y 2 ] = var[y]+(E[y])2 = E[var[yjx]]+E[(E[yjx])2 ] =

(b) Hence, or otherwise, conclude that


0

R2 !

E[xi x0i ]
:
+ 0 E[xi x0i ]

Solution: This convergence follows from (ii), (iii) and the Continuous Mapping Theorem (actually, it is su cient to use Slutskys
theorem).
5. Consider regression
yig =

1 xg

+ eig ;

(1)

where the data have a group structure so that


E (eig ejg ) =

2
e e:

Assume an extreme inter-claster dependence e = 1: Further, let CEF be


linear, the size of each cluster be n, and the number of clusters in the
sample be G.
(a) A colleague writes (1) using matrix notations as
y = X + e:
Explain what y; X; ; and e are.
Solution:
0

y = (y11 ; y21 ; :::; yn1 ; y21 ; :::; yn2 ; :::; yG1 ; :::; yGn ) ;
X=

1
x1

1
x1

:::
:::

1
x1

1
x2
7

:::
:::

1
x2

:::
:::

1
xG

:::
:::

1
xG

+ 0 E[xx0 ]

=(

0;

0
1)

and

e = (e11 ; e21 ; :::; en1 ; e21 ; :::; en2 ; :::; eG1 ; :::; eGn ) :
0

0
0
2
e diag(1n 1n ; :::; 1n 1n )

(b) Let 1n = (1; 1; :::; 1) : Show that E (ee0 jX) =


| {z }

n tim es

(the latter block-diagonal matrix is also denoted as IG


is the Kronecker product)

{z

G tim es
1n 10n ; where

Solution: The elements of E (ee0 jX) outside the n n diagonal blocks are
zero assuing eig1 and ejg2 are uncorrelated for g1 6= g2 : The elements
in the g th n n block are equal to E (eig ejg ) = e 2e = 2e : Hence,
the block looks like follows
1
0 2
2
2
:::
e
e
e
2
2 C
B 2e
:::
2
0
e
e C
g-th block= B
@ ::: ::: ::: ::: A = e 1n 1n :
2
2
2
:::
e
e
e
(c) Show that E (X 0 ee0 XjX) =

2
0
e nX X

Solution:
E (X 0 ee0 XjX) = X 0

2
0
0
e diag(1n 1n ; :::; 1n 1n )X:

On the other hand,


10n
x1 10n

X=

:::
:::

{z

G tim es

10n
xG 10n

Hence,
0

E (X ee XjX)

=
=

!
PG
PG
0
0
0
0
g=1 xg 1n 1n 1n 1n
g=1 1n 1n 1n 1n
PG
PG
0
0
0
2 0
g=1 xg 1n 1n 1n 1n
g=1 xg 1n 1n 1n 1n
!
PG
1
2 2
i=1 xi
P
P
n
= 2e nX 0 X
G
G
e
2
x
x
i=1 i
i=1 i
2
e

(d) Prove that the square of the Moulton factor, V ar ^ 1 =V arhom ^ 1 ;

equals n; which is consistent with the general formula V ar ^ 1 =V arhom ^ 1 =


1 + (n

1) e .

Solution: In this problem, we are conditioning on X: I am omitting the


conditioning notation, and writing V ar ^ 1 instead of V ar ^ 1 jX .
We have
V ar ^ 1

V ar (X 0 X)

(X 0 X)

X 0 e = (X 0 X)

2
0
e nX X

(X 0 X)

E (X 0 ee0 X) (X 0 X)

1
2
0
e n (X X)

Now,
V arhom ^ 1 =

(X 0 X)

2
e

Therefore,
V ar ^ 1 =V arhom ^ 1 = n:
6. Consider a poor elderly population that uses emergency rooms for primary
care. Let yi measure health of the i-th randomly selected person, and let
di be the dummy that equals 1 i the person is admitted to the hospital.
Let the potential outcomes y1i and y0i be dened as
yi =

y1i if di = 1
:
y0i if di = 0

(a) Explain in words what is the meaning of E (y0i jdi = 1) and of E (y1i jdi = 0) :
Solution: E (y0i jdi = 1) is the average of what the health of all those admitted to the hospital would have been, had they been not admitted.
Similarly, E (y1i jdi = 0) is the average of what health of all those not
admitted to the hospital would have been, had they been admitted.
(b) Why it may be the case that E (y0i jdi = 1) 6= E (y0i jdi = 0)?

Solution: di is not independent of the potential health outcomes. Those


who are more likely having problems if not admitted (low y0i ) have
a higher chance of being admitted.
(c) What is the likely sign of E (y0i jdi = 1)
Solution: Negative (see (b)).

E (y0i jdi = 0)?

(d) Is the slope of the population regression of yi on di larger or smaller


than the causal eect of hospitalization, assuming this eect is the
same for everybody?
Solution: Smaller. Negative selection bias.
7. In section 3.2.3 of Angrist and Pischkes textbook "Mostly Harmless Econometrics", there is a discussion of the causal eect of schooling on earnings.
The textbook claims that including a white-collar occupational dummy,
wi , into the regression of earnings yi on schooling si is an example of using
a "bad control" variable.
(a) Suppose that the schooling was randomly assigned to people, and let
E (yi jsi ) = 1 + 2 si : Further, assume that the causal eect of the
change of si from s to s + 1 is the same for everybody, and equals .
Show that 2 = :
Solution: We have
E (yi jsi = s + 1) =

(s + 1) ;

and
E (yi jsi = s) =

2s

Therefore,
= E (yi jsi = s + 1)

E (yi jsi = s)

= E (ys+1;i jsi = s + 1)

= E (ys+1;i jsi = s + 1)
=

E (ys;i jsi = s + 1)

E (ys;i jsi = s)

E (ys;i jsi = s + 1)

E (ys;i jsi = s)

+ [E (ys;i jsi = s + 1)

E (ys;i jsi = s)]

Since schooling is randomly assigned (by assumption), it is independent of the potential outcomes. Therefore,
E (ys;i jsi = s + 1)

E (ys;i jsi = s) = 0

and
2

= :

(b) In addition to the assumptions made in (a), assume that E (yi jsi ; wi ) =
1 + 2 si + 3 wi : Show that
2

+ E (ysi jsi = s + 1; wi )

E (ysi jsi = s; wi ) :

Solution: We have
E (yi jsi = s + 1; wi ) =

(s + 1) +

3 wi

and
E (yi jsi = s; wi ) =

2s

3 wi :

Subtracting the second equation from the rst, we get


2

= E (yi jsi = s + 1; wi )

E (yi jsi = s; wi )

= E (ys+1;i jsi = s + 1; wi )

E (ys;i jsi = s; wi )

+E (ys;i jsi = s + 1; wi )

E (ys;i jsi = s; wi )

= E (ys+1;i jsi = s + 1; wi )
=

+ E (ys;i jsi = s + 1; wi )

E (ys;i jsi = s + 1; wi )
E (ys;i jsi = s; wi ) :

(c) Since si is independent from ysi ; we have E (ysi jsi ) = E (ysi ) : Explain in words, how is this possible that, despite the latter equality, E (ysi jsi ; wi ) 6= E (ysi jwi ) in general, so that the selection bias
E (ysi jsi = s + 1; wi ) E (ysi jsi = s; wi ) is not equal zero.
Solution: Even though si is independent from ysi ; it is statistically dependent on wi (because the schooling aects the future choice of the
occupation). The occupational dummy, in its turn, may depend on
the potential outcomes. Hence, even though E (ysi jsi ) = E (ysi ) ; we
may have E (ysi jsi ; wi ) 6= E (ysi jwi ) :
10

(d) Based on your analysis in (c), explain which variables can be called
"bad controls" in an experimental setting.
Solution: Those variables that are aected by the treatment, and cannot
be thought as xed at the time of the treatment.
8. In the setting of problem (5), let gi be a variable that equals 1 if the ith randomly chosen person is a woman, and 1 if the person is a man.
Suppose that E (yi jsi ; gi ) = 1 + 2 si + 3 gi :
(a) Argue that, if si is randomly assigned, then E (ysi jsi = s + 1; gi )
E (ysi jsi = s; gi ) = 0; so that there is no selection bias, and 2 = .

Solution: Note that if si is randomly assigned, it is independent from gi


and ysi : Therefore,
E (ysi jsi = s + 1; gi ) = E (ysi jgi ) ;
and similarly
E (ysi jsi = s; gi ) = E (ysi jgi ) :
Hence, there is no selection bias.
(b) Suppose
that you have a sample of (yi ; si ; gi ) of size n: Assume that
Pn
g
=
0 (the same number of men and women in the sample),
i=1
Pni
and i=1 gi si = 0 (average education level is the same for men and
women in the sample). Show that the OLS estimate of the coe cient
on si in the "short regression" of yi on constant and si is the same
as the OLS estimate of the coe cient on si in the "long regression"
of yi on constant, si ; and gi .
Solution: Consider a short regression yi =
^

OLS

Pn
si

P
P s2i
si

Now, consider the long regression yi =


^OLS

1
1

2 si

+ ei :

P
P yi
si yi

1 + 2 si + 3 gi + ui :

We have

(X 0 X) X 0 Y
0
1 10 P
1
P
P
Pn
P s2i P gi
P yi
A @
A
= @ P si P si
Psi g2 i
P si yi
gi
si gi
gi
gi yi
0
1 10 P
1
P
n
s
0
y
i
i
P
P 2
P
si
si P0 A @ P si yi A
= @
2
0
0
gi
gi yi
!
^
:
=
P 2 OLS
1P
gi
gi yi

In particular, ^ 2;OLS = ^2;OLS :


11

2
(c) Suppose that the adjusted R2 from the "long regression" is Rlong
=
0:2 and that from the short regression is Rshort = 0:1: Prove that,
under the assumptions made in (b), the homoskedastic standard error
estimate
pfor the OLS coe cient on si from the "long regression"
equals 8=9 times the homoskedastic standard error estimate for
the OLS coe cient on si from the "short regression". Hence, the
precision of the estimate in the "long regression" is higher than that
in the "short regression".

Solution: The estimate of SE ^ 2;OLS


d ^ 2;OLS =
SE

equals

SSRshort
1
0
(Xshort
Xshort )22 ;
n kshort

0
where (Xshort
Xshort )22 is the element in the second row and second
1
0
column of the matrix (Xshort
Xshort ) : Similarly,
s
1
SSRlong
0
d ^2;OLS =
SE
Xlong
Xlong
:
n klong
22

But, as follows from the solution to (b),


1

0
0
(Xshort
Xshort )22 = Xlong
Xlong

Hence,
d ^2;OLS
SE

d ^ 2;OLS
SE

1
22

SSRlong n kshort
:
n klong SSRshort

Now, recall that an adjusted R2 equals


R2 = 1

n
n

SSR
:
k T SS

Therefore,
SSRlong
T SSlong
=
1
n klong
n
and

SSRshort
T SSshort
=
1
n kshort
n

2
Rlong
=

T SSlong
0:8
n

2
Rshort
=

T SSlong
0:9:
n

On the other hand, T SSlong = T SSshort because the dependent variable in both regressions is the same. Hence,
s
r
SSRlong n kshort
8
=
:
n klong SSRshort
9
12

(d) In light of your answers to (a), (b), and (c), explain in words why gi
is a "good control".
Solution: Including gi into the regression does not introduce selection
bias because gi is xed at the time of the treatment assigniment. On
the other hand, including gi into regression, reduces uncertainty, and
leads to lower standard errors of the OLS estimates.

13