Вы находитесь на странице: 1из 32

# Imbens, Lecture Notes 1, EC2140 Spring 12 1

## EC2140, Econometric Methods Department of Economics

Spring 2012 Harvard University
The Normal Linear Model I:
Robust Variances, The Bootstrap, the Delta Method,
and Finite versus Super Population Inference (W 4.2.1-4)
1. Introduction
Let us review the basics of the linear model. We have N units (individuals, rms, or
other economic agents) drawn randomly from a large population. The units are indexed
i = 1, 2, . . . , N. For unit i we observe a scalar outcome Y
i
and a (K+1)-dimensional column
vector of explanatory variables X
i
= (X
i0
, X
i1
, . . . , X
iK
)

## (where, unless we explicitly say

otherwise the rst covariate is a constant, X
i0
= 1 for all i = 1, . . . , N.) We are interested
in explaining the distribution of Y
i
in terms of the explanatory variables X
i
using a linear
model:
Y
i
= X

i
+
i
. (1)
In this equation is a K-dimensional column vector. In matrix notation,
Y = X + ,
where Y and are N-dimensional column vectors, and X is an N (K + 1) dimensional
matrix with ith row equal to X

i
. Avoiding vector and matrix notation completely we can
write this as:
Y
i
=
0
+
1
X
i1
+ . . . +
K
X
iK
+
i
=
K

k=0

k
X
ik
+
i
.
We assume that observations form a random sample from a large (innite) population:
Imbens, Lecture Notes 1, EC2140 Spring 12 2
Assumption 1 (Independence)
(X
i
, Y
i
)
N
i=1
are independent and identically distributed, with the rst four moments of X
i
and
Y
i
nite, and the expected value E[X
i
X

i
] full rank.
In addition we consider some assumptions on the relation between the error terms
i
and
the regressors. The rst version assumes normality and independence of the error terms:
Assumption 2
i
|X
i
N(0,
2
).
For some results we can weaken this assumption considerably. First, we could relax normality
and only assume independence of the errors
i
and the regressors:
Assumption 3
i
X
i
,
combined with the normalization that E[
i
] = 0. We can even weaken this assumption
further by requiring only mean-independence
Assumption 4 E[
i
|X
i
] = 0,
or even further, requiring only zero correlation:
Assumption 5 E[
i
X
i
] = 0.
The (ordinary) least squares estimator for solves
min

i=1
(Y
i

X
i
)
2
.
The solution for the least squares estimator is

ols
=
_
N

i=1
X
i
X

i
_
1
_
N

i=1
X
i
Y
i
_
= (X

X)
1
(X

Y) .
Imbens, Lecture Notes 1, EC2140 Spring 12 3
The (exact) distribution of the ols estimator under the normality assumption in Assumption
2, conditional on X, is

ols
|X N
_
,
2
(X

X)
1
_
.
Without the normality of the , we do not have exact results on the moments or distribution
of

ols
. However, under the independence Assumption 3, and a second moment condition on
(variance nite and equal to
2
) implied by Assumption 1, we can establish asymptotic
(meaning, in large samples) normality:

N(

ols
)
d
N
_
0,
2
E[X
i
X

i
]
1
_
.
Typically we do not know
2
. We can consistently estimate it as

2
=
1
N K 1
N

i=1
_
Y
i
X

ols
_
2
.
Dividing by N K 1 rather than N corrects for the fact that K + 1 parameters are
estimated before calculating the residuals
i
= Y
i
X

ols
. This degrees-of-freedom correction
does not matter in large samples, and in fact the maximum likelihood estimator, equal to:

2
ml
=
1
N
N

i=1
_
Y
i
X

ols
_
2
,
is a perfectly reasonable alternative. The rst estimator
2
is unbiased, but the maximum
likelihood estimator
2
ml
has lower expected squared error. So in practice, whether we have
exact normality for the error terms or not, we will use the following distribution for

ols
:

N(, V ), where V =

2
N
(E[X
i
X

i
])
1
=
2
(E[X

X])
1
estimated as

V =

2
N

_
1
N
N

i=1
X
i
X

i
_
1
=
2

_
N

i=1
X
i
X

i
_
1
. (2)
Imbens, Lecture Notes 1, EC2140 Spring 12 4
Often we are interested in one particular coecient. Suppose for example we are inter-
ested in
k
. In that case we have

k
N(
k
,

V
kk
),
where

V
ij
is the (i, j) element of the matrix

V . We can use this for constructing condence
intervals for a particular coecient. For example, a 95% condence interval for
1
would be
CI
0.95
(
k
) =
_

k
1.96
_

V
kk
,

k
1.96
_

V
kk
_
.
Here 1.96 is the 0.975 quantile of the normal distribution. We can also use this to test
whether a particular coecient is equal to some preset number. For example, if we want to
test whether
k
is equal to 0.1, we construct the t-statistic
t =

k
0.1
_

V
kk
,
and compare it to a standard normal distribution (a normal distribution with mean zero
and variance equal to one). If we want to do a two-sided test at the 10% level, we would
compare the absolute value of this t-statistic to 1.645, the 0.95 quantile of the standard
normal distribution (meaning that if Z has a standard normal N(0, 1) distribution, the
probability that Z is less than 1.645 is equal to 0.95, Pr(Z 1.645) = 0.95)).
2. Robust Variances
Next we consider the distribution of

under much weaker assumptions. Instead of
independence and normality of the , we use Assumption 5, E[
i
X
i
] = 0. This essentially
denes the true value of to be the best linear predictor:
= arg minE
_
(Y
i
X

i
)
2
_
= (E[X
i
X

i
])
1
E[X
i
Y
i
] .
Under this assumption and independent sampling (Assumption 1), we still have asymptotic
normality for the least squares estimator, but now with a dierent variance:

N
_

ols

_
d
N
_
0, (E[X
i
X

i
])
1
_
E[
2
i
X
i
X

i
]
_
(E[X
i
X

i
])
1
_
.
Imbens, Lecture Notes 1, EC2140 Spring 12 5
Let the asymptotic variance be denoted by
V
robust
= (E[X
i
X

i
])
1
_
E
_

2
i
X
i
X

i
_
(E[X
i
X

i
])
1
.
This is known as the heteroskedasticity-consistent variance, or the robust variance, due to
Eicker (1967), Huber (1967), and White (1980). To see where this variance comes from,
write the least squares estimator

ols
minus the true value as

ols
=
_
1
N
N

i=1
X
i
X

i
_
1
_
1
N
N

i=1
X
i
Y
i
_

=
_
1
N
N

i=1
X
i
X

i
_
1
_
1
N
N

i=1
X
i
X

_
+
_
1
N
N

i=1
X
i
X

i
_
1
_
1
N
N

i=1
X
i

i
_

=
_
1
N
N

i=1
X
i
X

i
_
1
_
1
N
N

i=1
X
i

i
_
.
The second factor satises a central limit theorem:
1

N
N

i=1
X
i

i
d
N
_
0, E
_

2
i
X
i
X

i
_
.
Note that along the way we assume that the covariates are not perfectly collinear, so that
the matrix

N
i=1
X
i
X

i
has full rank.
We can estimate the heteroskedasticity-consistent variance consistently as

V
robust
=
_
1
N
N

i=1
X
i
X

i
_
1
_
1
N
N

i=1

2
i
X
i
X

i
__
1
N
N

i=1
X
i
X

i
_
1
, (3)
= (X

X)
1
_
X

diag(
2
i
)X
_
(X

X)
1
,
where
i
= Y
i
X

## is the (estimated) residual.

Imbens, Lecture Notes 1, EC2140 Spring 12 6
3. The Bootstrap
An alternative method for estimating the variance of least squares estimators, and in fact
of many other estimators as well, is bootstrapping, or more generally, resampling, originating
with Efron (1979, 1982), see for textbook treatments Efron and Tibshirani (1993), Davison
and Hinkley (1998) and Hall (1992). The rst two are highly recommended, the third one
is more technical.
3.A Theory of the Bootstrap
Consider the following scenario. We have a random sample of size N from some dis-
tribution with cumulative distribution function F
X
(x) = pr(X
i
x). We are interested in
estimating the expected value of X,
X
= E[X
i
] using the sample average X =

N
i=1
X
i
/N.
Let
2
X
be the variance of X. The exact variance of the sample average X is
V
_
X
_
=
2
X
/N = E[X
i
E[X
i
]]
2
/N.
How do we estimate the variance? The standard variance estimator is

V
_
X
_
= S
2
X
/N,
where the sample variance S
2
X
is
S
2
X
=
1
N 1
N

i=1
_
X
i
X
_
2
.
Now consider an alternative approach. Suppose we actually knew the entire cdf F
X
(x). In
that case we could calculate the variance by replacing all expectations by integrals:

X
= E[X
i
] =
_
xdF
X
(x), (4)

2
X
= V(X
i
) =
_
(x
X
)
2
dF
X
(x), (5)
Imbens, Lecture Notes 1, EC2140 Spring 12 7
and thus
V(X) =
_ _
z
_
xdF
X
(x)
_
2
dF
X
(z)/N.
Now obviously we do not know the cdf. (If we did, we would not have to estimate the
expected value of X in the rst place.) However, we can replace it in these calculations with
an estimate based on the empirical distribution function:

F
X
(x) =
N

i=1
1
x
i
x
/N,
where 1
A
is the indicator function for the event A, equal to 1 if A is true and 0 otherwise.
The empirical distribution function is a pretty good estimate, in fact the maximum likelihood
estimate, with sup
x
|

F
X
(x) F
X
(x)| = O
p
(N
1/2
).
If we use the empirical distribution function

F
X
(x) instead of the actual distribution
function F
X
(x) in expressions (4) and (5), the expected value is

X
=
_
xd

F
X
(x) =
1
N
N

i=1
x
i
= X.
The variance is

2
X
=
_
(x
X
)
2
d

F
X
(x) =
_
(x X)
2
d

F
X
(x) =
N

i=1
(x
i
X)
2
/N = S
2
X
(N 1)/N.
Hence we would end up estimating the variance of the sample average as

V(X) = S
2
X
(N 1)/N
2
,
which is pretty close to the standard estimate of S
2
X
/N (it diers by a factor (N 1/)N
which goes to one fast).
3.B Computational Issues of the Bootstrap
Imbens, Lecture Notes 1, EC2140 Spring 12 8
Now this calculation is more complicated than it need be. In practice we do not need the
exact bootstrap variance

V(X), which is not equal to the exact variance of x anyway. But
if all we are interested in is an approximation, we can make the calculation a lot simpler.
If we want the distribution of some statistic W(X) (e.g., the sample average W(X) =
X), according to the empirical distribution function, we can draw random samples from
the empirical distribution. So, let the original sample be X
1
= x
1
, . . . , X
N
= x
N
. Now
consider the random variable X with a discrete distribution with support {x
1
, x
2
, . . . , x
N
},
and probabilities Pr(X = x
i
) = 1/N for all i, and thus with cumulative distribution function

F
X
(x). Draw a random sample from this distribution, of size N. Denote this random
sample by X
b
= (X
b1
, . . . , X
bN
). Then calculate the statistic W(X
b
). Repeat this B times
and calculate the average and sample variance of W(X
b
) over these B random samples. That
will give us, by the law of large numbers (in B), the population mean and variance of W(X)
according to the empirical distribution function.
Let us make this a little more specic. Suppose our sample is X
1
= 0, X
2
= 3 and
X
3
= 1. The sample average is X = 4/3. We are interested in the variance of the random
variable W(X) = X = (X
1
+ X
2
+ X
3
)/3. The empirical distribution function is a discrete
distribution with probability mass function

f(x) = 1/3,
for x = 0, 1, 3 and zero elsewhere. One random sample from this distribution could be
(0, 1, 0).
The value of the statistic for this sample is w
1
= 1/3. The next sample could be
(0, 3, 3),
with a sample average of w
2
= 2. After doing this many times, say B times, we can use the
statistics w
1
, w
2
, . . . , w
B
to approximate the expected value and variance of the empirical
Imbens, Lecture Notes 1, EC2140 Spring 12 9
distribution functions as

E[W]
1
B
B

b=1
w
b
,
and

V[W]
1
B
B

b=1
(w
b
w)
2
.
We then use this variance estimator

V(W) as an estimate of the variance of X.
We can do this is much more complex settings. Suppose we are interested in some best
linear predictor parameters , dened as
= E[X
i
X

i
]
1
E[X
i
Y
i
] .
Given a sample of size N, {(Y
i
, X
i
)}
N
i=1
, the least squares estimator is

ols
=
_
N

i=1
X
i
X

i
_
1
_
N

i=1
X
i
Y
i
_
.
To get its variance, we resample, at random, with replacement, the pairs (Y
i
, X
i
), to get a
new sample {(X
bj
, Y
bj
)}
N
j=1
. Then calculate for each such bootstrap sample the regression
estimate

ols,b
=
_
N

j=1
X
bj
X

bj
_
1
_
N

j=1
X
bj
Y
bj
_
.
Next, given B replications we estimate the variance of the least squares estimator

ols
as
V
_

ols
_
=
1
B
B

b=1
(

ols,b

ols,b
) (

ols,b

ols,b
)

,
where

ols,b
=
1
B
B

b=1

ols,b
.
Imbens, Lecture Notes 1, EC2140 Spring 12 10
We can do this for many, in fact almost all, estimators used in econometrics. The key
is that the estimator should be relatively easy to calculate because you will have to do this
many times to get an estimate of the variance. There are also regularity conditions required.
For example, if we are interested in the maximum of the support, bootstrapping is unlikely
to work. See Efron (1982), Efron and Tibshirani (1993), Davidson and Hinkley (1998), and
Hall (1992) for details and generalizations.
3.C The Parametric Bootstrap
We now discuss three additional issues. First is that of the parametric bootstrap. Con-
sider the regression model
Y
i
= X

i
+
i
.
Instead of bootstrapping the pairs (Y
i
, X
i
), which is also known as the nonparametric bootstrap,
we can bootstrap the residuals. First estimate by least squares. Then calculate the resid-
uals

i
= Y
i
X

ols
.
For b = 1, . . . , B, resample N residuals
i
Then construct the b-th bootstrap sample (Y
b,j
, X
j
)
using

Y
b,j
= X

ols
+
b,j
.
Then proceed as before. This is the parametric bootstrap. If the disturbances are really
independent of the Xs, the parametric bootstrap works better than the nonparametric
bootstrap in the sense that the estimator of the variance is more precise, but if not, the
nonparametric bootstrap may be preferred.
3.D The Jackknife
The second issue is an alternative (in fact, historically, a precursor) to the bootstrap, the
jackknife. Consider the original example of estimating the population mean. The standard
Imbens, Lecture Notes 1, EC2140 Spring 12 11
estimator is X, and we are interested in its variance. The jackknife estimate of the variance
calculates for each i the estimator based on leaving out the ith observation:
X
(i)
=
1
N 1

j=i
X
j
.
Given these N estimates of the mean, which clearly average out to X (i.e.,

i
X
(i)
/N = X),
the variance of X is estimated as

V(X) =
1
N

N

i=1
(X
(i)
X)
2
.
To see why this works, consider the dierence
X
(i)
X =

j=i
_
1
N(N 1)
X
j
_

1
N
X
i
.
The expectation of this dierence is obviously zero. The variance is
V(X
(i)
X) = E
_

j=i
_
1
N(N 1)
(X
j

X
)
_

1
N
(X
i

X
)
_
2
=

j=i
1
N
2
(N 1)
2

2
X
+
1
N
2

2
X
=
2
X

_
1
N(N 1)
2
+
1
N
2
_

2
X
/N.
Averaging this over all observations gives
E
_

V
_
X
_
_
= E
_
1
N

N

i=1
(X
(i)
X)
2
_
= E
_
(X
(i)
X)
2

2
X
/N,
which is the variance for X.
3.E Subsampling
A more recent important variation on the bootstrap is subsampling. This is valid under
much weaker conditions than the standard bootstrap. See Politis, Romano and Wol (1999).
Imbens, Lecture Notes 1, EC2140 Spring 12 12
In fact it does not rely on the convergence of the estimator to its limit distribution be at
the standard

N rate, nor does it require that the limiting distribution is normal. Here we
just look at the implementation in a simple case. We are interested in a functional of the
cumulative distribution function F, = (F). This may simply be the mean = E[X], but
it may also be something like the maximum of the support (and this is the type of example
where the standard bootstrap does not work). Lets focus on the mean, = E[X], with
estimator

= X. In this case

N(

## ) converges to a normal distribution with mean zero

and variance
2
X
. We focus on constructing a 95% condence interval for .
The idea behind subsample is to take repeated bootstrap samples of size M < N (sub-
samples). Key is that as the sample size increases the ratio of the subsample to the total
sample size converges to zero, or M/N 0. As a result the distinction between sample
with and sampling without replacement disappears. Suppose we take B of the subsamples
of size M. In each case we calculate the estimator as

b
. Let c

distribution of

M(

## ) (over the B subsamples). Calculate c

0.975
and c
0.025
(because we
are interested in a 95% condence interval. In large samples these quantiles will converge to
1.96
X
and 1.96
X
respectively. Then we construct the 95% condence interval for
as
CI
0.95
() =
_

+ c
0.025
/

N,

+ c
0.975
/

N
_
.
By taking a subsample of size M, with M/N going to zero as the sample size gets large,
we avoid, at least in large samples, getting duplicates in our bootstrap samples. As a result
the subsampling bootstrap is valid in a much larger class of problems thant the standard
bootstrap. In fact it requires extremely little in terms of regularity conditions. In practice it
can be dicult to use, and requires more tuning parameters (e.g., the choice of M). Politis,
Romano and Wol recommend M =

N.
4. Confidence Intervals in Small Samples
The condence intervals we have discussed so far are valid in large samples, that is, they
are valid asymptotically. In nite samples they do not always work very well, in the sense
Imbens, Lecture Notes 1, EC2140 Spring 12 13
that the actual coverage rates may deviate substantially from the nominal coverage rates.
There are some improvements possible that lead to condence intervals with better prop-
erties in small samples. Here we discuss three issues. First, improved estimates of the
variance. Second, the use of t-distributions to get better coverage, and third, the use of the
bootstrap to get improvements. The latter are only formally shown to lead to improvements
in case of normal error distributions, but tend to lead to improvements in other settings as
well. What a small sample is depends on the error variance, but also on the distribution of
the covariates, so that it is a little dicult to give general guidance when these adjustments
are important.
4.A Improved Variance Estimates
The standard ols variance estimator is

V =
1
N
N

i=1

2
i

_
N

i=1
X
i
X

i
_
1
.
As discussed before, this estimator is biased, because
2
ml
=
1
N

N
i=1

2
i
is biased for
2
. An
unbiased estimator for
2
exists, namely
2
=
1
NK1

N
i=1

2
i

V
unbiased
=
1
N K 1
N

i=1

2
i

_
N

i=1
X
i
X

i
_
1
.
Now consider the (standard) robust variance estimator:

V
robust
=
_
1
N
N

i=1
X
i
X

i
_
1
_
1
N
N

i=1

2
i
X
i
X

i
__
1
N
N

i=1
X
i
X

i
_
1
.
This has the same issues as

V : it is biased in nite samples. However, there is no modication
that removes all the bias. There are some modications that remove the bias in special cases
and that in general substantially reduce the bias. See MacKinnon and White (1985) for
various suggestions. Here we focus on what they call HC2. Let P
X
= X(X

X)
1
X

be the
Imbens, Lecture Notes 1, EC2140 Spring 12 14
N N projection matrix, with (i, j)th element P
ij
. The modied robust variance estimator
is

V
modied,robust
=
_
1
N
N

i=1
X
i
X

i
_
1
_
1
N
N

i=1
(
2
i
/(1 P
ii
))X
i
X

i
__
1
N
N

i=1
X
i
X

i
_
1
.
That is, instead of using the squared residuals
2
i
, it uses
2
i
/(1 P
ii
). If the errors are
homoskedastic, then E[
2
i
/(1 P
ii
)] =
2
. If not, they are still unbiased in some cases, e.g.,
X
i
is a binary indicator, but in general they are not.
4.B Degrees of Freedom Corrections
The second issue in obtaining improved condence intervals is the degrees of freedom
Y
i
N(,
2
),
and we want to construct a 95% condence interval for . Suppose we use the unbiased
variance estimator,

2
=
1
N 1
N

i=1
_
Y
i
Y
_
2
.
Then
Y
_

2
/N
t(N 1).
Let q
N

## be the quantile of the t-distribution with degrees of freedom equal to N. We can

use this exact result to construct exact condence intervals:
CI
0.95
=
_
Y q
N1
0.975
/

N, Y + q
N1
0.975
/

N
_
.
Here the adjustment is simple, and directly related to the sample size. Because the 0.975
quantile of a t distribution with 20 degrees of freedom is 2.086, instead of the 0.975 quantile of
Imbens, Lecture Notes 1, EC2140 Spring 12 15
the normal distribution of 1.96, it is clear that with 20 or more observations this adjustment
is of little importance.
This does not directly extend to the regression case when we allow for heteroskedasticity.
If we use the conventional variance estimator assuming homoskedasticity, the degrees of
freedom is simply N K 1, where K is the number of covariates. When we allow for
heteroskedasticity things get more complicated. The exact distribution of the t-statistics

k
se
k
is not a t-distribution. However, there is a large literature, especially for the special case
where X
i
is a scalar binary regressor, where this problem is widely studied as the Behrens-Fisher
problem (see Schee, 1970). The idea is to approximate the scaled distribution of the vari-
ance of the least squares estimator by a chi-squared distribution (and thus approximate the
distribution of the t-statistic by a t-distribution). In the homoskedastic variance estimator
case this approximation is exact. Here in general we look for the chi-squared distribution
that matches the rst two moments. Beyond the specics of the calculations, the key point
is that the adjustment depends on the distribution of the covariates. Even with many obser-
vations, the implied degrees of freedom may be quite small. This is most easily seen in the
case with a single binary covariate. In that case the regression estimator is the dierence in
two means. If the sample size for one of the two means is small, then the degrees of freedom
for the approximating t distribution is small, irrespective of the number of observations used
in calculating the other mean. Thus, looking at the sample size is not sucient to determine
whether the degrees of freedom adjustment is important or not.
Sattherthwaite (1946) suggests the following general degrees of freedom correction. We
are interested in
k
, the kth element of the vector of parameters . The degrees of freedom
adjustment will generally be dierent for dierent elements of the parameter. Let e
k
be the
K vector with all elements other than the kth equal to zero and the kth element equal to 1.
As before, let P
X
= X(X

X)
1
X

## be the N N projection matrix, with (i, j)th element P

ij
,
and let M
X
= I
N
P
X
. Also, let V be the n N diagonal matrix with ith element equal
Imbens, Lecture Notes 1, EC2140 Spring 12 16
to the ith element of the N vector X(X

X)
1
e
k
, and let D be an N N diagonal element
with ith diagonal element equal to 1 P
ii
. Then let
1
, . . . ,
N
be the eigenvalues of
D
1
VVM
X
.
Then the degrees of freedom is
=
_

N
i=1

i
_
2

N
i=1

2
i
.
Let [] be the largest integer less than or equal to . Then the 95% condence interval is
CI
0.95
=
_

k
q
[]
0.975
se
k
,

k
+ q
[]
0.975
se
k
_
.
For details see Imbens and Kolesar (2010).
4.C Improved Variance Estimates Based on the Bootstrap
Bootstrapping can also be used to obtain improved condence intervals. Suppose we wish
to get a condence interval for = E[X]. A simple way to do this is to calculate the sample
average X, the sample variance S
2
X
, and, relying on a central limit theorem, construct a 95%
condence interval as
_
X 1.96 S
X
/

N, X + 1.96 S
X
/

N
_
.
Earlier we discussed using the bootstrap to estimate the variance of X. A more subtle
bootstrapping version works as follows. Draw for b = 1, . . . , B a bootstrap sample of size
N from the empirical distribution function, X
b,j
, j = 1, . . . , N, b = 1, . . . , B. For each
bootstrap sample b calculate the sample mean, variance and t-statistic:
X
b
=
1
N
N

j=1
X
b,j
,
Imbens, Lecture Notes 1, EC2140 Spring 12 17
S
2
X,b
=
1
N 1
N

j=1
_
X
b,j
X
b
_
2
,
and
t
b
=
_
X
b
X
_
/(S
X,b
/

N),
Calculate the 0.025 and 0.975 sample quantiles from the B t-statistics and denote them by
c
0.025
and c
0.975
. In large samples c
0.025
1.96, and c
0.975
1.96. The dierence between
the quantiles and the limits is what leads to the improvement in coverage rates. The technical
term for statistics like t
b
whose large sample distribution is free of nuisance parameters is
pivotal. They play a large role in the theory of the bootstrap. The 95% condence interval
is now the set of all values of x such that
c
0.025
<
_
X x
_
/(S
X
/

N) < c
0.975
,
that is,
CI
0.95
=
_
X c
0.975
S
X
/

N, X c
0.025
S
X
/

N
_
This leads to condence intervals with better coverage properties.
The same idea works for regression estimates. Let (Y
1
, X
1
), . . . , (Y
N
, X
N
) be the original
sample, with least squares estimator

ols
=
_
N

i=1
X
i
X

i
_
1
_
N

i=1
X
i
Y
i
_
,
residual

i
= Y
i
X

ols
,
Imbens, Lecture Notes 1, EC2140 Spring 12 18
and robust variance estimator

V
robust
=
_
1
N
N

i=1
X
i
X

i
_
1
_
1
N
N

i=1

2
i
X
i
X

i
__
1
N
N

i=1
X
i
X

i
_
1
.
Let t se
k
be the estimated standard error, the square-root of the (k, k) element of

V
robust
.
Let (Y
1b
, X
1b
), . . . , (Y
Nb
, X
Nb
) be the b-th bootstrap sample. Let

ols,k
be the least squares
estimate for
k
,

ols,b
=
_
N

i=1
X
ib
X

ib
_
1
_
N

i=1
X
ib
Y
ib
_
, bootstrap residual
ib
= Y
ib
X

ib

ols,b
,
and robust variance estimator

V
robust,b
=
_
1
N
N

i=1
X
ib
X

ib
_
1
_
1
N
N

i=1

2
ib
X
ib
X

ib
__
1
N
N

i=1
X
ib
X

ib
_
1
,
and let se
b,k
be the square root of the (k, k) element of

V
robust,b
. Finally, dene
t =

k
se
b
and t
b
=

k,b

k
se
b,k
.
In large samples t and therefore t
b
should have normal distributions, and so t is again pivotal.
Find the 0.975 and 0.025 quantiles of the distribution of the t-statistic t
b
over the bootstrap
distribution. Finally, construct the 95% condence interval as
CI
0.95
=
_

k
c
0.975
se
k
/

N,

k
c
0.025
se
k
_
See Hall (1992) for details.
5. Some Illustrations
Let us look at some real data. The following regressions are estimated on data from
the National Longitudinal Survey of Youth (NLSY). The data set used here consists of 935
observations on usual weekly earnings, years of education, and experience (calculated as
Imbens, Lecture Notes 1, EC2140 Spring 12 19
Table 1: Summary Statistics NLS Data
Variable Mean Median Min Max Standard Dev.
Weekly Wage (in \$) 421 388 58 2001 199
Log Weekly Earnings 5.94 5.96 4.06 7.60 0.44
Years of Education 13.5 12 9 18 2.2
Age 33.1 33 28 38 3.1
Years of Experience 13.6 13 5 23 3.8
age minus education minus six). Table 1 presents some summary statistics for these 935
observations. The particular data set consists of men between 28 and 38 years of age at the
time the wages were measured.
5.A Earnings Regressions
We will use these data to look at the returns to education. Mincer developed a model that
leads to the following relation between log earnings, education and experience for individual
i:
log(earnings)
i
=
1
+
2
educ
i
+
3
exper
i
+
4
exper
2
i
+
i
.
Estimating this on the NLSY data leads to

log(earnings)
i
= 4.016 + 0.092 educ
i
+ 0.079 exper
i
0.002 exper
2
i
(0.222) (0.008) (0.025) (0.001)
The estimated standard deviation of the residuals is = 0.41. In brackets are the standard
errors for the parameters estimates, based on the square roots of the diagonal elements of
the variance estimate (2).
Using the estimates and the standard errors we can construct the condence intervals. For
example, a 95% condence interval for the returns to education, measured by the parameter
Imbens, Lecture Notes 1, EC2140 Spring 12 20

2
, is
CI
0.95
= (0.0923 1.96 0.008, 0.0923 + 1.96 0.008) = (0.0775, 0.1071).
The t-statistic for testing
2
= 0.1 is
t =
0.0923 0.1
0.008
= 1.0172,
so at the 90% level we do not reject the hypothesis that
2
is equal to 0.1.
5.B Variance Estimates
Now let us look at some variance estimates based on the various methods discussed here.
We focus on a slightly simpler regression of log weekly earnings on a constant and just a
single covariate, years of education. The estimated regression function with conventional
standard errors is is

log(earnings)
i
= 5.0455 + 0.0667 educ
i
(0.0849) (0.0062)
The estimate for
2
is 0.1744. The estimated covariance matrix for

is

V =
2
(X

X)
1
=
_
0.0072066 0.0005212
0.0000387
_
.
With robust standard errors (based on (3)) the estimated regression function is

log(earnings)
i
= 5.0455 + 0.0667 educ
i
(0.0858) (0.0063)
The two sets of standard errors are obviously very similar. To see why, let us compare the
matrix

N
i=1

2
i
X
i
X

i
/N to
2

i
X
i
X

i
/N. The rst is
N

i=1

2
i
X
i
X

i
/N =
_
0.01740 2.3632
32.9614
_
,
Imbens, Lecture Notes 1, EC2140 Spring 12 21
and the second

2

i=1
X
i
X

i
/N =
_
0.01744 2.3491
32.4789
_
.
It is not always the case that using robust standard errors makes no dierence. Let us look
at the same regression in levels rather than logs. The conventional standard errors are in
round brackets and the robust standard errors are in square brackets:

earnings
i
= 19.5348 + 29.7745 educ
i
(38.2351) (2.8019)
[38.2989] [2.9504],
with now a slightly bigger dierence (about 5% dierence in standard errors).
Let us go back to the model in logs and consider bootstrap standard errors. We consider
two versions. First, we bootstrap the residuals, keeping the covariates the same (the paramet-
ric bootstrap). Second, we do the nonparametric bootstrap. The results for the parametric
bootstrap are in round brackets, for the nonparametric bootstrap in square brackets, both
based on 100,000 bootstrap replications:

log(earnings)
i
= 5.0455 + 0.0667 educ
i
(0.0850) (0.0062)
[0.0861] [0.0064].
Again the standard errors are very similar to those based on the conventional calculations.
5.C Simulations
The dierences between the standard errors dont tell us which ones are actually better.
Let us therefore now see how well the various condence intervals work in practice. I carried
out the following experiment. I took the census data used by Angrist and Krueger in their
Imbens, Lecture Notes 1, EC2140 Spring 12 22
returns to schooling paper (QJE, 1991). This has observations for 329,509 individuals on
(among other things) wages and education. I ran a linear regression of log wages on a
constant and years of education, with the following result:

log(earnings)
i
= 4.9952 + 0.0709 educ
i
(0.00045) (0.0003).
Here the standard errors are very small, due to the very large sample.
Next, I take this sample of 329,509 individuals as the population. 5,000 times I draw a
random sample (with replacement, although this does not matter at all given the size of the
population), of size n (for n = 20 and n = 100). In each case I estimate the same linear
regression and calculate the standard errors in six dierent ways: (i) conventional ols stan-
dard errors, (ii) robust ols standard errors, (iii) parametric bootstrap, (iv) nonparametric
bootstrap, (v) the bias-adjusted robust variance estimator, and (vi) the bias-adjusted robust
variance estimator with degrees of freedom correction. Given a 95% condence interval for
the coecient on years of education I check whether the true value (0.709) is in there. I
calculate how often that happens over the 5,000 replications. The results are as follows. In
the rst row of each part of the table I report the converage probabilities, and in the second
part of the table the t-statistic for the null hypothesis that the actual coverage rate is equal
to the nominal one (0.95).
Table 2: Actual versus Nominal Coverage Rates 95% Confidence Intervals
convent robust par boot nonpar boot robust-unbiased dof correction
n = 20 0.906 0.883 0.894 0.941 0.921 0.974
-14.2 -21.7 -18.2 -2.9 -9.5 7.8
n = 100 0.916 0.934 0.914 0.942 0.941 0.953
-11.0 -5.2 -11.8 -2.7 -2.7 1.1
Imbens, Lecture Notes 1, EC2140 Spring 12 23
Even with 100 observations the small sample corrections make a dierence. For that
sample size the heteroskedasticity corrections also make a dierence: the conventional and
parametric bootstrap have substantially lower than nominal coverage. With the smaller
sample size the corrections make a substantial dierence and the bias of the standard robust
variance estimator is such that its coverage is worse than that of the conventional least
squares variance estimator.
6. The Delta Method
Often we are not interested in the parameters of the estimated model per se. Instead
we are intersted in possibly complicated functions of the parameters. In that case we can
use the Delta method for getting the variance. Suppose we have a model with parameters
, with true value
0
. Suppose we are interested in
0
= g(
0
), where g() is smooth. We
have an estimator for , . In many cases we have an approximate normal distribution for
, after normalizing by

N:

N(
0
)
d
N(0, ),
for some , which we can estimate.
However, we are interested in
0
= g(
0
). The natural estimator is

= g( ), and
consistency for implies consistency for

given continuity of g(). Then the Delta method
provides a method for obtaining an approximate normal distribution for

,

N(g( ) g())
d
N
_
0,
g

()
g

()
_
.
The variance comes from linearizing g() around the true value
0
.
6.A The Linear Case
Let us go back to the linear regression of log earnings on education, experience and
experience squared. Suppose we want to see what the estimated eect is on the log of weekly
earnings of increasing a persons education by one year. Because changing an individuals
Imbens, Lecture Notes 1, EC2140 Spring 12 24
education also changes their experience (in this case it automatically reduces it by one year),
this eect depends not just on
1
. To make this specic, let us focus on an individual with
twelve years of education (high school), and ten years of experience (so that exper
2
is equal
to 100). The expected value of this persons log earnings is

log(earnings)
i
= 4.016 + 0.092 12 + 0.079 10 0.002 100 = 5.7191
Now change this persons education to 13. This persons experience will go down to 9 and
exper
2
will go down to 81. Hence the expected log earnings is

log(earnings)
i
= 4.016 + 0.092 13 + 0.079 9 0.002 81 = 5.7696
The dierence is

3
19

4
= 0.0505. Hence the expected gain of an additional
year of education, taking into account the eect on experience and experience squared is
the dierence between these two predictions, which is equal to 0.051. Now the question is
what the standard error for this prediction is. The general way to answer this question is
through the Delta method. The vector of estimated coecients

is approximately normal
with mean and variance V . We are interested in a linear combination of the s. In
this case the specic linear combination is = g() =
2

3
19
4
=

, where
=
g

= (0, 1, 1, 19)

## . Therefore, because a linear combination of normally distributed

random variables has a normal distribution,

N(

V ),
where V is the variance of

. In the above example, we have the following values for the
covariance matrix V :
V =
_
_
_
_
0.0494 0.0011 0.0047 0.0001
0.0001 0.000 0.0000
0.0006 0.0000
0.0000
_
_
_
_
.
Imbens, Lecture Notes 1, EC2140 Spring 12 25
Hence the standard error of

## is 0.0096, and the 95% condence interval for

is
(0.0317, 0.0693).
There is a second, more direct method for obtaining an estimate and standard error
for

## in the linear case. We are interested in an estimator for

. To analyze this we
reparametrize from (
0
,
1
,
2
,
3
) to (
0
, ,
2
,
3
)
_
_
_
_

3
_
_
_
_
to
_
_
_
_

0
=
1

2
19
3

3
_
_
_
_
.
The inverse of the transformation is
_
_
_
_

1
= +
2
+ 19
3

3
_
_
_
_
.
Hence we can write the regression function as
log(earnings)
i
=
0
+ ( +
2
+ 19
3
) educ
i
+
2
exper
i
+
3
exper
2
i
+
i
=
0
+ educ
i
+
1
(exper
i
+ educ
i
) +
3
(exper
2
i
+ 19 educ
i
) +
i
.
Hence to get an estimate for we can regress log earnings on a constant, education, expe-
rience minus education and experience-squared minus 19 times education. This leads to the
estimated regression function

log(earnings)
i
= 4.016 + 0.051 educ
i
+ 0.079 (exper
i
+ educ
i
) 0.002 (exper
2
i
+ 19 educ
i
).
(0.222) (0.010) (0.025) (0.001)
Now we obtain both the estimate and standard error directly from the regression output.
They are the same as the estimate and standard error from the delta method.
Imbens, Lecture Notes 1, EC2140 Spring 12 26
6.B The Nonlinear Case
Let us also look at a nonlinear version of this. Suppose we are using a regression of log
earnings on education. The estimated regression function is

log(earnings)
i
= 5.0455 + 0.0667 educ
i
.
(0.0849) (0.0062)
The estimate for
2
is
2
= 0.1744. Now suppose we are interested in the average eect of
increasing education by one year for an individual with currently eight years of education,
not on the log of earnings, but on the level of earnings. (We could have regressed the level of
earnings on years of education, but that would not have delivered as good a statistical t.)
At x years of education the expected level of earnings is
E[earnings|educ = x] = exp(
1
+
2
x +
2
/2),
using the fact that if Z N(,
2
), then E[exp(Z)] = exp( +
2
/2). It is crucial here that
we assume that |X N(0,
2
), not just zero correlation with the covariates, otherwise we
could not calculate this expectation.
Hence the parameter of interest is
= g(,
2
) = exp(
0
+
1
9 +
2
/2) exp(
0
+
1
8 +
2
/2).
Getting a point estimate for is easy. Just plug in the estimates for and
2
to get:

= exp(

0
+

1
9 +
2
/2) exp(

0
+

1
8 +
2
/2) = 19.9484.
However, we also may want a standard error for this estimate. Let us write this more
generally as = g(), where = (

,
2
)

## . We have an approximate distribution for :

N( ) N(0, ).
Imbens, Lecture Notes 1, EC2140 Spring 12 27
Then by the Delta method,

N(g( ) g()) N
_
0,
g

()
g

()
_
.
In this case,
g

() =
_
_
exp(
0
+
1
9 +
2
/2) exp(
0
+
1
8 +
2
/2)
9 exp(
0
+
1
9 +
2
/2) 8 exp(
0
+
1
8 +
2
/2)
1
2
exp(
0
+
1
9 +
2
/2)
1
2
exp(
0
+
1
8 +
2
/2)
_
_
.
We estimate this by substituting estimated values for the parameters, so we get
g

( ) =
_
_
exp(

0
+

1
9 +
2
/2) exp(

0
+

1
8 +
2
/2)
9 exp(

0
+

1
9 +
2
/2) 8 exp(

0
+

1
8 +
2
/2)
1
2
exp(

0
+

1
9 +
2
/2)
1
2
exp(

0
+

1
8 +
2
/2)
_
_
=
_
_
19.9484
468.5779
9.9742
_
_
.
To get the variance for

= g( ), we also need the full covariance matrix, including for the
parameter
2
. Using the fact that because of the normal distribution the estimator for
2
is
independent of the estimators of the other parameters, and that it has asymptotic variance
equal to 2
4
, we have

=

AV (

N( )) =
_

AV (

N(

))

ACV (

N(

),

N(
2

2
))

AV (

N(
2

2
))
_
=
_
_
6.7382 0.4873 0.0000
0.0362 0.0000
0.0608
_
_
.
Here AV stands for Asymptotic Variance. Now the asymptotic variance for the parameter
of interest normalized by the square root of N is

AV (

) =
g

( )

( ) =
_
_
19.9484
468.5779
9.9742
_
_

_
_
6.7382 0.4873 0.0000
0.0362 0.0000
0.0608
_
_
_
_
19.9484
468.5779
9.9742
_
_
= 1521.4,
Imbens, Lecture Notes 1, EC2140 Spring 12 28
and the estimated variance for

is 1521.4/N = 1.2756
2
.
7. Conditional versus Unconditional Inference
Finally, suppose we are interested in the average eect on the log of earnings of increasing
education levels by one year. For individual i, with level of education educ
i
and level of
experience exper
i
, the expected value of log earnings is an additional year of education is

log(earnings)
i
=

0
+

1
educ
i
+

2
exper
i
+

3
exper
2
i
.
With an additional year of education the expected value for log earnings would be

log(earnings)
i
=

0
+

1
(educ
i
+ 1) +

2
(exper
i
1) +

3
(exper
i
1)
2
.
The expected gain for individual i is

log(earnings)
i

log(earnings)
i
=

3
(2 exper
i
1).
=

2
+

3
2

3
exper
i
.
The expected gain in the nite population (conditional on the covariates) is

S
=
1

2
+
3
2
3
exper, (6)
The expected gain in the population, or unconditional gain, is

P
=
1

2
+
3
2
3

exper
. (7)
Both are estimated as

2
+

3
2

3
exper, (8)
but their variances are dierent.
Imbens, Lecture Notes 1, EC2140 Spring 12 29
If we are interested in (6) we can use the Delta method with

1
= (
1
,
2
,
3
) =
1

2
+
3
2
3
exper.
Given that the normalized variance of

is estimated as
2

N
i=1
X
i
X

i
/N, this is straight-
forward.
If we are interested in (7) we can use the Delta method with

2
= (
1
,
2
,
3
,
exper
) =
1

2
+
3
2
3

exper
.
Now we need the full variance/covariance matrix of

and exper. Under Assumption 2, exper
is independent of

, and so we have

N
_

exper
exper
_
d
N
_
0,
_

2
E[XX

] 0
0
2
exper
__
.
As a result the variance of the

N(
2
g
exper
V (
exper
)
g
exper
=
4
2
3

2
exper
.
If we want to use the nonparametric bootstrap to estimate the variance of
1
, we have
to be careful. You can do it as follows. Create a bootstrap sample. Using this bootstrap
sample estimate
b
, and substitute

b
into (8) using exper based on the original sample. If
we want to use the nonparametric bootstrap to estimate the variance of
2
we use exper
b
based on the bootstrap sample.
Imbens, Lecture Notes 1, EC2140 Spring 12 30
Appendix: Big O
p
and Little o
p
Notation
Denition 1 A sequence of numbers K
n
is o(M
n
) (little o) as n if K
n
/M
n
0.
A sequence of numbers K
n
is O(M
n
) (big O) as n if there exists an C > 0 and an
n
0
such that |K
n
/M
n
| C for all n n
0
.
For example, K
n
= ln(n) satises K
n
= o(n

## ) for all > 0, because lim

n
ln(n)/n

= 0.
Denition 2 A sequence of random variables X
n
is bounded in probability if for any > 0,
there is an C and an n
0
such that for n n
0
, pr(|X
n
| > C) < .
For example, if X
1
, . . . , X
N
are iid with mean
X
and variance
2
X
, then
1

N
N

i=1
X
i

X
is bounded in probability.
Denition 3 A sequence of random variables X
n
is o
p
(K
n
) as n if X
n
/K
n
p
0. A
sequence of random variables X
n
is O(K
n
) as n if X
n
/K
n
is bounded in probability.
For example, if X
1
, . . . , X
N
are iid with mean 0 and variance
2
X
, then
X = O
p
_
N
1/2
_
,
and
X = o
p
_
N

_
,
for < 1/2.
Imbens, Lecture Notes 1, EC2140 Spring 12 31
References
Davison, A. C., and D. V. Hinkley, (1998), Bootstrap Methods and their Applica-
tions, Cambridge University Press, Cambridge.
Efron, B., (1979), Bootstrap Methods: Another look at the Jackknife, Annals of
Statistics, Vol. 7, 1-26.
Efron, B., (1982), The Jackknife, the Bootstrap and Other Resampling Plans, SIAM,
Efron, B., and Tibshirani, (1993), An Introduction to the Bootstrap, Chapman and
Hall, New York.
Eicker, F., (1967), Limit Theorems for Regressions with Unequal and Dependent
Errors, Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Prob-
ability, 1, 59:82. Berkeley, University of California Press.
Hall, P., (1992), The Bootstrap and Edgeworth Expansions, Springer Verlag, New York.
Huber, P. (1967), The Behavior of Maximum Likelihood Estimates Under Nonstan-
dard Conditions, in Proceedings of the Fifth Berkeley Symposium in Mathematical Statistics,
Vol. 1, Berkeley, University of California Press, 221-233.
Imbens, G., and M. Kolesar, (2010), The Behrens-Fisher Problem and Robust
Standard Errors in Small Samples With and Without Clustering, Unpublished Manuscript.
MacKinnon, J., and H. White, (1985), Some Heteroskedasticity-consistent Covari-
ance Matrix Estimators with Improved Finite Sample Properties, Journal of Econometrics,
Vol. 29, 305-325.
Politis, N., J. Romano, and M. Wolf (1999) Subsampling, Springer Verlag, New
York.
Sattherthwaite, F. (1946), An Approximate Distribution of Estimates of Variance
Components, Biometrics, Vol. 2, 110-114.
Imbens, Lecture Notes 1, EC2140 Spring 12 32
Scheffe, H.(1970), Practical Solutions of the Behrens-Fisher Problem, Journal of the
American Statistical Association, Vol. 332, 1501-1508.
White, H. (1980), A Heteroskedasticity-Consistent Covariance Matrix Estimator and
a Direct Test for Heteroskedasticity, Econometrica, 48, 817-838.