Lectures From Class PDF

Econometrics
ECON 550
January 31, 2018
Model Specification + Interaction Effects

W Chapters 6-7
Professor Sebastien Bradley

Drexel University, LeBow College of Business
Outline for Today
1) Administrative Business:
• Midterm Exam – February 7
• Problem Set #4 - Practice
2) Problem Set #2 comments

3) Model Specification
• Rescaling
• Polynomials
• Logarithms
• Interactions
Model Specification and Estimation
• The challenge in doing empirical analysis lies in choosing
(1) The appropriate model specification (i.e. the set of variables to
include in the regression), AND
(2) The corresponding estimation technique
which yields causal effects of the regressor(s) of interest
on the dependent variable.
• For most purposes, OLS is a suitable estimation method.

• The choice of dependent variable and regressors,
however, depends entirely on context and should be
driven by a desire to avoid bias.
The Role of Theory in Model Specification
• Ideally, regression models should be guided by theory.
• E.g. Labor Supply and Income Taxes
• Under basic theory, individuals choose how much labor (L) to
supply in order to maximize their consumption of leisure (1-L) and
everything else (C):
max , ,1
• Consumption must be limited to after-tax wage income, wL(1-t),
and non-labor income, Y(1-t):
1
• Working through the maximization problem suggests that we
estimate
1 1
Theory in Model Specification (cont.)
• For many interesting questions, the theory is undeveloped
or underdeveloped.
• It can be helpful to try to develop a simple theory upon
which to base the regression model.
• In the alternative, intuition can be a powerful guide and
theory may be unnecessary.
• Ultimately, we need to think critically about what belongs

in our regression model, whether we start from a known
theoretical model or not.
Avoiding Omitted Variable Bias
• Deciding what belongs in our regression model depends
fundamentally on thinking about (potential) omitted
variable bias.
• This is the same whether we have one or multiple
regressors already in our model.
• Namely, we wish to avoid violations of the conditional

mean independence assumption arising through
correlation between an included regressor and the
regression residual.
Control Variables
• As we have discussed, the direct solution to omitted
variables is their inclusion in the model, even if we are not
interested in their effects on the dependent variable per
se.
• We often refer to these additional regressors as control
variables, to control for or hold constant these other
factors, which, if ignored, would prevent us from obtaining
unbiased estimates of the causal effect(s) of interest.
• So long as we do not care about the causal effects of
the control variables themselves, we might tolerate
our control variables being correlated with the
regression residual (through other possible omitted
variables), but we must be careful.
Base vs. Alternative Specifications
• Having applied any relevant economic theory and
exercised careful judgment—most especially with omitted
variable bias in mind, this should yield a base
specification.
• Beyond the base specification, we might consider
additional alternative specifications involving alternative
sets of regressors whose relevance to the model is less
certain.
• If an alternative specification yields wildly different
coefficient estimates for the regressors of interest,
this suggests that the base specification may be
subject to omitted variable bias.
Using Measures of Model Fit
• If uncertain about whether to include a particular variable
as a regressor, measures of goodness of fit such as the
adjusted R or an F-test on joint significance of multiple
regressors can be instructive.
• If the addition of a variable substantially improves the

predictive ability of the model (i.e. increases the adjusted
R ) it is likely worth including.
Rescaling Data
• Occasionally, we may want to rescale our regressors
and/or dependent variable in some way:
• This may be for purposes of interpretation (e.g.
corporate income taxes and reported foreign earnings
versus corporate tax rates and rates of return).
• It may be for simple ease of reading coefficients.
• Or it may be because scale can be a source of spurious
correlation (e.g. “size” is an omitted variable).
Non-Linear Regression Functions
• Not all relationships that we might be interested in are

necessarily linear.
• Ignoring such non-linearities may lead to omitted variable

bias.
E.g. Average Hourly Earnings and Age
What would you expect to be the relationship between an
individual’s age and their wage?
• Workers likely become more productive as they accrue
experience, so we would predict (inflation-adjusted)
wages to rise with age.
• At some point, productivity likely plateaus or even
begins to decline.
How can we characterize both of these features

graphically and mathematically?
Non-Linear Regression Functions (cont.)
• We might want to model wages as a non-linear
regression function of age:
E.g.
• This allows the effect of age on wage to depend on the

value of age itself rather than impose that this effect be
constant for all ages.
• Generically, the effect of one regressor, , on Y depends
on the value of itself.
Non-Linear Regression Functions (cont.)
• A second group of non-linear regression functions involve
effects of one regressor, , on Y which depend on the
value of another regressor, .
E.g. ,
where the effects of on may vary according to

whether an individual is male or female.
Modeling Non-Linearities
The following suggests a general approach to modeling
non-linearities:
1) Just as for model specification more generally,
consideration of economic theory, omitted variables,
and data should first guide your thinking about whether
to specify the model to estimate as a non-linear
regression function.
2) Next, provided the model is linear in its parameters,
OLS can be used to estimate the coefficients of interest.
Modeling Non-Linearities (cont.)
3) You may then test whether the non-linear model
provides an improvement over the linear model by
A. Assessing the statistical significance of the coefficient
estimate(s) for the non-linear regression term(s)
B. Assessing whether the non-linear model implies the
existence of omitted variables bias in the linear model
C. Assessing model fit
Interpreting Coefficient Estimates
• The interpretation of coefficient estimates from a non-
linear model is somewhat changed.
• This is because, for example, it makes little sense to talk
about the effect of age on wages, holding age squared
constant.
• Since the effect of a regressor on the dependent variable

is non-constant in its own values or those of another
regressor, it makes sense to calculate effects on the
dependent variable for different values of the regressor.
Interpreting Coefficient Estimates (cont.)
E.g. Average Hourly Earnings and Age
• Suppose we estimate a regression of wages as a
quadratic function of age:
10.725 1.431 0.014
What is the effect of an additional year of age on average

hourly earnings?
1.431 2 0.014 ·
Interpreting Coefficient Estimates (cont.)
• Applied to our example, the marginal effect of a change in
age starting from age 20 is
1.431 2 0.014 · 20 0.851
• In contrast, the effect of a change in age from 50 is
1.431 2 0.014 · 50 0.183

Standard Errors of Estimated Effects
• In general, we compute confidence intervals around
estimated effects as
95% C.I.:∆ 1.96 ∆
• In the case of linear regression functions, this is

straightforward:
95% C.I.: ∆ 1.96 ∆

by ∆ ∆ (and ∆ 1, typically).
S.E. of Estimated Effects (cont.)
• With non-linear regression functions, it becomes more
complicated to compute standard errors around the
estimated effects.
E.g.
2 · 20
|
⇒ 2 · 20
|
S.E. of Estimated Effects (cont.)
can be backed out from a calculation of the
|
F-statistic for the test that the true effect of age on wage at
age 20 is zero under the null:
: 2 · 20 0
| 2 · 20 0|
2 · 20
Alternatively, 2 · 20 can be computed directly

through transformation of the regression model.
Polynomial Regression Models
• The quadratic regression model with
is just a special case of the polynomial regression model,

where each higher power allows for an additional bend
(inflection point) in the graphed relationship between Y
and X.
Polynomial Regression Models (cont.)
• In some cases, we might wish to consider cubic or even
higher-degree polynomials.
How do we determine how many powers of X should be

included in the model specification?
Beyond the quadratic, intuition is likely to be a poor
guide.
This model specification problem is analogous to the
more general problem of which regressors to include.
Polynomial Regression Models (cont.)
• A logical method for determining how many powers of X
to include is to iteratively eliminate the highest-powered
term whenever its effect on Y is not statistically
distinguishable from 0 (i.e. we fail to reject that the true
coefficient estimate is 0).
• If the null hypothesis is rejected, and the effect is
statistically significant, then we retain the regressor.
• This is referred to as sequential hypothesis testing.

• It is rarely necessary to ever include more than in any
regression.
Logarithmic Regression Models
• An alternative to polynomial regression models is to
specify Y and/or X as natural logarithms.
• This implies a non-linear relationship involving a steep
curve for low values of X followed by gradual flattening.
Note: recall that ln 0 ∞ and ln 1 0, so negative

values are undefined. This creates problems if you try to
take logs of data which includes negative values.
Logarithmic Regression Models (cont.)
• A fundamental virtue of specifying a non-linear regression
model in logs is that changes in log variables have the
interpretation of measuring percent changes.
• This follows from the fact that for small changes in Y,
∆ ∆
ln ∆ ln ln ≅
⇔ ln
• A related virtue is that taking logs of Y and/or the independent

variables performs a rescaling of the data and can eliminate
spurious correlation due to “size” effects in data.
Logarithmic Regression Models (cont.)
• Three forms of logarithmic regression models are
possible, each having different interpretations:
1) Linear-log
2) Log-linear
3) Log-log
The Linear-Log Model
• In the linear-log regression model, the dependent variable
enters linearly while X is in logarithms.
ln
• measures the effect on Y of a 100% increase in X.

• Worded differently, a 1% increase in X is associated with
a change in Y of 0.01 .
The Linear-Log Model (cont.)
E.g. Average Hourly Earnings vs. Log Age
ln 12.881 9.448ln
• A 100% increase in age is therefore associated with a

$9.45 increase in average hourly earnings.
• Equivalently, a 1% increase in age translates into a $0.09
increase in average hourly earnings.
How does this model compare to the cubic polynomial?

Avg. Hourly Earnings
0 20 40 60 80 100
20
30
40
Age
Linear-Log Model
50
60
Avg. Hourly Earnings

0 20 40 60 80 100
20
30
40
Age
Cubic Model
50
60
The Log-Linear Model
• In the log-linear regression model, Y is in logarithms but X
is not.
ln
• measures the effect of a one unit increase in X on Y in

terms of percent changes.
• In particular, a one unit increase in X is associated with
100* % increase in Y.
The Log-Linear Model (cont.)
E.g. Log Average Hourly Earnings vs. Age
ln 2.500 0.010
• A one year increase in age is therefore associated with a

1 percent increase in the wage, on average.
• This represents the semi-elasticity of wages w.r.t. to age
(imposed to be constant regardless of age).
The Log-Linear Model (cont.)
E.g. Log Average Hourly Earnings vs. Age
ln
0.845 0.241 0.005 0.000
• At age 20, the marginal effect of age is 8.29 percent.

• At age 50, the marginal effect of age is -0.37 percent.
5 4 Log-Linear Model
ln(Avg. Hourly Earnings)
2 1 3
20 30 40 50 60
Age
The Log-Log Model
• In the log-log regression model both Y and X are in
logarithms
ln ln
• measures the effect of a one percent change in X on Y

in terms of percent changes.
• In other words, measures the elasticity of Y with
respect to X.
The Log-Log Model (cont.)
E.g. Log Average Hourly Earnings vs. Log Age
ln ln
1.285 0.443ln
• A 1% change in age is associated with a 0.443% change

in average hourly earnings.
• The elasticity of the hourly wage w.r.t. age is 0.443.
ln(Avg. Hourly Earnings)
1 2 3 4 5
3
3.5
ln(Age)
Log-Log Model
4
4.5
Comparing Logarithmic Models
• If we want to compare models in terms of fit, the
dependent variable must be equivalent across
specifications.
• We cannot consequently compare the linear-log model to
either log-linear or log-log models.
• We can compare the linear-log model to the polynomial
model.
• We can compare the log-linear and log-log models.
• Ultimately, we should be guided by our desired
interpretation for the data.
Interactions Between Regressors
• Recall that a second group of non-linear regression
functions involve effects of one regressor, , on Y which
depend on the value of another regressor, .
• E.g. ,
• E.g. ,
Interactions of Binary Regressors
• If and are two binary dummy variables we might
want to allow the effect of on Y to depend on the value
of .
• This effect is estimated through an interaction term or
interacted regressor:
Interactions of Binary Regressors (cont.)
E.g. Returns to Education
average hourly earnings
∈ 0,1 {no college degree, degree}
∈ 0,1 {male, female}
Interaction Effects – Binary Regressors
• The difference in wage associated with going to college is:
1, 0,
if 0 (male)
if 1 (female)
• The difference in wage associated with being female is:

, 1 , 0
if 0 (no degree)
if 1 (degree)
Interactions of Binary and Continuous Regressors
• We can also construct interaction terms involving a binary

dummy variable and a continuous regressor, .
E.g. Teacher Evaluations

course evaluation scores
beauty score
• Consider the following three models
3
where D is a binary indicator variable (e.g. gender) and X
is a continuous regressor.
• (1) allows for only the intercept to depend on D.
• (2) allows for only the slope to depend on D.
• (3) allows for both intercept and slope to depend on D.
• Estimating (1) allows for males
and females to have different evaluation scores, on
average, but implicitly imposes that the effect of beauty be
the same for males and females alike.
• Estimating (2) imposes
that males and females have the same average
evaluation scores but allows for the effect of beauty to
differ by gender.
• Estimating (3)
allows for males and females to differ by average
evaluation score and allows for the effect of beauty to
differ by gender.
• Note that estimating the fully-specified non-linear model
nests both specifications (1) and (2).

• A test of 0 is equivalent to a test of whether the non-
linear model improves upon the linear model (i.e. by
allowing for non-constant slopes as a function of gender).
• A test of 0 is equivalent to a test of whether the non-
linear model should allow for both groups (males and
females) to have different average evaluation scores.
Interactions of Continuous Regressors
• Finally, we can also construct interaction terms involving
two continuous regressors, .

beauty score
age
Interpreting Interaction Effects
• If , then the
effect of ( ) on Y depends on the value of ( ):
∆ ∆
,
∆ ∆
• So, holding constant, a one unit increase in raises Y
by across the board (e.g. when 0 while further
increasing Y by , which differs according to the level
of .
• If 0, this says that the effect of beauty on teacher
evaluations is raised by the amount for each additional
year of teacher age.
• Equivalently, this says that the effect of age on teacher
evaluations is raised by the amount for each unit
increase in beauty.
• Putting the last two observations together, the coefficient
tells us about the effect of a unit increase in and above
and beyond the individual effects of and alone:
∆ ∆ ∆ ∆ ∆
• ∆ measures the effect of , holding constant.
• ∆ ∆ measures the additional effect of changing both
and .
Note that this interpretation holds for all types of

regressors, whether continuous or binary.
Multiple Interactions
Why stop at interactions of only two regressors?
• In some scenarios, we may want to consider multiple
interactions (e.g. a triple interaction):
Assignment
MIDTERM NEXT CLASS!
Problem Set #4 – Practice

Econometrics
ECON 550
February 14, 2018
Interaction Effects; Sources of Bias

W Chapters 7; 9, 16.1-16.2

Outline for Today
Administrative Business:
• Project Rough Draft Due 2/21
• Problem Set #5 Due 2/21
• Readings
1) Interaction Effects
2) External Validity
3) Internal Validity - Endogeneity Bias
• Omitted Variables
• Functional Form Misspecification
• Measurement Error (Errors in Variables)
• Simultaneity (Simultaneous Causality)
• Sample Selection
4) Return Midterm
Interactions Between Regressors
• Recall that an important class of non-linear regression
functions involve effects of one regressor, , on Y which
depend on the value of another regressor, .
• E.g. ,
• E.g. ,
Interactions of Binary Regressors
average hourly earnings
∈ 0,1 {no college degree, degree}
• We can also construct interaction terms involving a binary

dummy variable and a continuous regressor, .

beauty score
Interactions of Continuous Regressors
• Finally, we can also construct interaction terms involving
two continuous regressors, .

beauty score
age
• If , then the
effect of ( ) on Y depends on the value of ( ):
∆ ∆
,
∆ ∆
• Holding constant, a one unit increase in raises Y by

across the board (e.g. when 0 while further
increasing Y by , which differs according to the level
of .
• If 0, this says that the effect of beauty on teacher
evaluations is raised by the amount for each additional
year of teacher age.
• Equivalently, this says that the effect of age on teacher
evaluations is raised by the amount for each unit
increase in beauty.
• Putting the last two observations together, the coefficient
tells us about the effect of a unit increase in and above
and beyond the individual effects of and alone:
∆ ∆ ∆ ∆ ∆
• ∆ ∆ measures the additional effect of changing both
and .
Note that this interpretation holds for all types of

regressors, whether continuous or binary.
Multiple Interactions
Why stop at interactions of only two regressors?
• In some scenarios, we may want to consider multiple
interactions (e.g. a triple interaction):
Assessing Empirical Studies
• The reliability and usefulness of an empirical study should
be assessed on the basis of internal and external validity.
• External validity refers to the applicability or

generalizability of study results to populations and
settings that are different than the one(s) used in the
empirical analysis.
• Internal validity refers to the validity of the statistical

inferences/conclusions about causal effects for the
population and setting studied.
Threats to Externality Validity
• External validity is threatened when the population or
setting used in the empirical analysis is unusual and
therefore unlike other populations and settings.
1) Differences in Populations
E.g. Maternal Leave and Early Reading
• If we study effect of mothers taking an extra week of maternal leave
on early reading outcomes in the U.S., this effect is not likely
generalizable to European countries where there exists
preference/norm for more maternal leave.
2) Differences in Settings
E.g. Maternal Leave and Early Reading
• ~50 weeks paid leave is common in many European countries.
Threats to Internal Validity
• Internal validity is threatened when either
1) The coefficient estimate(s) of the causal effect(s) of
interest is (are) biased or inconsistent, or
2) Tests and confidence intervals yield rejection
probabilities that are inconsistent with the desired
significance or confidence levels (i.e. standard errors
are invalid).
Sources of (Endogeneity) Bias
• Violation of the conditional mean independence
assumption (i.e. correlation between your variable of
interest and the error term) may arise through several
channels:
1) Omitted Variables
2) Functional Form Misspecification
3) Measurement Error (Errors-in-Variables)
4) Simultaneity (Simultaneous Causality)
5) Sample Selection
• In each case, biased (non-causal) and inconsistent

coefficient estimates will result.
Omitted Variable Bias
• When data for an omitted variable is unavailable and
there are no suitable control variables, alternative
solutions to omitted variable bias may exist.
1) Panel data: by evaluating the same cross-sectional
units at multiple points in time, panel regression makes
it possible to control for unobserved (omitted) time-
invariant effects.
2) Instrumental variables: instrumental variables
regression introduces a new variable (i.e. the
instrument) which is exogenous to the dependent
variable.
3) Controlled experiments.
Functional Form Misspecification
• This source of bias is effectively a special form of omitted
variable bias (e.g. excluding a polynomial regression term
when the population regression model has such a form).
• Careful thought about potential non-linearities and plotting
data around the fitted regression line can help eliminate
functional form misspecification.
• Likewise, we might err on the side of caution and include
polynomials in key regressors and perform F-tests of joint
significance.
Misspecification Tests
• The Regression Specification Error Test (RESET)
includes polynomials in OLS fitted values to test whether
the base regression model neglects non-linearities in the
variables already in the model.
E.g. ⋯
1) Estimate regression model without polynomial terms
and compute fitted values
2) Estimate ⋯
3) Perform F-test of : 0, 0
Under the null, the base model is correctly specified (i.e.
polynomials of existing regressors are unnecessary).
Misspecification Tests (cont.)
Note that the RESET test is silent on whether your model
should include additional variables.
Rejection of the null hypothesis does not indicate which
regressors should be specified in quadratic and/or
cubic terms.
The RESET test is NOT a general test for omitted

variables bias!
Misspecification Tests (cont.)
• The Davidson-MacKinnon test allows comparison of non-
nested models.
• E.g. Linear vs. linear-log models:

1 ⋯
2 log ⋯ log
1) Estimate (2) and compute fitted values
2) Estimate ⋯
3) Perform a t-test of : 0
Under the null, (1) is correctly specified.
4) Repeat steps 1-3 switching the two models.
Measurement Error
• Mismeasurement of either the dependent variable or a
regressor can likewise yield biased and inconsistent
coefficient estimates.
• Mismeasurement of an explanatory variable is referred to

as errors in variables.
• Importantly, we must distinguish between classical

measurement error (i.e. mean zero random noise or
white noise) from non-classical measurement error.
Errors in Variables
• Formally, suppose we would like to estimate
∗
but we do not observe ∗ directly and instead observe

∗
only an imprecisely measured , where
captures the measurement error.
∗ ∗
• We assume , s.t. is uninformative
about once the ∗ has been accounted for.
• We assume 0
Errors in Variables (cont.)
• If we instead estimate
,
∗
Then use
∗
,
∗

∗
such that:
∑ ∑
∑ ∑
∑
∑
,
⇒
Consistency and unbiasedness of depends on , :

, ,
, ,
,
• We consider three possible cases:
1 cov , 0
∗
2 cov , 0
∗
3 cov , 0, cov , 0
(1) If cov , 0,
 cov , 0 and will be consistent.

∗
(2) If cov , 0, then:
∗ ∗
, ,
⇒ , 0
 will be biased and inconsistent.

 This is referred to as classical errors in variables.
Classical Measurement Error
• Under classical errors in variables (CEV), the
inconsistency of implies a very specific form of bias:
,
Attenuation bias!
Classical Measurement Error Solutions
• Under classical measurement error, the direction of bias is
always toward zero.
• For some types of analyses, this is not too problematic.
• If the nature of the classical measurement error is known,

biased coefficient estimates can be rescaled accordingly.
• This requires knowledge about the variance of the
unobserved ∗ and the variance of the measurement
error.
• More realistically (and more generally for all types of
measurement error), instrumental variables regression
may be applied to recover causal effects.
∗
(3) More generally, if cov , 0 and cov , 0:
,
β
∗ ∗
, , 0
 will be biased and inconsistent.

 The direction of bias will depend on the sign of β and
, .
E.g. Labor Supply and Wages
• Suppose that we wish to estimate the following model of
the effect of wages on labor supply (i.e. hours worked):
• When asked about their wages on surveys, however,

workers may tend to underreport wages (by correctly
reporting hours but underreporting income).
 How would this bias our estimate of the response of

labor supply to the wage?
E.g. Labor Supply and Wage (cont).
Suppose individuals with higher wages (income) tend to
underreport wages to a greater extent:
∗
the true (unobserved) wage
the reported wage
∗
where is the measurement error in the true wage.

Systematic underreporting implies 0.
• We expect , 0 because a higher wage is associated

with greater underreporting ( more negative) of that wage.
• We expect 0 if we think the substitution effect will

outweigh the income effect as the wage increases (i.e. higher
wage tends to induce additional hours of work)
E.g. Labor Supply and Wage (cont).
Recall,
,
β
So, we expect
and we will tend to overestimate the true magnitude of the

effect of a change in wage on labor supply.
(Suppose, for instance, that wage was 20% higher for a group
of people and they worked 2 more hours per week but they
only reported that their wage was 10% higher. We would
attribute the additional 2 hours to only a 10% rise in the wage
when in fact it was due to a 20% rise).
Measurement Error in Y
• Formally, suppose now that we would like to estimate
∗
but we do not observe ∗ directly and instead observe

∗
only an imprecisely measured , where
captures the measurement error.
Measurement Error in Y (cont.)
• If we instead estimate:
• As usual,
,
⇒
where , , , .
• Just as in the case of errors in variables, unbiasedness of

depends on , .
• Unlike errors in variables, it seems plausible that
, 0 in most cases (i.e. measurement error in Y
is independent of all explanatory variables).
 If , 0,
 , 0 and will be consistent.

• Mismeasurement of the dependent variable that has
mean zero and is uncorrelated with any regressor (i.e.
classical measurement error) will not yield biased
coefficient estimates, but the estimated standard errors
will be larger than otherwise.
• (Random measurement error with non-zero mean will
merely bias the intercept coefficient estimate).
• In contrast, mismeasurement of the dependent variable

that is correlated with the regressors will produce biased
and inconsistent estimates.
Simultaneity (Simultaneous Causality)
• Simultaneity or simultaneous causality bias arises when
there is reverse causality running from Y to X such that
the dependent variable partly determines the value of the
regressor(s).
• E.g. Traffic Fatalities and Minimum Drinking Age Laws

• E.g. GDP and Exports
• E.g. Federal Tax Revenue and Tax Rates
Simultaneity (cont.)
• Simultaneity induces spurious correlation between Y and
X through the regression error term and will consequently
cause the conditional mean independence assumption to
fail, thereby yielding biased, inconsistent estimates.
• This is often more apparent by framing simultaneity bias
as reflecting an omitted variables problem.
• E.g. Traffic Fatalities and Minimum Drinking Age Laws

• E.g. GDP and Exports
• E.g. Federal Tax Revenue and Tax Rates
• Formally, simultaneity bias can be illustrated by
considering two simultaneous equations:
• Intuitively, since captures all of the “other factors” which

affect , and in turn affects , then and must be
correlated.
• Mathematically,
cov , cov ,
⇒ cov , cov , cov ,

cov ,
cov ,
cov , cov ,
⇒ cov ,
1
Simultaneity Solutions
• Just as for omitted variable bias or errors-in-variables
bias, the best option for addressing simultaneity bias is
instrumental variables regression.
• If feasible, instrumental variables regression holds the

possibility of being able to separate the simultaneous
effects in order to focus exclusively on the causal effect
running from X to Y.
• In other cases, reverse causality may be especially
problematic contemporaneously. A partial solution may
involve using pre-determined values of X if panel data are
available.
Sample Selection
• Sample selection bias arises when data are missing
because of a selection process related to the value of the
dependent variable (even after X is accounted for).
• In other words, sample selection is a concern when the
data are not randomly sampled with respect to Y.
• Sample selection is not a concern when data are missing

purely at random, or when selection is on X directly (e.g.
you are missing half the data for women but missing none
for men), but it may affect external validity.
Sample Selection (cont.)
• Some data are inherently selected (e.g. home sale prices,
wages, survey data, online product reviews, etc.).
• This is not precisely selection on the value of Y, but may
rather be selection on some unobserved determinant of Y.
• This is generally only problematic for estimation if this

unobserved determinant of Y is also correlated with a
regressor of interest.
Sample Selection – Online Reviews
Above the Law, UNC Law Prof Sends a ‘Rather Embarrassing’ Request, Asks Former Students
to Help His Online Rating (February 23, 2012):
“Rating sites apparently even have the power to bring a well-known UNC Law professor to
his electronic knees. It’s not every day that a torts professor sends his former students a “rather
embarrassing request” to repair his online reputation. It’s also certainly not every day that the
students respond en masse….
On Tuesday, Professor Michael Corrado sent the following email to 2Ls who took his torts
class last year, basically pleading for their help. ...
‘I have a rather embarrassing request of you. An undergraduate brought something to my

attention that needs to be fixed. It seems that there is a website, something like Rate My
Professors, where my rating is so bad that he was uncertain about whether to take my
course or not. I was puzzled, because my evaluations are generally not bad. It turns out that
there are just a couple of responses on the site, and they are apparently from people who
have a real grievance against me for some reason.
They are certainly entitled to their opinions, but it isn’t really a fair reflection of my teaching
(I hope).
What I would like to ask of you is whether, if you are so inclined, you would go onto that
site and write your own review of my teaching. I’m not asking you to write a favorable review,
just to write an honest review. I think that overall I would get much better ratings if a number
of people did this and just gave their honest views.’”
Sample Selection – Charitable Giving
• E.g. Charitable Giving Surveys
• We might expect less generous individuals to be less
likely to complete the survey.
• If we are interested in estimating the effect on giving of
something wholly-unrelated to a person’s propensity to
complete the survey, selection bias is not a real concern
(e.g. paycheck frequency).
• If instead we are interested in estimating the effect on
giving of something that is related to the selection
process, selection bias may be a problem (e.g.
volunteering hours).
Sample Selection Solutions
• Unfortunately, sample selection bias is largely intractable,
at least using methods considered in this class.
• This also implies that we need to exercise caution in

omitting data outliers, as doing so may introduce a form of
non-random selection.
Outliers
• See Wooldridge Ch. 9.5-9.6 for a discussion of outliers.
Bottom Line:
• The determination of “outliers” is somewhat subjective and requires
careful consideration.
• It is always advisable to perform sensitivity analyses around the
definition of outliers.
• OLS is sensitive to outliers because it fits the regression line by
minimizing the sum of squared residuals
• Alternatively, one can use Least Absolute Deviation (LAD) also
known as quantile regression
• LAD is less sensitive to outliers than OLS because it minimizes the
sum of the absolute value of the residuals
Assignment
For next class, please read:

1) W Ch. 13-14
2) AP Ch. 5 (Diff-in-Diff)
PAPER ROUGH DRAFT

Problem Set #5
due Weds. 2/21, in class.
Econometrics
ECON 550
February 21, 2018
Sources of Bias; Panel Estimation

W Chapters 9, 16.1-16.2; 13-14

Outline for Today
1) Internal Validity - Endogeneity Bias
• Sample Selection
2) Pooled Cross Sectional Estimation
3) Difference in Differences (DiD or DD)
4) Panel Data Estimation
• First Differencing
• Fixed Effects, Dummy Variables, Demeaning
• Random Effects
• Cluster Robust Standard Errors
Recall: Sources of (Endogeneity) Bias
• Violation of the conditional mean independence
assumption (i.e. correlation between your variable of
interest and the error term) may arise through several
channels:
5) Sample Selection
• In each case, biased (non-causal) and inconsistent

coefficient estimates will result.
Sample Selection
• Sample selection bias arises when data are missing
because of a selection process related to the value of the
dependent variable (even after X is accounted for).
• In other words, sample selection is a concern when the
data are not randomly sampled with respect to Y.
• Sample selection is not a concern when data are missing

purely at random, or when selection is on X directly (e.g.
you are missing half the data for women but missing none
for men), but it may affect external validity.
Sample Selection (cont.)
• Some data are inherently selected (e.g. home sale prices,
wages, survey data, online product reviews, etc.).
• This is not precisely selection on the value of Y, but may
rather be selection on some unobserved determinant of Y.
• This is generally only problematic for estimation if this

unobserved determinant of Y is also correlated with a
regressor of interest.
Sample Selection – Online Reviews
Above the Law, UNC Law Prof Sends a ‘Rather Embarrassing’ Request, Asks Former Students
to Help His Online Rating (February 23, 2012):
“Rating sites apparently even have the power to bring a well-known UNC Law professor to
his electronic knees. It’s not every day that a torts professor sends his former students a “rather
embarrassing request” to repair his online reputation. It’s also certainly not every day that the
students respond en masse….
On Tuesday, Professor Michael Corrado sent the following email to 2Ls who took his torts
class last year, basically pleading for their help. ...
‘I have a rather embarrassing request of you. An undergraduate brought something to my

attention that needs to be fixed. It seems that there is a website, something like Rate My
Professors, where my rating is so bad that he was uncertain about whether to take my
course or not. I was puzzled, because my evaluations are generally not bad. It turns out that
there are just a couple of responses on the site, and they are apparently from people who
have a real grievance against me for some reason.
They are certainly entitled to their opinions, but it isn’t really a fair reflection of my teaching
(I hope).
What I would like to ask of you is whether, if you are so inclined, you would go onto that
site and write your own review of my teaching. I’m not asking you to write a favorable review,
just to write an honest review. I think that overall I would get much better ratings if a number
of people did this and just gave their honest views.’”
Sample Selection – Charitable Giving
• E.g. Charitable Giving Surveys
• We might expect less generous individuals to be less
likely to complete the survey.
• If we are interested in estimating the effect on giving of
something wholly-unrelated to a person’s propensity to
complete the survey, selection bias is not a real concern
(e.g. paycheck frequency).
• If instead we are interested in estimating the effect on
giving of something that is related to the selection
process, selection bias may be a problem (e.g.
volunteering hours).
Sample Selection Solutions
• Unfortunately, sample selection bias is largely intractable,
at least using methods considered in this class.
• This also implies that we need to exercise caution in

omitting data outliers, as doing so may introduce a form of
non-random selection.
Outliers
• See Wooldridge Ch. 9.5-9.6 for a discussion of outliers.
Bottom Line:
• The determination of “outliers” is somewhat subjective and requires
careful consideration.
• It is always advisable to perform sensitivity analyses around the
definition of outliers.
• OLS is sensitive to outliers because it fits the regression line by
minimizing the sum of squared residuals
• Alternatively, one can use Least Absolute Deviation (LAD) also
known as quantile regression
• LAD is less sensitive to outliers than OLS because it minimizes the
sum of the absolute value of the residuals
Pooled Cross-Sectional Data
• A pooled cross-sectional dataset consists of multiple
different (i.e. independent) cross-sectional samples
pooled across T time periods (e.g. home sales; many
(most?) surveys)
• Since all observations are still presumed independent,

this poses few statistical challenges.
• Still, observations from different years may not be
identically distributed.
 Estimate by pooled OLS, allowing for different
intercepts by year (include year dummies).
Panel Data
• A panel or longitudinal dataset consists of n cross-
sectional entities (firms, individuals, states, etc.) observed
repeatedly over multiple time periods s.t. 1, … and
1, … .
• Observations are indexed by entity i and time t (e.g. ).

• Time periods in the data should be evenly spaced.
• The panel is said to be balanced if data for all entities are
available for all of the same set of time periods and
unbalanced otherwise.
Differences-in-Differences (DiD or DD)
• With repeated observations on the same cross-sectional
entities, estimation of differential effects of key regressors
across time has the potential to identify differences-in-
differences causal treatment effects of experiments or
policy changes (i.e. “natural” or “quasi” experiments).
• This is most easily seen when a policy change affects a

discrete subset of the population (i.e. the treatment group)
but not another (the control group), and we have data
from before and after the policy change.
 Differences in outcomes between groups between the pre- and
post- periods constitute a differences-in-differences estimator of the
relevant “treatment” effect.
Differences-in-Differences (DiD) (cont.)
E.g. Financial Incentives for Fitness
• Suppose we take a sample of gym members and split them
into a treatment group ( 1 and a control group.
• Treated individuals receive $10 per gym visit during the month
of March, while the control group receives nothing.
• We observe gym attendance for both groups in February
( 0) and March ( 1).
• Estimate the DiD effect of the policy with a simple regression:

E.g. Financial Incentives for Fitness
• Sample consists of two observations per gym member on

(i.e. for February and March).
• 0 if a member is in the control group, and
1 if in the treatment group (in both months)
• 1 only for observations for the
treated group during the treatment month (March)
Consider first a simplified version of our regression model (and
build up to the full DiD specification):
What is ?
Next consider:
What is ?
What is ?
Next consider:
What is ?
What is ?
• Now consider again the full model:
What is ?
What is ?
What is ?
What is ?

• As in our discussion of binary interaction effects, we can view

this regression as allowing us to estimate 4 group means.
• = sample mean of control group in February

• = sample mean of treatment group in February
• = sample mean of control group in March
• = sample mean of treatment group in March
 is the differences-in-differences estimator of the effect

of treatment on gym visits. It captures the change in the
outcome for the treatment group net of the change in the
outcome for the control group
 Under what assumptions will capture the causal effect
of the policy change or experimental treatment?
 A key assumption of the DiD strategy is that of

parallel trends.
 If not for the policy change or experimental intervention,
outcomes would have evolved similarly over time for both
treatment and control groups.
 This is typically reasonable in the case of experiments given
randomization into treatment, but still a potential concern in small
samples.
Continuous DiD
E.g. Airlines Full-Fare Advertising Regulations (FFAR)
• In a continuous DiD set-up, all observations are “treated,”

albeit to varying degrees (depending on size of ).
• measures average pre-FFAR price where 0
• measures average price differences pre- versus post-FFAR
• measures the baseline rate of tax pass-through
• captures changes in the rate of tax pass-through post-FFAR
Continuous DiD (cont.)
 is the (continuous) differences-in-differences

estimator of the effect of FFAR on the rate of tax
pass-through.
 Under what conditions will capture the causal

effect of the policy change?
Appeal of Panel Data
• A broader virtue of using panel data is that time-invariant
entity characteristics (observed and unobserved) can be
controlled for through panel regression.
• This may substantially reduce the set of possible omitted

variables to worry about.
Omitted Variable Bias
• Suppose that we have a pure cross-sectional dataset with
1 observations per entity, and we estimate
 will be biased if there exists any omitted variable

such that , 0 and , 0.
Omitted Variable Bias (cont.)
• Equivalently, if the true model is
but we estimate
,
then | if , 0 and 0.
Without data to control for , we are in trouble.

First-Differencing
• A simple but elegant solution to this problem exists if we
have data for at least two time periods, 2, and Z does
not vary over time.
• By translating the dependent and independent variables

into changes between time periods and focusing on the
effect of changes in X between periods on changes in Y,
we can strip out the correlation between the first-
differenced regressor and the time-invariant omitted
variable:
First Differencing (cont.)
• will be an unbiased and consistent estimator of
provided that
E ∆ ∆ ∆
⇒ ,
• This requires that the conditional mean independence

holds across all periods in the data.
• This assumption is referred to as strict exogeneity.
First Differencing (cont.)
• Differencing eliminates concerns associated with all
unobserved time-invariant determinants of Y.
• is no longer problematic.
• (Note that if , we have no omitted

variable problem with respect to unobserved time-
invariant characteristics, and pooled OLS will yield
unbiased estimates.)
E.g. State Traffic Fatalities
• If states have different unobserved time-invariant
determinants of traffic fatalities (e.g. attitudes toward
drinking, political will to curb drunk driving, etc.) which are
correlated with the level of the beer tax, differencing will
allow these unobserved state fixed effects to drop out.
• Assuming no other time-varying omitted variables, will
be an unbiased estimate of the effect of alcohol taxes:
⇒
Limitations - First Differencing
• A possible downside with first differencing, however, is
that there may be relatively little variation over time in the
changes in X to explain changes in Y, even if there exists
a substantial degree of cross-sectional variation to exploit.
• Computing differences over longer time spans may

provide a partial solution.
Limitations - First Differencing (cont.)
• Differencing can also be performed when 2 to
likewise account for time-invariant determinants of Y.
• (Note that this implies the use of 1 observations
across 1 pooled first-differenced cross sections)
• This is still subject to the concern that variation over time

in the changes in X to explain changes in Y—even over
multiple consecutive periods—may still be too modest to
obtain precise results.
• may also perform worse than a simple pooled OLS
estimator if X suffers from classical errors-in-variables
measurement error.
Limitations - First Differencing (cont.)
• First differencing with 2 periods of data also requires
that ∆ be serially uncorrelated over time in order to
obtain statistically-valid standard errors.
• Unfortunately, it is not sufficient for to be serially
uncorrelated (which is already unlikely relative to the
usual cross-sectional OLS i.i.d. assumption).
• This requires more sophisticated empirical solutions
involving cluster-robust standard errors.
• Accounting for time-invariant fixed effects can also be

accomplished in alternative ways.
Entity Fixed Effects (FE)
• If the true linear regression model specifies both time-
varying and time-invariant (unobserved and observed)
determinants of Y, such that
then we can equivalently re-write the model as a fixed

effects regression model:
where .
• is referred to as an entity fixed effect and allows for
each entity to have its own intercept (i.e. time-average
effect on Y).
Accounting for Fixed Effects
Where have we previously encountered situations where
we wanted to allow different groups within our data to
have different intercepts?
How did we allow for this in our regression models?
Indicator (dummy) variables!
The Dummy Variable Approach
• The dummy variable approach to fixed effects regression
consists of including separate binary indicator variables to
flag each individual entity in the dataset.
• For example, 2 1 if 2 and 0 otherwise.
• To avoid multicollinearity, only 1 dummy variable
regressors may be included, so we arbitrarily drop the
first:
, 2 3 ⋯
What is the interpretation of the constant and dummy

variable coefficient estimates?
Interpreting Fixed Effects
• Given our motivation for estimating fixed effects—avoiding
omitted variable bias—we are not typically interested in
the fixed effects coefficients themselves.
• Nevertheless, these can be informative.
• In practice, the dummy variable coefficient estimates tell

us about the average value of Y over time for a particular
entity relative to the left out entity.
Interpreting Fixed Effects
• If 1 is the left out entity, then
⇒
Limitations - Dummy Variable Approach
• While statistically-valid, the dummy variable approach has
the practical downside of requiring estimation of
coefficients.
• This can be computationally-slow and clutters the
regression output if we do not care about the magnitude
of each of the separate fixed effects.
Entity Demeaning
• Yet a third method for accounting for time-invariant entity
fixed effects consists of time-demeaning (subtracting the
mean of each variable over all T time periods) for each of
the regressors and the dependent variable.
• Thus, instead of estimating,
,
we estimate
̅ ̅
Entity Demeaning (cont.)
• Since is time-invariant,
1 1
̅ · ·
so that demeaning yields:
• It can be shown that the estimator for the demeaned

model is identical to the obtained using the dummy
variable approach (for all T).
Equivalence of Fixed Effects Methods
• When 2, it can furthermore be shown that
where
• is the slope coefficient estimate from the first
differenced model (estimated without an intercept),
• corresponds to the model with 1 dummy
variable indicators, and
• represents the coefficient estimate from the entity
demeaned model.
Time Fixed Effects
• Just as we may worry about omitted variable bias arising
through time-invariant determinants of Y, we might
also/instead worry about unobserved time-varying effects
that are the same across entities.
• The fully-specified (time and entity) fixed effects
regression model is hence
⇔
• E.g. Traffic Fatalities: evolving paternalistic views
affecting national vehicle safety standards and “sin” taxes.
Time Fixed Effects (cont.)
• Accounting for time fixed effects proceeds in much the
same way as for entity fixed effects.
1) Entity and time dummies:
, 2 ⋯ 2 ⋯
1) Entity- and time-demeaning:

• Requires transforming all variables by subtracting entity and time-
period means from each observation.
2) Entity-demeaning with time dummies:
• Requires transforming all variables by subtracting entity means
from each observation and including 1 time dummies.
Time Fixed Effects (cont.)
• If 2, it can also be shown that the first differencing
approach with an intercept yields equivalent coefficient
estimates to the other approaches for accounting for
entity and time fixed effects.
• (Implicitly, the intercept term serves to capture the time
effect from the second period (i.e. the change in Y
between periods 1 and 2), and the entity fixed effects are
accounted for in the differencing).
Random Effects (RE)
• A common source of confusion arises surrounding the use
of fixed effects or random effects in panel estimation.
• Random effects are only warranted when the are

believed to be uncorrelated with the explanatory variables
(i.e. the time-invariant unobserved factors are not a
source of omitted variable bias).
Random Effects (RE) (cont.)
• If the are uncorrelated with the error term (yet
nevertheless non-zero) ignoring them and running pooled
OLS will lead to serial correlation in the error term and
incorrect SE’s.
• Random effects are hence only useful to deal with serial

correlation in the error term.
• Even though the assumption that the are uncorrelated

with the errors may be violated, people sometimes
erroneously try to use RE because they usually yield
smaller SE’s. Beware!
Fixed versus Random Effects
• A Hausman Test can in principle be used to determine
whether it is acceptable to use RE instead of FE.
• A maintained assumption of this test is that both the FE

and RE estimates are consistent under the null (i.e. the
are uncorrelated with the error term, such that neither
estimates are inconsistent), but that the RE estimates are
more efficient.
• The Hausman test evaluates whether differences between
the RE and FE coefficients are statistically significant.
Rejection of the null implies that the RE estimates are
inconsistent, and FE should be used.
Fixed versus Random Effects (cont.)
• However, failure to reject under the Hausman test may
arise because the FE estimates had large standard errors
so that even though the two sets of coefficients were far
apart, the test was too imprecise to reject equality.
• It is also possible that the maintained assumption is

violated so that the FE estimator is (also) inconsistent. In
this case, the test is meaningless.
Failure to reject in this case would suggest using the RE
estimates because they were sufficiently close to the
wrong FE estimates!
Having estimated
with both state and year fixed effects, what can be said
about the internal validity of our regression results?
 Valid standard errors?

 Unbiased and consistent coefficient estimates?
Fixed Effects Regression Assumptions
(1) Conditional mean independence (over all periods):
, ,… , 0
(2) , ,… , , ,… ,∀ 1, … are i.i.d. draws

(entities are randomly sampled)
(3) Large outliers are unlikely:

, have non-zero finite fourth moments
(4) No perfect multicollinearity

FE Assumptions – Key Differences
• In contrast to the standard OLS cross-sectional data
assumptions,
• (1) now requires conditional mean independence across
all time periods such that may not be correlated with
the past, present, or future values of the regressors.
• (2) only requires that the sample of entities be randomly
drawn. Within entity, observations need not be
independent (this would never happen).
• (2) leaves open that the X (and ) are likely to be
autocorrelated or serially correlated across periods
(persistence).
Heteroskedasticity
• As previously discussed, inconsistent standard error
estimates due to can readily be dealt
with by computing heteroskedasticity-robust standard
errors.
• If heteroskedasticity results from other underlying
problems, however, such as model misspecification or
omitted variable bias, computing robust standard errors
for biased coefficient estimates is pointless.
Correlated Errors
• Correlated errors across observations typically arise in
panel datasets or time series data where the same cross-
sectional unit is repeatedly included in the sample.
• Errors in this context are said to be serially correlated as
they persist over time for observations on the same
individual/firm/country/etc.
• Such data are not independently distributed, thereby
violating the OLS i.i.d. assumption.
Correlated Errors (cont.)
• Errors may also be correlated in space or along other
similar such dimensions for observations from different
cross-sectional units (i.e. errors are clustered).
• E.g. Firms within a geographic area

In trying to explain firm profitability, for example, firms
operating in the same geographic region may be hit by
the same unobserved shocks, which will be captured in
their regression errors.
• E.g. Firms within an industry

• E.g. Individuals within a household
Clustered Standard Errors
• Clustered standard errors allow for heteroskedasticity and
arbitrary within-cluster (e.g. within-entity) autocorrelation,
while assuming errors across clusters (entities) to be
uncorrelated.
• This is consistent with the fixed effects i.i.d. assumption
and allows calculation of valid standard errors.
• In Stata, you can compute clustered standard errors by

typing “reg Y X i.entity, vce(cluster entity)”
Panel Estimation – Virtues and Limitations
Even if the use of panel data allows controlling for

potentially unobserved entity and time fixed effects,
determinants of Y which vary across both entities and
time while being correlated with the included regressors
continue to present an omitted variables problem and
preclude estimation of unbiased causal effects.

• What if states are more likely to raise alcohol taxes
following periods of large increases in traffic fatality rates?
• What if the legal blood alcohol limit (BAC) while driving is
lowered at the same time alcohol taxes are raised?
Other possibilities?
Applicability to Other Data Structures
• All of the methods and motivation discussed in the context
of panel estimation are equally appropriate in related data
structures with multiple data dimensions.
• E.g. Home sales by season or neighborhood

• E.g. Firm profitability by industry
• E.g. Airline ticket pricing by origin x destination x season
 Many layers of FE are possible.

DiD as First-Differencing or Fixed Effects

• In terms of FE, the intercepts for the two entities, control and treated,
are the entity fixed effects, written as and , while is the time
fixed effect. captures the DiD treatment effect.
• In terms of first differences,
• Or, recognizing 1, 0:
Δ
• captures the average change in gym visits for the control group
• captures the DiD treatment effect.
DiD and Parallel Trends Assumption
• A DiD regression compares the trend in the outcome in the treatment group to the
trend in the outcome in the control group
• In order for this comparison to yield a good estimate of the treatment effect, we
must rule out any differences in pre-existing trends among the two groups
• If the pre-existing trends differ, then any difference in differences may simply
reflect a continuation of these pre-existing trends rather than a causal effect.
• Using data from the pre-period, create a linear time trend, a variable that equals 1
in period 1, 2 in period 2, …and T in period T, the last untreated time period.
• Then, interact it with a treatment group dummy and run the model below on pre
treatment data.

• If the coefficient on this interaction is different from zero 0 , the data flunk
the parallel trends assumption and the DiD estimate is likely to be biased.
• Researchers typically plot the trends in both the pre period and the treatment
period with a vertical line at the time the treatment is applied
• If you have data from too few observations to run the regression above, you can
simply plot the time trend.
Differences in Differences in Differences
(DDD)
• Conceptually, the DDD captures the difference between two
DD results in one regression.
• The first result is the one we’ve already studied.
• The second result is for a group that is exposed to the
treatment but should not be affected by it.
• For example, Philadelphia might enact a tutoring program for
high school juniors while Pittsburg does not.
• We might be interested in the effect of the program on test
scores, one at the start of the junior year and the other at the
end of the junior year.
• The first DD subtracts the change in test score for juniors in
Pittsburg from the change in test scores for juniors in
Philadelphia:

DDD (cont.)
• The DDD basically runs the DD for sophomores and
subtracts the result for them from the result for juniors.
• The DDD accounts for whether any time varying omitted
variable might be causing the change in test scores
instead of the tutoring program.
• For example, it may be that at the same time that
Philadelphia got money for the tutoring program, it also
got money for better labs, new textbooks and better
teachers.
• These changes should also affect sophomores, but since
the tutoring program is only for juniors, sophomores would
not be affected by it.
DDD (cont.)
• The DDD can be captured in a single regression as follows:
• is the coefficient of interest and represents the causal effect

of the program.
• We have eight parameters because we’re describing eight
different averages– an average pre- test score and an average
post score for each of four groups.
DDD (cont.)
• It’s also possible to first difference the data and write:
• ∆
• Notice that every regressor in the longer regression from the previous
slide that lacks disappears when it’s 1st differenced.
• Once again, we have 4 parameters to describe 4 groups, but this time
we’re describing changes instead of levels
• is the change in test scores for Pittsburg sophomores,

• is the change for Philadelphia sophomores
• is the change for Pittsburg juniors
• captures the change for Philadelphia juniors
• is the causal effect of the program since it captures the extent to which
the change for Philadelphia juniors differs from the changes of other
groups
Assignment
Please read W Ch. 13-14
Problem Set #6
Econometrics
ECON 550
February 28, 2018
Instrumental Variables
W Chapter 15

Outline for Today
• Homework Comments
1) Instrumental Variables
• Instrument Relevance, Exogeneity, and Monotonicity
• IV Estimation
• 2SLS Estimation
• Testing
• Endogeneity
• Overidentifying Restrictions
• Weak Instruments
2) Return Rough Drafts

Why Instrumental Variables?
• We have discussed many scenarios in which different
sources of bias will prevent us from identifying causal
effects of X on Y through violation of the OLS zero
conditional mean independence assumption (assumption
that error term is uncorrelated with the X’s of interest):
5) Sample Selection
Why Instrumental Variables? (cont.)
• When more direct solutions are not available (e.g. explicit
controls or fixed effects), instrumental variables (IV)
regression offers a possible method for mitigating bias
due to omitted variables, simultaneity, or measurement
error.
• IV thereby allows us to estimate causal effects.

Intuition for Instrumental Variables
• The problem we wish to avoid is having our regressor(s)
of interest, X, correlated with the error term, u.
• When , 0, one can think of separating the

variation in X into two parts: variation that is correlated
with the error term (endogenous) and variation that is
uncorrelated with the error term (exogenous).
Intuition for Instrumental Variables (cont.)
• In other words, the effect of X on Y can be decomposed
into a causal and a non-causal component.
IV allows us to decompose this variation in X and

measure only the effect of the variation in X that is
uncorrelated with the error term, i.e. the causal effect.
E.g. Concealed Gun Laws and Crime
• Even after accounting for various sources of bias through the
inclusion of fixed effects, one might still worry that our
estimates of the effect of “shall issue” laws on violent crime will
be biased due to the fact that states can choose if and when to
implement these laws (simultaneity).
• Variation in shall therefore consists of two parts:
1) Endogenous variation in the timing and geographic distribution of
shall issue laws due to state’s intentional responses to crime rates.
2) Exogenous variation due to factors having nothing to do with crime
rates.
• We would like to be able to discard the endogenous variation in
shall and extract only the variation that is truly exogenous to
crime (e.g. as if these laws were randomly assigned) to
measure the causal effect of the laws on crime.
Implementing Instrumental Variables
• In order to implement this desired decomposition of the
variation in X, we need at least one additional variable, Z,
which helps to explain the exogenous variation in X
without having any direct effect on Y.
• This additional variable, Z, then serves as an instrument
for our X of interest.
• E.g. Concealed Gun Laws and Crime
• In our example, a potential instrument for shall issue laws would be
a variable which helps to predict where and when these laws are
implemented without having any direct relationship to crime rates
(i.e. where the only relationship is through the implementation of
shall issue laws).
Requirements for a Valid Instrument
(1) Instrument relevance:
• The instrument, Z, successfully explains variation in the
endogenous regressor, X.
• i.e. , 0
(2) Instrument exogeneity:
• The instrument, Z, is uncorrelated with the error term
from the regression relating X and Y (i.e. Z does not
directly influence Y, except through X).
• i.e. , 0
Instrumental Variables Estimation
• Consider the following basic regression:
• If cov X, 0, will be biased and inconsistent and

OLS will be uninformative, or worse—misleading.
Instrumental Variables Estimation (cont.)
• Now, suppose there exists a valid instrument, Z, for our
endogenous regressor such that , 0 and
 How can we test for instrument relevance?

 How can we test for instrument exogeneity?
Instrumental Variables Estimation (cont.)
• Given that ,
, , ,
• Hence, provided that , 0 and , 0,
,
,
∑ ̅
⇒
∑ ̅
Properties of - Consistency
∑ ̅
∑ ̅
, ,
⇒
, ,
Properties of - Unbiasedness
⇒ , ,
 In finite samples, remains generally biased (hence

importance of large samples and consistency result).
Properties of - Efficiency
• Assuming homoskedasticity,
1

2
· ,
from regression of
on Z (including constant)
 Unless , 1 (i.e. ,
(Note that this comparison is only sensible if , 0)

Weak Instruments
 Weak correlation between X and the instrument, Z,
implies a small , , and hence, large standard errors.
 Worse, even a very modest failure of instrument

exogeneity (i.e. , 0 does not hold precisely),
can lead to severe asymptotic bias and inconsistency if
, is weak:
, ,
·
, ,
, ·
Weak Instruments (cont.)
• Asymptotic bias for the IV estimator will be more severe
than for the OLS estimator if:
,
,
,
 Successful application of IV methods depends

critically on having a valid (and strong) instrument
that satisfies both instrument relevance and
instrument exogeneity.
E.g. Instrument Validity
• Suppose that we want to estimate
ln
For the many reasons discussed before, is likely
endogenous to wages through unobserved ability, etc.
 Which of the following is likely to serve as a valid

instrument for ?
 Father’s educational attainment?
 Number of siblings?
 College proximity?
 Quarter of birth?
 Social security numbers?
Two-Stage Least Squares (2SLS)
• Thus far, it is not altogether transparent how the
introduction of Z enables the decomposition of X into
endogenous and exogenous components to estimate .
 Two-stage least squares estimation (2SLS) makes this

explicit.
2SLS (cont.)
• Recalling our expression for testing instrument relevance,
 Estimating this last relationship, we can decompose

variation in X into exogenous and endogenous parts:
1 (exogenous)
2) (endogenous)
2SLS Estimation (cont.)
• 2SLS regression thus proceeds in two stages:
1) In the first stage, we regress the endogenous regressor, X, on
the instrument(s) and obtain predicted values of the component
of X which is uncorrelated with the error term u from the
regression of Y on X:

⇒
2) In the second stage (i.e. the main or “structural” equation), we
regress Y on these predicted values:
Provided Z is a valid instrument, will be a

consistent estimate of the true causal effect of X on Y.
2SLS Estimation (cont.)
• Note that in the second-stage 2SLS regression is a
generated regressor, and is therefore measured with
some sampling error that depends on .
• Performed separately as one stage at a time OLS

regressions, will be invalid in that it will fail to
account for variation in .
• Computing valid standard errors therefore requires more
sophisticated adjustments, which ivregress 2sls will
perform automatically in Stata.
Multivariate IV vs. 2SLS Estimation
• In a multivariate regression model with a single
endogenous regressor, , and will each still
consistently estimate the effect of on Y, provided that a
valid instrument exists, and IV and 2SLS are
synonymous.
• With multiple valid instruments, or exclusion

restrictions (i.e. variables that do not appear directly in
the 2nd stage equation and satisfy instrument exogeneity),
2SLS estimation is required.
Multivariate IV vs. 2SLS (cont.)
Proof that :
(First Stage)
(Second Stage)
, ,
⇒
, ,
⇒
, ,
⇒
, ,
·
Structural (IV/2SLS), Reduced Form, and
First-Stage Equations
The reduced form equation evaluates the effect of the
instrument directly on the outcome.
(First Stage)
(Second Stage Structural Equation)
(Reduced Form Equation)
• Under the assumption that the exclusion restriction is

valid, the reduced form effect of the instrument on Y
must necessarily operate through X (only).
• Hence,
·
Structural (IV/2SLS), Reduced Form, and
First-Stage Equations (cont.)
• By implication,
, 1
≡ ·
 The causal effect of X on Y is equal to the reduced form

effect of the instrument scaled by the first stage
coefficient.
E.g. Returns to education and college proximity.

Local Average Treatment Effects (LATE)
captures a local average treatment effect (LATE).
• To see this, note that you can think of the portion of the
variation in X that is explained by Z as capturing the subset of
the sample that is induced to “comply” with X, the “treatment.”
E.g. Returns to education and college proximity.

• Z = distance to the nearest college
• X = college attendance
2SLS (IV) compares those who attended college (treated) due
to their proximity to a college to those who chose not to attend
(untreated) due to being far away.
LATE (cont.)
• Students who respond to college proximity are called
“compliers.”
• Those who would go to college regardless of how far they
live from a college are called “always takers.”
• Those who would never go to college, regardless of how
close they live to one, are called “never takers.”
IV estimates are based off a comparison of outcomes

within a subset of the pool of potential college students
(the compliers).
LATE (cont.)
• is the local effect averaged across the subset of
compliers, hence the name, “local average treatment
effect.”
• does not address the effect of college attendance on
always takers or what might happen if you forced never
takers to attend college.
• IV estimates are likely to be externally valid for those who

are similar to the compliers but may not apply more
generally.
Multicollinearity and 2SLS Estimation
• In a multivariate model, imperfect multicollinearity can be
even more serious for 2SLS estimation than OLS.
• This comes from the fact that

1) The second stage regressor, , has necessarily less
variation than the original endogenous regressor.
2) The correlation between and the remaining
exogenous regressors (used in the first stage as well)
is generally higher than between and the
covariates.
2SLS w/ Multiple Endogenous Regressors
• With multiple endogenous regressors, the order condition
requires the existence of at least as many valid
instruments as endogenous regressors.
• Each endogenous regressor will require a separate first

stage regression, involving all instruments (exogenous
regressors)
Tests of Endogeneity
(i.e. Do we need IV?)
• Suppose that we wish to estimate

where we suspect | , and is an
exogenous control variable.
• Assuming that we have a valid instrument for , , we

can test whether IV estimation is necessary by comparing
OLS and 2SLS estimates.
Tests of Endogeneity (cont.)
 Under the null hypothesis that is exogenous,
→
whereas only is consistent under the alternative.
 Moreover, assuming homoskedasticity,
V V
Durbin-Wu-Hausman Test:
′ V V
~
Regression-Based Test:
 Under the null ( is exogenous), the residual from the
first stage regression should have no statistically
significant effect if included as an extra regressor in the
OLS regression.
1) Estimate ⇒
• captures variation in that is orthogonal to
and and therefore potentially correlated with
2) Estimate ,
3) Test : 0
⇒ , 0⇔ , 0⇒ 0
• Rejection of 0 implies that is endogenous (through
correlation between and ).
Use IV!
• Note that the regression-based test of endogeneity

delivers identical point estimates in the second step
regression as 2SLS.
• This shows that instead of the usual IV or 2SLS routine,
you could instead include in a second stage regression
alongside to control explicitly for that part of that is
endogenous.
• This is known as the control function technique to IV
estimation.
Overidentification (OID) Tests
• An IV regression is said to be just identified if there are as
many instruments as endogenous regressors.
• If you have multiple candidate instruments, you can test

whether a subset of these are uncorrelated with the
structural error term in the true regression model (i.e. you
can test whether instrument exogeneity is satisfied for a
subset of instruments).
• Note: For any of these tests to be convincing, you must

assert that at least one of your instruments is valid.
• This is an important shortcoming that limits the usefulness
of these tests. Nevertheless, these are commonly used.
OID Tests (cont.)
• Suppose we have two candidate instruments, and ,
for in .
• Intuitively, we can obtain 2SLS estimates of using

either instrument singly. Under the null that both
instruments are exogenous, both 2SLS estimators will be
consistent and approximately equal (with differences due
only to sampling error).
• We reject this null if is statistically
significant, and conclude that one or both instruments are
invalid.
OID Tests (cont.)
• Note that rejection of the null for the Hausman OID test
gives no guidance as to which instrument is invalid.
• Moreover, the OID test might fail to reject if both

instruments are invalid but nevertheless yield similar
2SLS coefficient estimates.
OID Tests (cont.)
• Furthermore, rejection of the null for the Hausman OID
test might also falsely reject due to heterogeneous
treatment effects.
• In this case, instruments might isolate different sources of
variation in the endogenous X.

• One instrument might explain variation in high school education
(e.g. quarter of birth) and another might be for college education
(e.g. distance to the nearest 4-year college).
• If the effects of high school and college education on the outcome
are different (i.e. different LATEs), the OID test could falsely reject
validity of the instruments.
OID Tests (cont.)
• Assuming homoskedasticity, an alternative OID test with q
overidentifying restrictions can be implemented as
follows:
1) Estimate model by 2SLS using all instruments and
obtain the residuals,
2) Regress on all exogenous regressors and instruments
and compute the regression
3) Under the null that all exogenous regressors and
instruments are uncorrelated with , ~
If is large, we reject this null and conclude that at
least one instrument is not exogenous.
Weak Instruments Tests
• In the simplest case, testing for whether an instrument or
collection of instruments for a single endogenous
regressors is “weak” can be accomplished as an F test of
the exclusion restrictions in the first-stage.
• Staiger and Stock (1997) suggest as a rule-of-thumb
needing 10 to reject instrument weakness.
• For situations involving multiple endogenous regressors

and adjusted (e.g. robust) errors, Kleibergen-Paap
statistics apply, with critical values drawn from Stock and
Yogo (2005).
Assignment
Please read W Ch. 15
Problem Set #7
Econometrics
ECON 550
March 7, 2018
Instrumental Variables,
Limited Dependent Variables
W Chapter 15-17

Outline for Today
• Homework #6 Comments
• Presentation Guidelines
1) IV Tests
• Overidentification
• Weak Instruments
2) Simultaneous Equations
3) Limited Dependent Variables
• LPM/Probit/Logit
• Tobit
• Maximum Likelihood Estimation
Presentation Guidelines
• Plan concise, 15-18 minute presentations
• Presentation should cover
• Research question, motivation, and background
• Research design – modeling, data, etc.
• Results (w/ careful interpretation)
• Discussion – validity of results; extensions, etc.
• Group members should participate equally
• Participation in ~5 minute Q+A following others’
presentations will also factor into grades
Overidentification (OID) Tests
• An IV regression is said to be just identified if there are as
many instruments as endogenous regressors.
• If you have multiple candidate instruments, you can test

whether a subset of these are uncorrelated with the
structural error term in the true regression model (i.e. you
can test whether instrument exogeneity is satisfied for a
subset of instruments).
• Note: For any of these tests to be convincing, you must

assert that at least one of your instruments is valid.
• This is an important shortcoming that limits the usefulness
of these tests. Nevertheless, these are commonly used.
OID Tests (cont.)
• Suppose we have two candidate instruments, and ,
for in .
• Intuitively, we can obtain 2SLS estimates of using

either instrument singly. Under the null that both
instruments are exogenous, both 2SLS estimators will be
consistent and approximately equal (with differences due
only to sampling error).
• We reject this null if is statistically
significant, and conclude that one or both instruments are
invalid.
OID Tests (cont.)
• Note that rejection of the null for the Hausman OID test
gives no guidance as to which instrument is invalid.
• Moreover, the OID test might fail to reject if both

instruments are invalid but nevertheless yield similar
2SLS coefficient estimates.
OID Tests (cont.)
• Furthermore, rejection of the null for the Hausman OID
test might also falsely reject due to heterogeneous
treatment effects.
• In this case, instruments might isolate different sources of
variation in the endogenous X.

• One instrument might explain variation in high school education
(e.g. quarter of birth) and another might be for college education
(e.g. distance to the nearest 4-year college).
• If the effects of high school and college education on the outcome
are different (i.e. different LATEs), the OID test could falsely reject
validity of the instruments.
OID Tests (cont.)
• Assuming homoskedasticity, an alternative OID test with q
overidentifying restrictions can be implemented as
follows:
1) Estimate model by 2SLS using all instruments and
obtain the residuals,
2) Regress on all exogenous regressors and instruments
and compute the regression
3) Under the null that all exogenous regressors and
instruments are uncorrelated with , ~
If is large, we reject this null and conclude that at
least one instrument is not exogenous.
Weak Instruments Tests
• In the simplest case, testing for whether an instrument or
collection of instruments for a single endogenous
regressors is “weak” can be accomplished as an F test of
the exclusion restrictions in the first-stage.
• Staiger and Stock (1997) suggest as a rule-of-thumb
needing 10 to reject instrument weakness.
• For situations involving multiple endogenous regressors

and adjusted (e.g. robust) errors, Kleibergen-Paap
statistics apply, with critical values drawn from Stock and
Yogo (2005).
Simultaneous Equations
• Simultaneity or reverse causality arises frequently in
economic applications (e.g. equilibrium market outcomes;
policy analyses, etc.) and leads to simultaneity bias.
• IV methods are well-suited for tackling simultaneity bias

by isolating the causal variation in .
• Of course, this requires a valid instrument…
E.g. Air Travel
• Prices and quantities of airline tickets are jointly
determined as an equilibrium outcome based on the
intersection of supply and demand.
• We can consequently write a model consisting of two
structural equations characterizing airline and consumer
behavior as a simultaneous equations model (SEM):

E.g. Air Travel (cont.)

• In equilibrium, clearly, , and observed prices represent the

corresponding market-clearing prices.
• The expression for is identified only if there exists at least one

regressor in which is exogenous to and can serve as an
observed demand shifter and pin down the position of the supply
curve.
• The expression for is identified only if there exists at least one

regressor in which is exogenous to and can serve as an
observed supply shifter and pin down the position of the demand
curve.
E.g. Air Travel (cont.)
• To apply 2SLS, we might equivalently write our SEM as:
(Inverse Supply)
(Demand)
 An IV solution is feasible if we have an exogenous supply

shifter that can be used as an instrument in a first stage
regression to compute predicted values of P.
 E.g. jet fuel prices
Limited Dependent Variables
• So far, we have focused exclusively on continuous
dependent variables (or at least treated them as such).
• However, many outcomes of interest necessarily involve
values of Y that are limited in a certain way.
• These outcomes are referred to as limited dependent
variables and often warrant special treatment.
E.g.
1) Binary outcomes
2) Corner solutions/censoring
3) Counts
(1) Examples of Binary Outcomes
• Smoking: How do cigarette taxes affect whether or not
an individual smokes at a particular point in time?
• ER visits: How do medical co-payments affect whether or
not an individual uses the emergency room (over the
course of a year)?
• Poverty: How does an individual’s poverty status as a
child affect whether or not they live in poverty as adults?
• Sovereign default: How does the use of a pegged/fixed
exchange rate affect whether or not a country defaults on
its debt in a period of economic turmoil?
(2) Examples of Corner Solutions
• Smoking: How do cigarette taxes affect how many
cigarettes an individual smokes?
• Dividend payouts: How does the fraction of executive
compensation coming from stock options impact dividend
payments to shareholders?
• Capital expenditures: How do bonus depreciation rules
affect business expenditures on new industrial
machinery?
(3) Examples of Count Outcomes
• Number of children: How do government expenditures
on pre-K “schooling” affect the number of children per
household?
• Number of ER visits: How do medical co-payments affect
the number of ER visits made over the course of a year?
• Number of exported products: How do trade barriers
(e.g. tariff rates) affect the number of products produced
for export?
Why the Special Treatment?
• Binary explanatory variables posed no special problems
for estimation by OLS, so why the special treatment for
binary dependent variables?
• OLS estimation will yield continuous fitted values of Y that

almost all differ from 0 or 1.
What does it mean to estimate predicted values below 0
or greater than 1 when these are the only two true
outcomes possible?
E.g. Property Taxes and Delinquency
• Hypothesis: if homebuyers are confused or ignorant of
the property tax implications of their home purchases,
then those buyers hit hardest by the reality of their true
property tax obligations might be expected to struggle the
most to make payments in a timely manner.
• Consider a basic regression of whether or not a property

tax payment is late on changes in annual taxes owed:
∆
where 1 if a payment is received after the due
date and 0 otherwise.
1 Property Taxes and Payment Delinquency
Late Payment (1=Yes, 0=No)
0 -.5 .5
-20 -10 0 10 20
Change in Annual Property Taxes
OLS Regression Line

The Linear Probability Model (LPM)
 What is the interpretation of the coefficient on ∆ ?
• Recall that the population regression line describes the
expected value of Y conditional on all of the Xs:
,…
• With Y being binary,

0 · Pr 0 1 · Pr 1
Pr 1
The LPM (cont.)
• Pr 1 implies that has the interpretation of
measuring the partial effect of on the probability of the
dependent variable being equal to 1.
• Linear regression models in which the dependent variable

is binary are known as linear probability models (LPM).
Weaknesses of the LPM
 What does it mean to estimate predicted values in
excess of 1 or less than 0?
• By assuming a constant linear relationship between Y and

, the LPM counterfactually predicts that the probability of
making late property tax payments falls below 0 for large
reductions in tax obligations and exceeds 1 for very large
increases in tax obligations.
• Probabilities can never be less than 0 or greater than 1!!!

Weaknesses of the LPM (cont.)
• The LPM additionally imposes that the partial effect of
on Pr 1 be constant, regardless of the value of .
• The effect of an additional $1000 increase in annual
property tax obligations is the same whether starting from
$0 (i.e. from $0 to $1000) or $10000 (i.e. the difference
between $10000 and $11000).
• Presumably, if ∆ has an effect on tax delinquency,

this effect should be diminishing beyond some point.
• The errors in a LPM model are heteroskedastic, so you

should always use White (heteroskedasticity-robust)
standard errors.
Advantages of the LPM Model
• The marginal effects are easy to calculate, particularly in
the case of interaction terms, which can be very tricky in
the case of other binary choice models.
• If you are interested in marginal effects, rather than

predicted probabilities, then the shortcomings of the LPM
with regard to predicted probabilities may not be very
important
Non-Linear Binary Response Models
• The probit and logit regression models provide different
non-linear methods for constraining predicted values of Y
to lie between 0 and 1.
• These models both rely on cumulative distribution

functions to generate probabilities over the interval 0 to 1:
Pr 1
where 0 1 ∀ .
N-L Binary Response Models (cont.)
• Consider the following latent variable model determining
the observed values of where ∗ is unobserved:
∗ ,
∗ 0
(i.e. 1 whenever ∗ 0 and 0 otherwise.)
⇒ Pr 1 Pr ∗ 0
Pr |
N-L Binary Response Models (cont.)
• For any real number z , provided that is
symmetrically distributed about 0 and independent of ,
Pr Pr 1 Pr
⇒ Pr 1
⇒ Pr 1
• The probit and logit models differ only in their

assumptions about · (i.e. the c.d.f. characterizing the
distribution of Pr 1 ).
The Probit Model
• The probit model assumes a standard normal c.d.f.,Φ z ,
for · :
Pr 1 Φ
Φ z , 2 . · exp 0.5
• therefore captures the effect of a one unit change in

on the z-value from the standard normal c.d.f., not
Pr 1 directly.
• The magnitude of this effect on Pr 1 depends on the
value of X.
Interpreting Probit Coefficients
E.g. Property Taxes and Delinquency
Pr 1|∆ Φ ∆
Φ 1.42 0.071∆
• If ∆ 1 $1000 , this implies that

Φ 1.42 0.071 Φ 1.349 0.089.
• In contrast, if ∆ 0, Φ 1.42 0.078.
• As such, a one unit ($1000) change in ∆ (starting
from zero) increases the probability of making a late
payment by 1.1 percentage points 0.089 0.078 .
Interpreting Probit Coefficients (cont.)
• For the same one unit difference between ∆ 3
and ∆ 4,
Φ 1.42 4 ∗ 0.071 Φ 1.42 3 ∗ 0.071 0.014.
• The probability of making a late tax payment then rises by
1.4 percentage points.
Interpreting Probit Coefficients (cont.)
• Mathematically, we have used the fact that
∆ Pr 1 β ⋯ ∆ ⋯β
β ⋯ ⋯β
• Equivalently, Pr 1 implies that
Pr 1
g · , where g z
 (Note, however, that computation of standard errors of

partial effects is complicated.)
The Logit Model
• The logit model instead assumes a logistic c.d.f. for · :
Pr 1 Λ ,
Λ exp / 1 exp
• therefore captures the effect on z (the argument of the

logistic c.d.f.) of a one unit change in
Calculation of Marginal Effects
For continuous variables,
Pr 1
g · , where g z
If ⋯ ,
Pr 1
g ·
Pr 1
g · 2
Calculation of Marginal Effects (cont.)
For discrete variables, if ∈ is a dummy variable
Pr 1
1 0
For either discrete or continuous regressors,
Pr 1 Pr 0
Calculation of Marginal Effects (cont.)
• and g will differ across observations because each
observation has its own values of X.
• Since the calculation of the marginal effect depends on these
values, they too will differ across observations.
Two approaches to reporting marginal effects:

1) Average marginal effect (better): calculate the marginal
effect for each observation and then find the average.
2) Marginal effect at the average (worse): use the average
value of each regressor and plug these into the marginal
effect formula.
• WARNING—You must use factor variables in order for Stata to
know if you have a variable that is a function of another
variable (e.g. and ).
Probit vs. Logit
• Logit distribution has slightly thicker tails
• Historically, logit estimation was computationally faster
than probit estimation, but this advantage has largely
disappeared.
• General appeal of normality assumptions has led
economists to favor the probit.
• Despite their different distributional assumptions, logit and

probit regressions generally yield very similar results.
Maximum Likelihood Estimation
• Given the non-linearity of Pr 1 under
either the probit or logit models, alternative numerical
estimation techniques are required to obtain .
 Maximum likelihood estimation (MLE) yields an estimator,

, which maximizes the likelihood of obtaining the
observed values of Y (conditional on ).
 Intuitively, MLE selects the parameter value(s) that seem
most likely to produce the that data we observe.
MLE (cont.)
• Assuming that we have a random sample of , pairs,
with each drawn from the (conditional) population
distribution ; , X , we can write the joint distribution as
the product of densities, (i.e. the likelihood function):
; ; , · ; , … ; ,
• For convenience, the corresponding log likelihood
function can be written as
log ; log ; ,
⇒ argmax log ; ,
Probit/Logit MLE
• For binary response models, where ∈ 0,1 , the density of
| is given by
; , · 1
⇒ log 1 Y log 1
Finding the that maximizes typically requires

numerical optimization methods.
Other times (rarely), it is possible to obtain a closed form
solution by solving for 0 directly.
Model Fit
• Note that the standard measure of is inappropriate for
evaluating the goodness of fit of probit or logit models.
• Instead, the fraction correctly predicted represents a

better metric and is generally computed as
Pr 1, 0.5 Pr 0, 0.5
 If 1 is rare in the data, a threshold of 0.5 may yield

virtually no correct predicted “successes” and we might
select as an alternate threshold.
Model Fit (cont.)
• The pseudo- offers an alternative measure of model fit
that is intended to be more analogous to the standard
measure of after OLS estimation.
• Concretely,
Pseudo- 1 /
where is the log-likelihood function estimated for the
unrestricted model, and is the log-likelihood for an
intercept-only model.
Model Fit (cont.)
• Given that pseudo- 1 / ,
 If the included regressors have no explanatory power,
and pseudo- 0.
 Usually, ⇒ pseudo− 0 (i.e. since MLE
involves maximizing the (negative) log likelihood adding
regressors cannot reduce the value of the estimated log
likelihood function.)
Multiple Hypothesis Testing
• For the same reason that the standard measure of is
inappropriate under probit or logit estimation, F-tests are
likewise invalid and testing involves values of the
maximized likelihood function.
• The likelihood ratio (LR) test is computed as

2 ~ , 0
where is the log-likelihood function estimated for the
unrestricted model, and is the log-likelihood for the
restricted model, wherein q restrictions under the null
hypothesis have been imposed.
Corner Solutions
• For corner solution models, we consider a modified latent
variable model determining the observed values of
where ∗ is unobserved:
∗ ,
max ∗ , 0
(E.g. firms might wish to pay negative dividends, if it were
possible, such that ∗ 0, but we observe 0).
Corner Solutions – Tobit Model
• Under the Tobit model, | ~ 0, such that ∗ is
normally distributed.
• Consequently, max ∗ , 0 also has a continuous,
normal distribution over strictly positive values of ∗ and
the density of |X is identical to the density of ∗ over
∗
0.
• Moreover,
Pr 0 Pr ∗ 0 Pr
Pr 1
σ σ σ
Corner Solutions – Tobit Model (cont.)
• Taken together, we can write
, I 0 · log 1
σ
1
I Y 0 · log ·
 Computation of follows the same principle as for

probit or logit models.
Interpreting Tobit Coefficients
• Depending on the nature of the problem, we may be
interested in two different types of partial effects/predicted
probabilities following estimation of a Tobit model:
1
2 | 0,
Assignment
Please read W Ch. 16-17
PRESENTATIONS NEXT WEEK
Final Drafts, due Friday 3/16 @ 5:00 p.m.

Lectures From Class PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Lectures From Class PDF

Загружено:

Авторское право:

Доступные форматы

Econometrics

January 31, 2018

Model Specification + Interaction Effects

Professor Sebastien Bradley

2) Problem Set #2 comments

• For most purposes, OLS is a suitable estimation method.

• Ultimately, we need to think critically about what belongs

• Namely, we wish to avoid violations of the conditional

• If the addition of a variable substantially improves the

• Not all relationships that we might be interested in are

• Ignoring such non-linearities may lead to omitted variable

How can we characterize both of these features

• This allows the effect of age on wage to depend on the

where the effects of on may vary according to

• Since the effect of a regressor on the dependent variable

What is the effect of an additional year of age on average

1.431 2 0.014 · 20 0.851

• In contrast, the effect of a change in age from 50 is

1.431 2 0.014 · 50 0.183

95% C.I.:∆ 1.96 ∆

• In the case of linear regression functions, this is

95% C.I.: ∆ 1.96 ∆

Alternatively, 2 · 20 can be computed directly

is just a special case of the polynomial regression model,

How do we determine how many powers of X should be

• This is referred to as sequential hypothesis testing.

Note: recall that ln 0 ∞ and ln 1 0, so negative

• A related virtue is that taking logs of Y and/or the independent

• measures the effect on Y of a 100% increase in X.

• A 100% increase in age is therefore associated with a

How does this model compare to the cubic polynomial?

Avg. Hourly Earnings

• measures the effect of a one unit increase in X on Y in

• A one year increase in age is therefore associated with a

• At age 20, the marginal effect of age is 8.29 percent.

• measures the effect of a one percent change in X on Y

• A 1% change in age is associated with a 0.443% change

• The difference in wage associated with being female is:

• We can also construct interaction terms involving a binary

E.g. Teacher Evaluations

• Consider the following three models

• Note that estimating the fully-specified non-linear model

nests both specifications (1) and (2).

E.g. Teacher Evaluations

Note that this interpretation holds for all types of

MIDTERM NEXT CLASS!

Problem Set #4 – Practice

February 14, 2018

Interaction Effects; Sources of Bias

Professor Sebastien Bradley

• We can also construct interaction terms involving a binary

E.g. Teacher Evaluations

E.g. Teacher Evaluations

• Holding constant, a one unit increase in raises Y by

Note that this interpretation holds for all types of

• External validity refers to the applicability or

• Internal validity refers to the validity of the statistical

• In each case, biased (non-causal) and inconsistent

The RESET test is NOT a general test for omitted

• E.g. Linear vs. linear-log models:

• Mismeasurement of an explanatory variable is referred to

• Importantly, we must distinguish between classical

but we do not observe ∗ directly and instead observe