Вы находитесь на странице: 1из 49

Introduction to Econometrics

Professor Dr. Horst Entorf


Winter Semester 2013/14
Goethe University Frankfurt

1
Chapter 6: Regression Analysis with
Qualtitative Information
Introduction to Econometrics
Chapter 6: Regression Analysis with
Qualitative Information
2
Chapter 6: Regression Analysis with
Qualtitative Information
Dummy variables and the
dummy variable trap
Until now all variables had quantitative
information (e.g. wages, years of experience).
Now we will also look at variables which have
qualitative information (e.g. sex or race of an
individual, geographical location).
Chapter 6: Regression Analysis with
Qualtitative Information 3
Chapter 6: Regression Analysis with
Qualtitative Information 4
Dummy Variables
A dummy variable is a variable that takes on
the value 1 or 0
Examples:
female (= 1 if female, 0 otherwise),
south (= 1 if in the south, 0 otherwise)
Dummy variables are also called binary
variables, for obvious reasons

Dummy Variables
Chapter 6: Regression Analysis with
Qualtitative Information 5
Chapter 6: Regression Analysis with
Qualtitative Information 6
A Dummy Independent Variable
Consider a simple model with one continuous
variable (educ) and one dummy (female)
wage = |
0
+ o
0
female + |
1
educ + u
This can be interpreted as an intercept shift
If female = 0 (male), then
wage = |
0
+ |
1
educ + u
If female = 1 (female), then
wage = (|
0
+ o
0
) + |
1
educ + u
The case of female = 0 is the base group
Chapter 6: Regression Analysis with
Qualtitative Information 7
A Dummy Independent Variable
The dummy variable trap
Consider the model from the previous slide
wage = |
0
+ o
0
female + |
1
educ + u
It is tempting to estimate the following model
wage = |
0
+ o
0
female + o
1
male+ |
1
educ + u
The only difference is that we also include a
dummy variable for male in the regression
If we try to estimate this model we will run
into what is called the dummy variable trap
Chapter 6: Regression Analysis with
Qualtitative Information 8
The dummy variable trap (cont.)
Why cant we estimate the model
wage = |
0
+ o
0
female + o
1
male+ |
1
educ + u ?
Remember the third Gauss-Markov
Assumption:
No exact linear relationships
among the independent variables
But the sum of female and male will
reproduce the intercept since it takes the
value 1 for all individuals in our regression,
Chapter 6: Regression Analysis with
Qualtitative Information 9
The dummy variable trap (cont.)
So if we include qualitative information, we
cannot include dummies for all categories
The category we leave out is called the base
group (or base category )
The effect on the base group is then
hidden in the intercept of the regression
The coefficient on the dummy variable
gives the effect relative to the base group
Chapter 6: Regression Analysis with
Qualtitative Information 10
Example: wage equation
Chapter 6: Regression Analysis with
Qualtitative Information 11
0 72 0 26 0 049 0 012
0 021
2
1 57 1 81 0 572 0 025
0 141
526 0 364
= + +
+
= =
( . ) ( . ) ( . ) ( . )
( . )

wage . . female . educ . exper


. tenure
n , R .
The coefficient on tells us that women earn 1.81$
less per hour relative to men (base group) with the same
level of , and .
female
education experience tenure
Dummy variables when the
dependent variable is in logs
In a lot of specifications the dependent
variable appears as log(y) (e.g. in the wage
equation in chapter 4). How do we interpret
dummy variables in this context?
Chapter 6: Regression Analysis with
Qualtitative Information 12
Digression: percentage change
Chapter 6: Regression Analysis with
Qualtitative Information 13
0 1
It is well known that the percentage change ( ) of a variable
changing from to can be calculated the following way:
pc
y y
1 0 1
0 0
100 1 100
| |
= - = -
|
\ .
y y y
pc
y y
Example: If a t-shirt cost 20$ last year and costs 21$ this
year, then the percentage change of the price is
21$ 20$
100 0.05 100 5%
20$

- = - =
Digression: percentage change
Chapter 6: Regression Analysis with
Qualtitative Information 14
An approximation to this exact formular is given by the
differences in logs of the respective values:
( ) ( ) ( )
1 0
log log 100 ~ - pc y y
This approximation works well for small changes.
Using the numbers from the previous slide, the price of the t-shirt,
using the formular above, has risen by approximately
( ) ( ) ( )
log 21$ -log 20$ 100 4.879% 5% - = ~
Example: log(wage) equation
Chapter 6: Regression Analysis with
Qualtitative Information 15
( )
(0.099) (0.036) 0.007 (0.0059)
2 2
(0.0001) (0.007) (0.00023)
2

log 0.417 0.297 0.08 0.029


0.0006 0.032 0.0006
526, 0.441
The coefficient on
= + +
+
= =
wage female educ exper
exper tenure tenure
n R
fema tells us that women earn about 29.7%
less than men with the same values of , and ,
since the difference in log( ) of women relative to men is
-0.297:
le
educ exper tenure
wages
( ) ( )

log log 0.297 ( 29.7%) = ~
female male
wage wage
Example: log(wage) equation
Chapter 6: Regression Analysis with
Qualtitative Information 16

The exact percentage change 1 can be calculated as

follows, starting from the approximation:

female
male
wage
wage
( ) ( )
( )
( )
(rewrite)
(apply exponential function)

log log 0.297

log 0.297
exp 0.297
1 exp 0.297 1 0.257
(substract 1

)
(
=
| |
=
|
\ .
=
= =
female male
female
male
female
male
female
male
wage wage
wage
wage
wage
wage
wage
wage
25.7%)
Example: log(wage) equation
The coefficient on a dummy variable when
the dependent variable is in logs has a
percentage interpretation
This percentage change is only an
approximation that works well for small
changes
The exact percentage change can be
calculated the way described on the
previous slide
Chapter 6: Regression Analysis with
Qualtitative Information 17
Using Dummy Variables for
Multiple Categories
Up to now we were focusing on dummy
variables based on two categories (e.g. male vs.
female). Now we will turn to the problem of
dummy variables for multiple categories
Chapter 6: Regression Analysis with
Qualtitative Information 18
Multiple Categories
Suppose we want to estimate the effect of credit
ratings (CR) on bond interest rates (BIR)
Several companies (Moodys, Standard &Poors)
rate the quality of debt for governments, where
the rating depends on the probability of default
Suppose for simplicity that ratings range from 0
to 4, with 0 being the worst credit rating and 4
being the best credit rating
This is an example of an ordinal variable with
multiple categories
Chapter 6: Regression Analysis with
Qualtitative Information 19
Multiple Categories
How can we incorporate the variable CR in our
model? One way is to estimate the following
BIR=|
0
+ |
1
CR + other factors
|
1
would give us the change in BIR if CR
increases by one unit (e.g. from 0 to 1, or from
2 to 3)
But is the effect on BIR when CR changes
from 0 to 1 the same as when CR changes from
2 to 3? Probably not!
Chapter 6: Regression Analysis with
Qualtitative Information 20
Multiple Categories
Better alternative: define 4 dummy variables
C
1
, C
2
, C
3
and C
4
, where C
j
equals 1 if CR=j
and 0 otherwise (for j=1,...,4) and run the
regression
BIR= |
0
+ o
1
C
1
+ o
2
C
2
+ o
3
C
3
+ o
4
C
4
+ other factors
The interpretation of, e.g.,
2
is the following:
how does BIR changes if CR changes from 0
(the base group) to 2.
Chapter 6: Regression Analysis with
Qualtitative Information 21
Chapter 6: Regression Analysis with
Qualtitative Information 22
Multiple Categories
Any categorical variable can be turned into
a set of dummy variables
Because the base group is represented by
the intercept, if there are n categories there
should be n 1 dummy variables
If there are a lot of categories, it may make
sense to group some together
Example: top 10 ranking, 11 25, etc.
Chapter 6: Regression Analysis with
Qualtitative Information 23
Example: wage equation
Remember the model of wage
determination from slide 15:
log(wage)=|
0
+ o
1
female + other factors
We have seen that women earn less than
men after controlling for other factors like
experience, tenure and education (o
1
< 0)
Now we want to know whether marital
status also affects the wage


Example: wage equation
One way would be to estimate:
log(wage)=|
0
+ o
1
female + o
2
married + other factors,
where married is 1 if the person is married and
0 otherwise
Drawback: Estimating this regression we
implicitly assume that the effect of married is
the same for men and women!
Solution: multiple categories
Chapter 6: Regression Analysis with
Qualtitative Information 24
Example: wage equation
In this case we get 4 different categories:
married men, married women, single men and
single women
Denote these by marrmale, marrfem, singmale
and singfem
As the base group we choose single men, so
singmale will not be included in the regression
Chapter 6: Regression Analysis with
Qualtitative Information 25
Example: wage equation
The model that allows the effect of martial
status on wages to vary between men and
women is:
log(wage)= |
0
+ o
1
marrmale + o
2
marrfem +
o
3
singfem + other factors
Interpretation:
1
measures the difference in
wages between married men and single men
(base group)

2
measures the difference in wages between
married women and single men
Chapter 6: Regression Analysis with
Qualtitative Information 26
Example: wage equation
We can also calculate the difference in wages
between single women and married women
from the previous regression
This difference in given by
3

2

Chapter 6: Regression Analysis with
Qualtitative Information 27
Example: wage equation
Chapter 6: Regression Analysis with
Qualtitative Information 28
( )
(0.100) (0.055) (0.058)
(0.056)

log 0.321 0.213 0.198


0.110
= +
+
wage marrmale marrfem
singfem otherfactors
Examples:
Married men earn 21.3% more than single men
Single women earn 11% less than single men
Single women earn 8.8% (= 11% ( 19.8%)) more than
married women
Married men earn 32.3% ( 21.3% ( 11%)) more

= than
single women
Interactions Among Dummies
We have seen that using multiple
categories allows us to estimate various
differential effects. Another approach is to
interact dummy variables...
Chapter 6: Regression Analysis with
Qualtitative Information 29
Chapter 6: Regression Analysis with
Qualtitative Information 30
Interactions Among Dummies
Interacting dummy variables is like
subdividing the group
Example:
Have dummies for female, as well as married
Add female*married for a total of 3 dummy
variables > 4 categories
Base group is single men, female is for single
women, married is for married men.
The interaction reflects married women
Chapter 6: Regression Analysis with
Qualtitative Information 31
More on Dummy Interactions
Formally, the model is
y = |
0
+ o
1
female + o
2
married + o
3
female*married +
|
1
x + u, then:
If female = 0 and married = 0
y = |
0
+ |
1
x + u
If female = 0 and married = 1
y = |
0
+ o
2
married + |
1
x + u
If female = 1 and married = 1
y = |
0
+o
1
female+o
2
married+
o
3
female*married + |
1
x + u
Example: wage equation
Chapter 6: Regression Analysis with
Qualtitative Information 32
( )
(0.10) (0.056) (0.055)
(0.072)

log 0.321 0.11 0.213


0.301
= +
- +
wage female married
female married otherfactors
Single women earn 8.8% more than married women, because:
Examples:
Married men earn 21.3% more than single men.
Single women earn 11% less than single men.
Single women: 0.321-0.11=0.211
Since =0 married women: 0.321-0.11+0.213-0.301=0.123
the difference is 0.211-0.123=0.088 (8.8%)
(
(

(
(

married
Chapter 6: Regression Analysis with
Qualtitative Information 33
Other Interactions with Dummies
Can also consider interacting a dummy
variable, d, with a continuous variable, x
y = |
0
+ o
1
d + |
1
x + o
2
d*x + u
If d = 0, then y = |
0
+ |
1
x + u
If d = 1, then y = (|
0
+ o
1
) + (|
1
+ o
2
) x + u
This is interpreted as a change in the slope
in addition to a change in the intercept

Chapter 6: Regression Analysis with
Qualtitative Information 34
Other Interactions with Dummies
Chapter 6: Regression Analysis with
Qualtitative Information 35
Testing for Differences Across
Groups
Testing whether a regression function is
different for one group versus another can be
thought of as simply testing for the joint
significance of the dummy and its interactions
with all other x variables
So, you can estimate the model with all the
interactions and without and form an F
statistic, but this could be unwieldy
Chapter 6: Regression Analysis with
Qualtitative Information 36
The Chow Test
Turns out you can compute the proper F
statistic without running the unrestricted
model with interactions with all k continuous
variables
If run the restricted model for group one and
get SSR
1
, then for group two and get SSR
2
Run the restricted model for all to get SSR,
then
( ) | | ( ) | |
1
1 2
2 1
2 1
+
+
-
+
+
=
k
k n
SSR SSR
SSR SSR SSR
F
Chapter 6: Regression Analysis with
Qualtitative Information 37
The Chow Test (continued)
The Chow test is really just a simple F test
for exclusion restrictions, but weve
realized that SSR
ur
= SSR
1
+ SSR
2

Note, we have k + 1 restrictions (each of the
slope coefficients and the intercept)
Note the unrestricted model would estimate
2 different intercepts and 2 different slope
coefficients, so the df is n 2k 2

Example:
college grade point averages
cgpa= |
0
+ |
1
sat + |
2
hsperc + |
3
tothrs + u
cgpa: College grade point average
sat: Scholastic Assessment Test score
hsperc: High school rank percentile
tothrs: Total hours of college courses
Question: Is the regression line different for
men and women?
Chapter 6: Regression Analysis with
Qualtitative Information 38
Example:
college grade point averages
Variante 1:
Use the usual F test
Run the following regression
cgpa= |
0
+ o
0
female + |
1
sat + o
1
female*sat +
|
2
hsperc + o
2
female*hsperc + |
3
tothrs +
o
3
female*tothrs + u
Test the hypothesis H
0
:
0
=0,
1
=0,
2
=0,
3
=0
This would be the usual F test
Chapter 6: Regression Analysis with
Qualtitative Information 39
Example:
college grade point averages
Variante 2:
Estimate the restricted model for both groups
(males and females) and get SSR
female
and
SSR
male
Then estimate restricted model with males and
females pooled together to get SSR


Chapter 6: Regression Analysis with
Qualtitative Information 40
0 1 2 3
to get : = + + + + SSR cgpa sat hsperc tothrs u | | | |
{ }
0 1 2 3
to get for sex , : e
= + + + +
sex
sex sex sex sex sex
SSR male female
cgpa sat hsperc tothrs u | | | |
Example:
college grade point averages
Given SSR
female
, SSR
male
and SSR, we can
plug in the values into the formular



This would be the Chow test
Chapter 6: Regression Analysis with
Qualtitative Information 41
( )
( )
2 1
1
female male
female male
SSR SSR SSR
n k
F
SSR SSR k
(
+
+ (

= -
+ +
The Linear Probability Model
(LPM)
A special case of regression analysis
occurs if the dependent variable y is
binary (0 or 1). This issue will be
discussed on the next slides
Chapter 6: Regression Analysis with
Qualtitative Information 42
Chapter 6: Regression Analysis with
Qualtitative Information 43
Linear Probability Model
P(y=1|x) = E(y|x), when y is a binary variable,
so we can write our model as
P(y = 1|x) =
0
+
1
x
1
+ +
k
x
k
So, the interpretation of
j
is the change in the
probability of success when x
j
changes
The predicted y is the predicted probability of
success
Potential problem: prediction can be outside
[0,1] what is somehow strange for a
probablilty!
Chapter 6: Regression Analysis with
Qualtitative Information 44
Linear Probability Model
Chapter 6: Regression Analysis with
Qualtitative Information 45
Linear Probability Model (cont)
Irrespective of potential predictions outside of
[0,1], parameter estimates of the LPM are
consistent
The LPM model will violate assumption of
homoskedasticity, so it will affect inference
Despite drawbacks, its usually a good tool to
start with when y is binary
Chapter 6: Regression Analysis with
Qualtitative Information 46
Linear Probability Model (cont)
Alternatives to LPM: Nonlinear or nonparametric curve fitting;
Example: P(treatment under adult criminal law| age when crime is
committed)
0
.
2
.
4
.
6
.
8
1
P

(
t
r
e
a
t
m
e
n
t

=

a
d
u
l
t
)
16 18 20 22 24
age at offense
kernel = epanechnikov, degree = 0, bandwidth = .56, pwidth = .83
Chapter 6: Regression Analysis with
Qualtitative Information 47
Caveats on Program Evaluation
A typical use of a dummy variable is when
we are looking for a program effect
For example, we may have individuals that
received job training, or welfare, etc
We need to remember that usually
individuals choose whether to participate in
a program, which may lead to a self-
selection problem
Chapter 6: Regression Analysis with
Qualtitative Information 48
Self-selection Problems
If we can control for everything that is
correlated with both participation and the
outcome of interest then its not a problem
Often, though, there are unobservables that
are correlated with participation
In this case, the estimate of the program
effect is biased, and we dont want to set
policy based on it!
Self-selection Problems
One situation in which the problem of self-
selection does not occur is when program
participation is randomized
If all individuals have the same probability
of ending up in the program then self-
selection is not a problem
Randomization solves the selection
problem
Chapter 6: Regression Analysis with
Qualtitative Information 49