Академический Документы
Профессиональный Документы
Культура Документы
INTRODUCTION
TO LOGISTIC
REGRESSION
OUTLINE
Introduction and
Description
Some Potential
Problems and Solutions
Writing Up the Results
INTRODUCTION AND
DESCRIPTION
Why use logistic regression?
Estimation by maximum likelihood
Interpreting coefficients
Hypothesis testing
Evaluating the performance of the model
THE LINEAR
PROBABILITY MODEL
In the OLS regression:
Y = + X + e ; where Y = (0, 1)
The error terms are heteroskedastic
e is not normally distributed because Y takes on
only two values
The predicted probabilities can be greater than 1
or less than 0
AN EXAMPLE: HURRICANE
EVACUATIONS
Q: EVAC
Did you evacuate your home to go someplace
safer before Hurricane Dennis (Floyd) hit?
1 YES
2 NO
3 DON'T KNOW
4 REFUSED
THE DATA
EVAC
0
0
0
1
1
0
0
0
0
0
0
0
1
PETS
1
1
1
1
0
0
0
1
1
0
0
1
1
MOBLHOME
0
0
1
1
0
0
0
0
0
0
0
0
1
TENURE
16
26
11
1
5
34
3
3
10
2
2
25
20
EDUC
16
12
13
10
12
12
14
16
12
18
12
16
12
OLS RESULTS
Dependent Variable:
Variable
(Constant)
PETS
MOBLHOME
TENURE
EDUC
FLOYD
2
R
F-stat
EVAC
B
0.190
-0.137
0.337
-0.003
0.003
0.198
0.145
36.010
t-value
2.121
-5.296
8.963
-2.973
0.424
8.147
PROBLEMS:
Predicted Values outside the 0,1
range
Descriptive Statistics
N
Unstandardized
Predicted Value
Valid N (listwise)
1070
1070
Minimum
Maximum
-.08498
.76027
Mean
Std.
Deviat
.2429907
.1632
HETEROSKEDASTICITY
10
U
n
s
t
a
n
d
a
r
d
i
z
e
d
-10
R
e
s
i
d
u
a
l
-20
Park Test
0
TENURE
20
40
60
100
THE LOGISTIC
REGRESSION MODEL
The "logit" model solves these problems:
ln[p/(1-p)] = + X + e
p is the probability that the event Y
occurs, p(Y=1)
p/(1-p) is the "odds ratio"
ln[p/(1-p)] is the log odds ratio, or
"logit"
More:
The logistic distribution constrains the
estimated probabilities to lie between 0
and 1.
The estimated probability is:
COMPARING LP AND
LOGIT MODELS
LP Model
1
Logit Model
{Y - p(Y=1)}Xi = 0
summed over all observations, i = 1,,n
INTERPRETING COEFFICIENTS
Since:
ln[p/(1-p)] = + X + e
The slope coefficient ( ) is
interpreted as the rate of change in the
"log odds" as X changes not very useful.
Since:
p = 1/[1 + exp(- - X)]
The marginal effect of a change in X on
the probability is: p/X = f( X)
An interpretation of the
logit coefficient which is
usually more intuitive is the
"odds ratio"
Since:
[p/(1-p)] = exp( + X)
Exp(B)
1/Exp(B)
-0.6593
1.5583
-0.0198
0.0501
-0.916
0.5172
4.7508
0.9804
1.0514
1.933
1.020
INTERPRETATION
OF
COEFFICIENTS
Lets run the more complete model
0:
1:
2:
3:
4:
log
log
log
log
log
likelihood
likelihood
likelihood
likelihood
likelihood
=
=
=
=
=
-205.30803
-167.71312
-163.57746
-163.45376
-163.45352
Logistic regression
Number of obs
LR chi2(4)
Prob > chi2
Pseudo R2
Coef.
lrdi
lassets
spe
biotech
_cons
.7527497
.997085
.4252844
3.799953
-11.63447
Std. Err.
.2110683
.1368534
.4204924
.577509
1.937191
z
3.57
7.29
1.01
6.58
-6.01
P>|z|
0.000
0.000
0.312
0.000
0.000
=
=
=
=
431
83.71
0.0000
0.2039
1.166436
1.265313
1.249434
4.93185
-7.837643
HYPOTHESIS TESTING
The Wald statistic for the coefficient is:
Wald = [ /s.e.B]2
which is distributed chi-square with 1 degree of
freedom.
The "Partial R" (in SPSS output) is
R = {[(Wald-2)/(-2LL( )]}1/2
AN EXAMPLE:
Variable
S.E.
PETS
-0.6593 0.2012
MOBLHOME 1.5583 0.2874
TENURE
-0.0198 0.008
EDUC
0.0501 0.0468
Constant
-0.916
0.69
Wald
Sig
t-value
-3.28
5.42
-2.48
1.07
-1.33
MODEL CHI-SQUARE
The model likelihood ratio (LR), statistic is
LR[i] = -2[LL() - LL(, ) ]
{Or, as you are reading SPSS printout:
LR[i] = [-2LL (of beginning model)] - [-2LL (of ending
model)]}
AN EXAMPLE:
Beginning Block Number 1. Method: Enter
-2 Log Likelihood
687.35714
Variable(s) Entered on Step Number
1..
PETS
PETS
MOBLHOME MOBLHOME
TENURE TENURE
EDUC
EDUC
Estimation terminated at iteration number 3 because
Log Likelihood decreased by less than .01 percent.
-2 Log Likelihood
Model
641.842
Chi-Square
df
Sign.
45.515
0.0000
PERCENT CORRECT
PREDICTIONS
The "Percent Correct Predictions" statistic
assumes that if the estimated p is greater
than or equal to .5 then the event is
expected to occur and not occur otherwise.
By assigning these probabilities 0s and 1s
and comparing these to the actual 0s and
1s, the % correct Yes, % correct No, and
overall % correct scores are calculated.
AN EXAMPLE:
Observed
0
1
Predicted
0
1
328
24
139
44
Overall
% Correct
93.18%
24.04%
69.53%
PSEUDO-R2
One psuedo-R2 statistic is the
McFadden's-R2 statistic:
McFadden's-R2 = 1 - [LL(, )/LL()]
{= 1 - [-2LL( , )/-2LL( )] (from SPSS printout)}
where the R2 is a scalar measure which
varies between 0 and (somewhat close
to) 1 much like the R2 in a LP model.
AN EXAMPLE:
Beginning -2 LL
Ending -2 LL
Ending/Beginning
2
McF. R = 1 - E./B.
687.36
641.84
0.9338
0.0662
SOME POTENTIAL
PROBLEMS AND
SOLUTIONS
Functional Form
Multicollinearity
Structural Breaks
OMITTED VARIABLE
BIAS
Omitted variable(s) can result in bias in the
coefficient estimates. To test for omitted
variables you can conduct a likelihood ratio test:
LR[q] = {[-2LL(constrained model, i=k-q)]
- [-2LL(unconstrained model, i=k)]}
where LR is distributed chi-square with q
degrees of freedom, with q = 1 or more omitted
variables
{This test is conducted automatically by SPSS if
you specify "blocks" of independent variables}
AN EXAMPLE:
Variable
PETS
MOBLHOME
TENURE
EDUC
CHILD
WHITE
FEMALE
Constant
Beginning -2 LL
Ending -2 LL
Wald
Sig
-0.699
1.570
-0.020
0.049
0.009
0.186
0.018
-1.049
10.968
29.412
5.993
1.079
0.011
0.422
0.008
2.073
0.001
0.000
0.014
0.299
0.917
0.516
0.928
0.150
687.36
641.41
CONSTRUCTING THE LR
TEST
Ending -2 LL
Partial Model
Ending -2 LL
Full Model
Block Chi-Square
DF
Critical Value
641.84
641.41
0.43
3
11.345
FUNCTIONAL FORM
Errors in functional form can result in
biased coefficient estimates and poor
model fit.
You should try different functional
forms by logging the independent
variables, adding squared terms, etc.
Then consult the Wald statistics and
model chi-square statistics to
determine which model performs best.
MULTICOLLINEARITY
The presence of multicollinearity will not
lead to biased coefficients.
But the standard errors of the coefficients
will be inflated.
If a variable which you think should be
statistically significant is not, consult the
correlation coefficients.
If two variables are correlated at a rate
greater than .6, .7, .8, etc. then try
dropping the least theoretically important
of the two.
STRUCTURAL
BREAKS
You may have structural breaks in your data.
Pooling the data imposes the restriction that
an independent variable has the same effect
on the dependent variable for different groups
of data when the opposite may be true.
You can conduct a likelihood ratio test:
AN EXAMPLE
Floyd
B
-0.66
1.56
-0.02
0.05
-0.92
687.36
641.84
45.52
Dennis
B
-1.20
2.00
-0.02
-0.04
-0.78
440.87
382.84
58.02
Pooled
B
-0.79
1.62
-0.02
0.02
-0.97
1186.64
1095.26
91.37
CONSTRUCTING THE LR
TEST
Variable
PETS
MOBLHOME
TENURE
EDUC
FLOYD
Constant
B
-0.85
1.75
-0.02
0.02
1.26
-1.68
Wald
27.20
65.67
8.34
0.27
59.08
8.71
Sig
0.000
0.000
0.004
0.606
0.000
0.003
WRITING UP RESULTS
Present descriptive statistics in a table
Make it clear that the dependent variable is
discrete (0, 1) and not continuous and that you
will use logistic regression.
Logistic regression is a standard statistical
procedure so you don't (necessarily) need to
write out the formula for it. You also (usually)
don't need to justify that you are using Logit
instead of the LP model or Probit (similar to
logit but based on the normal distribution [the
tails are less fat]).
AN EXAMPLE:
"The dependent variable which
measures the willingness to evacuate
is EVAC. EVAC is equal to 1 if the
respondent evacuated their home
during Hurricanes Floyd and Dennis
and 0 otherwise. The logistic
regression model is used to estimate
the factors which influence
evacuation behavior."
AN EXAMPLE:
Table 2. Logistic Regression Results
Dependent Variable = EVAC
Variable
B
B/S.E.
PETS
MOBLHOME
TENURE
EDUC
Constant
-0.6593
1.5583
-0.0198
0.0501
-0.916
Model Chi-Squared
45.515
-3.28
5.42
-2.48
1.07
-1.33
To set the stage, consider the data given in Table 15.7. Letting
Y = 1 if a students final grade in an intermediate
microeconomics course was A and Y = 0 if the final grade was
a B or a C, Spector and Mazzeo used grade point average
(GPA), TUCE, and Personalized System of Instruction (PSI)
DESCRIPTION OF
VARIABLES IN
DATA
. desc;
51
storage display
value
variable name
type
format
label
variable label
-----------------------------------------------------------------------> smoker
byte
%9.0g
is current smoking
worka
byte
%9.0g
has workplace smoking bans
age
byte
%9.0g
age in years
male
byte
%9.0g
male
black
byte
%9.0g
black
hispanic
byte
%9.0g
hispanic
incomel
float %9.0g
log income
hsgrad
byte
%9.0g
is hs graduate
somecol
byte
%9.0g
has some college
college
float %9.0g
-----------------------------------------------------------------------
SUMMARY
STATISTICS
sum;
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------smoker |
16258
.25163
.433963
worka |
16258
.6851396
.4644745
age |
16258
38.54742
11.96189
18
87
male |
16258
.3947595
.488814
black |
16258
.1119449
.3153083
hispanic |
16258
.0607086
.2388023
incomel |
16258
10.42097
.7624525
6.214608
11.22524
hsgrad |
16258
.3355271
.4721889
somecol |
16258
.2685447
.4432161
college |
16258
.3293763
.4700012
52
-------------+--------------------------------------------------------
Heteroskedastic consistent
Standard errors
.
.
.
>
Number of obs
F( 9, 16248)
Prob > F
R-squared
Root MSE
=
=
=
=
=
16258
99.26
0.0000
0.0488
.42336
Since OLS
Report t-stats
53
-----------------------------------------------------------------------------|
Robust
smoker |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age | -.0004776
.0002806
-1.70
0.089
-.0010276
.0000725
incomel | -.0287361
.0047823
-6.01
0.000
-.03811
-.0193621
male |
.0168615
.0069542
2.42
0.015
.0032305
.0304926
black | -.0356723
.0110203
-3.24
0.001
-.0572732
-.0140714
hispanic |
-.070582
.0136691
-5.16
0.000
-.097375
-.043789
hsgrad | -.0661429
.0162279
-4.08
0.000
-.0979514
-.0343345
somecol | -.1312175
.0164726
-7.97
0.000
-.1635056
-.0989293
college | -.2406109
.0162568
-14.80
0.000
-.272476
-.2087459
worka |
-.066076
.0074879
-8.82
0.000
-.080753
-.051399
_cons |
.7530714
.0494255
15.24
0.000
.6561919
.8499509
------------------------------------------------------------------------------
0:
1:
2:
3:
log
log
log
log
Probit estimates
likelihood
likelihood
likelihood
likelihood
= -9171.443
= -8764.068
= -8761.7211
= -8761.7208
=
=
=
=
16258
819.44
0.0000
0.0447
Report z-statistics
Instead of t-stats
54
-----------------------------------------------------------------------------smoker |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age | -.0012684
.0009316
-1.36
0.173
-.0030943
.0005574
incomel |
-.092812
.0151496
-6.13
0.000
-.1225047
-.0631193
male |
.0533213
.0229297
2.33
0.020
.0083799
.0982627
black | -.1060518
.034918
-3.04
0.002
-.17449
-.0376137
hispanic | -.2281468
.0475128
-4.80
0.000
-.3212701
-.1350235
hsgrad | -.1748765
.0436392
-4.01
0.000
-.2604078
-.0893453
somecol |
-.363869
.0451757
-8.05
0.000
-.4524118
-.2753262
college | -.7689528
.0466418
-16.49
0.000
-.860369
-.6775366
worka | -.2093287
.0231425
-9.05
0.000
-.2546873
-.1639702
_cons |
.870543
.154056
5.65
0.000
.5685989
1.172487
------------------------------------------------------------------------------
Number of obs
LR chi2(9)
Prob > chi2
Pseudo R2
= 16258
= 819.44
= 0.0000
= 0.0447
55
-----------------------------------------------------------------------------smoker |
dF/dx
Std. Err.
z
P>|z|
x-bar [
95% C.I.
]
---------+-------------------------------------------------------------------age | -.0003951
.0002902
-1.36
0.173
38.5474 -.000964 .000174
incomel | -.0289139
.0047173
-6.13
0.000
10.421
-.03816 -.019668
male*|
.0166757
.0071979
2.33
0.020
.39476
.002568 .030783
black*| -.0320621
.0102295
-3.04
0.002
.111945 -.052111 -.012013
hispanic*| -.0658551
.0125926
-4.80
0.000
.060709 -.090536 -.041174
hsgrad*|
-.053335
.013018
-4.01
0.000
.335527
-.07885 -.02782
somecol*| -.1062358
.0122819
-8.05
0.000
.268545 -.130308 -.082164
college*| -.2149199
.0114584
-16.49
0.000
.329376 -.237378 -.192462
worka*| -.0668959
.0075634
-9.05
0.000
.68514
-.08172 -.052072
---------+-------------------------------------------------------------------obs. P |
.25163
pred. P |
.2409344 (at x-bar)
-----------------------------------------------------------------------------(*) dF/dx is for discrete change of dummy variable from 0 to 1
z and P>|z| correspond to the test of the underlying coefficient being 0
. mfx compute;
56
LR 2 ln L unc ln L c
2
THE
MCFADDEN
PSEUDO
R
We also use the McFadden Pseudo R (1973). Its interpretation is
2
analogous to the OLS R2. However its is biased doward and remain
generally low.
Le pseudo-R2 also compares The likelihood ratio is the difference
between the unconstrained model and the constrained model and is
comprised between 0 and 1.
Pseudo R
2
MF
ln L c ln L unc
ln L unc
1
ln L unc
ln L c
GOODNESS OF FIT
MEASURES
Constrained model
. logit
inno
Iteration 0:
Logistic regression
Number of obs
LR chi2(0)
Prob > chi2
Pseudo R2
. logit
inno
Coef.
_cons
1.494183
Std. Err.
.1244955
z
12.00
=
=
=
=
431
0.00
.
0.0000
P>|z|
0.000
1.250177
1.73819
LR 2 ln L unc ln Lc
2 163.5 205.3
83.8
Logistic regression
Number of obs
LR chi2(4)
Prob > chi2
Pseudo R2
Coef.
lrdi
lassets
spe
biotech
_cons
.7527497
.997085
.4252844
3.799953
-11.63447
Std. Err.
.2110683
.1368534
.4204924
.577509
1.937191
Unconstrained model
z
3.57
7.29
1.01
6.58
-6.01
P>|z|
0.000
0.000
0.312
0.000
0.000
=
=
=
=
431
83.71
0.0000
0.2039
1.166436
1.265313
1.249434
4.93185
-7.837643
GOODNESS OF FIT
MEASURES
. logit
Logistic regression
Number of obs
LR chi2(3)
Prob > chi2
Pseudo R2
Coef.
lrdi
lassets
spe
_cons
.9275668
.3032756
.3739987
-.4703812
Std. Err.
.1979951
.0792032
.3800765
.9313494
z
4.68
3.83
0.98
-0.51
P>|z|
0.000
0.000
0.325
0.614
=
=
=
=
431
26.93
0.0000
0.0656
1.31563
.4585111
1.118935
1.35503
Logistic regression
Number of obs
LR chi2(4)
Prob > chi2
Pseudo R2
Coef.
lrdi
lassets
spe
biotech
_cons
.7527497
.997085
.4252844
3.799953
-11.63447
Std. Err.
.2110683
.1368534
.4204924
.577509
1.937191
z
3.57
7.29
1.01
6.58
-6.01
P>|z|
0.000
0.000
0.312
0.000
0.000
=
=
=
=
431
83.71
0.0000
0.2039
1.166436
1.265313
1.249434
4.93185
-7.837643
LR chi2(1) =
Prob > chi2 =
56.78
0.0000
LR 2 ln L unc ln Lc
2 163.5 191.8
56.8
504.78
0.0000
64
( 1)
( 2)
( 3)
QUALITY OF PREDICTIONS
Lastly, one can compare the quality of the prediction with the observed outcome variable (dummy
variable).
One must assume that when the probability is higher than 0.5, then the prediction is that the vent will
occur (most likely
And then one can compare how good the prediction is as compared with the actual outcome variable.
STATA does this for us:
estat class
True neg
Classify pos
Classify neg
total
a+c
b+d
QUALITY OF PREDICTIONS
. estat class
Logistic model for inno
Classified
True
~D
Total
+
-
337
15
51
28
388
43
Total
352
79
431
Pr( +| D)
Pr( -|~D)
Pr( D| +)
Pr(~D| -)
95.74%
35.44%
86.86%
65.12%
False
False
False
False
Pr( +|~D)
Pr( -| D)
Pr(~D| +)
Pr( D| -)
64.56%
4.26%
13.14%
34.88%
+
+
-
rate
rate
rate
rate
for
for
for
for
true ~D
true D
classified +
classified -
Correctly classified
84.69%
OTHER BINARY
CHOICE MODELS
The Logit model is only one way of modeling
binary choice models
The Probit model is another way of modeling
binary choice models. It is actually more used than
logit models and assume a normal distribution (not
a logistic one) for the z values.
The complementary log-log models is used where
the occurrence of the event is very rare, with the
distribution of z being asymetric.
z2 2
dz
1 exp exp(X )
Pr(Y 1| X) X
dz
dz
REFERENCES
http://personal.ecu.edu/whiteheadj/data/logit/
http://personal.ecu.edu/whiteheadj/data/logit/logitpap.ht
m
E-mail: WhiteheadJ@mail.ecu.edu