Вы находитесь на странице: 1из 70

AN

INTRODUCTION
TO LOGISTIC
REGRESSION

OUTLINE
Introduction and
Description
Some Potential
Problems and Solutions
Writing Up the Results

INTRODUCTION AND
DESCRIPTION
Why use logistic regression?
Estimation by maximum likelihood
Interpreting coefficients
Hypothesis testing
Evaluating the performance of the model

WHY USE LOGISTIC REGRESSION?


There are many important research
topics for which the dependent variable
is "limited."
For example: voting, morbidity or
mortality, and participation data is not
continuous or distributed normally.
Binary logistic regression is a type of
regression analysis where the
dependent variable is a dummy
variable: coded 0 (did not vote) or 1(did
vote)

THE LINEAR
PROBABILITY MODEL
In the OLS regression:
Y = + X + e ; where Y = (0, 1)
The error terms are heteroskedastic
e is not normally distributed because Y takes on
only two values
The predicted probabilities can be greater than 1
or less than 0

AN EXAMPLE: HURRICANE
EVACUATIONS

Q: EVAC
Did you evacuate your home to go someplace
safer before Hurricane Dennis (Floyd) hit?

1 YES
2 NO
3 DON'T KNOW
4 REFUSED

THE DATA
EVAC
0
0
0
1
1
0
0
0
0
0
0
0
1

PETS
1
1
1
1
0
0
0
1
1
0
0
1
1

MOBLHOME
0
0
1
1
0
0
0
0
0
0
0
0
1

TENURE
16
26
11
1
5
34
3
3
10
2
2
25
20

EDUC
16
12
13
10
12
12
14
16
12
18
12
16
12

OLS RESULTS
Dependent Variable:
Variable
(Constant)
PETS
MOBLHOME
TENURE
EDUC
FLOYD
2

R
F-stat

EVAC
B
0.190
-0.137
0.337
-0.003
0.003
0.198
0.145
36.010

t-value
2.121
-5.296
8.963
-2.973
0.424
8.147

PROBLEMS:
Predicted Values outside the 0,1
range
Descriptive Statistics
N
Unstandardized
Predicted Value
Valid N (listwise)

1070
1070

Minimum

Maximum

-.08498

.76027

Mean

Std.
Deviat

.2429907

.1632

HETEROSKEDASTICITY
10

U
n
s
t
a
n
d
a
r
d
i
z
e
d

-10

R
e
s
i
d
u
a
l

-20

Park Test

0
TENURE

20

40

60

Dependent Variable: LNESQ


B
t-stat
(Constant) -2.34
-15.99
LNTNSQ
-0.20
-6.19
80

100

THE LOGISTIC
REGRESSION MODEL
The "logit" model solves these problems:
ln[p/(1-p)] = + X + e
p is the probability that the event Y
occurs, p(Y=1)
p/(1-p) is the "odds ratio"
ln[p/(1-p)] is the log odds ratio, or
"logit"

More:
The logistic distribution constrains the
estimated probabilities to lie between 0
and 1.
The estimated probability is:

p = 1/[1 + exp(- - X)]


if you let + X =0, then p = .50
as + X gets really big, p approaches 1
as + X gets really small, p approaches
0

COMPARING LP AND
LOGIT MODELS
LP Model
1
Logit Model

MAXIMUM LIKELIHOOD ESTIMATION


(MLE)
MLE is a statistical method for
estimating the coefficients of a model.
The likelihood function (L) measures
the probability of observing the
particular set of dependent variable
values (p1, p2, ..., pn) that occur in the
sample:
L = Prob (p1* p2* * * pn)
The higher the L, the higher the
probability of observing the ps in the
sample.

MLE involves finding the coefficients


( , ) that makes the log of the
likelihood function (LL < 0) as large as
possible
Or, finds the coefficients that make -2
times the log of the likelihood function
(-2LL) as small as possible
The maximum likelihood estimates
solve the following condition:

{Y - p(Y=1)}Xi = 0
summed over all observations, i = 1,,n

INTERPRETING COEFFICIENTS
Since:
ln[p/(1-p)] = + X + e
The slope coefficient ( ) is
interpreted as the rate of change in the
"log odds" as X changes not very useful.
Since:
p = 1/[1 + exp(- - X)]
The marginal effect of a change in X on
the probability is: p/X = f( X)

An interpretation of the
logit coefficient which is
usually more intuitive is the
"odds ratio"
Since:
[p/(1-p)] = exp( + X)

exp( ) is the effect of the


independent variable on the
"odds ratio"

FROM SPSS OUTPUT:


Variable
PETS
MOBLHOME
TENURE
EDUC
Constant

Exp(B)

1/Exp(B)

-0.6593
1.5583
-0.0198
0.0501
-0.916

0.5172
4.7508
0.9804
1.0514

1.933
1.020

Households with pets are 1.933 times more


likely to evacuate than those without pets.

INTERPRETATION
OF
COEFFICIENTS
Lets run the more complete model

logit inno lrdi lassets spe biotech


. logit inno lrdi lassets spe biotech
Iteration
Iteration
Iteration
Iteration
Iteration

0:
1:
2:
3:
4:

log
log
log
log
log

likelihood
likelihood
likelihood
likelihood
likelihood

=
=
=
=
=

-205.30803
-167.71312
-163.57746
-163.45376
-163.45352

Logistic regression

Number of obs
LR chi2(4)
Prob > chi2
Pseudo R2

Log likelihood = -163.45352


inno

Coef.

lrdi
lassets
spe
biotech
_cons

.7527497
.997085
.4252844
3.799953
-11.63447

Std. Err.
.2110683
.1368534
.4204924
.577509
1.937191

z
3.57
7.29
1.01
6.58
-6.01

P>|z|
0.000
0.000
0.312
0.000
0.000

=
=
=
=

431
83.71
0.0000
0.2039

[95% Conf. Interval]


.3390634
.7288574
-.3988654
2.668056
-15.43129

1.166436
1.265313
1.249434
4.93185
-7.837643

HYPOTHESIS TESTING
The Wald statistic for the coefficient is:
Wald = [ /s.e.B]2
which is distributed chi-square with 1 degree of
freedom.
The "Partial R" (in SPSS output) is
R = {[(Wald-2)/(-2LL( )]}1/2

AN EXAMPLE:
Variable

S.E.

PETS
-0.6593 0.2012
MOBLHOME 1.5583 0.2874
TENURE
-0.0198 0.008
EDUC
0.0501 0.0468
Constant
-0.916
0.69

Wald

Sig

10.732 -0.1127 0.0011


29.39 0.1996
0
6.1238 -0.0775 0.0133
1.1483 0.0000 0.2839
1.7624
1
0.1843

t-value
-3.28
5.42
-2.48
1.07
-1.33

EVALUATING THE PERFORMANCE OF


THE MODEL
There are several statistics which can be used
for comparing alternative models or evaluating the
performance of a single model:
Model Chi-Square
Percent Correct Predictions
Pseudo-R2

MODEL CHI-SQUARE
The model likelihood ratio (LR), statistic is
LR[i] = -2[LL() - LL(, ) ]
{Or, as you are reading SPSS printout:
LR[i] = [-2LL (of beginning model)] - [-2LL (of ending
model)]}

The LR statistic is distributed chi-square with i degrees of


freedom, where i is the number of independent variables
Use the Model Chi-Square statistic to determine if the
overall model is statistically significant.

AN EXAMPLE:
Beginning Block Number 1. Method: Enter
-2 Log Likelihood
687.35714
Variable(s) Entered on Step Number
1..
PETS
PETS
MOBLHOME MOBLHOME
TENURE TENURE
EDUC
EDUC
Estimation terminated at iteration number 3 because
Log Likelihood decreased by less than .01 percent.
-2 Log Likelihood

Model

641.842
Chi-Square

df

Sign.

45.515

0.0000

PERCENT CORRECT
PREDICTIONS
The "Percent Correct Predictions" statistic
assumes that if the estimated p is greater
than or equal to .5 then the event is
expected to occur and not occur otherwise.
By assigning these probabilities 0s and 1s
and comparing these to the actual 0s and
1s, the % correct Yes, % correct No, and
overall % correct scores are calculated.

AN EXAMPLE:
Observed
0
1

Predicted
0
1
328
24
139
44
Overall

% Correct
93.18%
24.04%
69.53%

PSEUDO-R2
One psuedo-R2 statistic is the
McFadden's-R2 statistic:
McFadden's-R2 = 1 - [LL(, )/LL()]
{= 1 - [-2LL( , )/-2LL( )] (from SPSS printout)}
where the R2 is a scalar measure which
varies between 0 and (somewhat close
to) 1 much like the R2 in a LP model.

AN EXAMPLE:

Beginning -2 LL
Ending -2 LL
Ending/Beginning
2

McF. R = 1 - E./B.

687.36
641.84
0.9338
0.0662

SOME POTENTIAL
PROBLEMS AND
SOLUTIONS

Omitted Variable Bias

Irrelevant Variable Bias

Functional Form

Multicollinearity

Structural Breaks

OMITTED VARIABLE
BIAS
Omitted variable(s) can result in bias in the
coefficient estimates. To test for omitted
variables you can conduct a likelihood ratio test:
LR[q] = {[-2LL(constrained model, i=k-q)]
- [-2LL(unconstrained model, i=k)]}
where LR is distributed chi-square with q
degrees of freedom, with q = 1 or more omitted
variables
{This test is conducted automatically by SPSS if
you specify "blocks" of independent variables}

AN EXAMPLE:
Variable
PETS
MOBLHOME
TENURE
EDUC
CHILD
WHITE
FEMALE
Constant
Beginning -2 LL
Ending -2 LL

Wald

Sig

-0.699
1.570
-0.020
0.049
0.009
0.186
0.018
-1.049

10.968
29.412
5.993
1.079
0.011
0.422
0.008
2.073

0.001
0.000
0.014
0.299
0.917
0.516
0.928
0.150

687.36
641.41

CONSTRUCTING THE LR
TEST
Ending -2 LL
Partial Model
Ending -2 LL
Full Model
Block Chi-Square
DF
Critical Value

641.84
641.41
0.43
3
11.345

Since the chi-squared value is less than the


critical value the set of coefficients is not
statistically significant. The full model is not an
improvement over the partial model.

Irrelevant Variable Bias


The inclusion of irrelevant variable(s) can result in
poor model fit.
You can consult your Wald statistics or conduct a
likelihood ratio test.

FUNCTIONAL FORM
Errors in functional form can result in
biased coefficient estimates and poor
model fit.
You should try different functional
forms by logging the independent
variables, adding squared terms, etc.
Then consult the Wald statistics and
model chi-square statistics to
determine which model performs best.

MULTICOLLINEARITY
The presence of multicollinearity will not
lead to biased coefficients.
But the standard errors of the coefficients
will be inflated.
If a variable which you think should be
statistically significant is not, consult the
correlation coefficients.
If two variables are correlated at a rate
greater than .6, .7, .8, etc. then try
dropping the least theoretically important
of the two.

STRUCTURAL
BREAKS
You may have structural breaks in your data.
Pooling the data imposes the restriction that
an independent variable has the same effect
on the dependent variable for different groups
of data when the opposite may be true.
You can conduct a likelihood ratio test:

LR[i+1] = -2LL(pooled model)


[-2LL(sample 1) + -2LL(sample 2)]
where samples 1 and 2 are pooled, and i is the
number of independent variables.

AN EXAMPLE

Is the evacuation behavior from


Hurricanes Dennis and Floyd
statistically equivalent?
Variable
PETS
MOBLHOME
TENURE
EDUC
Constant
Beginning -2 LL
Ending -2 LL
Model Chi-Square

Floyd
B
-0.66
1.56
-0.02
0.05
-0.92
687.36
641.84
45.52

Dennis
B
-1.20
2.00
-0.02
-0.04
-0.78
440.87
382.84
58.02

Pooled
B
-0.79
1.62
-0.02
0.02
-0.97
1186.64
1095.26
91.37

CONSTRUCTING THE LR
TEST

Since the chi-squared value is greater than the


critical value the set of coefficients are statistically
different. The pooled model is inappropriate.

WHAT SHOULD YOU DO?


Try adding a dummy variable:
FLOYD = 1 if Floyd, 0 if Dennis

Variable
PETS
MOBLHOME
TENURE
EDUC
FLOYD
Constant

B
-0.85
1.75
-0.02
0.02
1.26
-1.68

Wald
27.20
65.67
8.34
0.27
59.08
8.71

Sig
0.000
0.000
0.004
0.606
0.000
0.003

WRITING UP RESULTS
Present descriptive statistics in a table
Make it clear that the dependent variable is
discrete (0, 1) and not continuous and that you
will use logistic regression.
Logistic regression is a standard statistical
procedure so you don't (necessarily) need to
write out the formula for it. You also (usually)
don't need to justify that you are using Logit
instead of the LP model or Probit (similar to
logit but based on the normal distribution [the
tails are less fat]).

AN EXAMPLE:
"The dependent variable which
measures the willingness to evacuate
is EVAC. EVAC is equal to 1 if the
respondent evacuated their home
during Hurricanes Floyd and Dennis
and 0 otherwise. The logistic
regression model is used to estimate
the factors which influence
evacuation behavior."

Organize your regression results in a table:


In the heading state that your dependent
variable (dependent variable = EVAC) and that
these are "logistic regression results.
Present coefficient estimates, t-statistics (or
Wald, whichever you prefer), and (at least the)
model chi-square statistic for overall model fit
If you are comparing several model
specifications you should also present the %
correct predictions and/or Pseudo-R 2 statistics
to evaluate model performance
If you are comparing models with hypotheses
about different blocks of coefficients or testing
for structural breaks in the data, you could
present the ending log-likelihood values.

AN EXAMPLE:
Table 2. Logistic Regression Results
Dependent Variable = EVAC
Variable
B
B/S.E.
PETS
MOBLHOME
TENURE
EDUC
Constant

-0.6593
1.5583
-0.0198
0.0501
-0.916

Model Chi-Squared

45.515

-3.28
5.42
-2.48
1.07
-1.33

When describing the statistics


in the tables, point out the
highlights for the reader.
What are the statistically significant
variables?
"The results from Model 1 indicate
that coastal residents behave according
to risk theory. The coefficient on the
MOBLHOME variable is negative and
statistically significant at the p < .01
level (t-value = 5.42). Mobile home
residents are 4.75 times more likely to
evacuate.

Is the overall model statistically


significant?
The overall model is significant
at the .01 level according to the
Model chi-square statistic. The
model predicts 69.5% of the
responses correctly. The
McFadden's R2 is .066."

Which model is preferred?


"Model 2 includes three additional
independent variables. According to
the likelihood ratio test statistic, the
partial model is superior to the full
model of overall model fit. The block
chi-square statistic is not statistically
significant at the .01 level (critical
value = 11.35 [df=3]). The coefficient
on the children, gender, and race
variables are not statistically
significant at standard levels."

To set the stage, consider the data given in Table 15.7. Letting
Y = 1 if a students final grade in an intermediate
microeconomics course was A and Y = 0 if the final grade was
a B or a C, Spector and Mazzeo used grade point average
(GPA), TUCE, and Personalized System of Instruction (PSI)

DESCRIPTION OF
VARIABLES IN
DATA
. desc;

51

storage display
value
variable name
type
format
label
variable label
-----------------------------------------------------------------------> smoker
byte
%9.0g
is current smoking
worka
byte
%9.0g
has workplace smoking bans
age
byte
%9.0g
age in years
male
byte
%9.0g
male
black
byte
%9.0g
black
hispanic
byte
%9.0g
hispanic
incomel
float %9.0g
log income
hsgrad
byte
%9.0g
is hs graduate
somecol
byte
%9.0g
has some college
college
float %9.0g
-----------------------------------------------------------------------

SUMMARY
STATISTICS
sum;
Variable |

Obs

Mean

Std. Dev.

Min

Max

-------------+-------------------------------------------------------smoker |

16258

.25163

.433963

worka |

16258

.6851396

.4644745

age |

16258

38.54742

11.96189

18

87

male |

16258

.3947595

.488814

black |

16258

.1119449

.3153083

hispanic |

16258

.0607086

.2388023

incomel |

16258

10.42097

.7624525

6.214608

11.22524

hsgrad |

16258

.3355271

.4721889

somecol |

16258

.2685447

.4432161

college |

16258

.3293763

.4700012

52

-------------+--------------------------------------------------------

Heteroskedastic consistent
Standard errors
.
.
.
>

* run a linear probability model for comparison purposes;


* estimate white standard errors to control for heteroskedasticity;
reg smoker age incomel male black hispanic
hsgrad somecol college worka, robust;

Regression with robust standard errors

Very low R2, typical in LP models

Number of obs
F( 9, 16248)
Prob > F
R-squared
Root MSE

=
=
=
=
=

16258
99.26
0.0000
0.0488
.42336

Since OLS
Report t-stats

53

-----------------------------------------------------------------------------|
Robust
smoker |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age | -.0004776
.0002806
-1.70
0.089
-.0010276
.0000725
incomel | -.0287361
.0047823
-6.01
0.000
-.03811
-.0193621
male |
.0168615
.0069542
2.42
0.015
.0032305
.0304926
black | -.0356723
.0110203
-3.24
0.001
-.0572732
-.0140714
hispanic |
-.070582
.0136691
-5.16
0.000
-.097375
-.043789
hsgrad | -.0661429
.0162279
-4.08
0.000
-.0979514
-.0343345
somecol | -.1312175
.0164726
-7.97
0.000
-.1635056
-.0989293
college | -.2406109
.0162568
-14.80
0.000
-.272476
-.2087459
worka |
-.066076
.0074879
-8.82
0.000
-.080753
-.051399
_cons |
.7530714
.0494255
15.24
0.000
.6561919
.8499509
------------------------------------------------------------------------------

Same syntax as REG but with probit


. * run probit model;
. probit smoker age incomel male black hispanic
> hsgrad somecol college worka;
Iteration
Iteration
Iteration
Iteration

0:
1:
2:
3:

log
log
log
log

Probit estimates

likelihood
likelihood
likelihood
likelihood

= -9171.443
= -8764.068
= -8761.7211
= -8761.7208

Test that all non-constant


Terms are 0

Log likelihood = -8761.7208

Converges rapidly for most


problems
Number of obs
LR chi2(9)
Prob > chi2
Pseudo R2

=
=
=
=

16258
819.44
0.0000
0.0447

Report z-statistics
Instead of t-stats

54

-----------------------------------------------------------------------------smoker |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age | -.0012684
.0009316
-1.36
0.173
-.0030943
.0005574
incomel |
-.092812
.0151496
-6.13
0.000
-.1225047
-.0631193
male |
.0533213
.0229297
2.33
0.020
.0083799
.0982627
black | -.1060518
.034918
-3.04
0.002
-.17449
-.0376137
hispanic | -.2281468
.0475128
-4.80
0.000
-.3212701
-.1350235
hsgrad | -.1748765
.0436392
-4.01
0.000
-.2604078
-.0893453
somecol |
-.363869
.0451757
-8.05
0.000
-.4524118
-.2753262
college | -.7689528
.0466418
-16.49
0.000
-.860369
-.6775366
worka | -.2093287
.0231425
-9.05
0.000
-.2546873
-.1639702
_cons |
.870543
.154056
5.65
0.000
.5685989
1.172487
------------------------------------------------------------------------------

. dprobit smoker age incomel male black hispanic


> hsgrad somecol college worka;
Probit regression, reporting marginal effects
Log likelihood = -8761.7208

Number of obs
LR chi2(9)
Prob > chi2
Pseudo R2

= 16258
= 819.44
= 0.0000
= 0.0447

55

-----------------------------------------------------------------------------smoker |
dF/dx
Std. Err.
z
P>|z|
x-bar [
95% C.I.
]
---------+-------------------------------------------------------------------age | -.0003951
.0002902
-1.36
0.173
38.5474 -.000964 .000174
incomel | -.0289139
.0047173
-6.13
0.000
10.421
-.03816 -.019668
male*|
.0166757
.0071979
2.33
0.020
.39476
.002568 .030783
black*| -.0320621
.0102295
-3.04
0.002
.111945 -.052111 -.012013
hispanic*| -.0658551
.0125926
-4.80
0.000
.060709 -.090536 -.041174
hsgrad*|
-.053335
.013018
-4.01
0.000
.335527
-.07885 -.02782
somecol*| -.1062358
.0122819
-8.05
0.000
.268545 -.130308 -.082164
college*| -.2149199
.0114584
-16.49
0.000
.329376 -.237378 -.192462
worka*| -.0668959
.0075634
-9.05
0.000
.68514
-.08172 -.052072
---------+-------------------------------------------------------------------obs. P |
.25163
pred. P |
.2409344 (at x-bar)
-----------------------------------------------------------------------------(*) dF/dx is for discrete change of dummy variable from 0 to 1
z and P>|z| correspond to the test of the underlying coefficient being 0

Males are 1.7 percentage points more likely to smoke

. mfx compute;

Those w/ college degree 21.5 % points


Less likely to smoke

Marginal effects after probit


y = Pr(smoker) (predict)
= .24093439
-----------------------------------------------------------------------------variable |
dy/dx
Std. Err.
z
P>|z| [
95% C.I.
]
X
---------+-------------------------------------------------------------------age | -.0003951
.00029
-1.36
0.173 -.000964 .000174
38.5474
incomel | -.0289139
.00472
-6.13
0.000
-.03816 -.019668
10.421
male*|
.0166757
.0072
2.32
0.021
.002568 .030783
.39476
black*| -.0320621
.01023
-3.13
0.002 -.052111 -.012013
.111945
hispanic*| -.0658551
.01259
-5.23
0.000 -.090536 -.041174
.060709
hsgrad*|
-.053335
.01302
-4.10
0.000
-.07885 -.02782
.335527
somecol*| -.1062358
.01228
-8.65
0.000 -.130308 -.082164
.268545
college*| -.2149199
.01146 -18.76
0.000 -.237378 -.192462
.329376
worka*| -.0668959
.00756
-8.84
0.000
-.08172 -.052072
.68514
-----------------------------------------------------------------------------(*) dy/dx is for discrete change of dummy variable from 0 to 1

56

10 years of age reduces smoking rates by


4 tenths of a percentage point
10 percent increase in income will reduce smoking
By .29 percentage points

GOODNESS OF FIT MEASURES


In ML estimations, there is no such measure as the R2
But the log likelihood measure can be used to assess the goodness
of fit. But note the following :
The higher the number of observations, the lower the joint probability, the
more the LL measures goes towards -
Given the number of observations, the better the fit, the higher the LL
measures (since it is always negative, the closer to zero it is)

The philosophy is to compare two models looking at their LL values.


One is meant to be the constrained model, the other one is the
unconstrained model.

GOODNESS OF FIT MEASURES


A model is said to be constrained when the observed set the
parameters associated with some variable to zero.
A model is said to be unconstrained when the observer release this
assumption and allows the parameters associated with some
variable to be different from zero.
For example, we can compare two models, one with no explanatory
variables, one with all our explanatory variables. The one with no
explanatory variables implicitly assume that all parameters are equal
to zero. Hence it is the constrained model because we (implicitly)
constrain the parameters to be nil.

THE LIKELIHOOD RATIO TEST (LR


TEST)

The most used measure of goodness of fit in ML estimations is the


likelihood ratio. The likelihood ratio is the difference between the
unconstrained model and the constrained model. This difference is
distributed 2.
If the difference in the LL values is (no) important, it is because the
set of explanatory variables brings in (un)significant information. The
null hypothesis H0 is that the model brings no significant information
as follows:

LR 2 ln L unc ln L c

High LR values will lead the observer to reject hypothesis H0 and


accept the alternative hypothesis Ha that the set of explanatory
variables does significantly explain the outcome.

2
THE
MCFADDEN
PSEUDO
R
We also use the McFadden Pseudo R (1973). Its interpretation is
2

analogous to the OLS R2. However its is biased doward and remain
generally low.
Le pseudo-R2 also compares The likelihood ratio is the difference
between the unconstrained model and the constrained model and is
comprised between 0 and 1.

Pseudo R

2
MF

ln L c ln L unc
ln L unc

1
ln L unc

ln L c

GOODNESS OF FIT
MEASURES

Constrained model
. logit

inno

Iteration 0:

log likelihood = -205.30803

Logistic regression

Number of obs
LR chi2(0)
Prob > chi2
Pseudo R2

Log likelihood = -205.30803

. logit

inno

Coef.

_cons

1.494183

Std. Err.
.1244955

z
12.00

=
=
=
=

431
0.00
.
0.0000

P>|z|

[95% Conf. Interval]

0.000

1.250177

1.73819

LR 2 ln L unc ln Lc
2 163.5 205.3
83.8

inno lrdi lassets spe biotech, nolog

Logistic regression

Number of obs
LR chi2(4)
Prob > chi2
Pseudo R2

Log likelihood = -163.45352


inno

Coef.

lrdi
lassets
spe
biotech
_cons

.7527497
.997085
.4252844
3.799953
-11.63447

Std. Err.
.2110683
.1368534
.4204924
.577509
1.937191

Unconstrained model

z
3.57
7.29
1.01
6.58
-6.01

P>|z|
0.000
0.000
0.312
0.000
0.000

=
=
=
=

431
83.71
0.0000
0.2039

[95% Conf. Interval]


.3390634
.7288574
-.3988654
2.668056
-15.43129

1.166436
1.265313
1.249434
4.93185
-7.837643

Ps.R 2MF 1 ln L unc ln L c


1 163.5 205.3
0.204

OTHER USAGE OF THE LR TEST


The LR test can also be generalized to compare any two models,
the unconstrained one being nested in the constrained one.
Any variable which is added to a model can be tested for its
explanatory power as follows :
logit [modle contraint]
est store [nom1]
logit [modle non contraint]
est store [nom2]
lrtest nom2 nom1

GOODNESS OF FIT
MEASURES

. logit

inno lrdi lassets spe, nolog

Logistic regression

Number of obs
LR chi2(3)
Prob > chi2
Pseudo R2

Log likelihood = -191.84522


inno

Coef.

lrdi
lassets
spe
_cons

.9275668
.3032756
.3739987
-.4703812

Std. Err.
.1979951
.0792032
.3800765
.9313494

z
4.68
3.83
0.98
-0.51

P>|z|
0.000
0.000
0.325
0.614

=
=
=
=

431
26.93
0.0000
0.0656

[95% Conf. Interval]


.5395037
.1480402
-.3709376
-2.295793

1.31563
.4585111
1.118935
1.35503

. est store model1


. logit

inno lrdi lassets spe biotech, nolog

Logistic regression

Number of obs
LR chi2(4)
Prob > chi2
Pseudo R2

Log likelihood = -163.45352


inno

Coef.

lrdi
lassets
spe
biotech
_cons

.7527497
.997085
.4252844
3.799953
-11.63447

Std. Err.
.2110683
.1368534
.4204924
.577509
1.937191

z
3.57
7.29
1.01
6.58
-6.01

P>|z|
0.000
0.000
0.312
0.000
0.000

=
=
=
=

431
83.71
0.0000
0.2039

[95% Conf. Interval]


.3390634
.7288574
-.3988654
2.668056
-15.43129

1.166436
1.265313
1.249434
4.93185
-7.837643

. est store model2


. lrtest model2 model1
Likelihood-ratio test
(Assumption: model1 nested in model2)

LR chi2(1) =
Prob > chi2 =

56.78
0.0000

LR test on the added variable (biotech)

LR 2 ln L unc ln Lc
2 163.5 191.8
56.8

. * using a wald test, test the null hypothesis that;


. * all the education coefficients are zero;
. test hsgrad somecol college;
hsgrad = 0
somecol = 0
college = 0
chi2( 3) =
Prob > chi2 =

504.78
0.0000

64

( 1)
( 2)
( 3)

QUALITY OF PREDICTIONS
Lastly, one can compare the quality of the prediction with the observed outcome variable (dummy
variable).
One must assume that when the probability is higher than 0.5, then the prediction is that the vent will
occur (most likely
And then one can compare how good the prediction is as compared with the actual outcome variable.
STATA does this for us:
estat class

SENSITIVITY & SPECIFICITY


True pos

True neg

Classify pos

Classify neg

total

a+c

b+d

Sensitivity=a/(a+c), false neg=c/(a+c)


Specificity=d/(b+d), false pos=b/(b+d)
Accuracy = W sensitivity + (1-W) specificity

QUALITY OF PREDICTIONS
. estat class
Logistic model for inno
Classified

True

~D

Total

+
-

337
15

51
28

388
43

Total

352

79

431

Classified + if predicted Pr(D) >= .5


True D defined as inno != 0
Sensitivity
Specificity
Positive predictive value
Negative predictive value

Pr( +| D)
Pr( -|~D)
Pr( D| +)
Pr(~D| -)

95.74%
35.44%
86.86%
65.12%

False
False
False
False

Pr( +|~D)
Pr( -| D)
Pr(~D| +)
Pr( D| -)

64.56%
4.26%
13.14%
34.88%

+
+
-

rate
rate
rate
rate

for
for
for
for

true ~D
true D
classified +
classified -

Correctly classified

84.69%

OTHER BINARY
CHOICE MODELS
The Logit model is only one way of modeling
binary choice models
The Probit model is another way of modeling
binary choice models. It is actually more used than
logit models and assume a normal distribution (not
a logistic one) for the z values.
The complementary log-log models is used where
the occurrence of the event is very rare, with the
distribution of z being asymetric.

OTHER BINARY CHOICE MODELS


Probit model

Pr(Y 1| X) X

z2 2

dz

Complementary log-log model

1 exp exp(X )
Pr(Y 1| X) X

dz

dz

REFERENCES
http://personal.ecu.edu/whiteheadj/data/logit/
http://personal.ecu.edu/whiteheadj/data/logit/logitpap.ht
m
E-mail: WhiteheadJ@mail.ecu.edu

Вам также может понравиться