Вы находитесь на странице: 1из 18

Logistic regression

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées


© Antoine Morin et Scott Findlay 1
11-02-12 01:20
Logistic regression

• Member of the GLM family


• Unlike standard linear regression, the
dependent variable is binary (0,1), so that each
cases’ value is either 0 or 1.
• Normally, 0 is taken to mean the absence of
some attribute, 1 its presence.
• Logistic regression can be extended to the
case where there are more than two possible
values for the dependent variable (e.g. low,
medium, high – multinomial regression)

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées


© Antoine Morin et Scott Findlay 2
11-02-12 01:20
Example: incidence of heart attacks in
relation to age

1.0

Linear regression
inappropriate because:
0.7

•Residuals not normal


cardiaque

0.4 •Residuals heteroscedastic


•Predicted values nonsense (e.g.
what does a predicted value of
0.3 mean?)
0.1

-0.2

10 30 50 70 90
age

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées


© Antoine Morin et Scott Findlay 3
11-02-12 01:20
Logistic regression: dependent variable
1
• Variable of interest is
the probability p of
Y
obtaining a a one as a
function of predictor
variables 0
• The magnitude of
regression X
coefficients in the
1
model depends on
distribution of the
Y
predictor variables in
the two groups Y= 0
and Y = 1, 0

X
Université d’Ottawa - Bio 4518 - Biostatistiques appliquées
© Antoine Morin et Scott Findlay 4
11-02-12 01:20
Dependent variable: logit (p)
100
 p 
logit( p ) = y = ln  
 1 − p 
80
ey e logit( p )
p= =
1+ e y
1 + e logit ( p )
60
p

40

20

-4 -2 0 2 4
logit

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées


© Antoine Morin et Scott Findlay 5
11-02-12 01:20
Logistic regression: model coefficients
1
• Negative regression
coefficient means
Y
probability of success β >0
decreases with
increasing value of 0
predictor.
X
• Positive regression
coefficient means
1
probability of success
decreases with
Y β <0
increasing value of
predictor.
0

X
Université d’Ottawa - Bio 4518 - Biostatistiques appliquées
© Antoine Morin et Scott Findlay 6
11-02-12 01:20
Logistic regression: model coefficients
1
• The magnitude of
the regression Y
coefficient β > 0, small
depends on how 0
abruptly p
X
changes with X,
with large values 1
indicating abrupt
change. Y
β > 0, large

0
X
Université d’Ottawa - Bio 4518 - Biostatistiques appliquées
© Antoine Morin et Scott Findlay 7
11-02-12 01:20
Least squares
estimation (LSE)

SSR
• An ordinary least
squares (OLS) estimate
of a model parameter
Θ is that which
minimizes the sum of
squared differences OLS Θ
between observed and
predicted values: • Predicted values are
N derived from some
SS R = ∑ ( yi − y )
ˆ 2 model whose
parameters we wish to
i =1
yˆ = f ( x, Θ)
estimate

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées


© Antoine Morin et Scott Findlay 8
11-02-12 01:20
Maximum likelihood
- log L
estimation (MLE) L

L or - log L
• A maximum likelihood
estimate (MLE) of a
model parameter Θ for
a given distribution is
that which maximizes
the probability of
MLE Θ
generating the observed
sample data.
• …or equivalently, by
• MLEs are obtained by
minimizing the negative
maximizing the loss log likelihood function
function n n
L = ∏ φ ( xi ; Θ) − log L = ∑ ln(φ ( xi ; Θ))
i =1 i =1

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées


© Antoine Morin et Scott Findlay 9
11-02-12 01:20
How are the model parameters
estimated?
• Estimated not by least squares, but rather
by Maximum Likelihood
– Based on an estimate of the likelihood of obtaining
the observed results based on different values of
the model parameters
– In principle, parameter estimates should converge
to those maximizing log-likelihood or minimizing -
LogL

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées


© Antoine Morin et Scott Findlay 10
11-02-12 01:20
Hypothesis testing

• Likelihood
– Deviance=-2L
– Is apprioximately distributed as chi-square
– Measures the variation unexplained by the fitted
model, analagous to residual sums of squares.
• Model comparison
– Change in deviance when model terms are added
(or deleted) is also approximately distributed as
chi-square, so can test hypotheses relating to
individual model terms.

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées


© Antoine Morin et Scott Findlay 11
11-02-12 01:20
Model assumptions

• Observations are independent


• Dependent variable has a binomial
distribution
• Little error in measurement of dependent
variables.

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées


© Antoine Morin et Scott Findlay 12
11-02-12 01:20
Logistic regression in SPlus
*** Generalized Linear Model ***

Call: glm(formula = cardiaque ~ age, family = binomial(link = logit), data = SDF12, na.action =
na.exclude, control
= list(epsilon = 0.0001, maxit = 50, trace = F))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.545637 -0.5732664 -0.272312 -0.1404323 2.679875
Coefficients:
Value Std. Error t value
(Intercept) -7.76838060 0.376403465 -20.63844
age 0.09557905 0.005097055 18.75182
(Dispersion Parameter for Binomial family taken to be 1 )
Null Deviance: 2050.515 on 1999 degrees of freedom
Residual Deviance: 1490.001 on 1998 degrees of freedom
Number of Fisher Scoring Iterations: 4

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées


© Antoine Morin et Scott Findlay 13
11-02-12 01:20
Incidence of heart attack in relation to age

0.9
y=logit(p) = −7.77 + 0.96 Age
0.7 ey e logit( p ) e −7.77+0.96 Age
p= = =
1+ e y
1+ e logit( p )
1 + e −7.77+0.96 Age
cardiaque

0.5

0.3

0.1

-0.1

30 40 50 60 70 80 90
age

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées


© Antoine Morin et Scott Findlay 14
11-02-12 01:20
Presence of post-operative kyphosis using
logistic regression

Kyphosis: a binary variable indicating the


presence/absence
of a postoperative spinal deformity called Kyphosis.
• Age: the age of the child in months.
• Number: the number of vertebrae involved in the spinal
operation.
• Start: the beginning of the range of the vertebrae
involved in the operation

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées


© Antoine Morin et Scott Findlay 15
11-02-12 01:20
Evidence that the distribution of predictor
variables differs among levels of response
variable

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées


© Antoine Morin et Scott Findlay 16
11-02-12 01:20
The model

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées


© Antoine Morin et Scott Findlay 17
11-02-12 01:20
Testing hypotheses

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées


© Antoine Morin et Scott Findlay 18
11-02-12 01:20