Вы находитесь на странице: 1из 12

Logistic Regression with

Low Event Rate (or Rare


Events)

1/28/15

Tejamoy Ghosh Data Science ATG - New Delhi, India

Contents:

Problem with logistic regression with low event


rate

Way out

How to do them in SAS?

How to do them in R?

1/28/15

Tejamoy Ghosh Data Science ATG - New Delhi, India

A typical conversation
Analyst 1: Im in some trouble, my manager wants me to build a logistic
regression model but I have only a 2% event rate in my data. The logistic
regression wont be a good choice here the ML estimate will be biased.
Analyst 2: Not necessarily. Its the total count rather than the percentage of
events that matters. How many cases do you have for the rarer event and
how big is your dataset?
Analyst1: Weve got about 1800 odd events in a dataset of about 100,000
cases a less than 2% scenario
Analyst2: Hmm. With these many cases for the rarer event, you can very well
use logistic regression. There are methods to address such skewed, or sparse
data situations.
Analyst1: Wow. Really! Please tell me more!!
Analyst2: There are couple of alternatives. For one you can use exact
logistic regression this is to be used whensample size is too small for
your usual logistic regression using the regular maximum-likelihood-based
estimation. Another option in your scenario is to use the penalizedlikelihood estimation method. This second one has the advantage of being
computationally less demanding than the exact logistic method.
1/28/15

Tejamoy Ghosh Data Science ATG - New Delhi, India

Whats wrong with my


regular logistic regression
when the event rate is low
Low event rate/Rare Event:

In the current context, this refers to the scenario where under a binary
outcome space (response/no-response, good/bad, default/no-default,
purchase/no-purchase, etc.) one of the two events are far fewer than the other

Some real life examples:

Suppose in a sample of 1000 applicants for a position only 20 are selected here the
event of being selected is the rare event with a low event rate of 2%
Suppose, in a sample of 100,000 purchases from an online retailer, about 1800 are
returned by the customer here the event of goods being returned is the rare event
with a low event rate of 1.8%
Charge backs in credit card transactions
Goods returned in online retailing

Why is this a problem for logistic regression its still binary anyway?

The problem here is with the estimation method the usual maximumlikelihood method is susceptible to small sample bias and this bias is
strongly dependent on the count (as opposed to percentage) of the rarer of
the events

1/28/15

Tejamoy Ghosh Data Science ATG - New Delhi, India

Whats the way out then?

In case of small sample and/or very unbalanced binary data (When you
have just 20 cases in a sample of 1000) exact logistic regression is to
be used

Exact logistic regression approach provides an alternative to the maximum


likelihood method for making inferences about the parameters of the logistic
regression model
The method is based on appropriate exact distributions of sufficient statistics for
parameters of interest and the estimates given by exact logistic regression do not
depend on asymptotic results
It is useful for analyzing small or unbalanced binary data with covariates
This method is usually very computationally intensive

If, however, you have a larger count of the rarer of the two events, say,
1000, (even better if its 2000) in a sample of 100,000 with the same low
event rate (1% to 2%) you can use logistic regression the estimation will
have to be done using penalized likelihood method (also called Firths
penalized likelihood approach, after its inventor

1/28/15

While we mentioned this method in the context of only small sample size/rare
event scenario, this is a method of addressing issues of separability, small
sample sizes, and bias of the parameter estimates
Tejamoy Ghosh Data Science ATG - New Delhi, India

How to do them in SAS

1/28/15

Tejamoy Ghosh Data Science ATG - New Delhi, India

Exact Logistic SAS code


Proc Logistic Data = YourRareEventData descending;
Freq CellCount; /* the CellCount variable is weight vbl here */
model RareEvent = X1 X2;
Exact X1 / estimate = both;
Run;

You can add other options for what you want to have in your
output
The option Exact after the model statement and the Freq
statements are the key differences here
An alternative Event/Trial Syntax:
Proc Logistic Data = YourRareEventData;
model RareEvent / CellCount = X1 X2;
Exact X1 / estimate = both;
Run;

1/28/15

Tejamoy Ghosh Data Science ATG - New Delhi, India

Penalized Logistic SAS


code
Proc Logistic Data = YourRareEventData;
class CategoricalVbl1 CategoricalVbl2/ param=ref;
Model Y = CategoricalVbl1 CategoricalVbl2 X1 X1 /
firth ;
run;

You can add other options for what you want to


have in your output
The option FIRTH in the model statement is the
key here

1/28/15

Tejamoy Ghosh Data Science ATG - New Delhi, India

How to do them in R

1/28/15

Tejamoy Ghosh Data Science ATG - New Delhi, India

Exact Logistic in R
Package

Required:

elrm
This package implements (approximate)

exact inference for binomial logistic


regression models in R

1/28/15

Tejamoy Ghosh Data Science ATG - New Delhi, India

Penalized Logistic in R

Package:
logistf
This package runs Firths bias reduced logistic regression

approach with penalized profile likelihood based


confidence intervals for parameter estimates

Another package penalized runs penalized


generalized linear models, penalized regression
models

1/28/15

Arup Guha - Indian Institute of Foreign Trade - New Delhi, India

Data Sciences ATG


Free Solutions to Challenging Data
Problems
What we do
FREE analytics help to stuck
analysts and consultants
FREE snapshot to companies
considering entering analytics
FREE data analysis help to
students researchers and faculty
Customized analytics solutions to
institutes and companies
Apply analytics in non-traditional
areas including films & education

EDUCATION
Econometrics,
Statistics,
Economics
Vanderbilt,
Cincinnati, Indian
Statistical Institute,
Jawaharlal Nehru
University
Research Scholars
Journal Articles

EXPERTISE
Predictive
modelling,
Segmentation,
Market research,
Clickstream data
analysis,
Forecasting,
Financial Time
Series, Simulation,
Bayesian
econometrics,
Machine Learning
Techniques,
Decision Trees,
SAS, SPSS, R,
Octave, Stata,
Eviews, Matlab,
Maxima, Netlogo

EXPERIENCE
18 years combined,
Marketing
analytics, Risk
analytics, Financial
analytics, Analytic
Solution & Tools
development,
Analytics CoE setup, Advanced
Analytics Training

EXISTING/SERVED
CLIENTS
A large Global
Beverage company, A
small insurance
company,
A renowned business
school, A large
Global HR &
Compensation
Consulting Group, A
large Global IT
Research group, A
third party analytics
vendor, A mid sized
analytics consulting

What we dont
doQuick and dirty back of the
envelope calculation
Use jargon presentations with little
impact on your problem
Hide that we are stumped

Вам также может понравиться