Академический Документы
Профессиональный Документы
Культура Документы
BUILDING
PREDICTIVE
MODELS
OF
ELECTION
RESULTS
–
an
application
of
Logistic
regression
2009
Abstract
This
document
attempts
to
teach
students
logistic
regression
with
the
help
of
a
simple
real
world
example.
2009
Lok
Sabha
election
results
for
Karnataka
were
analyzed
to
assess
if
there
was
any
dependence
between
the
available
information
about
the
candidates
and
the
final
election
results.
The
only
significant
variables
were
“Political
Party”
and
“Moveable
Assets”
of
the
politician
in
question.
Table
of
Contents
Introduction................................................................................................................................
Tips to getting started................................................................................................................
Objective...............................................................................................................................
Data Understanding…………………………………………………………………...
Data Preparation…………………………………………………………………………..
Modeling………………………………………………………………………………..
Evaluation………………………………………………………………………………….
Conclusion ..................................................................................................................................
Further Discussion……………………………………………………………………………
INTRODUCTION:
The
compulsory
disclosure
of
information
with
respect
to
the
background
of
candidates
in
the
election
make
sure
that
the
voters
have
sufficient
information
about
the
candidates
in
order
to
enable
them
to
make
an
informed
choice
while
casting
their
votes.
This
information
includes
assets
and
liabilities
as
well
as
criminal
antecedents,
if
any.
Thus,
a
fairly
large
amount
of
data
on
the
candidate
background
had
become
available.
It
is
also
interesting
to
see
if
the
information
thus
made
available
did
make
any
difference
in
the
outcome
of
the
elections.
Also
it
is
important
to
see
if
it
would
be
possible
to
use
the
information
to
predict
or
forecast
the
results
of
the
elections.
TIPS
TO
GETTING
STARTED:
Business
Understanding:
This
is
the
initial
phase
focusing
on
understanding
the
project
objectives
and
requirements
from
a
business
perspective.
The
available
information
of
the
candidates
can
be
use
as
data
to
fit
a
model
to
determine
which
candidate
will
win
the
election.
Alternatively,
the
voters
can
make
a
fair
choice
of
selecting
their
leader.
Thus,
the
objective
of
this
research
paper
is
to
develop
predictive
models
which
could
be
used
for
predicting
the
outcomes
of
election.
Data
Understanding:
The
data
with
respect
to
the
profile
of
the
candidates
of
Karnataka
Lok
Sabha
election
2009
were
taken
from
myneta.com
for
this
research
paper.
There
are
a
total
of
28
constituencies
for
which
the
elections
were
held
in
2009
and
there
are
all
together
421
candidates. The
election
results
of
the
candidates
(win
or
loss)
are
used
as
the
dependent
variable
for
the
predictive
models.
This
is
treated
as
a
binary
categorical
variable.
In
addition,
a
number
of
variables
on
which
information
was
available
are
used
as
independent
variables.
These
variables
included
• Age
of
the
candidate
• Gender
• Educational
qualification
•
Number
of
candidates
in
the
constituency
• type
of
the
political
party
• win
or
loss
of
the
candidates
• movable
asset
• immovable
assets
• total
assets
of
the
candidates
• whether
the
candidate
has any
liabilities
to
the
government
• whether
the
candidate
has
any
liabilities
to
the
financial
institutions
• whether
the
candidate
had
committed
any
crimes
or
not
Data
preparation:
Here
the
dependent
and
independent
variables
are
categorical
variable.
So
we
need
to
transform
the
categorical
variables
using
dummy
variables.
The
statistical
software
will
automatically
transform
the
categorical
variables
using
dummy
variable
at
the
starting
of
the
analysis.
Given
below
is
the
description
of
the
variables
used
in
this
paper
to
build
the
model.
DEPENDENT
VARIABLE
win
=
1,
loss
=
0
INDEPENDENT
VARIABLES
NAME
DESCRIPTION
(
DEMOGRAPHIC
CHARACTERISTICS)
ID
Candidate
ID
Age
Age
of
the
candidate
Gen
Male
=
1
,
Female
=
0
Edu
Educational
level
of
the
candidate.
Divided
into
5
categories
:
primary
=
1,
High
school
=
2
,
pre
university
=
3,
graduate
=
4,
post
graduate
=
5
(dichotomous
variable)
(
POLITICAL
FACTORS
)
No.
of
candidate
Number
of
the
candidates
in
the
constituency:
binned
into
4
categories‐
<=10,
11
to
15,
16
to
20,
above
20
PolParty
Name
of
the
political
parties
:
BJP
=
1,congress
=
2,
independent
party
=
3,
JD
=
4,
other
national
party
=
4,
other
regional
party
=
5
(dichotomous
variable)
(
OWNERSHIP
)
Movasset
Whether
the
candidate
owns
any
movable
assets:
yes
=
1,
no
=
0
Immovasset
Whether
the
candidate
owns
any
immovable
assets
:
yes
=
1,
no
=
0
Totlasset
Total
assets
of
the
candidate
(
continuous
variable
)
(
LIABILITIES
)
GovtDues
Whether
the
candidate
has
any
government
dues
or
not
:
yes
=
1
,
no
=
0
BankDues
Whether
the
candidate
has
any
Banks
dues
or
not
:
yes
=
1
,
no
=
0
(
OTHER
FACTORS)
Crime
Whether
the
candidate
had
committed
any
crime
or
not
:
yes
=
1
,
no
=
0
Modeling:
Since
the
dependent
variable
is
categorical
in
nature,
usual
predictive
models
that
revolve
around
regression
techniques
could
be
use
for
prediction
of
this
specific
case.
Logistic
regression
would
be
ideal
for
handling
when
we
have
a
mixture
of
numerical
and
categorical
regressors.
A
brief
description
of
the
techniques
is
given
below.
Logistic
regression
is
a
multiple
regression
with
an
outcome
variable
(or
dependent
variable)
that
is
a
categorical
dichotomy
and
explanatory
variables
that
can
be
either
continuous
or
categorical.
In
other
words,
the
interest
is
in
predicting
which
of
two
possible
events
are
going
to
happen
given
certain
other
information.
The
dependent
variable
in
logistic
regression
is
usually
dichotomous,
that
is,
the
dependent
variable
can
take
the
value
1
with
a
probability
of
success
θ,
or
the
value
0
with
probability
of
failure
1‐θ.
This
type
of
variable
is
called
a
Bernoulli
(or
binary)
variable.
The
independent
or
predictor
variables
in
logistic
regression
can
take
any
form.
That
is,
logistic
regression
makes
no
assumption
about
the
distribution
of
the
independent
variables.
They
do
not
have
to
be
normally
distributed,
linearly
related
or
of
equal
variance
within
each
group.
The
relationship
between
the
predictor
and
response
variables
is
not
a
linear
function
in
logistic
regression,
instead,
the
logistic
regression
function
is
used,
which
is
the
logit
transformation
of
θ:
Where α = the constant of the equation and, β = the coefficient of the predictor variables.
An alternative form of the logistic regression equation is:
The
estimation
of
the
variables
in
Logistic
Regression
Analysis
is
done
through
Maximum
Likelihood
Techniques.
The
idea
behind
the
method
is
to
find
the
parameters
that
make
the
observed
values
most
likely
to
have
occurred.
i.e.:
it
maximises
the
probability
of
obtaining
the
sample
we
got.
The
process
by
which
coefficients
are
tested
for
significance
for
inclusion
or
elimination
from
the
model
involves
several
different
techniques.
Some
of
them
are
Wald
test,
Likelihood‐Ratio
test
and
Hosmer‐Lemshow
Goodness
of
fit
test.
Hosmer
and
Lemeshow
chi‐square
test
of
goodness
of
fit
is
the
recommended
test
for
overall
fit
of
a
binary
logistic
regression
model.
If
the
H‐L
goodness‐of‐fit
test
statistic
is
greater
than
.05,
as
we
want
for
well‐fitting
models,
we
fail
to
reject
the
null
hypothesis
that
there
is
no
difference
between
observed
and
model‐predicted
values,
implying
that
the
model's
estimates
fit
the
data
at
an
acceptable
level.
That
is,
well‐fitting
models
show
nonsignificance
on
the
H‐L
goodness‐of‐fit
test,
indicating
model
prediction
is
not
significantly
different
from
observed
value.
Evaluation:
The
data
are
analyzed
by
using
the
Statistical
software.
Sample
of
data
of
the
candidates
are
shown
below.
ID
polparty
crime
edu
age
Ttlasset
liabilities
gen
winloss
1
4
0
2
2
54406000
0
1
0
2
4
0
2
2
100000
0
1
0
3
4
0
4
1
1000000
0
1
0
4
5
0
5
3
406400
0
1
0
5
4
0
4
1
0
1
1
0
6
5
0
5
2
6371000
1
1
0
7
4
1
1
3
7578500
1
1
0
8
1
0
5
4
16763000
1
1
1
9
4
0
1
1
600000
1
1
0
10
2
0
1
4
30411328
1
1
0
11
4
0
5
4
3380000
0
1
0
12
4
0
1
2
0
0
1
0
13
6
0
4
3
11415000
1
1
0
14
4
0
4
1
630000
1
1
0
15
4
0
2
1
20000
0
1
0
16
4
0
5
2
8000000
0
1
0
17
4
0
3
3
195000
0
1
0
18
6
0
3
4
1173000
1
1
0
19
4
0
4
3
865000
0
1
0
This part of the output describes a "null model", which is model with
no predictors and just the intercept.
Classification Tablea,
b
Predicted
WINLOS Percentage
Observed 0 S 1 Correc
Step 0 WINLOS 0 390 0 t 100.0
S 1 2 0 .
Overall Percentage 8 0
93.3
a Constant is included in the model.
.
b The cut value is .500
.
This
gives
the
percent
of
cases
for
which
the
dependent
variables
was
correctly
predicted
given
the
model
and
here
it
is
93.3%.
This
is
the
Wald
chi‐square
test
that
tests
the
null
hypothesis
that
the
constant
equals
0.
This
hypothesis
is
rejected
because
the
p‐value
(listed
in
the
column
called
"Sig.")
is
smaller
than
the
critical
p‐value
of
.05
(or
.01).
Hence,
we
conclude
that
the
constant
is
not
0.
This section contains the overall test of the model (in the “Hosmer-
lemeshow test” table) and the coefficients and odds ratio (in the “Variables
in the Equation”)
Cox
&
Snell
R
Square
and
Nagelkerke
R
Square
are
pseudo
R‐squares.
Logistic
regression
does
not
have
an
equivalent
to
the
R‐squared
that
is
found
in
OLS
regression.
Here
Cox
&
Snell
R
Square
is
0.276<1
and
Nagelkerke
R
Square
is
0<
0.712<1.
It
indicates
improvement
from
null
model
to
fitted
model.
Here
the
null
hypothesis
is
that
there
is
no
difference
between
observed
and
model‐predicted
values.
The
H‐L
goodness‐of‐fit
test
statistic
is
greater
than
.05,
we
fail
to
reject
the
hypothesis.
It
implies
that
the
model
fit
the
data
at
an
acceptable
level.
a
Classification Table
Predicted
WINLOS Percentage
Observe 0 S 1 Correct
Step d
WINLOS 0 383 7 98.
1 S 1 1 1 2
64.
Overall 0 8 3
95.
a The Percentage
cut value is 9
. .500
This
table
shows
how
many
cases
are
correctly
predicted
(383
cases
are
observed
to
be
0
and
are
correctly
predicted
to
be
0;
18
cases
are
observed
to
be
1
and
are
correctly
predicted
to
be
1),
and
how
many
cases
are
not
correctly
predicted
(7
cases
are
observed
to
be
0
but
are
predicted
to
be
1;
10
cases
are
observed
to
be
1
but
are
predicted
to
be
0).
The
overall
percent
of
cases
that
are
correctly
predicted
by
the
model
(in
this
case,
the
full
model
that
we
specified)
is
95.9%.
This
percentage
has
increased
from
93.3
for
the
null
model
to
95.9
for
the
full
model.
From
the
above
table
we
see
that
the
variables
POLPARTY
(political
Party)
and
MOVASSET
(1)
(Movable
assets)
are
statistically
significant
since
the
P‐values
are
less
than
the
critical
P‐value
0.05.
There
is
no
coefficient
listed
for
POLPARTY,
because
it
is
not
a
variable
in
the
model.
Rather,
dummy
variables
which
code
for
POLPARTY
have
coefficients.
However,
the
coefficients
of
the
dummies
are
not
statistically
significant.
The
statistic
given
on
this
row
tells
you
if
the
dummies
that
represent
POLPARTY,
taken
together,
are
statistically
significant.
Thus,
type
of
Political
party
and
movable
assets
are
important
in
explaining
the
winning
of
a
candidate.
The
other
variables
do
not
seem
to
have
any
effects
at
all.
Both
the
effects
can
be
interpreted
as:
‐
Here
the
reference
group
of
the
variable
POLPARTY
is
level
6.
So,
changing
the
reference
group
from
level
6
to
levels
1,
2,3,4,5
increases
the
probability
of
winning
the
election.
‐
The
ownership
of
movable
assets
increases
the
winning
of
the
election
Now
the
predicted
model
is
given
by
Log = ‐23.477 + 23.583 POLPARTY (1) + 21.317 POLPARTY (2) + 19. 112
POLPARTY
(3)
–
0.087
POLPARTY
(4)
+3.958
MOVASSET
(1)
Suppose
we
want
to
compare
the
probability
of
winning
a
candidate
A
with
movable
assets
and
Political
Party
changing
from
level
6
to
level
1
(here
level
1
indicate
BJP
and
level
6
indicate
other
regional
parties)
with
the
probability
of
winning
a
candidate
B
without
movable
assets
and
Political
Party
changing
from
level
6
to
level
1.
Predicted
logit
of
candidate
A:
Log = ‐23.477 + 23.583 + 3.958(1)
= 4.0640
Thus, Prob (win/loss) = = 0.9831
Predicted
logit
of
candidate
B:
Log = ‐23.477 + 23.583 + 3.958 (0)
= 0.1060
Therefore, Prob (win/loss) = = 0.5265
From
this,
we
can
conclude
that
a
candidate
with
movable
assets
has
more
probability
to
win
the
election
than
a
candidate
without
movable
assets.
CONCLUSIONS:
The
disclosures
of
the
background
of
candidates
for
elections
in
India
resulted
in
providing
voters
with
sufficient
information.
While
this
information
was
primarily
meant
to
enable
the
voters
to
make
a
well‐informed
choice,
the
availability
of
such
information
made
it
possible
to
build
effective
predictive
models
for
forecasting
the
election
results.
The
techniques
namely
logistic
regression
is
used
to
build
the
predictive
models
for
the
Karnataka
Lok
Sabha
elections.
The
important
variables
in
predicting
election
outcomes
are
type
of
the
Political
party
and
movable
assets.
Questions
for
Further
Discussion
1. What
will
happen
to
the
predicted
log
odds
if
the
coefficients
of
the
predictor
variables
are
negative?
2. Will
there
be
any
change
in
the
model
if
we
consider
more
independent
variables
like
number
of
crimes
committed
by
the
candidate,
whether
the
candidate
belongs
to
ruling
party
or
not,
whether
the
constituency
was
reserved
for
the
scheduled
caste
and
scheduled
tribes
candidates,
whether
the
candidate
belongs
to
the
incumbent
party
in
the
specific
constituency
etc?
3. Compare
the
model
using
other
data
mining
techniques
like
Artificial
neural
networks
and
Classification
trees?
4. How
many
categorical
independent
variables
are
there
in
the
model?
5. Is
there
any
significance
test
to
fit
the
model
other
than
Hosmer‐Lemeshow
test?
If
so,
explain
briefly?
6. Can
we
use
simple
linear
regression
instead
of
Logistic
regression
and
why?
7. What
is
Wald
Chi‐square
test?
8. How
many
continuous
independent
variables
are
there
in
the
model?
If
so,
what
is
the
name
of
the
variable?
9. How
the
parameters
are
estimated
in
Logistic
regression
model?
10. What
is
the
dependent
variable
there
in
the
model?
Is
it
binary
or
continuous
variable?
11. How
many
regressors
are
significant
to
predict
the
model?
12. What
are
the
Coefficient
of
the
significant
variables?
13. Give
the
interpretation
of
the
significant
predictors?
14. What
are
Cox
&
Snell
R
Square
and
Nagelkerke
R
Square?