Regression

 
BUILDING PREDICTIVE 
MODELS OF ELECTION 
RESULTS – an application of 
Logistic regression 
 
 
2009 
 
Abstract 
This  document  attempts  to  teach  students  logistic  regression  with  the  help  of  a  simple  real  world 
example.  2009  Lok  Sabha  election  results  for  Karnataka  were  analyzed  to  assess  if  there  was  any 
dependence between the available information about the candidates and the final election results. 
The  only  significant  variables  were  “Political  Party”  and  “Moveable  Assets”  of  the  politician  in 
question. 
Table of Contents 
Introduction................................................................................................................................
Tips to getting started................................................................................................................
Objective...............................................................................................................................
Data Understanding…………………………………………………………………...
Data Preparation…………………………………………………………………………..
Modeling………………………………………………………………………………..
Evaluation………………………………………………………………………………….
Conclusion ..................................................................................................................................
Further Discussion……………………………………………………………………………
INTRODUCTION: 
The  compulsory  disclosure  of  information  with  respect  to  the  background  of  candidates  in  the 
election  make  sure  that  the  voters  have  sufficient  information  about  the  candidates  in  order  to 
enable them to make an informed choice while casting their votes. This information includes assets 
and  liabilities  as  well  as  criminal  antecedents,  if  any.  Thus,  a  fairly  large  amount  of  data  on  the 
candidate  background  had  become  available.  It  is  also  interesting  to  see  if  the  information  thus 
made available did make any difference in the outcome of the elections. Also it is important to see if 
it would be possible to use the information to predict or forecast the results of the elections.  
 
 
TIPS TO GETTING STARTED: 
Business Understanding: 
This is the initial phase focusing on understanding the project objectives and requirements from a 
business perspective. The available information of the candidates can be use as data to fit a model 
to determine which candidate will win the election. Alternatively, the voters can make a fair choice 
of selecting their leader. Thus, the objective of this research paper is to develop predictive models 
which could be used for predicting the outcomes of election. 
   
 
Data Understanding: 
 The data with respect to the profile of the candidates of Karnataka Lok Sabha election 2009 were 
taken from myneta.com for this research paper. There are a total of 28 constituencies for which the 
elections  were  held  in  2009  and  there  are  all  together  421  candidates. The  election  results  of  the 
candidates (win or loss) are used as the dependent variable for the predictive models. This is treated 
as  a  binary  categorical variable.  In  addition,  a  number  of  variables  on  which  information  was 
available are used as independent variables. These variables included 
            
• Age of the candidate 
• Gender 
• Educational qualification 
•  Number of candidates in the constituency 
• type of the political party 
• win or loss of the candidates 
• movable asset 
• immovable assets 
• total assets of the candidates 
• whether the candidate has any liabilities to the government 
• whether the candidate has any liabilities to the financial institutions 
• whether the candidate had committed any crimes or not 
 
 
 
 
 
                                                                                                                                                              
 
Data preparation: 
Here  the  dependent  and  independent  variables  are  categorical  variable.  So  we  need  to  transform 
the  categorical  variables  using  dummy  variables.  The  statistical  software  will  automatically 
transform the categorical variables using dummy variable at the starting of the analysis. Given below 
is the description of the variables used in this paper to build the model. 
          
DEPENDENT VARIABLE  win = 1, loss = 0 
INDEPENDENT VARIABLES NAME                   DESCRIPTION 
 ( DEMOGRAPHIC   
CHARACTERISTICS)    
ID  Candidate ID 
Age  Age of the candidate 
Gen  Male = 1 , Female = 0 
Edu  Educational level of the candidate. Divided into 5 
categories : primary = 1, High school = 2 , pre university 
= 3, graduate = 4, post graduate = 5          (dichotomous 
variable) 
( POLITICAL FACTORS )   
No. of candidate  Number of the candidates in the constituency: binned 
into 4 categories‐ <=10, 11 to 15, 16 to 20, above 20 
PolParty  Name of the political parties : BJP = 1,congress = 2, 
independent party = 3, JD = 4, other national party = 4, 
other regional party = 5                            (dichotomous 
variable)    
( OWNERSHIP )   
Movasset  Whether the candidate  owns  any movable assets: yes = 
1, no = 0 
Immovasset  Whether the candidate  owns any immovable assets : yes 
= 1, no = 0  
Totlasset  Total assets of the candidate ( continuous variable ) 
( LIABILITIES )   
GovtDues  Whether the candidate has any government dues or not : 
yes = 1 , no = 0 
BankDues  Whether the candidate has any Banks dues or not : yes = 
1 , no = 0 
( OTHER FACTORS)   
Crime  Whether the candidate had committed any crime or not : 
yes = 1 , no = 0 
 
 
 
 
Modeling: 
Since the dependent variable is categorical in nature, usual predictive models that revolve around 
regression techniques could be use for prediction of this specific case. Logistic regression would be 
ideal for handling when we have a mixture of numerical and categorical regressors. 
 
A brief description of the techniques is given below. 
 
Logistic regression is a multiple regression with an outcome variable (or dependent variable) that is 
a  categorical  dichotomy  and  explanatory  variables  that  can  be  either  continuous  or  categorical.  In 
other  words,  the  interest  is  in  predicting  which  of  two  possible  events  are  going  to  happen  given 
certain other information. 
The dependent variable in logistic regression is usually dichotomous, that is, the dependent variable 
can take the value 1 with a probability of success θ, or the value 0 with probability of failure 1‐θ. This 
type of variable is called a Bernoulli (or binary) variable. The independent or predictor variables in 
logistic  regression  can  take  any  form.  That  is,  logistic  regression  makes  no  assumption  about  the 
distribution  of  the  independent  variables.  They  do  not  have  to  be  normally  distributed,  linearly 
related  or  of  equal  variance  within  each  group.  The  relationship  between  the  predictor  and 
response  variables  is  not  a  linear  function  in  logistic  regression,  instead,  the  logistic  regression 
function is used, which is the logit transformation of θ:    
 
 
 
Where α = the constant of the equation and, β = the coefficient of the predictor variables.  
An alternative form of the logistic regression equation is: 
The estimation of the variables in Logistic Regression Analysis is done through Maximum Likelihood 
Techniques. The idea behind the method is to  
find the parameters that make the observed values most likely to have occurred. i.e.: it maximises 
the probability of obtaining the sample we got. 
 
The  process  by  which  coefficients  are  tested  for  significance  for  inclusion  or  elimination  from  the 
model involves several different techniques. Some of them are Wald test, Likelihood‐Ratio test and 
Hosmer‐Lemshow Goodness of fit test. 
 
Hosmer and Lemeshow chi‐square test of goodness of fit is the recommended test for overall fit of a 
binary  logistic  regression  model.  If  the  H‐L  goodness‐of‐fit  test  statistic  is  greater  than  .05,  as  we 
want for well‐fitting models, we fail to reject the null hypothesis that there is no difference between 
observed  and  model‐predicted  values,  implying  that  the  model's  estimates  fit  the  data  at  an 
acceptable  level.  That  is,  well‐fitting  models  show  nonsignificance  on  the  H‐L  goodness‐of‐fit  test, 
indicating model prediction is not significantly different from observed value. 
Evaluation: 
The data are analyzed by using the Statistical software. 
Sample of data of the candidates are shown below. 
 
 
ID  polparty  crime  edu  age  Ttlasset  liabilities  gen  winloss 
           1  4  0  2  2  54406000  0  1  0 
2  4  0  2  2  100000  0  1  0 
3  4  0  4  1  1000000  0  1  0 
4  5  0  5  3  406400  0  1  0 
5  4  0  4  1  0  1  1  0 
6  5  0  5  2  6371000  1  1  0 
7  4  1  1  3  7578500  1  1  0 
8  1  0  5  4  16763000  1  1  1 
9  4  0  1  1  600000  1  1  0 
10  2  0  1  4  30411328  1  1  0 
11  4  0  5  4  3380000  0  1  0 
12  4  0  1  2  0  0  1  0 
13  6  0  4  3  11415000  1  1  0 
14  4  0  4  1  630000  1  1  0 
15  4  0  2  1  20000  0  1  0 
16  4  0  5  2  8000000  0  1  0 
17  4  0  3  3  195000  0  1  0 
18  6  0  3  4  1173000  1  1  0 
19  4  0  4  3  865000  0  1  0 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
This part of the output describes a "null model", which is model with
 
no predictors and just the intercept.
 
 
 
 
 
Classification Tablea,
b
Predicted
WINLOS Percentage
Observed 0 S 1 Correc
Step 0 WINLOS 0 390 0 t 100.0
S 1 2 0 .
Overall Percentage 8 0
93.3
a Constant is included in the model.
.
b The cut value is .500
.  
 
 
This gives the percent of cases for which the dependent variables was correctly predicted given the 
model and here it is 93.3%. 
 
 
 
 
 
This  is  the  Wald  chi‐square  test  that  tests  the  null  hypothesis  that  the  constant  equals  0.   This 
hypothesis  is  rejected  because  the  p‐value  (listed  in  the  column  called  "Sig.")  is  smaller  than  the 
critical p‐value of .05 (or .01).  Hence, we conclude that the constant is not 0. 
   
  
   
  This section contains the overall test of the model (in the “Hosmer-
lemeshow test” table) and the coefficients and odds ratio (in the “Variables
  in the Equation”)
Cox & Snell R Square and Nagelkerke R Square are pseudo R‐squares.  Logistic regression does not 
have an equivalent to the R‐squared that is found in OLS regression. Here Cox & Snell R Square is 
0.276<1 and Nagelkerke R Square is 0< 0.712<1. It indicates improvement from null model to fitted 
model. 
Here  the  null  hypothesis  is  that  there  is  no  difference  between  observed  and  model‐predicted 
values. The H‐L goodness‐of‐fit test statistic is greater than .05, we fail to reject the hypothesis. It 
implies that the model fit the data at an acceptable level. 
a
Classification Table
Predicted
WINLOS Percentage
Observe 0 S 1 Correct
Step d
WINLOS 0 383 7 98.
1 S 1 1 1 2
64.
Overall 0 8 3
95.
a The Percentage
cut value is 9
. .500  
This table shows how many cases are correctly predicted (383 cases are observed to be 0 and are 
correctly predicted to be 0; 18 cases are observed to be 1 and are correctly predicted to be 1), and 
how many cases are not correctly predicted (7 cases are observed to be 0 but are predicted to be 1; 
10  cases  are  observed  to  be  1  but  are  predicted  to  be  0).  The  overall  percent  of  cases  that  are 
correctly  predicted  by  the  model  (in  this  case,  the  full  model  that  we  specified)  is  95.9%.  This 
percentage has increased from 93.3 for the null model to 95.9 for the full model. 
 
From  the  above  table  we  see  that  the  variables  POLPARTY  (political  Party)  and  MOVASSET  (1) 
(Movable assets) are statistically significant since the P‐values are less than the critical P‐value 0.05. 
There  is  no  coefficient  listed  for  POLPARTY,  because  it  is  not  a  variable  in  the  model.   Rather, 
dummy  variables  which  code  for  POLPARTY  have  coefficients.  However,  the  coefficients  of  the 
dummies are not statistically significant. The statistic given on this row tells you if the dummies that 
represent POLPARTY, taken together, are statistically significant. 
Thus,  type  of  Political  party  and  movable  assets  are  important  in  explaining  the  winning  of  a 
candidate. The other variables do not seem to have any effects at all. 
 
 
 
 
 Both the effects can be interpreted as: 
 
‐  Here the reference group of the variable POLPARTY is level 6. So, changing the reference group 
from level 6 to levels 1, 2,3,4,5 increases the probability of winning the election. 
 
‐    The ownership of movable assets increases the winning of the election 
 
  
Now the predicted model is given by 
Log  = ‐23.477 + 23.583 POLPARTY (1) + 21.317 POLPARTY (2) + 19. 112                             
POLPARTY (3) – 0.087 POLPARTY (4) +3.958 MOVASSET (1) 
 
Suppose  we  want  to  compare  the  probability  of  winning  a  candidate  A  with  movable  assets  and 
Political Party changing from level 6 to level 1 (here level 1 indicate BJP and level 6 indicate other 
regional parties) with the probability of winning a candidate B without movable assets and Political 
Party changing from level 6 to level 1. 
 
Predicted logit of candidate A: 
Log = ‐23.477 + 23.583 + 3.958(1) 
                    = 4.0640 
 Thus, Prob (win/loss) =  = 0.9831 
 
Predicted logit of candidate B: 
 
Log  = ‐23.477 + 23.583 + 3.958 (0) 
                     = 0.1060 
Therefore, Prob (win/loss) =   = 0.5265 
 
From this, we can conclude that a candidate with movable assets has more probability to win the 
election than a candidate without movable assets. 
 
 
 
 
 
 
 
 
 
CONCLUSIONS: 
 
The  disclosures  of  the  background  of  candidates  for  elections  in  India  resulted  in  providing  voters 
with  sufficient  information.  While  this  information  was  primarily  meant  to  enable  the  voters  to 
make a well‐informed choice, the availability of such information made it possible to build effective 
predictive  models  for  forecasting  the  election  results.  The  techniques  namely  logistic  regression  is 
used to build the predictive models for the Karnataka Lok Sabha elections. The important variables 
in predicting election outcomes are type of the Political party and movable assets. 
 
 
Questions for Further Discussion 
 
1. What will happen to the predicted log odds if the coefficients of the predictor variables 
are negative?  
2. Will  there  be  any  change  in  the  model  if  we  consider  more  independent  variables  like 
number of crimes committed by the candidate, whether the candidate belongs to ruling 
party  or  not,  whether  the  constituency  was  reserved  for  the  scheduled  caste  and 
scheduled tribes candidates, whether the candidate belongs to the incumbent party in 
the specific constituency etc? 
3. Compare  the  model  using  other  data  mining  techniques  like  Artificial  neural  networks 
and Classification trees?  
4. How many categorical independent variables are there in the model? 
5. Is there any significance test to fit the model other than Hosmer‐Lemeshow test? If so, 
explain briefly? 
6. Can we use simple linear regression instead of Logistic regression and why? 
7. What is Wald Chi‐square test? 
8. How many continuous independent variables are there in the model? If so, what is the 
name of the variable? 
9. How the parameters are estimated in Logistic regression model? 
10. What is the dependent variable there in the model? Is it binary or continuous variable? 
11. How many regressors are significant to predict the model? 
12. What are the Coefficient of the significant variables? 
13. Give the interpretation of the significant predictors? 
14. What are Cox & Snell R Square and Nagelkerke R Square?   
 
 
 
 
 
  
  
  

Regression

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Regression

Загружено:

Авторское право:

Доступные форматы

Вам также может понравиться