Вы находитесь на странице: 1из 38

BEGINNERS GUIDE TO LOGISTIC REGRESSION USING R AND

EXCEL
Logistic regression is one of the most widely used predictive modelling techniques. In this
book we will learn how to use logistic regression to aid decision making.

We will use data from our favourite sport, Cricket to illustrate the application of logistic
regression in decision making situations.

HOW WILL THIS GUIDE HELP ME?

The purpose of this guide is to demonstrate a step-by-step approach to data analysis using
data from the sport of Cricket. You will learn how to handle a data set, how to become
intimate with it, run descriptive analytics and build predictive models using logistic
regression on it, and draw insights from the results to guide you decisions.

HOW DO I USE THIS GUIDE?

The data set analyzed in this guide is available for free download. In order to get the full
benefit from this guide, you should download this data set and perform the steps
illustrated in each chapter before moving on to the next one.
Table of Contents

How will this guide help me? ....................................................................... 1

How do I use this guide? ............................................................................ 1

Introduction ......................................................... Error! Bookmark not defined.

Who is the greatest ODI batsman India has ever produced? ................................ 4

Problem Definition ..................................................................................... 5

Sachin, Sourav, Rahul ............................................................................... 5

Data Exploration ........................................................................................ 7

What is the available information? ................................................................ 7

What kind of questions can I answer using this data? .......................................... 8

Business application of Data exploration ........................................................ 10

Data Exploration Step 2 .......................................................................... 11

How much data is there? ........................................................................ 11

What does the data represent? ................................................................ 11

Examining all variables .......................................................................... 12

EXERCISE ............................................................................................. 13

Data Preparation ...................................................................................... 14

Cleaning the Opposition field .................................................................. 14

Cleaning the Runs field ............................................................................ 16

Cleaning up the Results field ..................................................................... 18

Data preparation in business analytics .......................................................... 19

EXERCISE ............................................................................................. 20

Descriptive Analytics .................................................................................. 21

EXERCISE ............................................................................................. 22

Predictive Modelling .................................................................................. 23

An introduction to Regression .................................................................... 24


Types of regression .............................................................................. 25

Logistic Regression ............................................................................... 25

Building a logistic regression model ........................................................... 26

Reading data into R .............................................................................. 26

Running a Logistic Model ........................................................................ 27

EXERCISE ............................................................................................. 29

Interpreting the output ............................................................................... 30

What about the batting average? ................................................................. 32

Lifetime Contribution .............................................................................. 33

EXERCISE ............................................................................................. 34

Model Validation ....................................................................................... 35

EXERCISE ............................................................................................. 36

Conclusion .............................................................................................. 37

Problem definition .................................................................................. 37

Data Exploration .................................................................................... 37

Data Preparation .................................................................................... 37

Descriptive Analytics ............................................................................... 38

Predictive Modeling ................................................................................ 38

Interpreting the results ............................................................................ 38

Model Validation .................................................................................... 38


INTRODUCTION

We chose Cricket as my analytics case study because of two reasons. The first reason is
that a majority of the readers of this e-book will be Cricket fans. You will be able to
relate to the problems we attempt to solve in this book. In many cases you will already
have gut-based opinions on the topics we discuss. You will find it interesting to see if
analytics verifies or diverges from your gut.

For the purpose of this book we will be analysing the performance of some of Indias top
ODI batsmen with a focus on the batting genius Sachin Tendulkar.

WHO IS THE GREATEST ODI BATSMAN INDIA HAS EVER PRODUCED?

This is a debate that has raged many a time across India, from water coolers to drawing
rooms to canteens to social media, and is unlikely to have a conclusive or decisive end.

There are many reasons why this debate is often inconclusive, not least being the
completely different and arbitrary set of criteria used by people to back the player they
rate supreme. Greatest as a term is open to many interpretations and, having been
witness to and often been a part of many such debates, I figured this needed an objective
approach.

Being data scientists, we thought of using a purely statistical and data-driven approach to
answer this question.

And like any statistical research, step one involved clearly defining the research objective.
STAGE 1: PROBLEM DEFINITION

Given that greatest is a term used in many contexts, the first task was to restate the
question under argument to be one which would provide conclusive objective answers. I
came up with:

Which batsman has had the most impact on Indias win-rate through the runs they
have scored in ODIs?

The restatement of the problem immediately narrows the discussion to batting


performances only and their impact on wins. To some its a cruel elimination of factors
like the elegance of a particular cover drive or ability to pace an innings. To the data
scientist, it is moving the argument to a turf where the conversation stops moving round
and round and instead lurches towards facts that should shape opinions.

SACHIN, SOURAV, RAHUL

Remember, this is still a discussion on who is the greatest of them all? India has produced
a number of ODI cricketers (in fact many think that far too many have worn the cap
without merit) but the discussions for greatest need to be limited to a select few.

The first elimination criterion used was the total number of career runs scored. For
further analysis, I zeroed in on the top 3.

Sachin Tendulkar, Sourav Ganguly and Rahul Dravid are Indias all time highest ODI run
getters. Sachin at 17742 runs is still going strong while Sourav and Rahul have both retired.

Statistics Sachin Sourav Dravid


Innings 431 292 307
Runs 17742 11255 10,536

Of course, for each I found plenty of backers willing to back their case:

I think dada is the best because of the way he ripped apart the bowlers before they
started to bowl short at him

I think Dravid is the best because he is such a joy to watch. Every innings of his is pure
class

Sachin has scored 49 ODI centuries and was the first player ever to hit a double hundred
in ODI. Of course he is the best. No question about it.

There are others who have quoted the names of Sehwag, Dhoni and even the name Virat
Kohli has already started creeping in, but none are near 10,000 ODI runs in overall
contribution and that is the first statistic that eliminated them from this research.

So now we have re-stated the objective and defined the scope of our analysis as well.
Amongst those who have scored more than 10000 runs in ODIs, which batsman has had
the most impact on Indias win-rate through the runs they have scored?

Now that we have defined the scope of our analysis in very precise terms, we will explore
the data that is available to us.
STAGE 2: DATA EXPLORATION

Data exploration is an important part of any analysis. It becomes even more important
when dealing with a data set for the first time.

In our case, we first need to identify the data to be used for this analysis. We used the site
www.espncricinfo.com to download the available data.

There is a lot of information available on Cricket players on this website. For the purpose
of our example, we will consider a small sample of the available information.

Our analysis table contains 10 fields. Here is a snippet of the data set.

Match Id Opposition Ground Start Date Runs Result Margin BR Toss Bat
ODI # 593 v Pakistan Gujranwala 18-Dec-89 0 lost 7 runs won 2nd
ODI # 612 v New Zealand Dunedin 01-Mar-90 0 lost 108 runs won 2nd
ODI # 616 v New Zealand Wellington 06-Mar-90 36 won 1 runs won 1st
ODI # 623 v Sri Lanka Sharjah 25-Apr-90 10 lost 3 wickets 4 lost 1st
ODI # 625 v Pakistan Sharjah 27-Apr-90 20 lost 26 runs won 2nd
ODI # 634 v England Leeds 18-Jul-90 19 won 6 wickets 12 won 2nd
ODI # 635 v England Nottingham 20-Jul-90 31 won 5 wickets 12 won 2nd

For our analysis, we will need to download the data for all 3 batsmen under consideration
i.e. Sachin, Sourav and Rahul. We will illustrate the data exploration and preparation
steps for Sachins data only. This same process will then be repeated for the other two as
well.

WHAT IS THE AVAILABLE INFORMATION?

The first step in data exploration is to understand the information available to us. Let us
spend some time on our data set.

The first field Match Id is a unique identifier for each ODI game. We can see that each
row in the data has a unique Match Id. This means that each row in our data corresponds
to one game. The first row in the data corresponds to ODI # 593. You can see that it is
referring to Sachins debut game against Pakistan.

The second field Opposition is self-explanatory. The opposition in this match was
Pakistan. The third field Ground tells us where the match was held. The field Start
Date gives us the date of the match. Runs is the number of runs scored by the
batsman (Sachin Tendulkar) in that game. Next we have the result of the game. Margin
gives us the margin of victory. If the team batting first won the game, then this field gives
us the number of runs they won the game by. If the team batting second won the game,
this field tells us the number of wickets they won by. The field BR is populated only in
cases where the team batting second won the game. It gives the number of balls
remaining when the victory was achieved. Toss tells whether India won or lost the toss.
The final field, Bat tells us if India batted first or second.
In all, this is pretty good information. If we look at the first row of the data, it tells us
about the game with Match Id 593.

Match Id Opposition Ground Start Date Runs Result Margin BR Toss Bat
ODI # 593 v Pakistan Gujranwala 18-Dec-89 0 lost 7 runs won 2nd
ODI # 612 v New Zealand Dunedin 01-Mar-90 0 lost 108 runs won 2nd

India v New Zealand Wellington 06-Mar-90


played
ODI # 616 36 won
th 1 runs won
1st
against Pakistan at Gujranwala on 18 Dec 1989. India won the toss, decided
ODI # 623 v Sri Lanka
Sharjah 25-Apr-90 10 lost 3 wickets 4 lost 1st
to field and while chasing, fell short of the target by 7 runs. Sachin got out for a duck in
ODI # 625 v Pakistan Sharjah 27-Apr-90 20 lost 26 runs won 2nd
this
ODI #game.
634 v England Leeds 18-Jul-90 19 won 6 wickets 12 won 2nd
ODI # 635 v England Nottingham 20-Jul-90 31 won 5 wickets 12 won 2nd

WHAT KIND OF QUESTIONS CAN I ANSWER USING THIS DATA?

Let us examine each of the fields and understand the kind of insights this information can
provide. The first field is what is called as the Primary key in data mining parlance. It is a
unique number assigned to each game in order to identify the game and distinguish it from
others. This key is useful for data manipulation but not for analysis itself.

The second field Opposition tells us who Sachin was playing against. We can analyse
Sachins performance by opposition. Think of any statistic that will help us analyse
Sachins performance. The field Opposition helps us add this dimension to the analysis.

Example questions:

What is Sachins average against each of the teams?

What is the win rate by opposition?

At what rate has he scored half centuries and centuries against different opposition?

Similarly, the second field Ground helps us add the venue dimension to analysing
Sachins performance.

Example questions:

What is Sachins average at different venues?

Where has he scored most centuries?

Where has he scored the most half-centuries?

Where does he have the highest win rate?

Start date tells us when the game was played. It provides the time dimension to the
analysis.

Example questions:
What is Sachins average in each of the last 20 years?

When did he score most centuries?

When did he score the most half-centuries?

How many years has he scored more than 1000 runs in?

The field Runs is important for obvious reasons. This variable is a measure of Sachins
performance in game.

Note that all the other variables are used as Dimensions i.e. they are a means to slice
and dice the data for the measure Runs. For example, we can look at Sachins total
runs scored or average runs scored by Opposition. Opposition here is the dimension and
we are slicing the data along this dimension. Runs, on the other hand, are a measure.

The field Result gives the result of that particular game. We use this field as another
dimension in the analysis.

Example questions:

What is Sachins average when team India wins a game vs. when they lose it?

How many centuries has Sachin scored in Indias victories vs. losses?

The field Margin is a slightly tricky one. It gives the margin of victory in runs when
team batting first wins, and in wickets when the team batting second wins the game. This
field will need some transformation for it to be used effectively. If required, we will come
to that in the data preparation stage.

Similarly, for the field BR.

The fields Toss and Bat also add dimensions to our analysis. We can analyse Sachins
performance when India wins the toss vs. when they lose it and when they bat first vs.
when they bat second.

Note one thing here. We had mentioned that the field Runs is a measure and all other
fields are dimensions. Well, thats not entirely correct. Even the field Result can be
used as a measure depending on what we are analysing. For example, if we answer the
question what is Indias win rate when they win the toss vs. when they lose it? In this
case, the field Result is the measure and the field Toss is the dimension.
In this section, we have completed the first step in data exploration. We have identified
the information contained in the data set. We have looked at each field and understood its
definition. We have also looked at several examples of questions that we can answer with
this data.

BUSINESS APPLICATION OF DATA EXPLORATION

This is a simplified scenario that we have taken for the purpose of this guide. Business
situations can be far more complex.

The data set that we have contains 10 fields. Business data sets can have many more
fields. Data sets in financial services can have up to 1000 fields. Most business data sets
tend to have anywhere between 10 to 100 variables.

Further, our data set has very intuitive fields. They are easy to understand and are not
vague in definition. In business situations, variables may not be this easy to understand.

In such a situation, there is something called a Data dictionary which comes in very
handy for the analyst. The data dictionary is a document (usually an excel sheet) which
has the names and definitions of all the fields in the data set.

A snippet from a data dictionary

It is advisable to spend plenty of time on the data dictionary. The analyst needs to be
comfortable with the definition of all the variables before proceeding any further with the
analysis.
DATA EXPLORATION STEP 2

So far, we have explored what is the information available to us. We have looked at all
the different fields in our data set and understood exactly what they mean.

The next step is to explore the data itself. How much information do we have? What is the
quality of the available data? How do we need to prepare the data?

For this step, we will need to look at each of the fields in the data individually.

HOW MUCH DATA IS THERE?

We can simply scroll down in Excel to see how many rows of data is there. In our case, we
find that there is data till the 464th row.

Since this is a fairly small


dataset, we are going to
perform the data
exploration and
preparation steps in Excel.
However, when we come
to the predictive modelling
stage we will use R.

Since the first row contains the headers, this means there is 463 rows of data. Each row
represents one match, so we have data on 463 matches.

WHAT DOES THE DATA REPRESENT?

We now need to find the time period this data pertains to. The field start date can provide
us that information. We sort the data on start date (which refers to the date the match
was played on). The data is already sorted on the start date. We can see that the first
game in the data was played on 18-Dec 1989 and the last one on 18-Mar 2012.
We know that 18-Dec 1989 was Sachins debut game. We can confirm that Sachin has
played 463 games from then till 18-Mar 2012.

This implies that we have data on all of Sachins games from his debut till 18-Mar 2012.

We have now established that in our data set, we have data on 463 games. This
represents all the games that Sachin has played for India from his debut till 18-Mar
2012.

EXAMINING ALL VARIABLES

Now let us examine all the values in all the fields


individually. Since this is a small data set, we can scan the
values manually. The easiest way to do this is to apply filters
on all the fields and examine each filter one by one.

The first field is the Match Id. We scan all the unique values
by scrolling within the filter. All the values are in order.

Figure 1 Match Id

We do the same with Opposition and find all the values in


order.

There are multiple things we are looking for when we scan


these values. The first is to detect something we do not
expect to see. For example, if we see China in the
opposition, thats an unexpected value that needs to be
investigated further. A more likely error that can occur is
that we have 2 different values representing the same thing. Example, U.A.E. could also
be written as UAE (without the dots) in some rows and we will need to then change some
entries to make it consistent. We could easily do a Replace all in Excel to change all
UAE values to U.A.E..

In this manner, I scan all the fields and make a note of all the points that need to be
worked on. Let us now move to the next stage, i.e. the data preparation stage. This is
where we will manipulate and transform the data into the format we want.

EXERCISE

Download the data by clicking on this link: Cricket data for Sachin, Sourav and Rahul

Perform the following steps on the data for Sourav and Rahul

1. Open the data in Excel

2. Examine the data. How many games worth of data is there for each of these
players

3. Examine all the variables independently using the filter option and make a note of
the changes you would need to make on the data in the data prep stage.
STAGE 3: DATA PREPARATION

It is a good practice to create a copy of the data set at this stage. Now we will start
making modifications to the data. Some of them may be irreversible. Creating a copy of
the data set at this stage gives us the option to go back to the original data set at any
stage later on.

CLEANING THE OPPOSITION FIELD

One thing that bugs me here is the presence of a v before the team names. For
example, the entry for a game where the opposition is Pakistan is v Pakistan. The v is
here as a short form for versus. But I feel it is pretty redundant. While it is not essential
for this analysis to remove the v, I will do it for aesthetic reasons.

There are many ways to remove the v here. I will use the Text to columns function in
excel. First I insert a column to the right of the Opposition column.

Then I simply select the cells where the data is located, click on the Text to columns
function under the Data tab and then choose the Fixed width option. Then I click
Next.
On the next screen, I simply click on the space between v and the
opposition name and a line appears between the two signifying a
break.

I click finish and I now have the data broken into 2 columns. The
original column contains all the vs and the column on the right
now contains all the Opposition names without the v.

With a little bit of cleaning, I now have my Opposition field in the


format I want.

CLEANING THE RUNS FIELD

When I examined the Runs field, I found a couple of things I will need to correct before I
can use this field for mathematical analysis.

First there are a couple of text entries in this


field. You can see the values DNB and TDNB
in the adjoining figure. Both of these refer to
situations where Tendulkar did not get to bat.

We now think back to the goal of our analysis. The


goal is Which batsman has had the most
impact on Indias win-rate through the runs
they have scored in ODIs?

With this goal in mind, we can safely exclude all


matches where Sachin did not bat. If he did not
bat, he could not have had any impact on the
team win-rate through his runs.

Note that he could still have had an impact


through his bowling and fielding but we are not
trying to measure that impact.

We can simply remove this data from our analysis data set by filtering and deleting the
rows.
It is a good idea to make a note of all the changes we are making. I have noted that we
have deleted data on 11 games here. In these 11 games, Sachin did not bat and hence this
data was not useful for our analysis.

The next thing I noticed in the Runs field is the presence of a number of entries where
the score is followed by an asterisk (*) sign. This is common convention to denote a not-
out score. In all these innings, Sachin stayed not out at the end. There are 41 such
innings in our data set.

What should we do with this issue? Removing the asterisk at the end is fairly simple in
excel. but before we do that we need to carefully understand the implications of that on
our analysis.Converting a score of 40* to just 40 means that we are saying that the impact
from Sachins runs remains the same whether he scores 40* or if he gets out at 40.

I think this is a fair assumption. Since we are measuring a batsmans impact solely throgh
the runs they have scored, it is ok to discard the information on whether the batsman got
out or not.

Having gone through this exercise mentally, I think it is fine to go ahead with this
approach. I now proceed to remove the asterisk at the end.

I will again insert a column to the righ of this field and use the Text to columns function.
This time I choose the Delimited option.

When I click Next, I am asked to choose the character that I want to use as a delimiter.
I choose the Other option and enter the character * and click on Finish.

What this does is, it tells excel to treat every asterisk sign as a delimiter, keep the
content to the left of it in the original cell and move the content to the right of it into the
cell on the right. In our case, this simply eliminates the * from this field.

We have now cleaned up the Runs field in our data set and made it amenable to
mathematical operations that we will perform in the next step.

CLEANING UP THE RESULTS FIELD

The next thing on my list is to clean up the Result field. There are 4 kinds of results in
our data set. won, lost, n/r and tied. n/r stands for no result. Since we want
to measure the impact on the teams win-rate, we can exclude the matches where the
result is n/r or tied. You can argue that tied
means India did not win and hence can be counted
as lost but by the same logic, tied does not
mean lost as well. Hence we decide to exclude all
games where the results is n/r or tied.

We note that we have deleted another 21 games due


to this criterion.

This brings us to the end of data preparation. Before we proceed any further, it is
important to summarize what we have done here.

1. We cleaned up the Opposition field by removing the v before each team name

2. We cleaned up the Runs field by removing all games where Sachin did not bat.
We deleted 11 games this way.

3. We removed the * at the end of scores where Sachin did not get out. Our data now
does not differentiate between innings where Sachin was out and innings where he
wasnt.

4. We removed all games where the result was not a straight win or loss. We removed
an additional 21 games this way.

5. We started with 463 games and now we are considering only 431 games for our
actual analysis.

DATA PREPARATION IN BUSINESS ANALYTICS

When dealing with business data, data preparation can be a long and exhausting process.
What we have discussed here can be considered more as data cleaning. We have not really
touched upon certain other important aspects of data preparation.

Anomaly detection or outlier correction is used extensively when dealing with business
data. The idea here is to remove unusual occurrences from the data before building a
predictive model. This is because outliers can have undue influence on our models. In our
case, we have limited data and there is nothing in the data that justifies outlier
correction.

Missing data treatment is another crucial step in data preparation. In our dataset, we
have no missing values (Thank you espncricinfo!) but if, for example, we had some innings
for which we had no values in the Runs field, we would have to do something about it.
Typically missing data treatment involves either imputing or estimating the missing values
or removing the data with missing values from our analysis.
Deriving variables is also a part of data preparation. Sometimes we need to create new
variables from the existing ones for the purpose of our analysis. For example, if we need
to create a Year variable, we can derive that from the Start date variable. We could
also derive the Country where the match was played from the Venue field. This will
involve creating a separate lookup table which maps venues to countries.

Data preparation is an important part of any analysis but it becomes even more important
when dealing with complex business data. Effective data preparation increases the
strength of the predictive models by harnessing the power of the available data in the
most efficient manner.

Now that we have prepared the data, we are now ready for the next stage i.e. Predictive
modelling. But before we get into that, now is a good time to perform some descriptive
analytics on the data first.

EXERCISE

Perform the following steps on the data for Sourav and Rahul

1. Clean up the Opposition field by removing the v before each team name

2. Clean up the Runs field by removing all games where the batsman did not bat.

3. Remove the * at the end of scores where the batsman did not get out.

4. Remove all games where the result was not a straight win or loss.

5. Make a note of the total number of games you started with and what you are left
with for further analysis.
STAGE 4: DESCRIPTIVE ANALYTICS

In the data exploration stage, we had compiled a long list of questions that could be
answered from this data. Here are some interesting charts.
Here is a graph on the distribution of Sachins innings score.

Descriptive analytics like this helps an analyst understand the data better. It also helps
her spot anything unusual anything that requires further investigation.

Descriptive analytics is a useful tool to understand the data, generate insights and spot
unusual occurrences that require further investigation.

EXERCISE

Descriptive analytics offers unlimited ways of analysing any kind of data. You are only
limited by your imagination. Here are some things you can do with your data at this stage.

1. Analyze the batsmans performance over time Total runs scored by calendar year
and average runs scored by calendar year
2. Analyze the batsmans performance by opposition, by venue (home and away) etc.
3. Create and examine the distribution of scores

What you will find is that in most cases descriptive analytics will confirm your belief or
intuition. But in a few cases, every once in a while, you will find patterns or insights that
you did not know or that run counter to your intuition. These counter-intuitive or hidden
insights are what make descriptive analytics such a valuable tool.
STAGE 5: PREDICTIVE MODELLING

Now that we have explored the data, prepared it for analysis and run descriptive analytics
on it, the next stage is predictive modelling.

We again refer back to the goal of our analysis, Which batsman has had the most
impact on Indias win-rate through the runs they have scored in ODIs?

We need to establish a relationship between Indias win-rate and the number of runs
scored by Sachin in a particular game. Let us first examine a graph where we have Indias
win-rate on the vertical axis and Sachins scores (in buckets of 20) on the horizontal axis.

When Sachin scores less than twenty one runs, Indias win-rate is 42% . It climbs up to 56%
when he scores between 21 and 40 runs. It goes up to a whopping 83% when Sachin scores
between 121 and 140 runs. The win-rate does come down for scores greater than 140 but
this aberration could be attributed to sparse data for such high scores. Since Sachin has
scored more than 140 in only 11 games, 1 or do unusual results can make a big impact on
this win-rate.

There does seem to be a general trend of improvement in Indias win rate as Sachins
scores become higher.
What if we could quantify this relationship? If we could somehow create a mathematical
formula that would be able to calculate Indias win-rate for any given Sachin score. For
example, if Sachin scores 25 runs in an innings, what if we could just plug in his score into
a mathematical formula and bam! It gives us the probability of India winning that game.

We will now attempt to do exactly this via a regression model. We will estimate the
relationship between Sachins score and Indias win rate. In other words, we will build a
model that will help us predict, for a given number of runs scored by Sachin, what is the
probability of India winning the game. This model will also be able to estimate the
increase in probability of India winning with each additional run scored by Sachin.

AN INTRODUCTION TO REGRESSION

Regression is one of the most popular predictive techniques. In simple terms, regression
helps us understand how the typical value of one variable also called the dependent
variable (in this case, Indias win-rate) changes when some other variable also called
independent variable (here Sachins score) varies.

This is a simplified case of regression. In many situations, regression models are used to
understand the effect of several variables on one variable. For example, Indias win-rate
could also be influenced by factors like whether India batted first or second, whether
India was playing at home or away or even the toss. We could theoretically build a model
which takes the effect of all these variables on Indias win-rate.
TYPES OF REGRESSION

There are many types of regression techniques that are applied by Statisticians depending
on the nature of the problem and the variables involved. Linear and logistic are two of the
most popular ones.

Linear regression assumes a linear relationship between the dependent and the
independent variable. If the relationship between Sachins score and Indias win-rate
could be quantified with a straight line, then linear regression would be a suitable
modeling technique.

In our problem, we have seen in the previous graphs that the relationship between our
dependent and independent variable is not exactly linear.

Further, the variable that we are trying to predict i.e. the outcome of the game is a
binary variable (win/loss). In our case, a regression technique called logistic is more
suitable. Logistic regression does not need a linear relationship between the dependent
and independent variables. Logistic regression can handle all sorts of relationships.

LOGISTIC REGRESSION

In this book, we will not go into the mathematical details of logistic regression. Instead we
are going to focus on its application for a given problem.

The result of a logistic regression model is an equation in this format

Log [p/(1-p)] = a + bx

Let us interpret this equation in the context of our problem.

There will be 2 values that will be generated from the model a and b.

Using the equation above, we can calculate the value of p for any given value of X.

We first calculate the value of Log[p/(1-p)] by putting the values of a, b and x. Let us
call this Y.
Log[p/(1-p)] = Y

We can then use the antilog or exponent function to calculate the value of p/(1-p).

p/(1-p) = exp(Y)

From there we can easily calculate the value of p as well.

p = exp (Y)/(1 + exp(Y))

p, if you remember is the probability of India winning the game. Thus, using a logistic
model, for any given value of X, we are able to calculate p, Indias predicted win-rate.

This is how we interpret the results of the logistic regression model.

Now we need to find the values of a and b so that we can calculate the probability p for
any given X (Runs scored).

BUILDING A LOGISTIC REGRESSION MODEL

We are going to use a combination of R and Excel to build this model. We will calculate
the coefficients (a and b) using R. we will perform all other calculations in Excel.

R is an open source tool that is available as a free download. Anyone can download R on
their machine and start working with it.

Download and install R

READING DATA INTO R

It is a lot simpler to load csv files in R than Excel files. We will copy paste our data into
another excel sheet and save it as a .CSV file.

We use the read command in R to read in the data.

data.frame.sachin = read.table("E:\\jigsaw\\Blog\\sachin.csv",

+ header = T,

+ sep = ",")

This command creates a new table (or data frame) called Sachin by reading in data from
the file Sachin.csv. we have specified the location of the file as well. Note that R requires
you to add double slashes when specifying the pathname. The header = T command tells R
to treat the first row of the data as headers. The sep = , command tells R that the data
is separated by commas (since it is a CSV, comma separated file).
Once we read in the data, we can quickly run summary statistics on the data using a
simple command.

summary(data.frame.sachin)

As you can see, this command creates 6 measures for each field. The min value, the 25th
percentile value, the 50th percentile value, the 75th percentile value, the max value and
the mean.

RUNNING A LOGISTIC MODEL

Once we have read in the data and run summary statistics on it, the next step is to build
the model. We are building a simple 2 variable model. The variable Outcome is the
dependent variable. This is what we will try to predict. The variable Runs is the
independent variable. This is what we will use for prediction. In other words, we will
quantify the relationship between runs scored and the outcome (or the probability of
outcome being a win).

We use the glm command in R.

smodel <- glm(outcome ~ Runs, data = data.frame.sachin, family = "binomial")

smodel is the name we have given to our model.

Glm stands for generalized linear modelling. This is the broad technique under which
logistic regression falls. Here it stands for a procedure which allows us to build various
kinds of regression models.

Outcome ~ Runs tells R to use the variable Runs to predict the variable outcome. Suppose
we want to use 2 variables instead of 1 for prediction. Along with Runs, we also want to
use the variable toss. In this case, this part of the command will change from Outcome ~
Runs to Outcome ~ Runs + Toss

Data = data.frame.new tells R to use the table Sachin as the data on which to do the
analysis.
Family = binomial is important. There are multiple algorithms under the glm command in
R. Binomial tells R to use the logistic regression technique.

Since we have assigned a name to our model (smodel), R will not print the results of the
model. For this we again use the summary command.

summary(smodel)

For a full understanding of the model output, click here.

For our purpose, we are interested in knowing the values of a and b. If you remember, the
logistic regression equation is as follows

Log [p/(1-p)] = a + bx

a is the constant value that is calculated by the model and is called the intercept. B is
another constant which is the coefficient of X (runs scored).

In the model output, we can find the values of a and b under the heading Coefficients.
a is the estimate of the intercept and it has a value of -.258734 as per our model.

b is the coefficient estimate of the runs and its value is .010062.

We now have information to solve the regression equation and can calculate the value of p
for each value of X.

EXERCISE

For this exercise you should have first prepared the data as described in the previous
section.

Save the newly prepared data for both Sourav and Rahul into separate csv sheets.
Read the data sets into R
Get summary statistics for the data using the summary function
Perform logistic regression on the data using the glm function.
Summarize the results of the model using the summary command
Note the estimates for the intercept and the variable Runs
STAGE 6: INTERPRETING THE OUTPUT

We have run our model and obtained the results. Let us now understand and interpret the
results.

Case 1: When Runs = 0

For X = 0,

a + bX = a

which is -.258734

Thus Log[p/(1-p)] = -.258734. Or,

p/(1-p) = Exp(.258734) = .772. Or,

P = .44 OR 44%.

This means that if Sachin scores a duck (0 runs), Indias predicted probability of winning
that game is 44%.

Case 2: When Runs = 50

For X = 50, a + bX = -.258734 + .10062 * 50

Thus Log[p/(1-p)] = -.244366,

p/(1-p) = Exp(.244366) = 1.28. Or,

P = .56 OR 56%.

If Sachin scores a half century (50 runs), Indias predicted probability of winning that
game is 56%.

As per the model, Indias chances of winning a game increase by over 12%, if Sachin scores
exactly 50 runs in a game vs. if he scores 0.

In other words, the contribution of Sachins 50 runs is an increment of 12% in Indias


chances of winning.

In this way, we can calculate Sachins contribution (in the form of an increase in Indias
chances of winning) for every score that he has ever made in these 431 games.
Cumulative
X = Runs Y = a + bx p Increment Increment
0 -0.258734 0.44 0
1 -0.248672 0.44 0.25% 0.25%
2 -0.23861 0.44 0.25% 0.50%

3 -0.228548 0.44 0.25% 0.74%


4 -0.218486 0.45 0.25% 0.99% A snippet of the table
5 -0.208424 0.45 0.25% 1.24% used to calculate the
6 -0.198362 0.45 0.25% 1.49% increment in Excel
7 -0.1883 0.45 0.25% 1.74%
8 -0.178238 0.46 0.25% 1.99%
9 -0.168176 0.46 0.25% 2.24%
10 -0.158114 0.46 0.25% 2.49%
11 -0.148052 0.46 0.25% 2.74%

When Sachin scores a 0 in a game, his contribution is 0 for that game.

When he scores 1 run, his contribution is a .25% increase in Indias chances of winning.

When he scores 10 runs, his contribution is a 2.49% increase in Indias chances of winning.

And so on.

Using this method, we can calculate Sachins total (lifetime) and average (per game)
contribution towards the teams win-rate.

We can use the exact same method to calculate the same measures for the other 2 players
included in our analysis Sourav Ganguly and Rahul Dravid.

Here is a comparison of the effects of the runs scored by each of the three batsmen.
Gangulys line (the green line) is at the top which means Gangulys runs have the highest
impact on Indias win rate.

In other words, if everything else remains constant, if Ganguly scores 60 runs, it is likely to
improve Indias chances of winning more than if either Sachin or Rahul had scored the
same number of runs (60).

Putting it another way, if India were in the world cup finals and you had to pray for one
batsmans success (out of these 3), you should be praying for a high score from Ganguly as
that will improve Indias chances of winning more than if Sachin or Rahul were to hit that
same high score.

WHAT ABOUT THE BATTING AVERAGE?

Ok, so we have proved that each run from Gangulys bat was more useful than from Sachin
or Rahuls bat. But what about the actual runs scored in every inning?

Sachin scores his runs at an average of 45 per innings, Ganguly scored at 41 and Dravid at
39. Even though Gangulys runs are more valuable, Sachin has scored more in each innings
on average. Could his contribution be more than Ganguly?

To find out each players average contribution, we need to look at their contribution for
each of the innings played.
As an example, Rahul Dravid scored 69 runs in his last innings. As per our model, his
innings improved Indias chances of winning from 35% to 57% - an improvement of
22%.Thus Rahuls contribution through that innings was 22%. If we repeat this process for
all his innings and calculate the average contribution per innings, this would represent
Rahuls average contribution to Indias victories over his entire career.

Here is a comparison of the 3 Indian stalwarts.

Again, Ganguly has the highest average contribution per innings. For every inning that he
played, the runs scored by him improved Indias chances of winning by 13% on an average.
In comparison, Rahuls average contribution is 11% and Sachins is 10%.

Thus, as per our statistical analysis, Ganguly is the player who had the highest average
contribution per innings.

From a purely statistical point of view, if we look at the average contribution per inning as
the defining measure, Ganguly has come out to be the most important contributor and is
therefore, the best ODI player for India amongst the 3 highest run getters.

LIFETIME CONTRIBUTION

Is average contribution the best way to measure a players contribution?

For example, Sachin has played in over 430 games (more than any other batsman in the
world). While Sourav has played only 292 and Dravid 307.

Statistics Sachin Sourav Dravid


Innings 431 292 307

Runs 17742 11255 10,536

Is it not better to look at their total contribution over the entire career rather than the
average contribution per game?

Better or not, it is surely a different way to measure the players contribution. If we


compare the lifetime contribution of the 3 players it gives a different picture.

Sachin leads the comparison by a fair bit. Ganguly is second and Dravid is a distant third.
Of course, a big thing to consider is that Sachin is still playing while the other two have
retired. Sachin will surely improve his lifetime contribution by the time he retires.

If we look at the Life time contribution of a player as the defining measure, Sachin has
emerged as the best ODI player for India amongst the 3 highest run getters.

EXERCISE

Use the coefficients obtained from the models to generate the average and life time
contributions for both Sourav and Rahul. Compare with results here.
STAGE 7: MODEL VALIDATION

We have used logistic regression to create a model to measure the impact of a players
runs scored on the teams chances of winning that game.

We have found that Sourav Ganguly has the highest average contribution per inning while
Sachin has the highest lifetime contribution.

Both of these conclusions are based on the models that we have built. It is therefore
important to verify how good, if any, these models are.

There are many ways of measuring the quality of a model. We will look at a very intuitive
way of doing this. We have already discussed the Chi-square test previously. We will use
this versatile test once again, this time to measure the quality of our model.

If you remember, the Chi-square test is commonly used to compare observed data with
data we would expect to obtain according to a specific hypothesis.

Let us understand how we can use it to assess the quality of our model.

There are 431 games in our data (for Sachin). Since we have taken all games with no
results or tie results, we know that all of these remaining 431 games had a win/loss result.
Now if we try to predict the results of these 431 games with no additional information, we
would expect to get it right in 50% of the cases. Thus we would expect our guess to be
right in 215 or 216 of the games and wrong in the remaining ones.

Now we can make the same prediction about the game outcome using our model as well.
If the model predicts a probability higher than .5, we can say that it is predicting a win for
India. If the model predicts a probability lower than .5, we can assume that the model is
predicting a loss for India.

If the model is not really any good, we would expect it to be right in half the cases.

Out of the 431 games, India has lost 200. We would predict our model to predict 100 of
these as losses and 100 as wins. Similarly, of the 231 games that India won, we would
expect the model to predict 115.5 correctly as wins and the remaining incorrectly as
losses.

This is what we would expect to see.

Model Prediction

Grand
Results Loss Win Total

Actual Loss 100 100 200


result
Win 115.5 115.5 231

Grand
Total 215.5 215.5 431

And this is what we actually see.

Model Prediction

Grand
Results Loss Win Total

Loss 114 86 200


Actual
result
Win 87 144 231

Grand
Total 201 230 431

We expect the model (if it is no good) to correctly predict 215.5 of the 431 games. The
model actually correctly predicts (114 + 144) = 258 games.

We can thus see that the model seems to be good. It has a (258/431) i.e. 60% accuracy
when predicting the outcome of the game based on Sachins score.

We can use the Chi-square test to determine if this difference between what we expected
to see and what we actually see is statistically significant or not.

Using Excel again to run this test, we find that there is a greater than 99.99% chance that
what we are seeing is because the model is actually good and not because of chance.

We can thus conclude that our model is indeed a good model and we can be fairly
confident about our findings.

EXERCISE

Create the tables with actual and expected values for both players
Run Chi-square tests to check validity of the model
CONCLUSION

In this book, we have looked at an application of logistic regression to get statistical


insights from the data that helps us make a decision on who is the best ODI batsman in
India.

PROBLEM DEFINITION

We started with a vaguely defined question Who is the greatest ODI batsman India has
ever produced?

In the first step, we defined the problem in more certain terms. If we want to apply any
kind of analytics on the data, the problem definition needs to be unambiguous and
precise. We changed the definition to Which batsman has had the most impact on Indias
win-rate through the runs they have scored in ODIs?

Then, in order to limit the scope of the analysis, we added another constraint. We
modified the problem definition to this form Amongst those who have scored more
than 10000 runs in ODIs, which batsman has had the most impact on Indias win-rate
through the runs they have scored?

DATA EXPLORATION

Data exploration was done in 2 steps.

In the first step we identified the information contained in the data set. We looked at
each field and understood its definition. We also looked at several examples of questions
that we can answer with this data.

In the second step, we dug a little deeper to understand the data even better.

How much information do we have?

What is the quality of the available data?

How do we need to prepare the data?

DATA PREPARATION

This is the stage where we prepare the data for extensive further analysis. This involves
cleaning up of variables and removing data that is not required for analysis. For Sachins
analysis, we started with 464 games. We removed the games where the player did not bat
or the result was a tie or no result. We ended with 431 games worth of data.
Data preparation often involves other important activities like outlier removal, missing
value treatment and variable transformation as well.

DESCRIPTIVE ANALYTICS

Having prepared the data for analysis, we first ran a series of descriptive analytics on the
data. Descriptive analytics like this helps an analyst understand the data better. It also
helps her spot anything unusual anything that requires further investigation.

PREDICTIVE MODELING

The next step is to build a predictive model that will help us answer the original question.

We used logistic regression to estimate the relationship between Sachins score and Indias
win rate. In other words, we built a model that will help us predict, for a given number of
runs scored by Sachin, what is the probability of India winning the game. This model is
also able to estimate the increase in probability of India winning with each additional run
scored by Sachin.

INTERPRETING THE RESULTS

Having built the model, the next step is to interpret the results of the model and generate
insights from it. We used 2 separate parameters to measure who is the greatest ODI
batsman.

The first parameter measures the average contribution of the batsman per inning. Sourva
Ganguly emerged as the most valuable contributor using this method.

The second parameter measures the life time contribution of the batsman over his entire
career. Sachin Tendulkar emerged as the top contributor using this method.

MODEL VALIDATION

We also spent some time validating the model. We used a Chi-square test to determine if
the model is statistically significant or not. We found that our model was highly significant
and thus concluded that the insights generated from the model are indeed valid.

This brings us to the end of this book. We hope that you will find this book a useful
tutorial to get a beginners understanding of the application of logistic regression to solve
problems and make decisions.

Вам также может понравиться