Вы находитесь на странице: 1из 8

Linear Discriminant Analysis

Narendra Kumar, Santadyuti Samanta, Vishal Vij and Shikhar Parashar

April 22, 2018

Problem Statement - Books By Mail from Paul Green


‘By Mail’ company is interested in offering a new title called The Art History of Florence to
1000, existing customers. Of these, 83 actually purchased the book, a response rate of 8.3
percent. Hence, the company sent a test mailing to them in this regard. The company also
sent out an identical mailing to another 1000 customers to serve as holdout sample. The
scope of the study primarily confined to predicting whether a customer will buy the new
book or not is based on two input variables namely months since last purchase and number
of art books purchased. The data files of the existing customers and the holdout sample are
given in Datasets:PaulBooks2.csv PaulBooks1.csv
Perform Discriminant Analysis using R and Interpret the results using training data. You
should incorporate all statistical tests associated with Discriminant and test your accuracy
with the test data. Critique the cut- off point probability and suggest what should be the
right cutoff.

Identifing the DATA


The existing customers data is provided as PaulBooks2.csv
## # A tibble: 6 x 4
## ID Months NoBought Purchase
## <int> <int> <int> <int>
## 1 2001 30 0 0
## 2 2002 12 0 0
## 3 2003 18 0 0
## 4 2004 27 1 0
## 5 2005 4 1 0
## 6 2006 35 0 0

## # A tibble: 6 x 4
## ID Months NoBought Purchase
## <int> <int> <int> <int>
## 1 2995 1 0 0
## 2 2996 9 1 1
## 3 2997 9 0 0
## 4 2998 28 1 0
## 5 2999 6 1 0
## 6 3000 10 0 0
## Classes 'tbl_df', 'tbl' and 'data.frame': 1000 obs. of 4 variables:
## $ ID : int 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 ...
## $ Months : int 30 12 18 27 4 35 4 23 10 21 ...
## $ NoBought: int 0 0 0 1 1 0 0 0 0 0 ...
## $ Purchase: int 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 4
## .. ..$ ID : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Months : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ NoBought: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Purchase: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"

## ID Months NoBought Purchase


## Min. :2001 Min. : 1.00 Min. :0.000 Min. :0.000
## 1st Qu.:2251 1st Qu.: 7.00 1st Qu.:0.000 1st Qu.:0.000
## Median :2500 Median :12.00 Median :0.000 Median :0.000
## Mean :2500 Mean :12.91 Mean :0.373 Mean :0.081
## 3rd Qu.:2750 3rd Qu.:16.00 3rd Qu.:1.000 3rd Qu.:0.000
## Max. :3000 Max. :35.00 Max. :3.000 Max. :1.000

Target and Independent Variables


From the Problem statement, its evident that ‘Purchase’ is the Target variable and ‘Months’
& ‘NoBought’ are the Indepentedent variables.
Target Variable - Purchase

Dependent Variable - ‘Months’ & ‘NoBought’


Correlation amongst the Variables.
Relationships between the Dependent and independent Variables.

Creating the Model


Linear Discriminent Model
## Call:
## lda(Purchase ~ Months + NoBought, data = PaulBooks2)
##
## Prior probabilities of groups:
## 0 1
## 0.919 0.081
##
## Group means:
## Months NoBought
## 0 13.257889 0.3155604
## 1 8.925926 1.0246914
##
## Coefficients of linear discriminants:
## LD1
## Months -0.05119557
## NoBought 1.50670235

For the Linear Discriminent Model created using the lda() function from MASS package, we
derive -0.0511 as the coefficient of Months and 1.5067 as the coefficient of NoBought.
Fishers Discriminent Model
##
## Descriptive Discriminant Analysis
## ---------------------------------
## $power discriminant power
## $values table of eigenvalues
## $discrivar discriminant variables
## $discor correlations
## $scores discriminant scores
## ---------------------------------
##
## $power
## cor_ratio wilks_lamb F_statistic p_values
## Months 0.020792889 0.979207111 21.191944558 0.000004691
## NoBought 0.093146548 0.906853452 102.508574564 0.000000000
##
##
## $values
## value proportion accumulated
## DF1 0.115 100.000 100.000
##
##
## $discrivar
## DF1
## constant 0.09878
## Months -0.05120
## NoBought 1.50670
##
##
## $discor
## DF1
## Months -0.4339
## NoBought 0.9183
##
##
## $scores
## z1
## 1 -1.4371
## 2 -0.5156
## 3 -0.8227
## 4 0.2232
## 5 1.4007
## 6 -1.6931
## ...

Fisher’s model gives the coefficients of variables directly and also the output tells the
importance of variables by the correlation value between the model and the varibales.
In the output,the p-values for both the variables are very less. Hence, we conclude that both
of them are statistically significant and they will be able to separate the customers into
buying and not buying of the book.
The correlation ratio values in $power tells that, the NoBought is 4-5 times more important
than Months in statisically separating and classifying the customers into buying and not
buying the book.
With the coefficients there is also a constant value which is nothing but the cut-off value
used similar to the intercept in regression. Hence the equation is,
Z = 0.0987 -0.05120(Months) + 1.50670(NoBought)

From this equation z score for each customer is calculated and their class is predicted. The
correlation value shows that Months is 43% correlated with the model and NoBought is
91% correlated with the model. With this we infer that the no. of books purchased is far
more important than no.of months since last purchase in predciting the customers.

Mahalanobis Discriminant Model


##
## Linear Discriminant Analysis
## -------------------------------------------
## $functions discrimination functions
## $confusion confusion matrix
## $scores discriminant scores
## $classification assigned class
## $error_rate error rate
## -------------------------------------------
##
## $functions
## 0 1
## constant -1.552 -4.551
## Months 0.201 0.135
## NoBought 0.858 2.802
##
##
## $confusion
## predicted
## original 0 1
## 0 894 25
## 1 63 18
##
##
## $error_rate
## [1] 0.088
##
##
## $scores
## 0 1
## 1 4.4773133 -0.5031929
## 2 0.8596540 -2.9318823
## 3 2.0655405 -2.1223191
## 4 4.7322553 1.8938982
## 5 0.1096907 -1.2094271
## 6 5.4822187 0.1714431
## ...
##
## $classification
## [1] 0 0 0 0 0 0
## Levels: 0 1
## ...

Mahalanobis discriminant model gives the output by two separate equations for buying
and not buying the book. From the eqaution , we can compute the score for each record for
both buying and not buying of the book.
For buying : -1.552 + 0.201(Months) + 0.858(NoBought)
For not buying : -4.551 + 0.135(Months) + 2.802(NoBought)

Thus, two scores are calculated for each record from the above equations and the record is
classified into a class with highest score.

Testing the Model


##
## 0 1
## 0 889 28
## 1 63 20

The model works well with test data also and the overall accuracy of the model here is also
90.9%. But, we face the problem of model predicting 1 as 1 very less. Since, it is an
unbalanced class data, a proper cut-off value to be chosen based on business intuition.

Вам также может понравиться