Вы находитесь на странице: 1из 11

BIA Assignment

Roll Nos.:
Anurag Pareek – 2018PGP066
Navneet Kaur – 2018PGP084
Prachi Vats – 2018PGP092

Problem Statement:
Based on a number of predictor variables (numerical and categorical), we need to predict the churn in the current
and the future score documents, by training the data on an oversampled training dataset (calibration.csv). The
actual churn rate is 1.8% which is accurately depicted in the prediction data sets (currentscore.csv,
futurescore.csv), while the training dataset has an approximate 50-50 split of churners and non-churners.

We need to calculate the top decile lift (for top 10% customers predicted most likely to churn), and Gini
coefficient (to calculate the area under the Lift Curves) as criteria for prediction and model selection.

Data Available:
The available data has three documents: “calibration.csv” to train the model, and “currentscores.csv”,
“futurescores.csv” to predict the churn on the model based on the data.

The data has a mix of categorical and continuous variables which need to be handled separately for pre-
processing.

Data Preprocessing:
The available calibration dataset has multiple missing values, which need to be handled before creating the
training model.

Checking the number for missing variables in our dataset.

Sum(is.na(d))
table(is.na(d))

Observation: there are significant number


of missing values that need to be handled
before creating the model

Since the data has both categorical and continuous variables, missing data imputation cannot be done
together for both data types. So, we edit the dataset (in excel) to manually convert all categorical data to
numeric (continuous) data before running the pre-processing, so that the entire preprocessing can be done
together for all variables. In this case, we have converted categorical data to values between -1 and 1.

e.g. A variable has 5 factors: levels are -1,-0.5, 0,0.5,1. Similarly, all other variables with similar factors would
be divided into equal parts between -1 and 1 to adjust to numerical values. This is also required since further
when we scale the values after missing data imputation, all values would be scaled between -1 and 1, so the
change in the categorical data would be handy then.

(Attached excel shows the edited values, which is the dataset used for imputing missing values). We have
removed the Customer ID column initially for missing data imputation since it is irrelevant to missing data
imputation. Removing variables with more than 30% missing values, and removing a few extra columns which
either had high percentage of missing values or were irrelevant to the churn variable . Final dataset has 89
variables that will be used. Removing the dataset has been manual.

Checking the percentage of missing values in each variable:


pmiss = function(x){sum(is.na(x))/length(x)*100}
apply(d,2,pmiss)

The below results show that some variables like “plcd_vce_Mean”, “plcd_dat_Mean” have 0% missing values,
while some other “ctrcount”, “tot_ret” have a very high percentage of missing values. We have removed all
columns that have higher percentage of missing values in the excel and computed imputations on the dataset
that has fewer missing values.

Now, loading the edited dataset into R and running the mice algorithm, we can see that all the missing data
has been imputed, as required. Using mice algorithm for missing data imputation:

dat1<-read.csv("calibration_edit.csv")
imp3<- mice(dat1, method = "pmm", m = 1)
Rechecking a few variables in the continuous and categorical datasets to see if the NA values are imputed and
if categorical data does not create any middle values (compared to the range that was applied on them: -1 to 1
To check missing data imputation on one of the categorical variables, consider the categorical variable
“ethnic”. It has 17 levels, which are edited in numerical form as: (-1, -0.875, -0.75, -0.625, -0.5, -0.375, -0.25, -
0.125,0, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875,1) – total of 17 levels. Checking this imputation on the new
dataset:

Due to scaling, the actual values are indicated as approximately -1 (churners – validated by 50438 columns)
and 1 (non-churners), while predicted are 0 and 1.

imp3$imp$ethnic

The missing data imputation shows that all missing


values have been assigned in the level of the numeric
values assigned, so the method is working correctly for
categorical datasets.

Now, similarly checking the imputed data for one of the continuous variables:

The missing data imputation for continuous values is


completed. Some of the missing row values generated
are shown.

Now, creating a complete dataset with the missing values:

completeDataF<- complete(imp3,1)

Scaling the data:


scaled.datF<-scale(completeDataF)
Checking for the missing values again:

The code below indicates that all missing data has now been handled.

Partitioning the data into training and test:


set.seed(1000)
r=sample(seq_len(nrow(scaled.datF)),0.7*nrow(scaled.datF),replace=F)
dt=scaled.datF[r,]
dv=scaled.datF[-r,]

Logistic Regression Model:

model=glm(churn~.,as.data.frame(dt), family = "gaussian")

Using the Gaussian family since the data has multiple levels, instead of binomial.
Summary of the model shows the important variables:
Selecting the important variables who have P values less than 0.05
#selecting important coefficients
toselect.x <- summary(model)$coeff[,"Pr(>|t|)"] < 0.05 # credit to kith
# select sig. variables
relevant.x <- names(toselect.x)[toselect.x == TRUE]
r# formula with only sig variables
sig.formula <- as.formula(paste("y ~",paste(relevant.x, collapse= "+")))

Sig.formula shows the final variables, indicated below:


Re-running the model with the selected variables:
model1=glm(churn~rev_Mean + mou_Mean + totmrc_Mean + da_Mean + ovrmou_Mean +
ovrrev_Mean + vceovr_Mean + datovr_Mean + change_mou + recv_sms_Mean +
custcare_Mean + ccrndmou_Mean + cc_mou_Mean + threeway_Mean +
mou_rvce_Mean + peak_vce_Mean + mou_peav_Mean + opk_vce_Mean +
adjrev + avgmou + avgqty + avg3rev + avg6mou + avg6qty +
uniqsubs + actvsubs + age1 + age2 + adults + models + prizm_social_one +
dualband + refurb_new + hnd_price + hnd_webcap + kid16_17 +
asl_flag, as.data.frame(dt), family = "gaussian")

Running the summary of this model to see the important variables:

There are a few other variables which are not


important in the new model and can be
further removed.

Removing the other variables that have P values of more than 0.05, and re-running the final regression model:

model2=glm(churn~rev_Mean + mou_Mean + totmrc_Mean + da_Mean + ovrmou_Mean +


ovrrev_Mean + vceovr_Mean + datovr_Mean + change_mou + recv_sms_Mean +
custcare_Mean + ccrndmou_Mean + cc_mou_Mean + threeway_Mean +
mou_rvce_Mean + peak_vce_Mean + avgmou + avgqty + avg3rev + avg6mou + avg6qty +
uniqsubs + actvsubs + age1 + age2 + adults + models + prizm_social_one +
dualband + refurb_new + hnd_price + hnd_webcap + kid16_17 +
asl_flag, as.data.frame(dt), family = "gaussian")
Viewing the final summary output of these variables:

The final variables are chosen based on the


final logistic regression model.

The chosen variables are:


hurn~rev_Mean, mou_Mean, totmrc_Mean, da_Mean , ovrmou_Mean , ovrrev_Mean , vceovr_Mean ,
datovr_Mean , change_mou , recv_sms_Mean , custcare_Mean , ccrndmou_Mean , cc_mou_Mean ,
threeway_Mean , mou_rvce_Mean , peak_vce_Mean , avgmou , avgqty , avg3rev , avg6mou , avg6qty ,
uniqsubs , actvsubs , age1 , age2 , adults , models , prizm_social_one , dualband , refurb_new , hnd_price ,
hnd_webcap , kid16_17, asl_flag

Plotting the ROC Curve:

(Converting the training and validation data to data frame before plotting the curve)
m.val=predict(model2,dv1,type="response")
pred=prediction(m.val,dv1$churn)
perf=performance(pred,"tpr","fpr")
perf
plot(perf,main="ROC curve",colorize=T)
Due to scaling, the actual values are indicated as approximately -1 (churners) and 1 (non-churners), while
predicted are 0 and 1.

We see that the results are better at cut off probability of 0.4.

Plotting the cumulative Lift Chart:

Gini coefficient:

The Gini coefficient is equal to the area below


the line of perfect equality (0.5 by definition)
minus the area below the Lorenz curve, divided
by the area below the line of perfect equality. In
other words, it is double the area between the
Lorenz curve and the line of perfect equality.

The ROC curve, lift chart and Gini coefficient indicate that logistic regression model will not be the best fit for
the data.

Decision Tree:

modeltr=rpart(churn~.,dt1,method="class",control=rpart.control(c=0))
pm<-prune(modeltr,cp=modeltr$cptable[which.min(modeltr$cptable[,"xerror"]),"CP"])

With the output obtained from the entire dataset, the decision tree created does not give a clear idea on the
variables to be selected.
Running the decision tree on the same set of variables:

The ROC curve with all variables is not


significant.

Re-running the model with significant


variables (since the tree is unclear, we
use same variables as logistic
regression to see a change I the ROC
curve)

modeltr1=rpart(churn~rev_Mean + mou_Mean + totmrc_Mean + da_Mean + ovrmou_Mean +


ovrrev_Mean + vceovr_Mean + datovr_Mean + change_mou + recv_sms_Mean +
custcare_Mean + ccrndmou_Mean + cc_mou_Mean + threeway_Mean +
mou_rvce_Mean + peak_vce_Mean + avgmou + avgqty + avg3rev + avg6mou + avg6qty +
uniqsubs + actvsubs + age1 + age2 + adults + models + prizm_social_one +
dualband + refurb_new + hnd_price + hnd_webcap + kid16_17 +
asl_flag,dt1,method="class",control=rpart.control(c=0))
ROC curve here is a better
indicator of the model. Decision
tree can be considered as a model
for the data.

Due to scaling, the actual values are indicated as approximately -


1(churners) and 1, while predicted are 0 and 1.

For p=0.40

Misclassification Error: 43%

False Positive Rate: 3897/ (3897+11057)= 0.26059917

This indicates that 26% of people who will churn are indicated as Not
Churn.

False Negative Rate:


9056/(5891+9065) = 60%
The model indicates 60% of not churn people as churn.

We focus on the low false positive rate, as this is the number of people
we might actually lose.

The lift chart indicates that the data


works well for the Top 10% of
Decile, so this model can be used
for the top 10% people.
Also tried to run Naïve Bayes and KNN but the process of code run was longer, so they are not the ideal
methods to be used for model creation.

Also, Neural Network indicates an error and does not give any indicative result.

With the above outputs we see that the decision tree with the selected predictors is a better indicator of our
model. We can also try multiple algorithms like boosting to combine the outputs for both and obtain a model.

Prediction on Current and Future Scores:

We perform the same scaling process to adjust values before making the predictions.

cs<-read.csv("currrent_score_edit.csv")
scaled.cs1<-scale(cs)
vnew=predict(modeltr1,newdata = cs)
vnew1<-ifelse(vnew[,2]>0.50,1,0)
vnew1
write.csv(vnew1, file = "current_score_pred.csv")

Output Data:
“current_score_pred.csv”
Based on the scaling performed, in the data:

0 – indicates churn
1- Indicates not churn

Percentage of churners: 4666/51306 = 9.09445289


Actual Percentage(in case): 1.8%

Similarly, for future score data:

Output Data:
“final_score_pred.csv”

Percentage of churners: 4666/46440 = 10.047373


Actual Percentage(in case): 1.8%

Attaching all the scaled files and the output files separately for reference.

Вам также может понравиться