Академический Документы
Профессиональный Документы
Культура Документы
1|Page
Content
Problem Statement -- Pg:3
EDA -- Pg: 3 to 15
SMOTE -- Pg:16
Logistic Regression -- Pg:17 to 19
KNN -- Pg:20 & 21
Naïve Bayes -- Pg:22 to 24
Model Validation -- Pg:25
Bagging -- Pg:25 & 26
Boosting -- Pg:26 to 28
Actionable & Recommendations -- Pg:29
Note: Following R Codes were used in the assignment
2|Page
This project requires you to understand what mode of transport employees prefers to
commute to their office. The attached data includes employee information about their mode of
transport as well as their personal and professional details like age, salary, and work exp.
Problem Statement :
We need to predict whether or not an employee will use Car as a mode of transport. Also,
which variables are significant predictors behind this decision?
There are 444 observations and 9 variables. (Refer Fig 1a, 1b & 1c).
Univariate present in the dataset are (Refer Fig 1d).
1) Numerical (Age, Engineer, MBA, Work Exp, Salary, Distance & license).
2) Factor (Gender & Transport).
Missing Value Present – There was Missing value present in Variable MBA. Since there
was only One Missing Value in the DATASET It can also be found in against Row#146
(Refer Fig 1e).
Missing Value Treatment and conversation of Factor Variable to Binary is done. Original
Dataset had 444 observations and after missing value treatment it has 443 observations.
( Refer Fig1f)
Identifying the outliers present. (Refer Fig 1g(i),(ii) (iii)&(iv)
Visual representation of the Uni Variant Fig 1h(i) &(ii)
Visual representation of Bi Variant Fig 1I &j
3|Page
Fig: 1a
Fig: 1b
Fig: 1c
4|Page
Fig:1d
Fig: 1e
5|Page
Fig 1f
Fig 1g (i)
6|Page
Fig 1g (ii)
Fig 1g (iii)
Fig 1g (iv)
7|Page
Fig 1h. (i)
8|Page
9|Page
Bivariate Analysis and Graphical Visualization of the DATASET.
While doing a Bivariate Analysis we would compare to analysis as following:
Age with Qualification (Engineer and MBA) Fig1i, (i)&(ii)
Salary distribution with Qualification (Engineer and MBA) Fig1i, (i)&(ii)
Work experience with Gender. Fig1j(i)&(ii)
Fig 1i
10 | P a g e
Fig 1i(i)
Fig 1i(ii)
11 | P a g e
Work experience with Gender.
We can again see that there are outliers visible. The Mean Work experience does not have
much of a difference and also find that the data is divided evenly between both the Genders.
Fig 1j (i)
Fig 1j (ii)
12 | P a g e
(a)
Age
Work. Exp Distance
(b)
Work. Exp
Distance Salary Age
Fig 1k (i)
13 | P a g e
Fig 1k (ii)
14 | P a g e
Treating Multicollinearity present in the dataset. After
excluding the collinear variables.
15 | P a g e
2) Data Presentation SMOTE.
Solution:
In SMOTE we need to define our equation perc.over means that 1 minority class will be added
for every value of perc.over
Using SMOTE we will analyze the dataset and understand factors influencing. But first we need
to Split dataset into Train and Test to begin with Smote activity.
Dataset is divided into Train and Test refer table in Fig: 2.2(i) &2.2(ii)
As you see that we have same percentage of car users and non care users we apply SMOTE on
dataset and are divided equally between car users and non car users Fig: 2.3(i) &2.3(ii)
Fig: 2
Fig:2.3(i)
Fig:2.3(ii)
16 | P a g e
3A) Applying Logistic Regression [LR] & Interpret results.
Solution:
Using the SMOTE Dataset Logistic Regression [LR] method is applied to understand the car
Usage and the factors influencing it.
From the below snapshots (Fig3A (a), (b) &(c) we can interpret that:
Age and License are more Significant.
An increase in age by 1 year there is 98%probability of person using car.
99% probability if a person has license will use car.
72% probability if there is an increase in Salary.
The null deviance is 357.664
Residual deviance is 17.959
McFadden R Square yields 0.94
Based on this model Predictions are done. Fig3A (d) & (e)
We see that 94% accuracy prediction in car users and 95% accuracy in non car users are
predicted accurately.
Fig3A (a)
17 | P a g e
Fig3A (b)
Fig3A (c)
18 | P a g e
Fig3A (d)
Fig3A (e)
19 | P a g e
3B) KNN
Solution:
KNN is a useful algorithm for matching closest point to K neighbors in a multidimensional space.
In the current dataset we apply KNN algorithm and interpret the findings.
Outcome of the Prediction and Confusion Matrix for KNN is mentioned in Fig3B (iii) we can see
that the accuracy is at 94% with Sensitivity at 94.52 and Specificity at 94.44. The Kappa Value is
at 0.8875. Summary of the test dataset in Fig3B (IV) shows 73:54
Fig3B (i)
Fig3B (ii)
20 | P a g e
Fig3B (iii)
Fig3B (iv)
21 | P a g e
3C) Naïve Bayes Model
Solution:
Naïve Bayes model is applied on the dataset and following are the findings Refer Fig 3C
(i),(ii),(iii)&(iv)
Accuracy is at 92.91% with
Sensitivity at 95.89 & Specificity at 88.89.
The Kappa Value is at 0.854.
Fig:3C(i)
22 | P a g e
Fig:3C(ii)
23 | P a g e
Fig:3C(iii)
Fig:3C(iv)
24 | P a g e
3E) Model Validation
Solution:
Results from the below table suggest that KNN model has a better accuracy compared to Naïve
Bayes, therefore we can use inference of KNN model for decision making.
Kappa
Method Accuracy Sensitivity Specificity Value
KNN 94% 94.52 94.44 0.8875
Naïve Bayes 92.91% 95.89 88.89 0.854
3F) Bagging
Solution:
Outcome of Bagging model can be referred in FIG3F (i) & (ii) mention in the below table are the
details,
Fig3F(i)
25 | P a g e
Fig3F (ii)
3G) Boosting
Solution:
Outcome of Bagging model can be referred in FIG3G (i),(ii),(iii),(iv)&(v) mention in the below
table are the details,
Fig3G(i)
26 | P a g e
Fig3G(ii)
Fig3G(iii)
Fig3G(iv)
27 | P a g e
Fig3G(v)
28 | P a g e
4 Actionable & Recommendations
Method Accuracy Sensitivity Specificity
KNN 94% 94.52 94.44
Naïve Bayes 92.91% 95.89 88.89
Bagging 94.49 93.15 96.3
Boosting 97.64 98.63 96.3
With reference to the above table and the overall analysis of the Dataset under different models, we can
conclude that Boosting method is preferred as it has a highest Accuracy rate with Sensitivity and
Specificity. From the overall analysis we can therefore conclude that:
Age is significant factors which influence the usage of Car. Higher the age, is related higher the
number of car Users.
Along with AGE, Work experience and Salary plays a significant role. User with higher work
experience (of course has higher age) and a higher salary prefers to transport more in car.
29 | P a g e