Вы находитесь на странице: 1из 29

Machine Learning- Project 5

Predicting mode of Transport

Presented By: Sanan Sahadevan Olachery

Submission Date: March 29th 2020

1|Page
Content
 Problem Statement -- Pg:3
 EDA -- Pg: 3 to 15
 SMOTE -- Pg:16
 Logistic Regression -- Pg:17 to 19
 KNN -- Pg:20 & 21
 Naïve Bayes -- Pg:22 to 24
 Model Validation -- Pg:25
 Bagging -- Pg:25 & 26
 Boosting -- Pg:26 to 28
 Actionable & Recommendations -- Pg:29
Note: Following R Codes were used in the assignment

Sr.No RCodes - Used


1 readr
2 ggplot2
3 usdm
4 VIF
5 DMwR
6 caret
7 e1071
8 class
9 gbm
10 xgboost
11 ipred
12 plyr
13 rpart
14 ROCR

2|Page
This project requires you to understand what mode of transport employees prefers to
commute to their office. The attached data includes employee information about their mode of
transport as well as their personal and professional details like age, salary, and work exp.

Problem Statement :
We need to predict whether or not an employee will use Car as a mode of transport. Also,
which variables are significant predictors behind this decision?

1) EDA on the DATA


Solution:
From the Structure and Summary of the data set we can observe that:

 There are 444 observations and 9 variables. (Refer Fig 1a, 1b & 1c).
 Univariate present in the dataset are (Refer Fig 1d).
1) Numerical (Age, Engineer, MBA, Work Exp, Salary, Distance & license).
2) Factor (Gender & Transport).
 Missing Value Present – There was Missing value present in Variable MBA. Since there
was only One Missing Value in the DATASET It can also be found in against Row#146
(Refer Fig 1e).
 Missing Value Treatment and conversation of Factor Variable to Binary is done. Original
Dataset had 444 observations and after missing value treatment it has 443 observations.
( Refer Fig1f)
 Identifying the outliers present. (Refer Fig 1g(i),(ii) (iii)&(iv)
 Visual representation of the Uni Variant Fig 1h(i) &(ii)
 Visual representation of Bi Variant Fig 1I &j

3|Page
Fig: 1a

Fig: 1b

Fig: 1c

4|Page
Fig:1d

Fig: 1e

5|Page
Fig 1f

Fig 1g (i)

6|Page
Fig 1g (ii)

Fig 1g (iii)

Fig 1g (iv)

Visual Representation of Univariate


The Below images reflects the following Interpretation:
 Age, Work Exp, Salary & Distance are Continues Variable.
 Work.exp and Salary are RIGHT Skewed. Work. Exp histogram depicts the number of
juniors are more than senior.
 Engineer, MBA & License are not continues variable. It is more of a polar and
categorical in nature.

7|Page
Fig 1h. (i)

Fig 1h. (ii)

8|Page
9|Page
Bivariate Analysis and Graphical Visualization of the DATASET.
While doing a Bivariate Analysis we would compare to analysis as following:
 Age with Qualification (Engineer and MBA) Fig1i, (i)&(ii)
 Salary distribution with Qualification (Engineer and MBA) Fig1i, (i)&(ii)
 Work experience with Gender. Fig1j(i)&(ii)

Qualification with Age and Salary(Engineer and MBA)


There are outliers visible in this. While comparing both the outputs we find that the mean
salary is more or less the same and there is hardly any difference in the Salary drawn by
Engineers v/s Non Engineer and MBA v/s Non MBA’s. --- Refer Image

Fig 1i

10 | P a g e
Fig 1i(i)

Fig 1i(ii)

11 | P a g e
Work experience with Gender.
We can again see that there are outliers visible. The Mean Work experience does not have
much of a difference and also find that the data is divided evenly between both the Genders.
Fig 1j (i)

Fig 1j (ii)

Multicollinearity Check and Treating it.


From the below image was can see that following variables i.e. AGE & Work.Exp are Highly
Correlated with/between Distance & Salary. ( refer Fig1k(i) &(ii)

12 | P a g e
(a)
Age
Work. Exp Distance

(b)
Work. Exp
Distance Salary Age

Fig 1k (i)

13 | P a g e
Fig 1k (ii)

14 | P a g e
Treating Multicollinearity present in the dataset. After
excluding the collinear variables.

Dataset post excluding the variable Work Experience.

15 | P a g e
2) Data Presentation SMOTE.
Solution:
In SMOTE we need to define our equation perc.over means that 1 minority class will be added
for every value of perc.over

Using SMOTE we will analyze the dataset and understand factors influencing. But first we need
to Split dataset into Train and Test to begin with Smote activity.

Dataset is divided into Train and Test refer table in Fig: 2.2(i) &2.2(ii)
As you see that we have same percentage of car users and non care users we apply SMOTE on
dataset and are divided equally between car users and non car users Fig: 2.3(i) &2.3(ii)

Fig: 2

Fig: 2.2(i) – Table Train Dataset

Fig: 2.2(ii) Table Test Dataset

Fig:2.3(i)

Fig:2.3(ii)

16 | P a g e
3A) Applying Logistic Regression [LR] & Interpret results.
Solution:
Using the SMOTE Dataset Logistic Regression [LR] method is applied to understand the car
Usage and the factors influencing it.
From the below snapshots (Fig3A (a), (b) &(c) we can interpret that:
 Age and License are more Significant.
 An increase in age by 1 year there is 98%probability of person using car.
 99% probability if a person has license will use car.
 72% probability if there is an increase in Salary.
 The null deviance is 357.664
 Residual deviance is 17.959
 McFadden R Square yields 0.94

Based on this model Predictions are done. Fig3A (d) & (e)

We see that 94% accuracy prediction in car users and 95% accuracy in non car users are
predicted accurately.

Fig3A (a)

17 | P a g e
Fig3A (b)

Fig3A (c)

18 | P a g e
Fig3A (d)

Fig3A (e)

19 | P a g e
3B) KNN
Solution:
KNN is a useful algorithm for matching closest point to K neighbors in a multidimensional space.
In the current dataset we apply KNN algorithm and interpret the findings.

Outcome of the Prediction and Confusion Matrix for KNN is mentioned in Fig3B (iii) we can see
that the accuracy is at 94% with Sensitivity at 94.52 and Specificity at 94.44. The Kappa Value is
at 0.8875. Summary of the test dataset in Fig3B (IV) shows 73:54

Fig3B (i)

Fig3B (ii)

20 | P a g e
Fig3B (iii)

Fig3B (iv)

21 | P a g e
3C) Naïve Bayes Model
Solution:
Naïve Bayes model is applied on the dataset and following are the findings Refer Fig 3C
(i),(ii),(iii)&(iv)
 Accuracy is at 92.91% with
 Sensitivity at 95.89 & Specificity at 88.89.
 The Kappa Value is at 0.854.

Fig:3C(i)

22 | P a g e
Fig:3C(ii)

23 | P a g e
Fig:3C(iii)

Fig:3C(iv)

24 | P a g e
3E) Model Validation
Solution:
Results from the below table suggest that KNN model has a better accuracy compared to Naïve
Bayes, therefore we can use inference of KNN model for decision making.

Kappa
Method Accuracy Sensitivity Specificity Value
KNN 94% 94.52 94.44 0.8875
Naïve Bayes 92.91% 95.89 88.89 0.854

3F) Bagging
Solution:
Outcome of Bagging model can be referred in FIG3F (i) & (ii) mention in the below table are the
details,

Method Accuracy Sensitivity Specificity P value


Bagging 94.49 93.15 96.3 0.4497

Fig3F(i)

25 | P a g e
Fig3F (ii)

3G) Boosting
Solution:
Outcome of Bagging model can be referred in FIG3G (i),(ii),(iii),(iv)&(v) mention in the below
table are the details,

Method Accuracy Sensitivity Specificity P value


Boosting 97.64 98.63 96.3 1

Fig3G(i)

26 | P a g e
Fig3G(ii)

Fig3G(iii)

Fig3G(iv)

27 | P a g e
Fig3G(v)

28 | P a g e
4 Actionable & Recommendations
Method Accuracy Sensitivity Specificity
KNN 94% 94.52 94.44
Naïve Bayes 92.91% 95.89 88.89
Bagging 94.49 93.15 96.3
Boosting 97.64 98.63 96.3

With reference to the above table and the overall analysis of the Dataset under different models, we can
conclude that Boosting method is preferred as it has a highest Accuracy rate with Sensitivity and
Specificity. From the overall analysis we can therefore conclude that:

 Age is significant factors which influence the usage of Car. Higher the age, is related higher the
number of car Users.
 Along with AGE, Work experience and Salary plays a significant role. User with higher work
experience (of course has higher age) and a higher salary prefers to transport more in car.

29 | P a g e

Вам также может понравиться