Thera Bank - Project

Project 3 – Data Mining
Contents
1 Objective of Project
2 Assumptions
3 Exploratory Data Analysis
4 Check for Appropriate Clustering
5 Models to be Built on CART & RANDOM FOREST
5.1 Check for Necessary Modifications Such as Pruning
6 Check for Model Performances
7 Conclusion
1 Objective of Project
We want to build a model that will help Thera bank which is having more liability customers than asset
customers to identify the potential customer who have higher probability of purchasing the loan on
basis of last year campaign data which is having 9.6 % success rate for 5000 customers.
You are brought in as a consultant and your job is to build the best model which can classify the
right customers who have a higher probability of purchasing the loan. You are expected to do the
following:
 EDA of the data available. Showcase the results using appropriate graphs - (10 Marks)
 Apply appropriate clustering on the data and interpret the output - (10 Marks)
 Build appropriate models on both the test and train data (CART & Random Forest). Interpret
all the model outputs and do the necessary modifications wherever eligible (such as
pruning) - (20 Marks)
 Check the performance of all the models that you have built (test and train). Use all the model
performance measures you have learned so far. Share your remarks on which model performs
the best. - (20 Marks)
2 Assumptions
No specific assumption is made about data.
3 Exploratory Data Analysis

First step is to set environment in R studio for this analysis
● Create R notebook file
● Setup working directory
● Libraries loaded – readxl, corrplot, ggplot2, caTools, rpart, rpart.plot, randomForest, lattice, etc.
We explored TheraBank Data using str which showed that
 Personal Loan is considered the Dependent variable and all other attributes as Independent
variables.
 We should not consider the ID as its completely unique for each customer and does not help
in model building
 The data includes the demographic information of customer like (Age, Income, Experience,
Family, zip code, family members, Education) which represent the customer behavior, So
we need to take these columns under consideration.
Summary of the Data showed :
 Mortgage, PersonalLoan, SecuritiesAccount, CD.Account, Online, Credit card columns all are
having only 0 or 1 as a value.
 Personal loan is having mean of 0.096 which infers having 9.6 % success rate in last year
campaign.
Used boxplot to find outliers if any. It was observed that outliers are present for variables Income(in
K/month), CCAvg, Mortgage.
Correlation Plot shows :
 Age(in years) is corelated with Experience(in years)(0.99),

 Income(in K/month) is corelated with CCAvg(0.65),
 Income(in K/month) is corelated with Personal Loan(0.5),
 CCAvg is corelated with Personal Loan(0.37),
 Personal Loan is corelated with Securities Account & CD Account(0.32),
 CD Account is corelated with CreditCard(0.28)
Used Qplots to check whether Personal Loan, Education, Securities Account , CreditCard, CD
Account & Online have lesser ratio when compared to the data provided.
library(readxl)
library(ggplot2)
library(corrplot)
str(mydata)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 5000 obs. of 14 variables:
$ ID : num 1 2 3 4 5 6 7 8 9 10 ...
$ Age (in years) : num 25 45 39 35 35 37 53 50 35 34 ...
$ Experience (in years): num 1 19 15 9 8 13 27 24 10 9 ...
$ Income (in K/month) : num 49 34 11 100 45 29 72 22 81 180 ...
$ ZIP Code : num 91107 90089 94720 94112 91330 ...
$ Family members : num 4 3 1 1 4 4 2 1 3 1 ...
$ CCAvg : num 1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
$ Education : num 1 1 1 2 2 2 2 3 2 3 ...
$ Mortgage : num 0 0 0 0 0 155 0 0 104 0 ...
$ Personal Loan : num 0 0 0 0 0 0 0 0 0 1 ...
$ Securities Account : num 1 1 0 0 0 0 0 0 0 0 ...
$ CD Account : num 0 0 0 0 0 0 0 0 0 0 ...
$ Online : num 0 0 0 0 0 1 1 0 1 0 ...
$ CreditCard : num 0 0 0 0 1 0 0 1 0 0 ...
summary(mydata)
ID Age (in years) Experience (in years) Income (in K/month) ZIP Code Family members
Min. : 1 Min. :23.00 Min. :-3.0 Min. : 8.00 Min. : 9307 Min. :1.000
1st Qu.:1251 1st Qu.:35.00 1st Qu.:10.0 1st Qu.: 39.00 1st Qu.:91911 1st Qu.:1.000
Median :2500 Median :45.00 Median :20.0 Median : 64.00 Median :93437 Median :2.000
Mean :2500 Mean :45.34 Mean :20.1 Mean : 73.77 Mean :93153 Mean :2.397
3rd Qu.:3750 3rd Qu.:55.00 3rd Qu.:30.0 3rd Qu.: 98.00 3rd Qu.:94608 3rd Qu.:3.000
Max. :5000 Max. :67.00 Max. :43.0 Max. :224.00 Max. :96651 Max. :4.000
NA's :18
CCAvg Education Mortgage Personal Loan Securities Account CD Account
Min. : 0.000 Min. :1.000 Min. : 0.0 Min. :0.000 Min. :0.0000 Min. :0.0000
1st Qu.: 0.700 1st Qu.:1.000 1st Qu.: 0.0 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000
Median : 1.500 Median :2.000 Median : 0.0 Median :0.000 Median :0.0000 Median :0.0000
Mean : 1.938 Mean :1.881 Mean : 56.5 Mean :0.096 Mean :0.1044 Mean :0.0604
3rd Qu.: 2.500 3rd Qu.:3.000 3rd Qu.:101.0 3rd Qu.:0.000 3rd Qu.:0.0000 3rd Qu.:0.0000
Max. :10.000 Max. :3.000 Max. :635.0 Max. :1.000 Max. :1.0000 Max. :1.0000
Online CreditCard
Min. :0.0000 Min. :0.000
1st Qu.:0.0000 1st Qu.:0.000
Median :1.0000 Median :0.000
Mean :0.5968 Mean :0.294
3rd Qu.:1.0000 3rd Qu.:1.000
Max. :1.0000 Max. :1.000
anyNA(mydata)
TRUE
sum(is.na(mydata))
18
mydata=na.omit(mydata)
correlationMatrix=cor(mydata)
corrplot(correlationMatrix, method = "number")

colnames(mydata)=make.names(colnames(mydata))
mydata$Personal.Loan=as.factor(mydata$Personal.Loan)
mydata$Securities.Account=as.factor(mydata$Securities.Account)
mydata$CD.Account=as.factor(mydata$CD.Account)
mydata$Online=as.factor(mydata$Online)
mydata$CreditCard=as.factor(mydata$CreditCard)
mydata$Education=as.factor(mydata$Education)
mydata$ZIP.Code=as.factor(mydata$ZIP.Code)
boxplot(mydata$Age..in.years.)
boxplot(mydata$Experience..in.years.)
boxplot(mydata$Income..in.K.month.)
boxplot(mydata$Family.members)
boxplot(mydata$CCAvg)
boxplot(mydata$Mortgage)
qplot(mydata$Personal.Loan)
qplot(mydata$Education)
qplot(mydata$Securities.Account)
qplot(mydata$CD.Account)
qplot(mydata$Online)
qplot(mydata$CreditCard)
Clustering
?dist
distMatrix = dist(x=mydata[,1:14], method = "euclidean")
print(distMatrix, digits = 3)
1 2 3 4 5 6 7 8 9 10 11
12 13 14 15 16 17 18 19 20 21 22
23 24 25 26 27 28 29 30 31 32 33
34 35 36 37 38 39 40 41 42 43 44
45 46 47 48 49 50 51 52 53 54 55
56 57 58 59 60 61 62 63 64 65 66
67 68 69 70 71 72 73 74 75 76 77
78 79 80 81 82 83 84 85 86 87 88
89 90 91 92 93 94 95 96 97 98 99
100 101 102 103 104 105 106 107 108 109 110
111 112 113 114 115 116 117 118 119 120 121
122 123 124 125 126 127 128 129 130 131 132
133 134 135 136 137 138 139 140 141 142 143
144 145 146 147 148 149 150 151 152 153 154
155 156 157 158 159 160 161 162 163 164 165
166 167 168 169 170 171 172 173 174 175 176
177 178 179 180 181 182 183 184 185 186 187
188 189 190 191 192 193 194 195 196 197 198
199 200 201 202 203 204 205 206 207 208 209
210 211 212 213 214 215 216 217 218 219 220
221 222 223 224 225 226 227 228 229 230 231
232 233 234 235 236 237 238 239 240 241 242
243 244 245 246 247 248 249 250 251 252 253
254 255 256 257 258 259 260 261 262 263 264
265 266 267 268 269 270 271 272 273 274 275
276 277 278 279 280 281 282 283 284 285 286
287 288 289 290 291 292 293 294 295 296 297
298 299 300 301 302 303 304 305 306 307 308
309 310 311 312 313 314 315 316 317 318 319
320 321 322 323 324 325 326 327 328 329 330
331 332 333 334 335 336 337 338 339 340 341
342 343 344 345 346 347 348 349 350 351 352…………and so on
##Scale function to standardize the values
clustering=scale(mydata[,1:14])
print(clustering)
ID Age (in years) Experience (in years) Income (in K/month) ZIP Code
[1,] -1.7315312529 -1.77423939 -1.665911856 -0.538174951 -0.96401766
[2,] -1.7308385019 -0.02952064 -0.096320584 -0.864022980 -1.44378718
[3,] -1.7301457508 -0.55293627 -0.445118645 -1.363656626 0.73873996
[4,] -1.7294529998 -0.90188002 -0.968315735 0.569708351 0.45219785
[5,] -1.7287602487 -0.90188002 -1.055515250 -0.625067758 -0.85892081
[6,] -1.7280674977 -0.72740814 -0.619517675 -0.972638990 -0.48613329
[7,] -1.7273747466 0.66836686 0.601275536 -0.038541305 -0.67936070
[8,] -1.7266819956 0.40665905 0.339676991 -1.124701404 0.37255045
[9,] -1.7259892445 -0.90188002 -0.881116220 0.156967513 -1.44378718
[10,] -1.7252964935 -0.98911595 -0.968315735 2.307564509 -0.06103300
[11,] -1.7246037424 1.71519811 1.647669717 0.678324360 0.73402709
[12,] -1.7239109914 -1.42529564 -1.317113796 -0.625067758 -1.35518534
[13,] -1.7232182403 0.23218717 0.252477476 0.873833178 -0.02191623
[14,] -1.7225254893 1.19178248 1.037273112 -0.733683768 0.83299723
[15,] -1.7218327382 1.88966998 1.822068747 0.830386774 -0.66522211
[16,] -1.7211399872 1.27901842 0.862874082 -1.124701404 0.89614960
[17,] -1.7204472361 -0.64017220 -0.532318160 1.221404410 0.87541300
[18,] -1.7197544851 -0.29122845 -0.183520099 0.156967513 0.54315612
[19,] -1.7190617340 0.05771530 0.078078446 2.589966135 -0.72978834
[20,] -1.7183689830 0.84283873 0.688475051 -1.146424606 0.73873996
[21,] -1.7176762319 0.93007467 0.950073597 -1.059531798 0.40648307
[22,] -1.7169834809 1.01731061 0.601275536 -0.234050123 -1.44095946
[23,] -1.7162907298 -1.42529564 -1.317113796 -0.255773325 -1.35518534
[24,] -1.7155979788 -0.11675658 -0.183520099 -0.668514162 -0.86363367
[25,] -1.7149052277 -0.81464408 -0.793916705 1.699314854 1.11624033
[26,] -1.7142124767 -0.20399252 -0.096320584 -0.972638990 0.54315612
[27,] -1.7135197256 -0.46570033 -0.357919130 0.200413917 0.90086246
[28,] -1.7128269746 0.05771530 -0.009121069 1.829654065 -1.45556934
[29,] -1.7121342235 0.93007467 0.862874082 -0.559898153 0.65343713
[30,] -1.7114414724 -0.64017220 -0.619517675 0.982449188 0.44842756
[31,] -1.7107487214 1.19178248 1.298871657 -0.842299778 -0.02191623
[32,] -1.7100559703 -0.46570033 -0.357919130 -0.972638990 0.45455428
[33,] -1.7093632193 0.66836686 0.688475051 -0.711960566 0.77691415
[34,] -1.7086704682 -1.33805970 -1.229914280 -1.211594212 -0.85892081
[35,] -1.7079777172 -1.25082377 -1.317113796 -0.516451749 0.41590880
[36,] -1.7072849661 0.23218717 0.339676991 0.156967513 -0.23823667
[37,] -1.7065922151 1.19178248 1.298871657 1.025895592 0.73873996
[38,] -1.7058994640 0.49389498 0.426876506 -0.060264507 1.25432724
[39,] -1.7052067130 -0.29122845 -0.183520099 1.460359632 0.45314042
[40,] -1.7045139619 -0.64017220 -0.619517675 0.135244311 0.45361171
[41,] -1.7038212109 1.01731061 1.037273112 0.222137119 -0.22645451
[42,] -1.7031284598 -0.98911595 -0.968315735 -0.299219729 0.45691071
[43,] -1.7024357088 -1.16358783 -1.142714765 1.264850814 -1.47677723
[44,] -1.7017429577 -0.55293627 -0.445118645 -0.625067758 1.16101254
[45,] -1.7010502067 0.05771530 -0.009121069 0.656601158 0.43004739
[46,] -1.7003574556 1.01731061 0.950073597 -0.473005345 0.73873996
[47,] -1.6996647046 -0.55293627 -0.532318160 -0.668514162 0.87729815
[48,] -1.6989719535 -0.72740814 -0.706717190 2.611689337 -0.83535649
[49,] -1.6982792025 0.93007467 0.514076021 0.156967513 1.22275105
[50,] -1.6975864514 -0.46570033 -0.357919130 -0.538174951 -0.36736913
[51,] -1.6968937004 -1.16358783 -1.055515250 -1.428826232 -0.49932931
[52,] -1.6962009493 1.36625436 1.473270687 1.243127612 0.73873996
[53,] -1.6955081983 -1.33805970 -1.229914280 -0.038541305 0.40177021
[54,] -1.6948154472 0.40665905 0.514076021 2.524796529 -1.37026651
[55,] -1.6941226962 -1.42529564 -1.317113796 -0.646790960 1.25668367
[56,] -1.6934299451 -0.37846439 -0.270719615 1.416913228 0.40978208
[57,] -1.6927371941 0.84283873 0.862874082 -0.972638990 0.40177021
[58,] -1.6920444430 0.93007467 0.950073597 1.243127612 1.16101254
[59,] -1.6913516920 -1.51253158 -1.578712341 0.417645937 0.43004739
[60,] -1.6906589409 -1.25082377 -1.317113796 2.481350125 -0.86363367
[61,] -1.6899661899 0.31942311 0.339676991 -0.755406970 -1.29533198
[62,] -1.6892734388 0.14495123 0.078078446 1.112788400 0.11994096
[63,] -1.6885806878 -0.29122845 -0.183520099 -1.124701404 -1.44378718
[64,] -1.6878879367 -0.29122845 -0.270719615 -0.907469384 0.64589654
[65,] -1.6871951857 0.14495123 0.252477476 0.678324360 -1.47442079
[66,] -1.6865024346 1.19178248 1.298871657 1.243127612 -0.84478222
[67,] -1.6858096835 1.45349030 1.386071172 0.678324360 1.18646200
[68,] -1.6851169325 0.66836686 0.252477476 -0.625067758 0.92866836
[69,] -1.6844241814 0.14495123 0.078078446 -0.299219729 0.11994096
[70,] -1.6837314304 0.66836686 0.775674566 -1.168147808 -1.46452378
[71,] -1.6830386793 -0.29122845 -0.183520099 0.895556380 -0.85656437
Family members CCAvg Education Mortgage Personal Loan Securities Account CD Account
[1,] 1.3971629 -0.19336610 -1.0489730 -0.5554684 -0.3258427 2.9286223 -0.2535149
[2,] 0.5254452 -0.25058550 -1.0489730 -0.5554684 -0.3258427 2.9286223 -0.2535149
[3,] -1.2179901 -0.53668251 -1.0489730 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[4,] -1.2179901 0.43604731 0.1416887 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[5,] 1.3971629 -0.53668251 0.1416887 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[6,] 1.3971629 -0.87999891 0.1416887 0.9684153 -0.3258427 -0.3413892 -0.2535149
[7,] -0.3462724 -0.25058550 0.1416887 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[8,] -1.2179901 -0.93721831 1.3323505 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[9,] 0.5254452 -0.76556011 0.1416887 0.4670084 -0.3258427 -0.3413892 -0.2535149
[10,] -1.2179901 3.98365017 1.3323505 -0.5554684 3.0683519 -0.3413892 -0.2535149
[11,] 1.3971629 0.26438911 1.3323505 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[12,] 0.5254452 -1.05165711 0.1416887 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[13,] -0.3462724 1.06546072 1.3323505 -0.5554684 -0.3258427 2.9286223 -0.2535149
[14,] 1.3971629 0.32160851 0.1416887 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[15,] -1.2179901 0.03551150 -1.0489730 -0.5554684 -0.3258427 2.9286223 -0.2535149
[16,] -1.2179901 -0.25058550 1.3323505 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[17,] 1.3971629 1.58043533 1.3323505 0.7619536 3.0683519 -0.3413892 -0.2535149
[18,] 1.3971629 0.26438911 -1.0489730 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[19,] -0.3462724 3.52589497 1.3323505 -0.5554684 3.0683519 -0.3413892 -0.2535149
[20,] -1.2179901 -0.82277951 0.1416887 -0.5554684 -0.3258427 2.9286223 -0.2535149
[21,] NA -0.59390191 0.1416887 0.5358290 -0.3258427 -0.3413892 -0.2535149
[22,] 0.5254452 0.03551150 1.3323505 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[23,] -1.2179901 -0.42224370 -1.0489730 2.0007236 -0.3258427 -0.3413892 -0.2535149
[24,] -0.3462724 -0.70834071 -1.0489730 1.0470673 -0.3258427 2.9286223 -0.2535149
[25,] -0.3462724 1.12268012 -1.0489730 1.0077413 -0.3258427 -0.3413892 -0.2535149
[26,] 0.5254452 -0.82277951 -1.0489730 0.3981878 -0.3258427 -0.3413892 -0.2535149
[27,] 1.3971629 -0.99443771 1.3323505 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[28,] -1.2179901 0.26438911 -1.0489730 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[29,] -1.2179901 0.14995031 1.3323505 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[30,] -1.2179901 0.77936372 0.1416887 -0.5554684 3.0683519 -0.3413892 3.9437520
[31,] -1.2179901 -0.42224370 1.3323505 0.6439755 -0.3258427 -0.3413892 -0.2535149
[32,] -1.2179901 0.03551150 0.1416887 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[33,] -0.3462724 -0.76556011 1.3323505 1.3420126 -0.3258427 -0.3413892 -0.2535149
[34,] 0.5254452 -0.59390191 1.3323505 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[35,] 1.3971629 -0.07892730 1.3323505 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[36,] 0.5254452 -0.70834071 -1.0489730 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[37,] -1.2179901 0.55048611 -1.0489730 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[38,] -1.2179901 -0.30780490 1.3323505 1.3911701 -0.3258427 -0.3413892 -0.2535149
[39,] 0.5254452 1.75209353 1.3323505 -0.5554684 3.0683519 2.9286223 3.9437520
[40,] 1.3971629 -0.70834071 1.3323505 2.2465112 -0.3258427 -0.3413892 -0.2535149
[41,] 0.5254452 -0.19336610 1.3323505 -0.5554684 -0.3258427 2.9286223 -0.2535149
[42,] 0.5254452 0.20716971 -1.0489730 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[43,] 1.3971629 -0.47946310 0.1416887 3.4951127 3.0683519 -0.3413892 -0.2535149
[44,] -1.2179901 -0.70834071 -1.0489730 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[45,] -1.2179901 2.15262934 -1.0489730 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[46,] 1.3971629 0.32160851 -1.0489730 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[47,] 0.5254452 -0.70834071 0.1416887 0.9487523 -0.3258427 -0.3413892 -0.2535149
[48,] 1.3971629 -0.99443771 1.3323505 1.5189797 3.0683519 2.9286223 3.9437520
[49,] -0.3462724 1.46599653 1.3323505 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[50,] -1.2179901 -0.07892730 -1.0489730 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[51,] 1.3971629 -0.70834071 0.1416887 -0.5554684 -0.3258427 2.9286223 -0.2535149
[52,] -1.2179901 0.55048611 -1.0489730 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[53,] -1.2179901 -1.05165711 -1.0489730 1.4796537 -0.3258427 -0.3413892 -0.2535149
[54,] 0.5254452 0.09273091 1.3323505 1.8040934 3.0683519 -0.3413892 -0.2535149
[55,] -1.2179901 -0.99443771 1.3323505 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[56,] -0.3462724 3.46867556 -1.0489730 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[57,] 0.5254452 -1.05165711 0.1416887 -0.5554684 -0.3258427 2.9286223 3.9437520
[58,] -0.3462724 -0.42224370 1.3323505 -0.5554684 3.0683519 -0.3413892 -0.2535149
[59,] NA -0.99443771 -1.0489730 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[60,] -0.3462724 1.46599653 -1.0489730 3.9178675 -0.3258427 -0.3413892 -0.2535149
[61,] 0.5254452 -0.13614670 0.1416887 -0.5554684 -0.3258427 2.9286223 -0.2535149
[62,] -1.2179901 2.15262934 -1.0489730 0.5456605 -0.3258427 2.9286223 -0.2535149
[63,] -1.2179901 -0.53668251 -1.0489730 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[64,] 1.3971629 -1.10887652 0.1416887 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[65,] -0.3462724 0.77936372 -1.0489730 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[66,] -1.2179901 1.06546072 -1.0489730 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[67,] -0.3462724 0.49326671 -1.0489730 2.7479181 -0.3258427 -0.3413892 -0.2535149
[68,] 1.3971629 0.03551150 1.3323505 0.7422906 -0.3258427 2.9286223 -0.2535149
[69,] 0.5254452 0.09273091 -1.0489730 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[70,] 1.3971629 -0.99443771 -1.0489730 -0.5554684 -0.3258427 -0.3413892 -0.2535149
[71,] -1.2179901 0.89380252 -1.0489730 -0.5554684 -0.3258427 -0.3413892 -0.2535149
Online CreditCard
[1,] -1.2164961 -0.6452498
[2,] -1.2164961 -0.6452498
[3,] -1.2164961 -0.6452498
[4,] -1.2164961 -0.6452498
[5,] -1.2164961 1.5494774
[6,] 0.8218687 -0.6452498
[7,] 0.8218687 -0.6452498
[8,] -1.2164961 1.5494774
[9,] 0.8218687 -0.6452498
[10,] -1.2164961 -0.6452498
[11,] -1.2164961 -0.6452498
[12,] 0.8218687 -0.6452498
[13,] -1.2164961 -0.6452498
[14,] 0.8218687 -0.6452498
[15,] -1.2164961 -0.6452498
[16,] 0.8218687 1.5494774
[17,] -1.2164961 -0.6452498
[18,] -1.2164961 -0.6452498
[19,] -1.2164961 -0.6452498
[20,] -1.2164961 1.5494774
[21,] 0.8218687 -0.6452498
[22,] 0.8218687 -0.6452498
[23,] 0.8218687 -0.6452498
[24,] -1.2164961 -0.6452498
[25,] -1.2164961 1.5494774
[26,] 0.8218687 -0.6452498
[27,] -1.2164961 -0.6452498
[28,] 0.8218687 1.5494774
[29,] 0.8218687 1.5494774
[30,] 0.8218687 1.5494774
[31,] 0.8218687 -0.6452498
[32,] 0.8218687 -0.6452498
[33,] -1.2164961 -0.6452498
[34,] -1.2164961 -0.6452498
[35,] 0.8218687 -0.6452498
[36,] -1.2164961 -0.6452498
[37,] -1.2164961 1.5494774
[38,] -1.2164961 -0.6452498
[39,] 0.8218687 -0.6452498
[40,] 0.8218687 -0.6452498
[41,] -1.2164961 -0.6452498
[42,] -1.2164961 -0.6452498
[43,] 0.8218687 -0.6452498
[44,] 0.8218687 -0.6452498
[45,] 0.8218687 1.5494774
[46,] -1.2164961 1.5494774
[47,] 0.8218687 -0.6452498
[48,] 0.8218687 1.5494774
[49,] -1.2164961 1.5494774
[50,] -1.2164961 1.5494774
[51,] 0.8218687 -0.6452498
[52,] 0.8218687 -0.6452498
[53,] -1.2164961 -0.6452498
[54,] 0.8218687 -0.6452498
[55,] 0.8218687 -0.6452498
[56,] 0.8218687 -0.6452498
[57,] 0.8218687 -0.6452498
[58,] -1.2164961 -0.6452498
[59,] -1.2164961 -0.6452498
[60,] -1.2164961 -0.6452498
[61,] 0.8218687 -0.6452498
[62,] -1.2164961 -0.6452498
[63,] -1.2164961 -0.6452498
[64,] 0.8218687 -0.6452498
[65,] -1.2164961 -0.6452498
[66,] 0.8218687 1.5494774
[67,] -1.2164961 -0.6452498
[68,] -1.2164961 -0.6452498
[69,] 0.8218687 1.5494774
[70,] 0.8218687 -0.6452498
[71,] -1.2164961 1.5494774
[ reached getOption("max.print") -- omitted 4929 rows ]
attr(,"scaled:center")
ID Age (in years) Experience (in years) Income (in K/month)
2500.500000 45.338400 20.104600 73.774200
ZIP Code Family members CCAvg Education
93152.503000 2.397230 1.937938 1.881000
Mortgage Personal Loan Securities Account CD Account
56.498800 0.096000 0.104400 0.060400
Online CreditCard
0.596800 0.294000
attr(,"scaled:scale")
ID Age (in years) Experience (in years) Income (in K/month)
1443.5200033 11.4631656 11.4679537 46.0337293
ZIP Code Family members CCAvg Education
2121.8521973 1.1471604 1.7476590 0.8398691
Mortgage Personal Loan Securities Account CD Account
101.7138021 0.2946207 0.3058093 0.2382503
Online CreditCard
0.4905893 0.4556375
distMatrix.scaled = dist(x=clustering, method = "euclidean")

print(distMatrix.scaled, digits = 3)
1 2 3 4 5 6 7 8 9 10 11 12 13 14
15 16 17 18 19 20 21 22 23 24 25 26 27 28
29 30 31 32 33 34 35 36 37 38 39 40 41 42
43 44 45 46 47 48 49 50 51 52 53 54 55 56
57 58 59 60 61 62 63 64 65 66 67 68 69 70
71 72 73 74 75 76 77 78 79 80 81 82 83 84
85 86 87 88 89 90 91 92 93 94 95 96 97 98
99 100 101 102 103 104 105 106 107 108 109 110 111 112
113 114 115 116 117 118 119 120 121 122 123 124 125 126
127 128 129 130 131 132 133 134 135 136 137 138 139 140
141 142 143 144 145 146 147 148 149 150 151 152 153 154
155 156 157 158 159 160 161 162 163 164 165 166 167 168
169 170 171 172 173 174 175 176 177 178 179 180 181 182
183 184 185 186 187 188 189 190 191 192 193 194 195 196
197 198 199 200 201 202 203 204 205 206 207 208 209 210
211 212 213 214 215 216 217 218 219 220 221 222 223 224
225 226 227 228 229 230 231 232 233 234 235 236 237 238
239 240 241 242 243 244 245 246 247 248 249 250 251 252
253 254 255 256 257 258 259 260 261 262 263 264 265 266
267 268 269 270 271 272 273 274 275 276 277 278 279 280
281 282 283 284 285 286 287 288 289 290 291 292 293 294
295 296 297 298 299 300 301 302 303 304 305 306 307 308
309 310 311 312 313 314 315 316 317 318 319 320 321 322
323 324 325 326 327 328 329 330 331 332 333 334 335 336
337 338 339 340 341 342 343 344 345 346 347 348 349 350
351 352 353 354 355 356 357 358 359 360 361 362 363 364
365 366 367 368 369 370 371 372 373 374 375 376 377 378
379 380 381 382 383 384 385 386 387 388 389 390 391 392
393 394 395 396 397 398 399 400 401 402 403 404 405 406
407 408 409 410 411 412 413 414 415 416 417 418 419 420
421 422 423 424 425 426 427 428 429 430 431 432 433 434
435 436 437 438 439 440 441 442 443 444 445 446 447 448
449 450 451 452 453 454 455 456 457 458 459 460 461 462
463 464 465 466 467 468 469 470 471 472 473 474 475 476
477 478 479 480 481 482 483 484 485 486 487 488 489 490
491 492 493 494 495 496 497 498 499 500 501 502 503 504
505 506 507 508 509 510 511 512 513 514 515 516 517 518
519 520 521 522 523 524 525 526 527 528 529 530 531 532
533 534 535 536 537 538 539 540 541 542 543 544 545 546
547 548 549 550 551 552 553 554 555 556 557 558 559 560
561 562 563 564 565 566 567 568 569 570 571 572 573 574
575 576 577 578 579 580 581 582 583 584 585 586 587 588………and so on
[ reached getOption("max.print") -- omitted 4999 rows ]
cluster <- hclust(distMatrix.scaled, method = "average")

plot(cluster, labels = as.character(clustering[,2]))
## Plotting rectangles defining the clusters
rect.hclust(cluster, k=3, border = "yellow")
##Print cluster combining heights
cluster$height
[1] 0.1644316 0.1659243 0.1698234 0.1701177 0.2045809 0.2057245 0.2096606 0.2108413 0.2156200 0.2200490
[11] 0.2285585 0.2326489 0.2430373 0.2437600 0.2522115 0.2525235 0.2534795 0.2554537 0.2554810 0.2588159
[21] 0.2749192 0.2750783 0.2802163 0.2853737 0.2955212 0.2959120 0.3089493 0.3091822 0.3100042 0.3178606
[31] 0.3180990 0.3188373 0.3216391 0.3217560 0.3218460 0.3219622 0.3262639 0.3282431 0.3283350 0.3286615
[41] 0.3335504 0.3371445 0.3418396 0.3514692 0.3564445 0.3571165 0.3606581 0.3613403 0.3637455 0.3671325
[51] 0.3703429 0.3757498 0.3765238 0.3776570 0.3791352 0.3792368 0.3797722 0.3797880 0.3851541 0.3855276
[61] 0.3872324 0.3874331 0.3887630 0.3960005 0.3968269 0.3972734 0.3985425 0.4023943 0.4024181 0.4030343
[71] 0.4061548 0.4082258 0.4086243 0.4095620 0.4096985 0.4124044 0.4134006 0.4158446 0.4188541 0.4195476
[81] 0.4200046 0.4201225 0.4216312 0.4254159 0.4258333 0.4262333 0.4273095 0.4294341 0.4319551 0.4344206
[91] 0.4373783 0.4412600 0.4418621 0.4422429 0.4439687 0.4446968 0.4470181 0.4510722 0.4531260 0.4536369
[101] 0.4563163 0.4568637 0.4568665 0.4591126 0.4633760 0.4635948 0.4638394 0.4654513 0.4666214 0.4679173
[111] 0.4685319 0.4687902 0.4706936 0.4728982 0.4767953 0.4775498 0.4776026 0.4777108 0.4783247 0.4791086….and
so on
[ reached getOption("max.print") -- omitted 3999 entries ]
##Adding cluster number back to the dataset
mydata$cluster <- cutree(cluster, k=3)
mydata$cluster
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[51] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[101] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[151] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[201] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[251] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[301] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[351] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[401] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[451] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[501] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[551] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[601] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[651] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[701] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[751] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[801] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[851] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[901] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[951] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[ reached getOption("max.print") -- omitted 4000 entries ]
##Aggregate columns 3:7 for each cluster ny their means
bank = aggregate(mydata[,-c(1,2,8)], list(mydata$cluster),FUN = "mean")
bank$Frequency = as.vector(table(mydata$cluster))
View(bank)
Cart & Random Forest
##Splitting
table(mydata$`Personal Loan`)
0 1
4520 480
sum(mydata$`Personal Loan`==1)/nrow(mydata)
0.096
table(mydata$`Personal Loan`)
sum(mydata$`Personal Loan`==1)/nrow(mydata)
library(caTools)
split <- sample.split(mydata$`Personal Loan`, SplitRatio = 0.7)
TrainingData <- subset(mydata, split== TRUE)
TestingData <- subset(mydata, split == FALSE)
summary(TrainingData)
ID Age (in years) Experience (in years) Income (in K/month) ZIP Code Family members
Min. : 1 Min. :23.00 Min. :-3.00 Min. : 8.00 Min. : 9307 Min. :1.000
1st Qu.:1261 1st Qu.:36.00 1st Qu.:10.00 1st Qu.: 39.00 1st Qu.:91942 1st Qu.:1.000
Median :2495 Median :45.00 Median :20.00 Median : 64.00 Median :93437 Median :2.000
Mean :2503 Mean :45.51 Mean :20.28 Mean : 73.86 Mean :93155 Mean :2.408
3rd Qu.:3759 3rd Qu.:56.00 3rd Qu.:30.00 3rd Qu.: 98.00 3rd Qu.:94608 3rd Qu.:4.000
Max. :4994 Max. :67.00 Max. :43.00 Max. :224.00 Max. :96651 Max. :4.000
NA's :11
CCAvg Education Mortgage Personal Loan Securities Account CD Account
Min. : 0.000 Min. :1.000 Min. : 0.00 Min. :0.000 Min. :0.0000 Min. :0.00000
1st Qu.: 0.700 1st Qu.:1.000 1st Qu.: 0.00 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.00000
Median : 1.500 Median :2.000 Median : 0.00 Median :0.000 Median :0.0000 Median :0.00000
Mean : 1.931 Mean :1.875 Mean : 55.52 Mean :0.096 Mean :0.1046 Mean :0.06229
3rd Qu.: 2.500 3rd Qu.:3.000 3rd Qu.:101.00 3rd Qu.:0.000 3rd Qu.:0.0000 3rd Qu.:0.00000
Max. :10.000 Max. :3.000 Max. :635.00 Max. :1.000 Max. :1.0000 Max. :1.00000
Online CreditCard
Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000
Median :1.0000 Median :0.0000
Mean :0.6029 Mean :0.2937
3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000
str(TrainingData)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 3500 obs. of 14 variables:
$ ID : num 1 2 4 5 7 8 9 10 12 13 ...
$ Age (in years) : num 25 45 35 35 53 50 35 34 29 48 ...
$ Experience (in years): num 1 19 9 8 27 24 10 9 5 23 ...
$ Income (in K/month) : num 49 34 100 45 72 22 81 180 45 114 ...
$ ZIP Code : num 91107 90089 94112 91330 91711 ...
$ Family members : num 4 3 1 4 2 1 3 1 3 2 ...
$ CCAvg : num 1.6 1.5 2.7 1 1.5 0.3 0.6 8.9 0.1 3.8 ...
$ Education : num 1 1 2 2 2 3 2 3 2 3 ...
$ Mortgage : num 0 0 0 0 0 0 104 0 0 0 ...
$ Personal Loan : num 0 0 0 0 0 0 0 1 0 0 ...
$ Securities Account : num 1 1 0 0 0 0 0 0 0 1 ...
$ CD Account : num 0 0 0 0 0 0 0 0 0 0 ...
$ Online : num 0 0 0 0 1 0 1 0 1 0 ...
$ CreditCard : num 0 0 0 1 0 1 0 0 0 0 ...
colnames(TrainingData)=make.names(colnames(TrainingData))
print(TrainingData)
# A tibble: 3,500 x 14
ID Age..in.years. Experience..in.~ Income..in.K.mo~ ZIP.Code Family.members CCAvg Education Mortgage
* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 25 1 49 91107 4 1.6 1 0
2 2 45 19 34 90089 3 1.5 1 0
3 4 35 9 100 94112 1 2.7 2 0
4 5 35 8 45 91330 4 1 2 0
5 7 53 27 72 91711 2 1.5 2 0
6 8 50 24 22 93943 1 0.3 3 0
7 9 35 10 81 90089 3 0.6 2 104
8 10 34 9 180 93023 1 8.9 3 0
9 12 29 5 45 90277 3 0.1 2 0
10 13 48 23 114 93106 2 3.8 3 0
# ... with 3,490 more rows, and 5 more variables: Personal.Loan <dbl>, Securities.Account <dbl>,
# CD.Account <dbl>, Online <dbl>, CreditCard <dbl>
library(rpart)
library(rpart.plot)
seed=1000
set.seed(seed)
TrainingTree <- rpart(formula=Personal.Loan ~ . , data = TrainingData, method = "class", cp=0,

minbucket=5)
rpart.plot(TrainingTree)
printcp(TrainingTree)
Classification tree:
rpart(formula = Personal.Loan ~ ., data = TrainingData, method = "class",
cp = 0, minbucket = 5)
Variables actually used in tree construction:

[1] Age..in.years. CCAvg CD.Account Education
[5] Experience..in.years. Family.members ID Income..in.K.month.
[9] Online ZIP.Code
Root node error: 336/3500 = 0.096
n= 3500
CP nsplit rel error xerror xstd

1 0.32440476 0 1.000000 1.00000 0.051870
2 0.14583333 2 0.351190 0.36905 0.032549
3 0.01339286 3 0.205357 0.23512 0.026153
4 0.01041667 7 0.139881 0.19643 0.023950
5 0.00892857 9 0.119048 0.17560 0.022667
6 0.00297619 10 0.110119 0.16071 0.021701
7 0.00148810 13 0.101190 0.19345 0.023771
8 0.00099206 15 0.098214 0.19048 0.023591
9 0.00000000 18 0.095238 0.20536 0.024477
plotcp(TrainingTree)
Pruned_TrainingTree = prune(TrainingTree, cp= 0.015 , "CP")
printcp(Pruned_TrainingTree)
cp = 0, minbucket = 5)

[1] Education Family.members Income..in.K.month.
Root node error: 336/3500 = 0.096
n= 3500

1 0.32440 0 1.00000 1.00000 0.051870
2 0.14583 2 0.35119 0.36905 0.032549
3 0.01500 3 0.20536 0.23512 0.026153
rpart.plot(Pruned_TrainingTree)
TrainingData$Cart_Prediction = predict(Pruned_TrainingTree, data=TrainingData, type="class")
# mydata$score = predict(ptree, data=Pruned_TrainingTree, type="prob")
# table(TrainingData$Personal.Loan,TrainingData$Cart_Prediction)
##Random Forest
library(randomForest)
set.seed(seed)
RandomForestModel = randomForest(Personal.Loan ~ ., data = TrainingData[,c(-4,-14)],
ntree=501, mtry = 3, nodesize = 10, importance=TRUE, na.action = na.exclude)
print(RandomForestModel)
Call:
randomForest(formula = Personal.Loan ~ ., data = TrainingData[, c(-4, -14)], ntree = 501, mtry = 3, nodesize = 10,
importance = TRUE, na.action = na.exclude)
Type of random forest: regression
Number of trees: 501
No. of variables tried at each split: 3
Mean of squared residuals: 0.01439079

% Var explained: 83.38
RandomForestModel$err.rate
NULL
plot(RandomForestModel)
legend("topright", c("OOB", "0", "1"), text.col = 1:6, lty = 1:3, col = 1:3)
set.seed(12345)
TunedRandomForestModel = tuneRF(x = TrainingData[,-c(4,9,14)],
y=TrainingData$Personal.Loan,
ntreeTry = 51,
nodesize = 10,
mtryStart = 6,
stepFactor = 1.5,
improve = 0.0001,
trace = TRUE,
plot = TRUE,
doBest = TRUE,
importance = TRUE)
print(tuneRF)
function (x, y, mtryStart = if (is.factor(y)) floor(sqrt(ncol(x))) else floor(ncol(x)/3),
ntreeTry = 50, stepFactor = 2, improve = 0.05, trace = TRUE,
plot = TRUE, doBest = FALSE, ...)
{
if (improve < 0)
stop("improve must be non-negative.")
classRF <- is.factor(y)
errorOld <- if (classRF) {
randomForest(x, y, mtry = mtryStart, ntree = ntreeTry,
keep.forest = FALSE, ...)$err.rate[ntreeTry, 1]
}
else {
randomForest(x, y, mtry = mtryStart, ntree = ntreeTry,
keep.forest = FALSE, ...)$mse[ntreeTry]
}
if (errorOld < 0)
stop("Initial setting gave 0 error and no room for improvement.")
if (trace) {
cat("mtry =", mtryStart, " OOB error =",
if (classRF)
paste(100 * round(errorOld, 4), "%", sep = "")
else errorOld, "\n")
}
oobError <- list()
oobError[[1]] <- errorOld
names(oobError)[1] <- mtryStart
for (direction in c("left", "right")) {
if (trace)
cat("Searching", direction, "...\n")
Improve <- 1.1 * improve
mtryBest <- mtryStart
mtryCur <- mtryStart
while (Improve >= improve) {
mtryOld <- mtryCur
mtryCur <- if (direction == "left") {
max(1, ceiling(mtryCur/stepFactor))
}
else {
min(ncol(x), floor(mtryCur * stepFactor))
}
if (mtryCur == mtryOld)
break
errorCur <- if (classRF) {
randomForest(x, y, mtry = mtryCur, ntree = ntreeTry,
keep.forest = FALSE, ...)$err.rate[ntreeTry,
"OOB"]
}
else {
randomForest(x, y, mtry = mtryCur, ntree = ntreeTry,
keep.forest = FALSE, ...)$mse[ntreeTry]
}
if (trace) {
cat("mtry =", mtryCur, "\tOOB error =",
if (classRF)
paste(100 * round(errorCur, 4), "%",
sep = "")
else errorCur, "\n")
}
oobError[[as.character(mtryCur)]] <- errorCur
Improve <- 1 - errorCur/errorOld
cat(Improve, improve, "\n")
if (Improve > improve) {
errorOld <- errorCur
mtryBest <- mtryCur
}
}
}
mtry <- sort(as.numeric(names(oobError)))
res <- unlist(oobError[as.character(mtry)])
res <- cbind(mtry = mtry, OOBError = res)
if (plot) {
plot(res, xlab = expression(m[try]), ylab = "OOB Error",
type = "o", log = "x", xaxt = "n")
axis(1, at = res[, "mtry"])
}
if (doBest)
res <- randomForest(x, y, mtry = res[which.min(res[,
2]), 1], ...)
res
}
<bytecode: 0x000002107c280ae0>
<environment: namespace:randomForest>
#..................Building Model
TrainingDataModel <- rpart(formula = Personal.Loan ~ ., data = TrainingData, method = "class",
control = cartParameters)
TrainingDataModel
n= 3500
node), split, n, loss, yval, (yprob)

* denotes terminal node
1) root 3500 336 0 (0.904000000 0.096000000)

2) Cart_Prediction=0 3207 56 0 (0.982538198 0.017461802)
4) CCAvg< 2.95 2766 9 0 (0.996746204 0.003253796) *
5) CCAvg>=2.95 441 47 0 (0.893424036 0.106575964)
10) Income..in.K.month.>=112.5 248 0 0 (1.000000000 0.000000000) *
11) Income..in.K.month.< 112.5 193 47 0 (0.756476684 0.243523316)
22) CD.Account< 0.5 174 33 0 (0.810344828 0.189655172)
44) Income..in.K.month.< 92.5 112 9 0 (0.919642857 0.080357143) *
45) Income..in.K.month.>=92.5 62 24 0 (0.612903226 0.387096774)
90) Education< 1.5 35 4 0 (0.885714286 0.114285714) *
91) Education>=1.5 27 7 1 (0.259259259 0.740740741) *
23) CD.Account>=0.5 19 5 1 (0.263157895 0.736842105) *
3) Cart_Prediction=1 293 13 1 (0.044368601 0.955631399)
6) Income..in.K.month.< 116.5 25 12 0 (0.520000000 0.480000000)
12) Age..in.years.>=60.5 7 0 0 (1.000000000 0.000000000) *
13) Age..in.years.< 60.5 18 6 1 (0.333333333 0.666666667) *
7) Income..in.K.month.>=116.5 268 0 1 (0.000000000 1.000000000) *
## PRINTING CART MODEL PARAMETERS
install.packages('rattle', dependencies = TRUE)
library(rattle)
library(RColorBrewer)
fancyRpartPlot(TrainingDataModel)
printcp(TrainingDataModel)
control = cartParameters)

[1] Age..in.years. Cart_Prediction CCAvg CD.Account Education
[6] Income..in.K.month.
Root node error: 336/3500 = 0.096
n= 3500

1 0.7946429 0 1.00000 1.00000 0.051870
2 0.0104167 1 0.20536 0.20536 0.024477
3 0.0089286 3 0.18452 0.19643 0.023950
4 0.0000000 8 0.11905 0.19048 0.023591
plotcp(TrainingDataModel)
bestcp <- TrainingDataModel$cptable[which.min(TrainingDataModel$cptable[,"xerror"]), "CP"]
bestcp
ptree<- prune(TrainingDataModel, cp= bestcp ,"CP")

fancyRpartPlot(ptree, uniform=TRUE, main="Pruned Classification Tree")
Model Performance
library(ROCR)
library(gplots)
library(ineq)
library(InformationValue)
TrainingData$predict.class <- predict(ptree, TrainingData, type="class")
TrainingData$predict.score <- predict(ptree, TrainingData)
View(TrainingData)
##Confusion Matrix
colnames(TestingData)=make.names(colnames(TestingData))
TestingData$Cart_PredictionClass = predict(PrunedCart_Train, TestingData, type="class")
confusionMatrix_Cart_Test=table(TestingData$Personal.Loan, TestingData$Cart_PredictionClass)
ErrorRate_CartModel=print((confusionMatrix_Cart_Test[1,2]+confusionMatrix_Cart_Test[2,1])/nrow(T
estingData))
< 0.02133333
##AUC
auc_TrainingData = performance(TrainingData, "auc");
auc_TrainingData = as.numeric(auc@y.values)
auc_TestingData = performance(TestingData, "auc");

auc_TestingData = as.numeric(auc@y.values)
## Gini Coefficicent
gini = ineq(TestingData$predict.score[,2], type="Gini")
gini = ineq(TrainingData$predict.score[,2], type="Gini")
##Concordance Ratio
Concordance_TrainingData = Concordance(actuals=Personal.Loan$actuals,
predictedscores=Personal.Loan$predictedscores)
Concordance_TestingData = Concordance(actuals=Personal.Loan$actuals, predictedScores =

Rf$predictedScores)
Conclusion
The distribution is uniform for both test and train data. Therefore, the model is built good.

Thera Bank - Project

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Thera Bank - Project

Загружено:

Авторское право:

Доступные форматы

Project 3 – Data Mining

3 Exploratory Data Analysis

4 Check for Appropriate Clustering

5 Models to be Built on CART & RANDOM FOREST

5.1 Check for Necessary Modifications Such as Pruning

6 Check for Model Performances

3 Exploratory Data Analysis

● Create R notebook file

● Setup working directory

We explored TheraBank Data using str which showed that

Summary of the Data showed :

Correlation Plot shows :

 Age(in years) is corelated with Experience(in years)(0.99),

corrplot(correlationMatrix, method = "number")

distMatrix = dist(x=mydata[,1:14], method = "euclidean")

##Scale function to standardize the values

distMatrix.scaled = dist(x=clustering, method = "euclidean")

cluster <- hclust(distMatrix.scaled, method = "average")

rect.hclust(cluster, k=3, border = "yellow")

##Print cluster combining heights

mydata$cluster <- cutree(cluster, k=3)

##Aggregate columns 3:7 for each cluster ny their means

bank = aggregate(mydata[,-c(1,2,8)], list(mydata$cluster),FUN = "mean")

split <- sample.split(mydata$`Personal Loan`, SplitRatio = 0.7)

TrainingData <- subset(mydata, split== TRUE)

TestingData <- subset(mydata, split == FALSE)

TrainingTree <- rpart(formula=Personal.Loan ~ . , data = TrainingData, method = "class", cp=0,

Variables actually used in tree construction:

Root node error: 336/3500 = 0.096

CP nsplit rel error xerror xstd

Pruned_TrainingTree = prune(TrainingTree, cp= 0.015 , "CP")

Variables actually used in tree construction:

Root node error: 336/3500 = 0.096

CP nsplit rel error xerror xstd

TrainingData$Cart_Prediction = predict(Pruned_TrainingTree, data=TrainingData, type="class")

# mydata$score = predict(ptree, data=Pruned_TrainingTree, type="prob")

RandomForestModel = randomForest(Personal.Loan ~ ., data = TrainingData[,c(-4,-14)],

ntree=501, mtry = 3, nodesize = 10, importance=TRUE, na.action = na.exclude)

Mean of squared residuals: 0.01439079

TunedRandomForestModel = tuneRF(x = TrainingData[,-c(4,9,14)],

node), split, n, loss, yval, (yprob)

1) root 3500 336 0 (0.904000000 0.096000000)

## PRINTING CART MODEL PARAMETERS

install.packages('rattle', dependencies = TRUE)

Variables actually used in tree construction:

Root node error: 336/3500 = 0.096

CP nsplit rel error xerror xstd

bestcp <- TrainingDataModel$cptable[which.min(TrainingDataModel$cptable[,"xerror"]), "CP"]

ptree<- prune(TrainingDataModel, cp= bestcp ,"CP")

TrainingData$predict.class <- predict(ptree, TrainingData, type="class")

TrainingData$predict.score <- predict(ptree, TrainingData)

TestingData$Cart_PredictionClass = predict(PrunedCart_Train, TestingData, type="class")

auc_TrainingData = performance(TrainingData, "auc");

auc_TestingData = performance(TestingData, "auc");

gini = ineq(TestingData$predict.score[,2], type="Gini")

gini = ineq(TrainingData$predict.score[,2], type="Gini")

Concordance_TestingData = Concordance(actuals=Personal.Loan$actuals, predictedScores =

Вам также может понравиться