Вы находитесь на странице: 1из 7

Data diagnosis for high dimensional multi-class machine learning

CS 6780 (Spring 2015)

Abstract
Multiclass labeling of high dimensional data
holds importance for industries such as online retailing. Due to the complexity of the data and related nature of the product classes, this machine
learning task is not trivial. It is of primary interest to understand whether a particular learning
algorithm is best suited to this particular data set.
In the case of no clear winner, we endeavored to
diagnose the data to identify why learning was
difficult for all of the algorithms, and attempted
to address the difficulty with available learning
algorithms.

1. Introduction
Proper classification of products sold is key to ensuring accurate understanding of product feedback, quality control,
and a number of other business operations for a typical online retailer. Now, expand this scenario to include a wide
array of online retailers operating under a single parent
company. Specifically, we are interested in data from the
Otto Group, one of the worlds biggest e-commerce companies, responsible for millions of products sold worldwide
each day, with a constant incoming stream of thousands
of new products. Due to the large number of subsidiaries
and varying infrastructures within each one, the potential
for identical products to be classified differently between
subsidiaries exists at a non-negligable rate. We aim to investigate the performance of various learning algorithms to
correctly classify these products, and attempt to investigate
why certain algorithms out-perform others in specific categories. The work begins with a survey of available learning
packages and concludes with an in-depth look as to the differences or similarities in accuracy and performance.

2. Data
The data is from Kaggle (Kaggle, 2015), made available
through the Otto group. There are a total of 93 features and
Preliminary work. Under review by the International Conference
on Machine Learning (ICML). Do not distribute.

9 classes. Each record indicates the count of each of the 93


features, along with the actual class label. The specific nature of the features or classes have not been made available;
they are simple numerical descriptions (1 to 93 for features,
1 to 9 for classes).
The dataset has 61,878 records. We chose to separate this
data into a training set and test set of 75% and 25%, respectively. On the training set, a 5-fold cross validation scheme
is used to evaluate the performance and robustness of each
learning method.
We began by looking for high correlation values between
each of the 93 features in the dataset. A correlation plot is
shown in Figure 1. Though it does appear that many of the
features are loosely correlated, it did not appear that any
features were worth excluding altogether. We also looked
at the correlation matrices for each of the nine classes to
see if there was a particular footprint in correlation for
any one of the classes that would inform feature selection,
but did not find that this was the case. Thus, we moved
forward with a complete feature set.

0.90

80

0.75
0.60

60
Feature #

000
001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
045
046
047
048
049
050
051
052
053
054

0.45
40

0.30
0.15

20

0.00
0.15
0

20

40
60
Feature #

80

Figure 1. Plot of feature correlation matrix. High correlation values only occur on the diagonal.

Because this is a Kaggle competition, an additional test set


of 18,000 records is also available through Kaggle, but on
this additional test set the class label is unavailable. To see
how our model(s) compare with others, we submitted our

055
056
057
058
059
060
061
062
063
064
065
066
067
068
069
070
071
072
073
074
075
076
077
078
079
080
081
082
083
084
085
086
087
088
089
090
091
092
093
094
095
096
097
098
099
100
101
102
103
104
105
106
107
108
109

CS 6780: Advanced Machine Learning Final Project

110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164

best models to Kaggle on their test set to see how we scored


on their log-loss grading scale. Note, however, that scoring
on the leaderboard is not the main intention of this project.

3. Learning Methods
A variety of different methods were used to begin investigating the best methodology to approach this highdimensional multiclass machine learning problem. An outline of our overall work flow is provided in Figure 2. In
Stages 1-3, we used our training set (75% of overall data),
and after we finalized our models, we applied our trained
models to the test set in Stage 4. In Stage 1, we explored
a variety of learning algorithms. In Stage 2, we aggregated
the results of our promising models proportional to the performance we saw in Stage 1. In Stage 3, we re-train three
of the models from Stage 1 and the two models from Stage
2 on the complete training set. Finally, in the last stage,
we test the five trained models in order to compare performance.
Stage 1

Stage 2

Stage 3

Stage 4

5 Fold CV!

5 Fold CV!

Train!

Test!

RF!

RF!
SVM!
NN!

RF!

RF!

RF!
SVM!

NB!

Key
Training set
75% Total Data!
Test set!
25% Total Data!

After an initial attempt through the four methods, the decision tree seemed to hold promise due to its fast nature and
easy implementation in ensemble models. Similarly, the
neural network also appeared to have an advantage over the
other methods. The linear SVM algorithm did not perform
as well, but we pursued it in order to discern any possible insights about our data that the others may not provide.
Naive Bayes performed the most poorly compared to the
others; the independence assumption in this scenario, between the 93 features, does not hold in data set (Murphy,
2012; Shalev-Shwartz & Ben-David, 2014). Moving forward, we did not include Naive Bayes in subsequent analysis.
3.1. Decision Trees: Ensemble Methods

SVM!

SVM!

NN!

NN!

RF!
SVM!
NN!

RF!
SVM!
NN!

NN!

NN!

four models shown in the figure under Stage 1, we also


looked initially at decision trees (DT) before using a random forest ensemble. A survey of these various methods
enabled us to gauge the potential performance of the different learning algorithms on our particular data set. We used
available machine learning packages from scikit-learn (Pedregosa et al., 2011) for the first three, and Keras (Chollet,
Francois, 2015; LISA lab, University of Montreal, 2015) to
implement a neural network. Computing was performed in
both Python and MATLAB.

RF!

RF!

NN!

NN!

Figure 2. An overview of the work flow covered in this paper.


Stage 1 consisted of the basic methods, including Random Forests
(RF), support vector machines (SVM), naive Bayes (NB), and
neural networks (NN), optimized over appropriate grid searches
when relevant. This set was five-fold cross validated on the training set. Stage 2 aggregated different combinations of promising
models from Stage 1 and weighted them according to their Stage
1 cross validation performance. These were again cross validated
on the same five-fold scheme. In Stage 3, we re-trained five of the
models on the whole training set. In Stage 4, we ran the trained
models from Stage 3 on the remaining test set.

As shown in Figure 2, the team began with basic methods


from different learning algorithm families, including Random Forests (RF), support vector machines (SVM), naive
Bayes (NB), and neural networks (NN). In addition to the

The decision tree is a favorable option for creating an ensemble. It is computationally cheap to generate many shallow trees, or weak learners. We initially investigated both
the bootstrap aggregating (bagging) method (Breiman,
1996) as well as a random forest (Breiman, 2001). For bagging, we adjusted the learning rate and number of estimators. Similarly, ranging over maximum depth of trees (2060) and number of estimators (10-200) as shown in Figure
3, we chose a maximum depth of 38 and 100 estimators for
the random forest parameters. We moved forward with the
random forest since it outperformed bagging in the early
trials. The mean accuracy of 5-fold cross validation for the
random forest model is 80.4%.
3.2. Linear SVMs
Even initially, the linear SVM did not appear as promising
as the other options, since the classes did not appear to be
linearly separable. We decided to keep this model in order
to explore tuning parameters to increase performance during cross validation (Cristianini & Shawe-Taylor, 2000).
Mainly, we began by adjusting the regularization parameter
C, as well as a search over number of features used in training. The features are picked in the order of chi2 values.
This analysis resulted in choosing a value of C = 1 and
limiting analysis to include 60 features, which appeared
to be an appropriate number to ensure against under- and
over-fitting. Figure 4 (a) shows the validation scores with

165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219

CS 6780: Advanced Machine Learning Final Project

The final neural network model consisted of 3 hidden layers, starting from an input of 93 features, out to a width of
500, 400, then 300, and finally to an output of 9 classes.
Figure 5 shows the relation between maximum epoch and
cross validation scores. Maximum epoch 30 is used to prevent overfitting. The mean accuracy of 5-fold cross validation for neural networks is 79.8%.
0.82

Figure 3. Grid search on maximum depth of trees and number of


estimators for random forest. The vertical axis represents log accuracy of 5-fold cross validation. Optimal parameters are maximum depth of 38 and 100 estimators.

Validation Scores

0.81
0.8
0.79
0.78
0.77
0

respect to parameter C using all features, and (b) shows


the scores with respect to the number of features when
C = 1. The mean accuracy of 5-fold cross validation for
linear SVM is 73.1%.
0.8

20

40
60
Max Epoch

80

100

Figure 5. Relation between number of epochs and cross validation


scores for neural network with 3 hidden layers, and 500, 400, to
300 hidden units.

0.8
Validation Scores

Validation Scores

220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274

0.75
0.7
0.65

20
10
0
10
Regularization parameter C (log)

(a)

4. Data Diagnosis
0.6

0.4

0.2
0

50
# of features

100

(b)

Figure 4. (a) Validation scores with respect to regularization parameter C, and (b) number of features for linear SVM. Optimal
parameters are C = 1 and 60 features.

3.3. Neural Networks


In using a neural network approach to this learning problem, we explored two main approaches: one, where the
neural network replicates a pyramid shape, starting a first
layer with hidden units less than the input layer and narrowing in with each layer; and two, where the neural network
expands in width from the input of 93 features, then narrows and is relatively shallow (Daniel Nouri, 2014). We
expected to see better performance from the former (a narrow and deep network), but we found that the latter scheme
(a wide and thin network) outperformed the former during
cross validation.

All of the methods reached a limit in their performance


around 73-80% accuracy during cross validation. This limit
necessitated further investigation. We began by identifying which classes were the most difficult to correctly label,
and if there was any difference in this class-by-class accuracy between methods. The accuracy of each learning algorithm, P (Y = y|Y = y), where Y is the true class and Y
is the predicted class, on the true class label is presented in
Figure 6. From the Figure, it can be seen that three classes
1, 3, and 4 were highly misclassified.
Within these classes, Class 1 was often mistaken for Class
9, while Classes 3 and 4 were often labeled as Class 2. This
was the case across all three learning algorithms, as presented in the confusion matrices for the three models (RF,
SVM, and NN, shown in Figures 7, 8, and 9, respectively).
When looking at the accuracy of our prediction of class y
for each training record, that is, P (Y = y|Y = y), the
prediction accuracy differences were a little bit more subtle, and the lower performing classes included the mistaken
Classes 2 and 9 (Figure 10). These trends held across all
learning algorithms.
To investigate further, we extracted these classes to see
if classification improved if we trained only on these five
classes, instead of all nine, but the low-accuracy classes

275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329

CS 6780: Advanced Machine Learning Final Project


SVM

SVM

NN

1
0.8
Predicted classes

A ccu racy P ( Y = y | Y = y )

RF

0.6
0.4
0.2
0

5 6
Cl ass

201

124

23

35

70

27

10906 4463 1369 52

144

394

124

134

834

1170 177

30

161

25

10

33

23

268

13

73

12

21

1957 4

14

10

212

75

48

164

9831 230

376

221

56

93

89

29

126

1073 52

366

80

111

268

197

5501 311

449

55

52

171

50

131

10000
8000
6000

Figure 6. Performance P (Y = y|Y = y) of all methods in each


class. Y is the true class and Y is the predicted class.

4 5 6
True classes

4000

13

2000

3013

Figure 8. SVM confusion matrix during cross validation.


Randon Forest
Neural Networks
522

31

18

24

55

61

10879 3066 894

61

110

312

71

90

1065 2748 239

11

149

22

57

47

817

23

12

13

1937 2

13

8000
6000

145

27

12

64

10043 167

42

69

58

14

109

1240 25

316

26

29

169

212

10000

330

14

123

4 5 6
True classes

28

188

143

5862 165
92

4000

15

2000

3264

Figure 7. Random forest confusion matrix during cross validation.

still posed issues; similarly, we trained exclusively on Class


2 and Class 3, but Class 2 and 3 were still difficult to separate.
In order to have a clearer visual of what our data looks look,
we used the t-distributed stochastic neighbor embedding (tSNE) algorithm, an effective visualization tool to reduce
our 93 dimensions to a 2-d plot (Van der Maaten & Hinton,
2008). The resulting figure indicated that our difficulty in
classifying Class 2 and Class 3 were not functions of our
learning algorithms, but a function of the highly non-linear,
difficult-to-separate nature of these two classes. While the
other classes have some visual separation in the t-SNE plot
(Figure 11), Classes 2 and 3, and to some extent, Class 4,
share almost the same space and characteristics. Running
the t-SNE algorithm again on only Classes 2 and 3, the
similarity is even more apparent (Figure 12).

Predicted classes

Predicted classes

330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384

736

10

29

71

64

118

129

9629 2331 641

29

54

163

45

43

12

2089 3310 351

20

180

22

187

137

941

21

20

28

21

1960 2

14

102

34

12

52

9926 129

124

116

66

106

143

30

168

1477 65

200

34

21

195

101

5802 147

278

33

149

14

105

4 5 6
True classes

8000

6000

4000

28

2000

3265

Figure 9. Neural network confusion matrix during cross validation.

In light of this information, we then moved forward by


aggregating the predictions of our three models, a small,
simple ensemble approach in order to utilize our existing
model strategies such that each model has a vote with an
assigned weight. We tried two different cases. One, we
gave the random forest model and neural network model
predictions a weight of 0.4 each, and the SVM model a
weight of 0.2. The second, we left the SVM model out
and split the weight evenly between the random forest and
neural network models. The resulting confusion matrices
for these two models using our 5-fold cross validation are
presented in Figures 13 and 14.
In including the model results of SVM, performance in
cross validation (79.5%) did not improve over previous
single models random forest (80.4%) and neural networks

385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439

CS 6780: Advanced Machine Learning Final Project

RF
A ccu racy P (Y = y | Y = y )

SVM

NN

2
3

1
0.8
0.6
0.4
0.2
0

5 6
Cl ass

Figure 10. Performance P (Y = y|Y = y) of all methods for


each class.

Figure 12. t-SNE plot for 93 dimensions in 2-D for Class 2 and
Class 3.
1
2
3
4
5
6
7
8
9

Figure 11. t-SNE plot for 93 dimensions in 2-D for all 9 classes.

(79.8%).
The model with equal votes from the random forest and
neural network models, however, saw a 1% increase over
either of those models alone. At this point, given the highly
linearly unseparable nature of Class 2 and 3 in the training
data, we moved on to re-train the models on the entire training set in order to run final test results.

5. Results
On our test set, we tried the random forest, SVM, neural
network models, as well as our two aggregating models described above. In addition to seeking the results of model
performance between the five models, we were also interested to see if the chosen training set versus test set split

RF + NN + SVM

Predicted classes

440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494

599

69

31

32

57

63

10968 3698 1022 50

95

281

81

71

927

2115 183

17

150

10

61

35

731

12

25

20

1957 4

11

120

24

14

63

9990 149

143

117

43

88

76

21

130

1381 33

251

33

18

196

127

5878 171

339

21

135

19

80

10000
8000
6000

4 5 6
True classes

4000

20

2000

3285

Figure 13. Weighted 3-model aggregate (random forest, neural


networks, and SVM) confusion matrix during cross validation.

of 75-25 was reasonable, as well as if our cross validation


scheme was representative of the results on our test set.
For this reason, we again present the confusion matrices
for the two aggregate models, but this time with results
from the test set, in Figures 15 and 16. These can be compared with the confusion matrices from the aggregate models from the training set. All of the results are summarized
in Table 1.
We see close agreement between the cross validation percentages and the test set percentages. Overall, the RF+NN

495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549

CS 6780: Advanced Machine Learning Final Project

735

43

10471 2748 754

RF + NN (Test set)

55

56

83

95

37

73

208

58

55

8000

1376 3004 278

16

158

17

111

80

912

17

22

23

18

1960 3

95

20

11

55

10004 126

127

116

61

91

102

24

144

1474 54

6000

194

26

16

166

96

5852 145

289

23

128

14

93

4 5 6
True classes

4000

26

10000

2000

3291

Figure 14. Weighted 2-model aggregate (random forest, neural


networks) confusion matrix during cross validation.

Table 1. Cross-validation and test accuracies for all models.

CV acc. (%)
Test acc. (%)

RF

SVM

NN

RF
NN
SVM

RF
NN

80.4
80.6

73.1
73.3

79.8
79.7

79.5
79.4

81.2
81.1

model outperformed the other four models in the CV stage


(Stages 1, 2) and test stage (Stage 4). SVM performance is
consistently lower than the other models, and including it
in the aggregate model appears to worsen the performance
of either RF or NN alone.
RF + NN + SVM (Test set)
1

219

20

3500

16

12

3620 1312 327

31

29

91

24

30

280

666

61

41

21

221

686

32

23

3322 48

53

40

1500

11

19

38

42

433

1000

91

75

36

2036 45

124

44

37

3000
2500
2000

4 5 6
True classes

500

1085

Figure 15. Confusion matrix for weighted RF+NN+SVM with results from the test set.

Predicted classes

Predicted classes

RF + NN

Predicted classes

550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604

258

12

21

18

30

25

3476 1011 229

23

21

66

18

22

3000

390

943

99

36

2500

48

25

285

684

30

10

20

3332 47

45

39

13

26

46

43

461

19

75

68

29

2028 39

109

39

33

4 5 6
True classes

2000
1500
1000
500

1085

Figure 16. Confusion matrix for weighted RF+NN with results


from the test set.

6. Conclusions and Future Work


With the use of high-dimensional visual tools such as tSNE and the resources of many machine learning algorithms, we were able to classify these products but only
to a certain limit. It appears that Class 2 and Class 3 have
very similar characteristics, and, due to the nature of this
Kaggle competition, we do not know why this may be the
case.
For example, it is possible that these are inherently two
very similar categories that share many characteristics and
would be difficult even for a human to separate. It is also
possible that these are in fact two very distinct categories
that happen to share many features, though it would be
readily apparent to an individual that these are separate categories. Unfortunately, all of the features are only given numerical descriptions (1 to 93), and similarly the classes are
only numbered (1 to 9), and there is no intention on elaborating on this information, and we have no way of knowing
if either of these scenarios is the case.
Because separating the data through additional data preprocessing or more extensive learning was not a route we
saw coming to fruition, we instead chose to move forward
by aggregating the predictions of our best performing models. This led to a slight improvement in performance, both
on the test set and validation set. We saw that our results
were consistent between our cross-validation scheme and
our test set split. This affirms that the 5-fold cross validation and 75-25 split were reasonable to use given the size
of our data set.
This data set emphasized the importance to understand the
nature of the data. Naive Bayes, for example, did poorly

605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659

CS 6780: Advanced Machine Learning Final Project

660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714

due to the high number of features and poor basis in the


assumption of independence between those features. The
linear SVM struggled compared to the random forest and
neural network models since the data is highly non-linear.
We believe that further work using the random forests and
neural network could lead to better results, especially if a
larger ensemble of these types of models were used.

References
Breiman, Leo. Bagging predictors. Machine Learning, 24
(2):123140, 1996.
Breiman, Leo. Random forests. Machine Learning, 45(1):
532, 2001.
Chollet, Francois. Keras documentation - theano-based
deep learning library, 2015. URL http://keras.
io/.
Cristianini, N. and Shawe-Taylor, J. An introduction to
support vector machines and other kernel-based learning methods. Cambridge University Press, 2000. ISBN
9780521780193. URL http://books.google.
com/books?id=B-Y88GdO1yYC.
Daniel Nouri. Using convolutional neural nets to detect facial keypoints tutorial, 2014. URL http:
//danielnouri.org/notes/category/
programming/.
Kaggle.
Kaggle: Otto group product classification
challenge - classify products into the correct category,
March 2015. URL https://www.kaggle.com/c/
otto-group-product-classification-challenge.
LISA lab, University of Montreal. Theano 0.7 documentation - python library for defining, optimizing, and
evaluating mathematical expressions involving multidimensional arrays efficiently, 2015. URL http://
deeplearning.net/software/theano/.
Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,
Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E.
Scikit-learn: Machine learning in python. Journal of
Machine Learning Research, 12:28252830, 2011.
Shalev-Shwartz, Shai and Ben-David, Shai. Understanding
Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.
Van der Maaten, Laurens and Hinton, Geoffrey. Visualizing
data using t-sne. Journal of Machine Learning Research,
9(2579-2605):85, 2008.

715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769

Вам также может понравиться