Академический Документы
Профессиональный Документы
Культура Документы
Abstract
Multiclass labeling of high dimensional data
holds importance for industries such as online retailing. Due to the complexity of the data and related nature of the product classes, this machine
learning task is not trivial. It is of primary interest to understand whether a particular learning
algorithm is best suited to this particular data set.
In the case of no clear winner, we endeavored to
diagnose the data to identify why learning was
difficult for all of the algorithms, and attempted
to address the difficulty with available learning
algorithms.
1. Introduction
Proper classification of products sold is key to ensuring accurate understanding of product feedback, quality control,
and a number of other business operations for a typical online retailer. Now, expand this scenario to include a wide
array of online retailers operating under a single parent
company. Specifically, we are interested in data from the
Otto Group, one of the worlds biggest e-commerce companies, responsible for millions of products sold worldwide
each day, with a constant incoming stream of thousands
of new products. Due to the large number of subsidiaries
and varying infrastructures within each one, the potential
for identical products to be classified differently between
subsidiaries exists at a non-negligable rate. We aim to investigate the performance of various learning algorithms to
correctly classify these products, and attempt to investigate
why certain algorithms out-perform others in specific categories. The work begins with a survey of available learning
packages and concludes with an in-depth look as to the differences or similarities in accuracy and performance.
2. Data
The data is from Kaggle (Kaggle, 2015), made available
through the Otto group. There are a total of 93 features and
Preliminary work. Under review by the International Conference
on Machine Learning (ICML). Do not distribute.
0.90
80
0.75
0.60
60
Feature #
000
001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
045
046
047
048
049
050
051
052
053
054
0.45
40
0.30
0.15
20
0.00
0.15
0
20
40
60
Feature #
80
Figure 1. Plot of feature correlation matrix. High correlation values only occur on the diagonal.
055
056
057
058
059
060
061
062
063
064
065
066
067
068
069
070
071
072
073
074
075
076
077
078
079
080
081
082
083
084
085
086
087
088
089
090
091
092
093
094
095
096
097
098
099
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
3. Learning Methods
A variety of different methods were used to begin investigating the best methodology to approach this highdimensional multiclass machine learning problem. An outline of our overall work flow is provided in Figure 2. In
Stages 1-3, we used our training set (75% of overall data),
and after we finalized our models, we applied our trained
models to the test set in Stage 4. In Stage 1, we explored
a variety of learning algorithms. In Stage 2, we aggregated
the results of our promising models proportional to the performance we saw in Stage 1. In Stage 3, we re-train three
of the models from Stage 1 and the two models from Stage
2 on the complete training set. Finally, in the last stage,
we test the five trained models in order to compare performance.
Stage 1
Stage 2
Stage 3
Stage 4
5 Fold CV!
5 Fold CV!
Train!
Test!
RF!
RF!
SVM!
NN!
RF!
RF!
RF!
SVM!
NB!
Key
Training set
75% Total Data!
Test set!
25% Total Data!
After an initial attempt through the four methods, the decision tree seemed to hold promise due to its fast nature and
easy implementation in ensemble models. Similarly, the
neural network also appeared to have an advantage over the
other methods. The linear SVM algorithm did not perform
as well, but we pursued it in order to discern any possible insights about our data that the others may not provide.
Naive Bayes performed the most poorly compared to the
others; the independence assumption in this scenario, between the 93 features, does not hold in data set (Murphy,
2012; Shalev-Shwartz & Ben-David, 2014). Moving forward, we did not include Naive Bayes in subsequent analysis.
3.1. Decision Trees: Ensemble Methods
SVM!
SVM!
NN!
NN!
RF!
SVM!
NN!
RF!
SVM!
NN!
NN!
NN!
RF!
RF!
NN!
NN!
The decision tree is a favorable option for creating an ensemble. It is computationally cheap to generate many shallow trees, or weak learners. We initially investigated both
the bootstrap aggregating (bagging) method (Breiman,
1996) as well as a random forest (Breiman, 2001). For bagging, we adjusted the learning rate and number of estimators. Similarly, ranging over maximum depth of trees (2060) and number of estimators (10-200) as shown in Figure
3, we chose a maximum depth of 38 and 100 estimators for
the random forest parameters. We moved forward with the
random forest since it outperformed bagging in the early
trials. The mean accuracy of 5-fold cross validation for the
random forest model is 80.4%.
3.2. Linear SVMs
Even initially, the linear SVM did not appear as promising
as the other options, since the classes did not appear to be
linearly separable. We decided to keep this model in order
to explore tuning parameters to increase performance during cross validation (Cristianini & Shawe-Taylor, 2000).
Mainly, we began by adjusting the regularization parameter
C, as well as a search over number of features used in training. The features are picked in the order of chi2 values.
This analysis resulted in choosing a value of C = 1 and
limiting analysis to include 60 features, which appeared
to be an appropriate number to ensure against under- and
over-fitting. Figure 4 (a) shows the validation scores with
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
The final neural network model consisted of 3 hidden layers, starting from an input of 93 features, out to a width of
500, 400, then 300, and finally to an output of 9 classes.
Figure 5 shows the relation between maximum epoch and
cross validation scores. Maximum epoch 30 is used to prevent overfitting. The mean accuracy of 5-fold cross validation for neural networks is 79.8%.
0.82
Validation Scores
0.81
0.8
0.79
0.78
0.77
0
20
40
60
Max Epoch
80
100
0.8
Validation Scores
Validation Scores
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
0.75
0.7
0.65
20
10
0
10
Regularization parameter C (log)
(a)
4. Data Diagnosis
0.6
0.4
0.2
0
50
# of features
100
(b)
Figure 4. (a) Validation scores with respect to regularization parameter C, and (b) number of features for linear SVM. Optimal
parameters are C = 1 and 60 features.
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
SVM
NN
1
0.8
Predicted classes
A ccu racy P ( Y = y | Y = y )
RF
0.6
0.4
0.2
0
5 6
Cl ass
201
124
23
35
70
27
144
394
124
134
834
1170 177
30
161
25
10
33
23
268
13
73
12
21
1957 4
14
10
212
75
48
164
9831 230
376
221
56
93
89
29
126
1073 52
366
80
111
268
197
5501 311
449
55
52
171
50
131
10000
8000
6000
4 5 6
True classes
4000
13
2000
3013
31
18
24
55
61
61
110
312
71
90
11
149
22
57
47
817
23
12
13
1937 2
13
8000
6000
145
27
12
64
10043 167
42
69
58
14
109
1240 25
316
26
29
169
212
10000
330
14
123
4 5 6
True classes
28
188
143
5862 165
92
4000
15
2000
3264
Predicted classes
Predicted classes
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
736
10
29
71
64
118
129
29
54
163
45
43
12
20
180
22
187
137
941
21
20
28
21
1960 2
14
102
34
12
52
9926 129
124
116
66
106
143
30
168
1477 65
200
34
21
195
101
5802 147
278
33
149
14
105
4 5 6
True classes
8000
6000
4000
28
2000
3265
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
RF
A ccu racy P (Y = y | Y = y )
SVM
NN
2
3
1
0.8
0.6
0.4
0.2
0
5 6
Cl ass
Figure 12. t-SNE plot for 93 dimensions in 2-D for Class 2 and
Class 3.
1
2
3
4
5
6
7
8
9
Figure 11. t-SNE plot for 93 dimensions in 2-D for all 9 classes.
(79.8%).
The model with equal votes from the random forest and
neural network models, however, saw a 1% increase over
either of those models alone. At this point, given the highly
linearly unseparable nature of Class 2 and 3 in the training
data, we moved on to re-train the models on the entire training set in order to run final test results.
5. Results
On our test set, we tried the random forest, SVM, neural
network models, as well as our two aggregating models described above. In addition to seeking the results of model
performance between the five models, we were also interested to see if the chosen training set versus test set split
RF + NN + SVM
Predicted classes
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
599
69
31
32
57
63
95
281
81
71
927
2115 183
17
150
10
61
35
731
12
25
20
1957 4
11
120
24
14
63
9990 149
143
117
43
88
76
21
130
1381 33
251
33
18
196
127
5878 171
339
21
135
19
80
10000
8000
6000
4 5 6
True classes
4000
20
2000
3285
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
735
43
RF + NN (Test set)
55
56
83
95
37
73
208
58
55
8000
16
158
17
111
80
912
17
22
23
18
1960 3
95
20
11
55
10004 126
127
116
61
91
102
24
144
1474 54
6000
194
26
16
166
96
5852 145
289
23
128
14
93
4 5 6
True classes
4000
26
10000
2000
3291
CV acc. (%)
Test acc. (%)
RF
SVM
NN
RF
NN
SVM
RF
NN
80.4
80.6
73.1
73.3
79.8
79.7
79.5
79.4
81.2
81.1
219
20
3500
16
12
31
29
91
24
30
280
666
61
41
21
221
686
32
23
3322 48
53
40
1500
11
19
38
42
433
1000
91
75
36
2036 45
124
44
37
3000
2500
2000
4 5 6
True classes
500
1085
Figure 15. Confusion matrix for weighted RF+NN+SVM with results from the test set.
Predicted classes
Predicted classes
RF + NN
Predicted classes
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
258
12
21
18
30
25
23
21
66
18
22
3000
390
943
99
36
2500
48
25
285
684
30
10
20
3332 47
45
39
13
26
46
43
461
19
75
68
29
2028 39
109
39
33
4 5 6
True classes
2000
1500
1000
500
1085
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
References
Breiman, Leo. Bagging predictors. Machine Learning, 24
(2):123140, 1996.
Breiman, Leo. Random forests. Machine Learning, 45(1):
532, 2001.
Chollet, Francois. Keras documentation - theano-based
deep learning library, 2015. URL http://keras.
io/.
Cristianini, N. and Shawe-Taylor, J. An introduction to
support vector machines and other kernel-based learning methods. Cambridge University Press, 2000. ISBN
9780521780193. URL http://books.google.
com/books?id=B-Y88GdO1yYC.
Daniel Nouri. Using convolutional neural nets to detect facial keypoints tutorial, 2014. URL http:
//danielnouri.org/notes/category/
programming/.
Kaggle.
Kaggle: Otto group product classification
challenge - classify products into the correct category,
March 2015. URL https://www.kaggle.com/c/
otto-group-product-classification-challenge.
LISA lab, University of Montreal. Theano 0.7 documentation - python library for defining, optimizing, and
evaluating mathematical expressions involving multidimensional arrays efficiently, 2015. URL http://
deeplearning.net/software/theano/.
Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,
Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E.
Scikit-learn: Machine learning in python. Journal of
Machine Learning Research, 12:28252830, 2011.
Shalev-Shwartz, Shai and Ben-David, Shai. Understanding
Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.
Van der Maaten, Laurens and Hinton, Geoffrey. Visualizing
data using t-sne. Journal of Machine Learning Research,
9(2579-2605):85, 2008.
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769