Вы находитесь на странице: 1из 125

A COMBINATION SCHEME FOR INDUCTIVE

LEARNING FROM IMBALANCED DATA


SETS
by
Andrew Estabrooks
A Thesis Submitted to the
Faculty of Computer Science
in Partial Fulfillment of the Requirements
for the degree of
ASTER !F C!P"TER SC#E$CE
a%or Sub%ect& Computer Science
APPR!'E(&
)))))))))))))))))))))))))))))))))))))))))
$athalie *apkowic+, Super-isor
)))))))))))))))))))))))))))))))))))))))))
.igang /ao
)))))))))))))))))))))))))))))))))))))))))
0ouise Spiteri
(A01!"S#E "$#'ERS#T2 3 (A0TEC1
1alifa4, $o-a Scotia
5666
ii
(A0TEC1 0#7RAR2
"AUTHORITY TO DISTRIBUTE MANUSCRIPT THESIS"
T#T0E&
A Combination Scheme for 0earning From #mbalanced (ata Sets
The abo-e library may make a-ailable or authori+e another library to make
a-ailable indi-idual photo8microfilm copies of this thesis without restrictions9
Full $ame of Author& Andrew Estabrooks
Signature of Author& )))))))))))))))))))))))))))))))))
(ate& :85;85666
iii
TA70E !F C!$TE$TS
1. Introducton999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;
; #nducti-e 0earning9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;
5 Class #mbalance999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999995
< oti-ation 999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999=
= Chapter !-er-iew99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999=
> 0earners99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999?
>9; 7ayesian 0earning99999999999999999999999999999999999999999999999999999999999999999999999999999999?
>95 $eural $etworks99999999999999999999999999999999999999999999999999999999999999999999999999999999999:
>9< $earest $eighbor9999999999999999999999999999999999999999999999999999999999999999999999999999999999@
>9= (ecision Trees99999999999999999999999999999999999999999999999999999999999999999999999999999999999999A
? (ecision Tree 0earning Algorithms and C>969999999999999999999999999999999999999999999999999999999999999999999A
?9; (ecision Trees and the #(< algorithm 99999999999999999999999999999999999999999999999;6
?95 #nformation /ain and the Entropy easure9999999999999999999999999999999999999999;;
?9< !-erfitting and (ecision Trees99999999999999999999999999999999999999999999999999999999999;<
?9= C>96 !ptions999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;>
: Performance easures9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;:
:9; Confusion atri499999999999999999999999999999999999999999999999999999999999999999999999999999999;:
:95 g3ean99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;@
:9< R!C cur-es 999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;@
@ A Re-iew of Current 0iterature999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;A
@9; isclassification Costs9999999999999999999999999999999999999999999999999999999999999999999999956
@95 Sampling Techniques9999999999999999999999999999999999999999999999999999999999999999999999999955
@959; 1eterogeneous "ncertainty Sampling99999999999999999999999999999999999999999999999999999999999955
@9595 !ne sided #ntelligent Selection99999999999999999999999999999999999999999999999999999999999999999999995=
@959< $ai-e Sampling Techniques999999999999999999999999999999999999999999999999999999999999999999999999995>
@9< Classifiers Bhich Co-er !ne Class9999999999999999999999999999999999999999999999999999<6
@9<9; 7R"TE9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999<6
@9<95 F!#099999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999<5
@9<9< S1R#$C99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999<=
@9= Recognition 7ased 0earning999999999999999999999999999999999999999999999999999999999999999<>
A E4perimental (esign9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999=6
A9; Artificial (omain99999999999999999999999999999999999999999999999999999999999999999999999999999999=6
A95 E4ample Creation9999999999999999999999999999999999999999999999999999999999999999999999999999999==
A9< (escription of Tests and Results999999999999999999999999999999999999999999999999999999999=>
A9<9; Test D ; 'arying the Target Concepts Comple4ity9999999999999999999999999999999999999999=>
A9<95 Test D5 Correcting #mbalanced (ata Sets& !-er3sampling -s9 (ownsi+ing9 9 9>6
A9<9< Test D< A Rule Count for 7alanced (ata Sets99999999999999999999999999999999999999999999999>@
A9= Characteristics of the (omain and how they Affect the Results999999999?>
;6 Combination Scheme99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999?>
;69; oti-ation9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999??
i-
;695 Architecture99999999999999999999999999999999999999999999999999999999999999999999999999999999999999?:
;6959; Classifier 0e-el99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999?A
;69595 E4pert 0e-el999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999:6
;6959< Beighting Scheme999999999999999999999999999999999999999999999999999999999999999999999999999999999999999:;
;6959= !utput 0e-el999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999:5
;; Testing the Combination scheme on the Artificial (omain9999999999999999999999999999999999999999999:5
;5 Te4t Classification999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999::
;59; Te4t Classification as an #nducti-e Process99999999999999999999999999999999999999:A
;< Reuters35;>:@999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999@6
;<9; (ocument Formatting99999999999999999999999999999999999999999999999999999999999999999999999@5
;<95 Categories99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999@<
;<9< Training and Test Sets9999999999999999999999999999999999999999999999999999999999999999999999@<
;= (ocument Representation9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999@=
;=9; (ocument Processing99999999999999999999999999999999999999999999999999999999999999999999999@>
;=95 0oss of information999999999999999999999999999999999999999999999999999999999999999999999999999@?
;> Performance easures999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999@:
;>9; Precision and Recall9999999999999999999999999999999999999999999999999999999999999999999999999@@
;>95 F3 measure9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999@A
;>9< 7reake-en Point99999999999999999999999999999999999999999999999999999999999999999999999999999999@A
;>9= A-eraging Techniques9999999999999999999999999999999999999999999999999999999999999999999999A;
;? Statistics used in this study99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999A<
;: #nitial Results99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999A<
;@ Testing the Combination Scheme9999999999999999999999999999999999999999999999999999999999999999999999999999999999A:
;@9; E4perimental (esign999999999999999999999999999999999999999999999999999999999999999999999999A:
;@95 Performance with 0oss of E4amples9999999999999999999999999999999999999999999999999A@
;@9< Applying the Combination Scheme999999999999999999999999999999999999999999999999;66
;A Summary999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;6:
56 Further Research9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;6A
-
0#ST !F TA70ES
Table 59=9;& An e4ample lift table taken from E0ing and 0i, ;AA@F99999999999999999999999999999999999995:
Table 59=95& This table is adapted from ECubat et al9, ;AA@F9 #t gi-es the accuracies achie-ed
by C=9> ;3$$ and S1R#$C99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999<>
Table 59=9<& Results for the three data sets tested9 The figures represent the percent error
along with their standard de-iation as each algorithm was tested on the data999999999999999999999<@
Table <9;9=& Accuracies of -arious e4pressions learned o-er balanced and imbalanced data
sets9 These figures are the percentage of correctly classified e4amples when tested o-er
both positi-e and negati-e e4amples99999999999999999999999999999999999999999999999999999999999999999999999999999999999=A
Table <9;9>& A list of the a-erage positi-e rule counts for data sets that ha-e been balanced
using downsi+ing and o-er3sampling9999999999999999999999999999999999999999999999999999999999999999999999999999999999?6
Table <9;9?& A list of the a-erage negati-e rule counts for data sets that ha-e been balanced
using downsi+ing and o-er3sampling9999999999999999999999999999999999999999999999999999999999999999999999999999999999?6
Table <9<9:& This table gi-es the accuracies achie-ed with a single classifier trained on the
imbalanced data set G#H and the combination of classifiers on the imbalanced data set GCSH9
:?
Table =959@& A list of the top ten categories of the Reuters35;>:@ test collection and their
document count999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999@<
Table =959A& Some statistics on the odApte Split9 $ote that the documents labeled as
unused recei-e this designation because they ha-e no category assigned to them99999999999999@=
Table =9?9;6& A comparison of breake-en points using a single classifier99999999999999999999999999A>
Table =9?9;;& Some of the rules e4tracted from the decision tree in Figure =9?9;9999999999999999A:
Table =9:9;5& This table compares F3measures of the original data set Goriginal
performanceH to the reduced data set Gperformance with e4amples lossH9 Twenty classifiers
were combined using Adapti-e37oosting to produce the results99999999999999999999999999999999999999999AA
Table =9:9;<& The bold numbers indicate classifiers that would be e4cluded from -oting9
$ote that if a threshold of ;66 were chosen, no classifiers would be allowed to -ote in the
system9 The category trained for was AC.9999999999999999999999999999999999999999999999999999999999999999999999;65
-i
0#ST !F F#/"RES
Number
Page
Figure ;9;9;& (iscrimination based learning on a two class problem9999999999999999999999999999999999995
Figure& 59;959 A perceptron999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999@
Figure 5959<& A decision tree that classifies whether it is a good day for a dri-e or not9999999;6
Figure 59<9=& A confusion matri49999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;:
Figure 59<9>& A fictitious e4ample of two R!C cur-es9999999999999999999999999999999999999999999999999999999;A
Figure 59=9?& A cost matri4 for a poisonous mushroom application9999999999999999999999999999999999956
Figure 59=9:& This e4ample demonstrates how standard classifiers can be dominated by
negati-e e4amples9 #t is taken from ERiddle, et al9, ;AA=F9 $ote that the /ain function
defined in Section 59;95 would prefer T5 to T;999999999999999999999999999999999999999999999999999999999999999999<6
Figure <9;9@& Four instances of classified data defined o-er the e4pression GE4p9 ;H9999999999=;
Figure <9;9A& A decision tree that correctly classifies instances Gsuch as those in Figure
<9;9;H as satisfying GE4p9 ;H or not9 #f an e4ample to be classified is sorted to a positi-e GIH
leaf in the tree it is gi-en a positi-e class and satisfies e4pression G;H9 9999999999999999999999999999999=5
Figure <9;9;6& A -isual representation of a target concept becoming sparser relati-e to the
number of e4amples99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999=>
Figure <9;9;;& A-erage error of induced decision trees measured o-er all testing e4amples9
=@
Figure <9;9;5& A-erage error of induced decision trees measured o-er positi-e testing
e4amples99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999=@
Figure <9;9;<& A-erage error of induced decision trees measured o-er negati-e testing
e4amples9 $ote that the scale is much smaller than those in Figures <9;9< and <9;9=9999999999=@
Figure <9;9;=& Error rates of learning an e4pression of =4> comple4ity as either negati-e
e4amples are being remo-ed, or positi-e e4amples being re3sampled99999999999999999999999999999999>5
Figure <9;9;>& This graph demonstrates that the optimal le-el at which a data set should be
balanced does not always occur at the same point9 To see this, compare this graph with
Figure <9;9?999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999>=
Figure <9;9;?& These graphs demonstrate the competing factors when balancing a data set9
Points that indicate the highest error rate o-er the negati-e e4amples correspond to the
lowest error o-er the positi-e e4amples999999999999999999999999999999999999999999999999999999999999999999999999999999>>
Figure <9;9;:& This graph demonstrates the effecti-eness of balancing data sets by
downsi+ing and o-er3sampling9 $otice that o-er3sampling appears to be more effecti-e
than downsi+ing when the data sets are balanced in terms of numbers9 The only case where
downsi+ing outperforms o-er3sampling in this comparison is when attempting to learn an
e4pression of =4= comple4ity999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999>:
Figure <9;9;@& An e4ample of how C>96 adds rules to create comple4 decision surfaces9 #t
is done by summing the confidence le-el of rules that co-er o-erlapping regions9 A region
co-ered by more than one rule is assigned the class with the highest summed confidence
-ii
le-el of all the rules that co-er it9 1ere we assume Rule ; has a higher confidence le-el
than Rule 59999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999?<
Figure <959;A 1ierarchical structure of the combination scheme9 Two e4perts are
combined9 The o-er3sampling e4pert is made up of classifiers which are trained on data
samples containing re3sampled data of the under represented class9 The downsi+ing e4pert
is made up of classifiers trained on data samples where e4amples of the o-er represented
class ha-e been remo-ed9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999?A
Figure <9<956& Testing the combination scheme on an imbalanced data set with a target
concept comple4ity of =4@9 999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999:=
Figure <9<95;& Testing the combination scheme on an imbalanced data set with a target
concept comple4ity of =4;6999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999:>
Figure <9<955& Testing the combination scheme on an imbalanced data set with a target
concept comple4ity of =4;5999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999:>
Figure =9;95<& Te4t classification -iewed as a collection of binary classifiers9 A document
to be classified by this system can be assigned any of $ categories from $ independently
trained classifiers9 $ote that using a collection of classifiers allows a document to be
assigned more than one category Gin this figure up to $ categoriesH99999999999999999999999999999999999@6
Figure =9595=& A Reuters35;>:@ Article G$ote that the topic for this e4ample is EarnH9999999@5
Figure =9<95>& This is the binary -ector representation of the sentence JThe sky is blue9J as
defined o-er the set of words Kblue, cloud, sky, windL9999999999999999999999999999999999999999999999999999999@>
Figure =9=95?& The dotted line indicates the breake-en point9 #n this figure the point is
interpolated9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999A6
Figure =9=95:& The dotted line indicates the breake-en point9 #n this figure the point is
e4trapolated999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999A;
Figure =9?95@& A decision tree created using C>969 The category trained for was AC.9 The
tree has been pruned to four le-els for conciseness9 99999999999999999999999999999999999999999999999999999999999A?
Figure =9:95A& This graph gi-es a -isual representation of the micro a-eraged results of
Table =9:9;9 !ne can see a significant loss in performance when the number of positi-e
e4amples is reduced for training9 999999999999999999999999999999999999999999999999999999999999999999999999999999999999999AA
Figure =9:9<6& icro a-eraged F;3measure of each e4pert and their combination9 1ere we
are considering precision and recall to be of equal importance99999999999999999999999999999999999999999;6<
Figure =9:9<; icro a-eraged F53measure of each e4pert and their combination9 1ere we
are considering recall to be twice as important as precision9 99999999999999999999999999999999999999999999;6<
Figure =9:9<5 icro a-eraged F69>3measure of each e4pert and their combination9 1ere we
are considering precision to be twice as important as recall9 99999999999999999999999999999999999999999999;6=
Figure =9:9<< Combining the e4perts for precision9 #n this case the e4perts are required to
agree on an e4ample being positi-e in order for it to be classified as positi-e by the system9
The results are reported for the F69> measure Gprecision considered twice as important as
recallH9 ;6>
Figure =9:9<= A comparison of the o-erall results 99999999999999999999999999999999999999999999999999999999999;6?

-iii
(alhousie "ni-ersity
Abstract
A C!7#$AT#!$ SC1EE F!R 0EAR$#$/
FR! #7A0A$CE( (ATA SETS
by Andrew Estabrooks
Chairperson of the Super-isory Committee& $athalie *apkowic+
(epartment of Computer Science
This thesis e4plores inducti-e learning and its application to imbalanced data sets9
#mbalanced data sets occur in two class domains when one class contains a large number of
e4amples, while the other class contains only a few e4amples9 0earners, presented with
imbalanced data sets, typically produce biased classifiers which ha-e a high predicti-e
accuracy o-er the o-er represented class, but a low predicti-e accuracy o-er the under
represented class9 As a result, the under represented class can be largely ignored by an
induced classifier9 This bias can be attributed to learning algorithms being designed to
ma4imi+e accuracy o-er a data set9 The assumption is that an induced classifier will
encounter unseen data with the same class distribution as the training data9 This limits its
ability to recogni+e positi-e e4amples9
This thesis in-estigates the nature of imbalanced data sets and looks at two e4ternal
methods, which can increase a learnerMs performance on under represented classes9 7oth
techniques artificially balance the training dataN one by randomly re3sampling e4amples of
the under represented class and adding them to the training set, the other by randomly
remo-ing e4amples of the o-er represented class from the training set9 Tested on an
artificial domain of k3($F e4pressions, both techniques are effecti-e at increasing the
predicti-e accuracy on the under represented class9
A combination scheme is then presented which combines multiple classifiers in an attempt
to further increase the performance of standard classifiers on imbalanced data sets9 The
approach is one in which multiple classifiers are arranged in a hierarchical structure
according to their sampling techniques9 The architecture consists of two e4perts, one that
boosts performance by combining classifiers that re3sample training data at different rates,
the other by combining classifiers that remo-e data from the training data at different rates9
The combination scheme is tested on the real world application of te4t classification, which
is typically associated with se-erely imbalanced data sets9 "sing the F3measure, which
combines precision and recall as a performance statistic, the combination scheme is shown
to be effecti-e at learning from se-erely imbalanced data sets9 #n fact, when compared to a
i4
state of the art combination technique, Adapti-e37oosting, the proposed system is shown to
be superior for learning on imbalanced data sets9
Ac!no"#$d%$&$nt'
# thank $athalie *apkowic+ for sparking my interest in machine learning and being a great
super-isor9 # am also appreciati-e of my e4aminers for pro-iding many useful comments,
and my parents for their support when # needed it9
A special thanks goes to arianne who made certain that # spent enough time working and
always listening to e-erything # had to say9
4
4i
C h a p t e r O n e
; #$TR!("CT#!$
1 Induct($ L$)rnn%
#nducti-e learning is the process of learning from e4amples, a set of rules, or more
generally speaking, a concept that can be used to generali+e to new e4amples9 #nducti-e
learning can be loosely defined for a two3class problem as the following9 0et c be any
7oolean target concept that is being searched for9 /i-en a learner 0 and a set of instances O
for which c is defined o-er, train 0 on O to estimate c9 The instances O which 0 is trained
on, are known as training e4amples and are made up of ordered pairs P4, cG4HQ, where 4 is
a -ector of attributes Gwhich ha-e -aluesH, and cG4H is the associated classification of the
-ector 49 0Rs appro4imation of c is its hypothesis h9 #n an ideal situation after training 0 on
O, h equals c, but in reality a learner can only guarantee a hypothesis h, such that it fits the
training data9 Bithout any other information we assume that the hypothesis, which fits the
target concept on the training data, will also fit the target concept on unseen e4amples9 This
assumption is typically based on an e-aluation process, such as withholding classified
e4amples from training to test the hypothesis9
The purpose of a learning algorithm is to be able to learn a target concept from training
e4amples and be able to generali+e to new instances9 The only information a learner has
about c is the -alue of c o-er the entire set of training e4amples9 #nducti-e learners
therefore assume that gi-en enough training data the obser-ed hypothesis o-er O will
generali+e correctly to unseen e4amples9 A -isual representation of what is being described
is gi-en in Figure ;9;9;9
Figure ;9;9;& (iscrimination based learning on a two class
problem9
Figure ;9;9; represents a discrimination task in which each -ector P4, cG4HQ o-er the
training data O is represented by its class, as being either positi-e GIH, or negati-e G3H9 The
position of each -ector in the bo4 is determined by its attribute -alues9 #n this e4ample the
data has two attribute -aluesN one plotted on the 43a4is, the other on the y3a4is9 The target
concept c is defined by the partitions separating the positi-e e4amples from the negati-e
e4amples9 $ote that Figure ;9;9; is a -ery simple illustrationN normally data contains more
than two attribute -alues and would be represented in a higher dimensional space9
* C#)'' I&+)#)nc$
Typically learners are e4pected to be able to generali+e o-er unseen instances of any class
with equal accuracy9 That is, in a two class domain of positi-e and negati-e e4amples, the
learner will perform on an unseen set of e4amples with equal accuracy on both the positi-e
and negati-e classes9 This of course is the ideal situation9 #n many applications learners are
faced with imbalanced data sets, which can cause the learner to be biased towards one
class9 This bias is the result of one class being hea-ily under represented in the training data
compared to the other classes9 #t can be attributed to two factors that relate to the way in
which learners are designed& #nducti-e learners are typically designed to minimi+e errors
o-er the training e4amples9 Classes containing few e4amples can be largely ignored by
learning algorithms because the cost of performing well on the o-er3represented class
outweighs the cost of doing poorly on the smaller class9 Another factor contributing to the
bias is o-er3fitting9 !-er3fitting occurs when a learning algorithm creates a hypothesis that
5
performs well o-er the training data but does not generali+e well o-er unseen data9 This can
occur on an under represented class because the learning algorithm creates a hypothesis
that can easily fit a small number of e4amples, but it fits them too specifically9
Class imbalances are encountered in many real world applications9 They include the
detection of oil spills in radar images ECubat et al9, ;AA:F, telephone fraud detection
EFawcett and Pro-ost, ;AA:F, and te4t classification E0ewis and Catlett, ;AA=F9 #n each case
there can be hea-y costs associated with a learner being biased towards the o-er3
represented class9 Take for e4ample telephone fraud detection9 7y far, most telephone calls
made are legitimate9 There are howe-er a significant number of calls made where a
perpetrator fraudulently gains access to the telephone network and places calls billed to the
account of a customer9 7eing able to detect fraudulent telephone calls, so as not to bill the
customer, is -ital to maintaining customer satisfaction and their confidence in the security
of the network9 A system designed to detect fraudulent telephone calls should, therefore,
not be biased towards the hea-ily o-er represented legitimate phone calls as too many
fraudulent calls may go undetected9
;
#mbalanced data sets ha-e recently recei-ed attention in the machine learning community9
Common solutions include&
#ntroducing weighting schemes that gi-e e4amples of the under represented class a
higher weight during training EPa++ani et al9, ;AA=F9
(uplicating training e4amples of the under represented class E0ing and 0i, ;AA@F9
This is in effect re3sampling the e4amples and will be referred to in this paper as
o-er3sampling9
Remo-ing training e4amples of the o-er represented class ECubat and atwin,
;AA:F9 This is referred to as downsi+ing to reflect that the o-erall si+e of the data set
is smaller after this balancing technique has taken place9
;
$ote that this discussion ignores false positi-es where legitimate calls are thought to be fraudulent9 This issue is
discussed in EFawcett and Pro-ost, ;AA:F9
<
Constructing classifiers which create rules to co-er only the under represented class
ECubat, 1olte, and atwin, ;AA@F, ERiddle, Segal, and Et+ioni, ;AA=F9
Almost, or completely ignoring one of the two classes, by using a recognition based
inducti-e scheme instead of a discrimination3based scheme E*apkowic+ et al9,
;AA>F9
, Mot()ton
Currently the ma%ority of research in the machine learning community has based the
performance of learning algorithms on how well they function on data sets that are
reasonably balanced9 This has lead to the design of many algorithms that do not adapt well
to imbalanced data sets9 Bhen faced with an imbalanced data set, researchers ha-e
generally de-ised methods to deal with the data imbalance that are specific to the
application at hand9 Recently howe-er there has been a thrust towards generali+ing
techniques that deal with data imbalances9
The focus of this thesis is directed towards inducti-e learning on imbalanced data sets9 The
goal of the work presented is to introduce a combination scheme that uses two of the
pre-iously mentioned balancing techniques, downsi+ing and o-er3sampling, in an attempt
to impro-e learning on imbalanced data sets9 ore specifically, # will present a system that
combines classifiers in a hierarchical structure according to their sampling technique9 This
combination scheme will be designed using an artificial domain and tested on the real
world application of te4t classification9 #t will be shown that the combination scheme is an
effecti-e method of increasing a standard classifierRs performance on imbalanced data sets9
- C.)/t$r O($r($"
The remainder of this thesis is broken down into four chapters9 Chapter 5 gi-es background
information and a re-iew of the current literature pertaining to data set imbalance9 Chapter
< is di-ided into se-eral sections9 The first section describes an artificial domain and a set
of e4periments, which lead to the moti-ation behind a general scheme to handle
imbalanced data sets9 The second section describes the architecture behind a system
=
designed to lend itself to domains that ha-e imbalanced data9 The third section tests the
de-eloped system on the artificial domain and presents the results9 Chapter = presents the
real world application of te4t classification and is di-ided into two parts9 The first part
gi-es needed background information and introduces the data set that the system will be
tested on9 The second part presents the results of testing the system on the te4t
classification task and discusses it effecti-eness9 The thesis concludes with Chapter >,
which contains a summary and suggested directions for further research9
>
C h a p t e r T w o
5 7ACC/R!"$(
# will begin this chapter by gi-ing a brief o-er-iew of some of the more common learning
algorithms and e4plaining the underlying concepts behind the decision tree learning
algorithm C>96, which will be used for the purposes of this study9 There will then be a
discussion of -arious performance measures that are commonly used in machine learning9
Following that, # will gi-e an o-er-iew of the current literature pertaining to data
imbalance9
0 L$)rn$r'
There are a large number of learning algorithms, which can be di-ided into a broad range
of categories9 This section gi-es a brief o-er-iew of the more common learning algorithms9
0.1 B)1$')n L$)rnn%
#nducti-e learning centers on finding the best hypothesis h, in a hypothesis space 1, gi-en a
set of training data (9 Bhat is meant by the best hypothesis is that it is the most probable
hypothesis gi-en a data set ( and any initial knowledge about the prior probabilities of
-arious hypothesis in 19 achine learning problems can therefore be -iewed as attempting
to determine the probabilities of -arious hypothesis and choosing the hypothesis which has
the highest probability gi-en (9
ore formally, we define the posterior probability PGhS(H, to be the probability of an
hypothesis h after seeing a data set (9 7ayes theorem GEq9 ;H pro-ides a means to calculate
posterior probabilities and is the basis of 7ayesian learning9
?
( )
( ) ( )
( ) D P
h P h D P
D h P
S
S =
GEq9 ;H
A simple method of learning based on 7ayes theorem is called the nai-e 7ayes classifier9
$ai-e 7ayes classifiers operate on data sets where each e4ample 4 consists of attribute
-alues Pa
;
, a
5
999 a
i
Q and the target function fG4H can take on any -alue from a pre3defined
finite set 'TG-
;
, -
5
999 -
%
H9 Classifying unseen e4amples in-ol-es calculating the most
probable target -alue v
ma4
and is defined as&
"sing 7ayes theorem GEq9 ;H v
max
can be rewritten as&
"nder the assumption that attribute -alues are conditionally independent gi-en the target
-alue9 The formula used by the nai-e 7ayes classifier is&
where - is the target output of the classifier and PGa
i
S-
%
H and PG-
i
H can be calculated based on
their frequency in the training data9
0.* N$ur)# N$t"or!'
$eural $etworks are considered -ery robust learners that perform well on a wide range of
applications such as, optical character recognition E0e Cun et al9, ;A@AF and autonomous
na-igation EPomerleau, ;AA<F9 They are modeled after the human ner-ous system, which is
a collection of neurons that communicate with each other -ia interconnections called
( ) ( )

=
i
j i j
V v
v a P v P v
j
S ma4
ma4
( )9 ,999, , S ma4
5 ; ma4 i j
V v
a a a v P v
j

=
( ) ( )9 S ,999, , ma4
5 ; ma4 j j i
V v
v P v a a a P v
j

=
:
a4ons9 The basic unit of an artificial neural network is the perceptron, which takes as input
a number of -alues and calculates the linear combination of these -alues9 The combined
-alue of the input is then transformed by a threshold unit such as the sigmoid function
5
9
Each input to a perceptron is associated with a weight that determines the contribution of
the input9 0earning for a neural network essentially in-ol-es determining -alues for the
weights9 A pictorial representation of a perceptron is gi-en in Figure 59;9;9

w1
w2
wn
x1
x2
xn
Threshold unit
w0
Figure& 59;959 A perceptron9
0., N$)r$'t N$%.+or
$earest $eighbor learning algorithms are instance3based learning methods, which store
e4amples and classify newly encountered e4amples by looking at the stored instances
considered similar9 #n its simplest form all instances correspond to points in an n
dimensional space9 An unseen e4ample is classified by choosing the ma%ority class of the
closest C e4amples9 An ad-antage nearest neighbor algorithms ha-e is that they can
appro4imate -ery comple4 target functions, by making simple local appro4imations based
on data, which is close to the e4ample to be classified9 An e4cellent e4ample of an
application, which uses a nearest neighbor algorithm, is that of te4t retrie-al in which
documents are represented as -ectors and a cosine similarity metric is used to measure the
distance of queries to documents9
5
The sigmoid function is defined as oGyH T ; 8 G; I e
3y
H and is referred to as a squashing function because it maps a -ery
wide range of -alues onto the inter-al G6, ;H9
@
0.- D$c'on Tr$$'
(ecision trees classify e4amples according to the -alues of their attributes9 They are
constructed by recursi-ely partitioning training e4amples based each time on the remaining
attribute that has the highest information gain9 Attributes become nodes in the constructed
tree and their possible -alues determine the paths of the tree9 The process of partitioning the
data continues until the data is di-ided into subsets that contain a single class, or until some
stopping condition is met Gthis corresponds to a leaf in the treeH9 Typically, decision trees
are pruned after construction by merging children of nodes and gi-ing the parent node the
ma%ority class9 Section 595 describes in detail how decision trees, in particular C>96, operate
and are constructed9
2 D$c'on Tr$$ L$)rnn% A#%ort.&' )nd C0.3
C>96 is a decision tree learning algorithm that is a later -ersion of the widely used C=9>
algorithm E.uinlan, ;AA<F9 itchell E;AA:F gi-es an e4cellent description of the #(<
E.uinlan, ;A@?F algorithm, which e4emplifies its successors C=9> and C>969 The following
section consists of two parts9 The first part is a brief summary of itchellRs description of
the #(< algorithm and the e4tensions leading to typical decision tree learners9 A brief
operational o-er-iew of C>96 is then gi-en as it relates to this work9
7efore # begin the discussion of decision tree algorithms, it should be noted that a decision
tree is not the only learning algorithm that could ha-e been used in this study9 As described
in Chapter ;, there are many different learning algorithms9 For the purposes of this study a
decision tree algorithm was chosen for three reasons9 The first is the understandability of
the classifier created by the learner9 7y looking at the comple4ity of a decision tree in terms
of the number and si+e of e4tracted rules, we can describe the beha-ior of the learner9
Choosing a learner such as $ai-e 7ayes, which classifies e4amples based on probabilities,
would make an analysis of this type nearly impossible9 The second reason a decision tree
learner was chosen was because of its computational speed9 Although, not as cheap to
operate as $ai-e 7ayes, decision tree learners ha-e significantly shorter training times than
do neural networks9 Finally, a decision tree was chosen because it operates well on tasks
A
that classify e4amples into a discrete number of classes9 This lends itself well to the real
world application of te4t classification9 Te4t classification is the domain that the
combination scheme designed in Chapter < will be tested on9
2.1 D$c'on Tr$$' )nd t.$ ID, )#%ort.&
(ecision trees classify e4amples by sorting them based on attribute -alues9 Each node in a
decision tree represents an attribute in an e4ample to be classified, and each branch in a
decision tree represents a -alue that the node can take9 E4amples are classified starting at
the root node and sorting them based on their attribute -alues9 Figure 5959; is an e4ample of
a decision tree that could be used to classify whether it is a good day for a dri-e or not9
Road Conditions
Clear Snow Covered Icy
Forecast
Temperature Accumulation
Rain Clear
Heavy Freein! "i!ht #arm
Snow
$%S
&'
&' &'
&'
$%S $%S
Figure 5959<& A decision tree that classifies whether it is a good
day for a dri-e or not9
"sing the decision tree depicted in Figure 5959; as an e4ample, the instance
PRoad Conditions T Clear, Forecast T Rain, Temperature T Barm, Accumulation T
1ea-yQ
;6
would sort to the nodes& Road Conditions, Forecast, and finally Temperature, which would
classify the instance as being positi-e GyesH, that is, it is a good day to dri-e9 Con-ersely an
instance containing the attribute Road Conditions assigned Snow Co-ered would be
classified as not a good day to dri-e no matter what the Forecast, Temperature, or
Accumulation are9
(ecision tress are constructed using a top down greedy search algorithm which recursi-ely
subdi-ides the training data based on the attribute that best classifies the training e4amples9
The basic algorithm #(< begins by di-iding the data according to the -alue of the attribute
that is most useful in classifying the data9 The attribute that best di-ides the training data
would be the root node of the tree9 The algorithm is then repeated on each partition of the
di-ided data, creating sub trees until the training data is di-ided into subsets of the same
class9 At each le-el in the partitioning process a statistical property known as information
gain is used to determine which attribute best di-ides the training e4amples9
2.* In4or&)ton G)n )nd t.$ Entro/1 M$)'ur$
#nformation gain is used to determine how well an attribute separates the training data
according to the target concept9 #t is based on a measure commonly used in information
theory known as entropy9 (efined o-er a collection of training data, S, with a 7oolean
target concept, the entropy of S is defined as&
where p
(+)
is the proportion of positi-e e4amples in S and p
()
the proportion of negati-e
e4amples9 The function of the entropy measure is easily described with an e4ample9
Assume that there is a set of data S containing ten e4amples9 Se-en of the e4amples ha-e a
positi-e class and three of the e4amples ha-e a negati-e class E:I, <3F9 The entropy
measure for this data set S would be calculated as&
( )
H G 5 H G H G 5 H G
log log
+ +
= p p p p S !ntropy
( ) @@; 9 6 >5; 9 6 <?6 9 6
;6
<
log
;6
<
;6
:
log
;6
:
5 5
= + = = S !ntropy
;;
$ote that if the number of positi-e and negati-e e4amples in the set were e-en Gp
(+)
T p
()
T
69>H, then the entropy function would equal ;9 #f all the e4amples in the set were of the
same class, then the entropy of the set would be 69 #f the set being measured contains an
unequal number of positi-e and negati-e e4amples then the entropy measure will be
between 6 and ;9
Entropy can be interpreted as the minimum number of bits needed to encode the
classification of an arbitrary member of S9 Consider two people passing messages back and
forth that are either positi-e or negati-e9 #f the recei-er of the message knows that the
message being sent is always going to be positi-e, then no message needs to be sent9
Therefore, there needs to be no encoding and no bits are sent9 #f on the other hand, half the
messages are negati-e, then one bit needs to be used to indicate that the message being sent
is either positi-e or negati-e9 For cases where there are more e4amples of one class than the
other, on a-erage, less than one bit needs to be sent by assigning shorter codes to more
likely collections of e4amples and longer codes to less likely collections of e4amples9 #n a
case where p
(+)
T 69A shorter codes could be assigned to collections of positi-e messages
being sent, with longer codes being assigned to collections of negati-e messages being
sent9
#nformation gain is the e4pected reduction in entropy when partitioning the e4amples of a
set S" according to an attribute #9 #t is defined as&
where Va$ue%(#) is the set of all possible -alues for an attribute # and S
v
is the subset of
e4amples in S which ha-e the -alue v for attribute #9 !n a 7oolean data set ha-ing only
positi-e and negati-e e4amples, Va$ue%(#) would be defined o-er EI,3F9 The first term in
the equation is the entropy of the original data set9 The second term describes the entropy
of the data set after it is partitioned using the attribute #9 #t is nothing more than a sum of
( ) ( )
( )
( )
v
# Va$ue% v
v
S !ntropy
S
S
S !ntropy # S &ain

= ,
;5
the entropies of each subset S
v
weighted by the number of e4amples that belong to the
subset9 The following is an e4ample of how &ain(S" #) would be calculated on a fictitious
data set9 /i-en a data set S with ten e4amples G: positi-e and < negati-eH, each containing
an attribute Temperature, &ain(S"#) where #TTemperature and Va$ue%GTemperatureH
TKBarm, Free+ingL would be calculated as follows&
S T E:I, <3F
SBarm T E<I, ;3F
SFree+ing T E=I, 53F
#nformation gain is the measure used by #(< to select the best attribute at each step in the
creation of a decision tree9 "sing this method, #(< searches a hypothesis space for one that
fits the training data9 #n its search, shorter decision trees are preferred o-er longer decision
trees because the algorithm places nodes with a higher information gain near the top of the
tree9 #n its purest form #(< performs no backtracking9 The fact that no backtracking is
performed can result in a solution that is only locally optimal9 A locally optimal solution is
known as overfitting9
2., O($r4ttn% )nd D$c'on Tr$$'
!-erfitting is not a problem that is inherent to decision tree learners alone9 #t can occur
with any learning algorithm that encounters noisy data or data in which one class, or both
classes, are underrepresented9 A decision tree, or any learned hypothesis h, is said to o-erfit
training data if there e4ists another hypothesis h that has a larger error than h when tested
on the training data, but a smaller error than h when tested on the entire data set9 At this
point the discussion of o-erfitting will focus on the e4tension of #(< that is used by
decision trees algorithms such as C=9> and C>96 in an attempt and a-oid o-erfitting data9
6<5 9 6
A;@ 9 6
;6
?
@;; 9 6
;6
=
@@; 9 6
H SFree+ing G
;6
?
H Swarm G
;6
=
@@; 9 6 H e Temperatur , G
=
=
= !ntropy !ntropy S &ain
;<
There are two common approaches that decision tree induction algorithms can use to a-oid
o-erfitting training data9 They are&
Stop the training algorithm before it reaches a point in which it perfectly fits the
training data, and,
Prune the induced decision tree9
The most commonly used is the latter approach Eitchell, ;AA:F9 (ecision tree learners
normally employ post3pruning techniques that e-aluate the performance of decision trees as
they are pruned using a -alidation set of e4amples that are not used during training9 The
goal of pruning is to impro-e a learnerRs accuracy on the -alidation set of data9
#n its simplest form post3pruning operates by considering each node in the decision tree as
a candidate for pruning9 Any node can be remo-ed and assigned the most common class of
the training e4amples that are sorted to the node in question9 A node is pruned if remo-ing
it does not make the decision tree perform any worse on the -alidation set than before the
node was remo-ed9 7y using a -alidation set of e4amples it is hoped that the regularities in
the data used for training do not occur in the -alidation set9 #n this way pruning nodes
created on regularities occurring in the training data will not hurt the performance of the
decision tree o-er the -alidation set9
Pruning techniques do not always use additional data such as the following pruning
technique used by C=9>9
C=9> begins pruning by taking a decision tree to be and con-erting it into a set of rulesN one
for each path from the root node to a leaf9 Each rule is then generali+ed by remo-ing any of
its conditions that will impro-e the estimated accuracy of the rule9 The rules are then sorted
by this estimated accuracy and are considered in the sorted sequence when classifying
newly encountered e4amples9 The estimated accuracy of each rule is calculated on the
training data used to create the classifier Gi9e9, it is a measure of how well the rule classifies
the training e4amplesH9 The estimate is a pessimistic one and is calculated by taking the
;=
accuracy of the rule o-er the training e4amples it co-ers and then calculating the standard
de-iation assuming a binomial distribution9 For a gi-en confidence le-el, the lower3bound
estimate is taken as a measure of the rules performance9 A more detailed discussion of
C=9>Rs pruning technique can be found in E.uinlan, ;AA<F9
2.- C0.3 O/ton'
This section contains a description of some of the capabilities of C>969 C>96 was
e4tensi-ely used in this study to create rule sets for classifying e4amples on two domainsN
an artificial domain of k3($F G(is%uncti-e $ormal FormH e4pressions, and a real world
domain of te4t classification9 The following has been adapted from E.uinlan, 5666F9
Ad)/t($ Boo'tn%
C>96 offers adapti-e boosting ESchapire and Freund, ;AA:F9 The general idea behind
adapti-e boosting is to generate se-eral classifiers on the training data9 Bhen an unseen
e4ample is encountered to be classified, the predicted class of the e4ample is a weighted
count of -otes from indi-idually trained classifiers9 C>96 creates a number of classifiers by
first constructing a single classifier9 A second classifier is then constructed by re3training
on the e4amples used to create the first classifier, but paying more attention to the cases of
the training set in which the first classifier, classified incorrectly9 As a result the second
classifier is generally different than the first9 The basic algorithm behind .uinlanRs
implementation of adapti-e boosting is described as follows9
Choose C e4amples from the training set of $ e4amples each being assigned a
probability of ;8$ of being chosen to train a classifier9
Classify the chosen e4amples with the trained classifier9
Replace the e4amples by multiplying the probability of the misclassified e4amples
by a weight 79
Repeat the pre-ious three steps O times with the generated probabilities9
Combine the O classifiers gi-ing a weight logG7OH to each trained classifier9
;>
Adapti-e boosting can be in-oked by C>96 and the number of classifiers generated
specified9
Prunn% O/ton'
C>96 constructs decision trees in two phases9 First it constructs a classifier that fits the
training data, and then it prunes the classifier to a-oid o-er3fitting the data9 Two options
can be used to affect the way in which the tree is pruned9
The first option specifies the degree in which the tree can initially fit the training data9 #t
specifies the minimum number of training e4amples that must follow at least two of the
branches at any node in the decision tree9 This is a method of a-oiding o-er3fitting data by
stopping the training algorithm before it o-er3fits the data9
A second pruning option that C>96 has affects the se-erity in which the algorithm will post3
prune constructed decision trees and rule sets9 Pruning is performed by remo-ing parts of
the constructed decision trees or rule sets that ha-e a high predicted error rate on new
e4amples9
Ru#$ S$t'
C>96 can also con-ert decision trees into rule sets9 For the purposes of this study rule sets
were generated using C>969 This is due to the fact that rule sets are easier to understand
than decision trees and can easily be described in terms of comple4ity9 That is, rules sets
can be looked at in terms of the a-erage si+e of the rules and the number of rules in the set9
The pre-ious description of C>96Rs operation is by no means complete9 #t is merely an
attempt to pro-ide the reader with enough information to understand the options that were
primarily used in this study9 C>96 has many other options that can be used to affect its
operation9 They include options to in-oke k3fold cross -alidation, enable differential
misclassification costs, and speed up training times by randomly sampling from large data
sets9
;?
5 P$r4or&)nc$ M$)'ur$'
E-aluating a classifierMs performance is a -ery important aspect of machine learning9
Bithout an e-aluation method it is impossible to compare learners, or e-en know whether
or not a hypothesis should be used9 For e4ample, learning to classify mushrooms as being
poisonous or not, one would want to be able to -ery precisely measure the accuracy of a
learned hypothesis in this domain9 The following section introduces the 'onfu%ion matrix
that identifies the type of errors a classifier makes, as well as two more sophisticated
e-aluation methods9 They are the gmean, which combines the performance of a classifier
o-er two classes, and (OC 'urve%, which pro-ide a -isual representation of a classifierRs
performance9
5.1 Con4u'on M)tr6
A classifierRs performance is commonly broken down into what is known as a 'onfu%ion
matrix9 A confusion matri4 basically shows the type of classification errors a classifier
makes9 Figure 59<9; represents a confusion matri49
1ypothesis
I 3 Actual Class
a b I
c d 3
Figure 59<9=& A confusion matri49
The breakdown of a confusion matri4 is as follows&
a is the number of positi-e e4amples correctly classified9
b is the number of positi-e e4amples misclassified as negati-e
c is the number of negati-e e4amples misclassified as positi-e
d is the number of negati-e e4amples correctly classified9
Accuracy Gdenoted as accH is most commonly defined o-er all the classification errors that
are made and, therefore, is calculated as&
d c b a
d a
acc
+ + +
+
=
;:
A classifierMs performance can also be separately calculated for its performance o-er the
positi-e e4amples Gdenoted as aIH and o-er the negati-e e4amples Gdenoted as a3H9 Each are
calculated as&
5.* %7M$)n
Cubat, 1olte, and atwin E;AA@F use the geometric mean of the accuracies measured
separately on each class&
The basic idea behind this measure is to ma4imi+e the accuracy on both classes9 #n this
study the geometric mean will be used as a check to see how balanced the combination
scheme is9 For e4ample, if we consider an imbalanced data set that has 5=6 positi-e
e4amples and ?666 negati-e e4amples and stubbornly classify each e4ample as negati-e,
we could see, as in many imbalanced domains, a -ery high accuracy Gacc T A?UH9 "sing
the geometric mean, howe-er, would quickly show that this line of thinking is flawed9 #t
would be calculated as sqrtG6 V ;H T 69
5., ROC cur($'
(OC 'urve% GRecei-ing !perator CharacteristicH pro-ide a -isual representation of the
trade off between true positi-es and false positi-es9 They are plots of the percentage of
correctly classified positi-e e4amples a
I
with respect to the percentage of incorrectly
classified negati-e e4amples a
3
9
d c
c
a
b a
a
a
+
=
+
=

+
+
= a a g
;@
ROC curves
0
20
(0
)0
*0
100
0 20 (0 )0 *0 100
False Positive (%)
T
r
u
e

P
o
s
i
t
i
v
e

(
%
)
Series1
Series2
Figure 59<9>& A fictitious e4ample of two R!C cur-es9
Point G6, 6H along a cur-e would represent a classifier that by default classifies all e4amples
as being negati-e, whereas a point G6, ;66H represents a classifier that correctly classifies all
e4amples9
any learning algorithms allow induced classifiers to mo-e along the cur-e by -arying
their learning parameters9 For e4ample, decision tree learning algorithms pro-ide options
allowing induced classifiers to mo-e along the cur-e by way of pruning parameters
Gpruning options for C>96 are discussed in Section 5959=H9 Swets E;A@@F proposes that
classifiersR performances can be compared by calculating the area under the cur-es
generated by the algorithms on identical data sets9 #n Figure 59<95 the learner associated
with Series ; would be considered superior to the algorithm that generated Series 59
This section has deliberately ignored performance measures deri-ed by the information
retrie-al community9 They will be discussed in Chapter =9
8 A R$($" o4 Curr$nt Lt$r)tur$
This section re-iews the current literature pertaining to data imbalance9 The papers
re-iewed ha-e been placed into four categories according to the approach taken by the
authors to tackle imbalanced data sets9 The first category, misclassification costs, re-iews
techniques that assign misclassification costs to training e4amples9 The second category,
;A
sampling techniques, discusses data set balancing techniques that sample training
e4amples, both in nai-e and intelligent fashions9 The third category, classifiers that co-er
one class, describes learning algorithms that create rules to co-er only one class9 The last
category, recognition based learning, discusses a learning method that ignores or makes
little use of one class all together9
8.1 M'c#)''4c)ton Co't'
Typically a classifierRs performance is e-aluated using the proportion of e4amples that are
incorrectly classified9 Pa++ani, er+, urphy, Ali, 1ume, and 7runk E;AA=F look at errors
made by a classifier in terms of their cost9 For e4ample, take an application such as the
detection of poisonous mushrooms9 The cost of misclassifying a poisonous mushroom as
being safe to eat may ha-e serious consequences and therefore should be assigned a high
costN con-ersely, misclassifying a mushroom that is safe to eat may ha-e no serious
consequences and should be assigned a low cost9 Pa++ani et al9 E;AA=F use algorithms that
attempt to sol-e the problem of imbalanced data sets by way of introducing a cost matri49
The algorithm that is of interest here is called (e)u'e) Co%t Or)ering GRC!H, which
attempts to order a decision list Gset of rulesH so as to minimi+e the cost of making incorrect
classifications9
RC! is a post3processing algorithm that can complement any rule learner such as C=9>9 #t
essentially orders a set of rules to minimi+e misclassification costs9 The algorithm works as
follows&
The algorithm takes as input a set of rules Grule listH, a cost matri4, and a set of e4amples
Ge4ample listH and returns an ordered set of rules Gdecision listH9 An e4ample of a cost
matri4 Gfor the mushroom e4ampleH is depicted in Figure 59=9;9
1ypothesis
Safe Poisonous Actual Class
6 ; Safe
;6 6 Poisonous
Figure 59=9?& A cost matri4 for a poisonous mushroom
application9
56
$ote that the costs in the matri4 are the costs associated with the prediction in light of the
actual class9
The algorithm begins by initiali+ing a decision list to a default class which yields the least
e4pected cost if all e4amples were tagged as being that class9 #t then attempts to iterati-ely
replace the default class with a new rule 8 default class pair, by choosing a rule from the
rule list that co-ers as many e4amples as possible and a default class which minimi+es the
cost of the e4amples not co-ered by the rule chosen9 $ote that when an e4ample in the
e4ample list is co-ered by a chosen rule it is remo-ed9 The process continues until no new
rule 8 default class pair can be found to replace the default class in the decision list Gi9e9, the
default class minimi+es cost o-er the remaining e4amplesH9
An algorithm such as the one described abo-e can be used to tackle imbalanced data sets
by way of assigning high misclassification costs to the underrepresented class9 (ecision
lists can then be biased, or ordered to classify e4amples as the underrepresented class, as
they would ha-e the least e4pected cost if classified incorrectly9
#ncorporating costs into decision tree algorithms can be done by replacing the information
gain metric used with a new measure that bases partitions not on information gain, but on
the cost of misclassification9 This was studied by Pa++ani et al9 E;AA=F by modifying #(< to
use a metric that chooses partitions that minimi+e misclassification cost9 The results of their
e4perimentation indicate that their greedy test selection method, attempting to minimi+e
cost, did not perform as well as using an information gain heuristic9 They attribute this to
the fact that their selection technique attempts to solely fit training data and not minimi+e
the comple4ity of the learned concept9
A more -iable alternati-e to incorporating misclassification costs into the creation of a
decision trees, is to modify pruning techniques9 Typically, decision trees are pruned by
merging lea-es of the tree to classify e4amples as the ma%ority class9 #n effect, this is
calculating the probability that an e4ample belongs to a gi-en class by looking at training
e4amples that ha-e filtered down to the lea-es being merged9 7y assigning the ma%ority
5;
class to the node of the merged lea-es, decision trees are assigning the class with the lowest
e4pected error9 /i-en a cost matri4, pruning can be modified to assign the class that has the
lowest e4pect cost instead of the lowest e4pected error9 Pa++ani et al9 E;AA=F state that cost
pruning techniques ha-e an ad-antage o-er replacing the information gain heuristic with a
minimal cost heuristic, in that a change in the cost matri4 does not affect the learned
concept description9 This allows different cost matrices to be used for different e4amples9
8.* S)&/#n% T$c.n9u$'
*+,+- .eterogeneou% /n'ertainty Samp$ing
0ewis and Catlett E;AA=F describe a heterogeneou%
0
approach to selecting training
e4amples from a large data set by using uncertainty sampling9 The algorithm they use
operates under an information filtering paradigmN uncertainty sampling is used to select
training e4amples to be presented to an e4pert9 #t can be simply described as a process
where a RcheapR classifier chooses a subset of training e4amples for which it is unsure of the
class from a large pool and presents them to an e4pert to be classified9 The classified
e4amples are then used to help the cheap classifier choose more e4amples for which it is
uncertain9 The e4amples that the classifier is unsure of are used to create a more e4pensi-e
classifier9
The uncertainty sampling algorithm used is an iterati-e process by which an ine4pensi-e
probabilistic classifier is initially trained on three randomly chosen positi-e e4amples from
the training data9 The classifier is based on an estimate of the probability that an instance
belongs to a class C&
<
Their method is considered heterogeneous because a classifier of one type chooses e4amples to present to a classifier of
another type9
( )
( )
( )
( )
( )

+ +

+
=

=
=
)
i i
i
)
i i
i
C w P
C w P
b a
C w P
C w P
b a
1 C P
;
;
S
S
log e4p ;
S
S
log e4p
S
55
where C indicates class membership and w
i
is the ith attribute of d attributes in e4ample wN
a and b are calculated using logistic regression9 This model is described in detail in E0ewis
and 1ayes, ;AA=F9 All we are concerned with here is that the classifier returns a number P
between 6 and ; indicating its confidence in whether or not an unseen e4ample belongs to a
class9 The threshold chosen to indicate a positi-e instance is 69>9 #f the classifier returns a P
higher than 69> for an unknown e4ample, it is considered to belong to the class C9 The
classifiers confidence in its prediction is proportional to the distance its prediction is away
from the threshold9 For e4ample, the classifier is less confident in a P of 69? belonging to C
than it is a P of 69A belong to C9
At each iteration of the sampling loop, the probabilistic classifier chooses four e4amples
from the training setN the two which are closest and below the threshold and the two which
are closest and abo-e the threshold9 The e4amples that are closest to the threshold are those
that it is least sure of the class9 The classifier is then retrained at each iteration of the
uncertainty sampling and reapplied to the training data to select four more instances that it
is unsure of9 $ote that after the four e4amples are chosen at each loop, their class is known
for retraining purposes Gthis is analogous to ha-ing an e4pert label e4amplesH9
The training set presented to the e4pert classifier can essentially be described as a pool of
e4amples that the probabilistic classifier is unsure of9 The pool of e4amples, chosen using a
threshold, will be biased towards ha-ing too many positi-e e4amples if the training data set
is imbalanced9 This is because the e4amples are chosen from a window that is centered
o-er the borderline where the positi-e and negati-e e4amples meet9 To correct for this, the
classifier chosen to train on the pool of e4amples, C=9>, was modified to include a loss ratio
parameter, which allows pruning to be based on e4pected loss instead of e4pected error
Gthis is analogous to cost pruning, Section 59=9;H9 The default rule for the classifier was also
modified to be chosen based on e4pected loss instead of e4pected error9
0ewis and Catlett E;AA=F show by testing their sampling technique on a te4t classification
task that uncertainty sampling reduces the number of training e4amples required by an
e4pensi-e learner such as C=9> by a factor of ;69 They did this by comparing results of
5<
induced decision trees on uncertainty samples from a large pool of training e4amples with
pools of e4amples that were randomly selected, but ten times larger9
*+,+, One %i)e) 2nte$$igent Se$e'tion
Cubat and atwin E;AA:F propose an intelligent one sided sampling technique that reduces
the number of negati-e e4amples in an imbalanced data set9 The underlying concept in their
algorithm is that positi-e e4amples are considered rare and must all be kept9 This is in
contrast to 0ewis and CatlettRs technique in that uncertainty sampling does not guarantee
that a large number of positi-e e4amples will be kept9 Cubat and atwin E;AA:F balance
data sets by remo-ing negati-e e4amples9 They categori+e negati-e e4amples as belonging
to one of four groups9 They are&
Those that suffer from class label noiseN
7orderline e4amples Gthey are e4amples which are close to the boundaries of
positi-e e4amplesHN
Redundant e4amples Gtheir part can be taken o-er by other e4amplesHN and
Safe e4amples that are considered suitable for learning9
#n their selection technique all negati-e e4amples, e4cept those which are safe, are
considered to be harmful to learning and thus ha-e the potential of being remo-ed from the
training set9 Redundant e4amples do not directly harm correct classification, but increase
classification costs9 7orderline negati-e e4amples can cause learning algorithms to o-erfit
positi-e e4amples9
Cubat and atwinMs E;AA:F selection technique begins by first remo-ing redundant
e4amples from the training set9 To do this a subset C of the training e4amples, S, is created
by taking e-ery positi-e e4ample from S and randomly choosing one negati-e e4ample9
The remaining e4amples in S are then classified using the ;3$earest $eighbor G;3$$H rule
with C9 Any misclassified e4ample is added to C9 $ote that this technique does not make
the smallest C possible, it %ust shrinks S9 After redundant e4amples are remo-ed, e4amples
considered borderline or class noisy are remo-ed9
5=
7orderline, or class noisy e4amples are detected using the concept of Tomek 0inks
ETomek, ;A:?F that are defined by the distance between different class labeled e4amples9
Take for instance, two e4amples 4 and y with different classes9 The pair G4, yH is considered
to be a Tomek link if there e4ists no e4ample +, such that G4, +H P G4, yH or Gy, +H P Gy,
4H, where Ga, bH is defined as the distance between e4ample a and e4ample b9 E4amples
are considered borderline or class noisy if they participate in a Tomek link9
Cubat and atwinRs selection technique was shown to be successful in impro-ing the
performance using the g3mean on two of three benchmark domains& -ehicles G-eh;H, glass
Gg:H, and -owels G-woH9 The domain in which no impro-ement was seen, g:, was
e4amined and it was found that in that particular domain the original data set did not
produce disproportionate -alues for gI and g39
*+,+0 Naive Samp$ing Te'hni3ue%
The pre-iously described selection algorithms balance data sets by significantly reducing
the number of training e4amples9 7oth are intelligent methods that filter out e4amples
using uncertainty sampling, or by remo-ing e4amples that are considered harmful to
learning9 0ing and 0i E;AA@F approach the problem of data imbalance using methods that
nai-ely downsi+e or o-er3sample data sets classifying e4amples with a confidence
measurement9 The domain of interest is data mining for direct marketing9 (ata sets in this
field are typically two class problems and are se-erely imbalanced, only containing a few
e4amples of people who ha-e bought the product and many e4amples of people who ha-e
not9 The three data sets studied by 0ing and 0i E;AA@F are a bank data set from a loan
product promotion G7ankH, a RRSP campaign from a life insurance company G0ife
#nsuranceH, and a bonus point program where customers accumulate points to redeem for
merchandise G7onusH9 As will be e4plained later, all three of the data sets are imbalanced9
(irect marketing is used by the consumer industry to target customers who are likely to
buy products9 Typically, if mass marketing is used to promote products Ge9g9, including
flyers in a newspaper with a large distributionH the response rate Gthe percent of people who
buy a product after being e4posed to the promotionH is -ery low and the cost of mass
5>
marketing -ery high9 For the three data sets studied by 0ing and 0i the response rates were
;95U of A6,66 responding in the 7ank data set, :U of @6,666 responding in the 0ife
#nsurance data set, and ;95U of ;6=,666 for the 7onus Program9
(ata mining can be -iewed as a two class domain9 /i-en a set of customers and their
characteristics, determine a set of rules that can accurately predict a customer as being a
buyer or a non3 buyer, ad-ertising only to buyers9 0ing and 0i E;AA@F howe-er, state that a
binary classification is not -ery useful for direct marketing9 For e4ample, a company may
ha-e a database of customers to which it wants to ad-ertise the sale of a new product to the
<6U of customers who are most likely to buy it9 "sing a binary classifier to predict buyers
may only classify >U of the customers in the database as responders9 0ing and 0i E;AA@F
a-oid this limitation of binary classification by requiring that classifiers being used, gi-e
their classifications a confidence le-el9 The confidence le-el is required to be able to rank
classified responders9
The two classifiers used for the data mining were $aW-e 7ayes, which produces a
probability to rank the testing e4amples and a modified -ersion of C=9>9 The modification
made to C=9> allows the algorithm to gi-e a certainty factor to a classification9 The
certainty factor is created during training and gi-en to each leaf of the decision tree9 #t is
simply the ratio of the number of e4amples of the ma%ority class o-er the total number of
e4amples sorted to the leaf9 An e4ample now classified by the decision tree not only
recei-es the classification of the leaf it sorts to, but also the certainty factor of the leaf9
C=9> and $aW-e 7ayes were not applied directly to the data sets9 #nstead, a modified -ersion
of Adapti-e37oosting GSee Section 5959=H was used to create multiple classifiers9 The
modification made to the Adapti-e37oosting algorithm was one in which the sampling
probability is not calculated from a binary classification, but from a difference in the
probability of the prediction9 Essentially, e4amples that are classified incorrectly with a
higher certainty weight are gi-en higher sampling probability in the training of the ne4t
classifier9
5?
The e-aluation method used by 0ing and 0i E;AA@F is known as the lift inde49 This inde4
has been widely used in database marketing9 The moti-ation behind using the lift inde4 is
that it reflects the re3distribution of testing e4amples after a learner has ranked them9 For
e4ample, in this domain the learning algorithms rank e4amples in order of the most likely
to respond to the least likely to respond9 0ing and 0i E;AA@F di-ide the ranked list into ;6
deciles9 Bhen e-aluating the ranked list, regularities should be found in the distribution of
the responders Gi9e9, there should be a high percentage of the responders in the first few
decilesH9 Table 59=9; is a reproduction of the e4ample that 0ing and 0i E;AA@F present to
demonstrate this9
0ift Table
;6U ;6U ;6U ;6U ;6U ;6U ;6U ;6U ;6U ;6U
=;6 ;A6 ;<6 :? =5 <5 <> <6 5A 5?
Table 59=9;& An e4ample lift table taken from E0ing and 0i, ;AA@F9
Typically, results are reported for the top ;6U decile, and the sum of the first four deciles9
#n 0ing and 0iRs E;AA@F e4ample, reporting for the first four deciles would be =;6 I ;A6 I
;<6 I :? T @6? Gor @6? 8 ;666 T @69?UH9
#nstead of using the top ;6U decile and the top four deciles to report results on, 0ing and 0i
E;AA@F use the formulae&
where S
;
through S
;6
are the deciles from the lift table9
"sing ten deciles to calculate the lift inde4 would result in a S
lift
inde4 of >>U if the
respondents were randomly distributed throughout the table9 A situation where all
respondents are in the first decile results in a S
0ift
inde4 of ;66U9
( )
;6 < 5 ;
; 9 6 999 @ 9 6 A 9 6 ; S S S S S
4ift
+ + + + =
5:
"sing their lift inde4 as the sole measure of performance, 0ing and 0i E;AA@F report results
for o-er3sampling and downsi+ing on the three data sets of interest G7ank, 0ife #nsurance,
and 7onusH9
0ing and 0i E;AA@F report results that show the best lift inde4 is obtained when the ratio of
positi-e and negati-e e4amples in the training data is equal9 "sing 7oosted3$aW-e 7ayes
with a downsi+ed data set resulted in a lift inde4 of :69>U for 7ank, :>95U for 0ife
#nsurance, and @;9<U for 7onus9 These results compared to S
0ift
inde4es of ?A9;U for
7ank, :>9=U for 0ife #nsurance, and @69=U for the 7onus program when the data sets were
imbalanced at a ratio of ; positi-e e4ample to e-ery @ negati-e e4amples9 1owe-er, using
7oosted37ayes with o-er3sampling did not show any significant impro-ement o-er the
imbalanced data set9 0ing and 0i E;AA@F state that one method to o-ercome this limitation
may be to retain all the negati-e e4amples in the data set and re3sample the positi-e
e4amples
=
9
Bhen tested using their boosted -ersion of C=9>, o-er3sampling saw a performance gain as
the positi-e e4amples were re3sampled at higher rates9 Bith a positi-e sampling rate of
564, 7ank saw an increase of 59AU Gfrom ?>9?U to ?@9>UH, 0ife #nsurance an increase of
59AU Gfrom :=9<U to :?95UH and the 7onus Program and increase of =9?U Gfrom :=9<U to
:@9AUH9
The different effects of o-er3sampling and downsi+ing reported by 0ing and 0i E;AA@F
were systematically studied in E*apkowic+, 5666F, which broadly di-ides balancing
techniques into three categories9 The categories are& methods in which the small class is
o-er3sampled to match the si+e of the larger classN methods by which the large class is
downsi+ed to match the smaller classN and methods that completely ignore one of the two
classes9 The two categories of interest in this section are downsi+ing and o-er3sampling9
#n order to study the nature of imbalanced data sets, *apkowic+ proposes two questions9
They are& what types of imbalances affect the performance of standard classifiers and
=
This will be one of the sampling methods tested in Chapter <9
5@
which techniques are appropriate in dealing with class imbalancesX To in-estigate these
questions *apkowic+ created a number of artificial domains which were made to -ary in
concept comple4ity, si+e of the training data and ratio of the under3represented class to the
o-er3represented class9
The target concept to be learned in her study was a one dimensional set of continuous
alternating equal si+ed inter-als in the range E6, ;F, each associated with a class -alue of 6
or ;9 For e4ample, a linear domain generated using her model would be the inter-als E6,
69>H and G69>, ;F9 #f the first inter-al was gi-en the class ;, the second inter-al would ha-e
class 69 E4amples for the domain would be generated by randomly sampling points from
each inter-al Ge9g9, a point 4 sampled in E6, 69>F would be a G4, IH e4ample, and likewise a
point y sampled in G69>, ;F would be an Gy, 3H e4ampleH9
*apkowic+ E5666F -aried the comple4ity of the domains by -arying the number of inter-als
in the target concept9 (ata set si+es and balances were easily -aried by uniformly sampling
different numbers of points from each inter-al9
The two balancing techniques that *apkowic+ E5666F used in her study that are of interest
here are o-er3sampling and downsi+ing9 The o-er3sampling technique used was one in
which the small class was randomly re3sampled and added to the training set until the
number of e4amples of each class was equal9 The downsi+ing technique used was one in
which random e4amples were remo-ed from the larger class until the si+e of the classes
was equal9 The domains and balancing techniques described abo-e were implemented
using -arious discrimination based neural networks G(0PH9
*apkowic+ found that both re3sampling and downsi+ing helped impro-e (0P, especially
as the target concept became -ery comple49 (ownsi+ing, howe-er, outperformed o-er3
sampling as the si+e of the training set increased9
5A
8., C#)''4$r' :.c. Co($r On$ C#)''
*+0+- 5(/T!
Riddle, Segal, and Et+ioni E;AA=F propose an induction technique called 7R"TE9 The goal
of 7R"TE is not classification, but the detection of rules that predict a class9 The domain
of interest which leads to the creation of 7R"TE is the detection of manufactured airplane
parts that are likely to fail9 Any rule that detects anomalies, e-en if they are rare, is
considered important9 Rules which predict that a part will not fail, on the other hand are not
considered -aluable, no matter how large the co-erage may be9
7R"TE operates on the premise that standard decision trees test functions such as #(<Rs
information gain metric can o-erlook rules which accurately predict the smallest failed
class in their domain9 The test function in the #(< algorithm a-erages the entropy at each
branch weighted by the number of e4amples that satisfies the test at each branch9 Riddle et
al9 E;AA=F gi-e the following e4ample demonstrating why a common information gain
metric would fail to recogni+e a rule that can correctly classify a significant number of
positi-e e4amples9 /i-en the following two tests on a branch of a decision tree,
information gain would be calculated as follows&
T1 T2
True False
100+
0,
(-0+
(-0,
True False
1.-+
/2-,
/.-+
12-,
Figure 59=9:& This e4ample demonstrates how standard classifiers
can be dominated by negati-e e4amples9 #t is taken from ERiddle,
et al9, ;AA=F9 $ote that the &ain function defined in Section 59;95
would prefer T5 to T;9
"sing &ain(S" #) G<H as a test selection we get&
<6
#t can be seen that T5 will be chosen o-er T; using #(<Rs information gain measure9 This
choice GT5H has the potential of missing a rule that would pro-ide an accurate rule for
predicting the positi-e class9 #nstead of treating positi-e and negati-e e4amples
symmetrically, 7R"TE uses the test selection function&
where n
I
is the number of positi-e e4amples at the test branch and n
3
the number of
negati-e e4amples at the branch9
#nstead of calculating the weighted a-erage for each test, max6a''ura'y ignores the
entropy of the negati-e e4amples and instead bases paths taken on the proportion of
positi-e e4amples9 #n the pre-ious e4ample 7R"TE would therefore choose T; and follow
the True path, because that path shows ;66U accuracy on the positi-e e4amples9
7R"TE performs what Riddle et al9 E;AA=F describe as a Jmassi-e brute3force search for
accurate positi-e rules9J #t is an e4hausted depth bounded search that was moti-ated by
their obser-ation that the predicti-e rules for their domain tended to be short9 Redundant
searches in their algorithm were a-oided by considering all rules that are smaller than the
depth bound in canonical order9
E4periments with 7R"TE showed that it was able to produce rules that significantly
outperformed those produced using CART E7reiman et al9, ;A@=F and C= E.uinlan, ;A@?F9
7R"TERs a-erage rule accuracy on their test domain was <@9=U, compared with 5;9?U for
;56 9 6 @;; 9 6
5
;
A<< 9 6
5
;
AA5 9 6 H T5 , G
6A5 9 6 ;
;6
A
6
;6
;
AA5 9 6 H T; , G
= =
= =
S &ain
S &ain

+
= =
+
+

n n
n
# S a''uar'y
# Va$ue% v # a H G
ma4 ma4 H , G ma4)
<;
CART and 5A9AU for C=
>
9 !ne drawback is that the computational comple4ity of 7R"TES
depth bound search is much higher than that of typical decision tree algorithms9 They do
report, howe-er, that it only took <> CP" minutes of computation on a SPARC3;69
*+0+, 7O24
F!#0 E.uinlan, ;AA6F is an algorithm designed to learn a set of first order rules to predict a
target predicate to be true9 #t differs from learners such as C>96 in that it learns relations
among attributes that are described with -ariables9 For e4ample, using a set of training
e4amples where each e4ample is a description of people and their relations&
P $ame
;
T *ack, /irlfriend
;
T *ill,
$ame
5
T *ill, 7oyfriend
5
T *ack, Couple
;5
T True Q
C>96 may learn the rule&
#F G$ame
;
T *ackH Y G7oyfriend
5
T *ackH T1E$ Couple
;5
T True9
This rule of course is correct, but will ha-e a -ery limited use9 F!#0 on the other hand can
learn the rule&
#F 7oyfriendG4, yH T1E$ CoupleG4, yH T True
where 4 and y are -ariables which can be bound to any person described in the data set9 A
positi-e binding is one in which a predicate binds to a positi-e assertion in the training
data9 A negati-e binding is one in which there is no assertion found in the training data9 For
e4ample, the predicate 7oyfriendG4, yH has four possible bindings in the e4ample abo-e9
The only positi-e assertion found in the data is for the binding 7oyfriendG*ill, *ackH Gread
>
The accuracy being referred to here is not how well a rule set performs o-er the testing data9 Bhat is being referred to
is the percentage of testing e4amples which are co-ered by a rule and correctly classified9 The e4ample Riddle et al9
E;AA=F gi-e is that if a rule matches ;6 e4amples in the testing data, and = of them are positi-e, then the predicti-e
accuracy of the rule is =6U9 The figures gi-en are a-erages o-er the entire rule set created by each algorithm9 Riddle et
al9 E;AA=F use this measure of performance in their domain because their primary interest is in finding a few accurate
rules that can be interpreted by factory workers in order to impro-e the production process9 #n fact, they state that they
would be happy with a poor tree with one really good branch from which an accurate rule could be e4tracted9
<5
the boyfriend of *ill is *ackH9 The other three possible bindings Ge9g9, 7oyfriendG*ack, *illHH
are negati-e bindings, because there are no positi-e assertions in the training data9
The following is a brief description of the F!#0 algorithm adapted from Eitchell, ;AA:F9
F!#0 takes as input a target predicate Ge9g9, CoupleG4, yHH, a list of predicates that will be
used to describe the target predicate and a set of e4amples9 At a high le-el, the algorithm
operates by learning a set of rules that co-ers the positi-e e4amples in the training set9 The
rules are learned using an iterati-e process that remo-es positi-e training e4amples from
the training set when they are co-ered by a rule9 The process of learning rules continues
until there are enough rules to co-er all the positi-e training e4amples9 This way, F!#0 can
be -iewed as a specific to general search through a hypothesis space, which begins with an
empty set of rules that co-ers no positi-e e4amples and ends with a set of rules general
enough to co-er all the positi-e e4amples in the training data Gthe default rule in a learned
set is negati-eH9
Creating a rule to co-er positi-e e4amples is a process by which a general to specific search
is performed starting with an empty condition that co-ers all e4amples9 The rule is then
made specific enough to co-er only positi-e e4amples by adding literals to the rule Ga
literal is defined as a predicate or its negati-eH9 For e4ample, a rule predicting the predicate
FemaleG4H may be made more specific by adding the literals long)hairG4H and ZbeardG4H9
The function used to e-aluate which literal, 0, to add to a rule, R, at each step is&
where p
6
and n
6
are the number of positi-e GpH and negati-e GnH bindings of the rule R, p
;
and n
;
are the number of positi-e and negati-e binding of the rule which will be created by
adding 0 to R and t is the number of positi-e bindings of the rule R which are still co-ered
by R when 0 is added Gi9e9, t T p
6
3 p
;
H9
( )

+
=
6 6
6
5
; ;
;
5
log log , )
n p
p
n p
p
t ( 4 &ain 7oi$
<<
The function 7oi$6&ain determines the utility of adding 0 to R9 #t prefers adding literals
with more positi-e bindings than negati-e bindings9 As can be seen in the equation, the
measure is based on the proportion of positi-e bindings before and after the literal in
question is added9
*+0+0 S.(2N8
Cubat, 1olte, and atwin E;AA@F discuss the design of the S1R#$C algorithm that follows
the same principles as 7R"TE9 S1R#$C operates by finding rules that co-er positi-e
e4amples9 #n doing this, it learns from both positi-e and negati-e e4amples using the g3
mean to take into account rule accuracy o-er negati-e e4amples9 There are three principles
behind the design of S1R#$C9 They are&
(o not subdi-ide the positi-e e4amples when learningN
Create a classifier that is low in comple4ityN and
Focus on regions in space where positi-e e4amples occur9
A S1R#$C classifier is made up of a network of tests9 Each test is of the form& 4
i
Emin a
i
N
ma4 a
i
F where i inde4es the attributes9 0et h
i
represent the output of the ith test9 #f the test
suggests a positi-e test, the output is ;, else it is 3;9 E4amples are classified as being
positi-e if
i
h
i
w
i
Q where w
i
is a weight assigned to the test h
i
9
S1R#$C creates the tests and weights in the following way9 #t begins by taking the inter-al
for each attribute that co-ers all the positi-e e4amples9 The inter-al is then reduced in si+e
by remo-ing either the left or right point based on whiche-er produces the best g3mean9
This process is repeated iterati-ely and the inter-al found to ha-e the best g3mean is
considered the test for the attribute9 Any test that has a g3mean less than 69>6 is discarded9
The weight assigned to each test is w
i
T log Gg
i
8;3g
i
H where g
i
is the g3mean associated with
the ith attribute test9
<=
The results reported by Cubat et al9 E;AA@F demonstrate that the S1R#$C algorithm
performs better than ;3$earest $eighbor with one sided selection
?
9 Pitting S1R#$C against
C=9> with one sided selection the results became less clear9 "sing one sided selection
resulted in a performance gain o-er the positi-e e4amples but a significant loss o-er the
negati-e e4amples9 This loss of performance o-er the negati-e e4amples results in the g3
mean being lowered by about ;6U9
Accuracies Achie-ed by C=9>, ;3$$ and Shrink
Classifier aI a3 g3mean
C=9> @;9; @?9? @;9:
;3$$ ?:95 @<9= ?:95
S1R#$C @59> ?69A :69A
Table 59=95& This table is adapted from ECubat et al9, ;AA@F9 #t
gi-es the accuracies achie-ed by C=9> ;3$$ and S1R#$C9
8.- R$co%nton B)'$d L$)rnn%
(iscrimination based learning techniques, such as C>969 create rules which describe both
the positi-e GconceptualH class, as well as the negati-e Gcounter conceptualH class9
Algorithms such as, 7R"TE, and F!#0 differ from algorithms such as C>96, in that they
create rules that only co-er positi-e e4amples9 1owe-er, they are still discrimination based
techniques because they create positi-e rules using negati-e e4amples in their search
through the hypothesis space9 For e4ample, F!#0 creates rules to co-er the positi-e class
by adding literals until they do not co-er any of the negati-e class e4amples9 !ther learning
methods, such as back propagation applied to a feed forward neural network and C3nearest
neighbor, do not e4plicitly create rules, but they are discrimination based techniques that
learn from both positi-e and negati-e e4amples9
*apkowic+, yers, and /luck E;AA>F describe 1#PP!, a system that learns to recogni+e a
target concept in the absence of counter e4amples9 ore specifically, it is a neural network
Gcalled an autoencoderH that is trained to take positi-e e4amples as input, map them to a
small hidden layer, and then attempt to reconstruct the e4amples at the output layer9
?
!ne sided selection is discussed in Section 59>95959 #t is essentially a method by which negati-e e4amples considered
harmful to learning are remo-ed from the data set9
<>
7ecause the network has a narrow hidden layer it is forced to compress redundancies found
in the input e4amples9
An ad-antage of recognition based learners is that they can operate in en-ironments in
which negati-e e4amples are -ery hard or e4pensi-e to obtain9 An e4ample *apkowic+ et
al9 E;AA>F gi-e is the application of machine fault diagnosis where a system is designed to
detect the likely failure of hardware Ge9g9, helicopter gear bo4esH9 #n domains such as this,
statistics on functioning hardware are plentiful, while statistics of failed hardware may be
nearly impossible to acquire9 !btaining positi-e e4amples in-ol-es monitoring functioning
hardware, while obtaining negati-e e4amples in-ol-es monitoring hardware that fails9
Acquiring enough e4amples of failed hardware for training a discrimination based learner,
can be -ery costly if the de-ice has to be broken a number of different ways to reflect all
the conditions in which it may fail9
#n learning a target concept, recognition based classifiers such as that described by
*apkowic+ et al9 E;AA>F do not try to partition a hypothesis space with boundaries that
separate positi-e and negati-e e4amples, but they attempt to make boundaries which
surround the target concept9 The following is an o-er-iew of how 1#PP!, a one hidden
layer autoencoder, is used for recognition based learning9
A one hidden layer autoencoder consists of three layers, the input layer, the hidden layer
and the output layer9 Training an autoencoder takes place in two stages9 #n the first stage
the system is trained on positi-e instances using back3propagation
:
to be able to compress
the training e4amples at the hidden layer and reconstruct them at the output layer9 The
second stage of training in-ol-es determining a threshold that can be used to determine the
reconstruction error between positi-e and negati-e e4amples9
The second stage of training is a semi3automated process that can be one of two cases9 The
first noiseless case is one in which a lower bound is calculated on the reconstruction error
of either the negati-e or positi-e instances9 The second noisy case is one that uses both
:
$ote that back propagation is not the only training function that can be used9 E-ans and *apkowic+ E5666F report results
using an auto3encoder trained with the !ne Step Secant function9
<?
positi-e and negati-e training e4amples to calculate the threshold ignoring the e4amples
considered to be noisy or e4ceptional9
After training and threshold determination, unseen e4amples can be gi-en to the
autoencoder that can compress and then reconstruct them at the output layer, measuring the
accuracy at which the e4ample was reconstructed9 For a two class domain this is -ery
powerful9 Training an autoencoder to be able to sufficiently reconstruct the positi-e class,
means that unseen e4amples that can be reconstructed at the output layer contain features
that were in the e4amples used to train the system9 "nseen e4amples that can be
generali+ed with a low reconstruction error can therefore be deemed to be of the same
conceptual class as the e4amples used for training9 Any e4ample which cannot be
reconstructed with a low reconstruction error is deemed to be unrecogni+ed by the system
and can be classified as the counter conceptual class9
*apkowic+ et al9 E;AA>F compared 1#PP! to two other standard classifiers that are designed
to operate with both positi-e and negati-e e4amples9 They are C=9> and applying back
propagation to a feed forward neural network GFF ClassificationH9 The data sets studied
were&
The C1=? 1elicopter /earbo4 data set EColesar and $Ra(, ;AA=F9 This domain
consists of discriminating between faulty and non3faulty helicopter gearbo4es
during operation9 The faulty gearbo4es are the positi-e class9
The Sonar Target Recognition data set9 This data was obtained from the "9C9 #r-ine
Repository of achine 0earning9 This domain consists of taking sonar signals as
input and determining which signals constitute rocks and which are mines Gmine
signals were considered the positi-e class in the studyH9
The Promoter data set9 This data consists of input segments of ($A strings9 The
problem consists of recogni+ing which strings represent promoters that are the
positi-e class9
<:
Testing 1#PP! showed that it performed much better than C=9> and FF Classifier on the
1elicopters and Sonar Targets domains9 #t performed equally with FF classifier on the
promoters domain but much better than C=9> on the same data9
(ata Set Results
(ata Set 1#PP! C=9> FF Classifier
1elicopters <9;5>69A ;>9?5>;9A ;69A;9:
Promoters 5669: <>;9= 56;9=
Sonar
Targets
5659: 5A;9@ <5<95
Table 59=9<& Results for the three data sets tested9 The figures
represent the percent error along with their standard de-iation as
each algorithm was tested on the data9
<@
<A
C h a p t e r T h r e e
< ART#F#C#A0 (!A#$
Chapter < is di-ided into three sections9 Section <9; introduces an artificial domain of k3
($F e4pressions and describes three e4periments that were performed using C>969 The
purpose of the e4periments is to in-estigate the nature of imbalanced data sets and pro-ide
a moti-ation behind the design of a system intended to impro-e a standard classifiers
performance on imbalanced data sets9 Section <95 presents the design of the system that
takes ad-antage of two sampling techniquesN o-er3sampling and downsi+ing9 The chapter
concludes with Section <9<, which tests the system on the artificial domain and presents the
results9
; E6/$r&$nt)# D$'%n
This section begins by introducing the artificial domain of k3($F e4pressions and
e4plaining why this domain was chosen to conduct e4periments on9 Three e4periments are
then presented which in-estigate the nature of imbalanced data sets9 The first e4periment
e4plores concept comple4ity as it effects imbalanced data sets9 The second e4periment
in-estigates the two sampling techniques, downsi+ing and o-er3sampling9 The last
e4periment looks at the rule sets created by C>96 as data sets are balanced by o-er3
sampling and downsi+ing 9
;.1 Art4c)# Do&)n
The artificial domain chosen to e4periment with is known as k3($F G(is%uncti-e $ormal
FormH e4pressions9 A k3($F e4pression is of the form&
G4
;
Y4
5
Y[Y4
n
H G4
nI;
Y4
nI5
Y[4
5n
H[G4
Gk3;HnI;
Y4
Gk3;HnI5
Y[Y4
kn
H
=6
where k is the number of dis%uncts, n is the number of con%unctions in each dis%unct, and 4
n
is defined o-er the alphabet 4
;
, 4
5
,[, 4
%
9 Z4
;
, Z4
5
, [,Z4
%
9 An e4ample of a k3($F
e4pression, C being 5, gi-en as GE4p9 ;H9
4
;
Y4
<
YZ4
>
Z4
<
Y4
=
Y4
>
GE4p9 ;H
$ote that if 4
k
is a member of a dis%unct Z4
k
cannot be9 Also note, GE4p9 ;H would be
referred to as an e4pression of <45 comple4ity because it has two dis%uncts and three
con%unctions in each dis%unct9
/i-en GE4p9 ;H, defined o-er an alphabet of si+e >, the following four e4amples would ha-e
classes indicated by I839
4; 45 4< 4= 4> Class
;H ; 6 ; ; 6 I
5H 6 ; 6 ; ; I
<H 6 6 ; ; ; 3
=H ; ; 6 6 ; 3
Figure <9;9@& Four instances of classified data defined o-er the
e4pression GE4p9 ;H9
#n the artificial domain of k3($F e4pressions, the task the learning algorithm GC>96H is to
take as input a set of classified e4amples and learn the e4pression that was used to classify
them9 For e4ample, gi-en the four classified instances in Figure <9;9;, C>96 would take
them as input and attempt to learn GE4p9 ;H9 Figure <9;95 gi-es and e4ample of a decision
tree that can correctly classify the instances in Figure <9;9;9
=;
Figure <9;9A& A decision tree that correctly classifies instances
Gsuch as those in Figure <9;9;H as satisfying GE4p9 ;H or not9 #f an
e4ample to be classified is sorted to a positi-e GIH leaf in the tree it
is gi-en a positi-e class and satisfies e4pression G;H9
C3($F e4pressions were chosen to e4periment with because of their similarity to the real
world application of te4t classification which is the process of placing labels on documents
to indicate their content9 "ltimately, the purpose of the e4periments presented in this
chapter is to moti-ate the design of a system that can be applied to imbalanced data sets
with practical applications9 The real world domain chosen to test the system is the task of
classifying te4t documents9 The remainder of this section will describe the similarities and
differences of the two domains& k3($F e4pressions and te4t classification9
The greatest similarity between the practical application of te4t classification and the
artificial domain of k3($F e4pressions is the fact that the conceptual class Gi9e9, the class
which contains the target conceptH in the artificial domain is the under represented class9
The o-er represented class is the counter conceptual class and therefore represents
e-erything else9 This can also be said of the te4t classification domain because during the
training process we label documents to be of the conceptual class Gpositi-eH and all other
documents to be negati-e9 The documents labeled as negati-e in a te4t domain represent
the counter conceptual class and therefore represent e-erything else9 This will be described
in more detail in Chapter =9
0-
01
0/
0/
0(
+ , + ,
, ,
1
0
0
0
0
0
1
1
1
1
=5
The other similarity between te4t classification and k3($F e4pressions is the ability to
affect the comple4ity of the target e4pression in a k3($F e4pression9 7y -arying the
number of dis%uncts in an e4pression we can -ary the difficulty of the target concept to be
learned9
@
This ability to control concept comple4ity can map itself onto te4t classification
tasks where not all classification tasks are equal in difficulty9 This may not be ob-ious at
first9 Consider a te4t classification task where one needs to classify documents as being
about a particular consumer product9 The comple4ity of the rule set needed to distinguish
documents of this type, may be as simple as a single rule indicating the name of the product
and the name of the company that produces it9 This task would probably map itself to a
-ery simple k3($F e4pression with perhaps only one dis%unct9 $ow consider training
another classifier intended to be used to classify documents as being computer software
related or not9 The number of rules needed to describe this category is probably much
greater9 For e4ample, the terms JcomputerJ and JsoftwareJ in a document may be good
indicators that a document is computer software related, but so might be the term
JwindowsJ, if it appears in a document not containing the term JcleanerJ9 #n fact, the terms
JoperatingJ and JsystemJ or JwordJ and JprocessorJ appearing together in a document are
also good indicators that it is software related9 The comple4ity of a rule set needed to be
constructed by a learner to recogni+e computer software related documents is, therefore,
greater and would probably map onto a k3($F e4pression with more dis%uncts than that of
the first consumer product e4ample9
The biggest difference between the two domains is that the artificial domain was created
without introducing any noise9 $o negati-e e4amples were created and labeled as being
positi-e9 0ikewise, there were no positi-e e4amples labeled as negati-e9 For te4t domains
in general there is often label noise in which documents are gi-en labels that do not
accurately indicate their content9
@
As the number of dis%uncts GkH in an e4pression increases, more partitions in the hypothesis space are need to be
reali+ed by a learner to separate the positi-e e4amples from the negati-e e4amples9
=<
;.* E6)&/#$ Cr$)ton
For the described tests, training e4amples were always created independently of the testing
e4amples9 The training and testing e4amples were created in the following manner&
A Random k3($F e4pression is created on a gi-en alphabet si+e Gin this study the
alphabet si+e is >6H9
An arbitrary set of e4amples was generated as a random sequence of attributes
equal to the si+e of the alphabet the k3($F e4pression was created o-er9 All the
attributes were gi-en an equal probability of being either 6 or ;9
Each e4ample was then classified as being either a member of the e4pression or not
and tagged appropriately9 Figure <9;9; demonstrates how four generated e4amples
would be classified o-er the e4pression G;H9
The process was repeated until a sufficient number of positi-e and negati-e
e4amples were attained9 That is, enough e4amples were created to pro-ide an
imbalanced data set for training and a balanced data set for testing9
For most of the tests, both the si+e of the alphabet used and the number of con%unctions in
each dis%unct were held constant9 There were, howe-er, a limited number of tests
performed that -aried the si+e of the alphabet and the number of con%unctions in each
dis%unct9 For all the e4periments, imbalanced data sets were used9 For the first few tests, a
data set that contained >666 negati-e e4amples and ;566 positi-e e4amples was used9 This
represented a class imbalance of >&; in fa-or of the negati-e class9 As the tests, howe-er,
lead to the creation of a combination scheme, the data sets tested were further imbalanced
to a 5>&; G?666 negati-e & 5=6 positi-eH ratio in fa-or of the negati-e class9 This greater
imbalance more closely resembled the real world domain of te4t classification on which the
system was ultimately tested9 #n each case the e4act ratio of positi-e and negati-e e4amples
in both the training and testing set will be indicated9
==
;., D$'cr/ton o4 T$'t' )nd R$'u#t'
The description of each test will consist of se-eral sections9 The first section will state the
moti-ation behind performing the test and gi-e the particulars of its design9 The results of
the e4periment will then be gi-en followed by a discussion9
9+0+- Te%t : - Varying the Target Con'ept% Comp$exity
'arying the number of dis%uncts in an e4pression -aries the comple4ity of the target
concept9 As the number of dis%uncts increases, the following two things occur in a data set
where the positi-e e4amples are e-enly distributed o-er the target e4pression and their
number is held constant&
The target concept becomes more comple4, and
The number of positi-e e4amples becomes sparser relati-e to the target concept9
A -isual representation of the preceding statements is gi-en in Figure <9;9<9
1a2 132
Figure <9;9;6& A -isual representation of a target concept
becoming sparser relati-e to the number of e4amples9
Figures <9;9<GaH and <9;9<GbH gi-e a feel for what is happening when the number of
dis%uncts increases9 For larger e4pressions, there are more partitions that are needed to be
reali+ed by the learning algorithm and fewer e4amples indicating which partitions should
take place9
=>
The moti-ation behind this e4periment comes from Schaffer E;AA<F who reports on
e4periments that show that data reflecting a simple target concept lends itself well to
decision tree learners that use pruning techniques whereas comple4 target concepts lend
themsel-es to best fit algorithms Gi9e9 decision trees which do not employ pruning
techniquesH9 0ikewise, this first e4periment in-estigates the effect of target concept
comple4ity on imbalanced data sets9 ore specifically it is an e4amination of how well
C>96 learns target concepts of increasing comple4ity on balanced and imbalanced data sets9
S$tu/
#n order to in-estigate the performance of induced decision trees on balanced and
imbalanced data sets, eight sets of training and testing data of increasing target concept
comple4ities were created9 The target concepts in the data sets were made to -ary in
concept comple4ity by increasing the number of dis%uncts in the e4pression to be learned,
while keeping the number of con%unctions in each dis%unct constant9 The following
algorithm was used to produce the results gi-en below9
Repeat 4 times
o Create a training set TGc, ?666I, ?6663H
o Create a test set EGc, ;566I, ;5663H
A
o Train C on T
o Test C on E and record its performance P
;&;
o Randomly remo-e =@66 positi-e e4amples from T
o Train C on T
o Test C on E and record its performance P
;&>
o Randomly remo-e A?6 positi-e e4amples from T
o Train C on T
o Test C on E and record its performance P
;&5>
A
$ote that throughout Chapter < the testing sets used to measure the performance of the induced classifiers are balanced9
That is, there is an equal number of both positi-e and negati-e e4amples used for testing9 The test sets are artificially
balanced in order to increase the cost of misclassifying positi-e e4amples9 "sing a balanced testing set to measure a
classifierRs performance gi-es each class equal weight9
=?
A-erage each PRs o-er each 49
For this test e4pressions of comple4ity ' T =45, =4<, [, =4@, and =4;6
;6
were tested o-er
an alphabet of si+e >69 The results for each e4pression were a-eraged o-er 4 T ;6 runs9
R$'u#t'
The results of the e4periment are shown in Figures <9;9=, to <9;9?9 The bars indicate
a-erage error of the induced classifier on the entire test set9 At first glance it may seem as
though the error rate of the classifier is not -ery high for the imbalanced data set9 Figure
<9;9= shows that the error rate for e4pressions of comple4ity =4: and under, balanced at
ratios of ;I&>3 and ;I&5>3, remains under <6U9 An error rate lower than <6U may not seem
e4tremely poor when one considers that the accuracy of the system is still o-er :6U, but, as
Figure <9;9> demonstrates, accuracy as a performance measure o-er both the positi-e and
negati-e class can be a bit misleading9 Bhen the error rate is measured %ust o-er the
positi-e e4amples the same =4: e4pression has an a-erage error of almost >?U9
Figure <9;9? graphs the error rate as measured o-er the negati-e e4amples in testing data9 #t
can be easily seen that when the data set is balanced the classifier has an error rate o-er the
negati-e e4amples that contributes to the o-erall accuracy of the classifier9 #n fact, when
the training data is balanced, the highest error rate measured o-er the negati-e e4amples is
;9:U for an e4pression of comple4ity =4;69 1owe-er, the imbalanced data sets show
almost no error o-er the negati-e e4amples9 The highest error achie-ed o-er the negati-e
data for the imbalanced data sets was less than one tenth of a percent at a balance ratio of
;I&>3, with an e4pression comple4ity of =4;69 Bhen trained on a data set imbalanced at a
ratio of ;I&5>,3 the classifier showed no error when measured o-er the negati-e e4amples9
;6
Throughout this work when referring to e4pressions of comple4ity a 4 b Ge9g9, =45H a refers to the number of con%uncts
and b refers to the number of dis%uncts9
=:
Error Over All Examples
0
041
042
04/
04(
(
x
2
(
x
/
(
x
(
(
x
-
(
x
)
(
x
.
(
x
*
(
x
1
0
Degree of Complexity
E
r
r
o
r 151
15-
152-
Figure <9;9;;& A-erage error of induced decision trees measured
o-er all testing e4amples9
Error Over Positive Examples
0
042
04(
04)
04*
(
x
2
(
x
/
(
x
(
(
x
-
(
x
)
(
x
.
(
x
*
(
x
1
0
Degree of Complexity
E
r
r
o
r 151
15-
152-
Figure <9;9;5& A-erage error of induced decision trees measured
o-er positi-e testing e4amples9
Error Over Negative Examples
0
0400-
0401
0401-
0402
(
x
2
(
x
/
(
x
(
(
x
-
(
x
)
(
x
.
(
x
*
(
x
1
0
Degree of Complexity
E
r
r
o
r
151
15-
152-
Figure <9;9;<& A-erage error of induced decision trees measured
o-er negati-e testing e4amples9 $ote that the scale is much
smaller than those in Figures <9;9< and <9;9=9
=@
D'cu''on
As pre-iously stated, the purpose of this e4periment was to test the classifierRs performance
on both balanced and imbalanced data sets while -arying the comple4ity of the target
e4pression9 #t can be seen in Figure <9;9= that the performance of the classifier worsens as
the comple4ity of the target concept increases9 The degree to which the accuracy of the
system degrades is dramatically different for that of balanced and imbalanced data sets
;;
9
To put this in perspecti-e, the following table lists some of the accuracies of -arious
e4pressions when learned o-er the balanced and imbalanced data sets9 $ote that the
statistics in Table <9;9; are taken from Figure <9;9= and are not the errors associated with
the induced classifiers, but are the accuracy in terms of the percent of e4amples correctly
classified9
Accuracy for 7alanced and #mbalanced (ata Sets
Co&/#$6t$'
B)#)nc$ =45 =4? =4;6
;I&;3 ;66U AA9AU A>9:U
;I&>3 ;66U :@95U ?:9;U
;I&5>3 ;66U :?9<U ?>9:U
Table <9;9=& Accuracies of -arious e4pressions learned o-er
balanced and imbalanced data sets9 These figures are the
percentage of correctly classified e4amples when tested o-er both
positi-e and negati-e e4amples9
;;
Throughout this first e4periment it is important to remember that a learnerRs poor performance learning more comple4
e4pressions when trained on an imbalanced training set can be caused by the imbalance in the training data, or, the fact
that there are not enough positi-e e4amples to learn the difficult e4pression9 The question then arises as to whether the
problem is the imbalance, or, the lack of positi-e training e4amples9 This can be answered by referring to <9;9;6, which
shows that by balancing the data sets by remo-ing negati-e e4amples, the accuracy of induced classifier can be
increased9 The imbalance is therefore at least partially responsible for the poor performance of an induced classifier
when attempting to learning difficult e4pressions9
=A
Table <9;9; clearly shows that a simple e4pression G=45H, is not affected by the balance of
the data set9 The same cannot be said for the e4pressions of comple4ity =4? and =4;6,
which are clearly affected by the balance of the data set9 As fewer positi-e e4amples are
made a-ailable to the learner, its performance decreases9
o-ing across Table <9<9; gi-es a feel for what happens as the e4pression becomes more
comple4 for each degree of imbalance9 #n the first row, there is a performance drop of less
than >U between learning the simplest G=45H e4pression and the most difficult G=4;6H
e4pression9 o-ing across the bottom two rows of the table, howe-er, reali+es a
performance drop of o-er <6U on the imbalanced data sets when trying to learn the most
comple4 e4pression G=4;6H9
The results of this initial e4periment show how the artificial domain of interest is affected
by an imbalance of data9 !ne can clearly see that the accuracy of the system suffers o-er
the positi-e testing e4amples when the data set is imbalanced9 #t also shows that the
comple4ity of the target concept hinders the performance of the imbalanced data sets to a
greater e4tent than the balanced data sets9 Bhat this means is that e-en though one may be
presented with an imbalanced data set in terms of the number of a-ailable positi-e
e4amples, without knowing how comple4 the target concept is, you cannot know how the
imbalance will affect the classifier9
9+0+, Te%t :, Corre'ting 2mba$an'e) Data Set%; Over%amp$ing v%+ Down%i<ing
The two techniques in-estigated for impro-ing the performance of imbalanced data sets
that are of interest here are& uniformly o-er3sampling the smaller class, in this case it is the
positi-e class which is under represented, and randomly under sampling the class which
has many e4amples, in this case it is the negati-e class9 These two balancing techniques
were chosen because of their simplicity and their opposing natures, which will be useful in
our combination scheme9 The simplicity of the techniques is easily understood9 The
opposing nature of the techniques is e4plained as follows9
>6
#n terms of the o-erall si+e of the data set, downsi+ing significantly reduces the number of
o-erall e4amples made a-ailable for training9 7y lea-ing negati-e e4amples out of the data
set, information about the negati-e Gor counter conceptualH class is being remo-ed9
!-er3sampling has the opposite effect in terms of the si+e of the data set9 Adding e4amples
by re3sampling the positi-e Gor conceptualH class, howe-er, does not add any additional
information to the data set9 #t %ust balances the data set by increasing the number of positi-e
e4amples in the data set9
S$tu/
This test was designed to determine if randomly remo-ing e4amples of the o-er
represented negati-e class, or uniformly o-er3sampling e4amples of the under represented
class to balance the data set, would impro-e the performance of the induced classifier o-er
the test data9 To do this, data sets imbalanced at a ratio of ;I&5>3 were created, -arying the
comple4ity of target e4pression in terms of the number of dis%uncts9 The idea behind the
testing procedure was to start with an imbalanced data set and measure the performance of
an induced classifier as either negati-e e4amples are remo-ed, or positi-e e4amples are re3
sampled and added to the training data9 The procedure gi-en below was followed to
produce the presented results9
Repeat 4 times
o Create a training set TG', 5=6I, ?6663H
o Create a test set EG', ;566I, ;5663H
o Train C on T
o Test C on E and record its performance P
original
o Repeat for n T ; to ;6
Create TdG5=6I, G?666 3 nV>:?H3H by randomly remo-ing >:?Vn
e4amples from T
Train C on Td
Test C on E and record its performance P
downsi+e
o Repeat for n T ; to ;6
Create ToGG5=6 I nV>:?HI, ?6663H by uniformly o-er3sampling the
positi-e e4amples from T9
Train C on Td
Test C on E and record its performance P
o-ersample
>;
A-erage P
downsi+e
Rs and P
o-ersample
Rs o-er 49
This test was repeated for e4pressions of comple4ity c T =45, =4<, [, =4@ and =4;6
defined o-er an alphabet of si+e >69 The results were a-erage o-er 4 T >6 runs9
R$'u#t'
The ma%ority of results for this e4periment will be presented as a line graph indicating the
error of the induced classifier as its performance is measured o-er the testing e4amples9
Each graph is titled with the comple4ity of the e4pression that was tested, along with the
class of the testing e4amples Gpositi-e, negati-e, or allH and compares the results for both
downsi+ing and o-er3sampling, as each was carried out at different rates9 Figure <9;9: is a
typical e4ample of the type of results that were obtained from this e4periment and is
presented with an e4planation of how to interpret it9
Figure <9;9;=& Error rates of learning an e4pression of =4>
comple4ity as either negati-e e4amples are being remo-ed, or
positi-e e4amples being re3sampled9
The y3a4is of Figure <9;9? gi-es the error of the induced classifier o-er the testing set9 The
43a4is indicates the le-el at which the training data was either downsi+ed, or o-er3sampled9
The 43a4is can be read the following way for each sampling method&
x! Accuracy Over All Examples
0
040-
041
041-
042
042-
0 20 (0 )0 *0 100
"ampli#g Rate
E
r
r
o
r
6ownsiin!
'verSamplin!
>5
For downsi+ing the numbers represent the rate at which negati-e e4amples were remo-ed
from the training data9 The point 6 represents no negati-e e4amples being remo-ed, while
;66 represents all the negati-e e4amples being remo-ed9 The point A6 represents the
training data being balanced G5=6I, 5=63H9 Essentially, what is being said is that the
negati-e e4amples were remo-ed at >:? increments9
For o-er3sampling, the labels on the 43a4is are simply the rate at which the positi-e
e4amples were re3sampled, ;66 being the point at which the training data set is balanced
G?666I, ?6663H9 The positi-e e4amples were therefore re3sampled at >:? increments9
#t can be seen from Figure <9;9: that the highest accuracy Glowest error rateH for downsi+ing
is not reached when the data set is balanced in terms of numbers9 This is e-en more
apparent for o-er3sampling, where the lowest error rate was achie-ed at a rate of re3
sampling <=?> G?>:?H or =6<5 G:>:?H positi-e e4amples9 That is, the lowest error rate
achie-ed for o-er3sampling is around the ?6 or :6 mark in Figure <9;9:9 #n both cases the
training data set is not balanced in terms of numbers9
Figure <9;9@ is included to demonstrate that e4pressions of different comple4ities ha-e
different optimal balance rates9 As can be seen from the plot of the =4: e4pression, the
lowest error achie-ed using downsi+ing is achie-ed at the :6 mark9 This is different than
the results pre-iously reported for the e4pression of =4> comple4ity GFigure <9;9?H9 #t is
also apparent that the optimal balance le-el is reached at >6 the mark as opposed to the ?63
:6 mark in Figure <9;9?9
><
x$ Accuracy Over All Examples
0
041
042
04/
04(
0 20 (0 )0 *0 100
"ampli#g Rate
E
r
r
o
r
6ownsiin!
'verSamplin!
Figure <9;9;>& This graph demonstrates that the optimal le-el at
which a data set should be balanced does not always occur at the
same point9 To see this, compare this graph with Figure <9;9?9
The results in Figures <9;9: and <9;9@ may at first appear to contradict with those reported
by 0ing and 0i E;AA@F, who found that the optimal balance le-el for data sets occurs when
the number of positi-e e4amples equals the number of negati-e e4amples9 7ut this is not
the case9 The results reported in this e4periment use accuracy as a measure of performance9
0ing and 0i E;AA@F use S
lift
as a performance measure9 7y using S
lift
9 as a performance
inde4, 0ing and 0i E;AA@F are measuring the distribution of correctly classified e4amples
after they ha-e been ranked9 #n fact, they report that the error rates in their domain drop
dramatically from =6U with a balance ratio of ;I&;3, to <U with a balance ratio of ;I&@39
The ne4t set of graphs in Figure <9;9A shows the results of trying to learn e4pressions of
-arious comple4ities9 For each e4pression the results are displayed in two graphs9 The first
graph gi-es the error rate of the classifier tested o-er negati-e e4amples9 The second graph
shows the error rate as tested o-er the positi-e e4amples9 They will be used in the
discussion that follows9
>=
x% Accuracy Over Negative Examples
0
04002
0400(
0400)
0400*
0401
0 20 (0 )0 *0 100
E
r
r
o
r
6ownsiin!
'verSamplin!
Figure <9;9;?& These graphs demonstrate the competing factors
when balancing a data set9 Points that indicate the highest error
rate o-er the negati-e e4amples correspond to the lowest error
o-er the positi-e e4amples9
D'cu''on
>>
The results in Figure <9;9A demonstrate the competing factors when faced with balancing a
data set9 7alancing the data set can increase the accuracy o-er the positi-e e4amples, aI,
but this comes at the e4pense of the accuracy o-er the negati-e e4amples, Ga3H9 7y
comparing the two graphs for each e4pression in Figure <9;9A, the link between aI and a3
can be seen9 (ownsi+ing results in a steady decline in the error o-er the positi-e e4amples,
but this comes at the e4pense of an increase in the error o-er the negati-e e4amples9
0ooking at the cur-es for o-er3sampling one can easily see that sections with the lowest
error o-er the positi-e e4amples correspond to the highest error o-er the negati-e
e4amples9
A difference in the error cur-es for o-er3sampling and downsi+ing can also be seen9 !-er3
sampling appears to initially perform better than downsi+ing o-er the positi-e e4amples
until the data set is balanced when the two techniques become -ery close9 (ownsi+ing on
the other hand, initially outperforms o-er3sampling on the negati-e testing e4amples until a
point at which it completely goes off the charts9
#n order to more clearly demonstrate that there is an increase in performance using the
naW-e sampling techniques that were in-estigated, Figure <9;9;6 is presented9 #t compares
the accuracy of the imbalanced data sets using each of the sampling techniques studied9
The balance le-el at which the techniques are compared is ;I&;3, that is, the data set is
balanced at G5=6I, 5=63H using downsi+ing and G?666I, ?6663H for o-er3sampling9
>?
Figure <9;9;:& This graph demonstrates the effecti-eness of
balancing data sets by downsi+ing and o-er3sampling9 $otice that
o-er3sampling appears to be more effecti-e than downsi+ing when
the data sets are balanced in terms of numbers9 The only case
where downsi+ing outperforms o-er3sampling in this comparison
is when attempting to learn an e4pression of =4= comple4ity9
The results from the comparison in Figure <9;9;6 indicate that o-er3sampling is a more
effecti-e technique when balancing the data sets in terms of numbers9 This result is
probably due to the fact that by downsi+ing we are in effect lea-ing out information about
the counter conceptual class9 Although o-er3sampling does not introduce any new
information by re3sampling positi-e e4amples, by retaining all the negati-e e4amples, o-er3
sampling has more information than downsi+ing about the counter3conceptual class9 This
effect can be seen referring back to Figure <9;9A in the differences in error o-er the negati-e
e4amples of each balancing technique9 That is, the error rate on the negati-e e4amples
remains relati-ely constant for o-er3sampling compared to that of downsi+ing9
To summari+e the results of this second e4periment, three things can be stated9 They are&
7oth o-er3sampling and downsi+ing in a naW-e fashion Gi9e9, by %ust randomly
remo-ing and re3sampling data pointsH are effecti-e techniques for balancing the
data set in question9
The optimal le-el at which an imbalanced data set should be o-er3sampled or
downsi+ed does not necessarily occur when the data is balanced in terms of
numbers9
Oversampli#g a#& Do'#si(i#g at E)ual Rates
(Error Over All Examples)
0
040-
041
041-
042
042-
04/
04/-
Im3alanced 6ownsied 'verSampled
E
r
r
o
r
(x(
(x)
(x*
(x10
>:
There are competing factors when each balancing technique is used9 Achie-ing a
higher aI comes at the e4pense of a3 Gthis is a common point in the literature for
domains such as te4t classificationH9
9+0+0 Te%t :0 # (u$e Count for 5a$an'e) Data Set%
"ltimately, the goal of the e4periments described in this section is to pro-ide moti-ation
behind the design of a system that combines multiple classifiers that use different sampling
techniques9 The ad-antage of combining classifiers that use different sampling techniques
only comes if there is a -ariance in their predictions9 Combining classifiers that always
make the same predictions is of no -alue if one hopes that their combination will increase
predicti-e accuracy9 #deally, one would like to combine classifiers that agree on correct
predictions, but disagree on incorrect predictions9
ethods that combine classifiers such as Adapti-e37oosting attempt to -ary learnersR
predictions by -arying the training e4amples in which successi-e classifiers are presented
to learn on9 As we saw in Section 5959=, Adapti-e37oosting increases the sampling
probability of e4amples that are incorrectly classified by already constructed classifiers9 7y
placing this higher weight on incorrectly classified e4amples, the induction process at each
iteration is biased towards creating a classifier that performs well on pre-iously
misclassified e4amples9 This is done in an attempt to create a number of classifiers that can
be combined to increase predicti-e accuracy9 #n doing this, Adapti-e37oosting ideally
di-ersifies the large rule sets of the classifiers9
S$tu/
Rules can be described in terms of their comple4ity9 0arger rules sets are considered more
comple4 than smaller rule sets9 This e4periment was designed to get a feel for the
comple4ities of the rule sets produced by C>96, when induced on imbalanced data sets that
ha-e been balanced by either o-er3sampling or downsi+ing9 7y looking at the comple4ity
of the rule sets created, we can get a feel for the differences between the rule sets created
>@
using each sampling technique9 The following algorithm was used to produce the results
gi-en below9
Repeat 4 times
o Create a training set TGc, 5=6I, ?6663H
o Create T
o
G?666I,?6663H by uniformly re3sampling the positi-e e4amples
from T and adding the negati-e e4amples from T9
o Train C on T
o
o Record rule counts R
o
I and R
o
3 for positi-e and negati-e rule sets
o Create T
d
G5=6I, 5=63H by randomly remo-ing >?:6 negati-e e4amples from
T9
o Train C on T
d
o Record rule counts R
d
I and R
d
3 for positi-e and negati-e rule sets
A-erage rule counts o-er 49
For this test e4pressions of si+es c T =45, =4<, [, =4@, and =4;6 defined o-er and alphabet
of si+e f T >6 were tested and a-eraged o-er 4 T <6 runs9
R$'u#t'
The following two tables list the a-erage characteristics of the rules associated with
induced decision trees o-er imbalanced data sets that contain target concepts of increasing
comple4ity9 The a-erages are taken from rule sets created from data sets that were obtained
from the procedure gi-en abo-e9 The first table indicates the a-erage number of positi-e
rules associated with each e4pression comple4ity and their a-erage si+e a-eraged o-er <6
trials9 The second table indicates the a-erage number of rules created to classify e4amples
as being negati-e and their a-erage si+e9 7oth tables will be used in the discussion that
follows9
>A
Positi-e Rule Counts
Do"n S<n% O($r S)&/#n%
E6/r$''on
A-erage
rule si+e
$umber
of rules
A-erage
rule si+e
$umber of
rules
=45 =96 596 =96 596
=4< =96 =9; =96 <96
=4= <9@ >9? =96 =96
=4> =9? ;;9; =96 >96
=4? =9: ;<9> =96 ?96
=4: =9@ ;>9< =9< :95
=4@ =9A ;>9= @9< <?95
=4;6 >96 ;@9? @9> =<9:
Table <9;9>& A list of the a-erage positi-e rule counts for data sets
that ha-e been balanced using downsi+ing and o-er3sampling9
$egati-e Rule Counts
Do"n S<n% O($r S)&/#n%
E6/r$''on
A-erage
rule si+e
$umber
of rules
A-erage
rule si+e
$umber of
rules
=45 596 A9= 596 ;<9=
=4< <9< ;:9< <9= <<9=
=4= =96 ;:9> =9? ==9;
=4> =9? 5;9: >95 >69=
=4? =9@ ;A96 >9? :@9<
=4: =9: ;:9: >9:> :>9;
=4@ >96 ;:9= >9A A=9?
=4;6 =9A ;@9; ?96 @A9>
Table <9;9?& A list of the a-erage negati-e rule counts for data sets
that ha-e been balanced using downsi+ing and o-er3sampling9
D'cu''on
?6
7efore # begin the discussion of these results it should be noted that these numbers must
only be used to indicate general trends towards rule set comple4ity9 Bhen being a-eraged
for e4pressions of comple4ities =4? and greater the numbers -aried considerably9 The
discussion will be in four parts9 #t will begin by attempting to e4plain the factors in-ol-ed
in creating rule sets o-er imbalanced data sets and then lead into an attempt to e4plain the
characteristics of rules sets created by downsi+ed data sets, followed by o-er3sampled rule
sets9 # will then conclude with a general discussion about some of the characteristics of the
artificial domain and how they create the results that ha-e been presented9 Throughout this
section one should remember that the positi-e rule set contains the target concept, that is,
the underrepresented class9
.ow )oe% a $a'= of po%itive training examp$e% hurt $earning>
Cubat et al9 E;AA@F gi-e an intuiti-e e4planation of why a lack of positi-e e4amples hurts
learning9 0ooking at the decision surface of a two dimensional plane, they e4plain the
beha-ior of the ;3$earest $eighbor G;3$$H rule9 #t is a simple e4planation that is
generali+ed as& J[as the number of negati-e e4amples in a noisy domain grows Gthe
number of positi-es being constantH, so does the likelihood that the nearest neighbor of any
e4ample will be negati-e9J Therefore, as more negati-e e4amples are introduced to the data
set, the more likely a positi-e e4ample is to be classified as negati-e using the ;3$$ rule9
!f course, as the number of negati-e e4amples approaches infinity, the accuracy of a
learner that classifies all e4amples as negati-e approaches ;66U o-er negati-e data and 6U
o-er the positi-e data9 This is unacceptable if one e4pects to be able to recogni+e positi-e
e4amples9
They then e4tend the argument to decision trees, drawing a connection to the common
problem of o-erfitting9 Each leaf of a decision tree represents a decision as being positi-e
or negati-e9 #n a noisy training set that is unbalanced in terms of the number of negati-e
e4amples, it is stated that an induced decision tree will be large enough to create regions
arbitrarily small enough to partition the positi-e regions9 That is, the decision tree will ha-e
rules comple4 enough to co-er -ery small regions of the decision surface9 This is a result of
?;
a classifier being induced to partition positi-e regions of the decision surface small enough
to contain on$y positi-e e4amples9 #f there are many negati-e e4amples nearby, the
partitions will be made -ery small to e4clude them from the positi-e regions9 #n this way,
the tree o-erfits the data with a similar effect as the ;3$$ rule9
any approaches ha-e been de-eloped to a-oid o-er fitting data, the most successful being
post pruning9 Cubat et al9 E;AA@F, howe-er, state that this does not address the main
problem9 #f a region in an imbalanced data set by definition contains many more negati-e
e4amples than positi-e e4amples, post pruning is -ery likely to result in all of the pruned
branches being classified as negati-e9
C?+@ an) (u$e Set%
C>96 attempts to partition data sets into regions that contain only positi-e e4amples and
regions that contain only negati-e e4amples9 #t does this by attempting to find features in
the data that are RgoodR to partition the training data around Gi9e9, ha-e a high information
gainH9 !ne can look at the partitions it creates by analy+ing the rules that are generated
which create the boundaries9 Each rule generated creates a partition in the data9 Rules can
appear to o-erlap, but when -iewed as partitions in an entire set of rules, the partitions
created in the data by the rule sets do not o-erlap9 'iewed as an entire set of rules, the
partitions in the data can be -iewed has ha-ing highly irregular shapes9 This is due to the
fact that C>96 assigns a confidence le-el to each rule9 #f a region of space is o-erlapped by
multiple rules, the confidence le-el for each rule class that co-ers the space is summed9 The
class with the highest summed confidence le-el is determined to be the correct class9 The
confidence le-el gi-en to each rule can be -iewed as being the number of e4amples the rule
co-ers correctly o-er the training data9 Therefore, rule sets that contain higher numbers of
rules are generally less confident in their estimated accuracy because each rule co-ers
fewer e4amples9 Figure <9;9;; is presented in an attempt to gi-e a pictorial representation
of what is being described here, gi-en that the partition with the positi-e class has a higher
confidence le-el than the partition indicating the negati-e class9
?5
Rule 1 Rule 2
Figure <9;9;@& An e4ample of how C>96 adds rules to create
comple4 decision surfaces9 #t is done by summing the confidence
le-el of rules that co-er o-erlapping regions9 A region co-ered by
more than one rule is assigned the class with the highest summed
confidence le-el of all the rules that co-er it9 1ere we assume
Rule ; has a higher confidence le-el than Rule 59
Down%i<ing
0ooking at the comple4ity of the rule sets created by downsi+ing, we can see a number of
things happening9 For e4pressions of comple4ity =45, on a-erage, the classifier creates a
positi-e rule set that perfectly fits the target e4pression9 At this point, the classifier is -ery
confident in being able to recogni+e positi-e e4amples Gi9e9, rules indicating a positi-e
partition ha-e high confidence le-elsH, but less confident in recogni+ing negati-e e4amples9
1owe-er, as the target concept becomes more comple4, the e4amples become sparser
relati-e to the number of positi-e e4amples Grefer to Figure <9;9<H9 This has the effect of
creating rules that co-er fewer positi-e e4amples, so as the target concept becomes more
comple4, the rules sets lose confidence in being able to recogni+e positi-e e4amples9 For
e4pressions of comple4ity =4<3=4>, possibly e-en as high as =4?, we can see that the rules
sets generated to co-er the positi-e class are still smaller than the rule sets generated to
co-er the negati-e class9 For e4pressions of comple4ity =4:3=4;6, and presumably beyond,
there is no distinction between rules generated to co-er the positi-e class and those to co-er
the negati-e class9
Over%amp$ing
?<
!-er3sampling has different effects than downsi+ing9 !ne ob-ious difference is the
comple4ity of the rule sets indicating negati-e partitions9 Rule sets that classify negati-e
e4amples when o-er3sampling is used are much larger than those created using
downsi+ing9 This is because there is still the large number of negati-e e4amples in the data
set, resulting in a large number of rules created to classify them9
The rule sets created for the negati-e e4amples are gi-en much less confidence than those
created when downsi+ing is used9 This effect occurs due to the fact that the learning
algorithm attempts to partition the data using features contained in the negati-e e4amples9
7ecause there is no target concept contained in the negati-e e4amples
;5
Gi9e9, no features to
indicate an e4ample to be negati-eH, the learning algorithm is faced with the dubious task,
in this domain, of attempting to find features that do not e4ist e4cept by mere chance9
!-er sampling the positi-e class can be -iewed as adding weight to the e4amples that are
re3sampled9 "sing an information gain heuristic when searching through the hypothesis
space, features which partition more e4amples correctly are fa-ored o-er those that do not9
The effect of multiplying the number of e4amples a feature will classify correctly when
found gi-es the feature weight9 !-er sampling the positi-e e4amples in the training data
therefore has the effect of gi-ing weight to features contained in the target concept, but it
also adds weight to random features which occur in the data that is being o-er3sampled9
The effect of o-er3sampling therefore has two competing factors9 The factors are&
!ne that adds weight to features containing the target concept9
!ne that adds weight to features not containing the target concept
The effect of features not rele-ant to the target concept being gi-en a disproportionate
weight can be seen for e4pressions of comple4ity =4@ and =4;69 This can be seen in lower
right hand corner of Table <9;95 where there are a large number of positi-e rules created to
co-er the positi-e e4amples9 #n these cases, the features indicating target concept are -ery
;5
Bhat is being referred to here is that the negati-e e4amples are the counter e4amples in the artificial domain and
represent e-erything but the target concept9
?=
sparse compared to the number of positi-e e4amples9 Bhen the positi-e data is o-er3
sampled, irrele-ant features are gi-en enough weight relati-e to the features containing the
target conceptN as a result the learning algorithm se-erely o-erfits the training data by
creating RgarbageR rules that partition the data on features not containing the target concept,
but that appear in the positi-e e4amples9
;.- C.)r)ct$r'tc' o4 t.$ Do&)n )nd .o" t.$1 A44$ct t.$ R$'u#t'
The characteristics of the artificial domain greatly affect the way in which rule sets are
created9 The ma%or determining factor in the creation of the rule sets is the fact that the
target concept is hidden in the underrepresented class and that the negati-e e4amples in the
domain ha-e no rele-ant features9 That is, the underrepresented class contains the target
concept and the o-er represented class contains e-erything else9 #n fact, if o-er3sampling is
used to balance the data sets, e4pressions of comple4ity =45 to =4? could still, on a-erage,
attain ;66U accuracy on the testing set, if only the positi-e rule sets were used to classify
e4amples with a default negati-e rule9 #n this respect, the artificial domain can be -iewed as
lending itself to being more of a recognition task than a discrimination task9 !ne would
e4pect classifiers such as 7R"TE, which aggressi-ely search hypothesis spaces for rules
that co-er only the positi-e e4amples, would work -ery well for this task9
13 Co&+n)ton Sc.$&$
This section describes the moti-ation and architecture behind a system that will be used to
increase the performance of a standard classifier o-er an imbalanced data set9 Essentially,
the system can be described as a collection of classifiers that combines their predicti-e
outputs to impro-e classification accuracy on the underrepresented class of an imbalanced
data set9 Combining classifiers is a technique in machine learning widely used to stabili+e
predictors and increase performance9 !ne such technique, Adapti-e37oosting was
described in Section 5959= of this thesis9
?>
13.1 Mot()ton
Analy+ing the results of the artificial domain showed that the comple4ity of the target
concept can be linked to how a imbalanced data set will affect a classifier9 The more
comple4 a target concept the more sensiti-e a learner is to an imbalanced data set9
E4perimentation with the artificial domain also showed that nai-ely downsi+ing and o-er3
sampling are effecti-e methods of impro-ing a classifierRs performance on an
underrepresented class9 As well, it was found that with the balancing techniques and
performance measures used, the optimal le-el at which a data set should be balanced does
not always occur when there is an equal number of positi-e and negati-e e4amples9 #n fact,
we saw in Section <9;9<95 that target concepts of -arious comple4ities ha-e different
optimal balance le-els9
#n an ideal situation, a learner would be presented with enough training data to be able to
di-ide it into two poolsN one pool for training classifiers at -arious balance le-els, and a
second for testing to see which balance le-el is optimal9 Classifiers could then be trained
using the optimal balance le-el9 This, howe-er, is not always possible in an imbalanced
data set where a -ery limited number of positi-e e4amples is a-ailable for training9 #f one
wants to achie-e any degree of accuracy o-er the positi-e class, all positi-e e4amples must
be used for the induction process9
Ceeping in mind that an optimal balance le-el cannot be known before testing takes place,
a scheme is proposed that combines multiple classifiers which sample data at different
rates9 7ecause each classifier in the combination scheme samples data at a different rate,
not all classifiers will ha-e an optimal performance, but there is the potential for one Gand
possibly moreH of the classifiers in the system to be optimal9 A classifier is defined as
optimal if it has a high predicti-e accuracy o-er the positi-e e4amples without losing its
predicti-e accuracy o-er the negati-e e4amples at an unacceptable le-el9 This loose
definition is analogous to Cubat et al9 E;AA@F using the g3mean as a measure of
performance9 #n their study, Cubat et al9 E;AA@F would not consider a classifier to be
optimal if it achie-ed a high degree of accuracy o-er the underrepresented class by
??
completely loosing confidence Gi9e9 a low predicti-e accuracyH on the o-er represented
class9 #n combining multiple classifiers an attempt will be made to e4ploit classifiers that
ha-e the potential of ha-ing an optimal performance9 This will be done by e4cluding
classifiers from the system that ha-e an estimated low predicti-e accuracy on the o-er
represented class9
Another moti-ating factor in the design of the combination scheme can be found in the
third test GSection <9;9<9<H9 #n this test, it was found that the rule sets created by downsi+ing
and o-er3sampling ha-e different characteristics9 This difference in rule sets can be -ery
useful in creating classifiers that can complement each other9 The greater the difference in
the rule sets of combined classifiers, the better chance there is for positi-e e4amples missed
by one classifier, to be picked up by another9 7ecause of the difference between the rule
sets created by each sampling technique, the combination scheme designed uses both o-er3
sampling and downsi+ing techniques to create classifiers9
13.* Arc.t$ctur$
The moti-ation behind the architecture of the combination scheme can be found in
EShimshoni and #ntrator, ;AA@F Shimshoni and #ntratorRs domain of interest is the
classification of seismic signals9 Their application in-ol-es classifying seismic signals as
either being naturally occurring e-ents, or artificial in nature Ge9g9, man made e4plosionsH
#n order to more accurately classify seismic signals, Shimshoni and #ntrator create what
they refer to as an #ntegrated Classification achine G#CH9 Their #C is essentially a
hierarchy of Artificial $eural $etworks GA$$sH that are trained to classify seismic
wa-eforms using different input representations9 ore specifically, they describe their
architecture as ha-ing two le-els9 At the bottom le-el, ensembles, which are collections of
classifiers, are created by combining A$$s that are trained on different 7ootstrapped
;<
samples of data and are combined by a-eraging their output9 At the second le-el in their
hierarchy, the ensembles are combined using a Competing Re%ection Algorithm GCRAH
;=
that sequentially polls the ensembles of which each can classify or re%ect the signal at hand9
;<
Their method of combining classifiers is known as 7agging E7reiman, ;AA?F9 7agging essentially creates multiple
classifiers by randomly drawing k e4amples, with replacement, from a training set of data, to train each of 4 classifiers9
The induced classifiers are then combined by a-eraging their prediction -alues9
?:
The moti-ation behind Shimshoni and #ntratorRs #C is that by including many ensembles
in the system, there will be some that are RsuperiorR and can perform globally well o-er the
data set9 There will, howe-er, be some ensembles that are not RsuperiorR, but perform locally
well on some regions of the data9 7y including these locally optimal ensembles they can
potentially perform well on regions of the data that the global ensembles are weak on9
"sing Shimshoni and #ntratorRs architecture as a model, the combination scheme presented
in this study combines two e4perts, which are collections of classifiers not based on
different input representations, but based on different sampling techniques9 As in
Shimshoni and #ntratorRs domain, regions of data which one e4pert is weak on ha-e the
potential of being corrected by the other e4pert9 For this study, the correction is an e4pert
being able to correctly classify e4amples of the underrepresented class of which the other
e4pert in the system fails to classify correctly9
The potential of regions of the data in which one sampling technique is weak on being
corrected by the other is a result of the two sampling techniques being effecti-e but
different in nature9 That is, by o-er3sampling and downsi+ing the training data, reasonably
accurate classifiers can be induced that are different reali+ations o-er the training data9 The
difference was shown by looking at the differences in the rule sets created using each
sampling technique9 A detailed description of the combination scheme follows9
;=
The CRA algorithm operates by assigning a pre3defined threshold to each ensemble9 #f an ensembleRs prediction
confidence falls below this threshold, it is said to be re%ected9 The confidence score assigned to a prediction is based on
the -ariance of the networksR prediction -alues Gi9e9, the amount of agreement among the networks in the ensembleH9 The
threshold can be based on an ensembleRs performance o-er the training data or sub%ecti-e information9
?@
'ver,samplin! %xpert 6ownsii n! %xpert

Input 1%xamples to 3e classi7ied2
'utput
'ver,samplin! Classi7iers 6ownsii n! Classi7iers
Figure <959;A 1ierarchical structure of the combination scheme9
Two e4perts are combined9 The o-er3sampling e4pert is made up
of classifiers which are trained on data samples containing re3
sampled data of the under represented class9 The downsi+ing
e4pert is made up of classifiers trained on data samples where
e4amples of the o-er represented class ha-e been remo-ed9
The combination scheme can be -iewed as ha-ing three le-els9 'iewed from the highest
le-el to the lowest le-el they are&
The output le-el,
The e4pert le-el, and
The classifier le-el9
-@+,+- C$a%%ifier 4eve$
The classifier le-el of the system consists of independently trained classifiers, each
associated with a confidence weight9 The classifiers at this le-el are created as follows9
First the training data for the system is di-ided into two pools9 The first pool of data
?A
consists of all of the e4amples of the underrepresented class that are a-ailable and a number
of the negati-e e4amples Gin the study of the artificial domain we were able to create an
unlimited number of negati-e e4amplesH and will be referred to as the training e4amples9
The second pool of training e4amples contains the remainder of the training data Gall
remaining e4amples of the o-er represented classH and will be referred to as weighting
e4amples9 The training for each classifier takes place in two stages9 #n the first stage of
training, the classifiers are trained on the training e4amples9 #n the second stage of training,
each classifier is tested on weighting e4amples9 Each classifier is then assigned a weight
proportional to the number of e4amples that it misclassifies at this stage9 The purpose of
this weight is to be able to estimate the classifierRs performance on the o-er represented
class9 The reasoning behind this is that although a classifier may be able to label the
underrepresented class well, it could do this at the e4pense of misclassifying a -ery high
proportion of the o-er represented class9 This will be e4plained further in Section <95959<9
The nature each classifier can be described at the e4pert le-el9
-@+,+, !xpert 4eve$
There are two e4perts in the systemN each using different techniques in an attempt to
impro-e the performance of the classifiers used on the underrepresented class9 !ne e4pert
attempts to boost the performance by o-er3sampling the positi-e e4amples in the training
set, the other by downsi+ing the negati-e e4amples9 Each e4pert is made up of a
combination of multiple classifiers9 !ne e4pert, the o-er3sampling e4pert, is made up of
classifiers which learn on data sets containing re3sample e4amples of the under represented
class9 The other e4pert, referred to as the downsi+ing e4pert, is made up of classifiers that
learn on data sets in which e4amples of the o-er represented class ha-e been remo-ed9
At this le-el each of the two e4perts independently classifies e4amples based on the
classifications assigned at the classifier le-el9 The classifiers associated with the e4perts can
be arranged in a number of ways9 For instance, an e4pert can tag an e4ample as ha-ing a
particular class if the ma%ority of the classifiers associated with it tag the e4ample as ha-ing
that class9 #n this study a ma%ority -ote was not required9 #nstead, each classifier in the
:6
system is associated with a weight that can be either ; or 6 depending on its performance
when tested o-er the weighting e4amples9 At the e4pert le-el an e4ample is classified
positi-e if the sum of its weighted classifiers predictions is at least one9 #n effect, this is
saying that an e4pert will classify an e4ample as being positi-e if one of its associated
classifiers tags the e4ample as positi-e and has been assigned a weight of ; Gi9e9, it is
allowed to participate in -otingH9
O($r S)&/#n% E6/$rt
Each classifier associated with this e4pert o-er samples the positi-e data in the training set
at increasing rates9 The inter-al increases for each classifier by a rate at which the final
classifier in the system is trained on a data set that contains an equal number of positi-e and
negati-e e4amples9 Therefore, the rate at which the positi-e data is o-er sampled at
classifier C is, in the case where $ classifiers are combined, GC 8 $HVGn
3
3 n
I
H9 #n this study
;6 classifiers were used to make up the o-er3sampling e4pert9
Do"n'<n% E6/$rt
The downsi+ing e4pert attempts to increase the performance of the classifiers by randomly
remo-ing negati-e e4amples from the training data at increasing numbers9 Each classifier
associated with this e4pert downsi+es the negati-e e4amples in the training set at increased
inter-al, the last of which results in the number of negati-e e4amples equaling the number
of positi-e9 Combining $ classifiers therefore results in the negati-e e4amples being
downsi+ed at a rate of GC 8 $HVGn
3
3 n
I
H e4amples at classifier C9 #n this study ;6 classifiers
were used to make up the downsi+ing e4pert9
-@+,+0 1eighting S'heme
As pre-iously stated, there is a weight associated with each classifier for each e4pert9 For
the purposes of this study the weight was used as a threshold to e4clude classifiers from the
system that incorrectly classify too many e4amples of the o-er3represented class9 ore
specifically, after training the classifiers, each was tested o-er the weighting e4amples9 Any
classifier that incorrectly classifies more than a defined threshold is assigned a weight of
+ero and is therefore e4cluded from the system9 Any classifier that performs below the
:;
threshold is assigned a weight of one and participates in -oting9 The moti-ation behind this
weighting scheme can be seen in Section <9;9<95 that shows a classifierRs performance o-er
the positi-e class is linked to its performance o-er the negati-e e4amples9 That is, aI comes
at the e4pense of a39 The threshold used for the weighting scheme in the artificial domain
was 5=69 The reasoning behind this choice will be e4plained in Section <9<9
-@+,+A Output 4eve$
At the output le-el a testing e4ample is determined to ha-e a class based on a -ote of the
two e4perts9 #n its simplest form, an e4ample is considered to ha-e a class if one or both of
the e4perts consider the e4ample as ha-ing the class9 This was the -oting scheme chosen
for the artificial domain9 Since the purpose of the system is to increase the performance of a
standard classifier o-er the underrepresented class it was decided to allow an e4ample to be
classified as positi-e without agreement between the two e4perts9 That is, only one e4pert
is required to classify an e4ample as positi-e for the e4ample to be classified as positi-e at
the output le-el9 #f the two e4perts are required to agree on their predictions, an e4ample
classified as positi-e by one e4pert, but negati-e by the other e4pert, would be classified as
negati-e at the output le-el9
11 T$'tn% t.$ Co&+n)ton 'c.$&$ on t.$ Art4c)# Do&)n
S$tu/
#n order to test the de-eloped system on the artificial domain the following procedure was
used9 Thirty sets of data were generated, each containing dis%oint training and test sets9
Each training set consisted of 5=6 positi-e e4amples and ;5,666 negati-e e4amples9 This
pro-ided a data set which was se-erely imbalanced at a ratio of one positi-e e4ample for
e-ery 5> negati-e e4amples9 The negati-e e4amples were di-ided into two equal sets, half
of which were pooled with the positi-e e4amples to train the system on and the other half
to be used by the system to calculate weights on9 The testing sets each contained 5=66
:5
e4amples, half of which were positi-e and half of which were negati-e9 All results reported
are a-eraged o-er <6 independent trials9
T.r$'.o#d D$t$r&n)ton
(ifferent thresholds were e4perimented with to determine which would be the optimal one
to choose9 The range of possible thresholds was 3; to ?666 being the number of weighting
e4amples in the artificial domain9 Choosing a threshold of 3; results in no classifier
participating in -oting and therefore all e4amples are classified as being negati-e9 Choosing
a threshold of ?666 results in all classifiers participating in -oting9
E4perimenting with -arious threshold -alues re-ealed that -arying the thresholds only a
small amount did not significantly alter the performance of the system when tested on what
were considered fairly simple e4pressions Ge9g9 less than =4>H9 This is because the system
for simple e4pressions is -ery accurate on negati-e e4amples9 #n fact, if a threshold of 6 is
chosen when testing e4pressions of comple4ity =4> and simpler, ;66U accuracy is
achie-ed on both the positi-e and negati-e e4amples9 This ;66U accuracy came as a result
of the downsi+ing e4pert containing classifiers that on a-erage did not misclassify any
negati-e e4amples9
Although indi-idually, the classifiers contained in the downsi+ing e4pert were not as
accurate o-er the positi-e e4amples as the o-er3sampling e4pertRs classifiers, their
combination could achie-e an accuracy of ;66U when a threshold of 6 was chosen9 This
;66U accuracy came as a result of the -ariance among the classifiers9 #n other words,
although any gi-en classifier in the downsi+ing e4pert missed a number of the positi-e
e4amples, at least one of the other classifiers in the e4pert system picked them up and they
did this without making any mistakes o-er the negati-e e4amples9
Choosing a threshold of 6 for the system is not effecti-e for e4pressions of greater
comple4ity in which the classifiers for both e4perts make mistakes o-er the negati-e
e4amples9 7ecause of this, choosing a threshold of 6 would not be effecti-e if one does not
know the comple4ity of the target e4pression9
:<
"ltimately, the threshold chosen for the weighting scheme was 5=69 Any classifier that
misclassified more than 5=6 of the weighting e4amples was assigned a weight of 6, else it
was assigned a weight of ;9 This weight was chosen under the rationale that there is an
equal number of negati-e training e4amples and weighting e4amples9 7ecause they are
equal, any classifier that misclassifies more than 5=6 weighting e4amples has a confidence
le-el of less than >6U o-er the positi-e class9
!ne should also note that because the threshold at each le-el abo-e this was assigned ;
Gi9e9, only one -ote was necessary at both the e4pert and output le-elsH the system
configuration can be -iewed as essentially 56 classifiers requiring only one to classify an
e4ample positi-e9
R$'u#t'
The results gi-en below compare the accuracy of the e4perts in the system to the
combination of the two e4perts9 They are broken down into the accuracy o-er the positi-e
e4amples Ga
I
H, the accuracy o-er the negati-e e4amples Ga
3
H and the g3mean GgH9
x* Expressio#
04.
04.-
04*
04*-
048
048-
1
a+ a, !
A
c
c
u
r
a
c
y
6ownsiin!
'versamplin!
Com3ination
Figure <9<956& Testing the combination scheme on an imbalanced
data set with a target concept comple4ity of =4@9
:=
x+, Expressio#
04)
04)-
04.
04.-
04*
04*-
048
048-
1
a+ a, !
A
c
c
u
r
a
c
y
6ownsiin!
'versamplin!
Com3ination
Figure <9<95;& Testing the combination scheme on an imbalanced
data set with a target concept comple4ity of =4;69
x+- Expressio#
04-
04)
04.
04*
048
1
a+ a, !
A
c
c
u
r
a
c
y
6ownsiin!
'versamplin!
Com3ination
Figure <9<955& Testing the combination scheme on an imbalanced
data set with a target concept comple4ity of =4;59
Testing the system on the artificial domain re-ealed that the o-er3sampling e4pert, on
a-erage, performed much better than the downsi+ing e4pert on a
I
for each of the
e4pressions tested9 The downsi+ing e4pert performed better on a-erage than the o-er3
sampling e4pert on a
3
9 Combining the e4perts impro-ed performance on a
I
o-er the best
e4pert for this class, which was the o-er3sampling e4pert9 As e4pected this came at the
e4pense of a
3
9
:>
Table <9<9; shows the o-erall impro-ement the combination scheme can achie-e o-er using
a single classifier9
Single Classifier G#H Results Compared to a Combination of Classifiers GCSH
aI a3 g
E4p # CS9 # CS # CS
=4@ 69<<@ 69A>@ ;96 69A5> 69>@; 69A=5
=4;6 69<?: 69@A> ;96 69@@6 69>>= 69@@@
=4;5 695=@ 69@;5 ;96 69@?A 69=A: 69@<A
Table <9<9:& This table gi-es the accuracies achie-ed with a single
classifier trained on the imbalanced data set G#H and the
combination of classifiers on the imbalanced data set GCSH9
:?
C h a p t e r 7 o u r
= TEOT C0ASS#F#CAT#!$
Chapter < introduced a combination scheme designed to impro-e the performance of a
single standard classifier o-er imbalanced data sets9 The purpose of this chapter is to test
the combination scheme on the real world application to te4t classification9 This chapter is
di-ided into se-en sections9 Section =9; introduces te4t classification and moti-ates its
practical applications9 Section =95 introduces the Reuters35;>:@ Te4t Categori+ation Test
Collection9 Sections =9< and =9= describe how the documents in the test collection will be
represented for learning and the performance measures used to compare learners in te4t
classification domains9 Section =9> gi-es the statistics that will be used to e-aluate the
combination scheme and Section =9? some initial results9 The chapter concludes with
Section =9:, which compares the performance of combination scheme designed in Chapter
< to that of C>96 in-oked with Adapti-e37oosting9
1* T$6t C#)''4c)ton
The process by which a pre3defined label, or number of labels, are assigned to te4t
documents is known as te4t classification9 Typically, labels are assigned to documents
based on the documentRs content and are used to indicate categories to which the document
belongs9 #n this way, when a document is referred to as belonging to a category, what is
being said is that the document has been assigned a label indicating its content9 (ocuments
can be grouped into categories based on the labels assigned to them9 For e4ample, a news
article written about the increasing price of crude oil may be assigned the labels
C!!(#T2, !PEC, and !#0 indicating it belongs in those categories9 1a-ing accurate
labels assigned to documents pro-ides efficient means for information filtering and
information retrie-al G#RH tasks9
::
#n an #R setting, pre3defined labels assigned to documents can be used for keyword
searches in large databases9 any #R systems rely on documents ha-ing accurate labels
indicating their content for efficient searches based on key words9 "sers of such systems
input words indicating the topic of interest9 The system can then match these words to the
document labels in a database, retrie-ing those documents containing label matches to the
words entered by the user9
"sed as a means for information filtering, labels can be used to block documents from
reaching users for which the document would be of no interest9 An e4cellent e4ample of
this is E3mail filtering9 Take a user who has subscribed to a newsgroup that is of a broad
sub%ect such as economic news stories9 The subscriber may only be interested in a small
number of the news stories that are sent to her -ia electronic mail, and may not want to
spend a lot of time weeding through stories that are of no interest to find the few that are9 #f
incoming stories ha-e accurate sub%ect labels, a user can direct a filtering system to only
present those that ha-e a pre3defined label9 For e4ample, someone who is interested in
economic news stories about the price of crude oil may instruct a filtering system to only
accept documents with the label !PEC9 A person who is interested in the price of
commodities in general, may instruct the system to present any document labeled
C!!(#T29 This second label used for filtering would probably result in many more
documents being presented than the label !PEC, because the topic label C!!(#T2 is
much broader in an economic news setting than is !PEC9
7eing able to automate the process of assigning categories to documents accurately can be
-ery useful9 #t is a -ery time3consuming and potentially e4pensi-e task to read large
numbers of documents and assign categories to them9 These limitations, coupled with the
proliferation of electronic material, ha-e lead to automated te4t classification recei-ing
considerable attention from both the information retrie-al and machine learning
communities9
:@
1*.1 T$6t C#)''4c)ton )' )n Induct($ Proc$''
From a machine learning standpoint, categories are referred to as classes9 For e4ample,
assigning the class C!R$ to a document means that the document belongs in a category of
documents labeled C!R$9 The same document may also be assigned belong in the
category CR!P but not TREE9 Bith this in mind, te4t classification can be -iewed as a
multi class domain9 That is, gi-en a set of documents there are a defined number of distinct
categories, which typically can number in the hundreds, to which the documents can be
assigned9
The inducti-e approach to te4t classification is as follows9 /i-en a set of classified
documents, induce an algorithm that can classify unseen documents9 7ecause there are
typically many classes that can be assigned to a document, -iewed at a high le-el, te4t
classification is a multi class domain9 #f one e4pects the system to be able to assign
multiple classes to a document Ge9g9, a document can be assigned the classes
C!!(#T2 !#0 and !PECH then the number of possible outcomes for a system can be
-irtually infinite9
7ecause te4t classification is typically associated with a large number of categories that can
o-erlap, documents are -iewed as either ha-ing a class or not ha-ing a class Gpositi-e or
negati-e e4amplesH9 For the multi class problem, te4t classification systems use indi-idual
classifiers to recogni+e indi-idual categories9 A classifier in the system is trained to
recogni+e a document ha-ing a class, or not ha-ing the class9 For an $ class problem, a te4t
classification system would consist of $ classifiers trained to recogni+e $ categories9 A
classifier trained to recogni+e a category, for e4ample TREE, would, for training purposes,
-iew documents categori+ed as TREE as positi-e e4amples and all other documents as
negati-e e4amples9 The same documents categori+ed as TREE would more than likely be
-iewed as negati-e e4amples by a classifier being trained to recogni+e documents
belonging to the category CAR9 Each classifier in the system is trained independently9
Figure =9;9; gi-es a pictorial representation of te4t classification -iewed as a collection of
binary classifiers9
:A
9redictor
1

9redictor
2

9redictor
N

. . .

Classi7ied 6ocument

:nseen 6ocument

Figure =9;95<& Te4t classification -iewed as a collection of binary
classifiers9 A document to be classified by this system can be
assigned any of $ categories from $ independently trained
classifiers9 $ote that using a collection of classifiers allows a
document to be assigned more than one category Gin this figure up
to $ categoriesH9
'iewing te4t classification as a collection of two class problems results in ha-ing many
times more negati-e e4amples than positi-e e4amples9 Typically the number of positi-e
e4amples for a gi-en class is in the order of %ust a few hundred, while the number of
negati-e e4amples is thousands or e-en tens of thousands9 The Reuters35;>:@ data set
consists of the a-erage category ha-ing less than 5>6 e4amples, while the total number of
negati-e e4amples e4ceeds ;66669
1, R$ut$r'7*1058
The Reuters35;>:@ Te4t Categori+ation Test Collection was used to test the combination
scheme designed in Chapter < on the real world domain of te4t categori+ation9 Reuters 0td9
makes the Reuters35;>:@ Te4t Categori+ation Test Collection a-ailable for free distribution
to be used for research purposes9 The corpus is referred to in the literature as Reuters35;>:@
and can be obtained freely at http&88www9research9att9com8Zlewis9 The Reuters35;>:@
collection consists of 5;>:@ documents that appeared on the Reuters newswire in ;A@:9
@6
The documents were originally assembled and inde4ed with categories by Reuters 0td, and
later formatted in S/0 by (a-id ( 0ewis and Stephen 1arding9 #n ;AA; and ;AA5 the
collection was formatted further and first made a-ailable to the public by Reuters 0td as
Reuters355;:<9 #n ;AA? it was decided that a new -ersion of the collection should be
produced with less ambiguous formatting and include documentation on how to use such a
collection9 The opportunity was also taken at this time to correct errors in the categori+ation
and formatting of the documents9 The importance of correct unambiguous formatting by
way of S/0 markup is -ery important to make clear the boundaries of the document,
such as the title and the topics9 #n ;AA? Ste-e Finch and (a-id (9 0ewis cleaned up the
collection and upon e4amination, remo-ed >A> articles that were e4act duplicates9 The
modified collection is referred to as Reuters35;>:@9
@;
Figure =959; is an e4ample of an article from the Reuters35;>:@ collection9
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET"
OLDID="5552" NE
WID="9">
<DATE>26-FEB-198 15!1!11"2#<$DATE>
<TOPICS><D>%&'(<$D><$TOPICS>
<PLACES><D>)*&<$D><$PLACES>
<PEOPLE><$PEOPLE>
<ORGS><$ORGS>
<E+C,ANGES><$E+C,ANGES>
<CO-PANIES><$CO-PANIES>
<UN.NOWN>
/051/051/051F
/0221/0221/0112#62/0311'%)4%
' 2 BC-C,A-PION-PRODUCTS-/541C, #2-26 ##6<$UN.NOWN>
<TE+T>/021
<TITLE>C,A-PION PRODUCTS /541C,> APPRO6ES STOC. SPLIT<$TITLE>
<DATELINE> ROC,ESTER7 N"Y"7 F%8 26 - <$DATELINE><BODY>C9&:;<o( P'o=)>4*
I(> *
&<= <4*
8o&'= o2 =<'%>4o'* &;;'o?%= & 4@o-2o'-o(% *4o>A *;5<4 o2 <4*
>o::o( *9&'%* 2o' *9&'%9o5=%'* o2 '%>o'= &* o2 A;'<5 17 198"
T9% >o:;&(B &5*o *&<= <4* 8o&'= ?o4%= 4o '%>o::%(= 4o
*9&'%9o5=%'* &4 49% &(()&5 :%%4<(C A;'<5 23 &( <(>'%&*% <( 49%
&)49o'<D%= >&;<4&5 *4o>A 2'o: 2<?% :5( 4o 25 :5( *9&'%*"
R%)4%'
/031<$BODY><$TE+T>
<$REUTERS>
Figure =9595=& A Reuters35;>:@ Article G$ote that the topic for
this e4ample is EarnH9
1,.1 Docu&$nt For&)ttn%
As was pre-iously stated, and can be seen in the e4ample document, the collection is
formatted with S/09 Each article in the collection is delimited by the opening tag
PRE"TERS T!P#CS TXX 0EB#SSP0#TTXX C/#SP0#TTXX !0(#(TXX $EB#(TXXQ and
by the closing tag P8RE"TERSQ9 The only fields used in this study were the PT#T0EQ and
P7!(2Q fields that were used to create the document -ectors Gthis will be described in
Section =9<H and the PRE"TERS T!P#CS[Q field to determine the category of each
document9
@5
1,.* C)t$%or$'
The T!P#CS field is -ery important for the purposes of this study9 #t contains the topic
categories Gor class labelsH that an article was assigned to by the inde4ers9 $ot all of the
documents were assigned topics and many were assigned more than one9 Some of the
topics ha-e no documents assigned to them9 1ayes and Beinstein E;AA6F discuss some of
the policies that were used in assigning topics to the documents9 The topics assigned to the
documents are economic sub%ect areas with the total number of categories for this test
collection being ;<>9 The top ten categories with the number of documents assigned to
them are gi-en in Table =959;9
Top Ten Reuters35;>:@ Categories
C#)'' Docu&$nt Count
Earn <A@:
AC. 5==@
oneyF4 @6;
/rain ?5@
Crude ?<=
Trade >>;
#nterest >;<
Ship <6>
Bheat <6?
Corn 5>=
Table =959@& A list of the top ten categories of the Reuters35;>:@
test collection and their document count9
1,., Tr)nn% )nd T$'t S$t'
#n order to be able compare e4perimental results on this particular data set, the Reuters3
5;>:@ corpus comes with recommended training and testing sets9 Researchers who use the
same documents for training and testing their systems can then compare results9 For the
purposes of this study the odApte split was used9 The odApte split essentially di-ides
the documents into three sets9 They are&
The training set, which consists of any document in the collection that has at least
one category assigned and is dated ear$ier than April :th, ;A@:N
@<
The test set, which consists of any document in the collection that has at least on
category assigned and is dated April :th, ;A@: or $aterN and
(ocuments that ha-e no topics assigned to them and are therefore not used9
Table =9595 lists the number of documents in each category9
odApte Split Statistics
S$t Docu&$nt Count
Training A?6<
Test <5AA
"nused @?:?
Table =959A& Some statistics on the odApte Split9 $ote that the
documents labeled as unused recei-e this designation because
they ha-e no category assigned to them9
Training and testing a classifier to recogni+e a topic using this split can be best described
using an e4ample9 Assume that the category of interest is Earn and we want to create a
classifier that will be able to distinguish documents of this category from all other
documents in this collection9 The first step is to label all the documents in the training set
and testing sets9 This is a simple process by which all the documents are either labeled
positi-e, because they are categori+ed as Earn, or they are labeled negati-e, because they
are not categori+ed as Earn9 The category Earn has <A@: labeled documents in the
collection, so this means that <A@: documents in the combined training and test sets would
be labeled as positi-e and the remaining @A;> documents would be labeled as negati-e9
The labeled training set would then be used to train a classifier which would be e-aluated
on the labeled testing set of e4amples9
1- Docu&$nt R$/r$'$nt)ton
7efore documents can be presented to the classifier for training, some form of
representation must be chosen9 The most common form of document representation found
@=
in te4t classification literature is known as a Jbag of wordsJ representation9 ost
representations of this type treat each word, or a set of words, found in the corpus as a
feature9 (ocuments are represented as -ectors, which are lists of words that are contained
in the corpus9 A binary -ector representation can be used ha-ing each attribute in the
document -ector represent the presence or absence of a word in the document9 Figure =9<9;
shows how the sentence& JThe sky is blue9J would be represented using a binary -ector
representation defined o-er the set of words Kblue, cloud, sky, windL9
; 1< 0< 1< 0=
3
l
u
e

c
l
o
u
d

s
>
y

w
i
n
d

Figure =9<95>& This is the binary -ector representation of the
sentence JThe sky is blue9J as defined o-er the set of words Kblue,
cloud, sky, windL9
Simple -ariations of a binary representation include ha-ing attribute -alues assigned to the
word frequency in the document Ge9g9, if the word sky appears in the document three times
a -alue of three would be assigned in the document -ectorH, or assigning weights to
features according to the e4pected information gain9 The latter -ariation in-ol-es prior
domain information9 For e4ample, in an application in which documents about machine
learning are being classified, one may assign a high weight to the terms pruning, tree, and
decision, because these terms indicate that the paper is about decision trees9 #n this study a
simple binary representation was used to assign words found in the corpus to features in the
document -ectors9 The ne4t section describes the steps used to process the documents9
1-.1 Docu&$nt Proc$''n%
Standard techniques were used to create the binary document -ectors9 The te4t used for the
representation of each document was contained in the PT#T0EQ and P7!(2Q fields of the
@>
documents9 All other fields were ignored9 The steps used in the creation of the document
-ectors are detailed below9
All punctuation and numbers were remo-ed from the documents9
The documents were then filtered through a stop word list
;>
, remo-ing any words
contained in the list9 Stops words, such as con%unctions and prepositions, are
considered to pro-ide no information gain9 #t is a widely accepted technique to
remo-e these words from a corpus to reduce feature set si+e9
The words in each document were then stemmed using 0o-ins stemmer
;?
9
Stemming maps words to their canonical form, for e4ample, the words golfer,
golfing, and golfed, would all be mapped to the common stem golf9 Stemming is
another widely used practice #R to reduce feature set si+es9
Finally, the >66 most frequently occurring features were placed in the feature set to
create the document -ectors9
1-.* Lo'' o4 n4or&)ton
A feature set si+e of >66 was used due to space and computing time considerations9 "sing
such a small feature set si+e, howe-er, has the potential of losing too much information9
Typically, feature set si+es used in document representation can be in the order of tens of
thousands EScott, ;AAAF, E*oachims, ;AA@F9 #f not enough features are used there is the
potential to miss features that ha-e a high information gain9 For e4ample, if all documents
with the category C!R$ ha-e the term crop in them, this feature may be lost due to the fact
there are few documents in the training data with the topic C!R$9 Remo-ing the term crop
from the feature set, by only using the >66 most frequently occurring terms, would be
remo-ing a feature which pro-ides high information gain when trying to classify
documents with the topic C!R$9
*oachims E;AA@F discusses the fact that in high dimensional spaces, feature selection
techniques Gsuch as remo-ing stop words in this studyH are employed to remo-e features
;>
The stop word list was obtained at http&88www9dcs9gla9ac9uk8idom8ir)resources8linguistic)utils8stop3words9
;?
0o-ins stemmer can be obtained freely at ftp&88n;6?9is9tokushima3u9ac9ip8pub8#R8#terated30o-ins3stemmer9
@?
that are not considered rele-ant9 This has the effect of reducing the si+e of the
representation and can potentially allow a learner to generali+e more accurately o-er the
training data Gi9e9, a-oid o-erfittingH9 *oachims E;AA@F, howe-er, demonstrates that there are
-ery few irrele-ant features in te4t classification9 1e does this through an e4periment with
the Reuters35;>:@ AC. category, in which he first ranks features according to their binary
information gain9 *oachims E;AA@F then orders the features according to their rank and uses
the features with the lowest information gain for document representation9 1e then trains a
nai-e 7ayes classifier using the document representation that is considered to be the
JworstJ, showing that the induced classifier has a performance that is much better than
random9 7y doing this, *oachims E;AA@F demonstrates that e-en features which are ranked
lowest according to their information gain, still contain considerable information9
Potentially throwing away too much information by using such a small feature set si+e did
not appear to greatly hinder the performance of initial tests on the corpus in this study9 As
will be reported in Section =9?, there were three categories in which initial benchmark tests
performed better than those reported in the literature, four that performed worse and the
other three about the same9 For the purposes of this study it was felt that a feature set si+e
of >66 was therefore adequate9
10 P$r4or&)nc$ M$)'ur$'
#n the machine learning community, the standard measure of a systemRs performance is
accuracy9 The accuracy of a system can be simply defined as the number of correctly
classified e4amples di-ided by the total number of e4amples9 1owe-er, this measure of
performance is not appropriate for data sets of an imbalanced nature9 Take for e4ample, the
category C!R$ in the Reuters35;>:@ testing set9 Bith >? positi-e e4amples and <5=<
negati-e e4amples in the testing set, the accuracy attained by classifying each e4ample as
negati-e is <5=< 8 G<5=< I >?H T A@U9 Stubbornly classifying all documents as negati-e for
each category, as was done with C!R$, and a-eraging the results for all the categories
produces an accuracy of o-er A>U9 Bhile an accuracy of A>U looks impressi-e,
classifying all documents as negati-e is not -ery useful9 The #R community in particular
@:
has defined alternati-e measures to study a systemRs performance when faced with an
imbalanced data set9 #n this section # will describe the commonly used F3measure, which
combines precision and recall, and the breake-en point9 Two a-eraging techniques known
as micro a-eraging and macro a-eraging will also be discussed along with the benefits of
each9
10.1 Pr$c'on )nd R$c)##
#t was pre-iously stated that accuracy is not a fair measure of a systems performance for an
imbalanced data set9 The #R community, instead, commonly basis the performance of a
system on the two measures known as precision and recall9
Pr$c'on
Precision GPH is the proportion of positi-e e4amples labeled by a system that are truly
positi-e9 #t is defined by&
where a is the number of correctly classified positi-e documents and c is the number of
incorrectly classified negati-e documents9 These terms are taken directly out of the
confusion matri4 Gsee Section 5959;H9
#n an #R setting, precision can be -iewed as a measure of how many documents a system
retrie-es, relati-e to the number of positi-e documents it retrie-es9 #n other words, it gi-es
an indication of to what e4tent a system retrie-es documents of a gi-en class9
R$c)##
Recall GRH is the proportion of truly positi-e e4amples -erses the total number of e4amples
that are labeled by a system as being positi-e9 #t is defined by&
' a
a
P
+
=
b a
a
(
+
=
@@
where a is the number of correctly classified positi-e documents and b is the number of
incorrectly classified positi-e documents Gsee Section 5959;H9
#n an #R setting it can be -iewed as measuring the performance of a system in retrie-ing
documents of a gi-en class9 Typically recall in an #R system comes at the e4pense of
precision9
10.* F7 &$)'ur$
A method that is commonly used to measure the performance of a te4t classification system
was proposed by Ri%sbergen E;A:AF and is called the F3measure9 This method of e-aluation
combines the trade3off between the precision and recall of a system by introducing a term
5, which indicates the importance of recall relati-e to precision9 The F3measure is defined
as&
Substituting P and R with the parameters from the confusion matri4 gi-es&
"sing the F3measure requires results to be reported for some -alue of 79 Typically the F3
measure is reported for the -alues 7 T ; Gprecision and recall are of equal importanceH, 7 T
5 Grecall is twice as important as precisionH and for 7 T 69> Grecall is half as important as
precisionH9
10., Br$)!$($n Pont
The breake-en point is another e-aluation method that appears in the literature9 This
performance measure treats precision and recall as being equal tradeoffs9 #t relies on a
( )
( )
9
;
;
5 5
5
' b 5 a 5
a 5
+ + +
+
( )
( P 5
P( 5
7
5
+
+
=
5
5
;
@A
system being able to -ary a parameter in order to alter the ratio of precision and recall9 For
e4ample, #R systems that use a cosine similarity test to determine documentsR similarity,
can -ary the ratio of precision and recall by ad%usting the similarity threshold9 As the
threshold is lowered, more rele-ant documents are retrie-ed, but so are typically non3
rele-ant documents9 C>96 can -ary the ratio by ad%usting the pre and post pruning le-els of
induced rule sets and decision trees9 The options that enable this are described in Section
5959=9
7eing able to ad%ust the ratio of precision and recall performance of a system allows
-arious -alues to be plotted9 The breake-en point is calculated by interpolating or
e4trapolating the point at which precision equals recall9 Figures =9=9; and =9=95 demonstrate
how breake-en points are calculated9
/#terpolate& 0rea1eve# Poi#t
04)
04)-
04.
04.-
04*
04*-
048
04) 04. 04* 048
Precisio#
R
e
c
a
l
l
Figure =9=95?& The dotted line indicates the breake-en point9 #n
this figure the point is interpolated9
A6
Extrapolate& 0rea1eve# Poi#t
04)
04)-
04.
04.-
04*
04*-
048
04) 04. 04* 048
Precisio#
R
e
c
a
l
l
Figure =9=95:& The dotted line indicates the breake-en point9 #n
this figure the point is e4trapolated9
The breake-en point is a popular method of e-aluating performance9 Sam Scott E;AAAF
used this method to measure the performance of the R#PPER ECohen, ;AA>F system on
-arious representations of the Reuters35;>:@ corpus and the (igiTrad E/reenhaus, ;AA?F
corpus9 Scott compares the e-aluation method of the F3measure and the breake-en point
and reports a AA9@U correlation between the statistic gathered for the F
;
3measure and the
breake-en point on the two corpuses he tested9
10.- A($r)%n% T$c.n9u$'
The discussion of performance measures, thus far, has been concerned with that of a single
class9 #n real world domains, howe-er, such as te4t classification, there are often many
classes for which a system is tested on9 #ts performance should be e-aluated as a whole on
the entire set of classes9 #n order to accommodate a multiple class domain such as te4t
classification, or more generally, systems in which each class is treated as a two class
problem Grefer to Section =9;9;H, two a-eraging techniques are commonly used in the
literature EScott ;AAAF9 They are known as micro a-eraging and macro a-eraging9 These
different techniques are best described with an e4ample9
Assume a three class domain GA, 7, CH with the following confusion matrices and their
associated precision, recall, and F
;
3measures&
A;
1ypothesis 1ypothesis 1ypothesis
A I 3 7 I 3 C I 3
Actua
l
Class
I >;A =@ I 56A ;;6 I ;> <;
3 ;6 =5< 3 ;;= >?: 3 < A>;
Precision T 69A@; Precision T 69?=: Precision T 69@<<
Recall T 69A;> Recall T 69?>> Recall T 69<5?
F
;
3measure T 69A=: F
;
3measure T 69?>; F
;
3measure T 69=?A
Mcro A($r)%n%
icro a-eraging considers each class to be of equal importance and is simply an a-erage of
all the indi-idually calculated statistics9 #n this sense, one can look at the micro a-eraged
results as being a normali+ed a-erage of each classRs performance9 The micro a-eraged
results for the gi-en e4ample are&
Precision T 69@56
Recall T 69?<5
F
;
3measure T 69?@A
M)cro A($r)%n%
acro a-eraging considers all classes to be a single group9 #n this sense, classes that ha-e a
lot of positi-e e4amples are gi-en a higher weight9 acro a-eraged statistics are obtained
by adding up all the confusion matrices of e-ery class and calculating the desired statistic
on the summed matri49 /i-en the pre-ious e4ample, the summed confusion matri4 treating
all classes as a single group is&
1ypothesis
acro I 3
Actual
Class
I :=< ;@A
3 ;5: ;A=;
The macro a-eraged statistics for the e4ample are therefore&
Precision T 69@>=
Recall T 69:A:
A5
F
;
3measure T 69@5=
$otice that the macro a-eraged statistics are higher that the micro a-eraged statistics9 This
is because class A performs -ery well and contains many more e4amples than that of the
other two classes9 As this demonstrates, macro a-eraging gi-es more weight to easier
classes that ha-e more e4amples to train and test on9
12 St)t'tc' u'$d n t.' 'tud1
For the purposes of this study, micro a-eraged breake-en point statistics are reported to
compare initial results to those found in the literature9 For the remainder of the chapter,
results will be reported using the micro a-eraged F3measure for -alues of 7 T 69>, ;96, and
5969 #t is felt that the breake-en point could not be calculated with enough confidence to use
it fairly9 Reasons for this will be gi-en as warranted9
The remainder of this chapter will be di-ided into three sections9 The first section will gi-e
initial results of rule sets, created on the Reuters35;>:@ test collection using C>96 and
compare them to results reported in the literature9 The second section will describe the
e4perimental design and gi-e results that are needed in order to test the combination
scheme described in Chapter <9 The last section will report the effecti-eness of the
combination scheme and interpret the results9
15 Int)# R$'u#t'
#n order to use C>96 as a rule learner for the Reuters35;>:@ corpus the algorithm was
initially tested on the data set to determine its effecti-eness9 #t was felt that the algorithm
used to test the combination scheme should perform at a comparable le-el to others used in
the literature9
To make the comparison, binary document -ectors were created as outlined in Section =9<9
The data set was then split into training and test sets using the odApte split9 7y using the
A<
odApte split, the initial results can be compared to those found in the literature that use
the same split9 The algorithm was then tested on the top ten categories in the corpus that
would ultimately be used to test the effecti-eness of the combination scheme9 To compare
the results to those of others, the breake-en point was chosen as a performance measure, as
it is used as a performance measure in published results9
C)#cu#)tn% Br$)!$($n Pont'
Calculating the breake-en point for each of the categories listed in Table =9?9; was done
using the pruning options made a-ailable by C>969 The process was one by which ;6
precision8recall pairs were plotted at -arious pruning le-els and the two highest points
which could be used to linearly interpolate the breake-en point were chosen to make the
calculation9
A Co&/)r'on o4 R$'u#t'
Table =9?9; is a list of published results for the top ;6 categories of the Reuters35;>:@
corpus and the initial results obtained using C>969 All rows are calculated breake-en points
with the bottom row indicating the micro a-erage of the listed categories9 The results
reported by *oachims E;AA@F were obtained using C=9> with a feature set of AA?5 distinct
terms9 #n this case *oachims E;AA@F used a stemmed bag of words representation9 The
results reported by Scott E;AAAF were obtained using R#PPER
;:
ECohen, ;AA>F with a
feature set si+e of ;@>A6 distinct stemmed terms9 #n both cases *oachims E;AA@F and Scott
E;AAAF used a binary representation as was used in this study9 The results reported in Table
=9?9; use only a single classifier9
7reake-en Point Statistics
CLASS C-.0 B.E. RIPPER B.E. C0.3 B.E.=Pr$'$nt
;:
R#PPER is based on Furnkran+ and BidmerRs E;AA=F #ncremental Reduced Error Pruning G#REPH algorithm9 #REP
creates rules, which, in a two class setting, co-er the positi-e class9 #t does this using an iterati-e growing and pruning
process9 To do this, training e4amples are di-ided into two groups, one for leaning rules, and the other for pruning rules9
(uring the growing phase of the process, rules are made more restricti-e by adding clauses using .uinlan E;AA6FRs
information gain heuristic9 Rules are then pruned o-er the pruning data by remo-ing clauses which co-er too many
negati-e e4amples9 After a rule is grown and pruned, the e4amples it co-ers are remo-ed from the growing and pruning
sets9 The process is repeated until all the e4amples in the growing set are co-ered by a rule or until a stopping condition
is met9
A=
>?o)c.&'@ 1;;8A >S)& Scott@
1;;;A
'tud1B
Earn 69A?; 69A>A 69A?>
AC. 69@>< 69@?; 69@::
oneyF4 69?A= 69?=< 69?@A
/rain 69@A; 69A6? 69@@6
Crude 69:>> 69@65 69:<<
Trade 69>A5 69?== 69?;>
#nterest 69=A; 69?56 69>=A
Ship 69@6A 69::: 69:5>
Bheat 69@>> 69@=@ 69@@;
Corn 69@:: 69A;; 69@<5
Mco A($r)%$ 3.558 3.5;5 3.55-
Table =9?9;6& A comparison of breake-en points using a single
classifier9
The micro a-eraged results in the Table =9?9; show that the use of C>96 with a feature set
si+e of >66 comes in third place in terms of performance on the categories tested9 #t does
howe-er perform -ery close to the results reported using C=9> by *oachims E;AA@F9
ACCD An E6)&/#$ D$c'on Tr$$ )nd Ru#$ S$t
Figure =9?9; is an e4ample of a decision tree created by C>96 to obtain the preliminary
results listed in Table =9?9;9 The category trained for was AC.9 For conciseness, the tree
has been truncated to four le-els9 Two numbers represent the lea-es in the tree9 The first
-alue gi-es the number of positi-e e4amples in the training data that sort to the leaf9 The
second -alue is the number of negati-e e4amples that sort to the leaf9 !ne interesting
obser-ation that can be made is that if the tree was pruned to this le-el by assigning the
classification at each leaf to the ma%ority class Ge9g9, rightmost leaf would be assigned the
negati-e class in this treeH, the resulting decision tree would only correctly classify >@U
GA:68;?>6H of the positi-e e4amples in the training data9 !n the other hand, A:U G::<>8
:A=>H of the negati-e e4amples in the training data would be classified correctly9
A>
a
c
?
u
i
r

s
t
a
>

?
t
r

m
e
r
!
e
r

?
t
r

t
a
>
e
o
v
e
r

?
t
r

v
s

!
u
l
7

!
u
l
7

).8
.)8*
).
2.
1-/
/(
0
.
22-
-/
0
(
0
*
-2-
8)
1
)
0
/
0
1.
T
F F F
F
F F
F F
F T
T
T T
T
T
T T
T
F
Figure =9?95@& A decision tree created using C>969 The category
trained for was AC.9 The tree has been pruned to four le-els for
conciseness9
Table =9?95 is a list of some of the rules e4tracted from the decision tree in Figure OO by
C>969 The total number of rules e4tracted by C>96 was 5:9 There were : rules co-ering
positi-e e4amples and 56 rules co-ering negati-e e4amples9 !nly the first three rules are
listed for each class9
A?
E4tracted Rules
Po't($ Ru#$' N$%)t($ Ru#$'
#F Gacquir T ; \
Forecast T 6 \
/ulf T 6 \
.tr T 6 \
's T 6H
Class T I9
#F GAcquir T 6 \
(ebt T 6 \
erger T 6 \
Ris T6 \
Stak T 6 \
Takeo-er T 6H
Class T 3
#F Gqtr T6 \
Stak T ;H
Class T I
#F G-s T ;H
Class T 3
#F GTakeo-er T ;H
Class T I9
#F Gqtr T ;H
Class T 3
Table =9?9;;& Some of the rules e4tracted from the decision tree in
Figure =9?9;9
18 T$'tn% t.$ Co&+n)ton Sc.$&$
The purpose of this e4periment was to determine if the combination scheme designed in
Chapter < would result in any significant performance gain o-er that of a classification
scheme using Adapti-e37oosting9 #t was decided that the combination scheme should not
be compared to a single classifier when tested on the real world application of te4t
classification, because it would be unfair to test a scheme that uses multiple classifiers, to
one that uses a single classifier9 Adapti-e37oosting was chosen as the method used to
combine the classifiers because it is state of the art and is pro-ided by C>969
18.1 E6/$r&$nt)# D$'%n
A process identical to that used to calculate the breake-en points in Section =9?
was followed, e4cept that only a single data point was needed to calculate the F3
measure9 The single point was obtained using the default parameters for C>96
with Adapti-e37oosting in-oked to combine 56 classifiers9 The F3measure for
-alues 7 T ;, 5, 69> were recorded for each of the tested categories9
For each category the number of positi-e e4amples was then reduced to ;66 for
training purposes9 This was done by randomly selecting ;66 documents
A:
containing the class being trained for and remo-ing all other positi-e documents
for training9 $ote, this meant that when training for each category there were
different numbers of training e4amples9 For e4ample, training the system on the
category AC. meant that ;?5A documents were remo-ed from the training set
because there were ;:5A total positi-e e4amples of this class in the training
data9 Reducing the data set for #$TEREST, howe-er, only meant remo-ing <@5
documents9
F3measures were then recorded a second time for each modified category using
7oosted C>969
The modified collection was then run through the system designed in Chapter <
to determine if there was any significant performance gain on the reduced data
set when compared to that of 7oosted C>969
18.* P$r4or&)nc$ "t. Lo'' o4 E6)&/#$'
Table =9:9;
;@
compares the F3measures on the original data set before and after the number
of positi-e e4amples is reduced to ;66 for training9 #t should be noted that the statistic used
to measure performance for the remainder of this chapter will be the F3measure9 The
breake-en point was not used because it was felt that it would not produce a fair measure of
performance on the reduced data set9 ore e4plicitly, not enough data points, far enough
apart to e4trapolate or interpolate from, could be produced by -arying C>96Rs pruning
options9
;@
The results are reported using C>96Rs Adapti-e37oosting option to combine 56 classifiers9 The results using 7oosted
C>96 only pro-ided a slight impro-ement o-er using a single classifier9 The micro a-eraged results of a single classifier
created by in-oking C>96 with its default options were GF;T69:?@, F5T69:=>, F69>T69:A:H for the original data set, and
GF;T69>6<, F5T69=<@, F69>T69?;AH for the reduced data set9 "pon further in-estigation it was found that this can probably
be attributed to the fact that Adapti-e37oosting adds classifiers to the system which, in this case, tended to o-erfit the
training data9 7y counting the rule sets of subsequent decision trees added to the system, it was found that they steady
grow in si+e9 As decision trees grow in si+e, they tend to fit the training data more closely9 These larger rule set si+es
therefore point towards classifiers being added to the system that o-erfit the data9
A@
F3easure Comparison
Or%n)# P$r4or&)nc$ 7 Ad)/t($
Boo'tn%
P$r4or&)nc$ "t. E6)&/#$
Lo'' 7 Ad)/t($ Boo'tn%
CLASS B E 1 B E * B E 3.0 BE1 BE* BE3.0
Earn 69A:5 69A:; 69A:= 69A<; 69@A> 69A?A
AC. 69@A@ 69@:6 69A5: 695=6 69;?> 69=<A
oneyF4 69?:< 69?5? 69:5@ 69<<A 695>6 69>5A
/rain 69@>= 69@5= 69@@? 69@66 69:5? 69@A6
Crude 69@;< 69:@@ 69@=6 69>@5 69=@: 69:5=
Trade 69>;@ 69=?6 69>A= 695>: 69;@< 69=<;
#nterest 69?66 69>;5 69:5@ 695>@ 69;@5 69==;
Ship 69:>; 69@<< 69?@= 695<< 69;?5 69=;=
Bheat 69A6> 69A5@ 69@@= 69@:< 69@:< 69@:<
Corn 69:=6 69A?6 69:A: 69:;= 69?>@ 69:@;
Mrco A($r)%$ 3.55, 3.503 3.83- 3.0*, 3.-80 3.2-;
Table =9:9;5& This table compares F3measures of the original data
set Goriginal performanceH to the reduced data set Gperformance
with e4amples lossH9 Twenty classifiers were combined using
Adapti-e37oosting to produce the results9
Figure =9:95A& This graph gi-es a -isual representation of the
micro a-eraged results of Table =9:9;9 !ne can see a significant
loss in performance when the number of positi-e e4amples is
reduced for training9
As Figure =9:9; and Table =9:9; indicate, reducing the number of positi-e e4amples to ;66
for training purposes se-erely hurt the performance of C>969 The greatest loss in
performance is associated with 7 T 5, where recall is considered twice as important as
precision9 $ote also that the smallest loss occurs when precision is considered twice as
Performa#ce 2oss
04(
04-
04)
04.
04*
048
@A1 @A2 @A04-
F
3
m
e
a
s
u
r
e
All %xamples Ada,
@oost
100 9ositive
%xamples Ada,@oost
AA
important as recall9 This is not surprising when one considers that learners tend to o-erfit
the under represented class when faced with an imbalanced data set9 The ne4t section gi-es
the results of applying the combination scheme designed in Chapter < to the reduced data
set9
18., A//#1n% t.$ Co&+n)ton Sc.$&$
#n order to apply the combination scheme to a collection of data, the training data has to be
di-ided into two pools of data& one containing e4amples to train the classifier on, and one
to calculate the weighting scheme9 The Reuters35;>:@ test collection pro-ides an ideal set
of e4amples to train the weights on9 "sing the odApt split to di-ide the data lea-es @?:?
unlabeled documents that are not used for training purposes, these documents were used to
train the weights for the system9 The ne4t section describes the weighting scheme that was
used9
C)#cu#)tn% T.r$'.o#d' 4or t.$ :$%.tn% Sc.$&$
#nitial tests with the system indicated that a more sophisticated weighting scheme than that
used for the artificial domain needed to be implemented for use on the te4t classification
domain9 Bhen tested on the artificial domain of k3($F e4pressions, a threshold of 5=6 Gthe
number of positi-e training e4amplesH was used9 Bith this in mind, an initial threshold of
;66 was chosen, as this was the number of positi-e training documents that remained in the
imbalanced data set and the number of weighting e4amples was sufficient9 "sing ;66 as a
threshold, howe-er, did not allow enough classifiers to participate in -oting to make the
system useful9
"pon further in-estigation it was reali+ed that the lack of classifiers participating in the
-oting was due to the nature of the weighting data9 The documents used to calculate the
weights were those defined as unused in the odApte split9 #t was assumed at first that the
documents were labeled as unused because they belonged to no category9 This howe-er is
not the case9 The documents defined by the odApte as unused do not recei-e this
designation because they ha-e no category, they are designated as unused because they
;66
were simply not categori+ed by the inde4ers9 This complicates the use of the unlabeled data
to calculate weights9
#n the artificial domain the weighing e4amples were known to be negati-e9 #n the te4t
classification domain there may be positi-e e4amples contained in the weighting data9 #t is
-ery difficult to know anything about the nature of the weighting data in the te4t
application, because nothing is known about why the documents were not labeled by the
inde4ers9 A weighting scheme, based on the assumption that the unlabeled data is negati-e,
is flawed9 #n fact, there may be many positi-e e4amples contained in the data labeled as
unused9
#nstead of making assumptions about the weighting data, the first and last
;A
classifiers in
each e4pert are used to base the performance of the remaining classifiers in the e4pert9 The
weighting scheme is e4plained as follows&
0et Positi-eGCH be the number of documents which classifier C classifies as being positi-e
o-er the weighting data9 The threshold for each e4pert is defined as&
where C
;
is the first classifier in the e4pert and C
n
is the last classifier in the e4pert9
The weighting scheme for each e4pert is based on two Gone for each e4pertH independently
calculated thresholds9 As in the artificial domain, any classifier that performs worse than
the threshold calculated for it is assigned a weight of +ero, otherwise it is assigned a weight
of one9
Table =9:95 lists the performance associated with each e4pert as trained to recogni+e AC.
and tested o-er the weighting data9 "sing the weighting scheme pre-iously defined results
;A
The first classifier in an e4pert is the classifier that trains on the imbalanced data set without o-er3sampling or
downsi+ing the data9 The last classifier in an e4pert is the classifier that learns on a balanced data set9
( ) ( )
5
; n
C Po%itive C Po%itive
Thre%ho$)
+
=
;6;
in a threshold of G;6> I ?=@H 8 5 T <:: for the downsi+ing e4pert and G>;> I >?5H 8 5 T ><A
for the o-er3sampling e4pert9 The bold numbers indicate -alues that are o-er their
calculated thresholds9 The classifiers associated with these -alues would be e4cluded from
-oting in the system9
E4cluded Classifiers
56
Classifie
r
(ownsi+ing
E4pert
!-er3sampling
E4pert
; ;6> >;>
5 ;;= =?A
< ;A; =:6
= ;?6 =?>
> <;5 =@>
? <?< =@:
: 5:> >5;
@ -01 =@=
A 0-; =:;
;6 2-8 02*
Table =9:9;<& The bold numbers indicate classifiers that would be
e4cluded from -oting9 $ote that if a threshold of ;66 were chosen,
no classifiers would be allowed to -ote in the system9 The
category trained for was AC.9
R$'u#t'
Figures =9:95 to =9:9= gi-e the icro a-eraged results as applied to the reduced data set9
The first two bars in each graph indicate the performance resulting from %ust using the
downsi+ing and o-er3sampling e4perts respecti-ely, with no weighting scheme applied9
The third bar indicating the combination without weights, shows the results for the
combination of the two e4perts without the classifier weights being applied9 The final bar
indicates the results of the full combination scheme applied with the weighting scheme9
56
$ote that the classifiers e4cluded from the system are the last ones G@, A and ;6 in the downsi+ing e4pert and ;6 in the
o-er3sampling e4pertH9 These classifiers are trained on the data sets which are most balanced9 #t was shown in Figure
<9;9A that as the data sets are balanced by adding positi-e e4amples or remo-ing negati-e e4amples, the induced
classifiers ha-e the potential of loosing confidence o-er the negati-e e4amples9 Since a classifier is included or e4cluded
from the system based on its estimated performance o-er the negati-e e4amples, the classifiers most likely to be
e4cluded from the system are those which are trained on the most balanced data sets9
;65
0 4 +
04))
04)*
04.
04.2
04.(
04.)
6ownsiin! 'verSamplin! Com3ined 1no
wei!hts2
Com3ined 1with
wei!hts2
F
3
m
e
a
s
u
r
e
Figure =9:9<6& icro a-eraged F;3measure of each e4pert and their
combination9 1ere we are considering precision and recall to be of
equal importance9
0 4 -
04))
04)*
04.
04.2
04.(
04.)
6ownsiin! 'verSamplin! Com3ined 1no
wei!hts2
Com3ined 1with
wei!hts2
F
3
m
e
a
s
u
r
e
Figure =9:9<; icro a-eraged F53measure of each e4pert and their
combination9 1ere we are considering recall to be twice as
important as precision9
;6<
0 4 ,.!
04)
04)(
04)*
04.2
04.)
6ownsiin! 'verSamplin! Com3ined 1no
wei!hts2
Com3ined 1with
wei!hts2
F
3
m
e
a
s
u
r
e
Figure =9:9<5 icro a-eraged F69>3measure of each e4pert and
their combination9 1ere we are considering precision to be twice
as important as recall9
D'cu''on
The biggest gains in performance were seen indi-idually by both the downsi+ing and o-er3
sampling e4perts9 #n fact, when considering precision and recall to be of equal importance,
both performed almost equally Gdownsi+ing achie-ed F
;
T 69?@? and o-er3sampling
achie-ed F
;
T 69?@@ H9 The strengths of each e4pert can be seen in the fact that if one
considers recall to be of greater importance, o-er3sampling outperforms downsi+ing Go-er3
sampling achie-ed F
5
T 69:;< and downsi+ing achie-ed F
5
T 69?@6H9 #f precision is
considered to be of greater importance, downsi+ing outperforms o-er3sampling
Gdownsi+ing achie-ed F
69>
T 69?:6 and o-er3sampling achie-ed F
69>
T 69??<H9
A considerable performance gain can also be seen in the combination of the two e4perts
with the calculated weighting scheme9 Compared to downsi+ing and o-er3sampling, there
is a <9=U increase in the F
;
measure o-er the best e4pert9 #f one considers precision and
recall to be of equal importance these are encouraging results9 #f one considers recall to be
twice as important as precision there is a significant performance gain in the combination
of the e4perts but no real gain is achie-ed using the weighting scheme9 There is a slight
69=U impro-ement9
;6=
The only statistic that does not see an impro-ement in performance using the combination
scheme as opposed to a single e4pert is the F
69>
measure Gprecision twice as important as
recallH9 This can be attributed to the underlying bias in the design of the system that is to
impro-e performance on the underrepresented class9 The system uses downsi+ing and o-er3
sampling techniques to increase accuracy on the underrepresented class, but this comes at
the e4pense of the accuracy of the o-er represented class9 The weighting scheme is used to
pre-ent the system from performing e4ceptionally well on the underrepresented class, but
losing confidence o-erall because it performs poorly Gby way of false positi-esH, on the
class which dominates the data set9
#t is not difficult to bias the system towards precision9 A performance gain in the F
69>
measure can be seen if the e4perts in the system are made to agree on their prediction of
being positi-e, in order for an e4ample to be considered positi-e by the system9 That is, at
the output le-el of the system an e4ample needs both e4perts to classify it as being positi-e
in order to be classified as being positi-e9 The following graph shows that adding this
condition impro-es results for the F
69>
measure9
0 4 ,.!
04)
04)(
04)*
04.2
04.)
6ownsiin! 'verSamplin! Com3ined 17or
precision2
F
3
m
e
a
s
u
r
e
Figure =9:9<< Combining the e4perts for precision9 #n this case the
e4perts are required to agree on an e4ample being positi-e in
order for it to be classified as positi-e by the system9 The results
are reported for the F69> measure Gprecision considered twice as
important as recallH9
;6>
#n order to put things in perspecti-e, a final graph shows the o-erall results reali+ed by the
system9 The bars labeled All E4amples indicate the ma4imum performance that could be
achie-ed using 7oosted C>969 The bars labeled ;66 Positi-e E4amples show the
performance achie-ed on the reduced data set using 7oosted C>96 and the bars labeled
Combination Scheme ;66 Positi-e E4amples show the performance achie-ed using the
combination scheme and the weighting scheme designed in Chapter 5 on the reduced data
set9 #t can clearly be seen from Figure =9:9? that the combination scheme greatly enhanced
the performance of C>96 on the reduced data set9 #n fact, if one considers recall to be twice
as important as precision the combination scheme actually performs slightly better on the
reduced data set9 The micro a-eraged F
5
measure for the balanced data set is 69:>6, while a
micro a-eraged measure of F
5
T 69:>A is achie-ed on the reduced data set using the
combination scheme9
Performa#ce 5ai#
04(
04-
04)
04.
04*
048
@A1 @A2 @A04-
F
3
m
e
a
s
u
r
e
All %xamples
100 9ositive
%xamples
Com3ination
Scheme 100
9ositive %xamples
Figure =9:9<= A comparison of the o-erall results
;6?
C h a p t e r 7 i v e
> C!$C0"S#!$
This chapter concludes the thesis with a summary of its finding and suggests directions for
further research9
1; Su&&)r1
This thesis began with an e4amination of the nature of imbalanced data sets9 Through
e4perimenting with k3($F e4pressions we saw that a typical discrimination based
classifier GC>96H can pro-e to be an inadequate learner on data sets that are imbalanced9 The
learnerRs shortfall manifested itself on the under represented class as e4cellent performance
was achie-ed on the abundant class, but poor performance was achie-ed on the rare class9
Further e4perimentation showed that this inadequacy can be linked to concept comple4ity9
That is, learners are more sensiti-e to concept comple4ity when presented with an
imbalanced data set9
Two balancing techniques were in-estigated in an attempt to impro-e C>96Rs performance9
They were re3sampling the under represented class and downsi+ing the o-er represented
class9 Each method pro-ed to be effecti-e on the domain at hand, as an impro-ement in
performance o-er the under represented class was seen9 This impro-ement howe-er came
at the e4pense of the o-er represented class9
The main contribution of this thesis lies in the creation of a combination scheme that
combines multiple classifiers to impro-e a standard classifierRs performance on imbalanced
data sets9 The design of the system was moti-ated through e4perimentation on the artificial
domain of k3($F e4pressions9 !n the artificial domain it was shown that by combining
classifiers which sample data at different rates, an impro-ement in a classifierRs
;6:
performance on an imbalanced data set can be reali+ed in terms of its accuracy o-er the
under represented class9
Pitting the combination scheme against Adapti-e37oosted C>96 on the domain of te4t
classification showed that it could perform much better on imbalanced data sets than
standard learning techniques that use all e4amples pro-ided9 The corpus used to test the
system was the Reuters35;>:@ Te4t Categori+ation Test Collection9 #n fact, when using the
micro a-eraged F3measure as a performance statistic, and testing the combination scheme
against 7oosted C>96 on the se-erely imbalanced data set, the combination scheme
achie-ed results that were superior9 The combination scheme designed in Chapter <
achie-ed between ;6U to 56U higher accuracy rates Gdepending on which F3measure is
usedH than those achie-ed using 7oosted C>969
Although the results presented in Figure =9:9? indicate the combination scheme pro-ides
better results than standard classifiers, it should be noted that these statistics could probably
be impro-ed by using a larger feature set si+e9 #n this study, only >66 terms were used to
represent the documents9 #deally, thousands of features should be used for document
representation9
#t is important not to lose sight of the big picture9 #mbalanced data sets occur frequently in
domains where data of one class is scarce or difficult to obtain9 Te4t classification is one
such domain in which data sets are typically dominated by negati-e e4amples9 Attempting
to learn to recogni+e documents of a particular class using standard classification
techniques on se-erely imbalanced data sets can result in a classifier that is -ery accurate at
identifying negati-e documents, but not -ery good at identifying positi-e documents9 7y
combining multiple classifiers which sample the a-ailable documents at different rates, a
better predicti-e accuracy o-er the under represented class with -ery few labeled
documents can be achie-ed9 Bhat this means is that if one is presented with -ery few
labeled documents with which to train a classifier , o-erall results in identifying positi-e
documents will impro-e by combining multiple classifiers which employ -ariable sampling
techniques9
;6@
*3 Furt.$r R$'$)rc.
There are three main e4tensions to this work9 !ne e4tension is the choice in sampling
techniques used to -ary the classifiers in each e4pert9 As the literature sur-ey in Chapter 5
points towards, there are many intelligent approaches to sampling training e4amples to
train classifiers on9 #n this thesis the two techniques of nai-ely o-er3sampling and
downsi+ing were chosen because of their opposing and simplistic natures9 7oth techniques
pro-ed to be sufficient for the task at hand as the main focus was in the combination of
classifiers9 This howe-er lea-es much room for in-estigating the effect of using more
sophisticated sampling techniques in the future9
Another e4tension to this work is the choice of learning algorithm used9 #n this study a
decision tree learner was chosen for its speed of computation and rule set understandability9
(ecision trees, howe-er, are not the only a-ailable classifier that can be used9 The
combination scheme was designed independently of the classifiers that it combined9 This
lea-es the choice of classifiers used open to the application that it will be applied to9 Future
work should probably consist of testing the combination scheme with a range of classifiers
on different applications9 An interesting set of practical applications to test the combination
scheme on are -arious data mining tasks, such as direct marketing E0ing and 0i, ;AA@F, that
are typically associated with imbalanced data sets9
Further research should be done to test the impact of automated te4t classification systems
upon real3life scenarios9 This thesis e4plored te4t classification as a practical application for
a no-el combination scheme9 The combination scheme was shown to be effecti-e using the
F3measure9 The F3measure as a performance gauge, howe-er, does not demonstrate that the
system can actually classify documents in such a way as to enable users to organi+e and
then search for documents rele-ant to their needs9 #deally, a system should meet the needs
of real3world applications9 #t should be tested to see if the combination scheme presented in
this thesis can meet the needs of real3world applications, in particular te4t classification9
;6A
;;6
7#70#!/RAP12
7reiman, 09, R7agging PredictorsR, achine 0earning, -ol 5=, pp9 ;5<3;=6, ;AA?9
7reiman, 09, Friedman, *9, !lshen, R9, and Stone, C9, C$a%%ifi'ation an) (egre%%ion% Tree%,
Badsworth, ;A@=9
Cohen, B9, RFast Effecti-e Rule #nductionR, in Pro'+ 2CB49?, pp9 ;;>3;5<9
Schaffer, C9, R!-erfitting A-oidance as 7iasR, achine 0earning, ;6&;><3;:@, ;AA<9
Ea-is, T9, and *apkowic+, $9, RA Recognition37ased Alternati-e to (iscrimination37ased
ulti30ayer PerceptronsF , in the Pro'ee)ing% of the Thirteenth Cana)ian Conferen'e on
#rtifi'ia$ 2nte$$igen'e+ (#2C,@@@)9
Fawcett, T9 E9 and Pro-ost, F, RAdapti-e Fraud (etectionR, (ata ining and Cnowledge
(isco-ery, ;G<H&5A;3<;?, ;AA:9
Freund, 29, and Schapire R9, RA decision3theoretic generali+ation of on3line learning and an
application to boostingR, *ournal of Computer and System Sciences, >>G;H&;;A3;<A, ;AA:9
1ayes, P9, and Beinstein, S9, R A System for Content37ased #nde4ing of a (atabase of
$ews StoriesR, in the Se'on) #nnua$ Conferen'e on 2nnovative #pp$i'ation% of #rtifi'ia$
2nte$$igen'e, ;AA69
*apkowic+, $9, RThe Class #mbalance Problem& Significance and StrategiesR, in the
Pro'ee)ing% of the ,@@@ 2nternationa$ Conferen'e on #rtifi'ia$ 2nte$$igen'e (2C#2C,@@@);
Spe'ia$ Tra'= on 2n)u'tive 4earning, 56669
*apkowic+, $9, yers, C9, and /luck, 9, RA $o-elty (etection Approach to
ClassificationR, in the Pro'ee)ing% of the 7ourteenth 2nternationa$ Doint Conferen'e on
#rtifi'ia$ 2nte$$igen'e (2DC#29?)9 pp9 >;@3>5<, ;AA>9
Cubat, 9 and atwin, S9, RAddressing the Curse of #mbalanced (ata Sets& !ne Sided
SamplingR, in the Pro'ee)ing% of the 7ourteenth 2nternationa$ Conferen'e on Ba'hine
4earning, pp9 ;:A3;@?, ;AA:9
Cubat, 9, 1olte, R9 and atwin, S9, Rachine 0earning for the (etection of !il Spills in
Radar #magesR, achine 0earning, <6&;A>35;>, ;AA@9
;;;
0eCun, 29, 7oser, 79, (enker, *9, 1enderson, (9, 1oward, R9, 1ubbard, B9, and *ackel, 09,
R7ackpropagation Applied to 1andwritten ]ip Code RecognitionR, $eural Computation
;G=H, ;A@A9
0ewis, (9, and Catlett, *9, R1eterogeneous "ncertainty Sampling for Super-ised 0earningR,
in the Pro'ee)ing% of the !$eventh 2nternationa$ Conferen'e on Ba'hine 4earning, pp9
;=@3;>?, ;AA=9
0ewis, (9, and /ale, B9, RTraining Te4t Classifiers by "ncertainty SamplingR, in the
Seventeenth #nnua$ 2nternationa$ AC S#/#R Conferen'e on (e%ear'h an) Deve$opment
in 2nformation (etrieva$, ;AA=9
0ing, C9 and 0i, C9, R(ata ining for (irect arketing& Problems and SolutionsR,
Proceedings of C((3A@9
itchell, T9, Ba'hine 4eanring, c/raw31ill, ;AA:9
Pa++anai, 9, ar+, C9, urphy, P9, Ali, C9, 1ume, T9, and 7runk, C9, RReducing
isclassification CostsR, in the Pro'ee)ing% of the !$eventh 2nternationa$ Conferen'e on
Ba'hine 4earning, pp9 5;:355>, ;AA=9
Pomerleau, (9, RCnowledge3based Training of Artificial $eural $etworks for Autonomous
Robot (ri-ingR, #n Connel, *9, and ahade-an, S9, GEds9H, Robot 0earning, ;AA<9
.uinlan, *9, R#nduction of (ecision TreesR, achine 0earning, ;G;H&@;3;6?, ;A@?9
.uinlan, *9, R0earning 0ogical (efinitions from RelationsR, achine 0earning, >, 5<A35??,
;AA69
.uinlan, *9, CA"?; Program% for Ba'hine 4earning9 San ateo, CA& organ Caufmann,
;AA<9
.uinlan, *9, RC>96 An #nformal TutorialR, http&88www9rulequest9com8see>3win9html, 56669
Riddle, P9, Segal, R9, and Et+ioni, !9, RRepresentation (esign and 7rute3force #nduction in
a 7oeing anufacturing (omainR, Applied Artificial #ntelligence, @&;5>3;=:, ;AA=9
Ri%sbergen, C9, 2nformation (etrieva$, second edition, 0ondon& 7utterworths, ;AA:9
Scott, S9, RFeature Engineering for Te4t ClassificationR, achine 0earning, in the
Pro'ee)ing% of the Sixteenth 2nternationa$ Conferen'e G#C0RAAH, pp9 <:A3<@@, ;AAA9
;;5
Shimshoni, 29, #ntrator, $9, RClassification of Seismic Signals by #ntegrating Ensembles of
$eural $etworksR, #EEE Transactions on Signal Processing, Special #ssue on $$, =?G>H,
;AA@9
Swets, *9, Reasuring the Accuracy of (iagnostic SystemsR, S'ien'e, 5=6, ;5@>3;5A<, ;A@@9
Thorsten *9, RTe4t Categori+ation with Support 'ector achines& 0earning with any
Rele-ant FeaturesR, #n Proc9 EC03A@ pp9 ;<:3;=59
Tomik, #9, RTwo odifications of C$$R, #EEE Transactions on Systems, an and
Communications, SC3?, :?A3::5, ;A:?9
;;<
=

Вам также может понравиться