Вы находитесь на странице: 1из 71

Classification: Basic Concepts and

Decision Trees
A programming task
Classification: Definition


Given a collect ion of records ( t raining
set )


Each record cont ains a set of at t ribut es, one of t he
at t ribut es is t he class.


Find a model for class at t ribut e as a
funct ion of t he values of ot her
at t ribut es.


Goal: previously unseen records should
be assigned a class as accurat ely as
possible.


A t est set is used t o det ermine t he accuracy of t he
model. Usually, t he given dat a set is divided int o
t raining and t est set s, wit h t raining set used t o build
t he model and t est set used t o validat e it .
Illustrating Classification Task
Apply
Model
Learn
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10

Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10

Examples of Classification Task


Predict ing t umor cells as benign or malignant


Classifying credit card t ransact ions
as legit imat e or fraudulent


Classifying secondary st ruct ures of prot ein
as alpha- helix, bet a- sheet , or random
coil


Cat egorizing news st ories as finance,
weat her, ent ert ainment , sport s, et c
Classification Using Distance


Place it ems in class t o which t hey are
closest .


Must det ermine dist ance bet ween an
it em and a class.


Classes represent ed by


Cent r oi d: Cent ral value.


Medoi d: Represent at ive point .


I ndividual point s
Algorit hm: KNN
K Nearest Neighbor (KNN):


Training set includes classes.


Examine K it ems near it em t o be
classified.


New it em placed in class wit h t he most
number of close it ems.


O( q) for each t uple t o be classified.
( Here q is t he size of t he t raining set . )
KNN
Classification Techniques


Decision Tree based Met hods


Rule- based Met hods


Memory based reasoning


Neural Net works


Nave Bayes and Bayesian Belief Net works


Support Vect or Machines
Example of a Decision Tree
Tid Refund Marital
Status
Taxable
Income
Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
c
a
t
e
g
o
r
i
c
a
l
c
a
t
e
g
o
r
i
c
a
l
c
o
n
t
i
n
u
o
u
s
c
l
a
s
s
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Splitting Attributes
Training Data
Model: Decision Tree
Another Example of Decision Tree
Tid Refund Marital
Status
Taxable
Income
Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
c
a
t
e
g
o
r
i
c
a
l
c
a
t
e
g
o
r
i
c
a
l
c
o
n
t
i
n
u
o
u
s
c
l
a
s
s
MarSt
Refund
TaxInc
YES NO
NO
NO
Yes
No
Married
Single,
Divorced
< 80K > 80K
There could be more than one tree that
fits the same data!
Decision Tree Classification Task
Apply
Model
Learn
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10

Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10

Decision
Tree
Apply Model to Test Data
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income
Cheat
No Married 80K ?
10

Test Data
Start from the root of tree.
Apply Model to Test Data
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income
Cheat
No Married 80K ?
10

Test Data
Apply Model to Test Data
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income
Cheat
No Married 80K ?
10

Test Data
Apply Model to Test Data
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income
Cheat
No Married 80K ?
10

Test Data
Apply Model to Test Data
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income
Cheat
No Married 80K ?
10

Test Data
Apply Model to Test Data
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income
Cheat
No Married 80K ?
10

Test Data
Assign Cheat to No
Decision Tree Classification Task
Apply
Model
Learn
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10

Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10

Decision
Tree
Decision Tree Induction


Many Algorit hms:


Hunt s Algorit hm ( one of t he earliest )


CART


I D3, C4. 5


SLI Q, SPRI NT
General Structure of Hunts
Algorithm


Let D
t
be t he set of t raining
records t hat reach a node t


General Procedure:


I f D
t
cont ains records t hat
belong t he same class y
t
, t hen
t is a leaf node labeled as y
t


I f D
t
is an empt y set , t hen t is
a leaf node labeled by t he
default class, y
d


I f D
t
cont ains records t hat
belong t o more t han one
class, use an at t ribut e t est t o
split t he dat a int o smaller
subset s. Recursively apply t he
procedure t o each subset .
Tid Refund Marital
Status
Taxable
Income
Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

D
t
?
Hunts Algorithm
Dont
Cheat
Refund
Dont
Cheat
Dont
Cheat
Yes No
Refund
Dont
Cheat
Yes No
Marital
Status
Dont
Cheat
Cheat
Single,
Divorced
Married
Taxable
Income
Dont
Cheat
< 80K >= 80K
Refund
Dont
Cheat
Yes No
Marital
Status
Dont
Cheat
Cheat
Single,
Divorced
Married
Tid Refund Marital
Status
Taxable
Income
Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Tree Induction


Greedy st rat egy.


Split t he records based on an at t ribut e t est
t hat opt imizes cert ain crit erion.


I ssues


Det ermine how t o split t he records


How t o specify t he at t ribut e t est condit ion?


How t o det ermine t he best split ?


Det ermine when t o st op split t ing
Tree Induction


Greedy st rat egy.


Split t he records based on an at t ribut e t est
t hat opt imizes cert ain crit erion.


I ssues


Det ermine how t o split t he records


How t o specify t he at t ribut e t est condit ion?


How t o det ermine t he best split ?


Det ermine when t o st op split t ing
How to Specify Test Condition?


Depends on at t ribut e t ypes


Nominal


Ordinal


Cont inuous


Depends on number of ways t o split


2- way split


Mult i- way split
Splitting Based on Nominal Attributes


Mult i- way split : Use as many part it ions as dist inct
values.


Binary split : Divides values int o t wo subset s.
Need t o find opt imal part it ioning.
CarType
Family
Sports
Luxury
CarType
{Family,
Luxury}
{Sports}
CarType
{Sports,
Luxury}
{Family}
OR


Mult i- way split : Use as many part it ions as dist inct
values.


Binary split : Divides values int o t wo subset s.
Need t o find opt imal part it ioning.


What about t his split ?
Splitting Based on Ordinal
Attributes
Size
Small
Medium
Large
Size
{Medium,
Large}
{Small}
Size
{Small,
Medium}
{Large}
OR
Size
{Small,
Large}
{Medium}
Splitting Based on Continuous
Attributes


Different ways of handling


Discret izat ion t o form an ordinal cat egorical
at t ribut e


St at ic discret ize once at t he beginning


Dynamic ranges can be found by equal int erval
bucket ing, equal frequency bucket ing
( percent iles) , or clust ering.


Binary Decision: ( A < v) or ( A >

v)


consider all possible split s and finds t he best cut


can be more comput e int ensive
Splitting Based on Continuous
Attributes
Tree Induction


Greedy st rat egy.


Split t he records based on an at t ribut e t est
t hat opt imizes cert ain crit erion.


I ssues


Det ermine how t o split t he records


How t o specify t he at t ribut e t est condit ion?


How t o det ermine t he best split ?


Det ermine when t o st op split t ing
How to determine the Best Split
Before Splitting: 10 records of class 0,
10 records of class 1
Which test condition is the best?
How to determine the Best Split


Greedy approach:


Nodes wit h homogeneous class dist ribut ion are
preferred


Need a measure of node impurit y:
Non-homogeneous,
High degree of impurity
Homogeneous,
Low degree of impurity
Measures of Node Impurity


Gini I ndex


Ent ropy


Misclassificat ion error
How to Find the Best Split
B?
Yes No
Node N3 Node N4
A?
Yes No
Node N1 Node N2
Before Splitting:
C0 N10
C1 N11


C0 N20
C1 N21


C0 N30
C1 N31


C0 N40
C1 N41


C0 N00
C1 N01


M0
M1
M2 M3 M4
M12
M34
Gain = M0 M12 vs M0 M34
Measure of Impurity: GINI


Gini I ndex for a given node t :
( NOTE: p( j | t) is t he relat ive frequency of class j at node t ) .


Maximum ( 1 - 1/ n
c
) when records are equally dist ribut ed
among all classes, implying least int erest ing informat ion


Minimum ( 0. 0) when all records belong t o one class,
implying most int erest ing informat ion

=
j
t j p t GINI
2
)] | ( [ 1 ) (
C1 0
C2 6
Gi ni = 0.000
C1 2
C2 4
Gi ni = 0.444
C1 3
C2 3
Gi ni = 0.500
C1 1
C2 5
Gi ni = 0.278
Examples for computing GINI
C1 0
C2 6


C1 2
C2 4


C1 1
C2 5


P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Gini = 1 P(C1)
2
P(C2)
2
= 1 0 1 = 0

=
j
t j p t GINI
2
)] | ( [ 1 ) (
P(C1) = 1/6 P(C2) = 5/6
Gini = 1 (1/6)
2
(5/6)
2
= 0.278
P(C1) = 2/6 P(C2) = 4/6
Gini = 1 (2/6)
2
(4/6)
2
= 0.444
Splitting Based on GINI


Used in CART, SLI Q, SPRI NT.


When a node p is split int o k part it ions ( children) ,
t he qualit y of split is comput ed as,
where, n
i
= number of records at child i,
n = number of records at node p.

=
=
k
i
i
split
i GINI
n
n
GINI
1
) (
Binary Attributes: Computing GINI
Index


Split s int o t wo part it ions


Effect of Weighing part it ions:


Larger and Purer Part it ions are sought for.
B?
Yes No
Node N1 Node N2
Par ent
C1 6
C2 6
Gi ni = 0.500

N1 N2
C1 5 1
C2 2 4
Gi ni = 0.333


Gini(N1)
= 1 (5/6)
2
(2/6)
2
= 0.194
Gini(N2)
= 1 (1/6)
2
(4/6)
2
= 0.528
Gini(Children)
= 7/12 * 0.194 +
5/12 * 0.528
= 0.333
Categorical Attributes: Computing Gini

Index


For each dist inct value, gat her count s for each
class in t he dat aset


Use t he count mat rix t o make decisions
CarType
{Sports,
Luxury}
{Family}
C1 3 1
C2 2 4
Gi ni 0.400
CarType
{Sports}
{Family,
Luxury}
C1 2 2
C2 1 5
Gi ni 0.419
CarType
Family Sports Luxury
C1 1 2 1
C2 4 1 1
Gini 0.393
Multi-way split Two-way split
(find best partition of values)
Continuous Attributes: Computing Gini

Index


Use Binary Decisions based on
one value


Several Choices for t he split t ing
value


Number of possible split t ing
values
= Number of dist inct values


Each split t ing value has a count
mat rix associat ed wit h it


Class count s in each of t he
part it ions, A < v and A >

v


Simple met hod t o choose best v


For each v, scan t he dat abase t o
gat her count mat rix and comput e
it s Gini index


Comput at ionally I nefficient !
Repet it ion of work.
Tid Refund Marital
Status
Taxable
Income
Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

Continuous Attributes: Computing Gini

Index...


For efficient comput at ion: for each at t ribut e,


Sort t he at t ribut e on values


Linearly scan t hese values, each t ime updat ing t he count
mat rix and comput ing gini index


Choose t he split posit ion t hat has t he least gini index
Cheat No No No Yes Yes Yes No No No No
Taxable Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Split Positions
Sorted Values
Alternative Splitting Criteria based on
INFO


Ent ropy at a given node t :
( NOTE: p( j | t) is t he relat ive frequency of class j at node t ) .


Measures homogeneit y of a node.


Maximum ( log n
c
) when records are equally dist ribut ed
among all classes implying least informat ion


Minimum ( 0. 0) when all records belong t o one class,
implying most informat ion


Ent ropy based comput at ions are similar t o t he
GI NI index comput at ions
=
j
t j p t j p t Entropy ) | ( log ) | ( ) (
Examples for computing Entropy
C1 0
C2 6


C1 2
C2 4


C1 1
C2 5


P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Entropy = 0 log 0 1 log 1 = 0 0 = 0
P(C1) = 1/6 P(C2) = 5/6
Entropy = (1/6) log
2
(1/6) (5/6) log
2
(1/6) = 0.65
P(C1) = 2/6 P(C2) = 4/6
Entropy = (2/6) log
2
(2/6) (4/6) log
2
(4/6) = 0.92
=
j
t j p t j p t Entropy ) | ( log ) | ( ) (
2
Splitting Based on INFO...


I nformat ion Gain:
Parent Node, p is split int o k part it ions;
n
i
is number of records in part it ion i


Measures Reduct ion in Ent ropy achieved because of t he
split . Choose t he split t hat achieves most reduct ion
( maximizes GAI N)


Used in I D3 and C4. 5


Disadvant age: Tends t o prefer split s t hat result in large
number of part it ions, each being small but pure.
|
.
|

\
|
=
=
k
i
i
split
i Entropy
n
n
p Entropy GAIN
1
) ( ) (
Splitting Based on INFO...


Gain Rat io:
Parent Node, p is split int o k part it ions
n
i
is t he number of records in part it ion i


Adj ust s I nformat ion Gain by t he ent ropy of t he
part it ioning ( Split I NFO) . Higher ent ropy part it ioning
( large number of small part it ions) is penalized!


Used in C4. 5


Designed t o overcome t he disadvant age of I nformat ion
Gain
SplitINFO
GAIN
GainRATIO
Split
split
=

=
=
k
i
i i
n
n
n
n
SplitINFO
1
log
Splitting Criteria based on Classification
Error


Classificat ion error at a node t :


Measures misclassificat ion error made by a node.


Maximum ( 1 - 1/ n
c
) when records are equally dist ribut ed
among all classes, implying least int erest ing informat ion


Minimum ( 0. 0) when all records belong t o one class,
implying most int erest ing informat ion
) | ( max 1 ) ( t i P t Error
i
=
Examples for Computing Error
C1 0
C2 6


C1 2
C2 4


C1 1
C2 5


P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Error = 1 max (0, 1) = 1 1 = 0
P(C1) = 1/6 P(C2) = 5/6
Error = 1 max (1/6, 5/6) = 1 5/6 = 1/6
P(C1) = 2/6 P(C2) = 4/6
Error = 1 max (2/6, 4/6) = 1 4/6 = 1/3
) | ( max 1 ) ( t i P t Error
i
=
Comparison among Splitting Criteria
For a 2-class problem:
Misclassification Error vs

Gini
A?
Yes No
Node N1 Node N2
Par ent
C1 7
C2 3
Gi ni = 0.42

N1 N2
C1 3 4
C2 0 3


Gini(N1)
= 1 (3/3)
2
(0/3)
2
= 0
Gini(N2)
= 1 (4/7)
2
(3/7)
2
= 0.489
Gini(Children)
= 3/10 * 0
+ 7/10 * 0.489
= 0.342
Tree Induction


Greedy st rat egy.


Split t he records based on an at t ribut e t est
t hat opt imizes cert ain crit erion.


I ssues


Det ermine how t o split t he records


How t o specify t he at t ribut e t est condit ion?


How t o det ermine t he best split ?


Det ermine when t o st op split t ing
Stopping Criteria for Tree Induction


St op expanding a node when all t he
records belong t o t he same class


St op expanding a node when all t he
records have similar at t ribut e values


Early t erminat ion ( t o be discussed lat er)
Decision Tree Based Classification


Advant ages:


I nexpensive t o const ruct


Ext remely fast at classifying unknown records


Easy t o int erpret for small- sized t rees


Accuracy is comparable t o ot her classificat ion
t echniques for many simple dat a set s
Example: C4.5


Simple dept h- first const ruct ion.


Uses I nformat ion Gain


Sort s Cont inuous At t ribut es at each node.


Needs ent ire dat a t o fit in memory.


Unsuit able for Large Dat aset s.


Needs out - of- core sort ing.


You can download t he soft ware from:
ht t p: / / www. cse. unsw. edu. au/ ~ quinlan/ c4. 5r8. t ar
. gz
Practical Issues of Classification


Underfit t ing and Overfit t ing


Missing Values


Cost s of Classificat ion
Underfitting

and Overfitting

(Example)
500 circular and 500
triangular data points.
Circular points:
0.5 s

sqrt(x
1
2
+x
2
2
) s

1
Triangular points:
sqrt(x
1
2
+x
2
2
) > 0.5 or
sqrt(x
1
2
+x
2
2
) < 1
Underfitting

and Overfitting
Overfitting
Underfitting: when model is too simple, both training and test errors are large
Overfitting

due to Noise
Decision boundary is distorted by noise point
Overfitting

due to Insufficient
Examples
Lack of data points in the lower half of the diagram makes it difficult
to predict correctly the class labels of that region
- Insufficient number of training records in the region causes the
decision tree to predict the test examples using other training
records that are irrelevant to the classification task
Notes on Overfitting


Overfit t ing result s in decision t rees t hat
are more complex t han necessary


Training error no longer provides a good
est imat e of how well t he t ree will perform
on previously unseen records


Need new ways for est imat ing errors
Estimating Generalization Errors


Re- subst it ut ion errors: error on t raining ( E

e( t ) )


Generalizat ion errors: error on t est ing ( E

e( t ) )


Met hods for est imat ing generalizat ion errors:


Opt imist ic approach: e( t ) = e( t )


Pessimist ic approach:


For each leaf node: e( t ) = ( e( t ) + 0. 5)


Tot al errors: e( T) = e( T) + N

0. 5 ( N: number of leaf
nodes)


For a t ree wit h 30 leaf nodes and 10 errors on t raining
( out of 1000 inst ances) :
Training error = 10/ 1000 = 1%
Generalizat ion error = ( 10 + 300. 5) / 1000 = 2. 5%


Reduced error pruning ( REP) :


uses validat ion dat a set t o est imat e generalizat ion
error
Occams Razor


Given t wo models of similar generalizat ion
errors, one should prefer t he simpler
model over t he more complex model


For complex models, t here is a great er
chance t hat it was fit t ed accident ally by
errors in dat a


Therefore, one should include model
complexit y when evaluat ing a model
Minimum Description Length
(MDL)


Cost ( Model, Dat a) = Cost ( Dat a| Model) + Cost ( Model)


Cost is t he number of bit s needed for encoding.


Search for t he least cost ly model.


Cost ( Dat a| Model) encodes t he misclassificat ion errors.


Cost ( Model) uses node encoding ( number of children) plus
split t ing condit ion encoding.
A B
A?
B?
C?
1 0
0
1
Yes No
B
1
B
2
C
1
C
2
X
y
X
1 1
X
2 0
X
3 0
X
4 1

X
n 1
X
y
X
1 ?
X
2 ?
X
3 ?
X
4 ?

X
n ?
How to Address Overfitting


Pre- Pruning ( Early St opping Rule)


St op t he algorit hm before it becomes a fully- grown t ree


Typical st opping condit ions for a node:


St op if all inst ances belong t o t he same class


St op if all t he at t ribut e values are t he same


More rest rict ive condit ions:


St op if number of inst ances is less t han some user- specified
t hreshold


St op if class dist ribut ion of inst ances are independent of t he
available feat ures ( e. g. , using _

2
t est )


St op if expanding t he current node does not improve impurit y
measures ( e. g. , Gini or informat ion gain) .
How to Address Overfitting


Post - pruning


Grow decision t ree t o it s ent iret y


Trim t he nodes of t he decision t ree in a
bot t om- up fashion


I f generalizat ion error improves aft er
t rimming, replace sub- t ree by a leaf node.


Class label of leaf node is det ermined from
maj orit y class of inst ances in t he sub- t ree


Can use MDL for post - pruning
Example of Post-Pruning
A?
A1
A2 A3
A4
Class = Yes 20
Class = No 10
Error = 10/ 30
Training Error (Before splitting) = 10/30
Pessimistic error = (10 + 0.5)/30 = 10.5/30
Training Error (After splitting) = 9/30
Pessimistic error (After splitting)
= (9 + 4

0.5)/30 = 11/30
PRUNE!
Class =
Yes
8
Class =
No
4
Class =
Yes
3
Class =
No
4
Class =
Yes
4
Class =
No
1
Class =
Yes
5
Class =
No
1
Examples of Post-pruning


Opt imist ic error?


Pessimist ic error?


Reduced error pruning?
C0: 11
C1: 3
C0: 2
C1: 4
C0: 14
C1: 3
C0: 2
C1: 2
Dont prune for both cases
Dont prune case 1, prune case 2
Case 1:
Case 2:
Depends on validation set
Handling Missing Attribute Values


Missing values affect decision t ree
const ruct ion in t hree different ways:


Affect s how impurit y measures are comput ed


Affect s how t o dist ribut e inst ance wit h missing
value t o child nodes


Affect s how a t est inst ance wit h missing value
is classified
Computing Impurity Measure
Tid Refund Marital
Status
Taxable
Income
Class
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 ? Single 90K Yes
10


Cl ass
= Yes
Cl ass
= No
Refund= Yes 0 3
Refund= No 2 4

Refund= ? 1 0

Split on Refund:
Entropy(Refund=Yes) = 0
Entropy(Refund=No)
= -(2/6)log(2/6) (4/6)log(4/6) = 0.9183
Entropy(Children)
= 0.3 (0) + 0.6 (0.9183) = 0.551
Gain = 0.9

(0.8813 0.551) = 0.3303
Missing
value
Before Splitting:
Entropy(Parent)
= -0.3 log(0.3)-(0.7)log(0.7) = 0.8813
Distribute Instances
Tid Refund Marital
Status
Taxable
Income
Class
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10

Refund
Yes No
Cl ass= Yes 0
Cl ass= No 3


Cheat = Yes 2
Cheat = No 4


Refund
Yes
Tid Refund Marital
Status
Taxable
Income
Class
10 ? Single 90K Yes
10

No
Cl ass= Yes 2 + 6/ 9
Cl ass= No 4


Probability that Refund=Yes is 3/9
Probability that Refund=No is 6/9
Assign record to the left child with
weight = 3/9 and to the right child
with weight = 6/9
Cl ass= Yes 0 + 3/ 9
Cl ass= No 3


Classify Instances
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes
No
Married
Single,
Divorced
< 80K > 80K
Married Single Divorce
d
Tot al
Class= No 3 1 0 4
Class= Yes 6/ 9 1 1 2.67
Tot al 3.67 2 1 6.67
Tid Refund Marital
Status
Taxable
Income
Class
11 No ? 85K ?
10

New record:
Probability that Marital Status
= Married is 3.67/6.67
Probability that Marital Status
={Single,Divorced} is 3/6.67
Scalable Decision Tree Induction Methods


SLI Q ( EDBT96 Meht a et al. )


Builds an index for each at t ribut e and only class list and
t he current at t ribut e list reside in memory


SPRI NT ( VLDB96 J. Shafer et al. )


Const ruct s an at t ribut e list dat a st ruct ure


PUBLI C ( VLDB98 Rast ogi & Shim)


I nt egrat es t ree split t ing and t ree pruning: st op growing t he
t ree earlier


RainForest ( VLDB98 Gehrke, Ramakrishnan &
Gant i)


Builds an AVC- list ( at t ribut e, value, class label)


BOAT ( PODS99 Gehrke, Gant i, Ramakrishnan &
Loh)


Uses boot st rapping t o creat e several small samples

Вам также может понравиться