Classification Decision Tree

Classification: Basic Concepts and
Decision Trees
A programming task
Classification: Definition

Given a collect ion of records ( t raining
set )

Each record cont ains a set of at t ribut es, one of t he
at t ribut es is t he class.

Find a model for class at t ribut e as a
funct ion of t he values of ot her
at t ribut es.

Goal: previously unseen records should
be assigned a class as accurat ely as
possible.

A t est set is used t o det ermine t he accuracy of t he
model. Usually, t he given dat a set is divided int o
t raining and t est set s, wit h t raining set used t o build
t he model and t est set used t o validat e it .
Illustrating Classification Task
Apply
Model
Learn
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10

11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10

Examples of Classification Task

Predict ing t umor cells as benign or malignant

Classifying credit card t ransact ions
as legit imat e or fraudulent

Classifying secondary st ruct ures of prot ein
as alpha- helix, bet a- sheet , or random
coil

Cat egorizing news st ories as finance,
weat her, ent ert ainment , sport s, et c
Classification Using Distance

Place it ems in class t o which t hey are
closest .

Must det ermine dist ance bet ween an
it em and a class.

Classes represent ed by

Cent r oi d: Cent ral value.

Medoi d: Represent at ive point .

I ndividual point s
Algorit hm: KNN
K Nearest Neighbor (KNN):

Training set includes classes.

Examine K it ems near it em t o be
classified.

New it em placed in class wit h t he most
number of close it ems.

O( q) for each t uple t o be classified.
( Here q is t he size of t he t raining set . )
KNN
Classification Techniques

Decision Tree based Met hods

Rule- based Met hods

Memory based reasoning

Neural Net works

Nave Bayes and Bayesian Belief Net works

Support Vect or Machines
Example of a Decision Tree
Tid Refund Marital
Status
Taxable
Income
Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
c
a
t
e
g
o
r
i
c
a
l
c
a
t
e
g
o
r
i
c
a
l
c
o
n
t
i
n
u
o
u
s
c
l
a
s
s
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Splitting Attributes
Training Data
Model: Decision Tree
Another Example of Decision Tree
Tid Refund Marital
Status
Taxable
Income
Cheat
3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10
c
a
t
e
g
o
r
i
c
a
l
c
a
t
e
g
o
r
i
c
a
l
c
o
n
t
i
n
u
o
u
s
c
l
a
s
s
MarSt
Refund
TaxInc
YES NO
NO
NO
Yes
No
Married
Single,
Divorced
< 80K > 80K
There could be more than one tree that
fits the same data!
Decision Tree Classification Task
Apply
Model
Learn
Model
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10

11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10

Decision
Tree
Apply Model to Test Data
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income
Cheat
No Married 80K ?
10

Test Data
Start from the root of tree.
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income
Cheat
No Married 80K ?
10

Test Data
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income
Cheat
No Married 80K ?
10

Test Data
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income
Cheat
No Married 80K ?
10

Test Data
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income
Cheat
No Married 80K ?
10

Test Data
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Refund Marital
Status
Taxable
Income
Cheat
No Married 80K ?
10

Test Data
Assign Cheat to No
Decision Tree Classification Task
Apply
Model
Learn
Model
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10

11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10

Decision
Tree
Decision Tree Induction

Many Algorit hms:

Hunt s Algorit hm ( one of t he earliest )

CART

I D3, C4. 5

SLI Q, SPRI NT
General Structure of Hunts
Algorithm

Let D
t
be t he set of t raining
records t hat reach a node t

General Procedure:

I f D
t
cont ains records t hat
belong t he same class y
t
, t hen
t is a leaf node labeled as y
t

I f D
t
is an empt y set , t hen t is
a leaf node labeled by t he
default class, y
d

I f D
t
cont ains records t hat
belong t o more t han one
class, use an at t ribut e t est t o
split t he dat a int o smaller
subset s. Recursively apply t he
procedure t o each subset .
Tid Refund Marital
Status
Taxable
Income
Cheat
3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10

D
t
?
Hunts Algorithm
Dont
Cheat
Refund
Dont
Cheat
Dont
Cheat
Yes No
Refund
Dont
Cheat
Yes No
Marital
Status
Dont
Cheat
Cheat
Single,
Divorced
Married
Taxable
Income
Dont
Cheat
< 80K >= 80K
Refund
Dont
Cheat
Yes No
Marital
Status
Dont
Cheat
Cheat
Single,
Divorced
Married
Tid Refund Marital
Status
Taxable
Income
Cheat
3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10
Tree Induction

Greedy st rat egy.

Split t he records based on an at t ribut e t est
t hat opt imizes cert ain crit erion.

I ssues

Det ermine how t o split t he records

How t o specify t he at t ribut e t est condit ion?

How t o det ermine t he best split ?

Det ermine when t o st op split t ing
Tree Induction

Greedy st rat egy.


I ssues




How to Specify Test Condition?

Depends on at t ribut e t ypes

Nominal

Ordinal

Cont inuous

Depends on number of ways t o split

2- way split

Mult i- way split
Splitting Based on Nominal Attributes

Mult i- way split : Use as many part it ions as dist inct
values.

Binary split : Divides values int o t wo subset s.
Need t o find opt imal part it ioning.
CarType
Family
Sports
Luxury
CarType
{Family,
Luxury}
{Sports}
CarType
{Sports,
Luxury}
{Family}
OR

Mult i- way split : Use as many part it ions as dist inct
values.

Binary split : Divides values int o t wo subset s.
Need t o find opt imal part it ioning.

What about t his split ?
Splitting Based on Ordinal
Attributes
Size
Small
Medium
Large
Size
{Medium,
Large}
{Small}
Size
{Small,
Medium}
{Large}
OR
Size
{Small,
Large}
{Medium}
Splitting Based on Continuous
Attributes

Different ways of handling

Discret izat ion t o form an ordinal cat egorical
at t ribut e

St at ic discret ize once at t he beginning

Dynamic ranges can be found by equal int erval
bucket ing, equal frequency bucket ing
( percent iles) , or clust ering.

Binary Decision: ( A < v) or ( A >

v)

consider all possible split s and finds t he best cut

can be more comput e int ensive
Splitting Based on Continuous
Attributes
Tree Induction

Greedy st rat egy.


I ssues




How to determine the Best Split
Before Splitting: 10 records of class 0,
10 records of class 1
Which test condition is the best?
How to determine the Best Split

Greedy approach:

Nodes wit h homogeneous class dist ribut ion are
preferred

Need a measure of node impurit y:
Non-homogeneous,
High degree of impurity
Homogeneous,
Low degree of impurity
Measures of Node Impurity

Gini I ndex

Ent ropy

Misclassificat ion error
How to Find the Best Split
B?
Yes No
Node N3 Node N4
A?
Yes No
Node N1 Node N2
Before Splitting:
C0 N10
C1 N11

C0 N20
C1 N21

C0 N30
C1 N31

C0 N40
C1 N41

C0 N00
C1 N01

M0
M1
M2 M3 M4
M12
M34
Gain = M0 M12 vs M0 M34
Measure of Impurity: GINI

Gini I ndex for a given node t :
( NOTE: p( j | t) is t he relat ive frequency of class j at node t ) .

Maximum ( 1 - 1/ n
c
) when records are equally dist ribut ed
among all classes, implying least int erest ing informat ion

Minimum ( 0. 0) when all records belong t o one class,
implying most int erest ing informat ion
=
j
t j p t GINI
2
)] | ( [ 1 ) (
C1 0
C2 6
Gi ni = 0.000
C1 2
C2 4
Gi ni = 0.444
C1 3
C2 3
Gi ni = 0.500
C1 1
C2 5
Gi ni = 0.278
Examples for computing GINI
C1 0
C2 6

C1 2
C2 4

C1 1
C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Gini = 1 P(C1)
2
P(C2)
2
= 1 0 1 = 0
=
j
t j p t GINI
2
)] | ( [ 1 ) (
P(C1) = 1/6 P(C2) = 5/6
Gini = 1 (1/6)
2
(5/6)
2
= 0.278
P(C1) = 2/6 P(C2) = 4/6
Gini = 1 (2/6)
2
(4/6)
2
= 0.444
Splitting Based on GINI

Used in CART, SLI Q, SPRI NT.

When a node p is split int o k part it ions ( children) ,
t he qualit y of split is comput ed as,
where, n
i
= number of records at child i,
n = number of records at node p.
=
=
k
i
i
split
i GINI
n
n
GINI
1
) (
Binary Attributes: Computing GINI
Index

Split s int o t wo part it ions

Effect of Weighing part it ions:

Larger and Purer Part it ions are sought for.
B?
Yes No
Node N1 Node N2
Par ent
C1 6
C2 6
Gi ni = 0.500

N1 N2
C1 5 1
C2 2 4
Gi ni = 0.333

Gini(N1)
= 1 (5/6)
2
(2/6)
2
= 0.194
Gini(N2)
= 1 (1/6)
2
(4/6)
2
= 0.528
Gini(Children)
= 7/12 * 0.194 +
5/12 * 0.528
= 0.333
Categorical Attributes: Computing Gini

Index

For each dist inct value, gat her count s for each
class in t he dat aset

Use t he count mat rix t o make decisions
CarType
{Sports,
Luxury}
{Family}
C1 3 1
C2 2 4
Gi ni 0.400
CarType
{Sports}
{Family,
Luxury}
C1 2 2
C2 1 5
Gi ni 0.419
CarType
Family Sports Luxury
C1 1 2 1
C2 4 1 1
Gini 0.393
Multi-way split Two-way split
(find best partition of values)
Continuous Attributes: Computing Gini

Index

Use Binary Decisions based on
one value

Several Choices for t he split t ing
value

Number of possible split t ing
values
= Number of dist inct values

Each split t ing value has a count
mat rix associat ed wit h it

Class count s in each of t he
part it ions, A < v and A >

v

Simple met hod t o choose best v

For each v, scan t he dat abase t o
gat her count mat rix and comput e
it s Gini index

Comput at ionally I nefficient !
Repet it ion of work.
Tid Refund Marital
Status
Taxable
Income
Cheat
3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10

Continuous Attributes: Computing Gini

Index...

For efficient comput at ion: for each at t ribut e,

Sort t he at t ribut e on values

Linearly scan t hese values, each t ime updat ing t he count
mat rix and comput ing gini index

Choose t he split posit ion t hat has t he least gini index
Cheat No No No Yes Yes Yes No No No No
Taxable Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Split Positions
Sorted Values
Alternative Splitting Criteria based on
INFO

Ent ropy at a given node t :
( NOTE: p( j | t) is t he relat ive frequency of class j at node t ) .

Measures homogeneit y of a node.

Maximum ( log n
c
among all classes implying least informat ion

implying most informat ion

Ent ropy based comput at ions are similar t o t he
GI NI index comput at ions
=
j
t j p t j p t Entropy ) | ( log ) | ( ) (
Examples for computing Entropy
C1 0
C2 6

C1 2
C2 4

C1 1
C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Entropy = 0 log 0 1 log 1 = 0 0 = 0
P(C1) = 1/6 P(C2) = 5/6
Entropy = (1/6) log
2
(1/6) (5/6) log
2
(1/6) = 0.65
P(C1) = 2/6 P(C2) = 4/6
Entropy = (2/6) log
2
(2/6) (4/6) log
2
(4/6) = 0.92
=
j
t j p t j p t Entropy ) | ( log ) | ( ) (
2
Splitting Based on INFO...

I nformat ion Gain:
Parent Node, p is split int o k part it ions;
n
i
is number of records in part it ion i

Measures Reduct ion in Ent ropy achieved because of t he
split . Choose t he split t hat achieves most reduct ion
( maximizes GAI N)

Used in I D3 and C4. 5

Disadvant age: Tends t o prefer split s t hat result in large
number of part it ions, each being small but pure.
|
.
|
\
|
=
=
k
i
i
split
i Entropy
n
n
p Entropy GAIN
1
) ( ) (
Splitting Based on INFO...

Gain Rat io:
Parent Node, p is split int o k part it ions
n
i
is t he number of records in part it ion i

Adj ust s I nformat ion Gain by t he ent ropy of t he
part it ioning ( Split I NFO) . Higher ent ropy part it ioning
( large number of small part it ions) is penalized!

Used in C4. 5

Designed t o overcome t he disadvant age of I nformat ion
Gain
SplitINFO
GAIN
GainRATIO
Split
split
=

=
=
k
i
i i
n
n
n
n
SplitINFO
1
log
Splitting Criteria based on Classification
Error

Classificat ion error at a node t :

Measures misclassificat ion error made by a node.

Maximum ( 1 - 1/ n
c
among all classes, implying least int erest ing informat ion

implying most int erest ing informat ion
) | ( max 1 ) ( t i P t Error
i
=
Examples for Computing Error
C1 0
C2 6

C1 2
C2 4

C1 1
C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Error = 1 max (0, 1) = 1 1 = 0
P(C1) = 1/6 P(C2) = 5/6
Error = 1 max (1/6, 5/6) = 1 5/6 = 1/6
P(C1) = 2/6 P(C2) = 4/6
Error = 1 max (2/6, 4/6) = 1 4/6 = 1/3
) | ( max 1 ) ( t i P t Error
i
=
Comparison among Splitting Criteria
For a 2-class problem:
Misclassification Error vs

Gini
A?
Yes No
Node N1 Node N2
Par ent
C1 7
C2 3
Gi ni = 0.42

N1 N2
C1 3 4
C2 0 3

Gini(N1)
= 1 (3/3)
2
(0/3)
2
= 0
Gini(N2)
= 1 (4/7)
2
(3/7)
2
= 0.489
Gini(Children)
= 3/10 * 0
+ 7/10 * 0.489
= 0.342
Tree Induction

Greedy st rat egy.


I ssues




Stopping Criteria for Tree Induction

St op expanding a node when all t he
records belong t o t he same class

St op expanding a node when all t he
records have similar at t ribut e values

Early t erminat ion ( t o be discussed lat er)
Decision Tree Based Classification

Advant ages:

I nexpensive t o const ruct

Ext remely fast at classifying unknown records

Easy t o int erpret for small- sized t rees

Accuracy is comparable t o ot her classificat ion
t echniques for many simple dat a set s
Example: C4.5

Simple dept h- first const ruct ion.

Uses I nformat ion Gain

Sort s Cont inuous At t ribut es at each node.

Needs ent ire dat a t o fit in memory.

Unsuit able for Large Dat aset s.

Needs out - of- core sort ing.

You can download t he soft ware from:
ht t p: / / www. cse. unsw. edu. au/ ~ quinlan/ c4. 5r8. t ar
. gz
Practical Issues of Classification

Underfit t ing and Overfit t ing

Missing Values

Cost s of Classificat ion
Underfitting

and Overfitting

(Example)
500 circular and 500
triangular data points.
Circular points:
0.5 s

sqrt(x
1
2
+x
2
2
) s

1
Triangular points:
sqrt(x
1
2
+x
2
2
) > 0.5 or
sqrt(x
1
2
+x
2
2
) < 1
Underfitting

and Overfitting
Overfitting
Underfitting: when model is too simple, both training and test errors are large
Overfitting

due to Noise
Decision boundary is distorted by noise point
Overfitting

due to Insufficient
Examples
Lack of data points in the lower half of the diagram makes it difficult
to predict correctly the class labels of that region
- Insufficient number of training records in the region causes the
decision tree to predict the test examples using other training
records that are irrelevant to the classification task
Notes on Overfitting

Overfit t ing result s in decision t rees t hat
are more complex t han necessary

Training error no longer provides a good
est imat e of how well t he t ree will perform
on previously unseen records

Need new ways for est imat ing errors
Estimating Generalization Errors

Re- subst it ut ion errors: error on t raining ( E

e( t ) )

Generalizat ion errors: error on t est ing ( E

e( t ) )

Met hods for est imat ing generalizat ion errors:

Opt imist ic approach: e( t ) = e( t )

Pessimist ic approach:

For each leaf node: e( t ) = ( e( t ) + 0. 5)

Tot al errors: e( T) = e( T) + N

0. 5 ( N: number of leaf
nodes)

For a t ree wit h 30 leaf nodes and 10 errors on t raining
( out of 1000 inst ances) :
Training error = 10/ 1000 = 1%
Generalizat ion error = ( 10 + 300. 5) / 1000 = 2. 5%

Reduced error pruning ( REP) :

uses validat ion dat a set t o est imat e generalizat ion
error
Occams Razor

Given t wo models of similar generalizat ion
errors, one should prefer t he simpler
model over t he more complex model

For complex models, t here is a great er
chance t hat it was fit t ed accident ally by
errors in dat a

Therefore, one should include model
complexit y when evaluat ing a model
Minimum Description Length
(MDL)

Cost ( Model, Dat a) = Cost ( Dat a| Model) + Cost ( Model)

Cost is t he number of bit s needed for encoding.

Search for t he least cost ly model.

Cost ( Dat a| Model) encodes t he misclassificat ion errors.

Cost ( Model) uses node encoding ( number of children) plus
split t ing condit ion encoding.
A B
A?
B?
C?
1 0
0
1
Yes No
B
1
B
2
C
1
C
2
X
y
X
1 1
X
2 0
X
3 0
X
4 1

X
n 1
X
y
X
1 ?
X
2 ?
X
3 ?
X
4 ?

X
n ?
How to Address Overfitting

Pre- Pruning ( Early St opping Rule)

St op t he algorit hm before it becomes a fully- grown t ree

Typical st opping condit ions for a node:

St op if all inst ances belong t o t he same class

St op if all t he at t ribut e values are t he same

More rest rict ive condit ions:

St op if number of inst ances is less t han some user- specified
t hreshold

St op if class dist ribut ion of inst ances are independent of t he
available feat ures ( e. g. , using _

2
t est )

St op if expanding t he current node does not improve impurit y
measures ( e. g. , Gini or informat ion gain) .
How to Address Overfitting

Post - pruning

Grow decision t ree t o it s ent iret y

Trim t he nodes of t he decision t ree in a
bot t om- up fashion

I f generalizat ion error improves aft er
t rimming, replace sub- t ree by a leaf node.

Class label of leaf node is det ermined from
maj orit y class of inst ances in t he sub- t ree

Can use MDL for post - pruning
Example of Post-Pruning
A?
A1
A2 A3
A4
Class = Yes 20
Class = No 10
Error = 10/ 30
Training Error (Before splitting) = 10/30
Pessimistic error = (10 + 0.5)/30 = 10.5/30
Training Error (After splitting) = 9/30
Pessimistic error (After splitting)
= (9 + 4

0.5)/30 = 11/30
PRUNE!
Class =
Yes
8
Class =
No
4
Class =
Yes
3
Class =
No
4
Class =
Yes
4
Class =
No
1
Class =
Yes
5
Class =
No
1
Examples of Post-pruning

Opt imist ic error?

Pessimist ic error?

Reduced error pruning?
C0: 11
C1: 3
C0: 2
C1: 4
C0: 14
C1: 3
C0: 2
C1: 2
Dont prune for both cases
Dont prune case 1, prune case 2
Case 1:
Case 2:
Depends on validation set
Handling Missing Attribute Values

Missing values affect decision t ree
const ruct ion in t hree different ways:

Affect s how impurit y measures are comput ed

Affect s how t o dist ribut e inst ance wit h missing
value t o child nodes

Affect s how a t est inst ance wit h missing value
is classified
Computing Impurity Measure
Tid Refund Marital
Status
Taxable
Income
Class
3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10 ? Single 90K Yes
10

Cl ass
= Yes
Cl ass
= No
Refund= Yes 0 3
Refund= No 2 4

Refund= ? 1 0

Split on Refund:
Entropy(Refund=Yes) = 0
Entropy(Refund=No)
= -(2/6)log(2/6) (4/6)log(4/6) = 0.9183
Entropy(Children)
= 0.3 (0) + 0.6 (0.9183) = 0.551
Gain = 0.9

(0.8813 0.551) = 0.3303
Missing
value
Before Splitting:
Entropy(Parent)
= -0.3 log(0.3)-(0.7)log(0.7) = 0.8813
Distribute Instances
Tid Refund Marital
Status
Taxable
Income
Class
3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10

Refund
Yes No
Cl ass= Yes 0
Cl ass= No 3

Cheat = Yes 2
Cheat = No 4

Refund
Yes
Tid Refund Marital
Status
Taxable
Income
Class
10 ? Single 90K Yes
10

No
Cl ass= Yes 2 + 6/ 9
Cl ass= No 4

Probability that Refund=Yes is 3/9
Probability that Refund=No is 6/9
Assign record to the left child with
weight = 3/9 and to the right child
with weight = 6/9
Cl ass= Yes 0 + 3/ 9
Cl ass= No 3

Classify Instances
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes
No
Married
Single,
Divorced
< 80K > 80K
Married Single Divorce
d
Tot al
Class= No 3 1 0 4
Class= Yes 6/ 9 1 1 2.67
Tot al 3.67 2 1 6.67
Tid Refund Marital
Status
Taxable
Income
Class
11 No ? 85K ?
10

New record:
Probability that Marital Status
= Married is 3.67/6.67
Probability that Marital Status
={Single,Divorced} is 3/6.67
Scalable Decision Tree Induction Methods

SLI Q ( EDBT96 Meht a et al. )

Builds an index for each at t ribut e and only class list and
t he current at t ribut e list reside in memory

SPRI NT ( VLDB96 J. Shafer et al. )

Const ruct s an at t ribut e list dat a st ruct ure

PUBLI C ( VLDB98 Rast ogi & Shim)

I nt egrat es t ree split t ing and t ree pruning: st op growing t he
t ree earlier

RainForest ( VLDB98 Gehrke, Ramakrishnan &
Gant i)

Builds an AVC- list ( at t ribut e, value, class label)

BOAT ( PODS99 Gehrke, Gant i, Ramakrishnan &
Loh)

Uses boot st rapping t o creat e several small samples

Classification Decision Tree

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Classification Decision Tree

Загружено:

Авторское право:

Доступные форматы

Classification: Basic Concepts and

Вам также может понравиться