Decision Trees

Decision Trees
CS 540 section 2
Louis Oliphant
oliphantcs.wisc.edu
Inductive learning
Simplest Iorm: learn a Iunction Irom examples

f is the target Iunction
An example is a pair (x, f(x))
Problem: Iind a hypothesis h
such that h - f
given a training set oI examples
(This is a highly simpliIied model oI real learning:
Ignores prior knowledge
Assumes examples are given)
Learning decision trees
Problem: decide whether to wait Ior a table at a restaurant, based
on the Iollowing attributes:
1. Alternate: is there an alternative restaurant nearby?
2. Bar: is there a comIortable bar area to wait in?
3. Fri/Sat: is today Friday or Saturday?
4. Hungry: are we hungry?
5. Patrons: number oI people in the restaurant (None, Some, Full)
6. Price: price range ($, $$, $$$)
7. Raining: is it raining outside?
8. Reservation: have we made a reservation?
9. Type: kind oI restaurant (French, Italian, Thai, Burger)
10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, ~60)
Attribute-based representations
Examples described by attribute values (Boolean, discrete, continuous)
E.g., situations where I will/won't wait Ior a table:

ClassiIication oI examples is positive (T) or negative (F)
Decision trees
One possible representation Ior hypotheses
E.g., here is the 'true tree Ior deciding whether to wait:

Expressiveness
Decision trees can express any Iunction oI the input attributes.
E.g., Ior Boolean Iunctions, truth table row path to leaI:
Trivially, there is a consistent decision tree Ior any training set with one path to leaI
Ior each example (unless f nondeterministic in x) but it probably won't generalize to
new examples
PreIer to Iind more compact decision trees

Decision tree learning
Aim: Iind a small tree consistent with the training examples
Idea: (recursively) choose "most signiIicant" attribute as root oI (sub)tree

Choosing an attribute
Idea: a good attribute splits the examples into subsets that are
(ideally) "all positive" or "all negative"
Patrons? is a better choice

Using information theory
To implement Choose-Attribute in the DTL

algorithm
InIormation Content (Entropy):

I(P(v
1
), . , P(v
n
)) L
i1
-P(v
i
) log
2
P(v
i
)
For a training set containing p positive examples and

n negative examples:
I (
p
p+n
,
n
p+n
)=-
p
p+n
log
2
p
p+n
-
n
p+n
log
2
n
p+n
Information function with two
categories
p
+
I
(
P
+
,
N
-
)
0.0 1.0 0.5
0.5
1.0 Entropy reflects the lack
of purity of some
particular set S
As the proportion of
positives p
+
approaches
0.5 (very impure), the
Entropy of S converges
to 1.0
Information gain
A chosen attribute A divides the training set E into subsets E
1
,
. , E
according to their values Ior A, where A has distinct

values.
InIormation Gain (IG) or reduction in entropy Irom the

attribute test:
Choose the attribute with the largest IG

remainder ( A)=
_
i=1
v
p
i
+n
i
p+n
I (
p
i
p
i
+n
i
,
n
i
p
i
+n
i
)
IG( A)=I (
p
p+n
,
n
p+n
)-remainder ( A)
remainder ( A)=
_
i=1
v
p
i
+n
i
p+n
I (
p
i
p
i
+n
i
,
n
i
p
i
+n
i
)
Information gain
For the training set, 6, I(6/12, 6/12) 1 bit
Consider the attributes Patrons and Tvpe (and others too):
Patrons has the highest IG oI all attributes and so is chosen by the DTL
algorithm as the root
IG( Patrons )=1-|
2
12
I ( 0,1)+
4
12
I (1,0)+
6
12
I (
2
6
,
4
6
)=. 541 bits
IG(Type)=1-|
2
12
I (
1
2
,
1
2
)+
2
12
I (
1
2
,
1
2
)+
4
12
I (
2
4
,
2
4
)+
4
12
I (
2
4
,
2
4
)=0 bits
Example contd.
Decision tree learned Irom the 12 examples:
Substantially simpler than 'true tree---a more complex

hypothesis isn`t justiIied by small amount oI data
Make The Decision Tree
Pick Features that maximize IG
Size Shape Color Class
small square red
large circle blue
small triangle blue -
large circle red -
small circle blue
small circle red -
I (
p
p+n
,
n
p+n
)=-
p
p+n
log
2
p
p+n
-
n
p+n
log
2
n
p+n
remainder ( A)=
_
i=1
v
p
i
+n
i
p+n
I (
p
i
p
i
+n
i
,
n
i
p
i
+n
i
) remainder ( A)=
_
i=1
v
p
i
+n
i
p+n
I (
p
i
p
i
+n
i
,
n
i
p
i
+n
i
)
IG( A)=I (
p
p+n
,
n
p+n
)-remainder ( A)
Make The Decision Tree
Pick Features that maximize IG
Size Shape Color Class
small square red
large circle blue
small triangle blue -
large circle red -
small circle blue
small circle red -
I (
p
p+n
,
n
p+n
)=-
p
p+n
log
2
p
p+n
-
n
p+n
log
2
n
p+n
remainder ( A)=_
i=1
v
p
i
+n
i
p+n
I (
p
i
p
i
+n
i
,
n
i
p
i
+n
i
) remainder ( A)=
_
i=1
v
p
i
+n
i
p+n
I (
p
i
p
i
+n
i
,
n
i
p
i
+n
i
)
IG( A)=I (
p
p+n
,
n
p+n
)-remainder ( A)
Shape
+ -
Color
+ -
square triangle circle
blue red
Handling Continuous Features
One way oI dealing with a continuous Ieature F is

to treat them like Boolean Ieatures, partitioned on a
dynamically chosen threshold t:
Sort the examples in S according to F
IdentiIy adjacent examples with diIIering class labels
Compute InfoGain with t equal to the average oI the

values oI at these boundaries
Can also be generalized to multiple thresholds
U. Fayyad and K. Irani, 'Multi-interval descretization oI
continuous-valued attributes Ior classiIication learning,
Proceedings of the 13
th
International Joint Conference on
Artificial Intelligence, 1993
Handling Continuous Features
There are two candidates Ior threshold t in this example:

No Yes Yes Yes No No >1,000?
90 80 72 60 48 40 Temperature
t = (48+60)/2 = 54 t = (80+90)/2 = 85
The dynamically-created Boolean features Temp
>54
and
Temp
>85
can now compete with the other Boolean and
discrete features in the dataset
Noisy Data
Noisy data could be in the examples:
examples have the same attribute values, but diIIerent

classiIications (rare case: if(empty(atts)))
classiIication is wrong
attributes values are incorrect because
oI errors getting or preprocessing the data
irrelevant attributes used in

the decision-making process
Use Pruning to improve perIormance when working

with noisy data
Missing Data
Missing data:
while learning: replace with most likely value
while learning: use NotKnown as a value
while classiIying: Iollow arc Ior all values and

weight each by the Irequency oI exs. crossing that
arc
Tree Induction as Search
We can think oI inducing the 'best tree as an

optimization search problem:
States: possible (sub-)trees
Actions: add a Ieature as a node oI the tree
Objective Function: increase the overall inIormation
gain oI the tree
Essentially, Decision Tree Learner is a hill-

climbing search through the hypothesis space,
where the heuristic picks Ieatures that are likely to
lead to small trees
Pruning
OverIitting
meaningless regularity is Iound in the data
irrelevant attributes conIound the

true, important, distinguishing Ieatures
Iix by pruning lower nodes in the decision tree
iI gain oI best attribute is below a threshold,
make this node a leaI rather than
generating child nodes
Pruning
randomly partition training examples into:
TRAIN set (~80% of training exs.)
TUNE set (~10% of training exs.)
TEST set (~10% of training exs.)
build decision tree as usual using TRAIN set
bestTree = decision tree produced on the TRAIN set;
bestAccuracy = accuracy of bestTree on the TUNE set;
progressMade = true;
while (progressMade) { //while accuracy on TUNE improves
find better tree;
//starting at root, consider various pruned versions
//of the current tree and see if any are better than
//the best tree found so far
}
return bestTree;
use TEST to determine performance accuracy;
Pruning
//find better tree
progressMade = false;
currentTree = bestTree;
for (each interiorNode N in currentTree) { //start at root
prunedTree = pruned copy of currentTree;
newAccuracy = accuracy of prunedTree on TUNE set;
if (newAccuracy >= bestAccuracy) {
bestAccuracy = newAccuracy;
bestTree = prunedTree;
progressMade = true;
}
}
Pruning
//pruned copy of currentTree
replace interiorNode N in currentTree by a leaf node
label leaf node with the majorityClass among TRAIN set
examples that reached node N
break ties in favor of '-'
Case Studies
Decision trees have been shown to be

at least as accurate as human experts.
Diagnosing breast cancer
humans correct 65 oI the time
decision tree classiIied 72 correct
BP designed a decision tree Ior

gas-oil separation Ior oIIshore oil platIorms
Cessna designed a Ilight controller

using 90,000 exs. and 20 attributes per ex.
Conclusions
InIormation (Entropy) oI a set oI data
InIormation Gain using some Ieature on a set oI

data
Handling
continuous Ieatures
noisy data
missing values
Pruning
Constructing decision trees as Search

Decision Trees

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Decision Trees

Загружено:

Авторское право:

Доступные форматы

Decision Trees

Simplest Iorm: learn a Iunction Irom examples

Examples described by attribute values (Boolean, discrete, continuous)

E.g., situations where I will/won't wait Ior a table:

One possible representation Ior hypotheses

E.g., here is the 'true tree Ior deciding whether to wait:

Decision trees can express any Iunction oI the input attributes.

E.g., Ior Boolean Iunctions, truth table row path to leaI:

PreIer to Iind more compact decision trees

Aim: Iind a small tree consistent with the training examples

Idea: (recursively) choose "most signiIicant" attribute as root oI (sub)tree

Patrons? is a better choice

To implement Choose-Attribute in the DTL

InIormation Content (Entropy):

For a training set containing p positive examples and

according to their values Ior A, where A has distinct

InIormation Gain (IG) or reduction in entropy Irom the

Choose the attribute with the largest IG

Decision tree learned Irom the 12 examples:

Substantially simpler than 'true tree---a more complex

One way oI dealing with a continuous Ieature F is

Sort the examples in S according to F

IdentiIy adjacent examples with diIIering class labels

Compute InfoGain with t equal to the average oI the

There are two candidates Ior threshold t in this example:

Noisy data could be in the examples:

examples have the same attribute values, but diIIerent

irrelevant attributes used in

Use Pruning to improve perIormance when working

while learning: replace with most likely value

while learning: use NotKnown as a value

while classiIying: Iollow arc Ior all values and

We can think oI inducing the 'best tree as an

Essentially, Decision Tree Learner is a hill-

irrelevant attributes conIound the

Decision trees have been shown to be

Diagnosing breast cancer

humans correct 65 oI the time

decision tree classiIied 72 correct

BP designed a decision tree Ior

Cessna designed a Ilight controller

InIormation (Entropy) oI a set oI data

InIormation Gain using some Ieature on a set oI

Constructing decision trees as Search

Вам также может понравиться