Data Mining - Classification 2

Decision Trees
Introduction
• A decision tree is created by a process known as

splitting on the value of attributes (or just splitting on
attributes) and then creating a branch for each of its
possible values.
• The splitting process continues until each branch can
be labeled with just one classification.
• It work for both discrete and continuous attributes. n
the case of continuous attributes the test is normally
whether the value is ‘less than or equal to’ or ‘greater
than’ a given value known as the split value.
Introduction
• It work for both discrete and continuous

attributes.
• In the case of continuous attributes the test is
normally whether the value is ‘less than or
equal to’ or ‘greater than’ a given value known
as the split value.
• Three types of nodes in decision tree:
– A root node
Having no incoming edges and zero or more outgoing
edges
– Internal nodes
Each node has exactly one incoming edge and two or more
outgoing edges
– Leaf (terminal) nodes
Each node has exactly one incoming edge and no outgoing
edges
• Problem
– There are exponentially many decision trees that can
be constructed from a given set of attributes.
– To find the optimal tree is computationally infeasible
because of the exponentially size of the search space.
• Solution
– Develop efficient algorithms to induce a reasonably
accurate (suboptimal) decision tree in a reasonable
amount of time.
• Many Algorithms:
– Top-Down Induction of Decision Trees (TDIDT)
– CART
– ID3, C4.5
– SLIQ, SPRINT
– etc
Example
Given that tomorrow
the values of
Outlook,
Temperature,
Humidity and Windy
were sunny, 74◦F,
77% and false
respectively, what
would the decision
be?
Pada sebuah desa Warna Usia Rambut Jender Kelas
dilakukan pelatihan Gelap Dewasa Panjang Pria Otomotif
bagi penduduk putus Gelap Dewasa Panjang Pria Otomotif
sekolah. Gelap Dewasa Pendek Wanita Sulam
Terdapat dua pilihan Gelap Remaja Panjang Wanita Sulam
kelas yang dapat Gelap Remaja Panjang Wanita Sulam
diambil yaitu Gelap Remaja Panjang Wanita Sulam
otomotif (bengkel) Gelap Remaja Panjang Wanita Sulam
dan menyulam. Gelap Remaja Pendek Pria Otomotif
Berikut adalah data Terang Remaja Panjang Pria Otomotif
keikutsertaan pada Terang Remaja Pendek Pria Otomotif
tahun lalu. Terang Dewasa Pendek Pria Otomotif
Buatlah decision tree. Terang Remaja Panjang Pria Otomotif
Which one is correct?
Terdapat 4 warga yang akan ikut yaitu:
1. Wanita dewasa feminim, warna kulit terang, berambut panjang
2. Remaja putri tomboi, warna kulit gelap, berambut pendek
3. Pria dewasa macho, berbadan gelap, kekar dan berambut cepak .
4. Remaja pria melambai, warna kulit gelap dan berambut panjang.
Which one is correct?
1. Wanita dewasa feminim ( Bengkel | Sulam )
2. Remaja putri tomboi ( Bengkel | Sulam )
3. Pria dewasa macho ( Sulam | Bengkel )
4. Remaja pria melambai ( Sulam | Bengkel )
Which one is correct? >>>> Inductive Bias
Inductive Bias
Inductive Bias
• Find the next term in the sequence
1, 4, 9, 16, ...
• The correct answer is 20.
• The nth term of the series is calculated from:
• You choose 25 because you display a most

regrettable bias towards perfect squares.
Inductive Bias
• Inductive bias is a preference for one choice rather
than another, which is not determined by the data
itself (ie. previous values in the sequence) but by
external factors, such as our preferences for simplicity
or familiarity (ie. perfect squares).
• Bias can be helpful or harmful, depending on the
dataset.
• We can choose a method that has a bias that we favor,
but we cannot eliminate inductive bias altogether.
• There is no neutral, unbiased method.
Issues in Tree Induction
• Determine how to split the records
– How to specify the attribute test condition?
– How to determine the best split?
• Determine when to stop splitting
How to Specify Test Condition?
• Depends on attribute types

– Binary
– Nominal
– Ordinal
– Continuous
• Depends on number of ways to split
– Binary split
– Multi-way split
• Splitting Based on Binary Attributes
According to the pre-specified attribute
values, generate the test condition.
• Splitting Based on Nominal Attributes
– Nominal attribute can have many values
– Multi-way split: Use as many partitions as distinct values.
– Binary split: Divides values into two subsets.
• Need to find optimal partitioning.

• Splitting Based on Nominal Attributes
– Ordinal attribute values can be grouped as long as the
grouping does not violate the order property of the
attribute values.
– Multi-way split: Use as many partitions as distinct values.
– Binary split: Divides values into two subsets. This needs to

find optimal partitioning.
What about this split?

• Splitting Based on Continuous Attributes
Different ways of handling
– Multi-way split:
Discretization to form an ordinal categorical attribute.
The ranges can be found by equal interval bucketing,
equal frequency bucketing (percentiles), or clustering.
Adjacent intervals can also be aggregated into wider
ranges as long as the order property is preserved.
– Binary Decision: (A < v) or (A ≥ v) consider all possible

splits and finds the best cut
• Splitting Based on Continuous Attributes
• Which attribute gives better split?
• Multi-way split vs binary split
• Splitting continous attribute
• How to determine the Best Split
Before Splitting: 10 records of class 0,
10 records of class 1
Which test condition is the best?

• How to determine the Best Split
– Nodes with homogeneous class distribution are preferred
– Need a measure of node impurity:
• Measure of node impurity:
• GINI index
• Entropy
• Misclassification error
– Measure the quality of split:

• Gain information
• Gain ratio
• Gini Index for a given node t:
is the relative frequency of class j at node t
• Maximum (1 - 1/nc) when records are equally

distributed among all classes, implying least
interesting information
• Minimum (0.0) when all records belong to one class,
implying most interesting information
• Gini Index for a given node t:
Worst
• Entropy for a given node t:
• Maximum (log(nc)) when records are equally

• Entropy for a given node t:
Worst
• Classification error for a given node t:
Error
• Maximum (1 - 1/nc) when records are equally

• Classification error for a given node t:
Error
Worst
•– Measure the quality of split:
• Gain information
– Impurity childern can be calculated by

» simple average
» weighted average
• Gain ratio
Example
Outlook Temp (F) Humidity (%) Humidity Windy Class
overcast 64 65Low TRUE play
rain 65 70Low TRUE don’t play
rain 68 80High FALSE play
sunny 69 70Low FALSE play
rain 71 80High TRUE don’t play
sunny 72 95High FALSE don’t play
overcast 72 90High TRUE play
sunny 75 70Low TRUE play
sunny 80 90High TRUE don’t play
overcast 81 75High FALSE play
overcast 83 78High FALSE play
sunny 85 85High FALSE don’t play
Example
Class
Don’t play
2 2
5 9
Don’t play
Don’t play
Don’t play
Don’t play
𝐺𝑖𝑛𝑖=1 −
[( ) ( ) ]
14
+
14
=0.459
Play
Play
Play = 0.940
Play
Play
Play
Error = 0.357
Play
Play
Play
Outlook Windy Humidity
Overcast Rain Sunny True False High Low
Don’t Don’t Don’t Don’t Don’t Don’t
Don’t Don’t Don’t Don’t Don’t
Don’t Don’t Don’t
Play Play Play Play Play Don’t
Play Play Play Play Play Play Play
Play Play Play Play Play Play
Play Play Play Play
Play Play
Play Play
Play
Example
Outlook
Decision
Overcast
Sunny
Rain
Don’t 0 2 3
Play 4 3 2
= 0.343

0.343 = 0.116
Decision Outlook
Example
Overcast Rain Sunny
Don’t 0 2 3
Play 4 3 2
= 0
= 0.971
= 0.971
= 0.694

0.694 = 0.246
Example
Windy
Decision
= 1
False
True
= 0.811
Don’t 3 2
Play 3 6

= 0.892

0.892 = 0.048
Decision Humidity
High Low
Example
Don’t 4 1
Play 6 3
= 0.971
= 0.811

= 0.925

0.925 = 0.015
Outlook Windy Humidity
Overcast Rain Sunny True False High Low
Don’t Don’t Don’t Don’t Don’t Don’t
Don’t Don’t
Don’t
Don’t Don’t
Play Play Play
Play Play Play Play Play
Play Play Play
Play Play
Play Play
Play

0.116 0.048 0.015
Humidity Windy Humidity Windy
High Low True False High Low True False
Don’t Don’t Don’t Play Don’t Play Don’t Don’t
Play Don’t Play Don’t Play Don’t
Play Play Don’t Play Play
Play
Outlook Outlook Outlook Outlook
Overcast Rain
Overcast Sunny
Overcast
Overcast
Sunny
Sunny
Rain
Rain
Sunny
Rain
Don’t Don’t Don’t Don’t Don’t
Don’t Don’t Don’t Don’t
Don’t Don’t
Don’t Don’t
Play Play Play Don’t Don’t
Don’t
Play Play
Play Play
Play Play Play Play
Play Play
Play Play Play
Play
Play Play Play
Play
Play
Play Play
Play
E E E E
G = 0.246 G = 0.102 G = 0.003 G = 0.226

Example
ID code Outlook Temp (F) Humidity Windy Class
A overcast 64 Low TRUE play
B rain 65 Low TRUE don’t play
C rain 68 High FALSE play
D sunny 69 Low FALSE play
E rain 70 High FALSE play
F rain 71 High TRUE don’t play
G sunny 72 High FALSE don’t play
H overcast 72 High TRUE play
I rain 75 High FALSE play
J sunny 75 Low TRUE play
K sunny 80 High TRUE don’t play
L overcast 81 High FALSE play
M overcast 83 High FALSE play
N sunny 85 High FALSE don’t play
= 0
=0

=0

0 = 0.940
Highly-branching attributes
• Problematic: attributes with a large number of
values (extreme case: ID code)
• Subsets are more likely to be pure if there is a
large number of values
– Information gain is biased towards choosing
attributes with a large number of values
– This may result in overfitting (selection of
an attribute that is non-optimal for
prediction)
• Reason for another measures: Gain Ratio
• Gain ratio: a modification of the information
gain that reduces its bias
• Gain ratio takes number and size of branches
into account when choosing an attribute
– It corrects the information gain by taking
the intrinsic information of a split into
account
• Intrinsic information: entropy of distribution
of instances into branches (i.e. how much info
do we need to tell which branch an instance
belongs to)
• Gain Ratio is defined by the formula:

𝟎 . 𝟗𝟒𝟎
𝑮𝒂𝒊𝒏 𝑹𝒂𝒕𝒊𝒐= =𝟎 . 𝟐𝟒𝟕
𝟑 . 𝟖𝟎𝟕

𝟎 . 𝟐𝟒𝟔
𝑮𝒂𝒊𝒏 𝑹𝒂𝒕𝒊𝒐= =𝟎 . 𝟏𝟓𝟔
𝟏 .𝟓𝟕𝟕
• Gain ratio for Outlook is higher than humidity,

temperature, windy.
• However, it is lower than the hypothetical ID code
attribute, means that splitting based on ID code
would be preferred to any of these four.
• However, its advantage is greatly reduced.
• In practical implementations, we have to make
judgment to guard against splitting on such a
useless attribute.
Splitting continous attribute
• Use Binary Decisions based on one value
• Several Choices for the splitting value
– Number of possible splitting values = Number of
distinct values
• Each splitting value has a count matrix associated with it
– Class counts in each of the partitions, A < v and A ≥ v
• Simple method to choose best v
– For each v, scan the database to gather count matrix
and compute its impurity
– Computationally Inefficient! (Repetition of work)
TID Refund Marital Taxable Cheat
Status Income
1 Yes Single 125 No
2 No Married 100 No
3 No Single 70 No
4 Yes Married 120 No
5 No Divorced 95 Yes
6 No Married 60 No
7 Yes Divorced 220 No
8 No Single 85 Yes
9 No Married 75 No
10 No Single 90 Yes
• For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the
count matrix and computing impurity
– Choose the split position that has the least impurity
Determine when to stop splitting
• Stop expanding a node when all the records

belong to the same class
• Stop expanding a node when all the records
have similar attribute values
• Early termination
• Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Accuracy is comparable to other classification
techniques for many simple data sets

Data Mining - Classification 2

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Data Mining - Classification 2

Загружено:

Авторское право:

Доступные форматы

Decision Trees

• A decision tree is created by a process known as

• It work for both discrete and continuous

• You choose 25 because you display a most

• Depends on attribute types

– Binary split: Divides values into two subsets.

• Need to find optimal partitioning.

– Binary split: Divides values into two subsets. This needs to

What about this split?

– Binary Decision: (A < v) or (A ≥ v) consider all possible

Which test condition is the best?

– Measure the quality of split:

is the relative frequency of class j at node t

• Maximum (1 - 1/nc) when records are equally

is the relative frequency of class j at node t

is the relative frequency of class j at node t

• Maximum (log(nc)) when records are equally

is the relative frequency of class j at node t

• Maximum (1 - 1/nc) when records are equally

– Impurity childern can be calculated by

G = 0.246 G = 0.102 G = 0.003 G = 0.226

• Gain ratio for Outlook is higher than humidity,

• Stop expanding a node when all the records

Вам также может понравиться