Академический Документы
Профессиональный Документы
Культура Документы
Introduction
Worst
• Entropy for a given node t:
Worst
• Classification error for a given node t:
Error
is the relative frequency of class j at node t
Worst
•– Measure the quality of split:
• Gain information
• Gain ratio
Example
Outlook Temp (F) Humidity (%) Humidity Windy Class
overcast 64 65Low TRUE play
rain 65 70Low TRUE don’t play
rain 68 80High FALSE play
sunny 69 70Low FALSE play
rain 70 96High FALSE play
rain 71 80High TRUE don’t play
sunny 72 95High FALSE don’t play
overcast 72 90High TRUE play
rain 75 80High FALSE play
sunny 75 70Low TRUE play
sunny 80 90High TRUE don’t play
overcast 81 75High FALSE play
overcast 83 78High FALSE play
sunny 85 85High FALSE don’t play
Example
Class
Don’t play
2 2
5 9
Don’t play
Don’t play
Don’t play
Don’t play
𝐺𝑖𝑛𝑖=1 −
[( ) ( ) ]
14
+
14
=0.459
Play
Play
Play = 0.940
Play
Play
Play
Error = 0.357
Play
Play
Play
Outlook Windy Humidity
Overcast Rain Sunny True False High Low
Don’t Don’t Don’t Don’t Don’t Don’t
Don’t Don’t Don’t Don’t Don’t
Don’t Don’t Don’t
Play Play Play Play Play Don’t
Play Play Play Play Play Play Play
Play Play Play Play Play Play
Play Play Play Play
Play Play
Play Play
Play
Example
Outlook
Decision
Overcast
Sunny
Rain
Don’t 0 2 3
Play 4 3 2
= 0.343
0.343 = 0.116
Decision Outlook
Example
Overcast Rain Sunny
Don’t 0 2 3
Play 4 3 2
= 0
= 0.971
= 0.971
= 0.694
0.694 = 0.246
Example
Windy
Decision
= 1
False
True
= 0.811
Don’t 3 2
Play 3 6
= 0.892
0.892 = 0.048
Decision Humidity
High Low
Example
Don’t 4 1
Play 6 3
= 0.971
= 0.811
= 0.925
0.925 = 0.015
Outlook Windy Humidity
Overcast Rain Sunny True False High Low
Don’t Don’t Don’t Don’t Don’t Don’t
Don’t Don’t
Don’t Don’t Don’t
Don’t
Don’t Don’t
Play Play Play
Play Play Play Play Play Don’t
Play Play Play Play Play Play
Play Play Play Play Play
Play Play Play
Play Play
Play Play
Play
0.116 0.048 0.015
Humidity Windy Humidity Windy
High Low True False High Low True False
Don’t Don’t Don’t Play Don’t Play Don’t Don’t
Play Don’t Play Don’t Play Don’t
Play Play Don’t Play Play
Play
Issues in Tree Induction
• Determine how to split the records
– How to specify the attribute test condition?
– How to determine the best split?
• Which attribute gives better split?
• Multi-way split vs binary split
• Splitting continous attribute
• Determine when to stop splitting
Outlook Outlook Outlook Outlook
Overcast Rain
Overcast Sunny
Overcast
Overcast
Sunny
Sunny
Rain
Rain
Sunny
Rain
Don’t Don’t Don’t
Don’t Don’t Don’t Don’t Don’t
Don’t Don’t Don’t Don’t
Don’t Don’t
Don’t Don’t
Play Play Play Don’t Don’t
Don’t
Play Play Play Play Play Don’t
Play Play
Play Play Play Play Play Play
Play Play
Play Play Play Play
Play Play
Play Play Play
Play
Play Play Play
Play
Play
Play Play
Play
E E E E
=0
=0
0 = 0.940
Highly-branching attributes
• Problematic: attributes with a large number of
values (extreme case: ID code)
• Subsets are more likely to be pure if there is a
large number of values
– Information gain is biased towards choosing
attributes with a large number of values
– This may result in overfitting (selection of
an attribute that is non-optimal for
prediction)
• Reason for another measures: Gain Ratio
Highly-branching attributes
• Gain ratio: a modification of the information
gain that reduces its bias
• Gain ratio takes number and size of branches
into account when choosing an attribute
– It corrects the information gain by taking
the intrinsic information of a split into
account
• Intrinsic information: entropy of distribution
of instances into branches (i.e. how much info
do we need to tell which branch an instance
belongs to)
Highly-branching attributes
• Gain Ratio is defined by the formula:
𝟎 . 𝟗𝟒𝟎
𝑮𝒂𝒊𝒏 𝑹𝒂𝒕𝒊𝒐= =𝟎 . 𝟐𝟒𝟕
𝟑 . 𝟖𝟎𝟕
𝟎 . 𝟐𝟒𝟔
𝑮𝒂𝒊𝒏 𝑹𝒂𝒕𝒊𝒐= =𝟎 . 𝟏𝟓𝟔
𝟏 .𝟓𝟕𝟕
Highly-branching attributes