Академический Документы
Профессиональный Документы
Культура Документы
A programming task
Classification: Definition
Illustrating Classification
Task
Examples of Classification
Task
Classification Using
Distance
KNN
K Nearest Neighbor
(KNN):
Training set includes classes.
Examine K items near item to be
classified.
New item placed in class with the most
number of close items.
O(q) for each tuple to be classified.
(Here q is the size of the training set.)
KNN
Classification Techniques
Decision Tree based Methods
Rule-based Methods
Memory based reasoning
Neural Networks
Nave Bayes and Bayesian Belief Networks
Support Vector Machines
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
Splitting Attributes
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
NO
> 80K
YES
10
Training Data
Married
Another Example of
Decision Tree
al
al
us
c
c
i
i
o
or
or
nu
i
g
g
t
ss
e
e
n
t
t
a
l
c
ca
ca
co
10
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
Married
MarSt
NO
Single,
Divorced
Refund
No
Yes
NO
TaxInc
< 80K
NO
> 80K
YES
Decision
Tree
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
Refund
Yes
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
Refund
Yes
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
Refund
Yes
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
Refund
Yes
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Assign Cheat to No
Decision
Tree
Many Algorithms:
Dt
Hunts Algorithm
Dont
Cheat
Refund
Yes
No
Dont
Cheat
Dont
Cheat
Refund
Refund
Yes
Yes
No
Dont
Cheat
Single,
Divorced
Cheat
Dont
Cheat
Marital
Status
Married
Single,
Divorced
No
Marital
Status
Married
Dont
Cheat
Taxable
Income
Dont
Cheat
< 80K
>= 80K
Dont
Cheat
Cheat
Tree Induction
Greedy strategy.
Issues
Tree Induction
Greedy strategy.
Issues
Nominal
Ordinal
Continuous
2-way split
Multi-way split
Luxury
Sports
{Sports,
Luxury}
CarType
{Family}
OR
{Family,
Luxury}
CarType
{Sports}
Size
Small
Medium
Large
Size
{Large}
OR
{Small,
Large}
{Medium,
Large}
Size
Size
{Medium}
{Small}
Splitting Based on
Different ways of handling
Continuous
Attributes
Discretization to form an ordinal categorical
attribute
Splitting Based on
Continuous Attributes
Tree Induction
Greedy strategy.
Issues
Greedy approach:
Non-homogeneous,
Homogeneous,
Gini Index
Entropy
Misclassification error
M0
A?
Yes
B?
No
Yes
Node N1
Node N2
Node N3
M1
M2
M3
M12
No
Node N4
M4
M34
GINI (t ) 1 [ p ( j | t )]2
j
0
6
Gini=0.000
C1
C2
1
5
Gini=0.278
C1
C2
2
4
Gini=0.444
C1
C2
3
3
Gini=0.500
C1
C2
0
6
P(C1) = 0/6 = 0
C1
C2
1
5
P(C1) = 1/6
C1
C2
2
4
P(C1) = 2/6
P(C2) = 6/6 = 1
P(C2) = 5/6
GINI split
where,
ni
GINI (i )
i 1 n
B?
Yes
No
C1
C2
Gini = 0.500
Gini(N1)
= 1 (5/6)2 (2/6)2
= 0.194
Gini(N2)
= 1 (1/6)2 (4/6)2
= 0.528
Node N1
Node N2
C1
C2
N1 N2
5
1
2
4
Gini=0.333
Gini(Children)
= 7/12 * 0.194 +
5/12 * 0.528
= 0.333
Categorical Attributes:
For each distinct value, gather counts for each
Computing
Gini Index
class in the dataset
Multi-way split
CarType
C1
C2
Gini
C1
C2
Gini
CarType
{Sports,
{Family}
Luxury}
3
1
2
4
0.400
C1
C2
Gini
CarType
{Family,
{Sports}
Luxury}
2
2
1
5
0.419
Continuous Attributes:
Computing Gini Index
Continuous Attributes:
Computing
Gini
Index...
For efficient computation: for each attribute,
Cheat
No
No
No
Yes
Yes
Yes
No
No
No
No
100
120
125
220
Taxable Income
60
Sorted Values
Split Positions
70
55
75
65
85
72
90
80
95
87
92
97
110
122
172
230
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
Yes
No
Gini
0.420
0.400
0.375
0.343
0.417
0.400
0.300
0.343
0.375
0.400
0.420
Entropy (t ) p ( j | t ) log p ( j | t )
j
C1
C2
0
6
P(C1) = 0/6 = 0
C1
C2
1
5
P(C1) = 1/6
C1
C2
2
4
P(C1) = 2/6
P(C2) = 6/6 = 1
P(C2) = 5/6
Information Gain:
GAIN
split
Entropy ( p )
Entropy
(
i
)
i 1
Gain Ratio:
GainRATIO
GAIN
n
n
SplitINFO log
SplitINFO
n
n
Split
split
i 1
C1
C2
0
6
P(C1) = 0/6 = 0
C1
C2
1
5
P(C1) = 1/6
C1
C2
2
4
P(C1) = 2/6
P(C2) = 6/6 = 1
P(C2) = 5/6
Comparison among
Splitting Criteria
Misclassification Error vs
Gini
Parent
A?
Yes
Node N1
Gini(N1)
= 1 (3/3)2 (0/3)2
=0
Gini(N2)
= 1 (4/7)2 (3/7)2
= 0.489
No
Node N2
C1
C2
Gini = 0.42
Gini(Children)
= 3/10 * 0
+ 7/10 * 0.489
= 0.342
Tree Induction
Greedy strategy.
Issues
Advantages:
Inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Accuracy is comparable to other classification
techniques for many simple data sets
Example: C4.5
Simple depth-first construction.
Uses Information Gain
Sorts Continuous Attributes at each node.
Needs entire data to fit in memory.
Unsuitable for Large Datasets.
http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.
gz
Practical Issues of
Classification
Missing Values
Costs of Classification
Circular points:
0.5 sqrt(x12+x22) 1
Triangular points:
sqrt(x12+x22) > 0.5 or
sqrt(x12+x22) < 1
Underfitting: when model is too simple, both training and test errors are large
Overfitting due to
Insufficient Examples
Lack of data points in the lower half of the diagram makes it difficult
to predict correctly the class labels of that region
- Insufficient number of training records in the region causes the
decision tree to predict the test examples using other training
records that are irrelevant to the classification task
Notes on Overfitting
Estimating Generalization
Errors
Occams Razor
Minimum Description
Length (MDL)
X
X1
X2
X3
X4
y
1
0
0
1
Xn
B
1
C
1
X
X1
X2
X3
X4
y
?
?
?
?
Xn
How to Address
Overfitting
Post-pruning
Example of Post-Pruning
Training Error (Before splitting) = 10/30
Class = Yes
20
Class = No
10
Error = 10/30
= (9 + 4 0.5)/30 = 11/30
A?
A1
PRUNE!
A4
A3
A2
Class =
Yes
Class =
Yes
Class =
Yes
Class =
Yes
Class = No
Class = No
Class = No
Class = No
Examples of Post-pruning
Case 1:
Optimistic error?
Dont prune for both cases
Pessimistic error?
C0: 11
C1: 3
C0: 2
C1: 4
C0: 14
C1: 3
C0: 2
C1: 2
Computing Impurity
Measure
Before Splitting:
Entropy(Parent)
= -0.3 log(0.3)-(0.7)log(0.7) = 0.8813
Refund=Yes
Refund=No
Refund=?
Class Class
= Yes = No
0
3
2
4
1
Split on Refund:
Entropy(Refund=Yes) = 0
Entropy(Refund=No)
= -(2/6)log(2/6) (4/6)log(4/6) = 0.9183
Missing
value
Entropy(Children)
= 0.3 (0) + 0.6 (0.9183) = 0.551
Gain = 0.9 (0.8813 0.551) = 0.3303
Distribute Instances
Refund
Yes
No
No
Classify Instances
Married
New record:
Single
Divorce
d
Total
Class=No
Class=Yes
6/9
2.67
Total
3.67
6.67
Refund
Yes
NO
No
Single,
Divorced
MarSt
Married
TaxInc
< 80K
NO
NO
> 80K
YES
Builds an index for each attribute and only class list and the
current attribute list reside in memory