Академический Документы
Профессиональный Документы
Культура Документы
Classification
Rule-Based Classification
Neural Networks
Summary
1
Classification
predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the training set
and the values (class labels) in a classifying attribute and uses it
in classifying new data
Numeric Prediction
models continuous-valued functions, i.e., predicts unknown or
missing values
Typical applications
Credit/loan approval:
Medical diagnosis: if a tumor is cancerous or benign
Fraud detection: if a transaction is fraudulent
Web page categorization: which category it is
Classification
Algorithms
Training
Data
NAME
M ike
M ary
B ill
Jim
D ave
Anne
RANK
YEARS TENURED
A ssistan t P ro f
3
no
A ssistan t P ro f
7
yes
P ro fesso r
2
yes
A sso ciate P ro f
7
yes
A ssistan t P ro f
6
no
A sso ciate P ro f
3
no
Classifier
(Model)
IF rank = professor
OR years > 6
THEN tenured = yes
Decision Tree
Decision tree
Tuples flow along the tree structure Internal node denotes
an attribute
Branch represents the values of the node attribute
Leaf nodes represent class labels or class distribution
Example of Playing Tennis
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
T om
M erlisa
G eorge
Joseph
RANK
YEARS TENURED
A ssistant P rof
2
no
A ssociate P rof
7
no
P rofessor
5
yes
A ssistant P rof
7
yes
Tenured?
12/4/2014
31..40
overcast
student?
no
no
yes
yes
yes
>40
age
<=30
<=30
3140
>40
>40
>40
3140
<=30
<=30
>40
<=30
3140
3140
>40
credit rating?
excellent
fair
yes
10
11
12
What is ID3?
Entropy
Entropy(S) =
- (9/14) Log2 (9/14) - (5/14) Log2 (5/14)
= 0.940
13
14
Example of ID3
15
17
18
12/4/2014
19
20
21
j 1
GainRatio(A) = InformationGain(A)/SplitInfo(A)
SplitInfo(CoO) = - 4/10 * log2 (4/10) 4/10 * log2 (4/10) 2/10 * log2 (2/10)
GainRatio(CoO) = 0.2756 / 1.5219
SplitInfo(Genre) = - 3/10 * log2 (3/10) 6/10 * log2 (6/10) 1/10 * log2 (1/10)
GainRatio(CoO) = 0.17 / 1.2954
SplitInfo(BigStar) = - 7/10 * log2 (7/10) 3/10 * log2 (3/10)
GainRatio(BigStar) = 0.01/ 0.8812
22
5 2 5 2
gini (total ) 1 0.50
10 10
23
24
3 1 2 7 4 3
gini genre{SF ,(C , R )} ( D) 1 1 0.4761
10 3 3 10 7 7
2
Reduction in Impurity
For genre {(SF, C), R} is 0.50 0.4444 = 0.0555
For genre {(SF, R), C} is 0.50 0.4146 = 0.0833
For genre {SF, (C, R)} is 0.50 0.4761 = 0.0238
25
EU
Ge
2
2
2
2
4 1 3 6 4 2
gini genre{( SF , R ),C} ( D) 1 1 0.4166
10 4 4 10 6 6
2
CO
2
2
2
2
9 5 4 1 0 1
gini genre{( SF ,C ), R} ( D) 1 1 0.4444
10 9 9 10 1 1
26
RW
Ge
Ge
SF
SF
SF
BS
BS
BS
BS
BS
BS
BS
BS
BS
27
12/4/2014
Classification
Classification
Classification
CO
CO
CO
US
EU
RW
US
EU
RW
US
Ge
Ge
Ge
Ge
Ge
Ge
Ge
EU
RW
Ge
Ge
SF
SF
SF
SF
SF
SF
SF
SF
SF
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
28
Classification
CO
EU
Ge
RW
Ge
Ge
SF
SF
SF
BS
BS
BS
BS
BS
BS
BS
BS
BS
N
30
US
29
BS
BS
N
31
32
33
Overfitting
34
35
36
12/4/2014
37
38
39
CO
US
Training Error
Error of a classification model on training data
is called training error
Tree pruning results in higher training error
CO
EU
RW
US
Ge
Ge
Ge
Training Error
Ge
EU
Ge
SF
SF
SF
SF
SF
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
SF
Ge
Ge
Ge
BS
BS
SF
BS
BS
BS
SF
BS
BS
BS
BS
T2
T2
CO
CO
RW
EU
Ge
EU
Ge
SF
US
US
RW
Training Error
RW
Ge
Ge
SF
SF
SF
T1
T2
T0
T0
T2
T0
T0
T0
T0
F0
F1
F0
F1
F0
F1
F1
F1
F0
BS
BS
N
40
41
42
CO
US
EU
RW
Training Error
Ge
Ge
Training Error
Ge
SF
SF
SF
BS
BS
BS
BS
BS
BS
BS
BS
BS
T2
CO
US
EU
Ge
Ge
SF
SF
SF
T1
T2
T0
T0
T2
T0
T0
T0
T0
F0
F1
F0
F1
F0
F1
F1
F1
F0
EU
RW
T2
T2
T0
F1
F2
F2
CO
US
RW
Ge
Notes on Overfitting
43
44
45
12/4/2014