Академический Документы
Профессиональный Документы
Культура Документы
Concepts and
Techniques
(3rd ed.)
Chapter 8
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
2009 Han,
08/02/15
Rule-Based Classification
Summary
August 2, 2015
August 2, 2015
Prediction Problems:
Classification vs. Numeric
Prediction
Classification
predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new
data
Numeric Prediction
models continuous-valued functions, i.e., predicts
unknown or missing values
Typical applications
Credit/loan approval:
Medical diagnosis: if a tumor is cancerous or benign
Fraud detection: if a transaction is fraudulent
Web page categorization: which category it is
August 2, 2015
ClassificationA Two-Step
Process
August 2, 2015
Training
Data
NAME
Mike
Mary
Bill
Jim
Dave
Anne
RANK
YEARS TENURED
Assistant Prof
3
no
Assistant Prof
7
yes
Professor
2
yes
Associate Prof
7
yes
Assistant Prof
6
no
Associate Prof
3
no
August 2, 2015
Classifier
(Model)
IF rank = professor
OR years > 6
THEN tenured = yes
7
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
Tom
Merlisa
George
Joseph
RANK
YEARS TENURED
Assistant Prof
2
no
Associate Prof
7
no
Professor
5
yes
Assistant Prof
7
yes
Data Mining: Concepts and
August 2, 2015
Techniques
Tenured?
Rule-Based Classification
Summary
August 2, 2015
This
follows
an
example
of
Quinlans
ID3
(Playing
Tennis)
August 2, 2015
age
<=30
<=30
3140
>40
>40
>40
3140
<=30
<=30
>40
<=30
3140
3140
>40
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
10
age?
<=30
31..40
overcast
student?
no
no
August 2, 2015
>40
credit rating?
yes
yes
yes
excellent
no
fair
yes
11
August 2, 2015
Techniques
12
| Dj |
Information needed (after using
A) to
Info A ( D
split D
Iinto
(D j ) v
j 1 | D |
partitions) to classify D:
v
Information gained by
branching
on attribute
A
Gain(A)
Info(D)
Info (D)
A
August 2, 2015
13
5
9
9
5
5
Class N: buys_computer
=
Info( D) I (9,5) log 2 ( ) log 2 ( ) 0.940
I (3,2) 0.694
no
14
14 14
14
14
5
I (2,3) means age <=30 has
14 5 out of 14 samples, with 2
yeses and 3 nos. Hence
age
income student
<=30
high
no
<=30
high
no
3140 high
no
>40
medium
no
>40
low
yes
>40
low
yes
3140 low
yes
<=30
medium
no
<=30
low
yes
>40
medium
yes
<=30
medium
yes
3140 medium
no
3140 high
yes
August
2, 2015 no
>40
medium
credit_rating
fair
excellent
fair
fair
fair
excellent
excellent
fair
fair
fair
excellent
excellent
fair
excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
Data
yes Mining: Concepts and
Techniques
no
Similarly,
Gain(income) 0.029
Gain( student ) 0.151
Gain(credit _ rating ) 0.048
14
Split:
August 2, 2015
Techniques
15
SplitInfo A ( D)
j 1
|D|
log 2 (
| Dj |
|D|
GainRatio(A) = Gain(A)/SplitInfo(A)
Ex.
| Dj |
August 2, 2015
16
gini( D) 1 p 2j
j 1
gini A ( D)
|D1|
|D |
gini( D1) 2 gini( D 2)
|D|
|D|
Reduction in Impurity:
August 2, 2015
17
9
5
gini ( D) 1
14
14
0.459
May need other tools, e.g., clustering, to get the possible split values
August 2, 2015
18
Information gain:
Gain ratio:
Gini index:
August 2, 2015
Techniques
19
C-SEP: performs better than info. gain and gini index in certain cases
The best tree as the one that requires the fewest # of bits to both
(1) encode the tree, and (2) encode the exceptions to the tree
CART: finds multivariate splits based on a linear comb. of attrs.
August 2, 2015
20
August 2, 2015
21
Attribute construction
August 2, 2015
22
August 2, 2015
23
August 2, 2015
24
August 2, 2015
25
AVC-set on Age
Age
Buy_Computer
AVC-set on income
income
yes
no
high
medium
low
yes
no
<=30
31..40
>40
AVC-set on Student
student
August 2, 2015
Buy_Computer
AVC-set on
credit_rating
Buy_Computer
Buy_Computer
yes
no
Credit
rating
yes
no
yes
fair
no
excellent
26
August 2, 2015
27
August 2, 2015
28
August 2, 2015
29
August 2, 2015
30
August 2, 2015
31
Rule-Based Classification
Summary
August 2, 2015
32
Bayesian Classification:
Why?
August 2, 2015
33
August 2, 2015
34
Bayesian Theorem
P( H | X) P(X | H )P(H )
P(X)
August 2, 2015
35
P(X | C )P(C )
i
i
P(C | X)
i
P(X)
August 2, 2015
36
P( X | C i ) P ( x | C i ) P ( x | C i) P( x | C i) ... P( x | C i )
k
1
2
n
k 1
1
standard deviation
g ( x, , )
e 2
2
and P(xk|Ci) is
August 2, 2015
P ( X | C i ) g ( xk , C i , C i )
Data Mining: Concepts and
Techniques
37
Class:
C1:buys_computer =
yes
C2:buys_computer = no
Data sample
X = (age <=30,
Income = medium,
Student = yes
Credit_rating = Fair)
August 2, 2015
38
P(Ci):
39
n
P( X | C i) P( xk | C i)
k 1
August 2, 2015
40
Advantages
Easy to implement
Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence, therefore
loss of accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes,
etc.
Dependencies among these cannot be modeled by Nave
Bayesian Classifier
August 2, 2015
41
Rule-Based Classification
Summary
August 2, 2015
42
Size ordering: assign the highest priority to the triggering rules that
has the toughest requirement (i.e., with the most attribute tests)
Rule-based ordering (decision list): rules are organized into one long
priority list, according to some measure of rule quality or by experts
August 2, 2015
43
<=30
31..40
student?
no
>40
credit rating?
yes
yes
excellent
yes
fair
yes
no
THEN buys_computer = no
THEN buys_computer = yes
THEN buys_computer = no
44
Rules are learned sequentially, each for a given class C i will cover
many tuples of Ci but none (or few) of the tuples of other classes
Steps:
Each time a rule is learned, the tuples covered by the rules are
removed
August 2, 2015
45
Examples covered
by Rule 2
Examples covered
by Rule 1
Examples covered
by Rule 3
Positive
examples
August 2, 2015
46
How to Learn-One-Rule?
pos '
pos
log 2
)
pos ' neg '
pos neg
It favors rules that have high accuracy and cover many positive tuples
pos neg
negcovered by R.
Pos/neg are # of positive/negativepos
tuples
FOIL _ Prune( R )
August 2, 2015
47
Rule Generation
To generate a rule
while(true)
find the best predicate p
if foil-gain(p) > threshold then add p to current rule
else break
A3=1&&A1=2
A3=1&&A1=2
&&A8=5
A3=1
Positive
examples
August 2, 2015
Negative
examples
Data Mining: Concepts and
Techniques
48
Rule-Based Classification
Summary
August 2, 2015
49
C1
~C1
C1
False Negatives
(FN)
~C1
True Negatives
(TN)
August 2, 2015
51
buy_computer =
yes
buy_computer =
no
Total
Recognition(%
)
buy_computer =
yes
6954
46
7000
99.34
buy_computer =
no
412
2588
3000
86.27
Total
7366
2634
1000
0
95.42
cancer =
yes
cancer = no
Total
Recognition(%
)
cancer = yes
90
210
300
30.00
sensitivity
cancer = no
140
9560
9700
98.56
specificity
Total
230
9770
1000
0
96.40
accuracy
F:
weighted measure of precision and recall
assigns times as much weight to recall as to
precision,
Holdout method
Given data is randomly partitioned into two independent sets
August 2, 2015
Techniques
57
Bootstrap
i 1
August 2, 2015
58
where
If 2 testwher
sets available: use non-paired t-test
e
where k1 & k2 are # of cross-validation samples used for M1 & M2, resp.
Symmetric
Significance
level, e.g, sig =
0.05 or 5%
means M1 & M2
are significantly
different for
95% of
population
Confidence
limit, z = sig/2
August 2, 2015
Vertical axis
represents the true
positive rate
Horizontal axis rep.
the false positive
rate
The plot also shows
a diagonal line
A model with perfect
accuracy will have 64
Accuracy
classifier accuracy: predicting class label
Speed
time to construct the model (training time)
time to use the model (classification/prediction
time)
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability
understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as
decision tree size or compactness of classification
rules
Data Mining: Concepts and
August 2, 2015
Techniques
65
Rule-Based Classification
Summary
August 2, 2015
66
Ensemble methods
Use a combination of models to increase accuracy
Combine a series of k learned models, M , M , , M ,
1
2
k
with the aim of creating an improved model M*
Popular ensemble methods
Bagging: averaging the prediction over a collection of
classifiers
Boosting: weighted vote with a collection of
classifiers
Ensemble: combining a set of heterogeneous
Data Mining: Concepts and
classifiers
August 2, 2015
Techniques
67
Bagging: Boostrap
Aggregation
August 2, 2015
68
Boosting
August 2, 2015
69
log
August 2, 2015
1 error ( M i )
error ( M i )
70
Rule-Based Classification
Summary
August 2, 2015
71
Rule-Based Classification
Summary
August 2, 2015
73
Summary (I)
August 2, 2015
74
Summary (II)
August 2, 2015
75
References (1)
C. Apte and S. Weiss. Data mining with decision trees and decision rules.
Future Generation Computer Systems, 13, 1997.
G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu. Mining top-k covering rule
groups for gene expression data. SIGMOD'05.
August 2, 2015
76
References (2)
G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and
differences. KDD'99.
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed. John Wiley, 2001
J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision Tree
Construction. SIGMOD'99.
W. Li, J. Han, and J. Pei, CMAR: Accurate and Efficient Classification Based on
Multiple Class-Association Rules, ICDM'01.
August 2, 2015
77
References (3)
T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy,
complexity, and training time of thirty-three old and new classification
algorithms. Machine Learning, 2000.
S. K. Murthy, Automatic Construction of Decision Trees from Data: A MultiDisciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998
August 2, 2015
78
References (4)
I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and
Techniques, 2ed. Morgan Kaufmann, 2005.
H. Yu, J. Yang, and J. Han. Classifying large data sets using SVM with
hierarchical clusters. KDD'03.
August 2, 2015
79
prediction?
and prediction
your neighbors)
Frequent-pattern-based
classification
induction
Bayesian classification
Prediction
Rule-based classification
Classification by back
Ensemble methods
propagation
Model selection
Summary
August 2, 2015
82
What Is Classification?
Tree Pruning
Bayes Theorem
August 2, 2015
Evaluation Metric
Cross-validation
Bootstrap
Bagging
Random Forest
Handling Different Kinds of Cases in
Classification
Multiclass Classification
Cost-Sensitive Learning
Active Learning
Transfer Learning
Summary
83
Data cleaning
Data transformation
August 2, 2015
84
Accuracy
classifier accuracy: predicting class label
predictor accuracy: guessing value of predicted
attributes
Speed
time to construct the model (training time)
time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability
understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as
decision tree size or compactness of classification rules
August 2, 2015
85
SplitInfo A ( D)
j 1
|D|
log 2 (
| Dj |
|D|
GainRatio(A) = Gain(A)/SplitInfo(A)
Ex.
| Dj |
SplitInfo A ( D )
4
4
6
6
4
4
log 2 ( ) log 2 ( ) log 2 ( ) 0.926
14
14 14
14 14
14
August 2, 2015
86
9
5
gini ( D) 1
0.459
14
14
Suppose the attribute income partitions D into 10 in D 1: {low,
10
4
medium} and 4 in Dgini
2
( D)
Gini ( D )
Gini ( D )
income{low , medium}
14
14
but gini{medium,high} is 0.30 and thus the best since it is the lowest
May need other tools, e.g., clustering, to get the possible split
values
August 2, 2015
87
Classifier
Accuracy
Measures
Real class\Predicted
class
C1
~C1
C1
True
positive
False
negative
~C1
False
buy_computer
=
total
positive
no
True
recognition(%
negative
)
Real class\Predicted
class
buy_computer =
yes
buy_computer =
yes
6954
46
7000
99.34
buy_computer = no
412
2588
3000
86.27
Accuracy
acc(M): percentage
tuples that
are
total of a classifier M,
7366
2634of test set
1000
95.42
correctly classified by the model M
0
Error rate (misclassification rate) of M = 1 acc(M)
Given m classes, CM , an entry in a confusion matrix, indicates # of
i,j
tuples in class i that are labeled by the classifier as class j
Alternative accuracy measures (e.g., for cancer diagnosis)
sensitivity = t-pos/pos
/* true positive recognition rate */
specificity = t-neg/neg
/* true negative recognition rate */
precision = t-pos/(t-pos + f-pos)
accuracy = sensitivity * pos/(pos + neg) + specificity * neg/(pos + neg)
This model can also be used for cost-benefit analysis
August 2, 2015
88
Loss function: measures the error betw. yi and the predicted value
yi
Test error (generalization derror): the average loss over thed test set
2
| y
i 1
yi ' |
| y
i 1
d
| y
i 1
i 1
yi ' )
d
( yi yi ' ) 2
d
(y
yi ' |
y|
i 1
d
(y
i 1
y)2
89
Summary (I)
Classification and prediction are two forms of data analysis that can
be used to extract models describing important data classes or to
predict future data trends.
August 2, 2015
90
Summary (II)
Significance tests and ROC curves are useful for model selection
No single method has been found to be superior over all others for
all data sets
August 2, 2015
91