Академический Документы
Профессиональный Документы
Культура Документы
Principles of Knowledge
Discovery in Databases
Fall 1999
Source:
Dr. Jiawei Han
University of Alberta
Dr. Osmar R. Zaane, 1999
University of Alberta
University of Alberta
Course Content
Chapter 7 Objectives
University of Alberta
University of Alberta
What is Classification?
?
1
University of Alberta
n
University of Alberta
What is Prediction?
gray
Classification
If forecasting continuous value Prediction
University of Alberta
Relevance Analysis:
Feature selection to eliminate irrelevant attributes
9
University of Alberta
?
3
?
4
?
?
University of Alberta
University of Alberta
10
?
2
Credit approval
Target marketing
Medical diagnosis
Defective parts identification in manufacturing
Crime zoning
Treatment effectiveness analysis
Etc.
Data Cleaning:
Smoothing to reduce noise
University of Alberta
black
Application
Data transformation:
Discretization of continuous data
Normalization to [-1..1] or [0..1]
?
1
green
blue
1
2
3
4
n
In Data Mining
red
11
University of Alberta
12
Training
Data
Derive
Classifier
(Model)
Estimate
Accuracy
Data
Testing
Data
Holdout
Random sub-sampling
K-fold cross validation
Bootstrapping
University of Alberta
13
1. Classification Process
(Learning)
Na m e
B ruc e
D a ve
W illiam
M arie
A nne
C h ris
Name
Tom
Jane
W ei
Hua
IF Income = High
OR Age > 30
THEN CreditRating = Good
University of Alberta
14
Classifier
(Model)
Testing
Data
Classifier
(Model)
University of Alberta
2. Classification Process
(Accuracy Evaluation)
Classification
Algorithms
Training
Data
15
Income
Medium
High
High
Medium
Age
Credit rating
<30
bad
<30
bad
>40
good
[30..40]
good
3. Classification Process
(Classification)
University of Alberta
16
Improving Accuracy
Classifier 1
Classifier 2
Classifier
(Model)
New
Data
Data
Classifier 3
Combine
votes
Classifier n
Credit Rating?
New
Data
Composite classifier
University of Alberta
17
University of Alberta
18
Classification Methods
Predictive accuracy
Robustness
Handling noise and missing values
Scalability
Efficiency in large databases (not memory resident data)
Interpretability:
The level of understanding and insight provided by the model
Form of rules
Decision tree size
The compactness of classification rules
University of Alberta
19
Atr=?
CL
Atr=?
CL
University of Alberta
21
An
Example
from
Quinlans
ID3
Atr=?
CL
CL
Atr=?
CL
CL
CL
CL
University of Alberta
22
Training Dataset
Outlook
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
20
University of Alberta
University of Alberta
Outlook?
sunny
overcast
Humidity?
high
23
P
Dr. Osmar R. Zaane, 1999
T e m p r e atu r e
hot
hot
hot
m i ld
cool
cool
cool
m i ld
cool
m i ld
m i ld
m i ld
hot
m i ld
H u m i di ty
h ig h
h ig h
h ig h
h ig h
n o rm al
n o rm al
n o rm al
h ig h
n o rm al
n o rm al
n o rm al
h ig h
n o rm al
h ig h
W in d y C l as s
fa ls e
N
tr u e
N
fa ls e
P
fa ls e
P
fa ls e
P
tr u e
N
tr u e
P
fa ls e
N
fa ls e
P
fa ls e
P
tr u e
P
tr u e
P
fa ls e
P
tr u e
N
Windy?
normal
rain
O u tlo o k
sunny
sunny
o ve r c a s t
r a in
r a in
r a in
o ve r c a s t
sunny
sunny
r a in
sunny
o ve r c a s t
o ve r c a s t
r a in
true
false
N
Principles of Knowledge Discovery in Databases
P
University of Alberta
24
1. Tree construction
Atr=?
2. Tree pruning
University of Alberta
25
University of Alberta
I (SP , S N ) =
E ( A) =
i=1
i=1
University of Alberta
28
gini ( S ) = 1
p
j 1
2
j
N1
gini ( S ) = gini ( S ) +
The attribute that provides the smallest gini
gain(windy) = 0.048
29
N2
gini ( S 2 )
is chosen to split
the node (need to enumerate all possible splitting points for each
attribute).
split
gain(humidity) = 0.151
University of Alberta
Gini Index
In general
v
+ + ... + s ni
xi + yi
I ( SPi , S N i )
E ( A) = s1i s 2i
I ( s1i , s2i ,..., sni )
In general
x+ y
s
i=1
27
x
x
y
y
log2 x + y x + y log2 x + y
x+y
Assume that using attribute A as the root in the tree will partition
S in sets {S1, S2 , , Sv}.
If Si contains xi examples of P and yi examples of N, the
information needed to classify objects in all subtrees Si :
gain(temperature) = 0.029
26
gain(outlook) = 0.246
University of Alberta
Recursive process:
Tree starts a single node representing all data.
If sample are all same class then node becomes
a leaf labeled with class label.
Otherwise, select attribute that best separates
sample into individual classes.
Recursion stops when:
split(S)
University of Alberta
30
Split criterion:
Used to select the attribute to be split at a tree node during the
tree generation phase.
Different algorithms may use different goodness functions:
information gain, gini index, etc.
Branching scheme:
Determining the tree branch to which a sample belongs.
binary splitting (gini index) versus many splitting (information gain).
Stopping decision: When to stop the further splitting of a node, e.g.
impurity measure.
Labeling rule: a node is labeled as the class to which most samples
at the node belong.
University of Alberta
31
University of Alberta
32
Algorithm
if samples are all of the same class C, then return N as a leaf node labeled
with C.
if attribute-list is empty then return N as a left node labeled with the most
common class.
greedy algorithm
make optimal choice at each step: select the best
attribute for each tree node.
if S i is empty then attach a leaf labeled with the most common class.
Else recursively run the algorithm at Node S i
University of Alberta
33
Directly
test the attribute value of unknown sample against the
tree.
A path is traced from root to a leaf which holds the
label.
Indirectly
decision tree is converted to classification rules.
one rule is created for each path from the root to a leaf.
IF-THEN rules are easier for humans to understand.
Principles of Knowledge Discovery in Databases
University of Alberta
34
University of Alberta
35
University of Alberta
36
|S |
|S |
i .
SplitInfo ( S , A) i log
2 |S |
|S |
University of Alberta
37
University of Alberta
39
University of Alberta
38
University of Alberta
University of Alberta
40
Gain ( S , A) .
SplitInfo (S , A)
Pruning Criterion
Tree Pruning
A decision tree constructed using the training data may have too
many branches/leaf nodes.
Caused by noise, over-fitting.
May result poor accuracy for unseen samples.
Prune the tree: merge a subtree into a leaf node.
Using a set of data different from the training data.
At a tree node, if the accuracy without splitting is higher than the
accuracy with splitting, replace the subtree with a leaf node,
label it using the majority class.
Issues:
Obtaining the testing data.
Criteria other than accuracy (e.g. minimum description length).
Dr. Osmar R. Zaane, 1999
GainRatio ( S , A) =
University of Alberta
42
Scalability
Generalization-based classification
Parallel and distributed processing
Dr. Osmar R. Zaane, 1999
University of Alberta
43
University of Alberta
44
Successful examples
SLIQ (EDBT96 -- Mehta et al.96)
45
Issues in scalability
selecting the splitting attribute at each tree node
selecting splitting points for the chosen attribute
more serious for numeric attributes
fast tree pruning algorithm
University of Alberta
46
a disk-based algorithm
decision-tree based algorithm
University of Alberta
SLIQ (I)
47
University of Alberta
48
SLIQ (II)
Credit
Rating
University of Alberta
rid
rid
25
no
25
no
34
yes
39
no
fair
39
yes
good
42
yes
45
no
excellent
fair
excellent
rid
Credit
Rating
Send
Add
50
rid
no
excellent
no
25
no
good
no
34
no
excellent
yes
39
yes
fair
no
39
no
fair
yes
42
yes
good
yes
yes
excellent
no
University of Alberta
45
51
University of Alberta
52
RainForest
University of Alberta
25
Send
Add
Age
SPRINT (I)
good
node
Age
excellent
Send
add
rid
University of Alberta
53
University of Alberta
54
DBMiner
University of Alberta
55
University of Alberta
56
Neural Networks
Advantages
prediction accuracy is generally high.
robust, works when training examples contain errors.
output may be discrete, real-valued, or a vector of
several discrete or real-valued attributes.
fast evaluation of the learned target function.
Criticism
long training time.
difficult to understand the learned function (weights).
not easy to incorporate domain knowledge.
University of Alberta
57
w1
University of Alberta
58
bias
Output vector
w0
x1
A Neuron
x0
.
.
.
xn
.
.
.
Output nodes
wn
Input
weight
vector x vector w
output y
Hidden nodes
weighted
sum
Activation
function
Input nodes
University of Alberta
Input vector: xi
59
University of Alberta
60
10
Learning Paradigms
(1) Classification
adjust weights using
Error = Desired - Actual
Learning Algorithms
(2) Reinforcement
adjust weights
using reinforcement
Inputs
Actual Output
University of Alberta
61
62
University of Alberta
Constructing a network
63
Network Training
University of Alberta
64
Network Pruning
Steps:
University of Alberta
65
University of Alberta
66
11
University of Alberta
67
University of Alberta
69
O u tlo o k
su n n y
o verca st
ra in
T e m p re a tu re
ho t
m ild
coo l
k 1
P(H | X ) = P( X | H )P(H )
P( X )
+
X
H
C
University of Alberta
70
68
University of Alberta
Bayes Theorem
71
P
2 /9
4 /9
3 /9
N
3 /5
0
2 /5
2 /9
4 /9
3 /9
2 /5
2 /5
1 /5
H u m id ity P
3 /9
h ig h
n o rm a l 6 /9
N
4 /5
1 /5
W in d y
tru e
false
3 /5
2 /5
3 /9
6 /9
University of Alberta
72
12
Belief Network
Allows class conditional dependencies to be expressed.
It has a directed acyclic graph (DAG) and a set of
conditional probability tables (CPT).
Nodes in the graph represent variables and arcs represent
probabilistic dependencies. (child dependent on parent)
There is one table for each variable X. The table contains
the conditional distribution P(X|Parents(X)).
Family
History
Smoker
(FH, S) (FH, ~S)(~FH, S) (~FH, ~S)
LC
LungCancer
Emphysema
0.8
0.5
0.7
0.1
~LC
0.2
0.5
0.3
0.9
Dyspnea
University of Alberta
73
University of Alberta
74
University of Alberta
75
University of Alberta
76
class
University of Alberta
77
University of Alberta
78
13
Prediction
Linear Regression
Linear regression:
Approximate data distribution by a line Y = + X
Y is the response variable and X the predictor variable.
and are regression coefficients specifying the intercept and the
slope of the line. They are calculated by least square method:
(x
x )( y i y )
i =1
(x
i=1
x)2
= y x
Multiple regression: Y = + 1 X 1 + 2 X 2.
Many nonlinear functions can be transformed into the above.
University of Alberta
79
University of Alberta
80
14