CS-572 Data Mining and Information Retrieval Week 06

12/4/2014
Classification
Supervised vs. Unsupervised Learning
Classification: Basic Concepts
Supervised learning (classification)
Decision Tree Induction
Supervision: The training data (observations, measurements,
Bayes Classification Methods
etc.) are accompanied by labels indicating the class of the

observations
Rule-Based Classification
CS-572 Data Mining and Information Retrieval
New data is classified based on the training set
Neural Networks
Unsupervised learning (clustering)
Model Evaluation and Selection
Week 06 Classification : Decision Tree Classifier
The class labels of training data is unknown
Techniques to Improve Classification Accuracy: Ensemble

Methods
Dr. Waqar ul Qounain
Given a set of measurements, observations, etc. with the aim of
Summary
1
establishing the existence of classes or clusters in the data

2
Prediction Problems: Classification vs.

Numeric Prediction
ClassificationA Two-Step Process
Classification
predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the training set
and the values (class labels) in a classifying attribute and uses it
in classifying new data
Numeric Prediction
models continuous-valued functions, i.e., predicts unknown or
missing values
Typical applications
Credit/loan approval:
Medical diagnosis: if a tumor is cancerous or benign
Fraud detection: if a transaction is fraudulent
Web page categorization: which category it is
Process (1): Model Construction
Model construction: describing a set of predetermined classes

Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules, decision trees, or
mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the classified result
from the model
Accuracy rate is the percentage of test set samples that are correctly
classified by the model
Test set is independent of training set (otherwise overfitting)
If the accuracy is acceptable, use the model to classify new data
Note: If the test set is used to select models, it is called validation (test) set
Classification
Algorithms
Training
Data
NAME
M ike
M ary
B ill
Jim
D ave
Anne
RANK
YEARS TENURED
A ssistan t P ro f
3
no
A ssistan t P ro f
7
yes
P ro fesso r
2
yes
A sso ciate P ro f
7
yes
A ssistan t P ro f
6
no
A sso ciate P ro f
3
no
Classifier
(Model)
IF rank = professor
OR years > 6
THEN tenured = yes
Process (2): Using the Model in

Prediction
Decision Tree
Decision tree
Tuples flow along the tree structure Internal node denotes
an attribute
Branch represents the values of the node attribute
Leaf nodes represent class labels or class distribution
Example of Playing Tennis
Classifier
Testing
Data
Decision Tree Classifier
Unseen Data
(Jeff, Professor, 4)
NAME
T om
M erlisa
G eorge
Joseph
RANK
YEARS TENURED
A ssistant P rof
2
no
A ssociate P rof
7
no
P rofessor
5
yes
A ssistant P rof
7
yes
Tenured?
12/4/2014
Decision Tree Induction: An Example
Training data set: Buys_computer

Resulting tree:
age?
<=30
31..40
overcast
student?
no
no
yes
yes
yes
>40
age
<=30
<=30
3140
>40
>40
>40
3140
<=30
<=30
>40
<=30
3140
3140
>40
Algorithm for Decision Tree Induction
income student credit_rating buys_computer

high
no fair
no
high
no excellent
no
high
no fair
yes
medium
no fair
yes
low
yes fair
yes
low
yes excellent
no
low
yes excellent
yes
medium
no fair
no
low
yes fair
yes
medium yes fair
yes
medium yes excellent
yes
medium
no excellent
yes
high
yes fair
yes
medium
no excellent
no
credit rating?
excellent
fair
yes
Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer
manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are
discretized in advance)
Examples are partitioned recursively based on selected
attributes
Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning
majority voting is employed for classifying the leaf
There are no samples left
10
11
12
What is ID3?
ID3 Algorithm for

DTC
Entropy
A formula to calculate the homogeneity of a sample.

A completely homogeneous sample has entropy of 0.
An equally divided sample has entropy of 1.
Entropy(s) = - (p+) log2(p+) (p-) log2(p-) for a sample of
negative and positive elements.
The formula for entropy is:
A mathematical algorithm for building the decision tree.

Invented by J. Ross Quinlan in 1979.
Uses Information Theory invented by Shannon in 1948.
Builds the tree from the top down, with no backtracking.
Information Gain is used to select the most useful attribute for
classification.
Entropy(S) =
- (9/14) Log2 (9/14) - (5/14) Log2 (5/14)
= 0.940
13
14
Information Gain (IG)
Example of ID3
Information Gain (IG)
The information gain is based on the decrease in entropy after

a dataset is split on an attribute.
Which attribute creates the most homogeneous branches?
First the entropy of the total dataset is calculated.
The dataset is then split on the different attributes.
The entropy for each branch is calculated. Then it is added
proportionally, to get total entropy for the split.
The resulting entropy is subtracted from the entropy before
the split.
The result is the Information Gain, or decrease in entropy.
The attribute that yields the largest IG is chosen for the
decision node.
15
Select the attribute with the highest information gain

Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify a tuple in
m
D:
Info ( D) pi log 2 ( pi )
i 1
Information needed (after using A to split D into v partitions)

v | D |
j
to classify D:
Info A ( D)
Info ( D j )
j 1 | D |
Information gained by branching on attribute A
Gain(A) Info(D) Info A(D)

16
17
18
12/4/2014
Entropy and Information Gain
IG (CoO) = S(Total) 4/10 * S(US) 4/10 * S(Europe) 2/10 * S(RoW)

IG (CoO) = 1 0.4 * 0.811 0.4 * 1 0.2 * 0
IG (CoO) = 1 0.3244 0.4 0
=
0.2756
Entropy of Europe e.g. S(Europe) is calculated as
Information Gain or IG (Genre)
S(Europe) = - 2/4 * log2 (2/4) 2/4 * log2 (2/4)

S(Europe) = 0.5 + 0.5
S(Europe) = 1
Selecting root attribute to split on, information gain for each

attribute is calculated
Information Gain or IG (Country of Origin) is calculated as
follows
IG (Genre) = S(Total) 3/10 * S(SF) 6/10 * S(Comedy) 1/10 * S(Romance)

IG (Genre) = 1 0.3 * 0.9182 0.6 * 0.9182 0.1 * 0
IG (Genre) = 1 0.2754 0.5509 0
=
0.17
Entropy of Rest of World e.g. S(RoW) is calculated as
Information Gain or IG (BigStar)
S(RoW) = - 0/2 * log2 (0/2) 2/2 * log2 (2/2)

S(RoW) = 0 + 0
S(RoW) = 0
IG (CoO) = S(Total) 4/10 * S(US) 4/10 * S(Europe) 2/10 * S(RoW)
IG (BigStar) = S(Total) 7/10 * S(Yes) 3/10 * S(No)

IG (BigStar) = 1 0.7 * 0.9852 0.3 * 1
IG (BigStar) = 1 0.6896 0.3
=
0.01
19
Gain Ratio C4.5
20
21
Gini Index Finding Best Split of an Attribute
j 1
GainRatio(A) = InformationGain(A)/SplitInfo(A)
SplitInfo(CoO) = - 4/10 * log2 (4/10) 4/10 * log2 (4/10) 2/10 * log2 (2/10)
GainRatio(CoO) = 0.2756 / 1.5219
SplitInfo(Genre) = - 3/10 * log2 (3/10) 6/10 * log2 (6/10) 1/10 * log2 (1/10)
GainRatio(CoO) = 0.17 / 1.2954
SplitInfo(BigStar) = - 7/10 * log2 (7/10) 3/10 * log2 (3/10)
GainRatio(BigStar) = 0.01/ 0.8812
If a data set D contains examples from n classes, gini index,

n
gini(D) is defined as
gini( D) 1 p 2
Information gain measure is biased towards attributes with a

large number of values (attributes with high cardinality)
C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
=1.5219
= 0.1811
=1.2954
= 0.1312
=0.8812
= 0.0112
The attribute with the maximum gain ratio is selected as the

splitting attribute
If we have to split Genre into a binary split then should the

attribute Genre be split as {(SF, C), R}, {(SF, R), C}, {SF, (R, C)}
Split with minimum Gini Index would be used to pick the split
SF, C and R have 3, 6 and 1 tuples in data set also (SF, C), (SF,
R) and (R, C) have 9, 4, 7 respectively.
Positive and negative tuples of SF, C and R have (1,2), (4,2) and
(0,1) tuples in data set also (SF, C), (SF, R) and (C, R) have (5,4),
(1,3), (4,3) respectively.
Gini of entire data set is calculated as
where pj is the relative frequency of class j in D

If a data set D is split on A into two subsets D1 and D2, the gini
index gini(D) is defined as
|D |
|D |
gini A (D) 1 gini(D1) 2 gini(D2)
|D|
|D|
Reduction in Impurity:
gini( A) gini(D) giniA(D)
The attribute provides the smallest ginisplit(D) (or the largest

reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)
22
5 2 5 2
gini (total ) 1 0.50
10 10
23
24
Computing Information-Gain for

Continuous-Valued Attributes

9
1
gini genre{( SF ,C ), R} ( D) Gini( SF , C ) Gini( R)
10
10
Decision Tree - Example
Let attribute A be a continuous-valued attribute

US
Sort the value A in increasing order
(ai+ai+1)/2 is the midpoint between the values of ai and ai+1
3 1 2 7 4 3
gini genre{SF ,(C , R )} ( D) 1 1 0.4761
10 3 3 10 7 7
2
The point with the minimum expected information requirement

for A is selected as the split-point for A
Split:
D1 is the set of tuples in D satisfying A split-point, and D2 is
the set of tuples in D satisfying A > split-point
Reduction in Impurity
For genre {(SF, C), R} is 0.50 0.4444 = 0.0555
For genre {(SF, R), C} is 0.50 0.4146 = 0.0833
For genre {SF, (C, R)} is 0.50 0.4761 = 0.0238
25
EU
Ge
Typically, the midpoint between each pair of adjacent values is

considered as a possible split point
2
2
2
2
4 1 3 6 4 2
gini genre{( SF , R ),C} ( D) 1 1 0.4166
10 4 4 10 6 6
2
CO
Must determine the best split point for A
2
2
2
2
9 5 4 1 0 1
gini genre{( SF ,C ), R} ( D) 1 1 0.4444
10 9 9 10 1 1
Information Gain or IG (Country of Origin) is calculated as follows
S(US) = - 3/4 * log2 (3/4) 1/4 * log2 (1/4)

S(US) = 0.311 + 0.5
S(US) = 0.811
Classes [True, False]

Total instances 10
Number of True and False instances respectively 5, 5.
S (total) = - 5/10 * log2 (5/10) 5/10 * log2 (5/10)
S (total) = 1
Entropy of United Stats e.g. S(US) is calculated as
Entropy S (total) of the entire dataset
26
RW
Ge
Ge
SF
SF
SF
BS
BS
BS
BS
BS
BS
BS
BS
BS
27
12/4/2014
Classification
Classification
How to classify following test examples
Classification
{Europe, Yes, Science Fiction, ?}

{Rest of World, No, Comedy, ?}
{United Stats, Yes, Romance, ?}
{Europe, Yes, Science Fiction, False}

{Rest of World, No, Comedy, ?}
CO

{Rest of World, No, Comedy, False}
CO
CO
US
EU
RW
US
EU
RW
US
Ge
Ge
Ge
Ge
Ge
Ge
Ge
EU
RW
Ge
Ge
SF
SF
SF
SF
SF
SF
SF
SF
SF
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
28
Classification
Advantages of using ID3
CO
EU
Ge
RW
Ge
Ge
SF
SF
SF
BS
BS
BS
BS
BS
BS
BS
BS
BS
N
30
Disadvantages of using ID3
Understandable prediction rules are created from the

training data.
Builds the fastest tree.
Builds a short tree.
Only need to test enough attributes until all data is
classified.
Finding leaf nodes enables test data to be pruned,
reducing number of tests.
Whole dataset is searched to create tree.

{Rest of World, No, Comedy, False}
{United Stats, Yes, Romance, True}
US
29
Data may be over-fitted or over-classified, if a small

sample is tested.
Only one attribute at a time is tested for making a
decision.
Classifying continuous data may be computationally
expensive, as many trees must be generated to see
where to break the continuum.
BS
BS
N
31
Practical Issues of Classification

Underfitting and Overfitting
Missing Values
Costs of Classification
32
33
Underfitting: when model is too simple, both training and test

errors are large
Overfitting: when model is too complex, training error is less
while test error is large
Underfitting: when model is too simple, both training and test

errors are large
Overfitting: when model is too complex, training error is less
while test error is large
Overfitting
34
35
36
12/4/2014
Overfitting due to Noise
Decision Tree Pruning
Overfitting due to Insufficient Examples

Lack of data points in the lower half of the diagram makes it
difficult to predict correctly the class labels of that region
Insufficient number of training records in the region causes
the decision tree to predict the test examples using other
training records that are irrelevant to the classification task
Decision boundary is distorted by noise point
37
Two approaches to avoid overfitting

Prepruning: Halt tree construction early do not split a
node if this would result in the goodness measure falling
below a threshold
Difficult to choose an appropriate threshold
Postpruning: Remove branches from a fully grown tree
get a sequence of progressively pruned trees
Use a set of data different from the training data to
decide which is the best pruned tree
38
39
CO
US
Training Error
Error of a classification model on training data
is called training error
Tree pruning results in higher training error
CO
EU
RW
US
Ge
Ge
Ge
Training Error
Ge
EU
Ge
SF
SF
SF
SF
SF
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
BS
SF
Ge
Ge
Ge
BS
BS
SF
BS
BS
BS
SF
BS
BS
BS
BS
T2
T2
CO
CO
RW
EU
Ge
EU
Ge
SF
US
US
RW
Training Error
RW
Ge
Ge
SF
SF
SF
T1
T2
T0
T0
T2
T0
T0
T0
T0
F0
F1
F0
F1
F0
F1
F1
F1
F0
BS
BS
N
40
41
42
CO
US
EU
RW
Training Error
Ge
Ge
Training Error
Ge
SF
SF
SF
BS
BS
BS
BS
BS
BS
BS
BS
BS
T2
CO
US
EU
Ge
Ge
SF
SF
SF
T1
T2
T0
T0
T2
T0
T0
T0
T0
F0
F1
F0
F1
F0
F1
F1
F1
F0
EU
RW
T2
T2
T0
F1
F2
F2
Overfitting results in decision trees that are

more complex than necessary
Training error no longer provides a good
estimate of how well the tree will perform on
previously unseen records
Need new ways for estimating errors
Incorrectly Classified / Total number of training Tuples

0 /10 = 0.0
1 /10 = 0.1
Case 3 One Level

3 /10 = 0.3
CO
US
Case 1 Three Levels

Case 2 Two Levels
RW
Ge
Notes on Overfitting
43
44
45
12/4/2014
Estimating Generalization Errors

Re-substitution errors: error on training ( e(t) )
Generalization errors: error on testing ( e(t))
Methods for estimating generalization errors:
Optimistic approach: e(t) = e(t)
Pessimistic approach:
For each leaf node: e(t) = (e(t)+0.5)
Total errors: e(T) = e(T) + N 0.5 (N: number of leaf nodes)
For a tree with 30 leaf nodes and 10 errors on training
(out of 1000 instances):
Training error = 10/1000 = 1%
Generalization error = (10 + 300.5)/1000 = 2.5%
Reduced error pruning (REP):

uses validation data set to estimate generalization error
46

CS-572 Data Mining and Information Retrieval Week 06

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

CS-572 Data Mining and Information Retrieval Week 06

Загружено:

Авторское право:

Доступные форматы

12/4/2014

Supervised vs. Unsupervised Learning

Classification: Basic Concepts

Supervised learning (classification)

Decision Tree Induction

Supervision: The training data (observations, measurements,

Bayes Classification Methods

etc.) are accompanied by labels indicating the class of the

CS-572 Data Mining and Information Retrieval

New data is classified based on the training set

Unsupervised learning (clustering)

Model Evaluation and Selection

Week 06 Classification : Decision Tree Classifier

The class labels of training data is unknown

Techniques to Improve Classification Accuracy: Ensemble

Dr. Waqar ul Qounain

Given a set of measurements, observations, etc. with the aim of

establishing the existence of classes or clusters in the data

Prediction Problems: Classification vs.

ClassificationA Two-Step Process

Process (1): Model Construction

Model construction: describing a set of predetermined classes

Process (2): Using the Model in

Decision Tree Classifier

Decision Tree Induction: An Example

Training data set: Buys_computer

Algorithm for Decision Tree Induction

income student credit_rating buys_computer

Basic algorithm (a greedy algorithm)

ID3 Algorithm for

A formula to calculate the homogeneity of a sample.

A mathematical algorithm for building the decision tree.

Information Gain (IG)

Information Gain (IG)

The information gain is based on the decrease in entropy after

Select the attribute with the highest information gain

Information needed (after using A to split D into v partitions)

Information gained by branching on attribute A

Gain(A) Info(D) Info A(D)

Entropy and Information Gain

Entropy and Information Gain

IG (CoO) = S(Total) 4/10 * S(US) 4/10 * S(Europe) 2/10 * S(RoW)

Entropy of Europe e.g. S(Europe) is calculated as

Information Gain or IG (Genre)

S(Europe) = - 2/4 * log2 (2/4) 2/4 * log2 (2/4)

Selecting root attribute to split on, information gain for each

IG (Genre) = S(Total) 3/10 * S(SF) 6/10 * S(Comedy) 1/10 * S(Romance)

Entropy of Rest of World e.g. S(RoW) is calculated as

Information Gain or IG (BigStar)

S(RoW) = - 0/2 * log2 (0/2) 2/2 * log2 (2/2)

IG (CoO) = S(Total) 4/10 * S(US) 4/10 * S(Europe) 2/10 * S(RoW)

IG (BigStar) = S(Total) 7/10 * S(Yes) 3/10 * S(No)

Gain Ratio C4.5

Gini Index Finding Best Split of an Attribute

Gini Index Finding Best Split of an Attribute

If a data set D contains examples from n classes, gini index,

Information gain measure is biased towards attributes with a

The attribute with the maximum gain ratio is selected as the

If we have to split Genre into a binary split then should the

where pj is the relative frequency of class j in D

gini( A) gini(D) giniA(D)

The attribute provides the smallest ginisplit(D) (or the largest

Computing Information-Gain for

Gini Index Finding Best Split of an Attribute