Вы находитесь на странице: 1из 6

12/4/2014

Classification

Supervised vs. Unsupervised Learning

Classification: Basic Concepts

Supervised learning (classification)

Decision Tree Induction

Supervision: The training data (observations, measurements,

Bayes Classification Methods

etc.) are accompanied by labels indicating the class of the


observations

Rule-Based Classification

CS-572 Data Mining and Information Retrieval

New data is classified based on the training set

Neural Networks

Unsupervised learning (clustering)

Model Evaluation and Selection

Week 06 Classification : Decision Tree Classifier

The class labels of training data is unknown

Techniques to Improve Classification Accuracy: Ensemble


Methods

Dr. Waqar ul Qounain

Given a set of measurements, observations, etc. with the aim of

Summary
1

establishing the existence of classes or clusters in the data


2

Prediction Problems: Classification vs.


Numeric Prediction

ClassificationA Two-Step Process

Classification
predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the training set
and the values (class labels) in a classifying attribute and uses it
in classifying new data
Numeric Prediction
models continuous-valued functions, i.e., predicts unknown or
missing values
Typical applications
Credit/loan approval:
Medical diagnosis: if a tumor is cancerous or benign
Fraud detection: if a transaction is fraudulent
Web page categorization: which category it is

Process (1): Model Construction

Model construction: describing a set of predetermined classes


Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules, decision trees, or
mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the classified result
from the model
Accuracy rate is the percentage of test set samples that are correctly
classified by the model
Test set is independent of training set (otherwise overfitting)
If the accuracy is acceptable, use the model to classify new data
Note: If the test set is used to select models, it is called validation (test) set

Classification
Algorithms

Training
Data

NAME
M ike
M ary
B ill
Jim
D ave
Anne

RANK
YEARS TENURED
A ssistan t P ro f
3
no
A ssistan t P ro f
7
yes
P ro fesso r
2
yes
A sso ciate P ro f
7
yes
A ssistan t P ro f
6
no
A sso ciate P ro f
3
no

Classifier
(Model)

IF rank = professor
OR years > 6
THEN tenured = yes

Process (2): Using the Model in


Prediction

Decision Tree
Decision tree
Tuples flow along the tree structure Internal node denotes
an attribute
Branch represents the values of the node attribute
Leaf nodes represent class labels or class distribution
Example of Playing Tennis

Classifier
Testing
Data

Decision Tree Classifier

Unseen Data
(Jeff, Professor, 4)

NAME
T om
M erlisa
G eorge
Joseph

RANK
YEARS TENURED
A ssistant P rof
2
no
A ssociate P rof
7
no
P rofessor
5
yes
A ssistant P rof
7
yes

Tenured?

12/4/2014

Decision Tree Induction: An Example

Training data set: Buys_computer


Resulting tree:
age?
<=30

31..40
overcast

student?
no
no

yes
yes
yes

>40

age
<=30
<=30
3140
>40
>40
>40
3140
<=30
<=30
>40
<=30
3140
3140
>40

Algorithm for Decision Tree Induction

income student credit_rating buys_computer


high
no fair
no
high
no excellent
no
high
no fair
yes
medium
no fair
yes
low
yes fair
yes
low
yes excellent
no
low
yes excellent
yes
medium
no fair
no
low
yes fair
yes
medium yes fair
yes
medium yes excellent
yes
medium
no excellent
yes
high
yes fair
yes
medium
no excellent
no

credit rating?
excellent

fair
yes

Basic algorithm (a greedy algorithm)


Tree is constructed in a top-down recursive divide-and-conquer
manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are
discretized in advance)
Examples are partitioned recursively based on selected
attributes
Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning
majority voting is employed for classifying the leaf
There are no samples left

10

11

12

What is ID3?

ID3 Algorithm for


DTC

Entropy

A formula to calculate the homogeneity of a sample.


A completely homogeneous sample has entropy of 0.
An equally divided sample has entropy of 1.
Entropy(s) = - (p+) log2(p+) (p-) log2(p-) for a sample of
negative and positive elements.
The formula for entropy is:

A mathematical algorithm for building the decision tree.


Invented by J. Ross Quinlan in 1979.
Uses Information Theory invented by Shannon in 1948.
Builds the tree from the top down, with no backtracking.
Information Gain is used to select the most useful attribute for
classification.

Entropy(S) =
- (9/14) Log2 (9/14) - (5/14) Log2 (5/14)
= 0.940
13

14

Information Gain (IG)

Example of ID3

Information Gain (IG)

The information gain is based on the decrease in entropy after


a dataset is split on an attribute.
Which attribute creates the most homogeneous branches?
First the entropy of the total dataset is calculated.
The dataset is then split on the different attributes.
The entropy for each branch is calculated. Then it is added
proportionally, to get total entropy for the split.
The resulting entropy is subtracted from the entropy before
the split.
The result is the Information Gain, or decrease in entropy.
The attribute that yields the largest IG is chosen for the
decision node.

15

Select the attribute with the highest information gain


Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify a tuple in
m
D:
Info ( D) pi log 2 ( pi )
i 1

Information needed (after using A to split D into v partitions)


v | D |
j
to classify D:
Info A ( D)
Info ( D j )
j 1 | D |

Information gained by branching on attribute A

Gain(A) Info(D) Info A(D)


16

17

18

12/4/2014

Entropy and Information Gain

Entropy and Information Gain

IG (CoO) = S(Total) 4/10 * S(US) 4/10 * S(Europe) 2/10 * S(RoW)


IG (CoO) = 1 0.4 * 0.811 0.4 * 1 0.2 * 0
IG (CoO) = 1 0.3244 0.4 0
=
0.2756

Entropy of Europe e.g. S(Europe) is calculated as

Information Gain or IG (Genre)

S(Europe) = - 2/4 * log2 (2/4) 2/4 * log2 (2/4)


S(Europe) = 0.5 + 0.5
S(Europe) = 1

Selecting root attribute to split on, information gain for each


attribute is calculated
Information Gain or IG (Country of Origin) is calculated as
follows

IG (Genre) = S(Total) 3/10 * S(SF) 6/10 * S(Comedy) 1/10 * S(Romance)


IG (Genre) = 1 0.3 * 0.9182 0.6 * 0.9182 0.1 * 0
IG (Genre) = 1 0.2754 0.5509 0
=
0.17

Entropy of Rest of World e.g. S(RoW) is calculated as

Information Gain or IG (BigStar)

S(RoW) = - 0/2 * log2 (0/2) 2/2 * log2 (2/2)


S(RoW) = 0 + 0
S(RoW) = 0

IG (CoO) = S(Total) 4/10 * S(US) 4/10 * S(Europe) 2/10 * S(RoW)

IG (BigStar) = S(Total) 7/10 * S(Yes) 3/10 * S(No)


IG (BigStar) = 1 0.7 * 0.9852 0.3 * 1
IG (BigStar) = 1 0.6896 0.3
=
0.01

19

Gain Ratio C4.5

20

21

Gini Index Finding Best Split of an Attribute

j 1

GainRatio(A) = InformationGain(A)/SplitInfo(A)
SplitInfo(CoO) = - 4/10 * log2 (4/10) 4/10 * log2 (4/10) 2/10 * log2 (2/10)
GainRatio(CoO) = 0.2756 / 1.5219
SplitInfo(Genre) = - 3/10 * log2 (3/10) 6/10 * log2 (6/10) 1/10 * log2 (1/10)
GainRatio(CoO) = 0.17 / 1.2954
SplitInfo(BigStar) = - 7/10 * log2 (7/10) 3/10 * log2 (3/10)
GainRatio(BigStar) = 0.01/ 0.8812

Gini Index Finding Best Split of an Attribute

If a data set D contains examples from n classes, gini index,


n
gini(D) is defined as
gini( D) 1 p 2

Information gain measure is biased towards attributes with a


large number of values (attributes with high cardinality)
C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
=1.5219
= 0.1811
=1.2954
= 0.1312
=0.8812
= 0.0112

The attribute with the maximum gain ratio is selected as the


splitting attribute

If we have to split Genre into a binary split then should the


attribute Genre be split as {(SF, C), R}, {(SF, R), C}, {SF, (R, C)}
Split with minimum Gini Index would be used to pick the split
SF, C and R have 3, 6 and 1 tuples in data set also (SF, C), (SF,
R) and (R, C) have 9, 4, 7 respectively.
Positive and negative tuples of SF, C and R have (1,2), (4,2) and
(0,1) tuples in data set also (SF, C), (SF, R) and (C, R) have (5,4),
(1,3), (4,3) respectively.
Gini of entire data set is calculated as

where pj is the relative frequency of class j in D


If a data set D is split on A into two subsets D1 and D2, the gini
index gini(D) is defined as
|D |
|D |
gini A (D) 1 gini(D1) 2 gini(D2)
|D|
|D|
Reduction in Impurity:

gini( A) gini(D) giniA(D)

The attribute provides the smallest ginisplit(D) (or the largest


reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)

22

5 2 5 2
gini (total ) 1 0.50

10 10
23

24

Computing Information-Gain for


Continuous-Valued Attributes

Gini Index Finding Best Split of an Attribute


9
1
gini genre{( SF ,C ), R} ( D) Gini( SF , C ) Gini( R)
10
10

Decision Tree - Example

Let attribute A be a continuous-valued attribute


US

Sort the value A in increasing order

(ai+ai+1)/2 is the midpoint between the values of ai and ai+1

3 1 2 7 4 3
gini genre{SF ,(C , R )} ( D) 1 1 0.4761
10 3 3 10 7 7
2

The point with the minimum expected information requirement


for A is selected as the split-point for A
Split:
D1 is the set of tuples in D satisfying A split-point, and D2 is
the set of tuples in D satisfying A > split-point

Reduction in Impurity
For genre {(SF, C), R} is 0.50 0.4444 = 0.0555
For genre {(SF, R), C} is 0.50 0.4146 = 0.0833
For genre {SF, (C, R)} is 0.50 0.4761 = 0.0238

25

EU

Ge

Typically, the midpoint between each pair of adjacent values is


considered as a possible split point

2
2
2
2
4 1 3 6 4 2
gini genre{( SF , R ),C} ( D) 1 1 0.4166
10 4 4 10 6 6
2

CO

Must determine the best split point for A

2
2
2
2
9 5 4 1 0 1
gini genre{( SF ,C ), R} ( D) 1 1 0.4444
10 9 9 10 1 1

Information Gain or IG (Country of Origin) is calculated as follows

S(US) = - 3/4 * log2 (3/4) 1/4 * log2 (1/4)


S(US) = 0.311 + 0.5
S(US) = 0.811

Classes [True, False]


Total instances 10
Number of True and False instances respectively 5, 5.
S (total) = - 5/10 * log2 (5/10) 5/10 * log2 (5/10)
S (total) = 1

Entropy and Information Gain

Entropy of United Stats e.g. S(US) is calculated as

Entropy S (total) of the entire dataset

26

RW

Ge

Ge

SF

SF

SF

BS

BS

BS

BS

BS

BS

BS

BS

BS

27

12/4/2014

Classification

Classification

How to classify following test examples

Classification

How to classify following test examples

{Europe, Yes, Science Fiction, ?}


{Rest of World, No, Comedy, ?}
{United Stats, Yes, Romance, ?}

How to classify following test examples

{Europe, Yes, Science Fiction, False}


{Rest of World, No, Comedy, ?}
{United Stats, Yes, Romance, ?}

CO

{Europe, Yes, Science Fiction, False}


{Rest of World, No, Comedy, False}
{United Stats, Yes, Romance, ?}

CO

CO

US

EU

RW

US

EU

RW

US

Ge

Ge

Ge

Ge

Ge

Ge

Ge

EU

RW

Ge

Ge

SF

SF

SF

SF

SF

SF

SF

SF

SF

BS

BS

BS

BS

BS

BS

BS

BS

BS

BS

BS

BS

BS

BS

BS

BS

BS

BS

BS

BS

BS

BS

BS

BS

BS

28

Classification

Advantages of using ID3

How to classify following test examples

CO
EU

Ge

RW

Ge

Ge

SF

SF

SF

BS

BS

BS

BS

BS

BS

BS

BS

BS

N
30

Disadvantages of using ID3

Understandable prediction rules are created from the


training data.
Builds the fastest tree.
Builds a short tree.
Only need to test enough attributes until all data is
classified.
Finding leaf nodes enables test data to be pruned,
reducing number of tests.
Whole dataset is searched to create tree.

{Europe, Yes, Science Fiction, False}


{Rest of World, No, Comedy, False}
{United Stats, Yes, Romance, True}

US

29

Data may be over-fitted or over-classified, if a small


sample is tested.
Only one attribute at a time is tested for making a
decision.
Classifying continuous data may be computationally
expensive, as many trees must be generated to see
where to break the continuum.

BS

BS

N
31

Practical Issues of Classification


Underfitting and Overfitting
Missing Values
Costs of Classification

32

33

Underfitting and Overfitting

Underfitting and Overfitting

Underfitting: when model is too simple, both training and test


errors are large
Overfitting: when model is too complex, training error is less
while test error is large

Underfitting: when model is too simple, both training and test


errors are large
Overfitting: when model is too complex, training error is less
while test error is large

Overfitting

34

35

36

12/4/2014

Overfitting due to Noise

Decision Tree Pruning

Overfitting due to Insufficient Examples


Lack of data points in the lower half of the diagram makes it
difficult to predict correctly the class labels of that region
Insufficient number of training records in the region causes
the decision tree to predict the test examples using other
training records that are irrelevant to the classification task

Decision boundary is distorted by noise point

37

Two approaches to avoid overfitting


Prepruning: Halt tree construction early do not split a
node if this would result in the goodness measure falling
below a threshold
Difficult to choose an appropriate threshold
Postpruning: Remove branches from a fully grown tree
get a sequence of progressively pruned trees
Use a set of data different from the training data to
decide which is the best pruned tree

38

39

CO
US

Training Error
Error of a classification model on training data
is called training error
Tree pruning results in higher training error

CO

EU

RW

US

Ge

Ge

Ge

Training Error

Ge

EU

Ge

SF

SF

SF

SF

SF

BS

BS

BS

BS

BS

BS

BS

BS

BS

BS

BS

BS

BS

BS

BS

BS

SF

Ge

Ge

Ge

BS

BS

SF

BS

BS

BS

SF

BS

BS

BS

BS

T2

T2

CO

CO
RW

EU

Ge

EU

Ge

SF

US

US

RW

Training Error

RW

Ge

Ge

SF

SF

SF

T1

T2

T0

T0

T2

T0

T0

T0

T0

F0

F1

F0

F1

F0

F1

F1

F1

F0

BS

BS

N
40

41

42

CO
US

EU

RW

Training Error

Ge

Ge

Training Error

Ge

SF

SF

SF

BS

BS

BS

BS

BS

BS

BS

BS

BS

T2

CO
US

EU

Ge

Ge

SF

SF

SF

T1

T2

T0

T0

T2

T0

T0

T0

T0

F0

F1

F0

F1

F0

F1

F1

F1

F0

EU

RW

T2

T2

T0

F1

F2

F2

Overfitting results in decision trees that are


more complex than necessary
Training error no longer provides a good
estimate of how well the tree will perform on
previously unseen records
Need new ways for estimating errors

Incorrectly Classified / Total number of training Tuples


0 /10 = 0.0
Incorrectly Classified / Total number of training Tuples
1 /10 = 0.1

Case 3 One Level


Incorrectly Classified / Total number of training Tuples
3 /10 = 0.3

CO
US

Case 1 Three Levels


Case 2 Two Levels

RW

Ge

Notes on Overfitting

43

44

45

12/4/2014

Estimating Generalization Errors


Re-substitution errors: error on training ( e(t) )
Generalization errors: error on testing ( e(t))
Methods for estimating generalization errors:
Optimistic approach: e(t) = e(t)
Pessimistic approach:
For each leaf node: e(t) = (e(t)+0.5)
Total errors: e(T) = e(T) + N 0.5 (N: number of leaf nodes)
For a tree with 30 leaf nodes and 10 errors on training
(out of 1000 instances):
Training error = 10/1000 = 1%
Generalization error = (10 + 300.5)/1000 = 2.5%

Reduced error pruning (REP):


uses validation data set to estimate generalization error
46

Вам также может понравиться