Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar

Data Mining
Classification: Basic Concepts, Decision

Trees, and Model Evaluation
Lecture Notes for Chapter 4
Introduction to Data Mining

by
Tan, Steinbach, Kumar
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Classification: Definition
 Given a collection of records (training set )

– Each record is by characterized by a tuple
(x,y), where x is the attribute set and y is the
class label
 x: attribute, predictor, independent variable, input
 y: class, response, dependent variable, output
 Task:
– Learn a model that maps each attribute set x
into one of the predefined class labels y

Examples of Classification Task
Task Attribute set, x Class label, y
Categorizing Features extracted from spam or non-spam

email email message header
messages and content
Identifying Features extracted from malignant or benign
tumor cells MRI scans cells
Cataloging Features extracted from Elliptical, spiral, or

galaxies telescope images irregular-shaped
galaxies

General Approach for Building
Classification Model
Tid Attrib1 Attrib2 Attrib3 Class
Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No

Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn

8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes

Model
10
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction

14 No Small 95K ?
15 No Large 67K ?
10
Test Set

Classification Techniques
 Base Classifiers
– Decision Tree based Methods
– Rule-based Methods
– Nearest-neighbor
– Neural Networks
– Naïve Bayes and Bayesian Belief Networks
– Support Vector Machines
 Ensemble Classifiers
– Boosting, Bagging, Random Forests

Example of a Decision Tree
cal cal us
ri ri uo
ego ego tin ss
t t n a
ca ca co cl
Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No Home
2 No Married 100K No Owner
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Income NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
10
Training Data Model: Decision Tree

Another Example of Decision Tree
cal cal us
i i o
or or nu
t eg
t eg
nti
ass
l
ca ca co c MarSt Single,
Married Divorced
ID
1 Yes Single 125K No
NO Home
Yes Owner No
2 No Married 100K No
3 No Single 70K No NO Income
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
fits the same data!
10

Decision Tree Induction
 Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT

General Structure of Hunt’s Algorithm
 Let Dt be the set of training ID
records that reach a node t 1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
 General Procedure: 4 Yes Married 120K No
– If Dt contains records that 5 No Divorced 95K Yes
belong the same class yt, 6 No Married 60K No

then t is a leaf node 8 No Single 85K Yes
labeled as yt 9 No Married 75K No
– If Dt contains records that 10

belong to more than one Dt

class, use an attribute test
to split the data into smaller
subsets. Recursively apply ?
the procedure to each
subset.

Hunt’s Algorithm
Home ID
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
4 Yes Married 120K No
(a) (b) 5 No Divorced 95K Yes

6 No Married 60K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital
Yes No
10
Status
Single,
Marital Married
Defaulted = No Divorced
Status
Single, Annual Defaulted = No
Married
Divorced Income
Defaulted = Yes Defaulted = No < 80K >= 80K
Defaulted = No Defaulted = Yes
(c) (d)

How to determine the Best Split
Before Splitting: 10 records of class 0,

10 records of class 1
Gender Car Customer

Type ID
Yes No Family Luxury c1 c20

c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1
Which test condition is the best?

How to determine the Best Split
 Greedy approach:
– Nodes with purer class distribution are
preferred
 Need a measure of node impurity:
C0: 5 C0: 9
C1: 5 C1: 1
High degree of impurity Low degree of impurity

Measures of Node Impurity
 Gini Index
GINI (t )  1   [ p ( j | t )] 2
 Entropy
Entropy (t )    p( j | t ) log p( j | t )
j
 Misclassification error
Error (t )  1  max P (i | t ) i

Comparison among Impurity Measures
For a 2-class problem:

Measure of Impurity: GINI
 Gini Index for a given node t :
GINI (t )  1   [ p ( j | t )]2
j
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Maximum (1 - 1/nc) when records are equally

distributed among all classes, implying least
interesting information
– Minimum (0.0) when all records belong to one class,
implying most interesting information
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500

Binary Attributes: Computing GINI
Index
 Splits into two partitions

 Effect of Weighing partitions:
– Larger and Purer Partitions are sought for.
Parent
B? C1 6
Yes No C2 6
Gini = 0.500
Node N1 Node N2
Gini(N1)
= 1 – (5/6)2 – (1/6)2 N1 N2 Gini(Children)
= 0.278
C1 5 2 = 6/12 * 0.278 +
Gini(N2) C2 1 4 6/12 * 0.444
= 1 – (2/6)2 – (4/6)2 Gini=0.361 = 0.361
= 0.444
Categorical Attributes: Computing Gini Index
 For each distinct value, gather counts for each class in

the dataset
 Use the count matrix to make decisions
Multi-way split Two-way split

(find best partition of values)
CarType CarType CarType

{Sports, {Family,
Family Sports Luxury {Family} {Sports}
Luxury} Luxury}
C1 1 8 1 C1 9 1 C1 8 2
C2 3 0 7 C2 7 3 C2 0 10
Gini 0.163 Gini 0.468 Gini 0.167

Decision Tree Based Classification
 Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Accuracy is comparable to other classification
techniques for many simple data sets

Rule-Based Classifier
 Classify records by using a collection of “if…then…” rules

Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds
R1: (Give Birth = no)  (Can Fly = yes)  Birds

R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians

Nearest Neighbor Classifiers
 Basic idea:
– If it walks like a duck, quacks like a duck,
then it’s probably a duck
Compute
Distance Test Record
Training Choose k of the

Records “nearest” records

Bayes Classifier
 A probabilistic framework for solving classification problems

 Key idea is that certain attribute values are more likely
(probable) for some classes than for others
– Example: Probability an individual is a male or female if the
individual is wearing a dress
 Conditional Probability:
P( X , Y )
P (Y | X ) 
P( X )
 Bayes theorem: P( X , Y )
P( X | Y ) 
P (Y )
P( X | Y ) P (Y )
P (Y | X ) 
P( X )
Evaluating Classifiers
 Confusion Matrix:
PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes a b
CLASS
Class=No c d
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)

Accuracy
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL (TP) (FN)
CLASS
Class=No c d
(FP) (TN)
 Most widely-used metric:
ad TP  TN
Accuracy  
a  b  c  d TP  TN  FP  FN
Methods for Classifier Evaluation
 Holdout
– Reserve k% for training and (100-k)% for testing
 Random subsampling
– Repeated holdout
 Cross validation
– Partition data into k disjoint subsets
– k-fold: train on k-1 partitions, test on the remaining one
– Leave-one-out: k=n
 Bootstrap
– Sampling with replacement
– .632 bootstrap:
1 b
accboot    0.632  acci  0.368  accs 
b i 1
Problem with Accuracy
 Consider a 2-class problem

– Number of Class 0 examples = 9990
– Number of Class 1 examples = 10
 If a model predicts everything to be class 0, accuracy is

9990/10000 = 99.9 %
– This is misleading because the model does not detect
any class 1 example
– Detecting the rare class is usually more interesting
(e.g., frauds, intrusions, defects, etc)

Example of classification accuracy measures
PREDICTED CLASS
TP  TN
Accuracy 
Class=Yes Class=No TP  FN  FP  TN
TP
Precision (p) 
TP  FP
ACTUAL Class=Yes 35 5 TP
CLASS Recall (r) 
(TP) (FN)
TP  FN
2rp 2TP
Class=No 5 5 F - measure (F)  
(FP) (TN) r  p 2TP  FN  FP
Accuracy = 0.8
For Yes class: precision = 87.5, recall = 87.5, F-measure = 87.5
For No class: precision = 0.5, recall = 0.5, F-measure = 0.5
02/14/2011
© Tan,Steinbach, Kumar CSCI 8980: Spring
Introduction 2011:
to Data Mining Biomedical Data
Mining 8/05/2005 26
26
Example of classification accuracy measures
PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes 99 1
CLASS
(TP) (FN)
Class=No 10 90
(FP) (TN)
Accuracy = 0.9450
Sensitivity = 0.99
Specificity = 0.90
02/14/2011
Introduction 2011:
Mining 8/05/2005 27
27
Measures of Classification Performance
PREDICTED CLASS
Yes No
ACTUAL
CLASS Yes TP FN
No FP TN
 is the probability that we

reject the null hypothesis
when it is true. This is a
Type I error or a false
positive (FP).
 is the probability that we

accept the null hypothesis
when it is false. This is a
Type II error or a false
negative (FN).
02/14/2011
Introduction 2011:
Mining 8/05/2005 28
28
ROC (Receiver Operating Characteristic)
 A graphical approach for displaying trade-off between

detection rate and false alarm rate
 Developed in 1950s for signal detection theory to analyze
noisy signals
 ROC curve plots True Positive Rate (TPR) against (False
Positive Rate) FPR
– Performance of a model represented as a point in an
ROC curve
– Changing the threshold parameter of classifier
changes the location of the point
– http://commonsenseatheism.com/wp-content/uploads
/2011/01/Swets-Better-Decisions-Through-Science.p
df
02/14/2011
Introduction 2011:
Mining 8/05/2005 29
29
ROC Curve
(TPR,FPR):
• (0,0): declare everything
to be negative class
• (1,1): declare everything
to be positive class
• (1,0): ideal
• Diagonal line:
– Random guessing
– Below diagonal line:
• prediction is opposite of
the true class
02/14/2011
Introduction 2011:
Mining 8/05/2005 30
30
Using ROC for Model Comparison
• No model consistently
outperforms the other
– M1 is better for small FPR
– M2 is better for large FPR
• Area Under the ROC curve

– Ideal: Area = 1
– Random guess:
Area = 0.5
02/14/2011
Introduction 2011:
Mining 8/05/2005 31
31
ROC (Receiver Operating Characteristic)
 To draw ROC curve, classifier must produce

continuous-valued output
– Outputs are used to rank test records, from the most
likely positive class record to the least likely positive
class record
 Many classifiers produce only discrete outputs (i.e., predicted class)

– Approaches to get ROC curve for other types of
classifiers such as decision trees
– WEKA gives you ROC curves
02/14/2011
Introduction 2011:
Mining 8/05/2005 32
32
ROC Curve Example
 - 1-dimensional data set containing 2 classes (positive and negative)

 - Any points located at x > t is classified as positive
At threshold t: TPR=0.5, FNR=0.5, FPR=0.12, FNR=0.88
02/14/2011
Introduction 2011:
Mining 8/05/2005 33
33
How to Construct an ROC curve
• Use classifier that produces

Instance score(+|A) True Class
continuous-valued output for
1 0.95 + each test instance score(+|A)
2 0.93 +
• Sort the instances according to
3 0.87 -
score(+|A) in decreasing order
4 0.85 -
5 0.85 - • Apply threshold at each unique
6 0.85 + value of score(+|A)
7 0.76 - • Count the number of TP, FP,
8 0.53 + TN, FN at each threshold
9 0.43 -
• TPR = TP/(TP+FN)
10 0.25 +
• FPR = FP/(FP + TN)
02/14/2011
Introduction 2011:
Mining 8/05/2005 34
34
How to construct an ROC curve
Class + - + - - - + - + +
P
Threshold >= 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00
TP 5 4 4 3 3 3 3 2 2 1 0
FP 5 5 4 4 3 2 1 1 0 0 0
TN 0 0 1 1 2 3 4 4 5 5 5
FN 0 1 1 2 2 2 2 3 3 4 5
TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0
FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0
ROC Curve:
02/14/2011
Introduction 2011:
Mining 8/05/2005 35
35

Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar

Загружено:

Авторское право:

Доступные форматы

Data Mining

Classification: Basic Concepts, Decision

Lecture Notes for Chapter 4

Introduction to Data Mining

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

 Given a collection of records (training set )

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 2

Task Attribute set, x Class label, y

Categorizing Features extracted from spam or non-spam

Cataloging Features extracted from Elliptical, spiral, or

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 3

4 Yes Medium 120K No

7 Yes Large 220K No Learn

10 No Small 90K Yes

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 4

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 5

Training Data Model: Decision Tree

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 6

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 7

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 8

– If Dt contains records that 5 No Divorced 95K Yes

belong the same class yt, 6 No Married 60K No

– If Dt contains records that 10

belong to more than one Dt

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 9

(a) (b) 5 No Divorced 95K Yes

Defaulted = No Defaulted = Yes

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 10

Before Splitting: 10 records of class 0,

Gender Car Customer

Yes No Family Luxury c1 c20

Which test condition is the best?

 Need a measure of node impurity:

High degree of impurity Low degree of impurity

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 12

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 13

For a 2-class problem:

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 14

 Gini Index for a given node t :

(NOTE: p( j | t) is the relative frequency of class j at node t).

– Maximum (1 - 1/nc) when records are equally

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 15

 Splits into two partitions

 For each distinct value, gather counts for each class in

Multi-way split Two-way split

CarType CarType CarType

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 17

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 18

 Classify records by using a collection of “if…then…” rules

R1: (Give Birth = no)  (Can Fly = yes)  Birds

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 19

Training Choose k of the

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 20

 A probabilistic framework for solving classification problems

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 22

 Most widely-used metric:

 Consider a 2-class problem

 If a model predicts everything to be class 0, accuracy is

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 25

 is the probability that we

 is the probability that we