Вы находитесь на странице: 1из 35

Data Mining

Classification: Basic Concepts, Decision


Trees, and Model Evaluation

Lecture Notes for Chapter 4

Introduction to Data Mining


by
Tan, Steinbach, Kumar

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1


Classification: Definition

 Given a collection of records (training set )


– Each record is by characterized by a tuple
(x,y), where x is the attribute set and y is the
class label
 x: attribute, predictor, independent variable, input
 y: class, response, dependent variable, output

 Task:
– Learn a model that maps each attribute set x
into one of the predefined class labels y

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 2


Examples of Classification Task

Task Attribute set, x Class label, y

Categorizing Features extracted from spam or non-spam


email email message header
messages and content
Identifying Features extracted from malignant or benign
tumor cells MRI scans cells

Cataloging Features extracted from Elliptical, spiral, or


galaxies telescope images irregular-shaped
galaxies

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 3


General Approach for Building
Classification Model
Tid Attrib1 Attrib2 Attrib3 Class
Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 4


Classification Techniques

 Base Classifiers
– Decision Tree based Methods
– Rule-based Methods
– Nearest-neighbor
– Neural Networks
– Naïve Bayes and Bayesian Belief Networks
– Support Vector Machines

 Ensemble Classifiers
– Boosting, Bagging, Random Forests

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 5


Example of a Decision Tree

cal cal us
ri ri uo
ego ego tin ss
t t n a
ca ca co cl
Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No Home
2 No Married 100K No Owner
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Income NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 6


Another Example of Decision Tree

cal cal us
i i o
or or nu
t eg
t eg
nti
ass
l
ca ca co c MarSt Single,
Married Divorced
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No
NO Home
Yes Owner No
2 No Married 100K No
3 No Single 70K No NO Income
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes
fits the same data!
10

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 7


Decision Tree Induction

 Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 8


General Structure of Hunt’s Algorithm
Home Marital Annual Defaulted
 Let Dt be the set of training ID
Owner Status Income Borrower
records that reach a node t 1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
 General Procedure: 4 Yes Married 120K No

– If Dt contains records that 5 No Divorced 95K Yes

belong the same class yt, 6 No Married 60K No


7 Yes Divorced 220K No
then t is a leaf node 8 No Single 85K Yes
labeled as yt 9 No Married 75K No

– If Dt contains records that 10


10 No Single 90K Yes

belong to more than one Dt


class, use an attribute test
to split the data into smaller
subsets. Recursively apply ?
the procedure to each
subset.

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 9


Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital
Yes No
10

Status
Single,
Marital Married
Defaulted = No Divorced
Status
Single, Annual Defaulted = No
Married
Divorced Income
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes

(c) (d)

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 10


How to determine the Best Split

Before Splitting: 10 records of class 0,


10 records of class 1

Gender Car Customer


Type ID

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which test condition is the best?


© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 11
How to determine the Best Split

 Greedy approach:
– Nodes with purer class distribution are
preferred

 Need a measure of node impurity:

C0: 5 C0: 9
C1: 5 C1: 1

High degree of impurity Low degree of impurity

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 12


Measures of Node Impurity

 Gini Index
GINI (t )  1   [ p ( j | t )] 2

 Entropy
Entropy (t )    p( j | t ) log p( j | t )
j

 Misclassification error

Error (t )  1  max P (i | t ) i

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 13


Comparison among Impurity Measures

For a 2-class problem:

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 14


Measure of Impurity: GINI

 Gini Index for a given node t :

GINI (t )  1   [ p ( j | t )]2
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

– Maximum (1 - 1/nc) when records are equally


distributed among all classes, implying least
interesting information
– Minimum (0.0) when all records belong to one class,
implying most interesting information
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 15


Binary Attributes: Computing GINI
Index

 Splits into two partitions


 Effect of Weighing partitions:
– Larger and Purer Partitions are sought for.
Parent
B? C1 6
Yes No C2 6
Gini = 0.500
Node N1 Node N2
Gini(N1)
= 1 – (5/6)2 – (1/6)2 N1 N2 Gini(Children)
= 0.278
C1 5 2 = 6/12 * 0.278 +
Gini(N2) C2 1 4 6/12 * 0.444
= 1 – (2/6)2 – (4/6)2 Gini=0.361 = 0.361
= 0.444
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 16
Categorical Attributes: Computing Gini Index

 For each distinct value, gather counts for each class in


the dataset
 Use the count matrix to make decisions

Multi-way split Two-way split


(find best partition of values)

CarType CarType CarType


{Sports, {Family,
Family Sports Luxury {Family} {Sports}
Luxury} Luxury}
C1 1 8 1 C1 9 1 C1 8 2
C2 3 0 7 C2 7 3 C2 0 10
Gini 0.163 Gini 0.468 Gini 0.167

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 17


Decision Tree Based Classification

 Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Accuracy is comparable to other classification
techniques for many simple data sets

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 18


Rule-Based Classifier

 Classify records by using a collection of “if…then…” rules


Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds

R1: (Give Birth = no)  (Can Fly = yes)  Birds


R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 19


Nearest Neighbor Classifiers

 Basic idea:
– If it walks like a duck, quacks like a duck,
then it’s probably a duck

Compute
Distance Test Record

Training Choose k of the


Records “nearest” records

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 20


Bayes Classifier

 A probabilistic framework for solving classification problems


 Key idea is that certain attribute values are more likely
(probable) for some classes than for others
– Example: Probability an individual is a male or female if the
individual is wearing a dress

 Conditional Probability:
P( X , Y )
P (Y | X ) 
P( X )
 Bayes theorem: P( X , Y )
P( X | Y ) 
P (Y )
P( X | Y ) P (Y )
P (Y | X ) 
P( X )
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 21
Evaluating Classifiers

 Confusion Matrix:
PREDICTED CLASS

Class=Yes Class=No

ACTUAL Class=Yes a b
CLASS
Class=No c d

a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 22


Accuracy

PREDICTED CLASS
Class=Yes Class=No

Class=Yes a b
ACTUAL (TP) (FN)
CLASS
Class=No c d
(FP) (TN)

 Most widely-used metric:

ad TP  TN
Accuracy  
a  b  c  d TP  TN  FP  FN
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 23
Methods for Classifier Evaluation

 Holdout
– Reserve k% for training and (100-k)% for testing
 Random subsampling
– Repeated holdout
 Cross validation
– Partition data into k disjoint subsets
– k-fold: train on k-1 partitions, test on the remaining one
– Leave-one-out: k=n
 Bootstrap
– Sampling with replacement
– .632 bootstrap:
1 b
accboot    0.632  acci  0.368  accs 
b i 1
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 24
Problem with Accuracy

 Consider a 2-class problem


– Number of Class 0 examples = 9990
– Number of Class 1 examples = 10

 If a model predicts everything to be class 0, accuracy is


9990/10000 = 99.9 %
– This is misleading because the model does not detect
any class 1 example
– Detecting the rare class is usually more interesting
(e.g., frauds, intrusions, defects, etc)

© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 25


Example of classification accuracy measures

PREDICTED CLASS
TP  TN
Accuracy 
Class=Yes Class=No TP  FN  FP  TN
TP
Precision (p) 
TP  FP
ACTUAL Class=Yes 35 5 TP
CLASS Recall (r) 
(TP) (FN)
TP  FN
2rp 2TP
Class=No 5 5 F - measure (F)  
(FP) (TN) r  p 2TP  FN  FP

Accuracy = 0.8
For Yes class: precision = 87.5, recall = 87.5, F-measure = 87.5
For No class: precision = 0.5, recall = 0.5, F-measure = 0.5

02/14/2011
© Tan,Steinbach, Kumar CSCI 8980: Spring
Introduction 2011:
to Data Mining Biomedical Data
Mining 8/05/2005 26
26
Example of classification accuracy measures

PREDICTED CLASS

Class=Yes Class=No

ACTUAL Class=Yes 99 1
CLASS
(TP) (FN)

Class=No 10 90
(FP) (TN)

Accuracy = 0.9450
Sensitivity = 0.99
Specificity = 0.90

02/14/2011
© Tan,Steinbach, Kumar CSCI 8980: Spring
Introduction 2011:
to Data Mining Biomedical Data
Mining 8/05/2005 27
27
Measures of Classification Performance

PREDICTED CLASS
Yes No
ACTUAL
CLASS Yes TP FN
No FP TN

 is the probability that we


reject the null hypothesis
when it is true. This is a
Type I error or a false
positive (FP).

 is the probability that we


accept the null hypothesis
when it is false. This is a
Type II error or a false
negative (FN).
02/14/2011
© Tan,Steinbach, Kumar CSCI 8980: Spring
Introduction 2011:
to Data Mining Biomedical Data
Mining 8/05/2005 28
28
ROC (Receiver Operating Characteristic)

 A graphical approach for displaying trade-off between


detection rate and false alarm rate
 Developed in 1950s for signal detection theory to analyze
noisy signals
 ROC curve plots True Positive Rate (TPR) against (False
Positive Rate) FPR
– Performance of a model represented as a point in an
ROC curve
– Changing the threshold parameter of classifier
changes the location of the point
– http://commonsenseatheism.com/wp-content/uploads
/2011/01/Swets-Better-Decisions-Through-Science.p
df

02/14/2011
© Tan,Steinbach, Kumar CSCI 8980: Spring
Introduction 2011:
to Data Mining Biomedical Data
Mining 8/05/2005 29
29
ROC Curve

(TPR,FPR):
• (0,0): declare everything
to be negative class
• (1,1): declare everything
to be positive class
• (1,0): ideal

• Diagonal line:
– Random guessing
– Below diagonal line:
• prediction is opposite of
the true class
02/14/2011
© Tan,Steinbach, Kumar CSCI 8980: Spring
Introduction 2011:
to Data Mining Biomedical Data
Mining 8/05/2005 30
30
Using ROC for Model Comparison

• No model consistently
outperforms the other
– M1 is better for small FPR
– M2 is better for large FPR

• Area Under the ROC curve


– Ideal: Area = 1
– Random guess:
Area = 0.5

02/14/2011
© Tan,Steinbach, Kumar CSCI 8980: Spring
Introduction 2011:
to Data Mining Biomedical Data
Mining 8/05/2005 31
31
ROC (Receiver Operating Characteristic)

 To draw ROC curve, classifier must produce


continuous-valued output
– Outputs are used to rank test records, from the most
likely positive class record to the least likely positive
class record

 Many classifiers produce only discrete outputs (i.e., predicted class)


– Approaches to get ROC curve for other types of
classifiers such as decision trees
– WEKA gives you ROC curves

02/14/2011
© Tan,Steinbach, Kumar CSCI 8980: Spring
Introduction 2011:
to Data Mining Biomedical Data
Mining 8/05/2005 32
32
ROC Curve Example

 - 1-dimensional data set containing 2 classes (positive and negative)


 - Any points located at x > t is classified as positive

At threshold t: TPR=0.5, FNR=0.5, FPR=0.12, FNR=0.88

02/14/2011
© Tan,Steinbach, Kumar CSCI 8980: Spring
Introduction 2011:
to Data Mining Biomedical Data
Mining 8/05/2005 33
33
How to Construct an ROC curve

• Use classifier that produces


Instance score(+|A) True Class
continuous-valued output for
1 0.95 + each test instance score(+|A)
2 0.93 +
• Sort the instances according to
3 0.87 -
score(+|A) in decreasing order
4 0.85 -
5 0.85 - • Apply threshold at each unique
6 0.85 + value of score(+|A)
7 0.76 - • Count the number of TP, FP,
8 0.53 + TN, FN at each threshold
9 0.43 -
• TPR = TP/(TP+FN)
10 0.25 +
• FPR = FP/(FP + TN)

02/14/2011
© Tan,Steinbach, Kumar CSCI 8980: Spring
Introduction 2011:
to Data Mining Biomedical Data
Mining 8/05/2005 34
34
How to construct an ROC curve

Class + - + - - - + - + +
P
Threshold >= 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00

TP 5 4 4 3 3 3 3 2 2 1 0

FP 5 5 4 4 3 2 1 1 0 0 0

TN 0 0 1 1 2 3 4 4 5 5 5

FN 0 1 1 2 2 2 2 3 3 4 5

TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0

FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0

ROC Curve:

02/14/2011
© Tan,Steinbach, Kumar CSCI 8980: Spring
Introduction 2011:
to Data Mining Biomedical Data
Mining 8/05/2005 35
35

Вам также может понравиться