Вы находитесь на странице: 1из 59

Data Mining Models &

Evaluation Techniques

Mentored ByProf. K.P Agarwal

Created By
Shubham Pachori

Overview
Motivation
Metrics for Classifiers Evaluation
Methods for Classifiers Evaluation
Comparing the Performance of two Classifiers
Costs in Classification
Ensemble Methods To Improve Accuracy

Motivation
It is important to evaluate classifiers generalization
performance in order to:
Determine whether to employ the classifier;

(For example: when learning the effectiveness of


medical treatments from a limited-size data, it is
important to estimate the accuracy of the classifiers.)
Optimize the classifier.

(For example: when post-pruning decision trees we


must evaluate the accuracy of the decision trees on
each pruning step.)

Models Evaluation in the KDD Process


Knowledge
Transformed
data

Target
data

data

Selection

Processed
data

Patterns

Interpretation
Evaluation
Data Mining

Transformation
Preprocessing & feature
selection
& cleaning

Metrics for Classifiers Evaluation

Accuracy = (TP+TN)/(P+N)
Error = (FP+FN)/(P+N)
Precision = TP/(TP+FP)
Recall/TP rate = TP/P
FP Rate = FP/N

Actual
class

Predicted class
Pos

Neg

Pos

TP

FN

Neg

FP

TN

How to Estimate the Metrics?


We can use:
Training data;
Independent test data;
Hold-out method;
k-fold cross-validation method;
Leave-one-out method;
Bootstrap method;
Ensemble methods

Estimation with Training Data


The accuracy/error estimates on the training data are not good indicators
of performance on future data.

Classifier

Training set

Training set

Q: Why?
A: Because new data will probably not be exactly the same as the training data!

The accuracy/error estimates on the training data measure the degree of


classifiers overfitting.

Estimation with Independent Test Data


Estimation with independent test data is used when we have plenty
of data and there is a natural way to forming training and test data.

Classifier

Training set

Test set

For example: Quinlan in 1987 reported experiments in a medical


domain for which the classifiers were trained on data from 1985 and
tested on data from 1986.

Hold-out Method
The hold-out method splits the data into training data and test
data (usually 2/3 for train, 1/3 for test). Then we build a classifier
using the train data and test it using the test data.
Classifier

Training set

Test set

Data
The hold-out method is usually used when we have thousands
of instances, including several hundred instances from each
class.

Classification: Train, Validation, Test Split


Results Known
+
+
+

Data

Model
Builder

Training set

Evaluate

Classifier Builder

Validation set

Final Test Set

Classifier

Predictions
+
+
+
- Final Evaluation
+
-

The test data cant be used for parameter tuning!

Making the Most of the Data


Once evaluation is complete, all the data can be used
to build the final classifier.
Generally, the larger the training data the better the
classifier (but returns diminish).
The larger the test data the more accurate the error
estimate.

Stratification
The holdout method reserves a certain amount for testing
and uses the remainder for training.
Usually: one third for testing, the rest for training.

For unbalanced datasets, samples might not be


representative.
Few or none instances of some classes viz. fraudulent
transaction detection and Medical Diagnostic Tests

Stratified sample: advanced version of balancing the


data.
Make sure that each class is represented with approximately
equal proportions in both subsets.

Repeated Holdout Method


Holdout estimate can be made more reliable by
repeating the process with different subsamples.
In each iteration, a certain proportion is randomly
selected for training (possibly with stratification).
The error rates on the different iterations are averaged
to yield an overall error rate.

This is called the repeated holdout method.

Repeated Holdout Method, 2

Still not optimum: the different test sets overlap, but


we would like all our instances from the data to be
tested at least ones.

Can we prevent overlapping?

witten & eibe

k-Fold Cross-Validation
k-fold cross-validation avoids overlapping test sets:

First step: data is split into k subsets of equal size;


Second step: each subset in turn is used for testing and the
remainder for training.
Classifier

The subsets are stratified


before the cross-validation.
The estimates are averaged to
yield an overall estimate.
Data

train

train

test

train

test

train

test

train

train

More on Cross-Validation
Standard method for evaluation: stratified 10-fold crossvalidation.
Why 10? Extensive experiments have shown that this is
the best choice to get an accurate estimate.
Stratification reduces the estimates variance.
Even better: repeated stratified cross-validation:
E.g. ten-fold cross-validation is repeated ten times and
results are averaged (reduces the variance).

Leave-One-Out Cross-Validation

Leave-One-Out is a particular form of crossvalidation:

Set number of folds to number of training instances;


I.e., for n training instances, build classifier n times.

Makes best use of the data.


Involves no random sub-sampling.
Very computationally expensive.

Leave-One-Out Cross-Validation and


Stratification

A disadvantage of Leave-One-Out-CV is that


stratification is not possible:

It guarantees a non-stratified sample because there


is only one instance in the test set!

Extreme example - random dataset split equally into


two classes:

Best inducer predicts majority class;


50% accuracy on fresh data;
Leave-One-Out-CV estimate is 100% error!

Bootstrap Method
Cross validation uses sampling without replacement:
The same instance, once selected, can not be selected again
for a particular training/test set

The bootstrap uses sampling with replacement to form


the training set:
Sample a dataset of n instances n times with replacement to
form a new dataset of n instances;
Use this data as the training set;
Use the instances from the original dataset that dont occur
in the new training set for testing.

Bootstrap Method
The bootstrap method is also called the 0.632
bootstrap:
A particular instance has a probability of 11/n of not being
picked;
Thus its probability of ending up in the test data is:
n

1
1
1

e
0.368

n
This means the training data will contain approximately
63.2% of the instances and the test data will contain
approximately 36.8% of the instances.

Estimating Error with the


Bootstrap Method

The error estimate on the test data will be very pessimistic


because the classifier is trained on just ~63% of the instances.
Therefore, combine it with the training error:

err 0.632 etest

instances

0.368 etraining

instances

The training error gets less weight than the error on the test
data.
Repeat process several times with different replacement
samples; average the results.

Comparing Two Classifier Models


Assume that we have two classifiers, M1 and M2, and we would
like to know which one is better for a classification problem.
We test the classifiers on n test data sets D1, D2, , Dn, and we
receive error rate estimates e11, e12, , e1n for classifier M1 and
error rate estimates e21, e22, , e2n for classifier M2.
Using rate estimates we can compute the mean error rate e1 for
classifier M1 and the mean error rate e2 for classifier M2.

These mean error rates are just estimates of error on the true
population of future data cases. What if the difference between
the two error rates is just attributed to chance?

Comparing Two Classifier Models


We note that error rate estimates e11, e12, , e1n for classifier M1 and error
rate estimates e21, e22, , e2n for classifier M2 are paired. Thus, we consider
the differences d1, d2, , dn where dj= | e1j- e2j|.
The differences d1, d2, , dn are instantiations of n random variables D1, D2,
, Dn with mean D and standard deviation D.
We need to establish confidence intervals for D in order to decide whether

the difference in the generalization performance of the classifiers M1 and M2


is statistically significant or not.

Comparing Two Classifier Models


Since the standard deviation D is unknown, we approximate it
using the sample standard deviation sd:

1 n
2
sd
[(
e

e
)

(
e

e
)]

1i
2i
1
2
n i 1
Since we approximate the true standard deviation D, we

introduce T statistics:

D D
T
sd / n

Comparing Two Classifier Models


The T statistics is governed by t-distribution with n - 1 degrees
of freedom.

Area = 1 -

t/2

t1- /2

Comparing Two Classifier Models


If d and sd are the mean and standard deviation of the normally
distributed differences of n random pairs of errors, a (1
)100% confidence interval for D = 1 - 2 is :
d t / 2

sd
s
D d t / 2 d ,
n
n

where t/2 is the t-value with v = n -1 degrees of freedom,

leaving an area of /2 to the right.


Thus, if the interval contains 0.0 we can conclude on
significance level that the difference is 0.0.

Counting the Costs


In practice, different types of classification errors
often incur different costs
Examples:
Terrorist profiling
Not a terrorist correct 99.99% of the time

Loan decisions
Fault diagnosis
Promotional mailing

Cost Matrices
Hypothesized
class
True class

Pos

Neg

Pos

TP Cost

FN Cost

Neg

FP Cost

TN Cost

Usually, TP Cost and TN Cost are set equal to 0!

Cost-Sensitive Classification
If the classifier outputs probability for each class, it can be adjusted to
minimize the expected costs of the predictions.
Expected cost is computed as dot product of vector of class
probabilities and appropriate column in cost matrix.

Hypothesized
class
True class

Pos

Neg

Pos

TP Cost

FN Cost

Neg

FP Cost

TN Cost

Cost Sensitive Classification


Assume that the classifier returns for an instance ppos = 0.6 and pneg = 0.4.
Then, the expected cost if the instance is classified as positive is 0.6 * 0 +
0.4 * 10 = 4. The expected cost if the instance is classified as negative is 0.6
* 5 + 0.4 * 0 = 3. To minimize the costs the instance is classified as negative.

Hypothesized
class
True class

Pos

Neg

Pos

Neg

10

Cost Sensitive Learning


Simple methods for cost sensitive learning:
Resampling of instances according to costs;
Weighting of instances according to costs.
Hypothesized
class
True class

Pos

Neg

Pos

Neg

10

In Weka Cost Sensitive Classification and Learning can be applied for any classifier using
the meta scheme: CostSensitiveClassifier.

ROC Curves and Analysis


Predicted

Predicted

Predicted

True

pos

neg

True

pos

neg

True

pos

neg

pos

40

60

pos

70

30

pos

60

40

neg

30

70

neg

50

50

neg

20

80

Classifier 1
TPr = 0.4
FPr = 0.3

Classifier 2
TPr = 0.7
FPr = 0.5

Classifier 3
TPr = 0.6
FPr = 0.2

ROC Space
Ideal classifier

always positive
False Negative Rate

True Negative Rate

chance

always negative

Dominance in the ROC Space

Classifier A dominates classifier B if and only if TPrA > TPrB and FPrA < FPrB.

Ensemble Methods
model 1
Data

Ensemble
model

model 2

model k

35

Combine multiple
models into one!

Applications: classification, clustering, collaborative


filtering, anomaly detection

Motivations
Ensemble model improves accuracy and robustness over single
model methods
Applications:

distributed computing
privacy-preserving applications
large-scale data with reusable models
multiple sources of data

Efficiency: a complex problem can be decomposed into multiple


sub-problems that are easier to understand and solve (divide-andconquer approach)

36

Why Ensemble Works? (1)


Intuition
combining diverse, independent opinions in human decisionmaking as a protective mechanism (e.g. stock portfolio)

Uncorrelated error reduction


Suppose we have 5 completely independent classifiers for majority
voting
If accuracy is 70% for each
10 (.7^3)(.3^2)+5(.7^4)(.3)+(.7^5)
83.7% majority vote accuracy

101 such classifiers


99.9% majority vote accuracy
37

Why Ensemble Works? (2)


Some unknown distribution
Model 1

Model 3
Model 5
Model 2
Model 4

Model 6

Ensemble gives the global picture!


38

Why Ensemble Works? (3)


Overcome limitations of single hypothesis
The target function may not be implementable with
individual classifiers, but may be approximated by model
averaging

Decision Tree
39

Model Averaging

Ensemble of ClassifiersLearn to Combine


Training

test

classifier 1
labeled
data

unlabeled
data

classifier 2

classifier k

Ensemble
model
final

predictions

40

learn the combination from labeled


data
Algorithms: boosting, stacked generalization, rule
ensemble, Bayesian model averaging

Ensemble of ClassifiersConsensus
training

test

classifier 1
labeled
data

classifier 2

classifier k

unlabeled
data

combine
the
predictions
by majority
voting
final

predictions

41

Algorithms: bagging, random forest, random


decision tree, model averaging of probabilities

Pros and Cons

Combine by learning Combine by consensus


Pros

Get useful feedbacks from


labeled data
Can potentially improve
accuracy

Do not need labeled data


Can improve the generalization
performance

Cons

Need to keep the labeled


data to train the ensemble
May overfit the labeled data
Cannot work when no labels
are available

No feedbacks from the labeled


data
Require the assumption that
consensus is better

42

Bagging
Original Data
Bagging (Round 1)
Bagging (Round 2)
Bagging (Round 3)

1
7
1
1

2
8
4
8

3
10
9
5

4
8
1
10

5
2
2
5

6
5
3
5

7
10
2
9

8
10
7
6

9
5
3
3

10
9
2
7

Also known as bootstrap aggregation


Sampling uniformly with replacement
Build classifier on each bootstrap sample
0.632 bootstrap
Each bootstrap sample Di contains approx. 63.2% of
the original training data
Remaining (36.8%) are used as test set

Bagging
Accuracy of bagging:
k

Acc ( M ) (0.632 * Acc ( M i ) test _ set 0.368 * Acc ( M i ) train _ set )


i 1

Works well for small data sets


Example:
X

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

-1

-1

-1

-1

Bagging
Decision Stump
Single level decision
binary tree
Entropy x<=0.35 or
x<=0.75
Accuracy at most
70%

Bagging

Accuracy of ensemble
classifier: 100%

Bagging- Final Points

Works well if the base classifiers are unstable


Increased accuracy because it reduces the variance of
the individual classifier
Does not focus on any particular instance of the
training data
Therefore, less susceptible to model over-fitting when
applied to noisy data
What if we want to focus on a particular instances of
training data?

Boosting
Principles

Boost a set of weak learners to a strong learner


An iterative procedure to adaptively change distribution of
training data by focusing more on previously misclassified
records
Initially, all N records are assigned equal weights
Unlike bagging, weights may change at the end of a boosting
round

Boosting
Records that are wrongly classified will have their weights
increased
Records that are classified correctly will have their weights
decreased
Original Data
Boosting (Round 1)
Boosting (Round 2)
Boosting (Round 3)

1
7
5
4

2
3
4
4

3
2
9
8

4
8
4
10

5
7
2
4

6
9
5
5

7
4
1
4

8
10
7
6

9
6
4
3

10
3
2
4

Example 4 is hard to classify


Its weight is increased, therefore it is
more likely to be chosen again in
subsequent rounds

Boosting
Equal weights are assigned to each training tuple (1/d for round
1)
After a classifier Mi is learned, the weights are adjusted to allow
the subsequent classifier Mi+1 to pay more attention to tuples
that were misclassified by Mi.
Final boosted classifier M* combines the votes of each
individual classifier
Weight of each classifiers vote is a function of its accuracy
Adaboost popular boosting algorithm

Adaboost
Input:
Training set D containing d tuples
k rounds
A classification learning scheme

Output:
A composite model

Adaboost
Data set D containing d class-labeled tuples (X1,y1), (X2,y2),
(X3,y3),.(Xd,yd)
Initially assign equal weight 1/d to each tuple
To generate k base classifiers, we need k rounds or
iterations
Round i, tuples from D are sampled with replacement , to
form Di (size d)
Each tuples chance of being selected depends on its weight

Adaboost
Base classifier Mi, is derived from training tuples of Di
Error of Mi is tested using Di
Weights of training tuples are adjusted depending on how
they were classified
Correctly classified: Decrease weight
Incorrectly classified: Increase weight

Weight of a tuple indicates how hard it is to classify it


(directly proportional)

Adaboost
Some classifiers may be better at classifying some
hard tuples than others
We finally have a series of classifiers that
complement each other!
d
Error rate of model Mi: error ( M i ) w j * err ( X j )
j

where err(Xj) is the misclassification error for Xj(=1)


If classifier error exceeds 0.5, we abandon it
Try again with a new Di and a new Mi derived from
it

Adaboost
error (Mi) affects how the weights of training tuples are
updated
If a tuple is correctly classified in round i, its weight is
multiplied by
error ( M i )
1 error ( M i )

Adjust weights of all correctly classified tuples


Now weights of all tuples (including the misclassified tuples)
are normalized
sum _ of _ old _ weights
Normalization factor =
sum _ of _ new _ weights

error ( M i )
Weight of a classifier Mis weight is log
1 error ( M i )

Adaboost
The lower a classifier error rate, the more accurate it is, and
therefore, the higher its weight for voting should be
Weight of a classifier Mis vote is
log

error ( M i )
1 error ( M i )

For each class c, sum the weights of each classifier that


assigned class c to X (unseen tuple)
The class with the highest sum is the WINNER!

Metric Evaluation Summary:


Use test sets and the hold-out method for large data;
Use the cross-validation method for middle-sized data;
Use the leave-one-out and bootstrap methods for small
data;
Dont use test data for parameter tuning - use separate
validation data.

Summary
In this seminar report we have considered:

Metrics for Classifiers Evaluation


Methods for Classifiers Evaluation
Comparing Data Mining Schemes
Costs in Data Mining
Ensemble Methods

Thank You