Вы находитесь на странице: 1из 59

# Data Mining Models &

Evaluation Techniques

## Mentored ByProf. K.P Agarwal

Created By
Shubham Pachori

Overview
Motivation
Metrics for Classifiers Evaluation
Methods for Classifiers Evaluation
Comparing the Performance of two Classifiers
Costs in Classification
Ensemble Methods To Improve Accuracy

Motivation
It is important to evaluate classifiers generalization
performance in order to:
Determine whether to employ the classifier;

## (For example: when learning the effectiveness of

medical treatments from a limited-size data, it is
important to estimate the accuracy of the classifiers.)
Optimize the classifier.

## (For example: when post-pruning decision trees we

must evaluate the accuracy of the decision trees on
each pruning step.)

## Models Evaluation in the KDD Process

Knowledge
Transformed
data

Target
data

data

Selection

Processed
data

Patterns

Interpretation
Evaluation
Data Mining

Transformation
Preprocessing & feature
selection
& cleaning

## Metrics for Classifiers Evaluation

Accuracy = (TP+TN)/(P+N)
Error = (FP+FN)/(P+N)
Precision = TP/(TP+FP)
Recall/TP rate = TP/P
FP Rate = FP/N

Actual
class

Predicted class
Pos

Neg

Pos

TP

FN

Neg

FP

TN

## How to Estimate the Metrics?

We can use:
Training data;
Independent test data;
Hold-out method;
k-fold cross-validation method;
Leave-one-out method;
Bootstrap method;
Ensemble methods

## Estimation with Training Data

The accuracy/error estimates on the training data are not good indicators
of performance on future data.

Classifier

Training set

Training set

Q: Why?
A: Because new data will probably not be exactly the same as the training data!

## The accuracy/error estimates on the training data measure the degree of

classifiers overfitting.

## Estimation with Independent Test Data

Estimation with independent test data is used when we have plenty
of data and there is a natural way to forming training and test data.

Classifier

Training set

Test set

## For example: Quinlan in 1987 reported experiments in a medical

domain for which the classifiers were trained on data from 1985 and
tested on data from 1986.

Hold-out Method
The hold-out method splits the data into training data and test
data (usually 2/3 for train, 1/3 for test). Then we build a classifier
using the train data and test it using the test data.
Classifier

Training set

Test set

Data
The hold-out method is usually used when we have thousands
of instances, including several hundred instances from each
class.

## Classification: Train, Validation, Test Split

Results Known
+
+
+

Data

Model
Builder

Training set

Evaluate

Classifier Builder

Validation set

## Final Test Set

Classifier

Predictions
+
+
+
- Final Evaluation
+
-

## Making the Most of the Data

Once evaluation is complete, all the data can be used
to build the final classifier.
Generally, the larger the training data the better the
classifier (but returns diminish).
The larger the test data the more accurate the error
estimate.

Stratification
The holdout method reserves a certain amount for testing
and uses the remainder for training.
Usually: one third for testing, the rest for training.

## For unbalanced datasets, samples might not be

representative.
Few or none instances of some classes viz. fraudulent
transaction detection and Medical Diagnostic Tests

## Stratified sample: advanced version of balancing the

data.
Make sure that each class is represented with approximately
equal proportions in both subsets.

## Repeated Holdout Method

Holdout estimate can be made more reliable by
repeating the process with different subsamples.
In each iteration, a certain proportion is randomly
selected for training (possibly with stratification).
The error rates on the different iterations are averaged
to yield an overall error rate.

## Still not optimum: the different test sets overlap, but

we would like all our instances from the data to be
tested at least ones.

## witten & eibe

k-Fold Cross-Validation
k-fold cross-validation avoids overlapping test sets:

## First step: data is split into k subsets of equal size;

Second step: each subset in turn is used for testing and the
remainder for training.
Classifier

## The subsets are stratified

before the cross-validation.
The estimates are averaged to
yield an overall estimate.
Data

train

train

test

train

test

train

test

train

train

More on Cross-Validation
Standard method for evaluation: stratified 10-fold crossvalidation.
Why 10? Extensive experiments have shown that this is
the best choice to get an accurate estimate.
Stratification reduces the estimates variance.
Even better: repeated stratified cross-validation:
E.g. ten-fold cross-validation is repeated ten times and
results are averaged (reduces the variance).

Leave-One-Out Cross-Validation

## Set number of folds to number of training instances;

I.e., for n training instances, build classifier n times.

## Makes best use of the data.

Involves no random sub-sampling.
Very computationally expensive.

Stratification

## A disadvantage of Leave-One-Out-CV is that

stratification is not possible:

## It guarantees a non-stratified sample because there

is only one instance in the test set!

two classes:

## Best inducer predicts majority class;

50% accuracy on fresh data;
Leave-One-Out-CV estimate is 100% error!

Bootstrap Method
Cross validation uses sampling without replacement:
The same instance, once selected, can not be selected again
for a particular training/test set

## The bootstrap uses sampling with replacement to form

the training set:
Sample a dataset of n instances n times with replacement to
form a new dataset of n instances;
Use this data as the training set;
Use the instances from the original dataset that dont occur
in the new training set for testing.

Bootstrap Method
The bootstrap method is also called the 0.632
bootstrap:
A particular instance has a probability of 11/n of not being
picked;
Thus its probability of ending up in the test data is:
n

1
1
1

e
0.368

n
This means the training data will contain approximately
63.2% of the instances and the test data will contain
approximately 36.8% of the instances.

Bootstrap Method

## The error estimate on the test data will be very pessimistic

because the classifier is trained on just ~63% of the instances.
Therefore, combine it with the training error:

## err 0.632 etest

instances

0.368 etraining

instances

The training error gets less weight than the error on the test
data.
Repeat process several times with different replacement
samples; average the results.

## Comparing Two Classifier Models

Assume that we have two classifiers, M1 and M2, and we would
like to know which one is better for a classification problem.
We test the classifiers on n test data sets D1, D2, , Dn, and we
receive error rate estimates e11, e12, , e1n for classifier M1 and
error rate estimates e21, e22, , e2n for classifier M2.
Using rate estimates we can compute the mean error rate e1 for
classifier M1 and the mean error rate e2 for classifier M2.

These mean error rates are just estimates of error on the true
population of future data cases. What if the difference between
the two error rates is just attributed to chance?

## Comparing Two Classifier Models

We note that error rate estimates e11, e12, , e1n for classifier M1 and error
rate estimates e21, e22, , e2n for classifier M2 are paired. Thus, we consider
the differences d1, d2, , dn where dj= | e1j- e2j|.
The differences d1, d2, , dn are instantiations of n random variables D1, D2,
, Dn with mean D and standard deviation D.
We need to establish confidence intervals for D in order to decide whether

## the difference in the generalization performance of the classifiers M1 and M2

is statistically significant or not.

## Comparing Two Classifier Models

Since the standard deviation D is unknown, we approximate it
using the sample standard deviation sd:

1 n
2
sd
[(
e

e
)

(
e

e
)]

1i
2i
1
2
n i 1
Since we approximate the true standard deviation D, we

introduce T statistics:

D D
T
sd / n

## Comparing Two Classifier Models

The T statistics is governed by t-distribution with n - 1 degrees
of freedom.

Area = 1 -

t/2

t1- /2

## Comparing Two Classifier Models

If d and sd are the mean and standard deviation of the normally
distributed differences of n random pairs of errors, a (1
)100% confidence interval for D = 1 - 2 is :
d t / 2

sd
s
D d t / 2 d ,
n
n

## leaving an area of /2 to the right.

Thus, if the interval contains 0.0 we can conclude on
significance level that the difference is 0.0.

## Counting the Costs

In practice, different types of classification errors
often incur different costs
Examples:
Terrorist profiling
Not a terrorist correct 99.99% of the time

Loan decisions
Fault diagnosis
Promotional mailing

Cost Matrices
Hypothesized
class
True class

Pos

Neg

Pos

TP Cost

FN Cost

Neg

FP Cost

TN Cost

## Usually, TP Cost and TN Cost are set equal to 0!

Cost-Sensitive Classification
If the classifier outputs probability for each class, it can be adjusted to
minimize the expected costs of the predictions.
Expected cost is computed as dot product of vector of class
probabilities and appropriate column in cost matrix.

Hypothesized
class
True class

Pos

Neg

Pos

TP Cost

FN Cost

Neg

FP Cost

TN Cost

## Cost Sensitive Classification

Assume that the classifier returns for an instance ppos = 0.6 and pneg = 0.4.
Then, the expected cost if the instance is classified as positive is 0.6 * 0 +
0.4 * 10 = 4. The expected cost if the instance is classified as negative is 0.6
* 5 + 0.4 * 0 = 3. To minimize the costs the instance is classified as negative.

Hypothesized
class
True class

Pos

Neg

Pos

Neg

10

## Cost Sensitive Learning

Simple methods for cost sensitive learning:
Resampling of instances according to costs;
Weighting of instances according to costs.
Hypothesized
class
True class

Pos

Neg

Pos

Neg

10

In Weka Cost Sensitive Classification and Learning can be applied for any classifier using
the meta scheme: CostSensitiveClassifier.

## ROC Curves and Analysis

Predicted

Predicted

Predicted

True

pos

neg

True

pos

neg

True

pos

neg

pos

40

60

pos

70

30

pos

60

40

neg

30

70

neg

50

50

neg

20

80

Classifier 1
TPr = 0.4
FPr = 0.3

Classifier 2
TPr = 0.7
FPr = 0.5

Classifier 3
TPr = 0.6
FPr = 0.2

ROC Space
Ideal classifier

always positive
False Negative Rate

chance

always negative

## Dominance in the ROC Space

Classifier A dominates classifier B if and only if TPrA > TPrB and FPrA < FPrB.

Ensemble Methods
model 1
Data

Ensemble
model

model 2

model k

35

Combine multiple
models into one!

## Applications: classification, clustering, collaborative

filtering, anomaly detection

Motivations
Ensemble model improves accuracy and robustness over single
model methods
Applications:

distributed computing
privacy-preserving applications
large-scale data with reusable models
multiple sources of data

## Efficiency: a complex problem can be decomposed into multiple

sub-problems that are easier to understand and solve (divide-andconquer approach)

36

## Why Ensemble Works? (1)

Intuition
combining diverse, independent opinions in human decisionmaking as a protective mechanism (e.g. stock portfolio)

## Uncorrelated error reduction

Suppose we have 5 completely independent classifiers for majority
voting
If accuracy is 70% for each
10 (.7^3)(.3^2)+5(.7^4)(.3)+(.7^5)
83.7% majority vote accuracy

## 101 such classifiers

99.9% majority vote accuracy
37

## Why Ensemble Works? (2)

Some unknown distribution
Model 1

Model 3
Model 5
Model 2
Model 4

Model 6

38

## Why Ensemble Works? (3)

Overcome limitations of single hypothesis
The target function may not be implementable with
individual classifiers, but may be approximated by model
averaging

Decision Tree
39

Model Averaging

Training

test

classifier 1
labeled
data

unlabeled
data

classifier 2

classifier k

Ensemble
model
final

predictions

40

## learn the combination from labeled

data
Algorithms: boosting, stacked generalization, rule
ensemble, Bayesian model averaging

Ensemble of ClassifiersConsensus
training

test

classifier 1
labeled
data

classifier 2

classifier k

unlabeled
data

combine
the
predictions
by majority
voting
final

predictions

41

## Algorithms: bagging, random forest, random

decision tree, model averaging of probabilities

Pros

## Get useful feedbacks from

labeled data
Can potentially improve
accuracy

## Do not need labeled data

Can improve the generalization
performance

Cons

## Need to keep the labeled

data to train the ensemble
May overfit the labeled data
Cannot work when no labels
are available

## No feedbacks from the labeled

data
Require the assumption that
consensus is better

42

Bagging
Original Data
Bagging (Round 1)
Bagging (Round 2)
Bagging (Round 3)

1
7
1
1

2
8
4
8

3
10
9
5

4
8
1
10

5
2
2
5

6
5
3
5

7
10
2
9

8
10
7
6

9
5
3
3

10
9
2
7

## Also known as bootstrap aggregation

Sampling uniformly with replacement
Build classifier on each bootstrap sample
0.632 bootstrap
Each bootstrap sample Di contains approx. 63.2% of
the original training data
Remaining (36.8%) are used as test set

Bagging
Accuracy of bagging:
k

i 1

## Works well for small data sets

Example:
X

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

-1

-1

-1

-1

Bagging
Decision Stump
Single level decision
binary tree
Entropy x<=0.35 or
x<=0.75
Accuracy at most
70%

Bagging

Accuracy of ensemble
classifier: 100%

## Works well if the base classifiers are unstable

Increased accuracy because it reduces the variance of
the individual classifier
Does not focus on any particular instance of the
training data
Therefore, less susceptible to model over-fitting when
applied to noisy data
What if we want to focus on a particular instances of
training data?

Boosting
Principles

## Boost a set of weak learners to a strong learner

An iterative procedure to adaptively change distribution of
training data by focusing more on previously misclassified
records
Initially, all N records are assigned equal weights
Unlike bagging, weights may change at the end of a boosting
round

Boosting
Records that are wrongly classified will have their weights
increased
Records that are classified correctly will have their weights
decreased
Original Data
Boosting (Round 1)
Boosting (Round 2)
Boosting (Round 3)

1
7
5
4

2
3
4
4

3
2
9
8

4
8
4
10

5
7
2
4

6
9
5
5

7
4
1
4

8
10
7
6

9
6
4
3

10
3
2
4

## Example 4 is hard to classify

Its weight is increased, therefore it is
more likely to be chosen again in
subsequent rounds

Boosting
Equal weights are assigned to each training tuple (1/d for round
1)
After a classifier Mi is learned, the weights are adjusted to allow
the subsequent classifier Mi+1 to pay more attention to tuples
that were misclassified by Mi.
Final boosted classifier M* combines the votes of each
individual classifier
Weight of each classifiers vote is a function of its accuracy

Input:
Training set D containing d tuples
k rounds
A classification learning scheme

Output:
A composite model

Data set D containing d class-labeled tuples (X1,y1), (X2,y2),
(X3,y3),.(Xd,yd)
Initially assign equal weight 1/d to each tuple
To generate k base classifiers, we need k rounds or
iterations
Round i, tuples from D are sampled with replacement , to
form Di (size d)
Each tuples chance of being selected depends on its weight

Base classifier Mi, is derived from training tuples of Di
Error of Mi is tested using Di
Weights of training tuples are adjusted depending on how
they were classified
Correctly classified: Decrease weight
Incorrectly classified: Increase weight

## Weight of a tuple indicates how hard it is to classify it

(directly proportional)

Some classifiers may be better at classifying some
hard tuples than others
We finally have a series of classifiers that
complement each other!
d
Error rate of model Mi: error ( M i ) w j * err ( X j )
j

## where err(Xj) is the misclassification error for Xj(=1)

If classifier error exceeds 0.5, we abandon it
Try again with a new Di and a new Mi derived from
it

error (Mi) affects how the weights of training tuples are
updated
If a tuple is correctly classified in round i, its weight is
multiplied by
error ( M i )
1 error ( M i )

## Adjust weights of all correctly classified tuples

Now weights of all tuples (including the misclassified tuples)
are normalized
sum _ of _ old _ weights
Normalization factor =
sum _ of _ new _ weights

error ( M i )
Weight of a classifier Mis weight is log
1 error ( M i )

The lower a classifier error rate, the more accurate it is, and
therefore, the higher its weight for voting should be
Weight of a classifier Mis vote is
log

error ( M i )
1 error ( M i )

## For each class c, sum the weights of each classifier that

assigned class c to X (unseen tuple)
The class with the highest sum is the WINNER!

## Metric Evaluation Summary:

Use test sets and the hold-out method for large data;
Use the cross-validation method for middle-sized data;
Use the leave-one-out and bootstrap methods for small
data;
Dont use test data for parameter tuning - use separate
validation data.

Summary
In this seminar report we have considered:

## Metrics for Classifiers Evaluation

Methods for Classifiers Evaluation
Comparing Data Mining Schemes
Costs in Data Mining
Ensemble Methods

Thank You