Вы находитесь на странице: 1из 61

Model Accuracy Measures

Master in Bioinformatics UPF


2017-2018

Eduardo Eyras
Computational Genomics
Pompeu Fabra University - ICREA
Barcelona, Spain
Variables Hypotheses Examples

What we can measure What we want to predict Training set


(attributes) (Class values/labels) (labeled data)

Model
Training

Model

Predict on new cases


Variables Hypotheses Examples

What we can measure What we want to predict Training set


(attributes) (Class values/labels) (labeled data)

Model
Training

Model Prediction: Does it example


“belong” to this model?

Predict on new cases Classification: what is the


most probable label?
Testing the accuracy of a model

Is my method good enough? (for the specific problem)

How does my method compare to other methods?


Testing the accuracy of a model

We need a systematic way to evaluate and compare multiple methods

Methods are heterogenous in their purposes, e.g.:

1)  Ability to classify instances accurately

2)  Predicting/scoring the class labels

3)  Methods may predict numerical or nominal values (score, class label,
yes/no, posterior probability, etc….)

Thus we need a methodology that is applicable to all of them


Training and Testing

Accuracy

expected performance (accuracy) of the model in future (new) data

It is wrong to estimate the accuracy on the same dataset used to build


(train) the model. This estimation would be overly optimistic:

Overfitting è it won’t necessarily adapt well to new different instances


Training and Testing

Separate known cases into a training set and a test set


Training
step
Cases for model
training
Labeled
cases
Cases for
testing
Evaluation
step
On the cases for testing we predict and compare the predictions with the
known labels.

How to do the splitting?

A common splitting choice is 2/3 for training and 1/3 for testing

This approach is suitable when the entire dataset is large


Training and Testing

How to select the data for training and testing:

1)  Stratification: The size of each of the “prediction classes” should be similar
in each subset, training and testing (balanced subsets)

2)  Homogeneity: Data sets should have similar properties to have a reliable
test. E.g. GC-content, peptide lengths, species represented.

These conditions ensure representativity of the different properties and


prediction classes (e.g. would you test a model of human transmembrane
domains with yeast proteins?) (e.g. think of GC content).

Provided that sets are balanced and homogeneous, the accuracy on the
test set will be a good estimation of future performance.
Training and Testing
N-fold cross validation

Test set 1/N parts of the data

Data set

Training set (N-1)/N parts of the data

Accuracy1

Build a
predictive
model

…where “accuracy” is used generically: any measure of prediction performance


Training and Testing
N-fold cross validation

Test set

Data set

Training set

Accuracy1 Accuracy2

Build a
predictive
model

…where “accuracy” is used generically: any measure of prediction performance


Training and Testing
N-fold cross validation

Test set

Data set

Training set …

Accuracy1 Accuracy2 Accuracy3 Accuracyn

Average accuracy
The average accuracy reflects the performance of the model on the entire
dataset.

Important: subsets must be representative of the original data (stratification and


homogeneity)

The standard is to do 10-fold cross validation


Training and Testing
Leave-one out

It is like n-fold cross validation, but where n is the size of the set
(number of instances), that is: “train in all but 1, test on this one”

Advantages:

1) The greatest possible amount of data is used for training (n-1 instances)

2) It is deterministic: no random sampling of subsets is involved.

Disadvantages:

1)  Computationally more expensive

2)  It cannot be stratified

E.g. Imagine you have the same number of examples for 2 classes. A random
classifier predicting the majority class is expected to have an error rate of 50%, but
in the leave-one out method, the majority class is always the opposite class, which
will produce 100% error rate.
Accuracy measures
Accuracy measure
Example: The model of transmembrane helices

We have two models:

(1) the loop model Mloop given by the observed frequencies of AA in loops p
(2) the helix model Mhelix given by the observed frequencies of AA in helices q

Given a peptide s=x1…xN we can predict whether it is part of a helix or a loop


using the log-likelihood test (assuming uniform priors and positional
independence) N

L(s | M helix )
∏q xi
i=1
S = log = N
L(s | M loop )
∏p xi
i=1

As a default, we can use as classification the rule:

€•  if S>0 then s is part of a helix


•  if S≤0 then s is a loop
Accuracy measure
Example: The model of transmembrane helices
N

L(s | M helix )
∏q xi
i=1
Training set S = log = N
L(s | M loop )
∏p xi
i=1

A test set: a set of “labelled” (annotated) proteins that we do not use for training

Helix
Loop
Accuracy measure

Real
False
Accuracy measure

Our model divides the test set according to our predictions of Real and False:

Our
predictions
Real
False

The red area contains the predictions (helix) made by our model
Accuracy measure

TP (True positives): elements predicted as real that are real

TP
Real
False
Accuracy measure

TP (True positives): elements predicted as real that are real


TN (True Negatives): elements predicted as false that are false

TP
Real
False
TN
Accuracy measure

TP (True positives): elements predicted as real that are real


TN (True Negatives): elements predicted as false that are false
FP (False Positives): elements predicted as real that are false

TP FP
Real
False
TN
Accuracy measure

TP (True positives): elements predicted as real that are real


TN (True Negatives): elements predicted as false that are false
FP (False Positives): elements predicted as real that are false
FN (False Negatives): elements predicted as false that are real

TP FP
Real
False
TN
FN
Accuracy measure
True Positive Rate (Sensitivity): proportion of true elements that is
correctly predicted (a.k.a hit rate, recall)
TP TP FP
Sn = TPR = Real False
TP + FN
FN TN

False Positive Rate (FPR): proportion of negative cases that are mislabelled
(a.k.a. fall-out)
€ FP
FPR =
FP + TN
Specificity: proportion of the negatives that are correctly predicted

TN
€ Sp = 1− FPR =
FP + TN

Sn and Sp take values between 0 and 1.


A perfect classification would have Sn=1 and Sp=1
Accuracy measure
Positive Predictive Value (PPV): sometimes called Precision
it gives the fraction of our predictions that are correct

TP FP
TP
PPV = Real False
TP + FP FN TN

False Discovery Rate (FDR): what fraction of our predictions are wrong

€ FP
FDR =
FP + TP

PPV à 1 means
€ most of our predictions are correct

FDR à 0 means that very few of our predictions are wrong


Accuracy measure
The issue of True Negatives

Sometimes we cannot find a True Negative set


(e.g. Think of genomic features, like genes, regulatory regions, etc… it is very
hard to find real negative cases for some biological features)

TP FP
Real

FN

We can still use the TPR, PPV and FDR:

TP TP FP
TPR = PPV = FDR =
TP + FN FP + TP FP + TP
Accuracy measure
Overall success rate: is the number of correct classifications divided by the
total number of classifications (sometimes called “accuracy”):

TP + TN
Overall Success Rate =
TP + TN + FN + FP

A value of 1 for the Success rate means that the model identifies all the
positive and negative cases correctly

The error rate: 1 minus the overall success rate:

TP + TN
Error Rate =1−
TP + TN + FN + FP


Accuracy measure

Correlation coefficient (a.k.a. Matthews Correlation Coefficient (MCC))

(TP)(TN )− (FP)(FN )
CC =
(TP + FN )(TN + FP)(TP + FP)(TN + FN )

This measure scores positively correct predictions and negatively


€ incorrect ones, and takes values between -1 and 1.

The more correct the method, the closer to one CC --> 1

A very bad method will have a CC closer to -1


unprepared. And diagnosis: the cost of misidentifying problems with a machine
that turns out to be free of faults is less than the cost of overlooking problems
Accuracy
with one that is about measure
to fail. And promotional mailing: the cost of sending junk
mail to a household that doesn’t respond is far less than the lost-business cost
of not sending it to a household that would have responded. Why—these are
all the examplesTPof Chapter 1! FPIn truth, you’d be hard pressed to find an appli-
cation in which the costs of different kinds of error were the same.
Yes case with classes yes and no, lend or not lend, mark a suspi-
In the two-class
cious patch as an oil slick or not, and so on, a single prediction has the four dif-
ferent possible outcomes shown inNo Table 5.3. The true positives (TP) and true
negatives (TN) are correct classifications. A false positive (FP) occurs when the
TN when it is actually no (neg-
outcome is incorrectly predicted as yes (or positive)
FN
ative). A false negative (FN) occurs when the outcome is incorrectly predicted
as negative when it is actually positive. The true positive rate is TP divided

This can also be represented by a confusion matrix for a 2-class prediction:


Table 5.3 Different outcomes of a two-class prediction.

Predicted class

yes no

true false
yes
Actual positive negative
class
no false true
positive negative
it from the predictor’s successes and expressing the result as a proportion
of the total for a perfect predictor, to yield 140 - 82 = 58 extra successes out
Accuracy measure
For multiclass predictions:
Table 5.4 Different outcomes of a three-class prediction: (a) actual and (b) expected.

Predicted class Predicted class

a b c Total a b c Total

Actual a 88 10 2 100 Actual a 60 30 10 100


class b 14 40 6 60 class b 36 18 6 60
c 18 10 12 40 c 24 12 4 40
Total 120 60 20 Total 120 60 20
(a) (b)

Good results correspond to large numbers on the diagonal and small numbers off
the diagonal

In the example we have 200 instances (100+60+40) and 140 of them are predicted
correctly, thus the success rate is 70%.

Question: is this a good measure? How many agreements do we expect by


chance?
it from the predictor’s successes and expressing the result as a proportion
of the total for a perfect predictor, to yield 140 - 82 = 58 extra successes out
Accuracy measure
For multiclass predictions:
Table 5.4 Different outcomes of a three-class prediction: (a) actual and (b) expected.

Predicted class Predicted class

a b c Total a b c Total

Actual a 88 10 2 100 Actual a 60 30 10 100


class b 14 40 6 60 class b 36 18 6 60
c 18 10 12 40 c 24 12 4 40
Total 120 60 20 Total 120 60 20
(a) Observed values (b) Expected values

We build the matrix of expected values by using the same totals as before and
sharing the total of each class:

Totals in each actual (Real) class: a = 100, b = 60, c = 40


it from the predictor’s successes and expressing the result as a proportion
of the total for a perfect predictor, to yield 140 - 82 = 58 extra successes out
Accuracy measure
For multiclass predictions:
Table 5.4 Different outcomes of a three-class prediction: (a) actual and (b) expected.

Predicted class Predicted class

a b c Total a b c Total

Actual a 88 10 2 100 Actual a 60 30 10 100


class b 14 40 6 60 class b 36 18 6 60
c 18 10 12 40 c 24 12 4 40
Total 120 60 20 Total 120 60 20
(a) Observed values (b) Expected values

We build the matrix of expected values by using the same totals as before and
sharing the total of each class:

Totals in each actual (Real) class: a = 100, b = 60, c = 40

We split each of them into the three groups using the proportions of the predicted
classes: a =120, b=60, c =20 è a= 60%, b=30%, c = 10%
it from the predictor’s successes and expressing the result as a proportion
of the total for a perfect predictor, to yield 140 - 82 = 58 extra successes out
Accuracy measure
For multiclass predictions:
Table 5.4 Different outcomes of a three-class prediction: (a) actual and (b) expected.

Predicted class Predicted class

a b c Total a b c Total

Actual a 88 10 2 100 Actual a 60 30 10 100


class b 14 40 6 60 class b 36 18 6 60
c 18 10 12 40 c 24 12 4 40
Total 120 60 20 Total 120 60 20
(a) Observed values (b) Expected values

We build the matrix of expected values by using the same totals as before and
sharing the total of each class:

Totals in each actual (Real) class: a = 100, b = 60, c = 40

We split each of them into the three groups using the proportions of the predicted
classes: a =120, b=60, c =20 è a= 60%, b=30%, c = 10%
it from the predictor’s successes and expressing the result as a proportion
of the total for a perfect predictor, to yield 140 - 82 = 58 extra successes out
Accuracy measure
For multiclass predictions:
Table 5.4 Different outcomes of a three-class prediction: (a) actual and (b) expected.

Predicted class Predicted class

a b c Total a b c Total

Actual a 88 10 2 100 Actual a 60 30 10 100


class b 14 40 6 60 class b 36 18 6 60
c 18 10 12 40 c 24 12 4 40
Total 120 60 20 140 Total 120 60 20 82
(a) Observed values (b) Expected values

To estimate the relative agreement between observed and expected values we can
use the kappa statistic:

P( A) − P(E) n( A) − n(E) 140 − 82


κ= = = = 0.49
1− P(E) N − n(E) 200 − 82

Where P(A) is the probability of agreement and P(E) is the probability of agreement
by chance. The maximum possible value is κ=1, and for a random predictor κ=0

Accuracy measure

What is a good accuracy?

Every measure shows a different perspective on the performance of the


model. In general we will use two or more complementary measures to
evaluate a model.

E.g. a method that finds almost all elements will have an Sn close to 1, but
this can be achieved with a method with very low Sp

E.g. a method that has Sp close to 1, may have very low Sn

In general, one would like to have a method that balances Sn and Sp (or
equivalent measures)
Accuracy measure

What is a good accuracy?

Which accuracy measure we want to maximize often depends on the question

Do you want to find all the true cases? (You want higher sensitivity)

Or want to find only correct cases? (You want higher specificity)

Question:

“predicting novel genes” might require high Sp or perhaps high Sn?


Choosing a prediction threshold
Accuracy measure

Although we have one single model in fact we have a family of predictions,


which are defined by one or more parameters, e.g. the log-likelihood test:

L(s | M helix )
S = log >λ
L(s | M loop )

λ λ λ λ λ


Real

False
Accuracy measure

Although we have one single model in fact we have a family of predictions,


which are defined by one or more parameters, e.g. the log-likelihood test:

L(s | M helix )
S = log >λ
L(s | M loop ) TP FP TN FN
λ
λ λ λ λ λ λ


λ
Real
λ
False
λ
Receiver Operating Characteristic (ROC) curve

A ROC curve is a graphical plot of TPR (Sn) vs. FPR built for the same
prediction model by varying one or more of the model parameters

It is quite common in binary classifiers

For instance, it can be plotted for several values of the discrimination


threshold, but other parameters of the model can be used.

λ λ λ λ λ TPR FPR
λ
λ
Real
λ
False
λ

λ
Receiver Operating Characteristic (ROC) curve

Example for threshold B


Distribution of the scores
In negative This area are our positive predictions
cases
True Negative

TPR FPR
A Low Low
In positive TPR FPR
cases
False Negative B
C High High
TPR FPR

Threshold
criterion

TP FP
TPR = FPR =
TP + FN FP + TN
Receiver Operating Characteristic (ROC) curve

Model
Distribution of the scores classification
1
In negative
cases
True Negative

TPR
In positive Random
cases
False Negative
classification

0 FPR 1
Threshold
criterion

TP FP
TPR = FPR =
TP + FN FP + TN
Receiver Operating Characteristic (ROC) curve

Model
Each dot in the line corresponds to classification
a choice of parameters (usually 1 1
single parameter)

The information that is not visible in


this graph is the threshold used at TPR
each point of the graph. Random
classification
The x=y line corresponds to the
random classification, i.e choosing 0
positive or negative at every
threshold with 50% chance. 0 FPR 1

TP FP
TPR = FPR =
TP + FN FP + TN
Receiver Operating Characteristic (ROC) curve
L(s | M helix )
Example: Consider the ranking of scores: S = log
L(s | M loop )

S
10
7
4
2
1
-0.4
-2
-5
-9
Receiver Operating Characteristic (ROC) curve
L(s | M helix )
Example: Consider the ranking of scores: S = log
L(s | M loop )
The test set is labeled:

S Known label
10 R
7 R
4 R
2 F
1 R
-0.4 R
-2 F
-5 F
-9 F
Receiver Operating Characteristic (ROC) curve
L(s | M helix )
Example: Consider the ranking of scores: S = log
L(s | M loop )
Let’s choose a cut-off (a λ):

S Known label
10 R
7 R
4 R
2 F 3 = Cut-off for
1 R prediction, i.e.
-0.4 R above this value
we predict “R”
-2 F
-5 F
-9 F
Receiver Operating Characteristic (ROC) curve
L(s | M helix )
Example: Consider the ranking of scores: S = log
L(s | M loop )
Calculate TP, FP,… for this λ

S Known label λ TP FP TN FN TPR FPR

10 R
7 R
4 R
2 F 3 3 0 4 2 3/5 0
1 R
-0.4 R
-2 F
-5 F
-9 F
TP FP
TPR = FPR =
TP + FN FP + TN
Receiver Operating Characteristic (ROC) curve
L(s | M helix )
Example: Consider the ranking of scores: S = log
L(s | M loop )
Repeat for other λ’s

S Known label λ TP FP TN FN TPR FPR

10 R
7 R
4 R
2 F 3 3 0 4 2 3/5 0
1 R
-0.4 R 0 4 1 3 1 4/5 1/4
-2 F
-5 F
-9 F

Note: I’m using arbitrary intermediate values for cut-off


Receiver Operating Characteristic (ROC) curve
L(s | M helix )
Example: Consider the ranking of scores: S = log
L(s | M loop )
Repeat for other λ’s

S Known label λ TP FP TN FN TPR FPR

10 R
7 R
4 R
2 F 3 3 0 4 2 3/5 0
1 R
-0.4 R 0 4 1 3 1 4/5 1/4
-2 F
-5 F
-9 F -7 5 3 1 0 1 3/4

Note: I’m using arbitrary intermediate values for cut-off


Receiver Operating Characteristic (ROC) curve
L(s | M helix )
Example: Consider the ranking of scores: S = log
L(s | M loop )
Exercise: complete the table
λ TP FP TN FN TPR FPR

You should see that for smaller cut-offs


the TPR (sensitivity) increases but the
FPR increases as well (i.e. the
specificity drops)
3 3 0 4 2 3/5 0
Whereas for high cut-offs TPR
decreases but the FPR is low
(specificity is high) 0 4 1 3 1 4/5 1/4

-7 5 3 1 0 1 3/4
The variability of the accuracy as a function
of the parameters and/or cut-offs is generally
described with a ROC curve.
Receiver Operating Characteristic (ROC) curve

Random classification
Comparing multiple methods
ROC curves
Each line corresponds to a different
method

Better models are further from the x=y


line (random classification)

Method 1
Method 2
Method 3

(see e.g. Corvelo et al. PLOS Comp. Biology 2010)


Receiver Operating Characteristic (ROC) curve

Random classification
Example:
ROC curves
If you wish to discover at least 60% of
the true elements (TPR=0.6), the graph
says that Model 1 has lower FPR than
Model 2 and 3. We may want to choose
Model 1.

We would then decide to make


predictions with Model 1 and choose
Method 1
parameters that produce FPR=0.2 at Method 2
TPR=0.6 Method 3

But is this the best choice?


Receiver Operating Characteristic (ROC) curve

Optimal configuration

Note that the more distant the points


from the diagonal (the line of ROC curves
TPR=FPR) the better the classification.

An optimal choice for a dot in the curve


is the one that is at a maximum
distance from the TPR=FPR line.

There are standard methods to


calculate this point. Method 1
Method 2
Method 3
But again: this is optimal for the
balance of TPR and FPR, but it might
not be the one most appropriate for the
model at hand, e.g. predicting novel
genes.
Receiver Operating Characteristic (ROC) curve

ROC curves

Method 1
Method 2
Method 3

Models

A summary measure for the best model is the Area Under the Curve (AUC). The
best model in general will have the highest AUC

The maximum value is AUC=1. The closer AUC is to one, the better the model

There are also standard methods to estimate the AUC from the sampled
Receiver Operating Characteristic (ROC) curve

ROC curves

Method 1
Method 2
Method 3

Models

Question:

Why do you think there are error bars in the AUC barplot and in the ROC curves?
Precision recall curves
ROC curves are useful to compare predictive models.

However, they still do not provide a complete picture of the accuracy of


models.

If we predict many TPs at the cost of producing many false predictions (FP is
large), the FPR might not look so bad if in our testing set we have many
Negatives, such that TN >> FP:
FP
FPR = "TN "→ 0
"large
FP + TN

So we may have a situation where our TPR is high, the FPR is low, but where
for the actual counts FP >> TP

That is, TPR is not affected by FP and FPR can be low even if FP is high (as
long as TN >> FP).
Precision recall curves
For instance, consider a method to classify documents. Let’s supposed that the
first method selects 100 documents, but 40 are correct. Imagine that our test set is
composed of 100 True instances and 10000 Negative instances.

TP 40 FP 60
TPR1 = = = 0.4 FPR1 = = = 0.006
TP + FN 100 FP + TN 10000
Precision recall curves
For instance, consider a method to classify documents. Let’s supposed that the
first method selects 100 documents, but 40 are correct. Imagine that our test set is
composed of 100 True instances and 10000 Negative instances.

TP 40 FP 60
TPR1 = = = 0.4 FPR1 = = = 0.006
TP + FN 100 FP + TN 10000

Now consider a second method selects 680 documents with 80 correct, and
imagine that our test set is composed now of 100 True instances and 100000
Negative instances.

TP 80 FP 600
TPR2 = = = 0.8 FPR2 = = = 0.006
TP + FN 100 FP + TN 100000

Which method is better?


Precision recall curves

The second one may seem better, because it retrieves more relevant documents,
but the proportion of predictions that are correct (precision or PPV) is smaller:

40
Precision1 = = 0.40
TP 100 (Note: you can also
PPV = 80
TP + FP Precision1 = = 0.11 use FDR = 1 –PPV)
680

Thus,
€ one must also take into account the “relative cost” of the
predictions, i.e. the FN and FP values that must be assumed to achieve
high TPR

One can make TN arbitrarily large to make FPR à 0


So other accuracy measures are needed to have a more correct picture.
Precision recall curves

Precision = proportion of the


predictions that are correct

TP
precision = PPV =
TP + FP

Recall = proportion of the true instances TP


recall = TPR =
that are correctly recovered TP + FN

(see e.g. Plass et al. RNA 2012)



Precision recall curves

Model 1

Has greater AUC,


but low precision
(high cost of false
positives)

Model B

We achieve a lower
AUC than model A,
but still pretty good.

Precision is highly
improved
Precision recall curves

Model 1

Has greater AUC,


but low precision
(high cost of false
positives)

Model 2

We achieve a lower
AUC than model A,
but still quite good.

Precision is highly
improved
References

Data Mining: Prac-cal Machine Learning Tools and Techniques.


Ian H. Wi)en, Eibe Frank, Mark A. Hall.
Morgan Kaufmann ISBN 978-0-12-374856-0
http://www.cs.waikato.ac.nz/ml/weka/book.html


Methods for Computa-onal Gene Predic-on.
W.H. Majoros. Cambridge University Press 2007

Вам также может понравиться