Lecture - Model Accuracy Measures

Model Accuracy Measures
Master in Bioinformatics UPF

2017-2018
Eduardo Eyras
Computational Genomics
Pompeu Fabra University - ICREA
Barcelona, Spain
Variables Hypotheses Examples
What we can measure What we want to predict Training set

(attributes) (Class values/labels) (labeled data)
Model
Training
Model
Predict on new cases

Variables Hypotheses Examples
What we can measure What we want to predict Training set

(attributes) (Class values/labels) (labeled data)
Model
Training
Model Prediction: Does it example

“belong” to this model?
Predict on new cases Classification: what is the

most probable label?
Testing the accuracy of a model
Is my method good enough? (for the specific problem)
How does my method compare to other methods?

Testing the accuracy of a model
We need a systematic way to evaluate and compare multiple methods
Methods are heterogenous in their purposes, e.g.:
1)  Ability to classify instances accurately
2)  Predicting/scoring the class labels
3)  Methods may predict numerical or nominal values (score, class label,
yes/no, posterior probability, etc….)
Thus we need a methodology that is applicable to all of them

Training and Testing
Accuracy
expected performance (accuracy) of the model in future (new) data
It is wrong to estimate the accuracy on the same dataset used to build

(train) the model. This estimation would be overly optimistic:
Overfitting è it won’t necessarily adapt well to new different instances

Separate known cases into a training set and a test set

Training
step
Cases for model
training
Labeled
cases
Cases for
testing
Evaluation
step
On the cases for testing we predict and compare the predictions with the
known labels.
How to do the splitting?
A common splitting choice is 2/3 for training and 1/3 for testing
This approach is suitable when the entire dataset is large

How to select the data for training and testing:
1)  Stratification: The size of each of the “prediction classes” should be similar
in each subset, training and testing (balanced subsets)
2)  Homogeneity: Data sets should have similar properties to have a reliable
test. E.g. GC-content, peptide lengths, species represented.
These conditions ensure representativity of the different properties and

prediction classes (e.g. would you test a model of human transmembrane
domains with yeast proteins?) (e.g. think of GC content).
Provided that sets are balanced and homogeneous, the accuracy on the
test set will be a good estimation of future performance.
N-fold cross validation
Test set 1/N parts of the data
Data set
Training set (N-1)/N parts of the data
Accuracy1
Build a
predictive
model
…where “accuracy” is used generically: any measure of prediction performance

Test set
Data set
Training set
Accuracy1 Accuracy2
Build a
predictive
model
…where “accuracy” is used generically: any measure of prediction performance

Test set
Data set
Training set …
Accuracy1 Accuracy2 Accuracy3 Accuracyn
Average accuracy
The average accuracy reflects the performance of the model on the entire
dataset.
Important: subsets must be representative of the original data (stratification and

homogeneity)
The standard is to do 10-fold cross validation

Leave-one out
It is like n-fold cross validation, but where n is the size of the set
(number of instances), that is: “train in all but 1, test on this one”
Advantages:
1) The greatest possible amount of data is used for training (n-1 instances)
2) It is deterministic: no random sampling of subsets is involved.
Disadvantages:
1)  Computationally more expensive
2)  It cannot be stratified
E.g. Imagine you have the same number of examples for 2 classes. A random
classifier predicting the majority class is expected to have an error rate of 50%, but
in the leave-one out method, the majority class is always the opposite class, which
will produce 100% error rate.
Accuracy measures
Accuracy measure
Example: The model of transmembrane helices
We have two models:
(1) the loop model Mloop given by the observed frequencies of AA in loops p
(2) the helix model Mhelix given by the observed frequencies of AA in helices q
Given a peptide s=x1…xN we can predict whether it is part of a helix or a loop

using the log-likelihood test (assuming uniform priors and positional
independence) N
L(s | M helix )
∏q xi
i=1
S = log = N
L(s | M loop )
∏p xi
i=1
As a default, we can use as classification the rule:
€•  if S>0 then s is part of a helix

•  if S≤0 then s is a loop
Accuracy measure
Example: The model of transmembrane helices
N
L(s | M helix )
∏q xi
i=1
Training set S = log = N
L(s | M loop )
∏p xi
i=1
A test set: a set of “labelled” (annotated) proteins that we do not use for training
Helix
Loop
Accuracy measure
Real
False
Accuracy measure
Our model divides the test set according to our predictions of Real and False:
Our
predictions
Real
False
The red area contains the predictions (helix) made by our model
Accuracy measure
TP (True positives): elements predicted as real that are real
TP
Real
False
Accuracy measure

TN (True Negatives): elements predicted as false that are false
TP
Real
False
TN
Accuracy measure

FP (False Positives): elements predicted as real that are false
TP FP
Real
False
TN
Accuracy measure

FP (False Positives): elements predicted as real that are false
FN (False Negatives): elements predicted as false that are real
TP FP
Real
False
TN
FN
Accuracy measure
True Positive Rate (Sensitivity): proportion of true elements that is
correctly predicted (a.k.a hit rate, recall)
TP TP FP
Sn = TPR = Real False
TP + FN
FN TN
False Positive Rate (FPR): proportion of negative cases that are mislabelled
(a.k.a. fall-out)
€ FP
FPR =
FP + TN
Specificity: proportion of the negatives that are correctly predicted
TN
€ Sp = 1− FPR =
FP + TN
Sn and Sp take values between 0 and 1.
€
A perfect classification would have Sn=1 and Sp=1
Accuracy measure
Positive Predictive Value (PPV): sometimes called Precision
it gives the fraction of our predictions that are correct
TP FP
TP
PPV = Real False
TP + FP FN TN
False Discovery Rate (FDR): what fraction of our predictions are wrong
€ FP
FDR =
FP + TP
PPV à 1 means
€ most of our predictions are correct
FDR à 0 means that very few of our predictions are wrong

Accuracy measure
The issue of True Negatives
Sometimes we cannot find a True Negative set

(e.g. Think of genomic features, like genes, regulatory regions, etc… it is very
hard to find real negative cases for some biological features)
TP FP
Real
FN
We can still use the TPR, PPV and FDR:
TP TP FP
TPR = PPV = FDR =
TP + FN FP + TP FP + TP
Accuracy measure
Overall success rate: is the number of correct classifications divided by the
total number of classifications (sometimes called “accuracy”):
TP + TN
Overall Success Rate =
TP + TN + FN + FP
A value of 1 for the Success rate means that the model identifies all the
positive and negative cases correctly
€
The error rate: 1 minus the overall success rate:
TP + TN
Error Rate =1−
TP + TN + FN + FP
€
Accuracy measure
Correlation coefficient (a.k.a. Matthews Correlation Coefficient (MCC))
(TP)(TN )− (FP)(FN )
CC =
(TP + FN )(TN + FP)(TP + FP)(TN + FN )
This measure scores positively correct predictions and negatively

€ incorrect ones, and takes values between -1 and 1.
The more correct the method, the closer to one CC --> 1
A very bad method will have a CC closer to -1

unprepared. And diagnosis: the cost of misidentifying problems with a machine
that turns out to be free of faults is less than the cost of overlooking problems
Accuracy
with one that is about measure
to fail. And promotional mailing: the cost of sending junk
mail to a household that doesn’t respond is far less than the lost-business cost
of not sending it to a household that would have responded. Why—these are
all the examplesTPof Chapter 1! FPIn truth, you’d be hard pressed to find an appli-
cation in which the costs of different kinds of error were the same.
Yes case with classes yes and no, lend or not lend, mark a suspi-
In the two-class
cious patch as an oil slick or not, and so on, a single prediction has the four dif-
ferent possible outcomes shown inNo Table 5.3. The true positives (TP) and true
negatives (TN) are correct classifications. A false positive (FP) occurs when the
TN when it is actually no (neg-
outcome is incorrectly predicted as yes (or positive)
FN
ative). A false negative (FN) occurs when the outcome is incorrectly predicted
as negative when it is actually positive. The true positive rate is TP divided
This can also be represented by a confusion matrix for a 2-class prediction:

Table 5.3 Different outcomes of a two-class prediction.
Predicted class
yes no
true false
yes
Actual positive negative
class
no false true
positive negative
it from the predictor’s successes and expressing the result as a proportion
of the total for a perfect predictor, to yield 140 - 82 = 58 extra successes out
Accuracy measure
For multiclass predictions:
Table 5.4 Different outcomes of a three-class prediction: (a) actual and (b) expected.
Predicted class Predicted class
a b c Total a b c Total
Actual a 88 10 2 100 Actual a 60 30 10 100

class b 14 40 6 60 class b 36 18 6 60
c 18 10 12 40 c 24 12 4 40
Total 120 60 20 Total 120 60 20
(a) (b)
Good results correspond to large numbers on the diagonal and small numbers off
the diagonal
In the example we have 200 instances (100+60+40) and 140 of them are predicted
correctly, thus the success rate is 70%.
Question: is this a good measure? How many agreements do we expect by

chance?
Accuracy measure
Actual a 88 10 2 100 Actual a 60 30 10 100

class b 14 40 6 60 class b 36 18 6 60
c 18 10 12 40 c 24 12 4 40
Total 120 60 20 Total 120 60 20
(a) Observed values (b) Expected values
We build the matrix of expected values by using the same totals as before and
sharing the total of each class:
Totals in each actual (Real) class: a = 100, b = 60, c = 40

Accuracy measure
Actual a 88 10 2 100 Actual a 60 30 10 100

class b 14 40 6 60 class b 36 18 6 60
c 18 10 12 40 c 24 12 4 40
Total 120 60 20 Total 120 60 20
We split each of them into the three groups using the proportions of the predicted
classes: a =120, b=60, c =20 è a= 60%, b=30%, c = 10%
Accuracy measure
Actual a 88 10 2 100 Actual a 60 30 10 100

class b 14 40 6 60 class b 36 18 6 60
c 18 10 12 40 c 24 12 4 40
Total 120 60 20 Total 120 60 20
We split each of them into the three groups using the proportions of the predicted
classes: a =120, b=60, c =20 è a= 60%, b=30%, c = 10%
Accuracy measure
Actual a 88 10 2 100 Actual a 60 30 10 100

class b 14 40 6 60 class b 36 18 6 60
c 18 10 12 40 c 24 12 4 40
Total 120 60 20 140 Total 120 60 20 82
To estimate the relative agreement between observed and expected values we can
use the kappa statistic:
P( A) − P(E) n( A) − n(E) 140 − 82

κ= = = = 0.49
1− P(E) N − n(E) 200 − 82
Where P(A) is the probability of agreement and P(E) is the probability of agreement
by chance. The maximum possible value is κ=1, and for a random predictor κ=0
€
Accuracy measure
What is a good accuracy?
Every measure shows a different perspective on the performance of the

model. In general we will use two or more complementary measures to
evaluate a model.
E.g. a method that finds almost all elements will have an Sn close to 1, but
this can be achieved with a method with very low Sp
E.g. a method that has Sp close to 1, may have very low Sn
In general, one would like to have a method that balances Sn and Sp (or
equivalent measures)
Accuracy measure
What is a good accuracy?
Which accuracy measure we want to maximize often depends on the question
Do you want to find all the true cases? (You want higher sensitivity)
Or want to find only correct cases? (You want higher specificity)
Question:
“predicting novel genes” might require high Sp or perhaps high Sn?

Choosing a prediction threshold
Accuracy measure
Although we have one single model in fact we have a family of predictions,

which are defined by one or more parameters, e.g. the log-likelihood test:
L(s | M helix )
S = log >λ
L(s | M loop )
λ λ λ λ λ
€
Real
False
Accuracy measure
Although we have one single model in fact we have a family of predictions,

which are defined by one or more parameters, e.g. the log-likelihood test:
L(s | M helix )
S = log >λ
L(s | M loop ) TP FP TN FN
λ
λ λ λ λ λ λ
€
λ
Real
λ
False
λ
Receiver Operating Characteristic (ROC) curve
A ROC curve is a graphical plot of TPR (Sn) vs. FPR built for the same
prediction model by varying one or more of the model parameters
It is quite common in binary classifiers
For instance, it can be plotted for several values of the discrimination

threshold, but other parameters of the model can be used.
λ λ λ λ λ TPR FPR
λ
λ
Real
λ
False
λ
λ
Example for threshold B

Distribution of the scores
In negative This area are our positive predictions
cases
True Negative
TPR FPR
A Low Low
In positive TPR FPR
cases
False Negative B
C High High
TPR FPR
Threshold
criterion
TP FP
TPR = FPR =
TP + FN FP + TN
Model
Distribution of the scores classification
1
In negative
cases
True Negative
TPR
In positive Random
cases
False Negative
classification
0 FPR 1
Threshold
criterion
TP FP
TPR = FPR =
TP + FN FP + TN
Model
Each dot in the line corresponds to classification
a choice of parameters (usually 1 1
single parameter)
The information that is not visible in

this graph is the threshold used at TPR
each point of the graph. Random
classification
The x=y line corresponds to the
random classification, i.e choosing 0
positive or negative at every
threshold with 50% chance. 0 FPR 1
TP FP
TPR = FPR =
TP + FN FP + TN
L(s | M helix )
Example: Consider the ranking of scores: S = log
L(s | M loop )
S
10
7
4
2
1
-0.4
-2
-5
-9
L(s | M helix )
L(s | M loop )
The test set is labeled:
S Known label
10 R
7 R
4 R
2 F
1 R
-0.4 R
-2 F
-5 F
-9 F
L(s | M helix )
L(s | M loop )
Let’s choose a cut-off (a λ):
S Known label
10 R
7 R
4 R
2 F 3 = Cut-off for
1 R prediction, i.e.
-0.4 R above this value
we predict “R”
-2 F
-5 F
-9 F
L(s | M helix )
L(s | M loop )
Calculate TP, FP,… for this λ
S Known label λ TP FP TN FN TPR FPR
10 R
7 R
4 R
2 F 3 3 0 4 2 3/5 0
1 R
-0.4 R
-2 F
-5 F
-9 F
TP FP
TPR = FPR =
TP + FN FP + TN
L(s | M helix )
L(s | M loop )
Repeat for other λ’s
10 R
7 R
4 R
2 F 3 3 0 4 2 3/5 0
1 R
-0.4 R 0 4 1 3 1 4/5 1/4
-2 F
-5 F
-9 F
Note: I’m using arbitrary intermediate values for cut-off

L(s | M helix )
L(s | M loop )
Repeat for other λ’s
10 R
7 R
4 R
2 F 3 3 0 4 2 3/5 0
1 R
-0.4 R 0 4 1 3 1 4/5 1/4
-2 F
-5 F
-9 F -7 5 3 1 0 1 3/4
Note: I’m using arbitrary intermediate values for cut-off

L(s | M helix )
L(s | M loop )
Exercise: complete the table
λ TP FP TN FN TPR FPR
You should see that for smaller cut-offs

the TPR (sensitivity) increases but the
FPR increases as well (i.e. the
specificity drops)
3 3 0 4 2 3/5 0
Whereas for high cut-offs TPR
decreases but the FPR is low
(specificity is high) 0 4 1 3 1 4/5 1/4
-7 5 3 1 0 1 3/4
The variability of the accuracy as a function
of the parameters and/or cut-offs is generally
described with a ROC curve.
Random classification
Comparing multiple methods
ROC curves
Each line corresponds to a different
method
Better models are further from the x=y

line (random classification)
Method 1
Method 2
Method 3
(see e.g. Corvelo et al. PLOS Comp. Biology 2010)

Random classification
Example:
ROC curves
If you wish to discover at least 60% of
the true elements (TPR=0.6), the graph
says that Model 1 has lower FPR than
Model 2 and 3. We may want to choose
Model 1.
We would then decide to make

predictions with Model 1 and choose
Method 1
parameters that produce FPR=0.2 at Method 2
TPR=0.6 Method 3
But is this the best choice?

Optimal configuration
Note that the more distant the points

from the diagonal (the line of ROC curves
TPR=FPR) the better the classification.
An optimal choice for a dot in the curve

is the one that is at a maximum
distance from the TPR=FPR line.
There are standard methods to

calculate this point. Method 1
Method 2
Method 3
But again: this is optimal for the
balance of TPR and FPR, but it might
not be the one most appropriate for the
model at hand, e.g. predicting novel
genes.
ROC curves
Method 1
Method 2
Method 3
Models
A summary measure for the best model is the Area Under the Curve (AUC). The
best model in general will have the highest AUC
The maximum value is AUC=1. The closer AUC is to one, the better the model
There are also standard methods to estimate the AUC from the sampled
ROC curves
Method 1
Method 2
Method 3
Models
Question:
Why do you think there are error bars in the AUC barplot and in the ROC curves?
Precision recall curves
ROC curves are useful to compare predictive models.
However, they still do not provide a complete picture of the accuracy of

models.
If we predict many TPs at the cost of producing many false predictions (FP is
large), the FPR might not look so bad if in our testing set we have many
Negatives, such that TN >> FP:
FP
FPR = "TN "→ 0
"large
FP + TN
So we may have a situation where our TPR is high, the FPR is low, but where
for the actual counts FP >> TP
€
That is, TPR is not affected by FP and FPR can be low even if FP is high (as
long as TN >> FP).
For instance, consider a method to classify documents. Let’s supposed that the
first method selects 100 documents, but 40 are correct. Imagine that our test set is
composed of 100 True instances and 10000 Negative instances.
TP 40 FP 60
TPR1 = = = 0.4 FPR1 = = = 0.006
TP + FN 100 FP + TN 10000
For instance, consider a method to classify documents. Let’s supposed that the
first method selects 100 documents, but 40 are correct. Imagine that our test set is
composed of 100 True instances and 10000 Negative instances.
TP 40 FP 60
TPR1 = = = 0.4 FPR1 = = = 0.006
TP + FN 100 FP + TN 10000
Now consider a second method selects 680 documents with 80 correct, and
imagine that our test set is composed now of 100 True instances and 100000
Negative instances.
TP 80 FP 600
TPR2 = = = 0.8 FPR2 = = = 0.006
TP + FN 100 FP + TN 100000
Which method is better?

The second one may seem better, because it retrieves more relevant documents,
but the proportion of predictions that are correct (precision or PPV) is smaller:
40
Precision1 = = 0.40
TP 100 (Note: you can also
PPV = 80
TP + FP Precision1 = = 0.11 use FDR = 1 –PPV)
680
Thus,
€ one must also take into account the “relative cost” of the
predictions, i.e. the FN and FP values that must be assumed to achieve
high TPR
One can make TN arbitrarily large to make FPR à 0

So other accuracy measures are needed to have a more correct picture.
Precision = proportion of the

predictions that are correct
TP
precision = PPV =
TP + FP
Recall = proportion of the true instances TP

recall = TPR =
that are correctly recovered TP + FN
(see e.g. Plass et al. RNA 2012)

€
Model 1
Has greater AUC,

but low precision
(high cost of false
positives)
Model B
We achieve a lower
AUC than model A,
but still pretty good.
Precision is highly
improved
Model 1
Has greater AUC,

but low precision
(high cost of false
positives)
Model 2
We achieve a lower
AUC than model A,
but still quite good.
Precision is highly
improved
References
Data Mining: Prac-cal Machine Learning Tools and Techniques.

Ian H. Wi)en, Eibe Frank, Mark A. Hall.
Morgan Kaufmann ISBN 978-0-12-374856-0
http://www.cs.waikato.ac.nz/ml/weka/book.html

Methods for Computa-onal Gene Predic-on.
W.H. Majoros. Cambridge University Press 2007

Lecture - Model Accuracy Measures

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Lecture - Model Accuracy Measures

Загружено:

Авторское право:

Доступные форматы

Model Accuracy Measures

Master in Bioinformatics UPF

What we can measure What we want to predict Training set

Predict on new cases

What we can measure What we want to predict Training set

Model Prediction: Does it example

Predict on new cases Classification: what is the

Is my method good enough? (for the specific problem)

How does my method compare to other methods?

We need a systematic way to evaluate and compare multiple methods

Methods are heterogenous in their purposes, e.g.:

1) Ability to classify instances accurately

2) Predicting/scoring the class labels

Thus we need a methodology that is applicable to all of them

expected performance (accuracy) of the model in future (new) data

It is wrong to estimate the accuracy on the same dataset used to build

Overfitting è it won’t necessarily adapt well to new different instances

Separate known cases into a training set and a test set

How to do the splitting?

This approach is suitable when the entire dataset is large

How to select the data for training and testing:

These conditions ensure representativity of the different properties and

Test set 1/N parts of the data

Training set (N-1)/N parts of the data

…where “accuracy” is used generically: any measure of prediction performance

…where “accuracy” is used generically: any measure of prediction performance

Accuracy1 Accuracy2 Accuracy3 Accuracyn

Important: subsets must be representative of the original data (stratification and

The standard is to do 10-fold cross validation

2) It is deterministic: no random sampling of subsets is involved.

1) Computationally more expensive

2) It cannot be stratified

We have two models:

Given a peptide s=x1…xN we can predict whether it is part of a helix or a loop

As a default, we can use as classification the rule:

€• if S>0 then s is part of a helix

TP (True positives): elements predicted as real that are real

TP (True positives): elements predicted as real that are real

TP (True positives): elements predicted as real that are real

TP (True positives): elements predicted as real that are real

Sn and Sp take values between 0 and 1.

FDR à 0 means that very few of our predictions are wrong

Sometimes we cannot find a True Negative set

We can still use the TPR, PPV and FDR:

The error rate: 1 minus the overall success rate:

Correlation coefficient (a.k.a. Matthews Correlation Coefficient (MCC))

This measure scores positively correct predictions and negatively

The more correct the method, the closer to one CC --> 1

A very bad method will have a CC closer to -1

This can also be represented by a confusion matrix for a 2-class prediction:

Predicted class Predicted class

Actual a 88 10 2 100 Actual a 60 30 10 100

Question: is this a good measure? How many agreements do we expect by

Predicted class Predicted class

Actual a 88 10 2 100 Actual a 60 30 10 100

Totals in each actual (Real) class: a = 100, b = 60, c = 40

Predicted class Predicted class

Actual a 88 10 2 100 Actual a 60 30 10 100

Totals in each actual (Real) class: a = 100, b = 60, c = 40

Predicted class Predicted class

Actual a 88 10 2 100 Actual a 60 30 10 100

Totals in each actual (Real) class: a = 100, b = 60, c = 40

Predicted class Predicted class

1)  Ability to classify instances accurately

2)  Predicting/scoring the class labels

1)  Computationally more expensive

2)  It cannot be stratified

€•  if S>0 then s is part of a helix