Академический Документы
Профессиональный Документы
Культура Документы
Eduardo Eyras
Computational Genomics
Pompeu Fabra University - ICREA
Barcelona, Spain
Variables Hypotheses Examples
Model
Training
Model
Model
Training
3) Methods may predict numerical or nominal values (score, class label,
yes/no, posterior probability, etc….)
Accuracy
A common splitting choice is 2/3 for training and 1/3 for testing
1) Stratification: The size of each of the “prediction classes” should be similar
in each subset, training and testing (balanced subsets)
2) Homogeneity: Data sets should have similar properties to have a reliable
test. E.g. GC-content, peptide lengths, species represented.
Provided that sets are balanced and homogeneous, the accuracy on the
test set will be a good estimation of future performance.
Training and Testing
N-fold cross validation
Data set
Accuracy1
Build a
predictive
model
Test set
Data set
Training set
Accuracy1 Accuracy2
Build a
predictive
model
Test set
Data set
Training set …
Average accuracy
The average accuracy reflects the performance of the model on the entire
dataset.
It is like n-fold cross validation, but where n is the size of the set
(number of instances), that is: “train in all but 1, test on this one”
Advantages:
1) The greatest possible amount of data is used for training (n-1 instances)
Disadvantages:
E.g. Imagine you have the same number of examples for 2 classes. A random
classifier predicting the majority class is expected to have an error rate of 50%, but
in the leave-one out method, the majority class is always the opposite class, which
will produce 100% error rate.
Accuracy measures
Accuracy measure
Example: The model of transmembrane helices
(1) the loop model Mloop given by the observed frequencies of AA in loops p
(2) the helix model Mhelix given by the observed frequencies of AA in helices q
L(s | M helix )
∏q xi
i=1
S = log = N
L(s | M loop )
∏p xi
i=1
L(s | M helix )
∏q xi
i=1
Training set S = log = N
L(s | M loop )
∏p xi
i=1
A test set: a set of “labelled” (annotated) proteins that we do not use for training
Helix
Loop
Accuracy measure
Real
False
Accuracy measure
Our model divides the test set according to our predictions of Real and False:
Our
predictions
Real
False
The red area contains the predictions (helix) made by our model
Accuracy measure
TP
Real
False
Accuracy measure
TP
Real
False
TN
Accuracy measure
TP FP
Real
False
TN
Accuracy measure
TP FP
Real
False
TN
FN
Accuracy measure
True Positive Rate (Sensitivity): proportion of true elements that is
correctly predicted (a.k.a hit rate, recall)
TP TP FP
Sn = TPR = Real False
TP + FN
FN TN
False Positive Rate (FPR): proportion of negative cases that are mislabelled
(a.k.a. fall-out)
€ FP
FPR =
FP + TN
Specificity: proportion of the negatives that are correctly predicted
TN
€ Sp = 1− FPR =
FP + TN
€
A perfect classification would have Sn=1 and Sp=1
Accuracy measure
Positive Predictive Value (PPV): sometimes called Precision
it gives the fraction of our predictions that are correct
TP FP
TP
PPV = Real False
TP + FP FN TN
False Discovery Rate (FDR): what fraction of our predictions are wrong
€ FP
FDR =
FP + TP
PPV à 1 means
€ most of our predictions are correct
TP FP
Real
FN
TP TP FP
TPR = PPV = FDR =
TP + FN FP + TP FP + TP
Accuracy measure
Overall success rate: is the number of correct classifications divided by the
total number of classifications (sometimes called “accuracy”):
TP + TN
Overall Success Rate =
TP + TN + FN + FP
A value of 1 for the Success rate means that the model identifies all the
positive and negative cases correctly
€
TP + TN
Error Rate =1−
TP + TN + FN + FP
€
Accuracy measure
(TP)(TN )− (FP)(FN )
CC =
(TP + FN )(TN + FP)(TP + FP)(TN + FN )
Predicted class
yes no
true false
yes
Actual positive negative
class
no false true
positive negative
it from the predictor’s successes and expressing the result as a proportion
of the total for a perfect predictor, to yield 140 - 82 = 58 extra successes out
Accuracy measure
For multiclass predictions:
Table 5.4 Different outcomes of a three-class prediction: (a) actual and (b) expected.
a b c Total a b c Total
Good results correspond to large numbers on the diagonal and small numbers off
the diagonal
In the example we have 200 instances (100+60+40) and 140 of them are predicted
correctly, thus the success rate is 70%.
a b c Total a b c Total
We build the matrix of expected values by using the same totals as before and
sharing the total of each class:
a b c Total a b c Total
We build the matrix of expected values by using the same totals as before and
sharing the total of each class:
We split each of them into the three groups using the proportions of the predicted
classes: a =120, b=60, c =20 è a= 60%, b=30%, c = 10%
it from the predictor’s successes and expressing the result as a proportion
of the total for a perfect predictor, to yield 140 - 82 = 58 extra successes out
Accuracy measure
For multiclass predictions:
Table 5.4 Different outcomes of a three-class prediction: (a) actual and (b) expected.
a b c Total a b c Total
We build the matrix of expected values by using the same totals as before and
sharing the total of each class:
We split each of them into the three groups using the proportions of the predicted
classes: a =120, b=60, c =20 è a= 60%, b=30%, c = 10%
it from the predictor’s successes and expressing the result as a proportion
of the total for a perfect predictor, to yield 140 - 82 = 58 extra successes out
Accuracy measure
For multiclass predictions:
Table 5.4 Different outcomes of a three-class prediction: (a) actual and (b) expected.
a b c Total a b c Total
To estimate the relative agreement between observed and expected values we can
use the kappa statistic:
Where P(A) is the probability of agreement and P(E) is the probability of agreement
by chance. The maximum possible value is κ=1, and for a random predictor κ=0
€
Accuracy measure
E.g. a method that finds almost all elements will have an Sn close to 1, but
this can be achieved with a method with very low Sp
In general, one would like to have a method that balances Sn and Sp (or
equivalent measures)
Accuracy measure
Do you want to find all the true cases? (You want higher sensitivity)
Question:
L(s | M helix )
S = log >λ
L(s | M loop )
λ λ λ λ λ
€
Real
False
Accuracy measure
L(s | M helix )
S = log >λ
L(s | M loop ) TP FP TN FN
λ
λ λ λ λ λ λ
€
λ
Real
λ
False
λ
Receiver Operating Characteristic (ROC) curve
A ROC curve is a graphical plot of TPR (Sn) vs. FPR built for the same
prediction model by varying one or more of the model parameters
λ λ λ λ λ TPR FPR
λ
λ
Real
λ
False
λ
λ
Receiver Operating Characteristic (ROC) curve
TPR FPR
A Low Low
In positive TPR FPR
cases
False Negative B
C High High
TPR FPR
Threshold
criterion
TP FP
TPR = FPR =
TP + FN FP + TN
Receiver Operating Characteristic (ROC) curve
Model
Distribution of the scores classification
1
In negative
cases
True Negative
TPR
In positive Random
cases
False Negative
classification
0 FPR 1
Threshold
criterion
TP FP
TPR = FPR =
TP + FN FP + TN
Receiver Operating Characteristic (ROC) curve
Model
Each dot in the line corresponds to classification
a choice of parameters (usually 1 1
single parameter)
TP FP
TPR = FPR =
TP + FN FP + TN
Receiver Operating Characteristic (ROC) curve
L(s | M helix )
Example: Consider the ranking of scores: S = log
L(s | M loop )
S
10
7
4
2
1
-0.4
-2
-5
-9
Receiver Operating Characteristic (ROC) curve
L(s | M helix )
Example: Consider the ranking of scores: S = log
L(s | M loop )
The test set is labeled:
S Known label
10 R
7 R
4 R
2 F
1 R
-0.4 R
-2 F
-5 F
-9 F
Receiver Operating Characteristic (ROC) curve
L(s | M helix )
Example: Consider the ranking of scores: S = log
L(s | M loop )
Let’s choose a cut-off (a λ):
S Known label
10 R
7 R
4 R
2 F 3 = Cut-off for
1 R prediction, i.e.
-0.4 R above this value
we predict “R”
-2 F
-5 F
-9 F
Receiver Operating Characteristic (ROC) curve
L(s | M helix )
Example: Consider the ranking of scores: S = log
L(s | M loop )
Calculate TP, FP,… for this λ
10 R
7 R
4 R
2 F 3 3 0 4 2 3/5 0
1 R
-0.4 R
-2 F
-5 F
-9 F
TP FP
TPR = FPR =
TP + FN FP + TN
Receiver Operating Characteristic (ROC) curve
L(s | M helix )
Example: Consider the ranking of scores: S = log
L(s | M loop )
Repeat for other λ’s
10 R
7 R
4 R
2 F 3 3 0 4 2 3/5 0
1 R
-0.4 R 0 4 1 3 1 4/5 1/4
-2 F
-5 F
-9 F
10 R
7 R
4 R
2 F 3 3 0 4 2 3/5 0
1 R
-0.4 R 0 4 1 3 1 4/5 1/4
-2 F
-5 F
-9 F -7 5 3 1 0 1 3/4
-7 5 3 1 0 1 3/4
The variability of the accuracy as a function
of the parameters and/or cut-offs is generally
described with a ROC curve.
Receiver Operating Characteristic (ROC) curve
Random classification
Comparing multiple methods
ROC curves
Each line corresponds to a different
method
Method 1
Method 2
Method 3
Random classification
Example:
ROC curves
If you wish to discover at least 60% of
the true elements (TPR=0.6), the graph
says that Model 1 has lower FPR than
Model 2 and 3. We may want to choose
Model 1.
Optimal configuration
ROC curves
Method 1
Method 2
Method 3
Models
A summary measure for the best model is the Area Under the Curve (AUC). The
best model in general will have the highest AUC
The maximum value is AUC=1. The closer AUC is to one, the better the model
There are also standard methods to estimate the AUC from the sampled
Receiver Operating Characteristic (ROC) curve
ROC curves
Method 1
Method 2
Method 3
Models
Question:
Why do you think there are error bars in the AUC barplot and in the ROC curves?
Precision recall curves
ROC curves are useful to compare predictive models.
If we predict many TPs at the cost of producing many false predictions (FP is
large), the FPR might not look so bad if in our testing set we have many
Negatives, such that TN >> FP:
FP
FPR = "TN "→ 0
"large
FP + TN
So we may have a situation where our TPR is high, the FPR is low, but where
for the actual counts FP >> TP
€
That is, TPR is not affected by FP and FPR can be low even if FP is high (as
long as TN >> FP).
Precision recall curves
For instance, consider a method to classify documents. Let’s supposed that the
first method selects 100 documents, but 40 are correct. Imagine that our test set is
composed of 100 True instances and 10000 Negative instances.
TP 40 FP 60
TPR1 = = = 0.4 FPR1 = = = 0.006
TP + FN 100 FP + TN 10000
Precision recall curves
For instance, consider a method to classify documents. Let’s supposed that the
first method selects 100 documents, but 40 are correct. Imagine that our test set is
composed of 100 True instances and 10000 Negative instances.
TP 40 FP 60
TPR1 = = = 0.4 FPR1 = = = 0.006
TP + FN 100 FP + TN 10000
Now consider a second method selects 680 documents with 80 correct, and
imagine that our test set is composed now of 100 True instances and 100000
Negative instances.
TP 80 FP 600
TPR2 = = = 0.8 FPR2 = = = 0.006
TP + FN 100 FP + TN 100000
The second one may seem better, because it retrieves more relevant documents,
but the proportion of predictions that are correct (precision or PPV) is smaller:
40
Precision1 = = 0.40
TP 100 (Note: you can also
PPV = 80
TP + FP Precision1 = = 0.11 use FDR = 1 –PPV)
680
Thus,
€ one must also take into account the “relative cost” of the
predictions, i.e. the FN and FP values that must be assumed to achieve
high TPR
TP
precision = PPV =
TP + FP
Model 1
Model B
We achieve a lower
AUC than model A,
but still pretty good.
Precision is highly
improved
Precision recall curves
Model 1
Model 2
We achieve a lower
AUC than model A,
but still quite good.
Precision is highly
improved
References