Вы находитесь на странице: 1из 6

ROC Analysis with Matlab

Antonin Slaby University of Hradec Kralove, Rokitanskh 62, 500 03 Hradec Kralove, Czech republic E-mail: antonin.slaby@uhk.cz

Abstract. The contribution is focused on essentials of ROC and Cost analysis and their support by Matlab software. In the contribution there are mentioned basic facts on ROC and COST curves theory and shown results of Matlab-based solution, mainly samples of graphical outputs, simplified and schematised due to size of images in the paper. More detailed solution of selected practical problems will be demonstrated during presentation. Source files can by obtained from the author. Keywords. ROC analysis, COST analysis, Matlab applications 1. Introduction
Receiver Operating Characteristics (ROC) graphs are a useful and clear possibility for organizing classifiers and visualizing their quality (performance). The research started in about 1940s to explain and interpret radio signals, used by radar receivers to analyze radar images during World War II. From the beginning of 1970s it was found as useful tool for interpreting medical test results and other problems. ROC graphs are now commonly used in medical and other decision making, and in recently has been adopted in the machine learning and data mining research communities. ROC graphs are apparently simple. All figures in this article depict of unit square with the lower limit 0 and upper limit 1 on both axes.

If we use a classifier and an realize one instance of testing, there are four possible outcomes. If the instance is positive and it is classified as positive, it is counted as a true positive (denote TPi).If the instance is positive and is classified as negative, it is counted as a false negative (denote FNi).If the instance is negative and it is classified as negative, it is counted as a true negative (denote TNi). If the distance is classified as positive, it is counted as false positive (denote FPi). Let us denote The number of true positive instances as TPi, the number of true negative instances as TNi, the number of false positive instances as FPi, and finally the number of false negative instances as FNi. Then it holds: Ti =TPi + TNi Fi = FPi + FNi Pi = TPi + TPi Ni = TNi + FNi Pi + Ni = Ti + Fi Where Ti is the number of true instances Fi is the number of false instances Pi is the number of positive instances Ni is the number of negative instances. Classifier may be represented in the form of the table called confusion matrix.

2. ROC curves
A classifier assigns an object to one of a predefined set of categories or classes. A medical tests outcome is either positive or negative. Students exam test is ether positive or negative. Let us introduce actual class for the true states and the predicted class for predicted states.

Correct (true) Positive Negative

Predicted Positive Negative

TPi FPi

TNi FNi

Table 1. Confusion Matrix

Let us calculate TP=TPi/Pi FP=FPi/Ni.

191
Proceedings of the ITI 2007 29 Int. Conf. on Information Technology Interfaces, June 25-28, 2007, Cavtat, Croatia
th

Then classifier can be also represented as one point (FP, TP), in ROC space. ROC space is a square [01] X [01] with horizontal axe x denoting FP and perpendicular axe y denoting TP. Classifiers can be ordered in the following way: The classifier [FP1 TP1] is better than classifier [FP2, TP2] if it holds for them: FP1 < FP2 and at the same time TP1 > TP2 i.e. classifier represented [FP1 TP1 as point to the north and to the west of the point [FP2, TP2]. This relation is called domination. At the figure 1 the grey point is dominated by black one. Further remarks about ROC curves and ROC space: It is clear that suitability of the classifier in the ROC space can be expressed by FP which should be as small as possible and by TP which should be as big as possible (it should be almost one). Several points in ROC space are important to mention: The lower left point [0, 0] represents the strategy of never issuing a positive classification. Such a classifier causes no false positive errors but also gains no true positives. The opposite strategy, of unconditionally issuing positive classifications is represented by the upper right point [1, 1]. The point [0, 1] represents perfect classification. [1,1]

classifier system set of classifiers. A classifier produces a single ROC point. The curve may be obtained by several ways, for example: If the score defining the decision threshold for classification is varied. If the classifier has a sensitivity parameter, varying it produces a series of ROC points (confusion matrices). If classifiers are produced by some learning algorithm, a series of ROC points can be generated by varying the class ratio in the training set. Procedure for generating a ROC curve can be obtained from a decision tree etc.
[1,1]

[0,0]
Figure 2. ROC curve

3. Area under ROC curve


The area under the ROC curve (AUC) is a common metric that can be used to compare different tests (indicator variables). An AUC is a measure of test accuracy. ROC curve describe two-dimensional visualization of ROC curve set of classifiers performance. For the reason of comparing two sets of classifiers it is sometimes suitable to reduce ROC performance to a single scalar value representing expected performance. The easiest possibility is to calculate the area under the ROC curve which is part of the area of the unit square. Consequently the value of AUC will always satisfies the following inequalities 0 AUC 1. It is clear that if the AUC is close to 1 (area of unit square) AUC indicates very good diagnostic

[0,0]
Figure 1. Dominance of classifiers

An ROC (Operating Characteristic) curve is a graphical visualization of the true positive rate as a function of the false positive rate of a

192

test. On the other hand as the random guessing produces the diagonal line between the points [0; 0] and [1; 1], which has an area of 0.5, reasonable tests should have 0.5 AUC 1. The AUC has an important statistical property: the AUC of a classifier is equivalent to the probability that the classifier will evaluate randomly chosen positive instance higher than a randomly chosen negative instance.
[1,1]

[0,1] and [1,0] to the point [1,1] representing always positive classification Points [X,Y] of the segment in the cost space can be also calculated by the formulas Y = FNX + FP (1-X) X =p(+) C(-|+)/(p(+) C(-|+) + (1-p(+)) C(+|-)) In the above formula the used quantities have the following meaning: p(+) is the probability of an example being from the positive class, C(-|+) cost of misclassification positive examples, C(+|-) cost of misclassification negative examples. If we assume that C(|+) = C(+|), i.e. that misclassification costs are equal we obtain easier formula for X X =p(+) Y are expected costs normalized to [0,1]. As numerator is always less or equal than denominator in the formula for calculation of X, both X and Y are from the interval [0,1]. On the other hand each line in ROC space with slope S and y-intercept TPo is assigned to a point in cost space using the following equation: X = p(+) =1/(1 + S) Y = error rate = (1 TP) p(+) Both these operations are invertible. Their inverses map points (lines) in cost space to lines (points) in ROC space. If (X, Y ) is a point in cost space, the corresponding line in ROC space has: TP = (1/X 1) FP + (1 Y/X).
[1,1]

[0,0]
Figure 3. Area under ROC curve

4. Cost curves and duality between ROC space and cost space
There exists a one to one relation (duality) between the points of ROC corves and lines (segments) in the cost space. This duality relation is defined like this: For each point [FP,TP] of the ROC curve there is assigned a segment in the Cost space. Cost space is again a unit square. The assigned segment connect the points [0, FP] and [1, FN]. It is defined by the formulas: Y = (FN FP) p(+) + FP X =p(+) where p(+) is the probability of an example being from the positive class. Especially there is assigned the segment connecting points [0,0] and [1,1] to the point [0,0] which represents the always negative classification and the segment connecting points

[0,0]
Figure 4. Cost curves

Convex hull of the ROC curve is defined and constructed as convex set containing all point of given ROC curve represented by the set of

193

points. Convex hull is also called upper envelope of ROC curve. Matlab offers built in function convhull(x,y) for calculating convex hull for the curve given by the set of points. As other procedures it can be alternatively developed as special m-file. The result of calculation of convex hull is on the following Figure 5.
[1,1]

[1,1]

[0,0]
Figure 6. Calculating AUC

The slope of the segment of the convex hull connecting the two vertices [FP1, TP1] and [FP2, TP2] is given by the left-hand side of the following equation: [TP1 TP2 ]/ [FP1 FP2] = [p() C(+|)]/[p(+) C(|+)] The right-hand side defines the segment of operating points where [FP1, TP1] and [FP2, TP2] have the same expected cost. The notion of dominance in ROC space has an exact counterpart in cost space. Cost curve C1 dominates cost curve C2 if C1 is below (or equal to) C2 for all x values i.e. there is no operating point for which C2 outperforms C1. The related ROC concept of upper (maximizing) convex hull defined for ROC curves also has an adequate notion in the theory of cost curves: the lower (minimizing) envelope. The lower envelope at for all points x defined as the lowest from the y values achieved on the set of given Cost curves for that x. From the duality theory between ROC space and Cost space it follows, that the line segment which achieves lower envelope can be assigned to the vertex point of the ROC convex hull and on the other hand the vertices of the lower envelope correspond to the line segments on the ROC hull. The following Figure 7 shows dual cost curves which are assigned to the six vertexes of convex hull of the ROC space from the Figure 5.

[0,0]
Figure 5. ROC curve and its convex hull

From the following fig 6 there is clearly seen that AUC then can be easily calculated as the sum of areas of trapezoids in The ROC space given by the formula. P(i)=[Y(i)+Y(i-1)] [(X(i)-X(i-1)]/2 P(i)=[TP(i)+TP(i-1)] [(FP(i)-FP(i-1)]/2 One of such a trapezoid is drawn at figure 6 in dark gray colour. In the Matlab notation the area can be written in one line of the code of m file AUC = sum((fp(2:n) - fp(1:n-1)) .* (tp(2:n)+tp(1:n-1)))/2.

We wil not discuss here more complicated case of calculatuion of area of convex hulls of sets for dimensions greater then two.

194

[1,1]

[0,0]
Figure 7. Lower envelope of Cost curves

5. Linear programming in finding optimal classifier


Linear programming may tribute to solving special problems restricting the ROC space. The set of feasible solutions is the intersection of two apparent linear inequalities fefining ROC space 0 x 1, 0 y 1 and a set of other inequalities forming the convex hull. In case we order the vertices of the convex hull according to growing x, end denote two neighbouring points as [x1,y1] and [x2,y2] and lay difx = x2 - x1 dify = y2 - y1 we obtain the following equation of the line and line segment defined by the endpoints [x1,y1] and [x2,y2] dify . x-difx . y = x1 . dify- y1 . difx which forms the part of the boundary of the convex set of feasible solutions. If we use the matrix Matlab notion the set of feasible solutions can be given in the following way AA=[dify ;-difx] bb=x1.*dify-y1.*difx and the problem of linear programming can be described like this AA.[x y]bb, 0 x 1, 0 y 1 Objective function f which should be optimised is defined to follow the practical nature of the problem. With regards to the nature of the

problem, there can be imposed other inequalities depicting additional restrictions too.The procedure of optimisation can be written as mfile of there can be alternatively used built in Matlab Optimisation toolbox function: [xx,fval]=linprog(f,A,b,[],[],lb,ub) In the second case of using the Matlab function we have to meet the following additional requirements difx, dify,b has to be column vectors All conditions should possess the same form Axb and consequently should be multiplied by number -1. Object function should be multiplied by -1 as the maximization is required. So as to satisfy these additional requirements, we made the following rearrangements: Transformation of the matrix of the set of inequalities: A = -AA' Making column vector of the right hand side of the inequalities b = -bb' Creating lower and upper limit for variables lb=[0 0] ub=[1 1]. Definition of vector of coefficients of the function to be optimized i.e. objective function f=[f1,f2]. The visualization of numerical results obtained by use of Matlab function is given on Figure 8 and Figure 9. [1,1]

[0,0]
Figure 8. Optimal predictor No. 1

195

[1,1]

presentation there will be shown larger testing data of medical origin. The software packages consists of the files rocgui.m and set of functions in the form of m-.files.

7. Acknowledgements
The software and contribution has arisen with the support of the grant of the Czech grant agency GACR No. 402/04/1308.

8. References
[1] Fawcett, T., & Provost, F. (1996).Combining data mining and machine learning for e_ective user pro_ling. In: Simoudis, Han, & Fayyad(Eds.), Proceedings on the Second International Conference on Knowledge Discovery and Data Mining, pp. 8{13 Menlo Park, CA. AAAI Press. [2] Holte, R. C.: 1993, Very Simple Classification Rules Perform Well on Most Commonly Used Datasets. Machine Learning 11(1), 6391. [3] Provost, F. and T. Fawcett: 1997, Analysis and Visualization of Classifier Per-formance: Comparison under Imprecise Class and Cost Distributions. In: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining. pp. 4348. [4] Drummond, C. and R. C. Holte: 2000a, Explicitly Representing Expected Cost: An Alternative to ROC Representation. In: Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining. pp. 198207. [5] Ferri, C., P. A. Flach, and J. HernandezOrallo: 2002, Learning Decision Trees Using the Area Under the ROC Curve. In: Proceedings of the 9th Int. Conference on Machine Learning. pp. 139146.

[0,0]
Figure 9. Optimal predictor No. 2

There are depicted two lines in these two figures. The thicker one line is optimal one and goes through optimal point [0.,0.2] at Figure 8 and through point [0.1, 0.5] at Figure 9. For the points of second of them (the thinner one) has the objective function value 0. It comes in both pictures through the point [0,0].

6. Conclusion
During the oral presentation there will be shown system of m-files which covers creating the curves, cost curves, AUC and other duality constructions and use of linear programming to solution of some special problems. Figures 2 to 9 are results of Matlab visualization procedures. The two column arrangements of the proceedings and black and white printing would have caused bad readability of tiny captions accompanying the graphs. Consequently the outputs were made simpler and bolder with only a few details. The best value can be obtained by running source m files which can be obtained from the author at he e-mail address antonin.slaby@uhk.cz. During the

196

Вам также может понравиться