Вы находитесь на странице: 1из 51

ID3 Algorithm & ROC Analysis

Talha KABAKU talha.kabakus@ibu.edu.tr

Agenda
Where are we now? Decision Trees What is ID3? Entropy Information Gain Pros and Cons of ID3 An Example - The Simpsons What is ROC Analysis? ROC Space ROC Space Example over predictions

Where are we now?

Decision Trees
One of the most used classification approach because of its clear model and presentation Classification by using data attributes Aim is to reaching estimating destination field value using source fields Tree Induction Create tree Apply data into tree to classify Each branch node represents a choice between a number of alternatives Each leaf node represents a classification or decision Leaf Count = Rule Count

Decision Trees (Cont.)


Leafs are inserted through top to bottom A B C

Sample Decision Tree

Creating Tree Model by Training Data

Decision Tree Classification Task

Apply Model to Test Data

Apply Model to Test Data (Cont.)

Apply Model to Test Data (Cont.)

Apply Model to Test Data (Cont.)

Apply Model to Test Data (Cont.)

Apply Model to Test Data (Cont.)

Decision Tree Algorithms


Classification and Regression Algorithms Twoig Gini Entropy-based Algorithms ID3 C4.5 Memory-based (Sample-based) Classification Algorithms

Decision Trees by Variable Type


Single Variable Decision Trees Classifications are done with asking questions over only one variable Hybrid Decision Trees Classifications are done with asking questions over both single and multiple variables Multiple Variables Decision Trees Classifications are done with asking questions over multiple variables

ID3 Algorithm
Iterative Dichotomizer 3 Developed by J. Ross Quinlan in 1979 Based on Entropy Only works for discrete data Can not work with defective data Advantage over Hunt's algorithm is choosing the right attribute while classification. (Hunt's algorithm chooses randomly)

Entropy
A formula to calculate the homogeneity of a sample; gives idea about how much information gain provides each leaf A complete homogeneous sample entropy value is 0 An equally divided sample entropy value is 1 Formula:

Information Gain (IG)


Information Gain calculates effective change in entropy after making a decision based on the value of an attribute. Which attribute creates the most homogeneous branches? First the entropy of the total dataset is calculated. The dataset is then split on the different attributes.

Information Gain (Cont.)


The entropy for each branch is calculated. Then it is added proportionally, to get total entropy for the split. The resulting entropy is subtracted from the entropy before the split. The result is the Information Gain, or decrease in entropy. The attribute that yields the largest IG is chosen for the decision node.

Information Gain (Cont.)


A branch set with entropy of 0 is a leaf node. Otherwise, the branch needs further splitting to classify its dataset. The ID3 algorithm is run recursively on the non-leaf branches, until all data is classified.

ID3 Algorithm Steps


function ID3 (R: a set of non-categorical attributes, C: the categorical attribute, S: a training set) returns a decision tree; begin If S is empty, return a single node with value Failure; If S consists of records all with the same value for the categorical attribute, return a single node with that value; If R is empty, then return a single node with as value the most frequent of the values of the categorical attribute that are found in records of S; [note that then there will be errors, that is, records that will be improperly classified]; Let D be the attribute with largest Gain( D,S) among attributes in R; Let {dj| j=1,2, .., m} be the values of attribute D; Let {Sj| j=1,2, .., m} be the subsets of S consisting respectively of records with value dj for attribute D; Return a tree with root labeled D and arcs labeled d1, d2, .., dm going respectively to the trees ID3(R-{D}, C, S1), ID3(R-{D}, C, S2), .., ID3(R-{D}, C, Sm); end ID3;

Pros of ID3 Algorithm


Builds decision tree in min. steps The most important point while tree induction is collecting enough reliable associated data over specific properties. Asking right questions determines tree induction. Each level benefits from previous level choices Whole dataset is scanned to create tree

Cons of ID3 Algorithm


Tree can not be updated when new data is classified incorrectly, instead a new tree must be generated. Only one attribute at a time is tested for making a decision. Can not work with defective data Can not work with numerical attributes

An Example - The Simpsons


Person Homer Marge Bart Lisa Maggie Abe Selma Otto Krusty Hair Length 0'' 10'' 2'' 6'' 4'' 1'' 8'' 10'' 6'' Weight 250 150 90 78 20 170 160 180 200 Age 36 34 10 8 1 70 41 38 45 Class M F M M F F F M M

Information Gain over Hair Length

E(4F, 5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 ==> General Information Gain

Hair Length <= 5 Yes No

E(1F,3M) = -(1/4)log2(1/4) - (3/4)log2(3/4) = 0.9710

E(3F,2M) = -(3/5)log2(3/5) - (2/5)log2(2/5) =0.8113

Gain(Hair Length <= 5) = 0.9911 (4/9 * 0.9710 + 5/9 * 0.8113) = 0.0911

Information Gain over Weight

E(4F, 5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 ==> General Information Gain

Weight <= 160 Yes No

E(4F,1M) = -(4/5)log2(4/5) - (1/5)log2(1/5) = 0.7219

E(0F,4M) = -(0/4)log2(0/4) - (4/4)log2(4/4) = 0

Gain(Weight<= 160) = 0.9911 (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900

Information Gain over Age

E(4F, 5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 ==> General Information Gain

Age <= 40 Yes No

E(3F,3M) = -(3/6)log2(3/6) - (3/6)log2(3/6) = 1

E(1F,2M) = -(1/3)log2(1/3) - (2/3)log2(2/3)= 0.9188

Gain(Age z= 40) = 0.9911 (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183

Results
Attribute Hair Length <= 5 Weight <= 160 Age <= 40 Information Gain (IG) 0.0911 0.5900 0.0183

As seen in the results, weight is the best attribute to classify these group.

Constructed Decision Tree

Weight <= 160 Yes No

Hair Length <= 5 Yes No

Female

Male

Entropy over Nominal Values


If an attribute has nominal values:
First calculate information gain for each attribute value Then calculate attribute information gain

Example II

IG= -(5/15)log2(5/15)-(10/15)log2(10/15) = ~0.918

Example II (Cont.)
Information Gain over Engine
Engine: 6 small, 5 medium, 4 large 3 values for attribute engine, so we need 3 entropy calculations small: 5 no, 1 yes IGsmall = -(5/6)log2(5/6)-(1/6)log2(1/6) = ~0.65 medium: 3 no, 2 yes IGmedium = -(3/5)log2(3/5)-(2/5)log2(2/5) = ~0.97 large: 2 no, 2 yes IGlarge = 1 (evenly distributed subset) => IGEngine = IE(S) [(6/15)*IGsmall + (5/15)*IGmedium + (4/15)*Ilarge] = IGEngine = 0.918 0.85 = 0.068

Example II (Cont.)
Information Gain over SC/Turbo
SC/Turbo: 4 yes, 11 no 2 values for attribute SC/Turbo, so we need 2 entropy calculations yes: 2 yes, 2 no IGturbo = 1 (evenly distributed subset) no: 3 yes, 8 no IGnoturbo = -(3/11)log2(3/11)-(8/11)log2(8/11) = ~0.84 IGturbo = IE(S) [(4/15)*IGturbo + (11/15)*IGnoturbo] IGturbo = 0.918 0.886 = 0.032

Example II (Cont.)
Information Gain over Weight
Weight: 6 Average, 4 Light, 5 Heavy 3 values for attribute weight, so we need 3 entropy calculations average: 3 no, 3 yes IGaverage = 1 (evenly distributed subset) light: 3 no, 1 yes IGlight = -(3/4)log2(3/4)-(1/4)log2(1/4) = ~0.81 heavy: 4 no, 1 yes IGheavy = -(4/5)log2(4/5)-(1/5)log2(1/5) = ~0.72
IGWeight = IE(S) [(6/15)*IGaverage + (4/15)*IGlight + (5/15)*IGheavy] IGWeight = 0.918 0.856 = 0.062

Example II (Cont.)
Information Gain over Full Eco
Fuel Economy: 2 good, 3 average, 10 bad 3 values for attribute Fuel Eco, so we need 3 entropy calculations good: 0 yes, 2 no IGgood = 0 (no variability) average: 0 yes, 3 no IGaverage = 0 (no variability) bad: 5 yes, 5 no IGbad = 1 (evenly distributed subset)
We can omit calculations for good and average since they always end up not fast. IGFuelEco = IE(S) [(10/15)*IGbad] IGFuelEco = 0.918 0.667 = 0.251

Example II (Cont.)

Results:
IGEngine IGturbo IGWeight IGFuelEco 0.068 0.032 0.062 0.251

Root of the tree

Example II (Cont.)

Since we selected the Fuel Eco attribute for our Root Node, it is removed from the table for future calculations.

General Information Gain = 1 (Evenly distributed set)

Example II (Cont.)
Information Gain over Engine
Engine: 1 small, 5 medium, 4 large 3 values for attribute engine, so we need 3 entropy calculations small: 1 yes, 0 no IGsmall = 0 (no variability) medium: 2 yes, 3 no IGmedium = -(2/5)log2(2/5)-(3/5)log2(3/5) = ~0.97 large: 2 no, 2 yes IGlarge = 1 (evenly distributed subset)
IGEngine = IE(SFuelEco) (5/10)*IGmedium + (4/10)*IGlarge] IGEngine = 1 0.885 = 0.115

Example II (Cont.)
Information Gain over SC/Turbo
SC/Turbo: 3 yes, 7 no 2 values for attribute SC/Turbo, so we need 2 entropy calculations yes: 2 yes, 1 no IGturbo = -(2/3)log2(2/3)-(1/3)log2(1/3) = ~0.84 no: 3 yes, 4 no IGnoturbo = -(3/7)log2(3/7)-(4/7)log2(4/7) = ~0.84 IGturbo = IE(SFuelEco) [(3/10)*IGturbo + (7/10)*IGnoturbo] IGturbo = 1 0.965 = 0.035

Example II (Cont.)
Information Gain over Weight
Weight: 3 average, 5 heavy, 2 light 3 values for attribute weight, so we need 3 entropy calculations average: 3 yes, 0 no IGaverage = 0 (no variability) heavy: 1 yes, 4 no IGheavy = -(1/5)log2(1/5)-(4/5)log2(4/5) = ~0.72 light: 1 yes, 1 no IlGight = 1 (evenly distributed subset) IGEngine = IE(SFuel Eco) [(5/10)*IGheavy+(2/10)*IGlight] IGEngine = 1 0.561 = 0.439

Example II (Cont.)
Results:
IGEngine IGturbo IGWeight 0.115 0.035 0.439

Weight has the highest gain, and is thus the best choice.

Example II (Cont.)
Since there are only two items for SC/Turbo where Weight = Light, and the result is consistent, we can simplify the weight = Light path.

Example II (Cont.)
Updated Table: (Weight = Heavy)

All cars with large engines in this table are not fast. Due to inconsistent patterns in the data, there is no way to proceed since medium size engines may lead to either fast or not fast.

ROC Analysis
Receiver Operating Characteristic The limitations of diagnostic accuracy as a measure of decision performance require introduction of the concepts of the sensitivity and specificity of a diagnostic test. These measures and the related indices, true positive rate and false positive rate, are more meaningful than accuracy. ROC curve is shown to be a complete description of this decision threshold effect, indicating all possible combinations of the relative frequencies of the various kinds of correct and incorrect decisions.

ROC Analysis (Cont.)


Combinations of correct & incorrect decisions:
Actual Value
p p n n

Prediction Outcome
p n p n

Description
True Positive Rate (TPR) False Negative Rate (FNR) False Positive Rate (FPR) True Negative Rate (TNR)

TPR is equivalent with sensitivity. FPR is equivalent with 1 - specificity. Best possible prediction would be 100% sensitivity and 100% specificity (which means FPR = 0%).

ROC Space
A ROC space is defined by FPR and TPR as x and y axes respectively, which depicts relative trade-offs between true positive (benefits) and false positive (costs). Since TPR is equivalent with sensitivity and FPR is equal to 1 specificity, the ROC graph is sometimes called the sensitivity vs (1 specificity) plot. Each prediction result one point in the ROC space.

Calculations
Sensitivity TPR = TP / P = TP / (TP + FN) Specificity FPR = FP / N = FP / (FP + TN) Accuracy ACC = (TP + TN) / (P + N)

A ROC Space Example


Let A, B, C, D to be predictions over 100 negative and 100 positive instance:
Prediction/ Combination A B C D TP 63 77 24 76 FP 28 77 88 12 FN 37 23 76 24 TN 72 23 12 88 TPR 0.63 0.77 0.24 0.76 FPR 0.28 0.77 0.88 0.12 ACC 0.68 0.50 0.18 0.82

A ROC Space Example (Cont.)

References
1. Data Mining Course Lectures, Ass. Prof. Nilfer Yurtay 2. Quinlan, J.R. 1986, Machine Learning, 1, 81 3. http://www.cse.unsw.edu. au/~billw/cs9414/notes/ml/06prop/id3/id3.html 4. J. Han, M. Kamber, J. Pie, Data Mining Concepts and Techniques, 3rd Edition, Elsevier, 2011. 5. http://www.cise.ufl.edu/~ddd/cap6635/Fall97/Short-papers/2.htm 6. C. E. Metz, Basic Principles of ROC Analysis, Seminars in Nuclear Medicine, Volume 8, Issue 4, P 283-298