Вы находитесь на странице: 1из 46

Predictive Modeling Using Decision Trees

Introduction
Decision Trees

Powerful/popular for classification & prediction Represent rules


Rules can be expressed in English
IF Age <=43 & Sex = Male & Credit Card Insurance = No THEN Life Insurance Promotion = No

Useful to explore data to gain insight into relationships of a large number of candidate input variables to a target (output) variable

Decision Tree What is it?


A structure that can be used to divide up a large collection of records into successively smaller sets of records by applying a sequence of simple decision rules
A decision tree model consists of a set of rules for dividing a large heterogeneous population into smaller, more homogeneous groups with respect to a particular target variable

Decision Trees: HMEQ Example

Banking marketing scenario (HMEQ):

Target :
default on a home-equity line of credit (BAD)

Inputs :
number of delinquent trade lines (DELINQ) number of credit inquiries (NINQ) debt to income ratio (DEBTINC) possibly many other inputs

Introduction to Decision Tree modeling

Decision Trees:

Interpretation of the fitted decision tree The internal nodes contain rules Start at the root node (top) and follow the rules until a terminal node (leaf) is reached. The leaves contain the estimate of the expected value of the target in this case the posterior probability of BAD. The probability can then be used to allocate cases to classes.

Decision Tree Template


Drawn top-to-bottom or left-toright Top (or left-most) node = Root Node
Root Child Child Leaf

Descendent node(s) = Child Node(s)


Bottom (or right-most) node(s) = Leaf Node(s) Unique path from root to each leaf = Rule

Child

Leaf

Leaf

Divide and Conquer


The tree is fitted to the data by recursive partitioning. Partitioning refers to segmenting the data into subgroups that are as homogeneous as possible with respect to the target. In this case, the binary split (Debt-toIncome Ratio < 45) was chosen. The 5,000 cases were split into two groups, one with a 5% BAD rate and the other with a 21% BAD rate.
n = 5,000

10% BAD

yes
n = 3,350

Debt-to-Income Ratio < 45

no
n = 1,650

5% BAD

21% BAD

The method is recursive because each subgroup results from splitting a subgroup from a previous split. Thus, the 3,350 cases in the left child node and the 1,650 cases in the right child node are split again in similar fashion.
8

The Cultivation of Trees

Split Search Which splits are to be considered? Splitting Criterion Which split is best? Stopping Rule When should the splitting stop? Pruning Rule Should some branches be lopped off?

Splitting Criteria
How is the best split determined? In some situations, the worth of a split is obvious. If the expected target is the same in the child nodes as in the parent node, no improvement was made, and the split is worthless! In contrast, if a split results in pure child nodes, the split is undisputedly best. For classification trees, the three most widely used splitting criteria are based on the Pearson chisquared test, the Gini index, and entropy. All three measure the difference in class distributions across the child nodes. The three methods usually give similar results.
10

Splitting Criteria
Left Not Bad Bad 3196 154 Left Not Bad 2521 Right 1304 346 4500 500 Debt-to-Income Ratio < 45

Center Right 1188 791 4500

Bad
Not Bad

115
4500

162
0

223
4500

500

A Competing Three-Way Split

Bad
11

500

500

Perfect Split

Decision Tree Types


Binary trees only two choices in each split. Can be nonuniform (uneven) in depth N-way trees or ternary trees three or more choices in at least one of its splits (3-way, 4-way, etc.)

Split Criteria
The best split is defined as one that does the best job of separating the data into groups where a single class predominates in each group Measure used to evaluate a potential split is purity The best split is one that increases purity of the subsets by the greatest amount A good split also creates nodes of similar size or at least does not create very small nodes

Tests for Choosing Best Split


Purity (Diversity) Measures:

Gini (population diversity) Entropy (information gain)

Gini (Population Diversity)


The Gini measure of a node is the sum of the squares of the proportions of the classes.

Root Node: 0.5^2 + 0.5^2 = 0.5 (even balance)

Leaf Nodes: 0.1^2 + 0.9^2 = 0.82 (close to pure) Gini Score =0.5*.82+0.5*.82=.82 (close to pure)

Decision Tree Advantages


1.

Easy to understand

2.
3. 4. 5.

Map nicely to a set of business rules


Applied to real problems Make no prior assumptions about the data Able to process both numerical and categorical data

Benefits of Trees

Interpretability tree-structured presentation

Mixed Measurement Scales


Regression trees Handling of Outliers

Handling of Missing Values

17

The Right-Sized Tree


Stunting

Pruning

18

Building and Interpreting Decision Trees

Explore the types of decision tree models available in Enterprise Miner. Build a decision tree model. Examine the model results and interpret these results. Choose a decision threshold theoretically and empirically.

19

The Scenario

Determine who should be approved for a home equity loan. The target variable is a binary variable that indicates whether an applicant eventually defaulted on the loan. The input variables are variables such as the amount of the loan, amount due on the existing mortgage, the value of the property, and the number of recent credit inquiries.

20

The HMEQ data set contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates if an applicant eventually defaulted or was seriously delinquent. This adverse outcome occurred in 1,189 cases (20%). For each applicant, 12 input variables were recorded. Presume that every two dollars loaned eventually returns three dollars if the loan is paid off in full.

21

Accuracy Measures (Classification)

Misclassification error

Error = classifying a record as belonging to one class when it belongs to another class.

Error rate = percent of misclassified records out of the total records in the validation data

Confusion Matrix
Classification Confusion Matrix Predicted Class Actual Class 1 0 1 201 25 0 85 2689

201 1s correctly classified as 1 85 1s incorrectly classified as 0 25 0s incorrectly classified as 1 2689 0s correctly classified as 0

Error Rate
Classification Confusion Matrix Predicted Class Actual Class 1 0 1 201 25 0 85 2689

Overall error rate = (25+85)/3000 = 3.67% Accuracy = 1 err = (201+2689) = 96.33% If multiple classes, error rate is:

Cutoff for classification


Most DM algorithms classify via a 2-step process: For each record, 1. Compute probability of belonging to class 1 2. Compare to cutoff value, and classify accordingly

Default cutoff value is 0.50 If >= 0.50, classify as 1 If < 0.50, classify as 0 Can use different cutoff values Typically, error rate is lowest for cutoff = 0.50

Cutoff Table
Actual Class 1 1 1 1 1 1 1 0 1 1 1 0

Prob. of "1" 0.996 0.988 0.984 0.980 0.948 0.889 0.848 0.762 0.707 0.681 0.656 0.622

Actual Class 1 0 0 1 0 0 0 0 0 0 0 0

Prob. of "1" 0.506 0.471 0.337 0.218 0.199 0.149 0.048 0.038 0.025 0.022 0.016 0.004

If cutoff is 0.50: eleven records are classified as 1 If cutoff is 0.80: seven records are classified as

Confusion Matrix for Different Cutoffs


Cut off Prob.Val. for Success (Updatable) 0.25

Classification Confusion Matrix Predicted Class Actual Class owner non-owner owner 11 4 non-owner 1 8

Cut off Prob.Val. for Success (Updatable)

0.75

Classification Confusion Matrix Predicted Class Actual Class owner non-owner owner 7 1 non-owner 5 11

23 1967
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43

Int

Int

Int

Int

Nom

Int

Nom

Int

Int

Nom
De f a u l t

Int
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 0 1 1 1 1 1 0 1 0 1 1 1

Int
Loan 24900 13400 65500 16800 24700 15500 6300 20600 20100 24000 30200 6500 11800 21700 5700 10300 11800 12000 15200 9700 40000 40000 12000 24800 41100 40000 20000 10000 10000 12000 17600 12000 10000 45000 15000 8500 8500 10000 47000 10000 6000 15000 6000

Int
Mo r t g a g e 62191 131524 205156 27623 79347 82054 12476 52946 16755 88783 80951 183860 74512 24984 74172 70147 67678 76345 105328 32660 4742 53543 88000 37200 94600 120000 . . 69727 42000 76043 87000 76700 47321 29000 48961 18240 34767 164411 32000 9660 45000 24600

Int
Va l u e 83694 148356 290239 88231 108238 104627 32559 83558 29412 116967 116160 208910 93328 92297 79846 122124 108092 89036 113931 54536 . . 118750 67000 151000 159000 115750 65088 90312 60000 95605 101200 97800 115000 105000 73550 40200 51000 235500 59000 35900 68250 30500

Nom
Re a s o n Ho me I mp De b t Co n De b t Co n De b t Co n De b t Co n De b t Co n Ho me I mp De b t Co n Ho me I mp De b t Co n De b t Co n De b t Co n Ho me I mp De b t Co n De b t Co n Ho me I mp Ho me I mp Ho me I mp Ho me I mp De b t Co n De b t Co n De b t Co n Ho me I mp De b t Co n De b t Co n De b t Co n Ho me I mp Ho me I mp Ho me I mp De b t Co n De b t Co n De b t Co n De b t Co n De b t Co n Ho me I mp Ho me I mp Ho me I mp Ho me I mp De b t Co n De b t Co n De b t Co n De b t Co n Ho me I mp

_ NODE _ _ L E A F _ P _ DE F A UL T 1 P _ DE F A UL T 0 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 9 2 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 9 2 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 9 2 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718

I _ DE F A UL T U_ DE F A UL T 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

F _ DE F A UL T R_ DE F A UL T 1 R_ DE F A UL T 0 _ WA RN_ 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718 0 - 0. 8282 0. 8282 1 0. 1718 - 0. 1718 0 - 0. 8282 0. 8282 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718 0 - 0. 8282 0. 8282 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718 0 - 0. 8282 0. 8282 1 0. 1718 - 0. 1718 0 - 0. 8282 0. 8282 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718

Consequences of a Decision

Decision 1 Actual 1 True Positive

Decision 0 False Negative

Actual 0

False Positive

True Negative

44

Consequences of a Decision: Profit matrix (SAS EM)


Decision 1 Actual 1 True Positive (Profit = $2) Decision 0 False Negative

Actual 0

False Positive (Loss = $1)

True Negative

45

Bayes Rule: Optimal threshold

1 cost of false negative 1 cost of false positive


Using the cost structure defined for the home equity example, the optimal threshold is 1/(1+(2/1)) = 1/3. That is, reject all applications whose predicted probability of default exceeds 0.3333
46