Вы находитесь на странице: 1из 62

Decision Trees

-CHAID & CART Analyses

Pristine www.edupristine.com
Pristine
Decision Trees Analyses
I. Data Mining and Decision Trees

II. Decision Tree Example

III. CHAID analyses

IV. CART

Pristine 1
CHAID Analyses
I. Data Mining and Decision Trees

II. Decision Tree Example

III. CHAID analyses- Method and Algorithms

IV. Using R to run CHAID analysis

V. Case

VI. CHAID Demonstration in SPSS

Pristine 2
Data Mining
Data mining is the nontrivial extraction of implicit, previously unknown, and potentially useful
information from data.
As goals could be differently. Accordingly the data mining techniques will vary. At a high level, data
mining techniques can be classified into:
Directed or
Undirected

Directed Undirected

Goal is to predict, estimate,


classify, or characterize the Goal is to discover structure in
behavior of some pre-identified the data set as a whole.
target variable.

- Description & Visualization


- Classification
- Association Rule or Affinity
- Estimation
Grouping
- Prediction
- Clustering

Pristine 3
Classification & Decision Tree Analysis
Classification is used to develop a model that maps a data item into one of several predefined
classes.
Decision Tree Analysis
Builds classification and regression trees
Starts with pre-identified target variable in other words dependent variable.
This is the initial node Initial node is split into two or more child nodes
Splitting is based on statistical analysis used by decision tree algorithms (CHAID is one such algorithm)

Pristine 4
Classification & Decision Tree Analysis- Example
Case: A pizza outlet carried out a promotional campaign and reached out to 1000 individuals in a
locality. Out of the 1000 individuals approached, 200 responded by visiting the outlet. The outlet also
collected the personal details of customer. After the campaign period was over, the outlet carried out
an analysis in order to classify the customers into various classes:
Segments in order of responsiveness
1. Married Male
Response
Response rate 40%
20%
Sample size 300
1000
2. Divorced with no pets
Response rate 40%
Sample size 50
3. Married Female
Married Single Divorced Response rate 15%
25% 15% 15% Sample size 300
500 300 200 4. Singles
Response rate 15%
Sample size 300
5. Divorced with pets
Male Female Pet=no Pet=yes Response rate 6.67%
40% 15% 40% 6.67% Sample size 150
300 300 50 150
Pristine 5
Decision Tree Algorithms
CHAID (Chi square Automatic Interaction Detector)
C&RT (Classification and Regression Tree)

QUEST (Quick Unbiased Efficient Statistical Test)

Pristine 6
CHAID Analyses
I. CHAID analyses- Method and Algorithms

II. Using R to run CHAID analysis

III. Case

Pristine 7
CHAID Method and Components
CHAID was originally designed to handle categorical variables only.

Data mining tools like SAS E-Miner; R and SPSS has extended algorithm to handle nominal, ordinal
and continuous dependent variables.

One or more predictor variables. Predictor variables can be continuous, ordinal, or nominal.

One target variable. The target variable must be categorical.

Pristine 8
CHAID Algorithms
A CHAID tree is a decision tree that is constructed by splitting subsets of the space into two or more
child (nodes) repeatedly, beginning with the entire data set.
To determine the best split at any node, CHAID merges any allowable pair of categories of the
predictor variable (the set of allowable pairs is determined by the type of predictor variable being
studied) if there is no statistically significant difference within the pair with respect to the target
variable.
The process is repeated until no non-significant pair is found.
The resulting set of categories of the predictor variable is the best split with respect to that predictor
variable.
This process is followed for all predictor variables.
The split that is the best prediction is selected, and the node is split.
The process repeats recursively until one of the stopping rules is triggered.
The significance of splitting is tested by means of chi-square test based on contingency table
approach.

Pristine 9
Case- Using CHAID analysis to identify customer segments who
default on payment due
Romanov, an Analytics consultant works with Credit One Bank. His manager gave him data
having Credit and personal information of a group of customers. Some of the customers had
defaulted in making the payment on balance due. He asked him to identify the classes of
customers having higher default rate than average using a decision tree. Romanov has no
knowledge of running a CHAID analysis.
Now suppose, he approaches you and request for your help to complete the assignment. Lets
help Romanov in solving the problem.

Pristine 10
Case- CHAID analysis
In due course of helping Romanov to complete his task, we will walk him through following steps:
Variable identification
Identifying the dependent (response) variable.
Identifying the independent (explanatory) variables.
Variable categorization (e.g. Numeric, Categorical, Discrete, Continuous etc.)
Creation of Data Dictionary
Response variable exploration
Distribution analysis
Outlier treatment
Not required for Binary variables
Running the CHAID analysis using R
Importing data in R
Selecting the variables
Runing the analysis
Interpreting the results

Pristine 11
Understanding the Data
Response variable
Default_on_Payment
Binary in nature
Takes (0,1) as values corresponding to (no default, default)
Independent variables (also called a s attributes in credit industry)
20 attributes
3 numerical and 17 categorical
For details refer to
Data dictionary
Data Dictionary- Analysis_of_Default.docx
Data file
Analysis_of_Default.xlsx

Pristine 12
Response variable analysis
Frequency distribution

Default_On_Payment # Observations
0 3505
1 1495
Total 5000

Event rate (for a Binary response variable)


Proportion of 1 as a total number of observation
Over here Default_On_Payment Rate = Event Rate
Event Rate = 1495/5000 = 29.9%

Pristine 13
Bivariate Analysis (of Independent variables)
1 Status_Checking_Acc # ObservationsDefault_On_PaymentDefault Rate
A11 1370 675 49.27%
A12 1345 520 38.66%
A13 315 70 22.22%
A14 1970 230 11.68%

2 Duration_in_Months # Observations Default_On_Payment Default Rate


12 1795 380 21.2%
These are transformed values. For original
24 2055 610 29.7% values refer to Analysis_of_Default.xlsx
36 715 285 39.9%
48 355 180 50.7%
60 80 40 50.0%
72 5 5 100.0%

Credit_History # Observations Default_On_Payment Default Rate


3 A30 200 125 62.50%
A31 245 140 57.14%
A32 2650 840 31.70%
A33 440 140 31.82%
A34 1465 250 17.06%

Pristine 14
Bivariate Analysis (of Independent variables)
4 Purposre_Credit_Taken # Observations Default_On_Payment Default Rate A40: car (new)
A40 1170 445 38.03% A41: car (used)
A41 515 85 16.50% A42: furniture/equipment
A42 905 290 32.04%
A43: radio/television
A43 1400 305 21.79%
A44 60 20 33.33%
A44: domestic appliances
A45 110 40 36.36% A45: repairs
A46 250 110 44.00% A46: education
A48 45 5 11.11% A47: (vacation - does not exist?)
A49 485 170 35.05% A48: retraining
A410 60 25 41.67% A49: business
A410: others
5 Credit_Amount # Observations Default_On_Payment Default Rate
These are transformed values. For original
4000 3770 975 25.86%
11000 1085 420 38.71% values refer to Analysis_of_Default.xlsx
12000 145 100 68.97%

6 Savings_Acc # Observations Default_On_Payment Default Rate


A61 3015 1080 35.82%
A62 515 170 33.01%
A63 315 55 17.46%
A64 240 30 12.50%
A65 915 160 17.49%

Pristine 15
Bivariate Analysis (of Independent variables)
Years_At_Present_Employment # Observations Default_On_Payment Default Rate
7 A71 310 115 37.10%
A72 860 350 40.70%
A73 1695 515 30.38%
A74 870 195 22.41%
A75 1265 320 25.30%

8 Inst_Rt_Income # Observations Default_On_Payment Default Rate


1 680 170 25.00%
2 1155 305 26.41%
3 785 225 28.66%
4 2380 795 33.40%

9 Marital_Status_Gender # Observations Default_On_Payment Default Rate


A91 250 100 40.00%
A92 1550 540 34.84%
A93 2740 730 26.64%
A94 460 125 27.17%

10 Other_Debtors_Guarantors # Observations Default_On_Payment Default Rate


A101 4535 1355 29.88%
A102 205 90 43.90%
A103 260 50 19.23%

Pristine 16
Bivariate Analysis (of Independent variables)
11 Current_Address_Yrs # Observations Default_On_Payment Default Rate
1 650 180 27.69%
2 1540 480 31.17%
3 745 215 28.86%
4 2065 620 30.02%

12 Property # Observations Default_On_Payment Default Rate


A121 1410 295 20.92%
A122 1160 355 30.60%
A123 1660 510 30.72%
A124 770 335 43.51%

13 Age:
Refer to Analysis_of_Default.xlsx
14 Other_Inst_Plans # Observations Default_On_Payment Default Rate
A141 695 285 41.01%
A142 235 95 40.43%
A143 4070 1115 27.40%

15 Housing # Observations Default_On_Payment Default Rate


A151 895 350 39.11%
A152 3565 925 25.95%
A153 540 220 40.74%

Pristine 17
Bivariate Analysis (of Independent variables)
16
Num_CC # Observations Default_On_Payment Default Rate
1 3165 995 31.44%
2 1665 460 27.63%
3 140 30 21.43%
4 30 10 33.33%

17 Job # Observations Default_On_Payment Default Rate


A171 110 35 31.82%
A172 1000 280 28.00%
A173 3150 925 29.37%
A174 740 255 34.46%

18 Dependents # Observations Default_On_Payment Default Rate


1 4225 1265 29.94%
2 775 230 29.68%

19 Telephone # Observations Default_On_Payment Default Rate


A191 2980 930 31.21%
A192 2020 565 27.97%

20 Foreign_Worker # Observations Default_On_Payment Default Rate


A201 4815 1475 30.63%
A202 185 20 10.81%

Pristine 18
Code for CHAID

Pristine 19
R outputs for CHAID

For better preview see the file saved as pdf through R

Pristine 20
CHAID Analysis Demonstration in SPSS

Pristine 21
Preparing SPSS to run CHAID analysis
1 2

Pristine 22
Preparing SPSS to run CHAID analysis
5 6

Pristine 23
Preparing SPSS to run CHAID analysis
7 8

Pristine 24
Running a simple CHAID analysis- Summary

Pristine 25
Running a simple CHAID analysis - Tree
Branches in decending order of default rate:
1. A11- (A33; A30; A31) {rate = 74.5%, sample = 235}
2. A12- (A30; A31) {rate = 63.6%, sample = 165}
3. A11- A32 {rate = 51,2%, sample = 800}
4. A12- A32 {rate = 38.4%, sample = 730}
5. A12- (A34; A33) {rate = 30%, sample = 450}
6. A11- A34 {rate = 26.9%, sample = 335}
7. A14- (A33; A30; A31) {rate = 24.1%, sample = 270}
8. A13 {rate = 22.2, sample = 315}
9. A14- A32 {rate = 12.3%, sample = 935}
10. A14- A34 {rate = 6.5%, sample = 765}

Pristine 26
Running a simple CHAID analysis Tree Table

Pristine 27
CHAID Analysis Result in Decision making

Group1 (Customers having Default Rate Group2 (Customers having Default Rate
Higher than Overall Average) Lowe than Overall Average)

A11- (A33; A30; A31) {rate = 74.5%, sample = A11- A34 {rate = 26.9%, sample = 335}
235} A14- (A33; A30; A31) {rate = 24.1%, sample =
A12- (A30; A31) {rate = 63.6%, sample = 165} 270}
A11- A32 {rate = 51,2%, sample = 800} A13 {rate = 22.2, sample = 315}
A12- A32 {rate = 38.4%, sample = 730} A14- A32 {rate = 12.3%, sample = 935}
A12- (A34; A33) {rate = 30%, sample = 450} A14- A34 {rate = 6.5%, sample = 765}

Management should focus on lending credit to customers falling in Group2 .


Group1 customers have higher risk than average. Management should improvise on risk
management strategy to manage these customers

Pristine 28
CART Analyses
I. CART analyses- Method and Algorithms

II. Using R to run CART analysis

III. Case

IV. CART Demonstration using SPSS

Pristine 29
CART Method and Components
A non-parametric technique,using the methodology of tree building.

Data mining tools like SAS E-Miner; R and SPSS has extended algorithm to handle nominal, ordinal
and continuous dependent variables.

One or more predictor variables. Predictor variables can be continuous, ordinal, or nominal.

One target variable. The target variable can be categorical or continuous.

Pristine 30
CART Algorithms
It makes use of Recursive Partitioning
Take all of your data.
Consider all possible values of all variables.
Select the variable/value (X=t1) that produces the greatest separation in the target.
(X=t1) is called a split.
If X< t1 then send the data to the left; otherwise, send data point to the right.
Now repeat same process on these two nodes
You get a tree
Note: CART only uses binary splits (Unlike CHAID which produces non-binary trees as well)
Separation defined in many ways.
Regression Trees (continuous target): use sum of squared errors.
Classification Trees (categorical target): choice of entropy, Gini measure, twoing splitting rule.

Pristine 31
Some differences between CART and CHAID
Dependent variable for CHAID must be categorical; for CART it can be metric
Different splitting algorithm (e.g., CHAID uses a Chi-squared test using contingency tables)
CHAID splits into multiple groups, CART makes binary splits
Different stopping criteria

Pristine 32
Case- Using CART analysis to identify customer segments who
default on payment due
Romanov, an Analytics consultant works with Credit One Bank. His manager gave him data
having Credit and personal information of a group of customers. Some of the customers had
defaulted in making the payment on balance due. He asked him to identify the classes of
customers having higher default rate than average using a decision tree. Romanov has no
knowledge of running a CART analysis.
Now suppose, he approaches you and request for your help to complete the assignment. Lets
help Romanov in solving the problem.

Pristine 33
Case- CART analysis
In due course of helping Romanov to complete his task, we will walk him through following steps:
Variable identification
Identifying the dependent (response) variable.
Identifying the independent (explanatory) variables.
Variable categorization (e.g. Numeric, Categorical, Discrete, Continuous etc.)
Creation of Data Dictionary
Response variable exploration
Distribution analysis
Outlier treatment
Not required for Binary variables
Running the CHAID analysis using R
Importing data in R
Selecting the variables
Runing the analysis
Interpreting the results

Pristine 34
Understanding the Data
Response variable
Default_on_Payment
Binary in nature
Takes (0,1) as values corresponding to (no default, default)
Independent variables (also called a s attributes in credit industry)
20 attributes
3 numerical and 17 categorical
For details refer to
Data dictionary
Data Dictionary- Analysis_of_Default.docx
Data file
Analysis_of_Default.xlsx

Pristine 35
Response variable analysis
Frequency distribution

Default_On_Payment # Observations
0 3505
1 1495
Total 5000

Event rate (for a Binary response variable)


Proportion of 1 as a total number of observation
Over here Default_On_Payment Rate = Event Rate
Event Rate = 1495/5000 = 29.9%

Pristine 36
Bivariate Analysis (of Independent variables)
1 Status_Checking_Acc # ObservationsDefault_On_PaymentDefault Rate
A11 1370 675 49.27%
A12 1345 520 38.66%
A13 315 70 22.22%
A14 1970 230 11.68%

2 Duration_in_Months # Observations Default_On_Payment Default Rate


12 1795 380 21.2%
These are transformed values. For original
24 2055 610 29.7% values refer to Analysis_of_Default.xlsx
36 715 285 39.9%
48 355 180 50.7%
60 80 40 50.0%
72 5 5 100.0%

Credit_History # Observations Default_On_Payment Default Rate


3 A30 200 125 62.50%
A31 245 140 57.14%
A32 2650 840 31.70%
A33 440 140 31.82%
A34 1465 250 17.06%

Pristine 37
Bivariate Analysis (of Independent variables)
4 Purposre_Credit_Taken # Observations Default_On_Payment Default Rate A40: car (new)
A40 1170 445 38.03% A41: car (used)
A41 515 85 16.50% A42: furniture/equipment
A42 905 290 32.04%
A43: radio/television
A43 1400 305 21.79%
A44 60 20 33.33%
A44: domestic appliances
A45 110 40 36.36% A45: repairs
A46 250 110 44.00% A46: education
A48 45 5 11.11% A47: (vacation - does not exist?)
A49 485 170 35.05% A48: retraining
A410 60 25 41.67% A49: business
A410: others
5 Credit_Amount # Observations Default_On_Payment Default Rate
These are transformed values. For original
4000 3770 975 25.86%
11000 1085 420 38.71% values refer to Analysis_of_Default.xlsx
12000 145 100 68.97%

6 Savings_Acc # Observations Default_On_Payment Default Rate


A61 3015 1080 35.82%
A62 515 170 33.01%
A63 315 55 17.46%
A64 240 30 12.50%
A65 915 160 17.49%

Pristine 38
Bivariate Analysis (of Independent variables)
Years_At_Present_Employment # Observations Default_On_Payment Default Rate
7 A71 310 115 37.10%
A72 860 350 40.70%
A73 1695 515 30.38%
A74 870 195 22.41%
A75 1265 320 25.30%

8 Inst_Rt_Income # Observations Default_On_Payment Default Rate


1 680 170 25.00%
2 1155 305 26.41%
3 785 225 28.66%
4 2380 795 33.40%

9 Marital_Status_Gender # Observations Default_On_Payment Default Rate


A91 250 100 40.00%
A92 1550 540 34.84%
A93 2740 730 26.64%
A94 460 125 27.17%

10 Other_Debtors_Guarantors # Observations Default_On_Payment Default Rate


A101 4535 1355 29.88%
A102 205 90 43.90%
A103 260 50 19.23%

Pristine 39
Bivariate Analysis (of Independent variables)
11 Current_Address_Yrs # Observations Default_On_Payment Default Rate
1 650 180 27.69%
2 1540 480 31.17%
3 745 215 28.86%
4 2065 620 30.02%

12 Property # Observations Default_On_Payment Default Rate


A121 1410 295 20.92%
A122 1160 355 30.60%
A123 1660 510 30.72%
A124 770 335 43.51%

13 Age:
Refer to Analysis_of_Default.xlsx
14 Other_Inst_Plans # Observations Default_On_Payment Default Rate
A141 695 285 41.01%
A142 235 95 40.43%
A143 4070 1115 27.40%

15 Housing # Observations Default_On_Payment Default Rate


A151 895 350 39.11%
A152 3565 925 25.95%
A153 540 220 40.74%

Pristine 40
Bivariate Analysis (of Independent variables)
16
Num_CC # Observations Default_On_Payment Default Rate
1 3165 995 31.44%
2 1665 460 27.63%
3 140 30 21.43%
4 30 10 33.33%

17 Job # Observations Default_On_Payment Default Rate


A171 110 35 31.82%
A172 1000 280 28.00%
A173 3150 925 29.37%
A174 740 255 34.46%

18 Dependents # Observations Default_On_Payment Default Rate


1 4225 1265 29.94%
2 775 230 29.68%

19 Telephone # Observations Default_On_Payment Default Rate


A191 2980 930 31.21%
A192 2020 565 27.97%

20 Foreign_Worker # Observations Default_On_Payment Default Rate


A201 4815 1475 30.63%
A202 185 20 10.81%

Pristine 41
R Code for CART-Regression Tree

Pristine 42
R Outputs for Regression tree
Output

Pristine 43
R Outputs for Regression tree

For better preview see the file saved as pdf through R

Pristine 44
R Code for CART-Classification Tree

Pristine 45
R Outputs for Classification tree

Pristine 46
R Outputs for Regression tree

For better preview see the file saved as pdf through R

Pristine 47
CART Analysis Demonstration in SPSS

Pristine 48
Preparing SPSS to run CART analysis
1 2

Pristine 49
Preparing SPSS to run CART analysis
5 6

Pristine 50
Preparing SPSS to run CART analysis
7 8

Pristine 51
Running a simple CART analysis- Summary

Pristine 52
Running a simple CART analysis Tree (Compressed View)
Use this tree result along with the tabular
result present in the next slide
Terminal Nodes with
default rate less than
over all average of
29.9%
Node 6: 25.8%
Node 12: 20%
Node 13: 26.9%
Node 17:6.5%
Node 18: 12.3%
Node 22: 25.7%

Terminal Nodes with


default rate greater
than over all average
of 29.9%
Node 14: 32.7%
Node 19: 51.2%
Node 20: 75%
Node 21: 38.4%

Pristine 53
Running a simple CART analysis Table
Terminal Nodes with
default rate less than
over all average of
29.9%
Node 6: 25.8%
Node 12: 20%
Node 13: 26.9%
Node 17:6.5%
Node 18: 12.3%
Node 22: 25.7%

Terminal Nodes with


default rate greater
than over all average
of 29.9%
Node 14: 32.7%
Node 19: 51.2%
Node 20: 75%
Node 21: 38.4%

Pristine 54
Running a simple CART analysis Tree (Expanded View -1)

Pristine 55
Running a simple CART analysis Tree (Expanded View- 2)

Pristine 56
Running a simple CART analysis Comparison with CHAID result

CART

CHAID

For a simple analysis using only two variables (Status_Checking_Account and Credit_History) are
producing the results with same level of accuracy

Pristine 57
Running a simple CART analysis Comparison with CHAID result

CART

CHAID

For a simple analysis using only two variables (Status_Checking_Account and Credit_History) are
producing the results with same level of accuracy

Pristine 58
CART vs. CHAID Comparison- Using all the available variables
1 CHAID 2 CART

Pristine 59
CART vs. CHAID Comparison- Using all the available variables

CART

CHAID

CHAID is producing better/more accurate results as compared to CART (79.3% vs 77.4%).

Pristine 60
When to use CART and when to use CHAID
Depends on type of Response variable
CART when numeric or metric or Binary

CHAID when Binary (provided it produces better result than CART)

CHAID should be used when the goal is to describe or understand the relationship between a
response variable and a set of explanatory variables.
CART is better suited for creating a model that has high prediction accuracy of new cases.

Pristine 61

Вам также может понравиться