Академический Документы
Профессиональный Документы
Культура Документы
Pristine www.edupristine.com
Pristine
Decision Trees Analyses
I. Data Mining and Decision Trees
IV. CART
Pristine 1
CHAID Analyses
I. Data Mining and Decision Trees
V. Case
Pristine 2
Data Mining
Data mining is the nontrivial extraction of implicit, previously unknown, and potentially useful
information from data.
As goals could be differently. Accordingly the data mining techniques will vary. At a high level, data
mining techniques can be classified into:
Directed or
Undirected
Directed Undirected
Pristine 3
Classification & Decision Tree Analysis
Classification is used to develop a model that maps a data item into one of several predefined
classes.
Decision Tree Analysis
Builds classification and regression trees
Starts with pre-identified target variable in other words dependent variable.
This is the initial node Initial node is split into two or more child nodes
Splitting is based on statistical analysis used by decision tree algorithms (CHAID is one such algorithm)
Pristine 4
Classification & Decision Tree Analysis- Example
Case: A pizza outlet carried out a promotional campaign and reached out to 1000 individuals in a
locality. Out of the 1000 individuals approached, 200 responded by visiting the outlet. The outlet also
collected the personal details of customer. After the campaign period was over, the outlet carried out
an analysis in order to classify the customers into various classes:
Segments in order of responsiveness
1. Married Male
Response
Response rate 40%
20%
Sample size 300
1000
2. Divorced with no pets
Response rate 40%
Sample size 50
3. Married Female
Married Single Divorced Response rate 15%
25% 15% 15% Sample size 300
500 300 200 4. Singles
Response rate 15%
Sample size 300
5. Divorced with pets
Male Female Pet=no Pet=yes Response rate 6.67%
40% 15% 40% 6.67% Sample size 150
300 300 50 150
Pristine 5
Decision Tree Algorithms
CHAID (Chi square Automatic Interaction Detector)
C&RT (Classification and Regression Tree)
Pristine 6
CHAID Analyses
I. CHAID analyses- Method and Algorithms
III. Case
Pristine 7
CHAID Method and Components
CHAID was originally designed to handle categorical variables only.
Data mining tools like SAS E-Miner; R and SPSS has extended algorithm to handle nominal, ordinal
and continuous dependent variables.
One or more predictor variables. Predictor variables can be continuous, ordinal, or nominal.
Pristine 8
CHAID Algorithms
A CHAID tree is a decision tree that is constructed by splitting subsets of the space into two or more
child (nodes) repeatedly, beginning with the entire data set.
To determine the best split at any node, CHAID merges any allowable pair of categories of the
predictor variable (the set of allowable pairs is determined by the type of predictor variable being
studied) if there is no statistically significant difference within the pair with respect to the target
variable.
The process is repeated until no non-significant pair is found.
The resulting set of categories of the predictor variable is the best split with respect to that predictor
variable.
This process is followed for all predictor variables.
The split that is the best prediction is selected, and the node is split.
The process repeats recursively until one of the stopping rules is triggered.
The significance of splitting is tested by means of chi-square test based on contingency table
approach.
Pristine 9
Case- Using CHAID analysis to identify customer segments who
default on payment due
Romanov, an Analytics consultant works with Credit One Bank. His manager gave him data
having Credit and personal information of a group of customers. Some of the customers had
defaulted in making the payment on balance due. He asked him to identify the classes of
customers having higher default rate than average using a decision tree. Romanov has no
knowledge of running a CHAID analysis.
Now suppose, he approaches you and request for your help to complete the assignment. Lets
help Romanov in solving the problem.
Pristine 10
Case- CHAID analysis
In due course of helping Romanov to complete his task, we will walk him through following steps:
Variable identification
Identifying the dependent (response) variable.
Identifying the independent (explanatory) variables.
Variable categorization (e.g. Numeric, Categorical, Discrete, Continuous etc.)
Creation of Data Dictionary
Response variable exploration
Distribution analysis
Outlier treatment
Not required for Binary variables
Running the CHAID analysis using R
Importing data in R
Selecting the variables
Runing the analysis
Interpreting the results
Pristine 11
Understanding the Data
Response variable
Default_on_Payment
Binary in nature
Takes (0,1) as values corresponding to (no default, default)
Independent variables (also called a s attributes in credit industry)
20 attributes
3 numerical and 17 categorical
For details refer to
Data dictionary
Data Dictionary- Analysis_of_Default.docx
Data file
Analysis_of_Default.xlsx
Pristine 12
Response variable analysis
Frequency distribution
Default_On_Payment # Observations
0 3505
1 1495
Total 5000
Pristine 13
Bivariate Analysis (of Independent variables)
1 Status_Checking_Acc # ObservationsDefault_On_PaymentDefault Rate
A11 1370 675 49.27%
A12 1345 520 38.66%
A13 315 70 22.22%
A14 1970 230 11.68%
Pristine 14
Bivariate Analysis (of Independent variables)
4 Purposre_Credit_Taken # Observations Default_On_Payment Default Rate A40: car (new)
A40 1170 445 38.03% A41: car (used)
A41 515 85 16.50% A42: furniture/equipment
A42 905 290 32.04%
A43: radio/television
A43 1400 305 21.79%
A44 60 20 33.33%
A44: domestic appliances
A45 110 40 36.36% A45: repairs
A46 250 110 44.00% A46: education
A48 45 5 11.11% A47: (vacation - does not exist?)
A49 485 170 35.05% A48: retraining
A410 60 25 41.67% A49: business
A410: others
5 Credit_Amount # Observations Default_On_Payment Default Rate
These are transformed values. For original
4000 3770 975 25.86%
11000 1085 420 38.71% values refer to Analysis_of_Default.xlsx
12000 145 100 68.97%
Pristine 15
Bivariate Analysis (of Independent variables)
Years_At_Present_Employment # Observations Default_On_Payment Default Rate
7 A71 310 115 37.10%
A72 860 350 40.70%
A73 1695 515 30.38%
A74 870 195 22.41%
A75 1265 320 25.30%
Pristine 16
Bivariate Analysis (of Independent variables)
11 Current_Address_Yrs # Observations Default_On_Payment Default Rate
1 650 180 27.69%
2 1540 480 31.17%
3 745 215 28.86%
4 2065 620 30.02%
13 Age:
Refer to Analysis_of_Default.xlsx
14 Other_Inst_Plans # Observations Default_On_Payment Default Rate
A141 695 285 41.01%
A142 235 95 40.43%
A143 4070 1115 27.40%
Pristine 17
Bivariate Analysis (of Independent variables)
16
Num_CC # Observations Default_On_Payment Default Rate
1 3165 995 31.44%
2 1665 460 27.63%
3 140 30 21.43%
4 30 10 33.33%
Pristine 18
Code for CHAID
Pristine 19
R outputs for CHAID
Pristine 20
CHAID Analysis Demonstration in SPSS
Pristine 21
Preparing SPSS to run CHAID analysis
1 2
Pristine 22
Preparing SPSS to run CHAID analysis
5 6
Pristine 23
Preparing SPSS to run CHAID analysis
7 8
Pristine 24
Running a simple CHAID analysis- Summary
Pristine 25
Running a simple CHAID analysis - Tree
Branches in decending order of default rate:
1. A11- (A33; A30; A31) {rate = 74.5%, sample = 235}
2. A12- (A30; A31) {rate = 63.6%, sample = 165}
3. A11- A32 {rate = 51,2%, sample = 800}
4. A12- A32 {rate = 38.4%, sample = 730}
5. A12- (A34; A33) {rate = 30%, sample = 450}
6. A11- A34 {rate = 26.9%, sample = 335}
7. A14- (A33; A30; A31) {rate = 24.1%, sample = 270}
8. A13 {rate = 22.2, sample = 315}
9. A14- A32 {rate = 12.3%, sample = 935}
10. A14- A34 {rate = 6.5%, sample = 765}
Pristine 26
Running a simple CHAID analysis Tree Table
Pristine 27
CHAID Analysis Result in Decision making
Group1 (Customers having Default Rate Group2 (Customers having Default Rate
Higher than Overall Average) Lowe than Overall Average)
A11- (A33; A30; A31) {rate = 74.5%, sample = A11- A34 {rate = 26.9%, sample = 335}
235} A14- (A33; A30; A31) {rate = 24.1%, sample =
A12- (A30; A31) {rate = 63.6%, sample = 165} 270}
A11- A32 {rate = 51,2%, sample = 800} A13 {rate = 22.2, sample = 315}
A12- A32 {rate = 38.4%, sample = 730} A14- A32 {rate = 12.3%, sample = 935}
A12- (A34; A33) {rate = 30%, sample = 450} A14- A34 {rate = 6.5%, sample = 765}
Pristine 28
CART Analyses
I. CART analyses- Method and Algorithms
III. Case
Pristine 29
CART Method and Components
A non-parametric technique,using the methodology of tree building.
Data mining tools like SAS E-Miner; R and SPSS has extended algorithm to handle nominal, ordinal
and continuous dependent variables.
One or more predictor variables. Predictor variables can be continuous, ordinal, or nominal.
Pristine 30
CART Algorithms
It makes use of Recursive Partitioning
Take all of your data.
Consider all possible values of all variables.
Select the variable/value (X=t1) that produces the greatest separation in the target.
(X=t1) is called a split.
If X< t1 then send the data to the left; otherwise, send data point to the right.
Now repeat same process on these two nodes
You get a tree
Note: CART only uses binary splits (Unlike CHAID which produces non-binary trees as well)
Separation defined in many ways.
Regression Trees (continuous target): use sum of squared errors.
Classification Trees (categorical target): choice of entropy, Gini measure, twoing splitting rule.
Pristine 31
Some differences between CART and CHAID
Dependent variable for CHAID must be categorical; for CART it can be metric
Different splitting algorithm (e.g., CHAID uses a Chi-squared test using contingency tables)
CHAID splits into multiple groups, CART makes binary splits
Different stopping criteria
Pristine 32
Case- Using CART analysis to identify customer segments who
default on payment due
Romanov, an Analytics consultant works with Credit One Bank. His manager gave him data
having Credit and personal information of a group of customers. Some of the customers had
defaulted in making the payment on balance due. He asked him to identify the classes of
customers having higher default rate than average using a decision tree. Romanov has no
knowledge of running a CART analysis.
Now suppose, he approaches you and request for your help to complete the assignment. Lets
help Romanov in solving the problem.
Pristine 33
Case- CART analysis
In due course of helping Romanov to complete his task, we will walk him through following steps:
Variable identification
Identifying the dependent (response) variable.
Identifying the independent (explanatory) variables.
Variable categorization (e.g. Numeric, Categorical, Discrete, Continuous etc.)
Creation of Data Dictionary
Response variable exploration
Distribution analysis
Outlier treatment
Not required for Binary variables
Running the CHAID analysis using R
Importing data in R
Selecting the variables
Runing the analysis
Interpreting the results
Pristine 34
Understanding the Data
Response variable
Default_on_Payment
Binary in nature
Takes (0,1) as values corresponding to (no default, default)
Independent variables (also called a s attributes in credit industry)
20 attributes
3 numerical and 17 categorical
For details refer to
Data dictionary
Data Dictionary- Analysis_of_Default.docx
Data file
Analysis_of_Default.xlsx
Pristine 35
Response variable analysis
Frequency distribution
Default_On_Payment # Observations
0 3505
1 1495
Total 5000
Pristine 36
Bivariate Analysis (of Independent variables)
1 Status_Checking_Acc # ObservationsDefault_On_PaymentDefault Rate
A11 1370 675 49.27%
A12 1345 520 38.66%
A13 315 70 22.22%
A14 1970 230 11.68%
Pristine 37
Bivariate Analysis (of Independent variables)
4 Purposre_Credit_Taken # Observations Default_On_Payment Default Rate A40: car (new)
A40 1170 445 38.03% A41: car (used)
A41 515 85 16.50% A42: furniture/equipment
A42 905 290 32.04%
A43: radio/television
A43 1400 305 21.79%
A44 60 20 33.33%
A44: domestic appliances
A45 110 40 36.36% A45: repairs
A46 250 110 44.00% A46: education
A48 45 5 11.11% A47: (vacation - does not exist?)
A49 485 170 35.05% A48: retraining
A410 60 25 41.67% A49: business
A410: others
5 Credit_Amount # Observations Default_On_Payment Default Rate
These are transformed values. For original
4000 3770 975 25.86%
11000 1085 420 38.71% values refer to Analysis_of_Default.xlsx
12000 145 100 68.97%
Pristine 38
Bivariate Analysis (of Independent variables)
Years_At_Present_Employment # Observations Default_On_Payment Default Rate
7 A71 310 115 37.10%
A72 860 350 40.70%
A73 1695 515 30.38%
A74 870 195 22.41%
A75 1265 320 25.30%
Pristine 39
Bivariate Analysis (of Independent variables)
11 Current_Address_Yrs # Observations Default_On_Payment Default Rate
1 650 180 27.69%
2 1540 480 31.17%
3 745 215 28.86%
4 2065 620 30.02%
13 Age:
Refer to Analysis_of_Default.xlsx
14 Other_Inst_Plans # Observations Default_On_Payment Default Rate
A141 695 285 41.01%
A142 235 95 40.43%
A143 4070 1115 27.40%
Pristine 40
Bivariate Analysis (of Independent variables)
16
Num_CC # Observations Default_On_Payment Default Rate
1 3165 995 31.44%
2 1665 460 27.63%
3 140 30 21.43%
4 30 10 33.33%
Pristine 41
R Code for CART-Regression Tree
Pristine 42
R Outputs for Regression tree
Output
Pristine 43
R Outputs for Regression tree
Pristine 44
R Code for CART-Classification Tree
Pristine 45
R Outputs for Classification tree
Pristine 46
R Outputs for Regression tree
Pristine 47
CART Analysis Demonstration in SPSS
Pristine 48
Preparing SPSS to run CART analysis
1 2
Pristine 49
Preparing SPSS to run CART analysis
5 6
Pristine 50
Preparing SPSS to run CART analysis
7 8
Pristine 51
Running a simple CART analysis- Summary
Pristine 52
Running a simple CART analysis Tree (Compressed View)
Use this tree result along with the tabular
result present in the next slide
Terminal Nodes with
default rate less than
over all average of
29.9%
Node 6: 25.8%
Node 12: 20%
Node 13: 26.9%
Node 17:6.5%
Node 18: 12.3%
Node 22: 25.7%
Pristine 53
Running a simple CART analysis Table
Terminal Nodes with
default rate less than
over all average of
29.9%
Node 6: 25.8%
Node 12: 20%
Node 13: 26.9%
Node 17:6.5%
Node 18: 12.3%
Node 22: 25.7%
Pristine 54
Running a simple CART analysis Tree (Expanded View -1)
Pristine 55
Running a simple CART analysis Tree (Expanded View- 2)
Pristine 56
Running a simple CART analysis Comparison with CHAID result
CART
CHAID
For a simple analysis using only two variables (Status_Checking_Account and Credit_History) are
producing the results with same level of accuracy
Pristine 57
Running a simple CART analysis Comparison with CHAID result
CART
CHAID
For a simple analysis using only two variables (Status_Checking_Account and Credit_History) are
producing the results with same level of accuracy
Pristine 58
CART vs. CHAID Comparison- Using all the available variables
1 CHAID 2 CART
Pristine 59
CART vs. CHAID Comparison- Using all the available variables
CART
CHAID
Pristine 60
When to use CART and when to use CHAID
Depends on type of Response variable
CART when numeric or metric or Binary
CHAID should be used when the goal is to describe or understand the relationship between a
response variable and a set of explanatory variables.
CART is better suited for creating a model that has high prediction accuracy of new cases.
Pristine 61