Вы находитесь на странице: 1из 25

Classification and regression trees

o Basic learning/mining tasks


o Recursive partitioning
o Classification rules from trees
o ID3, C4.5, C5 algorithm
o CHAID algorithm
o CART algorithm
BADY FAT DATASET Example

1
http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_BMI_Regression#References
Classification Example
Regression Example
2

2
Classification tree for county-level outcomes in the 2008 Democratic Party primary (as of April 16), by
Amanada Cox for the New York Times
Decision tree builds classification or regression models in the form of a tree structure. It breaks
down a dataset into smaller and smaller subsets while at the same time an associated decision
tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.
The goal of classification trees is to
predict /suggestion of a decision
and / or
explain responses on a categorical dependent variable / straightforward and intuitive
explanation of how the decision was made
Decision tree algorithms

Hunt 1966 SPLITING


FUNCTION
CART Breiman et al.3 1977/1984 GINI
INDEX
C4.5 Quinlan 1993/1996 ENTROPY
See 5 / C5.0 http://rulequest.com/see5-
win.html
CHAID Kass 1980 2 TEST
Decision Tree Breiman 2001
Forest
Boosting Trees Friedmann 1999
CTREE Hothorn et al. 2006

https://www.stonybrook.edu/commcms/irpe/reports/presentations/DataMiningOverview_Galambo
s_2015_06_04.pptx
An Algorithm for Building Decision Trees
1. Let T be the set of training instances.
2. Choose an attribute that best differentiates the instances contained in T.
3. Create a tree node whose value is the chosen attribute. Create child links from this
node where each link represents a unique value for the chosen attribute. Use the child
link values to further subdivide the instances into subclasses.
4. For each subclass created in step 3:
a. If the instances in the subclass satisfy predefined criteria or if the set of
remaining attribute choices for this path of the tree is null, specify the classification for
new instances following this decision path.
b. If the subclass does not satisfy the predefined criteria and there is at least one
attribute to further subdivide the path of the tree, let T be the current set of subclass
instances and return to step 2.
Partitioning of search space
Univariate partitioning methods, attractive because
only one feature / attribute is analyzed at a time
partition the search space axis-parallel based on only one attribute at a time
the derived decision tree is relatively easy to understand

Oblique partitioning methods


provides a viable alternative to univariate methods.
are formed by combinations of attributes.
are forming oblique partition boundaries based on a combination of attributes
Partitioning of search space
Univariate partitioning methods, attractive because
only one feature / attribute is analyzed at a time
partition the search space axis-parallel based on only one attribute at a time
the derived decision tree is relatively easy to understand
Ex1.
1

0.9

0.8
x < 0.43?

0.7
Yes No
0.6
y

0.5 y < 0.47? y < 0.33?


0.4

0.3
Yes No Yes No

0.2
:4 :0 :0 :4
0.1 :0 :4 :3 :0
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x
Ex. 2

Decision Boundary
Border line between two neighboring regions of different classes is known as decision
boundary
Decision boundary is parallel to axes because test condition involves a single attribute
at-a-time
Partitioning of search space
Univariate partitioning methods, attractive because
only one feature / attribute is analyzed at a time
partition the search space axis-parallel based on only one attribute at a time
the derived decision tree is relatively easy to understand

CART starts out with the best univariate split. It then iteratively searches for perturbations in
attribute values (one attribute at a time) which maximize some goodness metric. At the end of
the procedure, the best oblique and axis-parallel splits found are compared and the better of
these is selected.
CHAID4

CHAID stands for Chi-squared Automatic Interaction Detector


This name derives from
the basic algorithm
that is used to construct (non-binary) trees: Chi-square test

CHAID is a recursive partitioning method

It is one of the oldest tree classification methods


originally proposed by Kass (1980).
According to Ripley, 1996, the CHAID algorithm is a descendent of THAID
developed by Morgan and Messenger (1973)

CHAID will "build" non-binary trees


trees where more than two branches can attach to a single root or node
CHAID is a relatively simple algorithm that is
particularly well suited for the analysis of larger datasets.
CHAID algorithm will often effectively yield many multi-way frequency tables
when classifying a categorical response variable with many categories,
based on categorical predictors with many classes
for this reason it has been particularly popular in
marketing research, in the context of market segmentation studies.
Both CHAID and C&RT techniques will construct trees,

where each (non-terminal) node identifies a split condition, to yield

optimum prediction (of continuous dependent or response variables) or

classification (for categorical dependent or response variables)

Both types of algorithms can be applied to analyze:

regression-type problems or

classification-type.

4
http://www.statsoft.com/Textbook/CHAID-Analysis#index
Basic Tree-Building Algorithm: CHAID and Exhaustive CHAID

Basic algorithm that is used to construct (non-binary) trees:


for classification problems (when the dependent variable is categorical in nature)
the program will compute Chi-square test
to determine the best next split at each step;
for regression type problems (continuous dependent variable)
the program will compute F-tests

The algorithm proceeds as follows:


Preparing predictors.
The first step is
in case of continuous predictors:
to create categorical predictors
by
dividing the respective continuous distributions
into a number of categories
with an approximately equal number of observations
in case of categorical predictors
the categories (classes) are "naturally" defined

Merging categories.
The next step is
to cycle through the predictors
to determine
for each predictor
the pair of (predictor) categories
that is least significantly different
with respect to the dependent variable;
for classification problems (where the dependent variable is categorical as well),
it will compute a Chi-square test (Pearson Chi-square);
for regression problems (where the dependent variable is continuous),
it will compute the F tests

If the respective test for a given pair of predictor categories is not statistically significant
as defined by an alpha-to-merge value,
then it will
merge the respective predictor categories and
repeat this step
(i.e., find the next pair of categories,
which now may include previously merged categories)

If the statistical significance for the respective pair of predictor categories is significant
(less than the respective alpha-to-merge value),
then (optionally) it will
compute a Bonferroni adjusted p-value
for the set of categories for the respective predictor

Selecting the split variable


The next step is
to choose the split the predictor variable with the smallest adjusted p-value,
i.e., the predictor variable that will yield the most significant split;
If
the smallest (Bonferroni) adjusted p-value for any predictor
is greater than some alpha-to-split value,
then
no further splits will be performed,
and the respective node is a terminal node.
Continue this process until no further splits can be performed,

given the alpha-to-merge and alpha-to-split values

CHAID and Exhaustive CHAID Algorithms


Exhaustive CHAID is a modification to the basic CHAID algorithm
Exhaustive CHAID performs a more thorough merging and testing of predictor variables,
and hence requires more computing time.
The merging of categories
continues
without reference to any alpha-to-merge value
until
only two categories remain for each predictor
The algorithm
then proceeds as described above in the Selecting the split variable step,
and selects among the predictors the one that yields the most significant split
For large datasets, and with many continuous predictor variables,
this modification of the simpler CHAID algorithm
may require significant computing time.

General Computation Issues of CHAID

Reviewing large trees: Unique analysis management tools.


A general issue that arises when applying
tree classification or regression methods
is that
the final trees can become very large
In practice,
when
the input data are complex
and, for example, contain
many different categories for classification problems, and
many possible predictors for performing the classification,
then
the resulting trees can become very large

This is not so much a computational problem


as it is a problem of presenting the trees
in a manner that is easily accessible to the data analyst, or
for presentation to the "consumers" of the research
CRT5 (CART, C&RT)

Overview - Basic Ideas

CRT stands for Classification and Regression Tree

CRT builds classification and regression trees for


predicting continuous dependent variables (regression) and
categorical predictor variables (classification)
CRT algorithm was popularized by
Breiman et al. Breiman, Friedman, Olshen, & Stone, 1984;
see also Ripley, 1996
Hunt algorithm is the basis of many decision tree induction algorithms, includind CART (and
ID3, C4.5)

5
http://www.statsoft.com/Textbook/Classification-and-Regression-Trees
COMPARATIONS CHAID, EXCHAUSTIVE CHAID, CRT, QUEST
SIMILARITY

Both CHAID and C&RT techniques will construct trees,

where each (non-terminal) node identifies a split condition, to yield

optimum prediction (of continuous dependent or response variables) or

classification (for categorical dependent or response variables)

Both types of algorithms can be applied to analyze:

regression-type problems or

classification-type.

DIFFERENCES

For classification - type problems (categorical dependent variable),


- all three algorithms can be used to build a tree for prediction.
- QUEST is generally faster than CHAID, CRT, algorithms,
however, for very large datasets,
the memory requirements are usually larger,
so using the QUEST algorithms for classification with very large input data
sets
may be impractical.
For regression-type problems (continuous dependent variable),
- the QUEST algorithm is not applicable,
- only CHAID and C&RT can be used
- CHAID will build non-binary trees that tend to be "wider".
This has made the CHAID method particularly popular in market research
applications: CHAID often yields many terminal nodes connected to a single
branch, which can be conveniently summarized in a simple two-way table with
multiple categories for each variable or dimension of the table.
This type of display matches well the requirements for research on market
segmentation, for example, it may yield a split on a variable Income, dividing
that variable into 4 categories and groups of individuals belonging to those
categories that are different with respect to some important consumer-behavior
related variable (e.g., types of cars most likely to be purchased).
- C&RT will always yield binary trees, which can sometimes not be summarized as
efficiently for interpretation and/or presentation.
For predictive accuracy problems

- it is difficult to derive general recommendations,

- this issue is still the subject of active research

- as a practical matter,

it is best to apply different algorithms,

perhaps compare them with user-defined interactively derived trees,

and decide on the most reasonably and best performing model

based on the prediction errors

- for a discussion of various schemes for combining predictions from different models,
see, for example, Witten and Frank, 2000.

For speed and bias problems


- QUEST is fast and unbiased.
- The speed advantage of QUEST over C&RT is particularly dramatic when the
predictor variables have dozens of levels
(Loh & Shih, 1997, report an analysis completed by QUEST in 1 CPU second that
took C&RT 30.5 CPU hours to complete).
- QUEST's lack of bias in variable selection for splits is also a distinct advantage when
some predictor variable have few levels and other predictor variables have many
levels (predictors with many levels are more likely to produce "fluke theories," which
fit the data well but have low predictive accuracy, see Doyle, 1973, and Quinlan &
Cameron-Jones, 1995).
- QUEST does not sacrifice predictive accuracy for speed (Lim, Loh, & Shih, 1997).
Bibliography
https://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap4_basic_classification.ppt
http://www.saedsayad.com/decision_tree.htm
https://en.wikipedia.org/wiki/Decision_tree_learning
https://blog.bigml.com/2012/01/23/beautiful-decisions-inside-bigmls-decision-trees/
https://www.researchgate.net/publication/11205595_Decision_Trees_An_Overview_and_The
ir_Use_in_Medicine
http://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf
https://www.stat.cmu.edu/~cshalizi/350/lectures/22/lecture-22.pdf

Tutorial SPSS MODELER


https://www.youtube.com/watch?v=HYV2aPHhmVg

Вам также может понравиться