Вы находитесь на странице: 1из 49

Overview of TreeNet Technology

Stochastic Gradient Boosting

Dan Steinberg
2012
http://www.salford-systems.com

Introduction to TreeNet: Stochastic Gradient Boosting

Powerful approach to machine learning and function approximation


developed by Jerome H. Friedman at Stanford University

Friedman is a coauthor of CART with Breiman, Olshen and Stone


Author of MARS, PRIM, Projection Pursuit, COSA, RuleFit and more

TreeNet is unusual among learning machines in being very strong for both
classification and regression problems

Stochastic gradient boosting is now widely regarded as the most powerful


methodology available for most predictive modeling contexts

Builds on the notions of committees of experts and boosting but is


substantially different in key implementation details

Salford Systems Copyright 2005-2012

Aspects of TreeNet

Built on CART trees and thus

Highly discriminatory variable selector

Effective with thousands of predictors

Detects and identifies important interactions

immune to outliers
handles missing values automatically
results invariant under order preserving transformations of variables

No need to ever consider functional form revision (log, sqrt, power)

Can be used to easily test for presence of interactions and their degree

Resistant to overtraining generalizes well


Can be remarkably accurate with little effort

Should easily outperform conventional models

Salford Systems Copyright 2005-2012

Adapting to Major Errors in Data

TreeNet is a machine learning technology designed to recognize patterns in


historical data
Ideally the data TreeNet will use to learn from will be accurate
In some circumstances there is a risk that the most important variable,
namely the dependent variable is subject to error. Known as mislabelled
data
Good examples of mislabeled data can be found in

Medical diagnoses
Insurance claim fraud
Thus historical data for not frauds actually includes undetected fraud
Some of the 0s are actually 1s which complicates learning (possibly fatal)
TreeNet manages such data successfully

Salford Systems Copyright 2005-2012

Some TreeNet Successes

2010 DMA Direct Marketing Association.


1st place*
2009 KDDCup IDAnalytics*, and FEG Japan*
1st Runner Up
2008 DMA Direct Marketing Association
1st Runner Up
2007 Pacific Asia PAKDD: Credit Card Cross Sell. 1st place
2007 DMA Direct Marketing Association
1st place
2006 PAKDD Pacific Asia KDD: Telco Customer Model 2nd Runner Up
2005 BI-Cup Latin America: Predictive E-commerce*
1st place
2004 KDDCup: Predictive Modeling
Most Accurate*
2002 NCR/Teradata Duke University: Predictive Modeling-Churn
Four separate predictive modeling challenges
1st place

* Won by Salford client or partner using TreeNet


Salford Systems Copyright 2005-2012

Multi-tree methods and their single tree ancestors

Multi-tree methods have been under development since the early 1990s.
Most important variants (and dates of published articles) are:

Bagger (Breiman, 1996, Bootstrap Aggregation)

Boosting (Freund and Schapire, 1995)

Multiple Additive Regression Trees (Friedman, 1999, aka MART or


TreeNet)

RandomForests (Breiman, 2001)

TreeNet version 1.0 was released in 2002 but its power took some
time to be recognized
Work on TreeNet continues with major refinements underway
(Friedman in collaboration with Salford Systems)

Salford Systems Copyright 2005-2012

Multi-tree Methods: Simplest Case

Simplest example:

Grow a tree on training data


Find a way to grow another different tree (change
something in set up)
Repeat many times, eg 500 replications
Average results or create voting scheme.
Eg. relate Probability of Event to fraction of trees
predicting event for a given record

Beauty of the method is that every new tree starts with a


complete set of data.
Any one tree can run out of data, but when that happens
Prediction
we just start again with a new tree and all the data
Via Voting
(before sampling)

Salford Systems Copyright 2005-2012

Automated Multiple Tree Generation

Earliest multi-model methods recommended taking several good candidates


and averaging them. Examples considered as few as three trees.
Too difficult to generate multiple models manually. Hard enough to get one
good model.
How do we generate different trees?

Bagger: random reweighting of the data via bootstrap resampling


Bootstrap resampling allows for zero and positive integer weights
RandomForests: Random splits applied to randomly reweighted data
Implies that tree itself is grown at least partly at random
Boosting: Reweighting data based on prior success in correctly classifying a
case. High weights on difficult to classifiy cases. (Systematic reweighting)
TreeNet: Boosting with major refinements. Each tree attempts to correct errors
made by predecessors
Each tree is linked to predecessors. Like a series expansion where the
addition of terms progressively improves the predictions

Salford Systems Copyright 2005-2012

TreeNet

We focus on TreeNet because

It is the method used in many successful real world studies


Stochastic gradient boosting arguably the most popular methodology among
experience data miners
We have found it to be consistently more accurate than the other methods
Major credit card fraud detection engine uses TreeNet
Widespread use in on-line e-commerce
Dramatic capabilities include:
Display impact of any predictor (dependency plots)
Test for existence of interactions
Identify and rank interactions
Constrain model: allow some interactions and disallow others
Tools to recast TreeNet model as a logistic regression or GLM

Salford Systems Copyright 2005-2012

TreeNet Process

Begin with one very small tree as initial model

Could be as small as ONE split generating 2 terminal nodes


Default model will have 4-6 terminal nodes
Final output is a predicted target (regression) or predicted probability
First tree is an intentionally weak model

Compute residuals for this simple model (prediction error) for every record
in data (even for classification model)

Grow second small tree to predict the residuals from first tree
New model is now:

Tree1 + Tree2

Compute residuals from this new 2-tree model and grow 3rd tree to predict
revised residuals

Salford Systems Copyright 2005-2012

10

TreeNet: Trees incrementally revise predicted scores

Tree 1

Tree 3

Tree 2

First tree
grown on
original target.
Intentionally
weak model

2nd tree grown on


3rd tree grown on
residuals from
residuals from
first. Predictions
model consisting
made to improve
of first two trees
first tree
Every tree produces at least one positive and at least one negative node. Red
reflects a relatively large positive and deep blue reflects a relatively negative
node. Total score for a given record is obtained by finding relevant
terminal node in every tree in model and summing across all trees

Salford Systems Copyright 2005-2012

11

TreeNet: Example Individual Trees


Intercept

0.172

YES

- 0.228
COST2INC < 65.4
EQ2CUST_STF < 5.04

NO

- 0.051

Tree 1

+ 0.068

Tree 1

+ 0.065
TOTAL_DEPS < 65.5M

- 0.213

Tree 2

EQ2TOT_AST < 2.8


Tree 2

- 0.010
- 0.088

EQ2CUST_STF < 5.04


Tree 3

+ 0.140

Tree 3

EQ2TOT_AST < 2.8


- 0.007
Other Trees

Salford Systems Copyright 2005-2012

Predicted
Response
12

TreeNet Methdology: Key points

Trees are kept small (each tree learns only a little)


Each tree is weighted by a small weight prior to being used to compute the
current prediction or score produced by the model.
Like a partial adjustment model. Update factors can be as small as .01, .
001, .0001. This means that the model prediction changes by very small
amounts after a new tree is added to the model (training cycle)
Use random subsets of the training data for each new tree. Never train on all
the training data in any one learning cycle
Highly problematic cases are optionally IGNORED. If model prediction starts
to diverge substantially from observed data for some records, those records
may not be used in further model updates
Model can be tuned to optimize any of several criteria:

Least Squares, Least Absolute Deviation, Huber-M Hybrid LS/LAD


Area under the ROC curve
Logistic likelihood (deviance)
Classification Accuracy

Salford Systems Copyright 2005-2012

13

Why does TreeNet work?

Slow learning: the method peels the onion extracting very small amounts of
information in any one learning cycle
TreeNet can leverage hints available in the data across a number of
predictors

TreeNet can successfully include more variables than traditional models

Can capture substantial nonlinearity and complex interactions of high degree


TreeNet self-protects against errors in the dependent variable

If a record is actually a 1 but is misrecorded in the data as a 0 and TreeNet


recognizes it as a 1 TreeNet may decide to ignore this record
Important for medical studies in which the diagnosis can be incorrect
Conversion in online behavior can be misrecorded because it is not recognized
(e.g. conversion occurs via another channel or is delayed)

Salford Systems Copyright 2005-2012

14

Multiple Additive Regression Trees

Friedman originally named his methodology MART as the method generates small
trees which are summed to obtain an overall score
The model can be thought of as a series expansion approximating the true functional
relationship
We can think of each small tree as a mini-scorecard, making use of possibly different
combinations of variables. Each mini-scorecard is designed to offer a slight
improvement by correcting and refining its predecessors
Because each tree starts at the root node and can use all of the available data a
TreeNet model can never run out of data no matter how many trees are built.
The number of trees required to obtain best possible performance can vary considerably
from data set to data set
Some models converge after just a handful of trees and other may require many
thousands

Salford Systems Copyright 2005-2012

15

Selecting Optimal Model

TreeNet first grows a large number of trees

We evaluate performance of the ensemble as each tree is added:

Start with 1 tree. Then go on to 2 trees (1st + 2nd ).

Then 3 trees. (1st + 2nd + 3rd ) Etc.

Criteria used right now are

Least squares, Least Absolute Deviation, or Huber-M for regression


Classification Accuracy
Log-Likelihood
ROC (area under curve)
Lift in top P percentile (often top decile)

Salford Systems Copyright 2005-2012

16

TreeNet Summary Screen


Displaying ROC (Y-axis) on train and test samples at all ensemble sizes

Each criterion is optimized using a different number of trees (size of ensemble)

Criterion
CXE Class Error
ROC Area
Lift
---------------------------------------------------------------------------Optimal Number of Trees:
739
1069
643
163
Optimal Criterion
0.4224394
0.1803279
0.8862029
1.9626555

Salford Systems Copyright 2005-2012

17

How the different criteria select different response profiles


Artificial data: Red curve is truth
Logistic
CXE

Classification Accuracy

Models fit to
artificial data
(RED)
Orange is logistic
regression
Green is TreeNet
Optimized for
Classification
Accuracy
Blue is TreeNet
optimized for
max log
likelihood (CXE)

TreeNet (CXE) tracks data far more faithfully than LOGIT


Salford Systems Copyright 2005-2012

18

Interpreting TN Models

As TN models consist of hundreds or even thousands of trees there is no way


to represent the model via a display of one or two trees
However, the model can be summarized in a variety of ways

Partial dependency plots: These exhibit the relationship between the target and
any predictor as captured by the model.
Variable Importance Rankings: These stable rankings give an excellent
assessment of the relative importance of predictors
For regression models SPM 7.0 displays box plots for residuals and residual outlier
diagnostics
For binary classification TN also displays:
ROC curves: TN models produce scores that are typically unique for each scored
record allowing records to be ranked from best to worst. The ROC curve and area
under the curve reveal how successful the ranking is.
Confusion Matrix: Using an adjustable score threshold this matrix displays the
model false positive and false negative rates.

Salford Systems Copyright 2005-2012

19

TN Summary: Variable Importance Ranking

Based on actual use of variables in the trees on training data


TreeNet creates a sum of split improvement scores for every variable in every tree
Salford Systems Copyright 2005-2012

20

TN Classification Accuracy: Test Data

Threshold can be adjusted to reflect unbalanced classes, rare events


Salford Systems Copyright 2005-2012

21

TreeNet: Partial Dependency Plot

Y-axis: One-half-log-odds
Salford Systems Copyright 2005-2012

22

Partial Dependency Plot Explained

Partial dependency plots are extracted from the TreeNet model via
simulation

To trace out the impact of the predictor X on the target Y we start with
a single record from the training data and generate a set of
predictions for the target Y for each of a large number of different
values of X

We repeat this process for every record in the training data and then
average the results to produce the displayed partial dependency plot

In this way the plot reflects the effects of all the other relevant
predictors

Important to remember that the plot is based on predictions


generated by the model

Smooths of plots might be used to create transforms of certain


variables for use in GLMs or logistic regressions

Salford Systems Copyright 2005-2012

23

Place Knot Locations on Graph for Smooth

Smooth can be created on screen within TreeNet


Salford Systems Copyright 2005-2012

24

Smooth Depicted in Green

Salford Systems Copyright 2005-2012

25

Generate Programming Code for Smooth

Can obtain
SAS code
Java
C
PMML
Other languages coming:
SQL
Visual Basic

Salford Systems Copyright 2005-2012

26

Dealing with Monotonicity

Undesired response profile (here we prefer a monotone increasing response)

Salford Systems Copyright 2005-2012

27

Impose Constraint

Constrained solution generated to yield a monotone increasing smooth

In this case constraint is mild in that the differences on Y-axis are quite small
Salford Systems Copyright 2005-2012

28

Interaction Detection

TreeNet models based on 2-node trees automatically EXCLUDE interactions

Build TreeNet on larger tree (default is 6 nodes)

Model may be highly nonlinear but is by definition strictly additive


Every term in the model is based on a single variable (single split)
Use this as a baseline model for best possible additive model (automated GAM)
Permits up to 5-way interaction but in practice is more like 3-way interaction

Can conduct informal likelihood ratio test TN(2-node) vs TN(6-node)

Large differences signal important interactions


In TreeNet 2.0 can locate interactions via 3-D 2-variable dependency plots
In SPM 6.6 and later variables participating in interactions are ranked using new
methodology developed by Friedman and Salford Systems

Salford Systems Copyright 2005-2012

29

Interactions Ranking Report

Interaction strength measure is first


column; 2nd column shows
normalized scores
Variables are listed in descending
order of overall interaction strength

Computation of interaction strength is based on two ways of creating a 3D plot of


target vs Xi and Xj
One method ignores possible interactions and the other explicitly accounts for
interaction . If the two are very different we have evidence from the model for the
existence of an interaction
Salford Systems Copyright 2005-2012

30

What interacts with what?


Interaction strength scores for 2-way interactions for most important variables

NMOS

PAY_FREQUENCY_2$

Here we see that NMOS primarily interacts with PAY_FREQUENCY_2$


All other interactions with NMOS have much smaller impacts
Salford Systems Copyright 2005-2012

31

Example of an important interaction

Slope reverses due to interaction


In this real world example the downward sloping portion of relationship was
well known. But segment displaying upward sloping relationship was not
known

Note that the dominant pattern is downward sloping


But a key segment defined by the 3rd variable is upward sloping
Salford Systems Copyright 2005-2012

32

Real World Examples

Corporate default (Probability of Default or PD)


Analysis of bank ratings

Salford Systems Copyright 2005-2012

33

Corporate Default Scorecard

Mid-sized bank

Around 200 defaults


Around 200 indeterminate/slow

Classic small sample problem


Standard financial statements available, variety of ratios created
All key variables had some fraction of missing data, from 2% to 6%
in a set of 10 core predictors, and up to 35% missing in a set of 20
core predictors
Single CART tree involves just 6 predictors

yields cross-validated ROC= 0.6523

Salford Systems Copyright 2005-2012

34

TreeNet Analysis of Corporate Default Data: Summary Display

Summary reports progress of model as it evolves with an increasing number of


trees. Markers indicate models optimizing entropy, misclassification rate, ROC
Salford Systems Copyright 2005-2012

35

Corporate Default Model: TreeNet

1789 Trees in model with best test sample area under ROC curve
CXE

Class Error

ROC Area

Lift

-------------------------------------------------------------------Optimal N Trees:
2000
1930
1789
1911
Optimal Criterion
0.534914 0.3485293
0.7164921
1.9231729

TreeNet Cross-Validated ROC=0.71649 far better than single tree


Able to make use of more variables due to the many trees

Single CART tree uses 6 predictors, TreeNet uses more than 20

Some of these variables were missing for many accounts


Can extract graphical displays to describe model

Salford Systems Copyright 2005-2012

36

Predictive Variable Financial Ratio 1

Risk or Prob
of Default

Sales
/ Current
Ratio Assets
1

0.003

0.002

0.001
10

20

30

40

50

60

0.000

Ratio Value
-0.001

-0.002

-0.003

-0.004

TreeNet can trace out impact of predictor on PD

Salford Systems Copyright 2005-2012

37

Predictive Variable Financial Ratio 2

Risk or Prob
of Default

Return on Assets (ROA)

Ratio 2

Operating Profit / Total Asset

0.0006

0.0004

0.0002

Ratio Value

-0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

-0.0002

-0.0004

Salford Systems Copyright 2005-2012

38

Additive vs Interaction Model (2-node vs 6-node trees)

2 node trees model

CXE

Class Error

ROC Area

Lift

-------------------------------------------------------------------Optimal N Trees:
2000
1812
1978
1763
Optimal Criterion
0.54538
0.39331
0.70749
1.84881

6 node trees model


Optimal N Trees:
Optimal Criterion

2000
0.53491

1930
0.34852

1789
0.71649

1911
1.92317

Two-node trees do not permit interactions of any kind as each term in the
model involves a single predictor in isolation. Two-node tree models can be
highly nonlinear but strictly additive
Three- or more node trees allow progressively higher order interactions.
Running model with different tree sizes allows simple discovery of the precise
amount of interaction required for maximum performance models
In this example no strong evidence for low-order interactions

Salford Systems Copyright 2005-2012

39

Bank Ratings: Regression Example

Build a predictive model for average of major bank ratings

Scaled average of S&P. Moodys, Fitch Ratings

Challenges include

Small data base (66 banks)


Missing values prevalent (up to 35 missing in any predictor)
Impossible to build linear regression model because of missings
Expect relationships to be nonlinear

66 banks

25 potential predictors
Study run in 2003!

Salford Systems Copyright 2005-2012

40

1.4
1.3
1.2
1.1
1
0.9
0.8

945

886

827

768

709

650

591

532

473

414

355

296

237

178

0.6

119

0.7

60

C V M e a n A b s o l u te E r r o r

Cross-Validated Performance: Predicting Rating Score

Mode l Size

The optimal model achieved around 860 trees with cross-validated


mean absolute error 0.87, target variable ranges from 1 to 10

Salford Systems Copyright 2005-2012

41

Variable Importance Ranking

Ranks variables according to their contributions to the model


Influenced by the number of times a variables is used and is
strength as a splitter when it is used
Salford Systems Copyright 2005-2012

42

Country Contribution to Risk Score

AT

BE

CA

CH

DE

DK

ES

FR

GB

GR

IE

IT

JP

LU

NL

PT

SE

US

Distribution by Country

Swiss banks (CH) tend be rated low risk and Greek and Portuguese,
Italian, Irish and Japanese banks high risk
Salford Systems Copyright 2005-2012

43

Bank Specialization
Distribution

Specialised Governmental Credit Inst.

Savings Bank

Real Estate / Mortgage Bank

Non-banking Credit Institution

Investment Bank/Securities House

Cooperative Bank

Commercial Bank

10

20

30

40

Commercial banks and investment banks are rated higher risk

Salford Systems Copyright 2005-2012

44

50

Scale: Total Deposits

10

15

20

25

30

35

Histogram

0 e+00

1 e+08

2 e+08

3 e+08

4 e+08

5 e+08

6 e+08

A step function in size of bank


Salford Systems Copyright 2005-2012

45

ROAE

10

20

30

40

Histogram

-20

Salford Systems Copyright 2005-2012

-10

10

20

30

46

40

Cost to Income Ratio: Impact on Risk Score

10

15

20

Histogram

20

40

60

80

100

High cost to income increases risk score


Salford Systems Copyright 2005-2012

47

Equity to Total Assets

10

15

20

Histogram

10

15

Greater risk forecasted when equity is a large share of total assets

Salford Systems Copyright 2005-2012

48

References

Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140.


Freund, Y. & Schapire, R. E. (1996). Experiments with a new boosting
algorithm. In L. Saitta, ed., Machine Learning: Proceedings of the Thirteenth
National Conference, Morgan Kaufmann, pp. 148-156.
Friedman, J.H. (1999). Stochastic gradient boosting. Stanford: Statistics
Department, Stanford University.
Friedman, J.H. (1999). Greedy function approximation: a gradient boosting
machine. Stanford: Statistics Department, Stanford University.
Hastie, T., Tibshirani, R., and Friedman, J.H (2009). The Elements of
Statistical Learning. Springer. 2nd Edition

Salford Systems Copyright 2005-2012

49

Вам также может понравиться