Академический Документы
Профессиональный Документы
Культура Документы
Ensemble Learning
Business Analytics Practice
Winter Term 2015/16
Stefan Feuerriegel
Today’s Lecture
Objectives
Ensembles 2
Outline
1 Decision Trees
3 Random Forests
4 Boosting
5 AdaBoosting
Ensembles 3
Outline
1 Decision Trees
3 Random Forests
4 Boosting
5 AdaBoosting
Outlook
Sunny Rain
Overcast
Yes
Humidity Wind
No Yes No Yes
Amount>=3864
Duration< 19 | Amount< 3565
Duration>=11.5
Amount< 1561 Good
Amount>=786.5 Duration>=15.5
Good
Amount< 1102 Age< 35.5
Amount>=3016
Good Good
Good
Bad Good
Bad Good
Duration< 34.5
Duration>=25.5
Bad
Good
Bad Good
plot(as.party(dt))
1
Amount
2 ≥ 3864.5 < 3864.5 9
Duration Amount
< 19 ≥ 19 4 10 < 3565
≥ 3565
Duration Duration
5 < 34.5≥ 34.5 11 ≥ 11.5 < 11.5
Duration Amount
12 < 1561 ≥ 1561 17
Amount Duration
13 ≥ 786.5
< 786.5 18 ≥ 15.5< 15.5
≥ 25.5
< 25.5 Amount Age
19 < 35.5≥ 35.5
< 1102.5
≥ 1102.5 Amount
≥ 3015.5
< 3015.5
Node 3 (n Node
= 8) 6 (nNode
= 8) 7 (n Node
= 14) 8 (n Node
= 16) 14 (n
Node
= 7)15 (nNode
= 25)16 (n
Node
= 9) 20 (n
Node
= 9)21 (n
Node
= 19)22 (n
Node
= 17)23 (n
Node
= 22)24 (n
Node
= 34)25 (n = 12)
Good
Good
Good
Good
Good
Good
Good
Good
Good
Good
Good
Good
Good
Bad
Bad
Bad
Bad
Bad
Bad
Bad
Bad
Bad
Bad
Bad
Bad
Bad
0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6
0 0 0 0 0 0 0 0 0 0 0 0 0
##
## Classification tree:
## rpart(formula = Class ~ Duration + Amount + Age, data = GermanCredit[inTrain,
## ], method = "class")
##
## Variables actually used in tree construction:
## [1] Age Amount Duration
##
## Root node error: 58/200 = 0.29
##
## n= 200
##
## CP nsplit rel error xerror xstd
## 1 0.051724 0 1.00000 1.00000 0.11064
## 2 0.034483 2 0.89655 0.96552 0.10948
## 3 0.012931 4 0.82759 1.08621 0.11326
## 4 0.010000 12 0.72414 1.13793 0.11465
1
Amount
2
Duration
< 19 ≥ 19
Bad
Good
Good
0 0 0
I Output: predicted label in 1st row out of all possible labels (2nd row)
I Confusion matrix via
table(pred=pred_classes, true=true_classes)
# horizontal: true class; vertical: predicted class
table(pred=pred, true=GermanCredit[-inTrain,]$Class)
## true
## pred Bad Good
## Bad 20 28
## Good 280 671
1 Decision Trees
3 Random Forests
4 Boosting
5 AdaBoosting
Reasoning
I Classifiers C1 , . . . , CK which are independent, i. e. cor(Ci , Cj ) = 0
I Each has an error probability of Pi < 0.5 on the training data
I Then an ensemble of classifiers should have an error probability lower
than each individual Pi
Ensemble of 20 Classifiers
0.20
0.15
Probability
0.10
0.05
0.00
0 5 10 15 20
Number of Classifiers with Misclassification
Ensembles: Bagging 18
Bagging: Bootstrap Aggregation
Step 1:
D1 D2 ... DK 1 DK Create separate
datasets
Step 2:
C1 C2 ... CK 1 CK Train multiple
classifiers
Step 3:
C Combine classifiers
Ensembles: Bagging 19
Bagging: Bootstrap Aggregation
Algorithm
I Given training set D of size N
I Bagging generates new training sets Di of size M by sampling with
replacement from D
I Some observations may be repeated in each Di
I If M = N, then on average 63.2 % (Breiman, 1996) of the original
training dataset D is represented, the rest are duplicates
I Afterwards train classifier on each Ci separately
Ensembles: Bagging 20
Outline
1 Decision Trees
3 Random Forests
4 Boosting
5 AdaBoosting
rain no no
rain rain
no no rain no
rain rain
rain rain rain
Ensembles: Random Forests 22
Random Forest: Algorithm
x x x
...
+
y
Ensembles: Random Forests 23
Random Forest: Algorithm
K
1
ŷ = ∑ ti (x )
K i =1
Advantages Limitations
I Simple algorithm that I High memory consumption
learns non-linearity during tree construction
I Good performance in I Little performance gain
practice from large training data
I Fast training algorithm
I Resistant to overfitting
rf
0.6
Error
0.2
0 20 40 60 80 100
trees
(t ) (t )
(t ) ∑Ki=1 I (yi = ŷi ) ∑Ki=1 I (yi = ŷi learned from permuted x )
VI (x ) = −
K K
(t )
for tree t, with yi being the true class and ŷi the predicted class
I A frequent alternative is the Gini importance index
rf2
CheckingAccountStatus.none ●
CheckingAccountStatus.lt.0 ●
Duration ●
CreditHistory.Critical ●
Amount ●
5 6 7 8 9 10 11
MeanDecreaseAccuracy
1 Decision Trees
3 Random Forests
4 Boosting
5 AdaBoosting
Ensembles: Boosting 33
Boosting
I Combine multiple classifiers to improve classification accuracy
I Works together with many different types of classifiers
I None of the classifier needs extremely good, only better than chance
→ Extreme case: decision stumps
(
1, xi ≥ θ
y (x ) =
0, otherwise
Ensembles: Boosting 34
Boosting
High-level algorithm
1 Fit a simple model to a subsample of the data
2 Identify misclassified observations, i. e. that are hard to predict
3 Focus subsequent learners on these samples and get them right
4 Combine weak learners to form a complex predictor
Ensembles: Boosting 35
Boosting: Example
Given training data D from which we
sample without replacement
Ensembles: Boosting 36
Boosting: Example
N1
I Reasonable guess N1 = N2 = N3 ⇒ but problematic
3
I Simple problem: C1 explains most of the data and N2 and N3 are small
I Hard problem: C1 explains a small part and N2 is large
I Solution: run boosting procedure several times and adjust N1
Ensembles: Boosting 37
Boosting in R
I Load the required packages mboost
library(mboost)
I Fit a generalized linear model via glmboost(...)
m.boost <- glmboost(Class ~ Amount + Duration
+ Personal.Female.Single,
family=Binomial(), # needed for classification
data=GermanCredit)
coef(m.boost)
## (Intercept) Amount Duration
## 4.104949e-01 -1.144369e-05 -1.703911e-02
## attr(,"offset")
## [1] 0.4236489
Ensembles: Boosting 38
Boosting in R: Convergence Plot
I Partial effects show how estimated coefficients evolve across iterations
I Plot convergence of selected coefficients
plot(m.boost, ylim=range(coef(m.boost,
which=c("Amount", "Duration"))))
−0.015
Duration
0 20 40 60 80 100
Ensembles: Boosting 39
Boosting in R
I Main parameter for tuning is number of iterations mstop
I Use cross-validated estimates of empirical risk to find optimal number
→ Default is 25-fold bootstrapp cross-validation
cv.boost <- cvrisk(m.boost)
mstop(cv.boost) # optimal no. of iterations to prevent overfitting
## [1] 16
plot(cv.boost, main="Cross-validated estimates of empirical risk")
0.62
0.60
0.58
0.56
Ensembles: Boosting 1 7 14 22 30 38 46 54 62 70 78 86 94 40
Boosting in R
Alternative: fit generalized additive model via component-wise boosting
m.boost <- gamboost(Class ~ Amount + Duration,
family=Binomial(), # needed for classification
data=GermanCredit)
m.boost
##
## Model-based Boosting
##
## Call:
## gamboost(formula = Class ~ Amount + Duration, data = GermanCredit,
##
##
## Negative Binomial Likelihood
##
## Loss function: {
## f <- pmin(abs(f), 36) * sign(f)
## p <- exp(f)/(exp(f) + exp(-f))
## y <- (y + 1)/2
## -y * log(p) - (1 - y) * log(1 - p)
## }
##
##
## Number
Ensembles: Boostingof boosting iterations: mstop = 100 41
Outline
1 Decision Trees
3 Random Forests
4 Boosting
5 AdaBoosting
Ensembles: AdaBoosting 42
AdaBoosting
Instead of resampling, reweight misclassified training examples
Illustration
Ensembles: AdaBoosting 43
AdaBoost
Benefits Limitations
I Simple combination of I Sensitive to misclassified
multiple classifiers points in training data
I Easy implementation
I Different types of classifiers
can be used
I Commonly in used across
many domains
Ensembles: AdaBoosting 44
AdaBoost in R
I Load required package ada
library(ada)
I Fit AdaBoost model on training data with ada(..., iter) given a
fixed number iter of iterations
m.ada <- ada(Class ~ .,
data=GermanCredit[inTrain,],
iter=50)
I Evaluate on test data test.x with response test.y
m.ada.test <- addtest(m.ada,
test.x=GermanCredit[-inTrain,],
test.y=GermanCredit$Class[-inTrain])
Ensembles: AdaBoosting 45
AdaBoost in R
m.ada.test
## Call:
## ada(Class ~ ., data = GermanCredit[inTrain, ], iter = 50)
##
## Loss: exponential Method: discrete Iteration: 50
##
## Final Confusion Matrix for Data:
## Final Prediction
## True value Bad Good
## Bad 33 25
## Good 2 140
##
## Train Error: 0.135
##
## Out-Of-Bag Error: 0.18 iteration= 50
##
## Additional Estimates of number of iterations:
##
## train.err1 train.kap1 test.errs2 test.kaps2
## 36 36 43 37
Ensembles: AdaBoosting 46
AdaBoost in R
Plot error on training and testing data via plot(m, test=TRUE) for
model m
plot(m.ada.test, test=TRUE)
1 Train
2 Test1
2 2 2
0.25
Error
2 2
1 1 1
0.15
1 1
0 10 20 30 40 50
Iteration 1 to 50
Ensembles: AdaBoosting 47
AdaBoost in R
Similarly as with random forest, varplot(...) plots the importance for
the first variables
Amount ●
Duration ●
Age ●
OtherDebtorsGuarantors.Guarantor ●
NumberExistingCredits ●
Score
Ensembles: AdaBoosting 48
Summary
Decision Trees
I Highly visual tool for decision support, though the risk of overfitting
Ensembles: Wrap-Up 49