Вы находитесь на странице: 1из 45

Benchmarking analytical

techniques for churn modelling

in a B2B context
Word count: 10782

Jana Van Haver

Student number: 01200292

Supervisor: Prof. dr. Dirk van den Poel

Commissioner: Steven Hoornaert

Master’s Dissertation submitted to obtain the degree of:

Master of Science in Business Engineering

Academic year 2016-2017

Confidentiality Agreement

I declare that the content of this Master’s Dissertation may be consulted and/or reproduced,
provided that the source is referenced.

Jana Van Haver


This thesis is written as the final part of my Master in Commercial Engineering and concludes
a 5 year trajectory. I have always had an interest in data analytics. By exploring the subject
of churn prediction, I was able to gain more understanding of this study field. By means of
this foreword, I would like to take the opportunity to thank the people that contributed to
the realization of this dissertation.

First and foremost, I want to express my gratitude to my commissioner Steven Hoornaert.

To give me the opportunity to work on this topic, even though I had no affinity with the
subject matter beforehand. I want to thank him for the guidance he provided throughout
the whole process as well. His detailed and comprehensive suggestions and remarks helped
me tremendously.

Special thanks goes to my uncle for providing constructive feedback on my thesis. I would
like to end by thanking my parents for providing me with the opportunity to study and my
brother and sister for their continuous support and encouragement.

Jana Van Haver

Table of Contents

Confidentiality Agreement i

Foreword ii

List of Abbreviations iv

List of Tables v

List of Figures vi

1 Introduction 1

2 Literature review 3
Relationship marketing in B2B markets . . . . . . . . . . . . . . . . . . . . . . . . 3
Customer Relationship Marketing (CRM) & Data Mining . . . . . . . . . . . . . . 3
Data mining algorithms for churn prediction . . . . . . . . . . . . . . . . . . . . . . 4
Customer churn prediction in FMCG settings . . . . . . . . . . . . . . . . . . . . . 8
Churn modeling in B2B settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Churn variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Methodology 13
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Analytical techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Parameter selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Model evaluation criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Cross validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Statistical tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Results and Discussion 20

Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Variable importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5 Conclusion 26

6 Limitations and Future Research 27

References i

Attachment A Experimental results

Attachment B Results Wilcoxon signed-rank tests

List of Abbreviations

aCRM Analytical CRM

AUC Area Under the Curve

B2B Business-to-Business

B2C Business-to-Customer

CRM Customer Relationship Management

DT Decision Tree

FMCG Fast Moving Consumer Goods

FN False Negative

FP False Positive

FPR False Positive Rate

GLM General Linear Model

LR Logistic Regression

MARS Multivariate Adaptive Regression Splines

NB Naı̈ve Bayes

NBD Negative Binomial Distribution

NN Neural Network

PCC Percentage Correctly Classified

RF Random Forests

RFM Recency, Frequency and Monetary

RM Relationship Marketing

ROC Receiver Operating Characteristic

SVM Support Vector Machine

TN True Negative

TNR True Negative Rate

TP True Positive

TPR True Positive Rate

List of Tables

1 Overview data mining techniques . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Churn prediction applications in FMCG sector . . . . . . . . . . . . . . . . . . 8

3 B2B churn prediction applications . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Behavioural variables included in former research . . . . . . . . . . . . . . . . 11

5 Churn prediction variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

6 Parameter tuning values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

7 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

8 Experimental results (medians) . . . . . . . . . . . . . . . . . . . . . . . . . . 21

9 P-values Friedman test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

10 Computation time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

11 Variable importance LR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

12 Variable importance DT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

13 Variable importance NB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

14 Variable importance RF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

15 Variable importance NN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

16 Variable importance SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

17 Variable importance Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

18 Variable importance Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

19 Experimental results (averages and standard deviations) . . . . . . . . . . . . ix

20 P-values Wilcoxon test (accuracy) . . . . . . . . . . . . . . . . . . . . . . . . . x

21 P-values Wilcoxon test (AUC) . . . . . . . . . . . . . . . . . . . . . . . . . . . x

22 P-values Wilcoxon test (sensitivity) . . . . . . . . . . . . . . . . . . . . . . . . x

23 P-values Wilcoxon test (specificity) . . . . . . . . . . . . . . . . . . . . . . . . x

24 P-values Wilcoxon test (F-measure) . . . . . . . . . . . . . . . . . . . . . . . . xi

25 P-values Wilcoxon test (lift) . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

List of Figures

1 Comparison of ROC-curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Benchmarking analytical techniques
for churn modelling in a B2B context

Jana Van Haver

Tuesday 6th June, 2017

Abstract. Despite the proven importance of churn prediction for customer retention, research on the
performance of churn modelling techniques has been very limited in B2B contexts. This observation is
highly in contrast with the numerous applications that can be found in B2C settings. In order to address
this imbalance we perform a benchmarking exercise of commonly used analytical techniques: Logistic
Regression, Decision Trees, Naı̈ve Bayes, Random Forest, Neural Networks, Support Vector Machines,
Bagging and Boosting. The empirical data of a FMCG retailer is used to predict churn in a B2B
setting. The results show that Stochastic Gradient Boosting outperforms the other models in predictive
power. Logistic Regression can be recommended as well for B2B churn prediction due to its excellent
combination of a high predictive power and comprehensibility while keeping the computation time low.
When evaluating variable importance, recency variables are shown to have a very high predictive power.
Every prediction technique states recency as most important. Our findings indicate as well that the
importance of the other variable categories is dependent on the applied prediction technique.

Keywords: B2B, Churn Prediction, Data Mining, Non-contractual setting, FMCG

1 Introduction

The use of data mining techniques in Customer Relationship Management (CRM), in domains as
customer churn, customer acquisition or customer up- and cross-selling, has become common prac-
tice across various industries and applications. To date, most research on this topic is situated in
Business-to-Customer (B2C) settings, while the application of such techniques has been scarce in
Business-to-Business (B2B) settings. This, unfortunately, is due to differences between industrial
and consumer markets in decision-making processes, relationships, type of buyers, nature of de-
mand, communication mix and other factors [83, 124]. Moreover, the lack of data availability and
domain-relevant knowledge by researchers active in the B2B field is seen as a considerable challenge
as well [71]. However, these methods can be of great potential in a B2B context since industrial
companies are typically faced by a small number of customers who generate a large percentage of
revenue [51]. According to the Pareto or 80/20 rule, 20% of customers may even generate 80% of
the total revenue [121]. Since B2B companies are typically characterized by a smaller customer
base but a much higher transaction volume [103], losing a customer will have a more significant
direct effect on the company’s revenues. We, therefore, argue that data mining techniques can have
positive implications for customer retention in B2B.

In this study, we specifically focus on customer retention, more popularly known by its adversary,
customer churn. Customer churn is defined as the number or percentage of regular customers who
abandon a relationship with a service provider [59]. A distinction can be drawn between partial
and complete churn. Partial churn can be defined as the switch of some of the customer’s purchases
to another company. We speak of complete churn when the switch considers all purchases [14].
To manage customer churn, companies can opt for retention campaigns tailored to a small set of
customers or identical retention campaigns targeted at all customers (‘one-size-fits-all’ marketing
actions) [68]. Given a company’s limited resources, targeted retention campaigns are much more
efficient. Machine learning techniques can help with the identification of future churners, which
will enable the company to concentrate its retention efforts on these specific customers that have
the highest probabilities to churn. In conclusion, effective and accurate churn prediction models
are needed to reliably identify a customer’s probability to churn.

For many years, customer defection has been predicted using data mining techniques such as
Logistic Regression [52, 54], Decision Trees [4, 126], Neural Networks [13, 49] and Support Vector
Machines [3, 58]. The main power of these methods lies in their prediction potential. Logistic
Regression and Decision Trees are the most widely used methods because they offer a good trade-
off between performance and interpretability. The performance of data mining techniques for churn
prediction has been mostly evaluated based on datasets of B2C companies due to the difficulties
and complications encountered in B2B research aforementioned. Extensive comparisons of churn
prediction techniques in B2C have already been presented in studies like [110], [112] and [113].

In light of a small customer base with a high relative contribution to the bottom line, there is
an even higher cost of wrong prediction in the B2B domain. For this purpose, churn prediction
models with a low risk of wrong prediction are needed. However, in the B2B area we do not
find the same intensity of research as in B2C [71, 117], which makes it challenging for companies
to select an appropriate churn prediction model. We observe a visible lack of churn prediction
implementations in the B2B context. Moreover, the results of B2C research are not unambiguous
and are difficult to compare. Each study recommends a different churn prediction technique and
these recommendations are mostly based on limited benchmarking analyses. Therefore, no general
consensus can be reached [113]. This study fills the need for a broad benchmarking study in support
of the B2B decision making process.

In order to address this gap in research, this paper focuses on customer churn prediction in a B2B
context. We will first show the gap by analysing past research concerning the use of machine learning
techniques for churn prediction. Next, we present an empirical analysis of the most commonly used
techniques on a B2B data set of a Fast Moving Consumer Goods (FMCG) company. Our goal is
to analyze to what extent techniques used in B2C settings are applicable as well in B2B settings.
Eight algorithms for churn prediction are benchmarked: Logistic Regression, Decision Trees, Naı̈ve
Bayes, Random Forest, Neural Networks, Support Vector Machines, Bagging and Boosting. The
performance of the techniques will be assessed based on accuracy, sensitivity, specificity, F-measure,
area under the ROC-curve (AUC) and top decile lift. Predictive power of each model will be
discussed while taking into account interpretability and computation time. Furthermore, we will
analyse the importance of the churn prediction variables for each technique.

The remainder of the paper is organized as follows. A literature review of churn prediction models
used in both B2B and B2C in Section II. In Section III, the methodology consisting of a short
description of the analytical techniques, evaluation metrics and general approach is listed in greater
detail. The results will be discussed in Section IV, followed by the conclusion in Section V. Lastly,
Section VI discusses the limitations and indications for future research.

2 Literature review

Relationship marketing in B2B markets

Since the early 1990s, the focus in businesses has shifted from transactional marketing to Relation-
ship Marketing (RM) [56]. RM states that it is much more effective to build long-term relationships
with customers instead of going after potentially unrelated exchanges [128]. The sale between buyer
and seller becomes the starting point of the buyer-seller interaction, whereas it is the endpoint in
the transactional approach. It has been shown that RM is more beneficial since retained customers
increase earnings [14]. Moreover they tend to spread positive word-of-mouth [94] and will buy more
[75]. Their price-sensitivity tends to decrease [94] and customers will be less sensitive to actions of
competitors as well [8]. These advantages stimulate companies in adopting a relationship approach
and establishing long-term relationships with their clients.

The transactional approach would be, for example, a company offering the same standardized
products or services to every customer. By contrast, committing to relationship marketing will pave
the way for customization, alignment of manufacturing strategies or even take it as far as designing
the product together with the client. Companies designing products or services especially to meet
a particular customer’s needs will incite customers to enter into close, long-term relationships and
therefore benefit from the advantages these relationships bring with them.

B2B companies are typically characterized by buyer-seller interdependence and relationships that
are close and long-term oriented [27]. Such relationships are advantageous because of possible cost
reductions or increased revenues [27]. The customization that can be offered in this way, makes it
attractive for business customers as well. Moreover, the motivation to develop close relationships
comes from the complex nature of B2B offerings which lies in their technicality, complexity and
the long, formal group buying processes [104]. In view of those considerations, B2B markets have
always had a tendency towards relationship marketing.

Customer Relationship Marketing (CRM) & Data Mining

CRM enables the implementation of relationship marketing within a company [98]. CRM is de-
scribed as the efforts made by combining business processes and technologies that are customer-
oriented to manage the interaction between businesses and customers [56]. According to [86], CRM
consists of customer identification, customer attraction, customer development and customer re-
tention. In this paper, we focus on customer retention as a domain. This domain holds a lot of
potential given that acquiring new customers costs a lot more than retaining existing ones [8, 94]. In


addition, a small improvement in retention rate can lead to a significant increase in profit [94, 111].
Therefore, companies shifted their focus from customer acquisition to customer retention [91].

Technology and CRM have changed the way marketing is implemented in the last few years.
Particularly Analytical CRM (aCRM) has been omnipresent. As one of the 4 categories of CRM
suggested by [17], aCRM aims to analyse data that has been stored by a company using analytical
tools. One of these analytical tools that can be used is data mining. Data mining can be described as
the combination of statistical, mathematical, artificial intelligence and machine-learning techniques
used to acquire information and insights from databases [109]. Data mining can be used to support
the decision making process and has been frequently used in CRM e.g., [31, 32, 40]. We refer to [86]
for an extensive overview of data mining techniques applied to CRM. These applications mostly
concern B2C settings, but [99] stated that CRM can be even more important for business customers.
Given the potential the domain of customer retention holds, the application of analytical techniques
in this domain is highly relevant and has had a huge impact on CRM in the past.

Data mining algorithms for churn prediction

The impact of data mining techniques has become apparent in the area of churn prediction as well.
Many techniques have been successfully used in the past to predict a customer’s probability to
churn e.g., [49, 81, 85]. In Table 1 we list the predominant data mining techniques that have been
formerly used to estimate a customer’s probability to churn.

In Table 1 we clearly see that Logistic Regression (LR) and Decision Tree (DT) models are the
most common algorithms in academic research to predict customer churn. Even though they have
no great performance to capture complex and non-linear relationships, their popularity stems from
their ease of interpretability and low computation time. In the past, empirical analyses have led
to contradictory results regarding their performance relative to one another. In studies like [4],
[38] and [18] The DT model did a better job at predicting churn than a Neural Network (NN) and
LR. The latter study actually showed that the DT also outperformed a more sophisticated model,
namely a Support Vector Machine (SVM). Then again other studies, like [6, 24, 47, 87] showed
that LR achieved better results than the DT. In [47] LR is the best alternative compared to a DT
and a NN as well. We conclude that in spite of their simplicity, LR and DTs show a competitive
performance compared to more complex models. In some cases they even outperformed them.
Consequently, LR and DTs are very suitable to act as benchmarking techniques.

Bagging and boosting are ensemble methods that were constructed to reach a higher predictive
performance than single classifiers. Bagging takes a bit more computation time compared to LR
and DT but [69] and [95] showed that the ensemble technique performs better than DT. Bagging
reduces the variance of prediction and is simple and easy to put into practice. In [6] is shown that
bagging in combination with classification trees outperforms LR which in its turn outperformed the
classification tree without bagging. So we can conclude from this study that bagging improves the
predictive performance of classification trees. Boosting showed in [110] a substantial improvement
in classification performance when it was combined with NNs, DTs and SVMs. The authors of [69]
could not draw a conclusion whether bagging or boosting is better, since it depends on the dataset
to which the methods are applied.


Methods B2C B2B
Regression Algorithms
[1] [4] [6] [13] [14] [15] [16] [19] [22] [24] [25] [29]
[34] [36] [38] [42] [44] [46] [47] [50] [52] [54] [55]
Logistic Regression (LR) [57] [63] [64] [66] [67] [70] [74] [76] [78] [77] [79] [18] [37] [105]
[81] [82] [84] [87] [88] [89] [93] [95] [96] [100]
[102] [108] [110] [112] [113] [114] [118] [126]
Linear Regression [89] [39]
Probit Regression [39]
Multivariate Adaptive
Regression Splines (MARS)
Perceptron-based techniques
[2] [3] [4] [5] [13] [14] [19] [26] [36] [38] [41] [43]
Multilayer Perceptrons:
[42] [45] [46] [47] [49] [50] [53] [54] [57] [58] [63]
Artificial Neural Networks [18] [37]
[67] [66] [81] [82] [92] [97] [101] [102] [106] [107]
[110] [112] [118] [120] [123] [126] [127]
Single layered Perceptrons:
[88] [112]
Voted Perceptron
Bayesian algorithms
[13] [41] [42] [60] [66] [80] [85] [110] [112] [114]
Naı̈ve Bayes (NB)
[118] [127]
Bayesian Network (BN) [60] [61] [112] [114]
Ensemble classifiers
[14] [15] [16] [19] [23] [24] [28] [29] [48] [63] [64]
Random Forests (RF)
[77] [112] [119] [120]
Boosting [16] [19] [36] [41] [69] [74] [82] [110] [122] [112] [88]
Bagging [6] [29] [28] [57] [69] [95] [112]
GAMens [23] [29]
Logistic Model Tree (LMT) [112]
Dynamic ensemble methods [119]
Random Subspace Method [29]
Rotation forest [28]
RotBoost [28]
Other static classifiers
[119] [122]
ensemble methods
Rule-based methods
PART [44] [112] [114]
RIPPER [112] [113]
OneR [44]
AntMiner+ [113]
Continued on next page


Methods B2C B2B
Active Learning Based
Approach (ALBA)
Instance-based algorithms
K-Nearest Neighbour
[26] [44] [48] [53] [112] [114]
Classifier (kNN)
Decision Trees (DT)
[2] [4] [5] [6] [9] [13] [18] [19] [23] [26] [28] [34]
[36] [38] [41] [44] [42] [43] [45] [47] [50] [53] [54]
[60] [63] [67] [66] [69] [70] [79] [80] [82] [87] [89] [105]
[93] [95] [97] [101] [102] [106] [108] [110] [112]
[113] [114] [115] [116] [118] [120] [123] [126] [127]
Support Vector Machines (SVM)
[2] [3] [19] [24] [35] [44] [46] [43] [42] [50] [53]
[58] [63] [70] [97] [101] [80] [110] [112] [113] [114] [18][37]
[118] [120] [123] [127]
Hybrid Models
[21] [35] [44] [41] [45] [53] [67] [72] [88] [90] [92]
[101] [107]
Other Algorithms
Evolutionary Algorithms
[2] [5] [42]
Generalized Additive
[22] [23] [29]
Models (GAM)
Sequential Pattern Mining [20] [80] [97]
Survival Models [19] [73] [111]
Discriminant Analysis [13] [89] [97]
Pareto/Negative Binomial
Distribution Model (NBD)
Partial Least Squares [66]
K* [114]
Markov Chains [15]
Z-score model [92]
Decision Table [114]

Table 1: Overview data mining techniques

They did however, show that boosting had a better predictive performance than DT. We can
conclude that bagging and boosting have a higher predictive performance than LR and DT as single
classifiers in most cases. Boosting was not able to outperform other techniques when benchmarked
against more complex models like Random Forests in [16] and SVMs in [19].


To deal with the disadvantages of Decision Trees, a lack of robustness and vulnerability to noise in
the data, Random Forests (RF) have been proposed. Random Forests seems to be a more popular
technique for partial churn prediction compared to its application to complete churn. The technique
has been applied in 5 of the 9 studies that treat partial churn in Table 1. The algorithm generally
has a high predictive performance. Random Forests surpass LR and SVMs based on [24]. The
technique similarly performs better than DTs, NNs and SVMs in [120]. However in [14] RF did not
reach a significantly higher performance than LR and a NN. In [77] LR even outperformed RF. A
disadvantage of RF is that the technique is considered a black box.

Similar to RF, NNs and SVMs are black box models as well. NNs are among the most popular
method, as can be seen in Table 1. The technique is generally considered to have a higher predictive
performance than less complex models like LR and DTs. [82] and [102] show that NNs outperform
DTs and LRs. Also [5] shows the superiority of NNs compared to DTs. In [53] we see that NNs
not only have a better predictive performance than KNN and DTs, but outperforms SVMs as
well. But when consulting the results of [106], we see that the authors concluded that the NN was
outperformed by a DT.

SVMs are seen as a more sophisticated model that are computationally more intensive [70]. There
are multiple studies that show the excellent predictive performance of SVMs compared to other
churn prediction techniques [46, 70, 101, 110, 127]. But [24] actually showed that SVMs only
surpass LR when the right parameter-selection technique is used. A SVM was even outperformed
by a DT and a NN in [110].

Naı̈ve Bayes (NB) is a prediction technique that has been frequently applied in the past for churn
prediction as well. Although it is a simple classifier, NB has been able to report high predictive
accuracies in the past [41]. In [110], NB was shown to not be that effective compared to NNs,
SVMs and DTs. In [118], NB similarly was outperfomed by NNs and SVMs, but it did attain a
better performance than DTs. However, [112] does recommend Naı̈ve Bayes as churn prediction
model due to its comprehensibility, operation efficiency and sufficient predictive power.

A lot of hybrid methods are being proposed as well e.g. [35, 41, 72, 88]. A hybrid method is a
combination of 2 or more data mining techniques. This combination of techniques aims to increase
the predictive power of standard classification techniques. There are hybrid models that combine
two classification techniques, but the combination of a clustering and classification technique to
form a hybrid model exists as well. Different hybrid methods have been proposed in literature. For
example, a hybrid model is made combining SVM with Naı̈ve Bayes Tree in [35]. [72] introduces
a hybrid model based on Rough Set Theory and a Flow Network Graph. A combination of a
classification technique (DT) is combined with a clustering technique (Growing Hierarchical Self-
Organizing Maps) in [21]. Benchmarked to other classification techniques these hybrid models
always appear to be the most effective and performant. However, their predictive power is hardly
ever tested in other situations.


FMCG Partial Full Data mining techniques
Company churn churn used for churn prediction

- Neural networks*
- Logistic Regression
- Linear/ Quadratic discriminant
Buckinx et al. [13] x analysis
- Decision Tree
- Naı̈ve Bayes
- K-Nearest Neighbours

- Logistic Regression
Buckinx and Van den Poel [14] Grocery retailer x - Artificial Neural Network
- Random Forests

- SVMauc: SVM based on AUC

parameter selection*
Italian, on-line - SVMacc: SVM based on
Gordini and Veglio [37] x
company accuracy parameter selection
- Neural Network
- Logistic Regression

- Logistic Regression conducted

with Stepwise Feature Selection*
European food-
Miguéis et al. [76] x - MARS
based retailer
- Logistic Regression without
variable selection procedure

- Boosting*
Australian - Logistic Regression
Tamaddoni Jahromi et al. [105] x
retailer - Simple Decision Tree
- Cost-sensitive Decision Tree
denotes the best performing technique in the study

Table 2: Churn prediction applications in FMCG sector

Customer churn prediction in FMCG settings

Customer churn prediction has been addressed in multiple sectors like publishing [6, 22, 24], fi-
nancial services [36, 63, 111], insurance [46, 80, 102], e-commerce [57, 123], banking [35, 72, 87],
telecommunications [43, 47, 112], online gambling [23], retailing [14, 39, 76], logistics [18] and cable
services [15]. The attention given to telecommunication industry has been excessive. 70 of the
117 papers listed in Table 1 are studies about telecommunication companies. That is because this
industry is characterized by strong competitiveness and increased liberalization which makes churn
prediction indispensable [58]. In comparison, most of the other sectors like logistics, e-commerce
and retail have been underrepresented.

In this paper we treat the data of a retailer, more specifically a FMCG company. Fast-moving
consumer goods are considered as relatively inexpensive and frequently purchased [65]. Transactions
characterized by high volume make customer retention and accordingly churn prediction more


prominent. However, the fact that FMCG companies often operate in a non-contractual setting
makes it more challenging as well. This is because customers are not obliged to let companies know
when they stopped using their services or buying their products. In this way, it is more difficult
to determine when exactly a customer has churned. Therefore, it has been suggested to focus
on partial churn instead of complete churn in retail settings because customers typically defect
progressively, rather than in an abrupt discontinuation [77]. According to [14], partial churn has
a strong possibility to turn into complete churn in the long run. Therefore successfully predicting
partial churn can prevent complete churn.

An overview of churn prediction applications in the FMCG sector, is displayed in Table 2. The
applications that provide the most relevant results to our empirical analysis are [105] and [37],
where FMCG datasets of B2B companies are used to predict churn as well. This implies that their
best performing techniques, which are boosted trees and a SVM, could lead to similar satisfactory
results when applied to our dataset. In [37], a SVM (89.98% PCC, 88.61% AUC) outperformed
Logistic Regression (88.13% PCC, 86.04% AUC) and a Neural Network (88.25% PCC, 87.15%
AUC). In [105] boosting (92% AUC) performed slightly better than Logistic Regression (91%
AUC), while simple and cost-sensitive decision trees (AUC of 85% and 83%, resp.) were significantly
outperformed by both techniques. These studies predicted complete churn, while the other FMCG
studies focused on partial churn only, as can be seen in Table 2 as well.

It was demonstrated in [14] that partial churn can be successfully predicted in a non-contractual
setting. No significant differences were found in this study between the analysed data mining
techniques: Logistic Regression, a Neural Network and Random Forests. However in [13], a Neu-
ral Network (76.23% PCC, 79.72% AUC) significantly outperformed Logistic Regression (75.57%
PCC, 79.02% AUC) and other well-known methods. In [76] Multivariate Adaptive Regression
Splines (MARS) was introduced to predict churn of the customers of a retailer and was bench-
marked against Logistic Regression. The study showed that MARS was able to detect more partial
churners (AUC 76.74%) than Logistic Regression (AUC 75.29%) except when Logistic Regression
was combined with Stepwise forward or Stepwise backward Feature Selection (AUC of 78.43% and
78.50%, resp.).

We can conclude that the different applications situated in the FMCG sector lead to inconclusive
results. Only a limited amount of data mining techniques are evaluated in these studies which does
not give a comprehensive review of the performance of different churn prediction algorithms in the
FMCG domain.

Churn modeling in B2B settings

A distinction is made between applications in B2B and B2C in Table 1. The vast majority of the
data mining techniques are used to predict churn of a B2C company. Regarding the B2B domain,
implementations have been limited [4], which can be affirmed given the column at the right side
of the Table. We clearly see that the applications of churn modelling techniques are numerous in
B2C, whilst only a limited amount have been situated in a B2B context.


Best performing Outperformed techniques
Paper Company
technique(s) (in order of performance)

- Artificial Neural Network

Taiwanese Decision
Chen et al. [18] & Support Vector Machines
logistic company Trees
- Logistic Regression

- SVMacc: SVM based on

Italian e-commerce SVMauc: SVM based on accuracy parameter selection
Gordini and Veglio [37]
FMCG company AUC parameter selection - Neural Network
- Logistic Regression

German retailer for - General Linear Negative Binomial

Hopmann and Thede [39] electronics and Regression (GLM) model Distribution (NBD)-
computer accessories - Probit regression based model

- Logistic Regression
Australian FMCG
Tamaddoni Jahromi et al. [105] Boosting - Simple Decision Tree
- Cost-sensitive Decision Tree

Table 3: B2B churn prediction applications

To the best of our knowledge, only 4 papers have contributed to research in the B2B churn prediction
domain: [18], [37], [39] and [105]. More details about their approaches can be found in Table 3.
The churn probability of customers of a Taiwanese logistic B2B company was predicted in [18].
The authors were interested in the effect of length of the relationship, recency, frequency, monetary
and profit (LRFMP) variables on the predictive power. A Decision Tree, Logistic Regression, an
Artificial Neural Network and a Support Vector Machine were put into practice. The results showed
that the Decision Tree model was the most optimal algorithm in terms of accuracy, recall and F-
measure. A Negative Binomial Distribution (NBD)-based model, Probit Regression and General
Linear Model (GLM) Regression were used to construct a churn prediction model in [39]. A German
retailer for electronics and computer accessories made its data available for analysis. It was found
that GLM and Probit outperformed the stochastic model. As discussed before [37] and [105] treat
data of FMCG companies. In [37] the predictive performance of SVM was found to be superior
to Logistic Regression and Neural Networks. A data mining approach to model non-contractual
churn in a B2B context was proposed in [105]. Boosting outperformed three modelling techniques
(cost-sensitive learning decision tree, simple decision tree and logistic regression).

In conclusion, each of the 4 papers proposes a different technique for B2B churn prediction and
each of these proposed techniques is benchmarked against a different set of techniques. In this
way, it is difficult to evaluate the performance of the techniques in the B2B domain. If we consider
for example Logistic Regression, no general consensus can be reached. Boosting only marginally
outperformed LR in [105] which indicates a sufficient performance of LR. However, when LR was
used in [18] as a benchmarking technique, the technique performed the worst. A similar conclusion
was found in [37] where Logistic Regression was outperformed as well by Neural Networks and
Support Vector Machines. It is equally difficult to judge the performance of Decision Trees, being
alternately the best and worst performing technique in [18] and [105].

There is an ambiguity in interpretation of method performance.

This makes it challenging for B2B companies to determine which

algorithm would be best suited for the implementation of a churn
prediction model. Additionally, the lack of research on B2B churn

prediction makes the challenge even bigger. To the best of our
knowledge, methods like Bagging, Random Forests and Naı̈ve
Timing of

Bayes have not been applied to B2B datasets yet. This results in

a significant gap in research. Other well-known data mining tech-

niques like Logistic Regression, Decision Trees, Neural Networks,
Brand purchase

Boosting and Support Vector Machines have been used in a B2B


context in the past, but only to a very limited extent. No study


to date has made a thorough comparison of several algorithms on

Table 4: Behavioural variables included in former research

the same B2B dataset, which makes it hard to assess the perfor-
mance of these algorithms in this setting. There is a need for more

research evaluating the different churn prediction techniques in a


B2B setting. In this study we will therefore provide a broader

benchmarking exercise than the ones already available.
product categories
Behaviour across

Churn variables

The performance of a model not only depends on the algorithm

used for its construction, the included churn prediction variables
can have an important influence on predictive performance as well.
Length of

Past research already indicated the importance of Recency, Fre-


quency and Monetary (RFM) variables as behavioural variables

for churn prediction. Recency refers to the time since the last

purchase; Frequency to the number of purchases a customer made


within a certain period; Monetary value can be described as the

cumulative amount of money spend by a customer in this period

[14, 78].

In Table 4 a summary is given of behavioural churn prediction

variables used in former FMCG studies. The importance of RFM-
time variables

variables to indicate future churners was demonstrated as well on

a FMCG dataset in [14]. The length of the relationship was also

shown to be an important signal for future churning behaviour.

The same goes for mode of payment, buying behaviour across cat-
egories, usage of promotions and brand purchase behaviour, which


are listed in descending order of importance. In [13] the applica-

tion of churn prediction models to the data of FMCG customers
showed that the length of relationship was an important indicator
in addition to frequency and inter-purchase time related

variables, mode of payment and promotional behaviour. In [37] we find a similar conclusion for
recency, frequency and length of relationship, monetary indicators appeared to be less important.
Furthermore, it was shown in this study that variables related to product categories and failure are
important predictors as well. Known that they treated the data of a FMCG B2B company, this
study is highly relevant. In [76] no evident link could be found between the predictors selected by
the different prediction techniques.

However, the study showed that brand related variables were not relevant in any technique, just
like the total amount spend during the analysed period which serves a monetary variable. In [105],
only RFM-variables were used. The study emphasizes recency and frequency as highly predictive
predictors. Monetary indicators contributed not that significantly to predicting churn.

To conclude, recency and frequency turn out to be highly important in all studies. Monetary
indicators do not seem to live up to their expected importance. The importance of length of
relationship was affirmed by all studies that incorporated this variable.


Considering the lack of academic research on this topic, a significant difference in field of application
and a variety of methods applied in different papers, a comparison of the different B2B churn
prediction methods is difficult to realize. Given this variation, the interpretation of the results
is quite challenging. Drawing conclusions in consideration to B2B based on results acquired in a
B2C context is a considerable challenge as well due to the differences of these markets. We can
however make some general predictions about the performances of the different churn modelling

Logistic Regression and Decision Trees are techniques that clearly dominate in Table 1. Their
popularity in the prediction of B2C customer churn leads us to believe that they will act as adequate
techniques for B2B churn prediction as well. Especially since their application is widely spread in
other domains than solely churn prediction. On the grounds that Neural Networks and Support
Vector Machines are considered generally as more advanced prediction models, they will most
probably outperform LR and DTs in predictive power. Especially since NNs and SVMs turned
out to be better performing than LR in both B2B studies [18, 37] where these techniques were
benchmarked amongst others. Bagging and boosting tend to perform better than single classifiers
and due to the superior predictive power of boosting on a B2B dataset in [105], we estimate them
to be adequate as well. RF have proven its adequacy as well by outperforming other techniques
frequently in past studies such as [14, 15, 24, 77]. This gives us reason to foresee a good individual
performance of these techniques. We expect NB to achieve similar performances to LR and DT,
since it was not able to outperform more complex techniques in past research.

However, it is difficult to state expectations about the performance of churn prediction techniques
relative to one another. When consulting literature, former studies show varying results concerning
the performance of the techniques. For every technique, there appears to exist research affirming
their superiority compared to others. Nevertheless, nearly always studies can be found that claim
the contrary.

3 Methodology


The dataset used to perform the computational experiments in this study is obtained from a B2B
company offering fast moving consumer goods. The dataset contains historical sales transactions
of 10 000 business customers. The data range is situated between 1/1/2011 and 13/6/2016. The
distribution of churners to non-churners is about 25%. Compared to other studies this is relatively
high, since churn rates in B2C generally lie within a 5%-15% range e.g. [6, 15, 24, 35, 44, 80, 113,
125]. However, higher churn rates can be found as well in literature e.g. [14, 23, 88]. Since [14]
treats the data of a FMCG retailer as well, although it concerns a B2C company, a churn rate of
25% in a non-contractual setting might not be exceptionally high.


The target variable of our predictive models is churn, represented as a binary variable. Each
customer in the dataset is either identified as a churner (value 1) or a non-churner (value 0). A
churner is seen as a customer who cut ties with the company, while a non-churner stays loyal to
the company.

To construct the models, we used a limited number of predictors that are highly predictive in order
to keep the models as simple and comprehensible as possible. In this way the generalisability of
our conclusions will be facilitated. Employing only a limited number of variables will keep the
computational time low as well. Given that we are treating the data of a company in a non-
contractual B2B setting, no demographic variables are available. The equivalent of these variables
in a B2B context would be, for example, the number of employees or the concerned industry. Since
this information is not available, only behavioural variables based on the transactional history of
customers are included in the models.

The original dataset consisted of 563 variables. We selected the relevant variables for our study
based on former literature written about the subject. The variables that remained are listed in
Table 5. The equipment variables in the table concern information about the installed equipment
at the customers necessary to preserve the company’s products.

Since interpurchase-time related variables were proven to be an important variable category in past
research, we will use recency variables as well to construct our prediction models. Time since last
invoice and time since last equipment installation date both refer to the recency of customers’ shop
incidences. We include several variables related to customer’s frequency of purchases: number of
products and equipment models sold to customer. Furthermore sales quantity for adjacent and
CONV192 products, total sales quantity and total sales quantity in promotion are classified as
frequency variables as well. The following monetary indicators are included as well: sales in dollars
(represented by 2 variables) and the cost of goods for all orders. The length of relationship is
incorporated as well. This variable category is operationalized by including the time since first
invoice and the time since first equipment installation date.

Dependent variable
Description Churners Non-churners Churn rate
Churn 2517 7483 25.17%

Independent variables
Summary statistics
Min. Median Max. Mean SD.

Sales.Inv Dt rec Time since last invoice 0.00 24.00 1254.00 143.59 268.00
Equipment.Install Date rec Time since last equipment installation date 0.00 212.50 11287.00 767.89 1238.45

Sales.salesTotal freq Number of products sold to customer 1.00 119.00 11645.00 278.79 507.22
Equipment.Models freq Number of equipment models at customer 0.00 1.00 14.00 1.07 1.26
Sales.STD ADJ FCT mean Sales quantity for adjacent products 0.50 2.22 5.00 2.18 1.09
Sales.PKG CONV192 mean Sales quantity for CONV192 products 0.00 2.50 64.00 2.58 1.26
Sales.Qty mean Total sales quantity -165.90 1.37 217.78 1.66 6.26
Sales.PROMO QTY mean Total sales quantity in promotion 0.00 0.00 18.80 0.08 0.42

Sales.Whlsl Price Xtnd mean Sales in dollars -975.00 71.72 2129.79 81.36 68.77
Sales.WHLSL UNIT PRICE mean Sales in dollars 0.00 48.41 140.00 49.56 22.91
Sales.COST OF GOODS mean Cost of goods for all orders 0.00 23.84 73.66 23.25 14.72

Sales.Inv Dt dura Time since first invoice 0.00 1169.00 1258.00 900.35 418.93
Equipment.Install Date dura Time since first equipment installation date 0.00 317.00 23173.00 1070.23 1634.84

Table 5: Churn prediction variables

- 14 -
Analytical techniques

Multiple churn prediction models were constructed in order to predict the churn probability of
B2B customers in our dataset. The data mining techniques used to create those models were
selected based on their popularity and good predictive performance in past studies. The following
classification techniques were included in our benchmarking study:

Logistic Regression (LR) LR enables to predict the probability of a binary dependent variable
outcome based on the values of a set of independent variables. LR is easy to use and provides quick
and robust results [14]. Moreover, LR has a good interpretability compared to other methods [16].
This makes it an excellent benchmarking technique for the more complex and sophisticated models
applied in this study.

Decision Tree (DT) DTs are models that create a tree-like structure where instances are classified
based on their feature values. In each internal node a test is performed on a feature value. A branche
represents the outcome of the test, which will eventually lead to a leaf node that stands for a class
label. In this way, decision rules for the classification of new instances are established. DTs are
widely used in many fields due to their ease of interpretability [116]. However, they are considered
as unstable classifiers that significantly change when small adjustments to the data are made [11].
The only parameter to tune is the complexity parameter.

Naı̈ve Bayes (NB) NB is a classification technique that is constructed based on Bayes’ theorem.
NB assumes independence among features, which is a serious limitation of the model. The technique
constructs an algorithm with a low variance, because it is quite insensitive to data fluctuations [62].
This, however, implies that the predictions will most likely be less accurate than high-variance
models. A kernel density estimate will be used as the density function to construct the Naı̈ve
Bayes model.

Neural Network (NN) Artificial Neural Networks mimic the structure and functions of a biolog-
ical neural network. NNs consist of multiple layers that are made up of neurons. The input layer
communicates with one or more hidden layers, which in turn links to the output layer. The con-
nections between each of the layers’ neurons are made through weighted links. A popular method
to assign those weights, is the Back Propagation Method. The self-learning ability of NNs makes
that the underlying logic is not clear. Consequently, NNs are models with poor interpretability
[116]. NNs have a higher computation time than LR or DTs as well. We will construct a standard
Neural Network model with 1 hidden layer. The parameters to adjust are the decay that is added
to the weights and the number of neurons in the hidden layer.

Support Vector Machine (SVM) SVMs represent instances by points in a high-dimensional

space. SVMs search for the best separating gap between the points of different classes. New
instances will be mapped in the same space and will be classified based on their location relative

to the separating gap. SVMs are characterized by a high predictive performance [70, 50]. Only
two parameters have to be specified, the upper bound and the kernel parameter. On the downside,
SVMs are black box models and computationally more intensive [70]. In order to catch any non-
linear relationships we will make use of a Gaussian Radial Basis kernel function (SMV-RBF). The
parameters to optimize are the misclassification cost and a sigma (σ) value that is specific to the
Radial kernel.

Bagging Bagging stands for bootstrap aggregating. It improves prediction accuracy by applying a
base classifier on different bootstrap samples. These samples are randomly drawn out of the training
sample with replacement. The results are combined using majority voting. Bagging requires no
extra information, is easy to implement and reduces a classifier’s variance [69]. Bagging performs
generally better than the base classifier when the latter is unstable, but will not be able to increase
the performance when it is not [11].

Random Forests (RF) RF are ensemble classifiers that grow multiple classification trees. Each
tree is grown on a bootstrap sample of the training set by using random feature selection at each
node. RF classify an instance based on the classifications of the individual trees. The class that
receives the most votes is attributed to that instance. RF protect against overfitting, which can
sometimes happen with DT [12]. The technique is able to deliver a consistent high performance, is
very robust and has a reasonable computing time [14]. The only parameter to adjust is the number
of variables that are available for splitting at every node.

Boosting Boosting is seen as a more sophisticated version of bagging. First, the base classifier
is applied to the training sample, where each instance has an equal weight. Next, the weights are
adjusted, more importance is attributed to misclassified instances. A new classifier is constructed
based on the new weights. This process can be repeated multiple times. Boosting reduces vari-
ance as well as bias [69]. It is considered a robust technique [16]. Two commonly used boosting
algorithms are AdaBoost and Stochastic Gradient Boosting. Since Stochastic Gradient Boosting
achieved the best performance in [69], we will apply this algorithm to our B2B dataset. Stochastic
Gradient Boosting requires 4 parameters: the number of boosting iterations, the number of splits
performed on a tree, the learning rate and the minimal terminal node size.

Parameter selection

The parameters of the different analytical techniques will be optimized by making use of grid
search. In Table 6 the ranges of values that are used for tuning the parameters can be found.
For every analytical technique, different models will be constructed for all possible combinations
of parameters. The optimal combination of parameters is defined based on a cross validated AUC

Technique Parameter Tuning values
DT complexity parameter [0.0025, 0.0030, 0.0035, 0.0040, 0.0045]
decay [0.0001, 0.001, 0.01, 0.1]
# hidden neurons [1, 3, 5, 7]
σ [10−4 , 10−3 , 10−2 , 10−1 ]
cost [10−3 , 10−2 , 10−1 , 100 , 101 , 102 ]
RF # variables per split [3, 4, 5, 6, 7]
# iterations [500, 1000]
# splits [2, 3]
learning rate [0.1]
min. terminal node size [10, 25, 50]

Table 6: Parameter tuning values

Model evaluation criteria

The analytical techniques described above will be assessed based on their ability to identify true
churners. Table 7 shows the confusion matrix. In the matrix, a True Positive (TP) refers to
correctly classifying an actual churner. A True Negative (TN) is a correct classification of an
actual non-churner as non-churner. A misclassified churner or non-churner is defined by a False
Negative (FN) or False Positive (FP) respectively.

Churners Non-churners
Churners TP FN
Non-churners FP TN

Table 7: Confusion matrix

Accuracy Accuracy, also known as Percentage Correctly Classified (PCC), is the number of cor-
rectly classified instances divided by the total amount of classified instances. It is the most com-
monly used evaluation metric for classifiers. A downside of the evaluation metric is that it assumes
equal misclassification costs for FP and FN. In the context of churn prediction this is not appropri-
ate since misclassifying a churner implies a higher cost than classifying a non-churner as a churner.
Addressing a retention campaign to non-churners implicates a waste of useful resources. However,
the cost of losing a customer by incorrectly classifying him as non-churner is much higher. Fur-
thermore, PCC depends heavily on the cut-off value that determines whether an instance will be
classified as a churner or non-churner based on its predicted probability.

Accuracy =
TP + FP + TN + FN

In order to compare the accuracy for different data mining techniques more adequately, we will
report the top 10% accuracy. This can be interpreted as the accuracy based on a cutoff value

that is equal to the 90th percentile of the predicted probabilities. Since the distribution of the
probabilities varies across algorithms, choosing a relative cutoff value is more appropriate than an
absolute one.

Precision & recall Precision and recall measures can give a better insight in the performance of
classification models since these measures do not assume equal misclassification costs.

P recision =
Recall =

F-measure The F-measure (F1 ) combines both precision and recall into a single value, which is
more appropriate for the evaluation of predictive performance. Both evaluation metrics are required
to adequately assess the performance of a prediction technique.

2 × P recision × Recall
F1 =
P recision + Recall

Sensitivity & specificity These measures are an alternative for accuracy as well, since they do
not assume equal misclassification costs either. Sensitivity is the True Positive Rate (TPR), the
percentage of churners correctly classified. One can remark, sensitivity is equal to recall. Specificity
or True Negative Rate (TNR) is the percentage of non-churners correctly classified.

Sensitivity =
Specif icity =

The sensitivity, specificity and F-measure are calculated based on the same cutoff value that is used
for the top 10% accuracy.

ROC-curve The Receiver Operating Characteristic (ROC)-curve is a frequently used and recom-
mended evaluation metric as well, since no precise specification of a cutoff value is needed. On the
vertical axis of this two-dimensional graph we find the TPR or sensitivity. On the horizontal axis,
the False Positive Rate (FPR) or 1-specificity is given, which is the percentage of non-churners that
was incorrectly classified as a churner.

The outcome of a predictive model is given in terms of probabilities that observations of the test
data are of class 0 (non-churner) or class 1 (churner). The definition of the probability that will serve
as a threshold to classify the observation as a future churner or not, will influence the technique’s
performance. For every threshold from 0 to 100% we are able to derive the TPR and the FPR
based on the confusion matrix. Consequently, every cut-off value will lead to one point on the
curve. The more the ROC-curve is situated in the top left corner the better, in this way it will
correspond to a TPR of 1 and a FPR of 0.

AUC Area Under the Curve (AUC) computes the area under the ROC-curve. This single value
can be used to evaluate the performance of a classifier. Since the cut-off level is disregarded, AUC
is a very suitable metric to compare the predictive performance of classifiers for churn prediction.
AUC gives the probability that a classifier will rank a randomly chosen churner higher than a
randomly chosen non-churner. A random classification model has an AUC of 0.5, which implies
that a good classifier would have an AUC that is considerably higher.

Top decile lift Another well-known evaluation metric is the top-percentile lift. A retention cam-
paign will generally focus only on a small percentage of customers with the highest probability
to churn, given that resources are scarce. Therefore, the performance of the model on the n-th
percentile of customers with the highest probability to churn is important. Ranking the customers
by predicted probability and dividing the proportion of actual churners in the top n-th percentile
by the proportion of churners in the total population gives us the n-th percentile lift. A lift measure
of 4, for example, means there are 4 times more churners situated in the top n-th percentile than
in the total population, which indicates the performance of a model compared to a random one.
In this paper we will consider the top decile lift.

The maximum achievable lift measure, following the approach of [10], is given by:

n if N × c̄ ≤ N
Max Lift =
1 N
if N × c̄ > n

with N the total number of observations, c̄ the average churn rate in the dataset and n being the
n-th percentile lift. Since N = 10000, n = 10 and we observe average churn rate (c̄) of 25.17%,
the top decile is not large enough to include all churners. The second case holds for the equation
1 1
above, which entails that the max lift is c̄ = 25.17% = 3.97. This should be taken into account
when evaluating the performance of the models based on the lift measure.

Cross validation

5x2-fold cross validation, as recommended by [33], has been regularly applied in academic research
to evaluate the performance of churn prediction techniques [16, 28, 29].

When applying k-fold cross validation, the dataset is ’folded’ k times, meaning that the data is
randomly distributed in k subsets. These k subsets will alternatively serve as a training and test
set. The training set is used to fit the model and the test set to evaluate the model’s performance.
If repeated cross validation is performed this complete process will be repeated a number of times.
Each of the times the observations in the dataset will be randomly distributed to k different subsets

Consequently, 5x 2-fold cross validation means that the dataset is split up in 2 folds, a total of
5 times. Although this is a good approach, it is not suitable to define the optimal parameters
of a model. Since we want to perform grid search in order to define those, we need an extra
fold: a validation set. Including a validation set in our approach will enable us to evaluate the
performance of a model for different combinations of parameters. A will be trained on the training

set and validated on the validation set to choose the optimal parameters. The final model should
always be tested on unseen data. Therefore, the model with optimal parameters is tested on the
test data.

In summary, we will implement 5x 3-fold cross validation in this study. The complete dataset is
split up into 3 folds by applying stratified random sampling in order to maintain the original class
distributions. Each of these subsets will alternatively serve as a training, validation and test set.
Since we repeat the cross validation 5 times and there are 6 different combinations of training,
validation and test sets, this leaves us with 30 resamples.

Evaluation measures will be calculated of each resample based on the performance on the test set.
The aggregated result over all resamples serves as a robust measure of model performance, because
it is less susceptible to the randomness of splitting the data.

Statistical tests

In order to statistically compare the performance of the algorithms over the resamples, we will make
use of two non-parametric tests, the Friedman and Wilcoxon signed-rank test. The Friedman test is
recommended for the comparison of multiple models by [30]. The null hypothesis states that there
is no difference in performance between models. If the test shows that the null hypothesis can be
rejected based on a specified significance level, this will imply that differences can be found. To see
where these differences lie exactly, a post-hoc analysis to perform pairwise comparisons is needed.
We will use the Wilcoxon signed-rank test for post-hoc testing, as recommended by [7].

4 Results and Discussion

The results of the cross validation are used to compare the predictive performance of the prediction
models based on the aforementioned model evaluation criteria. Furthermore, the importance of the
variables will be analysed for the different models.

Performance evaluation

Predictive power In Table 8, the performances in terms of accuracy, AUC, sensitivity, specificity,
F-measure and top decile lift of the different classification techniques can be found. These are the
median values over the different resamples. The average values and standard deviations can be
found in Table 19 in Attachment A.

First, a Friedman test is performed for each evaluation measure to check whether the medians
are equal for all models. The resulting p-values of the Friedman test are given in Table 9. If we
set the significance level at 5%, we can conclude that there are significant differences in model
performances between resamples for all evaluation measures. Next, the Wilcoxon signed-rank test
is used to indicate between which models the differences in performances lie. Since we have 8
models, 28 pairwise comparisons are needed for each evaluation measure. The resulting p-values
for all evaluation measures of the Wilcoxon signed-rank tests can be found in Table 20 - 25 in


ACC AUC Sens Spec F1 Lift
LR 0.8389 0.9331 0.3790 0.9936 0.5422 3.7830
DT 0.8916 0.8973 0.7253 0.9527 0.7581 3.3940
NB 0.8269 0.8973 0.3552 0.9856 0.5081 3.5550
RF 0.8386 0.9430 0.3808 0.9924 0.5429 3.7520
NN 0.8380 0.9401 0.3772 0.9930 0.5396 3.7645
SVM 0.8389 0.9362 0.3790 0.9936 0.5422 3.7820
Bagging 0.8497 0.9331 0.4410 0.9876 0.5957 3.7105
Boosting 0.8389 0.9448 0.3790 0.9936 0.5422 3.7830

Table 8: Experimental results (medians)

ACC AUC Sens Spec F1 Lift

<0.001 <0.001 <0.001 <0.001 <0.001 <0.001

Table 9: P-values Friedman test

Attachment B. The nullhypothesis of the Wilcoxon signed-rank test states that there is no difference
between the medians of the performances of both models. Since we want to test 28 hypotheses, the
significance level has to be adjusted. Applying the Bonferroni correction results in a significance
level of 28 = 0.18%. P-values that lie above this threshold and consequently imply that there is
no significant difference between 2 models, are listed in bold in the tables in Attachment B.

The best performance for each evaluation measure is underlined in Table 8. Values that do not
differ from the top performance at a 0.18% significance level, based on the Wilcoxon signed-rank
test, are listed in bold. In the table we see that no significant differences can be found in ACC
when the models are compared with the ACC measure of the DT. For all other measures, the top
performances differ significantly from the measures of the other models.

The DT model achieves the best performance in terms of accuracy, sensitivity and F-measure.
However, we should mention that this is due to non-continuous distribution of the probabilities.
This results in overoptimistic values for the top 10% accuracy. Since the cutoff for sensitivity,
specificity and F-measure is put equal to that of accuracy, this may have resulted in misleading
values for these measures as well. The accuracy of the DT does not differ significantly from the other
models, although the median is much higher. This is caused by the high variance in performance
of the DT as can be seen in Table 19. Bagging faces the same problem concerning cutoff values
and therefore, leads to deceiving values for accuracy, sensitivity, specificity and F-measure as well.
However, the best results in terms of specificity are not reported by the DT or Bagging. LR, the
SVM and Boosting significantly outperform the other techniques in specificity.

As discussed before, accuracy is the least suitable measure to assess the predictive power of a
classification model. Although sensitivity, specificity and F-measure are more adequate performance
measures, they still depend on a probability cutoff. Therefore, we will accord more importance to
AUC and top decile lift. When evaluating AUC, we observe that boosting significantly outperforms
all other models. RF and the NN show highly competitive values for AUC as well. The DT and
NB are the least performant in terms of AUC. The best results of top decile lift are reported by
Boosting and LR. Although the SVM has a lower median lift value, there can not be found a


significant difference with Boosting and LR at a 0.18% significance level. The DT and NB are
the least suitable models when considering AUC and lift. What is remarkable as well is that there
cannot be found any significant difference between RF and the NN for any of the evaluation metrics
except for AUC. We can conclude that these prediction techniques have quite similar performances.

Figure 1: Comparison of ROC-curves

In Figure 1, the aggregated ROC-curves over all resamples of the different classification techniques
are drawn. A random classifier would result in the gray line that passes through the origin. We
conclude that all classifiers perform better than a random classifier would, since they are situated
at the left of that line. No clear difference can be distinguished between most classifiers in the
graph. However, we perceive that the ROC-curves of both NB and the DT model are situated
further to the right from the others. This relates to the fact that both models are outperformed in
terms of AUC.

To summarize, Stochastic Gradient Boosting is the best performing method when taking into
account AUC and lift jointly. LR and SVMs are able to compete with the Boosting technique in
terms of lift measure. NNs and RF are worth mentioning as well since they both achieve highly
competitive performances in AUC and lift measure. We would not recommend NB and DTs for
B2B churn prediction since they are the least suitable methods when considering AUC and lift.
Furthermore, we conclude that Bagging is able to improve the performance of Decision Trees

Comprehensibility Whether the models will be actually accepted by the end-users will depend on
their intuitiveness. Models that are difficult to understand will be less likely to convince managers
to implement them. Therefore, assessing a model’s comprehensibility should not be neglected [113].
Some models may improve predictive accuracy at the expense of understandability. LR and DTs are


generally considered as comprehensible models. LR achieves a generally competitive performance in
our analysis. RF, NNs and SVMs are viewed as black-box models and are shown to only increase
the performance in terms of AUC when compared to LR. Ensemble methods, like Bagging and
Boosting, are difficult to understand as well. Even if both techniques use Decision Trees as base
classifiers, combining many trees into one model obstructs their interpretability. Bagging is clearly
outperformed by LR, Boosting reports a higher AUC measure than increases the performance of
LR in AUC. To summarize, a trade-off should be made between performance and understandability.
LR shows a good combination of both, but there’s still room for improvement in terms of AUC.

Computation time LR DT NB RF NN SVM Bagging Boosting

Absolute (hh:mm:ss) 00:00:23 00:00:29 00:09:07 00:11:25 00:07:30 01:01:59 00:07:34 00:15:15
Relative (to LR) 1 1.26 24.02 30.10 19.77 163.45 19.97 40.22

Table 10: Computation time

Computation time In Table 10 the computation times for all prediction models can be found.
Not only the absolute values, but relative computation times are listed as well for the ease of
comparison. The application of Logistic Regression and Decision Trees necessitates the least time.
When compared to LR, we see that Support Vector Machines require a computation time that is
163.45 times that of LR. Naı̈ve Bayes, Random Forests, Neural Networks, Bagging and Boosting
necessitate a more moderate computation time. Depending on the size of the user base of a
company, using computationally more expensive methods can be reasonable or not. Moreover, the
performance of the considered techniques needs to be taken into account when evaluating their
computation time. If significantly more computation time is needed for a prediction model that
does not pay off in terms of increased performance, there is no need to consider it as a possibility.
We definitely observe this in the case of SVMs, where a significantly higher computation time does
not lead to an increase in predictive power.

Variable importance

Apart from the algorithm, the predictors used to construct a churn prediction model influence the
quality of that model as well. The results in terms of variable importance for each of the prediction
models are listed in Tables 11 - 18. The values in these tables are the averages and the standard
deviations of the variable importances over the different resamples. The values are scaled from 0
to 100. In this way, we are able to conclude which variables are important from a general and
model-specific viewpoint.

As former B2B study [37] found as well, the recency category score the best among all chosen
predictors. The recency variable Sales.Inv Dt rec reaches an average variable importance of 100
and a standard deviation of 0 for all models. This is clearly the variable that contributes most
to the accurate prediction of future churners. When we consider the average importance of other
variables, we see that they are situated in a significantly lower range than the variable importance
of Sales.Inv Dt rec.


A second important observation is that other variables show extremely varying results in impor-
tance. For example, the importance of frequency variable Sales.salesTotal freq differs signifi-
cantly between models. This variable is the best representation of the frequency category and is
extremely important for DTs, RF, NNs, Bagging and Boosting. On the other hand, it has the worst
predictive ability in NB and SVMs. This is rather remarkable. A similar observation can be made
for length variable Sales.Inv Dt dura, which shows a high importance for LR, RF and Boosting.
However, the variable is only moderately important for the other models. Monetary variables are
classified higher in variable importance in NB and SVMs compared to the other models. DTs
almost do not accord any importance to monetary indicators at all.

However, Sales.PROMO QTY mean, which is classified as frequency variable, does display consistent
results in variable importance. It generally is accorded a vary low importance. When considering
the summary statistics in Table 5 this can be explained by its low variance.

We remark as well that for certain models only a few predictors are considered to contribute
significantly to the predictive performance of the model. When analysing the variable importance
of DTs and especially Boosting, we assume that only leaving out the least important variables would
not significantly influence their predictive performance since many values are negligible.

We conclude that the importance of the types of predictors differs considerably between prediction
techniques. Consequently, this may imply that the predictive power of a technique can be dependent
on the variables used to construct it.

Mean SD Mean SD
Sales.Inv Dt rec 100.00 0.00 Sales.Inv Dt rec 100.00 0.00
Sales.Inv Dt dura 18.60 1.79 Sales.salesTotal freq 32.21 2.01
Sales.STD ADJ FCT mean 9.42 2.73 Equipment.Install Date rec 13.74 1.77
Sales.salesTotal freq 7.38 1.64 Equipment.Install Date dura 13.60 1.57
Equipment.Models freq 5.41 2.05 Equipment.Models freq 12.43 1.14
Sales.COST OF GOODS mean 4.87 2.25 Sales.Inv Dt dura 3.29 1.38
Sales.PKG CONV192 mean 4.86 1.31 Sales.PKG CONV192 mean 1.69 1.19
Sales.PROMO QTY mean 3.95 1.37 Sales.COST OF GOODS mean 1.02 1.03
Equipment.Install Date rec 3.93 2.37 Sales.STD ADJ FCT mean 0.58 0.73
Sales.Whlsl Price Xtnd mean 2.44 1.76 Sales.WHLSL UNIT PRICE mean 0.53 0.53
Sales.Qty mean 1.81 1.23 Sales.Qty mean 0.34 0.32
Sales.WHLSL UNIT PRICE mean 1.72 1.49 Sales.PROMO QTY mean 0.20 0.33
Equipment.Install Date dura 1.02 1.03 Sales.Whlsl Price Xtnd mean 0.17 0.22

Table 11: Variable importance LR Table 12: Variable importance DT


Mean SD Mean SD
Sales.Inv Dt rec 100.00 0.00 Sales.Inv Dt rec 100.00 0.00
Sales.WHLSL UNIT PRICE mean 42.09 0.00 Sales.salesTotal freq 20.25 2.08
Sales.PKG CONV192 mean 41.79 0.00 Sales.Inv Dt dura 8.39 0.76
Sales.STD ADJ FCT mean 40.86 0.00 Sales.PKG CONV192 mean 6.41 0.83
Sales.COST OF GOODS mean 40.00 0.00 Sales.Whlsl Price Xtnd mean 5.90 0.66
Sales.Whlsl Price Xtnd mean 35.96 0.00 Sales.Qty mean 5.47 0.47
Sales.Qty mean 34.78 0.00 Sales.COST OF GOODS mean 5.32 0.80
Sales.PROMO QTY mean 33.73 0.00 Sales.WHLSL UNIT PRICE mean 5.27 0.83
Sales.Inv Dt dura 30.01 0.00 Equipment.Install Date rec 4.42 0.81
Equipment.Install Date rec 17.07 0.00 Equipment.Install Date dura 4.10 0.85
Equipment.Models freq 16.37 0.00 Sales.STD ADJ FCT mean 4.01 0.55
Equipment.Install Date dura 15.86 0.00 Equipment.Models freq 1.35 0.62
Sales.salesTotal freq 0.00 0.00 Sales.PROMO QTY mean 0.00 0.00

Table 13: Variable importance NB Table 14: Variable importance RF

Mean SD Mean SD
Sales.Inv Dt rec 100.00 0.00 Sales.Inv Dt rec 100.00 0
Sales.salesTotal freq 28.44 14.21 Sales.WHLSL UNIT PRICE mean 42.09 0
Sales.COST OF GOODS mean 18.78 12.32 Sales.PKG CONV192 mean 41.79 0
Sales.STD ADJ FCT mean 14.65 9.11 Sales.STD ADJ FCT mean 40.86 0
Sales.WHLSL UNIT PRICE mean 13.27 10.06 Sales.COST OF GOODS mean 40.00 0
Sales.PKG CONV192 mean 10.20 7.73 Sales.Whlsl Price Xtnd mean 35.96 0
Sales.Inv Dt dura 8.90 6.17 Sales.Qty mean 34.78 0
Equipment.Models freq 8.66 8.35 Sales.PROMO QTY mean 33.73 0
Sales.Qty mean 7.60 10.81 Sales.Inv Dt dura 30.01 0
Sales.Whlsl Price Xtnd mean 7.24 7.42 Equipment.Install Date rec 17.07 0
Equipment.Install Date dura 7.14 5.86 Equipment.Models freq 16.37 0
Equipment.Install Date rec 6.32 4.72 Equipment.Install Date dura 15.86 0
Sales.PROMO QTY mean 3.10 5.02 Sales.salesTotal freq 0.00 0

Table 15: Variable importance NN Table 16: Variable importance SVM

Mean SD Mean SD
Sales.Inv Dt rec 100.00 0.00 Sales.Inv Dt rec 100 0
Sales.salesTotal freq 45.30 1.18 Sales.salesTotal freq 4.83 0.56
Equipment.Install Date rec 19.13 0.89 Sales.Inv Dt dura 1.42 0.35
Sales.PKG CONV192 mean 18.35 0.60 Equipment.Install Date dura 1.34 0.24
Equipment.Install Date dura 18.15 0.91 Sales.PKG CONV192 mean 0.85 0.25
Sales.Inv Dt dura 16.05 0.90 Equipment.Install Date rec 0.69 0.29
Sales.Whlsl Price Xtnd mean 15.59 0.69 Sales.COST OF GOODS mean 0.48 0.23
Sales.Qty mean 15.16 0.73 Sales.Qty mean 0.28 0.18
Sales.WHLSL UNIT PRICE mean 14.61 0.81 Sales.Whlsl Price Xtnd mean 0.26 0.19
Sales.COST OF GOODS mean 14.31 0.94 Sales.WHLSL UNIT PRICE mean 0.25 0.17
Sales.STD ADJ FCT mean 13.63 0.76 Sales.STD ADJ FCT mean 0.17 0.14
Equipment.Models freq 12.05 0.78 Sales.PROMO QTY mean 0.05 0.06
Sales.PROMO QTY mean 0.00 0.00 Equipment.Models freq 0.02 0.05

Table 17: Variable importance Bagging Table 18: Variable importance Boosting



In past literature, LR and DTs have been considered to be excellent benchmarking techniques. In
some studies, LR was even able to outperform more complex techniques. Our results show as well
that LR was able to outperform RF and a NN in top decile lift. When evaluating AUC, more
complex techniques like RF, NN and SVM and Boosting outperform LR. However, the DT does
not show the same predictive performance as LR. The DT model finds itself amongst the worst
performing techniques in terms of AUC and lift. The popularity of DTs in former B2C studies,
made us expect the opposite.

Similarly, NB cannot compete with the performances of the other models either. Only the DT
reports lower measures for specificity and lift. Based on former research, we expected NB to
achieve a similar result to DTs and LR. This is certainly true for in the case of the DT model, but
LR outperforms NB significantly.

In former studies, Boosting was not able to outperform more complex method. Our findings sug-
gest the contrary. In this study, we see that Boosting is outperforming RF, a NN and a SVM in
AUC and lift. We do remark that our study shows very similar results to [105], were boosting
slightly outperforms LR and the DT model is the least performant. As suggested by former re-
search, Bagging is able to significantly increase the performance of DTs. Nevertheless, Bagging is
outperformed by other techniques.

When reviewing literature, RF generally reported a high predictive performance. Our findings
suggest a similar tendency. RF and the NN report performance measures that are not significantly
different from each other, RF only outperforms the NN when evaluating AUC. In [14] no significant
differences can be found between these models either.

When considering variable importance, the observed importance of recency and frequency in this
study corresponds with the results of other literature. We do, however, note that frequency is not
that important for all applied prediction models. In past literature on churn prediction in FMCG
industries, monetary indicators turned out to be relatively insignificant. We observe this as well in
our study. The length of relationship was an important predictor in all studies that incorporated
it. Our results suggest that the importance varies over prediction techniques.

5 Conclusion

In this paper, we focus on churn prediction modelling in a B2B sector. When considering B2B
churn prediction, the research that can be found is rather limited. This is shown in an extended
overview of churn prediction techniques applied in past literature. To address this gap in research,
we perform a benchmarking study of churn prediction techniques in a B2B context. The predic-
tive power of Logistic Regression, Decision Trees, Naı̈ve Bayes, Random Forest, Neural Networks,
Support Vector Machines, Bagging and Boosting is evaluated on a FMCG dataset. To evaluate the
performance of the techniques, accuracy, AUC, sensitivity, specificity, F-measure and top decile lift
are calculated.

5. CONCLUSION - 26 -
Based on our findings, we would recommend the use of Stochastic Gradient Boosting. This tech-
nique is able to give the best results in terms of top decile lift and AUC. However, if we take into
account computation time and comprehensibility as well, we want to draw attention to LR. The
power of LR lies in its combination of a high competitive performance and intuitiveness and low
computation time.

When considering variable importance, our analysis identifies recency variables as most important
for every prediction technique. Frequency variables are generally shown to be important as well,
but significantly less than recency. Futhermore, our findings suggest that the importance of certain
categories of variables may vary depending on the applied prediction technique. We observe this, for
example, for the monetary indicators of which the importance varies over the different models.

To summarize, the contribution of this study is twofold: (1) an analysis of classification techniques
that have been formerly used in B2B and B2C churn prediction is presented; (2) we evaluate the
performance of the most commonly used churn prediction techniques in a B2B setting.

6 Limitations and Future Research

Some limitations and opportunities for future research can be mentioned.

A first limitation is that the we only have a small number of predictors in our analysis. Since we
wanted our results to be generalisable to other B2B companies, the number of predictors was kept
to a minimum. An interesting finding of our study is that the importance of the different categories
of variables depends on the prediction technique used. Further studies may study the importance
of other variable categories for different prediction techniques.

Furthermore, we only included commonly used prediction techniques to set up our empirical anal-
ysis. A possibility for future research can be to explore the performance of less well-known tech-

Another limitation is that the results of our analysis may not be applicable for B2B companies
that are not situated in the FMCG industry. We do, however, assume a certain generalisability to
companies in a non-contractual environment. Future studies may improve the generalisability of
our conclusions by extending the analysis to other B2B industries.

Lastly, we should mention that the outcome of our research is only relevant for a company if it is
actually willing to undertake actions to prevent churn. No decrease in churn rate will be realised
by predicting future churners alone. B2B companies should offer incentives to those customers
with the highest probabilities to churn in order to dissuade them from doing so. This will lead
to reduced churn rates and increased profits. Which will, ultimately, show the real value of churn


References [9] Bin, L., Peiji, S., and Juan, L. (2007). Cus-
tomer churn prediction based on the decision
[1] Ahn, J.-H., Han, S.-P., and Lee, Y.-S. tree in personal handyphone system service.
(2006). Customer churn analysis: Churn de-
terminants and mediation effects of partial [10] Blattberg, R. C., Kim, B.-D., and Neslin,
defection in the Korean mobile telecommuni- S. A. (2010). Database Marketing: Analyzing
cations service industry. Telecommunications and Managing Customers. Springer Science
Policy, 30(10–11):552–568. & Business Media.

[11] Breiman, L. (1996). Bagging Predictors.

[2] Amin, A., Shehzad, S., Khan, C., Ali, I.,
Machine Learning, 24(2):123–140.
and Anwar, S. (2015). Churn prediction in
telecommunication industry using Rough Set [12] Breiman, L. (2001). Random Forests. Ma-
Approach. chine Learning, 45(1):5–32.

[3] Archaux, C., Martin, A., and Khenchaf, A. [13] Buckinx, W., Baesens, B., Van den Poel,
(2004). An SVM based churn detector in D., Van Kenhove, P., and Vanthienen, J.
prepaid mobile telephony. 2004 International (2010). Using machine learning techniques to
Conference on Information and Communica- predict defection of top clients.
tion Technologies: From Theory to Applica- [14] Buckinx, W. and Van den Poel, D. (2005).
tions, 2004. Proceedings, pages 459–460. Customer base analysis: Partial defection
[4] Au, T., Ma, G., and Li, S. (2003a). Apply- of behaviourally loyal clients in a non-
ing and Evaluating Models to Predict Cus- contractual FMCG retail setting. European
tomer Attrition Using Data Mining Tech- Journal of Operational Research, 164(1):252–
niques. Journal of Comparative International 268.
Management, 6(1). [15] Burez, J. and Van den Poel, D. (2007).
CRM at a pay-TV company: Using analyt-
[5] Au, W.-H., Chan, K. C., and Yao, X.
ical models to reduce customer attrition by
(2003b). A novel evolutionary data mining
targeted marketing for subscription services.
algorithm with applications to churn predic-
Expert Systems with Applications, 32(2):277–
[6] Ballings, M. and Van den Poel, D. (2012).
[16] Burez, J. and Van den Poel, D. (2009).
Customer event history for churn prediction:
Handling class imbalance in customer churn
How long is long enough? Expert Systems
with Applications, 39(18):13517–13522.
[17] Chaudhury, A. and Kuilboer, J.-P. (2001).
[7] Benavoli, A., Corani, G., and Mangili, F. E-Business and E-Commerce Infrastructure:
(2016). Should We Really Use Post-hoc Tests Technologies Supporting the E-Business Ini-
Based on Mean-ranks? J. Mach. Learn. Res., tiative.
[18] Chen, K., Hu, Y.-H., and Hsieh, Y.-C.
[8] Bhattacharya, C. B. (1998). When cus- (2014). Predicting customer churn from valu-
tomers are members: Customer retention in able B2B customers in the logistics industry:
paid membership contexts. Journal of the A case study. Information Systems and e-
Academy of Marketing Science, 26(1):31. Business Management, 13(3):475–494.

[19] Chen, Z.-Y., Fan, Z.-P., and Sun, M. [27] David Ford (1980). The Development
(2012). A hierarchical multiple kernel support of Buyer-Seller Relationships in Industrial
vector machine for customer churn prediction Markets. European Journal of Marketing,
using longitudinal behavioral data. European 14(5/6):339–353.
Journal of Operational Research, 223(2):461–
[28] De Bock, K. W. and den Poel, D. V.
(2011). An empirical evaluation of rotation-
[20] Chiang, D.-A., Wang, Y.-F., Lee, S.-L., and based ensemble classifiers for customer churn
Lin, C.-J. (2003). Goal-oriented sequential prediction. Expert Systems with Applications,
pattern for network banking churn analysis. 38(10):12293–12301.
Expert Systems with Applications, 25(3):293–
[29] De Bock, K. W. and Van den Poel, D.
(2012). Reconciling performance and inter-
[21] Chu, B.-H., Tsai, M.-S., and Ho, C.-S. pretability in customer churn prediction us-
(2007). Toward a hybrid data mining model ing ensemble learning based on generalized
for customer retention. Knowledge-Based additive models. Expert Systems with Appli-
Systems, 20(8):703–718. cations, 39(8):6816–6826.
[22] Coussement, K., Benoit, D. F., and Van [30] Demšar, J. (2006). Statistical Comparisons
den Poel, D. (2010). Improved marketing of Classifiers over Multiple Data Sets. J.
decision making in a customer churn pre- Mach. Learn. Res., 7:1–30.
diction context using generalized additive
models. Expert Systems with Applications, [31] D’Haen, J. and Van den Poel, D. (2013).
37(3):2132–2143. Model-supported business-to-business
prospect prediction based on an iterative
[23] Coussement, K. and De Bock, K. W. customer acquisition framework. Industrial
(2013). Customer churn prediction in the on- Marketing Management, 42(4):544–551.
line gambling industry: The beneficial effect
of ensemble learning. Journal of Business Re- [32] D’Haen, J., Van den Poel, D., and Thor-
search, 66(9):1629–1636. leuchter, D. (2013). Predicting customer prof-
itability during acquisition: Finding the op-
[24] Coussement, K. and Van den Poel, D.
timal combination of data source and data
(2008a). Churn prediction in subscription
mining technique. Expert Systems with Ap-
services: An application of support vector
plications, 40(6):2007–2012.
machines while comparing two parameter-
selection techniques. Expert Systems with Ap- [33] Dietterich, T. G. (1998). Approximate Sta-
plications, 34(1):313–327. tistical Tests for Comparing Supervised Clas-
sification Learning Algorithms. Neural Com-
[25] Coussement, K. and Van den Poel, D.
put., 10(7):1895–1923.
(2008b). Integrating the voice of customers
through call center emails into a decision sup- [34] Eiben, A., Koudijs, A., and Slisser, F.
port system for churn prediction. Information (2006). Genetic modeling of customer reten-
& Management, 45(3):164–174. tion.

[26] Datta, P., Masand, B., Mani, D., and Li, [35] Farquad, M. A. H., Ravi, V., and Raju,
B. (2001). Automated cellular modeling and S. B. (2014). Churn prediction using compre-
prediction on a large scale. hensible support vector machine: An analyt-

ical CRM application. Applied Soft Comput- munication churn prediction. Expert Systems
ing, 19:31–40. with Applications, 40(14):5635–5647.

[36] Glady, N., Baesens, B., and Croux, C. [45] Hung, S.-Y., Yen, D. C., and Wang, H.-
(2009). Modeling churn using customer life- Y. (2006). Applying data mining to telecom
time value. European Journal of Operational churn management. Expert Systems with Ap-
Research, 197(1):402–411. plications, 31(3):515–524.

[37] Gordini, N. and Veglio, V. (2016). Cus- [46] Hur, Y. and Lim, S. (2005). Customer
tomers churn prediction and marketing reten- churning prediction using Support Vector
tion strategies. An application of support vec- Machines in online auto insurance service.
tor machines based on the AUC parameter- [47] Hwang, H., Jung, T., and Suh, E. (2004).
selection technique in B2B e-commerce indus- An LTV model and customer segmentation
try. Industrial Marketing Management. based on customer value: A case study on the
wireless telecommunication industry. Expert
[38] Hadden, J., Tiwari, A., Roy, R., and Ruta,
Systems with Applications, 26(2):181–188.
D. (2006). Churn prediction using complaints
data. [48] Idris, A., Rizwan, M., and Khan, A. (2012).
Churn prediction in telecom using Random
[39] Hopmann, J. and Thede, A. (2003). Appli-
Forest and PSO based data balancing in com-
cability of customer churn forecasts in a non-
bination with various feature selection strate-
contractual setting.
gies. Computers & Electrical Engineering,
[40] Hosseini, S. M. S., Maleki, A., and Gho- 38(6):1808–1819.
lamian, M. R. (2010). Cluster analysis us-
[49] Jadhav, R. and Pawar, U. (2011). Churn
ing data mining approach to develop CRM
prediction in telecommunication using data
methodology to assess the customer loy-
mining technology.
alty. Expert Systems with Applications,
37(7):5259–5264. [50] Jing, Z. and Xing-hua, D. (2008). Bank
customer churn prediction based on support
[41] Hu, X. (2005). A data mining approach for
vector machine: Taking a commercial bank’s
retailing bank customer attrition analysis.
VIP customer churn as the example.
[42] Huang, B., Kechadi, M. T., and Buck- [51] Keller, K. and Webster, F. (2004). A
ley, B. (2012). Customer churn prediction in roadmap for branding in industrial markets.
telecommunications. Expert Systems with Ap-
[52] Keramati, A. and Ardabili, S. M. S. (2011).
plications, 39(1):1414–1425.
Churn analysis for an Iranian mobile opera-
[43] Huang, B. Q., Kechadi, T. M., Buckley, tor. Telecommunications Policy, 35(4):344–
B., Kiernan, G., Keogh, E., and Rashid, T. 356.
(2010). A new feature set with new win-
[53] Keramati, A., Jafari-Marandi, R., Alianne-
dow techniques for customer churn prediction
jadi, M., Ahmadian, I., Mozaffari, M., and
in land-line telecommunications. Expert Sys-
Abbasi, U. (2014). Improved churn pre-
tems with Applications, 37(5):3657–3665.
diction in telecommunication industry using
[44] Huang, Y. and Kechadi, T. (2013). An data mining techniques. Applied Soft Com-
effective hybrid learning system for telecom- puting, 24:994–1012.

[54] Khan, A. A., Jamwal, S., and Sepehri, Conference on Emerging Artificial Intelli-
M. (2010). Applying data mining to cus- gence Applications in Computer Engineering:
tomer churn prediction in an Internet Service Real Word AI Systems with Applications in
Provider. eHealth, HCI, Information Retrieval and Per-
vasive Technologies, pages 3–24.
[55] Kim, H.-S. and Yoon, C.-H. (2004). De-
terminants of subscriber churn and cus- [63] Kumar, D. A. and Ravi, V. (2008). Pre-
tomer loyalty in the Korean mobile tele- dicting credit card customer churn in banks
phony market. Telecommunications Policy, using data mining.
[64] Larivière, B. and Van den Poel, D. (2005).
[56] Kim, J., Suh, E., and Hwang, H. (2003). A
Predicting customer retention and profitabil-
model for evaluating the effectiveness of CRM
ity by using random forests and regression
using the balanced scorecard. Journal of In-
forests techniques. Expert Systems with Ap-
teractive Marketing, 17(2):5–19.
plications, 29(2):472–484.
[57] Kim, K. and Lee, J. (2012). Sequen-
[65] Leahy, R. (2011). Relationships in fast
tial manifold learning for efficient churn pre-
moving consumer goods markets: The con-
diction. Expert Systems with Applications,
sumers’ perspective. European Journal of
Marketing, 45(4):651–672.
[58] Kim, S., Shin, K.-s., and Park, K. (2005).
[66] Lee, H., Lee, Y., Cho, H., Im, K., and
An Application of Support Vector Machines
Kim, Y. S. (2011). Mining churning behaviors
for Customer Churn Analysis: Credit Card
and developing retention strategies based on
Case. Advances in Natural Computation,
a partial least squares (PLS) model. Decision
pages 636–647.
Support Systems, 52(1):207–216.
[59] Kim, S.-Y., Jung, T.-S., Suh, E.-H., and
Hwang, H.-S. (2006). Customer segmentation [67] Lee, J. S. and Lee, J. C. (2006). Customer
and strategy development based on customer churn prediction by hybrid model.
lifetime value: A case study. Expert Systems
[68] Lejeune, M. (2011). Measuring the impact
with Applications, 31(1):101–107.
of Data Mining on Churn Management.
[60] Kirui, C., Hong, L., Cheruiyot, W., and
[69] Lemmens, A. and Croux, C. (2006). Bag-
Kirui, H. (2013). Predicting customer churn
ging and boosting classification trees to pre-
in mobile telephony industry using probabilis-
dict churn.
tic classifiers in data mining.

[61] Kisioglu, P. and Topcu, Y. I. (2011). Ap- [70] Lessmann, S. and Voß, S. (2009). A ref-
plying Bayesian Belief Network approach to erence model for customer-centric data min-
customer churn analysis: A case study on the ing with support vector machines. European
telecom industry of Turkey. Expert Systems Journal of Operational Research, 199(2):520–
with Applications, 38(6):7151–7157. 530.

[62] Kotsiantis, S. B. (2007). Supervised Ma- [71] Lilien, G. L. (2016). The B2B Knowledge
chine Learning: A Review of Classifica- Gap. International Journal of Research in
tion Techniques. Proceedings of the 2007 Marketing, 33(3):543–556.

[72] Lin, C.-S., Tzeng, G.-H., and Chin, Y.-C. [80] Morik, K. and Köpke, H. (2004). Analysing
(2011). Combined rough set theory and flow customer churn in insurance domain.
network graph to predict customer churn in
[81] Mozer, M. C., Wolniewicz, R., Grimes,
credit card accounts. Expert Systems with Ap-
D. B., Johnson, E., and Kaushansky, H.
plications, 38(1):8–15.
(1999). Churn reduction in the wireless in-
[73] Lu, J. (2002). Predicting customer churn dustry.
in the telecommunications industry –– An ap- [82] Mozer, M. C., Wolniewicz, R., Grimes,
plication of survival analysis modeling using D. B., Johnson, E., and Kaushansky, H.
SAS. (2000). Predicting subscriber dissatisfac-
[74] Lu, N., Lin, H., Lu, J., and Zhang, G. tion and improving retention in the wireless
(2014). A customer churn prediction model telecommunications industry.
in telecom industry using boosting. [83] Mudambi, S. (2002). Branding importance
in business-to-business markets: Three buyer
[75] Michèle Paulin, Jean Perrien, Ronald J.
clusters. Industrial Marketing Management,
Ferguson, Ana Maria Alvarez Salazar, and
Leon Michel Seruya (1998). Relational norms
and client retention: External effectiveness [84] Mutanen, T. (2006). Customer churn anal-
of commercial banking in Canada and Mex- ysis - a case study.
ico. International Journal of Bank Marketing,
[85] Nath, S. V. and Behara, R. S. (2003). Cus-
tomer churn analysis in the wireless industry:
[76] Miguéis, V. L., Camanho, A., and Falcão e A Data Mining approach.
Cunha, J. (2013). Customer attrition in re-
[86] Ngai, E. W. T., Xiu, L., and Chau, D. C. K.
tailing: An application of Multivariate Adap-
(2009). Application of data mining techniques
tive Regression Splines. Expert Systems with
in customer relationship management: A lit-
Applications, 40(16):6225–6232.
erature review and classification. Expert Sys-
[77] Miguéis, V. L., Van den Poel, D., Ca- tems with Applications, 36(2, Part 2):2592–
manho, A., and Falcao e Cunha, J. (2012a). 2602.
Predicting partial customer churn using [87] Nie, G., Rowe, W., Zhang, L., Tian, Y.,
markov for discrimination for modeling first and Shi, Y. (2011). Credit card churn
purchase sequences. forecasting by logistic regression and deci-
[78] Miguéis, V. L., Van den Poel, D., Ca- sion tree. Expert Systems with Applications,
manho, A. S., and Falcão e Cunha, J. (2012b). 38(12):15273–15285.
Modeling partial customer churn: On the [88] Olle, G. D. and Cai, S. (2014). A Hybrid
value of first product-category purchase se- Churn Prediction Model in Mobile Telecom-
quences. Expert Systems with Applications, munication Industry.
[89] Owczarczuk, M. (2010). Churn models for
[79] Modani, N., Dey, K., Gupta, R., and God- prepaid customers in the cellular telecommu-
bole, S. (2013). CDR Analysis Based Telco nication industry using large data marts. Ex-
Churn Prediction and Customer Behavior In- pert Systems with Applications, 37(6):4710–
sights: A Case Study. 4712.

[90] Oyeniyi, A. and Adeyemo, A. (2006). Cus- [100] Seo, D., Ranganathan, C., and Babad,
tomer churn analysis in banking sector using Y. (2008). Two-level model of customer re-
data mining techniques. tention in the US mobile telecommunications
service market. Telecommunications Policy,
[91] Parvatiyar, A. and Sheth, J. (2000). The
domain and conceptual foundations of rela-
tionship marketing. [101] Shaaban, E., Helmy, Y., Khedr, A., and
Nasr, M. (June-July 2012). A proposed churn
[92] Pendharkar, P. C. (2009). Genetic algo-
prediction model.
rithm based neural network approaches for
predicting churn in cellular wireless network
[102] Smith, K., Willis, R., and Brooks, M.
services. Expert Systems with Applications,
(2000). An analysis of customer retention and
36(3, Part 2):6714–6720.
insurance claim patterns using data mining:
[93] Radosavljevik, D., Van der Putten, P., and A case study.
Kyllesbech Larsen, K. (2010). The impact of
[103] Stevens, R. (2005). B-to-B customer re-
experimental setup in prepaid churn predic-
tention: Seven strategies for keeping your cus-
tion for mobile telecommunications: What to
predict, for whom and does the customer ex-
perience matter? [104] Swani, K., Brown, B. P., and Milne, G. R.
[94] Reichheld, F. F. and Sasser, W. (1990). (2014). Should tweets differ for B2B and
Zero defections: Quality comes to services. B2C? An analysis of Fortune 500 companies’
Twitter communications. Industrial Market-
[95] Risselada, H., Verhoef, P. C., and Bijmolt, ing Management, 43(5):873–881.
T. H. A. (2010). Staying Power of Churn Pre-
diction Models. Journal of Interactive Mar- [105] Tamaddoni Jahromi, A., Stakhovych, S.,
keting, 24(3):198–208. and Ewing, M. (2014). Managing B2B cus-
tomer churn, retention and profitability. In-
[96] Rosset, S. and Neumann, E. (2003). Inte-
dustrial Marketing Management, 43(7):1258–
grating Customer Value Considerations into
Predictive Modeling.
[106] Tsai, C.-F. and Chen, M.-Y. (2010). Vari-
[97] Ruta, D., Nauck, D., and Azvine, B.
able selection by association rules for cus-
(2006). K nearest sequence method and its
tomer churn prediction of multimedia on de-
application to churn prediction.
mand. Expert Systems with Applications,
[98] Ryals, L. and Knox, S. (2001). Cross- 37(3):2006–2015.
functional issues in the implementation of
relationship marketing through customer re- [107] Tsai, C.-F. and Lu, Y.-H. (2009). Cus-
lationship management. European Manage- tomer churn prediction by hybrid neural net-
ment Journal, 19(5):534–542. works. Expert Systems with Applications,
[99] Rygielski, C., Wang, J.-C., and Yen, D. C.
(2002). Data mining techniques for customer [108] Tuğba, U. and Gürsoy, Ş. (2010). Cus-
relationship management. Technology in So- tomer churn analysis in telecommunication
ciety, 24(4):483–502. sector.

[109] Turban, E., Sharda, R., and Delen, D. look ahead. Industrial Marketing Manage-
(2010). Decision Support and Business In- ment, 42(4):470–488.
telligence Systems.
[118] Xia, G.-e. and Jin, W.-d. (2008). Model
[110] Vafeiadis, T., Diamantaras, K. I., Sari- of customer churn prediction on Support Vec-
giannidis, G., and Chatzisavvas, K. C. (2015). tor Machine. Systems Engineering - Theory
A comparison of machine learning techniques & Practice, 28(1):71–77.
for customer churn prediction. Simulation
[119] Xiao, J., Xie, L., He, C., and Jiang, X.
Modelling Practice and Theory, 55:1–9.
(2012). Dynamic classifier ensemble model
[111] Van den Poel, D. and Larivière, B. for customer classification with imbalanced
(2004). Customer attrition analysis for finan- class distribution. Expert Systems with Ap-
cial services using proportional hazard mod- plications, 39(3):3668–3675.
els. European Journal of Operational Re-
[120] Xie, Y., Li, X., Ngai, E. W. T., and Ying,
search, 157(1):196–217.
W. (2009). Customer churn prediction us-
[112] Verbeke, W., Dejaeger, K., Martens, D., ing improved balanced random forests. Ex-
Hur, J., and Baesens, B. (2012). New insights pert Systems with Applications, 36(3, Part
into churn prediction in the telecommunica- 1):5445–5449.
tion sector: A profit driven data mining ap- [121] Xu, M. and Walton, J. (2005). Gain-
proach. European Journal of Operational Re- ing customer knowledge through analytical
search, 218(1):211–229. CRM. Industrial Management & Data Sys-
[113] Verbeke, W., Martens, D., Mues, C., and tems, 105(7):955–971.
Baesens, B. (2011). Building comprehensi- [122] Yan, L., Fassino, M., and Baldasare, P.
ble customer churn prediction models with (2005). Predicting Customer Behavior via
advanced rule induction techniques. Expert Calling Links.
Systems with Applications, 38(3):2354–2364.
[123] Yu, X., Guo, S., Guo, J., and Huang, X.
[114] Wang, G., Liu, L., Nie, G., Kou, G., and (2011). An extended support vector machine
Shi, Y. (2010). Predicting credit card holder forecasting framework for customer churn in
churn in banks of China using data mining e-commerce. Expert Systems with Applica-
and MCDM. tions, 38(3):1425–1430.

[115] Wang, Y.-F., Chiang, D.-A., Hsu, M.-H., [124] Zablah, A. R., Brown, B. P., and Don-
Lin, C.-J., and Lin, I.-L. (2009). A recom- thu, N. (2010). The Relative Importance of
mender system to avoid customer churn: A Brands in Modified Rebuy Purchase Situa-
case study. Expert Systems with Applications, tions.
[125] Zhang, Y., Qi, J., Shu, H., and Cao, J.
[116] Wei, C.-P. and Chiu, I.-T. (2002). Turn- (2007). A Hybrid KNN-LR Classifier and its
ing telecommunications call details to churn Application in Customer Churn Prediction.
prediction: A data mining approach. Expert
[126] Zhang, Y., Qi, J., Shu, H., and Li, Y.
Systems with Applications, 23(2):103–112.
(2006). Case study on CRM: Detecting likely
[117] Wiersema, F. (2013). The B2B Agenda: churners with limited information of fixed-
The current state of B2B marketing and a line subscriber.

[127] Zhao, Y., Li, B., Li, X., Liu, W., and Ren,
S. (2005). Customer churn prediction using
improved one-class support vector machine.

[128] Zinkhan, G. M. (2002). Relationship Mar-

keting: Theory and Implementation. Journal
of Market-Focused Management, 5(2):83–89.

Attachment A Experimental results

ACC AUC Sens Spec F1 Lift

LR 0.8390 (0.0015) 0.9324 (0.0055) 0.3792 (0.0031) 0.9936 (0.0010) 0.5424 (0.0044) 3.7855 (0.0312)
DT 0.8202 (0.1936) 0.8953 (0.0203) 0.6934 (0.1634) 0.8629 (0.2932) 0.6992 (0.1254) 3.4419 (0.1754)
NB 0.8261 (0.0033) 0.8967 (0.0055) 0.3537 (0.0066) 0.9851 (0.0022) 0.5059 (0.0094) 3.5314 (0.0636)
RF 0.8382 (0.0025) 0.9429 (0.0037) 0.3799 (0.0054) 0.9924 (0.0016) 0.5417 (0.0074) 3.7511 (0.0478)
NN 0.8378 (0.0019) 0.9404 (0.0045) 0.3768 (0.0038) 0.9929 (0.0013) 0.5391 (0.0055) 3.7607 (0.0378)
SVM 0.8389 (0.0015) 0.9361 (0.0045) 0.3791 (0.0029) 0.9936 (0.0010) 0.5423 (0.0042) 3.7851 (0.0279)
Bagging 0.8494 (0.0040) 0.9335 (0.0052) 0.4392 (0.0152) 0.9874 (0.0021) 0.5946 (0.0145) 3.7116 (0.0483)
Boosting 0.8391 (0.0019) 0.9445 (0.0040) 0.3793 (0.0038) 0.9937 (0.0013) 0.5427 (0.0054) 3.7858 (0.0378)

Table 19: Experimental results (averages and standard deviations)

Attachment B Results Wilcoxon signed-rank tests

LR DT NB RF NN SVM Bagging Boosting

LR - 0.0029 <0.001 0.0602 <0.001 0.272 <0.001 0.935
DT - - 0.0029 0.0029 0.0029 0.0029 0.0030 0.0029
NB - - - <0.001 <0.001 <0.001 <0.001 <0.001
RF - - - - 0.4400 0.117 <0.001 0.1300
NN - - - - - 0.0012 <0.001 <0.001
SVM - - - - - - <0.001 0.5100
Bagging - - - - - - - <0.001
Boosting - - - - - - - -

Table 20: P-values Wilcoxon test (accuracy)

LR DT NB RF NN SVM Bagging Boosting

LR - <0.001 <0.001 <0.001 <0.001 <0.001 0.129 <0.001
DT - - 0.422 <0.001 <0.001 <0.001 <0.001 <0.001
NB - - - <0.001 <0.001 <0.001 <0.001 <0.001
RF - - - - <0.001 <0.001 <0.001 <0.001
NN - - - - - <0.001 <0.001 <0.001
SVM - - - - - - <0.001 <0.001
Bagging - - - - - - - <0.001
Boosting - - - - - - - -

Table 21: P-values Wilcoxon test (AUC)

LR DT NB RF NN SVM Bagging Boosting

LR - <0.001 <0.001 0.219 <0.001 0.615 <0.001 0.806
DT - - <0.001 <0.001 <0.001 <0.001 <0.001 <0.001
NB - - - <0.001 <0.001 <0.001 <0.001 <0.001
RF - - - - 0.0089 0.292 <0.001 0.326
NN - - - - - 0.0015 <0.001 <0.001
SVM - - - - - - <0.001 0.426
Bagging - - - - - - - <0.001
Boosting - - - - - - - -

Table 22: P-values Wilcoxon test (sensitivity)

LR DT NB RF NN SVM Bagging Boosting

LR - <0.001 <0.001 <0.001 <0.001 0.232 <0.001 0.882
DT - - <0.001 <0.001 <0.001 <0.001 <0.001 <0.001
NB - - - <0.001 <0.001 <0.001 <0.001 <0.001
RF - - - - 0.224 <0.001 <0.001 <0.001
NN - - - - - 0.0011 <0.001 <0.001
SVM - - - - - - <0.001 0.4300
Bagging - - - - - - - <0.001
Boosting - - - - - - - -

Table 23: P-values Wilcoxon test (specificity)

LR DT NB RF NN SVM Bagging Boosting
LR - <0.001 <0.001 0.715 <0.001 0.344 <0.001 0.914
DT - - <0.001 <0.001 <0.001 <0.001 0.0010 <0.001
NB - - - <0.001 <0.001 <0.001 <0.001 <0.001
RF - - - - 0.114 0.903 <0.001 0.727
NN - - - - - 0.0023 <0.001 <0.001
SVM - - - - - - <0.001 0.315
Bagging - - - - - - - <0.001
Boosting - - - - - - - -

Table 24: P-values Wilcoxon test (F-measure)

LR DT NB RF NN SVM Bagging Boosting

LR - <0.001 <0.001 <0.001 <0.001 0.836 <0.001 0.847
DT - - 0.0062 <0.001 <0.001 <0.001 <0.001 <0.001
NB - - - <0.001 <0.001 <0.001 <0.001 <0.001
RF - - - - 0.414 <0.001 <0.001 <0.001
NN - - - - - <0.001 <0.001 <0.001
SVM - - - - - - <0.001 0.829
Bagging - - - - - - - <0.001
Boosting - - - - - - - -

Table 25: P-values Wilcoxon test (lift)