0 оценок0% нашли этот документ полезным (0 голосов)

2 просмотров45 страницBenchmarking analytical
techniques for churn modelling
in a B2B context

Nov 08, 2019

© © All Rights Reserved

PDF, TXT или читайте онлайн в Scribd

Benchmarking analytical
techniques for churn modelling
in a B2B context

© All Rights Reserved

0 оценок0% нашли этот документ полезным (0 голосов)

2 просмотров45 страницBenchmarking analytical
techniques for churn modelling
in a B2B context

© All Rights Reserved

Вы находитесь на странице: 1из 45

in a B2B context

Word count: 10782

Student number: 01200292

Commissioner: Steven Hoornaert

Confidentiality Agreement

I declare that the content of this Master’s Dissertation may be consulted and/or reproduced,

provided that the source is referenced.

Foreword

This thesis is written as the final part of my Master in Commercial Engineering and concludes

a 5 year trajectory. I have always had an interest in data analytics. By exploring the subject

of churn prediction, I was able to gain more understanding of this study field. By means of

this foreword, I would like to take the opportunity to thank the people that contributed to

the realization of this dissertation.

To give me the opportunity to work on this topic, even though I had no affinity with the

subject matter beforehand. I want to thank him for the guidance he provided throughout

the whole process as well. His detailed and comprehensive suggestions and remarks helped

me tremendously.

Special thanks goes to my uncle for providing constructive feedback on my thesis. I would

like to end by thanking my parents for providing me with the opportunity to study and my

brother and sister for their continuous support and encouragement.

ii

Table of Contents

Confidentiality Agreement i

Foreword ii

List of Abbreviations iv

List of Tables v

List of Figures vi

1 Introduction 1

2 Literature review 3

Relationship marketing in B2B markets . . . . . . . . . . . . . . . . . . . . . . . . 3

Customer Relationship Marketing (CRM) & Data Mining . . . . . . . . . . . . . . 3

Data mining algorithms for churn prediction . . . . . . . . . . . . . . . . . . . . . . 4

Customer churn prediction in FMCG settings . . . . . . . . . . . . . . . . . . . . . 8

Churn modeling in B2B settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Churn variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Methodology 13

Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Analytical techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Parameter selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Model evaluation criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Cross validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Statistical tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Variable importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5 Conclusion 26

References i

iii

List of Abbreviations

B2B Business-to-Business

B2C Business-to-Customer

DT Decision Tree

FN False Negative

FP False Positive

LR Logistic Regression

NB Naı̈ve Bayes

NN Neural Network

RF Random Forests

RM Relationship Marketing

TN True Negative

TP True Positive

iv

List of Tables

7 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

10 Computation time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

11 Variable importance LR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

12 Variable importance DT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

13 Variable importance NB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

14 Variable importance RF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

15 Variable importance NN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

v

List of Figures

1 Comparison of ROC-curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

vi

Benchmarking analytical techniques

for churn modelling in a B2B context

Abstract. Despite the proven importance of churn prediction for customer retention, research on the

performance of churn modelling techniques has been very limited in B2B contexts. This observation is

highly in contrast with the numerous applications that can be found in B2C settings. In order to address

this imbalance we perform a benchmarking exercise of commonly used analytical techniques: Logistic

Regression, Decision Trees, Naı̈ve Bayes, Random Forest, Neural Networks, Support Vector Machines,

Bagging and Boosting. The empirical data of a FMCG retailer is used to predict churn in a B2B

setting. The results show that Stochastic Gradient Boosting outperforms the other models in predictive

power. Logistic Regression can be recommended as well for B2B churn prediction due to its excellent

combination of a high predictive power and comprehensibility while keeping the computation time low.

When evaluating variable importance, recency variables are shown to have a very high predictive power.

Every prediction technique states recency as most important. Our findings indicate as well that the

importance of the other variable categories is dependent on the applied prediction technique.

1 Introduction

The use of data mining techniques in Customer Relationship Management (CRM), in domains as

customer churn, customer acquisition or customer up- and cross-selling, has become common prac-

tice across various industries and applications. To date, most research on this topic is situated in

Business-to-Customer (B2C) settings, while the application of such techniques has been scarce in

Business-to-Business (B2B) settings. This, unfortunately, is due to differences between industrial

and consumer markets in decision-making processes, relationships, type of buyers, nature of de-

mand, communication mix and other factors [83, 124]. Moreover, the lack of data availability and

domain-relevant knowledge by researchers active in the B2B field is seen as a considerable challenge

as well [71]. However, these methods can be of great potential in a B2B context since industrial

companies are typically faced by a small number of customers who generate a large percentage of

revenue [51]. According to the Pareto or 80/20 rule, 20% of customers may even generate 80% of

the total revenue [121]. Since B2B companies are typically characterized by a smaller customer

base but a much higher transaction volume [103], losing a customer will have a more significant

direct effect on the company’s revenues. We, therefore, argue that data mining techniques can have

positive implications for customer retention in B2B.

1

In this study, we specifically focus on customer retention, more popularly known by its adversary,

customer churn. Customer churn is defined as the number or percentage of regular customers who

abandon a relationship with a service provider [59]. A distinction can be drawn between partial

and complete churn. Partial churn can be defined as the switch of some of the customer’s purchases

to another company. We speak of complete churn when the switch considers all purchases [14].

To manage customer churn, companies can opt for retention campaigns tailored to a small set of

customers or identical retention campaigns targeted at all customers (‘one-size-fits-all’ marketing

actions) [68]. Given a company’s limited resources, targeted retention campaigns are much more

efficient. Machine learning techniques can help with the identification of future churners, which

will enable the company to concentrate its retention efforts on these specific customers that have

the highest probabilities to churn. In conclusion, effective and accurate churn prediction models

are needed to reliably identify a customer’s probability to churn.

For many years, customer defection has been predicted using data mining techniques such as

Logistic Regression [52, 54], Decision Trees [4, 126], Neural Networks [13, 49] and Support Vector

Machines [3, 58]. The main power of these methods lies in their prediction potential. Logistic

Regression and Decision Trees are the most widely used methods because they offer a good trade-

off between performance and interpretability. The performance of data mining techniques for churn

prediction has been mostly evaluated based on datasets of B2C companies due to the difficulties

and complications encountered in B2B research aforementioned. Extensive comparisons of churn

prediction techniques in B2C have already been presented in studies like [110], [112] and [113].

In light of a small customer base with a high relative contribution to the bottom line, there is

an even higher cost of wrong prediction in the B2B domain. For this purpose, churn prediction

models with a low risk of wrong prediction are needed. However, in the B2B area we do not

find the same intensity of research as in B2C [71, 117], which makes it challenging for companies

to select an appropriate churn prediction model. We observe a visible lack of churn prediction

implementations in the B2B context. Moreover, the results of B2C research are not unambiguous

and are difficult to compare. Each study recommends a different churn prediction technique and

these recommendations are mostly based on limited benchmarking analyses. Therefore, no general

consensus can be reached [113]. This study fills the need for a broad benchmarking study in support

of the B2B decision making process.

In order to address this gap in research, this paper focuses on customer churn prediction in a B2B

context. We will first show the gap by analysing past research concerning the use of machine learning

techniques for churn prediction. Next, we present an empirical analysis of the most commonly used

techniques on a B2B data set of a Fast Moving Consumer Goods (FMCG) company. Our goal is

to analyze to what extent techniques used in B2C settings are applicable as well in B2B settings.

Eight algorithms for churn prediction are benchmarked: Logistic Regression, Decision Trees, Naı̈ve

Bayes, Random Forest, Neural Networks, Support Vector Machines, Bagging and Boosting. The

performance of the techniques will be assessed based on accuracy, sensitivity, specificity, F-measure,

area under the ROC-curve (AUC) and top decile lift. Predictive power of each model will be

discussed while taking into account interpretability and computation time. Furthermore, we will

analyse the importance of the churn prediction variables for each technique.

1. INTRODUCTION -2-

The remainder of the paper is organized as follows. A literature review of churn prediction models

used in both B2B and B2C in Section II. In Section III, the methodology consisting of a short

description of the analytical techniques, evaluation metrics and general approach is listed in greater

detail. The results will be discussed in Section IV, followed by the conclusion in Section V. Lastly,

Section VI discusses the limitations and indications for future research.

2 Literature review

Since the early 1990s, the focus in businesses has shifted from transactional marketing to Relation-

ship Marketing (RM) [56]. RM states that it is much more effective to build long-term relationships

with customers instead of going after potentially unrelated exchanges [128]. The sale between buyer

and seller becomes the starting point of the buyer-seller interaction, whereas it is the endpoint in

the transactional approach. It has been shown that RM is more beneficial since retained customers

increase earnings [14]. Moreover they tend to spread positive word-of-mouth [94] and will buy more

[75]. Their price-sensitivity tends to decrease [94] and customers will be less sensitive to actions of

competitors as well [8]. These advantages stimulate companies in adopting a relationship approach

and establishing long-term relationships with their clients.

The transactional approach would be, for example, a company offering the same standardized

products or services to every customer. By contrast, committing to relationship marketing will pave

the way for customization, alignment of manufacturing strategies or even take it as far as designing

the product together with the client. Companies designing products or services especially to meet

a particular customer’s needs will incite customers to enter into close, long-term relationships and

therefore benefit from the advantages these relationships bring with them.

B2B companies are typically characterized by buyer-seller interdependence and relationships that

are close and long-term oriented [27]. Such relationships are advantageous because of possible cost

reductions or increased revenues [27]. The customization that can be offered in this way, makes it

attractive for business customers as well. Moreover, the motivation to develop close relationships

comes from the complex nature of B2B offerings which lies in their technicality, complexity and

the long, formal group buying processes [104]. In view of those considerations, B2B markets have

always had a tendency towards relationship marketing.

CRM enables the implementation of relationship marketing within a company [98]. CRM is de-

scribed as the efforts made by combining business processes and technologies that are customer-

oriented to manage the interaction between businesses and customers [56]. According to [86], CRM

consists of customer identification, customer attraction, customer development and customer re-

tention. In this paper, we focus on customer retention as a domain. This domain holds a lot of

potential given that acquiring new customers costs a lot more than retaining existing ones [8, 94]. In

addition, a small improvement in retention rate can lead to a significant increase in profit [94, 111].

Therefore, companies shifted their focus from customer acquisition to customer retention [91].

Technology and CRM have changed the way marketing is implemented in the last few years.

Particularly Analytical CRM (aCRM) has been omnipresent. As one of the 4 categories of CRM

suggested by [17], aCRM aims to analyse data that has been stored by a company using analytical

tools. One of these analytical tools that can be used is data mining. Data mining can be described as

the combination of statistical, mathematical, artificial intelligence and machine-learning techniques

used to acquire information and insights from databases [109]. Data mining can be used to support

the decision making process and has been frequently used in CRM e.g., [31, 32, 40]. We refer to [86]

for an extensive overview of data mining techniques applied to CRM. These applications mostly

concern B2C settings, but [99] stated that CRM can be even more important for business customers.

Given the potential the domain of customer retention holds, the application of analytical techniques

in this domain is highly relevant and has had a huge impact on CRM in the past.

The impact of data mining techniques has become apparent in the area of churn prediction as well.

Many techniques have been successfully used in the past to predict a customer’s probability to

churn e.g., [49, 81, 85]. In Table 1 we list the predominant data mining techniques that have been

formerly used to estimate a customer’s probability to churn.

In Table 1 we clearly see that Logistic Regression (LR) and Decision Tree (DT) models are the

most common algorithms in academic research to predict customer churn. Even though they have

no great performance to capture complex and non-linear relationships, their popularity stems from

their ease of interpretability and low computation time. In the past, empirical analyses have led

to contradictory results regarding their performance relative to one another. In studies like [4],

[38] and [18] The DT model did a better job at predicting churn than a Neural Network (NN) and

LR. The latter study actually showed that the DT also outperformed a more sophisticated model,

namely a Support Vector Machine (SVM). Then again other studies, like [6, 24, 47, 87] showed

that LR achieved better results than the DT. In [47] LR is the best alternative compared to a DT

and a NN as well. We conclude that in spite of their simplicity, LR and DTs show a competitive

performance compared to more complex models. In some cases they even outperformed them.

Consequently, LR and DTs are very suitable to act as benchmarking techniques.

Bagging and boosting are ensemble methods that were constructed to reach a higher predictive

performance than single classifiers. Bagging takes a bit more computation time compared to LR

and DT but [69] and [95] showed that the ensemble technique performs better than DT. Bagging

reduces the variance of prediction and is simple and easy to put into practice. In [6] is shown that

bagging in combination with classification trees outperforms LR which in its turn outperformed the

classification tree without bagging. So we can conclude from this study that bagging improves the

predictive performance of classification trees. Boosting showed in [110] a substantial improvement

in classification performance when it was combined with NNs, DTs and SVMs. The authors of [69]

could not draw a conclusion whether bagging or boosting is better, since it depends on the dataset

to which the methods are applied.

Methods B2C B2B

Regression Algorithms

[1] [4] [6] [13] [14] [15] [16] [19] [22] [24] [25] [29]

[34] [36] [38] [42] [44] [46] [47] [50] [52] [54] [55]

Logistic Regression (LR) [57] [63] [64] [66] [67] [70] [74] [76] [78] [77] [79] [18] [37] [105]

[81] [82] [84] [87] [88] [89] [93] [95] [96] [100]

[102] [108] [110] [112] [113] [114] [118] [126]

Linear Regression [89] [39]

Probit Regression [39]

Multivariate Adaptive

[76]

Regression Splines (MARS)

Perceptron-based techniques

[2] [3] [4] [5] [13] [14] [19] [26] [36] [38] [41] [43]

Multilayer Perceptrons:

[42] [45] [46] [47] [49] [50] [53] [54] [57] [58] [63]

Artificial Neural Networks [18] [37]

[67] [66] [81] [82] [92] [97] [101] [102] [106] [107]

(ANN)

[110] [112] [118] [120] [123] [126] [127]

Single layered Perceptrons:

[88] [112]

Voted Perceptron

Bayesian algorithms

[13] [41] [42] [60] [66] [80] [85] [110] [112] [114]

Naı̈ve Bayes (NB)

[118] [127]

Bayesian Network (BN) [60] [61] [112] [114]

Ensemble classifiers

[14] [15] [16] [19] [23] [24] [28] [29] [48] [63] [64]

Random Forests (RF)

[77] [112] [119] [120]

Boosting [16] [19] [36] [41] [69] [74] [82] [110] [122] [112] [88]

Bagging [6] [29] [28] [57] [69] [95] [112]

GAMens [23] [29]

Logistic Model Tree (LMT) [112]

Dynamic ensemble methods [119]

Random Subspace Method [29]

Rotation forest [28]

RotBoost [28]

Other static classifiers

[119] [122]

ensemble methods

Rule-based methods

PART [44] [112] [114]

RIPPER [112] [113]

OneR [44]

AntMiner+ [113]

Continued on next page

Methods B2C B2B

Active Learning Based

[113]

Approach (ALBA)

Instance-based algorithms

K-Nearest Neighbour

[26] [44] [48] [53] [112] [114]

Classifier (kNN)

Decision Trees (DT)

[2] [4] [5] [6] [9] [13] [18] [19] [23] [26] [28] [34]

[36] [38] [41] [44] [42] [43] [45] [47] [50] [53] [54]

[60] [63] [67] [66] [69] [70] [79] [80] [82] [87] [89] [105]

[93] [95] [97] [101] [102] [106] [108] [110] [112]

[113] [114] [115] [116] [118] [120] [123] [126] [127]

Support Vector Machines (SVM)

[2] [3] [19] [24] [35] [44] [46] [43] [42] [50] [53]

[58] [63] [70] [97] [101] [80] [110] [112] [113] [114] [18][37]

[118] [120] [123] [127]

Hybrid Models

[21] [35] [44] [41] [45] [53] [67] [72] [88] [90] [92]

[101] [107]

Other Algorithms

Evolutionary Algorithms

[2] [5] [42]

(EA)

Generalized Additive

[22] [23] [29]

Models (GAM)

Sequential Pattern Mining [20] [80] [97]

Survival Models [19] [73] [111]

Discriminant Analysis [13] [89] [97]

Pareto/Negative Binomial

[39]

Distribution Model (NBD)

Partial Least Squares [66]

K* [114]

Markov Chains [15]

Z-score model [92]

Decision Table [114]

They did however, show that boosting had a better predictive performance than DT. We can

conclude that bagging and boosting have a higher predictive performance than LR and DT as single

classifiers in most cases. Boosting was not able to outperform other techniques when benchmarked

against more complex models like Random Forests in [16] and SVMs in [19].

To deal with the disadvantages of Decision Trees, a lack of robustness and vulnerability to noise in

the data, Random Forests (RF) have been proposed. Random Forests seems to be a more popular

technique for partial churn prediction compared to its application to complete churn. The technique

has been applied in 5 of the 9 studies that treat partial churn in Table 1. The algorithm generally

has a high predictive performance. Random Forests surpass LR and SVMs based on [24]. The

technique similarly performs better than DTs, NNs and SVMs in [120]. However in [14] RF did not

reach a significantly higher performance than LR and a NN. In [77] LR even outperformed RF. A

disadvantage of RF is that the technique is considered a black box.

Similar to RF, NNs and SVMs are black box models as well. NNs are among the most popular

method, as can be seen in Table 1. The technique is generally considered to have a higher predictive

performance than less complex models like LR and DTs. [82] and [102] show that NNs outperform

DTs and LRs. Also [5] shows the superiority of NNs compared to DTs. In [53] we see that NNs

not only have a better predictive performance than KNN and DTs, but outperforms SVMs as

well. But when consulting the results of [106], we see that the authors concluded that the NN was

outperformed by a DT.

SVMs are seen as a more sophisticated model that are computationally more intensive [70]. There

are multiple studies that show the excellent predictive performance of SVMs compared to other

churn prediction techniques [46, 70, 101, 110, 127]. But [24] actually showed that SVMs only

surpass LR when the right parameter-selection technique is used. A SVM was even outperformed

by a DT and a NN in [110].

Naı̈ve Bayes (NB) is a prediction technique that has been frequently applied in the past for churn

prediction as well. Although it is a simple classifier, NB has been able to report high predictive

accuracies in the past [41]. In [110], NB was shown to not be that effective compared to NNs,

SVMs and DTs. In [118], NB similarly was outperfomed by NNs and SVMs, but it did attain a

better performance than DTs. However, [112] does recommend Naı̈ve Bayes as churn prediction

model due to its comprehensibility, operation efficiency and sufficient predictive power.

A lot of hybrid methods are being proposed as well e.g. [35, 41, 72, 88]. A hybrid method is a

combination of 2 or more data mining techniques. This combination of techniques aims to increase

the predictive power of standard classification techniques. There are hybrid models that combine

two classification techniques, but the combination of a clustering and classification technique to

form a hybrid model exists as well. Different hybrid methods have been proposed in literature. For

example, a hybrid model is made combining SVM with Naı̈ve Bayes Tree in [35]. [72] introduces

a hybrid model based on Rough Set Theory and a Flow Network Graph. A combination of a

classification technique (DT) is combined with a clustering technique (Growing Hierarchical Self-

Organizing Maps) in [21]. Benchmarked to other classification techniques these hybrid models

always appear to be the most effective and performant. However, their predictive power is hardly

ever tested in other situations.

FMCG Partial Full Data mining techniques

Paper

Company churn churn used for churn prediction

- Neural networks*

- Logistic Regression

- Linear/ Quadratic discriminant

Belgian

Buckinx et al. [13] x analysis

retailer

- Decision Tree

- Naı̈ve Bayes

- K-Nearest Neighbours

- Logistic Regression

Buckinx and Van den Poel [14] Grocery retailer x - Artificial Neural Network

- Random Forests

parameter selection*

Italian, on-line - SVMacc: SVM based on

Gordini and Veglio [37] x

company accuracy parameter selection

- Neural Network

- Logistic Regression

with Stepwise Feature Selection*

European food-

Miguéis et al. [76] x - MARS

based retailer

- Logistic Regression without

variable selection procedure

- Boosting*

Australian - Logistic Regression

Tamaddoni Jahromi et al. [105] x

retailer - Simple Decision Tree

- Cost-sensitive Decision Tree

*

denotes the best performing technique in the study

Customer churn prediction has been addressed in multiple sectors like publishing [6, 22, 24], fi-

nancial services [36, 63, 111], insurance [46, 80, 102], e-commerce [57, 123], banking [35, 72, 87],

telecommunications [43, 47, 112], online gambling [23], retailing [14, 39, 76], logistics [18] and cable

services [15]. The attention given to telecommunication industry has been excessive. 70 of the

117 papers listed in Table 1 are studies about telecommunication companies. That is because this

industry is characterized by strong competitiveness and increased liberalization which makes churn

prediction indispensable [58]. In comparison, most of the other sectors like logistics, e-commerce

and retail have been underrepresented.

In this paper we treat the data of a retailer, more specifically a FMCG company. Fast-moving

consumer goods are considered as relatively inexpensive and frequently purchased [65]. Transactions

characterized by high volume make customer retention and accordingly churn prediction more

prominent. However, the fact that FMCG companies often operate in a non-contractual setting

makes it more challenging as well. This is because customers are not obliged to let companies know

when they stopped using their services or buying their products. In this way, it is more difficult

to determine when exactly a customer has churned. Therefore, it has been suggested to focus

on partial churn instead of complete churn in retail settings because customers typically defect

progressively, rather than in an abrupt discontinuation [77]. According to [14], partial churn has

a strong possibility to turn into complete churn in the long run. Therefore successfully predicting

partial churn can prevent complete churn.

An overview of churn prediction applications in the FMCG sector, is displayed in Table 2. The

applications that provide the most relevant results to our empirical analysis are [105] and [37],

where FMCG datasets of B2B companies are used to predict churn as well. This implies that their

best performing techniques, which are boosted trees and a SVM, could lead to similar satisfactory

results when applied to our dataset. In [37], a SVM (89.98% PCC, 88.61% AUC) outperformed

Logistic Regression (88.13% PCC, 86.04% AUC) and a Neural Network (88.25% PCC, 87.15%

AUC). In [105] boosting (92% AUC) performed slightly better than Logistic Regression (91%

AUC), while simple and cost-sensitive decision trees (AUC of 85% and 83%, resp.) were significantly

outperformed by both techniques. These studies predicted complete churn, while the other FMCG

studies focused on partial churn only, as can be seen in Table 2 as well.

It was demonstrated in [14] that partial churn can be successfully predicted in a non-contractual

setting. No significant differences were found in this study between the analysed data mining

techniques: Logistic Regression, a Neural Network and Random Forests. However in [13], a Neu-

ral Network (76.23% PCC, 79.72% AUC) significantly outperformed Logistic Regression (75.57%

PCC, 79.02% AUC) and other well-known methods. In [76] Multivariate Adaptive Regression

Splines (MARS) was introduced to predict churn of the customers of a retailer and was bench-

marked against Logistic Regression. The study showed that MARS was able to detect more partial

churners (AUC 76.74%) than Logistic Regression (AUC 75.29%) except when Logistic Regression

was combined with Stepwise forward or Stepwise backward Feature Selection (AUC of 78.43% and

78.50%, resp.).

We can conclude that the different applications situated in the FMCG sector lead to inconclusive

results. Only a limited amount of data mining techniques are evaluated in these studies which does

not give a comprehensive review of the performance of different churn prediction algorithms in the

FMCG domain.

A distinction is made between applications in B2B and B2C in Table 1. The vast majority of the

data mining techniques are used to predict churn of a B2C company. Regarding the B2B domain,

implementations have been limited [4], which can be affirmed given the column at the right side

of the Table. We clearly see that the applications of churn modelling techniques are numerous in

B2C, whilst only a limited amount have been situated in a B2B context.

Best performing Outperformed techniques

Paper Company

technique(s) (in order of performance)

Taiwanese Decision

Chen et al. [18] & Support Vector Machines

logistic company Trees

- Logistic Regression

Italian e-commerce SVMauc: SVM based on accuracy parameter selection

Gordini and Veglio [37]

FMCG company AUC parameter selection - Neural Network

- Logistic Regression

Hopmann and Thede [39] electronics and Regression (GLM) model Distribution (NBD)-

computer accessories - Probit regression based model

- Logistic Regression

Australian FMCG

Tamaddoni Jahromi et al. [105] Boosting - Simple Decision Tree

retailer

- Cost-sensitive Decision Tree

To the best of our knowledge, only 4 papers have contributed to research in the B2B churn prediction

domain: [18], [37], [39] and [105]. More details about their approaches can be found in Table 3.

The churn probability of customers of a Taiwanese logistic B2B company was predicted in [18].

The authors were interested in the effect of length of the relationship, recency, frequency, monetary

and profit (LRFMP) variables on the predictive power. A Decision Tree, Logistic Regression, an

Artificial Neural Network and a Support Vector Machine were put into practice. The results showed

that the Decision Tree model was the most optimal algorithm in terms of accuracy, recall and F-

measure. A Negative Binomial Distribution (NBD)-based model, Probit Regression and General

Linear Model (GLM) Regression were used to construct a churn prediction model in [39]. A German

retailer for electronics and computer accessories made its data available for analysis. It was found

that GLM and Probit outperformed the stochastic model. As discussed before [37] and [105] treat

data of FMCG companies. In [37] the predictive performance of SVM was found to be superior

to Logistic Regression and Neural Networks. A data mining approach to model non-contractual

churn in a B2B context was proposed in [105]. Boosting outperformed three modelling techniques

(cost-sensitive learning decision tree, simple decision tree and logistic regression).

In conclusion, each of the 4 papers proposes a different technique for B2B churn prediction and

each of these proposed techniques is benchmarked against a different set of techniques. In this

way, it is difficult to evaluate the performance of the techniques in the B2B domain. If we consider

for example Logistic Regression, no general consensus can be reached. Boosting only marginally

outperformed LR in [105] which indicates a sufficient performance of LR. However, when LR was

used in [18] as a benchmarking technique, the technique performed the worst. A similar conclusion

was found in [37] where Logistic Regression was outperformed as well by Neural Networks and

Support Vector Machines. It is equally difficult to judge the performance of Decision Trees, being

alternately the best and worst performing technique in [18] and [105].

2. LITERATURE REVIEW - 10 -

There is an ambiguity in interpretation of method performance.

Products

This makes it challenging for B2B companies to determine which

x

algorithm would be best suited for the implementation of a churn

Failure

prediction model. Additionally, the lack of research on B2B churn

x

prediction makes the challenge even bigger. To the best of our

knowledge, methods like Bagging, Random Forests and Naı̈ve

Timing of

shopping

Bayes have not been applied to B2B datasets yet. This results in

x

x

niques like Logistic Regression, Decision Trees, Neural Networks,

Brand purchase

behaviour

x

x

Table 4: Behavioural variables included in former research

the same B2B dataset, which makes it hard to assess the perfor-

mance of these algorithms in this setting. There is a need for more

Promotional

behaviour

x

x

benchmarking exercise than the ones already available.

product categories

Behaviour across

Churn variables

x

x

x

x

used for its construction, the included churn prediction variables

can have an important influence on predictive performance as well.

relationship

Length of

x

x

x

for churn prediction. Recency refers to the time since the last

indicators

Monetary

x

x

x

x

x

cumulative amount of money spend by a customer in this period

Frequency

[14, 78].

x

x

x

x

x

variables used in former FMCG studies. The importance of RFM-

Interpurchase-

time variables

a FMCG dataset in [14]. The length of the relationship was also

x

x

x

x

x

The same goes for mode of payment, buying behaviour across cat-

egories, usage of promotions and brand purchase behaviour, which

Paper

[105]

[13]

[14]

[37]

[76]

tion of churn prediction models to the data of FMCG customers

showed that the length of relationship was an important indicator

in addition to frequency and inter-purchase time related

2. LITERATURE REVIEW - 11 -

variables, mode of payment and promotional behaviour. In [37] we find a similar conclusion for

recency, frequency and length of relationship, monetary indicators appeared to be less important.

Furthermore, it was shown in this study that variables related to product categories and failure are

important predictors as well. Known that they treated the data of a FMCG B2B company, this

study is highly relevant. In [76] no evident link could be found between the predictors selected by

the different prediction techniques.

However, the study showed that brand related variables were not relevant in any technique, just

like the total amount spend during the analysed period which serves a monetary variable. In [105],

only RFM-variables were used. The study emphasizes recency and frequency as highly predictive

predictors. Monetary indicators contributed not that significantly to predicting churn.

To conclude, recency and frequency turn out to be highly important in all studies. Monetary

indicators do not seem to live up to their expected importance. The importance of length of

relationship was affirmed by all studies that incorporated this variable.

Summary

Considering the lack of academic research on this topic, a significant difference in field of application

and a variety of methods applied in different papers, a comparison of the different B2B churn

prediction methods is difficult to realize. Given this variation, the interpretation of the results

is quite challenging. Drawing conclusions in consideration to B2B based on results acquired in a

B2C context is a considerable challenge as well due to the differences of these markets. We can

however make some general predictions about the performances of the different churn modelling

techniques.

Logistic Regression and Decision Trees are techniques that clearly dominate in Table 1. Their

popularity in the prediction of B2C customer churn leads us to believe that they will act as adequate

techniques for B2B churn prediction as well. Especially since their application is widely spread in

other domains than solely churn prediction. On the grounds that Neural Networks and Support

Vector Machines are considered generally as more advanced prediction models, they will most

probably outperform LR and DTs in predictive power. Especially since NNs and SVMs turned

out to be better performing than LR in both B2B studies [18, 37] where these techniques were

benchmarked amongst others. Bagging and boosting tend to perform better than single classifiers

and due to the superior predictive power of boosting on a B2B dataset in [105], we estimate them

to be adequate as well. RF have proven its adequacy as well by outperforming other techniques

frequently in past studies such as [14, 15, 24, 77]. This gives us reason to foresee a good individual

performance of these techniques. We expect NB to achieve similar performances to LR and DT,

since it was not able to outperform more complex techniques in past research.

However, it is difficult to state expectations about the performance of churn prediction techniques

relative to one another. When consulting literature, former studies show varying results concerning

the performance of the techniques. For every technique, there appears to exist research affirming

their superiority compared to others. Nevertheless, nearly always studies can be found that claim

the contrary.

2. LITERATURE REVIEW - 12 -

3 Methodology

Data

The dataset used to perform the computational experiments in this study is obtained from a B2B

company offering fast moving consumer goods. The dataset contains historical sales transactions

of 10 000 business customers. The data range is situated between 1/1/2011 and 13/6/2016. The

distribution of churners to non-churners is about 25%. Compared to other studies this is relatively

high, since churn rates in B2C generally lie within a 5%-15% range e.g. [6, 15, 24, 35, 44, 80, 113,

125]. However, higher churn rates can be found as well in literature e.g. [14, 23, 88]. Since [14]

treats the data of a FMCG retailer as well, although it concerns a B2C company, a churn rate of

25% in a non-contractual setting might not be exceptionally high.

Variables

The target variable of our predictive models is churn, represented as a binary variable. Each

customer in the dataset is either identified as a churner (value 1) or a non-churner (value 0). A

churner is seen as a customer who cut ties with the company, while a non-churner stays loyal to

the company.

To construct the models, we used a limited number of predictors that are highly predictive in order

to keep the models as simple and comprehensible as possible. In this way the generalisability of

our conclusions will be facilitated. Employing only a limited number of variables will keep the

computational time low as well. Given that we are treating the data of a company in a non-

contractual B2B setting, no demographic variables are available. The equivalent of these variables

in a B2B context would be, for example, the number of employees or the concerned industry. Since

this information is not available, only behavioural variables based on the transactional history of

customers are included in the models.

The original dataset consisted of 563 variables. We selected the relevant variables for our study

based on former literature written about the subject. The variables that remained are listed in

Table 5. The equipment variables in the table concern information about the installed equipment

at the customers necessary to preserve the company’s products.

Since interpurchase-time related variables were proven to be an important variable category in past

research, we will use recency variables as well to construct our prediction models. Time since last

invoice and time since last equipment installation date both refer to the recency of customers’ shop

incidences. We include several variables related to customer’s frequency of purchases: number of

products and equipment models sold to customer. Furthermore sales quantity for adjacent and

CONV192 products, total sales quantity and total sales quantity in promotion are classified as

frequency variables as well. The following monetary indicators are included as well: sales in dollars

(represented by 2 variables) and the cost of goods for all orders. The length of relationship is

incorporated as well. This variable category is operationalized by including the time since first

invoice and the time since first equipment installation date.

3. METHODOLOGY - 13 -

Dependent variable

Description Churners Non-churners Churn rate

Churn 2517 7483 25.17%

3. METHODOLOGY

Independent variables

Summary statistics

Description

Min. Median Max. Mean SD.

Recency

Sales.Inv Dt rec Time since last invoice 0.00 24.00 1254.00 143.59 268.00

Equipment.Install Date rec Time since last equipment installation date 0.00 212.50 11287.00 767.89 1238.45

Frequency

Sales.salesTotal freq Number of products sold to customer 1.00 119.00 11645.00 278.79 507.22

Equipment.Models freq Number of equipment models at customer 0.00 1.00 14.00 1.07 1.26

Sales.STD ADJ FCT mean Sales quantity for adjacent products 0.50 2.22 5.00 2.18 1.09

Sales.PKG CONV192 mean Sales quantity for CONV192 products 0.00 2.50 64.00 2.58 1.26

Sales.Qty mean Total sales quantity -165.90 1.37 217.78 1.66 6.26

Sales.PROMO QTY mean Total sales quantity in promotion 0.00 0.00 18.80 0.08 0.42

Monetary

Sales.Whlsl Price Xtnd mean Sales in dollars -975.00 71.72 2129.79 81.36 68.77

Sales.WHLSL UNIT PRICE mean Sales in dollars 0.00 48.41 140.00 49.56 22.91

Sales.COST OF GOODS mean Cost of goods for all orders 0.00 23.84 73.66 23.25 14.72

Length

Sales.Inv Dt dura Time since first invoice 0.00 1169.00 1258.00 900.35 418.93

Equipment.Install Date dura Time since first equipment installation date 0.00 317.00 23173.00 1070.23 1634.84

- 14 -

Analytical techniques

Multiple churn prediction models were constructed in order to predict the churn probability of

B2B customers in our dataset. The data mining techniques used to create those models were

selected based on their popularity and good predictive performance in past studies. The following

classification techniques were included in our benchmarking study:

Logistic Regression (LR) LR enables to predict the probability of a binary dependent variable

outcome based on the values of a set of independent variables. LR is easy to use and provides quick

and robust results [14]. Moreover, LR has a good interpretability compared to other methods [16].

This makes it an excellent benchmarking technique for the more complex and sophisticated models

applied in this study.

Decision Tree (DT) DTs are models that create a tree-like structure where instances are classified

based on their feature values. In each internal node a test is performed on a feature value. A branche

represents the outcome of the test, which will eventually lead to a leaf node that stands for a class

label. In this way, decision rules for the classification of new instances are established. DTs are

widely used in many fields due to their ease of interpretability [116]. However, they are considered

as unstable classifiers that significantly change when small adjustments to the data are made [11].

The only parameter to tune is the complexity parameter.

Naı̈ve Bayes (NB) NB is a classification technique that is constructed based on Bayes’ theorem.

NB assumes independence among features, which is a serious limitation of the model. The technique

constructs an algorithm with a low variance, because it is quite insensitive to data fluctuations [62].

This, however, implies that the predictions will most likely be less accurate than high-variance

models. A kernel density estimate will be used as the density function to construct the Naı̈ve

Bayes model.

Neural Network (NN) Artificial Neural Networks mimic the structure and functions of a biolog-

ical neural network. NNs consist of multiple layers that are made up of neurons. The input layer

communicates with one or more hidden layers, which in turn links to the output layer. The con-

nections between each of the layers’ neurons are made through weighted links. A popular method

to assign those weights, is the Back Propagation Method. The self-learning ability of NNs makes

that the underlying logic is not clear. Consequently, NNs are models with poor interpretability

[116]. NNs have a higher computation time than LR or DTs as well. We will construct a standard

Neural Network model with 1 hidden layer. The parameters to adjust are the decay that is added

to the weights and the number of neurons in the hidden layer.

space. SVMs search for the best separating gap between the points of different classes. New

instances will be mapped in the same space and will be classified based on their location relative

3. METHODOLOGY - 15 -

to the separating gap. SVMs are characterized by a high predictive performance [70, 50]. Only

two parameters have to be specified, the upper bound and the kernel parameter. On the downside,

SVMs are black box models and computationally more intensive [70]. In order to catch any non-

linear relationships we will make use of a Gaussian Radial Basis kernel function (SMV-RBF). The

parameters to optimize are the misclassification cost and a sigma (σ) value that is specific to the

Radial kernel.

Bagging Bagging stands for bootstrap aggregating. It improves prediction accuracy by applying a

base classifier on different bootstrap samples. These samples are randomly drawn out of the training

sample with replacement. The results are combined using majority voting. Bagging requires no

extra information, is easy to implement and reduces a classifier’s variance [69]. Bagging performs

generally better than the base classifier when the latter is unstable, but will not be able to increase

the performance when it is not [11].

Random Forests (RF) RF are ensemble classifiers that grow multiple classification trees. Each

tree is grown on a bootstrap sample of the training set by using random feature selection at each

node. RF classify an instance based on the classifications of the individual trees. The class that

receives the most votes is attributed to that instance. RF protect against overfitting, which can

sometimes happen with DT [12]. The technique is able to deliver a consistent high performance, is

very robust and has a reasonable computing time [14]. The only parameter to adjust is the number

of variables that are available for splitting at every node.

Boosting Boosting is seen as a more sophisticated version of bagging. First, the base classifier

is applied to the training sample, where each instance has an equal weight. Next, the weights are

adjusted, more importance is attributed to misclassified instances. A new classifier is constructed

based on the new weights. This process can be repeated multiple times. Boosting reduces vari-

ance as well as bias [69]. It is considered a robust technique [16]. Two commonly used boosting

algorithms are AdaBoost and Stochastic Gradient Boosting. Since Stochastic Gradient Boosting

achieved the best performance in [69], we will apply this algorithm to our B2B dataset. Stochastic

Gradient Boosting requires 4 parameters: the number of boosting iterations, the number of splits

performed on a tree, the learning rate and the minimal terminal node size.

Parameter selection

The parameters of the different analytical techniques will be optimized by making use of grid

search. In Table 6 the ranges of values that are used for tuning the parameters can be found.

For every analytical technique, different models will be constructed for all possible combinations

of parameters. The optimal combination of parameters is defined based on a cross validated AUC

measure.

3. METHODOLOGY - 16 -

Technique Parameter Tuning values

DT complexity parameter [0.0025, 0.0030, 0.0035, 0.0040, 0.0045]

decay [0.0001, 0.001, 0.01, 0.1]

NN

# hidden neurons [1, 3, 5, 7]

σ [10−4 , 10−3 , 10−2 , 10−1 ]

SVM

cost [10−3 , 10−2 , 10−1 , 100 , 101 , 102 ]

RF # variables per split [3, 4, 5, 6, 7]

# iterations [500, 1000]

# splits [2, 3]

Boosting

learning rate [0.1]

min. terminal node size [10, 25, 50]

The analytical techniques described above will be assessed based on their ability to identify true

churners. Table 7 shows the confusion matrix. In the matrix, a True Positive (TP) refers to

correctly classifying an actual churner. A True Negative (TN) is a correct classification of an

actual non-churner as non-churner. A misclassified churner or non-churner is defined by a False

Negative (FN) or False Positive (FP) respectively.

Predicted

Churners Non-churners

Churners TP FN

Actual

Non-churners FP TN

Accuracy Accuracy, also known as Percentage Correctly Classified (PCC), is the number of cor-

rectly classified instances divided by the total amount of classified instances. It is the most com-

monly used evaluation metric for classifiers. A downside of the evaluation metric is that it assumes

equal misclassification costs for FP and FN. In the context of churn prediction this is not appropri-

ate since misclassifying a churner implies a higher cost than classifying a non-churner as a churner.

Addressing a retention campaign to non-churners implicates a waste of useful resources. However,

the cost of losing a customer by incorrectly classifying him as non-churner is much higher. Fur-

thermore, PCC depends heavily on the cut-off value that determines whether an instance will be

classified as a churner or non-churner based on its predicted probability.

TP + TF

Accuracy =

TP + FP + TN + FN

In order to compare the accuracy for different data mining techniques more adequately, we will

report the top 10% accuracy. This can be interpreted as the accuracy based on a cutoff value

3. METHODOLOGY - 17 -

that is equal to the 90th percentile of the predicted probabilities. Since the distribution of the

probabilities varies across algorithms, choosing a relative cutoff value is more appropriate than an

absolute one.

Precision & recall Precision and recall measures can give a better insight in the performance of

classification models since these measures do not assume equal misclassification costs.

TP

P recision =

TP + FP

TP

Recall =

TP + FN

F-measure The F-measure (F1 ) combines both precision and recall into a single value, which is

more appropriate for the evaluation of predictive performance. Both evaluation metrics are required

to adequately assess the performance of a prediction technique.

2 × P recision × Recall

F1 =

P recision + Recall

Sensitivity & specificity These measures are an alternative for accuracy as well, since they do

not assume equal misclassification costs either. Sensitivity is the True Positive Rate (TPR), the

percentage of churners correctly classified. One can remark, sensitivity is equal to recall. Specificity

or True Negative Rate (TNR) is the percentage of non-churners correctly classified.

TP

Sensitivity =

TP + FN

TN

Specif icity =

TN + FP

The sensitivity, specificity and F-measure are calculated based on the same cutoff value that is used

for the top 10% accuracy.

ROC-curve The Receiver Operating Characteristic (ROC)-curve is a frequently used and recom-

mended evaluation metric as well, since no precise specification of a cutoff value is needed. On the

vertical axis of this two-dimensional graph we find the TPR or sensitivity. On the horizontal axis,

the False Positive Rate (FPR) or 1-specificity is given, which is the percentage of non-churners that

was incorrectly classified as a churner.

The outcome of a predictive model is given in terms of probabilities that observations of the test

data are of class 0 (non-churner) or class 1 (churner). The definition of the probability that will serve

as a threshold to classify the observation as a future churner or not, will influence the technique’s

performance. For every threshold from 0 to 100% we are able to derive the TPR and the FPR

based on the confusion matrix. Consequently, every cut-off value will lead to one point on the

curve. The more the ROC-curve is situated in the top left corner the better, in this way it will

correspond to a TPR of 1 and a FPR of 0.

3. METHODOLOGY - 18 -

AUC Area Under the Curve (AUC) computes the area under the ROC-curve. This single value

can be used to evaluate the performance of a classifier. Since the cut-off level is disregarded, AUC

is a very suitable metric to compare the predictive performance of classifiers for churn prediction.

AUC gives the probability that a classifier will rank a randomly chosen churner higher than a

randomly chosen non-churner. A random classification model has an AUC of 0.5, which implies

that a good classifier would have an AUC that is considerably higher.

Top decile lift Another well-known evaluation metric is the top-percentile lift. A retention cam-

paign will generally focus only on a small percentage of customers with the highest probability

to churn, given that resources are scarce. Therefore, the performance of the model on the n-th

percentile of customers with the highest probability to churn is important. Ranking the customers

by predicted probability and dividing the proportion of actual churners in the top n-th percentile

by the proportion of churners in the total population gives us the n-th percentile lift. A lift measure

of 4, for example, means there are 4 times more churners situated in the top n-th percentile than

in the total population, which indicates the performance of a model compared to a random one.

In this paper we will consider the top decile lift.

The maximum achievable lift measure, following the approach of [10], is given by:

n if N × c̄ ≤ N

n

Max Lift =

1 N

if N × c̄ > n

c̄

with N the total number of observations, c̄ the average churn rate in the dataset and n being the

n-th percentile lift. Since N = 10000, n = 10 and we observe average churn rate (c̄) of 25.17%,

the top decile is not large enough to include all churners. The second case holds for the equation

1 1

above, which entails that the max lift is c̄ = 25.17% = 3.97. This should be taken into account

when evaluating the performance of the models based on the lift measure.

Cross validation

5x2-fold cross validation, as recommended by [33], has been regularly applied in academic research

to evaluate the performance of churn prediction techniques [16, 28, 29].

When applying k-fold cross validation, the dataset is ’folded’ k times, meaning that the data is

randomly distributed in k subsets. These k subsets will alternatively serve as a training and test

set. The training set is used to fit the model and the test set to evaluate the model’s performance.

If repeated cross validation is performed this complete process will be repeated a number of times.

Each of the times the observations in the dataset will be randomly distributed to k different subsets

again.

Consequently, 5x 2-fold cross validation means that the dataset is split up in 2 folds, a total of

5 times. Although this is a good approach, it is not suitable to define the optimal parameters

of a model. Since we want to perform grid search in order to define those, we need an extra

fold: a validation set. Including a validation set in our approach will enable us to evaluate the

performance of a model for different combinations of parameters. A will be trained on the training

3. METHODOLOGY - 19 -

set and validated on the validation set to choose the optimal parameters. The final model should

always be tested on unseen data. Therefore, the model with optimal parameters is tested on the

test data.

In summary, we will implement 5x 3-fold cross validation in this study. The complete dataset is

split up into 3 folds by applying stratified random sampling in order to maintain the original class

distributions. Each of these subsets will alternatively serve as a training, validation and test set.

Since we repeat the cross validation 5 times and there are 6 different combinations of training,

validation and test sets, this leaves us with 30 resamples.

Evaluation measures will be calculated of each resample based on the performance on the test set.

The aggregated result over all resamples serves as a robust measure of model performance, because

it is less susceptible to the randomness of splitting the data.

Statistical tests

In order to statistically compare the performance of the algorithms over the resamples, we will make

use of two non-parametric tests, the Friedman and Wilcoxon signed-rank test. The Friedman test is

recommended for the comparison of multiple models by [30]. The null hypothesis states that there

is no difference in performance between models. If the test shows that the null hypothesis can be

rejected based on a specified significance level, this will imply that differences can be found. To see

where these differences lie exactly, a post-hoc analysis to perform pairwise comparisons is needed.

We will use the Wilcoxon signed-rank test for post-hoc testing, as recommended by [7].

The results of the cross validation are used to compare the predictive performance of the prediction

models based on the aforementioned model evaluation criteria. Furthermore, the importance of the

variables will be analysed for the different models.

Performance evaluation

Predictive power In Table 8, the performances in terms of accuracy, AUC, sensitivity, specificity,

F-measure and top decile lift of the different classification techniques can be found. These are the

median values over the different resamples. The average values and standard deviations can be

found in Table 19 in Attachment A.

First, a Friedman test is performed for each evaluation measure to check whether the medians

are equal for all models. The resulting p-values of the Friedman test are given in Table 9. If we

set the significance level at 5%, we can conclude that there are significant differences in model

performances between resamples for all evaluation measures. Next, the Wilcoxon signed-rank test

is used to indicate between which models the differences in performances lie. Since we have 8

models, 28 pairwise comparisons are needed for each evaluation measure. The resulting p-values

for all evaluation measures of the Wilcoxon signed-rank tests can be found in Table 20 - 25 in

ACC AUC Sens Spec F1 Lift

LR 0.8389 0.9331 0.3790 0.9936 0.5422 3.7830

DT 0.8916 0.8973 0.7253 0.9527 0.7581 3.3940

NB 0.8269 0.8973 0.3552 0.9856 0.5081 3.5550

RF 0.8386 0.9430 0.3808 0.9924 0.5429 3.7520

NN 0.8380 0.9401 0.3772 0.9930 0.5396 3.7645

SVM 0.8389 0.9362 0.3790 0.9936 0.5422 3.7820

Bagging 0.8497 0.9331 0.4410 0.9876 0.5957 3.7105

Boosting 0.8389 0.9448 0.3790 0.9936 0.5422 3.7830

<0.001 <0.001 <0.001 <0.001 <0.001 <0.001

Attachment B. The nullhypothesis of the Wilcoxon signed-rank test states that there is no difference

between the medians of the performances of both models. Since we want to test 28 hypotheses, the

significance level has to be adjusted. Applying the Bonferroni correction results in a significance

5%

level of 28 = 0.18%. P-values that lie above this threshold and consequently imply that there is

no significant difference between 2 models, are listed in bold in the tables in Attachment B.

The best performance for each evaluation measure is underlined in Table 8. Values that do not

differ from the top performance at a 0.18% significance level, based on the Wilcoxon signed-rank

test, are listed in bold. In the table we see that no significant differences can be found in ACC

when the models are compared with the ACC measure of the DT. For all other measures, the top

performances differ significantly from the measures of the other models.

The DT model achieves the best performance in terms of accuracy, sensitivity and F-measure.

However, we should mention that this is due to non-continuous distribution of the probabilities.

This results in overoptimistic values for the top 10% accuracy. Since the cutoff for sensitivity,

specificity and F-measure is put equal to that of accuracy, this may have resulted in misleading

values for these measures as well. The accuracy of the DT does not differ significantly from the other

models, although the median is much higher. This is caused by the high variance in performance

of the DT as can be seen in Table 19. Bagging faces the same problem concerning cutoff values

and therefore, leads to deceiving values for accuracy, sensitivity, specificity and F-measure as well.

However, the best results in terms of specificity are not reported by the DT or Bagging. LR, the

SVM and Boosting significantly outperform the other techniques in specificity.

As discussed before, accuracy is the least suitable measure to assess the predictive power of a

classification model. Although sensitivity, specificity and F-measure are more adequate performance

measures, they still depend on a probability cutoff. Therefore, we will accord more importance to

AUC and top decile lift. When evaluating AUC, we observe that boosting significantly outperforms

all other models. RF and the NN show highly competitive values for AUC as well. The DT and

NB are the least performant in terms of AUC. The best results of top decile lift are reported by

Boosting and LR. Although the SVM has a lower median lift value, there can not be found a

significant difference with Boosting and LR at a 0.18% significance level. The DT and NB are

the least suitable models when considering AUC and lift. What is remarkable as well is that there

cannot be found any significant difference between RF and the NN for any of the evaluation metrics

except for AUC. We can conclude that these prediction techniques have quite similar performances.

In Figure 1, the aggregated ROC-curves over all resamples of the different classification techniques

are drawn. A random classifier would result in the gray line that passes through the origin. We

conclude that all classifiers perform better than a random classifier would, since they are situated

at the left of that line. No clear difference can be distinguished between most classifiers in the

graph. However, we perceive that the ROC-curves of both NB and the DT model are situated

further to the right from the others. This relates to the fact that both models are outperformed in

terms of AUC.

To summarize, Stochastic Gradient Boosting is the best performing method when taking into

account AUC and lift jointly. LR and SVMs are able to compete with the Boosting technique in

terms of lift measure. NNs and RF are worth mentioning as well since they both achieve highly

competitive performances in AUC and lift measure. We would not recommend NB and DTs for

B2B churn prediction since they are the least suitable methods when considering AUC and lift.

Furthermore, we conclude that Bagging is able to improve the performance of Decision Trees

significantly.

Comprehensibility Whether the models will be actually accepted by the end-users will depend on

their intuitiveness. Models that are difficult to understand will be less likely to convince managers

to implement them. Therefore, assessing a model’s comprehensibility should not be neglected [113].

Some models may improve predictive accuracy at the expense of understandability. LR and DTs are

generally considered as comprehensible models. LR achieves a generally competitive performance in

our analysis. RF, NNs and SVMs are viewed as black-box models and are shown to only increase

the performance in terms of AUC when compared to LR. Ensemble methods, like Bagging and

Boosting, are difficult to understand as well. Even if both techniques use Decision Trees as base

classifiers, combining many trees into one model obstructs their interpretability. Bagging is clearly

outperformed by LR, Boosting reports a higher AUC measure than increases the performance of

LR in AUC. To summarize, a trade-off should be made between performance and understandability.

LR shows a good combination of both, but there’s still room for improvement in terms of AUC.

Absolute (hh:mm:ss) 00:00:23 00:00:29 00:09:07 00:11:25 00:07:30 01:01:59 00:07:34 00:15:15

Relative (to LR) 1 1.26 24.02 30.10 19.77 163.45 19.97 40.22

Computation time In Table 10 the computation times for all prediction models can be found.

Not only the absolute values, but relative computation times are listed as well for the ease of

comparison. The application of Logistic Regression and Decision Trees necessitates the least time.

When compared to LR, we see that Support Vector Machines require a computation time that is

163.45 times that of LR. Naı̈ve Bayes, Random Forests, Neural Networks, Bagging and Boosting

necessitate a more moderate computation time. Depending on the size of the user base of a

company, using computationally more expensive methods can be reasonable or not. Moreover, the

performance of the considered techniques needs to be taken into account when evaluating their

computation time. If significantly more computation time is needed for a prediction model that

does not pay off in terms of increased performance, there is no need to consider it as a possibility.

We definitely observe this in the case of SVMs, where a significantly higher computation time does

not lead to an increase in predictive power.

Variable importance

Apart from the algorithm, the predictors used to construct a churn prediction model influence the

quality of that model as well. The results in terms of variable importance for each of the prediction

models are listed in Tables 11 - 18. The values in these tables are the averages and the standard

deviations of the variable importances over the different resamples. The values are scaled from 0

to 100. In this way, we are able to conclude which variables are important from a general and

model-specific viewpoint.

As former B2B study [37] found as well, the recency category score the best among all chosen

predictors. The recency variable Sales.Inv Dt rec reaches an average variable importance of 100

and a standard deviation of 0 for all models. This is clearly the variable that contributes most

to the accurate prediction of future churners. When we consider the average importance of other

variables, we see that they are situated in a significantly lower range than the variable importance

of Sales.Inv Dt rec.

A second important observation is that other variables show extremely varying results in impor-

tance. For example, the importance of frequency variable Sales.salesTotal freq differs signifi-

cantly between models. This variable is the best representation of the frequency category and is

extremely important for DTs, RF, NNs, Bagging and Boosting. On the other hand, it has the worst

predictive ability in NB and SVMs. This is rather remarkable. A similar observation can be made

for length variable Sales.Inv Dt dura, which shows a high importance for LR, RF and Boosting.

However, the variable is only moderately important for the other models. Monetary variables are

classified higher in variable importance in NB and SVMs compared to the other models. DTs

almost do not accord any importance to monetary indicators at all.

However, Sales.PROMO QTY mean, which is classified as frequency variable, does display consistent

results in variable importance. It generally is accorded a vary low importance. When considering

the summary statistics in Table 5 this can be explained by its low variance.

We remark as well that for certain models only a few predictors are considered to contribute

significantly to the predictive performance of the model. When analysing the variable importance

of DTs and especially Boosting, we assume that only leaving out the least important variables would

not significantly influence their predictive performance since many values are negligible.

We conclude that the importance of the types of predictors differs considerably between prediction

techniques. Consequently, this may imply that the predictive power of a technique can be dependent

on the variables used to construct it.

Mean SD Mean SD

Sales.Inv Dt rec 100.00 0.00 Sales.Inv Dt rec 100.00 0.00

Sales.Inv Dt dura 18.60 1.79 Sales.salesTotal freq 32.21 2.01

Sales.STD ADJ FCT mean 9.42 2.73 Equipment.Install Date rec 13.74 1.77

Sales.salesTotal freq 7.38 1.64 Equipment.Install Date dura 13.60 1.57

Equipment.Models freq 5.41 2.05 Equipment.Models freq 12.43 1.14

Sales.COST OF GOODS mean 4.87 2.25 Sales.Inv Dt dura 3.29 1.38

Sales.PKG CONV192 mean 4.86 1.31 Sales.PKG CONV192 mean 1.69 1.19

Sales.PROMO QTY mean 3.95 1.37 Sales.COST OF GOODS mean 1.02 1.03

Equipment.Install Date rec 3.93 2.37 Sales.STD ADJ FCT mean 0.58 0.73

Sales.Whlsl Price Xtnd mean 2.44 1.76 Sales.WHLSL UNIT PRICE mean 0.53 0.53

Sales.Qty mean 1.81 1.23 Sales.Qty mean 0.34 0.32

Sales.WHLSL UNIT PRICE mean 1.72 1.49 Sales.PROMO QTY mean 0.20 0.33

Equipment.Install Date dura 1.02 1.03 Sales.Whlsl Price Xtnd mean 0.17 0.22

Mean SD Mean SD

Sales.Inv Dt rec 100.00 0.00 Sales.Inv Dt rec 100.00 0.00

Sales.WHLSL UNIT PRICE mean 42.09 0.00 Sales.salesTotal freq 20.25 2.08

Sales.PKG CONV192 mean 41.79 0.00 Sales.Inv Dt dura 8.39 0.76

Sales.STD ADJ FCT mean 40.86 0.00 Sales.PKG CONV192 mean 6.41 0.83

Sales.COST OF GOODS mean 40.00 0.00 Sales.Whlsl Price Xtnd mean 5.90 0.66

Sales.Whlsl Price Xtnd mean 35.96 0.00 Sales.Qty mean 5.47 0.47

Sales.Qty mean 34.78 0.00 Sales.COST OF GOODS mean 5.32 0.80

Sales.PROMO QTY mean 33.73 0.00 Sales.WHLSL UNIT PRICE mean 5.27 0.83

Sales.Inv Dt dura 30.01 0.00 Equipment.Install Date rec 4.42 0.81

Equipment.Install Date rec 17.07 0.00 Equipment.Install Date dura 4.10 0.85

Equipment.Models freq 16.37 0.00 Sales.STD ADJ FCT mean 4.01 0.55

Equipment.Install Date dura 15.86 0.00 Equipment.Models freq 1.35 0.62

Sales.salesTotal freq 0.00 0.00 Sales.PROMO QTY mean 0.00 0.00

Mean SD Mean SD

Sales.Inv Dt rec 100.00 0.00 Sales.Inv Dt rec 100.00 0

Sales.salesTotal freq 28.44 14.21 Sales.WHLSL UNIT PRICE mean 42.09 0

Sales.COST OF GOODS mean 18.78 12.32 Sales.PKG CONV192 mean 41.79 0

Sales.STD ADJ FCT mean 14.65 9.11 Sales.STD ADJ FCT mean 40.86 0

Sales.WHLSL UNIT PRICE mean 13.27 10.06 Sales.COST OF GOODS mean 40.00 0

Sales.PKG CONV192 mean 10.20 7.73 Sales.Whlsl Price Xtnd mean 35.96 0

Sales.Inv Dt dura 8.90 6.17 Sales.Qty mean 34.78 0

Equipment.Models freq 8.66 8.35 Sales.PROMO QTY mean 33.73 0

Sales.Qty mean 7.60 10.81 Sales.Inv Dt dura 30.01 0

Sales.Whlsl Price Xtnd mean 7.24 7.42 Equipment.Install Date rec 17.07 0

Equipment.Install Date dura 7.14 5.86 Equipment.Models freq 16.37 0

Equipment.Install Date rec 6.32 4.72 Equipment.Install Date dura 15.86 0

Sales.PROMO QTY mean 3.10 5.02 Sales.salesTotal freq 0.00 0

Mean SD Mean SD

Sales.Inv Dt rec 100.00 0.00 Sales.Inv Dt rec 100 0

Sales.salesTotal freq 45.30 1.18 Sales.salesTotal freq 4.83 0.56

Equipment.Install Date rec 19.13 0.89 Sales.Inv Dt dura 1.42 0.35

Sales.PKG CONV192 mean 18.35 0.60 Equipment.Install Date dura 1.34 0.24

Equipment.Install Date dura 18.15 0.91 Sales.PKG CONV192 mean 0.85 0.25

Sales.Inv Dt dura 16.05 0.90 Equipment.Install Date rec 0.69 0.29

Sales.Whlsl Price Xtnd mean 15.59 0.69 Sales.COST OF GOODS mean 0.48 0.23

Sales.Qty mean 15.16 0.73 Sales.Qty mean 0.28 0.18

Sales.WHLSL UNIT PRICE mean 14.61 0.81 Sales.Whlsl Price Xtnd mean 0.26 0.19

Sales.COST OF GOODS mean 14.31 0.94 Sales.WHLSL UNIT PRICE mean 0.25 0.17

Sales.STD ADJ FCT mean 13.63 0.76 Sales.STD ADJ FCT mean 0.17 0.14

Equipment.Models freq 12.05 0.78 Sales.PROMO QTY mean 0.05 0.06

Sales.PROMO QTY mean 0.00 0.00 Equipment.Models freq 0.02 0.05

Table 17: Variable importance Bagging Table 18: Variable importance Boosting

Discussion

In past literature, LR and DTs have been considered to be excellent benchmarking techniques. In

some studies, LR was even able to outperform more complex techniques. Our results show as well

that LR was able to outperform RF and a NN in top decile lift. When evaluating AUC, more

complex techniques like RF, NN and SVM and Boosting outperform LR. However, the DT does

not show the same predictive performance as LR. The DT model finds itself amongst the worst

performing techniques in terms of AUC and lift. The popularity of DTs in former B2C studies,

made us expect the opposite.

Similarly, NB cannot compete with the performances of the other models either. Only the DT

reports lower measures for specificity and lift. Based on former research, we expected NB to

achieve a similar result to DTs and LR. This is certainly true for in the case of the DT model, but

LR outperforms NB significantly.

In former studies, Boosting was not able to outperform more complex method. Our findings sug-

gest the contrary. In this study, we see that Boosting is outperforming RF, a NN and a SVM in

AUC and lift. We do remark that our study shows very similar results to [105], were boosting

slightly outperforms LR and the DT model is the least performant. As suggested by former re-

search, Bagging is able to significantly increase the performance of DTs. Nevertheless, Bagging is

outperformed by other techniques.

When reviewing literature, RF generally reported a high predictive performance. Our findings

suggest a similar tendency. RF and the NN report performance measures that are not significantly

different from each other, RF only outperforms the NN when evaluating AUC. In [14] no significant

differences can be found between these models either.

When considering variable importance, the observed importance of recency and frequency in this

study corresponds with the results of other literature. We do, however, note that frequency is not

that important for all applied prediction models. In past literature on churn prediction in FMCG

industries, monetary indicators turned out to be relatively insignificant. We observe this as well in

our study. The length of relationship was an important predictor in all studies that incorporated

it. Our results suggest that the importance varies over prediction techniques.

5 Conclusion

In this paper, we focus on churn prediction modelling in a B2B sector. When considering B2B

churn prediction, the research that can be found is rather limited. This is shown in an extended

overview of churn prediction techniques applied in past literature. To address this gap in research,

we perform a benchmarking study of churn prediction techniques in a B2B context. The predic-

tive power of Logistic Regression, Decision Trees, Naı̈ve Bayes, Random Forest, Neural Networks,

Support Vector Machines, Bagging and Boosting is evaluated on a FMCG dataset. To evaluate the

performance of the techniques, accuracy, AUC, sensitivity, specificity, F-measure and top decile lift

are calculated.

5. CONCLUSION - 26 -

Based on our findings, we would recommend the use of Stochastic Gradient Boosting. This tech-

nique is able to give the best results in terms of top decile lift and AUC. However, if we take into

account computation time and comprehensibility as well, we want to draw attention to LR. The

power of LR lies in its combination of a high competitive performance and intuitiveness and low

computation time.

When considering variable importance, our analysis identifies recency variables as most important

for every prediction technique. Frequency variables are generally shown to be important as well,

but significantly less than recency. Futhermore, our findings suggest that the importance of certain

categories of variables may vary depending on the applied prediction technique. We observe this, for

example, for the monetary indicators of which the importance varies over the different models.

To summarize, the contribution of this study is twofold: (1) an analysis of classification techniques

that have been formerly used in B2B and B2C churn prediction is presented; (2) we evaluate the

performance of the most commonly used churn prediction techniques in a B2B setting.

A first limitation is that the we only have a small number of predictors in our analysis. Since we

wanted our results to be generalisable to other B2B companies, the number of predictors was kept

to a minimum. An interesting finding of our study is that the importance of the different categories

of variables depends on the prediction technique used. Further studies may study the importance

of other variable categories for different prediction techniques.

Furthermore, we only included commonly used prediction techniques to set up our empirical anal-

ysis. A possibility for future research can be to explore the performance of less well-known tech-

niques.

Another limitation is that the results of our analysis may not be applicable for B2B companies

that are not situated in the FMCG industry. We do, however, assume a certain generalisability to

companies in a non-contractual environment. Future studies may improve the generalisability of

our conclusions by extending the analysis to other B2B industries.

Lastly, we should mention that the outcome of our research is only relevant for a company if it is

actually willing to undertake actions to prevent churn. No decrease in churn rate will be realised

by predicting future churners alone. B2B companies should offer incentives to those customers

with the highest probabilities to churn in order to dissuade them from doing so. This will lead

to reduced churn rates and increased profits. Which will, ultimately, show the real value of churn

prediction.

References [9] Bin, L., Peiji, S., and Juan, L. (2007). Cus-

tomer churn prediction based on the decision

[1] Ahn, J.-H., Han, S.-P., and Lee, Y.-S. tree in personal handyphone system service.

(2006). Customer churn analysis: Churn de-

terminants and mediation effects of partial [10] Blattberg, R. C., Kim, B.-D., and Neslin,

defection in the Korean mobile telecommuni- S. A. (2010). Database Marketing: Analyzing

cations service industry. Telecommunications and Managing Customers. Springer Science

Policy, 30(10–11):552–568. & Business Media.

[2] Amin, A., Shehzad, S., Khan, C., Ali, I.,

Machine Learning, 24(2):123–140.

and Anwar, S. (2015). Churn prediction in

telecommunication industry using Rough Set [12] Breiman, L. (2001). Random Forests. Ma-

Approach. chine Learning, 45(1):5–32.

[3] Archaux, C., Martin, A., and Khenchaf, A. [13] Buckinx, W., Baesens, B., Van den Poel,

(2004). An SVM based churn detector in D., Van Kenhove, P., and Vanthienen, J.

prepaid mobile telephony. 2004 International (2010). Using machine learning techniques to

Conference on Information and Communica- predict defection of top clients.

tion Technologies: From Theory to Applica- [14] Buckinx, W. and Van den Poel, D. (2005).

tions, 2004. Proceedings, pages 459–460. Customer base analysis: Partial defection

[4] Au, T., Ma, G., and Li, S. (2003a). Apply- of behaviourally loyal clients in a non-

ing and Evaluating Models to Predict Cus- contractual FMCG retail setting. European

tomer Attrition Using Data Mining Tech- Journal of Operational Research, 164(1):252–

niques. Journal of Comparative International 268.

Management, 6(1). [15] Burez, J. and Van den Poel, D. (2007).

CRM at a pay-TV company: Using analyt-

[5] Au, W.-H., Chan, K. C., and Yao, X.

ical models to reduce customer attrition by

(2003b). A novel evolutionary data mining

targeted marketing for subscription services.

algorithm with applications to churn predic-

Expert Systems with Applications, 32(2):277–

tion.

288.

[6] Ballings, M. and Van den Poel, D. (2012).

[16] Burez, J. and Van den Poel, D. (2009).

Customer event history for churn prediction:

Handling class imbalance in customer churn

How long is long enough? Expert Systems

prediction.

with Applications, 39(18):13517–13522.

[17] Chaudhury, A. and Kuilboer, J.-P. (2001).

[7] Benavoli, A., Corani, G., and Mangili, F. E-Business and E-Commerce Infrastructure:

(2016). Should We Really Use Post-hoc Tests Technologies Supporting the E-Business Ini-

Based on Mean-ranks? J. Mach. Learn. Res., tiative.

17(1):152–161.

[18] Chen, K., Hu, Y.-H., and Hsieh, Y.-C.

[8] Bhattacharya, C. B. (1998). When cus- (2014). Predicting customer churn from valu-

tomers are members: Customer retention in able B2B customers in the logistics industry:

paid membership contexts. Journal of the A case study. Information Systems and e-

Academy of Marketing Science, 26(1):31. Business Management, 13(3):475–494.

i

[19] Chen, Z.-Y., Fan, Z.-P., and Sun, M. [27] David Ford (1980). The Development

(2012). A hierarchical multiple kernel support of Buyer-Seller Relationships in Industrial

vector machine for customer churn prediction Markets. European Journal of Marketing,

using longitudinal behavioral data. European 14(5/6):339–353.

Journal of Operational Research, 223(2):461–

[28] De Bock, K. W. and den Poel, D. V.

472.

(2011). An empirical evaluation of rotation-

[20] Chiang, D.-A., Wang, Y.-F., Lee, S.-L., and based ensemble classifiers for customer churn

Lin, C.-J. (2003). Goal-oriented sequential prediction. Expert Systems with Applications,

pattern for network banking churn analysis. 38(10):12293–12301.

Expert Systems with Applications, 25(3):293–

[29] De Bock, K. W. and Van den Poel, D.

302.

(2012). Reconciling performance and inter-

[21] Chu, B.-H., Tsai, M.-S., and Ho, C.-S. pretability in customer churn prediction us-

(2007). Toward a hybrid data mining model ing ensemble learning based on generalized

for customer retention. Knowledge-Based additive models. Expert Systems with Appli-

Systems, 20(8):703–718. cations, 39(8):6816–6826.

[22] Coussement, K., Benoit, D. F., and Van [30] Demšar, J. (2006). Statistical Comparisons

den Poel, D. (2010). Improved marketing of Classifiers over Multiple Data Sets. J.

decision making in a customer churn pre- Mach. Learn. Res., 7:1–30.

diction context using generalized additive

models. Expert Systems with Applications, [31] D’Haen, J. and Van den Poel, D. (2013).

37(3):2132–2143. Model-supported business-to-business

prospect prediction based on an iterative

[23] Coussement, K. and De Bock, K. W. customer acquisition framework. Industrial

(2013). Customer churn prediction in the on- Marketing Management, 42(4):544–551.

line gambling industry: The beneficial effect

of ensemble learning. Journal of Business Re- [32] D’Haen, J., Van den Poel, D., and Thor-

search, 66(9):1629–1636. leuchter, D. (2013). Predicting customer prof-

itability during acquisition: Finding the op-

[24] Coussement, K. and Van den Poel, D.

timal combination of data source and data

(2008a). Churn prediction in subscription

mining technique. Expert Systems with Ap-

services: An application of support vector

plications, 40(6):2007–2012.

machines while comparing two parameter-

selection techniques. Expert Systems with Ap- [33] Dietterich, T. G. (1998). Approximate Sta-

plications, 34(1):313–327. tistical Tests for Comparing Supervised Clas-

sification Learning Algorithms. Neural Com-

[25] Coussement, K. and Van den Poel, D.

put., 10(7):1895–1923.

(2008b). Integrating the voice of customers

through call center emails into a decision sup- [34] Eiben, A., Koudijs, A., and Slisser, F.

port system for churn prediction. Information (2006). Genetic modeling of customer reten-

& Management, 45(3):164–174. tion.

[26] Datta, P., Masand, B., Mani, D., and Li, [35] Farquad, M. A. H., Ravi, V., and Raju,

B. (2001). Automated cellular modeling and S. B. (2014). Churn prediction using compre-

prediction on a large scale. hensible support vector machine: An analyt-

ii

ical CRM application. Applied Soft Comput- munication churn prediction. Expert Systems

ing, 19:31–40. with Applications, 40(14):5635–5647.

[36] Glady, N., Baesens, B., and Croux, C. [45] Hung, S.-Y., Yen, D. C., and Wang, H.-

(2009). Modeling churn using customer life- Y. (2006). Applying data mining to telecom

time value. European Journal of Operational churn management. Expert Systems with Ap-

Research, 197(1):402–411. plications, 31(3):515–524.

[37] Gordini, N. and Veglio, V. (2016). Cus- [46] Hur, Y. and Lim, S. (2005). Customer

tomers churn prediction and marketing reten- churning prediction using Support Vector

tion strategies. An application of support vec- Machines in online auto insurance service.

tor machines based on the AUC parameter- [47] Hwang, H., Jung, T., and Suh, E. (2004).

selection technique in B2B e-commerce indus- An LTV model and customer segmentation

try. Industrial Marketing Management. based on customer value: A case study on the

wireless telecommunication industry. Expert

[38] Hadden, J., Tiwari, A., Roy, R., and Ruta,

Systems with Applications, 26(2):181–188.

D. (2006). Churn prediction using complaints

data. [48] Idris, A., Rizwan, M., and Khan, A. (2012).

Churn prediction in telecom using Random

[39] Hopmann, J. and Thede, A. (2003). Appli-

Forest and PSO based data balancing in com-

cability of customer churn forecasts in a non-

bination with various feature selection strate-

contractual setting.

gies. Computers & Electrical Engineering,

[40] Hosseini, S. M. S., Maleki, A., and Gho- 38(6):1808–1819.

lamian, M. R. (2010). Cluster analysis us-

[49] Jadhav, R. and Pawar, U. (2011). Churn

ing data mining approach to develop CRM

prediction in telecommunication using data

methodology to assess the customer loy-

mining technology.

alty. Expert Systems with Applications,

37(7):5259–5264. [50] Jing, Z. and Xing-hua, D. (2008). Bank

customer churn prediction based on support

[41] Hu, X. (2005). A data mining approach for

vector machine: Taking a commercial bank’s

retailing bank customer attrition analysis.

VIP customer churn as the example.

[42] Huang, B., Kechadi, M. T., and Buck- [51] Keller, K. and Webster, F. (2004). A

ley, B. (2012). Customer churn prediction in roadmap for branding in industrial markets.

telecommunications. Expert Systems with Ap-

[52] Keramati, A. and Ardabili, S. M. S. (2011).

plications, 39(1):1414–1425.

Churn analysis for an Iranian mobile opera-

[43] Huang, B. Q., Kechadi, T. M., Buckley, tor. Telecommunications Policy, 35(4):344–

B., Kiernan, G., Keogh, E., and Rashid, T. 356.

(2010). A new feature set with new win-

[53] Keramati, A., Jafari-Marandi, R., Alianne-

dow techniques for customer churn prediction

jadi, M., Ahmadian, I., Mozaffari, M., and

in land-line telecommunications. Expert Sys-

Abbasi, U. (2014). Improved churn pre-

tems with Applications, 37(5):3657–3665.

diction in telecommunication industry using

[44] Huang, Y. and Kechadi, T. (2013). An data mining techniques. Applied Soft Com-

effective hybrid learning system for telecom- puting, 24:994–1012.

iii

[54] Khan, A. A., Jamwal, S., and Sepehri, Conference on Emerging Artificial Intelli-

M. (2010). Applying data mining to cus- gence Applications in Computer Engineering:

tomer churn prediction in an Internet Service Real Word AI Systems with Applications in

Provider. eHealth, HCI, Information Retrieval and Per-

vasive Technologies, pages 3–24.

[55] Kim, H.-S. and Yoon, C.-H. (2004). De-

terminants of subscriber churn and cus- [63] Kumar, D. A. and Ravi, V. (2008). Pre-

tomer loyalty in the Korean mobile tele- dicting credit card customer churn in banks

phony market. Telecommunications Policy, using data mining.

28(9–10):751–765.

[64] Larivière, B. and Van den Poel, D. (2005).

[56] Kim, J., Suh, E., and Hwang, H. (2003). A

Predicting customer retention and profitabil-

model for evaluating the effectiveness of CRM

ity by using random forests and regression

using the balanced scorecard. Journal of In-

forests techniques. Expert Systems with Ap-

teractive Marketing, 17(2):5–19.

plications, 29(2):472–484.

[57] Kim, K. and Lee, J. (2012). Sequen-

[65] Leahy, R. (2011). Relationships in fast

tial manifold learning for efficient churn pre-

moving consumer goods markets: The con-

diction. Expert Systems with Applications,

sumers’ perspective. European Journal of

39(18):13328–13337.

Marketing, 45(4):651–672.

[58] Kim, S., Shin, K.-s., and Park, K. (2005).

[66] Lee, H., Lee, Y., Cho, H., Im, K., and

An Application of Support Vector Machines

Kim, Y. S. (2011). Mining churning behaviors

for Customer Churn Analysis: Credit Card

and developing retention strategies based on

Case. Advances in Natural Computation,

a partial least squares (PLS) model. Decision

pages 636–647.

Support Systems, 52(1):207–216.

[59] Kim, S.-Y., Jung, T.-S., Suh, E.-H., and

Hwang, H.-S. (2006). Customer segmentation [67] Lee, J. S. and Lee, J. C. (2006). Customer

and strategy development based on customer churn prediction by hybrid model.

lifetime value: A case study. Expert Systems

[68] Lejeune, M. (2011). Measuring the impact

with Applications, 31(1):101–107.

of Data Mining on Churn Management.

[60] Kirui, C., Hong, L., Cheruiyot, W., and

[69] Lemmens, A. and Croux, C. (2006). Bag-

Kirui, H. (2013). Predicting customer churn

ging and boosting classification trees to pre-

in mobile telephony industry using probabilis-

dict churn.

tic classifiers in data mining.

[61] Kisioglu, P. and Topcu, Y. I. (2011). Ap- [70] Lessmann, S. and Voß, S. (2009). A ref-

plying Bayesian Belief Network approach to erence model for customer-centric data min-

customer churn analysis: A case study on the ing with support vector machines. European

telecom industry of Turkey. Expert Systems Journal of Operational Research, 199(2):520–

with Applications, 38(6):7151–7157. 530.

[62] Kotsiantis, S. B. (2007). Supervised Ma- [71] Lilien, G. L. (2016). The B2B Knowledge

chine Learning: A Review of Classifica- Gap. International Journal of Research in

tion Techniques. Proceedings of the 2007 Marketing, 33(3):543–556.

iv

[72] Lin, C.-S., Tzeng, G.-H., and Chin, Y.-C. [80] Morik, K. and Köpke, H. (2004). Analysing

(2011). Combined rough set theory and flow customer churn in insurance domain.

network graph to predict customer churn in

[81] Mozer, M. C., Wolniewicz, R., Grimes,

credit card accounts. Expert Systems with Ap-

D. B., Johnson, E., and Kaushansky, H.

plications, 38(1):8–15.

(1999). Churn reduction in the wireless in-

[73] Lu, J. (2002). Predicting customer churn dustry.

in the telecommunications industry –– An ap- [82] Mozer, M. C., Wolniewicz, R., Grimes,

plication of survival analysis modeling using D. B., Johnson, E., and Kaushansky, H.

SAS. (2000). Predicting subscriber dissatisfac-

[74] Lu, N., Lin, H., Lu, J., and Zhang, G. tion and improving retention in the wireless

(2014). A customer churn prediction model telecommunications industry.

in telecom industry using boosting. [83] Mudambi, S. (2002). Branding importance

in business-to-business markets: Three buyer

[75] Michèle Paulin, Jean Perrien, Ronald J.

clusters. Industrial Marketing Management,

Ferguson, Ana Maria Alvarez Salazar, and

31(6):525–533.

Leon Michel Seruya (1998). Relational norms

and client retention: External effectiveness [84] Mutanen, T. (2006). Customer churn anal-

of commercial banking in Canada and Mex- ysis - a case study.

ico. International Journal of Bank Marketing,

[85] Nath, S. V. and Behara, R. S. (2003). Cus-

16(1):24–31.

tomer churn analysis in the wireless industry:

[76] Miguéis, V. L., Camanho, A., and Falcão e A Data Mining approach.

Cunha, J. (2013). Customer attrition in re-

[86] Ngai, E. W. T., Xiu, L., and Chau, D. C. K.

tailing: An application of Multivariate Adap-

(2009). Application of data mining techniques

tive Regression Splines. Expert Systems with

in customer relationship management: A lit-

Applications, 40(16):6225–6232.

erature review and classification. Expert Sys-

[77] Miguéis, V. L., Van den Poel, D., Ca- tems with Applications, 36(2, Part 2):2592–

manho, A., and Falcao e Cunha, J. (2012a). 2602.

Predicting partial customer churn using [87] Nie, G., Rowe, W., Zhang, L., Tian, Y.,

markov for discrimination for modeling first and Shi, Y. (2011). Credit card churn

purchase sequences. forecasting by logistic regression and deci-

[78] Miguéis, V. L., Van den Poel, D., Ca- sion tree. Expert Systems with Applications,

manho, A. S., and Falcão e Cunha, J. (2012b). 38(12):15273–15285.

Modeling partial customer churn: On the [88] Olle, G. D. and Cai, S. (2014). A Hybrid

value of first product-category purchase se- Churn Prediction Model in Mobile Telecom-

quences. Expert Systems with Applications, munication Industry.

39(12):11250–11256.

[89] Owczarczuk, M. (2010). Churn models for

[79] Modani, N., Dey, K., Gupta, R., and God- prepaid customers in the cellular telecommu-

bole, S. (2013). CDR Analysis Based Telco nication industry using large data marts. Ex-

Churn Prediction and Customer Behavior In- pert Systems with Applications, 37(6):4710–

sights: A Case Study. 4712.

v

[90] Oyeniyi, A. and Adeyemo, A. (2006). Cus- [100] Seo, D., Ranganathan, C., and Babad,

tomer churn analysis in banking sector using Y. (2008). Two-level model of customer re-

data mining techniques. tention in the US mobile telecommunications

service market. Telecommunications Policy,

[91] Parvatiyar, A. and Sheth, J. (2000). The

32(3–4):182–196.

domain and conceptual foundations of rela-

tionship marketing. [101] Shaaban, E., Helmy, Y., Khedr, A., and

Nasr, M. (June-July 2012). A proposed churn

[92] Pendharkar, P. C. (2009). Genetic algo-

prediction model.

rithm based neural network approaches for

predicting churn in cellular wireless network

[102] Smith, K., Willis, R., and Brooks, M.

services. Expert Systems with Applications,

(2000). An analysis of customer retention and

36(3, Part 2):6714–6720.

insurance claim patterns using data mining:

[93] Radosavljevik, D., Van der Putten, P., and A case study.

Kyllesbech Larsen, K. (2010). The impact of

[103] Stevens, R. (2005). B-to-B customer re-

experimental setup in prepaid churn predic-

tention: Seven strategies for keeping your cus-

tion for mobile telecommunications: What to

tomers.

predict, for whom and does the customer ex-

perience matter? [104] Swani, K., Brown, B. P., and Milne, G. R.

[94] Reichheld, F. F. and Sasser, W. (1990). (2014). Should tweets differ for B2B and

Zero defections: Quality comes to services. B2C? An analysis of Fortune 500 companies’

Twitter communications. Industrial Market-

[95] Risselada, H., Verhoef, P. C., and Bijmolt, ing Management, 43(5):873–881.

T. H. A. (2010). Staying Power of Churn Pre-

diction Models. Journal of Interactive Mar- [105] Tamaddoni Jahromi, A., Stakhovych, S.,

keting, 24(3):198–208. and Ewing, M. (2014). Managing B2B cus-

tomer churn, retention and profitability. In-

[96] Rosset, S. and Neumann, E. (2003). Inte-

dustrial Marketing Management, 43(7):1258–

grating Customer Value Considerations into

1268.

Predictive Modeling.

[106] Tsai, C.-F. and Chen, M.-Y. (2010). Vari-

[97] Ruta, D., Nauck, D., and Azvine, B.

able selection by association rules for cus-

(2006). K nearest sequence method and its

tomer churn prediction of multimedia on de-

application to churn prediction.

mand. Expert Systems with Applications,

[98] Ryals, L. and Knox, S. (2001). Cross- 37(3):2006–2015.

functional issues in the implementation of

relationship marketing through customer re- [107] Tsai, C.-F. and Lu, Y.-H. (2009). Cus-

lationship management. European Manage- tomer churn prediction by hybrid neural net-

ment Journal, 19(5):534–542. works. Expert Systems with Applications,

36(10):12547–12553.

[99] Rygielski, C., Wang, J.-C., and Yen, D. C.

(2002). Data mining techniques for customer [108] Tuğba, U. and Gürsoy, Ş. (2010). Cus-

relationship management. Technology in So- tomer churn analysis in telecommunication

ciety, 24(4):483–502. sector.

vi

[109] Turban, E., Sharda, R., and Delen, D. look ahead. Industrial Marketing Manage-

(2010). Decision Support and Business In- ment, 42(4):470–488.

telligence Systems.

[118] Xia, G.-e. and Jin, W.-d. (2008). Model

[110] Vafeiadis, T., Diamantaras, K. I., Sari- of customer churn prediction on Support Vec-

giannidis, G., and Chatzisavvas, K. C. (2015). tor Machine. Systems Engineering - Theory

A comparison of machine learning techniques & Practice, 28(1):71–77.

for customer churn prediction. Simulation

[119] Xiao, J., Xie, L., He, C., and Jiang, X.

Modelling Practice and Theory, 55:1–9.

(2012). Dynamic classifier ensemble model

[111] Van den Poel, D. and Larivière, B. for customer classification with imbalanced

(2004). Customer attrition analysis for finan- class distribution. Expert Systems with Ap-

cial services using proportional hazard mod- plications, 39(3):3668–3675.

els. European Journal of Operational Re-

[120] Xie, Y., Li, X., Ngai, E. W. T., and Ying,

search, 157(1):196–217.

W. (2009). Customer churn prediction us-

[112] Verbeke, W., Dejaeger, K., Martens, D., ing improved balanced random forests. Ex-

Hur, J., and Baesens, B. (2012). New insights pert Systems with Applications, 36(3, Part

into churn prediction in the telecommunica- 1):5445–5449.

tion sector: A profit driven data mining ap- [121] Xu, M. and Walton, J. (2005). Gain-

proach. European Journal of Operational Re- ing customer knowledge through analytical

search, 218(1):211–229. CRM. Industrial Management & Data Sys-

[113] Verbeke, W., Martens, D., Mues, C., and tems, 105(7):955–971.

Baesens, B. (2011). Building comprehensi- [122] Yan, L., Fassino, M., and Baldasare, P.

ble customer churn prediction models with (2005). Predicting Customer Behavior via

advanced rule induction techniques. Expert Calling Links.

Systems with Applications, 38(3):2354–2364.

[123] Yu, X., Guo, S., Guo, J., and Huang, X.

[114] Wang, G., Liu, L., Nie, G., Kou, G., and (2011). An extended support vector machine

Shi, Y. (2010). Predicting credit card holder forecasting framework for customer churn in

churn in banks of China using data mining e-commerce. Expert Systems with Applica-

and MCDM. tions, 38(3):1425–1430.

[115] Wang, Y.-F., Chiang, D.-A., Hsu, M.-H., [124] Zablah, A. R., Brown, B. P., and Don-

Lin, C.-J., and Lin, I.-L. (2009). A recom- thu, N. (2010). The Relative Importance of

mender system to avoid customer churn: A Brands in Modified Rebuy Purchase Situa-

case study. Expert Systems with Applications, tions.

36(4):8071–8075.

[125] Zhang, Y., Qi, J., Shu, H., and Cao, J.

[116] Wei, C.-P. and Chiu, I.-T. (2002). Turn- (2007). A Hybrid KNN-LR Classifier and its

ing telecommunications call details to churn Application in Customer Churn Prediction.

prediction: A data mining approach. Expert

[126] Zhang, Y., Qi, J., Shu, H., and Li, Y.

Systems with Applications, 23(2):103–112.

(2006). Case study on CRM: Detecting likely

[117] Wiersema, F. (2013). The B2B Agenda: churners with limited information of fixed-

The current state of B2B marketing and a line subscriber.

vii

[127] Zhao, Y., Li, B., Li, X., Liu, W., and Ren,

S. (2005). Customer churn prediction using

improved one-class support vector machine.

keting: Theory and Implementation. Journal

of Market-Focused Management, 5(2):83–89.

viii

Attachments

Attachment A Experimental results

LR 0.8390 (0.0015) 0.9324 (0.0055) 0.3792 (0.0031) 0.9936 (0.0010) 0.5424 (0.0044) 3.7855 (0.0312)

DT 0.8202 (0.1936) 0.8953 (0.0203) 0.6934 (0.1634) 0.8629 (0.2932) 0.6992 (0.1254) 3.4419 (0.1754)

NB 0.8261 (0.0033) 0.8967 (0.0055) 0.3537 (0.0066) 0.9851 (0.0022) 0.5059 (0.0094) 3.5314 (0.0636)

RF 0.8382 (0.0025) 0.9429 (0.0037) 0.3799 (0.0054) 0.9924 (0.0016) 0.5417 (0.0074) 3.7511 (0.0478)

NN 0.8378 (0.0019) 0.9404 (0.0045) 0.3768 (0.0038) 0.9929 (0.0013) 0.5391 (0.0055) 3.7607 (0.0378)

SVM 0.8389 (0.0015) 0.9361 (0.0045) 0.3791 (0.0029) 0.9936 (0.0010) 0.5423 (0.0042) 3.7851 (0.0279)

Bagging 0.8494 (0.0040) 0.9335 (0.0052) 0.4392 (0.0152) 0.9874 (0.0021) 0.5946 (0.0145) 3.7116 (0.0483)

Boosting 0.8391 (0.0019) 0.9445 (0.0040) 0.3793 (0.0038) 0.9937 (0.0013) 0.5427 (0.0054) 3.7858 (0.0378)

Attachment B Results Wilcoxon signed-rank tests

LR - 0.0029 <0.001 0.0602 <0.001 0.272 <0.001 0.935

DT - - 0.0029 0.0029 0.0029 0.0029 0.0030 0.0029

NB - - - <0.001 <0.001 <0.001 <0.001 <0.001

RF - - - - 0.4400 0.117 <0.001 0.1300

NN - - - - - 0.0012 <0.001 <0.001

SVM - - - - - - <0.001 0.5100

Bagging - - - - - - - <0.001

Boosting - - - - - - - -

LR - <0.001 <0.001 <0.001 <0.001 <0.001 0.129 <0.001

DT - - 0.422 <0.001 <0.001 <0.001 <0.001 <0.001

NB - - - <0.001 <0.001 <0.001 <0.001 <0.001

RF - - - - <0.001 <0.001 <0.001 <0.001

NN - - - - - <0.001 <0.001 <0.001

SVM - - - - - - <0.001 <0.001

Bagging - - - - - - - <0.001

Boosting - - - - - - - -

LR - <0.001 <0.001 0.219 <0.001 0.615 <0.001 0.806

DT - - <0.001 <0.001 <0.001 <0.001 <0.001 <0.001

NB - - - <0.001 <0.001 <0.001 <0.001 <0.001

RF - - - - 0.0089 0.292 <0.001 0.326

NN - - - - - 0.0015 <0.001 <0.001

SVM - - - - - - <0.001 0.426

Bagging - - - - - - - <0.001

Boosting - - - - - - - -

LR - <0.001 <0.001 <0.001 <0.001 0.232 <0.001 0.882

DT - - <0.001 <0.001 <0.001 <0.001 <0.001 <0.001

NB - - - <0.001 <0.001 <0.001 <0.001 <0.001

RF - - - - 0.224 <0.001 <0.001 <0.001

NN - - - - - 0.0011 <0.001 <0.001

SVM - - - - - - <0.001 0.4300

Bagging - - - - - - - <0.001

Boosting - - - - - - - -

LR DT NB RF NN SVM Bagging Boosting

LR - <0.001 <0.001 0.715 <0.001 0.344 <0.001 0.914

DT - - <0.001 <0.001 <0.001 <0.001 0.0010 <0.001

NB - - - <0.001 <0.001 <0.001 <0.001 <0.001

RF - - - - 0.114 0.903 <0.001 0.727

NN - - - - - 0.0023 <0.001 <0.001

SVM - - - - - - <0.001 0.315

Bagging - - - - - - - <0.001

Boosting - - - - - - - -

LR - <0.001 <0.001 <0.001 <0.001 0.836 <0.001 0.847

DT - - 0.0062 <0.001 <0.001 <0.001 <0.001 <0.001

NB - - - <0.001 <0.001 <0.001 <0.001 <0.001

RF - - - - 0.414 <0.001 <0.001 <0.001

NN - - - - - <0.001 <0.001 <0.001

SVM - - - - - - <0.001 0.829

Bagging - - - - - - - <0.001

Boosting - - - - - - - -

## Гораздо больше, чем просто документы.

Откройте для себя все, что может предложить Scribd, включая книги и аудиокниги от крупных издательств.

Отменить можно в любой момент.