Академический Документы
Профессиональный Документы
Культура Документы
A Bayesian Approach
ACADEMIC SUPERVISOR:
Ana Alina Tudoran
Department of Business Administration
August, 2012
MSc. Business Intelligence
Aarhus University
Business and Social Sciences
Acknowledgement
We would like to extend our sincerest thanks to those who have helped us during the
process of this thesis. The process would have been much more overwhelming and the topic
more complicated, if it had not been for the support and guidance of those people.
First of all, our thanks goes to our dedicated academic supervisor, Ana Alina Tudoran, for
always being helpful with useful inputs and discussions during the process.
Secondly, a thanks goes to our contact person in the Bank, both for providing real data
Last but not least, a special thanks goes to our fellow thesis-writing students at Haslefortet,
to the recent advances in computer technology. The objective of the current thesis has been
to analyze if a Bayesian logistic model is a more eective tool for credit and loans default
predictive modeling, by comparing its performance with an expert model, and respectively,
a frequentist logistic model. Real data on 67618 customers collected during 2002-2010 were
provided by one of the most important banks in Denmark and were used in the empirical
analysis. The Bayesian logistic regression model were estimated using Markov chain Monte
Carlo simulations in SAS. The performance of the dierent credit scoring models (Bayesian,
frequentist, expert) was assessed by using the ROC curve. The results of the empirical analysis
shows that a Bayesian approach for credit scoring is overall able to outperform the expert
model. However, the ndings show no signicant dierences in terms of predictive performance
between the Bayesian and the frequentist logistic approach. Overall, we conclude that a
Bayesian approach for credit scoring may be used as an alternative decision tool to frequentist
approach.
Contents
1 Introduction 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Delimitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Statistical Reasoning 10
2.1 The History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Bayes' Theorem 17
3.1 The Likelihood p(y | ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5 Model Validation 33
5.1 Validation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6 Empirical Analysis 42
6.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7 Concluding Remarks 61
7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8 Bibliography 65
A Appendix 71
List of Figures
5.3 Power-Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.4 ROC and AUC for Validation Year 2010 - RSI, Prior 1 . . . . . . . . . . . . . . 52
6.5 ROC and AUC for Validation Year 2010 - Real Estate, Prior 1 . . . . . . . . . 55
1 Introduction
Through the last 70-80 years, dierent methods for credit scoring have been of huge interest
in the banking industry. Credit scoring is a well-known assessment process, which has the ob-
jective to distinguish the desired customers from defaulting customers based on the registered
customer information (Thomas, 2000) and (Isik, Deniz, & Taner, 2010). The rst methods
used were entirely focused on a judgmental approach, where a credit analyst approved or dis-
approved the customers' loan application forms, but since this approach was solely based on
Another important aspect in relation to the above-mentioned was, that through the 1960s
the number of people applying for a credit card rose signicantly, which increased the need for
automated assessment processes, since the banks simply did not have the resources to cope
In 1975 and 1976, the Equal Credit Opportunity Act (ECOA hereafter) was approved
in the US, which made it illegal to discriminate loan applicants in any way regarding sex,
marital status, race, color, religion, national origin, and age. Discrimination was, according
to ECOA, dened as treating one person less favorably than others and the banks were
required to inform applicants by a detailed written notice about their decision. Therefore, the
banks began focusing on empirical derived credit systems, where the creditworthiness of the
customers was based on statistical methods. These methods included predictor variables, such
as the number of years a person had lived at the same address, account information, other
creditor debts etc.. By implementing these systems, the banks were able to document the
same objective decisions to all applicants. This initiative was the starting point in the eld of
After the mid-70s, it was still possible for a nancial advisor to refuse judgments made
by the credit scoring system if he/she disagreed with its outcome. However, studies showed
that in 95% of the times an advisor accepted the loan, even though the system had turned the
applicant down, the loan was very hard to collect (Nevin, 1979).
In the 1980s the success of credit scoring in credit cards meant that banks started using
credit scoring for their other products like personal loans, and in the 1990s the credit scoring
methods were also applied for private loans and small business loans (Thomas, 2000).
The recent nancial crisis, which began back in 2007, was a result of greedy nancial
of the bank managers was that they had completely disregarded the risks involved in making
1
investments in sub-prime mortgage loans , and were afraid to come clean about the mistakes
they have made. Instead of writing o their losses at once, they were depreciated over a longer
1
Sub-prime lending is a loan type for people with a limited credit history and who are more likely to default than
regular prime loaners. The sub-prime loans have a higher interest rate to compensate for the risks associated with
oering these types of loans.
1
1.1 Problem Statement Chapter 1
period of time (Hellwig, 2008). This nancial crisis has shaken the entire nancial system all
over the world, and the countries' economy are still suering from the aftermath of the shock.
Financial institutions were led to ignore the risks associated with these loans because of
increasing home prices through the mid-90s (Hellwig, 2008). The popularity of the sub-prime
mortgages meant an increase in the home prices from around 9% in 2000 to above 40% in
2006. The fact that these types of loans actually were sub-prime loans should have concerned
the bank managers. One might wonder what the initial decisions about oering these types
of loans were based on. It was certainly not the result of an eective credit scoring system
(Hellwig, 2008).
In the recent years it has become increasingly more important for banks to develop eec-
tive and reliable credit scoring systems to classify default - or protable customers. Several
dierent methods and models have been utilized in doing so. Thomas (2000) briey describes
some of the techniques that have been used for credit scoring the last couple of decades. The
most popular of these techniques is the frequentist logistic regression approach (Steenackers
& Goovaerts 1989), (Laitinen, 1999), and (Alf, Caiazza, & Trovato, 2005). However, some
statisticians have recently argued that we stand at the threshold of a new Bayesian Renais-
sance and other proponents argue that Bayesian methods more closely reect how humans
perceive their environment, respond to new information, and make decisions (Wylie, Muegge,
& Thomas, 2006). Bayesian techniques have rarely been utilized by researchers or nancial
corporations in the past, but nowadays the increasing computational power entails that the
Studies have already explored dierent Bayesian approaches for credit scoring, and found
these methods to have some advantages over frequentist approaches (Mira & Tenconi, 2003),
(Ziemba, 2005), and (Ler, Posch, & Schne, 2005). In relation to this fact, one of the most
important banks in Denmark has expressed a desire for exploring a Bayesian approach for
In this thesis we explore Bayesian logistic regression as a possible decision tool for credit
scoring from am academic and practical viewpoint, and it is therefore ideal that the current
thesis has been composed in cooperation with the above-mentioned bank, who has delivered
the data used in the empirical analysis. The performance of Bayesian logistic regression will
be evaluated and compared with a frequentist logistic regression and an expert model.
The nancial advisors in Our Bank have the responsibility to ll in questionnaires regarding
the creditworthiness of the customers. The questionnaires may have dierent objectives, but
in this thesis the focus is solely on the questionnaire's ability to predict the default customers.
2
1.1 Problem Statement Chapter 1
In spite of the nancial crisis, Our Bank is only losing money on a tiny proportion of their
customers (i.e. the data consists of a small proportion of defaulters - the dependent variable),
which implies that it may take several years before Our Bank has enough data to re-estimate
Frequentist logistic regression is the current statistical method applied when enough data is
collected, and in the intervening period an expert model is applied. However, these approaches
The accuracy ratio of the expert model is notably lower than the accuracy ratio of
Due to the disadvantages listed above Our Bank is searching for an alternative statistical
method. Bayesian logistic regression is one possibility since this statistical approach has the
ability to combine prior expert information and collected data, and hereby update ones prior
beliefs regarding the parameters of interest. Therefore, the research objective for the present
thesis is:
The contribution of this thesis is three-part. First, we introduce Bayesian statistics and
the Bayesian approach for logistic regression from an academical point of view, based on
existing literature related to the topic. Second, we move on and suggest a way to integrate the
current expert knowledge from Our Bank into prior distributions used in the Bayesian logistic
regression model. Third, we apply and compare the performance of Bayesian logistic regression
with an expert model and a frequentist logistic regression model, using real data from Our
Bank. The provided data covers customer information from the year 2002 to 2010, and we
introduce a walk-forward validation framework used to analyze the dierences between the
models. Additionally, the comparison will be constructed on two dierent segments to clarify
if there are any dierences between the conclusions in a segment with a large number of
3
1.2 Delimitation Chapter 1
As mentioned in Chapter 1, the present thesis is composed in cooperation with one of the
most important banks in Denmark. Due to condentiality reasons, the name of Our Bank will
1.2 Delimitation
In this thesis it is assumed that the reader has a thorough understanding of frequentist statistics
in general. Therefore, frequentist statistics will only briey be described in section 2.2. The
authors have, as original frequentists, used a great amount of time on understanding Bayesian
statistics. Therefore, it has been deemed necessary to keep an acute focus in order not to lose
the thread.
As mentioned in the Introduction, the interest in credit scoring methods increased with
the entry of the credit cards in the 1960s. Since then, several statistical methods, based on
dierent beliefs, have been applied to credit scoring (e.g. Keramati & Youse, 2011 - a review).
Due to the recent developments in Bayesian statistics, this thesis will focus on Bayesian logistic
regression and compare its performance with the more common frequentist logistic regression
technique and an expert model. The expert model for credit scoring, described in section
6.2, will also be utilized in the development of prior distributions for the Bayesian logistic
regression. Since the expert model and a frequentist logistic regression are the current models
applied by Our Bank for credit scoring, both of these will be used as a reference in section
6.5. In continuation hereof, it is important to clarify that the frequentist logistic regression
model used in Our Bank is not completely identical with the estimated frequentist logistic
regression model used as reference in this thesis. This is due to the way Our Banks current
frequentist logistic regression is estimated, which will be elaborated on in section 6.3. However,
the frequentist logistic regression model estimated in the thesis will still be referred to as Our
guish the desired customers, who will fully repay, from defaulters (Isik, Deniz, & Taner, 2010).
Our Bank operates with three credit ratings; low risk , high risk , and full risk . This the-
sis, however, only operates with two credit ratings - default and non-default, which means
that if a customer is classied as default, it implies that the customer is either a high risk
or full risk customer according to Our Bank's current credit rating. An important aspect in
relation to classication of the customers, is the determination of the suitable time horizon.
This time horizon covers the duration starting from granting credit for a customer to the time
when the customer is observed as a defaulter or a non-defaulter (Isik, Deniz, & Taner, 2010).
In Our Bank the customers are evaluated the subsequent year after receiving the loan, which
is in line with the standard practice (Isik, Deniz, & Taner, 2010). Analysis shows that the
default rate, as a function of the time the customer has been with the organization, builds up
4
1.3 Research Methodology Chapter 1
initially and it is only after twelve months or so that it starts to stabilize. Thus, any shorter
horizon is underestimating the default rate and not reecting in full the types of characteristics
The data basis for the empirical analysis consists of questionnaire data on customers, lled
out by the nancial advisors during the period 2002-2010. The data covers two dierent
customer segments: Retail, Service and Industry (RSI hereafter) and Real Estate.
In total the dataset consist of 67618 customers, which will be used for the analysis in the
RSI (N=62866)
As it is the objectives to explore the theoretical basis of Bayesian statistics, as well as to quan-
titatively evaluate the performance of Bayesian logistic regression compared to Our Banks
current approaches, the present thesis is characterized by both an academic and practical
conducting research, focusing on concepts, and the relations between those concepts, whereas
set of events in the real world. Due to the nature of the present thesis, the authors have,
as already mentioned in section 1.2, used a great amount of time on understanding Bayesian
statistics. Based on that, the authors expect to face decisions requiring certain compromises
in order to satisfy both orientations in terms of theoretical test of Bayesian logistic regres-
sion, as well as problem oriented research necessary to conduct the performance evaluation
of Bayesian logistic regression to Our Bank's current approaches in a satisfactory and under-
standable manner.
All scientic investigation is aected by the methodological approach chosen by the re-
searchers. This thesis is inspired by the works of Guba (1990), which is used to explain the
The overall methodological approach to the present thesis is based on the fundamental
beliefs of postpositivism, in which the aim is prediction and control. Postpositivism can be
characterized as a modied version of positivism, which relies on the belief that reality out
there exists driven by immutable natural laws (Guba, 1990). The aim of this thesis is to clarify
if Bayesian logistic regression can be used as a more eective decision tool for credit scoring
than Our Banks current approaches. In continuation hereof, the thesis is solely focusing on
default prediction.
5
1.3 Research Methodology Chapter 1
Ontology is the considerations concerned with the nature of social entities. The ontological
aspect of postpositivism implies a focus on critical realism, which acknowledges that reality
exists, but enhances that it is impossible for humans to accurately perceive it with their
imperfect sensory and intellective mechanisms. This idea implies that the postpositivism
recognizes that all observations are fallible and error-prone and that all theory is revisable
(Guba, 1990). We are aware, that we need to be critical about our work due to the imperfect
Epistemological issues regard understanding of how knowledge is created and what accept-
able knowledge is. With regard to this issue, postpositivism emphasizes modied objectivity,
which implies that objectivity is the guiding ideal, but it cannot be achieved in absolute sense.
and observations, each of which may possess dierent types of error (Guba, 1990). The cur-
rent thesis is inuenced by the subjective Bayesian approach. All statistical methods that use
probability are subjective to a certain extent because they rely on mathematical idealizations
of the world. However, the Bayesian approach is sometimes perceived as being especially sub-
jective due to the reliance on a prior distribution (Gelman, 2000). To overcome this, dierent
priors are applied to the Bayesian logistic regression and the prior with the best performance
is used in the comparison with Our Bank's current approaches to credit scoring. We are aware
that objectivity cannot be fully met since two of the utilized priors in the thesis are estab-
lished based on subjective expert models. Furthermore, the variances of the dierent priors
Methodological questions concern how the researcher should go about nding knowledge.
interest of conforming to the commitment to critical realism and modied subjectivity, em-
human sensory and intellective mechanisms cannot be trusted, it is essential that the ndings
is based on as many sources of data, investigators, theories, and methods, as possible. Sec-
ond, postpositivism allows for many imbalances necessary to achieve realistic and objective
research. Particularly important is the imbalance that has to do with the inescapable trade-
o between internal and external validity, in which the researcher must sacrice the degree
of generalization of the ndings to achieve internal validity (Guba, 1990). The methodolog-
ical approach used in this thesis starts out with a theoretical introduction to the Bayesian
approach, which is used to guide the empirical analysis. Given the knowledge acquired in the
theory sections, the thesis explores the research objective using statistical and quantitative
methods. The empirical analysis is based on a solid dataset covering 67618 customers during
2002 to 2010. Furthermore, the empirical analysis is carried out by comparing three dierent
credit scoring methods. We are aware that the conclusions cannot be generalized to other
6
1.4 Structure of the Thesis Chapter 1
banks even though they are drawn on quantitative methods. This is due to the data, since it
is based on condential questionnaires within Our Bank, lled out by the nancial advisors.
However, the conclusions give indications which can be used for further research outside Our
Bank.
In order to be able to thoroughly investigate the research objective, the thesis is divided into
Part I (Chapter 1) frames the present thesis by introducing the problem statement and
Part II (Chapter 2, 3, and 4) outlines the theoretical framework for the present thesis. The
aim is to provide the reader with an understanding of the Bayesian approach and how it has
7
1.4 Structure of the Thesis Chapter 1
evolved through time. Chapter 2 introduces the two competing approaches within statistical
reasoning, frequentist and Bayesian, and derives Bayes' Theorem on the basis of conditional
probability. Chapter 3 gives a deeper understanding of the three components in Bayes' The-
Part III (Chapter 5) introduces the theory behind the validation framework and validation
Part IV (Chapter 6) presents the empirical ndings and evaluate model performance based
on real data from Our Bank. Initially Chapter 6 provides a description of the dataset used for
running the models. Chapter 6 continues with estimation of the current credit scoring models
in Our Bank and describes the procedures used for establishing a Bayesian logistic regression
model. Through Markov chain Monte Carlo (MCMC hereafter) simulations the parameters
for the Bayesian model are estimated. Lastly Chapter 6 will compare the performance of the
Finally Part V (Chapter 7) will conclude on the ndings in Chapter 6 and reect on possible
future research and extensions Furthermore, the limitations and contributions of the current
8
PART II
Statistical Reasoning
Chapter 2
2 Statistical Reasoning
Statistics is the study, creation, and use of methods for producing and employing data for
Statistics is a science that concerns itself with experimentation and the collection,
description and analysis of data... Statistical methods are tools for examining data.
(Barnett, 1973)
Nowadays two competing approaches to statistical reasoning exist: the Bayesian and the fre-
quentist, where the frequentist is the largest group. As mentioned in Chapter 1, the increasing
power of computers is bringing the Bayesian approach to the fore (Wylie, Muegge, & Thomas,
2006). Most statisticians have become Bayesians or frequentists as a result of their choice of
university. They were not aware of the existence of the Bayesians and frequentists approach
until it was too late and the choice had been made (Altman & Bland, 1998). Since dierent
approaches are based on dierent concepts, procedures and justications, the following section
the Bayesian approach. Finally the Chapter summarizes some advantages and disadvantages
Even though that Statistics as a formal scientic discipline has a rather short history, the
reasoning behind it began about three hundred years ago, where people started to give serious
thoughts to the question of how to reason in situations, where it was not possible to argue
with certainty. The rst to formulate the problem was probably James Bernoulli (1713), who
perceived the dierence between the deductive logic applicable to games of chance and the
inductive logic required for everyday life. The question for Bernoulli was how the mechanics
of the deductive logic might help to tackle the inference problems of the inductive logic (Sivia
through his papers published in the Philosophical Transaction in 1763 and 1764. Of these the
rst, entitled An Essay Towards Solving a Problem in the Doctrine of Chances, is the one that
has earned him acknowledgments since he proved a special case of what is nowadays called
Bayes' Theorem (Barnett, 1973). Thomas Bayes was interested in inverse probability, which
concerns inferences of probability parameters from observations of outcomes and prior beliefs.
Pierre-Simon Laplace proved, 11 years later, a more general version of Bayes' Theorem, where
he applied the results to inference of sampling and measurement error (Altman & Bland,
10
2.1 The History Chapter 2
1998). Both Bayes and Laplace assumed uniform prior distributions, that is, for an initial
starting point, they assumed that all possible values for the unknown parameter were equally
likely, and revised their estimates as they observed the data (Wylie et al., 2006). The present
form of Bayes' Theorem is actually due to the work of Laplace, since he rediscovered Bayes'
Theorem, in far more clarity than Bayes', and since he discovered the use of it in solving
problems in celestial mechanics, medical statistics and even jurisprudence. Despite Laplace's
numerous successes, his development of probability theory was rejected by many soon after
The problem did not have to do with the substance, but the concept. As mentioned
or plausibility - how much they thought that something was true based on the evidence at
hand. To the 19th century scholars, however, this seemed too vague and subjective an idea
to be the basis of a rigorous mathematical theory. The essays entitled On the Mathematical
the most Ecient Test of Statistical Hypotheses published a decade later by J. Neyman and E.
S. Pearson, can be considered as the cornerstones of what nowadays is called the frequentist
Arguably the most inuential article on that subject in the twentieth century (...)
with new denitions, a new conceptual framework and enough hard mathematical
Fisher gave major attention to estimation procedures while Neyman and Person largely con-
centrated on the construction of principles for testing hypotheses. Their work was not entirely
distinct either in emphasis or application. Nor was it free from internal controversy with
ned probability as the long-run relative frequency with which an event occurred, given many
repeated trials. Since frequencies can be measured, probability was now seen as an objective
tool for dealing with random phenomena (Sivia & Skilling, 2007). The only quantitative in-
formation handled by frequentists is sample data. This implies that prior information about
the parameter, , are of no importance, but may be expected to inuence the choice of sta-
tistical procedure and the performance characteristics needed, e.g. working hypotheses and
It is clear that considerations of a priori probability may (...) need to be taken into
exact numerical form (...) but in general we are doubtful of the value of attempts
2
Fiducial inference can be interpreted as an attempt to perform inverse probability without having prior
probability distributions. Fiducial inference quickly attracted controversy and was never widely accepted.
11
2.2 The Bayesian Approach Chapter 2
others must be taken into account in the nal judgment, but cannot be introduced
Following Fisher, most inuential statisticians of that period favored an objective frequentist
Even though the rst half of the 20th century was accompanied by the development of the
frequentist approach, the ames of Bayesian thinking was kept alive by a few thinkers such as
Bruno de Finetti and Herold Jereys (Cowles, Kass, & O'Hagen, 2009).
The modern Bayesian movement began in the second half of the 20th century but Bayesian
inference remained extremely dicult to implement until the late 1980s and early 1990s when
powerful computers became widely accessible and new computational methods were developed.
The subsequent explosion of interest in Bayesian statistics has not only led to extensive research
in Bayesian methodology but also to the use of Bayesian methods to address pressing questions
in diverse application areas such as astrophysics, weather forecasting, health care policy, and
The following section will give a further introduction to the Bayesian approach by intro-
ducing conditional probabilities and showing how Bayes' Theorem is derived from conditional
probabilities.
techniques, but, as mentioned in section 1.2, the most challenging part is simply the fact that
It is a whole dierent way of viewing the world. Jaynes (2003) argued, that the Bayesian
approaches to scientic questions oer a dierent way of viewing reality one that actually
As mentioned in section 2.1, the frequentists view the unknown parameters as xed con-
population. The aim of the frequentist approach is to use data to estimate the unknown value
of a parameter. The data obtained represents only one possible realization of the current ex-
periment and the corresponding probability distribution, given the data, is called a sampling
distribution. This sampling distribution is crucial to any assessment of the behavior of the
parameter. The estimate of the parameter is seen as a typical value that is likely to arise
12
2.2 The Bayesian Approach Chapter 2
way to represent the information provided by the sample (Barnett, 1973). The result of the
Where the frequentists treat the unknown parameters as xed constants, the Bayesian ap-
proach treats them as random variables, which means that the parameters can vary according
to a probability distribution. This variation can be regarded as purely stochastic for a data
driven model, but it can also be interpreted as beliefs of uncertainty under the Bayesian ap-
proach. In a Bayesian formulation the uncertainty about the value of each parameter can be
As mentioned in section 2.1, Bayes was interested in solving the inverse probability the prob-
ability of an event given the observations of other events. The result of his studies lead to what
nowadays is called the Bayes' Theorem (SAS Institute, 2008). Since Bayes' Theorem is the
key to Bayesian statistics and since it relies on conditional probability, conditional probability
predictable, which means that the experiment can be repeated under the same conditions
without leading to the same result. The result of one single trial of the random experiment
is the outcome and the events is any set of possible outcomes of a random experiment. The
sample space is all possible outcomes of one single trial of the random experiment, denoted .
Since the sample space represents everything considered, it is also called the universe (Bolstad,
2007).
A Venn diagram
3 is used to illustrate the relationship between two events, which is shown
in gure 2.1. The rectangle illustrates the universe, U, and the circles illustrate the occurring
events. The relationship between two events depends on the outcomes they have in common.
If all the outcomes in one event are also in the other event, the rst event is a subset of the
other. If the events have some outcomes in common they are intersecting events, such as the
3
Venn diagrams or set diagrams are diagrams that show all possible logical relations between a nite
collection of sets (aggregation of things).
13
2.2 The Bayesian Approach Chapter 2
Consider gure 2.1. If we know that one event has occurred, does that aect the probability
of the occurrence of another event? Conditional probability is used to answer this question.
Suppose that event B has occurred, the universe of interest will be reduced so that the only
thing of interest will be inside the circle B. Say event A occurs. The only part of event A that
is now relevant is that part also contained in B. That is B A. The joint probability of events
B and A is the probability that both events occur simultaneously, on the same repetition of
Given that event B has occurred, the total probability of the reduced universe must be
equal to
4
one . The probability of event A, given event B, is the unconditional probability of
1
that part of A, which is also included in B, multiplied by a scale factor
P r(B) . Adding this
information together gives the conditional probability of event A given event B:
P r(B A)
P r(A | B) = (2.1)
P r(B)
as long as P r(B) 6= 0. Equation 2.1 shows that the conditional probability P r(A | B) is
proportional to the joint probability P r(B A) but has been rescaled so the probability of
the reduced universe equals 1. The marginal probability of event B is found by summing the
probabilities of its disjoint parts. Since B = (B A) (B A) and clearly (B A) and (B A)
e
are disjoint, this simplied two-event example gives:
where A is the complement of A. Equation 2.2 is now substituted into the denition of condi-
4
An axiom of probability: P r(U ) = 1 (the total probability of the universe equals 1).
14
2.2 The Bayesian Approach Chapter 2
P r(A B)
P r(A | B) = (2.3)
P r(B A) + P r(B A)
Now the multiplication rule can be used to nd each of these joint probabilities. This gives
Using the law of total probability, the marginal probability of B, also the denominator in
n
X
P r(B) = P r(B | Aj )P r(Aj ) (2.5)
j=1
Equation 2.5 just states that the probability of event B is the sum of the probabilities of its
P r(B | Ai )P r(Ai )
P r(Ai |B) = Pn (2.6)
j=1 P r(B | Aj )P r(Aj )
Equation 2.6 is what is known as Bayes' Theorem and is a restatement of equation 2.1, where
the joint probability in the numerator is identied by the multiplication rule, and the marginal
probability contained in the denominator is found by using the law of total probability followed
Bayes' Theorem consists of three dierent components, which are the cornerstones of Bayesian
statistics; the initial probability parameter P r(Ai ), is the prior probability. The probability
of the parameter given the data P r(Ai | B) is the posterior probability, and the probability of
the data given the parameter P r(B | Ai ) is the likelihood function known from the frequentist
Pn
approach. The nal term j=1 P r(B | Aj ) P r(Aj ) is the probability of the data B, which
is not dependent on the parameter and acts as a normalizing constant. A more convenient
way to present Bayes' Theorem is by omitting the marginal distribution term, since it does
not provide any additional information about the posterior, as long as the integral is nite
(SAS Institute, 2008). For this reason, equation 2.6 is often referred to in terms of the prior,
15
2.3 Advantages and Disadvantages of the Bayesian Approach Chapter 2
m (2.7)
where the symbol means proportional to. Bayes' Theorem will be elaborated further in
Bayesian methods and frequentist methods both have advantages and disadvantages, and there
are some similarities. When the sample size is large, Bayesian inference often provides results
section 3.2. Some advantages to using Bayesian analysis include the following (SAS Institute,
Bayesian methods have a single tool, Bayes' Theorem, which is used in all situations.
Bayesian methods provide a way of combining prior information with data, within a
solid decision theoretical framework. In science there usually is some prior knowledge
about the process being measured. Leaving out the prior information away is waste of
knowledge, but as mentioned, Bayesian statistics uses both sources of information and
Since Bayesian methods provide inferences that are conditional on the data and are
Bayesian methods provide interpretable answers such as the true parameter has a
Bayesian methods provide a convenient setting for a wide range of models. Markov chain
Monte Carlo (MCMC hereafter), along with other numerical methods, makes computa-
tions tractable for almost all parametric models. MCMC will be introduced in section
4.1.
Bayesian methods do not specify how to select a prior, which entails that there is no
16
Chapter 3
Bayesian methods can produce posterior distributions that are strongly inuenced by
the priors. This causes, from a practical viewpoint, that it might be dicult to convince
experts who do not agree with the validity of the chosen prior.
Bayesian methods often come with high computational cost, especially in models with a
3 Bayes' Theorem
The following Chapter will contain a deeper introduction to Bayes' Theorem, equation 2.6,
where the three components will be elaborated on. Section 2.2 dealt with random experiments
in terms of events and introduced probability dened on events as a tool for understanding
random experiments. The more usual form of Bayes' Theorem will be used in this Chap-
ter, which is based on random variables. A random variable describes the outcome of the
The notation used for the parameters of interest will be , which forms a vector of the
parameters, so that = (1 , 2 , ..., k ). The notation used for the collected data in this thesis
is y, and since m variables are measured for every customer, n, in Our Bank, y will form a
mn matrix.
The conclusions from a Bayesian analysis are drawn based on the posterior probability
distribution. These posterior distributions are conditional on the observed data y, and by
utilizing Bayes' Theorem, presented in equation 2.6, we write the statement as p(|y). From
this point on, p(|) and p() will denote probability distributions.
The core of Bayesian inference is to update ones prior beliefs, p(), with new information
given by the collected data, p(y|), and using the necessary algorithms for computational
convenience to summarize p(|y). Combining these notations into Bayes' Theorem, equation
2.6 gives:
p()p(y|)
p(|y) = (3.1)
p(y)
X
where p(y) = p()p(y|) for discrete random variables, and p(y) = p()p(y|) d in the
case of continuous random variables. Equation 3.1 can then be presented in a proportional
form:
17
3.1 The Likelihood p(y | ) Chapter 3
All Bayesian inferences follow from this posterior probability distribution since it captures all
the relevant information regarding the parameters. In the following sections the three dierent
parameter . It forms the prediction of what the data should look like, if the parameter
presented in section 2.1 and section 2.2. The likelihood function, often denoted as `(; y), is a
function of with the data values serving as parameters of that function. People who follow
the frequentist approach will typically choose the value that provides the maximum likelihood
distribution for each . The theory of probability holds many such distributions, both from
simple distributions to probability models for high dimensional random variables, involving
The chosen likelihood distribution must be appropriate to the type of data that are observed
and must be able to represent the estimated model within it. Furthermore, it should make it
possible to discredit the estimated model when it is clearly inconsistent with the evidence. The
in the inferences over a set of likelihoods, each of which embodies the theory. Furthermore, it
From the subjective perspective utilized in this thesis, a likelihood represents your beliefs
about the values of the data conditional on . It is your likelihood, in the same way that the
marginal distribution for and the prior p(), will represent your beliefs about the parameter.
If the aim is to convince others about the interest of your results it is advised to choose a
likelihood that is not clearly inconsistent with the beliefs of your audience and your readers
(Lancaster, 2004).
intractable equation for the posterior. Prior distributions can be specically chosen to be
compatible with the likelihood function to avoid the problem called conjugate priors
5. How-
ever, the signicant advances in computing power, methodology and software over the last few
5
A prior is said to be a conjugate prior for a family distributions, if the prior and posterior distributions
are from the same family, which means that the form of the posterior has the same distribution form as the
prior distribution.
18
3.2 The Prior p() Chapter 3
decades means that the posterior density function can be directly sampled using simulation
techniques, which will be introduced in section 4.1. The current diculty in the Bayesian ap-
proach is the specication of a prior distribution and selecting an appropriate prior is probably
the most important aspect in Bayesian modeling (Kynn, 2005). The prior distribution is a
key part of Bayesian inference and represents the information about an uncertain parameter,
, that is combined with the probability distribution of the likelihood of new data to yield
the posterior distribution, which in turn is used for future inference and decisions involving
(Gelman, 2002). Considerable care should be taken when selecting priors and should be
supported by careful documentation. This is because inappropriate choices for priors can lead
The appearance of the prior distribution in the right-hand side of Bayes' Theorem
and a weakness because the inferences inevitably depend, at least to some degree,
Gelman (2002) points out that with well-identied parameters and large sample sizes, reason-
able choices of prior distributions will have minor eects on posterior inferences. This means
that where large amounts of data are available, the inuence of the prior will be negligible,
giving similar results to purely data-driven inference. This feature is commonly referred to as
likelihood dominance (Lancaster, 2004). In the absence of data, the inference will be driven by
the prior distributions. Between these two extremes, the prior will have some modifying eect
on the data. The extent that the prior inuences the resulting posterior distribution can be
2005).
There are a number of points that usually are taken into account when specifying the priors.
The rst point is that priors can be tentative. Since the inference necessarily depend on the
choice of prior, Lancaster (2004) suggests that alternative priors are examined to explore how
sensitive the main conclusions are to alterations in the prior. Furthermore, it is legitimate
to allow prior beliefs to be inuenced by inspection of the data. The second point is that
priors should be encompassing. This means that priors should take account of the beliefs of
19
3.2 The Prior p() Chapter 3
the readers, since prior beliefs that conict sharply with the readers will make the work of
little interest to them. On behalf of that, it is suggested for public scientic work to use priors
that are not sharply inconsistent with any reasonable belief. This requirement can sometimes
be met by using a uniform or at distribution on some reasonable function of the parameter
(Lancaster, 2004).
a prior that summaries the evidence about the parameters concerned from many sources and
which often have considerable impact on the posterior distribution. A non-informative prior,
on the other hand, provides little information relative to the experiment (Wylie, Muegge, &
Thomas, 2006).
Figure 3.1 illustrates three dierent prior distributions where prior A is relative non-
informative. Priors B and C are both informative, but represent dierent prior beliefs, B
Typically, informative prior distributions are created from historical data, from expert
knowledge, or from a combination of both. The proper use of informative prior distributions
illustrates the power of the Bayesian methods since previous studies, past experience, or expert
knowledge can be combined with the current information in a natural way. However, using
informative priors can lead to problems due to the subjective beliefs (SAS Institute, 2008).
connote the lack of subjective beliefs used in formulating such a prior. Due to the objectivity
of non-informative priors many statisticians favor this type. However, it is important to keep
20
3.2 The Prior p() Chapter 3
in mind that it is unrealistic to expect that non-informative priors represent total ignorance
A common choice of non-informative prior is the at prior, which is a prior distribution
that assigns equal likelihood on all possible values of the parameter. However, this might not
n Bernoulli
6 trials. The purpose is to make inferences about the unknown success probability.
A uniform prior on p,
(p) 1 (3.3)
might appear to be non-informative. However, since the uniform prior is equivalent to adding
two observations to the data, one 1 and one 0, experiments with a small n and y can be very
which is called a uniform distribution on the real line and can be thought of as a rectangle
on an innitely long base. Its integral, the area under the line, does not converge, it is innite
and so equation 3.4 is not, in fact, a probability distribution. However, improper priors are
One of the reasons is, at least mathematically, that it does not matter if the prior is
improper. Because the posterior distribution of is the object of ultimate interest, and since
this is formed by multiplying the likelihood and the prior it is perfectly possible for the posterior
distribution to be proper even though the prior is not. Thus, improper prior distributions can
based on the improper posterior distributions is obviously invalid (SAS Institute, 2008).
proper prior that is intended to represent very imprecise or vague beliefs. Often the uniform
prior is thought of as a labor saving device since it saves the troubles of specifying exact beliefs
(Lancaster, 2004).
6
A Bernoulli trial is an experiment whose outcome is random and can be either of two possible outcomes,
"success" and "failure".
21
3.3 The Posterior p( | y) Chapter 4
analysis. Reporting the results of the empirical analysis imply displaying the posterior distri-
bution, to which the model and data have led, which can be done in several ways (Lancaster,
2004):
Draw it; when is a scalar the best way of presenting the content of the posterior
distribution is by drawing it. This is also valid when is a vector but the parameter of
highest posterior density region; similarly, traditional practice often reports a condence
interval for . The Bayesian analogue is to nd, from the posterior distribution of , an
Calculate the marginals; the calculation involved in forming the posterior distribution
exist. The rst is the use of approximations to posterior distributions and the second is
As mentioned in section 1.1, Bayesian logistic regression will be applied in the current thesis to
analyze if this method is a more eective tool that improves quality of service and minimizes
The logistic regression model belongs to the class of Generalized Linear Models (GLM
7
hereafter) . Logistic regression allows prediction of a discrete outcome from a set of predictor
variables that may be continuous, discrete, dichotomous, or a mix, and is a exible technique
since it has no assumptions about the distribution of the predictor variables. The response
variable in a logistic regression model is binomial and the expectation is related to the linear
predictor through the logit function (Wilhelmsen, Dimakos, Huseb, & Fiskaaen, 2009).
7
In statistics, the GLM is a exible generalization of ordinary linear regression that allows for response
variables to have other than a normal distribution.
22
4.1 Parameter Estimation in Bayesian Models Chapter 4
M
X
i = 0 + j xij (4.2)
j=1
P ri
logit(pi ) = ln = i (4.3)
1 P ri
In this thesis, the explanatory variables, xij , are customer characteristics from the ques-
PM
exp(0 + j=1 j xij )
P ri = PM (4.4)
1 + exp(0 + j=1 j xij )
The Bayesian model is formulated by specifying prior distributions on the regression coef-
The prior distribution p ( | j ) may be any proper probability density function and j may
be a scalar or a parameter vector. The interpretation of the model parameters depends on the
choice of the distribution, but may for instance include a measure of the center (e.g. mean)
and spread of the prior (e.g. standard deviation). By specifying dierent values of j , the
regression coecients may have very dierent priors even if the distribution function, p (), is
In statistical modeling, such as the frequentist logistic regression, the aim, among others, is
to estimate the -coecients of the predictor variables, by using the maximum likelihood
with arbitrary values of coecients for the set of predictors and determines the direction and
size of change in the coecients that will maximize the likelihood of obtaining the observed
frequencies. Then residuals for the predictive model, based on those coecients, are tested
and another determination of direction and size of change in coecients is made until the
coecients change very little, which imply that convergence is reached (Tabachnick & Fidell,
2008).
However, in Bayesian statistics these -coecients, are derived directly from the posterior
probability distribution of the unknown parameters. Any features of the posterior distribution
are legitimate for Bayesian inference e.g. moments, quartiles, highest posterior density regions.
23
4.1 Parameter Estimation in Bayesian Models Chapter 4
All these quantities can be expressed in terms of posterior expectations of functions of . The
f ()p()p(y | ) d
E [f () | y] = (4.6)
p()p(y | ) d
In multidimensional Bayesian models, the objective is often to retrieve a scalar function
of the parameter vector , with respect to a single parameter of interest, say i , which could
be a regression coecient. This involves nding the marginal posterior distribution of the
parameter, p(i ), and involves integration of the posterior distribution p(|y) over all other
b
p(i | y) = p(i | k6=i , y)p(k6=i ) dk6=i (4.7)
a
In simple models it will be possible to derive the marginal distribution by hand, which to
some extent is described by Smith (1991). However, as the number of dimensions increase, so
does the diculty of these calculations. A major limitation towards more widespread imple-
mentation of Bayesian approaches is that obtaining the posterior distribution often requires
Until recently, acknowledging the full complexity and structure in many applications was
dicult and required the development of specic methodology and purpose-built software.
Now, MCMC methods provide a unifying framework within which many complex problems
can be analyzed using generic software (Gilks, Richardson, & Spiegelhalter, 1996). In the
following section the general ideas behind MCMC simulations are presented. Many ways of
constructing the Markov chains exist, but the most commonly used is the Gibbs sampler, which
has had a major inuence in the increase of Bayesian applications. Since the Gibbs sampler is
a special case of the Metropolis-Hastings algorithm, both of these will be introduced in section
4.1.2. Finally dierent convergence criterions on how to assess whether the Markov chains
the statistical physics literature, and has been used for a decade in spatial statistics and image
analysis. In the last few years, MCMC has had a profound eect on Bayesian statistics, and
has also found applications in the frequentist statistics (Gilks, Richardson, & Spiegelhalter,
1996).
MCMC methods are a class of algorithms, which are used for simulating samples from a
posterior distribution that has the desired true posterior distribution as the chains station-
8
See: Walsh (2004), Roberts & Rosenthal (1998) and Gilks, Richardson, & Spiegelhalter (1996).
24
4.1 Parameter Estimation in Bayesian Models Chapter 4
ary distribution. MCMC is Monte Carlo integration using Markov chains. As mentioned in
section 4.1, Bayesian statistics often include integration over possible multidimensional prob-
ability distributions to make inference about the model parameters or to make predictions.
Monte Carlo integration draws samples from the required distribution, and then forms sam-
ple averages to approximate expectations. Markov chain Monte Carlo draws these samples
by running a cleverly constructed Markov chain for a long time. In this section MCMC is
introduced as the method for evaluation of the expression in equation 4.6 (Gilks, Richardson,
& Spiegelhalter, 1996). Since MCMC has two constituent parts, Monte Carlo integration and
b
g(x) dx (4.8)
a
g(x) can be decomposed into the product of a random function, f (x), and a posterior
probability distribution (x) dened given the interval (a, b). Then the integral in equation
4.8 can be expressed as the expectation of f (x) over (x) over the interval (a,b):
b b
g(x) dx = f (x)(x) dx = E(x) [f (x)] (4.9)
a a
Monte Carlo integration evaluates E(x) [f (x)] by drawing random samples {Xt , t = 1, ..., n}
from a given posterior probability distribution, (x). The population mean of f (x), , can be
n
1X
f (Xt ) (4.10)
n
t=1
where n is the number of samples drawn. Note that n is not the size of the xed data sample.
Given that the samples {Xt } are independent, the law of large numbers makes sure that
the approximation can be made as accurate as desirable, since the central limits theorem
n( ) N (0, 2 ) (4.11)
However, it it not always possible to draw samples {Xt } independently from (x), since
the density can be of non-standard form. But these draws do not have to be independent,
as long as {Xt } can be drawn throughout the support of (x) in any possible process. One
25
4.1 Parameter Estimation in Bayesian Models Chapter 4
possible process is the Markov chain, where (x) is the stationary distribution of the chain
(Gilks, Richardson, & Spiegelhalter, 1996). The concept of Markov chains will be described
in the current time point depends only on the state of the variable in the previous time point
Let Xt denote the value of a random variable at time t and let the state space refer to
the range of possible X -values. When applying MCMC, the state space is of such a high-
dimensional nature, that direct computation about (x) is impossible. The distribution of
generate a sequence of random variables, {X0 , X1 , X2 , ..., Xn }, in a way so that each time t > 0,
the next state, Xt+1 , depends only of the state before, meaning that Xt+1 is sampled from
a distribution p(Xt+1 |Xt ). This means that, given Xt , the next state Xt+1 does not depend
further on the history of the chain {X0 , X1 , ..., Xt1 }, only on the current state Xt . When the
before-mentioned holds, the process is called a Markov process, which creates the Markov chain
of random variables. A particular chain is dened most critically by its transition probabilities
(or more familiar the transition kernel), P r(i, j) = P r(i j), which is the probability that a
If the chain is simulated long enough, so that t , the chain gradually forgets the initial
state p(Xt |X0 ), and the distribution of Xn will eventually converge to, called the stationary
distribution, and thus increasingly look like dependent samples drawn from that stationary
distribution.
The term burn-in refers to the number of s iterations it takes for the chain to converge
which will be presented in section 4.1.3. The output generated from the Markov chain can now
be used to estimate E[f (X)], where X has the distribution of (x). Usually burn-in samples
are discarded in the calculation, so that (Gilks, Richardson, & Spiegelhalter, 1996):
n
1 X
E[f (X)] = f (Xt ) (4.13)
ns
t=s+1
Several algorithms for creating the Markov chains exist, but most of them are built up
from the same basis, the Metropolis-Hastings algorithm. The fundamentals of the algorithm
26
4.1 Parameter Estimation in Bayesian Models Chapter 4
(x) that has the exact same features as the distribution of interest, the posterior distribution
(x). The earliest MCMC algorithm is the Metropolis algorithm introduced by Metropolis
and Ulam (1949) and further described by Metropolis et al. (1953). Hastings (1970) made
algorithm. Geman and Geman (1984) analyzed an image dataset by using what is now called
the Gibbs sampler, which is a special case of Metropolis-Hasting algorithm (Che & Xu, 2010).
All these algorithms can draw a sequence of samples from the joint distribution of two or more
variables. The Gibbs sampler is the simplest MCMC algorithm, and will briey be presented
below. Since PROC MCMC in SAS uses the random-walk Metropolis (RWM hereafter), which
is a special case of the Metropolis algorithm, the Metropolis algorithm will also be described
below.
The reason why the algorithms work is beyond the scope of this thesis, but more detailed
description and proofs are in Gilks, Richardson, & Spiegelhalter (1996), Chen, Shao, & Ibrahim
Josiah W. Gibbs, is a special case of the Metropolis-Hastings sampling algorithm, where the
random value is always accepted. The task remains to specify how to construct a Markov
The key to Gibbs sampling is that it only considers univariate conditional distributions -
the distribution when all variables except the one under consideration at time t are held xed.
Such conditional distributions are easier to simulate than complex joint distributions and
usually have simple forms (Walsh, 2004). The sampler can be ecient when the parameters
are not highly dependent on each other and the full conditional distributions are easy to sample
To introduce the Gibbs sampler, suppose is the parameter vector which can be expressed
as = (1 , 2 , ..., k )0 , p(y | ) is the likelihood, and () is the prior distribution. The full
27
4.1 Parameter Estimation in Bayesian Models Chapter 4
The idea of the sampler is that it is much easier and ecient to consider a sequence of
joint probability distribution. The Gibbs sampler can be summarized as follows (SAS Institute,
2008):
n o
(0) (0)
1. Set t = 0, and choose an arbitrary initial value of (0) = 1 , ..., k .
3. Set t = t + 1. If t < T , the number of desired samples, return to step 2. Otherwise, stop.
As mentioned above, the power of Gibbs sampling is that the joint distribution of the parame-
ters will converge to the joint probability of the parameters given the observed data (Rouchka,
2008).
scientist, Nicholas C. Metropolis. The algorithm is simple but practical, and can be used to
obtain random samples from arbitrarily complicated target distribution of any dimension that
T samples from a univariate distribution with probability density function f ( | y), and t as
the t'th sample from f. To use the Metropolis algorithm, we need to have an initial value 0
and a symmetric proposal density (t+1 | t ). The proposal distribution should be an easy
distribution from which to sample, and it must be such that (t+1 | t ) = (t | t+1 ), meaning
that the likelihood of jumping to t+1 from t is the same as the likelihood of jumping back to
t from t+1 . The most common choice of the proposal distribution is the normal distribution
N (t , ) with a xed . For the (t + 1)th iteration, the algorithm generates a sample from
( | ) t
based on the current sample , and it makes a decision to either accept or reject the
new sample. If the new sample is accepted, the algorithm repeats itself by starting at the
new sample, whereas if the sample is rejected, the algorithm starts at the current point and
28
4.1 Parameter Estimation in Bayesian Models Chapter 4
repeats. In theory the algorithm is self-repeating but in practice we can decide on the total
1. Set t = 0. Choose a starting point 0 . This can be any initial value as long as f (0 |
y) > 0.
f (new | y)
= min ,1 (4.16)
f (t | y)
Otherwise stop.
The algorithm denes a chain of random variates whose distribution will converge to the
Suppose = (1 , 2,..., k ) is the parameter vector. To begin the Metropolis algorithm, select
an initial value of each i and use a multivariate version of proposal distribution ( | ), such
steps remain the same as those described above, and this Markov chain eventually converges
rior distribution or calculate any relevant quantities of interest. There are usually two issues
regarding the treatment of the simulated draws. First, we have to decide whether the Markov
chain has reached stationarity, or the desired posterior distribution. Secondly, we have to
determine the number of iterations to keep after the Markov chain has reached stationar-
ity. Convergence diagnostics can help to solve these issues. It is important to keep in mind
that there are no conclusive tests that can tell you when the Markov chain has converged to
In the following part four convergence criteria will briey be presented, which are all
standard-output in SAS.
29
4.1 Parameter Estimation in Bayesian Models Chapter 4
gence. A trace plot tells whether the chain has reached its stationary distribution, if the chain
needs a longer burn-in period, or if the chain needs to be simulated over a longer period of
time. The aspects that are most identiable from a trace plot are a relatively constant mean
and variance. A chain that mixes well traverses its posterior space rapidly, and can jump from
one remote region of the posterior to another in relatively few steps, whereas a chain is said
to be poorly mixing if it stays within small regions of the parameter space for long periods of
The trace plot in the upper-left displays a perfect trace plot. The trace plot indicates
that the chain could have reached the right distribution since the center of the chain seems to
The upper-right trace plot illustrate a chain that starts at a very remote initial value and
makes its way to the targeting distribution. The rst few hundred observations should be
The trace plot in the lower-left demonstrate an instance of marginal mixing. The chain is
taking only small steps and does not traverse its distribution quickly. Since this type of trace
plot is typically associated with high autocorrelation among the samples it is suggested to run
the chain for much longer to obtain a few thousand independent samples.
30
4.1 Parameter Estimation in Bayesian Models Chapter 4
The lower-right trace plot shows a chain that is mixing very slowly and it oers no evidence
of convergence. This type of chain is entirely unsuitable for making parameter inferences (SAS
Institute, 2008).
part to detect failure of convergence. If the mean values of the parameters in the two time
intervals are somewhat close to each other, we can assume that the two dierent parts of the
chain have similar locations in the state space, and it is assumed that the two samples come
By default the Geweke test splits the sample, after removing a burn-in period, into two
parts: the rst 10% and the last 50%. A modied z-test, referred to as Geweke z-score, is used
to compare the two sub-samples. A value larger than 2 indicate that the mean of the series is
still drifting, and a longer burn-in is required before monitoring the chain can begin (Walsh,
2004).
4.1.3.3 Autocorrelation
Another way to assess convergence is to evaluate the autocorrelation between the draws of
the Markov chain, which is a measure of dependency among Markov chain samples. We would
expect the k th lag autocorrelation to be smaller as k increases, which means that our 2nd and
50th draws should be less correlated than our 2nd and 4th draws. If autocorrelation is still
relatively high for higher values of k, this indicates a high degree of correlation between our
closely related measure of mixing is the eective sample size (ESS hereafter). The ESS is a
quantity that estimates the number of independent samples obtained from a set of samples.
n n
ESS = = P (4.17)
1 + 2 k=1 k ()
where n is the actual posterior sample size and k () is the autocorrelation of lag k for .
The quantity is referred to as the autocorrelation time (SAS Institute, 2008). Because the
autocorrelation is always positive, the ESS is always less than the actual posterior sample size.
A much smaller ESS than the actual size indicates poor mixing of the Markov chain (Che &
Xu, 2010).
31
PART IV
A Validation Framework
Chapter 5
5 Model Validation
Sound credit rating models are important for all nancial institutions as they form the basis
for calculating risk premia, pricing credits, and allocating economic capital. The importance
of sound validation techniques for rating systems stems from the fact that credit rating mod-
els of poor quality could lead to suboptimal capital allocation (Satchell & Xia, 2006). This
implies that the eld of model validation is one of the major challenges for nancial institu-
power between the defaulting and non-defaulting customers ex ante. In this thesis, the ability
referred to as the discriminatory power of the credit rating model's (Satchell & Xia, 2006).
The most popular validation technique used in practice is the Cumulative Accuracy Prole
(CAP hereafter) and its summary statistic, the Accuracy Ratio. A concept similar to CAP is
the Receiver Operating Characteristic (ROC hereafter) curve and its summary statistic, the
area below the ROC curve (AUC hereafter) (Engelmann, Hayden, & Tasche, 2003a). Both
measures will be reviewed in section 5.2, whereas only the ROC curve and the AUC will
be applied when comparing Our Bank's current approaches to credit scoring with the esti-
mated Bayesian logistic regression models. Before introducing the validation techniques the
validation framework for the current thesis will be presented in section 5.1.
Determine how well the estimated models perform in terms of prediction accuracy.
Ensure that a model has not been overtted and that its performance is reliable and
well understood.
Conrm that the modeling approach, not just an individual model, is robust through
time.
Model validation is an essential step in the development of a credit scoring model. We aim
to perform tests in a rigorous and robust manner, while also protecting against unintended
errors. The performance statistics for credit scoring models can be highly sensitive to the data
sample used for validation. To avoid embedding unwanted sample dependency, quantitative
models should be developed and validated using some type of out-of-sample, out-of-universe,
and out-of-time
9 testing approach on panel or cross-sectional data (Sobehart, Keenan, & Stein,
2001).
9
Out-of-sample refers to observations for customers that are not included in the sample used to build the
model. Out-of-time refers to observations that are not contemporary with the training sample. Out-of-universe
refers to observations whose distribution diers from the population used to build the model.
33
5.1 Validation Framework Chapter 5
The statistical literature on model validation is quite broad. Since we do not attempt to
cover this topic exhaustively, we present in the following a methodology explained by Sobehart,
Keenan, & Stein (2001), that brings together several angles of the validation literature, which is
found useful in evaluation of quantitative credit scoring models. A schematic of the framework
Figure 5.1 splits the model testing procedure along two dimensions: time (horizontal axis),
and the population of customers (vertical axis). The least restrictive validation procedure is
represented by the upper-left quadrant, and the most stringent by the lower-right quadrant.
Dark circles represent training data and white circles represent validation data. Gray circles
represent data that may or may not be used for validation (Sobehart, Keenan, & Stein, 2001).
The upper-left quadrant illustrates the approach in which the validation data is chosen
completely random from the full training data. An assumption in connection to this proce-
dure is that the data stays stable over time. Since the data is drawn randomly, this approach
validates the estimated model across the population of customers, preserving its original dis-
tribution.
The upper-right quadrant describes one of the most common validation procedures. Here,
data for model training are chosen from any time period prior to a certain date, and validation
data are selected from periods only after that date from the same population. Since the sample
of customers is drawn from the population at random, this approach also validates the model,
34
5.1 Validation Framework Chapter 5
The lower-left quadrant represent the situation in which the data are segmented into train-
ing and validation sets containing no customers in common. In this general situation the
validation set is out of sample. If the population of the validation set is dierent from that of
the training set, the data is out-of-universe. Because the temporal nature of the data is not
used for constructing this type of out-of-sample test, this approach validates the model homo-
geneously in time and will not identify time dependence in the data. Thus, the assumption of
this procedure is that the relevant characteristics of the population do not vary with time.
Finally, the most exible procedure is shown in the lower-right quadrant and should be the
preferred sampling methods for credit scoring models. In addition to being segmented in time,
the data are also segmented across the population of customers. Non-overlapping sets can be
selected according to the peculiarities of the population of customers and their importance
Because default events are rare for credit scoring models, it is often impractical to create
a model using one dataset and then validate it on a separate hold-out dataset composed of
completely independent cross-sectional data. While such out-of-sample and out-of-time test
would undoubtedly be the best way to compare model performance if default data were widely
available, this is usually not the case. As a result, most institutions, including Our Bank, face
If too many defaulters are left out of the in-sample dataset, estimation of the model
If too many defaulters are left out of the hold-out dataset, it becomes exceedingly
dicult to evaluate the true model performance due to severe reductions in statistical
power.
Sobehart, Keenan, & Stein (2001) present an eective approach called walk-forward, which
will be used to estimate and validate the stability of the estimated models in the current thesis.
35
5.2 Validation Techniques Chapter 5
A specic year, here 2002, is chosen. The model is estimated using all data available on,
or before, the selected year, which is called the training data. Once the model forms and
parameters are established, the model performance can be validated using the data in the
following year, 2003. Note that the validation dataset in 2003 are out-of-time for customers
existing in the previous years, and out-of-sample for all the customers whose data become
available after 2002. Now, the data in 2003 is added to the training data, which implies that
all of the data through 2003 are used to t the model, and 2004 is then used to validate it.
Our Bank uses a validation technique called Power-curve with its appertaining summary
statistic called Powerstat. They are identical to CAP and Accuracy Ratio, and similar to the
ROC and the AUC. The ROC curve and AUC are standard outputs when utilizing the LO-
GISTIC procedure in SAS, whereas Power-curve and Powerstat statistics have to be manually
computed. Engelmann, Hayden & Tasche (2003) demonstrate that the summary statistics
of the CAP and the ROC are equivalent and that both methods are reliable even for small
datasets. Both validation techniques will be presented below, but due to the time limitations
it is only the ROC curve and AUC that will be reported when comparing Our Banks current
models for credit scoring with the estimated Bayesian logistic regression models.
To get an understanding of the Power-curve, consider a credit scoring model that produces
a continuous rating score. A high rating score indicates a low Probability of Default (hereafter
PD). By assigning scores to the customers from the data used for the validation, and checking
if the customers will default over the next period or remain solvent, we can evaluate the quality
To plot the Power-curve, the customers are rst ordered by PD from highest risk to lowest
risk on the x-axis, that is, from the customer with the lowest score to the customers with
the highest score, and on the y -axis is the share of defaulters (see gure 5.3). For a given
the percentage of the defaulters whose rating scores are equal to or lower than the maximum
36
5.2 Validation Techniques Chapter 5
A perfect credit scoring model would assign the lowest score to the defaulters. In this case
the Power-curve is increasing linearly and then staying at one. For a random model without
any discriminative power, the fraction x of all customers with the lowest rating scores will
contain x% of all defaulters. Real credit scoring models will be somewhere in between the two
extremes.
convenient to have a single measure that summarizes in a single statistic the predictive accuracy
in a number. This is known as the Powerstat and is dened as the ratio of the area, aE ,
between the Power-curve of the estimated model (rating model) and the Power-curve of the
non-informative model (random model), and the area, aP , between the Power-curve of the
aE
P owerstat = (5.1)
aP
Powerstat is a fraction between [0 ; 1]. Measures with Powerstat close to 0 display little
advantages over the random model while those with Powerstat near 1 display almost perfect
predictive power.
The construction of the ROC curve is a bit more complicated than the Power-curve. To get
an understanding of the properties of the ROC curve, gure 5.4 shows possible distributions
37
5.2 Validation Techniques Chapter 5
Figure 5.4: Distribution of Rating Scores for Defaulting and Non-defaulting Customers
For a perfect credit scoring model, the left distribution and the right distribution should
be separated. If we want to determine from the rating score, which customers will fully repay
during the next period and which customers will default, one possibility is to introduce a cuto
value, C, as in gure 5.4. With a given cuto value, C, each customer with a rating score
lower than C is classied as a potential defaulter and each customer with a rating score higher
1. If the rating score is below the cuto value C and the customer defaults subsequently,
3. If the rating score is above the cuto value and the customer does not default, the
Since the cost associated with a defaulting customer often exceeds the cost associated with
defaulting group, than to incorrectly assign a customer in the defaulting group (Sobehart,
The ROC curve can be constructed using dierent notations for the x-axis and y -axis.
Using the notation from Engelmann, Hayden, & Tasche (2003b) we dene the hit rate, HR(C),
as:
H(C)
HR(C) = (5.2)
ND
38
5.2 Validation Techniques Chapter 5
where H(C) is the number of defaulters predicted correctly with the cuto value, C, and
ND is the total number of defaulters. HR(C) is equal to the light green area on the left hand
side of the cuto value C in gure 5.4. The false alarm rate F AR(C) is dened as:
F (C)
F AR(C) = (5.3)
NN D
where F (C) is the number of false alarms, that is, the number of non-defaulters that were
classied incorrectly as defaulter by using the cuto value, C. The total number of non-
defaulters is denoted by NN D . F AR(C) is equal to the dark green area on the left hand side
The ROC curve is then constructed as follows. For all cuto values, C, that are contained
in the range of the rating scores the quantities HR(C) and F AR(C) are calculated. The ROC
A model's performance is better the steeper the ROC curve is and the closer the ROC
curve's position is to the point (0, 1). AUC is the summary statistic for the ROC curve and
summarizes the area under the curve. The steeper the ROC curve is, the higher the AUC is.
1
AU C = HR(F AR)d(F AR) (5.4)
The AUC can be interpreted as the average power of the estimated model on default/non-
default corresponding to all possible cuto values C. AUC is 0.5 for a random model without
discriminative power and 1 for a perfect model. It is between 0.5 and 1 for any reasonable
39
5.2 Validation Techniques Chapter 5
As mentioned, Engelmann, Hayden, and Tasche (2003) analyze the statistical properties
of the Power-curve and the ROC curve. They demonstrate the correspondence of the Pow-
erstat and AUC, which indicates that these summary statistics are equivalent and that the
relationship between the two can be calculated as (Engelmann, Hayden, & Tasche, 2003a):
Hamerle, Rauhmeier, & Rsch (2003) discuss the properties of Powerstat and AUC and
conclude that the sample space strongly depends on the structure of the true default prob-
abilities in the underlying portfolio. This implies that e.g. a Powerstat near one might not
portfolio. It follows that credit scoring models cannot be compared across time and across
portfolios. Therefore, Powerstat and AUC are only comparable when they are based on the
40
PART III
The Empirical Analysis
Chapter 6
6 Empirical Analysis
Nowadays several approaches for credit scoring analysis exist, with frequentist logistic regres-
sion being the most utilized method (Steenackers & Goovaerts, 1989), (Laitinen, 1999), and
(Alf, Caiazza, & Trovato, 2005). However, as mentioned in the introduction, the objective of
the current thesis is entirely focusing on whether a Bayesian logistic regression model is able
to outperform Our Bank current approaches in terms of predictive ability. In the following
sections the dierent approaches will be empirically analyzed, using real data provided by Our
Bank.
Initially, the data will briey be described, followed by an estimation of the expert models
and frequentist logistic regression models, both methods already applied in Our Bank. Next,
the Bayesian logistic regression models will be estimated and evaluated. In that respect several
The choice of priors will be specied, where the expert knowledge is transformed into
prior information.
The convergence criterias for the MCMC simulations will be assessed in order to ensure
Prior inuence will be assessed by comparing the performance of the dierent Bayesian
models.
By utilizing a walk-forward estimation method, the inuence of adding more data to the
By evaluating the AUCs, the chosen Bayesian model will be compared with the current
The estimation and evaluation will be carried out on both the RSI and Real Estate segments,
The empirical analysis will be carried out using SAS software. Details on SAS syntax and
6.1 Data
The data basis for the empirical analysis consists of questionnaire data, which has been gath-
ered by the nancial advisers in Our Bank from 2002 to 2010. It represents information
regarding individual customers. The purpose of these questionnaires is to collect data on cus-
tomer characteristics that helps to predict defaulters in the future by using a statistical model.
The data consists of 67618 customers divided into two segments. The two segments are, as
42
6.1 Data Chapter 6
mentioned in section 1.2, RSI and Real Estate. The size of the two segments and the number
In total there are 1794 recorded defaults in the dataset. The relative size of the segments
segments with a large amount of customers, compared to segments with a small amount of
customers, as stated in the problem statement. The amount of data available in the dierent
years for the two segments can be seen in Appendix A.1, table 1.
Table 2 shows the dierent variables available for the two customer segments:
The 16 variables for RSI can be divided into ve categories: Strategy and management, In-
dustry assessment, Industry position, Financial reporting and Risk exposure. The 18 explana-
tory variables for Real Estate can be divided into three groups: Strategy and management,
43
6.1 Data Chapter 6
Due to condentiality the original questionnaires are not enclosed in the thesis.
The original scale of the variables ranged between A to E. We have transformed these
scale values into values between 1 and 5, where 1 is considered to be the best value a
customer can achieve in a question. In other words, high values are associated with a better
creditworthiness. Category E (or 5) refers to a dont know answer and is therefore irrelevant
and deleted from the data. Hence, only values ranging between 4 and 1 are considered for
the analysis.
(see Appendix A.2, algorithm 1). Since SAS only considers cases with no missing values,
the amount of useful data is signicantly reduced, by deleting the cases with missing values.
Several imputation methods for handling missing data exist. For the purpose of this thesis
we decided to apply listwise deletion of cases containing missing values. This approach was
chosen assuming that imputation of missing data would bias the results of the analysis to
some extent. We are aware of the disadvantages linked to listwise deletion, which arise from
the loss of information derived from deleting incomplete cases. However, the amount of data
After all missing values have been deleted, the available data is reduced to the following
amount:
Segment 2002 2003 2004 2005 2006 2007 2008 2009 2010 Total
# Total 1134 2667 2869 2971 2936 3128 2891 2655 2663 23914
RSI
# Defaults 37 82 54 42 71 118 198 139 126 867
# Total 6 220 324 417 514 651 697 857 3686
Real Estate
# Defaults 1 10 38 31 36 116
As a note, due to the low number of defaulting customers within the Real Estate segment,
model estimation is only possible after the default year 2007, after data from the previous
years has been merged. Thus, the estimated Real Estate model is validated on 2008, 2009,
Furthermore, univariate descriptive investigation of the data shows that the data is rather
skewed. The rst step in the empirical analysis is therefore to standardize the provided data
- a common procedure often used in the theory, when estimating Bayesian generalized linear
44
6.2 Estimation of the Expert Models Chapter 6
models (Gelman, Jakulin, Pittau, & Su, 2008). The objective of standardizing data is to t
the data to the same scale and approximate the data to a normal distribution, by calculating
z -scores as:
x
z= (6.1)
where is the mean of the variable, x is the observed values and is the standard deviation.
However, Our Bank currently operates with the inverse cumulative normal distribution
since the distance between the values on the measured scales from the original data is not
using the inverse cumulative normal distribution instead of the regular normal distribution.
The SAS procedure PROC RANK, together with the normal=blom option, is utilized to
achieve this (see Appendix A.3, algorithm 2). The syntax employs the following equation for
computations:
(ri 83 )
yi = 1 (6.2)
(n 13 )
where 1 is the inverse cumulative normal function, ri is the rank of the i'th observation,
and n is the number of non-missing observations. The data is then centered around 0 with
a relatively small standard deviation and scored according to the relative importance of the
original values
10 .
Before estimating the Bayesian credit scoring models, the performance of our Bank's current
expert model will briey be introduced and estimated in this section followed by an estimation
of the frequentist logistic regression model in section 6.3. The performance of the expert
model will, together with the performance of the frequentist logistic regression model, serve
as references for the empirical analysis. Only the validation AUCs will be presented for this
As already mentioned, the expert models are used in Our Bank when not enough ques-
tionnaire data exist in order for Our Bank to utilize a frequentist logistic regression credit
scoring model. The expert models were created by highly-educated department managers in
Our Bank, whom after consultation agreed upon weighting the dierent questions according
to their relative importance. An example of the resulting setup is shown below in table 4. For
10
SAS Support webpage: http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/
viewer.htm#a000146840.htm
45
6.3 Estimation of the Frequentist Logistic Regression Models Chapter 6
By employing the expert weights together with the actual scores in the questionnaires,
Our Bank is able to score a customer with an expert credit score (ES), using the following
equation:
p
X
ES = wi S i (6.3)
i=1
where Si is the standardized value of the i'th question, and wi is the weight assigned to
question i by the experts. To obtain the actual performance of the expert models a ROC curve
is produced using the PROC LOGISTIC procedure in SAS (see Appendix A.8.1, algorithm 9
and A.8.2, algorithm 12). In table 5 below the actual AUCs obtained by the expert models
are shown:
Train. year 2002 2002-2003 2002-2004 2002-2005 2002-2006 2002-2007 2002-2008 2002-2009
Val. year 2003 2004 2005 2006 2007 2008 2009 2010
AUC - RSI 0.6918 0.7088 0.6782 0.6729 0.7126 0.6696 0.6300 0.7150
AUC - RE - - - - - 0.6014 0.5178 0.6325
In section 4.1.3 it was mentioned that an AUC of 0.5 represents a model without any
discriminative power. The AUC for the RSI segment seems to uctuate around 0.69 with the
exception of the validation years 2008 and 2009, where the AUC, for unknown reasons, seems
to decrease. The AUC for the Real Estate segment in the validation year 2009 gives rise to
concerns due to the fairly low values. Actually, the expert model for the Real Estate segment
is only slightly better than a random model in the validation year 2009.
The current frequentist logistic regression used in Our Bank does not contain variables with
a negative inuence on the predictor variable in the nal model. This is the selection criteria
chosen by Our Bank in order to reduce the number of parameters included in the nal model,
so only the variables that increase the probability of default are included . This approach
already has a hint of Bayesian thinking, because the parameter selection is based on subjective
46
6.4 Estimation of Bayesian Logistic Regression Models Chapter 6
judgments from the experts in Our Bank. Since one of the objective of the current thesis is to
compare the performance of Bayesian logistic regression with frequentist logistic regression, it
has been deemed necessary to estimate a clean logistic regression using a backward selection
criteria with a 25% signicance level specied. For SAS syntax and outputs see Appendix A.5,
algorithm 3.
Table 6 below shows the AUCs from the estimated frequentist logistic regression models.
Train. year 2002 2002-2003 2002-2004 2002-2005 2002-2006 2002-2007 2002-2008 2002-2009
Val. year 2003 2004 2005 2006 2007 2008 2009 2010
AUC - RSI 0.6994 0.7405 0.6612 0.7461 0.7510 0.7150 0.7280 0.7705
AUC - RE - - - - - 0.6062 0.6754 0.7191
From table 6 we can see that a frequentist logistic regression performs better on the RSI
data than it does on the Real Estate data, as was the case with the expert model. However, we
keep in mind that the AUCs are only directly comparable when based on the same underlying
portfolio.
By comparing the frequentist logistic regression model with the expert model, the logistic
regression performs slightly better than the expert model, though only signicantly better in
2006 and 2009 for the RSI data and 2009 for the Real Estate data. This is in accordance with
Next we estimate a Bayesian logistic regression for the RSI segment and Real Estate segment,
and compare the performance of the estimated models with the performance of Our Banks
current models. Before estimating the Bayesian models, two preliminary steps are important
to highlight:
2. The appropriate number of simulation iterations has to be determined for the Markov
Following these two steps, the resulting parameter coecients and model performance will be
empirical analysis, appropriate priors must be established. The parameters of interest in this
47
6.4 Estimation of Bayesian Logistic Regression Models Chapter 6
study are the regression coecients for the Bayesian logistic regression models, and therefore
the probability distributions of these coecients are the priors that have to specied. The
mean, or mode, for the assigned prior distributions serve as the expectation for the coecients
As described in section 3.2, Gelman (2002) points out that when the parameters are well-
dened and a relatively large sample size is used for estimation, the prior distribution is
expected to have little impact on the posterior - a condition called likelihood dominance (Wylie,
Muegge, & Thomas, 2006). However, no exact denition for well-identied parameters or
large sample size exist, so in order to assess the impact of the prior distribution, the posterior
In the analysis both a non-informative and two informative priors are applied. A at prior
is considered as the non-informative prior, whereas for the informative priors, two dierent
variance parameters, 12 , 22 , are selected. This approach is also chosen since there is no pre-
specied variance parameter provided by Our Bank. Gelman, Jakulin, Pittau, & Su (2008)
suggest a prior that has the ability to include prior information to some extent. It is not
a strict informative prior, which includes specic information regarding mean and variance
for the unknown parameters, nor is it a fully non-informative prior, such as a uniform prior.
The approach used in this thesis is somewhat similar to Gelman et al.'s work, and therefore
normally distributed priors with dierent variance parameters will be utilized, so that:
specify the means for the priors, based on the expert knowledge.
As mentioned in section 6.2 Our Bank uses the expert weights to obtain an ES for every
p
X
ES = wi S i (6.5)
i=1
To transform the expert score to a PD the next step is to perform a simple logistic regres-
sion. The variable containing the expert scores (ES) will serve as the independent variable in
the equation, and the binary variable that records whether or not a customer defaults will be
P rdef ault
P D = ln = a + bES (6.6)
1 P rdef ault
48
6.4 Estimation of Bayesian Logistic Regression Models Chapter 6
where a and b are the maximum likelihood estimators resulting from the equation. The b-
coecient (see Appendix A.6, Table 3) will serve as a last input to create the prior mean. The
results from equation 6.6 can be applied on the customers, to calculate the nal probability
P rdef ault
P D = ln = a + bw1 S1 + ... + bwp Sp (6.7)
1 P rdef ault
The term bwi can be summarized into a single coecient, which is the coecient that is
and these coecients will serve as means for the prior distributions. The resulting prior means
Though this approach is not perfect, because the prior is not strictly independent of the
available data, it is, however, an attempt to convert already existing expert knowledge within
Our Bank into prior knowledge, which will serve as needed input for the Bayesian analysis.
As mentioned earlier, no information regarding the specication of the prior variance exist,
and therefore the approach in this thesis is to examine priors with dierent variances in order
to explore what impact changes in the prior have on the posterior results. Three dierent
and the non-informative prior 1 selected (see SAS syntax and output in Appendix A.7.1,
Algorithm 5). The results for the validation year 2010 is presented below. When evaluating
the diagnostics for the simulation, it quickly becomes apparent that the Markov chain does
not mix very well. As an example, the resulting plots for the intercept, 0 , and beta1, 1 ,
parameters are shown below:
49
6.4 Estimation of Bayesian Logistic Regression Models Chapter 6
These two gures are quite representative for the remaining parameters, since their diag-
nostics follow the same pattern. The trace plots demonstrate a pattern known as marginal
mixing, which was introduced in section 4.1.3. The problem with the chain is that it only takes
small steps and is not able to reach its stationary distributions quickly. This form of trace plot
usually results from high autocorrelation between the samples, which can also be seen from
the autocorrelation graph. With a trace plot like the ones in gure 6.1 useful samples cannot
be obtained. In order to do so the chain must run for much longer. To reduce autocorrelation
the chain must also be thinned, meaning only a portion of the samples drawn are saved for
In the next example the model is run with 150000 iterations and only every 25th sample
is saved, so that 6000 samples are kept to draw posterior inference. Output for the same two
parameters, intercept and beta1, is shown below in gure 6.2 (SAS syntax and output can be
50
6.4 Estimation of Bayesian Logistic Regression Models Chapter 6
After running the simulation with 150000, the problems with autocorrelation and marginal
mixing are almost solved. Since a few problems concerning the Geweke diagnostics in the
validation year 2007 and 2008 remain (values larger than 2), we choose to run the simulation
again with 250000 iterations. With 250000 iterations we choose to thin the chain even further,
so that only every 50th sample is saved. Thereby 5000 samples are kept to draw posterior
inference. Output for the same two parameters, intercept and beta1, is shown below in gure
6.3 (SAS syntax and output can be found in Appendix A.7.3, algorithm 7).
After running the simulation again with 250000 iterations, two instances where the Geweke
value is above 2 remain. However, a closer look at the convergence diagnostics conrms that
the Markov chains have converged. MCSE/SD is a measure for the relationship between
the simulation uncertainty (MCSE) and the parameter uncertainty (SD). A comparison of
51
6.4 Estimation of Bayesian Logistic Regression Models Chapter 6
MCSE/SD between the three simulations shows a notable reduction in this ratio, meaning
that the simulation uncertainty in the third run has been improved. Furthermore, the ESS
also suggests improvements between the three runs, since the discrepancy between the ESS
and actual sample is lower for the third run than in the rst and second run.
Since the 250000 iterations made the Markov chains converge, the same simulation set-up
will be utilized for the remaining years in the walk-forward approach. The convergence outputs
for the remaining simulations in this thesis will only be elaborated on, if any problems with
walk-forward procedure, several obtained results must be compared. First of all, a test of
whether the models are overtting is carried out. Following the estimated parameters for
the three dierent Bayesian models are presented, and nally the performance of the three
Bayesian models are compared to assess the model with the best prediction accuracy.
summary statistic, AUC, for both the training data and validation data. Afterwards a Chi-
Square test is used to test if there are any signicant dierences between the two curves. An
example, where 2010 is used as validation data, is shown below in gure 6.4.
Figure 6.4: ROC and AUC for Validation Year 2010 - RSI, Prior 1
52
6.4 Estimation of Bayesian Logistic Regression Models Chapter 6
Though there is a dierence of 0.0314 between the two areas, the dierence is not signicant
overts the data in the validation year 2010. This test is performed for every year of estimation
Train. year 2002 2002-2003 2002-2004 2002-2005 2002-2006 2002-2007 2002-2008 2002-2009
Val. year 2003 2004 2005 2006 2007 2008 2009 2010
AUC - Training 0.7973 0.7528 0.7507 0.7381 0.7459 0.7519 0.7413 0.7395
AUC - Validation 0.7047* 0.7421 0.6681* 0.7501 0.7483 0.7168 0.7297 0.7708
No signicant dierences can be concluded for these tests, when a 5% level of signicance
is utilized
11 . Over the eight validation years the AUC has overall been improved by 0.0654
with a decline in the validation year 2005 and 2008. The average AUC over the period is
The same simulations have been run for the two other priors, and only signicant dierences
in validation year 2003 and 2005 were present at a 10% level of signicance, which was also
the case for prior 1 (see Appendix A.7.4, table 5). Overall, these results indicate that we
do not have any problems with overtting at a 5% signicance level, which imply that the
current Bayesian logistic regression models can be accepted. Therefore, the next step is to
take a closer look on how the choice of prior inuences the parameters, which will be claried
As a note, the model was aimed to be used for prediction, so the signicant parameters
the validation year 2010. Furthermore, the obtained parameters from the frequentist logistic
11
For the validation year 2003 and 2005 there is a signicant dierence on a 10% level of signicance
53
6.4 Estimation of Bayesian Logistic Regression Models Chapter 6
Signicant parameters (on a 5% signicance level) are marked with a in table 9, which
indicates that the number 0 is not contained in the condence intervals (CI hereafter). As can
be seen from Appendix A.7, algorithm 7, the number of signicant parameters vary during
the period.
Five parameters in the frequentist logistic regression model have been sorted out by the
backward selection criteria. There are no signicant dierences for any of the parameters given
the dierent priors. Furthermore, the Bayesian logistic regression models only dier with very
small deviations from the frequentist logistic regression. This could indicate that the normal
distributed prior has little inuence on the posterior distribution, which in section 3.2 was
In the following section the three dierent priors are tested against each other in order
to be able to select the most appropriate prior for comparison with the Bank's current credit
scoring models.
54
6.4 Estimation of Bayesian Logistic Regression Models Chapter 6
Val. year 2003 2004 2005 2006 2007 2008 2009 2010
Prior 1 0.7047 0.7421 0.6681 0.7501 0.7483 0.7168 0.7297 0.7708
Prior 2 0.7070 0.7418 0.6682 0.7502 0.7483 0.7168 0.7297 0.7710
Prior 3 0.7053 0.7423 0.6685 0.7500 0.7484 0.7171 0.7296 0.7707
The comparison of the dierent results across the validation years did not indicate any
signicant dierences in the model performance, given the dierent priors (see Appendix A.8.1,
algorithm 8).
The highest AUC in a given year has been highlighted. In four out of eight validation
years, prior 2 and prior 3 has a marginal better performance than the other two models. Since
the dierences between the three priors are so small, we choose to apply the Bayesian logistic
regression with prior 2 when comparing the dierent models for the RSI segment.
segment. The objective is to compare the performance of the Bayesian logistic regression
given the dierent priors, when applied to a segments with a small number of customers. The
comparison will again, as for the RSI segment, reveal the best prior for the Bayesian model,
which will be used when comparing with Our Banks current models.
Figure 6.5: ROC and AUC for Validation Year 2010 - Real Estate, Prior 1
55
6.4 Estimation of Bayesian Logistic Regression Models Chapter 6
As for the RSI segment, a marginal dierence of 0.0130 between the two areas exists.
Though, at least for the example in gure 6.5, the dierence is not signicant on a 5% level
of signicance (p=0.8083). The model is therefore not overtting the data in validation year
2010. Again the test is performed for every year of estimation and the resulting AUCs is shown
in Table 11 below.
In validation year 2008 an apparent problem with overtting exists. This might be due to
the very low number of defaults in the training data. However, the model seems to stabilize
over time, and in 2009 and 2010 there are no signicant evidence for overtting.
The same simulations have been run for the two other priors for the Real Estate segment
and again a signicant dierence in validation year 2008 is existing on a 1% level of signicance
Despite the overtting in validation year 2008 the approach is assumed to be valid with a
concluding remark that the results might be dependent on the number of defaulting customers
three dierent Bayesian models in the validation year 2010, with frequentist logistic regression
as a reference point.
56
6.4 Estimation of Bayesian Logistic Regression Models Chapter 6
11 parameters in the frequentist logistic regression model have been sorted out by the
backward selection criteria. There are no signicant dierences for non of the parameters given
the dierent priors, and the Bayesian parameters are almost identical with the frequentist
logistic regression coecients. Once again, this indicates that the chosen priors have little
In the following section the three dierent priors are tested against each other in order to
be able to select the most appropriate prior for comparison with Our Bank's current models
Table 13: AUC for all Three Bayesian Models - Real Estate
57
6.5 Comparison of the Estimated Credit Scoring Models Chapter 6
No signicant dierences between the Bayesian logistic regression models with the three
dierent priors have been identied (see Appendix A.8.2, algorithm 11). The Bayesian logistic
regression model with prior 2 is performing marginally better than with the other two priors,
due to the dierence in the validation year 2008. Therefore, the Bayesian logistic regression,
where prior 2 is utilized, will be applied when comparing the dierent models for Real Estate
data.
In the following section the performance of the Bayesian logistic regression models will be
compared with the performance of the estimated expert- and the frequentist logistic regression
shown:
The frequentist- and Bayesian logistic regression models perform slightly better than the
expert model for all years during the period, except from the validation year 2005. In the
validation year 2006 and 2009 the Frequentist and Bayesian logistic regression models are
Generally the Bayesian logistic regression model performs marginally better than the fre-
quentist logistic regression model. Only in the validation year 2007 the frequentist logistic
58
6.5 Comparison of the Estimated Credit Scoring Models Chapter 6
regression model has a higher AUC than the Bayesian logistic regression. It is worth men-
tioning that the Bayesian logistic regression model at no point signicantly outperforms the
frequentist logistic regression for the RSI data (see Appendix A.8.1, algorithm 10 for actual
results).
Thereby it can be concluded, that the Bayesian logistic regression model is overall able to
outperform the expert model and that it performs slightly better than the frequentist logistic
is shown:
As can be seen from gure 6.7, the Bayesian logistic regression is the credit scoring model
with the lowest AUC in the validation year 2008. In the validation year 2009 frequentist and
Bayesian logistic regression is performing signicantly better than the expert model, where
the Bayesian logistic regression model is performing marginally better than the frequentist
logistic regression model. In the last validation year, 2010, the frequentist logistic regression
model is the model with the highest AUC followed by the Bayesian logistic regression model
- 0.0528 separate the performance of these two models (see Appendix A.8.2, algorithm 13 for
actual results).
Overall there are no discoverable patterns in the performance of the dierent models for the
Real Estate segment. However, after the validation year 2008 the Bayesian logistic regression
is performing better than the expert model and on the same level as the frequentist logistic
regression model.
59
PART V
Concluding Remarks
Chapter 7
7 Concluding Remarks
The following section will sum up the current thesis and the appertaining results. The chapter
will rst of all contain a conclusion, which highlights the ndings of the thesis. Afterwards, the
limitations and contributions of the thesis are summarized. Finally, ideas for future research
are presented.
7.1 Conclusion
The objective of the current thesis was to analyze if a Bayesian logistic regression model for
credit scoring is a more eective tool than Our Bank's current approaches - an expert model
and a frequentist logistic regression model. In addition to this, it was also important to clarify
how the dierent models perform when applied on both a large customer segment and a small
customer segment.
Overall, the results from the empirical analysis showed that a Bayesian approach for credit
scoring was not able to outperform a frequentist approach, and thereby we cannot conclude,
that a Bayesian logistic regression is a more eective tool that improves quality of service
and minimizes risk of credit loss, when compared to a frequentist logistic regression. On the
contrary the analysis conrmed, that a Bayesian approach was overall able to outperform the
Since the thesis has had both an academic and practical orientation, it was deemed neces-
sary to equip the reader with a basic theoretical understanding of Bayesian statistics. There-
fore, an introduction to Bayesian statistics and the Bayesian approach for logistic regression
were initially accounted for. Given that the parameter estimation in Bayesian statistics is
notably dierent from frequentist statistic, due to the fact that all inference is drawn from
the posterior distribution, an elaboration on how to obtain that posterior distribution through
with the validation techniques used for comparison of the dierent models. A walk-forward
validation framework was chosen to be applied in the empirical analysis and the ROC curve
and its summary statistic, the AUC, were selected as the validation techniques.
For the empirical analysis a real dataset provided by Our Bank, containing data from
67618 bank customers divided on nine years from 2002 to 2010 was used. The data contained
questionnaire data from two dierent customer segments - Retail, Service and Industry and
Real Estate.
Before the Bayesian models were estimated, the AUCs of the expert model and the frequen-
tist logistic regression were estimated, and these served as references for the model comparison.
61
7.1 Conclusion Chapter 7
An approach for converting the current expert knowledge into prior information was pro-
posed. Since no information regarding the variance parameter for the prior distribution existed,
two dierent priors with dierent variances were chosen together with a non-informative prior.
Since the aim of the thesis was to compare the predictive power of dierent credit scoring
models, the empirical analysis focused on signicant dierences between the AUCs for the
dierent models. Therefore, in order to identify what kind of impact the prior had on the
Bayesian models, the dierences in the AUCs for all three Bayesian models were analyzed.
The analysis showed no signicant dierences between the models, and the conclusion was
therefore that the prior did not have any remarkable inuence on the posterior, indicating
what is referred to as likelihood dominance. This conclusion was valid for both customer
segments. Since prior 2 had a marginally better AUC for both segment it was chosen as the
prior used for comparison with the Our Banks current credit scoring models.
The results from comparing the Bayesian model with Our Banks current models for the
RSI segment, showed that the Bayesian and frequentist logistic regression models were able
to signicantly outperform the expert model in the validation year 2006 and 2009. For the
remaining years similar dierences were obtained, without being signicant. When comparing
the dierence between the Bayesian logistic regression and the frequentist logistic regression,
only marginal dierences were obtained. With the exception of the validation year 2007, the
Bayesian logistic regression performs slightly better than frequentist logistic regression.
When analyzing the performance of the models for the Real Estate segment, the results
were quite ambiguous. An important aspect related to this was the low amount of default-
ing customers within the segment, which implied that estimation was only possible after the
default year 2007. We were only able to identify one signicant dierence, namely that the
expert model performed signicantly worse than the other two models in the validation year
2009. No signicant dierences between the Bayesian and frequentist logistic regression could
be conrmed. However, after the validation year 2008 the Bayesian logistic regression model
was performing on the same level as the frequentist logistic regression model.
Bayesian methods are already increasingly being applied in a diverse assortment of elds,
including medicine, sociology, psychology, articial intelligence, and philosophy (Wylie, Muegge,
& Thomas, 2006). We believe that the Bayesian methods hold similar promise for researchers
In the future, it seems likely to us that statisticians will increasingly be tied less
dogmatically to a single approach and will feel comfortable using both frequentist
62
7.2 Limitations Chapter 7
As a nal remark it is worth mentioning that even though we were not able to conrm any
signicant dierences between a frequentist and Bayesian logistic regression, the Bayesian
approach for credit scoring should not be rejected. The conclusions in this thesis could be
highly inuenced by the choice of priors, the chosen sampling algorithm for the MCMC, the
7.2 Limitations
There have been some limitations related to the process of the research and these could have
First of all, only one prior distribution, the normal distribution, was applied in the Bayesian
models. The analysis indicated that the prior did not have any signicant inuence on the
posterior.
Secondly the random-walk Metropolis algorithm was utilized as the only sampling algo-
rithm for the MCMC simulations, though several others exist, such as the Gibbs sampler.
Thirdly, the data consisted of a certain amount of missing values, which were deleted and
7.3 Contributions
The main contribution of this thesis has been to introduce a Bayesian logistic regression as
an alternative to frequentist logistic regression for credit scoring. We have compared the
two dierent methods predictive ability (AUC) based on real data covering almost a decade.
Furthermore, the two dierent approaches have been compared to Our Bank's current expert
model.
Another contribution of the present thesis has been to integrate and convert the current
expert knowledge from Our Bank into prior information. Two dierent set ups for the expert
knowledge has been applied as informative prior together with an additional non-informative
prior.
Furthermore, a walk-forward validation framework has been used in the empirical analysis.
This framework has the ability to clarify how a model evolves over time, when more data is
obtained.
Finally, the current thesis has contributed with a comparison of the model performance on
two dierent segments with respectively a large number of customers and a small number of
customers.
63
7.4 Future Research Chapter 7
It has been demonstrated throughout the current thesis, that the Bayesian approach towards
credit scoring requires careful attention to modeling, since the quality of the results is strongly
ors, which place demands on the analyst's skills, judgment, and experience (Wylie, Muegge,
& Thomas, 2006). In continuation hereof, one of the disadvantages in using a Bayesian ap-
proach is that the approach does not contain instructions on how to select a prior. There is
no correct way to choose a prior, which implies that it requires skills to translate subjective
prior beliefs into mathematically formulated priors. From our point of view, the chosen priors
in the current thesis should be perceived as points of origins for further research rather than
nal solutions. The priors have been developed based on the current expert models in Our
Bank without having any prespecied variance or distribution. As implied in the empirical
analysis, the normally distributed priors have only had marginal inuences on the posterior
distributions, which probably is one of the reasons why the Bayesian approach does not dier
signicantly from the frequentist approach. One opportunity for future research is therefore
As stated in section 6.1.1, the data was notably reduced due to the amount of missing
values. It was chosen not to impute new values in the dataset. Several methods for missing data
etc. For future research a comparison of the dierent imputation methods and their inuence
Though, a lot of literature concerning MCMC exists, we have not been able to nd any
describing if the choice of sample algorithm inuences the estimated parameters in the Bayesian
approach. By utilizing another algorithm than the random-walk Metropolis this hypothesis
could be tested.
Many models can be utilized for credit scoring where, from a Bayesian point of view,
Bayesian networks have shown some success (Bier, Sevi, & Bilgi, 2010). An investigation
into these networks for credit scoring would provide interesting further topics of research.
64
Chapter 8
8 Bibliography
Alf, M., Caiazza, S. & Trovato, G. 2005, "Extending a Logistic Approach to Risk
Altman, D.G. & Bland, J.M. 1998, "Statistics Notes: Bayesians and Frequentists",
British Medical Journal - LA English, vol. 317, no. 7166, pp. 1151.
Berger, J.O. 2000, "Bayesian Analysis: A Look at Today and Thoughts of Tomorrow",
Journal of the American Statistical Association, vol. 95, no. 452, pp. 1269.
Bier, I.s, Sevi, D. & Bilgi, T. 2010, , Bayesian Credit Scoring Model with Integra-
http://leidykla.vgtu.lt/
tion of Expert Knowledge and Customer Data. Available:
conferences/MEC_EurOPT_2010/pdf/324-329-Bicer_Sevis_Bilgic-57.pdf
[2012, 07/16].
Bolstad, W.M.,1943- 2007, Introduction to Bayesian statistics, 2. ed. edn, John Wiley,
Hoboken, N.J.
Brinberg, D. & Hirschman, E.C. 1986, "Multiple Orientations for the Conduct of Mar-
Che, X. & Xu, S. 2010, , Bayesian Data Analysis for Agricultural Experiments. Available:
Chen, M., Shao, Q. & Ibrahim, J.G. 2000, Monte Carlo Methods in Bayesian Computa-
Cowles, K., Kass, R. & O'Hagen, T. 2009, , What is Bayesian Analysis. Available:
Engelmann, B., Hayden, E. & Tasche, D. 2003a, , Measuring the Discriminative Power of
Engelmann, B., Hayden, E. & Tasche, D. 2003b, , Testing Rating Accuracy. Available:
65
Chapter 8
Gelman, A., Jakulin, A., Pittau, M.G. & Su, Y. 2008, "A Weakly Informative Default
Prior Distribution for Logistic and Other Regression Models", The Annals of Applied
Geyer, C.J. 1992, "Practical Markov Chain Monte Carlo", Statistical Science, vol. 7,
Gilks, W.R., Richardson, S. & Spiegelhalter, D.J. 1996, Markov Chain Monte Carlo in
Hamerle, A., Rauhmeier, R. & Rsch, D. 2003, , Uses and Misuses of Measures for
Hellwig, M. 2008, , Systemic Risk in the Financial Sector: An Analysis of the Subprime-
Howson, C. & Urbach, P. 1993, Scientic Reasoning : the Bayesian Approach, 2. ed.
Isik, B., Deniz, S. & Taner, B. 2010, , Bayesian Credit Scoring Model with Integra-
http://www.refworks.com/
tion of Expert Knowledge and Customer Data. Available:
Isik, B., Deniz, S. & Taner, B. 2010, , Bayesian Credit Scoring Model with Integra-
http://www.refworks.com/
tion of Expert Knowledge and Customer Data. Available:
Jaynes, E.T. & Bretthorst, G.L. 2003, Probability Theory : the Logic of Science, Cam-
Kynn, M. 2005, , Elicting Expert Knowledge for Bayesian Logistic Regression in Species
Laitinen, E.K. 1999, "Predicting a Corporate Credit Analyst's Risk Estimate by Logistic
and Linear Models", International Review of Financial Analysis, vol. 8, no. 2, pp. 97.
66
Chapter 8
Malden, MA.
Lenhard, J. 2006, Models and Statistical Inference: The Controversy between Fisher and
Liu, J.S. 2001, Monte Carlo Strategies in Scientic Computing, Springer, New York.
Ler, G., Posch, P.N. & Schne, C. 2005, , Bayesian Methods for Improving Credit
Mira, A. & Tenconi, P. 2003, , Bayesian Estimate of Credit Risk via MCMC with
Nevin, J.R. 1979, "The Equal Credit Opportunity Act: An Evaluation", Journal of
Neyman, J. & Pearson, E.S. 1967, Joint Statistical Papers, at the Univ. Press, Cam-
bridge.
Roberts, G.O. & Rosenthal, J.S. 1998, "Markov-Chain Monte Carlo: Some Practical
http://topaz.
Rouchka, E., C. 2008, , A Brief Overview of Gibbs Sampling. Available:
gatech.edu/~vardges/biol7023/FALL_2006/Lab5/ROUCHKA_gibbs.pdf[2012, 06/12].
Satchell, S. & Xia, W. 2006, , Analytic Models of the ROC curve: Application to
Sivia, D.S. & Skilling, J. 2007, Data Analysis : a Bayesian Tutorial, 2. ed., repr. edn,
Smith, A.F.M. 1991, "Bayesian Computational Methods", Phil. Trans. R. Soc. Lond.,
Sobehart, J., Keenan, S. & Stein, R. 2001, , Benchmarking Quantitative Default Risk
67
Chapter 8
Steenackers, A. & Goovaerts, M.J. 1989, "A Credit Scoring Model for Personal Loans",
Stigler, S. 2005, "Fisher in 1921", Statistical Science, vol. 20, no. 1, pp. 32.
Tabachnick, B.G.,1936- & Fidell, L.S. 2008, Using Multivariate Statistics, 5. ed. edn,
Thomas, L.C. 2000, "A Survey of Credit and Behavioural Scoring: Forecasting Financial
Risk of Lending to Consumers", International Journal of Forecasting, vol. 16, no. 2, pp.
149.
Walsh, B. 2004, , Markov Chain Monte Carlo and Gibbs Sampling. Available: http://
web.mit.edu/~wingated/www/introductions/mcmc-gibbs-intro.pdf [2012, 06/12].
Wilhelmsen, M., Dimakos, X.K., Huseb, T. & Fiskaaen, M. 2009, , Bayesian Modelling
Wylie, J., Muegge, S. & Thomas, D.R. 2006, , Bayesian Methods in Management Re-
http://attila.acadiau.ca/
search: an Application to Logistic Regression. Available:
library/ASAC/v27/content/authors/t/Thomas,%20Roland/BAYESIAN%20METHODS%20IN%
20MANAGEMENT%20RESEARCH.pdf [2012, February/20].
http://
Ziemba, A. 2005, , Bayesian Updating of Generic Scoring Models. Available:
www.google.dk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CEwQFjAA&url=http%
3A%2F%2Fwww.business-school.ed.ac.uk%2Fwaf%2Fschoolbiz%2Fget_file.php%3Fasset_
file_id%3D1762&ei=jWEGUJXTAvSM4gTAu5WaCQ&usg=AFQjCNFrs7tKlLxx7QXkyHOkmawqRlV8wA
[2012, 07/12].
68