Вы находитесь на странице: 1из 76

Credit Scoring

A Bayesian Approach

BANK CREDIT SCORING ANALYSIS WITH


BAYESIAN LOGISTIC REGRESSION
AS A DECISION TOOL

A MASTER THESIS BY:


Daniel Lund [288631]
Kamilla H. Srensen [284220]

ACADEMIC SUPERVISOR:
Ana Alina Tudoran
Department of Business Administration

August, 2012
MSc. Business Intelligence
Aarhus University
Business and Social Sciences
Acknowledgement
We would like to extend our sincerest thanks to those who have helped us during the

process of this thesis. The process would have been much more overwhelming and the topic

more complicated, if it had not been for the support and guidance of those people.

First of all, our thanks goes to our dedicated academic supervisor, Ana Alina Tudoran, for

always being helpful with useful inputs and discussions during the process.

Secondly, a thanks goes to our contact person in the Bank, both for providing real data

and competent feedback every time a question arose.

Last but not least, a special thanks goes to our fellow thesis-writing students at Haslefortet,

who contributed to a fun and cozy environment.


Abstract
Bayesian methods have started to gain increasing acceptance nowadays, particularly due

to the recent advances in computer technology. The objective of the current thesis has been

to analyze if a Bayesian logistic model is a more eective tool for credit and loans default

predictive modeling, by comparing its performance with an expert model, and respectively,

a frequentist logistic model. Real data on 67618 customers collected during 2002-2010 were

provided by one of the most important banks in Denmark and were used in the empirical

analysis. The Bayesian logistic regression model were estimated using Markov chain Monte

Carlo simulations in SAS. The performance of the dierent credit scoring models (Bayesian,

frequentist, expert) was assessed by using the ROC curve. The results of the empirical analysis

shows that a Bayesian approach for credit scoring is overall able to outperform the expert

model. However, the ndings show no signicant dierences in terms of predictive performance

between the Bayesian and the frequentist logistic approach. Overall, we conclude that a

Bayesian approach for credit scoring may be used as an alternative decision tool to frequentist

approach.
Contents

1 Introduction 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Delimitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Statistical Reasoning 10
2.1 The History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 The Bayesian Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.1 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Advantages and Disadvantages of the Bayesian Approach . . . . . . . . . . . . 16

3 Bayes' Theorem 17
3.1 The Likelihood p(y | ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 The Prior p() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.1 Informative and Non-informative Prior Distributions . . . . . . . . . . . 20

3.2.2 Improper Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3 The Posterior p( | y) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Bayesian Logistic Regression 22


4.1 Parameter Estimation in Bayesian Models . . . . . . . . . . . . . . . . . . . . . 23

4.1.1 MCMC Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.1.2 MCMC Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1.3 Convergence Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5 Model Validation 33
5.1 Validation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.2 Validation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6 Empirical Analysis 42
6.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.1.1 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6.1.2 Standardizing Input Variables . . . . . . . . . . . . . . . . . . . . . . . . 44

6.2 Estimation of the Expert Models . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.3 Estimation of the Frequentist Logistic Regression Models . . . . . . . . . . . . . 46

6.4 Estimation of Bayesian Logistic Regression Models . . . . . . . . . . . . . . . . 47

6.4.1 Prior Specication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.4.2 Specifying the Simulation Method for RSI Data . . . . . . . . . . . . . . 49


6.4.3 Results for RSI Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.4.4 Results for Real Estate Data . . . . . . . . . . . . . . . . . . . . . . . . 55

6.5 Comparison of the Estimated Credit Scoring Models . . . . . . . . . . . . . . . 58

6.5.1 RSI Segment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.5.2 Real Estate Segment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

7 Concluding Remarks 61
7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

7.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

7.4 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

8 Bibliography 65

A Appendix 71
List of Figures

1.1 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1 Venn Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1 Prior Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.1 Trace Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.1 Schematic of Out-of-Sample Validation Techniques . . . . . . . . . . . . . . . . 34

5.2 Walk-Forward Validation - an Example . . . . . . . . . . . . . . . . . . . . . . . 35

5.3 Power-Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.4 Distribution of Rating Scores for Defaulting and Non-defaulting Customers . . 38

5.5 Receiver Operating Characteristic Curves . . . . . . . . . . . . . . . . . . . . . 39

6.1 Results from Initial Simulation - 10000 Iterations, Prior 1 . . . . . . . . . . . . 50

6.2 Results from Initial Simulation - 150000 Iterations, Prior 1 . . . . . . . . . . . . 51

6.3 Results from Initial Simulation - 250000 Iterations, Prior 1 . . . . . . . . . . . . 51

6.4 ROC and AUC for Validation Year 2010 - RSI, Prior 1 . . . . . . . . . . . . . . 52

6.5 ROC and AUC for Validation Year 2010 - Real Estate, Prior 1 . . . . . . . . . 55

6.6 Comparison of AUC - RSI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.7 Comparison of AUC - Real Estate . . . . . . . . . . . . . . . . . . . . . . . . . 59


List of Tables

1 The Two Customer Segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2 Overview of the Explanatory Variables . . . . . . . . . . . . . . . . . . . . . . . 43

3 Available Data - without Missing Values . . . . . . . . . . . . . . . . . . . . . . 44

4 Expert Weights - an Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5 AUC for the Expert Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6 AUC for the Frequentist Logistic Regression - Backward Selection . . . . . . . . 47

7 The Selected Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

8 AUC - RSI, Prior 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

9 Parameter Estimates 2010 - RSI . . . . . . . . . . . . . . . . . . . . . . . . . . 54

10 AUC for all Three Bayesian Models - RSI . . . . . . . . . . . . . . . . . . . . . 55

11 AUC - Real Estate, Prior 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

12 Parameter Estimates 2010 - Real Estate . . . . . . . . . . . . . . . . . . . . . . 57

13 AUC for all Three Bayesian Models - Real Estate . . . . . . . . . . . . . . . . . 57


PART I
Introduction
Chapter 1

1 Introduction

Through the last 70-80 years, dierent methods for credit scoring have been of huge interest

in the banking industry. Credit scoring is a well-known assessment process, which has the ob-

jective to distinguish the desired customers from defaulting customers based on the registered

customer information (Thomas, 2000) and (Isik, Deniz, & Taner, 2010). The rst methods

used were entirely focused on a judgmental approach, where a credit analyst approved or dis-

approved the customers' loan application forms, but since this approach was solely based on

subjective beliefs it was not considered reliable.

Another important aspect in relation to the above-mentioned was, that through the 1960s

the number of people applying for a credit card rose signicantly, which increased the need for

automated assessment processes, since the banks simply did not have the resources to cope

with the increasing demand (Thomas, 2000).

In 1975 and 1976, the Equal Credit Opportunity Act (ECOA hereafter) was approved

in the US, which made it illegal to discriminate loan applicants in any way regarding sex,

marital status, race, color, religion, national origin, and age. Discrimination was, according

to ECOA, dened as  treating one person less favorably than others and the banks were

required to inform applicants by a detailed written notice about their decision. Therefore, the

banks began focusing on empirical derived credit systems, where the creditworthiness of the

customers was based on statistical methods. These methods included predictor variables, such

as the number of years a person had lived at the same address, account information, other

creditor debts etc.. By implementing these systems, the banks were able to document the

same objective decisions to all applicants. This initiative was the starting point in the eld of

credit scoring and aimed to be a lucrative growth profession (Nevin, 1979).

After the mid-70s, it was still possible for a nancial advisor to refuse judgments made

by the credit scoring system if he/she disagreed with its outcome. However, studies showed

that in 95% of the times an advisor accepted the loan, even though the system had turned the

applicant down, the loan was very hard to collect (Nevin, 1979).

In the 1980s the success of credit scoring in credit cards meant that banks started using

credit scoring for their other products like personal loans, and in the 1990s the credit scoring

methods were also applied for private loans and small business loans (Thomas, 2000).

The recent nancial crisis, which began back in 2007, was a result of greedy nancial

institution's willingness to invest in hazardous mortgage-backed securities. The public view

of the bank managers was that they had completely disregarded the risks involved in making
1
investments in sub-prime mortgage loans , and were afraid to come clean about the mistakes

they have made. Instead of writing o their losses at once, they were depreciated over a longer

1
Sub-prime lending is a loan type for people with a limited credit history and who are more likely to default than
regular prime loaners. The sub-prime loans have a higher interest rate to compensate for the risks associated with
oering these types of loans.

1
1.1 Problem Statement Chapter 1

period of time (Hellwig, 2008). This nancial crisis has shaken the entire nancial system all

over the world, and the countries' economy are still suering from the aftermath of the shock.

Financial institutions were led to ignore the risks associated with these loans because of

increasing home prices through the mid-90s (Hellwig, 2008). The popularity of the sub-prime

mortgages meant an increase in the home prices from around 9% in 2000 to above 40% in

2006. The fact that these types of loans actually were sub-prime loans should have concerned

the bank managers. One might wonder what the initial decisions about oering these types

of loans were based on. It was certainly not the result of an eective credit scoring system

(Hellwig, 2008).

In the recent years it has become increasingly more important for banks to develop eec-

tive and reliable credit scoring systems to classify default - or protable customers. Several

dierent methods and models have been utilized in doing so. Thomas (2000) briey describes

some of the techniques that have been used for credit scoring the last couple of decades. The

most popular of these techniques is the frequentist logistic regression approach (Steenackers

& Goovaerts 1989), (Laitinen, 1999), and (Alf, Caiazza, & Trovato, 2005). However, some

statisticians have recently argued that we stand at the threshold of a new Bayesian Renais-

sance and other proponents argue that Bayesian methods more closely reect how humans

perceive their environment, respond to new information, and make decisions (Wylie, Muegge,

& Thomas, 2006). Bayesian techniques have rarely been utilized by researchers or nancial

corporations in the past, but nowadays the increasing computational power entails that the

computational challenges have been overcome (Wylie et al., 2006).

Studies have already explored dierent Bayesian approaches for credit scoring, and found

these methods to have some advantages over frequentist approaches (Mira & Tenconi, 2003),

(Ziemba, 2005), and (Ler, Posch, & Schne, 2005). In relation to this fact, one of the most

important banks in Denmark has expressed a desire for exploring a Bayesian approach for

credit scoring, as an alternative to their already existing approaches.

In this thesis we explore Bayesian logistic regression as a possible decision tool for credit

scoring from am academic and practical viewpoint, and it is therefore ideal that the current

thesis has been composed in cooperation with the above-mentioned bank, who has delivered

the data used in the empirical analysis. The performance of Bayesian logistic regression will

be evaluated and compared with a frequentist logistic regression and an expert model.

1.1 Problem Statement

The nancial advisors in Our Bank have the responsibility to ll in questionnaires regarding

the creditworthiness of the customers. The questionnaires may have dierent objectives, but

in this thesis the focus is solely on the questionnaire's ability to predict the default customers.

2
1.1 Problem Statement Chapter 1

In spite of the nancial crisis, Our Bank is only losing money on a tiny proportion of their

customers (i.e. the data consists of a small proportion of defaulters - the dependent variable),

which implies that it may take several years before Our Bank has enough data to re-estimate

a statistical model for default prediction.

Frequentist logistic regression is the current statistical method applied when enough data is

collected, and in the intervening period an expert model is applied. However, these approaches

contain various disadvantages for Our Bank.

Regarding the expert model:

 The accuracy ratio of the expert model is notably lower than the accuracy ratio of

the logistic regression.

Regarding the frequentist logistic regression model:

 Risk of over-tting when a small amount of data is included in the analysis.

Due to the disadvantages listed above Our Bank is searching for an alternative statistical

method. Bayesian logistic regression is one possibility since this statistical approach has the

ability to combine prior expert information and collected data, and hereby update ones prior

beliefs regarding the parameters of interest. Therefore, the research objective for the present

thesis is:

To analyze if Bayesian logistic regression is a more eective tool that improves


quality of service and minimizes the risk of credit loss compared to an expert
model and a frequentist logistic regression.
It must be emphasized that the objective of the current thesis is solely focusing on default

prediction and not default explanation.

The contribution of this thesis is three-part. First, we introduce Bayesian statistics and

the Bayesian approach for logistic regression from an academical point of view, based on

existing literature related to the topic. Second, we move on and suggest a way to integrate the

current expert knowledge from Our Bank into prior distributions used in the Bayesian logistic

regression model. Third, we apply and compare the performance of Bayesian logistic regression

with an expert model and a frequentist logistic regression model, using real data from Our

Bank. The provided data covers customer information from the year 2002 to 2010, and we

introduce a walk-forward validation framework used to analyze the dierences between the

models. Additionally, the comparison will be constructed on two dierent segments to clarify

if there are any dierences between the conclusions in a segment with a large number of

customers compared to a segment with a small number of customers.

3
1.2 Delimitation Chapter 1

As mentioned in Chapter 1, the present thesis is composed in cooperation with one of the

most important banks in Denmark. Due to condentiality reasons, the name of Our Bank will

not be disclosed; rather it is referred to as  Our Bank .

1.2 Delimitation

In this thesis it is assumed that the reader has a thorough understanding of frequentist statistics

in general. Therefore, frequentist statistics will only briey be described in section 2.2. The

authors have, as original frequentists, used a great amount of time on understanding Bayesian

statistics. Therefore, it has been deemed necessary to keep an acute focus in order not to lose

the thread.

As mentioned in the Introduction, the interest in credit scoring methods increased with

the entry of the credit cards in the 1960s. Since then, several statistical methods, based on

dierent beliefs, have been applied to credit scoring (e.g. Keramati & Youse, 2011 - a review).

Due to the recent developments in Bayesian statistics, this thesis will focus on Bayesian logistic

regression and compare its performance with the more common frequentist logistic regression

technique and an expert model. The expert model for credit scoring, described in section

6.2, will also be utilized in the development of prior distributions for the Bayesian logistic

regression. Since the expert model and a frequentist logistic regression are the current models

applied by Our Bank for credit scoring, both of these will be used as a reference in section

6.5. In continuation hereof, it is important to clarify that the frequentist logistic regression

model used in Our Bank is not completely identical with the estimated frequentist logistic

regression model used as reference in this thesis. This is due to the way Our Banks current

frequentist logistic regression is estimated, which will be elaborated on in section 6.3. However,

the frequentist logistic regression model estimated in the thesis will still be referred to as Our

Bank's current model.

In nance literature, credit scoring is dened as a classication method aiming to distin-

guish the desired customers, who will fully repay, from defaulters (Isik, Deniz, & Taner, 2010).

Our Bank operates with three credit ratings;  low risk ,  high risk , and  full risk . This the-

sis, however, only operates with two credit ratings - default and non-default, which means

that if a customer is classied as default, it implies that the customer is either a  high risk

or  full risk customer according to Our Bank's current credit rating. An important aspect in

relation to classication of the customers, is the determination of the suitable time horizon.

This time horizon covers the duration starting from granting credit for a customer to the time

when the customer is observed as a defaulter or a non-defaulter (Isik, Deniz, & Taner, 2010).

In Our Bank the customers are evaluated the subsequent year after receiving the loan, which

is in line with the standard practice (Isik, Deniz, & Taner, 2010). Analysis shows that the

default rate, as a function of the time the customer has been with the organization, builds up

4
1.3 Research Methodology Chapter 1

initially and it is only after twelve months or so that it starts to stabilize. Thus, any shorter

horizon is underestimating the default rate and not reecting in full the types of characteristics

that predict default (Thomas, 2000).

The data basis for the empirical analysis consists of questionnaire data on customers, lled

out by the nancial advisors during the period 2002-2010. The data covers two dierent

customer segments: Retail, Service and Industry (RSI hereafter) and Real Estate.

In total the dataset consist of 67618 customers, which will be used for the analysis in the

empirical analysis (Chapter 6):

RSI (N=62866)

Real Estate (N=4732)

1.3 Research Methodology

As it is the objectives to explore the theoretical basis of Bayesian statistics, as well as to quan-

titatively evaluate the performance of Bayesian logistic regression compared to Our Banks

current approaches, the present thesis is characterized by both an academic and practical

orientation. According to Brinberg and Hirchman (1986), an academic orientation refers to

conducting research, focusing on concepts, and the relations between those concepts, whereas

a practical orientation refers to conducting research focusing on a system, organization, or a

set of events in the real world. Due to the nature of the present thesis, the authors have,

as already mentioned in section 1.2, used a great amount of time on understanding Bayesian

statistics. Based on that, the authors expect to face decisions requiring certain compromises

in order to satisfy both orientations in terms of theoretical test of Bayesian logistic regres-

sion, as well as problem oriented research necessary to conduct the performance evaluation

of Bayesian logistic regression to Our Bank's current approaches in a satisfactory and under-

standable manner.

All scientic investigation is aected by the methodological approach chosen by the re-

searchers. This thesis is inspired by the works of Guba (1990), which is used to explain the

current methodological approach.

The overall methodological approach to the present thesis is based on the fundamental

beliefs of postpositivism, in which the aim is prediction and control. Postpositivism can be

characterized as a modied version of positivism, which relies on the belief that reality out

there exists driven by immutable natural laws (Guba, 1990). The aim of this thesis is to clarify

if Bayesian logistic regression can be used as a more eective decision tool for credit scoring

than Our Banks current approaches. In continuation hereof, the thesis is solely focusing on

default prediction.

5
1.3 Research Methodology Chapter 1

Ontology is the considerations concerned with the nature of social entities. The ontological

aspect of postpositivism implies a focus on critical realism, which acknowledges that reality

exists, but enhances that it is impossible for humans to accurately perceive it with their

imperfect sensory and intellective mechanisms. This idea implies that the postpositivism

recognizes that all observations are fallible and error-prone and that all theory is revisable

(Guba, 1990). We are aware, that we need to be critical about our work due to the imperfect

sensory and intellective mechanisms.

Epistemological issues regard understanding of how knowledge is created and what accept-

able knowledge is. With regard to this issue, postpositivism emphasizes modied objectivity,

which implies that objectivity is the guiding ideal, but it cannot be achieved in absolute sense.

Objectivity can be achieved reasonably close, by striving to be as neutral as possible. Because

all measurement is fallible, postpositivism emphasizes the importance of multiple measures

and observations, each of which may possess dierent types of error (Guba, 1990). The cur-

rent thesis is inuenced by the subjective Bayesian approach. All statistical methods that use

probability are subjective to a certain extent because they rely on mathematical idealizations

of the world. However, the Bayesian approach is sometimes perceived as being especially sub-

jective due to the reliance on a prior distribution (Gelman, 2000). To overcome this, dierent

priors are applied to the Bayesian logistic regression and the prior with the best performance

is used in the comparison with Our Bank's current approaches to credit scoring. We are aware

that objectivity cannot be fully met since two of the utilized priors in the thesis are estab-

lished based on subjective expert models. Furthermore, the variances of the dierent priors

have been subjectively chosen by the authors.

Methodological questions concern how the researcher should go about nding knowledge.

Methodologically, postpositivism provides two responses to emergent challenges. First, in the

interest of conforming to the commitment to critical realism and modied subjectivity, em-

phasis is placed on critical multiplism, which can be thought of as a form of triangulation. If

human sensory and intellective mechanisms cannot be trusted, it is essential that the ndings

is based on as many sources of data, investigators, theories, and methods, as possible. Sec-

ond, postpositivism allows for many imbalances necessary to achieve realistic and objective

research. Particularly important is the imbalance that has to do with the inescapable trade-

o between internal and external validity, in which the researcher must sacrice the degree

of generalization of the ndings to achieve internal validity (Guba, 1990). The methodolog-

ical approach used in this thesis starts out with a theoretical introduction to the Bayesian

approach, which is used to guide the empirical analysis. Given the knowledge acquired in the

theory sections, the thesis explores the research objective using statistical and quantitative

methods. The empirical analysis is based on a solid dataset covering 67618 customers during

2002 to 2010. Furthermore, the empirical analysis is carried out by comparing three dierent

credit scoring methods. We are aware that the conclusions cannot be generalized to other

6
1.4 Structure of the Thesis Chapter 1

banks even though they are drawn on quantitative methods. This is due to the data, since it

is based on condential questionnaires within Our Bank, lled out by the nancial advisors.

However, the conclusions give indications which can be used for further research outside Our

Bank.

1.4 Structure of the Thesis

In order to be able to thoroughly investigate the research objective, the thesis is divided into

ve distinct parts as illustrated in gure 1.1.

Figure 1.1: Structure of the Thesis

Source: Own elaboration

Part I (Chapter 1) frames the present thesis by introducing the problem statement and

the appertaining deliminations and research methodology.

Part II (Chapter 2, 3, and 4) outlines the theoretical framework for the present thesis. The

aim is to provide the reader with an understanding of the Bayesian approach and how it has

7
1.4 Structure of the Thesis Chapter 1

evolved through time. Chapter 2 introduces the two competing approaches within statistical

reasoning, frequentist and Bayesian, and derives Bayes' Theorem on the basis of conditional

probability. Chapter 3 gives a deeper understanding of the three components in Bayes' The-

orem, and Chapter 4 describes Bayesian logistic regression followed by an introduction to

parameter estimation in Bayesian statistics.

Part III (Chapter 5) introduces the theory behind the validation framework and validation

technique applied in the thesis.

Part IV (Chapter 6) presents the empirical ndings and evaluate model performance based

on real data from Our Bank. Initially Chapter 6 provides a description of the dataset used for

running the models. Chapter 6 continues with estimation of the current credit scoring models

in Our Bank and describes the procedures used for establishing a Bayesian logistic regression

model. Through Markov chain Monte Carlo (MCMC hereafter) simulations the parameters

for the Bayesian model are estimated. Lastly Chapter 6 will compare the performance of the

dierent credit scoring models.

Finally Part V (Chapter 7) will conclude on the ndings in Chapter 6 and reect on possible

future research and extensions Furthermore, the limitations and contributions of the current

thesis will be highlighted.

8
PART II
Statistical Reasoning
Chapter 2

2 Statistical Reasoning

Statistics is the study, creation, and use of methods for producing and employing data for

description, measurement, explanation, prediction, control, and decision-making (Barnett,

1973). Hultquist (1969) states:

Statistics is a science that concerns itself with experimentation and the collection,

description and analysis of data... Statistical methods are tools for examining data.

(Barnett, 1973)

Nowadays two competing approaches to statistical reasoning exist: the Bayesian and the fre-

quentist, where the frequentist is the largest group. As mentioned in Chapter 1, the increasing

power of computers is bringing the Bayesian approach to the fore (Wylie, Muegge, & Thomas,

2006). Most statisticians have become Bayesians or frequentists as a result of their choice of

university. They were not aware of the existence of the Bayesians and frequentists approach

until it was too late and the choice had been made (Altman & Bland, 1998). Since dierent

approaches are based on dierent concepts, procedures and justications, the following section

contains an introduction to the history of statistical reasoning followed by an introduction to

the Bayesian approach. Finally the Chapter summarizes some advantages and disadvantages

of the Bayesian approach.

2.1 The History

Even though that Statistics as a formal scientic discipline has a rather short history, the

reasoning behind it began about three hundred years ago, where people started to give serious

thoughts to the question of how to reason in situations, where it was not possible to argue

with certainty. The rst to formulate the problem was probably James Bernoulli (1713), who

perceived the dierence between the deductive logic applicable to games of chance and the

inductive logic required for everyday life. The question for Bernoulli was how the mechanics

of the deductive logic might help to tackle the inference problems of the inductive logic (Sivia

& Skilling, 2007).

Reverend Thomas Bayes is acknowledged with providing an answer to Bernoulli's question

through his papers published in the Philosophical Transaction in 1763 and 1764. Of these the

rst, entitled An Essay Towards Solving a Problem in the Doctrine of Chances, is the one that

has earned him acknowledgments since he proved a special case of what is nowadays called

Bayes' Theorem (Barnett, 1973). Thomas Bayes was interested in inverse probability, which

concerns inferences of probability parameters from observations of outcomes and prior beliefs.

Pierre-Simon Laplace proved, 11 years later, a more general version of Bayes' Theorem, where

he applied the results to inference of sampling and measurement error (Altman & Bland,

10
2.1 The History Chapter 2

1998). Both Bayes and Laplace assumed uniform prior distributions, that is, for an initial

starting point, they assumed that all possible values for the unknown parameter were equally

likely, and revised their estimates as they observed the data (Wylie et al., 2006). The present

form of Bayes' Theorem is actually due to the work of Laplace, since he rediscovered Bayes'

Theorem, in far more clarity than Bayes', and since he discovered the use of it in solving

problems in celestial mechanics, medical statistics and even jurisprudence. Despite Laplace's

numerous successes, his development of probability theory was rejected by many soon after

his death (Sivia & Skilling, 2007).

The problem did not have to do with the substance, but the concept. As mentioned

earlier, Bernoulli, Bayes and Laplace considered a probability to represent a degree-of-belief

or plausibility - how much they thought that something was true based on the evidence at

hand. To the 19th century scholars, however, this seemed too vague and subjective an idea

to be the basis of a rigorous mathematical theory. The essays entitled On the Mathematical

Foundations of Theoretical Statistics published by R. A. Fisher in 1922 and On the Problem of

the most Ecient Test of Statistical Hypotheses published a decade later by J. Neyman and E.

S. Pearson, can be considered as the cornerstones of what nowadays is called the frequentist

approach (Lenhard, 2006). According to Stigler (2005), Fisher's article is:

Arguably the most inuential article on that subject in the twentieth century (...)

An astonishing work: It announces and sketches out a new science of statistics,

with new denitions, a new conceptual framework and enough hard mathematical

analysis to conrm the potential and richness of this new structure.

Fisher gave major attention to estimation procedures while Neyman and Person largely con-

centrated on the construction of principles for testing hypotheses. Their work was not entirely

distinct either in emphasis or application. Nor was it free from internal controversy with

Fisher's concept of ducial


2 probability as the crucial element (Barnett, 1973). They rede-

ned probability as the long-run relative frequency with which an event occurred, given many

repeated trials. Since frequencies can be measured, probability was now seen as an objective

tool for dealing with random phenomena (Sivia & Skilling, 2007). The only quantitative in-

formation handled by frequentists is sample data. This implies that prior information about

the parameter, , are of no importance, but may be expected to inuence the choice of sta-

tistical procedure and the performance characteristics needed, e.g. working hypotheses and

signicance levels in a test of signicance (Barnett, 1973).

It is clear that considerations of a priori probability may (...) need to be taken into

account (...) Occasionally it happens that a priori probabilities can be expressed in

exact numerical form (...) but in general we are doubtful of the value of attempts

2
Fiducial inference can be interpreted as an attempt to perform inverse probability without having prior
probability distributions. Fiducial inference quickly attracted controversy and was never widely accepted.

11
2.2 The Bayesian Approach Chapter 2

to combine measures of the probability of an event if a hypothesis be true, with

measures of the a priori probability of that hypothesis. . . The vague a priori

grounds on which we are intuitively more condent in some alternatives than in

others must be taken into account in the nal judgment, but cannot be introduced

into the test to give a single probability measure.

(Neyman & Pearson, 1967)

Following Fisher, most inuential statisticians of that period favored an objective frequentist

approach (Wylie et al., 2006).

Even though the rst half of the 20th century was accompanied by the development of the

frequentist approach, the ames of Bayesian thinking was kept alive by a few thinkers such as

Bruno de Finetti and Herold Jereys (Cowles, Kass, & O'Hagen, 2009).

The modern Bayesian movement began in the second half of the 20th century but Bayesian

inference remained extremely dicult to implement until the late 1980s and early 1990s when

powerful computers became widely accessible and new computational methods were developed.

The subsequent explosion of interest in Bayesian statistics has not only led to extensive research

in Bayesian methodology but also to the use of Bayesian methods to address pressing questions

in diverse application areas such as astrophysics, weather forecasting, health care policy, and

criminal justice (Cowles, Kass, & O'Hagen, 2009).

The following section will give a further introduction to the Bayesian approach by intro-

ducing conditional probabilities and showing how Bayes' Theorem is derived from conditional

probabilities.

2.2 The Bayesian Approach

Bayesian inference contains complicated mathematical simulations and a variety of statistical

techniques, but, as mentioned in section 1.2, the most challenging part is simply the fact that

it is grounded in a fundamentally dierent paradigm than traditional frequentist statistics.

It is a whole dierent way of viewing the world. Jaynes (2003) argued, that the Bayesian

approaches to scientic questions oer a dierent way of viewing reality  one that actually

reects the way humans perceive reality.

As mentioned in section 2.1, the frequentists view the unknown parameters as xed con-

stants, and dene probability as a relative frequency or a proportion of an outcome in a

population. The aim of the frequentist approach is to use data to estimate the unknown value

of a parameter. The data obtained represents only one possible realization of the current ex-

periment and the corresponding probability distribution, given the data, is called a sampling

distribution. This sampling distribution is crucial to any assessment of the behavior of the

parameter. The estimate of the parameter is seen as a typical value that is likely to arise

12
2.2 The Bayesian Approach Chapter 2

in repeated sampling. Fisher introduced the likelihood function of a parameter as another

way to represent the information provided by the sample (Barnett, 1973). The result of the

frequentist approach is either a  true or  false conclusion based on hypothesis testing or

condence intervals (Howson & Urbach, 1993).

Where the frequentists treat the unknown parameters as xed constants, the Bayesian ap-

proach treats them as random variables, which means that the parameters can vary according

to a probability distribution. This variation can be regarded as purely stochastic for a data

driven model, but it can also be interpreted as beliefs of uncertainty under the Bayesian ap-

proach. In a Bayesian formulation the uncertainty about the value of each parameter can be

represented by a probability distribution, if prior knowledge can be quantied (Kynn, 2005).

As mentioned in section 2.1, Bayes was interested in solving the inverse probability  the prob-

ability of an event given the observations of other events. The result of his studies lead to what

nowadays is called the Bayes' Theorem (SAS Institute, 2008). Since Bayes' Theorem is the

key to Bayesian statistics and since it relies on conditional probability, conditional probability

will initially be introduced and Bayes' Theorem will be derived.

2.2.1 Conditional Probability


To get an understanding of conditional probability lets consider a random experiment in terms

of events. A random experiment is characterized by having an outcome that is not completely

predictable, which means that the experiment can be repeated under the same conditions

without leading to the same result. The result of one single trial of the random experiment

is the outcome and the events is any set of possible outcomes of a random experiment. The

sample space is all possible outcomes of one single trial of the random experiment, denoted .
Since the sample space represents everything considered, it is also called the universe (Bolstad,

2007).

A Venn diagram
3 is used to illustrate the relationship between two events, which is shown

in gure 2.1. The rectangle illustrates the universe, U, and the circles illustrate the occurring

events. The relationship between two events depends on the outcomes they have in common.

If all the outcomes in one event are also in the other event, the rst event is a subset of the

other. If the events have some outcomes in common they are intersecting events, such as the

area A and B in the Venn diagram.

3
Venn diagrams or set diagrams are diagrams that show all possible logical relations between a nite
collection of sets (aggregation of things).

13
2.2 The Bayesian Approach Chapter 2

Figure 2.1: Venn Diagram

Source: Bolstad (2007)

Consider gure 2.1. If we know that one event has occurred, does that aect the probability

of the occurrence of another event? Conditional probability is used to answer this question.

Suppose that event B has occurred, the universe of interest will be reduced so that the only

thing of interest will be inside the circle B. Say event A occurs. The only part of event A that
is now relevant is that part also contained in B. That is B A. The joint probability of events

B and A is the probability that both events occur simultaneously, on the same repetition of

the random experiment.

Given that event B has occurred, the total probability of the reduced universe must be

equal to
4
one . The probability of event A, given event B, is the unconditional probability of
1
that part of A, which is also included in B, multiplied by a scale factor
P r(B) . Adding this
information together gives the conditional probability of event A given event B:

P r(B A)
P r(A | B) = (2.1)
P r(B)

as long as P r(B) 6= 0. Equation 2.1 shows that the conditional probability P r(A | B) is

proportional to the joint probability P r(B A) but has been rescaled so the probability of

the reduced universe equals 1. The marginal probability of event B is found by summing the
probabilities of its disjoint parts. Since B = (B A) (B A) and clearly (B A) and (B A)
e
are disjoint, this simplied two-event example gives:

P r(B) = P r(B A) + P r(B A) (2.2)

where A is the complement of A. Equation 2.2 is now substituted into the denition of condi-

tional probability, equation 2.1, to get

4
An axiom of probability: P r(U ) = 1 (the total probability of the universe equals 1).

14
2.2 The Bayesian Approach Chapter 2

P r(A B)
P r(A | B) = (2.3)
P r(B A) + P r(B A)
Now the multiplication rule can be used to nd each of these joint probabilities. This gives

Bayes' Theorem for the two-event example:

P r(B | A)P r(A)


P r(A | B) = (2.4)
P r(B | A)P r(A) + P r(B | A)P r(A)
Usually the universe is partitioned by many dierent events, A1 , A2 , . . . .., An = Ai . This

partition has some conditions:

The union of A1 , A2 , . . . .., An = U

Every distinct pair of the partitioning events are disjoint, so Ai Aj = F, for i =


1, 2, . . . , n and j = 1, 2, . . . , n and i 6= j .

Using the law of total probability, the marginal probability of B, also the denominator in

Bayes' Theorem, is calculated as:

n
X
P r(B) = P r(B | Aj )P r(Aj ) (2.5)
j=1

Equation 2.5 just states that the probability of event B is the sum of the probabilities of its

disjoint parts. Substituting equation 2.5 into equation 2.4 gives:

P r(B | Ai )P r(Ai )
P r(Ai |B) = Pn (2.6)
j=1 P r(B | Aj )P r(Aj )

Equation 2.6 is what is known as Bayes' Theorem and is a restatement of equation 2.1, where

the joint probability in the numerator is identied by the multiplication rule, and the marginal

probability contained in the denominator is found by using the law of total probability followed

by the multiplication rule. A reconguration of the conditional probability (Bolstad, 2007).

Bayes' Theorem consists of three dierent components, which are the cornerstones of Bayesian

statistics; the initial probability parameter P r(Ai ), is the prior probability. The probability

of the parameter given the data P r(Ai | B) is the posterior probability, and the probability of

the data given the parameter P r(B | Ai ) is the likelihood function known from the frequentist
Pn
approach. The nal term j=1 P r(B | Aj ) P r(Aj ) is the probability of the data B, which
is not dependent on the parameter and acts as a normalizing constant. A more convenient

way to present Bayes' Theorem is by omitting the marginal distribution term, since it does

not provide any additional information about the posterior, as long as the integral is nite

(SAS Institute, 2008). For this reason, equation 2.6 is often referred to in terms of the prior,

likelihood and posterior:

15
2.3 Advantages and Disadvantages of the Bayesian Approach Chapter 2

P r(Ai | B) P r(B | Ai )P r(Ai )

m (2.7)

P osterior likelihood prior

where the symbol means proportional to. Bayes' Theorem will be elaborated further in

Chapter 3 where all the components will be further explained.

2.3 Advantages and Disadvantages of the Bayesian Approach

Bayesian methods and frequentist methods both have advantages and disadvantages, and there

are some similarities. When the sample size is large, Bayesian inference often provides results

that a equivalent to those obtained by frequentist methods, which will be elaborated on in

section 3.2. Some advantages to using Bayesian analysis include the following (SAS Institute,

2008) and (Bolstad, 2007):

Bayesian methods have a single tool, Bayes' Theorem, which is used in all situations.

This contrasts to frequentist procedures, which require many dierent tools.

Bayesian methods provide a way of combining prior information with data, within a

solid decision theoretical framework. In science there usually is some prior knowledge

about the process being measured. Leaving out the prior information away is waste of

knowledge, but as mentioned, Bayesian statistics uses both sources of information and

combine them using Bayes' Theorem.

Since Bayesian methods provide inferences that are conditional on the data and are

exact, without reliance on asymptotic approximation, inference drawn on small samples

is done in the same manner as with large samples.

Bayesian methods provide interpretable answers such as the true parameter has a

probability of 0.95 of falling in the 95% credible interval.

Bayesian methods provide a convenient setting for a wide range of models. Markov chain

Monte Carlo (MCMC hereafter), along with other numerical methods, makes computa-

tions tractable for almost all parametric models. MCMC will be introduced in section

4.1.

There are also disadvantages to using Bayesian analysis:

Bayesian methods do not specify how to select a prior, which entails that there is no

correct way to choose a prior.

16
Chapter 3

Bayesian methods can produce posterior distributions that are strongly inuenced by

the priors. This causes, from a practical viewpoint, that it might be dicult to convince

experts who do not agree with the validity of the chosen prior.

Bayesian methods often come with high computational cost, especially in models with a

large number of parameters. In addition, simulations provide slightly dierent answers

unless the same random seed is chosen.

3 Bayes' Theorem

The following Chapter will contain a deeper introduction to Bayes' Theorem, equation 2.6,

where the three components will be elaborated on. Section 2.2 dealt with random experiments

in terms of events and introduced probability dened on events as a tool for understanding

random experiments. The more usual form of Bayes' Theorem will be used in this Chap-

ter, which is based on random variables. A random variable describes the outcome of the

experiment in terms of a number.

The notation used for the parameters of interest will be , which forms a vector of the

parameters, so that = (1 , 2 , ..., k ). The notation used for the collected data in this thesis

is y, and since m variables are measured for every customer, n, in Our Bank, y will form a

mn matrix.

The conclusions from a Bayesian analysis are drawn based on the posterior probability

distribution. These posterior distributions are conditional on the observed data y, and by

utilizing Bayes' Theorem, presented in equation 2.6, we write the statement as p(|y). From

this point on, p(|) and p() will denote probability distributions.

The core of Bayesian inference is to update ones prior beliefs, p(), with new information

given by the collected data, p(y|), and using the necessary algorithms for computational

convenience to summarize p(|y). Combining these notations into Bayes' Theorem, equation

2.6 gives:

p()p(y|)
p(|y) = (3.1)
p(y)
X
where p(y) = p()p(y|) for discrete random variables, and p(y) = p()p(y|) d in the

case of continuous random variables. Equation 3.1 can then be presented in a proportional

form:

p(|y) p()p(y|) (3.2)

17
3.1 The Likelihood p(y | ) Chapter 3

All Bayesian inferences follow from this posterior probability distribution since it captures all

the relevant information regarding the parameters. In the following sections the three dierent

components from equation 3.2 will be explained in details.

3.1 The Likelihood p(y | )


The likelihood, p(y | ), describes what is expected to be seen for every particular value of the

parameter . It forms the prediction of what the data should look like, if the parameter

takes a particular value given by .


Many statisticians base their inference on likelihood functions following the work of Fisher

presented in section 2.1 and section 2.2. The likelihood function, often denoted as `(; y), is a

function of with the data values serving as parameters of that function. People who follow

the frequentist approach will typically choose the value that provides the maximum likelihood

of (Lancaster, 2004). Choice of likelihood function amount to choice of family of probability

distribution for each . The theory of probability holds many such distributions, both from

simple distributions to probability models for high dimensional random variables, involving

many parameters and complex patterns of dependence.

To construct a likelihood function an appropriate probability distribution must be chosen.

The chosen likelihood distribution must be appropriate to the type of data that are observed

and must be able to represent the estimated model within it. Furthermore, it should make it

possible to discredit the estimated model when it is clearly inconsistent with the evidence. The

likelihood should not be perceived as unchangeable, rather it is suggested to explore variations

in the inferences over a set of likelihoods, each of which embodies the theory. Furthermore, it

is suggested that it is better if the chosen likelihoods are relatively unrestricted.

From the subjective perspective utilized in this thesis, a likelihood represents your beliefs

about the values of the data conditional on . It is your likelihood, in the same way that the

marginal distribution for and the prior p(), will represent your beliefs about the parameter.

If the aim is to convince others about the interest of your results it is advised to choose a

likelihood that is not clearly inconsistent with the beliefs of your audience and your readers

(Lancaster, 2004).

3.2 The Prior p()


Until the modern computing age, the problem with the Bayesian formulations was the often

intractable equation for the posterior. Prior distributions can be specically chosen to be

compatible with the likelihood function to avoid the problem called conjugate priors
5. How-

ever, the signicant advances in computing power, methodology and software over the last few

5
A prior is said to be a conjugate prior for a family distributions, if the prior and posterior distributions
are from the same family, which means that the form of the posterior has the same distribution form as the
prior distribution.

18
3.2 The Prior p() Chapter 3

decades means that the posterior density function can be directly sampled using simulation

techniques, which will be introduced in section 4.1. The current diculty in the Bayesian ap-

proach is the specication of a prior distribution and selecting an appropriate prior is probably

the most important aspect in Bayesian modeling (Kynn, 2005). The prior distribution is a

key part of Bayesian inference and represents the information about an uncertain parameter,

, that is combined with the probability distribution of the likelihood of new data to yield

the posterior distribution, which in turn is used for future inference and decisions involving

(Gelman, 2002). Considerable care should be taken when selecting priors and should be

supported by careful documentation. This is because inappropriate choices for priors can lead

to incorrect inference (SAS Institute, 2008).

The appearance of the prior distribution in the right-hand side of Bayes' Theorem

is at once a strength and a weakness of the Bayesian approach: a strength because

it allows information beyond the data at hand to be used in making inferences,

and a weakness because the inferences inevitably depend, at least to some degree,

on the choice of prior.

(Armitage & Colton, 2005)

The key issues in setting up a prior distribution are:

What information is going into the prior distribution?

The properties of the resulting posterior distribution.

Gelman (2002) points out that with well-identied parameters and large sample sizes, reason-

able choices of prior distributions will have minor eects on posterior inferences. This means

that where large amounts of data are available, the inuence of the prior will be negligible,

giving similar results to purely data-driven inference. This feature is commonly referred to as

likelihood dominance (Lancaster, 2004). In the absence of data, the inference will be driven by

the prior distributions. Between these two extremes, the prior will have some modifying eect

on the data. The extent that the prior inuences the resulting posterior distribution can be

investigated by comparing dierent prior formulations through a sensitivity analysis (Kynn,

2005).

There are a number of points that usually are taken into account when specifying the priors.

The rst point is that priors can be tentative. Since the inference necessarily depend on the

choice of prior, Lancaster (2004) suggests that alternative priors are examined to explore how

sensitive the main conclusions are to alterations in the prior. Furthermore, it is legitimate

to allow prior beliefs to be inuenced by inspection of the data. The second point is that

priors should be encompassing. This means that priors should take account of the beliefs of

19
3.2 The Prior p() Chapter 3

the readers, since prior beliefs that conict sharply with the readers will make the work of

little interest to them. On behalf of that, it is suggested for public scientic work to use priors

that are not sharply inconsistent with any reasonable belief. This requirement can sometimes

be met by using a uniform or at distribution on some reasonable function of the parameter

(Lancaster, 2004).

3.2.1 Informative and Non-informative Prior Distributions


When specifying the Bayesian prior distributions, it is useful to distinguish between informative

and non-informative prior distributions. An informative prior distribution can be dened as

a prior that summaries the evidence about the parameters concerned from many sources and

which often have considerable impact on the posterior distribution. A non-informative prior,

on the other hand, provides little information relative to the experiment (Wylie, Muegge, &

Thomas, 2006).

Figure 3.1: Prior Distributions

Source: Own elaboration

Figure 3.1 illustrates three dierent prior distributions where prior A is relative non-

informative. Priors B and C are both informative, but represent dierent prior beliefs, B

being more precise than C.

Typically, informative prior distributions are created from historical data, from expert

knowledge, or from a combination of both. The proper use of informative prior distributions

illustrates the power of the Bayesian methods since previous studies, past experience, or expert

knowledge can be combined with the current information in a natural way. However, using

informative priors can lead to problems due to the subjective beliefs (SAS Institute, 2008).

Non-informative priors attempt to avoid subjectivity. The term non-informative is used to

connote the lack of subjective beliefs used in formulating such a prior. Due to the objectivity

of non-informative priors many statisticians favor this type. However, it is important to keep

20
3.2 The Prior p() Chapter 3

in mind that it is unrealistic to expect that non-informative priors represent total ignorance

about the parameter of interest (SAS Institute, 2008).

A common choice of non-informative prior is the at prior, which is a prior distribution

that assigns equal likelihood on all possible values of the parameter. However, this might not

be truly non-informative, which can be illustrated by considering a binomial experiment with

n Bernoulli
6 trials. The purpose is to make inferences about the unknown success probability.

A uniform prior on p,

(p) 1 (3.3)

might appear to be non-informative. However, since the uniform prior is equivalent to adding

two observations to the data, one 1 and one 0, experiments with a small n and y can be very

inuenced by the added observations (SAS Institute, 2008).

3.2.2 Improper Priors


A probability distribution for is called improper if its integral over the sample space does

not converge. A simple example is the expression

p() 1, < < (3.4)

which is called a uniform distribution on the real line and can be thought of as a rectangle

on an innitely long base. Its integral, the area under the line, does not converge, it is innite

and so equation 3.4 is not, in fact, a probability distribution. However, improper priors are

frequently used in applied Bayesian inference (Lancaster, 2004).

One of the reasons is, at least mathematically, that it does not matter if the prior is

improper. Because the posterior distribution of is the object of ultimate interest, and since

this is formed by multiplying the likelihood and the prior it is perfectly possible for the posterior

distribution to be proper even though the prior is not. Thus, improper prior distributions can

lead to posterior impropriety (Lancaster, 2004). To determine whether a posterior distribution



is proper it is suggested to make sure that the normalizing constant p(y | )p()d is nite for
all y 's. If an improper prior distribution leads to an improper posterior distribution, inferences

based on the improper posterior distributions is obviously invalid (SAS Institute, 2008).

Another reason is that an improper prior can be considered as an approximation to a

proper prior that is intended to represent very imprecise or vague beliefs. Often the uniform

prior is thought of as a labor saving device since it saves the troubles of specifying exact beliefs

(Lancaster, 2004).

6
A Bernoulli trial is an experiment whose outcome is random and can be either of two possible outcomes,
"success" and "failure".

21
3.3 The Posterior p( | y) Chapter 4

3.3 The Posterior p( | y)


The posterior density represents your beliefs about given your prior beliefs and the beliefs en-
compassed in the likelihood. In many applications the posterior is the ending of the empirical

analysis. Reporting the results of the empirical analysis imply displaying the posterior distri-

bution, to which the model and data have led, which can be done in several ways (Lancaster,

2004):

Draw it; when is a scalar the best way of presenting the content of the posterior

distribution is by drawing it. This is also valid when is a vector but the parameter of

interest is a one-dimensional function of .

Report its moments; traditional practice is to report an estimate of together with

an estimate of the standard deviation of its repeated sampling distribution. Report a

highest posterior density region; similarly, traditional practice often reports a condence

interval for . The Bayesian analogue is to nd, from the posterior distribution of , an

interval in such that with probability 0.95 lies within it.

Calculate the marginals; the calculation involved in forming the posterior distribution

of the object of interest might be mathematically challenging. Fortunately two solutions

exist. The rst is the use of approximations to posterior distributions and the second is

the method of computer assisted sampling.

4 Bayesian Logistic Regression

As mentioned in section 1.1, Bayesian logistic regression will be applied in the current thesis to

analyze if this method is a more eective tool that improves quality of service and minimizes

the risk of credit lost compared to Our Bank's current approaches.

The logistic regression model belongs to the class of Generalized Linear Models (GLM
7
hereafter) . Logistic regression allows prediction of a discrete outcome from a set of predictor

variables that may be continuous, discrete, dichotomous, or a mix, and is a exible technique

since it has no assumptions about the distribution of the predictor variables. The response

variable in a logistic regression model is binomial and the expectation is related to the linear

predictor through the logit function (Wilhelmsen, Dimakos, Huseb, & Fiskaaen, 2009).

In the present thesis we introduce an indicator of default. Consider Yi = 1 if customer i


defaults (bad credit), and Yi = 0 otherwise. Then,

Yi Binomial(pi ), i = 1, ..., N (4.1)

7
In statistics, the GLM is a exible generalization of ordinary linear regression that allows for response
variables to have other than a normal distribution.

22
4.1 Parameter Estimation in Bayesian Models Chapter 4

and the linear predictor

M
X
i = 0 + j xij (4.2)
j=1

is linked to the default probability through the logit function,

 
P ri
logit(pi ) = ln = i (4.3)
1 P ri
In this thesis, the explanatory variables, xij , are customer characteristics from the ques-

tionnaire. It follows that the probability of default is given by:

PM
exp(0 + j=1 j xij )
P ri = PM (4.4)
1 + exp(0 + j=1 j xij )
The Bayesian model is formulated by specifying prior distributions on the regression coef-

cients in equation 4.2,

j p(j | j ), j = 0, ..., M (4.5)

The prior distribution p ( | j ) may be any proper probability density function and j may

be a scalar or a parameter vector. The interpretation of the model parameters depends on the

choice of the distribution, but may for instance include a measure of the center (e.g. mean)

and spread of the prior (e.g. standard deviation). By specifying dierent values of j , the

regression coecients may have very dierent priors even if the distribution function, p (), is

the same (Wilhelmsen, Dimakos, Huseb, & Fiskaaen, 2009).

4.1 Parameter Estimation in Bayesian Models

In statistical modeling, such as the frequentist logistic regression, the aim, among others, is

to estimate the -coecients of the predictor variables, by using the maximum likelihood

procedure. In brief, maximum likelihood estimation is an iterative procedure that starts

with arbitrary values of coecients for the set of predictors and determines the direction and

size of change in the coecients that will maximize the likelihood of obtaining the observed

frequencies. Then residuals for the predictive model, based on those coecients, are tested

and another determination of direction and size of change in coecients is made until the

coecients change very little, which imply that convergence is reached (Tabachnick & Fidell,

2008).

However, in Bayesian statistics these -coecients, are derived directly from the posterior

probability distribution of the unknown parameters. Any features of the posterior distribution

are legitimate for Bayesian inference e.g. moments, quartiles, highest posterior density regions.

23
4.1 Parameter Estimation in Bayesian Models Chapter 4

All these quantities can be expressed in terms of posterior expectations of functions of . The

posterior expectations of a function f () is (Gilks, Richardson, & Spiegelhalter, 1996):


f ()p()p(y | ) d
E [f () | y] = (4.6)
p()p(y | ) d
In multidimensional Bayesian models, the objective is often to retrieve a scalar function

of the parameter vector , with respect to a single parameter of interest, say i , which could

be a regression coecient. This involves nding the marginal posterior distribution of the

parameter, p(i ), and involves integration of the posterior distribution p(|y) over all other

parameters than the one of interest (Lancaster, 2004):

b
p(i | y) = p(i | k6=i , y)p(k6=i ) dk6=i (4.7)
a

In simple models it will be possible to derive the marginal distribution by hand, which to

some extent is described by Smith (1991). However, as the number of dimensions increase, so

does the diculty of these calculations. A major limitation towards more widespread imple-

mentation of Bayesian approaches is that obtaining the posterior distribution often requires

the integration of high-dimensional functions (Geyer, 1992).

Until recently, acknowledging the full complexity and structure in many applications was

dicult and required the development of specic methodology and purpose-built software.

Now, MCMC methods provide a unifying framework within which many complex problems

can be analyzed using generic software (Gilks, Richardson, & Spiegelhalter, 1996). In the

following section the general ideas behind MCMC simulations are presented. Many ways of

constructing the Markov chains exist, but the most commonly used is the Gibbs sampler, which

has had a major inuence in the increase of Bayesian applications. Since the Gibbs sampler is

a special case of the Metropolis-Hastings algorithm, both of these will be introduced in section

4.1.2. Finally dierent convergence criterions on how to assess whether the Markov chains

have reached the stationary distribution are presented in section 4.1.3.

4.1.1 MCMC Simulations8


It took nearly 40 years for MCMC to penetrate mainstream statistical practice. It originated in

the statistical physics literature, and has been used for a decade in spatial statistics and image

analysis. In the last few years, MCMC has had a profound eect on Bayesian statistics, and

has also found applications in the frequentist statistics (Gilks, Richardson, & Spiegelhalter,

1996).

MCMC methods are a class of algorithms, which are used for simulating samples from a

posterior distribution that has the desired true posterior distribution as the chains station-

8
See: Walsh (2004), Roberts & Rosenthal (1998) and Gilks, Richardson, & Spiegelhalter (1996).

24
4.1 Parameter Estimation in Bayesian Models Chapter 4

ary distribution. MCMC is Monte Carlo integration using Markov chains. As mentioned in

section 4.1, Bayesian statistics often include integration over possible multidimensional prob-

ability distributions to make inference about the model parameters or to make predictions.

Monte Carlo integration draws samples from the required distribution, and then forms sam-

ple averages to approximate expectations. Markov chain Monte Carlo draws these samples

by running a cleverly constructed Markov chain for a long time. In this section MCMC is

introduced as the method for evaluation of the expression in equation 4.6 (Gilks, Richardson,

& Spiegelhalter, 1996). Since MCMC has two constituent parts, Monte Carlo integration and

Markov chains, each part will briey be described one by one.

4.1.1.1 Monte Carlo Integration


The term Monte Carlo in the context of statistics means computer simulation. To get an

understanding of Monte Carlo integration, consider the following as a complex integral, so

that a multidimensional distribution is given (Walsh, 2004):

b
g(x) dx (4.8)
a

g(x) can be decomposed into the product of a random function, f (x), and a posterior

probability distribution (x) dened given the interval (a, b). Then the integral in equation

4.8 can be expressed as the expectation of f (x) over (x) over the interval (a,b):
b b
g(x) dx = f (x)(x) dx = E(x) [f (x)] (4.9)
a a

Monte Carlo integration evaluates E(x) [f (x)] by drawing random samples {Xt , t = 1, ..., n}
from a given posterior probability distribution, (x). The population mean of f (x), , can be

approximated by a sample mean, :

n
1X
f (Xt ) (4.10)
n
t=1

where n is the number of samples drawn. Note that n is not the size of the xed data sample.

Given that the samples {Xt } are independent, the law of large numbers makes sure that

the approximation can be made as accurate as desirable, since the central limits theorem

holds, as n (Geyer, 1992):


n( ) N (0, 2 ) (4.11)

However, it it not always possible to draw samples {Xt } independently from (x), since

the density can be of non-standard form. But these draws do not have to be independent,

as long as {Xt } can be drawn throughout the support of (x) in any possible process. One

25
4.1 Parameter Estimation in Bayesian Models Chapter 4

possible process is the Markov chain, where (x) is the stationary distribution of the chain

(Gilks, Richardson, & Spiegelhalter, 1996). The concept of Markov chains will be described

in the following paragraph.

4.1.1.2 Markov Chains


Markov chains represent a special distribution in time series, where the state of a variable

in the current time point depends only on the state of the variable in the previous time point

(Che & Xu, 2010).

Let Xt denote the value of a random variable at time t and let the state space refer to

the range of possible X -values. When applying MCMC, the state space is of such a high-

dimensional nature, that direct computation about (x) is impossible. The distribution of

(x) describes the posterior distribution in a Bayesian inference application. If we wish to

generate a sequence of random variables, {X0 , X1 , X2 , ..., Xn }, in a way so that each time t > 0,
the next state, Xt+1 , depends only of the state before, meaning that Xt+1 is sampled from

a distribution p(Xt+1 |Xt ). This means that, given Xt , the next state Xt+1 does not depend

further on the history of the chain {X0 , X1 , ..., Xt1 }, only on the current state Xt . When the

before-mentioned holds, the process is called a Markov process, which creates the Markov chain

of random variables. A particular chain is dened most critically by its transition probabilities

(or more familiar the transition kernel), P r(i, j) = P r(i j), which is the probability that a

process at state space si moves to state sj in a single step (Walsh, 2004),

P r(i, j) = P r(i j) = P r(Xt+1 = sj | Xt = si ) (4.12)

If the chain is simulated long enough, so that t , the chain gradually forgets the initial
state p(Xt |X0 ), and the distribution of Xn will eventually converge to, called the stationary

distribution, and thus increasingly look like dependent samples drawn from that stationary

distribution.

The term burn-in refers to the number of s iterations it takes for the chain to converge

to a stationary distribution. Convergence can be assessed by a number of dierent criterions,

which will be presented in section 4.1.3. The output generated from the Markov chain can now

be used to estimate E[f (X)], where X has the distribution of (x). Usually burn-in samples

are discarded in the calculation, so that (Gilks, Richardson, & Spiegelhalter, 1996):

n
1 X
E[f (X)] = f (Xt ) (4.13)
ns
t=s+1

Several algorithms for creating the Markov chains exist, but most of them are built up

from the same basis, the Metropolis-Hastings algorithm. The fundamentals of the algorithm

are described in the following section.

26
4.1 Parameter Estimation in Bayesian Models Chapter 4

4.1.2 MCMC Algorithms


The goal of MCMC is to construct such a Markov chain with a stationary distribution of

(x) that has the exact same features as the distribution of interest, the posterior distribution
(x). The earliest MCMC algorithm is the Metropolis algorithm introduced by Metropolis

and Ulam (1949) and further described by Metropolis et al. (1953). Hastings (1970) made

a generalization of the Metropolis algorithm and developed the so called Metropolis-Hastings

algorithm. Geman and Geman (1984) analyzed an image dataset by using what is now called

the Gibbs sampler, which is a special case of Metropolis-Hasting algorithm (Che & Xu, 2010).

All these algorithms can draw a sequence of samples from the joint distribution of two or more

variables. The Gibbs sampler is the simplest MCMC algorithm, and will briey be presented

below. Since PROC MCMC in SAS uses the random-walk Metropolis (RWM hereafter), which

is a special case of the Metropolis algorithm, the Metropolis algorithm will also be described

below.

The reason why the algorithms work is beyond the scope of this thesis, but more detailed

description and proofs are in Gilks, Richardson, & Spiegelhalter (1996), Chen, Shao, & Ibrahim

(2000), and Liu (2001).

4.1.2.1 Gibbs Sampler


The Gibbs sampler, named by Geman and Geman in 1984 after the American physicist

Josiah W. Gibbs, is a special case of the Metropolis-Hastings sampling algorithm, where the

random value is always accepted. The task remains to specify how to construct a Markov

chain, whose values converge to the target distribution (Walsh, 2004).

The key to Gibbs sampling is that it only considers univariate conditional distributions -

the distribution when all variables except the one under consideration at time t are held xed.
Such conditional distributions are easier to simulate than complex joint distributions and

usually have simple forms (Walsh, 2004). The sampler can be ecient when the parameters

are not highly dependent on each other and the full conditional distributions are easy to sample

from (SAS Institute, 2008).

To introduce the Gibbs sampler, suppose is the parameter vector which can be expressed

as = (1 , 2 , ..., k )0 , p(y | ) is the likelihood, and () is the prior distribution. The full

posterior conditional distribution of (i | j , i 6= j, y) is proportional to the joint posterior

density, that is (SAS Institute, 2008):

(i | j , i 6= j, y) p(y | )p() (4.14)

For instance, the one-dimensional conditional distribution of 1 given j = j , 2 j k , is

computed as the following (SAS Institute, 2008):

27
4.1 Parameter Estimation in Bayesian Models Chapter 4

(i | j = j , 2 j k, y) = p(y | = (1 , 2 , ..., k )0 ) p( = (1 , 2 , ..., k )0 ) (4.15)

The idea of the sampler is that it is much easier and ecient to consider a sequence of

conditional distributions, than it is to obtain a marginal distribution by integration over the

joint probability distribution. The Gibbs sampler can be summarized as follows (SAS Institute,

2008):
n o
(0) (0)
1. Set t = 0, and choose an arbitrary initial value of (0) = 1 , ..., k .

2. Generate each component of as follows:

(t+1) (t) (t)


(a) Draw 1 from (1 | 2 , ..., k , y)
(t+1) (t+1) (t) (t)
(b) Draw 2 from (2 | 1 , 3 , ..., k , y)
(c) ...
(t+1) (t+1) (t+1)
(d) Draw k from (k | 1 , ..., k1 y)

3. Set t = t + 1. If t < T , the number of desired samples, return to step 2. Otherwise, stop.

As mentioned above, the power of Gibbs sampling is that the joint distribution of the parame-

ters will converge to the joint probability of the parameters given the observed data (Rouchka,

2008).

4.1.2.2 The Metropolis Algorithm


The Metropolis algorithm is named after its inventor, the American physicist and computer

scientist, Nicholas C. Metropolis. The algorithm is simple but practical, and can be used to

obtain random samples from arbitrarily complicated target distribution of any dimension that

is known up to a normalizing constant (SAS Institute, 2008).

To get an overall understanding of the Metropolis algorithm, suppose we want to obtain

T samples from a univariate distribution with probability density function f ( | y), and t as

the t'th sample from f. To use the Metropolis algorithm, we need to have an initial value 0
and a symmetric proposal density (t+1 | t ). The proposal distribution should be an easy

distribution from which to sample, and it must be such that (t+1 | t ) = (t | t+1 ), meaning
that the likelihood of jumping to t+1 from t is the same as the likelihood of jumping back to

t from t+1 . The most common choice of the proposal distribution is the normal distribution
N (t , ) with a xed . For the (t + 1)th iteration, the algorithm generates a sample from

( | ) t
based on the current sample , and it makes a decision to either accept or reject the

new sample. If the new sample is accepted, the algorithm repeats itself by starting at the

new sample, whereas if the sample is rejected, the algorithm starts at the current point and

28
4.1 Parameter Estimation in Bayesian Models Chapter 4

repeats. In theory the algorithm is self-repeating but in practice we can decide on the total

number of samples needed in advance (SAS Institute, 2008).

To summarize the Metropolis algorithm, suppose (new | ) is a symmetric distribution

and consider the following six steps (SAS Institute, 2008):

1. Set t = 0. Choose a starting point 0 . This can be any initial value as long as f (0 |
y) > 0.

2. Generate a new sample, new , by using the proposal distribution, q( | t ).

3. Calculate the following quantity:

 
f (new | y)
= min ,1 (4.16)
f (t | y)

4. Sample u from the uniform distribution U (0, 1).

5. Set t+1 = new if u < r: otherwise set t+1 = t .

6. Set t = t + 1. If t < T, which is the number of desired samples, return to step 2.

Otherwise stop.

The algorithm denes a chain of random variates whose distribution will converge to the

desired distribution p( | y).


The extension of the Metropolis algorithm to a higher-dimensional is straightforward.

Suppose = (1 , 2,..., k ) is the parameter vector. To begin the Metropolis algorithm, select

an initial value of each i and use a multivariate version of proposal distribution ( | ), such

as a multivariate normal distribution, to select a k -dimensional new parameter. The other

steps remain the same as those described above, and this Markov chain eventually converges

to the target distribution p( | y) (SAS Institute, 2008).

4.1.3 Convergence Criteria


Simulation-based Bayesian inference requires using simulated draws to summarize the poste-

rior distribution or calculate any relevant quantities of interest. There are usually two issues

regarding the treatment of the simulated draws. First, we have to decide whether the Markov

chain has reached stationarity, or the desired posterior distribution. Secondly, we have to

determine the number of iterations to keep after the Markov chain has reached stationar-

ity. Convergence diagnostics can help to solve these issues. It is important to keep in mind

that there are no conclusive tests that can tell you when the Markov chain has converged to

stationarity (SAS Institute, 2008).

In the following part four convergence criteria will briey be presented, which are all

standard-output in SAS.

29
4.1 Parameter Estimation in Bayesian Models Chapter 4

4.1.3.1 Visual Analysis via Trace Plots


Trace plots of samples versus the simulation index can be very useful in assessing conver-

gence. A trace plot tells whether the chain has reached its stationary distribution, if the chain

needs a longer burn-in period, or if the chain needs to be simulated over a longer period of

time. The aspects that are most identiable from a trace plot are a relatively constant mean

and variance. A chain that mixes well traverses its posterior space rapidly, and can jump from

one remote region of the posterior to another in relatively few steps, whereas a chain is said

to be poorly mixing if it stays within small regions of the parameter space for long periods of

time (SAS Institute, 2008).

Figure 4.1 displays some typical features regarding trace plots.

Figure 4.1: Trace Plots

Source: Output from SAS

The trace plot in the upper-left displays a perfect trace plot. The trace plot indicates

that the chain could have reached the right distribution since the center of the chain seems to

be around the value 3 with very small uctuations.

The upper-right trace plot illustrate a chain that starts at a very remote initial value and

makes its way to the targeting distribution. The rst few hundred observations should be

discarded, which imply that the burn-in sample should be increased.

The trace plot in the lower-left demonstrate an instance of marginal mixing. The chain is

taking only small steps and does not traverse its distribution quickly. Since this type of trace

plot is typically associated with high autocorrelation among the samples it is suggested to run

the chain for much longer to obtain a few thousand independent samples.

30
4.1 Parameter Estimation in Bayesian Models Chapter 4

The lower-right trace plot shows a chain that is mixing very slowly and it oers no evidence

of convergence. This type of chain is entirely unsuitable for making parameter inferences (SAS

Institute, 2008).

4.1.3.2 Geweke Diagnostics


The Geweke test compares values in the early part of the Markov chain to those in the latter

part to detect failure of convergence. If the mean values of the parameters in the two time

intervals are somewhat close to each other, we can assume that the two dierent parts of the

chain have similar locations in the state space, and it is assumed that the two samples come

from the same distribution (SAS Institute, 2008).

By default the Geweke test splits the sample, after removing a burn-in period, into two

parts: the rst 10% and the last 50%. A modied z-test, referred to as Geweke z-score, is used

to compare the two sub-samples. A value larger than 2 indicate that the mean of the series is

still drifting, and a longer burn-in is required before monitoring the chain can begin (Walsh,

2004).

4.1.3.3 Autocorrelation
Another way to assess convergence is to evaluate the autocorrelation between the draws of

the Markov chain, which is a measure of dependency among Markov chain samples. We would

expect the k th lag autocorrelation to be smaller as k increases, which means that our 2nd and

50th draws should be less correlated than our 2nd and 4th draws. If autocorrelation is still

relatively high for higher values of k, this indicates a high degree of correlation between our

draws and slow mixing (Walsh, 2004).

4.1.3.4 Eective Sample Size


Both traceplots and autocorrelation are used to examine the mixing of a Markov chain. A

closely related measure of mixing is the eective sample size (ESS hereafter). The ESS is a

quantity that estimates the number of independent samples obtained from a set of samples.

ESS is dened as follows (SAS Institute, 2008):

n n
ESS = = P (4.17)
1 + 2 k=1 k ()
where n is the actual posterior sample size and k () is the autocorrelation of lag k for .
The quantity is referred to as the autocorrelation time (SAS Institute, 2008). Because the

autocorrelation is always positive, the ESS is always less than the actual posterior sample size.

A much smaller ESS than the actual size indicates poor mixing of the Markov chain (Che &

Xu, 2010).

31
PART IV
A Validation Framework
Chapter 5

5 Model Validation

Sound credit rating models are important for all nancial institutions as they form the basis

for calculating risk premia, pricing credits, and allocating economic capital. The importance

of sound validation techniques for rating systems stems from the fact that credit rating mod-

els of poor quality could lead to suboptimal capital allocation (Satchell & Xia, 2006). This

implies that the eld of model validation is one of the major challenges for nancial institu-

tions. Therefore, questions arise as to which methodologies deliver acceptable discriminatory

power between the defaulting and non-defaulting customers ex ante. In this thesis, the ability

to discriminate in advance between subsequently defaulting and non-defaulting customers is

referred to as the discriminatory power of the credit rating model's (Satchell & Xia, 2006).

The most popular validation technique used in practice is the Cumulative Accuracy Prole

(CAP hereafter) and its summary statistic, the Accuracy Ratio. A concept similar to CAP is

the Receiver Operating Characteristic (ROC hereafter) curve and its summary statistic, the

area below the ROC curve (AUC hereafter) (Engelmann, Hayden, & Tasche, 2003a). Both

measures will be reviewed in section 5.2, whereas only the ROC curve and the AUC will

be applied when comparing Our Bank's current approaches to credit scoring with the esti-

mated Bayesian logistic regression models. Before introducing the validation techniques the

validation framework for the current thesis will be presented in section 5.1.

5.1 Validation Framework

The primary goals of validation are to:

Determine how well the estimated models perform in terms of prediction accuracy.

Ensure that a model has not been overtted and that its performance is reliable and

well understood.

Conrm that the modeling approach, not just an individual model, is robust through

time.

Model validation is an essential step in the development of a credit scoring model. We aim

to perform tests in a rigorous and robust manner, while also protecting against unintended

errors. The performance statistics for credit scoring models can be highly sensitive to the data

sample used for validation. To avoid embedding unwanted sample dependency, quantitative

models should be developed and validated using some type of out-of-sample, out-of-universe,

and out-of-time
9 testing approach on panel or cross-sectional data (Sobehart, Keenan, & Stein,

2001).

9
Out-of-sample refers to observations for customers that are not included in the sample used to build the
model. Out-of-time refers to observations that are not contemporary with the training sample. Out-of-universe
refers to observations whose distribution diers from the population used to build the model.

33
5.1 Validation Framework Chapter 5

The statistical literature on model validation is quite broad. Since we do not attempt to

cover this topic exhaustively, we present in the following a methodology explained by Sobehart,

Keenan, & Stein (2001), that brings together several angles of the validation literature, which is

found useful in evaluation of quantitative credit scoring models. A schematic of the framework

is presented in gure 5.1 below:

Figure 5.1: Schematic of Out-of-Sample Validation Techniques

Source: Sobehart, Keenan, & Stein (2001)

Figure 5.1 splits the model testing procedure along two dimensions: time (horizontal axis),

and the population of customers (vertical axis). The least restrictive validation procedure is

represented by the upper-left quadrant, and the most stringent by the lower-right quadrant.

Dark circles represent training data and white circles represent validation data. Gray circles

represent data that may or may not be used for validation (Sobehart, Keenan, & Stein, 2001).

The upper-left quadrant illustrates the approach in which the validation data is chosen

completely random from the full training data. An assumption in connection to this proce-

dure is that the data stays stable over time. Since the data is drawn randomly, this approach

validates the estimated model across the population of customers, preserving its original dis-

tribution.

The upper-right quadrant describes one of the most common validation procedures. Here,

data for model training are chosen from any time period prior to a certain date, and validation

data are selected from periods only after that date from the same population. Since the sample

of customers is drawn from the population at random, this approach also validates the model,

preserving its original distribution.

34
5.1 Validation Framework Chapter 5

The lower-left quadrant represent the situation in which the data are segmented into train-

ing and validation sets containing no customers in common. In this general situation the

validation set is out of sample. If the population of the validation set is dierent from that of

the training set, the data is out-of-universe. Because the temporal nature of the data is not

used for constructing this type of out-of-sample test, this approach validates the model homo-

geneously in time and will not identify time dependence in the data. Thus, the assumption of

this procedure is that the relevant characteristics of the population do not vary with time.

Finally, the most exible procedure is shown in the lower-right quadrant and should be the

preferred sampling methods for credit scoring models. In addition to being segmented in time,

the data are also segmented across the population of customers. Non-overlapping sets can be

selected according to the peculiarities of the population of customers and their importance

(Sobehart, Keenan, & Stein, 2001).

Because default events are rare for credit scoring models, it is often impractical to create

a model using one dataset and then validate it on a separate hold-out dataset composed of

completely independent cross-sectional data. While such out-of-sample and out-of-time test

would undoubtedly be the best way to compare model performance if default data were widely

available, this is usually not the case. As a result, most institutions, including Our Bank, face

the following dilemmas (Sobehart, Keenan, & Stein, 2001):

If too many defaulters are left out of the in-sample dataset, estimation of the model

parameters will be seriously impaired and overtting becomes likely.

If too many defaulters are left out of the hold-out dataset, it becomes exceedingly

dicult to evaluate the true model performance due to severe reductions in statistical

power.

Sobehart, Keenan, & Stein (2001) present an eective approach called walk-forward, which

will be used to estimate and validate the stability of the estimated models in the current thesis.

The walk-forward procedure works as follows, illustrated in gure 5.2 .

Figure 5.2: Walk-Forward Validation - an Example

Source: Own elaboration

35
5.2 Validation Techniques Chapter 5

A specic year, here 2002, is chosen. The model is estimated using all data available on,

or before, the selected year, which is called the training data. Once the model forms and

parameters are established, the model performance can be validated using the data in the

following year, 2003. Note that the validation dataset in 2003 are out-of-time for customers

existing in the previous years, and out-of-sample for all the customers whose data become

available after 2002. Now, the data in 2003 is added to the training data, which implies that

all of the data through 2003 are used to t the model, and 2004 is then used to validate it.

The process is repeated using data for every year available.

5.2 Validation Techniques

Our Bank uses a validation technique called Power-curve with its appertaining summary

statistic called Powerstat. They are identical to CAP and Accuracy Ratio, and similar to the

ROC and the AUC. The ROC curve and AUC are standard outputs when utilizing the LO-

GISTIC procedure in SAS, whereas Power-curve and Powerstat statistics have to be manually

computed. Engelmann, Hayden & Tasche (2003) demonstrate that the summary statistics

of the CAP and the ROC are equivalent and that both methods are reliable even for small

datasets. Both validation techniques will be presented below, but due to the time limitations

it is only the ROC curve and AUC that will be reported when comparing Our Banks current

models for credit scoring with the estimated Bayesian logistic regression models.

To get an understanding of the Power-curve, consider a credit scoring model that produces

a continuous rating score. A high rating score indicates a low Probability of Default (hereafter

PD). By assigning scores to the customers from the data used for the validation, and checking

if the customers will default over the next period or remain solvent, we can evaluate the quality

of the credit scoring model (Engelmann, Hayden & Tasche 2003b).

To plot the Power-curve, the customers are rst ordered by PD from highest risk to lowest

risk on the x-axis, that is, from the customer with the lowest score to the customers with

the highest score, and on the y -axis is the share of defaulters (see gure 5.3). For a given

fraction, x, of the total number of customers, the Power-curve is constructed by calculating

the percentage of the defaulters whose rating scores are equal to or lower than the maximum

score of fraction x (Engelmann, Hayden & Tasche 2003b).

36
5.2 Validation Techniques Chapter 5

Figure 5.3: Power-Curve

Source: Engelmann, Hayden & Tasche (2003)

A perfect credit scoring model would assign the lowest score to the defaulters. In this case

the Power-curve is increasing linearly and then staying at one. For a random model without

any discriminative power, the fraction x of all customers with the lowest rating scores will

contain x% of all defaulters. Real credit scoring models will be somewhere in between the two

extremes.

While the Power-curve is a convenient way to visualize model performance, it is often

convenient to have a single measure that summarizes in a single statistic the predictive accuracy

in a number. This is known as the Powerstat and is dened as the ratio of the area, aE ,
between the Power-curve of the estimated model (rating model) and the Power-curve of the

non-informative model (random model), and the area, aP , between the Power-curve of the

perfect model and the Power-curve of the non-informative model, i.e.

aE
P owerstat = (5.1)
aP
Powerstat is a fraction between [0 ; 1]. Measures with Powerstat close to 0 display little

advantages over the random model while those with Powerstat near 1 display almost perfect

predictive power.

The construction of the ROC curve is a bit more complicated than the Power-curve. To get

an understanding of the properties of the ROC curve, gure 5.4 shows possible distributions

of possible rating scores for defaulting and non-defaulting customers.

37
5.2 Validation Techniques Chapter 5

Figure 5.4: Distribution of Rating Scores for Defaulting and Non-defaulting Customers

Source: Engelmann, Hayden & Tasche (2003)

For a perfect credit scoring model, the left distribution and the right distribution should

be separated. If we want to determine from the rating score, which customers will fully repay

during the next period and which customers will default, one possibility is to introduce a cuto

value, C, as in gure 5.4. With a given cuto value, C, each customer with a rating score

lower than C is classied as a potential defaulter and each customer with a rating score higher

than C as a non-defaulter. Four decision results would then be possible:

1. If the rating score is below the cuto value C and the customer defaults subsequently,

the decision was correct.

2. Otherwise we wrongly classied a non-defaulter as a defaulter.

3. If the rating score is above the cuto value and the customer does not default, the

classication was correct.

4. Otherwise, a defaulter was incorrectly assigned to the non-defaulters group (Engelmann,

Hayden, & Tasche, 2003b).

Since the cost associated with a defaulting customer often exceeds the cost associated with

a non-defaulting customers, it is more serious to incorrectly assign a customer in the non-

defaulting group, than to incorrectly assign a customer in the defaulting group (Sobehart,

Keenan, & Stein, 2001).

The ROC curve can be constructed using dierent notations for the x-axis and y -axis.
Using the notation from Engelmann, Hayden, & Tasche (2003b) we dene the hit rate, HR(C),
as:

H(C)
HR(C) = (5.2)
ND

38
5.2 Validation Techniques Chapter 5

where H(C) is the number of defaulters predicted correctly with the cuto value, C, and

ND is the total number of defaulters. HR(C) is equal to the light green area on the left hand

side of the cuto value C in gure 5.4. The false alarm rate F AR(C) is dened as:

F (C)
F AR(C) = (5.3)
NN D
where F (C) is the number of false alarms, that is, the number of non-defaulters that were

classied incorrectly as defaulter by using the cuto value, C. The total number of non-

defaulters is denoted by NN D . F AR(C) is equal to the dark green area on the left hand side

of the cuto value, C, in gure 5.4.

The ROC curve is then constructed as follows. For all cuto values, C, that are contained

in the range of the rating scores the quantities HR(C) and F AR(C) are calculated. The ROC

curve is a plot of HR(C) versus F AR(C), which is shown in gure 5.5:

Figure 5.5: Receiver Operating Characteristic Curves

Source: Engelmann, Hayden & Tasche (2003)

A model's performance is better the steeper the ROC curve is and the closer the ROC

curve's position is to the point (0, 1). AUC is the summary statistic for the ROC curve and

summarizes the area under the curve. The steeper the ROC curve is, the higher the AUC is.

AUC can be calculated as:

1
AU C = HR(F AR)d(F AR) (5.4)

The AUC can be interpreted as the average power of the estimated model on default/non-

default corresponding to all possible cuto values C. AUC is 0.5 for a random model without

discriminative power and 1 for a perfect model. It is between 0.5 and 1 for any reasonable

rating model in practice.

39
5.2 Validation Techniques Chapter 5

As mentioned, Engelmann, Hayden, and Tasche (2003) analyze the statistical properties

of the Power-curve and the ROC curve. They demonstrate the correspondence of the Pow-

erstat and AUC, which indicates that these summary statistics are equivalent and that the

relationship between the two can be calculated as (Engelmann, Hayden, & Tasche, 2003a):

P owerstat = 2(AU C ) AU C = (P owerstat + 1) (5.5)

Hamerle, Rauhmeier, & Rsch (2003) discuss the properties of Powerstat and AUC and

conclude that the sample space strongly depends on the structure of the true default prob-

abilities in the underlying portfolio. This implies that e.g. a Powerstat near one might not

be an indication of perfect predictive power, since it might just reect an inhomogeneous

portfolio. It follows that credit scoring models cannot be compared across time and across

portfolios. Therefore, Powerstat and AUC are only comparable when they are based on the

same underlying portfolio.

40
PART III
The Empirical Analysis
Chapter 6

6 Empirical Analysis

Nowadays several approaches for credit scoring analysis exist, with frequentist logistic regres-

sion being the most utilized method (Steenackers & Goovaerts, 1989), (Laitinen, 1999), and

(Alf, Caiazza, & Trovato, 2005). However, as mentioned in the introduction, the objective of

the current thesis is entirely focusing on whether a Bayesian logistic regression model is able

to outperform Our Bank current approaches in terms of predictive ability. In the following

sections the dierent approaches will be empirically analyzed, using real data provided by Our

Bank.

Initially, the data will briey be described, followed by an estimation of the expert models

and frequentist logistic regression models, both methods already applied in Our Bank. Next,

the Bayesian logistic regression models will be estimated and evaluated. In that respect several

key points are worth mentioning:

The choice of priors will be specied, where the expert knowledge is transformed into

prior information.

The convergence criterias for the MCMC simulations will be assessed in order to ensure

that the chains have converged to their stationary distributions.

Prior inuence will be assessed by comparing the performance of the dierent Bayesian

models.

By utilizing a walk-forward estimation method, the inuence of adding more data to the

training data will be evaluated.

By evaluating the AUCs, the chosen Bayesian model will be compared with the current

credit scoring models applied by Our Bank.

The estimation and evaluation will be carried out on both the RSI and Real Estate segments,

where RSI is being the largest segment with 62886 customers.

The empirical analysis will be carried out using SAS software. Details on SAS syntax and

output can be found in the Appendix.

6.1 Data

The data basis for the empirical analysis consists of questionnaire data, which has been gath-

ered by the nancial advisers in Our Bank from 2002 to 2010. It represents information

regarding individual customers. The purpose of these questionnaires is to collect data on cus-

tomer characteristics that helps to predict defaulters in the future by using a statistical model.

The data consists of 67618 customers divided into two segments. The two segments are, as

42
6.1 Data Chapter 6

mentioned in section 1.2, RSI and Real Estate. The size of the two segments and the number

of defaults are as follows:

Table 1: The Two Customer Segments

Segment Total # of defaults


RSI 62866 1655
Real Estate 4732 140

In total there are 1794 recorded defaults in the dataset. The relative size of the segments

makes it possible to distinguish between the performance of Bayesian logistic regression in

segments with a large amount of customers, compared to segments with a small amount of

customers, as stated in the problem statement. The amount of data available in the dierent

years for the two segments can be seen in Appendix A.1, table 1.

Table 2 shows the dierent variables available for the two customer segments:

Table 2: Overview of the Explanatory Variables

(a) RSI (b) Real Estate

Name Values Category Name Values Category


Erhv_1 [-1,-4] Strategy and management Ejd_bran [-1,-4] Industry assessment
Erhv_2 [-1,-4] Strategy and management Ejd_1 [-1,-4] Strategy and management
Erhv_3 [-1,-4] Strategy and management Ejd_2 [-1,-4] Strategy and management
Erhv_bran [-1,-4] Industry assessment Ejd_3 [-1,-4] Strategy and management
Erhv_5 [-1,-4] Industry position Ejd_4 [-1,-4] Real estate related circumstances
Erhv_6 [-1,-4] Industry position Ejd_5 [-1,-4] Real estate related circumstances
Erhv_7 [-1,-3] Industry position Ejd_6 [-1,-4] Real estate related circumstances
Erhv_8 [-1,-4] Financial reporting Ejd_7 [-1,-4] Real estate related circumstances
Erhv_9 [-1,-3] Financial reporting Ejd_8 [-1,-4] Real estate related circumstances
Erhv_10 [-1,-4] Financial reporting Ejd_9 [-1,-4] Real estate related circumstances
Erhv_11 [-1,-4] Risk exposure Ejd_10 [-1,-4] Real estate related circumstances
Erhv_12 [-1,-4] Risk exposure Ejd_11 [-1,-4] Real estate related circumstances
Erhv_13 [-1,-3] Risk exposure Ejd_12 [-1,-4] Real estate related circumstances
Erhv_14 [-1,-3] Risk exposure Ejd_13 [-1,-4] Accounting related circumstances
Erhv_15 [-1,-2] Risk exposure Ejd_14 [-1,-4] Accounting related circumstances
Erhv_16 [-1,-3] Risk exposure Ejd_15 [-1,-2] Accounting related circumstances
Ejd_16 [-1,-4] Accounting related circumstances
Ejd_17 [-1,-4] Accounting related circumstances

The 16 variables for RSI can be divided into ve categories: Strategy and management, In-

dustry assessment, Industry position, Financial reporting and Risk exposure. The 18 explana-

tory variables for Real Estate can be divided into three groups: Strategy and management,

Real estate related circumstances, and Accounting related circumstances.

43
6.1 Data Chapter 6

Due to condentiality the original questionnaires are not enclosed in the thesis.

The original scale of the variables ranged between A to E. We have transformed these

scale values into values between 1 and 5, where 1 is considered to be the best value a

customer can achieve in a question. In other words, high values are associated with a better

creditworthiness. Category E (or 5) refers to a dont know answer and is therefore irrelevant

and deleted from the data. Hence, only values ranging between 4 and 1 are considered for

the analysis.

6.1.1 Missing Data


A preliminary analysis of the data reveals that the data have a fair amount of missing values

(see Appendix A.2, algorithm 1). Since SAS only considers cases with no missing values,

the amount of useful data is signicantly reduced, by deleting the cases with missing values.

Several imputation methods for handling missing data exist. For the purpose of this thesis

we decided to apply listwise deletion of cases containing missing values. This approach was

chosen assuming that imputation of missing data would bias the results of the analysis to

some extent. We are aware of the disadvantages linked to listwise deletion, which arise from

the loss of information derived from deleting incomplete cases. However, the amount of data

left after deletion seemed appropriate for obtaining valid results.

After all missing values have been deleted, the available data is reduced to the following

amount:

Table 3: Available Data - without Missing Values

Segment 2002 2003 2004 2005 2006 2007 2008 2009 2010 Total
# Total 1134 2667 2869 2971 2936 3128 2891 2655 2663 23914
RSI
# Defaults 37 82 54 42 71 118 198 139 126 867
# Total 6 220 324 417 514 651 697 857 3686
Real Estate
# Defaults 1 10 38 31 36 116

As a note, due to the low number of defaulting customers within the Real Estate segment,

model estimation is only possible after the default year 2007, after data from the previous

years has been merged. Thus, the estimated Real Estate model is validated on 2008, 2009,

and 2010 data.

6.1.2 Standardizing Input Variables


From table 2 it can be noticed that not all the variables are measured on the same scale.

Furthermore, univariate descriptive investigation of the data shows that the data is rather

skewed. The rst step in the empirical analysis is therefore to standardize the provided data

- a common procedure often used in the theory, when estimating Bayesian generalized linear

44
6.2 Estimation of the Expert Models Chapter 6

models (Gelman, Jakulin, Pittau, & Su, 2008). The objective of standardizing data is to t

the data to the same scale and approximate the data to a normal distribution, by calculating

z -scores as:

x
z= (6.1)

where is the mean of the variable, x is the observed values and is the standard deviation.

However, Our Bank currently operates with the inverse cumulative normal distribution

since the distance between the values on the measured scales from the original data is not

constant. As an example, the importance of whether a person scores 4 compared to 3


is greater than if a person scores 2 compared to 1. Therefore the data is standardized

using the inverse cumulative normal distribution instead of the regular normal distribution.

The SAS procedure PROC RANK, together with the normal=blom option, is utilized to

achieve this (see Appendix A.3, algorithm 2). The syntax employs the following equation for

computations:

(ri 83 )
yi = 1 (6.2)
(n 13 )
where 1 is the inverse cumulative normal function, ri is the rank of the i'th observation,

and n is the number of non-missing observations. The data is then centered around 0 with

a relatively small standard deviation and scored according to the relative importance of the

original values
10 .

6.2 Estimation of the Expert Models

Before estimating the Bayesian credit scoring models, the performance of our Bank's current

expert model will briey be introduced and estimated in this section followed by an estimation

of the frequentist logistic regression model in section 6.3. The performance of the expert

model will, together with the performance of the frequentist logistic regression model, serve

as references for the empirical analysis. Only the validation AUCs will be presented for this

and the subsequent sections.

As already mentioned, the expert models are used in Our Bank when not enough ques-

tionnaire data exist in order for Our Bank to utilize a frequentist logistic regression credit

scoring model. The expert models were created by highly-educated department managers in

Our Bank, whom after consultation agreed upon weighting the dierent questions according

to their relative importance. An example of the resulting setup is shown below in table 4. For

the actual weights see Appendix A.4.1, table 2.

10
SAS Support webpage: http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/
viewer.htm#a000146840.htm

45
6.3 Estimation of the Frequentist Logistic Regression Models Chapter 6

Table 4: Expert Weights - an Example

Questions Expert weights - wi


Question 1 w1 = 6.7%
Question 2 w2 = 2%
.. ..
. .
Question p wp = 10%

By employing the expert weights together with the actual scores in the questionnaires,

Our Bank is able to score a customer with an expert credit score (ES), using the following

equation:

p
X
ES = wi S i (6.3)
i=1

where Si is the standardized value of the i'th question, and wi is the weight assigned to

question i by the experts. To obtain the actual performance of the expert models a ROC curve

is produced using the PROC LOGISTIC procedure in SAS (see Appendix A.8.1, algorithm 9

and A.8.2, algorithm 12). In table 5 below the actual AUCs obtained by the expert models

are shown:

Table 5: AUC for the Expert Models

Train. year 2002 2002-2003 2002-2004 2002-2005 2002-2006 2002-2007 2002-2008 2002-2009
Val. year 2003 2004 2005 2006 2007 2008 2009 2010
AUC - RSI 0.6918 0.7088 0.6782 0.6729 0.7126 0.6696 0.6300 0.7150
AUC - RE - - - - - 0.6014 0.5178 0.6325

In section 4.1.3 it was mentioned that an AUC of 0.5 represents a model without any

discriminative power. The AUC for the RSI segment seems to uctuate around 0.69 with the

exception of the validation years 2008 and 2009, where the AUC, for unknown reasons, seems

to decrease. The AUC for the Real Estate segment in the validation year 2009 gives rise to

concerns due to the fairly low values. Actually, the expert model for the Real Estate segment

is only slightly better than a random model in the validation year 2009.

6.3 Estimation of the Frequentist Logistic Regression Models

The current frequentist logistic regression used in Our Bank does not contain variables with

a negative inuence on the predictor variable in the nal model. This is the selection criteria

chosen by Our Bank in order to reduce the number of parameters included in the nal model,

so only the variables that increase the probability of default are included . This approach

already has a hint of Bayesian thinking, because the parameter selection is based on subjective

46
6.4 Estimation of Bayesian Logistic Regression Models Chapter 6

judgments from the experts in Our Bank. Since one of the objective of the current thesis is to

compare the performance of Bayesian logistic regression with frequentist logistic regression, it

has been deemed necessary to estimate a clean logistic regression using a backward selection

criteria with a 25% signicance level specied. For SAS syntax and outputs see Appendix A.5,

algorithm 3.

Table 6 below shows the AUCs from the estimated frequentist logistic regression models.

Table 6: AUC for the Frequentist Logistic Regression - Backward Selection

Train. year 2002 2002-2003 2002-2004 2002-2005 2002-2006 2002-2007 2002-2008 2002-2009
Val. year 2003 2004 2005 2006 2007 2008 2009 2010
AUC - RSI 0.6994 0.7405 0.6612 0.7461 0.7510 0.7150 0.7280 0.7705
AUC - RE - - - - - 0.6062 0.6754 0.7191

From table 6 we can see that a frequentist logistic regression performs better on the RSI

data than it does on the Real Estate data, as was the case with the expert model. However, we

keep in mind that the AUCs are only directly comparable when based on the same underlying

portfolio.

By comparing the frequentist logistic regression model with the expert model, the logistic

regression performs slightly better than the expert model, though only signicantly better in

2006 and 2009 for the RSI data and 2009 for the Real Estate data. This is in accordance with

Our Bank's current situation as mentioned in section 1.1.

6.4 Estimation of Bayesian Logistic Regression Models

Next we estimate a Bayesian logistic regression for the RSI segment and Real Estate segment,

and compare the performance of the estimated models with the performance of Our Banks

current models. Before estimating the Bayesian models, two preliminary steps are important

to highlight:

1. The Bayesian priors must be specied.

2. The appropriate number of simulation iterations has to be determined for the Markov

chains to converge to the stationary distributions.

Following these two steps, the resulting parameter coecients and model performance will be

presented and discussed.

6.4.1 Prior Specication


In section 3.2 the importance of specifying appropriate priors was described. To carry out the

empirical analysis, appropriate priors must be established. The parameters of interest in this

47
6.4 Estimation of Bayesian Logistic Regression Models Chapter 6

study are the regression coecients for the Bayesian logistic regression models, and therefore

the probability distributions of these coecients are the priors that have to specied. The

mean, or mode, for the assigned prior distributions serve as the expectation for the coecients

and the variance is the uncertainty in related to the coecients.

As described in section 3.2, Gelman (2002) points out that when the parameters are well-

dened and a relatively large sample size is used for estimation, the prior distribution is

expected to have little impact on the posterior - a condition called likelihood dominance (Wylie,

Muegge, & Thomas, 2006). However, no exact denition for well-identied parameters or

large sample size exist, so in order to assess the impact of the prior distribution, the posterior

distribution will be assessed and compared under dierent choice of priors.

In the analysis both a non-informative and two informative priors are applied. A at prior

is considered as the non-informative prior, whereas for the informative priors, two dierent

variance parameters, 12 , 22 , are selected. This approach is also chosen since there is no pre-

specied variance parameter provided by Our Bank. Gelman, Jakulin, Pittau, & Su (2008)

suggest a prior that has the ability to include prior information to some extent. It is not

a strict informative prior, which includes specic information regarding mean and variance

for the unknown parameters, nor is it a fully non-informative prior, such as a uniform prior.

The approach used in this thesis is somewhat similar to Gelman et al.'s work, and therefore

normally distributed priors with dierent variance parameters will be utilized, so that:

p() N (p |k2 ) (6.4)

6.4.1.1 Transforming Expert Knowledge into Prior Information


As the objective is to transform expert knowledge into prior information, the rst step is to

specify the means for the priors, based on the expert knowledge.

As mentioned in section 6.2 Our Bank uses the expert weights to obtain an ES for every

customer given the following equation:

p
X
ES = wi S i (6.5)
i=1

where Si is the transformed score of the i0 th question.

To transform the expert score to a PD the next step is to perform a simple logistic regres-

sion. The variable containing the expert scores (ES) will serve as the independent variable in

the equation, and the binary variable that records whether or not a customer defaults will be

the dependent variable (1=default):

 
P rdef ault
P D = ln = a + bES (6.6)
1 P rdef ault

48
6.4 Estimation of Bayesian Logistic Regression Models Chapter 6

where a and b are the maximum likelihood estimators resulting from the equation. The b-
coecient (see Appendix A.6, Table 3) will serve as a last input to create the prior mean. The

results from equation 6.6 can be applied on the customers, to calculate the nal probability

of default by using the following equation:

 
P rdef ault
P D = ln = a + bw1 S1 + ... + bwp Sp (6.7)
1 P rdef ault

The term bwi can be summarized into a single coecient, which is the coecient that is

multiplied by the score for every i0 th question:

bwi = iprior (6.8)

and these coecients will serve as means for the prior distributions. The resulting prior means

for all variables can be seen in Appendix A.6, Table 4.

Though this approach is not perfect, because the prior is not strictly independent of the

available data, it is, however, an attempt to convert already existing expert knowledge within

Our Bank into prior knowledge, which will serve as needed input for the Bayesian analysis.

As mentioned earlier, no information regarding the specication of the prior variance exist,

and therefore the approach in this thesis is to examine priors with dierent variances in order

to explore what impact changes in the prior have on the posterior results. Three dierent

set-ups have been chosen, which can be seen in Table 7:

Table 7: The Selected Priors

Prior No. Type Distribution Mean Variance


1 Non-informative Normal 0 10000
2 Informative Normal See Appendix A.6, table 4 1
3 Informative Normal See Appendix A.6, table 4 5

6.4.2 Specifying the Simulation Method for RSI Data


As a rst attempt, a simulation for with 10000 iterations is run with all variables included

and the non-informative prior 1 selected (see SAS syntax and output in Appendix A.7.1,

Algorithm 5). The results for the validation year 2010 is presented below. When evaluating

the diagnostics for the simulation, it quickly becomes apparent that the Markov chain does

not mix very well. As an example, the resulting plots for the intercept, 0 , and beta1, 1 ,
parameters are shown below:

49
6.4 Estimation of Bayesian Logistic Regression Models Chapter 6

Figure 6.1: Results from Initial Simulation - 10000 Iterations, Prior 1

(a) Intercept (b) Beta1

Source: Output from SAS

These two gures are quite representative for the remaining parameters, since their diag-

nostics follow the same pattern. The trace plots demonstrate a pattern known as marginal

mixing, which was introduced in section 4.1.3. The problem with the chain is that it only takes

small steps and is not able to reach its stationary distributions quickly. This form of trace plot

usually results from high autocorrelation between the samples, which can also be seen from

the autocorrelation graph. With a trace plot like the ones in gure 6.1 useful samples cannot

be obtained. In order to do so the chain must run for much longer. To reduce autocorrelation

the chain must also be thinned, meaning only a portion of the samples drawn are saved for

drawing inference (SAS Institute, 2008).

In the next example the model is run with 150000 iterations and only every 25th sample

is saved, so that 6000 samples are kept to draw posterior inference. Output for the same two

parameters, intercept and beta1, is shown below in gure 6.2 (SAS syntax and output can be

found in Appendix A.7.2, algorithm 6).

50
6.4 Estimation of Bayesian Logistic Regression Models Chapter 6

Figure 6.2: Results from Initial Simulation - 150000 Iterations, Prior 1

(a) Intercept (b) Beta1

Source: Output from SAS

After running the simulation with 150000, the problems with autocorrelation and marginal

mixing are almost solved. Since a few problems concerning the Geweke diagnostics in the

validation year 2007 and 2008 remain (values larger than 2), we choose to run the simulation

again with 250000 iterations. With 250000 iterations we choose to thin the chain even further,

so that only every 50th sample is saved. Thereby 5000 samples are kept to draw posterior

inference. Output for the same two parameters, intercept and beta1, is shown below in gure

6.3 (SAS syntax and output can be found in Appendix A.7.3, algorithm 7).

Figure 6.3: Results from Initial Simulation - 250000 Iterations, Prior 1

(a) Intercept (b) Beta1

Source: Output from SAS

After running the simulation again with 250000 iterations, two instances where the Geweke

value is above 2 remain. However, a closer look at the convergence diagnostics conrms that

the Markov chains have converged. MCSE/SD is a measure for the relationship between

the simulation uncertainty (MCSE) and the parameter uncertainty (SD). A comparison of

51
6.4 Estimation of Bayesian Logistic Regression Models Chapter 6

MCSE/SD between the three simulations shows a notable reduction in this ratio, meaning

that the simulation uncertainty in the third run has been improved. Furthermore, the ESS

also suggests improvements between the three runs, since the discrepancy between the ESS

and actual sample is lower for the third run than in the rst and second run.

Since the 250000 iterations made the Markov chains converge, the same simulation set-up

will be utilized for the remaining years in the walk-forward approach. The convergence outputs

for the remaining simulations in this thesis will only be elaborated on, if any problems with

convergence criteria arise.

6.4.3 Results for RSI Data


In order to assess the performance of the estimated models during the dierent steps in the

walk-forward procedure, several obtained results must be compared. First of all, a test of

whether the models are overtting is carried out. Following the estimated parameters for

the three dierent Bayesian models are presented, and nally the performance of the three

Bayesian models are compared to assess the model with the best prediction accuracy.

6.4.3.1 A Test of Overtting - An Example, Prior 1


The approach to clarify if the models are overtting is to obtain a ROC curve and its

summary statistic, AUC, for both the training data and validation data. Afterwards a Chi-

Square test is used to test if there are any signicant dierences between the two curves. An

example, where 2010 is used as validation data, is shown below in gure 6.4.

Figure 6.4: ROC and AUC for Validation Year 2010 - RSI, Prior 1

Source: Output from SAS

52
6.4 Estimation of Bayesian Logistic Regression Models Chapter 6

Though there is a dierence of 0.0314 between the two areas, the dierence is not signicant

(p=0.1644) on a 5% level of signicance. Therefore, it cannot be concluded that the model

overts the data in the validation year 2010. This test is performed for every year of estimation

and the resulting AUCs is shown in Table 8 below.

Table 8: AUC - RSI, Prior 1

Train. year 2002 2002-2003 2002-2004 2002-2005 2002-2006 2002-2007 2002-2008 2002-2009
Val. year 2003 2004 2005 2006 2007 2008 2009 2010
AUC - Training 0.7973 0.7528 0.7507 0.7381 0.7459 0.7519 0.7413 0.7395
AUC - Validation 0.7047* 0.7421 0.6681* 0.7501 0.7483 0.7168 0.7297 0.7708

Note: * - Sign. (alpha=0.1); ** - Sign. (alpha=0.05); *** - Sign. (alpha=0.01)

No signicant dierences can be concluded for these tests, when a 5% level of signicance

is utilized
11 . Over the eight validation years the AUC has overall been improved by 0.0654

with a decline in the validation year 2005 and 2008. The average AUC over the period is

0.7290 for the validation samples.

The same simulations have been run for the two other priors, and only signicant dierences

in validation year 2003 and 2005 were present at a 10% level of signicance, which was also

the case for prior 1 (see Appendix A.7.4, table 5). Overall, these results indicate that we

do not have any problems with overtting at a 5% signicance level, which imply that the

current Bayesian logistic regression models can be accepted. Therefore, the next step is to

take a closer look on how the choice of prior inuences the parameters, which will be claried

in the following section.

As a note, the model was aimed to be used for prediction, so the signicant parameters

(explanatory power) are considered secondary in this thesis.

6.4.3.2 The Estimated Parameters


Table 9 summarizes the estimated parameters for the three dierent Bayesian models in

the validation year 2010. Furthermore, the obtained parameters from the frequentist logistic

regression is showed as a reference point.

11
For the validation year 2003 and 2005 there is a signicant dierence on a 10% level of signicance

53
6.4 Estimation of Bayesian Logistic Regression Models Chapter 6

Table 9: Parameter Estimates 2010 - RSI

Freq. Prior 1 Prior 2 Prior 3


Alpha -3.6119 -3.6276 -3.6263 -3.62447
Beta1 -0.0656 -0.0688 -0.0701 -0.0688
Beta2 -0.2495 * 0.2467 * -0.2485 * -0.2482 *
Beta3 -0.0380 -0.0379 -0.0388
Beta4 0.0406 0.0399 0.0388
Beta5 -0.0869 -0.0876 -0.0861 -0.0874
Beta6 0.0873 0.0904 0.0895 0.0905
Beta7 -0.0791 -0.0724 -0.0733 -0.0744
Beta8 -0.0891 -0.0866 -0.0860 -0.0858
Beta9 -0.1167 * -0.1177 * -0.1184 * -0.1164 *
Beta10 -0.5845 * -0.5888 * -0.5859 * -0.5876 *
Beta11 0.0061 0.0092 0.0083
Beta12 -0.0325 -0.0326 -0.0335
Beta13 -0.0588 -0.0556 -0.0557 -0.0558
Beta14 -0.2060 * 0.2045 * -0.2046 * -0.2054 *
Beta15 0.1560 0.1441 0.1488 0.1454
Beta16 0.0444 0.0419 0.0464

Note: * imply that 0 is not included in the condence intervals.

Signicant parameters (on a 5% signicance level) are marked with a in table 9, which

indicates that the number 0 is not contained in the condence intervals (CI hereafter). As can

be seen from Appendix A.7, algorithm 7, the number of signicant parameters vary during

the period.

Five parameters in the frequentist logistic regression model have been sorted out by the

backward selection criteria. There are no signicant dierences for any of the parameters given

the dierent priors. Furthermore, the Bayesian logistic regression models only dier with very

small deviations from the frequentist logistic regression. This could indicate that the normal

distributed prior has little inuence on the posterior distribution, which in section 3.2 was

referred to as likelihood dominance.

In the following section the three dierent priors are tested against each other in order

to be able to select the most appropriate prior for comparison with the Bank's current credit

scoring models.

6.4.3.3 Comparison of the Bayesian Models with Dierent Priors


The performance (AUC) of the three dierent Bayesian models is compared using a Chi-

square test. The validation AUCs are shown in table 10 below:

54
6.4 Estimation of Bayesian Logistic Regression Models Chapter 6

Table 10: AUC for all Three Bayesian Models - RSI

Val. year 2003 2004 2005 2006 2007 2008 2009 2010
Prior 1 0.7047 0.7421 0.6681 0.7501 0.7483 0.7168 0.7297 0.7708
Prior 2 0.7070 0.7418 0.6682 0.7502 0.7483 0.7168 0.7297 0.7710
Prior 3 0.7053 0.7423 0.6685 0.7500 0.7484 0.7171 0.7296 0.7707

The comparison of the dierent results across the validation years did not indicate any

signicant dierences in the model performance, given the dierent priors (see Appendix A.8.1,

algorithm 8).

The highest AUC in a given year has been highlighted. In four out of eight validation

years, prior 2 and prior 3 has a marginal better performance than the other two models. Since

the dierences between the three priors are so small, we choose to apply the Bayesian logistic

regression with prior 2 when comparing the dierent models for the RSI segment.

6.4.4 Results for Real Estate Data


In this section a similar analysis, as for the RSI segment, is performed on the Real Estate

segment. The objective is to compare the performance of the Bayesian logistic regression

given the dierent priors, when applied to a segments with a small number of customers. The

comparison will again, as for the RSI segment, reveal the best prior for the Bayesian model,

which will be used when comparing with Our Banks current models.

6.4.4.1 A Test of Overtting - An Example, Prior 1


2010 is again chosen as an example, and the ROC curve and summary statistics can be seen

below in gure 6.5.

Figure 6.5: ROC and AUC for Validation Year 2010 - Real Estate, Prior 1

Source: Output from SAS

55
6.4 Estimation of Bayesian Logistic Regression Models Chapter 6

As for the RSI segment, a marginal dierence of 0.0130 between the two areas exists.

Though, at least for the example in gure 6.5, the dierence is not signicant on a 5% level

of signicance (p=0.8083). The model is therefore not overtting the data in validation year

2010. Again the test is performed for every year of estimation and the resulting AUCs is shown

in Table 11 below.

Table 11: AUC - Real Estate, Prior 1

Train. year 2002-2007 2002-2008 2002-2009


Val. year 2008 2009 2010
AUC - Training 0.8400 0.7079 0.7294
AUC - Validation 0.5696 *** 0.7031 0.7165

Note: * - Sign. (alpha=0.1); ** - Sign. (alpha=0.05); *** - Sign. (alpha=0.01)

In validation year 2008 an apparent problem with overtting exists. This might be due to

the very low number of defaults in the training data. However, the model seems to stabilize

over time, and in 2009 and 2010 there are no signicant evidence for overtting.

The same simulations have been run for the two other priors for the Real Estate segment

and again a signicant dierence in validation year 2008 is existing on a 1% level of signicance

(see Appendix A.7.5, table 6 for actual results).

Despite the overtting in validation year 2008 the approach is assumed to be valid with a

concluding remark that the results might be dependent on the number of defaulting customers

in the training data.

6.4.4.2 The Estimated Parameters


As was the case with the RSI segment, table 12 summarizes the estimated parameters for the

three dierent Bayesian models in the validation year 2010, with frequentist logistic regression

as a reference point.

56
6.4 Estimation of Bayesian Logistic Regression Models Chapter 6

Table 12: Parameter Estimates 2010 - Real Estate

Freq. Prior 1 Prior 2 Prior 3


Alpha -3.8106 -3.9535 -3.9405 -3.9332
Beta1 -0.3222 * -0.3213* -0.3186* -0.3198*
Beta2 -0.5681 * -0.5948* -0.5765* -0.5769*
Beta3 -0.0430 0.0426 0.0413
Beta4 -0.1929 -0.1877 -0.1825
Beta5 0.0786 0.0662 0.0695
Beta6 0.1168 0.1136 0.1095
Beta7 0.0563 0.0593 0.0572
Beta8 -0.0847 -0.0850 -0.0863
Beta9 0.1892 0.2025 0.1896 0.1954
Beta10 -0.1688 -0.1769 -0.1719 -0.1752
Beta11 -0.0928 -0.0974 -0.0927
Beta12 -0.1408 -0.1438 -0.1486
Beta13 0.1095 0.0997 0.1037
Beta14 -0.3058 * -0.3014* -0.2934* -0.2928*
Beta15 0.0127 0.0149 0.0108
Beta16 0.3142 * 0.2704 0.2641 0.2688
Beta17 0.0797 0.0792 0.0736
Beta18 0.3477 * 0.2886 0.2742 0.2846

Note: * imply that 0 is not included in the condence interval.

11 parameters in the frequentist logistic regression model have been sorted out by the

backward selection criteria. There are no signicant dierences for non of the parameters given

the dierent priors, and the Bayesian parameters are almost identical with the frequentist

logistic regression coecients. Once again, this indicates that the chosen priors have little

inuence on the posterior distribution.

In the following section the three dierent priors are tested against each other in order to

be able to select the most appropriate prior for comparison with Our Bank's current models

for credit scoring.

6.4.4.3 Comparison of the Bayesian Priors


The validation AUCs are shown in table 13 below.

Table 13: AUC for all Three Bayesian Models - Real Estate

Val. year 2008 2009 2010


Prior 1 0.5696 0.7031 0.7165
Prior 2 0.5789 0.7040 0.7175
Prior 3 0.5737 0.7042 0.7175

57
6.5 Comparison of the Estimated Credit Scoring Models Chapter 6

No signicant dierences between the Bayesian logistic regression models with the three

dierent priors have been identied (see Appendix A.8.2, algorithm 11). The Bayesian logistic

regression model with prior 2 is performing marginally better than with the other two priors,

due to the dierence in the validation year 2008. Therefore, the Bayesian logistic regression,

where prior 2 is utilized, will be applied when comparing the dierent models for Real Estate

data.

6.5 Comparison of the Estimated Credit Scoring Models

In the following section the performance of the Bayesian logistic regression models will be

compared with the performance of the estimated expert- and the frequentist logistic regression

models, for the two customer segments.

6.5.1 RSI Segment


In gure 6.6 the performance of the dierent credit scoring models for the RSI segment is

shown:

Figure 6.6: Comparison of AUC - RSI

Source: Own elaboration

The frequentist- and Bayesian logistic regression models perform slightly better than the

expert model for all years during the period, except from the validation year 2005. In the

validation year 2006 and 2009 the Frequentist and Bayesian logistic regression models are

performing signicantly better than the expert model on a 5% signicance level.

Generally the Bayesian logistic regression model performs marginally better than the fre-

quentist logistic regression model. Only in the validation year 2007 the frequentist logistic

58
6.5 Comparison of the Estimated Credit Scoring Models Chapter 6

regression model has a higher AUC than the Bayesian logistic regression. It is worth men-

tioning that the Bayesian logistic regression model at no point signicantly outperforms the

frequentist logistic regression for the RSI data (see Appendix A.8.1, algorithm 10 for actual

results).

Thereby it can be concluded, that the Bayesian logistic regression model is overall able to

outperform the expert model and that it performs slightly better than the frequentist logistic

regression, though without any signicant dierences.

6.5.2 Real Estate Segment


In gure 6.7 the performance of the dierent credit scoring models for the Real Estate segment

is shown:

Figure 6.7: Comparison of AUC - Real Estate

Source: Own elaboration

As can be seen from gure 6.7, the Bayesian logistic regression is the credit scoring model

with the lowest AUC in the validation year 2008. In the validation year 2009 frequentist and

Bayesian logistic regression is performing signicantly better than the expert model, where

the Bayesian logistic regression model is performing marginally better than the frequentist

logistic regression model. In the last validation year, 2010, the frequentist logistic regression

model is the model with the highest AUC followed by the Bayesian logistic regression model

- 0.0528 separate the performance of these two models (see Appendix A.8.2, algorithm 13 for

actual results).

Overall there are no discoverable patterns in the performance of the dierent models for the

Real Estate segment. However, after the validation year 2008 the Bayesian logistic regression

is performing better than the expert model and on the same level as the frequentist logistic

regression model.

59
PART V
Concluding Remarks
Chapter 7

7 Concluding Remarks

The following section will sum up the current thesis and the appertaining results. The chapter

will rst of all contain a conclusion, which highlights the ndings of the thesis. Afterwards, the

limitations and contributions of the thesis are summarized. Finally, ideas for future research

are presented.

7.1 Conclusion

The objective of the current thesis was to analyze if a Bayesian logistic regression model for

credit scoring is a more eective tool than Our Bank's current approaches - an expert model

and a frequentist logistic regression model. In addition to this, it was also important to clarify

how the dierent models perform when applied on both a large customer segment and a small

customer segment.

Overall, the results from the empirical analysis showed that a Bayesian approach for credit

scoring was not able to outperform a frequentist approach, and thereby we cannot conclude,

that a Bayesian logistic regression is a more eective tool that improves quality of service

and minimizes risk of credit loss, when compared to a frequentist logistic regression. On the

contrary the analysis conrmed, that a Bayesian approach was overall able to outperform the

current expert model applied in Our Bank.

Since the thesis has had both an academic and practical orientation, it was deemed neces-

sary to equip the reader with a basic theoretical understanding of Bayesian statistics. There-

fore, an introduction to Bayesian statistics and the Bayesian approach for logistic regression

were initially accounted for. Given that the parameter estimation in Bayesian statistics is

notably dierent from frequentist statistic, due to the fact that all inference is drawn from

the posterior distribution, an elaboration on how to obtain that posterior distribution through

Markov chain Monte Carlo simulations was given.

As an introduction to the empirical analysis a validation framework was proposed together

with the validation techniques used for comparison of the dierent models. A walk-forward

validation framework was chosen to be applied in the empirical analysis and the ROC curve

and its summary statistic, the AUC, were selected as the validation techniques.

For the empirical analysis a real dataset provided by Our Bank, containing data from

67618 bank customers divided on nine years from 2002 to 2010 was used. The data contained

questionnaire data from two dierent customer segments - Retail, Service and Industry and

Real Estate.

Before the Bayesian models were estimated, the AUCs of the expert model and the frequen-

tist logistic regression were estimated, and these served as references for the model comparison.

61
7.1 Conclusion Chapter 7

An approach for converting the current expert knowledge into prior information was pro-

posed. Since no information regarding the variance parameter for the prior distribution existed,

two dierent priors with dierent variances were chosen together with a non-informative prior.

Since the aim of the thesis was to compare the predictive power of dierent credit scoring

models, the empirical analysis focused on signicant dierences between the AUCs for the

dierent models. Therefore, in order to identify what kind of impact the prior had on the

Bayesian models, the dierences in the AUCs for all three Bayesian models were analyzed.

The analysis showed no signicant dierences between the models, and the conclusion was

therefore that the prior did not have any remarkable inuence on the posterior, indicating

what is referred to as likelihood dominance. This conclusion was valid for both customer

segments. Since prior 2 had a marginally better AUC for both segment it was chosen as the

prior used for comparison with the Our Banks current credit scoring models.

The results from comparing the Bayesian model with Our Banks current models for the

RSI segment, showed that the Bayesian and frequentist logistic regression models were able

to signicantly outperform the expert model in the validation year 2006 and 2009. For the

remaining years similar dierences were obtained, without being signicant. When comparing

the dierence between the Bayesian logistic regression and the frequentist logistic regression,

only marginal dierences were obtained. With the exception of the validation year 2007, the

Bayesian logistic regression performs slightly better than frequentist logistic regression.

When analyzing the performance of the models for the Real Estate segment, the results

were quite ambiguous. An important aspect related to this was the low amount of default-

ing customers within the segment, which implied that estimation was only possible after the

default year 2007. We were only able to identify one signicant dierence, namely that the

expert model performed signicantly worse than the other two models in the validation year

2009. No signicant dierences between the Bayesian and frequentist logistic regression could

be conrmed. However, after the validation year 2008 the Bayesian logistic regression model

was performing on the same level as the frequentist logistic regression model.

Bayesian methods are already increasingly being applied in a diverse assortment of elds,

including medicine, sociology, psychology, articial intelligence, and philosophy (Wylie, Muegge,

& Thomas, 2006). We believe that the Bayesian methods hold similar promise for researchers

of business and management problems in the future.

In continuation hereof, Agresti and Hitchcock (2005) wrote:

In the future, it seems likely to us that statisticians will increasingly be tied less

dogmatically to a single approach and will feel comfortable using both frequentist

and Bayesian paradigms.

(Wylie, Muegge, & Thomas, 2006)

62
7.2 Limitations Chapter 7

As a nal remark it is worth mentioning that even though we were not able to conrm any

signicant dierences between a frequentist and Bayesian logistic regression, the Bayesian

approach for credit scoring should not be rejected. The conclusions in this thesis could be

highly inuenced by the choice of priors, the chosen sampling algorithm for the MCMC, the

analyzed customer segments, and the Bayesian framework chosen.

7.2 Limitations

There have been some limitations related to the process of the research and these could have

had an inuence on the results of the thesis.

First of all, only one prior distribution, the normal distribution, was applied in the Bayesian

models. The analysis indicated that the prior did not have any signicant inuence on the

posterior.

Secondly the random-walk Metropolis algorithm was utilized as the only sampling algo-

rithm for the MCMC simulations, though several others exist, such as the Gibbs sampler.

Thirdly, the data consisted of a certain amount of missing values, which were deleted and

therefore decreased the amount of data substantially.

In spite of these limitations, however, several contributions are worth mentioning.

7.3 Contributions

The main contribution of this thesis has been to introduce a Bayesian logistic regression as

an alternative to frequentist logistic regression for credit scoring. We have compared the

two dierent methods predictive ability (AUC) based on real data covering almost a decade.

Furthermore, the two dierent approaches have been compared to Our Bank's current expert

model.

Another contribution of the present thesis has been to integrate and convert the current

expert knowledge from Our Bank into prior information. Two dierent set ups for the expert

knowledge has been applied as informative prior together with an additional non-informative

prior.

Furthermore, a walk-forward validation framework has been used in the empirical analysis.

This framework has the ability to clarify how a model evolves over time, when more data is

obtained.

Finally, the current thesis has contributed with a comparison of the model performance on

two dierent segments with respectively a large number of customers and a small number of

customers.

63
7.4 Future Research Chapter 7

7.4 Future Research

It has been demonstrated throughout the current thesis, that the Bayesian approach towards

credit scoring requires careful attention to modeling, since the quality of the results is strongly

dependent on thoughtful model-building decisions and careful specication of appropriate pri-

ors, which place demands on the analyst's skills, judgment, and experience (Wylie, Muegge,

& Thomas, 2006). In continuation hereof, one of the disadvantages in using a Bayesian ap-

proach is that the approach does not contain instructions on how to select a prior. There is

no correct way to choose a prior, which implies that it requires skills to translate subjective

prior beliefs into mathematically formulated priors. From our point of view, the chosen priors

in the current thesis should be perceived as points of origins for further research rather than

nal solutions. The priors have been developed based on the current expert models in Our

Bank without having any prespecied variance or distribution. As implied in the empirical

analysis, the normally distributed priors have only had marginal inuences on the posterior

distributions, which probably is one of the reasons why the Bayesian approach does not dier

signicantly from the frequentist approach. One opportunity for future research is therefore

to apply dierent variances and prior distributions on the priors.

As stated in section 6.1.1, the data was notably reduced due to the amount of missing

values. It was chosen not to impute new values in the dataset. Several methods for missing data

imputation exist, such as mean substitution, Expectation Maximization, regression predictions

etc. For future research a comparison of the dierent imputation methods and their inuence

on the predictability of the estimated models, would be a relevant area to study.

Though, a lot of literature concerning MCMC exists, we have not been able to nd any

describing if the choice of sample algorithm inuences the estimated parameters in the Bayesian

approach. By utilizing another algorithm than the random-walk Metropolis this hypothesis

could be tested.

Many models can be utilized for credit scoring where, from a Bayesian point of view,

Bayesian networks have shown some success (Bier, Sevi, & Bilgi, 2010). An investigation

into these networks for credit scoring would provide interesting further topics of research.

64
Chapter 8

8 Bibliography

Alf, M., Caiazza, S. & Trovato, G. 2005, "Extending a Logistic Approach to Risk

Modeling through Semiparametric Mixing", Journal of Financial Services Research, vol.

28, no. 1, pp. 163.

Altman, D.G. & Bland, J.M. 1998, "Statistics Notes: Bayesians and Frequentists",

British Medical Journal - LA English, vol. 317, no. 7166, pp. 1151.

Barnett, V. 1973, Comparative Statistical Inference, John Wiley, London (osv.).

Berger, J.O. 2000, "Bayesian Analysis: A Look at Today and Thoughts of Tomorrow",

Journal of the American Statistical Association, vol. 95, no. 452, pp. 1269.

Bier, I.s, Sevi, D. & Bilgi, T. 2010, , Bayesian Credit Scoring Model with Integra-

http://leidykla.vgtu.lt/
tion of Expert Knowledge and Customer Data. Available:

conferences/MEC_EurOPT_2010/pdf/324-329-Bicer_Sevis_Bilgic-57.pdf
[2012, 07/16].

Bolstad, W.M.,1943- 2007, Introduction to Bayesian statistics, 2. ed. edn, John Wiley,

Hoboken, N.J.

Brinberg, D. & Hirschman, E.C. 1986, "Multiple Orientations for the Conduct of Mar-

keting Research: An Analysis of the Academic/Practitioner Distinction", The Journal

of Marketing, vol. 50, no. 4, pp. 161.

Che, X. & Xu, S. 2010, , Bayesian Data Analysis for Agricultural Experiments. Available:

http://pubs.aic.ca/doi/pdfplus/10.4141/CJPS10004 [2012, 06/08].

Chen, M., Shao, Q. & Ibrahim, J.G. 2000, Monte Carlo Methods in Bayesian Computa-

tion, Springer, Berlin.

Cowles, K., Kass, R. & O'Hagen, T. 2009, , What is Bayesian Analysis. Available:

http://bayesian.org/Bayes-Explained [2012, 05/08].

Engelmann, B., Hayden, E. & Tasche, D. 2003a, , Measuring the Discriminative Power of

Rating Systems. Available: http://www.bundesbank.de/download/bankenaufsicht/


dkp/200301dkp_b.pdf [2012, 05/24].

Engelmann, B., Hayden, E. & Tasche, D. 2003b, , Testing Rating Accuracy. Available:

http://www.german-zscore.de/docs/engelmann_2003.pdf [2012, 06/07].

Gelman, A. 2002, , Prior Pistribution. Available: http://www.stat.columbia.edu/


~gelman/research/published/p039-_o.pdf [2012, 05/10].

65
Chapter 8

Gelman, A., Jakulin, A., Pittau, M.G. & Su, Y. 2008, "A Weakly Informative Default

Prior Distribution for Logistic and Other Regression Models", The Annals of Applied

Statistics, vol. 2, no. 4, pp. 1360.

Geyer, C.J. 1992, "Practical Markov Chain Monte Carlo", Statistical Science, vol. 7,

no. 4, pp. 473.

Gilks, W.R., Richardson, S. & Spiegelhalter, D.J. 1996, Markov Chain Monte Carlo in

Practice, Reprint edn, Chapman & Hall, London.

Guba, E.G. 1990, The Paradigm Dialog, SAGE, London.

Hamerle, A., Rauhmeier, R. & Rsch, D. 2003, , Uses and Misuses of Measures for

Credit Rating Accuracy. http://www.defaultrisk.com/_pdf6j4/Uses_n_


Available:

Misuses_o_Measures_4_Cr_Rtng_Accrc.pdf [2012, 05/29].

Hellwig, M. 2008, , Systemic Risk in the Financial Sector: An Analysis of the Subprime-

Mortgage Financial Crisis. Available: http://www.coll.mpg.de/pdf_dat/2008_43online.


pdf [2012, February/20].

Howson, C. & Urbach, P. 1993, Scientic Reasoning : the Bayesian Approach, 2. ed.

edn, Open Court, Chicago.

Isik, B., Deniz, S. & Taner, B. 2010, , Bayesian Credit Scoring Model with Integra-

http://www.refworks.com/
tion of Expert Knowledge and Customer Data. Available:

refworks2/default.aspx?r=references|MainLayout::init [2012, 02/29].

Isik, B., Deniz, S. & Taner, B. 2010, , Bayesian Credit Scoring Model with Integra-

http://www.refworks.com/
tion of Expert Knowledge and Customer Data. Available:

refworks2/default.aspx?r=references|MainLayout::init [2012, 02/29].

Jaynes, E.T. & Bretthorst, G.L. 2003, Probability Theory : the Logic of Science, Cam-

bridge University Press, New York.

Keramati, A. & Youse, N. 2011, , A Proposed Classication of Data Mining Tech-

niques in Credit Scoring. Available: http://www.iieom.org/ieom2011/pdfs/IEOM061.


pdf [2012, 07/02].

Kynn, M. 2005, , Elicting Expert Knowledge for Bayesian Logistic Regression in Species

Habitat Modelling. Available: http://eprints.qut.edu.au/16041/1/Mary_Kynn_Thesis.


pdf [2012, 05/11].

Laitinen, E.K. 1999, "Predicting a Corporate Credit Analyst's Risk Estimate by Logistic

and Linear Models", International Review of Financial Analysis, vol. 8, no. 2, pp. 97.

66
Chapter 8

Lancaster, T., 2004, An Introduction to Modern Bayesian Econometrics, Blackwell Pub.,

Malden, MA.

Lenhard, J. 2006, Models and Statistical Inference: The Controversy between Fisher and

Neyman-Pearson, Oxford University Press.

Liu, J.S. 2001, Monte Carlo Strategies in Scientic Computing, Springer, New York.

Ler, G., Posch, P.N. & Schne, C. 2005, , Bayesian Methods for Improving Credit

Scoring Models. Available: http://129.3.20.41/eps/fin/papers/0505/0505024.pdf


[2012, 07/12].

Mira, A. & Tenconi, P. 2003, , Bayesian Estimate of Credit Risk via MCMC with

Delayed Rejection. Available: http://eco.uninsubria.it/dipeco/quaderni/files/


QF2003_34.pdf [2012, 07/12].

Nevin, J.R. 1979, "The Equal Credit Opportunity Act: An Evaluation", Journal of

Marketing - LA English, vol. 43, no. 2, pp. 95.

Neyman, J. & Pearson, E.S. 1967, Joint Statistical Papers, at the Univ. Press, Cam-

bridge.

Roberts, G.O. & Rosenthal, J.S. 1998, "Markov-Chain Monte Carlo: Some Practical

Implications of Theoretical Results", The Canadian Journal of Statistics / La Revue

Canadienne de Statistique, vol. 26, no. 1, pp. 5.

http://topaz.
Rouchka, E., C. 2008, , A Brief Overview of Gibbs Sampling. Available:

gatech.edu/~vardges/biol7023/FALL_2006/Lab5/ROUCHKA_gibbs.pdf[2012, 06/12].

SAS Institute, I. 2008, "Introduction to Bayesian Analysis Procedures" in SATS/STAT

9.2 User's Guide, 2nd edn, SAS Publishing, , pp. 141.

Satchell, S. & Xia, W. 2006, , Analytic Models of the ROC curve: Application to

Credit Rating Model Validation. Available: http://www.qfrc.uts.edu.au/research/


research_papers/rp181.pdf [2012, 06/07].

Sivia, D.S. & Skilling, J. 2007, Data Analysis : a Bayesian Tutorial, 2. ed., repr. edn,

Oxford University Press, Oxford.

Smith, A.F.M. 1991, "Bayesian Computational Methods", Phil. Trans. R. Soc. Lond.,

, no. 337, pp. 369-386.

Sobehart, J., Keenan, S. & Stein, R. 2001, , Benchmarking Quantitative Default Risk

Models: a Validation Methodology. http://www.algorithmics.com/EN/


Available:

media/pdfs/Algo-RA0301-ARQ-DefaultRiskModels.pdf [2012, 05/24].

67
Chapter 8

Steenackers, A. & Goovaerts, M.J. 1989, "A Credit Scoring Model for Personal Loans",

Insurance Mathematics and Economics, vol. 8, no. 1, pp. 31.

Stigler, S. 2005, "Fisher in 1921", Statistical Science, vol. 20, no. 1, pp. 32.

Tabachnick, B.G.,1936- & Fidell, L.S. 2008, Using Multivariate Statistics, 5. ed. edn,

Pearson/Allyn & Bacon, Boston.

Thomas, L.C. 2000, "A Survey of Credit and Behavioural Scoring: Forecasting Financial

Risk of Lending to Consumers", International Journal of Forecasting, vol. 16, no. 2, pp.

149.

Walsh, B. 2004, , Markov Chain Monte Carlo and Gibbs Sampling. Available: http://
web.mit.edu/~wingated/www/introductions/mcmc-gibbs-intro.pdf [2012, 06/12].

Wilhelmsen, M., Dimakos, X.K., Huseb, T. & Fiskaaen, M. 2009, , Bayesian Modelling

of Credit Risk using Integrated Nested Laplace Approximations. Available: http://


publications.nr.no/BayesianCreditRiskUsingINLA.pdf[2012, 05/30].

Wylie, J., Muegge, S. & Thomas, D.R. 2006, , Bayesian Methods in Management Re-

http://attila.acadiau.ca/
search: an Application to Logistic Regression. Available:

library/ASAC/v27/content/authors/t/Thomas,%20Roland/BAYESIAN%20METHODS%20IN%
20MANAGEMENT%20RESEARCH.pdf [2012, February/20].

http://
Ziemba, A. 2005, , Bayesian Updating of Generic Scoring Models. Available:

www.google.dk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CEwQFjAA&url=http%
3A%2F%2Fwww.business-school.ed.ac.uk%2Fwaf%2Fschoolbiz%2Fget_file.php%3Fasset_
file_id%3D1762&ei=jWEGUJXTAvSM4gTAu5WaCQ&usg=AFQjCNFrs7tKlLxx7QXkyHOkmawqRlV8wA
[2012, 07/12].

68

Вам также может понравиться