Вы находитесь на странице: 1из 17

J Syst Sci Syst Eng(Dec 2006) 15(4): 419-435

DOI: 10.1007/s11518-006-5023-5

ISSN: 1004-3756 (Paper) 1861-9576 (Online)


CN11-2983/N

A COMPARATIVE STUDY OF DATA MINING METHODS IN


CONSUMER LOANS CREDIT SCORING MANAGEMENT
Wenbing XIAO1
1

Qian ZHAO 2

Qi FEI 3

Institute of Systems Engineering, Huazhong University of Science & Technology, Wuhan 430074, China
xiaowenbing11@163.com ( )
2

School of Economics, Renmin University of China, Beijing 100872, China


zqheropen@yahoo.com.cn

Institute of Systems Engineering, Huazhong University of Science & Technology, Wuhan 430074, China
qfei@mail.hust.edu.cn

Abstract
Credit scoring has become a critical and challenging management science issue as the credit
industry has been facing stiffer competition in recent years. Many classification methods have been
suggested to tackle this problem in the literature. In this paper, we investigate the performance of
various credit scoring models and the corresponding credit risk cost for three real-life credit scoring
data sets. Besides the well-known classification algorithms (e.g. linear discriminant analysis, logistic
regression, neural networks and k-nearest neighbor), we also investigate the suitability and
performance of some recently proposed, advanced data mining techniques such as support vector
machines (SVMs), classification and regression tree (CART), and multivariate adaptive regression
splines (MARS). The performance is assessed by using the classification accuracy and cost of credit
scoring errors. The experiment results show that SVM, MARS, logistic regression and neural networks
yield a very good performance. However, CART and MARSs explanatory capability outperforms the
other methods.
Keywords: Data mining, credit scoring, classification and regression tree, support vector machines,
multivariate adaptive regression splines, credit-risk evaluation

1. Introduction

researchers and practitioners due to its wide

Data mining (DM), sometimes referred to as

applications in crucial business decisions.

knowledge discovery in database (KDD), is a

Basically, the research on DM can be classified

systematic approach to find underlying patterns,

into

trends, and relationships buried in data. Data

technologies. According to Curt (1995), the

mining has drawn much attention from both

technology part of DM consists of techniques

two

categories:

methodologies

This work was supported in part by National Science Foundation of China under Grant No. 70171015

Systems Engineering Society of China & Springer-Verlag 2006

and

A Comparative Study of Data Mining Methods in Consumer Loans Credit Scoring Management

such as statistical methods, neural networks,

details and other relevant information held by a

decision

and

credit reference agency. As a result, accounts

the

with high probability of default can be

above-mentioned applications, the classification

monitored and necessary actions can be taken in

problems which observations can be assigned to

order to prevent the account from entering

one of several disjoint groups have played

default. In response, the statistical methods,

important roles in business decision making due

non-parametric

to their wide applications in decisions support,

intelligence approaches have been proposed to

financial forecasting, fraud detection, marketing

support the credit approval decision process

strategy, and other related fields (Chen et al.

(Desai et al., 1996, West, 2000).

trees,

non-parametric

genetic
methods.

algorithms,
Among

1996, Lee and Chen 2005, Tam and Kiang


1992).

methods,

and

artificial

Generally, linear discriminant analysis and


logistic regression are the two most commonly

Credit risk evaluation decisions are crucial

used data mining techniques to construct credit

for financial institutions due to the severe impact

scoring models. However, linear discriminant

of loan default. It is an even more important task

analysis (LDA) has often been criticized because

today

been

of the categorical nature of the credit data and

experiencing serious competition during the past

the fact that the covariance matrices of the good

few years. Credit scoring has gained more and

and bad credit classes are not likely to be equal.

more attention as the credit industry has realized

In addition to the LDA approach, logistic

the benefits of improving cash flow, insuring

regression is an alternative to conduct credit

credit collections and reducing possible risks.

scoring. A number of logistic regression models

Hence, many different useful techniques, known

for credit scoring applications have been

as the credit scoring models, have been

reported in the literature (Henley 1995).

developed by banks and researchers in order to

However, logistic regression is also being

solve

the

criticized for some strong model assumptions,

evaluation process (Mester 1997). The objective

such as variation homogeneity, which has

of credit scoring models is to assign credit

limited its application in handling credit scoring

applicants to either a good credit group who

problems. Recently, neural networks have

are likely to repay financial obligation, or a bad

provided an alternative to LDA and logistic

credit group who are more likely to default on

regression, particularly in situations where the

the financial obligation. The applications of the

dependent and independent variables exhibit

latter should be denied. Therefore, credit scoring

complex nonlinear relationships. Even though it

problems are basically in the scope of the more

has been reported that neural networks have

generally and widely discussed classification

better credit scoring capability than LDA and

problems.

logistic regression (Desai et al. 1996), neural

as

the

the

credit

problems

industry

involved

has

during

Usually, credit scoring is employed to rank

networks are also being criticized for their long

credit information based on the application form

training process in designing the optimal

420

JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING

XIAO, ZHAO and FEI

networks topology, difficulty in identifying the

maximizing the between-group variance relative

relative importance of potential input variables,

to the within-group variance; this relationship is

and certain interpretive difficulties which have

expressed as the ration of between-group to

limited their applicability in handling credit

within-group variance. The linear combinations

scoring problems. Hence, the issue of which

for a discriminant analysis are derived from an

classification technique to be used for credit


scoring remains a very difficult and challenging

equation that takes the from (1)


Z = w1 x1 + w2 x2 + L + wn xn

problem. In the paper, we will conduct a

where

benchmarking study of various classification

wi (i = 1, 2,L , n ) are the discriminant weights,


and xi (i = 1, 2,L , n ) are independent variables
(Altman 1968, Jo, Han and Lee 1997).

techniques on three real-life credit data sets.


Techniques that will be implemented are logistic
regression, linear discriminant analysis, SVMs,
neural networks, KNN, CART and MARS. All
techniques will be evaluated in terms of the
percentage of correctly classified observations
and misclassification cost.
This paper is organized as follows. We begin
with a short overview of the classification
techniques used in Section 2. Data sets and
experimental design are presented in Section 3.
Section 4 gives the empirical results and
discussion for three real credit scoring data set,
including classification performance, the costs
of credit scoring errors and explanatory ability
of credit scoring models. Section 5 addresses the
conclusion

and

discusses

possible

future

research areas.

2. Literature Review
2.1 Linear Discriminant Analysis and
Logistic Regression Models
Linear discriminant analysis involves the

is

the

discriminant

(1)
score,

Logistic regression (Logistic) analysis has


also been used to investigate the relationship
between binary or ordinal response probability
and explanatory variables. The method fits linear
logistic regression model for binary or ordinal
response data by the method of maximum
likelihood. The advantage of this method is that
it does not assume multivariate normality and
equal covariance matrices as LDA does. The
logistic regression approach to classification
(Logistic) tries to estimate the probability
P( y = 1 | x ) as follows:
1
1
P( y = 1| x) =
=
(2)
z
( w0 + w1 x1 +Lwn xn )
1+ e
1+ e

Whereby x n is n-dimensional input


vector, wi is the parameter vector and the scalar
w0 is the intercept. The parameters
w0 and wi are then typically estimated using the
maximum likelihood procedure (Hosener 2000,
Thomas 2000).

2.2 Support Vector Machines Models

linear combination of the two (or more)

A simple description of the SVM algorithm

independent variables that differentiate best

is provided as follows. Given a training set

between the priori defined groups. This is

D = {xi , yi }iN=1

achieved by the statistical decision rule of

xi =

JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING

( xi(1) ,L xi( n ) )T

with
R

input
and

target

vectors
labels

421

A Comparative Study of Data Mining Methods in Consumer Loans Credit Scoring Management

yi {1, +1} , the support vector machine (SVM)


classifier,

according

to

Vapnik's

original

formation, satisfies the following condition:


wT ( xi ) + b +1, if yi = +1
T
w ( xi ) + b 1, if yi = 1

(3)

which is equivalent to
yi [ wT ( xi ) + b] 1,

i = 1,L, N

(4)

where w represents the weight vector and b the


bias. Nonlinear function () : R n R nk maps
input or measurement space to a highdimensional, and possibly infinite-dimensional,
feature space. Equation (4) then comes down the
construct of two parallel bounding hyperplanes
at

opposite

sides

of

separating

hyperplane w ( x ) + b = 0 in the feature space


with the margin width between both hyperplanes

problem is obtained after constructing the


Lagrangian. From the conditions of optimality,
one obtains a quadratic programming (QP)
problem with Lagrange multipliers i 's. A
multiple i exists for each training data instance.
Data instances corresponding to non-zero i 's
are called support vectors.
On the other hand, the above primal problem
can be converted into the following dual
problem with objective function (8) and
constraints (9). Since the decision variables are
support vector of Lagrange multipliers, it is
easier to interpret the results of this dual
problem than those of the primal one.
1 T
T
Max Q e
2
Subject to

equal to 2 /( || w || ) . In primal weight space, the

0 i C , i = 1,L , N
T
y = 0

classifier then takes the decision function form


(5)
sgn( wT ( x ) + b)

(5)

Most of classification problems are, however,

(8)

In

the
T

e = (1,1,L ,1) R

dual
N

problem

(9)
above,

, Q is a N N positive

semi-definite matrix, Qij = yi y j K ( xi , x j ) and

linearly non-separable. Therefore, it is general to

K ( xi , x j ) ( xi )T ( x j ) is the kernel. Here,

find the weight vector using slack variable i

training vectors xi 's are mapped into a higher

to permit misclassification. One defines the

(maybe

primal optimization problem as


N
1 T
Min w w + C i
w,b , 2
i =1
Subject to
yi ( wT ( xi ) + b 1 i , i = 1,L , N

i = 1,L , N
i 0,

function . As is typical for SVMs, we never

(6)

(7)

Where i 's are slack variables needed to allow


misclassifications in the set of inequalities, and
C + is a tuning hyperparameter, weighting
the importance of classification errors visa the
margin width. The solution of the primal

422

infinite)

dimensional

space

by

calculate w or ( x ) . This is made possible due


to Mercer's condition, which relates mapping
function ( x ) to kernel function K ( , ) as
follows.
K ( xi , x j ) = ( xi )T ( x j )

10

For kernel function K ( , ) , one typically has


several design choices such as the linear kernel
of K ( xi , x j ) = xiT x j , the polynomial kernel of
degree d of K ( xi , x j ) = ( xiT x j + r ) d , > 0 ,

JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING

XIAO, ZHAO and FEI

the radial basis function (RBF) kernel of


K ( xi , x j ) = exp{ || xi x j ||2 } , > 0 , and the

output. Thus, a neural network model is a

sigmoid kernel of K ( xi , x j ) = tanh{ xiT x j + r} ,

such as the input layer, the hidden layer, and the

where d , r N and R + are constants. Then


one constructs the final SVM classifier as

output layer. Several hidden layers can be placed

collection of neurons that are grouped in layers

between the input and the output layers. We will


discuss the BPN in more detail because it is the

sgn i yi K ( x, xi ) + b
i

(11)

most popular NN for classification.


A simple back-propagation network (BPN)

The details of the optimization are discussed in

model consists of three layers: the input layer,

(Vapnik 1999, Gunn 1998, Cristianini 2000).

the hidden layer, and the output layer. The


input-layer processes the input variables, and

2.3 Neural Networks Models (BPN, RBF


and FAR)

provides the processed values to the hidden layer.


The

hidden

layer

further

processes

the

involves

intermediate values, and transmits the processed

constructing computers with architectures and

values to the output layer. The output layer

processing

certain

corresponds to the output variables of the

processing capabilities of the human brain. A

back-propagation neural network model. A

neural network model is composed of neurons,

three-layer backpropagation neural networks

the processing elements. These elements are

(BPN) is shown in Figure 1. For the details of

inspired by biological nervous systems. Each of

the neural networks, readers are referred to Refs

the neurons receives inputs, and delivers a single

(West 2000, Bihop 1995).

neural

networks

capabilities

that

Input layer

model
mimic

Hidden layer

Output layer

Figure 1 A three-layer back-propagation neural networks

JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING

423

A Comparative Study of Data Mining Methods in Consumer Loans Credit Scoring Management

Radial Basis Function (RBF) networks


(Moody and Darken 1989) have a static
Gaussian function as the non-linearity for the
hidden layer processing elements. The Gaussian
function responds only to a small region of the
input space where the Gaussian is centered. The
key to a successful implementation of these
networks is to find suitable centers for the
Gaussian functions. This can be done with
supervised learning, but an unsupervised
approach usually produces better results. The
advantage of the radial basis function networks
is that it finds the input to output map using
local approximators. Usually the supervised
segment is simply a linear combination of the
approximators. Since linear combiners have few
weights, these networks train extremely fast and
require fewer training samples.
The Fuzzy art (FAR) (West 2000) network is
a dynamic network where incorporates
computations from fuzzy set theory into the
adaptive resonance theory (ART). The typical
FAR network consists of two totally
interconnected layers of neurons, identified as
the complement layer and the category layer, in
addition to the input and output layers. When an
input vector is applied to the network, it creates
a short-term activation of the neurons in the
complement layer. This activity is transmitted
through the weight vector to neurons in the
category layer. Each neuron in the category layer
then calculates the inner product of the
respective weights and input values. These
calculated values are then resonated back to the
complement layer.

2.4 Multivariate Adaptive Regression


Splines
MARS is first proposed by Firedman (1991,

424

1995) as a flexible procedure which models


relationships that are nearly additive or involve
interactions with fewer variables. The modeling
procedure

is

inspired

by

the

recursive

partitioning technique governing classification


and regression tree (CART) (Breiman et al. 1984)
and generalized additive modeling, resulting in a
model that is continuous with continuous
derivatives. It excels at finding optimal variable
transformations and interactions, the complex
data

structure

high-dimensional

that

often

data.

And

hides
hence

in
can

effectively uncover important data patterns and


relationships that are difficult, if not impossible,
for other methods to reveal.
MARS essentially builds flexible models by
fitting piecewise linear regressions; that is, the
nonlinearity of a model is approximated through
the use of separate regression slopes in distinct
intervals of the predictor variable space.
Therefore the slope of the regression line is
allowed to change from one interval to the other
as the two knot point are crossed. The variable
to use and the end points of the intervals for
each variable are found via a fast but intensive
search procedure. In addition to searching
variables one by one, MARS also searches for
interactions between variables, allowing any
degree of interaction to be considered.
The

general

MARS

function

can

be

represented using the following equation:

Km

m =1

k =1

f ( x ) = a0 + am [ skm ( xv ( k ,m ) tkm )]+ (12)

where a0 and am are parameters, M is the


number of basis functions, K m is the number of
knots, skm takes on value of either 1 or 1 and
indicates the right/left sense of the associated
step function,

v( k , m) is the label of the

JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING

XIAO, ZHAO and FEI

independent variable, and tkm indicates the knot

regarding the model building process.

location.
The optimal MARS model is selected in a
two-stage process. Firstly, MARS constructs a
very large number of basis functions to overfit
the data initially, where variables are allowed to
enter as continuous, categorical, or ordinalthe
formal mechanism by which variable intervals
are defined, and they can interact with each
other or be restricted to enter in only as additive
components. In the second stage, basis functions
are deleted in order of least contribution using
the generalized cross-validation (GCV) criterion.
A measure of variable importance can be
assessed by observing the decrease in the
calculated GCV values when a variable is
removed from the model. The GCV can be
expressed as follows:
^

LOF ( f M ) = GCV ( M )
=

1
N

[ yi f M ( xi )]2 /[1
i =1

C(M ) 2
]
N
(13)

where there are N observations, and C ( M ) is


the cost-penalty measures of a model containing
M basis function (therefore the numerator
measures the lack of fit on the M basis function
model f M ( xi ) and the denominator denotes the
penalty for model complexity C ( M ) ). Missing
values can also be handled in MARS by using
dummy variables indicating the presence of the
missing values. By allowing for any arbitrary
shape for the function and interactions, and by
using the above-mentioned two-stage model
building procedure, MARS is capable of reliably
tracking the very complex data structures that
often hide in high-dimensional data. Please refer
to Firedman (1991, 1995) for more details

2.5 k-Nearest-Neighbor-Classifiers and


CART Model
k-Nearest-neighbor classifiers (KNN) (Henley
and Hand 1996) classify a data instance by
considering only the k-most similar data
instances in the training set. The class label is
then assigned according to the class of the
majority of the k nearest neighbors. Ties can be
avoided by choosing k odd. One commonly
opts for the Euclidean distance as the similarity
measure:
d ( xi , x j ) =|| xi x j ||= [( xi x j )T ( xi x j )]1/ 2
(14)
where xi , x j n are the input vectors of data
instance i and j , respectively. Note that also
more advanced distance measures have been
proposed in the literature.
Classification and regression tree (CART), a
statistical procedure introduced by Breiman et al.
(1984), is primarily used as a classification tool,
where the objective is to classify an object into
two or more populations. As the name suggests,
CART is a single procedure that can be used to
analyze either categorical or continuous data
using the same technology. The methodology
outlined in Breiman et al. can be summarized
into three stages. The first stage involves
growing the tree using a recursive partitioning
technique to select variables and split points
using a splitting criterion. Several criteria are
available for determining the splits, including
gini, towing and ordered towing. Detailed
description of the mentioned criteria one can
refer to Breiman et al. In addition to selecting
the primary variables, surrogate variables, which
are closely related to the original splits and may

JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING

425

A Comparative Study of Data Mining Methods in Consumer Loans Credit Scoring Management

be used in classifying observations having


missing values for the primary variables, can be
identified and selected.
After a large tree is identified, the second
stage of the CART methodology uses a pruning
procedure that incorporates a minimal cost
complexity measure. The result of the pruning
procedure is a nested subset of trees starting
from the largest tree grown and continuing the
process until only one node of the tree remains.
Cross-validation or a testing sample will be used
to provide estimates of future classification
errors for each subtree. The last stage of the
methodology is to select optimal tree, which
corresponds to a tree yielding the lowest error
rate of cross-validated or testing set. Please refer
to Breiman et al. (1984) and Steinburg and Colla
(1997) for more details regarding the model
building process of CART.

3. Data Sets and Experimental Design


The German and Australian credit data sets
are publicly available at the UCI repository
(http://kdd.ics.uci.edu). Dr. Hans Hofmann of
the University of Hamburg contributed the
German credit scoring data. It consists of 700
examples of creditworthy applicants and 300
examples where credit should not be extended.
For each applicant, 24 variables described credit
history, account balances, loan purpose, loan
amount,
employment
status,
personal
information, age, housing, and job. The
Australian credit scoring data is a similar but
more balanced with 307 and 383 examples of
each outcome. The data set contains a mixture of
six continuous and eight categorical variables.
The third credit data is from major financial
institutions in US, where there are 1225

426

applications, including 902 examples of


creditworthy applications and 323 examples of
no-creditworthy applicants. The data sets also
include 14 attributes. To protect the
confidentiality of these data, attribute names and
values of data sets have been changed to
symbolic data.
To minimize the impact of data dependency
and improve the reliability of the resultant
estimates, 10-fold cross validation is used to
create random partitions of the raw data sets.
Each of the 10 random partitions serves as an
independent holdout test set for the credit
scoring model trained with the remaining nine
partitions. The training set is used to establish
the credit scoring models parameter, while the
independent test sample is used to test the
generalization capability of the model. The
overall scoring accuracy reported is an average
across all ten test set partitions.
The topic of choosing the appropriate class
distribution for classifier learning has received
much attention in the literature. In this study, we
dealt with this problem by using a variety of
class distribution ranging from 55.5/44.5 for the
Australian credit data set to 73.6/26.4 for
America credit data set. The LDA, Logistic,
CART, KNN and MARS classifiers require no
parameter tuning. For the SVM classifiers, we
used the LIBSVM toolbox 2.8 and adopt a grid
search mechanism to tune the parameters. For
BPN classifiers, we adopted the standard
three-layer. The number of input and output
nodes was the number of input and output
variables, respectively. In the hidden layer and
output layer nodes use the sigmoid transfer
function. Since the optimum networks for the
data in the test set is still difficult to guarantee

JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING

XIAO, ZHAO and FEI

generalization performance, the number of


4. Results and Discussion
The results for each credit-scoring model are
hidden nodes of three data sets was varied
reported in Table 1 for both the German,
between 8 and 30 and the network with the best
Australian and American credit data. These
training set performance was selected for test set
results are averages of accuracy determined for
evaluation. The NN analyses were conducted
each of the 10 independent test data set
using
Neural
Networks
toolbox
4.0.
partitions used in the cross validation
(http://www.mathworks.com). CART 4.0 and
methodology. Since the training of any neural
MARS 2.0 evaluation (http://www.salford
networks model is a stochastic process, the
-systems.com) are provided by Salford Systems,
network accuracy determined for each data set
in building the CART and MARS credit scoring
partition is itself an average of 10 repetitions.
models. The SVM analyses were conducted
using the LIBSVM toolbox 2.8 (Chung and Lin
2001).
Table 1 10-fold cross validation test set classification accuracy on credit scoring data sets
German credit data (%)
Australian credit data (%)
Goods
Bads
Overall
Goods
Bads
Overall
RBF
86.5
48.0
74.6
86.8
87.2
87.1
BPN
86.4
42.5
73.3
84.6
86.7
85.8
FAR
60.0
51.2
57.3
74.4
76.2
75.4
LDA
72.3
73.3
72.6
81.0
92.2
85.9
LOGIT
88.1
48.7
76.3
85.9
89.0
87.2
KNN
77.5
44.7
67.6
84.7
86.7
85.8
Kernel
84.5
37.0
70.2
81.4
84.8
84.4
CART
71.2
69.4
70.5
79.9
92.5
85.5
Mars
89.0
66.0
74.9
86.3
88.3
87.4
Lin-svm
88.9
49.1
77.0
79.9
92.5
85.5
Pol-svm
88.5
48.6
76.5
83.8
88.6
85.5
Rbf-svm
88.7
49.7
77.1
80.5
93.0
85.8
Sig-svm
89.0
50.0
77.2
80.5
92.0
85.6
Neural networks results are averages of 10 repetitions. N/A: not test.

It is evident from Table 1 that Sig-SVM has


the highest overall credit scoring accuracy of
77.2% for German credit data, while the
Lin-SVM, Pol-SVM and Rbf-SVM have credit
scoring accuracy of 76.5 % to 77.1%. Closely
following SVM is Logistic regression with an
overall accuracy of 76.3%, and MARS with
74.9%. Linear discriminant analysis has
accuracy of 72.6%, which is 3.7% less accurate
than logistic regression. Strength of the linear
discriminant model for this data, however, is a

American credit data (%)


Goods
Bads
Overall
88.5
24.2
71.3
88.1
22.9
70.9
N/A
N/A
N/A
65.4
56.0
62.9
95.9
11.2
73.5
78.4
30.1
66.1
N/A
N/A
N/A
59.3
59.4
59.3
89.7
20.2
71.4
88.9
22.0
71.3
89.9
18.3
71.0
89.4
22.6
71.8
89.6
21.1
71.5

significantly higher accurate than any other


model identifying bad credit risks. This is likely
due to the assumption of equal prior
probabilities used to develop the linear
discriminant model. It is also interesting to note
that the most commonly used neural network
architecture, BPN with accuracy 73.3%, is
comparable to linear discriminant analysis with
an accuracy of 72.6%. The K-NN, kernel density
and CART at overall accuracy levels are 67.6%,
70.2% and 70.5%, respectively. The least

JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING

427

A Comparative Study of Data Mining Methods in Consumer Loans Credit Scoring Management

accurate method for the German credit scoring


data is the FAR neural networks model at
57.3%.
For the Australian credit data, MARS has the
top overall credit scoring accuracy of 87.4%,
followed closely by the Logistic regression
(87.2%) and NN (RBF) (87.1%). The BPN (85.8
%) and LDA (85.9 %) are again comparable
from an overall accuracy consideration. The
KNN CART BPN, LDA, SVM model have
overall credit scoring errors that are more than
0.01 greater than MARS, logistic regression, and
RBF neural models. The FAR neural networks
and kernel density model overall accuracy are
75.4% and 84.4%, respectively.
For the American credit data, Logistic
regression has the top overall credit scoring
accuracy of 73.5%, followed closely by the
Rbf-SVM (71.8%) and Sig-SVM (71.5%). The
linear-SVM and Poly-SVM are all grouped at
accuracy levels from 71.0% to 71.3%. The BPN,
K-NN and Mars at overall accuracy levels are
70.9%, 66.1% and 71.4%, respectively. The least
accurate method for the America credit scoring
data is the CART model (Kernel density and
FAR arent tested for American credit data).
However, we note that CART has the lowest
error rate (40.6%) in all models identifying bad
credit risks, followed closely by the LDA
(44.0%).
To further enhance the conclusion, we test
for statistically significant differences between
credit scoring models. We have used a special
notational convention whereby the best three of
the overall accuracy is underlined and denoted
in bold face for each data. For cross validation
studies of supervised learning algorithms,
Dietterich (1998) recommends McNermars test,

428

which is used in this paper to establish


statistically significant differences between
credit scoring models. McNemars test is
chi-square statistic calculated from a 2 2
contingency table. The diagonal elements of the
contingency table are counts of the number of
credit applications misclassified by both
models, n00 , and the number correctly classified
by both models, n11 . The off diagonal elements
are counts of numbers classified incorrectly by
Model A and correctly by Model B, n01 , and
conversely the numbers classified incorrectly by
Model B and correctly by Model A, n10 .
Results of McNemars test with p = 0.05 are
given in Table 2. All credit scoring models are
tested for significant differences with the most
accurate model in the data set. A model whose
overall credit scoring is not significantly
different from the most accurate are labeled as a
superior model; those that are significantly less
accurate are labeled as inferior models. It is
evident from Table 2 that the SVM, Logistic
regression, NN (RBF) and MARS models are
superior ones for three credit scoring data sets
and the LDA, KNN and CART models are
superior for only the Australian credit data.

4.1 Cost of Credit Scoring Errors


This subsection considers the costs of credit
scoring errors and their impact on model
selection. It is evident that the individual group
(bad or good) accuracy of the credit scoring
model can vary widely. For the German credit
data, all models except LDA are much less
accurate at classifying bad credit risks than good
credit risks. Most pronounced is the accuracy of
logistic regression with an error of 0.1186 for
good credit and 0.5113 for bad credit. In credit

JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING

XIAO, ZHAO and FEI

Table 2 Statistically significant differences, credit scoring models


Superior models

German credit data

Australian credit data

American credit data

RBF
Mars,
Logistic regression
SVM

RBF, SVM
Logistic regression
MARS, LDA
KNN, CART

RBF
BPN
Logistic regression
SVM, MARS

Inferior Models

FAR, VPN
FAR
LDA
LDA, KNN
Kernel density
CART
Kernel density
KNN
CART
Statistical significance established with McNemars test, p=0.05; kernel density and FAR arent tested for
American credit data.

scoring applications, it is generally believed that


the costs of granting credit to a bad risk
candidate, denoted by C12 is significantly
greater than the cost of denying credit to a good
risk candidate, denoted by C21 . In this situation
it is important to rate the credit scoring models
with the cost function defined in Equation (15)
rather than relying on the overall classification
accuracy. To illustrate the cost function, relative
costs of misclassification suggested by Dr.
Hofmann when he compiled the German credit
data are used; C12 is 5 and C21 is 1. Evaluation
of the cost function also requires estimates of the
prior probabilities of good credit 1 and bad
2 in the application pool of the credit scoring
model. These prior probabilities are estimated
from reported default rates. For the year 1997,
6.48% of a total credit debt of $ 560 billion was
charged off (West 2000), while Jensen reports a
charge off rate of 11.2% fro credit applications
he investigated (Frydman et al. 1985). The error
rate for the bad credit group of the German
credit data (which averages about 0.45) is used
to establish a low value for 2 of 0.144
(0.0648/0.45) and a high value of 0.249
(0.112/0.45). The ratio n2 / N 2 , in Equation (15)
measures the false positive rate, the proportion

of bad credit risks that are granted credit, while


the ration n1 / N1 measures the false negative
rate, or good credit risks denied credit by the
model.
n
n
Cost = C12 2 2 + C21 1 1
(15)
N2
N1
Under these assumptions, the credit scoring
cost is reported for each model in Table 3. For
the German credit data, the MARS (0.413)
model is now slightly better than the LDA
(0.429) at the prior probability level of 14.4%
bad credit. At the higher level of 24.9% bad
credit, the LDA is clearly the best model from an
overall cost perspective with a score of 0.540.
Closely following LDA is MARS with an
overall accuracy of 0.571, and CART with 0.597.
For the Australian credit, the costs of all models
are nearly identical at both levels of 2 . The
Logistic (0.200) model is now slightly better
than the MARS (0.202) at the prior probability
level of 14.4% bad credit. At the higher level of
24.9% bad credit, the Rbf-SVM is clearly the
best model from an overall cost perspective with
a score of 0.234. Closely following Rbf-SVM is
LDA and Logistic with overall cost of 0.239 and
0.244, respectively. For the American credit, the
LDA (0.613) model is now slightly better than

JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING

429

A Comparative Study of Data Mining Methods in Consumer Loans Credit Scoring Management

Table 3 Credit scoring models misclassification cost


German credit data

2 =0.144

2 =0.249

BPN

0.530

RBF

Australian credit data

American credit data

2 =0.144

2 =0.249

2 =0.144

2 =0.249

0.818

0.228

0.281

0.657

1.049

0.497

0.761

0.205

0.258

0.644

1.030

FAR

0.694

0.908

0.391

0.490

N/A

N/A

LDA

0.429

0.540

0.219

0.239

0.613

0.808

Logist

0.471

0.728

0.200

0.243

0.673

1.140

KNN

0.592

0.858

0.227

0.281

0.688

1.033

Kernel

0.587

0.901

0.268

0.329

N/A

N/A

CART

0.467

0.597

0.226

0.244

0.641

0.811

Lin-SVM

0.462

0.717

0.226

0.244

0.657

1.055

Pol-SVM

0.469

0.726

0.221

0.264

0.675

1.093

Rbf-SVM

0.459

0.711

0.217

0.234

0.648

1.043

Sig-SVM

0.454

0.705

0.225

0.246

0.657

1.060

MARS

0.413

0.571

0.202

0.249

0.663

1.071

N/A: not test

Table 4 5-fold cross validation test set classification accuracy on parities credit scoring data sets in new strategy

RBF
BPN
LDA
LOGIT
CART
Mars
Rbf-SVM

German credit data (%)


Goods
Bads
Overall
67.2
73.7
70.4
67.0
70.3
68.7
69.0
73.0
71.0
74.3
74.0
74.2
68.0
69.7
68.8
66.0
79.0
72.5
69.1
73.5
71.3

Australian credit data (%)


Goods
Bads
Overall
85.7
89.3
87.5
85.2
87.6
86.4
80.3
92.5
86.4
84.0
92.3
88.2
80.7
93.3
87.0
84.0
91.0
87.5
81.0
93.3
87.2

American credit data (%)


Goods
Bads
Overall
65.2
57.3
61.3
64.7
55.7
60.2
59.5
55.5
57.5
64.3
62.7
63.5
66.7
54.7
61.3
66.3
50.7
58.5
63.3
59.0
61.2

Neural networks results are averages of 10 repetitions.

the CART(0.641) at the prior probability level of


14.4% bad credit, followed NN(RBF) with a
score of 0.644. At the higher level of 24.9% bad
credit, the LDA is clearly the best model from an
overall cost perspective with a score of 0.808.
Closely following LDA is CART with an overall
accuracy of 0.8111, and RBFNN with 1.030.
From the Table 2, the relative group

430

classification accuracy of the neural networks


model, SVM, logistic regression and MARS are
influenced by the design of the training
no-balance data. To improve their accuracy with
bad credit risks, a new strategy is tested for the
above models training sets. The strategy is to
form new data sets from a balanced group of
300 good credit examples and 300 bad credit

JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING

XIAO, ZHAO and FEI

examples for difference models. Each of these


models is tested with the 5-fold cross-validation.
The new strategy accuracy results are
summarized in Table 4. The new strategy yields
the greatest improvement in the error for bad
credit
identification
with a
reduction
approximate 20% for German credit data and
30% for American credit data. The overall error
rate for tested with a new strategy increases 5%
and 10% for the two data sets, respectively.
However, the overall error rate for tested with a
new strategy decreases 0.5% to 2% for
Australian credit data.

4.2 A Comparative of Explanatory


Ability of Credit Scoring Models
This subsection considers explanatory ability
of the credit scoring models. A good explanatory
ability of any credit scoring models for credit
scoring applications is very important in
explaining the rationale for the decision to deny
credit. Neural networks and SVM models both
cannot explain how and why they identified a
potential bad loan application. LDA and
Logistic regression models are better than SVM
and neural networks. KNN and Kernel density
are inferior models regarding the explanatory
ability. CART and MARS have better
explanatory ability. The more detailed analysis
of the three (Neural, CART and MARS)
explanatory ability for German credit data are as
follows.
4.2.1 Explanatory Ability of Neural Networks
Model for German Data Credit Scoring
A key deficiency of any neural networks
model for credit scoring applications is the
difficulty in explaining the rationale for the

decision to deny credit. Neural networks are


usually thought of as black-box technology
devoid of any logic or rule-based explanations
for the output mapping. This is a particularly
sensitive issue in light of recent federal
legislation regarding discrimination in lending
practices. To address this problem, West (2000)
developed explanatory ability insights for the
neural network trained on the German credit
data. It is accomplished by clamping 23 of the
24 input values, varying the remaining input by
5%, and measuring the magnitude of the
impact on the two output neurons. The clamping
process is repeated until all network inputs have
been varied. A weight can now be determined
for each input that estimates its relative power in
determining the resultant credit decision. Please
refer to West (2000) for more details and results
regarding the model building process.
4.2.2 Explanatory Ability of CART Model for
German Data Credit Scoring

Figure 2 depicts the obtained CART tree of


the testing sample with the popular 1-SE rule in
the tree pruning procedure. It is observed from
Figure 2 that A1, A3, A5, A2 play important
roles in the rule induction (Ai indicates the ith
attribute name for i=1,,n and it has likely
meaning when appearing latter). It can also be
observed from Figure 2 that if an observed
whos A1 is between 1.5 and 2.5 and A2 22.5
and A5>3.5, it falls into terminal node 11 whose
classified class is class 1 (good customer). The
built rules and terminal nodes from the built tree,
unlike other classification techniques, are very
easy to interpret and hence marketing
professionals can use the built rules in designing
proper managerial decisions. Furthermore, we

JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING

431

A Comparative Study of Data Mining Methods in Consumer Loans Credit Scoring Management

data will be used as an illustrative example. The


obtained basis functions and variable selection
results of the illustrative example are
summarized in Table 5. It is observed that
A1,A2,A3,,A4,A5,A8,A9,A15,A16,A17,A20 do
play important roles in deciding the MARS

conclude by saying that CART is executive and


powerful management tools which allow us to
build advanced and user-friendly decisionsupport systems for credit scoring management.
4.2.3 Explanatory Ability of MARS Model for
Credit Scoring
In order to demonstrate the explanatory
ability of MARS scoring models, the German

Node 2
Class=1
A2<=22.500
N=543

Node 10
Class=2
A5<=3.500
N=237

Node 3
Class=2
A3<=1.500
N=306
Terminal
Node 1
Class=1
N=28

Node 6
Class=2
A18<=11.5
N=72

Terminal
Node 3
Class=1
N=12

Terminal
Node 5
Class=2
N=48

Terminal
Node 10
Class=2
N=17

Node 5
Class=2
A4<=13.5
N=278

Terminal
Node 4
Class=2
N=60
Node 9
Class=2
A1<=1.5
N=77

Terminal
Node 12
Class=1
N=457

Node 11
Class=1
A1<=1.500
N=41

Terminal
Node 9
Class=2
N=196

Node 4
Class=2
A2<=11.5
N=278
Terminal
Node 2
Class=1
N=80

Node 1
Class=1
A1<=2.500
N=1000

Node 7
Class=1
A10<=50.5
N=126
Node 8
Class=1
A5<=1.5
N=114

Terminal
Node 11
Class=1
N=24

Terminal
Node 8
Class=2
N=12
Terminal
Node 7
Class=1
N=37

Terminal
Node 6
Class=1
N=29

Figure 2 The tree of CART credit scoring mode

432

JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING

XIAO, ZHAO and FEI

Table 5 Variable selection results and basis functions of MARS credit scoring model
Variable name

Relative importance (%)

Equation name

A1

100.00

BF1

Equation
max (0, A1 1.000)

BF2
max (0, A2 4.000)
BF3
max (0, A3 .180272E-06)
BF4
max (0, A5 1.000)
BF5
max (0, A4 36.000)
BF6
max (0, 36.000 A4)
BF7
max (0, A16 + .180632E-07)
BF8
max (0, A15 1.000)
BF9
max (0, A20 + .182414E-07)
BF10
max (0, A17 .376854E08)
BF12
max (0, 4.000 A6)
BF13
max (0, A9 1.000)
BF14
max (0, A8 2.000)
BF15
max (0, 2.000 A8)
MARS prediction function: Y = 1.358 0.096 * BF1 + 0.007 * BF2 0.058 * BF3 0.032 * BF4+ 0.002 * BF5
+ 0.005 * BF6 + 0.098 * BF7 0.192 * BF8+ 0.094 * BF9 0.129 * BF10 + 0.040 * BF12+ 0.040 * BF13
0.026 * BF14 0.095 * BF15;

A2
A3
A5
A4
A16
A9
A17
A15
A20
A6
A8

57.31
51.82
40.44
36.67
32.33
30.43
28.86
27.52
27.34
29.1
16.95

In the MARS credit scoring model, Y=0(1) is defined to be a good (bad) credit customer.

credit scoring models. Besides, according to the


obtained basis functions and the MARS
prediction function, it can be observed that the
high value of A2, A9, A16, and A20 tends to
become a bad credit customer while the high
value of A1, A3, A5, A15, and A17 likely to be a
good credit customer. The above conclusions
from the basis functions and MARS prediction
function have important managerial implications
since it can help managers/professionals design
appropriate loan policies in acquiring the good
credit customer.

5. Conclusions and Areas of Future


Research
Credit scoring has become more and more
important as the competition between financial
institutions has come to a totally conflicting
stage. More and more companies are seeking
better strategies through the help of credit
scoring models. And hence various modeling

techniques have been developed in different


credit evaluation processes for better credit
approval schemes. Therefore, many modeling
alternatives, like traditional statistical methods,
non-parametric
methods
and
artificial
intelligence techniques, have been developed in
order to handle the credit scoring tasks
successfully. In this paper, we have studied the
performance of various classification techniques
for credit scoring. The experiments were
conducted on 3 real-life credit scoring data sets.
The classification performance was assessed by
the percentage of correctly classified and
misclassified cost.
It is found that each technique has showed
some characteristics which may be interesting in
the context of different data set. Firstly, Logistic,
MARS, SVM and ANN (BPN and RBF)
classifiers yield very good performances in
terms of the classification ratio. However, it has
to be noted that LDA and CART were

JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING

433

A Comparative Study of Data Mining Methods in Consumer Loans Credit Scoring Management

significantly more accurate than any other model


in identifying bad credit risks for German and
American credit scoring data sets. Secondly, the
experiments clearly indicated that many
classification techniques yield performances
which are quite competitive with each other.
Only a few classification techniques (e.g. FAR
and kernel density) were clearly inferior to the
others. Besides, CART and MARS not only have
lower Type II errors associated with high
misclassification costs, but also have better
evaluation reasoning and can help to structure
the understanding of prediction.
Starting from the findings of this study,
several interesting topics for future research can
be identified. One interesting topic may aim at
collecting more important variables in
improving the credit scoring accuracy. Another
promising avenue for future research is to
investigate the power of classifier ensembles
where multiple classification algorithms are
combined.

References
[1] Altman, E.I. (1968). Financial ratios,
discriminant analysis and prediction of
corporate bankruptcy. Finance, 23: 589-609
[2] Bishop, C.M. (1995). Neural Networks for
Pattern Recognition. New York: Oxford
University, Press
[3] Breiman, L., Friedman, J.H., Olshen, R.A.
& Stone, C.J. (1984). Classification and
Regression Trees, Pacific Grove, CA:
Wadsworth
[4] Chen, M.S., Han, J. & Yu, P.S. (1996). Data
mining: an overview from a database
perspective.
IEEE
Transactions
on
Knowledge and Data Engineering, 8(6):

434

866-883
[5] Chung, C-C. & Lin, C-J. (2001). LIBSVM:
a Library for Support Vector Machines,
Software.
available
at
http://www.csie.ntu.edu.tw/~cjlin/libsvm
[6] Curt, H. (1995). The devils in the detail:
techniques, tools, and applications for
database mining and knowledge discovery
Part 1. Intell, Software Strategies, 6: 1-15
[7] Cristianini, N. & Shawe-Taylor, J. (ed.)
(2000). An Introduction to Support Vector
Machines, NewYork, Cambridge Univ,
Cambridge
[8] Desai, V.S., Crook, J.N. & Overstreet, G.A.
(1996). A comparison of neural networks
and linear scoring models in the credit union
environment.
European
Journal
of
Operational Research, 95(1): 24-37
[9] Dietterich, T.G. (1998). Approximate
statistical tests for comparing supervised
classification learning algorithms. Neural
Computation, 10: 1895-1923
[10]Firedman, J.H. (1991). Multivariate
adaptive regression splines (with discussion).
Annals of Statistics, 19: 1-141
[11]Firedman, J.H. & Roosen, C.B. (1995). An
introduction to multivariate adaptive
regression splines. Statistical Methods in
Medical Research, l4: 197-217
[12]Frydman, H.E., Altman, EI. & Kao, D.
(1985). Introducing recursive partitioning
for financial classification: the case of
financial distress. Journal of Finance, 40(1):
53-65
[13]Gunn, S.R. (ed.) (1998). Support Vector
Machines for Classification and Regression.
Technical
Report,
University
of
Southampton

JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING

XIAO, ZHAO and FEI

[14]Henley, W.E. (1995). Statistical aspects of


credit scoring. Dissertation, The Open
University, Milton Keynes, UK
[15]Henley, W.E. & Hand, D.J. (1996).
K-nearest neighbor classifier for assessing
consumer credit risk. Statistician, 44: 77-95
[16]Hosmer, D.W. & Lemeshow, S. (2000).
Applied
Logistic
Regression.
New
York:John Wiley & Sons Inc
[17]Jo, H., Han, I. & Lee, H. (1997).
Bankruptcy prediction using case-based
reasoning,
neural
networks,
and
discriminant analysis. Expert Systems
Application, 13: 97-108
[18]Lee, T.S. & Chen, I.F. (2005). A two-stage
hybrid credit scoring model using artificial
neural networks and multivariate adaptive
regression splines. Expert Systems with
Applications, 28: 743-752
[19]Mester, L.J. (1997). Whats the point of
credit scoring? Business Review - Federal
Reserve Bank of Philadelphia. Sept/Oct:
3-16
[20]Moody, J. & Darken, C.J. (1989). Fast
learning in networks of locally tuned
processing units. Neural Computation, 3:
213-25
[21]Steinburg, D. & Colla, P. (ed.) (1997).
Classification and Regression Trees, Salford
Systems. San Didgo, CA
[22]Thomas, L.C. (2000). A survey of credit and
behavioral scoring: Forecasting financial
risks of lending to customers. International
Journal of Forecasting, 16: 149-172
[23]Tam, K.Y. & Kiang, M.Y. (1992).
Managerial applications of neural networks:

the case of bank failure predictions.


Management Science; 38(7): 926-47
[24]Vapnik, N. (1999). Statistical Learning
Theory. New York: Springer & Verlag.
[25]West, D. (2000). Neural network credit
scoring models. Computers & Operations
Research, 27: 1131-1152
Wenbing Xiao is a doctoral student of Institute

of Control Science & System Engineering at


Huazhong University of Science and Technology,
China. His research interests include financial
forecasting and modeling, decision support
system, data mining and machine learning. He
received the M.S. degree in Mathematics &
Computer at Hunan Normal University (2004).
Qian Zhao is a doctoral student in School of
Economics at Renmin University of China. She
received her M.S. in mathematics from Hunan
Normal University in 2004. Her current research
interests include financial forecasting and
modeling, data mining and energy economics.
She has published in Advances in Mathematics,
Chinese Journal of Management Science.
Qi Fei is a professor of Institute of Control

Science & Systems Engineering at Huazhong


University of Science and Technology, China.
His research interests include complex theory,
decision support system and decision analysis.
He received the B.S. degree in Control Science
and Engineering at Harbin Institute of
Technology (1961).

JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING

435

Вам также может понравиться