Comparative Study of Data Mining Methods in

J Syst Sci Syst Eng(Dec 2006) 15(4): 419-435
DOI: 10.1007/s11518-006-5023-5
ISSN: 1004-3756 (Paper) 1861-9576 (Online)

CN11-2983/N
A COMPARATIVE STUDY OF DATA MINING METHODS IN

CONSUMER LOANS CREDIT SCORING MANAGEMENT
Wenbing XIAO1
1
Qian ZHAO 2
Qi FEI 3
Institute of Systems Engineering, Huazhong University of Science & Technology, Wuhan 430074, China
xiaowenbing11@163.com ( )
2
School of Economics, Renmin University of China, Beijing 100872, China

zqheropen@yahoo.com.cn
Institute of Systems Engineering, Huazhong University of Science & Technology, Wuhan 430074, China
qfei@mail.hust.edu.cn
Abstract
Credit scoring has become a critical and challenging management science issue as the credit
industry has been facing stiffer competition in recent years. Many classification methods have been
suggested to tackle this problem in the literature. In this paper, we investigate the performance of
various credit scoring models and the corresponding credit risk cost for three real-life credit scoring
data sets. Besides the well-known classification algorithms (e.g. linear discriminant analysis, logistic
regression, neural networks and k-nearest neighbor), we also investigate the suitability and
performance of some recently proposed, advanced data mining techniques such as support vector
machines (SVMs), classification and regression tree (CART), and multivariate adaptive regression
splines (MARS). The performance is assessed by using the classification accuracy and cost of credit
scoring errors. The experiment results show that SVM, MARS, logistic regression and neural networks
yield a very good performance. However, CART and MARSs explanatory capability outperforms the
other methods.
Keywords: Data mining, credit scoring, classification and regression tree, support vector machines,
multivariate adaptive regression splines, credit-risk evaluation
1. Introduction
researchers and practitioners due to its wide
Data mining (DM), sometimes referred to as
applications in crucial business decisions.
knowledge discovery in database (KDD), is a
Basically, the research on DM can be classified
systematic approach to find underlying patterns,
into
trends, and relationships buried in data. Data
technologies. According to Curt (1995), the
mining has drawn much attention from both
technology part of DM consists of techniques
two
categories:
methodologies
This work was supported in part by National Science Foundation of China under Grant No. 70171015
Systems Engineering Society of China & Springer-Verlag 2006
and
A Comparative Study of Data Mining Methods in Consumer Loans Credit Scoring Management
such as statistical methods, neural networks,
details and other relevant information held by a
decision
and
credit reference agency. As a result, accounts
the
with high probability of default can be
above-mentioned applications, the classification
monitored and necessary actions can be taken in
problems which observations can be assigned to
order to prevent the account from entering
one of several disjoint groups have played
default. In response, the statistical methods,
important roles in business decision making due
non-parametric
to their wide applications in decisions support,
intelligence approaches have been proposed to
financial forecasting, fraud detection, marketing
support the credit approval decision process
strategy, and other related fields (Chen et al.
(Desai et al., 1996, West, 2000).
trees,
non-parametric
genetic
methods.
algorithms,
Among
1996, Lee and Chen 2005, Tam and Kiang

1992).
methods,
and
artificial
Generally, linear discriminant analysis and

logistic regression are the two most commonly
Credit risk evaluation decisions are crucial
used data mining techniques to construct credit
for financial institutions due to the severe impact
scoring models. However, linear discriminant
of loan default. It is an even more important task
analysis (LDA) has often been criticized because
today
been
of the categorical nature of the credit data and
experiencing serious competition during the past
the fact that the covariance matrices of the good
few years. Credit scoring has gained more and
and bad credit classes are not likely to be equal.
more attention as the credit industry has realized
In addition to the LDA approach, logistic
the benefits of improving cash flow, insuring
regression is an alternative to conduct credit
credit collections and reducing possible risks.
scoring. A number of logistic regression models
Hence, many different useful techniques, known
for credit scoring applications have been
as the credit scoring models, have been
reported in the literature (Henley 1995).
developed by banks and researchers in order to
However, logistic regression is also being
solve
the
criticized for some strong model assumptions,
evaluation process (Mester 1997). The objective
such as variation homogeneity, which has
of credit scoring models is to assign credit
limited its application in handling credit scoring
applicants to either a good credit group who
problems. Recently, neural networks have
are likely to repay financial obligation, or a bad
provided an alternative to LDA and logistic
credit group who are more likely to default on
regression, particularly in situations where the
the financial obligation. The applications of the
dependent and independent variables exhibit
latter should be denied. Therefore, credit scoring
complex nonlinear relationships. Even though it
problems are basically in the scope of the more
has been reported that neural networks have
generally and widely discussed classification
better credit scoring capability than LDA and
problems.
logistic regression (Desai et al. 1996), neural
as
the
the
credit
problems
industry
involved
has
during
Usually, credit scoring is employed to rank
networks are also being criticized for their long
credit information based on the application form
training process in designing the optimal
420
JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING
XIAO, ZHAO and FEI
networks topology, difficulty in identifying the
maximizing the between-group variance relative
relative importance of potential input variables,
to the within-group variance; this relationship is
and certain interpretive difficulties which have
expressed as the ration of between-group to
limited their applicability in handling credit
within-group variance. The linear combinations
scoring problems. Hence, the issue of which
for a discriminant analysis are derived from an
classification technique to be used for credit

scoring remains a very difficult and challenging
equation that takes the from (1)

Z = w1 x1 + w2 x2 + L + wn xn
problem. In the paper, we will conduct a
where
benchmarking study of various classification
wi (i = 1, 2,L , n ) are the discriminant weights,

and xi (i = 1, 2,L , n ) are independent variables
(Altman 1968, Jo, Han and Lee 1997).
techniques on three real-life credit data sets.

Techniques that will be implemented are logistic
regression, linear discriminant analysis, SVMs,
neural networks, KNN, CART and MARS. All
techniques will be evaluated in terms of the
percentage of correctly classified observations
and misclassification cost.
This paper is organized as follows. We begin
with a short overview of the classification
techniques used in Section 2. Data sets and
experimental design are presented in Section 3.
Section 4 gives the empirical results and
discussion for three real credit scoring data set,
including classification performance, the costs
of credit scoring errors and explanatory ability
of credit scoring models. Section 5 addresses the
conclusion
and
discusses
possible
future
research areas.
2. Literature Review
2.1 Linear Discriminant Analysis and
Logistic Regression Models
Linear discriminant analysis involves the
is
the
discriminant
(1)
score,
Logistic regression (Logistic) analysis has

also been used to investigate the relationship
between binary or ordinal response probability
and explanatory variables. The method fits linear
logistic regression model for binary or ordinal
response data by the method of maximum
likelihood. The advantage of this method is that
it does not assume multivariate normality and
equal covariance matrices as LDA does. The
logistic regression approach to classification
(Logistic) tries to estimate the probability
P( y = 1 | x ) as follows:
1
1
P( y = 1| x) =
=
(2)
z
( w0 + w1 x1 +Lwn xn )
1+ e
1+ e
Whereby x n is n-dimensional input

vector, wi is the parameter vector and the scalar
w0 is the intercept. The parameters
w0 and wi are then typically estimated using the
maximum likelihood procedure (Hosener 2000,
Thomas 2000).
2.2 Support Vector Machines Models
linear combination of the two (or more)
A simple description of the SVM algorithm
independent variables that differentiate best
is provided as follows. Given a training set
between the priori defined groups. This is
D = {xi , yi }iN=1
achieved by the statistical decision rule of
xi =
( xi(1) ,L xi( n ) )T
with
R
input
and
target
vectors
labels
421
yi {1, +1} , the support vector machine (SVM)

classifier,
according
to
Vapnik's
original
formation, satisfies the following condition:

wT ( xi ) + b +1, if yi = +1
T
w ( xi ) + b 1, if yi = 1
(3)
which is equivalent to
yi [ wT ( xi ) + b] 1,
i = 1,L, N
(4)
where w represents the weight vector and b the

bias. Nonlinear function () : R n R nk maps
input or measurement space to a highdimensional, and possibly infinite-dimensional,
feature space. Equation (4) then comes down the
construct of two parallel bounding hyperplanes
at
opposite
sides
of
separating
hyperplane w ( x ) + b = 0 in the feature space

with the margin width between both hyperplanes
problem is obtained after constructing the

Lagrangian. From the conditions of optimality,
one obtains a quadratic programming (QP)
problem with Lagrange multipliers i 's. A
multiple i exists for each training data instance.
Data instances corresponding to non-zero i 's
are called support vectors.
On the other hand, the above primal problem
can be converted into the following dual
problem with objective function (8) and
constraints (9). Since the decision variables are
support vector of Lagrange multipliers, it is
easier to interpret the results of this dual
problem than those of the primal one.
1 T
T
Max Q e
2
Subject to
equal to 2 /( || w || ) . In primal weight space, the
0 i C , i = 1,L , N
T
y = 0
classifier then takes the decision function form

(5)
sgn( wT ( x ) + b)
(5)
Most of classification problems are, however,
(8)
In
the
T
e = (1,1,L ,1) R
dual
N
problem
(9)
above,
, Q is a N N positive
semi-definite matrix, Qij = yi y j K ( xi , x j ) and
linearly non-separable. Therefore, it is general to
K ( xi , x j ) ( xi )T ( x j ) is the kernel. Here,
find the weight vector using slack variable i
training vectors xi 's are mapped into a higher
to permit misclassification. One defines the
(maybe
primal optimization problem as

N
1 T
Min w w + C i
w,b , 2
i =1
Subject to
yi ( wT ( xi ) + b 1 i , i = 1,L , N
i = 1,L , N
i 0,
function . As is typical for SVMs, we never
(6)
(7)
Where i 's are slack variables needed to allow

misclassifications in the set of inequalities, and
C + is a tuning hyperparameter, weighting
the importance of classification errors visa the
margin width. The solution of the primal
422
infinite)
dimensional
space
by
calculate w or ( x ) . This is made possible due

to Mercer's condition, which relates mapping
function ( x ) to kernel function K ( , ) as
follows.
K ( xi , x j ) = ( xi )T ( x j )
10
For kernel function K ( , ) , one typically has

several design choices such as the linear kernel
of K ( xi , x j ) = xiT x j , the polynomial kernel of
degree d of K ( xi , x j ) = ( xiT x j + r ) d , > 0 ,
XIAO, ZHAO and FEI
the radial basis function (RBF) kernel of

K ( xi , x j ) = exp{ || xi x j ||2 } , > 0 , and the
output. Thus, a neural network model is a
sigmoid kernel of K ( xi , x j ) = tanh{ xiT x j + r} ,
such as the input layer, the hidden layer, and the
where d , r N and R + are constants. Then

one constructs the final SVM classifier as
output layer. Several hidden layers can be placed
collection of neurons that are grouped in layers
between the input and the output layers. We will

discuss the BPN in more detail because it is the
sgn i yi K ( x, xi ) + b
i
(11)
most popular NN for classification.

A simple back-propagation network (BPN)
The details of the optimization are discussed in
model consists of three layers: the input layer,
(Vapnik 1999, Gunn 1998, Cristianini 2000).
the hidden layer, and the output layer. The

input-layer processes the input variables, and
2.3 Neural Networks Models (BPN, RBF

and FAR)
provides the processed values to the hidden layer.

The
hidden
layer
further
processes
the
involves
intermediate values, and transmits the processed
constructing computers with architectures and
values to the output layer. The output layer
processing
certain
corresponds to the output variables of the
processing capabilities of the human brain. A
back-propagation neural network model. A
neural network model is composed of neurons,
three-layer backpropagation neural networks
the processing elements. These elements are
(BPN) is shown in Figure 1. For the details of
inspired by biological nervous systems. Each of
the neural networks, readers are referred to Refs
the neurons receives inputs, and delivers a single
(West 2000, Bihop 1995).
neural
networks
capabilities
that
Input layer
model
mimic
Hidden layer
Output layer
Figure 1 A three-layer back-propagation neural networks
423
Radial Basis Function (RBF) networks

(Moody and Darken 1989) have a static
Gaussian function as the non-linearity for the
hidden layer processing elements. The Gaussian
function responds only to a small region of the
input space where the Gaussian is centered. The
key to a successful implementation of these
networks is to find suitable centers for the
Gaussian functions. This can be done with
supervised learning, but an unsupervised
approach usually produces better results. The
advantage of the radial basis function networks
is that it finds the input to output map using
local approximators. Usually the supervised
segment is simply a linear combination of the
approximators. Since linear combiners have few
weights, these networks train extremely fast and
require fewer training samples.
The Fuzzy art (FAR) (West 2000) network is
a dynamic network where incorporates
computations from fuzzy set theory into the
adaptive resonance theory (ART). The typical
FAR network consists of two totally
interconnected layers of neurons, identified as
the complement layer and the category layer, in
addition to the input and output layers. When an
input vector is applied to the network, it creates
a short-term activation of the neurons in the
complement layer. This activity is transmitted
through the weight vector to neurons in the
category layer. Each neuron in the category layer
then calculates the inner product of the
respective weights and input values. These
calculated values are then resonated back to the
complement layer.
2.4 Multivariate Adaptive Regression

Splines
MARS is first proposed by Firedman (1991,
424
1995) as a flexible procedure which models

relationships that are nearly additive or involve
interactions with fewer variables. The modeling
procedure
is
inspired
by
the
recursive
partitioning technique governing classification

and regression tree (CART) (Breiman et al. 1984)
and generalized additive modeling, resulting in a
model that is continuous with continuous
derivatives. It excels at finding optimal variable
transformations and interactions, the complex
data
structure
high-dimensional
that
often
data.
And
hides
hence
in
can
effectively uncover important data patterns and

relationships that are difficult, if not impossible,
for other methods to reveal.
MARS essentially builds flexible models by
fitting piecewise linear regressions; that is, the
nonlinearity of a model is approximated through
the use of separate regression slopes in distinct
intervals of the predictor variable space.
Therefore the slope of the regression line is
allowed to change from one interval to the other
as the two knot point are crossed. The variable
to use and the end points of the intervals for
each variable are found via a fast but intensive
search procedure. In addition to searching
variables one by one, MARS also searches for
interactions between variables, allowing any
degree of interaction to be considered.
The
general
MARS
function
can
be
represented using the following equation:
Km
m =1
k =1
f ( x ) = a0 + am [ skm ( xv ( k ,m ) tkm )]+ (12)
where a0 and am are parameters, M is the

number of basis functions, K m is the number of
knots, skm takes on value of either 1 or 1 and
indicates the right/left sense of the associated
step function,
v( k , m) is the label of the
XIAO, ZHAO and FEI
independent variable, and tkm indicates the knot
regarding the model building process.
location.
The optimal MARS model is selected in a
two-stage process. Firstly, MARS constructs a
very large number of basis functions to overfit
the data initially, where variables are allowed to
enter as continuous, categorical, or ordinalthe
formal mechanism by which variable intervals
are defined, and they can interact with each
other or be restricted to enter in only as additive
components. In the second stage, basis functions
are deleted in order of least contribution using
the generalized cross-validation (GCV) criterion.
A measure of variable importance can be
assessed by observing the decrease in the
calculated GCV values when a variable is
removed from the model. The GCV can be
expressed as follows:
^
LOF ( f M ) = GCV ( M )
=
1
N
[ yi f M ( xi )]2 /[1
i =1
C(M ) 2
]
N
(13)
where there are N observations, and C ( M ) is

the cost-penalty measures of a model containing
M basis function (therefore the numerator
measures the lack of fit on the M basis function
model f M ( xi ) and the denominator denotes the
penalty for model complexity C ( M ) ). Missing
values can also be handled in MARS by using
dummy variables indicating the presence of the
missing values. By allowing for any arbitrary
shape for the function and interactions, and by
using the above-mentioned two-stage model
building procedure, MARS is capable of reliably
tracking the very complex data structures that
often hide in high-dimensional data. Please refer
to Firedman (1991, 1995) for more details
2.5 k-Nearest-Neighbor-Classifiers and

CART Model
k-Nearest-neighbor classifiers (KNN) (Henley
and Hand 1996) classify a data instance by
considering only the k-most similar data
instances in the training set. The class label is
then assigned according to the class of the
majority of the k nearest neighbors. Ties can be
avoided by choosing k odd. One commonly
opts for the Euclidean distance as the similarity
measure:
d ( xi , x j ) =|| xi x j ||= [( xi x j )T ( xi x j )]1/ 2
(14)
where xi , x j n are the input vectors of data
instance i and j , respectively. Note that also
more advanced distance measures have been
proposed in the literature.
Classification and regression tree (CART), a
statistical procedure introduced by Breiman et al.
(1984), is primarily used as a classification tool,
where the objective is to classify an object into
two or more populations. As the name suggests,
CART is a single procedure that can be used to
analyze either categorical or continuous data
using the same technology. The methodology
outlined in Breiman et al. can be summarized
into three stages. The first stage involves
growing the tree using a recursive partitioning
technique to select variables and split points
using a splitting criterion. Several criteria are
available for determining the splits, including
gini, towing and ordered towing. Detailed
description of the mentioned criteria one can
refer to Breiman et al. In addition to selecting
the primary variables, surrogate variables, which
are closely related to the original splits and may
425
be used in classifying observations having

missing values for the primary variables, can be
identified and selected.
After a large tree is identified, the second
stage of the CART methodology uses a pruning
procedure that incorporates a minimal cost
complexity measure. The result of the pruning
procedure is a nested subset of trees starting
from the largest tree grown and continuing the
process until only one node of the tree remains.
Cross-validation or a testing sample will be used
to provide estimates of future classification
errors for each subtree. The last stage of the
methodology is to select optimal tree, which
corresponds to a tree yielding the lowest error
rate of cross-validated or testing set. Please refer
to Breiman et al. (1984) and Steinburg and Colla
(1997) for more details regarding the model
building process of CART.
3. Data Sets and Experimental Design

The German and Australian credit data sets
are publicly available at the UCI repository
(http://kdd.ics.uci.edu). Dr. Hans Hofmann of
the University of Hamburg contributed the
German credit scoring data. It consists of 700
examples of creditworthy applicants and 300
examples where credit should not be extended.
For each applicant, 24 variables described credit
history, account balances, loan purpose, loan
amount,
employment
status,
personal
information, age, housing, and job. The
Australian credit scoring data is a similar but
more balanced with 307 and 383 examples of
each outcome. The data set contains a mixture of
six continuous and eight categorical variables.
The third credit data is from major financial
institutions in US, where there are 1225
426
applications, including 902 examples of

creditworthy applications and 323 examples of
no-creditworthy applicants. The data sets also
include 14 attributes. To protect the
confidentiality of these data, attribute names and
values of data sets have been changed to
symbolic data.
To minimize the impact of data dependency
and improve the reliability of the resultant
estimates, 10-fold cross validation is used to
create random partitions of the raw data sets.
Each of the 10 random partitions serves as an
independent holdout test set for the credit
scoring model trained with the remaining nine
partitions. The training set is used to establish
the credit scoring models parameter, while the
independent test sample is used to test the
generalization capability of the model. The
overall scoring accuracy reported is an average
across all ten test set partitions.
The topic of choosing the appropriate class
distribution for classifier learning has received
much attention in the literature. In this study, we
dealt with this problem by using a variety of
class distribution ranging from 55.5/44.5 for the
Australian credit data set to 73.6/26.4 for
America credit data set. The LDA, Logistic,
CART, KNN and MARS classifiers require no
parameter tuning. For the SVM classifiers, we
used the LIBSVM toolbox 2.8 and adopt a grid
search mechanism to tune the parameters. For
BPN classifiers, we adopted the standard
three-layer. The number of input and output
nodes was the number of input and output
variables, respectively. In the hidden layer and
output layer nodes use the sigmoid transfer
function. Since the optimum networks for the
data in the test set is still difficult to guarantee
XIAO, ZHAO and FEI
generalization performance, the number of

4. Results and Discussion
The results for each credit-scoring model are
hidden nodes of three data sets was varied
reported in Table 1 for both the German,
between 8 and 30 and the network with the best
Australian and American credit data. These
training set performance was selected for test set
results are averages of accuracy determined for
evaluation. The NN analyses were conducted
each of the 10 independent test data set
using
Neural
Networks
toolbox
4.0.
partitions used in the cross validation
(http://www.mathworks.com). CART 4.0 and
methodology. Since the training of any neural
MARS 2.0 evaluation (http://www.salford
networks model is a stochastic process, the
-systems.com) are provided by Salford Systems,
network accuracy determined for each data set
in building the CART and MARS credit scoring
partition is itself an average of 10 repetitions.
models. The SVM analyses were conducted
using the LIBSVM toolbox 2.8 (Chung and Lin
2001).
Table 1 10-fold cross validation test set classification accuracy on credit scoring data sets
German credit data (%)
Australian credit data (%)
Goods
Bads
Overall
Goods
Bads
Overall
RBF
86.5
48.0
74.6
86.8
87.2
87.1
BPN
86.4
42.5
73.3
84.6
86.7
85.8
FAR
60.0
51.2
57.3
74.4
76.2
75.4
LDA
72.3
73.3
72.6
81.0
92.2
85.9
LOGIT
88.1
48.7
76.3
85.9
89.0
87.2
KNN
77.5
44.7
67.6
84.7
86.7
85.8
Kernel
84.5
37.0
70.2
81.4
84.8
84.4
CART
71.2
69.4
70.5
79.9
92.5
85.5
Mars
89.0
66.0
74.9
86.3
88.3
87.4
Lin-svm
88.9
49.1
77.0
79.9
92.5
85.5
Pol-svm
88.5
48.6
76.5
83.8
88.6
85.5
Rbf-svm
88.7
49.7
77.1
80.5
93.0
85.8
Sig-svm
89.0
50.0
77.2
80.5
92.0
85.6
Neural networks results are averages of 10 repetitions. N/A: not test.
It is evident from Table 1 that Sig-SVM has

the highest overall credit scoring accuracy of
77.2% for German credit data, while the
Lin-SVM, Pol-SVM and Rbf-SVM have credit
scoring accuracy of 76.5 % to 77.1%. Closely
following SVM is Logistic regression with an
overall accuracy of 76.3%, and MARS with
74.9%. Linear discriminant analysis has
accuracy of 72.6%, which is 3.7% less accurate
than logistic regression. Strength of the linear
discriminant model for this data, however, is a
American credit data (%)

Goods
Bads
Overall
88.5
24.2
71.3
88.1
22.9
70.9
N/A
N/A
N/A
65.4
56.0
62.9
95.9
11.2
73.5
78.4
30.1
66.1
N/A
N/A
N/A
59.3
59.4
59.3
89.7
20.2
71.4
88.9
22.0
71.3
89.9
18.3
71.0
89.4
22.6
71.8
89.6
21.1
71.5
significantly higher accurate than any other

model identifying bad credit risks. This is likely
due to the assumption of equal prior
probabilities used to develop the linear
discriminant model. It is also interesting to note
that the most commonly used neural network
architecture, BPN with accuracy 73.3%, is
comparable to linear discriminant analysis with
an accuracy of 72.6%. The K-NN, kernel density
and CART at overall accuracy levels are 67.6%,
70.2% and 70.5%, respectively. The least
427
accurate method for the German credit scoring

data is the FAR neural networks model at
57.3%.
For the Australian credit data, MARS has the
top overall credit scoring accuracy of 87.4%,
followed closely by the Logistic regression
(87.2%) and NN (RBF) (87.1%). The BPN (85.8
%) and LDA (85.9 %) are again comparable
from an overall accuracy consideration. The
KNN CART BPN, LDA, SVM model have
overall credit scoring errors that are more than
0.01 greater than MARS, logistic regression, and
RBF neural models. The FAR neural networks
and kernel density model overall accuracy are
75.4% and 84.4%, respectively.
For the American credit data, Logistic
regression has the top overall credit scoring
accuracy of 73.5%, followed closely by the
Rbf-SVM (71.8%) and Sig-SVM (71.5%). The
linear-SVM and Poly-SVM are all grouped at
accuracy levels from 71.0% to 71.3%. The BPN,
K-NN and Mars at overall accuracy levels are
70.9%, 66.1% and 71.4%, respectively. The least
accurate method for the America credit scoring
data is the CART model (Kernel density and
FAR arent tested for American credit data).
However, we note that CART has the lowest
error rate (40.6%) in all models identifying bad
credit risks, followed closely by the LDA
(44.0%).
To further enhance the conclusion, we test
for statistically significant differences between
credit scoring models. We have used a special
notational convention whereby the best three of
the overall accuracy is underlined and denoted
in bold face for each data. For cross validation
studies of supervised learning algorithms,
Dietterich (1998) recommends McNermars test,
428
which is used in this paper to establish

statistically significant differences between
credit scoring models. McNemars test is
chi-square statistic calculated from a 2 2
contingency table. The diagonal elements of the
contingency table are counts of the number of
credit applications misclassified by both
models, n00 , and the number correctly classified
by both models, n11 . The off diagonal elements
are counts of numbers classified incorrectly by
Model A and correctly by Model B, n01 , and
conversely the numbers classified incorrectly by
Model B and correctly by Model A, n10 .
Results of McNemars test with p = 0.05 are
given in Table 2. All credit scoring models are
tested for significant differences with the most
accurate model in the data set. A model whose
overall credit scoring is not significantly
different from the most accurate are labeled as a
superior model; those that are significantly less
accurate are labeled as inferior models. It is
evident from Table 2 that the SVM, Logistic
regression, NN (RBF) and MARS models are
superior ones for three credit scoring data sets
and the LDA, KNN and CART models are
superior for only the Australian credit data.
4.1 Cost of Credit Scoring Errors

This subsection considers the costs of credit
scoring errors and their impact on model
selection. It is evident that the individual group
(bad or good) accuracy of the credit scoring
model can vary widely. For the German credit
data, all models except LDA are much less
accurate at classifying bad credit risks than good
credit risks. Most pronounced is the accuracy of
logistic regression with an error of 0.1186 for
good credit and 0.5113 for bad credit. In credit
XIAO, ZHAO and FEI
Table 2 Statistically significant differences, credit scoring models

Superior models
German credit data
Australian credit data
American credit data
RBF
Mars,
Logistic regression
SVM
RBF, SVM
Logistic regression
MARS, LDA
KNN, CART
RBF
BPN
Logistic regression
SVM, MARS
Inferior Models
FAR, VPN
FAR
LDA
LDA, KNN
Kernel density
CART
Kernel density
KNN
CART
Statistical significance established with McNemars test, p=0.05; kernel density and FAR arent tested for
American credit data.
scoring applications, it is generally believed that

the costs of granting credit to a bad risk
candidate, denoted by C12 is significantly
greater than the cost of denying credit to a good
risk candidate, denoted by C21 . In this situation
it is important to rate the credit scoring models
with the cost function defined in Equation (15)
rather than relying on the overall classification
accuracy. To illustrate the cost function, relative
costs of misclassification suggested by Dr.
Hofmann when he compiled the German credit
data are used; C12 is 5 and C21 is 1. Evaluation
of the cost function also requires estimates of the
prior probabilities of good credit 1 and bad
2 in the application pool of the credit scoring
model. These prior probabilities are estimated
from reported default rates. For the year 1997,
6.48% of a total credit debt of $ 560 billion was
charged off (West 2000), while Jensen reports a
charge off rate of 11.2% fro credit applications
he investigated (Frydman et al. 1985). The error
rate for the bad credit group of the German
credit data (which averages about 0.45) is used
to establish a low value for 2 of 0.144
(0.0648/0.45) and a high value of 0.249
(0.112/0.45). The ratio n2 / N 2 , in Equation (15)
measures the false positive rate, the proportion
of bad credit risks that are granted credit, while

the ration n1 / N1 measures the false negative
rate, or good credit risks denied credit by the
model.
n
n
Cost = C12 2 2 + C21 1 1
(15)
N2
N1
Under these assumptions, the credit scoring
cost is reported for each model in Table 3. For
the German credit data, the MARS (0.413)
model is now slightly better than the LDA
(0.429) at the prior probability level of 14.4%
bad credit. At the higher level of 24.9% bad
credit, the LDA is clearly the best model from an
overall cost perspective with a score of 0.540.
Closely following LDA is MARS with an
overall accuracy of 0.571, and CART with 0.597.
For the Australian credit, the costs of all models
are nearly identical at both levels of 2 . The
Logistic (0.200) model is now slightly better
than the MARS (0.202) at the prior probability
level of 14.4% bad credit. At the higher level of
24.9% bad credit, the Rbf-SVM is clearly the
best model from an overall cost perspective with
a score of 0.234. Closely following Rbf-SVM is
LDA and Logistic with overall cost of 0.239 and
0.244, respectively. For the American credit, the
LDA (0.613) model is now slightly better than
429
Table 3 Credit scoring models misclassification cost

German credit data
2 =0.144
2 =0.249
BPN
0.530
RBF
Australian credit data
American credit data
2 =0.144
2 =0.249
2 =0.144
2 =0.249
0.818
0.228
0.281
0.657
1.049
0.497
0.761
0.205
0.258
0.644
1.030
FAR
0.694
0.908
0.391
0.490
N/A
N/A
LDA
0.429
0.540
0.219
0.239
0.613
0.808
Logist
0.471
0.728
0.200
0.243
0.673
1.140
KNN
0.592
0.858
0.227
0.281
0.688
1.033
Kernel
0.587
0.901
0.268
0.329
N/A
N/A
CART
0.467
0.597
0.226
0.244
0.641
0.811
Lin-SVM
0.462
0.717
0.226
0.244
0.657
1.055
Pol-SVM
0.469
0.726
0.221
0.264
0.675
1.093
Rbf-SVM
0.459
0.711
0.217
0.234
0.648
1.043
Sig-SVM
0.454
0.705
0.225
0.246
0.657
1.060
MARS
0.413
0.571
0.202
0.249
0.663
1.071
N/A: not test
Table 4 5-fold cross validation test set classification accuracy on parities credit scoring data sets in new strategy
RBF
BPN
LDA
LOGIT
CART
Mars
Rbf-SVM
German credit data (%)

Goods
Bads
Overall
67.2
73.7
70.4
67.0
70.3
68.7
69.0
73.0
71.0
74.3
74.0
74.2
68.0
69.7
68.8
66.0
79.0
72.5
69.1
73.5
71.3
Australian credit data (%)

Goods
Bads
Overall
85.7
89.3
87.5
85.2
87.6
86.4
80.3
92.5
86.4
84.0
92.3
88.2
80.7
93.3
87.0
84.0
91.0
87.5
81.0
93.3
87.2
American credit data (%)

Goods
Bads
Overall
65.2
57.3
61.3
64.7
55.7
60.2
59.5
55.5
57.5
64.3
62.7
63.5
66.7
54.7
61.3
66.3
50.7
58.5
63.3
59.0
61.2
Neural networks results are averages of 10 repetitions.
the CART(0.641) at the prior probability level of

14.4% bad credit, followed NN(RBF) with a
score of 0.644. At the higher level of 24.9% bad
credit, the LDA is clearly the best model from an
overall cost perspective with a score of 0.808.
Closely following LDA is CART with an overall
accuracy of 0.8111, and RBFNN with 1.030.
From the Table 2, the relative group
430
classification accuracy of the neural networks

model, SVM, logistic regression and MARS are
influenced by the design of the training
no-balance data. To improve their accuracy with
bad credit risks, a new strategy is tested for the
above models training sets. The strategy is to
form new data sets from a balanced group of
300 good credit examples and 300 bad credit
XIAO, ZHAO and FEI
examples for difference models. Each of these

models is tested with the 5-fold cross-validation.
The new strategy accuracy results are
summarized in Table 4. The new strategy yields
the greatest improvement in the error for bad
credit
identification
with a
reduction
approximate 20% for German credit data and
30% for American credit data. The overall error
rate for tested with a new strategy increases 5%
and 10% for the two data sets, respectively.
However, the overall error rate for tested with a
new strategy decreases 0.5% to 2% for
Australian credit data.
4.2 A Comparative of Explanatory

Ability of Credit Scoring Models
This subsection considers explanatory ability
of the credit scoring models. A good explanatory
ability of any credit scoring models for credit
scoring applications is very important in
explaining the rationale for the decision to deny
credit. Neural networks and SVM models both
cannot explain how and why they identified a
potential bad loan application. LDA and
Logistic regression models are better than SVM
and neural networks. KNN and Kernel density
are inferior models regarding the explanatory
ability. CART and MARS have better
explanatory ability. The more detailed analysis
of the three (Neural, CART and MARS)
explanatory ability for German credit data are as
follows.
4.2.1 Explanatory Ability of Neural Networks
Model for German Data Credit Scoring
A key deficiency of any neural networks
model for credit scoring applications is the
difficulty in explaining the rationale for the
decision to deny credit. Neural networks are

usually thought of as black-box technology
devoid of any logic or rule-based explanations
for the output mapping. This is a particularly
sensitive issue in light of recent federal
legislation regarding discrimination in lending
practices. To address this problem, West (2000)
developed explanatory ability insights for the
neural network trained on the German credit
data. It is accomplished by clamping 23 of the
24 input values, varying the remaining input by
5%, and measuring the magnitude of the
impact on the two output neurons. The clamping
process is repeated until all network inputs have
been varied. A weight can now be determined
for each input that estimates its relative power in
determining the resultant credit decision. Please
refer to West (2000) for more details and results
regarding the model building process.
4.2.2 Explanatory Ability of CART Model for
German Data Credit Scoring
Figure 2 depicts the obtained CART tree of

the testing sample with the popular 1-SE rule in
the tree pruning procedure. It is observed from
Figure 2 that A1, A3, A5, A2 play important
roles in the rule induction (Ai indicates the ith
attribute name for i=1,,n and it has likely
meaning when appearing latter). It can also be
observed from Figure 2 that if an observed
whos A1 is between 1.5 and 2.5 and A2 22.5
and A5>3.5, it falls into terminal node 11 whose
classified class is class 1 (good customer). The
built rules and terminal nodes from the built tree,
unlike other classification techniques, are very
easy to interpret and hence marketing
professionals can use the built rules in designing
proper managerial decisions. Furthermore, we
431
data will be used as an illustrative example. The

obtained basis functions and variable selection
results of the illustrative example are
summarized in Table 5. It is observed that
A1,A2,A3,,A4,A5,A8,A9,A15,A16,A17,A20 do
play important roles in deciding the MARS
conclude by saying that CART is executive and

powerful management tools which allow us to
build advanced and user-friendly decisionsupport systems for credit scoring management.
4.2.3 Explanatory Ability of MARS Model for
Credit Scoring
In order to demonstrate the explanatory
ability of MARS scoring models, the German
Node 2
Class=1
A2<=22.500
N=543
Node 10
Class=2
A5<=3.500
N=237
Node 3
Class=2
A3<=1.500
N=306
Terminal
Node 1
Class=1
N=28
Node 6
Class=2
A18<=11.5
N=72
Terminal
Node 3
Class=1
N=12
Terminal
Node 5
Class=2
N=48
Terminal
Node 10
Class=2
N=17
Node 5
Class=2
A4<=13.5
N=278
Terminal
Node 4
Class=2
N=60
Node 9
Class=2
A1<=1.5
N=77
Terminal
Node 12
Class=1
N=457
Node 11
Class=1
A1<=1.500
N=41
Terminal
Node 9
Class=2
N=196
Node 4
Class=2
A2<=11.5
N=278
Terminal
Node 2
Class=1
N=80
Node 1
Class=1
A1<=2.500
N=1000
Node 7
Class=1
A10<=50.5
N=126
Node 8
Class=1
A5<=1.5
N=114
Terminal
Node 11
Class=1
N=24
Terminal
Node 8
Class=2
N=12
Terminal
Node 7
Class=1
N=37
Terminal
Node 6
Class=1
N=29
Figure 2 The tree of CART credit scoring mode
432
XIAO, ZHAO and FEI
Table 5 Variable selection results and basis functions of MARS credit scoring model
Variable name
Relative importance (%)
Equation name
A1
100.00
BF1
Equation
max (0, A1 1.000)
BF2
max (0, A2 4.000)
BF3
max (0, A3 .180272E-06)
BF4
max (0, A5 1.000)
BF5
max (0, A4 36.000)
BF6
max (0, 36.000 A4)
BF7
max (0, A16 + .180632E-07)
BF8
max (0, A15 1.000)
BF9
max (0, A20 + .182414E-07)
BF10
max (0, A17 .376854E08)
BF12
max (0, 4.000 A6)
BF13
max (0, A9 1.000)
BF14
max (0, A8 2.000)
BF15
max (0, 2.000 A8)
MARS prediction function: Y = 1.358 0.096 * BF1 + 0.007 * BF2 0.058 * BF3 0.032 * BF4+ 0.002 * BF5
+ 0.005 * BF6 + 0.098 * BF7 0.192 * BF8+ 0.094 * BF9 0.129 * BF10 + 0.040 * BF12+ 0.040 * BF13
0.026 * BF14 0.095 * BF15;
A2
A3
A5
A4
A16
A9
A17
A15
A20
A6
A8
57.31
51.82
40.44
36.67
32.33
30.43
28.86
27.52
27.34
29.1
16.95
In the MARS credit scoring model, Y=0(1) is defined to be a good (bad) credit customer.
credit scoring models. Besides, according to the

obtained basis functions and the MARS
prediction function, it can be observed that the
high value of A2, A9, A16, and A20 tends to
become a bad credit customer while the high
value of A1, A3, A5, A15, and A17 likely to be a
good credit customer. The above conclusions
from the basis functions and MARS prediction
function have important managerial implications
since it can help managers/professionals design
appropriate loan policies in acquiring the good
credit customer.
5. Conclusions and Areas of Future

Research
Credit scoring has become more and more
important as the competition between financial
institutions has come to a totally conflicting
stage. More and more companies are seeking
better strategies through the help of credit
scoring models. And hence various modeling
techniques have been developed in different

credit evaluation processes for better credit
approval schemes. Therefore, many modeling
alternatives, like traditional statistical methods,
non-parametric
methods
and
artificial
intelligence techniques, have been developed in
order to handle the credit scoring tasks
successfully. In this paper, we have studied the
performance of various classification techniques
for credit scoring. The experiments were
conducted on 3 real-life credit scoring data sets.
The classification performance was assessed by
the percentage of correctly classified and
misclassified cost.
It is found that each technique has showed
some characteristics which may be interesting in
the context of different data set. Firstly, Logistic,
MARS, SVM and ANN (BPN and RBF)
classifiers yield very good performances in
terms of the classification ratio. However, it has
to be noted that LDA and CART were
433
significantly more accurate than any other model

in identifying bad credit risks for German and
American credit scoring data sets. Secondly, the
experiments clearly indicated that many
classification techniques yield performances
which are quite competitive with each other.
Only a few classification techniques (e.g. FAR
and kernel density) were clearly inferior to the
others. Besides, CART and MARS not only have
lower Type II errors associated with high
misclassification costs, but also have better
evaluation reasoning and can help to structure
the understanding of prediction.
Starting from the findings of this study,
several interesting topics for future research can
be identified. One interesting topic may aim at
collecting more important variables in
improving the credit scoring accuracy. Another
promising avenue for future research is to
investigate the power of classifier ensembles
where multiple classification algorithms are
combined.
References
[1] Altman, E.I. (1968). Financial ratios,
discriminant analysis and prediction of
corporate bankruptcy. Finance, 23: 589-609
[2] Bishop, C.M. (1995). Neural Networks for
Pattern Recognition. New York: Oxford
University, Press
[3] Breiman, L., Friedman, J.H., Olshen, R.A.
& Stone, C.J. (1984). Classification and
Regression Trees, Pacific Grove, CA:
Wadsworth
[4] Chen, M.S., Han, J. & Yu, P.S. (1996). Data
mining: an overview from a database
perspective.
IEEE
Transactions
on
Knowledge and Data Engineering, 8(6):
434
866-883
[5] Chung, C-C. & Lin, C-J. (2001). LIBSVM:
a Library for Support Vector Machines,
Software.
available
at
http://www.csie.ntu.edu.tw/~cjlin/libsvm
[6] Curt, H. (1995). The devils in the detail:
techniques, tools, and applications for
database mining and knowledge discovery
Part 1. Intell, Software Strategies, 6: 1-15
[7] Cristianini, N. & Shawe-Taylor, J. (ed.)
(2000). An Introduction to Support Vector
Machines, NewYork, Cambridge Univ,
Cambridge
[8] Desai, V.S., Crook, J.N. & Overstreet, G.A.
(1996). A comparison of neural networks
and linear scoring models in the credit union
environment.
European
Journal
of
Operational Research, 95(1): 24-37
[9] Dietterich, T.G. (1998). Approximate
statistical tests for comparing supervised
classification learning algorithms. Neural
Computation, 10: 1895-1923
[10]Firedman, J.H. (1991). Multivariate
adaptive regression splines (with discussion).
Annals of Statistics, 19: 1-141
[11]Firedman, J.H. & Roosen, C.B. (1995). An
introduction to multivariate adaptive
regression splines. Statistical Methods in
Medical Research, l4: 197-217
[12]Frydman, H.E., Altman, EI. & Kao, D.
(1985). Introducing recursive partitioning
for financial classification: the case of
financial distress. Journal of Finance, 40(1):
53-65
[13]Gunn, S.R. (ed.) (1998). Support Vector
Machines for Classification and Regression.
Technical
Report,
University
of
Southampton
XIAO, ZHAO and FEI
[14]Henley, W.E. (1995). Statistical aspects of

credit scoring. Dissertation, The Open
University, Milton Keynes, UK
[15]Henley, W.E. & Hand, D.J. (1996).
K-nearest neighbor classifier for assessing
consumer credit risk. Statistician, 44: 77-95
[16]Hosmer, D.W. & Lemeshow, S. (2000).
Applied
Logistic
Regression.
New
York:John Wiley & Sons Inc
[17]Jo, H., Han, I. & Lee, H. (1997).
Bankruptcy prediction using case-based
reasoning,
neural
networks,
and
discriminant analysis. Expert Systems
Application, 13: 97-108
[18]Lee, T.S. & Chen, I.F. (2005). A two-stage
hybrid credit scoring model using artificial
neural networks and multivariate adaptive
regression splines. Expert Systems with
Applications, 28: 743-752
[19]Mester, L.J. (1997). Whats the point of
credit scoring? Business Review - Federal
Reserve Bank of Philadelphia. Sept/Oct:
3-16
[20]Moody, J. & Darken, C.J. (1989). Fast
learning in networks of locally tuned
processing units. Neural Computation, 3:
213-25
[21]Steinburg, D. & Colla, P. (ed.) (1997).
Classification and Regression Trees, Salford
Systems. San Didgo, CA
[22]Thomas, L.C. (2000). A survey of credit and
behavioral scoring: Forecasting financial
risks of lending to customers. International
Journal of Forecasting, 16: 149-172
[23]Tam, K.Y. & Kiang, M.Y. (1992).
Managerial applications of neural networks:
the case of bank failure predictions.

Management Science; 38(7): 926-47
[24]Vapnik, N. (1999). Statistical Learning
Theory. New York: Springer & Verlag.
[25]West, D. (2000). Neural network credit
scoring models. Computers & Operations
Research, 27: 1131-1152
Wenbing Xiao is a doctoral student of Institute
of Control Science & System Engineering at

Huazhong University of Science and Technology,
China. His research interests include financial
forecasting and modeling, decision support
system, data mining and machine learning. He
received the M.S. degree in Mathematics &
Computer at Hunan Normal University (2004).
Qian Zhao is a doctoral student in School of
Economics at Renmin University of China. She
received her M.S. in mathematics from Hunan
Normal University in 2004. Her current research
interests include financial forecasting and
modeling, data mining and energy economics.
She has published in Advances in Mathematics,
Chinese Journal of Management Science.
Qi Fei is a professor of Institute of Control
Science & Systems Engineering at Huazhong

University of Science and Technology, China.
His research interests include complex theory,
decision support system and decision analysis.
He received the B.S. degree in Control Science
and Engineering at Harbin Institute of
Technology (1961).
435

Comparative Study of Data Mining Methods in

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Comparative Study of Data Mining Methods in

Загружено:

Авторское право:

Доступные форматы

J Syst Sci Syst Eng(Dec 2006) 15(4): 419-435

ISSN: 1004-3756 (Paper) 1861-9576 (Online)

A COMPARATIVE STUDY OF DATA MINING METHODS IN

School of Economics, Renmin University of China, Beijing 100872, China

researchers and practitioners due to its wide

Data mining (DM), sometimes referred to as

applications in crucial business decisions.

knowledge discovery in database (KDD), is a

Basically, the research on DM can be classified

systematic approach to find underlying patterns,

trends, and relationships buried in data. Data

technologies. According to Curt (1995), the

mining has drawn much attention from both

technology part of DM consists of techniques

Systems Engineering Society of China & Springer-Verlag 2006

such as statistical methods, neural networks,

details and other relevant information held by a

credit reference agency. As a result, accounts

with high probability of default can be

above-mentioned applications, the classification

monitored and necessary actions can be taken in

problems which observations can be assigned to

order to prevent the account from entering

one of several disjoint groups have played

default. In response, the statistical methods,

important roles in business decision making due

to their wide applications in decisions support,

intelligence approaches have been proposed to

financial forecasting, fraud detection, marketing

support the credit approval decision process

strategy, and other related fields (Chen et al.

(Desai et al., 1996, West, 2000).

1996, Lee and Chen 2005, Tam and Kiang

Generally, linear discriminant analysis and

Credit risk evaluation decisions are crucial

used data mining techniques to construct credit

for financial institutions due to the severe impact

scoring models. However, linear discriminant

of loan default. It is an even more important task

analysis (LDA) has often been criticized because

of the categorical nature of the credit data and

experiencing serious competition during the past

the fact that the covariance matrices of the good

few years. Credit scoring has gained more and

and bad credit classes are not likely to be equal.

more attention as the credit industry has realized

In addition to the LDA approach, logistic

the benefits of improving cash flow, insuring

regression is an alternative to conduct credit

credit collections and reducing possible risks.

scoring. A number of logistic regression models

Hence, many different useful techniques, known

for credit scoring applications have been

as the credit scoring models, have been

reported in the literature (Henley 1995).

developed by banks and researchers in order to

However, logistic regression is also being

criticized for some strong model assumptions,

evaluation process (Mester 1997). The objective

such as variation homogeneity, which has

of credit scoring models is to assign credit

limited its application in handling credit scoring

applicants to either a good credit group who