Академический Документы
Профессиональный Документы
Культура Документы
DOI: 10.1007/s11518-006-5023-5
Qian ZHAO 2
Qi FEI 3
Institute of Systems Engineering, Huazhong University of Science & Technology, Wuhan 430074, China
xiaowenbing11@163.com (
)
2
Institute of Systems Engineering, Huazhong University of Science & Technology, Wuhan 430074, China
qfei@mail.hust.edu.cn
Abstract
Credit scoring has become a critical and challenging management science issue as the credit
industry has been facing stiffer competition in recent years. Many classification methods have been
suggested to tackle this problem in the literature. In this paper, we investigate the performance of
various credit scoring models and the corresponding credit risk cost for three real-life credit scoring
data sets. Besides the well-known classification algorithms (e.g. linear discriminant analysis, logistic
regression, neural networks and k-nearest neighbor), we also investigate the suitability and
performance of some recently proposed, advanced data mining techniques such as support vector
machines (SVMs), classification and regression tree (CART), and multivariate adaptive regression
splines (MARS). The performance is assessed by using the classification accuracy and cost of credit
scoring errors. The experiment results show that SVM, MARS, logistic regression and neural networks
yield a very good performance. However, CART and MARSs explanatory capability outperforms the
other methods.
Keywords: Data mining, credit scoring, classification and regression tree, support vector machines,
multivariate adaptive regression splines, credit-risk evaluation
1. Introduction
into
two
categories:
methodologies
This work was supported in part by National Science Foundation of China under Grant No. 70171015
and
A Comparative Study of Data Mining Methods in Consumer Loans Credit Scoring Management
decision
and
the
non-parametric
trees,
non-parametric
genetic
methods.
algorithms,
Among
methods,
and
artificial
today
been
solve
the
problems.
as
the
the
credit
problems
industry
involved
has
during
420
where
and
discusses
possible
future
research areas.
2. Literature Review
2.1 Linear Discriminant Analysis and
Logistic Regression Models
Linear discriminant analysis involves the
is
the
discriminant
(1)
score,
D = {xi , yi }iN=1
xi =
( xi(1) ,L xi( n ) )T
with
R
input
and
target
vectors
labels
421
A Comparative Study of Data Mining Methods in Consumer Loans Credit Scoring Management
according
to
Vapnik's
original
(3)
which is equivalent to
yi [ wT ( xi ) + b] 1,
i = 1,L, N
(4)
opposite
sides
of
separating
0 i C , i = 1,L , N
T
y = 0
(5)
(8)
In
the
T
e = (1,1,L ,1) R
dual
N
problem
(9)
above,
, Q is a N N positive
(maybe
i = 1,L , N
i 0,
(6)
(7)
422
infinite)
dimensional
space
by
10
sgn i yi K ( x, xi ) + b
i
(11)
hidden
layer
further
processes
the
involves
processing
certain
neural
networks
capabilities
that
Input layer
model
mimic
Hidden layer
Output layer
423
A Comparative Study of Data Mining Methods in Consumer Loans Credit Scoring Management
424
is
inspired
by
the
recursive
structure
high-dimensional
that
often
data.
And
hides
hence
in
can
general
MARS
function
can
be
Km
m =1
k =1
location.
The optimal MARS model is selected in a
two-stage process. Firstly, MARS constructs a
very large number of basis functions to overfit
the data initially, where variables are allowed to
enter as continuous, categorical, or ordinalthe
formal mechanism by which variable intervals
are defined, and they can interact with each
other or be restricted to enter in only as additive
components. In the second stage, basis functions
are deleted in order of least contribution using
the generalized cross-validation (GCV) criterion.
A measure of variable importance can be
assessed by observing the decrease in the
calculated GCV values when a variable is
removed from the model. The GCV can be
expressed as follows:
^
LOF ( f M ) = GCV ( M )
=
1
N
[ yi f M ( xi )]2 /[1
i =1
C(M ) 2
]
N
(13)
425
A Comparative Study of Data Mining Methods in Consumer Loans Credit Scoring Management
426
427
A Comparative Study of Data Mining Methods in Consumer Loans Credit Scoring Management
428
RBF
Mars,
Logistic regression
SVM
RBF, SVM
Logistic regression
MARS, LDA
KNN, CART
RBF
BPN
Logistic regression
SVM, MARS
Inferior Models
FAR, VPN
FAR
LDA
LDA, KNN
Kernel density
CART
Kernel density
KNN
CART
Statistical significance established with McNemars test, p=0.05; kernel density and FAR arent tested for
American credit data.
429
A Comparative Study of Data Mining Methods in Consumer Loans Credit Scoring Management
2 =0.144
2 =0.249
BPN
0.530
RBF
2 =0.144
2 =0.249
2 =0.144
2 =0.249
0.818
0.228
0.281
0.657
1.049
0.497
0.761
0.205
0.258
0.644
1.030
FAR
0.694
0.908
0.391
0.490
N/A
N/A
LDA
0.429
0.540
0.219
0.239
0.613
0.808
Logist
0.471
0.728
0.200
0.243
0.673
1.140
KNN
0.592
0.858
0.227
0.281
0.688
1.033
Kernel
0.587
0.901
0.268
0.329
N/A
N/A
CART
0.467
0.597
0.226
0.244
0.641
0.811
Lin-SVM
0.462
0.717
0.226
0.244
0.657
1.055
Pol-SVM
0.469
0.726
0.221
0.264
0.675
1.093
Rbf-SVM
0.459
0.711
0.217
0.234
0.648
1.043
Sig-SVM
0.454
0.705
0.225
0.246
0.657
1.060
MARS
0.413
0.571
0.202
0.249
0.663
1.071
Table 4 5-fold cross validation test set classification accuracy on parities credit scoring data sets in new strategy
RBF
BPN
LDA
LOGIT
CART
Mars
Rbf-SVM
430
431
A Comparative Study of Data Mining Methods in Consumer Loans Credit Scoring Management
Node 2
Class=1
A2<=22.500
N=543
Node 10
Class=2
A5<=3.500
N=237
Node 3
Class=2
A3<=1.500
N=306
Terminal
Node 1
Class=1
N=28
Node 6
Class=2
A18<=11.5
N=72
Terminal
Node 3
Class=1
N=12
Terminal
Node 5
Class=2
N=48
Terminal
Node 10
Class=2
N=17
Node 5
Class=2
A4<=13.5
N=278
Terminal
Node 4
Class=2
N=60
Node 9
Class=2
A1<=1.5
N=77
Terminal
Node 12
Class=1
N=457
Node 11
Class=1
A1<=1.500
N=41
Terminal
Node 9
Class=2
N=196
Node 4
Class=2
A2<=11.5
N=278
Terminal
Node 2
Class=1
N=80
Node 1
Class=1
A1<=2.500
N=1000
Node 7
Class=1
A10<=50.5
N=126
Node 8
Class=1
A5<=1.5
N=114
Terminal
Node 11
Class=1
N=24
Terminal
Node 8
Class=2
N=12
Terminal
Node 7
Class=1
N=37
Terminal
Node 6
Class=1
N=29
432
Table 5 Variable selection results and basis functions of MARS credit scoring model
Variable name
Equation name
A1
100.00
BF1
Equation
max (0, A1 1.000)
BF2
max (0, A2 4.000)
BF3
max (0, A3 .180272E-06)
BF4
max (0, A5 1.000)
BF5
max (0, A4 36.000)
BF6
max (0, 36.000 A4)
BF7
max (0, A16 + .180632E-07)
BF8
max (0, A15 1.000)
BF9
max (0, A20 + .182414E-07)
BF10
max (0, A17 .376854E08)
BF12
max (0, 4.000 A6)
BF13
max (0, A9 1.000)
BF14
max (0, A8 2.000)
BF15
max (0, 2.000 A8)
MARS prediction function: Y = 1.358 0.096 * BF1 + 0.007 * BF2 0.058 * BF3 0.032 * BF4+ 0.002 * BF5
+ 0.005 * BF6 + 0.098 * BF7 0.192 * BF8+ 0.094 * BF9 0.129 * BF10 + 0.040 * BF12+ 0.040 * BF13
0.026 * BF14 0.095 * BF15;
A2
A3
A5
A4
A16
A9
A17
A15
A20
A6
A8
57.31
51.82
40.44
36.67
32.33
30.43
28.86
27.52
27.34
29.1
16.95
In the MARS credit scoring model, Y=0(1) is defined to be a good (bad) credit customer.
433
A Comparative Study of Data Mining Methods in Consumer Loans Credit Scoring Management
References
[1] Altman, E.I. (1968). Financial ratios,
discriminant analysis and prediction of
corporate bankruptcy. Finance, 23: 589-609
[2] Bishop, C.M. (1995). Neural Networks for
Pattern Recognition. New York: Oxford
University, Press
[3] Breiman, L., Friedman, J.H., Olshen, R.A.
& Stone, C.J. (1984). Classification and
Regression Trees, Pacific Grove, CA:
Wadsworth
[4] Chen, M.S., Han, J. & Yu, P.S. (1996). Data
mining: an overview from a database
perspective.
IEEE
Transactions
on
Knowledge and Data Engineering, 8(6):
434
866-883
[5] Chung, C-C. & Lin, C-J. (2001). LIBSVM:
a Library for Support Vector Machines,
Software.
available
at
http://www.csie.ntu.edu.tw/~cjlin/libsvm
[6] Curt, H. (1995). The devils in the detail:
techniques, tools, and applications for
database mining and knowledge discovery
Part 1. Intell, Software Strategies, 6: 1-15
[7] Cristianini, N. & Shawe-Taylor, J. (ed.)
(2000). An Introduction to Support Vector
Machines, NewYork, Cambridge Univ,
Cambridge
[8] Desai, V.S., Crook, J.N. & Overstreet, G.A.
(1996). A comparison of neural networks
and linear scoring models in the credit union
environment.
European
Journal
of
Operational Research, 95(1): 24-37
[9] Dietterich, T.G. (1998). Approximate
statistical tests for comparing supervised
classification learning algorithms. Neural
Computation, 10: 1895-1923
[10]Firedman, J.H. (1991). Multivariate
adaptive regression splines (with discussion).
Annals of Statistics, 19: 1-141
[11]Firedman, J.H. & Roosen, C.B. (1995). An
introduction to multivariate adaptive
regression splines. Statistical Methods in
Medical Research, l4: 197-217
[12]Frydman, H.E., Altman, EI. & Kao, D.
(1985). Introducing recursive partitioning
for financial classification: the case of
financial distress. Journal of Finance, 40(1):
53-65
[13]Gunn, S.R. (ed.) (1998). Support Vector
Machines for Classification and Regression.
Technical
Report,
University
of
Southampton
435