Вы находитесь на странице: 1из 6

Proceeding of the 3rd International Conference on Informatics and Technology, 2009

COMPARATIVE STUDIES ON DATA MINING TECHNIQUES TO PREDICT CUSTOMERS’ CREDIT CARD RISK IN
BANKS
1 2
Ling Kock Sheng , Teh Ying Wah
1
Faculty of Computer Science and Information Technology,
University of Malaya, 50603 Kuala Lumpur, Malaysia. Email: ksling99@perdana.um.edu.my
2
Department of Information Science, Faculty of Compute Science and Information Technology,
University of Malaya, 50603 Kuala Lumpur, Malaysia. Email: tehyw@um.edu.my

ABSTRACT

It is increasingly more important for banks to analyze and understand their risk’s portfolio, in particularly those that of
the credit card, as it could severely affect the wellbeing of the entire establishment especially in the current economy
downturn. Insights in markets, application, performance and bad debt management provide for maximization of
profits and improvement to the wellbeing of the company. These four insights are usually found through the use of
data mining techniques to build the needed classifier, models and policies. Classification techniques such as Linear
Regression, QUEST, C5 and Neural Networks are commonly used for the construction of such credit scoring models.

Using a subset of the bank’s data, the quality of the bank’s portfolio can be measured based on the tendency of the
client to default in their payment. Credit risk’s is deemed to increase whenever such default occurs. It is therefore
critical for banks to measure and monitor such occurrence. To improve the predictive accuracy for identifying of the
probability of such occurrence, four of the data mining techniques were used to generate a classifier. C5 emerged as
the technique that has generated the best classifier with an improved predictive accuracy of 0.92%. This quantum is
directly attributable to an improvement of the bank’s portfolio by that percentage.

Keywords: Data Mining techniques, Credit Scoring, Predictive Accuracy, Credit Risk, Decision Tree
Induction

1.0 INTRODUCTION

One of the primary services of a bank is to provide credit facilities or lending to consumers. This risk for the bank to
extend the requested credit depends on how well they distinguish between good and bad credit borrowers. The
management of risks is vital to the financial soundness of the bank. It is usually a very important aspect of the bank’s
business strategy. The management philosophy for banks would therefore, emphasize primarily on identifying,
measuring, monitoring and managing their portfolio within a robust risk management framework. Returns has to
commensurate based on risks taken.

With increasing competition, the recent credit crunch and world wide recession, banks are tightening their lending
policies. Banks are now not only facing the highly competitive markets within their industry, they are also hard
pressed to balance between liquidity, market share, application in terms of higher credit provided to borrowers or
investment, performance and also bad debt’s recovery. One of the most critical credit risks has been an explosive
growth of credit card usage. According to Malaysia Rating Corp. Bhd. Chief economist, Nor Zahidi Alias, the number
of credit cards in circulation for primary holders, have tripled to 9.67 million as of March 2009 from a mere 2.92 million
seven years ago[1]. In addition, the credit line for card holders more than quadrupled to a whooping RM103.8 billion
versus RM 24.3 billion in January 2002[1]. Such phenomena growth suggests Malaysia and other regions, ie Korea
[11] is relying more on credit cards in their consumption, thereby creating a burden not only from the consumer’s
point of view, but also the level of risks that the bank has to take in. It is therefore important for banks to understand
their current portfolio and the risks level that they are in. Such understanding provides insights into balancing their
portfolio for[9]:

1. Market - Targeting the right candidate as credit card holders


2. Application – The level of credit extended to such card holders
3. Performance – Understanding and predicting the future payment behavior
4. Bad Debt Management – Understanding and deriving the right collection policy for maximizing amount
recovered.

In addition to the capability of analyzing and understanding the bank’s portfolio, it is also important that such analysis
data could only be carried out on a scheduled, for instant, monthly or biweekly basis. Globalization and the relaxation
of the banking policies cater for the volatility of the money markets and economy. The risk’s level of the bank’s
portfolio could change very quickly. Such sophistication may increase the need to continuously review the portfolio at
a higher rate than before. Other than the speed which refers to the computational costs involved in generating and
using the given classifier or predictor or model, it is also important that the system involved should also have the
needed robustness, scalability and interpretability to ensure that the bank’s portfolio can be properly analyzed and
understood.

©Informatics '09, UM 2009 RDT1 - 20


Proceeding of the 3rd International Conference on Informatics and Technology, 2009

The understanding of such factors or attributes affecting bank’s risks levels and the measures undertaken could
alleviate both the bank’s wellbeing as well as the consumer’s financial standing could help strengthen the country’s
economy. Therefore, the use of data mining tools, techniques and models could very well be intensified and
reassess at the current operating environment to ensure that there will be increasing predictive accuracy to credit
risks and improved portfolio management so that bank will be resilience to such challenges.

2.0 CREDIT RISKS, SCORING AND APPROVAL

Credit risks’ is the probability that debtors will not pay their debts [6]. The management of such risks is important in
banks as it affect the wellbeing of their organization. It is also like, putting firms in distress if not properly managed. It
is therefore critical for banks to have an in-dept understand of their portfolio so that they would be able to take
appropriate measure beginning with credit approval to monitoring each change continually so that there is maximum
recovery from their lending activities.

The process for credit approval is usually based on a very stringent set of policies and procedures. It is usually
handled by bank officers that has the experience, seniority, product/business sector and track record to do so. The
mechanism banks used to ascertain their risk’s level is through what is known as credit scoring.

According to the literature provided by Statistica, “credit scoring is the set of decision models and their underlying
techniques that aid lenders in the granting of consumer credit”[9].

The bank uses these techniques to support who to give the credit to, how much credit they should get, and what
operational strategies will enhance their profitability. It also helps to assess the risk portfolio in credit card and
lending.

Credit scoring is fairly dependable since it is base on the assessment of a person’s credit worthiness on actual data.
The bank will usually use the result set for extending their credit card packages into new markets and also to monitor
their existing clients.

The current procedure, according to a manager in a bank in Malaysia, is to evaluate a person’s credit risks by
conducting via, a check with Credit Bureau in Malaysia and possibly CTOS Sdn Bhd (CTOS) and the profile of the
person or company. Like all Credit Bureau around the world, the Credit Bureau in Malaysia collects credit information
on all borrowers from lending institutions and furnishes the information collected back to the institutions in the form of
credit report through Central Credit Reference Information System (CCRIS). The Credit Bureau kept the following
types of data:

- Personal particulars of borrower such as name, identification number address, etc.


- Credit facility account details such as type of credit facilities, credit limit, outstanding balance, conduct of
account and legal action status, if any.

It is also possible that the bank will also check with another service provider of credit information, that is, CTOS, to
confirm if this private service agent has information that Credit Bureau may not have.

The profile check revolves primarily those of age and income level. A rule-based scoring system is in place to provide
the level of credit scoring for the client. The Credit Bureau report, CTOS credit check and also the person’s profile
scoring will be merged to form the final scoring and categorization of the person’s credentials to obtain a certain loan
from the bank, as shown in the following figure, fig 1.

©Informatics '09, UM 2009 RDT1 - 21


Proceeding of the 3rd International Conference on Informatics and Technology, 2009

CCRIS
Credit Report
Determine
Credit Worthiness
Centralized Rating
CTOS Check Credit Category:
Scoring 1. High
System 2. Medium
3. Low
Personal
Profile Credit
Rating

Collateral

Fig 1 Bank’s current process to check credit worthiness

Credit scoring from the Central Credit Rating System will then be used to categorize the risk’s worthiness as high,
medium or low. In the high level category, the loan will be approved by just an officer in charge of the account. For
those in the medium category or lower scale, a manger and senior manager will have to analyze the case to provide
the approval. In such a case, the credit limit may be lowered, entirely rejected or that a guarantor with good credit
report or collateral is needed to be included.

The bank’s internal credit rating system would incorporate both the statistical models and expert systems to establish
the right rating or scorecards in their decision making system. As per one of the bank’s procedures[10], “a borrower
is assigned a Customer Risk Rating (CRR) and a Facility Risk Rating (FRR). The CRR is a borrower' s standalone
credit rating and is derived after a comprehensive assessment of its financial condition, the quality of its
management, business risks and the industry it operates in. The FRR incorporates transaction-specific dimensions
such as availability and types of collateral, seniority of the exposures, facility structures, etc”.

The determination of bank’s Customer Risk Rating (CRR) and FRR should inevitably be dependent on various aspect
and factors influencing the financial wellbeing of the bank in many ways. There are however, still many aspects of
the impact that the bank may not have taken into consideration and may affect their position very quickly. Such
scenarios may not be very prevailing in Malaysia, many banks in the West are not spared of such effect. The level of
default in repayment, ie., credit cards can act as a quick yardstick or measure as to how healthy the bank’s customer
portfolios are. With this, it may be very critical to look at the accuracy of the model or classifier used for such gauge.
The class label of any default in payment can be used to ascertain the quality of the portfolio,

3.0 PREDICTIVE ACCURACY OF A CLASSIFIER AND DATA MINING TECHNIQUES

The accuracy of a classifier refers to the ability of a given classifier to correctly predict the class label of new or
previously unseen data. This can be estimated using one or more data set or test sets[2].

Some of the data mining techniques that are used in credit or financial risk’s detection or otherwise known as the
classification techniques where a model or classifier is constructed to predict categorical labels are Bayesian network,
naïve Bayes, Support Vector Machine (SVM), linear logistic regression, K-nearest-neighbour, C4.5, Repeated
Incremental Pruning to Produce Error Reduction (RIPPER) rule induction and radial basis function (RBF) network[6].
Other decision tree induction techniques comprise those of ID3 and CART[4]. The techniques selected for this
comparison are, Linear Regression, QUART, C5 and Neural Network.

©Informatics '09, UM 2009 RDT1 - 22


Proceeding of the 3rd International Conference on Informatics and Technology, 2009

3.1 Linear Regression

One of the more basic technique used is linear regression. Regression analysis involves a response variable, y, and
a single predictor variable, x. In it’s simplest form, it is represented by the formula,

y= b + wx

where the variance of y is assumed to be constant, that is, b and w are regression coefficients. These coefficients
can be solved for the method of least squares. An extension to this is the multiple linear regression of straight-line
regression. This can be represented with

y = w0 +w1x1 + w2x2
st nd
where x1 and x2 are the values of attributes (1 and 2 attribute). Other regression models are nonlinear such as
parabola or some higher-order polynomial[2].

3.2 QUEST and C5

For classification techniques involving decision tree induction, Quinlan developed a decision tree algorithm known as
ID3 (Iterative Dichotomiser). This was later followed by C4.5, which became the benchmark to which all other newer
supervised learning are compared to. Another technique that was developed at about the same time is Regression
Tree (CART). Most algorithm for decision tree induction follow a top-down approach. They start with a training a set
of data and their associated class labels. The training set is recursively partitioned into smaller subsets as the tree is
being build[2].

“The decision tree is made up of internal nodes, leaf nodes and branches. Each internal node represents a decision
on a data attribute or a function of data attributes. Each outgoing branch corresponds to a possible outcome of the
instance. Each leaf node represents a class. To classify an un-labeled data sample, the classifier tests the attribute
values of the sample against the decision tree. A path is traced from the root to a leaf node, which holds the class
prediction for that sample”. The efficiency of existing decision tree algorithms has been well established for relatively
small data sets[4].

Three popular attribute selection measures for decision tree induction involve information gain, gain ratio and gini
index[2]. This is a heuristic for selecting the splitting criterion that “best” separates a given data partition of class
labelled training tuples, into individual classes. They determine how the tuples a a given node are to be split.
QUEST, which stands for Quick, Unbiased and Efficient Statistical Tree is another decision tree algorithm for
classification. Developed by Loh and Shih, the objective of QUEST is similar to that of CART (Regression Tree).
The major differences are that QUEST is an unbiased variable selection technique by default and the way missing
values are treated[5]. QUEST uses imputation instead of surrogate splits to deal with missing values.

3.3 Neural Network

Another widely used technique for classification is Neural Network learning algorithm[8]. Neural networks are highly
interconnected networks that learn by adjusting the weights of the connections between nodes on different layers. A
neural network has an input layer, one or more hidden layers and an output layer[7]. “Information processing occurs
at many simple elements call neurons or nodes, or units or cells. Signals are passed between neurons over
connection links. Each connection link has an associated weight, which in a typical neural net, multiplies the signal
transmitted. Each node applies an activation function (usually nonlinear) to its net input(sum of weighted input
signals) to determine the output signal”[3]. Commonly used activation functions are linear or identity, hyperbolic
tangent, logistic, threshold or Gaussian. As for the fitting of the day, Rumelhart and McClelland (1986) proposed to fit
the perceptron or weights in the learning rule using least squares’ method[3].

4.0 RESEARCH DESIGN AND MISCLASSIFICATION COST

There are several discussions involving predictive accuracy. These may take the form of an estimation of how
accurate the classifier can predict credit risks – based on the tendency of the borrower to default in their payment.
Sinha & Zhao pointed that the performance measure most frequently used in the literature for comparing data mining
methods has been error rate or accuracy[8]. The error rate of a classifier f is defined as:

Error(f)=P(f(x) y)=P(y=-1)P(f(x)=1|y=-1) + P(y=1)P(f(x)=-1|y=1)

Accuracy is defined as accuracy(f)=1−error(f). The method generating the minimum error rate (or the maximum
accuracy) has usually been considered to be the best. However, for many real-world problems, costs of different

©Informatics '09, UM 2009 RDT1 - 23


Proceeding of the 3rd International Conference on Informatics and Technology, 2009

types of misclassification errors are not equal. In a binary classification problem, it can be either, positive or negative,
true or untrue or right or wrong. A false positive result if the test does not fit into the positive class and vice versa for
a false negative. For such problems, the focus should be to minimize overall mis-calculation error rate[8]. When the
costs of making classification errors are known, the performance of a classifier f can be measured using mis-
classification cost, defined as:

Cost(f) = P(y=-1)P(f(x)=1|y=-1)C1|-1 + P(y=1)P(f(x)=-1|y=1)C-1|1

where C1|−1 is the cost of a false positive mistake and C−1|1 is the cost of a false negative mistake. Error rate can
be viewed as a special case of misclassification cost, when the two types of errors are weighted equally (i.e.,
C1|−1=C−1|1).

A subset of bank’s data comprising 5000 customer payment records were used for this test. A Limited number of
attributes were selected due to the confidentiality of the information. Attributes used include, sex, credit limit,
postcode, age and accrued value. Data were loaded into the system and preprocessed accordingly. Four
techniques were used to generate an appropriate classifier and each classifier measured, with the current rate of
payment default.

5.0 DATA ANALYSIS

Table 1: Predictive accuracy of classifier of each data mining technique

No Model Classifier Accuracy


0 From dataset 92.20
1 C5.0 Refer to Fig. 2 93.12
2 Linear = 0.5996*(sex=0) – 0.0006182* 92.58
Regression ACRDCUBA) + 0.0001739 * crlimit –
0.02441*age + 2.818
3 Neural Networks 92.78
4 QUEST 92.30

Fig. 2 Classifier C5

The predictive accuracy as shown in table 1, indicated C5.0 as having the highest level of accuracy (least errors).
The improved in accuracy is at 93.12% – 92.20% = 0.92%

6.0 CONCLUSION

The findings conclude that it is most advantages for the bank to use the C5.0 data mining technique and the classifier
that it churn out as the formula and rule-set to predict a loan applicant’s credit worthiness - their tendency to default in
their payment.

The findings also concluded that all the data mining techniques provide improvement to the predictive accuracy and
C5.0 has the highest improvement rate of 0.92%. Predictive accuracy is thus 0.92% better at 93.12%. The effect on
the bank’s portfolio and subsequently, its wellbeing, could improved in the same quantum if steps are taken to reflect
this in their marketing, application, performance programs as well as their debt’s recovery management.

©Informatics '09, UM 2009 RDT1 - 24


Proceeding of the 3rd International Conference on Informatics and Technology, 2009

REFERENCES

[1] D. Dhesi,, “Credit Card Trap”, StarBizWeek, 13th June 2009, pp. SBW20-21.

[2] J. Han et al., Data Mining: Concepts and Techniques, Elsevier Inc., 2006.

[3] Krieger, C. 1996, “Neural Networks in Data Mining”,


http://www.cs.uml.edu/~ckrieger/user/Neural_Networks.pdf

[4] J. Lee, “A new approach of top-down induction of decision trees for knowledge discovery”, Iowa State
University, 2008

[5] T. S. Lim et al., “A comparison of prediction accuracy, complexity, and training time of thirty-three old and
new classification algorithms”, Machine Learning Journal, 2000, vol. 40, 203-228.

[6] Y. Peng et al., “A comparative Study of Classification Methods in Financial Risk Detection”, Fourth
International Conference on Networked Computing and Advanced Information Management, 2008.

[7] S. Ozekes et al., “Classification and Prediction in Data Mining with Neural Networks”, Journal of Electrical
and Electronics, 2003, pp. 707-712.

[8] A. P. Sinha, “Incorporating domain knowledge into data mining classifiers: An application in indirect
lending”, Decision Support Systems. 2008.

[9] Statistica, “Financial Institutions and Statistica, Case Study: Credit Scoring”,
http://www.statsoft.com/datamining/pdf/CreditScoring.pdf

[10] UOB,”Risks Management”, http://www1.uob.com.my/webpages/cor_risk_management.htm#2

[11] J. Yoon, “Performance Improvement of Bankruptcy Prediction using Credit Card Sales Information of Small
& Micro Business”, Fifth International Conference on Software Engineering Research, Management and
Applications, 2007.

BIOGRAPHY

Ling Kock Sheng has 18 years of working experience with several multi-nationals in various industries and in various
capacities. He is currently pursuing the Master of Computer Science degree from University of Malaya and is
researching topics in data mining.

Teh Ying Wah is a senior lecturer at the Faculty of Computer Science and Information Technology in University of
Malaya. He holds a PhD in computer science from University of Malaya, Malaysia. He is
currently carrying out research in data mining, data warehouses and E-Commerce

©Informatics '09, UM 2009 RDT1 - 25

Вам также может понравиться