Академический Документы
Профессиональный Документы
Культура Документы
Increasing the demand for consumer credit has led to the competition in credit industry. So
credit managers have to develop and apply machine learning methods to handle analyzing credit
data in order to saving time and reduction errors. Credit scoring can be defined as a technique
that helps lenders decide whether to grant credit to the applicants with respect to the applicants'
characteristics such as age, income and marital status. In recent years, several quantitative
methods have been proposed for credit risk evaluation. Among all existent approaches, data
mining methods have found more popularity than the others because of their ability in
discovering practical knowledge from the database and transforming them into useful
information. The first researches into credit scoring were done by Fisher and Durand, who
applied linear and quadratic discriminant analysis respectively to categorize credit applications
as good or bad ones. This study aims to prepare a literature survey in data mining technique
applied in credit risk evaluation problem from 2000 to 2010. The main purpose of this study is
helping to researchers to be aware of the present methods, find their limitations and suggest
more efficient methods.
Synopsis
This aimed at the case of customers default payments in Taiwan and compares the predictive
accuracy of probability of default among six data mining methods. From the perspective of risk
management, the result of predictive accuracy of the estimated probability of default will be
more valuable than the binary result of classification - credible or not credible clients. Because
the real probability of default is unknown, this study presented the novel Sorting Smoothing
Method to estimate the real probability of default. With the real probability of default as the
response variable (Y), and the predictive probability of default as the independent variable (X),
the simple linear regression result (Y = A + BX) shows that the forecasting model produced by
artificial neural network has the highest coefficient of determination; its regression intercept (A)
is close to zero, and regression coefficient (B) to one. Therefore, among the six data mining
techniques, artificial neural network is the only one that can accurately estimate the real
probability of default.
Attributes
25 numerical values for attributes named X, X1 to X23 and Y
Dim
Retrieve or set the dimension of an object.
>dim(DefCreditCard)
[1] 30000
25
Head
To obtain the first several rows of a matrix or data frame head is used.
>head(DefCreditCard)
Tail
To obtain the last several rows of a matrix or data frame tail is used.
>tail(DefCreditCard)
Mean
Generic function for the (trimmed) arithmetic mean.
> mean(DefCreditCard$X)
[1] 15000.5
> mean(DefCreditCard$X1)
[1] 167484.3
> mean(DefCreditCard$X2)
[1] 1.603733
> mean(DefCreditCard$X3)
[1] 1.853133
> mean(DefCreditCard$X4)
[1] 1.551867
> mean(DefCreditCard$X5)
[1] 35.4855
> mean(DefCreditCard$X6)
[1] -0.0167
> mean(DefCreditCard$X7)
[1] -0.1337667
> mean(DefCreditCard$X8)
[1] -0.1662
> mean(DefCreditCard$X9)
[1] -0.2206667
> mean(DefCreditCard$X10)
[1] -0.2662
> mean(DefCreditCard$X11)
[1] -0.2911
> mean(DefCreditCard$X12)
[1] 51223.33
> mean(DefCreditCard$X13)
[1] 49179.08
> mean(DefCreditCard$X14)
[1] 47013.15
> mean(DefCreditCard$X15)
[1] 43262.95
> mean(DefCreditCard$X16)
[1] 40311.4
> mean(DefCreditCard$X17)
[1] 38871.76
> mean(DefCreditCard$X18)
[1] 5663.581
> mean(DefCreditCard$X19)
[1] 5921.163
> mean(DefCreditCard$X20)
[1] 5225.681
> mean(DefCreditCard$X21)
[1] 4826.077
> mean(DefCreditCard$X22)
[1] 4799.388
> mean(DefCreditCard$X23)
[1] 5215.503
> mean(DefCreditCard$Y)
[1] 0.2212
Var
The variance is a numerical measure of how the data values is dispersed around
the mean. In particular, the sample variance is defined as:
>var(DefCreditCard$X)
[1] 75002500
4
>var(DefCreditCard$X1)
[1] 16834455682
>var(DefCreditCard$X2)
[1] 0.2392474
>var(DefCreditCard$X3)
[1] 0.624651
>var(DefCreditCard$X4)
[1] 0.2724523
>var(DefCreditCard$X5)
[1] 84.96976
>var(DefCreditCard$X6)
[1] 1.26293
>var(DefCreditCard$X7)
[1] 1.433254
>var(DefCreditCard$X8)
[1] 1.432492
>var(DefCreditCard$X9)
[1] 1.366885
>var(DefCreditCard$X10)
[1] 1.284114
>var(DefCreditCard$X11)
[1] 1.322472
>var(DefCreditCard$X12)
[1] 5422239963
>var(DefCreditCard$X13)
[1] 5065705363
>var(DefCreditCard$X14)
[1] 4809337537
>var(DefCreditCard$X15)
[1] 4138716378
5
>var(DefCreditCard$X16)
[1] 3696294150
>var(DefCreditCard$X17)
[1] 3546691724
>var(DefCreditCard$X18)
[1] 274342256
>var(DefCreditCard$X19)
[1] 530881709
>var(DefCreditCard$X20)
[1] 310005092
>var(DefCreditCard$X21)
[1] 245428561
>var(DefCreditCard$X22)
[1] 233426624
>var(DefCreditCard$X23)
[1] 316038289
>var(DefCreditCard$Y)
[1] 0.1722763
Standard Deviation(sd)
The standard deviation of an observation variable is the square root of its variance.
>sd(DefCreditCard$X)
[1] 8660.398
>sd(DefCreditCard$X1)
[1] 129747.7
>sd(DefCreditCard$X2)
[1] 0.4891292
>sd(DefCreditCard$X3)
[1] 0.7903487
>sd(DefCreditCard$X4)
[1] 0.5219696
>sd(DefCreditCard$X5)
[1] 9.217904
>sd(DefCreditCard$X6)
[1] 1.123802
>sd(DefCreditCard$X7)
[1] 1.197186
>sd(DefCreditCard$X8)
[1] 1.196868
>sd(DefCreditCard$X9)
[1] 1.169139
>sd(DefCreditCard$X10)
[1] 1.133187
>sd(DefCreditCard$X11)
[1] 1.149988
>sd(DefCreditCard$X12)
[1] 73635.86
>sd(DefCreditCard$X13)
[1] 71173.77
>sd(DefCreditCard$X14)
[1] 69349.39
>sd(DefCreditCard$X15)
[1] 64332.86
>sd(DefCreditCard$X16)
[1] 60797.16
>sd(DefCreditCard$X17)
[1] 59554.11
>sd(DefCreditCard$X18)
[1] 16563.28
>sd(DefCreditCard$X19)
7
[1] 23040.87
>sd(DefCreditCard$X20)
[1] 17606.96
>sd(DefCreditCard$X21)
[1] 15666.16
>sd(DefCreditCard$X22)
[1] 15278.31
>sd(DefCreditCard$X23)
[1] 17777.47
>sd(DefCreditCard$Y)
[1] 0.4150618
Length
Get or set the length of vectors
> length(DefCreditCard$X)
[1] 30000
Sum
To get sum of all the values present in its arguments.
>sum(DefCreditCard$X)
[1] 450015000
>sum(DefCreditCard$X2)
[1] 48112
>sum(DefCreditCard$X3)
[1] 55594
>sum(DefCreditCard$X4)
[1] 46556
>sum(DefCreditCard$X5)
[1] 1064565
>sum(DefCreditCard$X6)
8
[1] -501
>sum(DefCreditCard$X7)
[1] -4013
>sum(DefCreditCard$X8)
[1] -4986
>sum(DefCreditCard$X9)
[1] -6620
>sum(DefCreditCard$X10)
[1] -7986
>sum(DefCreditCard$X11)
[1] -8733
>sum(DefCreditCard$X12)
[1] 1536699927
>sum(DefCreditCard$X13)
[1] 1475372255
>sum(DefCreditCard$X14)
[1] 1410394644
>sum(DefCreditCard$X15)
[1] 1297888469
>sum(DefCreditCard$X16)
[1] 1209342029
>sum(DefCreditCard$X17)
[1] 1166152812
>sum(DefCreditCard$X18)
[1] 169907415
>sum(DefCreditCard$X19)
[1] 177634905
>sum(DefCreditCard$X20)
[1] 156770445
>sum(DefCreditCard$X21)
[1] 144782306
9
>sum(DefCreditCard$X22)
[1] 143981629
>sum(DefCreditCard$X23)
[1] 156465077
>sum(DefCreditCard$Y)
[1] 6636
Range
A range is a vector containing the minimum and maximum of all the given arguments.
>range(DefCreditCard$X)
[1]
1 30000
>range(DefCreditCard$X1)
[1] 10000 1000000
>range(DefCreditCard$X2)
[1] 1 2
>range(DefCreditCard$X3)
[1] 0 6
>range(DefCreditCard$X4)
[1] 0 3
>range(DefCreditCard$X5)
[1] 21 79
>range(DefCreditCard$X6)
[1] -2 8
>range(DefCreditCard$X12)
[1] -165580 964511
>range(DefCreditCard$X13)
[1] -69777 983931
>range(DefCreditCard$X14)
[1] -157264 1664089
>range(DefCreditCard$X15)
10
0 873552
>range(DefCreditCard$X19)
[1]
0 1684259
>range(DefCreditCard$X20)
[1]
0 896040
>range(DefCreditCard$X21)
[1]
0 621000
>range(DefCreditCard$X22)
[1]
0 426529
>range(DefCreditCard$X23)
[1]
0 528666
>range(DefCreditCard$Y)
[1] 0 1
Readline
To readline reads a line from the terminal.
enames<-readline(DefCreditCard.)
1:30000
Names
Functions to get the names of an object.
>names(DefCreditCard)
[1] "X" "X1" "X2" "X3" "X4" "X5" "X6" "X7" "X8" "X9"
[11] "X10" "X11" "X12" "X13" "X14" "X15" "X16" "X17" "X18" "X19"
[21] "X20" "X21" "X22" "X23" "Y"
11
Summary
Summary is a generic function used to produce result summaries of the results of various model
fitting functions.
>summary(DefCreditCard$X)
Min. 1st Qu.Median
1
Max.
>summary(DefCreditCard$X1)
Min. 1st Qu. Median
Max.
Max.
Max.
Max.
Max.
Max.
Max.
Max.
12
Max.
Max.
Max.
Max.
>summary(DefCreditCard$X13)
Min. 1st Qu.Median
-69780
Max.
>summary(DefCreditCard$X14)
Min. 1st Qu.Median
-157300
Max.
>summary(DefCreditCard$X15)
Min. 1st Qu.Median
-170000
Max.
>summary(DefCreditCard$X16)
Min. 1st Qu.Median
-81330
Max.
>summary(DefCreditCard$X17)
Min. 1st Qu.Median
-339600
Max.
>summary(DefCreditCard$X18)
Min. 1st Qu.Median
0
1000
2100
Max.
5006 873600
13
>summary(DefCreditCard$X19)
Min. 1st Qu.Median
0
833
2009
Max.
5000 1684000
>summary(DefCreditCard$X20)
Min. 1st Qu.Median
0
390
1800
Max.
4505 896000
>summary(DefCreditCard$X21)
Min. 1st Qu.Median
0
296
1500
Max.
4013 621000
>summary(DefCreditCard$X22)
Min.1st Qu. Median
0.0
Max.
>summary(DefCreditCard$X23)
Min.1st Qu. Median
0.0
Max.
>summary(DefCreditCard$Y)
Min. 1st Qu.Median
Max.
14
Results:
Histogram
A histogram is a visual representation of the distribution of a dataset. As such, the shape of a
histogram is its most obvious and informative characteristic.
This allows us easily see where a relatively large amount of the data is situated and where there
is very little data to be found.
In other words, the middle is in your data distribution, how close the data lie around this middle
and where possible outliers are to be found. Exactly because of all this, histograms are a great
way to get to know your data!
The below histograms are of Default of Credit Card with its different numerical values for
attributes named X to Y
>hist(DefCreditCard$X)
15
>hist(DefCreditCard$X1)
>hist(DefCreditCard$X5)
>hist(DefCreditCard$X6)
16
>hist(DefCreditCard$X12)
>hist(DefCreditCard$X3)
17
>hist(DefCreditCard$X23)
>hist(DefCreditCard$Y)
18
Box plot
Boxplots can be created for individual variables or for variables by group. The format
is boxplot(x, data=), where x is a formula and data= denotes the data frame providing the data
> boxplot(log(DefCreditCard$X1),log(DefCreditCard$X2))
boxplot(log(DefCreditCard$X23),log(DefCreditCard$X15))
19
boxplot(log(DefCreditCard$X7),log(DefCreditCard$X14))
boxplot(log(DefCreditCard$X10),log(DefCreditCard$X20))
20
Scatter Plot
A scatter plot pairs up values of two quantitative variables in a data set and display them as
geometric points inside a Cartesian diagram.
plot(DefCreditCard$X1, DefCreditCard$X12, main="Scatterplot for X1 and X12",xlab
="X1",ylab = "X12",pch=20)
21
s
plot(DefCreditCard$X13, DefCreditCard$X16, main="Scatterplot for X13 and X16",xlab
="X13",ylab = "X16",pch=5)
22
23
Multiple Plots
R makes it easy to combine multiple plots into one overall graph, as shown below.
> par(mfrow=c(2,2))
>hist(DefCreditCard$X14,main = "Histogram")
> boxplot(DefCreditCard$X1,main = "Boxplot")
> plot(DefCreditCard$X,DefCreditCard$Y,main = "Scatterplot")
> plot(DefCreditCard$X2,DefCreditCard$X7,main = "Scatterplot")
24
25
26
27