Default of Credit Card Clients

Date Analysis Case Study: Default of Credit Card Clients
Increasing the demand for consumer credit has led to the competition in credit industry. So
credit managers have to develop and apply machine learning methods to handle analyzing credit
data in order to saving time and reduction errors. Credit scoring can be defined as a technique
that helps lenders decide whether to grant credit to the applicants with respect to the applicants'
characteristics such as age, income and marital status. In recent years, several quantitative
methods have been proposed for credit risk evaluation. Among all existent approaches, data
mining methods have found more popularity than the others because of their ability in
discovering practical knowledge from the database and transforming them into useful
information. The first researches into credit scoring were done by Fisher and Durand, who
applied linear and quadratic discriminant analysis respectively to categorize credit applications
as good or bad ones. This study aims to prepare a literature survey in data mining technique
applied in credit risk evaluation problem from 2000 to 2010. The main purpose of this study is
helping to researchers to be aware of the present methods, find their limitations and suggest
more efficient methods.
Synopsis
This aimed at the case of customers default payments in Taiwan and compares the predictive
accuracy of probability of default among six data mining methods. From the perspective of risk
management, the result of predictive accuracy of the estimated probability of default will be
more valuable than the binary result of classification - credible or not credible clients. Because
the real probability of default is unknown, this study presented the novel Sorting Smoothing
Method to estimate the real probability of default. With the real probability of default as the
response variable (Y), and the predictive probability of default as the independent variable (X),
the simple linear regression result (Y = A + BX) shows that the forecasting model produced by
artificial neural network has the highest coefficient of determination; its regression intercept (A)
is close to zero, and regression coefficient (B) to one. Therefore, among the six data mining
techniques, artificial neural network is the only one that can accurately estimate the real
probability of default.
Attributes
25 numerical values for attributes named X, X1 to X23 and Y
Dim
Retrieve or set the dimension of an object.
>dim(DefCreditCard)
[1] 30000
25
Head
To obtain the first several rows of a matrix or data frame head is used.
>head(DefCreditCard)
Tail
To obtain the last several rows of a matrix or data frame tail is used.
>tail(DefCreditCard)
Mean
Generic function for the (trimmed) arithmetic mean.
> mean(DefCreditCard$X)
[1] 15000.5
> mean(DefCreditCard$X1)
[1] 167484.3
[1] 1.603733
[1] 1.853133
[1] 1.551867
[1] 35.4855
[1] -0.0167
[1] -0.1337667
[1] -0.1662
[1] -0.2206667
[1] -0.2662
[1] -0.2911
[1] 51223.33
[1] 49179.08
[1] 47013.15
[1] 43262.95
[1] 40311.4
[1] 38871.76
[1] 5663.581
[1] 5921.163
[1] 5225.681
[1] 4826.077
[1] 4799.388
[1] 5215.503
> mean(DefCreditCard$Y)
[1] 0.2212
Var
The variance is a numerical measure of how the data values is dispersed around
the mean. In particular, the sample variance is defined as:
>var(DefCreditCard$X)
[1] 75002500
4
>var(DefCreditCard$X1)
[1] 16834455682
[1] 0.2392474
[1] 0.624651
[1] 0.2724523
[1] 84.96976
[1] 1.26293
[1] 1.433254
[1] 1.432492
[1] 1.366885
[1] 1.284114
[1] 1.322472
[1] 5422239963
[1] 5065705363
[1] 4809337537
[1] 4138716378
5
[1] 3696294150
[1] 3546691724
[1] 274342256
[1] 530881709
[1] 310005092
[1] 245428561
[1] 233426624
[1] 316038289
>var(DefCreditCard$Y)
[1] 0.1722763
Standard Deviation(sd)
The standard deviation of an observation variable is the square root of its variance.
>sd(DefCreditCard$X)
[1] 8660.398
>sd(DefCreditCard$X1)
[1] 129747.7
[1] 0.4891292
[1] 0.7903487
[1] 0.5219696
[1] 9.217904
[1] 1.123802
[1] 1.197186
[1] 1.196868
[1] 1.169139
[1] 1.133187
[1] 1.149988
[1] 73635.86
[1] 71173.77
[1] 69349.39
[1] 64332.86
[1] 60797.16
[1] 59554.11
[1] 16563.28
7
[1] 23040.87
[1] 17606.96
[1] 15666.16
[1] 15278.31
[1] 17777.47
>sd(DefCreditCard$Y)
[1] 0.4150618
Length
Get or set the length of vectors
> length(DefCreditCard$X)
[1] 30000
Sum
To get sum of all the values present in its arguments.
>sum(DefCreditCard$X)
[1] 450015000
>sum(DefCreditCard$X2)
[1] 48112
[1] 55594
[1] 46556
[1] 1064565
8
[1] -501
[1] -4013
[1] -4986
[1] -6620
[1] -7986
[1] -8733
[1] 1536699927
[1] 1475372255
[1] 1410394644
[1] 1297888469
[1] 1209342029
[1] 1166152812
[1] 169907415
[1] 177634905
[1] 156770445
[1] 144782306
9
[1] 143981629
[1] 156465077
>sum(DefCreditCard$Y)
[1] 6636
Range
A range is a vector containing the minimum and maximum of all the given arguments.
>range(DefCreditCard$X)
[1]
1 30000
>range(DefCreditCard$X1)
[1] 10000 1000000
[1] 1 2
[1] 0 6
[1] 0 3
[1] 21 79
[1] -2 8
[1] -165580 964511
[1] -69777 983931
[1] -157264 1664089
10
[1] -170000 891586

[1] -81334 927171
[1] -339603 961664
[1]
0 873552
[1]
0 1684259
[1]
0 896040
[1]
0 621000
[1]
0 426529
[1]
0 528666
>range(DefCreditCard$Y)
[1] 0 1
Readline
To readline reads a line from the terminal.
enames<-readline(DefCreditCard.)
1:30000
Names
Functions to get the names of an object.
>names(DefCreditCard)
[1] "X" "X1" "X2" "X3" "X4" "X5" "X6" "X7" "X8" "X9"
[11] "X10" "X11" "X12" "X13" "X14" "X15" "X16" "X17" "X18" "X19"
[21] "X20" "X21" "X22" "X23" "Y"
11
Summary
Summary is a generic function used to produce result summaries of the results of various model
fitting functions.
>summary(DefCreditCard$X)
Min. 1st Qu.Median
1
Mean 3rd Qu.
Max.
7501 15000 15000 22500 30000
>summary(DefCreditCard$X1)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
10000 50000 140000 167500 240000 1000000

Min. 1st Qu.Median
Mean 3rd Qu.
Max.
1.000 1.000 2.000 1.604 2.000 2.000

Min. 1st Qu.Median
Mean 3rd Qu.
Max.
0.000 1.000 2.000 1.853 2.000 6.000

Min. 1st Qu.Median
Mean 3rd Qu.
Max.
0.000 1.000 2.000 1.552 2.000 3.000

Min. 1st Qu.Median
Mean 3rd Qu.
Max.
21.00 28.00 34.00 35.49 41.00 79.00

Min. 1st Qu.Median
Mean 3rd Qu.
Max.
-2.0000 -1.0000 0.0000 -0.0167 0.0000 8.0000

Min. 1st Qu.Median
Mean 3rd Qu.
Max.
-2.0000 -1.0000 0.0000 -0.1338 0.0000 8.0000

Min. 1st Qu.Median
Mean 3rd Qu.
Max.
12
-2.0000 -1.0000 0.0000 -0.1662 0.0000 8.0000

Min. 1st Qu.Median
Mean 3rd Qu.
Max.
-2.0000 -1.0000 0.0000 -0.2207 0.0000 8.0000

Min. 1st Qu.Median
Mean 3rd Qu.
Max.
-2.0000 -1.0000 0.0000 -0.2662 0.0000 8.0000

Min. 1st Qu.Median
Mean 3rd Qu.
Max.
-2.0000 -1.0000 0.0000 -0.2911 0.0000 8.0000

Min. 1st Qu.Median
-165600
Mean 3rd Qu.
Max.
3559 22380 51220 67090 964500
Min. 1st Qu.Median
-69780
Mean 3rd Qu.
Max.
2985 21200 49180 64010 983900
Min. 1st Qu.Median
-157300
Mean 3rd Qu.
Max.
2666 20090 47010 60160 1664000
Min. 1st Qu.Median
-170000
Mean 3rd Qu.
Max.
2327 19050 43260 54510 891600
Min. 1st Qu.Median
-81330
Mean 3rd Qu.
Max.
1763 18100 40310 50190 927200
Min. 1st Qu.Median
-339600
Mean 3rd Qu.
Max.
1256 17070 38870 49200 961700
Min. 1st Qu.Median
0
1000
2100
Mean 3rd Qu.

5664
Max.
5006 873600
13
Min. 1st Qu.Median
0
833
2009
Mean 3rd Qu.

5921
Max.
5000 1684000
Min. 1st Qu.Median
0
390
1800
Mean 3rd Qu.

5226
Max.
4505 896000
Min. 1st Qu.Median
0
296
1500
Mean 3rd Qu.

4826
Max.
4013 621000
Min.1st Qu. Median
0.0
Mean 3rd Qu.
Max.
252.5 1500.0 4799.0 4032.0 426500.0
Min.1st Qu. Median
0.0
Mean 3rd Qu.
Max.
117.8 1500.0 5216.0 4000.0 528700.0
>summary(DefCreditCard$Y)
Min. 1st Qu.Median
Mean 3rd Qu.
Max.
0.0000 0.0000 0.0000 0.2212 0.0000 1.0000
14
Results:
Histogram
A histogram is a visual representation of the distribution of a dataset. As such, the shape of a
histogram is its most obvious and informative characteristic.
This allows us easily see where a relatively large amount of the data is situated and where there
is very little data to be found.
In other words, the middle is in your data distribution, how close the data lie around this middle
and where possible outliers are to be found. Exactly because of all this, histograms are a great
way to get to know your data!
The below histograms are of Default of Credit Card with its different numerical values for
attributes named X to Y
>hist(DefCreditCard$X)
15
>hist(DefCreditCard$X1)
16
17
>hist(DefCreditCard$Y)
18
Box plot
Boxplots can be created for individual variables or for variables by group. The format
is boxplot(x, data=), where x is a formula and data= denotes the data frame providing the data
> boxplot(log(DefCreditCard$X1),log(DefCreditCard$X2))
boxplot(log(DefCreditCard$X23),log(DefCreditCard$X15))
19
20
Scatter Plot
A scatter plot pairs up values of two quantitative variables in a data set and display them as
geometric points inside a Cartesian diagram.
plot(DefCreditCard$X1, DefCreditCard$X12, main="Scatterplot for X1 and X12",xlab
="X1",ylab = "X12",pch=20)

="X18",ylab = "X15",pch=25)
21

="X20",ylab = "X23",pch=1)
s
="X13",ylab = "X16",pch=5)
22
Multiple Scatter plot

R makes it easy to combine multiple scatter plots into one overall graph, as shown below.
> par(mfrow=c(1,6))
>plot(DefCreditCard$X5,DefCreditCard$X6,main = "ScatterPlot of X5 & X6")
>plot(DefCreditCard$X2,DefCreditCard$X7,main = "ScatterPlot of X2 & X7")
>plot(DefCreditCard$X9,DefCreditCard$X20,main = "X9 & X20")
>plot(DefCreditCard$X,DefCreditCard$Y,main = "X & Y")
23
Multiple Plots
R makes it easy to combine multiple plots into one overall graph, as shown below.
> par(mfrow=c(2,2))
>hist(DefCreditCard$X14,main = "Histogram")
> boxplot(DefCreditCard$X1,main = "Boxplot")
> plot(DefCreditCard$X,DefCreditCard$Y,main = "Scatterplot")
> plot(DefCreditCard$X2,DefCreditCard$X7,main = "Scatterplot")
24
LINER REGRESSION LINE:

The regression command is lm for linear model. We will store that model in a variable
called model. The order of the variables is dependent followed by a tilde "~" followed by a list of
independent variables.
>plot(X12 ~ X13, data = DefCreditCard)
25
>abline(lm(X12 ~ X13, data = DefCreditCard))
>plot(X5 ~ X23, data = DefCreditCard)
26
>abline(lm(X5 ~ X23, data = DefCreditCard))
27

Default of Credit Card Clients

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Default of Credit Card Clients

Загружено:

Авторское право:

Доступные форматы

Date Analysis Case Study: Default of Credit Card Clients

[1] -170000 891586

Mean 3rd Qu.

7501 15000 15000 22500 30000

Mean 3rd Qu.

10000 50000 140000 167500 240000 1000000

Mean 3rd Qu.

1.000 1.000 2.000 1.604 2.000 2.000

Mean 3rd Qu.

0.000 1.000 2.000 1.853 2.000 6.000

Mean 3rd Qu.

0.000 1.000 2.000 1.552 2.000 3.000

Mean 3rd Qu.

21.00 28.00 34.00 35.49 41.00 79.00

Mean 3rd Qu.

-2.0000 -1.0000 0.0000 -0.0167 0.0000 8.0000

Mean 3rd Qu.

-2.0000 -1.0000 0.0000 -0.1338 0.0000 8.0000

Mean 3rd Qu.

-2.0000 -1.0000 0.0000 -0.1662 0.0000 8.0000

Mean 3rd Qu.

-2.0000 -1.0000 0.0000 -0.2207 0.0000 8.0000

Mean 3rd Qu.

-2.0000 -1.0000 0.0000 -0.2662 0.0000 8.0000

Mean 3rd Qu.

-2.0000 -1.0000 0.0000 -0.2911 0.0000 8.0000

Mean 3rd Qu.

3559 22380 51220 67090 964500

Mean 3rd Qu.

2985 21200 49180 64010 983900

Mean 3rd Qu.

2666 20090 47010 60160 1664000

Mean 3rd Qu.

2327 19050 43260 54510 891600

Mean 3rd Qu.

1763 18100 40310 50190 927200

Mean 3rd Qu.

1256 17070 38870 49200 961700

Mean 3rd Qu.

Mean 3rd Qu.

Mean 3rd Qu.

Mean 3rd Qu.

Mean 3rd Qu.

252.5 1500.0 4799.0 4032.0 426500.0

Mean 3rd Qu.

117.8 1500.0 5216.0 4000.0 528700.0

Mean 3rd Qu.

0.0000 0.0000 0.0000 0.2212 0.0000 1.0000

plot(DefCreditCard$X18, DefCreditCard$X15, main="Scatterplot for X18 and X15",xlab

plot(DefCreditCard$X20, DefCreditCard$X23, main="Scatterplot for X20 and X23",xlab

Multiple Scatter plot

LINER REGRESSION LINE:

>abline(lm(X12 ~ X13, data = DefCreditCard))

>plot(X5 ~ X23, data = DefCreditCard)

>abline(lm(X5 ~ X23, data = DefCreditCard))

Вам также может понравиться