Вы находитесь на странице: 1из 27

Date Analysis Case Study: Default of Credit Card Clients

Increasing the demand for consumer credit has led to the competition in credit industry. So
credit managers have to develop and apply machine learning methods to handle analyzing credit
data in order to saving time and reduction errors. Credit scoring can be defined as a technique
that helps lenders decide whether to grant credit to the applicants with respect to the applicants'
characteristics such as age, income and marital status. In recent years, several quantitative
methods have been proposed for credit risk evaluation. Among all existent approaches, data
mining methods have found more popularity than the others because of their ability in
discovering practical knowledge from the database and transforming them into useful
information. The first researches into credit scoring were done by Fisher and Durand, who
applied linear and quadratic discriminant analysis respectively to categorize credit applications
as good or bad ones. This study aims to prepare a literature survey in data mining technique
applied in credit risk evaluation problem from 2000 to 2010. The main purpose of this study is
helping to researchers to be aware of the present methods, find their limitations and suggest
more efficient methods.

Synopsis
This aimed at the case of customers default payments in Taiwan and compares the predictive
accuracy of probability of default among six data mining methods. From the perspective of risk
management, the result of predictive accuracy of the estimated probability of default will be
more valuable than the binary result of classification - credible or not credible clients. Because
the real probability of default is unknown, this study presented the novel Sorting Smoothing
Method to estimate the real probability of default. With the real probability of default as the
response variable (Y), and the predictive probability of default as the independent variable (X),
the simple linear regression result (Y = A + BX) shows that the forecasting model produced by
artificial neural network has the highest coefficient of determination; its regression intercept (A)
is close to zero, and regression coefficient (B) to one. Therefore, among the six data mining
techniques, artificial neural network is the only one that can accurately estimate the real
probability of default.

Attributes
25 numerical values for attributes named X, X1 to X23 and Y
Dim
Retrieve or set the dimension of an object.
>dim(DefCreditCard)
[1] 30000

25

Head
To obtain the first several rows of a matrix or data frame head is used.
>head(DefCreditCard)

Tail
To obtain the last several rows of a matrix or data frame tail is used.
>tail(DefCreditCard)

Mean
Generic function for the (trimmed) arithmetic mean.
> mean(DefCreditCard$X)
[1] 15000.5
> mean(DefCreditCard$X1)
[1] 167484.3
> mean(DefCreditCard$X2)
[1] 1.603733
> mean(DefCreditCard$X3)
[1] 1.853133
> mean(DefCreditCard$X4)
[1] 1.551867
> mean(DefCreditCard$X5)
[1] 35.4855
> mean(DefCreditCard$X6)
[1] -0.0167
> mean(DefCreditCard$X7)
[1] -0.1337667
> mean(DefCreditCard$X8)
[1] -0.1662
> mean(DefCreditCard$X9)
[1] -0.2206667
> mean(DefCreditCard$X10)
[1] -0.2662
> mean(DefCreditCard$X11)
[1] -0.2911
> mean(DefCreditCard$X12)
[1] 51223.33
> mean(DefCreditCard$X13)
[1] 49179.08

> mean(DefCreditCard$X14)
[1] 47013.15
> mean(DefCreditCard$X15)
[1] 43262.95
> mean(DefCreditCard$X16)
[1] 40311.4
> mean(DefCreditCard$X17)
[1] 38871.76
> mean(DefCreditCard$X18)
[1] 5663.581
> mean(DefCreditCard$X19)
[1] 5921.163
> mean(DefCreditCard$X20)
[1] 5225.681
> mean(DefCreditCard$X21)
[1] 4826.077
> mean(DefCreditCard$X22)
[1] 4799.388
> mean(DefCreditCard$X23)
[1] 5215.503
> mean(DefCreditCard$Y)
[1] 0.2212

Var
The variance is a numerical measure of how the data values is dispersed around
the mean. In particular, the sample variance is defined as:

>var(DefCreditCard$X)
[1] 75002500
4

>var(DefCreditCard$X1)
[1] 16834455682
>var(DefCreditCard$X2)
[1] 0.2392474
>var(DefCreditCard$X3)
[1] 0.624651
>var(DefCreditCard$X4)
[1] 0.2724523
>var(DefCreditCard$X5)
[1] 84.96976
>var(DefCreditCard$X6)
[1] 1.26293
>var(DefCreditCard$X7)
[1] 1.433254
>var(DefCreditCard$X8)
[1] 1.432492
>var(DefCreditCard$X9)
[1] 1.366885
>var(DefCreditCard$X10)
[1] 1.284114
>var(DefCreditCard$X11)
[1] 1.322472
>var(DefCreditCard$X12)
[1] 5422239963
>var(DefCreditCard$X13)
[1] 5065705363
>var(DefCreditCard$X14)
[1] 4809337537
>var(DefCreditCard$X15)
[1] 4138716378
5

>var(DefCreditCard$X16)
[1] 3696294150
>var(DefCreditCard$X17)
[1] 3546691724
>var(DefCreditCard$X18)
[1] 274342256
>var(DefCreditCard$X19)
[1] 530881709
>var(DefCreditCard$X20)
[1] 310005092
>var(DefCreditCard$X21)
[1] 245428561
>var(DefCreditCard$X22)
[1] 233426624
>var(DefCreditCard$X23)
[1] 316038289
>var(DefCreditCard$Y)
[1] 0.1722763

Standard Deviation(sd)
The standard deviation of an observation variable is the square root of its variance.
>sd(DefCreditCard$X)
[1] 8660.398
>sd(DefCreditCard$X1)
[1] 129747.7
>sd(DefCreditCard$X2)
[1] 0.4891292
>sd(DefCreditCard$X3)
[1] 0.7903487

>sd(DefCreditCard$X4)
[1] 0.5219696
>sd(DefCreditCard$X5)
[1] 9.217904
>sd(DefCreditCard$X6)
[1] 1.123802
>sd(DefCreditCard$X7)
[1] 1.197186
>sd(DefCreditCard$X8)
[1] 1.196868
>sd(DefCreditCard$X9)
[1] 1.169139
>sd(DefCreditCard$X10)
[1] 1.133187
>sd(DefCreditCard$X11)
[1] 1.149988
>sd(DefCreditCard$X12)
[1] 73635.86
>sd(DefCreditCard$X13)
[1] 71173.77
>sd(DefCreditCard$X14)
[1] 69349.39
>sd(DefCreditCard$X15)
[1] 64332.86
>sd(DefCreditCard$X16)
[1] 60797.16
>sd(DefCreditCard$X17)
[1] 59554.11
>sd(DefCreditCard$X18)
[1] 16563.28
>sd(DefCreditCard$X19)
7

[1] 23040.87
>sd(DefCreditCard$X20)
[1] 17606.96
>sd(DefCreditCard$X21)
[1] 15666.16
>sd(DefCreditCard$X22)
[1] 15278.31
>sd(DefCreditCard$X23)
[1] 17777.47
>sd(DefCreditCard$Y)
[1] 0.4150618

Length
Get or set the length of vectors
> length(DefCreditCard$X)
[1] 30000

Sum
To get sum of all the values present in its arguments.
>sum(DefCreditCard$X)
[1] 450015000
>sum(DefCreditCard$X2)
[1] 48112
>sum(DefCreditCard$X3)
[1] 55594
>sum(DefCreditCard$X4)
[1] 46556
>sum(DefCreditCard$X5)
[1] 1064565
>sum(DefCreditCard$X6)
8

[1] -501
>sum(DefCreditCard$X7)
[1] -4013
>sum(DefCreditCard$X8)
[1] -4986
>sum(DefCreditCard$X9)
[1] -6620
>sum(DefCreditCard$X10)
[1] -7986
>sum(DefCreditCard$X11)
[1] -8733
>sum(DefCreditCard$X12)
[1] 1536699927
>sum(DefCreditCard$X13)
[1] 1475372255
>sum(DefCreditCard$X14)
[1] 1410394644
>sum(DefCreditCard$X15)
[1] 1297888469
>sum(DefCreditCard$X16)
[1] 1209342029
>sum(DefCreditCard$X17)
[1] 1166152812
>sum(DefCreditCard$X18)
[1] 169907415
>sum(DefCreditCard$X19)
[1] 177634905
>sum(DefCreditCard$X20)
[1] 156770445
>sum(DefCreditCard$X21)
[1] 144782306
9

>sum(DefCreditCard$X22)
[1] 143981629
>sum(DefCreditCard$X23)
[1] 156465077
>sum(DefCreditCard$Y)
[1] 6636

Range
A range is a vector containing the minimum and maximum of all the given arguments.
>range(DefCreditCard$X)
[1]

1 30000

>range(DefCreditCard$X1)
[1] 10000 1000000
>range(DefCreditCard$X2)
[1] 1 2
>range(DefCreditCard$X3)
[1] 0 6
>range(DefCreditCard$X4)
[1] 0 3
>range(DefCreditCard$X5)
[1] 21 79
>range(DefCreditCard$X6)
[1] -2 8
>range(DefCreditCard$X12)
[1] -165580 964511
>range(DefCreditCard$X13)
[1] -69777 983931
>range(DefCreditCard$X14)
[1] -157264 1664089
>range(DefCreditCard$X15)
10

[1] -170000 891586


>range(DefCreditCard$X16)
[1] -81334 927171
>range(DefCreditCard$X17)
[1] -339603 961664
>range(DefCreditCard$X18)
[1]

0 873552

>range(DefCreditCard$X19)
[1]

0 1684259

>range(DefCreditCard$X20)
[1]

0 896040

>range(DefCreditCard$X21)
[1]

0 621000

>range(DefCreditCard$X22)
[1]

0 426529

>range(DefCreditCard$X23)
[1]

0 528666

>range(DefCreditCard$Y)
[1] 0 1

Readline
To readline reads a line from the terminal.
enames<-readline(DefCreditCard.)
1:30000

Names
Functions to get the names of an object.
>names(DefCreditCard)
[1] "X" "X1" "X2" "X3" "X4" "X5" "X6" "X7" "X8" "X9"
[11] "X10" "X11" "X12" "X13" "X14" "X15" "X16" "X17" "X18" "X19"
[21] "X20" "X21" "X22" "X23" "Y"
11

Summary
Summary is a generic function used to produce result summaries of the results of various model
fitting functions.
>summary(DefCreditCard$X)
Min. 1st Qu.Median
1

Mean 3rd Qu.

Max.

7501 15000 15000 22500 30000

>summary(DefCreditCard$X1)
Min. 1st Qu. Median

Mean 3rd Qu.

Max.

10000 50000 140000 167500 240000 1000000


>summary(DefCreditCard$X2)
Min. 1st Qu.Median

Mean 3rd Qu.

Max.

1.000 1.000 2.000 1.604 2.000 2.000


>summary(DefCreditCard$X3)
Min. 1st Qu.Median

Mean 3rd Qu.

Max.

0.000 1.000 2.000 1.853 2.000 6.000


>summary(DefCreditCard$X4)
Min. 1st Qu.Median

Mean 3rd Qu.

Max.

0.000 1.000 2.000 1.552 2.000 3.000


>summary(DefCreditCard$X5)
Min. 1st Qu.Median

Mean 3rd Qu.

Max.

21.00 28.00 34.00 35.49 41.00 79.00


>summary(DefCreditCard$X6)
Min. 1st Qu.Median

Mean 3rd Qu.

Max.

-2.0000 -1.0000 0.0000 -0.0167 0.0000 8.0000


>summary(DefCreditCard$X7)
Min. 1st Qu.Median

Mean 3rd Qu.

Max.

-2.0000 -1.0000 0.0000 -0.1338 0.0000 8.0000


>summary(DefCreditCard$X8)
Min. 1st Qu.Median

Mean 3rd Qu.

Max.
12

-2.0000 -1.0000 0.0000 -0.1662 0.0000 8.0000


>summary(DefCreditCard$X9)
Min. 1st Qu.Median

Mean 3rd Qu.

Max.

-2.0000 -1.0000 0.0000 -0.2207 0.0000 8.0000


>summary(DefCreditCard$X10)
Min. 1st Qu.Median

Mean 3rd Qu.

Max.

-2.0000 -1.0000 0.0000 -0.2662 0.0000 8.0000


>summary(DefCreditCard$X11)
Min. 1st Qu.Median

Mean 3rd Qu.

Max.

-2.0000 -1.0000 0.0000 -0.2911 0.0000 8.0000


>summary(DefCreditCard$X12)
Min. 1st Qu.Median
-165600

Mean 3rd Qu.

Max.

3559 22380 51220 67090 964500

>summary(DefCreditCard$X13)
Min. 1st Qu.Median
-69780

Mean 3rd Qu.

Max.

2985 21200 49180 64010 983900

>summary(DefCreditCard$X14)
Min. 1st Qu.Median
-157300

Mean 3rd Qu.

Max.

2666 20090 47010 60160 1664000

>summary(DefCreditCard$X15)
Min. 1st Qu.Median
-170000

Mean 3rd Qu.

Max.

2327 19050 43260 54510 891600

>summary(DefCreditCard$X16)
Min. 1st Qu.Median
-81330

Mean 3rd Qu.

Max.

1763 18100 40310 50190 927200

>summary(DefCreditCard$X17)
Min. 1st Qu.Median
-339600

Mean 3rd Qu.

Max.

1256 17070 38870 49200 961700

>summary(DefCreditCard$X18)
Min. 1st Qu.Median
0

1000

2100

Mean 3rd Qu.


5664

Max.

5006 873600
13

>summary(DefCreditCard$X19)
Min. 1st Qu.Median
0

833

2009

Mean 3rd Qu.


5921

Max.

5000 1684000

>summary(DefCreditCard$X20)
Min. 1st Qu.Median
0

390

1800

Mean 3rd Qu.


5226

Max.

4505 896000

>summary(DefCreditCard$X21)
Min. 1st Qu.Median
0

296

1500

Mean 3rd Qu.


4826

Max.

4013 621000

>summary(DefCreditCard$X22)
Min.1st Qu. Median
0.0

Mean 3rd Qu.

Max.

252.5 1500.0 4799.0 4032.0 426500.0

>summary(DefCreditCard$X23)
Min.1st Qu. Median
0.0

Mean 3rd Qu.

Max.

117.8 1500.0 5216.0 4000.0 528700.0

>summary(DefCreditCard$Y)
Min. 1st Qu.Median

Mean 3rd Qu.

Max.

0.0000 0.0000 0.0000 0.2212 0.0000 1.0000

14

Results:
Histogram
A histogram is a visual representation of the distribution of a dataset. As such, the shape of a
histogram is its most obvious and informative characteristic.
This allows us easily see where a relatively large amount of the data is situated and where there
is very little data to be found.
In other words, the middle is in your data distribution, how close the data lie around this middle
and where possible outliers are to be found. Exactly because of all this, histograms are a great
way to get to know your data!
The below histograms are of Default of Credit Card with its different numerical values for
attributes named X to Y
>hist(DefCreditCard$X)

15

>hist(DefCreditCard$X1)

>hist(DefCreditCard$X5)

>hist(DefCreditCard$X6)

16

>hist(DefCreditCard$X12)

>hist(DefCreditCard$X3)

17

>hist(DefCreditCard$X23)

>hist(DefCreditCard$Y)

18

Box plot
Boxplots can be created for individual variables or for variables by group. The format
is boxplot(x, data=), where x is a formula and data= denotes the data frame providing the data
> boxplot(log(DefCreditCard$X1),log(DefCreditCard$X2))

boxplot(log(DefCreditCard$X23),log(DefCreditCard$X15))

19

boxplot(log(DefCreditCard$X7),log(DefCreditCard$X14))

boxplot(log(DefCreditCard$X10),log(DefCreditCard$X20))

20

Scatter Plot
A scatter plot pairs up values of two quantitative variables in a data set and display them as
geometric points inside a Cartesian diagram.
plot(DefCreditCard$X1, DefCreditCard$X12, main="Scatterplot for X1 and X12",xlab
="X1",ylab = "X12",pch=20)

plot(DefCreditCard$X18, DefCreditCard$X15, main="Scatterplot for X18 and X15",xlab


="X18",ylab = "X15",pch=25)

21

plot(DefCreditCard$X20, DefCreditCard$X23, main="Scatterplot for X20 and X23",xlab


="X20",ylab = "X23",pch=1)

s
plot(DefCreditCard$X13, DefCreditCard$X16, main="Scatterplot for X13 and X16",xlab
="X13",ylab = "X16",pch=5)

22

Multiple Scatter plot


R makes it easy to combine multiple scatter plots into one overall graph, as shown below.
> par(mfrow=c(1,6))
>plot(DefCreditCard$X5,DefCreditCard$X6,main = "ScatterPlot of X5 & X6")
>plot(DefCreditCard$X2,DefCreditCard$X7,main = "ScatterPlot of X2 & X7")
>plot(DefCreditCard$X9,DefCreditCard$X20,main = "X9 & X20")
>plot(DefCreditCard$X15,DefCreditCard$X19,main = "X15 & X19")
>plot(DefCreditCard$X11,DefCreditCard$X23,main = "X11 & X23")
>plot(DefCreditCard$X,DefCreditCard$Y,main = "X & Y")

23

Multiple Plots
R makes it easy to combine multiple plots into one overall graph, as shown below.
> par(mfrow=c(2,2))
>hist(DefCreditCard$X14,main = "Histogram")
> boxplot(DefCreditCard$X1,main = "Boxplot")
> plot(DefCreditCard$X,DefCreditCard$Y,main = "Scatterplot")
> plot(DefCreditCard$X2,DefCreditCard$X7,main = "Scatterplot")

24

LINER REGRESSION LINE:


The regression command is lm for linear model. We will store that model in a variable
called model. The order of the variables is dependent followed by a tilde "~" followed by a list of
independent variables.
>plot(X12 ~ X13, data = DefCreditCard)

25

>abline(lm(X12 ~ X13, data = DefCreditCard))

>plot(X5 ~ X23, data = DefCreditCard)

26

>abline(lm(X5 ~ X23, data = DefCreditCard))

27

Вам также может понравиться