Вы находитесь на странице: 1из 12

IE 4903 TERM PROJECT

German Credit Rating

METU IE Kenan Aghayev Gkhan Kof

DATA PREPARATION AND INITIALIZATION 1. There are some attributes that we think as irrelevant: For instance, we think that the purposes of credits such as buying a new car, buying a used car, furniture and retraining has not much to do with whether the rating will be good or bad. Also, we think the same way about the attributes regarding the marital status. 2. To pick the variables we are going to use, we computed entropy and gini index for each attribute. We chose the ones having the lowest entropy and gini values. Entropy and gini values for each attribute is given in Table 1 and Table 2 in the Appendix. Also, the code we have used to calculate those values are given in Figure 1 and 2 in the Appendix. Looking at the values at Table 1 and 2, we have included different number of attributes in our analysis and compared the performances. 3. Our data does not contain any missing values. We decided to leave the outliers as they are and not to discard them because we think that an observation should not be discarded unless it is absolutely needed. The continuous variables in our dataset are binned to convert them to categorical variables to calculate the entropy values and gini indices of each attribute and also to use them in our classification methods. 4. After partitioning our dataset to 70 % training and 30 % test, we applied the nave rule to calculate the training and test errors both of which turned out to be around 30 % consistently as expected since our data contains 700 good and 300 bad credits. The code we have written to use the nave rule is given in Figure and the confusion matrix is given in Table 3 in the Appendix. MODEL BUILDING We have chosen two methods to classify the dataset into good and bad credit. The first one is Nave Bayes. We chose this method because most of the attributes in our data are binary and categorical. The remaining continuous variables are binned to be used in the analysis. The second method is Decision Tree. Decision tree is a good method to understand the relationship between attributes and the response in our dataset. After we perform the classification, we can check whether our model is logical by looking at the structure of the tree and include or exclude some of the variables. That is why our second method is decision tree. For the Nave Bayes, we binned our continuous variables into categorical variables. The methods used for each continuous variable are given in Table 4 in the Appendix. We did not used equal range or equal probability. We looked at the histograms of each continuous attribute and binned them in a way that would make sense to us. The excel file GermanCredit.xls contains the binned form of the continuous variables. Nave Bayes The code we have written to apply Nave Bayes is given in Figure 4 in the Appendix. We apply the method 10 times and take the mean of training and test errors. We have used the entropy and gini index values to choose the variables we are going to work with. We have concluded to use the 4 variables with the lowest gini index and entropy values. Those variables are CHK_ACCT, HISTORY, SAV_ACCT and DURATION. To show the performance of our results, we performed Nave Bayes method using all of the 30 variables and using only four variables. While the mean error of ten replication using 30 variables is equal to about 28 %, the mean error using only 4 variables is equal to about 25 %.

Decision Tree We applied the decision tree method using all of the 30 variables thinking that the tree itself would eliminate the variables that are not needed. The minimum test error is equal 24 % after 10 replications. We also looked at other figures to decide on the tree. Figure 5 illustrates the error rate changing against the pruning level. Looking at this figure we would select the pruning level as 8. The number of terminal nodes against the pruning level of the tree is given in Figure 6 in the Appendix. The complexity cost against number of terminal nodes is represented in Figure 7 in the Appendix. However, the most important thing for us is the cost. The average total cost of applying the pruned tree to the test set is given against pruning level in Figure 8. The reason we took the average is that since we partitioned the data into training and test and since training has more applicants, its cost is more likely to be higher than the cost of the test set. By taking the averages we are able to compare the cost of two sets. The pruning level that minimizes the total cost in the test set is 4. Also after 4 the cost starts to increase so it is a good place to choose. The rules of the tree is given in Figure 9 in the Appendix. The code we have written for the decision tree is given in Figure 10 in the Appendix. Confusion matrices are given in Table 5 and Table 6 in the Appendix. The cost of the decision tree is 81.33 while the Nave Bayes gives a cost of 68.00 on the average. Moreover, the Nave Byes method is much simpler and fast. In addition it only uses 4 variables to predict the credit ratings. Decision tree on the other hand is much more complex and takes more time to implement. Therefore, we have chosen the Nave Bayes method for this problem.

APPENDIX Table 1. Entropy Values of Each Attribute


Attributes CHK_ACCT HISTORY SAV_ACCT DURATION EMPLOYMENT OWN_RES PROP_UNKN_NONE REAL_ESTATE OTHER_INSTALL RADIO/TV USED_CAR AGE NEW_CAR RENT FOREIGN Entropy 0,5452 0,5806 0,5914 0,5918 0,6018 0,6021 0,6034 0,6034 0,6047 0,6049 0,6054 0,6058 0,6063 0,6067 0,6068 Attributes MALE_SINGLE INSTALL_RATE EDUCATION CO-APPLICANT GUARANTOR MALE_DIV NUM_CREDITS JOB AMOUNT TELEPHONE RETRAINING PRESENT_RESIDENT FURNITURE MALE_MAR_or_WID NUM_DEPENDENTS Entropy 0,6076 0,6081 0,6086 0,6090 0,6092 0,6097 0,6097 0,6099 0,6101 0,6102 0,6102 0,6105 0,6106 0,6107 0,6109

Table 2. Gini Index for Each Attribute


Attributes CHK_ACCT HISTORY DURATION SAV_ACCT AMOUNT EMPLOYMENT OWN_RES PROP_UNKN_NONE REAL_ESTATE OTHER_INSTALL RADIO/TV AGE USED_CAR NEW_CAR RENT Gini Index 0,3680 0,3941 0,4008 0,4048 0,4086 0,4123 0,4124 0,4134 0,4140 0,4146 0,4152 0,4155 0,4158 0,4161 0,4164 Attributes FOREIGN MALE_SINGLE INSTALL_RATE EDUCATION CO-APPLICANT GUARANTOR MALE_DIV NUM_CREDITS JOB TELEPHONE RETRAINING PRESENT_RESIDENT FURNITURE MALE_MAR_or_WID NUM_DEPENDENTS Gini Index 0,4172 0,4173 0,4177 0,4179 0,4183 0,4187 0,4189 0,4190 0,4192 0,4194 0,4195 0,4197 0,4198 0,4198 0,4200

Figure 1. Contents of entropy.m


clear clc data = xlsread('GermanCredit.xls'); attr = data(:,2:31); trgt = data(:,32); [r,c] = size(attr); e = zeros(c,1);

for i = 1:c u = unique(attr(:,i)); for j = 1:length(u) e(i) = e(i) + (sum(attr(:,i) == u(j)) / r) *... ( - (sum(trgt(attr(:,i) == u(j)) == 0) / sum(attr(:,i) == u(j)))... * log((sum(trgt(attr(:,i) == u(j)) == 0) / sum(attr(:,i) == u(j))))... - (sum(trgt(attr(:,i) == u(j)) == 1) / sum(attr(:,i) == u(j)))... * log((sum(trgt(attr(:,i) == u(j)) == 1) / sum(attr(:,i) == u(j))))); end end [E,IX] = sort(e);

Figure 2. Contents of gini.m


clear clc data = xlsread('GermanCredit.xls'); numericAttr = [3 11 14 23]; numOfBins = [4 3 4 5]; for i = 1:length(numericAttr) data(:,numericAttr(i)) = binEqualProb(data(:,numericAttr(i)),numOfBins(i)); end attr = data(:,2:31); trgt = data(:,32); [r,c] = size(attr); g = zeros(c,1); for i = 1:c u = unique(attr(:,i)); for j = 1:length(u) g(i) = g(i) + (sum(attr(:,i) == u(j)) / r) * (1 ((sum(trgt(attr(:,i) == u(j)) == 0) / sum(attr(:,i) == u(j)))^2) ((sum(trgt(attr(:,i) == u(j)) == 1) / sum(attr(:,i) == u(j)))^2)); end end [G,IX] = sort(g);

Figure 3. Contents of naive.m


clear clc data = xlsread('GermanCredit.xls');

trgt = data(:,32); v = randperm(length(trgt)); p = input('Percentage of Training Set: '); % Partitioning data into training and test sets. trainingClass = trgt(v(1:round(length(trgt)*p)),:); testClass = trgt(v(round(length(trgt)*p)+1:end),:); % Calculating Error trainingError = sum(abs(trainingClass - mode(trainingClass))) / length(trainingClass); testError = sum(abs(testClass - mode(trainingClass))) / length(testClass);

Table 3. Confusion Matrix for Nave Rule Predicted Training Actual Bad Good Bad 0 0 Good 212 488 Bad 0 0 Test Good 88 212

Table 4. Binning Methods used for the Continuous Variables


Attribute DURATION AMOUNT INSTALL_RATE AGE NUM_CREDITS NUM_DEPENDENTS Binning Method 0 if <=11 else 1 if <=17 else 2 if <=24 else 3 0 if <=2000 else 1 if <=4000 else 2 Treated as categorical 0 if <=24 else 1 if <=40 else 2 if <=55 else 3 0 if <=1 else 1 if <=2 else 3 Treated as categorical

Figure 4. Contents of NBayes.m


clear clc data = xlsread('GermanCredit.xls'); p = input('Percentage of Training Set: '); attr = data(:,[2 4 3 12]); trgt = data(:,32); trainingError = zeros(1,10); testError = zeros(1,10); for i = 1:10 % Partitioning data into training and test sets. v = randperm(size(attr,1)); trainingSet = attr(v(1:round(size(attr,1)*p)),:);

trainingClass = trgt(v(1:round(size(attr,1)*p)),:); testSet = attr(v(round(size(attr,1)*p)+1:end),:); testClass = trgt(v(round(size(attr,1)*p)+1:end),:); % Applying Naive Bayes. nb = NaiveBayes.fit(trainingSet,trainingClass); trainingLabel = predict(nb,trainingSet); testLabel = predict(nb,testSet); trainingError(i) = length(find(trainingLabeltrainingClass))/(size(trainingSet,1)); testError(i) = length(find(testLabel-testClass))/(size(testSet,1)); end display(mean(trainingError)); display(mean(testError));

Figure 5. Pruning Level vs. Error Rate

Figure 6. Pruning Level vs. Number of Terminal Nodes

Figure 7. Complexity Cost vs. Number of Terminal Nodes

Figure 8. Total Cost vs. Pruning Level

Figure 9. Rules of the Best Tree


Decision tree for classification 1 if x1 in {0 1} then node 2 elseif x1 in {2 3} then node 3 else 1 2 if x3 in {0 1 2} then node 4 elseif x3 in {3 4} then node 5 else 1 3 if x23=0 then node 6 elseif x23=1 then node 7 else 1 4 if x18=0 then node 8 elseif x18=1 then node 9 else 0 5 if x2<2.5 then node 10 elseif x2>=2.5 then node 11 else 1 6 if x12 in {0 1} then node 12 elseif x12 in {2 3 4} then node 13 else 7 if x12 in {0 2} then node 14 elseif x12 in {1 3 4} then node 15 else 8 if x11 in {0 1} then node 16 elseif x11 in {2 3 4} then node 17 else 9 class = 1 10 if x11 in {0 1 3} then node 18 elseif x11 in {2 4} then node 19 else 11 if x11 in {0 2} then node 20 elseif x11 in {1 4} then node 21 else 1 12 if x2<2.5 then node 22 elseif x2>=2.5 then node 23 else 1 13 class = 1 14 if x13<1.5 then node 24 elseif x13>=1.5 then node 25 else 0 15 if x4=0 then node 26 elseif x4=1 then node 27 else 1 16 if x12 in {0 3} then node 28 elseif x12 in {1 2 4} then node 29 else 17 if x10<1.5 then node 30 elseif x10>=1.5 then node 31 else 1 18 if x23=0 then node 32 elseif x23=1 then node 33 else 1 19 class = 1 20 if x26<2.5 then node 34 elseif x26>=2.5 then node 35 else 0 21 class = 1 22 if x17=0 then node 36 elseif x17=1 then node 37 else 1 23 class = 0 24 class = 1 25 if x2<2.5 then node 38 elseif x2>=2.5 then node 39 else 0 26 class = 1 27 class = 0 28 if x10<0.5 then node 40 elseif x10>=0.5 then node 41 else 1 29 if x25=0 then node 42 elseif x25=1 then node 43 else 0 30 if x10<0.5 then node 44 elseif x10>=0.5 then node 45 else 1

1 1 0 1

31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69

class = 0 if x27 in {1 2} then node 46 elseif x27 in {0 3} then node 47 else 1 class = 0 class = 0 class = 1 class = 1 class = 0 if x9=0 then node 48 elseif x9=1 then node 49 else 1 class = 0 if x1=1 then node 50 elseif x1=0 then node 51 else 0 if x26<1.5 then node 52 elseif x26>=1.5 then node 53 else 1 class = 0 if x4=0 then node 54 elseif x4=1 then node 55 else 0 if x2<1.5 then node 56 elseif x2>=1.5 then node 57 else 1 class = 1 class = 1 if x26<1.5 then node 58 elseif x26>=1.5 then node 59 else 1 class = 1 class = 0 class = 1 class = 0 class = 1 class = 0 if x19 in {2 3} then node 60 elseif x19 in {1 4} then node 61 else 1 class = 0 class = 1 if x7=0 then node 62 elseif x7=1 then node 63 else 0 class = 0 class = 1 if x2<1.5 then node 64 elseif x2>=1.5 then node 65 else 0 if x8=0 then node 66 elseif x8=1 then node 67 else 1 class = 0 class = 1 class = 1 class = 0 if x14=0 then node 68 elseif x14=1 then node 69 else 1 class = 0 class = 1 class = 0

Figure 10. Contents of DecisonTree.m


clear clc data = xlsread('GermanCredit.xls'); attr = data(:,2:31); trgt = data(:,32); v = randperm(size(attr,1)); p = input('Percentage of Training Set: '); % Partitioning data into training and test sets. trainingSet = attr(v(1:round(size(attr,1)*p)),:); trainingClass = trgt(v(1:round(size(attr,1)*p)),:); testSet = attr(v(round(size(attr,1)*p)+1:end),:); testClass = trgt(v(round(size(attr,1)*p)+1:end),:);

% Fitting a decision tree to the training set. t = classregtree(... trainingSet,trainingClass,... 'method','classification',... 'categorical',[1 3:9 11 12 14:21 23:25 27 29 30]... ); % Applying decision tree on both training and test sets % with different pruning levels. m = max(prunelist(t)); trainingLabel = t.eval(trainingSet,0:m); testLabel = t.eval(testSet,0:m); % Initializing error matrices. trainingError = zeros(size(trainingLabel,2),1); testError = zeros(size(testLabel,2),1); % Calculating training error. for i = 1:size(trainingLabel,2) trainingError(i) = length(find(str2num(cell2mat(trainingLabel(:,i))) trainingClass))/size(trainingLabel,1); end % Calculating test error. for i = 1:size(testLabel,2) testError(i) = length(find(str2num(cell2mat(testLabel(:,i))) testClass))/size(testLabel,1); end % Plotting errors with respect to different pruning levels. plot(0:m,trainingError,'r'); hold on plot(0:m,testError,'b'); xlabel('Pruning Level'); ylabel('Error Rate'); legend('Training Set','Test Set'); hold off % Creating best tree and minimum error tree. [cost,secost,ntnodes,bestlevel] = test(t,'cross',testSet,testClass); bestTree= prune(t,'level',bestlevel); [minTestError,minTestErrorIndex] = min(testError); minErrorTree = prune(t,'level',minTestErrorIndex); % Plotting # of terminal nodes vs pruning level. figure; plot(0:m,ntnodes); xlabel('Pruning Level'); ylabel('# of Terminal Nodes'); title('# of Terminal Nodes vs. Pruning Level (Test Set)'); % Plotting # of terminal nodes vs cost. figure; plot(ntnodes,cost,'b'); xlabel('# of Terminal Nodes'); ylabel('Cost'); title('# of Terminal Nodes vs. Cost (Test Set)');

10

oppCost = [0 500;100 0]; TotalCostTraining = zeros(1,size(trainingClass,2)); TotalCostTest = zeros(1,size(testClass,2)); for i = 1:size(trainingLabel,2) CTraining = confusionmat(trainingClass,str2num(cell2mat(trainingLabel(:,i)))); TotalCostTraining(i) = sum(sum(oppCost .* CTraining)) / size(trainingLabel,1); CTest = confusionmat(testClass,str2num(cell2mat(testLabel(:,i)))); TotalCostTest(i) = sum(sum(oppCost .* CTest)) / size(testLabel,1); end [MinCostTest,IX] = min(TotalCostTest); MinCostTree= prune(t,'level',IX-1); figure; plot(0:size(trainingLabel,2)-1,TotalCostTraining,'r'); hold on plot(0:size(testLabel,2)-1,TotalCostTest,'b'); plot(IX-1,MinCostTest,'*'); legend('Training','Test'); xlabel('Pruning Level'); ylabel('Total Cost'); title('Total Cost vs. Pruning Level'); view(prune(t,'level',IX-1));

Table 5. Confusion Matrix for Naive Bayes Predicted Actual Bad Good Training Bad 127 84 92 397 45 44 Test Good 32 179

Table 6. Confusion Matrix for Decision Tree Predicted Training Actual Bad Good Bad 134 34 Good 70 462 Bad 47 39 Test Good 49 165

11

Вам также может понравиться