Вы находитесь на странице: 1из 23

kjhj

PROJECT REPORT

Loan Defaulter Prediction


Submitted towards the partial fulfillment of the criteria for award of Genpact Data
Science Prodegree by Imarticus

Submitted By:

Puja Lonkar (Roll #IL012536)


Trupti Joke (Roll #IL013171)
Sneha Bhalerao (Roll #IL004011)

Course and Batch


DSP-18 Pune November 2019
Abstract

Loan default is always the threat to any financial institution and should be predicted in
advance based on various features of the applicant. This study aims at applying machine
learning models, including decision tree, logistic regression and random forest etc. to
classify the applicants with and without loan default from a group of predicting variables,
and evaluate their performance. The study also finds out that regression is the best model
to classify those applicants with loan default.

The recent significant increase in loan default has generated interest in understanding
the key factors predicting the non-performance of these loans. However, despite the large
size of the loan market, existing analyses have been limited by data. This report is based
on model created to predicts the repayment of loans in XYZ Corp. The detailed analysis is
carried out on given data set (8705 observations) and each observation in the dataset
represents a United States institution of higher education and a specific year. This report
presents an analysis of data concerning student loan delinquency. The analysis is based
on 855969 observations and 73 features, each containing specific characteristics of the
person who are borrowing the loan. After exploring the data by calculating summary and
descriptive statistics, and by creating visualizations of the data, several potential
relationships between variable characteristics and loan default were identified. After
exploring the data, a predictive model is created to find out the loan defaulter. Later, a
regression model is created to predict repayment rate of the loan from its feature. The
efficiency of the model is increased by identifying the most valuable and contributing
features among the given features in the data set. Thus, this report basically includes the
steps followed during the implementation of machine learning model for the loan
defaulter prediction and presents the different descriptive statistics and content rich
visualization.

2
Acknowledgements

We are using this opportunity to express my gratitude to everyone who supported us


throughout the course of this group project. We are thankful for their aspiring guidance,
invaluably constructive criticism and friendly advice during the project work. We are
sincerely grateful to them for sharing their truthful and illuminating views on a number
of issues related to the project.

Further, we were fortunate to have _________________ as our mentor. He/She has readily
shared his immense knowledge in data analytics and guide us in a manner that the
outcome resulted in enhancing our data skills.

We wish to thank, all the faculties, as this project utilized knowledge gained from every
course that formed the DSP program.

We certify that the work done by us for conceptualizing and completing this project is
original and authentic.

Date: November 30, 2019 1) Puja Lonkar

Place: Pune 2) Sneha Bhalerao

3) Trupti Joke

3
Certificate of Completion

I hereby certify that the project titled “Loan Defaulter Prediction” was undertaken and
completed under my supervision by Puja Lonkar , Snaha Bhalerao and Trupti Joke
from the batch of DSP 18 (Nov 2019)

Mentor: Mentor Name (if any)

Date: November 30, 2019

Place – Pune

4
Table of Contents [Sample purpose only]
Abstract ........................................................................................................................................................................................... 2
Acknowledgements .................................................................................................................................................................. 2
Certificate of Completion ....................................................................................................................................................... 3
CHAPTER 1: INTRODUCTION.............................................................................................................................................. 6
1.1 Title & Objective of the study ........................................................................................................................ 6
1.2 Need of the Study................................................................................................................................................. 6
1.3 Business or Enterprise under study ................................................ Error! Bookmark not defined.
1.4 Business Model of Enterprise .............................................................. Error! Bookmark not defined.
1.4 Data Sources .................................................................................................................................................................... 7
1.5 Tools & Techniques ..................................................................................................................................................... 7
1.6 Infrastructure Challenges ......................................................................................................................................... 7
CHAPTER 2: DATA PREPARATION AND UNDERSTANDING .............................................................................. 8
2.1 Phase I – Data Extraction and Cleaning:............................................................................................................ 8
2.2 Phase II - Feature Engineering ............................................................................................................................ 10
2.3 Data Dictionary: .......................................................................................................................................................... 12
2.4 Exploratory Data Analysis: ................................................................................................................................... 12
CHAPTER 3: FITTING MODELS TO DATA .................................................................................................................. 15
4.1 LINEAR REGRESSION MODEL............................................................................................................................. 16
4.1.1 First Linear Regression Model ........................................................................................................................ 16
4.1.2 Second Linear Regression model ................................................................................................................... 17
4.1.3 Third Linear Regression model ...................................................................................................................... 17
4.2 RANDOM FOREST ........................................................................................... Error! Bookmark not defined.
4.2.1 INFRASTRUCTURE CHALLENGES ................................................. Error! Bookmark not defined.
CHAPTER 5: KEY FINDINGS .............................................................................................................................................. 21
CHAPTER 6: RECOMMENDATIONS AND CONCLUSION ...................................................................................... 21
CHAPTER 7: REFERENCES................................................................................................................................................. 23

List of Figures

List of Tables

5
CHAPTER 1: INTRODUCTION

1.1 Title & Objective of the study


Objective of the Study The objective of this project is to build the predictive model for the
individual assessment of loan application to determine whether the applicant will default
or not. This model must be based on objective financial data given by the applicant only.
This study is going to help XYZ Corp. on the conclusion of Underwriting. The primary goal
of this project is to develop a specific tool which will be useful in the decision of “Whether
the investors should lend to a particular customer or not?”.

1.2 Need of the Study

Commercial lending is a way to borrow without using a traditional bank or credit union.
For applicants with a good credit score (often a FICO credit scores higher than 720), P2P
loan rates can be surprisingly low. With less-than-perfect credit, an applicant still has a
decent shot at being approved for an affordable loan with online lenders like XYZ
corporation. Financial loans are loans made by individuals and investors – as opposed to
loans that come from a bank. People with extra funds offer to lend that money to others
(individuals and businesses) in need of cash. A P2P service (such as XYZ corp.) provides
the “bridge” between investors and borrowers so that the process is relatively easy for
all involved.

1.3 Business Model of Enterprise

As easy the Financial loans looks like, but the risk associated with it is immense. The
situation of lending and collection is profoundly get affected by the market situation and
economic environment, which might fluctuate improvidently. This particular risk
associated with financial loans for every individual or organization like our subject of
interest XYZ corp. positively gives rise to the requirement of the predictive study of this
situation.

6
1.4 Data Sources

We had received data from our respective official Imarticus portal as a part of the
curriculum. The primary purpose of the utilization of given data is purely for academic
purpose, to practice and build our knowledge to the current industrial, analytical tools
and subjects. Following are the characteristics of the data we received for analysis: We
have received the data in the .txt format which is hence imported into python using
pandas module as a tab-delimited format. The provided dataset corresponds to all loans
issued to individuals in the past from 2007-2015.

The dataset has 855969 observations and 73 features.The data contains the various
indicator of default like payment information and credit history. As per the
understanding for ease in computation, the customers under ‘current’ status have been
considered as non-defaulters in the dataset. We have also been provided with a Data
dictionary that best describes the features. We have data from 2007-2015 because most
of the loans from that period have already been repaid or defaulted on. Moreover, we will
divide the data set in train and test in accordance with time series only.

1.5 Tools & Techniques

Tools: We are using a Jupyter Notebook 6.0.0 IDE analytical tool for this project along
with Python 3.7, Numpy, Pandas, Matplotlib, Seaborn, Scikit-learn, Scipy, ploty and
various analytical tools.

Techniques: Logistic regression, Random Forest Classifier, Gradient Boosting Classifier


along with the different classification and analytical algorithms.

1.6 Infrastructure Challenges


1.The modus operandi of credit rating by banks and rating agencies.
2.The impact of economic downturn on the behaviors of borrowers as well as lenders.
3.Mode of calculation of default probability

7
CHAPTER 2: DATA PREPARATION AND UNDERSTANDING

One of the first steps we engaged in was to outline the sequence of steps that we will be
following for our project. Each of these steps are elaborated below

2.1 Phase I – Data Extraction and Cleaning:


• Missing Value Analysis and Treatment
In our dataset our target shows that 94% have not defaulted and 6% are defaulters or
charged off. So this is clearly an unbalanced dataset.
The first issue was to know if the columns were filled with useful information or were
mostly empty. Data exploration uncovered many empty or almost empty columns which
were removed from the dataset because it would prove a difficult task to go back and try
to answer for each data point that did not seem necessary at the time of the loan
application.
Our dataset has 855969 rows × 73 features including the target out of which 32 have
missing values or NAN. Below we will look at a plot and get some insights.

Insights: So, we can see from the above plot that there are 21 columns in the dataset
where all the values are NA.
As we can see there are 855969 observations & 73 columns in the dataset, it will be very
difficult to look at each column one by one & find the NA or missing values. So let's find

8
out all columns where missing values are more than certain percentage, let's say 40%.
We will remove those columns as it is not feasible to impute missing values for those
columns.
Out of 73 features we only kept 52. So we removed about 21 features that had more than
40% missing values since it will not make any sense in further exploration.
• We need to especially pay close attention to data leakage, which can cause the
model to overfit. This is because the model would be also learning from features that
wouldn’t be available when we’re using it make predictions on future loans

• We checked the variance and are removing low variance variables(excluding out
target). However, since this is a sensitive problem we will remove based on our
own discretion and business knowledge. Some irrelevant columns Unique ID's
such as "id","member_id" because they did not provide any useful information
about the customer. As last 2 digits of zip code is masked 'xx', we can remove that
as well.
• We still had five more features with null vauues(emp_length,revol_util
,tot_coll_amt,tot_cur_bal,total_rev_hi_lim) which we have imputed using fillna
method uaing the appropriate statistic .

• Handling Outliers
• Feature Extraction

Decide On A Target Column

Now, let’s decide on the appropriate column to use as a target column for modeling – keep
in mind the main goal is predict who will pay off a loan and who will default. We learned
from the description of columns in the preview Dataframe that ‘default_ind’ is the only
field in the main dataset that describe a loan status, so let’s use this column as the target
column.
94% have not defaulted (fully paid) and 6% are defaulters or charged off.

9
2.2 Phase II - Feature Engineering
Mapping:
We are Mapping the issue date from "Jun-2015" to "Dec-2015" as Test for the ease of
splitting our data to test set using Dictionary.

Correlation:

Finding the correlation between variables


We will now look at the correlation structure between our variables that we selected
above. This will tell us about any dependencies between different variables and help us
reduce the dimensionality a little bit more

The variables checked for correlation are:

Feature Engineering and Feature selection among the correlated columns:

From the heatmap below columns are highly correlated:


1. Funded amount-loan amount-funded amount by investors As we are concerned with the
funded amount and loan amount.The amount funded by investors has no direct impact
on whether the customer will default or not. It’s completely irrelevant field. Therefore
dropping that field As funded amount directly impacts the customers defaulter status we
will keep the funded amount. But if the loan amount is higher than the funded amount,
there are chances that the customer may not be able to repay the funded amount given
his plans may fail is loan amount (amount needed to succeed on his task) is far greater
than the amount funded by the bank. Therefore we will take the difference between the
Loan amount and funded amount for analysis. As it could have a direct impact on the
defaulter status of the customer.

2. Installment amount-Funded amount-loan_amount-funded amount by investors The


installment is directly proportional to the funded amount/loan amount/funded_amnt_inv.

10
So we can instead calculate the funded amount to installment ratio for our
analysis.Therefore we will remove this field directly.

3. Total_revolving high credit limit with revolving balance Total revolving high credit limit
meaning the max credit that the customer holds across all accounts. Whereas the
revolving balance is the available credit limit left with the customer that he can use. So
the high limit doesnt actually play a role since even if the high limit is of 100000/month
but the expenses of the customer are above 100000 then the field doesnt provide insight
into the defaulter status, as the higher the credit limit there could be a possibility of high
expenses or low expenses. We cannot comment on that. And without that factor into
picture we cannot analyse the impact of that field on the defaulter status. Therefore
revolving balance helps in this case. It tells the available balance across all accounts that
the person can use. Meaning is the balance is always low we can infer he may have a
high probability of becoming a defaulter whereas if the balance is consistently higher we
may infer that he has a lower probability of being defaulter

4. out principal amount by investors and out principal amount: Out principal amount means
the total amount remaining out of the principal amount funded. and out principal by
investors is the amount remaining of the total principal amount funded by investors. Here
the customer will not know and care who funded the amount. And his defaulter status will
never depend on the amount funded by investors. Therefore, we will remove the out
principal outstanding amount funded by investors from the analysis. We will retain
however the total out principal outstanding amount as the amount of principal matters to
the customer that he has been paying. It may have a direct impact on the defaulter status
of the customer.

5. total payment by investors - total payment -total received principal: Total payment by
investors, total payment and total received principal. Of these the total payment received
from the amount invested by investors has no impact on whether the person will default
or not. As the customer has no knowledge and concern of who funded the amount. So,
we will remove this column. The total received principal and the outstanding principal will
capture the same information in different angles. Therefore, we can retain either of the
two columns, and so we remove the total received principal column. Total payment
received since the other collinear columns are removed can be safely kept for analysis,
as the total payment received includes both the principal and interest received and it may
be of concern to the customer how much he is repaying at some point of time in period
and he may default on his payments. Therefore, we retain this column for analysis

6. total payment and total received interest columns are correlated. As we have already
retained outstaning principal amount and the funded amount, we thus easily have the
impact of interest amount pending on the defaulter status. Therefore there is no need to
retain the total received interest. However as also we are retaining the total_payment
column we here too have the impact of total received interest as outstanding principal and
funded amount are present. Therefore we will retain the total payment column only

11
2.3 Data Dictionary:

DataDictionary-14-1
0-17.xlsx

2.4 Exploratory Data Analysis:


Categorical variables:
Univariate Analysis:
1. Check number of unique values.

Bivariate Analysis:
2. Check distribution of data with respect to even default indicator:

3. Check the percent of defaulters per category


= (No of records in the category in (1/0 default indicator category) /total
number of records in that category) * 100

12
4. Chi Square test:

Depending on the percentile plot and the chi squared test we have made the
decisions of retaining or discarding the columns from analysis.
Like for instance subgrade seems to have a notable impact on the defaulter status so
this column is retained.

Numeric Variable Analysis:


Univariate analysis:
1. Histogram: with respect to default indicator

Bivariate Analysis:
1. Boxplot:

13
2. Percentile Analysis:

Using percentile analysis and boxplot we have made the decision as to remove or
retain the column depending on the impact seen on the default indicator visually.
Like for example the int-rate column seems to have a notable impact on the
defaulter status, so we have retained this column.

Multivariate analysis:
1. We have used the various columns like the percentile columns computed and the
existing categorical columns to get insights into the data.

14
Like for example from above graph we understand that the no of defaulters is
maximum in Illinois than other states with annual income higher than 80percentile
of total annual income range of loan applicants.

CHAPTER 3: FITTING MODELS TO DATA

[To be changed by candidates as per their requirement]


[The below is for example purpose only]

Logistic Regression
Logistic regression, despite its name, is a linear model for classification rather than
regression. Logistic regression is also known in the literature as logit regression,
maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the
probabilities describing the possible outcomes of a single trial are modeled using
a logistic function.

Decision Trees
A decision tree classifier is used to model the data. The maximum depth parameter of the
decision tree is controlled to get the best results possible from the model.
So we tried max depth from 4,6,8,9,10,12. But in that even though max depth 12 gave best
results on test1 and validation set data, it performed poorly on the final test set. Therefore
it was overfitted as was observed. So we used a trade off between the bias and variance
and chose the max-depth 6 decision tree.

15
Random forest classifier.
A random forest is a meta estimator that fits a number of decision tree classifiers on
various sub-samples of the dataset and uses averaging to improve the predictive
accuracy and control over-fitting. The sub-sample size is always the same as the original
input sample size but the samples are drawn with replacement.

Our main aim was to have a good trade-off between precision and recall so as to predict
the defaulters accurately without much loss of customers to the bank.

4.1 LINEAR REGRESSION MODEL

4.1.1 First Logistic Regression Model

We tried logistic regression model with no regularization. As the model was


not overfitting no regularization is applied to it.
Test set 1:

Cross validation set:

On the final test set it performed as below:

We got less accuracy with threshold = 0.5, therefore used precision-recall vs threshold graph
to set the threshold for the classification.

It is observed that at 0.4 the precision as well as the recall is highest max they can be with
the trade off between the two.

16
4.1.2 First Decision Tree model

We tried with default tree height. No hyper parameters were tuned for this
first model to check the performance of the default setting first.

Test set 1:

Cross validation set:

4.1.3 Second Decision tree model


Here we tuned tried the max – depth hyper parameter to be set to 4.
Test set 1:

Cross validation set:

4.1.4 Third Decision Tree model


Here we tuned tried the max – depth hyper parameter to be set to 6.
Test set 1:

Cross validation set:

On the final test set it perform as below:

17
4.1.5. Fourth Decision Tree model
Here we tuned tried the max – depth hyper parameter to be set to 8.
Test set 1:

Cross validation set:

4.1.6. Fifth Decision Tree model


Here we tuned tried the max – depth hyper parameter to be set to 9.
Test set 1:

Cross validation set:

4.1.7. Six Decision Tree model


Here we tuned tried the max – depth hyper parameter to be set to 10.
Test set 1:

Cross validation set:

18
4.1.8. Seventh Decision Tree model
Here we tuned tried the max – depth hyper parameter to be set to 12.
Test set 1:

Cross validation set:

Out of all the decision tree models the decision tree with max-depth 6 performed
well without overfitting much as compared to max-depth 12 which showed good
performance on validation and test set.

4.1.8. Eighth Decision Tree model:


This model uses oversampled train and test set 1 and cross validation set.
Test set 1:

Cross validation set:

Final test set:

4.1.9. Random Forest Model

19
Here we haven’t tuned the hyper parameters but just to check if the random forest
would perform better than decision tree with default setting we used this model
for our understanding purpose.
On the intermediate test1 and validation set it performed as below:
Test set1:

Cross validation set:

On the final set it performed as below:

4.1.10. Model Validation


• The imbalance in the target category of loan repayment in the dataset, was due to the
fact that 82 out of 100 loans were repaid. This indicates money could be lent
continuously (always predicting that the borrower would repay) and be correct about
82.07% of the time that the loan was repaid.
• The cross-validation scores and ROC curves suggest the Logistic Regression is the best model

• If we look at the confusion matrix, though, we see a big problem. The model can predict who
are going to pay off the loan with a good accuracy of 99% but cannot predict who are going
to default. The true positive rate of default (0 predicting 0) is almost 0. Since our main goal is
to predict defaulter's, we have to do something about this. The reason this is happening
could be because of high imbalance in our dataset and the algorithm is putting everything
into 1.

• We cannot perform grid search as the running time is very high.

• Due to class imbalance we have used stratified sampling for logistic regression and
oversampling and stratified sampling both for decision trees. Decision trees performed well
in the oversampled setting though the performance as compared to logistic regression was
still not up-to the mark.

It is appeared from below graph that decision tree model overfits the data.

20
CHAPTER 5: KEY FINDINGS
Significant Variables identified in linear models are also used in Decision trees and
Random forest
Below table provides a snapshot of the various models which the business can choose
from based on the pros and cons of each model.
Model Accuracy Precision Recall F1Score
Logistic 0.9974 0.2912 0.7717 0.4229

Regression
Decision Tree 0.6207 0.0029 0.9421 0.0059

Random Forest 0.9678 0.0272 0.7363 0.0525

Below are some of the key findings:

• Logistic Regression:
We are concentrating on the recall mostly and therefore logistic is performing
well. Even though the recall is not as good as Decision trees the precision is higher
in logistic regression which is also of great importance to us. As we wanted a good
trade-off between type 1 and type 2 error.
• Decision Trees:
Even though the recall is higher in the decision trees than logistic regression the
precision is too low. This will result in losing a great deal of genuine clients.
Therefore, we cannot use the decision trees for the model building for loan
defaulters.

CHAPTER 6: RECOMMENDATIONS AND CONCLUSION

21
We have successfully built the machine learning algorithm to predict the people who
might default on their loans.
Also, we might want to look on other techniques or variables to improve the prediction
power of the algorithm. One of the drawbacks is just the limited number of people who
defaulted on their loan in the 8 years of data (2007-2015) present on the dataset. We
can use an updated data frame which consist next 3 years values (2015-2018) and see
how many of the current loans were paid off or defaulted or even charged off. Then
these new data points can be used for predicting them or even used to train the model
again to improve its accuracy.

Since we had a lot of categorical data, we cannot apply PCA for dimensionality
reduction. Because of this, we did not use PCA for dimensionality reduction.

Also we wanted to learn about the data in depth and make decisions as per the business
significance of the columns as well so we did not use any feature selection technique
instead based on the EDA and our understanding of the business columns we have
manual feature selection.

Since the algorithm puts around 584 number of clients of non-defaulters in the default
class, we might want to look further into this issue to make the model more robust.

Business Insights and Recommendations:

From the analysis it is evident that the banks should be giving more loans to the people
according to the sub grade assigned to the clients’ loan application file. Clients with
lower sub grades have a higher tendency of defaulting on the loans.

The analysis says even though now the banks are giving loans more to the 36months
term and less to 60months term a careful eye should be given to the 60 months term of
loan applications. As they have a higher tendency to default.

The people with home ownership as OTHER tend to default more so banks can take-
into-account other considerations while giving loans to them.

It is better if the banks give loans only after the source verification of the persons
income and other documents is done. As it reduces the probability of defaulters.

Even though the percent of loan applicants under the small business and educational
category are less banks should be extra vigilant while giving them loans as they tend to
default more.

It appears from the analysis that the Texas/California/New York/Florida have highest
number of loan appliers. We need to investigate the total percent of defaulters in the
data in each state.

It is clear from the analysis though that the percent of defaulters varies from state to
state. With minimum being at NE and max being at IA then followed by ID and NV. So,
when someone from these states apply for loans banks could check for other defaulter
behavior parameters stated above in order to decide if to give loan or not.

22
It is seen that as the interest rate increases the defaulter status also increases, with up-
to 20-percentile being at least and from 60-80Percentile being at max.
The defaulter status decreases from annual income 0 to max 3,64,000$. So people with l
ess annual income should be verified on other parameters stated above before giving lo
ans.

Clients with Debt to income ratio below 218.4 are found to be more defaulters. And I tha
t also there is a trend of decreasing defaulters from low to max.

It is appeared that there are chances of more defaulters in the category of less than 15 t
otal no of credit lines in the borrowers’ credit file. And the chances of being a defaulter d
ecreases as we move towards 62> more accounts. This trend is consistent with the tren
d that we saw with the outliers in the data. So, we can safely assume that as the no of tot
al credit lines of user increases the chances of him being a defaulter reduces slightly.

CHAPTER 7: REFERENCES
1) https://www.kaggle.com/deepanshu08/prediction-of-lendingclub-loan-defaulters
2) https://scikit-
learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
3) https://scikit-learn.org/stable/modules/model_evaluation.html#accuracy-score
4) https://scikit-learn.org/stable/model_selection.html#model-selection
5) https://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier

23

Вам также может понравиться