Академический Документы
Профессиональный Документы
Культура Документы
PROJECT REPORT
Submitted By:
Loan default is always the threat to any financial institution and should be predicted in
advance based on various features of the applicant. This study aims at applying machine
learning models, including decision tree, logistic regression and random forest etc. to
classify the applicants with and without loan default from a group of predicting variables,
and evaluate their performance. The study also finds out that regression is the best model
to classify those applicants with loan default.
The recent significant increase in loan default has generated interest in understanding
the key factors predicting the non-performance of these loans. However, despite the large
size of the loan market, existing analyses have been limited by data. This report is based
on model created to predicts the repayment of loans in XYZ Corp. The detailed analysis is
carried out on given data set (8705 observations) and each observation in the dataset
represents a United States institution of higher education and a specific year. This report
presents an analysis of data concerning student loan delinquency. The analysis is based
on 855969 observations and 73 features, each containing specific characteristics of the
person who are borrowing the loan. After exploring the data by calculating summary and
descriptive statistics, and by creating visualizations of the data, several potential
relationships between variable characteristics and loan default were identified. After
exploring the data, a predictive model is created to find out the loan defaulter. Later, a
regression model is created to predict repayment rate of the loan from its feature. The
efficiency of the model is increased by identifying the most valuable and contributing
features among the given features in the data set. Thus, this report basically includes the
steps followed during the implementation of machine learning model for the loan
defaulter prediction and presents the different descriptive statistics and content rich
visualization.
2
Acknowledgements
Further, we were fortunate to have _________________ as our mentor. He/She has readily
shared his immense knowledge in data analytics and guide us in a manner that the
outcome resulted in enhancing our data skills.
We wish to thank, all the faculties, as this project utilized knowledge gained from every
course that formed the DSP program.
We certify that the work done by us for conceptualizing and completing this project is
original and authentic.
3) Trupti Joke
3
Certificate of Completion
I hereby certify that the project titled “Loan Defaulter Prediction” was undertaken and
completed under my supervision by Puja Lonkar , Snaha Bhalerao and Trupti Joke
from the batch of DSP 18 (Nov 2019)
Place – Pune
4
Table of Contents [Sample purpose only]
Abstract ........................................................................................................................................................................................... 2
Acknowledgements .................................................................................................................................................................. 2
Certificate of Completion ....................................................................................................................................................... 3
CHAPTER 1: INTRODUCTION.............................................................................................................................................. 6
1.1 Title & Objective of the study ........................................................................................................................ 6
1.2 Need of the Study................................................................................................................................................. 6
1.3 Business or Enterprise under study ................................................ Error! Bookmark not defined.
1.4 Business Model of Enterprise .............................................................. Error! Bookmark not defined.
1.4 Data Sources .................................................................................................................................................................... 7
1.5 Tools & Techniques ..................................................................................................................................................... 7
1.6 Infrastructure Challenges ......................................................................................................................................... 7
CHAPTER 2: DATA PREPARATION AND UNDERSTANDING .............................................................................. 8
2.1 Phase I – Data Extraction and Cleaning:............................................................................................................ 8
2.2 Phase II - Feature Engineering ............................................................................................................................ 10
2.3 Data Dictionary: .......................................................................................................................................................... 12
2.4 Exploratory Data Analysis: ................................................................................................................................... 12
CHAPTER 3: FITTING MODELS TO DATA .................................................................................................................. 15
4.1 LINEAR REGRESSION MODEL............................................................................................................................. 16
4.1.1 First Linear Regression Model ........................................................................................................................ 16
4.1.2 Second Linear Regression model ................................................................................................................... 17
4.1.3 Third Linear Regression model ...................................................................................................................... 17
4.2 RANDOM FOREST ........................................................................................... Error! Bookmark not defined.
4.2.1 INFRASTRUCTURE CHALLENGES ................................................. Error! Bookmark not defined.
CHAPTER 5: KEY FINDINGS .............................................................................................................................................. 21
CHAPTER 6: RECOMMENDATIONS AND CONCLUSION ...................................................................................... 21
CHAPTER 7: REFERENCES................................................................................................................................................. 23
List of Figures
List of Tables
5
CHAPTER 1: INTRODUCTION
Commercial lending is a way to borrow without using a traditional bank or credit union.
For applicants with a good credit score (often a FICO credit scores higher than 720), P2P
loan rates can be surprisingly low. With less-than-perfect credit, an applicant still has a
decent shot at being approved for an affordable loan with online lenders like XYZ
corporation. Financial loans are loans made by individuals and investors – as opposed to
loans that come from a bank. People with extra funds offer to lend that money to others
(individuals and businesses) in need of cash. A P2P service (such as XYZ corp.) provides
the “bridge” between investors and borrowers so that the process is relatively easy for
all involved.
As easy the Financial loans looks like, but the risk associated with it is immense. The
situation of lending and collection is profoundly get affected by the market situation and
economic environment, which might fluctuate improvidently. This particular risk
associated with financial loans for every individual or organization like our subject of
interest XYZ corp. positively gives rise to the requirement of the predictive study of this
situation.
6
1.4 Data Sources
We had received data from our respective official Imarticus portal as a part of the
curriculum. The primary purpose of the utilization of given data is purely for academic
purpose, to practice and build our knowledge to the current industrial, analytical tools
and subjects. Following are the characteristics of the data we received for analysis: We
have received the data in the .txt format which is hence imported into python using
pandas module as a tab-delimited format. The provided dataset corresponds to all loans
issued to individuals in the past from 2007-2015.
The dataset has 855969 observations and 73 features.The data contains the various
indicator of default like payment information and credit history. As per the
understanding for ease in computation, the customers under ‘current’ status have been
considered as non-defaulters in the dataset. We have also been provided with a Data
dictionary that best describes the features. We have data from 2007-2015 because most
of the loans from that period have already been repaid or defaulted on. Moreover, we will
divide the data set in train and test in accordance with time series only.
Tools: We are using a Jupyter Notebook 6.0.0 IDE analytical tool for this project along
with Python 3.7, Numpy, Pandas, Matplotlib, Seaborn, Scikit-learn, Scipy, ploty and
various analytical tools.
7
CHAPTER 2: DATA PREPARATION AND UNDERSTANDING
One of the first steps we engaged in was to outline the sequence of steps that we will be
following for our project. Each of these steps are elaborated below
Insights: So, we can see from the above plot that there are 21 columns in the dataset
where all the values are NA.
As we can see there are 855969 observations & 73 columns in the dataset, it will be very
difficult to look at each column one by one & find the NA or missing values. So let's find
8
out all columns where missing values are more than certain percentage, let's say 40%.
We will remove those columns as it is not feasible to impute missing values for those
columns.
Out of 73 features we only kept 52. So we removed about 21 features that had more than
40% missing values since it will not make any sense in further exploration.
• We need to especially pay close attention to data leakage, which can cause the
model to overfit. This is because the model would be also learning from features that
wouldn’t be available when we’re using it make predictions on future loans
•
• We checked the variance and are removing low variance variables(excluding out
target). However, since this is a sensitive problem we will remove based on our
own discretion and business knowledge. Some irrelevant columns Unique ID's
such as "id","member_id" because they did not provide any useful information
about the customer. As last 2 digits of zip code is masked 'xx', we can remove that
as well.
• We still had five more features with null vauues(emp_length,revol_util
,tot_coll_amt,tot_cur_bal,total_rev_hi_lim) which we have imputed using fillna
method uaing the appropriate statistic .
• Handling Outliers
• Feature Extraction
Now, let’s decide on the appropriate column to use as a target column for modeling – keep
in mind the main goal is predict who will pay off a loan and who will default. We learned
from the description of columns in the preview Dataframe that ‘default_ind’ is the only
field in the main dataset that describe a loan status, so let’s use this column as the target
column.
94% have not defaulted (fully paid) and 6% are defaulters or charged off.
9
2.2 Phase II - Feature Engineering
Mapping:
We are Mapping the issue date from "Jun-2015" to "Dec-2015" as Test for the ease of
splitting our data to test set using Dictionary.
Correlation:
10
So we can instead calculate the funded amount to installment ratio for our
analysis.Therefore we will remove this field directly.
3. Total_revolving high credit limit with revolving balance Total revolving high credit limit
meaning the max credit that the customer holds across all accounts. Whereas the
revolving balance is the available credit limit left with the customer that he can use. So
the high limit doesnt actually play a role since even if the high limit is of 100000/month
but the expenses of the customer are above 100000 then the field doesnt provide insight
into the defaulter status, as the higher the credit limit there could be a possibility of high
expenses or low expenses. We cannot comment on that. And without that factor into
picture we cannot analyse the impact of that field on the defaulter status. Therefore
revolving balance helps in this case. It tells the available balance across all accounts that
the person can use. Meaning is the balance is always low we can infer he may have a
high probability of becoming a defaulter whereas if the balance is consistently higher we
may infer that he has a lower probability of being defaulter
4. out principal amount by investors and out principal amount: Out principal amount means
the total amount remaining out of the principal amount funded. and out principal by
investors is the amount remaining of the total principal amount funded by investors. Here
the customer will not know and care who funded the amount. And his defaulter status will
never depend on the amount funded by investors. Therefore, we will remove the out
principal outstanding amount funded by investors from the analysis. We will retain
however the total out principal outstanding amount as the amount of principal matters to
the customer that he has been paying. It may have a direct impact on the defaulter status
of the customer.
5. total payment by investors - total payment -total received principal: Total payment by
investors, total payment and total received principal. Of these the total payment received
from the amount invested by investors has no impact on whether the person will default
or not. As the customer has no knowledge and concern of who funded the amount. So,
we will remove this column. The total received principal and the outstanding principal will
capture the same information in different angles. Therefore, we can retain either of the
two columns, and so we remove the total received principal column. Total payment
received since the other collinear columns are removed can be safely kept for analysis,
as the total payment received includes both the principal and interest received and it may
be of concern to the customer how much he is repaying at some point of time in period
and he may default on his payments. Therefore, we retain this column for analysis
6. total payment and total received interest columns are correlated. As we have already
retained outstaning principal amount and the funded amount, we thus easily have the
impact of interest amount pending on the defaulter status. Therefore there is no need to
retain the total received interest. However as also we are retaining the total_payment
column we here too have the impact of total received interest as outstanding principal and
funded amount are present. Therefore we will retain the total payment column only
11
2.3 Data Dictionary:
DataDictionary-14-1
0-17.xlsx
Bivariate Analysis:
2. Check distribution of data with respect to even default indicator:
12
4. Chi Square test:
Depending on the percentile plot and the chi squared test we have made the
decisions of retaining or discarding the columns from analysis.
Like for instance subgrade seems to have a notable impact on the defaulter status so
this column is retained.
Bivariate Analysis:
1. Boxplot:
13
2. Percentile Analysis:
Using percentile analysis and boxplot we have made the decision as to remove or
retain the column depending on the impact seen on the default indicator visually.
Like for example the int-rate column seems to have a notable impact on the
defaulter status, so we have retained this column.
Multivariate analysis:
1. We have used the various columns like the percentile columns computed and the
existing categorical columns to get insights into the data.
14
Like for example from above graph we understand that the no of defaulters is
maximum in Illinois than other states with annual income higher than 80percentile
of total annual income range of loan applicants.
Logistic Regression
Logistic regression, despite its name, is a linear model for classification rather than
regression. Logistic regression is also known in the literature as logit regression,
maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the
probabilities describing the possible outcomes of a single trial are modeled using
a logistic function.
Decision Trees
A decision tree classifier is used to model the data. The maximum depth parameter of the
decision tree is controlled to get the best results possible from the model.
So we tried max depth from 4,6,8,9,10,12. But in that even though max depth 12 gave best
results on test1 and validation set data, it performed poorly on the final test set. Therefore
it was overfitted as was observed. So we used a trade off between the bias and variance
and chose the max-depth 6 decision tree.
15
Random forest classifier.
A random forest is a meta estimator that fits a number of decision tree classifiers on
various sub-samples of the dataset and uses averaging to improve the predictive
accuracy and control over-fitting. The sub-sample size is always the same as the original
input sample size but the samples are drawn with replacement.
Our main aim was to have a good trade-off between precision and recall so as to predict
the defaulters accurately without much loss of customers to the bank.
We got less accuracy with threshold = 0.5, therefore used precision-recall vs threshold graph
to set the threshold for the classification.
It is observed that at 0.4 the precision as well as the recall is highest max they can be with
the trade off between the two.
16
4.1.2 First Decision Tree model
We tried with default tree height. No hyper parameters were tuned for this
first model to check the performance of the default setting first.
Test set 1:
17
4.1.5. Fourth Decision Tree model
Here we tuned tried the max – depth hyper parameter to be set to 8.
Test set 1:
18
4.1.8. Seventh Decision Tree model
Here we tuned tried the max – depth hyper parameter to be set to 12.
Test set 1:
Out of all the decision tree models the decision tree with max-depth 6 performed
well without overfitting much as compared to max-depth 12 which showed good
performance on validation and test set.
19
Here we haven’t tuned the hyper parameters but just to check if the random forest
would perform better than decision tree with default setting we used this model
for our understanding purpose.
On the intermediate test1 and validation set it performed as below:
Test set1:
• If we look at the confusion matrix, though, we see a big problem. The model can predict who
are going to pay off the loan with a good accuracy of 99% but cannot predict who are going
to default. The true positive rate of default (0 predicting 0) is almost 0. Since our main goal is
to predict defaulter's, we have to do something about this. The reason this is happening
could be because of high imbalance in our dataset and the algorithm is putting everything
into 1.
• Due to class imbalance we have used stratified sampling for logistic regression and
oversampling and stratified sampling both for decision trees. Decision trees performed well
in the oversampled setting though the performance as compared to logistic regression was
still not up-to the mark.
It is appeared from below graph that decision tree model overfits the data.
20
CHAPTER 5: KEY FINDINGS
Significant Variables identified in linear models are also used in Decision trees and
Random forest
Below table provides a snapshot of the various models which the business can choose
from based on the pros and cons of each model.
Model Accuracy Precision Recall F1Score
Logistic 0.9974 0.2912 0.7717 0.4229
Regression
Decision Tree 0.6207 0.0029 0.9421 0.0059
• Logistic Regression:
We are concentrating on the recall mostly and therefore logistic is performing
well. Even though the recall is not as good as Decision trees the precision is higher
in logistic regression which is also of great importance to us. As we wanted a good
trade-off between type 1 and type 2 error.
• Decision Trees:
Even though the recall is higher in the decision trees than logistic regression the
precision is too low. This will result in losing a great deal of genuine clients.
Therefore, we cannot use the decision trees for the model building for loan
defaulters.
21
We have successfully built the machine learning algorithm to predict the people who
might default on their loans.
Also, we might want to look on other techniques or variables to improve the prediction
power of the algorithm. One of the drawbacks is just the limited number of people who
defaulted on their loan in the 8 years of data (2007-2015) present on the dataset. We
can use an updated data frame which consist next 3 years values (2015-2018) and see
how many of the current loans were paid off or defaulted or even charged off. Then
these new data points can be used for predicting them or even used to train the model
again to improve its accuracy.
Since we had a lot of categorical data, we cannot apply PCA for dimensionality
reduction. Because of this, we did not use PCA for dimensionality reduction.
Also we wanted to learn about the data in depth and make decisions as per the business
significance of the columns as well so we did not use any feature selection technique
instead based on the EDA and our understanding of the business columns we have
manual feature selection.
Since the algorithm puts around 584 number of clients of non-defaulters in the default
class, we might want to look further into this issue to make the model more robust.
From the analysis it is evident that the banks should be giving more loans to the people
according to the sub grade assigned to the clients’ loan application file. Clients with
lower sub grades have a higher tendency of defaulting on the loans.
The analysis says even though now the banks are giving loans more to the 36months
term and less to 60months term a careful eye should be given to the 60 months term of
loan applications. As they have a higher tendency to default.
The people with home ownership as OTHER tend to default more so banks can take-
into-account other considerations while giving loans to them.
It is better if the banks give loans only after the source verification of the persons
income and other documents is done. As it reduces the probability of defaulters.
Even though the percent of loan applicants under the small business and educational
category are less banks should be extra vigilant while giving them loans as they tend to
default more.
It appears from the analysis that the Texas/California/New York/Florida have highest
number of loan appliers. We need to investigate the total percent of defaulters in the
data in each state.
It is clear from the analysis though that the percent of defaulters varies from state to
state. With minimum being at NE and max being at IA then followed by ID and NV. So,
when someone from these states apply for loans banks could check for other defaulter
behavior parameters stated above in order to decide if to give loan or not.
22
It is seen that as the interest rate increases the defaulter status also increases, with up-
to 20-percentile being at least and from 60-80Percentile being at max.
The defaulter status decreases from annual income 0 to max 3,64,000$. So people with l
ess annual income should be verified on other parameters stated above before giving lo
ans.
Clients with Debt to income ratio below 218.4 are found to be more defaulters. And I tha
t also there is a trend of decreasing defaulters from low to max.
It is appeared that there are chances of more defaulters in the category of less than 15 t
otal no of credit lines in the borrowers’ credit file. And the chances of being a defaulter d
ecreases as we move towards 62> more accounts. This trend is consistent with the tren
d that we saw with the outliers in the data. So, we can safely assume that as the no of tot
al credit lines of user increases the chances of him being a defaulter reduces slightly.
CHAPTER 7: REFERENCES
1) https://www.kaggle.com/deepanshu08/prediction-of-lendingclub-loan-defaulters
2) https://scikit-
learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
3) https://scikit-learn.org/stable/modules/model_evaluation.html#accuracy-score
4) https://scikit-learn.org/stable/model_selection.html#model-selection
5) https://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier
23