Machine Learning Applications in Credit Risk

Machine Learning applications in Credit Risk
Presented By: Location:

Sri Krishnamurthy, CFA, CAP ARPM Open Source Conference
sri@quantuniversity.com 8/13/2017
www.analyticscertificate.com
2017 Copyright QuantUniversity LLC.
2
Slides will be available at:

www.analyticscertificatecom/MachineLearning
3
• Founder of QuantUniversity LLC. and

www.analyticscertificate.com
• Advisory and Consultancy for Financial Analytics
• Prior Experience at MathWorks, Citigroup and
Endeca and 25+ financial services and energy
customers.
• Regular Columnist for the Wilmott Magazine
• Author of forthcoming book
“Financial Modeling: A case study approach”
published by Wiley
Sri Krishnamurthy • Charted Financial Analyst and Certified Analytics
Founder and CEO Professional
• Teaches Analytics in the Babson College MBA
program and at Northeastern University, Boston
4
Quantitative Analytics and Big Data Analytics Onboarding

• Data Science, Quant Finance and
Machine Learning Advisory
• Trained more than 1000 students in
Quantitative methods, Data Science
and Big Data Technologies using
MATLAB, Python and R
• Launching
▫ Analytics Certificate Program
 Spring 2018
▫ Fintech Certification program
 Fall 2017
• Building
6
Credit risk in consumer credit
Credit-scoring models and techniques assess the risk in lending to
customers.
Typical decisions:
• Grant credit/not to new applicants
• Increasing/Decreasing spending limits
• Increasing/Decreasing lending rates
• What new products can be given to existing applicants ?
Credit assessment in consumer credit
History:
• Gut feel
• Social network
• Communities and influence
Traditional:
• Scoring mechanisms through credit bureaus
• Bank assessments through business rules
Newer approaches (FINTECH):

• Peer-to-Peer lending
• Lending club, Prosper Market place
9
10
Types of algorithms
Prediction
Supervised
Learning
Machine Classification
learning
Unsupervised
Clustering
Learning
11
Supervised Learning
Used to derive a relationship between dependent and independent
variables
• Prediction
▫ Regression
▫ Decision Trees (CART)
▫ Neural Networks
• Classification
▫ Logistic Regression
▫ CART, Random Forest, SVM
▫ Neural Networks
12
Methodology
Test the model

Split data into
Data pre- Train the model using Testing data
Training and
processing on Training data to evaluate model
Testing sets
performance
13
Unsupervised Learning
• No distinction between independent variables and dependent
variables
• No result labels to determine “correct” results
• Goals:
▫ Data Reduction
▫ Clustering
14
Types of Clustering
• Partitioning Clustering
▫ Starts with K –number of clusters sought
▫ Observations randomly divided to form cohesive clusters
▫ Example : K-means
• Hierarchical Agglomerative Clustering
▫ Each observation is its own cluster
▫ Combine clusters two at a time to finally have one cluster
▫ Example: Hierarchical clustering using single linkage, Ward’s method
etc.
15
K-means
• Tries to separate samples into K groups with a goal of maximizing
between group variance and minimizing within group variance
• Requires K to be specified up front.
• Starts with K initial centroids and optimizes to minimize the criterion
or till the number of specified iterations are reached.
• Suited for larger datasets
16
Hierarchical clustering
• Goal is to derive a dendrogram starting from each record being its
own cluster
• Works well for smaller data sets
• Proximity is measured in multiple ways (more later)
17
The notion of distance

How do you measure similarity between two entities ?
▫ Apples and Bananas
▫ Coke and Pepsi vs Orange juice
▫ Honda Civic vs Toyota Corolla
▫ New York and Boston
• The notion of distance
18
Distance measures
• Euclidean distance
• Cosine distance
19
Other distance measures

• Manhattan distance
(Taxi-cab distance)
• Jaccard distance
▫ Used to measure similarity or dissimilarity between binary and non-
binary variables
▫ http://people.revoledu.com/kardi/tutorial/Similarity/Jaccard.html
20
Working with mixed-data

• Gower distance is used for calculating distances when we have mixed types
of variables (continuous and categorical)
• Variables can be:
▫ Quantitative (such as rating scale)
▫ Binary (such as present/absent)
▫ Nominal (such as worker/teacher/clerk)
• The metrics used for each data type are described below:
▫ Quantitative: range-normalized Manhattan distance
▫ Ordinal: variable is first ranked, then Manhattan distance is used with a special
adjustment for ties
▫ Nominal: variables of k categories are first converted into k binary columns and
then the Dice coefficient is used
(https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient )
21
Support in R
• Daisy : Compute all the pairwise dissimilarities (distances) between
observations in the data set
• Pam: Partitioning (clustering) of the data into k clusters “around
medoids”, a more robust version of K-means.
• Agnes: Computes agglomerative nesting (hierarchical clustering) of
the dataset.
22
23
Lending club
24
The Data
https://www.lendingclub.com/info/download-data.action
25
The Data
https://www.kaggle.com/wendykan/lending-club-loan-data
Variable description
Objective
• Calculate dissimilarity between observations.
• Select algorithm to group observations together
• Choose the best number of clusters
• Visualize clusters on reduced dimensions
Selecting number of clusters
• Partitioning around medoids (PAM) is used in this case.
• PAM is an iterative clustering procedure with the following steps:
▫ Step 1: Choose k random entities to become the medoids.
▫ Step 2: Assign every entity to its closest medoid (using the distance
matrix we have calculated).
▫ Step 3: For each cluster, identify the observation that would yield the
lowest average distance if it were to be re-assigned as the medoid. If
so, make this observation the new medoid.
▫ Step 4: If at least one medoid has changes, return to step 2.
Otherwise, end the algorithm.
Visualization with reduced dimension
• One way to visualize many variables in a lower dimensional space is
with t-distributed stochastic neighborhood embedding (t-SNE)
• This method is a dimension reduction technique that tries to
preserve local structure so as to make clusters visible in a 2D or 3D
visualization.
• https://en.wikipedia.org/wiki/T-
distributed_stochastic_neighbor_embedding
30
31
Alternative Credit scoring in the news

32
Fintech being noticed by Regulators

33
Regulatory Sandboxes
• The regulatory sandbox allows businesses to test innovative
products, services, business models and delivery mechanisms in the
real market, with real consumers.
• The sandbox is a supervised space, open to both authorized and

unauthorized firms, that provides firms with:
▫ reduced time-to-market at potentially lower cost
▫ appropriate consumer protection safeguards built in to new products and
services
▫ better access to finance
• https://www.fca.org.uk/firms/regulatory-sandbox
34
US Regulators catching up
Model Validation
• “Model risk is the potential for adverse consequences from

decisions based on incorrect or misused model outputs and
reports. “ [1]
• “Model validation is the set of processes and activities

intended to verify that models are performing as expected,
in line with their design objectives and business uses. ” [1]
• Ref:
• [1] . Supervisory Letter SR 11-7 on guidance on Model Risk
36
Popularity of Open-source software in the enterprise

increasing
37
Cloud maturing
• Financial Services customers like Capital One, FINRA, and Pacific Life
are moving critical workloads to AWS
38
Challenges in adopting Open-source software in the

enterprise
• Versions and packages
39

enterprise
• Difficulty in replicating and reconciling differences in environments
40

enterprise
• Deploying models built by Data Scientists still a problem
Data Scientists Enterprise IT

41

enterprise
• The try-before-adopt model is difficult with unproven open-source
solutions
42
www.QuSandbox.com
43
Use cases
Quant/Enterprise use cases
• Create an environment that can support multiple platforms and
programming languages
• Enable remote running of applications
• Ability to try out a Github submission/ someone else’s code
• Facilitate creation of Docker images to create replicable containers
• Create prototyping environments for Data Science/Quant teams
• Enable Data scientists/Quants to deploy their solutions
• Enable running multiple experiments concurrently
• Integrate seamlessly with the cloud to scale up computations
44
Use cases
Fintech use cases
• To demonstrate solutions to enterprises
• Create customized enterprise trials for companies that don’t permit
installation of vendor software prior to procurement
• To manage quick updates
• Enable effective integration and hosting of services (REST APIs)
45
Use cases
Academic use cases
• Enable creation of course material and exercises that could be
shared
• Enable students and workshop participants to focus on the data
science experiments rather than environment setting
46
Creating replicable environments
Creating and manage replicable environments (Code + software + data) in a single portal
47
Creating replicable environments
Create replicable environments (Code + software + data) through a easy point & click tool and
publish to Dockerhub or manage internally
Share it with target users
48
User portal
• Run multiple experiments in pre-created environments (Code + software + data)

• Deploy your own solutions
• Run any Docker image or Github submission on the cloud
49
Run Jupyter notebooks and prototype applications

50
Run Rstudio and Shiny applications

51
Run any Docker application

52
Manage tasks and errors

53
User portal
• Dockerize and deploy applications on AWS in just a few steps

54
Deploy applications with ease

55
QU’s open source project – Project Mozaic

56
www.QuSandbox.com
57
www.analyticscertificatecom/MachineLearning
58
Thank you ARPM and enjoy the boot camp!

Checkout our programs at:
www.analyticscertificate.com/fintech
www.qusandbox.com
Sri Krishnamurthy, CFA, CAP

Founder and CEO
QuantUniversity LLC.
srikrishnamurthy
www.QuantUniversity.com
Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be
distributed or used in any other publication without the prior written consent of QuantUniversity LLC.

Machine Learning Applications in Credit Risk - KrishnaMurthy Quantuniv

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Machine Learning Applications in Credit Risk - KrishnaMurthy Quantuniv

Загружено:

Авторское право:

Доступные форматы

Presented By: Location:

Slides will be available at:

• Founder of QuantUniversity LLC. and

Quantitative Analytics and Big Data Analytics Onboarding

Newer approaches (FINTECH):

Test the model

The notion of distance

Other distance measures

Working with mixed-data

Alternative Credit scoring in the news

Fintech being noticed by Regulators

• The sandbox is a supervised space, open to both authorized and

• “Model risk is the potential for adverse consequences from

• “Model validation is the set of processes and activities

Popularity of Open-source software in the enterprise

Challenges in adopting Open-source software in the

Challenges in adopting Open-source software in the

Challenges in adopting Open-source software in the

Data Scientists Enterprise IT

Challenges in adopting Open-source software in the

Creating replicable environments

Creating replicable environments

• Run multiple experiments in pre-created environments (Code + software + data)

Run Jupyter notebooks and prototype applications

Run Rstudio and Shiny applications

Run any Docker application

Manage tasks and errors

• Dockerize and deploy applications on AWS in just a few steps

Deploy applications with ease

QU’s open source project – Project Mozaic

Thank you ARPM and enjoy the boot camp!

Sri Krishnamurthy, CFA, CAP

Вам также может понравиться