Вы находитесь на странице: 1из 57

Machine Learning applications in Credit Risk

Presented By: Location:


Sri Krishnamurthy, CFA, CAP ARPM Open Source Conference
sri@quantuniversity.com 8/13/2017
www.analyticscertificate.com
2017 Copyright QuantUniversity LLC.
2

Slides will be available at:


www.analyticscertificatecom/MachineLearning
3

• Founder of QuantUniversity LLC. and


www.analyticscertificate.com
• Advisory and Consultancy for Financial Analytics
• Prior Experience at MathWorks, Citigroup and
Endeca and 25+ financial services and energy
customers.
• Regular Columnist for the Wilmott Magazine
• Author of forthcoming book
“Financial Modeling: A case study approach”
published by Wiley
Sri Krishnamurthy • Charted Financial Analyst and Certified Analytics
Founder and CEO Professional
• Teaches Analytics in the Babson College MBA
program and at Northeastern University, Boston
4

Quantitative Analytics and Big Data Analytics Onboarding


• Data Science, Quant Finance and
Machine Learning Advisory
• Trained more than 1000 students in
Quantitative methods, Data Science
and Big Data Technologies using
MATLAB, Python and R
• Launching
▫ Analytics Certificate Program
 Spring 2018
▫ Fintech Certification program
 Fall 2017
• Building
6
Credit risk in consumer credit
Credit-scoring models and techniques assess the risk in lending to
customers.

Typical decisions:
• Grant credit/not to new applicants
• Increasing/Decreasing spending limits
• Increasing/Decreasing lending rates
• What new products can be given to existing applicants ?
Credit assessment in consumer credit
History:
• Gut feel
• Social network
• Communities and influence

Traditional:
• Scoring mechanisms through credit bureaus
• Bank assessments through business rules

Newer approaches (FINTECH):


• Peer-to-Peer lending
• Lending club, Prosper Market place
9
10

Types of algorithms

Prediction
Supervised
Learning
Machine Classification
learning
Unsupervised
Clustering
Learning
11

Supervised Learning
Used to derive a relationship between dependent and independent
variables
• Prediction
▫ Regression
▫ Decision Trees (CART)
▫ Neural Networks
• Classification
▫ Logistic Regression
▫ CART, Random Forest, SVM
▫ Neural Networks
12

Methodology

Test the model


Split data into
Data pre- Train the model using Testing data
Training and
processing on Training data to evaluate model
Testing sets
performance
13

Unsupervised Learning
• No distinction between independent variables and dependent
variables
• No result labels to determine “correct” results
• Goals:
▫ Data Reduction
▫ Clustering
14

Types of Clustering
• Partitioning Clustering
▫ Starts with K –number of clusters sought
▫ Observations randomly divided to form cohesive clusters
▫ Example : K-means
• Hierarchical Agglomerative Clustering
▫ Each observation is its own cluster
▫ Combine clusters two at a time to finally have one cluster
▫ Example: Hierarchical clustering using single linkage, Ward’s method
etc.
15

K-means
• Tries to separate samples into K groups with a goal of maximizing
between group variance and minimizing within group variance
• Requires K to be specified up front.
• Starts with K initial centroids and optimizes to minimize the criterion
or till the number of specified iterations are reached.
• Suited for larger datasets
16

Hierarchical clustering
• Goal is to derive a dendrogram starting from each record being its
own cluster
• Works well for smaller data sets
• Proximity is measured in multiple ways (more later)
17

The notion of distance


How do you measure similarity between two entities ?
▫ Apples and Bananas
▫ Coke and Pepsi vs Orange juice
▫ Honda Civic vs Toyota Corolla
▫ New York and Boston
• The notion of distance
18

Distance measures
• Euclidean distance

• Cosine distance
19

Other distance measures


• Manhattan distance
(Taxi-cab distance)

• Jaccard distance
▫ Used to measure similarity or dissimilarity between binary and non-
binary variables
▫ http://people.revoledu.com/kardi/tutorial/Similarity/Jaccard.html
20

Working with mixed-data


• Gower distance is used for calculating distances when we have mixed types
of variables (continuous and categorical)
• Variables can be:
▫ Quantitative (such as rating scale)
▫ Binary (such as present/absent)
▫ Nominal (such as worker/teacher/clerk)
• The metrics used for each data type are described below:
▫ Quantitative: range-normalized Manhattan distance
▫ Ordinal: variable is first ranked, then Manhattan distance is used with a special
adjustment for ties
▫ Nominal: variables of k categories are first converted into k binary columns and
then the Dice coefficient is used
(https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient )
21

Support in R
• Daisy : Compute all the pairwise dissimilarities (distances) between
observations in the data set
• Pam: Partitioning (clustering) of the data into k clusters “around
medoids”, a more robust version of K-means.
• Agnes: Computes agglomerative nesting (hierarchical clustering) of
the dataset.
22
23

Lending club
24

The Data

https://www.lendingclub.com/info/download-data.action
25

The Data

https://www.kaggle.com/wendykan/lending-club-loan-data
Variable description
Objective
• Calculate dissimilarity between observations.
• Select algorithm to group observations together
• Choose the best number of clusters
• Visualize clusters on reduced dimensions
Selecting number of clusters
• Partitioning around medoids (PAM) is used in this case.
• PAM is an iterative clustering procedure with the following steps:
▫ Step 1: Choose k random entities to become the medoids.
▫ Step 2: Assign every entity to its closest medoid (using the distance
matrix we have calculated).
▫ Step 3: For each cluster, identify the observation that would yield the
lowest average distance if it were to be re-assigned as the medoid. If
so, make this observation the new medoid.
▫ Step 4: If at least one medoid has changes, return to step 2.
Otherwise, end the algorithm.
Visualization with reduced dimension
• One way to visualize many variables in a lower dimensional space is
with t-distributed stochastic neighborhood embedding (t-SNE)
• This method is a dimension reduction technique that tries to
preserve local structure so as to make clusters visible in a 2D or 3D
visualization.
• https://en.wikipedia.org/wiki/T-
distributed_stochastic_neighbor_embedding
30
31

Alternative Credit scoring in the news


32

Fintech being noticed by Regulators


33

Regulatory Sandboxes
• The regulatory sandbox allows businesses to test innovative
products, services, business models and delivery mechanisms in the
real market, with real consumers.

• The sandbox is a supervised space, open to both authorized and


unauthorized firms, that provides firms with:
▫ reduced time-to-market at potentially lower cost
▫ appropriate consumer protection safeguards built in to new products and
services
▫ better access to finance

• https://www.fca.org.uk/firms/regulatory-sandbox
34

US Regulators catching up
Model Validation

• “Model risk is the potential for adverse consequences from


decisions based on incorrect or misused model outputs and
reports. “ [1]

• “Model validation is the set of processes and activities


intended to verify that models are performing as expected,
in line with their design objectives and business uses. ” [1]

• Ref:
• [1] . Supervisory Letter SR 11-7 on guidance on Model Risk
36

Popularity of Open-source software in the enterprise


increasing
37

Cloud maturing
• Financial Services customers like Capital One, FINRA, and Pacific Life
are moving critical workloads to AWS
38

Challenges in adopting Open-source software in the


enterprise
• Versions and packages
39

Challenges in adopting Open-source software in the


enterprise
• Difficulty in replicating and reconciling differences in environments
40

Challenges in adopting Open-source software in the


enterprise
• Deploying models built by Data Scientists still a problem

Data Scientists Enterprise IT


41

Challenges in adopting Open-source software in the


enterprise
• The try-before-adopt model is difficult with unproven open-source
solutions
42

www.QuSandbox.com
43

Use cases
Quant/Enterprise use cases
• Create an environment that can support multiple platforms and
programming languages
• Enable remote running of applications
• Ability to try out a Github submission/ someone else’s code
• Facilitate creation of Docker images to create replicable containers
• Create prototyping environments for Data Science/Quant teams
• Enable Data scientists/Quants to deploy their solutions
• Enable running multiple experiments concurrently
• Integrate seamlessly with the cloud to scale up computations
44

Use cases
Fintech use cases
• To demonstrate solutions to enterprises
• Create customized enterprise trials for companies that don’t permit
installation of vendor software prior to procurement
• To manage quick updates
• Enable effective integration and hosting of services (REST APIs)
45

Use cases
Academic use cases
• Enable creation of course material and exercises that could be
shared
• Enable students and workshop participants to focus on the data
science experiments rather than environment setting
46

Creating replicable environments

Creating and manage replicable environments (Code + software + data) in a single portal
47

Creating replicable environments

Create replicable environments (Code + software + data) through a easy point & click tool and
publish to Dockerhub or manage internally
Share it with target users
48

User portal

• Run multiple experiments in pre-created environments (Code + software + data)


• Deploy your own solutions
• Run any Docker image or Github submission on the cloud
49

Run Jupyter notebooks and prototype applications


50

Run Rstudio and Shiny applications


51

Run any Docker application


52

Manage tasks and errors


53

User portal

• Dockerize and deploy applications on AWS in just a few steps


54

Deploy applications with ease


55

QU’s open source project – Project Mozaic


56

www.QuSandbox.com
57

www.analyticscertificatecom/MachineLearning
58

Thank you ARPM and enjoy the boot camp!


Checkout our programs at:
www.analyticscertificate.com/fintech
www.qusandbox.com

Sri Krishnamurthy, CFA, CAP


Founder and CEO
QuantUniversity LLC.
srikrishnamurthy
www.QuantUniversity.com
Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be
distributed or used in any other publication without the prior written consent of QuantUniversity LLC.

Вам также может понравиться