Академический Документы
Профессиональный Документы
Культура Документы
Typical decisions:
• Grant credit/not to new applicants
• Increasing/Decreasing spending limits
• Increasing/Decreasing lending rates
• What new products can be given to existing applicants ?
Credit assessment in consumer credit
History:
• Gut feel
• Social network
• Communities and influence
Traditional:
• Scoring mechanisms through credit bureaus
• Bank assessments through business rules
Types of algorithms
Prediction
Supervised
Learning
Machine Classification
learning
Unsupervised
Clustering
Learning
11
Supervised Learning
Used to derive a relationship between dependent and independent
variables
• Prediction
▫ Regression
▫ Decision Trees (CART)
▫ Neural Networks
• Classification
▫ Logistic Regression
▫ CART, Random Forest, SVM
▫ Neural Networks
12
Methodology
Unsupervised Learning
• No distinction between independent variables and dependent
variables
• No result labels to determine “correct” results
• Goals:
▫ Data Reduction
▫ Clustering
14
Types of Clustering
• Partitioning Clustering
▫ Starts with K –number of clusters sought
▫ Observations randomly divided to form cohesive clusters
▫ Example : K-means
• Hierarchical Agglomerative Clustering
▫ Each observation is its own cluster
▫ Combine clusters two at a time to finally have one cluster
▫ Example: Hierarchical clustering using single linkage, Ward’s method
etc.
15
K-means
• Tries to separate samples into K groups with a goal of maximizing
between group variance and minimizing within group variance
• Requires K to be specified up front.
• Starts with K initial centroids and optimizes to minimize the criterion
or till the number of specified iterations are reached.
• Suited for larger datasets
16
Hierarchical clustering
• Goal is to derive a dendrogram starting from each record being its
own cluster
• Works well for smaller data sets
• Proximity is measured in multiple ways (more later)
17
Distance measures
• Euclidean distance
• Cosine distance
19
• Jaccard distance
▫ Used to measure similarity or dissimilarity between binary and non-
binary variables
▫ http://people.revoledu.com/kardi/tutorial/Similarity/Jaccard.html
20
Support in R
• Daisy : Compute all the pairwise dissimilarities (distances) between
observations in the data set
• Pam: Partitioning (clustering) of the data into k clusters “around
medoids”, a more robust version of K-means.
• Agnes: Computes agglomerative nesting (hierarchical clustering) of
the dataset.
22
23
Lending club
24
The Data
https://www.lendingclub.com/info/download-data.action
25
The Data
https://www.kaggle.com/wendykan/lending-club-loan-data
Variable description
Objective
• Calculate dissimilarity between observations.
• Select algorithm to group observations together
• Choose the best number of clusters
• Visualize clusters on reduced dimensions
Selecting number of clusters
• Partitioning around medoids (PAM) is used in this case.
• PAM is an iterative clustering procedure with the following steps:
▫ Step 1: Choose k random entities to become the medoids.
▫ Step 2: Assign every entity to its closest medoid (using the distance
matrix we have calculated).
▫ Step 3: For each cluster, identify the observation that would yield the
lowest average distance if it were to be re-assigned as the medoid. If
so, make this observation the new medoid.
▫ Step 4: If at least one medoid has changes, return to step 2.
Otherwise, end the algorithm.
Visualization with reduced dimension
• One way to visualize many variables in a lower dimensional space is
with t-distributed stochastic neighborhood embedding (t-SNE)
• This method is a dimension reduction technique that tries to
preserve local structure so as to make clusters visible in a 2D or 3D
visualization.
• https://en.wikipedia.org/wiki/T-
distributed_stochastic_neighbor_embedding
30
31
Regulatory Sandboxes
• The regulatory sandbox allows businesses to test innovative
products, services, business models and delivery mechanisms in the
real market, with real consumers.
• https://www.fca.org.uk/firms/regulatory-sandbox
34
US Regulators catching up
Model Validation
• Ref:
• [1] . Supervisory Letter SR 11-7 on guidance on Model Risk
36
Cloud maturing
• Financial Services customers like Capital One, FINRA, and Pacific Life
are moving critical workloads to AWS
38
www.QuSandbox.com
43
Use cases
Quant/Enterprise use cases
• Create an environment that can support multiple platforms and
programming languages
• Enable remote running of applications
• Ability to try out a Github submission/ someone else’s code
• Facilitate creation of Docker images to create replicable containers
• Create prototyping environments for Data Science/Quant teams
• Enable Data scientists/Quants to deploy their solutions
• Enable running multiple experiments concurrently
• Integrate seamlessly with the cloud to scale up computations
44
Use cases
Fintech use cases
• To demonstrate solutions to enterprises
• Create customized enterprise trials for companies that don’t permit
installation of vendor software prior to procurement
• To manage quick updates
• Enable effective integration and hosting of services (REST APIs)
45
Use cases
Academic use cases
• Enable creation of course material and exercises that could be
shared
• Enable students and workshop participants to focus on the data
science experiments rather than environment setting
46
Creating and manage replicable environments (Code + software + data) in a single portal
47
Create replicable environments (Code + software + data) through a easy point & click tool and
publish to Dockerhub or manage internally
Share it with target users
48
User portal
User portal
www.QuSandbox.com
57
www.analyticscertificatecom/MachineLearning
58