Академический Документы
Профессиональный Документы
Культура Документы
Cluster Analysis
An Introduction
Introduction:
• Why Segmentation ?
• Types of Segmentation:
a) Objective Segmentation (CHAID).
b) Subjective Segmentation (Cluster Analysis).
• What is Cluster Analysis ?
• Basic Concepts.
Business Examples of Cluster
Analysis: • Cluster Analysis for CCC Thailand (An Overview).
Cluster Analysis Process:
• Data Cleaning and Preparing the data set for analysis.
• Creating new relevant Variables.
• Selection of Variables.
• Tackling the Outliers.
• Treatment on Missing Values.
• Multicollinearity Check and hence reducing dimensions.
• Standardization of the selected variables.
• Getting Cluster Solution.
• Checking the optimality of the solution.
Agenda
SAS Procedures:
• Proc Factor
• Proc Standard
• Proc Cluster / Proc Fastclus
Implementing a Cluster
Solution:
• Scoring Code and Minimum Euclidean Distance Method
Step A: Why Segmentation ?
Step B: Types of Segmentation:
1. Objective Segmentation (CHAID)
2. Subjective Segmentation (Cluster Analysis)
++++++++++..
1 2 3 4
Solution : Identify segments where people have same characters and target each of
these segments in a different way
Segmentation
Objective Subjective
Subjective Segmentation
(Cluster Analysis)
Step D: Basic Concepts
Such that
&
Total Population
The
The attributes
attributes of
of the
the objects
objects are
are allowed
allowed to
to determine
determine which
which objects
objects
should
should be
be grouped
grouped together
together
Step B.2: Subjective Segmentation
Consider a portfolio with 1000 customers having Credits. Business wants to make different
strategies to different groups of people. Of course ‘Objective segmentation’ won’t help since
we don’t know who are the responders.
Total Population
(1000)
Avg. delinquency
Avg. delinquency age = 0 age = 75 days and
Avg. delinquency Avg. delinquency Avg. age = 50 yrs.
days and Avg. age = 35 yrs.
age = 15 days and age = 12 days and Avg. Utilization = 40%
Avg. Utilization > 80%
Avg. age = 33 yrs. Avg. age = 25 yrs.
Avg. Utilization = 60% Avg. Utilization = 90%
We can exclude the group with avg. delinquency age = 75 days from mailing
This type of segmentation is known as ‘Subjective Segmentation’. It
gives the salient characteristics of the best customers
Step D: Basic Concepts
Current
Balance Medium
Example Cluster 2
High Income
Low Balance
Low
Cluster 1 and Cluster 2 are being differentiated by Income and Current Balance. The
objects in Cluster 1 have similar characteristics (High Income and Low balance), on the
other hand the objects in Cluster 2 have the same characteristic (High Balance and Low
Income).
But there are much differences between an object in Cluster 1 and an object in Cluster 2
Cluster Analysis
for
CCC Thailand
Central Credit Card -Thailand
Segment Sizes
Segment 6 Segment 1
Segment 5 8% 20%
15%
Segment 4
5%
Segment 2
Segment 3 29%
23%
Significant Variables
Segment Descriptions
Segment 1: Credit Hungry Poor (20%) Segment 2: New Revolvers (28%)
• Low monthly income segment. • Newer accounts.
• High Balance, Credit Utilisation, Fees and • Low monthly income segment.
Fin. Charges. High revolvers. • High Balance, Credit Utilisation, Fees and
• Older accounts. Fin. Charges. High revolvers.
• Delinquency marginally higher than average. • Delinquency marginally higher than average.
• Low transactors. • Average transactors.
Segment Descriptions
No. of
Segment %
customers
Credit hungry poor 15356 20.44
New Revolvers 21161 28.16
New Hopefuls 17373 23.12
Affluent Spenders 3815 5.08
Vintage fuddy-duddies 11248 14.97
Old Risky Inactives 6181 8.23
Note: Refer ‘Behaviour Profile’ and ‘Demographic Profile E1’ in Appendix for Profile details.
Spend Characteristics of Segments
‘Affluent Spenders’ and ‘New Revolvers’ form 33% of the portfolio but account for 53% of all
spends.
‘Affluent Spenders’ alone accounts for 25% of the total spend though it is only 5% of the
portfolio.
2450
1450
New Revolvers
28% of portfolio - 28% of all txn. value
950
Vintage Fuddy-Duddies
15% of portfolio - 14% of all txn. value
450
‘Credit Hungry Poor’ and ‘New Revolvers’ together account for 48% of the
portfolio and 78% of revolvers.
8 3
100 6
4
90 15 9
80 5
70
23
60 45
50
40 28
30
33
20
20
10
0
% of Portfolio % of Revolvers
Server
Client Data
Different Tables
Final Data.
• Check for Inconsistency
in the values of variables.
Ready for Analysis
Step 2: Creating new Variables
Variable Types:
Demographic
Socio-Economic
Product Related
Behavioral
Variable Creation -
New relevant variables, if necessary are to be created from the existing
ones. As an example for auto loan portfolio, if there are variables like deposit
amount and price of the vehicles under finance, a new variable ‘Deposit Percent’ =
(Deposit Amount / Price of the Vehicles)*100 can be created.
Step 3: Selection of Variables
E.g. when we have Auto Loan portfolio, some variables like Loan Amount, Month on Book,
Term, Deposit amount, Car Price should be considered. We should not look at APR (Annual
Percentage Rate), because APR depends mainly on the yearly performance of the business
overall rather than on the accounts.
E.g. In a clustering Process, we want to identify the highly delinquent people. In this case
Maximum Delinquency reached will have more significance than month end delinquency
variables.
Step 4: Tackling Outliers
What is an outlier ?
An observation is said to be an outlier w.r.t. a variable if it is far away from the
remaining observations.
Scatter Plot
Outlier
90
80 To identify them:
70
• Univariate and Frequency
60
analysis
50
Var 2
10
0
0 5 10 15 20 25 30 35 40 45
Var 1
To tackle them:
1. The outliers can be deleted from analysis if they are very small in number.
2. The variables selected can be trimmed or capped.
Step 5: Treatment on Missing Values
Variables with lot many (about 15%) missing values should not be
used for clustering unless ‘Missing’ has a special significance and
can be replaced by some meaningful number.
E.g. - Insurance Variables.
Note: - SAS does not include observations with missing values for
Clustering Process
% of Missing Treatments
• Regression Imputation
5-10% • Mean Imputation
• Regression Imputation
More than 10% • Try to use some proxy
Variable
Step 6: Multi-collinearity Check
What is ‘Multi-collinearity’ ?
Factor Analysis: -
By Factor Analysis select those factors, which are explaining
almost 90/95 % of total variation together. Then select those
variables which have high loadings towards those factors.
In SAS there are two mostly used procedures namely ‘Proc Fastclus’ and
‘Proc Cluster’.
• Simple Linkage
• Complete Linkage
Proc Cluster
• Two Stage
• Etc.
What is K-Means: -
The Process starts with K distinct observations which are at the highest
distance from each other. Then each of the observations will be considered
one by one.
They will be clubbed to the nearest Cluster. In this way if two clusters come
significantly close to each other, they will be merged to each other to form a
new cluster.
Step 8: Getting the Cluster
Solution
Cluster Process: -
After cleaning up the data set from outliers any of the above procedures can be
used to build clusters. There is no hard and fast rule in terms of cluster numbers
and cluster sizes. But the rule of thumb is there should be 5% observations in each
cluster and total number of clusters should be between 5 to 15.
Some of the variables(present in the data set) are to used for clustering. These
variables must be numeric. They may be continuous or discrete, but if discrete there
must be an ordinality among the categories.
The goodness of a particular set of clusters are to measured by the extent to which
means of the clustering variables are differing from one cluster to another.
Approximate Expected Overall R Square Similar to above, different Close to Overall R Square
formula; calculated (diff <= 0.1)
assuming variables are
independent.
RMS STD For each cluster: <= 1.1
Sqrt{Avg[Variance(Var1),
Variance(Var2),?]}
Distance Between Cluster Centroids How close or how far apart >= 1.5
are cluster centroids
Maximum distance from seed to "Dispersion" within each Relative; roughly uniform
observation cluster across clusters
Step 9:Evaluating the Cluster Solution
Variable R - square:
- set / variable combination “Total amount of Variation” is fixed.
For a given data
If there is k Clusters in the solution then Total Variation = Within Variation + Between Variation
Within Variation(var 1) = [Variation(var 1) within Cluster 1] + [Variation(var 1) within Cluster 2] + E +
[Variation(var 1) within Cluster k]
Between Variation = Variation between one cluster to another (I.e. variation of cluster means).
Scatter Plot
80
70
Cluster 1
60
Cluster 3
50
R - Square =
Cluster 2
Var 2
40
30 Between Variation
20
10 Total Variation
0
0 5 10 15 20 25 30 35 40 45 50
Var 1
RMS STD
Minimum
Minimum Euclidean
Euclidean Distance
Distance Method
Method
Scatter Plot
80
New
70 Observation
60
50
Cluster 1
Var 2
40
30 Cluster 3
20
Cluster 2
10
0
0 5 10 15 20 25 30 35 40 45 50
Var 1
Microsoft Word
Document
Microsoft Word
Document
Thank You
Contact us
Join us on:
Twitter - http://twitter.com/#!/AnalytixLabs
Facebook - http://www.facebook.com/analytixlabs
LinkedIn - http://www.linkedin.com/in/analytixlabs
Blog - http://www.analytixlabs.co.in/category/blog/
41