Cluster Analysis (SAS Based)

Cluster Analysis
Cluster Analysis
An Introduction
This document is an introduction to Cluster

Analysis. The objectives:
1. Understanding of the underlying concepts
2. Appreciation of when to use the technique
3. What Cluster Analysis can and can’t do
4. The process of building a cluster solution
5. Evaluating a cluster solution
6. Implementing a cluster solution
Agenda
Introduction:
• Why Segmentation ?
• Types of Segmentation:
a) Objective Segmentation (CHAID).
b) Subjective Segmentation (Cluster Analysis).
• What is Cluster Analysis ?
• Basic Concepts.
Business Examples of Cluster
Analysis: • Cluster Analysis for CCC Thailand (An Overview).
Cluster Analysis Process:
• Data Cleaning and Preparing the data set for analysis.
• Creating new relevant Variables.
• Selection of Variables.
• Tackling the Outliers.
• Treatment on Missing Values.
• Multicollinearity Check and hence reducing dimensions.
• Standardization of the selected variables.
• Getting Cluster Solution.
• Checking the optimality of the solution.
Agenda
SAS Procedures:
• Proc Factor
• Proc Standard
• Proc Cluster / Proc Fastclus
Evaluating a Cluster Solution:

• % of Population in each Cluster
• Maximum Distance from Cluster Centroid
• Distance from the nearest Cluster
• Variable R-square
• Overall R-square
Implementing a Cluster
Solution:
• Scoring Code and Minimum Euclidean Distance Method
Step A: Why Segmentation ?
Step B: Types of Segmentation:
1. Objective Segmentation (CHAID)
2. Subjective Segmentation (Cluster Analysis)
Step C: What is Cluster Analysis?

Step D: Basic
Concepts.
Each individual is so different

that ideally we would want to reach out to each one of them in a different way
1 2 3 4 5 6
++++++++++..
Problem : The volume is too large for customization at individual level
1 2 3 4
Solution : Identify segments where people have same characters and target each of
these segments in a different way
Segmentation is for better targeting

² Segmentation provides a catalyst for creative

insights.....often the first step in marketing strategy
planning.
² Segmentation is a multipurpose technique.
² Segmentation provides a common vocabulary for

communicating marketing analysis.
² Segmentation can complement models.
² Segmentation is a technique used by many of our

competitors.
Step B: Types of Segmentation
Segmentation
Objective Subjective
³ CHAID ³ Cluster Analysis

Step B.2: Subjective Segmentation
Subjective Segmentation
(Cluster Analysis)
Step D: Basic Concepts
Cluster Analysis is a technique used for

combining observations into groups
Such that
Each group is homogeneous (similar) w.r.t. certain characteristics
&
Each group is different from other groups w.r.t. same characteristics

Step C: What is Cluster Analysis
What is Cluster Analysis ?
The process by which objects are classified into a number of groups

so that they are as much dissimilar as possible from one group to
another group, but as much similar as possible within each group.
In other words Cluster analysis means dividing the whole population into
groups which are distinct between themselves but internally similar.
Total Population
Group 1 Group 2 Group 3 Group 4
The objects in group 1 should be as similar as possible. But there should be

much difference between an object in group 1 and group 2
The
The attributes
attributes of
of the
the objects
objects are
are allowed
allowed to
to determine
determine which
which objects
objects
should
should be
be grouped
grouped together
together
Step B.2: Subjective Segmentation
Consider a portfolio with 1000 customers having Credits. Business wants to make different
strategies to different groups of people. Of course ‘Objective segmentation’ won’t help since
we don’t know who are the responders.
In this case we need some profiling as below: -
Total Population
(1000)
Avg. delinquency
Avg. delinquency age = 0 age = 75 days and
Avg. delinquency Avg. delinquency Avg. age = 50 yrs.
days and Avg. age = 35 yrs.
age = 15 days and age = 12 days and Avg. Utilization = 40%
Avg. Utilization > 80%
Avg. age = 33 yrs. Avg. age = 25 yrs.
Avg. Utilization = 60% Avg. Utilization = 90%
We can exclude the group with avg. delinquency age = 75 days from mailing
This type of segmentation is known as ‘Subjective Segmentation’. It
gives the salient characteristics of the best customers
Step D: Basic Concepts
Basic Concepts of Cluster Analysis Using Two Variables
High Example Cluster 1

High Balance
Low Income
Current
Balance Medium
Example Cluster 2
High Income
Low Balance
Low
Low Medium High

Gross Monthly Income
Cluster 1 and Cluster 2 are being differentiated by Income and Current Balance. The
objects in Cluster 1 have similar characteristics (High Income and Low balance), on the
other hand the objects in Cluster 2 have the same characteristic (High Balance and Low
Income).
But there are much differences between an object in Cluster 1 and an object in Cluster 2
Cluster Analysis
for
CCC Thailand
Central Credit Card -Thailand
Segmentation of the PLCC

Portfolio
Cluster Sizes
The Cluster Analysis that was performed on 75,134 Logo 02 Accounts,

resulted in 6 segments.
Segment Sizes
Segment 6 Segment 1
Segment 5 8% 20%
15%
Segment 4
5%
Segment 2
Segment 3 29%
23%
Significant Variables
The variables used for generating the segments are: -
1. Account age (in days) 3. Time (in mths.) elapsed between

last transaction and end of window
(Recency)
2. No. of times revolved 4. Sum of all transaction amounts over

in the last 3 months last 6 months.
Segement 1 Segment 2 Segement 3 Segement 4 Segment 5 Segement 6

Account age (days) 436.18 282.06 265.66 342.17 434.42 397.7
Total amt. of sale
transactions - last 6 6282.05 8975.72 7370.61 43385.06 8075.22 0.53
months (Bht)
Recency (months) 1.07 0.85 1.09 0.11 0.91 10.41
No. of times revolved -
2.92 2.9 0.15 0.72 0.16 0.55
last 3 months
These variables are different across segments.

Segment Descriptions: 1 to 4
Segment Descriptions
Segment 1: Credit Hungry Poor (20%) Segment 2: New Revolvers (28%)
• Low monthly income segment. • Newer accounts.
• High Balance, Credit Utilisation, Fees and • Low monthly income segment.
Fin. Charges. High revolvers. • High Balance, Credit Utilisation, Fees and
• Older accounts. Fin. Charges. High revolvers.
• Delinquency marginally higher than average. • Delinquency marginally higher than average.
• Low transactors. • Average transactors.
Segment 3: New Hopefuls (23%) Segment 4: Affluent Spenders (5%)

• Newest accounts. • Highest monthly income.
• Low Balance, Credit Utilisation, Fees and • High Balance; but low Fees and Fin.
Fin. Charges. Low Revolvers. Charges.
• Low transactors in number and total amount • Heavy transactors.
transacted.
• But, sale per transaction slightly higher than
average. Therefore, hold some hope for the
future.
• Delinquency higher than average.
Segment Descriptions: 5 & 6
Segment Descriptions
Segment 5: Vintage Fuddy-Duddies (15%) Segment 6: Old Risky Inactives (8%)

• Older accounts • Older customers
• Low Balance, Credit Utilisation, Fin. • Low Balance, Credit Utilisation.
Charges, Fees.
• Non transactors.
• Low revolvers.
• High delinquency.
• Average transactors.
No. of
Segment %
customers
Credit hungry poor 15356 20.44
New Revolvers 21161 28.16
New Hopefuls 17373 23.12
Affluent Spenders 3815 5.08
Vintage fuddy-duddies 11248 14.97
Old Risky Inactives 6181 8.23
Note: Refer ‘Behaviour Profile’ and ‘Demographic Profile E1’ in Appendix for Profile details.
Spend Characteristics of Segments
‘Affluent Spenders’ and ‘New Revolvers’ form 33% of the portfolio but account for 53% of all
spends.
‘Affluent Spenders’ alone accounts for 25% of the total spend though it is only 5% of the
portfolio.
2450
1950 Affluent Spenders

5% of portfolio - 25% of all txn. value
AVT (Value per txn. in Bht)
1450
New Revolvers
950
Vintage Fuddy-Duddies
450
Old Risky Inactives

-50 8% of portfolio - 0% of all txn. value
-0.25 0.75 1.75 2.75 3.75 4.75 5.75 6.75 7.75
AFT (No. of txns. per month)
Credit Hungry Poor New Hopefuls

20% of portfolio - 14% of all txn. value 23% of portfolio - 19% of all txn. value
Revolver Characteristics of Segments
‘Credit Hungry Poor’ and ‘New Revolvers’ together account for 48% of the
portfolio and 78% of revolvers.
8 3
100 6
4
90 15 9
80 5
70
23
60 45
50
40 28
30
33
20
20
10
0
% of Portfolio % of Revolvers
Credit hungry poor New Revolvers New Hopefuls

Affluent Spenders Vintage fuddy-duddies Old Risky Inactives
Step 1: Data Cleaning and Preparing the data set for analysis.
Step 2: Creating new relevant Variables.
Step 3: Selection of Variables.
Step 4: Tackling the Outliers.
Step 5: Treatment of Missing Values.
Step 6: Multicollinearity Check and hence reducing
dimensions
Step 7: Standardization of the selected variables
Step 8: Getting Cluster Solution.
Step 9: Checking the optimality of the solution.
Clustering Process
Data Cleaning and Creating New

Selection of
Preparing the Relevant
Variables
data set for Variables
Step 3
analysis
Step 1 Step 2
Multicollinearity Treatment of Tackling the

Check Missing Values Outliers
Step 6 Step 5 Step 4
Getting Cluster Checking the

Standardization Optimality of
Solution
Step 7 the Solution
Step 8
Step 9
Process Flow for Cluster Analysis

Step 1: Preparation of Data
Server
Client Data
Different Tables
Merged Data Data Merging

Account Level
or
Customer Level Cleaning Process -
• Identify the erroneous
Data Cleaning values.
Final Data.
• Check for Inconsistency
in the values of variables.
Ready for Analysis
Step 2: Creating new Variables
Variable Types:
Demographic
Socio-Economic
Product Related
Behavioral
Variable Creation -
New relevant variables, if necessary are to be created from the existing
ones. As an example for auto loan portfolio, if there are variables like deposit
amount and price of the vehicles under finance, a new variable ‘Deposit Percent’ =
(Deposit Amount / Price of the Vehicles)*100 can be created.
Step 3: Selection of Variables
No Limit to # of variables to be selected for analysis
Selection of Variables depends on the purpose of Clustering
Irrelevant Variables are to be dropped
Variables with large % of missing values are to be dropped
E.g. when we have Auto Loan portfolio, some variables like Loan Amount, Month on Book,
Term, Deposit amount, Car Price should be considered. We should not look at APR (Annual
Percentage Rate), because APR depends mainly on the yearly performance of the business
overall rather than on the accounts.
E.g. In a clustering Process, we want to identify the highly delinquent people. In this case
Maximum Delinquency reached will have more significance than month end delinquency
variables.
Step 4: Tackling Outliers
What is an outlier ?
An observation is said to be an outlier w.r.t. a variable if it is far away from the
remaining observations.
Scatter Plot
Outlier
90
80 To identify them:
70
• Univariate and Frequency
60
analysis
50
Var 2
40 • Histogram and Box-Plot

30
20
10
0
0 5 10 15 20 25 30 35 40 45
Var 1
To tackle them:
1. The outliers can be deleted from analysis if they are very small in number.
2. The variables selected can be trimmed or capped.
Step 5: Treatment on Missing Values
Variables with lot many (about 15%) missing values should not be
used for clustering unless ‘Missing’ has a special significance and
can be replaced by some meaningful number.
E.g. - Insurance Variables.
Note: - SAS does not include observations with missing values for
Clustering Process
% of Missing Treatments
• Delete those Observations

Less than 1%
• Mean Imputation
1-5% • Mean Imputation
• Regression Imputation
5-10% • Mean Imputation
• Regression Imputation
More than 10% • Try to use some proxy
Variable
Step 6: Multi-collinearity Check
What is ‘Multi-collinearity’ ?
A set of independent or explanatory variables are said to have

‘Multi-collinearity’, if there is any linear relation between them.
Devices to tackle ‘Multi-collinearity’: -
Factor Analysis: -
By Factor Analysis select those factors, which are explaining
almost 90/95 % of total variation together. Then select those
variables which have high loadings towards those factors.
VIF (Variance Inflation Factor): -

Variables with VIF more than 2 should be dropped
Step 7: Standardization
Why do we need ‘Standardization’

?
Since the units of measurement are different for different variables,

standardization is a must.
E.g.: - Consider two variables, Age and Income.

The unit of Age is ‘Year’ and the unit of Income is say ‘$’.
Hence they are not comparable.
In that case there won’t be an unit of measurement for the distance
between two clusters.
Generally we standardize by making the mean = 0 and variance = 1.

Step 8: Getting the Cluster
Solution
Cluster Process: -
In SAS there are two mostly used procedures namely ‘Proc Fastclus’ and
‘Proc Cluster’.
• Simple Linkage
• Complete Linkage
Proc Cluster
• Two Stage
• Etc.
Proc Fastclus • K - Means
What is K-Means: -
The Process starts with K distinct observations which are at the highest
distance from each other. Then each of the observations will be considered
one by one.
They will be clubbed to the nearest Cluster. In this way if two clusters come
significantly close to each other, they will be merged to each other to form a
new cluster.
Solution
Cluster Process: -
After cleaning up the data set from outliers any of the above procedures can be
used to build clusters. There is no hard and fast rule in terms of cluster numbers
and cluster sizes. But the rule of thumb is there should be 5% observations in each
cluster and total number of clusters should be between 5 to 15.
Some of the variables(present in the data set) are to used for clustering. These
variables must be numeric. They may be continuous or discrete, but if discrete there
must be an ordinality among the categories.
The goodness of a particular set of clusters are to measured by the extent to which
means of the clustering variables are differing from one cluster to another.
Profiling the Clusters: -

After building the clusters, they are to profiled with respect to discrete and
continuous variables to identify the different features of the different clusters.
Step 9:Evaluating the Cluster Solution
Statistic/Measure Meaning Ideal value

Variable R Square Between Variation/Total >= 0.3
Variation
Overall R Square Avg(Var R Square); 1 - >= 0.6
Avg[WithinVariance(Var1),
WithinVariance(Var2),?]
Approximate Expected Overall R Square Similar to above, different Close to Overall R Square
formula; calculated (diff <= 0.1)
assuming variables are
independent.
RMS STD For each cluster: <= 1.1
Sqrt{Avg[Variance(Var1),
Variance(Var2),?]}
Distance Between Cluster Centroids How close or how far apart >= 1.5
are cluster centroids
Maximum distance from seed to "Dispersion" within each Relative; roughly uniform
observation cluster across clusters
Variable R - square:
- set / variable combination “Total amount of Variation” is fixed.
For a given data
If there is k Clusters in the solution then Total Variation = Within Variation + Between Variation
Within Variation(var 1) = [Variation(var 1) within Cluster 1] + [Variation(var 1) within Cluster 2] + E +
[Variation(var 1) within Cluster k]
Between Variation = Variation between one cluster to another (I.e. variation of cluster means).
Scatter Plot
80
70
Cluster 1
60
Cluster 3
50
R - Square =
Cluster 2
Var 2
40
30 Between Variation
20
10 Total Variation
0
0 5 10 15 20 25 30 35 40 45 50
Var 1
Clusters should Within Variation should be The higher the

be homogeneous lower and Between
Variation should be higher R-Square, the Better
within themselves
Approximate Expected Overall R-square
Approximate Expected Overall R-Square is calculated based on the hypothesis

that all the explanatory variables used for Clustering are independent.
Hence if there is very much difference between Observed Overall R-square and
Approximate Expected Overall R-square, we can suspect high correlation among
the independent variables.
RMS STD
RMS STD within a cluster = Square root of Average of (Variance of variable 1 in

that cluster, Variance of variable 2 in that cluster, … ,Variance of variable p in
that cluster) . [Assuming p variables were used for Clustering.]
There is no restriction on the number of clusters, but it should be between 5 to

15.
Care should be taken on the number of observations in each clusters. A good
rule of thumb is to have >= 5% of the population in each cluster.
Proc Factor
Proc Cluster
Proc Fastclus
Typical SAS codes
proc factor data=t5

method=prin nfactors=10 rotate=varimax out=final1;
run;
proc standard data=out1 out=out2 mean=0 std=1;

var amt_fin term dep_per age mon_book;
run;
proc cluster data=out1 method=complete;

run;
proc fastclus data=out1 out=out2 maxc=120

maxiter=100 delete=1200 short;
run;
Minimum Euclidean Distance Method
Scoring
Minimum
Minimum Euclidean
Euclidean Distance
Distance Method
Method
Scatter Plot
80
New
70 Observation
60
50
Cluster 1
Var 2
40
30 Cluster 3
20
Cluster 2
10
0
0 5 10 15 20 25 30 35 40 45 50
Var 1
The New Observation will be a member of Cluster 1

Solution
The SAS code for

implementation
Microsoft Word
Document
The Cluster Analysis Output
Microsoft Word
Document
Thank You
Contact us
Visit us on: http://www.analytixlabs.in/
For course registration, please visit: http://www.analytixlabs.co.in/course-registration/
For more information, please contact us: http://www.analytixlabs.co.in/contact-us/

Or email: info@analytixlabs.co.in
Call us we would love to speak with you: (+91) 88021-73069
Join us on:
Twitter - http://twitter.com/#!/AnalytixLabs
Facebook - http://www.facebook.com/analytixlabs
LinkedIn - http://www.linkedin.com/in/analytixlabs
Blog - http://www.analytixlabs.co.in/category/blog/
41

Cluster Analysis (SAS Based)

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Cluster Analysis (SAS Based)

Загружено:

Авторское право:

Доступные форматы

Cluster Analysis

This document is an introduction to Cluster

Evaluating a Cluster Solution:

Step C: What is Cluster Analysis?

Each individual is so different

Problem : The volume is too large for customization at individual level

Segmentation is for better targeting

² Segmentation provides a catalyst for creative

² Segmentation provides a common vocabulary for

² Segmentation can complement models.

² Segmentation is a technique used by many of our

³ CHAID ³ Cluster Analysis

Cluster Analysis is a technique used for

Each group is homogeneous (similar) w.r.t. certain characteristics

Each group is different from other groups w.r.t. same characteristics

The process by which objects are classified into a number of groups

Group 1 Group 2 Group 3 Group 4

The objects in group 1 should be as similar as possible. But there should be

In this case we need some profiling as below: -

Basic Concepts of Cluster Analysis Using Two Variables

High Example Cluster 1

Low Medium High

Segmentation of the PLCC

The Cluster Analysis that was performed on 75,134 Logo 02 Accounts,

The variables used for generating the segments are: -

1. Account age (in days) 3. Time (in mths.) elapsed between

2. No. of times revolved 4. Sum of all transaction amounts over

Segement 1 Segment 2 Segement 3 Segement 4 Segment 5 Segement 6

These variables are different across segments.

Segment 3: New Hopefuls (23%) Segment 4: Affluent Spenders (5%)

Segment 5: Vintage Fuddy-Duddies (15%) Segment 6: Old Risky Inactives (8%)

1950 Affluent Spenders

Old Risky Inactives

Credit Hungry Poor New Hopefuls

Credit hungry poor New Revolvers New Hopefuls

Data Cleaning and Creating New

Multicollinearity Treatment of Tackling the

Getting Cluster Checking the

Process Flow for Cluster Analysis

Merged Data Data Merging

No Limit to # of variables to be selected for analysis

Selection of Variables depends on the purpose of Clustering

Irrelevant Variables are to be dropped

Variables with large % of missing values are to be dropped

40 • Histogram and Box-Plot

• Delete those Observations

1-5% • Mean Imputation

A set of independent or explanatory variables are said to have

Devices to tackle ‘Multi-collinearity’: -

VIF (Variance Inflation Factor): -

Why do we need ‘Standardization’

Since the units of measurement are different for different variables,

E.g.: - Consider two variables, Age and Income.

Generally we standardize by making the mean = 0 and variance = 1.

Proc Fastclus • K - Means

Profiling the Clusters: -

Statistic/Measure Meaning Ideal value

Clusters should Within Variation should be The higher the

Approximate Expected Overall R-square

Approximate Expected Overall R-Square is calculated based on the hypothesis

RMS STD within a cluster = Square root of Average of (Variance of variable 1 in

There is no restriction on the number of clusters, but it should be between 5 to

proc factor data=t5

proc standard data=out1 out=out2 mean=0 std=1;