Вы находитесь на странице: 1из 42

Cluster Analysis

Cluster Analysis

An Introduction

This document is an introduction to Cluster


Analysis. The objectives:
1. Understanding of the underlying concepts
2. Appreciation of when to use the technique
3. What Cluster Analysis can and can’t do
4. The process of building a cluster solution
5. Evaluating a cluster solution
6. Implementing a cluster solution
Agenda

Introduction:
• Why Segmentation ?
• Types of Segmentation:
a) Objective Segmentation (CHAID).
b) Subjective Segmentation (Cluster Analysis).
• What is Cluster Analysis ?
• Basic Concepts.
Business Examples of Cluster
Analysis: • Cluster Analysis for CCC Thailand (An Overview).
Cluster Analysis Process:
• Data Cleaning and Preparing the data set for analysis.
• Creating new relevant Variables.
• Selection of Variables.
• Tackling the Outliers.
• Treatment on Missing Values.
• Multicollinearity Check and hence reducing dimensions.
• Standardization of the selected variables.
• Getting Cluster Solution.
• Checking the optimality of the solution.
Agenda

SAS Procedures:
• Proc Factor
• Proc Standard
• Proc Cluster / Proc Fastclus

Evaluating a Cluster Solution:


• % of Population in each Cluster
• Maximum Distance from Cluster Centroid
• Distance from the nearest Cluster
• Variable R-square
• Overall R-square

Implementing a Cluster
Solution:
• Scoring Code and Minimum Euclidean Distance Method
Step A: Why Segmentation ?
Step B: Types of Segmentation:
1. Objective Segmentation (CHAID)
2. Subjective Segmentation (Cluster Analysis)

Step C: What is Cluster Analysis?


Step D: Basic
Concepts.
Step A: Why Segmentation ?

Each individual is so different


that ideally we would want to reach out to each one of them in a different way
1 2 3 4 5 6

++++++++++..

Problem : The volume is too large for customization at individual level

1 2 3 4

Solution : Identify segments where people have same characters and target each of
these segments in a different way

Segmentation is for better targeting


Step A: Why Segmentation ?

² Segmentation provides a catalyst for creative


insights.....often the first step in marketing strategy
planning.
² Segmentation is a multipurpose technique.

² Segmentation provides a common vocabulary for


communicating marketing analysis.

² Segmentation can complement models.

² Segmentation is a technique used by many of our


competitors.
Step B: Types of Segmentation

Segmentation

Objective Subjective

³ CHAID ³ Cluster Analysis


Step B.2: Subjective Segmentation

Subjective Segmentation
(Cluster Analysis)
Step D: Basic Concepts

Cluster Analysis is a technique used for


combining observations into groups

Such that

Each group is homogeneous (similar) w.r.t. certain characteristics

&

Each group is different from other groups w.r.t. same characteristics


Step C: What is Cluster Analysis
What is Cluster Analysis ?

The process by which objects are classified into a number of groups


so that they are as much dissimilar as possible from one group to
another group, but as much similar as possible within each group.
In other words Cluster analysis means dividing the whole population into
groups which are distinct between themselves but internally similar.

Total Population

Group 1 Group 2 Group 3 Group 4

The objects in group 1 should be as similar as possible. But there should be


much difference between an object in group 1 and group 2

The
The attributes
attributes of
of the
the objects
objects are
are allowed
allowed to
to determine
determine which
which objects
objects
should
should be
be grouped
grouped together
together
Step B.2: Subjective Segmentation

Consider a portfolio with 1000 customers having Credits. Business wants to make different
strategies to different groups of people. Of course ‘Objective segmentation’ won’t help since
we don’t know who are the responders.

In this case we need some profiling as below: -

Total Population
(1000)

Avg. delinquency
Avg. delinquency age = 0 age = 75 days and
Avg. delinquency Avg. delinquency Avg. age = 50 yrs.
days and Avg. age = 35 yrs.
age = 15 days and age = 12 days and Avg. Utilization = 40%
Avg. Utilization > 80%
Avg. age = 33 yrs. Avg. age = 25 yrs.
Avg. Utilization = 60% Avg. Utilization = 90%

We can exclude the group with avg. delinquency age = 75 days from mailing
This type of segmentation is known as ‘Subjective Segmentation’. It
gives the salient characteristics of the best customers
Step D: Basic Concepts

Basic Concepts of Cluster Analysis Using Two Variables

High Example Cluster 1


High Balance
Low Income

Current
Balance Medium

Example Cluster 2
High Income
Low Balance
Low

Low Medium High


Gross Monthly Income

Cluster 1 and Cluster 2 are being differentiated by Income and Current Balance. The
objects in Cluster 1 have similar characteristics (High Income and Low balance), on the
other hand the objects in Cluster 2 have the same characteristic (High Balance and Low
Income).
But there are much differences between an object in Cluster 1 and an object in Cluster 2
Cluster Analysis
for
CCC Thailand
Central Credit Card -Thailand

Segmentation of the PLCC


Portfolio
Cluster Sizes

The Cluster Analysis that was performed on 75,134 Logo 02 Accounts,


resulted in 6 segments.

Segment Sizes

Segment 6 Segment 1
Segment 5 8% 20%
15%

Segment 4
5%
Segment 2
Segment 3 29%
23%
Significant Variables

The variables used for generating the segments are: -

1. Account age (in days) 3. Time (in mths.) elapsed between


last transaction and end of window
(Recency)

2. No. of times revolved 4. Sum of all transaction amounts over


in the last 3 months last 6 months.

Segement 1 Segment 2 Segement 3 Segement 4 Segment 5 Segement 6


Account age (days) 436.18 282.06 265.66 342.17 434.42 397.7
Total amt. of sale
transactions - last 6 6282.05 8975.72 7370.61 43385.06 8075.22 0.53
months (Bht)
Recency (months) 1.07 0.85 1.09 0.11 0.91 10.41
No. of times revolved -
2.92 2.9 0.15 0.72 0.16 0.55
last 3 months

These variables are different across segments.


Segment Descriptions: 1 to 4

Segment Descriptions
Segment 1: Credit Hungry Poor (20%) Segment 2: New Revolvers (28%)
• Low monthly income segment. • Newer accounts.
• High Balance, Credit Utilisation, Fees and • Low monthly income segment.
Fin. Charges. High revolvers. • High Balance, Credit Utilisation, Fees and
• Older accounts. Fin. Charges. High revolvers.
• Delinquency marginally higher than average. • Delinquency marginally higher than average.
• Low transactors. • Average transactors.

Segment 3: New Hopefuls (23%) Segment 4: Affluent Spenders (5%)


• Newest accounts. • Highest monthly income.
• Low Balance, Credit Utilisation, Fees and • High Balance; but low Fees and Fin.
Fin. Charges. Low Revolvers. Charges.
• Low transactors in number and total amount • Heavy transactors.
transacted.
• But, sale per transaction slightly higher than
average. Therefore, hold some hope for the
future.
• Delinquency higher than average.
Segment Descriptions: 5 & 6

Segment Descriptions

Segment 5: Vintage Fuddy-Duddies (15%) Segment 6: Old Risky Inactives (8%)


• Older accounts • Older customers
• Low Balance, Credit Utilisation, Fin. • Low Balance, Credit Utilisation.
Charges, Fees.
• Non transactors.
• Low revolvers.
• High delinquency.
• Average transactors.

No. of
Segment %
customers
Credit hungry poor 15356 20.44
New Revolvers 21161 28.16
New Hopefuls 17373 23.12
Affluent Spenders 3815 5.08
Vintage fuddy-duddies 11248 14.97
Old Risky Inactives 6181 8.23

Note: Refer ‘Behaviour Profile’ and ‘Demographic Profile E1’ in Appendix for Profile details.
Spend Characteristics of Segments

‘Affluent Spenders’ and ‘New Revolvers’ form 33% of the portfolio but account for 53% of all
spends.
‘Affluent Spenders’ alone accounts for 25% of the total spend though it is only 5% of the
portfolio.
2450

1950 Affluent Spenders


5% of portfolio - 25% of all txn. value
AVT (Value per txn. in Bht)

1450
New Revolvers
28% of portfolio - 28% of all txn. value

950

Vintage Fuddy-Duddies
15% of portfolio - 14% of all txn. value
450

Old Risky Inactives


-50 8% of portfolio - 0% of all txn. value
-0.25 0.75 1.75 2.75 3.75 4.75 5.75 6.75 7.75
AFT (No. of txns. per month)

Credit Hungry Poor New Hopefuls


20% of portfolio - 14% of all txn. value 23% of portfolio - 19% of all txn. value
Revolver Characteristics of Segments

‘Credit Hungry Poor’ and ‘New Revolvers’ together account for 48% of the
portfolio and 78% of revolvers.

8 3
100 6
4
90 15 9
80 5
70
23
60 45

50
40 28

30
33
20
20
10
0
% of Portfolio % of Revolvers

Credit hungry poor New Revolvers New Hopefuls


Affluent Spenders Vintage fuddy-duddies Old Risky Inactives
Step 1: Data Cleaning and Preparing the data set for analysis.
Step 2: Creating new relevant Variables.
Step 3: Selection of Variables.
Step 4: Tackling the Outliers.
Step 5: Treatment of Missing Values.
Step 6: Multicollinearity Check and hence reducing
dimensions
Step 7: Standardization of the selected variables
Step 8: Getting Cluster Solution.
Step 9: Checking the optimality of the solution.
Clustering Process

Data Cleaning and Creating New


Selection of
Preparing the Relevant
Variables
data set for Variables
Step 3
analysis
Step 1 Step 2

Multicollinearity Treatment of Tackling the


Check Missing Values Outliers
Step 6 Step 5 Step 4

Getting Cluster Checking the


Standardization Optimality of
Solution
Step 7 the Solution
Step 8
Step 9

Process Flow for Cluster Analysis


Step 1: Preparation of Data

Server
Client Data
Different Tables

Merged Data Data Merging


Account Level
or
Customer Level Cleaning Process -
• Identify the erroneous
Data Cleaning values.

Final Data.
• Check for Inconsistency
in the values of variables.
Ready for Analysis
Step 2: Creating new Variables

Variable Types:
Demographic

Socio-Economic

Product Related

Behavioral

Variable Creation -
New relevant variables, if necessary are to be created from the existing
ones. As an example for auto loan portfolio, if there are variables like deposit
amount and price of the vehicles under finance, a new variable ‘Deposit Percent’ =
(Deposit Amount / Price of the Vehicles)*100 can be created.
Step 3: Selection of Variables

No Limit to # of variables to be selected for analysis

Selection of Variables depends on the purpose of Clustering

Irrelevant Variables are to be dropped

Variables with large % of missing values are to be dropped

E.g. when we have Auto Loan portfolio, some variables like Loan Amount, Month on Book,
Term, Deposit amount, Car Price should be considered. We should not look at APR (Annual
Percentage Rate), because APR depends mainly on the yearly performance of the business
overall rather than on the accounts.

E.g. In a clustering Process, we want to identify the highly delinquent people. In this case
Maximum Delinquency reached will have more significance than month end delinquency
variables.
Step 4: Tackling Outliers

What is an outlier ?
An observation is said to be an outlier w.r.t. a variable if it is far away from the
remaining observations.

Scatter Plot
Outlier
90

80 To identify them:
70
• Univariate and Frequency
60
analysis
50
Var 2

40 • Histogram and Box-Plot


30
20

10

0
0 5 10 15 20 25 30 35 40 45
Var 1

To tackle them:
1. The outliers can be deleted from analysis if they are very small in number.
2. The variables selected can be trimmed or capped.
Step 5: Treatment on Missing Values

Variables with lot many (about 15%) missing values should not be
used for clustering unless ‘Missing’ has a special significance and
can be replaced by some meaningful number.
E.g. - Insurance Variables.

Note: - SAS does not include observations with missing values for
Clustering Process

% of Missing Treatments

• Delete those Observations


Less than 1%
• Mean Imputation

1-5% • Mean Imputation

• Regression Imputation
5-10% • Mean Imputation

• Regression Imputation
More than 10% • Try to use some proxy
Variable
Step 6: Multi-collinearity Check

What is ‘Multi-collinearity’ ?

A set of independent or explanatory variables are said to have


‘Multi-collinearity’, if there is any linear relation between them.

Devices to tackle ‘Multi-collinearity’: -

Factor Analysis: -
By Factor Analysis select those factors, which are explaining
almost 90/95 % of total variation together. Then select those
variables which have high loadings towards those factors.

VIF (Variance Inflation Factor): -


Variables with VIF more than 2 should be dropped
Step 7: Standardization

Why do we need ‘Standardization’


?

Since the units of measurement are different for different variables,


standardization is a must.

E.g.: - Consider two variables, Age and Income.


The unit of Age is ‘Year’ and the unit of Income is say ‘$’.
Hence they are not comparable.
In that case there won’t be an unit of measurement for the distance
between two clusters.

Generally we standardize by making the mean = 0 and variance = 1.


Step 8: Getting the Cluster
Solution
Cluster Process: -

In SAS there are two mostly used procedures namely ‘Proc Fastclus’ and
‘Proc Cluster’.

• Simple Linkage
• Complete Linkage
Proc Cluster
• Two Stage
• Etc.

Proc Fastclus • K - Means

What is K-Means: -
The Process starts with K distinct observations which are at the highest
distance from each other. Then each of the observations will be considered
one by one.
They will be clubbed to the nearest Cluster. In this way if two clusters come
significantly close to each other, they will be merged to each other to form a
new cluster.
Step 8: Getting the Cluster
Solution

Cluster Process: -

After cleaning up the data set from outliers any of the above procedures can be
used to build clusters. There is no hard and fast rule in terms of cluster numbers
and cluster sizes. But the rule of thumb is there should be 5% observations in each
cluster and total number of clusters should be between 5 to 15.

Some of the variables(present in the data set) are to used for clustering. These
variables must be numeric. They may be continuous or discrete, but if discrete there
must be an ordinality among the categories.

The goodness of a particular set of clusters are to measured by the extent to which
means of the clustering variables are differing from one cluster to another.

Profiling the Clusters: -


After building the clusters, they are to profiled with respect to discrete and
continuous variables to identify the different features of the different clusters.
Step 9:Evaluating the Cluster Solution

Statistic/Measure Meaning Ideal value


Variable R Square Between Variation/Total >= 0.3
Variation
Overall R Square Avg(Var R Square); 1 - >= 0.6
Avg[WithinVariance(Var1),
WithinVariance(Var2),?]

Approximate Expected Overall R Square Similar to above, different Close to Overall R Square
formula; calculated (diff <= 0.1)
assuming variables are
independent.
RMS STD For each cluster: <= 1.1
Sqrt{Avg[Variance(Var1),
Variance(Var2),?]}
Distance Between Cluster Centroids How close or how far apart >= 1.5
are cluster centroids

Maximum distance from seed to "Dispersion" within each Relative; roughly uniform
observation cluster across clusters
Step 9:Evaluating the Cluster Solution

Variable R - square:
- set / variable combination “Total amount of Variation” is fixed.
For a given data
If there is k Clusters in the solution then Total Variation = Within Variation + Between Variation
Within Variation(var 1) = [Variation(var 1) within Cluster 1] + [Variation(var 1) within Cluster 2] + E +
[Variation(var 1) within Cluster k]
Between Variation = Variation between one cluster to another (I.e. variation of cluster means).

Scatter Plot

80

70
Cluster 1
60
Cluster 3
50
R - Square =
Cluster 2
Var 2

40

30 Between Variation
20

10 Total Variation
0
0 5 10 15 20 25 30 35 40 45 50
Var 1

Clusters should Within Variation should be The higher the


be homogeneous lower and Between
Variation should be higher R-Square, the Better
within themselves
Step 9:Evaluating the Cluster Solution

Approximate Expected Overall R-square

Approximate Expected Overall R-Square is calculated based on the hypothesis


that all the explanatory variables used for Clustering are independent.
Hence if there is very much difference between Observed Overall R-square and
Approximate Expected Overall R-square, we can suspect high correlation among
the independent variables.

RMS STD

RMS STD within a cluster = Square root of Average of (Variance of variable 1 in


that cluster, Variance of variable 2 in that cluster, … ,Variance of variable p in
that cluster) . [Assuming p variables were used for Clustering.]

There is no restriction on the number of clusters, but it should be between 5 to


15.
Care should be taken on the number of observations in each clusters. A good
rule of thumb is to have >= 5% of the population in each cluster.
Proc Factor
Proc Cluster
Proc Fastclus
Typical SAS codes

proc factor data=t5


method=prin nfactors=10 rotate=varimax out=final1;
run;

proc standard data=out1 out=out2 mean=0 std=1;


var amt_fin term dep_per age mon_book;
run;

proc cluster data=out1 method=complete;


var amt_fin term dep_per age mon_book;
run;

proc fastclus data=out1 out=out2 maxc=120


maxiter=100 delete=1200 short;
var amt_fin term dep_per age mon_book;
run;
Minimum Euclidean Distance Method
Scoring

Minimum
Minimum Euclidean
Euclidean Distance
Distance Method
Method

Scatter Plot

80
New
70 Observation
60

50
Cluster 1
Var 2

40

30 Cluster 3

20
Cluster 2
10

0
0 5 10 15 20 25 30 35 40 45 50
Var 1

The New Observation will be a member of Cluster 1


Step 8: Getting the Cluster
Solution

The SAS code for


implementation

Microsoft Word
Document

The Cluster Analysis Output

Microsoft Word
Document
Thank You
Contact us

Visit us on: http://www.analytixlabs.in/

For course registration, please visit: http://www.analytixlabs.co.in/course-registration/

For more information, please contact us: http://www.analytixlabs.co.in/contact-us/


Or email: info@analytixlabs.co.in

Call us we would love to speak with you: (+91) 88021-73069

Join us on:
Twitter - http://twitter.com/#!/AnalytixLabs
Facebook - http://www.facebook.com/analytixlabs
LinkedIn - http://www.linkedin.com/in/analytixlabs
Blog - http://www.analytixlabs.co.in/category/blog/

41

Вам также может понравиться