Вы находитесь на странице: 1из 25

Predictive Analytics and Data Mining

Segmentation using Clustering


Automatic Cluster Detection

DM techniques used to find patterns in data


• Not always easy to identify
 No observable pattern
 Too many patterns

Decomposition (break down into smaller pieces)

Automatic Cluster Detection is useful to find “better behaved” clusters


of data within a larger dataset
Automatic Cluster Detection

K-Means clustering algorithm – depends on a geometric interpretation of the


data

Other automatic cluster detection (ACD) algorithms include:


• Gaussian mixture models
• Agglomerative clustering
• Divisive clustering
• Self-organizing maps (SOM) - Neural Nets

ACD is a tool
• No preclassified training data set
• No distinction between independent and dependent variables
• Marketing clusters referred to as “segments”
• Customer segmentation is a popular application of clustering

ACD rarely used in isolation – other methods follow up


Segmentation

Organizing Customers into groups with similar traits, product


preferences or expectations
• Demographic Characteristics
• Psychographics (interests, attitudes, opinions, personality, values,
lifestyles…)
• Desired benefits from products/services
• Past-purchase or product use behaviors
K-means Clustering

“K” – circa 1967 – this algorithm looks for a fixed number of


clusters which are defined in terms of proximity of data
points to each other

How K-means works (see next slide figures):


• Algorithm selects K data points randomly
• Assigns each of the remaining data points to one of K clusters
• Calculate the mean of cases of each cluster and move the K data
points/ cluster seeds to the mean of the cluster
• Reassign cases closest to the new seed I as belonging to cluster I
• Euclidean distance (dist. Between two points (u1,v1) and (u2,v2) is
the sq. root (sq. (u1-u2) + sq. (v1-v2)
K-means Clustering
K-means Clustering

Resulting clusters describe


underlying structure in the
data, however, there is no
one right description of that
structure
Similarity & Difference

Automatic Cluster Detection is quite simple for a software program to


accomplish – data points, clusters mapped in space

However, business data points are not about points in space but about
purchases, phone calls, airplane trips, car registrations, etc. which
have no obvious connection to the dots in a cluster diagram
Similarity & Difference

Clustering business data requires some notion of natural association –


records (data) in a given cluster are more similar to each other than
to those in another cluster

For DM software, this concept of association must be translated into


some sort of numeric measure of the degree of similarity

Most common translation is to translate data values (eg., gender, age,


product, etc.) into numeric values so can be treated as points in
space

If two points are close in geometric sense then they represent similar
data in the database
Similarity & Difference

Business variable (fields) types:


• Categorical (eg., mint, cherry, chocolate)
• Ranks (eg., freshman, soph, etc. or valedictorian, salutatorian)
• Intervals (eg., 56 degrees, 72 degrees, etc)
• True measures – interval variables that measure from a meaningful
zero point
 Age, weight, height, length, tenure are good examples
Pattern Discovery

“…the discovery of interesting, unexpected, or valuable structures in


large data sets.”
- David Hand, Professor of Statistics, Imperial College

- “If you’ve got terabytes of data, and you’re relying on data mining to
find interesting things in there for you, you’ve lost before you’ve
even begun. You really need people who understand what it is they
are looking for – and what they can do with it once they find it.”
- Herb Edelstein, President of Two Crows Corporation
Inputs (Desirable Charateristics)

Meaningful to the analysis objective

Relatively independent

Limited in number

Have a measurement level of Interval

Have low kurtosis and skewness statistics


What Value of K to Use

Subject Matter Knowledge (there are most likely five groups)

Convenience (It is convenient to market to 3 or 4 groups)

Constraints (You have 5 products and need 5 segments)

Arbitrarily (always pick 10)

Based on the Data (Ward’s method or Elbow Criterion )

(Elbow Plot – plot of ratio of within cluster variance to between cluster


variance vs the no. of clusters)
Ward’s Method

Algorithm for Hierarchical cluster analysis

In this method each observation is considered a cluster, and the


clusters are hierarchically joined, based on minimizing the ratio of
variation within clusters to between clusters

Based on a statistical analysis, the number of clusters is selected

This number of clusters is used for k-means cluster analysis


Ward’s Method in SAS Enterprise Miner

Preliminary k-means clustering on data to save many cluster centroids


(default 50)

Ward’s hierarchical clustering on saved cluster centroids (k, then k-1, k-


2 etc) to determine ideal value of k (greater than minimum specified
in selection criteria and has a CCC (cubic clustering criterion)
statistic greater than threshold specified in selection criteria)

K-means clustering on the original dataset using k from step 2


Evaluating Clusters

What does it mean to say that a cluster is “good”?


• Clusters should have members that have a high degree of similarity
• Standard way to measure within-cluster similarity is variance* –
clusters with lowest variance is considered best
• Cluster size is also important so alternate approach is to use average
variance**

* The sum of the squared differences of each element from the mean
** The total variance divided by the size of the cluster
Evaluating Clusters

Finally, if detection identifies good clusters along with weak ones it


could be useful to set the good ones aside (for further study) and run
the analysis again to see if improved clusters are revealed from only
the weaker ones
Validating Clusters
Interpretation

Goal: obtain meaningful and useful clusters

Caveats:
(1) Random chance can often produce apparent clusters
(2) Different cluster methods produce different results

Solutions:

Obtain summary statistics

Also review clusters in terms of variables not used in clustering

Label the cluster (e.g. clustering of financial firms in 2008 might


yield label like “midsize, sub-prime loser”)
Desirable Cluster Features

Stability – are clusters and cluster assignments sensitive to slight


changes in inputs? Are cluster assignments in partition B similar to
partition A?

Separation – check ratio of between-cluster variation to within-


cluster variation (higher is better)
CLUSTERING USING SAS ENTERPRISE MINER
Grocery Store Case Study
Analysis goal:
Where should you open new grocery store locations?
Group geographic regions into segments based on
income, household size, and population density.
Analysis plan:
 Select and transform segmentation inputs.
 Select the number of segments to create.
 Create segments with the Cluster tool.
 Interpret the segments.

47

Вам также может понравиться