You are on page 1of 45

K-Means Cluster Analysis

Chapter 3
PPDM Cl
Class

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

What is Cluster Analysis?


z

Finding groups of objects such that the objects in a group


will be similar (or related) to one another and different
from (or unrelated to) the objects in other groups
Inter-cluster
Inter
cluster
distances are
maximized

Intra-cluster
distances are
minimized

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Applications of Cluster Analysis


z

Discovered Clusters

Understanding
Group
p related documents
for browsing, group genes
and proteins that have
similar functionality, or
group stocks with similar
price fluctuations

Industry Group

Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
Sun-DOWN
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN

1
2

Fannie Mae DOWN Fed Home Loan DOWN


Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
MBNA-Corp-DOWN,Morgan-Stanley-DOWN

3
4

Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP

Technology1-DOWN

Technology2-DOWN

Financial-DOWN
Oil-UP

Summarization
Reduce the size of large
data sets
C uste g p
Clustering
precipitation
ec p tat o
in Australia

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

What is not Cluster Analysis?


z

Supervised classification
Have class label information

Simple segmentation
Di
Dividing
idi students
t d t into
i t different
diff
t registration
i t ti groups
alphabetically, by last name

Results of a query
Groupings are a result of an external specification

Graph partitioning
Some mutual relevance and synergy, but areas are not
identical

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Notion of a Cluster can be Ambiguous

How many clusters?

Six Clusters

Two Clusters

Four Clusters

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Types of Clusterings
z

A clustering is a set of clusters

Important distinction between hierarchical and


partitional sets of clusters

Partitional Clustering
A division data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset

Hierarchical clustering
A set of nested clusters organized as a hierarchical tree

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Partitional Clustering

Original
g
Points

Tan,Steinbach, Kumar

A Partitional Clustering
g

Introduction to Data Mining

4/18/2004

Hierarchical Clustering

p1
p3

p4

p2

p1 p2
Traditional Hierarchical Clustering

p3 p4

Traditional Dendrogram

p1
p3

p4

p2

p1 p2
Non-traditional Hierarchical Clustering
Tan,Steinbach, Kumar

p3 p4

Non-traditional Dendrogram

Introduction to Data Mining

4/18/2004

Other Distinctions Between Sets of Clusters


z

Exclusive versus non-exclusive


In non-exclusive clusterings, points may belong to multiple
clusters.
clusters
Can represent multiple classes or border points

Fuzzyy versus non-fuzzyy


In fuzzy clustering, a point belongs to every cluster with some
weight between 0 and 1
Weights must sum to 1
Probabilistic clustering has similar characteristics

Partial versus complete


In
I some cases, we only
l wantt to
t cluster
l t some off the
th data
d t

Heterogeneous versus homogeneous


Cluster of widely different sizes, shapes, and densities

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Types of Clusters
z

Well-separated clusters

Center-based clusters

Contiguous clusters

Density-based clusters

P
Property
t or Conceptual
C
t l

Described by an Objective Function

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

10

Types of Clusters: Well-Separated


z

Well-Separated Clusters:
A cluster is a set of points such that any point in a cluster is
closer (or more similar) to every other point in the cluster than
to any point not in the cluster.

3 well-separated clusters
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

11

Types of Clusters: Center-Based


z

Center-based
A cluster is a set of objects such that an object in a cluster is
closer (more similar) to the center of a cluster, than to the
center of any other cluster
The center of a cluster is often a centroid, the average of all
th points
the
i t iin th
the cluster,
l t or a medoid,
d id the
th mostt representative

t ti
point of a cluster

4 center-based clusters
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

12

Types of Clusters: Contiguity-Based


z

Contiguous Cluster (Nearest neighbor or


Transitive)
A cluster is a set of points such that a point in a cluster is
closer (or more similar) to one or more other points in the
cluster than to any point not in the cluster.

8 contiguous clusters
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

13

Types of Clusters: Density-Based


z

Density-based
A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
Used when the clusters are irregular or intertwined, and when
noise and outliers are present.

6 density-based clusters
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

14

Types of Clusters: Conceptual Clusters


z

Shared Property or Conceptual Clusters


Finds clusters that share some common property or represent
a particular concept.
.

2 Overlapping Circles
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

15

Types of Clusters: Objective Function


z

Clusters Defined by an Objective Function


Finds clusters that minimize or maximize an objective function.
Enumerate all possible ways of dividing the points into clusters and
evaluate the `goodness' of each potential set of clusters by using
the given objective function
function. (NP Hard)

Can have global or local objectives.

Hierarchical clustering algorithms typically have local objectives

P titi
Partitional
l algorithms
l ith
typically
t i ll h
have global
l b l objectives
bj ti

A variation of the global objective function approach is to fit the


data to a parameterized model.

Parameters for the model are determined from the data.

Mixture models assume that the data is a mixture' of a number of


statistical distributions.

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

16

Types of Clusters: Objective Function


z

Map the clustering problem to a different domain


and solve a related problem in that domain
Proximity matrix defines a weighted graph, where the
nodes are the points being clustered, and the
weighted edges represent the proximities between
points
Clustering is equivalent to breaking the graph into
connected components, one for each cluster.
Want to minimize the edge weight between clusters
and maximize the edge weight within clusters

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

17

Characteristics of the Input Data Are Important


z

Type of proximity or density measure


This is a derived measure, but central to clustering

Sparseness
Dictates type of similarity
Adds to efficiency

Attribute type
Dictates type of similarity

Type of Data
Dictates type of similarity
Other characteristics, e.g., autocorrelation

z
z
z

Dimensionality
Di
i
lit
Noise and Outliers
Type of Distribution

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

18

Clustering Algorithms
z

K-means and its variants

Hierarchical clustering

Density-based clustering

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

19

K-means Clustering
z
z

Partitional clustering approach


Each cluster is associated with a centroid (center point)

Each point is assigned to the cluster with the closest


centroid

Number of clusters,
clusters K
K, must be specified
The basic algorithm is very simple

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

20

K-means Clustering Details


z

Initial centroids are often chosen randomly.

Clusters produced vary from one run to another.

The centroid is (typically) the mean of the points in the


cluster.

Closeness is measured by Euclidean distance, cosine


similarity correlation
similarity,
correlation, etc
etc.
K-means will converge for common similarity measures
mentioned above.

z
z

Most of the convergence happens in the first few


iterations.

Often the stopping


pp g condition is changed
g to Until relatively
y few
points change clusters

Complexity is O( n * K * I * d )

n = number of points, K = number of clusters,


I = number of iterations, d = number of attributes

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

21

Two different K-means Clusterings


3

2.5

Original Points

15
1.5

0.5
0

-2

-1.5

-1

-0.5

0.5

1.5

2.5

2.5

1.5

1.5

0.5

0.5

-2

-1.5

-1

-0.5

0.5

1.5

-1.5

-1

-0.5

0.5

1.5

Optimal Clustering
Tan,Steinbach, Kumar

-2

Introduction to Data Mining

Sub-optimal Clustering
4/18/2004

22

Importance of Choosing Initial Centroids


Iteration 6
1
2
3
4
5
3

2.5

1.5

0.5
0

-2

-1.5

-1

-0.5

0.5

1.5

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

23

Importance of Choosing Initial Centroids


Iteration 1

Iteration 2

Iteration 3

2.5

2.5

2.5

1.5

1.5

1.5

0.5

0.5

0.5

-2

-1.5

-1

-0.5

0.5

1.5

-2

-1.5

-1

-0.5

0.5

1.5

-2

It ti 4
Iteration

It ti 5
Iteration
2.5

1.5

1.5

1.5

0.5

0.5

0.5

-0.5

0.5

Tan,Steinbach, Kumar

1.5

0.5

1.5

1.5

2.5

2.5

-1

-0.5

It ti 6
Iteration

-1.5

-1

-2

-1.5

-2

-1.5

-1

-0.5

0.5

Introduction to Data Mining

1.5

-2

-1.5

-1

-0.5

0.5

4/18/2004

24

Evaluating K-means Clusters


z

Most common measure is Sum of Squared Error (SSE)


For each point, the error is the distance to the nearest cluster
To get SSE, we square these errors and sum them.
K

SSE = dist 2 ( mi , x )
i =1 xCi

x is a data point in cluster Ci and mi is the representative point for


cluster Ci

can show that mi corresponds to the center (mean) of the cluster

Given two clusters, we can choose the one with the smallest
error
One easy way to reduce SSE is to increase K, the number of
clusters
A good clustering with smaller K can have a lower SSE than a poor
clustering
l t i with
ith hi
higher
h K

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

25

Importance of Choosing Initial Centroids


Iteration 5
1
2
3
4
3

2.5

1.5

0.5
0

-2

-1.5

-1

-0.5

0.5

1.5

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

26

Importance of Choosing Initial Centroids


Iteration 1

Iteration 2

2.5

2.5

1.5

1.5

0.5

0.5

-2

-1.5

-1

-0.5

0.5

1.5

-2

-1.5

-1

-0.5

0.5

Iteration 3

2.5

1.5

1.5

1.5

2.5

2.5

0.5

0.5

0.5

-1
1

-0.5
05

05
0.5

Tan,Steinbach, Kumar

Iteration 5

-1.5
15

1.5

Iteration 4

-2
2

15
1.5

-2
2

-1.5
15

-1
1

-0.5
05

05
0.5

Introduction to Data Mining

15
1.5

-2
2

-1.5
15

-1
1

-0.5
05

05
0.5

15
1.5

4/18/2004

27

Problems with Selecting Initial Points


z

If there are K real clusters then the chance of selecting


one centroid from each cluster is small.

Ch
Chance
is
i relatively
l ti l smallll when
h K iis llarge

If clusters are the same size, n, then

For example, if K = 10, then probability = 10!/1010 = 0.00036

Sometimes the initial centroids will readjust themselves in


i ht way, and
right
d sometimes
ti
they
th d
dont
t

Consider an example of five pairs of clusters

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

28

10 Clusters Example
Iteration 4
1
2
3
8
6
4

2
0
-2
-4
-6
0

10

15

20

x
Starting with two initial centroids in one cluster of each pair of clusters
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

29

10 Clusters Example
Iteration 2
8

Iteration 1
8

-2

-2

-4

-4

-6

-6
0

10

15

20

15

20

-2

-2

-4

-4

-6

-6
10

20

Iteration 4
8

Iteration 3

15

10

15

20

10

Starting with two initial centroids in one cluster of each pair of clusters
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

30

10 Clusters Example
Iteration 4
1
2
3
8
6
4

2
0
-2
-4
-6
0

10

15

20

x
Starting with some pairs of clusters having three initial centroids, while other have only one.
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

31

10 Clusters Example
Iteration 2
8

Iteration 1
8

-2

-2

-4

-4

-6

-6
0

10

15

20

-2

-4

-4

-6

-6
5

10

15

20

15

20

-2

10

x
Iteration
4

x
Iteration
3

15

20

10

Starting with some pairs of clusters having three initial centroids, while other have only one.
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

32

Solutions to Initial Centroids Problem


z

Multiple runs
Helps,
p , but probability
p
y is not on yyour side

Sample and use hierarchical clustering to


determine initial centroids
z Select more than k initial centroids and then
select among these initial centroids
z

Select most widely separated

Postprocessing
z Bisecting K-means
z

Not as susceptible to initialization issues

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

33

Handling Empty Clusters


z

Basic K-means algorithm can yield empty


clusters

Several strategies
Choose the point that contributes most to SSE
Choose a point from the cluster with the highest SSE
If there are several empty clusters, the above can be
repeated several times.

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

34

Updating Centers Incrementally


z

In the basic K-means algorithm, centroids are


updated after all points are assigned to a centroid

An alternative is to update the centroids after


each assignment (incremental approach)

Each assignment updates zero or two centroids


More expensive
Introduces an order dependency
N
Never
gett an empty
t cluster
l t
Can use weights to change the impact

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

35

Pre-processing and Post-processing


z

Pre-processing
Normalize the data
Eliminate outliers

Post processing
Post-processing
Eliminate small clusters that may represent outliers
Split loose
loose clusters,
clusters ii.e.,
e clusters with relatively high
SSE
Merge
g clusters that are close and that have relatively
y
low SSE
Can use these steps during the clustering process

ISODATA
SO

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

36

Bisecting K-means
z

Bisecting K-means algorithm

Variant of K-means that can produce a partitional or a


hierarchical clustering

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

37

Bisecting K-means Example

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

38

Limitations of K-means
z

K-means has problems when clusters are of


differing
Sizes
Densities
Non-globular shapes

K-means has problems when the data contains


outliers.

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

39

Limitations of K-means: Differing Sizes

K-means (3 Clusters)

Original Points

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

40

Limitations of K-means: Differing Density

K-means (3 Clusters)

Original Points

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

41

Limitations of K-means: Non-globular Shapes

Original Points

Tan,Steinbach, Kumar

K-means (2 Clusters)

Introduction to Data Mining

4/18/2004

42

Overcoming K-means Limitations

Original Points

K-means Clusters

One solution is to use many


y clusters.
Find parts of clusters, but need to put together.
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

43

Overcoming K-means Limitations

Original Points

Tan,Steinbach, Kumar

K-means Clusters

Introduction to Data Mining

4/18/2004

44

Overcoming K-means Limitations

Original Points

Tan,Steinbach, Kumar

K-means Clusters

Introduction to Data Mining

4/18/2004

45