Вы находитесь на странице: 1из 31

Clustering

Lecture 3: Hierarchical Methods

Jing Gao
SUNY Buffalo

1
Outline
• Basics
– Motivation, definition, evaluation
• Methods
– Partitional
– Hierarchical
– Density-based
– Mixture model
– Spectral methods
• Advanced topics
– Clustering ensemble
– Clustering in MapReduce
– Semi-supervised clustering, subspace clustering, co-clustering,
etc.

2
Hierarchical Clustering

• Agglomerative approach
Initialization:
Each object is a cluster
Iteration:
a ab Merge two clusters which are
b abcde most similar to each other;
c Until all objects are merged
cde into a single cluster
d
de
e

Step 0 Step 1 Step 2 Step 3 Step 4 bottom-up

3
Hierarchical Clustering

• Divisive Approaches Initialization:


All objects stay in one cluster
Iteration:
a ab Select a cluster and split it into
b abcde
two sub clusters
Until each leaf cluster contains
c
cde only one object
d
de
e

Step 4 Step 3 Step 2 Step 1 Step 0 Top-down

4
Dendrogram
• A tree that shows how clusters are merged/split
hierarchically
• Each node on the tree is a cluster; each leaf node is a
singleton cluster

5
Dendrogram
• A clustering of the data objects is obtained by cutting
the dendrogram at the desired level, then each
connected component forms a cluster

6
Agglomerative Clustering Algorithm

• More popular hierarchical clustering technique


• Basic algorithm is straightforward
1. Compute the distance matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the distance matrix
6. Until only a single cluster remains

• Key operation is the computation of the distance between


two clusters
– Different approaches to defining the distance between clusters
distinguish the different algorithms

7
Starting Situation
• Start with clusters of individual points and a distance matrix
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
.
.
. Distance Matrix

...
p1 p2 p3 p4 p9 p10 p11 p12

8
Intermediate Situation
• After some merging steps, we have some clusters
• Choose two clusters that has the smallest C1 C2 C3 C4 C5
distance (largest similarity) to merge C1
C2
C3
C3
C4
C4
C5
Distance Matrix
C1

C2 C5

...
p1 p2 p3 p4 p9 p10 p11 p12

9
Intermediate Situation
• We want to merge the two closest clusters (C2 and C5) and update
the distance matrix. C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
Distance Matrix
C1

C2 C5

...
p1 p2 p3 p4 p9 p10 p11 p12

10
After Merging
• The question is “How do we update the distance matrix?”
C2
U
C1 C5 C3 C4
C1 ?

C2 U C5 ? ? ? ?
C3
C3 ?
C4
C4 ?
Distance Matrix
C1

C2 U C5

...
p1 p2 p3 p4 p9 p10 p11 p12

11
How to Define Inter-Cluster Distance

p1 p2 p3 p4 p5 ...
p1
Distance?
p2
p3

p4
p5
 MIN
.
 MAX
. Distance Matrix
 Group Average .
 Distance Between Centroids
 ……

12
MIN or Single Link
• Inter-cluster distance
– The distance between two clusters is represented by the
distance of the closest pair of data objects belonging to
different clusters.
– Determined by one pair of points, i.e., by one link in the
proximity graph

d min (Ci , C j )  min d ( p, q)


pCi , qC j
13
MIN

5
1
3
5 0.2

2 1 0.15

2 3 6 0.1

0.05
4
4 0
3 6 2 5 4 1

Nested Clusters Dendrogram

14
Strength of MIN

Original Points Two Clusters

• Can handle non-elliptical shapes

15
Limitations of MIN

Original Points Two Clusters

• Sensitive to noise and outliers

16
MAX or Complete Link
• Inter-cluster distance
– The distance between two clusters is represented by the
distance of the farthest pair of data objects belonging to
different clusters

d min (Ci , C j )  max d ( p, q)


pCi , qC j

17
MAX

4 1
2 5 0.4

0.35
5
2 0.3

0.25

3 6 0.2

3 0.15
1 0.1

0.05
4
0
3 6 4 1 2 5

Nested Clusters Dendrogram

18
Strength of MAX

Original Points Two Clusters

• Less susceptible to noise and outliers

19
Limitations of MAX
•Tends to break large clusters

Original Points

20
Limitations of MAX

MIN (2 clusters) MAX (2 clusters)

•Biased towards globular clusters


21
Group Average or Average Link
• Inter-cluster distance
– The distance between two clusters is represented by the
average distance of all pairs of data objects belonging to
different clusters
– Determined by all pairs of points in the two clusters

d min (Ci , C j )  avg d ( p, q)


pCi , qC j
22
Group Average

5 4 1
2 0.25

5 0.2
2
0.15

3 6 0.1

1 0.05

4 0
3 3 6 4 1 2 5

Nested Clusters Dendrogram

23
Group Average

• Compromise between Single and Complete


Link

• Strengths
– Less susceptible to noise and outliers

• Limitations
– Biased towards globular clusters

24
Centroid Distance
• Inter-cluster distance
– The distance between two clusters is represented by the
distance between the centers of the clusters
– Determined by cluster centroids

 

d mean (Ci , C j )  d (mi , m j )

25
Ward’s Method

• Similarity of two clusters is based on the increase


in squared error when two clusters are merged
– Similar to group average if distance between points is
distance squared

• Less susceptible to noise and outliers


• Biased towards globular clusters
• Hierarchical analogue of K-means
– Can be used to initialize K-means

26
Comparison
5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3
1
4 4
4

5
1 5 4 1
2 2
5 Ward’s Method 5
2 2
3 6 Group Average 3 6
3
4 1 1
4 4
3

27
Time and Space Requirements

• O(N2) space since it uses the distance matrix


– N is the number of points

• O(N3) time in many cases


– There are N steps and at each step the size, N2,
distance matrix must be updated and searched
– Complexity can be reduced to O(N2 log(N) ) time
for some approaches

28
Strengths

• Do not have to assume any particular number


of clusters
– Any desired number of clusters can be obtained
by ‘cutting’ the dendrogram at the proper level

• They may correspond to meaningful


taxonomies
– e.g., shopping websites—electronics (computer,
camera, ..), furniture, groceries

29
Problems and Limitations

• Once a decision is made to combine two clusters,


it cannot be undone
• No objective function is directly minimized
• Different schemes have problems with one or
more of the following:
– Sensitivity to noise and outliers
– Difficulty handling different sized clusters and
irregular shapes
– Breaking large clusters

30
Take-away Message

• Agglomerative and divisive hierarchical clustering


• Several ways of defining inter-cluster distance
• The properties of clusters outputted by different
approaches based on different inter-cluster distance
definition
• Pros and cons of hierarchical clustering

31

Вам также может понравиться