Вы находитесь на странице: 1из 42

FEM 2063 - Data Analytics

Chapter 8

CLUSTERING

1
CLUSTERING

• Clustering refers to a very broad set of techniques for finding subgroups, or


clusters, in a data set.

• Clustering are Unsupervised Methods that seek a partition of the data into
distinct groups so that the observations within each group are quite similar to
each other.

• It requires to define what it means for two or more observations to be similar


or different.
CLUSTERING
PCA versus Clustering

• PCA looks for a low-dimensional (--> dimensional reduction method)


representation of the observations that explains a good fraction of the
variance.

• Clustering looks for homogeneous subgroups among the observations.


CLUSTERING
Example: Clustering for Market Segmentation
Given a large number of measurements (e.g. household income, occupation,
distance from nearest urban area, etc.) for a large number of people.

Objective: to perform market segmentation by identifying subgroups of


people who might be more receptive to a particular form of advertising, or
more likely to purchase a particular product.

The task of performing market segmentation amounts to clustering the people


in the data set.
Overview

➢ K-Means Clustering

➢ Hierarchical Clustering

5
K-Means Clustering

In K-means clustering, we seek to partition the observations into a pre-


specified number of clusters, K.
K-Means Clustering
Example: simulated data set with 150 observations
K-Means Clustering
Details

Let 𝐶1 , . . . , 𝐶𝑘 denote sets containing the indices of the observations in each cluster.
These sets satisfy two properties:

1. 𝐶1 ∪ 𝐶2 ∪. . .∪ 𝐶𝑘 = {1, . . . , 𝑛} : each observation belongs to one of the K clusters.

2. 𝐶𝐾 ∩ 𝐶𝐾′ = ∅ for all 𝑘 ≠ 𝑘′ : the clusters are non-overlapping.


K-Means Clustering
Details
• A good clustering is one for which the within-cluster variation is as
small as possible.
• The within-cluster variation for cluster Ck is a measure W(Ck) of the
amount by which the observations within a cluster differ from each other.
• Hence, we want to solve the problem
K-Means Clustering
Define W(Ck)
• Using Euclidean distance

where Ck denotes the number of observations in the kth cluster.

• The optimization problem that defines K-means clustering:


K-Means Clustering
Algorithm
1. Initialization: randomly assign a cluster Ci (i between 1 and K) to each of
the observations.

1. Iterate until the cluster assignments stop changing:

a. For each of the K clusters, compute the cluster centroid (the mean vector).

a. Assign each observation to the cluster whose centroid is closest (e.g. using
Euclidean distance).
K-Means Clustering
Properties of the Algorithm

• This algorithm is guaranteed to decrease the value of the objective at each step.

• It is not guaranteed to reach the global minimum


K-Means Clustering
Example
K-Means Clustering
Example
K-Means Clustering
Example: Different initializations
K-Means Clustering
Example: Different initializations
Overview

➢ K-Means Clustering

➢ Hierarchical Clustering

17
Hierarchical Clustering

• Hierarchical clustering consists of building a tree-based diagram called


dendrogram, that allows to view at once the clusters obtained for each
possible number of clusters.

• It does not require to set upfront the number of clusters.

• There are several approaches. The most common is we describe bottom-up or


agglomerative clustering. It refers to the fact that a dendrogram is built
starting from the leaves and combining clusters up to the trunk.
Hierarchical Clustering
“Bottom-up”
Hierarchical Clustering
“Bottom-up”
Hierarchical Clustering
“Bottom-up”
Hierarchical Clustering
“Bottom-up”
Hierarchical Clustering
“Bottom-up”
Hierarchical Clustering
Approach
Start with each point in its own cluster.
• Identify the closest two clusters and merge them.
• Repeat.
• Ends when all points are in a single cluster.
Hierarchical Clustering
Example 1: 45 generated points
Hierarchical Clustering
Example 1: 45 generated points
K-Means Clustering
Algorithm
1. Begin with n observations and a measure (e.g. Euclidean distance) of all
(n(n−1)/2) pairwise dissimilarities. Treat each observation as its own cluster.

2. For i = n, n − 1, . . . , 2:
(a) Examine all pairwise inter-cluster dissimilarities among the i clusters.
Identify the pair of clusters that are least dissimilar and fuse them. The
dissimilarity between these two clusters indicates the height in the
dendrogram at which the fusion should be placed.
(b) Compute the new pairwise inter-cluster dissimilarities among the i − 1
remaining clusters
K-Means Clustering
Dissimilarity between these two clusters
The concept of dissimilarity between a pair of observations can be extended to
a pair of groups of observations using the notion of linkage.

The four most common types of linkage:


• Complete
• Average
• Single
• Centroid
Hierarchical Clustering
Linkage Description
Complete Maximal inter-cluster dissimilarity. Compute all pairwise dissimilarities
between the observations in cluster A and the observations in cluster B, and
record the largest of these dissimilarities.

Single Minimal inter-cluster dissimilarity. Compute all pairwise dissimilarities


between the observations in cluster A and the observations in cluster B, and
record the smallest of these dissimilarities.

Average Mean inter-cluster dissimilarity. Compute all pairwise dissimilarities between


the observations in cluster A and the observations in cluster B, and record the
average of these dissimilarities.

Centroid Dissimilarity between the centroid for cluster A and the centroid for cluster B.
Centroid linkage can result in undesirable inversions (increase of similarity).
Hierarchical Clustering
Choice of Dissimilarity

Euclidean distance: 1 and 3 are similar (small distance).


Correlation: 1 and 2 are correlated
Hierarchical Clustering
Example 2: 9 points
Hierarchical Clustering
Example 2: 9 points
Hierarchical Clustering
Example 2: 9 points

• Observations 5 and 7 are quite similar to each other, as are observations 1 and 6.

• However, observation 9 is no more similar to observation 2 than it is to


observations 8, 5 and 7, even though observations 9 and 2 are close together in
terms of horizontal distance.

• This is because observations 2, 8, 5 and 7 all fuse with observation 9 at the same
height, approximately 1.8.
Hierarchical Clustering
Example 2: 9 points
Hierarchical Clustering
Example 2: 9 points
Hierarchical Clustering
Choice of Dissimilarity

• An alternative to Euclidean distance is correlation-based distance


which considers two observations to be similar if their features are
highly correlated.

• This is an unusual use of correlation, which is normally computed


between variables; here it is computed between the observation profiles
for each pair of observations.
Hierarchical Clustering
Scaling

Left: Number without scaling. Center: after scaling the number. Right: Amount (USD) without scaling
Hierarchical Clustering
Practical Issues

• Should the observations or features first be standardized in some way?


For instance, maybe the variables should be centered to have mean zero
and scaled to have standard deviation one.

• In the case of hierarchical clustering,


• What dissimilarity measure should be used?
• What type of linkage should be used?

• How many clusters to choose? (for both K-means or hierarchical


clustering).
Hierarchical Clustering
Example: Breast Cancer Microarray study

• “Kruger,D.T., et al. “Hierarchical clustering of activated proteins in the PI3K and


MAPK pathways in ER-positive, HER2-negative breast cancer with potential
therapeutic consequences”. Br J Cancer 119, 832–839 (2018).

• Euclidean distance to determine the similarity among the tumours and a complete
linkage function to iteratively build up the clusters. Data were normalized.

• subset of 293 ER+/HER2− tumours for which scorings of the seven proteins of
interest were available
Hierarchical Clustering

Hierarchical clustering of seven proteins in ER+/HER2− tumours visualised in a heatmap.


Red (Blue) bars indicate a higher (lower) score in activation.
Conclusions
• Unsupervised learning is important for understanding the variation and
grouping structure of a set of unlabeled data, and can be a useful pre-
processor for supervised learning

• It is intrinsically more difficult than supervised learning because there is


no gold standard (like an outcome variable) and no single objective (like
test set accuracy)

• It is an active field of research.


42

Вам также может понравиться