20 1clustering

Inspire…Educate…Transform.
Machine Learning
Clustering
Lt Suryaprakash Kompalli
Senior Mentor, INSOFE
The best place for students to learn Applied Engineering http://www.insofe.edu.in

Unsupervised learning
• Supervised: Data and target
• Unsupervised: Just data
The best place for students to learn Applied Engineering 2 http://www.insofe.edu.in

Clustering
• Finding similarity groups in data, called
clusters. I.e.,
– data instances that are similar to (near) each other
are in the same cluster
– data instances that are very different (far away) from
each other fall in different clusters.

A few clustering applications
• Targeted marketing
– It is not uncommon to have over 100,000 segments in insurance
clustering
• Given a collection of text documents,

organize them according to their content
similarities,
– E.g., Google news
• Blind signal separation (separating two
speakers)

Algorithms
• Hierarchical approach: Create a hierarchical
decomposition of the set of data (or objects) using
some criterion (Wald)
• Partitioning approach: Construct various
partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
(K-means, Spectral clustering)
• Model-based methods: A model is hypothesized
for each of the clusters and tries to find the best
fit of that model to each other (EM)
HIERARCHICAL
(AGGLOMERATIVE)
CLUSTERING

Example of agglomerative clustering
At each iteration, pick two data points that have least distance between them. Add the points
into a cluster.

Note how we update distances between other clusters. The lower distance is picked.
Distance between BOS to DC was 429, now set to 233. Distance from BOS to MIA was 1504,
now set to 1308.
Averaging may also be used instead of taking distance to the closest point.

Note how creation of
SF/LA cluster has
changed distnces

Agglomerative clustering (Hierarchical)
Typically, a particular “level” of the hierarchy Highlighted clusters divide airports

is selected to be your clustering result into North-East, Central, South and
Pacific areas
All airports
BOS/NY/
DC/CHI/
MIA
DEN/SF/
LA/SEA
SF/LA/SEA
BOS/NY/
SF/LA SEA DC/CHI/
DEN
BOS/NY/
SF LA DEN
DC/CHI
Decomposes data into a levels of nested BOS/NY/

CHI
DC
partitioning.
A clustering of the data objects is obtained BOS/NY DC

by cutting the dendrogram at the desired
level, then each connected component Highlighted clusters divide airports
BOS NY
forms a cluster. into North-East, Central, South and
Pacific areas

• Assign each item to its own cluster, so that if

you have N items, you now have N clusters,
each containing just one item.
• Merge most similar clusters into a single cluster,
so that now you have one less cluster.
• Compute distances (similarities) between the
new cluster and each of the old clusters.
• Repeat steps 2 and 3 until all items are
clustered into a single cluster of size N.

(Cluster Distance Measures)
•

KD-Tree Clustering
Rojas et al 2005, SIOX: Simple Interactive Object Extraction in

Still Images
http://www.siox.org/downloads/tr-b-05-07.pdf
KD-Tree Clustering
• Works well in practical situations like image

segmentation
• KD-tree built with one feature at each level
1) Example 3D data: x1, x2, x3: (3, 1, 4), (2, 3, 7), (4, 3, 4), (2, 1, 3),
(2, 4, 5), (6, 1, 4), (5, 2, 5), (1, 4,4), (0, 5, 7), (4, 0, 6), (7, 1, 6)
2) Select one feature (Say X1) Find Mean in X1-Dimension i.e. (sum of
x1-values)/(no. of elements)
a) (3+2+4+2+2+6+5+1+0+4+7)/11 = 3.27 ~= 3
3) Select the first item having X-value as 3 as the ROOT element.
4) Use ROOT to classify remaining elements whether they belongs to
left sub tree (x1 <= 3) or right sub tree (x1 > 3)
5) Using elements in the right sub tree and left Sub tree, repeat from
step 2 using next feature till only a few items are left in each sub-
tree
KD-Tree Clustering
X (3,1,4) X
(4,3,4)
(2,3,7)
Y Y Y
(2,1,3) (2,4,5) (6,1,4)

Z Z Z
(5,2,5)
(1,4,4) (0,5,7) X X
(4,0,6) (7,1,6)
Data: (3, 1, 4), (2, 3, 7), (4, 3, 4), (2, 1, 3), (2, 4, 5), (6,
1, 4), (5, 2, 5), (1, 4,4), (0, 5, 7), (4, 0, 6), (7, 1, 6)
KD-Tree Clustering
From left to right: Original Image, Red: Region of interest selected by user,
Green: Known foreground object, Intermediate result, Final result

A color signature is a set of representative colors,not
necessarily a sub set of the input colors.

The set of background samples from any image
typically consists of a few hundreds of thousands of
colors.

Clustering reduces the background sample to its
representative colors, usually about a few hundreds.
Rojas et al 2005, SIOX: Simple Interactive Object Extraction

in Still Images
Partitioning algorithms
K-MEANS AND K-
MEDOIDS
K-means clustering
• K-means is a partitional clustering
algorithm as it partitions the given data
into k clusters.
– Each cluster has a cluster center, called
centroid.
– k is specified by the user

Pictures courtesy Andrew Ngs course on Coursera

K-means algorithm
• Given k, the k-means algorithm works as
follows:
1. Randomly choose k data points (seeds) to be the

initial centroids, cluster centers
2. Assign each data point to the closest centroid
3. Re-compute the centroids using the current cluster
memberships.
4. If a convergence criterion is not met, or if some
clusters don’t get any points go to 2.

Stopping/convergence criterion
1. no (or minimum) re-assignments of data points to
different clusters,
2. no (or minimum) change of centroids, or
3. minimum decrease in the sum of squared error
(SSE),
(1)
– j C is the jth cluster, mj is the centroid (mean) of

Check SCREE plots.
cluster Cj
http://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters
Local optima

What Is the Problem with K-Means?
• The k-means algorithm is sensitive to outliers !

• K-Medoids: Instead of taking the mean value of the object
in a cluster as a reference point, medoids can be used,
which is the most centrally located object in a cluster.
– In some problems, cluster center will need to be a member of the
dataset
– Ex: Graph similarity, Gene sequence matching, image matching

What Is the Problem with Medoids?
• More robust than k-means in the presence of noise and
outliers because a medoid is less influenced by outliers or
other extreme values than a mean
• Works efficiently for small data sets but does not scale
well for large data sets.
– In regular K-Means, the new mean is merely average of values
O(nkt)
– In K-Medoid, if a particular cluster has data points, you will
need O computations to determine the medoid for that cluster
– O(k(n-k)2 ) for each iteration
where n is # of data, k is # of clusters, t: # of iterations

K-means versus Hierarchical
• K-means produces a • Hierarchical Clustering can give
single partitioning different partitions depending on the
• K-means needs the level-of-resolution we are looking at
number of clusters to be • Hierarchical clustering doesn’t need the
specified number of clusters to be specified
• K-means is usually more • Hierarchical clustering can be slow
efficient run-time wise (has to make several merge/Split
decisions)

Comparison of Clustering
Sample data where single-link (i.e. heirarchical) does

better than k-means
Sample data where k-means does better than single

link

ANALYZING YOUR
CLUSTERS
Using Distance between Clusters
• Single link: smallest distance between an element in one cluster

and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)
• Complete link: largest distance between an element in one cluster

and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)
• Average: average distance between an element in one cluster and

an element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)
• Centroid: distance between the centroids of two clusters,

i.e., dis(Ki, Kj) = dis(Ci, Cj)
• Medoid: distance between the medoids of two clusters, i.e., dis(K i,

Kj) = dis(Mi, Mj)
– Medoid: one chosen, centrally located object in the cluster
Centroid, Radius & Diameter of a Cluster
(for numerical data sets)
• Centroid: the “middle” of a cluster iN1( X )
Cm  N i
• Radius: square root of average distance from any point of the

cluster to its centroid  N ( X  cm ) 2
Rm  i 1 i
N
– Sometimes changed to max of distance between all points to centroid
• Diameter: square root of average mean squared distance between

all pairs of points in the cluster  N  N ( X  X q )2
i
Dm  i 1 q 1
N ( N  1)
– Sometimes changed to max of the distance between all pairs of points in a

cluster

Stability Check of Clusters
• To check the stability of the

clusters take a random sample
of 95% of records. Compute
the clusters. If the clusters
formed are very similar to the
original, then the clusters are
fine.
Linearly clustered data
Nice

Linearly separable but merged
Data points of individuals of a

particular height and weight

Linearly separable but merged
Clustering the data points into T-

shirt sizes. Top: Three sizes, Right:
Five sizes

Linearly separable
• Run 50-500 simulations for
small k (2-10). For large k
(100 or so), we can do 1-5
simulations
• Pick the one that gives the
best SSE

Other Topics
• Clustering large datasets
– CLARA (Kaufman and Rousseeuw, 1990)
and CLARANS (Ng and Han 1994, 2002):
Select a small % of data, run K-means or
K-medoids
• Parallel and Efficient
implementations of K-means /
K-medoids http://www.math.unipd.it/~dulli/corso04/ng94efficient.pdf
https://anuradhasrinivas.files.wordpress.com/2013/04/lesson8-clustering.pdf
http://www.vlfeat.org/overview/kmeans.html
http://repository.cmu.edu/cgi/viewcontent.cgi?article=2397&context=compsci
http://www.cs.ucsb.edu/~veronika/MAE/Global_Kernel_K-Means.pdf

Bonus slides, if
time permits
UNDERSTANDING
DISTANCE

Numeric
2 , 𝑦2)
(𝑥
Eucledian
| 𝑦 2 − 𝑦 1|
𝑥 2 − 𝑥 1|
|
1 , 𝑦1)
(𝑥 Manhattan

Distance between categorical
attributes
• For each categorical variable
–Distance is zero if two records
have same value
–1 if different
–Sum the total distance across all
attributes

Categorical attributes 11 1 010010101
11 0 110110011
For example, in above

two feature vectors: a=5
b=2
c=3
d=2

Symmetric binary attributes
• Hamming distance function: Simple Matching

Coefficient, proportion of mismatches of their
values
bc
dist( xi , x j ) 
Hamming Distance a b c d Jaccard Distance
b  c *m W* W* *m
a b c d
Weighted Hamming Distance Weighted Jaccard Distance

Asymmetric binary attributes
• Asymmetric: if one of the states is more

important or more valuable than the other.
– By convention, state 1 represents the more important
state, which is typically the rare or infrequent state.
– Jaccard coefficient is a popular measure
bc
a b c
– We can have some variations, adding weights

– Assymetric features are sometimes ignored or given
less weights
Dissimilarity between Binary
Variables
• Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
– gender is a symmetric attribute, hence ignored

– the remaining attributes are asymmetric binary (Having fever is
more significant than not having fever)
– let the values Y and P be set to 1, and the value N be set to 0
0 1
d ( jack , mary )  0.33
2  0 1
11
d ( jack , jim )  0.67
111 a c
1 2 b is 0, no features where jack=Y/P and mary=N
d ( jim , mary )  0.75
11 2
Another distance metric
used in supervised learning


Value Distance Measure


Ordinal variables
• Same as numeric
• Look up is better than

computation

Look up matrix for ordinal with 3 states

Distance between different ordinals may not be same !!! On the right, distance
between state 1 and state 2 is much less than distance between state 1 and state 3.
Depends on what the ordinals represent.
E.g.: If ordinals represent education: class 5, Class 10, Degree. Distance between
class 5 and class 10 may be considered less than distance between class 10 and
degree

KERNEL TRICK

Moving into higher dimensions
Decision surface is complex

Decision surface is simple in
in lower dimension
higher dimension

Linear Separation of XOR
y x.y
(-1,1) (1,1)
x x
(-1,-1) (1,-1)
• XOR is not linearly separable in x,y space

• Linearly separable in x.y space
– The kernel here is K(x,y) = x.y
Reference: http://www.doc.ic.ac.uk/~dfg/ProbabilisticInference/IDAPILecture18.pdf, Duda and Hart, chapter 3.

Standard Kernels
•

k-means Vs. Kernel k-means
k-means Kernel k-means

Performance of Kernel K-means
Evaluation of the performance of clustering algorithms in kernel-induced feature space, Pattern

Recognition, 2005

Kernel Trick
• The original way is to transform each
data point into a high dimensional space
and then do computation
– This can be computationally complex
• Note that we need to compute distances
in Euclidian space, or sometimes scalar
product

Kernel Trick
•

Kernel Trick
•

Kernel trick continued
• This is where Kernel trick
comes into picture - we can
choose a kernel where the
transformed space is of high
dimension and yet it is easy to
compute the similarity score in
the original space.

International School of Engineering
Plot 63/A, 1st Floor, Road # 13, Film Nagar, Jubilee Hills, Hyderabad - 500 033
For Individuals: +91-9502334561/63 or 040-65743991

For Corporates: +91-9618483483
Web: http://www.insofe.edu.in
Facebook: https://www.facebook.com/insofe
Twitter: https://twitter.com/Insofeedu
YouTube: http://www.youtube.com/InsofeVideos
SlideShare: http://www.slideshare.net/INSOFE
LinkedIn: http://www.linkedin.com/company/international-scho
ol-of-engineering
This presentation may contain references to findings of various reports available in the public domain. INSOFE makes no representation as to their accuracy or that the organization
subscribes to those findings.

20 1clustering

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

20 1clustering

Загружено:

Авторское право:

Доступные форматы

Inspire…Educate…Transform.

The best place for students to learn Applied Engineering http://www.insofe.edu.in

• Unsupervised: Just data

The best place for students to learn Applied Engineering 2 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 4 http://www.insofe.edu.in

• Given a collection of text documents,

The best place for students to learn Applied Engineering 5 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 7 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 8 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 9 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 11 http://www.insofe.edu.in

Typically, a particular “level” of the hierarchy Highlighted clusters divide airports

Decomposes data into a levels of nested BOS/NY/

A clustering of the data objects is obtained BOS/NY DC

The best place for students to learn Applied Engineering 16 http://www.insofe.edu.in

• Assign each item to its own cluster, so that if

The best place for students to learn Applied Engineering 17 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 18 http://www.insofe.edu.in

Rojas et al 2005, SIOX: Simple Interactive Object Extraction in

• Works well in practical situations like image

(2,1,3) (2,4,5) (6,1,4)

Rojas et al 2005, SIOX: Simple Interactive Object Extraction

The best place for students to learn Applied Engineering 24 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 25 http://www.insofe.edu.in

1. Randomly choose k data points (seeds) to be the

The best place for students to learn Applied Engineering 33 http://www.insofe.edu.in

– j C is the jth cluster, mj is the centroid (mean) of

The best place for students to learn Applied Engineering 35 http://www.insofe.edu.in

• The k-means algorithm is sensitive to outliers !

The best place for students to learn Applied Engineering 36 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 37 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 38 http://www.insofe.edu.in

Sample data where single-link (i.e. heirarchical) does

Sample data where k-means does better than single

The best place for students to learn Applied Engineering 39 http://www.insofe.edu.in

• Single link: smallest distance between an element in one cluster

• Complete link: largest distance between an element in one cluster

• Average: average distance between an element in one cluster and

• Centroid: distance between the centroids of two clusters,

• Medoid: distance between the medoids of two clusters, i.e., dis(K i,

• Radius: square root of average distance from any point of the

– Sometimes changed to max of distance between all points to centroid

• Diameter: square root of average mean squared distance between

– Sometimes changed to max of the distance between all pairs of points in a

The best place for students to learn Applied Engineering 42 http://www.insofe.edu.in

• To check the stability of the

The best place for students to learn Applied Engineering 44 http://www.insofe.edu.in

Data points of individuals of a

The best place for students to learn Applied Engineering 45 http://www.insofe.edu.in

Clustering the data points into T-

The best place for students to learn Applied Engineering 46 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 47 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 48 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 49 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 50 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 51 http://www.insofe.edu.in

For example, in above

The best place for students to learn Applied Engineering 52 http://www.insofe.edu.in

• Hamming distance function: Simple Matching

The best place for students to learn Applied Engineering 53 http://www.insofe.edu.in

• Asymmetric: if one of the states is more

– We can have some variations, adding weights

– gender is a symmetric attribute, hence ignored