Вы находитесь на странице: 1из 70

# Inspire…Educate…Transform.

Machine Learning

Clustering

Lt Suryaprakash Kompalli
Senior Mentor, INSOFE

## The best place for students to learn Applied Engineering http://www.insofe.edu.in

Unsupervised learning
• Supervised: Data and target

## The best place for students to learn Applied Engineering 2 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 3 http://www.insofe.edu.in
Clustering
• Finding similarity groups in data, called
clusters. I.e.,
– data instances that are similar to (near) each other
are in the same cluster
– data instances that are very different (far away) from
each other fall in different clusters.

## The best place for students to learn Applied Engineering 4 http://www.insofe.edu.in

A few clustering applications

• Targeted marketing
– It is not uncommon to have over 100,000 segments in insurance
clustering

## • Given a collection of text documents,

organize them according to their content
similarities,
• Blind signal separation (separating two
speakers)

## The best place for students to learn Applied Engineering 5 http://www.insofe.edu.in

Algorithms
• Hierarchical approach: Create a hierarchical
decomposition of the set of data (or objects) using
some criterion (Wald)
• Partitioning approach: Construct various
partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
(K-means, Spectral clustering)
• Model-based methods: A model is hypothesized
for each of the clusters and tries to find the best
fit of that model to each other (EM)
The best place for students to learn Applied Engineering 6 http://www.insofe.edu.in
HIERARCHICAL
(AGGLOMERATIVE)
CLUSTERING

## The best place for students to learn Applied Engineering 7 http://www.insofe.edu.in

Example of agglomerative clustering

At each iteration, pick two data points that have least distance between them. Add the points
into a cluster.

## The best place for students to learn Applied Engineering 8 http://www.insofe.edu.in

Note how we update distances between other clusters. The lower distance is picked.
Distance between BOS to DC was 429, now set to 233. Distance from BOS to MIA was 1504,
now set to 1308.

Averaging may also be used instead of taking distance to the closest point.

## The best place for students to learn Applied Engineering 9 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 10 http://www.insofe.edu.in
Note how creation of
SF/LA cluster has
changed distnces

## The best place for students to learn Applied Engineering 11 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 12 http://www.insofe.edu.in
The best place for students to learn Applied Engineering 13 http://www.insofe.edu.in
The best place for students to learn Applied Engineering 14 http://www.insofe.edu.in
Agglomerative clustering (Hierarchical)

## Typically, a particular “level” of the hierarchy Highlighted clusters divide airports

is selected to be your clustering result into North-East, Central, South and
Pacific areas
The best place for students to learn Applied Engineering 15 http://www.insofe.edu.in
Agglomerative clustering (Hierarchical)
All airports

BOS/NY/
DC/CHI/
MIA
DEN/SF/
LA/SEA

SF/LA/SEA

BOS/NY/
SF/LA SEA DC/CHI/
DEN

BOS/NY/
SF LA DEN
DC/CHI

CHI
DC
partitioning.

## A clustering of the data objects is obtained BOS/NY DC

by cutting the dendrogram at the desired
level, then each connected component Highlighted clusters divide airports
BOS NY
forms a cluster. into North-East, Central, South and
Pacific areas

## The best place for students to learn Applied Engineering 16 http://www.insofe.edu.in

Agglomerative clustering (Hierarchical)

## • Assign each item to its own cluster, so that if

you have N items, you now have N clusters,
each containing just one item.
• Merge most similar clusters into a single cluster,
so that now you have one less cluster.
• Compute distances (similarities) between the
new cluster and each of the old clusters.
• Repeat steps 2 and 3 until all items are
clustered into a single cluster of size N.

## The best place for students to learn Applied Engineering 17 http://www.insofe.edu.in

Agglomerative clustering (Hierarchical)
(Cluster Distance Measures)

•

## The best place for students to learn Applied Engineering 18 http://www.insofe.edu.in

KD-Tree Clustering

## Rojas et al 2005, SIOX: Simple Interactive Object Extraction in

Still Images
The best place for students to learn Applied Engineering 19 http://www.insofe.edu.in
KD-Tree Clustering

## • Works well in practical situations like image

segmentation
• KD-tree built with one feature at each level
1) Example 3D data: x1, x2, x3: (3, 1, 4), (2, 3, 7), (4, 3, 4), (2, 1, 3),
(2, 4, 5), (6, 1, 4), (5, 2, 5), (1, 4,4), (0, 5, 7), (4, 0, 6), (7, 1, 6)
2) Select one feature (Say X1) Find Mean in X1-Dimension i.e. (sum of
x1-values)/(no. of elements)
a) (3+2+4+2+2+6+5+1+0+4+7)/11 = 3.27 ~= 3
3) Select the first item having X-value as 3 as the ROOT element.
4) Use ROOT to classify remaining elements whether they belongs to
left sub tree (x1 <= 3) or right sub tree (x1 > 3)
5) Using elements in the right sub tree and left Sub tree, repeat from
step 2 using next feature till only a few items are left in each sub-
tree
The best place for students to learn Applied Engineering 20 http://www.insofe.edu.in
KD-Tree Clustering

X (3,1,4) X

(4,3,4)
(2,3,7)
Y Y Y

## (2,1,3) (2,4,5) (6,1,4)

Z Z Z
(5,2,5)
(1,4,4) (0,5,7) X X

(4,0,6) (7,1,6)

Data: (3, 1, 4), (2, 3, 7), (4, 3, 4), (2, 1, 3), (2, 4, 5), (6,
1, 4), (5, 2, 5), (1, 4,4), (0, 5, 7), (4, 0, 6), (7, 1, 6)
The best place for students to learn Applied Engineering 21 http://www.insofe.edu.in
KD-Tree Clustering

From left to right: Original Image, Red: Region of interest selected by user,
Green: Known foreground object, Intermediate result, Final result

A color signature is a set of representative colors,not
necessarily a sub set of the input colors.

The set of background samples from any image
typically consists of a few hundreds of thousands of
colors.

Clustering reduces the background sample to its
representative colors, usually about a few hundreds.

## Rojas et al 2005, SIOX: Simple Interactive Object Extraction

in Still Images
The best place for students to learn Applied Engineering 22 http://www.insofe.edu.in
Partitioning algorithms
K-MEANS AND K-
MEDOIDS
The best place for students to learn Applied Engineering 23 http://www.insofe.edu.in
K-means clustering
• K-means is a partitional clustering
algorithm as it partitions the given data
into k clusters.
– Each cluster has a cluster center, called
centroid.
– k is specified by the user

## The best place for students to learn Applied Engineering 24 http://www.insofe.edu.in

Pictures courtesy Andrew Ngs course on Coursera

## The best place for students to learn Applied Engineering 25 http://www.insofe.edu.in

The best place for students to learn Applied Engineering 26 http://www.insofe.edu.in
The best place for students to learn Applied Engineering 27 http://www.insofe.edu.in
The best place for students to learn Applied Engineering 28 http://www.insofe.edu.in
The best place for students to learn Applied Engineering 29 http://www.insofe.edu.in
The best place for students to learn Applied Engineering 30 http://www.insofe.edu.in
The best place for students to learn Applied Engineering 31 http://www.insofe.edu.in
The best place for students to learn Applied Engineering 32 http://www.insofe.edu.in
K-means algorithm
• Given k, the k-means algorithm works as
follows:

## 1. Randomly choose k data points (seeds) to be the

initial centroids, cluster centers
2. Assign each data point to the closest centroid
3. Re-compute the centroids using the current cluster
memberships.
4. If a convergence criterion is not met, or if some
clusters don’t get any points go to 2.

## The best place for students to learn Applied Engineering 33 http://www.insofe.edu.in

Stopping/convergence criterion
1. no (or minimum) re-assignments of data points to
different clusters,
2. no (or minimum) change of centroids, or
3. minimum decrease in the sum of squared error
(SSE),
(1)

## – j C is the jth cluster, mj is the centroid (mean) of

Check SCREE plots.
cluster Cj
http://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters
The best place for students to learn Applied Engineering 34 http://www.insofe.edu.in
Local optima

## The best place for students to learn Applied Engineering 35 http://www.insofe.edu.in

What Is the Problem with K-Means?

## • The k-means algorithm is sensitive to outliers !

• K-Medoids: Instead of taking the mean value of the object
in a cluster as a reference point, medoids can be used,
which is the most centrally located object in a cluster.
– In some problems, cluster center will need to be a member of the
dataset
– Ex: Graph similarity, Gene sequence matching, image matching

## The best place for students to learn Applied Engineering 36 http://www.insofe.edu.in

What Is the Problem with Medoids?
• More robust than k-means in the presence of noise and
outliers because a medoid is less influenced by outliers or
other extreme values than a mean
• Works efficiently for small data sets but does not scale
well for large data sets.
– In regular K-Means, the new mean is merely average of values
O(nkt)
– In K-Medoid, if a particular cluster has data points, you will
need O computations to determine the medoid for that cluster
– O(k(n-k)2 ) for each iteration
where n is # of data, k is # of clusters, t: # of iterations

## The best place for students to learn Applied Engineering 37 http://www.insofe.edu.in

K-means versus Hierarchical
• K-means produces a • Hierarchical Clustering can give
single partitioning different partitions depending on the
• K-means needs the level-of-resolution we are looking at
number of clusters to be • Hierarchical clustering doesn’t need the
specified number of clusters to be specified
• K-means is usually more • Hierarchical clustering can be slow
efficient run-time wise (has to make several merge/Split
decisions)

## The best place for students to learn Applied Engineering 38 http://www.insofe.edu.in

Comparison of Clustering

## Sample data where single-link (i.e. heirarchical) does

better than k-means

## The best place for students to learn Applied Engineering 39 http://www.insofe.edu.in

ANALYZING YOUR
CLUSTERS
The best place for students to learn Applied Engineering 40 http://www.insofe.edu.in
Using Distance between Clusters

## • Single link: smallest distance between an element in one cluster

and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)

## • Complete link: largest distance between an element in one cluster

and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)

## • Average: average distance between an element in one cluster and

an element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)

## • Centroid: distance between the centroids of two clusters,

i.e., dis(Ki, Kj) = dis(Ci, Cj)

## • Medoid: distance between the medoids of two clusters, i.e., dis(K i,

Kj) = dis(Mi, Mj)
– Medoid: one chosen, centrally located object in the cluster
The best place for students to learn Applied Engineering 41 http://www.insofe.edu.in
Centroid, Radius & Diameter of a Cluster
(for numerical data sets)
• Centroid: the “middle” of a cluster iN1( X )
Cm  N i

## • Radius: square root of average distance from any point of the

cluster to its centroid  N ( X  cm ) 2
Rm  i 1 i
N

## • Diameter: square root of average mean squared distance between

all pairs of points in the cluster  N  N ( X  X q )2
i
Dm  i 1 q 1
N ( N  1)

cluster

## The best place for students to learn Applied Engineering 42 http://www.insofe.edu.in

Stability Check of Clusters

## • To check the stability of the

clusters take a random sample
of 95% of records. Compute
the clusters. If the clusters
formed are very similar to the
original, then the clusters are
fine.
The best place for students to learn Applied Engineering 43 http://www.insofe.edu.in
Linearly clustered data

Nice

## The best place for students to learn Applied Engineering 44 http://www.insofe.edu.in

Linearly separable but merged

## Data points of individuals of a

particular height and weight

## The best place for students to learn Applied Engineering 45 http://www.insofe.edu.in

Linearly separable but merged

## Clustering the data points into T-

shirt sizes. Top: Three sizes, Right:
Five sizes

## The best place for students to learn Applied Engineering 46 http://www.insofe.edu.in

Linearly separable
• Run 50-500 simulations for
small k (2-10). For large k
(100 or so), we can do 1-5
simulations
• Pick the one that gives the
best SSE

## The best place for students to learn Applied Engineering 47 http://www.insofe.edu.in

Other Topics
• Clustering large datasets
– CLARA (Kaufman and Rousseeuw, 1990)
and CLARANS (Ng and Han 1994, 2002):
Select a small % of data, run K-means or
K-medoids
• Parallel and Efficient
implementations of K-means /
K-medoids http://www.math.unipd.it/~dulli/corso04/ng94efficient.pdf
http://www.vlfeat.org/overview/kmeans.html
http://repository.cmu.edu/cgi/viewcontent.cgi?article=2397&context=compsci
http://www.cs.ucsb.edu/~veronika/MAE/Global_Kernel_K-Means.pdf

Bonus slides, if
time permits

UNDERSTANDING
DISTANCE

Numeric

2 , 𝑦2)
(𝑥
Eucledian

|  𝑦 2 − 𝑦 1|

𝑥 2 − 𝑥 1|
|
1 , 𝑦1)
(𝑥 Manhattan

## The best place for students to learn Applied Engineering 50 http://www.insofe.edu.in

Distance between categorical
attributes
• For each categorical variable
–Distance is zero if two records
have same value
–1 if different
–Sum the total distance across all
attributes

## The best place for students to learn Applied Engineering 51 http://www.insofe.edu.in

Categorical attributes 11 1 010010101
11 0 110110011

## For example, in above

two feature vectors: a=5
b=2
c=3
d=2

## The best place for students to learn Applied Engineering 52 http://www.insofe.edu.in

Symmetric binary attributes

## • Hamming distance function: Simple Matching

Coefficient, proportion of mismatches of their
values
bc
dist( xi , x j ) 
Hamming Distance a b c d Jaccard Distance

b  c *m W* W* *m
dist( xi , x j ) 
a b c d
Weighted Hamming Distance Weighted Jaccard Distance

## The best place for students to learn Applied Engineering 53 http://www.insofe.edu.in

Asymmetric binary attributes

## • Asymmetric: if one of the states is more

important or more valuable than the other.
– By convention, state 1 represents the more important
state, which is typically the rare or infrequent state.
– Jaccard coefficient is a popular measure

bc
dist( xi , x j ) 
a b c

## – We can have some variations, adding weights

– Assymetric features are sometimes ignored or given
less weights
The best place for students to learn Applied Engineering 54 http://www.insofe.edu.in
Dissimilarity between Binary
Variables
• Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

## – gender is a symmetric attribute, hence ignored

– the remaining attributes are asymmetric binary (Having fever is
more significant than not having fever)
– let the values Y and P be set to 1, and the value N be set to 0
0 1
d ( jack , mary )  0.33
2  0 1
11
d ( jack , jim )  0.67
111 a c
1 2 b is 0, no features where jack=Y/P and mary=N
d ( jim , mary )  0.75
11 2
The best place for students to learn Applied Engineering 55 http://www.insofe.edu.in
Another distance metric
used in supervised learning

## The best place for students to learn Applied Engineering 56 http://www.insofe.edu.in

Value Distance Measure

## The best place for students to learn Applied Engineering 57 http://www.insofe.edu.in

Ordinal variables
• Same as numeric

computation

## The best place for students to learn Applied Engineering 58 http://www.insofe.edu.in

Look up matrix for ordinal with 3 states

Distance between different ordinals may not be same !!! On the right, distance
between state 1 and state 2 is much less than distance between state 1 and state 3.

## Depends on what the ordinals represent.

E.g.: If ordinals represent education: class 5, Class 10, Degree. Distance between
class 5 and class 10 may be considered less than distance between class 10 and
degree

KERNEL TRICK

## The best place for students to learn Applied Engineering 60 http://www.insofe.edu.in

Moving into higher dimensions

## Decision surface is complex

Decision surface is simple in
in lower dimension
higher dimension

## The best place for students to learn Applied Engineering 61 http://www.insofe.edu.in

Linear Separation of XOR
y x.y

(-1,1) (1,1)

x x

(-1,-1) (1,-1)

## • XOR is not linearly separable in x,y space

• Linearly separable in x.y space
– The kernel here is K(x,y) = x.y

Standard Kernels
•

## The best place for students to learn Applied Engineering 63 http://www.insofe.edu.in

k-means Vs. Kernel k-means

## The best place for students to learn Applied Engineering 64 http://www.insofe.edu.in

Performance of Kernel K-means

## Evaluation of the performance of clustering algorithms in kernel-induced feature space, Pattern

Recognition, 2005

## The best place for students to learn Applied Engineering 65 http://www.insofe.edu.in

Kernel Trick
• The original way is to transform each
data point into a high dimensional space
and then do computation
– This can be computationally complex
• Note that we need to compute distances
in Euclidian space, or sometimes scalar
product

Kernel Trick
•

Kernel Trick
•

## The best place for students to learn Applied Engineering 68 http://www.insofe.edu.in

Kernel trick continued
• This is where Kernel trick
comes into picture - we can
choose a kernel where the
transformed space is of high
dimension and yet it is easy to
compute the similarity score in
the original space.

## The best place for students to learn Applied Engineering 69 http://www.insofe.edu.in

International School of Engineering

Plot 63/A, 1st Floor, Road # 13, Film Nagar, Jubilee Hills, Hyderabad - 500 033

## For Individuals: +91-9502334561/63 or 040-65743991

For Corporates: +91-9618483483
Web: http://www.insofe.edu.in