Class On Clustering

Lecture based on
CLUSTERING
By
Er. Pramod Nath
Assistant Professor
Computer Science & Engineering Department
Mohammad Ali Jauhar University, Rampur U.P
er.pramodnath@gmail.com
What is Cluster Analysis?

Cluster: a collection of data objects
Similar to one another within the same cluster.
Dissimilar to the objects in other clusters.
Cluster analysis
Finding similarities between data according to
the characteristics found in the data and

grouping similar data objects into clusters.
Unsupervised
classes
learning:
no
predefined
Examples of Clustering
Applications
Marketing: Help marketers to discover distinct groups of
their customers, and then use this knowledge to develop
targeted marketing programs.
Land use: Identification of areas of similar land use in an
earth observation database.

Insurance: Identifying groups of insurance holders with a
high average claim cost.

City-planning: Identifying groups of houses according to
their house type, value, and geographical location.

Earth-quake studies: Observed earth quake epicenters.
Measure the Quality of

Clustering
A good clustering method will produce high quality clusters
with
high intra-class similarity
low inter-class similarity
Dissimilarity/Similarity
metric:
Similarity
is
expressed in terms of a distance function, typically
metric: d(i, j)
There
is a separate quality function

measures the goodness of a cluster.
that
The definitions of distance functions are usually
very
different
for
interval-scaled,
Boolean,
categorical, ordinal ratio, and mixed variables.
Data
Structures
Data matrix
(two modes)
x11
...
x
i1
...
x
n1
...
x1f
...
x1p
...
...
...
...
xif
...
...
xip
...
...
... xnf
...
...
...
xnp
Dissimilarity matrix
0
d(2,1)
(one mode)
0
d(3,1) d ( 3,2) 0
:
:
:
d ( n,1) d ( n,2) ...
... 0
Categorization of Major Clustering Methods

Partitioning Methods
It randomly selects k objects as initial clusters center and performs iterative
process using distance criteria to discover k clusters of the dataset. e.g., kmeans, k-medoids.
Where each cluster must contain at least one object.
Every object should belong to precisely one cluster.
These methods are often economical and successful for clustering of dataset.
Produces local optimal clustering, spherical shaped clusters and take k as input.
Hierarchical Methods
The agglomerative approach such as AGNES starts with each object forming a
separate cluster and successively merges the objects which are closest to each
other to form a bigger cluster until all the objects are merged into one cluster or
termination criteria satisfied.
Divisive approach like DIANA is a reverse procedure of agglomerative approach.
Expensive in terms of their computational and storage necessities.
Suffer from the greediness as once a merge/split is committed, it can not be
undone.
6
Typical Alternatives to Calculate the

Distance
between
Clusters
Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq).
Complete link: largest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq).

Average: average distance between an element in one cluster and
an element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq).

Centroid: distance between the centroids of two clusters, i.e.,
dis(Ki, Kj) = dis(Ci, Cj).

Medoid: distance between the medoids of two clusters, i.e., dis(K i,
Kj) = dis(Mi, Mj).

Medoid: one chosen, centrally located object in the cluster.
Centroid, Radius and Diameter of

a Cluster (for numerical data sets)
N
Centroid: the middle of a cluster. Cm
i 1(t
ip
Radius: square root of average distance from
any point of the cluster to its centroid.

N
Rm
(t cm ) 2
i 1 ip
N
Diameter: square root of average mean squared
distance between all pairs of points in the

cluster.
N N (t t ) 2
i 1 i 1 ip iq
Dm
N ( N 1)
Partitioning Algorithms: Basic

Concept
Partitioning method: Construct a partition of a

database D of n objects into a set of k clusters,
that minimize sum of squared distance.
km 1 tmi Km (Cm t mi ) 2
Given a k, find a partition of k clusters that
optimizes the chosen partitioning criterion.
Heuristic methods: k-means and k-medoids algorithms.
k-means (MacQueen67): Each cluster is represented by
the center of the cluster.

k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw87): Each cluster is represented by one of the

objects locate near to center of the cluster.
The K-Means Clustering

Method
The K-Means Clustering

Method
Example
10
9
8
7
6
5
10
10
4
3
2
1
0
0
K=2
Arbitrarily choose
K object as initial
cluster center
10
Assign
each
objects
to
most
similar
center
3
2
1
0
0
10
Update
the
cluster
means
4
3
2
1
0
0
reassign
10
4
3
2
1
0
1
10
reassign
10
10
Update
the
cluster
means
4
3
2
1
0
0
10
Comments on the K-Means

Method
Strength: Relatively efficient: O(tkn), where n is no. of
objects, k is no. of clusters, and t is no. of iterations.

Normally, k, t << n.
Weakness
Applicable only when mean is defined, then what about
categorical data?
Often terminates at a local optimum.
Need to specify k, the number of clusters, in advance.
Unable to handle noisy data and outliers.
Not suitable to discover clusters with arbitrary shapes.
What Is the Problem of the K-Means

Method?
The k-means algorithm is sensitive to
outliers !
Since an object with an extremely large value may
substantially distort the distribution of the data.

K-Medoids: Instead of taking the mean value of the
object in a cluster as a reference point, medoids can

be used, which is the most centrally located object
10
10
in a cluster.
0
0
10
10
A Typical K-medoids Algorithm (PAM)
Total Cost = 20
10
10
10
6
5
4
3
2
1
0
0
K=2
10
Arbitrar
y
choose
k object
as
initial
medoid
s
6
5
4
3
2
1
0
0
10
Total Cost = 26
10
Do loop
Until no
change
Assign
each
remaini
ng
object
to
nearest
medoid
s
7
6
5
4
3
2
1
0
0
10
Compute
total cost
of
swapping
Swapping
O and
Oramdom
If quality is
improved.
6
5
4
10
Randomly select a
nonmedoid
object,Oramdom
8
7
8
7
6
5
4
0
0
10
10
Hierarchical Clustering
A hierarchical clustering method works by grouping data
objects into a tree of clusters.

Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input, but

needs a termination condition.
Step 0
a
b
c
d
e
Step 4
Step 1
Step 2 Step 3 Step 4
ab
agglomerative
(AGNES)
abcde
cde
de
Step 3
Step 2 Step 1 Step 0
divisive
(DIANA)
Hierarchical Clustering
Methods
The partition of data is not done at a single
step.
There are two varieties
clustering algorithms.
of
hierarchical
Agglomerative : successively fusions of the
data into groups.

Divisive : separate the data successively into
finer groups.
Dendrogram: Shows How the Clusters are Merged

Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the
dendrogram at the desired level, then each connected
component forms a cluster.
AGNES (Agglomerative
Nesting)
Use the Single-Link method and the dissimilarity
matrix.
Merge nodes that have the least dissimilarity.
Successively merge clusters to form larger
cluster.
Finally, all nodes belong to the same cluster.
10
10
10
0
0
10
0
0
10
10
Dendrogram
Hierarchic grouping can be represented by
two-dimensional diagram known as a

dendrogram.
3
4
Dendrogram
5
2
1
5.0
4.0 3.0 2.0

Distance
1.0
Distance
Single Linkage
Complete Linkage
Group Average Linkage
Centroid Linkage
Median Linkage
Single Linkage
Also, known as the nearest neighbor
technique
Distance between groups is defined as
that of the closest pair of data, where only
pairs consisting of one record from each
group are considered
Cluster B
Cluster A
1
1
2
3
4
5
0.0
2.0
6.0
10.0
9.0
0.0
5.0 0.0
9.0 4.0 0.0
8.0 5.0 3.0 0.0
Dendrogram
2
1
5.0
4.0 3.0 2.0

Distance
1.0
(12) 3
(12) 0.0
3
4
5
5.0 0.0
9.0 4.0 0.0
8.0 5.0 3.0 0.0
(12) 3
(12)
3
4
5
Dendrogram
2
1
5.0
4.0 3.0 2.0

Distance
1.0
0.0
5.0 0.0
9.0 4.0 0.0
8.0 5.0 3.0 0.0
(12)
(12)
3
4
5
0.0
5.0 0.0
(12)
(12)
9.0 4.0 0.0
8.0 5.0 3.0 0.0
(4 5)
5
4
Dendrogram
2
1
5.0
4.0 3.0 2.0

Distance
1.0
0.0
5.0 0.0
(4 5)
8.0 4.0 0.0
(12)
(12)
3
(4 5)
5
4
Dendrogram
2
1
5.0
4.0 3.0 2.0

Distance
1.0
0.0
5.0 0.0
(4 5)
8.0 4.0 0.0
(12)
(12)
3
(4 5)
(4 5)
0.0
5.0 0.0
(12) (3 4 5)
0.0
(3 4 5) 5.0 0.0
(12)
8.0 4.0 0.0
5
4
Dendrogram
3
2
1
5.0
4.0 3.0 2.0

Distance
1.0
(12) (3 4 5)
0.0
(3 4 5) 5.0 0.0
(12)
5
4
Dendrogram
3
2
1
5.0
4.0 3.0 2.0

Distance
1.0
(12) (3 4 5)
0. 0
(3 4 5) 5.0 0.0
(12)
5
4
Dendrogram
3
2
1
5.0
4.0 3.0 2.0

Distance
1.0
Distance
Single Linkage
Complete Linkage
Centroid Linkage
Median Linkage
Complete Linkage
The distance between two clusters is given by
the distance between their most distant

members.
Cluster B
Cluster A
1
1
2
3
4
5
0.0
2.0
6.0
10.0
9.0
0.0
5.0 0.0
9.0 4.0 0.0
8.0 5.0 3.0 0.0
Dendrogram
2
1
10.0 8.0 6.0 4.0
Distance
2.0
(12) 3
(12) 0.0
3 6.0
0.0
4 10.0 4.0 0.0
9.0 5.0 3.0 0.0

5
(12) 3
(12)
3
4
5
Dendrogram
2
1
10.0 8.0 6.0 4.0
Distance
2.0
0.0
6.0 0.0
10.0 4.0 0.0
9.0 5.0 3.0 0.0
(12)
(12) 0.0
6.0 0.0
3
10.0 4.0 0.0
9.0 5.0 3.0 0.0
5
5
4
Dendrogram
2
1
10.0 8.0 6.0 4.0
Distance
2.0
(12)
(12)
3
(4 5)
(4 5)
0.0
6.0 0.0
10.0 5.0 0.0
(12)
(12)
3
(4 5)
5
4
Dendrogram
2
1
10.0 8.0 6.0 4.0
Distance
2.0
(4 5)
0.0
6.0 0.0
10.0 5.0 0.0
(12)
(12)
3
(4 5)
(4 5)
(12) (3 4 5)
0.0
6.0 0.0
10.0 5.0 0.0
(12)
(3 4 5)
5
4
Dendrogram
3
2
1
10.0 8.0 6.0 4.0
Distance
2.0
0.0
10.0 0.0
(12) (3 4 5)
(12)
(3 4 5)
5
4
Dendrogram
3
2
1
10.0 8.0 6.0 4.0
Distance
2.0
0.0
10.0 0.0
(12) (3 4 5)
(12)
(3 4 5)
0.0
10.0 0.0
5
4
Dendrogram
3
2
1
10.0 8.0 6.0 4.0
Distance
2.0
Distance
Single Linkage
Complete Linkage
Centroid Linkage
Median Linkage
Group Average Clustering

The distance between two clusters is defined as
the average of the distances between all pairs

of records (one from each cluster).
dAB = 1/6 (d13 + d14 + d15 + d23 + d24 + d25)
Cluster B
3
Cluster A
2
1
Distance
Single Linkage
Complete Linkage
Centroid Linkage
Median Linkage
Centroid
Clustering
The distance between two clusters is
defined as the distance between the
mean vectors of the two clusters.
dAB = dab
where a is the mean vector of the
cluster A and b is the mean vector of the
cluster B.
Cluster B
Cluster A
a
Distance
Single Linkage
Complete Linkage
Centroid Linkage
Median Linkage
Median
Clustering
Disadvantage of the Centroid Clustering: When a
large cluster is merged with a small one, the

centroid of the combined cluster would be closed
to the large one, i.e. The characteristic
properties of the small one are lost.
After we have combined two groups, the midpoint of the original two cluster centres is used
as the centre of the newly combined group
Cluster A
a
Cluster B
DIANA (Divisive Analysis)

Inverse order of AGNES.
Eventually each node forms a cluster on
its own.
10
10
10
0
0
10
0
0
10
10
Divisive Methods
In a divisive algorithm, we start with the
assumption that all the data is part of one

cluster.
We then use a distance criterion to divide
the cluster in two, and then subdivide the
clusters until a stopping criterion is achieved.
Polythetic divide the data based on the values
by all attributes.
Monothetic divide the data on the basis of the
possession of a single specified attribute.
Polythetic Approach
Distance
Single Linkage
Complete Linkage
Centroid Linkage
Median Linkage
Polythetic Approach
1
1
2
3
4
5
6
7
10
7
30
29
38
42
D(1, *) = 26.0
0
7
23
25
34
36
A = {1
21 0
22 7 0
31 10 11 0
36 13 17 9 0
}
B = {2, 3, 4, 5, 6, 7}
D(2, *) = 22.5
D(3, *) = 20.7
D(4, *) = 17.3
D(5, *) = 18.5
D(6, *) = 20.7
D(7, *) = 20.0
Polythetic Approach
1
1
2
3
4
5
6
7
10
7
30
29
38
42
D(2, A) = 10
0
7
23
25
34
36
A = {1
21 0
22 7 0
31 10 11 0
36 13 17 9 0
}
B = {2, 3, 4, 5, 6, 7}
D(3, A) = 7
D(4, A) = 30
D(5, A) = 29
D(6, A) = 38
D(7, A) = 42
Polythetic Approach
1
1
2
3
4
5
6
7
10
7
30
29
38
42
D(2, A) = 10 D(2, B) = 25.0
0
7
23
25
34
36
A = {1
21 0
22 7 0
31 10 11 0
36 13 17 9 0
}
B = {2, 3, 4, 5, 6, 7}
D(3, A) = 7
D(3, B) = 23.4
D(4, A) = 30 D(4, B) = 14.8

D(5, A) = 29 D(5, B) = 16.4
D(6, A) = 38 D(6, B) = 19.0
D(7, A) = 42 D(7, B) = 22.2
Polythetic Approach
1
1
2
3
4
5
6
7
10
7
30
29
38
42
D(2, A) = 10 D(2, B) = 25.0 2 = 15.0
0
7
23
25
34
36
21 0
22 7 0
31 10 11 0
36 13 17 9 0
A = {1, 3
B = {2, 3, 4, 5, 6, 7}
D(3, A) = 7
D(3, B) = 23.4 3 = 16.4
D(4, A) = 30 D(4, B) = 14.8 4 = -15.2

D(5, A) = 29 D(5, B) = 16.4 5 = -12.6
D(6, A) = 38 D(6, B) = 19.0 6 = -19.0
D(7, A) = 42 D(7, B) = 22.2 7 = -19.8
Polythetic Approach
1
1
2
3
4
5
6
7
10
7
30
29
38
42
D(2, A) = 10 D(2, B) = 25.0 2 = 15.0
0
7
23
25
34
36
21 0
22 7 0
31 10 11 0
36 13 17 9 0
A = {1, 3
B = {2, 3, 4, 5, 6, 7}
D(3, A) = 7
D(3, B) = 23.4 3 = 16.4
D(4, A) = 30 D(4, B) = 14.8 4 = -15.2

D(5, A) = 29 D(5, B) = 16.4 5 = -12.6
D(6, A) = 38 D(6, B) = 19.0 6 = -19.0
D(7, A) = 42 D(7, B) = 22.2 7 = -19.8
Polythetic Approach
1
1
2
3
4
5
6
7
10
7
30
29
38
42
D(2, A) = 8.5
0
7
23
25
34
36
21 0
22 7 0
31 10 11 0
36 13 17 9 0
A = {1, 3
B = {2, 4, 5, 6, 7}
D(4, A) = 25.5
D(5, A) = 25.5
D(6, A) = 34.5
D(7, A) = 39.0
Polythetic Approach
1
1
2
3
4
5
6
7
10
7
30
29
38
42
0
7
23
25
34
36
D(2, B) = 29.5
D(2, A) = 8.5
D(4, A) = 25.5 D(4, B) = 13.2
21 0
22 7 0
31 10 11 0
36 13 17 9 0
A = {1, 3
B = {2, 4, 5, 6, 7}
D(5, A) = 25.5 D(5, B) = 15.0

D(6, A) = 34.5 D(6, B) = 16.0
D(7, A) = 39.0 D(7, B) = 18.9
Polythetic Approach
1
1
2
3
4
5
6
7
10
7
30
29
38
42
0
7
23
25
34
36
D(2, A) = 8.5
D(4, A) = 25.5 D(4, B) = 13.2 4 = -12.3
21 0
22 7 0
31 10 11 0
36 13 17 9 0
A = {1, 3 , 2 }
B = {2, 4, 5, 6, 7}
D(2, B) = 29.5 2 = 21.0
D(5, A) = 25.5 D(5, B) = 15.0 5 = -10.5

D(6, A) = 34.5 D(6, B) = 16.0 6 = -18.5
D(7, A) = 39.0 D(7, B) = 18.9 7 = -20.3
Polythetic Approach
1
1
2
3
4
5
6
7
10
7
30
29
38
42
0
7
23
25
34
36
D(2, A) = 8.5
D(4, A) = 25.5 D(4, B) = 13.2 4 = -12.3
21 0
22 7 0
31 10 11 0
36 13 17 9 0
A = {1, 3 , 2 }
B = {2, 4, 5, 6, 7}
D(2, B) = 29.5 2 = 21.0
D(5, A) = 25.5 D(5, B) = 15.0 5 = -10.5

D(6, A) = 34.5 D(6, B) = 16.0 6 = -18.5
D(7, A) = 39.0 D(7, B) = 18.9 7 = -20.3
Polythetic Approach
1
1
2
3
4
5
6
7
10
7
30
29
38
42
0
7
23
25
34
36
21 0
22 7 0
31 10 11 0
36 13 17 9 0
A = {1, 3 , 2 }
B = {4, 5, 6, 7}
D(4, A) = 24.7
D(5, A) = 25.3
D(6, A) = 34.3
D(7, A) = 38.0
Polythetic Approach
1
1
2
3
4
5
6
7
10
7
30
29
38
42
0
7
23
25
34
36
21 0
22 7 0
31 10 11 0
36 13 17 9 0
A = {1, 3 , 2 }
B = {4, 5, 6, 7}
D(4, A) = 24.7 D(4, B) = 10.0

D(5, A) = 25.3 D(5, B) = 11.7
D(6, A) = 34.3 D(6, B) = 10.0
D(7, A) = 38.0 D(7, B) = 13.0
Polythetic Approach
1
1
2
3
4
5
6
7
10
7
30
29
38
42
0
7
23
25
34
36
D(4, A) = 24.7 D(4, B) = 10.0 4 = -14.7
21 0
22 7 0
31 10 11 0
36 13 17 9 0
D(5, A) = 25.3 D(5, B) = 11.7 5 = -13.6
A = {1, 3 , 2 }
B = {4, 5, 6, 7}
D(6, A) = 34.3 D(6, B) = 10.0 6 = -24.3

D(7, A) = 38.0 D(7, B) = 13.0 7 = -25.0
All differences are negative.

The process would continue
on each subgroup
separately.
Monothetic
It is usually used when the data consists
of binary variables.
1
2
3
4
5
A
0
1
1
1
0
B
1
1
1
1
0
C
1
0
1
0
1
Monothetic
1
2
3
4
5
A
0
1
1
1
0
B
1
1
1
1
0
C
1
0
1
0
1
1
0
1
a=3
c=0
0
b=1
d=1
Chi-Square Measure
AB
(ad bc) 2 N
(a b)(a c)(b d )(c d )

(3 0) 2 5
4 3 2 1
= 1.875
1
0
1
a=3
c=0
0
b=1
d=1
Monothetic
Chi-Square Measure
1
2
3
4
5
A
0
1
1
1
0
B
1
1
1
1
0
C
1
0
1
0
1
AB
(ad bc) 2 N
(a b)(a c)(b d )(c d )

(3 0) 2 5
4 3 2 1
= 1.875
Attr. AB
1
a
3
0
b
1
c
0
Chi-Square Measure
d
1
N
5
A B C
2 1.87
1 0 1 1
2 1 1 0
2
(
ad
bc
)
N
2
AB
3 1 1 1
(a b)(a c)(b d )(c d )
4 1 1 0
(3 0) 2 5
4 3 2 1
5 0 0 1
B
1
a=3
c=0
0
b=1
d=1
Monothetic
= 1.875
Attr. AB AC BC
1
a
3
1
2
0
b
1
2
1
c
0
2
2
Chi-Square Measure
d
1
0
0
N
5
5
5
A B C
2 1.87 2.22 0.83
1 0 1 1
For attribute A,
2
2
2 1 1 0
AB AC = 4.09
3 1 1 1 For attribute B,
2
2
AB BC = 2.70
4 1 1 0
For attribute C,
2
2
5 0 0 1

= 3.05
B
1
a=3
c=0
0
b=1
d=1
Monothetic
AC
BC
Attr. AB AC BC
1
a
3
1
2
0
b
1
2
1
c We ch0oose att2
2
Chi-Square Measure
ribute A for
dividing the data in
d
1
0 to tw
0o
groups.
{2, 3, 4}, and {1, 5
N
5
5 } 5
2
A B C
1.87 2.22 0.83
1 0 1 1
For attribute A,
2
2
2 1 1 0
AB AC = 4.09
3 1 1 1 For attribute B,
2
2
AB BC = 2.70
4 1 1 0
For attribute C,
2
2
5 0 0 1

= 3.05
B
1
a=3
c=0
0
b=1
d=1
Monothetic
AC
BC
Thank You

Class On Clustering

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Class On Clustering

Загружено:

Авторское право:

Доступные форматы

Lecture based on

What is Cluster Analysis?

the characteristics found in the data and

earth observation database.

high average claim cost.

their house type, value, and geographical location.

Measure the Quality of

is a separate quality function

The definitions of distance functions are usually

Categorization of Major Clustering Methods

Typical Alternatives to Calculate the

and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq).

an element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq).

dis(Ki, Kj) = dis(Ci, Cj).

Kj) = dis(Mi, Mj).

Centroid, Radius and Diameter of

Radius: square root of average distance from

any point of the cluster to its centroid.

Diameter: square root of average mean squared

distance between all pairs of points in the

Partitioning Algorithms: Basic

Partitioning method: Construct a partition of a

the center of the cluster.

Rousseeuw87): Each cluster is represented by one of the

The K-Means Clustering

The K-Means Clustering

Comments on the K-Means

Strength: Relatively efficient: O(tkn), where n is no. of

objects, k is no. of clusters, and t is no. of iterations.

What Is the Problem of the K-Means

substantially distort the distribution of the data.

object in a cluster as a reference point, medoids can

A Typical K-medoids Algorithm (PAM)

A hierarchical clustering method works by grouping data

objects into a tree of clusters.

does not require the number of clusters k as an input, but

Step 2 Step 3 Step 4

Step 2 Step 1 Step 0

Agglomerative : successively fusions of the

data into groups.

Dendrogram: Shows How the Clusters are Merged

two-dimensional diagram known as a

4.0 3.0 2.0

9.0 4.0 0.0

8.0 5.0 3.0 0.0

4.0 3.0 2.0

9.0 4.0 0.0

8.0 5.0 3.0 0.0

4.0 3.0 2.0

9.0 4.0 0.0

8.0 5.0 3.0 0.0

9.0 4.0 0.0

8.0 5.0 3.0 0.0

4.0 3.0 2.0

8.0 4.0 0.0

4.0 3.0 2.0

8.0 4.0 0.0

8.0 4.0 0.0

4.0 3.0 2.0

4.0 3.0 2.0

4.0 3.0 2.0

the distance between their most distant

9.0 4.0 0.0