Академический Документы
Профессиональный Документы
Культура Документы
CLUSTERING
By
Er. Pramod Nath
Assistant Professor
Computer Science & Engineering Department
Mohammad Ali Jauhar University, Rampur U.P
er.pramodnath@gmail.com
Cluster analysis
Finding similarities between data according to
Unsupervised
classes
learning:
no
predefined
Examples of Clustering
Applications
Marketing: Help marketers to discover distinct groups of
their customers, and then use this knowledge to develop
targeted marketing programs.
Land use: Identification of areas of similar land use in an
with
high intra-class similarity
low inter-class similarity
Dissimilarity/Similarity
metric:
Similarity
is
expressed in terms of a distance function, typically
metric: d(i, j)
There
that
very
different
for
interval-scaled,
Boolean,
categorical, ordinal ratio, and mixed variables.
Data
Structures
Data matrix
(two modes)
x11
...
x
i1
...
x
n1
...
x1f
...
x1p
...
...
...
...
xif
...
...
xip
...
...
... xnf
...
...
...
xnp
Dissimilarity matrix
0
d(2,1)
(one mode)
0
d(3,1) d ( 3,2) 0
:
:
:
d ( n,1) d ( n,2) ...
... 0
These methods are often economical and successful for clustering of dataset.
Produces local optimal clustering, spherical shaped clusters and take k as input.
Hierarchical Methods
The agglomerative approach such as AGNES starts with each object forming a
separate cluster and successively merges the objects which are closest to each
other to form a bigger cluster until all the objects are merged into one cluster or
termination criteria satisfied.
Divisive approach like DIANA is a reverse procedure of agglomerative approach.
Expensive in terms of their computational and storage necessities.
Suffer from the greediness as once a merge/split is committed, it can not be
undone.
6
i 1(t
ip
Rm
(t cm ) 2
i 1 ip
N
N N (t t ) 2
i 1 i 1 ip iq
Dm
N ( N 1)
Example
10
9
8
7
6
5
10
10
4
3
2
1
0
0
K=2
Arbitrarily choose
K object as initial
cluster center
10
Assign
each
objects
to
most
similar
center
3
2
1
0
0
10
Update
the
cluster
means
4
3
2
1
0
0
reassign
10
4
3
2
1
0
1
10
reassign
10
10
Update
the
cluster
means
4
3
2
1
0
0
10
categorical data?
Often terminates at a local optimum.
Need to specify k, the number of clusters, in advance.
Unable to handle noisy data and outliers.
Not suitable to discover clusters with arbitrary shapes.
10
in a cluster.
0
0
10
10
Total Cost = 20
10
10
10
6
5
4
3
2
1
0
0
K=2
10
Arbitrar
y
choose
k object
as
initial
medoid
s
6
5
4
3
2
1
0
0
10
Total Cost = 26
10
Do loop
Until no
change
Assign
each
remaini
ng
object
to
nearest
medoid
s
7
6
5
4
3
2
1
0
0
10
Compute
total cost
of
swapping
Swapping
O and
Oramdom
If quality is
improved.
6
5
4
10
Randomly select a
nonmedoid
object,Oramdom
8
7
8
7
6
5
4
0
0
10
10
Hierarchical Clustering
a
b
c
d
e
Step 4
Step 1
ab
agglomerative
(AGNES)
abcde
cde
de
Step 3
divisive
(DIANA)
Hierarchical Clustering
Methods
The partition of data is not done at a single
step.
There are two varieties
clustering algorithms.
of
hierarchical
AGNES (Agglomerative
Nesting)
Use the Single-Link method and the dissimilarity
matrix.
Merge nodes that have the least dissimilarity.
Successively merge clusters to form larger
cluster.
Finally, all nodes belong to the same cluster.
10
10
10
0
0
10
0
0
10
10
Dendrogram
Hierarchic grouping can be represented by
3
4
Dendrogram
5
2
1
5.0
1.0
Distance
Single Linkage
Complete Linkage
Group Average Linkage
Centroid Linkage
Median Linkage
Single Linkage
Also, known as the nearest neighbor
technique
Distance between groups is defined as
that of the closest pair of data, where only
pairs consisting of one record from each
group are considered
Cluster B
Cluster A
1
1
2
3
4
5
0.0
2.0
6.0
10.0
9.0
0.0
5.0 0.0
Dendrogram
2
1
5.0
1.0
(12) 3
(12) 0.0
3
4
5
5.0 0.0
(12) 3
(12)
3
4
5
Dendrogram
2
1
5.0
1.0
0.0
5.0 0.0
(12)
(12)
3
4
5
0.0
5.0 0.0
(12)
(12)
(4 5)
5
4
Dendrogram
2
1
5.0
1.0
0.0
5.0 0.0
(4 5)
(12)
(12)
3
(4 5)
5
4
Dendrogram
2
1
5.0
1.0
0.0
5.0 0.0
(4 5)
(12)
(12)
3
(4 5)
(4 5)
0.0
5.0 0.0
(12) (3 4 5)
0.0
(3 4 5) 5.0 0.0
(12)
5
4
Dendrogram
3
2
1
5.0
1.0
(12) (3 4 5)
0.0
(3 4 5) 5.0 0.0
(12)
5
4
Dendrogram
3
2
1
5.0
1.0
(12) (3 4 5)
0. 0
(3 4 5) 5.0 0.0
(12)
5
4
Dendrogram
3
2
1
5.0
1.0
Distance
Single Linkage
Complete Linkage
Group Average Linkage
Centroid Linkage
Median Linkage
Complete Linkage
The distance between two clusters is given by
Cluster B
Cluster A
1
1
2
3
4
5
0.0
2.0
6.0
10.0
9.0
0.0
5.0 0.0
Dendrogram
2
1
10.0 8.0 6.0 4.0
Distance
2.0
(12) 3
(12) 0.0
3 6.0
0.0
(12) 3
(12)
3
4
5
Dendrogram
2
1
10.0 8.0 6.0 4.0
Distance
2.0
0.0
6.0 0.0
(12)
(12) 0.0
6.0 0.0
3
10.0 4.0 0.0
5
5
4
Dendrogram
2
1
10.0 8.0 6.0 4.0
Distance
2.0
(12)
(12)
3
(4 5)
(4 5)
0.0
6.0 0.0
(12)
(12)
3
(4 5)
5
4
Dendrogram
2
1
10.0 8.0 6.0 4.0
Distance
2.0
(4 5)
0.0
6.0 0.0
(12)
(12)
3
(4 5)
(4 5)
(12) (3 4 5)
0.0
6.0 0.0
(12)
(3 4 5)
5
4
Dendrogram
3
2
1
10.0 8.0 6.0 4.0
Distance
2.0
0.0
10.0 0.0
(12) (3 4 5)
(12)
(3 4 5)
5
4
Dendrogram
3
2
1
10.0 8.0 6.0 4.0
Distance
2.0
0.0
10.0 0.0
(12) (3 4 5)
(12)
(3 4 5)
0.0
10.0 0.0
5
4
Dendrogram
3
2
1
10.0 8.0 6.0 4.0
Distance
2.0
Distance
Single Linkage
Complete Linkage
Group Average Linkage
Centroid Linkage
Median Linkage
Cluster B
3
Cluster A
2
1
Distance
Single Linkage
Complete Linkage
Group Average Linkage
Centroid Linkage
Median Linkage
Centroid
Clustering
The distance between two clusters is
defined as the distance between the
mean vectors of the two clusters.
dAB = dab
where a is the mean vector of the
cluster A and b is the mean vector of the
cluster B.
Cluster B
Cluster A
a
Distance
Single Linkage
Complete Linkage
Group Average Linkage
Centroid Linkage
Median Linkage
Median
Clustering
Disadvantage of the Centroid Clustering: When a
Cluster B
its own.
10
10
10
0
0
10
0
0
10
10
Divisive Methods
In a divisive algorithm, we start with the
by all attributes.
Monothetic divide the data on the basis of the
possession of a single specified attribute.
Polythetic Approach
Distance
Single Linkage
Complete Linkage
Group Average Linkage
Centroid Linkage
Median Linkage
Polythetic Approach
1
1
2
3
4
5
6
7
10
7
30
29
38
42
D(1, *) = 26.0
0
7
23
25
34
36
A = {1
21 0
22 7 0
31 10 11 0
36 13 17 9 0
}
B = {2, 3, 4, 5, 6, 7}
D(2, *) = 22.5
D(3, *) = 20.7
D(4, *) = 17.3
D(5, *) = 18.5
D(6, *) = 20.7
D(7, *) = 20.0
Polythetic Approach
1
1
2
3
4
5
6
7
10
7
30
29
38
42
D(2, A) = 10
0
7
23
25
34
36
A = {1
21 0
22 7 0
31 10 11 0
36 13 17 9 0
}
B = {2, 3, 4, 5, 6, 7}
D(3, A) = 7
D(4, A) = 30
D(5, A) = 29
D(6, A) = 38
D(7, A) = 42
Polythetic Approach
1
1
2
3
4
5
6
7
10
7
30
29
38
42
0
7
23
25
34
36
A = {1
21 0
22 7 0
31 10 11 0
36 13 17 9 0
}
B = {2, 3, 4, 5, 6, 7}
D(3, A) = 7
D(3, B) = 23.4
Polythetic Approach
1
1
2
3
4
5
6
7
10
7
30
29
38
42
0
7
23
25
34
36
21 0
22 7 0
31 10 11 0
36 13 17 9 0
A = {1, 3
B = {2, 3, 4, 5, 6, 7}
D(3, A) = 7
Polythetic Approach
1
1
2
3
4
5
6
7
10
7
30
29
38
42
0
7
23
25
34
36
21 0
22 7 0
31 10 11 0
36 13 17 9 0
A = {1, 3
B = {2, 3, 4, 5, 6, 7}
D(3, A) = 7
Polythetic Approach
1
1
2
3
4
5
6
7
10
7
30
29
38
42
D(2, A) = 8.5
0
7
23
25
34
36
21 0
22 7 0
31 10 11 0
36 13 17 9 0
A = {1, 3
B = {2, 4, 5, 6, 7}
D(4, A) = 25.5
D(5, A) = 25.5
D(6, A) = 34.5
D(7, A) = 39.0
Polythetic Approach
1
1
2
3
4
5
6
7
10
7
30
29
38
42
0
7
23
25
34
36
D(2, B) = 29.5
D(2, A) = 8.5
21 0
22 7 0
31 10 11 0
36 13 17 9 0
A = {1, 3
B = {2, 4, 5, 6, 7}
Polythetic Approach
1
1
2
3
4
5
6
7
10
7
30
29
38
42
0
7
23
25
34
36
D(2, A) = 8.5
21 0
22 7 0
31 10 11 0
36 13 17 9 0
A = {1, 3 , 2 }
B = {2, 4, 5, 6, 7}
Polythetic Approach
1
1
2
3
4
5
6
7
10
7
30
29
38
42
0
7
23
25
34
36
D(2, A) = 8.5
21 0
22 7 0
31 10 11 0
36 13 17 9 0
A = {1, 3 , 2 }
B = {2, 4, 5, 6, 7}
Polythetic Approach
1
1
2
3
4
5
6
7
10
7
30
29
38
42
0
7
23
25
34
36
21 0
22 7 0
31 10 11 0
36 13 17 9 0
A = {1, 3 , 2 }
B = {4, 5, 6, 7}
D(4, A) = 24.7
D(5, A) = 25.3
D(6, A) = 34.3
D(7, A) = 38.0
Polythetic Approach
1
1
2
3
4
5
6
7
10
7
30
29
38
42
0
7
23
25
34
36
21 0
22 7 0
31 10 11 0
36 13 17 9 0
A = {1, 3 , 2 }
B = {4, 5, 6, 7}
Polythetic Approach
1
1
2
3
4
5
6
7
10
7
30
29
38
42
0
7
23
25
34
36
21 0
22 7 0
31 10 11 0
36 13 17 9 0
A = {1, 3 , 2 }
B = {4, 5, 6, 7}
Monothetic
It is usually used when the data consists
of binary variables.
1
2
3
4
5
A
0
1
1
1
0
B
1
1
1
1
0
C
1
0
1
0
1
Monothetic
It is usually used when the data consists
of binary variables.
1
2
3
4
5
A
0
1
1
1
0
B
1
1
1
1
0
C
1
0
1
0
1
1
0
1
a=3
c=0
0
b=1
d=1
Chi-Square Measure
AB
(ad bc) 2 N
4 3 2 1
= 1.875
1
0
1
a=3
c=0
0
b=1
d=1
Monothetic
Chi-Square Measure
It is usually used when the data consists
of binary variables.
1
2
3
4
5
A
0
1
1
1
0
B
1
1
1
1
0
C
1
0
1
0
1
AB
(ad bc) 2 N
4 3 2 1
= 1.875
Attr. AB
1
a
3
0
b
1
c
0
Chi-Square Measure
It is usually used when the data consists
d
1
of binary variables.
N
5
A B C
2 1.87
1 0 1 1
2 1 1 0
2
(
ad
bc
)
N
2
AB
3 1 1 1
(a b)(a c)(b d )(c d )
4 1 1 0
(3 0) 2 5
4 3 2 1
5 0 0 1
B
1
a=3
c=0
0
b=1
d=1
Monothetic
= 1.875
Attr. AB AC BC
1
a
3
1
2
0
b
1
2
1
c
0
2
2
Chi-Square Measure
It is usually used when the data consists
d
1
0
0
of binary variables.
N
5
5
5
A B C
2 1.87 2.22 0.83
1 0 1 1
For attribute A,
2
2
2 1 1 0
AB AC = 4.09
3 1 1 1 For attribute B,
2
2
AB BC = 2.70
4 1 1 0
For attribute C,
2
2
5 0 0 1
= 3.05
B
1
a=3
c=0
0
b=1
d=1
Monothetic
AC
BC
Attr. AB AC BC
1
a
3
1
2
0
b
1
2
1
c We ch0oose att2
2
Chi-Square Measure
ribute A for
It is usually used when the data consists
dividing the data in
d
1
0 to tw
0o
of binary variables.
groups.
{2, 3, 4}, and {1, 5
N
5
5 } 5
2
A B C
1.87 2.22 0.83
1 0 1 1
For attribute A,
2
2
2 1 1 0
AB AC = 4.09
3 1 1 1 For attribute B,
2
2
AB BC = 2.70
4 1 1 0
For attribute C,
2
2
5 0 0 1
= 3.05
B
1
a=3
c=0
0
b=1
d=1
Monothetic
AC
BC
Thank You