Вы находитесь на странице: 1из 65

Lecture based on

CLUSTERING
By
Er. Pramod Nath
Assistant Professor
Computer Science & Engineering Department
Mohammad Ali Jauhar University, Rampur U.P
er.pramodnath@gmail.com

What is Cluster Analysis?


Cluster: a collection of data objects
Similar to one another within the same cluster.
Dissimilar to the objects in other clusters.

Cluster analysis
Finding similarities between data according to

the characteristics found in the data and


grouping similar data objects into clusters.

Unsupervised

classes

learning:

no

predefined

Examples of Clustering
Applications
Marketing: Help marketers to discover distinct groups of
their customers, and then use this knowledge to develop
targeted marketing programs.
Land use: Identification of areas of similar land use in an

earth observation database.


Insurance: Identifying groups of insurance holders with a

high average claim cost.


City-planning: Identifying groups of houses according to

their house type, value, and geographical location.


Earth-quake studies: Observed earth quake epicenters.

Measure the Quality of


Clustering
A good clustering method will produce high quality clusters

with
high intra-class similarity
low inter-class similarity

Dissimilarity/Similarity

metric:
Similarity
is
expressed in terms of a distance function, typically
metric: d(i, j)

There

is a separate quality function


measures the goodness of a cluster.

that

The definitions of distance functions are usually

very
different
for
interval-scaled,
Boolean,
categorical, ordinal ratio, and mixed variables.

Data
Structures
Data matrix
(two modes)

x11

...

x
i1
...
x
n1

...

x1f

...

x1p

...

...

...

...

xif

...

...
xip

...
...
... xnf

...
...

...
xnp

Dissimilarity matrix
0

d(2,1)
(one mode)

0
d(3,1) d ( 3,2) 0
:
:
:
d ( n,1) d ( n,2) ...

... 0

Categorization of Major Clustering Methods


Partitioning Methods
It randomly selects k objects as initial clusters center and performs iterative
process using distance criteria to discover k clusters of the dataset. e.g., kmeans, k-medoids.
Where each cluster must contain at least one object.
Every object should belong to precisely one cluster.

These methods are often economical and successful for clustering of dataset.
Produces local optimal clustering, spherical shaped clusters and take k as input.

Hierarchical Methods
The agglomerative approach such as AGNES starts with each object forming a
separate cluster and successively merges the objects which are closest to each
other to form a bigger cluster until all the objects are merged into one cluster or
termination criteria satisfied.
Divisive approach like DIANA is a reverse procedure of agglomerative approach.
Expensive in terms of their computational and storage necessities.
Suffer from the greediness as once a merge/split is committed, it can not be
undone.
6

Typical Alternatives to Calculate the


Distance
between
Clusters
Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq).
Complete link: largest distance between an element in one cluster

and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq).


Average: average distance between an element in one cluster and

an element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq).


Centroid: distance between the centroids of two clusters, i.e.,

dis(Ki, Kj) = dis(Ci, Cj).


Medoid: distance between the medoids of two clusters, i.e., dis(K i,

Kj) = dis(Mi, Mj).


Medoid: one chosen, centrally located object in the cluster.

Centroid, Radius and Diameter of


a Cluster (for numerical data sets)
N
Centroid: the middle of a cluster. Cm

i 1(t

ip

Radius: square root of average distance from

any point of the cluster to its centroid.


N

Rm

(t cm ) 2
i 1 ip
N

Diameter: square root of average mean squared

distance between all pairs of points in the


cluster.

N N (t t ) 2
i 1 i 1 ip iq
Dm
N ( N 1)

Partitioning Algorithms: Basic


Concept

Partitioning method: Construct a partition of a


database D of n objects into a set of k clusters,
that minimize sum of squared distance.
km 1 tmi Km (Cm t mi ) 2
Given a k, find a partition of k clusters that
optimizes the chosen partitioning criterion.
Heuristic methods: k-means and k-medoids algorithms.
k-means (MacQueen67): Each cluster is represented by

the center of the cluster.


k-medoids or PAM (Partition around medoids) (Kaufman &

Rousseeuw87): Each cluster is represented by one of the


objects locate near to center of the cluster.

The K-Means Clustering


Method

The K-Means Clustering


Method

Example
10
9
8
7
6
5

10

10

4
3
2
1
0
0

K=2
Arbitrarily choose
K object as initial
cluster center

10

Assign
each
objects
to
most
similar
center

3
2
1
0
0

10

Update
the
cluster
means

4
3
2
1
0
0

reassign
10

4
3
2
1
0
1

10

reassign

10

10

Update
the
cluster
means

4
3
2
1
0
0

10

Comments on the K-Means


Method

Strength: Relatively efficient: O(tkn), where n is no. of

objects, k is no. of clusters, and t is no. of iterations.


Normally, k, t << n.
Weakness
Applicable only when mean is defined, then what about

categorical data?
Often terminates at a local optimum.
Need to specify k, the number of clusters, in advance.
Unable to handle noisy data and outliers.
Not suitable to discover clusters with arbitrary shapes.

What Is the Problem of the K-Means


Method?
The k-means algorithm is sensitive to
outliers !
Since an object with an extremely large value may

substantially distort the distribution of the data.


K-Medoids: Instead of taking the mean value of the

object in a cluster as a reference point, medoids can


be used, which is the most centrally located object
10

10

in a cluster.

0
0

10

10

A Typical K-medoids Algorithm (PAM)

Total Cost = 20
10

10

10

6
5
4
3
2
1
0
0

K=2

10

Arbitrar
y
choose
k object
as
initial
medoid
s

6
5
4
3
2
1
0
0

10

Total Cost = 26
10

Do loop
Until no
change

Assign
each
remaini
ng
object
to
nearest
medoid
s

7
6
5
4
3
2
1
0
0

10

Compute
total cost
of
swapping

Swapping
O and
Oramdom
If quality is
improved.

6
5
4

10

Randomly select a
nonmedoid
object,Oramdom

8
7

8
7
6
5
4

0
0

10

10

Hierarchical Clustering

A hierarchical clustering method works by grouping data

objects into a tree of clusters.


Use distance matrix as clustering criteria. This method

does not require the number of clusters k as an input, but


needs a termination condition.
Step 0

a
b
c
d
e
Step 4

Step 1

Step 2 Step 3 Step 4

ab

agglomerative
(AGNES)

abcde
cde
de
Step 3

Step 2 Step 1 Step 0

divisive
(DIANA)

Hierarchical Clustering
Methods
The partition of data is not done at a single
step.
There are two varieties
clustering algorithms.

of

hierarchical

Agglomerative : successively fusions of the

data into groups.


Divisive : separate the data successively into
finer groups.

Dendrogram: Shows How the Clusters are Merged


Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the
dendrogram at the desired level, then each connected
component forms a cluster.

AGNES (Agglomerative
Nesting)
Use the Single-Link method and the dissimilarity

matrix.
Merge nodes that have the least dissimilarity.
Successively merge clusters to form larger
cluster.
Finally, all nodes belong to the same cluster.
10

10

10

0
0

10

0
0

10

10

Dendrogram
Hierarchic grouping can be represented by

two-dimensional diagram known as a


dendrogram.

3
4

Dendrogram

5
2
1
5.0

4.0 3.0 2.0


Distance

1.0

Distance
Single Linkage
Complete Linkage
Group Average Linkage
Centroid Linkage
Median Linkage

Single Linkage
Also, known as the nearest neighbor

technique
Distance between groups is defined as
that of the closest pair of data, where only
pairs consisting of one record from each
group are considered
Cluster B
Cluster A

1
1
2
3
4
5

0.0

2.0
6.0

10.0
9.0

0.0

5.0 0.0

9.0 4.0 0.0

8.0 5.0 3.0 0.0

Dendrogram

2
1
5.0

4.0 3.0 2.0


Distance

1.0

(12) 3

(12) 0.0
3
4
5

5.0 0.0

9.0 4.0 0.0

8.0 5.0 3.0 0.0

(12) 3
(12)
3
4
5

Dendrogram

2
1
5.0

4.0 3.0 2.0


Distance

1.0

0.0

5.0 0.0

9.0 4.0 0.0

8.0 5.0 3.0 0.0

(12)
(12)
3
4
5

0.0

5.0 0.0

(12)

(12)

9.0 4.0 0.0

8.0 5.0 3.0 0.0

(4 5)

5
4

Dendrogram

2
1
5.0

4.0 3.0 2.0


Distance

1.0

0.0

5.0 0.0

(4 5)

8.0 4.0 0.0

(12)
(12)
3
(4 5)

5
4

Dendrogram

2
1
5.0

4.0 3.0 2.0


Distance

1.0

0.0

5.0 0.0

(4 5)

8.0 4.0 0.0

(12)
(12)
3
(4 5)

(4 5)

0.0

5.0 0.0

(12) (3 4 5)

0.0

(3 4 5) 5.0 0.0
(12)

8.0 4.0 0.0

5
4

Dendrogram

3
2
1
5.0

4.0 3.0 2.0


Distance

1.0

(12) (3 4 5)

0.0

(3 4 5) 5.0 0.0
(12)

5
4

Dendrogram

3
2
1
5.0

4.0 3.0 2.0


Distance

1.0

(12) (3 4 5)

0. 0

(3 4 5) 5.0 0.0
(12)

5
4

Dendrogram

3
2
1
5.0

4.0 3.0 2.0


Distance

1.0

Distance
Single Linkage
Complete Linkage
Group Average Linkage
Centroid Linkage
Median Linkage

Complete Linkage
The distance between two clusters is given by

the distance between their most distant


members.

Cluster B
Cluster A

1
1
2
3
4
5

0.0

2.0
6.0

10.0
9.0

0.0

5.0 0.0

9.0 4.0 0.0

8.0 5.0 3.0 0.0

Dendrogram

2
1
10.0 8.0 6.0 4.0
Distance

2.0

(12) 3

(12) 0.0

3 6.0

0.0

4 10.0 4.0 0.0

9.0 5.0 3.0 0.0


5

(12) 3
(12)
3
4
5

Dendrogram

2
1
10.0 8.0 6.0 4.0
Distance

2.0

0.0

6.0 0.0

10.0 4.0 0.0

9.0 5.0 3.0 0.0

(12)

(12) 0.0

6.0 0.0

3
10.0 4.0 0.0

9.0 5.0 3.0 0.0

5
5
4

Dendrogram

2
1
10.0 8.0 6.0 4.0
Distance

2.0

(12)
(12)
3
(4 5)

(4 5)

0.0

6.0 0.0

10.0 5.0 0.0

(12)
(12)
3
(4 5)

5
4

Dendrogram

2
1
10.0 8.0 6.0 4.0
Distance

2.0

(4 5)

0.0

6.0 0.0

10.0 5.0 0.0

(12)
(12)
3
(4 5)

(4 5)

(12) (3 4 5)

0.0

6.0 0.0

10.0 5.0 0.0

(12)
(3 4 5)

5
4

Dendrogram

3
2
1
10.0 8.0 6.0 4.0
Distance

2.0

0.0

10.0 0.0

(12) (3 4 5)
(12)
(3 4 5)

5
4

Dendrogram

3
2
1
10.0 8.0 6.0 4.0
Distance

2.0

0.0

10.0 0.0

(12) (3 4 5)
(12)
(3 4 5)

0.0

10.0 0.0
5
4

Dendrogram

3
2
1
10.0 8.0 6.0 4.0
Distance

2.0

Distance
Single Linkage
Complete Linkage
Group Average Linkage
Centroid Linkage
Median Linkage

Group Average Clustering


The distance between two clusters is defined as

the average of the distances between all pairs


of records (one from each cluster).
dAB = 1/6 (d13 + d14 + d15 + d23 + d24 + d25)

Cluster B
3
Cluster A
2
1

Distance
Single Linkage
Complete Linkage
Group Average Linkage
Centroid Linkage
Median Linkage

Centroid
Clustering
The distance between two clusters is
defined as the distance between the
mean vectors of the two clusters.
dAB = dab
where a is the mean vector of the
cluster A and b is the mean vector of the
cluster B.
Cluster B
Cluster A
a

Distance
Single Linkage
Complete Linkage
Group Average Linkage
Centroid Linkage
Median Linkage

Median
Clustering
Disadvantage of the Centroid Clustering: When a

large cluster is merged with a small one, the


centroid of the combined cluster would be closed
to the large one, i.e. The characteristic
properties of the small one are lost.
After we have combined two groups, the midpoint of the original two cluster centres is used
as the centre of the newly combined group
Cluster A
a

Cluster B

DIANA (Divisive Analysis)


Inverse order of AGNES.
Eventually each node forms a cluster on

its own.

10

10

10

0
0

10

0
0

10

10

Divisive Methods
In a divisive algorithm, we start with the

assumption that all the data is part of one


cluster.
We then use a distance criterion to divide
the cluster in two, and then subdivide the
clusters until a stopping criterion is achieved.
Polythetic divide the data based on the values

by all attributes.
Monothetic divide the data on the basis of the
possession of a single specified attribute.

Polythetic Approach
Distance
Single Linkage
Complete Linkage
Group Average Linkage
Centroid Linkage
Median Linkage

Polythetic Approach
1
1
2
3
4
5
6
7

10
7

30
29

38

42

D(1, *) = 26.0

0
7
23
25
34
36

A = {1

21 0

22 7 0

31 10 11 0

36 13 17 9 0
}

B = {2, 3, 4, 5, 6, 7}

D(2, *) = 22.5
D(3, *) = 20.7
D(4, *) = 17.3
D(5, *) = 18.5
D(6, *) = 20.7
D(7, *) = 20.0

Polythetic Approach
1
1
2
3
4
5
6
7

10
7

30
29

38

42

D(2, A) = 10

0
7
23
25
34
36

A = {1

21 0

22 7 0

31 10 11 0

36 13 17 9 0
}

B = {2, 3, 4, 5, 6, 7}

D(3, A) = 7
D(4, A) = 30
D(5, A) = 29
D(6, A) = 38
D(7, A) = 42

Polythetic Approach
1
1
2
3
4
5
6
7

10
7

30
29

38

42

D(2, A) = 10 D(2, B) = 25.0

0
7
23
25
34
36

A = {1

21 0

22 7 0

31 10 11 0

36 13 17 9 0
}

B = {2, 3, 4, 5, 6, 7}

D(3, A) = 7

D(3, B) = 23.4

D(4, A) = 30 D(4, B) = 14.8


D(5, A) = 29 D(5, B) = 16.4
D(6, A) = 38 D(6, B) = 19.0
D(7, A) = 42 D(7, B) = 22.2

Polythetic Approach
1
1
2
3
4
5
6
7

10
7

30
29

38

42

D(2, A) = 10 D(2, B) = 25.0 2 = 15.0

0
7
23
25
34
36

21 0

22 7 0

31 10 11 0

36 13 17 9 0

A = {1, 3

B = {2, 3, 4, 5, 6, 7}

D(3, A) = 7

D(3, B) = 23.4 3 = 16.4

D(4, A) = 30 D(4, B) = 14.8 4 = -15.2


D(5, A) = 29 D(5, B) = 16.4 5 = -12.6
D(6, A) = 38 D(6, B) = 19.0 6 = -19.0
D(7, A) = 42 D(7, B) = 22.2 7 = -19.8

Polythetic Approach
1
1
2
3
4
5
6
7

10
7

30
29

38

42

D(2, A) = 10 D(2, B) = 25.0 2 = 15.0

0
7
23
25
34
36

21 0

22 7 0

31 10 11 0

36 13 17 9 0

A = {1, 3

B = {2, 3, 4, 5, 6, 7}

D(3, A) = 7

D(3, B) = 23.4 3 = 16.4

D(4, A) = 30 D(4, B) = 14.8 4 = -15.2


D(5, A) = 29 D(5, B) = 16.4 5 = -12.6
D(6, A) = 38 D(6, B) = 19.0 6 = -19.0
D(7, A) = 42 D(7, B) = 22.2 7 = -19.8

Polythetic Approach
1
1
2
3
4
5
6
7

10
7

30
29

38

42

D(2, A) = 8.5

0
7
23
25
34
36

21 0

22 7 0

31 10 11 0

36 13 17 9 0

A = {1, 3

B = {2, 4, 5, 6, 7}

D(4, A) = 25.5
D(5, A) = 25.5
D(6, A) = 34.5
D(7, A) = 39.0

Polythetic Approach
1
1
2
3
4
5
6
7

10
7

30
29

38

42

0
7
23
25
34
36

D(2, B) = 29.5

D(2, A) = 8.5

D(4, A) = 25.5 D(4, B) = 13.2

21 0

22 7 0

31 10 11 0

36 13 17 9 0

A = {1, 3

B = {2, 4, 5, 6, 7}

D(5, A) = 25.5 D(5, B) = 15.0


D(6, A) = 34.5 D(6, B) = 16.0
D(7, A) = 39.0 D(7, B) = 18.9

Polythetic Approach
1
1
2
3
4
5
6
7

10
7

30
29

38

42

0
7
23
25
34
36

D(2, A) = 8.5

D(4, A) = 25.5 D(4, B) = 13.2 4 = -12.3

21 0

22 7 0

31 10 11 0

36 13 17 9 0

A = {1, 3 , 2 }
B = {2, 4, 5, 6, 7}

D(2, B) = 29.5 2 = 21.0

D(5, A) = 25.5 D(5, B) = 15.0 5 = -10.5


D(6, A) = 34.5 D(6, B) = 16.0 6 = -18.5
D(7, A) = 39.0 D(7, B) = 18.9 7 = -20.3

Polythetic Approach
1
1
2
3
4
5
6
7

10
7

30
29

38

42

0
7
23
25
34
36

D(2, A) = 8.5

D(4, A) = 25.5 D(4, B) = 13.2 4 = -12.3

21 0

22 7 0

31 10 11 0

36 13 17 9 0

A = {1, 3 , 2 }
B = {2, 4, 5, 6, 7}

D(2, B) = 29.5 2 = 21.0

D(5, A) = 25.5 D(5, B) = 15.0 5 = -10.5


D(6, A) = 34.5 D(6, B) = 16.0 6 = -18.5
D(7, A) = 39.0 D(7, B) = 18.9 7 = -20.3

Polythetic Approach
1
1
2
3
4
5
6
7

10
7

30
29

38

42

0
7
23
25
34
36

21 0

22 7 0

31 10 11 0

36 13 17 9 0

A = {1, 3 , 2 }
B = {4, 5, 6, 7}

D(4, A) = 24.7
D(5, A) = 25.3
D(6, A) = 34.3
D(7, A) = 38.0

Polythetic Approach
1
1
2
3
4
5
6
7

10
7

30
29

38

42

0
7
23
25
34
36

21 0

22 7 0

31 10 11 0

36 13 17 9 0

A = {1, 3 , 2 }
B = {4, 5, 6, 7}

D(4, A) = 24.7 D(4, B) = 10.0


D(5, A) = 25.3 D(5, B) = 11.7
D(6, A) = 34.3 D(6, B) = 10.0
D(7, A) = 38.0 D(7, B) = 13.0

Polythetic Approach
1
1
2
3
4
5
6
7

10
7

30
29

38

42

0
7
23
25
34
36

D(4, A) = 24.7 D(4, B) = 10.0 4 = -14.7

21 0

22 7 0

31 10 11 0

36 13 17 9 0

D(5, A) = 25.3 D(5, B) = 11.7 5 = -13.6

A = {1, 3 , 2 }
B = {4, 5, 6, 7}

D(6, A) = 34.3 D(6, B) = 10.0 6 = -24.3


D(7, A) = 38.0 D(7, B) = 13.0 7 = -25.0

All differences are negative.


The process would continue
on each subgroup
separately.

Monothetic
It is usually used when the data consists
of binary variables.

1
2
3
4
5

A
0
1
1
1
0

B
1
1
1
1
0

C
1
0
1
0
1

Monothetic
It is usually used when the data consists
of binary variables.

1
2
3
4
5

A
0
1
1
1
0

B
1
1
1
1
0

C
1
0
1
0
1

1
0

1
a=3
c=0

0
b=1
d=1

Chi-Square Measure

AB

(ad bc) 2 N

(a b)(a c)(b d )(c d )


(3 0) 2 5

4 3 2 1
= 1.875

1
0

1
a=3
c=0

0
b=1
d=1

Monothetic
Chi-Square Measure
It is usually used when the data consists
of binary variables.

1
2
3
4
5

A
0
1
1
1
0

B
1
1
1
1
0

C
1
0
1
0
1

AB

(ad bc) 2 N

(a b)(a c)(b d )(c d )


(3 0) 2 5

4 3 2 1
= 1.875

Attr. AB
1
a
3
0
b
1
c
0
Chi-Square Measure
It is usually used when the data consists
d
1
of binary variables.
N
5
A B C
2 1.87
1 0 1 1
2 1 1 0
2
(
ad

bc
)
N
2
AB
3 1 1 1
(a b)(a c)(b d )(c d )
4 1 1 0
(3 0) 2 5

4 3 2 1
5 0 0 1
B

1
a=3
c=0

0
b=1
d=1

Monothetic

= 1.875

Attr. AB AC BC
1
a
3
1
2
0
b
1
2
1
c
0
2
2
Chi-Square Measure
It is usually used when the data consists
d
1
0
0
of binary variables.
N
5
5
5
A B C
2 1.87 2.22 0.83
1 0 1 1
For attribute A,
2
2
2 1 1 0
AB AC = 4.09
3 1 1 1 For attribute B,
2
2
AB BC = 2.70
4 1 1 0
For attribute C,
2
2
5 0 0 1

= 3.05
B

1
a=3
c=0

0
b=1
d=1

Monothetic

AC

BC

Attr. AB AC BC
1
a
3
1
2
0
b
1
2
1
c We ch0oose att2
2
Chi-Square Measure
ribute A for
It is usually used when the data consists
dividing the data in
d
1
0 to tw
0o
of binary variables.
groups.
{2, 3, 4}, and {1, 5
N
5
5 } 5
2

A B C
1.87 2.22 0.83
1 0 1 1
For attribute A,
2
2
2 1 1 0
AB AC = 4.09
3 1 1 1 For attribute B,
2
2
AB BC = 2.70
4 1 1 0
For attribute C,
2
2
5 0 0 1

= 3.05
B

1
a=3
c=0

0
b=1
d=1

Monothetic

AC

BC

Thank You

Вам также может понравиться