Вы находитесь на странице: 1из 58

Cluster Analysis IV

Presidency University

September,2016
So far we have done

I What is cluster analysis?


I Hierarchical Algglometarive and Non-hierarchical Clusterings
I R implementation of K-means and Hierarchical Agglomerative
Clusterings.
So far we have done

I What is cluster analysis?


I Hierarchical Algglometarive and Non-hierarchical Clusterings
I R implementation of K-means and Hierarchical Agglomerative
Clusterings.

I In this lecture we shall see:


I Manual Implementation of the Hierarchical Clustering and
K-means method.
I Some more R applications.
Example of Hierarchical Clustering

I Consider the distance matrix between 5 objects

0 9 3 6 11
 
9 0 7 5 10
D= 3 7 0 9 2
 
6 5 9 0 8

11 10 2 8 0
Example of Hierarchical Clustering

I Consider the distance matrix between 5 objects

0 9 3 6 11
 
9 0 7 5 10
D= 3 7 0 9 2
 
6 5 9 0 8

11 10 2 8 0

I We begin hierarchical clustering using single linkage.


Example of Hierarchical Clustering

I Consider the distance matrix between 5 objects

0 9 3 6 11
 
9 0 7 5 10
D= 3 7 0 9 2
 
6 5 9 0 8

11 10 2 8 0

I We begin hierarchical clustering using single linkage.

I We start with treating each item as a single cluster.


Example of Hierarchical Clustering

I Consider the distance matrix between 5 objects

0 9 3 6 11
 
9 0 7 5 10
D= 3 7 0 9 2
 
6 5 9 0 8

11 10 2 8 0

I We begin hierarchical clustering using single linkage.

I We start with treating each item as a single cluster.

I We proceed by merging the closest items.


Stage 1

I Objects 5 and 3 are


merged.

0 9 3 6 11
 
9 0 7 5 10
D 3 7 0 9 2
 
=
6 5 9 0 8

11 10 2 8 0
I mindij = d53=2
i ,j
Stage 1

I Objects 5 and 3 are


merged.
I Distances of 1,2,4
from (35) are
computed using single
0 9 3 6 11
 
9 0 linkage.
7 5 10
D 3 7 0 9 2
 
=
6 5 9 0 8

11 10 2 8 0
I mindij = d53=2
i ,j
Stage 1

I Objects 5 and 3 are


merged.
I Distances of 1,2,4
from (35) are
computed using single
0 9 3 6 11
 
9 0 linkage.
7 5 10
D 3 7 0 9 2
 
=
6 5 9 0 8
 I d(35)1 =
11 10 2 8 0
min{d31 , d51 } =
min{3, 11} = 3
I mindij = d53=2 I d(35)2 =
i ,j
min{d32 , d52 } =
min{7, 10} = 7
I d(35)4 =
min{d34 , d54 } =
min{9, 8} = 8
Stage 2

I Objects 1 and (35)


are merged to get
We get the new distance matrix as (135)

(35) 1 2 4
(35) 0 3 7 8
 
1 3 0 9
 6
2 7 9 0 5

4 8 6 5 0

I mindij = d(35)1 = 3
i ,j
Stage 2

I Objects 1 and (35)


are merged to get
We get the new distance matrix as (135)

(35) 1 2 4 I Distances of 2,4 from


(35) 0 3 7 8 (135) are computed
 
1 3 0 9 6 using single linkage.
2 7 9 0 5

4 8 6 5 0

I mindij = d(35)1 = 3
i ,j
Stage 2

I Objects 1 and (35)


are merged to get
We get the new distance matrix as (135)

(35) 1 2 4 I Distances of 2,4 from


(35) 0 3 7 8 (135) are computed
 
1 3 0 9 6 using single linkage.
2 7 9 0 5

4 8 6 5 0
I d(135)2 =
min{d(35)2 , d12 } =
I mindij = d(35)1 = 3
min{7, 9} = 7
i ,j I d(135)4 =
min{d(35)4 , d14 } =
min{8, 6} = 6
Stage 3

I Objects 2 and 4 are


We get the new distance matrix as merged to get (24)

(135) 0 7 6
 

2 7 0 5
4 6 5 0

I mindij = d42 = 5
i ,j
Stage 3

I Objects 2 and 4 are


We get the new distance matrix as merged to get (24)
I Distances of (24)
(135) 0 7 6
 
from (135) are
2 7 0 5 computed using single
4 6 5 0 linkage.

I mindij = d42 = 5
i ,j
Stage 3

I Objects 2 and 4 are


We get the new distance matrix as merged to get (24)
I Distances of (24)
(135) 0 7 6
 
from (135) are
2 7 0 5 computed using single
4 6 5 0 linkage.

I mindij = d42 = 5 I d(135)(24) =


i ,j
min{d(135)2 , d(135)4 } =
min{7, 6} = 6
Stage 4

I Objects (135) and


(24) are merged to
We get the nal distance matrix as get a single cluster
(12345)
(135) 0 6
 

(24) 6 0
Stage 4

I Objects (135) and


(24) are merged to
We get the nal distance matrix as get a single cluster
(12345)
(135) 0 6
 

(24) 6 0 I The nal nearest


neighbour distance
reaches 6.
Dendogram

6
5 Cluster Dendrogram

4
4
Height

1
2

d
hclust (*, "single")
K-means

I Observations were taken on two variables x1 and x2 on 4 items


A,B,C,D

Item x1 x2
A 5 3
B -1 1
C 1 -2
D -3 -2
K-means

I Observations were taken on two variables x1 and x2 on 4 items


A,B,C,D

Item x1 x2
A 5 3
B -1 1
C 1 -2
D -3 -2

I We shall divide these items into K = 2 clusters.


Step 1

I We arbitrarily partition the items into two groups say (A,B)


and (C,D)
Step 1

I We arbitrarily partition the items into two groups say (A,B)


and (C,D)

I We next compute the centriods (i.e. means) x̄1 and x̄2


Step 1

I We arbitrarily partition the items into two groups say (A,B)


and (C,D)

I We next compute the centriods (i.e. means) x̄1 and x̄2


Cluster x̄1 x̄2
5+(−1) 3+1
(AB) 2 =2 2 =2
1+(−3) −2+(−2)
(CD) 2 = −1 2 = −2
Updating Cluster Centroids

I The i th co-ordinate of the centroid is updated using the


formulas:
nx̄i + xji
x̄i ,new = if j th item is added to the group
n+1
nx̄ − x
x̄i ,new = i ji if j th item is removed from the group
n−1
where n is the number of items in the old group.
Step 2

I We compute the Euclidean distances of each item from the


group centriods.
Step 2

I We compute the Euclidean distances of each item from the


group centriods.

I We reassign each item to the nearest group.


Step 2

I We compute the Euclidean distances of each item from the


group centriods.

I We reassign each item to the nearest group.

I If A is not moved:
d 2 (A, (AB )) = (5 − 2)2 + (3 − 2)2 = 10d 2 (A, (CD )) =
(5 + 1)2 + (3 + 2)2 = 61
Step 2

I We compute the Euclidean distances of each item from the


group centriods.

I We reassign each item to the nearest group.

I If A is not moved:
d 2 (A, (AB )) = (5 − 2)2 + (3 − 2)2 = 10d 2 (A, (CD )) =
(5 + 1)2 + (3 + 2)2 = 61
I If A is moved to the group (CD )
I Then the cluster centers are :
x̄ 2(2)−5

Group(B): 1,new = 2−1 = −1 2,new = 2−1 = 1
2(2)−3


Group(ACD): 1,new =
2(−1)+5
2+1 x̄
= 1 2,new = 2(− 2)+3
2+1 = −0.33
I and consequently we get:
d A B d A ACD
2 ( , ( )) = (5 + 1)2 + (3 − 1)2 = 40 2 ( , ( )) =
(5 − 1)2 + (3 + 0.33)2 = 27.09
Step 2

I We compute the Euclidean distances of each item from the


group centriods.

I We reassign each item to the nearest group.

I If A is not moved:
d 2 (A, (AB )) = (5 − 2)2 + (3 − 2)2 = 10d 2 (A, (CD )) =
(5 + 1)2 + (3 + 2)2 = 61
I If A is moved to the group (CD )
I Then the cluster centers are :
x̄ 2(2)−5

Group(B): 1,new = 2−1 = −1 2,new = 2−1 = 1
2(2)−3


Group(ACD): 1,new =
2(−1)+5
2+1 x̄
= 1 2,new = 2(− 2)+3
2+1 = −0.33
I and consequently we get:
d A B d A ACD
2 ( , ( )) = (5 + 1)2 + (3 − 1)2 = 40 2 ( , ( )) =
(5 − 1)2 + (3 + 0.33)2 = 27.09
I Since A is closer to the center of (AB) than it is to the center
of (ACD), it is not reassigned.
Step 2

I Next we consider reassigning B


Step 2

I Next we consider reassigning B

I If B is not moved:
d 2 (B , (AB )) = (−1 − 2)2 + (1 − 2)2 = 10d 2 (B , (CD )) =
(−1 + 1)2 + (1 + 2)2 = 9
Step 2

I Next we consider reassigning B

I If B is not moved:
d 2 (B , (AB )) = (−1 − 2)2 + (1 − 2)2 = 10d 2 (B , (CD )) =
(−1 + 1)2 + (1 + 2)2 = 9

I A is moved to the group (CD )


If
d B , (A)) = (−1 − 5)2 + (1 − 3)2 = 40d 2 (B , (BCD )) =
2(

(−1 + 1)2 + (1 + 1)2 = 4


Step 2

I Next we consider reassigning B

I If B is not moved:
d 2 (B , (AB )) = (−1 − 2)2 + (1 − 2)2 = 10d 2 (B , (CD )) =
(−1 + 1)2 + (1 + 2)2 = 9

I A is moved to the group (CD )


If
d B , (A)) = (−1 − 5)2 + (1 − 3)2 = 40d 2 (B , (BCD )) =
2(

(−1 + 1)2 + (1 + 1)2 = 4

I Since B is closer to the center of (BCD) than it is to the


center of (AB), B is reassigned to the (CD) group. We now
have the dusters (A) and (BCD) with centroid (5,3) and
(-1,-1) respectively.
Step 2

I Next we consider C for reassignment.


Step 2

I Next we consider C for reassignment.

I If C is not moved:
d 2 (C , A) = (−1 − 5)2 + (−2 − 3)2 = 41d 2 (C , (BCD )) =
(1 + 1)2 + (−2 + 1)2 = 5
Step 2

I Next we consider C for reassignment.

I If C is not moved:
d 2 (C , A) = (−1 − 5)2 + (−2 − 3)2 = 41d 2 (C , (BCD )) =
(1 + 1)2 + (−2 + 1)2 = 5

I If C is moved to the group (A) :


d 2 (C , (AC )) = (−1 − 3)2 + (−2 − 0.5)2 =
10.25d 2 (C , (BD )) = (1 + 2)2 + (−2 + 0.5)2 = 11.25
Step 2

I Next we consider C for reassignment.

I If C is not moved:
d 2 (C , A) = (−1 − 5)2 + (−2 − 3)2 = 41d 2 (C , (BCD )) =
(1 + 1)2 + (−2 + 1)2 = 5

I If C is moved to the group (A) :


d 2 (C , (AC )) = (−1 − 3)2 + (−2 − 0.5)2 =
10.25d 2 (C , (BD )) = (1 + 2)2 + (−2 + 0.5)2 = 11.25

I Since C is closer to the center of the (BCD) group than it is to


the center of the (AC) group, C is not moved.
Step 3

I Continuing in this way, we nd that no more re assignments


take place and the nal K = 2 clusters are (A) and (BCD).
Step 3

I Continuing in this way, we nd that no more re assignments


take place and the nal K = 2 clusters are (A) and (BCD).

I For the nal clusters we get the squared distances to group


centroids as:
Cluster A B C D
A 0 40 41 89
BCD 52 4 5 5
Step 3

I Continuing in this way, we nd that no more re assignments


take place and the nal K = 2 clusters are (A) and (BCD).

I For the nal clusters we get the squared distances to group


centroids as:
Cluster A B C D
A 0 40 41 89
BCD 52 4 5 5

I The within cluster sum of squares are


Cluster A: 0
Cluster (BCD): 4 + 5 + 5 = 14
Old Faithful Geyser Eruptions

I A geyser is a hot spring which occasionally becomes unstable


and erupts hot water and steam into the air. Old Faithful
Geyser is the most famous of all geysers and is an extremely
popular tourist attraction.

1
(Weisberg, 1985, p. 231)
Old Faithful Geyser Eruptions

I A geyser is a hot spring which occasionally becomes unstable


and erupts hot water and steam into the air. Old Faithful
Geyser is the most famous of all geysers and is an extremely
popular tourist attraction.

I The dataset1 contains 107 bivariate observations, that were


taken from a study of the eruptions of Old Faithful Geyser in
Yellowstone National Park, Wyoming.

1
(Weisberg, 1985, p. 231)
Old Faithful Geyser Eruptions

I A geyser is a hot spring which occasionally becomes unstable


and erupts hot water and steam into the air. Old Faithful
Geyser is the most famous of all geysers and is an extremely
popular tourist attraction.

I The dataset1 contains 107 bivariate observations, that were


taken from a study of the eruptions of Old Faithful Geyser in
Yellowstone National Park, Wyoming.

I The variables measured are duration of eruption (x1 ) and


waiting time until the next eruption (x2 ), both recorded in
minutes, for all eruptions of Old Faithful Geyser between 6
a.m. and mid- night, 18 August 1978.

1
(Weisberg, 1985, p. 231)
Old Faithful Geyser Eruptions I

I The two variables are measured on very dierent scales (the


standard deviations of x1 and x2 being approximately 1 and 13,
respectively)
I Without standardizing both variables, we cannot obtain a
realistic partitioning of the data. Hence we standardize the
variables prior to clustering.

data=apply(faithful,2,'scale')
d=dist(data)
mytree=hclust(d)
plot(mytree)
Height

0 1 2 3 4 5

58
17
95
91
142
266
61
11275
77
201
65
39
119
44
115
192
42
263
172
247
163
50
146 84
244
69
249
153 211
167
240
10164
133
199
234
117
178
121
217
232
265
127
131
206
271
135
188
103
14889
242
161
26919
21
108
93
221
213
171
209
63
14
22
37
106
251
129
19016
48
139
99
137
185
369
169
27
219
259
72
124
181
236
237 2
11
53
55
223
204
150
159
215
33 47
165243
174
239
29
164
152
195
79
87
145
198
207
216
123
238
228
Old Faithful Geyser Eruptions

20
260
226
13
67
28
220
225
227
35
98
214
253
57
155
122
83

d
231
74
257
II

156
229
250
31
96
18085
81
241 7
12540
52
130
109
138
243
113
86
168
170
218149
18
15
97
107
56
100
38
54
64
191
88

hclust (*, "complete")


157
147
212
49
116
154
128
143
Cluster Dendrogram

104
210
189
252
258
136
248
82
202
41
60
176
76
151
68
200
94
261
51
111
267
144
193
4570
177
254
224
25
272
16632
59
173
182
186
30
114
78
205
73
230197
46
8
26
80
110231
126
140
256
235 66
270
92
134160
158
203
175
26871
34
105
132
183
264
222
141
196
12
208
184
246
187
194
90
179
118
245
4310
5
62
262
102
120
255
162
233
Visualizing the clusters I

classes=cutree(mytree,h=2.5)
table(classes)

## classes
## 1 2 3
## 125 97 50

plot(faithful,col=classes)
Visualizing the clusters II



● ●


90

● ●● ●●●
● ● ●
● ● ● ● ●●
● ●
● ●● ● ●
● ● ● ●●●
●● ●● ●●● ●● ●
● ● ● ● ● ●●● ● ●
● ● ●●●●
● ●● ● ● ●
● ● ● ●● ● ● ● ● ●● ●
80

● ● ● ● ●● ●
● ● ● ● ●● ● ● ●●
● ●
● ● ● ●● ● ● ●● ●● ●●
●● ●● ● ● ● ● ● ●
●● ●● ● ● ● ● ●
● ● ●
● ● ●
●●
● ● ● ● ●●
waiting

● ● ● ●● ●

● ● ● ● ●
70

● ● ●●
● ●


● ●
● ● ●
● ● ● ●
● ● ●
● ● ● ●
60

● ● ● ●● ●
● ●● ● ● ●●
● ●● ●
● ●
●● ●
● ●● ● ● ●

●●●●● ● ●
●● ●● ● ●
● ● ● ● ●
●●●● ● ●
50

● ● ● ●
●●● ● ●
● ● ●
● ● ●
●●
● ●
●● ●

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

eruptions
Partitioning the dendogram I

plot(mytree)
rect.hclust(mytree, k=3, border="red")
Height

0 1 2 3 4 5

58
17
95
91
142
266
61
11275
77
201
65
39
119
44
115
192
42
263
172
247
163
50
146 84
244
69
249
153 211
167
240
10164
133
199
234
117
178
121
217
232
265
127
131
206
271
135
188
103
14889
242
161
26919
21
108
93
221
213
171
209
63
14
22
37
106
251
129
19016
48
139
99
137
185
369
169
27
219
259
72
124
181
236
237 2
11
53
55
223
204
150
159
215
33 47
165243
174
239
29
164
152
195
79
87
145
198
207
216
123
238
228
Partitioning the dendogram II

20
260
226
13
67
28
220
225
227
35
98
214
253
57
155
122
83

d
231
74
257
156
229
250
31
96
18085
81
241
125 7
40
52
130
109
138
243
113
86
168
170149
218
18
15
97
107
56
100
38
54
64
191
88

hclust (*, "complete")


157
147
212
49
116
154
128
143
Cluster Dendrogram

104
210
189
252
258
136
248
82
202
41
60
176
76
151
68
200
94
261
51
111
267
144
193
4570
177
254
224
25
272
16632
59
173
182
186
30
114
78
205
73
230197
46
8
26
80
110231
126
140
256
235 66
270
92
134160
158
203
175
26871
34
105
132
183
264
222
141
196
12
208
184
246
187
194
90
179
118
245
4310
5
62
262
102
120
255
162
233
Landsat Satellite Image Data

I The Landsat data here used were generated from a Landsat


Multispectral Scanner (MSS) image database used in the
European Statlog Project for assessing machine-learning
methods. 2

2
see http://edc.usgs.gov/guides/landsat mss.html
Landsat Satellite Image Data

I The Landsat data here used were generated from a Landsat


Multispectral Scanner (MSS) image database used in the
European Statlog Project for assessing machine-learning
methods. 2

I The data consist of six groups of 4,435 observations measured


on 36 variables.

2
see http://edc.usgs.gov/guides/landsat mss.html
Landsat Satellite Image Data

I The Landsat data here used were generated from a Landsat


Multispectral Scanner (MSS) image database used in the
European Statlog Project for assessing machine-learning
methods. 2

I The data consist of six groups of 4,435 observations measured


on 36 variables.

I Prior to clustering we standardize each variable.

2
see http://edc.usgs.gov/guides/landsat mss.html
Applying K-means
I K-means is implemented by the command kmeans with K
being specied.

data=read.table("/home/user/pics/satimage.txt",header=T)
data=scale(data); cl=kmeans(data,6);
Applying K-means
I K-means is implemented by the command kmeans with K
being specied.

data=read.table("/home/user/pics/satimage.txt",header=T)
data=scale(data); cl=kmeans(data,6);

I We then nd the number of points in each cluster.

cl$size

[1] 611 749 1057 665 382 971


Applying K-means
I K-means is implemented by the command kmeans with K
being specied.

data=read.table("/home/user/pics/satimage.txt",header=T)
data=scale(data); cl=kmeans(data,6);

I We then nd the number of points in each cluster.

cl$size

[1] 611 749 1057 665 382 971

I The total within-cluster sum of squares and between-cluster


sum of squares are next computed.

cl$tot.withinss; cl$betweenss;

[1] 35207.24
[1] 128850.8
Visualizing Clusters I

I A nice way to visualize the cluster solutions for high


dimensional data is to use the principal components.

library(cluster)
clusplot(data,cl$cluster,
color=TRUE,shade=TRUE,lines=0)
Visualizing Clusters II

CLUSPLOT( data )
5

● ●●
● ●
0

● ●
●● ●● ● ●●●● ●
●● ● ●●
● ● ●●


● ● ●● ●●●● ●
● ●
● ●●●
● ●●
●● ● ●

●●●●

●●
●●

●●●
●●
● ●
●●
●● ●
●●●●
● ●
●●● ●●●

●●
●●


●●

●●
●● ●
●●
● ●


●●

●●
●●


●●
●●



●●






●●●










●●














●●







●●
●●






●●
●●


●●

●●










●●

●●









●●





●●●●
● ●
●●
● ●
●●
●●●

●●

● ●
●●●
●●

●● ●●

●●


● ● ●
●●
●●

●●●
●●●
●●●


●●

●●

●●

●●●

●●●
●●

●●●● ●
● ●●●● ● ●
●●● ● ● ●
● ●
●● ●
●● ● ●●
● ●●●●
●●
●●●●● ●●●● ● ●●●
●●
●●
●●
●●●●
● ● ●
Component 2

● ●● ● ●
●● ●● ●
●●
●● ●●
● ●●
●●
● ●● ●●●●●
●●
●● ●● ●
●●●● ● ● ● ● ●● ●

● ●

●●● ● ●●●●●●●● ●● ● ●
●● ● ●● ● ● ●
● ● ● ●●
−5

● ● ● ●●● ●
●●● ●●● ● ● ●
●● ●● ● ● ● ●

−10
−15

−10 −5 0 5 10

Component 1
These two components explain 84.09 % of the point variability.

Вам также может понравиться