Cluster4 PDF

Cluster Analysis IV
Presidency University
September,2016
So far we have done
I What is cluster analysis?

I Hierarchical Algglometarive and Non-hierarchical Clusterings
I R implementation of K-means and Hierarchical Agglomerative
Clusterings.
So far we have done
I What is cluster analysis?

I Hierarchical Algglometarive and Non-hierarchical Clusterings
I R implementation of K-means and Hierarchical Agglomerative
Clusterings.
I In this lecture we shall see:

I Manual Implementation of the Hierarchical Clustering and
K-means method.
I Some more R applications.
Example of Hierarchical Clustering
I Consider the distance matrix between 5 objects
0 9 3 6 11
 
9 0 7 5 10
D= 3 7 0 9 2
 
6 5 9 0 8

11 10 2 8 0
0 9 3 6 11
 
9 0 7 5 10
D= 3 7 0 9 2
 
6 5 9 0 8

11 10 2 8 0
I We begin hierarchical clustering using single linkage.

0 9 3 6 11
 
9 0 7 5 10
D= 3 7 0 9 2
 
6 5 9 0 8

11 10 2 8 0
I We start with treating each item as a single cluster.

0 9 3 6 11
 
9 0 7 5 10
D= 3 7 0 9 2
 
6 5 9 0 8

11 10 2 8 0
I We start with treating each item as a single cluster.
I We proceed by merging the closest items.

Stage 1
I Objects 5 and 3 are

merged.
0 9 3 6 11
 
9 0 7 5 10
D 3 7 0 9 2
 
=
6 5 9 0 8

11 10 2 8 0
I mindij = d53=2
i ,j
Stage 1

merged.
I Distances of 1,2,4
from (35) are
computed using single
0 9 3 6 11
 
9 0 linkage.
7 5 10
D 3 7 0 9 2
 
=
6 5 9 0 8

11 10 2 8 0
I mindij = d53=2
i ,j
Stage 1

merged.
I Distances of 1,2,4
from (35) are
computed using single
0 9 3 6 11
 
9 0 linkage.
7 5 10
D 3 7 0 9 2
 
=
6 5 9 0 8
 I d(35)1 =
11 10 2 8 0
min{d31 , d51 } =
min{3, 11} = 3
I mindij = d53=2 I d(35)2 =
i ,j
min{d32 , d52 } =
min{7, 10} = 7
I d(35)4 =
min{d34 , d54 } =
min{9, 8} = 8
Stage 2
I Objects 1 and (35)

are merged to get
We get the new distance matrix as (135)
(35) 1 2 4
(35) 0 3 7 8
 
1 3 0 9
 6
2 7 9 0 5

4 8 6 5 0
I mindij = d(35)1 = 3
i ,j
Stage 2

are merged to get
(35) 1 2 4 I Distances of 2,4 from

(35) 0 3 7 8 (135) are computed
 
1 3 0 9 6 using single linkage.
2 7 9 0 5

4 8 6 5 0
I mindij = d(35)1 = 3
i ,j
Stage 2

are merged to get
(35) 1 2 4 I Distances of 2,4 from

(35) 0 3 7 8 (135) are computed
 
1 3 0 9 6 using single linkage.
2 7 9 0 5

4 8 6 5 0
I d(135)2 =
min{d(35)2 , d12 } =
I mindij = d(35)1 = 3
min{7, 9} = 7
i ,j I d(135)4 =
min{d(35)4 , d14 } =
min{8, 6} = 6
Stage 3

We get the new distance matrix as merged to get (24)
(135) 0 7 6
 
2 7 0 5
4 6 5 0
I mindij = d42 = 5
i ,j
Stage 3

I Distances of (24)
(135) 0 7 6
 
from (135) are
2 7 0 5 computed using single
4 6 5 0 linkage.
I mindij = d42 = 5
i ,j
Stage 3

I Distances of (24)
(135) 0 7 6
 
from (135) are
2 7 0 5 computed using single
4 6 5 0 linkage.
I mindij = d42 = 5 I d(135)(24) =

i ,j
min{d(135)2 , d(135)4 } =
min{7, 6} = 6
Stage 4
I Objects (135) and

(24) are merged to
We get the nal distance matrix as get a single cluster
(12345)
(135) 0 6

(24) 6 0
Stage 4
I Objects (135) and

(24) are merged to
We get the nal distance matrix as get a single cluster
(12345)
(135) 0 6

(24) 6 0 I The nal nearest

neighbour distance
reaches 6.
Dendogram
6
5 Cluster Dendrogram
4
4
Height
1
2
d
hclust (*, "single")
K-means
I Observations were taken on two variables x1 and x2 on 4 items

A,B,C,D
Item x1 x2
A 5 3
B -1 1
C 1 -2
D -3 -2
K-means
I Observations were taken on two variables x1 and x2 on 4 items

A,B,C,D
Item x1 x2
A 5 3
B -1 1
C 1 -2
D -3 -2
I We shall divide these items into K = 2 clusters.

Step 1
I We arbitrarily partition the items into two groups say (A,B)

and (C,D)
Step 1

and (C,D)
I We next compute the centriods (i.e. means) x̄1 and x̄2

Step 1

and (C,D)
I We next compute the centriods (i.e. means) x̄1 and x̄2

Cluster x̄1 x̄2
5+(−1) 3+1
(AB) 2 =2 2 =2
1+(−3) −2+(−2)
(CD) 2 = −1 2 = −2
Updating Cluster Centroids
I The i th co-ordinate of the centroid is updated using the

formulas:
nx̄i + xji
x̄i ,new = if j th item is added to the group
n+1
nx̄ − x
x̄i ,new = i ji if j th item is removed from the group
n−1
where n is the number of items in the old group.
Step 2
I We compute the Euclidean distances of each item from the

group centriods.
Step 2

group centriods.
I We reassign each item to the nearest group.

Step 2

group centriods.
I If A is not moved:
d 2 (A, (AB )) = (5 − 2)2 + (3 − 2)2 = 10d 2 (A, (CD )) =
(5 + 1)2 + (3 + 2)2 = 61
Step 2

group centriods.
d 2 (A, (AB )) = (5 − 2)2 + (3 − 2)2 = 10d 2 (A, (CD )) =
(5 + 1)2 + (3 + 2)2 = 61
I If A is moved to the group (CD )
I Then the cluster centers are :
x̄ 2(2)−5
x̄
Group(B): 1,new = 2−1 = −1 2,new = 2−1 = 1
2(2)−3
x̄
Group(ACD): 1,new =
2(−1)+5
2+1 x̄
= 1 2,new = 2(− 2)+3
2+1 = −0.33
I and consequently we get:
d A B d A ACD
2 ( , ( )) = (5 + 1)2 + (3 − 1)2 = 40 2 ( , ( )) =
(5 − 1)2 + (3 + 0.33)2 = 27.09
Step 2

group centriods.
d 2 (A, (AB )) = (5 − 2)2 + (3 − 2)2 = 10d 2 (A, (CD )) =
(5 + 1)2 + (3 + 2)2 = 61
I If A is moved to the group (CD )
I Then the cluster centers are :
x̄ 2(2)−5
x̄
Group(B): 1,new = 2−1 = −1 2,new = 2−1 = 1
2(2)−3
x̄
Group(ACD): 1,new =
2(−1)+5
2+1 x̄
= 1 2,new = 2(− 2)+3
2+1 = −0.33
I and consequently we get:
d A B d A ACD
2 ( , ( )) = (5 + 1)2 + (3 − 1)2 = 40 2 ( , ( )) =
(5 − 1)2 + (3 + 0.33)2 = 27.09
I Since A is closer to the center of (AB) than it is to the center
of (ACD), it is not reassigned.
Step 2
I Next we consider reassigning B

Step 2
I If B is not moved:
d 2 (B , (AB )) = (−1 − 2)2 + (1 − 2)2 = 10d 2 (B , (CD )) =
(−1 + 1)2 + (1 + 2)2 = 9
Step 2
d 2 (B , (AB )) = (−1 − 2)2 + (1 − 2)2 = 10d 2 (B , (CD )) =
(−1 + 1)2 + (1 + 2)2 = 9
I A is moved to the group (CD )

If
d B , (A)) = (−1 − 5)2 + (1 − 3)2 = 40d 2 (B , (BCD )) =
2(
(−1 + 1)2 + (1 + 1)2 = 4

Step 2
d 2 (B , (AB )) = (−1 − 2)2 + (1 − 2)2 = 10d 2 (B , (CD )) =
(−1 + 1)2 + (1 + 2)2 = 9
I A is moved to the group (CD )

If
d B , (A)) = (−1 − 5)2 + (1 − 3)2 = 40d 2 (B , (BCD )) =
2(
(−1 + 1)2 + (1 + 1)2 = 4
I Since B is closer to the center of (BCD) than it is to the

center of (AB), B is reassigned to the (CD) group. We now
have the dusters (A) and (BCD) with centroid (5,3) and
(-1,-1) respectively.
Step 2
I Next we consider C for reassignment.

Step 2
I If C is not moved:
d 2 (C , A) = (−1 − 5)2 + (−2 − 3)2 = 41d 2 (C , (BCD )) =
(1 + 1)2 + (−2 + 1)2 = 5
Step 2
d 2 (C , A) = (−1 − 5)2 + (−2 − 3)2 = 41d 2 (C , (BCD )) =
(1 + 1)2 + (−2 + 1)2 = 5
I If C is moved to the group (A) :

d 2 (C , (AC )) = (−1 − 3)2 + (−2 − 0.5)2 =
10.25d 2 (C , (BD )) = (1 + 2)2 + (−2 + 0.5)2 = 11.25
Step 2
d 2 (C , A) = (−1 − 5)2 + (−2 − 3)2 = 41d 2 (C , (BCD )) =
(1 + 1)2 + (−2 + 1)2 = 5
I If C is moved to the group (A) :

d 2 (C , (AC )) = (−1 − 3)2 + (−2 − 0.5)2 =
10.25d 2 (C , (BD )) = (1 + 2)2 + (−2 + 0.5)2 = 11.25
I Since C is closer to the center of the (BCD) group than it is to

the center of the (AC) group, C is not moved.
Step 3
I Continuing in this way, we nd that no more re assignments

take place and the nal K = 2 clusters are (A) and (BCD).
Step 3

I For the nal clusters we get the squared distances to group

centroids as:
Cluster A B C D
A 0 40 41 89
BCD 52 4 5 5
Step 3

I For the nal clusters we get the squared distances to group

centroids as:
Cluster A B C D
A 0 40 41 89
BCD 52 4 5 5
I The within cluster sum of squares are

Cluster A: 0
Cluster (BCD): 4 + 5 + 5 = 14
Old Faithful Geyser Eruptions
I A geyser is a hot spring which occasionally becomes unstable

and erupts hot water and steam into the air. Old Faithful
Geyser is the most famous of all geysers and is an extremely
popular tourist attraction.
1
(Weisberg, 1985, p. 231)

I The dataset1 contains 107 bivariate observations, that were

taken from a study of the eruptions of Old Faithful Geyser in
Yellowstone National Park, Wyoming.
1
(Weisberg, 1985, p. 231)

I The dataset1 contains 107 bivariate observations, that were

taken from a study of the eruptions of Old Faithful Geyser in
Yellowstone National Park, Wyoming.
I The variables measured are duration of eruption (x1 ) and

waiting time until the next eruption (x2 ), both recorded in
minutes, for all eruptions of Old Faithful Geyser between 6
a.m. and mid- night, 18 August 1978.
1
(Weisberg, 1985, p. 231)
Old Faithful Geyser Eruptions I
I The two variables are measured on very dierent scales (the

standard deviations of x1 and x2 being approximately 1 and 13,
respectively)
I Without standardizing both variables, we cannot obtain a
realistic partitioning of the data. Hence we standardize the
variables prior to clustering.
data=apply(faithful,2,'scale')
d=dist(data)
mytree=hclust(d)
plot(mytree)
Height
0 1 2 3 4 5
58
17
95
91
142
266
61
11275
77
201
65
39
119
44
115
192
42
263
172
247
163
50
146 84
244
69
249
153 211
167
240
10164
133
199
234
117
178
121
217
232
265
127
131
206
271
135
188
103
14889
242
161
26919
21
108
93
221
213
171
209
63
14
22
37
106
251
129
19016
48
139
99
137
185
369
169
27
219
259
72
124
181
236
237 2
11
53
55
223
204
150
159
215
33 47
165243
174
239
29
164
152
195
79
87
145
198
207
216
123
238
228
20
260
226
13
67
28
220
225
227
35
98
214
253
57
155
122
83
d
231
74
257
II
156
229
250
31
96
18085
81
241 7
12540
52
130
109
138
243
113
86
168
170
218149
18
15
97
107
56
100
38
54
64
191
88
hclust (*, "complete")

157
147
212
49
116
154
128
143
Cluster Dendrogram
104
210
189
252
258
136
248
82
202
41
60
176
76
151
68
200
94
261
51
111
267
144
193
4570
177
254
224
25
272
16632
59
173
182
186
30
114
78
205
73
230197
46
8
26
80
110231
126
140
256
235 66
270
92
134160
158
203
175
26871
34
105
132
183
264
222
141
196
12
208
184
246
187
194
90
179
118
245
4310
5
62
262
102
120
255
162
233
Visualizing the clusters I
classes=cutree(mytree,h=2.5)
table(classes)
## classes
## 1 2 3
## 125 97 50
plot(faithful,col=classes)
Visualizing the clusters II
●
●
● ●
●
●
90
● ●● ●●●
● ● ●
● ● ● ● ●●
● ●
● ●● ● ●
● ● ● ●●●
●● ●● ●●● ●● ●
● ● ● ● ● ●●● ● ●
● ● ●●●●
● ●● ● ● ●
● ● ● ●● ● ● ● ● ●● ●
80
● ● ● ● ●● ●
● ● ● ● ●● ● ● ●●
● ●
● ● ● ●● ● ● ●● ●● ●●
●● ●● ● ● ● ● ● ●
●● ●● ● ● ● ● ●
● ● ●
● ● ●
●●
● ● ● ● ●●
waiting
● ● ● ●● ●
●
● ● ● ● ●
70
● ● ●●
● ●
●
●
● ●
● ● ●
● ● ● ●
● ● ●
● ● ● ●
60
● ● ● ●● ●
● ●● ● ● ●●
● ●● ●
● ●
●● ●
● ●● ● ● ●
●
●●●●● ● ●
●● ●● ● ●
● ● ● ● ●
●●●● ● ●
50
● ● ● ●
●●● ● ●
● ● ●
● ● ●
●●
● ●
●● ●
●
1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
eruptions
Partitioning the dendogram I
plot(mytree)
rect.hclust(mytree, k=3, border="red")
Height
0 1 2 3 4 5
58
17
95
91
142
266
61
11275
77
201
65
39
119
44
115
192
42
263
172
247
163
50
146 84
244
69
249
153 211
167
240
10164
133
199
234
117
178
121
217
232
265
127
131
206
271
135
188
103
14889
242
161
26919
21
108
93
221
213
171
209
63
14
22
37
106
251
129
19016
48
139
99
137
185
369
169
27
219
259
72
124
181
236
237 2
11
53
55
223
204
150
159
215
33 47
165243
174
239
29
164
152
195
79
87
145
198
207
216
123
238
228
Partitioning the dendogram II
20
260
226
13
67
28
220
225
227
35
98
214
253
57
155
122
83
d
231
74
257
156
229
250
31
96
18085
81
241
125 7
40
52
130
109
138
243
113
86
168
170149
218
18
15
97
107
56
100
38
54
64
191
88
hclust (*, "complete")

157
147
212
49
116
154
128
143
Cluster Dendrogram
104
210
189
252
258
136
248
82
202
41
60
176
76
151
68
200
94
261
51
111
267
144
193
4570
177
254
224
25
272
16632
59
173
182
186
30
114
78
205
73
230197
46
8
26
80
110231
126
140
256
235 66
270
92
134160
158
203
175
26871
34
105
132
183
264
222
141
196
12
208
184
246
187
194
90
179
118
245
4310
5
62
262
102
120
255
162
233
Landsat Satellite Image Data
I The Landsat data here used were generated from a Landsat

Multispectral Scanner (MSS) image database used in the
European Statlog Project for assessing machine-learning
methods. 2
2
see http://edc.usgs.gov/guides/landsat mss.html

methods. 2
I The data consist of six groups of 4,435 observations measured

on 36 variables.
2

methods. 2
I The data consist of six groups of 4,435 observations measured

on 36 variables.
I Prior to clustering we standardize each variable.
2
Applying K-means
I K-means is implemented by the command kmeans with K
being specied.
data=read.table("/home/user/pics/satimage.txt",header=T)
data=scale(data); cl=kmeans(data,6);
Applying K-means
being specied.
I We then nd the number of points in each cluster.
cl$size
[1] 611 749 1057 665 382 971

Applying K-means
being specied.
I We then nd the number of points in each cluster.
cl$size
[1] 611 749 1057 665 382 971
I The total within-cluster sum of squares and between-cluster

sum of squares are next computed.
cl$tot.withinss; cl$betweenss;
[1] 35207.24
[1] 128850.8
Visualizing Clusters I
I A nice way to visualize the cluster solutions for high

dimensional data is to use the principal components.
library(cluster)
clusplot(data,cl$cluster,
color=TRUE,shade=TRUE,lines=0)
Visualizing Clusters II
CLUSPLOT( data )
5
● ●●
● ●
0
● ●
●● ●● ● ●●●● ●
●● ● ●●
● ● ●●
●
●
● ● ●● ●●●● ●
● ●
● ●●●
● ●●
●● ● ●
●
●●●●
●
●●
●●
●
●●●
●●
● ●
●●
●● ●
●●●●
● ●
●●● ●●●
●
●●
●●
●
●
●●
●
●●
●● ●
●●
● ●
●
●
●●
●
●●
●●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●●
● ●
●●
● ●
●●
●●●
●
●●
●
● ●
●●●
●●
●
●● ●●
●
●●
●
●
● ● ●
●●
●●
●
●●●
●●●
●●●
●
●
●●
●
●●
●
●●
●
●●●
●
●●●
●●
●
●●●● ●
● ●●●● ● ●
●●● ● ● ●
● ●
●● ●
●● ● ●●
● ●●●●
●●
●●●●● ●●●● ● ●●●
●●
●●
●●
●●●●
● ● ●
Component 2
● ●● ● ●
●● ●● ●
●●
●● ●●
● ●●
●●
● ●● ●●●●●
●●
●● ●● ●
●●●● ● ● ● ● ●● ●
●
● ●
●
●●● ● ●●●●●●●● ●● ● ●
●● ● ●● ● ● ●
● ● ● ●●
−5
● ● ● ●●● ●
●●● ●●● ● ● ●
●● ●● ● ● ● ●
●
−10
−15
−10 −5 0 5 10
Component 1
These two components explain 84.09 % of the point variability.

Cluster4 PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Cluster4 PDF

Загружено:

Авторское право:

Доступные форматы

Cluster Analysis IV

I What is cluster analysis?

I What is cluster analysis?

I In this lecture we shall see:

I Consider the distance matrix between 5 objects

I Consider the distance matrix between 5 objects

I We begin hierarchical clustering using single linkage.

I Consider the distance matrix between 5 objects

I We begin hierarchical clustering using single linkage.

I We start with treating each item as a single cluster.

I Consider the distance matrix between 5 objects

I We begin hierarchical clustering using single linkage.

I We start with treating each item as a single cluster.

I We proceed by merging the closest items.

I Objects 5 and 3 are

I Objects 5 and 3 are

I Objects 5 and 3 are

I Objects 1 and (35)

I Objects 1 and (35)

(35) 1 2 4 I Distances of 2,4 from

I Objects 1 and (35)

(35) 1 2 4 I Distances of 2,4 from

I Objects 2 and 4 are

I Objects 2 and 4 are

I Objects 2 and 4 are

I mindij = d42 = 5 I d(135)(24) =

I Objects (135) and

I Objects (135) and

(24) 6 0 I The nal nearest

I Observations were taken on two variables x1 and x2 on 4 items

I Observations were taken on two variables x1 and x2 on 4 items

I We shall divide these items into K = 2 clusters.

I We arbitrarily partition the items into two groups say (A,B)

I We arbitrarily partition the items into two groups say (A,B)

I We next compute the centriods (i.e. means) x̄1 and x̄2

I We arbitrarily partition the items into two groups say (A,B)

I We next compute the centriods (i.e. means) x̄1 and x̄2

I The i th co-ordinate of the centroid is updated using the

I We compute the Euclidean distances of each item from the

I We compute the Euclidean distances of each item from the

I We reassign each item to the nearest group.

I We compute the Euclidean distances of each item from the

I We reassign each item to the nearest group.

I We compute the Euclidean distances of each item from the

I We reassign each item to the nearest group.

I We compute the Euclidean distances of each item from the

I We reassign each item to the nearest group.

I Next we consider reassigning B

I Next we consider reassigning B

I Next we consider reassigning B

I A is moved to the group (CD )

(−1 + 1)2 + (1 + 1)2 = 4

I Next we consider reassigning B

I A is moved to the group (CD )

(−1 + 1)2 + (1 + 1)2 = 4

I Since B is closer to the center of (BCD) than it is to the

I Next we consider C for reassignment.

I Next we consider C for reassignment.

I Next we consider C for reassignment.

I If C is moved to the group (A) :

I Next we consider C for reassignment.

I If C is moved to the group (A) :

I Since C is closer to the center of the (BCD) group than it is to

(24) 6 0 I The nal nearest

I Continuing in this way, we nd that no more re assignments

I Continuing in this way, we nd that no more re assignments

I For the nal clusters we get the squared distances to group

I Continuing in this way, we nd that no more re assignments

I For the nal clusters we get the squared distances to group

I The two variables are measured on very dierent scales (the

I We then nd the number of points in each cluster.

I We then nd the number of points in each cluster.