Академический Документы
Профессиональный Документы
Культура Документы
Today objective
Clustering Analysis
• Flat algorithms
– Usually start with a random (partial) partitioning
– Refine it iteratively
• K means clustering
• (Model based clustering)
• Hierarchical algorithms
– Bottom-up, agglomerative
– (Top-down, divisive)
• Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
the
3
each
2 2
2
1
objects
1
0
cluster 1
0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10
similar
center reassign
10 10
K=2 9 9
8 8
Arbitrarily choose K 7 7
object as initial
6 6
5 5
reassign 3
the 3
2 2
1 cluster 1
0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
First we list all points in the first column of the table above. The
initial cluster centers – means, are (2, 10), (5, 8) and (1, 2) -
chosen randomly. Next, we will calculate the distance from the first
point (2, 10) to each of the three means, by using the distance
function: Indian Institute of Management (IIM),Rohtak
point mean1 point mean2
x1, y1 x2, y2 x1, y1 x2, y2
(2, 10) (2, 10) (2, 10) (5, 8)
ρ(a, b) = |x2 – x1| + |y2 – y1| ρ(a, b) = |x2 – x1| + |y2 – y1|
ρ(point, mean1) = |x2 – x1| + |y2 – y1| ρ(point, mean2) = |x2 – x1| + |y2 – y1|
= |2 – 2| + |10 – 10| = 0 + 0 = 0 = |5 – 2| + |8 – 10|
=3+2 =5
point mean3
x1, y1 x2, y2 So, we fill in these values in the table:
(2, 10) (1, 2)
Cluster 1 Cluster 2 Cluster 3
ρ(a, b) = |x2 – x1| + |y2 – y1| (2, 10)
ρ(point, mean2) = |x2 – x1| + |y2 – y1|
= |1 – 2| + |2 – 10|
= 1 + 8 =Dist
Point
9 Mean
(2, 10)
1
(5, 8)
Dist Mean 2
(1, 2)
Dist Mean 3 Cluster
A1 (2, 10) 0 5 9 1
A2 (2, 5)
A3 (8, 4)
A4 (5, 8)
A5 (7, 5)
A6 (6, 4)
A7 (1, 2)
A8 (4, 9)
So, which cluster should the point (2, 10) be placed in? The one, where the point has the shortest distance to the mean –
that is mean 1 (cluster 1), since the distance is 0.
Indian Institute of Management (IIM),Rohtak
So, we go to the second point (2, 5) and we will calculate the distance to each of the
three means, by using the distance function:
point mean1 point mean2 point mean3
x1, y1 x2, y2 x1, y1 x2, y2 x1, y1 x2, y2
(2, 5) (2, 10)
(2, 5) (5, 8) (2, 5) (1, 2)
ρ(a, b) = |x2 – x1| + |y2 – y1|
ρ(point, mean1) = |x2 – x1| + |y2 ρ(a, b) = |x2 – x1| + |y2 – y1| ρ(a, b) = |x2 – x1| + |y2 – y1|
– y1| ρ(point, mean2) = |x2 – x1| + |y2 – y1| ρ(point, mean2) = |x2 – x1| + |y2 – y1|
= |2 – 2| + |10 – 5| = 0 + 5 = 5 = |5 – 2| + |8 – 5| = 3 + 3 = 6 = |1 – 2| + |2 – 5| = 1 + 3 = 4
(2, 10) (5, 8) (1, 2)
A1
Point
(2, 10)
Dist Mean 1
0
Dist Mean 2
5
Dist Mean 3
9
Cluster
1
we fill in the rest of the table, and
A2
A3
(2, 5)
(8, 4)
5 6 4 3
place each point in one of the
A4 (5, 8)
A5
A6
(7, 5)
(6, 4) clusters:
A7 (1, 2)
A8 (4, 9)
The initial cluster centers are shown in red dot. The new cluster centers are shown in red x.
Indian Institute of Management (IIM),Rohtak
That was Iteration1 (epoch1). Next, we go to Iteration2 (epoch2),
Iteration3, and so on until the means do not change anymore.
In Iteration2, we basically repeat the process from Iteration1 this time
using the new means we computed.
In the example, if you cluster using the attribute levels referred to excel
sheet , the percentage of Blacks and Hispanics in each city will drive
the clusters because these values are more spread out than the other
demographic attributes.
To remedy this problem you can standardize each demographic
attribute by subtracting off the attribute’s mean and dividing by the
attribute’s standard deviation.
For example, the average city has 24.34 percent Blacks with a standard
deviation of 18.11 percent. This implies that after standardizing the
percentage of Blacks, Atlanta has 2.35 standard deviations more Blacks
(on a percentage basis) than a typical city. Working with standardized
values for each attribute ensures that your analysis is unit-free and each
attribute has the same effect on your cluster selection. Of course you
may give a larger weight to any attribute.
dim(input)
Within
Number of
sum of Initial seed
clusters
squares to start
clustering
Within
sum of
squares
i.e., intra
cluster
distance
Used to join points =
The within groups of squares are calculated and plotted for different
cluster values k=1,2,3,4,5 etc. represented by the ncol versus the
number of clusters represented by nrow parameter and Elbow curve is
formed, the region of bend represents the optimal number.
Indian Institute of Management (IIM),Rohtak
Interpretation of the
scree plot
The within groups of squares are calculated and plotted for different
cluster values k=1,2,3,4,5 etc. and Elbow curve is formed, the region
of bend represents the optimal number.
Indian Institute of Management (IIM),Rohtak
Bend indicates optimal k(number of clusters)=2