Вы находитесь на странице: 1из 13

PAM CLUSTERING TECHNIQUE

Bachelor of Technology
Computer Science and Engineering

OCTOBER 2019

India

TABLE OF CONTENTS
1. Abstract
2. Introduction
3. Body
i. PAM Algorithm
ii. PAM Time Complexity
iii. Applications of Pam
iv. Problems related with PAM
v. Advantages with PAM
vi. Improvements of CLARA over PAM

4. Conclusion
5. References

ABSTRACT

TMSL/CSE/Term-Paper/Semester-7 2
Earlier research has resulted in the production of an ‘all-rules’ algorithm for data-mining
that produces all conjunctive rules of above given confidence and coverage thresholds.
While this is a useful tool, it may produce a large number of rules. his paper describes
the application of two clustering algorithms to these rules, in order to identify sets of
similar rules and to better understand the data. Clustering is the procedure to group
similar objects together. Several algorithms have been proposed for clustering. Among
them, the K-means clustering method has less time complexity. But it is sensitive to
extreme values and would cause less accurate clustering of the data-set. However, K-
medoids method does not have such limitations. But this method uses user-defined
value for K. Therefore, if the number of clusters is not chosen correctly, it will not provide
the natural number of clusters and hence the accuracy will be minimized. In this paper,
we propose a grid based clustering method that has higher accuracy than the existing
K-medoids algorithm. Our proposed Grid Multi-dimensional K-medoids (GMK) algorithm
uses the concept of cluster validity index and it is shown from the experimental results
that the new proposed method has higher accuracy than the existing K-medoids
method. The object space is quantized into a number of cells, and the distance between
the intra cluster objects decrease which contributes to the higher accuracy of the
proposed method. Therefore, the proposed approach has higher accuracy and provides
natural clustering method which scales well for large data-set.

INTRODUCTION

TMSL/CSE/Term-Paper/Semester-7 3
The partitioning around medoids (PAM) algorithm is a clustering algorithm reminiscent
to the k-means algorithm. Both the k-means and k-medoids algorithms are partitional
(breaking the dataset up into groups) and both attempt to minimize the distance between
points labeled to be in a cluster and a point designated as the center of that cluster. In
contrast to the k-means algorithm, k-medoids chooses data points as centers (medoids or
exemplars) and can be used with arbitrary distances, while in k-means the centre of a
cluster is not necessarily one of the input data points (it is the average between the points
in the cluster). The PAM method was proposed in 1987 [1] for the work with norm and other
distances.

k-medoid is a classical partitioning technique of clustering, which clusters the data set of n
objects into k clusters, with the number k of clusters assumed known a priori (which
implies that the programmer must specify k before the execution of the algorithm). The
"goodness" of the given value of k can be assessed with methods such as the silhouette
method.

It is more robust to noise and outliers as compared to k-means because it minimizes a sum
of pairwise dissimilarities instead of a sum of squared Euclidean distances.

A medoid can be defined as the object of a cluster whose average dissimilarity to all the
objects in the cluster is minimal, that is, it is a most centrally located point in the cluster.

Data Mining is the procedure of non-trivial extraction of implicit, previously unknown, and
potentially helpful information from data. Commonly, data mining tasks can be classified into two
categories: descriptive and predictive. Descriptive mining tasks illustrate the general properties
of the data in the database i.e. descriptive task finds the human-interpretable patterns that
describe the data. Predictive mining tasks perform inference on the existing data in order to
make predictions. Clustering is one of the major descriptive data mining tasks. As mentioned,
clustering is partitioning of data into groups of analogous objects. Representing the data-set by
fewer clusters loses certain fine details, but achieves simplification.

Data modeling puts clustering in a historical viewpoint rooted in mathematics, statistics, and
numerical analysis. Clustering can be viewed as a density evaluation problem. From a machine
learning viewpoint clusters correspond to hidden patterns, the exploration for clusters is
unsupervised learning, and the resulting system represents a data concept. From a practical
perspective clustering plays a marvelous role in data mining applications such as scientific data
exploration, information retrieval and text mining, spatial database applications, Web analysis,
CRM, marketing, medical diagnostics, computational biology, and many others.

The k-medoids or partitioning around medoids (PAM) algorithm is a clustering algorithm


reminiscent to the k-means algorithm. Both the k-means and k-medoids algorithms are
partitional (breaking the dataset up into groups) and both attempt to minimize the distance
between points labeled to be in a cluster and a point designated as the center of that cluster In
contrast to the k-means algorithm, k-medoids chooses data points as centers (metoids or

TMSL/CSE/Term-Paper/Semester-7 4
exemplars) and can be used with arbitrary distances, while in k-means the centre of a cluster is
not necessarily one of the input data points (it is the average between the points in the cluster) .
The PAM method was proposed in 1987for the work with normal and other distances. k-medoid
is a classical partitioning technique of clustering, which clusters the data set of n objects into k
clusters, with the number k of clusters assumed known a priori (which implies that the
programmer must specify k before the execution of the algorithm). The "goodness" of the given
value of k can be assessed with methods such as the silhouette method. It is more robust to
noise and outliers as compared to k-means because it minimizes a sum of pairwise
dissimilarities instead of a sum of squared Euclidean distances. A medoid can be defined as the
object of a cluster whose average dissimilarity to all the objects in the cluster is minimal, that is,
it is a most centrally located point in the cluster.

PAM ALGORITHM
The algorithm proceeds in two steps:

 BUILD-step: This step sequentially selects k "centrally located" objects, to be


used as initial medoids.

 SWAP-step: If the objective function can be reduced by interchanging


(swapping) a selected object with an unselected object, then the swap is carried
out.

TMSL/CSE/Term-Paper/Semester-7 5
This is continued till the objective function can no longer be decreased.The algorithm is
as follows:

1.Initially select k random points as the medoids from the given n data points of the data
set.

2.Associate each data point to the closest medoid by using any of the most common
distance metrics.

3.For each pair of non-selected object hand selected object i, calculate the total
swapping cost Tcih.

 if TCih< 0, iis replaced by h4.

 Repeat the steps 2-3 until there is no change of the medoids.

There are four situations to be considered in this process:

i.Shift-out membership: an object pimay need to be shifted from currently considered


cluster of oj to another cluster;

ii.Update the current medoid: a new medoid o c is found to replace the current medoid o j;

iii.No change: objects in the current cluster result have the same or even smaller square
error criterion(SEC) measure for all the possible redistributions considered;

iv.Shift-inmembership : an outside object p i is assigned to the current cluster with the


new (replaced) medoid oc.

PAM algorithm uses greedy search to find optimum solution.

AN WORKING EXAMPLE OF THIS ALGORITHM IS SHOWN BELOW:

Example:

For a given k=2, cluster the following data set using PAM.

Point X-axis Y-axis

TMSL/CSE/Term-Paper/Semester-7 6
1 7 6

2 2 6

3 3 8

4 8 5

5 7 4

6 4 7

7 6 2

8 7 3

9 6 4

10 3 4

Let us choose that (3, 4) and (7, 4) are the medoids. Suppose considering the
Manhattandistance metric as the distance measure,

So, now if we calculate the distance from each point:

 For (7, 6), Calculating the distance from the medoids chosen, this point is nearest
to (7, 4).

 For (2, 6), Calculating the distance from the medoids chosen, this point is nearest
to (3, 4).

 For (3, 8), Calculating the distance from the medoids chosen, this point is at
same distance from both the points. So choosing that it is nearest to (3, 4).

 For (8, 5), Calculating the distance from the medoids chosen, this point is nearest
to (7, 4).

 For (4, 7), Calculating the distance from the medoids chosen, this point is nearest
to (3, 4).

 For (6, 2), Calculating the distance from the medoids chosen, this point is nearest
to (7, 4).

TMSL/CSE/Term-Paper/Semester-7 7
 For (7, 3), Calculating the distance from the medoids chosen, this point is nearest
to (7, 4).

 For (6, 4), Calculating the distance from the medoids chosen, this point is nearest
to (7, 4).

So, now after the clustering, the clusters formed are:{(3,4), (2,6), (3,8), (4,7)} and {(7,4),
(6,2), (6,4), (7,3), (8,5), (7,6)}. Now calculating the cost which is nothing but the sum of
distance of each non-selected point from the selected point which is medoid of the
cluster it belongs to.

Total Cost =cost((3, 4), (2, 6)) + cost((3, 4), (3, 8)) + cost((3, 4), (4, 7)) + cost((7, 4), (6,
2))+ cost((7, 4),(6, 4))+ cost((7, 4), (7, 3))+ cost((7, 4), (8, 5))+ cost((7, 4), (7, 6))

= 3 + 4 + 4 + 3 + 1 + 1 + 2 + 2 = 20.

Similarly,when we choose the point (7,3) instead of the point (7,4). We get the total cost
as 22.

Swap Cost = Present Cost – Previous Cost


= 22 – 20 = 2 >0
As the swap cost is not less than zero, we undo the swap. Hence (3, 4) and (7, 4) are
the final medoids. The clustering would be in the following way:

TMSL/CSE/Term-Paper/Semester-7 8
Algorithm 1: PAM BUILD: Find initial cluster centers.

(TD ,m1)←(∞,null);
for each xj do TDj←0;
for each xo=!xj do TDj←TD j+d(xo,xj);
if TD j<TD then (TD ,m1)←(TD j,xj); // Smallest distance sum
for i=1...k−1 do // Other medoids
(ΔTD *,x*)←(∞,null);
for each xj∈{m1,...,mi} do
ΔTD ←0;
for each xo∈{m1,...,mi,xj} do
δ←d(xo,xj)−mino∈m1,...,mid(xo,O);
if δ<0 then ΔTD ←ΔTD +δ;
if ΔTD <ΔTD∗ then (ΔTD ∗, x∗)←(ΔTD ,xj); // best reduction in TD
(TD ,mi+1)←(TD +ΔTD∗ , x∗);
return TD ,{m1,...,mk};

Algorithm 2: PAM SWAP: Iterative improvement.

repeat(ΔTD∗ , m∗ ,x∗)←(0,null,null);
for each mi∈{m1,...,mk} do // each medoid
for each xj∈{m1,...,mk} do // each non-medoid
ΔTD ←0;
for each xo∈{m1,...,mk}\mi do ΔTD ←ΔTD +Δ(xo,mi,xj);
if ΔTD <ΔTD ∗then (ΔTD ∗ ,m∗ , x∗)←(ΔTD ,mi,xj);
break loop if ΔTD ∗≥0;
swap roles of medoid m∗ and non-medoid x∗; // perform best swap
TD ←TD +ΔTD ∗;
return TD ,M,C;

TMSL/CSE/Term-Paper/Semester-7 9
PAM TIME COMPLEXITY
EXPLANATION:

We have to find the distance between each of the (n-k) data points k times to place the
data points in their closest cluster.

After this, we need to replace each of the previously assumed medoids with each non-
medoid, and re-compute the distance between for (n-k) objects.

So the above analysis states that we have to traverse (n-k) data points k times and
each time we traverse we do it for both kinds of medoids. So the time complexity is
O(k(n-k)^2).

APPLICATIONS OF PAM
The only application PAM is to break a data-set into small clusters for observations.

ADVANTAGES OF PAM
1. It is simple to understand and easy to implement.
2. K-Medoid Algorithm is fast and converges in a fixed number of steps.
3. PAM is less sensitive to outliers than other partitioning algorithms.
PROBLEMS WITH PAM

1. The main disadvantage of K-Medoid algorithms is that it is not suitable for


clustering non-spherical (arbitrary shaped) groups of objects. This is because it
relies on minimizing the distances between the non-medoid objects and the
medoid (the cluster center) – briefly, it uses compactness as clustering criteria
instead of connectivity.
2. It may obtain different results for different runs on the same dataset because the
first k medoids are chosen randomly.

TMSL/CSE/Term-Paper/Semester-7 10
CLARA VS K-METOIDS(PAM)

 Instead of finding medoids for the entire data set, CLARA considers a small
sample of the data with fixed size (sampsize) and applies the PAM algorithm to
generate an optimal set of medoids for the sample. The quality of resulting
medoids is measured by the average dissimilarity between every object in the
entire data set and the medoid of its cluster, defined as the cost function.
 CLARA repeats the sampling and clustering processes a pre-specified number of
times in order to minimize the sampling bias. The final clustering results
correspond to the set of medoids with the minimal cost.
 CLARA can also work for large data-sets and is more accurate than PAM. It can
be said as a improved version of PAM

TMSL/CSE/Term-Paper/Semester-7 11
CONCLUSION

The K-medoids algorithm, PAM, is a robust alternative to k-means for partitioning a


data set into clusters of observation. In k-medoids method, each cluster is represented
by a selected object within the cluster. The selected objects are named medoids and
corresponds to the most centrally located points within the cluster. The PAM algorithm
requires the user to know the data and to indicate the appropriate number of clusters to
be produced. This can be estimated using the function fviz_nbclust [in factoextra R
package]. The R function pam() [cluster package] can be used to compute PAM
algorithm. The simplified format is pam(x, k), where “x” is the data and k is the number
of clusters to be generated. After, performing PAM clustering, the R function
fviz_cluster() [factoextra package] can be used to visualize the results. The format is
fviz_cluster(pam.res), where pam.res is the PAM results.

TMSL/CSE/Term-Paper/Semester-7 12
REFERENCES

www.wikepedia.com

www.mllearningprjoects.com

https://www.datanovia.com/en/lessons/k-medoids-in-r-algorithm-and-practical-examples

TMSL/CSE/Term-Paper/Semester-7 13

Вам также может понравиться