Академический Документы
Профессиональный Документы
Культура Документы
Bachelor of Technology
Computer Science and Engineering
OCTOBER 2019
India
TABLE OF CONTENTS
1. Abstract
2. Introduction
3. Body
i. PAM Algorithm
ii. PAM Time Complexity
iii. Applications of Pam
iv. Problems related with PAM
v. Advantages with PAM
vi. Improvements of CLARA over PAM
4. Conclusion
5. References
ABSTRACT
TMSL/CSE/Term-Paper/Semester-7 2
Earlier research has resulted in the production of an ‘all-rules’ algorithm for data-mining
that produces all conjunctive rules of above given confidence and coverage thresholds.
While this is a useful tool, it may produce a large number of rules. his paper describes
the application of two clustering algorithms to these rules, in order to identify sets of
similar rules and to better understand the data. Clustering is the procedure to group
similar objects together. Several algorithms have been proposed for clustering. Among
them, the K-means clustering method has less time complexity. But it is sensitive to
extreme values and would cause less accurate clustering of the data-set. However, K-
medoids method does not have such limitations. But this method uses user-defined
value for K. Therefore, if the number of clusters is not chosen correctly, it will not provide
the natural number of clusters and hence the accuracy will be minimized. In this paper,
we propose a grid based clustering method that has higher accuracy than the existing
K-medoids algorithm. Our proposed Grid Multi-dimensional K-medoids (GMK) algorithm
uses the concept of cluster validity index and it is shown from the experimental results
that the new proposed method has higher accuracy than the existing K-medoids
method. The object space is quantized into a number of cells, and the distance between
the intra cluster objects decrease which contributes to the higher accuracy of the
proposed method. Therefore, the proposed approach has higher accuracy and provides
natural clustering method which scales well for large data-set.
INTRODUCTION
TMSL/CSE/Term-Paper/Semester-7 3
The partitioning around medoids (PAM) algorithm is a clustering algorithm reminiscent
to the k-means algorithm. Both the k-means and k-medoids algorithms are partitional
(breaking the dataset up into groups) and both attempt to minimize the distance between
points labeled to be in a cluster and a point designated as the center of that cluster. In
contrast to the k-means algorithm, k-medoids chooses data points as centers (medoids or
exemplars) and can be used with arbitrary distances, while in k-means the centre of a
cluster is not necessarily one of the input data points (it is the average between the points
in the cluster). The PAM method was proposed in 1987 [1] for the work with norm and other
distances.
k-medoid is a classical partitioning technique of clustering, which clusters the data set of n
objects into k clusters, with the number k of clusters assumed known a priori (which
implies that the programmer must specify k before the execution of the algorithm). The
"goodness" of the given value of k can be assessed with methods such as the silhouette
method.
It is more robust to noise and outliers as compared to k-means because it minimizes a sum
of pairwise dissimilarities instead of a sum of squared Euclidean distances.
A medoid can be defined as the object of a cluster whose average dissimilarity to all the
objects in the cluster is minimal, that is, it is a most centrally located point in the cluster.
Data Mining is the procedure of non-trivial extraction of implicit, previously unknown, and
potentially helpful information from data. Commonly, data mining tasks can be classified into two
categories: descriptive and predictive. Descriptive mining tasks illustrate the general properties
of the data in the database i.e. descriptive task finds the human-interpretable patterns that
describe the data. Predictive mining tasks perform inference on the existing data in order to
make predictions. Clustering is one of the major descriptive data mining tasks. As mentioned,
clustering is partitioning of data into groups of analogous objects. Representing the data-set by
fewer clusters loses certain fine details, but achieves simplification.
Data modeling puts clustering in a historical viewpoint rooted in mathematics, statistics, and
numerical analysis. Clustering can be viewed as a density evaluation problem. From a machine
learning viewpoint clusters correspond to hidden patterns, the exploration for clusters is
unsupervised learning, and the resulting system represents a data concept. From a practical
perspective clustering plays a marvelous role in data mining applications such as scientific data
exploration, information retrieval and text mining, spatial database applications, Web analysis,
CRM, marketing, medical diagnostics, computational biology, and many others.
TMSL/CSE/Term-Paper/Semester-7 4
exemplars) and can be used with arbitrary distances, while in k-means the centre of a cluster is
not necessarily one of the input data points (it is the average between the points in the cluster) .
The PAM method was proposed in 1987for the work with normal and other distances. k-medoid
is a classical partitioning technique of clustering, which clusters the data set of n objects into k
clusters, with the number k of clusters assumed known a priori (which implies that the
programmer must specify k before the execution of the algorithm). The "goodness" of the given
value of k can be assessed with methods such as the silhouette method. It is more robust to
noise and outliers as compared to k-means because it minimizes a sum of pairwise
dissimilarities instead of a sum of squared Euclidean distances. A medoid can be defined as the
object of a cluster whose average dissimilarity to all the objects in the cluster is minimal, that is,
it is a most centrally located point in the cluster.
PAM ALGORITHM
The algorithm proceeds in two steps:
TMSL/CSE/Term-Paper/Semester-7 5
This is continued till the objective function can no longer be decreased.The algorithm is
as follows:
1.Initially select k random points as the medoids from the given n data points of the data
set.
2.Associate each data point to the closest medoid by using any of the most common
distance metrics.
3.For each pair of non-selected object hand selected object i, calculate the total
swapping cost Tcih.
ii.Update the current medoid: a new medoid o c is found to replace the current medoid o j;
iii.No change: objects in the current cluster result have the same or even smaller square
error criterion(SEC) measure for all the possible redistributions considered;
Example:
For a given k=2, cluster the following data set using PAM.
TMSL/CSE/Term-Paper/Semester-7 6
1 7 6
2 2 6
3 3 8
4 8 5
5 7 4
6 4 7
7 6 2
8 7 3
9 6 4
10 3 4
Let us choose that (3, 4) and (7, 4) are the medoids. Suppose considering the
Manhattandistance metric as the distance measure,
For (7, 6), Calculating the distance from the medoids chosen, this point is nearest
to (7, 4).
For (2, 6), Calculating the distance from the medoids chosen, this point is nearest
to (3, 4).
For (3, 8), Calculating the distance from the medoids chosen, this point is at
same distance from both the points. So choosing that it is nearest to (3, 4).
For (8, 5), Calculating the distance from the medoids chosen, this point is nearest
to (7, 4).
For (4, 7), Calculating the distance from the medoids chosen, this point is nearest
to (3, 4).
For (6, 2), Calculating the distance from the medoids chosen, this point is nearest
to (7, 4).
TMSL/CSE/Term-Paper/Semester-7 7
For (7, 3), Calculating the distance from the medoids chosen, this point is nearest
to (7, 4).
For (6, 4), Calculating the distance from the medoids chosen, this point is nearest
to (7, 4).
So, now after the clustering, the clusters formed are:{(3,4), (2,6), (3,8), (4,7)} and {(7,4),
(6,2), (6,4), (7,3), (8,5), (7,6)}. Now calculating the cost which is nothing but the sum of
distance of each non-selected point from the selected point which is medoid of the
cluster it belongs to.
Total Cost =cost((3, 4), (2, 6)) + cost((3, 4), (3, 8)) + cost((3, 4), (4, 7)) + cost((7, 4), (6,
2))+ cost((7, 4),(6, 4))+ cost((7, 4), (7, 3))+ cost((7, 4), (8, 5))+ cost((7, 4), (7, 6))
= 3 + 4 + 4 + 3 + 1 + 1 + 2 + 2 = 20.
Similarly,when we choose the point (7,3) instead of the point (7,4). We get the total cost
as 22.
TMSL/CSE/Term-Paper/Semester-7 8
Algorithm 1: PAM BUILD: Find initial cluster centers.
(TD ,m1)←(∞,null);
for each xj do TDj←0;
for each xo=!xj do TDj←TD j+d(xo,xj);
if TD j<TD then (TD ,m1)←(TD j,xj); // Smallest distance sum
for i=1...k−1 do // Other medoids
(ΔTD *,x*)←(∞,null);
for each xj∈{m1,...,mi} do
ΔTD ←0;
for each xo∈{m1,...,mi,xj} do
δ←d(xo,xj)−mino∈m1,...,mid(xo,O);
if δ<0 then ΔTD ←ΔTD +δ;
if ΔTD <ΔTD∗ then (ΔTD ∗, x∗)←(ΔTD ,xj); // best reduction in TD
(TD ,mi+1)←(TD +ΔTD∗ , x∗);
return TD ,{m1,...,mk};
repeat(ΔTD∗ , m∗ ,x∗)←(0,null,null);
for each mi∈{m1,...,mk} do // each medoid
for each xj∈{m1,...,mk} do // each non-medoid
ΔTD ←0;
for each xo∈{m1,...,mk}\mi do ΔTD ←ΔTD +Δ(xo,mi,xj);
if ΔTD <ΔTD ∗then (ΔTD ∗ ,m∗ , x∗)←(ΔTD ,mi,xj);
break loop if ΔTD ∗≥0;
swap roles of medoid m∗ and non-medoid x∗; // perform best swap
TD ←TD +ΔTD ∗;
return TD ,M,C;
TMSL/CSE/Term-Paper/Semester-7 9
PAM TIME COMPLEXITY
EXPLANATION:
We have to find the distance between each of the (n-k) data points k times to place the
data points in their closest cluster.
After this, we need to replace each of the previously assumed medoids with each non-
medoid, and re-compute the distance between for (n-k) objects.
So the above analysis states that we have to traverse (n-k) data points k times and
each time we traverse we do it for both kinds of medoids. So the time complexity is
O(k(n-k)^2).
APPLICATIONS OF PAM
The only application PAM is to break a data-set into small clusters for observations.
ADVANTAGES OF PAM
1. It is simple to understand and easy to implement.
2. K-Medoid Algorithm is fast and converges in a fixed number of steps.
3. PAM is less sensitive to outliers than other partitioning algorithms.
PROBLEMS WITH PAM
TMSL/CSE/Term-Paper/Semester-7 10
CLARA VS K-METOIDS(PAM)
Instead of finding medoids for the entire data set, CLARA considers a small
sample of the data with fixed size (sampsize) and applies the PAM algorithm to
generate an optimal set of medoids for the sample. The quality of resulting
medoids is measured by the average dissimilarity between every object in the
entire data set and the medoid of its cluster, defined as the cost function.
CLARA repeats the sampling and clustering processes a pre-specified number of
times in order to minimize the sampling bias. The final clustering results
correspond to the set of medoids with the minimal cost.
CLARA can also work for large data-sets and is more accurate than PAM. It can
be said as a improved version of PAM
TMSL/CSE/Term-Paper/Semester-7 11
CONCLUSION
TMSL/CSE/Term-Paper/Semester-7 12
REFERENCES
www.wikepedia.com
www.mllearningprjoects.com
https://www.datanovia.com/en/lessons/k-medoids-in-r-algorithm-and-practical-examples
TMSL/CSE/Term-Paper/Semester-7 13