Академический Документы
Профессиональный Документы
Культура Документы
Algorithms
V IJAY K UMAR1
J ITENDER K UMAR C HHABRA2
D INESH K UMAR3
1
Computer Science & Engineering Department, Manipal University, Jaipur, Rajesthan, INDIA
2
Computer Engineering Department, National Institute of Technology, Kurukshetra, Haryana, INDIA
3
CSE Department, Guru Jambheshwer University of Science & Technology, Hisar, Haryana, INDIA
1
vijaykumarchahar@gmail.com
2
jitenderchhabra@gmail.com
3
dinesh_chutani@yahoo.com
Abstract. Distance measures play an important role in cluster analysis. There is no single distance
measure that best fits for all types of the clustering problems. So, it is important to find set of distance
measures for different clustering techniques on datasets that yields optimal results. In this paper, an
attempt has been made to evaluate ten different distance measures on eight clustering techniques. The
quality of the distance measures has been computed on basis of three factors: accuracy, inter-cluster
and intra-cluster distances. The performance of clustering algorithms on different distance measures
has been evaluated on three artificial and six real life datasets. The experimental results reveal that the
performance and quality of different distance measures vary with the nature of data as well as clustering
techniques. Hence choice of distance measure must be done on basis of dataset and clustering technique.
Keywords: Distance Measures; Clustering Algorithms; Ant Colony based Clustering; Modified Har-
mony Search Clustering.
2. The distance between two data points must be zero Manhattan distance between two data points is defined
if and only if the two data points are identical, i.e., as the sum of the absolute differences of their coordi-
D(xi , xj ) = 0 if and only if xi = xj nates. It is also known as a city block, rectilinear, taxi-
cab or L1 distance. It is mathematically defined as:
3. The distance from xi to xj is the same as the dis- d
tance from xj to xi , i.e., DM n (xi , xj ) =
X
|xil − xjl | (7)
D(xi , xj ) = D(xj , xi ) l=1
4. The distance measure must satisfy the triangle in- The clusters formed using Manhattan distance tend to
equality, which is form rectangular shaped clusters. When all the features
INFOCOMP, v. 13, no. 1, p. 38-51, June 2014.
Kumar, Chhabra and Kumar Performance Evaluation of Distance Metrics in the Clustering Algorithms 41
of dataset are binary in nature, the Manhattan distance where mik = xik − xi , mjk = xjk − xj , xi =
1
Pd 1
Pd
acts as a Hamming distance [7]. It is also a metric. k=1 xik and xj = d k=1 xjk . This measure is
d
The advantage of over Euclidean distance is the reduced not a distance metric. It tends to disclose the difference
computation time [7]. Further it does not depend upon in shapes rather than to detect the magnitude of differ-
the translation and reflection of the coordinate system. ences between two data points [21]. It is invariant to
The one disadvantage is that it depends upon the rota- both scaling and translation.
tion of the coordinate system.
3.7 Spearman Distance
3.4 Mahalanbois Distance
The Spearman distance measure is derived from the
Mahalanobis [12] introduced a new distance measure Spearman correlation coefficient [5]. It can be defined
named as Mahalanobis distance. It is also known as as
quadratic distance. It is based on the correlations be- DSpear (xi , xj ) = 1 − SC (xi , xj ) (12)
tween variables by which different patterns can be iden- Pd r r
tified and analyzed. The Mahalanobis distance is de- k=1 (mik )(mjk )
SC (xi , xj ) = qP (13)
fined as: d r 2
Pd r 2
k=1 (mik ) k=1 (mjk )
DM a (xi , xj ) = (xi − xj )V −1 (xi − xj )T (8) where mrik = r(xik ) − r, mrjk = r(xjk ) − r. In
where V is covariance matrix. If the covariance matrix Spearman rank correlation, each data value is replaced
is identity matrix, the Mahalanobis distance reduces to by their rank if the data in each vector is ordered by
the Euclidean distance. It differs from the Euclidean its value. Then Pearson correlation between the two
distance in that it takes into account the correlations rank vectors is computed instead of the data vectors.
of the dataset and is scale-invariant. It leads to vio- The Spearman rank correlation is an example of a non-
lations of the triangle inequality and sensitive towards parametric similarity measure. It is robust against out-
sampling fluctuations (Cherry et al., 1982). The Maha- liers than the Pearson correlation. The disadvantage is
lanobis distance tends to form ellipsoidal clusters. that there is a loss of information when data are con-
verted to ranks.
3.5 Cosine Distance
3.8 Chebyshev Distance
Cosine distance is a measure of dissimilarity between
two vectors by measuring the cosine of the angle be- The Chebyshev distance calculates the maximum of
tween them. It is defined as the absolute differences between the features of a pair
of data points. This distance is named after Panfnuty
xTi xj
Dcos (xi , xj ) = 1 − (9) Chebyshev. It is also known as tchebyschev distance,
kxi kkxj k maximum metric, chessboard distance, or metric. It is
It is bounded between 0 and 1 if and are non-negative. mathematically defined as
It is used to measure cohesion within clusters [18]. This DCh (xi , xj ) = max1≤l≤d (|xil − xjl |) (14)
measure is not a distance metric and violates the trian-
gle inequality. It is also invariant to scaling. It is unable This distance measure is a metric. The advantage is that
to provide information on the magnitude of the differ- it takes less time to decide the distances between data
ences. It is not invariant to shifts. sets [15].
The Bray-Curtis distance is also known as Sorensen dis- The K-Means and K-Medoid were executed for 100 it-
tance [2]. This measure is computed using the absolute erations. The parameters of the ACOC are as follows:
differences divided by the summation. It is defined as evaporation rate = 0.1, number of ants = 20, and maxi-
follows: mum number of iterations = 100 as mentioned in [17].
The parameters of the MHSC are as follows: harmony
Pd
|xil − xjl | memory size = 15 and maximum number of iterations =
DBc (xi , xj ) = Pdl=1 (16) 100. The pitch adjustment rate, harmony memory con-
l=1 (xil + xjl )
sideration rate and bandwidth are chosen as in [9, 10].
This distance measure is not a metric as it does not sat- The value of K, number of clusters, for datasets equals
isfy the triangle inequality property. The main draw- the number of classes of the corresponding datasets as
back of this measure is that it is undefined if both data mentioned in Table 1.
points are near zero values.
4.3 Experimentation 1: Effect of distance mea-
sures on Hierarchical Techniques
4 Experimental Results Tables 2-10 show the effect of distance measures on ac-
This section provides a description of the datasets and curacy for Sp_5 _2, Sp_6 _2, Sp_4 _3, Iris, W ine,
demonstrate the efficiency of well-known clustering Glass, Haberman, Bupa and Libras datasets respec-
algorithms based on ten different distance measures. tively. The results reported in tables are the average
The results are evaluated and compared using some values obtained over ten runs of algorithms. Figures
widely acceptable performance evaluation metrics such 1-2 show the effect of distance measures on inter and
as accuracy, inter-cluster and intra-cluster distance [4]. intra-cluster distance.
Large value of accuracy measure is required for bet-
ter clustering. Smaller value of intra-cluster and large
value of inter-cluster distance is required for better clus- Figure 1: Effect of distance measures on Inter-cluster distance for
tering. All the results are evaluated in terms of ’mean’ hierarchical techniques; (a) Sp_5 _2 (b) Sp_6 _2 (c) Sp_4 _3 (d)
and ’standard deviation’. The standard deviation is used Iris (e) W ine (f) Glass (g) Glass (h) Haberman (i) Bupa.
as a measure of robustness, which is shown in parenthe-
sis.
Dist.M eas. S.Lin. C.Lin. A.Lin. W.Lin. Dist.M eas. S.Lin. C.Lin. A.Lin. W.Lin.
Eucl. 0.596 0.948 0.940 0.956 Eucl. 1.000 1.000 1.000 1.000
(0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000)
S.Eucl. 0.588 0.848 0.948 0.948 S.Eucl. 1.000 1.000 1.000 1.000
(0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000)
M anh. 0.356 0.940 0.944 0.860 M anh. 1.000 1.000 1.000 1.000
(0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000)
M ahal. 0.588 0.928 0.948 0.912 M ahal. 0.258 0.520 0.508 0.398
(0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000)
Cos. 0.464 0.488 0.508 0.556 Cos. 0.358 0.323 0.338 0.340
(0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000)
Corr. 0.400 0.416 0.404 0.408 Corr. 0.300 0.310 0.308 0.290
(0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000)
Spear. 0.400 0.400 0.400 0.400 Spear. 0.302 0.302 0.302 0.302
(0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000)
Cheb. 0.420 0.892 0.932 0.896 Cheb. 1.000 1.000 1.000 1.000
(0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000)
Canb. 0.216 0.736 0.944 0.812 Canb. 0.495 0.358 0.385 0.375
(0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000)
Bray 0.204 0.964 0.964 0.720 Bray 0.748 0.335 0.330 0.345
(0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000)