Вы находитесь на странице: 1из 10

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO.

X, SEPTEMBER XXXX

Chameleon 2 an improved graph based clustering


algorithm for finding structure in a complex data
Tomas Barton, Tomas Bruna, and Pavel Kordik

AbstractTraditional clustering algorithms fail to produce


a human-like results while being limited to discovery of only
spherical shaped clusters in data or being too sensitive to noise.
We propose an improved version of the Chameleon algorithm
which overcomes several drawbacks of the previous version.
Furthermore a cutoff method which produces a high-quality
clustering without requiring a human interaction is presented.
Proposed algorithm is compared with state-of-the-art algorithms
on a benchmark that consists of artificial and real-world datasets.
Index Termsclustering, graph clustering, cluster analysis,
chameleon, clustering algorithm

I. I NTRODUCTION
LUSTERING is a technique that tries to group similar
objects into same groups called clusters and dissimilar objects into different clusters [1]. There is no general
consensus what exactly a cluster is, in fact it is generally
acknowledged that the problem is ill-defined [2]. Various
algorithms use slightly different definition of a cluster, e.g.
based on distance to closes center of cluster or density of
points in its neighborhood. Unlike supervised learning where
labeled data are used to train a model which is afterwards
used to classify unseen data, clustering belong to category
of unsupervised problems. Clustering is more difficult and
challenging problem than classification [3]. Clustering has
applications in many fields including data-mining, machine
learning, marketing, biology, chemistry, astrology, psychology
and spatial database technology.
Probably due to this interdisciplinary scope most of respected authors ([4], [5]) define clustering in a vague way,
while leaving space for several interpretations. Generally
clustering algorithms are designed to capture a notion of
grouping in the same way as a human observer does. The
ultimate goal would be a detection of structures in higher
dimensions where human fails. How to evaluate such methods
is another problem. In this contribution we focus on patterns
that are easily detected by a human. Yet, many algorithms fail
such test. A detailed overview of many algorithms and their
applications, including recently proposed clustering methods,
can be found in [6]. So far hundreds of algorithms have
been proposed, some assign items to exactly one cluster other
allow fuzzy assignment to many clusters. There are methods
based on cluster prototypes, mixture models, graph structures,
density or grid based models.

T. Barton is with Institute of Molecular Genetics of the ASCR, v. v. i.


Videnska 1083, Prague 4, Czech Republic and Czech Technical University in
Prague, Faculty of Information Technology. E-mail: tomas.barton@fit.cvut.cz.
P. Kordik and T. Bruna are with Faculty of Information Technology Czech
Technical University in Prague, Thakurova 9, Prague 6, Czech Republic.
Manuscript received April XX, 2015; revised September XX, 2015.

Chameleon algorithm was originally introduced by Karypis


et al. [7] in 1999. It is a graph-based hierarchical clustering algorithm which tries to overcome limitations of traditional clustering approaches. The main idea is based on a combination
of approaches used by CURE [8] and ROCK [9] algorithms
(see Section II for more details). CURE ignores information
about interconnectivity between two clusters while ROCK
ignores closeness of two clusters and empathizes their only
interconnectivity. Chameleon tries to resolve these drawbacks
by combining proximity and connectivity of items as well
as taking into account internal cluster properties during the
merging phase. Unlike other algorithms Chameleon produces
as human-like results as possible.
The well known k-means algorithm which is usually referred as the Lloyds algorithm was proposed in 1957, although
publised in 1982 [10]. The algorithm itself was independently
dicovered in several fields [11], [12]. Since that time many
algorithms has been proposed but k-means remains still quite
popular due to its simplicity and nearly linear computational
complexity. This fact also empathizes the difficulty of designing a general purpose clustering algorithm [3].
We implemented a modified version of the algorithm which
is capable of finding complex structures in various datasets
with a minimal error rate. In this article, we describe the
modifications and compare the altered version with the original
implementation as well as with other clustering algorithms.
Furthermore, in the last chapter, we focus on the automatic
selection of a high quality clustering in process of merging
clusters.
Based on the description in the original paper [7] we
were unable to reproduce exactly the same results as were
presented in the study. Nonetheless, with several modifications
our implementation works at least as good as the original one
(comparison based solely on visualization of results on several
datasets, there are no quantifiable results available).
II. R ELATED WORK
Chameleon gained a lot of attention in the literature, however most of authors mention the algorithm as an interesting
graph-based clustering approach but do not investigate further
properties of the algorithm. This is probably due to the fact
that a complete algorithm implementation is not available.
Hierarchical agglomerative clustering (HAC) is one of the
oldest clustering algorithms that uses a bottom-up approach
while merging closest items and producing a hierarchical
data structure. Lance and Williams [13] proposed formulas
for faster computation of merged items similarity,
 still the
algorithm has high time complexity, at least O n2 , while not

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, SEPTEMBER XXXX

k-nearest
neighbor graph

Data set
Construct a
sparse graph

Final clusters
Partition
the graph

Merge
partitions

Fig. 1. The overview of the Chameleon approach. Diagram courtesy of Karypis et al. [7].

providing high quality results. There are several methods for


computing similarity of clusters, which are usually referred
as single-link (HAC-SL), complete-link (HAC-CL), averagelink (HAC-AL) and Wards linkage. For discovering clusters
of arbitrary shapes, the most suitable method is single-linkage
which computes similarity of clusters by its closest items,
however this method can not deal with outliers and noise [14].
Another group of clustering algorithms models the density
function by a probabilistic mixture model. It is assumed
that data follow some distribution model and each cluster
is described by one or more mixture components [15]. The
Expectation-Maximization (EM) algorithm was proposed by
Dempster et. al [16] in 1977 and is often used to infer
parameters in mixture models.
Jarvis and Patrick [17] proposed in 1973 an algorithm
that defines similarity of items based on common number
of neighbors they share. Very similar approach chooses also
DBSCAN [18], which is a density based algorithm proposed
by Ester et. al. in 1996. DBSCAN is capable of discovering
clusters of arbitrary shapes, if cluster density could be determined beforehand and each cluster has an uniform distribution.
A cluster is defined as a maximum set of density connected
points, where every core point in a cluster must have at least
minimum number of points (MinPts) within given radius ().
All points within one cluster could be reached by traversing
a path of density-connected points. The algorithm itself could
be relatively fast, however in order to configure properly its
parameters a prior knowledge of a dataset is required or a
run of the k-NN algorithm in order to estimate appropriate
parameters setting. Main disadvantage of DBSCAN is its
sensitivity to parameter setting, even a small modification
of the parameter could cause that all data points will be
assigned to a single cluster. Moreover, the algorithm will fail
on datasets with non-uniform distribution.
CURE (CLustering Using Representatives) [8] is a clustering algorithm that uses variety of different techniques to
address various drawbacks, of the agglomerative clustering
method. A cluster is represented by multiple points, the first
point to be chosen is the furthest point from the center of the
cluster. Once representatives are chosen, all points are shrunk
towards the center of the cluster by a a user defined factor.
This helps to moderate the effect of outliers, the absolute value
of moving each point will be bigger for points lying farther
out. As a clustering method uses hierarchical agglomerative
clustering, with distance function counting the minimum of
any two representative points between two clusters. However,

this algorithm requires specification the number of clusters we


would like to find (a parameter k). There are two phases of
eliminating outliers. The first one occurs when approximately
1/3 of the desired k clusters is reached. Slowly growing
clusters are removed because they might potentially contain
only outliers. The second phase of elimination is done when
the desired number of clusters is reached. At this point all
small clusters are removed. CURE is more robust to outliers
than a hierarchical agglomerative clustering algorithm, and it
also manages to identify clusters that have a non-spherical
shape and wide variances in sizes.
Strehl and Ghosh [19] proposed in 2002 several ensemble
approaches for combining multiple clusterings into a single
final solution. The Cluster-based Similarity Partitioning Algorithm (CSPA) uses METIS [20] for partitioning the similarity
matrix into k components.
Shatovska et al. [21] proposed an improved similarity measure in 2007 which provides results similar those presented
in the Chameleon paper. The formula is further discussed in
Section IV-A1. Shatovska was not able to reproduce same
results as were described by Karypis et al. [7].
Liu Y. at al. [22] reported that Chameleon gives best
results on an artificial dataset with skewed distribution, while
comparing with clusterings provided by a k-means [10], DBSCAN [18] and hierarchical agglomerative clustering algorithms.
III. O RIGINAL C HAMELEON A LGORITHM
In order to find clusters Chameleon uses a three-step approach. First two steps aim to partition the data into many
small subclusters and the last repeatedly merges these subclusters into the final result. The whole process is illustrated
in the Figure 1.
The Chameleon algorithm works with a graph representation of data. At first, we construct a graph using the k-NN
method. A data point, represented by a node, is connected
with its k neighbors by an edge. If a graph is given, we can
skip this step and continue with the second one.
Second step partitions previously created graph. The goal of
partitioning is to produce equal-sized partitions and minimize
the number of edges cut. In this manner, many small clusters
are created with a few highly connected nodes in each cluster.
For partitioning Karypis et al. use their own hyper-graph
partitioning algorithm called hMETIS, which is a multilevel
partitioning algorithm which works with a coarsened version
of the graph. Coarsening is not deterministic, thus for each

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, SEPTEMBER XXXX

run we might obtain slightly different result (for further details


please see [23] and [24]).
The final and most important step merges partitioned clusters using Chameleons dynamic modeling framework starting
from clusters with highest similarity. The general idea is to
begin with connected clusters, that are close to each other
and that have similar densities. There are many The original
Chameleon uses following function for computing the similarity between two clusters (Ci and Cj ) that are about to merge
[7]:
Sim(Ci , Cj ) = RCL (Ci , Cj ) RIC (Ci , Cj )

(1)

where RCL stands for relative cluster closeness, while RIC


denotes relative cluster interconnectivity and , are user
specified parameters that gives higher importance either to
relative closeness or to relative interconnectivity.
The relative interconnectivity and closeness are computed
from the external and internal properties which describe
relations between clusters and among items inside clusters,
respectively.

RCL (Ci , Cj ) =

i , Cj )
(C
|ECi |
|ECi |+|ECj |

i) +
(C

|ECj |
|ECi |+|ECj |

j)
(C

(2)

(Ci , Cj )
2(Ci , Cj )
=
(3)
(Ci ) + (Cj )
(Ci ) + (Cj )
2
Where |ECi,j |, |ECi | are interconnectivity properties number of edges between clusters Ci and Cj , respective number
of edges inside cluster Ci .
i , Cj ), (C
i ) denote closeness properties average
(C
weights of all edges between clusters Ci and Cj , respective
average weights of all edges after cluster Ci bisection (an
example shown on Figure 2).
RIC (Ci , Cj ) =

BCi = bisect(Ci )
(Ci ) =

w(e)

(4)

1
(Ci )
|BCi |

Fig. 2. Example of finding bisection between clusters Ci and Cj , (Ci , Cj )


is computed as sum of edges weights between clusters (marked with dashed
line).

Algorithm 1 Original Chameleon


1: procedure O RIGINAL C HAMELEON (dataset)
2:
graph knnGraph(k, dataset)
3:
partitioning hM ET IS(graph)
4:
return mergeP airs(partitioning)
5: end procedure

hMETIS [23] and the proposed partitioning method based on


the recursive Fiduccia-Mattheyses bisection [25]. The recursive bisection process is also thoroughly described in [26].
The difference in speed and quality is not surprising because
hMETIS is a multilevel partitioning algorithm which works
with a coarsened version of the graph while our implementation works with the original graph. Basically, coarsened
version contains less nodes and edges, therefore the process
is faster. However, some information may be missing in the
coarsened approximation which can result in worse partitioning.
Algorithm 2 Chameleon2
1: procedure C HAMELEON 2(dataset)
2:
graph knnGraph(k, dataset)
3:
partitioning recursiveBisection(graph)
4:
partitioning f loodF ill(partitioning)
5:
return mergeP airs(partitioning)
6: end procedure

(5)

eBCi

i) =
(C

(6)

BCi is a set of edges selected by a bisection algorithm,


removing those edges would split graph into two components
(clusters). w(e) is the weight of a given edge that was
computed by the k-NN algorithm.
It is important to note that the internal properties are
computed via bisection. Therefore, the quality of the bisection
algorithm determines how well the internal properties are
computed.
//preformulovat, tomu nerozumim!!!
IV. C HAMELEON 2 ALGORITHM
1) Partitioning algorithms: In the partitioning phase,
we experimented with the original partitioning algorithm

Comparison of the two methods is shown in Table II.


2) Partitioning refinement: Unfortunately, the previously
described disconnected clusters can occur even when the edge
weights are ignored. To fix this situation, we use a simple
method called flood fill which finds connected components
in the graph. The process of partitioning with the flood fill
algorithm is shown in Algorithm 3. Result of the described
algorithm is a list of connected components where each
component is represented by a list of nodes.
A. Merging
The most significant modification was done in the merging
phase. During merging, Chameleon chooses the most similar
cluster pairs and merges them together. The choice is based on
a function which evaluates the similarity between every pair
of clusters, so called similarity measure.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, SEPTEMBER XXXX

TABLE I
C OMPARISON OF THE PARTITIONING METHODS

TABLE II
C HAMELEON 2 PARAMETERS

Computational time
Dataset size

hMETIS

Parameter
k

RB with F-M

psize

similarity

Algorithm 3 Flood Fill


1: procedure F LOOD F ILL(graph)
2:
partitions []
3:
for all node graph do
4:
partition []
5:
F ill(node, partition)
6:
partitions partitions partition
7:
end for
8: end procedure
9: procedure F ILL (node, partition)
10:
if node.marked? then
11:
return
12:
end if
13:
node.marked true
14:
partition partition node
15:
for all neighbor of the node do
16:
F ill(node, partition)
17:
end for
18:
return
19: end procedure

1) Improved similarity measure: Shatovska et al. [21] propose a modified similarity measure which proved to be more
robust than the original function based only on interconnectivity and closeness. We incorporated the function into the
Chameleon algorithm and the experiments showed that the
achieved results are always at least as good as the original,
most of the time even better.
The whole improved formula [21] can be written as:
Simshat (Ci , Cj ) = RCLS (Ci , Cj ) RICS (Ci , Cj ) (Ci , Cj )
(7)
RCLS (Ci , Cj ) =

s(Ci , Cj )
|ECi |
|ECi |+|ECj |

s(Ci ) +

|ECj |
|ECi |+|ECj |

s(Cj )
(8)

RICS (Ci , Cj ) =
(Ci , Cj ) =

min{
s(Ci ), s(Cj )}
max{
s(Ci ), s(Cj )}

(9)

|ECi,j |
min(|ECi |, |ECj |)

(10)

Where s(Ci ) is defined as sum of edges weights in a


cluster:
s(Ci ) =

1 X
w(e)
|ECi |
eCi

(11)

Description
number of neighbors (k-NN)
max. partition size
interconnectivity priority
closeness priority
determines merging order

Default value
2 log(n)
max {5, n/100}
1.0
2.0
Shatovska

and s(Ci , Cj ) is computed as sum of edges weighs between


clusters.
The improved measure also incorporates internal and external cluster properties to determine the relative similarity but
in a slightly different way. The internal cluster properties are
computed from all edges in the graph. This way, the measure
does not have to rely on the quality of the bisection algorithm.
Also, not having to bisect every cluster during the merging
phase saves quite a lot of computational time.
Additionally, the result is multiplied by the quotient of
clusters densities density of the sparser is divided by
the density of the denser cluster. This further encourages the
merging of clusters with similar densities.
Comparison of results achieved by the original and improved similarity measures is provided in the next section.
2) Clusters formed by one node: Independently of the
similarity measure chosen, problems arise while merging
clusters formed by individual items. Since they contain no
edges, internal characteristic of such clusters are impossible
to determine. Chameleons similarity measures rely heavily
on the internal characteristics and without them, clusters with
only one node are often merged incorrectly, deteriorating the
overall result.
Even when the partitioning algorithm is set to make strictly
balanced partitions with the same number of items in each
clusters, the partitioning refinement described in Section IV-2
can still produce clusters formed by just one node. Therefore,
the problem has to be solved during the merging phase and
our solution is described bellow.
When computing similarity between a cluster pair in which
one of the clusters contains no edges, only external properties
of the pair are computed. The resulting cluster similarity is
set to be the external similarity multiplied by a constant. This
multiplication increases the similarity of all pairs containing
single-item clusters and causes the clusters to merge with
their neighbors in the early stages of the merging process.
This way, clusters with one node are quickly merged with the
nearest cluster and do not cause problems later on. Multiplying
constant chosen for our implementation is the number 1000
but any number significantly enlarging the computed external
properties would work.
V. DATASETS
Our benchmark intentionally contains mostly 2D or 3D
datasets with clearly distinguishable structure in the data by a
human, some datasets contains noise as well.
1 Labels were added manually, original dataset did not contain any. Visualization of assigned labels can be found in Appendix A.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, SEPTEMBER XXXX

TABLE III
DATASETS USED FOR EXPERIMENTS .
Dataset
aggregation
atom
chainlink
chameleon-t4.8k
chameleon-t5.8k
chameleon-t7.10k
chameleon-t8.8k
compound
cure-t2-4k
D31
DS-850
diamond9
flame
jain
long1
longsquare
lsun
pathbased
s-set1
spiralsquare
target
triangle1
twodiamonds
wingnut

d
2
3
3
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2

n
788
800
1000
8000
8000
10000
8000
399
4000
3100
850
3000
240
373
1000
900
400
300
5000
1500
770
1000
800
1016

classes
7
2
2
71
91
81
81
6
7
31
5
9
2
2
2
6
3
3
15
6
6
4
2
2

source
[27]
[28]
[28]
[29]
[29]
[29]
[29]
[30]
[8]2
[31]
[32]
[33]
[34]
[35]
[36]
[36]
[28]
[37]
[38]
[36]
[28]
[36]
[28]
[28]

An overview of all used dataset with their properties and


references can be found in Table III. We tried to include
same datasets are were used in original Chameleon paper,
dataset marked as DS1 and DS2 were not available, first one
comes from CURE [8], based on description in the paper we
generated a similar dataset. Instead of DS2 we included twodiamonds dataset fr7om FCPS suite [28]. Chameleon datasets
(t4.8k, t5.8k, t7.10k marked as DS3 in [7], t8.8k marked
as DS4 in [7]) are available in an archive for download at
CLUTO website [29] (a software package [39] from Karypis
Labs that provides dynamic model clustering, however does
not offer directly Chameleon algorithm).
There are no labels provided for these datasets thus we
assigned labels manually.

Fig. 3. Sorted distances to 4th nearest neighbor for dataset chameleon-t4.8.


Purple line shows optimal value selected by our heuristic.

of clustering algorithms. Complete results of our experiments


can be found in Table IV. The NMI value is the best possible
result of a given algorithm when configured optimally. These
experiments are aimed to explore boundaries each algorithm,
in real-world scenario it is often hard to select optimal
configuration.
The k-means [10] algorithm was provided with correct
number of ground truth classes (parameter k). Initialization
is randomized and the NMI value is an average of 100
independent runs.
For DBSCAN [18] we followed autors recommendation
regarding estimation, firstly we compute and sort distances
to 4th nearest neighbor for each data point (see Figure 3). Then
using a simple heuristic we search for an elbow in distance
values which is typically located in the first third of distances
sorted from highest to lowest value. From the elbow area we
choose 10 different values and run DBSCAN with MinPts
in interval from 4 to 10. From 60 DBSCAN clusterings we
select the one with the highest NMI value.
VII. C UTOFF
Output of the Chameleon algorithm is a hierarchical structure of merges. This result can be useful on its own but most
of the time, user needs to get a single partitioning clustering.
To obtain it, we need to explore the hierarchical result and
determine where the best single clustering lies.
A. Dendrogram representation

VI. E XPERIMENTS
We evaluated Chameleon 2 against several popular algorithms. For evaluation of clustering we used Normalized
Mutual Information (NMIsqrt ) as defined by Strehl and Gosh
in [19]. NMI computes agreement between clustering and
ground truth labels which we provided for each dataset. NMI
value 1.0 means complete agreement of clustering to external
labels while 0.0 means complete opposite. Another popular
criterion for external evaluation is Adjusted Rand Index which
would provide in this case very similar results.
It is not feasible to run a benchmark of our algorithm against
every other existing algorithm. However, we tried to select a
representative algorithm from several distinguishable groups

Hierarchical structure is best visualized via dendrogram. In


order to meaningfully visualize Chameleons result, certain
changes to the traditional dendrogram representation have to
be made.
Firstly, the first level of the dendrogram does not represent
individual items but small clusters created by the partitioning
phase.
Secondly, height of the nodes has been redefined. Normally,
the height of each cluster node is simply the distance between
the merged clusters:
h(Ci ) = d(Cx , Cy )

(12)

In Chameleon, however, we redefined the height to:

2 The

author does no longer have this dataset, we generated similar data


based on images provided in the referred paper.

h(Ci ) = h(Ci1 ) + d(Cx , Cy )

(13)

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, SEPTEMBER XXXX

TABLE IV
C LUSTERING BENCHMARK ON DATASETS USED IN THE LITERATURE .
Dataset
aggregation
atom
chainlink
chameleon-t4.8k
chameleon-t5.8k
chameleon-t7.10k
chameleon-t8.8k
compound
cure-t2-4k
D31
diamond9
DS-850
flame
jain
long1
longsquare
lsun
pathbased
s-set1
spiralsquare
target
triangle1
twodiamonds
wingnut

Ch2-auto
0.99
1.00
1.00
0.88
0.82
0.86
0.88
0.96
0.88
0.96
0.99
0.98
0.87
1.00
1.00
0.98
1.00
0.90
1.00
0.91
0.94
1.00
0.99
0.97

Ch2-nd1
0.99
0.99
0.99
0.89
0.87
0.90
0.89
0.95
0.91
0.96
0.98
0.95
0.86
0.93
0.97
0.97
0.99
0.81
0.98
0.93
0.96
0.97
0.99
0.91

Ch2-Std
0.98
1.00
1.00
0.86
0.82
0.90
0.88
0.95
0.87
0.94
0.97
0.99
0.91
1.00
1.00
0.98
1.00
0.86
1.00
0.99
0.94
1.00
1.00
0.97

DBSCAN
0.98
0.99
1.00
0.95
0.94
0.97
0.89
0.92
0.88
0.88
0.98
0.98
0.90
0.89
0.99
0.94
1.00
0.89
0.97
0.98
0.99
1.00
1.00
1.00

HAC-AL
0.97
0.59
0.55
0.67
0.82
0.68
0.69
0.85
0.83
0.95
1.00
0.98
0.80
0.70
0.62
0.90
0.82
0.71
0.98
0.74
0.74
0.98
0.99
1.00

HAC-CL
0.87
0.57
0.51
0.63
0.68
0.63
0.66
0.82
0.72
0.95
1.00
0.62
0.70
0.70
0.55
0.83
0.83
0.58
0.97
0.78
0.70
0.91
0.97
1.00

HAC-SL
0.89
1.00
1.00
0.86
0.80
0.87
0.86
0.85
0.82
0.87
0.99
0.99
0.84
0.86
1.00
0.93
1.00
0.70
0.96
0.92
1.00
1.00
0.93
1.00

HAC-WL
0.94
1.00
0.55
0.65
0.82
0.67
0.68
0.82
0.78
0.95
1.00
0.69
0.59
0.52
0.55
0.84
0.73
0.62
0.98
0.67
0.69
1.00
1.00
1.00

k-means
0.85
0.29
0.07
0.59
0.77
0.58
0.57
0.72
0.69
0.92
0.95
0.57
0.43
0.37
0.02
0.81
0.54
0.55
0.95
0.64
0.69
0.93
1.00
0.77

would have to grow down which would be confusing and the


search for a line that cuts the dendrogram would not make any
sense.
B. First Jump Cutoff

(a)

(b)

Fig. 4. Chameleons dendrogram on flame dataset results using standard


similarity (Fig. 4a) and Shatovska similarity (Fig. 4b). It is easier to find
automatically a reasonable cutoff for this dendrogram unlike the case when
standard similarity is used.

In both equations, h(Ci ) represents height of the cluster C


at level i and d(Cx , Cy ) is the distance between clusters at
levels x and y which are merged into the cluster Ci . Distance
is an inverse similarity measure, therefore computed in the
following way:
d(Cx , Cy ) =

1
Sim(Cx , Cy )

(14)

The main reason for the change is that over time, cluster
similarity can increase, thus the distance decreases. In the
standard representation this would mean that the dendrogram

To find an optimal cut in the proposed structure we come


with a simple yet effective method named First Jump Cutoff.
The method is based on an idea that the first big jump in
the cutoff (the first large distance between levels) is the place
where two clusters which should be separate start to merge.
Therefore, this is where the cut should be made.
To find the first jump, we compute the average distance
between levels in the first half of the dendrogram. After that,
we look for the first distance between levels in the second
half which is at least 100 times greater than the computed
average. If it is found, the cut is made between these levels.
If not, we search for the first jump at least 50 times bigger,
25 times bigger and so forth. The whole method is illustrated
in Algorithm 4.
The parameters starting with 100th multiple and dividing
the multiple by 2 were randomly chosen. Our experiments
showed that the algorithm is not very sensitive to these values
and that the values 340 for the starting multiple and 1.9 for the
factor produce the best overall results on the chosen datasets.
VIII. C ONCLUSION
In this article, we introduced several advanced clustering
techniques, the original Chameleon algorithm and our improved version called Chameleon 2. We described all the

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, SEPTEMBER XXXX

Algorithm 4 First Jump Cutoff


1: procedure F IRST J UMP (initM ultiplier, f actor)
2:
avgJump ComputeAvgJump()
3:
multp initM ultiplier
4:
while multp > 0 do
5:
res F indBiggerJump(multp avgJump)
6:
if res.f ound then
7:
return res.height
8:
else
9:
mult mult/f actor
10:
end if
11:
end while
12:
return 0
13: end procedure
14: procedure F IND B IGGER J UMP (jump)
15:
result []
16:
for all level Dendrogram do
17:
levelW idth level.next.height level.height
18:
if levelW idth > jump then
19:
result.f ound true
20:
result.height level.height
21:
return result
22:
end if
23:
end for
24:
res.f ound f alse
25:
return result
26: end procedure

Fig. 6. Clustering result of our Chameleon 1 implementation on dataset


chameleon-t7.10k which uses hMETIS for bisection and manual cutoff. There
are several tiny clusters left after partitioning phase, total number of discovered
clusters is 15.

Fig. 7. Chameleon 2 result on dataset chameleon-t7.10k using max. partition


size = 10, = 4 and manual cutoff. Total number of discovered clusters is
10.

works with a minimal error rate on all of the tested data. However, by configuring each phase of the algorithm, Chameleon 2
is able to correctly identify clusters in basically any dataset.
Therefore, Chameleon 2 can also be viewed as a general robust
clustering framework which can be adjusted for a wide range
of specific problems.

Fig. 5. Chameleon 2 result on dataset chameleon-t8.8k using max. partition


size = 10, = 4 and manual cutoff.

differences and improvements which aim to enhance the


clustering produced by Chameleon 2 and conducted several
experiments. They showed both external evaluation scores and
visual results which proved the superiority of our approach
over other methods. We also proposed a method which is
able to automatically select the best clustering result from a
modified Chameleon dendrogram.
Our original goal was to create a fully automated algorithm
which produces high-quality clustering on diverse datasets
without having to set any dataset-specific parameters. We
achieved this and we managed to find a configuration which

Fig. 8. Generated dataset cure-t2-4k with ground truth labels, inspired by


CUREs dataset (marked as DS1 in [7]). Many algorithms fail to identify two
upper ellipsoids due to the link of points that is connecting them. Chameleon 2
provides best result with = 1 and = 1.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, SEPTEMBER XXXX

A PPENDIX A
DATASET VISUALIZATIONS
All datasets used in our experiments contains distinguishable pattern. In case of datasets contaminated by noise, clusters
are area with high density of data points.
ACKNOWLEDGMENT
We would like to thank Petr Bartunek, Ph.D. from the IMG
CAS institute for supporting our research and letting us publish
all details of our work. This research is partially supported
by CTU grant SGS15/117/OHK3/1T/18 New data processing
methods for data mining and Program NPU I (LO1419) by
Ministry of Education, Youth and Sports of Czech Republic.
R EFERENCES
[1] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. A John Wiley & Sons, Inc., 1990.
[2] R. Caruana, M. Elhawary, N. Nguyen, and C. Smith, Meta Clustering,
in Proceedings of the Sixth International Conference on Data Mining,
ser. ICDM 06. Washington, DC, USA: IEEE Computer Society, 2006,
pp. 107118.
[3] A. K. Jain, Data Clustering : 50 Years Beyond K-Means, Pattern
Recognition Letters, 2010.
[4] A. K. Jain and R. C. Dubes, Algorithms for clustering data. Upper
Saddle River, NJ, USA: Prentice-Hall, Inc., 1988.
[5] B. S. Everitt, Cluster Analysis. Edward Arnold, 1993.
[6] C. C. Aggarwal and C. K. Reddy, Eds., Data Clustering: Algorithms
and Applications. CRC Press, 2014.
[7] G. Karypis, E. Han, and V. Kumar, Chameleon: Hierarchical Clustering
Using Dynamic Modeling, Computer, vol. 32, no. 8, pp. 6875, August
1999.
[8] S. Guha, R. Rastogi, and K. Shim, CURE: an efficient clustering
algorithm for large databases, in ACM SIGMOD Record, vol. 27, no. 2.
ACM, 1998, pp. 7384.
[9] , ROCK: A Robust Clustering Algorithm for Categorical Attributes. in ICDE, M. Kitsuregawa, M. P. Papazoglou, and C. Pu, Eds.
IEEE Computer Society, 1999, pp. 512521.
[10] S. Lloyd, Least squares quantization in PCM, Information Theory,
IEEE Transactions on, vol. 28, no. 2, pp. 129137, 1982.
[11] G. Ball and D. Hall, ISODATA: A novel method of data analysis and
pattern classification, Stanford Research Institute, Menlo Park, Tech.
Rep., 1965.
[12] J. B. MacQueen, Some Methods for Classification and Analysis of
MultiVariate Observations, in Proc. of the fifth Berkeley Symposium on
Mathematical Statistics and Probability, L. M. L. Cam and J. Neyman,
Eds., vol. 1. University of California Press, 1967, pp. 281297.
[13] G. N. Lance and W. T. Williams, A General Theory of Classificatory
Sorting Strategies, The Computer Journal, vol. 9, no. 4, pp. 373380,
1967.
[14] A. K. Jain, A. Topchy, M. H. C. Law, and J. M. Buhmann, Landscape
of Clustering Algorithms, in Proceedings of the Pattern Recognition,
17th International Conference on (ICPR04) Volume 1 - Volume 01, ser.
ICPR 04. Washington, DC, USA: IEEE Computer Society, 2004, pp.
260263.
[15] G. McLachlan and K. Basford, Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York, 1988.
[16] A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum likelihood
from incomplete data via the EM algorithm, Journal of the Royal
Statistical Society. Series B (Methodological), pp. 138, 1977.
[17] R. A. Jarvis and E. A. Patrick, Clustering using a similarity measure
based on shared near neighbors, Computers, IEEE Transactions on, vol.
100, no. 11, pp. 10251034, 1973.
[18] M. Ester, H. Kriegel, J. Sander, and X. Xu, A Density-Based Algorithm
for Discovering Clusters in Large Spatial Databases with Noise. in
KDD, E. Simoudis, J. Han, and U. M. Fayyad, Eds. AAAI Press,
1996, pp. 226231.
[19] A. Strehl and J. Ghosh, Cluster Ensembles A Knowledge Reuse
Framework for Combining Multiple Partitions, Journal on Machine
Learning Research (JMLR), vol. 3, pp. 583617, December 2002.

[20] G. Karypis and V. Kumar, A Fast and High Quality Multilevel


Scheme for Partitioning Irregular Graphs, SIAM J. Sci. Comput.,
vol. 20, pp. 359392, December 1998. [Online]. Available:
http://dx.doi.org/10.1137/S1064827595287997
[21] T. Shatovska, T. Safonova, and I. Tarasov, A Modified Multilevel
Approach to the Dynamic Hierarchical Clustering for Complex types
of Shapes. in ISTA, ser. LNI, H. C. Mayr and D. Karagiannis, Eds.,
vol. 107. GI, 2007, pp. 176186.
[22] Y. Liu, Z. Li, H. Xiong, X. Gao, and J. Wu, Understanding of Internal
Clustering Validation Measures. in ICDM, G. I. Webb, B. L. 0001,
C. Zhang, D. Gunopulos, and X. Wu, Eds. IEEE Computer Society,
2010, pp. 911916.
[23] G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar, Multilevel
Hypergraph Partitioning: Applications in VLSI Domain, IEEE Trans.
Very Large Scale Integr. Syst., vol. 7, no. 1, pp. 6979, Mar. 1999.
[24] G. Karypis and V. Kumar, Multilevel k-way Hypergraph Partitioning.
in DAC, 1999, pp. 343348.
[25] C. M. Fiduccia and R. M. Mattheyses, A Linear-time Heuristic for
Improving Network Partitions, in Proceedings of the 19th Design
Automation Conference, ser. DAC 82. Piscataway, NJ, USA: IEEE
Press, 1982, pp. 175181.
[26] T. Bruna, Implementation of the Chameleon Clustering Algorithm,
Masters thesis, Czech Technical University in Prague, 2015.
[27] A. Gionis, H. Mannila, and P. Tsaparas, Clustering aggregation.
TKDD, vol. 1, no. 1, 2007.
[28] A. Ultsch and F. Morchen, ESOM-Maps: tools for clustering, visualization, and classification with Emergent SOM, Technical Report No.
46, Dept. of Mathematics and Computer Science, University of Marburg,
Germany, 2005.
[29] G. Karypis, Karypis Lab - CLUTOs datasets. [Online]. Available:
http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download
[30] C. Zahn, Graph-Theoretical Methods for Detecting and Describing
Gestalt Clusters, IEEE Trans. on Computers, vol. C-20, no. 1, pp. 68
86, jan 1971.
[31] C. J. Veenman, M. J. T. Reinders, and E. Backer, A Maximum Variance
Cluster Algorithm. IEEE Trans. Pattern Anal. Mach. Intell., vol. 24,
no. 9, pp. 12731280, 2002.
[32] M.-C. Su, C.-H. Chou, and C.-C. Hsieh, Fuzzy C-means algorithm with
a point symmetry distance, International Journal of Fuzzy Systems,
vol. 7, no. 4, pp. 175181, 2005.
[33] S. Salvador and P. Chan, Determining the Number of Clusters/Segments
in Hierarchical Clustering/Segmentation Algorithms. in ICTAI. IEEE
Computer Society, 2004, pp. 576584.
[34] L. Fu and E. Medico, FLAME, a novel fuzzy clustering method for the
analysis of DNA microarray data. BMC Bioinformatics, vol. 8, 2007.
[35] A. K. Jain and M. H. C. Law, Data Clustering: A Users Dilemma. in
PReMI, ser. Lecture Notes in Computer Science, S. K. Pal, S. Bandyopadhyay, and S. Biswas, Eds., vol. 3776. Springer, 2005, pp. 110.
[36] J. Handl and J. Knowles, Multiobjective clustering with automatic
determination of the number of clusters, UMIST, Tech. Rep., 2004.
[37] H. Chang and D.-Y. Yeung, Robust path-based spectral clustering.
Pattern Recognition, vol. 41, no. 1, pp. 191203, 2008.
[38] P. Franti and O. Virmajoki, Iterative shrinking method for clustering
problems. Pattern Recognition, vol. 39, no. 5, pp. 761775, 2006.
[39] G. Karypis, CLUTO A Clustering Toolkit, Dept. of Computer Science, University of Minnesota, Tech. Rep. 02-017, 2002, available at
http://www.cs.umn.edu/cluto.

Tomas Barton Biography text here.

PLACE
PHOTO
HERE

Tomas Bruna Biography text here.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, SEPTEMBER XXXX

(a) Dataset aggregation.

(b) Dataset atom.

(c) Dataset chainlink.

(d) Dataset chameleon-t4.8k.

(e) Dataset chameleon-t5.8k.

(f) Dataset chameleon-t7.10k.

(g) Dataset chameleon-t8.8k.

(h) Dataset compound.

(i) Dataset cure-t2-4k.

(j) Dataset D31.

(k) Dataset diamond9.

(l) Dataset DS-850.

(m) Dataset flame.

(n) Dataset jain.

(o) Dataset long1.

Fig. 9. Visualization of datasets used in experiments with ground truth assignments.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, SEPTEMBER XXXX

10

(a) Dataset longsquare.

(b) Dataset lsun.

(c) Dataset pathbased.

(d) Dataset s-set1.

(e) Dataset spiralsquare.

(f) Dataset target.

(g) Dataset triangle1.

(h) Dataset twodiamonds.

(i) Dataset wingnut.

Fig. 10. Visualization of datasets used in experiments with ground truth assignments.

Pavel Kordik Biography text here.

Вам также может понравиться