Вы находитесь на странице: 1из 14

Pattern Recognition 37 (2004) 175 – 188

www.elsevier.com/locate/patcog

Clustering of the self-organizing map using a clustering


validity index based on inter-cluster and intra-cluster density
Sitao Wu∗ , Tommy W.S. Chow
Department of Electronic Engineering, City University of Hong Kong, 83 Tat Chu Avenue, Kowloon, Hong Kong

Received 10 December 2002; received in revised form 20 June 2003; accepted 20 June 2003

Abstract

The self-organizing map (SOM) has been widely used in many industrial applications. Classical clustering methods based
on the SOM often fail to deliver satisfactory results, specially when clusters have arbitrary shapes. In this paper, through some
preprocessing techniques for 4ltering out noises and outliers, we propose a new two-level SOM-based clustering algorithm
using a clustering validity index based on inter-cluster and intra-cluster density. Experimental results on synthetic and real data
sets demonstrate that the proposed clustering algorithm is able to cluster data better than the classical clustering algorithms
based on the SOM, and 4nd an optimal number of clusters.
? 2003 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.

Keywords: Partitioning clustering; Hierarchical clustering; Clustering validity index; Self-organizing map; Multi-representation

1. Introduction Also, because of the neighborhood relation contributed by


the inter-connections among neurons, the SOM exhibits an-
The self-organizing map (SOM), proposed by Kohonen other important property of topology preservation. In other
[1], has been widely used in many industrial applications words, if two feature vectors are near to each other in the
such as pattern recognition, biological modeling, data com- input space, the corresponding neurons will also be close in
pression, signal processing, and data mining [2]. It is an the output space, and vice versa. Usually the output neurons
unsupervised and nonparametric neural network approach. are arranged in 2D grids. Therefore, the SOM is suitable for
The success of the SOM algorithm lies in its simplicity that visualization purpose.
makes it easy to understand, simulate and be used in many Clustering algorithms attempt to organize unlabeled in-
applications. put vectors into clusters or “natural groups” such that points
The basic SOM consists of a set of neurons usually within a cluster are more similar to each other than vectors
arranged in a two-dimensional structure such that there belonging to diGerent clusters [4]. Clustering has been used
are neighborhood relations among the neurons. After com- in exploratory pattern-analysis, grouping, decision-making,
pletion of training, each neuron is attached to a feature and machine-learning situations, including data mining,
vector of the same dimension as the input space. By assign- document retrieval, image segmentation, and pattern clas-
ing each input vector to the neuron with the nearest feature si4cation [5]. The clustering methods are of 4ve types:
vector, the SOM is able to divide the input space into regions hierarchical clustering, partitioning clustering, density-based
with common nearest feature vectors. This process can be clustering, grid-based clustering, model-based clustering
considered as performing vector quantization (VQ) [3]. [6]. Each type has its advantages and disadvantages.
Several attempts have been made to cluster data based on
the SOM. In Refs. [7,8] SOM is used to develop cluster-
∗ Corresponding author. Tel.: +852-21942874. ing algorithms. The clustering algorithm in Ref. [7] assumes
E-mail address: eetchow@cityu.edu.hk (S. Wu). each cluster has a spherical shape, while HEC network in

0031-3203/03/$30.00 ? 2003 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.
doi:10.1016/S0031-3203(03)00237-1
176 S. Wu, T.W.S. Chow / Pattern Recognition 37 (2004) 175 – 188

Ref. [8] assumes each cluster can be either hyper-spherical in the clustering assessments. The assessment is able to 4nd
or hyper-ellipsoidal. The number of output neurons is equal an optimal partition of the input data [14].
to the desired number of clusters. When the number of clus- In this paper, a new two-level algorithm for clustering
ters is a prime number, the SOM cannot be realized in two of the SOM is proposed. The clustering at the second level
dimensions. Furthermore, the SOM is conceptually diGerent is agglomerative hierarchical clustering. The merging crite-
from clustering [4]. The SOM tries to extract and visually rion is motivated by a clustering validity index based on the
display the topological structure of high-dimensional input inter-cluster and intra-cluster density, and inter-cluster dis-
data while clustering is to partition the input data into groups. tances [14]. The original index is used for the whole input
The algorithms in Refs. [7,8] seem to mix two objectives, data and therefore is a global index. The optimal number
feature mapping and clustering, and the overall methodol- of clusters can be found by the clustering validity index. In
ogy is diLcult to interpret in either case [4]. this paper, the clustering validity index is slightly modi4ed
It is then reasonable that a two-level approach is able to and used locally to determine which neighboring pair of
cluster data based on the SOM. The idea is that the 4rst clusters to be merged into one cluster in agglomerative hi-
level is to train data by the SOM and the second level is erarchical clustering. Since more information is added into
to cluster data based on the SOM. The required number of the merging criterion in addition to inter-cluster distances,
output neurons at the 4rst level is more than the desired the proposed algorithm clusters data better than other
number of clusters. Clustering is carried out by clustering clustering algorithms based on the SOM. Through certain
of output neurons after completion of training performed by preprocessing techniques for 4ltering, the proposed cluster-
the SOM. This two-level approach has been addressed in ing algorithm is able to handle input data with noises and
Refs. [9–12]. They are actually multirepresentation-based outliers.
clustering because each cluster can be represented by mul- This paper is organized into 4ve sections. In Section 2,
tiple output neurons. In Ref. [9] there are two SOM lay- the SOM and clustering algorithms are brieMy reviewed.
ers for clustering. The second SOM layer takes the out- Algorithms of two-level clustering of the SOM are dis-
puts of the 4rst SOM layer as the inputs of the second cussed. In Section 3, a new algorithm of multirepresentation
SOM layer. The number of the neurons on the second map clustering of the SOM is proposed. In Section 4, experimen-
is equal to the desired number of clusters. The task of tal results on synthetic and real data sets demonstrate that
the second SOM layer is analogous to clustering of the the proposed algorithm is able to cluster the input data and
SOM by k-means algorithm. In Ref. [10] an agglomerative 4nd the optimal number of clusters. The clustering eGect of
contiguity-constrained clustering method on the SOM was the proposed algorithm is better than that of other cluster-
proposed. The merging process of neighboring neurons was ing algorithms on the SOM. Finally, the conclusions of this
based on minimal distance criterion. Algorithm in Ref. [11] work are presented in Section 5.
extended the idea in Ref. [10] by minimal variance crite-
rion and achieved better clustering results. In Ref. [12] both
the classical hierarchical and partitioning clustering algo- 2. Self-organizing map and clustering
rithms are applied in clustering of the SOM. The proposed
algorithms [12] were aimed at reducing computational com- 2.1. Self-organizing map and visualization
plexity compared with the classical clustering methods. The
algorithms presented in Refs. [10,11] need to recalculate Competitive learning is an adaptive process in which the
the center after two clusters are merged. They are only fea- neurons in a neural network gradually become sensitive to
sible for clusters with hyper-spherical or hyper-ellipsoidal diGerent input categories, sets of samples in a speci4c do-
shapes. The second SOM layer [9] and the batch k-means main of the input space. A division of neural nodes emerges
algorithm in clustering of the SOM [12] require the desired in the network to represent diGerent patterns of the inputs
number of clusters to be known and are only feasible for after training.
hyper-spherical-shaped clusters. Hierarchical clustering al- The division is enforced by competition among the neu-
gorithms on the SOM in Ref. [12] use only inter-cluster rons: when an input x arrives, the neuron that is best able
distance criterion to cluster the output neurons. In order to to represent it wins the competition and is allowed to learn
deal with arbitrary cluster shapes, high-order neurons are in- it even better, as will be described later. If there exists an
troduced in Ref. [13]. The inverse covariance matrix in the ordering between the neurons, i.e., the neurons are located
clustering metric can be considered as second-order statis- on a discrete lattice, the competitive learning algorithm can
tics to capture hyper-ellipsoidal properties of clusters. But be generalized: if not only the winning neuron but also
the algorithm in Ref. [13] is computation-consuming such its neighboring neurons on the lattice are allowed to learn,
that it is not suitable for real applications. Some additional the whole eGect is that the 4nal map becomes an ordered
information other than distances will be helpful in the clus- map in the input space. This is the essence of the SOM
tering process. In Ref. [14] a clustering validity assessment algorithm.
was proposed that not only inter-cluster distances, but also The SOM consists of M neurons located on a regular
inter-cluster density and intra-cluster density are considered low-dimensional grid, usually one or two dimensional.
S. Wu, T.W.S. Chow / Pattern Recognition 37 (2004) 175 – 188 177

Higher-dimensional grids are possible, but they are not 2.2. Clustering algorithms
generally used since their visualization is problematic. The
lattice of the grid is either hexagonal or rectangle. There are a multitude of clustering methods in the litera-
The basic SOM algorithm is iterative. Each neuron i ture, which can be broadly classi4ed into the following cat-
has a d-dimensional feature vector wi = [wi1 ; : : : ; wid ]. At egories [6]: hierarchical clustering, partitioning clustering,
each training step t, a sample data vector x(t) is randomly density-based clustering, grid-based clustering, model-based
chosen from the training set. Distances between x(t) and clustering. In this paper, only the 4rst two categories are
all the feature vectors are computed. The winning neuron, considered.
denoted by c, is the neuron with the feature vector closest In partitioning clustering, given a database of n object,
to x(t): a partitioning clustering algorithm constructs k partitions
of the data, where each partition represents a cluster and
c = arg min x(t) − wi ; i ∈ {1; : : : ; M }: (1) k 6 n. The most used partitioning clustering algorithm is
i
k-means algorithm, where each cluster is represented by the
A set of neighboring nodes of the winning node is denoted mean value of the objects in the cluster. One advantage of
as Nc . We de4ne hic (t) as the neighborhood kernel function the partitioning clustering is that the clustering is dynamic,
around the winning neuron c at time t. The neighborhood i.e., data points can move from one cluster to another. The
kernel function is a nonincreasing function of time and of other advantage is that some a priori knowledge, such as
the distance of neuron i from the winning neuron c. The cluster shapes, can be incorporated in the clustering. The
kernel can be taken as a Gaussian function: drawbacks of the partitioning clustering are the following:

Posi −Posc 2 (1) it encounters diLculty at discovering clusters of arbi-
hic (t) = e 2(t)2 ; i ∈ Nc ; (2) trary shapes; (2) the number of clusters is pre-4xed and the
optimal number of clusters is hard to determine.
where Posi is the coordinates of neuron i on the output grid An hierarchical clustering algorithm creates an hierarchi-
and (t) is kernel width. cal decomposition of the given set of data objects. It can
The weight update rule in the sequential SOM algorithm be classi4ed as either agglomerative or divisive. The advan-
can be written as tage of the hierarchical clustering is that it is not aGected

by initialization and local minima. The shortcomings of the
 wi (t) + (t)hic (t)(x(t)

 hierarchical clustering are the following: (1) it is impracti-
wi (t + 1) = −wi (t)); ∀i ∈ Nc ; (3) cal for large data sets due to the high-computational com-


 plexity; (2) it does not incorporate any a priori knowledge
wi (t) otherwise:
such as cluster shapes; (3) the clustering is static, i.e., data
Both learning rate (t) and neighborhood (t) decrease points in a cluster at the early stage cannot move to another
monotonically with time. During training, the SOM behaves cluster at the latter stage. In this paper, the divisive hierar-
like a Mexible net that folds onto a “cloud” formed by the chical clustering is not considered because the top–down di-
training data. Because of the neighborhood relations, neigh- rection of the divisive hierarchical clustering is not suitable
boring neurons are pulled to the same direction, and thus for two-level clustering of the SOM.
feature vectors of neighboring neurons resemble each other. In the classical agglomerative hierarchical clustering, a
There are many variants of the SOM. However, these vari- pair of clusters to be merged has the minimum inter-cluster
ants are not considered in this paper because the proposed distance. The widely used measures of inter-cluster distance
algorithm is based on the SOM, but not a new variant of the are listed in Table 1 (mi is the mean for cluster Ci and ni is the
SOM. number of points in Ci ). All of these distance measures yield
The 2D map can be easily visualized and thus give people the same clustering results if the clusters are compact and
useful information about the input data. The usual way to well separated. But in some cases [17], using dmax ; dave , and
display the cluster structure of the data is to use a distance dmean as the distance measures result in wrong clusters that
matrix, such as U-matrix [15]. U-matrix method displays the are similar to those determined by partitioning clustering.
SOM grid according to the distance of neighboring neurons. The single-linkage clustering with distance measure dmin
The visual display of U-matrix can be three-dimensional may have “chaining e2ects”—a few points located so as to
using ridges and valleys, or two-dimensional using gray form a bridge between two clusters cause points across the
level. Clusters can be identi4ed in low inter-neuron distances clusters to be grouped into a single cluster.
and borders are identi4ed in high inter-neuron distances. The extended SOM (minimum distance) [10] utilizes the
Another method of visualizing cluster structure is to assign single-linkage clustering method on the SOM. The clus-
the input data to their nearest neurons. Some neurons then tering in the extended SOM (minimum variance) [11] is
have no input data assigned to them. These neurons can diGerent from the minimum-distance-based agglomerative
be used as the borders of clusters [16]. These methods are hierarchical clustering. But it is also an agglomerative
cluster visualization tools and inherently are not clustering hierarchical clustering on the SOM based on minimum
methods. variance.
178 S. Wu, T.W.S. Chow / Pattern Recognition 37 (2004) 175 – 188

Table 1
Four types of de4nitions of inter-cluster distance

De4nition Inter-cluster distance

Single-linkage: dmin (Ci; Cj ) = min p − p 


p∈Ci ;p ∈Cj

Complete-linkage: dmax (Ci; Cj ) = max p − p 


p∈Ci ;p ∈Cj

Centroid-linkage: dmean (Ci; Cj ) = mi − mj 


 
Average-linkage: dave (Ci; Cj ) = 1=(ni nj ) p∈Ci ; p ∈Cj p − p 

SOM
Clustering
Input data Training

Fig. 1. The two-level approach of clustering of the SOM. DiGerent symbols on the map represent diGerent clusters.

2.3. Clustering of the SOM variance) [11], intra-cluster variances were considered and
better clustering results were obtained. But there is no
In the 4rst level, we use the SOM to form a 2D feature explicit inter-cluster distance information in the extended
map. The number of output neurons is signi4cantly more SOM (minimum variance). In some cases, a clustering al-
than the desired number of clusters. This requires more neu- gorithm using more information about the pair of clusters
rons to represent a cluster, rather than a single neuron to in addition to inter-cluster distances, such as special char-
represent a cluster. Then in the second level the output neu- acteristics about individual clusters and between-clusters,
rons are clustered such that the neurons on the map are di- will have better clustering eGects than the classical hier-
vided into as many diGerent regions as the desired number archical clustering algorithms. This additional information
of clusters. Each input data point can be assigned to a cluster has been considered in the Chameleon algorithm [18], but
according to their nearest output neuron. The whole process the algorithm has to construct k-nearest graph, which is
is illustrated in Fig. 1. This two-level approach for cluster- computationally complex when the data is very large.
ing of the SOM has been addressed in Refs. [9–12]. The In this paper, inter-cluster and intra-cluster density are
recently developed hierarchical algorithms CURE [17] and added in the merging criterion and some useful steps for
Chameleon [18] also utilize a two-level clustering technique. 4ltering noises and outliers are proposed before clustering
The classical clustering algorithms can be used in cluster- of the SOM. This will be discussed in the next section.
ing the output neurons of SOM. However, due to the disad-
vantages of diGerent types of clustering algorithms, we must
choose an appropriate one to cluster the SOM. Partitioning 3. Clustering of the SOM using local clustering validity
clustering of the SOM can cause incorrect clusters. If the index and preprocessing of the SOM for "ltering
clusters in the input data have nonspherical shapes, cluster-
ing of the SOM is deteriorated. The number of cluster in par- 3.1. Global clustering validity index for di2erent
titioning clustering, which is usually unknown beforehand, clustering algorithms
must be prede4ned. Therefore, a partitioning algorithm is
not adopted in this study. This will be illustrated in Section Since there are many clustering algorithms in the litera-
4. Agglomerative hierarchical clustering of the SOM is the ture, evaluation criteria are needed to justify the correctness
algorithm we adopted in this paper. But here more informa- of a partition. Furthermore, the evaluation algorithms need
tion about clusters is added into the clustering algorithm. to address the number of clusters that appear in a data set.
In the classical agglomerative hierarchical clustering of the A lot of eGorts have been made for clustering in the area of
SOM, i.e., the extended SOM (minimum distance) [10], only pattern recognition [19]. In general, there are three types of
inter-cluster distance information is used to merge the near- methods used to investigate cluster validity: (1) external cri-
est neighboring clusters. In the extended SOM (minimum teria; (2) internal criteria; (3) relative criteria [19]. The idea
S. Wu, T.W.S. Chow / Pattern Recognition 37 (2004) 175 – 188 179

of cluster validity is to 4nd “compact and well-separated It is de4ned as follows:


clusters” [20]. The implementation of most validity algo- c c
close rep(i) − close rep(j)
rithms is very computationally intensive, especially when Inter den(c) =
stdev(i) + stdev(j)
the number of clusters and the number of the input data is i=1 j=1
j=i
very large [21]. Some validity indexes are dependent on the
data [22], and certain indexes are dependent on the num- ×density(uij ); c ¿ 1; (6)
ber of clusters [23]. Some indexes use sample mean of each where close rep(i) and close rep(j) are the closest pair of
subset, and others use all points in each subset in another representations of the ith and jth clusters, uij is the middle
extreme. For arbitrarily shaped clusters, using multirepre- point between the pair points close rep(i) and
sentation points to represent a cluster will give a better clus- niclose
+nj
rep(j).
density(uij ) is given by density(uij ) = k=1 f(xk ; uij ),
ter description than a single-center representation, and mul- where xk is the input vector belonging to the ith and jth
tirepresentation points will be computationally lighter than clusters, and f(xk ; uij ) is de4ned by
all-point representation. 

 1 xk − uij  6 (stdev(i)
The newly proposed multirepresentation clustering valid- 
ity index [14] is a relative algorithm-independent clustering f(xk ; uij ) = +stdev(j))=2; (7)


index for assessing the quality of a partitioning. The index 
0 otherwise:
is also based on two accepted concepts: (1) a cluster’s com-
pactness; (2) a cluster’s separation. The de4nitions of the Cluster’s separation evaluates the separation of clusters. It
clustering validity index are modi4ed a little in this paper. includes both the inter-cluster distances and the inter-cluster
The notations in the clustering validity index are de4ned density. The goal is that the inter-clusters distance is signif-
as follows. The data set is partitioned into c clusters. A set icantly high while the inter-cluster density is signi4cantly
of representation points Vi = {vi1 ; vi2 ; : : : ; viri } represents the low. The de4nition of the clusters’ separation is
ith cluster, where ri is the number of representation point of c c
close rep(i) − close rep(j)
the ith cluster. stdev(i) is a standard deviation vector of the Sep(c) = ;
i=1 j=1
1 + Inter den(c)
ith cluster. The pth component of stdev(i) is de4ned by j=i

 ni
 p c ¿ 1: (8)
stdev (i) = 
p
(xk − mpi )2 =(ni − 1);
k=1 The overall clustering validity index, which is called
“Composing Density Between and With clusters” (CDbw)
where ni is the number of data points in the ith cluster, xk [14], is de4ned by
is the data belonging to the ith cluster, and mi is the sample
CDbw(c) = Intra den(c) × Sep(c): (9)
mean of the ith cluster. The average standard deviation is
given by: Experiments in Ref. [14] demonstrate that the value of
 CDbw reaches maximum when c is the optimal number of
 c
 clusters, irrespective of the diGerent clustering algorithms.
stdev =  stdev(i)2 =c:
i=1
The computational complexity is O(n) [14], which is
acceptable for large data sets.
Intra-cluster density in Ref. [14] is de4ned as the The CDbw can be applied in hierarchical clustering
number of points that belong to the neighborhood of repre- algorithms. Instead of only distance information being
sentation points of the clusters. The intra-cluster density is used, intra-cluster and inter-cluster density information
signi4cantly high for well-separated clusters. The de4nition are also utilized in merging two neighboring clusters. The
of intra-cluster density is given by advantages of new merging mechanism are described in the
ri next subsection.
1
c
Intra den(c) = density(vij ); c ¿ 1: (4)
c i=1 j=1
3.2. Merging criterion using the CDbw
The term density(vij ) in (4) is de4ned by density(vij ) =
 ni In the proposed hierarchical clustering of the SOM, the
l=1 f(xl ; vij ), where xl belongs to the ith cluster, vij is the inter-cluster and intra-cluster density are incorporated into
jth representation point of the ith cluster, and f(xl ; vij ) is
merging criteria in addition to distance information. The
de4ned by

merging process is described as follows. First, compute the
1 xl − vij  6 stdev; CDbw for data belonging to each neighboring pair of clus-
f(xl ; vij ) = (5) ters. Contrary to 4nding the maximal CDbw for all the input
0 otherwise:
data, the merging mechanism is to 4nd the pair of clusters
Inter-cluster density is de4ned as the density in with minimal value of the CDbw, which indicates that the
the between-cluster areas [14]. The density in the two clusters have the strongest tendency to be clustered. The
between-cluster region is intended to be signi4cantly low. CDbw is then used locally.
180 S. Wu, T.W.S. Chow / Pattern Recognition 37 (2004) 175 – 188

20 tering will usually have a better eGect than by hierarchical


clustering algorithms using only distance information.

15 3.3. Preprocessing before clustering of the SOM

2 The algorithms in Refs. [9–12] use only feature vectors


to cluster of the SOM. Information about density of the
10 clusters is lost. But if all input data belonging to certain
neurons are considered, noises and outliers may deteriorate
the clustering eGect. To achieve a better clustering result
5 and be less aGected by noises and outliers, some preliminary
steps must be performed before clustering of the SOM. The
preprocessing steps are described as follows:
0 1 1. First, assign each input xi to its nearest neuron j on the
map according to Eq. (1). Neurons with no input data
assigned, which are called interpolating units in Ref. [12],
-5 3 are not included in the next clustering steps.
2. Compute the mean vector mj of the assigned input data
for each eGective neuron j. Then compute the distance
deviation between each feature vector wj and mj : devj =
-10 wj − mj . Compute the mean mean dev and standard
-5 0 5 10 15
deviation std dev for all the devj  s.
Fig. 2. The 2D input data with three well-separated clusters. 3. If devj ¿ mean dev + std dev, exclude the neuron j for
clustering later on and 4lter out the input data assigned
to it. This mechanism can 4lter out the input outliers.
Table 2
The inter-cluster distance and the CDbw of the three clusters
4. Compute distances between the input vectors assigned to
the jth neuron and the feature vector wj , i.e., disj (xi ) =
Pair of clusters Distance CDbw xi − wj , where xi belongs to the jth neuron. Then com-
1 and 2 11.90 50.00
pute the mean mean disj and standard deviation std disj
2 and 3 14.37 13102.56 for all the disj (xj ) s.
1 and 3 7.94 1292.75 5. If the distance between the input vector xi assigned to
the jth cluster and the feature vector wj is larger than
mean disj + std disj , i.e., xi − wj  ¿ mean disj +
The advantage of the merging mechanism is that the clus- std disj , 4lter out the input vector xj for the next cluster-
tering result is more accurate, due to the more information ing steps. This can 4lter out the input outliers and noises.
about the individual clusters considered. This can be illus- 6. Compute the number of data belonging to the jth cluster:
trated in Fig. 2. There are three well-separated clusters in numj . Then compute the statistical information about the
Fig. 2. The left two clusters are elongated in the vertical mean mean num and standard deviation std num for all
direction. The third cluster has spherical shape and is more the numj  s.
compact than the left two clusters. The density of the third 7. If the number of data belonging to the jth neuron is
cluster is greater than that of the other two. The input data less than mean num-std num, i.e., numj ¡ mean num −
are agglomerated from three clusters into two clusters by std num, exclude the neuron j for cluster later on and
using centroid-linkage hierarchical clustering. The distances 4lter out the input data assigned to it. This can 4lter out
between the cluster centers are listed in Table 2. It seems the input noises.
that the 4rst and third clusters are to be merged since they
have the shortest inter-cluster distance between cluster cen- 3.4. Clustering of the SOM
ters. But the size and density of each cluster can also aGect
the cluster-merging process. The 4rst two clusters are more After the preprocessing the clustering of the SOM, some
likely to be merged because of inter-cluster density between neurons and some input data are excluded. The neurons and
the 4rst and second cluster are denser than that between the input data left can be hierarchically clustered. In this paper,
other two pair of clusters. From Table 2 the value of the rectangular grids are used for the SOM. The merging
CDbw between the 4rst and second cluster is the lowest so process happens for neighboring clusters, which mean the
that it indicates the clustering tendency of merging the 4rst neurons belonging to the pair of clusters are direct neigh-
and second cluster. This example considers only a single bors. For example, the neurons in the nonboundary area
representation. Using the CDbw a multirepresentation clus- can have eight direct neighbors. This is illustrated in
S. Wu, T.W.S. Chow / Pattern Recognition 37 (2004) 175 – 188 181

cause the 4nal number of cluster to be more than the number


of natural clusters. Thus, we must choose an appropriate
SOM map size.
B C D

E A F
3.5. The algorithm of clustering of the SOM
G H I
The overall proposed algorithm is summarized as follows:

1. Train input data by the SOM.


2. Preprocessing before clustering of the SOM as described
in the previous Section 3.3.
(a) 3. Cluster SOM by using the agglomerative hierarchical
clustering. The merging criterion is made by the CDbw
for all pairs of directly neighboring clusters. Compute the
1 1 2 global CDbw for all the input data before the merging
1 1 2 2 2
process until only two clusters exist, or merging cannot
happen.
1 1
1 2 2
4. Find the optimal partition of the input data according to
1 1 1 2 the CDbw for all the input data as a function of the number
of clusters.

4. Experimental results
(b)
To demonstrate the eGectiveness of the proposed cluster-
ing algorithm, four data sets were used in our experiments.
The input data are normalized such that the value of each
1 1 2 datum in each dimension lies in [0,1]. For training SOM we
1 1 2 2 2
used 100 training epochs on the input data and the learning
rate decreases from 1 to 0.0001.
1 1 2 2

1 1 2
4

2
(c)

Fig. 3. (a) Neuron A has eight direct neighboring neurons: B − I ; 1


(b) multi-neuron represented neighboring clusters 1 and 2 can
be clustered into one cluster because the two clusters are direct 0
neighbors; (c) multi-neuron represented clusters 1 and 2 cannot be
clustered into one cluster because the two clusters are not direct
neighbors. -1

-2
Fig. 3(a). Two clusters can be merged if the two clusters
are direct neighbors. When the pair of clusters are not di-
rect neighbors, they cannot be merged. This is shown in -3
Fig. 3(b) and (c). The CDbw are used locally for the input
data belonging to the directly neighboring pair of clusters. -4
If the CDbw for a pair of directly neighboring clusters is
the lowest among all the available directly neighboring pair -5
of clusters, the pair of clusters is merged into one cluster. -3 -2 -1 0 1 2
If the number of neurons is very large, the interpolating
neurons form the inter-cluster borders on the map. This may Fig. 4. The synthetic data set in the 2D plane.
182 S. Wu, T.W.S. Chow / Pattern Recognition 37 (2004) 175 – 188

4 4

3 3

2 2

1 1

0 0

-1 -1

-2 -2

-3 -3

-4 -4

-5 -5
(a) -3 -2 -1 0 1 2
(b) -3 -2 -1 0 1 2

4 4

3 3

2 2

1 1

0 0

-1 -1

-2 -2

-3 -3

-4 -4

-5 -5
(c) -3 -2 -1 0 1 2 (d) -3 -2 -1 0 1 2

4 4

3 3

2 2

1 1

0 0

-1 -1

-2 -2

-3 -3

-4 -4

-5 -5
(e) -3 -2 -1 0 1 2 (f) -3 -2 -1 0 1 2

Fig. 5. Three partitions of the synthetic data set for clustering of the SOM by (a) the proposed algorithm; (b) k-means clustering algorithm;
(c) single-linkage hierarchical clustering algorithm; (d) complete-linkage hierarchical clustering algorithm; (e) centroid-linkage hierarchical
clustering algorithms; (f) average-linkage hierarchical clustering algorithm. “.”, “*” and “+” indicate three diGerent clusters, respectively.
S. Wu, T.W.S. Chow / Pattern Recognition 37 (2004) 175 – 188 183

11
Proposed
Single-linkage
10

Clustering validity index


9

5
2 3 4 5 6 7 8 9 10 11
Number of clusters

Fig. 6. The CDbw as a function of the number of clusters for the synthetic data set by the proposed algorithm (solid line), and the
single-linkage clustering algorithm (dashed line) on the SOM.

Fig. 7. Three clusters for the synthetic data set are displayed on
the map by the proposed algorithm or the single-linkage clustering
algorithm on the SOM (SOM map size of 4×4). The same symbol
on the map represents the same cluster. Fig. 8. Three clusters for the synthetic data set are displayed on
the map by the proposed algorithm or the single-linkage clustering
algorithm on the SOM (SOM map size of 6×6). The same symbol
on the map represents the same cluster.
4.1. 2D synthetic data set

A 200 2D data set was generated with some noises and


outliers. The correctness of clustering result can be easily clusters because they can only deal with clusters of spher-
seen in a 2D plane. The data generated consisted of three ical shapes. For the proposed algorithm and single-linkage
shallow elongated parallel clusters in the 2D plane, as shown clustering algorithm, the partition of the data is correct. The
in Fig. 4. Some noises and outliers were added into the input CDbw as a function of the number of clusters for the two
data. We used the SOM algorithm to train the input data. clustering algorithms is plotted in Fig. 6. The maximal value
Then we preprocessed the SOM before clustering of the of the CDbw by the two algorithms indicates there are three
SOM. Finally we used k-means, four diGerent hierarchical clusters in the input data, which is consistent with the cluster
clustering algorithms, and the proposed algorithm to cluster structure designed by us. The input data are assigned to the
of the SOM. In the experiment we 4rst used map size of nearest eGective neurons. According to the assigned input
4 × 4. The clustering results are illustrated in Fig. 5. Except data and their cluster membership, the cluster memberships
for the single-linkage algorithm, the K-means, hierarchical for the eGective neurons can be displayed on the 2D map.
clustering algorithms cannot correctly 4nd the true clusters. The three clusters on the map with map size of 4 × 4 by
These clustering algorithms split one true cluster into two the proposed algorithm are shown in Fig. 7, although it is
184 S. Wu, T.W.S. Chow / Pattern Recognition 37 (2004) 175 – 188

70
Proposed
Single-linkage
60

Clustering validity index 50

40

30

20

10

0
2 4 6 8 10 12 14
Number of clusters

Fig. 9. The CDbw as a function of the number of clusters for the iris data by the proposed algorithm (solid line), and the single-linkage
clustering algorithm (dashed line) on the SOM.

(a) (b)
Fig. 10. Two clusters for the iris data set are identi4ed by the Fig. 11. For the known three classes, three clusters are formed
proposed algorithm or single-linkage clustering algorithm on the (SOM map size of 4 × 4) for the iris data set by (a) the proposed
SOM (SOM map size of 4 × 4). The same symbol on the map algorithm; (b) single-linkage clustering algorithm on the SOM. The
represents the same cluster. same symbol on the map represents the same cluster.

Table 3
Performance comparison of diGerent clustering algorithms for the
a trivial thing for 2D data. We also tested the above algo-
iris data set
rithms on map size of 3 × 3 and 5 × 5 and obtained similar
results. For a smaller map size such as 2 × 3, the clustering Algorithm Clustering
by all clustering algorithms on the SOM gave wrong clus- accuracy (%)
ters. This is because each cluster has fewer representations
Proposed clustering of the SOM 96.0
so that the elongated clusters cannot be adequately described Single-linkage clustering of the SOM 74.7
by the representation neurons. But for a larger map size such Extended SOM (minimum variance) 90.3
as 6 × 6, all these algorithms achieved the same correct re- Extended SOM (minimum distance) 89.2
sults because the interpolating neurons explicitly forms the Direct k-means 85.3
borders of the three natural clusters and the minimal num- Direct single-linkage clustering 68.0
ber of 4nal agglomerated clusters is 3. The three clusters on Direct complete-linkage clustering 84.0
the SOM with map size of 6 × 6 are shown in Fig. 8, where Direct centroid-linkage clustering 68.0
the interpolating neurons clearly separate the map into three Direct average-linkage clustering 69.3
regions representing three clusters. For a more larger map
size, the minimal number of 4nal agglomerated clusters can We do not use the k-means, the complete-linkage, the
be larger than three and thus cannot express the true infor- centroid-linkage and the average-linkage clustering algo-
mation about the input data. rithms on the SOM in the next three data sets because they
S. Wu, T.W.S. Chow / Pattern Recognition 37 (2004) 175 – 188 185

1 Cluster1 (30) 1 1 1
0.5 0.5 0.5 0.5
Cluster2 (35) Cluster3 (40) Cluster4 (45)
0 0 0 0
5 10 15 5 10 15 5 10 15 5 10 15
1 1 Cluster6 (55) 1 Cluster7 (60) 1 Cluster8 (70)

0.5 0.5 0.5 0.5


0 Cluster5 (50) 0 0 0
5 10 15 5 10 15 5 10 15 5 10 15
1 Cluster9 (75) 1 Cluster10 (80) 1 1 Cluster12 (90)

0.5 0.5 0.5 0.5


0 0 0 Cluster11 (85) 0
5 10 15 5 10 15 5 10 15 5 10 15
1 1 1 1
0.5 0.5 0.5 0.5
0 Cluster13 (95) 0 Cluster14 (105) 0 Cluster15 (120) 0 Cluster16 (130)
5 10 15 5 10 15 5 10 15 5 10 15
1 Cluster17 (140) 1 1 Cluster19 (160) 1 Cluster20 (160)

0.5 0.5 0.5 0.5


0 0 Cluster18 (150) 0 0
5 10 15 5 10 15 5 10 15 5 10 15

Fig. 12. The statistical information of 20 clusters for the 15D synthetic data set. The horizontal axis in each sub4gure represents the dimension
and vertical axis in each sub4gure represents the value in each dimension. The mean value of each dimension in each cluster is represented
by a black dot in each sub4gure. The standard deviation of each dimension in each cluster is represented by two curves enclosing the black
dots in each sub4gure. The number of data points in each cluster is bracketed in each sub4gure.

are only suitable for hyper-spherical cluster shapes. The map


4 6 12 18 20
size should be carefully chosen so that it is able to learn the
natural number of clusters. 6 6 10 16 20
3 3 8
4.2. Iris data set 5 8 8 15 19

The Iris data set [24] has been widely used in pattern 2 7 8 11 19
classi4cation. It has 150 data points of four dimensions. The 2 7 14 17
data are divided into three classes with 50 points each. The
4rst class of Iris plant is linearly separable from the other 1 9 9 17 17
two. The other two classes are overlapped to some extent 1 1 9 13 17
and are not linearly separable. We clustered the data by the
proposed clustering algorithm and the single-linkage clus- Fig. 13. Twenty clusters for the 15D synthetic data set are displayed
tering algorithm on the SOM. We used an appropriate map on the map by the proposed algorithm on the SOM (SOM map
size of 4 × 4. The two algorithms achieved the same optimal size of 8 × 8). The same number on the map represents the same
clustering results. The CDbw as a function of the number cluster.
of clusters, plotted in Fig. 9, indicates that the data are
optimally divided into two clusters. The iris data of the 4rst
class form a cluster, and the rest two classes form the other If three clusters are forced to be formed, the proposed
cluster. The two clusters can be displayed on the map shown algorithm is better than the single-linkage-clustering algo-
in Fig. 10, where “*” symbol represents the 4rst class and rithm. The partition of the map performed by the two al-
“+” represents the second and third class. It is inconsistent gorithms is shown in Fig. 11. One cluster representing the
with the inherent three classes in the data. The two clusters 4rst class (“*” symbol) is clearly separated from other two
are also achieved in Ref. [25]. The two clusters are formed clusters by the interpolating neurons representing borders.
without a priori information about the classes of the data. The clustering accuracies by the proposed algorithm, the
Therefore, the two clusters found in the data are reasonable. single-linkage clustering of the SOM, the extended SOM
186 S. Wu, T.W.S. Chow / Pattern Recognition 37 (2004) 175 – 188

400

350

Clustering validity index


300

250

200

150

100

50
0 5 10 15 20 25 30 35 40
Number of clusters

Fig. 14. The CDbw as a function of the number of clusters for the 15D synthetic data set by the proposed algorithm on the SOM.

(minimum distance) [10], the extended SOM (minimum used the proposed clustering algorithm on the SOM with
variance) [11], the direct k-means, and the four direct ag- map size of 8 × 8. The 20 clusters displayed on the map are
glomerative hierarchical clustering algorithms are listed in shown in Fig. 13. The CDbw as a function of the number
Table 3. The single-linkage clustering of the SOM has dis- of clusters, plotted in Fig. 14, strongly indicates 20 clusters
advantage of a chain-link tendency so that it joins most of existed in the data. The partition of the data consistent with
the points in the second and third classes, which resulted in the cluster structure of the data with 100% accuracy and
a low clustering accuracy of 74.7%. Therefore in the next thus the statistical information of the clusters generated by
two experiments, we do not use the single-linkage clustering the proposed clustering algorithm is the same as that of the
of the SOM. On the other hand, the proposed algorithm is known clusters shown in Fig. 12.
able to form the correct three clusters with the highest accu-
racy 96.0%, while other agglomerative hierarchical cluster-
ing algorithms (direct or not direct) and the direct k-means 4.4. Wine data set
algorithm have lower clustering accuracy to some extent in
distinguishing the second and third class. Wine data set [26] has 178 13D data with known three
So in real data sets if the number of classes is known and is classes. The numbers of data samples in the three classes are
used for the number of clusters, and some overlapping exists 59, 71 and 48, respectively. We use the proposed algorithm
in some pair of clusters, the proposed algorithm is a better with map size of 4 × 4 to cluster the data. The CDbw as a
choice. If we do not use the information about the number function of the number of clusters, plotted in Fig. 15, indi-
of classes, the pair of overlapping classes may merge into a cates that the number of clusters is three, which is exactly
single cluster and then the number of clusters may be less equal to the number of the classes. This is because the three
than the number of classes. classes are well separated from each other. The three clusters
on the map is shown in Fig. 16. The clustering accuracies by
the proposed algorithm, the extended SOM (minimum vari-
4.3. 15D synthetic data set ance), the direct k-means, and the four direct agglomerative
hierarchical clustering algorithms are listed in Table 4. For
In this example, we used data set of 1780 15D data. The the data set, the proposed algorithm achieved the best clus-
data were created by 4rst generating 20 uniformly distributed tering result with a highest clustering accuracy 98.3%. The
random 15D points with each dimension lying in [0 1], and extended SOM (minimum variance) and the direct k-means
then adding some Gaussian noise to each point with stan- algorithm have also good clustering results with accuracy
dard deviation 0.12 in each dimension. The number of data larger than 90.0%, while the four direct agglomerative
points in each cluster varies from 30 to 165. The statisti- hierarchical clustering algorithms have a worse eGect
cal information for each cluster can be seen in Fig. 12. We in distinguishing the second and third class with lower
S. Wu, T.W.S. Chow / Pattern Recognition 37 (2004) 175 – 188 187

40

35

Clustering validity index 30

25

20

15

10
2 3 4 5 6 7 8 9 10 11
Number of clusters

Fig. 15. The CDbw as a function of the number of clusters for the wine data set by the proposed algorithm on the SOM.

5. Conclusions

In this paper, we propose a new SOM-based clustering


algorithm. It uses the clustering validity index locally to
determine which pair of clusters to be merged. The opti-
mal number of clusters can be determined by the maximum
value of the CDbw, which is the clustering validity index
Fig. 16. Three clusters for the wine data set are displayed on the globally for all input data. Compared with classical cluster-
map by the proposed clustering algorithm on the SOM (SOM map ing methods on the SOM, the proposed algorithm utilizes
size of 4 × 4). The same symbol on the map represents the same more information about the data in each cluster in addition
cluster. to inter-cluster distances. The proposed algorithm therefore
clusters data better than the classical clustering algorithms
on the SOM. The preprocessing steps for 4ltering out noises
Table 4
Performance comparison of diGerent clustering algorithms for the
and outliers are also included to increase the accuracy and
wine data set robustness for clustering of the SOM. The experimental re-
sults on the four data sets demonstrate that the proposed clus-
Algorithm Clustering tering algorithm is a better clustering algorithm than other
accuracy (%) clustering algorithms on the SOM.
Proposed clustering of the SOM 98.3
Extended SOM (minimum variance) 93.3
References
Direct k-means 97.8
Direct single-linkage clustering 57.4
Direct complete-linkage clustering 67.4 [1] T. Kohonen, Self-organized formation of topologically correct
Direct centroid-linkage clustering 61.2 feature maps, Biol. Cybern. 43 (1982) 59–69.
Direct average-linkage clustering 61.2 [2] T. Kohonen, Self-Organizing Maps, Springer, Berlin,
Germany, 1997.
[3] R.M. Gray, Vector quantization, IEEE Acoust., Speech, Signal
Process. Mag. 1 (2) (1984) 4–29.
clustering accuracy less than 70%. The three clusters are [4] N.R. Pal, J.C. Bezdek, E.C.-K. Tsao, Generalized clustering
near spherical shapes, but have some noises between clus- networks and Kohonen’s self-organizing scheme, IEEE Trans.
ters. Therefore, the direct k-means algorithm has a better Neural Networks 4 (4) (1993) 549–557.
clustering eGect than the direct agglomerative hierarchical [5] A.K. Jain, M.N. Murty, P.J. Flyn, Data clustering: a review,
clustering algorithms. ACM Comput. Surveys 31 (3) (1999) 264–323.
188 S. Wu, T.W.S. Chow / Pattern Recognition 37 (2004) 175 – 188

[6] J. Han, M. Kamber, Data mining: concepts and techniques, International Joint Conference on Neural Networks, Nagoya,
Morgan-Kaufman, San Francisco, 2000. Japan, 1993, pp. 2448–2451.
[7] T. Huntsberger, P. Ajjimarangsee, Parallel self-organizing [17] S. Guha, R. Rastogi, K. Shim, CURE: an eLcient clustering
feature maps for unsupervised pattern recognition, Int. J. Gen. algorithm for large databases, Proceedings of ACM SIGMOD
Systems 16 (1989) 357–372. International Conference on Management of Data, New York,
[8] J. Mao, A.K. Jain, A self-organizing network for 1998, pp. 73–84.
hyperellipsoidal clustering (HEC), IEEE Trans. Neural [18] G. Karypis, E.-H. Han, V. Kumar, Chameleon: hierarchical
Networks 7 (1) (1996) 16–29. clustering using dynamic modeling, IEEE Comput. 32 (8)
[9] J. Lampinen, E. Oja, Clustering properties of hierarchical (1999) 68–74.
self-organizing maps, J. Math. Imag. Vis. 2 (2–3) (1992) [19] S. Theodoridis, K. Koutroubas, Pattern Recognition,
261–272. Academic Press, New York, 1999.
[10] F. Murtagh, Interpreting the Kohonen self-organizing [20] J.C. Dunn, Well separated clusters and optimal fuzzy
feature map using contiguity-constrained clustering, Pattern partitions, J. Cycbern. 4 (1974) 95–104.
Recognition Lett. 16 (1995) 399–408. [21] X.L. Xie, G. Beni, A validity measure for fuzzy clustering,
[11] M.Y. Kiang, Extending the Kohonen self-organizing map IEEE Trans. Pattern Anal. Mach. Intell. 13 (8) (1991)
networks for clustering analysis, Comput. Stat. Data Anal. 38 841–847.
(2001) 161–180. [22] G.W. Milligan, S.C. Soon, L.M. Sokol, The eGect of cluster
[12] J. Vesanto, E. Alhonierni, Clustering of the self-organizing size, dimensionality and number of clusters on recovery of
map, IEEE Trans. Neural Networks 11 (3) (2000) true cluster structure, IEEE Trans. Pattern Anal. Mach. Intell.
586–600. 5 (1983) 40–47.
[13] H. Lopson, H.T. Siegelmann, Clustering irregular shapes [23] R.N. Dave, Validating fuzzy partitions obtained through
using high-order neurons, Neural Computation 12 (10) (2000) c-shell clustering, Pattern Recognition Lett. 17 (1996)
2331–2353. 613–623.
[14] M. Halkidi, M. Vazirgiannis, Clustering validity assessment [24] R.A. Fisher, The use of multiple measure in taxonomic
using multi representatives, Proceedings of SETN Conference, problems, Ann. Eugenics 7 (Part II) (1936) 179–188.
Thessaloniki, Greece, April 2002. [25] J.C. Bezdek, N.R. Pal, Some new indexes of cluster validity,
[15] A. Ultsch, H.P. Siemon, Kohonen’s self organizing IEEE Trans. System, Man, and Cybern. 28 (3) (1998)
feature maps for exploratory data analysis, Proceedings of 301–315.
the International Neural Network Conference, Dordrecht, [26] C.L. Blake, C.J. Merz, UCI repository of machine learning
Netherlands, 1990, pp. 305 –308. databases, (http://www.ics.uci.edu/∼mlearn/MLRepository.
[16] X. Zhang, Y. Li, Self-organizing map as a new method html), Department of Information and Computer Science,
for clustering and data analysis, Proceedings of the University of California at Irvine, CA, 1998.

About the Author—SITAO WU is now pursuing Ph.D. degree in the Department of Electronic Engineering of City University of Hong
Kong, Hong Kong, China. He obtained B.E. and M.E. degrees in the Department of Electrical Engineering of Southwest Jiaotong University,
China in 1996 and 1999, respectively. His research interest areas are neural networks, pattern recognition, and their applications.

About the Author—TOMMY W.S. CHOW (M’93) received the B.Sc (First Hons.) and Ph.D. degrees from the University of Sunderland,
Sunderland, UK. He joined the City University of Hong Kong, Hong Kong, as a Lecturer in 1988. He is currently a Professor in the
Electronic Engineering Department. His research interests include machine fault diagnosis, HOS analysis, system identi4cation, and neural
networks learning algorithms and applications.