0 оценок0% нашли этот документ полезным (0 голосов)

2 просмотров14 страницAlgoritmos de agrupamiento

Nov 08, 2019

Maquinas rotativas

© © All Rights Reserved

PDF, TXT или читайте онлайн в Scribd

Algoritmos de agrupamiento

© All Rights Reserved

0 оценок0% нашли этот документ полезным (0 голосов)

2 просмотров14 страницMaquinas rotativas

Algoritmos de agrupamiento

© All Rights Reserved

Вы находитесь на странице: 1из 14

www.elsevier.com/locate/patcog

validity index based on inter-cluster and intra-cluster density

Sitao Wu∗ , Tommy W.S. Chow

Department of Electronic Engineering, City University of Hong Kong, 83 Tat Chu Avenue, Kowloon, Hong Kong

Received 10 December 2002; received in revised form 20 June 2003; accepted 20 June 2003

Abstract

The self-organizing map (SOM) has been widely used in many industrial applications. Classical clustering methods based

on the SOM often fail to deliver satisfactory results, specially when clusters have arbitrary shapes. In this paper, through some

preprocessing techniques for 4ltering out noises and outliers, we propose a new two-level SOM-based clustering algorithm

using a clustering validity index based on inter-cluster and intra-cluster density. Experimental results on synthetic and real data

sets demonstrate that the proposed clustering algorithm is able to cluster data better than the classical clustering algorithms

based on the SOM, and 4nd an optimal number of clusters.

? 2003 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.

Keywords: Partitioning clustering; Hierarchical clustering; Clustering validity index; Self-organizing map; Multi-representation

the inter-connections among neurons, the SOM exhibits an-

The self-organizing map (SOM), proposed by Kohonen other important property of topology preservation. In other

[1], has been widely used in many industrial applications words, if two feature vectors are near to each other in the

such as pattern recognition, biological modeling, data com- input space, the corresponding neurons will also be close in

pression, signal processing, and data mining [2]. It is an the output space, and vice versa. Usually the output neurons

unsupervised and nonparametric neural network approach. are arranged in 2D grids. Therefore, the SOM is suitable for

The success of the SOM algorithm lies in its simplicity that visualization purpose.

makes it easy to understand, simulate and be used in many Clustering algorithms attempt to organize unlabeled in-

applications. put vectors into clusters or “natural groups” such that points

The basic SOM consists of a set of neurons usually within a cluster are more similar to each other than vectors

arranged in a two-dimensional structure such that there belonging to diGerent clusters [4]. Clustering has been used

are neighborhood relations among the neurons. After com- in exploratory pattern-analysis, grouping, decision-making,

pletion of training, each neuron is attached to a feature and machine-learning situations, including data mining,

vector of the same dimension as the input space. By assign- document retrieval, image segmentation, and pattern clas-

ing each input vector to the neuron with the nearest feature si4cation [5]. The clustering methods are of 4ve types:

vector, the SOM is able to divide the input space into regions hierarchical clustering, partitioning clustering, density-based

with common nearest feature vectors. This process can be clustering, grid-based clustering, model-based clustering

considered as performing vector quantization (VQ) [3]. [6]. Each type has its advantages and disadvantages.

Several attempts have been made to cluster data based on

the SOM. In Refs. [7,8] SOM is used to develop cluster-

∗ Corresponding author. Tel.: +852-21942874. ing algorithms. The clustering algorithm in Ref. [7] assumes

E-mail address: eetchow@cityu.edu.hk (S. Wu). each cluster has a spherical shape, while HEC network in

0031-3203/03/$30.00 ? 2003 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.

doi:10.1016/S0031-3203(03)00237-1

176 S. Wu, T.W.S. Chow / Pattern Recognition 37 (2004) 175 – 188

Ref. [8] assumes each cluster can be either hyper-spherical in the clustering assessments. The assessment is able to 4nd

or hyper-ellipsoidal. The number of output neurons is equal an optimal partition of the input data [14].

to the desired number of clusters. When the number of clus- In this paper, a new two-level algorithm for clustering

ters is a prime number, the SOM cannot be realized in two of the SOM is proposed. The clustering at the second level

dimensions. Furthermore, the SOM is conceptually diGerent is agglomerative hierarchical clustering. The merging crite-

from clustering [4]. The SOM tries to extract and visually rion is motivated by a clustering validity index based on the

display the topological structure of high-dimensional input inter-cluster and intra-cluster density, and inter-cluster dis-

data while clustering is to partition the input data into groups. tances [14]. The original index is used for the whole input

The algorithms in Refs. [7,8] seem to mix two objectives, data and therefore is a global index. The optimal number

feature mapping and clustering, and the overall methodol- of clusters can be found by the clustering validity index. In

ogy is diLcult to interpret in either case [4]. this paper, the clustering validity index is slightly modi4ed

It is then reasonable that a two-level approach is able to and used locally to determine which neighboring pair of

cluster data based on the SOM. The idea is that the 4rst clusters to be merged into one cluster in agglomerative hi-

level is to train data by the SOM and the second level is erarchical clustering. Since more information is added into

to cluster data based on the SOM. The required number of the merging criterion in addition to inter-cluster distances,

output neurons at the 4rst level is more than the desired the proposed algorithm clusters data better than other

number of clusters. Clustering is carried out by clustering clustering algorithms based on the SOM. Through certain

of output neurons after completion of training performed by preprocessing techniques for 4ltering, the proposed cluster-

the SOM. This two-level approach has been addressed in ing algorithm is able to handle input data with noises and

Refs. [9–12]. They are actually multirepresentation-based outliers.

clustering because each cluster can be represented by mul- This paper is organized into 4ve sections. In Section 2,

tiple output neurons. In Ref. [9] there are two SOM lay- the SOM and clustering algorithms are brieMy reviewed.

ers for clustering. The second SOM layer takes the out- Algorithms of two-level clustering of the SOM are dis-

puts of the 4rst SOM layer as the inputs of the second cussed. In Section 3, a new algorithm of multirepresentation

SOM layer. The number of the neurons on the second map clustering of the SOM is proposed. In Section 4, experimen-

is equal to the desired number of clusters. The task of tal results on synthetic and real data sets demonstrate that

the second SOM layer is analogous to clustering of the the proposed algorithm is able to cluster the input data and

SOM by k-means algorithm. In Ref. [10] an agglomerative 4nd the optimal number of clusters. The clustering eGect of

contiguity-constrained clustering method on the SOM was the proposed algorithm is better than that of other cluster-

proposed. The merging process of neighboring neurons was ing algorithms on the SOM. Finally, the conclusions of this

based on minimal distance criterion. Algorithm in Ref. [11] work are presented in Section 5.

extended the idea in Ref. [10] by minimal variance crite-

rion and achieved better clustering results. In Ref. [12] both

the classical hierarchical and partitioning clustering algo- 2. Self-organizing map and clustering

rithms are applied in clustering of the SOM. The proposed

algorithms [12] were aimed at reducing computational com- 2.1. Self-organizing map and visualization

plexity compared with the classical clustering methods. The

algorithms presented in Refs. [10,11] need to recalculate Competitive learning is an adaptive process in which the

the center after two clusters are merged. They are only fea- neurons in a neural network gradually become sensitive to

sible for clusters with hyper-spherical or hyper-ellipsoidal diGerent input categories, sets of samples in a speci4c do-

shapes. The second SOM layer [9] and the batch k-means main of the input space. A division of neural nodes emerges

algorithm in clustering of the SOM [12] require the desired in the network to represent diGerent patterns of the inputs

number of clusters to be known and are only feasible for after training.

hyper-spherical-shaped clusters. Hierarchical clustering al- The division is enforced by competition among the neu-

gorithms on the SOM in Ref. [12] use only inter-cluster rons: when an input x arrives, the neuron that is best able

distance criterion to cluster the output neurons. In order to to represent it wins the competition and is allowed to learn

deal with arbitrary cluster shapes, high-order neurons are in- it even better, as will be described later. If there exists an

troduced in Ref. [13]. The inverse covariance matrix in the ordering between the neurons, i.e., the neurons are located

clustering metric can be considered as second-order statis- on a discrete lattice, the competitive learning algorithm can

tics to capture hyper-ellipsoidal properties of clusters. But be generalized: if not only the winning neuron but also

the algorithm in Ref. [13] is computation-consuming such its neighboring neurons on the lattice are allowed to learn,

that it is not suitable for real applications. Some additional the whole eGect is that the 4nal map becomes an ordered

information other than distances will be helpful in the clus- map in the input space. This is the essence of the SOM

tering process. In Ref. [14] a clustering validity assessment algorithm.

was proposed that not only inter-cluster distances, but also The SOM consists of M neurons located on a regular

inter-cluster density and intra-cluster density are considered low-dimensional grid, usually one or two dimensional.

S. Wu, T.W.S. Chow / Pattern Recognition 37 (2004) 175 – 188 177

Higher-dimensional grids are possible, but they are not 2.2. Clustering algorithms

generally used since their visualization is problematic. The

lattice of the grid is either hexagonal or rectangle. There are a multitude of clustering methods in the litera-

The basic SOM algorithm is iterative. Each neuron i ture, which can be broadly classi4ed into the following cat-

has a d-dimensional feature vector wi = [wi1 ; : : : ; wid ]. At egories [6]: hierarchical clustering, partitioning clustering,

each training step t, a sample data vector x(t) is randomly density-based clustering, grid-based clustering, model-based

chosen from the training set. Distances between x(t) and clustering. In this paper, only the 4rst two categories are

all the feature vectors are computed. The winning neuron, considered.

denoted by c, is the neuron with the feature vector closest In partitioning clustering, given a database of n object,

to x(t): a partitioning clustering algorithm constructs k partitions

of the data, where each partition represents a cluster and

c = arg min x(t) − wi ; i ∈ {1; : : : ; M }: (1) k 6 n. The most used partitioning clustering algorithm is

i

k-means algorithm, where each cluster is represented by the

A set of neighboring nodes of the winning node is denoted mean value of the objects in the cluster. One advantage of

as Nc . We de4ne hic (t) as the neighborhood kernel function the partitioning clustering is that the clustering is dynamic,

around the winning neuron c at time t. The neighborhood i.e., data points can move from one cluster to another. The

kernel function is a nonincreasing function of time and of other advantage is that some a priori knowledge, such as

the distance of neuron i from the winning neuron c. The cluster shapes, can be incorporated in the clustering. The

kernel can be taken as a Gaussian function: drawbacks of the partitioning clustering are the following:

−

Posi −Posc 2 (1) it encounters diLculty at discovering clusters of arbi-

hic (t) = e 2(t)2 ; i ∈ Nc ; (2) trary shapes; (2) the number of clusters is pre-4xed and the

optimal number of clusters is hard to determine.

where Posi is the coordinates of neuron i on the output grid An hierarchical clustering algorithm creates an hierarchi-

and (t) is kernel width. cal decomposition of the given set of data objects. It can

The weight update rule in the sequential SOM algorithm be classi4ed as either agglomerative or divisive. The advan-

can be written as tage of the hierarchical clustering is that it is not aGected

by initialization and local minima. The shortcomings of the

wi (t) + (t)hic (t)(x(t)

hierarchical clustering are the following: (1) it is impracti-

wi (t + 1) = −wi (t)); ∀i ∈ Nc ; (3) cal for large data sets due to the high-computational com-

plexity; (2) it does not incorporate any a priori knowledge

wi (t) otherwise:

such as cluster shapes; (3) the clustering is static, i.e., data

Both learning rate (t) and neighborhood (t) decrease points in a cluster at the early stage cannot move to another

monotonically with time. During training, the SOM behaves cluster at the latter stage. In this paper, the divisive hierar-

like a Mexible net that folds onto a “cloud” formed by the chical clustering is not considered because the top–down di-

training data. Because of the neighborhood relations, neigh- rection of the divisive hierarchical clustering is not suitable

boring neurons are pulled to the same direction, and thus for two-level clustering of the SOM.

feature vectors of neighboring neurons resemble each other. In the classical agglomerative hierarchical clustering, a

There are many variants of the SOM. However, these vari- pair of clusters to be merged has the minimum inter-cluster

ants are not considered in this paper because the proposed distance. The widely used measures of inter-cluster distance

algorithm is based on the SOM, but not a new variant of the are listed in Table 1 (mi is the mean for cluster Ci and ni is the

SOM. number of points in Ci ). All of these distance measures yield

The 2D map can be easily visualized and thus give people the same clustering results if the clusters are compact and

useful information about the input data. The usual way to well separated. But in some cases [17], using dmax ; dave , and

display the cluster structure of the data is to use a distance dmean as the distance measures result in wrong clusters that

matrix, such as U-matrix [15]. U-matrix method displays the are similar to those determined by partitioning clustering.

SOM grid according to the distance of neighboring neurons. The single-linkage clustering with distance measure dmin

The visual display of U-matrix can be three-dimensional may have “chaining e2ects”—a few points located so as to

using ridges and valleys, or two-dimensional using gray form a bridge between two clusters cause points across the

level. Clusters can be identi4ed in low inter-neuron distances clusters to be grouped into a single cluster.

and borders are identi4ed in high inter-neuron distances. The extended SOM (minimum distance) [10] utilizes the

Another method of visualizing cluster structure is to assign single-linkage clustering method on the SOM. The clus-

the input data to their nearest neurons. Some neurons then tering in the extended SOM (minimum variance) [11] is

have no input data assigned to them. These neurons can diGerent from the minimum-distance-based agglomerative

be used as the borders of clusters [16]. These methods are hierarchical clustering. But it is also an agglomerative

cluster visualization tools and inherently are not clustering hierarchical clustering on the SOM based on minimum

methods. variance.

178 S. Wu, T.W.S. Chow / Pattern Recognition 37 (2004) 175 – 188

Table 1

Four types of de4nitions of inter-cluster distance

p∈Ci ;p ∈Cj

p∈Ci ;p ∈Cj

Average-linkage: dave (Ci; Cj ) = 1=(ni nj ) p∈Ci ; p ∈Cj p − p

SOM

Clustering

Input data Training

Fig. 1. The two-level approach of clustering of the SOM. DiGerent symbols on the map represent diGerent clusters.

2.3. Clustering of the SOM variance) [11], intra-cluster variances were considered and

better clustering results were obtained. But there is no

In the 4rst level, we use the SOM to form a 2D feature explicit inter-cluster distance information in the extended

map. The number of output neurons is signi4cantly more SOM (minimum variance). In some cases, a clustering al-

than the desired number of clusters. This requires more neu- gorithm using more information about the pair of clusters

rons to represent a cluster, rather than a single neuron to in addition to inter-cluster distances, such as special char-

represent a cluster. Then in the second level the output neu- acteristics about individual clusters and between-clusters,

rons are clustered such that the neurons on the map are di- will have better clustering eGects than the classical hier-

vided into as many diGerent regions as the desired number archical clustering algorithms. This additional information

of clusters. Each input data point can be assigned to a cluster has been considered in the Chameleon algorithm [18], but

according to their nearest output neuron. The whole process the algorithm has to construct k-nearest graph, which is

is illustrated in Fig. 1. This two-level approach for cluster- computationally complex when the data is very large.

ing of the SOM has been addressed in Refs. [9–12]. The In this paper, inter-cluster and intra-cluster density are

recently developed hierarchical algorithms CURE [17] and added in the merging criterion and some useful steps for

Chameleon [18] also utilize a two-level clustering technique. 4ltering noises and outliers are proposed before clustering

The classical clustering algorithms can be used in cluster- of the SOM. This will be discussed in the next section.

ing the output neurons of SOM. However, due to the disad-

vantages of diGerent types of clustering algorithms, we must

choose an appropriate one to cluster the SOM. Partitioning 3. Clustering of the SOM using local clustering validity

clustering of the SOM can cause incorrect clusters. If the index and preprocessing of the SOM for "ltering

clusters in the input data have nonspherical shapes, cluster-

ing of the SOM is deteriorated. The number of cluster in par- 3.1. Global clustering validity index for di2erent

titioning clustering, which is usually unknown beforehand, clustering algorithms

must be prede4ned. Therefore, a partitioning algorithm is

not adopted in this study. This will be illustrated in Section Since there are many clustering algorithms in the litera-

4. Agglomerative hierarchical clustering of the SOM is the ture, evaluation criteria are needed to justify the correctness

algorithm we adopted in this paper. But here more informa- of a partition. Furthermore, the evaluation algorithms need

tion about clusters is added into the clustering algorithm. to address the number of clusters that appear in a data set.

In the classical agglomerative hierarchical clustering of the A lot of eGorts have been made for clustering in the area of

SOM, i.e., the extended SOM (minimum distance) [10], only pattern recognition [19]. In general, there are three types of

inter-cluster distance information is used to merge the near- methods used to investigate cluster validity: (1) external cri-

est neighboring clusters. In the extended SOM (minimum teria; (2) internal criteria; (3) relative criteria [19]. The idea

S. Wu, T.W.S. Chow / Pattern Recognition 37 (2004) 175 – 188 179

clusters” [20]. The implementation of most validity algo- c c

close rep(i) − close rep(j)

rithms is very computationally intensive, especially when Inter den(c) =

stdev(i) + stdev(j)

the number of clusters and the number of the input data is i=1 j=1

j=i

very large [21]. Some validity indexes are dependent on the

data [22], and certain indexes are dependent on the num- ×density(uij ); c ¿ 1; (6)

ber of clusters [23]. Some indexes use sample mean of each where close rep(i) and close rep(j) are the closest pair of

subset, and others use all points in each subset in another representations of the ith and jth clusters, uij is the middle

extreme. For arbitrarily shaped clusters, using multirepre- point between the pair points close rep(i) and

sentation points to represent a cluster will give a better clus- niclose

+nj

rep(j).

density(uij ) is given by density(uij ) = k=1 f(xk ; uij ),

ter description than a single-center representation, and mul- where xk is the input vector belonging to the ith and jth

tirepresentation points will be computationally lighter than clusters, and f(xk ; uij ) is de4ned by

all-point representation.

1 xk − uij 6 (stdev(i)

The newly proposed multirepresentation clustering valid-

ity index [14] is a relative algorithm-independent clustering f(xk ; uij ) = +stdev(j))=2; (7)

index for assessing the quality of a partitioning. The index

0 otherwise:

is also based on two accepted concepts: (1) a cluster’s com-

pactness; (2) a cluster’s separation. The de4nitions of the Cluster’s separation evaluates the separation of clusters. It

clustering validity index are modi4ed a little in this paper. includes both the inter-cluster distances and the inter-cluster

The notations in the clustering validity index are de4ned density. The goal is that the inter-clusters distance is signif-

as follows. The data set is partitioned into c clusters. A set icantly high while the inter-cluster density is signi4cantly

of representation points Vi = {vi1 ; vi2 ; : : : ; viri } represents the low. The de4nition of the clusters’ separation is

ith cluster, where ri is the number of representation point of c c

close rep(i) − close rep(j)

the ith cluster. stdev(i) is a standard deviation vector of the Sep(c) = ;

i=1 j=1

1 + Inter den(c)

ith cluster. The pth component of stdev(i) is de4ned by j=i

ni

p c ¿ 1: (8)

stdev (i) =

p

(xk − mpi )2 =(ni − 1);

k=1 The overall clustering validity index, which is called

“Composing Density Between and With clusters” (CDbw)

where ni is the number of data points in the ith cluster, xk [14], is de4ned by

is the data belonging to the ith cluster, and mi is the sample

CDbw(c) = Intra den(c) × Sep(c): (9)

mean of the ith cluster. The average standard deviation is

given by: Experiments in Ref. [14] demonstrate that the value of

CDbw reaches maximum when c is the optimal number of

c

clusters, irrespective of the diGerent clustering algorithms.

stdev = stdev(i)2 =c:

i=1

The computational complexity is O(n) [14], which is

acceptable for large data sets.

Intra-cluster density in Ref. [14] is de4ned as the The CDbw can be applied in hierarchical clustering

number of points that belong to the neighborhood of repre- algorithms. Instead of only distance information being

sentation points of the clusters. The intra-cluster density is used, intra-cluster and inter-cluster density information

signi4cantly high for well-separated clusters. The de4nition are also utilized in merging two neighboring clusters. The

of intra-cluster density is given by advantages of new merging mechanism are described in the

ri next subsection.

1

c

Intra den(c) = density(vij ); c ¿ 1: (4)

c i=1 j=1

3.2. Merging criterion using the CDbw

The term density(vij ) in (4) is de4ned by density(vij ) =

ni In the proposed hierarchical clustering of the SOM, the

l=1 f(xl ; vij ), where xl belongs to the ith cluster, vij is the inter-cluster and intra-cluster density are incorporated into

jth representation point of the ith cluster, and f(xl ; vij ) is

merging criteria in addition to distance information. The

de4ned by

merging process is described as follows. First, compute the

1 xl − vij 6 stdev; CDbw for data belonging to each neighboring pair of clus-

f(xl ; vij ) = (5) ters. Contrary to 4nding the maximal CDbw for all the input

0 otherwise:

data, the merging mechanism is to 4nd the pair of clusters

Inter-cluster density is de4ned as the density in with minimal value of the CDbw, which indicates that the

the between-cluster areas [14]. The density in the two clusters have the strongest tendency to be clustered. The

between-cluster region is intended to be signi4cantly low. CDbw is then used locally.

180 S. Wu, T.W.S. Chow / Pattern Recognition 37 (2004) 175 – 188

clustering algorithms using only distance information.

to cluster of the SOM. Information about density of the

10 clusters is lost. But if all input data belonging to certain

neurons are considered, noises and outliers may deteriorate

the clustering eGect. To achieve a better clustering result

5 and be less aGected by noises and outliers, some preliminary

steps must be performed before clustering of the SOM. The

preprocessing steps are described as follows:

0 1 1. First, assign each input xi to its nearest neuron j on the

map according to Eq. (1). Neurons with no input data

assigned, which are called interpolating units in Ref. [12],

-5 3 are not included in the next clustering steps.

2. Compute the mean vector mj of the assigned input data

for each eGective neuron j. Then compute the distance

deviation between each feature vector wj and mj : devj =

-10 wj − mj . Compute the mean mean dev and standard

-5 0 5 10 15

deviation std dev for all the devj s.

Fig. 2. The 2D input data with three well-separated clusters. 3. If devj ¿ mean dev + std dev, exclude the neuron j for

clustering later on and 4lter out the input data assigned

to it. This mechanism can 4lter out the input outliers.

Table 2

The inter-cluster distance and the CDbw of the three clusters

4. Compute distances between the input vectors assigned to

the jth neuron and the feature vector wj , i.e., disj (xi ) =

Pair of clusters Distance CDbw xi − wj , where xi belongs to the jth neuron. Then com-

1 and 2 11.90 50.00

pute the mean mean disj and standard deviation std disj

2 and 3 14.37 13102.56 for all the disj (xj ) s.

1 and 3 7.94 1292.75 5. If the distance between the input vector xi assigned to

the jth cluster and the feature vector wj is larger than

mean disj + std disj , i.e., xi − wj ¿ mean disj +

The advantage of the merging mechanism is that the clus- std disj , 4lter out the input vector xj for the next cluster-

tering result is more accurate, due to the more information ing steps. This can 4lter out the input outliers and noises.

about the individual clusters considered. This can be illus- 6. Compute the number of data belonging to the jth cluster:

trated in Fig. 2. There are three well-separated clusters in numj . Then compute the statistical information about the

Fig. 2. The left two clusters are elongated in the vertical mean mean num and standard deviation std num for all

direction. The third cluster has spherical shape and is more the numj s.

compact than the left two clusters. The density of the third 7. If the number of data belonging to the jth neuron is

cluster is greater than that of the other two. The input data less than mean num-std num, i.e., numj ¡ mean num −

are agglomerated from three clusters into two clusters by std num, exclude the neuron j for cluster later on and

using centroid-linkage hierarchical clustering. The distances 4lter out the input data assigned to it. This can 4lter out

between the cluster centers are listed in Table 2. It seems the input noises.

that the 4rst and third clusters are to be merged since they

have the shortest inter-cluster distance between cluster cen- 3.4. Clustering of the SOM

ters. But the size and density of each cluster can also aGect

the cluster-merging process. The 4rst two clusters are more After the preprocessing the clustering of the SOM, some

likely to be merged because of inter-cluster density between neurons and some input data are excluded. The neurons and

the 4rst and second cluster are denser than that between the input data left can be hierarchically clustered. In this paper,

other two pair of clusters. From Table 2 the value of the rectangular grids are used for the SOM. The merging

CDbw between the 4rst and second cluster is the lowest so process happens for neighboring clusters, which mean the

that it indicates the clustering tendency of merging the 4rst neurons belonging to the pair of clusters are direct neigh-

and second cluster. This example considers only a single bors. For example, the neurons in the nonboundary area

representation. Using the CDbw a multirepresentation clus- can have eight direct neighbors. This is illustrated in

S. Wu, T.W.S. Chow / Pattern Recognition 37 (2004) 175 – 188 181

of natural clusters. Thus, we must choose an appropriate

SOM map size.

B C D

E A F

3.5. The algorithm of clustering of the SOM

G H I

The overall proposed algorithm is summarized as follows:

2. Preprocessing before clustering of the SOM as described

in the previous Section 3.3.

(a) 3. Cluster SOM by using the agglomerative hierarchical

clustering. The merging criterion is made by the CDbw

for all pairs of directly neighboring clusters. Compute the

1 1 2 global CDbw for all the input data before the merging

1 1 2 2 2

process until only two clusters exist, or merging cannot

happen.

1 1

1 2 2

4. Find the optimal partition of the input data according to

1 1 1 2 the CDbw for all the input data as a function of the number

of clusters.

4. Experimental results

(b)

To demonstrate the eGectiveness of the proposed cluster-

ing algorithm, four data sets were used in our experiments.

The input data are normalized such that the value of each

1 1 2 datum in each dimension lies in [0,1]. For training SOM we

1 1 2 2 2

used 100 training epochs on the input data and the learning

rate decreases from 1 to 0.0001.

1 1 2 2

1 1 2

4

2

(c)

(b) multi-neuron represented neighboring clusters 1 and 2 can

be clustered into one cluster because the two clusters are direct 0

neighbors; (c) multi-neuron represented clusters 1 and 2 cannot be

clustered into one cluster because the two clusters are not direct

neighbors. -1

-2

Fig. 3(a). Two clusters can be merged if the two clusters

are direct neighbors. When the pair of clusters are not di-

rect neighbors, they cannot be merged. This is shown in -3

Fig. 3(b) and (c). The CDbw are used locally for the input

data belonging to the directly neighboring pair of clusters. -4

If the CDbw for a pair of directly neighboring clusters is

the lowest among all the available directly neighboring pair -5

of clusters, the pair of clusters is merged into one cluster. -3 -2 -1 0 1 2

If the number of neurons is very large, the interpolating

neurons form the inter-cluster borders on the map. This may Fig. 4. The synthetic data set in the 2D plane.

182 S. Wu, T.W.S. Chow / Pattern Recognition 37 (2004) 175 – 188

4 4

3 3

2 2

1 1

0 0

-1 -1

-2 -2

-3 -3

-4 -4

-5 -5

(a) -3 -2 -1 0 1 2

(b) -3 -2 -1 0 1 2

4 4

3 3

2 2

1 1

0 0

-1 -1

-2 -2

-3 -3

-4 -4

-5 -5

(c) -3 -2 -1 0 1 2 (d) -3 -2 -1 0 1 2

4 4

3 3

2 2

1 1

0 0

-1 -1

-2 -2

-3 -3

-4 -4

-5 -5

(e) -3 -2 -1 0 1 2 (f) -3 -2 -1 0 1 2

Fig. 5. Three partitions of the synthetic data set for clustering of the SOM by (a) the proposed algorithm; (b) k-means clustering algorithm;

(c) single-linkage hierarchical clustering algorithm; (d) complete-linkage hierarchical clustering algorithm; (e) centroid-linkage hierarchical

clustering algorithms; (f) average-linkage hierarchical clustering algorithm. “.”, “*” and “+” indicate three diGerent clusters, respectively.

S. Wu, T.W.S. Chow / Pattern Recognition 37 (2004) 175 – 188 183

11

Proposed

Single-linkage

10

9

5

2 3 4 5 6 7 8 9 10 11

Number of clusters

Fig. 6. The CDbw as a function of the number of clusters for the synthetic data set by the proposed algorithm (solid line), and the

single-linkage clustering algorithm (dashed line) on the SOM.

Fig. 7. Three clusters for the synthetic data set are displayed on

the map by the proposed algorithm or the single-linkage clustering

algorithm on the SOM (SOM map size of 4×4). The same symbol

on the map represents the same cluster. Fig. 8. Three clusters for the synthetic data set are displayed on

the map by the proposed algorithm or the single-linkage clustering

algorithm on the SOM (SOM map size of 6×6). The same symbol

on the map represents the same cluster.

4.1. 2D synthetic data set

outliers. The correctness of clustering result can be easily clusters because they can only deal with clusters of spher-

seen in a 2D plane. The data generated consisted of three ical shapes. For the proposed algorithm and single-linkage

shallow elongated parallel clusters in the 2D plane, as shown clustering algorithm, the partition of the data is correct. The

in Fig. 4. Some noises and outliers were added into the input CDbw as a function of the number of clusters for the two

data. We used the SOM algorithm to train the input data. clustering algorithms is plotted in Fig. 6. The maximal value

Then we preprocessed the SOM before clustering of the of the CDbw by the two algorithms indicates there are three

SOM. Finally we used k-means, four diGerent hierarchical clusters in the input data, which is consistent with the cluster

clustering algorithms, and the proposed algorithm to cluster structure designed by us. The input data are assigned to the

of the SOM. In the experiment we 4rst used map size of nearest eGective neurons. According to the assigned input

4 × 4. The clustering results are illustrated in Fig. 5. Except data and their cluster membership, the cluster memberships

for the single-linkage algorithm, the K-means, hierarchical for the eGective neurons can be displayed on the 2D map.

clustering algorithms cannot correctly 4nd the true clusters. The three clusters on the map with map size of 4 × 4 by

These clustering algorithms split one true cluster into two the proposed algorithm are shown in Fig. 7, although it is

184 S. Wu, T.W.S. Chow / Pattern Recognition 37 (2004) 175 – 188

70

Proposed

Single-linkage

60

40

30

20

10

0

2 4 6 8 10 12 14

Number of clusters

Fig. 9. The CDbw as a function of the number of clusters for the iris data by the proposed algorithm (solid line), and the single-linkage

clustering algorithm (dashed line) on the SOM.

(a) (b)

Fig. 10. Two clusters for the iris data set are identi4ed by the Fig. 11. For the known three classes, three clusters are formed

proposed algorithm or single-linkage clustering algorithm on the (SOM map size of 4 × 4) for the iris data set by (a) the proposed

SOM (SOM map size of 4 × 4). The same symbol on the map algorithm; (b) single-linkage clustering algorithm on the SOM. The

represents the same cluster. same symbol on the map represents the same cluster.

Table 3

Performance comparison of diGerent clustering algorithms for the

a trivial thing for 2D data. We also tested the above algo-

iris data set

rithms on map size of 3 × 3 and 5 × 5 and obtained similar

results. For a smaller map size such as 2 × 3, the clustering Algorithm Clustering

by all clustering algorithms on the SOM gave wrong clus- accuracy (%)

ters. This is because each cluster has fewer representations

Proposed clustering of the SOM 96.0

so that the elongated clusters cannot be adequately described Single-linkage clustering of the SOM 74.7

by the representation neurons. But for a larger map size such Extended SOM (minimum variance) 90.3

as 6 × 6, all these algorithms achieved the same correct re- Extended SOM (minimum distance) 89.2

sults because the interpolating neurons explicitly forms the Direct k-means 85.3

borders of the three natural clusters and the minimal num- Direct single-linkage clustering 68.0

ber of 4nal agglomerated clusters is 3. The three clusters on Direct complete-linkage clustering 84.0

the SOM with map size of 6 × 6 are shown in Fig. 8, where Direct centroid-linkage clustering 68.0

the interpolating neurons clearly separate the map into three Direct average-linkage clustering 69.3

regions representing three clusters. For a more larger map

size, the minimal number of 4nal agglomerated clusters can We do not use the k-means, the complete-linkage, the

be larger than three and thus cannot express the true infor- centroid-linkage and the average-linkage clustering algo-

mation about the input data. rithms on the SOM in the next three data sets because they

S. Wu, T.W.S. Chow / Pattern Recognition 37 (2004) 175 – 188 185

1 Cluster1 (30) 1 1 1

0.5 0.5 0.5 0.5

Cluster2 (35) Cluster3 (40) Cluster4 (45)

0 0 0 0

5 10 15 5 10 15 5 10 15 5 10 15

1 1 Cluster6 (55) 1 Cluster7 (60) 1 Cluster8 (70)

0 Cluster5 (50) 0 0 0

5 10 15 5 10 15 5 10 15 5 10 15

1 Cluster9 (75) 1 Cluster10 (80) 1 1 Cluster12 (90)

0 0 0 Cluster11 (85) 0

5 10 15 5 10 15 5 10 15 5 10 15

1 1 1 1

0.5 0.5 0.5 0.5

0 Cluster13 (95) 0 Cluster14 (105) 0 Cluster15 (120) 0 Cluster16 (130)

5 10 15 5 10 15 5 10 15 5 10 15

1 Cluster17 (140) 1 1 Cluster19 (160) 1 Cluster20 (160)

0 0 Cluster18 (150) 0 0

5 10 15 5 10 15 5 10 15 5 10 15

Fig. 12. The statistical information of 20 clusters for the 15D synthetic data set. The horizontal axis in each sub4gure represents the dimension

and vertical axis in each sub4gure represents the value in each dimension. The mean value of each dimension in each cluster is represented

by a black dot in each sub4gure. The standard deviation of each dimension in each cluster is represented by two curves enclosing the black

dots in each sub4gure. The number of data points in each cluster is bracketed in each sub4gure.

4 6 12 18 20

size should be carefully chosen so that it is able to learn the

natural number of clusters. 6 6 10 16 20

3 3 8

4.2. Iris data set 5 8 8 15 19

The Iris data set [24] has been widely used in pattern 2 7 8 11 19

classi4cation. It has 150 data points of four dimensions. The 2 7 14 17

data are divided into three classes with 50 points each. The

4rst class of Iris plant is linearly separable from the other 1 9 9 17 17

two. The other two classes are overlapped to some extent 1 1 9 13 17

and are not linearly separable. We clustered the data by the

proposed clustering algorithm and the single-linkage clus- Fig. 13. Twenty clusters for the 15D synthetic data set are displayed

tering algorithm on the SOM. We used an appropriate map on the map by the proposed algorithm on the SOM (SOM map

size of 4 × 4. The two algorithms achieved the same optimal size of 8 × 8). The same number on the map represents the same

clustering results. The CDbw as a function of the number cluster.

of clusters, plotted in Fig. 9, indicates that the data are

optimally divided into two clusters. The iris data of the 4rst

class form a cluster, and the rest two classes form the other If three clusters are forced to be formed, the proposed

cluster. The two clusters can be displayed on the map shown algorithm is better than the single-linkage-clustering algo-

in Fig. 10, where “*” symbol represents the 4rst class and rithm. The partition of the map performed by the two al-

“+” represents the second and third class. It is inconsistent gorithms is shown in Fig. 11. One cluster representing the

with the inherent three classes in the data. The two clusters 4rst class (“*” symbol) is clearly separated from other two

are also achieved in Ref. [25]. The two clusters are formed clusters by the interpolating neurons representing borders.

without a priori information about the classes of the data. The clustering accuracies by the proposed algorithm, the

Therefore, the two clusters found in the data are reasonable. single-linkage clustering of the SOM, the extended SOM

186 S. Wu, T.W.S. Chow / Pattern Recognition 37 (2004) 175 – 188

400

350

300

250

200

150

100

50

0 5 10 15 20 25 30 35 40

Number of clusters

Fig. 14. The CDbw as a function of the number of clusters for the 15D synthetic data set by the proposed algorithm on the SOM.

(minimum distance) [10], the extended SOM (minimum used the proposed clustering algorithm on the SOM with

variance) [11], the direct k-means, and the four direct ag- map size of 8 × 8. The 20 clusters displayed on the map are

glomerative hierarchical clustering algorithms are listed in shown in Fig. 13. The CDbw as a function of the number

Table 3. The single-linkage clustering of the SOM has dis- of clusters, plotted in Fig. 14, strongly indicates 20 clusters

advantage of a chain-link tendency so that it joins most of existed in the data. The partition of the data consistent with

the points in the second and third classes, which resulted in the cluster structure of the data with 100% accuracy and

a low clustering accuracy of 74.7%. Therefore in the next thus the statistical information of the clusters generated by

two experiments, we do not use the single-linkage clustering the proposed clustering algorithm is the same as that of the

of the SOM. On the other hand, the proposed algorithm is known clusters shown in Fig. 12.

able to form the correct three clusters with the highest accu-

racy 96.0%, while other agglomerative hierarchical cluster-

ing algorithms (direct or not direct) and the direct k-means 4.4. Wine data set

algorithm have lower clustering accuracy to some extent in

distinguishing the second and third class. Wine data set [26] has 178 13D data with known three

So in real data sets if the number of classes is known and is classes. The numbers of data samples in the three classes are

used for the number of clusters, and some overlapping exists 59, 71 and 48, respectively. We use the proposed algorithm

in some pair of clusters, the proposed algorithm is a better with map size of 4 × 4 to cluster the data. The CDbw as a

choice. If we do not use the information about the number function of the number of clusters, plotted in Fig. 15, indi-

of classes, the pair of overlapping classes may merge into a cates that the number of clusters is three, which is exactly

single cluster and then the number of clusters may be less equal to the number of the classes. This is because the three

than the number of classes. classes are well separated from each other. The three clusters

on the map is shown in Fig. 16. The clustering accuracies by

the proposed algorithm, the extended SOM (minimum vari-

4.3. 15D synthetic data set ance), the direct k-means, and the four direct agglomerative

hierarchical clustering algorithms are listed in Table 4. For

In this example, we used data set of 1780 15D data. The the data set, the proposed algorithm achieved the best clus-

data were created by 4rst generating 20 uniformly distributed tering result with a highest clustering accuracy 98.3%. The

random 15D points with each dimension lying in [0 1], and extended SOM (minimum variance) and the direct k-means

then adding some Gaussian noise to each point with stan- algorithm have also good clustering results with accuracy

dard deviation 0.12 in each dimension. The number of data larger than 90.0%, while the four direct agglomerative

points in each cluster varies from 30 to 165. The statisti- hierarchical clustering algorithms have a worse eGect

cal information for each cluster can be seen in Fig. 12. We in distinguishing the second and third class with lower

S. Wu, T.W.S. Chow / Pattern Recognition 37 (2004) 175 – 188 187

40

35

25

20

15

10

2 3 4 5 6 7 8 9 10 11

Number of clusters

Fig. 15. The CDbw as a function of the number of clusters for the wine data set by the proposed algorithm on the SOM.

5. Conclusions

algorithm. It uses the clustering validity index locally to

determine which pair of clusters to be merged. The opti-

mal number of clusters can be determined by the maximum

value of the CDbw, which is the clustering validity index

Fig. 16. Three clusters for the wine data set are displayed on the globally for all input data. Compared with classical cluster-

map by the proposed clustering algorithm on the SOM (SOM map ing methods on the SOM, the proposed algorithm utilizes

size of 4 × 4). The same symbol on the map represents the same more information about the data in each cluster in addition

cluster. to inter-cluster distances. The proposed algorithm therefore

clusters data better than the classical clustering algorithms

on the SOM. The preprocessing steps for 4ltering out noises

Table 4

Performance comparison of diGerent clustering algorithms for the

and outliers are also included to increase the accuracy and

wine data set robustness for clustering of the SOM. The experimental re-

sults on the four data sets demonstrate that the proposed clus-

Algorithm Clustering tering algorithm is a better clustering algorithm than other

accuracy (%) clustering algorithms on the SOM.

Proposed clustering of the SOM 98.3

Extended SOM (minimum variance) 93.3

References

Direct k-means 97.8

Direct single-linkage clustering 57.4

Direct complete-linkage clustering 67.4 [1] T. Kohonen, Self-organized formation of topologically correct

Direct centroid-linkage clustering 61.2 feature maps, Biol. Cybern. 43 (1982) 59–69.

Direct average-linkage clustering 61.2 [2] T. Kohonen, Self-Organizing Maps, Springer, Berlin,

Germany, 1997.

[3] R.M. Gray, Vector quantization, IEEE Acoust., Speech, Signal

Process. Mag. 1 (2) (1984) 4–29.

clustering accuracy less than 70%. The three clusters are [4] N.R. Pal, J.C. Bezdek, E.C.-K. Tsao, Generalized clustering

near spherical shapes, but have some noises between clus- networks and Kohonen’s self-organizing scheme, IEEE Trans.

ters. Therefore, the direct k-means algorithm has a better Neural Networks 4 (4) (1993) 549–557.

clustering eGect than the direct agglomerative hierarchical [5] A.K. Jain, M.N. Murty, P.J. Flyn, Data clustering: a review,

clustering algorithms. ACM Comput. Surveys 31 (3) (1999) 264–323.

188 S. Wu, T.W.S. Chow / Pattern Recognition 37 (2004) 175 – 188

[6] J. Han, M. Kamber, Data mining: concepts and techniques, International Joint Conference on Neural Networks, Nagoya,

Morgan-Kaufman, San Francisco, 2000. Japan, 1993, pp. 2448–2451.

[7] T. Huntsberger, P. Ajjimarangsee, Parallel self-organizing [17] S. Guha, R. Rastogi, K. Shim, CURE: an eLcient clustering

feature maps for unsupervised pattern recognition, Int. J. Gen. algorithm for large databases, Proceedings of ACM SIGMOD

Systems 16 (1989) 357–372. International Conference on Management of Data, New York,

[8] J. Mao, A.K. Jain, A self-organizing network for 1998, pp. 73–84.

hyperellipsoidal clustering (HEC), IEEE Trans. Neural [18] G. Karypis, E.-H. Han, V. Kumar, Chameleon: hierarchical

Networks 7 (1) (1996) 16–29. clustering using dynamic modeling, IEEE Comput. 32 (8)

[9] J. Lampinen, E. Oja, Clustering properties of hierarchical (1999) 68–74.

self-organizing maps, J. Math. Imag. Vis. 2 (2–3) (1992) [19] S. Theodoridis, K. Koutroubas, Pattern Recognition,

261–272. Academic Press, New York, 1999.

[10] F. Murtagh, Interpreting the Kohonen self-organizing [20] J.C. Dunn, Well separated clusters and optimal fuzzy

feature map using contiguity-constrained clustering, Pattern partitions, J. Cycbern. 4 (1974) 95–104.

Recognition Lett. 16 (1995) 399–408. [21] X.L. Xie, G. Beni, A validity measure for fuzzy clustering,

[11] M.Y. Kiang, Extending the Kohonen self-organizing map IEEE Trans. Pattern Anal. Mach. Intell. 13 (8) (1991)

networks for clustering analysis, Comput. Stat. Data Anal. 38 841–847.

(2001) 161–180. [22] G.W. Milligan, S.C. Soon, L.M. Sokol, The eGect of cluster

[12] J. Vesanto, E. Alhonierni, Clustering of the self-organizing size, dimensionality and number of clusters on recovery of

map, IEEE Trans. Neural Networks 11 (3) (2000) true cluster structure, IEEE Trans. Pattern Anal. Mach. Intell.

586–600. 5 (1983) 40–47.

[13] H. Lopson, H.T. Siegelmann, Clustering irregular shapes [23] R.N. Dave, Validating fuzzy partitions obtained through

using high-order neurons, Neural Computation 12 (10) (2000) c-shell clustering, Pattern Recognition Lett. 17 (1996)

2331–2353. 613–623.

[14] M. Halkidi, M. Vazirgiannis, Clustering validity assessment [24] R.A. Fisher, The use of multiple measure in taxonomic

using multi representatives, Proceedings of SETN Conference, problems, Ann. Eugenics 7 (Part II) (1936) 179–188.

Thessaloniki, Greece, April 2002. [25] J.C. Bezdek, N.R. Pal, Some new indexes of cluster validity,

[15] A. Ultsch, H.P. Siemon, Kohonen’s self organizing IEEE Trans. System, Man, and Cybern. 28 (3) (1998)

feature maps for exploratory data analysis, Proceedings of 301–315.

the International Neural Network Conference, Dordrecht, [26] C.L. Blake, C.J. Merz, UCI repository of machine learning

Netherlands, 1990, pp. 305 –308. databases, (http://www.ics.uci.edu/∼mlearn/MLRepository.

[16] X. Zhang, Y. Li, Self-organizing map as a new method html), Department of Information and Computer Science,

for clustering and data analysis, Proceedings of the University of California at Irvine, CA, 1998.

About the Author—SITAO WU is now pursuing Ph.D. degree in the Department of Electronic Engineering of City University of Hong

Kong, Hong Kong, China. He obtained B.E. and M.E. degrees in the Department of Electrical Engineering of Southwest Jiaotong University,

China in 1996 and 1999, respectively. His research interest areas are neural networks, pattern recognition, and their applications.

About the Author—TOMMY W.S. CHOW (M’93) received the B.Sc (First Hons.) and Ph.D. degrees from the University of Sunderland,

Sunderland, UK. He joined the City University of Hong Kong, Hong Kong, as a Lecturer in 1988. He is currently a Professor in the

Electronic Engineering Department. His research interests include machine fault diagnosis, HOS analysis, system identi4cation, and neural

networks learning algorithms and applications.