Академический Документы
Профессиональный Документы
Культура Документы
Abstract- This paper proposes two new approaches to and unsupervised learning, e.g. LVQ-I1 [SI.
using PSO to cluster data. It is shown how PSO can he Recently, particle swarm optimization (PSO) 19, I O ] has
used to find the centroids of a user specified number of been applied to image clustering (131. This paper explores
clusters. The algorithm is then extended to use K-means the applicability of PSO to cluster data vectors. In the pro-
clustering to seed the initial swarm. This second alga. cess of doing so, the objective of the paper is twofold:
rithm basically uses PSO to refine the clusters formed by
K-means. The new PSO algorithms are evaluated on six to show that the standard PSO algorithm can be used
data sets, and compared to the performance of K-means to cluster arbitrary data, and
clustering. Results show that both PSO clustering tech- to develop a new PSO-based clustering algorithm
niques have much potential. where K-means clustering is used to seed the initial
swarm.
1 Introduction
The rest of the paper is organized as follows: Section 2
Data clustering is the process of grouping together similar presents an overview of the K-means algorithm. PSO is
multi-dimensional data vectors into a number of clusters or overviewed in section 3. The two PSO clustering techniques
bins. Clustering algorithms have been applied to a wide are discussed in section 4. Experimental results are summa-
range of problems, including exploratory data analysis, data rized in section S.
mining [4], image segmentation [ 121and mathematical pro-
gramming [I,161. Clustering techniques have been used 2 K-Means Clustering
successfully to address the scalability problem of machine
learning and data mining algorithms, where prior to, and One of the most important components of a clustering al-
during training, training data is clustered, and samples from gorithm i s the measure of similarity used to determine how
these clusters are selected for training, thereby reducing the close two patterns are to one another. K-means clustering
computational complexity of the training process, and even groups data vectors into a predefined number of clusters,
improving generalization performance [6, 15,14.31. based on Euclidean distance as similarity measure. Data
Clustering algorithms can be grouped into two main vectors within a cluster have small Euclidean distances from
classes of algorithms, namely supervised and unsupervised. one another, and are associated with one centroid vector,
With supervised clustering, the learning algorithm has an which represents the "midpoint" of that cluster. The cen-
external teacher that indicates the target class to which a troid vector is the mean of the data vectors that belong to
data vector should belong. For unsupervised clustering, a the corresponding cluster.
teacher does not exist, and data vectors are grouped based For the purpose of this paper, define the following sym-
on distance from one another. This paper focuses on unsu- bols:
pervised clustering.
Many unsupervised clustering algorithms have been de- Nd denotes the input dimension, i.e. the number of
veloped. Most of these algorithms group data into clusters parameters of each data vector
independent of the topology of input space. These algo- No denotes the number of data vectors to be clustered
rithms include, among others, K-means 17, 81, ISODATA
[2], and learning vector quantizers (LVQ) [SI. The self- N,denotes the number of cluster centroids (as pro-
organizing feature map (SOM) [ I I], on the other hand, per- vided by the user), i.e. the number of clusters to be
forms a topological clustering, where the topology of the formed
original input space is maintained. While clustering algo-
rithms are usually supervised or unsupervised, efficient hy- z p denotes the p-th data vector
brids have been developed that performs both supervised m3 denotes the centroid vector of cluster j
-
where U' is the inertia weight, cl and c~ are the acceleration
constants, r ~ . ~ (~t ~) .. j ( t )U(O.1). and k = 1:.. . N,j.
The velocity is thus calculated based on three contributions:
I
216
where mij refers to the j-th cluster centroid vector of the K-means algorithm. The hybrid algorithm first executes the
i-th particle in cluster Cij. Therefore, a swarm represents a K-means algorithm once. In this case the K-means cluster-
number of candidate clusterings for the current data vectors. ing is terminated when ( I ) the maximum number of itera-
The fitness of panicles is easily measured as the quantiza- tions is exceeded, or when ( 2 ) the average change in cen-
tion error, troid vectors is less that 0.0001 (a user specified parameter).
The result of the K-means algorithm is then used as one of
the particles, while the rest of the swarm is initialized ran-
domly. The gbest PSO algorithm as presented above is then
executed.
where d is defined in equation (I), and 1Ci;l is the number
of data vectors belonging to cluster C,, i.e. the frequency
of that cluster. 5 Experimental Results
This section first presents a standard gbesr PSO for clus-
This section compares the results of the K-means, PSO and
tering data into a given number of clusters in section 4.1,
Hybrid clustering algorithms on six classification problems.
and then shows how K-means and the PSO algorithm can
The main purpose is to compare the'quality of the respec-
be combined to further improve the performance of the PSO
tive clusterings, where quality is measured according to the
clustering algorithm in section 4.2.
following three criteria:
4.1 gbest PSO Cluster Algorithm 0 the quantization error as defined in equation (8);
Using the standard gbest PSO, data vectors can be clustered 0 the intra-cluster distances, i.e. the distance between
as follows: data vectors within a cluster, where the objective is to
minimize the intra-cluster distances;
1. Initialize each particle to contain N , randomly se-
lected cluster centroids. e the inter-clus!er distances, i.e. the distance between
the centroids of the clusters, where the objective is to
2. Fort = 1 tot,,, do maximize the distance between clusters.
(a) For each particle i do The latter two objectives respectively correspond to crisp,
compact clusters that are well separated.
(b) For each data vector zp For all the results reported, averages over 30 simulations
i. calculate the Euclidean distance d(z,, mi; are given. All algorithms are run for 1000 function evalua-
to all cluster centroids C?; tions, and the PSO algorithms used I O particles. For PSO,
(U = 0.72 and c1 = cz = 1.49. These values were chosen
ii. assign z p to cluster Cti such that
to ensure good convergence [ 171.
d(z,, m i j ) = minvc,l .....~ . { d ( z mi,)}
~:
The classification problems used for the purpose of this
iii. calculate the fitness using equation (8) paper are
(c) Update the global best and local best positions
Artificial problem 1: This problem follows the fol-
(d) Update the cluster centroids using equations (3) lowing classification rule:
and (4).
1 if (i1 2 0.7) or ((21 5 0.3)
where t,,, is the maximum number of iterations.
The population-based search of the PSO algorithm re-
duces the effect that initial conditions has, as opposed to the
class =
{ and (i2 2 -0.2 - il))
0 otherwise
(9)
217
cia66 3 3
1
-0.8
-0.6
-0.4
-0.2
21 0 21
0.2
0.4
0.6
0.8
0
. 0.1
-0.8 -0.6-0.4 -0.2 0 0.2 0.4 0.6 0.8
d d
Figure I : Artificial rule classification problem defined in Figure 2: Four-class artificial classification problem defined
equation (9) in equation (IO)
218
Quantization Intra-cluster Inter-cluster
Problem Algorithm Error Distance Distance
Artificial 1 K-means 0.984i0.032 3.678i0.085 1.77 I f0.046
PSO 0.76910.03 I 3.826f0.091 1.I4210.052
Hybrid I
0.768f0.048 I
3.82310.083 1 1.15110.043
Artificial 2 K-means I 0.264f0.001 I 0.91 1f0.027 I 0.796f0.022
PSO 0.252~0.001 0.873f0.023 0.8 l5i0.019
Hybrid 0.25010.001 0.869f0.018 0.81410.01 1
Iris K-means 0.649f0.146 3.374i0.245 0.887i0.091
PSO 0.77410.094 3.489i0.186 0.881 f0.086
Hybrid 0.633f0.143 3.30410.204 0.852f0.097
Wine, K-means 1.139iO.125 4.2023~0.22.1 I.01 O i O . 146
PSO 1.49310.095 4.91 1i0.353 2.977f0.241
Hybrid 1
1.078f0.085 1
4.199io.514 I 2.799f0. I I 1
Breast-cancer K-means 1 1.99910.054 1 6.599+0.332 1 I ,824i0.25 I
PSO 2.536f0.197 7.285f0.351 3.545i0.204
Hybrid 1.890f0.125 6.55110.436 3.335f0.097
Automotive K-means 1030.714144.69 I 1032.3551342.2 1037.920rt22.14
971.553144.11 13675.675f341.3 988.818122.44
Hvbrid 902.414i43.81 1 1895.797f340.7 952.892+21.55
23 I I
.a. 2.1
1 .o
(i Convergence I
.E
:*
1.7.
1.5
g.13
E; 1:l
0.g
0.7
LL 0.75
1 0.7
219
have better convergence to lower quantization errors, and [ I21 T Lillesand, R Keifer, Remote Sensing and Image In-
in general, larger inter-cluster distances and smaller intra- terpretation, John Wiley & Sons. 1994.
cluster distances.
Future studies will extend the fitness function to also ex- [ I31 M Omran, A Salman. AP Engelbrecht, Image Classi-
plicitly optimize the h e r - and intra-cluster distances. More fication using Particle Swarm Optimization, Proceed-
elaborate tests on higher dimensional problems-and large ings of the 4th Asia-Pacific Conference on Simulated
number of patterns will be done. The PSO clustering algo- Evolution and Learning, Singapore, 2002.
rithms will also he extended to dynamically determine the [ 141 G Potgieter, Mining Continuous Classes using Evolu-
optimal number of clusters. tionary Computing, M.Sc Thesis. Department of Com-
puter Science, University of Pretoria, Pretoria, South
Bibliography Africa. 2002.
[ I ] HC Andrews. lntroduclion to Mathematical Tech- [I51 JR Quinlan, C4.5: Programs for Machine Learning,
niques in Pattern Recognition, John Wiley & Sons, Morgan Kaufmann, San Mateo, 1993.
New York. 1972.
1161 MR Rao, Cluster Analysis and Mathematical Pro-
121 G Ball. D Hall, A Clustering Technique for Summarir- gramming, Journal of the American Statistical Asso-
ing Multivariate Data, Behavioral Science, Vol. 12, pp ciation, Vol. 22, pp 622-626, 197 I .
153-1.55, 1967. . .
[ 171 F van den Bergh, An Analysis of Particle Swarm O p
131 A P Engelbrecht. Sensitivity Analysis of Multilayer timizers, PhD Thesis, Department of Computer Sci-
Neural Networks, PhD Thesis, Department of Com- ence, University of Pretoria, Pretoria, South Africa,
puter Science. University of Stellenbosch, Stellen- 2002.
hosch. South Africa, 1999.
220