Академический Документы
Профессиональный Документы
Культура Документы
PII: S0957-4174(17)30034-9
DOI: 10.1016/j.eswa.2017.01.024
Reference: ESWA 11072
Please cite this article as: Rasool Azimi , Mohadeseh Ghayekhloo , Mahmoud Ghofrani ,
Hedieh Sajedi , A novel clustering algorithm based on data transformation, Expert Systems With
Applications (2017), doi: 10.1016/j.eswa.2017.01.024
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
1
Highlights
A new initialization technique is proposed to improve the performance of K-means.
T
IP
CR
US
AN
M
ED
PT
CE
AC
ACCEPTED MANUSCRIPT
2
T
d
Department of Computer Science, College of Science, University of Tehran, Tehran, Iran.
IP
*Corresponding author: UWBB room 227, 18807 Beardslee Blvd, Bothell, WA 98011, USA Fax number: 425.352.3775
Abstract Clustering provides a knowledge acquisition method for intelligent systems. This paper proposes a novel data-
CR
clustering algorithm, by combining a new initialization technique, K-means algorithm and a new gradual data transformation ap-
proach to provide more accurate clustering results than the K-means algorithm and its variants by increasing the clusters coher-
US
ence. The proposed data transformation approach solves the problem of generating empty clusters, which frequently occurs for
other clustering algorithms. An efficient method based on the principal component transformation and a modified silhouette algo-
AN
rithm is also proposed in this paper to determine the number of clusters. Several different data sets are used to evaluate the efficacy
of the proposed method to deal with the empty cluster generation problem and its accuracy and computational performance in
M
comparison with other K-means based initialization techniques and clustering methods. The developed estimation method for de-
termining the number of clusters is also evaluated and compared with other estimation algorithms. Significances of the proposed
ED
method include addressing the limitations of the K-means based clustering and improving the accuracy of clustering as an im-
portant method in the field of data mining and expert systems. Application of the proposed method for the knowledge acquisition in
PT
time series data such as wind, solar, electric load and stock market provides a pre-processing tool to select the most appropriate
data to feed in neural networks or other estimators in use for forecasting such time series. In addition, utilization of the knowledge
CE
discovered by the proposed K-means clustering to develop rule based expert systems is one of the main impacts of the proposed
method.
AC
Index Terms Data mining, clustering, K-means, data transformation, silhouette, transformed K-means
1. Introduction
The expert systems are computer applications that contain stored knowledge and are developed to solve problems in a specific field
in almost the same way in which a human expert would (Shuliang et al., 2002)0. Acquisition of the expert knowledge is a chal-
lenge for developing such expert systems (Yang et al., 2012). One of the major problems and most difficult tasks in developing rule
based expert systems is representing the knowledge discovered by data clustering (Markic and Tomic, 2010). The K-means algo-
rithm is one of the most commonly used clustering techniques, which uses the data reassignment method to repeatedly optimize
ACCEPTED MANUSCRIPT
3
clustering (Lloyd, 1982). The main goal of clustering is to generate compact groups of objects or data that share similar patterns
within the same cluster, and isolate these groups from those which contain elements with different characteristics.
Although the K-means algorithm has features such as simplicity and high convergence speed, it is totally dependent on the initial
centroids which are randomly selected in the first phase of the algorithm. Due to this random selection, the algorithm may converge
to locally optimal solutions (Celebi et al., 2013). Different variants of K-means algorithm have been proposed to address this limi-
tation. The K-Medoids algorithm was proposed in (Kaufman and Rousseeuw, 1987) to define each cluster by the most central me-
doid in which it is located. First, K data are considered as initial centroids (medoid) and then each data is assigned to the closest
T
Medoid, and the initial clusters are formed. In an iteration-based process, the most central data in each cluster is considered as the
IP
new centroid and each data is assigned to the nearest centroid. The remaining steps of this algorithm match the K-means procedure.
CR
Fuzzy C-means (FCM) clustering introduced the partial membership concept (Dunn, 1973), (Bezdek et al, 1984). In fact, in the
FCM algorithm, each data belongs to all clusters. The degree of belonging is represented by a partial membership determined by a
US
fuzzy clustering matrix. A genetic algorithm-based K-means (GA-K-means) was proposed in (Krishna, and Murty, 1999) to pro-
vide global optimum for the clustering. In this method, the K-means algorithm was used as a search operator instead of crossover.
A biased mutation operator was also proposed for clustering to help the K-means algorithm to avoid local minima. Global K-means
AN
algorithm was developed in (Likas et al., 2003) to provide experimentally optimal solution for clustering problems. However, it is
not appropriate for clustering medium-sized and large-scale datasets due to its heavy computational burden. K-means++ initializa-
M
tion algorithm was proposed in (Arthur and Vassilvitskii, 2007) for obtaining an initial set of centroids that is near-optimal. The
main drawback of the K-means++ is its inherent sequential nature, which limits the effectiveness of the method for high-volume
ED
data. An artificial bee colony K-means (ABC-K-means) clustering approach was proposed in (Zhang et al., 2010) for optimal parti-
tioning of data objects into a fixed number of clusters. A hybrid of differential evolution and K-means algorithms named DE-K-
PT
means was introduced in (Kwedlo, 2011). The differential evolution algorithm was used as a global optimization method and the
resultant clustering solutions were fine-tuned and corrected using the K-means algorithm. Dogan et al. proposed a hybrid of K-
CE
means++ and self-organizing map (SOM) (Kohonen, 1990) algorithm to improve the clustering accuracy. It first uses K-Means++
initialization method to determine the initial weight values and the starting points, and then uses SOM to find an appropriate final
AC
clustering solution. However, the aforementioned limitation of the K-means++ was not addressed. A new clustering technique
using a combination of the global K-means algorithm and the topology neighborhood based on Axiomatic Fuzzy Sets (AFS) theory
was developed in (Wang et al., 2013) to determine initial centroids. A new clustering algorithm, named K-means*, was presented in
(Malinen et al., 2014) that generates an artificial dataset X* as the input data. The input data are then mapped one-by-one to the
generated artificial data (X X*). Next, the inverse transformation of the artificial data to the original data is performed by a se-
ries of gradual transformations. To do so, the K-means algorithm updates the clustering model after each transformation and moves
ACCEPTED MANUSCRIPT
4
the data vectors slowly to their original positions. The K-means* algorithm uses a random data swapping strategy to deal with the
problem of generating empty clusters. However, the random selection of the data vectors as the cluster centroids may reduce the
other clusters coherence and decrease the efficiency of the K-means* algorithm. Moreover, the convergence rate of the K-means*
algorithm significantly reduces as the number of clusters increases, especially with increasing data volumes. Density based cluster-
ing methods were proposed in (Mahesh Kumar and Rama Mohan Reddy, 2016) to speed up the neighbor search for clustering spa-
tial databases with noise. Density Based Spatial Clustering of Applications with Noise (DBSCAN) provides a graph based index
structure for high dimensional data with large amount of noise. It was shown that running time of the proposed method is faster
T
than DBSCAN with exactly same clustering results. The proposed method solved the inefficacy of the DBSCAN method to work
IP
with clusters with large differences in densities. A novel clustering algorithm named CLUB (CLUstering based on Backbone) was
CR
developed in (Chen et al., 2016) to determine optimal clusters. First, the algorithm detects the initial clusters and finds their density
backbones. Then, the algorithm finds out the outliers in each cluster based on K Nearest Neighbour (KNN) method. Finally by
US
assigning each unlabeled point to the cluster with the nearest higher density neighbour, the algorithm yields the final clusters.
CLUB has several drawbacks: The KNN method lacks an efficient algorithm to determine the value of parameter K (number of
nearest neighbors); Computational cost of this method is too high because it requires calculating distance of each instance query
AN
with respect to all training samples. Two particle swarm optimization (PSO) based fuzzy clustering methods were proposed in (Sil-
va Filho et al., 2015) to deal with the shortcomings of the PSO algorithms used for fuzzy clustering. The proposed methods adjust
M
parameters of PSO dynamically to achieve a balance between exploration and exploitation to avoid trapping in local optimum. The
proposed methods lack precision for high-dimensional applications. In addition, the iterative process of the proposed methods sig-
ED
nificantly decreases the convergence rate. Generally, the speed at which a convergent sequence approaches its limit is defined as
the rate of convergence. Three clustering algorithms named Near Neighbor Influence (CNNI), an improved version of time cost of
PT
Near Neighbor Influence (ICNNI), and a variation of Near Neighbor Influence (VCNNI) were presented in (Chen, 2015). The clus-
tering results showed that ICCNNI is faster than CNNI and also, CNNI requires less space than VCNNI. These methods suffer
CE
from large scale computing and storage requirements. A growing incremental self-organizing neural network (GISONN) was de-
veloped in (Liu and Ban, 2015) to select appropriate clusters by learning data distribution of each cluster. The proposed method is
AC
however not applicable for large-volume or high-dimensional datasets due to its computational complexity. In addition, the neigh-
borhood preserving feature of the algorithm is violated when the output space topology does not match with the structure of the
In spite of the improved performance of the K-means variants for synthetic datasets with Gaussian distribution, their performance
on real datasets is neither very promising nor different from the original K-means algorithm. In addition, all K-means based algo-
rithms lack an efficient method to determine the optimal number of clusters. This requires the user to determine the number of clus-
ACCEPTED MANUSCRIPT
5
ters either arbitrarily or based on practical and experimental estimates, which might not be optimal.
In this paper, we propose a novel clustering approach called transformed K-means to provide more accurate clustering results com-
pared to the K-means algorithm and its improved versions. The proposed clustering method combines a new initialization tech-
nique, K-means algorithm and a new gradual data transformation approach to appropriately select the initial cluster centroids and
move the real data into the locations of the initial cluster centroids that are closer to the actual positions of the associated data. By
doing this, the data are placed in an artificial structure to properly initiate the K-means clustering. The inverse transformation is
then performed to gradually move back the artifical data to their original places. During this process, K-means updates the cluster-
T
ing centroids after any changes in the data structure. This provides more optimal clustering results for both synthetic and real da-
IP
tasets. In addition, the proposed data transformation solves the empty cluster problem of K-means algorithm and its improved ver-
CR
sions. An efficient method based on the principal component transformation and a modified silhouette algorithm is also proposed
in this paper to determine the optimal number of clusters for the K-means algorithms.
US
The proposed clustering method develops a rule-based expert system by means of knowledge acquisition through data transfor-
mation. Significances of the proposed method include addressing the limitations of the K-means based clustering and improving
the accuracy of clustering as an important method in the field of data mining and expert systems. The proposed method can be used
AN
for intelligent system applications such as forecasting time-series including solar, wind, load and stock market series.
1. A new initialization technique is proposed to select initial centroids which are closer to the optimum centroids locations.
2. A novel gradual data transformation approach is proposed to significantly reduce the number of empty clusters generated
ED
4. A hybrid clustering algorithm is developed by combining the proposed initialization, data transformation and cluster
number estimation to provide a better knowledge discovery of the input patterns and more accurate clustering results.
CE
The rest of the paper is organized as follows. Section 2 provides a brief description of the K-means algorithm. It also explains the
proposed clustering method. Section 3 demonstrates a case study where the performance of the developed clustering method is
AC
2. Methodology
The K-means (Lloyd, 1982) is a well-known, low complexity algorithm utilized for data-partitioning. The algorithm starts
running after an input of K clusters is given, and outputs the cluster centroids through iterations. Let X = [x1 ,..., x n ] be the set of n
ACCEPTED MANUSCRIPT
6
points to be grouped into K different cluster (partition) sets as C = { c p } p = 1, 2, , K. By means of the Euclidean distance, the
algorithm assigns each data point to its closest centroid c p , calculated by:
1 np ( p )
cp
n x i (1)
p i 1
where x i( p ) is the i-th data point in the cluster p, and n p is the number of data points in the respective cluster.
After the first run, the algorithm calculates the mean of the data points in each cluster c p and selects this value as a new cluster
T
centroid, starting a new iteration. As new clusters are selected, a new mean value is obtained. The algorithm halts once the sum of
IP
the squared errors over K clusters is minimized (Cui et al., 2014).
CR
2.2 The Proposed Clustering Method
An improved version of K-means algorithm, named transformed K-means, is proposed in this section. The proposed clustering
US
algorithm uses a combination of a new technique to select the initial cluster centroids and a new approach for the reverse transfor-
mation of the data to enhance the clustering performance. The steps of the transformed K-means algorithm are as follows:
AN
Let X = [x1 ,..., x n ] be a set of n data. The selection of K initial centroids is as follows:
1) Remove duplicate data vectors and store them to new dataset X' = [(x1' , r1 ),...,(x 'm , rm )] where ri is the repetition number
ED
2) Sort the data vectors in the dataset X ' in ascending order based on the Euclidean length of the vectors.
CE
3) Divide the dataset X ' , consisting of m data, into K sub-datasets, with (at most) S m / K data, according to Eq. (2),
AC
such that the data elements of X ' are distributed among the sub-datasets X1' to X 'K .
4) Now, we have K sub datasets where each one is used to determine only one of the K initial centroids. Eq. (3) is used to
calculate a weight attribute w ( x i' ) for each data entry x i' with the repetition number ri in each of K sub datasets
1 (3)
w ( x i' ) m = ( ri ) m , (1 m K )
1 S
S j =1
dist ( x i' , x 'j )
T
where w ( x i' )m is the weight attribute for x i' in the m-th sub-dataset.
IP
5) In each of K sub datasets, the data entry with the highest weight attribute is selected as the initial centroid.
CR
Fig. 1 shows the flowchart of our proposed method for selecting initial centroids.
US
AN
M
ED
PT
CE
AC
ACCEPTED MANUSCRIPT
8
Inputs:
Input pattern: X={x1,,xn}
Number of Final Clusters: K
T
Split X ' into K sub-data sets {X1' ,X '2 ,...,X 'K }
IP
m=1
CR
Calculate a weight attribute w ( x i' ) for each
data entry x i' in the mth sub-dataset
1
w ( x i' ) m =
1 S
S j =1
US
dist ( x i' , x 'j )
m=m+1 Is m < K?
Yes
No
ED
initC0 = InitCentArray;
Output:
Initial Centroids: initC0 ={c01,,c0K}
PT
End
CE
Fig. 1. Flowchart of the proposed method for the initial centroids selection
AC
B. Inverse transformation
The inverse data transformation approach was first used in 0to solve the problems associated with the K-means clustering
algorithm. However, the approach presented in (Malinen et al, 2014) has a number of shortcomings such as finding a suita-
ble artificial data structure, performing the mapping, and controlling the inverse transformations. This algorithm cannot gen-
erally guarantee an optimal solution. This was demonstrated by the clustering results of (Malinen et al, 2014) where in some
cases, the data transformation led to the deviation of the data towards the incorrect cluster centroids. For the inverse trans-
ACCEPTED MANUSCRIPT
9
formation of data, first we generate an artificial data X*as the input data of the same size (n) and dimension (d). This would
divide the data vectors into distinct clusters K without any fluctuations. Then we represent a one-to-one mapping of the input
The inverse data transformation approach of (Malinen et al, 2014) uniformly distributes the initial cluster centroids along a
T
This random placement may break the clustering structure and deviate the data to incorrect cluster centroids, and conse-
IP
quently, provide incorrect results. To address this problem, our proposed inverse data transformation approach places each
CR
initial centroid C0j (1 < j K ) in the location of the data di (1 i n ) that is closer to C0i in the artificial structure X*.
x i* ArgMin x i c 0j , c 0j , (0 i
US
n ) , (0 j K )
(5)
AN
x i* initC (0 i n )
A series of inverse transformation is then performed that gradually move the data elements to their real (original) positions.
M
This will inversely transfer the artificial data to the main data. During this process, K-means updates the cluster centroids of
the transformed data. Calculated cluster centroids in each step are used as the initial cluster centroids for the next step. This
ED
process continues until the last step whose results provide the final cluster centroids. The proposed procedure is outlined as
the following:
PT
First, each vector xi is placed in a position closest to the initial centroid initCl (1 l K ) , which has the minimum distance
to the corresponding data. Next, they gradually move back to their real positions.
CE
Generally, for a dataset X = [x1 ,..., x n ] of n data vectors, the gradual inverse transformation of data to their real positions
AC
1) Sort the dataset X = [x1 ,..., x n ] in ascending order based on the Euclidean length of the vectors. Next, store the sorted
Inputs:
Input pattern: X={x1,,xn}
Number of Final Clusters: K
Initial Centroids: initC0 ={c01,,c0K}
Inverse transformation steps: Steps
T
Determine Dist" = (X" - X* ) / Steps
IP
i=1
CR
All data points X*i are transformed
towards their real location (X ' ) according
to the formula X*i = X*i 1 + (i Dist " )
US
Perform K- Means(X*i , K ,initCi 1 ) algorithm
given the previous centroids (initCi-1) along
with the modified dataset X*i as input.
AN
(Outputs: initCi ={C1,,CK})
CF = InitCi ;
Output:
Final Centroids: CF={c1,,cK}
ED
End
2) To construct the artificial data structure X* as the initial position of the data, place each initial centroid initCl (1 l K )
CE
in the position of the data vectors of the dataset, X , which are closer to that initial centroid ( x i' initC l ) compared to their
'
AC
distance to the (K-1) other initial centroids. This forms the artificial structure X* = [x1* ,..., x*n ] . This moves each real data
*
into the location of the initial centroid that is closer to the actual position of the associated data in the artificial structure X .
3) Displace all the real data vectors X = [x1 ,..., x n ] in random order and store them into the new dataset X" = [x1" ,..., x"n ] .
4) Determine the distance between initial artificial data (X*) and sorted real data (X"), and store them in the set
Dist" = [dist1" ,dist"2 ,...,dist"n ] . Each data element dist "i , represents the distance vector between the i-th data vector (x*i ) in the
artificial dataset X* and the position of the corresponding data (x "i ) in the dataset X".
ACCEPTED MANUSCRIPT
11
5) According to the number of steps given by user (Steps>1), divide each element of Dist" = [dist1" ,...,dist"n ] by the value of
Steps and update the new values of the data elements in Dist" . This is given by Eq. (6).
6) At each step of the inverse transform process, all data points move towards their real location as follows:
where, X* is the position of data in the artificial structure, i is the step number and Dist" is the distance of the sorted (in
T
descending order) data from the data positions in the artificial structure. Fig. 2 shows the flowchart for the proposed trans-
IP
formed K-means. We should note that X1* is the initial data positions in the artificial structure (initial artificial dataset). In
CR
the first Step, the initial centroids (initC), calculated by the proposed initial centroid selection method, are fed to the K-
means algorithm as the inputs (initC= initC0). After every inverse transform, K-means is executed given the previous cen-
US
troids (initCi-1) along with the modified dataset ( X *i ) as the input pattern. After completion of all steps (i = Steps), all data
points are placed in their original location ( X*i X' ) and the final centroids (CF) are calculated as the outputs. The proposed
AN
initialization approach of Section 2.2.A significantly reduces the chance of empty cluster generation by proper selection of
M
the initial centroids. The proposed data transformation approach completely solves the empty cluster problem during the
The transformed K-means algorithm has a time complexity of the order , O ((n log n ) K s ) , where n is the
total number of data, K is the number of clusters and s is the number of steps. More details of the time complexity of the
CE
proposed transformed K-means algorithm are given for different phases in TABLE I.
AC
TABLE I
Time complexity of the proposed transformed K-means algorithm.
TABLE II provides the time complexity orders for the proposed method and well-known clustering algorithms including K-
means*, K-means++, global K-means, original K-means, K-medoids, FCM, SOM, SOM++ and game theoretic SOM
TABLE II
Time complexity comparison of the proposed transformed K-means algorithm and several well-known clustering algorithms.
T
K-means++ O (n K )
IP
Global K-means O (n 2 K 2 )
K-means O (n K )
K-medoids O (n 2 K )
CR
FCM O (n K 2 )
SOM O (n 2 K )
GTSOM O (n 2 K )
SOM++
US O (n 2 K )
Time complexity comparison of TABLE II shows that the proposed transformed K-means algorithm is faster than SOM,
AN
GTSOM, and global K-means and competes with FCM and K-medoids. The time complexity of K-means and K-means++ is
better than that of our proposed algorithm. However, as the data volume increases, the K-means++ algorithm may not be as
M
efficient as our proposed method due to its sequential initialization (Bahmani et al., 2012). The proposed transformed K-
means and K-means* algorithms have almost the same time complexity. However, the approach used to deal with the gener-
ED
ation of empty clusters in the K-means* algorithm reduces the convergence rate nonlinearly as the data volume (n) and the
number of clusters (K) increase. Consequently, our proposed clustering algorithm is generally faster than the K-means* al-
PT
gorithm.
CE
K-means and many other clustering algorithms are provided assuming that the number of clusters is known in advance.
AC
In cases where the number of clusters is not predefined, an efficient method is required to determine the optimal number of
clusters. In this section we present a new method based on the silhouette approach proposed in (Rousseeuw, 1987) to estimate
1) Cluster the input data using any clustering technique for each iteration m, (K min m K max ) .
a(i) - b(i)
Sim = (8)
max(a(i),b(i))
where a(i) is the average distance between the ith data (1 i n ) and other data in the same cluster; b(i) is the lowest aver-
age distance of the ith data from the data in the other K-1 clusters at the mth iteration.
4) Select the iteration number index with the highest S as the estimated number of clusters.
T
IP
The proposed method uses the principal component transformation to modify the silhouette algorithm. The proposed
procedure is as follows:
CR
4-1) Transform the input data using the Karhunen-Loeve Transform (KLT) method. The KLT method, also known as the
US
4-2) Let k denote the eigenvector corresponding to the kth eigenvalue k of the covariance matrix x
:
AN
x
k k . k , k ={1,..., k }
i,j
cov( X i ,Y j ) E [( X i i )(Y j j ) (9)
where: i E ( X i )
M
[1 ,..., N ] , *T I , 1 *T
ED
(10)
PT
x
1
N
ij 1 N 1 (11)
CE
*T x *T 1 x 1 = (12)
4-6) Given the input data X, define the KarhunenLoeve Transformation of X as follows:
y1 1*T x1
Y *T X (13)
y N N*T x N
ACCEPTED MANUSCRIPT
14
6) For each iteration m, calculate the initial centroids using the proposed initialization method and assign each of the trans-
formed data (Y) to the nearest initial centroid to form initial clusters. Then calculate the mean of all data in each cluster
7) Calculate Sim for each of the input data at each iteration m, (K min m K max ) by:
a(i) - b(i)
Sim = (14)
T
max(a(i), b(i))
IP
where a(i) is the distance between the ith data (1 i n ) and the nearest centroid csj (1 j K ) at the P-th iteration; b(i) is the
minimum distance of the ith data from the other K-1 centroids at the Pth iteration. The proposed definition of a(i) and b(i) by
CR
Eq. (15) decreases the computational burden and speeds up the process as compared to their original definitions of the sil-
8) Include Sim (for the ith data at the mth iteration) in the Sm array. US
AN
est
9) Include the average value of Sm (for the mth iteration) in the mth cell of the array Save .
m
10) Use Eq. (15) and select the row number with the highest Save as the estimated number of clusters.
M
Inputs:
Input pattern: X={x1,,xn}
The minimum value of K: Kmin
The maximum value of K: Kmax
m=Kmin
T
all data in each cluster as new centroids
(Outputs: Cs={cs1,,csk})
IP
i=1
CR
Calculate a(i) and b(i) as defined in step 7 of
the proposed algorithm and use (15) to
Calculate Sik for each input data X={x1,,xn}
US
Add Sim to SmArray: Sm = [S1m ,...Smn ]
Is i < n?
No
Yes
i=i+1
AN
Add the average value of S m in
the mth cell of the array Sest
ave
End
ED
Fig. 3. Flowchart of the proposed method for estimating the number of clusters
This procedure modifies the method proposed in (Rousseeuw, 1987) to provide stable results with less processing time.
PT
TABLE III provides the time complexity orders of the proposed estimation method and Silhouette algorithm.
CE
TABLE III
Time complexity comparison of the proposed estimation method and silhouette algorithm.
AC
where n is the data volume and K is the difference between the Kmax and Kmin (K K max K min ) . The comparison
3. Case Studies
In this section, we evaluate the performance of the proposed method to deal with the empty cluster problem; then we as-
sess the proposed transformed K-means clustering algorithm and finally examine the proposed estimation method to deter-
mine the optimal number of clusters; the datasets used in the experiment are available online at Joensuu
T
IP
CR
US
AN
M
ED
PT
CE
AC
3.1. Evaluation of the proposed method for dealing with the empty cluster generation
ACCEPTED MANUSCRIPT
17
Three real time-series datasets of solar radiation in the Ames, Chariton and Calmar stations between 01/01/2009 and
01/01/2014 are used to calculate the number of empty clusters (N. E. C) generated by the K-means algorithm and the pro-
posed method. Number of clusters for both algorithms is 200 ( ). First, the clustering is performed for one step
( ), i.e., without any data transformation, to evaluate the performance of the proposed initialization approach to
deal with the empty cluster problem. TABLE IV shows the N. E. C for the proposed method with 1 step and the K-means
algorithm. The results demonstrate that the proposed method significantly reduces the N. E. C as compared to the K-means
T
algorithm. This is due to the proposed initialization approach that properly selects the initial centroids. The N. E. C generat-
IP
ed by the proposed method is then calculated as the number of steps increases. The results are provided in TABLE V for
steps 1 to 10. The results show that the empty cluster problem is completely solved during the data transformation to their
CR
original positions; and the proposed clustering algorithm converges without generating any empty cluster.
TABLE IV
Dataset
Number of Number of
US
Performance comparison of the K-means and proposed transformed K-means on the problem of empty cluster generation
N. E. C generated by:
Proposed method
AN
objects clusters K-means
(Step=1)
Ames 43827 200 84 1
Chariton 43827 200 75 0
Calmar 43827 200 27 2
M
TABLE V
Performance of the proposed method with different steps for the empty cluster problem
ED
Step N. E. C in N. E. C in N. E. C in
number Ames dataset Chariton dataset Calmar dataset
1 1 1 3
PT
2 0 0 0
3 1 0 0
4 0 0 0
CE
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
AC
9 0 0 0
10 0 0 0
This section evaluates the accuracy of the proposed clustering method (transformed K-means). Mean squared error
1
K N 2
MSE k 1 i 1
X i( k ) C k (16)
K .N
where N is the number of data points in the cluster k, and X i( k ) is the i-th data point in the cluster k. The testing datasets are
normalized in the range of [1, 1]. K-means clustering is run with different initialization methods including the random
based, K-means* based, K-means++ based and the proposed initialization method (section 2.2.A). The calculated error val-
ues as well as the processing time are provided in TABLE VI. A comparison of the results shows that the proposed initiali-
zation method improves the accuracy performance of the K-means algorithm when compared to the other initialization
T
methods. However, the computational complexity is increased due to data sorting proposed by our initialization to optimally
IP
select the initial centroids.
CR
TABLE VI
MSE measures and running time (sec) for K-means algorithm with different initialization methods
MSE
Proposed
Initialization
Time(s) MSE
K-means* based
US
initialization
Time(s)
K-means++ based
MSE
initialization
Time(s) MSE
Random based
initialization
Time(s)
AN
IRIS 0.0432 0.24 0.0432 0.0891 0.0431 0.0238 0.0432 0.0248
Olitos
AC
The MSE value is calculated for different data clustering methods including the proposed transformed K-means, ABC-K-
means, DE-K-means, GA-K-means, K-means*, K-means++, K-means, SOM, GTSOM, SOM++, K-medoids and FCM, and
provided in TABLE VII. The calculated MSE values show that the transformed K-means (with Steps=20) outperforms or
competes with the existing methods in terms of the clustering quality. The improved clustering quality is the result of several
procedures embraced by our proposed method namely the determination of optimal number of clusters, the proposed initiali-
The proposed transformed K-means algorithm has a faster processing time compared to the ABC-K-means, DE-K-means,
T
GA-K-means, SOM, GTSOM, and SOM++ and competes with FCM, K-medoids, K-means*, and K-means++. For small-
IP
and medium-sized data, our proposed method is generally more time consuming than K-medoids, K-means*, and K-
CR
means++. However, the reduced converegence rate of K-means* to deal with empty cluster generation particularly for higher
number of clusters and sequential initialization of K-means++ increase the comuptaional complexity of these methods for
US
large data volumes. This is evident from our running time results for large size datasets such as Missal and Shuttle where the
TABLE VII
MSE measure and the running time (sec) for different clustering techniques
M
K-medoids
K-means*
Proposed
means++
K-means
K-means
K-means
K-means
GTSOM
SOM++
Criteria
Dataset
method
ABC-
SOM
FCM
GA-
DE-
K-
ED
MSE: 0.0432 0.0475 0.043 0.043 0.0432 0.0432 0.0432 0.0432 0.0433 0.0432 0.0432 0.0432
IRIS
0.5754 1.2629 1.4003 1.6738 1.4858 0.0063 0.0061 2.3584 1.7599 2.4825 0.0127 0.2132
PT
Time(s):
MSE: 0.0017 0.0019 0.0018 0.0017 0.0017 0.0018 0.0017 0.0017 0.0019 0.0017 0.0017 0.0017
Glass
Time(s): 0.6301 1.2252 1.3897 1.6864 0.1381 0.0134 0.0058 2.3659 1.8439 2.3572 0.0311 0.0134
MSE: 0.0001 0.0002 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.005
CE
Missa1
Time(s): 3.9664 45.2259 60.29 81.2574 47.813 37.6179 0.3216 3.4526 10.5227 3.4381 47.6677 4.7857
MSE: 0.0008 0.0018 0.0015 0.0013 0.0009 0.0013 0.001 0.001 0.0009 0.0012 0.0009 0.0038
Bridge
Time(s): 2.4381 29.2082 38.8446 52.4054 18.794 10.5017 0.1362 3.3597 10.6182 3.3557 18.0322 15.5823
0.0134 0.0147 0.0134 0.0133 0.0591 0.0151 0.0145 0.0167 0.0169 0.0168 0.0146 0.0149
AC
MSE:
Thyroid
Time(s): 0.6331 1.2185 1.3674 1.6443 0.3718 0.0087 0.0069 2.3265 1.7153 2.3346 0.0394 0.4068
MSE: 0.0271 0.0325 0.0295 0.0295 0.0304 0.0304 0.0304 0.0307 0.0306 0.0304 0.0304 0.0304
Magic
Time(s): 0.9886 2.6975 3.2338 4.1301 0.7922 1.3478 0.0373 2.3212 1.684 2.3245 3.777 0.7571
MSE: 0.0255 0.0277 0.0254 0.0252 0.1349 0.0256 0.0255 0.0257 0.0255 0.0255 0.0256 0.0256
Wine
Time(s): 0.6183 1.1893 1.3275 1.5914 1.8411 0.0059 0.006 2.3257 1.7137 2.3216 0.0127 0.4052
MSE: 0.0007 0.0008 0.0008 0.0007 0.0009 0.0008 0.0008 0.0008 0.0009 0.0008 0.0008 0.0009
Shuttle
Time(s): 1.2867 5.363 6.7498 9.2292 1.4791 13.8573 0.047 2.3333 1.7205 2.3379 4.9705 0.9686
MSE: 0.0069 0.0085 0.0077 0.0073 0.0079 0.0071 0.0071 0.0069 0.0071 0.007 0.007 0.0969
Pendigit
Time(s): 0.814 2.3004 2.8268 3.6743 7.2133 0.3607 0.018 2.3571 1.969 2.3645 1.6015 1.3866
MSE: 0.0138 0.0152 0.0138 0.0138 0.0231 0.0206 0.0206 0.0188 0.0243 0.0205 0.0192 0.0205
Wdbc
Time(s): 0.6232 1.2157 1.3701 1.6524 0.2455 0.0057 0.0044 2.3226 1.6766 2.3566 0.017 0.1047
ACCEPTED MANUSCRIPT
20
MSE: 0.0051 0.0062 0.0057 0.0055 0.0088 0.0055 0.0053 0.0052 0.0054 0.0052 0.0054 0.0058
Yeast
Time(s): 0.6999 1.6819 1.9743 2.4665 7.0087 0.0898 0.016 2.3747 1.9684 2.3955 1.0383 0.2853
MSE: 0.0438 0.0474 0.0431 0.0431 0.1013 0.0536 0.0536 0.0525 0.0541 0.0574 0.0537 0.0512
P. I. D
Time(s): 0.6353 1.2327 1.3823 1.698 0.2842 0.008 0.0056 2.337 1.6765 2.3256 0.0762 0.4109
MSE: 0.0153 0.017 0.0156 0.0153 0.051 0.0162 0.0167 0.0162 0.0161 0.0161 0.0161 0.016
Olitos
Time(s): 0.5645 1.2042 1.3552 1.6307 1.3999 0.0101 0.0065 2.337 1.8621 2.3347 0.0259 0.5152
MSE: 0.0352 0.0387 0.0352 0.0352 0.0574 0.0358 0.0358 0.0354 0.0359 0.0356 0.0358 0.0359
Heart
Time(s): 0.6091 1.1922 1.342 1.6161 0.3494 0.0052 0.0057 2.3168 1.6736 2.3149 0.0408 0.3544
MSE: 0.081 0.0891 0.081 0.081 0.0959 0.081 0.081 0.0811 0.081 0.081 0.081 0.0813
Ionosphere
Time(s): 0.6061 1.2029 1.3576 1.6325 0.2635 0.0053 0.0052 2.3253 1.6762 2.3317 0.0175 0.4333
Movement MSE: 0.0055 0.0081 0.0072 0.006 0.0094 0.0063 0.0062 0.0055 0.0058 0.0057 0.0056 0.0057
Libras Time(s): 0.6293 1.335 1.5315 1.8818 15.938 0.0252 0.01 2.3821 2.1349 2.3865 0.1281 0.8592
T
MSE: 0.0078 0.0087 0.0079 0.0079 0.0098 0.01 0.01 0.01 0.0099 0.0097 0.01 0.0091
Spambase
Time(s): 0.6714 1.4971 1.7339 2.1458 0.4771 0.079 0.0092 2.3608 1.684 2.405 0.6408 0.541
IP
MSE: 0.0114 0.0122 0.0111 0.0111 0.0158 0.0114 0.0114 0.0116 0.0114 0.0115 0.0114 0.0114
Waveform
Time(s): 0.7298 1.7551 2.0672 2.5838 0.9814 0.2015 0.014 2.3478 1.718 2.3255 1.0042 0.6661
CR
MSE: 0.0057 0.0088 0.0075 0.0066 0.0067 0.0077 0.0058 0.0057 0.0058 0.0059 0.0058 0.0058
a1
Time(s): 0.9649 3.0271 3.6953 4.707 14.172 0.4251 0.0481 2.3947 2.3222 2.4029 2.984 1.027
MSE: 0.0094 0.0131 0.0114 0.0104 0.0102 0.0104 0.0108 0.0096 0.0098 0.0098 0.0114 0.0103
s1
Time(s): 1.2519 3.7156 4.5436 5.7588 13.730 0.6793 0.06 2.3753 2.1462 2.6723 4.8069 1.4703
3.3 Evaluation of the estimation method for determining the number of clusters
US
AN
This section evaluates the performance of the proposed method to estimate the number of clusters. The proposed estimation
method is used to calculate the number of clusters for different datasets and the results are compared with the numbers determined
M
by silhouette, Calinski-Harabasz (Caliski, and Harabasz, 1974), Davies-Bouldin (Davies, and Bouldin, 1979) and Gap method
TABLE VIII
Calinski Harabasz +
Proposed Method Silhouette + kmeans Davies Bouldin + kmeans Gap + kmeans
kmeans
Number
Dataset of real Number of Number of Number of Number of Number of
Running Running Running Running Running
clusters estimated estimated estimated estimated estimated
time (s) time (s) time (s) time (s) time (s)
clusters clusters clusters clusters clusters
CE
Ionosphere
1
Newthyroid 3 3 1.186 4 1.5288 3 3.9329 3 4.3865 47 24.276
2 7
Pendigit 10 10 9.9373 15 56.519 3 40.15 8 42.363 50 76.589
P. I. D 2 2 2.3712 2 6.5364 3 11.640 3 12.964 42 33.739
2 7
s1 15 15 10.9825 13 53.430 16 22.457 13 21.579 15 71.095
3 3
Spambase 2 2 23.3221 2 565.50 30 353.06 2 380.40 49 160.60
3 7 6
Wine 3 3 0.9984 3 1.0764 37 2.3514 7 2.5995 39 24.542
3 0
Wdbc 2 2 2.2308 6 5.0544 40 5.9540 4 6.4251 50 32.395
Heart 2 2 4.3531 2 28.18 2 20.243 48 16.373 2 29.309
Shuttle 7 5 65.6080 2 741.52 2 423.87 2 497.51 10 819.32
5 9
Waveform 3 3 24.6797 2 795.24 2 72.841 2 96.211 8 186.56
Olitos 4 2 1.0713 2 4.0011 2 1.2951 44 1.3393 50 27.393
5 4
Yeast 10 3 1.9232 3 46.698 3 2.3961 5 3.4387 47 39.521
5
ACCEPTED MANUSCRIPT
21
The comparison shows that while working much faster, the proposed estimation method is more accurate than the other estimation
methods.
4. Conclusion
Clustering provides a knowledge acquisition method for intelligent applications to develop rule-based expert systems. This paper
T
proposes an improved version of K-means clustering algorithm named transformed K-means. The proposed clustering method is a
IP
combination of a new initialization technique, K-means algorithm and a new gradual data transformation approach that presents
CR
more accurate clustering results on the real datasets, when compared to other K-means based algorithms. By selecting initial cen-
troids which are closer to the optimum centroids locations, the proposed initialization approach solves the limitation of the meth-
US
ods in (Lloyd, 1982), (Dunn, 1973) and (Bezdek et al, 1984) to properly initiate the K-means clustering. The inverse transfor-
mation gradually moves back the artifical data to their original places. During this process, the clustering centroids are updated
AN
after any changes in the data structure. This provides more optimal clustering results for both synthetic and real datasets to address
the drawback of the forecasting models in (Arthur and Vassilvitskii, 2007) and (Mahesh Kumar and Rama Mohan Reddy, 2016). In
M
addition, the proposed data transformation solves the problem of empty cluster generation associated with the K-means clustering
method and its improved versions. An efficient method based on the principal component transformation and a modified silhouette
ED
algorithm is also proposed in this paper to determine the number of clusters for cases where the number is not specified in advance
(Arthur and Vassilvitskii, 2007), (Chen et al., 2016) and (Silva Filho et al., 2015). Finally, the proposed clustering method address-
PT
es the time and computational burden associated with the models in (Kwedlo, 2013) and (Malinen et al., 2014).
Several experiments are performed to evaluate: 1) the proposed method for dealing with the empty cluster generation; 2) the pro-
CE
posed transformed K-means clustering algorithm; and 3) the estimation method for determining the number of clusters. For the first
experiment, three real time-series datasets of solar radiation in the Ames, Chariton and Calmar stations are used to calculate
AC
the number of empty clusters (N. E. C) generated by the K-means algorithm and the proposed method. The results demon-
strated that the proposed method significantly reduces the N. E. C as compared to the K-means algorithm. The empty cluster
problem was then completely solved by the proposed data transformation approach, which guarantees the convergence of the algo-
For the second experiment, K-means clustering was run with different initialization methods including the random based, K-
means* based, K-means++ based and the proposed initialization method. Simulation results showed that that the proposed
initialization method improves the accuracy performance of the K-means algorithm when compared to the other initializa-
tion methods. However, the computational complexity was increased due to data sorting proposed by our initialization to
ACCEPTED MANUSCRIPT
22
optimally select the initial centroids. The performance of the proposed transformed K-means was also evaluated using several
different real datasets and compared with different variants of K-means clustering as well as SOM, SOM++, FCM, K-medoids and
GTSOM clustering algorithms. The comparison demonstrated the improved quality of the clustering for the proposed transformed
K-means.The proposed transformed K-means algorithm provided a faster processing time compared to the ABC-K-means,
DE-K-means, GA-K-means, SOM, GTSOM, and SOM++ and competed with FCM, K-medoids, K-means*, and K-means++.
For small- and medium-sized data, our proposed method was shown to be generally more time consuming than K-medoids,
K-means*, and K-means++. Howver, it converged faster than K-means* and K-means++ for large data volumes.
T
IP
For the third experiment, the proposed estimation method was evaluated and compared with other estimation techniques to
determine the number of clusters. The comparison showed that while working much faster, the proposed estimation method was
CR
more accurate than the other estimation methods.
Acknowledgment
US
The authors would like to thank Prof. P. Frnti and Mr. M. Malinen for their valuable technical advices.
AN
References
M
Abdul Nazeer, K. A., Sebastian, M. P. (2009) Improving the Accuracy & Efficiency of the K-means Clustering Algorithm. Proceed-
ED
Arthur, D., & Vassilvitskii, S. (2007) K-means++: the advantages of careful seeding. Proceedings of the eighteenth annual ACM-
SIAM symposium on discrete algorithms, New Orleans, Louisiana, Society for Industrial & Applied Mathematics, pp 1027-1035.
PT
Bahmani, B., Moseley, B., Vattani, A., Kumar, R., & Vassilvitskii, S. (2012) Scalable K-means++. Proc. VLDB Endow. vol. 5,
CE
no.7, pp 622-633.
Bezdek, J. C., Ehrlich, R., Full, W., (1984) FCM: The Fuzzy C-Means Clustering Algorithm, Computers & Geosciences, vol. 10,
AC
Caliski, T., & Harabasz, J., (1974) A dendrite method for cluster analysis. Communications in Statistics, vol. 3, no.1, pp 1-27.
Celebi, M. E., Kingravi, H., Vela, P. A. (2013) A Comparative Study of Efficient Initialization Methods for the K-Means Cluster-
ing Algorithm, Expert Systems with Applications, vol. 40, no. 1, pp 200-210.
Chen, M., Li, L., Wang, B., Cheng, J., Pan, L., Chen, X., (2016), Effectively clustering by finding density backbone based-on
Chen, X., (2015), A new clustering algorithm based on near neighbor influence, Expert Systems with Applications, vol. 42, pp
ACCEPTED MANUSCRIPT
23
7746-7758.
Cui, X., Zhu, P., Yang, X., Li, K., & Ji, C. (2014) Optimized big data K-means clustering using MapReduce. The Journal of Super-
Davies, D. L., & Bouldin, D. W. (1979) A Cluster Separation Measure. Pattern Analysis and Machine Intelligence, IEEE Transac-
Dogan, Y., Birant, D., & Kut, A. (2013) SOM++: Integration of Self-Organizing Map and K-Means++ Algorithms. Machine Learn-
ing and Data Mining in Pattern Recognition. P. Perner (Ed.), Springer Berlin Heidelberg. Vol. 7988, pp 246-259.
T
Dunn, J. C., (1973) A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters. Journal
IP
of Cybernetics, vol. 3, no. 3, pp 32-57.
CR
Herbert, J., & Yao, J. (2007) GTSOM: Game Theoretic Self-organizing Maps. Trends in Neural Computation. K. Chen and L.
http://cs.uef.fi/sipu/datasets
http://mesonet. agron.iastate.edu US
AN
https://archive.ics.uci.edu/ml/datasets
Kaufman, L., & Rousseeuw, P. J. (1987) Clustering by Means of Medoids. In Y. Dodge, editor, Statistical Data Analysis Based on
Kohonen, T. (1990) The Self-Organizing Map. Proceedings of the IEEE, vol. 78, no. 9, pp 14641480.
ED
Krishna, K., & Murty, M. N. (1999) Genetic K-means algorithm. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE
Kwedlo, W., (2011) A clustering method combining differential evolution with the K-means algorithm. Pattern Recognition Letters,
PT
Likas, A., Vlassis, N., & Verbeek, J. J. (2003) The global K-means clustering algorithm. Pattern Recognition, vol. 36, no. 2, pp
451-461.
Liu, H., Ban, X.-j., (2015), Clustering by growing incremental self-organizing neural network, Expert Systems with Applica-
AC
Lloyd. S. P. (1982) Least Squares Quantization in PCM. IEEE Transactions on Information Theory, vol.28, no. 2, pp 129-136.
Mahesh Kumar K., and Rama Mohan Reddy, A., (2016), A fast DBSCAN clustering algorithm by accelerating neighbor searching
Malinen, M., Mariescu-Istodor R., & Frnti, P. (2014) K-means*: Clustering by gradual data transformation. Pattern Recognition,
Markic, B., Tomic, D., (2010) Marketing Intelligent System for Customer Segmentation, in: J. Casillas, F.J. Martnez-Lpez
(Eds.) Marketing Intelligent Systems Using Soft Computing: Managerial and Research Applications, Springer Berlin Heidelberg,
Rousseeuw, P. J. (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computa-
Shuliang, L., Barry, D., Edwards, J., Kinman, R., and Duan, Y., (2002), Integrating group Delphi, fuzzy logic and expert systems
for marketing strategy development, the hybridisation and its effectiveness. Journal: Marketing Intelligence & Planning, vol. 20,
T
no. 5, pp 273284.
IP
Silva Filho, T.M., Pimentel, B.A., Souza, R.M.C.R., Oliveira, A.L.I., (2015) Hybrid methods for fuzzy clustering based on fuzzy
CR
c-means and improved particle swarm optimization, Expert Systems with Applications, vol 42, pp 6315-6328.
Tibshirani, R., Walther, G., & Hastie, T. (2000) Estimating the number of data clusters via the Gap statistic. Journal of the Royal
Yang, B.-r., Li, H., Qian, W.-b., (2012) The Cognitive-Base Knowledge Acquisition in Expert System, in: H. Tan (Ed.) Technolo-
gy for Education and Learning, Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 73-80.
M
Zhang, C., Ouyang, D., & Ning, J. (2010) An artificial bee colony approach for clustering. Expert Systems with Applications, vol.
ED
Biographies
PT
Rasool Azimi received his B.Sc. degree in Software Engineering from Mehrastan University, Guilan, Iran, in 2011 and the M.Sc.
CE
degree from Science and Research Branch, Islamic Azad University, Qazvin, Iran in 2014. His research interests include distribut-
ed data mining, data clustering, artificial intelligence and their applications in power systems.
AC
Mohadeseh Ghayekhloo received her B.Sc. degree in Computer Engineering from Mazandaran University of Science and Tech-
nology, Babol, Iran, and the M.Sc. degree from Science and Research Branch, Islamic Azad University, Qazvin, Iran in 2011 and
2014, respectively. Her research interests include optimization algorithms, artificial neural networks, computational intelligence
Mahmoud Ghofrani received his B.Sc. degree in Electrical Engineering from Amirkabir University of Technology, Tehran, Iran
in 2005, the M.Sc. degree from University of Tehran, Tehran, Iran, in 2008, and the Ph.D. degree from the University of Nevada,
Reno, in 2014. He is currently an Assistant Professor at the School of Science, Technology, Engineering and Mathematics, Univer-
sity of Washington, Bothell. His research interests include power systems operation and planning, renewable energy systems, smart
Hedieh Sajedi received her B.Sc. degree in Computer Engineering from AmirKabir University of Technology in 2003, and M.Sc.
T
and Ph.D degrees in Computer Engineering (Artificial Intelligence) from Sharif University of Technology, Tehran, Iran in 2006
IP
and 2010, respectively. She is currently an Assistant Professor at the Department of Computer Science, Tehran University, Iran.
CR
Her research interests include Multimedia data hiding, steganography and steganalysis methods, pattern recognition, and machine
learning.
US
AN
M
ED
PT
CE
AC