Azimi 2017

Accepted Manuscript
A novel clustering algorithm based on data transformation
Rasool Azimi , Mohadeseh Ghayekhloo , Mahmoud Ghofrani ,

Hedieh Sajedi
PII: S0957-4174(17)30034-9
DOI: 10.1016/j.eswa.2017.01.024
Reference: ESWA 11072
To appear in: Expert Systems With Applications
Received date: 28 November 2015

Revised date: 29 October 2016
Accepted date: 24 January 2017
Please cite this article as: Rasool Azimi , Mohadeseh Ghayekhloo , Mahmoud Ghofrani ,
Hedieh Sajedi , A novel clustering algorithm based on data transformation, Expert Systems With
Applications (2017), doi: 10.1016/j.eswa.2017.01.024
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
1
Highlights
A new initialization technique is proposed to improve the performance of K-means.
A data transformation approach is proposed to solve empty cluster problem.
An efficient method is proposed to estimate the optimal number of clusters.
Proposed clustering method provides more accurate clustering results.
T
IP
CR
US
AN
M
ED
PT
CE
AC
ACCEPTED MANUSCRIPT
2
A Novel Clustering Algorithm Based on Data

Transformation
Rasool Azimia r.azimi@qiau.ac.ir, Mohadeseh Ghayekhloo bm.ghayekhloo@qiau.ac.ir, Mahmoud
Ghofrani c,*mghofrani@uwb.edu , Hedieh Sajedi dhhsajedi@ut.ac.ir
a
Computer and Information Technology Engineering, Qazvin Branch, Islamic Azad University, Qazvin, Iran.
b
Young Researchers and Elite Club, Qazvin Branch, Islamic Azad University, Qazvin, Iran.
c
School of Science, Technology, Engineering and Mathematics (STEM), University of Washington, Bothell, USA.
T
d
Department of Computer Science, College of Science, University of Tehran, Tehran, Iran.
IP
*Corresponding author: UWBB room 227, 18807 Beardslee Blvd, Bothell, WA 98011, USA Fax number: 425.352.3775
Abstract Clustering provides a knowledge acquisition method for intelligent systems. This paper proposes a novel data-
CR
clustering algorithm, by combining a new initialization technique, K-means algorithm and a new gradual data transformation ap-
proach to provide more accurate clustering results than the K-means algorithm and its variants by increasing the clusters coher-
US
ence. The proposed data transformation approach solves the problem of generating empty clusters, which frequently occurs for
other clustering algorithms. An efficient method based on the principal component transformation and a modified silhouette algo-
AN
rithm is also proposed in this paper to determine the number of clusters. Several different data sets are used to evaluate the efficacy
of the proposed method to deal with the empty cluster generation problem and its accuracy and computational performance in
M
comparison with other K-means based initialization techniques and clustering methods. The developed estimation method for de-
termining the number of clusters is also evaluated and compared with other estimation algorithms. Significances of the proposed
ED
method include addressing the limitations of the K-means based clustering and improving the accuracy of clustering as an im-
portant method in the field of data mining and expert systems. Application of the proposed method for the knowledge acquisition in
PT
time series data such as wind, solar, electric load and stock market provides a pre-processing tool to select the most appropriate
data to feed in neural networks or other estimators in use for forecasting such time series. In addition, utilization of the knowledge
CE
discovered by the proposed K-means clustering to develop rule based expert systems is one of the main impacts of the proposed
method.
AC
Index Terms Data mining, clustering, K-means, data transformation, silhouette, transformed K-means
1. Introduction
The expert systems are computer applications that contain stored knowledge and are developed to solve problems in a specific field
in almost the same way in which a human expert would (Shuliang et al., 2002)0. Acquisition of the expert knowledge is a chal-
lenge for developing such expert systems (Yang et al., 2012). One of the major problems and most difficult tasks in developing rule
based expert systems is representing the knowledge discovered by data clustering (Markic and Tomic, 2010). The K-means algo-
rithm is one of the most commonly used clustering techniques, which uses the data reassignment method to repeatedly optimize
ACCEPTED MANUSCRIPT
3
clustering (Lloyd, 1982). The main goal of clustering is to generate compact groups of objects or data that share similar patterns
within the same cluster, and isolate these groups from those which contain elements with different characteristics.
Although the K-means algorithm has features such as simplicity and high convergence speed, it is totally dependent on the initial
centroids which are randomly selected in the first phase of the algorithm. Due to this random selection, the algorithm may converge
to locally optimal solutions (Celebi et al., 2013). Different variants of K-means algorithm have been proposed to address this limi-
tation. The K-Medoids algorithm was proposed in (Kaufman and Rousseeuw, 1987) to define each cluster by the most central me-
doid in which it is located. First, K data are considered as initial centroids (medoid) and then each data is assigned to the closest
T
Medoid, and the initial clusters are formed. In an iteration-based process, the most central data in each cluster is considered as the
IP
new centroid and each data is assigned to the nearest centroid. The remaining steps of this algorithm match the K-means procedure.
CR
Fuzzy C-means (FCM) clustering introduced the partial membership concept (Dunn, 1973), (Bezdek et al, 1984). In fact, in the
FCM algorithm, each data belongs to all clusters. The degree of belonging is represented by a partial membership determined by a
US
fuzzy clustering matrix. A genetic algorithm-based K-means (GA-K-means) was proposed in (Krishna, and Murty, 1999) to pro-
vide global optimum for the clustering. In this method, the K-means algorithm was used as a search operator instead of crossover.
A biased mutation operator was also proposed for clustering to help the K-means algorithm to avoid local minima. Global K-means
AN
algorithm was developed in (Likas et al., 2003) to provide experimentally optimal solution for clustering problems. However, it is
not appropriate for clustering medium-sized and large-scale datasets due to its heavy computational burden. K-means++ initializa-
M
tion algorithm was proposed in (Arthur and Vassilvitskii, 2007) for obtaining an initial set of centroids that is near-optimal. The
main drawback of the K-means++ is its inherent sequential nature, which limits the effectiveness of the method for high-volume
ED
data. An artificial bee colony K-means (ABC-K-means) clustering approach was proposed in (Zhang et al., 2010) for optimal parti-
tioning of data objects into a fixed number of clusters. A hybrid of differential evolution and K-means algorithms named DE-K-
PT
means was introduced in (Kwedlo, 2011). The differential evolution algorithm was used as a global optimization method and the
resultant clustering solutions were fine-tuned and corrected using the K-means algorithm. Dogan et al. proposed a hybrid of K-
CE
means++ and self-organizing map (SOM) (Kohonen, 1990) algorithm to improve the clustering accuracy. It first uses K-Means++
initialization method to determine the initial weight values and the starting points, and then uses SOM to find an appropriate final
AC
clustering solution. However, the aforementioned limitation of the K-means++ was not addressed. A new clustering technique
using a combination of the global K-means algorithm and the topology neighborhood based on Axiomatic Fuzzy Sets (AFS) theory
was developed in (Wang et al., 2013) to determine initial centroids. A new clustering algorithm, named K-means*, was presented in
(Malinen et al., 2014) that generates an artificial dataset X* as the input data. The input data are then mapped one-by-one to the
generated artificial data (X X*). Next, the inverse transformation of the artificial data to the original data is performed by a se-
ries of gradual transformations. To do so, the K-means algorithm updates the clustering model after each transformation and moves
ACCEPTED MANUSCRIPT
4
the data vectors slowly to their original positions. The K-means* algorithm uses a random data swapping strategy to deal with the
problem of generating empty clusters. However, the random selection of the data vectors as the cluster centroids may reduce the
other clusters coherence and decrease the efficiency of the K-means* algorithm. Moreover, the convergence rate of the K-means*
algorithm significantly reduces as the number of clusters increases, especially with increasing data volumes. Density based cluster-
ing methods were proposed in (Mahesh Kumar and Rama Mohan Reddy, 2016) to speed up the neighbor search for clustering spa-
tial databases with noise. Density Based Spatial Clustering of Applications with Noise (DBSCAN) provides a graph based index
structure for high dimensional data with large amount of noise. It was shown that running time of the proposed method is faster
T
than DBSCAN with exactly same clustering results. The proposed method solved the inefficacy of the DBSCAN method to work
IP
with clusters with large differences in densities. A novel clustering algorithm named CLUB (CLUstering based on Backbone) was
CR
developed in (Chen et al., 2016) to determine optimal clusters. First, the algorithm detects the initial clusters and finds their density
backbones. Then, the algorithm finds out the outliers in each cluster based on K Nearest Neighbour (KNN) method. Finally by
US
assigning each unlabeled point to the cluster with the nearest higher density neighbour, the algorithm yields the final clusters.
CLUB has several drawbacks: The KNN method lacks an efficient algorithm to determine the value of parameter K (number of
nearest neighbors); Computational cost of this method is too high because it requires calculating distance of each instance query
AN
with respect to all training samples. Two particle swarm optimization (PSO) based fuzzy clustering methods were proposed in (Sil-
va Filho et al., 2015) to deal with the shortcomings of the PSO algorithms used for fuzzy clustering. The proposed methods adjust
M
parameters of PSO dynamically to achieve a balance between exploration and exploitation to avoid trapping in local optimum. The
proposed methods lack precision for high-dimensional applications. In addition, the iterative process of the proposed methods sig-
ED
nificantly decreases the convergence rate. Generally, the speed at which a convergent sequence approaches its limit is defined as
the rate of convergence. Three clustering algorithms named Near Neighbor Influence (CNNI), an improved version of time cost of
PT
Near Neighbor Influence (ICNNI), and a variation of Near Neighbor Influence (VCNNI) were presented in (Chen, 2015). The clus-
tering results showed that ICCNNI is faster than CNNI and also, CNNI requires less space than VCNNI. These methods suffer
CE
from large scale computing and storage requirements. A growing incremental self-organizing neural network (GISONN) was de-
veloped in (Liu and Ban, 2015) to select appropriate clusters by learning data distribution of each cluster. The proposed method is
AC
however not applicable for large-volume or high-dimensional datasets due to its computational complexity. In addition, the neigh-
borhood preserving feature of the algorithm is violated when the output space topology does not match with the structure of the
data in the input space.
In spite of the improved performance of the K-means variants for synthetic datasets with Gaussian distribution, their performance
on real datasets is neither very promising nor different from the original K-means algorithm. In addition, all K-means based algo-
rithms lack an efficient method to determine the optimal number of clusters. This requires the user to determine the number of clus-
ACCEPTED MANUSCRIPT
5
ters either arbitrarily or based on practical and experimental estimates, which might not be optimal.
In this paper, we propose a novel clustering approach called transformed K-means to provide more accurate clustering results com-
pared to the K-means algorithm and its improved versions. The proposed clustering method combines a new initialization tech-
nique, K-means algorithm and a new gradual data transformation approach to appropriately select the initial cluster centroids and
move the real data into the locations of the initial cluster centroids that are closer to the actual positions of the associated data. By
doing this, the data are placed in an artificial structure to properly initiate the K-means clustering. The inverse transformation is
then performed to gradually move back the artifical data to their original places. During this process, K-means updates the cluster-
T
ing centroids after any changes in the data structure. This provides more optimal clustering results for both synthetic and real da-
IP
tasets. In addition, the proposed data transformation solves the empty cluster problem of K-means algorithm and its improved ver-
CR
sions. An efficient method based on the principal component transformation and a modified silhouette algorithm is also proposed
in this paper to determine the optimal number of clusters for the K-means algorithms.
US
The proposed clustering method develops a rule-based expert system by means of knowledge acquisition through data transfor-
mation. Significances of the proposed method include addressing the limitations of the K-means based clustering and improving
the accuracy of clustering as an important method in the field of data mining and expert systems. The proposed method can be used
AN
for intelligent system applications such as forecasting time-series including solar, wind, load and stock market series.
Contributions of the paper are outlined as follows:

M
1. A new initialization technique is proposed to select initial centroids which are closer to the optimum centroids locations.
2. A novel gradual data transformation approach is proposed to significantly reduce the number of empty clusters generated
ED
by the K-means based algorithms.
3. An efficient method is proposed to estimate the optimal number of clusters.

PT
4. A hybrid clustering algorithm is developed by combining the proposed initialization, data transformation and cluster
number estimation to provide a better knowledge discovery of the input patterns and more accurate clustering results.
CE
The rest of the paper is organized as follows. Section 2 provides a brief description of the K-means algorithm. It also explains the
proposed clustering method. Section 3 demonstrates a case study where the performance of the developed clustering method is
AC
evaluated by several experiments. Finally, Section 4 concludes the paper.
2. Methodology
2.1 K-means algorithm
The K-means (Lloyd, 1982) is a well-known, low complexity algorithm utilized for data-partitioning. The algorithm starts
running after an input of K clusters is given, and outputs the cluster centroids through iterations. Let X = [x1 ,..., x n ] be the set of n
ACCEPTED MANUSCRIPT
6
points to be grouped into K different cluster (partition) sets as C = { c p } p = 1, 2, , K. By means of the Euclidean distance, the
algorithm assigns each data point to its closest centroid c p , calculated by:
1 np ( p )
cp
n x i (1)
p i 1
where x i( p ) is the i-th data point in the cluster p, and n p is the number of data points in the respective cluster.
After the first run, the algorithm calculates the mean of the data points in each cluster c p and selects this value as a new cluster
T
centroid, starting a new iteration. As new clusters are selected, a new mean value is obtained. The algorithm halts once the sum of
IP
the squared errors over K clusters is minimized (Cui et al., 2014).
CR
2.2 The Proposed Clustering Method
An improved version of K-means algorithm, named transformed K-means, is proposed in this section. The proposed clustering
US
algorithm uses a combination of a new technique to select the initial cluster centroids and a new approach for the reverse transfor-
mation of the data to enhance the clustering performance. The steps of the transformed K-means algorithm are as follows:
AN
A. Initial centroids selection

M
Let X = [x1 ,..., x n ] be a set of n data. The selection of K initial centroids is as follows:
1) Remove duplicate data vectors and store them to new dataset X' = [(x1' , r1 ),...,(x 'm , rm )] where ri is the repetition number
ED
for each non-repetitive data vector (x i ) in the new dataset X ' ( i m n ).

PT
2) Sort the data vectors in the dataset X ' in ascending order based on the Euclidean length of the vectors.
CE
3) Divide the dataset X ' , consisting of m data, into K sub-datasets, with (at most) S m / K data, according to Eq. (2),
AC
such that the data elements of X ' are distributed among the sub-datasets X1' to X 'K .
X1' = [(x1' , r1 ),...,(x S' , rS )],

X '2 = [(x S' +1 , rS +1 ),...,(x '2S , r2S )],
X '3 = [(x '2S +1 , r2S +1 ),...,(x 3' S , r3S )],
... (2)
X 'K = [(x (K-1)(
' '
S )+1 , r(K-1)( S )+1 ),...,(x KS , rKS )].
K
X' = X 'k
k=1
ACCEPTED MANUSCRIPT
7
where ri is the repetition number for the i-th data vector.
4) Now, we have K sub datasets where each one is used to determine only one of the K initial centroids. Eq. (3) is used to
calculate a weight attribute w ( x i' ) for each data entry x i' with the repetition number ri in each of K sub datasets
{X1' ,X'2 ,...,X'K } .
1 (3)
w ( x i' ) m = ( ri ) m , (1 m K )
1 S

S j =1
dist ( x i' , x 'j )
T
where w ( x i' )m is the weight attribute for x i' in the m-th sub-dataset.
IP
5) In each of K sub datasets, the data entry with the highest weight attribute is selected as the initial centroid.
CR
Fig. 1 shows the flowchart of our proposed method for selecting initial centroids.
US
AN
M
ED
PT
CE
AC
ACCEPTED MANUSCRIPT
8
Inputs:
Input pattern: X={x1,,xn}
Number of Final Clusters: K
Remove duplicate data vectors from X and store

unique data vectors to new dataset X'
X' = [(x1' , r1 ),...,(x 'i , ri ),...,(x 'm , rm )]
Sort the unique data set (X ' ) in ascending order
T
Split X ' into K sub-data sets {X1' ,X '2 ,...,X 'K }
IP
m=1
CR
Calculate a weight attribute w ( x i' ) for each
data entry x i' in the mth sub-dataset
1
w ( x i' ) m =
1 S

S j =1
US
dist ( x i' , x 'j )
Select the data entry with the highest weight

( ri ) m
AN
attribute as the initial centroid init_cm for the
mth sub-dataset
Add init_cm to InitCentArray

M
m=m+1 Is m < K?
Yes
No
ED
initC0 = InitCentArray;
Output:
Initial Centroids: initC0 ={c01,,c0K}
PT
End
CE
Fig. 1. Flowchart of the proposed method for the initial centroids selection
AC
B. Inverse transformation
The inverse data transformation approach was first used in 0to solve the problems associated with the K-means clustering
algorithm. However, the approach presented in (Malinen et al, 2014) has a number of shortcomings such as finding a suita-
ble artificial data structure, performing the mapping, and controlling the inverse transformations. This algorithm cannot gen-
erally guarantee an optimal solution. This was demonstrated by the clustering results of (Malinen et al, 2014) where in some
cases, the data transformation led to the deviation of the data towards the incorrect cluster centroids. For the inverse trans-
ACCEPTED MANUSCRIPT
9
formation of data, first we generate an artificial data X*as the input data of the same size (n) and dimension (d). This would
divide the data vectors into distinct clusters K without any fluctuations. Then we represent a one-to-one mapping of the input
data to the artificial data (XX*).
The inverse data transformation approach of (Malinen et al, 2014) uniformly distributes the initial cluster centroids along a
line in the artificial structure. This is given by Eq. (4).
X = [x1 ,..., x n ], X* = [x1* ,..., x *n ], initC = [c 01,...,c0K ]

(4)
x*i = RandomSample initC , x *i initC (0 i n )
T
This random placement may break the clustering structure and deviate the data to incorrect cluster centroids, and conse-
IP
quently, provide incorrect results. To address this problem, our proposed inverse data transformation approach places each
CR
initial centroid C0j (1 < j K ) in the location of the data di (1 i n ) that is closer to C0i in the artificial structure X*.
This is given by Eq. (5).
x i* ArgMin x i c 0j , c 0j , (0 i
US
n ) , (0 j K )
(5)
AN
x i* initC (0 i n )
A series of inverse transformation is then performed that gradually move the data elements to their real (original) positions.
M
This will inversely transfer the artificial data to the main data. During this process, K-means updates the cluster centroids of
the transformed data. Calculated cluster centroids in each step are used as the initial cluster centroids for the next step. This
ED
process continues until the last step whose results provide the final cluster centroids. The proposed procedure is outlined as
the following:
PT
First, each vector xi is placed in a position closest to the initial centroid initCl (1 l K ) , which has the minimum distance
to the corresponding data. Next, they gradually move back to their real positions.
CE
Generally, for a dataset X = [x1 ,..., x n ] of n data vectors, the gradual inverse transformation of data to their real positions
AC
follows the steps below:
1) Sort the dataset X = [x1 ,..., x n ] in ascending order based on the Euclidean length of the vectors. Next, store the sorted
data into new dataset X' = [x1' ,..., x 'n ] .

ACCEPTED MANUSCRIPT
10
Inputs:
Number of Final Clusters: K
Initial Centroids: initC0 ={c01,,c0K}
Inverse transformation steps: Steps
Sort the input pattern (X) in ascending order based

on the Euclidean distance between each data vector
in X and the data variance: X'
Displace the input pattern (X) in

random order : X"
Create artifical data: (X ' X* )
T
Determine Dist" = (X" - X* ) / Steps
IP
i=1
CR
All data points X*i are transformed
towards their real location (X ' ) according
to the formula X*i = X*i 1 + (i Dist " )
US
Perform K- Means(X*i , K ,initCi 1 ) algorithm
given the previous centroids (initCi-1) along
with the modified dataset X*i as input.
AN
(Outputs: initCi ={C1,,CK})
i=i+1 Is i < Steps?

Yes
No
M
CF = InitCi ;
Output:
Final Centroids: CF={c1,,cK}
ED
End
Fig. 2. Flowchart of the Transformed K-means algorithm

PT
2) To construct the artificial data structure X* as the initial position of the data, place each initial centroid initCl (1 l K )
CE
in the position of the data vectors of the dataset, X , which are closer to that initial centroid ( x i' initC l ) compared to their
'
AC
distance to the (K-1) other initial centroids. This forms the artificial structure X* = [x1* ,..., x*n ] . This moves each real data
*
into the location of the initial centroid that is closer to the actual position of the associated data in the artificial structure X .
3) Displace all the real data vectors X = [x1 ,..., x n ] in random order and store them into the new dataset X" = [x1" ,..., x"n ] .
4) Determine the distance between initial artificial data (X*) and sorted real data (X"), and store them in the set
Dist" = [dist1" ,dist"2 ,...,dist"n ] . Each data element dist "i , represents the distance vector between the i-th data vector (x*i ) in the
artificial dataset X* and the position of the corresponding data (x "i ) in the dataset X".
ACCEPTED MANUSCRIPT
11
5) According to the number of steps given by user (Steps>1), divide each element of Dist" = [dist1" ,...,dist"n ] by the value of
Steps and update the new values of the data elements in Dist" . This is given by Eq. (6).
Dist" = [(dist1" / Steps),...,(dist"n / Steps)] (6)
6) At each step of the inverse transform process, all data points move towards their real location as follows:
X*i = X*i -1 + (i Dist" ) ..... (1 i Steps ) (7)
where, X* is the position of data in the artificial structure, i is the step number and Dist" is the distance of the sorted (in
T
descending order) data from the data positions in the artificial structure. Fig. 2 shows the flowchart for the proposed trans-
IP
formed K-means. We should note that X1* is the initial data positions in the artificial structure (initial artificial dataset). In
CR
the first Step, the initial centroids (initC), calculated by the proposed initial centroid selection method, are fed to the K-
means algorithm as the inputs (initC= initC0). After every inverse transform, K-means is executed given the previous cen-
US
troids (initCi-1) along with the modified dataset ( X *i ) as the input pattern. After completion of all steps (i = Steps), all data
points are placed in their original location ( X*i X' ) and the final centroids (CF) are calculated as the outputs. The proposed
AN
initialization approach of Section 2.2.A significantly reduces the chance of empty cluster generation by proper selection of
M
the initial centroids. The proposed data transformation approach completely solves the empty cluster problem during the
data transformation process.

ED
2.3 Time complexity

PT
The transformed K-means algorithm has a time complexity of the order , O ((n log n ) K s ) , where n is the
total number of data, K is the number of clusters and s is the number of steps. More details of the time complexity of the
CE
proposed transformed K-means algorithm are given for different phases in TABLE I.
AC
TABLE I
Time complexity of the proposed transformed K-means algorithm.
Algorithm Phase Time complexity

Initialization O (n log n )
Data Transformation O (n log n )
K-means algorithm O (n K )
Total O ((n log n ) K )
Running in s steps (i > 1) O ((n log n ) K s )
ACCEPTED MANUSCRIPT
12
TABLE II provides the time complexity orders for the proposed method and well-known clustering algorithms including K-
means*, K-means++, global K-means, original K-means, K-medoids, FCM, SOM, SOM++ and game theoretic SOM
(GTSOM) (Herbert and Yao, 2007).
TABLE II
Time complexity comparison of the proposed transformed K-means algorithm and several well-known clustering algorithms.
Algorithm Time complexity

Transformed K-means O ((n log n ) K s )
K-means* O (n K s )
T
K-means++ O (n K )
IP
Global K-means O (n 2 K 2 )
K-means O (n K )
K-medoids O (n 2 K )
CR
FCM O (n K 2 )
SOM O (n 2 K )
GTSOM O (n 2 K )
SOM++
US O (n 2 K )
Time complexity comparison of TABLE II shows that the proposed transformed K-means algorithm is faster than SOM,
AN
GTSOM, and global K-means and competes with FCM and K-medoids. The time complexity of K-means and K-means++ is
better than that of our proposed algorithm. However, as the data volume increases, the K-means++ algorithm may not be as
M
efficient as our proposed method due to its sequential initialization (Bahmani et al., 2012). The proposed transformed K-
means and K-means* algorithms have almost the same time complexity. However, the approach used to deal with the gener-
ED
ation of empty clusters in the K-means* algorithm reduces the convergence rate nonlinearly as the data volume (n) and the
number of clusters (K) increase. Consequently, our proposed clustering algorithm is generally faster than the K-means* al-
PT
gorithm.
CE
2.4 Estimation of the number of clusters
K-means and many other clustering algorithms are provided assuming that the number of clusters is known in advance.
AC
In cases where the number of clusters is not predefined, an efficient method is required to determine the optimal number of
clusters. In this section we present a new method based on the silhouette approach proposed in (Rousseeuw, 1987) to estimate
the number of clusters. The silhouette algorithm is as follows:
1) Cluster the input data using any clustering technique for each iteration m, (K min m K max ) .
2) Calculate the silhouette function S(i) for the input data:

ACCEPTED MANUSCRIPT
13
a(i) - b(i)
Sim = (8)
max(a(i),b(i))
where a(i) is the average distance between the ith data (1 i n ) and other data in the same cluster; b(i) is the lowest aver-
age distance of the ith data from the data in the other K-1 clusters at the mth iteration.
3) Calculate S(m) as the average value of S at mth iteration.
4) Select the iteration number index with the highest S as the estimated number of clusters.
T
IP
The proposed method uses the principal component transformation to modify the silhouette algorithm. The proposed
procedure is as follows:
CR
4-1) Transform the input data using the Karhunen-Loeve Transform (KLT) method. The KLT method, also known as the
principal component analysis (PCA) is outlined below:
US
4-2) Let k denote the eigenvector corresponding to the kth eigenvalue k of the covariance matrix x
:
AN
x
k k . k , k ={1,..., k }
i,j
cov( X i ,Y j ) E [( X i i )(Y j j ) (9)
where: i E ( X i )
M
4-3) construct a N N unitary matrix as:
[1 ,..., N ] , *T I , 1 *T
ED
(10)
4-4) Combine the N eigenvalue equations as follows:

PT
x

1
N
ij 1 N 1 (11)
CE
where diag (1 ,..., N ) is a diagonal matrix.

AC
4-5) Multiply *T 1 by both sides of Eq. (11):
*T x *T 1 x 1 = (12)
4-6) Given the input data X, define the KarhunenLoeve Transformation of X as follows:
y1 1*T x1
Y *T X (13)

y N N*T x N

ACCEPTED MANUSCRIPT
14
5) Specify Kmin and Kmax values.
6) For each iteration m, calculate the initial centroids using the proposed initialization method and assign each of the trans-
formed data (Y) to the nearest initial centroid to form initial clusters. Then calculate the mean of all data in each cluster
as new centroids Cs = [cs1 ,...,csK ] .
7) Calculate Sim for each of the input data at each iteration m, (K min m K max ) by:
a(i) - b(i)
Sim = (14)
T
max(a(i), b(i))
IP
where a(i) is the distance between the ith data (1 i n ) and the nearest centroid csj (1 j K ) at the P-th iteration; b(i) is the
minimum distance of the ith data from the other K-1 centroids at the Pth iteration. The proposed definition of a(i) and b(i) by
CR
Eq. (15) decreases the computational burden and speeds up the process as compared to their original definitions of the sil-
houette algorithm (Rousseeuw, 1987).
8) Include Sim (for the ith data at the mth iteration) in the Sm array. US
AN
est
9) Include the average value of Sm (for the mth iteration) in the mth cell of the array Save .
m
10) Use Eq. (15) and select the row number with the highest Save as the estimated number of clusters.
M
K est ArgMax Sest

ave , m .... (K min Kest K max )
(15)
ED
Fig. 3 shows the flowchart for the proposed method.

PT
CE
AC
ACCEPTED MANUSCRIPT
15
Inputs:
The minimum value of K: Kmin
The maximum value of K: Kmax
m=Kmin
Perform data transformation based on KLT

approach: XY
Perform Proposed Initialization technique

along with the Input pattern.
(Outputs: Initial Centroids: C0={c1,,cK})
Form initial clusters and calculate the mean of
T
all data in each cluster as new centroids
(Outputs: Cs={cs1,,csk})
IP
i=1
CR
Calculate a(i) and b(i) as defined in step 7 of
the proposed algorithm and use (15) to
Calculate Sik for each input data X={x1,,xn}
US
Add Sim to SmArray: Sm = [S1m ,...Smn ]
Is i < n?
No
Yes
i=i+1
AN
Add the average value of S m in
the mth cell of the array Sest
ave
m=m+1 Is m < Kmax?

Yes
No
M
Output:K est = ArgMax Sest

ave , m
, (K min K est K max )
End
ED
Fig. 3. Flowchart of the proposed method for estimating the number of clusters
This procedure modifies the method proposed in (Rousseeuw, 1987) to provide stable results with less processing time.
PT
TABLE III provides the time complexity orders of the proposed estimation method and Silhouette algorithm.
CE
TABLE III
Time complexity comparison of the proposed estimation method and silhouette algorithm.
AC
Algorithm Time complexity
Proposed Method O (n log (n ).K )

Silhouette algorithm O (n 2 .K )
where n is the data volume and K is the difference between the Kmax and Kmin (K K max K min ) . The comparison
demonstrates the less time complexity of the proposed algorithm.

ACCEPTED MANUSCRIPT
16
3. Case Studies
In this section, we evaluate the performance of the proposed method to deal with the empty cluster problem; then we as-
sess the proposed transformed K-means clustering algorithm and finally examine the proposed estimation method to deter-
mine the optimal number of clusters; the datasets used in the experiment are available online at Joensuu
(http://cs.uef.fi/sipu/datasets), uci (https://archive.ics.uci.edu/ml/datasets) and mesonet (http://mesonet.agron.iastate.edu)
websites. More information regarding the data is presented in Fig. 4.
T
IP
CR
US
AN
M
ED
PT
CE
AC
Fig. 4. Datasets used for the case study
3.1. Evaluation of the proposed method for dealing with the empty cluster generation
ACCEPTED MANUSCRIPT
17
Three real time-series datasets of solar radiation in the Ames, Chariton and Calmar stations between 01/01/2009 and
01/01/2014 are used to calculate the number of empty clusters (N. E. C) generated by the K-means algorithm and the pro-
posed method. Number of clusters for both algorithms is 200 ( ). First, the clustering is performed for one step
( ), i.e., without any data transformation, to evaluate the performance of the proposed initialization approach to
deal with the empty cluster problem. TABLE IV shows the N. E. C for the proposed method with 1 step and the K-means
algorithm. The results demonstrate that the proposed method significantly reduces the N. E. C as compared to the K-means
T
algorithm. This is due to the proposed initialization approach that properly selects the initial centroids. The N. E. C generat-
IP
ed by the proposed method is then calculated as the number of steps increases. The results are provided in TABLE V for
steps 1 to 10. The results show that the empty cluster problem is completely solved during the data transformation to their
CR
original positions; and the proposed clustering algorithm converges without generating any empty cluster.
TABLE IV
Dataset
Number of Number of
US
Performance comparison of the K-means and proposed transformed K-means on the problem of empty cluster generation
N. E. C generated by:
Proposed method
AN
objects clusters K-means
(Step=1)
Ames 43827 200 84 1
Chariton 43827 200 75 0
Calmar 43827 200 27 2
M
TABLE V
Performance of the proposed method with different steps for the empty cluster problem
ED
Step N. E. C in N. E. C in N. E. C in
number Ames dataset Chariton dataset Calmar dataset
1 1 1 3
PT
2 0 0 0
3 1 0 0
4 0 0 0
CE
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
AC
9 0 0 0
10 0 0 0
3.2. Evaluation of the proposed transformed K-means clustering algorithm
This section evaluates the accuracy of the proposed clustering method (transformed K-means). Mean squared error
(MSE) is used as the accuracy performance indicator calculated by:

ACCEPTED MANUSCRIPT
18
1

K N 2
MSE k 1 i 1
X i( k ) C k (16)
K .N
where N is the number of data points in the cluster k, and X i( k ) is the i-th data point in the cluster k. The testing datasets are
normalized in the range of [1, 1]. K-means clustering is run with different initialization methods including the random
based, K-means* based, K-means++ based and the proposed initialization method (section 2.2.A). The calculated error val-
ues as well as the processing time are provided in TABLE VI. A comparison of the results shows that the proposed initiali-
zation method improves the accuracy performance of the K-means algorithm when compared to the other initialization
T
methods. However, the computational complexity is increased due to data sorting proposed by our initialization to optimally
IP
select the initial centroids.
CR
TABLE VI
MSE measures and running time (sec) for K-means algorithm with different initialization methods
K-means clustering with

Dataset
MSE
Proposed
Initialization
Time(s) MSE
K-means* based
US
initialization
Time(s)
K-means++ based
MSE
initialization
Time(s) MSE
Random based
initialization
Time(s)
AN
IRIS 0.0432 0.24 0.0432 0.0891 0.0431 0.0238 0.0432 0.0248
Glass 0.0017 0.0349 0.002 0.0235 0.0017 0.0079 0.0017 0.007
Missa1 0.0094 0.0678 0.0095 0.2665 0.0094 0.8085 0.0097 0.0901

M
Bridge 0.0008 0.2256 0.0008 2.3118 0.0011 12.8968 0.001 0.1551
Thyroid 0.0145 0.115 0.0554 0.0261 0.0151 0.0084 0.0145 0.0062

ED
Magic 0.0304 0.1871 0.0304 0.4238 0.0304 1.382 0.0304 0.0353
Wine 0.0255 0.0858 0.1352 0.0862 0.0255 0.0101 0.0255 0.0053
Shuttle 0.0008 0.3903 0.0009 0.9169 0.0008 14.0788 0.0008 0.0451

PT
Pendigit 0.007 0.1973 0.0083 0.1728 0.0071 0.3836 0.0071 0.0124
Wdbc 0.0206 0.0274 0.0231 0.0279 0.0206 0.0073 0.0206 0.0053

CE
Yeast 0.0053 0.0373 0.0097 0.3779 0.0053 0.0997 0.0053 0.0311
P. I. D 0.0536 0.0265 0.1013 0.0429 0.054 0.0137 0.0536 0.0086
Olitos
AC
0.0162 0.0148 0.0517 0.1279 0.0162 0.0059 0.0161 0.005
Heart 0.0358 0.0138 0.0574 0.0279 0.0358 0.0113 0.0358 0.0065
Ionosphere 0.081 0.0167 0.096 0.0311 0.081 0.0076 0.081 0.0058
M. Libras 0.0064 0.0185 0.0094 1.3832 0.0066 0.0352 0.0058 0.0061
Spambase 0.01 0.049 0.0125 0.13 0.01 0.0894 0.01 0.0089
Waveform 0.0114 0.0675 0.0159 0.1521 0.0114 0.2114 0.0114 0.0176
a1 0.0057 0.1083 0.0065 0.6655 0.0061 0.3598 0.0059 0.0367
s1 0.0094 0.0869 0.01 0.6682 0.0101 0.7354 0.0095 0.0358

ACCEPTED MANUSCRIPT
19
The MSE value is calculated for different data clustering methods including the proposed transformed K-means, ABC-K-
means, DE-K-means, GA-K-means, K-means*, K-means++, K-means, SOM, GTSOM, SOM++, K-medoids and FCM, and
provided in TABLE VII. The calculated MSE values show that the transformed K-means (with Steps=20) outperforms or
competes with the existing methods in terms of the clustering quality. The improved clustering quality is the result of several
procedures embraced by our proposed method namely the determination of optimal number of clusters, the proposed initiali-
zation, and the gradual data transformation.
The proposed transformed K-means algorithm has a faster processing time compared to the ABC-K-means, DE-K-means,
T
GA-K-means, SOM, GTSOM, and SOM++ and competes with FCM, K-medoids, K-means*, and K-means++. For small-
IP
and medium-sized data, our proposed method is generally more time consuming than K-medoids, K-means*, and K-
CR
means++. However, the reduced converegence rate of K-means* to deal with empty cluster generation particularly for higher
number of clusters and sequential initialization of K-means++ increase the comuptaional complexity of these methods for
US
large data volumes. This is evident from our running time results for large size datasets such as Missal and Shuttle where the
proposed transformed K-means converges faster than K-means* and K-means++.

AN
TABLE VII
MSE measure and the running time (sec) for different clustering techniques
M
K-medoids
K-means*
Proposed
means++
K-means
K-means
K-means
K-means
GTSOM
SOM++
Criteria
Dataset
method
ABC-
SOM
FCM
GA-
DE-
K-
ED
MSE: 0.0432 0.0475 0.043 0.043 0.0432 0.0432 0.0432 0.0432 0.0433 0.0432 0.0432 0.0432
IRIS
0.5754 1.2629 1.4003 1.6738 1.4858 0.0063 0.0061 2.3584 1.7599 2.4825 0.0127 0.2132
PT
Time(s):
MSE: 0.0017 0.0019 0.0018 0.0017 0.0017 0.0018 0.0017 0.0017 0.0019 0.0017 0.0017 0.0017
Glass
Time(s): 0.6301 1.2252 1.3897 1.6864 0.1381 0.0134 0.0058 2.3659 1.8439 2.3572 0.0311 0.0134
MSE: 0.0001 0.0002 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.005
CE
Missa1
Time(s): 3.9664 45.2259 60.29 81.2574 47.813 37.6179 0.3216 3.4526 10.5227 3.4381 47.6677 4.7857
MSE: 0.0008 0.0018 0.0015 0.0013 0.0009 0.0013 0.001 0.001 0.0009 0.0012 0.0009 0.0038
Bridge
Time(s): 2.4381 29.2082 38.8446 52.4054 18.794 10.5017 0.1362 3.3597 10.6182 3.3557 18.0322 15.5823
0.0134 0.0147 0.0134 0.0133 0.0591 0.0151 0.0145 0.0167 0.0169 0.0168 0.0146 0.0149
AC
MSE:
Thyroid
Time(s): 0.6331 1.2185 1.3674 1.6443 0.3718 0.0087 0.0069 2.3265 1.7153 2.3346 0.0394 0.4068
MSE: 0.0271 0.0325 0.0295 0.0295 0.0304 0.0304 0.0304 0.0307 0.0306 0.0304 0.0304 0.0304
Magic
Time(s): 0.9886 2.6975 3.2338 4.1301 0.7922 1.3478 0.0373 2.3212 1.684 2.3245 3.777 0.7571
MSE: 0.0255 0.0277 0.0254 0.0252 0.1349 0.0256 0.0255 0.0257 0.0255 0.0255 0.0256 0.0256
Wine
Time(s): 0.6183 1.1893 1.3275 1.5914 1.8411 0.0059 0.006 2.3257 1.7137 2.3216 0.0127 0.4052
MSE: 0.0007 0.0008 0.0008 0.0007 0.0009 0.0008 0.0008 0.0008 0.0009 0.0008 0.0008 0.0009
Shuttle
Time(s): 1.2867 5.363 6.7498 9.2292 1.4791 13.8573 0.047 2.3333 1.7205 2.3379 4.9705 0.9686
MSE: 0.0069 0.0085 0.0077 0.0073 0.0079 0.0071 0.0071 0.0069 0.0071 0.007 0.007 0.0969
Pendigit
Time(s): 0.814 2.3004 2.8268 3.6743 7.2133 0.3607 0.018 2.3571 1.969 2.3645 1.6015 1.3866
MSE: 0.0138 0.0152 0.0138 0.0138 0.0231 0.0206 0.0206 0.0188 0.0243 0.0205 0.0192 0.0205
Wdbc
Time(s): 0.6232 1.2157 1.3701 1.6524 0.2455 0.0057 0.0044 2.3226 1.6766 2.3566 0.017 0.1047
ACCEPTED MANUSCRIPT
20
MSE: 0.0051 0.0062 0.0057 0.0055 0.0088 0.0055 0.0053 0.0052 0.0054 0.0052 0.0054 0.0058
Yeast
Time(s): 0.6999 1.6819 1.9743 2.4665 7.0087 0.0898 0.016 2.3747 1.9684 2.3955 1.0383 0.2853
MSE: 0.0438 0.0474 0.0431 0.0431 0.1013 0.0536 0.0536 0.0525 0.0541 0.0574 0.0537 0.0512
P. I. D
Time(s): 0.6353 1.2327 1.3823 1.698 0.2842 0.008 0.0056 2.337 1.6765 2.3256 0.0762 0.4109
MSE: 0.0153 0.017 0.0156 0.0153 0.051 0.0162 0.0167 0.0162 0.0161 0.0161 0.0161 0.016
Olitos
Time(s): 0.5645 1.2042 1.3552 1.6307 1.3999 0.0101 0.0065 2.337 1.8621 2.3347 0.0259 0.5152
MSE: 0.0352 0.0387 0.0352 0.0352 0.0574 0.0358 0.0358 0.0354 0.0359 0.0356 0.0358 0.0359
Heart
Time(s): 0.6091 1.1922 1.342 1.6161 0.3494 0.0052 0.0057 2.3168 1.6736 2.3149 0.0408 0.3544
MSE: 0.081 0.0891 0.081 0.081 0.0959 0.081 0.081 0.0811 0.081 0.081 0.081 0.0813
Ionosphere
Time(s): 0.6061 1.2029 1.3576 1.6325 0.2635 0.0053 0.0052 2.3253 1.6762 2.3317 0.0175 0.4333
Movement MSE: 0.0055 0.0081 0.0072 0.006 0.0094 0.0063 0.0062 0.0055 0.0058 0.0057 0.0056 0.0057
Libras Time(s): 0.6293 1.335 1.5315 1.8818 15.938 0.0252 0.01 2.3821 2.1349 2.3865 0.1281 0.8592
T
MSE: 0.0078 0.0087 0.0079 0.0079 0.0098 0.01 0.01 0.01 0.0099 0.0097 0.01 0.0091
Spambase
Time(s): 0.6714 1.4971 1.7339 2.1458 0.4771 0.079 0.0092 2.3608 1.684 2.405 0.6408 0.541
IP
MSE: 0.0114 0.0122 0.0111 0.0111 0.0158 0.0114 0.0114 0.0116 0.0114 0.0115 0.0114 0.0114
Waveform
Time(s): 0.7298 1.7551 2.0672 2.5838 0.9814 0.2015 0.014 2.3478 1.718 2.3255 1.0042 0.6661
CR
MSE: 0.0057 0.0088 0.0075 0.0066 0.0067 0.0077 0.0058 0.0057 0.0058 0.0059 0.0058 0.0058
a1
Time(s): 0.9649 3.0271 3.6953 4.707 14.172 0.4251 0.0481 2.3947 2.3222 2.4029 2.984 1.027
MSE: 0.0094 0.0131 0.0114 0.0104 0.0102 0.0104 0.0108 0.0096 0.0098 0.0098 0.0114 0.0103
s1
Time(s): 1.2519 3.7156 4.5436 5.7588 13.730 0.6793 0.06 2.3753 2.1462 2.6723 4.8069 1.4703
3.3 Evaluation of the estimation method for determining the number of clusters
US
AN
This section evaluates the performance of the proposed method to estimate the number of clusters. The proposed estimation
method is used to calculate the number of clusters for different datasets and the results are compared with the numbers determined
M
by silhouette, Calinski-Harabasz (Caliski, and Harabasz, 1974), Davies-Bouldin (Davies, and Bouldin, 1979) and Gap method
(Tibshirani et al., 2000). TABLE VIII provides the comparison.

ED
TABLE VIII
Performance comparison of different techniques for determining the number of clusters

PT
Calinski Harabasz +
Proposed Method Silhouette + kmeans Davies Bouldin + kmeans Gap + kmeans
kmeans
Number
Dataset of real Number of Number of Number of Number of Number of
Running Running Running Running Running
clusters estimated estimated estimated estimated estimated
time (s) time (s) time (s) time (s) time (s)
clusters clusters clusters clusters clusters
CE
a1 20 19 6.1932 17 25.771 21 16.435 18 12.735 20 45.4

Glass 6 9 0.5460 14 0.9672 2 3.8382 31 4.4005 48 25.493
4 6 1
IRIS 3 2 0.7176 2 0.8736 2 2.0415 2 2.2817 43 23.656
2 2 1.6068 5 3.8376 2 10.154 50 11.146 50 30.376
AC
Ionosphere
1
Newthyroid 3 3 1.186 4 1.5288 3 3.9329 3 4.3865 47 24.276
2 7
Pendigit 10 10 9.9373 15 56.519 3 40.15 8 42.363 50 76.589
P. I. D 2 2 2.3712 2 6.5364 3 11.640 3 12.964 42 33.739
2 7
s1 15 15 10.9825 13 53.430 16 22.457 13 21.579 15 71.095
3 3
Spambase 2 2 23.3221 2 565.50 30 353.06 2 380.40 49 160.60
3 7 6
Wine 3 3 0.9984 3 1.0764 37 2.3514 7 2.5995 39 24.542
3 0
Wdbc 2 2 2.2308 6 5.0544 40 5.9540 4 6.4251 50 32.395
Heart 2 2 4.3531 2 28.18 2 20.243 48 16.373 2 29.309
Shuttle 7 5 65.6080 2 741.52 2 423.87 2 497.51 10 819.32
5 9
Waveform 3 3 24.6797 2 795.24 2 72.841 2 96.211 8 186.56
Olitos 4 2 1.0713 2 4.0011 2 1.2951 44 1.3393 50 27.393
5 4
Yeast 10 3 1.9232 3 46.698 3 2.3961 5 3.4387 47 39.521
5
ACCEPTED MANUSCRIPT
21
M.Libras 15 12 1.0857 29 14.013 2 1.3357 49 0.3777 48 31.997

Magic 2 2 12.6995 2 426.94 2 8.2453 10 6.8662 2 578.32
The comparison shows that while working much faster, the proposed estimation method is more accurate than the other estimation
methods.
4. Conclusion
Clustering provides a knowledge acquisition method for intelligent applications to develop rule-based expert systems. This paper
T
proposes an improved version of K-means clustering algorithm named transformed K-means. The proposed clustering method is a
IP
combination of a new initialization technique, K-means algorithm and a new gradual data transformation approach that presents
CR
more accurate clustering results on the real datasets, when compared to other K-means based algorithms. By selecting initial cen-
troids which are closer to the optimum centroids locations, the proposed initialization approach solves the limitation of the meth-
US
ods in (Lloyd, 1982), (Dunn, 1973) and (Bezdek et al, 1984) to properly initiate the K-means clustering. The inverse transfor-
mation gradually moves back the artifical data to their original places. During this process, the clustering centroids are updated
AN
after any changes in the data structure. This provides more optimal clustering results for both synthetic and real datasets to address
the drawback of the forecasting models in (Arthur and Vassilvitskii, 2007) and (Mahesh Kumar and Rama Mohan Reddy, 2016). In
M
addition, the proposed data transformation solves the problem of empty cluster generation associated with the K-means clustering
method and its improved versions. An efficient method based on the principal component transformation and a modified silhouette
ED
algorithm is also proposed in this paper to determine the number of clusters for cases where the number is not specified in advance
(Arthur and Vassilvitskii, 2007), (Chen et al., 2016) and (Silva Filho et al., 2015). Finally, the proposed clustering method address-
PT
es the time and computational burden associated with the models in (Kwedlo, 2013) and (Malinen et al., 2014).
Several experiments are performed to evaluate: 1) the proposed method for dealing with the empty cluster generation; 2) the pro-
CE
posed transformed K-means clustering algorithm; and 3) the estimation method for determining the number of clusters. For the first
experiment, three real time-series datasets of solar radiation in the Ames, Chariton and Calmar stations are used to calculate
AC
the number of empty clusters (N. E. C) generated by the K-means algorithm and the proposed method. The results demon-
strated that the proposed method significantly reduces the N. E. C as compared to the K-means algorithm. The empty cluster
problem was then completely solved by the proposed data transformation approach, which guarantees the convergence of the algo-
rithm without any empty cluster.
For the second experiment, K-means clustering was run with different initialization methods including the random based, K-
means* based, K-means++ based and the proposed initialization method. Simulation results showed that that the proposed
initialization method improves the accuracy performance of the K-means algorithm when compared to the other initializa-
tion methods. However, the computational complexity was increased due to data sorting proposed by our initialization to
ACCEPTED MANUSCRIPT
22
optimally select the initial centroids. The performance of the proposed transformed K-means was also evaluated using several
different real datasets and compared with different variants of K-means clustering as well as SOM, SOM++, FCM, K-medoids and
GTSOM clustering algorithms. The comparison demonstrated the improved quality of the clustering for the proposed transformed
K-means.The proposed transformed K-means algorithm provided a faster processing time compared to the ABC-K-means,
DE-K-means, GA-K-means, SOM, GTSOM, and SOM++ and competed with FCM, K-medoids, K-means*, and K-means++.
For small- and medium-sized data, our proposed method was shown to be generally more time consuming than K-medoids,
K-means*, and K-means++. Howver, it converged faster than K-means* and K-means++ for large data volumes.
T
IP
For the third experiment, the proposed estimation method was evaluated and compared with other estimation techniques to
determine the number of clusters. The comparison showed that while working much faster, the proposed estimation method was
CR
more accurate than the other estimation methods.
Acknowledgment
US
The authors would like to thank Prof. P. Frnti and Mr. M. Malinen for their valuable technical advices.
AN
References
M
Abdul Nazeer, K. A., Sebastian, M. P. (2009) Improving the Accuracy & Efficiency of the K-means Clustering Algorithm. Proceed-
ED
ings of the World Congress on Engineering (WCE), Vol. 1, pp 1-3.
Arthur, D., & Vassilvitskii, S. (2007) K-means++: the advantages of careful seeding. Proceedings of the eighteenth annual ACM-
SIAM symposium on discrete algorithms, New Orleans, Louisiana, Society for Industrial & Applied Mathematics, pp 1027-1035.
PT
Bahmani, B., Moseley, B., Vattani, A., Kumar, R., & Vassilvitskii, S. (2012) Scalable K-means++. Proc. VLDB Endow. vol. 5,
CE
no.7, pp 622-633.
Bezdek, J. C., Ehrlich, R., Full, W., (1984) FCM: The Fuzzy C-Means Clustering Algorithm, Computers & Geosciences, vol. 10,
AC
no. 2-3, pp 191-203.
Caliski, T., & Harabasz, J., (1974) A dendrite method for cluster analysis. Communications in Statistics, vol. 3, no.1, pp 1-27.
Celebi, M. E., Kingravi, H., Vela, P. A. (2013) A Comparative Study of Efficient Initialization Methods for the K-Means Cluster-
ing Algorithm, Expert Systems with Applications, vol. 40, no. 1, pp 200-210.
Chen, M., Li, L., Wang, B., Cheng, J., Pan, L., Chen, X., (2016), Effectively clustering by finding density backbone based-on
kNN, Pattern Recognition, vol. 60, pp 486-498.
Chen, X., (2015), A new clustering algorithm based on near neighbor influence, Expert Systems with Applications, vol. 42, pp
ACCEPTED MANUSCRIPT
23
7746-7758.
Cui, X., Zhu, P., Yang, X., Li, K., & Ji, C. (2014) Optimized big data K-means clustering using MapReduce. The Journal of Super-
computing, vol. 70, no. 3, pp 1249-1259.
Davies, D. L., & Bouldin, D. W. (1979) A Cluster Separation Measure. Pattern Analysis and Machine Intelligence, IEEE Transac-
tions on PAMI, vol. 1, no.2, pp 224-227.
Dogan, Y., Birant, D., & Kut, A. (2013) SOM++: Integration of Self-Organizing Map and K-Means++ Algorithms. Machine Learn-
ing and Data Mining in Pattern Recognition. P. Perner (Ed.), Springer Berlin Heidelberg. Vol. 7988, pp 246-259.
T
Dunn, J. C., (1973) A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters. Journal
IP
of Cybernetics, vol. 3, no. 3, pp 32-57.
CR
Herbert, J., & Yao, J. (2007) GTSOM: Game Theoretic Self-organizing Maps. Trends in Neural Computation. K. Chen and L.
Wang (Eds.), Springer Berlin Heidelberg, vol. 35, pp 199-223.
http://cs.uef.fi/sipu/datasets
http://mesonet. agron.iastate.edu US
AN
https://archive.ics.uci.edu/ml/datasets
Kaufman, L., & Rousseeuw, P. J. (1987) Clustering by Means of Medoids. In Y. Dodge, editor, Statistical Data Analysis Based on
the l1 Norm and Related Methods, pp 405-416.

M
Kohonen, T. (1990) The Self-Organizing Map. Proceedings of the IEEE, vol. 78, no. 9, pp 14641480.
ED
Krishna, K., & Murty, M. N. (1999) Genetic K-means algorithm. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE
Transactions on, vol. 29, no. 3, pp 433-439.
Kwedlo, W., (2011) A clustering method combining differential evolution with the K-means algorithm. Pattern Recognition Letters,
PT
vol. 32, no. 12, pp 1613-1621.

CE
Likas, A., Vlassis, N., & Verbeek, J. J. (2003) The global K-means clustering algorithm. Pattern Recognition, vol. 36, no. 2, pp
451-461.
Liu, H., Ban, X.-j., (2015), Clustering by growing incremental self-organizing neural network, Expert Systems with Applica-
AC
tions, vol 42, pp 4965-4981.
Lloyd. S. P. (1982) Least Squares Quantization in PCM. IEEE Transactions on Information Theory, vol.28, no. 2, pp 129-136.
Mahesh Kumar K., and Rama Mohan Reddy, A., (2016), A fast DBSCAN clustering algorithm by accelerating neighbor searching
using Groups method, Pattern Recognition, vol. 58, pp 39-48.
Malinen, M., Mariescu-Istodor R., & Frnti, P. (2014) K-means*: Clustering by gradual data transformation. Pattern Recognition,
vol. 47, no. 10, pp 3376-3386.

ACCEPTED MANUSCRIPT
24
Markic, B., Tomic, D., (2010) Marketing Intelligent System for Customer Segmentation, in: J. Casillas, F.J. Martnez-Lpez
(Eds.) Marketing Intelligent Systems Using Soft Computing: Managerial and Research Applications, Springer Berlin Heidelberg,
Berlin, Heidelberg, pp. 79-111.
Rousseeuw, P. J. (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computa-
tional and Applied Mathematics, vol. 20, pp 53-65.
Shuliang, L., Barry, D., Edwards, J., Kinman, R., and Duan, Y., (2002), Integrating group Delphi, fuzzy logic and expert systems
for marketing strategy development, the hybridisation and its effectiveness. Journal: Marketing Intelligence & Planning, vol. 20,
T
no. 5, pp 273284.
IP
Silva Filho, T.M., Pimentel, B.A., Souza, R.M.C.R., Oliveira, A.L.I., (2015) Hybrid methods for fuzzy clustering based on fuzzy
CR
c-means and improved particle swarm optimization, Expert Systems with Applications, vol 42, pp 6315-6328.
Tibshirani, R., Walther, G., & Hastie, T. (2000) Estimating the number of data clusters via the Gap statistic. Journal of the Royal
Statistical Society, B. 63, pp 411423.

US
Wang, L., Liu, X., & Mu, Y. (2013) The Global k-Means Clustering Analysis Based on Multi-Granulations Nearness Neighbor-
AN
hood. Mathematics in Computer Science, vol. 7, no. 1, pp 113-124.
Yang, B.-r., Li, H., Qian, W.-b., (2012) The Cognitive-Base Knowledge Acquisition in Expert System, in: H. Tan (Ed.) Technolo-
gy for Education and Learning, Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 73-80.
M
Zhang, C., Ouyang, D., & Ning, J. (2010) An artificial bee colony approach for clustering. Expert Systems with Applications, vol.
ED
37, no. 7, pp 4761-4767.
Biographies
PT
Rasool Azimi received his B.Sc. degree in Software Engineering from Mehrastan University, Guilan, Iran, in 2011 and the M.Sc.
CE
degree from Science and Research Branch, Islamic Azad University, Qazvin, Iran in 2014. His research interests include distribut-
ed data mining, data clustering, artificial intelligence and their applications in power systems.
AC
Mohadeseh Ghayekhloo received her B.Sc. degree in Computer Engineering from Mazandaran University of Science and Tech-
nology, Babol, Iran, and the M.Sc. degree from Science and Research Branch, Islamic Azad University, Qazvin, Iran in 2011 and
2014, respectively. Her research interests include optimization algorithms, artificial neural networks, computational intelligence
and their applications in power systems.

ACCEPTED MANUSCRIPT
25
Mahmoud Ghofrani received his B.Sc. degree in Electrical Engineering from Amirkabir University of Technology, Tehran, Iran
in 2005, the M.Sc. degree from University of Tehran, Tehran, Iran, in 2008, and the Ph.D. degree from the University of Nevada,
Reno, in 2014. He is currently an Assistant Professor at the School of Science, Technology, Engineering and Mathematics, Univer-
sity of Washington, Bothell. His research interests include power systems operation and planning, renewable energy systems, smart
grids, electric vehicles and electricity market.
Hedieh Sajedi received her B.Sc. degree in Computer Engineering from AmirKabir University of Technology in 2003, and M.Sc.
T
and Ph.D degrees in Computer Engineering (Artificial Intelligence) from Sharif University of Technology, Tehran, Iran in 2006
IP
and 2010, respectively. She is currently an Assistant Professor at the Department of Computer Science, Tehran University, Iran.
CR
Her research interests include Multimedia data hiding, steganography and steganalysis methods, pattern recognition, and machine
learning.
US
AN
M
ED
PT
CE
AC

Azimi 2017

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Azimi 2017

Загружено:

Авторское право:

Доступные форматы

Accepted Manuscript

A novel clustering algorithm based on data transformation

Rasool Azimi , Mohadeseh Ghayekhloo , Mahmoud Ghofrani ,

To appear in: Expert Systems With Applications

Received date: 28 November 2015

A data transformation approach is proposed to solve empty cluster problem.

An efficient method is proposed to estimate the optimal number of clusters.

Proposed clustering method provides more accurate clustering results.

A Novel Clustering Algorithm Based on Data

data in the input space.

Contributions of the paper are outlined as follows:

by the K-means based algorithms.

3. An efficient method is proposed to estimate the optimal number of clusters.

evaluated by several experiments. Finally, Section 4 concludes the paper.

2.1 K-means algorithm

A. Initial centroids selection

for each non-repetitive data vector (x i ) in the new dataset X ' ( i m n ).

X1' = [(x1' , r1 ),...,(x S' , rS )],

where ri is the repetition number for the i-th data vector.

{X1' ,X'2 ,...,X'K } .

Remove duplicate data vectors from X and store

Sort the unique data set (X ' ) in ascending order

Select the data entry with the highest weight

Add init_cm to InitCentArray

data to the artificial data (XX*).

line in the artificial structure. This is given by Eq. (4).

X = [x1 ,..., x n ], X* = [x1* ,..., x *n ], initC = [c 01,...,c0K ]

This is given by Eq. (5).

follows the steps below:

data into new dataset X' = [x1' ,..., x 'n ] .

Sort the input pattern (X) in ascending order based

Displace the input pattern (X) in

Create artifical data: (X ' X* )

i=i+1 Is i < Steps?

Fig. 2. Flowchart of the Transformed K-means algorithm

Dist" = [(dist1" / Steps),...,(dist"n / Steps)] (6)

X*i = X*i -1 + (i Dist" ) ..... (1 i Steps ) (7)

data transformation process.

2.3 Time complexity

Algorithm Phase Time complexity

(GTSOM) (Herbert and Yao, 2007).

Algorithm Time complexity

2.4 Estimation of the number of clusters

the number of clusters. The silhouette algorithm is as follows:

2) Calculate the silhouette function S(i) for the input data:

3) Calculate S(m) as the average value of S at mth iteration.

principal component analysis (PCA) is outlined below:

4-3) construct a N N unitary matrix as:

4-4) Combine the N eigenvalue equations as follows:

where diag (1 ,..., N ) is a diagonal matrix.

4-5) Multiply *T 1 by both sides of Eq. (11):

5) Specify Kmin and Kmax values.

as new centroids Cs = [cs1 ,...,csK ] .

houette algorithm (Rousseeuw, 1987).

K est ArgMax Sest

Fig. 3 shows the flowchart for the proposed method.

Perform data transformation based on KLT

Perform Proposed Initialization technique

Form initial clusters and calculate the mean of

m=m+1 Is m < Kmax?

Output:K est = ArgMax Sest

Algorithm Time complexity

Proposed Method O (n log (n ).K )

demonstrates the less time complexity of the proposed algorithm.

Xi = Xi -1 + (i Dist" ) ..... (1 i Steps ) (7)