Accepted Manuscript
A novel clustering algorithm based on data transformation
Rasool Azimi , Mohadeseh Ghayekhloo , Mahmoud Ghofrani , Hedieh Sajedi
PII: 
S09574174(17)300349 
DOI: 

Reference: 
ESWA 11072 
To appear in: 
Expert Systems With Applications 
Received date: 
28 November 2015 
Revised date: 
29 October 2016 
Accepted date: 
24 January 2017 
Please cite this article as: Rasool Azimi , Mohadeseh Ghayekhloo , Mahmoud Ghofrani , Hedieh Sajedi , A novel clustering algorithm based on data transformation, Expert Systems With Applications (2017), doi: 10.1016/j.eswa.2017.01.024
This is a PDF ﬁle of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its ﬁnal form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Highlights
1
A new initialization technique is proposed to improve the performance of Kmeans.
A data transformation approach is proposed to solve empty cluster problem.
An efficient method is proposed to estimate the optimal number of clusters.
Proposed clustering method provides more accurate clustering results.
ACCEPTED MANUSCRIPT
2
A Novel Clustering Algorithm Based on Data Transformation
Rasool Azimi ^{a} r.azimi@qiau.ac.ir, Mohadeseh Ghayekhloo ^{b} m.ghayekhloo@qiau.ac.ir, Mahmoud
Ghofrani ^{c}^{,}^{*} mghofrani@uwb.edu, Hedieh Sajedi ^{d} hhsajedi@ut.ac.ir
^{a} Computer and Information Technology Engineering, Qazvin Branch, Islamic Azad University, Qazvin, Iran.
^{b} Young Researchers and Elite Club, Qazvin Branch, Islamic Azad University, Qazvin, Iran.
^{c} School of Science, Technology, Engineering and Mathematics (STEM), University of Washington, Bothell, USA.
^{d} Department of Computer Science, College of Science, University of Tehran, Tehran, Iran.
*Corresponding author: UWBB room 227, 18807 Beardslee Blvd, Bothell, WA 98011, USA Fax number: 425.352.3775
Abstract— Clustering provides a knowledge acquisition method for intelligent systems. This paper proposes a novel data
clustering algorithm, by combining a new initialization technique, Kmeans algorithm and a new gradual data transformation ap
proach to provide more accurate clustering results than the Kmeans algorithm and its variants by increasing the clusters’ coher
ence. The proposed data transformation approach solves the problem of generating empty clusters, which frequently occurs for
other clustering algorithms. An efficient method based on the principal component transformation and a modified silhouette algo
rithm is also proposed in this paper to determine the number of clusters. Several different data sets are used to evaluate the efficacy
of the proposed method to deal with the empty cluster generation problem and its accuracy and computational performance in
comparison with other Kmeans based initialization techniques and clustering methods. The developed estimation method for de
termining the number of clusters is also evaluated and compared with other estimation algorithms. Significances of the proposed
method include addressing the limitations of the Kmeans based clustering and improving the accuracy of clustering as an im
portant method in the field of data mining and expert systems. Application of the proposed method for the knowledge acquisition in
time series data such as wind, solar, electric load and stock market provides a preprocessing tool to select the most appropriate
data to feed in neural networks or other estimators in use for forecasting such time series. In addition, utilization of the knowledge
discovered by the proposed Kmeans clustering to develop rule based expert systems is one of the main impacts of the proposed
method.
Index Terms— Data mining, clustering, Kmeans, data transformation, silhouette, transformed Kmeans
1. Introduction
The expert systems are computer applications that contain stored knowledge and are developed to solve problems in a specific field
in almost the same way in which a human expert would (Shuliang et al., 2002)0. Acquisition of the expert knowledge is a chal
lenge for developing such expert systems (Yang et al., 2012). One of the major problems and most difficult tasks in developing rule
based expert systems is representing the knowledge discovered by data clustering (Markic and Tomic, 2010). The Kmeans algo
rithm is one of the most commonly used clustering techniques, which uses the data reassignment method to repeatedly optimize
ACCEPTED MANUSCRIPT
3
clustering (Lloyd, 1982). The main goal of clustering is to generate compact groups of objects or data that share similar patterns
within the same cluster, and isolate these groups from those which contain elements with different characteristics.
Although the Kmeans algorithm has features such as simplicity and high convergence speed, it is totally dependent on the initial
centroids which are randomly selected in the first phase of the algorithm. Due to this random selection, the algorithm may converge
to locally optimal solutions (Celebi et al., 2013). Different variants of Kmeans algorithm have been proposed to address this limi
tation. The KMedoids algorithm was proposed in (Kaufman and Rousseeuw, 1987) to define each cluster by the most central me
doid in which it is located. First, K data are considered as initial centroids (medoid) and then each data is assigned to the closest
Medoid, and the initial clusters are formed. In an iterationbased process, the most central data in each cluster is considered as the
new centroid and each data is assigned to the nearest centroid. The remaining steps of this algorithm match the Kmeans procedure.
Fuzzy Cmeans (FCM) clustering introduced the partial membership concept (Dunn, 1973), (Bezdek et al, 1984). In fact, in the
FCM algorithm, each data belongs to all clusters. The degree of belonging is represented by a partial membership determined b y a
fuzzy clustering matrix. A genetic algorithmbased Kmeans (GAKmeans) was proposed in (Krishna, and Murty, 1999) to pro
vide global optimum for the clustering. In this method, the Kmeans algorithm was used as a search operator instead of crossover.
A biased mutation operator was also proposed for clustering to help the Kmeans algorithm to avoid local minima. Global Kmeans
algorithm was developed in (Likas et al., 2003) to provide experimentally optimal solution for clustering problems. However, it is
not appropriate for clustering mediumsized and largescale datasets due to its heavy computational burden. Kmeans++ initializa
tion algorithm was proposed in (Arthur and Vassilvitskii, 2007) for obtaining an initial set of centroids that is nearoptimal. The
main drawback of the Kmeans++ is its inherent sequential nature, which limits the effectiveness of the method for highvolume
data. An artificial bee colony Kmeans (ABCKmeans) clustering approach was proposed in (Zhang et al., 2010) for optimal parti
tioning of data objects into a fixed number of clusters. A hybrid of differential evolution and Kmeans algorithms named DEK
means was introduced in (Kwedlo, 2011). The differential evolution algorithm was used as a global optimization method and the
resultant clustering solutions were finetuned and corrected using the Kmeans algorithm. Dogan et al. proposed a hybrid of K
means++ and selforganizing map (SOM) (Kohonen, 1990) algorithm to improve the clustering accuracy. It first uses KMeans++
initialization method to determine the initial weight values and the starting points, and then uses SOM to find an appropriate final
clustering solution.
However, the aforementioned limitation of the Kmeans++ was not addressed. A new clustering technique
using a combination of the global Kmeans algorithm and the topology neighborhood based on Axiomatic Fuzzy Sets (AFS) theory
was developed in (Wang et al., 2013) to determine initial centroids. A new clustering algorithm, named Kmeans*, was presented in
(Malinen et al., 2014) that generates an artificial dataset X* as the input data. The input data are then mapped onebyone to the
generated artificial data (X → X*). Next, the inverse transformation of the artificial data to the original data is performed by a se
ries of gradual transformations. To do so, the K  means algorithm updates the clustering model after each tra nsformation and moves
ACCEPTED MANUSCRIPT
4
the data vectors slowly to their original positions. The Kmeans* algorithm uses a random data swapping strategy to deal with the
problem of generating empty clusters. However, the random selection of the data vectors as the cluster centroids may reduce the
other clusters’ coherence and decrease the efficiency of the Kmeans* algorithm. Moreover, the convergence rate of the Kmeans*
algorithm significantly reduces as the number of clusters increases, especially with increasing data volumes. Density based cluster
ing methods were proposed in (Mahesh Kumar and Rama Mohan Reddy, 2016) to speed up the neighbor search for clustering spa
tial databases with noise. Density Based Spatial Clustering of Applications with Noise (DBSCAN) provides a graph based index
structure for high dimensional data with large amount of noise. It was shown that running time of the proposed method is faster
than DBSCAN with exactly same clustering results. The proposed method solved the inefficacy of the DBSCAN method to work
with clusters with large differences in densities. A novel clustering algorithm named CLUB (CLUstering based on Backbone) was
developed in (Chen et al., 2016) to determine optimal clusters. First, the algorithm detects the initial clusters and finds their density
backbones. Then, the algorithm finds out the outliers in each cluster based on K Nearest Neighbour (KNN) method. Finally by
assigning each unlabeled point to the cluster with the nearest higher density neighbour, the algorithm yields the final clusters.
CLUB has several drawbacks: The KNN method lacks an efficient algorithm to determine the value of parameter K (number of
nearest neighbors); Computational cost of this method is too high because it requires calculating distance of each instance query
with respect to all training samples. Two particle swarm optimization (PSO) based fuzzy clustering methods were proposed in (Sil
va Filho et al., 2015) to deal with the shortcomings of the PSO algorithms used for fuzzy clustering. The proposed methods adjust
parameters of PSO dynamically to achieve a balance between exploration and exploitation to avoid trapping in local optimum. The
proposed methods lack precision for highdimensional applications. In addition, the iterative process of the proposed methods sig
nificantly decreases the convergence rate. Generally, the speed at which a convergent sequence approaches its limit is defined as
the rate of convergence. Three clustering algorithms named Near Neighbor Influence (CNNI), an improved version of time cost of
Near Neighbor Influence (ICNNI), and a variation of Near Neighbor Influence (VCNNI) were presented in (Chen, 2015). The clus
tering results showed that ICCNNI is faster than CNNI and also, CNNI requires less space than VCNNI. These methods suffer
from large scale computing and storage requirements. A growing incremental selforganizing neural network (GISONN) was de
veloped in (Liu and Ban, 2015) to select appropriate clusters by learning data distribution of each cluster. The proposed method is
however not applicable for largevolume or highdimensional datasets due to its computational complexity. In addition, the neigh
borhood preserving feature of the algorithm is violated when the output space topology does not match with the structure of the
data in the input space.
In spite of the improved performance of the Kmeans variants for synthetic datasets with Gaussian distribution, their performance
on real datasets is neither very promising nor different from the original Kmeans algorithm. In addition, all Kmeans based algo
rithms lack an efficient method to determine the optimal number of clusters. This requires the user to determine the number of clus
ACCEPTED MANUSCRIPT
5
ters either arbitrarily or based on practical and experimental estimates, which might not be optimal.
In this paper, we propose a novel clustering approach called transformed Kmeans to provide more accurate clustering results com
pared to the Kmeans algorithm and its improved versions. The proposed clustering method combines a new initialization tech
nique, Kmeans algorithm and a new gradual data transformation approach to appropriately select the initial cluster centroids and
move the real data into the locations of the initial cluster centroids that are closer to the actual positions of the associated data. By
doing this, the data are placed in an artificial structure to properly initiate the Kmeans clustering.
The inverse transformation is
then performed to gradually move back the artifical data to their original places. During this process, Kmeans updates the cluster
ing centroids after any changes in the data structure. This provides more optimal clustering results for both synthetic and real da
tasets. In addition, the proposed data transformation solves the empty cluster problem of Kmeans algorithm and its improved ver
sions. An efficient method based on the principal component transformation and a modified silhouette algorithm is also proposed
in this paper to determine the optimal number of clusters for the Kmeans algorithms.
The proposed clustering method develops a rulebased expert system by means of knowledge acquisition through data transfor
mation. Significances of the proposed method include addressing the limitations of the Kmeans based clustering and improving
the accuracy of clustering as an important method in the field of data mining and expert systems. The proposed method can be used
for intelligent system applications such as forecasting timeseries including solar, wind, load and stock market series.
Contributions of the paper are outlined as follows:
1. A new initialization technique is proposed to select initial centroids which are closer to the optimum centroids’ locations.
2. A novel gradual data transformation approach is proposed to significantly reduce the number of empty clusters generated
by the Kmeans based algorithms.
3. An efficient method is proposed to estimate the optimal number of clusters.
4. A hybrid clustering algorithm is developed by combining the proposed initialization, data transformation and cluster
number estimation to provide a better knowledge discovery of the input patterns and more accurate clustering results.
The rest of the paper is organized as follows. Section 2 provides a brief description of the Kmeans algorithm. It also explains the
proposed clustering method. Section 3 demonstrates a case study where the performance of the developed clustering method is
evaluated by several experiments. Finally, Section 4 concludes the paper.
2. Methodology
2.1 Kmeans algorithm
The Kmeans (Lloyd, 1982) is a wellknown, low complexity algorithm utilized for datapartitioning. The algorithm starts
running after an input of K clusters is given, and outputs the cluster centroids through iterations. Let X = [x1,
,xn
] be the set of n
ACCEPTED MANUSCRIPT
6
points to be grouped into K different cluster (partition) sets as C = {
c
p
} p = 1, 2, … , K. By means of the Euclidean distance, the
algorithm assigns each data point to its closest centroid
where
( p )
x i
c
p
1
n
p
n
p
i 1
x
i (
is the ith data point in the cluster p, and
p )
n
p
c
p
, calculated by:
(1)
is the number of data points in the respective cluster.
After the first run, the algorithm calculates the mean of the data points in each cluster
c
p and selects this value as a new cluster
centroid, starting a new iteration. As new clusters are selected, a new mean value is obtained. The algorithm halts once the sum of
the squared errors over K clusters is minimized (Cui et al., 2014).
2.2 The Proposed Clustering Method
An improved version of Kmeans algorithm, named transformed Kmeans, is proposed in this section. The proposed clustering
algorithm uses a combination of a new technique to select the initial cluster centroids and a new approach for the reverse transfor
mation of the data to enhance the clustering performance. The steps of the transformed Kmeans algorithm are as follows:
A. Initial centroids selection
Let X = [x1,
,xn
] be a set of n data. The selection of K initial centroids is as follows:
1) Remove duplicate data vectors and store them to new dataset
X
'
1 '
= [(x ,r ),
1
,(x
' m
,r
m
)] where r _{i} is the repetition number
for each nonrepetitive data vector
(x
i
)
in the new dataset
'
_{X} ( i m n ).
' 

2) 
Sort the data vectors in the dataset 
X 
in ascending order based on the Euclidean length of the vectors. 
3) Divide the dataset
X
'
, consisting of m data, into K subdatasets, with (at most) S m / K data, according to Eq. (2),
such that the data elements of
'
X are distributed among the subdatasets
X
X
X
' '
1
= [(x ,r ),
1
1
,(x
'
S
' '
2
= [(x
S
'
3
= [(x
'
2
+1
,r
S
+1
S
+1
,r
2
S
),
+1
),
,r )],
S
,(x
'
2
S
,(x
,r
2
S
'
3
S
,r
3
X
'
K
= [(x
'
(K1)×(
S
)+1
,r
(K1)×(
'
X =
S
)+1
),
)],
S
)],
,(x
'
K
S
,r
K
S
)].
X
'
1
to
X
'
K
.
(2)
ACCEPTED MANUSCRIPT
where r _{i} is the repetition number for the ith data vector.
7
4) Now, we have K sub datasets where each one is used to determine only one of the K initial centroids. Eq. (3) is used to
calculate a weight attribute
{X ,X ,
1 '
' 2
,X
' K
} .
w
(
x
w
'
i
(
)
x
for each data entry
' 1
)
i
m
=
1
S
S
j =1
dist x
(
i '
,
x
' j
)
(
r
i
x
'
i with the repetition number r _{i} in each of K sub datasets
)
m
,
(1
m
K
)
(3)
where
w
(
x
'
i
)
m
is the weight attribute for
^{'}
x i in the mth subdataset.
5) In each of K sub datasets, the data entry with the highest weight attribute is selected as the initial centroid.
Fig. 1 shows the flowchart of our proposed method for selecting initial centroids.
ACCEPTED MANUSCRIPT
8
Fig. 1. Flowchart of the proposed method for the initial centroids selection
B. Inverse transformation
The inverse data transformation approach was first used in 0to solve the problems associated with the Kmeans clustering
algorithm. However, the approach presented in (Malinen et al, 2014) has a number of shortcomings such as finding a suita
ble artificial data structure, performing the mapping, and controlling the inverse transformations. This algorithm cannot gen
erally guarantee an optimal solution. This was demonstrated by the clustering results of (Malinen et al, 2014) where in some
cases, the data transformation led to the deviation of the data towards the incorrect cluster centroids. For the inverse trans
ACCEPTED MANUSCRIPT
9
formation of data, first we generate an artificial data X ^{*} as the input data of the same size (n) and dimension (d). This would
divide the data vectors into distinct clusters K without any fluctuations. Then we represent a onetoone mapping of the input
data to the artificial data (X→X ^{*} ).
The inverse data transformation approach of (Malinen et al, 2014) uniformly distributes the initial cluster centroids along a
line in the artificial structure. This is given by Eq. (4).
X = [x ,
x
,x
n
], X
*
1 *
= [x ,
,x
* n
1
*
i
=
RandomSample
initC , x
], initC =[c
initC
_{i}
*
(0
01
,
i
,c
n
0K
)
]
(4)
This random placement may break the clustering structure and deviate the data to incorrect cluster centroids, and conse
quently, provide incorrect results. To address this problem, our proposed inverse data transformation approach places each
initial centroid C _{0}_{j} (1 < j K ) in the location of the data d
This is given by Eq. (5).
_{i}
x
x
*
i
*
i
ArgMin
initC
(0
x
i
i
n
c
)
0j
,
c
0j
,
(0
i
n
) , (0
(1 i
n)
j
K
)
that is closer to C _{0}_{i}
in the artificial structure X ^{*} .
(5)
A series of inverse transformation is then performed that gradually move the data elements to their real (original) positions.
This will inversely transfer the artificial data to the main data. During this process, Kmeans updates the cluster centroids of
the transformed data. Calculated cluster centroids in each step are used as the initial cluster centroids for the next step. This
process continues until the last step whose results provide the final cluster centroids. The proposed procedure is outlined as
the following:
First, each vector x
i is placed in a position closest to the initial centroid initC _{l} (1 l K ) , which has the minimum distance
to the corresponding data. Next, they gradually move back to their real positions.
Generally, for a dataset
X = [x1,
follows the steps below:
1) Sort the dataset
X = [x1,
' 

data into new dataset 
X 
1 '
= [x ,
,xn
]
,x
,xn
]
of n data vectors, the gradual inverse transformation of data to their real positions
in ascending order based on the Euclidean length of the vectors. Next, store the sorted
' n
]
.
ACCEPTED MANUSCRIPT
10
Sort the input pattern (X) in ascending order based
on the Euclidean distance between each data vector
in X and the data variance: X'
Displace the input pattern (X) in
random order : X"
Create artifical data: 
' 
* 
(X 
X ) 
Determine
Dist
= (X
"
*
 X ) / Steps
All data points
*
Xi
are transformed
towards their real location
to the formula
' 

Xi = Xi (X ) * * 1 according " + (i Dist ) 
C F = InitC
;
i
Output:
Final Centroids: C ={c ,…,c
F
1
}
K
Fig. 2. Flowchart of the Transformed Kmeans algorithm
2) To construct the artificial data structure X* as the initial position of the data, place each initial centroid initCl (1 l K )
' 

in the position of the data vectors of the dataset, 
X 
, which are closer to that initial centroid (
X
*
1 *
= [x ,
,x x
'
i
* n
distance to the (K1) other initial centroids. This forms the artificial structure
initC
l
) compared to their
] . This moves each real data
into the location of the initial centroid that is closer to the actual position of the associated data in the artificial structure
3) Displace all the real data vectors
X =[x ,
1
,x
n
] in random order and store them into the new dataset
X
"
1 "
= [x ,
,x
" n
X
]
.
*
.
4) Determine the distance between initial artificial data (X*) and sorted real data (X"), and store them in the set
"
Dist = [dist ,dist ,
1 "
" 2
,dist
" n
] . Each data element
dist
"
i
, represents the distance vector between the ith data vector
artificial dataset X* and the position of the corresponding data
(x
"
i
) in the dataset X".
(x
*
i
)
in the
ACCEPTED MANUSCRIPT
5) According to the number of steps given by user (Steps>1), divide each element of
"
Dist = [dist ,
1 "
―Steps‖ and update the new values of the data elements in
Dist
" . This is given by Eq. (6).
" 
" 
" 

Dist =[(dist /Steps), 1 
,(dist 
n 
/Steps)] 
(6)
,dist
" n
] by the value of
11
6) At each step of the inverse transform process, all data points move towards their real location as follows:
X
*
i
= X
*
i 1
"
+( Dist )
i
(1
i
Steps
)
where, X* is the position of data in the artificial structure, i is the step number and
(7)
Dist
" is the distance of the sorted (in
descending order) data from the data positions in the artificial structure. Fig. 2 shows the flowchart for the proposed trans
formed Kmeans. We should note that
X
*
1
is the initial data positions in the artificial structure (initial artificial dataset). In
the first Step, the initial centroids (initC), calculated by the proposed initial centroid selection method, are fed to the K
means algorithm as the inputs (initC= initC _{0} ). After every inverse transform, Kmeans is executed given the previous cen
troids (initC _{i}_{}_{1} ) along with the modified dataset (
points are placed in their original location (
X
*
i
X
'
X
*
i ) as the input pattern. After completion of all steps (i = Steps), all data
) and the final centroids (C
F ) are calculated as the outputs. The proposed
initialization approach of Section 2.2.A significantly reduces the chance of empty cluster generation by proper selection of
the initial centroids. The proposed data transformation approach completely solves the empty cluster problem during the
data transformation process.
2.3 Time complexity
The transformed Kmeans algorithm has a time complexity of the order
, O((n logn) K s) , where n is the
total number of data, K is the number of clusters and s is the number of steps. More details of the time complexity of the
proposed transformed Kmeans algorithm are given for different phases in TABLE I.
TABLE I
Time complexity of the proposed transformed Kmeans algorithm.
Algorithm Phase
Time complexity
Initialization Data Transformation Kmeans algorithm Total Running in s steps (i > 1)
O(nlogn) O(nlogn) O((nlogn)K O((n O(n logn)K K ) s) )
ACCEPTED MANUSCRIPT
12
TABLE II provides the time complexity orders for the proposed method and wellknown clustering algorithms including K
means*, Kmeans++, global Kmeans, original Kmeans, Kmedoids, FCM, SOM, SOM++ and game theoretic SOM
(GTSOM) (Herbert and Yao, 2007).
TABLE II
Time complexity comparison of the proposed transformed Kmeans algorithm and several wellknown clustering algorithms.
Algorithm
Time complexity
Transformed Kmeans
Kmeans*
Kmeans++
Global Kmeans
Kmeans
Kmedoids
FCM
SOM
GTSOM
SOM++
O((nlogn)K O(n K s) s)
O(n K )
2
O(n K
2
)
O(n K )
2
O(n K )
2
O(n K )
O(n K )
K K ) )
O(n O(n
2
2
2
Time complexity comparison of TABLE II shows that the proposed transformed Kmeans algorithm is faster than SOM,
GTSOM, and global Kmeans and competes with FCM and Kmedoids. The time complexity of Kmeans and Kmeans++ is
better than that of our proposed algorithm. However, as the data volume increases, the Kmeans++ algorithm may not be as
efficient as our proposed method due to its sequential initialization (Bahmani et al., 2012).
The proposed transformed K
means and Kmeans* algorithms have almost the same time complexity. However, the approach used to deal with the gener
ation of empty clusters in the Kmeans* algorithm reduces the convergence rate nonlinearly as the data volume (n) and the
number of clusters (K) increase. Consequently, our proposed clustering algorithm is generally faster than the Kmeans* al
gorithm.
2.4 Estimation of the number of clusters
Kmeans and many other clustering algorithms are provided assuming that the number of clusters is known in advance.
In cases where the number of clusters is not predefined, an efficient method is required to determine the optimal number of
clusters. In this section we present a new method based on the silhouette approach proposed in (Rousseeuw, 1987) to estimate
the number of clusters. The silhouette algorithm is as follows:
1) Cluster the input data using any clustering technique for each iteration m, (K _{m}_{i}_{n} m K _{m}_{a}_{x} ) .
2) Calculate the silhouette function S(i) for the input data:
ACCEPTED MANUSCRIPT
S
m a(i)b(i)
i
=
max(a(i),b(i))
(8)
13
where a(i) is the average distance between the i ^{t}^{h} data (1 i n) and other data in the same cluster; b(i) is the lowest aver
age distance of the i ^{t}^{h} data from the data in the other K1 clusters at the m ^{t}^{h} iteration.
3) Calculate S(m) as the average value of S at m ^{t}^{h} iteration.
4) Select the iteration number index with the highest
S as the estimated number of clusters.
The proposed method uses the principal component transformation to modify the silhouette algorithm. The proposed
procedure is as follows:
41) Transform the input data using the KarhunenLoeve Transform (KLT) method. The KLT method, also known as the
principal component analysis (PCA) is outlined below:
42) Let
φ
k
denote the eigenvector corresponding to the k ^{t}^{h} eigenvalue λk
of the covariance matrix
x
i
,
φ
j
k
where:
λ φ
k
.
k
,
cov(
μ
i
X
Y
,
E X
(
i
j
i
k
)
)
={
1, ,k
E
[(
X
i
}
)(
μ Y
i
43) construct a N N unitary matrix as:
Φ
[
φ
1
,
,
φ
N
]
, Φ
*
T
Φ
I
, Φ
1
j
μ
j
Φ
*
T
)
(9)
(10)
44) Combine the N eigenvalue equations as follows:
x
Φ
ΦΛ
where Λ diag(λ1,
45) Multiply
Φ
*
T
,λN )
is a diagonal matrix.
Φ
1
by both sides of Eq. (11):
Φ
*
T
x
Φ
Φ
*
T
ΦΛ
Φ
1
x
Φ
Φ
1
ΦΛ = Λ
(11)
(12)
x
:
46) Given the input data X, define the Karhunen–Loeve Transformation of X as follows:
Y
(13)
ACCEPTED MANUSCRIPT
14
5)
Specify K _{m}_{i}_{n} and K _{m}_{a}_{x} values.
6) For each iteration m, calculate the initial centroids using the proposed initialization method and assign each of the trans
formed data (Y) to the nearest initial centroid to form initial clusters. Then calculate the mean of all data in each cluster
as new centroids Cs = [cs1,
,csK ] .
7) Calculate
S
i ^{m} for each of the input data at each iteration m, (K _{m}_{i}_{n} m K _{m}_{a}_{x} ) by:
S
m
i
=
(14)
where a(i) is the distance between the i ^{t}^{h} data
(1 i n)
and the nearest centroid c _{s}_{j} (1 j K ) at the P
th
iteration; b(i) is the
minimum distance of the i ^{t}^{h} data from the other K1 centroids at the P
th
iteration. The proposed definition of a(i) and b(i) by
Eq. (15) decreases the computational burden and speeds up the process as compared to their original definitions of the sil
houette algorithm (Rousseeuw, 1987).
8) Include
S
^{m} (for the i ^{t}^{h} data at the m ^{t}^{h} iteration) in the S ^{m} array.
i
9) Include the average value of S ^{m} (for the m ^{t}^{h} iteration) in the m
^{t}^{h} cell of the array
_{S}
est ave
.
10)Use Eq. (15) and select the row number with the highest
K
est
ArgMax S
est ave
, m
Fig. 3 shows the flowchart for the proposed method.
S
m as the estimated number of clusters.
ave
(K
min
K
est
K
max
)
(15)
ACCEPTED MANUSCRIPT
15
Fig. 3. Flowchart of the proposed method for estimating the number of clusters
This procedure modifies the method proposed in (Rousseeuw, 1987) to provide stable results with less processing time.
TABLE III provides the time complexity orders of the proposed estimation method and Silhouette algorithm.
TABLE III
Time complexity comparison of the proposed estimation method and silhouette algorithm.
Algorithm
Time complexity
Proposed Method
O(nlog(n).ΔK )
Silhouette algorithm
2
O(n .ΔK )
where n is the data volume and ΔK
is the difference between the K _{m}_{a}_{x} and K _{m}_{i}_{n} (ΔK K _{m}_{a}_{x} K _{m}_{i}_{n} ) . The comparison
demonstrates the less time complexity of the proposed algorithm.
ACCEPTED MANUSCRIPT
16
3. Case Studies
In this section, we evaluate the performance of the proposed method to deal with the empty cluster problem; then we as
sess the proposed transformed Kmeans clustering algorithm and finally examine the proposed estimation method to deter
mine
the
optimal
number
of
clusters;
the
datasets
used
in
the
experiment
are
available
online
at
Joensuu
(http://cs.uef.fi/sipu/datasets), uci (https://archive.ics.uci.edu/ml/datasets) and mesonet (http://mesonet.agron.iastate.edu)
websites. More information regarding the data is presented in Fig. 4.
Fig. 4. Datasets used for the case study
3.1. Evaluation of the proposed method for dealing with the empty cluster generation
ACCEPTED MANUSCRIPT
17
Three real timeseries datasets of solar radiation in the Ames, Chariton and Calmar stations between 01/01/2009 and
01/01/2014 are used to calculate the number of empty clusters (N. E. C) generated by the Kmeans algorithm and the pro
posed method. Number of clusters for both algorithms is 200 (
). First, the clustering is performed for one step
( ), i.e., without any data transformation, to evaluate the performance of the proposed initialization approach to
deal with the empty cluster problem. TABLE IV shows the N. E. C for the proposed method with 1 step and the Kmeans
algorithm. The results demonstrate that the proposed method significantly reduces the N. E. C as compared to the Kmeans
algorithm. This is due to the proposed initialization approach that properly selects the initial centroids. The N. E. C generat
ed by the proposed method is then calculated as the number of steps increases. The results are provided in TABLE V for
steps 1 to 10. The results show that the empty cluster problem is completely solved during the data transformation to their
original positions; and the proposed clustering algorithm converges without generating any empty cluster.
TABLE IV
Performance comparison of the Kmeans and proposed transformed Kmeans on the problem of empty cluster generation
Number of 
Number of 
N. E. C generated by: 

Dataset 
objects 
clusters 
Kmeans 
Proposed method (Step=1) 
Ames 
43827 
200 
84 
1 
Chariton 
43827 
200 
75 
0 
Calmar 
43827 
200 
27 
2 
TABLE V
Performance of the proposed method with different steps for the empty cluster problem
Step
number
N. E. C in
Ames dataset
N. E. C in
Chariton dataset
N. E. C in
Calmar dataset
1 1 
1 
3 

2 0 
0 
0 

3 1 
0 
0 

4 0 
0 
0 

5 0 
0 
0 

6 0 
0 
0 

7 0 
0 
0 

8 0 
0 
0 

9 0 
0 
0 

10 0 
0 
0 
3.2. Evaluation of the proposed transformed Kmeans clustering algorithm
This section evaluates the accuracy of the proposed clustering method (transformed Kmeans). Mean squared error
(MSE) is used as the accuracy performance indicator calculated by:
ACCEPTED MANUSCRIPT
18
MSE
1
K N
.
K
k
1
N
i
1
where N is the number of data points in the cluster k, and
X
i (
k
)
X
(k )
i
C
k
2
(16)
is the ith data point in the cluster k. The testing datasets are
normalized in the range of [−1, 1]. Kmeans clustering is run with different initialization methods including the random
based, Kmeans* based, Kmeans++ based and the proposed initialization method (section 2.2.A). The calculated error val
ues as well as the processing time are provided in TABLE VI. A comparison of the results shows that the proposed initiali
zation method improves the accuracy performance of the Kmeans algorithm when compared to the other initialization
methods. However, the computational complexity is increased due to data sorting proposed by our initialization to optimally
select the initial centroids.
TABLE VI
MSE measures and running time (sec) for Kmeans algorithm with different initialization methods
Kmeans clustering with
Dataset 
Proposed 
Kmeans* based 
Kmeans++ based 
Random based 

Initialization 
initialization 
initialization 
initialization 

MSE 
Time(s) 
MSE 

Time(s) 
MSE 

Time(s) 
MSE 
Time(s) 

IRIS 
0.0432 
0.24 
0.0432 
0.0891 
0.0431 
0.0238 
0.0432 
0.0248 

Glass 
0.0017 
0.0349 
0.002 
0.0235 
0.0017 
0.0079 
0.0017 
0.007 

Missa1 
0.0094 
0.0678 
0.0095 
0.2665 
0.0094 
0.8085 
0.0097 
0.0901 

Bridge 
0.0008 
0.2256 
0.0008 
2.3118 
0.0011 
12.8968 
0.001 
0.1551 

Thyroid 
0.0145 
0.115 
0.0554 
0.0261 
0.0151 
0.0084 
0.0145 
0.0062 

Magic 
0.0304 
0.1871 
0.0304 
0.4238 
0.0304 
1.382 
0.0304 
0.0353 

Wine 
0.0255 
0.0858 
0.1352 
0.0862 
0.0255 
0.0101 
0.0255 
0.0053 

Shuttle 
0.0008 
0.3903 
0.0009 
0.9169 
0.0008 
14.0788 
0.0008 
0.0451 

Pendigit 
0.007 
0.1973 
0.0083 
0.1728 
0.0071 
0.3836 
0.0071 
0.0124 

Wdbc 
0.0206 
0.0274 
0.0231 
0.0279 
0.0206 
0.0073 
0.0206 
0.0053 

Yeast 
0.0053 
0.0373 
0.0097 
0.3779 
0.0053 
0.0997 
0.0053 
0.0311 

P. I. D 
0.0536 
0.0265 
0.1013 
0.0429 
0.054 
0.0137 
0.0536 
0.0086 

Olitos 
0.0162 
0.0148 
0.0517 
0.1279 
0.0162 
0.0059 
0.0161 
0.005 

Heart 
0.0358 
0.0138 
0.0574 
0.0279 
0.0358 
0.0113 
0.0358 
0.0065 

Ionosphere 
0.081 
0.0167 
0.096 
0.0311 
0.081 
0.0076 
0.081 
0.0058 

M. Libras 
0.0064 
0.0185 
0.0094 
1.3832 
0.0066 
0.0352 
0.0058 
0.0061 

Spambase 
0.01 
0.049 
0.0125 
0.13 
0.01 
0.0894 
0.01 
0.0089 

Waveform 
0.0114 
0.0675 
0.0159 
0.1521 
0.0114 
0.2114 
0.0114 
0.0176 

a1 
0.0057 
0.1083 
0.0065 
0.6655 
0.0061 
0.3598 
0.0059 
0.0367 

s1 
0.0094 
0.0869 
0.01 
0.6682 
0.0101 
0.7354 
0.0095 
0.0358 
ACCEPTED MANUSCRIPT
19
The MSE value is calculated for different data clustering methods including the proposed transformed Kmeans, ABCK
means, DEKmeans, GAKmeans, Kmeans*, Kmeans++, Kmeans, SOM, GTSOM, SOM++, Kmedoids and FCM, and
provided in TABLE VII.
The calculated MSE values show that the transformed Kmeans (with Steps=20) outperforms or
competes with the existing methods in terms of the clustering quality. The improved clustering quality is the result of several
procedures embraced by our proposed method namely the determination of optimal number of clusters, the proposed initiali
zation, and the gradual data transformation.
The proposed transformed Kmeans algorithm has a faster processing time compared to the ABCKmeans, DEKmeans,
GAKmeans, SOM, GTSOM, and SOM++ and competes with FCM, Kmedoids, Kmeans*, and Kmeans ^{+}^{+} . For small
and mediumsized data, our proposed method is generally more time consuming than Kmedoids, Kmeans*, and K
means ^{+}^{+} . However, the reduced converegence rate of Kmeans* to deal with empty cluster generation particularly for higher
number of clusters and sequential initialization of Kmeans
++
increase the comuptaional complexity of these methods for
large data volumes. This is evident from our running time results for large size datasets such as Missal and Shuttle where the
proposed transformed Kmeans converges faster than Kmeans* and Kmeans
++
.
TABLE VII
MSE measure and the running time (sec) for different clustering techniques
MSE: 
0.0432 
0.0475 
0.043 
0.043 
0.0432 
0.0432 
0.0432 
0.0432 
0.0433 
0.0432 
0.0432 
0.0432 

IRIS 

Time(s): 
0.5754 
1.2629 
1.4003 
1.6738 
1.4858 
0.0063 
0.0061 
2.3584 
1.7599 
2.4825 
0.0127 
0.2132 

MSE: 
0.0017 
0.0019 
0.0018 
0.0017 
0.0017 
0.0018 
0.0017 
0.0017 
0.0019 
0.0017 
0.0017 
0.0017 

Glass 

Time(s): 
0.6301 
1.2252 
1.3897 
1.6864 
0.1381 
0.0134 
0.0058 
2.3659 
1.8439 
2.3572 
0.0311 
0.0134 

MSE: 
0.0001 
0.0002 
0.0002 
0.0001 
0.0001 
0.0001 
0.0001 
0.0001 
0.0001 
0.0001 
0.0001 
0.005 

Missa1 

Time(s): 
3.9664 
45.2259 
60.29 
81.2574 
47.813 
37.6179 
0.3216 
3.4526 
10.5227 
3.4381 
47.6677 
4.7857 

MSE: 
0.0008 
0.0018 
0.0015 
0.0013 
0.0009 
0.0013 
0.001 
0.001 
0.0009 
0.0012 
0.0009 
0.0038 

Bridge 
Time(s): 
2.4381 
29.2082 
38.8446 
52.4054 
18.794 
10.5017 
0.1362 
3.3597 
10.6182 
3.3557 
18.0322 
15.5823 
MSE: 
0.0134 
0.0147 
0.0134 
0.0133 
0.0591 
0.0151 
0.0145 
0.0167 
0.0169 
0.0168 
0.0146 
0.0149 

Thyroid 
Time(s): 
0.6331 
1.2185 
1.3674 
1.6443 
0.3718 
0.0087 
0.0069 
2.3265 
1.7153 
2.3346 
0.0394 
0.4068 
MSE: 
0.0271 
0.0325 
0.0295 
0.0295 
0.0304 
0.0304 
0.0304 
0.0307 
0.0306 
0.0304 
0.0304 
0.0304 

Magic 
Time(s): 
0.9886 
2.6975 
3.2338 
4.1301 
0.7922 
1.3478 
0.0373 
2.3212 
1.684 
2.3245 
3.777 
0.7571 
^{M}^{S}^{E}^{:} 
0.0255 
0.0277 
0.0254 
0.0252 
0.1349 
0.0256 
0.0255 
0.0257 
0.0255 
0.0255 
0.0256 
0.0256 

Wine 

_{T}_{i}_{m}_{e}_{(}_{s}_{)}_{:} 
0.6183 
1.1893 
1.3275 
1.5914 
1.8411 
0.0059 
0.006 
2.3257 
1.7137 
2.3216 
0.0127 
0.4052 

_{M}_{S}_{E}_{:} 
0.0007 
0.0008 
0.0008 
0.0007 
0.0009 
0.0008 
0.0008 
0.0008 
0.0009 
0.0008 
0.0008 
0.0009 

Shuttle 

_{T}_{i}_{m}_{e}_{(}_{s}_{)}_{:} 
1.2867 
5.363 
6.7498 
9.2292 
1.4791 
13.8573 
0.047 
2.3333 
1.7205 
2.3379 
4.9705 
0.9686 

_{M}_{S}_{E}_{:} 
0.0069 
0.0085 
0.0077 
0.0073 
0.0079 
0.0071 
0.0071 
0.0069 
0.0071 
0.007 
0.007 
0.0969 

Pendigit 
_{T}_{i}_{m}_{e}_{(}_{s}_{)}_{:} 
0.814 
2.3004 
2.8268 
3.6743 
7.2133 
0.3607 
0.018 
2.3571 
1.969 
2.3645 
1.6015 
1.3866 
_{M}_{S}_{E}_{:} 
0.0138 
0.0152 
0.0138 
0.0138 
0.0231 
0.0206 
0.0206 
0.0188 
0.0243 
0.0205 
0.0192 
0.0205 

Wdbc 

_{T}_{i}_{m}_{e}_{(}_{s}_{)}_{:} 
0.6232 
1.2157 
1.3701 
1.6524 
0.2455 
0.0057 
0.0044 
2.3226 
1.6766 
2.3566 
0.017 
0.1047 
ACCEPTED MANUSCRIPT
20
_{M}_{S}_{E}_{:} 
0.0051 
0.0062 
0.0057 
0.0055 
0.0088 
0.0055 
0.0053 
0.0052 
0.0054 
0.0052 
0.0054 
0.0058 

Yeast 

_{T}_{i}_{m}_{e}_{(}_{s}_{)}_{:} 
0.6999 
1.6819 
1.9743 
2.4665 
7.0087 
0.0898 
0.016 
2.3747 
1.9684 
2.3955 
1.0383 
0.2853 

_{M}_{S}_{E}_{:} 
0.0438 
0.0474 
0.0431 
0.0431 
0.1013 
0.0536 
0.0536 
0.0525 
0.0541 
0.0574 
0.0537 
0.0512 

P. I. D 

_{T}_{i}_{m}_{e}_{(}_{s}_{)}_{:} 
0.6353 
1.2327 
1.3823 
1.698 
0.2842 
0.008 
0.0056 
2.337 
1.6765 
2.3256 
0.0762 
0.4109 

_{M}_{S}_{E}_{:} 
0.0153 
0.017 
0.0156 
0.0153 
0.051 
0.0162 
0.0167 
0.0162 
0.0161 
0.0161 
0.0161 
0.016 

Olitos 

_{T}_{i}_{m}_{e}_{(}_{s}_{)}_{:} 
0.5645 
1.2042 
1.3552 
1.6307 
1.3999 
0.0101 
0.0065 
2.337 
1.8621 
2.3347 
0.0259 
0.5152 

_{M}_{S}_{E}_{:} 
0.0352 
0.0387 
0.0352 
0.0352 
0.0574 
0.0358 
0.0358 
0.0354 
0.0359 
0.0356 
0.0358 
0.0359 

Heart 

_{T}_{i}_{m}_{e}_{(}_{s}_{)}_{:} 
0.6091 
1.1922 
1.342 
1.6161 
0.3494 
0.0052 
0.0057 
2.3168 
1.6736 
2.3149 
0.0408 
0.3544 

_{M}_{S}_{E}_{:} 
0.081 
0.0891 
0.081 
0.081 
0.0959 
0.081 
0.081 
0.0811 
0.081 
0.081 
0.081 
0.0813 

Ionosphere 
_{T}_{i}_{m}_{e}_{(}_{s}_{)}_{:} 
0.6061 
1.2029 
1.3576 
1.6325 
0.2635 
0.0053 
0.0052 
2.3253 
1.6762 
2.3317 
0.0175 
0.4333 
Movement 
_{M}_{S}_{E}_{:} 
0.0055 
0.0081 
0.0072 
0.006 
0.0094 
0.0063 
0.0062 
0.0055 
0.0058 
0.0057 
0.0056 
0.0057 
Libras 
_{T}_{i}_{m}_{e}_{(}_{s}_{)}_{:} 
0.6293 
1.335 
1.5315 
1.8818 
15.938 
0.0252 
0.01 
2.3821 
2.1349 
2.3865 
0.1281 

Гораздо больше, чем просто документы.
Откройте для себя все, что может предложить Scribd, включая книги и аудиокниги от крупных издательств.
Отменить можно в любой момент.