Вы находитесь на странице: 1из 26

Accepted Manuscript

A novel clustering algorithm based on data transformation

Rasool Azimi , Mohadeseh Ghayekhloo , Mahmoud Ghofrani ,


Hedieh Sajedi

PII: S0957-4174(17)30034-9
DOI: 10.1016/j.eswa.2017.01.024
Reference: ESWA 11072

To appear in: Expert Systems With Applications

Received date: 28 November 2015


Revised date: 29 October 2016
Accepted date: 24 January 2017

Please cite this article as: Rasool Azimi , Mohadeseh Ghayekhloo , Mahmoud Ghofrani ,
Hedieh Sajedi , A novel clustering algorithm based on data transformation, Expert Systems With
Applications (2017), doi: 10.1016/j.eswa.2017.01.024

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
1

Highlights
A new initialization technique is proposed to improve the performance of K-means.

A data transformation approach is proposed to solve empty cluster problem.

An efficient method is proposed to estimate the optimal number of clusters.

Proposed clustering method provides more accurate clustering results.

T
IP
CR
US
AN
M
ED
PT
CE
AC
ACCEPTED MANUSCRIPT
2

A Novel Clustering Algorithm Based on Data


Transformation
Rasool Azimia r.azimi@qiau.ac.ir, Mohadeseh Ghayekhloo bm.ghayekhloo@qiau.ac.ir, Mahmoud
Ghofrani c,*mghofrani@uwb.edu , Hedieh Sajedi dhhsajedi@ut.ac.ir
a
Computer and Information Technology Engineering, Qazvin Branch, Islamic Azad University, Qazvin, Iran.
b
Young Researchers and Elite Club, Qazvin Branch, Islamic Azad University, Qazvin, Iran.
c
School of Science, Technology, Engineering and Mathematics (STEM), University of Washington, Bothell, USA.

T
d
Department of Computer Science, College of Science, University of Tehran, Tehran, Iran.

IP
*Corresponding author: UWBB room 227, 18807 Beardslee Blvd, Bothell, WA 98011, USA Fax number: 425.352.3775

Abstract Clustering provides a knowledge acquisition method for intelligent systems. This paper proposes a novel data-

CR
clustering algorithm, by combining a new initialization technique, K-means algorithm and a new gradual data transformation ap-

proach to provide more accurate clustering results than the K-means algorithm and its variants by increasing the clusters coher-

US
ence. The proposed data transformation approach solves the problem of generating empty clusters, which frequently occurs for

other clustering algorithms. An efficient method based on the principal component transformation and a modified silhouette algo-
AN
rithm is also proposed in this paper to determine the number of clusters. Several different data sets are used to evaluate the efficacy

of the proposed method to deal with the empty cluster generation problem and its accuracy and computational performance in
M

comparison with other K-means based initialization techniques and clustering methods. The developed estimation method for de-

termining the number of clusters is also evaluated and compared with other estimation algorithms. Significances of the proposed
ED

method include addressing the limitations of the K-means based clustering and improving the accuracy of clustering as an im-

portant method in the field of data mining and expert systems. Application of the proposed method for the knowledge acquisition in
PT

time series data such as wind, solar, electric load and stock market provides a pre-processing tool to select the most appropriate

data to feed in neural networks or other estimators in use for forecasting such time series. In addition, utilization of the knowledge
CE

discovered by the proposed K-means clustering to develop rule based expert systems is one of the main impacts of the proposed

method.
AC

Index Terms Data mining, clustering, K-means, data transformation, silhouette, transformed K-means

1. Introduction

The expert systems are computer applications that contain stored knowledge and are developed to solve problems in a specific field

in almost the same way in which a human expert would (Shuliang et al., 2002)0. Acquisition of the expert knowledge is a chal-

lenge for developing such expert systems (Yang et al., 2012). One of the major problems and most difficult tasks in developing rule

based expert systems is representing the knowledge discovered by data clustering (Markic and Tomic, 2010). The K-means algo-

rithm is one of the most commonly used clustering techniques, which uses the data reassignment method to repeatedly optimize
ACCEPTED MANUSCRIPT
3

clustering (Lloyd, 1982). The main goal of clustering is to generate compact groups of objects or data that share similar patterns

within the same cluster, and isolate these groups from those which contain elements with different characteristics.

Although the K-means algorithm has features such as simplicity and high convergence speed, it is totally dependent on the initial

centroids which are randomly selected in the first phase of the algorithm. Due to this random selection, the algorithm may converge

to locally optimal solutions (Celebi et al., 2013). Different variants of K-means algorithm have been proposed to address this limi-

tation. The K-Medoids algorithm was proposed in (Kaufman and Rousseeuw, 1987) to define each cluster by the most central me-

doid in which it is located. First, K data are considered as initial centroids (medoid) and then each data is assigned to the closest

T
Medoid, and the initial clusters are formed. In an iteration-based process, the most central data in each cluster is considered as the

IP
new centroid and each data is assigned to the nearest centroid. The remaining steps of this algorithm match the K-means procedure.

CR
Fuzzy C-means (FCM) clustering introduced the partial membership concept (Dunn, 1973), (Bezdek et al, 1984). In fact, in the

FCM algorithm, each data belongs to all clusters. The degree of belonging is represented by a partial membership determined by a

US
fuzzy clustering matrix. A genetic algorithm-based K-means (GA-K-means) was proposed in (Krishna, and Murty, 1999) to pro-

vide global optimum for the clustering. In this method, the K-means algorithm was used as a search operator instead of crossover.

A biased mutation operator was also proposed for clustering to help the K-means algorithm to avoid local minima. Global K-means
AN
algorithm was developed in (Likas et al., 2003) to provide experimentally optimal solution for clustering problems. However, it is

not appropriate for clustering medium-sized and large-scale datasets due to its heavy computational burden. K-means++ initializa-
M

tion algorithm was proposed in (Arthur and Vassilvitskii, 2007) for obtaining an initial set of centroids that is near-optimal. The

main drawback of the K-means++ is its inherent sequential nature, which limits the effectiveness of the method for high-volume
ED

data. An artificial bee colony K-means (ABC-K-means) clustering approach was proposed in (Zhang et al., 2010) for optimal parti-

tioning of data objects into a fixed number of clusters. A hybrid of differential evolution and K-means algorithms named DE-K-
PT

means was introduced in (Kwedlo, 2011). The differential evolution algorithm was used as a global optimization method and the

resultant clustering solutions were fine-tuned and corrected using the K-means algorithm. Dogan et al. proposed a hybrid of K-
CE

means++ and self-organizing map (SOM) (Kohonen, 1990) algorithm to improve the clustering accuracy. It first uses K-Means++

initialization method to determine the initial weight values and the starting points, and then uses SOM to find an appropriate final
AC

clustering solution. However, the aforementioned limitation of the K-means++ was not addressed. A new clustering technique

using a combination of the global K-means algorithm and the topology neighborhood based on Axiomatic Fuzzy Sets (AFS) theory

was developed in (Wang et al., 2013) to determine initial centroids. A new clustering algorithm, named K-means*, was presented in

(Malinen et al., 2014) that generates an artificial dataset X* as the input data. The input data are then mapped one-by-one to the

generated artificial data (X X*). Next, the inverse transformation of the artificial data to the original data is performed by a se-

ries of gradual transformations. To do so, the K-means algorithm updates the clustering model after each transformation and moves
ACCEPTED MANUSCRIPT
4

the data vectors slowly to their original positions. The K-means* algorithm uses a random data swapping strategy to deal with the

problem of generating empty clusters. However, the random selection of the data vectors as the cluster centroids may reduce the

other clusters coherence and decrease the efficiency of the K-means* algorithm. Moreover, the convergence rate of the K-means*

algorithm significantly reduces as the number of clusters increases, especially with increasing data volumes. Density based cluster-

ing methods were proposed in (Mahesh Kumar and Rama Mohan Reddy, 2016) to speed up the neighbor search for clustering spa-

tial databases with noise. Density Based Spatial Clustering of Applications with Noise (DBSCAN) provides a graph based index

structure for high dimensional data with large amount of noise. It was shown that running time of the proposed method is faster

T
than DBSCAN with exactly same clustering results. The proposed method solved the inefficacy of the DBSCAN method to work

IP
with clusters with large differences in densities. A novel clustering algorithm named CLUB (CLUstering based on Backbone) was

CR
developed in (Chen et al., 2016) to determine optimal clusters. First, the algorithm detects the initial clusters and finds their density

backbones. Then, the algorithm finds out the outliers in each cluster based on K Nearest Neighbour (KNN) method. Finally by

US
assigning each unlabeled point to the cluster with the nearest higher density neighbour, the algorithm yields the final clusters.

CLUB has several drawbacks: The KNN method lacks an efficient algorithm to determine the value of parameter K (number of

nearest neighbors); Computational cost of this method is too high because it requires calculating distance of each instance query
AN
with respect to all training samples. Two particle swarm optimization (PSO) based fuzzy clustering methods were proposed in (Sil-

va Filho et al., 2015) to deal with the shortcomings of the PSO algorithms used for fuzzy clustering. The proposed methods adjust
M

parameters of PSO dynamically to achieve a balance between exploration and exploitation to avoid trapping in local optimum. The

proposed methods lack precision for high-dimensional applications. In addition, the iterative process of the proposed methods sig-
ED

nificantly decreases the convergence rate. Generally, the speed at which a convergent sequence approaches its limit is defined as

the rate of convergence. Three clustering algorithms named Near Neighbor Influence (CNNI), an improved version of time cost of
PT

Near Neighbor Influence (ICNNI), and a variation of Near Neighbor Influence (VCNNI) were presented in (Chen, 2015). The clus-

tering results showed that ICCNNI is faster than CNNI and also, CNNI requires less space than VCNNI. These methods suffer
CE

from large scale computing and storage requirements. A growing incremental self-organizing neural network (GISONN) was de-

veloped in (Liu and Ban, 2015) to select appropriate clusters by learning data distribution of each cluster. The proposed method is
AC

however not applicable for large-volume or high-dimensional datasets due to its computational complexity. In addition, the neigh-

borhood preserving feature of the algorithm is violated when the output space topology does not match with the structure of the

data in the input space.

In spite of the improved performance of the K-means variants for synthetic datasets with Gaussian distribution, their performance

on real datasets is neither very promising nor different from the original K-means algorithm. In addition, all K-means based algo-

rithms lack an efficient method to determine the optimal number of clusters. This requires the user to determine the number of clus-
ACCEPTED MANUSCRIPT
5

ters either arbitrarily or based on practical and experimental estimates, which might not be optimal.

In this paper, we propose a novel clustering approach called transformed K-means to provide more accurate clustering results com-

pared to the K-means algorithm and its improved versions. The proposed clustering method combines a new initialization tech-

nique, K-means algorithm and a new gradual data transformation approach to appropriately select the initial cluster centroids and

move the real data into the locations of the initial cluster centroids that are closer to the actual positions of the associated data. By

doing this, the data are placed in an artificial structure to properly initiate the K-means clustering. The inverse transformation is

then performed to gradually move back the artifical data to their original places. During this process, K-means updates the cluster-

T
ing centroids after any changes in the data structure. This provides more optimal clustering results for both synthetic and real da-

IP
tasets. In addition, the proposed data transformation solves the empty cluster problem of K-means algorithm and its improved ver-

CR
sions. An efficient method based on the principal component transformation and a modified silhouette algorithm is also proposed

in this paper to determine the optimal number of clusters for the K-means algorithms.

US
The proposed clustering method develops a rule-based expert system by means of knowledge acquisition through data transfor-

mation. Significances of the proposed method include addressing the limitations of the K-means based clustering and improving

the accuracy of clustering as an important method in the field of data mining and expert systems. The proposed method can be used
AN
for intelligent system applications such as forecasting time-series including solar, wind, load and stock market series.

Contributions of the paper are outlined as follows:


M

1. A new initialization technique is proposed to select initial centroids which are closer to the optimum centroids locations.

2. A novel gradual data transformation approach is proposed to significantly reduce the number of empty clusters generated
ED

by the K-means based algorithms.

3. An efficient method is proposed to estimate the optimal number of clusters.


PT

4. A hybrid clustering algorithm is developed by combining the proposed initialization, data transformation and cluster

number estimation to provide a better knowledge discovery of the input patterns and more accurate clustering results.
CE

The rest of the paper is organized as follows. Section 2 provides a brief description of the K-means algorithm. It also explains the

proposed clustering method. Section 3 demonstrates a case study where the performance of the developed clustering method is
AC

evaluated by several experiments. Finally, Section 4 concludes the paper.

2. Methodology

2.1 K-means algorithm

The K-means (Lloyd, 1982) is a well-known, low complexity algorithm utilized for data-partitioning. The algorithm starts

running after an input of K clusters is given, and outputs the cluster centroids through iterations. Let X = [x1 ,..., x n ] be the set of n
ACCEPTED MANUSCRIPT
6

points to be grouped into K different cluster (partition) sets as C = { c p } p = 1, 2, , K. By means of the Euclidean distance, the

algorithm assigns each data point to its closest centroid c p , calculated by:

1 np ( p )
cp
n x i (1)
p i 1

where x i( p ) is the i-th data point in the cluster p, and n p is the number of data points in the respective cluster.

After the first run, the algorithm calculates the mean of the data points in each cluster c p and selects this value as a new cluster

T
centroid, starting a new iteration. As new clusters are selected, a new mean value is obtained. The algorithm halts once the sum of

IP
the squared errors over K clusters is minimized (Cui et al., 2014).

CR
2.2 The Proposed Clustering Method

An improved version of K-means algorithm, named transformed K-means, is proposed in this section. The proposed clustering

US
algorithm uses a combination of a new technique to select the initial cluster centroids and a new approach for the reverse transfor-

mation of the data to enhance the clustering performance. The steps of the transformed K-means algorithm are as follows:
AN

A. Initial centroids selection


M

Let X = [x1 ,..., x n ] be a set of n data. The selection of K initial centroids is as follows:

1) Remove duplicate data vectors and store them to new dataset X' = [(x1' , r1 ),...,(x 'm , rm )] where ri is the repetition number
ED

for each non-repetitive data vector (x i ) in the new dataset X ' ( i m n ).


PT

2) Sort the data vectors in the dataset X ' in ascending order based on the Euclidean length of the vectors.
CE

3) Divide the dataset X ' , consisting of m data, into K sub-datasets, with (at most) S m / K data, according to Eq. (2),
AC

such that the data elements of X ' are distributed among the sub-datasets X1' to X 'K .

X1' = [(x1' , r1 ),...,(x S' , rS )],


X '2 = [(x S' +1 , rS +1 ),...,(x '2S , r2S )],
X '3 = [(x '2S +1 , r2S +1 ),...,(x 3' S , r3S )],
... (2)
X 'K = [(x (K-1)(
' '
S )+1 , r(K-1)( S )+1 ),...,(x KS , rKS )].
K
X' = X 'k
k=1
ACCEPTED MANUSCRIPT
7

where ri is the repetition number for the i-th data vector.

4) Now, we have K sub datasets where each one is used to determine only one of the K initial centroids. Eq. (3) is used to

calculate a weight attribute w ( x i' ) for each data entry x i' with the repetition number ri in each of K sub datasets

{X1' ,X'2 ,...,X'K } .

1 (3)
w ( x i' ) m = ( ri ) m , (1 m K )
1 S

S j =1
dist ( x i' , x 'j )

T
where w ( x i' )m is the weight attribute for x i' in the m-th sub-dataset.

IP
5) In each of K sub datasets, the data entry with the highest weight attribute is selected as the initial centroid.

CR
Fig. 1 shows the flowchart of our proposed method for selecting initial centroids.

US
AN
M
ED
PT
CE
AC
ACCEPTED MANUSCRIPT
8

Inputs:
Input pattern: X={x1,,xn}
Number of Final Clusters: K

Remove duplicate data vectors from X and store


unique data vectors to new dataset X'
X' = [(x1' , r1 ),...,(x 'i , ri ),...,(x 'm , rm )]

Sort the unique data set (X ' ) in ascending order

T
Split X ' into K sub-data sets {X1' ,X '2 ,...,X 'K }

IP
m=1

CR
Calculate a weight attribute w ( x i' ) for each
data entry x i' in the mth sub-dataset
1
w ( x i' ) m =
1 S

S j =1
US
dist ( x i' , x 'j )

Select the data entry with the highest weight


( ri ) m
AN
attribute as the initial centroid init_cm for the
mth sub-dataset

Add init_cm to InitCentArray


M

m=m+1 Is m < K?
Yes
No
ED

initC0 = InitCentArray;
Output:
Initial Centroids: initC0 ={c01,,c0K}
PT

End
CE

Fig. 1. Flowchart of the proposed method for the initial centroids selection
AC

B. Inverse transformation

The inverse data transformation approach was first used in 0to solve the problems associated with the K-means clustering

algorithm. However, the approach presented in (Malinen et al, 2014) has a number of shortcomings such as finding a suita-

ble artificial data structure, performing the mapping, and controlling the inverse transformations. This algorithm cannot gen-

erally guarantee an optimal solution. This was demonstrated by the clustering results of (Malinen et al, 2014) where in some

cases, the data transformation led to the deviation of the data towards the incorrect cluster centroids. For the inverse trans-
ACCEPTED MANUSCRIPT
9

formation of data, first we generate an artificial data X*as the input data of the same size (n) and dimension (d). This would

divide the data vectors into distinct clusters K without any fluctuations. Then we represent a one-to-one mapping of the input

data to the artificial data (XX*).

The inverse data transformation approach of (Malinen et al, 2014) uniformly distributes the initial cluster centroids along a

line in the artificial structure. This is given by Eq. (4).

X = [x1 ,..., x n ], X* = [x1* ,..., x *n ], initC = [c 01,...,c0K ]


(4)
x*i = RandomSample initC , x *i initC (0 i n )

T
This random placement may break the clustering structure and deviate the data to incorrect cluster centroids, and conse-

IP
quently, provide incorrect results. To address this problem, our proposed inverse data transformation approach places each

CR
initial centroid C0j (1 < j K ) in the location of the data di (1 i n ) that is closer to C0i in the artificial structure X*.

This is given by Eq. (5).

x i* ArgMin x i c 0j , c 0j , (0 i
US
n ) , (0 j K )
(5)
AN
x i* initC (0 i n )

A series of inverse transformation is then performed that gradually move the data elements to their real (original) positions.
M

This will inversely transfer the artificial data to the main data. During this process, K-means updates the cluster centroids of

the transformed data. Calculated cluster centroids in each step are used as the initial cluster centroids for the next step. This
ED

process continues until the last step whose results provide the final cluster centroids. The proposed procedure is outlined as

the following:
PT

First, each vector xi is placed in a position closest to the initial centroid initCl (1 l K ) , which has the minimum distance

to the corresponding data. Next, they gradually move back to their real positions.
CE

Generally, for a dataset X = [x1 ,..., x n ] of n data vectors, the gradual inverse transformation of data to their real positions
AC

follows the steps below:

1) Sort the dataset X = [x1 ,..., x n ] in ascending order based on the Euclidean length of the vectors. Next, store the sorted

data into new dataset X' = [x1' ,..., x 'n ] .


ACCEPTED MANUSCRIPT
10

Inputs:
Input pattern: X={x1,,xn}
Number of Final Clusters: K
Initial Centroids: initC0 ={c01,,c0K}
Inverse transformation steps: Steps

Sort the input pattern (X) in ascending order based


on the Euclidean distance between each data vector
in X and the data variance: X'

Displace the input pattern (X) in


random order : X"

Create artifical data: (X ' X* )

T
Determine Dist" = (X" - X* ) / Steps

IP
i=1

CR
All data points X*i are transformed
towards their real location (X ' ) according
to the formula X*i = X*i 1 + (i Dist " )

US
Perform K- Means(X*i , K ,initCi 1 ) algorithm
given the previous centroids (initCi-1) along
with the modified dataset X*i as input.
AN
(Outputs: initCi ={C1,,CK})

i=i+1 Is i < Steps?


Yes
No
M

CF = InitCi ;
Output:
Final Centroids: CF={c1,,cK}
ED

End

Fig. 2. Flowchart of the Transformed K-means algorithm


PT

2) To construct the artificial data structure X* as the initial position of the data, place each initial centroid initCl (1 l K )
CE

in the position of the data vectors of the dataset, X , which are closer to that initial centroid ( x i' initC l ) compared to their
'
AC

distance to the (K-1) other initial centroids. This forms the artificial structure X* = [x1* ,..., x*n ] . This moves each real data

*
into the location of the initial centroid that is closer to the actual position of the associated data in the artificial structure X .

3) Displace all the real data vectors X = [x1 ,..., x n ] in random order and store them into the new dataset X" = [x1" ,..., x"n ] .

4) Determine the distance between initial artificial data (X*) and sorted real data (X"), and store them in the set

Dist" = [dist1" ,dist"2 ,...,dist"n ] . Each data element dist "i , represents the distance vector between the i-th data vector (x*i ) in the

artificial dataset X* and the position of the corresponding data (x "i ) in the dataset X".
ACCEPTED MANUSCRIPT
11

5) According to the number of steps given by user (Steps>1), divide each element of Dist" = [dist1" ,...,dist"n ] by the value of

Steps and update the new values of the data elements in Dist" . This is given by Eq. (6).

Dist" = [(dist1" / Steps),...,(dist"n / Steps)] (6)

6) At each step of the inverse transform process, all data points move towards their real location as follows:

X*i = X*i -1 + (i Dist" ) ..... (1 i Steps ) (7)

where, X* is the position of data in the artificial structure, i is the step number and Dist" is the distance of the sorted (in

T
descending order) data from the data positions in the artificial structure. Fig. 2 shows the flowchart for the proposed trans-

IP
formed K-means. We should note that X1* is the initial data positions in the artificial structure (initial artificial dataset). In

CR
the first Step, the initial centroids (initC), calculated by the proposed initial centroid selection method, are fed to the K-

means algorithm as the inputs (initC= initC0). After every inverse transform, K-means is executed given the previous cen-

US
troids (initCi-1) along with the modified dataset ( X *i ) as the input pattern. After completion of all steps (i = Steps), all data

points are placed in their original location ( X*i X' ) and the final centroids (CF) are calculated as the outputs. The proposed
AN

initialization approach of Section 2.2.A significantly reduces the chance of empty cluster generation by proper selection of
M

the initial centroids. The proposed data transformation approach completely solves the empty cluster problem during the

data transformation process.


ED

2.3 Time complexity


PT

The transformed K-means algorithm has a time complexity of the order , O ((n log n ) K s ) , where n is the

total number of data, K is the number of clusters and s is the number of steps. More details of the time complexity of the
CE

proposed transformed K-means algorithm are given for different phases in TABLE I.
AC

TABLE I
Time complexity of the proposed transformed K-means algorithm.

Algorithm Phase Time complexity


Initialization O (n log n )
Data Transformation O (n log n )
K-means algorithm O (n K )
Total O ((n log n ) K )
Running in s steps (i > 1) O ((n log n ) K s )
ACCEPTED MANUSCRIPT
12

TABLE II provides the time complexity orders for the proposed method and well-known clustering algorithms including K-

means*, K-means++, global K-means, original K-means, K-medoids, FCM, SOM, SOM++ and game theoretic SOM

(GTSOM) (Herbert and Yao, 2007).

TABLE II

Time complexity comparison of the proposed transformed K-means algorithm and several well-known clustering algorithms.

Algorithm Time complexity


Transformed K-means O ((n log n ) K s )
K-means* O (n K s )

T
K-means++ O (n K )

IP
Global K-means O (n 2 K 2 )
K-means O (n K )
K-medoids O (n 2 K )

CR
FCM O (n K 2 )
SOM O (n 2 K )
GTSOM O (n 2 K )
SOM++
US O (n 2 K )

Time complexity comparison of TABLE II shows that the proposed transformed K-means algorithm is faster than SOM,
AN
GTSOM, and global K-means and competes with FCM and K-medoids. The time complexity of K-means and K-means++ is

better than that of our proposed algorithm. However, as the data volume increases, the K-means++ algorithm may not be as
M

efficient as our proposed method due to its sequential initialization (Bahmani et al., 2012). The proposed transformed K-

means and K-means* algorithms have almost the same time complexity. However, the approach used to deal with the gener-
ED

ation of empty clusters in the K-means* algorithm reduces the convergence rate nonlinearly as the data volume (n) and the

number of clusters (K) increase. Consequently, our proposed clustering algorithm is generally faster than the K-means* al-
PT

gorithm.
CE

2.4 Estimation of the number of clusters

K-means and many other clustering algorithms are provided assuming that the number of clusters is known in advance.
AC

In cases where the number of clusters is not predefined, an efficient method is required to determine the optimal number of

clusters. In this section we present a new method based on the silhouette approach proposed in (Rousseeuw, 1987) to estimate

the number of clusters. The silhouette algorithm is as follows:

1) Cluster the input data using any clustering technique for each iteration m, (K min m K max ) .

2) Calculate the silhouette function S(i) for the input data:


ACCEPTED MANUSCRIPT
13

a(i) - b(i)
Sim = (8)
max(a(i),b(i))

where a(i) is the average distance between the ith data (1 i n ) and other data in the same cluster; b(i) is the lowest aver-

age distance of the ith data from the data in the other K-1 clusters at the mth iteration.

3) Calculate S(m) as the average value of S at mth iteration.

4) Select the iteration number index with the highest S as the estimated number of clusters.

T
IP
The proposed method uses the principal component transformation to modify the silhouette algorithm. The proposed

procedure is as follows:

CR
4-1) Transform the input data using the Karhunen-Loeve Transform (KLT) method. The KLT method, also known as the

principal component analysis (PCA) is outlined below:

US
4-2) Let k denote the eigenvector corresponding to the kth eigenvalue k of the covariance matrix x
:
AN
x
k k . k , k ={1,..., k }

i,j
cov( X i ,Y j ) E [( X i i )(Y j j ) (9)
where: i E ( X i )
M

4-3) construct a N N unitary matrix as:

[1 ,..., N ] , *T I , 1 *T
ED

(10)

4-4) Combine the N eigenvalue equations as follows:


PT

x

1
N
ij 1 N 1 (11)
CE

where diag (1 ,..., N ) is a diagonal matrix.


AC

4-5) Multiply *T 1 by both sides of Eq. (11):

*T x *T 1 x 1 = (12)

4-6) Given the input data X, define the KarhunenLoeve Transformation of X as follows:

y1 1*T x1
Y *T X (13)

y N N*T x N

ACCEPTED MANUSCRIPT
14

5) Specify Kmin and Kmax values.

6) For each iteration m, calculate the initial centroids using the proposed initialization method and assign each of the trans-

formed data (Y) to the nearest initial centroid to form initial clusters. Then calculate the mean of all data in each cluster

as new centroids Cs = [cs1 ,...,csK ] .

7) Calculate Sim for each of the input data at each iteration m, (K min m K max ) by:

a(i) - b(i)
Sim = (14)

T
max(a(i), b(i))

IP
where a(i) is the distance between the ith data (1 i n ) and the nearest centroid csj (1 j K ) at the P-th iteration; b(i) is the

minimum distance of the ith data from the other K-1 centroids at the Pth iteration. The proposed definition of a(i) and b(i) by

CR
Eq. (15) decreases the computational burden and speeds up the process as compared to their original definitions of the sil-

houette algorithm (Rousseeuw, 1987).

8) Include Sim (for the ith data at the mth iteration) in the Sm array. US
AN
est
9) Include the average value of Sm (for the mth iteration) in the mth cell of the array Save .

m
10) Use Eq. (15) and select the row number with the highest Save as the estimated number of clusters.
M

K est ArgMax Sest


ave , m .... (K min Kest K max )
(15)
ED

Fig. 3 shows the flowchart for the proposed method.


PT
CE
AC
ACCEPTED MANUSCRIPT
15

Inputs:
Input pattern: X={x1,,xn}
The minimum value of K: Kmin
The maximum value of K: Kmax
m=Kmin

Perform data transformation based on KLT


approach: XY

Perform Proposed Initialization technique


along with the Input pattern.
(Outputs: Initial Centroids: C0={c1,,cK})

Form initial clusters and calculate the mean of

T
all data in each cluster as new centroids
(Outputs: Cs={cs1,,csk})

IP
i=1

CR
Calculate a(i) and b(i) as defined in step 7 of
the proposed algorithm and use (15) to
Calculate Sik for each input data X={x1,,xn}

US
Add Sim to SmArray: Sm = [S1m ,...Smn ]

Is i < n?
No
Yes
i=i+1
AN
Add the average value of S m in
the mth cell of the array Sest
ave

m=m+1 Is m < Kmax?


Yes
No
M

Output:K est = ArgMax Sest


ave , m
, (K min K est K max )

End
ED

Fig. 3. Flowchart of the proposed method for estimating the number of clusters

This procedure modifies the method proposed in (Rousseeuw, 1987) to provide stable results with less processing time.
PT

TABLE III provides the time complexity orders of the proposed estimation method and Silhouette algorithm.
CE

TABLE III
Time complexity comparison of the proposed estimation method and silhouette algorithm.
AC

Algorithm Time complexity

Proposed Method O (n log (n ).K )


Silhouette algorithm O (n 2 .K )

where n is the data volume and K is the difference between the Kmax and Kmin (K K max K min ) . The comparison

demonstrates the less time complexity of the proposed algorithm.


ACCEPTED MANUSCRIPT
16

3. Case Studies

In this section, we evaluate the performance of the proposed method to deal with the empty cluster problem; then we as-

sess the proposed transformed K-means clustering algorithm and finally examine the proposed estimation method to deter-

mine the optimal number of clusters; the datasets used in the experiment are available online at Joensuu

(http://cs.uef.fi/sipu/datasets), uci (https://archive.ics.uci.edu/ml/datasets) and mesonet (http://mesonet.agron.iastate.edu)

websites. More information regarding the data is presented in Fig. 4.

T
IP
CR
US
AN
M
ED
PT
CE
AC

Fig. 4. Datasets used for the case study

3.1. Evaluation of the proposed method for dealing with the empty cluster generation
ACCEPTED MANUSCRIPT
17

Three real time-series datasets of solar radiation in the Ames, Chariton and Calmar stations between 01/01/2009 and

01/01/2014 are used to calculate the number of empty clusters (N. E. C) generated by the K-means algorithm and the pro-

posed method. Number of clusters for both algorithms is 200 ( ). First, the clustering is performed for one step

( ), i.e., without any data transformation, to evaluate the performance of the proposed initialization approach to

deal with the empty cluster problem. TABLE IV shows the N. E. C for the proposed method with 1 step and the K-means

algorithm. The results demonstrate that the proposed method significantly reduces the N. E. C as compared to the K-means

T
algorithm. This is due to the proposed initialization approach that properly selects the initial centroids. The N. E. C generat-

IP
ed by the proposed method is then calculated as the number of steps increases. The results are provided in TABLE V for

steps 1 to 10. The results show that the empty cluster problem is completely solved during the data transformation to their

CR
original positions; and the proposed clustering algorithm converges without generating any empty cluster.

TABLE IV

Dataset
Number of Number of
US
Performance comparison of the K-means and proposed transformed K-means on the problem of empty cluster generation

N. E. C generated by:
Proposed method
AN
objects clusters K-means
(Step=1)
Ames 43827 200 84 1
Chariton 43827 200 75 0
Calmar 43827 200 27 2
M

TABLE V
Performance of the proposed method with different steps for the empty cluster problem
ED

Step N. E. C in N. E. C in N. E. C in
number Ames dataset Chariton dataset Calmar dataset
1 1 1 3
PT

2 0 0 0
3 1 0 0
4 0 0 0
CE

5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
AC

9 0 0 0
10 0 0 0

3.2. Evaluation of the proposed transformed K-means clustering algorithm

This section evaluates the accuracy of the proposed clustering method (transformed K-means). Mean squared error

(MSE) is used as the accuracy performance indicator calculated by:


ACCEPTED MANUSCRIPT
18

1

K N 2
MSE k 1 i 1
X i( k ) C k (16)
K .N

where N is the number of data points in the cluster k, and X i( k ) is the i-th data point in the cluster k. The testing datasets are

normalized in the range of [1, 1]. K-means clustering is run with different initialization methods including the random

based, K-means* based, K-means++ based and the proposed initialization method (section 2.2.A). The calculated error val-

ues as well as the processing time are provided in TABLE VI. A comparison of the results shows that the proposed initiali-

zation method improves the accuracy performance of the K-means algorithm when compared to the other initialization

T
methods. However, the computational complexity is increased due to data sorting proposed by our initialization to optimally

IP
select the initial centroids.

CR
TABLE VI
MSE measures and running time (sec) for K-means algorithm with different initialization methods

K-means clustering with


Dataset

MSE
Proposed
Initialization
Time(s) MSE
K-means* based

US
initialization
Time(s)
K-means++ based

MSE
initialization
Time(s) MSE
Random based
initialization
Time(s)
AN
IRIS 0.0432 0.24 0.0432 0.0891 0.0431 0.0238 0.0432 0.0248

Glass 0.0017 0.0349 0.002 0.0235 0.0017 0.0079 0.0017 0.007

Missa1 0.0094 0.0678 0.0095 0.2665 0.0094 0.8085 0.0097 0.0901


M

Bridge 0.0008 0.2256 0.0008 2.3118 0.0011 12.8968 0.001 0.1551

Thyroid 0.0145 0.115 0.0554 0.0261 0.0151 0.0084 0.0145 0.0062


ED

Magic 0.0304 0.1871 0.0304 0.4238 0.0304 1.382 0.0304 0.0353

Wine 0.0255 0.0858 0.1352 0.0862 0.0255 0.0101 0.0255 0.0053

Shuttle 0.0008 0.3903 0.0009 0.9169 0.0008 14.0788 0.0008 0.0451


PT

Pendigit 0.007 0.1973 0.0083 0.1728 0.0071 0.3836 0.0071 0.0124

Wdbc 0.0206 0.0274 0.0231 0.0279 0.0206 0.0073 0.0206 0.0053


CE

Yeast 0.0053 0.0373 0.0097 0.3779 0.0053 0.0997 0.0053 0.0311

P. I. D 0.0536 0.0265 0.1013 0.0429 0.054 0.0137 0.0536 0.0086

Olitos
AC

0.0162 0.0148 0.0517 0.1279 0.0162 0.0059 0.0161 0.005

Heart 0.0358 0.0138 0.0574 0.0279 0.0358 0.0113 0.0358 0.0065

Ionosphere 0.081 0.0167 0.096 0.0311 0.081 0.0076 0.081 0.0058

M. Libras 0.0064 0.0185 0.0094 1.3832 0.0066 0.0352 0.0058 0.0061

Spambase 0.01 0.049 0.0125 0.13 0.01 0.0894 0.01 0.0089

Waveform 0.0114 0.0675 0.0159 0.1521 0.0114 0.2114 0.0114 0.0176

a1 0.0057 0.1083 0.0065 0.6655 0.0061 0.3598 0.0059 0.0367

s1 0.0094 0.0869 0.01 0.6682 0.0101 0.7354 0.0095 0.0358


ACCEPTED MANUSCRIPT
19

The MSE value is calculated for different data clustering methods including the proposed transformed K-means, ABC-K-

means, DE-K-means, GA-K-means, K-means*, K-means++, K-means, SOM, GTSOM, SOM++, K-medoids and FCM, and

provided in TABLE VII. The calculated MSE values show that the transformed K-means (with Steps=20) outperforms or

competes with the existing methods in terms of the clustering quality. The improved clustering quality is the result of several

procedures embraced by our proposed method namely the determination of optimal number of clusters, the proposed initiali-

zation, and the gradual data transformation.

The proposed transformed K-means algorithm has a faster processing time compared to the ABC-K-means, DE-K-means,

T
GA-K-means, SOM, GTSOM, and SOM++ and competes with FCM, K-medoids, K-means*, and K-means++. For small-

IP
and medium-sized data, our proposed method is generally more time consuming than K-medoids, K-means*, and K-

CR
means++. However, the reduced converegence rate of K-means* to deal with empty cluster generation particularly for higher

number of clusters and sequential initialization of K-means++ increase the comuptaional complexity of these methods for

US
large data volumes. This is evident from our running time results for large size datasets such as Missal and Shuttle where the

proposed transformed K-means converges faster than K-means* and K-means++.


AN

TABLE VII
MSE measure and the running time (sec) for different clustering techniques
M

K-medoids
K-means*
Proposed

means++
K-means

K-means

K-means

K-means

GTSOM

SOM++
Criteria
Dataset

method

ABC-

SOM

FCM
GA-
DE-

K-
ED

MSE: 0.0432 0.0475 0.043 0.043 0.0432 0.0432 0.0432 0.0432 0.0433 0.0432 0.0432 0.0432
IRIS
0.5754 1.2629 1.4003 1.6738 1.4858 0.0063 0.0061 2.3584 1.7599 2.4825 0.0127 0.2132
PT

Time(s):

MSE: 0.0017 0.0019 0.0018 0.0017 0.0017 0.0018 0.0017 0.0017 0.0019 0.0017 0.0017 0.0017
Glass
Time(s): 0.6301 1.2252 1.3897 1.6864 0.1381 0.0134 0.0058 2.3659 1.8439 2.3572 0.0311 0.0134
MSE: 0.0001 0.0002 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.005
CE

Missa1
Time(s): 3.9664 45.2259 60.29 81.2574 47.813 37.6179 0.3216 3.4526 10.5227 3.4381 47.6677 4.7857
MSE: 0.0008 0.0018 0.0015 0.0013 0.0009 0.0013 0.001 0.001 0.0009 0.0012 0.0009 0.0038
Bridge
Time(s): 2.4381 29.2082 38.8446 52.4054 18.794 10.5017 0.1362 3.3597 10.6182 3.3557 18.0322 15.5823
0.0134 0.0147 0.0134 0.0133 0.0591 0.0151 0.0145 0.0167 0.0169 0.0168 0.0146 0.0149
AC

MSE:
Thyroid
Time(s): 0.6331 1.2185 1.3674 1.6443 0.3718 0.0087 0.0069 2.3265 1.7153 2.3346 0.0394 0.4068
MSE: 0.0271 0.0325 0.0295 0.0295 0.0304 0.0304 0.0304 0.0307 0.0306 0.0304 0.0304 0.0304
Magic
Time(s): 0.9886 2.6975 3.2338 4.1301 0.7922 1.3478 0.0373 2.3212 1.684 2.3245 3.777 0.7571
MSE: 0.0255 0.0277 0.0254 0.0252 0.1349 0.0256 0.0255 0.0257 0.0255 0.0255 0.0256 0.0256
Wine
Time(s): 0.6183 1.1893 1.3275 1.5914 1.8411 0.0059 0.006 2.3257 1.7137 2.3216 0.0127 0.4052
MSE: 0.0007 0.0008 0.0008 0.0007 0.0009 0.0008 0.0008 0.0008 0.0009 0.0008 0.0008 0.0009
Shuttle
Time(s): 1.2867 5.363 6.7498 9.2292 1.4791 13.8573 0.047 2.3333 1.7205 2.3379 4.9705 0.9686
MSE: 0.0069 0.0085 0.0077 0.0073 0.0079 0.0071 0.0071 0.0069 0.0071 0.007 0.007 0.0969
Pendigit
Time(s): 0.814 2.3004 2.8268 3.6743 7.2133 0.3607 0.018 2.3571 1.969 2.3645 1.6015 1.3866
MSE: 0.0138 0.0152 0.0138 0.0138 0.0231 0.0206 0.0206 0.0188 0.0243 0.0205 0.0192 0.0205
Wdbc
Time(s): 0.6232 1.2157 1.3701 1.6524 0.2455 0.0057 0.0044 2.3226 1.6766 2.3566 0.017 0.1047
ACCEPTED MANUSCRIPT
20

MSE: 0.0051 0.0062 0.0057 0.0055 0.0088 0.0055 0.0053 0.0052 0.0054 0.0052 0.0054 0.0058
Yeast
Time(s): 0.6999 1.6819 1.9743 2.4665 7.0087 0.0898 0.016 2.3747 1.9684 2.3955 1.0383 0.2853
MSE: 0.0438 0.0474 0.0431 0.0431 0.1013 0.0536 0.0536 0.0525 0.0541 0.0574 0.0537 0.0512
P. I. D
Time(s): 0.6353 1.2327 1.3823 1.698 0.2842 0.008 0.0056 2.337 1.6765 2.3256 0.0762 0.4109
MSE: 0.0153 0.017 0.0156 0.0153 0.051 0.0162 0.0167 0.0162 0.0161 0.0161 0.0161 0.016
Olitos
Time(s): 0.5645 1.2042 1.3552 1.6307 1.3999 0.0101 0.0065 2.337 1.8621 2.3347 0.0259 0.5152
MSE: 0.0352 0.0387 0.0352 0.0352 0.0574 0.0358 0.0358 0.0354 0.0359 0.0356 0.0358 0.0359
Heart
Time(s): 0.6091 1.1922 1.342 1.6161 0.3494 0.0052 0.0057 2.3168 1.6736 2.3149 0.0408 0.3544
MSE: 0.081 0.0891 0.081 0.081 0.0959 0.081 0.081 0.0811 0.081 0.081 0.081 0.0813
Ionosphere
Time(s): 0.6061 1.2029 1.3576 1.6325 0.2635 0.0053 0.0052 2.3253 1.6762 2.3317 0.0175 0.4333
Movement MSE: 0.0055 0.0081 0.0072 0.006 0.0094 0.0063 0.0062 0.0055 0.0058 0.0057 0.0056 0.0057
Libras Time(s): 0.6293 1.335 1.5315 1.8818 15.938 0.0252 0.01 2.3821 2.1349 2.3865 0.1281 0.8592

T
MSE: 0.0078 0.0087 0.0079 0.0079 0.0098 0.01 0.01 0.01 0.0099 0.0097 0.01 0.0091
Spambase
Time(s): 0.6714 1.4971 1.7339 2.1458 0.4771 0.079 0.0092 2.3608 1.684 2.405 0.6408 0.541

IP
MSE: 0.0114 0.0122 0.0111 0.0111 0.0158 0.0114 0.0114 0.0116 0.0114 0.0115 0.0114 0.0114
Waveform
Time(s): 0.7298 1.7551 2.0672 2.5838 0.9814 0.2015 0.014 2.3478 1.718 2.3255 1.0042 0.6661

CR
MSE: 0.0057 0.0088 0.0075 0.0066 0.0067 0.0077 0.0058 0.0057 0.0058 0.0059 0.0058 0.0058
a1
Time(s): 0.9649 3.0271 3.6953 4.707 14.172 0.4251 0.0481 2.3947 2.3222 2.4029 2.984 1.027
MSE: 0.0094 0.0131 0.0114 0.0104 0.0102 0.0104 0.0108 0.0096 0.0098 0.0098 0.0114 0.0103
s1
Time(s): 1.2519 3.7156 4.5436 5.7588 13.730 0.6793 0.06 2.3753 2.1462 2.6723 4.8069 1.4703

3.3 Evaluation of the estimation method for determining the number of clusters
US
AN
This section evaluates the performance of the proposed method to estimate the number of clusters. The proposed estimation

method is used to calculate the number of clusters for different datasets and the results are compared with the numbers determined
M

by silhouette, Calinski-Harabasz (Caliski, and Harabasz, 1974), Davies-Bouldin (Davies, and Bouldin, 1979) and Gap method

(Tibshirani et al., 2000). TABLE VIII provides the comparison.


ED

TABLE VIII

Performance comparison of different techniques for determining the number of clusters


PT

Calinski Harabasz +
Proposed Method Silhouette + kmeans Davies Bouldin + kmeans Gap + kmeans
kmeans
Number
Dataset of real Number of Number of Number of Number of Number of
Running Running Running Running Running
clusters estimated estimated estimated estimated estimated
time (s) time (s) time (s) time (s) time (s)
clusters clusters clusters clusters clusters
CE

a1 20 19 6.1932 17 25.771 21 16.435 18 12.735 20 45.4


Glass 6 9 0.5460 14 0.9672 2 3.8382 31 4.4005 48 25.493
4 6 1
IRIS 3 2 0.7176 2 0.8736 2 2.0415 2 2.2817 43 23.656
2 2 1.6068 5 3.8376 2 10.154 50 11.146 50 30.376
AC

Ionosphere
1
Newthyroid 3 3 1.186 4 1.5288 3 3.9329 3 4.3865 47 24.276
2 7
Pendigit 10 10 9.9373 15 56.519 3 40.15 8 42.363 50 76.589
P. I. D 2 2 2.3712 2 6.5364 3 11.640 3 12.964 42 33.739
2 7
s1 15 15 10.9825 13 53.430 16 22.457 13 21.579 15 71.095
3 3
Spambase 2 2 23.3221 2 565.50 30 353.06 2 380.40 49 160.60
3 7 6
Wine 3 3 0.9984 3 1.0764 37 2.3514 7 2.5995 39 24.542
3 0
Wdbc 2 2 2.2308 6 5.0544 40 5.9540 4 6.4251 50 32.395
Heart 2 2 4.3531 2 28.18 2 20.243 48 16.373 2 29.309
Shuttle 7 5 65.6080 2 741.52 2 423.87 2 497.51 10 819.32
5 9
Waveform 3 3 24.6797 2 795.24 2 72.841 2 96.211 8 186.56
Olitos 4 2 1.0713 2 4.0011 2 1.2951 44 1.3393 50 27.393
5 4
Yeast 10 3 1.9232 3 46.698 3 2.3961 5 3.4387 47 39.521
5
ACCEPTED MANUSCRIPT
21

M.Libras 15 12 1.0857 29 14.013 2 1.3357 49 0.3777 48 31.997


Magic 2 2 12.6995 2 426.94 2 8.2453 10 6.8662 2 578.32

The comparison shows that while working much faster, the proposed estimation method is more accurate than the other estimation

methods.

4. Conclusion

Clustering provides a knowledge acquisition method for intelligent applications to develop rule-based expert systems. This paper

T
proposes an improved version of K-means clustering algorithm named transformed K-means. The proposed clustering method is a

IP
combination of a new initialization technique, K-means algorithm and a new gradual data transformation approach that presents

CR
more accurate clustering results on the real datasets, when compared to other K-means based algorithms. By selecting initial cen-

troids which are closer to the optimum centroids locations, the proposed initialization approach solves the limitation of the meth-

US
ods in (Lloyd, 1982), (Dunn, 1973) and (Bezdek et al, 1984) to properly initiate the K-means clustering. The inverse transfor-

mation gradually moves back the artifical data to their original places. During this process, the clustering centroids are updated
AN
after any changes in the data structure. This provides more optimal clustering results for both synthetic and real datasets to address

the drawback of the forecasting models in (Arthur and Vassilvitskii, 2007) and (Mahesh Kumar and Rama Mohan Reddy, 2016). In
M

addition, the proposed data transformation solves the problem of empty cluster generation associated with the K-means clustering

method and its improved versions. An efficient method based on the principal component transformation and a modified silhouette
ED

algorithm is also proposed in this paper to determine the number of clusters for cases where the number is not specified in advance

(Arthur and Vassilvitskii, 2007), (Chen et al., 2016) and (Silva Filho et al., 2015). Finally, the proposed clustering method address-
PT

es the time and computational burden associated with the models in (Kwedlo, 2013) and (Malinen et al., 2014).

Several experiments are performed to evaluate: 1) the proposed method for dealing with the empty cluster generation; 2) the pro-
CE

posed transformed K-means clustering algorithm; and 3) the estimation method for determining the number of clusters. For the first

experiment, three real time-series datasets of solar radiation in the Ames, Chariton and Calmar stations are used to calculate
AC

the number of empty clusters (N. E. C) generated by the K-means algorithm and the proposed method. The results demon-

strated that the proposed method significantly reduces the N. E. C as compared to the K-means algorithm. The empty cluster

problem was then completely solved by the proposed data transformation approach, which guarantees the convergence of the algo-

rithm without any empty cluster.

For the second experiment, K-means clustering was run with different initialization methods including the random based, K-

means* based, K-means++ based and the proposed initialization method. Simulation results showed that that the proposed

initialization method improves the accuracy performance of the K-means algorithm when compared to the other initializa-

tion methods. However, the computational complexity was increased due to data sorting proposed by our initialization to
ACCEPTED MANUSCRIPT
22

optimally select the initial centroids. The performance of the proposed transformed K-means was also evaluated using several

different real datasets and compared with different variants of K-means clustering as well as SOM, SOM++, FCM, K-medoids and

GTSOM clustering algorithms. The comparison demonstrated the improved quality of the clustering for the proposed transformed

K-means.The proposed transformed K-means algorithm provided a faster processing time compared to the ABC-K-means,

DE-K-means, GA-K-means, SOM, GTSOM, and SOM++ and competed with FCM, K-medoids, K-means*, and K-means++.

For small- and medium-sized data, our proposed method was shown to be generally more time consuming than K-medoids,

K-means*, and K-means++. Howver, it converged faster than K-means* and K-means++ for large data volumes.

T
IP
For the third experiment, the proposed estimation method was evaluated and compared with other estimation techniques to

determine the number of clusters. The comparison showed that while working much faster, the proposed estimation method was

CR
more accurate than the other estimation methods.

Acknowledgment
US
The authors would like to thank Prof. P. Frnti and Mr. M. Malinen for their valuable technical advices.
AN

References
M

Abdul Nazeer, K. A., Sebastian, M. P. (2009) Improving the Accuracy & Efficiency of the K-means Clustering Algorithm. Proceed-
ED

ings of the World Congress on Engineering (WCE), Vol. 1, pp 1-3.

Arthur, D., & Vassilvitskii, S. (2007) K-means++: the advantages of careful seeding. Proceedings of the eighteenth annual ACM-

SIAM symposium on discrete algorithms, New Orleans, Louisiana, Society for Industrial & Applied Mathematics, pp 1027-1035.
PT

Bahmani, B., Moseley, B., Vattani, A., Kumar, R., & Vassilvitskii, S. (2012) Scalable K-means++. Proc. VLDB Endow. vol. 5,
CE

no.7, pp 622-633.

Bezdek, J. C., Ehrlich, R., Full, W., (1984) FCM: The Fuzzy C-Means Clustering Algorithm, Computers & Geosciences, vol. 10,
AC

no. 2-3, pp 191-203.

Caliski, T., & Harabasz, J., (1974) A dendrite method for cluster analysis. Communications in Statistics, vol. 3, no.1, pp 1-27.

Celebi, M. E., Kingravi, H., Vela, P. A. (2013) A Comparative Study of Efficient Initialization Methods for the K-Means Cluster-

ing Algorithm, Expert Systems with Applications, vol. 40, no. 1, pp 200-210.

Chen, M., Li, L., Wang, B., Cheng, J., Pan, L., Chen, X., (2016), Effectively clustering by finding density backbone based-on

kNN, Pattern Recognition, vol. 60, pp 486-498.

Chen, X., (2015), A new clustering algorithm based on near neighbor influence, Expert Systems with Applications, vol. 42, pp
ACCEPTED MANUSCRIPT
23

7746-7758.

Cui, X., Zhu, P., Yang, X., Li, K., & Ji, C. (2014) Optimized big data K-means clustering using MapReduce. The Journal of Super-

computing, vol. 70, no. 3, pp 1249-1259.

Davies, D. L., & Bouldin, D. W. (1979) A Cluster Separation Measure. Pattern Analysis and Machine Intelligence, IEEE Transac-

tions on PAMI, vol. 1, no.2, pp 224-227.

Dogan, Y., Birant, D., & Kut, A. (2013) SOM++: Integration of Self-Organizing Map and K-Means++ Algorithms. Machine Learn-

ing and Data Mining in Pattern Recognition. P. Perner (Ed.), Springer Berlin Heidelberg. Vol. 7988, pp 246-259.

T
Dunn, J. C., (1973) A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters. Journal

IP
of Cybernetics, vol. 3, no. 3, pp 32-57.

CR
Herbert, J., & Yao, J. (2007) GTSOM: Game Theoretic Self-organizing Maps. Trends in Neural Computation. K. Chen and L.

Wang (Eds.), Springer Berlin Heidelberg, vol. 35, pp 199-223.

http://cs.uef.fi/sipu/datasets

http://mesonet. agron.iastate.edu US
AN
https://archive.ics.uci.edu/ml/datasets

Kaufman, L., & Rousseeuw, P. J. (1987) Clustering by Means of Medoids. In Y. Dodge, editor, Statistical Data Analysis Based on

the l1 Norm and Related Methods, pp 405-416.


M

Kohonen, T. (1990) The Self-Organizing Map. Proceedings of the IEEE, vol. 78, no. 9, pp 14641480.
ED

Krishna, K., & Murty, M. N. (1999) Genetic K-means algorithm. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE

Transactions on, vol. 29, no. 3, pp 433-439.

Kwedlo, W., (2011) A clustering method combining differential evolution with the K-means algorithm. Pattern Recognition Letters,
PT

vol. 32, no. 12, pp 1613-1621.


CE

Likas, A., Vlassis, N., & Verbeek, J. J. (2003) The global K-means clustering algorithm. Pattern Recognition, vol. 36, no. 2, pp

451-461.

Liu, H., Ban, X.-j., (2015), Clustering by growing incremental self-organizing neural network, Expert Systems with Applica-
AC

tions, vol 42, pp 4965-4981.

Lloyd. S. P. (1982) Least Squares Quantization in PCM. IEEE Transactions on Information Theory, vol.28, no. 2, pp 129-136.

Mahesh Kumar K., and Rama Mohan Reddy, A., (2016), A fast DBSCAN clustering algorithm by accelerating neighbor searching

using Groups method, Pattern Recognition, vol. 58, pp 39-48.

Malinen, M., Mariescu-Istodor R., & Frnti, P. (2014) K-means*: Clustering by gradual data transformation. Pattern Recognition,

vol. 47, no. 10, pp 3376-3386.


ACCEPTED MANUSCRIPT
24

Markic, B., Tomic, D., (2010) Marketing Intelligent System for Customer Segmentation, in: J. Casillas, F.J. Martnez-Lpez

(Eds.) Marketing Intelligent Systems Using Soft Computing: Managerial and Research Applications, Springer Berlin Heidelberg,

Berlin, Heidelberg, pp. 79-111.

Rousseeuw, P. J. (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computa-

tional and Applied Mathematics, vol. 20, pp 53-65.

Shuliang, L., Barry, D., Edwards, J., Kinman, R., and Duan, Y., (2002), Integrating group Delphi, fuzzy logic and expert systems

for marketing strategy development, the hybridisation and its effectiveness. Journal: Marketing Intelligence & Planning, vol. 20,

T
no. 5, pp 273284.

IP
Silva Filho, T.M., Pimentel, B.A., Souza, R.M.C.R., Oliveira, A.L.I., (2015) Hybrid methods for fuzzy clustering based on fuzzy

CR
c-means and improved particle swarm optimization, Expert Systems with Applications, vol 42, pp 6315-6328.

Tibshirani, R., Walther, G., & Hastie, T. (2000) Estimating the number of data clusters via the Gap statistic. Journal of the Royal

Statistical Society, B. 63, pp 411423.


US
Wang, L., Liu, X., & Mu, Y. (2013) The Global k-Means Clustering Analysis Based on Multi-Granulations Nearness Neighbor-
AN
hood. Mathematics in Computer Science, vol. 7, no. 1, pp 113-124.

Yang, B.-r., Li, H., Qian, W.-b., (2012) The Cognitive-Base Knowledge Acquisition in Expert System, in: H. Tan (Ed.) Technolo-

gy for Education and Learning, Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 73-80.
M

Zhang, C., Ouyang, D., & Ning, J. (2010) An artificial bee colony approach for clustering. Expert Systems with Applications, vol.
ED

37, no. 7, pp 4761-4767.

Biographies
PT

Rasool Azimi received his B.Sc. degree in Software Engineering from Mehrastan University, Guilan, Iran, in 2011 and the M.Sc.
CE

degree from Science and Research Branch, Islamic Azad University, Qazvin, Iran in 2014. His research interests include distribut-

ed data mining, data clustering, artificial intelligence and their applications in power systems.
AC

Mohadeseh Ghayekhloo received her B.Sc. degree in Computer Engineering from Mazandaran University of Science and Tech-

nology, Babol, Iran, and the M.Sc. degree from Science and Research Branch, Islamic Azad University, Qazvin, Iran in 2011 and

2014, respectively. Her research interests include optimization algorithms, artificial neural networks, computational intelligence

and their applications in power systems.


ACCEPTED MANUSCRIPT
25

Mahmoud Ghofrani received his B.Sc. degree in Electrical Engineering from Amirkabir University of Technology, Tehran, Iran

in 2005, the M.Sc. degree from University of Tehran, Tehran, Iran, in 2008, and the Ph.D. degree from the University of Nevada,

Reno, in 2014. He is currently an Assistant Professor at the School of Science, Technology, Engineering and Mathematics, Univer-

sity of Washington, Bothell. His research interests include power systems operation and planning, renewable energy systems, smart

grids, electric vehicles and electricity market.

Hedieh Sajedi received her B.Sc. degree in Computer Engineering from AmirKabir University of Technology in 2003, and M.Sc.

T
and Ph.D degrees in Computer Engineering (Artificial Intelligence) from Sharif University of Technology, Tehran, Iran in 2006

IP
and 2010, respectively. She is currently an Assistant Professor at the Department of Computer Science, Tehran University, Iran.

CR
Her research interests include Multimedia data hiding, steganography and steganalysis methods, pattern recognition, and machine

learning.

US
AN
M
ED
PT
CE
AC

Вам также может понравиться