Вы находитесь на странице: 1из 26

Accepted Manuscript

A novel clustering algorithm based on data transformation

Rasool Azimi , Mohadeseh Ghayekhloo , Mahmoud Ghofrani , Hedieh Sajedi

PII:

S0957-4174(17)30034-9

DOI:

Reference:

ESWA 11072

To appear in:

Expert Systems With Applications

Received date:

28 November 2015

Revised date:

29 October 2016

Accepted date:

24 January 2017

29 October 2016 Accepted date: 24 January 2017 Please cite this article as: Rasool Azimi ,

Please cite this article as: Rasool Azimi , Mohadeseh Ghayekhloo , Mahmoud Ghofrani , Hedieh Sajedi , A novel clustering algorithm based on data transformation, Expert Systems With Applications (2017), doi: 10.1016/j.eswa.2017.01.024

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Highlights

1

A new initialization technique is proposed to improve the performance of K-means.

A data transformation approach is proposed to solve empty cluster problem.

An efficient method is proposed to estimate the optimal number of clusters.

Proposed clustering method provides more accurate clustering results.

ACCEPTED MANUSCRIPT

2

A Novel Clustering Algorithm Based on Data Transformation

Rasool Azimi a r.azimi@qiau.ac.ir, Mohadeseh Ghayekhloo b m.ghayekhloo@qiau.ac.ir, Mahmoud

Ghofrani c,* mghofrani@uwb.edu, Hedieh Sajedi d hhsajedi@ut.ac.ir

a Computer and Information Technology Engineering, Qazvin Branch, Islamic Azad University, Qazvin, Iran.

b Young Researchers and Elite Club, Qazvin Branch, Islamic Azad University, Qazvin, Iran.

c School of Science, Technology, Engineering and Mathematics (STEM), University of Washington, Bothell, USA.

d Department of Computer Science, College of Science, University of Tehran, Tehran, Iran.

*Corresponding author: UWBB room 227, 18807 Beardslee Blvd, Bothell, WA 98011, USA Fax number: 425.352.3775

Abstract— Clustering provides a knowledge acquisition method for intelligent systems. This paper proposes a novel data-

clustering algorithm, by combining a new initialization technique, K-means algorithm and a new gradual data transformation ap-

proach to provide more accurate clustering results than the K-means algorithm and its variants by increasing the clusters’ coher-

ence. The proposed data transformation approach solves the problem of generating empty clusters, which frequently occurs for

other clustering algorithms. An efficient method based on the principal component transformation and a modified silhouette algo-

rithm is also proposed in this paper to determine the number of clusters. Several different data sets are used to evaluate the efficacy

of the proposed method to deal with the empty cluster generation problem and its accuracy and computational performance in

comparison with other K-means based initialization techniques and clustering methods. The developed estimation method for de-

termining the number of clusters is also evaluated and compared with other estimation algorithms. Significances of the proposed

method include addressing the limitations of the K-means based clustering and improving the accuracy of clustering as an im-

portant method in the field of data mining and expert systems. Application of the proposed method for the knowledge acquisition in

time series data such as wind, solar, electric load and stock market provides a pre-processing tool to select the most appropriate

data to feed in neural networks or other estimators in use for forecasting such time series. In addition, utilization of the knowledge

discovered by the proposed K-means clustering to develop rule based expert systems is one of the main impacts of the proposed

method.

Index Terms— Data mining, clustering, K-means, data transformation, silhouette, transformed K-means

1. Introduction

The expert systems are computer applications that contain stored knowledge and are developed to solve problems in a specific field

in almost the same way in which a human expert would (Shuliang et al., 2002)0. Acquisition of the expert knowledge is a chal-

lenge for developing such expert systems (Yang et al., 2012). One of the major problems and most difficult tasks in developing rule

based expert systems is representing the knowledge discovered by data clustering (Markic and Tomic, 2010). The K-means algo-

rithm is one of the most commonly used clustering techniques, which uses the data reassignment method to repeatedly optimize

ACCEPTED MANUSCRIPT

3

clustering (Lloyd, 1982). The main goal of clustering is to generate compact groups of objects or data that share similar patterns

within the same cluster, and isolate these groups from those which contain elements with different characteristics.

Although the K-means algorithm has features such as simplicity and high convergence speed, it is totally dependent on the initial

centroids which are randomly selected in the first phase of the algorithm. Due to this random selection, the algorithm may converge

to locally optimal solutions (Celebi et al., 2013). Different variants of K-means algorithm have been proposed to address this limi-

tation. The K-Medoids algorithm was proposed in (Kaufman and Rousseeuw, 1987) to define each cluster by the most central me-

doid in which it is located. First, K data are considered as initial centroids (medoid) and then each data is assigned to the closest

Medoid, and the initial clusters are formed. In an iteration-based process, the most central data in each cluster is considered as the

new centroid and each data is assigned to the nearest centroid. The remaining steps of this algorithm match the K-means procedure.

Fuzzy C-means (FCM) clustering introduced the partial membership concept (Dunn, 1973), (Bezdek et al, 1984). In fact, in the

FCM algorithm, each data belongs to all clusters. The degree of belonging is represented by a partial membership determined b y a

fuzzy clustering matrix. A genetic algorithm-based K-means (GA-K-means) was proposed in (Krishna, and Murty, 1999) to pro-

vide global optimum for the clustering. In this method, the K-means algorithm was used as a search operator instead of crossover.

A biased mutation operator was also proposed for clustering to help the K-means algorithm to avoid local minima. Global K-means

algorithm was developed in (Likas et al., 2003) to provide experimentally optimal solution for clustering problems. However, it is

not appropriate for clustering medium-sized and large-scale datasets due to its heavy computational burden. K-means++ initializa-

tion algorithm was proposed in (Arthur and Vassilvitskii, 2007) for obtaining an initial set of centroids that is near-optimal. The

main drawback of the K-means++ is its inherent sequential nature, which limits the effectiveness of the method for high-volume

data. An artificial bee colony K-means (ABC-K-means) clustering approach was proposed in (Zhang et al., 2010) for optimal parti-

tioning of data objects into a fixed number of clusters. A hybrid of differential evolution and K-means algorithms named DE-K-

means was introduced in (Kwedlo, 2011). The differential evolution algorithm was used as a global optimization method and the

resultant clustering solutions were fine-tuned and corrected using the K-means algorithm. Dogan et al. proposed a hybrid of K-

means++ and self-organizing map (SOM) (Kohonen, 1990) algorithm to improve the clustering accuracy. It first uses K-Means++

initialization method to determine the initial weight values and the starting points, and then uses SOM to find an appropriate final

clustering solution.

However, the aforementioned limitation of the K-means++ was not addressed. A new clustering technique

using a combination of the global K-means algorithm and the topology neighborhood based on Axiomatic Fuzzy Sets (AFS) theory

was developed in (Wang et al., 2013) to determine initial centroids. A new clustering algorithm, named K-means*, was presented in

(Malinen et al., 2014) that generates an artificial dataset X* as the input data. The input data are then mapped one-by-one to the

generated artificial data (X → X*). Next, the inverse transformation of the artificial data to the original data is performed by a se-

ries of gradual transformations. To do so, the K - means algorithm updates the clustering model after each tra nsformation and moves

ACCEPTED MANUSCRIPT

4

the data vectors slowly to their original positions. The K-means* algorithm uses a random data swapping strategy to deal with the

problem of generating empty clusters. However, the random selection of the data vectors as the cluster centroids may reduce the

other clusters’ coherence and decrease the efficiency of the K-means* algorithm. Moreover, the convergence rate of the K-means*

algorithm significantly reduces as the number of clusters increases, especially with increasing data volumes. Density based cluster-

ing methods were proposed in (Mahesh Kumar and Rama Mohan Reddy, 2016) to speed up the neighbor search for clustering spa-

tial databases with noise. Density Based Spatial Clustering of Applications with Noise (DBSCAN) provides a graph based index

structure for high dimensional data with large amount of noise. It was shown that running time of the proposed method is faster

than DBSCAN with exactly same clustering results. The proposed method solved the inefficacy of the DBSCAN method to work

with clusters with large differences in densities. A novel clustering algorithm named CLUB (CLUstering based on Backbone) was

developed in (Chen et al., 2016) to determine optimal clusters. First, the algorithm detects the initial clusters and finds their density

backbones. Then, the algorithm finds out the outliers in each cluster based on K Nearest Neighbour (KNN) method. Finally by

assigning each unlabeled point to the cluster with the nearest higher density neighbour, the algorithm yields the final clusters.

CLUB has several drawbacks: The KNN method lacks an efficient algorithm to determine the value of parameter K (number of

nearest neighbors); Computational cost of this method is too high because it requires calculating distance of each instance query

with respect to all training samples. Two particle swarm optimization (PSO) based fuzzy clustering methods were proposed in (Sil-

va Filho et al., 2015) to deal with the shortcomings of the PSO algorithms used for fuzzy clustering. The proposed methods adjust

parameters of PSO dynamically to achieve a balance between exploration and exploitation to avoid trapping in local optimum. The

proposed methods lack precision for high-dimensional applications. In addition, the iterative process of the proposed methods sig-

nificantly decreases the convergence rate. Generally, the speed at which a convergent sequence approaches its limit is defined as

the rate of convergence. Three clustering algorithms named Near Neighbor Influence (CNNI), an improved version of time cost of

Near Neighbor Influence (ICNNI), and a variation of Near Neighbor Influence (VCNNI) were presented in (Chen, 2015). The clus-

tering results showed that ICCNNI is faster than CNNI and also, CNNI requires less space than VCNNI. These methods suffer

from large scale computing and storage requirements. A growing incremental self-organizing neural network (GISONN) was de-

veloped in (Liu and Ban, 2015) to select appropriate clusters by learning data distribution of each cluster. The proposed method is

however not applicable for large-volume or high-dimensional datasets due to its computational complexity. In addition, the neigh-

borhood preserving feature of the algorithm is violated when the output space topology does not match with the structure of the

data in the input space.

In spite of the improved performance of the K-means variants for synthetic datasets with Gaussian distribution, their performance

on real datasets is neither very promising nor different from the original K-means algorithm. In addition, all K-means based algo-

rithms lack an efficient method to determine the optimal number of clusters. This requires the user to determine the number of clus-

ACCEPTED MANUSCRIPT

5

ters either arbitrarily or based on practical and experimental estimates, which might not be optimal.

In this paper, we propose a novel clustering approach called transformed K-means to provide more accurate clustering results com-

pared to the K-means algorithm and its improved versions. The proposed clustering method combines a new initialization tech-

nique, K-means algorithm and a new gradual data transformation approach to appropriately select the initial cluster centroids and

move the real data into the locations of the initial cluster centroids that are closer to the actual positions of the associated data. By

doing this, the data are placed in an artificial structure to properly initiate the K-means clustering.

The inverse transformation is

then performed to gradually move back the artifical data to their original places. During this process, K-means updates the cluster-

ing centroids after any changes in the data structure. This provides more optimal clustering results for both synthetic and real da-

tasets. In addition, the proposed data transformation solves the empty cluster problem of K-means algorithm and its improved ver-

sions. An efficient method based on the principal component transformation and a modified silhouette algorithm is also proposed

in this paper to determine the optimal number of clusters for the K-means algorithms.

The proposed clustering method develops a rule-based expert system by means of knowledge acquisition through data transfor-

mation. Significances of the proposed method include addressing the limitations of the K-means based clustering and improving

the accuracy of clustering as an important method in the field of data mining and expert systems. The proposed method can be used

for intelligent system applications such as forecasting time-series including solar, wind, load and stock market series.

Contributions of the paper are outlined as follows:

1. A new initialization technique is proposed to select initial centroids which are closer to the optimum centroids’ locations.

2. A novel gradual data transformation approach is proposed to significantly reduce the number of empty clusters generated

by the K-means based algorithms.

3. An efficient method is proposed to estimate the optimal number of clusters.

4. A hybrid clustering algorithm is developed by combining the proposed initialization, data transformation and cluster

number estimation to provide a better knowledge discovery of the input patterns and more accurate clustering results.

The rest of the paper is organized as follows. Section 2 provides a brief description of the K-means algorithm. It also explains the

proposed clustering method. Section 3 demonstrates a case study where the performance of the developed clustering method is

evaluated by several experiments. Finally, Section 4 concludes the paper.

2. Methodology

2.1 K-means algorithm

The K-means (Lloyd, 1982) is a well-known, low complexity algorithm utilized for data-partitioning. The algorithm starts

running after an input of K clusters is given, and outputs the cluster centroids through iterations. Let X = [x1,

,xn

] be the set of n

ACCEPTED MANUSCRIPT

6

points to be grouped into K different cluster (partition) sets as C = {

c

p

} p = 1, 2, … , K. By means of the Euclidean distance, the

algorithm assigns each data point to its closest centroid

where

( p )

x i

c

p

   

1

n

p

   

n

p

i 1

x

i (

is the i-th data point in the cluster p, and

p )

n

p

c

p

, calculated by:

(1)

is the number of data points in the respective cluster.

After the first run, the algorithm calculates the mean of the data points in each cluster

c

p and selects this value as a new cluster

centroid, starting a new iteration. As new clusters are selected, a new mean value is obtained. The algorithm halts once the sum of

the squared errors over K clusters is minimized (Cui et al., 2014).

2.2 The Proposed Clustering Method

An improved version of K-means algorithm, named transformed K-means, is proposed in this section. The proposed clustering

algorithm uses a combination of a new technique to select the initial cluster centroids and a new approach for the reverse transfor-

mation of the data to enhance the clustering performance. The steps of the transformed K-means algorithm are as follows:

A. Initial centroids selection

Let X = [x1,

,xn

] be a set of n data. The selection of K initial centroids is as follows:

1) Remove duplicate data vectors and store them to new dataset

X

'

1 '

= [(x ,r ),

1

,(x

' m

,r

m

)] where r i is the repetition number

for each non-repetitive data vector

(x

i

)

in the new dataset

'

X ( i m n ).

 

'

2)

Sort the data vectors in the dataset

X

in ascending order based on the Euclidean length of the vectors.

3) Divide the dataset

X

'

, consisting of m data, into K sub-datasets, with (at most) S m / K  data, according to Eq. (2),

such that the data elements of

'

X are distributed among the sub-datasets

X

X

X

' '

1

= [(x ,r ),

1

1

,(x

'

S

' '

2

= [(x

S

'

3

= [(x

'

2

+1

,r

S

+1

S

+1

,r

2

S

),

+1

),

,r )],

S

,(x

'

2

S

,(x

,r

2

S

'

3

S

,r

3

X

'

K

= [(x

'

(K-1)×(

S

)+1

,r

(K-1)×(

'

X =

K ' X k k=1
K
'
X
k
k=1

S

)+1

),

)],

S

)],

,(x

'

K

S

,r

K

S

)].

X

'

1

to

X

'

K

.

(2)

ACCEPTED MANUSCRIPT

where r i is the repetition number for the i-th data vector.

7

4) Now, we have K sub datasets where each one is used to determine only one of the K initial centroids. Eq. (3) is used to

calculate a weight attribute

{X ,X ,

1 '

' 2

,X

' K

} .

w

(

x

w

'

i

(

)

x

for each data entry

' 1

)

i

m

=

1

S

S

j =1

dist x

(

i '

,

x

' j

)

(

r

i

x

'

i with the repetition number r i in each of K sub datasets

)

m

,

(1

m

K

)

(3)

where

w

(

x

'

i

)

m

is the weight attribute for

'

x i in the m-th sub-dataset.

5) In each of K sub datasets, the data entry with the highest weight attribute is selected as the initial centroid.

Fig. 1 shows the flowchart of our proposed method for selecting initial centroids.

ACCEPTED MANUSCRIPT

8

Inputs: Input pattern: X={x 1 ,…,x n } Number of Final Clusters: K Remove duplicate
Inputs:
Input pattern: X={x 1 ,…,x n }
Number of Final Clusters: K
Remove duplicate data vectors from X and store
unique data vectors to new dataset X'
X
'
= [(x ,r ),
,(x
,r
),
,(x
,r
)]
1 '
1
' i
i
' m
m
(X )
'
Sort the unique data set
in ascending order
X
'
{X ,X ,
,X
}
1 '
' 2
' K
Split
into K sub-data sets
m=1
w
(
x
i )
'
x
'
Calculate a weight attribute
for each
i
data entry
in the mth sub-dataset
' 1
w
(
x
)
=
S (
r
)
i
m
1
i
m
 dist x
(
,
x
)
S
i '
' j
j =1
Select the data entry with the highest weight
attribute as the initial centroid init_c
for the
m
mth sub-dataset
Add init_c
to InitCentArray
m
m=m+1
Is m < K?
Yes
No
initC
= InitCentArray;
0
Output:
Initial Centroids: initC 0 ={c 01 ,…,c 0K }
End

Fig. 1. Flowchart of the proposed method for the initial centroids selection

B. Inverse transformation

The inverse data transformation approach was first used in 0to solve the problems associated with the K-means clustering

algorithm. However, the approach presented in (Malinen et al, 2014) has a number of shortcomings such as finding a suita-

ble artificial data structure, performing the mapping, and controlling the inverse transformations. This algorithm cannot gen-

erally guarantee an optimal solution. This was demonstrated by the clustering results of (Malinen et al, 2014) where in some

cases, the data transformation led to the deviation of the data towards the incorrect cluster centroids. For the inverse trans-

ACCEPTED MANUSCRIPT

9

formation of data, first we generate an artificial data X * as the input data of the same size (n) and dimension (d). This would

divide the data vectors into distinct clusters K without any fluctuations. Then we represent a one-to-one mapping of the input

data to the artificial data (X→X * ).

The inverse data transformation approach of (Malinen et al, 2014) uniformly distributes the initial cluster centroids along a

line in the artificial structure. This is given by Eq. (4).

X = [x ,

x

,x

n

], X

*

1 *

= [x ,

,x

* n

1

*

i

=

RandomSample

initC , x

], initC =[c

initC

i

*

(0

01

,

i

,c

n

0K

)

]

(4)

This random placement may break the clustering structure and deviate the data to incorrect cluster centroids, and conse-

quently, provide incorrect results. To address this problem, our proposed inverse data transformation approach places each

initial centroid C 0j (1 < j K ) in the location of the data d

This is given by Eq. (5).

i

x

x

*

i

*

i

ArgMin

initC

(0

x

i

i

n

c

)

0j

,

c

0j

,

(0

i

n

) , (0

(1 i

n)

j

K

)

that is closer to C 0i

in the artificial structure X * .

(5)

A series of inverse transformation is then performed that gradually move the data elements to their real (original) positions.

This will inversely transfer the artificial data to the main data. During this process, K-means updates the cluster centroids of

the transformed data. Calculated cluster centroids in each step are used as the initial cluster centroids for the next step. This

process continues until the last step whose results provide the final cluster centroids. The proposed procedure is outlined as

the following:

First, each vector x

i is placed in a position closest to the initial centroid initC l (1l K ) , which has the minimum distance

to the corresponding data. Next, they gradually move back to their real positions.

Generally, for a dataset

X = [x1,

follows the steps below:

1) Sort the dataset

X = [x1,

 

'

data into new dataset

X

1 '

= [x ,

,xn

]

,x

,xn

]

of n data vectors, the gradual inverse transformation of data to their real positions

in ascending order based on the Euclidean length of the vectors. Next, store the sorted

' n

]

.

ACCEPTED MANUSCRIPT

10

Inputs: Input pattern: X={x 1 ,…,x n } Number of Final Clusters: K Initial Centroids:
Inputs:
Input pattern: X={x 1 ,…,x n }
Number of Final Clusters: K
Initial Centroids: initC 0 ={c 01 ,…,c 0K }
Inverse transformation steps: Steps

Sort the input pattern (X) in ascending order based

on the Euclidean distance between each data vector

in X and the data variance: X'

each data vector in X and the data variance: X' Displace the input pattern (X) in

Displace the input pattern (X) in

random order : X"

Displace the input pattern (X) in random order : X" Create artifical data: ' * (X

Create artifical data:

'

*

(X

X )

"
"

Determine

Dist

= (X

"

*

- X ) / Steps

i=1
i=1

All data points

*

Xi

are transformed

towards their real location

to the formula

 

'

Xi = Xi (X )

*

*

1

according

"

+ (i Dist )

(X ) * *  1 according " + ( i  Dist ) K- Means(Xi
K- Means(Xi ,K ,initC * i  1 Perform Xi ) algorithm * given the
K- Means(Xi ,K ,initC
*
i
 1
Perform
Xi
) algorithm
*
given the previous centroids (initC
) along
i-1
with the modified dataset
as input.
(Outputs: initC ={C ,…,C
i
1
})
K
i=i+1 Is i < Steps? Yes No
i=i+1
Is i < Steps?
Yes
No

C F = InitC

;
i

Output:

Final Centroids: C ={c ,,c

F

1

}
K

End
End

Fig. 2. Flowchart of the Transformed K-means algorithm

2) To construct the artificial data structure X* as the initial position of the data, place each initial centroid initCl (1l K )

 

'

in the position of the data vectors of the dataset,

X

, which are closer to that initial centroid (

X

*

1 *

= [x ,

,x x

'

i

* n

distance to the (K-1) other initial centroids. This forms the artificial structure

initC

l

) compared to their

] . This moves each real data

into the location of the initial centroid that is closer to the actual position of the associated data in the artificial structure

3) Displace all the real data vectors

X =[x ,

1

,x

n

] in random order and store them into the new dataset

X

"

1 "

= [x ,

,x

" n

X

]

.

*

.

4) Determine the distance between initial artificial data (X*) and sorted real data (X"), and store them in the set

"

Dist = [dist ,dist ,

1 "

" 2

,dist

" n

] . Each data element

dist

"

i

, represents the distance vector between the i-th data vector

artificial dataset X* and the position of the corresponding data

(x

"

i

) in the dataset X".

(x

*

i

)

in the

ACCEPTED MANUSCRIPT

5) According to the number of steps given by user (Steps>1), divide each element of

"

Dist = [dist ,

1 "

―Steps‖ and update the new values of the data elements in

Dist

" . This is given by Eq. (6).

"

"

"

Dist =[(dist /Steps),

1

,(dist

n

/Steps)]

(6)

,dist

" n

] by the value of

11

6) At each step of the inverse transform process, all data points move towards their real location as follows:

X

*

i

= X

*

i- 1

"

+( Dist )

i

(1

i

Steps

)

where, X* is the position of data in the artificial structure, i is the step number and

(7)

Dist

" is the distance of the sorted (in

descending order) data from the data positions in the artificial structure. Fig. 2 shows the flowchart for the proposed trans-

formed K-means. We should note that

X

*

1

is the initial data positions in the artificial structure (initial artificial dataset). In

the first Step, the initial centroids (initC), calculated by the proposed initial centroid selection method, are fed to the K-

means algorithm as the inputs (initC= initC 0 ). After every inverse transform, K-means is executed given the previous cen-

troids (initC i-1 ) along with the modified dataset (

points are placed in their original location (

X

*

i

X

'

X

*

i ) as the input pattern. After completion of all steps (i = Steps), all data

) and the final centroids (C

F ) are calculated as the outputs. The proposed

initialization approach of Section 2.2.A significantly reduces the chance of empty cluster generation by proper selection of

the initial centroids. The proposed data transformation approach completely solves the empty cluster problem during the

data transformation process.

2.3 Time complexity

The transformed K-means algorithm has a time complexity of the order

, O((n logn)K s) , where n is the

total number of data, K is the number of clusters and s is the number of steps. More details of the time complexity of the

proposed transformed K-means algorithm are given for different phases in TABLE I.

TABLE I

Time complexity of the proposed transformed K-means algorithm.

Algorithm Phase

Time complexity

Initialization Data Transformation K-means algorithm Total Running in s steps (i > 1)

O(nlogn) O(nlogn) O((nlogn)K O((n O(n logn)K K ) s) )

ACCEPTED MANUSCRIPT

12

TABLE II provides the time complexity orders for the proposed method and well-known clustering algorithms including K-

means*, K-means++, global K-means, original K-means, K-medoids, FCM, SOM, SOM++ and game theoretic SOM

(GTSOM) (Herbert and Yao, 2007).

TABLE II

Time complexity comparison of the proposed transformed K-means algorithm and several well-known clustering algorithms.

Algorithm

Time complexity

Transformed K-means

K-means*

K-means++

Global K-means

K-means

K-medoids

FCM

SOM

GTSOM

SOM++

O((nlogn)K O(n K s) s)

O(n K )

2

O(n K

2

)

O(n K )

2

O(n K )

2

O(n K )

O(n K )

K K ) )

O(n O(n

2

2

2

Time complexity comparison of TABLE II shows that the proposed transformed K-means algorithm is faster than SOM,

GTSOM, and global K-means and competes with FCM and K-medoids. The time complexity of K-means and K-means++ is

better than that of our proposed algorithm. However, as the data volume increases, the K-means++ algorithm may not be as

efficient as our proposed method due to its sequential initialization (Bahmani et al., 2012).

The proposed transformed K-

means and K-means* algorithms have almost the same time complexity. However, the approach used to deal with the gener-

ation of empty clusters in the K-means* algorithm reduces the convergence rate nonlinearly as the data volume (n) and the

number of clusters (K) increase. Consequently, our proposed clustering algorithm is generally faster than the K-means* al-

gorithm.

2.4 Estimation of the number of clusters

K-means and many other clustering algorithms are provided assuming that the number of clusters is known in advance.

In cases where the number of clusters is not predefined, an efficient method is required to determine the optimal number of

clusters. In this section we present a new method based on the silhouette approach proposed in (Rousseeuw, 1987) to estimate

the number of clusters. The silhouette algorithm is as follows:

1) Cluster the input data using any clustering technique for each iteration m, (K min m K max ) .

2) Calculate the silhouette function S(i) for the input data:

ACCEPTED MANUSCRIPT

S

m a(i)-b(i)

i

=

max(a(i),b(i))

(8)

13

where a(i) is the average distance between the i th data (1i n) and other data in the same cluster; b(i) is the lowest aver-

age distance of the i th data from the data in the other K-1 clusters at the m th iteration.

3) Calculate S(m) as the average value of S at m th iteration.

4) Select the iteration number index with the highest

S as the estimated number of clusters.

The proposed method uses the principal component transformation to modify the silhouette algorithm. The proposed

procedure is as follows:

4-1) Transform the input data using the Karhunen-Loeve Transform (KLT) method. The KLT method, also known as the

principal component analysis (PCA) is outlined below:

4-2) Let

φ

k

denote the eigenvector corresponding to the k th eigenvalue λk

of the covariance matrix

x

i

,

φ

j

k

where:

λ φ

k

.

k

,

cov(

μ

i

X

Y

,

E X

(

i

j

i

k

)

)

={

1, ,k

E

[(

X

i

}

)(

μ Y

i

4-3) construct a N N unitary matrix as:

Φ

[

φ

1

,

,

φ

N

]

, Φ

*

T

Φ

I

, Φ

1

j

μ

j

Φ

*

T

)

(9)

(10)

4-4) Combine the N eigenvalue equations as follows:

x

Φ

ΦΛ

   λ 1 σ  φ φ    φ φ 
λ
1
σ
φ
φ
 
φ
φ
 
ij
 
1
N
1
N
 
   
λ
N     

where Λ diag(λ1,

4-5) Multiply

Φ

*

T

,λN )

is a diagonal matrix.

Φ

1

by both sides of Eq. (11):

Φ

*

T

x

Φ

Φ

*

T

ΦΛ

Φ

1

x

Φ

Φ

1

ΦΛ = Λ

(11)

(12)

x

:

4-6) Given the input data X, define the KarhunenLoeve Transformation of X as follows:

Y

 y  φ * T  x  1 1 1   
 y
φ
*
T

x 
1
1
1
φ X
* T
y
φ
*
    
x
 
N
N T
   
N

(13)

ACCEPTED MANUSCRIPT

14

5)

Specify K min and K max values.

6) For each iteration m, calculate the initial centroids using the proposed initialization method and assign each of the trans-

formed data (Y) to the nearest initial centroid to form initial clusters. Then calculate the mean of all data in each cluster

as new centroids Cs = [cs1,

,csK ] .

7) Calculate

S

i m for each of the input data at each iteration m, (K min m K max ) by:

S

m

i

=

a(i)- b(i) max(a(i),b(i))
a(i)- b(i)
max(a(i),b(i))

(14)

where a(i) is the distance between the i th data

(1i n)

and the nearest centroid c sj (1j K ) at the P

-th

iteration; b(i) is the

minimum distance of the i th data from the other K-1 centroids at the P

th

iteration. The proposed definition of a(i) and b(i) by

Eq. (15) decreases the computational burden and speeds up the process as compared to their original definitions of the sil-

houette algorithm (Rousseeuw, 1987).

8) Include

S

m (for the i th data at the m th iteration) in the S m array.

i

9) Include the average value of S m (for the m th iteration) in the m

th cell of the array

S

est ave

.

10)Use Eq. (15) and select the row number with the highest

K

est

ArgMax S

est ave

, m  

Fig. 3 shows the flowchart for the proposed method.

S

m as the estimated number of clusters.

ave

(K

min

K

est

K

max

)

(15)

ACCEPTED MANUSCRIPT

15

Inputs: Input pattern: X={x 1 ,…,x n } The minimum value of K: K min
Inputs:
Input pattern: X={x 1 ,…,x n }
The minimum value of K: K min
The maximum value of K: K max
m=K min
Perform data transformation based on KLT
approach: X→Y
Perform Proposed Initialization technique
along with the Input pattern.
(Outputs: Initial Centroids: C 0 ={c 1 ,…,c K })
Form initial clusters and calculate the mean of
all data in each cluster as new centroids
(Outputs: C ={c
,…,c
})
s
s1
sk
i=1
Calculate a(i) and b(i) as defined in step 7 of
Si k
the proposed algorithm and use (15) to
Calculate
for each input data X={x1,…,xn}
S
Si m
m
S
m
= [S1 ,
m
Sn
m
]
Add
to
Array:
Yes
Is i < n?
S
i=i+1
m
No
Add the average value of
Save est
in
th
the m
cell of the array
m=m+1
Is m < K
?
max
Kest = ArgMax Save ,m   , (Kmin
Yes
No
 
est
Kest
Kmax )
Output:
End

Fig. 3. Flowchart of the proposed method for estimating the number of clusters

This procedure modifies the method proposed in (Rousseeuw, 1987) to provide stable results with less processing time.

TABLE III provides the time complexity orders of the proposed estimation method and Silhouette algorithm.

TABLE III

Time complexity comparison of the proposed estimation method and silhouette algorithm.

Algorithm

Time complexity

Proposed Method

O(nlog(n).ΔK )

Silhouette algorithm

2

O(n .ΔK )

where n is the data volume and ΔK

is the difference between the K max and K min (ΔK K max K min ) . The comparison

demonstrates the less time complexity of the proposed algorithm.

ACCEPTED MANUSCRIPT

16

3. Case Studies

In this section, we evaluate the performance of the proposed method to deal with the empty cluster problem; then we as-

sess the proposed transformed K-means clustering algorithm and finally examine the proposed estimation method to deter-

mine

the

optimal

number

of

clusters;

the

datasets

used

in

the

experiment

are

available

online

at

Joensuu

(http://cs.uef.fi/sipu/datasets), uci (https://archive.ics.uci.edu/ml/datasets) and mesonet (http://mesonet.agron.iastate.edu)

websites. More information regarding the data is presented in Fig. 4.

More information regarding the data is presented in Fig. 4. Fig. 4. Datasets used for the

Fig. 4. Datasets used for the case study

3.1. Evaluation of the proposed method for dealing with the empty cluster generation

ACCEPTED MANUSCRIPT

17

Three real time-series datasets of solar radiation in the Ames, Chariton and Calmar stations between 01/01/2009 and

01/01/2014 are used to calculate the number of empty clusters (N. E. C) generated by the K-means algorithm and the pro-

posed method. Number of clusters for both algorithms is 200 (

). First, the clustering is performed for one step

( ), i.e., without any data transformation, to evaluate the performance of the proposed initialization approach to

deal with the empty cluster problem. TABLE IV shows the N. E. C for the proposed method with 1 step and the K-means

algorithm. The results demonstrate that the proposed method significantly reduces the N. E. C as compared to the K-means

algorithm. This is due to the proposed initialization approach that properly selects the initial centroids. The N. E. C generat-

ed by the proposed method is then calculated as the number of steps increases. The results are provided in TABLE V for

steps 1 to 10. The results show that the empty cluster problem is completely solved during the data transformation to their

original positions; and the proposed clustering algorithm converges without generating any empty cluster.

TABLE IV

Performance comparison of the K-means and proposed transformed K-means on the problem of empty cluster generation

 

Number of

Number of

Number of

N. E. C generated by:

 

Dataset

objects

clusters

clusters

K-means

Proposed method

(Step=1)

Ames

43827

200

84

1

Chariton

43827

200

75

0

Calmar

43827

200

27

2

TABLE V

Performance of the proposed method with different steps for the empty cluster problem

Step

number

N. E. C in

Ames dataset

N. E. C in

Chariton dataset

N. E. C in

Calmar dataset

1 1

1

3

2 0

0

0

3 1

0

0

4 0

0

0

5 0

0

0

6 0

0

0

7 0

0

0

8 0

0

0

9 0

0

0

10 0

0

0

3.2. Evaluation of the proposed transformed K-means clustering algorithm

This section evaluates the accuracy of the proposed clustering method (transformed K-means). Mean squared error

(MSE) is used as the accuracy performance indicator calculated by:

ACCEPTED MANUSCRIPT

18

MSE

1

K N

.

K

 

k

1

N

i

1

where N is the number of data points in the cluster k, and

X

i (

k

)

X

(k )

i

C

k

2

(16)

is the i-th data point in the cluster k. The testing datasets are

normalized in the range of [−1, 1]. K-means clustering is run with different initialization methods including the random

based, K-means* based, K-means++ based and the proposed initialization method (section 2.2.A). The calculated error val-

ues as well as the processing time are provided in TABLE VI. A comparison of the results shows that the proposed initiali-

zation method improves the accuracy performance of the K-means algorithm when compared to the other initialization

methods. However, the computational complexity is increased due to data sorting proposed by our initialization to optimally

select the initial centroids.

TABLE VI

MSE measures and running time (sec) for K-means algorithm with different initialization methods

K-means clustering with

Dataset

Proposed

 

K-means* based

K-means++ based

Random based

Initialization

initialization

initialization

initialization

initialization

initialization

MSE

MSE

Time(s)

MSE

MSE Time(s) MSE Time(s) MSE Time(s) MSE Time(s)

Time(s)

MSE

MSE Time(s) MSE Time(s) MSE Time(s) MSE Time(s)

Time(s)

MSE

Time(s)

IRIS

0.0432

0.24

0.0432

0.0891

0.0431

0.0238

0.0432

0.0248

Glass

0.0017

0.0349

0.002

0.0235

0.0017

0.0079

0.0017

0.007

Missa1

0.0094

0.0678

0.0095

0.2665

0.0094

0.8085

0.0097

0.0901

Bridge

0.0008

0.2256

0.0008

2.3118

0.0011

12.8968

0.001

0.1551

Thyroid

0.0145

0.115

0.0554

0.0261

0.0151

0.0084

0.0145

0.0062

Magic

0.0304

0.1871

0.0304

0.4238

0.0304

1.382

0.0304

0.0353

Wine

0.0255

0.0858

0.1352

0.0862

0.0255

0.0101

0.0255

0.0053

Shuttle

0.0008

0.3903

0.0009

0.9169

0.0008

14.0788

0.0008

0.0451

Pendigit

0.007

0.1973

0.0083

0.1728

0.0071

0.3836

0.0071

0.0124

Wdbc

0.0206

0.0274

0.0231

0.0279

0.0206

0.0073

0.0206

0.0053

Yeast

0.0053

0.0373

0.0097

0.3779

0.0053

0.0997

0.0053

0.0311

P. I. D

0.0536

0.0265

0.1013

0.0429

0.054

0.0137

0.0536

0.0086

Olitos

0.0162

0.0148

0.0517

0.1279

0.0162

0.0059

0.0161

0.005

Heart

0.0358

0.0138

0.0574

0.0279

0.0358

0.0113

0.0358

0.0065

Ionosphere

0.081

0.0167

0.096

0.0311

0.081

0.0076

0.081

0.0058

M. Libras

0.0064

0.0185

0.0094

1.3832

0.0066

0.0352

0.0058

0.0061

Spambase

0.01

0.049

0.0125

0.13

0.01

0.0894

0.01

0.0089

Waveform

0.0114

0.0675

0.0159

0.1521

0.0114

0.2114

0.0114

0.0176

a1

0.0057

0.1083

0.0065

0.6655

0.0061

0.3598

0.0059

0.0367

s1

0.0094

0.0869

0.01

0.6682

0.0101

0.7354

0.0095

0.0358

ACCEPTED MANUSCRIPT

19

The MSE value is calculated for different data clustering methods including the proposed transformed K-means, ABC-K-

means, DE-K-means, GA-K-means, K-means*, K-means++, K-means, SOM, GTSOM, SOM++, K-medoids and FCM, and

provided in TABLE VII.

The calculated MSE values show that the transformed K-means (with Steps=20) outperforms or

competes with the existing methods in terms of the clustering quality. The improved clustering quality is the result of several

procedures embraced by our proposed method namely the determination of optimal number of clusters, the proposed initiali-

zation, and the gradual data transformation.

The proposed transformed K-means algorithm has a faster processing time compared to the ABC-K-means, DE-K-means,

GA-K-means, SOM, GTSOM, and SOM++ and competes with FCM, K-medoids, K-means*, and K-means ++ . For small-

and medium-sized data, our proposed method is generally more time consuming than K-medoids, K-means*, and K-

means ++ . However, the reduced converegence rate of K-means* to deal with empty cluster generation particularly for higher

number of clusters and sequential initialization of K-means

++

increase the comuptaional complexity of these methods for

large data volumes. This is evident from our running time results for large size datasets such as Missal and Shuttle where the

proposed transformed K-means converges faster than K-means* and K-means

++

.

TABLE VII

MSE measure and the running time (sec) for different clustering techniques

Dataset Criteria Proposed method ABC- K-means DE- K-means GA- K-means K-means* K- means++ K-means SOM
Dataset
Criteria
Proposed
method
ABC-
K-means
DE-
K-means
GA-
K-means
K-means*
K-
means++
K-means
SOM
GTSOM
SOM++
K-medoids
FCM
 

MSE:

0.0432

0.0475

0.043

0.043

0.0432

0.0432

0.0432

0.0432

0.0433

0.0432

0.0432

0.0432

IRIS

 

Time(s):

0.5754

1.2629

1.4003

1.6738

1.4858

0.0063

0.0061

2.3584

1.7599

2.4825

0.0127

0.2132

MSE:

0.0017

0.0019

0.0018

0.0017

0.0017

0.0018

0.0017

0.0017

0.0019

0.0017

0.0017

0.0017

Glass

 

Time(s):

0.6301

1.2252

1.3897

1.6864

0.1381

0.0134

0.0058

2.3659

1.8439

2.3572

0.0311

0.0134

MSE:

0.0001

0.0002

0.0002

0.0001

0.0001

0.0001

0.0001

0.0001

0.0001

0.0001

0.0001

0.005

Missa1

 

Time(s):

3.9664

45.2259

60.29

81.2574

47.813

37.6179

0.3216

3.4526

10.5227

3.4381

47.6677

4.7857

MSE:

0.0008

0.0018

0.0015

0.0013

0.0009

0.0013

0.001

0.001

0.0009

0.0012

0.0009

0.0038

Bridge

Time(s):

2.4381

29.2082

38.8446

52.4054

18.794

10.5017

0.1362

3.3597

10.6182

3.3557

18.0322

15.5823

MSE:

0.0134

0.0147

0.0134

0.0133

0.0591

0.0151

0.0145

0.0167

0.0169

0.0168

0.0146

0.0149

Thyroid

Time(s):

0.6331

1.2185

1.3674

1.6443

0.3718

0.0087

0.0069

2.3265

1.7153

2.3346

0.0394

0.4068

MSE:

0.0271

0.0325

0.0295

0.0295

0.0304

0.0304

0.0304

0.0307

0.0306

0.0304

0.0304

0.0304

Magic

Time(s):

0.9886

2.6975

3.2338

4.1301

0.7922

1.3478

0.0373

2.3212

1.684

2.3245

3.777

0.7571

MSE:

0.0255

0.0277

0.0254

0.0252

0.1349

0.0256

0.0255

0.0257

0.0255

0.0255

0.0256

0.0256

Wine

 

Time(s):

0.6183

1.1893

1.3275

1.5914

1.8411

0.0059

0.006

2.3257

1.7137

2.3216

0.0127

0.4052

MSE:

0.0007

0.0008

0.0008

0.0007

0.0009

0.0008

0.0008

0.0008

0.0009

0.0008

0.0008

0.0009

Shuttle

 

Time(s):

1.2867

5.363

6.7498

9.2292

1.4791

13.8573

0.047

2.3333

1.7205

2.3379

4.9705

0.9686

MSE:

0.0069

0.0085

0.0077

0.0073

0.0079

0.0071

0.0071

0.0069

0.0071

0.007

0.007

0.0969

Pendigit

Time(s):

0.814

2.3004

2.8268

3.6743

7.2133

0.3607

0.018

2.3571

1.969

2.3645

1.6015

1.3866

MSE:

0.0138

0.0152

0.0138

0.0138

0.0231

0.0206

0.0206

0.0188

0.0243

0.0205

0.0192

0.0205

Wdbc

 

Time(s):

0.6232

1.2157

1.3701

1.6524

0.2455

0.0057

0.0044

2.3226

1.6766

2.3566

0.017

0.1047

ACCEPTED MANUSCRIPT

20

 

MSE:

0.0051

0.0062

0.0057

0.0055

0.0088

0.0055

0.0053

0.0052

0.0054

0.0052

0.0054

0.0058

Yeast

 

Time(s):

0.6999

1.6819

1.9743

2.4665

7.0087

0.0898

0.016

2.3747

1.9684

2.3955

1.0383

0.2853

MSE:

0.0438

0.0474

0.0431

0.0431

0.1013

0.0536

0.0536

0.0525

0.0541

0.0574

0.0537

0.0512

P. I. D

 

Time(s):

0.6353

1.2327

1.3823

1.698

0.2842

0.008

0.0056

2.337

1.6765

2.3256

0.0762

0.4109

MSE:

0.0153

0.017

0.0156

0.0153

0.051

0.0162

0.0167

0.0162

0.0161

0.0161

0.0161

0.016

Olitos

 

Time(s):

0.5645

1.2042

1.3552

1.6307

1.3999

0.0101

0.0065

2.337

1.8621

2.3347

0.0259

0.5152

MSE:

0.0352

0.0387

0.0352

0.0352

0.0574

0.0358

0.0358

0.0354

0.0359

0.0356

0.0358

0.0359

Heart

 

Time(s):

0.6091

1.1922

1.342

1.6161

0.3494

0.0052

0.0057

2.3168

1.6736

2.3149

0.0408

0.3544

MSE:

0.081

0.0891

0.081

0.081

0.0959

0.081

0.081

0.0811

0.081

0.081

0.081

0.0813

Ionosphere

Time(s):

0.6061

1.2029

1.3576

1.6325

0.2635

0.0053

0.0052

2.3253

1.6762

2.3317

0.0175

0.4333

Movement

MSE:

0.0055

0.0081

0.0072

0.006

0.0094

0.0063

0.0062

0.0055

0.0058

0.0057

0.0056

0.0057

Libras

Time(s):

0.6293

1.335

1.5315

1.8818

15.938

0.0252

0.01

2.3821

2.1349

2.3865

0.1281