Академический Документы
Профессиональный Документы
Культура Документы
net/publication/316957748
Parallel K-means for big data: On enhancing its cluster metrics and patterns
CITATIONS READS
4 87
2 authors, including:
Veronica Moertini
Universitas Katolik Parahyangan
36 PUBLICATIONS 87 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
E-CRM Development Model for Transaction Broker e-Commerce System View project
Pengembangan Model, Teknik dan Sistem untuk Analisis Big Data View project
All content following this page was uploaded by Veronica Moertini on 28 May 2017.
ABSTRACT
K-Means clustering algorithm has been enhanced based on MapReduce such that it works in
distributed Hadoop cluster for clustering big data. We found that the existing algorithm have not included
techniques for computing the cluster metrics necessary for evaluating the quality of clusters and finding
interesting patterns. This research adds this capability. Few metrics are computed in every iteration of k-
Means in the Hadoop’s Reduce function such that when it is converged, the metrics are ready to be
evaluated. We have implemented the proposed parallel k-Means and the experiments results show that
the proposed metrics are useful for selecting clusters and finding interesting patterns.
Keywords: Clustering Big Data, Parallel k-Means, Hadoop MapReduce
1844
Journal of Theoretical and Applied Information Technology
30th April 2017. Vol.95. No 8
© 2005 – ongoing JATIT & LLS
experiments using two sample of big data for information, measuring the results [6]. When the
obtaining knowledge. Our main contribution is problem is lack of data insights, data miners can
enhancing the previously developed parallel k- define the objective as to obtain knowledge from
Means based on MapReduce such that it has the the data and select clustering technique to seek
capability to generate the necessary metrics for solutions. The processes for clustering data is
evaluating clusters quality and discovering shown in Fig. 1. Based on the objective, data
interesting patterns. miner should gather and select some raw data.
This paper presents some related literature Then, the selected dataset should be preprocessed
review, proposed techniques, experiment results that may involve data cleaning, attribute selection
using two big dataset, conclusion and further
and transformations [3]. Data cleaning needs to be
works.
performed as raw dataset may contain missing
2. LITERATURE REVIEW values, outliers or unwanted values. Some
attributes may be irrelevant such that these should
2.1. Clustering Stages be removed. Attribute values may need to be
Among business organizations, data mining normalized or transformed into the certain values
techniques are commonly used in supporting and/or types that are accepted by the algorithms.
customer relationship management. The cycle of The patterns resulted from clustering are then
using data mining include stages of identifying the evaluated by some measures to obtain knowledge,
business problem, mining data to transform the which can be used to design organizational actions.
data into actionable information, acting on the
2.2. k-Means Algorithm, Cluster Quality and Stop if its improvement over previous iteration is
Patterns Generation below a certain threshold or maximum iteration
Clustering aims to find similarities between (defined by data miners) is reached; (4) Update the
data objects according to the characteristics found cluster center by Eq. (2). Go to step 2.
in the objects and grouping similar objects into The performance of the k-means algorithm
clusters [2]. As k-Means algorithm processes depends on the initial positions of the cluster
matrix data input where all of the attributes must be centers. k-Means is relatively efficient with O(tkn),
numeric, each object is a vector. where n is total vectors/objects, k is the cluster
The k-means algorithm partitions a collection numbers, and t is the iterations. Normally, k, t <<
of n vector xj, j = 1,…,n into c groups Gi, i = 1,…,c, n.
and finds a cluster center (centroid) in each group Measuring Clustering Algorithm Quality: A
such that a cost function of dissimilarity measure good clustering method will produce high quality
is minimized ([8] as appeared in [9, 11]). If a clusters with high intra-class similarity and low
generic distance function d(xk,ci) is applied for inter-class similarity. It should also be able to
vector xk in group i, the overall cost function is discover the hidden patterns [2]. Other
c c
requirements are: (1) Scalability; (2) Able to deal
J Ji d ( xk ci ) . with noise and outliers; (3) Interpretability and
i 1 i 1 k , xkGi usability, etc.
(1)
The partitioned groups are represented by an c Measuring Clustering Results Quality: As
x n binary membership matrix U, where element uij defined in [2], high quality clusters should have
is 1 if the jth data point xj belongs to group i, and 0 high intra-class similarity and low inter-class
otherwise. The cluster center (centroid) ci is the similarity. To achieve this, data miners should
mean of all vectors in group i: assess the homogeneity or cohesion of the clusters
1 and the level of similarity of their members, as well
ci xk
| Gi | k , xkGi
as their separation.
(2) In examining and evaluating the clusters, [3]
where |Gi| is the size (object numbers) of Gi. proposes 3 measures:
The k-means algorithm is presented with a (a) The number of clusters and the size of each
dataset xi, i = 1,…,n. The algorithm determines the cluster: A large, dominating cluster which
centroid ci and the membership matrix U iteratively concentrates most of the records may indicate the
using the following steps: (1) Initialize the cluster need for further segmentation. Conversely, a small
center ci, i = 1,…,c; (2) Determine the membership cluster with a few records merits special attention.
matrix U; (3) Compute the cost function by Eq. (1). If considered as an outlier cluster, it could be set
1845
Journal of Theoretical and Applied Information Technology
30th April 2017. Vol.95. No 8
© 2005 – ongoing JATIT & LLS
apart from the overall clustering solution and values and percentage of objects having each of the
studied separately. attribute values. Likewise, the number of object
(b) Cohesion of the clusters: A good clustering members in each cluster can also be examined.
solution is expected to be composed of dense By understanding the metrics used to evaluated
concentrations of records around their centroids. clusters quality and patterns generation, it is clear
Two metrics can be calculated to summarize the that size of each cluster and standard deviations
concentration and the level of internal cohesion of can be used in generating patterns as well as
the revealed clusters, which are: measuring clusters quality. Thus, computing these
(b.1) Standard deviations of cluster attributes metrics is important.
and pooled standard deviations of each cluster:
Standard deviations of the attribute j in a cluster can 2.3. Hadoop, HDFS and Map-Reduce
be defined as: Hadoop is a platform that has been developed
(𝑥𝑖 −𝜇)2 for storing and analyzing big data in distributed
𝑆𝐷𝑗 = √ (3) systems [1]. It comes with master-slave
𝑁−1
architecture and consists of the Hadoop Distributed
where xi is the attribute value of object i, µ is File System (HDFS) for storage and MapReduce
the average of this attributes, N is the total object for computational capabilities. Its storage and
members in the corresponding cluster. computational capabilities scale with the addition
The pooled standard deviation of a cluster of hosts to a Hadoop cluster, and can reach volume
having k attributes and N object members can be sizes in the petabytes on clusters with thousands of
defined as: hosts. The following is some brief overview of
∑𝑘 2
𝑗=1(𝑁−1)𝑆𝐷𝑗 HDFS and MapReduce.
𝑆𝐷 = √ (4) HDFS: HDFS is a distributed file system
𝑘𝑁−𝑘
designed for large-scale distributed data processing
(b.2) Average of squared Euclidean distances under frameworks such as MapReduce and is
(SSE) between the object and their centroid as optimized for high throughput. It automatically re-
follows: replicates data blocks on nodes (the default is 3
1 replications).
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑆𝑆𝐸 = ∑𝑖𝜖𝐶 ∑𝑥𝜖𝐶𝑖 𝑑𝑖𝑠𝑡(𝑥, 𝑐𝑖 )2 MapReduce: MapReduce is a data processing
𝑁
(5) model that has the advantage of easy scaling of data
where Ci is the centroid of cluster i, x is an processing over multiple computing nodes. A
object of cluster i, and N is the total objects. MapReduce program processes data by
(c) Separation of the clusters: High clusters manipulating (key/value) pairs in the general form:
should have low inter-cluster similarity or high map: (k1,v1) ➞ list(k2,v2)
inter-cluster dissimilarity. This can measured by reduce: (k2,list(v2)) ➞ list(k3,v3).
computing the silhouette coefficients of the Map receives (key, value) pairs, then based on
clustering results. the functions designed by developers, it generates
The silhouette coefficient of each clustered one or more output pairs list (k2, v2). Through a
object, S(i), is computed as: shuffle and sort phase, the output pairs are
(b(i ) a(i)) partitioned and then transferred to reducers. Pairs
S (i) (6)
max(( a(i), b(i )) with the same key are grouped together as (k2,
where a(i) = average dissimilarity between list(v2)). Then the reduce function (designed by
object i and all other objects of the cluster to which developers) generates the final output pairs list(k3,
i belongs and b(i) = average dissimilarity between v3) for each group.
object i and its “neighbor” cluster (the nearest In some situation, the traffic in the shuffle
cluster to which i belongs). In Eq. 6, 0 ≤ S(i) ≤ 1. phase can be reduced by using local Combiner.
Large value of S(i) denotes that object i is well Combiner function is useful in the case when the
clustered, small value denotes the opposite and reducer only performs a distributive function, such
negative value of S(i) denotes that object i is as maximum, minimum, and summation
wrongly clustered. Generally if the average of S(i) (counting). But many useful functions aren’t
for all clustered objects is greater than 0.5, then the distributive such that using combiner doesn’t
cluster solution is acceptable. necessarily improve performance [12].
Patterns Generated from k-Means Output: The overall MapReduce processed is shown in
Patterns of clusters can be found through profiling Fig. 2 [1, 13]. A client submit a job to the master,
[3, 10]. One method of profiling is by comparing which then assign and manage Map and Reduce
the objects attributes in clusters. Things that can be job parts to slave nodes. Map will read and process
compared include the average (means), minimum, blocks of files stored locally in the slave node. The
maximum, standard deviation of the attribute Map output of pair key-values are sent to Reducer.
1846
Journal of Theoretical and Applied Information Technology
30th April 2017. Vol.95. No 8
© 2005 – ongoing JATIT & LLS
Job parts
Job parts
Blocks in
node-1
Map
Reduce
Blocks in
node-2 Map Output1
2.4. Parallel k-Means for Hadoop grids). The k and initial clusters are used as input of
We have found two parallel k-Means Map function of parallel k-Means based on
developed for Hadoop environment. The core MapReduce.
concept of both is excerpted as follows. Some drawbacks that we found on those
First, in [4], the map function assigns each existing parallel k-Means are:
object to the closest centroid while the reduce (a) Big data may (most likely) contains noise or
function performs the procedure of updating the outlier and missing value, hence it must be handled.
new centroids. To decrease the cost of network If cleaning data is performed before the big data is
communication, a combiner function combines the fed into k-Means, it will be inefficient. This has not
intermediate values with the same key within the been addressed in the algorithm.
same map task in a Hadoop node. (b) The Reduce function emits cluster centroids
The excerpt of the algorithm of Map, Combine only as patterns. For some big data, such as
and Reduce (detailed algorithm can be found in organizations business data, this may not be
[4]): sufficient.
(a) Map-function: The input dataset is stored in (c) If some more patterns need to be computed (in
HDFS as a sequence file of <key, value> pairs, the Reduce) that require detailed information
each of which represents a record/instance/object (attribute values) of each object, Combine (that
in the dataset. Map computes the minimum sums up attribute values of “local cluster”) cannot
distance for each object to all centroids. It then be employed.
emits strings comprising of the index of its closest (d) In [5], the formula for obtaining GridId is not
centroid (as key’) and object attributes (as value). presented clearly. While an object may have
(b) Combine-function: Processing key-value pair several attributes, the GridID of an object is
from Map, Combine partially sums the attribute computed based on a single value of (attribute)
values of the points assigned to the same cluster value.
and number of objects in each cluster. It emits (e) As the parallel DBSCAN algorithm is not
strings comprising of the index of its cluster included in the proposed technique, it seems that
centroid (as key’) and the sum of each attribute initial centroids are still computed at the outside of
value of objects in this cluster. Hadoop system.
(c) Reduce-function: Reduce function sums all the By examining those drawbacks, we aim to
samples and compute the new centroids (centers) develop a parallel k-Means with the capability to
which are used for next iteration. It then emits key’ preprocess the big dataset and compute suitable
is the index of the cluster, value’ comprising a metrics that can be used to evaluate cluster quality
string representing the new centroids. as well as patterns.
Secondly, in [5], the parallel K-means
algorithm is improved by removing noise, giving
pre-computed value of k and initial clusters (to 3. PROPOSED TECHNIQUE
reduce iterations). The excerpt of the general idea:
The value of each attribute for each object is In this section, we present the analysis of selecting
evaluated, then based on this value a GridId is cluster quality metrics, the enhancement of
assigned for each object. Object having attribute providing metrics and pattern components and the
values beyond its threshold is removed. The parallel k-Means algorithm based on MapReduce.
centroid of the grids are fed into DBSCAN Our proposed technique is designed based on
algorithm to obtain the best k value (the k initial the MapReduce concept as depicted in Section 2.3.
cluster centers are computed from the sample of
1847
Journal of Theoretical and Applied Information Technology
30th April 2017. Vol.95. No 8
© 2005 – ongoing JATIT & LLS
Hence, the parallel k-Means is not applicable for parallel k-Means. Hence, it is not feasible to be
other than Hadoop distributed environment. adopted.
Based on those analysis, the metrics chosen for
3.1. Selecting Cluster Quality Metrics and evaluating cluster quality are number of members
Pattern Components and pooled standard deviations for every cluster.
Big data may consist of millions or even As discussed in Subsection 2.2, number of
billions of objects. Clustering big data will members and standard deviation of attributes in
produce clusters where each cluster may have very clusters can be regarded as cluster pattern
large number of object members. The following is components. Hence, we can include these as part
the feasibility review of using metrics depicted in of the patterns for reducing computations. Other
Subsection 2.2 for measuring the quality of big components that we adopt are cluster centroids,
data clusters: the minimum and maximum of attribute values in
(a) Number of object members: In each iteration, every cluster. Computing those 5 pattern
the number of objects in every clusters are components will not add significant time
computed (and used to compute the new complexity as it can be performed along with
centroids), so having this metrics is feasible. clustering process in every iteration.
(b) Standard deviations of cluster attributes and
pooled standard deviations of each cluster: The 3.2. Parallel Clustering Technique
computation of µ in (xi - µ) (Eq. 3) requires that
all of attribute values in every object in each In our previous work presented in a conference
cluster be stored in the slave node memory. [7], we proposed a technique for clustering big data
Storing the whole (raw) large number of objects consisting of two stages that include data sampling
and their attribute values in the slave nodes for finding initial centroids and some enhancement
memory will not guarantee scalability (required as the following:
for good clustering algorithm) in processing big (1) Data preprocessing: Attributes selection,
data. Accessing each object (of million objects) in cleaning and transformations are performed along
each iteration also worsens the time complexity. with the clustering process, in the Map functions
As a solution, we propose the following approach: that takes input the raw dataset. Hence, the big data
As in each k-Means iteration the cluster centroids is not “visited” more than once.
(2) Reducing iterations: MapReduce known for its
are closer to the final centroids, the cluster
inefficiency in iterative processes (such as in k-
centroids obtained from the previous iteration is
Means algorithm) as in each pass the output must
used as µ in the current iteration such that while be written in HDFS. Reducing the iterations
iterating the list of values (that include xi), Reducer number is significant. We propose that initial
functions compute (xi - µ) along with other centroids be computed by MapReduce from a
computations (to produce pattern components). sample of dataset, which are expected to be closer
Then, after all of the computations are performed, to the final centroids.
the standard deviations of each attribute (SDj) and (3) Adding computations in Reduce function for
pooled standard deviation (SD) can also be computing some pattern components. In this past
computed. research, we had not conducted experiments to
(c) Separation of the clusters: Computing support our concept.
silhouette coefficient of each clustered object, S(i), In [7], we present experiment results showing
requires that the whole (raw) large number of that the proposed technique is scalable but have not
objects and their attribute values be stored in the conducted experiments with real big data set for
slave nodes memory in every k-Means iteration. evaluating the cluster patterns.
This is necessary because a(i) and b(i) After further works, we find that the sampling
computations in Eq. 6 need distance computation does not always perform well for finding initial
from one object to every other object in its cluster centroids close to the real ones. Hence, that
as well as other clusters. If this metric is adopted technique needs to be revised as depicted on Fig. 3.
for clustering big data, the computation will
worsen the scalability and time complexity of the
1848
Journal of Theoretical and Applied Information Technology
30th April 2017. Vol.95. No 8
© 2005 – ongoing JATIT & LLS
1 2
computing enhanced
initial paralel k-
centroids Means patterns
HDFS
HDFS
initial centroids
1849
Journal of Theoretical and Applied Information Technology
30th April 2017. Vol.95. No 8
© 2005 – ongoing JATIT & LLS
Output: < key’, value’> pair, where the key’ is the in a Hadoop cluster with a master (name node) and
index of the cluster, value’ is a string comprising: 6 slave nodes. All of the nodes are commodity
centers[] (centroid of a clusters), number of computers having low specification with processor
objects in a cluster, countObj, minimum, of Quad-Core running at 3.2 GHz clock and RAM
maximum, average, standard deviation of every of 8 Gb.
attribute, pooled standard deviation for a cluster, Dataset: The dataset of household energy
minAtrVal[], maxAtrVal[], StdAtrCluster[], consumption is obtained from:
PooldStdCluster; currentCostFunction, J; https://archive.ics.uci.edu/ml/datasets/ with the
averageSSE, AvgSSE. size of approximately 132 Mb. This archive
contains 2075259 measurements (records)
Steps:
gathered between December 2006 and November
1. Initialize minAtrVal[], maxAtrVal[], 2010 (47 months).
SumDiffAtrPrevCenter[], SumAtr[],StdAtrCluster The sample of the dataset are as follows:
[], centers[] 9/6/2007; 17:31:00 ; 0.486 ; 0.066; 241.810; 2.000;
2. countObj = 0; J = 0; 0.000 ; 0.000 ; 0.000
3. While(ListVal.hasNext()) 9/6/2007; 17:32:00 ; 0.484 ; 0.066; 241.220; 2.000;
4. get the object attribute values and its 0.000 ; 0.000; 0.000
distance to centroid from value 9/6/2007; 17:33:00 ; 0.484 ; 0.066 ; 241.510; 2.000;
5. for each attribute, add its value to SumAtr[] 0.000 ; 0.000 ; 0.000
accordingly, subtract its value with its previous Each line presents a record with 9 attributes, the
centroid stored in prevcenters, compute the excerpts are:
square of this result then add this to (1) Date;
SumDiffAtrPrevCenter accordingly, compare its (2) Time;
value to the corresponding element value in (3, 4, 5, 6) some results of metrics;
minAtrVal, maxAtrVal, replace value in (7) sub_metering_1: energy sub-metering (watt-
minAtrVal, maxAtrVal, hour) that corresponds to the kitchen,
6. J = J + dist; (8) sub_metering_2: energy sub-metering that
7. countObj = countObj + 1 corresponds to the laundry room;
8. Compute new centroids by dividing SumAtr (9) sub_metering_3: energy sub-metering that
with countObj and store the results in centers; corresponds to a water-heater and an air-
10. Compute approximate of standard deviation of conditioner.
every attribute in each cluster using Mining objective: By understanding the
SumDiffAtrPrevCenter, store the result in dataset, the objective that is feasible is to obtain
StdAtrCluster accordingly energy consumption patterns of the household.
11. Compute PooldStdCluster using StdAtrCluster Knowing this, the power provider may design
based on Eq. 3 and 4 better services for this household.
12. Compute AvgSSE by dividing J by countObj Data Preprocessing: Based on the objective,
11. Construct value” as a string comprising we intend to find the patterns of electricity power
countObj, centers, J, minAtrVal, maxAtrVal, usage of sub-metering 1-2-3 based on day number
StdAtrCluster, PooldStdCluster, AvgSSE (1 = Monday, 2 = Tuesday, …, 7 = Sunday) and
12. Emit < key, value”>; hour. Hence, the data preprocessing performed in
Map function is as follows:
Algorithm kM-4. run (the Hadoop job) a) Number of day (1, 2, …7) is extracted from
Input: Array of cost function, J; maximum of Date and stored as attribute-1;
iteration, maxIter; minimum of the different b) Hour (1, 2,…24) is extracted from Time and
between current and previous iteration of cost stored as attribute-2;
function, Eps. c) The value of sub_metering_1, _2 and _3 are
Output: J
taken as is and stored as attribute-3, -4, -5.
Steps:
Thus, the preprocessed dataset has 5 attributes,
1. Initialize J[maxIter];
which are day number, hour and 3 sub-metering
2. iter = 1;
measures.
3. While iter <= maxIter {execute configure, map
and reduce function; get J from the output of
Testing the performance of the proposed
reduce function then store it in J[iter]; if absolute
parallel k-Means: For experimenting speed and
value of (J[iter] – J[iter-1]) <= Eps then break;
scalability, we created several simulation dataset
else iter = iter + 1}
with the size of 0.2, 0.4, 0.8, …, 2 gigabyte as the
“multiplications” of the original dataset. We use
4. EXPERIMENTS
HDFS block size of 32 and 64 Mb. We then
repeatedly clustered each of the dataset stored
We have implemented the algorithms and
using k = 3 for each of the block size setting. We
performed a series of experiments using big dataset
1850
Journal of Theoretical and Applied Information Technology
30th April 2017. Vol.95. No 8
© 2005 – ongoing JATIT & LLS
then averaged the execution times on every blocks indicates that the execution of the proposed k-
and plotted the results as depicted on Fig. 4. The Means scales linearly to the size of dataset or
speed with 32 and 64 Mb block size is almost the guarantee scalability.
same. The time execution plots are linear, which
Mining knowledge from the dataset: The preprocessed dataset with k = 3, 4, 5, 6 and 7. The
experiments for obtaining the best k, patterns and count of iterations until k-Means reach
knowledge are presented as follows. convergence and the average execution time (in
Selecting the best k using the cluster quality seconds) for each k are: k = 3: 11 - 60; k = 4: 10 -
metrics: This experiment is intended to show how 61.88; k = 5: 15 - 63.13; k = 6: 15 - 65 and k = 7:
to use the proposed quality metrics for finding the 16 - 66.25. The results for each metrics are
best cluster number, k. We cluster the depicted in Table 1, 2 and 3.
1851
Journal of Theoretical and Applied Information Technology
30th April 2017. Vol.95. No 8
© 2005 – ongoing JATIT & LLS
1852
Journal of Theoretical and Applied Information Technology
30th April 2017. Vol.95. No 8
© 2005 – ongoing JATIT & LLS
REFERENCES
[1] A. Holmes, 2012. Hadoop in Practice, Manning
Publications Co., USA.
[2] J. Han and M. Kamber, 2011. Data Mining
Concepts and Techniques 2nd Ed., Morgan
Kaufmann Pub., USA.
1853
Journal of Theoretical and Applied Information Technology
30th April 2017. Vol.95. No 8
© 2005 – ongoing JATIT & LLS
1854
Journal of Theoretical and Applied Information Technology
30th April 2017. Vol.95. No 8
© 2005 – ongoing JATIT & LLS
Table A.1: Iterations and execution times of data from a station at Norway
Period #Iterations Time (sec) AvgTime (sec)
1973 - 1980 11 426 39
1981-1985 22 872 40
1986-1990 9 349 39
1991-1995 13 512 39
1996-2000 9 354 39
2001-2005 17 676 40
2006-2010 15 581 39
2011-2015 15 603 40
Average 13.875 546.64 39.32
Wsp: wind speed, Tem: air temperature, Dew: dew point temperature, Prs: atmospheric pressure.
Figure A.1: Patterns of weather from station 010010 at Jan Mayen Norway, with (a) centroids, (b) deviations, (c) minimum,
(d) maximum of attribute values in each cluster, (e) object members in each cluster.
1855
Journal of Theoretical and Applied Information Technology
30th April 2017. Vol.95. No 8
© 2005 – ongoing JATIT & LLS
1856