Вы находитесь на странице: 1из 11

!"#$%"&"'(!#)'("*'+(,-'+%#.-'/"*"'0"#-'#1,2,(+%#)'/-*3+(,-'#4+'+2"3"'(#5).

1)4#6789:

A Critical Review of Density-based Data Stream


Clustering Techniques
Affan Ahmad Toor, Muhammad Usman Waseem Ahmed
Department of Computer Science School of Computing
Shaheed Zulfikar Ali Bhutto Institute of Science and Waiariki Institute of Technology, New Zealand
Technology, Pakistan waseem.ahmad@waiariki.ac.nz
affan.toor85@gmail.com, dr.usman@szabist-isb.edu.pk

Abstract—Data stream is relatively new and emerging domain different research areas and domain but they have same goal to
in the current era of Internet advancement. Clustering data cluster data streams and to some extent, they face similar issues
streams is equally important and difficult because of the
numerous hurdles attached to it. A number of algorithms have like not being able to identify arbitrary shaped or density
been proposed to offer solutions for efficient clustering. Grid- clusters, unable to handle noise and boundary points, etc.
based clustering approach was adopted few years ago to
overcome the limitations of conventional partition-based The biggest challenge in stream clustering is the huge
algorithms for data stream clustering. Data points are mapped to amount of data which cannot be saved or processed in memory
the grid-cells to form micro-clusters which later are used for and the speed of stream which is so fast that we do not have
clustering. Using density in the clustering process is proved to be
a remarkable success and in recent years many researchers have enough time to process incoming data. To handle these
used density to find arbitrary shaped & density clusters and challenges, an online/offline component based structure was
identify outliers. Concept of density-based clustering is to use introduced by Aggarwal [13] in CluStream algorithm which
grid-based clustering at core and create a distinction between was so effective that almost the entire stream clustering
dense and sparse grids using density threshold values and use
dense grids to yield clustering results; which provide more algorithms after CluStream followed same structure. The
cluster purity and accuracy. In this paper, we reviewed grid- online component fetches data from stream and stores in
based data stream clustering algorithms which utilize density. micro-clusters and offline components applies clustering
We evaluated their functionalities and identified their limitations.
In the end, we critically evaluated different aspects of algorithms
algorithm on those micro-clusters to perform clustering.
and suggested one of these algorithms which is better in terms of Use of grids in clustering algorithms was a big leap forward
performance and accuracy.
in data stream mining. Previously partitioning-based clustering
Keywords—Data Streams, Data Stream Mining, Density-based was used which divides data into 2 partitions and then apply
Clustering, Density-grid clustering clustering techniques to both partitions and finally merging the
partitions to form final clusters. The grid-based technique
I. INTRODUCTION solved the problem of finding clusters of arbitrary nature and
With the evolution of internet the amount of data being also it was able to identify noise points in data. The data points
transferred at daily basis has increased rapidly over the past are plotted in grid-cells and the neighboring non-empty grid
few years. It is very important to mine this data to extract cells are processed further to form the final clusters. Another
valuable information and make sense out of it. But the valuable addition to this process was introducing the use of
conventional mining algorithms are not designed to process density. It is a calculated value assigned to each grid-cell which
such huge and continuously flowing data. There are many defines the importance of each data point as well as the
clustering techniques which were previously used on large cumulative weightage of each grid. This density-based process
amount of data but they have certain limitations and drawbacks has increased the clustering purity and efficiency to a great
when they are applied to data streams. So need of creating new deal.
algorithms to cope with data streams was recognized and in
The purpose of this study is to explore the available
past few years’ lot of research work has been done in this
density-based stream clustering algorithms, their application in
regard.
various domains and the problems they solve in order to
This review explores few of density-based data stream achieve efficiency. This study is helpful in choosing a stream
clustering techniques and tries to develop understanding on clustering algorithm on basis of performance enhancements
how these algorithms work, what challenges they face and and problems addressed by them. The methodology used for
what issues they resolve. Although these algorithms target this study is firstly, selecting research papers published in 2010

978-1-5090-2641-8/16/$31.00 ©2016 IEEE

51
and onward, so that most recent research is included. Further, active grid means it has up to date data so it is kept and used
the evaluation is done in ascending order of the publication for clustering process.
year, which shows the gradual progress in the algorithms and
The experiments to compare AGDStream with CluStream
changes in the methodology used overtime. Lastly, we used
showed that the proposed algorithm is time efficient and
attributes like dataset, dimensions, data size, stream speed, etc.
improves with the increase of stream length. Further clustering
to show a comparison of the kind of environment in which
accuracy was tested by dividing data into 3 classes and
these algorithms were evaluated. Also, we portrayed the
overtime the accuracy was found to be better than CluStream.
performance comparison of these algorithms on basis of
common problems addressed and performance enhancements B. SDSStream
made.
Ren et al. in [2] highlights the increasing importance of
The organization of the rest of the paper is as follows. data stream clustering and more specifically adaptation of
Section II discusses in detail the literature review. Section III subspace clustering model in today’s technology trends.
provides critical evaluation along with comparison tables and Subspace clustering is defined as a process of finding all
finally Section IV presents the conclusion of the study. clusters hidden in subspaces formed by some degree of
relevance in the dimensions. The existing techniques either do
II. LITERATURE REVIEW not provide solution for streaming data or they work on
decayed window model which relies heavily on historical data.
A. AGDStream A new algorithm SDSStream is proposed to solve these
Yang et al. in [1] emphasize on the importance of mining problems. SDSStream combines subspace and sliding window
data streams in current era of IT advancement when our lives model to cluster stream data.
are flooded with more data we can handle every day. Data
streams can only be mined in a sequence within a limited CLIQUE, which is a grid-based subspace clustering
timeframe which makes the existing clustering algorithm algorithm starts processing with one dimensional dense units
useless. A new algorithm Active Grid Density Stream (AGS- and computes (k+1) dimensional dense units. After considering
Stream) is proposed to solve these problems; this algorithm dense grids using entropy measure, the data is plotted in
makes use of the density grid decaying technology to identify adaptive grids and parallel processed to obtain effective results.
active grids and creates clusters from them. Drawback of this technique is that it loses effectiveness in case
of large datasets like in data stream. Later, many algorithms
CluStream algorithm can cluster data streams but it can such as GCHDS, Park, HPStream, etc. were introduced which
only form clusters of spherical shape and fails when it comes to were able to process stream data and also applied subspace
arbitrary shaped clusters. Another drawback of CluStream is concept. But they all perform clustering on decayed window
that it cannot handle boundary points which results in low model. To overcome this problem a sliding window model was
cluster accuracy. This algorithm works on offline-online developed and used in many algorithms. In sliding window
components. The online component either forms new micro- model an exponential histogram technique is used to keep track
clusters from stream or merges the data in existing relevant of variance in data elements overtime.
micro-clusters. The offline component applies k-means
algorithm on micro-clusters to obtain final clusters. Few other The proposed algorithm SDSStream extends the sliding
algorithms like ACluStream, DenStream, D-Stream and RCTS window model and introduces weighted sliding window model.
were able to find arbitrary shaped clusters but these involve In this framework the time defines a window’s size to eliminate
some other complexities which make them less efficient in real the irregularities. Also the recent data is assigned more
time environment. weightage than the old one so that the relevance and
effectiveness of data remain intact. SDSStream makes use of
The proposed AGD-Stream algorithm divides the data features like EHCF and TCF to form and maintain micro
space grid into a grid structure by forming small cube grids and clusters and to sum up the results core-micro-clusters are
then maps data steam to this structure. This concept is referred extracted from those micro clusters. New data points are either
to as grid density. The purpose of density decaying technology identified as potential micro-clusters or outlier micro-clusters
is to extract the boundary points of grid, which are later used to and are assigned weights accordingly.
process the active grid density. The paper introduces the
concept of activity that is used to find the grid density. Inactive Since the SDSStream algorithm extends the CluStream
grid density represents a sparse grid which means no new data algorithm, so the results are compared with CluStream. For fair
is arriving, such grids are excluded. On the other hand, an comparison of EHCFs in SDSStream and micro-clusters in

52
CluStream were kept same. Results shows that SDSStream out The offline component not only assigns a grid cell to a new
performs CluStream in cluster quality, which is measured by density-based cluster but it also merges related neighbor
purity of each cluster. Further execution time of online clusters until a final cluster is formed.
component of proposed algorithm is faster with one snapshot
The experimental evaluation of MDStream compares the
per 100 second cycle and it also improves with passage of time.
relative density relatedness and dynamic range neighborhood
Finally SDSStream showed better memory usage with window
factors. The results are compared with DStream algorithm
size from 10000 to 100000.
using both synthetic and real data sets. Cluster quality and
efficiency are measured in terms of cluster purity and resource
C. MDStream
utilization respectively. The results showed that MDStream
Magdy et al. in [3] discusses the increasing trend of data
was not only able to find clusters with arbitrary densities but
stream with advancement of hardware devices and internet.
also it was scalable in terms of execution time and data size.
Wireless sensors, network traffic, e-commerce transactions,
etc. are the main sources of data stream. The nature of data D. DDBound
streams is different from static data as it contains non-static
Zhang et al. in [4] points out that with recent emergence of
free flowing data points which cause few limitations like
real time data streams there is a need to develop tools and
limited memory support and limited time constraint. Though
techniques to explore the stream and make use of it. Boundary
data streams algorithms have evolved in recent years and now
pattern algorithms are gaining popularity and many researchers
they can identify clusters of arbitrary shapes but still they are
have developed solutions based on boundary pattern.
not able to identify clusters of arbitrary densities. The reason
BORDER, BRIM and GBCB are few algorithms which are
could be the use of static density parameters.
used for boundary point detection but these algorithms have
Previously, a stream clustering technique named k-median few limitations, discussed in detail below, which creates
was proposed, which divides data into two partitions and each demands for further research in this area.
part is clustered separately before finally merging all parts to
BORDER makes use of reverse k-nearest neighbor
form the final clusters. Another algorithm STREAM was later
(RKNN) technique to detect boundary points. It finds out the
proposed to enhance the capability of k-median by producing
RKNNs of every data point and then sort them and picks any
the centers at each stage. CluStream is another algorithm which
given number of objects as boundary points. BORDER works
introduced an offline/online component mechanism to cluster
fine in any regular dataset but fails when data have noise.
stream data. Online component is responsible for extracting
Further calculating RKNN and sorting repeatedly takes up lot
data from stream and offline component applies k-means on
of resources and also it is dependent on some prior knowledge
data to form clusters.
to execute algorithm. Another algorithm BRIM is able to
A grid-based algorithm DStream maps the data points to separate boundary points from noise points by using the border
grid cells with the help of its online component. The density of degree calculations. However it is still not very accurate and
each grid is represented by the total points assigned to that grid. has high computational cost. GBCB algorithm also overcame
Dense grid cells are grouped together to make clusters using the problem of noise point recognition by using the restricted
the offline component. A density threshold function k-nearest algorithm. But since it needs to collect the knowledge
distinguishes between sparse and dense grids. Later few of all boundary points, it becomes more and more complex and
modifications were also made to DStream to improve the resource hungry.
performance but it is still not able to identify clusters of
The proposed algorithm DDBound is based on damped
arbitrary densities. Furthermore the outliers are not properly
window model and is able to overcome shortcomings of above
handled and the parameter selection for clustering result
mentioned algorithms. DDBound maps each point to a grid and
sensitivity is also ignored.
then calculates density threshold to point divided grids into
The proposed algorithm MDStream targets these problems high density and low density grids. A maximal density grid is
and offers solution. It follows grid-based approach by dividing discovered among high density grids by iterating through data
data into grid-cells but also avoids overlapping. The resultant points. This maximal density grid is considered the starting
cube-shaped grids make efficient use of memory. Grid density point for a depth first search to cluster the connected high
is calculated using a decay factor model and a dynamic range density grids and their respective transitional density grids. In
neighborhood mechanism enables algorithm to discover the end all the boundary grid clusters are extracted to form the
clusters with arbitrary shaped densities. The online component final results.
reads stream and creates or merges the data points into grids.

53
The experimental studies conducted to see the results of are resource aware and works well within limitations of
DDBound compared the results with BORDER algorithm. memory and also has less battery consumption.
BORDER was unable to identify boundaries properly since it
CluStream algorithm introduced the mechanism of having
also showed noise points separately whereas DDBound
two components of algorithm i.e. an online and an offline
handled the noise points. Also, DDBound was able to
component. Since then it is extensively used in the data stream
recognize clusters of arbitrary shapes and sizes.
clustering algorithms. DenStream adopted this approach and
extended the DBSCAN algorithm to perform the density based
E. GRDCluStream
clustering. The online component of DenStream reads stream
Gao et al. in [5] states that the main source of data streams
and creates micro clusters which are later used by offline
is sensors and the advancement in the sensor technologies in
component to perform clustering. Another density based
recent years. Due to the characteristics of data streams
algorithm DStream divides the space into small grids and plots
researchers have tried to develop algorithms that can cluster
data into those grids based upon their densities, which are later
data by one-off scanning.
merged according to the updated densities to form clusters.
Previously algorithms like LOCALSEARCH, STREAM, Although these algorithms perform density based clustering
BIRCH, etc. have proposed which try to cope with data stream quite efficiently but they are not resource aware. The first
challenges but they lacked the ability to handle dynamic nature algorithm to consider the resource conditions was RA-Cluster.
of data streams. CluStream was a major progress in this regard, It adjusts its parameters according to available memory and
which introduced online/offline components to cluster stream remaining battery. However it is not capable of finding
data. Later, ACluStream enhanced the capabilities of arbitrary densities in streams.
CluStream and was able to form arbitrary shaped clusters. But
The proposed RA-DCluster algorithm not only keeps track
this also has limitation of not being able to provide multi-
of available resources but also performs grid and density based
density functionality
clustering. In the online component of the algorithm, records
GRDCluStream adopts the clustering model of CluStream are mapped to dynamic grids using the sliding window model.
algorithm and in addition uses relative density parameter to The offline component performs density based clustering and
improve density calculation process. The online process of also uses a resource monitor to check the available resources.
GRDCluStream reads data from stream and saves it in form of In case if low memory or battery, window size is reduced and
time frame pyramids. The reason this algorithm is labeled as grid size is increased. During clustering if resources are
‘kind’ of data stream clustering algorithm is that it mainly insufficient, number of sparse grids is reduced by increasing
emphasizes on the online process and leaves offline process to the density threshold. Even at the critical battery level, the
be done using any clustering algorithm of choice. The online clustering results are still generated by prioritizing the
process referred as Gad_Mac, calculates attenuation unprocessed grids on basis of lowest density.
coefficient, support of each grid and eigenvectors of mesh
The comparison of RA-Dcluster with RA-Cluster shows
elements. These variables are then used in the polymerization
better efficiency, accuracy, memory usage and battery
process.
consumption. The experimental result show that number of
When comparing GRDCluStream with other algorithms, it points processed per second is better than RA-Cluster and also
not only takes in account the impact of surrounding data it is scalable. The comparison of memory and battery usage
elements in grids but also quantifies them into mesh density also shows that RA-DCluster adapts to available memory with
and makes the algorithm behave according to the situation and passage of time.
meanwhile maintaining a low time complexity makes it very
effective. G. HDDStream
Ntoutsi et al. in [7] emphasizes the importance choosing
F. RA-DCluster relevant attributes in mining the high dimensional datasets.
Chao et al. in [6] highlights the importance of data stream With data streams this becomes more difficult as we have to
mining because of the ever growing stream data continuously select appropriate attributes within a time window before the
generated from sensors, stock market, online clicks, telecom, concept drift or concept shift takes place. The approaches used
etc. although there has been lot of progress in the to cluster high dimensional stream data have issues like
computational capabilities of mobile devices but still they are assuming that cluster count remains constant throughout the
unable to perform complex data mining functions. This life cycle and using non-partitioning based approaches which
requires the need of developing new mining algorithms which are not able to identify the clusters in subspaces. three types of

54
clustering approaches have been researched on i.e. density degraded. Two approaches are used for behavioral change
based, high dimensinal and full dimensional space for data detection, i.e. supervised and unsupervised. Supervised
streams. approach monitors the perfoamnce measures with help of
prioir knowledge and detects when there is a loss in quality.
Density based clustering being a non-parametric approach
Whereas unsupervised approach uses clustering techniques to
does not require parameters like number of clusters or any
detect behavioral changes without any prioir knowledge.
other prior knowledge before clustering. This simplifies the
clustering however it does not peovide any working model or The supervised approach for behavior change detection is
insights to the discovered clusters. The idea of high more suitable for the streams which provide information about
dimensional data clustering is that due to irrelevence of many class labels. This informaiton is used to moniter the measures
attributes/features in context of whole space, the clustering is like accuracy, recall, etc. This approach fails when there is no
done more accurately in the subspaces using local features class labeling provided, for that purpose the unsupervised
which have relevence or correlation with that specific approach e.g. clustering is used. Algorithms like Growing k-
subspace. Two approaches used in subspace clustering are means, OLINDDA, SLC, etc. are incremental based clustering
bottom up and top down. First one computes all clusters in all examples of concept drift detection. Some neural network
subspace and later one avoids overlapping and uses disjoint based approaches like SOM, GWR, SONDE, etc. are also
instead. The problem with those approaches is that they have effectie in this regard.
too much redundancy and projected clustering in a disjoint
The new proposed algorithm is a modified version of
group. Clustering stream data has always been a challenge
DenStream algorithm which is 2 phase clustering process for
because of the stream size and time window keeping in mind
density based streams. In the conventional DenStream
the resource constraints.
algorithm online phase distributes the data points into
The proposed algorithm HDDStream clusters the streams potential microclusters and outlier microclusters on basis of
in a three step process. In first step i.e. initialization, micro weightage. In the offline phase the clustering is done using the
clusters are extracted using the PreDeCon algorithm. This DBSCAN algorithm. The modified version of DenStream
basically computes the d-dimensional neighborhood value for handles the densities as well as implements the concept drift
the all points and then distributes points into micro clusters detection technique. Unlike DenStream, this modified version
based on the variance and preference vectors. Second step is to calculates the entropy levels for microclusters and keeps
maintain online microcluster, which is responsible for history of them which are later used to detect any novel
distinguishing between projected microclusters and outlier change of event that indicates the concept drift. The
microclusters. Final step is offline clustering which uses a experiments show that the modified DenStream performs
variant of PreDeCon algorithm to form clusters. Each better than conventional DenStream. Keeping history of the
microcluster which is also a core projected microcluster is microclusters improves the ability to detect behavoral changes
assumed as the starting point. Neighbors of this cluster accurately.
becomes seeds to expand the cluster and finilize the clustering
process. I. DCUStream
Yang et al. in [9] points out that due to some internal and
HDDStream is compared with HPStream algorithm to show
external environmental factors the amount of uncertain data is
the performance improvement. Clustering quality is measured
increasing day by day. Specifically Radio Frequency
by means of average percentage of true class labels in a cluster.
Identification (RFID) and Wireless Sensor Network (WSN)
The memory usage is measured by simultaniously maintaining
produce lot of uncertain data due to lack of energy, cost, etc.
microcluster at a given time. Several experiments conducted on
and also due to high error rate. With the emergence of Things
different datasets showed that both cluster purity and memory
Internet technology it is necessary to develop data mining
consumption is better than HPStream.
techniques to utilize the uncertain data streams and perform
H. Modified DenStream useful analysis.
According to Vallim et al. in [8], data streams are so huge Previously an algorithm named UMicro was proposed to
that performing analysis on them becomes a challanging task. cluster data streams of uncertain nature. In this algorithm an
The data stream algorithm must work incremental in order to uncertain feature was used to compute cluster center and
analyze continuously flowing data. Proper handling of concept radius. Another such algorithm addressed the problem of
drift is an important aspect because if behavioral changes of “curse of dimensionality” and few other algorithms used
streams are ignored then the performance is gradually entropy information to overcome the uncertainty of data

55
streams. These are attribute-level uncertainty approaches that for how long density needs to be calculated before
however another algorithm named EMicro was proposed to finalizing the clusters.
offer an existence-level uncertain data solution. Few problems
The PKS-Stream-1 is the proposed algorithm which
which these algorithms have are that they are unable to form
extends the functionality and enhances performance of PKS-
arbitrary shaped clusters and newly arriving and existing data
Stream. This algorithm omits the empty or less frequent used
are treated same way which is not practical approach as the
grids to save resources and increase efficiency. The grids are
quality of existing data may vary with passage of time.
partitioned with multiple granularities and any grid is subject
The proposed algorithm DCUStream addresses these to be further partitioned into sub-grid. The Pks-tree is then
problems and perform density based clustering on uncertain used to traverse these sub-grids and pick the non-empty cells
data streams. In this algorithm grid is partitioned into multiple which will be later used to form the clusters.
hypercubes within data space. Then a weight value is
The comparison of PKS-Stream-1 and PKS-Stream is done
calculated and grids are distributed and labeled in sparse and
by experimenting with multiple datasets measuring the cluster
dense grids according to the weight value. After that the
quality and scalability. The cluster purity and entropy
existence-uncertain factor and tense factors are used to
represents the quality of cluster which in this case is found
classify the data points. Finally when it comes to clustering,
better in PKS-Stream-1. The clustering execution time
core dense grids are identified and using the depth first search
becomes more or less same in both algorithms if the speed of
their neighbors are assembled into clusters. The sparse grids
data stream in increased. However, when the scalability in
represent the boundary points and are processed for noise
terms of memory consumption is measured, PKS-Stream-1
point detection. The process goes on until all data points are
was found to be taking much less memory than PKS-Stream.
either assigned to sparse grids or dense grids.
The experimental results of DCUStream are compared K. MuDi Stream
with EMicro algorithm. Average quality of cluster of Armini et al. in [11] mentions various application areas of
DCUStream was found to be better than EMicro clusters data stream mining such as network traffic anomaly detection,
which is probably because EMicro creates sphere shaped web searches, sensor monitoring, social networks, etc. In these
clusters as compared to DCUStream which creates arbitrary domains the data arrives at a continuous rate and in huge
shaped clusters. Also the DCUStream is more time efficient amounts so mining such data has some special requirements
because EMicro is k-means based algorithm which requires and limitations. Processing density based data is one of the
complex calculation hence taking more time and resources. attributes which a data stream clustering algorithm must have.
Previously density based algorithms like Den-Stream, D-
J. PKS-Stream-1 Stream and MR-Stream works well but they have limitation
Zhang et al. in [10] divides the data stream clustering when it comes to multi-density data. Though there exists few
algorithms into two categories, i.e. single stage processing multi-density clustering algorithms but they are not meant for
model algorithms and two-phase processing model algorithms. handling data streams, instead they are suitable for static data.
The first one partitions stream into small chunks and clusters A new algorithm MuDi-Stream is proposed which is not only
them one by one. Those algorithms are mostly based on k- able to handle data streams but also can cope with multi-
means and use an iterative approach to discover cluster by density clusters.
assigning equal weights to all clusters. This process however
The clustering algorithms are classified in four categories,
ignores the evolution of stream and after some time the cluster
i.e. Partitioning based clustering, Hierarchical clustering,
become outdated. The two-phase processing model (e.g.
Grid-based clustering and Density-based clustering. The
CluStream) consists of an online and an offline phase. The
proposed algorithm lies under density based clustering
online phase reads the raw data from stream and saves it with
technique. It has a tree-like structure which keeps track of
some information. The offline phase takes this data and
each node’s parent and child and uses this information to
performs k-means based clustering.
create clusters that are improved in performance. Few existing
Few limitations of two-phase model are that they are not density-based algorithms such as D-Stream and FlockStream
able to identify arbitrary shaped clusters as well as noise are mentioned. D-Stream uses a multi-resolution approach and
points. Also it requires prior knowledge of data points. PKS- its quality depends on granularity of a lowest level structure.
Stream algorithm is a combination of density and grid based On the other hand FlockStream is a bio-inspired algorithm
algorithms. With the help of Pks-tree technique the non-empty which makes use of independent working agents which form
grid cells are stored and a method is devised which decides the cluster simultaneously. These algorithms work quiet well

56
in their own range but they are unable to handle multi-density Output of each phase is used as input of the next one. This
data. A comparison of various algorithms shows that technique makes the clustering process very efficient. Instead
algorithms either have data stream support but do not have of scanning all grid cells, the map-reduce function parallel
multi-density support or have multi-density support but are not processes the data smartly and reduces calculations on each
data stream compatible. step to get improve results.
In the proposed algorithm, a new terminology is When implementing this technique to data streams, few
introduced i.e. mini core distance (mcd), which is a minimum things are very important. First, the proper system of
threshold value for capturing the different density clusters. inspection cycle must be in place. If the data is coming on
Basically multiple small clusters are formed on basis of mcd regular basis at s[parse grid unit, with the increase in density,
and they group together to make a multi density cluster. The it evolves into dense grid. However if data is not coming
algorithm also assigns data weight points to each data which is frequently, it will make a dense grid sparse overtime. So an
initialized with 1 and decreases overtime. These weight points inspection cycle circle at right interval will ensure the quality
sum up to calculate the grid weight which helps finding out of the dense grid remains intact. Other thing to look for is to
the overall grid density. The online part of MuDi algorithm formulate an appropriate density threshold function. This
reads data stream and passes it over to core mini cluster or function removes the noise points from data to ensure that the
maps it to grid using a helping algorithm called Merge. cluster results are not affected. Online component of PGDC-
Meanwhile another helping algorithm named Pruning removes Stream algorithm is responsible for taking data from stream
the outliers periodically and recalculates the grid weightage. In and mapping it to grid. The offline component continuously
the offline component of the algorithm, an extended form of inspects and removes sparse grid units which are formed by
DBSCAN algorithm fetches the results from mini clusters. noise points.
In the end, some experimental studies are conducted to Experimental studies involved in this algorithm included
show the results of newly proposed algorithm. The data set of length 100000, having 2 dimensions and data
implementation is developed in Java and the results are stream rate at 1000 points/s. results were compared with
compared with DenStream algorithm for comparison. Two CluStream and found that the performance was much better.
synthetic datasets having 5000 and 10000 data points The PGDC algorithm was able to detect noise points timely
respectively were used. Each data set had multiple arbitrary and also arbitrary shaped clusters were formed. The resource
and multi-density shaped clusters. When it comes to cluster release rate was also very high. In scalability test the cluster
purity, both algorithms show similar results in case of single quality, purity and scalability was found to be better than
density clusters but when the multi density clusters are passed CluStream. Because unlike CluStream, PGDC does not need
the quality and purity of DenStream decreases significantly. to do complex computations for distance calculation, it just
MuDi was not only able to identify all clusters but also the has to map continuously arriving data to the relevant grid. In
time taken was less than DenStream. Also MuDi Stream is the end it is pointed out that this algorithm may have some
insensitive to data noise which makes it more efficient and drawbacks as if the order of the stream is changed the results
reliable. could be different, since it’s a preliminary research so there is
room for improvement.
L. PGDC-Stream
Hu et al. in [12] points out that existing data clustering III. CRITICAL EVALUATION
algorithms are not capable of processing high-dimensional The purpose of this section is to compare all the above
data streams. To be specific, they are not able to find arbitrary discussed algorithms and try to find out their weaknesses and
shaped clusters and also cannot remove noise points timely. strengths; and figure out that in which scenario which
Proposed algorithm uses the map-Reduce technique which is algorithm is most suitable. Although all of the discussed
based on HADOOP platform, which is widely used for algorithms perform clustering on data streams and all of them
processing large scale static data. also make use of the density measure, however the
The basic idea of Map-reduce pattern is that it generates circumstances they operate in are different. The problems they
key-value pairs of data. Here the value part of same key-value target and the enhancements they offer are also different. So in
pair can be merged using a reduce function. This mechanism our opinion, grading any algorithm lower than any other
allows algorithm to process large amount of data as well as algorithm, based on above criteria is not fair. However, we can
distribute it among thousands of machines for parallel evaluate the algorithms on basis of their targeted problems and
processing. The map-reduce process is done in 3 phases. the amount of enhancements they offer.

57
Table I represents some common attributes of each claims made by each algorithm; we made assumption that the
algorithm and can be compared with other peers. The experimental results they provide are valid.
attributes like objective, environment, tool used, etc. are
Problem addressed shows which problems each algorithm
ignorable as these attributes does not impact on the evaluation
targeted. As Table II shows, most common problem addressed
because we are not imposing any platform or tool dependency.
by algorithms are identifying arbitrary shaped clusters, finding
However other attributes such as dataset, number of clusters,
arbitrary density clusters, noise point handling and
number of attributes or dimensions, total data points, stream
dependency on prior clustering knowledge. These are also few
speed, etc. are very important. Our thoughts on each of these
of the main challenges in data stream clustering so we expect
attributes are as follows:
that each data stream clustering algorithm must address these
Dataset: Most of the algorithms used KDD CUP99 dataset, problems even if their main research problem is different.
which is a real time dataset based on intrusion detection data
Section II shows how many performance enhancements are
points. Some algorithms also used synthetic datasets along
claimed by each algorithm. These claims are extracted from
with KDD CUP99. Few algorithms [8], [11] and [12] only
the experimental results sections. The experiment results of
used synthetic datasets but we recommend the use of both
each algorithm are compared with their predecessor. Most
synthetic and real-time datasets; however the results can be
common enhancements are time efficiency, cluster purity,
improved if multiple real time datasets are used.
clustering quality/accuracy, efficient memory consumption
Clusters: a maximum of 10 clusters [6] and 80 micro- and algorithm scalability. All of these features are very
clusters [7] is used. Few algorithms did not mention the important for success of a data stream algorithm; however
number of clusters and rest used very few clusters. In our time and memory efficiency are 2 most important aspects
opinion to proper test the results the number of clusters should since these are biggest challenges in stream clustering.
be increased because in real data streams the number of
Another useful feature is algorithm scalability which
clusters could be very high.
almost all of the algorithms claim to provide. It means that the
Attributes/Dimensions: Most of the algorithms used a algorithm maintains its performance and keeps on yielding
decent number of attributes/dimensions (30-50) which can accurate results even when the measures like cluster count,
help in in-depth clustering. data size, stream speed, etc. are increased. An algorithm which
shows improved performance in only a static environment is
Data Points: this is a very important attribute in stream
most likely tend to fail when it is overloaded with input
clustering. Since the streams normally consists of huge
parameters.
volume of data points so it is recommended to use very large
number of data points in order to get a real time feel. If we summarize the evaluation of all algorithms then in
Generally algorithms have used 100k-200k data points which our opinion PKS-Stream-1 [10] is a good example of efficient
are not enough. We recommend using at least 2 million data data stream clustering algorithms. One reason is that it not just
points, however keeping in mind the limited availability of uses the gird-based structure and density values but also
large datasets; it is very difficult to obtain real time large utilizes the Pks-tree structure. It plots data points into Pks-tree
datasets. and then assigns those trees to grid cells. This way the
clustering is done more efficiently. Another reason of
Stream Speed/sec: Stream speed is another important
declaring PKS-Stream-1 better is because it addresses most
attribute to consider. It represents the data points coming per
number of problems and also offers most number of
second. Maximum stream speed used by [4], [7] and [11] is
performance enhancements.
2000 data points per second. It is decent speed but in real time,
depending upon stream source, can be extremely higher than
IV. CONCLUSIONS
2000 so we need algorithms which can perform better in high
stream speed as well. In this paper, we have reviewed various density-based
clustering algorithms for data streams. These algorithms are
Table II depicts the claimed contributions of each grid-based where data points are plotted on grids and
algorithm. There are two sections of the table problems processed in a systematic way. The density values associated
addressed and performance enhancements. Red cells show if to each data point helps distinguishing between dense and
an algorithm does not have that feature and green cell shows if sparse grids. According to our understanding, grid-based
structure enables algorithm to overcome shortcomings of
they claim to have that feature. We have not verified the
legacy partition-based algorithms which were not suitable for
data streams. The use of density further improves clustering

58
TABLE I. SUMMARY OF DENSITY-BASED ALGORITHMS
Algorithm Year Objective Environment Tool Dataset Cluste Attributes\ Data Stream
Used rs Dimensions Points Speed/sec

Active Grid Intel Pentium


AGD-Stream MATL KDD
2010 Density based 4, Windows 5 42 30k
[1] AB CUP-99
clustering XP

Subspace
Intel Pentium
SDSStream Clustering over Visual KDD
2010 4, Windows 41 300k
[2] Weighted C++ CUP-99
XP
Sliding Window

KDD
Arbitrary shape CUP-99
MDStream
2011 and density C++ and 50 200k
[3]
clustering Synthetic
datasets

Real time stream 3


DDBound
2011 boundary Benchmar 4 2.47m 2000
[4]
detection k datasets

Grid and
GRDCLuStream
2012 Relative Density
[5]
based Clustering

Shuttle
Resource aware Android dataset
RA-DCluster
2012 clustering in Emulator, Eclipse (real) and 10 9 243k 100
[6]
mobile devices Windows XP synthetic
dataset

Projected
KDD
HDDStream clustering over 80
2012 Java CUP-99, 34 1m 2000
[7] high dimension (micro)
UCI KDD
data stream

Modified Behavior change


8 Numeric
DenStream 2012 detection in data 4-7 2 8
datasets
[8] streams
Forest
Covertype
Pentium 4,
DCUStream Uncertain data MATL dataset
2012 Windows 40 1.1m
[9] stream clustering AB (real) and
2003
synthetic
dataset
Density grid Visual
PKS-Stream-1 KDD
2012 based stream Intel Core i3 Studio 41 300k 1550
[10] CUP-99
clustering 2010

Multi density
2
MuDi Stream clustering for
2013 Mac OS X Java synthetic 15k 2000
[11] noisy data
datasets
streams

PGDC Stream Parallel data Intel Premium HADO Synthetic


2015 2 100k 1000
[12] stream clustering D, Ubuntu OP dataset

59
TABLE II. COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]
No arbitrary shaped clusters ● ● ● ● ●
No boundary point detection ●
No arbitrary densities in clusters ● ●
Problems Addressed

No outlier/noise handling ● ● ● ● ●
No selection of clustering result sensitivity ● ●
Dependency on prior cluster knowledge ● ● ● ●
Clustering process redundancy ● ●
Projected clustering in disjoint ●
Concept drift detection ●
No multi-density support ● ●
Time Efficiency ● ● ● ● ● ● ● ● ● ●
Performance Enhancements

Clustering Accuracy ● ● ●
Cluster Quality/Purity ● ● ● ● ● ●
Memory Consumption ● ● ● ●
Accurate boundary detection ●
Concept/Behavioral change detection ●
Clustering Entropy ●
Battery Consumption ●
Scalability ● ● ● ● ● ● ● ●
Relative Density ●

results and makes the clustering fast and reliable. We analyzed Conference on Internet Computing for Science and Engineering,
Heilongjiang, 2010, pp. 97-101.
what problems and challenges are there in data stream
[2] J. Ren, S. Cao and C. Hu, "Density-Based Data Streams Subspace
clustering and how these proposed algorithms have solved Clustering over Weighted Sliding Windows," Cryptography and
those problems. Following major challenges every researcher Network Security, Data Mining and Knowledge Discovery, E-Commerce
& Its Applications and Embedded Systems (CDEE), 2010 First ACIS
has to face in this domain: International Symposium on, Qinhuangdao, 2010, pp. 212-216.
[3] A. Magdy, N. A. Yousri and N. M. El-Makky, "Discovering Clusters
Memory Constraint: Due to the immensely huge size of data with Arbitrary Shapes and Densities in Data Streams," Machine
streams memory is the biggest limitation. It is physically Learning and Applications and Workshops (ICMLA), 2011 10th
International Conference on, Honolulu, HI, 2011, pp. 279-282.
impossible to store all the data in memory which makes it
[4] X. Zhang, X. Liang and B. Li, "Real-time data stream clustering and its
tricky to perform any kind of clustering on it. The solution for boundary detection based on distance and density," Advanced
this limitation is to get a chunk of data in a given time window Computational Intelligence (IWACI), 2011 Fourth International
Workshop on, Wuhan, 2011, pp. 209-212.
and perform clustering on it. This data is then released to free [5] H. Gao, R. Li and J. Hou, "A Kind of Data Stream Clustering Algorithm
memory and the next chunk of data is fetched to move forward. Based on Grid and Relative Density," Engineering and Technology (S-
CET), 2012 Spring Congress on, Xian, 2012, pp. 1-4.
Time Constraint: another limitation of data stream clustering is [6] C. M. Chao and G. L. Chao, "Resource-Aware Density-and-Grid-Based
time. The stream is very fast and the continuously flowing data Clustering in Ubiquitous Data Streams," Advanced Information
Networking and Applications Workshops (WAINA), 2012 26th
does not wait for being processed. We not only have to perform International Conference on, Fukuoka, 2012, pp. 1203-1208.
clustering in limited time but also prepare our system to receive [7] Irene Ntoutsi, Arthur Zimek, Themis Palpanas, Peer Kröger, and Hans-
new data which keeps on coming without any interruption. The Peter Kriegel, "Density-based Projected Clustering over High
Dimensional Data Streams" Proceedings of the 2012 SIAM International
only solution to this problem is to improve the efficiency and Conference on Data Mining. 2012, 987-998.
response time of our algorithm to handle fast-paced data. [8] R. M. M. Vallim, J. A. A. Filho, A. C. P. L. F. Carvalho and J. Gama,
"A Density-Based Clustering Approach for Behavior Change Detection
in Data Streams," 2012 Brazilian Symposium on Neural Networks,
REFERENCES Curitiba, 2012, pp. 37-42.
[9] Y. Yang, Z. Liu, J. p. Zhang and J. Yang, "Dynamic density-based
clustering algorithm over uncertain data streams," Fuzzy Systems and
[1] J. Yang, W. Zhu, J. Zhang and Y. Yang, "Data Stream Clustering Knowledge Discovery (FSKD), 2012 9th International Conference on,
Algorithm Based on Active Grid Density," 2010 Fifth International Sichuan, 2012, pp. 2664-2670.

60
[10] D. Zhang et al., "A Clustering Algorithm Based on Density-Grid for [14] A. S. Sabau, "Clustering Data Streams Using Mass
Stream Data," 2012 13th International Conference on Parallel and Estimation," Symbolic and Numeric Algorithms for Scientific Computing
Distributed Computing, Applications and Technologies, Beijing, 2012, (SYNASC), 2013 15th International Symposium on, Timisoara, 2013, pp.
pp. 398-403. 289-295.
[11] A. Amini, H. Saboohi and T. Y. Wah, "A Multi Density-Based [15] Z. Sun, K. Z. Mao, W. Tang, L. O. Mak, K. Xian and Y. Liu,
Clustering Algorithm for Data Stream with Noise," 2013 IEEE 13th "Knowledge-based evolving clustering algorithm for data stream," 2014
International Conference on Data Mining Workshops, Dallas, TX, 2013, 11th International Conference on Service Systems and Service
pp. 1105-1112. Management (ICSSSM), Beijing, 2014, pp. 1-6.
[12] W. Hu, M. Cheng, G. Wu and L. Wu, "Research on Parallel Data Stream [16] A. Amini, T. Y. Wah, M. R. Saybani and S. R. A. S. Yazdi, "A study of
Clustering Algorithm Based on Grid and Density," Computer Science density-grid based clustering algorithms on data streams," Fuzzy
and Mechanical Automation (CSMA), 2015 International Conference Systems and Knowledge Discovery (FSKD), 2011 Eighth International
on, Hangzhou, 2015, pp. 70-75. Conference on, Shanghai, 2011, pp. 1652-1656.
[13] Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu. [17] M. Hahsler and M. Bolaños, "Clustering Data Streams Based on Shared
2003. A framework for clustering evolving data streams. In Proceedings Density between Micro-Clusters," in IEEE Transactions on Knowledge
of the 29th international conference on Very large data bases - Volume and Data Engineering, vol. 28, no. 6, pp. 1449-1461, June 1 2016.
29 (VLDB '03), Johann Christoph Freytag, Peter C. Lockemann, Serge [18] Yogita and D. Toshniwal, "Clustering techniques for streaming data-a
Abiteboul, Michael J. Carey, Patricia G. Selinger, and Andreas Heuer survey," Advance Computing Conference (IACC), 2013 IEEE 3rd
(Eds.), Vol. 29. VLDB Endowment 81-92. International, Ghaziabad, 2013, pp. 951-956.

61

Вам также может понравиться