Академический Документы
Профессиональный Документы
Культура Документы
scribed, the WSTREAM algorithm needs only a single pass Comparisons to Other Algorithms In this work we have
over the dataset, which satisfies the constraints imposed by used the same datasets as similar works, and this greatly fa-
streaming data. The memory requirements of the algorithm cilitates comparison of our proposed approach to related al-
strongly depend on the complexity of the underlying cluster gorithms. Specifically, both DsetKDD , and DsetForest , have
structure. This is because a cluster is defined by at least one been used in assessment of HPSTREAM [3] and DEN-
window. To this end, as the number of clusters grows more STREAM [4]. We have measured clustering efficiency in
memory is required to keep track of their structure. More- a similar manner, and our results yield superior purity to
over, if the clusters have a non-convex shape then more than these other approaches.
one window is required to capture their structure [9]. To compare the three algorithms we constructed a se-
ries of artificial datasets Dset3 , by drawing 20,000 random
points from a mixture of 5-dimensional spherical Gaussian
5. Experimental Results
distributions. The number of Gaussians varied between 2
and 100, in order to measure the scalability of the algo-
To test the the efficiency of the WSTREAM algorithm
rithms with respect to the number of clusters in the data.
we first employ the Forest CoverType real world data set,
Similarly, a series of artificial datasets, Dset4 , was con-
obtained from the UCI machine learning repository [7].
structed in a similar manner, with the number of Gaussians
This data set, DsetForest , is comprised of 581012 observa-
fixed at 5, and the dimensionality varied between 2 and 20.
tions characterised using 54 attributes, where each obser-
The mean of all the Gaussians was uniformly scattered in
vation is labeled in one of seven forest cover classes. We
the interval [0, 2000] for each coordinate, and the variance
retain the 10 numerical attributes. To measure clustering ef-
along each dimension was a uniformly chosen number in
ficiency we employed the cluster purity measure P , [3, 4],
Pk |Di | the interval [0.5, 5]. The required processing time for a
defined as: P = i=1k |Ni | , where |Di | denotes the num- 2.8GHz Pentium 4 personal computer with 1GB of memory
ber of points with the dominant class label in cluster i, |Ni | was measured for both series of datasets; Dset3 , and Dset4 .
the total number of points in cluster i, and k, the number of we used the suggestions in [3, 4] to set the parameters of
clusters. Intuitively, this measures the purity of the clusters HPSTREAM and DENSTREAM. For HPSTREAM we set
with respect to the true cluster (class) labels that are known the number of fading cluster structures as twice the true
for our data sets. In Fig. 3(a), we exhibit the purity of the cluster number in all cases. For DENSTREAM, the pa-
clustering result obtained in the last 2000 points for vari- rameter was set to 2.0, that is half the size of the largest pos-
ous time points of the data stream. Note that purity is never sible covariance. The results are presented in Figs. 4(a) and
lower than 80%. Additionally, to examine sensitivity to var- (b). Since there is some randomness involved, all the exper-
ious parameters, at time point 20000, we examine the purity iments were performed 10 times. In the figures the mean
of the clustering result on the last 2000 records, obtained for of these experiments is displayed, along with vertical lines
different values of the parameters a (initial size of the win- that depict the variation of the observed times. Although
dow), and θe (enlarge-contract parameter). Fig.3(b) shows HPSTREAM is a projected clustering algorithm, additional
(a) (b) (c)
98 94 94
96 92 92
Purity (%)
Purity (%)
Purity (%)
94
90 90
92
88 88
90
86 86
88
84 84
86
84 82 82
82 80 80
0 50000 100000 150000 200000 250000 300000 350000 400000 0 2 4 6 8 10 12 14 16 18 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
time α θe
Figure 3. Results for the DsetForest , and different parameter settings.
Processing Time
DENSTREAM
tend such clustering methods to the spatio-temporal form of
2
WSTREAM streaming data. Our proposed algorithmic scheme appears
1.5
able to capture evolving cluster dynamics and provide both
efficient and meaningful results.
1
0.5
References
0
10 20 30 40 50 60 70 80 90 100 [1] C. Aggarwal. A framework for diagnosing changes in evolv-
Number of Clusters ing data streams, 2003.
Processing Time