Академический Документы
Профессиональный Документы
Культура Документы
Sivakami.P 1 Ramesh.D 2
Department of Computer Science & Engineering
PSNA College of Engineering and Technology, Dindigul, Tamilnadu, India
1. PG Student, 2. Lecturer
Siva_pon2006@yahoo.co.in
I ir Cluster
The number of occurence of lir' Distributions
Comparisons
K The number of clusters in C.
mi The number of data points in Ci. Updating Dumping
St The sliding window t. NIR Previous NIR
θ The outlier threshold.
∈ The cluster variation threshold.
η The cluster difference threshold. Re clustering
In this section, the objective of the DCD Given a data point pj and an NIR table of
algorithm is used and to detect the difference of clusters ci, the resemblance is defined by
cluster distributions between the current data following equation:
[t ,t −1]
subset St and the last clustering result C e
q
and to decide whether the reclustering is
required or not in St . In this paper, modify our R( p j , ci ) = ∑ w(ci , I ir )
r =1
previous work on the labeling process in order
to detect outliers in St. The data point that does
not belong to any proper cluster is called an where Iir is one entry in the NIR table of clusters
outlier. After labeling, the last clustering result ci.
The value of resemblance R(pj,ci) can be
C [te ,t −1] and the current temporal clustering directly obtained by summing up the nodes'
result Ctt obtained by data labeling are importance in the NIR table of clusters ci, where
compared with each other.. The flowchart of the these nodes are decomposed from the data point
DCD algorithm is shown in Fig. 3. pj.
In Section 3.1, introduce the data labeling Example 1 . Consider the data set in Fig. 2 and
process and the outlier detection,. Section 3.2
the NIR of C1 in Fig. 5. The data points in the
second sliding window are going to perform current temporal clustering result obtained by
data labeling, and the data labeling are compared with each other to
thresholds λ1 = λ2 = 0.5 . The first data detect the drifting concept. The clustering
results are said to be different according to
point p6 = (B,E,G) in S2 is decomposed into
the following two criteria:
three nodes, i.e., {[A1=B]}, {[A2=E]}, and
{[A3=G]}. The resemblance of p6 in c11 is 1. The clustering results are different if
1
zero, and in c , it is also zero. Since the quite a large number of outliers are found by
2
data labeling.
maximal resemblance is not larger than the 2. The clustering results are different if
threshold, the data point p6 is considered as quite a large number of clusters are varied in the
an outlier. In addition, the resemblance of p7 ratio of data points.
1 1
in c1 is 0.029, and in c 2 , it is 1.529 (0.5 + The entire Cluster Distribution Comparison
0.029 + 1). The maximal resemblance value is equation is shown as follows:
R(p7, c12 ), and the resemblance value is larger Concept drift
than the threshold λ2 = 0.5. Therefore, p7 is # of outliers
labeled to cluster c .
1 yes, if › θ
2
N
K[tc, t-1]
›η
K[t c, t-1]
References
[1] C. Aggarwal, J. Han, ]. Wang, and P. Yu, "A
Framework for Clustering Evolving Data Streams,"
Proc. 29th Int'l Can! Very Large Data Bases (VLDB),
2003.
Fig. 6. Execution time comparison: scalability [2] C.C. Aggarwal, ].L. Wolf, PS. Yu, C. Procopiuc,
with data size and the number of drifting and JS. Park, "Fast Algorithms for Projected
concepts Clustering," Proc. ACM SIGMOD '99, pp. 61-72,
1999.
The expression to calculate the expected value
of the CD function is shown in the following [3] P. Andritsos, P. Tsaparas, R.J. Miller, and KC.
Sevcik, "Limbo:
equation: Scalable Clustering of Categorical Data," Proc. Ninth
k z
mi
∑ [P( I ] Int'l Can! Extending Database Technology (EDBn
CU = ∑ r ci ) 2 − P ( I r ) 2 2004.
i =1 N r =1
where the number of data points in cluster Ci is [4] D. Barbara, Y. Li, and J. Couto, "Coolcat: An
mi, and there are totally z distinct nodes in the Entropy-Based Algorithm for Categorical
clustering results. Clustering," Proc. ACM Int'l Conf Information and
Knowledge Management (CIKM), 2002.
Confusion matrix accuracy (CMA) . Since [5] F. Cao, M. Ester, W. Qian, and A. Zhou,
the synthetic data sets shown in Table 2 contain "Density-Based Clustering over an Evolving Data
the clustering label on each data point, we can Stream with Noise," Froc. Sixth SIAM Int'I Conf
evaluate the clustering results by comparing Data Mining (SDM), 2006.
with the original clustering labels. In the
confusion matrix [2], the entry (i;j) is equal to [[6] D. Chakrabarti, R. Kumar, and A. Tomkins,
the number of data points assigned to output "Evolutionary Clustering," Froc. ACM SIGKDD '06,
cluster Ci and that contain the original pp. 554-560, 2006.
clustering label j. We measure the accuracy in
[7] H.-L. Chen, K-T. Chuang, and M.-S. Chen,
this matrix (CMA) by maximizing the count of "Labeling Unc1ustered Categorical Data into Clusters
the one-toone mapping in which one output Based on the Important Attribute Values," Froc. Fifth
cluster Ci is mapped to one original clustering IEEE Int'l Conf Data Mining (ICDM), 2005.
label j.
[8] Y. Chi, X.-D. Song, D.-Y. Zhou, K. Hino, and
6. Conclusions B.L. Tseng, "Evolutionary Spectral Clustering by
Incorporating Temporal Smoothness," Froc. ACM
SIGKDD '07, pp. 153-162,2007.
This paper, propose a framework to perform
clustering on categorical time-evolving data and [9] D.H. Fisher, "Knowledge Acquisition via
this framework detects the drifting concepts at Incremental Conceptual Clustering," Machine
different sliding windows, generates the Learning, 1987.
clustering results based on the current concept,
and also shows the relationship between [10] M.M. Gaber and p.s. Yu, "Detection and
clustering results by visualization. For the Classification of Changes in Evolving Data Streams,"
Detection of drift on sliding window by using Int'l J. Information Technology and Decision
Making, vol. 5, no. 4, pp. 659-670, 2006.
DCD algorithm to comparing the cluster
distributions between the last clustering result [11] Hung-Leng Chan,Ming-Syan Chen and Su-Chen
and the temporal current clustering result. Lin”Catching the trend:A framework for Clustering
If the results are quite different, the last Concept –Drifting Categorical Data”,vol
clustering result will be dumped out, and the .21,no.5,2009.
current data in this sliding window will perform