Вы находитесь на странице: 1из 6

Clustering Behavior Analysis Using

Data Labeling Technique

Sivakami.P 1 Ramesh.D 2
Department of Computer Science & Engineering
PSNA College of Engineering and Technology, Dindigul, Tamilnadu, India
1. PG Student, 2. Lecturer
Siva_pon2006@yahoo.co.in

with the exception of a Web Usage Mining


Abstract Framework for Mining Evolving user profiles in
Dynamic Web Sites for web log transactions.
So that ,a framework for performing
To improve the efficiency of clustering with clustering on the categorical time evolving ata
applying sampling, but in normal process the is shown in fig 1. And also propose a
data points are unlabeled in sampling. The generalized clustering framework and with
straightforward approach in the numerical utilize existing clustering algorithm and detects
domain, the problem of how to allocate those if there is a drifting concept or not in the
unlabeled data points into proper clusters incoming data .For detecting the drifting
remains as a challenging issue in the concept ,the sliding window technique is
categorical domain. In this paper, a Technique adopted. However, for capturing the
named MAximal Resemblance Data Labeling ( characteristics of clusters ,an effective cluster
MARDL) is proposed to allocate each unlabeled representative that summarizes the clustering
data point into the corresponding appropriate result is needed. Here this categorical cluster
cluster based on the novel categorical representative is named by “NIR”(Node
clustering representative, namely, N-Nodeset Importance Representative) is used by
Importance Representative(abbreviated as measuring the importance of each attribute
NNIR), which represents clusters by the value in the clusters. Utilizing representative
importance of the combinations of attribute (NIR) to propose the “Drifting Concept
values. MARDL exhibits high execution Detection”(DCD)algorithm is used. After
efficiency, and can achieve high intracluster detecting drifting concept, the distribution of
similarity and low intercluster similarity, which clusters and outliers between the last clustering
are regarded as the most important properties result and the current temporal clustering result
of clusters, thus benefiting the analysis of are compared with each other. Suppose the
cluster behaviors. MARDL is empirically distribution is changed, the concept are said to
validated on real and synthetic data sets and is drift. And also explains the drifting concepts by
shown to be significantly more efficient than analyzing the relationship between clustering
prior works while attaining results of high results at different times. The analyzing
quality. algorithm is named “Cluster Relationship
Analysis” (CRA). By analyzing the relationship
1. Introduction between clustering results is gives capturing
the time evolving trend that explains why the
Data clustering is an important technique for clustering results have changes in the dataset.
exploratory data analysis .The clustering This paper is organized as follows: In Section
analysis can help us to get better Knowledge 2, presents the preliminaries and formulate the
into the distribution of data. Now we look the problem of this work. Section 3 presents the
concepts that we try to learn from those data DCD algorithm, and the CRA algorithm is
drift with time . For example, the buying introduced in Section 4, Section 5 reports our
preferences of customers may change with time, performance study. And the paper concludes
depending on the current day of the week, with Section 6.
availability of alternatives, discounting rate, etc.
As the concepts behind the data evolve with 2. Preliminaries
time, the underlying clusters may also change
considerably with time [1] . The Problem of In this Section 2.1, presents the problem
clustering time evolving data has not been definition of this and in section 2.2, introduce
widely discussed in the categorical domain NIR which is known as practical categorical
clustering representative , which was presented
in our previous work [7]. 2.2 Node Importance Representative

The NlR is to represent a cluster as the


distribution of the attribute values, called
"nodes" in [7]. In order to measure the
Initial representability of each node in a cluster, the
DCD
clustering importance of a node is evaluated based on the
following two concepts:

1. The node is important in the cluster


Generate Updating Dump Re- when the frequency of the node is high
CR CR previous clu in this cluster.
CR ster

2.The node is important in the cluster if


the node appears prevalently in this
cluster rather than in other clusters.
Clusters
The formal definitions of nodes and node
Clusters Clusters
importance are shown as follows:

Cluster Relationsip Analysis

Fig. 1. System design of performing clustering


on the categorical time-evolving data.

2.1 Problem Description

The objective of the framework is to


perform clustering on the data set D and
Fig. 2. Example data set with the initial
consider the drifting concepts between St and
clustering is performed.
St+1 and also analyze the relationship between
different clustering results.
In this framework, several clustering results Definition 1 (node) . A node, Ir, is defined as
at different time stamps will be reported. Each attribute name + attribute value.
[ t ,t ]
clustering result C 1 2 is formed by one stable Definition 2 (node importance) . The
concept that persists for a period of time, i.e., importance value of the node lir is calculated as
the sliding windows from t1 to t2. The clustering the following equations:
results C [ t1 ,t2 ] contain k [ t1 ,t2 ] clusters, i.e.,
[ t1 ,t 2 ] [ t1 ,t 2 ] [ t ,t
C [ t1 ,t2 ] = { C1 , C2 , ……. C k [1t1 ,t22] ] } ,
where Ci [t1 ,t2 ] ,1 ≤ i ≤ k [t1 ,t2 ] , is the ith cluster
[ t ,t ]
in C 1 2 . If t1 = t2 = t, and we simplify the where
superscript by t.
For example, the first clustering result that is
obtained from the initial clustering step is C1. I yr
The notation Ctt is used to represent the
p( I yr ) = k t

temporal clustering result at time stamp t. Fig. 2


shows an example of data set D with 15 data
∑I
z =1
yr

points, three attributes, and the sliding window


size N = 5. The initial clustering is performed
on the first sliding window S1, and the w(ci , I irn ) represents the importance
clustering result C1, which contains two of node Iir in cluster ci with two factors, the
1 1
clusters, c1 and c 2 , is obtained. All of the probability of Iir being in ci and the weighting
symbols utilized in this section are summarized function f(Ir).
in Table 1.
present the cluster distribution comparison
TABLE 1 method .
Summary of the Symbols

C [ t1The clustering result from tI to t2.

The clustering result on sliding


Ct window t. Data
The temporal clustering result on Labeling
Previo
Ctt sliding window t. us Temp
ci The i-th cluster in C. Cluste oral
ring Cluste
ci The node importance vector of Ci. Result ring
Iir The r-th node in Ci. Result

I ir Cluster
The number of occurence of lir' Distributions
Comparisons
K The number of clusters in C.
mi The number of data points in Ci. Updating Dumping
St The sliding window t. NIR Previous NIR
θ The outlier threshold.
∈ The cluster variation threshold.
η The cluster difference threshold. Re clustering

Fig. 3. Flowchart of the DCD algorithm.


NIR is related to the idea of conceptual
clustering [11], which creates a conceptual 3.1Data Label and outlier Detection
structure to represent a concept (cluster) during
clustering. However, NIR only analyzes the The data labeling is used to decide the most
conceptual structure and does not perform appropriate cluster label for each incoming data
clustering, i.e., there is no objective function . point. The clusters are represented by an
Furthermore, NIR considers both the effective clustering representative, named NIR.
intracluster similarity and the intercluster Based on NIR, the similarity, referred to as
similarity in the representation by integrating resemblance which is defined below.
the first and the second concepts.
Definition 3 (resemblance and maximal
3. Drifting Concept Detection resemblance).

In this section, the objective of the DCD Given a data point pj and an NIR table of
algorithm is used and to detect the difference of clusters ci, the resemblance is defined by
cluster distributions between the current data following equation:
[t ,t −1]
subset St and the last clustering result C e
q
and to decide whether the reclustering is
required or not in St . In this paper, modify our R( p j , ci ) = ∑ w(ci , I ir )
r =1
previous work on the labeling process in order
to detect outliers in St. The data point that does
not belong to any proper cluster is called an where Iir is one entry in the NIR table of clusters
outlier. After labeling, the last clustering result ci.
The value of resemblance R(pj,ci) can be
C [te ,t −1] and the current temporal clustering directly obtained by summing up the nodes'
result Ctt obtained by data labeling are importance in the NIR table of clusters ci, where
compared with each other.. The flowchart of the these nodes are decomposed from the data point
DCD algorithm is shown in Fig. 3. pj.

In Section 3.1, introduce the data labeling Example 1 . Consider the data set in Fig. 2 and
process and the outlier detection,. Section 3.2
the NIR of C1 in Fig. 5. The data points in the
second sliding window are going to perform current temporal clustering result obtained by
data labeling, and the data labeling are compared with each other to
thresholds λ1 = λ2 = 0.5 . The first data detect the drifting concept. The clustering
results are said to be different according to
point p6 = (B,E,G) in S2 is decomposed into
the following two criteria:
three nodes, i.e., {[A1=B]}, {[A2=E]}, and
{[A3=G]}. The resemblance of p6 in c11 is 1. The clustering results are different if
1
zero, and in c , it is also zero. Since the quite a large number of outliers are found by
2
data labeling.
maximal resemblance is not larger than the 2. The clustering results are different if
threshold, the data point p6 is considered as quite a large number of clusters are varied in the
an outlier. In addition, the resemblance of p7 ratio of data points.
1 1
in c1 is 0.029, and in c 2 , it is 1.529 (0.5 + The entire Cluster Distribution Comparison
0.029 + 1). The maximal resemblance value is equation is shown as follows:
R(p7, c12 ), and the resemblance value is larger Concept drift
than the threshold λ2 = 0.5. Therefore, p7 is # of outliers
labeled to cluster c .
1 yes, if › θ
2
N

K[tc, t-1]

∑ d(C i[tc, t-1]


,Ci tt )
yes, if i=1

›η
K[t c, t-1]

The ratio of outliers in the current sliding


Fig. 4. The temporal clustering result Ct2 that window t is first measured by this equation and
is obtained by data labeling.
compared with θ . After that, the variation of
the ratio of data points in the cluster ci between
[t ,t −1]
Cluster c11 the last clustering result C e and the current
temporal clustering result Ctt is calculated and
Node Importance compared by a zero-one function
A1=A 1 [te ,t −1] tt
A2=M 0.029 d (c i , ci , where the different cluster is
A3=C 0.67 represented by one.
A3=D 0.33 The number of different clusters is summed
in this equation, and the ratio of different
Fig. 5. The NIR of the clustering result C1 in clusters between C e
[t ,t −1]
and Ctt is compared
Fig. 2. with η . If the current sliding window t
considered that the drifting concept happens, the
The incoming data point is able to allocate data points in the current sliding window t will
to the cluster if the resemblance value is perform reclustering.
larger than the smallest resemblance value in
that cluster measured by the data points in the
3.3 Implementation of DCD
last sliding window. For example, in the
example in Figs. 2 and 4, λi = 1 + 0.029 + There are 2 algorithm is used ,which is
0.33 = 1.359. The temporal clustering result shown below. All the clustering results C are
Ct2 is shown in Fig. 4. In the next section, represented by NIR, which contains all the pairs
presents introduce how we compare two of nodes and node importance.
cluster distributions.
[t ,t −1] t
Algorithm 1 . Data Labeling ( C e , S)
3.2 Cluster Distributions Comparison outliers out = 0
while there is next typle in St do
This is the step for Cluster Distribution read in data point pj from St
Comparison , the last clustering result and the divide pj into nodes I1 to Iq
for all clusters Ci[te ,t −1] do 4.1 Node Importance Vector and Cluster
Distance
calculate Resemblance (
R p j , ci[te ,t −1] )
end for It include definition of node importance
[te ,t −1] vector and cluster distance
find Maximal Resemblance c m
Definition 4 (node importance vector) .
( [t ,t −1]
if R p j , c me ≥ λm then ) Suppose that there are totally z distinct nodes in
the entire data set D. The node importance
p j is assign to cmtt
vector ci of a cluster ci is defined as the
else
following equation:
out = out + 1
end if
end while ci = (Wi ( I i ),Wi ( I 2 )........Wi ( I r ),.........,Wi ( I z )
return out
where
Algorithm 2 .Drifting Concept Detecting
[t ,t −1] t Wi(Ir) = 0, if Ir does not occur in
(C e , S)
[t ,t −1] t ci,
outlier = DataLabeling ( C e , S)
Wi(Ir) = w(ci, Iir), if Ir occurs in ci.
{Do data labeling
on current sliding window} The value in the vector ci on each
numdiffclusters = 0 node domain is the importance value of this
[t t −1] [t ,t −1] node in cluster ci, i.e., w(ci, Iir). So that, the
for all clusters C e in C e do
dimensions of all the vectors ci are the same.
[te ,t −1] t
m i m i
if
k [te ,t −1 ]
− k [te , t −1 ]
>∈ then 5. Experimental Results
[te ,t −1]
∑m
x =1
x ∑m
x =1
t
x In this section, we demonstrate the scalability
and accuracy of the framework on clustering the
evolving categorical data. Section 5.1 presents
numdiffclusters = numdiffclusters + 1 the efficiency and the scalability of DCD.
Section 5.2 presents the accuracy on evolution
end if results of DCD.
end for
outlier 5.1 Evaluation on Efficiency and
if > θ or
N Scalability
numdiffclusters
> η then The scalability with the data size of DCD is
k [te ,t −1] shown fig.6. This study fixes the dimensionality
{Concept Drifts} to 20 and the number of clusters to 20 and also
[t ,t −1] tests DCD in different numbers of data points,
dump out C e
i.e.,for example we take 50,000, 100,000, and
call initial clustering on St 150,000. The sliding window size is set to 500.
else
{Concept not Drifts)
tt [t ,t −1] 5.2 Evaluation on Accuracy
add C into C e
[t ,t ] In this experiment, by test the accuracy of
update NIR as C e
end if DCD on both synthetic and real data sets. First,
we will test the accuracy of drifting concepts
4 .Clustering Relationship Analysis that are detected by DCD. And then, in order to
evaluate the results of clustering algorithms, we
CRA measures the similarity of clusters
adopt the following two widely used methods.
between the clustering results at different
time stamps and links the similar clusters.
The CD function. The CD function [13]
Following subtitle explain NIR .CRA
attempts to maximize both the probability that
measures the similarity of clusters between
two data points in the same cluster obtain the
the clustering results at different time stamps
same attribute values and the probability that
and link the similarity cluster.
data points from different clusters have different reclustering. To observe the relationship
attributes. between different clustering results, using the
algorithm CRA to analyze and show the
changes between different clustering results.
The experimental evaluation shows that
performing DCD is faster than doing clustering
once on the entire data set, and DCD can
provide high-quality clustering results with
correctly detected drifting concepts.

References
[1] C. Aggarwal, J. Han, ]. Wang, and P. Yu, "A
Framework for Clustering Evolving Data Streams,"
Proc. 29th Int'l Can! Very Large Data Bases (VLDB),
2003.

Fig. 6. Execution time comparison: scalability [2] C.C. Aggarwal, ].L. Wolf, PS. Yu, C. Procopiuc,
with data size and the number of drifting and JS. Park, "Fast Algorithms for Projected
concepts Clustering," Proc. ACM SIGMOD '99, pp. 61-72,
1999.
The expression to calculate the expected value
of the CD function is shown in the following [3] P. Andritsos, P. Tsaparas, R.J. Miller, and KC.
Sevcik, "Limbo:
equation: Scalable Clustering of Categorical Data," Proc. Ninth
k z
mi
∑ [P( I ] Int'l Can! Extending Database Technology (EDBn
CU = ∑ r ci ) 2 − P ( I r ) 2 2004.
i =1 N r =1
where the number of data points in cluster Ci is [4] D. Barbara, Y. Li, and J. Couto, "Coolcat: An
mi, and there are totally z distinct nodes in the Entropy-Based Algorithm for Categorical
clustering results. Clustering," Proc. ACM Int'l Conf Information and
Knowledge Management (CIKM), 2002.
Confusion matrix accuracy (CMA) . Since [5] F. Cao, M. Ester, W. Qian, and A. Zhou,
the synthetic data sets shown in Table 2 contain "Density-Based Clustering over an Evolving Data
the clustering label on each data point, we can Stream with Noise," Froc. Sixth SIAM Int'I Conf
evaluate the clustering results by comparing Data Mining (SDM), 2006.
with the original clustering labels. In the
confusion matrix [2], the entry (i;j) is equal to [[6] D. Chakrabarti, R. Kumar, and A. Tomkins,
the number of data points assigned to output "Evolutionary Clustering," Froc. ACM SIGKDD '06,
cluster Ci and that contain the original pp. 554-560, 2006.
clustering label j. We measure the accuracy in
[7] H.-L. Chen, K-T. Chuang, and M.-S. Chen,
this matrix (CMA) by maximizing the count of "Labeling Unc1ustered Categorical Data into Clusters
the one-toone mapping in which one output Based on the Important Attribute Values," Froc. Fifth
cluster Ci is mapped to one original clustering IEEE Int'l Conf Data Mining (ICDM), 2005.
label j.
[8] Y. Chi, X.-D. Song, D.-Y. Zhou, K. Hino, and
6. Conclusions B.L. Tseng, "Evolutionary Spectral Clustering by
Incorporating Temporal Smoothness," Froc. ACM
SIGKDD '07, pp. 153-162,2007.
This paper, propose a framework to perform
clustering on categorical time-evolving data and [9] D.H. Fisher, "Knowledge Acquisition via
this framework detects the drifting concepts at Incremental Conceptual Clustering," Machine
different sliding windows, generates the Learning, 1987.
clustering results based on the current concept,
and also shows the relationship between [10] M.M. Gaber and p.s. Yu, "Detection and
clustering results by visualization. For the Classification of Changes in Evolving Data Streams,"
Detection of drift on sliding window by using Int'l J. Information Technology and Decision
Making, vol. 5, no. 4, pp. 659-670, 2006.
DCD algorithm to comparing the cluster
distributions between the last clustering result [11] Hung-Leng Chan,Ming-Syan Chen and Su-Chen
and the temporal current clustering result. Lin”Catching the trend:A framework for Clustering
If the results are quite different, the last Concept –Drifting Categorical Data”,vol
clustering result will be dumped out, and the .21,no.5,2009.
current data in this sliding window will perform

Вам также может понравиться