Академический Документы
Профессиональный Документы
Культура Документы
Volume: 2 Issue: 6
ISSN: 2321-8169
1465 1470
_______________________________________________________________________________________________
__________________________________________________*****_________________________________________________
1. INTRODUCTION
A. Data Mining
Data mining, called Knowledge Discovery in Databases
(KDD) an interdisciplinary subfield of computer science is
the process of identifying knowledge / patterns in large
heterogeneous data sets .The goal of the Data mining
process is to extract information from a data set, preprocess
and transform it into an understandable structure for further
use. Various stages of Data mining are Selection,
Preprocessing, Transformation, Data mining, Interpretation
and evaluation. The various Data mining techniques are
Classification, Clustering, Prediction, Association and
Discrimination.
_______________________________________________________________________________________
ISSN: 2321-8169
1465 1470
_______________________________________________________________________________________________
of events/items <i1, i2,...,in>.There are several key
traditional computational problems addressed within this
field. These include building efficient databases and indexes
for sequence information, extracting the frequently
occurring patterns, comparing sequences for similarity, and
recovering missing sequence numbers. In general, sequence
mining problems can be classified as string mining which is
typically based on string based algorithms and itemset
mining which is typically based on association rule mining.
=( )
=( )
,
( , )
+
=( )
=( )
2. EXISTING METHODOLOGY
Usually when dealing with sequences, the data is converted
into n-dimensional frequency vectors. The vector
representation can be either indicating presence or absence
of symbol in a sequence, or, indicating frequency of symbol
within a sequence. While computing similarity between
sequences they either consider the content /information or
the order information. In the existing work the sequences
are converted to intermediate representations and the
similarity between any two sequences is calculated using
any of the similarity measures like Euclidean, Jaccard,
Cosine. PAM algorithms can be applied for clustering.
Similarity are calculated which illustrates the similarity
between the sequences. And the Inter cluster similarity
has to be maximized and Intra cluster similarity has to
be minimized.
Procedure:
Similarity between
sequences
)2
+ (1 2 2 2
)2
+ + (1 2
)2
=(
1 , 2 =
Vector representation of
sequences
1 , 2
(1 1 2 1
D. Web Personalization
(1 , 2 ) =
Euclid
ean
Jacca
rd
Cosine
Fuzzy
PAM algorithms
(1 2
)2
=1
Inter
cluster
Intra cluster
_______________________________________________________________________________________
ISSN: 2321-8169
1465 1470
_______________________________________________________________________________________________
The work concentrates on Clustering technique on the
domain of web usage data. A new similarity measure [6] is
used to measure similarity/distance between two sequences
and experiments are conducted on PAM algorithms..In all
the experiments the running time of the new similarity
measure is accurate and best compared to the earlier
similarity measures. An experimental framework for
sequential data stream mining on clustering on web usage
data is built.
A. Experimental Results
a)Web Navigation dataset used for Testing
MSNBC is a joint venture between Microsoft and
NBC(National Broad casting) is a famous online news
website with has different news subjects. There are 17
categories
of
news
like
frontpage,news,tech,local,opinion,onair,weather,health,livin
g,business,sports,summary,bbs,travelmisc,msn-news
and
msn-sports. For example, frontpage is coded as 1, news
as 2, tech as 3, etc. Web Navigational dataset
is
considered in Table 5.1
T2
T3
T4
T5
T6
T7
T8
T9
T10
C2
C3
C4
C5
C6
C1
C2
C3
C4
C5
C6
C7
C8
0.1
5
0.1
6
0.1
6
0.16
0.16
0.1
7
0.17
0.15
0.1
3
0.1
3
0.14
0.13
0.1
3
0.14
0.16
0.1
3
0.1
2
0.14
0.15
0.1
5
0.15
0.16
0.1
3
0.1
2
0.18
0.18
0.1
8
0.19
0.16
0.1
4
0.1
4
0.1
8
0.16
0.1
6
0.16
0.16
0.1
3
0.1
5
0.1
8
0.16
0.1
6
0.17
0.17
0.1
3
0.1
5
0.1
8
0.16
0.16
0.21
0.17
0.1
4
0.1
5
0.1
9
0.16
0.17
0.2
1
C7
C8
T1
6,15,15,15,6,15
T2
2,11,3,4,11,11
T3
11,13,13,13,13,13
T4
1,1,11,2,2,4
T5
6,7,7,7,11,11
T6
6,6,6,6,3,6
T7
1,13,13,1,1,2
Jacca
rd
C1
C2
C3
C4
C5
C6
C7
C8
C9
C1
0.1
5
0.1
6
0.1
3
0.1
6
0.1
3
0.1
2
0.1
6
0.1
4
0.1
4
0.1
6
0.1
3
0.1
5
0.1
7
0.1
3
0.1
5
0.1
7
0.1
4
0.1
5
0.1
8
0.1
5
0.1
6
0.1
8
0.1
8
0.1
6
-
0.1
8
0.1
6
0.1
0.1
9
0.1
6
0.1
0.1
9
0.1
7
0.1
C2
C3
C4
T8
1,1,1,1,1,13
T9
2,2,14,5,5,16
C5
T10
1,10,1,2,2,13
C6
0.1
5
0.1
6
0.1
3
0.1
6
0.1
6
0.1
0.1
3
0.1
4
0.1
0.1
2
0.1
4
0.1
0.1
8
0.1
0.1
1467
IJRITCC | June 2014, Available @ http://www.ijritcc.org
_______________________________________________________________________________________
ISSN: 2321-8169
1465 1470
_______________________________________________________________________________________________
6
0.1
7
0.1
7
0.1
8
C7
C8
C9
3
0.1
3
0.1
4
0.1
5
5
0.1
5
0.1
5
0.1
6
8
0.1
8
0.1
9
0.1
9
6
0.1
6
0.1
6
0.1
7
6
0.1
6
0.1
7
0.1
8
7
0.2
1
0.2
1
0.2
1
8
0.2
1
0.1
8
0.1
8
C1
C1
C2
C3
C4
C5
C6
C7
C8
C9
0.1
6
0.1
7
0.1
7
0.1
7
0.1
8
0.1
1
0.1
7
0.1
7
0.1
2
0.1
6
0.1
8
0.1
7
0.1
3
0.1
3
0.1
3
0.1
9
0.1
8
0.1
4
0.1
9
0.2
1
0.1
8
0.1
4
0.1
7
0.1
8
0.1
8
0.1
8
0.1
9
0.1
7
0.1
4
0.1
7
0.1
8
0.1
8
0.2
2
0.2
3
0.1
6
0.1
7
0.1
7
0.1
7
0.1
8
0.1
9
0.2
1
0.1
9
C2
C3
C4
C5
C6
C7
C8
C9
0.1
7
0.1
8
0.1
7
0.1
7
0.1
8
0.1
8
0.1
7
0.1
1
0.1
2
0.1
3
0.1
4
0.1
4
0.1
4
0.1
6
0.1
3
0.1
9
0.1
7
0.1
7
0.1
3
0.2
0.1
8
0.1
8
0.2
0.2
1
0.2
1
0.1
8
0.1
8
0.1
8
0.2
2
0.2
3
C1
C2
C3
C4
C5
C2
C3
C4
C5
C6
C7
C8
C9
0.1
6
0.1
7
0.1
7
0.1
7
0.1
8
0.1
1
0.1
7
0.1
7
0.1
2
0.1
1
-
0.1
8
0.1
7
0.1
3
0.1
2
0.1
0.1
9
0.1
8
0.1
4
0.1
4
0.1
0.2
1
0.1
8
0.1
4
0.1
6
0.1
0.1
9
0.1
7
0.1
4
0.1
8
0.1
0.1
6
0.1
7
0.1
7
0.1
0.1
7
0.1
8
0.1
0.1
1
0.1
0.1
C1
0
0.1
9
0.1
6
0.1
3
0.1
2
0.1
C8
C1
0
C1
1
7
0.1
7
0.1
8
0.1
8
0.1
7
0.1
6
0.1
6
2
0.1
3
0.1
4
0.1
4
0.1
4
0.1
3
0.1
4
1
0.1
2
0.1
4
0.1
6
0.1
8
0.1
2
0.1
4
7
0.1
7
0.1
6
0.1
7
0.1
7
0.1
6
0.1
6
6
0.1
1
0.1
1
0.1
2
0.1
2
0.1
3
0.1
4
0.1
9
0.1
8
0.1
9
0.1
7
7
0.1
2
0.1
9
0.1
7
0.1
6
0.2
1
7
0.1
2
0.1
8
0.1
7
0.2
1
0.2
2
6
0.1
3
0.1
9
0.1
6
0.2
1
6
0.1
4
0.1
7
0.2
1
0.2
2
0.1
7
0.1
7
C1
C7
C9
C6
7
0.1
8
0.1
9
0.2
1
0.1
9
0.1
9
0.1
9
C1
1
0.1
9
0.1
6
0.1
4
0.1
4
0.1
No of
Samples
5000
10000
20000
30000
40000
No of
clusters
formed
82
124
155
116
189
Inter
cluster
4.5
4.9
5.124
6.893
6.989
Average
inter
cluster
0.054
0.039
0.033
0.059
0.036
Average
Intra
cluster
4.27
4.000
4.989
6.867
5.896
5000
10000
20000
30000
40000
1468
IJRITCC | June 2014, Available @ http://www.ijritcc.org
_______________________________________________________________________________________
ISSN: 2321-8169
1465 1470
_______________________________________________________________________________________________
No of
clusters
formed
99
Inter
cluster
4.281
4.317
5.213
8.153
7.298
Average
Inter
cluster
0.043
0.037
0.035
0.045
0.026
Average
Intra
cluster
4.013
4.291
5.222
7.293
8.123
114
147
135
197
4. TIME REQUIREMENTS
No
clusters
formed
5000
10000
20000
30000
40000
Size
of
sequences
5000
10000
20000
30000
40,000
No
of
clusters
83
124
155
116
189
Time
taken in
seconds
1566
2665
2785
3218
3196
PAM-KMEDOIDS
of
93
Inter cluster
Average
Inter cluster
Average
Intra cluster
4.6
0.049
4.001
113
4.8
0.042
4.019
124
5.39
0.043
4.318
138
7.123
0.044
5.293
174
Size
of
sequences
of
5000
94
10000
126
20000
149
30000
141
0.045
6.142
40,000
187
Inter
cluster
4.69
4.47
5.213
6.153
7.298
Average
Inter cluster
0.049
0.035
0.035
0.043
0.039
Average
Intra cluster
Size
of
sequences
5000
10000
20000
30000
40,000
No
of
clusters
99
114
147
135
197
Time
taken in
seconds
1624
2660
2794
3301
3126
7.932
No
clusters
No
of
samples
SSM-KMEANS
Size of
sequences
5000
10000
20000
30000
40,000
No of
clusters
93
113
124
138
174
Time
taken in
seconds
1085
1879
3643
1956
2498
SSM-KMEDOIDS
Size of
sequences
3.314
3.187
4.212
3.297
5000
10000
20000
30000
40,000
94
126
149
141
187
4.123
No of
1469
IJRITCC | June 2014, Available @ http://www.ijritcc.org
_______________________________________________________________________________________
ISSN: 2321-8169
1465 1470
_______________________________________________________________________________________________
Sequential information is important as well as Content
information is also important.
clusters
Time
taken in
seconds
1080
1871
2946
1749
2156
a) Future Work
we extend our work in future to other
techniques and to other domains as well.
[1].
[2].
5. CONCLUSIONS
[3].
[4].
[5].
clustering
_______________________________________________________________________________________