Вы находитесь на странице: 1из 4

International Conference on Methods and Models in Computer Science, 2009

A LINEAR Regression-Based Frequent Itemset


Forecast Algorithm for Stream DATA
Sonali Shukla', Sushil Kumar', Bhupendra Verma3
TIT, Bhopal,

e-mail:}im_sonalishukla@yahoo.com.2sushil.tit@gmail.com. 3bk_verma3@rediffmail.com

Abstract- Data mining deals with extracting or mining


knowledge from large and infinite amount of stream data.
It also handles the data quality with limited volume of disk
or memory. In such traditional transaction environment it
is impossible to perform frequent items mining because it
requires analyzing which item is a frequent one to
continuously incoming stream data and which is probable
to become a frequent item. This paper proposes a way to
predict frequent items using regression model to the
continuously incoming real time stream data. By
establishing the regression model from the stream data, it
may be used for prediction of uncertain items.
After gathering real-time stream data through sliding
window, algorithm FIM-2DS computes support for
appointed sequence and describes linear equation to
forecast sequence trends in the future.
Keywords:
Frequent items; stream data; linear
regression; sliding window.
I. INTRODUCTION

tream data is generated continuously in a dynamic


environment, with huge volume, infmite flow, and
fast changing behavior.
In recent years, many time-series mining problems
have been explored for a data stream environment.
Because sequence forecast is an import subject of those,
some methods were designed to develop sequence trend
analysis in a dynamic environment. Yixin Chen etc. [9]
investigated methods for multi-dimensional regression
analysis of time-series stream data; they built up a
partially materialized data cube model and took an
exception-guided regression approach to analysis stream
data. Wei-Guang Teng etc. [10] devised a FTP-DS
algorithm to mine frequent temporal patterns of data
stream. The FTP-DS algorithm uses linear regression to
perform trend detection, it's an effective method, but it
always omits some exceptions, which are too pivotal to
lead to failure.
In this paper, we present a FIM-2DS algorithm that
use plan regression approach to ameliorate FTP-DS
algorithm, so the sequence trends can cover those key
exceptions. Our experiment on comparing between the
FTP-DS algorithm and the FIM-2DS algorithm shows
that coverage ofFIM-2DS algorithm is much wider.
The rest of the paper is organized as follows. In next
Section, we introduce the basic concepts of sequence

forecast and concerned problem, Section 1 describes


FIPM algorithm with example, and Section 2 describes
the FTP-DS algorithm. In Section 3, we present the
algorithm of FIM-2DS. Section 4 shows our
experiments and performance analysis and our study
and future research are concluded in Section 5 and 6
respectively.
II. RELATED CONCEPTS

In this section, we introduce the basic concepts of


sequence forecast in data stream and give a brief look at
FIPM algorithm and FTP-DS algorithm.
Sequence in data stream is essentially a sequence of
sets of events, which conform to the given timing
constraints [11]. As an example, the sequential pattern
(A)(C, B)(D), encodes an interesting fact that event D
occurs after an event-set(C, B), which in tum occurs
after event A. If the support of sequence is no less than
the threshold, this sequence is called frequent sequence.
Sequence forecast plays an important role in gene
analysis, fmance forecast, network security, web
information extraction, communication statistic, and so
on.
1.FREQUENT ITEMS PREDICTION METHOD
We explain the frequent items prediction method [8],
called
FIPM. The method uses a regression analysis model
in one dimensional stream data and predicts frequent
items by using the estimated regression model. First,
this method preprocesses one dimensional stream data
and transforms it to a form of sampling value for further
regression. When a sampling value is generated, a
regression model is found using the method of least
squares, a way to estimate regression model.
A. Data Preprocessing

The purpose of regression analysis is to fmd an


estimator in the future from observed values as samples.
In other words, if n number of sample values were
obtained, an estimator, yn+ 1 would be predicted based
on the input, xn+ 1" upon the analysis of the samples.

Therefore, it is necessary to transform the incoming


stream data into a data pair, such as (xi, yi). Stream data
has properties of time series data. As seen in the Figure
1, stream data comes in continuously as inputs with time,
t.
The stream data preprocessing is as follows : first,
each inputted value in stream data is reorganized into
stream data at the incoming time of each inputted value.
We first reorganize times of each value to perform
preprocessing operation.
From the reorganized stream data we can find the
difference in time for each item. For example in the
value A, the value, A is entered in the stream data at tIS, t-13, t-IO, t-S, t-4, and t respectively. From the
example of stream data in the Figure I, it is unknown
when the value, A entered before the time oft-IS, but it
is possible to see the differences in time for the input of
A from the time oft-IS to the present time, t: {2, 3, 2, 4,
4}. Such differences in time can be sequentially paired
as follows : (2, 3), (3, 2), (2, 4), (4, 4). These pairs
indicate the differences in time before and after the
value, A is entered in the stream data. ( As given in
Table 2).
U

\JI~

.~

I
I

rime

116 1ll

I !J

11)

Ig

.~

I
I

I
I

I
I

>1

1-6

,3

I
I

l-l

I)

(.~

r.[

time. In other words, previously appeared time intervals


can be set as an independent variable and the time
appeared intervals later as a dependent variable, we may
be able to estimate the linear regression model that we
want.
Putting the values of Dependent and Independent
variable in linear regression model, we will get the
regression fit line.
2. A BRIEF LOOK AT FPT-DS ALGORITHM
Algorithm FTP-DS [10] was presented by Wei-Guang
Teng on 29th VLDB conferences at 2003 . There were
two key defmitions for FTP-DS algorithm. One is the
support of frequent sequence; the other is the pattern
representation, which called a compact ATF. We will
give a brieflook at those in below.
The support of frequent sequence X at a specific time
t is denoted by the ratio of the number of events having
sequence X in the current sliding window to total
number of events. Supposing the sliding window size is
3, there are 3 sliding windows in Figurel , i.e., [0,3],
[1,4], [2,5], the support of sequence <bd> is computed
as: in [0,3], the <bd> is in event {ID:2, ID:5}, so its
support is 2/5=0.4; in [1,4], the <bd> is in event {ID:I,
ID:4}, so its support is 2/5=0.4 ; in [2,5], the <bd> is in
event {ID:I}, so its support is 1/5=0.2.

Calculate the values of Dependent variable and


Independent variable depend on the following model
using pairs of Data Sets:
TABLE I. EXAMPLEOF THE ORDERED PAIR,(X, V),
FORTHE VALUE, A
input time
sequence
independent
variable(x)
dependent
variable(y)

TABLE2. PROCESS OF ESTIMATING REGRESSION


FORMULA BETWEEN AN INDEPENDENT VARIABLE
Seq.

V.....(><;)

Var.(y,)

(Xi _x)2

(V i - ;.)

(Xi -x)(}'j -;)

2
3
2
4
4
3
5
7

3
2

4
1
4

-1.3
-C.3

0
0

-1.3

2.7
2.3
0.7
0
0

2
3
4
5
6

7
8
9

4
4

3
5
7
6
5

1
1
9
4

-2.3

-C.3
0.7
2.7

-C.7

1.7

2.7
5

0.7

1.3

From them, you can calculate fragment and slope that


will be used to draw a regression line. Therefore, we
may estimate the regression model using these pairs of
difference in time.
Regression analysis is often used to find the
functional relationship between a dependent variable, y
and an independent variable, x existing in a population
using sample data. A ordered pair of (2, 3) indicates that
the value, A is entered in the stream data after two
points in time and entered again after three points in

St r{'am Seque nce


TO
(b) (d )
1
2 (u , b. c) (C. dl (d, f)

Fig. 1. Exampleof one dimensionstream data

(b)

(a)

(b)

t i mcs.t aap

(b. eJ
(c. tI)

(rJ 1
1

(f.')
Cd )
CI')

I
I ..

,I

Fig. 2. An example of supportcomputation

The ATP form of time series was defmed as


(ts, Df, If, It2)
and FPT-DS algorithm extracted a regression fit line:
f=a.+~t

Whereo,

has to be calculated using some formulae.

3.FREQUENT ITEM SET MINING SCHEME FOR 2DIMENSIONAL DATA STREAM


As we know Stream data is continuous and complex in
time so it is only possible to access such data
temporarily. Stream data has sequential characteristics
that can be considered as time series data. Prediction of
time series data gathers useful data estimating future
through the analysis of data from the past. In the FIM2DS method first two-dimensional stream data is preprocessed to establish simple linear regression model.
When the regression model is generated, prediction
process on the possibility of frequent items is performed
based on the regression model stream data is
reorganized with the time and the support for each data
is calculated X in the current sliding window to total
number of events. In the reorganization of data we

collect the input time of same data and calculate the


0.3
difference in time at which data is accrued at the later
stage pairing is to be done with time, which is calculated
-+- erroron
0.2
from the reorganization of data.
FTPDS
Algorithm
-erroron
0.1
Stepl: Input frequent sequence ofrea I time stream data,
FIPM2D
window size N, and the support threshold.
o
Step2:Calculate Support which is no less than the
threshold
2345678
Step3: Here we take time as independent variable and
support as dependent variable.
Fig. 4. Graph to show the mean residual comparison
Step4: Calculate the co efficient of regression model of
FlM- 2DS bO and bl
5. CONCLUSION

1 "

~ .1'r
.--- L

Tl r-J
"

bo = j '-bl i

l; = Ixt J,r - 11 .1: \' = I (x - i )f V ....z:( .\ - x-) -'


L.1;;, - 11 X- A

.,--

'- J

\.)
.

Step5: Fit the regression model of FlM-2DS:

y = b + [1(( +
Step6: Calculate the error

4. EXPERIMENT AND EVALUATION


In C++ programming environment with Microsoft
Windows -xp in support and also the MS Excel 2003 for
plotting the graphs, we performed the experiments and
evaluated the results in terms of errors.
We took time as independent variable (or predictor
variable) and support as dependent variable (or response
variable). We perform the preprocessing of data stream
using sliding window and calculated the support (or
occurrence frequency) . Using these two variable values
we calculated the regression coefficients bO and b 1, to
create a model of linear regression for forecasting of
frequent items.
Here, in this mining scheme we found that the errors
or the residuals, which we encounter, are little less as
compare to the residuals of the algorithm FTD-PS. The
comparison of errors is shown below in the form of line
graphs.

In this method , the least square method is used to


estimate a regression line based on the linear regression
analysis. The regression in this method is a simple linear
regression model estimating a regression model with
changes in two different variables . Since stream data has
a characteristic of being continuously transmitted in
time, they defme the changes in time with two variables .
These two variables indicate an independent variable
and a dependent variable respectively.
Here, in this method I have presented an overview of
some vulnerabilities of algorithm FIPM and designed
linear regression based algorithm FIM-2DS to predict
frequent sequence for real time stream data, in which I
reviewed the concept of preprocessing from the
algorithm FTP-DS. Our experimental result shows the
merit of our algorithm in terms of errors.
6. FUTURE ENHANCEMENT
Let us analyze a few open problems of FIM-2DS
algorithm from data stream background. Stream data is
so huge, infinite, and fast changing that it always
contains lots of noisy, overflowing, incomplete data
items. As my work is concerned, it is not dealing with
this problem that too at the cost of execution time, my
future work is to resolve this problem .
REFERENCES
[I]

[2]

[3]

Error comparison on FTPDS and FIPM2D

[4]
0 .3
0.2 5
0 .2
0. 15
0 .1
0.05

-D. as
-D.1
-D. 15
-D.2
-D.2 5

[5]
II

.J>

\~\~/~
~~
II

I I

<,

---+- e rro r o n FTPD S

[6]

____ e rro r o n F IPM2D

[7]

[8]
Fig. 3. Graph to show the residual comparison

B. Babcock, S. Babu, M. Datar, R. Motwani, and 1. Widom,


"Models and Issues in Data Stream Systems," In Proc. of PODS,
March 2002.
N. Davey, S. P. Hunt, and R. 1. Frank, "Time Series Prediction
and Neural Networks," In Journal of Intelligent and Robotic
Systems, 200 I.
L. Golab, M. Tamer Ozsu, "Issues in Data Stream Management,"
In SIGMOD Record, Volume 32, Number 2,2003.
X. Hao, D. Xu, "Time Series Prediction based on NonParametric," In SIGMOD Record, Volume 32, Number 2,2003.
D. F. Specht, "A General Regression Neural Network," IEEE
Trans. on Neural Networks, Vol.2, No.6, pp.568-576, Nov. 1991.
D. Wacherly, W. Mendenhall, R. L. Scheaffer, "Mathematical
Statistics with Applications", 5th edition, 2005.
O. B. Yaik, C. H. Yong, and F. Haron, "Time Series Prediction
using Adaptive Association Rules," In Proc. Of DFMA05,
pp.3IO-3I4 ,2005.
Duck Jin Chai, Eun Hee Kin, Long Jin, "Frequent Items
Prediction Method to one dimensional stream data ", in Fifth
International Conference on Computational Science And
Applications, IEEE -2007.

[9]

Y.Chen, G.Dong, J.Han et, al, "Multi-Dimensional Regression


Analysis of Time-Series Data Stream", Proceedings of the 28th
International Conference on Very Large Data Base, Berlin, pp.
323-334, August 2002.
[10] Wei-Guang Teng, Ming-Syan Chen, Philip S.Yu, "A
Regression-Based Temporal Pattern Mining Scheme for Data
Stream", Proceedings of the 29th International Conference on
Very Large Data Base, Berlin, pp.607-617, August 2003.
[11] Joshi M, Karypis G, "A universal formulation of sequential
patterns", Research report No.99-021, Department of Computer
Science, University of Minnesota, Minnesota, 1999.

Вам также может понравиться