Академический Документы
Профессиональный Документы
Культура Документы
Abstract—Sequential pattern mining has been used to predict various aspects of customer buying behavior for a long time.
Discovered sequence reveals the chronological relation between items and provides valuable information to aid in developing
marketing strategies. Nevertheless, we can hardly know whether the buying is cyclic and how long the interval between the two
consecutive items in the sequential pattern is. To solve this problem, in this paper, data mining skills and the fundamentals of statistics
are combined to develop a set of algorithms to unearth the cyclic properties of discovered sequential patterns. The algorithms, coupled
with the sequential pattern mining process, constitute a thorough scheme to analyze and predict likely consumer behavior. The
proposed algorithms are implemented and applied to test against real data collected from a consumer goods company. The
experimental results illustrate how the model can be used to predict likely purchases within a certain time frame. Consequently,
marketing professionals can execute campaigns to favorably impact customers’ behaviors.
Index Terms—Association rules, data mining, frequency, sequential pattern, polynomial regression.
1 INTRODUCTION
the cycle and interval of items purchased are hardly known. TABLE 1
Thus, the best time to recommend the right products to the The Transactions by Four Customers over One Month
right person is hardly known either. Actually, periodical
patterns are common in daily life. A time-interval sequential
pattern provides more information than a conventional
sequential pattern does, discovering time interval of the
successive item set is the first step toward more accurate
analysis of customer analysis. Therefore, in this paper, we
develop a set of algorithms to analyze the periodical proper-
ties of time intervals over sequential patterns.
Data mining skills and the fundamentals of statistic are
combined to introduce an algorithm Cyclic Model Analysis
(CMA) to find out the model of recurring purchasing. The ðNumber of customer supports sequencesÞ
modeling process commences with the discovery of sequen- Support ðSÞ ¼ :
ðT otal number of customersÞ
tial patterns from the transactional database. Then the
existence of periodicity is identified and the interval of A sequence is maximal if it is not contained in any other
successive events by the Generalized Periodicity Detection sequences. Given a database D of customer transactions,
(GPD)/Trend Modeling (TM) computed, which will be sequential pattern mining is the process of finding maximal
explained in more detail later. Next, the CMA algorithm is sequences among all sequences that have a certain user-
used to obtain the period and trends of quantities of specified minimum support. Each such maximal sequence
purchasing. Consequently, marketing people can recom- represents a sequential pattern. The user-specified mini-
mend the right products to the right customers at the right mum support threshold (denoted by minsup) means
time. statistical significance of a sequence in the database.
This section provides a comprehensive review of prior Table 1 gives a simple example, which contains four
works related to sequence pattern mining. In addition, the customers and their activities over one month. Given the
motivation and research objectives of this paper are also threshold minsup ¼ 0:5, three frequent sequences <A, F>,
explained in this section. The remainder of this paper is <F, H>, and <D, E> are found. The support of <A, F> is
organized as follows: The mathematical models that portray 3=4 ¼ 0:75. The support of <F, H> and <D, E> is
the sequential buying behavior are constructed in Section 2. 2=4 ¼ 0:5. Hence, there are three sequential patterns in
The proposed algorithms are presented in Section 3. The the example database.
experimental results are shown in Section 4. The briefings A sequential pattern indicates the correlation between
on short conclusion and discussion on the future direction transactions. The sequence mined from the transaction
are shown in Section 5. databases represents the order of purchases by the same
customer, those items come from different transactions. A
2 PROBLEM STATEMENT typical example of such a sequential pattern is a customer
who buys a personal computer, then a laser printer. As
An item set i, denoted by (x1 ; x2 ; . . . ; xt ), is a nonempty set of discussed in the previous section, there are many algorithms
items. A sequence S, denoted by <i1 ; i2 ; . . . ; iq >, is an ordered developed by researchers to address the problem of effi-
set of item sets. The size of a sequence S, written as jSj, is the ciently discovering sequences. However, prior works seldom
number of elements in S. A sequence is a k-sequence if address the issue of our major concerns: Tendency and
jSj ¼ k. For example, sequence <a; b; c; d> is a 4-sequence. Periodicity. Whether the next purchase will happen or how
A sequence <a1 ; a2 ; . . . ; an > is a subsequence of another long the purchase behavior will last is hard to tell. A tool to
sequence <b1 ; b2 ; . . . ; bm > if there exist 1 i1 < i2 < i3 < capture the characteristics of discovered sequences is needed.
in m such that a1 bi1 ; a2 bi2 ; . . . , and an bin . We also
To simplify the discussion, the case for 2-sequence <i1 ; i2 >,
call that the sequence <a1 ; a2 ; . . . ; an > is contained in the
where i1 ; i2 are item sets, is considered. The item set is a
sequence <b1 ; b2 ; . . . ; bm >. For example, the sequence <a; b>
collection of the items. Thus, the case can be extended to more
is a subsequence of <ða; cÞ; ðb; dÞ> since a ða; cÞ and
complicated situations. Given a 2-sequnece <i1 ; i2 > mined
b ðb; dÞ. On the other hand, the sequence <ða; cÞ; b> is not
from transactions, the definition of the Trend Distribution
contained in <ða; c; bÞ>, and vice versa.
Function (TDF) of the 2-sequence is stated in Definition 1.
Given a database D of customer transactions, each
transaction is characterized by the fields: <customer-id>, Definition 1. The sequential pattern s ¼ <i1 ; i2 > is a
<time stamp>, and <items purchased>. More precisely, 2-sequence mined from transaction database over designated
each transaction is a set of item sets and each sequence is a time frame T ¼ ½t1 ; tn . The Trend Distribution Function of
list of transactions ordered by transaction time. Usually, the the given sequence s, denoted by fðxj Þ, is a nonnegative
list of all the transactions of a customer is called the function defined on ½0; tn t1 . A sequence s is said to be an
customer sequence. xj -interval-sequence if the interval difference between i1 and i2
A customer supports a sequence s if s is contained in the is xj . The value of fðxj Þ is the total occurrences of xj -interval-
corresponding customer sequence. The support for a sequence in the transaction database D.
sequence s is defined as the number of data sequences
containing s. The definition of support for a sequence s can The pseudocode for computing the value of the trend
be written as follows: distribution function is presented in Fig. 1.
CHIANG ET AL.: THE CYCLIC MODEL ANALYSIS ON SEQUENTIAL PATTERNS 1619
ð24 xÞ 3 ALGORITHMS
fðxÞ ¼ ðsinðxÞ þ 1:5Þ: ð1Þ
18 In this section, a set of algorithms designed to deal with
The function shown in Fig. 3 is not a strict (monotone) the distribution functions obtained from the transaction
decreasing function. But the movement along the curve goes databases is presented. The procedures proposed here are
downward steadily. As mentioned before in this section, the the synthesis of data mining techniques and mathematical
inclination of the function within the designated domain can tools. More specifically, the aim of this research is to devise
be characterized as the slope of the regression line within a scheme to analyze the trend underlying the patterns. The
the domain. Given any subset of the whole domain, the scheme is to be integrated with traditional sequential
function is called a linear increasing distribution function if pattern mining to offer a comprehensive analysis proce-
the slope of the regression line is positive. The function is a dure, which can more easily be adopted by marketers. The
linearly decreasing function if the slope of the regression scheme presented in this paper takes a two-phase
line is negative. Accordingly, the following is the definition approach to cope with all periodicity-related problems,
for a periodical distribution function. which occur in the analysis process of sequential pattern
Definition 3. Let f(x) be a linearly periodical trend distribution mined from transactions.
function of sequence s defined on the domain X ¼ ½x; xn . The core theme of the research is Simple is Beauty. It is
The straight line y ¼ ax þ b is the trend line constructed by well known that a host of algorithms have been developed
linear regression. For each xi in X; fðxÞ ffi fðx þ Þ þ ax; for efficient mining of sequential patterns. To solve the
1=2ðxn x1 Þ, then f(x) is said to be a linearly periodicity problem, a mathematical model constructed to
increasing periodic trend distribution function of the portray the sequential pattern mined from the database. The
sequence with period l on the domain X. The function f(x) structure proposed to describe the nature of the pattern can
is said to be a linearly decreasing periodical trend distribu- reveal not only the periodicity but also the tendency of the
tion function if fðxÞ ffi fðx þ Þ ax; ðxn x1: Þ=2. occurrence of purchasing actions. Then the mathematical
tool is used to determine that the periodicity exists. If the
Fig. 4 is a sample of a typical linearly decreasing periodicity exists, a procedure is proposed to analyze the
function. The graph of the function goes downward along likely consumer behavior.
the x-axis at a certain rate. And the curve repeats its shape The scheme comprises the sequential pattern mining
after a period of 63. That is, the function decreases steadily technique and the algorithms presented in this section.
with period ¼ 63. Then the function reaches zero at a certain Given the result of sequential pattern mining, the primary
point, that is, x ¼ 300. concern is to know where there are regularities that can be
As mentioned, periodicity is not the only interest. The found. Thus, the value of trend distribution function is
degeneration phenomenon is another major concern. Since computed and then the GPD is introduced to detect the
the distribution function is a nonnegative function defined on periodicity of the function. If the periodicity can be
CHIANG ET AL.: THE CYCLIC MODEL ANALYSIS ON SEQUENTIAL PATTERNS 1621
Fig. 6. The first step of the GPD procedure is to find the regression line
y ¼ ax þ b. Then the iterative computation of error threshold suggests
that 6.39 has maximum likelihood that 6.39 is the period of the function.
Fig. 8. The plot of the function fðxÞ and the polynomial f 0 ðxÞ.
Fig. 8 shows the result of applying TM to the function Fig. 9. The plot of the function fðxÞ and the polynomial f 0 ðxÞ.
f(x). The darker line is the polynomial f 0 ðxÞ determined by
regression and the other line is the input function f(x). obtained from the sequence. In short, the mathematical
Next, TM is used to find the polynomial of the function model established by GPD/TM is used to describe the
(5), which is similar to the previous inspected example (1) characteristics of the sequential patterns mined from
but it is a linearly increasing function: designated time frame.
Next, CMA is proposed to analyze and describe the
ð24 þ xÞ
gðxÞ ¼ ðsinðxÞ þ 1:5Þ: ð5Þ characteristics of the sequential pattern mined from the
18
transaction databases. Users must determine the value of
The function g(x) is defined on the same domain X be parameters min_period, max_error, degree, and trcd. The
defined on X = [0, 18], divide the domain into 100 partitions. meaning of the parameters min_period, max_error, and degree
Use the same input parameters as done to (1). Let is the same as defined in GPD and TM. The value of trcd is
min period ¼ 0:5; max error ¼ 0:2, and degree ¼ 6; then, the terminating condition of the process. If the length of the
apply Trend Modeling to find the approximating model domain is too short, the process should be stopped since it
of g(x). This will give the following: is meaningless to investigate the characteristics of repeated
patterns. The procedural steps are shown in Fig. 10.
1. Invoking GPD to g(x) to find that a ¼ 0:023;
The trend distribution function of a given sequence is
b ¼ 2:526, and ¼ 6:42.
defined in Definition 1. Then the type of the function is
2. Compute the polynomial g0 ¼
determined by finding the local maximum of the function. If
8
>
> ð0:023x þ 2:526Þ the local maximum exists at the end of the domain of the
< f0:9012 þ 0:5577ðx mod 6:42Þ þ 0:5251ðx mod 6:42Þ2 function, the function belongs to the ascending type. The
>
> 0:5153ðx mod 6:42Þ3 þ 0:1321ðx mod 6:42Þ4 descending type can be determined if the local maximum
:
0:0144ðx mod 6:42Þ5 þ 0:00066ðx mod 6:42Þ6 g: exists at the beginning of the domain.
If the distribution function is ascending or descending
The darker line in Fig. 9 is the polynomial g0 ðxÞ, which is type, apply GPD/TM directly to get the polynomial
determined by regression to the input g(x); the lighter one is
approximating the patterns and find the period of the
the plot of the input function g(x).
distribution function. If the distribution is neither the
It has been demonstrated how TM can perfectly
ascending nor the descending type, whole time frame has
approximate the descending and ascending types of the
to be partitioned into two subframes and invoke CMA
linearly periodic functions.
recursively until the distribution function of subframe is
The polynomial gained by the TM process is an aid to
simplified. If the length of inspected subframe is smaller
identify the nature of the sequence mined. The graphical
than the predefined terminating condition trcd, the process
representation of the polynomial is an extremely good aid
will be stopped.
to help observers have a better understanding of the
In other words, CMA takes the divide-and-conquer
tendency of the pattern. And the analysis of the character-
istics of the polynomial itself is helpful in describing the approach to collect the knowledge of the designated
phenomenon of the mined pattern. Hence, an elaborate and distribution function. Analyzers use the synthesized
systematic plan of action is needed to complete the task. mathematical model associated with product knowledge
That is why we developed CMA. to interpret the meaning of the model discovered by the
proposed algorithms. Consequently, the interpretations
3.3 Cyclic Model Analysis will be translated into marketing insight and marketing
The purpose of the establishment of the mathematical practice accordingly.
model is to help analysts obtain a better understanding of Below is an illustration of CMA applied to the real-world
the whole picture of what happened and predict what is likely databases. The data collected were transactions of a
to happen. With GPD and TM, it can be determined if domestic cosmetic supplier. The marketing department
customers tend to repeat buying at a regular period and an discovered several sequential patterns, which are of inter-
equation can be formulated to approximate the distribution est. They found that the pattern <37, 27> was unusual;
1624 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 11, NOVEMBER 2009
Fig. 11. (a) The plot of the distribution function of pattern <37, 27>.
(b) After applying GPD/TM, the vertical bar used to indicate the
discovered period was drawn on the picture.
TABLE 2
The Result of Sequential Pattern Mining
Fig. 13. The plot of the pattern <38, 20> and its regression line.
fðxÞ ¼
8 Fig. 14. The plot of the pattern <38, 36> and its regression line.
> ð0:08x þ 25:87Þ If x < 305
>
>
>
> f1:756 0:361ðx mod 35Þ
>
> The plot of the distribution function and its regression
>
< þ 0:070ðx mod 35Þ2
line are shown in Fig. 14. Together, the picture of the model
>
>
> 0:007ðx mod 35Þ3 þ 0:000ðx mod 35Þ4 and the characteristics of the polynomial were obtained. It
>
>
>
> 0:000ðx mod 35Þ5 þ 0:000ðx mod 35Þ6 g; was learned that:
>
:
0; Otherwise . The majority of customers tended to buy product
However, the approximating polynomial is an abstrac- <36> every 63 days.
tion of the pattern, which is hardly interpreted by . The purchasing decreases moderately.
nontechnical people. With the help of visual representation . Customer will not buy product <36> after 299 days
of the distribution function, engineers, marketers, and after the initial purchase of product <38>.
business owners communicate among themselves easily. The results indicated that the CMA performs well in
Thus, the results of modeling process can easily be exploring the trends of repeat-buying behaviors and
incorporated into a marketing practice. provides a practical model for predicting when the
The plot of the distribution function and its regression customers tend to purchase, and when they are likely to
line are shown in Fig. 13. It can easily be seen that <38, 20> stop buying. Consequently, the marketers can allocate
is a simple descending sequence. The pattern has the period resources to build and execute marketing campaigns, which
¼ 35 days and degenerates at day x ¼ 305. Thus, the favorably impact the behavior of these customers.
analysis suggested the following:
4.3 Consistent Buying Behaviors
. The majority of customers tended to buy product Next, transactions which occurred in the year 2001 were
<20> every 35 days. examined to see if the patterns proved to be vital by CMA in
. The purchasing decreases moderately. the year 2000 have the same characteristic. Hence, we
. Customers will not buy product <20> after 305 days applied GPD/TM to find the regression polynomial of each
after the initial purchase of product <38>. pattern in the years 2000 and 2001.
The characteristics of the patterns were learned from Fig. 14 is the plot of the trend line and regression
the mathematical model and visualization of the distribu- polynomial determined by GPD/TM of the pattern <38,
tion function. Marketing professionals incorporate infor- 20> in the years 2000 and 2001, respectively. Fig. 15 is the
mation gained from CMA with the knowledge of a plot of the trend line and regression polynomial discovered
product, then adapt the marketing practice to impact by GPD/TM of the pattern <38, 36> in the years 2000 and
consumers’ likely behavior. 2001, respectively.
Next, CMA was applied to take on <38, 36>. Similarly,
sequence <38, 36> means that the customer purchases
product 38 first and then buys product 36. It was understood
that marketers require more information than what was
revealed. Thus, predetermined parameters min period ¼ 5;
max error ¼ 0:5, and degree ¼ 6 were invoked. The results
of GPD were: a ¼ 0:07; b ¼ 22:13, and ¼ 63. The approx-
imating polynomial of the distribution function of <38, 36> is
fðxÞ ¼
8
> ð0:07x þ 22:13Þ If x < 299
>
>
>
> f1:995 0:231ðx mod 63Þ
>
>
>
< þ 0:016ðx mod 63Þ2
> 0:001ðx mod 63Þ3 þ 0:000ðx mod 63Þ4
>
>
>
>
> 0:000ðx mod 63Þ5 þ 0:000ðx mod 63Þ6 g;
>
> Fig. 15. The regression line and approximation polynomial of <38, 20> in
: the years 2000 and 2001. We found out that the shapes of the two plots
0; Otherwise
of the two polynomials are similar.
CHIANG ET AL.: THE CYCLIC MODEL ANALYSIS ON SEQUENTIAL PATTERNS 1627
Ding-An Chiang received the BS degree in hydraulic engineering from Shao-Ping Chen is currently working toward the PhD degree in
Chung Yuan Christian University, Taiwan, in 1981, and the MS and PhD computer science and information engineering at Tamkang Uni-
degrees in computer science from the University of Southwestern versity in Taipei, Taiwan. His research interests include data mining,
Louisiana in 1986 and 1990, respectively. He is currently a professor in e-commence, and cyber culture.
the Department of Computer Science and Information Engineering and
the dean of the student affairs at Tamkang University. His research Chun-Chi Chen received the MS degrees in computer science and
interests include fuzzy, relational databases and data mining. information engineering from Tamkang University in Taipei, Taiwan, in
2003. His research interests include relational databases and data
Cheng-Tzu Wang received the MS and PhD degrees from the Center mining.
for Advanced Computer Studies at the University of Louisiana in 1991
and 1994, respectively. He is currently an associate professor in the
Department of Computer Science at the National Taipei University of . For more information on this or any other computing topic,
Education, Taiwan. His research interests include software engineering,
please visit our Digital Library at www.computer.org/publications/dlib.
hybrid soft computing models, and data mining.