Академический Документы
Профессиональный Документы
Культура Документы
Ke Wang
Jiawei Han
liujunq@mail.hz.zj.cn
panyh@sun.zju.edu.cn
wangk@cs.sfu.ca
hanj@cs.uiuc.edu
Junqiang Liu
Yunhe Pan
ABSTRACT
In this paper, we present a novel algorithm OpportuneProject for
mining complete set of frequent item sets by projecting databases
to grow a frequent item set tree. Our algorithm is fundamentally
different from those proposed in the past in that it
opportunistically chooses between two different structures, arraybased or tree-based, to represent projected transaction subsets,
and heuristically decides to build unfiltered pseudo projection or
to make a filtered copy according to features of the subsets. More
importantly, we propose novel methods to build tree-based
pseudo projections and array-based unfiltered projections for
projected transaction subsets, which makes our algorithm both
CPU time efficient and memory saving. Basically, the algorithm
grows the frequent item set tree by depth first search, whereas
breadth first search is used to build the upper portion of the tree if
necessary. We test our algorithm versus several other algorithms
on real world datasets, such as BMS-POS, and on IBM artificial
datasets. The empirical results show that our algorithm is not only
the most efficient on both sparse and dense databases at all levels
of support threshold, but also highly scalable to very large
databases.
General Terms
Algorithms
Keywords
Association Rules, Frequent Patterns
1. INTRODUCTION
Mining frequent item sets is a key step in many data mining
problems, such as association rule mining, sequential pattern
mining, classification, and so on. Since the pioneering work in [3],
the problem of efficiently generating frequent item sets has been
an active research topic.
Let I = {i1 , i 2 ,..., i m } be a set of literals, called items. Let database
D be a set of transactions, where each transaction T is a set of
items such that T I . Each transaction is associated with a
unique identifier, called TID. Let X be a set of items. A
transaction T is said to contain X if and only if X T . The
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
SIGKDD 02, July 23-26, 2002, Edmonton, Alberta, Canada.
Copyright 2002 ACM 1-58113-567-X/02/0007$5.00.
2. PROBLEM DESCRIPTIONS
Frequent item sets can be represented by a tree, namely frequent
item set tree, abbreviated as FIST, which is not necessarily
materialized. In order to avoid repetitiveness, we impose an
ordering on the items.
FIST is an ordered tree, where each node is labeled by an item,
and associated with a weight. The ordering of items labeling the
nodes along any path (top down) and the ordering of items
labeling children of any node (left to right) follow the imposed
ordering. Each frequent item set is represented by one and only
one path starting from the root, and the weight of the ending node
is the support of the item set. The null root corresponds to the
empty item set. For example, the path (,)(c,4)(f,3)(m,3) in
Figure 1 represents the item set {c, f, m} with support of 3. The
weights associated with nodes need not be actually implemented.
( ,)
(a,3)
(c,3)
(f,3)
(m,3)
(m,3)
(f,3)
(m,3)
(m,3)
(b,3)
(f,3)
(c,4)
(m,3)
(f,4)
(p,3)
(m,3)
(p,3)
(m,3)
(m,3)
a
a
b
b
a
c
b
f
c
c
(b,3)
02 c f m
03 f
04 c p
(m,3)
(m,3)
( ,)
d f g i m p
c f l m o
h j o
k p s
e f l m n p
(c,4)
01 f m p
02 f m
04 p
05 f m p
(f,3) (m,3)
01 m p 01 p
02 m
05 p
05 m p
(p,3)
(m,3)
(m,3)
In depth first search, PTSs are maintained for all nodes on the
path starting from the root to the node that is currently being
explored. Depth first search has the advantage that it is not
necessary to re-create PTSs. This avoids the CPU-bound pattern
matching inherent to breadth first search. Moreover, only the
branch that is currently being explored needs to be maintained in
the memory. This overcomes the limitation on the size of FIST.
Depth first search is especially efficient for dense database and for
low support threshold. Depth first search is usually memory based,
that is PTSs are maintained in the memory. Hence, depth first
search is not scalable to very large databases. Generally speaking,
depth first search is more efficient, and breadth first search is
more scalable.
by one and only one path in the forest. Each node in the forest is
labeled by (i, w) where i is an item and w is a count that is the
number of transactions represented by the path starting from a
root ending at the node. Items labeling nodes along any path are
sorted by the same ordering as IL. All nodes labeled by the same
item are threaded by the entry in IL with the same item. TTF is
filtered if only local frequent items appear in TTF, otherwise
unfiltered.
FIL a
b
c
f
m
p
3
3
4
4
3
3
01
02
a
c
f
m
p
a
b
c
f
m
05
04
f
03
b
f
a
c
f
m
p
b
c
p
(a)
FIL a
b
c
f
m
p
3
3
4
4
3
3
FIL a
b
c
f
m
p
3
3
4
4
3
3
01
FIL a
b
c
f
m
p
3
3
4
4
3
3
01
FIL a
b
c
f
m
p
3
3
4
4
3
3
FIL a
b
c
f
m
p
3
3
4
4
3
3
01
02
a
b
c
f
m
05
b
f
b
c
p
a
c
f
m
p
(b)
02
04
f
a
c
f
m
p
a
b
c
f
m
b
f
05
a
c
f
m
p
b
c
p
(c)
02
a
c
f
m
p
05
a
b
c
f
m
b
f
a
c
f
m
p
b
c
p
(d)
01
05
a
c
f
m
p
a
b
c
f
m
b
f
b
c
p
a
c
f
m
p
b
f
b
c
p
a
c
f
m
p
(e)
a
c
f
m
p
a
b
c
f
m
FIL a
b
c
f
m
p
3
3
4
4
3
3
01
FIL(a)
c 3
f 3
m 3
01
FIL(a)
c 3
f 3
m 3
01
02
a
c
f
m
p
a
b
c
f
m
05
04
f
03
b
f
02
a
c
f
m
p
b
c
p
05
(a)
02
c
f
m
05
c
f
m
c
f
m
(b)
04
f
03
a
c
f
m
p
(c,2)-(f,2)-(m,2)-(p,2) represents transaction 01 and 05, (a,3)(b,1)-(c,1)-(f,1)-(m,1) represents transaction 02, and so on.
(f)
For example, the filtered TTF representation for the PTS of the
null root in Figure 2 is shown in Figure 5(a), where the path (a,3)-
a
b
c
f
m
p
3
3
4
4
3
3
b,1
c,2
c,1
f,2
f,1
m,2
m,1
b,2
c,1
f,1
p,2
p,1
a,3
(a)
a,3
a
b
c
f
m
p
3
3
4
4
3
2
b,1
c,2
c,1
f,2
f,1
m,2
b,2
c,1
f,1
f 3
m 3
p 3
b,1
c,2
c,1
f,2
f,1
m,2
m,1
m,1
p,1
(a)
p,1
(b)
a,3
a,3
3
3
4
4
3
2
c,1
f,1
p,2
p,2
a
b
c
f
m
p
b,2
b,1
b,2
c,2
c,1
f,2
f,1
m,2
m,1
c,1
f,1
f 3
m 3
p 2
b,1
c,2
c,1
f,2
f,1
m,2
m,1
b,2
c,1
f,1
p,2
p,2
p,1
(b)
p,1
(c)
a,3
a,3
a
b
c
f
m
p
3
3
4
3
3
3
b,2
c,2
b,2
c,1
c,1
f,2
f,1
m,2
m,1
f 3
m 3
p 2
f,1
b,1
c,2
c,1
f,2
f,1
m,2
m,1
b,2
c,1
f,1
p,2
p,2
p,1
(d)
p,1
(c)
Figure 6. Children of the TTF in Figure 5(d)
a,3
a
b
c
f
m
p
3
3
2
2
1
1
b,1
c,2
b,2
c,1
f,2
f,1
m,2
m,1
c,1
f,1
p,2
p,1
(e)
a,3
a
b
c
f
m
p
3
1
3
3
3
2
b,1
c,2
c,1
f,2
f,1
m,2
m,1
p,2
b,2
c,1
f,1
p,1
(f)
Figure 5. Bottom up pseudo projecting TTF
First, any pseudo TTF consists of a sub forest of its parent TTF
and leaves of the sub forest are label by the same item and
threaded by the entry of IL with the same item.
Second, by traversing the sub forest, we can delimitate the PTS by
re-threading nodes in the sub forest, count the support of each
item in the sub forest by re-calculating the count of each node
Fourth, the order of frequent item sets found by top down pseudo
projection of TTFs is different from by bottom up pseudo
projection as shown in Figure 9.
Our novel methods of pseudo projection avoid recursively
building projected transaction set, which is in the same number as
frequent item sets. These methods are not only space efficient in
that no additional space is needed for any child TTF, but the
counting and projecting operation is also highly CPU-efficient.
Empirical study shows that algorithms based on these methods are
much more efficient than Apriori and FP-Growth.
dense. Some are sparse. Real databases are of all sizes. We are
going to propose a hybrid approach that maximizes efficiency and
scalability for mining real databases, based on following
observations and heuristics.
a,3
a
b
c
f
m
p
3
3
4
4
3
3
b,1
c,2
c,1
f,2
f,1
m,2
m,1
b,2
c,1
f,1
p,2
a 3
b 2
c 3
p,1
(a)
a,1
a
b
c
f
m
p
1
3
4
4
3
3
b,1
c,2
c,1
f,2
f,1
m,2
m,1
b,2
c,1
f,2
f,1
m,2
m,1
f,2
f,1
m,2
m,1
c,1
f,1
p,1
(a)
a,3
a 3
b 1
c 3
a,3
c,2
b,1
p,2
p,1
b,1
c,1
f,1
(b)
3
2
4
4
3
3
c,1
p,2
a
b
c
f
m
p
b,1
c,2
b,1
c,1
b,1
f,1
c,2
c,1
f,2
f,1
m,2
m,1
b,1
c,1
f,1
p,2
p,1
(b)
Figure 8. Children of the TTF in Figure 7(d)
p,1
p,2
(c)
a,3
a
b
c
f
m
p
3
2
3
4
3
3
b,1
c,2
c,1
f,2
f,1
m,2
m,1
b,1
c,1
f,1
p,2
p,1
(d)
a,3
a
b
c
f
m
p
3
1
3
3
3
3
b,1
c,2
c,1
f,2
f,1
m,2
m,1
b,2
c,1
f,1
( ,)
p,2
p,1
(e)
(a,3)
(b,3)
(c,4)
(f,4)
(m,3)
(p,3)
a,2
a
b
c
f
m
p
2
1
3
2
2
3
b,1
c,2
c,1
f,2
f,1
m,2
m,1
p,2
b,1
(a,3)
(a,3)
(c,3)
(a,3)
(c,3)
(f,3)
(c,3)
c,1
(a,3)
f,1
p,1
(f)
Figure 7. Top down pseudo projecting TTF
(a,3)
(a,3)
(c,3)
(a,3)
Figure 9. Build FIST by top down pseudo projecting TTF
l 1
i =1
i =1
n = C if C lfii11 (2 i 1) 2 f 1
In our algorithm we estimate u and l based on the average
transaction length. Numerous experiments show this estimation is
always larger than the actual size. In other words, this is a
pessimistic measure. The compression ratio of TTF is r = o/n. If r
is less than 6-(t/n), the size of TTF is greater than TVLA, which is
the case for sparse databases.
OpportuneProject(Database: D)
begin
create a null root for frequent item set tree T;
D= BreadthFirst(T, D);
v = the null root of T;
GuidedDepthFirst(v, D);
end
BreadthFirst(FIST: T, CurrentLevel: L, Database: D)
begin
for each node v at level L of T do
CreateCountingVector(v);
D = { };
for each transaction t in D do
ProjectAndCount(t, D);
for each node v at level L of T do
GenerateChildren(v);
if D cannot be represented by TVLA and TTF
then BreadthFirst(T, L+1, D);
else return(D);
end
GuidedDepthFirst(CurrentFISTNode: p, PTS: D)
begin
ILp = TraverseAndCount(D, p);
Dp = Represent(D, p);
For each frequent entry e in ILp by particular ordering do
begin
c = GetChild(p, e);
GuidedDepthFirst(c, Dp);
end
end
Figure 10. Algorithm OpportuneProject
4. PERFORMANCE EVALUATIONS
To evaluate the efficiency and effectiveness of our algorithm
OpportuneProject, we have done extensive experiments on
various kinds of datasets with different features by comparing
with Apriori [16], FP-Growth [9], and H-Mine [12] on a 800MHz
Pentium IV PC with 512MB main memory and 20GB hard drive,
running on Microsoft Windows 2000 Server.
Dist.
Items
1000
100
10
Apriori
H-Mine
1
0
0.02
0.04
0.06
0.08
Support threshold (%)
Apriori
H-Mine
1000
0.1
FPGrowth
OP
100
10
1
0
Max. Aver.
Trans. Trans.
Size Size
164
6.5
267
2.5
161
5.0
43 43.0
72 28.4
FPGrowth
OP
10000
Time (Seconds)
0.2
0.4
0.6
0.8
Support threshold (%)
1000
Apriori
FPGrowth
H-Mine
OP
10
1000
100
10
1
0
20
40
60
80
Support threshold (%)
100
1
100000
0.1
0.05
0.06
0.07
0.08
0.09
Support threshold (%)
0.1
1000
Apriori
FPGrowth
H-Mine
OP
Time (Seconds)
100
1000
100
10
1
0.1
100000
0.1
10000
0.2
0.4
0.6
0.8
Support threshold (%)
Apriori
H-Mine
Time (Seconds)
1000
FPGrowth
OP
10
0.3
Apriori
FPGrowth
H-Mine
OP
100
10
1
0
100
0.15
0.2
0.25
support threshold (%)
1000
Time (Seconds)
Apriori
FPGrowth
H-Mine
OP
10000
10
0.2
0.4
0.6
0.8
support threshold (%)
0.1
0
0.02
0.04
0.06
0.08
Support threshold (%)
0.1
10000
Apriori
H-Mine
1000
Time (Seconds)
Apriori
FPGrowth
H-Mine
OP
10000
Time (Seconds)
Time (Seconds)
100
100000
Time (Seconds)
FPGrowth
OP
100
10
1
0.1
0
0.2
0.4
0.6
0.8
Support threshold (%)
Apriori
FPGrowth
H-Mine
OP
Time (Seconds)
1000
6. ACKNOWLEDGEMENTS
We would like to thank Blue Martini Software, Inc. for providing
us the BMS datasets.
100
7. REFERENCES
10
0
0.2
0.4
0.6
0.8
support threshold (%)
Time (Seconds)
10000
1000
Apriori
FPGrowth
H-Mine
OP
100
10
0
0.2
0.4
0.6
0.8
support threshold (%)
100
Apriori
FPGrowth
H-Mine
OP
10
1
0
5
10
Transactions (M)
15
5. CONCLUSIONS
In this paper, we propose an efficient algorithm to find complete
set of frequent item sets for databases of all features, sparse or
dense, and of all sizes, from moderate to very large. This
algorithm combines depth first approach with breadth first
approach, opportunistically chooses between array-based
representation with tree-based representation for projected