Академический Документы
Профессиональный Документы
Культура Документы
Abstract: For a transaction database, a frequent a ppc extension needs no previously obtained closed
itemset is an itemset included in at least a specified itemset. Hence, the memory use of LCM does not
number of transactions. A frequent itemset P depend on the number of frequent closed itemsets,
is maximal if P is included in no other frequent even if there are many frequent closed itemsets.
itemset, and closed if P is included in no other The time complexity of LCM is theoretically
itemset included in the exactly same transactions bounded by a linear function in the number of fre-
as P . The problems of finding these frequent quent closed itemsets, while the existing algorithms
itemsets are fundamental in data mining, and are not. We further developed algorithms for the
from the applications, fast implementations for frequency counting, occurrence deliver and hybrid of
solving the problems are needed. In this paper, diffsets. They reduce the practical computation time
we propose efficient algorithms LCM (Linear time efficiently. Moreover, the framework of LCM is sim-
Closed itemset Miner), LCMfreq and LCMmax ple. Generating ppc extensions needs no sophisti-
for these problems. We show the efficiency of our cated data structure such as binary trees. LCM is
algorithms by computational experiments compared implemented with only arrays. Therefore, LCM is
with existing algorithms. fast, and outperforms than other algorithms for some
sparse datasets.
However, LCM does not have any routine for re-
ducing the database, while many existing algorithms
1 Introduction have. Thus, the performance of LCM is not good for
dense datasets with large minimum supports, which
Frequent item set mining is one of the fundamental involve many unnecessary items and transactions.
problems in data mining and has many applications At FIMI03, we also proposed modifications of LCM,
such as association rule mining, inductive databases, LCMfreq and LCMmax, for enumerating all frequent
and query expansion. From these applications, fast itemsets and maximal frequent itemsets. Although
implementations of frequent itemset mining problems they are fast for some instances, if LCM is not fast
are needed. In this paper, we propose the second for an instance, they are also not fast for the instance.
versions of LCM, LCMfreq and LCMmax, for enu- Existing maximal frequent itemset mining algorithms
merating closed, all and maximal frequent itemsets. have efficient pruning methods to reduce the number
LCM is an abbreviation of Linear time Closed item of iterations, while LCMmax does not have. It is also
set Miner . a reason of the slowness of LCMmax.
In FIMI03[7], we proposed the first version of This paper proposes the second version of LCM
LCM, which is for enumerating frequent closed item- algorithms. We added database reduction to LCM,
sets. LCM uses prefix preserving closure extension so that problems of dense datasets can be solved in
(ppc extension in short), which is an extension from short time. The second version of LCMmax includes
a closed itemset to another closed itemset. The ex- a pruning method, thus the computation time is re-
tension induces a search tree on the set of frequent duced when the number of maximal frequent itemsets
closed itemsets, thereby we can completely enumer- is small. We further developed new algorithms for
ate closed itemsets without duplications. Generating checking the maximality of a frequent itemset and
1
for taking the closure of an itemset. We compare f rq(Q) < f rq(P ). We call the item with the maxi-
the performance of LCM algorithms and other algo- mum index in P the tail of P , and denote by tail(P ).
rithms submitted to FIMI03 by computational exper-
iments. In many instances, LCM algorithms perform
above other algorithms. 3 Algorithms for Efficient Enu-
The organization of the paper is as follows. Sec- meration
tion 2 introduces preliminaries. The main algorithms
and practical techniques of LCM algorithms are de- In this section, we explain the techniques used in the
scribed in Section 3. Section 4 shows the results of second versions of LCM algorithms. We explain them
computational experiments, and Section 5 concludes one-by-one with comparing to the techniques used
the paper. by the other algorithms, in the following subsections.
The new techniques used in the second version are:
2
The other group is so called backtracking Database reduction performs well when the mini-
algorithms[3, 18, 19]. Backtracking algorithm is mum support is large, and many existing algorithms
based on recursive calls. An iteration of a backtrack- use it. LCM algorithms also use database reduction.
ing algorithm inputs a frequent itemset P , and gener- In the existing studies, the input databases are
ates itemsets by adding every items to P . Then, for often stored and maintained by using FP-tree (fre-
each itemset being frequent among them, the itera- quent pattern tree), which is a version of prefix tree
tion generates recursive calls with respect to it. To (trie) [9]. By using FP-tree, we can search specified
avoid duplications, an iteration of backtracking algo- transactions from the datasets efficiently. FP-tree
rithms adds items with indices larger than the tail compresses the common prefix, so we can decrease
of P . We describe the framework of backtracking the memory use. In addition, FP-tree can detect the
algorithms as follows. identical transactions, thus we can merge them into
one. This merge accelerates the frequency counting.
ALGORITHM BackTracking (P :current solution) From these reasons, FP-trees are used in many algo-
1. Output P rithms and implementations.
2. For each e ∈ I, e > tail(P ) do Although FP-tree has many good advantages, we
3. If P ∪ {e} is frequent then do not use it in the implementation of LCM, but use
call BackTracking (P ∪ {e}) simple arrays. The main reason is that LCM does
not have to search transactions in the database. The
An execution of backtracking algorithms gives a main operation of LCM is tracing the transactions in
tree structure such that the vertices of the tree are the the denotation of the current solution. Thus, we
iterations, and edges connect two iterations if one of do not need to use sophisticated data structures for
the iteration calls the other. If an iteration I recur- searching.
sively calls another iteration I 0 , then we say that I The other reason is the computation time for the
is the parent of I 0 , and I 0 is a child of I. For an iter- initialization. If we use a standard binary tree for
ation, the itemset received from the parent is called implementing FP-tree, the initialization of the in-
the current solution. put database takes O(||T || + |T | log |T |) time. for
Apriori algorithms use much memory for storing constructing FP-tree in memory. Compared to this,
Dk in memory, while backtracking algorithms use LCM detects the identical transactions and stores
less memory since they keep only the current solu- the database in memory within linear time of the
tion. Backtracking algorithms need no computation database size. This is because that LCM uses radix
for maintaining previously obtained itemsets, so the sort for this task, which sorts the transactions in a
computation time of backtracking algorithms is gen- lexicographic order in linear time. In general, the
erally short. However, apriori algorithms have ad- datasets of data mining problems have many trans-
vantages for the frequency counting. actions, and each transaction has few items. Thus,
LCM algorithms are based on backtracking al- ||T || is usually smaller than |T | log |T |, and LCM has
gorithms, and use an efficient techniques for the an advantage. The constant factors of the compu-
frequency counting, which are occurrence deliver tation time of binary tree operations are relatively
and anytime database reduction described below. larger than that of array operations. LCM also has
Hence, LCM algorithms compute the frequency effi- an advantage at this point. Again, we recall that
ciently without keeping previously obtained itemsets LCM never search the transactions, so each opera-
in memory. tion required by LCM can be done in constant time.
FP-tree has an advantage in reducing the memory
use. This memory reduction can also reduce the
3.2 Maintaining Databases computation time of the frequency counting. To
check the efficiency of the reduction, we checked
In the existing studies, database reduction is said to the reduction ratio by FP-tree for some datasets
be important to reduce the computation time. It is examined in FIMI03. The result is shown in Table
to reduce the input database as the following rules: 1. Each cell shows the ratio of the number of items
needed to be stored by arrays and FP-tree. Usually,
1. remove each item included in less than θ transactions the input database is reduced in each iteration,
2. remove each item included in all transactions hence we sum up the numbers over all iterations to
3. merge the identical transactions into one. compute the ratio. In the results of our experiments,
3
the ratio was not greater 3 in many instances. If generated itemset. To reduce the computation time,
|I| is small, |T | is large, and the dataset has a existing algorithms uses down project. For an itemset
randomness, such as accidents, the ratio was up to 6. P , down project computes its denotation T (P ) by us-
Generally, a binary tree uses memory three times as ing two subsets P1 and P2 of P . If P = P1 ∪ P2 , then
much as an array. Thus, the performance of FP-tree T (P ) = T (P1 )∩T (P2 ). Under the condition that the
seems to be not quite good rather than an array in items of P1 and P2 are sorted by their indices, the in-
both memory use and computation time, for many tersection can be computed in O(|T (P1 )| + |T (P2 )|)
datasets. time. Down project uses this property, and computes
the denotations quickly. Moreover, if |T (P1 )| < θ or
|T (P2 )| < θ holds, we can see that P never be fre-
3.3 Hypercube Decomposition quent. It also helps to reduce the computation time.
Apriori-type algorithms accelerates the frequency
LCM finds a number of frequent itemsets at once for counting by finding a good pair P1 and P2 of subsets
reducing the computation time[18]. Since the item- of P , such that |T (P1 )| + |T (P2 )| is small, or either
sets obtained at once compose a hypercube in the P1 or P2 is infrequent. Backtracking algorithm adds
itemset lattice, we call the technique hypercube de- an item e to the current solution P in each itera-
composition. For a frequent itemset P , let H(P ) tion, and compute its denotation. By using T ({e}),
be the set of items e satisfying e > tail(P ) and the computationP time for the frequency counting is
T (P ) = T (P ∪ {e}). Then, for any Q ⊆ H(P ), reduced to O( e>tail(P ) (|T (P )| + |T ({e})|)).
T (P ∪ Q) = T (P ) holds, and P ∪ Q is frequent. The bitmap method[5] is a technique for speed-
LCMfreq uses this property. For two itemsets P and ing up the computation of taking the intersection in
P ∪ Q, we say that P 0 is between P and P ∪ Q if down project. It uses a bitmap image (the charac-
P ⊆ P 0 ⊆ P ∪ Q. In the iteration with respect to teristic vector) of the denotations. To take the in-
P , we output all P 0 between P and P ∪ H(P ). This tersection, we have to take O(|T |) time with bitmap.
saves about 2|H(P )| times of the frequency counting. However, a 32bit CPU can take the intersection of
To avoid duplications, we do not generate recur- 32bits at once, thus roughly speaking the computa-
sive calls with respect to items included in H(P ). tion time is reduced to 1/32. This method has a
Instead of generating these recursive calls, we output disadvantage for sparse datasets, and is not orthog-
frequent itemsets including items of H(P ) in recur- onal to anytime database reduction described in the
sive calls with respect to items not included in H(P ). below. From the results of the experiments in FIMI
When the algorithm generates a recursive call with 03, bitmap method seems to be not good for sparse
respect to e 6∈ H(P ), we pass H(P ) to it. In the re- large datasets.
cursive call, we output all itemsets between P ∪ {e} LCM algorithms use another method for the fre-
and P ∪ {e} ∪ H(P ) ∪ H(P ∪ {e}). Since any itemset quency counting, called occurrence deliver[18, 19].
Q satisfies T (P ∪ Q ∪ H(P )) = T (P ∪ Q), the item- Occurrence deliver computes the denotations of P ∪
sets output in the recursive calls are frequent. We {e} for e = tail(P ) + 1, ..., |I| at once by tracing
describe hypercube decomposition as follows. transactions in T (P ). It use a bucket for each e to
ALGORITHM HypercubeDecomposition be added, and set them to empty set at the begin-
(P :current solution, S:itemset) ning. Then, for each transaction t ∈ T (P ), occur-
S 0 := S ∪ H(P ) rence deliver inserts t to the bucket of e for each
Output all itemsets including P e ∈ t, e > tail(P ). After these insertions, the bucket
and included in P ∪ S 0 of e is equal to T (P ∪ {e}). For each transaction t,
For each item e ∈ I \ (P ∪ S 0 ), e > tail(P ) do occurrence deliver takes O(|t ∩ {tail(P )P + 1, ..., |I|}|)
If P ∪ {e} is frequent then time. Thus, the computation time is O( T ∈T (P ) |T ∩
P
call HypercubeDecomposition (P ∪ {e}, S 0 ) {tail(P )+1, ..., |I|}|) =O(|T (P )|+ e>tail(P ) |T (P ∪
End for {e})|). This time complexity is smaller than down
project. We describe the pseudo code of occurrence
3.4 Frequency Counting deliver in the following.
Generally, the most heavy part of the frequent item- ALGORITHM OccurrenceDeliver
set mining is the frequency counting, which is to (T:database, P :itemset)
count the number of transactions including a newly 1. Set Bucket[e] := ∅ for each item e > tail(P )
4
dataset and chess accidents BMS-WebView2 T40I10D100K
minimum support 40% 30% 0.05% 0.1%
reduction factor
by FP-tree 2.27 6.01 1.9 1.57
reduction factor by
Hypercube decomposition 6.25 1 1.21 1
reduction factor
by apriori (best) 1.11 1.34 1.35 2.85
Table 1: Efficiency test of FP-tree, hypercube decomposition, and apriori: the reduction factor of FP-tree is
(sum of # of elements in reduced database by LCM) / ( sum of # of elements in reduced database by FP-tree),
over all iterations, the reduction ratio of hypercube decomposition is the
P average number of output frequent
itemsets in an iteration, and the reduction ratio of apriori is (sum of
P e>tail(P ) |T (P ∪ {e})|) / (sum of
e>F (P ) |T (P ∪ {e})|), over all iterations.
5
dataset and connect pumsb BMS-WebView2 T40I10D100K
minimum support 50% 60% 0.1% 0.03%
Database reduction 188319235 2125460007 2280260 1704927639
Anytime database reduction 538931 7777187 521576 77371534
Reduction factor 349.4 273.2 4.3 22.0
Table 2: Accumulated number of transactions in database in all iterations
thus they will be not efficient. Many pruning meth- ate the closure of P . By adding to P all items e such
ods have been developed for speeding up, however that f rq(P ) = f rq(P ∪ {e}), we can construct the
they are not complete. Thus, the computation time closure of P . We call the second closure operation.
is not bounded by a linear function in the number of LCM uses closure operations for generating ppc ex-
frequent closed itemsets. There is a possibility of over tensions. Similar to the frequency counting, we use
linear increase of computation time in the number of database reduction for closure operation. Suppose
output. that the current solution is P , the reduced database
LCM uses prefix preserving closure extension (ppc- is composed of transactions S1 , ..., Sh , and each Sl is
extension in short) for generating closed itemsets[18, obtained from transactions T1l , ..., Tkl of the original
19]. For a closed itemset P , we define the closure database. For eachTSl , we define the interior inter-
tail clo tail(P ) by the item i of the minimum index section In(Sl ) by T ∈{T l ,...,T l } T . Here the closure
T 1 k
satisfying clo(P (i)) = P . clo tail(P ) is always in- of P is equal to S∈{S1 ,...,Sh } In(S). Thus, by using
cluded in P . We say that P 0 is a ppc extension of P interior intersections, we can efficiently construct the
if P 0 = clo(P ∪ {e}) and P 0 (e − 1) = P (e − 1) hold for closure of P .
an item e > clo tail(P ). Let P0 be the itemset satis- When we merge transactions to reduce the
fying T (P 0 ) = T . Any closed itemset P 0 6= P0 is a database, interior intersections can be updated effi-
ppc extension of another closed itemset P , and such ciently, by taking the intersection of their interior in-
P is unique for P 0 . Moreover, the frequency of P is tersections. In the same way as the frequency count-
strictly larger than P 0 , hence ppc extension induces a ing, we can remove infrequent items from the interior
rooted tree on frequent closed itemsets. LCM starts intersections for more reduction. The computation
from P0 , and finds all frequent closed itemsets in a time for the closure operation in LCM depends on
depth first manner by recursively generating ppc ex- the size of database, but not on the number of previ-
tensions. The proof of ppc extension algorithms are ously obtained itemsets. Thus, storage method has
described in [18, 19]. advantages if the number of frequent closed itemsets
By ppc extension, the time complexity is bounded is small. However, for the instances with a lot of
by a linear function in the number of frequent closed frequent closed itemsets, which take long time to be
itemsets. Hence, the computation time of LCM never solved, LCM has an advantage.
be super linear in the number of frequent closed item-
sets.
3.7 Enumerating Maximal Frequent
Itemsets
3.6 Closure Operation
Many existing algorithms for maximal frequent item-
To enumerate closed itemsets, we have to check set enumeration are based on the enumeration of fre-
whether the current solution P is a closed itemset quent itemsets. In breadth-first manner or depth-
or not. In the existing studies, there are two meth- first manner, they enumerate frequent itemsets and
ods for this task. The first method is to store in output maximal itemsets among them. To reduce the
memory previously obtained itemsets which are cur- computation time, the algorithms prune the unnec-
rently maximal among itemsets having the identical essary itemsets and recursive calls.
denotation. In this method, we find frequent itemsets Similar to these algorithms, LCMmax enumerates
one-by-one, and store them in memory with remov- closed itemsets by backtracking, and outputs maxi-
ing itemsets included in another itemset having the mal itemsets among them. It uses a pruning to cut
identical denotation. After finding all frequent item- off unnecessary branches of the recursion. The prun-
sets, only closed itemsets remain in memory. We call ing is based on a re-ordering of the indices of items,
this storage method. The second method is to gener- in each iteration. We explain the re-ordering in the
6
following. is a heavy task, thus many existing algorithms avoid
Let us consider a backtracking algorithm for enu- it. They store in memory maximal itemsets among
merating frequent itemsets. Let P be the current so- previously obtained frequent itemsets, and update
lution of an iteration of the algorithm. Suppose that them when they find a new itemset. When the al-
P 0 is a maximal frequent itemset including P . LCM- gorithms terminate and obtain all frequent itemsets,
max puts new indices to items with indices larger only maximal frequent itemsets remain in memory.
than tail(P ) so that any item in P 0 has an index We call this storage method. If the number of max-
larger than any item not in P 0 . Note that this re- imal frequent itemsets is small, storage method is
ordering of indices has no effect to the correctness of efficient. However, if the number is large, storage
the algorithm. method needs much memory. When a frequent item-
Let e > tail(P ) be an item in P 0 , and consider the set is newly found, storage method checks whether
recursive call with respect to P ∪ {e}. Any frequent the itemset is included in some itemsets in the mem-
itemset P̂ found in the recursive call is included in ory or not. If the number of frequent itemsets is large,
P 0 , since every item having an index larger than e the operation takes long time.
is included in P 0 , and the recursive call adds to P
items only of indices larger than e. From this, we To avoid the disadvantage of storage method,
can see that by the re-ordering of indices, recursive LCMmax operates maximality check. LCMmax
calls with respect to items in P 0 ∩ H generates no checks the maximality by finding an item e such
maximal frequent itemset other than P 0 . that P ∪ {e} is frequent. If and only if such e ex-
According to this, an iteration of LCMmax chooses ists, P is not maximal. To operate this efficiently,
an item e∗ ∈ H, and generates a recursive call with we reduce the database. Let us consider an itera-
respect to P ∪ {e∗ } to obtain a maximal frequent tion of LCMmax with respect to a frequent itemset
itemset P 0 . Then, re-orders the indices of items other P . LCM algorithms reduce the database by anytime
than e∗ as the above, and generates recursive calls database reduction for the frequency counting. Sup-
with respect to each e > tail(P ) not included in P 0 ∪ pose that the reduced database is composed of trans-
{e∗ }. In this way, we save the computation time for actions S1 , ..., Sh , and each Sl is obtained by merg-
finding P 0 , and by finding a large itemset, increase ing transactions T1l , ..., Tkl of the original database.
the efficiency of this approach. In the following, we Let H be the set of items to be added in the iter-
describe LCMmax. ation. Suppose that we remove all items e from H
such that P ∪ {e} is infrequent. Then, for any l,
ALGORITHM LCMmax (P :itemset, H:items to T1l ∩ H = T2l ∩ H =, ..., = Tkl ∩ H holds. For an item e
be added) and a transaction Sl , we define the weight w(e, Sl ) by
1. H 0 := the set of items e in H s.t. P ∪ {e} is frequent the number of transactions in P T1l , ..., Tkl including e.
2. If H 0 = ∅ then Here the frequency of P ∪{e} is S∈{S1 ,...,Sh } w(e, S).
3. If P ∪ {e} is infrequent for any e then Thus, by using the weights, we can efficiently check
output P ; return the maximality, in linear time of the size of the re-
4. End if duced database.
5. End if When we merge transactions to reduce the
6. Choose an item e∗ ∈ H 0 ; H 0 := H 0 \ {e∗ } database, the weights can be updated easily. For each
7. LCMmax (P ∪ {e}, H 0 ) item e, we take the sum of w(e, S) over all transac-
8. P 0 := frequent itemset of the maximum size tions S to be merged. In the same way as frequency
found in the recursive call in 7 counting, we can remove infrequent items from the
9. For each item e ∈ H \ P 0 do database for maximality checking, for more reduc-
10. H 0 := H 0 \ {e} tion.
11. LCMmax (P ∪ {e}, H 0 )
12. End for The computation time for maximality check in
LCMmax depends on the size of database, but not
3.8 Checking Maximality on the number of previously obtained itemsets. Thus,
storage method has advantages if the number of max-
When LCMmax finds a frequent itemset P , it checks imal frequent itemsets is small, but for the instances
the current solution is maximal or not. We call with a lot of maximal frequent itemsets, which take
this operation maximality check. Maximality check long time to be solved, LCMmax has an advantage.
7
BMS-WebView-2-all
BMS-WebView-2-all
BMS-WebView-2-all BMS-WebView-2-closed
BMS-WebView-2-closed
BMS-WebView-2-closed
10
1000
10 10
1000
10
afopt_all
afopt_all
afopt_all afopt_all afopt_closed
afopt_closed
afopt_closed
fpg_all
fpg_all
fpg_all fpg_all fpg_closed
fpg_closed
fpg_closed
lcm_all
lcm_all lcm_all lcm_closed
lcm_closed
100 mafia_all mafia_all 100 mafia_closed
dci_all dci_all
cputime (sec)
cputime (sec)
patriciamine_all
1
10 1
10
1 1
0.1
10.1 0.1
10.1
0.11
0.11
0.110.1
0.1
0.10.09
0.09
0.090.08
0.08
0.080.07
0.07
0.070.06
0.06
0.060.05
0.05
0.050.04
0.04
0.04 0.03
0.03
0.03 0.02
0.02 0.01 0.11
0.11
0.110.1
0.1
0.10.09
0.09
0.090.08
0.08
0.080.07
0.07
0.070.06
0.06
0.060.05
0.05
0.050.04
0.04
0.04 0.03
0.03
0.03 0.02
0.02 0.01
minsup
minsup
minsup(%)
(%)
(%) minsup
minsup
minsup(%)
(%)
(%)
BMS-WebView-2-maximal
BMS-WebView-2-maximal
BMS-WebView-2-maximal
10
1000
10
afopt_maximal
afopt_maximal
afopt_maximal
fpg_maximal
fpg_maximal
fpg_maximal
lcm_maximal
lcm_maximal
100 mafia_maximal
cputime (sec)
1
10
0.1
10.1
0.11
0.11
0.110.1
0.1
0.10.09
0.09
0.090.08
0.08
0.080.07
0.07
0.070.06
0.06
0.060.05
0.05
0.050.04
0.04
0.04 0.03
0.03
0.03 0.02
0.02 0.01
minsup
minsup
minsup(%)
(%)
(%)
Figure 1: Results 1
8
chess-all
chess-all chess-closed
100
100 1000
afopt_all
afopt_all afopt_all afopt_closed
fpg_all
fpg_all fpg_all fpg_closed
lcm_all lcm_all lcm_closed
10 mafia_all mafia_all 100 mafia_closed
10 dci_all dci_all
cputime (sec)
cputime (sec)
patriciamine_all
1 10
1
0.1 1
0.1
0.01 0.1
90
90 80
80 70
70 60
60 50
50 40
40 30 20 90 80
80 7070 60 60 50 5040 40
30 30
20
minsup
minsup(%)
(%) minsup (%)
chess-maximal accidents-all
accidents-all
100 100
1000
afopt_maximal afopt_all
afopt_all afopt_all
fpg_maximal fpg_all
fpg_all fpg_all
lcm_maximal lcm_all lcm_all
mafia_maximal mafia_all mafia_all
10 100 dci_all dci_all
cputime (sec)
cputime (sec)
patriciamine_all
10
1 10
0.1 11
90 80 70 60 50 40 30 20 9090 80
80 70
70 60
60 50 50 40
40 30
30 20 10
minsup (%) minsup
minsup(%)
(%)
accidents-closed
accidents-closed accidents-maximal
100
1000 1000
afopt_closed
afopt_closed afopt_maximal
fpg_closed
fpg_closed fpg_maximal
lcm_closed lcm_maximal
mafia_closed mafia_maximal
100 100
cputime (sec)
cputime (sec)
10
10 10
11 1
9090 80
80 70
70 6060 50 50 40 4030 30
20 20
10 90 80 70 60 50 40 30 20 10
minsup
minsup(%)
(%) minsup (%)
Figure 2: Results 2
9
from UCI-Machine Learning Repository1 , connect, 5 Conclusion
chess, mushrooms, pumsb, and pumsb-star. These
datasets have many transactions but few items. We In this paper, we proposed a fast implementation of
call these datasets middle density datasets. As a rep- LCM for enumerating frequent closed itemsets, which
resentative, we show the result of chess. is based on prefix preserving closure extension. We
The third group is accidents. It is different from further gave implementations LCMfreq and LCM-
any other dataset. It has huge number of transac- max for enumerating all frequent itemsets and max-
tions, but few items. Transactions includes many imal frequent itemsets by modifying LCM. We show
items, so the dataset is very dense. We call this by computational experiments that our implements
dataset very dense dataset. of LCM, LCMfreq and LCMmax perform above the
other algorithms for many datasets, especially for
In almost instances and minimum supports, LCM
sparse datasets. There is a possibility of speeding up
algorithms perform well. When the minimum sup-
LCM algorithms by developing more efficient maxi-
port is large, LCM algorithms are the fastest for all
mality checking algorithms, or developing a hybrid of
instances, because of the fast initialization. For all
array and FP-tree like data structures.
instances with any minimum support, LCM outper-
forms other closed itemset mining algorithms. This
shows the efficiency of ppc extension. Acknowledgment
For sparse datasets, LCM algorithms are the
fastest, for any minimum support. The efficiency of This research is supported by joint-research funds of
FP-tree is not large, and occurrence deliver works ef- National Institute of Informatics.
ficiently. The performances of afopt and fp-growth
are quite similar for these problems. They are the References
second bests, and 2 to 10 times slower than LCM
algorithms. For enumerating frequent closed item- [1] R. Agrawal and R. Srikant, “Fast Algorithms for
sets, they take much time when the number of closed Mining Association Rules in Large Databases,”
itemsets is large. Although PATRICIAMINE is fast In Proceedings of VLDB ’94, pp. 487–499, 1994.
as much as fp-groth and afopt, it abnormally ter-
minated for some instances. kDCI is slow when the [2] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen
number of frequent itemsets is large. MAFIA was the and A. I. Verkamo, “Fast Discovery of Associa-
slowest for these instances, for any minimum support. tion Rules,” In Advances in Knowledge Discov-
For middle density datasets, LCM is the fastest ery and Data Mining, MIT Press, pp. 307–328,
for all instances on closed itemset mining. On all 1996.
and maximal frequent itemset mining, LCMfreq and [3] R. J. Bayardo Jr., “Efficiently Mining Long Pat-
LCMmax are the fastest for large minimum supports, terns from Databases”, In Proc. SIGMOD’98,
for any dataset. For small minimum supports, for pp. 85–93, 1998.
half instances LCMfreq and LCMmax are the fastest. [4] E. Boros, V. Gurvich, L. Khachiyan, and
For the other instances, the results are case by case: K. Makino, “On the Complexity of Generat-
each algorithm won in some cases. ing Maximal Frequent and Minimal Infrequent
For accidents, LCM algorithms are the fastest Sets,” STACS 2002, pp. 133-141, 2002.
when the minimum support is large. For small sup- [5] D. Burdick, M. Calimlim, J. Gehrke, “MAFIA:
ports, LCM(closed) is the fastest, however LCMfreq A Maximal Frequent Itemset Algorithm for
and LCMmax are slower than fp-growth For this Transactional Databases,” In Proc. ICDE 2001,
dataset, the efficiency of FP-tree is large, and the pp. 443-452, 2001.
compression ratio is up to 6. Bitmap is also efficient
[6] D. Burdick, M. Calimlim, J. Flannick, J.
from the density. Hence, the computation time for
Gehrke, and T. Yiu, “MAFIA: A Performance
the frequency counting is short in the execution of
Study of Mining Maximal Frequent Itemsets,”
existing implementations. However, by ppc exten-
In Proc. IEEE ICDM’03 Workshop FIMI’03,
sion, LCM has an advantage for closed itemset min-
2003. (Available as CEUR Workshop Proc. se-
ing. hence LCM(closed) is the fastest.
ries, Vol. 90, http://ceur-ws.org/vol-90)
[7] B. Goethals, the FIMI’03 Homepage,
1 http://www.ics.uci.edu/ mlearn/MLRepository.html
http://fimi.cs.helsinki.fi/, 2003.
10
[8] G. Grahne and J. Zhu, “Efficiently Using as CEUR Workshop Proc. series, Vol. 90,
Prefix-trees in Mining Frequent Itemsets,” In http://ceur-ws.org/vol-90)
Proc. IEEE ICDM’03 Workshop FIMI’03, 2003. [19] T. Uno, T. Asai, Y. Uchida, H. Arimura, “An
(Available as CEUR Workshop Proc. series, Efficient Algorithm for Enumerating Closed Pat-
Vol. 90, http://ceur-ws.org/vol-90) terns in Transaction Databases,” to appear in
[9] J. Han, J. Pei, Y. Yin, “Mining Frequent Pat- Proc. of Discovery Science 2004, 2004.
terns without Candidate Generation,” SIGMOD [20] M. J. Zaki, C. Hsiao, “CHARM: An Efficient
Conference 2000, pp. 1-12, 2000 Algorithm for Closed Itemset Mining,” 2nd
[10] R. Kohavi, C. E. Brodley, B. Frasca, L. Ma- SIAM International Conference on Data Mining
son and Z. Zheng, “KDD-Cup 2000 Organizers’ (SDM’02), pp. 457-473, 2002.
Report: Peeling the Onion,” SIGKDD Explo- [21] Z. Zheng, R. Kohavi and L. Mason, “Real World
rations, 2(2), pp. 86-98, 2000. Performance of Association Rule Algorithms,”
[11] Guimei Liu, Hongjun Lu, Jeffrey Xu Yu, Wang KDD 2001, pp. 401-406, 2000.
Wei, and Xiangye Xiao, “AFOPT: An Efficient
Implementation of Pattern Growth Approach,”
In Proc. IEEE ICDM’03 Workshop FIMI’03,
2003. (Available as CEUR Workshop Proc. se-
ries, Vol. 90, http://ceur-ws.org/vol-90)
[12] S. Orlando, C. Lucchese, P. Palmerini, R. Perego
and F. Silvestri, “kDCI: a Multi-Strategy Algo-
rithm for Mining Frequent Sets,” In Proc. IEEE
ICDM’03 Workshop FIMI’03, 2003. (Available
as CEUR Workshop Proc. series, Vol. 90,
http://ceur-ws.org/vol-90)
[13] N. Pasquier, Y. Bastide, R. Taouil, L. Lakhal,
Efficient Mining of Association Rules Using
Closed Itemset Lattices, Inform. Syst., 24(1),
25–46, 1999.
[14] N. Pasquier, Y. Bastide, R. Taouil, L. Lakhal,
Discovering Frequent Closed Itemsets for Asso-
ciation Rules, In Proc. ICDT’99, 398-416, 1999.
11