Вы находитесь на странице: 1из 4

2010 Second International Conference on Computer Modeling and Simulation

Study on Application of Apriori Algorithm in Data Mining


Yanxi Liu
School of ScienceChangchun University,Ch ina130022
liuy x.ccu@163.co m
Abstract- with the rapid de velopment of networks and
information technology, the endless information has paid
more and more attention by people. While in the pursuit of
information with high speed, the analysis and mining of the
information and rules hidden deep in the data are also paid
more emphasis. Data mining technology is to organize and
analyze the data, which can extract and discover knowledge
from the mass of data, so how to apply the data mining
technology into enterprise stock management is the focus of
this topic studied. In this paper, combining with data mining
theory, Apriori algorithm in association rule mining
algorithm is described in detail, the algorithm
implementation process is illustrated, and the improved
methods of the algorithm are discussed.
Keywords-data
algorithm

mining; information

frequently set generated from the database, strong


association rules can be directly generated. The core
algorithm as follows:
Apriori algorith m called two sub-processes which are
Apriori-gen() and subset(), Apriori-gen() process produces
a candidate, then use the Apriori property (all non-empty
subsets of frequent item sets must also be frequent) to
delete those candidates of the non-frequent subsets. Once
generated all of the candidates, we will scan the database
D, and for each transaction, use the Subset () sub
procedure to identify all the candidate subsets, and make
cumulative count for each of these candidates. Finally, all
candidates met the minimu m support form frequent item
set L.

extraction; Apriori

III A PRIORI ALGORITHM ILLUST RATION


Here by way of examp le to illustrate the Apriori
algorithm imp lementation process.

I
INT RODUCTION
Although modern computer technology and database
technology have been developed rapidly, could support
the store and quickly retrieve the grand scale databases or
data warehouses, but these techniques was only to gather
these "massive" data, and not to effectively organize and
use the knowledge hidden them, which eventually led to
todays phenomenon of "rich data, poor knowledge". The
emergence of data mining technology met people needs.
The technology involved in artificial intelligence, machine
learning, statistical analysis and other technologies, and it
makes decision analysis into a new stage. In this paper the
association rule min ing algorithm - Apriori algorith m
which is common ly used in data mining is mainly
discussed.

A Generate frequent item sets


For examp le, a transaction database as shown in table
1, in wh ich there are nine affairs, that is, |D|=9. Apriori
assumes that the items of the affair are stored by the order
of
dictionary.
Minimu m
support
threshold
minsupport=2/9=22%, minimu m support count is 2.
TABLE1.

Tid
T1
T2
T3
T4
T5
T6
T7
T8
T9

II A PRIORI ALGORITHM SUMMARY


Apriori algorith m is one of the most influential
algorithms to mine the frequent item sets of Boolean
association rules. This algorithm is an approach based on
two-stage frequency, the design of association rule mining
algorithm can be decomposed into two sub-issues:
(1) Find all the item sets with the support greater than
the min imu m support, which is called frequent item;
(2) Based on the above obtained frequent set, all the
association rules will be generated, and for each frequent
item set A, all the nonempty subset a of A will be found, if
the ratio of sup port (A)/ sup port(a) min confidence, to
generate the association rules A-a. That is, from the
frequent sets obtained in the first step, to exploit the rules
with confidence not less than minimu m confidence
minconfidence the user specified.
The realization process of the algorithm can be
described as follows: First of all, Apriori algorith m solve
the frequent sets L with items of 1, then from L to
generate the candidate sets Cz with items of 2, scan the
transaction database D, calculate the support to solve Lz,
and so on, resulting in CK, to scan D and derive LK. Once
978-0-7695-3941-6/10 $26.00 2010 IEEE
DOI 10.1109/ICCMS.2010.398

TRANSACTION DATABASE TABLE

Item list
I1,I2,I5
I2,I4
I2,I3
I1,I2,I4
I1,I3
I2,I3
I1,I3
I1,I2,I3,I5
I1,I2,I3

The following figure is candidate sets, the generation


process of frequent item sets.

C1

L1

Item sets Support count


Item set
6
{I1}
{I1} Scan
7
{I2} calculate {I2}
support
{I3}
6
{I3} count
{I4}
2
{I4}
{I5}
2
{I5}

107
111

C1
Compare
with
minimum
support
count

Item sets Support count


6
{I1}
{I2}
7
{I3}
6
{I4}
2
{I5}
2

C2

Item sets
{I1,I2}
{I1,I3}
{I1,I4}
{I1,I5}
{I2,I3}
{I2,I4}
{I2,I5}
{I3,I4}
{I3, I5}
{I4,I5}

C3
item sets
{I1,I2,I3}
{I1,I2,I5}
{I1,I3,I5}
{I2,I3,I4}
{I2,I3,I5}
{I2,I4,I5}

Scan
calcul
ate
suppo
rt
count

delete
item set
with
subset not
belong to
L2

Item sets
{I1,I2}
{I1,I3}
{I1,I4}
{I1,I5}
{I2,I3}
{I2,I4}
{I2,I5}
{I3,I4}
{I3, I5}
{I4,I5}

Support count
4
4
1
2
4
2
2
0
1
0

C3
item sets
{I1,I2,I3}
{I1,I2,I5}

Compare
L3
with
minimum item sets support count
2
support {I1,I2,I3}
2
count {I1,I2,I5}
Figure1.

C2

C3=L2L2={{I1,I2,I3},{I1,I2,I5},{I1,I3,I5},{I2,I3,I4
},{I2,I3,I5},{I2,I3,I5},{I2,I3,I5}
Item set {I1, I3, I5} has a 2-item subset {I3, I5} which
is not a L2 element, thus it is not frequent, from C3 to
remove {I1, I3, I5}, empathy delete{I2,I3,I4}, {I2,I3,I5},
{I2,I4,I5}. Thus C3= {{I1, I2, I3}, {I1, I2, I5}}. For each
item in the C3, scanning the transaction database and
calculate its support count. And then it will be co mpared
with the min imu m support count 2, to determine whether
the frequency, and to determine frequent 3-itemset L3.
5) To find frequent 4-itemsets L4.
Using L3 to generate aggregation C4 of candidate
4-itemsets and C4 = L3 L3= {{I1, I2, I3, I5}}. The
3-subset {I2, I3, I5} of item set {I1, I2, I3, I5} does not
belong to L3, thus {I1, I2, I3, I5} will be removed.
Therefore, C4=, the algorith m is ended and all the
frequent item sets are found.

L2
Compar
e with
minimu
m
support
count

scan D,
calculate
support
count

Item sets
{I1,I2}
{I1,I3}
{I1,I5}
{I2,I3}
{I2,I4}
{I2,I5}

item sets
{I1,I2,I3}
{I1,I2,I5}

C4

item sets
{I1,I2,I3,I5}

Support count
4
4
2
4
2
2

C3

support count
2
2

B Generate association rules from frequent item sets


Once the frequent item sets found from the transaction
database, the next step is to generate association rules
fro m them. That is, to produce strong association rules to
meet the minimu m support and minimu m confidence, and
the confidence of the obtained association rules can be
calculated by using the following formu la. Here the
conditional probability is calculated by using the support
for item sets.

delete
item set C4
with
subset not item sets
belong to
L3

confidence( X Y ) G ( X Y ) / G x u 100%
sup port ( X Y ) / sup port ( X ) u 100%

generation process of candidate frequent item sets

In which sup port(X Y) is the affair number which


contains item set XY; sup port(X) is the affair
number which contains item set X; based on the above
formula, the operation of forming the association rules
as follows:

The specific process described as follows:


1) Scan the database to initialize the source data, and
form candidate 1-itemsets with all the items of the
database for the aggregation.
2) To find frequent 1-itemsets L1. As the figure shows,
candidate 1-itemset C1 is an aggregation {{I1}, {I2},
{I3}, {I4}, {I5}} which is consisted of each item set. For
each item in the C1, scanning the transaction database
and calculate its support count. And then it will be
compared with the minimum support count 2, to
determine whether the frequency, the item greater than
the minimum support count will be added to the frequent
1-itemset to determine frequent 1-itemset L1.
3) To find frequent 2-itemsets L2. We should connect
frequent 1-itemsets L1 to generate aggregation C2 of
candidate 2-itemsets.
C2=L1L1={{I1,I2},{I1,I3},{I1,I4},{I1,I5},{I2,I4},{
I2,I5},{I3,I4},{I3,I5},{I4,I5}}
As all the 1-itemsets of the 2-itemsets in C2 are
included, so it is not need to delete. For each item in the
C2, scanning the transaction database and calculate its
support count. And then it will be compared with the
minimu m support count 2, to determine whether the
frequency, same as step 2, to determine frequent 2-itemset
L2.
4) To find frequent 3-itemsets L3.
Same as step 3, it is to use L2 to generate aggregation
C3 of candidate 3-itemsets.

1) For each frequent item set 1, generate all


non-empty subsets of it.
2) For
each
non-empty
subset
of
1,
if

sup port (l )
t min confidence,
sup port ( s)

then to generate

association rule s (l-s). Here minconfidence is the


minimum confidence threshold.
As the rules are directly generated through frequent
item sets, so all the item sets the association rules involved
have to meet the minimu m support threshold.
With the above data as an examp le, the generation
process of association rules will be illustrated. The
frequent item set l= {I1, I2, I5}. The fo llowing will g ive
the association rules generated according to 1. The
non-empty subsets of I are {I1, I2}, {II, IS}, {I2, I5},
{I1}, {I2}, {I5}. The following are the association rules
and their confidence obtained according to this.
(1) Il I2I5
confidence=2/4=50%
(2) Il I5 12 confidence=2/2=100%
(3)12 I5 I1
confidence=2/2=100%
(4) I2 IlI5
confidence=2/6=33.3%
(5) Il 12I5
confidence=2/7=28.6%
(6)I5I1I2
confidence=2/2=100%

112
108

If the threshold of the minimu m confidence is 70%,


then there are only rule (2), (3) and (6), because their
confidence is greater than the minimu m confidence
threshold, so they are retained as the final output. That is,
output rules: Il I2I5, 12 I5 I1, I5I1I2.

association rule min ing in large-capacity data set. Its basic


idea is: first of all, the large-capacity databases will be
logically divided into several disjoint blocks which are
used to generate locally frequent item sets, and then these
local frequent item sets are regarded as candidate global
frequent item sets to get the final global frequent item sets
through testing their support.
2) Method based on Hash.
A hash-based technology can be used to compress the
candidate k-item set Ck (k>1). For example, when
scanning each transaction in database to hash them (that is,
map) to the different barrels of the different barrels and
increase the corresponding table count. In the hash table
the 2-itemset with corresponding hash bucket count less
than the value of support must not be a frequent 2-itemset,
should be removed from the candidate item sets. This
hash-based technology can greatly compressed the k-item
set to be examined.
3) Method based on transaction compression.
Reducing the affairs needed to be scanned in latter
circle, thus one affair does not contain any frequent k-item
set can not contain any frequent (k+l) - item set.
Therefore, when this record appears, you can add a mark
to it or remove from the transaction database. Therefore,
the database scans to generate j-item set (j>k) would
obviate the need to scan and analyze these records.
4) Method based on sampling.
This method is to select a random sample S of given
database D, and then search frequent item sets in the S
rather than D. This approach sacrifices some accuracy, but
in exchange for the effectiveness.
This method only needs to scan S affairs for one time.
As the search in S instead of D, it may lose some global
frequent item sets. To reduce this possibility, the support
less than the minimu m support can be used to find out the
frequent item sets (LS) locally in S, and then the rest of
the database will be used to calculate the actual support of
each item set in LS.
5) Dynamic item set count.
Dynamic item set count technology divides the
database into blocks which will mark the starting point.
Unlike Apriori only identify new candidate before each
time of complete scan, in this deformation, new candidate
item set can be added at any start point. If all subsets of a
key set have been identified as frequently, it will be added
as a new candidate. The result algorithm needs fewer
database scans than Apriori.

IV A PRIORI ALGORITHM EVALUATION


Apriori algorith m is to gradually complete the frequent
item set discovery through a growing number of the item
elements. Firstly, to generate frequent 1-itemset L1, and
then frequent 2-itemset L2, until no longer able to expand
the number of frequent item set elements, the algorithm is
ended. In the first k-cycle, the process firstly generates the
collection of Ck of the candidate k-item, and then
generates support through scanning the database and tests
to generate frequent k-item Lk. A lgorithm is simple and
clear, has no complicated derivation, and is complicated
derivation, but there are some shortcomings difficult to
overcome.
1) Repeatedly scanning the transaction database. For
each element of each circular candidate set C*, it must be
verified whether to join the frequent item set Lk through
scanning the database. If there is a large frequent item set
contained 10 items, then it is need to scan the transaction
database at least 10 times, which will bring a great I/O
load.
2) It may generate a large candidate set, from Lk-1 to
generate k- candidate item set, and Ck is the exponential
growth, for example, 104 frequent 1-itemsets may are
likely to generate candidate 2-itemset with close 107
elements. Such a large candidate set is a challenge to
time and main memory space.
3) Adopting only support. In real life, the occurrences
of some affairs are very frequent, but some are very rare,
so for our mining there has a problem: If the minimum
support threshold set too high, although the pace has
accelerated, but the data covered is less, meaningful rules
may not be discovered; If the minimum support threshold
set too low, then a large number of rules without
practical meaning will flood the entire mining process,
which will greatly reduce the mining efficiency and the
availability of the rule. All this will mislead decision
making.
4) The fitness landscape of the algorithm is narrow.
The algorithm only considers a single Boolean
association rule mining, but in practical applications,
there might be multi-dimensional, multi-volume, and
multi-layer association rules. At this time, the algorithm
is no longer applicable, needs to improve, or even needs
to be re-designed.
In order to improve the efficiency of Apriori
algorithm, there are a series of improved algorithms.
Although these algorithms follow the above theory, but
because of the introduction of related technologies (such
as data partition, sampling, etc.), the adaptability and
efficiency of the Apriori algorithm is improved to a certain
extent.
1) Method based on data partition.
The application of data partition technology in
association rule min ing can improve the flexibility of the

CONCLUSION

In this paper, C# language and a powerful class library


.Net framework provided are used to achieve the
development and application of the Apriori association
rule mining algorith m. The algorith m is theoretically
studied in-depth, the advantages and disadvantages of it
are also pointed out, and finally its improvement methods
are discussed.
REFERENCES

[1] Savasere, E.Omiecinski, and S.Navathe, An


Efficient Algorithm for Mining Association Rules
in Large Databases, Proceedings of the 21st
International Conference on Very large Database,
113
109

1995:175-186
[2] Fayyad U M, Piatetsky-Shapiro G, Smyth p, The
KDD process for extracting useful knowledge
from volumes of data[J], Communication of the
ACM, 1996. 39 (11)
[3] D. W. Cheung, J. Han. V Ng, Maintenance of
discovered association rules in large databases: An
incremental updating technique, In proc.1996 Int.
Conf. Data Engineering
[4] E. H. Han, G Kary Pis, V Kumar, Scalable parallel
data mining for associate rules. In Proc.1997,
ACM-SIGMOD Int. Conf. Management of Data
[5] J. Han, Y Fu, Discovery of multiple-level
association rules from large databases. In
Proc.1995 Int. Conf. Very Large Data Bases

114
110