p139 Data Mining Mafia

DATA MINING
MAFIA: A MAXIMAL FREQUENT ITEM SET ALGORITHM
P.Radhika, R.Sambavi,
radha_p_cse@yahoo.co.in, sam_ram06@yahoo.co.in,
2/4 CSE, 2/4 CSE,
Gayathri vidya Parishad College of engg. Gayathri vidya Parishad College ofengg.
Madhurawada, Madhurawada,
Visakhapatnam. Visakhapatnam.
ABSTRACT:
In this Information Age, we are deluged by data-scientific data, medical data,
demographic data, financial data and marketing data. People have no time to look at this
data. Human attention has become a precious resource. So, we must find base to
automatically discover and characterize trends in it. Our capabilities of both generating
and collecting data have been increasing rapidly in the last several decades. This
explosive growth in stored data has generated an urgent need for new techniques that can
assist in transforming the vast amounts of data into useful information.
To analyze this data, we introduce the concepts and techniques of data mining, a
promising and flourishing frontier in database systems and new database applications. We
also deal with those algorithms that are specially designed for data mining.
Our paper focuses on a new algorithm for mining maximal frequent item sets
from a transactional database. The search strategy of the algorithm integrates a depth first
traversal of the item set lattice with effective pruning mechanisms that significantly
improve mining performance. Our implementation for support counting combines a
vertical bitmap representation of the data with an efficient bitmap compression scheme
Our analysis show that MAFIA performs best when mining long item sets and
outperforms other algorithms on dense data by a factor of three to 30.
INTRODUCTION:
The major reason that data mining has attracted a great deal of attention in the
information industry in recent years is due to wide availability of huge amounts of data
and the imminent need for turning such data into useful information and knowledge
which can be used for applications ranging from business management, production
control and market analysis to engineering design and science exploration.
Data mining refers to extracting or mining knowledge from large amounts of data.
It is the task of discovering interesting patterns from large amounts of data where the data
can be stored in databases, data warehouses or other information repositories. It is an
interdisciplinary field merging ideas from statistics, machine learning, databases and
parallel computing. It is an essential step in the process of knowledge discovery in
databases. Data mining functionalities are used to specify the kind of pattern which
represents knowledge to be found in data mining tasks. Data mining tasks can be
classified into two categories: descriptive and predictive. Descriptive mining tasks
characterize the general properties of the data in the database. Predictive mining tasks
inference on the current data in order to make predictions. The functionalities include the
discovery of concept descriptions, associations, classification, prediction, clustering,
trend analysis, deviation analysis and similarity analysis.
Among the areas of data mining, the problem of deriving associations from data
has received a great deal of attention. Association analysis is widely used for market
basket or transaction data analysis. Here we are given a set of items and a large collection
of transactions which are subsets (baskets) of these items. The problem is to analyze
customers buying habits by finding associations between the different items that
customers place in their shopping baskets. The discovery of such association rules help in
the development of marketing strategies by gaining insight into matters like “which items
are most frequently purchased by customers”. It also helps in inventory management,
sales promotion strategies etc. Hence the discovery of association rules is solely
dependent on the discovery of frequent sets. Many algorithms have been proposed for the
efficient mining of association rules.
ASSOCIATION ANALYSIS:
It is a discovery of association rule showing attribute-value conditions that occur
frequently together in a given set of data. Association rule mining searches for interesting
relationships among items in a given data set.
ASSOCIATION RULE: Given a set of items I ={I1,I2,………,In} and a database of
transactions D={t1,t2,…….,t n} where t i={I i1,I i2,……..,I ik } and I ij ∈ I, an
association rule is an implication of the form X⇒Y where X,Y ⊂ I are sets of items
called item sets and X ∩Y =Ø.
The support (S) for an association rule X⇒Y is the percentage of transactions in the
database that contains X∪Y. The confidence or the strength (α) for an association rule X
⇒Y is the ratio of the number of transactions that contain X∪Y to the number of
transactions that contains X. The association rule problem is to identify all association
rules X⇒Y with a minimum support and confidence. These values (S, α) are given as
input to the problem. The efficiency of association rule algorithms usually is discussed
with respect to the number of scans in the database that are referred and the maximum
number of item sets that must be counted.
The problem of mining association rules can be decomposed into two sub
problems:
• Find all sets of items (item sets) whose support is greater than the user specified
minimum support, S. Such item sets are called frequent item sets.
• Use the frequent item sets to generate the desired rules. For e.g., if ABCD and AB
are frequent item sets, and then we can determine if the rule AB⇒CD holds by
checking the following inequality.
S ({A,B,C,D}) ≥ α
S ({A,B})
Where S(X) is the support of X in T.
FREQUENT SET: Let T be a transactional database and S be the user specified minimum
support. An item set X ⊂ A is said to be a frequent item set in T with respect to S if
S(X) T ≥ S
Discovering all frequent item sets and their support is a non-trivial problem if the
cardinality of A, the set of items and the database T are large. The potentially large item
sets are called candidates and the set of all potentially large item sets are called Candidate
item sets.
For ex: if |A|=m, the number of possible distinct item sets is2^m. The problem is to
identify which of these are frequent in the given set of transactions. One way to achieve
this is to set up 2^m counters, one for each distinct item set and count the support for
every item set by scanning the database once. However this approach is impractical for
many applications where m can be more than 1000.
To reduce the combinatorial search space, all algorithms implement the following two
properties:
 Downward Closure Property: Any subset of a frequent set is a frequent set.
 Upward Closure Property: Any superset of an infrequent set is an infrequent set.
We denote the set of all frequent item sets by FI. If X is frequent and no superset of X is
frequent, we say that X is a Maximally Frequent Item set. We denote the set of all
maximally frequent item sets by MFI.A frequent item set X is said to be closed and is
called a Frequent Closed Item set if there does not exist any proper subset Y⊃X with
S(X) = S(Y).
Hence it holds as follows: MFI⊆FCI⊆FI.
A PRIORI ALGORITHM:
It is also called level-wise algorithm. It is the most popular algorithm to find all the
frequent sets. It makes use of the downward closure property. The algorithm is a bottom
up search, moving upward level-wise in the lattice.
The basic idea of the A Priori Algorithm is to generate candidates item sets of a
particular size and then scan the database to count these to see if they are frequent.
During scan i, candidates of size i, Ci are counted. Only those candidates that are frequent
are used to generate candidates for the next pass. That is Li (set of frequent item sets
during scan i) are used to generate Ci+1. An item set is considered a candidate only if all
its subsets are large. To generate candidates of size i+1, joins are made of frequent item
sets found in the previous pass. An algorithm called A Priori Gen is used to generate the
candidate item sets for each pass after the first. All singleton item sets are used as
candidates in the first pass. Here the set of frequent item sets of the previous pass Li-1 is
joined with itself to determine the candidates. After the first scan, every frequent item set
is counted with every other frequent item set.
The A Priori Algorithm traverses the search space in pure breadth first manner and
finds support information explicitly generating and counting each node. When the
frequent patterns are long (more than 15 to 20 items), FI and even FCI become large and
more traditional methods count too many item sets to be feasible. Straight A Priori –based
algorithms count all of the 2^k subsets of each item set they discover, and thus donot
scale well for long items. This approach limits the effectiveness of the look aheads
since useful longer frequent patterns have not yet been discovered.
Recently, the merits of depth first approach have been recognized. Here we present a
new algorithm named MAFIA (A Maximal Frequent Item Set Algorithm). MAFIA uses a
vertical bitmap representation for counting and effective pruning mechanisms for
searching the item set lattice. By changing some of the pruning tools, MAFIA can also
generate all frequent item sets and closed frequent item sets, though the algorithm is
optimized for mining only maximum frequent item sets. The set of maximal frequent
item sets is the smallest representation of data that can still be used to generate the set FI.
Once the set is generated, the support information can be easily recomputed from
transactional database.
MAFIA focuses on the hardness of traversing the search space efficiently if there are
long item sets instead of minimizing I/O costs. In a thorough experimental evaluation, we
first quantify the effect of each individual pruning component on the performance of
MAFIA. We then demonstrate the benefits of using compression on the bit maps to speed
counting and yield large savings in computations. Finally we study the performance of
MAFIA versus other current algorithms for mining MFI. Because of our strong pruning
mechanisms, MAFIA performs best on dense data sets where large subsets are removed
from the search space.
PRELIMINARIES: In this section, we describe the conceptual frame work of item
subset lattice (fig. 1). Assume there is a total ordering <L of the items I in the database
i.e. lexicographic ordering. Here if item i occurs before item j, we denote this by i<Lj.
Fig(2) shows the subset lattice reduced to lexicographic sub tree. The item set identifying
each node is referred to as node’s head while possible extensions of the node are called
the tail.
{ }
{a} {b} {c} {d}
{a, b} {a, c} {a, d} {b, c} {b, d} {c, d}
{a, b, c} {a, b, d} {a, c, d} {b, c, d}
{a, b, c, d}
fig 1: Subset lattice for four items
{ }
{a} {b} {c} {d}
Cut {a,b} {a,c} {a,d} {b, c} {b, d} {c ,d}
{a,b,c} {a,b,d} {a,c,d} {b,c,d}
{a, b, c, d}
Fig 2: Lexicographic subset tree for four items

For example: Consider node P. P’s head is {a} and the tail is the set {b,c,d}. Note that the
tail contains all items lexicographically larger than any element on the head. Here H∪T is
{a,b,c,d}. The problem of mining the frequent item sets from the lattice can be viewed as
finding a cut through the lattice such that all elements above the cut are frequent item sets
and all item sets below are infrequent. For a node C, we call all items in the tail of C as
the 1-extensions of C. Given a transaction T, we define the projected transaction T(C)
with respect to an item set C as follows: If C is not present in T, then T(C) = ∅. If C is
not present in T then T(C) is defined to be all items present in the transaction T that are
also frequent 1-extensions of C.
MAFIA ALGORITHM: In this section, various pruning techniques are used to reduce
the search space. First we describe a simple depth first traversal with no pruning. We use
this algorithm to motivate the pruning and ordering improvements and latter for effective
MFI superset checking.
SIMPLE DEPTH FIRST TRAVERSAL:
Here we traverse the lexicographic tree in pure depth first order. At each node, each
element in the node’s tail is generated and counted as a 1-extension. If the support of {C’s
head}∪{1-extension} is less than minimum support we can stop by the A Priori principle
since any item set in the sub tree rooted at {C’s head}∪{1-extension} would be
infrequent. If none of the 1-extensions of C leads to a frequent item set, the node is a leaf.
When we reach a leaf C in the tree, we have a candidate for entry in the MFI.
However, a frequent superset of C would have been already discovered. Therefore, we
need to check if the superset of the candidate item set C is already in the MFI. Only if no
superset exists we add the candidate item set C to the MFI.
ALGORITHM:
Simple (Current node C, MFI)
1. For each item i in C.tail
2. Cn=C∪{i}
3. If (Cn is frequent)
4. Simple(Cn,MFI)
5. If (C is a leaf and C.head is not in MFI)
6. Add C.head to MFI
SEARCH SPACE PRUNING: The set of maximum possible solutions for a problem is
called search space. The simple depth first traversal is ultimately no better than a
comparable breadth first traversal since exactly the same search space is generated and
counted. To realize performance gains, we must prune out parts of the search space.
EFFECTIVE PRUNING TECHNIQUES:

1. PARENT EQUILENT PRUNING (PEP): One method of pruning involves
comparing the transaction sets of each parent/child pair. Let X be node C’s head and y
be an element in C’s tail, t(X) =t(X∪{y}), then any transaction containing X contains y.
This guarantees that any frequent item set containing X but not y has the frequent
superset (Z∪{y}). Since we only want the maximal frequent item sets, it is necessary to
count item sets containing X and not y. Therefore we can move item y from the tail to the
head. For node C, X=X∪{y} and element y is removed from C’s tail. This can yield
significant savings since the sub tree rooted at C no longer has to count y as an extension
for node in the sub tree.
ALGORITHM: PEP (Current node C, MFI)
1. For each item i in C. tail
2. Cn= C ∪{i}
3. If (Cn.support==C.support)
4. Move i from C.tail to C.head
5. Else if (Cn is frequent)
6. PEP (Cn, MFI)
2. FHUT:
Another type of pruning is superset pruning. We observe that, at node C, the largest
possible frequent item set contained in the sub tree rooted at C is C’s H∪T. If C’s H ∪T is
discovered to be frequent, we never have to explore subsets of the H∪T and thus we can
prune out the entire sub tree rooted at node C. We refer to this method pruning as FHUT
(Frequent Head Union Tail ) pruning which can be computed by exploring the left most
part of the sub tree at each node.
ALGORITHM: FHUT (Current node C, MFI, Boolean isHUT)
2. Cn= C∪{i}
3. isHUT = whether i is the left most child in the tail
5. FHUT(Cn, MFI, isHUT )
6. If ( C is a leaf and C.head is not in MFI)
8. If (isHUT and all extensions are frequent)
9. Stop exploring sub tree and go back up tree to when
isHUT was changed to true.
3. HUTMFI:
There are two methods for determining whether an item set is frequent:
(1) Direct counting of the support of X.
(2) Checking if a superset of X has already been declared frequent.
FHUT uses the first method. The latter approach determines if a super set of
the HUT is in the MFI. If a superset does exist, then the HUT must be frequent and the
sub tree rooted at the node corresponding to x can be pruned away. We call this type of
superset pruning HUTMFI. HUTMFI does not expand any children to check for
successful superset pruning, unlike FHUT, where the left most branch of the sub tree is
explored. Hence HUTMFI is preferable to FHUT pruning since HUTMFI counts fewer
item sets.
ALGORITHM: HUTMFI(Current node C, MFI, isHUT)
1. HUT = C.head ∪ C.tail
2. If (HUT is in MFI)
3. Stop searching and return
5. Cn = C∪{i}
6. isHUT = whether i is the leftmost child in the tail
8. HUTMFI(Cn, MFI, isHUT)
4. DYNAMIC REORDERING:
Dynamic reordering involves rearranging the children of each node by increasing the
support instead of lexicographically. As the size of tree grows, dynamic reordering helps
to trim out many branches of the search tree. The benefit of dynamically reordering the
children of each node based on support is significant. An algorithm that trims the tail to
only frequent extensions at a higher level will save a lot of computation. The order of tail
elements is an important consideration. Ordering the tail elements by increasing the
support will keep the search space as small as possible.
Dynamic reordering greatly increases the effectiveness of pruning mechanisms.
Since PEP depends on the support of each child relative to the parent, we can move all
elements for which PEP holds from the tail to head at once, quickly reducing the size of
the tail. For both FHUT and HUTMFI, ordering by increasing the support yields
significant savings. The infrequent extensions keep the left side of the sub tree small. On
the right side of the sub tree, where extensions are more frequent, FHUT and HUTMFI
are more effective in trimming the search space.
ALGORITHM: MAFIA (Current node C, MFI, Boolean isHUT)
4. Count all children, use PEP to trim the tail, and reorder by
increasing support
5. For each item i in C.trimmed.tail
6. isHUT = whether i is the first item in the tail
7. Cn = C ∪ {i}
8. MAFIA(Cn, MFI, isHUT)
10. Stop exploring sub tree and go back up sub tree
EFFECTIVE MFI SUPERSET CHECKING:In order to enumerate the exact set of

maximally frequent item sets, before adding any item set to the MFI, we must check the
entire MFI to ensure that no superset of the item set has already been found. This check is
done often and significant performance improvement can be realized if it is done
efficiently. To ensure this, we adapt the progressive focusing. The basic idea is that, while
the entire MFI may be large, at any given node, only a fraction of MFI are possible
supersets of the item set at the node.
We therefore maintain, for each node, an LMFI (Local MFI), which is a subset of the
MFI that is relevant while performing superset checks at the node. Initially the LMFI for
the root is the null set. Assume that we are examining node C and are about to recurse on
Cn where Cn = C ∪{y}. The LMFI for Cn consists of all item sets in the LMFI for C
with the added condition that they also contain the item with which we extended C to
found Cn. When the recursive call to Cn is finished, we add to the LMFI of C all item
sets that were added to the LMFI of Cn during the call. In addition, each time we add an
item set to the MFI, we add the item set to the LMFI of the node we are examining while
adding the item set to the MFI. Now candidate item sets no longer have to do superset
checks in the MFI. The LMFI contains all supersets of the current node. Therefore, if the
LMFI of a candidate node is empty, no supersets will be found in the entire MFI and,
conversely, if the LMFI is not empty, then a superset is guaranteed.
ALGORITHM:
MAFIALMFI (Current node C, MFI, Boolean isHUT)
4. Count all children, use PEP to trim the tail and reorder
by increasing the support
5. For each item i in C. trimmed. tail
6. isHUT = whether i is the first item in the tail
7. Cn = C∪{i}
8. Sort MFI by new item i and update left and right
pointers for Cn
9. MAFIALMFI (Cn, MFI, isHUT)
10. Adjust right LMFI pointers of C for any new item sets
added to MFI
12. Stop exploring sub tree and go back up sub tree
13. If (C is a leaf and C’s LMFI is not empty)
ALGORITHMIC ANALYSIS:
First we present a full analysis of each pruning component of the MAFIA algorithm.
There are three types of pruning used to trim the tree: FHUT, HUTMFI and PEP.
FHUT and HUTMFI are both forms of superset pruning and thus will tend to overlap in
their efficiency for reducing these search space. In addition, dynamic reordering can
significantly reduce the size of the search space by removing infrequent items from each
node’s tail. The dense data sets support the idea that MAFIA runs the fastest on longer
item sets and it has the best performance.
CONCLUSIONS:
In this paper, we present a detailed performance analysis of MAFIA. The powerful
pruning techniques such as PEP and superset checking are very beneficial in reducing the
search space. Thus MAFIA is highly optimized for mining long item sets and on dense
data, it consistently outperforms other algorithms by a factor of 3 to 30.
REFERENCES:
• “DATA MINING-Concepts and Techniques” by Jiawei Han,
Micheline Kamber.
• “DATA MINING-Techniques“ by Arun K Pujari.
• “DATA MINING” by Pieter Adriaans, Dolf Zantinge.
• IEEE Transactions On Knowledge And Data Engineering.
• “MINING LARGE ITEM SETS FOR ASSOCIATION RULES”
by C.C Aggarwal .

p139 Data Mining Mafia

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

p139 Data Mining Mafia

Загружено:

Авторское право:

Доступные форматы

DATA MINING

MAFIA: A MAXIMAL FREQUENT ITEM SET ALGORITHM

{a} {b} {c} {d}

{a, b} {a, c} {a, d} {b, c} {b, d} {c, d}

{a, b, c} {a, b, d} {a, c, d} {b, c, d}

fig 1: Subset lattice for four items

{a} {b} {c} {d}

Cut {a,b} {a,c} {a,d} {b, c} {b, d} {c ,d}

{a,b,c} {a,b,d} {a,c,d} {b,c,d}

Fig 2: Lexicographic subset tree for four items

EFFECTIVE PRUNING TECHNIQUES:

EFFECTIVE MFI SUPERSET CHECKING:In order to enumerate the exact set of

Вам также может понравиться