Академический Документы
Профессиональный Документы
Культура Документы
P.Radhika, R.Sambavi,
radha_p_cse@yahoo.co.in, sam_ram06@yahoo.co.in,
2/4 CSE, 2/4 CSE,
Gayathri vidya Parishad College of engg. Gayathri vidya Parishad College ofengg.
Madhurawada, Madhurawada,
Visakhapatnam. Visakhapatnam.
ABSTRACT:
In this Information Age, we are deluged by data-scientific data, medical data,
demographic data, financial data and marketing data. People have no time to look at this
data. Human attention has become a precious resource. So, we must find base to
automatically discover and characterize trends in it. Our capabilities of both generating
and collecting data have been increasing rapidly in the last several decades. This
explosive growth in stored data has generated an urgent need for new techniques that can
assist in transforming the vast amounts of data into useful information.
To analyze this data, we introduce the concepts and techniques of data mining, a
promising and flourishing frontier in database systems and new database applications. We
also deal with those algorithms that are specially designed for data mining.
Our paper focuses on a new algorithm for mining maximal frequent item sets
from a transactional database. The search strategy of the algorithm integrates a depth first
traversal of the item set lattice with effective pruning mechanisms that significantly
improve mining performance. Our implementation for support counting combines a
vertical bitmap representation of the data with an efficient bitmap compression scheme
Our analysis show that MAFIA performs best when mining long item sets and
outperforms other algorithms on dense data by a factor of three to 30.
INTRODUCTION:
The major reason that data mining has attracted a great deal of attention in the
information industry in recent years is due to wide availability of huge amounts of data
and the imminent need for turning such data into useful information and knowledge
which can be used for applications ranging from business management, production
control and market analysis to engineering design and science exploration.
Data mining refers to extracting or mining knowledge from large amounts of data.
It is the task of discovering interesting patterns from large amounts of data where the data
can be stored in databases, data warehouses or other information repositories. It is an
interdisciplinary field merging ideas from statistics, machine learning, databases and
parallel computing. It is an essential step in the process of knowledge discovery in
databases. Data mining functionalities are used to specify the kind of pattern which
represents knowledge to be found in data mining tasks. Data mining tasks can be
classified into two categories: descriptive and predictive. Descriptive mining tasks
characterize the general properties of the data in the database. Predictive mining tasks
inference on the current data in order to make predictions. The functionalities include the
discovery of concept descriptions, associations, classification, prediction, clustering,
trend analysis, deviation analysis and similarity analysis.
Among the areas of data mining, the problem of deriving associations from data
has received a great deal of attention. Association analysis is widely used for market
basket or transaction data analysis. Here we are given a set of items and a large collection
of transactions which are subsets (baskets) of these items. The problem is to analyze
customers buying habits by finding associations between the different items that
customers place in their shopping baskets. The discovery of such association rules help in
the development of marketing strategies by gaining insight into matters like “which items
are most frequently purchased by customers”. It also helps in inventory management,
sales promotion strategies etc. Hence the discovery of association rules is solely
dependent on the discovery of frequent sets. Many algorithms have been proposed for the
efficient mining of association rules.
ASSOCIATION ANALYSIS:
It is a discovery of association rule showing attribute-value conditions that occur
frequently together in a given set of data. Association rule mining searches for interesting
relationships among items in a given data set.
ASSOCIATION RULE: Given a set of items I ={I1,I2,………,In} and a database of
transactions D={t1,t2,…….,t n} where t i={I i1,I i2,……..,I ik } and I ij ∈ I, an
association rule is an implication of the form X⇒Y where X,Y ⊂ I are sets of items
called item sets and X ∩Y =Ø.
The support (S) for an association rule X⇒Y is the percentage of transactions in the
database that contains X∪Y. The confidence or the strength (α) for an association rule X
⇒Y is the ratio of the number of transactions that contain X∪Y to the number of
transactions that contains X. The association rule problem is to identify all association
rules X⇒Y with a minimum support and confidence. These values (S, α) are given as
input to the problem. The efficiency of association rule algorithms usually is discussed
with respect to the number of scans in the database that are referred and the maximum
number of item sets that must be counted.
The problem of mining association rules can be decomposed into two sub
problems:
• Find all sets of items (item sets) whose support is greater than the user specified
minimum support, S. Such item sets are called frequent item sets.
• Use the frequent item sets to generate the desired rules. For e.g., if ABCD and AB
are frequent item sets, and then we can determine if the rule AB⇒CD holds by
checking the following inequality.
S ({A,B,C,D}) ≥ α
S ({A,B})
Where S(X) is the support of X in T.
FREQUENT SET: Let T be a transactional database and S be the user specified minimum
support. An item set X ⊂ A is said to be a frequent item set in T with respect to S if
S(X) T ≥ S
Discovering all frequent item sets and their support is a non-trivial problem if the
cardinality of A, the set of items and the database T are large. The potentially large item
sets are called candidates and the set of all potentially large item sets are called Candidate
item sets.
For ex: if |A|=m, the number of possible distinct item sets is2^m. The problem is to
identify which of these are frequent in the given set of transactions. One way to achieve
this is to set up 2^m counters, one for each distinct item set and count the support for
every item set by scanning the database once. However this approach is impractical for
many applications where m can be more than 1000.
To reduce the combinatorial search space, all algorithms implement the following two
properties:
Downward Closure Property: Any subset of a frequent set is a frequent set.
Upward Closure Property: Any superset of an infrequent set is an infrequent set.
We denote the set of all frequent item sets by FI. If X is frequent and no superset of X is
frequent, we say that X is a Maximally Frequent Item set. We denote the set of all
maximally frequent item sets by MFI.A frequent item set X is said to be closed and is
called a Frequent Closed Item set if there does not exist any proper subset Y⊃X with
S(X) = S(Y).
Hence it holds as follows: MFI⊆FCI⊆FI.
A PRIORI ALGORITHM:
It is also called level-wise algorithm. It is the most popular algorithm to find all the
frequent sets. It makes use of the downward closure property. The algorithm is a bottom
up search, moving upward level-wise in the lattice.
The basic idea of the A Priori Algorithm is to generate candidates item sets of a
particular size and then scan the database to count these to see if they are frequent.
During scan i, candidates of size i, Ci are counted. Only those candidates that are frequent
are used to generate candidates for the next pass. That is Li (set of frequent item sets
during scan i) are used to generate Ci+1. An item set is considered a candidate only if all
its subsets are large. To generate candidates of size i+1, joins are made of frequent item
sets found in the previous pass. An algorithm called A Priori Gen is used to generate the
candidate item sets for each pass after the first. All singleton item sets are used as
candidates in the first pass. Here the set of frequent item sets of the previous pass Li-1 is
joined with itself to determine the candidates. After the first scan, every frequent item set
is counted with every other frequent item set.
The A Priori Algorithm traverses the search space in pure breadth first manner and
finds support information explicitly generating and counting each node. When the
frequent patterns are long (more than 15 to 20 items), FI and even FCI become large and
more traditional methods count too many item sets to be feasible. Straight A Priori –based
algorithms count all of the 2^k subsets of each item set they discover, and thus donot
scale well for long items. This approach limits the effectiveness of the look aheads
since useful longer frequent patterns have not yet been discovered.
Recently, the merits of depth first approach have been recognized. Here we present a
new algorithm named MAFIA (A Maximal Frequent Item Set Algorithm). MAFIA uses a
vertical bitmap representation for counting and effective pruning mechanisms for
searching the item set lattice. By changing some of the pruning tools, MAFIA can also
generate all frequent item sets and closed frequent item sets, though the algorithm is
optimized for mining only maximum frequent item sets. The set of maximal frequent
item sets is the smallest representation of data that can still be used to generate the set FI.
Once the set is generated, the support information can be easily recomputed from
transactional database.
MAFIA focuses on the hardness of traversing the search space efficiently if there are
long item sets instead of minimizing I/O costs. In a thorough experimental evaluation, we
first quantify the effect of each individual pruning component on the performance of
MAFIA. We then demonstrate the benefits of using compression on the bit maps to speed
counting and yield large savings in computations. Finally we study the performance of
MAFIA versus other current algorithms for mining MFI. Because of our strong pruning
mechanisms, MAFIA performs best on dense data sets where large subsets are removed
from the search space.
PRELIMINARIES: In this section, we describe the conceptual frame work of item
subset lattice (fig. 1). Assume there is a total ordering <L of the items I in the database
i.e. lexicographic ordering. Here if item i occurs before item j, we denote this by i<Lj.
Fig(2) shows the subset lattice reduced to lexicographic sub tree. The item set identifying
each node is referred to as node’s head while possible extensions of the node are called
the tail.
{ }
{a, b, c, d}
{ }
{a, b, c, d}
SEARCH SPACE PRUNING: The set of maximum possible solutions for a problem is
called search space. The simple depth first traversal is ultimately no better than a
comparable breadth first traversal since exactly the same search space is generated and
counted. To realize performance gains, we must prune out parts of the search space.
4. DYNAMIC REORDERING:
Dynamic reordering involves rearranging the children of each node by increasing the
support instead of lexicographically. As the size of tree grows, dynamic reordering helps
to trim out many branches of the search tree. The benefit of dynamically reordering the
children of each node based on support is significant. An algorithm that trims the tail to
only frequent extensions at a higher level will save a lot of computation. The order of tail
elements is an important consideration. Ordering the tail elements by increasing the
support will keep the search space as small as possible.
Dynamic reordering greatly increases the effectiveness of pruning mechanisms.
Since PEP depends on the support of each child relative to the parent, we can move all
elements for which PEP holds from the tail to head at once, quickly reducing the size of
the tail. For both FHUT and HUTMFI, ordering by increasing the support yields
significant savings. The infrequent extensions keep the left side of the sub tree small. On
the right side of the sub tree, where extensions are more frequent, FHUT and HUTMFI
are more effective in trimming the search space.
ALGORITHM: MAFIA (Current node C, MFI, Boolean isHUT)
1. HUT = C.head ∪ C.tail
2. If (HUT is in MFI)
3. Stop searching and return
4. Count all children, use PEP to trim the tail, and reorder by
increasing support
5. For each item i in C.trimmed.tail
6. isHUT = whether i is the first item in the tail
7. Cn = C ∪ {i}
8. MAFIA(Cn, MFI, isHUT)
9. If (isHUT and all extensions are frequent)
10. Stop exploring sub tree and go back up sub tree
11. If (C is a leaf and C.head is not in MFI)
12. Add C.head to MFI
REFERENCES:
• “DATA MINING-Concepts and Techniques” by Jiawei Han,
Micheline Kamber.
• “DATA MINING-Techniques“ by Arun K Pujari.
• “DATA MINING” by Pieter Adriaans, Dolf Zantinge.
• IEEE Transactions On Knowledge And Data Engineering.
• “MINING LARGE ITEM SETS FOR ASSOCIATION RULES”
by C.C Aggarwal .