Вы находитесь на странице: 1из 6

Data Mining Lab Report

Lab 1
Apriori and FP growth


Submitted by
Redowan Mahmud
Roll:16

Submitted to
Md. Samiullah





Tasks to perform

Implement Apriori and FP growth algorithm for mining frequent item sets.

Run the simulation program on some given datasets.

Derive Minimum support Vs. Time, Minimum support Vs. Memory,
Dataset Size Vs. Time, Dataset Size Vs. Memory graph for the two
algorithms.

*System configuration should be mentioned.


Basic Knowledge

Apriori Algorithm:

As is common in association rule mining, given a set of item sets , the algorithm
attempts to find subsets which are common to at least a minimum number C of the
item sets. Apriori uses a "bottom up" approach, where frequent subsets are
extended one item at a time (a step known as candidate generation), and groups of
candidates are tested against the data. The algorithm terminates when no further
successful extensions are found.

Apriori uses breadth-first search and a tree structure to count candidate item sets
efficiently. It generates candidate item sets of length k from item sets of length k-1.
Then it prunes the candidates which have an infrequent sub pattern. According to
the downward closure lemma, the candidate set contains all frequent k-length item
sets. After that, it scans the transaction database to determine frequent item sets
among the candidates.

Apriori, while historically significant, suffers from a number of inefficiencies or
trade-offs, which have spawned other algorithms. Candidate generation generates
large numbers of subsets (the algorithm attempts to load up the candidate set with
as many as possible before each scan). Bottom-up subset exploration (essentially a
breadth-first traversal of the subset lattice) finds any maximal subset S only after
all 2
| S |
1 of its proper subsets.



FP growth Algorithm

The FP-Growth Algorithm is an alternative way to find frequent item sets without
using candidate generations, thus improving performance. For so much it uses a
divide-and-conquer strategy. The core of this method is the usage of a special data
structure named frequent-pattern tree (FP-tree), which retains the item set
association information.

In simple words, this algorithm works as follows: first it compresses the input
database creating an FP-tree instance to represent frequent items. After this first
step it divides the compressed database into a set of conditional databases, each
one associated with one frequent pattern. Finally, each such database is mined
separately. Using this strategy, the FP-Growth reduces the search costs looking for
short patterns recursively and then concatenating them in the long frequent
patterns, offering good selectivity.

In large databases, its not possible to hold the FP-tree in the main memory. A
strategy to cope with this problem is to firstly partition the database into a set of
smaller databases (called projected databases), and then construct an FP-tree from
each of these smaller databases.

Implementation

Both the algorithm have been developed in Java programming language.
Some built in data structure of Java is used in both source code.

Java Vector to handling the candidate item set (Apriori algorithm) and
projected conditional database ( FP growth algorithm)
Java File System to read input from dataset and write output.
Java Object to keep information of each item name and their occurances.

Simulation

Both the program have been simulated on some real life datasets (chess,
mashroom). The outputs are compared with each other using the help of 3
rd
party
reference . And the simulation result was 100% correct.



Graphs

1.

Dataset: Chess
Name : Minimum support (x) Vs.
Time (y)
Constant : Dataset Size (3196)


2.

Dataset: Chess
Name : Minimum support (x) Vs.
Memory (y)
Constant : Dataset Size (3196)




3.

Dataset: Chess
Name : Dataset Size (x) Vs. Time (y)
Constant : Minimum Support (87%)



4.

Dataset: Chess
Name : Dataset Size (x) Vs. Memory
(y)
Constant : Minimum Support (87%)




5.

Dataset: Mashrum
Name : Minimum support (x) Vs.
Time (y)
Constant : Dataset Size (8124)


6.

Dataset: Mashrum
Name : Minimum support (x) Vs.
Memory (y)
Constant : Dataset Size (8124)








7.

Dataset: Mashrum
Name : Dataset Size (x) Vs. Time (y)
Constant : Minimum Support (47%)



8.

Dataset: Mashrum
Name : Dataset Size (x) Vs. Memory
(y)
Constant : Minimum Support (47%)










System Configuration

Вам также может понравиться