Вы находитесь на странице: 1из 6

International Journal of Engineering Trends and Technology (IJETT) Volume 4 Issue 9- Sep 2013

ISSN: 2231-5381 http://www.ijettjournal.org Page 4046


Mine High Utility Itemset using UP-Tree and
FP-Growth
NS J AGADEESH
#1
, B J YOTHSNA
*2
, KN DHARANIDHAR
#3
, A ANANTHA BIPIN
*4
#
Assistant Professor, Dept of CSE, Kuppam Engineering College, kuppam, India.


Abstract Data Mining is defined as a process that extracts some
new, non-trivial, previously unknown potentially useful
information contained in large databases. Traditional mining
techniques have focused largely on detecting the statistical
correlations between the items that are more frequent in the
transaction databases. Also termed as frequent itemset mining.
In this paper, I propose strategies for UP-Growth from the
emerging area called Utility Mining which not only considers the
frequency of the itemsets but also considers the utility associated
with the itemsets. The term utility refers to the importance or the
usefulness of the itemset in transactions quantified in terms like
profit, sales or any other user preferences. Here the objective is
to identify itemsets that have utility values above a given utility
threshold using the pattern growth methodology for mining set
of utility patterns.

Keywords candidate pruning, frequent itemset, high utility
itemset, utility mining, UP-tree, FP-Growth.
I. INTRODUCTION
Over the last two decades data mining has
emerged as a significant research area. This is
primary due to the interdisciplinary nature of the
subject and the diverse range of application
domains in which data mining based products and
techniques are being employed. This includes
bioinformatics, genetics, medicine, clinical research,
education, retail and marketing research.
Data mining is the process of revealing
previously unknown and potentially useful
information from large databases. The primary goal
is to discover hidden patterns, unexpected trends in
the data. This term is frequently misused to mean
any form of large-scale data or information
processing. The actual data mining task is the
automatic or semi-automatic analysis of large
quantities of data to extract previously unknown
interesting patterns.
Data mining activities uses combination of
techniques from database technologies, statistics,
artificial intelligence and machine learning.
Discovering useful patterns hidden in a database
plays an essential role in several data mining tasks,
such as frequent pattern mining, weighted frequent
pattern mining and high utility pattern mining.
Among them, frequent pattern mining is a
fundamental research topic that has been applied to
different kinds of databases, such as transactional
databases. It is used in the analysis of customer
transactions in retail research where it is termed as
market basket analysis and also been used to
identify the purchase patterns of the consumer.

II. LITERATURE SURVEY
Extensive studies have been proposed for mining
frequent patterns [1, 2, 3, 4, 6]. Among the issues of
frequent pattern mining, the most famous are
association rule mining [1, 3, 4, 6] and sequential
pattern mining. One of the well-known algorithms
for mining association rules is Apriori [1], which is
the pioneer for efficiently mining association rules
from large databases. Pattern growth based
association rule mining algorithms [4, 6] such as
FP-Growth [4] were afterward proposed. It is
widely recognised that FP-Growth achieves a better
performance than Apriori based algorithms since it
finds frequent itemsets without generating any
candidate itemset and scans database just twice.

Frequent Itemset Mining

An itemset can be defined as a non-empty set of
items. An itemset with k different items is termed
as a k-itemset. For e.g. {bread, butter, milk } may
denote a 3-itemset in a supermarket
International Journal of Engineering Trends and Technology (IJETT) Volume 4 Issue 9- Sep 2013
ISSN: 2231-5381 http://www.ijettjournal.org Page 4047
transaction .The notion of frequent itemsets was
introduced by Agrawal et al [1].Frequent itemsets
are the itemsets that appear frequently in the
transactions. The goal of frequent itemset mining is
to identify all the itemsets in a transaction dataset
[6]. Frequent itemset mining plays an essential role
in the theory and practice of many important data
mining tasks, such as mining association rules [1,2]
long patterns [5], emerging patterns and
dependency rules. It has been applied in the field of
telecommunications [3], census analysis[6] and text
analysis.

The criterion of being frequent is expressed in
terms of support value of the itemsets. The Support
value of an itemset is the percentage of transactions
that contain the itemset.

1) EXAMPLE 1:
.
Consider the small example of a transaction
database representing the sales data and the profit
TABLE I
TRANSACTION DATABASE
Transacion
ID

Quantity of Item sold in Transaction

Item A

Item B

Item C

T1 2 0 1
T2 4 0 2
T3 4 1 0
T4 0 1 1
T5 5 1 2
T6 10 1 5
T7 4 0 2
T8 1 0 0
T9 3 0 0
T10 5 0 0
associated with the sale of each unit of the items.
Table I represents the sales figures for three items
Item A, B and C and ten transactions overall. The
entry in the cells represent the unit of any item sold
in that transaction

Table II represents the unit profit associated
with the sale of individual items.
TABLE II
UNIT PROFIT ASSOCIATED WITH ITEMS
Item Name

Unit Profit (in USD)

ItemA 5
ItemB 100
ItemC 40

Now consider the itemset AB. Since there are only
3transactions (T3, T5 and T6) that contain this
itemset out of the overall 10 transactions, so the
support for this itemset will be

Support (AB) =3 / 10 * 100 =30 %

Since T3 contains 4 units of item A and 1 unit of
item B, so the profit earned by the sale of the
itemset AB in transaction T3 is given by

profit (AB, T3) = 4 * profit(A) + 1 *
profit(B) =4*5 +1*100 =120

Since AB appears in transactions T3, T5 and T6, so
total profit associated with itemset AB by the
complete transaction set of 10 transactions is

Profit(AB) =profit(AB,T3) +profit(AB,T5)
+profit(AB,T6)
=(4*5+1*100) +(5*5+1*100) +
(10*5+1*100 )
=395

Similarly we can calculate the support values for
the different itemsets and also the profit obtained by
the sale of those itemsets by all ten transactions as
indicated in Table III.
If we consider minimum support =40 % then we
observe that there are 4 itemsets A, B,C and AC
which qualify as frequent itemsets because they
have support more than minimum support threshold
value. But if we consider the profit associated we
find that out of the 4 most profitable itemsets i.e. C,
AC, BC, and ABC only two are frequent itemsets
also. Itemsets BC and ABC are itemsets which are
not frequent but still they fetch more profit than
some of the frequent itemsets like A or B. This is
International Journal of Engineering Trends and Technology (IJETT) Volume 4 Issue 9- Sep 2013
ISSN: 2231-5381 http://www.ijettjournal.org Page 4048
inherently because the deviation of the unit profits
of the items. As we can see one unit of item B when
sold will fetch much more profit than one unit of
item A or item C.
TABLE III
SUPPORT AND PROFIT FOR ALL ITEMSETS

Itemset Support(%) Profit(USD)
A 90 190
B 40 400
C 60 520
AB 30 395
AC 50 605
BC 30 620
ABC 20 555

This example illustrates the fact that frequent
itemset mining approach may not always satisfy a
sales managers goal. In this case the support
measure of the itemsets reflects the statistical
correlation of items, but it does not reflect their
semantic significance which in this example was
the associated profit.
In reality a retail business may be interested in
identifying its most valuable customers (customers
who contribute a major fraction of the profits to the
business).These are the customers who may buy
full priced items or high margin items which may
be absent from a large number of transactions
because most customers do not buy these items
frequently.

Utility Mining

The limitations of frequent or rare itemset mining
motivated researchers to conceive a utility based
mining approach, which allows a user to
conveniently express his or her perspectives
concerning the usefulness of itemsets as utility
values and then find itemsets with high utility
values higher than a threshold. In utility based
mining the term utility refers to the quantitative
representation of user preference i.e. the utility
value of an itemset is the measurement of the
importance of that itemset in the users perspective.
For e.g. if a sales analyst involved in some retail
research needs to find out which itemsets in the
stores earn the maximum sales revenue for the
stores he or she will define the utility of any itemset
as the monetary profit that the store earns by selling
each unit of that itemset.
Here note that the sales analyst is not interested in
the number of transactions that contain the itemset
but he or she is only concerned about the revenue
generated collectively by all the transactions
containing the itemset. In practice the utility value
of an itemset can be profit, popularity, page-rank,
measure of some aesthetic aspect such as beauty or
design or some other measures of users preference.
Formally an itemset S is useful to a user if it
satisfies a utility constraint i.e. any constraint in the
form u(S)>=min_util, where u(S) is the utility value
of the itemset an min_util is a utility threshold
defined by the user [32]. In our example if we take
utility of an itemset as the unit profit associated
with the sale of that itemset then with utility
threshold min_util =500 then the itemset ABC has
a utility value of 555 which means that this itemset
is of interest to the user even though its support
value is just 20%. Since while considering the total
utility of an itemset S we multiply the utility values
of the individual items consisting the itemset S with
the corresponding frequencies of the individual
items of S in the transactions that contain S, so the
utility based mining approach can be said to be
measuring the significance of an itemset from two
dimensions. The first dimension being the support
value of the itemset i.e., the frequency of the
itemset and the second dimension is the semantic
significance of the itemset as measured by the user.

III. PROPOSED METHODS

International Journal of Engineering Trends and Technology (IJETT) Volume 4 Issue 9- Sep 2013
ISSN: 2231-5381 http://www.ijettjournal.org Page 4049
The framework of proposed method consists of
two steps:
(1) Scan the database twice to construct a global
UP-Tree with the first two strategies (given
in the Subsection III.A).
(2) Recursively generate potential high utility
itemsets abbreviated as PHUIs) from global
UP-Tree and local UP-Trees by UP-
Growth+with the last two strategies (given
in the Subsection III.B).

A. The Proposed Data Structure: UP-Tree
To facilitate the mining performance and avoid
scanning original database repeatedly, we will use a
compact tree structure, named UP-Tree (Utility
Pattern Tree), to maintain the information of
transactions and high utility itemsets. Two
strategies are applied to minimise the overestimated
utilities stored in the nodes of global UP-Tree. In
following subsections, the elements of UP-Tree are
first defined. Next, the two strategies are introduced.

1) The Elements in UP-Tree
In a UP-Tree, each node N consists of N.name,
N.count, N.nu, N.parent, N.hlink and a set of child
nodes. N.name is the nodes item name. N.count is
the nodes support count.N.nu is the nodes node
utility, i.e., overestimated utility of the node.
N.parent records the parent node of N. N.hlink is a
node link which points to a node whose item name
is the same as N.name.
A table named header table is employed to
facilitate the traversal of UP-Tree. In header table,
each entry records an item name, an overestimated
utility, and a link. The link points to the last
occurrence of the node which has the same item as
the entry in the UP-Tree. By following the links in
header table and the nodes in UP-Tree, the nodes
having the same name can be traversed efficiently.
In following subsections, two strategies for
decreasing the overestimated utility of each item
during the construction of a global UP-Tree are
introduced.
2) Strategy DGU: Discarding Global
Unpromising Items
The construction of a global UP-Tree can be
performed with two scans of the original database.
In the first scan, Transaction Utility (also
abbreviated as TU) of each transaction is computed.
At the same time, Transaction-Weighted Utility
(also abbreviated as TWU) of each single item is
also accumulated. By transaction-weighted
downward closure (also abbreviated as TWDC)
property, an item and its supersets are unpromising
to be high utility itemsets if its also TWU is less
than the minimum utility threshold. Such an item is
called an unpromising item.
An item is called a promising item if TWU >=
min_util. Otherwise, it is called an un promising
item. Without loss of generality, an item is also
called a promising item if its overestimated utility is
no less than min_util. Otherwise, it is called an
unpromising item.
3) Strategy DGN: Decreasing Global Node
Utilities
By actual utilities of descendant nodes during the
construction of global UP-Tree we can decrease
global node utilities. By applying strategy DGN, the
utilities of the nodes that are closer to the root of a
global UP-Tree are further reduced. DGN is
especially suitable for the databases containing lots
of long transactions. In other words, the more items
a transaction contains, the more utilities can be
discarded by DGN. On the contrary, traditional
TWU mining model is not suitable for such
databases since the more items a transaction
contains, the higher TWU is.

B. The Proposed Mining Method: UP-Growth+
In UP-Growth+, minimal node utilities (also
abbreviated as MNU's) in each path are used to
make the estimated pruning values closer to real
utility values of the pruned items in database.
International Journal of Engineering Trends and Technology (IJETT) Volume 4 Issue 9- Sep 2013
ISSN: 2231-5381 http://www.ijettjournal.org Page 4050
MNU for each node can be acquired during the
construction of a global UP-Tree. First, we add an
element, namely N.mnu, into each node of UP-Tree.
N.mnu is minimal node utility of N. When N is
traced, N.mnu keeps track of the minimal value of
N.names utility in different transactions. If N.mnu
is larger than u(N.name, Tcurrent), N.mnu is set to
u(N.name, Tcurrent).

Fig. 1 A Block diagramof the proposed system

1) Strategy ENU: Eliminating local unpromising
items and their estimated Node Utilities from the
paths and path utilities
ENU can be recognized as local version of DGU.
It will provide a simple but useful schema to reduce
over estimated utilities locally without an extra scan
of original database.
2) Strategy DNN: Decreasing local Node utilities
for nodes of local UP-Tree by estimated utilities of
descendant Nodes
DLN can be also be recognized as well as a local
version of DGN mentioned in the earlier sections.
By these two strategies, overestimated utilities for
itemsets can be locally reduced in a certain degree
without losing any actual high utility itemset.


IV. CONCLUSION
In this paper, we have presented novel strategies
for UP-growth by utilizing a tree structure for
storing essential information about frequent patterns
for mining high utility itemsets. I have utilized the
concepts standard Frequent Itemset Mining for
mining the complete set of frequent patterns by
means of pattern growth.
Higher efficiency in mining high utility patterns
can be realized by implementing the above two
important concepts. One is the construction of the
UP-tree and the other one is the mining of utility
itemsets from the UP-tree. The proposed UP-tree
based pattern mining utilizes the pattern growth
method to avoid the costly generation of a large
number of candidate sets and reduces the search
space dramatically.
REFERENCES
[1] R. Agrawal and R. Srikant. Fast algorithms for mining
association rules, inProc. of the 20th VLDB Conf., pp.
487-499, 1994
[2] R. Agrawal and R. Srikant, Mining Sequential
Patterns, in Proc. of the 11th Intl Conference on Data
Engineering, pp. 3-14, Mar., 1995.
[3] J. Han and Y. Fu, Discovery of multiple-level
association rules fromlarge databases, in Proc. 21th
VLDB Conf., Sep. 2000, pp. 420431.
[4] J. Han, J. Pei, Y. Yin, Mining frequent patterns without
candidate generation, in Proc. of the ACM-SIGMOD
Int'l Conf. on Management of Data, pp. 1-12, 2011.
[5] V. S. Tseng, C. J. Chu and T. Liang, Efficient Mining
of Temporal High Utility Itemsets fromData streams,
in Proc. of ACM KDD Workshop on Utility-Based Data
Mining Workshop (UBDM06), USA, Aug., 2006.
[6] R. Martinez, N. Pasquier and C. Pasquier, GenMiner:
mining non-redundant association rules fromintegrated
gene expression data and annotations, Bio-informatics,
Vol. 24, pp. 2643-2644, 2010.
[7] S. J. Yen, Y. S. Lee, C. K. Wang, C. W. Wu and L.-Y.
Ouyang, The studies of mining frequent patterns based
on frequent pattern tree, in Proc. of the 13thPAKDD
and LNCS, Vol. 5476, pp. 232-241, 2012.




AUTHORS DESCRIPTION

N.S.Jagadeesh, currently he is working as
Assistant Professor in Kuppam Engineering
International Journal of Engineering Trends and Technology (IJETT) Volume 4 Issue 9- Sep 2013
ISSN: 2231-5381 http://www.ijettjournal.org Page 4051
College, kuppam, received B.Tech
(Information Technology) and M.Tech
(Computer Science and Engineering) from
J NTU,Anantapur. His Research interest areas
are Data warehousing and Mining & Software
Engineering.

B.Jyothsna, currently she is
working as Assistant Professor
in Sir Vishveshwaraiah
Institute of Science &
Technology, Madanapalle. Received B.Tech,
M.Tech (Computer Science and Engineering)
from J NTU, Anantapur. Her Research interest
areas are Data warehousing and mining &
Software Engineering.

KN Dharanidhar, currently
he is working as Assistant
Professor in Kuppam
Engineering College, kuppam,
received B.Tech (Information Technology) and
M.Tech (Computer Science and Engineering)
from J NTU, Anantapur. His Research interest
areas are Data warehousing and Mining &
Mobile Computing.

A.Anantha Bipin, currently he
is working as Assistant
Professor in Kuppam
Engineering College, kuppam,
received B.E (Computer Science and
Engineering) and M.E (Computer Science and
Engineering) from Anna University, Chennai.
His Research interest areas are Data
warehousing and Mining & Networks.

Вам также может понравиться