0 оценок0% нашли этот документ полезным (0 голосов)
16 просмотров6 страниц
Data Mining is defined as a process that extracts some
new, non-trivial, previously unknown potentially useful
information contained in large databases. Traditional mining
techniques have focused largely on detecting the statistical
correlations between the items that are more frequent in the
transaction databases. Also termed as frequent itemset mining.
In this paper, I propose strategies for UP-Growth from the
emerging area called Utility Mining which not only considers the
frequency of the itemsets but also considers the utility associated
with the itemsets. The term utility refers to the importance or the
usefulness of the itemset in transactions quantified in terms like
profit, sales or any other user preferences. Here the objective is
to identify itemsets that have utility values above a given utility
threshold using the pattern growth methodology for mining set
of utility patterns.
Оригинальное название
Mine High Utility Itemset using UP-Tree and
FP-Growth
Data Mining is defined as a process that extracts some
new, non-trivial, previously unknown potentially useful
information contained in large databases. Traditional mining
techniques have focused largely on detecting the statistical
correlations between the items that are more frequent in the
transaction databases. Also termed as frequent itemset mining.
In this paper, I propose strategies for UP-Growth from the
emerging area called Utility Mining which not only considers the
frequency of the itemsets but also considers the utility associated
with the itemsets. The term utility refers to the importance or the
usefulness of the itemset in transactions quantified in terms like
profit, sales or any other user preferences. Here the objective is
to identify itemsets that have utility values above a given utility
threshold using the pattern growth methodology for mining set
of utility patterns.
Data Mining is defined as a process that extracts some
new, non-trivial, previously unknown potentially useful
information contained in large databases. Traditional mining
techniques have focused largely on detecting the statistical
correlations between the items that are more frequent in the
transaction databases. Also termed as frequent itemset mining.
In this paper, I propose strategies for UP-Growth from the
emerging area called Utility Mining which not only considers the
frequency of the itemsets but also considers the utility associated
with the itemsets. The term utility refers to the importance or the
usefulness of the itemset in transactions quantified in terms like
profit, sales or any other user preferences. Here the objective is
to identify itemsets that have utility values above a given utility
threshold using the pattern growth methodology for mining set
of utility patterns.
Mine High Utility Itemset using UP-Tree and FP-Growth NS J AGADEESH #1 , B J YOTHSNA *2 , KN DHARANIDHAR #3 , A ANANTHA BIPIN *4 # Assistant Professor, Dept of CSE, Kuppam Engineering College, kuppam, India.
Abstract Data Mining is defined as a process that extracts some new, non-trivial, previously unknown potentially useful information contained in large databases. Traditional mining techniques have focused largely on detecting the statistical correlations between the items that are more frequent in the transaction databases. Also termed as frequent itemset mining. In this paper, I propose strategies for UP-Growth from the emerging area called Utility Mining which not only considers the frequency of the itemsets but also considers the utility associated with the itemsets. The term utility refers to the importance or the usefulness of the itemset in transactions quantified in terms like profit, sales or any other user preferences. Here the objective is to identify itemsets that have utility values above a given utility threshold using the pattern growth methodology for mining set of utility patterns.
Keywords candidate pruning, frequent itemset, high utility itemset, utility mining, UP-tree, FP-Growth. I. INTRODUCTION Over the last two decades data mining has emerged as a significant research area. This is primary due to the interdisciplinary nature of the subject and the diverse range of application domains in which data mining based products and techniques are being employed. This includes bioinformatics, genetics, medicine, clinical research, education, retail and marketing research. Data mining is the process of revealing previously unknown and potentially useful information from large databases. The primary goal is to discover hidden patterns, unexpected trends in the data. This term is frequently misused to mean any form of large-scale data or information processing. The actual data mining task is the automatic or semi-automatic analysis of large quantities of data to extract previously unknown interesting patterns. Data mining activities uses combination of techniques from database technologies, statistics, artificial intelligence and machine learning. Discovering useful patterns hidden in a database plays an essential role in several data mining tasks, such as frequent pattern mining, weighted frequent pattern mining and high utility pattern mining. Among them, frequent pattern mining is a fundamental research topic that has been applied to different kinds of databases, such as transactional databases. It is used in the analysis of customer transactions in retail research where it is termed as market basket analysis and also been used to identify the purchase patterns of the consumer.
II. LITERATURE SURVEY Extensive studies have been proposed for mining frequent patterns [1, 2, 3, 4, 6]. Among the issues of frequent pattern mining, the most famous are association rule mining [1, 3, 4, 6] and sequential pattern mining. One of the well-known algorithms for mining association rules is Apriori [1], which is the pioneer for efficiently mining association rules from large databases. Pattern growth based association rule mining algorithms [4, 6] such as FP-Growth [4] were afterward proposed. It is widely recognised that FP-Growth achieves a better performance than Apriori based algorithms since it finds frequent itemsets without generating any candidate itemset and scans database just twice.
Frequent Itemset Mining
An itemset can be defined as a non-empty set of items. An itemset with k different items is termed as a k-itemset. For e.g. {bread, butter, milk } may denote a 3-itemset in a supermarket International Journal of Engineering Trends and Technology (IJETT) Volume 4 Issue 9- Sep 2013 ISSN: 2231-5381 http://www.ijettjournal.org Page 4047 transaction .The notion of frequent itemsets was introduced by Agrawal et al [1].Frequent itemsets are the itemsets that appear frequently in the transactions. The goal of frequent itemset mining is to identify all the itemsets in a transaction dataset [6]. Frequent itemset mining plays an essential role in the theory and practice of many important data mining tasks, such as mining association rules [1,2] long patterns [5], emerging patterns and dependency rules. It has been applied in the field of telecommunications [3], census analysis[6] and text analysis.
The criterion of being frequent is expressed in terms of support value of the itemsets. The Support value of an itemset is the percentage of transactions that contain the itemset.
1) EXAMPLE 1: . Consider the small example of a transaction database representing the sales data and the profit TABLE I TRANSACTION DATABASE Transacion ID
Quantity of Item sold in Transaction
Item A
Item B
Item C
T1 2 0 1 T2 4 0 2 T3 4 1 0 T4 0 1 1 T5 5 1 2 T6 10 1 5 T7 4 0 2 T8 1 0 0 T9 3 0 0 T10 5 0 0 associated with the sale of each unit of the items. Table I represents the sales figures for three items Item A, B and C and ten transactions overall. The entry in the cells represent the unit of any item sold in that transaction
Table II represents the unit profit associated with the sale of individual items. TABLE II UNIT PROFIT ASSOCIATED WITH ITEMS Item Name
Unit Profit (in USD)
ItemA 5 ItemB 100 ItemC 40
Now consider the itemset AB. Since there are only 3transactions (T3, T5 and T6) that contain this itemset out of the overall 10 transactions, so the support for this itemset will be
Support (AB) =3 / 10 * 100 =30 %
Since T3 contains 4 units of item A and 1 unit of item B, so the profit earned by the sale of the itemset AB in transaction T3 is given by
Similarly we can calculate the support values for the different itemsets and also the profit obtained by the sale of those itemsets by all ten transactions as indicated in Table III. If we consider minimum support =40 % then we observe that there are 4 itemsets A, B,C and AC which qualify as frequent itemsets because they have support more than minimum support threshold value. But if we consider the profit associated we find that out of the 4 most profitable itemsets i.e. C, AC, BC, and ABC only two are frequent itemsets also. Itemsets BC and ABC are itemsets which are not frequent but still they fetch more profit than some of the frequent itemsets like A or B. This is International Journal of Engineering Trends and Technology (IJETT) Volume 4 Issue 9- Sep 2013 ISSN: 2231-5381 http://www.ijettjournal.org Page 4048 inherently because the deviation of the unit profits of the items. As we can see one unit of item B when sold will fetch much more profit than one unit of item A or item C. TABLE III SUPPORT AND PROFIT FOR ALL ITEMSETS
Itemset Support(%) Profit(USD) A 90 190 B 40 400 C 60 520 AB 30 395 AC 50 605 BC 30 620 ABC 20 555
This example illustrates the fact that frequent itemset mining approach may not always satisfy a sales managers goal. In this case the support measure of the itemsets reflects the statistical correlation of items, but it does not reflect their semantic significance which in this example was the associated profit. In reality a retail business may be interested in identifying its most valuable customers (customers who contribute a major fraction of the profits to the business).These are the customers who may buy full priced items or high margin items which may be absent from a large number of transactions because most customers do not buy these items frequently.
Utility Mining
The limitations of frequent or rare itemset mining motivated researchers to conceive a utility based mining approach, which allows a user to conveniently express his or her perspectives concerning the usefulness of itemsets as utility values and then find itemsets with high utility values higher than a threshold. In utility based mining the term utility refers to the quantitative representation of user preference i.e. the utility value of an itemset is the measurement of the importance of that itemset in the users perspective. For e.g. if a sales analyst involved in some retail research needs to find out which itemsets in the stores earn the maximum sales revenue for the stores he or she will define the utility of any itemset as the monetary profit that the store earns by selling each unit of that itemset. Here note that the sales analyst is not interested in the number of transactions that contain the itemset but he or she is only concerned about the revenue generated collectively by all the transactions containing the itemset. In practice the utility value of an itemset can be profit, popularity, page-rank, measure of some aesthetic aspect such as beauty or design or some other measures of users preference. Formally an itemset S is useful to a user if it satisfies a utility constraint i.e. any constraint in the form u(S)>=min_util, where u(S) is the utility value of the itemset an min_util is a utility threshold defined by the user [32]. In our example if we take utility of an itemset as the unit profit associated with the sale of that itemset then with utility threshold min_util =500 then the itemset ABC has a utility value of 555 which means that this itemset is of interest to the user even though its support value is just 20%. Since while considering the total utility of an itemset S we multiply the utility values of the individual items consisting the itemset S with the corresponding frequencies of the individual items of S in the transactions that contain S, so the utility based mining approach can be said to be measuring the significance of an itemset from two dimensions. The first dimension being the support value of the itemset i.e., the frequency of the itemset and the second dimension is the semantic significance of the itemset as measured by the user.
III. PROPOSED METHODS
International Journal of Engineering Trends and Technology (IJETT) Volume 4 Issue 9- Sep 2013 ISSN: 2231-5381 http://www.ijettjournal.org Page 4049 The framework of proposed method consists of two steps: (1) Scan the database twice to construct a global UP-Tree with the first two strategies (given in the Subsection III.A). (2) Recursively generate potential high utility itemsets abbreviated as PHUIs) from global UP-Tree and local UP-Trees by UP- Growth+with the last two strategies (given in the Subsection III.B).
A. The Proposed Data Structure: UP-Tree To facilitate the mining performance and avoid scanning original database repeatedly, we will use a compact tree structure, named UP-Tree (Utility Pattern Tree), to maintain the information of transactions and high utility itemsets. Two strategies are applied to minimise the overestimated utilities stored in the nodes of global UP-Tree. In following subsections, the elements of UP-Tree are first defined. Next, the two strategies are introduced.
1) The Elements in UP-Tree In a UP-Tree, each node N consists of N.name, N.count, N.nu, N.parent, N.hlink and a set of child nodes. N.name is the nodes item name. N.count is the nodes support count.N.nu is the nodes node utility, i.e., overestimated utility of the node. N.parent records the parent node of N. N.hlink is a node link which points to a node whose item name is the same as N.name. A table named header table is employed to facilitate the traversal of UP-Tree. In header table, each entry records an item name, an overestimated utility, and a link. The link points to the last occurrence of the node which has the same item as the entry in the UP-Tree. By following the links in header table and the nodes in UP-Tree, the nodes having the same name can be traversed efficiently. In following subsections, two strategies for decreasing the overestimated utility of each item during the construction of a global UP-Tree are introduced. 2) Strategy DGU: Discarding Global Unpromising Items The construction of a global UP-Tree can be performed with two scans of the original database. In the first scan, Transaction Utility (also abbreviated as TU) of each transaction is computed. At the same time, Transaction-Weighted Utility (also abbreviated as TWU) of each single item is also accumulated. By transaction-weighted downward closure (also abbreviated as TWDC) property, an item and its supersets are unpromising to be high utility itemsets if its also TWU is less than the minimum utility threshold. Such an item is called an unpromising item. An item is called a promising item if TWU >= min_util. Otherwise, it is called an un promising item. Without loss of generality, an item is also called a promising item if its overestimated utility is no less than min_util. Otherwise, it is called an unpromising item. 3) Strategy DGN: Decreasing Global Node Utilities By actual utilities of descendant nodes during the construction of global UP-Tree we can decrease global node utilities. By applying strategy DGN, the utilities of the nodes that are closer to the root of a global UP-Tree are further reduced. DGN is especially suitable for the databases containing lots of long transactions. In other words, the more items a transaction contains, the more utilities can be discarded by DGN. On the contrary, traditional TWU mining model is not suitable for such databases since the more items a transaction contains, the higher TWU is.
B. The Proposed Mining Method: UP-Growth+ In UP-Growth+, minimal node utilities (also abbreviated as MNU's) in each path are used to make the estimated pruning values closer to real utility values of the pruned items in database. International Journal of Engineering Trends and Technology (IJETT) Volume 4 Issue 9- Sep 2013 ISSN: 2231-5381 http://www.ijettjournal.org Page 4050 MNU for each node can be acquired during the construction of a global UP-Tree. First, we add an element, namely N.mnu, into each node of UP-Tree. N.mnu is minimal node utility of N. When N is traced, N.mnu keeps track of the minimal value of N.names utility in different transactions. If N.mnu is larger than u(N.name, Tcurrent), N.mnu is set to u(N.name, Tcurrent).
Fig. 1 A Block diagramof the proposed system
1) Strategy ENU: Eliminating local unpromising items and their estimated Node Utilities from the paths and path utilities ENU can be recognized as local version of DGU. It will provide a simple but useful schema to reduce over estimated utilities locally without an extra scan of original database. 2) Strategy DNN: Decreasing local Node utilities for nodes of local UP-Tree by estimated utilities of descendant Nodes DLN can be also be recognized as well as a local version of DGN mentioned in the earlier sections. By these two strategies, overestimated utilities for itemsets can be locally reduced in a certain degree without losing any actual high utility itemset.
IV. CONCLUSION In this paper, we have presented novel strategies for UP-growth by utilizing a tree structure for storing essential information about frequent patterns for mining high utility itemsets. I have utilized the concepts standard Frequent Itemset Mining for mining the complete set of frequent patterns by means of pattern growth. Higher efficiency in mining high utility patterns can be realized by implementing the above two important concepts. One is the construction of the UP-tree and the other one is the mining of utility itemsets from the UP-tree. The proposed UP-tree based pattern mining utilizes the pattern growth method to avoid the costly generation of a large number of candidate sets and reduces the search space dramatically. REFERENCES [1] R. Agrawal and R. Srikant. Fast algorithms for mining association rules, inProc. of the 20th VLDB Conf., pp. 487-499, 1994 [2] R. Agrawal and R. Srikant, Mining Sequential Patterns, in Proc. of the 11th Intl Conference on Data Engineering, pp. 3-14, Mar., 1995. [3] J. Han and Y. Fu, Discovery of multiple-level association rules fromlarge databases, in Proc. 21th VLDB Conf., Sep. 2000, pp. 420431. [4] J. Han, J. Pei, Y. Yin, Mining frequent patterns without candidate generation, in Proc. of the ACM-SIGMOD Int'l Conf. on Management of Data, pp. 1-12, 2011. [5] V. S. Tseng, C. J. Chu and T. Liang, Efficient Mining of Temporal High Utility Itemsets fromData streams, in Proc. of ACM KDD Workshop on Utility-Based Data Mining Workshop (UBDM06), USA, Aug., 2006. [6] R. Martinez, N. Pasquier and C. Pasquier, GenMiner: mining non-redundant association rules fromintegrated gene expression data and annotations, Bio-informatics, Vol. 24, pp. 2643-2644, 2010. [7] S. J. Yen, Y. S. Lee, C. K. Wang, C. W. Wu and L.-Y. Ouyang, The studies of mining frequent patterns based on frequent pattern tree, in Proc. of the 13thPAKDD and LNCS, Vol. 5476, pp. 232-241, 2012.
AUTHORS DESCRIPTION
N.S.Jagadeesh, currently he is working as Assistant Professor in Kuppam Engineering International Journal of Engineering Trends and Technology (IJETT) Volume 4 Issue 9- Sep 2013 ISSN: 2231-5381 http://www.ijettjournal.org Page 4051 College, kuppam, received B.Tech (Information Technology) and M.Tech (Computer Science and Engineering) from J NTU,Anantapur. His Research interest areas are Data warehousing and Mining & Software Engineering.
B.Jyothsna, currently she is working as Assistant Professor in Sir Vishveshwaraiah Institute of Science & Technology, Madanapalle. Received B.Tech, M.Tech (Computer Science and Engineering) from J NTU, Anantapur. Her Research interest areas are Data warehousing and mining & Software Engineering.
KN Dharanidhar, currently he is working as Assistant Professor in Kuppam Engineering College, kuppam, received B.Tech (Information Technology) and M.Tech (Computer Science and Engineering) from J NTU, Anantapur. His Research interest areas are Data warehousing and Mining & Mobile Computing.
A.Anantha Bipin, currently he is working as Assistant Professor in Kuppam Engineering College, kuppam, received B.E (Computer Science and Engineering) and M.E (Computer Science and Engineering) from Anna University, Chennai. His Research interest areas are Data warehousing and Mining & Networks.