Вы находитесь на странице: 1из 7

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No.

A Study & Emergence of Temporal Association Mining Similarity-Profiled


Name: Mr. N Prasanna Balaji Designation: Professor, HOD College: Gurunank Engineering College Name: G.Ravi Kumar Qualification: M.Tech College: Gurunank Engineering College Name: Mr.U.Sreenivasulu Designation: Asst Prof College: Gurunank Engineering College Name: Anjaneyulu. N Qualification: M.Tech College: Gurunank Engineering College

Abstract
A System that allow the weather market based sales planning for future activities, provides a graphical user interface for users to order series of supermarket items information. Administrator collects the series of customer order time period and weather report patterns similar association temporal pattern can reveal interesting relationships of data items which co-occur with a particular event over time. Most of our work in temporal association mining have focused on capturing special temporal regulation patterns such as cyclic patterns and calendar scheme-based patterns, however is flexible in representing interesting temporal patterns using a user defined reference sequence. The dissimilarity degree of the sequence of support values of an item set to the reference sequence is used to capture how well its temporal prevalence variation matches the reference pattern. By exploiting interesting properties such as an envelope of support time sequence and a lower bounding distance for early pruning candidate item sets, we develop an algorithm for effectively mining similarity-profiled temporal association patterns. Keywords Data mining, temporal sequence patterns, Association rule, contrast set mining, generalized association rules INTRODUCTION I Association rule mining is one of the most important and well researched techniques of data mining, [1]. It aims to extract interesting correlations, frequent patterns, associations or casual structures among sets of items in the transaction databases or other data repositories. Association rules are widely used in various areas such as telecommunication networks, market and risk management, inventory control etc. Association rule mining is to find out association rules that satisfy the predefined minimum support and confidence from a given database. The problem is usually decomposed into two sub problems. One is to find those item sets whose occurrences exceed a predefined threshold in the database, those item sets are called frequent or large item sets. The second problem is to generate association rules from those large item sets with the constraints of minimal confidence. Suppose one of the large item sets is Lk, Lk = {I1, I2, , Ik}, association rules with this item sets are generated in the following way: the first rule is {{I1, I2, , Ik-1 }=> { Ik }, by checking the confidence this rule can be determined as interesting or not. Then other rule are generated by deleting the last items in the antecedent and inserting it to the consequent, further the confidences of the new rules are checked to determine the interestingness of them. Those processes iterated until the antecedent becomes empty. Since the second sub problem is quite straight forward, most of the researches focus on the first sub problem Consider a time stamped transaction database and a user defined reference sequence of interest over time, the goal of similarityprofiled temporal association mining is to discover all associated item sets whose prevalence variations over time are similar to the reference sequence under a threshold. Similarity-profiled temporal association mining can reveal interesting relationships of data items that co-occur with a particular event over time. For example, the fluctuation of consumer retail sales is closely tied to changes in weather and climate [2]. While bottled water and generator sales might not show any correlation on normal days, a sales association between the two items may develop with the increasing strength of a hurricane in a particular region [2]. Retail decision makers may be interested in such an association pattern to improve their decisions regarding how changes in weather affect consumer needs. Recent advances in data collection and storage technology have made it possible to collect vast amounts of data every day in many areas of business and science. Examples are recordings of sales of products, stock exchanges, web logs, climate measures, and so on. One major area of data mining from these data is association pattern analysis. Association rules discover interrelationships among various data items in transactional data. Following the work of Agarwal and Srikant [3], the discovery of association rules has been extensively studied in [4], [5], [6], and [7]. In particular, in [8], [9], [10], they have paid attention to temporal information, which is implicitly related to transaction data, e.g., the time that a transaction is executed, and discovered association patterns that vary over time. However, most works in temporal association mining have focused on special temporal regulation patterns of associated item sets such as cyclic patterns [8] and calendar-based patterns [9]. For example, it may be found that beer and chips are sold together primarily in evening time on week days.

136

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8

SECTION II 2. Data mining interfaces support the following classification functions: 2.1. Classification: A classification task begins with build data (also known as training data) for which the target values (or class assignments) are known. Different classification algorithms use different techniques for finding relations between the predictor attributes' values and the target attribute's values in the build data. Decision tree rules provide model transparency so that a business user, marketing analyst, or business analyst can understand the basis of the model's predictions, and therefore, be comfortable acting on them and explaining them to others Decision Tree does not support nested tables. Decision Tree Models can be converted to XML. NB makes predictions using Bayes' Theorem, which derives the probability of a prediction from the underlying evidence. Bayes' Theorem states that the probability of event A occurring given that event B has occurred (P(A|B)) is proportional to the probability of event B occurring given that event A has occurred multiplied by the probability of event A occurring ((P(B|A)P(A)). Adaptive Bayes Network (ABN) is an Oracle proprietary algorithm that provides a fast, scalable, non-parametric means of extracting predictive information from data with respect to a target attribute. (Non-parametric statistical techniques avoid assuming that the population is characterized by a family of simple distributional models, such as standard linear regression, where different members of the family are differentiated by a small set of parameters.) Support Vector Machine (SVM) is a state-of-the-art classification and regression algorithm. SVM is an algorithm with strong regularization properties, that is, the optimization procedure maximizes predictive accuracy while automatically avoiding overfitting of the training data. Neural networks and radial basis functions, both popular data mining techniques, have the same functional form as SVM models; however, neither of these algorithms has the well-founded theoretical approach to regularization that forms the basis of SVM. 2.2. Association rule: Association rule learning is a popular and well researched method for discovering interesting relations between variables in large databases . Piatetsky-Shapiro[11] describes analyzing and presenting strong rules discovered in databases using different measures of interestingness. Based on the concept of strong rules, Agrawa[12]l et al. introduced association rules for discovering regularities between products in large scale transaction data recorded by point-of-scale (POS) systems in supermarkets. For example, the rule{ onions, potatoes}=>{beef} found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, he or she is likely to also buy beef. Association model is often used for market basket analysis, which attempts to discover relationships or correlations in a set of items. Market basket analysis is widely used in data analysis for direct marketing, catalog design, and other business decisionmaking processes. Traditionally, association models are used to discover business trends by analyzing customer transactions. However, they can also be used effectively to predict Web page accesses for personalization. For example, assume that after mining the Web access log, Company X discovered an association rule "A and B implies C," with 80% confidence, where A, B, and C are Web page accesses. If a user has visited pages A and B, there is an 80% chance that he/she will visit page C in the same session. Page C may or may not have a direct link from A or B. This information can be used to create a dynamic link to page C from pages A or B so that the user can "click-through" to page C directly. This kind of information is particularly valuable for a Web server supporting an e-commerce site to link the different product pages dynamically, based on the customer interaction. 2.3.Clustering: Cluster is a number of similar objects grouped together. It can also be defined as the organization of dataset into homogeneous and/or well separated groups with respect to distance or equivalently similarity measure. Cluster is an aggregation of points in test space such that the distance between any two points in cluster is less than the distance between any two points in the cluster and any point not in it. There are two types of attributes associated with clustering, numerical and categorical attributes. Numerical attributes are associated with ordered values such as height of a person and speed of a train. Categorical attributes are those with unordered values such as kind of a drink and brand of car. Clustering is available in flavors of Hierarchical Partition (non Hierarchical)

In hierarchical clustering the data are not partitioned into a particular cluster in a single step. Instead, a series of partitions takes place, which may run from a single cluster containing all objects to n clusters each containing a single object[12]. Hierarchical Clustering is subdivided into agglomerative methods, which proceed by series of fusions of the n objects into groups, and divisive methods, which separate n objects successively into finer groupings. 2.4.Temporal sequence in Mining: The non-trival extraction of implicit unknown and potential useful information from data. The ultimate goal of temporal data mining is to discover hidden relations between sequences and subsequences of events. Sequences involves mainly three steps the representation and modeling of the data sequence in a suitable from the definition of similarity measures between sequences and application of models and representations to the actual mining problems. Other approach to classify data mining problems and algorithms. Roddick has used three dimensions: data type, mining operations

137

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8

and type of timing information. Although both approaches are equally valid, that is representation, similarity and operations, since it provided a more comprehensive and novel view of the field. Depending on the nature of the event sequence, the approaches to solve the problem may be quite different. A sequence composed by a series of nominal symbols from a particular alphabet is usually called a temporal sequence and a sequence of continuous, real-valued elements, is known as a time series. SECTION III 3. Usage of SPAMINE & Apriori Algorithms A similarity-profiled temporal Association mining method (SPAMINE) based on our algorithm design Concept. We present the SPAMINE algorithm first, and then, for comparison, we also present an alternative method using a support-pruning scheme. 3.1.SPAMINE Algorithm.

Generate the support time sequences of single items and find similar items (Steps 1-3). All singletons (k 1) become candidate items (C1). The SPAMINE algorithm uses the lattice-dominant database scan method to generate the support time sequences. In the first scan of an entire input data set, the supports of singletons are computed per each time slot, and their support time sequences (~S1) are generated. If distances between the support time sequences and a given reference sequence do not exceed a given dissimilarity threshold, the singletons are added to a result set (R1). 3.2.Apriori Algorithms: In this algorithm first find all itemsets that have minimum support frequent itemsets called large itemsets e.g a frequent itemset is {chicken, clothes,milk} [sup=3/7] and one rule from the frequent item set. Clothes -> Milk,Chiken [sup= 3/7, conf=3/3]. Here a frequent item set is an item set whose support is minsup, the key idea of

apriori property is any subsets of a frequent item set are also frequent item sets Figure 1 shows the example of frequent itemset

138

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8

SECTION IV

4. Types of Association mining


4.1. Contrast set Mining: Contrast set mining is closely related to association rule mining and utilizes some of the terminology and notation of association rule mining. Contrast set first defined by Bay and Pazzani as conjunctions of attributes and values that differ in their meaningfull distributions across groups. Procedure for Mining association contrast set: In constrast set mining process the search space for potential contrast sets can be represented in a tree structure having every possible combination of the attributes. For a data set that has three categorical attributes fig 2 shows the search tree. The {} at the tree represents the entire dataset. The search process begins with the most general terms first shows as Level 1in figure2. In moving to Level 2 conjunctions of the attributes would then be examined. At this level there are only three contrast sets to examine. This is because for instance {1,2} represents the same instances in the data set as {2,1} more complex combinations would be examined in subsequent levels. The number of levels in the search tree is equal to the number of attributes. The number of contrast sets on each level is equal to the number of combinations of nand i represented as Cni where n is the number of attributes and i is the level of tree. The total number of contrast sets is equal to

Figure 2 example search tree for a dataset with three attributes Fig 3 summarizes the algorithm. A search tree is constructed for all possible item sets as in fig the minimum support is applied to generate candidate item sets filtering conditions are applied to produce candidate contrast sets and independence tests are performed to discover the candidate contrast sets.

139

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8

Figure 3: Procedure for mining contrast sets 4.2.Generalized Association Rules: Mining Generalized Association Rules problem is mining association rules among items in hierarchical tree that satisfy minSup and minConf. Algorithm for mining frequent item sets in hierarchical database based on GIT-tree Apriori algorithm that lead to do many database scans, and the candidate number will be very large. In this section, we apply GIT-tree to mine frequent item sets in hierarchical database to reduce the mining time.

Mining frequent itemsets in hierarchical database using GIT-tree First, we create the minimum support table of items based on equation 3, then create parent-child relation table in hierarchical tree G. Finally, algorithm sorts the IMS in increasing order of minimum support (aim to create set F that includes single items satisfying minimum support threshold of the smallest minimum support item). Function ENUMERATE_GENERALIZED_FIs(Lr) creates GIT-tree to mine generalized frequent itemsets. It creates a new equivalence class Lc by considering every Y after X to form itemset X and Tidset of X (T). If items in X have not parent-child relation each other, we consider whether its support satisfies the minimum support or not (ms(X) is computed. If it satisfies ms(X), then we add this node to Lc. SECTION V 5.1.Problem definition: Now a days Market business analysis depends on the weather report only we can buy an item sets. The goal of similarity- profiled temporal association mining is to discover all associated item sets whose prevalence variations over time are similar to the reference sequence under a threshold. 5.2.Comparative study on similarity-profiled: Data collection technology made it possible to collect vast amounts of data every day in many areas of business and sciences like sales of products, stock exchanges, web logs climate measures. Association rules discover interrelationships among various data items in transactional data. The closest related efforts have attempted to capture special temporal regulations of frequent association patterns such as cyclic association rule mining and calendar-based association rule mining in temporal association mining, these cyclic association examined cyclic association rule mining which detects periodically repetitive patterns of frequent item sets over time and calendar- based association a temporal pattern is defined with the set of time points where the user expects discovered item sets to be frequent. The problem of mining the

140

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8

temporal pattern is formulated with a similarity function and a dissimilarity threshold, here similarity profiled temporal association mining can reveal interesting relationships of data items that co-occur with a particular event over time. Compare to cyclic and calendar time, the SPAMINE algorithm substantially reduced the search space by pruning candidate item sets using the lower bounding distance of the bounds of support sequences and monotonicity property of the upper lower bounding distance without compromising the correctness and completeness of the mining results CONCLUSION VI Problem is formulated as mining similarity-profiled temporal association patterns and proposed a novel algorithm to discover them. The proposed SPAMINE algorithm substantially reduced the search space by pruning candidate item sets using the lower bounding distance of the bounds of support sequences, and the monotonicity property of the upper lower bounding distance without Compromising the correctness and completeness of the mining results. Comparative analysis results on synthetic and real data sets showed that the SPAMINE algorithm is computationally efficient and can produce meaningful results from real data. However, our pruning scheme effect depends on data distribution, dissimilarity threshold, and type of reference sequence. In the future, we plan to explore different similarity models for our temporal patterns. The current similarity model using a Lp normbased similarity function is a little rigid in finding similar temporal patterns. It may be interesting to consider not only a relaxed similarity model to catch temporal patterns that show similar trends but also phase shifts in time. For example, the sale of items for cleanup such as chain saws and mops would increase after a storm rather than during the storm.

References:
[1] Agrawal, R., Imielinski, T., Swami, A. Database Mining: A performance Perspective. IEEE Trans. Knowledge and Data Engineering, vol. 5, 6, 1993a, pp. 914-925. [2] Agrawal, R., Imielinski, T., Swami, A. Mining associations between sets of items in large databases. Proc. of ACM SIGMOD Int. Conference on Management of Data, Washinton [3] R. Agarwal and R. Srikant, Fast Algorithms for Mining Association Rules, Proc. Intl Conf. Very Large Databases (VLDB), 1994. [4] J. Han, J. Pei, and Y. Yin, Mining Frequent Patterns without Candidate Generation, Proc. ACM SIGMOD, 2000. [5] J. Han and Y. Fu, Discovery of Multi-Level Association Rules from Large Databases, Proc. Intl Conf. Very Large Databases (VLDB), 1995. [6] J. Park, M. Chen, and P. Yu, An Effective Hashing-Based Algorithm for Mining Association Rules, Proc. ACM SIGMOD, 1995. [7] R. Srikant and R. Agrawal, Mining Generalized Association Rules, Proc. Intl Conf. Very Large Databases (VLDB), 1995. [8] B. Ozden, S. Ramaswamy, and A. Silberschatz, Cyclic Association Rules, Proc. IEEE Intl Conf. Data Eng. (ICDE), 1998. [9] Y. Li, P. Ning, X.S. Wang, and S. Jajodia, Discovering Calendar- Based Temporal Association Rules, J. Data and Knowledge Eng., vol. 15, no. 2, 2003. [10] S. Ramaswamy, S. Mahajan, and A. Silberschatz, On the Discovery of Interesting Patterns in Association Rules, Proc. Intl Conf. Very Large Databases (VLDB), 1998. [11]Piatetsky-Shapiro, G. (1991), Discovery, analysis, and presentation of strong rules, in G. Piatetsky-Shapiro & W. J. Frawley, eds, Knowledge Discovery in Databases, AAAI/MIT Press, Cambridge, MA. [12] R. Agrawal; T. Imielinski; A. Swami: Mining Association Rules Between Sets of Items in Large Databases", SIGMOD Conference 1993: 207-216

Prof. N Prasanna Balaji, and Head IT has done his B.E in Computer science from Bharathidasan University, completed his M.Tech in IT (Part Time) with distinction from Punjab University Patiala, currently pursuing Ph.D in the topic Enterprise Resource Planning from Kakatiya University, Warangal.. He has 20+ years of teaching, training and Systems Computerization. Mr. Balaji has worked as Associate Professor in CSE dept at Vignan Institute of technology & Science.At Infosys Campus Connect (two weeks residential December 2006) Programme and was recognized as one of the Best Teacher. At Institute of Public Enterprise(IPE) he was the ERP-Incharge for Microsoft Business Solutions-Navision, and has organized a National level conference on e-Customer Relationship Management and three Management Development Programmes in Recent Trends in Information Technology, two Management Development Programmes in Enterprise Resource Planning-Navission, and one Management Development Programme in Network Security for Public Sector executives. He is the co-editor for the proceedings of National level conference on eCustomer Relationship Management. He has published and presented papers in National level Seminars and Journals. His areas of interest are Enterprise Resource Planning, Relational Database Management Design, Artificial Intelligence, Operating Systems, Mobile Computing, and Customer Relationship Management. He has guided many PG level and engineering students. He is also a member on various professional societies like Life Member of Computer

141

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8

Society of India, Indian Society for Technical Education, and a Member of International Electrical and Electronics Engineers and All India Management Association. His present area of research is Enterprise Resource Planning. Currently he is the HOD of the Department of Information Technology in Gurunanak Engineering College, an NBA and NAAC accredited college located in Ranga Reddy Dist., Ibrahimpatnam, Hyderabad.

Mr. U Sreenivasulu. He holds B.Tech., degree in Computer Science & Engineering from JNTU, Hyderabad. He obtained Post Graduation in Computer Science & Engineering in the year 2005 from SRM University, Chennai. Currently he is Assistant Professor in the department of Information Technology in Gurunanak Engineering College, an NBA and NAAC accredited college located in Ranga Reddy Dist., Ibrahimpatnam, Hyderabad

G. Ravi Kumar pursuing M.Tech Information Technology at Gurunank Engineering College. His areas of interest include Networking, Web Application, Information Security, currently focusing on Data Mining.

Anjaneyulu. N pursuing M.Tech IT at Gurunank Engineering College. His areas of interest include Networking, Mobile computing, Web applications.

142

Вам также может понравиться