Вы находитесь на странице: 1из 6

Mining Data Streams

Ruoming Jin Department of Computer and Information Sciences Ohio State University, Columbus, OH 43210 jinr@cis.ohio-state.edu

1 Introduction
The need to understand the enormous amount of data being generated every day in a timely fashion has given rise to a new data processing modeldata stream processing. In this new model, data arrives in the form of continuous, high-volume, fast, and time-varying streams, and the processing of such streams entails a near real-time constraint. Many important applications, ranging from network security, sensor data processing, to stock analysis, climate monitoring [12, 3, 26], are a part of the data stream model. From the last decade, data mining, meaning extracting useful information or knowledge from large amounts of data, has become the key technique to analyze and understand data. Typical data mining tasks include association mining, classication, and clustering. These techniques help nd interesting patterns, regularities, and anomalies in the data. However, traditional data mining techniques cannot directly apply to data streams. This is because mining algorithms developed in the past target diskresident or in-core datasets, and usually make several passes of the data. Mining data streams are allowed only one look at the data, and techniques have to keep pace with the arrival of new data. Furthermore, dynamic data streams pose new challenges, because their underlying distribution might be changing. Recently, a number of new algorithms have been developed to deal with these problems. These algorithms focus on approximate one-pass algorithms [5, 6, 21, 13, 22, 4], mining over dynamic data streams [15, 24, 25, 19, 1, 9, 11], and mining changes or trends in data streams [10, 8, 7, 26, 2, 23]. However, the current research has been limited in the following ways: First, memory issues for one-pass algorithms are not adequately addressed. A stream mining algorithm with high memory cost will have difculty being applied in many situations, such as sensor networks. Second, current techniques for mining frequent itemsets from data streams are still very computation intensive, and may not keep pace with the streams if the data arrives too rapidly. Third, little research has been done to help improve the entire stream mining process. In many applications, the benet of speeding up a single mining task may be very limited if the entire mining process is not optimized. Our research will address these issues in two ways: designing low memory and computation efcient mining algorithms for data streams, and studying appropriate system support to ease the stream mining process. The rest of the paper is organized as follows. Section 2 introduces the thesis statement as well as the specic research problems in which we are interested. Section 3 presents some of the new techniques we have developed. Section 4 discusses some preliminary ideas of our on-going work. In Section 5, related work on mining data streams is reviewed. Finally, the paper is concluded in Section 6. 1

2 Thesis Statement and Research Problems


To meet the challenges of mining data streams, our current and proposed research focuses on the following issues: The need for low memory and computation-efcient one-pass mining algorithms Memory requirements are critical for mining data streams. This is because stream mining tasks usually take a long time and there can be many running simultaneously. Furthermore, some of the tasks might be carried out in small, hand held devices. Also, the computation should also be efcient considering the near real-time constraints. However, this need has not been adequately addressed.

The need of new techniques to keep pace with streams One-pass algorithms cannot guarantee that the mining tasks can keep pace with the arrival of the data. This is particularly true for frequent itemset mining on data streams. Though one-pass algorithms [21, 16] have been shown to be very accurate and faster than traditional multi-pass algorithms, the experimental results show that they are still computationally intensive, meaning that if the data arrives too rapidly, the mining algorithms will not able to handle the data. Unfortunately, this can be the case for some high-velocity streams, such as network ow data. Therefore, new techniques are needed to increase the speed of stream mining tasks. The need to improve the efciency of stream mining process Knowledge discovery or data mining is not a single task, but an iterative process. Data preparing, deciding which algorithms to be appropriate, and experimenting with different parameters can take a large amount of time. However, the current research mainly focuses on speeding up or performing single mining task. The benet of this work can be limited if the entire mining process is not optimized. Therefore, shortening the entire mining process is very critical.

Our thesis is as follows: Designing computation and memory efcient algorithms to provide approximate results in high accuracy and condence and developing system support help mine useful information from data streams. Our current and proposed research focuses on the following key problems of mining data streams: Designing Low Memory and Computation-Efcient One-pass Mining Algorithms: Many existing one-pass mining algorithms do not pay enough attention to memory and computation issues, which can limit their application. For example, Domingo et al.s VFDT [5] did not consider processing numerical attributes, which dominate the computation and memory cost for decision tree construction, and Manku and Motwanis frequent itemset mining algorithm [21] may not be suitable for many mining environments, due to its depending on a disk-based Trie structure. Our research focused on developing techniques to reduce the memory and computation costs for one-pass approximate mining. Especially we are interested in two key mining tasks for data streams, decision tree construction and frequent itemset mining. The details of our techniques and experimental results for both mining tasks will be explained in Section 3. Designing Algorithms for Mining Condensed Itemsets of Streaming Data: Existing frequent itemset mining algorithms for data streams [21, 16] are to nd a superset of all frequent itemsets. However, the algorithms have been shown to be computational intensive, and may not work on high-velocity streams. In order to deal with this difculty, we propose to mine condensed itemsets, such as the Maximal Frequent Itemsets(MFI), the Closed Frequent Itemsets(CFI), and -free sets, etc. on data streams. In the in-core datasets, mining condensed itemsets has shown to be much faster than their 2

counterparts. Therefore, it is reasonable to believe that mining condensed itemsets on streaming data can signicantly reduce the computational cost. Currently, we are developing techniques to mine the MFI and -free sets over data streams. Some preliminary idea of mining MFI will be discussed in Section 4. Designing System Support of Mining Data Streams with the Sliding Window: The sliding window can be very powerful and helpful when expressing some complicated mining tasks with a combination of simple queries. For instance, the itemsets with a large frequency change can be expressed by comparing the current windows or the last window with the entire time span. For example, the itemsets have frequencies higher than 0.01 in the current window but are lower than 0.001 for the entire time span. However, to apply this type of mining, mining process needs different mining algorithms for different constraints and combinations. The exibility and power of sliding window model can make the mining process and mining algorithms complicated and complex. To tackle this problem, we propose to use system supports to ease the mining process, and we are focusing on query languages, system frameworks, and query optimizations for frequent itemset mining on data streams. Some initial ideas will be discussed in Section 4.

3 Current Contributions
3.1 Efcient Decision Tree Construction on Streaming Data
The existing methods [5, 6] to deal with decision tree construction over streaming data are limited by the fact that the memory and computation cost is not optimized. To overcome this limitation, we have developed an algorithm framework, which generalizes the VFDT [5] approach by considering both categorical and numerical attributes and allowing a predened-dened memory bound. Similar to VFDT, it can provide the same probabilistic bound as on the different number of nodes between the tree built on samples and the one built on the complete data. Under this framework, we propose numerical interval pruning (NIP) to speed up the processing of numerical attributes for streaming data. Compared with traditional methods, this approach can signicantly reduce the memory and computation cost without losing accuracy it can always nd the exact split point with respect to the sample data. NIP works as follows: it divides the domain of numerical attributes into intervals and builds a summary of each interval; then it will prune the intervals that can not include best test criteria based on the summary; nally it will only visit the unpruned intervals to complete the computation. Our experimental results show NIP reduce execution time an average of 39% on the experimental datasets. Further, we propose to use the Normal Test to decide the sample size. Different from Domingo and Hultens approach which uses Hoeffding bound [14], the Normal Test is based on the observation that the gain function based on sample actually converges into Normal distribution. By taking advantage of the normal distribution property, our experimental results show that the Normal Test reduce an average of 37% of the sample size. This work has been published as a short paper in the proceedings of SIGKDD03 [17].

3.2 An Algorithm for In-Core Frequent Itemset Mining on Streaming Data


We propose a new one-pass algorithm StreamMining with a parameter of support and a desired accuracy . In a single pass, StreamMining will nd a super set of frequent itemsets with support , and each itemset in the super set will appear more than the frequency . Interestingly, with aggressive memory usage, the nal results provided by StreamMining can improve to with 3
 

. This implies that the actual answering set can be actually more accurate than the expected one. If a second pass allowed, StreamMining is guaranteed to nd the exactly all frequent itemsets by eliminating the false positive. StreamMining is derived from Karp et al.s work on nding frequent elements (or 1-itemsets) [18], To deal with frequent itemsets mining, StreamMining use potential frequent 2-itemsets and the famous Apriori property to reduce the total candidates of frequent itemsets. It will keep track of as the largest possible number of frequent 2-itemsets as well as the superset of all frequent itemsets for the data stream so far. The processing is as follows. When a new transaction arrives, StreamMining will insert the new 2-itemsets into the potential frequent itemsets, and update the count of the existing ones. It will also keep the transaction temporally in a buffer. Once the number of 2-itemsets achieves , the 2-itemsets that are unlikely to be frequent will be removed. Further, the larger itemsets will also be updated by traversing the transactions in the buffer. After that, the buffer will be cleared. The algorithm is able to nd all of the frequent itemsets in a single pass. To efciently implement the new algorithm, we have also designed a new data structure, referred to as TreeHash. It support the insertion, update, deletion, and traversal operation of frequent itemset in the superset. This data structure implements a prex tree using a hash table. It has the compactness of a prex tree and allows easy deletions like a hash table. We have carried out a detailed evaluation using both synthetic and real datasets. Our results show the one pass algorithm is very accurate in practice. Even when is 1, the accuracy is 94% or higher, and in fact 100% in several cases. Using results in an accuracy of 98% or higher in all cases. Further, the new algorithm is very memory efcient. For example, using the T10.I4.N10K dataset and a support level of 1%, we can consistently handle 4 million to 20 million transactions with less than 2.5 MB of main memory. Comparing with Manku and Motwanis algorithm [21], which can take a 44 MB buffer to process 1 million transactions on top of an out-of-core data-structure, StreamMining is in-core and and sometimes needs even less main memory. Finally, the algorithm can handle large number of distinct items and small support levels using a reasonable amount of memory. For example, a dataset with 100,000 distinct items and a support level of 0.05% could be handled with less than 200 MB main memory, a 5-fold improvement over apriori. This work is currently under submission [16].

4 Future Work
4.1 Mining Maximal Frequent Itemsets
We are currently developing techniques to mine the set of the maximal frequent itemsets (MFI) on data streams. The difculty of mining the MFI is that if we just maintain the information for MFI, it will be very hard to nd a good estimate of the counts of the interior itemsets once the border of frequent itemsets shrinks. For example, assume itemset a,b,c is a current maximal frequent itemset. Now, after processing the new chunk, we know it becomes infrequent. Without loss of generality, we assume that a,b , b,c and a,c become potentially maximal frequent itemsets. However, because we do not record any information about these itemsets, it will be very hard to provide a reasonable estimation of how frequent these itemsets are. To address this difculty, we propose to maintain a concise frequency contour over frequent itemsets. In other words, we maintain the several MFI sets for different support levels. We call them as contour sets. Therefore, once an itemset does not appear in contour sets, it will be falling between two different MFI sets. In this case, we will take the greater support level between the two MFI sets 4

 

as a frequency estimation. However, some technical issues arise when transforming this idea into an efcient algorithm, such as how to efciently query an itemset that does not appear in the contour sets. and how many MFI sets should be built as well as how to build them efciently. Currently, we are working on these technical issues.

4.2 System Support for Association Mining on Data Streams


In the system support, we are interested in the following key questions: How to systematically support time-dependent mining queries for data streams? How to provide some ofine mining mechanism to aid online mining? What kinds of optimization issues need to be considered? Our current research studies these problems from a language support perspective. Several factors have been considered to design an appropriate language support. First, the language has to dene a unied model to incorporate existing models, such as tilted time window and sliding window. Furthermore, the users should have the capability to explicitly dene a range of data or results to be stored temporarily in order to perform ofine mining. Meanwhile, the language compilation should be able to recognize the necessary time windows and allow for optimally storing necessary data for the queries. Secondly, the language needs to provide some supports for the time constraints and memory requirements, as well as the accuracy. Third, the language and its compilation should be able to optimize multiple queries and complex queries. For example, if we need to nd itemsets that are frequent in every window, the query optimization should be able to gure out that the mining process should rst generate frequent itemset for the rst window, then use other windows to eliminate those infrequent itemsets.

5 Related Work
Some new mining algorithms have been developed targeting data streams. The rst line of work is to design approximate algorithms for mining static streams, where each arrival data instance is a random from a xed underlying distribution. Domingo [5, 6] et al. studied sampling approach to tackle decision tree classication and k-means clustering. Manku and Motwani developed a counting-based approach to mine frequent itemsets over data streams [21]. Stanfords STREAM project has studied the approximate k-median clusterings with guaranteed probabilistic bound [13, 22, 4]. The second line of research is to work with dynamic data streams. Two avenues of research have been taken to deal with this problem. One avenue assumes the user has no knowledge of when the data stream will change. Hulten et al. [15] proposes CVFDT for building decision tree for timechanging data streams. Another approach taken by several research groups is applying ensemble classiers to this problem [24, 25, 19]. Aggarwal et al. propose a framework for clustering evolving data streams [1]. The other avenue assumes the users have some knowledge of how the data streams will change, and allow them to dene the window model to mine dynamic streams. The pioneering work DEMON [9] considers the dynamic environment to be evolving through systematic additions or deletions of blocks of data. Lee et al. [20] work with a sliding window model over very large datasets. More recently, Giannella et al. [11] design a stream association mining algorithm called FP-Stream based on the tilted-time window model. The third line of research is to mine changes and trends in the dynamic data streams. Ganti et al. propose a framework for quantifying the difference of the model induced by different datasets [10]. Dong and Li [8] consider nding emerging patterns between datasets. Zhu and Shash study how to

perform efcient burst detection, where an abnormal aggregate happens in the data streams [26]. Pei et al. propose a density-list notation to describe the cluster changes in the data stream [7]. Aggarwal [2], Pratt and Tschapeck [23], and Pei et al. [7] study visualization techniques to help determine the trends of dynamic streams.

6 Conclusions
In this paper, we address the challenges to mine data streams as well as discuss some limitations of current research. To takle these problems, our research focuses on Designing computation and memory efcient algorithms to provide approximate results in high accuracy and condence and developing system support help mine useful information from data streams. Some specic research problems are identied. Meanwhile, the new techniques and algorithms we have developed for decision tree construction and frequent itemset mining on streaming data are presented, and some preliminary ideas of our on-going work are discussed. Currently, we continue working on some of these problems.

References
[1] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for clustering evolving data streams. In Proceeings of the 29th VLDB conference, 2003. [2] C.C. Aggarwal. A framework for diagnosing changes in evolving data streams. In ACM SIGMOD, 2003. [3] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and Issues in Data Stream Systems. In Proceedings of the 2002 ACM Symposium on Principles of Database Systems (PODS 2002) (Invited Paper). ACM Press, June 2002. [4] M. Charikar, L. OCallaghan, and R. Panigrahy. Better streaming algorithms for clustering problems. In Proceedings of the ACM Symposium on Theory of Computing, 2003. [5] P. Domingos and G. Hulten. Mining high-speed data streams. In Proceedings of the ACM Conference on Knowledge and Data Discovery (SIGKDD), 2000. [6] Pedro Domingos and Geoff Hulten. A general method for scaling up machine learning algorithms and its application to clustering. In Proc. 18th International Conference on Machine Learning (ICML), pages 106113, 2001. [7] G. Dong, J. Han, L. Lakshmanan, J. Pei, H. Wang, and P. S. Yu. Online mining of changes from data streams: Research problems and prelimiary results. In ACM SIGMOD MPDS, San Diego, CA, Jun 2003. [8] Guozhu Dong and Jinyan Li. Efcient mining of emerging patterns: Discovering trends and differences. In ACM SIGKDD International Conference on Knowledge Discovery and Data Minin, pages 4352, August 15-18 1999. [9] V. Ganti, J. Gehrke, and R. Ramakrishnan. Demon: Mining and monitoring evolving data. In Proceedings of ICDE, 2000. [10] V. Ganti, J. Gehrke, R. Ramakrishnan, and W. Loh. A framework for measuring changes in data characteristics. In PODS, 1999. [11] C. Giannella, Jiawei Han, Jian Pei, Xifeng Yan, and P. S. Yu. Mining Frequent Patterns in Data Streams at Multiple Time Granularities. In Proceedings of the NSF Workshop on Next Generation Data Mining, November 2002. [12] L. Golab and M. Ozsu. Issues in data stream management. In SIGMOD Record, Vol. 32, No. 2, pages 514, June 2003. [13] S. Guha, N. Mishra, R. Motwani, and L. OCallaghan. Clustering data streams. In Proc. of the Annual Symp. on Foundations of Computer Science (FOCS 2000), Nov 2000. [14] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, pages 58:1830, 1963. [15] G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In Proceedings of the ACM Conference on Knowledge and Data Discovery (SIGKDD), 2001. [16] R. Jin and G. Agrawal. An algorithm for in-core frequent itemset mining on streaming data. In submitted for publication, July 2003. [17] R. Jin and G. Agrawal. Efcient and effective decision tree construction on streaming data. In Proceedings of SIGKDD03, Washington,DC, 2003. [18] Richard M. Karp, Christos H. Papadimitrious, and Scott Shanker. A Simple Algorithm for Finding Frequent Elements in Streams and Bags. Available from http://www.cs.berkeley.edu/ christos/iceberg.ps, 2002. [19] J. Z. Kolter and M. A. Maloof. Dynamic weighted majority: A new ensemble method for tracking concept drift. Technical Report CSTR-20030610-3, Georgetown University, Jun,2003. [20] C. Lee, C. Lin, and M. Chen. Sliding-window ltering: an efcient algorithm for incremental mining. In Proceedings of the tenth international conference on Information and knowledge management, Atlanta,GA, 2001. [21] G. S. Manku and R. Motwani. Approximate Frequency Counts Over Data Streams. In Proceedings of Conference on Very Large DataBases (VLDB), pages 346 357, 2002. [22] L. OCallaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani. High-performance clustering of streams and large data sets. In Proc. of the 2002 Intl. Conference on Data Engineering (ICDE 2002), Feb 2002. [23] K. B. Pratt and G. Tschapek. Visualizing concept drift. In ACM SIGKDD, August 2003. [24] W. Street and Y. Kim. A streaming ensemble algorithm(sea) for large-scale classication. In ACM SIGKDD, 2001. [25] H. Wang, W. Fan, P. S. Yun, and J. Han. Mining concept-drifting data streams using ensemble classiers. In ACM SIGKDD, Aug 2003. [26] Yunyue Zhu and D. Shasha. Efcient elastic burst detection in data streams. In ACM SIGKDD, 2003.

Вам также может понравиться