0 оценок0% нашли этот документ полезным (0 голосов)
83 просмотров8 страниц
Sensor networks produce large scale of data in the form of streams. Our aim has been to identify methods that will enable efficient counting of frequent sets in cases where the data is much too large to be contained in primary memory.
Sensor networks produce large scale of data in the form of streams. Our aim has been to identify methods that will enable efficient counting of frequent sets in cases where the data is much too large to be contained in primary memory.
Авторское право:
Attribution Non-Commercial (BY-NC)
Доступные форматы
Скачайте в формате PDF, TXT или читайте онлайн в Scribd
Sensor networks produce large scale of data in the form of streams. Our aim has been to identify methods that will enable efficient counting of frequent sets in cases where the data is much too large to be contained in primary memory.
Авторское право:
Attribution Non-Commercial (BY-NC)
Доступные форматы
Скачайте в формате PDF, TXT или читайте онлайн в Scribd
International Journal of Advances in Science and Technology,
Vol. 3, No.5, 2011
An Efficient Algorithm for Pattern Discovery in Sensor Network Data
Netreshwari Sharma, Kapil Kumar Nagwanshi and Lokesh Kumar Sharma
1 Department of Computer Science and Engineering, Rungta College of Engineering and Technology, Bhilai, CG, India {netresharma, kapilkn, lksharmain }@gmail.com
Abstract Sensor Networks produce large scale of data in the form of streams. Recently, data mining techniques have received a great deal of attention in extracting knowledge from Sensor Network data. Mining association rules on the sensor data provides useful information for different applications. In this paper we have examined ways of partitioning data for sensor pattern discovery. Our aim has been to identify methods that will enable efficient counting of frequent sets in cases where the data is much too large to be contained in primary memory, and also where the density of the data means that the number of candidates to be considered becomes very large. Our starting point was a method which makes use of an initial preprocessing of the data into a partial tree structure (P-tree) which incorporates a partial counting of support totals. The experimental results show that P-tree and T-tree structure algorithm outperforms in both sparse and dense data set related algorithms in generating association rules from SNs data.
Wide-area sensor infrastructures, remote sensors, RFIDs, Phasor measurements, and Wireless Sensor Networks yield massive volumes of disparate, dynamic, and geographically distributed data. As such sensors are becoming ubiquitous, a set of broad requirements is beginning to emerge across high- priority applications including adaptability to national or homeland security, critical infrastructures monitoring, disaster preparedness and management, greenhouse emissions and climate change, and transportation. The raw data from sensors need to be efficiently managed and transformed to usable information through data fusion, which in turn must be converted to predictive insights via knowledge discovery, ultimately facilitating automated or human-induced tactical decisions or strategic policy based on decision sciences and decision support systems [1] [2]. Knowledge discovery from sensor data (Sensor-KDD) is important due to many applications of crucial importance to our society and large-scale sensor systems need to process heterogeneous and multisource information from diverse types of instruments [5]. The raw data from sensors need to be efficiently managed and transformed to usable information through data fusion, which in turn must be converted to predictive insights via knowledge discovery, ultimately facilitating automated or human- induced tactical decisions or strategic policy. Developing a model that facilitates the representation and knowledge discovery on sensor data presents many challenges. With sensors reporting data at a very high frequency, resulting in large volumes of data, there is a need for a model that is memory efficient. Sensor networks have spatial characteristics which include the location of the sensors. In addition, sensor data incorporates temporal nature, and hence the model must also support the time dependence of the data. Balancing the conflicting requirements of simplicity, expressiveness, and storage efficiency is challenging. The model should also provide adequate support for the formulation of efficient algorithms for knowledge discovery [18]. There is a clear and present need to bring together researchers from academia, government, and the private sector in the broad areas of knowledge discovery from sensor data [6]. Recently, association rules [1] for sensors have received a great deal of attention [1] [2] [3] [14] [15] [16] due to their importance in capturing the temporal relations among sensor nodes in Wireless Sensor Networks. An example of such a rule is (s1s2 s3, 80%,) which can be interpreted as follows: if November Issue Page 9 of 88 ISSN 2229 5216 International Journal of Advances in Science and Technology, Vol. 3, No.5, 2011
event is received events from sensors s1 and s2 then there is an 80% chance to receive an event from sensor s3 within unit of time, where 80% is the frequency of the rule. However, generating association rules that have a certain frequency require generating all the patterns presented in the data that meet this frequency, i.e., frequent patterns. Once the frequent patterns are determined, the process of generating the rules is therefore straightforward. In this paper an efficient pattern discovery algorithm for sensor data is proposed. The remainder of paper is organized as follow. In the section 2 the literature review regarding pattern discovery in sensor data is reported. The framework of pattern discovery in sensor data is presented in section 3. In the section 4 the data structure for proposed algorithm is given and experimental investigation is reported. Finally the work is concluded in the section 5.
2. Related Works
Frequent pattern mining is an important task in data mining. Several researchers have been contributing in this field and several algorithms in the literature have been proposed to tackle the problem of mining frequent patterns from transactional databases. These algorithms differ mainly in the way that they represent the database and in which they generate the frequent patterns. These algorithms can be classified into mainly two approaches: the candidate generation approach and the pattern growth approach [1]. In terms of these approaches, the algorithms also differ in how they represent the database. The two most popular formats are 1) the vertical layout, in which each object is associated with the list of context identifiers where it occurred, and 2) the horizontal layout, in which each context identifier is associated with the list of objects. The candidate generation approach enumerates the frequent patterns gradually, with several scans of the database. In each iteration, patterns found to be frequent are used to generate the candidates (possible frequent patterns) to be counted in the next iteration. Within this approach, several schemes have been developed [14] such as the AIS algorithm, Apriori, AprioriHyprid etc. The most popular algorithm among the candidate generation approaches is the Apriori scheme. All other approaches, except for AIS, are basically optimized versions of the Apriori scheme. Recall that in the Apriori algorithm, a database scan is conducted in order to determine the set of frequent one element patterns. From this set, it generates set of candidates to be counted in the next step. In addition, it prunes the set of the candidates by eliminating the candidates that have at least one infrequent subset. This process is repeated a number of times that is equal to the size of the largest frequent pattern. The pattern growth approach [10] tries avoiding the large number of candidates generated in each pass and overcome the repeated scans of the database, thereby enabling most of the algorithms in this approach to outperform the candidate-generation-based approach algorithms. The Frequent Pattern Growth (FP-Growth) proposed by Han et al. is the core algorithm of the pattern growth approach [10]. In this method, the database is converted into a compact representation in the form of a tree, called Frequent Pattern tree (FP-tree), which is much smaller in size than the original database. The FP-tree is constructed in such a way that all relevant information needed in the mining process is presented in the tree structure. Note that building the tree structure requires only two scans of the database. After building the FP-tree, the FP-Growth routine mines all the frequent patterns from the tree structure without referring to the original database and without generating candidates [2][10]. Pattern Discovery in sensor data is relatively new research area. The researchers have been augmenting the traditional pattern discovery algorithm for mining the sensor data. Loo et. al. [11] have studied the problem of mining the associations that exist between sensor values in a stream of data reported from a wireless sensor network. They proposed a data model that stores the data and presents those in a way that makes it possible to adapt the lossy counting algorithm that makes an online one- pass analysis of the data. In this data model, sensors are assumed to take values from a finite discrete number of values, whereas a quantization method is applied for the continuous values. The time is divided into equal-sized intervals, and a snapshot from the sensor reading is taken whenever there is an update on a sensor reading. These snapshots formulate the contexts of the database. Although taking snapshots at state changes will reduce the redundancy in the data, these snapshots occur randomly; thus, each context is associated with a weight value that indicates for how many intervals this reading is valid (that is, for how long these readings will kept unchanged). The support of the pattern is defined by the total length of non-overlapping intervals in which the pattern is valid. Mining spatial temporal November Issue Page 10 of 88 ISSN 2229 5216 International Journal of Advances in Science and Technology, Vol. 3, No.5, 2011
event patterns is another attempt to link the problem of mining sensor data to the association rules mining problem that was proposed by Roemer [12]. Roemers approach takes into consideration the distributed nature of wireless sensor networks and proposes an in-network data mining technique to discover frequent patterns of events with certain spatial and temporal properties. In this approach, each sensor should be aware of the events that are within a certain distance from itself (this distance may be a Euclidean distance or a number of hops). The sensor then collects these events and applies a mining algorithm to discover the pattern that satisfies the given parameters. The mining parameters include a minimum support S, a minimum confidence C, a maximum scope, and a maximum history. Each node in the network collects the events from its neighbors within the maximum scope and keeps a history of their events for duration of the maximum history. After that, each node applies a mining algorithm to discover the frequent patterns (those that have frequency exceeding the given minimum support). Boukerche et. al.[3] proposed a solution to extract the behavioral data required for mining patterns regarding the behavior of the sensor nodes in the network (that is, the data used in the mining process is metadata, describing the nodes activities, and it differs from the sensed data). A primary assumption of this data extraction mechanism is to have a flash memory device attached to each sensor to store the metadata about the sensors behavior that will be used during the extraction process. Several researchers have studied the cost of attaching a storage devise to each sensor. The Boukerche et. al. [1] [2] [3] framework consists of 1) a formal definition of sensor behavioral patterns and sensor association rules, 2) a novel representation structure that we refer to as the Positional Lexicographic Tree (PLT) that is able to compress the data gathered for the mining process and thus allows the fast and efficient mining of sensor behavioral patterns, and 3) a distributed data extraction mechanism to prepare the data required for mining sensor behavioral patterns. However, construction of such data structures (both FP-tree and PLT-tree) require two database scans, which is not suitable for generating association rules from the streams of sensor data. Moreover, mining PLT requires an extra mapping mechanism for the sensors to a vector [15]. Therefore, Tanbeer et. al. [15] proposed a prefix-tree structure, called Sensor Pattern Tree (SP-tree in short) which is able to capture the information with one scan over the stream of sensor data and store them in a memory- efficient highly compact manner, similar to FP-tree. The main idea of our SP-tree is to obtain the frequency of all event-detecting sensors data and construct a prefix-tree based on that in any canonical order, then reorganize the tree in a frequency descending order. Through the reorganization the SP-tree can maintain the frequently event-detecting sensors nodes at the upper part of the tree, which, in turn, provides high compactness in the tree structure. Once the SP-tree is constructed, we apply the efficient FP-growth mining technique on it. Above reported pattern mining techniques for sensor data are mainly based on Apriori or FP- Growth framework. The Apriori like algorithms suffer problem such as a huge set of candidate sequences could be generated in a large sequence database; Multiple scans of databases in mining; generates a combinatorial explosive number of candidates when mining long sequential patterns. FP- Growth algorithms are good when data set is dense. But in case of sparse data set the large size of FP tree is required and FP-Growth utilizes more space and it take similar computation time as Apriori algorithm [4] [9]. In this paper we use Apriori-TFP structure [9] based algorithm and augmented for sensor data pattern mining, which completes the summation of the final support counts, storing the results in a second set-enumeration tree (the T-tree, of Total support counts), ordered in the opposite way to the P- tree. The T-tree finally contains all frequent sets with their complete support-counts. This algorithm works well in both case sparse and dense data set.
3. Sensor Association Rules Mining Framework Boukerche et. al. [1][2] defined the problem of mining sensor association rules is based upon the definition of association rules proposed in the domain of transactional databases [10][14]. Basis of this the pattern mining framework can be generalized as follow [2]: Let S = {s1, s 2 . . . s m } be a set of sensors in a particular sensor network. The time is divided into equal- sized slots {t 1 , t 2 . . . t n } such that t i+1 - t i = for all 1 < i < n, where is the size of each time slot, and T = t n - t 1 represents the historical period of the behavioral data defined during the data extraction process. Also P = {s 1 , s 2 . . . s k }_ S is referred as a pattern of sensors. November Issue Page 11 of 88 ISSN 2229 5216 International Journal of Advances in Science and Technology, Vol. 3, No.5, 2011
Definition 1. A sensor database SDB, the behavioral data, is defined to be a set of epochs in which each epoch is a couple E (E ts , P), where P is a pattern of sensors that report events within the same time slot. E ts is the epochs time slot. Definition 2. Let P 1 be a pattern of sensor nodes such that P 1 _ S. We say that an epoch E (E ts , P), P i
supports P 1 if P 1 _ P. Definition 3. The frequency of the pattern P 1 in SDB is defined to be the number of epochs in SDB that supports it: Freq (P 1 , SDB) = | {E (E ts , P) | P 1 _ P}|. Definition 4. The pattern is said to be frequent if its frequency is greater than or equal to the given minimum support. Definition 5. Sensor association rules are defined in the form of P P, where Pc S, P c S, and P P =|. Definition 6. The frequency of the rule (P P) represents the frequency of the pattern (P P) in SDB, whereas the confidence of the rule is defined as follows: Conf ( P P) = Freq (P P, SDB) / Freq (P, SDB). . 4. Efficient Algorithm for Mining Association Rules in Sensor Data 4.1. Data Structures
In this section we will describe the used data structure and an Efficient Algorithm for Mining Association Rules in Sensor Data is discussed. This algorithm is augmentation of s Apriori Total-from- Partial (TFP) algorithm [9] for sensor network data.
4.1.1. The Partial Support Tree (P-Tree) The partial support tree solves the problem of repeat re-scanning of same record set. In this approach to copy the input data into a data structure, which maintains all relevant input data. The P-tree has two advantages: - It merged the duplicated records and records with common leading sub strings, the reducing the storage and processing requirements. - The partial counts of the support for individual nodes within the tree are accumulated and the tree is constructed. The overall structure of the P-tree is a compressed enumeration tree. To construct a P-tree, the data is scanned record by record. When complete, the P-tree will contain all the item sets present as distinct records in the input data. The support of item is stored in each node that is incomplete total support, the sum of the supports stored in the subtree of the node.
4.1.2. The Total Support Tree (T- Tree) The total support tree uses the downward closure property of item sets 'if any given item set I is not large, any super set of I will also not be large'. This can be used effectively to avoid the need to generate and compute support for all combinations in the input data. The approach requires: 1) A number of passes of the data set, 2) The construction of candidate sets to be counted in next pass and 3) The verification of validity of dataset. The algorithm to compute frequent pattern using total and partial support tree is as follows. The data structure for P tree is link representation and defined by following format: class participatory { Node code; Time stamp; Child reference; Sibling reference; } For a database of m records, stage 1 of the algorithm (A1) performs m support count incrimination's in a single pass, to compute a total of m partial supports, for some m0s in given timestamps. The second November Issue Page 12 of 88 ISSN 2229 5216 International Journal of Advances in Science and Technology, Vol. 3, No.5, 2011
stage of the algorithm (A2) involves, for each of these, the examination of subsets, which are members of the target set T. In an exhaustive version of the method, T will be the full set of subsets of I. Computing via summation of partial supports, however, others three potential advantages.
Algorithm 1: Algorithm with total and partial support Inputs: Transaction DS, count set P Output: Returns P and T counting sets in DS Method: A1: 1. For each record j in the database { 2. For (For all P i e u (P)) 3. add 1 to P j
4. Insert newly inserted record to j. 5.} A2: 1. for all j in P 2. For (for all P i e u (P)) 3. {For all i in T i e u j do 4. Add to total support tree // added to total support tree 5. end; Firstly, when n is small (2n << m), then A2 involves the summation of a set of counts, which is significantly smaller than a summation over the whole database. Secondly, even for large n, if the database contains a high degree of duplication (m0 << m) then the stage 2 summation will again be significantly faster than a full database pass, especially if the duplicated records are densely populated with attributes. Finally, and most generally, we may use the stage A1 to organize the partial counts in a way which will facilitate a more efficient stage 2 computation, exploiting the structural relationships inherent in the lattice of partial supports [4][9]. 4.2. Algorithm for generating Sensor Association Rule using TFP Tree
The proposed algorithm first extract the data of particular interval from whole sensor data set and apply the TFP Mining approach to and frequent item set on that specific user defined time intervals. Algorithm 2: Pattern mining for sensor data with total and partial support tree. Input : A Sensor Database D, Specified temporal pattern e 0
Output : Frequent item set from Sensor Database and Database Table DT. Method : 1. Set pointer to first record of database 2. Scan the Database one by one and follow the Step (3) 3. { 4. If (p e u(p)) do 5. Insert into DT {item set} 6. } 7. set K = 1 8. Build level K in the T -tree. 9. Walk the P-tree, applying algorithm TFP to add interim supports associated with individual P- tree nodes to the level K nodes established in (2) 10. Remove any level K T-tree nodes that do not have an adequate level of support. 11. Increase K by 1. 12. Repeat step (8), (9), (10), (11), until a level K is reached where no nodes are adequately supported. In above algorithm step (1) to step (5) used to find out the item set, which occurs on valid time period specified by calendar schema. Step (7) to step (12) used for mining frequent item set from TFP tree. November Issue Page 13 of 88 ISSN 2229 5216 International Journal of Advances in Science and Technology, Vol. 3, No.5, 2011
4.3. Experimental Results To evaluate the performance of our algorithms and optimization techniques, we perform series of experiments in synthetic data set. In following section, we describe the synthetic basket data generator with temporal information. Then we generate synthetic data sets to evaluate the algorithms with data set having different characteristics. The algorithm is implemented on Java and performed on Window 7 operating system. No another application is running during the performance evaluation. We used the Dodgers loop sensor data [7] and synthetic data set [3][17] which is generated by simulator for the experiments. The figure 1 shows the experiment result on dense data set. We can observe that PLT algorithm is taking more time as SP Growth and TFP. Also the result of SP Growth and TFP is comparable giving similar performance. The experiment result on sparse data is shown in figure 2. In this experiment, it is observed that PLT and SP-Growth are taking comparable similar time and TFP more efficient as PLT and SP-Growth.
Figure 1: Support values versus Time for Dense Data.
Figure 2: Support values versus Time for Sparse Data.
5. Conclusion In this paper, we have performed a systematic study on mining of sensor data patterns in large sensor network databases and developed a tree approach for efficient and scalable mining of sensor pattern mining. Instead of refinement of the apriori-like, candidate generation-and-test approach a P- tree structure is proposed. The experimental results we have reported here show that the Tree November Issue Page 14 of 88 ISSN 2229 5216 International Journal of Advances in Science and Technology, Vol. 3, No.5, 2011
Partitioning method described is extremely effective in limiting the maximal memory requirements of the algorithm, while its execution time scales only slowly and linearly with increasing data dimensions. Its overall performance, both in execution time and especially in memory requirements, is significantly better than that obtained from either simple data segmentation or from other methods considered. The advantage increases with increasing density of data and with reduced thresholds of support i.e. for the cases that are in general most challenging for association rule mining.
6. References [1] A. Boukerche, and S. Samarah, A Novel Algorithm for Mining Association Rules in Wireless Ad Hoc Sensor Networks, IEEE Transactions on Parallel and Distributed Systems, vol. 19, no. 7, 2008, pp. 865-877. [2] A. Boukerche, S. Samarah, and H. Harbi, "Knowledge Discovery in Wireless Sensor Networks for Chronological Patterns", Proc. of 33rd IEEE Conf. on Local Computer Networks (LCN'08), pp.667 - 673, 2008. [3] A. Boukerche, and S. Samarah, An Efficient Data Extraction Mechanism for Mining Association Rules from Wireless Sensor Networks, Proc. IEEE International Conference on Communications, 2007, pp. 3936 - 3941. [4] A. Akasapu, L. K. Sharma and G. Ramakrishana, "Efficient Trajectory Pattern Mining for both Sparse and Dense Dataset", Int. J. of Computer Applications Vol. 9(5), 2010, pp. 4548. [5] B. Chikhaoui, S. Wang, H. Pigot, "A New Algorithm Based On Sequential Pattern Mining For Person Identification In Ubiquitous Environments", Proc. of the Fourth Int. Workshop on Knowledge Discovery form Sensor Data (ACM SensorKDD'10), Washington, DC, July 25-28 pp. 20-28, 2010. [6] D. Lymberopoulos, A. Bamis, and A. Savvides, "A Methodology for Extracting Temporal Properties from Sensor Network Data Streams", ACM MobiSys09, June 2225, Krakw, Poland, 2009. [7] Dodgers loop sensor Data http://pems.eecs.berkeley.edu". [8] E. Salah. R. Pauwels, R. Tavenard and T. Gevers, "T-Patterns Revisited: Mining for Temporal Patterns in Sensor Data", Sensors 2010, vol (10), pp. 7496-7513 [9] F. Coenen, P. Leng and S. Ahmed, 2004. "Data Structure for Association Rule Mining: T-Trees and P-Trees", IEEE Transactions on Knowledge and Data Engineering, Vol. 16(6), 2004, pp. 774-778. [10] J. Han, J. Pei, and Y. Yin, Mining Frequent Patterns without Candidate Generation, Proc. the 2000 ACM SIGMOD Int. Conference on Management of Data, 2000, pp. 112. [11] K. K. Loo, I. Tong, B. Kao, and D. Chenung, Online Algorithms for Mining Inter-Stream Associations from Large Sensor Networks, Proc. Ninth Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD 05), May 2005, pp. 143-149. [12] K. Roemer, Distributed Mining of Spatio-Temporal Event Patterns in Sensor Networks, Proc. of EAWMS 06, June 2006. [13] M. Rajpoot and L. K. Sharma, Comparative Study of Association Rule Mining for Sensor Data, Int. J. of Computer Applications, Vol 19(1), 2011, pp. 34-36. [14] R. Agrawal, T. Imielinski, and A. N. Swami, Mining Association Rules Between Sets of Items in Large Databases, Proc. ACM SIGMOD Conference on Management of Data, 1993, pp. 207- 216. [15] S. K. Tanbeer, F. A. Chowdhury, B. S. Jeong, Y. K. Lee, "Efficient Mining of Association Rules from Wireless Sensor Networks", Int. Conf. ACT 2009 Feb. 15-18, 2009, ISBN 978-89-5519- 139-4, pp. 719-724. [16] S. Samarah and A. Boukerche, "Target Association Rules: A New Behavioral Patterns for Point of Coverage Wireless Sensor Networks", IEEE Tran. on Computer, Vol. 60, No. 6, pp. 879-889, 2011. [17] S. Samarah, A. Boukerche, and Y. Ren, "Coverage-based Sensor Association Rules for Wireless Vehicular Ad hoc and Sensor Networks", Proc. of IEEE "GLOBECOM", pp. 1-5, 2008. [18] V. S. Tseng and K. W. Lin, "Mining Temporal Moving Patterns in Object Tracking Sensor Networks", Proc. of Int. Workshop on Ubiquitous Data Management (UDM05), 0-7695-2411- 7/05 IEEE, pp. 1-8, 2005. November Issue Page 15 of 88 ISSN 2229 5216 International Journal of Advances in Science and Technology, Vol. 3, No.5, 2011
Authors Profile
Netreshwari Sharma completed Master of Science in Information Technology from Guru Ghasidas Central University Bilaspur (CG)-India. She is currently doing M. Tech. in Computer Science and Engineering from Chhattisgarh Swami Vivekananda Technical University Bhilai (CG)-India. Her research interests include Data Models and Abstract Semantic Descriptions for Sensor and Trajectory data.
Kapil Kumar Nagwanshi is an active Member of IEEE and IEEE Computer Society. He is life time member of Computer Society of India, International Association of Computer Science & Information Technology since 2009. He is also member of International Association of Engineers (IAENG), and Computer Science Teachers' Association (CSTA- ACM) Currently he is working as a Reader in RCET Bhilai. His research area includes, signal processing, image processing, Prallel Computing and Information Systems and Security.
Dr. Lokesh Kumar Sharma received his Ph. D. degree from Pt. Ravishankar Shukla University, Raipur (CG)-India. Dr. Sharma is a DAAD Fellow Germany. He is currently working as Head-Department of Computer Science and Engineering at Rungta College of Engineering and Technology, Bhilai (CG). He is having overall teaching experience of 11 years. He is member of CSI, ACM, IAE etc. His major research Interests are in Data Mining, Spatial Data Mining, Mobile Communication.