Paper 2

International Journal of Advances in Science and Technology,
Vol. 3, No.5, 2011

An Efficient Algorithm for Pattern Discovery in Sensor
Network Data

Netreshwari Sharma, Kapil Kumar Nagwanshi and Lokesh Kumar Sharma

1
Department of Computer Science and Engineering,
Rungta College of Engineering and Technology, Bhilai, CG, India
{netresharma, kapilkn, lksharmain }@gmail.com

Abstract
Sensor Networks produce large scale of data in the form of streams. Recently, data mining
techniques have received a great deal of attention in extracting knowledge from Sensor Network
data. Mining association rules on the sensor data provides useful information for different
applications. In this paper we have examined ways of partitioning data for sensor pattern discovery.
Our aim has been to identify methods that will enable efficient counting of frequent sets in cases
where the data is much too large to be contained in primary memory, and also where the density of
the data means that the number of candidates to be considered becomes very large. Our starting
point was a method which makes use of an initial preprocessing of the data into a partial tree
structure (P-tree) which incorporates a partial counting of support totals. The experimental results
show that P-tree and T-tree structure algorithm outperforms in both sparse and dense data set
related algorithms in generating association rules from SNs data.

Keywords: Pattern Discovery, Sensor KDD, Sensor Network.

1. Introduction

Wide-area sensor infrastructures, remote sensors, RFIDs, Phasor measurements, and Wireless
Sensor Networks yield massive volumes of disparate, dynamic, and geographically distributed data. As
such sensors are becoming ubiquitous, a set of broad requirements is beginning to emerge across high-
priority applications including adaptability to national or homeland security, critical infrastructures
monitoring, disaster preparedness and management, greenhouse emissions and climate change, and
transportation. The raw data from sensors need to be efficiently managed and transformed to usable
information through data fusion, which in turn must be converted to predictive insights via knowledge
discovery, ultimately facilitating automated or human-induced tactical decisions or strategic policy
based on decision sciences and decision support systems [1] [2].
Knowledge discovery from sensor data (Sensor-KDD) is important due to many applications of
crucial importance to our society and large-scale sensor systems need to process heterogeneous and
multisource information from diverse types of instruments [5]. The raw data from sensors need to be
efficiently managed and transformed to usable information through data fusion, which in turn must be
converted to predictive insights via knowledge discovery, ultimately facilitating automated or human-
induced tactical decisions or strategic policy. Developing a model that facilitates the representation and
knowledge discovery on sensor data presents many challenges. With sensors reporting data at a very
high frequency, resulting in large volumes of data, there is a need for a model that is memory efficient.
Sensor networks have spatial characteristics which include the location of the sensors. In addition,
sensor data incorporates temporal nature, and hence the model must also support the time dependence
of the data. Balancing the conflicting requirements of simplicity, expressiveness, and storage efficiency
is challenging. The model should also provide adequate support for the formulation of efficient
algorithms for knowledge discovery [18]. There is a clear and present need to bring together
researchers from academia, government, and the private sector in the broad areas of knowledge
discovery from sensor data [6].
Recently, association rules [1] for sensors have received a great deal of attention [1] [2] [3] [14] [15]
[16] due to their importance in capturing the temporal relations among sensor nodes in Wireless Sensor
Networks. An example of such a rule is (s1s2 s3, 80%,) which can be interpreted as follows: if
November Issue Page 9 of 88 ISSN 2229 5216
Vol. 3, No.5, 2011

event is received events from sensors s1 and s2 then there is an 80% chance to receive an event from
sensor s3 within unit of time, where 80% is the frequency of the rule. However, generating
association rules that have a certain frequency require generating all the patterns presented in the data
that meet this frequency, i.e., frequent patterns. Once the frequent patterns are determined, the process
of generating the rules is therefore straightforward. In this paper an efficient pattern discovery
algorithm for sensor data is proposed.
The remainder of paper is organized as follow. In the section 2 the literature review regarding
pattern discovery in sensor data is reported. The framework of pattern discovery in sensor data is
presented in section 3. In the section 4 the data structure for proposed algorithm is given and
experimental investigation is reported. Finally the work is concluded in the section 5.

2. Related Works

Frequent pattern mining is an important task in data mining. Several researchers have been
contributing in this field and several algorithms in the literature have been proposed to tackle the
problem of mining frequent patterns from transactional databases. These algorithms differ mainly in
the way that they represent the database and in which they generate the frequent patterns. These
algorithms can be classified into mainly two approaches: the candidate generation approach and the
pattern growth approach [1]. In terms of these approaches, the algorithms also differ in how they
represent the database. The two most popular formats are 1) the vertical layout, in which each object is
associated with the list of context identifiers where it occurred, and 2) the horizontal layout, in which
each context identifier is associated with the list of objects. The candidate generation approach
enumerates the frequent patterns gradually, with several scans of the database. In each iteration,
patterns found to be frequent are used to generate the candidates (possible frequent patterns) to be
counted in the next iteration. Within this approach, several schemes have been developed [14] such as
the AIS algorithm, Apriori, AprioriHyprid etc. The most popular algorithm among the candidate
generation approaches is the Apriori scheme. All other approaches, except for AIS, are basically
optimized versions of the Apriori scheme. Recall that in the Apriori algorithm, a database scan is
conducted in order to determine the set of frequent one element patterns. From this set, it generates set
of candidates to be counted in the next step. In addition, it prunes the set of the candidates by
eliminating the candidates that have at least one infrequent subset. This process is repeated a number of
times that is equal to the size of the largest frequent pattern.
The pattern growth approach [10] tries avoiding the large number of candidates generated in each
pass and overcome the repeated scans of the database, thereby enabling most of the algorithms in this
approach to outperform the candidate-generation-based approach algorithms. The Frequent Pattern
Growth (FP-Growth) proposed by Han et al. is the core algorithm of the pattern growth approach [10].
In this method, the database is converted into a compact representation in the form of a tree, called
Frequent Pattern tree (FP-tree), which is much smaller in size than the original database. The FP-tree is
constructed in such a way that all relevant information needed in the mining process is presented in the
tree structure. Note that building the tree structure requires only two scans of the database. After
building the FP-tree, the FP-Growth routine mines all the frequent patterns from the tree structure
without referring to the original database and without generating candidates [2][10].
Pattern Discovery in sensor data is relatively new research area. The researchers have been
augmenting the traditional pattern discovery algorithm for mining the sensor data. Loo et. al. [11] have
studied the problem of mining the associations that exist between sensor values in a stream of data
reported from a wireless sensor network. They proposed a data model that stores the data and presents
those in a way that makes it possible to adapt the lossy counting algorithm that makes an online one-
pass analysis of the data. In this data model, sensors are assumed to take values from a finite discrete
number of values, whereas a quantization method is applied for the continuous values. The time is
divided into equal-sized intervals, and a snapshot from the sensor reading is taken whenever there is an
update on a sensor reading. These snapshots formulate the contexts of the database. Although taking
snapshots at state changes will reduce the redundancy in the data, these snapshots occur randomly;
thus, each context is associated with a weight value that indicates for how many intervals this reading
is valid (that is, for how long these readings will kept unchanged). The support of the pattern is defined
by the total length of non-overlapping intervals in which the pattern is valid. Mining spatial temporal
Vol. 3, No.5, 2011

event patterns is another attempt to link the problem of mining sensor data to the association rules
mining problem that was proposed by Roemer [12]. Roemers approach takes into consideration the
distributed nature of wireless sensor networks and proposes an in-network data mining technique to
discover frequent patterns of events with certain spatial and temporal properties. In this approach, each
sensor should be aware of the events that are within a certain distance from itself (this distance may be
a Euclidean distance or a number of hops). The sensor then collects these events and applies a mining
algorithm to discover the pattern that satisfies the given parameters. The mining parameters include a
minimum support S, a minimum confidence C, a maximum scope, and a maximum history. Each node
in the network collects the events from its neighbors within the maximum scope and keeps a history of
their events for duration of the maximum history. After that, each node applies a mining algorithm to
discover the frequent patterns (those that have frequency exceeding the given minimum support).
Boukerche et. al.[3] proposed a solution to extract the behavioral data required for mining patterns
regarding the behavior of the sensor nodes in the network (that is, the data used in the mining process
is metadata, describing the nodes activities, and it differs from the sensed data). A primary assumption
of this data extraction mechanism is to have a flash memory device attached to each sensor to store the
metadata about the sensors behavior that will be used during the extraction process. Several
researchers have studied the cost of attaching a storage devise to each sensor. The Boukerche et. al. [1]
[2] [3] framework consists of 1) a formal definition of sensor behavioral patterns and sensor
association rules, 2) a novel representation structure that we refer to as the Positional Lexicographic
Tree (PLT) that is able to compress the data gathered for the mining process and thus allows the fast
and efficient mining of sensor behavioral patterns, and 3) a distributed data extraction mechanism to
prepare the data required for mining sensor behavioral patterns.
However, construction of such data structures (both FP-tree and PLT-tree) require two database
scans, which is not suitable for generating association rules from the streams of sensor data. Moreover,
mining PLT requires an extra mapping mechanism for the sensors to a vector [15]. Therefore, Tanbeer
et. al. [15] proposed a prefix-tree structure, called Sensor Pattern Tree (SP-tree in short) which is able
to capture the information with one scan over the stream of sensor data and store them in a memory-
efficient highly compact manner, similar to FP-tree. The main idea of our SP-tree is to obtain the
frequency of all event-detecting sensors data and construct a prefix-tree based on that in any canonical
order, then reorganize the tree in a frequency descending order. Through the reorganization the SP-tree
can maintain the frequently event-detecting sensors nodes at the upper part of the tree, which, in turn,
provides high compactness in the tree structure. Once the SP-tree is constructed, we apply the efficient
FP-growth mining technique on it.
Above reported pattern mining techniques for sensor data are mainly based on Apriori or FP-
Growth framework. The Apriori like algorithms suffer problem such as a huge set of candidate
sequences could be generated in a large sequence database; Multiple scans of databases in mining;
generates a combinatorial explosive number of candidates when mining long sequential patterns. FP-
Growth algorithms are good when data set is dense. But in case of sparse data set the large size of FP
tree is required and FP-Growth utilizes more space and it take similar computation time as Apriori
algorithm [4] [9].
In this paper we use Apriori-TFP structure [9] based algorithm and augmented for sensor data
pattern mining, which completes the summation of the final support counts, storing the results in a
second set-enumeration tree (the T-tree, of Total support counts), ordered in the opposite way to the P-
tree. The T-tree finally contains all frequent sets with their complete support-counts. This algorithm
works well in both case sparse and dense data set.

3. Sensor Association Rules Mining Framework
Boukerche et. al. [1][2] defined the problem of mining sensor association rules is based upon the
definition of association rules proposed in the domain of transactional databases [10][14]. Basis of this
the pattern mining framework can be generalized as follow [2]:
Let S = {s1, s
2
. . . s
m
} be a set of sensors in a particular sensor network. The time is divided into equal-
sized slots {t
1
, t
2
. . . t
n
} such that t
i+1
- t
i
= for all 1 < i < n, where is the size of each time slot, and T
= t
n
- t
1
represents the historical period of the behavioral data defined during the data extraction
process. Also P = {s
1
, s
2
. . . s
k
}_ S is referred as a pattern of sensors.
Vol. 3, No.5, 2011

Definition 1. A sensor database SDB, the behavioral data, is defined to be a set of epochs in which
each epoch is a couple E (E
ts
, P), where P is a pattern of sensors that report events within the same time
slot. E
ts
is the epochs time slot.
Definition 2. Let P
1
be a pattern of sensor nodes such that P
1
_ S. We say that an epoch E (E
ts
, P), P
i

supports P
1
if P
1
_ P.
Definition 3. The frequency of the pattern P
1
in SDB is defined to be the number of epochs in SDB that
supports it:
Freq (P
1
, SDB) = | {E (E
ts
, P) | P
1
_ P}|.
Definition 4. The pattern is said to be frequent if its frequency is greater than or equal to the given
minimum support.
Definition 5. Sensor association rules are defined in the form of P P, where Pc S, P c S, and
P P =|.
Definition 6. The frequency of the rule (P P) represents the frequency of the pattern (P P) in
SDB, whereas the confidence of the rule is defined as follows:
Conf ( P P) = Freq (P P, SDB) / Freq (P, SDB).
.
4. Efficient Algorithm for Mining Association Rules in Sensor Data
4.1. Data Structures

In this section we will describe the used data structure and an Efficient Algorithm for Mining
Association Rules in Sensor Data is discussed. This algorithm is augmentation of s Apriori Total-from-
Partial (TFP) algorithm [9] for sensor network data.

4.1.1. The Partial Support Tree (P-Tree)
The partial support tree solves the problem of repeat re-scanning of same record set. In this
approach to copy the input data into a data structure, which maintains all relevant input data. The P-tree
has two advantages:
- It merged the duplicated records and records with common leading sub strings, the reducing
the storage and processing requirements.
- The partial counts of the support for individual nodes within the tree are accumulated and the
tree is constructed.
The overall structure of the P-tree is a compressed enumeration tree. To construct a P-tree, the data
is scanned record by record. When complete, the P-tree will contain all the item sets present as distinct
records in the input data. The support of item is stored in each node that is incomplete total support, the
sum of the supports stored in the subtree of the node.

4.1.2. The Total Support Tree (T- Tree)
The total support tree uses the downward closure property of item sets 'if any given item set I is not
large, any super set of I will also not be large'. This can be used effectively to avoid the need to
generate and compute support for all combinations in the input data. The approach requires:
1) A number of passes of the data set,
2) The construction of candidate sets to be counted in next pass and
3) The verification of validity of dataset.
The algorithm to compute frequent pattern using total and partial support tree is as follows. The data
structure for P tree is link representation and defined by following format:
class participatory
{
Node code;
Time stamp;
Child reference;
Sibling reference;
}
For a database of m records, stage 1 of the algorithm (A1) performs m support count incrimination's in
a single pass, to compute a total of m partial supports, for some m0s in given timestamps. The second
Vol. 3, No.5, 2011

stage of the algorithm (A2) involves, for each of these, the examination of subsets, which are members
of the target set T. In an exhaustive version of the method, T will be the full set of subsets of I.
Computing via summation of partial supports, however, others three potential advantages.

Algorithm 1: Algorithm with total and partial support
Inputs: Transaction DS, count set P
Output: Returns P and T counting sets in DS
Method:
A1:
1. For each record j in the database {
2. For (For all P
i
e u (P))
3. add 1 to P
j

4. Insert newly inserted record to j.
5.}
A2:
1. for all j in P
2. For (for all P
i
e u (P))
3. {For all i in T
i
e u j do
4. Add to total support tree // added to total support tree
5. end;
Firstly, when n is small (2n << m), then A2 involves the summation of a set of counts, which is
significantly smaller than a summation over the whole database. Secondly, even for large n, if the
database contains a high degree of duplication (m0 << m) then the stage 2 summation will again be
significantly faster than a full database pass, especially if the duplicated records are densely populated
with attributes. Finally, and most generally, we may use the stage A1 to organize the partial counts in a
way which will facilitate a more efficient stage 2 computation, exploiting the structural relationships
inherent in the lattice of partial supports [4][9].
4.2. Algorithm for generating Sensor Association Rule using TFP Tree

The proposed algorithm first extract the data of particular interval from whole sensor data set and
apply the TFP Mining approach to and frequent item set on that specific user defined time intervals.
Algorithm 2: Pattern mining for sensor data with total and partial support tree.
Input : A Sensor Database D, Specified temporal pattern e
0

Output : Frequent item set from Sensor Database and Database Table DT.
Method :
1. Set pointer to first record of database
2. Scan the Database one by one and follow the Step (3)
3. {
4. If (p e u(p)) do
5. Insert into DT {item set}
6. }
7. set K = 1
8. Build level K in the T -tree.
9. Walk the P-tree, applying algorithm TFP to add interim supports associated with individual P-
tree nodes to the level K nodes established in (2)
10. Remove any level K T-tree nodes that do not have an adequate level of support.
11. Increase K by 1.
12. Repeat step (8), (9), (10), (11), until a level K is reached where no nodes are adequately
supported.
In above algorithm step (1) to step (5) used to find out the item set, which occurs on valid time
period specified by calendar schema. Step (7) to step (12) used for mining frequent item set from TFP
tree.
Vol. 3, No.5, 2011

4.3. Experimental Results
To evaluate the performance of our algorithms and optimization techniques, we perform series of
experiments in synthetic data set. In following section, we describe the synthetic basket data generator
with temporal information. Then we generate synthetic data sets to evaluate the algorithms with data
set having different characteristics. The algorithm is implemented on Java and performed on Window 7
operating system. No another application is running during the performance evaluation. We used the
Dodgers loop sensor data [7] and synthetic data set [3][17] which is generated by simulator for the
experiments. The figure 1 shows the experiment result on dense data set. We can observe that PLT
algorithm is taking more time as SP Growth and TFP. Also the result of SP Growth and TFP is
comparable giving similar performance. The experiment result on sparse data is shown in figure 2. In
this experiment, it is observed that PLT and SP-Growth are taking comparable similar time and TFP
more efficient as PLT and SP-Growth.

Figure 1: Support values versus Time for Dense Data.

Figure 2: Support values versus Time for Sparse Data.

5. Conclusion
In this paper, we have performed a systematic study on mining of sensor data patterns in large
sensor network databases and developed a tree approach for efficient and scalable mining of sensor
pattern mining. Instead of refinement of the apriori-like, candidate generation-and-test approach a P-
tree structure is proposed. The experimental results we have reported here show that the Tree
Vol. 3, No.5, 2011

Partitioning method described is extremely effective in limiting the maximal memory requirements of
the algorithm, while its execution time scales only slowly and linearly with increasing data dimensions.
Its overall performance, both in execution time and especially in memory requirements, is significantly
better than that obtained from either simple data segmentation or from other methods considered. The
advantage increases with increasing density of data and with reduced thresholds of support i.e. for the
cases that are in general most challenging for association rule mining.

6. References
[1] A. Boukerche, and S. Samarah, A Novel Algorithm for Mining Association Rules in Wireless
Ad Hoc Sensor Networks, IEEE Transactions on Parallel and Distributed Systems, vol. 19, no.
7, 2008, pp. 865-877.
[2] A. Boukerche, S. Samarah, and H. Harbi, "Knowledge Discovery in Wireless Sensor Networks
for Chronological Patterns", Proc. of 33rd IEEE Conf. on Local Computer Networks (LCN'08),
pp.667 - 673, 2008.
[3] A. Boukerche, and S. Samarah, An Efficient Data Extraction Mechanism for Mining
Association Rules from Wireless Sensor Networks, Proc. IEEE International Conference on
Communications, 2007, pp. 3936 - 3941.
[4] A. Akasapu, L. K. Sharma and G. Ramakrishana, "Efficient Trajectory Pattern Mining for both
Sparse and Dense Dataset", Int. J. of Computer Applications Vol. 9(5), 2010, pp. 4548.
[5] B. Chikhaoui, S. Wang, H. Pigot, "A New Algorithm Based On Sequential Pattern Mining For
Person Identification In Ubiquitous Environments", Proc. of the Fourth Int. Workshop on
Knowledge Discovery form Sensor Data (ACM SensorKDD'10), Washington, DC, July 25-28
pp. 20-28, 2010.
[6] D. Lymberopoulos, A. Bamis, and A. Savvides, "A Methodology for Extracting Temporal
Properties from Sensor Network Data Streams", ACM MobiSys09, June 2225, Krakw,
Poland, 2009.
[7] Dodgers loop sensor Data http://pems.eecs.berkeley.edu".
[8] E. Salah. R. Pauwels, R. Tavenard and T. Gevers, "T-Patterns Revisited: Mining for Temporal
Patterns in Sensor Data", Sensors 2010, vol (10), pp. 7496-7513
[9] F. Coenen, P. Leng and S. Ahmed, 2004. "Data Structure for Association Rule Mining: T-Trees
and P-Trees", IEEE Transactions on Knowledge and Data Engineering, Vol. 16(6), 2004, pp.
774-778.
[10] J. Han, J. Pei, and Y. Yin, Mining Frequent Patterns without Candidate Generation, Proc. the
2000 ACM SIGMOD Int. Conference on Management of Data, 2000, pp. 112.
[11] K. K. Loo, I. Tong, B. Kao, and D. Chenung, Online Algorithms for Mining Inter-Stream
Associations from Large Sensor Networks, Proc. Ninth Pacific-Asia Conf. Knowledge
Discovery and Data Mining (PAKDD 05), May 2005, pp. 143-149.
[12] K. Roemer, Distributed Mining of Spatio-Temporal Event Patterns in Sensor Networks, Proc.
of EAWMS 06, June 2006.
[13] M. Rajpoot and L. K. Sharma, Comparative Study of Association Rule Mining for Sensor
Data, Int. J. of Computer Applications, Vol 19(1), 2011, pp. 34-36.
[14] R. Agrawal, T. Imielinski, and A. N. Swami, Mining Association Rules Between Sets of Items
in Large Databases, Proc. ACM SIGMOD Conference on Management of Data, 1993, pp. 207-
216.
[15] S. K. Tanbeer, F. A. Chowdhury, B. S. Jeong, Y. K. Lee, "Efficient Mining of Association Rules
from Wireless Sensor Networks", Int. Conf. ACT 2009 Feb. 15-18, 2009, ISBN 978-89-5519-
139-4, pp. 719-724.
[16] S. Samarah and A. Boukerche, "Target Association Rules: A New Behavioral Patterns for Point
of Coverage Wireless Sensor Networks", IEEE Tran. on Computer, Vol. 60, No. 6, pp. 879-889,
2011.
[17] S. Samarah, A. Boukerche, and Y. Ren, "Coverage-based Sensor Association Rules for Wireless
Vehicular Ad hoc and Sensor Networks", Proc. of IEEE "GLOBECOM", pp. 1-5, 2008.
[18] V. S. Tseng and K. W. Lin, "Mining Temporal Moving Patterns in Object Tracking Sensor
Networks", Proc. of Int. Workshop on Ubiquitous Data Management (UDM05), 0-7695-2411-
7/05 IEEE, pp. 1-8, 2005.
Vol. 3, No.5, 2011

Authors Profile

Netreshwari Sharma completed Master of Science in Information Technology from Guru
Ghasidas Central University Bilaspur (CG)-India. She is currently doing M. Tech. in
Computer Science and Engineering from Chhattisgarh Swami Vivekananda Technical
University Bhilai (CG)-India. Her research interests include Data Models and Abstract
Semantic Descriptions for Sensor and Trajectory data.

Kapil Kumar Nagwanshi is an active Member of IEEE and IEEE Computer Society. He
is life time member of Computer Society of India, International Association of Computer
Science & Information Technology since 2009. He is also member of International
Association of Engineers (IAENG), and Computer Science Teachers' Association (CSTA-
ACM) Currently he is working as a Reader in RCET Bhilai. His research area includes,
signal processing, image processing, Prallel Computing and Information Systems and
Security.

Dr. Lokesh Kumar Sharma received his Ph. D. degree from Pt. Ravishankar Shukla
University, Raipur (CG)-India. Dr. Sharma is a DAAD Fellow Germany. He is currently
working as Head-Department of Computer Science and Engineering at Rungta College of
Engineering and Technology, Bhilai (CG). He is having overall teaching experience of 11
years. He is member of CSI, ACM, IAE etc. His major research Interests are in Data
Mining, Spatial Data Mining, Mobile Communication.


Paper 2

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Paper 2

Загружено:

Авторское право:

Доступные форматы

International Journal of Advances in Science and Technology,

Vol. 3, No.5, 2011

Вам также может понравиться