Вы находитесь на странице: 1из 11

Temporal Data Mining: Time Series Analysis and Time-lag Detection

Cameron Hunter

School of Information Technology

James Cook University, Townsville

Cameron.hunter@jcu.edu.au

Abstract

Temporal data mining and in particular time series analysis has seen a great deal of interest in the
past 15 years and is used to discover trends and patterns in temporal sequences from engineering, to
finance and medicine. These techniques provide a means of knowledge discovery as well as an
accurate means of predicting future trends. However relatively little research has been conducted in
determining the level of influence between two variables in a temporal system. This project aims to
develop a method for discovery of temporal relationships between variables and the period of time
over which one variable influences another.

1. Introduction

The rapid increase of stored data in our world and the information contained within has given birth
to the relatively new innovation of data mining. Traditional statistical data analysis involves creating
mathematical models and applying them to datasets in order to identify patterns or trends. Data
mining differs from this approach in that it explores the dataset in order to discover hidden
relationships and patterns which can in turn be used to create a model for the data. The benefit of
this approach is that much larger datasets can be mined to discover patterns and trends that would
most probably have been missed by traditional analysis. Numerous data mining techniques and
algorithms exist and the suitability of each is dependent on the application and the requirements of
the end user.
Clustering, classification and association rules mining form three of the major subsidiaries of data
mining and techniques within each of these disciplines may be applied to carry out temporal data
mining. However in many cases temporal data is treated as an unordered sequence of events,
ignoring the temporal aspect of the data [1]. Clustering or classification of time series data without
respect to time may ultimately produce meaningless results. Time series analysis is another
subsidiary of data mining which deals directly with finding relationships and trends in temporal
sequences.

1.1 Applications

Temporal sequences appear in a vast range of domains, from engineering, to medicine and finance,
and the ability to extract information from this data is crucial to the advancement of our society
[1].Time series analysis may be used to find trends and patterns in data such as stock market
fluctuations so that accurate predictions of future behaviour can be made. This technique can be
applied to any time series data such as weather or medical condition monitoring. One of the first,
and perhaps the most important time series analysis algorithms was introduced by Agrawal et al [2]
in 1993 and formed the basis for many of today’s time series data mining algorithms.

1.2 The Project

The goal of this project was to take hydrodynamic data from a coral reef off the coast of Queensland
and compare and contrast different variables such as turbidity, current, pressure and wave height to
see if any temporal relationships could be discovered. Traditional time series analysis involves taking
a sample of a temporal sequence and comparing it to historical data in an attempt to identify similar
trends and patterns. For instance a stock broker may want to find days in the past year which
followed a similar stock trend to the present in order to try and predict the future behaviour of the
stock.

The goal of this project differs slightly in that the aim is to compare different variables at the same
time to see if there is a relationship as well as shifting one value temporally to see if there is a time-
lag relationship between the variables. For example, variable 2 follows the trends of variable 1 but
after a time-lag of 2 hours. Once relationships are established they can then be analysed and used to
build a hydrodynamic model for the reef.
1.3 The Data Set

The dataset used for this project was a hydrodynamic dataset obtained from sensor readings on a
coral reef off the coast of Queensland. The dataset includes turbidity levels, current speed, wave
height, pressure (tidal) and wind speed and direction. This dataset together with other collected
data is to be used to understand spatial and temporal hydrodynamic patterns on inshore turbid zone
reefs.

1.4 Pre-processing

The dataset contained a certain number of noise points for each variable. In numerical datasets a
noise point is somewhat easy to detect and remove if it is impossibly high or low in magnitude.
However if the noise point is within the range of possible values then it is virtually undetectable.
Noise points due to sensor malfunction with an impossibly high value were removed from the
dataset before analysis and were replaced with an average of the surrounding values.

2 Related Work

Since the classic paper by Agrawal et al. in 1993 there has been an explosion of interest in
mining time series data [9]. Presented below is a brief review of some of the more common
algorithms that have been developed for the purpose of temporal sequence analysis.

2.1 Time Series Analysis

The algorithm proposed by [2] involves taking a sequence N from numerical sequence data and
comparing it to find similar sequences in Q. The comparison may be to find exact matches or
matches within distance є. The parameter є is a distance parameter that controls when two
sequences should be considered similar. Approximate matching is an important feature of time
series analysis as it may be uncommon to find identical matches but quite common to find matches
with very high similarity. The similarity threshold is perhaps the most important parameter of time
series algorithms as it controls the quality and quantity of the results. If є is set too low it will return
meaningless results, set too high and it will return few or no results at all.

Before a similarity measure can be applied the numerical sequence must first be processed.
[2]proposes extracting k features from every sequence, mapping them to k-dimensional space and
then using a multidimensional index to store and search these points. This process needs to
guarantee that the Euclidean distance in k-dimensional space is less than or equal to the real
distance between the two objects. This process may be achieved using a Discrete Fourier Transform
(DFT), which transforms a numerical sequence from the time domain to the frequency domain.

2.1.1 Discrete Fourier Transforms

Discrete Fourier Transforms are used in signal processing in order to provide a better understanding
of the underlying frequencies in a signal [6]. A DFT works by producing an output of the frequencies
of values occurring in the numerical sequence, those values which occur frequently are of interest
while those with low frequency can be considered noise. This means that only the first few DFT
coefficients need to be considered as they contain the majority of the energy of the signal.

A brief description of a discrete Fourier transform is given below, further information on the subject
can be found in any digital signal processing textbook such as [3].

2.1.2 Wavelet Transforms

Another technique proposed by Chan et al. [4] is the use of a wavelet transform instead of a Fourier
transform. The algorithm proposed by [4] is similar in its implementation but differs in the type of
signal transform employed. The authors state that Discrete Wavelet Transforms have been effective
in replacing Discrete Fourier Transforms in many applications including signal processing. The
advantage of using DWT is multi-resolution representation of signals. It has the time-frequency
localization property, meaning it is able to give locations in both time and frequency. Therefore
wavelet representations contain more information than that of DFT. A brief description of a wavelet
transform can be found in [4] and further information may be found in any contemporary signal
processing textbook [10]. Popivanov et al. [5] claim that another benefit of the DWT over DFT is that
the time complexity of even the Fast Fourier Transform is O(n2) whereas DWT is O(nlogn) and
therefore not only is DWT more effective it is also more efficient than DFT.
2.2 Similarity comparisons

There exist many similarity comparators for determining how similar two objects are to each other.
Suitability of each technique is domain specific, although [7] states that Euclidean distance is the
optimal distance measure for similarity estimation.

2.2.1 Dot Product

The dot product is the inner product of two vectors, and returns a real number scalar quantity. It is
suitable as a similarity measure for normalised variables and returns which variables are more
similar to each other. A description of the dot product function is given below:

(3)

2.2.2 Euclidean Distance

Euclidean distance is the sum of squared differences between two objects [8], hence the smaller the
Euclidean distance the closer two objects are to each other. It is a suitable similarity measure in
numerical multidimensional datasets and is frequently used in time series analysis. A description of
the Euclidean distance function is given below:

(4)

2.2.3 Cosine Similarity

Cosine similarity is a measure frequently used in text mining to determine the similarity between
two objects by dividing the inner product of two variables by the multiplied absolute values of both
variables [8]. This process produces a result between -1 and 1 (in this project between 0 and 1)
which indicates the cosine angle between two variables and hence the degree of similarity. I.e. a -1
value means the two variables have no similarity and a 1 means the variables are identical. A
description of the cosine similarity function is given below:

(5)
3 Implementation

As stated previously the goal of this project was to discover relationships between a set of variables
in a system, for instance hydrodynamic characteristics of a coral reef. The aim was to determine
which variables were most similar to each other and then try and determine the degree of influence
one variable had over the other. For example if turbidity and current show a high degree of
similarity, how much effect did one variable have over the other and how long did these effects take
to materialise. This temporal influence is referred to in this paper as time-lag, and is defined as the
time it takes for the attributes of variable B to be repeated in some variable C, if a similarity of
greater than є exists between B and C.

Initially a simple similarity measure of taking the dot product of two normalised variables (B dot C)
was used to cluster the set of variables with respect to a variable of interest (i.e. the variable being
influenced). A dot product measure is sufficient in determining which two variables are more similar,
for example if B dot C is greater than B dot D then B and C are more similar than B and D. The
variables must undergo some sort of normalisation in order to create a level playing field for the
similarity measures. Figure 1 shows some sample results of signal clustering[9]:

Figure 1: Signal Clustering

The drawback of using dot product similarity is that it does not provide a degree of similarity, i.e. it
can tell which variables are more similar but cannot show how similar they are. To solve this
problem a similarity measure such as Euclidean distance or cosine similarity may be used.
Once a variety of similarity measures had been established it was necessary to compare two
variables over time. The reasoning being that if one variable influenced another temporally then
they would show higher similarity when one was shifted against the other. Initially the shift was
performed by moving one variable an element at a time and performing the similarity measure. For
example:

For the length/2 of B {

(B dot C)

Shift B

The first real drawback of this technique is that only half of the values for either B or C may be
compared. This is because the values of B can no longer be used once they are shifted past the front
of C, for if they are rotated it will negate any temporal relationship between B and C. Another
drawback of this scheme is that comparison is done on the entire range of values for each variable,
meaning that only global relationships may be discovered and that smaller local relationships may be
lost in the data deluge.

A solution to this problem is to use a windowing technique, whereby each variable is broken down
into a user defined number of windows which can then be compared one at a time and a similarity
measure calculated. The benefit of this technique is that local relationships may be discovered and a
more accurate time-lag can be estimated.

The final technique to be implanted was a Discrete Wavelet Transform. This technique was used to
reduce the amount of noise in the dataset and to produce an analysis of the underlying frequencies
of each variable which could be compared using a similarity measure.

It should also be noted that an applet was used in order to graph the values of each variable and the
output of the similarity measures so that a user could visually confirm the validity of the analysis. A
sample output is given in the results section.
4 Results and Discussion

4.1 Similarity Comparators

Dot product as a similarity measure was suitable only for clustering pairs of variables as it is unable
to provide a degree of similarity, without a degree a similarity a similarity threshold cannot be set.

Euclidean distance and cosine similarity were both able to provide a degree of similarity, however
cosine similarity is also able to provide a degree of dissimilarity which in certain circumstances may
be just as important, i.e. if a user wanted to know which variables had little or no influence on each
other and therefore determine independent variables.

Euclidean distance and cosine similarity measures were both able to correctly identify similar trends
in clustered variables and produce a time-lag coefficient however the accuracy of these results in
terms of real world trends remains unverified.

4.2 Shift Functions

The shift function is a naive way of determining the time-lag between two variables and has some
major flaws as discussed in section. A sample output of a similarity measure together with the
shifted variable is show in figure 2 below.

Figure 2: Sample cosine


similarity output

The windowing technique provides a more accurate analysis and estimate of the time-lag between
two variables and is also faster to execute as can be seen in figure 3.
4.3 Wavelet Transforms

A discrete wavelet transform was implemented to reduce noise and provide a suitable basis for
comparison with other transformed data. The wavelet transform provides an accurate means of
calculating similarity between two variables, however it is unable to produce a time-lag coefficient
unless used with a suitable shift function as the temporal aspect of the data is lost during
transformation. A sample output of wavelet transform is given in figure...

Figure 3: Sample output


of a wavelet transform

4.4 Execution time

Each similarity measure and shift technique was implemented with a function to calculate execution
time. The tests were performed on a single laptop computer with an Intel Core 2 Duo 1.66Ghz
processor and 2GB of RAM, the operating system used was Windows XP Professional running IntelliJ
IDEA 8.0. It should be noted that operating system overheads may skew the results, however
everything possible was done to ensure a stable testing platform. Figure 4 illustrates the execution
time for each technique when tested with an increasing number of values for each variable.
Figure 4: Execution time
1200

1000

800 Dot product


Cosine similarity
600 Windows 25
Windows 50
400 Wavelet

200

0
0 50 100 150 200 250

Although this project was able to identify temporal relationships and time-lag values for them it was
unable to verify the validity of these results. A training dataset of temporal sequence data is needed
for experimentation and verification of results in order to determine if the trends obtained with
these techniques are echoed in the real world.

5 Conclusion and Future Work

Although this project was able to identify temporal relationships and time-lag values for them it was
unable to verify the validity of these results. A training dataset of temporal sequence data is needed
for experimentation and verification of results in order to determine if the trends obtained with
these techniques are echoed in the real world. It became apparent during experimentation that no
single variable was influenced by just one other variable. An area of further research associated with
this topic is the influence of multiple variables in a system. It is relatively simple to investigate the
influence of a single variable on another, but far more complex to analyse the affect multiple
variables have when combined in a system. Another possible extension to this technique is the
implementation of a sliding window scheme in order to detect similar local trends and a more
accurate time-lag estimate.
Bibliography

[1] Antunes C., Oliveira A., “Temporal Data Mining: an Overview”. KDD Workshop on Temporal Data
Mining, 2001.

[2] Agrawal R,. Faloutsos C,. Swami A,. “Efficient Similarity Search In Sequence Databases”. In
Proceedings of the 4th Int'l. Conference on Foundations of Data Organization and Algorithms,
Chicago, IL, Oct. 13–15, pp. 69–84. 1994.

[3] Oppenheim A., Schafer R., “Digital Signal Processing”. Prentice-Hall, Englewood Cliffs, N.J.1975.

[4] Chan K., Wai-chee Fu A,. “Efficient Time Series Matching by Wavelets” Lecture Notes in Computer
Science, 1993.

[5]Popivanov I,. Miller R,. “Similarity Search Over Time-Series Data Using Wavelets”. pp.0212, 18th
International Conference on Data Engineering (ICDE'02), 2002

[6]Faloutsos C., Ranganathan M., Manolopoulos Y., “ Fast Subsequence Matching in Time-Series
Databases” Volume 23 , Issue 2, Pages: 419 – 429 Year of Publication: 1994.

[7] Gelb A., “Applied Optimal Estimation”. MIT Press, 1986.

[8] Pang-Ning T,. Steinbach M., Kumar V., “ Introduction to Data Mining”. Pearson Addison Wesley,
2006.

[9] Keogh E., Kasetty S,. “On the Need for Time Series Data Mining Benchmarks: A Survey and
Empirical Demonstration” Data Mining and Knowledge Discovery Volume 7, Number 4 / October,
2003

[10] Burrus C. S., Gopinath R. A., Guo H., “Introduction to Wavelets and Wavelet Transforms”. A
Primer, Prentice-Hall, 1997.

[11] Laxman S., Sastry P.S., A survey of temporal data mining. Sadhana Vol. 31, part 2, April 2006

Вам также может понравиться