Академический Документы
Профессиональный Документы
Культура Документы
C HA P T E R 8
Time Series Data
Mining
T
imes series data mining is an emerging field that holds great opport
unities for conversion of data into information. It is intuitively ob-
vious to us that the world is filled with time series dataactually
transactional datasuch as pointofsales (POS) data, financial (stock
market) data, and Web site data. Transactional data is timestamped
data collected over time at no particular frequency. Time series data,
however, is timestamped data collected over time at a particular fre-
quency. Some examples of time series data are: sales per month, trades
per weekday, or Web site visits per hour. Businesses often want to
analyze and understand transactional or time series data. Another
common activity also includes building models for forecasting future
behavior.
Unlike other data discussed in this book, time series data sets
have a time dimension that must be considered. While socalled
crosssectional data, which can be used for building predictive models,
typically features data created by or about humans, it is not unusual
to find that time series data is created by devices or machines, such as
sensors, in addition to humangenerated data. The questions that arise
from dealing with vast amounts of time series are typically different
149
150 BIG DATA, DATA MINING, AND MACHINE LEARNING
from what have been discussed earlier. Some of these questions will
be discussed in this chapter.
It is important to note that time series data is a large component to
the big data now being generated. There is a great opportunity to en-
hance data marts used for predictive modeling with attributes derived
from time series data, which can improve accuracy of models. This new
availability of data and potential improvement in models is very similar to
the idea of using unstructured data in the context of predictive modeling.
There are many different methods available for the analyst who
wants to go ahead and take advantage of time series data mining.
Highperformance data mining and big data analytics will certainly
boost interest and success stories of this exciting and evolving area.
After all, we are not necessarily finding lots of new attributes in big
data but similar attributes to what already existed, measured more
frequentlyin other words, the time dimension will rise in impor-
tance. In this chapter, we focus on two important areas of time series
data mining: dimension reduction and pattern detection.
REDUCING DIMENSIONALITY
DETECTING PATTERNS
The second area is pattern detection in time series data. This area can
be divided into two subareas: finding patterns within a time series and
finding patterns across many time series. These techniques are used
152 BIG DATA, DATA MINING, AND MACHINE LEARNING
Fraud Detection
Credit card providers are using similarity analysis to automate the de-
tection of fraudulent behavior in financial transactions. They are inter-
ested in spotting exceptions to average behavior by comparing many
detailed time series against a known pattern of abusive behavior.
1 Nike was great about replacing the device that was covered by warranty.
TIME SERIES DATA MINING 155
Seasonal Analysis
15
10
calories
0
0:00 4:00 8:00 12:00 16:00 20:00 24:00
t
560,000 observations. One of the easiest ways to look into that kind of
big-time data is seasonal analysis. Seasonal data can be summarized by
seasonal index. For example, Figure 8.1 is obtained by hourly seasonal
analysis.
From Figure 8.1, you can see that my day is rather routine. I
usually become active shortly after 6:00 AM and ramp up my ac-
tivity level until just before 9:00 AM, when I generally arrive at
work. I then keep the same activity level through the business day
and into the evening. Around 7:00 PM, I begin to ramp down my
activity levels until I retire for the evening, which appears to hap-
pen around 12:00 to 1:00 AM. This is a very accurate picture of my
typical day at a macro level. I have an alarm set for 6:30 AM on
weekdays, and with four young children I am up by 7:00 AM most
weekend mornings. I have a predictable morning routine that used
to involve a fivemile run. The workday usually includes a break
for lunch and brief walk, which show up as a small spike around
1:00 PM. My afternoon ramps down at 7:00 PM because that is when
the bedtime routine starts for my children and my activity level
drops accordingly.
TIME SERIES DATA MINING 157
20
15
calories
10
0
Sep Nov Jan Mar May Jul Sep Nov
2012 2013
Trend Analysis
Figure 8.2 shows my activity levels as a trend over 13 months. You can
see the initial increase in activity as my activity level was gamified by
being able to track it. There is also the typical spike in activity level that
often occurs at the beginning of a new year and the commitment to be
more active. There is also another spike in early May, as the weather
began to warm up and my children participated in soccer, football, and
other leagues that got me on the field and more active. In addition, the
days were longer and allowed for more outside time. The trend stays
high through the summer.
Similarity Analysis
Let us look at the data in a different way. Since the FuelBand is a con-
tinuous biosensor, an abnormal daily pattern can be detected (if any
exists). This is the similarity search problem between target and input
series that I mentioned earlier. I used the first two months of data to
make a query (target) series, which is an hourly sequence averaged
over two months. In other words, the query sequence is my average
158 BIG DATA, DATA MINING, AND MACHINE LEARNING
hourly pattern in a day. For example, Figure 8.3 shows my hourly pat-
tern of calories burned averaged from 09SEP2012 to 9NOV2012.
Using the SIMULARITY procedure in SAS/ETS to find my
abnormal day based on calorie burn data.
Using dynamic time warping methods to measure the distance
between two sequences, I found the five most abnormal days based
on the initial patterns over the first two months of data compared to
about 300 daily series from 10 Nov. 2012 to. This has many applica-
tions in areas such as customer traffic to a retail store or monitoring of
network traffic for a company network.
From Table 8.1, you see that May 1, 2013, and April 29, 2013,
were the two most abnormal days. This abnormal measure does not
indicate I was not necessarily much more or less active but that the
pattern of my activity was the most different from the routine I had
established in the first two months. These two dates correspond to
a conference, SAS Global Forum, which is the premier conference
for SAS users. During the conference, I attended talks, met with
customers, and demonstrated new product offerings in the software
demo area.
TIME SERIES DATA MINING 159
Table 8.1 Top Five Days with the Largest Deviation in Daily Activity
As you can see from Figures 8.4 and 8.5, my conference lifestyle
is very different from my daily routine. I get up later in the morn-
ing. I am overall less active, but my activity level is higher later in
the day.
For certain types of analysis, the abnormal behavior is of the most
interest; for others, the similarity to a pattern is of the most interest.
An example of abnormal patterns is the monitoring of chronically ill
or elderly patients. With this type of similarity analysis, homebound
seniors could be monitored remotely and then have health care staff
dispatched when patterns of activity deviate from an individuals norm.
125
100
Calories Burned
75
50
25
0
0:00 4:00 8:00 12:00 16:00 20:00 24:00
Time of Day
Baseline 01MAY2013
200
150
Calories Burned
100
50
0
0:00 4:00 8:00 12:00 16:00 20:00 24:00
Time of Day
Baseline 29April2013
2
The Boston Marathon is the only marathon in the United States for which every par-
ticipant must have met a qualifying time. The qualifying times are determined by your
age and gender. I hope to qualify one day.
TIME SERIES DATA MINING 161
150
Calories Burned
100
50
0
0:00 4:00 8:00 12:00 16:00 20:00 24:00
Time of Day
3I am a huge fan of Kaggle and joined in the first weeks after the site came online. It has
proven to be a great chance to practice my skills and ensure that the software I develop
has the features to solve a large diversity of data mining problems. For more informa-
tion on this and other data mining contests, see http://www.kaggle.com.