Вы находитесь на странице: 1из 91

Unsupervised Anomaly Detection in

Sequences Using Long Short Term Memory


Recurrent Neural Networks

Majid S. alDosari
April 20, 2016
George Mason University
Contents I
Introduction
The Challenge of Anomaly Detection in Sequences
Procedure
1. Sample Extraction
2. Transformation
3. Detection Technique
Proximity
Effects on Point Distribution
Data Classification
Nearest Neighbor
Clustering
1
Contents II
Models

Problems with Established Techniques

Recurrent Neural Networks (RNNs)

Using RNNs for Anomaly Detection

Conclusions

Reproducibility

Discussion Time

2
Introduction
Modern technology facilitates the capture, storage, and pro-
cessing of sequential data at scale

• Data capture
• physiological signals
• network traffic
• industrial processes
• automobiles
• website navigation
• environment
• Data storage
• Hadoop
• MongoDB
• Ubiquitous computing
• cloud/cluster
• desktop
• at point of capture
3
Problem: Finding anomalous data is challenging

• large
• varied
• domain knowledge required

4
Solution: Use recurrent neural networks to generically find
anomalous data

• This work:
1. Background: Anomaly detection in sequences
2. Sequence Modeler: Recurent neural network (RNN)
3. Experiments: RNNs for anomaly detection
• Prior: Malhotra1 but no emphasis on process

1
Malhotra et al., “Long Short Term Memory Networks for Anomaly Detection
in Time Series”.
5
The Challenge of Anomaly
Detection in Sequences
Anomaly dectection work is fragmented

• variety of solutions in communication networks, biology,


economics, biology, ...etc.
• different settings
• no comparison between application domains
• technical basis in computer science vs. statistics
• not much review literature: Cheboli2 and Gupta3

2
Cheboli, “Anomaly Detection of Time Series”.
3
Gupta et al., “Outlier Detection for Temporal Data : A Survey”.
6
Define the problem to focus on the right solution

Sequence x
x = {x , x , x , . . . , x
(1) (2) (t) (T )
}
x ∈R (t) v

Assumption: Anomalies are a small part of the data.

7
Solution must answer the following:

1. What is normal (as an anomaly is defined as what is not


normal)?
2. What measure is used to indicate how anomalous point(s) are?
3. How is the measure tested to decide if it is anomalous?

8
Solution must address different anomaly types I

Simple point anomaly


x

9
Solution must address different anomaly types II

Anomaly in a periodic context


x

10
Solution must address different anomaly types III

Discord anomaly in a periodic time series


x

11
Solution must address different anomaly types IV

Discord anomaly in an aperiodic time series


x

12
Solution must address different anomaly types V

Multivariate: (a)synchronous and (a)periodic

13
Procedure
Description of anomaly detection procedure is straightforward

1. Compute an anomaly score for an observation


2. Aggregate the anomaly scores for many observations.
3. Use the anomaly scores to determine whether an observation
can be considered anomalous

14
Characterizing normal behavior is involved

x = {x , x , x
(1) (2) (t)
,..., x (T )
}

1. Extract Samples
2. Transform Samples
3. Apply Detection Technique

15
Procedure

1. Sample Extraction

16
Use sliding windows to obtain samples

X = {W1 , W2 , . . . , Wp }

• hop, h
• window, w

17
Problem: Hops can skip over anomalies

sequence: abccabcabc
hop (h) Ordered Windows
1 abc, bcc, cca, cab, abc, bca, cab, abc
2 abc, cca, abc, cab
3 abc, cab, cab
4 abc, abc

18
Problem: Window size must be large enough to contain anomaly

sequence: aaabbbccccaaabbbcccaaabbbccc

Window width must be at least 4.

19
Problem: Treating window width as a dimension size ignores
temporal nature

X = {W1 , W2 , . . . , Wp }
W ∈ R1×w

20
Procedure

2. Transformation

21
Transformation can help reveal anomalies

• Haar transform
• Symbolic Aggregate approXimation (SAX)4

4
Lin et al., “Experiencing SAX: A novel symbolic representation of time
series”.
22
Transformation is not general

• Choice of representation must be compatible with data


characteristics
• normal:anomaly as transform(normal):transform(anomaly)

Study5 suggests generally little difference among representations.

5
Wang et al., “Experimental comparison of representation methods and
distance measures for time series data”.
23
Procedure

3. Detection Technique

24
Anomaly detection techniques and their application domains
are varied

Based on

• Segmentation
• Information Theory
• Proximity
• Modeling

25
Model and proximity-based techniques are most developed

Based on

• Segmentation: requires homogeneous segments


• Information Theory: requires finding sensitive
information-theoretic measure
• Proximity
• Modeling

26
Proximity
Idealization never occurs

p1

N1

27
Practically, distributions are complicated

p1 N2
p2

N1
pp1
2
N1

28
Proximity

Effects on Point Distribution

29
Distance measure should be invariant to:

• length
• translation
• (skew)
• (amplitude)

Study6 : Not much difference in similarity measures

6
Wang et al., “Experimental comparison of representation methods and
distance measures for time series data”.
30
Window width needs to be chosen on the scale of expected
anomaly

If width is too large:

• anomalous points not distinguished


• data becomes equidistant in high-dimensional space

31
Sliding windows challenge anomaly detection assumptions

• anomalous points are not necessarily in sparse space while


repeated patterns are not necessarily in dense space7
• “Clustering of Time Series Subsequences is Meaningless”8

7
Keogh, Lin, and Fu, “HOT SAX: Efficiently finding the most unusual time
series subsequence”.
8
Keogh et al., “Clustering of Time Series Subsequences is Meaningless:”
32
Proximity

Data Classification

33
Global vs Local: Local techniques use neigborhood data

p1 N2
p2

N1
pp1
2
N1

34
Proximity

Nearest Neighbor

35
Overlapping windows distort data similarity

h=1 h=3
abc abc
bca
cab
abc abc
bcX
cXX
XXX XXX
abcabcXXXabcababc
XXa
Xab
abc abc
bca
cab
aba aba
bab
abc
36
Solution: Use non-self matches9

h=1 h=3
abc abc
bca
cab
abc abc
bcX
cXX
XXX XXX
XXa
Xab
abc abc
bca
cab
aba aba
bab
abc
9 37
Keogh, Lin, and Fu, “HOT SAX: Efficiently finding the most unusual time
kNN uses no local information

p1 N2
p2

N1
pp1
2
N1

38
Local Outlier Factor10 uses local density information

A point is likely to be an anomaly if its neighbors are in dense


regions while it is in a less dense region
p1 N2
p2

N1
pp1
2
N1

(Didn’t you say anomalies may not be in less dense regions?!)


10
Breunig et al., “Optics-of: Identifying local outliers”.
39
Proximity

Clustering

40
Clustering algorithms are usually not designed to find anomalies

Assumptions: anomalous points

• do not belong to a cluster (DBSCAN11 )


• are far from a cluster centroid
• are in less dense clusters

11
Ester et al., “A Density-Based Algorithm for Discovering Clusters in Large
Spatial Databases with Noise”.
41
Models
Hidden Markov Models (HMMs) are general advanced sequence
modelers

Restrictions:

• fixed length sequences


• Markovian process

42
Problems with Established
Techniques
How to determine a priori what the best algorithm is?

review papers only give subjective assessments

43
Proximity-based techniques need alot of decisions

Choose:

• similarity measure
• sliding window size
• sliding window hop
• compatible classification technique

44
Solution: Use a model-based technique

• characterize normal
• restriction: use when data can be modeled

Ideally:

• model arbitrary time series


• minimize effect of window length
• requires as few parameters as possible

45
Recurrent Neural Networks (RNNs)
RNNs are powerful

• speech recognition
• handwriting recognition
• music generation
• text generation
• handwriting generation
• translation
• identifying non-verbal cues from speech
• image caption generation
• video to text description
• generating talking heads

46
RNNs are more flexible and efficient than HMMs

state:

• HMM: hidden state depends only on previous state


• RNN: shared state

generality:

• HMM: Markovian
• RNN: general computation device

47
Recurrence explains the efficiency of RNN encoding

cyclic view

s θ (−1)

acyclic view

θ θ θ θ
s(...) s(t−1) s(t) s(t+1) s(...)

x(t−1) x(t) x(t+1)

48
RNN computation is elaborate

y (t−1) y (t) y (t+1)


y

L(t−1) L(t) L(t+1)


L

o(t−1) o(t) o(t+1)


o
θ sl o θ sl o θ sl o
θ sn o
(...)
θ sl sl (t−1)
θ sl sl (t)
θ sl sl (t+1)
θ sl sl (...)
sl sl sl sl sl
sn θ (−1)
sn sl

(...)
θ s2 s2 (t−1)
θ s2 s2 (t)
θ s2 s2 (t+1)
θ s2 s2 (...)
s2 s2 s2 s2 s2
s2 θ (−1)
s2 s2

(...)
θ s1 s1 (t−1)
θ s1 s1 (t)
θ s1 s1 (t+1)
θ s1 s1 (...)
s1 θ (−1) s1 s1 s1 s1 s1
s1 s1

θ xs1 θ xs1 θ xs1 θ xs1


x x(t−1) x(t) x(t+1) 49
Training RNNs is difficult

oy
L( , ) =
1 XX
TV t v
( o (t)
v − y (t) 2
v )

• but mini-batch SGD-flavor training still works

∆θ = −α
1 X oy
∂L( , m)
|M| ∂θ
(x m ,y m )∈M

• acute vanishing gradient problem

o s s
 
(t) T (t) (t) T (j) (i)
∂L ∂L ∂ ∂  ∂
W o s s W
X Y
= (t) (t)

(j−1)
∂ ss ∂ ∂ ∂ ∂ ss
i=0 j=i+1

50
Understand vanishing gradient problem through computational
graph for T = 4

y (t−4) y (t−3) y (t−2) y (t−1) y (t)

L(t−4) L(t−3) L(t−2) L(t−1) L(t)

∂L(t)
∂o(t)

o(t−4) o(t−3) o(t−2) o(t−1) o(t)

∂o(t)
∂s(t)

W ss s(t−4) s(t−3) s(t−2) s(t−1) s(t)

∂s(t−4) ∂s(t−3) ∂s(t−2) ∂s(t−1) ∂s(t)


∂W ss ∂s(t−4) ∂s(t−3) ∂s(t−2) ∂s(t−1)
51
x(t−4) x(t−3) x(t−2) x(t−1) x(t)
Long Short Term Memory (LSTM) ‘cells’ store information but
are more complicated than vanilla RNNs’ tanh

s(t)

c(t−1) + c(t)

f (t) i(t) g (t) o(t)


σ σ τ σ τ

s(t−1) s(t)

x(t) 12
12
Colah, Understanding LSTM Networks.
52
Using RNNs for Anomaly Detection
Use same procedure for test time series to test generality

1. sample
2. setup RNN autoencoder
3. train
4. optimize
5. evaluate anomaly scores

53
1. Sample with sliding windows of varying length to test ver-
satility

1. spikes
2. sine
3. power demand
4. electrocardiogram (ECG)
5. polysomnography ECG (PSG-ECG)

54
2. Setup RNN autoencoder

• Set target to (uncorrupted) input

y =x
• Add noise to input

x̃ = x + N (0, (0.75σ (x )) )
std
2

55
3a. Train: RMSprop is appropriate algorithm

• works with mini-batch learning as data is highly redundant


• similar in results to second order methods with less
computational cost

56
3b. Optimize RNN hyperparameters to find best RNN config-
uration

Optimize

• number of layers, l
• ‘size’ of each layer, n

using Bayesian optimization

• minimize expensive objective function calls


• considers stochasticity of function

57
Optimization of spike-1

10−1
l training
100
1 validation
2 10−1
10−2

L
Lv

10−2

10−3 10−3
1 3 10 13 16 17 19 20 0 20 40 60 80 100 120
n epoch

58
Optimization of spike-2

10−1
l 100 training
1 validation
10−2
2 10−1

L
Lv

10−3 10−2

10−4 10−3
1 14 16 17 18 19 20 0 20 40 60 80 100 120 140 160
n epoch

59
Optimization of sine

100
l 100 training
1 validation
2
10−1

L
Lv

10−1

10−2
1 4 10 15 16 20 0 10 20 30 40 50 60
n epoch

60
Optimization of power

0.014
l 10−1 training
0.012 1 validation
2

L
Lv

0.010

0.008
10−2
0.006
6 8 12 20 33 36 42 50 0 20 40 60 80 100 120 140 160
n epoch

61
Optimization of ECG

100
l training
1 validation
2
10−1 10−1

L
Lv

10−2
1 5 9 15 0 20 40 60 80 100 120 140
n epoch

62
Optimization of PSG-ECG

0.05 101
l training
0.04
1 100 validation
0.03 2
Lv

L
0.02 10−1
0.01
10−2
0.00
1 5 6 7 11 19 20 0 50 100 150 200 250 300 350
n epoch

63
4. Use squared error as an anomaly score

Reconstruction Error

• individual squared error


• mean squared error of a window

64
spike-1: atypical value detected
1.00
0.75
0.50
x

0.25
0.00
max


5%
max

5%


max


65
350 400 450 500 550 600 650 700 750 800
spike-2: irregularity detected
1.00
0.75
0.50
x

0.25
0.00
max


0.0
max


5%

max


66
350 400 450 500 550 600 650 700 750 800
sine: discord detection inconclusive

0.8

0.0
x
−0.8

max

5%


max


5%

max

5%


max

5%


67
690 720 750 780 810 840 870 900 930
power: discord detected

1.5
x

1.2

0.9
max

5%


max

5%


max

5%


68
1800 2000 2200 2400 2600 2800 3000
power: discord detected

1.5
x

1.2

0.9
max

5%


max

5%


max

5%


69
1800 2000 2200 2400 2600 2800 3000
ECG: discord detected

2
x

−2
max


5%
max


5%

max


5%

max


5%

70
1320 1380 1440 1500 1560 1620 1680 1740 1800
PSG-ECG: discord detected

6.4
x

5.6

4.8
max


5%
max

5%


max

5%


71
1380 1440 1500 1560 1620 1680 1740 1800 1860 1920
Experiment conclusion: squared reconstruction error of AE-
RNNs can be used to detect anomalies

• point errors may find extreme values


• windowed errors find anomalies if size is on the order of
anomaly
• RNNs were insensitive to translation and length
• RNNs learned normal behaviour despite having some
anomalies in the training data
• the same process found anomalies in all tests

72
Conclusions
RNNs have advantages over advanced techniques

Model-based: HMMs

• more efficient encoding


• varying sequence length

Proximity-based: HOT SAX

• more efficient after training


• multivariate
• not forced to find an anomaly

73
Alternative method checklist

• Is only the test sequence needed to determine how anomalous


it is? (Is a summary of the data stored?)
• Is it robust against some window length?
• Is it invariant to translation? (Is it invariant to sliding a
window?)
• Is it fundamentally a sequence modeler?
• Can it handle multivariate sequences?
• Can the model prediction be associated with a probability?
• Can it work with unlabeled data? If not, is it robust to
anomalous training data?
• Does it require domain knowledge?

74
Main disadvantage of RNNs: computational cost

• training
• hyperoptimization

75
Further work is needed to strengthen the case for using RNNs
for anomaly detection

• better optimize
• use autocorrelation to determine minimum window length
• accelerate training: normalization, optimium training data size
• use drop out to guard against overfitting
• experiment with RNN architectures: bi-directional RNNs,
LSTM alts., more connections
• incorporate uncertainty
• objective comparisons with labelled data
• try multivariate series

76
Conclusion

Use RNNs to find anomalies when computational cost can be


managed.

77
Reproducibility
Technology stack enables automation and reproducibility

application ... ...


container network Weave
app. containerization Docker
operating system CoreOS
machine (x64) x64
hypervisor VirtualBox ...
hypervisor interface Vagrant AWS
host operating sys. Windows|OS X|Linux ...
hardware x64 x64
local remote

78
Reproducibility of technology stack on any machine enables
parallel processing

‘vagrant1’ aws1
‘vagrant2’ aws2
app1 app2 app1 app2

lib app3 app4 lib app3 app4

Docker Docker Weave Net Docker Docker

/project /project /project /project

CoreOS CoreOS CoreOS CoreOS

init remote (AWS)

registry svc1
Weave Net
Docker

/project

CoreOS

VirtualBox/Vagrant

localhost

app3 VirtualBox

Docker Vagrant share:/project

Windows|Linux|OS X

local

79
Discussion Time

Вам также может понравиться