Defense

Unsupervised Anomaly Detection in
Sequences Using Long Short Term Memory

Recurrent Neural Networks
Majid S. alDosari
April 20, 2016
George Mason University
Contents I
Introduction
The Challenge of Anomaly Detection in Sequences
Procedure
1. Sample Extraction
2. Transformation
3. Detection Technique
Proximity
Effects on Point Distribution
Data Classification
Nearest Neighbor
Clustering
1
Contents II
Models
Problems with Established Techniques
Recurrent Neural Networks (RNNs)
Using RNNs for Anomaly Detection
Conclusions
Reproducibility
Discussion Time
2
Introduction
Modern technology facilitates the capture, storage, and pro-
cessing of sequential data at scale
• Data capture
• physiological signals
• network traffic
• industrial processes
• automobiles
• website navigation
• environment
• Data storage
• Hadoop
• MongoDB
• Ubiquitous computing
• cloud/cluster
• desktop
• at point of capture
3
Problem: Finding anomalous data is challenging
• large
• varied
• domain knowledge required
4
Solution: Use recurrent neural networks to generically find
anomalous data
• This work:
1. Background: Anomaly detection in sequences
2. Sequence Modeler: Recurent neural network (RNN)
3. Experiments: RNNs for anomaly detection
• Prior: Malhotra1 but no emphasis on process
1
Malhotra et al., “Long Short Term Memory Networks for Anomaly Detection
in Time Series”.
5
The Challenge of Anomaly
Detection in Sequences
Anomaly dectection work is fragmented
• variety of solutions in communication networks, biology,

economics, biology, ...etc.
• different settings
• no comparison between application domains
• technical basis in computer science vs. statistics
• not much review literature: Cheboli2 and Gupta3
2
Cheboli, “Anomaly Detection of Time Series”.
3
Gupta et al., “Outlier Detection for Temporal Data : A Survey”.
6
Define the problem to focus on the right solution
Sequence x
x = {x , x , x , . . . , x
(1) (2) (t) (T )
}
x ∈R (t) v
Assumption: Anomalies are a small part of the data.
7
Solution must answer the following:
1. What is normal (as an anomaly is defined as what is not

normal)?
2. What measure is used to indicate how anomalous point(s) are?
3. How is the measure tested to decide if it is anomalous?
8
Solution must address different anomaly types I
Simple point anomaly

x
9
Solution must address different anomaly types II
Anomaly in a periodic context

x
10
Solution must address different anomaly types III
Discord anomaly in a periodic time series

x
11
Solution must address different anomaly types IV
Discord anomaly in an aperiodic time series

x
12
Solution must address different anomaly types V
Multivariate: (a)synchronous and (a)periodic
13
Procedure
Description of anomaly detection procedure is straightforward
1. Compute an anomaly score for an observation

2. Aggregate the anomaly scores for many observations.
3. Use the anomaly scores to determine whether an observation
can be considered anomalous
14
Characterizing normal behavior is involved
x = {x , x , x
(1) (2) (t)
,..., x (T )
}
1. Extract Samples
2. Transform Samples
3. Apply Detection Technique
15
Procedure
1. Sample Extraction
16
Use sliding windows to obtain samples
X = {W1 , W2 , . . . , Wp }
• hop, h
• window, w
17
Problem: Hops can skip over anomalies
sequence: abccabcabc
hop (h) Ordered Windows
1 abc, bcc, cca, cab, abc, bca, cab, abc
2 abc, cca, abc, cab
3 abc, cab, cab
4 abc, abc
18
Problem: Window size must be large enough to contain anomaly
sequence: aaabbbccccaaabbbcccaaabbbccc
Window width must be at least 4.
19
Problem: Treating window width as a dimension size ignores
temporal nature
X = {W1 , W2 , . . . , Wp }
W ∈ R1×w
20
Procedure
2. Transformation
21
Transformation can help reveal anomalies
• Haar transform
• Symbolic Aggregate approXimation (SAX)4
4
Lin et al., “Experiencing SAX: A novel symbolic representation of time
series”.
22
Transformation is not general
• Choice of representation must be compatible with data

characteristics
• normal:anomaly as transform(normal):transform(anomaly)
Study5 suggests generally little difference among representations.
5
Wang et al., “Experimental comparison of representation methods and
distance measures for time series data”.
23
Procedure
3. Detection Technique
24
Anomaly detection techniques and their application domains
are varied
Based on
• Segmentation
• Information Theory
• Proximity
• Modeling
25
Model and proximity-based techniques are most developed
Based on
• Segmentation: requires homogeneous segments

• Information Theory: requires finding sensitive
information-theoretic measure
• Proximity
• Modeling
26
Proximity
Idealization never occurs
p1
N1
27
Practically, distributions are complicated
p1 N2
p2
N1
pp1
2
N1
28
Proximity
Effects on Point Distribution
29
Distance measure should be invariant to:
• length
• translation
• (skew)
• (amplitude)
Study6 : Not much difference in similarity measures
6
Wang et al., “Experimental comparison of representation methods and
distance measures for time series data”.
30
Window width needs to be chosen on the scale of expected
anomaly
If width is too large:
• anomalous points not distinguished

• data becomes equidistant in high-dimensional space
31
Sliding windows challenge anomaly detection assumptions
• anomalous points are not necessarily in sparse space while

repeated patterns are not necessarily in dense space7
• “Clustering of Time Series Subsequences is Meaningless”8
7
Keogh, Lin, and Fu, “HOT SAX: Efficiently finding the most unusual time
series subsequence”.
8
Keogh et al., “Clustering of Time Series Subsequences is Meaningless:”
32
Proximity
Data Classification
33
Global vs Local: Local techniques use neigborhood data
p1 N2
p2
N1
pp1
2
N1
34
Proximity
Nearest Neighbor
35
Overlapping windows distort data similarity
h=1 h=3
abc abc
bca
cab
abc abc
bcX
cXX
XXX XXX
abcabcXXXabcababc
XXa
Xab
abc abc
bca
cab
aba aba
bab
abc
36
Solution: Use non-self matches9
h=1 h=3
abc abc
bca
cab
abc abc
bcX
cXX
XXX XXX
XXa
Xab
abc abc
bca
cab
aba aba
bab
abc
9 37
Keogh, Lin, and Fu, “HOT SAX: Efficiently finding the most unusual time
kNN uses no local information
p1 N2
p2
N1
pp1
2
N1
38
Local Outlier Factor10 uses local density information
A point is likely to be an anomaly if its neighbors are in dense

regions while it is in a less dense region
p1 N2
p2
N1
pp1
2
N1
(Didn’t you say anomalies may not be in less dense regions?!)

10
Breunig et al., “Optics-of: Identifying local outliers”.
39
Proximity
Clustering
40
Clustering algorithms are usually not designed to find anomalies
Assumptions: anomalous points
• do not belong to a cluster (DBSCAN11 )

• are far from a cluster centroid
• are in less dense clusters
11
Ester et al., “A Density-Based Algorithm for Discovering Clusters in Large
Spatial Databases with Noise”.
41
Models
Hidden Markov Models (HMMs) are general advanced sequence
modelers
Restrictions:
• fixed length sequences

• Markovian process
42
Problems with Established
Techniques
How to determine a priori what the best algorithm is?
review papers only give subjective assessments
43
Proximity-based techniques need alot of decisions
Choose:
• similarity measure
• sliding window size
• sliding window hop
• compatible classification technique
44
Solution: Use a model-based technique
• characterize normal
• restriction: use when data can be modeled
Ideally:
• model arbitrary time series

• minimize effect of window length
• requires as few parameters as possible
45
Recurrent Neural Networks (RNNs)
RNNs are powerful
• speech recognition
• handwriting recognition
• music generation
• text generation
• handwriting generation
• translation
• identifying non-verbal cues from speech
• image caption generation
• video to text description
• generating talking heads
46
RNNs are more flexible and efficient than HMMs
state:
• HMM: hidden state depends only on previous state

• RNN: shared state
generality:
• HMM: Markovian
• RNN: general computation device
47
Recurrence explains the efficiency of RNN encoding
cyclic view
s θ (−1)
acyclic view
θ θ θ θ
s(...) s(t−1) s(t) s(t+1) s(...)
x(t−1) x(t) x(t+1)
48
RNN computation is elaborate
y (t−1) y (t) y (t+1)

y
L(t−1) L(t) L(t+1)

L
o(t−1) o(t) o(t+1)

o
θ sl o θ sl o θ sl o
θ sn o
(...)
θ sl sl (t−1)
θ sl sl (t)
θ sl sl (t+1)
θ sl sl (...)
sl sl sl sl sl
sn θ (−1)
sn sl
(...)
θ s2 s2 (t−1)
θ s2 s2 (t)
θ s2 s2 (t+1)
θ s2 s2 (...)
s2 s2 s2 s2 s2
s2 θ (−1)
s2 s2
(...)
θ s1 s1 (t−1)
θ s1 s1 (t)
θ s1 s1 (t+1)
θ s1 s1 (...)
s1 θ (−1) s1 s1 s1 s1 s1
s1 s1
θ xs1 θ xs1 θ xs1 θ xs1

x x(t−1) x(t) x(t+1) 49
Training RNNs is difficult
oy
L( , ) =
1 XX
TV t v
( o (t)
v − y (t) 2
v )
• but mini-batch SGD-flavor training still works
∆θ = −α
1 X oy
∂L( , m)
|M| ∂θ
(x m ,y m )∈M
• acute vanishing gradient problem
o s s
 
(t) T (t) (t) T (j) (i)
∂L ∂L ∂ ∂  ∂
W o s s W
X Y
= (t) (t)

(j−1)
∂ ss ∂ ∂ ∂ ∂ ss
i=0 j=i+1
50
Understand vanishing gradient problem through computational
graph for T = 4
y (t−4) y (t−3) y (t−2) y (t−1) y (t)
L(t−4) L(t−3) L(t−2) L(t−1) L(t)
∂L(t)
∂o(t)
o(t−4) o(t−3) o(t−2) o(t−1) o(t)
∂o(t)
∂s(t)
W ss s(t−4) s(t−3) s(t−2) s(t−1) s(t)
∂s(t−4) ∂s(t−3) ∂s(t−2) ∂s(t−1) ∂s(t)

∂W ss ∂s(t−4) ∂s(t−3) ∂s(t−2) ∂s(t−1)
51
x(t−4) x(t−3) x(t−2) x(t−1) x(t)
Long Short Term Memory (LSTM) ‘cells’ store information but
are more complicated than vanilla RNNs’ tanh
s(t)
c(t−1) + c(t)
f (t) i(t) g (t) o(t)

σ σ τ σ τ
s(t−1) s(t)
x(t) 12
12
Colah, Understanding LSTM Networks.
52
Using RNNs for Anomaly Detection
Use same procedure for test time series to test generality
1. sample
2. setup RNN autoencoder
3. train
4. optimize
5. evaluate anomaly scores
53
1. Sample with sliding windows of varying length to test ver-
satility
1. spikes
2. sine
3. power demand
4. electrocardiogram (ECG)
5. polysomnography ECG (PSG-ECG)
54
2. Setup RNN autoencoder
• Set target to (uncorrupted) input
y =x
• Add noise to input
x̃ = x + N (0, (0.75σ (x )) )
std
2
55
3a. Train: RMSprop is appropriate algorithm
• works with mini-batch learning as data is highly redundant

• similar in results to second order methods with less
computational cost
56
3b. Optimize RNN hyperparameters to find best RNN config-
uration
Optimize
• number of layers, l
• ‘size’ of each layer, n
using Bayesian optimization
• minimize expensive objective function calls

• considers stochasticity of function
57
Optimization of spike-1
10−1
l training
100
1 validation
2 10−1
10−2
L
Lv
10−2
10−3 10−3
1 3 10 13 16 17 19 20 0 20 40 60 80 100 120
n epoch
58
Optimization of spike-2
10−1
l 100 training
1 validation
10−2
2 10−1
L
Lv
10−3 10−2
10−4 10−3
1 14 16 17 18 19 20 0 20 40 60 80 100 120 140 160
n epoch
59
Optimization of sine
100
l 100 training
1 validation
2
10−1
L
Lv
10−1
10−2
1 4 10 15 16 20 0 10 20 30 40 50 60
n epoch
60
Optimization of power
0.014
l 10−1 training
0.012 1 validation
2
L
Lv
0.010
0.008
10−2
0.006
6 8 12 20 33 36 42 50 0 20 40 60 80 100 120 140 160
n epoch
61
Optimization of ECG
100
l training
1 validation
2
10−1 10−1
L
Lv
10−2
1 5 9 15 0 20 40 60 80 100 120 140
n epoch
62
Optimization of PSG-ECG
0.05 101
l training
0.04
1 100 validation
0.03 2
Lv
L
0.02 10−1
0.01
10−2
0.00
1 5 6 7 11 19 20 0 50 100 150 200 250 300 350
n epoch
63
4. Use squared error as an anomaly score
Reconstruction Error
• individual squared error

• mean squared error of a window
64
spike-1: atypical value detected
1.00
0.75
0.50
x
0.25
0.00
max

5%
max
5%

max

65
350 400 450 500 550 600 650 700 750 800
spike-2: irregularity detected
1.00
0.75
0.50
x
0.25
0.00
max

0.0
max

5%
max

66
350 400 450 500 550 600 650 700 750 800
sine: discord detection inconclusive
0.8
0.0
x
−0.8
max
5%

max

5%
max
5%

max
5%

67
690 720 750 780 810 840 870 900 930
power: discord detected
1.5
x
1.2
0.9
max
5%

max
5%

max
5%

68
1800 2000 2200 2400 2600 2800 3000
power: discord detected
1.5
x
1.2
0.9
max
5%

max
5%

max
5%

69
1800 2000 2200 2400 2600 2800 3000
ECG: discord detected
2
x
−2
max

5%
max

5%
max

5%
max

5%
70
1320 1380 1440 1500 1560 1620 1680 1740 1800
PSG-ECG: discord detected
6.4
x
5.6
4.8
max

5%
max
5%

max
5%

71
1380 1440 1500 1560 1620 1680 1740 1800 1860 1920
Experiment conclusion: squared reconstruction error of AE-
RNNs can be used to detect anomalies
• point errors may find extreme values

• windowed errors find anomalies if size is on the order of
anomaly
• RNNs were insensitive to translation and length
• RNNs learned normal behaviour despite having some
anomalies in the training data
• the same process found anomalies in all tests
72
Conclusions
RNNs have advantages over advanced techniques
Model-based: HMMs
• more efficient encoding

• varying sequence length
Proximity-based: HOT SAX
• more efficient after training

• multivariate
• not forced to find an anomaly
73
Alternative method checklist
• Is only the test sequence needed to determine how anomalous

it is? (Is a summary of the data stored?)
• Is it robust against some window length?
• Is it invariant to translation? (Is it invariant to sliding a
window?)
• Is it fundamentally a sequence modeler?
• Can it handle multivariate sequences?
• Can the model prediction be associated with a probability?
• Can it work with unlabeled data? If not, is it robust to
anomalous training data?
• Does it require domain knowledge?
74
Main disadvantage of RNNs: computational cost
• training
• hyperoptimization
75
Further work is needed to strengthen the case for using RNNs
for anomaly detection
• better optimize
• use autocorrelation to determine minimum window length
• accelerate training: normalization, optimium training data size
• use drop out to guard against overfitting
• experiment with RNN architectures: bi-directional RNNs,
LSTM alts., more connections
• incorporate uncertainty
• objective comparisons with labelled data
• try multivariate series
76
Conclusion
Use RNNs to find anomalies when computational cost can be

managed.
77
Reproducibility
Technology stack enables automation and reproducibility
application ... ...

container network Weave
app. containerization Docker
operating system CoreOS
machine (x64) x64
hypervisor VirtualBox ...
hypervisor interface Vagrant AWS
host operating sys. Windows|OS X|Linux ...
hardware x64 x64
local remote
78
Reproducibility of technology stack on any machine enables
parallel processing
‘vagrant1’ aws1
‘vagrant2’ aws2
app1 app2 app1 app2
lib app3 app4 lib app3 app4
Docker Docker Weave Net Docker Docker
/project /project /project /project
CoreOS CoreOS CoreOS CoreOS
init remote (AWS)
registry svc1
Weave Net
Docker
/project
CoreOS
VirtualBox/Vagrant
localhost
app3 VirtualBox
Docker Vagrant share:/project
Windows|Linux|OS X
local
79
Discussion Time

Defense

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Defense

Загружено:

Авторское право:

Доступные форматы

Unsupervised Anomaly Detection in

Sequences Using Long Short Term Memory

Problems with Established Techniques

Recurrent Neural Networks (RNNs)

Using RNNs for Anomaly Detection

• variety of solutions in communication networks, biology,

Assumption: Anomalies are a small part of the data.

1. What is normal (as an anomaly is defined as what is not

Simple point anomaly

Anomaly in a periodic context

Discord anomaly in a periodic time series

Discord anomaly in an aperiodic time series

Multivariate: (a)synchronous and (a)periodic

1. Compute an anomaly score for an observation

Window width must be at least 4.

• Choice of representation must be compatible with data

Study5 suggests generally little difference among representations.

• Segmentation: requires homogeneous segments

Effects on Point Distribution

Study6 : Not much difference in similarity measures

If width is too large:

• anomalous points not distinguished

• anomalous points are not necessarily in sparse space while

A point is likely to be an anomaly if its neighbors are in dense

(Didn’t you say anomalies may not be in less dense regions?!)

Assumptions: anomalous points

• do not belong to a cluster (DBSCAN11 )

• fixed length sequences

review papers only give subjective assessments

• model arbitrary time series

• HMM: hidden state depends only on previous state

x(t−1) x(t) x(t+1)

y (t−1) y (t) y (t+1)

L(t−1) L(t) L(t+1)

o(t−1) o(t) o(t+1)

θ xs1 θ xs1 θ xs1 θ xs1

• but mini-batch SGD-flavor training still works

• acute vanishing gradient problem

y (t−4) y (t−3) y (t−2) y (t−1) y (t)

L(t−4) L(t−3) L(t−2) L(t−1) L(t)

o(t−4) o(t−3) o(t−2) o(t−1) o(t)

W ss s(t−4) s(t−3) s(t−2) s(t−1) s(t)

∂s(t−4) ∂s(t−3) ∂s(t−2) ∂s(t−1) ∂s(t)

f (t) i(t) g (t) o(t)

• Set target to (uncorrupted) input

• works with mini-batch learning as data is highly redundant

using Bayesian optimization

• minimize expensive objective function calls

• individual squared error

• point errors may find extreme values

• more efficient encoding

Proximity-based: HOT SAX

• more efficient after training

• Is only the test sequence needed to determine how anomalous

Use RNNs to find anomalies when computational cost can be

application ... ...

lib app3 app4 lib app3 app4

Docker Docker Weave Net Docker Docker

/project /project /project /project

CoreOS CoreOS CoreOS CoreOS

init remote (AWS)

Docker Vagrant share:/project

Вам также может понравиться