Академический Документы
Профессиональный Документы
Культура Документы
Majid S. alDosari
April 20, 2016
George Mason University
Contents I
Introduction
The Challenge of Anomaly Detection in Sequences
Procedure
1. Sample Extraction
2. Transformation
3. Detection Technique
Proximity
Effects on Point Distribution
Data Classification
Nearest Neighbor
Clustering
1
Contents II
Models
Conclusions
Reproducibility
Discussion Time
2
Introduction
Modern technology facilitates the capture, storage, and pro-
cessing of sequential data at scale
• Data capture
• physiological signals
• network traffic
• industrial processes
• automobiles
• website navigation
• environment
• Data storage
• Hadoop
• MongoDB
• Ubiquitous computing
• cloud/cluster
• desktop
• at point of capture
3
Problem: Finding anomalous data is challenging
• large
• varied
• domain knowledge required
4
Solution: Use recurrent neural networks to generically find
anomalous data
• This work:
1. Background: Anomaly detection in sequences
2. Sequence Modeler: Recurent neural network (RNN)
3. Experiments: RNNs for anomaly detection
• Prior: Malhotra1 but no emphasis on process
1
Malhotra et al., “Long Short Term Memory Networks for Anomaly Detection
in Time Series”.
5
The Challenge of Anomaly
Detection in Sequences
Anomaly dectection work is fragmented
2
Cheboli, “Anomaly Detection of Time Series”.
3
Gupta et al., “Outlier Detection for Temporal Data : A Survey”.
6
Define the problem to focus on the right solution
Sequence x
x = {x , x , x , . . . , x
(1) (2) (t) (T )
}
x ∈R (t) v
7
Solution must answer the following:
8
Solution must address different anomaly types I
9
Solution must address different anomaly types II
10
Solution must address different anomaly types III
11
Solution must address different anomaly types IV
12
Solution must address different anomaly types V
13
Procedure
Description of anomaly detection procedure is straightforward
14
Characterizing normal behavior is involved
x = {x , x , x
(1) (2) (t)
,..., x (T )
}
1. Extract Samples
2. Transform Samples
3. Apply Detection Technique
15
Procedure
1. Sample Extraction
16
Use sliding windows to obtain samples
X = {W1 , W2 , . . . , Wp }
• hop, h
• window, w
17
Problem: Hops can skip over anomalies
sequence: abccabcabc
hop (h) Ordered Windows
1 abc, bcc, cca, cab, abc, bca, cab, abc
2 abc, cca, abc, cab
3 abc, cab, cab
4 abc, abc
18
Problem: Window size must be large enough to contain anomaly
sequence: aaabbbccccaaabbbcccaaabbbccc
19
Problem: Treating window width as a dimension size ignores
temporal nature
X = {W1 , W2 , . . . , Wp }
W ∈ R1×w
20
Procedure
2. Transformation
21
Transformation can help reveal anomalies
• Haar transform
• Symbolic Aggregate approXimation (SAX)4
4
Lin et al., “Experiencing SAX: A novel symbolic representation of time
series”.
22
Transformation is not general
5
Wang et al., “Experimental comparison of representation methods and
distance measures for time series data”.
23
Procedure
3. Detection Technique
24
Anomaly detection techniques and their application domains
are varied
Based on
• Segmentation
• Information Theory
• Proximity
• Modeling
25
Model and proximity-based techniques are most developed
Based on
26
Proximity
Idealization never occurs
p1
N1
27
Practically, distributions are complicated
p1 N2
p2
N1
pp1
2
N1
28
Proximity
29
Distance measure should be invariant to:
• length
• translation
• (skew)
• (amplitude)
6
Wang et al., “Experimental comparison of representation methods and
distance measures for time series data”.
30
Window width needs to be chosen on the scale of expected
anomaly
31
Sliding windows challenge anomaly detection assumptions
7
Keogh, Lin, and Fu, “HOT SAX: Efficiently finding the most unusual time
series subsequence”.
8
Keogh et al., “Clustering of Time Series Subsequences is Meaningless:”
32
Proximity
Data Classification
33
Global vs Local: Local techniques use neigborhood data
p1 N2
p2
N1
pp1
2
N1
34
Proximity
Nearest Neighbor
35
Overlapping windows distort data similarity
h=1 h=3
abc abc
bca
cab
abc abc
bcX
cXX
XXX XXX
abcabcXXXabcababc
XXa
Xab
abc abc
bca
cab
aba aba
bab
abc
36
Solution: Use non-self matches9
h=1 h=3
abc abc
bca
cab
abc abc
bcX
cXX
XXX XXX
XXa
Xab
abc abc
bca
cab
aba aba
bab
abc
9 37
Keogh, Lin, and Fu, “HOT SAX: Efficiently finding the most unusual time
kNN uses no local information
p1 N2
p2
N1
pp1
2
N1
38
Local Outlier Factor10 uses local density information
N1
pp1
2
N1
Clustering
40
Clustering algorithms are usually not designed to find anomalies
11
Ester et al., “A Density-Based Algorithm for Discovering Clusters in Large
Spatial Databases with Noise”.
41
Models
Hidden Markov Models (HMMs) are general advanced sequence
modelers
Restrictions:
42
Problems with Established
Techniques
How to determine a priori what the best algorithm is?
43
Proximity-based techniques need alot of decisions
Choose:
• similarity measure
• sliding window size
• sliding window hop
• compatible classification technique
44
Solution: Use a model-based technique
• characterize normal
• restriction: use when data can be modeled
Ideally:
45
Recurrent Neural Networks (RNNs)
RNNs are powerful
• speech recognition
• handwriting recognition
• music generation
• text generation
• handwriting generation
• translation
• identifying non-verbal cues from speech
• image caption generation
• video to text description
• generating talking heads
46
RNNs are more flexible and efficient than HMMs
state:
generality:
• HMM: Markovian
• RNN: general computation device
47
Recurrence explains the efficiency of RNN encoding
cyclic view
s θ (−1)
acyclic view
θ θ θ θ
s(...) s(t−1) s(t) s(t+1) s(...)
48
RNN computation is elaborate
(...)
θ s2 s2 (t−1)
θ s2 s2 (t)
θ s2 s2 (t+1)
θ s2 s2 (...)
s2 s2 s2 s2 s2
s2 θ (−1)
s2 s2
(...)
θ s1 s1 (t−1)
θ s1 s1 (t)
θ s1 s1 (t+1)
θ s1 s1 (...)
s1 θ (−1) s1 s1 s1 s1 s1
s1 s1
oy
L( , ) =
1 XX
TV t v
( o (t)
v − y (t) 2
v )
∆θ = −α
1 X oy
∂L( , m)
|M| ∂θ
(x m ,y m )∈M
o s s
(t) T (t) (t) T (j) (i)
∂L ∂L ∂ ∂ ∂
W o s s W
X Y
= (t) (t)
(j−1)
∂ ss ∂ ∂ ∂ ∂ ss
i=0 j=i+1
50
Understand vanishing gradient problem through computational
graph for T = 4
∂L(t)
∂o(t)
∂o(t)
∂s(t)
s(t)
c(t−1) + c(t)
s(t−1) s(t)
x(t) 12
12
Colah, Understanding LSTM Networks.
52
Using RNNs for Anomaly Detection
Use same procedure for test time series to test generality
1. sample
2. setup RNN autoencoder
3. train
4. optimize
5. evaluate anomaly scores
53
1. Sample with sliding windows of varying length to test ver-
satility
1. spikes
2. sine
3. power demand
4. electrocardiogram (ECG)
5. polysomnography ECG (PSG-ECG)
54
2. Setup RNN autoencoder
y =x
• Add noise to input
x̃ = x + N (0, (0.75σ (x )) )
std
2
55
3a. Train: RMSprop is appropriate algorithm
56
3b. Optimize RNN hyperparameters to find best RNN config-
uration
Optimize
• number of layers, l
• ‘size’ of each layer, n
57
Optimization of spike-1
10−1
l training
100
1 validation
2 10−1
10−2
L
Lv
10−2
10−3 10−3
1 3 10 13 16 17 19 20 0 20 40 60 80 100 120
n epoch
58
Optimization of spike-2
10−1
l 100 training
1 validation
10−2
2 10−1
L
Lv
10−3 10−2
10−4 10−3
1 14 16 17 18 19 20 0 20 40 60 80 100 120 140 160
n epoch
59
Optimization of sine
100
l 100 training
1 validation
2
10−1
L
Lv
10−1
10−2
1 4 10 15 16 20 0 10 20 30 40 50 60
n epoch
60
Optimization of power
0.014
l 10−1 training
0.012 1 validation
2
L
Lv
0.010
0.008
10−2
0.006
6 8 12 20 33 36 42 50 0 20 40 60 80 100 120 140 160
n epoch
61
Optimization of ECG
100
l training
1 validation
2
10−1 10−1
L
Lv
10−2
1 5 9 15 0 20 40 60 80 100 120 140
n epoch
62
Optimization of PSG-ECG
0.05 101
l training
0.04
1 100 validation
0.03 2
Lv
L
0.02 10−1
0.01
10−2
0.00
1 5 6 7 11 19 20 0 50 100 150 200 250 300 350
n epoch
63
4. Use squared error as an anomaly score
Reconstruction Error
64
spike-1: atypical value detected
1.00
0.75
0.50
x
0.25
0.00
max
5%
max
5%
max
65
350 400 450 500 550 600 650 700 750 800
spike-2: irregularity detected
1.00
0.75
0.50
x
0.25
0.00
max
0.0
max
5%
max
66
350 400 450 500 550 600 650 700 750 800
sine: discord detection inconclusive
0.8
0.0
x
−0.8
max
5%
max
5%
max
5%
max
5%
67
690 720 750 780 810 840 870 900 930
power: discord detected
1.5
x
1.2
0.9
max
5%
max
5%
max
5%
68
1800 2000 2200 2400 2600 2800 3000
power: discord detected
1.5
x
1.2
0.9
max
5%
max
5%
max
5%
69
1800 2000 2200 2400 2600 2800 3000
ECG: discord detected
2
x
−2
max
5%
max
5%
max
5%
max
5%
70
1320 1380 1440 1500 1560 1620 1680 1740 1800
PSG-ECG: discord detected
6.4
x
5.6
4.8
max
5%
max
5%
max
5%
71
1380 1440 1500 1560 1620 1680 1740 1800 1860 1920
Experiment conclusion: squared reconstruction error of AE-
RNNs can be used to detect anomalies
72
Conclusions
RNNs have advantages over advanced techniques
Model-based: HMMs
73
Alternative method checklist
74
Main disadvantage of RNNs: computational cost
• training
• hyperoptimization
75
Further work is needed to strengthen the case for using RNNs
for anomaly detection
• better optimize
• use autocorrelation to determine minimum window length
• accelerate training: normalization, optimium training data size
• use drop out to guard against overfitting
• experiment with RNN architectures: bi-directional RNNs,
LSTM alts., more connections
• incorporate uncertainty
• objective comparisons with labelled data
• try multivariate series
76
Conclusion
77
Reproducibility
Technology stack enables automation and reproducibility
78
Reproducibility of technology stack on any machine enables
parallel processing
‘vagrant1’ aws1
‘vagrant2’ aws2
app1 app2 app1 app2
registry svc1
Weave Net
Docker
/project
CoreOS
VirtualBox/Vagrant
localhost
app3 VirtualBox
Windows|Linux|OS X
local
79
Discussion Time