2019 KDD-Deep Learning For Time-Series Forecasting

This tutorial is about the intersection of DL and Time Series
Deep Learning Time Series
Deep Learning for Time Series Forecasting: aka.ms/dlts

Agenda Time (mins)
Tutorial Introduction + Pre-requisite Setup 15
Knowledge Recap 10
Introduction to convolutional neural networks (CNN) 40
Introduction to recurrent neural networks (RNN) 25
Break 10
Encoder-decoder RNN model 25
Hybrid models for time series forecasting 25
Hyper-parameter tuning 20
Conclusion 5
Q&A 5
• Hands-on lab and quizzes

• Q&A during break or in the end
Pre-requisite Setup
aka.ms/dlts/pr
Time Series Forecasting
Date Observation
2018-06-04 60
2018-06-05 64
2018-06-06 66
2018-06-07 65
History 2018-06-08 67
2018-06-09 68
2018-06-10 70
2018-06-11 69
2018-06-12 72
2018-06-13 ?
Future 2018-06-14 ?
2018-06-15 ?

Tao Hong, Pierre Pinson, Shu Fan, Hamidreza
Zareipour, Alberto Troccoli and Rob J. Hyndman,
"Probabilistic energy forecasting: Global Energy
Forecasting Competition 2014 and beyond",
International Journal of Forecasting, vol.32, no.3,
pp 896-913, July-September, 2016.

- Deep learning model has been shown perform well in many scenarios
 2014 Global Energy Forecasting Competition (link)
 2016 CIF International Time Series Competition (link)
 2017 Web Traffic Time Series Forecasting (link)
 2018 Corporación Favorita Grocery Sales Forecasting (link)
 2018 M4-Competition (link)
- Non-parametric
- Flexible and expressive
- Easily inject exogenous features into the model
- Learn from large time series datasets

Feedforward Neural
Networks Recap
Feed-Forward Neural Network and Deep Neural Network
https://en.wikipedia.org/wiki/Activation_function Deep Learning for Time Series Forecasting: aka.ms/dlts
learning rate

Computation of
𝑓𝑓1 𝑤𝑤
𝐿𝐿′ 𝑤𝑤 = 𝑔𝑔′ 𝑓𝑓1 ℎ(𝑤𝑤) � 𝑓𝑓1′ ℎ(𝑤𝑤) � ℎ′ 𝑤𝑤
𝐿𝐿′ 𝑤𝑤 = 𝑔𝑔′ 𝑓𝑓2 ℎ(𝑤𝑤) � 𝑓𝑓2′ ℎ(𝑤𝑤) � ℎ′ 𝑤𝑤
𝑓𝑓1 𝑤𝑤 𝑓𝑓2 𝑤𝑤
𝐿𝐿′ 𝑤𝑤 𝐿𝐿′ 𝑤𝑤

Tutorial from deeplearning.ai
Regularization for deep learning
Dropout: A simple way to prevent neural networks from overfitting Zoneout: Regularizing RNNs by
randomly preserving hidden activations
Batch normalization: Accelerating deep network training by reducing internal

covariate shift Recurrent batch normalization
Learning rate schedules and adaptive learning rate methods for deep learning
An overview of gradient descent optimization algorithms

5x1+2
3 3x0+7
2 x -1 = -4
3
5 3 2 7 1 6 X 1 0 -1 = 3 -4 1 1
Input: 1 x 6 Filter: 1 x 3 Output: 1 x 4

5 3 2 7 1 6 X w1 w2 w3 = 3 4 1 1
Input: 1 x 6 Filter: 1 x 3 Output: 1 x 4

X =
Filter: 1 x 3 Output: 1 x 4
X =
Filter: 1 x 3 Output: 1 x 4 Output: 1 x 4 x 3
Input: 1 x 6 X =
Filter: 1 x 3 Output: 1 x 4

t-9 t-8 t-7 t-6 t-5 t-4 t-3 t-2 t-1 t t+1

=
X
Filter: 1 x 3 x 3 Output: 1 x 4 x 1
Input: 1 x 6 x 3 Output: 1 x 4 x 2
=
X
Filter: 1 x 3 x 3 Output: 1 x 4 x 1

https://github.com/MSRDL/Deep4Cast
https://arxiv.org/pdf/1609.03499.pdf

Y Y
/S
NN RNN
X X

Y Y1 Y2 Y3
RNN RNN RNN RNN
X X1 X2 X3
W
In Keras you will see “timesteps” when set data input shape
Y1 Y2 Y3
RNN RNN RNN
X1 X2 X3
RNN
X
Y1 Y2 Y3
In Keras, the parameter “units” is dimensions of hidden state.
(Think of it as feedforward neural network number of units in hidden layer.)
model =RNN
Sequential() RNN RNN
model.add(RNN(4, input_shape=(timesteps, data_dim)))
Y model.add(Dense(3)) (timesteps, features)
X1 X2 X3
RNN
X
Y Y1 Y2 Y3
RNN RNN RNN RNN
X X1 X2 X3
Forward through entire sequence to compute loss
then backward through entire sequence to compute gradient

tanh

Y Y1 Y2 Y3
RNN tanh tanh tanh
X X1 X2 X3

100 time steps is similar to 100 layers feedforward neural net
1-
tanh
tanh

Y1 Y2 Y3
GRU GRU GRU
X1 X2 X3
Uninterrupted
gradient flow
State runs straight 1- 1- 1-
through the entire chain
with minor linear tanh tanh tanh
interactions which
makes information very
easy to pass.

Hidden state:
1-
Update gates:
tanh
Candidate gates/states:
Reset gates:
GRU [Learning phrase representations using rnn encoderdecoder (0,1) (-1, 1)

for statistical machine translation, Cho et al. 2014]
× +
1- × tanh
GRU LSTM ×
tanh tanh
Hidden cell state:

Hidden state:
Hidden state:
Forget gates:
Update gates:
Input gates:
Reset gates:
Output gates:
LSTM: A Search Space Odyssey, Greff et al., 2015 Long Short-Term Memory, Hochreiter & Schmidhuber (1997) Deep Learning for Time Series Forecasting: aka.ms/dlts
Y Y1 Y2 Y3
RNN RNN RNN RNN
RNN RNN RNN RNN
X X1 X2 X3
model = Sequential()
model.add(RNN(4, return_sequences=True, input_shape=(timesteps, data_dim)))
model.add(RNN(4))
RNN for one step time
series forecasting
t-5 t-4 t-3 t-2 t-1 t t+1

RNN for multi-step time
series forecasting
t-5 t-4 t-3 t-2 t-1 t t+1 t+2 t+3

t+1 t+2 t+3





t+1 t+2 t+3




t+1 t+2 t+3

t+1 t+2 t+3





github.com/Arturus/kaggle-web-traffic

Slawek Smyl, 2018
https://eng.uber.com/m4-forecasting-competition/
Winner of M4 competition

𝛼𝛼 𝛼𝛼 1 − 𝛼𝛼 𝛼𝛼(1 − 𝛼𝛼)2
𝛼𝛼 1 − 𝛼𝛼
𝛼𝛼 1 − 𝛼𝛼
Hyndman, R.J., & Athanasopoulos, G. (2018) Forecasting: principles and practice Deep Learning for Time Series Forecasting: aka.ms/dlts
� 𝑠𝑠𝑡𝑡+1
𝑠𝑠𝑡𝑡
𝑦𝑦𝑡𝑡
𝑠𝑠𝑡𝑡+𝑚𝑚 = 𝛾𝛾 + 1 − 𝛾𝛾 𝑠𝑠𝑡𝑡
𝑙𝑙𝑡𝑡

� 𝑠𝑠𝑡𝑡+1 � 𝑅𝑅𝑅𝑅𝑅𝑅(𝑥𝑥𝑡𝑡 )
𝑦𝑦𝑡𝑡
𝑙𝑙𝑡𝑡 = 𝛼𝛼 + 1 − 𝛼𝛼 𝑙𝑙𝑡𝑡−1
𝑠𝑠𝑡𝑡
𝑦𝑦𝑡𝑡
𝑠𝑠𝑡𝑡+𝑚𝑚 = 𝛾𝛾 + 1 − 𝛾𝛾 𝑠𝑠𝑡𝑡
𝑙𝑙𝑡𝑡
𝑦𝑦𝑡𝑡
𝑥𝑥𝑡𝑡 =
𝑙𝑙𝑡𝑡 𝑠𝑠𝑡𝑡
S. Smyl, 2018, Introducing a New Hybrid ES-RNN Model Deep Learning for Time Series Forecasting: aka.ms/dlts
� 𝑠𝑠𝑡𝑡+1 � 𝑅𝑅𝑅𝑅𝑅𝑅(𝑥𝑥𝑡𝑡 )
Dilated Recurrent Neural Networks

Dual-Stage Attention-Based Recurrent
Neural Network
Residual LSTM
t+1 t+2 t+3
Denormalization + Seasonalization 𝑦𝑦�𝑡𝑡+1|𝑡𝑡 = 𝑙𝑙𝑡𝑡 � 𝑠𝑠𝑡𝑡+1 � 𝑅𝑅𝑅𝑅𝑅𝑅(𝑥𝑥𝑡𝑡 )
Dense
𝑅𝑅𝑅𝑅𝑅𝑅(𝑥𝑥𝑡𝑡 )
GRU GRU GRU GRU GRU GRU
𝑦𝑦𝑡𝑡
𝑙𝑙𝑡𝑡 = 𝛼𝛼 + 1 − 𝛼𝛼 𝑙𝑙𝑡𝑡−1
Exponential Smoothing + Normalization + De-seasonalization 𝑠𝑠𝑡𝑡
𝑦𝑦𝑡𝑡
𝑥𝑥𝑡𝑡 =
𝑙𝑙𝑡𝑡 𝑠𝑠𝑡𝑡
t+1 t+2 t+3
Denormalization + Seasonalization
Dense
GRU GRU GRU GRU GRU GRU
Exponential Smoothing + Normalization + De-seasonalization
Fast ES-RNN: A GPU Implementation of the ES-RNN Algorithm Deep Learning for Time Series Forecasting: aka.ms/dlts
Output layer
Number of nodes Learning rate

in layers
Input layer
Batch size
Hidden layer 1 Hidden layer 2

Number of Regularization
hidden layers parameter

Hyper-parameter tuning
Challenges
• Huge search space to explore
• Sparsity of good configurations
• Expensive evaluation
• Limited time and resources

𝒓𝒓𝒓𝒓𝒓𝒓𝟏𝟏 = {learning rate=0.2, #layers=2, …}
𝒓𝒓𝒓𝒓𝒓𝒓𝟐𝟐 = {learning rate=0.5, #layers=4, …}
…
𝒓𝒓𝒓𝒓𝒓𝒓𝒑𝒑 = {learning rate=0.9, #layers=8, …}
(A) Generate new runs #1 𝑟𝑟𝑟𝑟𝑟𝑟1 𝑟𝑟𝑟𝑟𝑟𝑟𝑗𝑗 ???

• Which parameter
configuration to explore? #2 𝑟𝑟𝑟𝑟𝑟𝑟2 𝑟𝑟𝑟𝑟𝑟𝑟𝑖𝑖
…
(B) Manage resource usage
#𝑝𝑝
of active runs 𝑟𝑟𝑟𝑟𝑟𝑟𝑝𝑝 𝑟𝑟𝑟𝑟𝑟𝑟𝑘𝑘
• How long to execute a run?

Generating New Runs – Sampling algorithms
{ Sampling
algorithm
“learning_rate”: uniform(0, 1),
“num_layers”: choice(2, 4, 8) Config1= {“learning_rate”: 0.2,
… “num_layers”: 2, …}
}
Config2= {“learning_rate”: 0.5,
“num_layers”: 4, …}
Config3= {“learning_rate”: 0.9,

“num_layers”: 8, …}

Managing Active Jobs
Early terminate poorly

performing runs

User inputs & system outputs
Training script Recommended

+ hyperparameter
Training data configuration & run metrics
Hyperparameter Tuning in Azure ML

• Python SDK for
• creating / cancelling
• obtaining results
• Elastic compute (auto-scaling)
• Remote compute can be CPU / GPU cluster

Multi-horizon quantile-based forecaster
Zoneout: Regularizing RNNs by randomly preserving hidden activations

memory networks
recurrent batch normalization

stateful training
An Empirical Evaluation of Generic Convolutional and

Recurrent Networks for Sequence Modeling
aka.ms/dlts

[1] Dropout: A simple way to prevent neural networks from overfitting
[2] Zoneout: Regularizing RNNs by randomly preserving hidden activations
[3] Batch normalization: Accelerating deep network training by reducing internal covariate shift
[4] Recurrent batch normalization
[5] Learning rate schedules and adaptive learning rate methods for deep learning
[6] An overview of gradient descent optimization algorithms
[7] Learning phrase representations using rnn encoderdecoder for statistical machine translation
[8] LSTM: A Search Space Odyssey
[9] Bayesian optimization with robust Bayesian neural networks
[10] Hyperband: A novel bandit-based approach for hyperparameter optimization
[11] Neural architecture search with reinforcement learning
[12] Multi-horizon quantile-based forecaster
[13] Population based training of neural networks
[14] Memory networks
[15] Stateful training
[16] An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
[17] Winning solution of web traffic forecasting competition
[18] Probabilistic energy forecasting: Global Energy Forecasting Competition 2014 and beyond
[19] S. Smyl, 2018, M4 Forecasting Competition: Introducing a New Hybrid ES-RNN Model
[20] Hyndman, R.J., & Athanasopoulos, G. (2018) Forecasting: principles and practice
[21] Dilated Recurrent Neural Networks
[22] Dual-Stage Attention-Based Recurrent Neural Network
[23] Residual LSTM
[24] Fast ES-RNN: A GPU Implementation of the ES-RNN Algorithm


2019 KDD-Deep Learning For Time-Series Forecasting

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

2019 KDD-Deep Learning For Time-Series Forecasting

Загружено:

Авторское право:

Доступные форматы

This tutorial is about the intersection of DL and Time Series

Deep Learning Time Series

Deep Learning for Time Series Forecasting: aka.ms/dlts

• Hands-on lab and quizzes

Deep Learning for Time Series Forecasting: aka.ms/dlts

Deep Learning for Time Series Forecasting: aka.ms/dlts

Deep Learning for Time Series Forecasting: aka.ms/dlts

Deep Learning for Time Series Forecasting: aka.ms/dlts

𝐿𝐿′ 𝑤𝑤 = 𝑔𝑔′ 𝑓𝑓2 ℎ(𝑤𝑤) � 𝑓𝑓2′ ℎ(𝑤𝑤) � ℎ′ 𝑤𝑤

Deep Learning for Time Series Forecasting: aka.ms/dlts

Regularization for deep learning

Batch normalization: Accelerating deep network training by reducing internal

An overview of gradient descent optimization algorithms

Deep Learning for Time Series Forecasting: aka.ms/dlts

Input: 1 x 6 Filter: 1 x 3 Output: 1 x 4

Deep Learning for Time Series Forecasting: aka.ms/dlts

Input: 1 x 6 Filter: 1 x 3 Output: 1 x 4

Deep Learning for Time Series Forecasting: aka.ms/dlts

Filter: 1 x 3 Output: 1 x 4 Output: 1 x 4 x 3

Deep Learning for Time Series Forecasting: aka.ms/dlts

Deep Learning for Time Series Forecasting: aka.ms/dlts

Deep Learning for Time Series Forecasting: aka.ms/dlts

Deep Learning for Time Series Forecasting: aka.ms/dlts

Deep Learning for Time Series Forecasting: aka.ms/dlts

RNN RNN RNN RNN

RNN RNN RNN

RNN RNN RNN RNN

Forward through entire sequence to compute loss

then backward through entire sequence to compute gradient

Deep Learning for Time Series Forecasting: aka.ms/dlts

RNN tanh tanh tanh

Deep Learning for Time Series Forecasting: aka.ms/dlts

Deep Learning for Time Series Forecasting: aka.ms/dlts

GRU GRU GRU

Deep Learning for Time Series Forecasting: aka.ms/dlts

GRU [Learning phrase representations using rnn encoderdecoder (0,1) (-1, 1)

Hidden cell state:

RNN RNN RNN RNN

RNN RNN RNN RNN

Deep Learning for Time Series Forecasting: aka.ms/dlts

Deep Learning for Time Series Forecasting: aka.ms/dlts

Deep Learning for Time Series Forecasting: aka.ms/dlts

Deep Learning for Time Series Forecasting: aka.ms/dlts

Deep Learning for Time Series Forecasting: aka.ms/dlts

Deep Learning for Time Series Forecasting: aka.ms/dlts

Deep Learning for Time Series Forecasting: aka.ms/dlts

Deep Learning for Time Series Forecasting: aka.ms/dlts

Deep Learning for Time Series Forecasting: aka.ms/dlts

Deep Learning for Time Series Forecasting: aka.ms/dlts

Deep Learning for Time Series Forecasting: aka.ms/dlts

Dilated Recurrent Neural Networks

Denormalization + Seasonalization 𝑦𝑦�𝑡𝑡+1|𝑡𝑡 = 𝑙𝑙𝑡𝑡 � 𝑠𝑠𝑡𝑡+1 � 𝑅𝑅𝑅𝑅𝑅𝑅(𝑥𝑥𝑡𝑡 )

GRU GRU GRU GRU GRU GRU

Exponential Smoothing + Normalization + De-seasonalization

Number of nodes Learning rate

Hidden layer 1 Hidden layer 2

Deep Learning for Time Series Forecasting: aka.ms/dlts

Deep Learning for Time Series Forecasting: aka.ms/dlts

(A) Generate new runs #1 𝑟𝑟𝑟𝑟𝑟𝑟1 𝑟𝑟𝑟𝑟𝑟𝑟𝑗𝑗 ???

Deep Learning for Time Series Forecasting: aka.ms/dlts

Config3= {“learning_rate”: 0.9,

Deep Learning for Time Series Forecasting: aka.ms/dlts

Early terminate poorly