Predicting Stock Prices With Echo State Networks - Towards Data Science

05/09/2019 Predicting Stock Prices with Echo State Networks - Towards Data Science
Predicting Stock Prices with Echo

State Networks
People have tried and failed to reliably predict the seemingly chaotic
nature of the stock market for decades. Do neural networks hold the
key?
Matthew Stewart, PhD Researcher

Mar 18 · 15 min read
. . .
“There (is) order and even great beauty in what looks like total chaos. If we look closely
enough at the randomness around us, patterns will start to emerge.” ― Aaron Sorkin
EDIT: Due to skepticism and criticism of the methods outlined in this article, all of the
code and data used in this tutorial and the results are provided in the related GitHub
repository, which can be found here.
https://towardsdatascience.com/predicting-stock-prices-with-echo-state-networks-f910809d23d4 1/19
The Motivation for Time Series Prediction
The stock market is typically viewed as a chaotic time series, and advanced stochastic
methods are often applied by companies to try and make reasonably accurate
predictions so that they can get the upper hand and make money. This is essentially the
idea behind all investment banking, especially those who are market traders.
I do not claim to know much about the stock market (I am, after all, a scientist and not
an investment banker), but I do know a reasonable amount about machine learning
and stochastic methods. One of the greatest problems in this area is trying to accurately
predict chaotic time series in a reliable manner. The idea of predicting the dynamics of
chaotic systems is somewhat counterintuitive given that something chaotic, by
definition, does not behave in a predictable manner.
The study of time series was around before the introduction of the stock market but saw
a marked increase in its popularity as individuals tried to leverage the stock market in
order to ‘beat the system’ and become wealthy. In order to do this, people had to
develop reliable methods of estimating market trends based on prior information.
First, let us talk about some properties of time series that make them easy to analyze so
that we can appreciate why time series analysis can get pretty tough when we look at
the stock market.
. . .
Time Series Properties

One of the most important properties that a time series can have is that it is stationary.
A time series is said to be stationary if its statistical properties such as mean and
variance remain constant over time. But why is it important?
Most models actually work on the assumption that the time series is stationary.
Intuitively, we can say that if a time series has a particular behavior over time, there is a
very high probability that it will follow the same in the future. Also, the theories related
to stationary series are more mature and easier to implement as compared to non-
stationary series.
Stationarity is defined using a very strict criterion. However, for practical purposes we
can assume the series to be stationary if it has constant statistical properties over time,
ie. the following:
1. Constant mean. This should be intuitive since if the mean is changing then the time
series can be seen to be moving, as seen by contrasting the two below figures.
2. Constant variance. This property is known as homoscedasticity. The following figure

depicts a stationary vs non-stationary example that violates this property.
3. An autocorrelation that does not depend on time. In the below figures you will notice
the spread becomes closer as the time increases. Hence, the covariance is not constant
with time for the non-stationary case.
Why do I care about the ‘stationarity’ of a time series?

The reason I took up this section first was that unless your time series is stationary, you
cannot build a time series model. In cases where the criteria for a stationary time series
are violated, the first requisite is to transform the time series to make it stationary, and
then try stochastic models to predict this time series. There are multiple ways of
bringing this stationarity. Some of them are detrending, differencing etc.
This may seem a little stupid to those of you who are not familiar with time series
analysis. However, it is a little more complicated than it first appears (isn’t it always..).
It just turns out that the best way to deal with a time series is to first ‘stationarize’ it,
and decouple it into several different characteristics such as a linear trend, separate
time series with different seasonal qualities, and then add them back together at the
end.
An example of decoupling a time series into multiple series with desirable properties.
For anyone familiar with Fourier transforms, this is a very similar analogy. What a
Fourier transform does is separate out different frequency characteristics in a time
series and transforms these into the frequency domain so that they can be represented
more simply. These can then be manipulated or analyzed more easily before
transforming these back into the time domain.
How do I test for stationarity?
It might not always be obvious from visual observations whether a time series is
stationary or not. So, more formally, we can check stationarity using the following:
1. Plotting Rolling Statistics: We can plot the moving average or moving variance
and see if it varies with time. By moving average/variance I mean that at any
instant ‘t’, we’ll take the average/variance of the last year, i.e. last 12 months. But
again this is more of a visual technique.
2. Dickey-Fuller Test: This is one of the statistical tests for checking stationarity. Here
the null hypothesis is that the TS is non-stationary. The test results comprise of a
Test Statistic and some Critical Values for difference confidence levels. If the ‘Test
Statistic’ is less than the ‘Critical Value’, we can reject the null hypothesis and say
that the series is stationary. Refer to this article for details.
Now that we know a bit more about time series, we can look at the traditional ways
people study time series, how they develop their models, and why they are inadequate
for studying the stock market.
. . .
Basic Methods for Time Series Prediction

The most basic methods are so simple that I think most people could have come up with
them without taking a class on time series analysis. The simplest model that is of some
use is the moving average. Essentially, the moving average takes the last t values and
takes the average of these as the prediction for the next point.
The moving average is surprisingly accurate and the robustness to outliers and short-
term fluctuations can be controlled by altering the number of previous points used in
the averaging process.
More complex procedures then proceed naturally from this, such as exponential
smoothing. This is similar to the moving average except it is a weighted procedure that
puts a higher importance on the most recent data points. The particular weighting
function used in exponential smoothing is (no surprise) an exponential function, but
the procedure can be weighted using different methods.
These methods are fine for relatively consistent and periodic time series, but ones that
exhibit seasonality combined with a persistent linear trend or any substantial
randomness or chaotic nature are difficult to use for this. For example, if I have a
weekly oscillation and I am using a moving average model that averages the data from
the last week, I will completely miss this oscillatory behavior with my model.
One very popular method for analyzing time series with different levels of
autocorrelation (e.g. a weekly trend combined with a monthly and yearly trend) is
called Holtz linear model. Holt extended simple exponential smoothing to allow
forecasting of data with a trend. It is nothing more than exponential smoothing applied
to both level (the average value in the series) and trend. To express this in
mathematical notation we now need three equations: one for level, one for the trend
and one to combine the level and trend to get the expected forecast ŷ.
The other most popular technique for this is using ARIMA. This stands for
autoregressive integrated moving average. As you can probably guess, it incorporates
the moving averages as well as autoregressive features (looking at correlations between

subsequent timesteps). The ARIMA model follows a specific methodology.
Essentially, we take the original data and do our decoupling of the time series to make
it into stationary and non-stationary components. We can then study charts known as
autocorrelation or partial autocorrelation plots, which look at how strongly correlated a
specific value is compared to its predecessors. From this, we can determine how to build
the ARIMA model to make the predictions.
All of these methods relied on a stationary time series that had some kind of
autocorrelation and/or periodicity. This is a feature that is inherently not present in the
stock market. There are indeed times where the stock market oscillates, these are
studied in great detail in any economics class at university. These are the Kitchin cycle
(3–5 years periodicity), Juglar cycle (7–11 years periodicity), Kuznets swing (15–25
years periodicity) and Kondratiev wave (45–60 years periodicity — although this one is
still debated by economists). However, stocks of individual companies generally do not
follow this trend, some win and some lose more than others. They are affected by
political, socioeconomic, and social factors that are essentially random and chaotic
when viewed from the view of a time series model. In addition, these waves are not
understood to a degree of accuracy that one can make useful predictions about the
future of economic markets based on their existence — which makes sense because
otherwise, everyone would do it.
How about neural networks?
Neural networks seem to work for just about anything that involves non-linear feature
spaces. In fact, recurrent neural networks can and have been used to predict the stock
market. However, there are several challenges facing recurrent neural networks (RNNs)
with regard to predicting stock prices, most noticeably the vanishing gradients problem
associated with RNNs, as well as very noisy predictions. a comprehensive walkthrough
showing how to implement a basic type of RNN called an LSTM to predict stock prices
can be found here.
By far the most important problem for RNNs is the vanishing gradient problem. This
issue comes from the fact that very deep neural networks that are optimized by a
procedure called backpropagation use derivatives between each layer in order to ‘learn’.
These derivatives can be relatively small or relatively large. If my network has 100
hidden layers, and I multiply a small number by itself 100 times, the value essentially
disappears. That is a problem, my network cannot learn anything if all my gradients are
zero, so what can I do?
There are 3 solutions to this:
1. Clipping gradients method
2. Special RNN with leaky units such as Long-Short-Term-Memory (LSTM) and Gated
Recurrent Units (GRU)
3. Echo states RNNs
Gradient clipping stops our gradients from getting too big, or from getting too small,
but we are still losing information by doing this so it is not an ideal approach. RNNs
with leaky units are fine and are the standard technique used by most individuals and
companies using RNNs for commercial purposes. These algorithms adapt all
connections (input, recurrent, output) by some version of gradient descent. This
renders these algorithms slow, and what is maybe even more cumbersome, makes the
learning process prone to become disrupted by bifurcations; convergence cannot be
guaranteed. As a consequence, RNNs were rarely fielded in practical engineering
applications.
This is where echo state networks come in. Echo state networks are a relatively new
invention, it is essentially a recurrent neural network with a loosely connected hidden
layer, called a ‘reservoir’ which works surprisingly well in the presence of chaotic time
series. In an echo state network, we only have to train the output weights of the
network, and it speeds up the training of the neural network, generally provides better
predictions, and solves all of the previous problems we have discussed with time series
analysis. ESN training, by contrast to other methods, is fast, does not suffer from
bifurcations, and is easy to implement. On a number of benchmark tasks, ESNs have
starkly outperformed all other methods of nonlinear dynamical modeling.
The echo state network is part of a category of computational science known as

reservoir computing, and we will delve into it in more detail in the next section.
. . .
Echo State Networks
So we have made the case that there is no method out there that can handle chaotic
time series, which, unfortunately, just so happens to be how we model the stock market.
An approach to avoid this difficulty is to fix the recurrent and input weights and learn
only the output weights: the Echo State Network (ESN). The hidden units form a
‘reservoir’ of temporal features that capture different aspects from the history inputs.
The mathematical justification behind the ESN is rather involved, so I will try to avoid it
for the sake of this article. Instead, we will discuss the concept behind the ESN and look
at how it can be implemented relatively simply using Python.
The description in the original paper outlining why it is called an ‘echo’ network is
“The unifying theme throughout all these variations is to use a fixed RNN as a random
nonlinear excitable medium, whose high-dimensional dynamical “echo” response to a
driving input (and/or output feedback) is used as a non-orthogonal signal basis to
reconstruct the desired output by a linear combination, minimizing some error criteria.”
An ESN takes an arbitrary length sequence input vector (u) and (1) maps it into a high-
dimensional feature space (i.e. the recurrent reservoir state h), and applies a linear
predictor (linear regression) to find ŷ.
Schematic diagram of an echo state network.
We essentially train only the output weights, which drastically speeds up the training.
This is the great advantage of reservoir computing. By setting and fixing the input and
recurrent weights to represent a rich history, we obtain:
Recurrent states as dynamical systems near to the stability — stability means

Jacobians are all close to one (no vanishing or exploding gradients)
Leaky hidden units that partially remember the previous state — They avoid
exploding/vanishing gradients, whilst at the same time have no need of training.
Echo state property
In order for the ESN principle to work, the reservoir must have the echo state property
(ESP), which relates asymptotic properties of the excited reservoir dynamics to the
driving signal. Intuitively, the ESP states that the reservoir will asymptotically wash out
any information from initial conditions. The ESP is guaranteed for additive-sigmoid
neuron reservoirs, if the reservoir weight matrix (and the leaking rates) satisfy certain
algebraic conditions in terms of singular values.
Training
So you might be wondering, how do we pick the values for the hidden state in the first
place? The input and recurrent weights are initialized randomly and then are fixed. So,
we are not training them. How should we fix them to optimize the prediction?
The training is very easy and fast but there are hyperparameters such as
hyperparameters that govern the random generation of the weights, the degree of
reservoir nodes, the sparsity of the reservoir nodes, the spectral radius. Unfortunately,
no systematic method exists to optimize the hyperparameters, and so this is typically
done using a validation set. Cross-validation is not feasible with time series data due to
the inherent autocorrelation present in the feature space.
To recap:
The network nodes each have distinct dynamical behavior
Time delays of signal may occur along the network links
The network hidden part has recurrent connections
The input and internal weights are fixed and randomly chosen
Only the output weights are adjusted during the training.
Coding an Echo State Network

Now for the moment you’ve all been waiting for, how do you actually code these
mysterious networks? We use a Python library which is available from this GitHub
repository. The library is called PyESN. In order to install this library, you must clone
the repository and put the pyESN.py file in your current Jupyter Notebook folder. Then
when you are in the Python 3 notebook you can simply called import pyESN .
1 import numpy as np
2 import pandas as pd
3 import seaborn as sns
4 from matplotlib import pyplot as plt
5 import warnings
6 warnings.filterwarnings('ignore')
7
8 # This is the library for the Reservoir Computing got it by: https://github.com/cknd/py
9 from pyESN import ESN
10 %matplotlib inline
11
12 # Read dataset amazon.txt (this was scraped from the internet)
13 amazon = open("amazon.txt").read().split()
14 amazon = np.array(amazon).astype('float64')
Overview for the pyESN library for the RC implementation
1 esn=ESN(n_inputs =1,
2 n_outputs = #,
3 n_reservoir = #,
4 sparsity= #,
5 random_state= #,
6 spectral_radius = #,
7 noise= #)
8
9 ## where # denotes the value that you choose.
You call the RC as:
For a brief explanation of the parameters:
n_inputs: number of input dimensions
n_outputs: number of output dimensions
n_reservoir: number of reservoir neurons
ranodom_state: seed for the random generator
sparsity: the proportion of recurrent weights set to zero
spectral_radius: spectral radius of the recurrent weight matrix
noise: noise added to each neuron (regularization)
Predicting Amazon Stock Prices
I will now go over an example of using echo state networks to predict future Amazon
stock prices. First, we import all of the necessary libraries and also import out data
(which in this case was scraped from the internet).
We then use the ESN from the pyESN library to employ an RC network. The task here is
to predict two days ahead by using the previous 1500 points and do that for 100 future
points (check the figure below). So, in the end you will have a 100 time step prediction
with prediction-window = 2. We will use this as the validation set.
First, we create our echo state network implementation using some reasonable values
and specify our training and validation length. We then create functions to calculate the
mean squared error as well as the run an echo state network for specific input
arguments of the spectral radius, noise, and the window length.
1 n reservoir= 500
1 n_reservoir= 500
2 sparsity=0.2
3 rand_seed=23
4 spectral_radius = 1.2
5 noise = .0005
6
7
8 esn = ESN(n_inputs = 1,
9 n_outputs = 1,
10 n_reservoir = n_reservoir,
11 sparsity=sparsity,
12 random_state=rand_seed,
13 spectral_radius = spectral_radius,
14 noise=noise)
15
16 trainlen = 1500
17 future = 2
18 futureTotal=100
19 pred_tot=np.zeros(futureTotal)
20
21 for i in range(0,futureTotal,future):
22 pred_training = esn.fit(np.ones(trainlen),data[i:trainlen+i])
23 prediction = esn.predict(np.ones(future))
24 pred_tot[i:i+future] = prediction[:,0]
Now we can simply run one function and obtain our prediction, and then we can plot
this to see how well we did.
1 error, validation_set = run_echo(1.2, .005,2)

2
3 future = 2
4 plt.figure(figsize=(18,8))
5 plt.plot(range(0,trainlen+future),amazon[0:trainlen+future],'k',label="target system")
6 plt.plot(range(trainlen,trainlen+100),validation_set.reshape(-1,1),'r', label="free run
7 lo,hi = plt.ylim()
8 plt.plot([trainlen,trainlen],[lo+np.spacing(1),hi-np.spacing(1)],'k:')
9 plt.legend(loc=(0.61,0.12),fontsize='x-large')
10 sns.despine();
The above code produces the following plot.
And if we zoom in on this plot, we can see just how impressive this prediction actually
is.
Not bad right? The only caveat is that it seems to work well for short time periods (on
the order of 1 or two days) with reasonable accuracy, but the errors become
increasingly large as that estimate is extrapolated further. The above model was made
with a prediction window of two days, meaning that we are only ever predicting 2 days
into the future at any given time.
We can illustrate this by increasing the window length. For a window length of 10 days,
the prediction is still surprisingly accurate, although it is obviously worse than the two-
day prediction window.
In order to obtain a result this good, we had to do a significant amount of

hyperparameter optimization, here is the procedure that was done to obtain the
hyperparameters used in the above results.
1 n_reservoir= 500
2 sparsity = 0.2
3 rand_seed = 23
4 radius_set = [0.9, 1, 1.1]
5 noise_set = [ 0.001, 0.004, 0.006]
6
7 radius_set = [0.5, 0.7, 0.9, 1, 1.1,1.3,1.5]
8 noise_set = [ 0.0001, 0.0003,0.0007, 0.001, 0.003, 0.005, 0.007,0.01]
9
10
11
12 radius_set_size = len(radius_set)
13 noise_set_size = len(noise_set)
14
15 trainlen = 1500
16 future = 2
17 futureTotal= 100
18
19 loss = np.zeros([radius_set_size, noise_set_size])
20
20
21 for l in range(radius_set_size):
22 rho = radius_set[l]
23 for j in range(noise_set_size):
24 noise = noise_set[j]
25
26 pred_tot=np.zeros(futureTotal)
27
28 esn = ESN(n_inputs = 1,
29 n_outputs = 1,
30 n_reservoir = n_reservoir,
31 sparsity=sparsity,
32 random_state=rand_seed,
33 spectral_radius = rho,
34 noise=noise)
35
36 for i in range(0,futureTotal,future):
37 pred_training = esn.fit(np.ones(trainlen),data[i:trainlen+i])
38 prediction = esn.predict(np.ones(future))
39 pred_tot[i:i+future] = prediction[:,0]
40
41 loss[l, j] = MSE(pred_tot, data[trainlen:trainlen+futureTotal])
42 print('rho = ', radius_set[l], ', noise = ', noise_set[j], ', MSE = ', loss[l][
The above code takes a while to optimize, and this is only for two parameters. When
plotting the MSE on a heatmap we get the below result.
However, it gives us a pretty good result on our validation set when the window
function is small. This makes ESN’s good for short term predictions of stock prices, but
things get a bit more uncertain and risky if you try to extrapolate the results.
In the future predictions, the error propagates in time and thus it increases in time. This
is the reason that as the window length is increased the accuracy decreases. We can see
this behavior in the plot below, where the MSE is an increasing monotonically function
of the prediction-window, hence longer predictions mean larger MSE.
Final Word
The ability of the echo state network to analyze chaotic time series makes it an
interesting tool for financial forecasting where the data is highly nonlinear and chaotic.
But, we can do more with these networks than predict the stock market. We can also:
Forecast the weather
Control complex dynamical systems
Perform pattern recognition
Expect to hear a lot more about these kinds of networks in the future, especially as
people move into the development of DeepESN models which are able to work in a
much higher dimensional latent space with temporal features that are able to tackle
some of the most difficult time series problems.
If you are interested in these networks, there are more articles and research papers
discussing these this that are freely accessible.
Echo state network — Scholarpedia

Echo state networks (ESN) provide an architecture
and supervised learning principle for recurrent…
www.scholarpedia.org
Deep Echo State Network (DeepESN): A Brief Survey

The study of deep recurrent neural networks (RNNs) and, in particular, of
deep Reservoir Computing (RC) is gaining an…
arxiv.org
All of the above work is outlined in my GitHub repository that can be found here.
Happy machine learning!
Machine Learning Data Science Stock Market Towards Data Science In Depth Analysis
About Help Legal

Predicting Stock Prices With Echo State Networks - Towards Data Science

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Predicting Stock Prices With Echo State Networks - Towards Data Science

Загружено:

Авторское право:

Доступные форматы

05/09/2019 Predicting Stock Prices with Echo State Networks - Towards Data Science

Predicting Stock Prices with Echo

Matthew Stewart, PhD Researcher

The Motivation for Time Series Prediction

Time Series Properties

2. Constant variance. This property is known as homoscedasticity. The following figure

with time for the non-stationary case.

Why do I care about the ‘stationarity’ of a time series?

How do I test for stationarity?

Basic Methods for Time Series Prediction

the moving averages as well as autoregressive features (looking at correlations between

How about neural networks?

There are 3 solutions to this:

1. Clipping gradients method

3. Echo states RNNs

The echo state network is part of a category of computational science known as

Echo State Networks

Schematic diagram of an echo state network.

Recurrent states as dynamical systems near to the stability — stability means

Echo state property

The network nodes each have distinct dynamical behavior

Time delays of signal may occur along the network links

The network hidden part has recurrent connections

Only the output weights are adjusted during the training.

Coding an Echo State Network

Overview for the pyESN library for the RC implementation

You call the RC as:

For a brief explanation of the parameters:

n_inputs: number of input dimensions

n_outputs: number of output dimensions

n_reservoir: number of reservoir neurons

ranodom_state: seed for the random generator

sparsity: the proportion of recurrent weights set to zero

spectral_radius: spectral radius of the recurrent weight matrix

noise: noise added to each neuron (regularization)

Predicting Amazon Stock Prices

1 error, validation_set = run_echo(1.2, .005,2)

The above code produces the following plot.

In order to obtain a result this good, we had to do a significant amount of

Forecast the weather

Control complex dynamical systems

Perform pattern recognition

Echo state network — Scholarpedia

Deep Echo State Network (DeepESN): A Brief Survey

Happy machine learning!

About Help Legal

Вам также может понравиться