Вы находитесь на странице: 1из 24

Understanding LSTM Networks

Recurrent Neural Networks

Humans don’t start their thinking from scratch every second. As you read this essay, you
understand each word based on your understanding of previous words. You don’t throw
everything away and start thinking from scratch again. Your thoughts have persistence.
Traditional neural networks can’t do this, and it seems like a major shortcoming. For example,
imagine you want to classify what kind of event is happening at every point in a movie. It’s unclear
how a traditional neural network could use its reasoning about previous events in the film to inform
later ones.
Recurrent neural networks address this issue. They are networks with loops in them, allowing
information to persist.

Recurrent Neural Networks have loops.


In the above diagram, a chunk of neural network, AA, looks at some input xtxt and outputs a
value htht. A loop allows information to be passed from one step of the network to the next.
These loops make recurrent neural networks seem kind of mysterious. However, if you think a
bit more, it turns out that they aren’t all that different than a normal neural network. A recurrent
neural network can be thought of as multiple copies of the same network, each passing a
message to a successor. Consider what happens if we unroll the loop:

An unrolled recurrent neural network.


This chain-like nature reveals that recurrent neural networks are intimately related to sequences
and lists. They’re the natural architecture of neural network to use for such data.
And they certainly are used! In the last few years, there have been incredible success applying
RNNs to a variety of problems: speech recognition, language modeling, translation, image
captioning… The list goes on. I’ll leave discussion of the amazing feats one can achieve with
RNNs to Andrej Karpathy’s excellent blog post, The Unreasonable Effectiveness of Recurrent
Neural Networks. But they really are pretty amazing.
Essential to these successes is the use of “LSTMs,” a very special kind of recurrent neural
network which works, for many tasks, much much better than the standard version. Almost all
exciting results based on recurrent neural networks are achieved with them. It’s these LSTMs
that this essay will explore.

The Problem of Long-Term Dependencies


One of the appeals of RNNs is the idea that they might be able to connect previous information
to the present task, such as using previous video frames might inform the understanding of the
present frame. If RNNs could do this, they’d be extremely useful. But can they? It depends.
Sometimes, we only need to look at recent information to perform the present task. For example,
consider a language model trying to predict the next word based on the previous ones. If we are
trying to predict the last word in “the clouds are in the sky,” we don’t need any further context –
it’s pretty obvious the next word is going to be sky. In such cases, where the gap between the
relevant information and the place that it’s needed is small, RNNs can learn to use the past
information.

But there are also cases where we need more context. Consider trying to predict the last word in
the text “I grew up in France… I speak fluent French.” Recent information suggests that the next
word is probably the name of a language, but if we want to narrow down which language, we
need the context of France, from further back. It’s entirely possible for the gap between the
relevant information and the point where it is needed to become very large.
Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.
In theory, RNNs are absolutely capable of handling such “long-term dependencies.” A human
could carefully pick parameters for them to solve toy problems of this form. Sadly, in practice,
RNNs don’t seem to be able to learn them. The problem was explored in depth by Hochreiter
(1991) [German] and Bengio, et al. (1994), who found some pretty fundamental reasons why it
might be difficult.
Thankfully, LSTMs don’t have this problem!

LSTM Networks
Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN,
capable of learning long-term dependencies. They were introduced by Hochreiter &
Schmidhuber (1997), and were refined and popularized by many people in following work.1 They
work tremendously well on a large variety of problems, and are now widely used.
LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering
information for long periods of time is practically their default behavior, not something they
struggle to learn!
All recurrent neural networks have the form of a chain of repeating modules of neural network. In
standard RNNs, this repeating module will have a very simple structure, such as a single tanh
layer.

The repeating module in a standard RNN contains a single layer.


LSTMs also have this chain like structure, but the repeating module has a different structure.
Instead of having a single neural network layer, there are four, interacting in a very special way.

The repeating module in an LSTM contains four interacting layers.


Don’t worry about the details of what’s going on. We’ll walk through the LSTM diagram step by
step later. For now, let’s just try to get comfortable with the notation we’ll be using.
In the above diagram, each line carries an entire vector, from the output of one node to the inputs
of others. The pink circles represent pointwise operations, like vector addition, while the yellow
boxes are learned neural network layers. Lines merging denote concatenation, while a line
forking denote its content being copied and the copies going to different locations.

The Core Idea Behind LSTMs


The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.
The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some
minor linear interactions. It’s very easy for information to just flow along it unchanged.

The LSTM does have the ability to remove or add information to the cell state, carefully regulated
by structures called gates.
Gates are a way to optionally let information through. They are composed out of a sigmoid neural
net layer and a pointwise multiplication operation.

The sigmoid layer outputs numbers between zero and one, describing how much of each
component should be let through. A value of zero means “let nothing through,” while a value of
one means “let everything through!”
An LSTM has three of these gates, to protect and control the cell state.

Step-by-Step LSTM Walk Through


The first step in our LSTM is to decide what information we’re going to throw away from the cell
state. This decision is made by a sigmoid layer called the “forget gate layer.” It looks
at ht−1ht−1 and xtxt, and outputs a number between 00 and 11 for each number in the cell
state Ct−1Ct−1. A 11 represents “completely keep this” while a 00 represents “completely get rid
of this.”
Let’s go back to our example of a language model trying to predict the next word based on all the
previous ones. In such a problem, the cell state might include the gender of the present subject,
so that the correct pronouns can be used. When we see a new subject, we want to forget the
gender of the old subject.

The next step is to decide what new information we’re going to store in the cell state. This has
two parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update.
Next, a tanh layer creates a vector of new candidate values, C~tC~t, that could be added to the
state. In the next step, we’ll combine these two to create an update to the state.
In the example of our language model, we’d want to add the gender of the new subject to the cell
state, to replace the old one we’re forgetting.

It’s now time to update the old cell state, Ct−1Ct−1, into the new cell state CtCt. The previous steps
already decided what to do, we just need to actually do it.
We multiply the old state by ftft, forgetting the things we decided to forget earlier. Then we
add it∗C~tit∗C~t. This is the new candidate values, scaled by how much we decided to update
each state value.
In the case of the language model, this is where we’d actually drop the information about the old
subject’s gender and add the new information, as we decided in the previous steps.
Finally, we need to decide what we’re going to output. This output will be based on our cell state,
but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell
state we’re going to output. Then, we put the cell state through tanhtanh (to push the values to
be between −1−1 and 11) and multiply it by the output of the sigmoid gate, so that we only output
the parts we decided to.
For the language model example, since it just saw a subject, it might want to output information
relevant to a verb, in case that’s what is coming next. For example, it might output whether the
subject is singular or plural, so that we know what form a verb should be conjugated into if that’s
what follows next.

Variants on Long Short Term Memory


What I’ve described so far is a pretty normal LSTM. But not all LSTMs are the same as the above.
In fact, it seems like almost every paper involving LSTMs uses a slightly different version. The
differences are minor, but it’s worth mentioning some of them.
One popular LSTM variant, introduced by Gers & Schmidhuber (2000), is adding “peephole
connections.” This means that we let the gate layers look at the cell state.

The above diagram adds peepholes to all the gates, but many papers will give some peepholes
and not others.
Another variation is to use coupled forget and input gates. Instead of separately deciding what to
forget and what we should add new information to, we make those decisions together. We only
forget when we’re going to input something in its place. We only input new values to the state
when we forget something older.
A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU, introduced
by Cho, et al. (2014). It combines the forget and input gates into a single “update gate.” It also
merges the cell state and hidden state, and makes some other changes. The resulting model is
simpler than standard LSTM models, and has been growing increasingly popular.

These are only a few of the most notable LSTM variants. There are lots of others, like Depth
Gated RNNs by Yao, et al. (2015). There’s also some completely different approach to tackling
long-term dependencies, like Clockwork RNNs by Koutnik, et al. (2014).
Which of these variants is best? Do the differences matter? Greff, et al. (2015) do a nice
comparison of popular variants, finding that they’re all about the same. Jozefowicz, et al.
(2015) tested more than ten thousand RNN architectures, finding some that worked better than
LSTMs on certain tasks.

Conclusion
Earlier, I mentioned the remarkable results people are achieving with RNNs. Essentially all of
these are achieved using LSTMs. They really work a lot better for most tasks!
Written down as a set of equations, LSTMs look pretty intimidating. Hopefully, walking through
them step by step in this essay has made them a bit more approachable.
LSTMs were a big step in what we can accomplish with RNNs. It’s natural to wonder: is there
another big step? A common opinion among researchers is: “Yes! There is a next step and it’s
attention!” The idea is to let every step of an RNN pick information to look at from some larger
collection of information. For example, if you are using an RNN to create a caption describing an
image, it might pick a part of the image to look at for every word it outputs.
Long short-term memory

The Long Short-Term Memory (LSTM) cell can process data sequentially and keep its hidden state through
time.

Long short-term memory (LSTM) is an artificial recurrent neural network (RNN)


[1]

architecture[1] used in the field of deep learning. Unlike standard feedforward neural networks,
LSTM has feedback connections. It can not only process single data points (such as images), but
also entire sequences of data (such as speech or video). For example, LSTM is applicable to
tasks such as unsegmented, connected handwriting recognition[2] or speech
recognition.[3][4] Bloomberg Business Week wrote: "These powers make LSTM arguably the most
commercial AI achievement, used for everything from predicting diseases to composing music."[5]
A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate.
The cell remembers values over arbitrary time intervals and the three gates regulate the flow of
information into and out of the cell.
LSTM networks are well-suited to classifying, processing and making predictions based on time
series data, since there can be lags of unknown duration between important events in a time
series. LSTMs were developed to deal with the exploding and vanishing gradient problems that
can be encountered when training traditional RNNs. Relative insensitivity to gap length is an
advantage of LSTM over RNNs, hidden Markov models and other sequence learning methods in
numerous applications.[citation needed]

Idea[edit]
In theory, classic (or "vanilla") RNNs can keep track of arbitrary long-term dependencies in the
input sequences. The problem of vanilla RNNs is computational (or practical) in nature: when
training a vanilla RNN using back-propagation, the gradients which are back-propagated
can "vanish" (that is, they can tend to zero) or "explode" (that is, they can tend to infinity),
because of the computations involved in the process, which use finite-precision numbers. RNNs
using LSTM units partially solve the vanishing gradient problem, because LSTM units allow
gradients to also flow unchanged. However, LSTM networks can still suffer from the exploding
gradient problem.[28]

Architecture[edit]
There are several architectures of LSTM units. A common architecture is composed of a cell (the
memory part of the LSTM unit) and three "regulators", usually called gates, of the flow of
information inside the LSTM unit: an input gate, an output gate and a forget gate. Some
variations of the LSTM unit do not have one or more of these gates or maybe have other gates.
For example, gated recurrent units (GRUs) do not have an output gate.
Intuitively, the cell is responsible for keeping track of the dependencies between the elements in
the input sequence. The input gate controls the extent to which a new value flows into the cell,
the forget gate controls the extent to which a value remains in the cell and the output
gate controls the extent to which the value in the cell is used to compute the output activation of
the LSTM unit. The activation function of the LSTM gates is often the logistic sigmoid function.
There are connections into and out of the LSTM gates, a few of which are recurrent. The weights
of these connections, which need to be learned during training, determine how the gates operate.
Variants[edit]
In the equations below, the lowercase variables represent vectors.Matrices and contain,

respectively, the weights of the input and recurrent connections, where the subscript can either be the

input gate , output gate , the forget gate or the memory cell , depending on the activation

being calculated. In this section, we are thus using a "vector notation". So, for example, is not just one cell

of one LSTM unit, but contains LSTM unit's cells.

LSTM with a forget gate[edit]


The compact forms of the equations for the forward pass of an LSTM unit with a forget gate are:[1][8] where the

initial values are and and the operator denotes the Hadamard product (element-wise

product). The subscript indexes the time step.


Variables[edit]

 : input vector to the LSTM unit

 : forget gate's activation vector

 : input/update gate's activation vector

 : output gate's activation vector

 : hidden state vector also known as output vector of the LSTM unit

 : cell state vector

 , and : weight matrices and bias vector parameters which need to be learned during training

where the superscripts and refer to the number of input features and number of hidden units,
respectively.
Activation functions[edit]

 : sigmoid function.

 : hyperbolic tangent function.

 : hyperbolic tangent function or, as the peephole LSTM paper [29][30] suggests, .

Training[edit]
An RNN using LSTM units can be trained in a supervised fashion, on a set of training sequences,
using an optimization algorithm, like gradient descent, combined with backpropagation through
time to compute the gradients needed during the optimization process, in order to change each
weight of the LSTM network in proportion to the derivative of the error (at the output layer of the
LSTM network) with respect to corresponding weight.
A problem with using gradient descent for standard RNNs is that error
gradients vanish exponentially quickly with the size of the time lag between important events.

This is due to if the spectral radius of is smaller than 1.[33][34]


However, with LSTM units, when error values are back-propagated from the output layer, the
error remains in the LSTM unit's cell. This "error carousel" continuously feeds error back to each
of the LSTM unit's gates, until they learn to cut off the value.

CTC score function[edit]


Many applications use stacks of LSTM RNNs[35] and train them by connectionist temporal
classification (CTC)[36] to find an RNN weight matrix that maximizes the probability of the label
sequences in a training set, given the corresponding input sequences. CTC achieves both
alignment and recognition.

Alternatives[edit]
Sometimes, it can be advantageous to train (parts of) an LSTM by neuroevolution[37] or by policy
gradient methods, especially when there is no "teacher" (that is, training labels).
Success[edit]
There have been several successful stories of training, in a non-supervised fashion, RNNs with
LSTM units.
In 2018, Bill Gates called it a “huge milestone in advancing artificial intelligence” when bots
developed by OpenAI were able to beat humans in the game of Dota 2.[38] OpenAI Five consists
of five independent but coordinated neural networks. Each network is trained by a policy gradient
method without supervising teacher and contains a single-layer, 1024-unit Long-Short-Term-
Memory that sees the current game state and emits actions through several possible action
heads.[38]
In 2018, OpenAI also trained a similar LSTM by policy gradients to control a human-like robot
hand that manipulates physical objects with unprecedented dexterity.[39]
In 2019, DeepMind's program AlphaStar used a deep LSTM core to excel at the complex video
game Starcraft.[40] This was viewed as significant progress towards Artificial General
Intelligence.[40]

Understanding LSTM and its diagrams


Although we don’t know how brain functions yet, we have the feeling that it must have a logic unit and a memory
unit. We make decisions by reasoning and by experience. So do computers, we have the logic units, CPUs and
GPUs and we also have memories.
But when you look at a neural network, it functions like a black box. You feed in some inputs from one side, you
receive some outputs from the other side. The decision it makes is mostly based on the current inputs.
I think it’s unfair to say that neural network has no memory at all. After all, those learnt weights are some kind of
memory of the training data. But this memory is more static. Sometimes we want to remember an input for later
use. There are many examples of such a situation, such as the stock market. To make a good investment
judgement, we have to at least look at the stock data from a time window.
The naive way to let neural network accept a time series data is connecting several neural networks together. Each
of the neural networks handles one time step. Instead of feeding the data at each individual time step, you provide
data at all time steps within a window, or a context, to the neural network.
A lot of times, you need to process data that has periodic patterns. As a silly example, suppose you want to predict
christmas tree sales. This is a very seasonal thing and likely to peak only once a year. So a good strategy to predict
christmas tree sale is looking at the data from exactly a year back. For this kind of problems, you either need to
have a big context to include ancient data points, or you have a good memory. You know what data is valuable to
remember for later use and what needs to be forgotten when it is useless.
Theoretically the naively connected neural network, so called recurrent neural network, can work. But in practice,
it suffers from two problems: vanishing gradient and exploding gradient, which make it unusable.
Then later, LSTM (long short term memory) was invented to solve this issue by explicitly introducing a memory
unit, called the cell into the network. This is the diagram of a LSTM building block.

At a first sight, this looks intimidating. Let’s ignore the internals, but only look at the inputs and outputs of the
unit. The network takes three inputs. X_t is the input of the current time step. h_t-1 is the output from the
previous LSTM unit and C_t-1 is the “memory” of the previous unit, which I think is the most important input. As
for outputs, h_t is the output of the current network. C_t is the memory of the current unit.
Therefore, this single unit makes decision by considering the current input, previous output and previous memory.
And it generates a new output and alters its memory.

The way its internal memory C_t changes is pretty similar to piping water through a pipe. Assuming the memory is
water, it flows into a pipe. You want to change this memory flow along the way and this change is controlled by two
valves. The first valve is called the forget valve. If you shut it, no old memory will be kept. If you fully open this
valve, all old memory will pass through. The second valve is the new memory valve. New memory will come in
through a T shaped joint like above and merge with the old memory. Exactly how much new memory should come
in is controlled by the second valve.
On the LSTM diagram, the top “pipe” is the memory pipe. The input is the old memory (a vector). The first cross
✖ it passes through is the forget valve. It is actually an element-wise multiplication operation. So if you multiply
the old memory C_t-1 with a vector that is close to 0, that means you want to forget most of the old memory. You
let the old memory goes through, if your forget valve equals 1.

Then the second operation the memory flow will go through is this + operator. This operator means piece-wise
summation. It resembles the T shape joint pipe. New memory and the old memory will merge by this operation.
How much new memory should be added to the old memory is controlled by another valve, the ✖ below the +
sign. After these two operations, you have the old memory C_t-1 changed to the new memory C_t.

Now lets look at the valves. The first one is called the forget valve. It is controlled by a simple one layer neural
network. The inputs of the neural network is h_t-1, the output of the previous LSTM block, X_t, the input for the
current LSTM block, C_t-1, the memory of the previous block and finally a bias vector b_0. This neural network
has a sigmoid function as activation, and it’s output vector is the forget valve, which will applied to the old memory
C_t-1 by element-wise multiplication.

Now the second valve is called the new memory valve. Again, it is a one layer simple neural network that takes the
same inputs as the forget valve. This valve controls how much the new memory should influence the old memory.

The new memory itself, however is generated by another neural network. It is also a one layer network, but uses
tanh as the activation function. The output of this network will element-wise multiple the new memory valve, and
add to the old memory to form the new memory.
These two ✖ signs are the forget valve and the new memory valve.

And finally, we need to generate the output for this LSTM unit. This step has an output valve that is controlled by
the new memory, the previous output h_t-1, the input X_t and a bias vector. This valve controls how much new
memory should output to the next LSTM unit.
The above diagram is inspired by Christopher’s blog post. But most of the time, you will see a diagram like below.
The major difference between the two variations is that the following diagram doesn’t treat the memory unit C as
an input to the unit. Instead, it treats it as an internal thing “Cell”.
I like the Christopher’s diagram, in that it explicitly shows how this memory C gets passed from the previous unit
to the next. But in the following image, you can’t easily see that C_t-1 is actually from the previous unit. and C_t is
part of the output.
The second reason I don’t like the following diagram is that the computation you perform within the unit should be
ordered, but you can’t see it clearly from the following diagram. For example to calculate the output of this unit,
you need to have C_t, the new memory ready. Therefore, the first step should be evaluating C_t.
The following diagram tries to represent this “delay” or “order” with dash lines and solid lines (there are errors in
this picture). Dash lines means the old memory, which is available at the beginning. Some solid lines means the
new memory. Operations require the new memory have to wait until C_t is available.
But these two diagrams are essentially the same. Here, I want to use the same symbols and colors of the first
diagram to redraw the above diagram:

This is the forget gate (valve) that shuts the old This is the new memory valve and the new memory:
memory:

These are the two valves and the element-wise summation to merge the old memory and the new memory to form
C_t (in green, flows back to the big “Cell”):
This is the output valve and output of the LSTM unit:

1. Flashback: A look into Recurrent Neural Networks (RNN)


Take an example of sequential data, which can be the stock market’s data for a particular stock. A
simple machine learning model or an Artificial Neural Network may learn to predict the stock prices
based on a number of features: the volume of the stock, the opening value etc. While the price of the
stock depends on these features, it is also largely dependent on the stock values in the previous
days. In fact for a trader, these values in the previous days (or the trend) is one major deciding factor
for predictions.

In the conventional feed-forward neural networks, all test cases are considered to be independent.
That is when fitting the model for a particular day, there is no consideration for the stock prices on the
previous days.

This dependency on time is achieved via Recurrent Neural Networks. A typical RNN looks like:
This may be intimidating at first sight, but once unfolded, it looks a lot simpler:

Now it is easier for us to visualize how these networks are considering the trend of stock prices,
before predicting the stock prices for today. Here every prediction at time t (h_t) is dependent on all
previous predictions and the information learned from them.

RNNs can solve our purpose of sequence handling to a great extent but not entirely. We want our
computers to be good enough to write Shakespearean sonnets. Now RNNs are great when it comes
to short contexts, but in order to be able to build a story and remember it, we need our models to be
able to understand and remember the context behind the sequences, just like a human brain. This is
not possible with a simple RNN.

Why? Let’s have a look.

2. Limitations of RNNs
Recurrent Neural Networks work just fine when we are dealing with short-term dependencies. That is
when applied to problems like:

RNNs turn out to be quite effective. This is because this problem has nothing to do with the context of
the statement. The RNN need not remember what was said before this, or what was its meaning, all
they need to know is that in most cases the sky is blue. Thus the prediction would be:
However, vanilla RNNs fail to understand the context behind an input. Something that was said long
before, cannot be recalled when making predictions in the present. Let’s understand this as an
example:

Here, we can understand that since the author has worked in Spain for 20 years, it is very likely that
he may possess a good command over Spanish. But, to make a proper prediction, the RNN needs to
remember this context. The relevant information may be separated from the point where it is needed,
by a huge load of irrelevant data. This is where a Recurrent Neural Network fails!

The reason behind this is the problem of Vanishing Gradient. In order to understand this, you’ll need
to have some knowledge about how a feed-forward neural network learns. We know that for a
conventional feed-forward neural network, the weight updating that is applied on a particular layer is a
multiple of the learning rate, the error term from the previous layer and the input to that layer. Thus,
the error term for a particular layer is somewhere a product of all previous layers’ errors. When
dealing with activation functions like the sigmoid function, the small values of its derivatives (occurring
in the error function) gets multiplied multiple times as we move towards the starting layers. As a result
of this, the gradient almost vanishes as we move towards the starting layers, and it becomes difficult
to train these layers.

A similar case is observed in Recurrent Neural Networks. RNN remembers things for just small
durations of time, i.e. if we need the information after a small time it may be reproducible, but once a
lot of words are fed in, this information gets lost somewhere. This issue can be resolved by applying a
slightly tweaked version of RNNs – the Long Short-Term Memory Networks.

3. Improvement over RNN: LSTM (Long Short-Term Memory)


Networks
When we arrange our calendar for the day, we prioritize our appointments right? If in case we need to
make some space for anything important we know which meeting could be canceled to accommodate
a possible meeting.

Turns out that an RNN doesn’t do so. In order to add a new information, it transforms the existing
information completely by applying a function. Because of this, the entire information is modified, on
the whole, i. e. there is no consideration for ‘important’ information and ‘not so important’ information.

LSTMs on the other hand, make small modifications to the information by multiplications and
additions. With LSTMs, the information flows through a mechanism known as cell states. This way,
LSTMs can selectively remember or forget things. The information at a particular cell state has three
different dependencies.

We’ll visualize this with an example. Let’s take the example of predicting stock prices for a particular
stock. The stock price of today will depend upon:

1. The trend that the stock has been following in the previous days, maybe a downtrend or an
uptrend.
2. The price of the stock on the previous day, because many traders compare the stock’s
previous day price before buying it.
3. The factors that can affect the price of the stock for today. This can be a new company policy
that is being criticized widely, or a drop in the company’s profit, or maybe an unexpected
change in the senior leadership of the company.

These dependencies can be generalized to any problem as:

1. The previous cell state (i.e. the information that was present in the memory after the previous
time step)
2. The previous hidden state (i.e. this is the same as the output of the previous cell)
3. The input at the current time step (i.e. the new information that is being fed in at that moment)

Another important feature of LSTM is its analogy with conveyor belts!

That’s right!

Industries use them to move products around for different processes. LSTMs use this mechanism to
move information around.

We may have some addition, modification or removal of information as it flows through the different
layers, just like a product may be molded, painted or packed while it is on a conveyor belt.

The following diagram explains the close relationship of LSTMs and conveyor belts.

Although this diagram is not even close to the actual architecture of an LSTM, it solves our purpose
for now.

Just because of this property of LSTMs, where they do not manipulate the entire information but
rather modify them slightly, they are able to forget and remember things selectively. How do they do
so, is what we are going to learn in the next section?

4. Architecture of LSTMs
The functioning of LSTM can be visualized by understanding the functioning of a news channel’s
team covering a murder story. Now, a news story is built around facts, evidence and statements of
many people. Whenever a new event occurs you take either of the three steps.

Let’s say, we were assuming that the murder was done by ‘poisoning’ the victim, but the autopsy
report that just came in said that the cause of death was ‘an impact on the head’. Being a part of this
news team what do you do? You immediately forget the previous cause of death and all stories that
were woven around this fact.

What, if an entirely new suspect is introduced into the picture. A person who had grudges with the
victim and could be the murderer? You input this information into your news feed, right?

Now all these broken pieces of information cannot be served on mainstream media. So, after a
certain time interval, you need to summarize this information and output the relevant things to your
audience. Maybe in the form of “XYZ turns out to be the prime suspect.”.
Now let’s get into the details of the architecture of LSTM network:

Now, this is nowhere close to the simplified version which we saw before, but let me walk you through
it. A typical LSTM network is comprised of different memory blocks called cells
(the rectangles that we see in the image). There are two states that are being transferred to the next
cell; the cell state and the hidden state. The memory blocks are responsible for remembering things
and manipulations to this memory is done through three major mechanisms, called gates. Each of
them is being discussed below.

4.1 Forget Gate

Taking the example of a text prediction problem. Let’s assume an LSTM is fed in, the following
sentence:

As soon as the first full stop after “person” is encountered, the forget gate realizes that there may be a
change of context in the next sentence. As a result of this, the subject of the sentence is forgotten and
the place for the subject is vacated. And when we start speaking about “Dan” this position of the
subject is allocated to “Dan”. This process of forgetting the subject is brought about by the forget gate.

A forget gate is responsible for removing information from the cell state. The information that is no
longer required for the LSTM to understand things or the information that is of less importance is
removed via multiplication of a filter. This is required for optimizing the performance of the LSTM
network.

This gate takes in two inputs; h_t-1 and x_t.

h_t-1 is the hidden state from the previous cell or the output of the previous cell and x_t is the input at
that particular time step. The given inputs are multiplied by the weight matrices and a bias is added.
Following this, the sigmoid function is applied to this value. The sigmoid function outputs a vector,
with values ranging from 0 to 1, corresponding to each number in the cell state. Basically, the sigmoid
function is responsible for deciding which values to keep and which to discard. If a ‘0’ is output for a
particular value in the cell state, it means that the forget gate wants the cell state to forget that piece
of information completely. Similarly, a ‘1’ means that the forget gate wants to remember that entire
piece of information. This vector output from the sigmoid function is multiplied to the cell state.

4.2 Input Gate

Okay, let’s take another example where the LSTM is analyzing a sentence:

Now the important information here is that “Bob” knows swimming and that he has served the Navy
for four years. This can be added to the cell state, however, the fact that he told all this over the
phone is a less important fact and can be ignored. This process of adding some new information can
be done via the input gate.

Here is its structure:

The input gate is responsible for the addition of information to the cell state. This addition of
information is basically three-step process as seen from the diagram above.

1. Regulating what values need to be added to the cell state by involving a sigmoid function.
This is basically very similar to the forget gate and acts as a filter for all the information from
h_t-1 and x_t.
2. Creating a vector containing all possible values that can be added (as perceived from h_t-1
and x_t) to the cell state. This is done using the tanh function, which outputs values from -1 to
+1.
3. Multiplying the value of the regulatory filter (the sigmoid gate) to the created vector (the tanh
function) and then adding this useful information to the cell state via addition operation.

Once this three-step process is done with, we ensure that only that information is added to the cell
state that is important and is not redundant.

4.3 Output Gate

Not all information that runs along the cell state, is fit for being output at a certain time. We’ll visualize
this with an example:
In this phrase, there could be a number of options for the empty space. But we know that the current
input of ‘brave’, is an adjective that is used to describe a noun. Thus, whatever word follows, has a
strong tendency of being a noun. And thus, Bob could be an apt output.

This job of selecting useful information from the current cell state and showing it out as an output is
done via the output gate. Here is its structure:

The functioning of an output gate can again be broken down to three steps:

1. Creating a vector after applying tanh function to the cell state, thereby scaling the values to
the range -1 to +1.
2. Making a filter using the values of h_t-1 and x_t, such that it can regulate the values that need
to be output from the vector created above. This filter again employs a sigmoid function.
3. Multiplying the value of this regulatory filter to the vector created in step 1, and sending it out
as a output and also to the hidden state of the next cell.

The filter in the above example will make sure that it diminishes all other values but ‘Bob’. Thus the
filter needs to be built on the input and hidden state values and be applied on the cell state vector.

5. Text generation using LSTMs


We have had enough of theoretical concepts and functioning of LSTMs. Now we would be trying to
build a model that can predict some n number of characters after the original text of Macbeth. Most of
the classical texts are no longer protected under copyright and can be found here. An updated version
of the .txt file can be found here.

We will use the library Keras, which is a high-level API for neural networks and works on top of
TensorFlow or Theano. So make sure that before diving into this code you have Keras installed and
functional.

Okay, so let’s generate some text!

Importing dependencies

# Importing dependencies numpy and keras

import numpy

from keras.models import Sequential

from keras.layers import Dense


from keras.layers import Dropout

from keras.layers import LSTM

from keras.utils import np_utils

We import all the required dependencies and this is pretty much self-explanatory.

 Loading text file and creating character to integer mappings

# load text

filename = "/macbeth.txt"

text = (open(filename).read()).lower()

# mapping characters with integers

unique_chars = sorted(list(set(text)))

char_to_int = {}

int_to_char = {}

for i, c in enumerate (unique_chars):

char_to_int.update({c: i})

int_to_char.update({i: c})

The text file is open, and all characters are converted to lowercase letters. In order to facilitate the
following steps, we would be mapping each character to a respective number. This is done to make
the computation part of the LSTM easier.

 Preparing dataset

# preparing input and output dataset

X = []

Y = []

for i in range(0, len(text) - 50, 1):

sequence = text[i:i + 50]

label =text[i + 50]

X.append([char_to_int[char] for char in sequence])

Y.append(char_to_int[label])
Data is prepared in a format such that if we want the LSTM to predict the ‘O’ in ‘HELLO’ we would
feed in [‘H’, ‘E‘ , ‘L ‘ , ‘L‘ ] as the input and [‘O’] as the expected output. Similarly, here we fix the
length of the sequence that we want (set to 50 in the example) and then save the encodings of the
first 49 characters in X and the expected output i.e. the 50th character in Y.

 Reshaping of X

# reshaping, normalizing and one hot encoding

X_modified = numpy.reshape(X, (len(X), 50, 1))

X_modified = X_modified / float(len(unique_chars))

Y_modified = np_utils.to_categorical(Y)

A LSTM network expects the input to be in the form [samples, time steps, features] where samples is
the number of data points we have, time steps is the number of time-dependent steps that are there in
a single data point, features refers to the number of variables we have for the corresponding true
value in Y. We then scale the values in X_modified between 0 to 1 and one hot encode our true
values in Y_modified.

 Defining the LSTM model

# defining the LSTM model

model = Sequential()

model.add(LSTM(300, input_shape=(X_modified.shape[1], X_modified.shape[2]), return_sequences=True))

model.add(Dropout(0.2))

model.add(LSTM(300))

model.add(Dropout(0.2))

model.add(Dense(Y_modified.shape[1], activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')

A sequential model which is a linear stack of layers is used. The first layer is an LSTM layer with 300
memory units and it returns sequences. This is done to ensure that the next LSTM layer receives
sequences and not just randomly scattered data. A dropout layer is applied after each LSTM layer to
avoid overfitting of the model. Finally, we have the last layer as a fully connected layer with a
‘softmax’ activation and neurons equal to the number of unique characters, because we need to
output one hot encoded result.

 Fitting the model and generating characters

# fitting the model


model.fit(X_modified, Y_modified, epochs=1, batch_size=30)

# picking a random seed

start_index = numpy.random.randint(0, len(X)-1)

new_string = X[start_index]

# generating characters

for i in range(50):

x = numpy.reshape(new_string, (1, len(new_string), 1))

x = x / float(len(unique_chars))

#predicting

pred_index = numpy.argmax(model.predict(x, verbose=0))

char_out = int_to_char[pred_index]

seq_in = [int_to_char[value] for value in new_string]

print(char_out)

new_string.append(pred_index)

new_string = new_string[1:len(new_string)]

The model is fit over 100 epochs, with a batch size of 30. We then fix a random seed (for easy
reproducibility) and start generating characters. The prediction from the model gives out the character
encoding of the predicted character, it is then decoded back to the character value and appended to
the pattern.

This is how the output of the network would look like

Eventually, after enough training epochs, it will give better and better results over the time. This is
how you would use LSTM to solve a sequence prediction task.

End Notes
LSTMs are a very promising solution to sequence and time series related problems. However, the
one disadvantage that I find about them, is the difficulty in training them. A lot of time and system
resources go into training even a simple model. But that is just a hardware constraint! I hope I was
successful in giving you a basic understanding of these networks. For any problems or issues related
to the blog, please feel free to comment below.

Вам также может понравиться