Accepted Manuscript: Applied Soft Computing

Accepted Manuscript
Title: Feature Memory-Based Deep Recurrent Neural

Network for Language Modeling
Author: Hongli Deng Lei Zhang Xin Shu
PII: S1568-4946(18)30163-7
DOI: https://doi.org/doi:10.1016/j.asoc.2018.03.040
Reference: ASOC 4789
To appear in: Applied Soft Computing
Received date: 21-5-2017

Revised date: 6-2-2018
Accepted date: 23-3-2018
Please cite this article as: Hongli Deng, Lei Zhang, Xin Shu, Feature Memory-Based
Deep Recurrent Neural Network for Language Modeling, <![CDATA[Applied Soft
Computing Journal]]> (2018), https://doi.org/10.1016/j.asoc.2018.03.040
This is a PDF file of an unedited manuscript that has been accepted for publication.
As a service to our customers we are providing this early version of the manuscript.
The manuscript will undergo copyediting, typesetting, and review of the resulting proof
before it is published in its final form. Please note that during the production process
errors may be discovered which could affect the content, and all legal disclaimers that
apply to the journal pertain.
*Highlights (for review)
1. We design a feature memory module to provide direct connections between each

two layers. These direct connections greatly shorten the length of the path
between non-adjacent layers.
2. We propose a new stacking pattern to construct deep recurrent neural

network-based language model. This pattern can alleviate the gradient vanishing
t
ip
and make the network be effectively trained even if a larger number of layers are
stacked.
cr
us
an
M
ed
pt
ce
Ac
Page 1 of 38
Feature Memory-Based Deep Recurrent Neural Network for
Language Modeling
Hongli Denga,b , Lei Zhanga,∗, Xin Shua
t
a
Machine Intelligence Laboratory, College of Computer Science, Sichuan University, 24# South Section 1,
ip
Yihuan Road, Chengdu , P.R.China, 610065
b
Education and Information Technology Center, China West Normal University, Nanchong,
P.R.China,637002
cr
us
Abstract
Recently, deep recurrent neural networks (DRNNs) have been widely proposed for lan-
an
guage modeling. DRNNs can learn higher-level features of input data by stacking mul-
tiple recurrent layers, making them achieve better performance than single-layer models.
However, due to their simple linear stacking patterns, the gradient information vanishes
when it is backward propagated through too many layers. As a result, DRNNs become
M
hard to train and their performance degrades rapidly with the number of recurrent layers
increasing. To address this problem, the feature memory-based deep recurrent neural net-
work (FMDRNN) is proposed in this paper. FMDRNN presents a new stacking pattern
d
by introducing a special feature memory module (FM), which makes the hidden units of
each layer can see and reuse all the features generated by preceding stacked layers, not
te
just the feature from previous layer as in DRNNs. FM is like a traffic hub to provide di-
rect connections between each two layers, and the attention network in FM controls the
switch of these connections. These direct connections enable FMDRNN can alleviate the
p
vanishing of gradient in the process of backward propagation and also make the learned
features do not wash away when they reach the end of the network. FMDRNN is evaluated
ce
by performing extensive experiments on the widely used English Penn Treebank dataset
and five more complex non-English language corpora. The experimental results show that
FMDRNN can be effectively trained even if a larger number of layers are stacked, so
Ac
that it benefits from deeper networks instead of degrading performance, and consistently
achieves markedly better results than other models through deeper but thinner network.
∗
Corresponding author
Email addresses: hongli_deng@foxmail.com (Hongli Deng), leizhang@scu.edu.cn (Lei
Zhang), shuxin@stu.scu.edu.cn (Xin Shu)
Preprint submitted to Applied Soft Computing April 4, 2018
Page 2 of 38
Keywords: Memory and Attention, Recurrent Neural Network, Deep Learning,
Language modeling
1. Introduction
t
Language modeling plays an important role in many natural language-related appli-
ip
cations such as automatic speech recognition, machine translation, information retrieval,
and natural language understanding. In the past, n-gram statistical language models (LMs)
cr
[1, 2, 3, 4, 5] were the most frequently used. However, they require an exact match for
the context, resulting in the number of possible parameters (n-grams) increasing exponen-
tially as the length of the context increases. To address this problem, feedforward neural
us
networks (FNNs) were used to cluster the similar contexts and share parameters, and FNN-
based LMs(FNNLMs) are proposed [6]. Compared to FNNs, RNNs [7, 8, 9, 10, 11, 12]
can capture longer context patterns. Therefore, RNNs have become the most popular
an
choice of language modeling in recent years, and various recurrent neural network (RNN)-
based LMs (RNNLMs) have been proposed [13, 14, 15, 16, 17, 18].
RNNs can be considered as deep neural networks (DNNs) with indefinitely many lay-
ers when unfolded in time. However, the primary function of the unfolded layers is to
M
introduce local memory at one time scale, not hierarchical processing. Thus RNNs are
only "deep in time," not able to build up progressively higher-level features of the input
data[19]. To deal with this disadvantage, some researchers, encouraged by the success of
d
deep architectures in general DNNs, combined the concept of DNNs with RNNs to form
deep recurrent neural networks (DRNNs) [19, 20, 21, 22]. In addition to "deep in time,"
te
this class of models is also "deep in space," with the goal of improving the performance of
the models by learning higher-level features of the input data[19]. When such hierarchies
are applied to natural language sentences, they can better model the multiscale language
p
effects that are emblematic of natural languages: layers at lower-level time scales capture
ce
short-term interactions among words, and higher layers reflect interpretations aggregated
over longer spans of text. Recently, DRNNs have been used in many natural language
process tasks [19, 21], including the task of language modeling [15, 17].
However, very few recurrent layers are actually used in existing DRNNs. Adding more
Ac
recurrent layers does not improve the performance of the network but instead degrades it
(see 4.2 for details). This is because the simple linear stack of recurrent layers in conven-
tional DRNNs may result in gradient vanishing if too many recurrent layers are stacked,
similar to the phenomenon that has been previously observed in vanilla RNNs [see,e.g.,
23] when the gradient passes through many time steps. As the stacked recurrent layers
increases, DRNNs become more difficult to train and their performance degrades rapidly.
Page 3 of 38
To solve this problem, the feature memory-based deep recurrent neural network (FM-
DRNN) is proposed in this paper. Through the addition of FM, FMDRNN forms a new
stacking pattern. During each time step, FM explicitly stores and memorizes the features
that have been learned, so that the hidden units of each layer can see and reuse all the
features learned by the preceding stacked layers. FM is the traffic hub of the network,
t
providing direct connections between each two layers, and the attention network in FM
ip
is used as the gate to control the switch of these connections. These direct connections
greatly shorten the length of paths between non-adjacent layers. It ensures that the infor-
cr
mation about the gradient or features do not vanish or wash out when they pass through all
the stacked layers, so as to enable FMDRNN that has a larger number of recurrent layers
to be effectively trained.
us
Before our work, practices and theories that achieve memory in neural networks have
been studied for a long time. Jordan [24] used recurrent links to give the network memory
which allows the network’s hidden units to see their own previous output, so that the
an
subsequent behavior can be shaped by previous responses. Elman [25] proposed that the
hidden unit patterns are fed back to themselves, making the internal representations reflect
the feature in the context of prior internal states. Meanwhile, a lot of related works were
also proposed [7, 8, 9, 26]. All these works can be considered as the variants of vanilla
M
RNN. The key point of them is to allow a memory of their own previous outputs, so as to
persist in the network’s internal state over time. Similarly, FM is proposed in this paper
to give memory for the network, but it is different from other works in two aspects. On
the one hand, the recurrent links or the feedback of patterns added in all the variants of
d
RNN are used for the hidden units to memorize their own outputs at the previous time,
te
whereas FM is used to make the hidden units memorize the outputs (features) generated
by preceding stacked layers during the same time. On the other hand, various variants
of RNN are presented so as to represent the time dimension in the network, but FM is
p
proposed to alleviate the vanishing of the gradient when it is backward propagated through
the deep layers and also to make the learned features do not wash away when they reach
ce
the end of the network.

We evaluate FMDRNN by performing extensive experiments on the English Penn Tree-
bank (PTB) corpus, a dataset widely used for testing language models. The results show
Ac
that even if many layers are stacked, FMDRNN can be effectively trained and benefit from
deeper networks rather than degrading performance as in DRNNs. Furthermore, by deeper
but thinner networks, FMDRNN consistently outperforms the models that use the conven-
tional framework. We also perform experiments on five more complex non-English bench-
marks and verify that FMDRNN can obtain markedly better results for language modeling
on various datasets.
The rest of this paper is organized as follows: Preliminaries are reviewed in Section
Page 4 of 38
2. Section 3 provides the detailed description of FMDRNN for language modeling. In
Section 4, FMDRNN is evaluated on several corpora. Some discussiones are given in
Section 5, and finally Section 6 offers conclusions and suggestions for future work.
2. Preliminaries
t
ip
2.1. LMs
An LM is formalized as a probability distribution over a sequence of words. Specifi-
cally, for a given sequence of words with length T : <x1 , · · · , xT −1 , xT >, the LM assigns a
cr
joint probability, p(x1 , · · · , xT −1 , xT ), to the sequence to estimate the probability that the
sequence is a natural language sentence. For convenience, we assume that all the words
us
have been converted to the indices in the vocabulary. A good LM should assign a higher
probability to the probable sentence than improbable one.
Inspired by the observation that people always say a sentence from the beginning to
an
the end, the joint probability can be decomposed as

T
p(x1 , · · · , xT −1 , xT ) = p (xt | x0 , · · · , xt−2 , xt−1 ) , (1)
M
t=1
where xt (1 ≤ t ≤ T ) is the index of the t-th word, and x0 is the special token <BOS>
added to the beginning. Each p (xt | x0 , · · · , xt−2 , xt−1 ) is a conditional probability, rep-
d
resenting the probability of the word xt given all its previous words. The group of words
conditioned on, x0 , · · · , xt−2 , xt−1 , is called the context.
te
For simplicity, LMs usually assume that a word is only dependent on previous n-1
words and adjust Equation (1) as
p

T
p(x1 , · · · , xT −1 , xT ) ≈ p (xt | xt−n+1 , · · · , xt−2 , xt−1 ) . (2)
ce
t=1
N-gram statistical LMs are the most classic ones that based on this assumption [1, 2, 3,
4, 5], and each conditional probability p (xt | xt−n+1 , · · · , xt−2 , xt−1 ) in Equation (2) is
Ac
computed by counting the number of times that xt appears in the given context and nor-
malizing it by all observations of this context. Subsequently, Bengio et al. [6] proposed to
use a probability function represented by FNNs to compute these conditional probability.
This model associates each word with a distributed word feature vector and expresses the
conditional probability function in terms of the feature vectors of these words in the con-
text. The word feature vectors and the parameters of the probability function are learned
simultaneously through the iterative training of the network.
Page 5 of 38
In recent years, in contrast to compute the simplified Equation (2), researchers pro-
posed to use RNNs as a means of direct calculation of Equation (1) [13, 14, 27, 16].
Compared with n-gram and FNNs models, RNNs has the ability to capture longer con-
text patterns. Therefore, RNNLMs have drawn increasing research interest and shown
significantly better performance.
t
Hence, we describe two standard RNNs-based LMs here: (1) vanilla RNN-based LMs
ip
(RNN-LMs) [13]; (2) long short-term memory (LSTM)-based LMs (LSTM-LMs) [14], to
show the computation process of LMs in detail.
cr
RNN-LMs. In RNN-LMs, each conditional probability p (xt | x0 , · · · , xt−2 , xt−1 ) (1 ≤
t ≤ T ) is iteratively computed by the vanilla RNN. The architecture of RNN-LMs (as
us
shown in Figure 1(a)) usually consists of three layers: an input layer, recurrent layer, and
output layer. An RNN-LM is usually unfolded in time as shown in Figure 1(b).
At each time step t, the input layer reads a word, xt−1 , and maps the word to a dense
an
vector,
x̂t−1 = Wemb x∗t−1 . (3)
This vector x̂t−1 ∈ Rdw is known as the word embedding. The matrix Wemb ∈ Rdw ×|V | (dw
M
|V |) is a word embedding matrix, where dw is the dimension of the word embedding and
|V | is the size of the vocabulary contained in the data. Each column of this matrix cor-
responds to the word embedding of a word in the vocabulary. The x∗t−1 is the one-hot
d
representation of xt−1 , a vector of size |V | that only the (xt−1 )-th element is 1 and other
element are 0.
te
Then the recurrent layer updates the feature vector ht ∈ Rdh by nonlinear transforma-
tion, dependent on the current embedding vector x̂t−1 and its previous state ht−1 ,
p
ht = f (Whx x̂t−1 + Whh ht−1 + bh ) (1 ≤ t ≤ T ), (4)

ce
where f (·) denotes the element-wise non-linear function, Whx ∈ Rdh ×dw and Whh ∈
Rdh ×dh are the weight matrices of the connection from input to this layer and the recurrent
links of this layer, respectively. Besides, dh and bh ∈ Rdh denote the size and the bias
Ac
vector of this layer, respectively.

Finally, the output layer estimates the conditional probability over each word in the
vocabulary as
yt = sof tmax (Wyh ht + by ). (5)
Here yt is the output vector of size V , Wyh ∈ R|V |×dh and by ∈ R|V | are the weight
Page 6 of 38
p( x1 | x0 ) p( xT 1 | x0 , xT 22 ) p( xT | x0 , xt 2 , xT 1 )
Ċ Output Layer Ċ Ċ Ċ
t
ip
h1 hT 1 hT
h0
Ċ Recurrent Layer Ċ Ċ Ċ
cr
x̂0 xˆT 2 xˆT 1
Ċ Input Layer Ċ Ċ Ċ
us
x0 xT 2 xT 1
⁞
time
t 1 t T 1 t T
⁞
(a) (b)
an
Figure 1: Architecture of RNN-LMs. (a) Simplified schematic of RNN-LMs; (b) RNN-LM is unfolded in
time. The layers denoted by green, orange and blue indicate the input, recurrent hidden and output layers,
M
respectively.
matrix and the bias vector from recurrent layer to output layer, respectively. The function
z
sof tmax (sof tmax(zk ) = Σep ekzp ) ensures that yt forms a valid probability distribution
d
over each word in the vocabulary, i.e., all elements of yt are between 0 and 1, and their
te
sum is 1. Each element of yt is the estimated conditional probability of a word in the

vocabulary. If xt = k, then the conditional probability of xt is the k-th element of yt ,
p
p (xt |x0 , x1 , · · · , xt−1 ) = ytk .

ce
LSTM-LMs. The vanilla RNN architecture used in RNN-LMs can summarize all the his-
torical information up to current time t, however, learning long-range dependencies with
vanilla RNN is difficult [28]. This is because gradient values can either blow up or decay
Ac
exponentially with increasing context length. To overcome this difficulty, an improved

RNN architecture: long short term memory (LSTM) [8] was proposed. Subsequently, sev-
eral variants of original LSTM have been presented [29, 30, 31]. In the recent years, LSTM
have shown their effectiveness for many difficult tasks, such as handwriting recognition
[32, 33], speech synthesis [21, 34], visual tracking [35, 36], as well as language modeling
[14, 15].
The LSTM-based LMs usually have the same architecture with RNN-LMs as that de-
Page 7 of 38
t
ip
ht
cr
ht 1
us
ot
f xˆt 1
peepholes
V
ct 1
ct Output Gate
an
ht 1 ct 1 ct
It
V cell
xˆt 1 1
ct 1
Forget Gate
M
Input Gate
it V
xˆt 1
d
ht 1
f
p te
ht 1 xˆt 1
ce
Figure 2: Graphical depiction of the widely used LSTM block. This variant features a single cell, three gates
(input, forget, output), and the peephole connections (red, omitted in our model). The symbol σ represent
the element-wise non-linear activation functions of the three gates, which is always sigmoid. And f , usually
the hyperbolic tangent, denotes the block input and output activation function. The small circles (black)
Ac
represent the multiplicative units.
Page 8 of 38
picted in Figure 1, except that the nonlinear units in the recurrent layer are replaced with
LSTM blocks. The most widely used setup of LSTM block is shown in Figure 2, which is
drawn based on the LSTM block provided in [32] and [37]. But then, the peepholes link
(red) is usually omitted in many subsequent studies, such as our baseline models [15, 17].
Therefore, the LSTM architecture used in our model is consistent with Figure 2 but with-
t
out the peepholes. Then, different from RNN-LMs, LSTM-LMs compute the hidden state
ip
ht by the following series of equations:
⎧
⎪
⎪ it = σ (Wix x̂t−1 + Wih ht−1 + bi )
cr
⎪
⎪
⎪
⎨ φt = σ (Wφx x̂t−1 + Wφh ht−1 + bφ )
⎪
ct = it f (Wcx x̂t−1 + Wch ht−1 + bc ) + φt ct−1 (6)
⎪
us
⎪
⎪ ot = σ (Wox x̂t−1 + Woh ht−1 + bo )
⎪
⎪
⎪
⎩ h = o f (c ) ,
t t t
an
where it , φt and ot are referred to the activations of input, forget, and output gates at time
t, respectively. The symbol ct represents the current state of the cell. The functions σ(·)
and f (·) usually are the sigmoid and tanh, respectively. The matrices Wix , Wφx , Wox and
Wcx ∈ Rdh ×dw denotes the weights of the connection from the input layer to the input
M
gate, forget gate, output gate and the cell of the LSTM blocks, respectively. The matrices
Wih , Wφh , Woh and Wch ∈ Rdh ×dh correspond to the weights of recurrent connections of
LSTM blocks. Besides, bi , bφ , bo and bc ∈ Rdh are the bias vectors in the block.
d
2.2. DRNNs
te
Although RNNs (usually including vanilla RNN and LSTM) are naturally deep when
they are unfolded in time ("depth in time"), they do not explicitly support multiple time
scales. Recently, "depth in space" was introduced as an orthogonal notion to the "depth in
p
time" of RNNs [19, 20, 21, 22]. This has been investigated by stacking multiple recurrent
ce
layers on top of each other, just as feedforward layers are stacked in DNNs, to create
DRNNs [19]. In this kind of architecture, the output sequence of a recurrent layer is the
input sequence for the succeeding one. Suppose that L recurrent layers of vanilla RNN are
stacked to form DRNNs, the feature at l-th (1 ≤ l ≤ L) layer is computed as
Ac

h1t = f (Whx
l
x̂t−1 + Whh
l
hlt−1 + blh ), l = 1,
(7)
hlt = f (Whx
l
hl−1
t + Whh
l
hlt−1 + blh ), l = 2, 3, · · · , L,
where Whx l
and Whhl
are the weight matrices of the connection from (l − 1)-th recurrent
layer (the input layer for l=1) to l-th layer and the recurrent links of l-th recurrent layer,
respectively. The symbol blh is the bias vector of l-th layer. The hidden representations hl−1
t
Page 9 of 38
and hlt−1 is the output of (l − 1)-th recurrent layer at time t and the output of l-th recurrent
layer at previous time, respectively. Note that hl−1
t is the word embedding, h0t = x̂t−1 , for
the first recurrent layer (l = 1).
Careful empirical investigation of this architecture showed that multiple recurrent lay-
ers in the stack can operate at different time scales [20], ensuring the integration of depth in
t
both time and space. Although other ways of constructing DRNNs have also been explored
ip
[22], in the context of language modeling tasks, most studies focus on the stack notion
[see,e.g., 15, 17]. The schematic of an LM based on this kind of DRNNs is shown in Fig-
ure 3. Here the computation of each conditional probability p (xt | x0 , x1 , · · · , xt−2 , xt−1 )
cr
is accomplished by DRNNs, combining the input layer (Equation (3)), stacked recurrent
layers (Equation (7)), and the output layer (similar to Equation (5)).
us
p( x1 | x0 ) p( xT 1 | x0 , xT 22 ) p( xT | x0 , xt 2 , xT 1 )
Ċ Output Layer
h0L
Ċ
h1L
an Ċ
hTL1
Ċ
hTL
M
Ċ Ċ Ċ Ċ
⁞
⁞ Stacked ⁞ ⁞ ⁞
Recurrent Layers
d
h11 hT1 1 hT1

h01
Ċ Ċ Ċ Ċ
te
⁞
x̂0 xˆT 2 xˆT 1

Ċ Input Layer Ċ Ċ Ċ
p
⁞
x0 xT 2 xT 1
ce
time
t 1 t T 1 t T
⁞
(a) (b)
Ac
Figure 3: Architecture of a DRNN-based LM. (a) simplified description of the architecture of DRNNs. (b)
a DRNN-based LM unfolded in time. The recurrent layer in the stack can be that in vanilla RNN or LSTM
layer.
Although the stacked architecture of DRNNs has the advantage of integrating the tem-
poral and spatial depths, it is difficult to train (see 4.2 for details). This is because the
simple linear stacking of their recurrent layers make the information about gradient may
Page 10 of 38
vanish when it backward propagates through too many layers, then the performance de-
grades as the number of recurrent layers increases.
3. FMDRNN
t
To improve the information flow among the stacked layers, FMDRNN introduces a
ip
special feature memory module to propose a new stacked pattern. Based on the memory
and attention mechanism, this module implements a global memorization for the features
learned by different-level layers, which provides direct connections to shorten the length
cr
of the path between non-adjacent layers, and then alleviates the vanishing of gradient
in the process of backward propagation. Moreover, this direct connections also make the
us
learned features do not wash out when they reach the end of the network in the feedforward
computation. The architecture of the FMDRNN is shown in Figure 4, which consists of
the input module, FM module, stacked LSTM layers module, and output layer. In this
section, the main parts of the FMDRNN and its training method are described in detail.
3.1. Input Module

an
FMDRNN receives data from the task and encodes it into a distributed vector represen-
M
tation by the input module. If the model takes words as input, then the distributed vector
representation is the word embedding and the input module is like the input layer of the
RNN-LMs (see Equation (3)), responsible for mapping each word to its distributed vector
representation h0t ∈ Rdw just using a matrix,
d
h0t = Wemb x∗t−1 (1 ≤ t ≤ T ), (8)

te
The models that directly accept words take each word as a whole, which do not consider
p
the subword information (e.g. morphemes) included in the words. For example they do
not know that eventful, eventfully, uneventful and uneventfully should have structurally
ce
related embeddings in the vector space [17]. In order to overcome this problem, FMDRNN
also can be fed with the characters of the word. Then the input module is a character-level
convolutional neural network (CharCNN) [17], mapping each word to its distributed vector
representation h0t by taking its characters as input. Suppose that the word xt−1 is made up
Ac
of a sequence of characters [c1 , c2 , · · · , cg ], where g is the length of the word. These

characters are first represented as a matrix C. The j-th column of C corresponds to the
character-level representation of cj ∈ Rdc , where dc is the dimension of the character-level
10
Page 11 of 38
p( xt | x0 , xt 2 , xt 11 )
Output Layer Ċ
mtL htL1
htL
Stacked LSTM Layers Module
t
Update Feature Memory Module
Feature Memory Module Ċ
ip
⁞
ht2
cr
2
m t ht21
ht2
us
mt1 ht1 ht11
ht1
ht0
an
ht0 Input Module
‫ݔ‬଴ ǡ ‫ݔ ڮ‬௧ିଶ ǡ ‫ݔ‬௧ିଵ

M
Figure 4: Architecture of the FMDRNN. Here, h0t is the lowest-level feature of the input at time t, which
is obtained by the input module, and hlt (l ∈ [1, L]) is the l-th feature at time t learned by l-th LSTM layer,
where L is the number of LSTM layers. The vertical arrow on the left, running from the bottom to the top of
d
the diagram, indicates the memory update process of feature memory module as the different-level features
from h1t to hL 1 2 L
t are learned. The symbols mt , mt , · · · , mt represent a series of memory state generated in
te
the process of updating feature memory module. Notes that there is only one feature memory module which
is updated L times. For clarity, it is depicted L times and the modules with dotted frame are replica of that
of solid frame.
p
representation. Then CharCNN uses K filters to encode C as follows:

ce
⎧
⎪
⎪
dc
w
⎪
⎪ z(r) = f ( Ci(j+r) Hijk + bkC ) (0 ≤ r ≤ g − w),
⎪
⎨
i=1 j=1
Ac
(9)
⎪
⎪ yCk = max(z(r)),
⎪
⎪
⎪
⎩
r
h0t = [yC1 , yC2 , · · · , yCK ],
where H k ∈ Rdc ×w is k-th (k = 1, 2, · · · , K) filter of width w, bkC is the bias value and
f (·) is the non-linear function. The output yCk is the encoding of C using k-th filter. In
order to keep accordance with the model that uses word as input, the number of filters, K,
11
Page 12 of 38
is set to the same with the size of word embedding, K = dw .
Subsequently, the representation h0t is seen as the lowest-level feature of the input at
time t and is written into FM module. Besides, it is also used as the input of the stacked
LSTM layers module for learning higher-level features.
t
3.2. FM Module
ip
During each time step, FM module executes L times memory update to constantly store
and memorizes the features that learned by different-level LSTM layer. The architecture
cr
and memory update process of FM module is displayed in Figure 5. It shows that this
module is comprised of a FM space, an attention network, and a memory state. In the l-th
(1 ≤ l ≤ L) memory update, the newest feature learned by l-th LSTM layer, hlt , is written
us
into feature memory space. Then, the attention network fuses all the features in the current
space using a soft attention mechanism. Finally, the memory state ml−1t is updated to mlt .
A detailed description of each part is given below.
an
FM Space. The FM space is a variable-sized ([1, L+1]) storage space. It is responsible for
memorizing the features that have been learned by preceding stacked layers in the same
time step. The memorization is achieved by saving these features as a set:
M
sl−1
t = {h0t , h1t , · · · , hl−1
t },
where sl−1 (1 ≤ l ≤ L) is the feature set at time t before the l-th memory update. Note
d
t
that at the beginning of each time step, the size of FM space is reset to 1 and the initial
feature set is defined as s0t = {h0t }.
te
When the new feature hlt ∈ Rdh is generated by the l-th LSTM layer, FM module
begins the l-th memory update. Then, hlt is written into FM space and sl−1
t is updated to
p
slt = sl−1
t ∪ hlt = {h0t , h1t , · · · , hl−1
t , ht }.
l
ce
The feature memory space preserves all the different-level features that have been learned
during the same time step, making all the following layers can see and reuse them. This
reuse provides a direct connection between any two layers to shorten their path, avoiding
Ac
the attenuation of the information about feature or gradient resulting from passing through
too many layers.
Attention Network. Although the reuse of different-level features provide the direct con-
nection between any two layers, the switch of these connections need be controlled. Thus,
an attention mechanism is applied and the attention network is proposed.
12
Page 13 of 38
mt1 mt2 mtL
Memory State mt0 Memory State mt1 Memory State mtL 1
t
mt1 mt2 mtL
ip
Attention Network Attention Network Attention Network
... ... ...
⁞
... ... ...
cr
... ... ...
s1
t s 2
t
stL
Feature Memory Space Feature Memory Space Feature Memory Space
stL 1
us
st1 st2 ...
st0 {ht0 } ... ...
⁞
... ...
... ...
... ...
...
ht1 ht2 htL
an
Memory Update
(a) Architecture and memory update process of the FM module.

M
mtl
Attention Network
Ͱ
d
Ĵ Ĵ Ĵ
atl 0 atl1 atll
... ... ...
te
l l l
G() G() G()
... ... ...
... ... ...
p
... ... ... ...

h t
0
ht1 htl
ce
stl
(b) Architecture of the attention network

Ac
Figure 5: Architecture and memory update process of FM module in time step t. The dashed arrows in (a)
correspond to the memory state transition. The blue and black solid arrows represent the information flow
and weighted connections, respectively. The ⊕ and indicate the element-wise summation and element-
wise multiplication, respectively. Notes that Gl (·) is shared by all the features, where the dotted frame are
replica of that of solid frame.
13
Page 14 of 38
In each memory update at time t, the attention network (as shown in Figure 5(b))
computes the attention value, aljt (j ∈ [0, l]), for each feature ht in FM space. These
j
attention values operate like gates to control the switch of the connections between j-th
layer to subsequent layers.
We present to use a group of scoring functions {Gl (·)} to compute these attention
t
values. At the l-th memory update, the attention network recurrently uses the scoring
ip
function Gl (·) to produce a group of attention values,
alj = Gl (hjt ) (0 ≤ j ≤ l). (10)
cr
t
In our work, the scoring function Gl (·) is shared by all the features in slt , and is imple-
us
mented using a two-layer feedforward neural network,
Gl (hjt ) = σ(WFl M hjt + blF M ), (11)
an
where WFl M ∈ Rdh ×dh is the weight matrix, and blF M ∈ Rdh is the bias vector. The
function σ(·) is the element-wise sigmoid. For simplicity, we set all the features to the
same dimension. If not, a linear map is needed to project them to the same.
Based on these attention values, the attention network fuses all the features in FM
M
space by a soft attention mechanism to generate a new memory state,

l
mlt = t ht ,
alj j
(12)
d
j=0
te
where mlt ∈ Rdh is the new memory state at time t after the l-th iteration and alj
t is the
attention weight of hjt .
p
The attention mechanism encourages the attention network to learn bigger attention
values for more effective features, while it suppresses redundant features by assigning
ce
them the minimal attention values.
Memory State. Memory state mlt (l ∈ [1, L]) is like a conveyor belt, bridging the direct
Ac
connections between every two layers and carrying the latest memory for all the learned
features. At the l-th memory update of the t-th time step, it is updated by the attention
network from ml−1 t to mlt , where m0t is the initial memory state and is set to zeros. Then,
mlt (l ∈ [1, L − 1]) is read as the input of the (l + 1)-th LSTM layer to learn a higher-level
feature. The last memory state mLt is used for the final estimation.
Algorithm 1 gives the forward pass procedure of the l-th memory update at time t.
14
Page 15 of 38
Algorithm 1 The forward pass procedure of the l-th memory update at time t.
Input:
Previous feature set : sl−1t = {h0t , h1t , . . . , hl−1
t }; the output of l-th LSTM layer: ht .
l
Output:
Current memory state: mlt .
t
1: Step 1. Update feature memory space slt = sl−1 t ∪ hlt .
ip
2: Step 2. Recurrently compute the attention values of each feature in slt as follows:
3: for j = 1, 2, . . . , l do
Do feedforward propagation to get attention values alj j
cr
4: t of ht according to Eq. (10)
and Eq. (11) .
5: end for
us
6: Step 3. Apply Eq. (12) to generate the new memory state mlt .
7: Step 4. Update current memory state from ml−1 t to mlt .
8: Step 5. Output mlt to (l + 1)-th LSTM layer.
3.3. Stacked LSTM Layers Module

an
The stacking of multiple LSTM layers is used for learning different-level features. As
M
shown in Figure 4, the input of the first LSTM layer (l = 1) at time t is composed of the
distributed vector representations of raw input, h0t , and the output of this layer at previous
time, hlt−1 . This layer is the same as the baseline models. However, for the subsequent
LSTM layers (l ≥ 2), besides the common inputs in baselines, hlt−1 and hl−1 t , they also
d
0
take the current memory state mt as the input. For convenience, we set ml−1 t = 0 and
te
express the computation of this module uniformly as:

hlt = F hl−1 l l−1

t , ht−1 , mt (1 ≤ l ≤ L), (13)
p
where F (·) is the computation function of the LSTM layer, which is formulated by
ce
⎧ l l l−1
⎪
⎪ i t = σ W h
ix t + W l
im t m l−1
+ W l l
h
ih t−1 + b l
i ,
⎪
⎪ l l−1
⎪
⎪ φ l
= σ W h + W l
m l−1
+ W l
h l
+ b l
,
⎨ t φx
l l−1
t φm t φh t−1 φ

Ac
ct = it f Wcx ht + Wcm mt + Wch ht−1 + bc + φlt clt−1 ,

l l l l−1 l l l
(14)
⎪
⎪ l l−1
⎪
⎪ olt = σ Wox ht + Wom l
ml−1 + Woh l
hlt−1 + blo ,
⎪
⎪
t
⎩ l
ht = olt f clt ,
where ilt , φlt and olt are referred to the activation of input, forget, and output gates of the l-th
LSTM layer at time t, respectively. The symbol clt denotes the state of the cell in this layer.
The matrices Wpq l
denotes the weights from q ∈ {x, m, h} to p ∈ {i, φ, o, c}, and blp is the
15
Page 16 of 38
bias vector. The subscripts i, φ, o and c refer to the input gate, forget gate, output gate and
the cell, respectively, x, m, and h refer to the input, the memory state of FM module and
the feature outputted from LSTM layer, respectively. The feature hlt is then written into
the feature memory space. It is also used as one of the inputs of the next LSTM layer.
In contrast to conventional stacked LSTM layers, each LSTM layer in this module
t
takes the latest memory state as one of its inputs. This memory state fuses all the features
ip
that have been learned by preceding stacked layers, so that this LSTM layer reuse all the
learned features directly.
cr
3.4. Output Layer
In the final stage, the output layer estimates the conditional probability of xt based on
us
the last memory state mLt by

yt = sof tmax Wym mLt + by ,

(15)
an
p (xt |x0 , x1 , · · · , xt−1 ) = ytxt ,
where Wym ∈ R|V |×dh is the weight matrix of the connection from FM module to the
output layer, yt ∈ R|V | is the estimated probability distribution over each word in the
M
vocabulary, and ytxt denotes the xt -th element of yt (xt is the index of t-th word in the
vocabulary).
Note that, as shown in Figure 3, the output of the last LSTM layer in FMDRNN is not
d
used as the direct input of the output layer as in conventional DRNNs. Although it could
improve the result further by adding a connection from the last LSTM layer to the output
te
layer, this connection will greatly increase the parameters and the computation time of the
model for the task of language modeling. Therefore, only the last memory state of FM
module is used by output layer for the final estimation.
p
3.5. Model Training

ce
Given a training set X with N samples, FMDRNN with parameter set Θ is trained by
minimizing the cross-entropy of the estimated and true probability distributions over X to
search the optima of the parameters for this model. The parameter set Θ includes four
Ac
subsets that refer to the parameters of the input module ({Wemb , C, H k , bkC }), FM module
({WFl M , blF M }), stacked LSTM layers module ({Wpq l
, blp }) and the output layer({Wym , by }),
respectively. In the training phrase, each parameter θ ∈ Θ is first initialized and then itera-
tively updated by the min-batch stochastic gradient descent (SGD) to find their optima.
In detail, the training set X is split into a series of min-batch samples {Xm }. For a
(n) (n) (n)
given sequence with Tn words in each min-batch, x(n) =< x1 , x2 , · · · , xTn > (x(n) ∈
16
Page 17 of 38
Xm ), FMDRNN first forward passes this sequence through the whole network as de-
scribed
from section 3.1 to 3.4,
to estimate the conditional probability at each time step,

(n) (n) (n) (n) (n)
pΘ xt x0 , x1 , · · · , xt−1 . By denoting the label at time t as yt ∈ {1, 2, · · · , |V |},
the cost is defined as the cross-entropy of the estimated and true probability distributions
over x(n) , i.e.,
t
ip
1
Tn
(n) (n) (n) (n) (n) (n) (n)
J(Θ; x )=− 1{yt = xt } log pΘ xt x0 , x1 , · · · , xt−1
Tn t=1
cr
(16)
1
Tn
(n) (n) (n) (n)
=− log pΘ xt x0 , x1 , · · · , xt−1 ,
Tn t=1
us
where 1{·} is the indicator function that 1{a true statement}=1 and 1{a false statement}=0.
In the task of language modeling, the input sequence for the sample x(n) is constructed as
(n) (n) (n) (n) (n) (n)
an
< x0 , x1 , · · · , xTn −1 > and the corresponding label is < x1 , x2 , · · · , xTn >. Thus
(n) (n) (n)
1{yt = xt } = 1. The symbol xt corresponds to the index in the vocabulary for the
t-th word of x(n) . Then, all the cost of x(n) in the min-batch is accumulated as
M
1
Nm
1
Tn
(n) (n) (n) (n)
J(Θ; Xm ) = − log pΘ xt x0 , x1 , · · · , xt−1 , (17)
Nm n=1 Tn t=1
d
where Nm is the number of samples in the min-batch Xm .

Subsequently, backpropagation through time algorithm [38] is used to efficiently com-
te
pute the gradients of the accumulated cost with respect to each parameter θ ∈ Θ. Finally,
each parameter is updated according to these gradients. These processes are repeated until
p
finding the best parameters. Algorithms 2 and 3 give the detailed optimization algorithms
for learning the parameters of FMDRNN.
ce
Perplexity is the most widely used metric for evaluating the quality of different lan-
guage models, thus we use the perplexity of test data to evaluate the performance of
the model after training. Lower the perplexity, better the model. Given a test sequence
<x1 , · · · , xT −1 , xT >, the perplexity of a language model over this sequence is defined by
Ac
P erplexity = exp( NTLL ),

T
N LL = − log p(xt x0 , · · · , xt−1 ),
t=1
where N LL is the cross-entropy over all the words and T is the number of words.
17
Page 18 of 38
Algorithm 2 The optimization algorithm for searching the optimum parameters of FM-
t
DRNN based on min-batch SGD.
ip
Input:
(n) (n) (n)
The training set X, x(n) =< x0 , x1 , · · · , xTn >, x(n) ∈ X.
cr
Output:
Optimum parameters Θ = {Wemb , C, H k , bkC }∪{WFl M , blF M }∪{Wpq
l
, blp }∪{Wym , by }
after training, where 1 ≤ l ≤ L, q ∈ {x, m, h}, p ∈ {i, φ, o, c}.
us
2: Initialize each parameter θ ∈ Θ and set a learning rate α.
//Execute maxIter times iterations to update parameters Θ. maxIter is decided by
an
human.
4: for iter = 1, 2, . . . , maxIter do
Split X to a series of min-batch samples with size Nm
6: for each min-batch samples Xm ⊆ X do
M
//Forward pass each samples through all the layers
8: for each sample x(n) ∈ Xm do
a. Perform feedforward pass f c(Θ; x(n) ) (see Algorithm 3) to compute
the
(n) (n)
d
(n) (n)
output of FMDRNN at each time step, pΘ xt x0 , x1 , · · · , xt−1 (1 ≤ t ≤
Tn ).
te
10: b. Compute the cost of x(n) by Equation (16).

end for
p
12: Accumulate all the cost of x(n) ∈ Xm , according to Equation (17).

//Backpropagation
ce
Compute the gradient of J(Θ; Xm ) with respect to each parameter θ ∈ Θ,

θ J(Θ; Xm ), according to backpropagation through time algorithm.
//Update parameters
Update each parameter θ ∈ Θ as θ = θ − α · θ J(Θ; Xm ).
Ac
14:
end for
16: end for
18
Page 19 of 38
t
ip
cr
Algorithm 3 The feedforward pass function f c(Θ; x(n) ).
Input:
(n) (n) (n)
Θ; x(n) = {x0 , x1 , · · · , xt−1 }.
us
Output:
(n) (n) (n) (n)
pΘ xt x0 , x1 , · · · , xt−1 ) (1 ≤ t ≤ Tn ).
for t = 1, 2, . . . , Tn do
an
Step 1. Do forward pass of input module to get h0t according to Eq. (8) or Eq. (9).
3: Step 2. Initialize m0t = 0, s0t = {h0t }.
Step 3. Pass h0t through feature memory module and stacked LSTM layer module
M
as follows:
for l = 1, 2, . . . , L do
6: a. Apply Eq. (13) and Eq. (14) to execute forward pass of l-th LSTM layer to
generate hlt .
d
b. Compute the l-th memory state of feature memory module according to Algo-
te
rithm 1.
end for
9: Step 4. Compute the output probability of x(n) at time t,
p
(n) (n) (n) (n)

pΘ xt x0 , x1 , · · · , xt−1 , using Eq. (15).
end for
ce
n (n) (n) (n)

Output pΘ xt x0 , x1 , · · · , xt−1 ), 1 ≤ t ≤ Tn .
Ac
19
Page 20 of 38
Table 1: Statistics on the five non-English language corpora, where |V | and |C| are the sizes of the word
vocabulary and character vocabulary in each dataset, respectively.
|V | |C| Training Tokens Validation Tokens Test Tokens
French 25K 76 1M 82K 80K
Spanish 27K 72 1M 80K 79K
t
German 37K 74 1M 74K 73K
ip
Czech 46K 101 1M 66K 65K
Russian 63K 62 1M 72K 65K
cr
4. Experiments
us
To evaluate the effectiveness of the proposed FMDRNN for language modeling, a se-
ries of experiments were conducted on the widely used PTB dataset and five more complex
non-English language corpora. The detailed experiments and results are described in this
an
section.
4.1. Datasets
PTB: The PTB portion of the Wall Street Journal corpus [39] is the most normative
M
and widely used dataset in the language modeling field. Thus, it is the best choice for
comparing the quality of different LMs. It has been previously used by many researchers
[13, 27, 15, 17], with the same preprocessing as in [13]: Only the most frequent 10K words
d
are included in the vocabulary, and the rest are replaced with a special token <unk>; All
the data are split into training (sections 0-20, including 930K tokens), validation (sections
te
21-22, including 74K tokens), and test sets (sections 23-24, 82K tokens). This is the most
commonly used version in the language modeling community.
Non-English Languages Corpora: These corpora five morphologically rich non-
p
English language corpora: French, Spanish, German, Czech, and Russian. The original
ce
data came from the 2013 ACL Workshop on Machine Translation. The preprocessed data
were obtained and split into train/valid/test sets with the code provided by [17], which
serves as one of the baselines for our work. In these datasets, only singleton words were
mapped to <unk>, and hence the full vocabulary was used effectively. Basic statistics
Ac
about these corpora are summarized in Table 1. Learning these corpora is a more com-
plex language modeling task than learning PTB, because their training sets have the same
magnitude as PTB, but the vocabulary size is larger.
4.2. Baseline models

Regularized LSTM [15] and character-aware neural language model [17] were used
as the baseline models in our experiments. We chose them for two reasons: (1) they are
20
Page 21 of 38
Table 2: Architectures settings of DRNN-Word and DRNN-Char.
DRNN-Word DRNN-Char
Small Large Small Large
Word Embedding Size 200 650 - -
Character Embedding Size - - 15 15
t
LSTM Layer Size 200 650 300 650
ip
Output Layer Size |V | |V | |V | |V |
cr
very successful models for language modeling that based on the stacked architecture of
DRNN in recent years; (2) they correspond to two kinds of frequently used input data:
us
words and characters, respectively. The input of regularized LSTM at each time step is a
word in the context, while the character-aware neural language model uses the characters
of the corresponding word as input to take into account subword information. We denote
the two baselines as DRNN-Word and DRNN-Char, respectively. In [17], each baseline
an
model was trained with two different sizes as shown in Figure 2, forming four networks:
DRNN-Word-Small, DRNN-Word-Large,DRNN-Char-Small and DRNN-Char-Large.
The above four networks are all DRNNs with LSTM architecture, but only two LSTM
M
layers were used in the original research. To observe their behaviors with more LSTM
layers, we trained them with different numbers of LSTM layers on the corpora given in
Section 4.1. Figure 6 demonstrates an example of the results on the PTB data. It shows that
the perplexity rises (performance degrades) as the number of LSTM layers increases. More
d
seriously, some of them cannot converge even when a small number of LSTM layers are
te
used, for example DRNN-Word-Small and DRNN-Char-Large did not converge with more
than four LSTM layers, and DRNN-Char-Small did not converge with more than three
LSTM layers. Although the objective of DRNNs is to improve the performance of models
p
by learning higher-level features of input data, the results here show that conventional
stacking of the recurrent layers makes DRNNs difficult to train when the spatial depth
ce
increases. Adding more recurrent layers leads to a rapid degradation in performance rather
than an improvement. Similar phenomena were also observed for the non-English datasets,
such as the results shown in Figure 8.
Ac
4.3. Experimental Setup

Model Architecture Settings: To compare with baseline models, we trained two vari-
ants of FMDRNN: FMDRNN-Word and FMDRNN-Char, corresponding to use word and
character as inputs, respectively. Moreover, each variant was trained with two groups of
settings for comparison with the baselines in Table 2. The settings are shown in Table 3.
The word embedding size of FMDRNN-Word was set to the same as that of LSTM layer.
21
Page 22 of 38
t
2 LSTM Layers 3 LSTM Layers 2 LSTM Layers 3 LSTM Layers
ip
260 145
710
240
700 135
220 690
cr
Validation Perplexity
680 125
200
670
180 2 6 10 14 18 22 26 30 115
us
160
105
140
95
120
an
100 85
2 6 10 14 18 22 26 30 2 6 10 14 18 22 26 30
Epochs Epochs
(a) DRNN-Word-Small (b) DRNN-Word-Large

M
165 140
710 710
155 700 130 700
690 690
d
145
680 120 680

135 670 670
te
2 6 10 14 18 22 26 30 110 2 6 10 14 18 22 26 30
125
100
115
p
105 90
95 80
ce
2 6 10 14 18 22 26 30 2 6 10 14 18 22 26 30
Epochs Epochs
(c) DRNN-Char-Small (d) DRNN-Char-Large

Ac
Figure 6: Behaviors of the four baseline networks on PTB. Note that some networks cannot converge, mak-
ing their results so poor that they cannot be displayed on the same scale as other results, thus we add a small
sub-graph in the corresponding figure to display these results separately.
22
Page 23 of 38
Table 3: Architectures settings of FMDRNN. nG is the hidden layer size of the feed forward neural network
used for score function G(·) in the attention network.
FMDRNN-Word FMDRNN-Char
Small Large Small Large
Word Embedding Size 200 650 - -
t
Input Module
Character Embedding Size - - 15 15
ip
Layers (L) ≥2 ≥2 ≥2 ≥2
Stacked LSTM
Size 200 650 300 650
cr
Attention Network nG 200 650 300 650
Output Layer Size |V | |V | |V | |V |
us
For FMDRNN-Char, the size of the character embedding is 15, and the other settings of the
charCNN used in the input module is exactly the same as in [17]. The number of stacked
LSTM layers L was set to be L ≥ 2. The size of the LSTM layer was set in accordance
an
with the corresponding baseline networks, and we used the same size for each LSTM layer
in the same network. The output layer size was decided by the size of the vocabulary |V |
of the dataset. All the settings in Table 3 are mostly guided by the baseline models. In
M
our experiments, both architectures were used for all the corpora, and the settings were
unchanged unless otherwise stated.
Implementation details: The FMDRNN-based LM was implemented in Python using
Torch [40], and ran on an NVIDIA TESLA K20 GPU. To exclude the effect of other
d
techniques on the results, we used optimization settings similar to those of the baseline
models for training. The training was executed by truncated backpropagation through
te
time with backpropagating for 35 time steps. We updated the weights after minibatches
of 20 sequences. Unless otherwise stated, the initial learning rate was set to 1.0, and
p
reduced after the decrease in validation perplexity was less than 1.0. All the parameters
of the model were randomly initialized over a uniform distribution in [-0.05,0.05]. We
ce
did not employ any pre-training step for all the parameters; For regularization, 50% and
65% dropout [41] were applied for the small and large networks, respectively. Because
the gradient clipping works well for the gradient exploding and is very crucial to train
Ac
the models [17], we clipped the L2 norm of the gradients at 5. All the networks were
trained until convergence, and the parameters with the best development set performance
was selected as the final parameters to be evaluated.
4.4. Results for PTB

Behavior of FMDRNN with spatial depth increasing: The first set of experiments in-
vestigate the behaviors of FMDRNN as the depth in space gradually increases. We trained
23
Page 24 of 38
FMDRNN-Word-Small, FMDRNN-Word-Large, FMDRNN-Char-Small, and FMDRNN-
Char-Large with different numbers of LSTM layers and tested them on the PTB test data.
The test perplexities of the four networks are displayed in Figure 7. These results show
two major results.
First, the test perplexities of the four networks tend to decrease as the number of
t
stacked LSTM layers increases. Although sometimes the decrease is small or a little in-
ip
crease can even be observed when the number of LSTM layers is further increased, adding
LSTM layers does not lead to serious degradation (much higher perplexity) as it does in
cr
the baseline models (Figure 6). This observation manifests that FMDRNN can be effec-
tively trained even with a large number of stacked layers and thereby avoids performance
degradation.
us
Second, the performance of these networks tends to be relatively stable after the spatial
depth growing to a certain value. For example, the test perplexity of FMDRNN-Word-
Small continues to decrease until 12 LSTM layers are used, and then there is only a
an
slight fluctuation when the number of LSTM layers is larger than 12. We believe this
phenomenon occurs because: (1) The performance of FMDRNN reaches the maximum
when the number of LSTM layers runs up to a certain value; (2) The further increase of
LSTM layers results in the redundant layers emerging, but fortunately the direct connec-
M
tions from these redundant layers are weakened or cut off by giving small attention values.
In contrast, the connections from the effective layers still connect to subsequent layers
through FM (see section 5.2 for detailed analysis). Therefore, FMDRNN still maintains
the relatively stable results after the network reaches its maximum performance.
d
te
Perplexity comparison with optimal number of LSTM layers: In this section, the
four networks mentioned above were compared with the corresponding baseline networks
when all of them were trained with their optimal number of LSTM layers. Their test
p
perplexities are listed in Table 4, which shows that (1) FMDRNN yields a clear perfor-
mance boost over the corresponding baselines. These improvements owe to that the new
ce
stack pattern enables deeper FMDRNN to be effectively trained, so that FMDRNN can
benefit from deeper models and achieve better results than baselines; (2) FMDRNN-Word-
Small and FMDRNN-Char-Small obtain lower test perplexities than DRNN-Word-Large
Ac
and DRNN-Char-Large, respectively. These results indicate that, whether it is fed with
words or characters, FMDRNN can outperform the larger baseline models using thinner
and deeper architectures.
24
Page 25 of 38
100 82
95
80
Test Perplexity
Test Perplexity
90
78
t
85
ip
76
80
75 74
1 2 4 6 8 10 12 14 16 18 1 2 4 6 8 10 12 14 16 18
cr
Stacked LSTM Layers Stacked LSTM Layers
(a) FMDRNN-Word-Small (b) FMDRNN-Word-Large
us
100 88
95
84
Test Perplexity
Test Perplexity
90
80
an
85
76
80
75 72
1 2 4 6 8 10 12 14 16 18 1 2 4 6 8 10 12 14 16 18
M
(c) FMDRNN-Char-Small (d) FMDRNN-Char-Large

d
Figure 7: Test perplexities of FMDRNN for PTB with increasing number of LSTM layers. The results of
shallow networks with only one LSTM layer (green) are also shown for comparison.
p te
Table 4: Comparison of FMDRNN and baseline models trained with their optimal number of LSTM layers
for PTB dataset.
ce
Model LSTM layers LSTM Size Test Perplexity

DRNN-Word-Small [15] 2 200 97.6
FMDRNN-Word-Small 12 200 78.4
Ac
DRNN-Word-Large [15] 2 650 82.7

FMDRNN-Word-Large 14 650 74.8
DRNN-Char-Small [17] 2 300 92.3
FMDRNN-Char-Small 18 300 77.7
DRNN-Char-Large [17] 2 650 78.9
FMDRNN-Char-Large 12 650 73.9
25
Page 26 of 38
4.5. Results for Non-English Corpora
To verify the effectiveness of FMDRNN for more difficult tasks, our next series of
experiments were focused on five non-English language modeling tasks: French, Spanish,
German, Czech, and Russian. Compared with PTB, the five non-English corpora have
much larger vocabularies but the same number of training tokens. This makes it more
t
difficult to build Deep RNNs on these corpora.
ip
Behavior of FMDRNN with spatial depth increasing: All the networks were retrained
cr
for the five non-English corpora. We observed the behaviors similar to those observed for
the PTB dataset when the depth in space increases. In Figure 8, we show the results for
the French dataset as an example. We can clearly see that as the number of LSTM layers
us
increases, the validation perplexities of FMDRNN drop, whereas those of all the baseline
networks rise. These results are alike to those for PTB, but differ in that fewer LSTM
layers for French are tested (two to five LSTM layers are tested here). This is because
an
the vocabulary size |V | of French is more than two times of that of PTB (see Table 1),
resulting that FMDRNN has more parameters for French corpus when the same number
of LSTM layers is used. Hence, it is difficult to improve the results further with more than
four LSTM layers because of serious overfitting. Fortunately, we found that FMDRNN
M
still can be trained effectively and obtain good results even though adding more layers, as
shown by FMDRNN with five LSTM layers. In contrast, some baseline networks cannot
converge when more than three layers are used.
d
Perplexity comparison with the optimal number of LSTM layers: For the five non-
te
English datasets, FMDRNN also obtain lower perplexity than the corresponding baselines
when all of them use their optimal number of LSTM layers. The results are compared in
Table 5. Moreover, most of the small networks of FMDRNN obtain better results than the
p
large networks of DRNN, no matter what type of input is used.

The above experimental results show that the novel stacking mode of FMDRNN makes
ce
the network can be effectively trained even if many layers are stacked, so that FMDRNN
can benefit from deeper models and obtain better results through deeper but thinner net-
works. These encouraging results on these data strongly evidence that FMDRNN is a
Ac
feasible strategy for constructing very deep RNNLMs.
5. Discussion
5.1. The number of parameters and the complexity of computation
Because of the added FM module, the proposed model may increase the parameters
and computation complexity when it is compared with the conventional DRNNs. In order
26
Page 27 of 38
2 LSTM Layers 3 LSTM Layers 2 LSTM Layers 3 LSTM Layers 2 LSTM Layers 3 LSTM Layers 2 LSTM Layers 3 LSTM Layers
360 260 300 300
920 930 920
910 250 920 280 910
330 910
900 270 900
240 900 260
890 890
890
300 230 240
880 880 880
t
2 6 10 14 18 22 26 30 240 2 6 10 14 18 22 26 30 2 6 10 14 18 22 26 30
270 220 220
ip
210 200
210
240
200 180
210 190 180 160

2 6 10 14 18 22 26 30 2 6 10 14 18 22 26 30 2 6 10 14 18 22 26 30 2 6 10 14 18 22 26 30
Epochs Epochs Epochs Epochs
cr
(a) DRNN-Word-Small (b) DRNN-Word-Large (c) DRNN-Char-Small (d) DRNN-Char-Large
250 210 240 210
us
240 205
200
220
230 200
190
220 195 200
180
210 190
180
170
200 185
an
190 180 160 160
2 6 10 14 18 22 26 30 2 6 10 14 18 22 26 30 2 6 10 14 18 22 26 30 2 6 10 14 18 22 26 30
Epochs Epochs Epochs Epochs
(e) FMDRNN-Word-Small (f) FMDRNN-Word-Large (g) FMDRNN-Char-Small (h) FMDRNN-Char-Large

M
Figure 8: Behavior of FMDRNN versus those of the corresponding baselines.
d
te
Table 5: Test perplexities of FMDRNN versus those of the baseline models with the optimal number of
LSTM layers for the five non-English corpora. The number of stacked LSTM layers is indicated by the
number in brackets. The perplexities of baselines DRNN-Word-Small, DRNN-Word-Large, DRNN-Char-
p
Small, and DRNN-Char-Large are from [17].

Model French Spanish German Czech Russian
ce
DRNN-Word-Small 229 (2) 212 (2) 305 (2) 503 (2) 352 (2)
FMDRNN-Word-Small 210 (5) 192 (5) 289 (3) 481 (5) 349 (3)
DRNN-Word-Large 222 (2) 200 (2) 286 (2) 493 (2) 357 (2)
Ac
FMDRNN-Word-Large 196 (5) 181 (4) 269 (4) 450 (3) 331 (2)
DRNN-Char-Small 189 (2) 182 (2) 260 (2) 401 (2) 278 (2)
FMDRNN-Char-Small 177 (4) 159 (4) 239 (4) 366 (3) 254 (3)
DRNN-Char-Large 184 (2) 165 (2) 239 (2) 371 (2) 261 (2)
FMDRNN-Char-Large 172 (4) 155 (5) 230 (4) 350 (3) 240 (4)
27
Page 28 of 38
Table 6: Parameters and perplexities comparison of FMDRNN versus DRNNs when they are trained with
the same setup for Russian.
Model Parameters PPL
DRNN-Word-Small 26M 352
t
FMDRNN-Word-Small 26M 350
ip
DRNN-Word-Large 89M 357
FMDRNN-Word-Large 92M 330
cr
DRNN-Char-Small 21M 278
FMDRNN-Char-Small 22M 260
DRNN-Char-Large 54M 261
us
FMDRNN-Char-Large 57M 240
an
to assess the merits of FMDRNN in this case, a detail analysis of this issue is given here.
The number of parameters. Compared with DRNNs, FMDRNN increase the parame-
ters of FM module if the same setting is used. However, the parameters of the attention
M
network WFl M and blF M are the only group of parameters in FM module, and they are
shared to compute the attention values for all the features in a memory update process.
Therefore, although the number of parameters of FMDRNN raise, the increase is not sig-
d
nificant when they use the same setting. We take the Russian, the dataset that has the
largest vocabulary (corresponding to the most parameters), as an example to show the re-
te
sults in Table 6. Moreover, Table 6 also shows that the performance of the network has
been greatly improved in despite of using more parameters.
p
Furthermore, as FMDRNN can benefit from the spatial depth of the network, it can use
ce
a deeper and thinner network to decrease the parameter but still obtain similar result. Table
7 shows a result tested for PTB data. It shows that: the best result (78.4) of DRNN-Word
was obtained when the LSTM size was set to 1,500 with approximately 66M parame-
ters [15]; DRNN-Char obtained the similar results with fewer parameters (approximately
Ac
19M) than DRNN-Word by using characters as input; comparatively, FMDRNN yielded

that perplexities used much fewer parameters, especially FMDRNN-Word only uses 10M
parameters to achieve the result through the deeper and thinner networks.
More importantly, FMDRNN can achieve better results than DRNNs even with much
fewer parameters. Table 8 demonstrates the results tested for PTB dataset. It is also con-
sistent when FMDRNN is used for other five corpora as shown in Table 9.
28
Page 29 of 38
Table 7: Parameters comparison of FMDRNN versus DRNNs when they obtain the same results for the PTB
dataset. DRNN-Word and DRNN-Char correspond to the regularized LSTM and character-aware neural
language models, respectively. Their results are from [15] and [17], respectively.
Model LSTM layers LSTM size Test Perplexity Parameters
t
DRNN-Word 2 1500 78.4 66M
ip
DRNN-Char 2 650 78.9 19M
FMDRNN-Char 8 300 78.2 14M
cr
FMDRNN-Word 12 200 78.4 10M
us
Table 8: Comparison of perplexities and the numbers of parameters on PTB dataset.
Model LSTM layers LSTM size Test Perplexity Parameters
an
DRNN-Word 2 1500 78.4 66M
FMDRNN-Word 10 400 75.5 28M
DRNN-Char 2 650 78.9 19M
FMDRNN-Char 10 300 77.8 16M
M
d
Computation time and memory storage For the LMs based on neural network, the es-
te
timation of the probability of a word given its context requires to explicitly compute the
probability distribution over all the words in the vocabulary (see the calculation at the out-
put layer, Equation (5) for conventional RNNLMs, Equation (15) for FMDRNN). Typical
p
natural language datasets have vocabularies containing tens of thousands of words, result-
ing that the size of the output layer (equals to the size of vocabulary, |V |) is very large (see
ce
Table 3). In contrast, the largest size of other layers is set to 650 in our work, which is
far less than the size of the output layer. Therefore, the large-scale matrix multiplication
at the output layer consumes most of the calculation time whether DRNNs or FMDRNNs,
Ac
while other layers take less time. Therefore, as shown in Table 10 the increased computa-
tional cost of FM module is not significant, and it is not very expensive compared to the
improved performance. This table displays the runtime of each epoch used for FMDRNN-
Word/DRNN-Word on PTB dataset, and FMDRNN-Char/DRNN-Char on Russian corpus.
They were chosen to be shown because the networks fed with word and tested on PTB
dataset are the simplest ones, while the networks taking character as input and tested on
Russian corpus is the most complex networks.
29
Page 30 of 38
Table 9: Comparison of perplexities and the numbers of parameters on five non-English corpora. The number
in bracket indicates the size of each LSTM layer.
DRNN-Word FMDRNN-Word DRNN-Char FMDRNN-Char
t
LSTM Layers 2 (650) 4 (200) 2 (650) 4 (300)
ip
French Parameters 39M 12M 29M 13M
Test Perplexity 222 210 184 177
LSTM Layers 2 (650) 5 (200) 2 (650) 4 (300)
cr
Spanish Parameters 42M 13M 30M 14M
us
LSTM Layers 2 (650) 4 (300) 2 (650) 4 (300)
German Parameters 54M 26M 37M 17M
an
LSTM Layers 2 (650) 5 (200) 2(650) 3 (300)
Czech Parameters 67M 21M 43M 19M
LSTM Layers 2 (650) 3 (200) 2 (650) 3 (300)
M
Russian Parameters 89M 27M 54M 24M
d
Besides, Table 10 also shows that the increased memory storage of FMDRNN is not
te
very large. This is owing to that the FM space only preserves the different-level features
during the same time step. In each time step, FM space starts from one element h0t and
then L different-level features are successively written in. When the computation in a time
p
step is completed, FM space is reset. That is to say, there are at most L + 1 elements stored
ce
in this space. Thus, it is not very expensive to store these features.

In a word, FMDRNN can achieves better performance with limited increase of the
computational time and memory storage. In our work, compared to the calculation time
and memory storage, we are more concerned with the better results.
Ac
5.2. The attention network and attention values

The traditional DRNNs use a linear stacking method for increasing the spatial depth,
where only the adjacent two layers have direct connection. Suppose the number of stacked
layers is L, then the length of the path from the output to the input is L + 1. In the process
of back propagation, the gradient may vanish after passing through L + 1 layers, making
DRNNs difficult to train. In contrast, FMDRNN provides a direct connection between
30
Page 31 of 38
Table 10: Comparison of FMDRNN and baseline models on the computation time and memory storage.
Dataset Model Times/Epoch Memory Storage Test Perplexity

DRNN-Word-Small 120s 435MB 98
t
FMDRNN-Word-Small 169s 473MB 89
ip
PTB
DRNN-Word-Large 159s 759MB 83
FMDRNN-Word-Large 252s 897MB 77
cr
DRNN-Char-Small 442s 1581MB 278
FMDRNN-Char-Small 523s 1642MB 260
Russian
DRNN-Char-Large 686s 2579MB 261
us
FMDRNN-Char-Large 818s 2724MB 240
an
every two non-adjacent layers by adding FM, so that the possible shortest path between
any two layers is one. What’s more, the attention network and attention values are used
to control the switch of these direct connections. For understanding, Figure 9 displays a
simplified schematic of the direct connections between the output layer and other layers. If
M
necessary, the attention network will switch on the direct connection, then the gradient can
be back propagated through the direct connection, which makes up the vanished gradient
from the traditional path.
d
In order to observe when these direct connections are learned to switch on, in this sec-
tion we show what happens to the attention values as LSTM layers are gradually added.
te
Note that the numbers of stacked LSTM layers tested for PTB were much higher than
those for the non-English corpora in our experiments, so it is more convenient to con-
p
duct an analysis using the results of PTB. Therefore, we took FMDRNN-Word-Small and
FMDRNN-Word-Large trained for PTB as examples of small and large models, respec-
ce
tively, to investigate the attention values of different-level features in the L-th memory
update process.
The results are displayed in Figure 10. Each multicolor bar displays the accumulated
attention value distribution of all the features learned by the corresponding network. Each
Ac
block with a different color in the bar represents the attention value for the different-level
feature. A larger block indicates a larger value for the corresponding feature. As Figure 10
shows, some of the features have larger attention values, and these should be more useful
features for this layer. In contrast, some attention values are close to zero, corresponding
to the features that are redundant. Because the attention values are the gates that switch the
added direct connections, these results indicate that FMDRNN learns to retain the direct
connections from useful layers, meanwhile it tries to suppress the direct connections com-
31
Page 32 of 38
t
ip
cr
p( xt | x0 , xt 2 , xt 11 )
us
Ċ Output Layer
mtL
FM
L htL1
h
an
t Ċ
⁞
⁞
ht2 ht21
M
ht2
Ċ
ht1 ht11
1
h
d
t Ċ
ht0
te
ht0 Input Module
‫ݔ‬଴ ǡ ‫ݔ ڮ‬௧ିଶ ǡ ‫ݔ‬௧ିଵ

p
ce
Figure 9: The direct connections from other layers to output layer. The red path is the shortest one from
input to output when the attention network switches it on.
Ac
32
Page 33 of 38
ing from the redundant ones. Similar results are observed for FMDRNN-Word-Small and
FMDRNN-Word-Large, but with the difference that more features of FMDRNN-Word-
Small have noticeable attention values. This is likely because that the larger LSTM layer
size of FMDRNN-Word-Large makes it need fewer LSTM layers than FMDRNN-Word-
Small to solve the same task. If redundant LSTM layers emerge in a network, the attention
t
network suppresses them to avoid degrading performance.
ip
In summary, the direct connections combined with the attention values greatly reduce
the length of the path that the gradient passes through, providing a valid architecture for
cr
alleviating the gradient vanishing as well as a feasible strategy for constructing very deep
RNNLMs.
us
1.0 1.0
H12
Accumulated Attention Weights
Accumulated Attention Weights

H12
H11 H11
0.8 0.8
H10 H10
H9 H9
an
0.6 H8 0.6 H8
H7 H7
H6 H6
0.4 0.4
H5 H5
H4 H4
0.2 H3 0.2 H3
H2 H2
M
H1
0.0 0.0 H1
2 3 4 5 6 8 10 12 2 3 4 5 6 8 10 12
(a) FMDRNN-Word-LSTM200 (b) FMDRNN-Word-LSTM650

d
te
Figure 10: Attention weights of features learned by different levels of LSTM layers. The symbols from H1
to H12 represent the features at different levesl obtained by the corresponding network.
p
6. Conclusion
ce
In this paper, we present a new recurrent neural network language model, FMDRNN,
to overcome the challenge of propagating gradient across a large number of recurrent lay-
ers and to avoid the performance degradation it causes in conventional DRNNs. The exper-
Ac
imental results show that the new stacked pattern of FMDRNN enables it to be effectively
trained even using a larger number of recurrent layers, and to benefit from deeper models
instead of the performance degradation. Thus, FMDRNN shows clear performance gains
over baseline DRNNs.
To verify the effect of the proposed models compared with the baselines, in our present
experiments, we did not search for the best experimental settings of FMDRNN, but kept
them consistent with the baseline model to ensure a fair comparison. In future work, we
33
Page 34 of 38
plan to search for better architecture settings of FMDRNN and use better regularization
techniques such as those proposed by [42] and [43] to seek better results.
7. Acknowledge
t
This work was supported by National Key R&D Program of China (grant 2016YFC0801800);
ip
National Natural Science Foundation of China (grants 61332002,61772353); Fok Ying
Tung Education Foundation (grant 151068).
cr
Conflicts of interest The authors declared that they have no conflicts of interest to this
work.
us
[1] P. F. Brown, P. V. Desouza, R. L. Mercer, V. J. D. Pietra, J. C. Lai, Class-based n -
gram models of natural language, Computational Linguistics 18 (4) (1997) 467–479.
an
[2] R. Kneser, H. Ney, Improved backing-off for n-gram language modeling, 1995.
[3] T. R. Niesler, P. C. Woodland, A variable-length category-based n-gram language

model, in: icassp, 1996, pp. 164–167.
M
[4] M. Federico, Bayesian estimation methods for n-gram language model adaptation, in:
International Conference on Spoken Language, 1996. Icslp 96. Proceedings, 1996,
pp. 240–243 vol.1.
d
[5] H. Yamamoto, S. Isogai, Y. Sagisaka, Multi-class composite n-gram language model,

te
Systems and Computers in Japan 34 (7) (2003) 108–114.
[6] Y. Bengio, R. Ducharme, P. Vincent, C. Jauvin, A neural probabilistic language

p
model, Journal of machine learning research 3 (Feb) (2003) 1137–1155.

ce
[7] K. J. Lang, A. H. Waibel, G. E. Hinton, A time-delay neural network architecture for

isolated word recognition, Neural networks 3 (1) (1990) 23–43.
Ac
[8] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (8)

(1997) 1735–1780.
[9] H. Jaeger, The "echo state" approach to analysing and training recurrent neural
networks-with an erratum note, Bonn, Germany: German National Research Cen-
ter for Information Technology GMD Technical Report 148 (34) (2001) 13.
34
Page 35 of 38
[10] L. Zhang, Z. Yi, J. Yu, Multiperiodicity and attractivity of delayed recurrent neural
networks with unsaturating piecewise linear transfer functions, IEEE Transactions
on Neural Networks 19 (1) (2008) 158–167.
[11] L. Zhang, Z. Yi, S. L. Zhang, P. A. Heng, Activity invariant sets and exponentially
t
stable attractors of linear threshold discrete-time recurrent neural networks, IEEE
ip
Transactions on Automatic Control 54 (6) (2009) 1341–1347.
[12] L. Zhang, Z. Yi, Selectable and unselectable sets of neurons in recurrent neural net-
cr
works with saturated piecewise linear transfer function, IEEE transactions on neural
networks 22 (7) (2011) 1021–1031.
us
[13] T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, S. Khudanpur, Recurrent neural
network based language model., in: INTERSPEECH, Vol. 2, 2010, p. 3.
[14] M. Sundermeyer, R. Schlüter, H. Ney, Lstm neural networks for language modeling.,
an
in: INTERSPEECH, 2012, pp. 194–197.
[15] W. Zaremba, I. Sutskever, O. Vinyals, Recurrent neural network regularization, arXiv

preprint arXiv:1409.2329.
M
[16] X. Liu, X. Chen, M. Gales, P. Woodland, Paraphrastic recurrent neural network lan-
guage models, in: Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE
d
International Conference on, IEEE, 2015, pp. 5406–5410.

te
[17] Y. Kim, Y. Jernite, D. Sontag, A. M. Rush, Character-aware neural language models,

arXiv preprint arXiv:1508.06615.
p
[18] H. Deng, L. Zhang, L. Wang, Global context-dependent recurrent neural network

language model with sparse feature learning, Neural Computing and Applications
ce
(2017) 1–13.
[19] O. Irsoy, C. Cardie, Opinion mining with deep recurrent neural networks, in: Confer-
ence on Empirical Methods in Natural Language Processing, 2014.
Ac
[20] M. Hermans, B. Schrauwen, Training and analysing deep recurrent neural networks,
Advances in Neural Information Processing Systems (2013) 190–198.
[21] A. Graves, A.-R. Mohamed, G. Hinton, Speech recognition with deep recurrent neu-
ral networks, in: IEEE International Conference on Acoustics, Speech and Signal
Processing, 2013, pp. 6645–6649.
35
Page 36 of 38
[22] R. Pascanu, C. Gulcehre, K. Cho, Y. Bengio, How to construct deep recurrent neural
networks, Computer Science.
[23] D. E. Rumelhart, G. E. Hinton, R. J. Williams, Learning representations by back-

propagating errors, nature 323 (6088) (1986) 533.
t
[24] M. Jordan, Serial order: a parallel distributed processing approach. technical report,
ip
june 1985-march 1986, Tech. rep., California Univ., San Diego, La Jolla (USA). Inst.
for Cognitive Science (1986).
cr
[25] J. L. Elman, Finding structure in time, Cognitive science 14 (2) (1990) 179–211.
us
[26] F. J. Pineda, Generalization of back-propagation to recurrent neural networks, Physi-
cal review letters 59 (19) (1987) 2229.
[27] T. Mikolov, G. Zweig, Context dependent recurrent neural network language model.,
an
in: SLT, 2012, pp. 234–239.
[28] Y. Bengio, P. Simard, P. Frasconi, Learning long-term dependencies with gradient

descent is difficult, Neural Networks, IEEE Transactions on 5 (2) (1994) 157–166.
M
[29] F. A. Gers, J. Schmidhuber, F. Cummins, Learning to forget: Continual prediction
with lstm, Neural computation 12 (10) (2000) 2451–2471.
d
[30] F. A. Gers, N. N. Schraudolph, J. Schmidhuber, Learning precise timing with lstm

recurrent networks, The Journal of Machine Learning Research 3 (2003) 115–143.
te
[31] A. Graves, J. Schmidhuber, Framewise phoneme classification with bidirectional

lstm and other neural network architectures, Neural Networks 18 (5) (2005) 602–
p
610.
ce
[32] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, J. Schmidhuber, A

novel connectionist system for unconstrained handwriting recognition, IEEE transac-
tions on pattern analysis and machine intelligence 31 (5) (2009) 855–868.
Ac
[33] P. Doetsch, M. Kozielski, H. Ney, Fast and robust training of recurrent neural net-
works for offline handwriting recognition, in: Frontiers in Handwriting Recognition
(ICFHR), 2014 14th International Conference on, IEEE, 2014, pp. 279–284.
[34] H. Sak, A. Senior, F. Beaufays, Long short-term memory recurrent neural network
architectures for large scale acoustic modeling, in: Fifteenth Annual Conference of
the International Speech Communication Association, 2014.
36
Page 37 of 38
[35] H. Fan, H. Ling, Sanet: Structure-aware network for visual tracking, arXiv preprint
arXiv:1611.06878.
[36] L. Wang, L. Zhang, Z. Yi, Trajectory predictor by using recurrent neural networks in
visual tracking, IEEE Transactions on Cybernetics 47 (10) (2017) 3172–3183.
t
[37] K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, J. Schmidhuber, Lstm: A
ip
search space odyssey, IEEE transactions on neural networks and learning systems.
cr
[38] R. J. Williams, D. Zipser, Gradient-based learning algorithms for recurrent networks
and their computational complexity, Backpropagation: Theory, architectures, and
applications 1 (1995) 433–486.
us
[39] M. P. Marcus, M. A. Marcinkiewicz, B. Santorini, Building a large annotated corpus
of english: The penn treebank, Computational linguistics 19 (2) (1993) 313–330.
an
[40] Torch frameworks, http://torch.ch/.
[41] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout:

a simple way to prevent neural networks from overfitting., Journal of Machine Learn-
M
ing Research 15 (1) (2014) 1929–1958.
[42] Y. Gal, Z. Ghahramani, A theoretically grounded application of dropout in recurrent

neural networks, in: Advances in Neural Information Processing Systems, 2016, pp.
d
1019–1027.
te
[43] O. Press, L. Wolf, Using the output embedding to improve language models, arXiv
preprint arXiv:1608.05859.
p
ce
Ac
37
Page 38 of 38

Accepted Manuscript: Applied Soft Computing

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Accepted Manuscript: Applied Soft Computing

Загружено:

Авторское право:

Доступные форматы

Accepted Manuscript

Title: Feature Memory-Based Deep Recurrent Neural

Author: Hongli Deng Lei Zhang Xin Shu

To appear in: Applied Soft Computing

Received date: 21-5-2017

1. We design a feature memory module to provide direct connections between each

2. We propose a new stacking pattern to construct deep recurrent neural

Preprint submitted to Applied Soft Computing April 4, 2018

the end of the network.

x̂t−1 = Wemb x∗t−1 . (3)

ht = f (Whx x̂t−1 + Whh ht−1 + bh ) (1 ≤ t ≤ T ), (4)

vector of this layer, respectively.

yt = sof tmax (Wyh ht + by ). (5)

sum is 1. Each element of yt is the estimated conditional probability of a word in the

p (xt |x0 , x1 , · · · , xt−1 ) = ytk .

exponentially with increasing context length. To overcome this difficulty, an improved

represent the multiplicative units.

h11 hT1 1 hT1

x̂0 xˆT  2 xˆT 1

3.1. Input Module

h0t = Wemb x∗t−1 (1 ≤ t ≤ T ), (8)

of a sequence of characters [c1 , c2 , · · · , cg ], where g is the length of the word. These

Stacked LSTM Layers Module

‫ݔ‬଴ ǡ ‫ݔ ڮ‬௧ିଶ ǡ ‫ݔ‬௧ିଵ

representation. Then CharCNN uses K filters to encode C as follows:

Memory State mt0 Memory State mt1 Memory State mtL 1

(a) Architecture and memory update process of the FM module.

... ... ... ...

(b) Architecture of the attention network

alj = Gl (hjt ) (0 ≤ j ≤ l). (10)

Gl (hjt ) = σ(WFl M hjt + blF M ), (11)

them the minimal attention values.

3.3. Stacked LSTM Layers Module

express the computation of this module uniformly as:

hlt = F hl−1 l l−1

ct = it f Wcx ht + Wcm mt + Wch ht−1 + bc + φlt clt−1 ,

yt = sof tmax Wym mLt + by ,

3.5. Model Training

where Nm is the number of samples in the min-batch Xm .

P erplexity = exp( NTLL ),

10: b. Compute the cost of x(n) by Equation (16).

12: Accumulate all the cost of x(n) ∈ Xm , according to Equation (17).

Compute the gradient of J(Θ; Xm ) with respect to each parameter θ ∈ Θ,

(n) (n) (n) (n)

n (n) (n) (n)

4.2. Baseline models

4.3. Experimental Setup

(a) DRNN-Word-Small (b) DRNN-Word-Large

680 120 680

(c) DRNN-Char-Small (d) DRNN-Char-Large

4.4. Results for PTB

(a) FMDRNN-Word-Small (b) FMDRNN-Word-Large

(c) FMDRNN-Char-Small (d) FMDRNN-Char-Large

Model LSTM layers LSTM Size Test Perplexity

DRNN-Word-Large [15] 2 650 82.7

large networks of DRNN, no matter what type of input is used.

feasible strategy for constructing very deep RNNLMs.

270 220 220

210 190 180 160

(e) FMDRNN-Word-Small (f) FMDRNN-Word-Large (g) FMDRNN-Char-Small (h) FMDRNN-Char-Large

Small, and DRNN-Char-Large are from [17].

19M) than DRNN-Word by using characters as input; comparatively, FMDRNN yielded

h11 hT1 1 hT1

x̂0 xˆT 2 xˆT 1

Memory State mt0 Memory State mt1 Memory State mtL 1