Arboles Gelbuck

Recursive Deep Models for Semantic Compositionality
Over a Sentiment Treebank

Richard Socher, Alex Perelygin, Jean Y. Wu, Jason Chuang,
Christopher D. Manning, Andrew Y. Ng and Christopher Potts
Stanford University, Stanford, CA 94305, USA
richard@socher.org,{aperelyg,jcchuang,ang}@cs.stanford.edu
{jeaneis,manning,cgpotts}@stanford.edu
Abstract –
0 –
Semantic word spaces have been very use- 0 0 – 0
.
This film
ful but cannot express the meaning of longer – 0
phrases in a principled way. Further progress 0 0 + +
does n’t care

towards understanding compositionality in 0 +
about
tasks such as sentiment detection requires + +
richer supervised training and evaluation re- + 0 0 +
or
sources and more powerful models of com- + 0 0 0 0 +
wit any of
position. To remedy this, we introduce a + 0 0 + + ++
cleverness , other kind intelligent humor

Sentiment Treebank. It includes fine grained
sentiment labels for 215,154 phrases in the
parse trees of 11,855 sentences and presents Figure 1: Example of the Recursive Neural Tensor Net-
new challenges for sentiment composition- work accurately predicting 5 sentiment classes, very neg-
ality. To address them, we introduce the ative to very positive (– –, –, 0, +, + +), at every node of a
Recursive Neural Tensor Network. When parse tree and capturing the negation and its scope in this
trained on the new treebank, this model out- sentence.
performs all previous methods on several met-
rics. It pushes the state of the art in single
sentence positive/negative classification from models to accurately capture the underlying phe-
80% up to 85.4%. The accuracy of predicting
nomena presented in such data. To address this need,
fine-grained sentiment labels for all phrases
reaches 80.7%, an improvement of 9.7% over
we introduce the Stanford Sentiment Treebank and
bag of features baselines. Lastly, it is the only a powerful Recursive Neural Tensor Network that
model that can accurately capture the effects can accurately predict the compositional semantic
of negation and its scope at various tree levels effects present in this new corpus.
for both positive and negative phrases.
The Stanford Sentiment Treebank is the first cor-
pus with fully labeled parse trees that allows for a
1 Introduction complete analysis of the compositional effects of
sentiment in language. The corpus is based on
Semantic vector spaces for single words have been the dataset introduced by Pang and Lee (2005) and
widely used as features (Turney and Pantel, 2010). consists of 11,855 single sentences extracted from
Because they cannot capture the meaning of longer movie reviews. It was parsed with the Stanford
phrases properly, compositionality in semantic vec- parser (Klein and Manning, 2003) and includes a
tor spaces has recently received a lot of attention total of 215,154 unique phrases from those parse
(Mitchell and Lapata, 2010; Socher et al., 2010; trees, each annotated by 3 human judges. This new
Zanzotto et al., 2010; Yessenalina and Cardie, 2011; dataset allows us to analyze the intricacies of senti-
Socher et al., 2012; Grefenstette et al., 2013). How- ment and to capture complex linguistic phenomena.
ever, progress is held back by the current lack of Fig. 1 shows one of the many examples with clear
large and labeled compositionality resources and compositional structure. The granularity and size of
1631
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642,
Seattle, Washington, USA, 18-21 October 2013. 2013
c Association for Computational Linguistics
this dataset will enable the community to train com- word appears in a certain syntactic context (Pado
positional models that are based on supervised and and Lapata, 2007; Erk and Padó, 2008). However,
structured machine learning techniques. While there distributional vectors often do not properly capture
are several datasets with document and chunk labels the differences in antonyms since those often have
available, there is a need to better capture sentiment similar contexts. One possibility to remedy this is to
from short comments, such as Twitter data, which use neural word vectors (Bengio et al., 2003). These
provide less overall signal per document. vectors can be trained in an unsupervised fashion
In order to capture the compositional effects with to capture distributional similarities (Collobert and
higher accuracy, we propose a new model called the Weston, 2008; Huang et al., 2012) but then also be
Recursive Neural Tensor Network (RNTN). Recur- fine-tuned and trained to specific tasks such as sen-
sive Neural Tensor Networks take as input phrases timent detection (Socher et al., 2011b). The models
of any length. They represent a phrase through word in this paper can use purely supervised word repre-
vectors and a parse tree and then compute vectors for sentations learned entirely on the new corpus.
higher nodes in the tree using the same tensor-based Compositionality in Vector Spaces. Most of
composition function. We compare to several super- the compositionality algorithms and related datasets
vised, compositional models such as standard recur- capture two word compositions. Mitchell and La-
sive neural networks (RNN) (Socher et al., 2011b), pata (2010) use e.g. two-word phrases and analyze
matrix-vector RNNs (Socher et al., 2012), and base- similarities computed by vector addition, multiplica-
lines such as neural networks that ignore word order, tion and others. Some related models such as holo-
Naive Bayes (NB), bi-gram NB and SVM. All mod- graphic reduced representations (Plate, 1995), quan-
els get a significant boost when trained with the new tum logic (Widdows, 2008), discrete-continuous
dataset but the RNTN obtains the highest perfor- models (Clark and Pulman, 2007) and the recent
mance with 80.7% accuracy when predicting fine- compositional matrix space model (Rudolph and
grained sentiment for all nodes. Lastly, we use a test Giesbrecht, 2010) have not been experimentally val-
set of positive and negative sentences and their re- idated on larger corpora. Yessenalina and Cardie
spective negations to show that, unlike bag of words (2011) compute matrix representations for longer
models, the RNTN accurately captures the sentiment phrases and define composition as matrix multipli-
change and scope of negation. RNTNs also learn cation, and also evaluate on sentiment. Grefen-
that sentiment of phrases following the contrastive stette and Sadrzadeh (2011) analyze subject-verb-
conjunction ‘but’ dominates. object triplets and find a matrix-based categorical
The complete training and testing code, a live model to correlate well with human judgments. We
demo and the Stanford Sentiment Treebank dataset compare to the recent line of work on supervised
are available at http://nlp.stanford.edu/ compositional models. In particular we will de-
sentiment. scribe and experimentally compare our new RNTN
model to recursive neural networks (RNN) (Socher
2 Related Work et al., 2011b) and matrix-vector RNNs (Socher et
This work is connected to five different areas of NLP al., 2012) both of which have been applied to bag of
research, each with their own large amount of related words sentiment corpora.
work to which we cannot do full justice given space Logical Form. A related field that tackles com-
constraints. positionality from a very different angle is that of
Semantic Vector Spaces. The dominant ap- trying to map sentences to logical form (Zettlemoyer
proach in semantic vector spaces uses distributional and Collins, 2005). While these models are highly
similarities of single words. Often, co-occurrence interesting and work well in closed domains and
statistics of a word and its context are used to de- on discrete sets, they could only capture sentiment
scribe each word (Turney and Pantel, 2010; Baroni distributions using separate mechanisms beyond the
and Lenci, 2010), such as tf-idf. Variants of this idea currently used logical forms.
use more complex frequencies such as how often a Deep Learning. Apart from the above mentioned
1632
work on RNNs, several compositionality ideas re- nerdy folks
lated to neural networks have been discussed by Bot-
tou (2011) and Hinton (1990) and first models such | | | | | | |
Very Negative Somewhat Neutral Somewhat Positive Very
as Recursive Auto-associative memories been exper- negative negative positive positive
imented with by Pollack (1990). The idea to relate phenomenal fantasy best sellers
inputs through three way interactions, parameterized

by a tensor have been proposed for relation classifi- | | | | | | |
Very Negative Somewhat Neutral Somewhat Positive Very
cation (Sutskever et al., 2009; Jenatton et al., 2012), negative negative positive positive
extending Restricted Boltzmann machines (Ranzato

Figure 3: The labeling interface. Random phrases were
and Hinton, 2010) and as a special layer for speech shown and annotators had a slider for selecting the senti-
recognition (Yu et al., 2012). ment and its degree.
Sentiment Analysis. Apart from the above-
mentioned work, most approaches in sentiment anal-
tences, half of which were considered positive and
ysis use bag of words representations (Pang and Lee,
the other half negative. Each label is extracted from
2008). Snyder and Barzilay (2007) analyzed larger
a longer movie review and reflects the writer’s over-
reviews in more detail by analyzing the sentiment
all intention for this review. The normalized, lower-
of multiple aspects of restaurants, such as food or
cased text is first used to recover, from the origi-
atmosphere. Several works have explored sentiment
nal website, the text with capitalization. Remaining
compositionality through careful engineering of fea-
HTML tags and sentences that are not in English
tures or polarity shifting rules on syntactic structures
are deleted. The Stanford Parser (Klein and Man-
(Polanyi and Zaenen, 2006; Moilanen and Pulman,
ning, 2003) is used to parses all 10,662 sentences.
2007; Rentoumi et al., 2010; Nakagawa et al., 2010).
In approximately 1,100 cases it splits the snippet
into multiple sentences. We then used Amazon Me-
3 Stanford Sentiment Treebank
chanical Turk to label the resulting 215,154 phrases.
Bag of words classifiers can work well in longer Fig. 3 shows the interface annotators saw. The slider
documents by relying on a few words with strong has 25 different values and is initially set to neutral.
sentiment like ‘awesome’ or ‘exhilarating.’ How- The phrases in each hit are randomly sampled from
ever, sentiment accuracies even for binary posi- the set of all phrases in order to prevent labels being
tive/negative classification for single sentences has influenced by what follows. For more details on the
not exceeded 80% for several years. For the more dataset collection, see supplementary material.
difficult multiclass case including a neutral class, Fig. 2 shows the normalized label distributions at
accuracy is often below 60% for short messages each n-gram length. Starting at length 20, the ma-
on Twitter (Wang et al., 2012). From a linguistic jority are full sentences. One of the findings from
or cognitive standpoint, ignoring word order in the labeling sentences based on reader’s perception is
treatment of a semantic task is not plausible, and, as that many of them could be considered neutral. We
we will show, it cannot accurately classify hard ex- also notice that stronger sentiment often builds up
amples of negation. Correctly predicting these hard in longer phrases and the majority of the shorter
cases is necessary to further improve performance. phrases are neutral. Another observation is that most
In this section we will introduce and provide some annotators moved the slider to one of the five po-
analyses for the new Sentiment Treebank which in- sitions: negative, somewhat negative, neutral, posi-
cludes labels for every syntactically plausible phrase tive or somewhat positive. The extreme values were
in thousands of sentences, allowing us to train and rarely used and the slider was not often left in be-
evaluate compositional models. tween the ticks. Hence, even a 5-class classification
We consider the corpus of movie review excerpts into these categories captures the main variability
from the rottentomatoes.com website orig- of the labels. We will name this fine-grained senti-
inally collected and published by Pang and Lee ment classification and our main experiment will be
(2005). The original dataset includes 10,662 sen- to recover these five labels for phrases of all lengths.
1633
(a) (b) (c) (d)
Distributions of sentiment values for (a) unigrams,
100% Very Positive (b) 10-grams, (c) 20-grams, and (d) full sentences.
Positive (a)
80%
% of Sentiment Values
Somewhat Positive
60%
Neutral (b)
40%
Somewhat Negative (c)

20%
Negative
(d)
0% Very Negative
5 10 15 20 25 30 35 40 45
N-Gram Length
Figure 2: Normalized histogram of sentiment annotations at each n-gram length. Many shorter n-grams are neutral;
longer phrases are well distributed. Few annotators used slider positions between ticks or the extreme values. Hence
the two strongest labels and intermediate tick positions are merged into 5 classes.
4 Recursive Neural Models -
The models in this section compute compositional

p2 = g(a,p1)
+
+
vector representations for phrases of variable length
and syntactic type. These representations will then p1 =g(b,c)
be used as features to classify each phrase. Fig. 4 0 0 +
displays this approach. When an n-gram is given to

the compositional models, it is parsed into a binary
not very good ...
tree and each leaf node, corresponding to a word,
is represented as a vector. Recursive neural mod- a b c
els will then compute parent vectors in a bottom
Figure 4: Approach of Recursive Neural Network mod-
up fashion using different types of compositional-
els for sentiment: Compute parent vectors in a bottom up
ity functions g. The parent vectors are again given fashion using a compositionality function g and use node
as features to a classifier. For ease of exposition, vectors as features for a classifier at that node. This func-
we will use the tri-gram in this figure to explain all tion varies for the different models.
models.
We first describe the operations that the below re- labels given the word vector via:
cursive neural models have in common: word vector
representations and classification. This is followed y a = softmax(Ws a), (1)
by descriptions of two previous RNN models and where Ws ∈ R5×d is the sentiment classification
our RNTN. matrix. For the given tri-gram, this is repeated for
Each word is represented as a d-dimensional vec- vectors b and c. The main task of and difference
tor. We initialize all word vectors by randomly between the models will be to compute the hidden
sampling each value from a uniform distribution: vectors pi ∈ Rd in a bottom up fashion.
U(−r, r), where r = 0.0001. All the word vec-
tors are stacked in the word embedding matrix L ∈ 4.1 RNN: Recursive Neural Network
Rd×|V | , where |V | is the size of the vocabulary. Ini- The simplest member of this family of neural net-
tially the word vectors will be random but the L ma- work models is the standard recursive neural net-
trix is seen as a parameter that is trained jointly with work (Goller and Küchler, 1996; Socher et al.,
the compositionality models. 2011a). First, it is determined which parent already
We can use the word vectors immediately as has all its children computed. In the above tree ex-
parameters to optimize and as feature inputs to ample, p1 has its two children’s vectors since both
a softmax classifier. For classification into five are words. RNNs use the following equations to
classes, we compute the posterior probability over compute the parent vectors:
1634
(p2 ,P2 )
(a,A) (p1 ,P1 )

b a
p1 = f W , p2 = f W ,
c p1
(b,B) (c,C)
the MV-RNN computes the first parent vector and its
where f = tanh is a standard element-wise nonlin- matrix via two equations:
earity, W ∈ Rd×2d is the main parameter to learn

and we omit the bias for simplicity. The bias can be Cb B
p1 = f W , P1 = f WM ,
added as an extra column to W if an additional 1 is Bc C
added to the concatenation of the input vectors. The
parent vectors must be of the same dimensionality to where WM ∈ Rd×2d and the result is again a d × d
be recursively compatible and be used as input to the matrix. Similarly, the second parent node is com-
next composition. Each parent vector pi , is given to puted using the previously computed (vector,matrix)
the same softmax classifier of Eq. 1 to compute its pair (p1 , P1 ) as well as (a, A). The vectors are used
label probabilities. for classifying each phrase using the same softmax
This model uses the same compositionality func- classifier as in Eq. 1.
tion as the recursive autoencoder (Socher et al., 4.3 RNTN:Recursive Neural Tensor Network
2011b) and recursive auto-associate memories (Pol-
lack, 1990). The only difference to the former model One problem with the MV-RNN is that the number
is that we fix the tree structures and ignore the re- of parameters becomes very large and depends on
construction loss. In initial experiments, we found the size of the vocabulary. It would be cognitively
that with the additional amount of training data, the more plausible if there was a single powerful com-
reconstruction loss at each node is not necessary to position function with a fixed number of parameters.
obtain high performance. The standard RNN is a good candidate for such a
function. However, in the standard RNN, the input
vectors only implicitly interact through the nonlin-
4.2 MV-RNN: Matrix-Vector RNN earity (squashing) function. A more direct, possibly
The MV-RNN is linguistically motivated in that multiplicative, interaction would allow the model to
most of the parameters are associated with words have greater interactions between the input vectors.
and each composition function that computes vec- Motivated by these ideas we ask the question: Can
tors for longer phrases depends on the actual words a single, more powerful composition function per-
being combined. The main idea of the MV-RNN form better and compose aggregate meaning from
(Socher et al., 2012) is to represent every word and smaller constituents more accurately than many in-
longer phrase in a parse tree as both a vector and put specific ones? In order to answer this question,
a matrix. When two constituents are combined the we propose a new model called the Recursive Neu-
matrix of one is multiplied with the vector of the ral Tensor Network (RNTN). The main idea is to use
other and vice versa. Hence, the compositional func- the same, tensor-based composition function for all
tion is parameterized by the words that participate in nodes.
it. Fig. 5 shows a single tensor layer. We define the
output of a tensor product h ∈ Rd via the follow-
Each word’s matrix is initialized as a d×d identity ing vectorized notation and the equivalent but more
matrix, plus a small amount of Gaussian noise. Sim- detailed notation for each slice V [i] ∈ Rd×d :
ilar to the random word vectors, the parameters of
T T
these matrices will be trained to minimize the clas-

b [1:d] b b [i] b
sification error at each node. For this model, each n- h= V ; hi = V .
c c c c
gram is represented as a list of (vector,matrix) pairs,
together with the parse tree. For the tree with (vec- where V [1:d] ∈ R2d×2d×d is the tensor that defines
tor,matrix) nodes: multiple bilinear forms.
1635
softmax classifier trained on its vector representa-
Neural Tensor Layer tion to predict a given ground truth or target vector
Slices of Standard t. We assume the target distribution vector at each
Tensor Layer Layer node has a 0-1 encoding. If there are C classes, then
it has length C and a 1 at the correct label. All other
entries are 0.
p=f + We want to maximize the probability of the cor-
rect prediction, or minimize the cross-entropy error
between the predicted distribution y i ∈ RC×1 at
node i and the target distribution ti ∈ RC×1 at that
T node. This is equivalent (up to a constant) to mini-
b [1:2] b b mizing the KL-divergence between the two distribu-
p=f V + W
c c c tions. The error as a function of the RNTN parame-
ters θ = (V, W, Ws , L) for a sentence is:
XX
Figure 5: A single layer of the Recursive Neural Ten- E(θ) = tij log yji + λkθk2 (2)
sor Network. Each dashed box represents one of d-many i j
slices and can capture a type of influence a child can have
on its parent. The derivative for the weights of the softmax clas-
sifier are standard and simply sum up from each
node’s error. We define xi to be the vector at node
The RNTN uses this definition for computing p1 :
i (in the example trigram, the xi ∈ Rd×1 ’s are
T ! (a, b, c, p1 , p2 )). We skip the standard derivative for
b [1:d] b b Ws . Each node backpropagates its error through to
p1 = f V +W ,
c c c the recursively used weights V, W . Let δ i,s ∈ Rd×1
be the softmax error vector at node i:
where W is as defined in the previous models. The
δ i,s = WsT (y i − ti ) ⊗ f 0 (xi ),

next parent vector p2 in the tri-gram will be com-
puted with the same weights:
where ⊗ is the Hadamard product between the two

a
T
a
!
a vectors and f 0 is the element-wise derivative of f
[1:d]
p2 = f V +W . which in the standard case of using f = tanh can
p1 p1 p1
be computed using only f (xi ).
The main advantage over the previous RNN The remaining derivatives can only be computed
model, which is a special case of the RNTN when in a top-down fashion from the top node through the
V is set to 0, is that the tensor can directly relate in- tree and into the leaf nodes. The full derivative for
put vectors. Intuitively, we can interpret each slice V and W is the sum of the derivatives at each of
of the tensor as capturing a specific type of compo- the nodes. We define the complete incoming error
sition. messages for a node i as δ i,com . The top node, in
An alternative to RNTNs would be to make the our case p2 , only received errors from the top node’s
compositional function more powerful by adding a softmax. Hence, δ p2 ,com = δ p2 ,s which we can
second neural network layer. However, initial exper- use to obtain the standard backprop derivative for
iments showed that it is hard to optimize this model W (Goller and Küchler, 1996; Socher et al., 2010).
and vector interactions are still more implicit than in For the derivative of each slice k = 1, . . . , d, we get:
the RNTN. ∂E p2

a

a
T
p2 ,com
= δk ,
4.4 Tensor Backprop through Structure ∂V [k] p1 p1
We describe in this section how to train the RNTN where δkp2 ,com is just the k’th element of this vector.
model. As mentioned above, each node has a Now, we can compute the error message for the two
1636
children of p2 : Fine-grained Positive/Negative
Model

a
All Root All Root
δ p2 ,down = W T δ p2 ,com + S ⊗ f 0 ,
p1 NB 67.2 41.0 82.6 81.8
SVM 64.3 40.7 84.6 79.4
where we define BiNB 71.0 41.9 82.7 83.1
VecAvg 73.3 32.7 85.1 80.1
d
RNN 79.0 43.2 86.1 82.4
T a
δkp2 ,com
X
[k] [k]
S= V + V MV-RNN 78.7 44.4 86.8 82.9
p1
k=1 RNTN 80.7 45.7 87.6 85.4
The children of p2 , will then each take half of this Table 1: Accuracy for fine grained (5-class) and binary
vector and add their own softmax error message for predictions at the sentence level (root) and for all nodes.
the complete δ. In particular, we have
δ p1 ,com = δ p1 ,s + δ p2 ,down [d + 1 : 2d], showed that the recursive models worked signifi-
cantly worse (over 5% drop in accuracy) when no
where δ p2 ,down [d + 1 : 2d] indicates that p1 is the nonlinearity was used. We use f = tanh in all ex-
right child of p2 and hence takes the 2nd half of the periments.
error, for the final word vector derivative for a, it We compare to commonly used methods that use
will be δ p2 ,down [1 : d]. bag of words features with Naive Bayes and SVMs,
The full derivative for slice V [k] for this trigram as well as Naive Bayes with bag of bigram features.
tree then is the sum at each node: We abbreviate these with NB, SVM and biNB. We
T also compare to a model that averages neural word
∂E E p2 p1 ,com b b vectors and ignores word order (VecAvg).
= + δk ,
∂V [k] ∂V [k] c c The sentences in the treebank were split into a
train (8544), dev (1101) and test splits (2210) and
and similarly for W . For this nonconvex optimiza- these splits are made available with the data release.
tion we use AdaGrad (Duchi et al., 2011) which con- We also analyze performance on only positive and
verges in less than 3 hours to a local optimum. negative sentences, ignoring the neutral class. This
filters about 20% of the data with the three sets hav-
5 Experiments
ing 6920/872/1821 sentences.
We include two types of analyses. The first type in-
cludes several large quantitative evaluations on the 5.1 Fine-grained Sentiment For All Phrases
test set. The second type focuses on two linguistic The main novel experiment and evaluation metric
phenomena that are important in sentiment. analyze the accuracy of fine-grained sentiment clas-
For all models, we use the dev set and cross- sification for all phrases. Fig. 2 showed that a fine
validate over regularization of the weights, word grained classification into 5 classes is a reasonable
vector size as well as learning rate and minibatch approximation to capture most of the data variation.
size for AdaGrad. Optimal performance for all mod- Fig. 6 shows the result on this new corpus. The
els was achieved at word vector sizes between 25 RNTN gets the highest performance, followed by
and 35 dimensions and batch sizes between 20 and the MV-RNN and RNN. The recursive models work
30. Performance decreased at larger or smaller vec- very well on shorter phrases, where negation and
tor and batch sizes. This indicates that the RNTN composition are important, while bag of features
does not outperform the standard RNN due to sim- baselines perform well only with longer sentences.
ply having more parameters. The MV-RNN has or- The RNTN accuracy upper bounds other models at
ders of magnitudes more parameters than any other most n-gram lengths.
model due to the word matrices. The RNTN would Table 1 (left) shows the overall accuracy numbers
usually achieve its best performance on the dev set for fine grained prediction at all phrase lengths and
after training for 3 - 5 hours. Initial experiments full sentences.
1637

0RGHO
5171
09511
\
DF

511
XU
F
EL
1%
$F
DF
H
1%
XU

Y
F
DWL
$F
&XPXO

1*U
DP
/HQJW
K 1*U
DP
/HQJW
K
Figure 6: Accuracy curves for fine grained sentiment classification at each n-gram lengths. Left: Accuracy separately
for each set of n-grams. Right: Cumulative accuracy of all ≤ n-grams.
5.2 Full Sentence Binary Sentiment +

+
.
This setup is comparable to previous work on the – +
original rotten tomatoes dataset which only used – 0
but
0
it
+
– 0 0 +
full sentence labels and binary classification of pos- 0 –

,
0 0 + +
itive/negative. Hence, these experiments show the There 0 – has 0 0

spice 0 +
are just enough

improvement even baseline methods can achieve – 0 to 0 +
parts keep
with the sentiment treebank. Table 1 shows results 0
repetitive
–
0 +
– 0 it interesting
of this binary classification for both all phrases and slow and
for only full sentences. The previous state of the
Figure 7: Example of correct prediction for contrastive
art was below 80% (Socher et al., 2012). With the
conjunction X but Y .
coarse bag of words annotation for training, many of
the more complex phenomena could not be captured,
even by more powerful models. The combination of Set 1: Negating Positive Sentences. The first set
the new sentiment treebank and the RNTN pushes contains positive sentences and their negation. In
the state of the art on short phrases up to 85.4%. this set, the negation changes the overall sentiment
of a sentence from positive to negative. Hence, we
5.3 Model Analysis: Contrastive Conjunction compute accuracy in terms of correct sentiment re-
In this section, we use a subset of the test set which versal from positive to negative. Fig. 9 shows two
includes only sentences with an ‘X but Y ’ structure: examples of positive negation the RNTN correctly
A phrase X being followed by but which is followed classified, even if negation is less obvious in the case
by a phrase Y . The conjunction is interpreted as of ‘least’. Table 2 (left) gives the accuracies over 21
an argument for the second conjunct, with the first positive sentences and their negation for all models.
functioning concessively (Lakoff, 1971; Blakemore, The RNTN has the highest reversal accuracy, show-
1989; Merin, 1999). Fig. 7 contains an example. We ing its ability to structurally learn negation of posi-
analyze a strict setting, where X and Y are phrases tive sentences. But what if the model simply makes
of different sentiment (including neutral). The ex- phrases very negative when negation is in the sen-
ample is counted as correct, if the classifications for tence? The next experiments show that the model
both phrases X and Y are correct. Furthermore, captures more than such a simplistic negation rule.
the lowest node that dominates both of the word Set 2: Negating Negative Sentences. The sec-
but and the node that spans Y also have to have the ond set contains negative sentences and their nega-
same correct sentiment. For the resulting 131 cases, tion. When negative sentences are negated, the sen-
the RNTN obtains an accuracy of 41% compared to timent treebank shows that overall sentiment should
MV-RNN (37), RNN (36) and biNB (27). become less negative, but not necessarily positive.
For instance, ‘The movie was terrible’ is negative
5.4 Model Analysis: High Level Negation but the ‘The movie was not terrible’ says only that it
We investigate two types of negation. For each type, was less bad than a terrible one, not that it was good
we use a separate dataset for evaluation. (Horn, 1989; Israel, 2001). Hence, we evaluate ac-
1638
++ –
0 + 0 –
0 0 0 0
– 0
+ 0
Roger Dodger . Roger Dodger –
.
0
0 +
is 0 +
is 0 –
one one 0 –
0 +
of of – 0
+ 0
+ 0 –
0 0 0
0 0
the on the – on 0
+ 0
0 0 0 0
variations variations
0 +
this theme – +
this theme
most compelling least compelling
0 –
+ –
It – 0
0 – 0 – .
0 +
0 0 + ––
I – 0
’s just incredibly dull
I + 0
0 0 .
+ 0 . 0
0 0 0 0
0 0
liked 0 0
did n’t like 0 0
0 0
It 0 0
0 0
0 0 0 –– .
0 0
every 0 0
a 0 0
0 –
dull
0 0
of of 0 0 0 not +
single minute single minute ’s definitely

this film this film
Figure 9: RNTN prediction of positive and negative (bottom right) sentences and their negation.
Accuracy

1HJDW
HG
3RVL
WL
YH
6HQW
HQFHV
&KDQJH
LQ
$FW
LYDW
LRQ
Model EL
1%

Negated Positive Negated Negative 551

09511

biNB 19.0 27.3 5171

RNN 33.3 45.5

MV-RNN 52.4 54.6

1HJDW
HG
1HJDW
LYH
6HQW
HQFHV
&KDQJH
LQ
$FW
LYDW
LRQ
RNTN 71.4 81.8 EL
1%

551

09511

Table 2: Accuracy of negation detection. Negated posi- 5171

tive is measured as correct sentiment inversions. Negated

negative is measured as increases in positive activations.
Figure 8: Change in activations for negations. Only the
RNTN correctly captures both types. It decreases positive
curacy in terms of how often each model was able sentiment more when it is negated and learns that negat-
ing negative phrases (such as not terrible) should increase
to increase non-negative activation in the sentiment
neutral and positive activations.
of the sentence. Table 2 (right) shows the accuracy.
In over 81% of cases, the RNTN correctly increases
the positive activations. Fig. 9 (bottom right) shows age positive activation (for set 1) and positive values
a typical case in which sentiment was made more mean an increase in average positive activation (set
positive by switching the main class from negative 2). The RNTN has the largest shifts in the correct di-
to neutral even though both not and dull were nega- rections. Therefore we can conclude that the RNTN
tive. Fig. 8 shows the changes in activation for both is best able to identify the effect of negations upon
sets. Negative values indicate a decrease in aver- both positive and negative sentiment sentences.
1639
n Most positive n-grams Most negative n-grams
1 engaging; best; powerful; love; beautiful bad; dull; boring; fails; worst; stupid; painfully
2 excellent performances; A masterpiece; masterful worst movie; very bad; shapeless mess; worst
film; wonderful movie; marvelous performances thing; instantly forgettable; complete failure
3 an amazing performance; wonderful all-ages tri- for worst movie; A lousy movie; a complete fail-
umph; a wonderful movie; most visually stunning ure; most painfully marginal; very bad sign
5 nicely acted and beautifully shot; gorgeous im- silliest and most incoherent movie; completely
agery, effective performances; the best of the crass and forgettable movie; just another bad
year; a terrific American sports movie; refresh- movie. A cumbersome and cliche-ridden movie;
ingly honest and ultimately touching a humorless, disjointed mess
8 one of the best films of the year; A love for films A trashy, exploitative, thoroughly unpleasant ex-
shines through each frame; created a masterful perience ; this sloppy drama is an empty ves-
piece of artistry right here; A masterful film from sel.; quickly drags on becoming boring and pre-
a master filmmaker, dictable.; be the worst special-effects creation of
the year
Table 3: Examples of n-grams for which the RNTN predicted the most positive and most negative responses.

0RGHO
6 Conclusion
PHQW
5171
L
6HQW
09511
511
We introduced Recursive Neural Tensor Networks
K

XW
and the Stanford Sentiment Treebank. The combi-

7U
RXQG
nation of new model and data results in a system

*U

for single sentence sentiment detection that pushes
HUDJH
state of the art by 5.4% for positive/negative sen-

$Y

tence classification. Apart from this standard set-
1
*UDP
/HQJW
K
ting, the dataset also poses important new challenges
Figure 10: Average ground truth sentiment of top 10 most and allows for new evaluation metrics. For instance,
positive n-grams at various n. The RNTN correctly picks the RNTN obtains 80.7% accuracy on fine-grained
the more negative and positive examples.
sentiment prediction across all phrases and captures
negation of different sentiments and scope more ac-
curately than previous models.
5.5 Model Analysis: Most Positive and
Acknowledgments
Negative Phrases
We thank Rukmani Ravisundaram and Tayyab
Tariq for the first version of the online demo.
We queried the model for its predictions on what Richard is partly supported by a Microsoft Re-
the most positive or negative n-grams are, measured search PhD fellowship. The authors gratefully ac-
as the highest activation of the most negative and knowledge the support of the Defense Advanced Re-
most positive classes. Table 3 shows some phrases search Projects Agency (DARPA) Deep Exploration
from the dev set which the RNTN selected for their and Filtering of Text (DEFT) Program under Air
strongest sentiment. Force Research Laboratory (AFRL) prime contract
no. FA8750-13-2-0040, the DARPA Deep Learning
Due to lack of space we cannot compare top
program under contract number FA8650-10-C-7020
phrases of the other models but Fig. 10 shows that
and NSF IIS-1159679. Any opinions, findings, and
the RNTN selects more strongly positive phrases at
conclusions or recommendations expressed in this
most n-gram lengths compared to other models.
material are those of the authors and do not neces-
For this and the previous experiment, please find sarily reflect the view of DARPA, AFRL, or the US
additional examples and descriptions in the supple- government.
mentary material.
1640
References A. Merin. 1999. Information, relevance, and social deci-
sionmaking: Some principles and results of decision-
M. Baroni and A. Lenci. 2010. Distributional mem-
theoretic semantics. In Lawrence S. Moss, Jonathan
ory: A general framework for corpus-based semantics.
Ginzburg, and Maarten de Rijke, editors, Logic, Lan-
Computational Linguistics, 36(4):673–721.
guage, and Information, volume 2. CSLI, Stanford,
Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin.
CA.
2003. A neural probabilistic language model. J.
Mach. Learn. Res., 3, March. J. Mitchell and M. Lapata. 2010. Composition in dis-
D. Blakemore. 1989. Denial and contrast: A relevance tributional models of semantics. Cognitive Science,
theoretic analysis of ‘but’. Linguistics and Philoso- 34(8):1388–1429.
phy, 12:15–37. K. Moilanen and S. Pulman. 2007. Sentiment composi-
L. Bottou. 2011. From machine learning to machine tion. In In Proceedings of Recent Advances in Natural
reasoning. CoRR, abs/1102.1808. Language Processing.
S. Clark and S. Pulman. 2007. Combining symbolic and T. Nakagawa, K. Inui, and S. Kurohashi. 2010. Depen-
distributional models of meaning. In Proceedings of dency tree-based sentiment classification using CRFs
the AAAI Spring Symposium on Quantum Interaction, with hidden variables. In NAACL, HLT.
pages 52–55. S. Pado and M. Lapata. 2007. Dependency-based con-
R. Collobert and J. Weston. 2008. A unified architecture struction of semantic space models. Computational
for natural language processing: deep neural networks Linguistics, 33(2):161–199.
with multitask learning. In ICML. B. Pang and L. Lee. 2005. Seeing stars: Exploiting class
J. Duchi, E. Hazan, and Y. Singer. 2011. Adaptive sub- relationships for sentiment categorization with respect
gradient methods for online learning and stochastic op- to rating scales. In ACL, pages 115–124.
timization. JMLR, 12, July.
B. Pang and L. Lee. 2008. Opinion mining and senti-
K. Erk and S. Padó. 2008. A structured vector space
ment analysis. Foundations and Trends in Information
model for word meaning in context. In EMNLP.
Retrieval, 2(1-2):1–135.
C. Goller and A. Küchler. 1996. Learning task-
dependent distributed representations by backpropaga- T. A. Plate. 1995. Holographic reduced representations.
tion through structure. In Proceedings of the Interna- IEEE Transactions on Neural Networks, 6(3):623–
tional Conference on Neural Networks (ICNN-96). 641.
E. Grefenstette and M. Sadrzadeh. 2011. Experimental L. Polanyi and A. Zaenen. 2006. Contextual valence
support for a categorical compositional distributional shifters. In W. Bruce Croft, James Shanahan, Yan Qu,
model of meaning. In EMNLP. and Janyce Wiebe, editors, Computing Attitude and Af-
E. Grefenstette, G. Dinu, Y.-Z. Zhang, M. Sadrzadeh, and fect in Text: Theory and Applications, volume 20 of
M. Baroni. 2013. Multi-step regression learning for The Information Retrieval Series, chapter 1.
compositional distributional semantics. In IWCS. J. B. Pollack. 1990. Recursive distributed representa-
G. E. Hinton. 1990. Mapping part-whole hierarchies into tions. Artificial Intelligence, 46, November.
connectionist networks. Artificial Intelligence, 46(1- M. Ranzato and A. Krizhevsky G. E. Hinton. 2010.
2). Factored 3-Way Restricted Boltzmann Machines For
L. R. Horn. 1989. A natural history of negation, volume Modeling Natural Images. AISTATS.
960. University of Chicago Press Chicago.
V. Rentoumi, S. Petrakis, M. Klenner, G. A. Vouros, and
E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng.
V. Karkaletsis. 2010. United we stand: Improving
2012. Improving Word Representations via Global
sentiment analysis by joining machine learning and
Context and Multiple Word Prototypes. In ACL.
rule based methods. In Proceedings of the Seventh
M. Israel. 2001. Minimizers, maximizers, and the
conference on International Language Resources and
rhetoric of scalar reasoning. Journal of Semantics,
Evaluation (LREC’10), Valletta, Malta.
18(4):297–331.
R. Jenatton, N. Le Roux, A. Bordes, and G. Obozinski. S. Rudolph and E. Giesbrecht. 2010. Compositional
2012. A latent factor model for highly multi-relational matrix-space models of language. In ACL.
data. In NIPS. B. Snyder and R. Barzilay. 2007. Multiple aspect rank-
D. Klein and C. D. Manning. 2003. Accurate unlexical- ing using the Good Grief algorithm. In HLT-NAACL.
ized parsing. In ACL. R. Socher, C. D. Manning, and A. Y. Ng. 2010. Learning
R. Lakoff. 1971. If’s, and’s, and but’s about conjunction. continuous phrase representations and syntactic pars-
In Charles J. Fillmore and D. Terence Langendoen, ed- ing with recursive neural networks. In Proceedings of
itors, Studies in Linguistic Semantics, pages 114–149. the NIPS-2010 Deep Learning and Unsupervised Fea-
Holt, Rinehart, and Winston, New York. ture Learning Workshop.
1641
R. Socher, C. Lin, A. Y. Ng, and C.D. Manning. 2011a.
Parsing Natural Scenes and Natural Language with
Recursive Neural Networks. In ICML.
R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and
C. D. Manning. 2011b. Semi-Supervised Recursive
Autoencoders for Predicting Sentiment Distributions.
In EMNLP.
R. Socher, B. Huval, C. D. Manning, and A. Y. Ng. 2012.
Semantic compositionality through recursive matrix-
vector spaces. In EMNLP.
I. Sutskever, R. Salakhutdinov, and J. B. Tenenbaum.
2009. Modelling relational data using Bayesian clus-
tered tensor factorization. In NIPS.
P. D. Turney and P. Pantel. 2010. From frequency to
meaning: Vector space models of semantics. Journal
of Artificial Intelligence Research, 37:141–188.
H. Wang, D. Can, A. Kazemzadeh, F. Bar, and
S. Narayanan. 2012. A system for real-time twit-
ter sentiment analysis of 2012 u.s. presidential elec-
tion cycle. In Proceedings of the ACL 2012 System
Demonstrations.
D. Widdows. 2008. Semantic vector products: Some ini-
tial investigations. In Proceedings of the Second AAAI
Symposium on Quantum Interaction.
A. Yessenalina and C. Cardie. 2011. Composi-
tional matrix-space models for sentiment analysis. In
EMNLP.
D. Yu, L. Deng, and F. Seide. 2012. Large vocabulary
speech recognition using deep tensor neural networks.
In INTERSPEECH.
F.M. Zanzotto, I. Korkontzelos, F. Fallucchi, and S. Man-
andhar. 2010. Estimating linear models for composi-
tional distributional semantics. In COLING.
L. Zettlemoyer and M. Collins. 2005. Learning to
map sentences to logical form: Structured classifica-
tion with probabilistic categorial grammars. In UAI.
1642

Arboles Gelbuck

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Arboles Gelbuck

Загружено:

Авторское право:

Доступные форматы

Recursive Deep Models for Semantic Compositionality

Over a Sentiment Treebank

Semantic word spaces have been very use- 0 0 – 0

phrases in a principled way. Further progress 0 0 + +

does n’t care

richer supervised training and evaluation re- + 0 0 +

cleverness , other kind intelligent humor

extending Restricted Boltzmann machines (Ranzato

Somewhat Negative (c)

4 Recursive Neural Models -

The models in this section compute compositional

displays this approach. When an n-gram is given to

(a,A) (p1 ,P1 )

5.2 Full Sentence Binary Sentiment +

original rotten tomatoes dataset which only used – 0

full sentence labels and binary classification of pos- 0 –

itive/negative. Hence, these experiments show the There 0 – has 0 0

are just enough

single minute single minute ’s definitely

and the Stanford Sentiment Treebank. The combi-

nation of new model and data results in a system

state of the art by 5.4% for positive/negative sen-

Вам также может понравиться