Академический Документы
Профессиональный Документы
Культура Документы
1
Proceedings of the 1st Workshop on Representation Learning for NLP, pages 17,
c
Berlin, Germany, August 11th, 2016.
2016 Association for Computational Linguistics
The decomposition is obtained by performing a et al., 2003). The sensitivity of an input variable is
backward pass on the network, where for each given by its squared partial derivative:
neuron, the relevance associated with it is redis- f 2
tributed to its predecessors. Considering neurons Rd = .
mapping a set of n inputs (xi )i[1,n] to the neuron xd
activation xj through the sequence of functions: Here, we note that unlike LRP, sensitivity analysis
bj does not preserve the function value f (x), but the
zij = xi wij + n squared l2 -norm of the function gradient:
P
zj = i zij P
xj = g(zj ) kx f (x)k22 = d Rd . (2)
where for convenience, the neuron bias bj has This quantity is however not directly related to
been distributed equally to each input neuron, and the amount of evidence for the category to de-
where g() is a monotonously increasing activation tect. Similar gradient-based analyses (Denil et al.,
function. Denoting by Ri and Rj the relevance 2014; Li et al., 2015) have been recently applied in
associated with xi and xj , the relevance is redis- the NLP domain, and were also used by Simonyan
tributed from one layer to the other by defining et al. (2014) in the context of image classification.
messages Rij indicating how much relevance While recent work uses different relevance defini-
must be propagated from neuron xj to its input tions for a group of input variables (e.g. gradient
neuron xi in the lower layer. These messages are magnitude in Denil et al. (2014) or max-norm of
defined as: absolute value of simple derivatives in Simonyan
et al. (2014)), in the present work (unless other-
s(z )
zij + nj wise stated) we employ the squared l2 -norm of
Rij = P Rj gradients allowing for decomposition of Eq. 2 as
i zij + s(zj )
a sum over relevances of input variables.
where s(zj ) = (1zj 0 1zj <0 ) is a stabilizing
term that handles near-zero denominators, with 3 Experiments
set to 0.01. The intuition behind this local rele- For the following experiments we use the 20news-
vance redistribution formula is that each input xi bydate version of the 20Newsgroups2 dataset con-
should be assigned relevance proportionally to its sisting of 11314/7532 train/test documents evenly
contribution in the forward
P pass, in a way that the distributed among twenty fine-grained categories.
relevance is preserved ( i Rij = Rj ).
Each neuron in the lower layer receives rele- 3.1 CNN Model
vance from all upper-level neurons to which it con- As a document classifier we employ a word-based
tributes CNN similar to Kim (2014) consisting of the fol-
P
Ri = j Rij . lowing sequence of layers:
This
P pooling P ensures layer-wise conservation: Conv
ReLU
1-Max-Pool
FC
i Ri = j Rj . Finally, in a max-pooling
layer, all relevance at the output of the layer
By 1-Max-Pool we denote a max-pooling
is redistributed to the pooled neuron with max-
layer where the pooling regions span the whole
imum activation (i.e. winner-take-all). An im-
text length, as introduced in (Collobert et al.,
plementation of LRP can be found in (La-
2011). Conv, ReLU and FC denote the con-
puschkin et al., 2016b) and downloaded from
volutional layer, rectified linear units activation
www.heatmapping.org1 .
and fully-connected linear layer. For building
2.2 Sensitivity Analysis the CNN numerical input we concatenate horizon-
tally 300-dimensional pre-trained word2vec3 vec-
An alternative procedure called sensitivity analy-
tors (Mikolov et al., 2013), in the same order the
sis (SA) produces explanations by scoring input
corresponding words appear in the pre-processed
variables based on how they affect the decision
2
output locally (Dimopoulos et al., 1995; Gevrey http://qwone.com/%7Ejason/20Newsgroups/
3
GoogleNews-vectors-negative300,
1
Currently the available code is targeted on image data. https://code.google.com/p/word2vec/
2
document, and further keep this input representa- In particular, in case of SA, the above word rel-
tion fixed during training. The convolutional oper- evance can equivalently be expressed as:
ation we apply in the first neural network layer is
one-dimensional and along the text sequence di- RSA (wt ) = kwt f (d)k22 (4)
rection (i.e. along the horizontal direction). The where f (d) represents the classifiers prediction
receptive field of the convolutional layer neurons for document d.
spans the entire word embedding space in verti- Note that the resulting LRP word relevance is
cal direction, and covers two consecutive words in signed, while the SA word relevance is positive.
horizontal direction. The convolutional layer filter In all experiments, we use the term target class
bank contains 800 filters. to identify the function f (x) to analyze in the rel-
evance decomposition. This function maps the
3.2 Experimental Setup
neural network input to the neural network output
As pre-processing we remove the document head- variable corresponding to the target class.
ers, tokenize the text with NLTK4 , filter out punc-
tuation and numbers5 , and finally truncate each 3.3 Evaluating Word-Level Relevances
document to the first 400 tokens. We train In order to evaluate different relevance models, we
the CNN by stochastic mini-batch gradient de- perform a sequence of word deletions (hereby
scent with momentum (with l2 -norm penalty and for deleting a word we simply set the word-vector
dropout). Our trained classifier achieves a classifi- to zero in the input document representation), and
cation accuracy of 80.19%6 . track the impact of these deletions on the classifi-
Due to our input representation, applying LRP cation performance. We carry out two deletion ex-
or SA to our neural classifier yields one relevance periments, starting either with the set of test docu-
value per word-embedding dimension. From these ments that are initially classified correctly, or with
single input variable relevances to obtain word- those that are initially classified wrongly7 . We es-
level relevances, we sum up the relevances over timate the LRP/SA word relevances using as target
the word embedding space in case of LRP, and class the true document class. Subsequently we
(unless otherwise stated) take the squared l2 -norm delete words in decreasing resp. increasing order
of the corresponding word gradient in case of of the obtained word relevances.
SA. More precisely, given an input document d Fig. 1 summarizes our results. We find that
consisting of a sequence (w1 , w2 , ..., wN ) of N LRP yields the best results in both deletion exper-
words, each word being represented by a D- iments. Thereby we provide evidence that LRP
dimensional word embedding, we compute the rel- positive relevance is targeted to words that sup-
evance R(wt ) of the tth word in the input docu- port a classification decision, while LRP negative
ment, through the summation: relevance is tuned upon words that inhibit this de-
cision. In the first experiment the SA classifica-
D
X tion accuracy curve decreases significantly faster
R(wt ) = Ri,t (3)
than the random curve representing the perfor-
i=1
mance change when randomly deleting words, in-
where Ri,t denotes the relevance of the input vari- dicating that SA is able to identify relevant words.
able corresponding to the ith dimension of the tth However, the SA curve is clearly above the LRP
word embedding, obtained by LRP or SA as spec- curve indicating that LRP provides better expla-
ified in Sections 2.1 & 2.2. nations for the CNN predictions. Similar results
4 have been reported for image classification tasks
We employ NLTKs version 3.1 recommended tok-
enizers sent tokenize and word tokenize, module (Samek et al., 2015). The second experiment indi-
nltk.tokenize. cates that the classification performance increases
5
We retain only tokens composed of the following char- when deleting words with the lowest LRP rele-
acters: alphabetic-character, apostrophe, hyphen and dot, and
containing at least one alphabetic-character. vance, while small SA values points to words that
6
To the best of our knowledge, the best published have less influence on the classification perfor-
20Newsgroups accuracy is 83.0% (Paskov et al., 2013). How- mance than random word selection. This result
ever we notice that for simplification we use a fixed-length
7
document representation, and our main focus is on explain- For the deletion experiments we consider only the test
ing classifier decisions, not on improving the classification documents whose pre-processed length is greater or equal to
state-of-the-art. 100 tokens, this amounts to a total of 4963 documents.
3
1.0 1.0
LRP ships between words (Mikolov et al., 2013). We
0.9 0.9
SA explore if these regularities can be transferred to
0.8 0.8 random a document representation, when using as a docu-
Accuracy (4154 documents)
4
LRP SA
It is the body's reaction to a strange environment. It appears to be induced It is the body's reaction to a strange environment. It appears to be induced
partly to physical discomfort and part to mental distress. Some people are partly to physical discomfort and part to mental distress. Some people are
sci.space
more prone to it than others, like some people are more prone to get sick more prone to it than others, like some people are more prone to get sick
on a roller coaster ride than others. The mental part is usually induced by on a roller coaster ride than others. The mental part is usually induced by
a lack of clear indication of which way is up or down, ie: the Shuttle is a lack of clear indication of which way is up or down, ie: the Shuttle is
normally oriented with its cargo bay pointed towards Earth, so the Earth normally oriented with its cargo bay pointed towards Earth, so the Earth
(or ground) is "above" the head of the astronauts. About 50% of the astronauts (or ground) is "above" the head of the astronauts. About 50% of the astronauts
experience some form of motion sickness, and NASA has done numerous tests in experience some form of motion sickness, and NASA has done numerous tests in
space to try to see how to keep the number of occurances down. space to try to see how to keep the number of occurances down.
It is the body's reaction to a strange environment. It appears to be induced It is the body's reaction to a strange environment. It appears to be induced
partly to physical discomfort and part to mental distress. Some people are partly to physical discomfort and part to mental distress. Some people are
sci.med
more prone to it than others, like some people are more prone to get sick more prone to it than others, like some people are more prone to get sick
on a roller coaster ride than others. The mental part is usually induced by on a roller coaster ride than others. The mental part is usually induced by
a lack of clear indication of which way is up or down, ie: the Shuttle is a lack of clear indication of which way is up or down, ie: the Shuttle is
normally oriented with its cargo bay pointed towards Earth, so the Earth normally oriented with its cargo bay pointed towards Earth, so the Earth
(or ground) is "above" the head of the astronauts. About 50% of the astronauts (or ground) is "above" the head of the astronauts. About 50% of the astronauts
experience some form of motion sickness, and NASA has done numerous tests in experience some form of motion sickness, and NASA has done numerous tests in
space to try to see how to keep the number of occurances down. space to try to see how to keep the number of occurances down.
rec.motorcycles
It is the body's reaction to a strange environment. It appears to be induced It is the body's reaction to a strange environment. It appears to be induced
partly to physical discomfort and part to mental distress. Some people are partly to physical discomfort and part to mental distress. Some people are
more prone to it than others, like some people are more prone to get sick more prone to it than others, like some people are more prone to get sick
on a roller coaster ride than others. The mental part is usually induced by on a roller coaster ride than others. The mental part is usually induced by
a lack of clear indication of which way is up or down, ie: the Shuttle is a lack of clear indication of which way is up or down, ie: the Shuttle is
normally oriented with its cargo bay pointed towards Earth, so the Earth normally oriented with its cargo bay pointed towards Earth, so the Earth
(or ground) is "above" the head of the astronauts. About 50% of the astronauts (or ground) is "above" the head of the astronauts. About 50% of the astronauts
experience some form of motion sickness, and NASA has done numerous tests in experience some form of motion sickness, and NASA has done numerous tests in
space to try to see how to keep the number of occurances down. space to try to see how to keep the number of occurances down.
Figure 2: Heatmaps for the test document sci.space 61393 (correctly classified), using either layer-
wise relevance propagation (LRP) or sensitivity analysis (SA) for highlighting words. Positive relevance
is mapped to red, negative to blue. The target class for the LRP/SA explanation is indicated on the left.
SA e.w. SA SA (l2 )
Figure 3: PCA projection of the 20Newsgroups test documents formed by linearly combining word2vec
embeddings. The weighting scheme is based on word-level relevances, or on single input variable rel-
evances (e.w.), or uniform (SUM). The target class for relevance estimation is the predicted document
class. SA(l2 ) corresponds to a variant of SA with simple l2 -norm pooling of word gradient values. All
visualizations are provided on the same equal axis scale.
5
weighting. In case of SA note that, even though M. Denil, A. Demiraj, and N. de Freitas. 2014. Extrac-
the power to which word gradient norms are raised tion of Salient Sentences from Labelled Documents.
Technical report, University of Oxford.
(l2 or l22 ) affects the present visualization experi-
ment, it has no influence on the earlier described Y. Dimopoulos, P. Bourret, and S. Lek. 1995. Use of
word deletion analysis. some sensitivity criteria for choosing networks with
For element-wise models d0e.w. , we observe good generalization ability. Neural Processing Let-
ters, 2(6):14.
slightly better separated clusters for SA, and a
clear-cut cluster structure for LRP. D. Erhan, A. Courville, and Y. Bengio. 2010. Under-
standing Representations Learned in Deep Architec-
4 Conclusion tures. Technical report, University of Montreal.
Through word deleting we quantitatively evalu- M. Gevrey, I. Dimopoulos, and S. Lek. 2003. Review
and comparison of methods to study the contribu-
ated and compared two classifier explanation mod- tion of variables in artificial neural network models.
els, and pinpointed LRP to be more effective than Ecological Modelling, 160(3):249264.
SA. We investigated the application of word-level
relevance information for document highlighting Y. Kim. 2014. Convolutional Neural Networks for
Sentence Classification. In Proc. of EMNLP, pages
and visualization. We derive from our empirical 17461751.
analysis that the superiority of LRP stems from the
fact that it reliably not only links to determinant R. Krishnan, G. Sivakumar, and P. Bhattacharya. 1999.
words that support a specific classification deci- Extracting decision trees from trained neural net-
works. Pattern Recognition, 32(12):19992009.
sion, but further distinguishes, within the preemi-
nent words, those that are opposed to that decision. W. Landecker, M. Thomure, L. Bettencourt,
Future work would include applying LRP to M. Mitchell, G. Kenyon, and S. Brumby. 2013. In-
terpreting Individual Classifications of Hierarchical
other neural network architectures (e.g. character-
Networks. In IEEE Symposium on Computational
based or recurrent models) on further NLP tasks, Intelligence and Data Mining (CIDM), pages
as well as exploring how relevance information 3238.
could be taken into account to improve the clas-
S. Lapuschkin, A. Binder, G. Montavon, K.-R. Muller,
sifiers training procedure or prediction perfor-
and W. Samek. 2016a. Analyzing Classifiers:
mance. Fisher Vectors and Deep Neural Networks. In Proc.
of the IEEE Conference on Computer Vision and
Acknowledgments Pattern Recognition (CVPR).
This work was supported by the German Ministry S. Lapuschkin, A. Binder, G. Montavon, K.-R. Muller,
for Education and Research as Berlin Big Data and W. Samek. 2016b. The Layer-wise Relevance
Propagation Toolbox for Artificial Neural Networks.
Center BBDC (01IS14013A) and the Brain Korea JMLR. in press.
21 Plus Program through the National Research
Foundation of Korea funded by the Ministry of J. Li, X. Chen, E. Hovy, and D. Jurafsky. 2015. Visu-
Education. alizing and Understanding Neural Models in NLP.
arXiv, (1506.01066).
Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. H.S. Paskov, R. West, J.C. Mitchell, and T. Hastie.
2003. A Neural Probabilistic Language Model. 2013. Compressive Feature Learning. In Adv. in
JMLR, 3:11371155. NIPS.
R. Collobert, J. Weston, L. Bottou, M. Karlen, W. Samek, A. Binder, G. Montavon, S. Bach, and K.-
K. Kavukcuoglu, and P. Kuksa. 2011. Natural Lan- R. Muller. 2015. Evaluating the visualization of
guage Processing (Almost) from Scratch. JMLR, what a Deep Neural Network has learned. arXiv,
12:24932537. (1509.06321).
6
K. Simonyan, A. Vedaldi, and A. Zisserman. 2014.
Deep Inside Convolutional Networks: Visualising
Image Classification Models and Saliency Maps. In
Workshop Proc. ICLR.
M. D. Zeiler and R. Fergus. 2014. Visualizing and
Understanding Convolutional Networks. In ECCV,
pages 818833.