Paper 2016 - Explaining Predictions of Non-Linear Classifiers in NLP

Explaining Predictions of Non-Linear Classifiers in NLP
Leila Arras1 , Franziska Horn2 , Gregoire Montavon2 ,

2,3

Klaus-Robert Muller , and Wojciech Samek1
1
Machine Learning Group, Fraunhofer Heinrich Hertz Institute, Berlin, Germany
2
Machine Learning Group, Technische Universitat Berlin, Berlin, Germany
3
Department of Brain and Cognitive Engineering, Korea University, Seoul, Korea
{leila.arras, wojciech.samek}@hhi.fraunhofer.de
klaus-robert.mueller@tu-berlin.de
Abstract al., 2015) has been shown to produce more mean-

ingful explanations in the context of image classi-
Layer-wise relevance propagation (LRP) fications (Samek et al., 2015). In this paper, we ap-
is a recently proposed technique for ex- ply the same LRP technique to a NLP task, where
plaining predictions of complex non-linear a neural network maps a sequence of word2vec
classifiers in terms of input variables. In vectors representing a text document to its cat-
this paper, we apply LRP for the first time egory, and evaluate whether similar benefits in
to natural language processing (NLP). terms of explanation quality are observed.
More precisely, we use it to explain the In the present work we contribute by (1) ap-
predictions of a convolutional neural net- plying the LRP method to the NLP domain, (2)
work (CNN) trained on a topic categoriza- proposing a technique for quantitative evaluation
tion task. Our analysis highlights which of explanation methods for NLP classifiers, and
words are relevant for a specific prediction (3) qualitatively and quantitatively comparing two
of the CNN. We compare our technique different explanation methods, namely LRP and a
to standard sensitivity analysis, both qual- gradient-based approach, on a topic categorization
itatively and quantitatively, using a word task using the 20Newsgroups dataset.
deleting perturbation experiment, a PCA
analysis, and various visualizations. All 2 Explaining Predictions of Classifiers
experiments validate the suitability of LRP
for explaining the CNN predictions, which We consider the problem of explaining a predic-
is also in line with results reported in re- tion f (x) associated to an input x by assigning to
cent image classification studies. each input variable xd a score Rd determining how
relevant the input variable is for explaining the
1 Introduction prediction. The scores can be pooled into groups
Following seminal work by Bengio et al. (2003) of input variables (e.g. all word2vec dimensions of
and Collobert et al. (2011), the use of deep learn- a word, or all components of a RGB pixel), such
ing models for natural language processing (NLP) that they can be visualized as heatmaps of high-
applications received an increasing attention in re- lighted texts, or as images.
cent years. In parallel, initiated by the computer
2.1 Layer-Wise Relevance Propagation
vision domain, there is also a trend toward under-
standing deep learning models through visualiza- Layer-wise relevance propagation (Bach et al.,
tion techniques (Erhan et al., 2010; Landecker et 2015) is a newly introduced technique for obtain-
al., 2013; Zeiler and Fergus, 2014; Simonyan et ing these explanations. It can be applied to various
al., 2014; Bach et al., 2015; Lapuschkin et al., machine learning classifiers such as deep convolu-
2016a) or through decision tree extraction (Krish- tional neural networks. The LRP technique pro-
nan et al., 1999). Most work dedicated to under- duces a decomposition of the function value f (x)
standing neural network classifiers for NLP tasks on its input variables, that satisfies the conserva-
(Denil et al., 2014; Li et al., 2015) use gradient- tion property:
based approaches. Recently, a technique called P
layer-wise relevance propagation (LRP) (Bach et f (x) = d Rd . (1)
1
Proceedings of the 1st Workshop on Representation Learning for NLP, pages 17,
c
Berlin, Germany, August 11th, 2016. 2016 Association for Computational Linguistics
The decomposition is obtained by performing a et al., 2003). The sensitivity of an input variable is
backward pass on the network, where for each given by its squared partial derivative:
neuron, the relevance associated with it is redis- f 2
tributed to its predecessors. Considering neurons Rd = .
mapping a set of n inputs (xi )i[1,n] to the neuron xd
activation xj through the sequence of functions: Here, we note that unlike LRP, sensitivity analysis
bj does not preserve the function value f (x), but the
zij = xi wij + n squared l2 -norm of the function gradient:
P
zj = i zij P
xj = g(zj ) kx f (x)k22 = d Rd . (2)
where for convenience, the neuron bias bj has This quantity is however not directly related to
been distributed equally to each input neuron, and the amount of evidence for the category to de-
where g() is a monotonously increasing activation tect. Similar gradient-based analyses (Denil et al.,
function. Denoting by Ri and Rj the relevance 2014; Li et al., 2015) have been recently applied in
associated with xi and xj , the relevance is redis- the NLP domain, and were also used by Simonyan
tributed from one layer to the other by defining et al. (2014) in the context of image classification.
messages Rij indicating how much relevance While recent work uses different relevance defini-
must be propagated from neuron xj to its input tions for a group of input variables (e.g. gradient
neuron xi in the lower layer. These messages are magnitude in Denil et al. (2014) or max-norm of
defined as: absolute value of simple derivatives in Simonyan
et al. (2014)), in the present work (unless other-
s(z )
zij + nj wise stated) we employ the squared l2 -norm of
Rij = P Rj gradients allowing for decomposition of Eq. 2 as
i zij + s(zj )
a sum over relevances of input variables.
where s(zj ) = (1zj 0 1zj <0 ) is a stabilizing
term that handles near-zero denominators, with 3 Experiments
set to 0.01. The intuition behind this local rele- For the following experiments we use the 20news-
vance redistribution formula is that each input xi bydate version of the 20Newsgroups2 dataset con-
should be assigned relevance proportionally to its sisting of 11314/7532 train/test documents evenly
contribution in the forward
P pass, in a way that the distributed among twenty fine-grained categories.
relevance is preserved ( i Rij = Rj ).
Each neuron in the lower layer receives rele- 3.1 CNN Model
vance from all upper-level neurons to which it con- As a document classifier we employ a word-based
tributes CNN similar to Kim (2014) consisting of the fol-
P
Ri = j Rij . lowing sequence of layers:
This
P pooling P ensures layer-wise conservation: Conv
ReLU
1-Max-Pool
FC
i Ri = j Rj . Finally, in a max-pooling
layer, all relevance at the output of the layer
By 1-Max-Pool we denote a max-pooling
is redistributed to the pooled neuron with max-
layer where the pooling regions span the whole
imum activation (i.e. winner-take-all). An im-
text length, as introduced in (Collobert et al.,
plementation of LRP can be found in (La-
2011). Conv, ReLU and FC denote the con-
puschkin et al., 2016b) and downloaded from
volutional layer, rectified linear units activation
www.heatmapping.org1 .
and fully-connected linear layer. For building
2.2 Sensitivity Analysis the CNN numerical input we concatenate horizon-
tally 300-dimensional pre-trained word2vec3 vec-
An alternative procedure called sensitivity analy-
tors (Mikolov et al., 2013), in the same order the
sis (SA) produces explanations by scoring input
corresponding words appear in the pre-processed
variables based on how they affect the decision
2
output locally (Dimopoulos et al., 1995; Gevrey http://qwone.com/%7Ejason/20Newsgroups/
3
GoogleNews-vectors-negative300,
1
Currently the available code is targeted on image data. https://code.google.com/p/word2vec/
2
document, and further keep this input representa- In particular, in case of SA, the above word rel-
tion fixed during training. The convolutional oper- evance can equivalently be expressed as:
ation we apply in the first neural network layer is
one-dimensional and along the text sequence di- RSA (wt ) = kwt f (d)k22 (4)
rection (i.e. along the horizontal direction). The where f (d) represents the classifiers prediction
receptive field of the convolutional layer neurons for document d.
spans the entire word embedding space in verti- Note that the resulting LRP word relevance is
cal direction, and covers two consecutive words in signed, while the SA word relevance is positive.
horizontal direction. The convolutional layer filter In all experiments, we use the term target class
bank contains 800 filters. to identify the function f (x) to analyze in the rel-
evance decomposition. This function maps the
3.2 Experimental Setup
neural network input to the neural network output
As pre-processing we remove the document head- variable corresponding to the target class.
ers, tokenize the text with NLTK4 , filter out punc-
tuation and numbers5 , and finally truncate each 3.3 Evaluating Word-Level Relevances
document to the first 400 tokens. We train In order to evaluate different relevance models, we
the CNN by stochastic mini-batch gradient de- perform a sequence of word deletions (hereby
scent with momentum (with l2 -norm penalty and for deleting a word we simply set the word-vector
dropout). Our trained classifier achieves a classifi- to zero in the input document representation), and
cation accuracy of 80.19%6 . track the impact of these deletions on the classifi-
Due to our input representation, applying LRP cation performance. We carry out two deletion ex-
or SA to our neural classifier yields one relevance periments, starting either with the set of test docu-
value per word-embedding dimension. From these ments that are initially classified correctly, or with
single input variable relevances to obtain word- those that are initially classified wrongly7 . We es-
level relevances, we sum up the relevances over timate the LRP/SA word relevances using as target
the word embedding space in case of LRP, and class the true document class. Subsequently we
(unless otherwise stated) take the squared l2 -norm delete words in decreasing resp. increasing order
of the corresponding word gradient in case of of the obtained word relevances.
SA. More precisely, given an input document d Fig. 1 summarizes our results. We find that
consisting of a sequence (w1 , w2 , ..., wN ) of N LRP yields the best results in both deletion exper-
words, each word being represented by a D- iments. Thereby we provide evidence that LRP
dimensional word embedding, we compute the rel- positive relevance is targeted to words that sup-
evance R(wt ) of the tth word in the input docu- port a classification decision, while LRP negative
ment, through the summation: relevance is tuned upon words that inhibit this de-
cision. In the first experiment the SA classifica-
D
X tion accuracy curve decreases significantly faster
R(wt ) = Ri,t (3)
than the random curve representing the perfor-
i=1
mance change when randomly deleting words, in-
where Ri,t denotes the relevance of the input vari- dicating that SA is able to identify relevant words.
able corresponding to the ith dimension of the tth However, the SA curve is clearly above the LRP
word embedding, obtained by LRP or SA as spec- curve indicating that LRP provides better expla-
ified in Sections 2.1 & 2.2. nations for the CNN predictions. Similar results
4 have been reported for image classification tasks
We employ NLTKs version 3.1 recommended tok-
enizers sent tokenize and word tokenize, module (Samek et al., 2015). The second experiment indi-
nltk.tokenize. cates that the classification performance increases
5
We retain only tokens composed of the following char- when deleting words with the lowest LRP rele-
acters: alphabetic-character, apostrophe, hyphen and dot, and
containing at least one alphabetic-character. vance, while small SA values points to words that
6
To the best of our knowledge, the best published have less influence on the classification perfor-
20Newsgroups accuracy is 83.0% (Paskov et al., 2013). How- mance than random word selection. This result
ever we notice that for simplification we use a fixed-length
7
document representation, and our main focus is on explain- For the deletion experiments we consider only the test
ing classifier decisions, not on improving the classification documents whose pre-processed length is greater or equal to
state-of-the-art. 100 tokens, this amounts to a total of 4963 documents.
3
1.0 1.0
LRP ships between words (Mikolov et al., 2013). We
0.9 0.9
SA explore if these regularities can be transferred to
0.8 0.8 random a document representation, when using as a docu-
Accuracy (4154 documents)
Accuracy (809 documents)

0.7 0.7
ment vector a linear combination of word2vec em-
0.6 0.6
beddings. As a weighting scheme we employ LRP
0.5 0.5 or SA scores, with the classifiers predicted class
0.4 0.4 as the target class for the relevance estimation. For
0.3 0.3 comparison we perform uniform weighting, where
0.2 LRP 0.2 we simply sum up the word embeddings of the
SA
0.1
random
0.1 document words (SUM).
0.0
0 10 20 30 40 50
0.0
0 10 20 30 40 50 For SA we use either the l2 -norm or squared l2 -
Number of deleted words Number of deleted words
norm for pooling word gradient values along the
word2vec dimensions, i.e. in addition to the stan-
Figure 1: Word deletion on initially correct (left) dard SA word relevance defined in Eq. 4, we use
and false (right) classified test documents, using as an alternative RSA(l2 ) (wt ) = kwt f (d)k2 and
either LRP or SA. The target class is the true denote this relevance model by SA(l2 ).
document class, words are deleted in decreasing For both LRP and SA, we employ different
(left) and increasing (right) order of their rele- variations of the weighting scheme. More pre-
vance. Random deletion is averaged over 10 runs cisely, given an input document d composed of
(std < 0.0141). A steep decline (left) and incline the sequence (w1 , w2 , ..., wN ) of D-dimensional
(right) indicate informative word relevances. word2vec embeddings, we build new document
representations d0 and d0e.w. 8 by either using word-
can partly be explained by the fact that in contrast level relevances R(wt ) (as in Eq. 3), or through
to SA, LRP provides signed explanations. More element-wise multiplication of word embeddings
generally the different quality of the explanations with single input variable relevances (Ri,t )i[1,D]
provided by SA and LRP can be attributed to their (we recall that Ri,t is the relevance of the input
different objectives: while LRP aims at decompos- variable corresponding to the ith dimension of the
ing the global amount of evidence for a class f (x), tth word in the input document d). More formally
SA is build solely upon derivatives and as such we use:
XN
describes the effect of local variations of the in-
d0 = R(wt ) wt
put variables on the classifier decision. For a more
t=1
detailed view of SA, as well as an interpretation
or
of the LRP propagation rules as a deep Taylor de- R1,t
composition see Montavon et al. (2015). N
X
R2,t
d0e.w. = .. wt
3.4 Document Highlighting .
t=1
Word-level relevances can be used for highlighting RD,t
purposes. In Fig. 2 we provide such visualizations where is an element-wise multiplication. Fi-
on one test document for different relevance target nally we normalize the document vectors d0 resp.
classes, using either LRP or SA relevance mod- d0e.w. to unit l2 -norm and perform a PCA projec-
els. We can observe that while the word ride tion. In Fig. 3 we label the resulting 2D-projected
is highly negative-relevant for LRP when the tar- test documents using five top-level document cat-
get class is not rec.motorcycles, it is pos- egories.
itively highlighted (even though not heavily) by For word-based models d0 , we observe that
SA. This suggests that SA does not clearly dis- while standard SA and LRP both provide simi-
criminate between words speaking for or against lar visualization quality, the SA variant with sim-
a specific classifier decision, while LRP is more ple l2 -norm yields partly overlapping and dense
discerning in this respect. clusters, still all schemes are better than uniform9
3.5 Document Visualization 8
The subscript e.w. stands for element-wise.
9
We also performed a TFIDF weighting of word embed-
Word2vec embeddings are known to exhibit lin- dings, the resulting 2D-visualization was very similar to uni-
ear regularities representing semantic relation- form weighting (SUM).
4
LRP SA
It is the body's reaction to a strange environment. It appears to be induced It is the body's reaction to a strange environment. It appears to be induced
partly to physical discomfort and part to mental distress. Some people are partly to physical discomfort and part to mental distress. Some people are
sci.space
more prone to it than others, like some people are more prone to get sick more prone to it than others, like some people are more prone to get sick
on a roller coaster ride than others. The mental part is usually induced by on a roller coaster ride than others. The mental part is usually induced by
a lack of clear indication of which way is up or down, ie: the Shuttle is a lack of clear indication of which way is up or down, ie: the Shuttle is
normally oriented with its cargo bay pointed towards Earth, so the Earth normally oriented with its cargo bay pointed towards Earth, so the Earth
(or ground) is "above" the head of the astronauts. About 50% of the astronauts (or ground) is "above" the head of the astronauts. About 50% of the astronauts
experience some form of motion sickness, and NASA has done numerous tests in experience some form of motion sickness, and NASA has done numerous tests in
space to try to see how to keep the number of occurances down. space to try to see how to keep the number of occurances down.
sci.med
rec.motorcycles
Figure 2: Heatmaps for the test document sci.space 61393 (correctly classified), using either layer-
wise relevance propagation (LRP) or sensitivity analysis (SA) for highlighting words. Positive relevance
is mapped to red, negative to blue. The target class for the LRP/SA explanation is indicated on the left.
LRP e.w. LRP SUM

comp
rec
sci
politics
religion
SA e.w. SA SA (l2 )
Figure 3: PCA projection of the 20Newsgroups test documents formed by linearly combining word2vec
embeddings. The weighting scheme is based on word-level relevances, or on single input variable rel-
evances (e.w.), or uniform (SUM). The target class for relevance estimation is the predicted document
class. SA(l2 ) corresponds to a variant of SA with simple l2 -norm pooling of word gradient values. All
visualizations are provided on the same equal axis scale.
5
weighting. In case of SA note that, even though M. Denil, A. Demiraj, and N. de Freitas. 2014. Extrac-
the power to which word gradient norms are raised tion of Salient Sentences from Labelled Documents.
Technical report, University of Oxford.
(l2 or l22 ) affects the present visualization experi-
ment, it has no influence on the earlier described Y. Dimopoulos, P. Bourret, and S. Lek. 1995. Use of
word deletion analysis. some sensitivity criteria for choosing networks with
For element-wise models d0e.w. , we observe good generalization ability. Neural Processing Let-
ters, 2(6):14.
slightly better separated clusters for SA, and a
clear-cut cluster structure for LRP. D. Erhan, A. Courville, and Y. Bengio. 2010. Under-
standing Representations Learned in Deep Architec-
4 Conclusion tures. Technical report, University of Montreal.
Through word deleting we quantitatively evalu- M. Gevrey, I. Dimopoulos, and S. Lek. 2003. Review
and comparison of methods to study the contribu-
ated and compared two classifier explanation mod- tion of variables in artificial neural network models.
els, and pinpointed LRP to be more effective than Ecological Modelling, 160(3):249264.
SA. We investigated the application of word-level
relevance information for document highlighting Y. Kim. 2014. Convolutional Neural Networks for
Sentence Classification. In Proc. of EMNLP, pages
and visualization. We derive from our empirical 17461751.
analysis that the superiority of LRP stems from the
fact that it reliably not only links to determinant R. Krishnan, G. Sivakumar, and P. Bhattacharya. 1999.
words that support a specific classification deci- Extracting decision trees from trained neural net-
works. Pattern Recognition, 32(12):19992009.
sion, but further distinguishes, within the preemi-
nent words, those that are opposed to that decision. W. Landecker, M. Thomure, L. Bettencourt,
Future work would include applying LRP to M. Mitchell, G. Kenyon, and S. Brumby. 2013. In-
terpreting Individual Classifications of Hierarchical
other neural network architectures (e.g. character-
Networks. In IEEE Symposium on Computational
based or recurrent models) on further NLP tasks, Intelligence and Data Mining (CIDM), pages
as well as exploring how relevance information 3238.
could be taken into account to improve the clas-
S. Lapuschkin, A. Binder, G. Montavon, K.-R. Muller,
sifiers training procedure or prediction perfor-
and W. Samek. 2016a. Analyzing Classifiers:
mance. Fisher Vectors and Deep Neural Networks. In Proc.
of the IEEE Conference on Computer Vision and
Acknowledgments Pattern Recognition (CVPR).
This work was supported by the German Ministry S. Lapuschkin, A. Binder, G. Montavon, K.-R. Muller,
for Education and Research as Berlin Big Data and W. Samek. 2016b. The Layer-wise Relevance
Propagation Toolbox for Artificial Neural Networks.
Center BBDC (01IS14013A) and the Brain Korea JMLR. in press.
21 Plus Program through the National Research
Foundation of Korea funded by the Ministry of J. Li, X. Chen, E. Hovy, and D. Jurafsky. 2015. Visu-
Education. alizing and Understanding Neural Models in NLP.
arXiv, (1506.01066).
M. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013.

References Efficient Estimation of Word Representations in
Vector Space. In Workshop Proc. ICLR.
S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-
R. Muller, and W. Samek. 2015. On Pixel-Wise G. Montavon, S. Bach, A. Binder, W. Samek, and K.-R.
Explanations for Non-Linear Classifier Decisions by Muller. 2015. Explaining NonLinear Classification
Layer-Wise Relevance Propagation. PLoS ONE, Decisions with Deep Taylor Decomposition. arXiv,
10(7):e0130140. (1512.02479).
Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. H.S. Paskov, R. West, J.C. Mitchell, and T. Hastie.
2003. A Neural Probabilistic Language Model. 2013. Compressive Feature Learning. In Adv. in
JMLR, 3:11371155. NIPS.
R. Collobert, J. Weston, L. Bottou, M. Karlen, W. Samek, A. Binder, G. Montavon, S. Bach, and K.-
K. Kavukcuoglu, and P. Kuksa. 2011. Natural Lan- R. Muller. 2015. Evaluating the visualization of
guage Processing (Almost) from Scratch. JMLR, what a Deep Neural Network has learned. arXiv,
12:24932537. (1509.06321).
6
K. Simonyan, A. Vedaldi, and A. Zisserman. 2014.
Deep Inside Convolutional Networks: Visualising
Image Classification Models and Saliency Maps. In
Workshop Proc. ICLR.
M. D. Zeiler and R. Fergus. 2014. Visualizing and
Understanding Convolutional Networks. In ECCV,
pages 818833.

Paper 2016 - Explaining Predictions of Non-Linear Classifiers in NLP

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Paper 2016 - Explaining Predictions of Non-Linear Classifiers in NLP

Загружено:

Авторское право:

Доступные форматы

Explaining Predictions of Non-Linear Classifiers in NLP

Leila Arras1 , Franziska Horn2 , Gregoire Montavon2 ,

Abstract al., 2015) has been shown to produce more mean-

Accuracy (809 documents)

LRP e.w. LRP SUM

M. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013.

Вам также может понравиться