Вы находитесь на странице: 1из 44

Practical Deep Learning for NLP

Maarten Versteegh
NLP Research Engineer
Overview

● Deep Learning Recap


● Text classification:
– Convnet with word embeddings
● Sentiment analysis:
– ResNet
● Tips and tricks
What is this deep learning thing again?
Output
Activation

Error
Hidden

Input
Rectified Linear Units
Backpropagation involves repeated multiplication with derivative of activation function
→ Problem if result is always smaller than 1!
Text Classification
Traditional approach: BOW + TFIDF
“The car might also need a front end alignment”

"alignment" (0.323) "also need" (0.343)


"also" (0.137) "car might" (0.358)
"car" (0.110) "end alignment" (0.358)
"end" (0.182) "front end" (0.296)
"front" (0.167) "might also" (0.358)
"might" (0.178) "need front" (0.358)
"need" (0.157) "the car" (0.161)
"the" (0.053)
20 newsgroups performance

F1-Score*
BOW+TFIDF+SVM Some number

(*) Scores removed


Deep Learning 1: Replace Classifier

Output

Hidden x 256

Hidden x 512

BOW
x 1000
Features
from keras.layers import Input, Dense
from keras.models import Model

input_layer = Input(shape=(1000,))
fc_1 = Dense(512, activation='relu')(input_layer)
fc_2 = Dense(256, activation='relu')(fc_1)
output_layer = Dense(10, activation='softmax')(fc_2)

model = Model(input=input_layer, output=output_layer)


model.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])

model.fit(bow, newsgroups.target)
predictions = model.predict(features).argmax(axis=1)
20 newsgroups performance

F1-Score*
BOW+TFIDF+SVM Some number
BOW+TFIDF+SVD+ 2-layer NN Some slightly higher number

(*) Scores removed


What about the deep learning promise?
Convolutional Networks

Source: Andrej Karpathy


Pooling layer

Source: Andrej Karpathy


Convolutional networks

Source: Y. Kim (2014) Convolutional Networks for Sentence Classification


Word embedding
from keras.layers import Embedding

# embedding_matrix: ndarray(vocab_size, embedding_dim)

input_layer = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')


layer = Embedding(
embedding_matrix.shape[0],
embedding_matrix.shape[1],
weights=[embedding_matrix],
input_length=max_sequence_length,
trainable=False
)(input_layer)
from keras.layer import Convolution1D, MaxPooling1D,
BatchNormalization, Activation

layer = Embedding(...)(input_layer)

layer = Convolution1D(
128, # number of filters
5, # filter size
activation='relu',
)(layer)

layer = MaxPooling1D(5)(layer)
Performance
F1-Score*
BOW+TFIDF+SVM Some number
CBOW+TFIDF+SVD+NN Some slightly higher number
ConvNet (3 layers) Quite a bit higher now
ConvNet (6 layers) Look mom, even higher!

(*) Scores removed


Sentiment Analysis
Data Set
Facebook posts from media organizations:
– CNN, MSNBC, NYTimes, The Guardian, Buzzfeed,
Breitbart, Politico, The Wall Street Journal, Washington
Post, Baltimore Sun
Measure sentiment as “reactions”
Title Org Like Love Wow Haha Sad Angry

Poll: Clinton up big on Trump in Virginia CNN 4176 601 17 211 11 83

It's a fact: Trump has tiny hands. Will Guardian 595 17 17 225 2 8
this be the one that sinks him?
Donald Trump Explains His Obama- NYTimes 2059 32 284 1214 80 2167
Founded-ISIS Claim as ‘Sarcasm’
Can hipsters stomach the unpalatable Guardian 3655 0 396 44 773 69
truth about avocado toast?
Tim Kaine skewers Donald Trump's MSNBC 1094 111 6 12 2 26
military policy
Top 5 Most Antisemitic Things Hillary Breitbart 1067 7 134 35 22 372
Clinton Has Done
17 Hilarious Tweets About Donald Buzzfeed 11390 375 16 4121 4 5
Trump Explaining Movies
Go deeper: ResNet
Convolutional Layers with shortcuts

He et al. Deep Residual Learning


for Image Recognition
Go deeper: ResNet

input_layer = ...

layer = Convolution1D(128, 5, activation='linear')


(input_layer)
layer = BatchNormalization()(layer)
layer = Activation('relu')(layer)

layer = Convolution1D(128, 5, activation='linear')(layer)


layer = BatchNormalization()(layer)
layer = Activation('relu')(layer)

block_output = merge([layer, input_layer], mode='sum')


block_output = Activation('relu')(block_output)
%
Dense

The Guardian Dense


(1-of-K)
News Org
MaxPooling

ResNet Block

… Conv (128) x 10

ResNet Block

It's a fact: Trump has tiny hands.


(EMBEDDING_DIM=300)
Title + Message
Cherry-picked predicted response
distribution*
Sentence Org Love Haha Wow Sad Angry

Trump wins the election Guardian 3% 9% 7% 32% 49%

Trump wins the election Breitbart 58% 30% 8% 1% 3%

*Your mileage may vary. By a lot. I


mean it.
Tips and Tricks
Initialization
● Break symmetry:
– Never ever initialize all your weights to
the same value
● Let initialization depend on activation
function:
– ReLU/PReLU → He Normal
– sigmoid/tanh → Glorot Normal
Choose an adaptive optimizer

Choose an adaptive optimizer

Source: Alec Radford


Choose the right model size

● Start small and keep adding layers


– Check if test error keeps going down
● Cross-validate over the number of units
● You want to be able to overfit

Y. Bengio (2012) Practical


recommendations for gradient-based
training of deep architectures
Don't be scared of overfitting
● If your model can't overfit, it also can't learn enough
● So, check that your model can overfit:
– If not, make it bigger
– If so, get more date and/or regularize

Source: wikipedia
Regularization
● Norm penalties on hidden layer weights, never
on first and last
● Dropout
● Early stopping
Size of data set
● Just get more data already
● Augment data:
– Textual replacements
– Word vector perturbation
– Noise Contrastive Estimation
● Semi-supervised learning:
– Adapt word embeddings to your domain
Monitor your model

Training loss:
– Does the model converge?
– Is the learning rate too low or too high?
Training loss and learning rate

Source: Andrej Karpathy


Monitor your model

Training and validation accuracy


– Is there a large gap?
– Does the training accuracy increase
while the validation accuracy
decreases?
Training and validation accuracy

Source: Andrej Karpathy


Monitor your model

● Ratio of weights to updates


● Distribution of activations and gradients
(per layer)
Hyperparameter optimization

After network architecture, continue with:


– Regularization strength
– Initial learning rate
– Optimization strategy (and LR decay
schedule)
Friends don't let friends do a full grid search!
Hyperparameter optimization

Friends don't let friends do a full grid search!


– Use a smart strategy like Bayesian
optimization or Particle Swarm Optimization
(Spearmint, SMAC, Hyperopt, Optunity)
– Even random search often beats grid search
Keep up to date: arxiv-sanity.com
We are hiring!
DevOps & Front-end
NLP engineers
Full-stack Python engineers

www.textkernel.com/jobs
Questions?

Source: http://visualqa.org/

Вам также может понравиться