DL Tutorial NIPS2015 PDF

Deep
Learning

NIPS’2015 Tutorial

Geoff Hinton, Yoshua Bengio & Yann LeCun
Breakthrough
Deep Learning: machine
learning algorithms based on
learning mulHple levels of
representaHon / abstracHon.
Amazing improvements in error rate in object recogni4on, object
detec4on, speech recogni4on, and more recently, in natural language
processing / understanding

2
Machine Learning,
AI & No Free Lunch
•  Four key ingredients for ML towards AI
1.  Lots & lots of data
2.  Very flexible models
3.  Enough compu4ng power
4.  Powerful priors that can defeat the curse of

dimensionality
3
Bypassing the curse of
dimensionality
We need to build composi4onality into our ML models
Just as human languages exploit composi4onality to give
representa4ons and meanings to complex ideas
Exploi4ng composi4onality gives an exponen4al gain in

representa4onal power
(1) Distributed representa4ons / embeddings: feature learning
(2) Deep architecture: mul4ple levels of feature learning
Addi4onal prior: composi4onality is useful to

describe the world around us efficiently
4

Classical Symbolic AI vs
Learning Distributed Representations
•  Two symbols are equally far from each other

•  Concepts are not represented by symbols in our
brain, but by paWerns of ac4va4on
(Connec'onism, 1980’s)
Geoffrey Hinton
Output units
Hidden units
Input person
units
cat
dog
5 David Rumelhart
Exponential advantage of distributed
representations
Learning a set of parametric features that are not

mutually exclusive can be exponen4ally more sta4s4cally
efficient than having nearest-neighbor-like or clustering-
like models
Hidden Units Discover Semantically
Meaningful Concepts
Under review as a conference paper at ICLR 2015
•  Zhou et al & Torralba, arXiv1412.6856 submiWed to ICLR 2015
•  Network trained to recognize places, not objects
Figure 10: Interpretation of a picture by different layers of the Places-CNN u
by AMT workers. The first shows the final layer output of Places-CNN.
People Lighting Tables detection results along with the confidence based on the units’ activation and
Object counts in SUN
Fireplace
15000(J=5.3%, AP=22.9%) Bed (J=24.6%, AP=81.1%)
10000
5000 (J=4.2%, AP=12.7%)

Wardrobe Mountain (J=11.3%, AP=47.6%)
wall
window
chair
building
floor
tree
ceiling lamp
cabinet
ceiling
person
plant
cushion
sky
picture
curtain
painting
door
desk lamp
side table
table
bed
books
pillow
mountain
car
pot
armchair
box
vase
flowers
road
grass
bottle
shoes
sofa
outlet
worktop
sign
book
sconce
plate
mirror
column
rug
basket
ground
desk
coffee table
clock
shelves
b)
Billiard table (J=3.2%, AP=42.6%) Sofa (J=10.8%, AP=36.2%)
Counts of CNN units discovering each object class.
20
15
Animals Seating Building10(J=14.6%, AP=47.2%) Washing machine (J=3.2%, AP=34.4%)
5
c) 0
a) Object counts of most informative objects for scene recognition
30
Figure 11: (a) Segmentation of images from the SUN database using pool
Jaccard
20
segmentation index, AP = average precision-recall.) (b) Precision-r
discovered
10 objects. (c) Histogram of AP for all discovered object classes.
0
wall
window
chair
building
floor
tree
ceiling lamp
cabinet
ceiling
person
plant
cushion
sky
picture
curtain
painting
door
desk lamp
side table
table
bed
books
pillow
mountain
car
pot
armchair
box
vase
flowers
road
grass
bottle
shoes
sofa
outlet
worktop
sign
book
sconce
plate
mirror
column
rug
basket
ground
desk
coffee table
clock
shelves
a) 7 Note
d) that there are 115 units in pool5 of Places-CNN not detecting objects.
incomplete learning or a complementary texture-based or part-based represen
Figure 9: (a) Segmentations from pool5 in Places-CNN. Many classes are encoded by several units
Each feature can be discovered
without the need for seeing the
exponentially large number of
configurations of the other features
•  Consider a network whose hidden units discover the following
features:
•  Person wears glasses
•  Person is female
•  Person is a child
•  Etc.
If each of n feature requires O(k) parameters, need O(nk) examples

Non-parametric methods would require O(nd) examples
8
Exponential advantage of distributed
representations
•  Bengio 2009 (Learning Deep Architectures for AI, F & T in ML)

•  Montufar & Morton 2014 (When does a mixture of products
contain a product of mixtures? SIAM J. Discr. Math)
•  Longer discussion and rela4ons to the no4on of priors: Deep

Learning, to appear, MIT Press.
•  Prop. 2 of Pascanu, Montufar & Bengio ICLR’2014: number of

pieces dis4nguished by 1-hidden-layer rec4fier net with n units
and d inputs (i.e. O(nd) parameters) is
9
Deep Learning: Output
Automating
Feature Discovery Mapping
Output Output from
features
Mapping Mapping Most

Output from from complex
features features features
Hand- Hand-
Simplest
designed designed Features
features
program features
Input Input Input Input
Rule-based Classic Representation Deep

systems machine learning learning
learning
10 Fig: I. Goodfellow
Exponential advantage of depth
Theore4cal arguments:
Logic gates
2 layers of Formal neurons = universal approximator
RBF units
RBMs & auto-encoders = universal approximator

Theorems on advantage of depth:
(Hastad et al 86 & 91, Bengio et al 2007, Bengio …
& Delalleau 2011, Martens et al 2013, Pascanu
et al 2014, Montufar et al NIPS 2014) 1 2 3
2n
Some functions compactly
represented with k layers may
require exponential size with 2 …
layers 1 2 3 n
Why does it work? No Free Lunch
•  It only works because we are making some assump4ons about
the data genera4ng distribu4on
•  Worse-case distribu4ons s4ll require exponen4al data
•  But the world has structure and we can get an exponen4al gain
by exploi4ng some of it
12
Exponential advantage of depth
•  Expressiveness of deep networks with piecewise linear ac4va4on

func4ons: exponen4al advantage for depth (Montufar et al,
NIPS 2014)
•  Number of pieces dis4nguished for a network with depth L and ni
units per layer is at least
or, if hidden layers have width n and input has size n0
13
Y LeCun
Backprop
(modular approach)
Typical Multilayer Neural Net Architecture
Y LeCun
C(X,Y,Θ)
l  Complex learning machines can be
built by assembling modules into
Squared Distance networks
l  Linear Module
l  Out = W.In+B
W3, B3 Linear
l  ReLU Module (Rectified Linear Unit)
l  Out = 0 if Ini<0
ReLU i
l  Out = In
i i otherwise
W2, B2 Linear l  Cost Module: Squared Distance

l  C = ||In1 - In2||
2
ReLU l  Objective Function

l  L(Θ)=1/p Σk C(Xk,Yk,Θ)
Θ = (W1,B1,W2,B2,W3,B3)
W1, B1 Linear
l 
X (input) Y (desired output)

Building a Network by Assembling Modules
Y LeCun
l  All major deep learning frameworks use modules (inspired by SN/Lush, 1991)
l  Torch7, Theano, TensorFlow….
C(X,Y,Θ)
NegativeLogLikelihood
LogSoftMax
W2,B2Linear
ReLU
W1,B1Linear
X Y
input Label
Computing Gradients by Back-Propagation
Y LeCun
C(X,Y,Θ)
l  A practical Application of Chain Rule
Cost
l  Backprop for the state gradients:
Wn Fn(Xn-1,Wn) l  dC/dXi-1 = dC/dXi . dXi/dXi-1
dC/dWn
l  dC/dXi-1 = dC/dXi . dFi(Xi-1,Wi)/dXi-1
dC/dXi Xi
Wi l  Backprop for the weight gradients:

Fi(Xi-1,Wi)
dC/dWi l  dC/dWi = dC/dXi . dXi/dWi
dC/dXi-1 Xi-1 l  dC/dWi = dC/dXi . dFi(Xi-1,Wi)/dWi
F1(X0,W1)
X (input) Y (desired output)

Running Backprop
Y LeCun
l  Torch7 example
l  Gradtheta contains the gradient
C(X,Y,Θ)
NegativeLogLikelihood
LogSoftMax
W2,B2Linear
Θ ReLU
W1,B1Linear
X Y
input Label
Module Classes
Y LeCun
T
Linear l  Y = W.X ; dC/dX = W . dC/dY ; dC/dW = dC/dY . (dC/dX)T
ReLU l  y = ReLU(x) ; if (x<0) dC/dx = 0 else dC/dx = dC/dy
Duplicate l  Y1 = X, Y2 = X ; dC/dX = dC/dY1 + dC/dY2
Add l  Y = X1 + X2 ; dC/dX1 = dC/dY ; dC/dX2 = dC/dY
Max l  y = max(x1,x2) ; if (x1>x2) dC/dx1 = dC/dy else dC/dx1=0
LogSoftMax l  Yi = Xi – log[∑j exp(Xj)] ; …..

Module Classes
Y LeCun
l  Many more basic module classes

l  Cost functions:
l  Squared error
l  Hinge loss
l  Ranking loss
l  Non-linearities and operators

l  ReLU, “leaky” ReLU, abs,….
l  Tanh, logistic
l  Just about any simple function (log, exp, add, mul,….)
l  Specialized modules
l  Multiple convolutions (1D, 2D, 3D)
l  Pooling/subsampling: max, average, Lp, log(sum(exp())), maxout
l  Long Short-Term Memory, attention, 3-way multiplicative interactions.
l  Switches
l  Normalizations: batch norm, contrast norm, feature norm...
l  inception
Any Architecture works
Y LeCun
" Any connection graph is permissible

"  Directed acyclic graphs (DAG)
"  Networks with loops must be
“unfolded in time”.
" Any module is permissible
"  As long as it is continuous and
differentiable almost everywhere with
respect to the parameters, and with
respect to non-terminal inputs.
" Most frameworks provide automatic
differentiation
"  Theano, Torch7+autograd,…
"   Programs are turned into
computation DAGs and automatically
differentiated.
Backprop in Practice
Y LeCun
" Use ReLU non-linearities

" Use cross-entropy loss for classification
" Use Stochastic Gradient Descent on minibatches
" Shuffle the training samples (← very important)
" Normalize the input variables (zero mean, unit variance)
" Schedule to decrease the learning rate
" Use a bit of L1 or L2 regularization on the weights (or a combination)
"  But it's best to turn it on after a couple of epochs
" Use “dropout” for regularization
" Lots more in [LeCun et al. “Efficient Backprop” 1998]
" Lots, lots more in “Neural Networks, Tricks of the Trade” (2012 edition)
edited by G. Montavon, G. B. Orr, and K-R Müller (Springer)
" More recent: Deep Learning (MIT Press book in preparation)
Y LeCun
Convolutional
Networks
Deep Learning = Training Multistage Machines
Y LeCun
" Traditional Pattern Recognition: Fixed/Handcrafted Feature Extractor
Feature Trainable
Extractor Classifier
" Mainstream Pattern Recognition 9until recently)
Feature Mid-Level Trainable

Extractor Features Classifier
" Deep Learning: Multiple stages/layers trained end to end
Low-Level Mid-Level High-Level Trainable

Features Features Features Classifier
Overall Architecture: multiple stages of
Normalization → Filter Bank → Non-Linearity → Pooling Y LeCun
Filter Non- feature Filter Non- feature

Norm Norm Classifier
Bank Linear Pooling Bank Linear Pooling
" Normalization: variation on whitening (optional)

–  Subtractive: average removal, high pass filtering
–  Divisive: local contrast normalization, variance normalization
" Filter Bank: dimension expansion, projection on overcomplete basis
" Non-Linearity: sparsification, saturation, lateral inhibition....
–  Rectification (ReLU), Component-wise shrinkage, tanh,..
" Pooling: aggregation over space or feature type

–  Max, Lp norm, log prob.
ConvNet Architecture
Y LeCun
Filter Bank +non-linearity
Pooling
Pooling
" LeNet1 [LeCun et al. NIPS 1989]

Multiple Convolutions
Y LeCun
Animation: Andrej Karpathy http://cs231n.github.io/convolutional-networks/

Convolutional Networks (vintage 1990)
Y LeCun
" filters → tanh → average-tanh → filters → tanh → average-tanh → filters → tanh

Example: 1D (Temporal) convolutional net
Y LeCun
" 1D (Temporal) ConvNet, aka Timed-Delay Neural Nets

" Groups of units are replicated at each time step.
" Replicas have identical (shared) weights.
LeNet5
Y LeCun
" Simple ConvNet

" for MNIST
" [LeCun 1998]
Layer 3 Layer 5
Layer 1 Layer 2 Layer 4
input 12@10x10 100@1x1
6@28x28 6@14x14 12@5x5
1@32x32
Layer 6: 10
10
5x5
2x2 5x5 2x2 convolution
5x5
pooling/ convolution pooling/
convolution
subsampling subsampling
Applying a ConvNet with a Sliding Window
Y LeCun
" Every layer is a convolution

" Sometimes called “fully convolutional nets”
" There is no such thing as a “fully connected layer”
Sliding Window ConvNet + Weighted FSM (Fixed Post-Proc)
Y LeCun
[Matan, Burges, LeCun, Denker NIPS 1991] [LeCun, BoWou, Bengio, Haffner, Proc IEEE 1998]
Sliding Window ConvNet + Weighted FSM
Y LeCun
Why Multiple Layers? The World is Compositional
Y LeCun
" Hierarchy of representations with increasing level of abstraction

" Each stage is a kind of trainable feature transform
" Image recognition: Pixel → edge → texton → motif → part → object
" Text: Character → word → word group → clause → sentence → story
" Speech: Sample → spectral band → sound → … → phone → phoneme → word
Low-Level Mid-Level High-Level Trainable

Feature Feature Feature Classifier
Yes, ConvNets are somewhat inspired by the Visual Cortex
Y LeCun
" The ventral (recognition) pathway in the visual cortex has multiple stages
" Retina - LGN - V1 - V2 - V4 - PIT - AIT ....
[picture from Simon Thorpe]

[Gallant & Van Essen]
What are ConvNets Good For
Y LeCun
" Signals that comes to you in the form of (multidimensional) arrays.

" Signals that have strong local correlations
" Signals where features can appear anywhere
" Signals in which objects are invariant to translations and distortions.
" 1D ConvNets: sequential signals, text

–  Text Classification
–  Musical Genre Recognition
–  Acoustic Modeling for Speech Recognition
–  Time-Series Prediction
" 2D ConvNets: images, time-frequency representations (speech and audio)
–  Object detection, localization, recognition
" 3D ConvNets: video, volumetric images, tomography images
–  Video recognition / understanding
–  Biomedical image analysis
–  Hyperspectral image analysis
Recurrent Neural Networks
37
•  Selec4vely summarize an input sequence in a fixed-size state
vector via a recursive update
F✓
s st 1 st st+1
F✓ F✓ F✓
unfold
x xt 1 xt xt+1
38
•  Can produce an output at each 4me step: unfolding the graph
tells us how to back-prop through 4me.
o ot 1 ot ot+1
V W V V V
s W st 1 st st+1
W W W
U unfold
U U U
x xt 1 xt xt+1
39
Generative RNNs
•  An RNN can represent a fully-connected directed generaHve

model: every variable predicted from all previous ones.
Lt 1 Lt Lt+1
ot 1 ot ot+1
V V V
W st 1 st st+1
W W W
U U U
xt 1 xt xt+1 xt+2
40
Maximum Likelihood =
Teacher Forcing
ŷt ⇠ P (yt | ht ) yt
•  During training, past y
in input is from training
data P (yt | ht )
•  At genera4on 4me,
past y in input is
generated
ht
•  Mismatch can cause
”compounding error”
xt
(xt , yt ) : next input/output training pair

41
Increasing the Expressive Power of
RNNs with more Depth
•  ICLR 2014, How to construct deep recurrent neural networks

+ deep hid-to-out
+ deep hid-to-hid
+deep in-to-hid
Ordinary RNNs
+ stacking
+ skip connec4ons for

crea4ng shorter paths
42
Long-Term Dependencies
•  The RNN gradient is a product of Jacobian matrices, each
associated with a step in the forward computa4on. To store
informa4on robustly in a finite-dimensional state, the dynamics
must be contrac4ve [Bengio et al 1994].
Storing bits
robustly requires
sing. values<1
•  Problems:
Gradient
•  sing. values of Jacobians > 1 à gradients explode clipping
•  or sing. values < 1 à gradients shrink & vanish (Hochreiter 1991)
•  or random à variance grows exponen4ally

43
Gradient Norm Clipping
(Mikolov thesis 2012;
Pascanu, Mikolov, Bengio, ICML 2013)
44
RNN Tricks
(Pascanu, Mikolov, Bengio, ICML 2013; Bengio, Boulanger & Pascanu, ICASSP 2013)
•  Clipping gradients (avoid exploding gradients)

•  Leaky integra4on (propagate long-term dependencies)
•  Momentum (cheap 2nd order)
•  Ini4aliza4on (start in right ballpark avoids exploding/vanishing)
•  Sparse Gradients (symmetry breaking)
•  Gradient propaga4on regularizer (avoid vanishing gradient)
•  LSTM self-loops (avoid vanishing gradient)
error
✓
45
✓
Gated Recurrent Units & LSTM
output
•  Create a path where

gradients can flow for ×
longer with self-loop

•  Corresponds to an self-loop
eigenvalue of Jacobian + ×
slightly less than 1 state
•  LSTM is heavily used

(Hochreiter & Schmidhuber ×
1997)
•  GRU light-weight version input input gate forget gate output gate
(Cho et al 2014)
46
RNN Tricks
•  Delays and mul4ple 4me scales, Elhihi & Bengio NIPS 1996
o ot 1 ot ot+1
W1 W3 W3 W3 W3
s st 2 st 1 st st+1
W1 W1 W1 W1
W3 unfold
x xt 1 xt xt+1
47
Backprop in Practice
Other tricks: see Deep Learning book (in prepara4on, online)
48
The Convergence of Gradient Descent
Y
LeCun
"  Batch Gradient

"  There is an optimal learning
rate nd
"  Equal to inverse 2 derivative
Let's Look at a single linear unit
Y
LeCun
"  Single unit, 2 inputs
"  Quadratic loss W0

"  E(W) = 1/p ∑p (Y – W●Xp)2
W1 W2
"  Dataset: classification: Y=-1 for blue, +1 for red.
X1 X2
"  Hessian is covariance matrix of input vectors
"  H = 1/p ∑ X X T
p p
"  To avoid ill conditioning: normalize the inputs
"  Zero mean
"  Unit variance for all variable
Convergence is Slow When Hessian has Different Eigenvalues
Y
LeCun
"  Batch Gradient, small learning rate Batch Gradient, large learning rate
Convergence is Slow When Hessian has Different Eigenvalues
Y
LeCun
"  Batch Gradient, small learning rate "S   tochastic Gradient: Much Faster
"  But fluctuates near the minimum
Multilayer Nets Have Non-Convex Objective Functions
Y
LeCun
"  1-1-1 network
"  Y = W1*W2*X Y
"   trained to compute the identity function with quadratic loss W2
"  Single sample X=1, Y=1 L(W) = (1-W1*W2)^2
"   Solution: W2 = 1/W2 hyperbola. Z
W1
X
Solution
Saddle point Solution
Deep Nets with ReLUs and Max Pooling
Y
LeCun
"S   tack of linear transforms interspersed with Max operators
"  Point-wise ReLUs:
31
W31,22
22
W22,14
14
"  Max Pooling
"  “switches” from one layer to the next W14,3
"  Input-output function
"  Sum over active paths 3
"  Product of all weights along the path
"  Solutions are hyperbolas Z3
"  Objective function is full of saddle points
A Myth Has Been Debunked: Local
Minima in Neural Nets
! Convexity is not needed
•  (Pascanu, Dauphin, Ganguli, Bengio, arXiv May 2014): On the
saddle point problem for non-convex op'miza'on
•  (Dauphin, Pascanu, Gulcehre, Cho, Ganguli, Bengio, NIPS’ 2014):
Iden'fying and a[acking the saddle point problem in high-
dimensional non-convex op'miza'on
•  (Choromanska, Henaff, Mathieu, Ben Arous & LeCun,
AISTATS’2015): The Loss Surface of Mul'layer Nets
55
Saddle Points
•  Local minima dominate in low-D, but

saddle points dominate in high-D
•  Most local minima are close to the
boWom (global minimum error)
56
Saddle Points During Training
•  Oscilla4ng between two behaviors:
•  Slowly approaching a saddle point
•  Escaping it
57
Low Index Critical Points
Choromanska et al & LeCun 2014, ‘The Loss Surface of Mul'layer Nets’

Shows that deep rec4fier nets are analogous to spherical spin-glass models
The low-index cri4cal points of large models concentrate in a band just
above the global minimum
58
Piecewise Linear Nonlinearity

•  Jarreth, Kavukcuoglu, Ranzato & LeCun ICCV 2009: absolute value

rec4fica4on works beWer than tanh in lower layers of convnet
•  Nair & Hinton ICML 2010: Duplica4ng sigmoid units with same
sotplus
weights but different bias in an RBM approximates a rec4fied f(x)=log(1+exp(x))
linear unit (ReLU)
•  Glorot, Bordes and Bengio AISTATS 2011: Using a rec4fier non- f(x)=max(0,x)
linearity (ReLU) instead of tanh of sotplus allows for the first 4me
to train very deep supervised networks without the need for Neuroscience motivations
Leaky integrate-and-fire model
unsupervised pre-training; was biologically moHvated
Leaky integrate-and-fire model
•  Krizhevsky, Sutskever & Hinton NIPS 2012:
rec4fiers one of the crucial ingredients in
ImageNet breakthrough
Stochastic Neurons as Regularizer:
Improving neural networks by prevenHng co-adaptaHon
of feature detectors (Hinton et al 2012, arXiv)
•  Dropouts trick: during training mul4ply neuron output by random
bit (p=0.5), during test by 0.5
•  Used in deep supervised networks
•  Similar to denoising auto-encoder, but corrup4ng every layer
•  Works beWer with some non-lineari4es (rec4fiers, maxout)
(Goodfellow et al. ICML 2013)
•  Equivalent to averaging over exponen4ally many architectures
•  Used by Krizhevsky et al to break through ImageNet SOTA
•  Also improves SOTA on CIFAR-10 (18à16% err)
•  Knowledge-free MNIST with DBMs (.95à.79% err)
•  TIMIT phoneme classifica4on (22.7à19.7% err)
60
Dropout Regularizer: Super-Efficient
Bagging
*
… …
61
we can standardize each feature1asX
follows
m
Batch Normalization
BN (xx̄kk)==m k x̂xki,k+, k.
x k x̄
i=1 k
(Ioffe & Szegedy ICML 2015)
x̂k = p 2X , (3)
By setting k to k and k2 to x̄ 1kk+ , the
network
m
✏ 2
can rec
= (x x̄ )
Standardize ac4va4ons (before nonlinearity) across minibatch ,
original
•  layer
where ✏ is arepresentation.
k
So,
m
small positive constant tofor
i=1
i,k
a standard
improve
k
numericalfeed
sta-
•  Backprop through this operaHon
ayer in a neural
bility. where network
m is the size of the mini-batch. Using these statis
•  Regularizes & helps to train
However,
we can standardizing
standardize each thefeature
intermediate activations
asIffollows
we have re-
access to the who
duces the1 Xrepresentational
m y = power
(Wx of the
+ layer.
b),
information To
not account for
only from the p
(1) xkthe future
x̄k ones, allowing for bid
this, batch
x̄k =
m i=1normalization introduces
xi,k ,
x̂k = padditional , learnable pa-
2 +✏
where W2 is1 the
rameters X weights matrix, b is the bias
m and , which
2
respectively scale
k and !
shift
ht =vector,
the(W
! !
data,
hht
where
(xi,k✏ isx̄ak )small positive(2)
constant to improve numerical
leading
k = to a layer of the
,
form
nput of the bility.
m i=1
layer and is an arbitrary activation h t = (W h h t
!
f
batch normalization
e m is the size of the mini-batch.
However, is
BN
an standardize each feature as follows
applied
Using these
(x k ) = as
statistics,
standardizing x̂
k k
follows
the
+
h
intermediate
k .
= [ h : h ]
activation
t
(4)
t t
duces the representational power of the

where [x : layer. To the
y] denotes accoun
conca
By setting pkk to
this,
x batch and to
x̄k k normalizationx̄ , the network
introduces can
we can additional
stack RNNsrecover the
learnable
by using h
2 +✏ y = (BN (Wx)).
x̂k = , k k (3)
original layer krepresentation.
rameters and , which for creating
So,respectively deeperand
a standard
scale architectures
feedforward
shift the[13
e ✏ is a small positive leading
constant to improve
a layernumerical sta-
62 layer in a neural to network of the form hlt = (Wh hlt 1
Note
. that the bias vector has been removed, since i
Early Stopping
•  Beau4ful FREE LUNCH (no need to launch many different

training runs for each value of hyper-parameter for #itera4ons)
•  Monitor valida4on error during training (ater visi4ng # of

training examples = a mul4ple of valida4on set size)
•  Keep track of parameters with best valida4on error and report

them at the end
•  If error does not improve enough (with some pa4ence), stop.
63
Random Sampling of Hyperparameters
(Bergstra & Bengio 2012)
•  Common approach: manual + grid search
•  Grid search over hyperparameters: simple & wasteful
•  Random search: simple & efficient
•  Independently sample each HP, e.g. l.rate~exp(U[log(.1),log(.0001)])
•  Each training trial is iid
•  If a HP is irrelevant grid search is wasteful
•  More convenient: ok to early-stop, con4nue further, etc.
64
Sequential Model-Based Optimization
of Hyper-Parameters
•  (HuWer et al JAIR 2009; Bergstra et al NIPS 2011; Thornton et al

arXiv 2012; Snoek et al NIPS 2012)
•  Iterate
•  Es4mate P(valid. err | hyper-params config x, D)
•  choose op4mis4c x, e.g. maxx P(valid. err < current min. err | x)
•  train with config x, observe valid. err. v, D ß D U {(x,v)}
65
Distributed Training
•  Minibatches
•  Large minibatches + 2nd order & natural gradient methods
•  Asynchronous SGD (Bengio et al 2003, Le et al ICML 2012, Dean et al NIPS 2012)
•  Data parallelism vs model parallelism
•  BoWleneck: sharing weights/updates among nodes, to avoid
node-models to move too far from each other
•  EASGD (Zhang et al NIPS 2015) works well in prac4ce
•  Efficiently exploi4ng more than a few GPUs remains a challenge
66
Vision
((switch laptops)
67
Speech Recognition
68
The dramatic impact of Deep
Learning on Speech Recognition
(according to Microsoft)
100%
Word error rate on Switchboard
Using DL
10%
4%
2%
69 1%
1990 2000 2010
Speech Recognition with Convolutional Nets (NYU/IBM)
Y LeCun
"M   ultilingual recognizer

"  Multiscale input
"  Large context window
Y LeCun
"   coustic Model: ConvNet with 7 layers. 54.4 million parameters.

A
"  Classifies acoustic signal into 3000 context-dependent subphones categories
"  ReLU units + dropout for last layers
"  Trained on GPU. 4 days of training
Y LeCun
"  Training samples.

"  40 MEL-frequency Cepstral Coefficients
"  Window: 40 frames, 10ms each
Y LeCun
"  Convolution Kernels at Layer 1:

"  64 kernels of size 9x9
End-to-End Training interpretation graph
interpretations:
cut (2.0)
cap (0.8)
with Search 0.8

"t" 0.8 cat (1.4)
"u"
•  Hybrid systems, neural nets + "c" 0.4
"p" 0.2
HMMs (Bengio 1991, Bo[ou 1991)

"a" 0.2
"t" grammar graph
0.8
•  Neural net outputs scores for "a"
"r" "n"
each arc, recognized output =
Graph Composition
"t"
labels along best path; trained match match match
& add
"b" "u"
& add & add "t" "e"
discrimina4vely (LeCun et al 1998) "c"
"u"
"r" "e"
•  Connec4onist Temporal "a" "p"
Classifica4on (Graves 2006) "t" "r" "d"
•  DeepSpeech and aWen4on-

based end-to-end RNNs "c" 0.4 "x" 0.1 "p" 0.2
(Hannun et al 2014; Graves & "o" 1.0 "a" 0.2 Recognition

Graph
Jaitly 2014; Chorowski et al "d" 1.8 "u" 0.8
"t" 0.8
NIPS 2015)
74
Natural Language
Representations
75
Neural Language Models: fighting one
exponential by another one!
•  (Bengio et al NIPS’2000) i−th output = P(w(t) = i | context)
output ...
softmax
...
Exponen4ally large set of most computation here

generaliza4ons: seman4cally close
sequences
tanh
... ...
C(w(t−n+1)) C(w(t−2)) C(w(t−1))

... ... ... ...
R(w1 ) R(w2 ) R(w3 ) R(w4 ) R(w5 ) R(w6 ) Table Matrix C
look−up
shared parameters
in C across words
w1 w2 w3 w4 w5 w6 index for w(t−n+1) index for w(t−2) index for w(t−1)
input sequence Exponen4ally large set of possible contexts

76
Neural word embeddings: visualization
directions = Learned Attributes
77
Analogical Representations for Free
(Mikolov et al, ICLR 2013)
•  Seman4c rela4ons appear as linear rela4onships in the space of

learned representa4ons
•  King – Queen ≈ Man – Woman
•  Paris – France + Italy ≈ Rome
France
Italy
Paris
Rome
78
Handling Large Output Spaces

•  Sampling “nega4ve” examples: increase score of

correct word and stochas4cally decrease all the
others
•  Uniform sampling (Collobert & Weston, ICML 2008)
•  Importance sampling, (Bengio & Senecal AISTATS 2003; Dauphin et al ICML
2011) ; GPU friendly implementa4on (Jean et al ACL 2015)
•  Decompose output probabili4es hierarchically (Morin &

Bengio 2005; Blitzer et al 2005; Mnih & Hinton 2007,2009; Mikolov et al 2011)
categories
words within each category

79
Encoder-Decoder Framework
•  Intermediate representa4on of meaning
= ‘universal representa4on’
•  Encoder: from word sequence to sentence representa4on
•  Decoder: from representa4on to word sequence distribu4on
English sentence For unilingual data English sentence

English English
For bitext data
decoder decoder
French English
encoder encoder
French sentence English sentence
(Cho et al EMNLP 2014; Sutskever et al NIPS 2014)

80
Attention Mechanism for Deep
Learning
•  Consider an input (or intermediate) sequence or image

•  Consider an upper level representa4on, which can choose
« where to look », by assigning a weight or probability to each
input posi4on, as produced by an MLP, applied at each posi4on
Higher-level
Sotmax over lower
loca4ons condi4oned •  Sot aWen4on (backprop) vs
on context at lower and •  Stochas4c hard aWen4on (RL)
higher loca4ons
Lower-level
(Bahdanau, Cho & Bengio, arXiv sept. 2014) following up on (Graves 2013) and
81 (Larochelle & Hinton NIPS 2010)

End-to-End Machine Translation with
Recurrent Nets and Attention Mechanism
(Bahdanau et al 2014, Jean et al 2014, Gulcehre et al 2015, Jean et al 2015)
>Qr 7` +M r2 ;Q rBi? p2`v H`;2 i`;2i pQ+#mH`v\ URV
•  Reached the state-of-the-art in one year, from scratch
UV 1M;HBb?→6`2M+? UqJh@R9V

LJhUV :QQ;H2 S@aJh
LJh jkXe3 jyXe⋆
Y*M/ jjXk3 Ĝ
jdXyj•
YlLE jjXNN jkXd◦
Y1Mb jeXdR jeXN◦
U#V 1M;HBb?→:2`KM UqJh@R8V U+V 1M;HBb?→*x2+? UqJh@R8V

JQ/2H LQi2 JQ/2H LQi2
k9X3 L2m`H Jh R3Xj L2m`H Jh
k9Xy lX1/BM#m`;?- avMi+iB+ aJh R3Xk C>l- aJhYGJYPaJYaT`b2
kjXe GAJaAfEAh RdXe *l- S?`b2 aJh
kkX3 lX1/BM#m`;?- S?`b2 aJh RdX9 lX1/BM#m`;?- S?`b2 aJh
kkXd EAh- S?`b2 aJh ReXR lX1/BM#m`;?- avMi+iB+ aJh
82
IWSLT 2015 – Luong & Manning (2015)
TED talk MT, English-German
BLEU (CASED) HTER (HE SET)

35 30 28.18
30.85
30 25 23.42
26.18 22.67
-26%
26.02 21.84
24.96
25 22.51
20
20.08
20 16.16
15
15
10
10
5 5
0 0
Stanford Karlsruhe Edinburgh Heidelberg PJAIT Baseline Stanford Edinburgh Karlsruhe Heidelberg PJAIT
83
Image-to-Text: Caption Generation
with Attention
(Xu et al, ICML 2015)
f = (a, man, is, jumping, into, a, lake, .)
Following many papers

Ssample
ui
Word
on cap4on genera4on,
including (Kiros et al
Recurrent
zi
2014; Mao et al 2014;
State
Vinyals et al 2014;
Mechanism
Attention
Donahue et al 2014;
Attention
aj weight
aj =1 + Karpathy & Li 2014;

Fang et al 2014)
Convolutional Neural Network
Annotation
Vectors
hj
X X X X X X X X X X X X X X X X X X X X
Usm 2i HX- kyR8V- UuQ 2i HX- kyR8V X X X X X X X X X X X X X X X X X X X X
84
Paying
Attention to
Selected Parts
of the Image
While Uttering
Words
85
The Good
86
And the Bad
87
But How can Neural Nets Remember Things?
Y
LeCun
"  Recurrent networks cannot remember things for very long

"  The cortex only remember things for 20 seconds
"  We need a “hippocampus” (a separate memory module)
"  LSTM [Hochreiter 1997], registers
"  Memory networks [Weston et 2014] (FAIR), associative memory
"  NTM [Graves et al. 2014], “tape”.
Attention
mechanism
Recurrent net memory
Y
LeCun
Memory Networks Enable REASONING
"  Add a short-term memory to a network http://arxiv.org/abs/1410.3916
Results on
Question Answering
Task
(Weston, Chopra,
Bordes 2014)
End-to-End Memory Network
Y
LeCun
" [  Sukhbataar, Szlam, Weston, Fergus NIPS 2015, ArXiv:1503.08895]
"  Weakly-supervised MemNN: no need to tell which memory location to use.
Stack-Augmented RNN: learning “algorithmic” sequences
Y
LeCun
"  [Joulin & Mikolov, ArXiv:1503.01007]
Sparse Access Memory for Long-Term
Dependencies
•  A mental state stored in an external memory can stay for
arbitrarily long dura4ons, un4l evoked for read or write
•  Forge…ng = vanishing gradient.
•  Memory = larger state, reducing the need for forge…ng/vanishing
passive copy
access
92
How do humans generalize
from very few examples?
•  They transfer knowledge from previous learning:
•  Representa4ons
•  Explanatory factors
•  Previous learning from: unlabeled data

+ labels for other tasks
•  Prior: shared underlying explanatory factors, in
parHcular between P(x) and P(Y|x)
93

Unsupervised and Transfer Learning Challenge
+ Transfer Learning Challenge: Won by
Unsupervised Deep Learning
NIPS’2011
Raw data Transfer
Learning
1 layer 2 layers Challenge
Paper:
ICML’2012
ICML’2011
workshop on
Unsup. &
Transfer Learning 3 layers
4 layers
Multi-Task Learning
task 1 task 2 task 3
•  Generalizing beWer to new tasks (tens output y1 output y2 output y3
of thousands!) is crucial to approach AI Task A Task B Task C

•  Example: speech recogni4on, sharing
across mul4ple languages
•  Deep architectures learn good

intermediate representa4ons that can
be shared across tasks
(Collobert & Weston ICML 2008,
Bengio et al AISTATS 2011)
•  Good representa4ons that disentangle

underlying factors of varia4on make raw input x
sense for many tasks because each E.g. dic4onary, with intermediate
task concerns a subset of the concepts re-used across many defini4ons
factors
Prior: shared underlying explanatory factors between tasks
95
Google Image Search
Joint Embedding: different
object types represented in same space
Google:
S. Bengio, J.
Weston & N.
Usunier
(IJCAI 2011,
NIPS’2010,
JMLR 2010,
ML J 2010)
WSABIE objec4ve func4on:

Combining Multiple Sources of Evidence
with Shared Representations
person url event
•  Tradi4onal ML: data = matrix url words history
•  Rela4onal learning: mul4ple sources,
different tuples of variables
•  Share representa4ons of same types
across data sources
•  Shared learned representa4ons help event url person
propagate informa4on among data history words url

sources: e.g., WordNet, XWN,
Wikipedia, FreeBase, ImageNet…
(Bordes et al AISTATS 2012, ML J. 2013)
•  FACTS = DATA P(person,url,event)
•  DeducHon = GeneralizaHon
P(url,words,history)
97
Multi-Task / Multimodal Learning
with Different Inputs for Different
Tasks
Y
E.g. speaker adapta4on,
mul4modal input…
Unsupervised mul4modal case:

selection switch
(Srivastava & Salakhutdinov NIPS 2012)
h1 h2 h3
X1 X2 X3
98
Maps Between
hx = fx (x) hy = fy (y)
Representations
fx
x and y represent fy
different modali4es, e.g.,
x-space
image, text, sound… y -space
xtest

ytest
Can provide 0-shot
generaliza4on to new
categories (values of y) (x, y) pairs in the training set
x -representation (encoder) function fx
y -representation (encoder) function fy
relationship between embedded points
within one of the domains
maps between representation spaces
99
Unsupervised Representation
Learning
100
Why Unsupervised Learning?
•  Recent progress mostly in supervised DL

•  Real challenges for unsupervised DL
•  Poten4al benefits:
•  Exploit tons of unlabeled data
•  Answer new ques4ons about the variables observed
•  Regularizer – transfer learning – domain adapta4on
•  Easier op4miza4on (divide and conquer)
•  Joint (structured) outputs
101
Why Latent Factors & Unsupervised
Representation Learning? Because of
Causality.
On causal and an'causal learning, (Janzing et al ICML 2012)
•  If Ys of interest are among the causal factors of X, then

P (X|Y )P (Y )
P (Y |X) =
P (X)
is 4ed to P(X) and P(X|Y), and P(X) is defined in terms of P(X|Y), i.e.
•  The best possible model of X (unsupervised learning) MUST
involve Y as a latent factor, implicitly or explicitly.
•  Representa4on learning SEEKS the latent variables H that explain
the varia4ons of X, making it likely to also uncover Y.

102
If Y is a Cause of X, Semi-Supervised
Learning Works
•  Just observing the x-density reveals the causes y (cluster ID)

•  Ater learning p(x) as a mixture, a single labeled example per class
suffices to learn p(y|x)
Mixture model
0.5
y=1 y=2 y=3
0.4
0.3
p(x)
0.2
0.1
0.0
0 5 10 15 20
103 x
Invariance & Disentangling
Underlying Factors
•  Invariant features
•  Which invariances?
•  Alterna4ve: learning to disentangle factors, i.e.
keep all the explanatory factors in the
representa4on
•  Good disentangling à
avoid the curse of dimensionality
•  Emerges from representa4on learning
(Goodfellow et al. 2009, Glorot et al. 2011)
104
Boltzmann Machines /
Undirected Graphical Models
•  Boltzmann machines:
(Hinton 84)

•  Itera4ve sampling scheme =

stochas4c relaxa4on,
Monte-Carlo Markov chain
•  Training requires sampling:

might take a lot of 4me to
converge if there are well-
separated modes
Restricted Boltzmann Machine
(RBM) (Smolensky 1986, Hinton et al 2006)
•  A building block hidden

(single-layer) for
deep architectures

•  BiparHte undirected
observed
graphical model
h ~ P(h|x) h ~ P(h|x )
Block
Gibbs
sampling
x x ~ P(x | h)
Capturing the Shape of the
Distribution: Positive & Negative
Samples Energy(x)
Boltzmann machines, undirected graphical models,
e
RBMs, energy-based models
P r(x) =
Z
•  Observed (+) examples push the energy down
•  Generated / dream / fantasy (-) samples / particles push
the energy up
X+
X-
Eight Strategies to Shape the Energy Function Yann
LeCun
LeCun
"   1. build the machine so that the volume of low energy stuff is constant
"  PCA, K-means, GMM, square ICA
"   2. push down of the energy of data points, push up everywhere else
"  Max likelihood (needs tractable partition function)
"   3. push down of the energy of data points, push up on chosen locations
"   contrastive divergence, Ratio Matching, Noise Contrastive Estimation,
Minimum Probability Flow
"   4. minimize the gradient and maximize the curvature around data points
"  score matching
"   5. train a dynamical system so that the dynamics goes to the manifold
" denoising auto-encoder, diffusion inversion (nonequilibrium dynamics)
"   6. use a regularizer that limits the volume of space that has low energy
"  Sparse coding, sparse auto-encoder, PSD
"   7. if E(Y) = ||Y - G(Y)||^2, make G(Y) as "constant" as possible.
"  Contracting auto-encoder, saturating auto-encoder
"   8. Adversarial training: generator tries to fool real/synthetic classifier.
•  Itera4ve sampling / undirected models:
Auto-Encoders RBM, denoising auto-encoder
•  Ancestral sampling / directed models

Helmholtz machine, VAE, etc.
P(x|h) (Hinton et al 1995)
reconstruc,on!r!
Decoder.g! ProbabilisHc reconstrucHon criterion:

Reconstruc4on log-likelihood =
- log P(x | h)
P(h)
code!h!
Q(h|x)
Denoising auto-encoder:
During training, input is corrupted
Encoder.f! stochas4cally, and auto-encoder must
learn to guess the distribu4on of the
missing informa4on.
x input!x!
109
Predictive Sparse Decomposition (PSD)
Yann
LeCun
LeCun
[Kavukcuoglu, Ranzato, LeCun, rejected by every conference, 2008-2009]
"  Train a “simple” feed-forward function to predict the result of a complex
optimization on the data points of interest
Generative Model
Factor A
Factor B
Distance Decoder
LATENT
INPUT Y Z VARIABLE
Fast Feed-Forward Model
Factor A'
1. Find optimal Zi for all Yi;
Encoder Distance 2. Train Encoder to predict
Zi from Yi
Energy = reconstruction_error + code_prediction_error + code_sparsity

Probabilistic interpretation of auto-
encoders
•  Manifold & probabilis4c interpreta4ons of auto-encoders

•  Denoising Score Matching as induc4ve principle
(Vincent 2011)
•  Es4ma4ng the gradient of the energy func4on
(Alain & Bengio ICLR 2013)
•  Sampling via Markov chain

(Bengio et al NIPS 2013; Sohl-Dickstein et al ICML 2015)
•  Varia4onal auto-encoders
(Kingma & Welling ICLR 2014)
(Gregor et al arXiv 2015)
111
Denoising Auto-Encoder
•  Learns a vector field poin4ng towards higher prior: examples
probability direc4on (Alain & Bengio 2013) concentrate near a
lower dimensional
2@ log p(x)
reconstruction(x) x ! “manifold”
@x
•  Some DAEs correspond to a kind of Gaussian Corrupted input
RBM with regularized Score Matching (Vincent
2011)
[equivalent when noiseà0] Corrupted input
Regularized Auto-Encoders Learn a
Vector Field that Estimates a
Gradient Field (Alain & Bengio ICLR 2013)
113
Denoising Auto-Encoder Markov Chain
corrupt
X~ t denoise X~ t+1 X~ t+2
Xt Xt+1 Xt+2
The corrupt-encode-decode-sample Markov chain associated with a DAE

samples from a consistent es4mator of the data genera4ng distribu4on
114
Preference for Locally Constant Features
•  Denoising or contrac4ve auto-encoder on 1-D input:

@E(x)
r(x)" ⇡ x
@x
E(x)
x1" x2" x3" x"

2 2 2 @r(x) 2
E[||r(x + z) x|| ] ⇡ E[||r(x) x|| ] + || ||F
@x
115
Helmholtz Machines (Hinton et al 1995) and
Variational Auto-Encoders (VAEs)
(Kingma & Welling 2013, ICLR 2014)
(Gregor et al ICML 2014; Rezende et al ICML 2014) P (h3 )
(Mnih & Gregor ICML 2014; Kingma et al, NIPS 2014) h3

•  Parametric approximate
Encoder = inference
inference Q(h3 |h2 ) P (h2 |h3 )
Decoder = generator
h2
•  Successors of Helmholtz
machine (Hinton et al ‘95) Q(h2 |h1 ) P (h1 |h2 )
h1
•  Maximize varia4onal lower
bound on log-likelihood: Q(h1 |x) P (x|h1 )

min KL(Q(x, h)||P (x, h))
where = data distr. x
Q(x)
or equivalently Q(x)
X P (x, h) X
max Q(h|x) log = max Q(h|x) log P (x|h) + KL(Q(h|x)||P (h))
x
Q(h|x) x
116
Geometric Interpretation
•  Encoder: map input to a new space
where the data has a simpler P(h)
Q(h|x)
distribu4on
•  Add noise between encoder output
f(x)
and decoder input: train the
contrac4ve
decoder to be robust to mismatch
between encoder output and prior f g
output.
x
117
DRAW: Sequential Variational Auto-
Encoder
Network with Attention
For Image Generation
(Gregor et al of Google DeepMind, arXiv 1502.04623, 2015)
KAROLG @ GOOGLE . COM
•  Even for a sta4c input, the encoder and decoder are now
DANIHELKA @ GOOGLE . COM
GRAVESA @ GOOGLE . COM
recurrent nets, which gradually add elements to the answer,

WIERSTRA @ GOOGLE . COM
and use an aWen4on mechanism to choose where to do so.

DRAW: A Recurrent Neural Network For Image Generation
limpses, or foveations, than by a sin- P (x|z) ct 1 write ct write . . . cT P (x|z1:T )

he entire image (Larochelle & Hinton, decoder decoder decoder
012; Tang et al., 2013; Ranzato, 2014; FNN hdec
t 1 RNN RNN
Mnih et al., 2014; Ba et al., 2014; Ser- z zt zt+1

decoding
The main challenge faced by sequential sample sample sample (generative model)
learning where to look, which can be Q(z|x) Q(zt |x, z1:t Q(zt+1 |x, z1:t )
encoding
1) (inference)
forcement learning techniques such as
nih et al., 2014). The attention model in henc
t 1
encoder
RNN
encoder
RNN
encoder
fully differentiable, making it possible FNN
read read
d backpropagation. In this sense it re-
e read and write operations developed x x x
g Machine
Time
(Graves et al., 2014).
Figure 2. Left: Conventional Variational Auto-Encoder. Dur-
ionFigure
defines the DRAW
118 1. A trained DRAW networkarchitecture,
generating MNIST dig-
ing generation, a sample z is drawn from a prior P (z) and passed
function
its. Eachused forsuccessive
row shows training and
stages in thethe pro- of a sin-
generation
DRAW Samples of SVHN Images:
generated
rent Neural samples
Network For Image Generation vs training nearest
neighbor
Nearest training
example for last
column of samples
119
GAN: Generative Adversarial Networks
Goodfellow et al NIPS 2014
Adversarial nets framework
Random Generator Fake

Vector Network Image
Discriminator
Network
Random Training Real

1
Index Set Image 2
0
LAPGAN: Laplacian Pyramid of
Generative Adversarial Networks
http://soumith.ch/eyescream/
Laplacian(Denton + Chintala, et al 2015)
Pyramid

1
2
1
LAPGAN: Visual Turing Test
(Denton + Chintala, et al 2015)
LAPGAN results
•  40% of samples mistaken by humans for real photos
•  Sharper images than max. lik. proxys (which min. KL(data|model)):

•  GAN objec4ve = compromise between KL(data|model) and KL(model|data)
122
Convolutional GANs
Figure 2: Generated bedrooms after one training pass through the dataset. Theoretically, the model
could learn to memorize training examples, but this is experimentally unlikely as we train with a
(Radford et al, arXiv 1511.06343)
small learning rate and minibatch SGD. We are aware of no prior empirical evidence demonstrating
memorization with SGD and a small learning rate in only one epoch.
Strided convolu4ons, batch normaliza4on, only convolu4onal layers,

ReLU and leaky ReLU
123
Space-Filling in Representation-Space
Deeper representaHons " abstracHons " disentangling (Bengio et al ICML 2013)
Manifolds are expanded and flabened
X-space
H-space
Linear interpola4on at layer 2 3’s manifold
9’s manifold
Linear interpola4on at layer 1
Linear interpola4on in pixel space

Under review as a conference paper at ICLR 2016
GAN: Interpolating in Latent Space

If the model is good (unfolds the manifold), interpola4ng between
latent values yields plausible images.
125
Figure 4: Top rows: Interpolation between a series of 9 random points in Z show that the space
Supervised and Unsupervised in One Learning Rule?
"  Boltzmann Machines have all the right properties [Hinton 1831] [OK, OK 1983 ;-]
"  Sup & unsup, generative & discriminative in one simple/local learning rule
"  Feedback circuit reconstructs and propagates virtual hidden targets
"  But they don't really work (or at least they don't scale).
"  Problem: the feedforward path eliminates information
"  If the feedforward path is invariant, then
"  the reconstruction path is a one-to-many mapping
"  Usual solution: sampling. But I'm allergic.
Predicted What what Predicted What Cost what
Many One Many One

To To To To
One Many One Many
input Cost reconstruction input Cost reconstruction

Deep Semi-Supervised Learning
•  Unlike unsupervised pre-training, modern approaches op4mize
jointly the supervised and unsupervised objec4ve
•  Discrimina4ve RBMs (Larochelle & Bengio, ICML 2008)
•  Semi-Supervised VAE (Kingma et al, NIPS 2014)
•  Ladder Network (Rasmus et al, NIPS 2015)
127
forl l==11totoLLdo
for do for ll == LL to
for to 00 do
do
(l)
(l)
z̃z̃pre WW(l) (l) (l(l 1)
h̃h̃ 1) ifif ll == LL then
then
Semisupervised Learning with Ladder
pre
(L) (L)
(l)
(l) (l)
(l) uu (L) batchnorm(
batchnorm( h̃(L) )
µ̃µ̃ batchmean(z̃pre
batchmean(z̃ pre))
else
else
Network
˜˜
(l)
(l) batchstd(z̃pre
batchstd(z̃
(l)
(l)
pre))
(l)
(Rasmus et al, NIPS 2015)
uu(l)(l)
batchnorm(V(l)
batchnorm(V
(l) (l+1)
ẑ(l+1)))
ẑ
z̃z̃ (l)
(l) batchnorm(z̃
batchnorm(z̃pre
(l)
pre))+ +noise
noise end ifif
end (l) (l) (l) (l)
(l) (l) (l) (l) 8i : ẑ(l) g(z̃ (l)
i , uii )) #
, u # Eq.
Eq. (1)
(1)
•  Jointly trained stack of denoising auto-encoders with gated
h̃h̃(l) activation(
activation( (l) (z̃ +
(z̃ (l) + (l) ))
)) 8i : ẑii g(z̃i(l)
end for
(l)
ẑ(l) µ̃(l)
end for
lateral connec4ons and semi-supervised objec4ve
(L) 8i
8i :: ẑ
ẑ
(l)
(l)
i,BN
ẑ i
i
µ̃
(l)
i
i
(ỹ| |x)
PP(ỹ x) h̃h̃(L) i,BN ˜(l)
˜ii
end for for
##Clean
Cleanencoder
(0)
encoderỹ(for
(0)
(fordenoising
denoisingtargets) targets) y end
# Cost
Cost functionfunction C C for for training:
training:
hh(0)
zz(0) x(n)
x(n) #
for l = 1 to L do CC 00
for l (l)= 1 to L do z̃(2) g (l)(·, ·)(l 1) ẑ(2) (2)
(2)
if
z(2)t(n) then Semi-supervised objec4ve:
z(l) 2 batchnorm(W(l) h(l 1) )
Nz(0,(l)) batchnorm(W (l) h ) Cd if t(n) then
h(l) activation( (l) (l)
(z(l) + (l) (l)
)) C
C log
log P (ỹ = t(n) | x)
P (ỹ = t(n) | x)
h activation(
(2) (z + )) end if
end for f (·) end if
(2)
f (·)
end for PL (l)
2
g (1) (·, ·) C C + PLl=1 l z(l) ẑ(l) (l) 2 # Eq
z̃(1) ẑ(1)
Cd
(1) Cz (1)
C + l=1 l z ẑ BN
BN # Eq
2
N (0, )
f (1) (·) f (1) (·)

They also use
x̃
g (0)
(·, ·)
x̂ (0) x Batch Normaliza4on
Cd
2
N (0, )
4
4
x x
1% error on PI-MNIST with 100 labeled examples (Pezeshki et al arXiv 1511.06430)

Figure 2: A conceptual illustration of the Ladder network when L = 2. The feedforward path
x! z ! z(2) ! y) shares the mappings f (l) with the corrupted feedforward path, or encoder
128
(1)
(x ! z̃(1) ! z̃(2) ! ỹ). The decoder (z̃(l) ! ẑ(l) ! x̂) consists of denoising functions g (l) and
(l) (l) (l)
Stacked What-Where Predicted
Output
Desired
Output
Auto-Encoder (SWWAE) Loss Yann
LeCun
LeCun
[Zhao, Mathieu, LeCun arXiv:1506.02351]
Stacked What-Where Auto-Encoder
A bit like a ConvNet paired with a DeConvNet
Recons
Inpu -
t tructio
n
Conclusions & Challenges
130
Learning « How the world ticks »
•  So long as our machine learning models « cheat » by relying only
on surface sta4s4cal regulari4es, they remain vulnerable to out-
of-distribu4on examples
•  Humans generalize beWer than other animals by implicitly having

a more accurate internal model of the underlying causal
rela4onships
•  This allows one to predict future situa4ons (e.g., the effect of

planned ac4ons) that are far from anything seen before, an
essen4al component of reasoning, intelligence and science
131
Learning Multiple Levels of
Abstraction
•  The big payoff of deep learning is to allow learning

higher levels of abstrac4on
•  Higher-level abstrac4ons disentangle the factors of
varia4on, which allows much easier generaliza4on and
transfer
132
Challenges & Open Problems
A More ScienHfic Approach is Needed, not Just Building Beber Systems
•  Unsupervised learning
•  How to evaluate?
•  Long-term dependencies
•  Natural language understanding & reasoning
•  More robust op4miza4on (or easier to train architectures)
•  Distributed training (that scales) & specialized hardware
•  Bridging the gap to biology
•  Deep reinforcement learning
133

DL Tutorial NIPS2015 PDF

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

DL Tutorial NIPS2015 PDF

Загружено:

Авторское право:

Доступные форматы

Deep

4. Powerful priors that can defeat the curse of

Exploi4ng composi4onality gives an exponen4al gain in

Addi4onal prior: composi4onality is useful to

• Two symbols are equally far from each other

Learning a set of parametric features that are not

5000 (J=4.2%, AP=12.7%)

Animals Seating Building10(J=14.6%, AP=47.2%) Washing machine (J=3.2%, AP=34.4%)

• Bengio 2009 (Learning Deep Architectures for AI, F & T in ML)

• Longer discussion and rela4ons to the no4on of priors: Deep

• Prop. 2 of Pascanu, Montufar & Bengio ICLR’2014: number of

Mapping Mapping Most

Input Input Input Input

Rule-based Classic Representation Deep

RBMs & auto-encoders = universal approximator

• Worse-case distribu4ons s4ll require exponen4al data

• Expressiveness of deep networks with piecewise linear ac4va4on

or, if hidden layers have width n and input has size n0

W2, B2 Linear l Cost Module: Squared Distance

ReLU l Objective Function

X (input) Y (desired output)

Wi l Backprop for the weight gradients:

dC/dXi-1 Xi-1 l dC/dWi = dC/dXi . dFi(Xi-1,Wi)/dWi

X (input) Y (desired output)

ReLU l y = ReLU(x) ; if (x<0) dC/dx = 0 else dC/dx = dC/dy

Duplicate l Y1 = X, Y2 = X ; dC/dX = dC/dY1 + dC/dY2

Add l Y = X1 + X2 ; dC/dX1 = dC/dY ; dC/dX2 = dC/dY

Max l y = max(x1,x2) ; if (x1>x2) dC/dx1 = dC/dy else dC/dx1=0

LogSoftMax l Yi = Xi – log[∑j exp(Xj)] ; …..

l Many more basic module classes

l Non-linearities and operators

l Just about any simple function (log, exp, add, mul,….)

l Pooling/subsampling: max, average, Lp, log(sum(exp())), maxout

l Long Short-Term Memory, attention, 3-way multiplicative interactions.

l Normalizations: batch norm, contrast norm, feature norm...

" Any connection graph is permissible

" Use ReLU non-linearities

" Traditional Pattern Recognition: Fixed/Handcrafted Feature Extractor

" Mainstream Pattern Recognition 9until recently)

Feature Mid-Level Trainable

" Deep Learning: Multiple stages/layers trained end to end

Low-Level Mid-Level High-Level Trainable

Filter Non- feature Filter Non- feature

" Normalization: variation on whitening (optional)

" Pooling: aggregation over space or feature type

Filter Bank +non-linearity

Filter Bank +non-linearity

Filter Bank +non-linearity

" LeNet1 [LeCun et al. NIPS 1989]

Animation: Andrej Karpathy http://cs231n.github.io/convolutional-networks/

" filters → tanh → average-tanh → filters → tanh → average-tanh → filters → tanh

" 1D (Temporal) ConvNet, aka Timed-Delay Neural Nets

" Simple ConvNet

" Every layer is a convolution

" Hierarchy of representations with increasing level of abstraction

Low-Level Mid-Level High-Level Trainable

[picture from Simon Thorpe]

" Signals that comes to you in the form of (multidimensional) arrays.

" 1D ConvNets: sequential signals, text

• An RNN can represent a fully-connected directed generaHve

(xt , yt ) : next input/output training pair

• ICLR 2014, How to construct deep recurrent neural networks

4.  Powerful priors that can defeat the curse of

•  Two symbols are equally far from each other

•  Bengio 2009 (Learning Deep Architectures for AI, F & T in ML)

•  Longer discussion and rela4ons to the no4on of priors: Deep

•  Prop. 2 of Pascanu, Montufar & Bengio ICLR’2014: number of

•  Worse-case distribu4ons s4ll require exponen4al data

•  Expressiveness of deep networks with piecewise linear ac4va4on

W2, B2 Linear l  Cost Module: Squared Distance

ReLU l  Objective Function

Wi l  Backprop for the weight gradients:

dC/dXi-1 Xi-1 l  dC/dWi = dC/dXi . dFi(Xi-1,Wi)/dWi

ReLU l  y = ReLU(x) ; if (x<0) dC/dx = 0 else dC/dx = dC/dy

Duplicate l  Y1 = X, Y2 = X ; dC/dX = dC/dY1 + dC/dY2

Add l  Y = X1 + X2 ; dC/dX1 = dC/dY ; dC/dX2 = dC/dY

Max l  y = max(x1,x2) ; if (x1>x2) dC/dx1 = dC/dy else dC/dx1=0

LogSoftMax l  Yi = Xi – log[∑j exp(Xj)] ; …..

l  Many more basic module classes

l  Non-linearities and operators

l  Just about any simple function (log, exp, add, mul,….)

l  Pooling/subsampling: max, average, Lp, log(sum(exp())), maxout

l  Long Short-Term Memory, attention, 3-way multiplicative interactions.

l  Normalizations: batch norm, contrast norm, feature norm...

•  An RNN can represent a fully-connected directed generaHve

•  ICLR 2014, How to construct deep recurrent neural networks

•  Clipping gradients (avoid exploding gradients)

•  Create a path where

•  LSTM is heavily used

"  Batch Gradient

"  Quadratic loss W0

•  Local minima dominate in low-D, but

•  Jarreth, Kavukcuoglu, Ranzato & LeCun ICCV 2009: absolute value

•  Beau4ful FREE LUNCH (no need to launch many diﬀerent

•  Monitor valida4on error during training (ater visi4ng # of

•  Keep track of parameters with best valida4on error and report

•  If error does not improve enough (with some pa4ence), stop.

•  (HuWer et al JAIR 2009; Bergstra et al NIPS 2011; Thornton et al

"M   ultilingual recognizer

"   coustic Model: ConvNet with 7 layers. 54.4 million parameters.

"  Training samples.

"  Convolution Kernels at Layer 1:

•  DeepSpeech and aWen4on-

•  (Bengio et al NIPS’2000) i−th output = P(w(t) = i | context)

•  Seman4c rela4ons appear as linear rela4onships in the space of

•  Sampling “nega4ve” examples: increase score of

•  Decompose output probabili4es hierarchically (Morin &

•  Consider an input (or intermediate) sequence or image

UV 1M;HBb?→6`2M+? UqJh@R9V

U#V 1M;HBb?→:2`KM UqJh@R8V U+V 1M;HBb?→*x2+? UqJh@R8V

"  Recurrent networks cannot remember things for very long

"  Add a short-term memory to a network http://arxiv.org/abs/1410.3916

•  Previous learning from: unlabeled data

•  Deep architectures learn good

•  Good representa4ons that disentangle

•  Recent progress mostly in supervised DL

•  If Ys of interest are among the causal factors of X, then

•  Just observing the x-density reveals the causes y (cluster ID)

•  Itera4ve sampling scheme =

•  Training requires sampling:

•  A building block hidden

•  Ancestral sampling / directed models

•  Manifold & probabilis4c interpreta4ons of auto-encoders

•  Sampling via Markov chain

•  Denoising or contrac4ve auto-encoder on 1-D input: