Академический Документы
Профессиональный Документы
Культура Документы
Learning
NIPS’2015 Tutorial
Geoff Hinton, Yoshua Bengio & Yann LeCun
Breakthrough
Deep Learning: machine
learning algorithms based on
learning mulHple levels of
representaHon / abstracHon.
Amazing improvements in error rate in object recogni4on, object
detec4on, speech recogni4on, and more recently, in natural language
processing / understanding
2
Machine Learning,
AI & No Free Lunch
• Four key ingredients for ML towards AI
1. Lots & lots of data
2. Very flexible models
3. Enough compu4ng power
Output units
Hidden units
Input person
units
cat
dog
5 David Rumelhart
Exponential advantage of distributed
representations
10000
wall
window
chair
building
floor
tree
ceiling lamp
cabinet
ceiling
person
plant
cushion
sky
picture
curtain
painting
door
desk lamp
side table
table
bed
books
pillow
mountain
car
pot
armchair
box
vase
flowers
road
grass
bottle
shoes
sofa
outlet
worktop
sign
book
sconce
plate
mirror
column
rug
basket
ground
desk
coffee table
clock
shelves
b)
Billiard table (J=3.2%, AP=42.6%) Sofa (J=10.8%, AP=36.2%)
Counts of CNN units discovering each object class.
20
15
5
c) 0
a) Object counts of most informative objects for scene recognition
30
Figure 11: (a) Segmentation of images from the SUN database using pool
Jaccard
20
segmentation index, AP = average precision-recall.) (b) Precision-r
discovered
10 objects. (c) Histogram of AP for all discovered object classes.
0
wall
window
chair
building
floor
tree
ceiling lamp
cabinet
ceiling
person
plant
cushion
sky
picture
curtain
painting
door
desk lamp
side table
table
bed
books
pillow
mountain
car
pot
armchair
box
vase
flowers
road
grass
bottle
shoes
sofa
outlet
worktop
sign
book
sconce
plate
mirror
column
rug
basket
ground
desk
coffee table
clock
shelves
a) 7 Note
d) that there are 115 units in pool5 of Places-CNN not detecting objects.
incomplete learning or a complementary texture-based or part-based represen
Figure 9: (a) Segmentations from pool5 in Places-CNN. Many classes are encoded by several units
Each feature can be discovered
without the need for seeing the
exponentially large number of
configurations of the other features
• Consider a network whose hidden units discover the following
features:
• Person wears glasses
• Person is female
• Person is a child
• Etc.
If each of n feature requires O(k) parameters, need O(nk) examples
Non-parametric methods would require O(nd) examples
8
Exponential advantage of distributed
representations
9
Deep Learning: Output
Automating
Feature Discovery Mapping
Output Output from
features
Hand- Hand-
Simplest
designed designed Features
features
program features
Theore4cal arguments:
Logic gates
2 layers of Formal neurons = universal approximator
RBF units
• But the world has structure and we can get an exponen4al gain
by exploi4ng some of it
12
Exponential advantage of depth
13
Y LeCun
Backprop
(modular approach)
Typical Multilayer Neural Net Architecture
Y LeCun
C(X,Y,Θ)
l Complex learning machines can be
built by assembling modules into
Squared Distance networks
l Linear Module
l Out = W.In+B
W3, B3 Linear
l ReLU Module (Rectified Linear Unit)
l Out = 0 if Ini<0
ReLU i
l Out = In
i i otherwise
l All major deep learning frameworks use modules (inspired by SN/Lush, 1991)
l Torch7, Theano, TensorFlow….
C(X,Y,Θ)
NegativeLogLikelihood
LogSoftMax
W2,B2Linear
ReLU
W1,B1Linear
X Y
input Label
Computing Gradients by Back-Propagation
Y LeCun
C(X,Y,Θ)
l A practical Application of Chain Rule
Cost
l Backprop for the state gradients:
Wn Fn(Xn-1,Wn) l dC/dXi-1 = dC/dXi . dXi/dXi-1
dC/dWn
l dC/dXi-1 = dC/dXi . dFi(Xi-1,Wi)/dXi-1
dC/dXi Xi
F1(X0,W1)
l Torch7 example
l Gradtheta contains the gradient
C(X,Y,Θ)
NegativeLogLikelihood
LogSoftMax
W2,B2Linear
Θ ReLU
W1,B1Linear
X Y
input Label
Module Classes
Y LeCun
T
Linear l Y = W.X ; dC/dX = W . dC/dY ; dC/dW = dC/dY . (dC/dX)T
l Hinge loss
l Ranking loss
l Tanh, logistic
l Specialized modules
l Multiple convolutions (1D, 2D, 3D)
l Switches
l inception
Any Architecture works
Y LeCun
Convolutional
Networks
Deep Learning = Training Multistage Machines
Y LeCun
Feature Trainable
Extractor Classifier
Pooling
Pooling
Layer 3 Layer 5
Layer 1 Layer 2 Layer 4
input 12@10x10 100@1x1
6@28x28 6@14x14 12@5x5
1@32x32
Layer 6: 10
10
5x5
2x2 5x5 2x2 convolution
5x5
pooling/ convolution pooling/
convolution
subsampling subsampling
Applying a ConvNet with a Sliding Window
Y LeCun
" The ventral (recognition) pathway in the visual cortex has multiple stages
" Retina - LGN - V1 - V2 - V4 - PIT - AIT ....
37
Recurrent Neural Networks
• Selec4vely summarize an input sequence in a fixed-size state
vector via a recursive update
F✓
s st 1 st st+1
F✓ F✓ F✓
unfold
x xt 1 xt xt+1
38
Recurrent Neural Networks
• Can produce an output at each 4me step: unfolding the graph
tells us how to back-prop through 4me.
o ot 1 ot ot+1
V W V V V
s W st 1 st st+1
W W W
U unfold
U U U
x xt 1 xt xt+1
39
Generative RNNs
Lt 1 Lt Lt+1
ot 1 ot ot+1
V V V
W st 1 st st+1
W W W
U U U
xt 1 xt xt+1 xt+2
40
Maximum Likelihood =
Teacher Forcing
ŷt ⇠ P (yt | ht ) yt
• During training, past y
in input is from training
data P (yt | ht )
• At genera4on 4me,
past y in input is
generated
ht
• Mismatch can cause
”compounding error”
xt
+ deep hid-to-out
+ deep hid-to-hid
+deep in-to-hid
Ordinary RNNs
+ stacking
Storing bits
robustly requires
sing. values<1
• Problems:
Gradient
• sing. values of Jacobians > 1 à gradients explode clipping
• or sing. values < 1 à gradients shrink & vanish (Hochreiter 1991)
• or random à variance grows exponen4ally
43
Gradient Norm Clipping
(Mikolov thesis 2012;
Pascanu, Mikolov, Bengio, ICML 2013)
44
RNN Tricks
(Pascanu, Mikolov, Bengio, ICML 2013; Bengio, Boulanger & Pascanu, ICASSP 2013)
error
✓
45
✓
Gated Recurrent Units & LSTM
output
eigenvalue of Jacobian + ×
slightly less than 1 state
1997)
• GRU light-weight version input input gate forget gate output gate
(Cho et al 2014)
46
RNN Tricks
• Delays and mul4ple 4me scales, Elhihi & Bengio NIPS 1996
o ot 1 ot ot+1
W1 W3 W3 W3 W3
s st 2 st 1 st st+1
W1 W1 W1 W1
W3 unfold
x xt 1 xt xt+1
47
Backprop in Practice
48
The Convergence of Gradient Descent
Y
LeCun
Solution
Saddle point Solution
Deep Nets with ReLUs and Max Pooling
Y
LeCun
"S tack of linear transforms interspersed with Max operators
" Point-wise ReLUs:
31
W31,22
22
W22,14
14
" Max Pooling
" “switches” from one layer to the next W14,3
" Input-output function
" Sum over active paths 3
" Product of all weights along the path
" Solutions are hyperbolas Z3
" Objective function is full of saddle points
A Myth Has Been Debunked: Local
Minima in Neural Nets
! Convexity is not needed
• (Pascanu, Dauphin, Ganguli, Bengio, arXiv May 2014): On the
saddle point problem for non-convex op'miza'on
• (Dauphin, Pascanu, Gulcehre, Cho, Ganguli, Bengio, NIPS’ 2014):
Iden'fying and a[acking the saddle point problem in high-
dimensional non-convex op'miza'on
• (Choromanska, Henaff, Mathieu, Ben Arous & LeCun,
AISTATS’2015): The Loss Surface of Mul'layer Nets
55
Saddle Points
56
Saddle Points During Training
• Oscilla4ng between two behaviors:
• Slowly approaching a saddle point
• Escaping it
57
Low Index Critical Points
58
Piecewise Linear Nonlinearity
• Nair & Hinton ICML 2010: Duplica4ng sigmoid units with same
sotplus
weights but different bias in an RBM approximates a rec4fied f(x)=log(1+exp(x))
linear unit (ReLU)
• Glorot, Bordes and Bengio AISTATS 2011: Using a rec4fier non- f(x)=max(0,x)
linearity (ReLU) instead of tanh of sotplus allows for the first 4me
to train very deep supervised networks without the need for Neuroscience motivations
Leaky integrate-and-fire model
unsupervised pre-training; was biologically moHvated
Leaky integrate-and-fire model
• Krizhevsky, Sutskever & Hinton NIPS 2012:
rec4fiers one of the crucial ingredients in
ImageNet breakthrough
Stochastic Neurons as Regularizer:
Improving neural networks by prevenHng co-adaptaHon
of feature detectors (Hinton et al 2012, arXiv)
• Dropouts trick: during training mul4ply neuron output by random
bit (p=0.5), during test by 0.5
• Used in deep supervised networks
• Similar to denoising auto-encoder, but corrup4ng every layer
• Works beWer with some non-lineari4es (rec4fiers, maxout)
(Goodfellow et al. ICML 2013)
• Equivalent to averaging over exponen4ally many architectures
• Used by Krizhevsky et al to break through ImageNet SOTA
• Also improves SOTA on CIFAR-10 (18à16% err)
• Knowledge-free MNIST with DBMs (.95à.79% err)
• TIMIT phoneme classifica4on (22.7à19.7% err)
60
Dropout Regularizer: Super-Efficient
Bagging
*
… …
61
we can standardize each feature1asX
follows
m
Batch Normalization
BN (xx̄kk)==m k x̂xki,k+, k.
x k x̄
i=1 k
(Ioffe & Szegedy ICML 2015)
x̂k = p 2X , (3)
By setting k to k and k2 to x̄ 1kk+ , the
network
m
✏ 2
can rec
= (x x̄ )
Standardize ac4va4ons (before nonlinearity) across minibatch ,
original
• layer
where ✏ is arepresentation.
k
So,
m
small positive constant tofor
i=1
i,k
a standard
improve
k
numericalfeed
sta-
• Backprop through this operaHon
ayer in a neural
bility. where network
m is the size of the mini-batch. Using these statis
• Regularizes & helps to train
However,
we can standardizing
standardize each thefeature
intermediate activations
asIffollows
we have re-
access to the who
duces the1 Xrepresentational
m y = power
(Wx of the
+ layer.
b),
information To
not account for
only from the p
(1) xkthe future
x̄k ones, allowing for bid
this, batch
x̄k =
m i=1normalization introduces
xi,k ,
x̂k = padditional , learnable pa-
2 +✏
where W2 is1 the
rameters X weights matrix, b is the bias
m and , which
2
respectively scale
k and !
shift
ht =vector,
the(W
! !
data,
hht
where
(xi,k✏ isx̄ak )small positive(2)
constant to improve numerical
leading
k = to a layer of the
,
form
nput of the bility.
m i=1
layer and is an arbitrary activation h t = (W h h t
!
f
batch normalization
e m is the size of the mini-batch.
However, is
BN
an standardize each feature as follows
applied
Using these
(x k ) = as
statistics,
standardizing x̂
k k
follows
the
+
h
intermediate
k .
= [ h : h ]
activation
t
(4)
t t
63
Random Sampling of Hyperparameters
(Bergstra & Bengio 2012)
• Common approach: manual + grid search
• Grid search over hyperparameters: simple & wasteful
• Random search: simple & efficient
• Independently sample each HP, e.g. l.rate~exp(U[log(.1),log(.0001)])
• Each training trial is iid
• If a HP is irrelevant grid search is wasteful
• More convenient: ok to early-stop, con4nue further, etc.
64
Sequential Model-Based Optimization
of Hyper-Parameters
65
Distributed Training
• Minibatches
• Large minibatches + 2nd order & natural gradient methods
• Asynchronous SGD (Bengio et al 2003, Le et al ICML 2012, Dean et al NIPS 2012)
• Data parallelism vs model parallelism
• BoWleneck: sharing weights/updates among nodes, to avoid
node-models to move too far from each other
• EASGD (Zhang et al NIPS 2015) works well in prac4ce
• Efficiently exploi4ng more than a few GPUs remains a challenge
66
Vision
((switch laptops)
67
Speech Recognition
68
The dramatic impact of Deep
Learning on Speech Recognition
(according to Microsoft)
100%
Word error rate on Switchboard
Using DL
10%
4%
2%
69 1%
1990 2000 2010
Speech Recognition with Convolutional Nets (NYU/IBM)
Y LeCun
"u"
• Hybrid systems, neural nets + "c" 0.4
"p" 0.2
Graph Composition
"t"
labels along best path; trained match match match
& add
"b" "u"
& add & add "t" "e"
discrimina4vely (LeCun et al 1998) "c"
"u"
"r" "e"
• Connec4onist Temporal "a" "p"
NIPS 2015)
74
Natural Language
Representations
75
Neural Language Models: fighting one
exponential by another one!
output ...
softmax
...
77
Analogical Representations for Free
(Mikolov et al, ICLR 2013)
Paris
Rome
78
Handling Large Output Spaces
categories
decoder decoder
French English
encoder encoder
French sentence English sentence
Higher-level
Sotmax over lower
loca4ons condi4oned • Sot aWen4on (backprop) vs
on context at lower and • Stochas4c hard aWen4on (RL)
higher loca4ons
Lower-level
(Bahdanau, Cho & Bengio, arXiv sept. 2014) following up on (Graves 2013) and
81 (Larochelle & Hinton NIPS 2010)
End-to-End Machine Translation with
Recurrent Nets and Attention Mechanism
(Bahdanau et al 2014, Jean et al 2014, Gulcehre et al 2015, Jean et al 2015)
>Qr 7` +M r2 ;Q rBi? p2`v H`;2 i`;2i pQ+#mH`v\ URV
• Reached the state-of-the-art in one year, from scratch
82
IWSLT 2015 – Luong & Manning (2015)
TED talk MT, English-German
-26%
26.02 21.84
24.96
25 22.51
20
20.08
20 16.16
15
15
10
10
5 5
0 0
Stanford Karlsruhe Edinburgh Heidelberg PJAIT Baseline Stanford Edinburgh Karlsruhe Heidelberg PJAIT
83
Image-to-Text: Caption Generation
with Attention
(Xu et al, ICML 2015)
f = (a, man, is, jumping, into, a, lake, .)
ui
Word
on cap4on genera4on,
including (Kiros et al
Recurrent
zi
2014; Mao et al 2014;
State
Vinyals et al 2014;
Mechanism
Attention
Donahue et al 2014;
Attention
aj weight
Annotation
Vectors
hj
X X X X X X X X X X X X X X X X X X X X
Usm 2i HX- kyR8V- UuQ 2i HX- kyR8V X X X X X X X X X X X X X X X X X X X X
84
Paying
Attention to
Selected Parts
of the Image
While Uttering
Words
85
The Good
86
And the Bad
87
But How can Neural Nets Remember Things?
Y
LeCun
Attention
mechanism
Recurrent net memory
Y
LeCun
Memory Networks Enable REASONING
Results on
Question Answering
Task
(Weston, Chopra,
Bordes 2014)
End-to-End Memory Network
Y
LeCun
" [ Sukhbataar, Szlam, Weston, Fergus NIPS 2015, ArXiv:1503.08895]
" Weakly-supervised MemNN: no need to tell which memory location to use.
Stack-Augmented RNN: learning “algorithmic” sequences
Y
LeCun
" [Joulin & Mikolov, ArXiv:1503.01007]
Sparse Access Memory for Long-Term
Dependencies
• A mental state stored in an external memory can stay for
arbitrarily long dura4ons, un4l evoked for read or write
• Forge…ng = vanishing gradient.
• Memory = larger state, reducing the need for forge…ng/vanishing
passive copy
access
92
How do humans generalize
from very few examples?
• They transfer knowledge from previous learning:
• Representa4ons
• Explanatory factors
Google:
S. Bengio, J.
Weston & N.
Usunier
(IJCAI 2011,
NIPS’2010,
JMLR 2010,
ML J 2010)
h1 h2 h3
X1 X2 X3
98
Maps Between
hx = fx (x) hy = fy (y)
Representations
fx
x and y represent fy
different modali4es, e.g.,
x-space
image, text, sound… y -space
xtest
ytest
Can provide 0-shot
generaliza4on to new
categories (values of y) (x, y) pairs in the training set
x -representation (encoder) function fx
y -representation (encoder) function fy
relationship between embedded points
within one of the domains
maps between representation spaces
99
Unsupervised Representation
Learning
100
Why Unsupervised Learning?
101
Why Latent Factors & Unsupervised
Representation Learning? Because of
Causality.
On causal and an'causal learning, (Janzing et al ICML 2012)
0.4
0.3
p(x)
0.2
0.1
0.0
0 5 10 15 20
103 x
Invariance & Disentangling
Underlying Factors
• Invariant features
• Which invariances?
• Alterna4ve: learning to disentangle factors, i.e.
keep all the explanatory factors in the
representa4on
• Good disentangling à
avoid the curse of dimensionality
• Emerges from representa4on learning
(Goodfellow et al. 2009, Glorot et al. 2011)
104
Boltzmann Machines /
Undirected Graphical Models
• Boltzmann machines:
(Hinton 84)
Block
Gibbs
sampling
x x ~ P(x | h)
Capturing the Shape of the
Distribution: Positive & Negative
Samples Energy(x)
Boltzmann machines, undirected graphical models,
e
RBMs, energy-based models
P r(x) =
Z
• Observed (+) examples push the energy down
• Generated / dream / fantasy (-) samples / particles push
the energy up
X+
X-
Eight Strategies to Shape the Energy Function Yann
LeCun
LeCun
" 1. build the machine so that the volume of low energy stuff is constant
" PCA, K-means, GMM, square ICA
" 2. push down of the energy of data points, push up everywhere else
" Max likelihood (needs tractable partition function)
" 3. push down of the energy of data points, push up on chosen locations
" contrastive divergence, Ratio Matching, Noise Contrastive Estimation,
Minimum Probability Flow
" 4. minimize the gradient and maximize the curvature around data points
" score matching
" 5. train a dynamical system so that the dynamics goes to the manifold
" denoising auto-encoder, diffusion inversion (nonequilibrium dynamics)
" 6. use a regularizer that limits the volume of space that has low energy
" Sparse coding, sparse auto-encoder, PSD
" 7. if E(Y) = ||Y - G(Y)||^2, make G(Y) as "constant" as possible.
" Contracting auto-encoder, saturating auto-encoder
" 8. Adversarial training: generator tries to fool real/synthetic classifier.
• Itera4ve sampling / undirected models:
Auto-Encoders RBM, denoising auto-encoder
109
Predictive Sparse Decomposition (PSD)
Yann
LeCun
LeCun
[Kavukcuoglu, Ranzato, LeCun, rejected by every conference, 2008-2009]
" Train a “simple” feed-forward function to predict the result of a complex
optimization on the data points of interest
Generative Model
Factor A
Factor B
Distance Decoder
LATENT
INPUT Y Z VARIABLE
Fast Feed-Forward Model
Factor A'
1. Find optimal Zi for all Yi;
Encoder Distance 2. Train Encoder to predict
Zi from Yi
111
Denoising Auto-Encoder
• Learns a vector field poin4ng towards higher prior: examples
probability direc4on (Alain & Bengio 2013) concentrate near a
lower dimensional
2@ log p(x)
reconstruction(x) x ! “manifold”
@x
• Some DAEs correspond to a kind of Gaussian Corrupted input
RBM with regularized Score Matching (Vincent
2011)
[equivalent when noiseà0] Corrupted input
Regularized Auto-Encoders Learn a
Vector Field that Estimates a
Gradient Field (Alain & Bengio ICLR 2013)
113
Denoising Auto-Encoder Markov Chain
corrupt
X~ t denoise X~ t+1 X~ t+2
Xt Xt+1 Xt+2
114
Preference for Locally Constant Features
E(x)
Encoder = inference
inference Q(h3 |h2 ) P (h2 |h3 )
Decoder = generator
h2
• Successors of Helmholtz
machine (Hinton et al ‘95) Q(h2 |h1 ) P (h1 |h2 )
h1
• Maximize varia4onal lower
bound on log-likelihood: Q(h1 |x) P (x|h1 )
min KL(Q(x, h)||P (x, h))
where = data distr. x
Q(x)
or equivalently Q(x)
X P (x, h) X
max Q(h|x) log = max Q(h|x) log P (x|h) + KL(Q(h|x)||P (h))
x
Q(h|x) x
116
Geometric Interpretation
• Encoder: map input to a new space
where the data has a simpler P(h)
Q(h|x)
distribu4on
• Add noise between encoder output
f(x)
and decoder input: train the
contrac4ve
decoder to be robust to mismatch
between encoder output and prior f g
output.
x
117
DRAW: Sequential Variational Auto-
Encoder
Network with Attention
For Image Generation
(Gregor et al of Google DeepMind, arXiv 1502.04623, 2015)
KAROLG @ GOOGLE . COM
• Even for a sta4c input, the encoder and decoder are now
DANIHELKA @ GOOGLE . COM
GRAVESA @ GOOGLE . COM
learning where to look, which can be Q(z|x) Q(zt |x, z1:t Q(zt+1 |x, z1:t )
encoding
1) (inference)
forcement learning techniques such as
nih et al., 2014). The attention model in henc
t 1
encoder
RNN
encoder
RNN
encoder
fully differentiable, making it possible FNN
read read
d backpropagation. In this sense it re-
e read and write operations developed x x x
g Machine
Time
(Graves et al., 2014).
Figure 2. Left: Conventional Variational Auto-Encoder. Dur-
ionFigure
defines the DRAW
118 1. A trained DRAW networkarchitecture,
generating MNIST dig-
ing generation, a sample z is drawn from a prior P (z) and passed
function
its. Eachused forsuccessive
row shows training and
stages in thethe pro- of a sin-
generation
DRAW Samples of SVHN Images:
generated
rent Neural samples
Network For Image Generation vs training nearest
neighbor
Nearest training
example for last
column of samples
119
GAN: Generative Adversarial Networks
Goodfellow et al NIPS 2014
Adversarial nets framework
1
2
1
LAPGAN: Visual Turing Test
(Denton + Chintala, et al 2015)
LAPGAN results
• 40% of samples mistaken by humans for real photos
122
Convolutional GANs
Figure 2: Generated bedrooms after one training pass through the dataset. Theoretically, the model
could learn to memorize training examples, but this is experimentally unlikely as we train with a
(Radford et al, arXiv 1511.06343)
small learning rate and minibatch SGD. We are aware of no prior empirical evidence demonstrating
memorization with SGD and a small learning rate in only one epoch.
123
Space-Filling in Representation-Space
Deeper representaHons " abstracHons " disentangling (Bengio et al ICML 2013)
Manifolds are expanded and flabened
X-space
H-space
9’s manifold
Linear interpola4on at layer 1
125
Figure 4: Top rows: Interpolation between a series of 9 random points in Z show that the space
Supervised and Unsupervised in One Learning Rule?
" Boltzmann Machines have all the right properties [Hinton 1831] [OK, OK 1983 ;-]
" Sup & unsup, generative & discriminative in one simple/local learning rule
" Feedback circuit reconstructs and propagates virtual hidden targets
" But they don't really work (or at least they don't scale).
" Problem: the feedforward path eliminates information
" If the feedforward path is invariant, then
" the reconstruction path is a one-to-many mapping
" Usual solution: sampling. But I'm allergic.
127
forl l==11totoLLdo
for do for ll == LL to
for to 00 do
do
(l)
(l)
z̃z̃pre WW(l) (l) (l(l 1)
h̃h̃ 1) ifif ll == LL then
then
Semisupervised Learning with Ladder
pre
(L) (L)
(l)
(l) (l)
(l) uu (L) batchnorm(
batchnorm( h̃(L) )
µ̃µ̃ batchmean(z̃pre
batchmean(z̃ pre))
else
else
Network
˜˜
(l)
(l) batchstd(z̃pre
batchstd(z̃
(l)
(l)
pre))
(l)
(Rasmus et al, NIPS 2015)
uu(l)(l)
batchnorm(V(l)
batchnorm(V
(l) (l+1)
ẑ(l+1)))
ẑ
z̃z̃ (l)
(l) batchnorm(z̃
batchnorm(z̃pre
(l)
pre))+ +noise
noise end ifif
end (l) (l) (l) (l)
(l) (l) (l) (l) 8i : ẑ(l) g(z̃ (l)
i , uii )) #
, u # Eq.
Eq. (1)
(1)
• Jointly trained stack of denoising auto-encoders with gated
h̃h̃(l) activation(
activation( (l) (z̃ +
(z̃ (l) + (l) ))
)) 8i : ẑii g(z̃i(l)
end for
(l)
ẑ(l) µ̃(l)
end for
lateral connec4ons and semi-supervised objec4ve
(L) 8i
8i :: ẑ
ẑ
(l)
(l)
i,BN
ẑ i
i
µ̃
(l)
i
i
(ỹ| |x)
PP(ỹ x) h̃h̃(L) i,BN ˜(l)
˜ii
end for for
##Clean
Cleanencoder
(0)
encoderỹ(for
(0)
(fordenoising
denoisingtargets) targets) y end
# Cost
Cost functionfunction C C for for training:
training:
hh(0)
zz(0) x(n)
x(n) #
for l = 1 to L do CC 00
for l (l)= 1 to L do z̃(2) g (l)(·, ·)(l 1) ẑ(2) (2)
(2)
if
z(2)t(n) then Semi-supervised objec4ve:
z(l) 2 batchnorm(W(l) h(l 1) )
Nz(0,(l)) batchnorm(W (l) h ) Cd if t(n) then
h(l) activation( (l) (l)
(z(l) + (l) (l)
)) C
C log
log P (ỹ = t(n) | x)
P (ỹ = t(n) | x)
h activation(
(2) (z + )) end if
end for f (·) end if
(2)
f (·)
end for PL (l)
2
g (1) (·, ·) C C + PLl=1 l z(l) ẑ(l) (l) 2 # Eq
z̃(1) ẑ(1)
Cd
(1) Cz (1)
C + l=1 l z ẑ BN
BN # Eq
2
N (0, )
(x ! z̃(1) ! z̃(2) ! ỹ). The decoder (z̃(l) ! ẑ(l) ! x̂) consists of denoising functions g (l) and
(l) (l) (l)
Stacked What-Where Predicted
Output
Desired
Output
Auto-Encoder (SWWAE) Loss Yann
LeCun
LeCun
[Zhao, Mathieu, LeCun arXiv:1506.02351]
Stacked What-Where Auto-Encoder
Recons
Inpu -
t tructio
n
Conclusions & Challenges
130
Learning « How the world ticks »
• So long as our machine learning models « cheat » by relying only
on surface sta4s4cal regulari4es, they remain vulnerable to out-
of-distribu4on examples
131
Learning Multiple Levels of
Abstraction
132
Challenges & Open Problems
A More ScienHfic Approach is Needed, not Just Building Beber Systems
• Unsupervised learning
• How to evaluate?
• Long-term dependencies
• Natural language understanding & reasoning
• More robust op4miza4on (or easier to train architectures)
• Distributed training (that scales) & specialized hardware
• Bridging the gap to biology
• Deep reinforcement learning
133