Академический Документы
Профессиональный Документы
Культура Документы
What's Wrong
With
Deep Learning?
Yann LeCun
Facebook AI Research &
Center for Data Science, NYU
yann@cs.nyu.edu
http://yann.lecun.com
Plan
Y LeCun
Feature Trainable
Extractor Classifier
“Simple cells”
“Complex
cells”
pooling
Multiple subsampling
convolutions
The ventral (recognition) pathway in the visual cortex has multiple stages
Retina - LGN - V1 - V2 - V4 - PIT - AIT ....
Lots of intermediate representations
It's deep if it has more than one stage of non-linear feature transformation
Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]
Early Networks [LeCun 85, 86]
Y LeCun
Single layer Two layers FC locally connected Shared weights Shared weights
Pooling
Pooling
Integrating Segmentation
Word-level training
No labeling of individual characters
How do we do the training?
We need a “deformable part model”
5
4
ConvNet
3
window width of
each classifier
Multiple classifiers
“Deformable Part Model” on top of a ConvNet
[Driancourt, Bottou 1991] Y LeCun
Object models
(elastic template)
Energies Switch
Sequence of
feature vectors
LVQ2 Loss
Trainable feature
extractors
Input Sequence
(acoustic vectors) Warping Category
(latent var) (output)
Word-level training with elastic word models
Y LeCun
Trellis
Object models
Warping/Path
Sequence of (latent var)
feature vectors
T E(X,Y,Z)
E ( X ,Y , Z )=∑i W hi ( X ,Y , Z )
i
+
with the NLL Loss :
– Conditional
Random Field
[Lafferty, McCallum,
Pereira 2001] Params W1 W2 W3
with Hinge Loss:
– Max Margin Features h(X,Y,Z) h(X,Y,Z) h(X,Y,Z)
Markov Nets and
Latent SVM [Taskar,
Altun, Hofmann...]
Outputs: Y1 Y2 Y3 Y4
with Perceptron Loss
– Structured Latent Vars: Z1 Z2 Z3
Perceptron
[Collins...] Input: X
Deep Structured Prediction
Y LeCun
Energy function is linear in the parameters
E(X,Y,Z)
E ( X ,Y , Z )=∑i g i ( X , Y , Z , W i )
+
Graph Transformer Networks
– [LeCun, Bottou,
Bengio,Haffner 97,98]
– NLL loss
– Perceptron loss
ConvNet g(X,Y,Z,W) g(X,Y,Z,W) g(X,Y,Z,W)
Outputs: Y1 Y2 Y3 Y4
Latent Vars: Z1 Z2 Z3
Input: X
Graph Transformer
Networks Y LeCun
Structured Prediction
on top of Deep Learning
Object Detection
Face Detection [Vaillant et al. 93, 94]
Y LeCun
[Garcia & Delakis 2003][Osadchy et al. 2004] [Osadchy et al, JMLR 2007]
Simultaneous face detection and pose estimation
Y LeCun
Y LeCun
VIDEOS
Y LeCun
Semantic Segmentation
ConvNets for Biological Image Segmentation
Y LeCun
...
100@25x121 ``
3x12x25 input window
CONVOLUTIONS (6x5)
YUV image band
20-36 pixels tall,
...
20@30x125
36-500 pixels wide
20@30x484
CONVOLUTIONS (7x6)
3@36x484
YUV input
Scene Parsing/Labeling: Multiscale ConvNet Architecture
Y LeCun
Each output sees a large input context:
46x46 window at full rez; 92x92 at ½ rez; 184x184 at ¼ rez
[7x7conv]->[2x2pool]->[7x7conv]->[2x2pool]->[7x7conv]->
Trained supervised on fully-labeled images
Method 1: majority over super-pixel regions
Y LeCun
Superpixel boundaries
Categories aligned
With region
Convolutional classifier
boundaries
Multi-scale ConvNet
Input image
Barcelona dataset
[Tighe 2010]:
170 categories.
No post-processing
Frame-by-frame
ConvNet runs at 50ms/frame on Virtex-6 FPGA hardware
But communicating the features over ethernet limits system
performance
Then., two things happened...
Y LeCun
Matchstick Sea lion
The ImageNet dataset [Fei-Fei et al. 2012]
1.2 million training samples
1000 categories
Flute
Fast Graphical Processing Units (GPU)
Capable of 1 trillion operations/second Strawberry
Bathing cap
Backpack
Racket
Y LeCun
Very Deep ConvNet for Object Recognition
Kernels: Layer 1 (11x11)
Y LeCun
• C3D Architecture
– 8 convolution, 5 pool, 2 fully-connected layers
– 3x3x3 convolution kernels
– 2x2x2 pooling kernels
• Dataset: Sports-1M [Karpathy et al. CVPR’14]
– 1.1M videos of 487 different sport categories
– Train/test splits are provided
Now,
What's Wrong
with Deep Learning?
Y LeCun
L (W )=∑ C p ( X ,Y ,W )( ∏ W ij ) 31
P (ij)∈P
Piecewise polynomial in W with random W31,22
coefficients
A lot is known about the distribution of critical 22
points of polynomials on the sphere with random
(Gaussian) coefficients [Ben Arous et al.] W22,14
High-order spherical spin glasses
14
Random matrix theory
W14,3
3
Z3
Y LeCun
Missing: Reasoning
Reasoning as Energy Minimization (structured prediction++)
Y LeCun
s, e
compatibility function
~ w,
~ w e, w ~
e , e ~
w w e e
Joint Distribution:
latent / hidden
P f , s, e, w xi , x j xi , ~
1
xi
69 Z i, j i
SPATIAL MODEL
Y LeCun
c f | f
f f | f b f
c f | s
f f | s
SPATIAL MODEL: RESULTS
Y LeCun
FLIC(1) FLIC(1)
Elbow Wrist
LSP(2) LSP(1)
Arms Legs
(1)B. Sapp and B. Taskar. MODEC: Multimodel decomposition models for human pose estimation. CVPR’13
(2)S. Johnson and M. Everingham. Learning Effective Human Pose Estimation for Inaccurate Annotation. CVPR’11
Y LeCun
Missing: Memory
In Natural Language Processing: Word Embedding
Y LeCun
Neural net of some kind
[sentence vector]
ConvNet or Recurrent Net
Actor
Detection of Ocean’s
Freebase entity Male
11
in the question
ER
Lexington
Y LeCun
Question-Answering System
Recurrent net memory
Y LeCun
Results on
Question Answering
Task
Y LeCun
Missing:
Unsupervised Learning
Energy-Based Unsupervised Learning
Y LeCun
1. build the machine so that the volume of low energy stuff is constant
PCA, K-means, GMM, square ICA
2. push down of the energy of data points, push up everywhere else
Max likelihood (needs tractable partition function)
3. push down of the energy of data points, push up on chosen locations
contrastive divergence, Ratio Matching, Noise Contrastive Estimation,
Minimum Probability Flow
4. minimize the gradient and maximize the curvature around data points
score matching
5. train a dynamical system so that the dynamics goes to the manifold
denoising auto-encoder
6. use a regularizer that limits the volume of space that has low energy
Sparse coding, sparse auto-encoder, PSD
7. if E(Y) = ||Y - G(Y)||^2, make G(Y) as "constant" as possible.
Contracting auto-encoder, saturating auto-encoder
#1: constant volume of low energy
Energy surface for PCA and K-means
Y LeCun
1. build the machine so that the volume of low energy stuff is constant
PCA, K-means, GMM, square ICA...
K-Means,
PCA
Z constrained to 1-of-K code
E (Y )=∥W T WY −Y ∥2 E (Y )=min z ∑i ∥Y −W i Z i∥2
#6. use a regularizer that limits
the volume of space that has low energy
Y LeCun
Generative Model
Factor A
Factor B
Distance Decoder
LATENT
INPUT Y Z VARIABLE
Sparse Modeling: Sparse Coding + Dictionary Learning
Y LeCun
i i 2
E (Y , Z )=∥Y −W d Z∥ + λ ∑ j ∣z j∣
i
∥Y −Y∥
2
WdZ ∑j .
Y → Ẑ =argmin Z E (Y , Z )
#6. use a regularizer that limits
the volume of space that has low energy
Y LeCun
Factor B
LATENT
INPUT Y Z VARIABLE
Fast Feed-Forward Model
Factor A'
Encoder Distance
Encoder-Decoder Architecture
Y LeCun
[Kavukcuoglu, Ranzato, LeCun, rejected by every conference, 2008-2009]
Train a “simple” feed-forward function to predict the result of a complex
optimization on the data points of interest
Generative Model
Factor A
Factor B
Distance Decoder
LATENT
INPUT Y Z VARIABLE
Fast Feed-Forward Model
Factor A'
Encoder Distance
Learning to Perform
Approximate Inference:
Predictive Sparse Decomposition
Sparse Auto-Encoders
Sparse auto-encoder: Predictive Sparse Decomposition (PSD)
Y LeCun
[Kavukcuoglu, Ranzato, LeCun, 2008 → arXiv:1010.3467],
Prediction the optimal code with a trained encoder
Energy = reconstruction_error + code_prediction_error + code_sparsity
i i 2 i 2
E Y , Z =∥Y −W d Z∥ ∥Z −g e W e ,Y ∥ ∑ j ∣z j∣
i i
ge (W e , Y )=shrinkage(W e Y )
i
∥Y −Y∥
2
WdZ ∑j .
INPUT Y Z ∣z j∣
FEATURES
i
ge W e ,Y 2
∥Z − Z∥
Regularized Encoder-Decoder Model (auto-Encoder)
for Unsupervised Feature Learning Y LeCun
FEATURES
g e ( W e ,Y i ) ̃ 2
∥Z − Z∥ L2 norm within
each pool
PSD: Basis Functions on MNIST
Y LeCun
Learning to Perform
Approximate Inference
LISTA
Better Idea: Give the “right” structure to the encoder
Y LeCun
INPUT Y We + sh() Z
Lateral Inhibition S
ISTA/FISTA reparameterized:
LISTA (Learned ISTA): learn the We and S matrices to get fast solutions
[Gregor & LeCun, ICML 2010], [Bronstein et al. ICML 2012], [Rolfe & LeCun ICLR 2013]
LISTA: Train We and S matrices
to give a good approximation quickly
Y LeCun
Think of the FISTA flow graph as a recurrent neural net where We and S are
trainable parameters
INPUT Y We + sh() Z
S
Time-Unfold the flow graph for K iterations
Learn the We and S matrices with “backprop-through-time”
Get the best approximate solution within K iterations
Y We
+ sh() S + sh() S Z
Learning ISTA (LISTA) vs ISTA/FISTA
Y LeCun
Reconstruction Error
Smallest elements
removed
Lateral
Decoding L1 Z̄ 0
Inhibition
Filters
X We
Architecture
Wd X̄ X
+ +
Encoding () S + ()
Filters Wc Ȳ Y
Can be repeated
Image = prototype + sparse sum of “parts” (to move around the manifold)
Convolutional Sparse Coding
Y LeCun
Convolutional S.C.
Y = ∑. * Zk
k
Wk
“deconvolutional networks” [Zeiler, Taylor, Fergus CVPR 2010]
Convolutional PSD: Encoder with a soft sh() Function
Y LeCun
Convolutional Formulation
Extend sparse coding from PATCH to IMAGE
Filters and Basis Functions obtained with 1, 2, 4, 8, 16, 32, and 64 filters.
Using PSD to Train a Hierarchy of Features
Y LeCun
∥Y i −Ỹ ∥2 Wd Z λ∑ .
Y Z ∣z j∣
g e ( W e ,Y i ) ̃ 2
∥Z − Z∥
FEATURES
Using PSD to Train a Hierarchy of Features
Y LeCun
Y ∣z j∣
g e ( W e ,Y i )
FEATURES
Using PSD to Train a Hierarchy of Features
Y LeCun
∥Y i −Ỹ ∥2 Wd Z λ∑ .
Y ∣z j∣ Y Z ∣z j∣
g e ( W e ,Y i ) g e ( W e ,Y i ) ̃ 2
∥Z − Z∥
FEATURES
Using PSD to Train a Hierarchy of Features
Y LeCun
Y ∣z j∣ ∣z j∣
g e ( W e ,Y i ) g e ( W e ,Y i )
FEATURES
Using PSD to Train a Hierarchy of Features
Y LeCun
Y ∣z j∣ ∣z j∣ classifier
g e ( W e ,Y i ) g e ( W e ,Y i )
FEATURES
Pedestrian Detection: INRIA Dataset. Miss rate vs false
positives
Y LeCun
ConvNet
Color+Skip
Supervised
ConvNet
ConvNet
B&W
Color+Skip ConvNet
Supervised
Unsup+Sup B&W
Unsup+Sup
Unsupervised Learning:
Invariant Features
Learning Invariant Features with L2 Group Sparsity
Y LeCun
i
∥Y −Ỹ ∥
2
Wd Z
λ∑ .
INPUT Y Z √ ( ∑ Z 2
k)
FEATURES
g e ( W e ,Y i ) ̃
∥Z − Z∥
2
L2 norm within
each pool
Learning Invariant Features with L2 Group Sparsity
Y LeCun
Idea: features are pooled in group.
Sparsity: sum over groups of L2 norm of activity in group.
[Hyvärinen Hoyer 2001]: “subspace ICA”
decoder only, square
[Welling, Hinton, Osindero NIPS 2002]: pooled product of experts
encoder only, overcomplete, log student-T penalty on L2 pooling
[Kavukcuoglu, Ranzato, Fergus LeCun, CVPR 2010]: Invariant PSD
encoder-decoder (like PSD), overcomplete, L2 pooling
[Le et al. NIPS 2011]: Reconstruction ICA
Same as [Kavukcuoglu 2010] with linear encoder and tied decoder
[Gregor & LeCun arXiv:1006:0448, 2010] [Le et al. ICML 2012]
Locally-connect non shared (tiled) encoder-decoder
L2 norm within
INPUT
SIMPLE
FEATURES each pool λ∑ .
Encoder only (PoE, ICA),
Y Decoder Only or
Encoder-Decoder (iPSD, RICA)
Z √ (∑ Z 2k ) INVARIANT
FEATURES
Groups are local in a 2D Topographic Map
Y LeCun
Input
Encoder
Image-level training, local filters but no weight sharing
Y LeCun
Training on 115x115 images. Kernels are 15x15 (not shared across space!)
K Obermayer and GG Blasdel, Journal of
Topographic Maps Neuroscience, Vol 13, 4114-4129 (Monkey)
Y LeCun
Each edge in the tree indicates a zero in the S matrix (no mutual inhibition)
Sij is larger if two neurons are far away in the tree
Invariant Features via Lateral Inhibition: Topographic Maps
Y LeCun
Supervised filters CIFAR10 sparse conv. auto-encoder slow & sparse conv. auto-encoder
Trained on YouTube videos
[Goroshin et al. Arxiv:1412.6056]
Invariant Features through Temporal Constancy
Y LeCun
Decoder Predicted
St St-1 St-2
input
W1 W1 W1 W2
̃ 2 ̃2
W W
Encoder St St-1 St-2 Input
Low-Level Filters Connected to Each Complex Cell
Y LeCun
C1
(where)
C2
(what)
Integrated Supervised &
Unsupervised Learning
Y LeCun
The End
Y LeCun
Lily is a swan.
Lily is white.
Greg is a swan.
What color is Greg? A:white
John is hungry.
John goes to the kitchen.
John eats the apple.
Daniel is hungry.
Where does Daniel go? A:kitchen
Why did John go to the kitchen? A:hungry
One way of solving these tasks: Memory
Networks!!
MemNNs have four component networks (which
may or may not have shared parameters):
▪ I:
(input feature map) this converts incoming
data to the internal feature representation.
▪ G:(generalization) this updates memories
given new input.
▪ O:this produces new output (in
featurerepresentation space) given the
memories.
▪ R:(response) converts output O into a
response seen by the outside world.
Experiments
▪ Protocol: 1000 training QA pairs, 1000 for test.
“Weakly supervised” methods:
▪ Ngram baseline, uses bag of Ngram features
from sentences that share a word with the
question.
▪ LSTM