Вы находитесь на странице: 1из 64

GENERALIZING FROM FEW EXAMPLES

WITH META-LEARNING

Hugo Larochelle
Google Brain
2

A RESEARCH AGENDA
• Deep learning successes have required a lot of labeled training data
‣ collecting and labeling such data requires significant human labor
‣ practically, is that really how we’ll solve AI ?
‣ scientifically, this means there is a gap with ability of humans to learn, which we should try to understand

• Alternative solution : exploit other sources of data that are imperfect but plentiful
‣ unlabeled data (unsupervised learning)
‣ multimodal data (multimodal learning)
‣ multidomain data (transfer learning, domain adaptation)
2

A RESEARCH AGENDA
• Deep learning successes have required a lot of labeled training data
‣ collecting and labeling such data requires significant human labor
‣ practically, is that really how we’ll solve AI ?
‣ scientifically, this means there is a gap with ability of humans to learn, which we should try to understand

• Alternative solution : exploit other sources of data that are imperfect but plentiful
‣ unlabeled data (unsupervised learning)
‣ multimodal data (multimodal learning)
‣ multidomain data (transfer learning, domain adaptation)
e of meta-learning setup. The top represents the meta-training set Dmet
gray box is a separate dataset that consists of the training set D (lef
4

People are 

good at it
e indistinguishable from human behavior. 5

advances in artificial
Human-level
from just one or a handful of examples, whereas
concept
1
learning
Center for Data Science, New York University, 726
Broadway, New York, NY 10003, USA. 2Department of
achine learning,People
two
onceptual knowledge
standard 

arealgorithms through
in machine learning require
tens or hundreds of examples to perform simi-
probabilistic
Computer Science and Department of Statistics, University
of Toronto, 6 King’s College Road, Toronto, ON M5S 3G4,
good at Forit instance, people may only need to see
ne systems. First, for
ds of natural and man-
larly.
program
one example of a novel two-wheeled vehicle
induction
Canada. 3Department of Brain and Cognitive Sciences,
Massachusetts Institute of Technology, 77 Massachusetts
Avenue, Cambridge, MA 02139, USA.
n learn a new concept (Fig. 1A) in order to grasp the boundaries of the *Corresponding author. E-mail: brenden@nyu.edu
Brenden M. Lake, * Ruslan Salakhutdinov,2 Joshua B. Tenenbaum3
1

People learning new concepts can often generalize successfully from just a
Machines are
yet machine learning algorithms typically require tens or hundreds of exam
perform with similar accuracy. People can also use learned concepts in ric
getting
conventional algorithms—for action, imagination, and explanation. We pres
computational model that captures these human learning abilities for a lar
better at it
simple visual concepts: handwritten characters from the world’s alphabets
represents concepts as simple programs that best explain observed examp
Bayesian criterion. On a challenging one-shot classification task, the mode
human-level performance while outperforming recent deep learning approa
present several “visual Turing tests” probing the model’s creative generaliz
which in many cases are indistinguishable from human behavior.

D
espite remarkable advances in artificial from just one or a handful of ex
intelligence and machine learning, two standard algorithms in machine
aspects of human conceptual knowledge tens or hundreds of examples
have eluded machine systems. First, for larly. For instance, people may
most interesting kinds of natural and man- one example of a novel two-
7

A RESEARCH AGENDA
• Let’s attack directly the problem of few-shot learning
‣ we want to design a learning algorithm A that outputs a good parameters 𝜽

L
of a model M, when fed a small dataset Dtrain={(xi,yi)}i=1

• Idea: let’s learn that algorithm A, end-to-end


‣ this is known as meta-learning or learning to learn
8

RELATED WORK: TRANSFER LEARNING


• Largeimage datasets (e.g. ImageNet) have been shown to allow training
representations that transfer to other problems
‣ DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition (2014)

Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng and Trevor Darrell

‣ CNN Features off-the-shelf: an Astounding Baseline for Recognition (2014)



Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, Stefan Carlsson

‣ …some have even reported some positive transfer on medical imaging datasets!

• In
few-shot learning, we aim at transferring the complete training of the model on
new datasets (not just transferring the features or initialization)
‣ ideally there should be no human involved in producing a model for new datasets
9

RELATED WORK: ONE-SHOT LEARNING


• One-shot learning has been studied before
‣ One-Shot learning of object categories (2006)

Fei-Fei Li, Rob Fergus and Pietro Perona

‣ Knowledge transfer in learning to recognize visual objects classes (2004)



Fei-Fei Li

‣ Object classification from a single example utilizing class relevance pseudo-metrics (2004)

Michael Fink

‣ Cross-generalization: learning novel classes from a single example by feature replacement


(2005)

Evgeniy Bart and Shimon Ullman

• These largely relied on hand-engineered features


‣ with recent progress in end-to-end deep learning, we hope to learn a representation better
suited for few-shot learning
10

RELATED WORK: META-LEARNING


• Early work on learning an update rule
‣ Learning a synaptic learning rule (1990)

Yoshua Bengio, Samy Bengio, and Jocelyn Cloutier

‣ The Evolution of Learning: An Experiment in Genetic Connectionism (1990)



David Chalmers

‣ On the search for new learning rules for ANNs (1995)



Samy Bengio, Yoshua Bengio, and Jocelyn Cloutier

• Early work on recurrent networks modifying their weights


‣ Learning to control fast-weight memories: An alternative to dynamic recurrent
networks (1992)

Jürgen Schmidhuber

‣ A neural network that embeds its own meta-levels (1993)



Jürgen Schmidhuber
11

RELATED WORK: META-LEARNING


• Training a recurrent neural network to optimize
‣ outputs update, so can decide to do something else than gradient descent
t-2 t-1 t
ft-2 ft-1 ft

θt-2 θt-1 θt θt+1


Optimizee + + +

gt-2 gt-1 gt
∇t-2 ∇t-1 ∇t
Optimizer m m m
ht-2 ht-1 ht ht+1

• Learning to learn Figure


by gradient descent
2: Computational bycomputing
graph used for gradient descent
the gradient (2016)

of the optimizer.

Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W. Hoffman, David Pfau, Tom Schaul, and Nando de Freitas

• Learning to 2.1
learn using gradient
Coordinatewise descent (2001)

LSTM optimizer
Sepp Hochreiter, A. Steven Younger, and Peter R. Conwell
One challenge in applying RNNs in our setting is that we want to be able to optimize at least tens of
thousands of parameters. Optimizing at this scale with a fully connected RNN is not feasible as it
12

RELATED WORK: META-LEARNING


Gradient-based Hyperparameter Optimization through Reversible Learning

he training procedure (forwards)


• Hyper-parameter as
optimization Optimized learning rate schedule
kward pass. However, this would re- 7
‣ idea of learning the

y to be practical for large neural nets 6 Layer 1
learning rates and

of minibatches. Layer 2

LearnLng rate
the initialization conditions 5
Layer 3
4
Layer 4
3
2
rning applications, only a few hyper-
1
20) are optimized. Since each ex-
0
single number (the validation loss), 0 20 40 60 80 100
comes more difficult as the dimen- 6chedule Lndex
meter vector increases. In contrast,
Figure 2. A learning-rate training schedule for the weights in each
re available, the amount of informa- layer of a neural network, optimized by hypergradient descent.
training run grows along with the The optimized schedule starts by takinglearning
large steps only in the
• Gradient-based hyperparameter optimization through reversible (2015)

meters, allowing us toDavid
Dougal Maclaurin, optimize thou-
Duvenaud, and Ryan Ptopmost
Adams layer, then takes larger steps in the first layer. All layers
13
ral Architecture Search, we use a controller to generate architectural hyperparame
RELATED WORK: META-LEARNING
networks. To be flexible, the controller is implemented as a recurrent neural network
e we •would
AutoMLlike to predict
(Bayesian feedforward
optimization, neural networks
reinforcement learning)with only convolutional lay
the controller to generate their hyperparameters as a sequence of tokens:

2: How our controller


• Neural Architecture recurrent neural
Search with network samples
Reinforcement a simple
Learning (2017)
 convolutional netw
s filter Barret
height,Zoph filter
and Quocwidth,
Le stride height, stride width, and number of filters for one la
If you don’t evaluate on never-seen problems/datasets…
If you don’t evaluate on never-seen problems/datasets…

… it’s not meta-learning!


captures fundamental knowledge shared among all the tasks.
15

2
META-LEARNING
TASK D ESCRIPTION
• Learning algorithm
We firstA
begin by detailing the meta-learning formulation we use. In the typical mach
setting, we are interested in a dataset D and usually split D so that we optimize param
‣ input: training set Dtrain={(xi,yi)}
training set Dtrain and evaluate its generalization on the test set Dtest . In meta-learnin
Figure
‣ output: 1: we are dealing
Computational
parameters 𝜽 model M withgraph
(the meta-sets
for theDforward
learner) containing
passmultiple regular datasets,
of the meta-learner. Thewhere eachline
dashed D 2div
D
examples of the and
Dtrain
from . Dtrain and test set Dtest . Each (Xi , Yi ) is the ith batch from
Dtestset
training
‣ objective: goodset
training performance
whereas on
(X, test
Y) set
is D
all ={(x
the ’i,y’i)}
elements
We consider the k-shot, N -class classification
test from thetask, test set.
whereThefordashed arrows D,
each dataset indicate tha
the train
do not back-propagate
sists of k labelledthrough that step
examples whenoftraining
for each N classes, the meaning
meta-learner.that DWe refer to the learn
train consists of k · N
M , whereand M (X;
Dtest is the
✓)has output
a set number of learner
of examples M using parameters ✓ for inputs X. We also use r
for evaluation.
a shorthandalgorithm
• Meta-learning for r✓t 1 Lt .
In meta-learning, we thus have different meta-sets for meta-training, meta-validation
(n) (n) N
testing
‣ input: meta-training set (D meta train ={(D, D metatrain,D test)}n=1 ,ofand
validation Dmeta test , respectively). On Dmeta tr
episodes
interested in training a learning procedure (the meta-learning model) that can take as i
‣ output: parameters 𝝝 algorithm
its training sets DA train
(the meta-learner)
and produce a model that achieves high average classification perf
to have training conditionstest
its corresponding matchset Dthose . of testDtime. During evaluation
Using we can of the hyper-paramet
perform meta-learning
test meta (n) validation
(n) N’
each dataset
‣ objective: good ofperformance
D = (D on meta-test
the meta-learning
train , D model
test ) 2
set andmeta
D evaluate , a
test={(D good
’ meta-learner

its generalization model will,
train,D test)}n=1performance on Dgiven a seri.
meta test
learner gradients and losses on the training set Dtrain , suggest a series of updates for the lea
model thatFor this itformulation
trains towards good to correspond
performancetoon thethefew-shot
test set D learning
test . setting, each training set
D 2 D will contain few labeled examples (we consider k = 1 or k = 5), that mus
16

META-LEARNING

1: Example of meta-learning setup. The top represents the meta-training set Dmeta train ,
nside each gray box is a separate dataset that consists of the training set Dtrain (left side of
line) and the test set Dtest (right side of dashed line). In this illustration, we are considering
17

META-LEARNING
Dtrain Dtest
=

episode
18

META-LEARNING
Dtrain Dtest
=

episode
18

META-LEARNING
Dtrain Dtest
=

episode

Meta-learner (A)
18

META-LEARNING
Dtrain Dtest
=

episode

Meta-learner (A) Learner (M)


18

META-LEARNING
Dtrain Dtest
=

episode
Loss

Meta-learner (A) Learner (M)


19

META-LEARNING
Dtrain Dtest
=

episode

Loss

Meta-learner (A) Learner (M)


19

META-LEARNING
Dtrain Dtest
=

episode

Loss

Meta-learner (A) Learner (M)


20

META-LEARNING NOMENCLATURE

Training set

Test set

Meta-training set

Meta-test set
20

META-LEARNING NOMENCLATURE

Episode
{ Training set

Test set

Meta-training set

Meta-test set
20

META-LEARNING NOMENCLATURE

Episode
{ Training set

Test set
Support set

Query set

Meta-training set

Meta-test set
20

META-LEARNING NOMENCLATURE

Episode
{ Training set

Test set
Support set

Query set

Meta-training set Training set

Meta-test set Test set


21

LEARNING PROBLEM STATEMENT


• Assuming a probabilistic model M over labels, the cost per episode can become

1 X
0 0
C(⇥; Dtrain , Dtest ) = log p⇥ (yi |xi , Dtrain )
|Dtest | 0 0
(xi ,yi )
2Dtest

• Depending on the choice of meta-learner, p⇥ (y|x, Dtrain ) will take a different


form
22

CHOOSING A META-LEARNER
• How to parametrize learning algorithms?

• Two approaches to defining a meta-learner


‣ Take inspiration from a known learning algorithm
- kNN/kernel machine: Matching networks (Vinyals et al. 2016)
- Gaussian classifier: Prototypical Networks (Snell et al. 2017)
- Gradient Descent: Meta-Learner LSTM (Ravi & Larochelle, 2017) , MAML (Finn et al. 2017)
‣ Derive it from a black box neural network
- MANN (Santoro et al. 2016)
- SNAIL (Mishra et al. 2018)
23

CHOOSING A META-LEARNER
• How to parametrize learning algorithms?

• Two approaches to defining a meta-learner


‣ Take inspiration from a known learning algorithm
- kNN/kernel machine: Matching networks (Vinyals et al. 2016)
- Gaussian classifier: Prototypical Networks (Snell et al. 2017)
- Gradient Descent: Meta-Learner LSTM (Ravi & Larochelle, 2017) , MAML (Finn et al. 2017)
‣ Derive it from a black box neural network
- MANN (Santoro et al. 2016)
- SNAIL (Mishra et al. 2018)
predictions
where, givenabout the appropriate
an input, we “point” tolabel ŷ for each testexample
the corresponding exampleinx̂:the support set, retrievin
24
dicted outputunlike
class for a given input unseen
memoryexample x̂ and [2],
a support
However,
, S).
other
MATCHING NETWORKS
attentional mechanisms (1) is non-parametric in nat
support set size grows, so does the memory used. Hence the functional form defined by th
cS (x̂)
computes is ŷvery
• Training flexible
as afollows:
“patternand matcher”
can adapt easily to any new support set.

2.1.1 The Attention


k
Kernel
X
Equation ŷ =1 relies a(x̂,onxchoosing
i )yi a(., .), the attention mechanism, (1) which fully specifies
fier. The simplest i=1 form that this takes (and which has very tight relationships with
attention
d labels frommodels and kernel
the support set P S= functions)
{(x i , yi )} kis to use the softmax over the cosine distan
, and a is an attention
k i=1
low. Note
a(x̂, x ) = e c(f (x̂),g(xi ))
i that eq. 1 essentially j=1 / c(f (x̂),g(xj ))
describes the output
e with forembedding
a new class functions as f and g being
lsate neural
in the support networks (potentially
set. Where the attention with fmechanism
= g) to embed a is ax̂kernel and xon i . In our experiments w
examples
rnel where f and
density estimator. Where g aretheparameterised
attention mechanism variously as deep
is zero for theconvolutional networks
to tasks
some(as in VGG[22]
distance metric or andInception[24])
an appropriateorconstant a simpleotherwise, form wordthen embedding for language
Section
rest neighbours4). (although this requires an extension to the Figureattention
1: Matching Networks architecture
n Section 2.1.2). Thus (1) subsumes
We note that, though related to metric learning, both KDE and kNN
theonlyclassifier methods.
train it by showing a few examplesdefined by Equation
per class, switching 1 is disc
the task from minibatch to min
acts as
• an
Matchingattention mechanism
networks for one
For a given support set S and sample toBesides and shot the y act
learning likeas
classify
much
i memories
(2016)

how
it is bound
enough
it will be tested when
x̂, to
for
presented with
to be sufficiently
a few examples of a new task.
x̂ al
sepairs
we can understand
Oriol 0Vinyals, this
0 Charles Blundell, as a
Timothy
(x , y ) 2 S such that y = y and we particular
P.0 Lillicrap, Koraykind of
our
Kavukcuoglu,
misaligned associative
contributions
and Daan in defining
with the
contribute by the definition
memory
Wierstra a model and training criterion amenable for one-shot le
of tasksrest.
that canThis
be used kind of loss
to benchmark is alsoo
other approaches
|Sk |
responding label. Sk denotes
(xi ,yithe
)2Sset
k of examples labeled with class k. 25

ion d : R M PROTOTYPICAL NETWORKS


⇥ R ! [0, +1), Prototypical Networks produce a distribution
M

y point x baseda “prototype


• Training on a softmaxextractor”
over distances to the prototypes in the embedding
te an M -dimensional representation ck 2 RM , or prototype, of each
nctionp f(y :=R k |! exp( d(f (x), c ))
with learnable parameters . Each prototype
x)R= P
D M k
(2)
dded support points belonging k 0 exp(to its class:
d(f (x), c k 0 ))

X
minimizing 1the negative log-probability J( ) = log p (y = k | x) of 2the c
c =
Training episodes f (x ) (1)
|Sk | are formed by randomly selecting a subset of classes from
k i
hoosing a subset(xof i ,yexamples
i )2Sk within each class to act as thec1 support set and a x
M Sk M = {(x , y )|y = k, (x , y ) 2 D
R ⇥ R ! [0, +1), Prototypical Networks produce a distribution }
i i i i i train
c3
x based on a softmax over distances
2 to the prototypes in the embedding
⌘⇥
(a) Few-shot
exp( d(f (x), ck ))
| x) = P Networks for Few-shot Learning
(y =•kPrototypical Figure (2) in the few-shot
1: Prototypical networks
(2017)

exp(
Jake Snell, Kevin Swersky
k 0 and d(f
Richard (x),
Zemel c k 0 )) ck are computed as the mean of embedded su
26

PROTOTYPICAL NETWORKS
• Training a “prototype extractor”
‣ distance function d(·, ·) can be anything (euclidean squared, negative cosine)
‣ if distance is euclidean squared, equivalent to learning an

embedding network f (·) such that a Gaussian classifier

works well
‣ prototype vectors are equivalent to output weights of

c2
a neural network
‣ Snell et al. find that using more classes in the meta-training
 c1 x
episodes compared to meta-testing works better
c3
(a) Few-shot

• Prototypical Figure
Networks for Few-shot Learning 1: Prototypical networks in the few-shot
(2017)

Jake Snell, Kevin Swersky and Richard Zemel ck are computed as the mean of embedded su
Under review as a conference paper at ICLR 2017
work. We set the cell state of the LSTM to be the parameters of the learner, or ct = ✓t , and the 27
M ODELcandidate D ESCRIPTION
META-LEARNER LSTM
cell state c̃t = r✓t 1 Lt , given how valuable information about the gradient is for opti-
mization. We define parametric forms for it and ft so that the meta-learner can determine optimal
der
iderreview
a single
values as athroughconference
dataset theDcourse 2paper
Dofmeta at updates.
the ICLRtrain2017 . Suppose we have a learner neural net mode
• Training a “gradient descent procedure” applied on some learner M
meters ✓Our Let thatus we
keystart want
with itto
observation train
that
, which we on
leverage
correspondsDtrain to .the
here isThe
that
learningstandard
this update optimization
rate forresembles
the updates. the We algorithms
update let for the cell used
statet
in an LSTM
neural networks are some variant ⇥ ⇤
‣ gradient descent starts from it = of
some Wgradient
initial parameters
· r descent,

L and
0,L ,✓
ct = ft t 1 ct 1 + it c̃t ,
I ✓ t t
then
t
which
performs
1 , i t 1 +uses I
updates
theb following
, of
updates: the form
(2)
ifmeaning
ft = 1, that ct 1the = ✓learning
t 1 , it =rate↵is t , aandfunction
c̃t = rof✓t the Lcurrent
t. parameter value ✓t , the current gradient
✓ = ✓ ↵ 1
r
r✓t Lt , the current loss Lt , and the previous learning rate it 1 . With this information, the meta-
t t 1 t ✓ t 1 L t ,
r key observation
Thus,
learnerweshould thatbewe
propose able leverage
training here
a meta-learner
to finely control is that LSTM
the thistoupdate
learning learn
rate ansoresembles
update
as to train rulethefor update
the for
trainingquickly
learner thewhile
a neural cell
net-
this is quite
‣ work. We similar
set the tocell
LSTM cellof
state state
the updates:

LSTM to be the parameters of the learner, or ct = ✓t , and the
an LSTMavoiding divergence.
e ✓t 1 are 

candidatethe parameters
cell state c̃t =ofrthe ✓t 1 L
c learner
t , given after
= f t
how valuable
c + i 1 updates,
information↵about
c̃ , t is the learning
the gradient rate
is for at
opti-
As
the lossmization. for
optimized f , it seems
t We define by the possible
parametric that
learnerforms t the
for its optimal
t t
for tit and
th choice
1 tisn’t
ft so that
update, r✓t constant
the
t meta-learner
L is 1. Intuitively,
the can
gradient what
determine
of would
optimal
that los
t
fect
t = 1, c justify
values =
state cshrinking
through

t is model , ithe
M’s the
= courseparameters

parameter of
, and
space c̃of the learner and forgetting part of its previous value would be
t t = r ✓t 1 L t .
the ✓updates.
1
to parameters
t - 1
if the learner
t 1 t , and ✓t inisathe
✓t is1currently t
badupdated
local optima parameters
and needs of the learner.
a large change to escape. This would
Let
us, we propose us start
- state
correspond updatewith it , negative
toc~t ais the
training awhich
situation corresponds
meta-learner
gradient
where ther✓loss
t 1 totisthe
LSTM
L
high learning
tobut learn
the rate anforupdate
gradient theisupdates.
close rule
to We
for
zero. lettraining
Thus, oneaproposal
neural
⇥ ⇤
rk. We set for- ftthe
and forget
the it cell
are LSTM gate
state is of
gates: to have
the it be W
it =LSTM a function
I to
· r be✓t of
the that
L Linformation,
t ,parameters
t , ✓t 1 , it 1of as+the
well as
, the previous
bI learner, or ct value
= ✓tof, the
and
2
1

ndidate cell forget stategate:


t 1 rate is a function of the current parameter about
, given how valuable information the gradient is for

meaning that the learning t = r ✓ L t ⇥ ⇤ value ✓ , the current gradient
ft = W F · r ✓t 1 L t , L t , ✓t 1 , f t 1 + b F . t
zation. We r ✓t L define parametric
t , the current loss Lforms
t , and the forprevious
it and flearning t so that ratetheit meta-learner
1 . With this information,can determine opt
the meta-
learner
• Optimization
ues through theshould
Additionally, course asbe aof
notice able thetowe
Model
that finely
for
updates. control
Few-Shot
can also learnthe learning
theLearning
initial value rate so the
(2017)

of as cell
to train
statethe learner
c0 for quicklytreating
the LSTM, while
avoiding
it asRavi
Sachin anddivergence.
a parameter Hugo Larochelleof the meta-learning. This corresponds to the initial weights of the learner model
Under review as a conference paper at ICLR 2017
work. We set the cell state of the LSTM to be the parameters of the learner, or ct = ✓t , and the 27
M ODELcandidate D ESCRIPTION
META-LEARNER LSTM
cell state c̃t = r✓t 1 Lt , given how valuable information about the gradient is for opti-
mization. We define parametric forms for it and ft so that the meta-learner can determine optimal
der
iderreview
a single
values as athroughconference
dataset theDcourse 2paper
Dofmeta at updates.
the ICLRtrain2017 . Suppose we have a learner neural net mode
• Training a “gradient descent procedure” applied on some learner M
meters ✓Our Let thatus we
keystart want
with itto
observation train
that
, which we on
leverage
correspondsDtrain to .the
here isThe
that
learningstandard
this update optimization
rate forresembles
the updates. the We algorithms
update let for the cell used
statet
in an LSTM
neural networks are some variant ⇥ ⇤
‣ gradient descent starts from it = of
some Wgradient
initial parameters
· r descent,

L and
0,L ,✓
ct = ft t 1 ct 1 + it c̃t ,
I ✓ t t
then
t
which
performs
1 , i t 1 +uses I
updates
theb following
, of
updates: the form
(2)
ifmeaning
ft = 1, that ct 1the = ✓learning
t 1 , it =rate↵is t , aandfunction
c̃t = rof✓t the Lcurrent
t. parameter value ✓t , the current gradient
✓ = ✓ ↵ 1
r
r✓t Lt , the current loss Lt , and the previous learning rate it 1 . With this information, the meta-
t t 1 t ✓ t 1 L t ,
r key observation
Thus,
learnerweshould thatbewe
propose able leverage
training here
a meta-learner
to finely control is that LSTM
the thistoupdate
learning learn
rate ansoresembles
update
as to train rulethefor update
the for
trainingquickly
learner thewhile
a neural cell
net-
this is quite
‣ work. We similar
set the tocell
LSTM cellof
state state
the updates:

LSTM to be the parameters of the learner, or ct = ✓t , and the
an LSTMavoiding divergence.
e ✓t 1 are 

candidatethe parameters
cell state c̃t =ofrthe ✓t 1 L
c learner
t , given after
= f t
how valuable
c + i 1 updates,
information↵about
c̃ , t is the learning
the gradient rate
is for at
opti-
As
the lossmization. for
optimized f , it seems
t We define by the possible
parametric that
learnerforms t the
for its optimal
t t
for tit and
th choice
1 tisn’t
ft so that
update, r✓t constant
the
t meta-learner
L is 1. Intuitively,
the can
gradient what
determine
of would
optimal
that los
t
fect
t = 1, c justify
values =
state cshrinking
through

t is model , ithe
M’s the
= courseparameters

parameter of
, and
space c̃of the learner and forgetting part of its previous value would be
t t = r ✓t 1 L t .
the ✓updates.
1
to parameters
t - 1
if the learner
t 1 t , and ✓t inisathe
✓t is1currently t
badupdated
local optima parameters
and needs of the learner.
a large change to escape. This would
Let
us, we propose us start
- state
correspond updatewith it , negative
toc~t ais the
training awhich
situation corresponds
meta-learner
gradient
where ther✓loss
t 1 totisthe
LSTM
L
high learning
tobut learn
the rate anforupdate
gradient theisupdates.
close rule
to We
for
zero. lettraining
Thus, oneaproposal
neural
⇥ ⇤
rk. We set for- ftthe
and forget
the it cell
are LSTM gate
state is of
gates: to have
the it be W
it =LSTM a function
I to
· r be✓t of
the that
L Linformation,
t ,parameters
t , ✓t 1 , it 1of as+the
well as
, the previous
bI learner, or ct value
= ✓tof, the
and
2
1

ndidate cell forget stategate:


t 1 rate is a function of the current parameter about
, given how valuable information the gradient is for

meaning that the learning t = r ✓ L t ⇥ ⇤ value ✓ , the current gradient
ft = W F · r ✓t 1 L t , L t , ✓t 1 , f t 1 + b F . t
zation. We r ✓t L define parametric
t , the current loss Lforms
t , and the forprevious
it and flearning t so that ratetheit meta-learner
1 . With this information,can determine opt
the meta-
learner
• Optimization
ues through theshould
Additionally, course asbe aof
notice able thetowe
Model
that finely
for
updates. control
Few-Shot
can also learnthe learning
theLearning
initial value rate so the
(2017)

of as cell
to train
statethe learner
c0 for quicklytreating
the LSTM, while
avoiding
it asRavi
Sachin anddivergence.
a parameter Hugo Larochelleof the meta-learning. This corresponds to the initial weights of the learner model
Under review as a conference paper at ICLR 2017
work. We set the cell state of the LSTM to be the parameters of the learner, or ct = ✓t , and the 27
M ODELcandidate D ESCRIPTION
META-LEARNER LSTM
cell state c̃t = r✓t 1 Lt , given how valuable information about the gradient is for opti-
mization. We define parametric forms for it and ft so that the meta-learner can determine optimal
der
iderreview
a single
values as athroughconference
dataset theDcourse 2paper
Dofmeta at updates.
the ICLRtrain2017 . Suppose we have a learner neural net mode
• Training a “gradient descent procedure” applied on some learner M
meters ✓Our Let thatus we
keystart want
with itto
observation train
that
, which we on
leverage
correspondsDtrain to .the
here isThe
that
learningstandard
this update optimization
rate forresembles
the updates. the We algorithms
update let for the cell used
statet
in an LSTM
neural networks are some variant ⇥ ⇤
‣ gradient descent starts from it = of
some Wgradient
initial parameters
· r descent,

L and
0,L ,✓
ct = ft t 1 ct 1 + it c̃t ,
I ✓ t t
then
t
which
performs
1 , i t 1 +uses I
updates
theb following
, of
updates: the form
(2)
ifmeaning
ft = 1, that ct 1the = ✓learning
t 1 , it =rate↵is t , aandfunction
c̃t = rof✓t the Lcurrent
t. parameter value ✓t , the current gradient
✓ = ✓ ↵ 1
r
r✓t Lt , the current loss Lt , and the previous learning rate it 1 . With this information, the meta-
t t 1 t ✓ t 1 L t ,
r key observation
Thus,
learnerweshould thatbewe
propose able leverage
training here
a meta-learner
to finely control is that LSTM
the thistoupdate
learning learn
rate ansoresembles
update
as to train rulethefor update
the for
trainingquickly
learner thewhile
a neural cell
net-
this is quite
‣ work. We similar
set the tocell
LSTM cellof
state state
the updates:

LSTM to be the parameters of the learner, or ct = ✓t , and the
an LSTMavoiding divergence.
e ✓t 1 are 

candidatethe parameters
cell state c̃t =ofrthe ✓t 1 L
c learner
t , given after
= f t
how valuable
c + i 1 updates,
information↵about
c̃ , t is the learning
the gradient rate
is for at
opti-
As
the lossmization. for
optimized f , it seems
t We define by the possible
parametric that
learnerforms t the
for its optimal
t t
for tit and
th choice
1 tisn’t
ft so that
update, r✓t constant
the
t meta-learner
L is 1. Intuitively,
the can
gradient determinewhat
of would
optimal
that los
t
fect
t = 1, c justify
values =
state cshrinking
through

t is model , ithe
M’s the
= courseparameters

parameter of
, and
space c̃of the learner and forgetting part of its previous value would be
t t = r ✓t 1 L t .
the ✓updates.
1
to parameters
t - 1
if the learner
t 1 t , and ✓t inisathe
✓t is1currently t
badupdated
local optima parameters
and needs of the learner.
a large change to escape. This would
Let
us, we propose us start
- state
correspond updatewith it , negative
toc~t ais the
training awhich
situation corresponds
meta-learner
gradient
where ther✓loss
t 1 totisthe
LSTM
L
high learning
tobut learn
the rate anforupdate
gradient theisupdates.
close rule
to We
for
zero. lettraining
Thus, oneaproposal
neural
⇥ ⇤
rk. We set for- ftthe
and forget
the it cell
are LSTM gate
state is of
gates: to have
the it be W
it =LSTM a function
I to
· r be✓t of
the that
L Linformation,
t ,parameters
t , ✓t 1 , it 1of as+the
well as
, the previous
bI learner, value
or ctlearning
adaptive of the
t , and
= ✓rate
2
1

ndidate cell forget stategate:


t 1 rate is a function of the current parameter about
, given how valuable information the gradient is for

meaning that the learning t = r ✓ L t ⇥ ⇤ value ✓ , the current gradient
ft = W F · r ✓t 1 L t , L t , ✓t 1 , f t 1 + b F . t
zation. We r ✓t L define parametric
t , the current loss Lforms
t , and the forprevious
it and flearning t so that ratetheit meta-learner
1 . With this information,can determine the meta-opt
learner
• Optimization
ues through theshould
Additionally, course asbe aof
notice able thetowe
Model
that finely
for
updates. control
Few-Shot
can also learnthe learning
theLearning
initial value rate so the
(2017)

of as cell
to train
statethe learner
c0 for quicklytreating
the LSTM, while
avoiding
it asRavi
Sachin anddivergence.
a parameter Hugo Larochelleof the meta-learning. This corresponds to the initial weights of the learner model
Under review as a conference paper at ICLR 2017
work. We set the cell state of the LSTM to be the parameters of the learner, or ct = ✓t , and the 27
M ODELcandidate D ESCRIPTION
META-LEARNER LSTM
cell state c̃t = r✓t 1 Lt , given how valuable information about the gradient is for opti-
mization. We define parametric forms for it and ft so that the meta-learner can determine optimal
der
iderreview
a single
values as athroughconference
dataset theDcourse 2paper
Dofmeta at updates.
the ICLRtrain2017 . Suppose we have a learner neural net mode
• Training a “gradient descent procedure” applied on some learner M
meters ✓Our Let thatus we
keystart want
with itto
observation train
that
, which we on
leverage
correspondsDtrain to .the
here isThe
that
learningstandard
this update optimization
rate forresembles
the updates. the We algorithms
update let for the cell used
statet
in an LSTM
neural networks are some variant ⇥ ⇤
‣ gradient descent starts from it = of
some Wgradient
initial parameters
· r descent,

L and
0,L ,✓
ct = ft t 1 ct 1 + it c̃t ,
I ✓ t t
then
t
which
performs
1 , i t 1 +uses I
updates
theb following
, of
updates: the form
(2)
ifmeaning
ft = 1, that ct 1the = ✓learning
t 1 , it =rate↵is t , aandfunction
c̃t = rof✓t the Lcurrent
t. parameter value ✓t , the current gradient
✓ = ✓ ↵ 1
r
r✓t Lt , the current loss Lt , and the previous learning rate it 1 . With this information, the meta-
t t 1 t ✓ t 1 L t ,
r key observation
Thus,
learnerweshould thatbewe
propose able leverage
training here
a meta-learner
to finely control is that LSTM
the thistoupdate
learning learn
rate ansoresembles
update
as to train rulethefor update
the for
trainingquickly
learner thewhile
a neural cell
net-
this is quite
‣ work. We similar
set the tocell
LSTM cellof
state state
the updates:

LSTM to be the parameters of the learner, or ct = ✓t , and the
an LSTMavoiding divergence.
e ✓t 1 are 

candidatethe parameters
cell state c̃t =ofrthe ✓t 1 L
c learner
t , given after
= f t
how valuable
c + i 1 updates,
information↵about
c̃ , t is the learning
the gradient rate
is for at
opti-
As
the lossmization. for
optimized f , it seems
t We define by the possible
parametric that
learnerforms t the
for its optimal
t t
for tit and
th choice
1 tisn’t
ft so that
update, r✓t constant
the
t meta-learner
L is 1. Intuitively,
the can
gradient determinewhat
of would
optimal
that los
t
fect
t = 1, c justify
values =
state cshrinking
through

t is model , ithe
M’s the
= courseparameters

parameter of
, and
space c̃of the learner and forgetting part of its previous value would be
t t = r ✓t 1 L t .
the ✓updates.
1
to parameters
t - 1
if the learner
t 1 t , and ✓t inisathe
✓t is1currently t
badupdated
local optima parameters
and needs of the learner.
a large change to escape. This would
Let
us, we propose us start
- state
correspond updatewith it , negative
toc~t ais the
training awhich
situation corresponds
meta-learner
gradient
where ther✓loss
t 1 totisthe
LSTM
L
high learning
tobut learn
the rate anforupdate
gradient theisupdates.
close rule
to We
for
zero. lettraining
Thus, oneaproposal
neural
⇥ ⇤
rk. We set for- ftthe
and forget
the it cell
are LSTM gate
state is of
gates: to have
the it be W
it =LSTM a function
I to
· r be✓t of
the that
L Linformation,
t ,parameters
t , ✓t 1 , it 1of as+the
well as
, the previous
bI learner, value
or ctlearning
adaptive of the
t , and
= ✓rate
2
1

ndidate cell forget stategate:


t 1 rate is a function of the current parameter about
, given how valuable information the gradient is for

meaning that the learning t = r ✓ L t ⇥ ⇤ value ✓ , the current gradient
ft = W F · r ✓t 1 L t , L t , ✓t 1 , f t 1 + b F . t adaptive weight decay
zation. We r ✓t L define parametric
t , the current loss Lforms
t , and the forprevious
it and flearning t so that ratetheit meta-learner
1 . With this information,can determine the meta-opt
learner
• Optimization
ues through theshould
Additionally, course asbe aof
notice able thetowe
Model
that finely
for
updates. control
Few-Shot
can also learnthe learning
theLearning
initial value rate so the
(2017)

of as cell
to train
statethe learner
c0 for quicklytreating
the LSTM, while
avoiding
it asRavi
Sachin anddivergence.
a parameter Hugo Larochelleof the meta-learning. This corresponds to the initial weights of the learner model
Under review as a conference paper at ICLR 2017
work. We set the cell state of the LSTM to be the parameters of the learner, or ct = ✓t , and the 27
M ODELcandidate D ESCRIPTION
META-LEARNER LSTM
cell state c̃t = r✓t 1 Lt , given how valuable information about the gradient is for opti-
mization. We define parametric forms for it and ft so that the meta-learner can determine optimal
der
iderreview
a single
values as athroughconference
dataset theDcourse 2paper
Dofmeta at updates.
the ICLRtrain2017 . Suppose we have a learner neural net mode
• Training a “gradient descent procedure” applied on some learner M
meters ✓Our Let thatus we
keystart want
with itto
observation train
that
, which we on
leverage
correspondsDtrain to .the
here isThe
that
learningstandard
this update optimization
rate forresemblesthe updates. the We algorithms
update let for the cell used
statet
in an LSTM
neural networks are some variant ⇥ ⇤
‣ gradient descent starts from it = of
some Wgradient
initial parameters
· r descent,

L and
0,L ,✓
ct = ft t 1 ct 1 + it c̃t ,
I ✓ t t
then
t
which
1
performs
, i t 1 +uses I
updates
theb following
, of
updates: the form
(2)
ifmeaning
ft = 1, that ct 1the = ✓learning
t 1 , it =rate↵is t , aandfunction
c̃t = rof✓t the Lcurrent
t. parameter value ✓t , the current gradient
✓ = ✓ ↵ 1
r
r✓t Lt , the current loss Lt , and the previous learning rate it 1 . With this information, the meta-
t t 1 t ✓ t 1 L t ,
r key observation
Thus,
learnerweshould thatbewe
propose able leverage
training here
a meta-learner
to finely control is that
the LSTM thistoupdate
learning learn
rate an soresembles
update
as to train rulethefor update
the for
trainingquickly
learner thewhile
a neural cell
net-
this is quite
‣ work. We similar
set the tocell
LSTM cellof
state state
the updates:

LSTM to be the parameters of the learner, or ct = ✓t , and the
an LSTMavoiding divergence.
e ✓t 1 are 

candidatethe parameters
cell state c̃t =ofrthe ✓t 1 L
c learner
t , given after
= f t
how valuable
c + i 1 updates,
information↵about
c̃ , t is the learning
the gradient rate
is for at
opti-
As
the lossmization. for
optimized f , it seems
t We define by the possible
parametric that
learnerforms t the
for its optimal
t
for tit and
tht choice
1 t
ft so that
update, isn’t r✓t constant
the
t meta-learner
L is 1. Intuitively,
the can
gradient determinewhat
of would
optimal
that los
t
fect
t = 1, c justify
values =
state cshrinking
through

t is model , ithe
M’s the
= courseparameters

parameter of
, and of the learner
the ✓updates.
space t t = rc✓
c̃ L and . forgetting part
1
of its previous value would be
to parameters
t - 1 t ✓1
t
t
1 , and ✓t is the updated 1parameters of the learner.
0 tbecomes t
if the learner is currently in a bad local optima and needs a large change to escape. This would
t
a learned initialization

Let
us, we propose us start
- state
correspond updatewith it , negative
toc~t ais the
training awhich
situation corresponds
meta-learner
gradient
where ther✓loss
t 1 totisthe
LSTM
L
highlearning
tobut learn
the rate an forupdate
gradient theisupdates.
close rule
to We
for
zero. lettraining
Thus, oneaproposal
neural
⇥ ⇤
rk. We set for- ftthe
and forget
the it cell
are LSTM gate
state is of
gates: to have
the it be W
it =LSTM a function
I to
· r be✓t ofthe that
L Linformation,
t ,parameters
t , ✓t 1 , it 1of as+the
well as
, the previous
bI learner, value
or ctlearning
adaptive of the
t , and
= ✓rate
2
1

ndidate cell forget stategate:


t 1 rate is a function of the current parameter about
, given how valuable information the gradient is for

meaning that the learning t = r ✓ L t ⇥ ⇤ value ✓ , the current gradient
ft = W F · r ✓t 1 L t , L t , ✓t 1 , f t 1 + b F . t adaptive weight decay
zation. We r ✓t L define parametric
t , the current loss Lforms
t , and the forprevious
it and flearning t so that ratethe it meta-learner
1 . With this information,can determine the meta-opt
learner
• Optimization
ues through theshould
Additionally, course asbe aof
notice able thetowe
Model
that finely
for
updates. control
Few-Shot
can also learnthe learning
theLearning
initial value rate so the
(2017)

of as cell
to train
statethe learner
c0 for quicklytreating
the LSTM, while
avoiding
it asRavi
Sachin anddivergence.
a parameter Hugo Larochelleof the meta-learning. This corresponds to the initial weights of the learner model
28

META-LEARNER
Under review as a conference paper at ICLR 2017
LSTM
• Training a “gradient descent procedure” applied on some learner M

Dtrain Dtest

(M)

C(⇥; Dtrain , Dtest )

(LSTM)
• Optimization as a Model for Few-Shot Learning (2017)

Figure 1: Computational graph for the forward pass of the meta-learner. The dashed line divides
Sachin Ravi and Hugo Larochelle
examples from the training set Dtrain and test set Dtest . Each (Xi , Yi ) is the ith batch from the
A Gradient preprocessing
29

META-LEARNER LSTM
One potential challenge in training optimizers is that different input coordinates (i.e. the gradients
w.r.t. different optimizee parameters) can have very different magnitudes. This is indeed the case e.g.
when the optimizee is a neural network and different parameters correspond to weights in different
• Training a “gradient descent procedure” applied on some learner
layers. This can make training an optimizer difficult, because neural networks naturally disregard
M
small variations
‣ LSTM in input
parameters signalsacross
are shared and concentrate on (i.e.
M’s parameters bigger input
treated likevalues.
a large minibatch)
To‣this
can aim we(stop)
ignore propose to preprocess
gradients through the the optimizer’s
inputs of the LSTMinputs. One solution would be to give the
optimizer (log(|r|), sgn(r)) as an input, where r is the gradient in the current timestep. This has a
‣ gradient
problem that(and loss) inputs
log(|r|) to the
diverges forMeta-LSTM preprocessed
r ! 0. Therefore, weasuse
proposed by Andrychowicz
the following et al. (2016)

preprocessing formula

 (⇣ ⌘

 log(|r|)
, sgn(r) if |r| e p

 rk ! p
( 1, ep r) otherwise
‣ we are careful to avoid “leakage” from batchnorm statistics between meta-train / meta-test sets

where p > 0 is a parameter controlling how small gradients are disregarded (we use p = 10 in all our
(sometimes
experiments). referred to as the “transductive setting”)

We noticed that just rescaling all inputs by an appropriate constant instead also works fine, but the
proposed preprocessing seems to be more robust and gives slightly better results on some problems.

• Optimization as a Model for Few-Shot Learning (2017)



B Sachin
Visualizations
Ravi and Hugo Larochelle
30

MODEL-AGNOSTIC META-LEARNING
• Training a “gradient descent procedure” applied on some learner M
‣ MAML proposes not to bother with training an LSTM for the gradient descent updates and constant step-
size updates
‣ better results are also reported by the so-called bias transformation architecture

(One-Shot Visual Imitation Learning via Meta-Learning, Finn et al. 2017)
- concatenates to one of the layers a trainable parameter vector, for instance to the input layer [xi , ✓b ]
- decouples the updates of the bias and weights of that layer
- with it, can be shown that even a single gradient descent update yields a universal approximator over
functions mapping Dtrain and x to any label y, for a sufficiently deep ReLU network and certain losses

(Meta-Learning and Universality: Deep Representations and Gradient Descent can Approximate any Learning
Algorithm, Finn and Levine, 2018)

• Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (2017)



Chelsea Finn, Pieter Abbeel and Sergey Levine
31

CHOOSING A META-LEARNER
• How to parametrize learning algorithms?

• Two approaches to defining a meta-learner


‣ Take inspiration from a known learning algorithm
- kNN/kernel machine: Matching networks (Vinyals et al. 2016)
- Gaussian classifier: Prototypical Networks (Snell et al. 2017)
- Gradient Descent: Meta-Learner LSTM (Ravi & Larochelle, 2017) , MAML (Finn et al. 2017)
‣ Derive it from a black box neural network
- MANN (Santoro et al. 2016)
- SNAIL (Mishra et al. 2018)
One-shot learning
One-shot
with learning
Memory
One-shot learning with Memory-Augmented 32

BLACK-BOX META-LEARNER
• Frame meta-learning as sequence labeling with correct labels as delayed inputs
yt yt+1 yT y0 y1
Reset memory

(Figure from Santoro et al. 2016)

(a) Task setup


• Learning to learn using gradient descent (2001)

(a) Task setup (a) Task setup
Sepp Hochreiter, A. Steven Younger, and Peter R. Conwell
33

MEMORY-AUGMENTED
th Memory-Augmented Neural Networks NEURAL NETWORK
• Training a neural Turing machine to learn a learning algorithm

• One-shot (b) Network strategy


learning with memory-augmented neural networks (2016)

Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy P. Lillicrap
34

MEMORY-AUGMENTED NEURAL NETWORK


• Training a neural Turing machine to learn a learning algorithm
‣ required adapting the original neural Turing machine, using a fairly non-trivial Least Recently Used Access
(LRUA) writing mechanism
‣ only shown to work on Omniglot dataset (and not on the harder split used by Lake et al. 2015)

• One-shot learning with memory-augmented neural networks (2016)



Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy P. Lillicrap
A attention block performs a single key-value lookup; we style this operation after the self-attention
1: function
mechanismAproposed
TTENTION B LOCK
by Vaswani (inputs,
et al. (2017a): key size K, value size V ):
35

SIMPLE NEURAL ATTENTIVE LEARNER


2:
3:
keys, query = affine(inputs, K), affine(inputs, K)
1: function ATTENTION B LOCK(inputs, key size K, value size V ):
2:logits = matmul(query,
keys,
3:probs
query = affine(inputs,transpose(keys))
K), affine(inputs, K) p
Supervised Learning
4: = CausallyMaskedSoftmax(logits
logits = matmul(query, transpose(keys)) p / K)
• Training
5: 4:
5:
valuesa convolutional/attentional
probs
=
values
=affine(inputs,
CausallyMaskedSoftmax(logits
V
= affine(inputs, V )
) / K)
network 
 Label
Predicted t
6: 6:readread
= matmul(probs, values)
to learn
7: 7:return
= matmul(probs, values)
concat(inputs,
return read)
concat(inputs, read)

‣ alternates between dilated zeros


where CausallyMaskedSoftmax(·) convolutional layers
out the appropriate and attentional
probabilities layers so
before normalization,
wherethat
CausallyMaskedSoftmax(·)
a particular timestep’s query cannotzeros out the
have access appropriate
to future keys/values.probabilities before normalization, so
that a particular
‣ when inputstimestep’s
are images, query cannot have access
an convolutional to futurenetwork
embedding keys/values.
is used

(a) Dense Block (dilation rate R, D lters) (b) Attention Block (key size K, value size V)
to map to a vector
outputs, space
shape [T, C + D] outputs, shape [T, C + V]
(a) Dense Block (dilation rate R, D lters) (b) Attention Block (key size K, value size V)
concatenate

outputs, shape [T, C + D] outputs, shape [T, C + V]


concatenate
matmul concatenate

causal conv, kernel 2


dilation R, D lters a ne, output size V
concatenate matmul, masked softmax
(values) matmul

causal conv, kernel 2 a ne, output size K a ne, output size K


inputs, shape [T, C] (query) (keys)
dilation R, D lters a ne, output size V
(values) matmul, masked softmax

a ne, output size K a ne, output size K


inputs, shape [T, C] inputs, shape [T, C]
(query) (keys)
(Examples, xt-3 xt-2 xt-1 xt
•A Simple Neural Attentive Meta-Learner (2018)

Figure 2: Two of the building blocks that compose SNAIL architectures. (a) A dense block applies
Labels)
a causal 1D-convolution, and then concatenates the output to its input. A TC block (not pictured)
Nikhilapplies
Mishra, Mostafa Rohaninejad, Xi Chen and Pieter Abbeel yt-3 yt-2 yt-1 --
a series of dense blocks with exponentially-increasing dilation rates. (b) A attention block
A attention block performs a single key-value lookup; we style this operation after the self-attention
1: function
mechanismAproposed
TTENTION B LOCK
by Vaswani (inputs,
et al. (2017a): key size K, value size V ):
35

SIMPLE NEURAL ATTENTIVE LEARNER


2:
3:
keys, query = affine(inputs, K), affine(inputs, K)
1: function ATTENTION B LOCK(inputs, key size K, value size V ):
2:logits = matmul(query,
keys,
3:probs
query = affine(inputs,transpose(keys))
K), affine(inputs, K) p
Supervised Learning
4: = CausallyMaskedSoftmax(logits
logits = matmul(query, transpose(keys)) p / K)
• Training
5: 4:
5:
valuesa convolutional/attentional
probs
=
values
=affine(inputs,
CausallyMaskedSoftmax(logits
V
= affine(inputs, V )
) / K)
network 
 Label
Predicted t
6: 6:readread
= matmul(probs, values)
to learn
7: 7:return
= matmul(probs, values)
concat(inputs,
return read)
concat(inputs, read)

‣ alternates between dilated zeros


where CausallyMaskedSoftmax(·) convolutional layers
out the appropriate and attentional
probabilities layers so
before normalization,
wherethat
CausallyMaskedSoftmax(·)
a particular timestep’s query cannotzeros out the
have access appropriate
to future keys/values.probabilities before normalization, so
that a particular
‣ when inputstimestep’s
are images, query cannot have access
an convolutional to futurenetwork
embedding keys/values.
is used

(a) Dense Block (dilation rate R, D lters) (b) Attention Block (key size K, value size V)
to map to a vector
outputs, space
shape [T, C + D] outputs, shape [T, C + V]
(a) Dense Block (dilation rate R, D lters) (b) Attention Block (key size K, value size V)
concatenate

outputs, shape [T, C + D] outputs, shape [T, C + V]


concatenate
matmul concatenate

causal conv, kernel 2


dilation R, D lters a ne, output size V
concatenate matmul, masked softmax
(values) matmul

causal conv, kernel 2 a ne, output size K a ne, output size K


inputs, shape [T, C] (query) (keys)
dilation R, D lters a ne, output size V
(values) matmul, masked softmax

a ne, output size K a ne, output size K


inputs, shape [T, C] inputs, shape [T, C]
(query) (keys)
(Examples, xt-3 xt-2 xt-1 xt
•A Simple Neural Attentive Meta-Learner (2018)

Figure 2: Two of the building blocks that compose SNAIL architectures. (a) A dense block applies
Labels)
a causal 1D-convolution, and then concatenates the output to its input. A TC block (not pictured)
Nikhilapplies
Mishra, Mostafa Rohaninejad, Xi Chen and Pieter Abbeel yt-3 yt-2 yt-1 --
a series of dense blocks with exponentially-increasing dilation rates. (b) A attention block
A attention block performs a single key-value lookup; we style this operation after the self-attention
1: function
mechanismAproposed
TTENTION B LOCK
by Vaswani (inputs,
et al. (2017a): key size K, value size V ):
35

SIMPLE NEURAL ATTENTIVE LEARNER


2:
3:
keys, query = affine(inputs, K), affine(inputs, K)
1: function ATTENTION B LOCK(inputs, key size K, value size V ):
2:logits = matmul(query,
keys,
3:probs
query = affine(inputs,transpose(keys))
K), affine(inputs, K) p
Supervised Learning
4: = CausallyMaskedSoftmax(logits
logits = matmul(query, transpose(keys)) p / K)
• Training
5: 4:
5:
valuesa convolutional/attentional
probs
=
values
=affine(inputs,
CausallyMaskedSoftmax(logits
V
= affine(inputs, V )
) / K)
network 
 Label
Predicted t
6: 6:readread
= matmul(probs, values)
to learn
7: 7:return
= matmul(probs, values)
concat(inputs,
return read)
concat(inputs, read)

‣ alternates between dilated zeros


where CausallyMaskedSoftmax(·) convolutional layers
out the appropriate and attentional
probabilities layers so
before normalization,
wherethat
CausallyMaskedSoftmax(·)
a particular timestep’s query cannotzeros out the
have access appropriate
to future keys/values.probabilities before normalization, so
that a particular
‣ when inputstimestep’s
are images, query cannot have access
an convolutional to futurenetwork
embedding keys/values.
is used

(a) Dense Block (dilation rate R, D lters) (b) Attention Block (key size K, value size V)
to map to a vector
outputs, space
shape [T, C + D] outputs, shape [T, C + V]
(a) Dense Block (dilation rate R, D lters) (b) Attention Block (key size K, value size V)
concatenate

outputs, shape [T, C + D] outputs, shape [T, C + V]


concatenate
matmul concatenate

causal conv, kernel 2


dilation R, D lters a ne, output size V
concatenate matmul, masked softmax
(values) matmul

causal conv, kernel 2 a ne, output size K a ne, output size K


inputs, shape [T, C] (query) (keys)
dilation R, D lters a ne, output size V
(values) matmul, masked softmax

a ne, output size K a ne, output size K


inputs, shape [T, C] inputs, shape [T, C]
(query) (keys)
(Examples, xt-3 xt-2 xt-1 xt
•A Simple Neural Attentive Meta-Learner (2018)

Figure 2: Two of the building blocks that compose SNAIL architectures. (a) A dense block applies
Labels)
a causal 1D-convolution, and then concatenates the output to its input. A TC block (not pictured)
Nikhilapplies
Mishra, Mostafa Rohaninejad, Xi Chen and Pieter Abbeel yt-3 yt-2 yt-1 --
a series of dense blocks with exponentially-increasing dilation rates. (b) A attention block
A attention block performs a single key-value lookup; we style this operation after the self-attention
1: function
mechanismAproposed
TTENTION B LOCK
by Vaswani (inputs,
et al. (2017a): key size K, value size V ):
35

SIMPLE NEURAL ATTENTIVE LEARNER


2:
3:
keys, query = affine(inputs, K), affine(inputs, K)
1: function ATTENTION B LOCK(inputs, key size K, value size V ):
2:logits = matmul(query,
keys,
3:probs
query = affine(inputs,transpose(keys))
K), affine(inputs, K) p
Supervised Learning
4: = CausallyMaskedSoftmax(logits
logits = matmul(query, transpose(keys)) p / K)
• Training
5: 4:
5:
valuesa convolutional/attentional
probs
=
values
=affine(inputs,
CausallyMaskedSoftmax(logits
V
= affine(inputs, V )
) / K)
network 
 Label
Predicted t
6: 6:readread
= matmul(probs, values)
to learn
7: 7:return
= matmul(probs, values)
concat(inputs,
return read)
concat(inputs, read)

‣ alternates between dilated zeros


where CausallyMaskedSoftmax(·) convolutional layers
out the appropriate and attentional
probabilities layers so
before normalization,
wherethat
CausallyMaskedSoftmax(·)
a particular timestep’s query cannotzeros out the
have access appropriate
to future keys/values.probabilities before normalization, so
that a particular
‣ when inputstimestep’s
are images, query cannot have access
an convolutional to futurenetwork
embedding keys/values.
is used

(a) Dense Block (dilation rate R, D lters) (b) Attention Block (key size K, value size V)
to map to a vector
outputs, space
shape [T, C + D] outputs, shape [T, C + V]
(a) Dense Block (dilation rate R, D lters) (b) Attention Block (key size K, value size V)
concatenate

outputs, shape [T, C + D] outputs, shape [T, C + V]


concatenate
matmul concatenate

causal conv, kernel 2


dilation R, D lters a ne, output size V
concatenate matmul, masked softmax
(values) matmul

causal conv, kernel 2 a ne, output size K a ne, output size K


inputs, shape [T, C] (query) (keys)
dilation R, D lters a ne, output size V
(values) matmul, masked softmax

a ne, output size K a ne, output size K


inputs, shape [T, C] inputs, shape [T, C]
(query) (keys)
(Examples, xt-3 xt-2 xt-1 xt
•A Simple Neural Attentive Meta-Learner (2018)

Figure 2: Two of the building blocks that compose SNAIL architectures. (a) A dense block applies
Labels)
a causal 1D-convolution, and then concatenates the output to its input. A TC block (not pictured)
Nikhilapplies
Mishra, Mostafa Rohaninejad, Xi Chen and Pieter Abbeel yt-3 yt-2 yt-1 --
a series of dense blocks with exponentially-increasing dilation rates. (b) A attention block
A attention block performs a single key-value lookup; we style this operation after the self-attention
1: function
mechanismAproposed
TTENTION B LOCK
by Vaswani (inputs,
et al. (2017a): key size K, value size V ):
35

SIMPLE NEURAL ATTENTIVE LEARNER


2:
3:
keys, query = affine(inputs, K), affine(inputs, K)
1: function ATTENTION B LOCK(inputs, key size K, value size V ):
2:logits = matmul(query,
keys,
3:probs
query = affine(inputs,transpose(keys))
K), affine(inputs, K) p
Supervised Learning
4: = CausallyMaskedSoftmax(logits
logits = matmul(query, transpose(keys)) p / K)
• Training
5: 4:
5:
valuesa convolutional/attentional
probs
=
values
=affine(inputs,
CausallyMaskedSoftmax(logits
V
= affine(inputs, V )
) / K)
network 
 Label
Predicted t
6: 6:readread
= matmul(probs, values)
to learn
7: 7:return
= matmul(probs, values)
concat(inputs,
return read)
concat(inputs, read)

‣ alternates between dilated zeros


where CausallyMaskedSoftmax(·) convolutional layers
out the appropriate and attentional
probabilities layers so
before normalization,
wherethat
CausallyMaskedSoftmax(·)
a particular timestep’s query cannotzeros out the
have access appropriate
to future keys/values.probabilities before normalization, so
that a particular
‣ when inputstimestep’s
are images, query cannot have access
an convolutional to futurenetwork
embedding keys/values.
is used

(a) Dense Block (dilation rate R, D lters) (b) Attention Block (key size K, value size V)
to map to a vector
outputs, space
shape [T, C + D] outputs, shape [T, C + V]
(a) Dense Block (dilation rate R, D lters) (b) Attention Block (key size K, value size V)
concatenate

outputs, shape [T, C + D] outputs, shape [T, C + V]


concatenate
matmul concatenate

causal conv, kernel 2


dilation R, D lters a ne, output size V
concatenate matmul, masked softmax
(values) matmul

causal conv, kernel 2 a ne, output size K a ne, output size K


inputs, shape [T, C] (query) (keys)
dilation R, D lters a ne, output size V
(values) matmul, masked softmax

a ne, output size K a ne, output size K


inputs, shape [T, C] inputs, shape [T, C]
(query) (keys)
(Examples, xt-3 xt-2 xt-1 xt
•A Simple Neural Attentive Meta-Learner (2018)

Figure 2: Two of the building blocks that compose SNAIL architectures. (a) A dense block applies
Labels)
a causal 1D-convolution, and then concatenates the output to its input. A TC block (not pictured)
Nikhilapplies
Mishra, Mostafa Rohaninejad, Xi Chen and Pieter Abbeel yt-3 yt-2 yt-1 --
a series of dense blocks with exponentially-increasing dilation rates. (b) A attention block
36

EXPERIMENT
• Mini-ImageNet (split used in Ravi & Larochelle, 2017)
Under review assubset
‣ random a conference paper
of 100 classes (64 at ICLR
training, 162017
validation, 20 testing)
‣ random sets Dtrain are generated by randomly picking 5 classes from class subset

5-class
Model
1-shot 5-shot
Baseline-finetune 28.86 ± 0.54% 49.79 ± 0.79%
Baseline-nearest-neighbor 41.08 ± 0.70% 51.04 ± 0.65%
Matching Network 43.40 ± 0.78% 51.09 ± 0.71%
Matching Network FCE 43.56 ±±0.84%
43.56% 0.84% 55.31 ±±0.73%
55.31% 0.73%
Meta-Learner LSTM (OURS) 43.44 ±±0.77%
43.44% 0.77% 60.60 ±±0.71%
60.60% 0.71%
37

EXPERIMENT
• Mini-ImageNet (split used in Ravi & Larochelle, 2017)
Under review assubset
‣ random a conference paper
of 100 classes (64 at ICLR
training, 162017
validation, 20 testing)
‣ random sets Dtrain are generated by randomly picking 5 classes from class subset

5-class
Model
1-shot 5-shot
Baseline-finetune
Prototypical Nets (Snell et al.) 28.86 ±±0.54%
49.42% 0.78% 49.79 ±±0.79%
68.20% 0.66%
Baseline-nearest-neighbor
MAML (Finn et al.) 41.08 ±±0.70%
48.70% 1.84% 51.04 ±±0.65%
63.10% 0.92%
Matching
SNAIL Network
(Mishra et al.) 43.40 ±±0.78%
55.71% 0.99% 51.09 ±±0.71%
68.88% 0.98%
Matching Network FCE 43.56 ±±0.84%
43.56% 0.84% 55.31 ±±0.73%
55.31% 0.73%
Meta-Learner LSTM (OURS) 43.44 ±±0.77%
43.44% 0.77% 60.60 ±±0.71%
60.60% 0.71%
ork various statistics of the normalized distances for the prototype:
✓ ◆ 38

EXTENSIONS AND VARIATIONS


[ c , c ] = MLP min(d˜j,c ), max ( ˜
d
Published as aj,c
j
), var ( ˜
d
conference paper
j
), skew ( ˜
d
j,cat ICLR 2018 j,c
j
), kurt
j
( ˜
d j,c )
j
(8

allows• Semi-supervised
each threshold tolearning
use information on the amount
(with distractors) of intra-cluster variation to determin
aggressively it should cut out unlabeled examples.
‣ assign soft-labels to unlabeled examples 1 2 3
? ?

soft masks mj,c for


‣ use soft-labels the contribution
to refine prototypes of each example to each prototype are computed, b
aring to the threshold the normalized distances, Training as follows: ?
Support Set

P P ⇣ ⇣ ⌘⌘
i h(x )z
i i,c + j h( x̃ )z̃ m
j j,c j,c
p̃c = P P , where mj,c = c
˜
d j,c c (9
Unlabeled Set Query Set

i zi,c + j z̃j,c mj,c


1 2 3 ? ?

e (·) is the sigmoid function.


Support Set

n training with this refinement process, the model can now use its MLP in Equation 8 to lear
Testing ?

lude or ignore entirely certain unlabeled examples. The use of soft masks makes this proces Unlabeled Set Query Set

ly differentiable2 . Finally, much like for regular soft k-means (with or without a distracto
r), while we could recursively repeat the refinement for multiple steps, we found a single ste
Figure 2: Example of the semi-supervised few-shot learning setup. Training involves iterating through
• Meta-Learning for Semi-supervised Few-Shot Classification (2018)

form well enough. episodes, consisting of a support set S, an unlabeled set R, and a query set Q. The goal is to use the
Ren, Triantafillou, Ravi, Snell, Swersky, Tenenbaum, Larochelle and Zemel
items (shown with their numeric class label) in S and the unlabeled items in R within each episode to ge
ork various statistics of the normalized distances for the prototype:
✓ ◆ 38

EXTENSIONS AND VARIATIONS


[ c , c ] = MLP min(d˜j,c ), max ( ˜
d
Published as aj,c
j
), var ( ˜
d
conference paper
j
), skew ( ˜
d
j,cat ICLR 2018 j,c
j
), kurt
j
( ˜
d j,c )
j
(8

allows• Semi-supervised
each threshold tolearning
use information on the amount
(with distractors) of intra-cluster variation to determin
aggressively it should cut out unlabeled examples.
‣ assign soft-labels to unlabeled examples 1 2 3
? ?

soft masks mj,c for


‣ use soft-labels the contribution
to refine prototypes of each example to each prototype are computed, b
aring to the threshold the normalized distances, Training as follows: ?
Support Set

P P ⇣ ⇣ ⌘⌘
i h(x )z
i i,c + j h( x̃ )z̃ m
j j,c j,c
p̃c = P P , where mj,c = c
˜
d j,c c (9
Unlabeled Set Query Set

i zi,c + j z̃j,c mj,c


1 2 3 ? ?

e (·) is the sigmoid function.


Support Set

n training with this refinement process, the model can now use its MLP in Equation 8 to lear
Testing ?

lude or ignore entirely certain unlabeled examples. The use of soft masks makes this proces Unlabeled Set Query Set

ly differentiable2 . Finally, much like for regular soft k-means (with or without a distracto
r), while we could recursively repeat the refinement for multiple steps, we found a single ste
Figure 2: Example of the semi-supervised few-shot learning setup. Training involves iterating through
• Meta-Learning for Semi-supervised Few-Shot Classification (2018)

form well enough. episodes, consisting of a support set S, an unlabeled set R, and a query set Q. The goal is to use the
Ren, Triantafillou, Ravi, Snell, Swersky, Tenenbaum, Larochelle and Zemel
items (shown with their numeric class label) in S and the unlabeled items in R within each episode to ge
ork various statistics of the normalized distances for the prototype:
✓ ◆ 38

EXTENSIONS AND VARIATIONS


[ c , c ] = MLP min(d˜j,c ), max ( ˜
d
Published as aj,c
j
), var ( ˜
d
conference paper
j
), skew ( ˜
d
j,cat ICLR 2018 j,c
j
), kurt
j
( ˜
d j,c )
j
(8

allows• Semi-supervised
each threshold tolearning
use information on the amount
(with distractors) of intra-cluster variation to determin
aggressively it should cut out unlabeled examples.
‣ assign soft-labels to unlabeled examples 1 2 3
? ?

soft masks mj,c for


‣ use soft-labels the contribution
to refine prototypes of each example to each prototype are computed, b
aring to the threshold the normalized distances, Training as follows: ?
Support Set

P P ⇣ ⇣ ⌘⌘
i h(x )z
i i,c + j h( x̃ )z̃ m
j j,c j,c
p̃c = P P , where mj,c = c
˜
d j,c c (9
Unlabeled Set Query Set

i zi,c + j z̃j,c mj,c


1 2 3 ? ?

e (·) is the sigmoid function.


Support Set

n training with this refinement process, the model can now use its MLP in Equation 8 to lear
Testing ?

lude or ignore entirely certain unlabeled examples. The use of soft masks makes this proces Unlabeled Set Query Set

ly differentiable2 . Finally, much like for regular soft k-means (with or without a distracto
r), while we could recursively repeat the refinement for multiple steps, we found a single ste
Figure 2: Example of the semi-supervised few-shot learning setup. Training involves iterating through
• Meta-Learning for Semi-supervised Few-Shot Classification (2018)

form well enough. episodes, consisting of a support set S, an unlabeled set R, and a query set Q. The goal is to use the
Ren, Triantafillou, Ravi, Snell, Swersky, Tenenbaum, Larochelle and Zemel
items (shown with their numeric class label) in S and the unlabeled items in R within each episode to ge
Table in
2: Omniglot NLL 2: nats/pixel
OmniglotwithNLLfour
in nats/pixel with four Attention
support examples. support examples. Attention
Meta PixelCNN is aMeta PixelCNN is a 39

EXTENSIONS AND VARIATIONS


model combining
combining attention attention with
with gradient-based gradient-based
weight weight updates
updates for few-shot for few-shot learning.
learning.

PixelCNN also Meta PixelCNN


achieves also achieves
state-of-the-art state-of-the-art
likelihoods,
To overcome likelihoods,
only outperformed
this difficulty, only
by
we propose outperformed
Attention by simple
Pixel-the
to replace Attention Pixel- function f (s) with a context-
encoder
see Table 2).CNN
Naively(seecombining
Table 2). Naively
attentioncombining attentiondoes
and meta learning and not
metaseem
learning doesHow-
to help. not seem to help. How-
sensitive attention
waysmechanism ft (s, x<t It varying
). metaproduces an encoding
as varyingof the context that depends on
Few-shot distribution estimation
• ever,
here are likely more there are likely
effective ways more effective
to combine
themeta-gradient
image
attentiontoand
combine attention
meta learning, and
such as learning, such
the or
er loss function inner
usingloss function
multiple or usinggenerated
multiple up until
couldthe
steps,meta-gradient
which current
steps,
be which
future step
work. t. The
could weights
work.arePixelCNN
be Attention
future shared over t.Mechanism
‣ given D train={x
Wei}will
produceuse p(x)
the following nota-
ports Supports
PixelCNN PixelCNN
Attention PixelCNN Attention PixelCNN
Meta PixelCNN Meta PixelCNN reduce
tion. Let the target image be x 2 KxKxP
sum
1x1xP

RH⇥W ⇥3 . and the support set images ft(s, x<t)


be s 2 RS⇥H⇥W ⇥3 , where S is the mul
number of supports. α
KxKx1
To capture texture information, we pvalue qt
encode all supporting images with a KxKxP
WxHxP
p key attn
shallow convolutional network, typi- KxKxP
cally only two layers. Each hidden
unit of the resulting feature map will
have a small receptive field, e.g. cor-
responding to a 10 ⇥ 10 patch in a WxHx3 WxHx3

support set image. We encode these


support images into a set of spatially- Support image, s Target image, x
Dtrain Attention
indexed keyPixelCNN
and value samples
vectors.
e 4: Typical Omniglot
Figure 4:samples
Typicalfrom PixelCNN,
Omniglot Attention
samples PixelCNN,Attention
from PixelCNN, and MetaPixelCNN,
PixelCNN.and Meta PixelCNN.
Figure 2: The PixelCNN attention mechanism.
After encoding the support
• Few-shot Autoregressive Density images in
Estimation: Towards learning to learning distributions (2018)

1 shows several key1frames
Figure showsof parallel,
the
several wemodel
attention reshape
key frames the resulting
of sampling Omniglot.
the attention model S Within
⇥ each
sampling column,Within each column,
Omniglot.
t part showsReed, Chen, Paine, van
the left
the 4 support set K
part shows ⇥ 4Kden
images.
the TheOord, Eslami, Rezende, Vinyals, de
support
⇥ 2P red overlay
feature
set images. indicates
maps theoverlay
Thetored attention
squeeze out headFreitas
the read
spatial
indicates dimensions,
the attention resulting in a SK 2 ⇥ 2P matrix.
head read
ai-on.org/projects
F(t8 )
41

EXTENSIONS AND VARIATIONS


{ } G F (t2 )
F(t4 )
G } {
F(t4 )
F(t6 )

negative class negative class

• Cold-start (a)
item recommendation
Linear Classifier with Weight Adaptation. Changes in the shading of each connection with the
output unit for two users illustrates that the weights of the classifier vary based on each user’s item
‣ given positive/negative itemsbias
history. The output for indicated
a user, produce bias ofparameters
by the shades of engagement
the circles however remains thepredictor
same. for new item

User 1 User 2

}{
F(t1 )

{ }
F (t1 )
F(t3 )
G G F(t5 )
F(t7 )
positive class
positive class

{ }G
F (t2 )
F(t4 ) F(t8 )
G } {
F(t4 )
F(t6 )

negative class negative class

(b) Non-linear Classifier with Bias Adaptation. Changes in the shading of each unit between two
•A Meta-Learning Perspective on Cold-Start Recommendations for Items (2017)

users illustrates that the biases of these units vary based on each user’s item history. The weights
Manasi Vartak, Arvind
howeverThiagarajan, Conrado Miranda, Jeshua Bratman, Hugo Larochelle
remain the same.
42

EXTENSIONS AND VARIATIONS


• Learning a data-augmentation network
‣ add to Dtrain examples produced by a generator network that also takes a random z vector as input
!"#$%& rithm lear
Sample ism: reali
Noise , modes of
$()
!"#$%& -.
( , heron) lucination
h We theref
support th
G !"*+"
' As bef
!"#$%&
m classes
( , heron) amples pe
generate
• Low-Shot Figure 2. Meta-learning
Learning from ImaginarywithData
hallucination.
(2018)
 Given an initial train- naug exam
aug
ing set
Yu-Xiong Wang, Ross Strain , weHebert, Bharath
Girshick, Martial create an augmented
Hariharan training set Strain by
43

DISCUSSION
• What is the right definition of distributions over problems?
‣ varying number of classes / examples per class ?
‣ semantic differences between meta-training vs. meta-testing classes ?
‣ overlap in meta-training vs. meta-testing classes ? (recent “low-shot” literature)

• Move from static to interactive learning


‣ how should this impact how we generate episodes ?
‣ meta-active learning ? (few successes so far)
44

MERCI !

Вам также может понравиться