Академический Документы
Профессиональный Документы
Культура Документы
University of Toronto
and
S. Osindero
June 3, 2003
Abstra t
1
minimizing the energy assigned to data-ve
tors that are a
tually observed and
maximizing the energy assigned to \
onfabulations". The
onfabulations are
generated by perturbing an observed data-ve
tor in a dire
tion that de
reases
its energy under the
urrent model, so
onfabulations are things that the model
prefers to real data. Ba
kpropagation of energy derivatives through the mul-
tilayer network is used both for
omputing the derivatives that are needed to
adjust the weights and for
omputing how to perturb an observed data-ve
tor
to produ
e a
onfabulation that has lower energy.
The ba
kpropagation algorithm (ref) trains the units in the intermediate layers of
a feedforward neural net to represent features of the input ve
tor that are useful for
predi
ting the desired output. This is a
hieved by propagating information about the
dis
repan
y between the a
tual output and the desired output ba
kwards through
the net to
ompute how to
hange the
onne
tion weights in a dire
tion that redu
es
the dis
repan
y. In this paper we show how to use ba
kpropagation to learn features
and
onstraints when ea
h input ve
tor is not a
ompanied by a supervision signal
that spe
ies the desired output.
When no desired output is spe
ied, it is not immediately obvious what the goal
of learning should be. We assume here that the aim is to
hara
terize the observed
data in terms of many dierent features and
onstraints that
an be interpreted as
hidden fa
tors. These hidden fa
tors
ould be used for subsequent de
ision making
or they
ould be used to dete
t highly improbable data-ve
tors by using the global
energy. We dene the probabilty that the network assigns to a data-ve
tor, x, in
terms of its global energy, E (x):
E (x)
( ) = Pe e
px E (v) (1)
v
The quality of the set of features and
onstraints dis
overed by the neural network
an
then be quantied by the summed log probability that gets assigned to the observed
data-ve
tors. The
ontribution of a single data-ve
tor to this sum is:
2
X
log p(x) = ( ) log
E x e E (v) (2)
v
where wij is the weight on the
onne
tion from unit i in one layer to unit j in the
next layer.
The rst term is easy to
ompute. We assume that ea
h unit, j , sums the weighted
a
tivities
oming from units, i, in the layer below to get its total input, zj = Pi yiwij ,
where an a
tivity yi in the layer below is equal to xi if it is the input layer. A smooth
non-linear fun
tion of zj is then used to
ompute the unit's a
tivity, yj (see gure
??). The energy
ontributed by the unit
an be any smooth fun
tion of its a
tivity.
After performing a forward pass through the network to
ompute the a
tivities
of all the units, we do a ba
kward pass as des
ribed in ??. The ba
kward pass
uses the
hain rule to
ompute E (x)=wij for every
onne
tion weight, and by
ba
kpropagating all the way to the inputs we
an also
ompute E (x)=xi for ea
h
omponent, xi , of the input ve
tor.
Unfortunately, the se
ond term in Eq. 3 is mu
h harder to deal with. It involves
a weighted average of the derivatives from all
on
eivable data-ve
tors so it
annot
be
omputed eÆ
iently ex
ept in spe
ial
ases. We usually expe
t, however, that
the probability mass will be
on
entrated on a very small fra
tion of the
on
eiv-
able data-ve
tors, so it seems reasonable to approximate the last term by averaging
E (x)=wij over a relatively small number of samples from the distribution p(:). One
way to generate samples from this distribution is to run a Markov
hain that simu-
3
lates a physi
al pro
ess with thermal noise. If we think of the dataspa
e as forming
a horizontal plane and we represent the energy of ea
h data-ve
tor as height, the
neural network denes a potential energy surfa
e whose height and gradient are easy
to
ompute. We imagine a parti
le on this surfa
e that tends to move downhill but
is also jittered by additional Gaussian noise. After enough steps, the parti
le will
have lost all information about where it started and if we use small enough steps,
its probability of being at any parti
ular point in the dataspa
e will be given by the
Boltzmann distribution in Eq 1. This is a painfully slow way of generating samples
and even if the equilibrium distribution is rea
hed, the varian
e
reated by sampling
may mask the true learning signal.
Rather surprisingly, it is unne
essary to allow the simulated physi
al pro
ess to
rea
h the equilibrium distribution. If we start the pro
ess at an observed data-ve
tor
and just run it for a few steps, we
an generate a \
onfabulation" that works very well
for adjusting the weights (Hinton, 2002). Intuitively, if the Markov
hain starts to
diverge from the data in a systemati
way, we already have eviden
e that the model
is imperfe
t and that it
an be improved (in this lo
al region of the dataspa
e) by
redu
ing the energy of the initial data-ve
tor and raising the energy of the
onfabu-
lation. So the learning pro
edure
y
les through the observed data-ve
tors adjusting
ea
h weight by:
wij = E x() + ( )!
E x
^
(4)
wij wij
4
(xi xi+1 ) + (yi
2
yi+1 ) + (zi
2
zi+1 )
2
l2 =0 (5)
where i and i + 1 index neighboring joints. Be
ause the
onstraints are highly non-
linear, dimensionality-redu
tion methods like prin
ipal
omponents analysis or fa
tor
analysis are of little help.
We used a neural net with 15 input units and two hidden layers. Ea
h of the 15
units in the rst hidden layer
omputes a weighted average of the inputs and then
squares it. Ea
h of the 5 units in the top layer
omputes a weighted average of the
squares provided by the rst hidden layer and adds a learned bias. It is
lear that
with the right weights and biases, ea
h top-layer unit
ould implement one of the
onstraints represented by equation ?? by produ
ing an output of exa
tly zero if and
only if the
onstraint is satised. The question is whether the network
an dis
over
the appropriate weights and biases just by observing the data.
For this example, the units in the rst hidden layer do not
ontribute to the energy
and the units in the se
ond hidden layer ea
h
ontribute an energy of j log(1 +
yj ) where yj is the a
tivity of unit j in the se
ond hidden layer and j is a s
ale
2
weights and also the energy gradients required for generating a whole
onfabulated
sequen
e.
Notes
1. We used a simplied version of the Hybrid Monte Carlo pro
edure in whi
h the
parti
le is given a random initial momentum and its deterministi
traje
tory along the
energy surfa
e is then simulated for 10 time steps. If this simulation has no numeri
al
errors the in
rease, E , in the
ombined potential and kineti
energy will be zero.
If E is positive, the parti
le is returned to its initial position with a probability
of 1 exp( E ). The step size is slowly adapted so that only about 10% of the
traje
tories get reje
ted. Numeri
al errors up to se
ond order are eliminated by using
a \leapfrog" method (Neal, 19??) whi
h uses the potential energy gradient at time t
to
ompute the velo
ity in
rement between time t and t + and uses the velo
ity
1 1
2. If the
onfabulations are generated by pi
king a value for one
omponent of the
data-ve
tor from its posterior distribution given all the other
omponents, our method
is a version of pseudo-likelihood tting (Besag, He
kerman) in whi
h the use of a
global energy fun
tion ensures that all of the
onditional distributions are
onsistent.
A knowledgments
We would like to thank David Ma
Kay, Radford Neal, Sam Roweis, Zoubin Ghahra-
mani, Chris Williams, Carl Rasmussen, Brian Sallans, Javier Movellan and Tim
Marks for helpful dis
ussions. This resear
h was supported by the Gatsby Chari-
table foundation and NSERC.
6
Figure 1: The areas of the small bla
k and weight re
tangles represent the magnitudes
of the negative and positive weights learned by the network. Ea
h
olumn in the lower
blo
k represents the weights on the
onne
tions to a unit in the middle layer from the
joint
oordinates x1 ; y1 ; z1 ; x2 ; y2 ; z2 :::x5 ; y5 ; z5 . Ea
h row in the higher blo
k represents the
weights on the
onne
tions to a top level unit from the middle layer units.