Вы находитесь на странице: 1из 7

Unsupervised Dis

overy of Non-Linear Stru ture


using Contrastive Ba kpropagation

G. E. Hinton, M. Welling, Y. W. Teh

Department of Computer S ien e

University of Toronto

Toronto, Canada M5S 3G4

and

S. Osindero

Gatsby Computational Neuros ien e Unit

University College, London

London, UK WC1N 3AR

June 3, 2003

Abstra t

We des ribe a way of modelling high-dimensional data-ve tors by using an


unsupervised, non-linear, multilayer neural network in whi h the a tivity of
ea h neuron-like unit makes an additive ontribution to a global energy s ore
that indi ates how surprised the network is by the data-ve tor. Units whose
a tivities represent violations of learned onstraints ontribute positively to the
global energy and units whose a tivities represent the presen e of familiar fea-
tures ontribute negatively. The onne tion weights whi h determine how the
a tivity of ea h unit depends on the a tivities in earlier layers are learned by

1
minimizing the energy assigned to data-ve tors that are a tually observed and
maximizing the energy assigned to \ onfabulations". The onfabulations are
generated by perturbing an observed data-ve tor in a dire tion that de reases
its energy under the urrent model, so onfabulations are things that the model
prefers to real data. Ba kpropagation of energy derivatives through the mul-
tilayer network is used both for omputing the derivatives that are needed to
adjust the weights and for omputing how to perturb an observed data-ve tor
to produ e a onfabulation that has lower energy.

The ba kpropagation algorithm (ref) trains the units in the intermediate layers of
a feedforward neural net to represent features of the input ve tor that are useful for
predi ting the desired output. This is a hieved by propagating information about the
dis repan y between the a tual output and the desired output ba kwards through
the net to ompute how to hange the onne tion weights in a dire tion that redu es
the dis repan y. In this paper we show how to use ba kpropagation to learn features
and onstraints when ea h input ve tor is not a ompanied by a supervision signal
that spe i es the desired output.
When no desired output is spe i ed, it is not immediately obvious what the goal
of learning should be. We assume here that the aim is to hara terize the observed
data in terms of many di erent features and onstraints that an be interpreted as
hidden fa tors. These hidden fa tors ould be used for subsequent de ision making
or they ould be used to dete t highly improbable data-ve tors by using the global
energy. We de ne the probabilty that the network assigns to a data-ve tor, x, in
terms of its global energy, E (x):

E (x)
( ) = Pe e
px E (v) (1)
v

The quality of the set of features and onstraints dis overed by the neural network an
then be quanti ed by the summed log probability that gets assigned to the observed
data-ve tors. The ontribution of a single data-ve tor to this sum is:

2
X
log p(x) = ( ) log
E x e E (v) (2)
v

The features and onstraints an be improved by repeatedly adjusting the weights


on the onne tions so as to maximize the log probability of the observed data. To per-
form gradient as ent in the log likelihood we would need to ompute exa t derivatives
of the form:

 log p(x) = ( ) + X p(v) E (v)


E x
(3)
wij wij v w ij

where wij is the weight on the onne tion from unit i in one layer to unit j in the
next layer.
The rst term is easy to ompute. We assume that ea h unit, j , sums the weighted
a tivities oming from units, i, in the layer below to get its total input, zj = Pi yiwij ,
where an a tivity yi in the layer below is equal to xi if it is the input layer. A smooth
non-linear fun tion of zj is then used to ompute the unit's a tivity, yj (see gure
??). The energy ontributed by the unit an be any smooth fun tion of its a tivity.

After performing a forward pass through the network to ompute the a tivities
of all the units, we do a ba kward pass as des ribed in ??. The ba kward pass
uses the hain rule to ompute E (x)=wij for every onne tion weight, and by
ba kpropagating all the way to the inputs we an also ompute E (x)=xi for ea h
omponent, xi , of the input ve tor.
Unfortunately, the se ond term in Eq. 3 is mu h harder to deal with. It involves
a weighted average of the derivatives from all on eivable data-ve tors so it annot
be omputed eÆ iently ex ept in spe ial ases. We usually expe t, however, that
the probability mass will be on entrated on a very small fra tion of the on eiv-
able data-ve tors, so it seems reasonable to approximate the last term by averaging
E (x)=wij over a relatively small number of samples from the distribution p(:). One
way to generate samples from this distribution is to run a Markov hain that simu-
3
lates a physi al pro ess with thermal noise. If we think of the dataspa e as forming
a horizontal plane and we represent the energy of ea h data-ve tor as height, the
neural network de nes a potential energy surfa e whose height and gradient are easy
to ompute. We imagine a parti le on this surfa e that tends to move downhill but
is also jittered by additional Gaussian noise. After enough steps, the parti le will
have lost all information about where it started and if we use small enough steps,
its probability of being at any parti ular point in the dataspa e will be given by the
Boltzmann distribution in Eq 1. This is a painfully slow way of generating samples
and even if the equilibrium distribution is rea hed, the varian e reated by sampling
may mask the true learning signal.
Rather surprisingly, it is unne essary to allow the simulated physi al pro ess to
rea h the equilibrium distribution. If we start the pro ess at an observed data-ve tor
and just run it for a few steps, we an generate a \ onfabulation" that works very well
for adjusting the weights (Hinton, 2002). Intuitively, if the Markov hain starts to
diverge from the data in a systemati way, we already have eviden e that the model
is imperfe t and that it an be improved (in this lo al region of the dataspa e) by
redu ing the energy of the initial data-ve tor and raising the energy of the onfabu-
lation. So the learning pro edure y les through the observed data-ve tors adjusting
ea h weight by:

wij =  E x() + ( )!
E x
^
(4)
wij wij

where  is a learning rate and x^ is a onfabulation produ ed by starting at x and


noisily following the gradient of the energy surfa e for a few steps (HMC note).
To illustrate the learning pro edure, we applied it to the task of dis overing the
non-linear kinemati onstraints in a simulated three dimensional \arm" that has ve
rigid links and six ball-joints. The rst ball-joint atta hes the arm to the origin, and
ea h data-ve tor onsists of the 15 artesian oordinates of the remaining ve ball-
joints. This apparently 15-dimensional data really has only 10 degrees of freedom
be ause of the 5 one-dimensional onstraints imposed by the 5 rigid links. These
onstraints are of the form:

4
(xi xi+1 ) + (yi
2
yi+1 ) + (zi
2
zi+1 )
2
l2 =0 (5)

where i and i + 1 index neighboring joints. Be ause the onstraints are highly non-
linear, dimensionality-redu tion methods like prin ipal omponents analysis or fa tor
analysis are of little help.
We used a neural net with 15 input units and two hidden layers. Ea h of the 15
units in the rst hidden layer omputes a weighted average of the inputs and then
squares it. Ea h of the 5 units in the top layer omputes a weighted average of the
squares provided by the rst hidden layer and adds a learned bias. It is lear that
with the right weights and biases, ea h top-layer unit ould implement one of the
onstraints represented by equation ?? by produ ing an output of exa tly zero if and
only if the onstraint is satis ed. The question is whether the network an dis over
the appropriate weights and biases just by observing the data.
For this example, the units in the rst hidden layer do not ontribute to the energy
and the units in the se ond hidden layer ea h ontribute an energy of j log(1 +
yj ) where yj is the a tivity of unit j in the se ond hidden layer and j is a s ale
2

parameter that is also learned by ontrastive ba kpropagation. This \heavy-tailed"


energy fun tion penalizes top-level units with non-zero outputs, but hanging the
output has little e e t on the penalty if the output is already large.
Figure 1 shows the weights and top-level biases that were learned by ontrastive
ba kpropagation. For ea h pair of neighboring joints, there are three units in the
rst hidden layer that have learned to ompute di eren es between the oordinates of
the two joints. These di eren es are always omputed in three orthogonal dire tions.
Ea h unit in the se ond hidden layer has learned a linear ombination of the 5 on-
straints, but it uses weights of exa tly the same size for the three squared di eren es
in ea h onstraint so that it an exa tly an el the xed sum of these three squared
di eren es by using its bias.
The ontrastive ba kpropagation learning pro edure is quite exible. It puts no
onstraints other than smoothness on the a tivation fun tions or the fun tions for
onverting a tivations into energy ontributions. The pro edure an easily be modi-
5
ed to use re urrent neural networks that re eive time-varying inputs su h as video
sequen es. The energy of a whole sequen e is simply de ned to be some fun tion of
the time history of the a tivations of the hidden units. Ba kpropagation through time
(ref) an then be used to obtain the derivatives of the energy the onne tion
w.r.t.

weights and also the energy gradients required for generating a whole onfabulated
sequen e.
Notes

1. We used a simpli ed version of the Hybrid Monte Carlo pro edure in whi h the
parti le is given a random initial momentum and its deterministi traje tory along the
energy surfa e is then simulated for 10 time steps. If this simulation has no numeri al
errors the in rease, E , in the ombined potential and kineti energy will be zero.
If E is positive, the parti le is returned to its initial position with a probability
of 1 exp( E ). The step size is slowly adapted so that only about 10% of the
traje tories get reje ted. Numeri al errors up to se ond order are eliminated by using
a \leapfrog" method (Neal, 19??) whi h uses the potential energy gradient at time t
to ompute the velo ity in rement between time t and t + and uses the velo ity
1 1

at time t + to ompute the position in rement between time t and t + 1.


2 2
1
2

2. If the onfabulations are generated by pi king a value for one omponent of the
data-ve tor from its posterior distribution given all the other omponents, our method
is a version of pseudo-likelihood tting (Besag, He kerman) in whi h the use of a
global energy fun tion ensures that all of the onditional distributions are onsistent.

A knowledgments

We would like to thank David Ma Kay, Radford Neal, Sam Roweis, Zoubin Ghahra-
mani, Chris Williams, Carl Rasmussen, Brian Sallans, Javier Movellan and Tim
Marks for helpful dis ussions. This resear h was supported by the Gatsby Chari-
table foundation and NSERC.

6
Figure 1: The areas of the small bla k and weight re tangles represent the magnitudes
of the negative and positive weights learned by the network. Ea h olumn in the lower
blo k represents the weights on the onne tions to a unit in the middle layer from the
joint oordinates x1 ; y1 ; z1 ; x2 ; y2 ; z2 :::x5 ; y5 ; z5 . Ea h row in the higher blo k represents the
weights on the onne tions to a top level unit from the middle layer units.

Вам также может понравиться