Jul09 Hinton Deeplearn

UCL Tutorial on:
Deep Belief Nets

(An updated and extended version of my 2007 N !" tutorial# $eoffrey %inton Canadian nstitute for Advan&ed 'esear&( ) Department of Computer "&ien&e University of Toronto
"&(edule for t(e Tutorial

* 2+00 , -+-0 Tutorial part . * -+-0 , -+/0 1uestions * -+/0 2 /+.0 Tea Brea3
* /+.0 , 0+/0 Tutorial part 2 * 0+/0 , 4+00 1uestions
"ome t(in5s you 6ill learn in t(is tutorial

* %o6 to learn multi2layer 5enerative models of unla7elled data 7y learnin5 one layer of features at a time+ , %o6 to add 8ar3ov 'andom 9ields in ea&( (idden layer+ * %o6 to use 5enerative models to ma3e dis&riminative trainin5 met(ods 6or3 mu&( 7etter for &lassifi&ation and re5ression+ , %o6 to extend t(is approa&( to $aussian !ro&esses and (o6 to learn &omplex: domain2spe&ifi& 3ernels for a $aussian !ro&ess+ * %o6 to perform non2linear dimensionality redu&tion on very lar5e datasets , %o6 to learn 7inary: lo62dimensional &odes and (o6 to use t(em for very fast do&ument retrieval+ * %o6 to learn multilayer 5enerative models of (i5(2 dimensional se;uential data+
A spe&trum of ma&(ine learnin5 tas3s Typi&al "tatisti&s222222222222Artifi&ial ntelli5en&e

* Lo62dimensional data (e+5+ less t(an .00 dimensions# Lots of noise in t(e data T(ere is not mu&( stru&ture in t(e data: and 6(at stru&ture t(ere is: &an 7e represented 7y a fairly simple model+ * * * * * %i5(2dimensional data (e+5+ more t(an .00 dimensions# T(e noise is not suffi&ient to o7s&ure t(e stru&ture in t(e data if 6e pro&ess it ri5(t+ T(ere is a (u5e amount of stru&ture in t(e data: 7ut t(e stru&ture is too &ompli&ated to 7e represented 7y a simple model+ T(e main pro7lem is fi5urin5 out a 6ay to represent t(e &ompli&ated stru&ture so t(at it &an 7e learned+
T(e main pro7lem is distin5uis(in5 true stru&ture from noise+
%istori&al 7a&35round:
9irst 5eneration neural net6or3s
* !er&eptrons (<.=40# used a layer of (and2 &oded features and tried to re&o5ni>e o7?e&ts 7y learnin5 (o6 to 6ei5(t t(ese features+ , T(ere 6as a neat learnin5 al5orit(m for ad?ustin5 t(e 6ei5(ts+ , But per&eptrons are fundamentally limited in 6(at t(ey &an learn to do+
Bom7 Toy
output units e+5+ &lass la7els
non2adaptive (and2&oded features
input units e+5+ pixels
"3et&( of a typi&al per&eptron from t(e .=40@s
"e&ond 5eneration neural net6or3s (<.=A0#

Ba&32propa5ate error si5nal to 5et derivatives for learnin5
Compare outputs 6it( &orre&t ans6er to 5et error si5nal
outputs
(idden layers
input ve&tor
A temporary di5ression
* Bapni3 and (is &o26or3ers developed a very &lever type of per&eptron &alled a "upport Be&tor 8a&(ine+ , nstead of (and2&odin5 t(e layer of non2adaptive features: ea&( trainin5 example is used to &reate a ne6 feature usin5 a fixed re&ipe+
* T(e feature &omputes (o6 similar a test example is to t(at trainin5 example+
, T(en a &lever optimi>ation te&(ni;ue is used to sele&t t(e 7est su7set of t(e features and to de&ide (o6 to 6ei5(t ea&( feature 6(en &lassifyin5 a test &ase+
* But its ?ust a per&eptron and (as all t(e same limitations+
n t(e .==0@s: many resear&(ers a7andoned neural net6or3s 6it( multiple adaptive (idden layers 7e&ause "upport Be&tor 8a&(ines 6or3ed 7etter+
C(at is 6ron5 6it( 7a&32propa5ationD

* t re;uires la7eled trainin5 data+ , Almost all data is unla7eled+ * T(e learnin5 time does not s&ale 6ell , t is very slo6 in net6or3s 6it( multiple (idden layers+ * t &an 5et stu&3 in poor lo&al optima+ , T(ese are often ;uite 5ood: 7ut for deep nets t(ey are far from optimal+
Ever&omin5 t(e limitations of 7a&32 propa5ation

* Feep t(e effi&ien&y and simpli&ity of usin5 a 5radient met(od for ad?ustin5 t(e 6ei5(ts: 7ut use it for modelin5 t(e stru&ture of t(e sensory input+ , Ad?ust t(e 6ei5(ts to maximi>e t(e pro7a7ility t(at a 5enerative model 6ould (ave produ&ed t(e sensory input+ , Learn p(ima5e# not p(la7el G ima5e#
* f you 6ant to do &omputer vision: first learn &omputer 5rap(i&s
* C(at 3ind of 5enerative model s(ould 6e learnD
Belief Nets
* A 7elief net is a dire&ted a&y&li& 5rap( &omposed of sto&(asti& varia7les+ * Ce 5et to o7serve some of t(e varia7les and 6e 6ould li3e to solve t6o pro7lems: * T(e inferen&e pro7lem: nfer t(e states of t(e uno7served varia7les+ * T(e learnin5 pro7lem: Ad?ust t(e intera&tions 7et6een varia7les to ma3e t(e net6or3 more li3ely to 5enerate t(e o7served data+
sto&(asti& (idden &ause
visi7le effe&t
Ce 6ill use nets &omposed of layers of sto&(asti& 7inary varia7les 6it( 6ei5(ted &onne&tions+ Later: 6e 6ill 5enerali>e to ot(er types of varia7le+
"to&(asti& 7inary units

(Bernoulli varia7les#
* T(ese (ave a state of . or 0+ * T(e pro7a7ility of turnin5 on is determined 7y t(e 6ei5(ted input from ot(er units (plus a 7ias#
.
p ( si = 1)
0 0
bi +
s j w ji
p ( si = 1) =
1 + exp( bi
s j w ji )
Learnin5 Deep Belief Nets

* t is easy to 5enerate an un7iased example at t(e leaf nodes: so 6e &an see 6(at 3inds of data t(e net6or3 7elieves in+ * t is (ard to infer t(e posterior distri7ution over all possi7le &onfi5urations of (idden &auses+ * t is (ard to even 5et a sample from t(e posterior+ * "o (o6 &an 6e learn deep 7elief nets t(at (ave millions of parametersD
sto&(asti& (idden &ause
visi7le effe&t
T(e learnin5 rule for si5moid 7elief nets

* Learnin5 is easy if 6e &an 5et an un7iased sample from t(e posterior distri7ution over (idden states 5iven t(e o7served data+ * 9or ea&( unit: maximi>e t(e lo5 pro7a7ility t(at its 7inary state in t(e sample from t(e posterior 6ould 7e 5enerated 7y t(e sampled 7inary states of its parents+
sj
w ji
i
si
1
pi p ( si = 1) =
1 + exp( s j w ji )
j
w ji = s j ( si pi )
learnin5 rate
Hxplainin5 a6ay (Iudea !earl#

* Hven if t6o (idden &auses are independent: t(ey &an 7e&ome dependent 6(en 6e o7serve an effe&t t(at t(ey &an 7ot( influen&e+ , f 6e learn t(at t(ere 6as an eart(;ua3e it redu&es t(e pro7a7ility t(at t(e (ouse ?umped 7e&ause of a tru&3+
2.0 tru&3 (its (ouse 2.0 eart(;ua3e
20 220
20
posterior
p(.:.#J+000. p(.:0#J+/=== p(0:.#J+/=== p(0:0#J+000.
(ouse ?umps
C(y it is usually very (ard to learn si5moid 7elief nets one layer at a time
* To learn C: 6e need t(e posterior distri7ution in t(e first (idden layer+ * !ro7lem .: T(e posterior is typi&ally &ompli&ated 7e&ause of Kexplainin5 a6ayL+ * !ro7lem 2: T(e posterior depends on t(e prior as 6ell as t(e li3eli(ood+ , "o to learn C: 6e need to 3no6 t(e 6ei5(ts in (i5(er layers: even if 6e are only approximatin5 t(e posterior+ All t(e 6ei5(ts intera&t+ * !ro7lem -: Ce need to inte5rate over all possi7le &onfi5urations of t(e (i5(er varia7les to 5et t(e prior for first (idden layer+ Mu3N (idden varia7les
(idden varia7les prior (idden varia7les li3eli(ood C
data
"ome met(ods of learnin5 deep 7elief nets

* 8onte Carlo met(ods &an 7e used to sample from t(e posterior+ , But its painfully slo6 for lar5e: deep models+ * n t(e .==0@s people developed variational met(ods for learnin5 deep 7elief nets , T(ese only 5et approximate samples from t(e posterior+ , Nevet(eless: t(e learnin5 is still 5uaranteed to improve a variational 7ound on t(e lo5 pro7a7ility of 5eneratin5 t(e o7served data+
T(e 7rea3t(rou5( t(at ma3es deep learnin5 effi&ient

* To learn deep nets effi&iently: 6e need to learn one layer of features at a time+ T(is does not 6or3 6ell if 6e assume t(at t(e latent varia7les are independent in t(e prior : , T(e latent varia7les are not independent in t(e posterior so inferen&e is (ard for non2linear models+ , T(e learnin5 tries to find independent &auses usin5 one (idden layer 6(i&( is not usually possi7le+ * Ce need a 6ay of learnin5 one layer at a time t(at ta3es into a&&ount t(e fa&t t(at 6e 6ill 7e learnin5 more (idden layers later+ , Ce solve t(is pro7lem 7y usin5 an undire&ted model+
T6o types of 5enerative neural net6or3

* f 6e &onne&t 7inary sto&(asti& neurons in a dire&ted a&y&li& 5rap( 6e 5et a "i5moid Belief Net ('adford Neal .==2#+ * f 6e &onne&t 7inary sto&(asti& neurons usin5 symmetri& &onne&tions 6e 5et a Bolt>mann 8a&(ine (%inton ) "e?no6s3i: .=A-#+ , f 6e restri&t t(e &onne&tivity in a spe&ial 6ay: it is easy to learn a Bolt>mann ma&(ine+
'estri&ted Bolt>mann 8a&(ines

("molens3y :.=A4: &alled t(em K(armoniumsL# * Ce restri&t t(e &onne&tivity to ma3e learnin5 easier+ , Enly one layer of (idden units+
* Ce 6ill deal 6it( more layers later (idden ?
, No &onne&tions 7et6een (idden units+ * n an 'B8: t(e (idden units are &onditionally independent 5iven t(e visi7le states+ , "o 6e &an ;ui&3ly 5et an un7iased sample from t(e posterior distri7ution 6(en 5iven a data2ve&tor+ , T(is is a 7i5 advanta5e over dire&ted 7elief nets
i visi7le
T(e Hner5y of a ?oint &onfi5uration

(i5norin5 terms to do 6it( 7iases#
7inary state of visi7le unit i 7inary state of (idden unit ?
E (v,h) =
Hner5y 6it( &onfi5uration v on t(e visi7le units and ( on t(e (idden units
vi h j wij
6ei5(t 7et6een units i and ?
i, j
E (v, h ) = vi h j wij
Cei5(ts Hner5ies !ro7a7ilities

* Ha&( possi7le ?oint &onfi5uration of t(e visi7le and (idden units (as an ener5y , T(e ener5y is determined 7y t(e 6ei5(ts and 7iases (as in a %opfield net#+ * T(e ener5y of a ?oint &onfi5uration of t(e visi7le and (idden units determines its pro7a7ility:
p (v, h)
E ( v ,h )
* T(e pro7a7ility of a &onfi5uration over t(e visi7le units is found 7y summin5 t(e pro7a7ilities of all t(e ?oint &onfi5urations t(at &ontain it+
Usin5 ener5ies to define pro7a7ilities

* T(e pro7a7ility of a ?oint &onfi5uration over 7ot( visi7le and (idden units depends on t(e ener5y of t(at ?oint &onfi5uration &ompared 6it( t(e ener5y of all ot(er ?oint &onfi5urations+ * T(e pro7a7ility of a &onfi5uration of t(e visi7le units is t(e sum of t(e pro7a7ilities of all t(e ?oint &onfi5urations t(at &ontain it+
p (v, h ) =
partition fun&tion
e
u,g
E ( v ,h )
E (u , g )
p (v ) =
e e
E ( v ,h )
E (u , g )
u,g
A pi&ture of t(e maximum li3eli(ood learnin5 al5orit(m for an 'B8

? ? ? ?
< vi h j > 0
i tJ0 i tJ. i tJ2
< vi h j >
i
a fantasy
t J infinity
"tart 6it( a trainin5 ve&tor on t(e visi7le units+ T(en alternate 7et6een updatin5 all t(e (idden units in parallel and updatin5 all t(e visi7le units in parallel+
log p (v) = < vi h j > 0 < vi h j > wij
A ;ui&3 6ay to learn an 'B8

? ?
< vi h j > 0
i tJ0 data
< vi h j > 1
i tJ. re&onstru&tion
"tart 6it( a trainin5 ve&tor on t(e visi7le units+ Update all t(e (idden units in parallel Update t(e all t(e visi7le units in parallel to 5et a Kre&onstru&tionL+ Update t(e (idden units a5ain+
wij = ( < vi h j > 0 < vi h j > 1 )

T(is is not follo6in5 t(e 5radient of t(e lo5 li3eli(ood+ But it 6or3s 6ell+ t is approximately follo6in5 t(e 5radient of anot(er o7?e&tive fun&tion (Carreira2!erpinan ) %inton: 2000#+
%o6 to learn a set of features t(at are 5ood for re&onstru&tin5 ima5es of t(e di5it 2
00 7inary feature neurons
n&rement 6ei5(ts 7et6een an a&tive pixel and an a&tive feature .4 x .4 pixel ima5e data (reality#
00 7inary feature neurons

De&rement 6ei5(ts 7et6een an a&tive pixel and an a&tive feature .4 x .4 pixel ima5e re&onstru&tion (7etter t(an reality#
T(e final 00 x 204 6ei5(ts
Ha&( neuron 5ra7s a different feature+
%o6 6ell &an 6e re&onstru&t t(e di5it ima5es from t(e 7inary feature a&tivationsD
'e&onstru&tion from a&tivated 7inary features 'e&onstru&tion from a&tivated 7inary features
Data
Data
Ne6 test ima5es from t(e di5it &lass t(at t(e model 6as trained on
ma5es from an unfamiliar di5it &lass (t(e net6or3 tries to see every ima5e as a 2#
T(ree 6ays to &om7ine pro7a7ility density models (an underlyin5 t(eme of t(e tutorial#
* Mixture: Ta3e a 6ei5(ted avera5e of t(e distri7utions+ , t &an never 7e s(arper t(an t(e individual distri7utions+ t@s a very 6ea3 6ay to &om7ine models+ * Product: 8ultiply t(e distri7utions at ea&( point and t(en renormali>e (t(is is (o6 an 'B8 &om7ines t(e distri7utions defined
7y ea&( (idden unit#
, Hxponentially more po6erful t(an a mixture+ T(e normali>ation ma3es maximum li3eli(ood learnin5 diffi&ult: 7ut approximations allo6 us to learn any6ay+ * Composition: Use t(e values of t(e latent varia7les of one model as t(e data for t(e next model+ , Cor3s 6ell for learnin5 multiple layers of representation: 7ut only if t(e individual models are undire&ted+
Trainin5 a deep net6or3

(t(e main reason 'B8@s are interestin5# * 9irst train a layer of features t(at re&eive input dire&tly from t(e pixels+ * T(en treat t(e a&tivations of t(e trained features as if t(ey 6ere pixels and learn features of features in a se&ond (idden layer+ * t &an 7e proved t(at ea&( time 6e add anot(er layer of features 6e improve a variational lo6er 7ound on t(e lo5 pro7a7ility of t(e trainin5 data+ , T(e proof is sli5(tly &ompli&ated+ , But it is 7ased on a neat e;uivalen&e 7et6een an 'B8 and a deep dire&ted model (des&ri7ed later#
T(e 5enerative model after learnin5 - layers

* To 5enerate data: .+ $et an e;uili7rium sample from t(e top2level 'B8 7y performin5 alternatin5 $i77s samplin5 for a lon5 time+ 2+ !erform a top2do6n pass to 5et states for all t(e ot(er layers+ "o t(e lo6er level 7ottom2up &onne&tions are not part of t(e 5enerative model+ T(ey are ?ust used for inferen&e+
(W3
(2
W2
(.
W1
data
C(y does 5reedy learnin5 6or3D

An aside: Avera5in5 fa&torial distri7utions
* f you avera5e some fa&torial distri7utions: you do NET 5et a fa&torial distri7ution+ , n an 'B8: t(e posterior over t(e (idden units is fa&torial for ea&( visi7le ve&tor+ , But t(e a55re5ated posterior over all trainin5 &ases is not fa&torial (even if t(e data 6as 5enerated 7y t(e 'B8 itself#+

* * Ha&( 'B8 &onverts its data distri7ution into an a55re5ated posterior distri7ution over its (idden units+ T(is divides t(e tas3 of modelin5 its data into t6o tas3s: , Tas3 .: Learn 5enerative 6ei5(ts t(at &an &onvert t(e a55re5ated posterior distri7ution over t(e (idden units 7a&3 into t(e data distri7ution+ , Tas3 2: Learn to model t(e a55re5ated posterior distri7ution over t(e (idden units+ , T(e 'B8 does a 5ood ?o7 of tas3 . and a moderately 5ood ?o7 of tas3 2+ Tas3 2 is easier (for t(e next 'B8# t(an modelin5 t(e ori5inal data 7e&ause t(e a55re5ated posterior distri7ution is &loser to a distri7ution t(at an 'B8 &an model perfe&tly+ Tas3 2
p(h | W )
a55re5ated posterior distri7ution on (idden units

Tas3 .
p ( v | h, W )
data distri7ution on visi7le units

T(e 6ei5(ts: C: in t(e 7ottom level 'B8 define p(vG(# and t(ey also: indire&tly: define p((#+ "o 6e &an express t(e 'B8 model as
p (v ) =
h f 6e leave p(vG(# alone and improve p((#: 6e 6ill improve p(v#+
p ( h ) p (v | h )
To improve p((#: 6e need it to 7e a 7etter model of t(e a55re5ated posterior distri7ution over (idden ve&tors produ&ed 7y applyin5 C to t(e data+
C(i&( distri7utions are fa&torial in a dire&ted 7elief netD

* n a dire&ted 7elief net 6it( one (idden layer: t(e posterior over t(e (idden units p((Gv# is non2 fa&torial (due to explainin5 a6ay#+ , T(e a55re5ated posterior is fa&torial if t(e data 6as 5enerated 7y t(e dire&ted model+
* t@s t(e opposite 6ay round from an undire&ted model 6(i&( (as fa&torial posteriors and a non2 fa&torial prior p((# over t(e (iddens+ * T(e intuitions t(at people (ave from usin5 dire&ted models are very misleadin5 for undire&ted models+
C(y does 5reedy learnin5 fail in a dire&ted moduleD

* A dire&ted module also &onverts its data distri7ution into an a55re5ated posterior , Tas3 . T(e learnin5 is no6 (arder 7e&ause t(e posterior for ea&( trainin5 &ase is non2fa&torial+ , Tas3 2 is performed usin5 an independent prior+ T(is is a very 7ad approximation unless t(e a55re5ated posterior is &lose to fa&torial+ A dire&ted module attempts to ma3e t(e a55re5ated posterior fa&torial in one step+ , T(is is too diffi&ult and leads to a 7ad &ompromise+ T(ere is also no 5uarantee t(at t(e a55re5ated posterior is easier to model t(an t(e data distri7ution+
Tas3 2
p (h | W2 )
a55re5ated posterior distri7ution on (idden units

Tas3 .
p (v | h, W1 )
data distri7ution on visi7le units
A model of di5it re&o5nition

T(e top t6o layers form an asso&iative memory 6(ose ener5y lands&ape models t(e lo6 dimensional manifolds of t(e di5its+ T(e ener5y valleys (ave names
2000 top2level neurons
.0 la7el neurons
000 neurons
T(e model learns to 5enerate &om7inations of la7els and ima5es+ To perform re&o5nition 6e start 6it( a neutral state of t(e la7el units and do an up2pass from t(e ima5e follo6ed 7y a fe6 iterations of t(e top2level asso&iative memory+
000 neurons 2A x 2A pixel ima5e
9ine2tunin5 6it( a &ontrastive version of t(e K6a3e2sleepL al5orit(m

After learnin5 many layers of features: 6e &an fine2tune t(e features to improve 5eneration+ .+ Do a sto&(asti& 7ottom2up pass , Ad?ust t(e top2do6n 6ei5(ts to 7e 5ood at re&onstru&tin5 t(e feature a&tivities in t(e layer 7elo6+ -+ Do a fe6 iterations of samplin5 in t(e top level 'B8 22 Ad?ust t(e 6ei5(ts in t(e top2level 'B8+ /+ Do a sto&(asti& top2do6n pass , Ad?ust t(e 7ottom2up 6ei5(ts to 7e 5ood at re&onstru&tin5 t(e feature a&tivities in t(e layer a7ove+
"(o6 t(e movie of t(e net6or3 5eneratin5 di5its

(availa7le at 666+&s+torontoO<(inton#
"amples 5enerated 7y lettin5 t(e asso&iative memory run 6it( one la7el &lamped+ T(ere are .000 iterations of alternatin5 $i77s samplin5 7et6een samples+
Hxamples of &orre&tly re&o5ni>ed (and6ritten di5its t(at t(e neural net6or3 (ad never seen 7efore
ts very 5ood
%o6 6ell does it dis&riminate on 8N "T test set 6it( no extra information a7out 5eometri& distortionsD
* * * * * * * $enerative model 7ased on 'B8@s "upport Be&tor 8a&(ine (De&oste et+ al+# Ba&3prop 6it( .000 (iddens (!latt# Ba&3prop 6it( 000 22Q-00 (iddens F2Nearest Nei5(7or "ee Le Cun et+ al+ .==A for more results .+20P .+/P <.+4P <.+4P < -+-P
ts 7etter t(an 7a&3prop and mu&( more neurally plausi7le 7e&ause t(e neurons only need to send one 3ind of si5nal: and t(e tea&(er &an 7e anot(er sensory input+
Unsupervised Kpre2trainin5L also (elps for models t(at (ave more data and 7etter priors
* 'an>ato et+ al+ (N !" 2004# used an additional 400:000 distorted di5its+ * T(ey also used &onvolutional multilayer neural net6or3s t(at (ave some 7uilt2in: lo&al translational invarian&e+ Ba&32propa5ation alone: 0+/=P
Unsupervised layer27y2layer pre2trainin5 follo6ed 7y 7a&3prop: 0+-=P (re&ord#
Anot(er vie6 of 6(y layer27y2layer learnin5 6or3s (%inton: Esindero ) Te( 2004#
* T(ere is an unexpe&ted e;uivalen&e 7et6een 'B8@s and dire&ted net6or3s 6it( many layers t(at all use t(e same 6ei5(ts+ , T(is e;uivalen&e also 5ives insi5(t into 6(y &ontrastive diver5en&e learnin5 6or3s+
An infinite si5moid 7elief net t(at is e;uivalent to an 'B8

* T(e distri7ution 5enerated 7y t(is infinite dire&ted net 6it( repli&ated 6ei5(ts is t(e e;uili7rium distri7ution for a &ompati7le pair of &onditional distri7utions: p(vG(# and p((Gv# t(at are 7ot( defined 7y C , A top2do6n pass of t(e dire&ted net is exa&tly e;uivalent to lettin5 a 'estri&ted Bolt>mann 8a&(ine settle to e;uili7rium+ , "o t(is infinite dire&ted net defines t(e same distri7ution as an 'B8+
et&+ (2
WT
v2
WT
(.
W
v.
WT
(0
W
v0
nferen&e in a dire&ted net 6it( repli&ated 6ei5(ts

* T(e varia7les in (0 are &onditionally independent 5iven v0+ , nferen&e is trivial+ Ce ?ust multiply v0 7y C transpose+ , T(e model a7ove (0 implements a &omplementary prior+ , 8ultiplyin5 v0 7y C transpose 5ives t(e produ&t of t(e li3eli(ood term and t(e prior term+ * nferen&e in t(e dire&ted net is exa&tly e;uivalent to lettin5 a 'estri&ted Bolt>mann 8a&(ine settle to e;uili7rium startin5 at t(e data+
et&+ (2
WT
v2
WT
(.
W
v. R R R
WT
(0 R W v0
et&+
* T(e learnin5 rule for a si5moid 7elief net is:
WT
i ) wij s j ( si s
+
1 sj)
2 s (2 j
WT
* Cit( repli&ated 6ei5(ts t(is 7e&omes:

0 s0 ( s j i
v2 s
W
2 i
WT
1 s (. j
s1 i)
1 0 si ( s j
+ si2 ) + ...
s j si
WT
v. s
W
1 i
W
WT
1 s1 ( s j i
0 s (0 j
WT
v0 s
0 i
Learnin5 a deep dire&ted net6or3

* 9irst learn 6it( all t(e 6ei5(ts tied , T(is is exa&tly e;uivalent to learnin5 an 'B8 , Contrastive diver5en&e learnin5 is e;uivalent to i5norin5 t(e small derivatives &ontri7uted 7y t(e tied 6ei5(ts 7et6een deeper layers+
et&+ (2
WT
v2
WT
(.
W
v. (0
W
WT
(0
W
v0
v0
* T(en free>e t(e first layer of 6ei5(ts in 7ot( dire&tions and learn t(e remainin5 6ei5(ts (still tied to5et(er#+ , T(is is e;uivalent to learnin5 anot(er 'B8: usin5 t(e a55re5ated posterior distri7ution of (0 as t(e data+
et&+ (2
WT
v2
WT
(.
W
v.
W
v.
WT
(0
T W frozen
(0
W frozen
v0
%o6 many layers s(ould 6e use and (o6 6ide s(ould t(ey 7eD
* T(ere is no simple ans6er+ , Hxtensive experiments 7y Mos(ua Ben5io@s 5roup (des&ri7ed later# su55est t(at several (idden layers is 7etter t(an one+ , 'esults are fairly ro7ust a5ainst &(an5es in t(e si>e of a layer: 7ut t(e top layer s(ould 7e 7i5+ * Deep 7elief nets 5ive t(eir &reator a lot of freedom+ , T(e 7est 6ay to use t(at freedom depends on t(e tas3+ , Cit( enou5( narro6 layers 6e &an model any distri7ution over 7inary ve&tors ("uts3ever ) %inton: 2007#
C(at (appens 6(en t(e 6ei5(ts in (i5(er layers 7e&ome different from t(e 6ei5(ts in t(e first layerD
* T(e (i5(er layers no lon5er implement a &omplementary prior+ , "o performin5 inferen&e usin5 t(e fro>en 6ei5(ts in t(e first layer is no lon5er &orre&t+ But its still pretty 5ood+ , Usin5 t(is in&orre&t inferen&e pro&edure 5ives a variational lo6er 7ound on t(e lo5 pro7a7ility of t(e data+ * T(e (i5(er layers learn a prior t(at is &loser to t(e a55re5ated posterior distri7ution of t(e first (idden layer+ , T(is improves t(e net6or3@s model of t(e data+
* %inton: Esindero and Te( (2004# prove t(at t(is improvement is al6ays 7i55er t(an t(e loss in t(e variational 7ound &aused 7y usin5 less a&&urate inferen&e+
An improved version of Contrastive Diver5en&e learnin5 (if time permits#

* T(e main 6orry 6it( CD is t(at t(ere 6ill 7e deep minima of t(e ener5y fun&tion far a6ay from t(e data+ , To find t(ese 6e need to run t(e 8ar3ov &(ain for a lon5 time (may7e t(ousands of steps#+ , But 6e &annot afford to run t(e &(ain for too lon5 for ea&( update of t(e 6ei5(ts+ * 8ay7e 6e &an run t(e same 8ar3ov &(ain over many 6ei5(t updatesD (Neal: .==2# , f t(e learnin5 rate is very small: t(is s(ould 7e e;uivalent to runnin5 t(e &(ain for many steps and t(en doin5 a 7i55er 6ei5(t update+
!ersistent CD
(Ti?men Teileman: C8L 200A ) 200=# * Use mini7at&(es of .00 &ases to estimate t(e first term in t(e 5radient+ Use a sin5le 7at&( of .00 fantasies to estimate t(e se&ond term in t(e 5radient+ * After ea&( 6ei5(t update: 5enerate t(e ne6 fantasies from t(e previous fantasies 7y usin5 one alternatin5 $i77s update+ , "o t(e fantasies &an 5et far from t(e data+
Contrastive diver5en&e as an adversarial 5ame

* C(y does persisitent CD 6or3 so 6ell 6it( only .00 ne5ative examples to &(ara&teri>e t(e 6(ole partition fun&tionD , 9or all interestin5 pro7lems t(e partition fun&tion is (i5(ly multi2modal+ , %o6 does it mana5e to find all t(e modes 6it(out startin5 at t(e dataD
T(e learnin5 &auses very fast mixin5

* T(e learnin5 intera&ts 6it( t(e 8ar3ov &(ain+
* !ersisitent Contrastive Diver5en&e &annot 7e analysed 7y vie6in5 t(e learnin5 as an outer loop+
, C(erever t(e fantasies outnum7er t(e positive data: t(e free2ener5y surfa&e is raised+ T(is ma3es t(e fantasies rus( around (ypera&tively+
%o6 persistent CD moves 7et6een t(e modes of t(e model@s distri7ution

* f a mode (as more fantasy parti&les t(an data: t(e free2 ener5y surfa&e is raised until t(e fantasy parti&les es&ape+ , T(is &an over&ome free2 ener5y 7arriers t(at 6ould 7e too (i5( for t(e 8ar3ov C(ain to ?ump+ * T(e free2ener5y surfa&e is 7ein5 &(an5ed to (elp mixin5 in addition to definin5 t(e model+
"ummary so far
* 'estri&ted Bolt>mann 8a&(ines provide a simple 6ay to learn a layer of features 6it(out any supervision+ , 8aximum li3eli(ood learnin5 is &omputationally expensive 7e&ause of t(e normali>ation term: 7ut &ontrastive diver5en&e learnin5 is fast and usually 6or3s 6ell+ * 8any layers of representation &an 7e learned 7y treatin5 t(e (idden states of one 'B8 as t(e visi7le data for trainin5 t(e next 'B8 (a &omposition of experts#+ * T(is &reates 5ood 5enerative models t(at &an t(en 7e fine2tuned+ , Contrastive 6a3e2sleep &an fine2tune 5eneration+
B'HAF
Evervie6 of t(e rest of t(e tutorial

* %o6 to fine2tune a 5reedily trained 5enerative model to 7e 7etter at dis&rimination+ * %o6 to learn a 3ernel for a $aussian pro&ess+ * %o6 to use deep 7elief nets for non2linear dimensionality redu&tion and do&ument retrieval+ * %o6 to learn a 5enerative (ierar&(y of &onditional random fields+ * A more advan&ed learnin5 module for deep 7elief nets t(at &ontains multipli&ative intera&tions+ * %o6 to learn deep models of se;uential data+
9ine2tunin5 for dis&rimination

* 9irst learn one layer at a time 5reedily+ * T(en treat t(is as Kpre2trainin5L t(at finds a 5ood initial set of 6ei5(ts 6(i&( &an 7e fine2tuned 7y a lo&al sear&( pro&edure+ , Contrastive 6a3e2sleep is one 6ay of fine2 tunin5 t(e model to 7e 7etter at 5eneration+ * Ba&3propa5ation &an 7e used to fine2tune t(e model for 7etter dis&rimination+ , T(is over&omes many of t(e limitations of standard 7a&3propa5ation+
C(y 7a&3propa5ation 6or3s 7etter 6it( 5reedy pre2trainin5: T(e optimi>ation vie6
* $reedily learnin5 one layer at a time s&ales 6ell to really 7i5 net6or3s: espe&ially if 6e (ave lo&ality in ea&( layer+ * Ce do not start 7a&3propa5ation until 6e already (ave sensi7le feature dete&tors t(at s(ould already 7e very (elpful for t(e dis&rimination tas3+ , "o t(e initial 5radients are sensi7le and 7a&3prop only needs to perform a lo&al sear&( from a sensi7le startin5 point+
C(y 7a&3propa5ation 6or3s 7etter 6it( 5reedy pre2trainin5: T(e overfittin5 vie6
* 8ost of t(e information in t(e final 6ei5(ts &omes from modelin5 t(e distri7ution of input ve&tors+ , T(e input ve&tors 5enerally &ontain a lot more information t(an t(e la7els+ , T(e pre&ious information in t(e la7els is only used for t(e final fine2tunin5+ , T(e fine2tunin5 only modifies t(e features sli5(tly to 5et t(e &ate5ory 7oundaries ri5(t+ t does not need to dis&over features+ * T(is type of 7a&3propa5ation 6or3s 6ell even if most of t(e trainin5 data is unla7eled+ , T(e unla7eled data is still very useful for dis&overin5 5ood features+
9irst: model t(e distri7ution of di5it ima5es

T(e top t6o layers form a restri&ted Bolt>mann ma&(ine 6(ose free ener5y lands&ape s(ould model t(e lo6 dimensional manifolds of t(e di5its+
2000 units
T(e net6or3 learns a density model for unla7eled di5it ima5es+ C(en 6e 5enerate from t(e model 6e 5et t(in5s t(at loo3 li3e real di5its of all &lasses+ But do t(e (idden features really (elp 6it( di5it dis&riminationD Add .0 softmaxed units to t(e top and do 7a&3propa5ation+
000 units 000 units

2A x 2A pixel ima5e
'esults on permutation2invariant 8N "T tas3

* Bery &arefully trained 7a&3prop net 6it( one or t6o (idden layers (!lattS %inton# * "B8 (De&oste ) "&(oel3opf: 2002# * $enerative model of ?oint density of ima5es and la7els (R 5enerative fine2tunin5# * $enerative model of unla7elled di5its follo6ed 7y 5entle 7a&3propa5ation
(%inton ) "ala3(utdinov: "&ien&e 2004#
.+4P
.+/P .+20P .+.0P
Learnin5 Dynami&s of Deep Nets

t(e next / slides des&ri7e 6or3 7y Mos(ua Ben5io@s 5roup
Before fine-tuning
After fine-tuning
Hffe&t of Unsupervised !re2trainin5

Erhan et. al. AISTATS2009
40
Hffe&t of Dept(
w/o pre-training
without pre-training
with pre-training
44
Learnin5 Tra?e&tories in 9un&tion "pa&e

(a 22D visuali>ation produ&ed 6it( t2"NH# Erhan et. al. AISTATS2009 * Ha&( point is a model in fun&tion spa&e * Color J epo&( * Top: tra?e&tories 6it(out pre2trainin5+ Ha&( tra?e&tory &onver5es to a different lo&al min+ * Bottom: Tra?e&tories 6it( pre2trainin5+ * No overlapN
C(y unsupervised pre2trainin5 ma3es sense stuff

(i5( 7and6idt(
stuff
lo6 7and6idt(
ima5e
la7el
ima5e
la7el
f ima5e2la7el pairs 6ere 5enerated t(is 6ay: it 6ould ma3e sense to try to 5o strai5(t from ima5es to la7els+ 9or example: do t(e pixels (ave even parityD
f ima5e2la7el pairs are 5enerated t(is 6ay: it ma3es sense to first learn to re&over t(e stuff t(at &aused t(e ima5e 7y invertin5 t(e (i5( 7and6idt( pat(6ay+
8odelin5 real2valued data

* 9or ima5es of di5its it is possi7le to represent intermediate intensities as if t(ey 6ere pro7a7ilities 7y usin5 Kmean2fieldL lo5isti& units+ , Ce &an treat intermediate values as t(e pro7a7ility t(at t(e pixel is in3ed+ * T(is 6ill not 6or3 for real ima5es+ , n a real ima5e: t(e intensity of a pixel is almost al6ays almost exa&tly t(e avera5e of t(e nei5(7orin5 pixels+ , 8ean2field lo5isti& units &annot represent pre&ise intermediate values+
'epla&in5 7inary varia7les 7y inte5er2valued varia7les

(Te( and %inton: 200.#
* Ene 6ay to model an inte5er2valued varia7le is to ma3e N identi&al &opies of a 7inary unit+ * All &opies (ave t(e same pro7a7ility: of 7ein5 KonL : p J lo5isti&(x# , T(e total num7er of KonL &opies is li3e t(e firin5 rate of a neuron+ , t (as a 7inomial distri7ution 6it( mean N p and varian&e N p(.2p#
A 7etter 6ay to implement inte5er values

* 8a3e many &opies of a 7inary unit+ * All &opies (ave t(e same 6ei5(ts and t(e same adaptive 7ias: 7: 7ut t(ey (ave different fixed offsets to t(e 7ias:
b 0.5, b 1.5, b 2.5, b 3.5, ....
A fast approximation
n=
logistic( x + 0.5 n)
log(1 + e x )
n= 1
* Contrastive diver5en&e learnin5 6or3s 6ell for t(e sum of 7inary units 6it( offset 7iases+ * t also 6or3s for re&tified linear units+ T(ese are mu&( faster to &ompute t(an t(e sum of many lo5isti& units+ output J max(0: x R randnTs;rt(lo5isti&(x## #
%o6 to train a 7ipartite net6or3 of re&tified linear units

* Iust use &ontrastive diver5en&e to lo6er t(e ener5y of data and raise t(e ener5y of near7y &onfi5urations t(at t(e model prefers to t(e data+
?
< vi h j > data
?
< vi h j > recon
"tart 6it( a trainin5 ve&tor on t(e visi7le units+ Update all (idden units in parallel 6it( samplin5 noise Update t(e visi7le units in parallel to 5et a Kre&onstru&tionL+ Update t(e (idden units a5ain
i data
i re&onstru&tion
wij = ( < vi h j > data < vi h j > recon )
3D Object Recognition: The NORB dataset

Stereo pairs o! gra"sca#e images o! to" objects$
,nima#s /umans P#anes Truc0s Cars Norma#i1ed uni!orm )ersion o! NORB
% #ighting conditions& '%( )ie*points 2+i)e object instances per c#ass in the training set 2 , different set o! !i)e instances per c#ass in the test set (-&3.. training cases& (-&3.. test cases
"implifyin5 t(e data

* Ha&( trainin5 &ase is a stereo2pair of =4x=4 ima5es+ , T(e o7?e&t is &entered+ , T(e ed5es of t(e ima5e are mainly 7lan3+ , T(e 7a&35round is uniform and 7ri5(t+ * To ma3e learnin5 faster used simplified t(e data: , T(ro6 a6ay one ima5e+ , Enly use t(e middle 4/x4/ pixels of t(e ot(er ima5e+ , Do6nsample to -2x-2 7y avera5in5 / pixels+
"implifyin5 t(e data even more so t(at it &an 7e modeled 7y re&tified linear units
* T(e intensity (isto5ram for ea&( -2x-2 ima5e (as a s(arp pea3 for t(e 7ri5(t 7a&35round+ * 9ind t(is pea3 and &all it >ero+ * Call all intensities 7ri5(ter t(an t(e 7a&35round >ero+ * 8easure intensities do6n6ards from t(e 7a&35round intensity+
Test set error rates on NE'B after 5reedy learnin5 of one or t6o (idden layers usin5 re&tified linear units
9ull NE'B (2 ima5es of =4x=4#
* Lo5isti& re5ression on t(e ra6 pixels 20+0P * $aussian "B8 (trained 7y Leon Bottou# ..+4P * Convolutional neural net (Le Cun@s 5roup# 4+0P (&onvolutional nets (ave 3no6led5e of translations 7uilt in#
'edu&ed NE'B (. ima5e -2x-2#

* Lo5isti& re5ression on t(e ra6 pixels -0+2P * Lo5isti& re5ression on first (idden layer * Lo5isti& re5ression on se&ond (idden layer
./+=P .0+2P
T(e re&eptive fields of some re&tified linear (idden units+
A standard type of real2valued visi7le unit

* Ce &an model pixels as $aussian varia7les+ Alternatin5 $i77s samplin5 is still easy: t(ou5( learnin5 needs to 7e mu&( slo6er+
bi
para7oli& &ontainment fun&tion
vi
ener5y25radient produ&ed 7y t(e total input to a visi7le unit
E ( v,h) =
i vis
(vi bi ) 2
2 i
j hid
b jhj
i, j
vi i
h j wij
Cellin5 et+ al+ (2000# s(o6 (o6 to extend 'B8@s to t(e exponential family+ "ee also Ben5io et+ al+ (2007#
A random sample of .0:000 7inary filters learned 7y Alex Fri>(evs3y on a million -2x-2 &olor ima5es+
Com7inin5 deep 7elief nets 6it( $aussian pro&esses

* Deep 7elief nets &an 7enefit a lot from unla7eled data 6(en la7eled data is s&ar&e+ , T(ey ?ust use t(e la7eled data for fine2tunin5+ * Fernel met(ods: li3e $aussian pro&esses: 6or3 6ell on small la7eled trainin5 sets 7ut are slo6 for lar5e trainin5 sets+ * "o 6(en t(ere is a lot of unla7eled data and only a little la7eled data: &om7ine t(e t6o approa&(es: , 9irst learn a deep 7elief net 6it(out usin5 t(e la7els+ , T(en apply a $aussian pro&ess model to t(e deepest layer of features+ T(is 6or3s 7etter t(an usin5 t(e ra6 data+ , T(en use $!@s to 5et t(e derivatives t(at are 7a&32 propa5ated t(rou5( t(e deep 7elief net+ T(is is a furt(er 6in+ t allo6s $!@s to fine2tune &ompli&ated domain2spe&ifi& 3ernels+
Learnin5 to extra&t t(e orientation of a fa&e pat&(

("ala3(utdinov ) %inton: N !" 2007#
T(e trainin5 and test sets for predi&tin5 fa&e orientation
.00: 000: or .000 la7eled &ases
..:000 unla7eled &ases
fa&e pat&(es from ne6 people
T(e root mean s;uared error in t(e orientation 6(en &om7inin5 $!@s 6it( deep 7elief nets
$! on t(e pixels $! on top2level features $! on top2level features 6it( fine2tunin5
.00 la7els 22+2 000 la7els .7+2 .000 la7els .4+-
.7+= .2+7 ..+2
.0+2 7+2 4+/
Con&lusion: T(e deep features are mu&( 7etter t(an t(e pixels+ 9ine2tunin5 (elps a lot+
Deep Autoen&oders
(%inton ) "ala3(utdinov: 2004# * T(ey al6ays loo3ed li3e a really ni&e 6ay to do non2linear dimensionality redu&tion: , But it is very diffi&ult to optimi>e deep autoen&oders usin5 7a&3propa5ation+ * Ce no6 (ave a mu&( 7etter 6ay to optimi>e t(em: , 9irst train a sta&3 of / 'B8@s , T(en KunrollL t(em+ , T(en fine2tune 6it( 7a&3prop+
2Ax2A
W1T
.000 neurons
T W2
000 neurons
W3T
T W4
200 neurons -0
W4
200 neurons
linear units
W3
000 neurons
W2
.000 neurons
W1
2Ax2A
A &omparison of met(ods for &ompressin5 di5it ima5es to -0 real num7ers+
real data -02D deep auto -02D lo5isti& !CA -02D !CA
'etrievin5 do&uments t(at are similar to a ;uery do&ument

* Ce &an use an autoen&oder to find lo62 dimensional &odes for do&uments t(at allo6 fast and a&&urate retrieval of similar do&uments from a lar5e set+ * Ce start 7y &onvertin5 ea&( do&ument into a K7a5 of 6ordsL+ T(is a 2000 dimensional ve&tor t(at &ontains t(e &ounts for ea&( of t(e 2000 &ommonest 6ords+
%o6 to &ompress t(e &ount ve&tor

2000 re&onstru&ted &ounts output ve&tor
000 neurons
200 neurons .0 200 neurons
000 neurons 2000 6ord &ounts
* Ce train t(e neural net6or3 to reprodu&e its input ve&tor as its output * T(is for&es it to &ompress as mu&( information as possi7le into t(e .0 num7ers in t(e &entral 7ottlene&3+ * T(ese .0 num7ers are t(en a 5ood 6ay to &ompare do&uments+
input ve&tor
!erforman&e of t(e autoen&oder at do&ument retrieval

* Train on 7a5s of 2000 6ords for /00:000 trainin5 &ases of 7usiness do&uments+ , 9irst train a sta&3 of 'B8@s+ T(en fine2tune 6it( 7a&3prop+ * Test on a separate /00:000 do&uments+ , !i&3 one test do&ument as a ;uery+ 'an3 order all t(e ot(er test do&uments 7y usin5 t(e &osine of t(e an5le 7et6een &odes+ , 'epeat t(is usin5 ea&( of t(e /00:000 test do&uments as t(e ;uery (re;uires 0+.4 trillion &omparisons#+ * !lot t(e num7er of retrieved do&uments a5ainst t(e proportion t(at are in t(e same (and2la7eled &lass as t(e ;uery do&ument+
!roportion of retrieved do&uments in same &lass as ;uery
Num7er of do&uments retrieved
9irst &ompress all do&uments to 2 num7ers usin5 a type of !CA T(en use different &olors for different do&ument &ate5ories
9irst &ompress all do&uments to 2 num7ers+ T(en use different &olors for different do&ument &ate5ories
9indin5 7inary &odes for do&uments

2000 re&onstru&ted &ounts
* Train an auto2en&oder usin5 -0 lo5isti& units for t(e &ode layer+ * Durin5 t(e fine2tunin5 sta5e: add noise to t(e inputs to t(e &ode units+ , T(e KnoiseL ve&tor for ea&( trainin5 &ase is fixed+ "o 6e still 5et a deterministi& 5radient+ , T(e noise for&es t(eir a&tivities to 7e&ome 7imodal in order to resist t(e effe&ts of t(e noise+ , T(en 6e simply round t(e a&tivities of t(e -0 &ode units to . or 0+
000 neurons
200 neurons -0
noise
200 neurons
000 neurons 2000 6ord &ounts
"emanti& (as(in5: Usin5 a deep autoen&oder as a (as(2fun&tion for findin5 approximate mat&(es ("ala3(utdinov ) %inton: 2007#
(as( fun&tion
Ksupermar3et sear&(L
%o6 5ood is a s(ortlist found t(is 6ayD

* Ce (ave only implemented it for a million do&uments 6it( 2027it &odes 222 7ut 6(at &ould possi7ly 5o 6ron5D , A 202D (yper&u7e allo6s us to &apture enou5( of t(e similarity stru&ture of our do&ument set+ * T(e s(ortlist found usin5 7inary &odes a&tually improves t(e pre&ision2re&all &urves of T92 D9+ , Lo&ality sensitive (as(in5 (t(e fastest ot(er met(od# is 00 times slo6er and (as 6orse pre&ision2re&all &urves+
$eneratin5 t(e parts of an o7?e&t

* Ene 6ay to maintain t(e &onstraints 7et6een t(e parts is to 5enerate ea&( part very a&&urately , But t(is 6ould re;uire a lot of &ommuni&ation 7and6idt(+ "loppy top2do6n spe&ifi&ation of t(e parts is less demandin5 , 7ut it messes up relations(ips 7et6een features , so use redundant features and use lateral intera&tions to &lean up t(e mess+ Ha&( transformed feature (elps to lo&ate t(e ot(ers , T(is allo6s a noisy &(annel
Ks;uareL
pose parameters sloppy top2do6n a&tivation of parts features 6it( top2do6n support &lean2up usin5 3no6n intera&tions
ts li3e soldiers on a parade 5round
"emi2restri&ted Bolt>mann 8a&(ines

* Ce restri&t t(e &onne&tivity to ma3e learnin5 easier+ * Contrastive diver5en&e learnin5 re;uires t(e (idden units to 7e in &onditional e;uili7rium 6it( t(e visi7les+ , But it does not re;uire t(e visi7le units to 7e in &onditional e;uili7rium 6it( t(e (iddens+ , All 6e re;uire is t(at t(e visi7le units are &loser to e;uili7rium in t(e re&onstru&tions t(an in t(e data+ * "o 6e &an allo6 &onne&tions 7et6een t(e visi7les+
(idden ?
i visi7le
Learnin5 a semi2restri&ted Bolt>mann 8a&(ine

.+ "tart 6it( a trainin5 ve&tor on t(e visi7le units+ 2+ Update all of t(e (idden units in parallel -+ 'epeatedly update all of t(e visi7le units in parallel usin5 mean2field updates (6it( t(e (iddens fixed# to 5et a Kre&onstru&tionL+ /+ Update all of t(e (idden units a5ain+
< vi h j > 0
i 3 i 3 i 3 i
< vi h j > 1
3
tJ0 data
tJ. re&onstru&tion
wij = ( < vi h j > 0 < vi h j > 1 )
lik = ( < vi vk > 0 < vi vk > 1 )

update for a lateral 6ei5(t
Learnin5 in "emi2restri&ted Bolt>mann 8a&(ines

* 8et(od .: To form a re&onstru&tion: &y&le t(rou5( t(e visi7le units updatin5 ea&( in turn usin5 t(e top2do6n input from t(e (iddens plus t(e lateral input from t(e ot(er visi7les+ * 8et(od 2: Use Kmean fieldL visi7le units t(at (ave real values+ Update t(em all in parallel+ , Use dampin5 to prevent os&illations
t+ 1 pi
t pi
+ (1 ) ( xi )
total input to i
dampin5
'esults on modelin5 natural ima5e pat&(es usin5 a sta&3 of 'B8@s (Esindero and %inton#
* "ta&3 of 'B8@s learned one at a time+ .000 top2 level units+ * /00 $aussian visi7le units t(at see No 8'9+ 6(itened ima5e pat&(es , Derived from .00:000 Ban %ateren ima5e pat&(es: ea&( 20x20 %idden * T(e (idden units are all 7inary+ 8'9 6it( , T(e lateral &onne&tions are 000 units learned 6(en t(ey are t(e visi7le units of t(eir 'B8+ * 'e&onstru&tion involves lettin5 t(e %idden visi7le units of ea&( 'B8 settle usin5 8'9 6it( mean2field dynami&s+ 2000 units , T(e already de&ided states in t(e level a7ove determine t(e effe&tive /00 7iases durin5 mean2field settlin5+
$aussian units
Undire&ted Conne&tions
Dire&ted Conne&tions
Dire&ted Conne&tions
Cit(out lateral &onne&tions

real data samples from model
Cit( lateral &onne&tions

real data samples from model
A funny 6ay to use an 8'9

* T(e lateral &onne&tions form an 8'9+ * T(e 8'9 is used durin5 learnin5 and 5eneration+ * T(e 8'9 is not used for inferen&e+ , T(is is a novel idea so vision resear&(ers don@t li3e it+ * T(e 8'9 enfor&es &onstraints+ Durin5 inferen&e: &onstraints do not need to 7e enfor&ed 7e&ause t(e data o7eys t(em+ , T(e &onstraints only need to 7e enfor&ed durin5 5eneration+ * Uno7served (idden units &annot enfor&e &onstraints+ , To enfor&e &onstraints re;uires lateral &onne&tions or o7served des&endants+
C(y do 6e 6(iten dataD

* ma5es typi&ally (ave stron5 pair26ise &orrelations+ * Learnin5 (i5(er order statisti&s is diffi&ult 6(en t(ere are stron5 pair26ise &orrelations+ , "mall &(an5es in parameter values t(at improve t(e modelin5 of (i5(er2order statisti&s may 7e re?e&ted 7e&ause t(ey form a sli5(tly 6orse model of t(e mu&( stron5er pair26ise statisti&s+ * "o 6e often remove t(e se&ond2order statisti&s 7efore tryin5 to learn t(e (i5(er2order statisti&s+
C(itenin5 t(e learnin5 si5nal instead of t(e data

* Contrastive diver5en&e learnin5 &an remove t(e effe&ts of t(e se&ond2order statisti&s on t(e learnin5 6it(out a&tually &(an5in5 t(e data+ , T(e lateral &onne&tions model t(e se&ond order statisti&s , f a pixel &an 7e re&onstru&ted &orre&tly usin5 se&ond order statisti&s: its 6ill 7e t(e same in t(e re&onstru&tion as in t(e data+ , T(e (idden units &an t(en fo&us on modelin5 (i5(2 order stru&ture t(at &annot 7e predi&ted 7y t(e lateral &onne&tions+
* 9or example: a pixel &lose to an ed5e: 6(ere interpolation from near7y pixels &auses in&orre&t smoot(in5+
To6ards a more po6erful: multi2linear sta&3a7le learnin5 module

* "o far: t(e states of t(e units in one layer (ave only 7een used to determine t(e effe&tive 7iases of t(e units in t(e layer 7elo6+ * t 6ould 7e mu&( more po6erful to modulate t(e pair26ise intera&tions in t(e layer 7elo6+ , A 5ood 6ay to desi5n a (ierar&(i&al system is to allo6 ea&( level to determine t(e o7?e&tive fun&tion of t(e level 7elo6+ * To modulate pair26ise intera&tions 6e need (i5(er2order Bolt>mann ma&(ines+
%i5(er order Bolt>mann ma&(ines

("e?no6s3i: <.=A4#
* T(e usual ener5y fun&tion is ;uadrati& in t(e states:
E = bias terms
i< j
si s j wij
* But 6e &ould use (i5(er order intera&tions:
E = bias terms
i< j < k
si s j sk wijk
* Unit 3 a&ts as a s6it&(+ C(en unit 3 is on: it s6it&(es in t(e pair6ise intera&tion 7et6een unit i and unit ?+ , Units i and ? &an also 7e vie6ed as s6it&(es t(at &ontrol t(e pair6ise intera&tions 7et6een ? and 3 or 7et6een i and 3+
Usin5 (i5(er2order Bolt>mann ma&(ines to model ima5e transformations

(t(e unfa&tored version# * A 5lo7al transformation spe&ifies 6(i&( pixel 5oes to 6(i&( ot(er pixel+ * Conversely: ea&( pair of similar intensity pixels: one in ea&( ima5e: votes for a parti&ular 5lo7al transformation+
ima5e transformation ima5e(t# ima5e(tR.#
9a&torin5 t(ree26ay multipli&ative intera&tions

E=
i, j ,h
si s j s h wijh
unfa&tored
6it( &u7i&ally many parameters
E=
si s j s h wif w jf whf
fa&tored
6it( linearly many parameters per fa&tor+
f i, j ,h
A pi&ture of t(e lo62ran3 tensor &ontri7uted 7y fa&tor f

w jf whf wif
Ha&( layer is a s&aled version of t(e same matrix+ T(e 7asis matrix is spe&ified as an outer produ&t 6it( typi&al term wif w jf "o ea&( a&tive (idden unit &ontri7utes a s&alar: whf times t(e matrix spe&ified 7y fa&tor f +
nferen&e 6it( fa&tored t(ree26ay multipli&ative intera&tions

Ef =
i, j ,h
si s j sh wif w jf whf
T(e ener5y &ontri7uted 7y fa&tor f+
[ E f ( s h = 0)
E f ( s h = 1)
whf
si wif
s j w jf
%o6 &(an5in5 t(e 7inary state of unit ( &(an5es t(e ener5y &ontri7uted 7y fa&tor f+
C(at unit ( needs to 3no6 in order to do $i77s samplin5
Belief propa5ation
h
whf
f
T(e out5oin5 messa5e at ea&( vertex of t(e fa&tor is t(e produ&t of t(e 6ei5(ted sums at t(e ot(er t6o verti&es+
wif
i
w jf
j
Learnin5 6it( fa&tored t(ree26ay multipli&ative intera&tions

mh f =
messa5e from fa&tor f to unit (
si wif Ef whf
s j w jf Ef whf
model
whf

data
sh m h f
data
sh m h f
model
'oland data
8odelin5 t(e &orrelational stru&ture of a stati& ima5e 7y usin5 t6o &opies of t(e ima5e
h
whf
f
Ha&( fa&tor sends t(e s;uared output of a linear filter to t(e (idden units+ t is exa&tly t(e standard model of simple and &omplex &ells+ t allo6s &omplex &ells to extra&t oriented ener5y+
j
wif
i
w jf
Copy .
Copy 2
T(e standard model drops out of doin5 7elief propa5ation for a fa&tored t(ird2order ener5y fun&tion+
An advanta5e of modelin5 &orrelations 7et6een pixels rat(er t(an pixels

* Durin5 5eneration: a Kverti&al ed5eL unit &an turn off t(e (ori>ontal interpolation in a re5ion 6it(out 6orryin5 a7out exa&tly 6(ere t(e intensity dis&ontinuity 6ill 7e+ , T(is 5ives some translational invarian&e , t also 5ives a lot of invarian&e to 7ri5(tness and &ontrast+ , "o t(e Kverti&al ed5eL unit is li3e a &omplex &ell+ * By modulatin5 t(e &orrelations 7et6een pixels rat(er t(an t(e pixel intensities: t(e 5enerative model &an still allo6 interpolation parallel to t(e ed5e+
A prin&iple of (ierar&(i&al systems

* Ha&( level in t(e (ierar&(y s(ould not try to mi&ro2mana5e t(e level 7elo6+ * nstead: it s(ould &reate an o7?e&tive fun&tion for t(e level 7elo6 and leave t(e level 7elo6 to optimi>e it+ , T(is allo6s t(e fine details of t(e solution to 7e de&ided lo&ally 6(ere t(e detailed information is availa7le+ * E7?e&tive fun&tions are a 5ood 6ay to do a7stra&tion+
Time series models

* nferen&e is diffi&ult in dire&ted models of time series if 6e use non2linear distri7uted representations in t(e (idden units+ , t is (ard to fit Dynami& Bayes Nets to (i5(2 dimensional se;uen&es (e+5 motion &apture data#+ * "o people tend to avoid distri7uted representations and use mu&( 6ea3er met(ods (e+5+ %88@s#+
Time series models

* f 6e really need distri7uted representations (6(i&( 6e nearly al6ays do#: 6e &an ma3e inferen&e mu&( simpler 7y usin5 t(ree tri&3s: , Use an 'B8 for t(e intera&tions 7et6een (idden and visi7le varia7les+ T(is ensures t(at t(e main sour&e of information 6ants t(e posterior to 7e fa&torial+ , 8odel s(ort2ran5e temporal information 7y allo6in5 several previous frames to provide input to t(e (idden units and to t(e visi7le units+ * T(is leads to a temporal module t(at &an 7e sta&3ed , "o 6e &an use 5reedy learnin5 to learn deep models of temporal stru&ture+
An appli&ation to modelin5 motion &apture data

(Taylor: 'o6eis ) %inton: 2007# * %uman motion &an 7e &aptured 7y pla&in5 refle&tive mar3ers on t(e ?oints and t(en usin5 lots of infrared &ameras to tra&3 t(e -2D positions of t(e mar3ers+ * $iven a s3eletal model: t(e -2D positions of t(e mar3ers &an 7e &onverted into t(e ?oint an5les plus 4 parameters t(at des&ri7e t(e -2D position and t(e roll: pit&( and ya6 of t(e pelvis+
, Ce only represent &(an5es in ya6 7e&ause p(ysi&s doesn@t &are a7out its value and 6e 6ant to avoid &ir&ular varia7les+
T(e &onditional 'B8 model

(a partially o7served C'9# * "tart 6it( a 5eneri& 'B8+ * Add t6o types of &onditionin5 &onne&tions+ * $iven t(e data: t(e (idden units at time t are &onditionally independent+ * T(e autore5ressive 6ei5(ts &an model most s(ort2term temporal stru&ture very 6ell: leavin5 t(e (idden units to model nonlinear irre5ularities (su&( as 6(en t(e foot (its t(e 5round#+
j
(
i
v
t22
t2.
Causal 5eneration from a learned model

* Feep t(e previous visi7le states fixed+ , T(ey provide a time2dependent 7ias for t(e (idden units+ * !erform alternatin5 $i77s samplin5 for a fe6 iterations 7et6een t(e (idden units and t(e most re&ent visi7le units+ , T(is pi&3s ne6 (idden and visi7le states t(at are &ompati7le 6it( ea&( ot(er and 6it( t(e re&ent (istory+
j
%i5(er level models

* * * En&e 6e (ave trained t(e model: 6e &an add layers li3e in a Deep Belief Net6or3+ T(e previous layer C'B8 is 3ept: and its output: 6(ile driven 7y t(e data is treated as a ne6 3ind of Kfully o7servedL data+ T(e next level C'B8 (as t(e same ar&(ite&ture as t(e first (t(ou5( 6e &an alter t(e num7er of units it uses# and is trained t(e same 6ay+ Upper levels of t(e net6or3 model more Ka7stra&tL &on&epts+ T(is 5reedy learnin5 pro&edure &an 7e ?ustified usin5 a variational 7ound+ t22 t2.
* *
Learnin5 6it( KstyleL la7els

* As in t(e 5enerative model of (and6ritten di5its (%inton et al+ 2004#: style la7els &an 7e provided as part of t(e input to t(e top layer+ * T(e la7els are represented 7y turnin5 on one unit in a 5roup of units: 7ut t(ey &an also 7e 7lended+ t22 t2.
l
k
"(o6 demo@s of multiple styles of 6al3in5

These can be foun at www.cs.toronto.e u/!gwta"lor/
'eadin5s on deep 7elief nets

A readin5 list (t(at is still 7ein5 updated# &an 7e found at
666+&s+toronto+eduO<(intonOdeeprefs+(tml

Jul09 Hinton Deeplearn

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Jul09 Hinton Deeplearn

Загружено:

Авторское право:

Доступные форматы

UCL Tutorial on:

Deep Belief Nets

"&(edule for t(e Tutorial

* /+.0 , 0+/0 Tutorial part 2 * 0+/0 , 4+00 1uestions

"ome t(in5s you 6ill learn in t(is tutorial

A spe&trum of ma&(ine learnin5 tas3s Typi&al "tatisti&s222222222222Artifi&ial ntelli5en&e

T(e main pro7lem is distin5uis(in5 true stru&ture from noise+

output units e+5+ &lass la7els

non2adaptive (and2&oded features

input units e+5+ pixels

"3et&( of a typi&al per&eptron from t(e .=40@s

"e&ond 5eneration neural net6or3s (<.=A0#

C(at is 6ron5 6it( 7a&32propa5ationD

Ever&omin5 t(e limitations of 7a&32 propa5ation

* C(at 3ind of 5enerative model s(ould 6e learnD

"to&(asti& 7inary units

Learnin5 Deep Belief Nets

T(e learnin5 rule for si5moid 7elief nets

Hxplainin5 a6ay (Iudea !earl#

(idden varia7les prior (idden varia7les li3eli(ood C

"ome met(ods of learnin5 deep 7elief nets

T(e 7rea3t(rou5( t(at ma3es deep learnin5 effi&ient

T6o types of 5enerative neural net6or3

'estri&ted Bolt>mann 8a&(ines

T(e Hner5y of a ?oint &onfi5uration

Cei5(ts Hner5ies !ro7a7ilities

Usin5 ener5ies to define pro7a7ilities

A pi&ture of t(e maximum li3eli(ood learnin5 al5orit(m for an 'B8

log p (v) = < vi h j > 0 < vi h j > wij

A ;ui&3 6ay to learn an 'B8

wij = ( < vi h j > 0 < vi h j > 1 )

00 7inary feature neurons

T(e final 00 x 204 6ei5(ts

Ha&( neuron 5ra7s a different feature+

Trainin5 a deep net6or3

T(e 5enerative model after learnin5 - layers

C(y does 5reedy learnin5 6or3D

C(y does 5reedy learnin5 6or3D

a55re5ated posterior distri7ution on (idden units

data distri7ution on visi7le units

C(y does 5reedy learnin5 6or3D

h f 6e leave p(vG(# alone and improve p((#: 6e 6ill improve p(v#+

C(i&( distri7utions are fa&torial in a dire&ted 7elief netD

C(y does 5reedy learnin5 fail in a dire&ted moduleD

a55re5ated posterior distri7ution on (idden units

data distri7ution on visi7le units

A model of di5it re&o5nition

2000 top2level neurons

000 neurons 2A x 2A pixel ima5e

9ine2tunin5 6it( a &ontrastive version of t(e K6a3e2sleepL al5orit(m

"(o6 t(e movie of t(e net6or3 5eneratin5 di5its

Unsupervised layer27y2layer pre2trainin5 follo6ed 7y 7a&3prop: 0+-=P (re&ord#

An infinite si5moid 7elief net t(at is e;uivalent to an 'B8

nferen&e in a dire&ted net 6it( repli&ated 6ei5(ts

* Cit( repli&ated 6ei5(ts t(is 7e&omes:

Learnin5 a deep dire&ted net6or3

An improved version of Contrastive Diver5en&e learnin5 (if time permits#

Contrastive diver5en&e as an adversarial 5ame

T(e learnin5 &auses very fast mixin5

%o6 persistent CD moves 7et6een t(e modes of t(e model@s distri7ution

Evervie6 of t(e rest of t(e tutorial

9ine2tunin5 for dis&rimination

9irst: model t(e distri7ution of di5it ima5es

000 units 000 units