Вы находитесь на странице: 1из 126

UCL Tutorial on:

Deep Belief Nets


(An updated and extended version of my 2007 N !" tutorial# $eoffrey %inton Canadian nstitute for Advan&ed 'esear&( ) Department of Computer "&ien&e University of Toronto

"&(edule for t(e Tutorial


* 2+00 , -+-0 Tutorial part . * -+-0 , -+/0 1uestions * -+/0 2 /+.0 Tea Brea3

* /+.0 , 0+/0 Tutorial part 2 * 0+/0 , 4+00 1uestions

"ome t(in5s you 6ill learn in t(is tutorial


* %o6 to learn multi2layer 5enerative models of unla7elled data 7y learnin5 one layer of features at a time+ , %o6 to add 8ar3ov 'andom 9ields in ea&( (idden layer+ * %o6 to use 5enerative models to ma3e dis&riminative trainin5 met(ods 6or3 mu&( 7etter for &lassifi&ation and re5ression+ , %o6 to extend t(is approa&( to $aussian !ro&esses and (o6 to learn &omplex: domain2spe&ifi& 3ernels for a $aussian !ro&ess+ * %o6 to perform non2linear dimensionality redu&tion on very lar5e datasets , %o6 to learn 7inary: lo62dimensional &odes and (o6 to use t(em for very fast do&ument retrieval+ * %o6 to learn multilayer 5enerative models of (i5(2 dimensional se;uential data+

A spe&trum of ma&(ine learnin5 tas3s Typi&al "tatisti&s222222222222Artifi&ial ntelli5en&e


* Lo62dimensional data (e+5+ less t(an .00 dimensions# Lots of noise in t(e data T(ere is not mu&( stru&ture in t(e data: and 6(at stru&ture t(ere is: &an 7e represented 7y a fairly simple model+ * * * * * %i5(2dimensional data (e+5+ more t(an .00 dimensions# T(e noise is not suffi&ient to o7s&ure t(e stru&ture in t(e data if 6e pro&ess it ri5(t+ T(ere is a (u5e amount of stru&ture in t(e data: 7ut t(e stru&ture is too &ompli&ated to 7e represented 7y a simple model+ T(e main pro7lem is fi5urin5 out a 6ay to represent t(e &ompli&ated stru&ture so t(at it &an 7e learned+

T(e main pro7lem is distin5uis(in5 true stru&ture from noise+

%istori&al 7a&35round:
9irst 5eneration neural net6or3s
* !er&eptrons (<.=40# used a layer of (and2 &oded features and tried to re&o5ni>e o7?e&ts 7y learnin5 (o6 to 6ei5(t t(ese features+ , T(ere 6as a neat learnin5 al5orit(m for ad?ustin5 t(e 6ei5(ts+ , But per&eptrons are fundamentally limited in 6(at t(ey &an learn to do+
Bom7 Toy

output units e+5+ &lass la7els

non2adaptive (and2&oded features

input units e+5+ pixels

"3et&( of a typi&al per&eptron from t(e .=40@s

"e&ond 5eneration neural net6or3s (<.=A0#


Ba&32propa5ate error si5nal to 5et derivatives for learnin5
Compare outputs 6it( &orre&t ans6er to 5et error si5nal

outputs

(idden layers

input ve&tor

A temporary di5ression
* Bapni3 and (is &o26or3ers developed a very &lever type of per&eptron &alled a "upport Be&tor 8a&(ine+ , nstead of (and2&odin5 t(e layer of non2adaptive features: ea&( trainin5 example is used to &reate a ne6 feature usin5 a fixed re&ipe+
* T(e feature &omputes (o6 similar a test example is to t(at trainin5 example+

, T(en a &lever optimi>ation te&(ni;ue is used to sele&t t(e 7est su7set of t(e features and to de&ide (o6 to 6ei5(t ea&( feature 6(en &lassifyin5 a test &ase+
* But its ?ust a per&eptron and (as all t(e same limitations+

n t(e .==0@s: many resear&(ers a7andoned neural net6or3s 6it( multiple adaptive (idden layers 7e&ause "upport Be&tor 8a&(ines 6or3ed 7etter+

C(at is 6ron5 6it( 7a&32propa5ationD


* t re;uires la7eled trainin5 data+ , Almost all data is unla7eled+ * T(e learnin5 time does not s&ale 6ell , t is very slo6 in net6or3s 6it( multiple (idden layers+ * t &an 5et stu&3 in poor lo&al optima+ , T(ese are often ;uite 5ood: 7ut for deep nets t(ey are far from optimal+

Ever&omin5 t(e limitations of 7a&32 propa5ation


* Feep t(e effi&ien&y and simpli&ity of usin5 a 5radient met(od for ad?ustin5 t(e 6ei5(ts: 7ut use it for modelin5 t(e stru&ture of t(e sensory input+ , Ad?ust t(e 6ei5(ts to maximi>e t(e pro7a7ility t(at a 5enerative model 6ould (ave produ&ed t(e sensory input+ , Learn p(ima5e# not p(la7el G ima5e#
* f you 6ant to do &omputer vision: first learn &omputer 5rap(i&s

* C(at 3ind of 5enerative model s(ould 6e learnD

Belief Nets
* A 7elief net is a dire&ted a&y&li& 5rap( &omposed of sto&(asti& varia7les+ * Ce 5et to o7serve some of t(e varia7les and 6e 6ould li3e to solve t6o pro7lems: * T(e inferen&e pro7lem: nfer t(e states of t(e uno7served varia7les+ * T(e learnin5 pro7lem: Ad?ust t(e intera&tions 7et6een varia7les to ma3e t(e net6or3 more li3ely to 5enerate t(e o7served data+
sto&(asti& (idden &ause

visi7le effe&t

Ce 6ill use nets &omposed of layers of sto&(asti& 7inary varia7les 6it( 6ei5(ted &onne&tions+ Later: 6e 6ill 5enerali>e to ot(er types of varia7le+

"to&(asti& 7inary units


(Bernoulli varia7les#
* T(ese (ave a state of . or 0+ * T(e pro7a7ility of turnin5 on is determined 7y t(e 6ei5(ted input from ot(er units (plus a 7ias#
.

p ( si = 1)
0 0

bi +

s j w ji

p ( si = 1) =

1 + exp( bi

s j w ji )

Learnin5 Deep Belief Nets


* t is easy to 5enerate an un7iased example at t(e leaf nodes: so 6e &an see 6(at 3inds of data t(e net6or3 7elieves in+ * t is (ard to infer t(e posterior distri7ution over all possi7le &onfi5urations of (idden &auses+ * t is (ard to even 5et a sample from t(e posterior+ * "o (o6 &an 6e learn deep 7elief nets t(at (ave millions of parametersD
sto&(asti& (idden &ause

visi7le effe&t

T(e learnin5 rule for si5moid 7elief nets


* Learnin5 is easy if 6e &an 5et an un7iased sample from t(e posterior distri7ution over (idden states 5iven t(e o7served data+ * 9or ea&( unit: maximi>e t(e lo5 pro7a7ility t(at its 7inary state in t(e sample from t(e posterior 6ould 7e 5enerated 7y t(e sampled 7inary states of its parents+

sj

w ji
i

si
1

pi p ( si = 1) =

1 + exp( s j w ji )
j

w ji = s j ( si pi )
learnin5 rate

Hxplainin5 a6ay (Iudea !earl#


* Hven if t6o (idden &auses are independent: t(ey &an 7e&ome dependent 6(en 6e o7serve an effe&t t(at t(ey &an 7ot( influen&e+ , f 6e learn t(at t(ere 6as an eart(;ua3e it redu&es t(e pro7a7ility t(at t(e (ouse ?umped 7e&ause of a tru&3+
2.0 tru&3 (its (ouse 2.0 eart(;ua3e

20 220

20

posterior
p(.:.#J+000. p(.:0#J+/=== p(0:.#J+/=== p(0:0#J+000.

(ouse ?umps

C(y it is usually very (ard to learn si5moid 7elief nets one layer at a time
* To learn C: 6e need t(e posterior distri7ution in t(e first (idden layer+ * !ro7lem .: T(e posterior is typi&ally &ompli&ated 7e&ause of Kexplainin5 a6ayL+ * !ro7lem 2: T(e posterior depends on t(e prior as 6ell as t(e li3eli(ood+ , "o to learn C: 6e need to 3no6 t(e 6ei5(ts in (i5(er layers: even if 6e are only approximatin5 t(e posterior+ All t(e 6ei5(ts intera&t+ * !ro7lem -: Ce need to inte5rate over all possi7le &onfi5urations of t(e (i5(er varia7les to 5et t(e prior for first (idden layer+ Mu3N (idden varia7les

(idden varia7les prior (idden varia7les li3eli(ood C

data

"ome met(ods of learnin5 deep 7elief nets


* 8onte Carlo met(ods &an 7e used to sample from t(e posterior+ , But its painfully slo6 for lar5e: deep models+ * n t(e .==0@s people developed variational met(ods for learnin5 deep 7elief nets , T(ese only 5et approximate samples from t(e posterior+ , Nevet(eless: t(e learnin5 is still 5uaranteed to improve a variational 7ound on t(e lo5 pro7a7ility of 5eneratin5 t(e o7served data+

T(e 7rea3t(rou5( t(at ma3es deep learnin5 effi&ient


* To learn deep nets effi&iently: 6e need to learn one layer of features at a time+ T(is does not 6or3 6ell if 6e assume t(at t(e latent varia7les are independent in t(e prior : , T(e latent varia7les are not independent in t(e posterior so inferen&e is (ard for non2linear models+ , T(e learnin5 tries to find independent &auses usin5 one (idden layer 6(i&( is not usually possi7le+ * Ce need a 6ay of learnin5 one layer at a time t(at ta3es into a&&ount t(e fa&t t(at 6e 6ill 7e learnin5 more (idden layers later+ , Ce solve t(is pro7lem 7y usin5 an undire&ted model+

T6o types of 5enerative neural net6or3


* f 6e &onne&t 7inary sto&(asti& neurons in a dire&ted a&y&li& 5rap( 6e 5et a "i5moid Belief Net ('adford Neal .==2#+ * f 6e &onne&t 7inary sto&(asti& neurons usin5 symmetri& &onne&tions 6e 5et a Bolt>mann 8a&(ine (%inton ) "e?no6s3i: .=A-#+ , f 6e restri&t t(e &onne&tivity in a spe&ial 6ay: it is easy to learn a Bolt>mann ma&(ine+

'estri&ted Bolt>mann 8a&(ines


("molens3y :.=A4: &alled t(em K(armoniumsL# * Ce restri&t t(e &onne&tivity to ma3e learnin5 easier+ , Enly one layer of (idden units+
* Ce 6ill deal 6it( more layers later (idden ?

, No &onne&tions 7et6een (idden units+ * n an 'B8: t(e (idden units are &onditionally independent 5iven t(e visi7le states+ , "o 6e &an ;ui&3ly 5et an un7iased sample from t(e posterior distri7ution 6(en 5iven a data2ve&tor+ , T(is is a 7i5 advanta5e over dire&ted 7elief nets

i visi7le

T(e Hner5y of a ?oint &onfi5uration


(i5norin5 terms to do 6it( 7iases#
7inary state of visi7le unit i 7inary state of (idden unit ?

E (v,h) =
Hner5y 6it( &onfi5uration v on t(e visi7le units and ( on t(e (idden units

vi h j wij
6ei5(t 7et6een units i and ?

i, j

E (v, h ) = vi h j wij

Cei5(ts Hner5ies !ro7a7ilities


* Ha&( possi7le ?oint &onfi5uration of t(e visi7le and (idden units (as an ener5y , T(e ener5y is determined 7y t(e 6ei5(ts and 7iases (as in a %opfield net#+ * T(e ener5y of a ?oint &onfi5uration of t(e visi7le and (idden units determines its pro7a7ility:

p (v, h)

E ( v ,h )

* T(e pro7a7ility of a &onfi5uration over t(e visi7le units is found 7y summin5 t(e pro7a7ilities of all t(e ?oint &onfi5urations t(at &ontain it+

Usin5 ener5ies to define pro7a7ilities


* T(e pro7a7ility of a ?oint &onfi5uration over 7ot( visi7le and (idden units depends on t(e ener5y of t(at ?oint &onfi5uration &ompared 6it( t(e ener5y of all ot(er ?oint &onfi5urations+ * T(e pro7a7ility of a &onfi5uration of t(e visi7le units is t(e sum of t(e pro7a7ilities of all t(e ?oint &onfi5urations t(at &ontain it+

p (v, h ) =
partition fun&tion

e
u,g

E ( v ,h )

E (u , g )

p (v ) =

e e

E ( v ,h )

E (u , g )

u,g

A pi&ture of t(e maximum li3eli(ood learnin5 al5orit(m for an 'B8


? ? ? ?

< vi h j > 0
i tJ0 i tJ. i tJ2

< vi h j >
i

a fantasy

t J infinity

"tart 6it( a trainin5 ve&tor on t(e visi7le units+ T(en alternate 7et6een updatin5 all t(e (idden units in parallel and updatin5 all t(e visi7le units in parallel+

log p (v) = < vi h j > 0 < vi h j > wij

A ;ui&3 6ay to learn an 'B8


? ?

< vi h j > 0
i tJ0 data

< vi h j > 1
i tJ. re&onstru&tion

"tart 6it( a trainin5 ve&tor on t(e visi7le units+ Update all t(e (idden units in parallel Update t(e all t(e visi7le units in parallel to 5et a Kre&onstru&tionL+ Update t(e (idden units a5ain+

wij = ( < vi h j > 0 < vi h j > 1 )


T(is is not follo6in5 t(e 5radient of t(e lo5 li3eli(ood+ But it 6or3s 6ell+ t is approximately follo6in5 t(e 5radient of anot(er o7?e&tive fun&tion (Carreira2!erpinan ) %inton: 2000#+

%o6 to learn a set of features t(at are 5ood for re&onstru&tin5 ima5es of t(e di5it 2
00 7inary feature neurons
n&rement 6ei5(ts 7et6een an a&tive pixel and an a&tive feature .4 x .4 pixel ima5e data (reality#

00 7inary feature neurons


De&rement 6ei5(ts 7et6een an a&tive pixel and an a&tive feature .4 x .4 pixel ima5e re&onstru&tion (7etter t(an reality#

T(e final 00 x 204 6ei5(ts

Ha&( neuron 5ra7s a different feature+

%o6 6ell &an 6e re&onstru&t t(e di5it ima5es from t(e 7inary feature a&tivationsD
'e&onstru&tion from a&tivated 7inary features 'e&onstru&tion from a&tivated 7inary features

Data

Data

Ne6 test ima5es from t(e di5it &lass t(at t(e model 6as trained on

ma5es from an unfamiliar di5it &lass (t(e net6or3 tries to see every ima5e as a 2#

T(ree 6ays to &om7ine pro7a7ility density models (an underlyin5 t(eme of t(e tutorial#
* Mixture: Ta3e a 6ei5(ted avera5e of t(e distri7utions+ , t &an never 7e s(arper t(an t(e individual distri7utions+ t@s a very 6ea3 6ay to &om7ine models+ * Product: 8ultiply t(e distri7utions at ea&( point and t(en renormali>e (t(is is (o6 an 'B8 &om7ines t(e distri7utions defined
7y ea&( (idden unit#

, Hxponentially more po6erful t(an a mixture+ T(e normali>ation ma3es maximum li3eli(ood learnin5 diffi&ult: 7ut approximations allo6 us to learn any6ay+ * Composition: Use t(e values of t(e latent varia7les of one model as t(e data for t(e next model+ , Cor3s 6ell for learnin5 multiple layers of representation: 7ut only if t(e individual models are undire&ted+

Trainin5 a deep net6or3


(t(e main reason 'B8@s are interestin5# * 9irst train a layer of features t(at re&eive input dire&tly from t(e pixels+ * T(en treat t(e a&tivations of t(e trained features as if t(ey 6ere pixels and learn features of features in a se&ond (idden layer+ * t &an 7e proved t(at ea&( time 6e add anot(er layer of features 6e improve a variational lo6er 7ound on t(e lo5 pro7a7ility of t(e trainin5 data+ , T(e proof is sli5(tly &ompli&ated+ , But it is 7ased on a neat e;uivalen&e 7et6een an 'B8 and a deep dire&ted model (des&ri7ed later#

T(e 5enerative model after learnin5 - layers


* To 5enerate data: .+ $et an e;uili7rium sample from t(e top2level 'B8 7y performin5 alternatin5 $i77s samplin5 for a lon5 time+ 2+ !erform a top2do6n pass to 5et states for all t(e ot(er layers+ "o t(e lo6er level 7ottom2up &onne&tions are not part of t(e 5enerative model+ T(ey are ?ust used for inferen&e+

(W3

(2
W2

(.
W1

data

C(y does 5reedy learnin5 6or3D


An aside: Avera5in5 fa&torial distri7utions
* f you avera5e some fa&torial distri7utions: you do NET 5et a fa&torial distri7ution+ , n an 'B8: t(e posterior over t(e (idden units is fa&torial for ea&( visi7le ve&tor+ , But t(e a55re5ated posterior over all trainin5 &ases is not fa&torial (even if t(e data 6as 5enerated 7y t(e 'B8 itself#+

C(y does 5reedy learnin5 6or3D


* * Ha&( 'B8 &onverts its data distri7ution into an a55re5ated posterior distri7ution over its (idden units+ T(is divides t(e tas3 of modelin5 its data into t6o tas3s: , Tas3 .: Learn 5enerative 6ei5(ts t(at &an &onvert t(e a55re5ated posterior distri7ution over t(e (idden units 7a&3 into t(e data distri7ution+ , Tas3 2: Learn to model t(e a55re5ated posterior distri7ution over t(e (idden units+ , T(e 'B8 does a 5ood ?o7 of tas3 . and a moderately 5ood ?o7 of tas3 2+ Tas3 2 is easier (for t(e next 'B8# t(an modelin5 t(e ori5inal data 7e&ause t(e a55re5ated posterior distri7ution is &loser to a distri7ution t(at an 'B8 &an model perfe&tly+ Tas3 2

p(h | W )

a55re5ated posterior distri7ution on (idden units


Tas3 .

p ( v | h, W )

data distri7ution on visi7le units

C(y does 5reedy learnin5 6or3D


T(e 6ei5(ts: C: in t(e 7ottom level 'B8 define p(vG(# and t(ey also: indire&tly: define p((#+ "o 6e &an express t(e 'B8 model as

p (v ) =

h f 6e leave p(vG(# alone and improve p((#: 6e 6ill improve p(v#+

p ( h ) p (v | h )

To improve p((#: 6e need it to 7e a 7etter model of t(e a55re5ated posterior distri7ution over (idden ve&tors produ&ed 7y applyin5 C to t(e data+

C(i&( distri7utions are fa&torial in a dire&ted 7elief netD


* n a dire&ted 7elief net 6it( one (idden layer: t(e posterior over t(e (idden units p((Gv# is non2 fa&torial (due to explainin5 a6ay#+ , T(e a55re5ated posterior is fa&torial if t(e data 6as 5enerated 7y t(e dire&ted model+
* t@s t(e opposite 6ay round from an undire&ted model 6(i&( (as fa&torial posteriors and a non2 fa&torial prior p((# over t(e (iddens+ * T(e intuitions t(at people (ave from usin5 dire&ted models are very misleadin5 for undire&ted models+

C(y does 5reedy learnin5 fail in a dire&ted moduleD


* A dire&ted module also &onverts its data distri7ution into an a55re5ated posterior , Tas3 . T(e learnin5 is no6 (arder 7e&ause t(e posterior for ea&( trainin5 &ase is non2fa&torial+ , Tas3 2 is performed usin5 an independent prior+ T(is is a very 7ad approximation unless t(e a55re5ated posterior is &lose to fa&torial+ A dire&ted module attempts to ma3e t(e a55re5ated posterior fa&torial in one step+ , T(is is too diffi&ult and leads to a 7ad &ompromise+ T(ere is also no 5uarantee t(at t(e a55re5ated posterior is easier to model t(an t(e data distri7ution+

Tas3 2

p (h | W2 )

a55re5ated posterior distri7ution on (idden units


Tas3 .

p (v | h, W1 )

data distri7ution on visi7le units

A model of di5it re&o5nition


T(e top t6o layers form an asso&iative memory 6(ose ener5y lands&ape models t(e lo6 dimensional manifolds of t(e di5its+ T(e ener5y valleys (ave names

2000 top2level neurons

.0 la7el neurons

000 neurons

T(e model learns to 5enerate &om7inations of la7els and ima5es+ To perform re&o5nition 6e start 6it( a neutral state of t(e la7el units and do an up2pass from t(e ima5e follo6ed 7y a fe6 iterations of t(e top2level asso&iative memory+

000 neurons 2A x 2A pixel ima5e

9ine2tunin5 6it( a &ontrastive version of t(e K6a3e2sleepL al5orit(m


After learnin5 many layers of features: 6e &an fine2tune t(e features to improve 5eneration+ .+ Do a sto&(asti& 7ottom2up pass , Ad?ust t(e top2do6n 6ei5(ts to 7e 5ood at re&onstru&tin5 t(e feature a&tivities in t(e layer 7elo6+ -+ Do a fe6 iterations of samplin5 in t(e top level 'B8 22 Ad?ust t(e 6ei5(ts in t(e top2level 'B8+ /+ Do a sto&(asti& top2do6n pass , Ad?ust t(e 7ottom2up 6ei5(ts to 7e 5ood at re&onstru&tin5 t(e feature a&tivities in t(e layer a7ove+

"(o6 t(e movie of t(e net6or3 5eneratin5 di5its


(availa7le at 666+&s+torontoO<(inton#

"amples 5enerated 7y lettin5 t(e asso&iative memory run 6it( one la7el &lamped+ T(ere are .000 iterations of alternatin5 $i77s samplin5 7et6een samples+

Hxamples of &orre&tly re&o5ni>ed (and6ritten di5its t(at t(e neural net6or3 (ad never seen 7efore

ts very 5ood

%o6 6ell does it dis&riminate on 8N "T test set 6it( no extra information a7out 5eometri& distortionsD
* * * * * * * $enerative model 7ased on 'B8@s "upport Be&tor 8a&(ine (De&oste et+ al+# Ba&3prop 6it( .000 (iddens (!latt# Ba&3prop 6it( 000 22Q-00 (iddens F2Nearest Nei5(7or "ee Le Cun et+ al+ .==A for more results .+20P .+/P <.+4P <.+4P < -+-P

ts 7etter t(an 7a&3prop and mu&( more neurally plausi7le 7e&ause t(e neurons only need to send one 3ind of si5nal: and t(e tea&(er &an 7e anot(er sensory input+

Unsupervised Kpre2trainin5L also (elps for models t(at (ave more data and 7etter priors
* 'an>ato et+ al+ (N !" 2004# used an additional 400:000 distorted di5its+ * T(ey also used &onvolutional multilayer neural net6or3s t(at (ave some 7uilt2in: lo&al translational invarian&e+ Ba&32propa5ation alone: 0+/=P

Unsupervised layer27y2layer pre2trainin5 follo6ed 7y 7a&3prop: 0+-=P (re&ord#

Anot(er vie6 of 6(y layer27y2layer learnin5 6or3s (%inton: Esindero ) Te( 2004#
* T(ere is an unexpe&ted e;uivalen&e 7et6een 'B8@s and dire&ted net6or3s 6it( many layers t(at all use t(e same 6ei5(ts+ , T(is e;uivalen&e also 5ives insi5(t into 6(y &ontrastive diver5en&e learnin5 6or3s+

An infinite si5moid 7elief net t(at is e;uivalent to an 'B8


* T(e distri7ution 5enerated 7y t(is infinite dire&ted net 6it( repli&ated 6ei5(ts is t(e e;uili7rium distri7ution for a &ompati7le pair of &onditional distri7utions: p(vG(# and p((Gv# t(at are 7ot( defined 7y C , A top2do6n pass of t(e dire&ted net is exa&tly e;uivalent to lettin5 a 'estri&ted Bolt>mann 8a&(ine settle to e;uili7rium+ , "o t(is infinite dire&ted net defines t(e same distri7ution as an 'B8+

et&+ (2

WT

v2
WT

(.
W

v.
WT

(0
W

v0

nferen&e in a dire&ted net 6it( repli&ated 6ei5(ts


* T(e varia7les in (0 are &onditionally independent 5iven v0+ , nferen&e is trivial+ Ce ?ust multiply v0 7y C transpose+ , T(e model a7ove (0 implements a &omplementary prior+ , 8ultiplyin5 v0 7y C transpose 5ives t(e produ&t of t(e li3eli(ood term and t(e prior term+ * nferen&e in t(e dire&ted net is exa&tly e;uivalent to lettin5 a 'estri&ted Bolt>mann 8a&(ine settle to e;uili7rium startin5 at t(e data+

et&+ (2

WT

v2
WT

(.
W

v. R R R
WT

(0 R W v0

et&+
* T(e learnin5 rule for a si5moid 7elief net is:

WT

i ) wij s j ( si s
+
1 sj)

2 s (2 j

WT

* Cit( repli&ated 6ei5(ts t(is 7e&omes:


0 s0 ( s j i

v2 s
W

2 i

WT
1 s (. j

s1 i)

1 0 si ( s j

+ si2 ) + ...
s j si

WT

v. s
W

1 i

W
WT

1 s1 ( s j i

0 s (0 j

WT

v0 s

0 i

Learnin5 a deep dire&ted net6or3


* 9irst learn 6it( all t(e 6ei5(ts tied , T(is is exa&tly e;uivalent to learnin5 an 'B8 , Contrastive diver5en&e learnin5 is e;uivalent to i5norin5 t(e small derivatives &ontri7uted 7y t(e tied 6ei5(ts 7et6een deeper layers+

et&+ (2

WT

v2
WT

(.
W

v. (0
W
WT

(0
W

v0

v0

* T(en free>e t(e first layer of 6ei5(ts in 7ot( dire&tions and learn t(e remainin5 6ei5(ts (still tied to5et(er#+ , T(is is e;uivalent to learnin5 anot(er 'B8: usin5 t(e a55re5ated posterior distri7ution of (0 as t(e data+

et&+ (2

WT

v2
WT

(.
W

v.
W

v.
WT

(0
T W frozen

(0
W frozen

v0

%o6 many layers s(ould 6e use and (o6 6ide s(ould t(ey 7eD
* T(ere is no simple ans6er+ , Hxtensive experiments 7y Mos(ua Ben5io@s 5roup (des&ri7ed later# su55est t(at several (idden layers is 7etter t(an one+ , 'esults are fairly ro7ust a5ainst &(an5es in t(e si>e of a layer: 7ut t(e top layer s(ould 7e 7i5+ * Deep 7elief nets 5ive t(eir &reator a lot of freedom+ , T(e 7est 6ay to use t(at freedom depends on t(e tas3+ , Cit( enou5( narro6 layers 6e &an model any distri7ution over 7inary ve&tors ("uts3ever ) %inton: 2007#

C(at (appens 6(en t(e 6ei5(ts in (i5(er layers 7e&ome different from t(e 6ei5(ts in t(e first layerD
* T(e (i5(er layers no lon5er implement a &omplementary prior+ , "o performin5 inferen&e usin5 t(e fro>en 6ei5(ts in t(e first layer is no lon5er &orre&t+ But its still pretty 5ood+ , Usin5 t(is in&orre&t inferen&e pro&edure 5ives a variational lo6er 7ound on t(e lo5 pro7a7ility of t(e data+ * T(e (i5(er layers learn a prior t(at is &loser to t(e a55re5ated posterior distri7ution of t(e first (idden layer+ , T(is improves t(e net6or3@s model of t(e data+
* %inton: Esindero and Te( (2004# prove t(at t(is improvement is al6ays 7i55er t(an t(e loss in t(e variational 7ound &aused 7y usin5 less a&&urate inferen&e+

An improved version of Contrastive Diver5en&e learnin5 (if time permits#


* T(e main 6orry 6it( CD is t(at t(ere 6ill 7e deep minima of t(e ener5y fun&tion far a6ay from t(e data+ , To find t(ese 6e need to run t(e 8ar3ov &(ain for a lon5 time (may7e t(ousands of steps#+ , But 6e &annot afford to run t(e &(ain for too lon5 for ea&( update of t(e 6ei5(ts+ * 8ay7e 6e &an run t(e same 8ar3ov &(ain over many 6ei5(t updatesD (Neal: .==2# , f t(e learnin5 rate is very small: t(is s(ould 7e e;uivalent to runnin5 t(e &(ain for many steps and t(en doin5 a 7i55er 6ei5(t update+

!ersistent CD
(Ti?men Teileman: C8L 200A ) 200=# * Use mini7at&(es of .00 &ases to estimate t(e first term in t(e 5radient+ Use a sin5le 7at&( of .00 fantasies to estimate t(e se&ond term in t(e 5radient+ * After ea&( 6ei5(t update: 5enerate t(e ne6 fantasies from t(e previous fantasies 7y usin5 one alternatin5 $i77s update+ , "o t(e fantasies &an 5et far from t(e data+

Contrastive diver5en&e as an adversarial 5ame


* C(y does persisitent CD 6or3 so 6ell 6it( only .00 ne5ative examples to &(ara&teri>e t(e 6(ole partition fun&tionD , 9or all interestin5 pro7lems t(e partition fun&tion is (i5(ly multi2modal+ , %o6 does it mana5e to find all t(e modes 6it(out startin5 at t(e dataD

T(e learnin5 &auses very fast mixin5


* T(e learnin5 intera&ts 6it( t(e 8ar3ov &(ain+
* !ersisitent Contrastive Diver5en&e &annot 7e analysed 7y vie6in5 t(e learnin5 as an outer loop+

, C(erever t(e fantasies outnum7er t(e positive data: t(e free2ener5y surfa&e is raised+ T(is ma3es t(e fantasies rus( around (ypera&tively+

%o6 persistent CD moves 7et6een t(e modes of t(e model@s distri7ution


* f a mode (as more fantasy parti&les t(an data: t(e free2 ener5y surfa&e is raised until t(e fantasy parti&les es&ape+ , T(is &an over&ome free2 ener5y 7arriers t(at 6ould 7e too (i5( for t(e 8ar3ov C(ain to ?ump+ * T(e free2ener5y surfa&e is 7ein5 &(an5ed to (elp mixin5 in addition to definin5 t(e model+

"ummary so far
* 'estri&ted Bolt>mann 8a&(ines provide a simple 6ay to learn a layer of features 6it(out any supervision+ , 8aximum li3eli(ood learnin5 is &omputationally expensive 7e&ause of t(e normali>ation term: 7ut &ontrastive diver5en&e learnin5 is fast and usually 6or3s 6ell+ * 8any layers of representation &an 7e learned 7y treatin5 t(e (idden states of one 'B8 as t(e visi7le data for trainin5 t(e next 'B8 (a &omposition of experts#+ * T(is &reates 5ood 5enerative models t(at &an t(en 7e fine2tuned+ , Contrastive 6a3e2sleep &an fine2tune 5eneration+

B'HAF

Evervie6 of t(e rest of t(e tutorial


* %o6 to fine2tune a 5reedily trained 5enerative model to 7e 7etter at dis&rimination+ * %o6 to learn a 3ernel for a $aussian pro&ess+ * %o6 to use deep 7elief nets for non2linear dimensionality redu&tion and do&ument retrieval+ * %o6 to learn a 5enerative (ierar&(y of &onditional random fields+ * A more advan&ed learnin5 module for deep 7elief nets t(at &ontains multipli&ative intera&tions+ * %o6 to learn deep models of se;uential data+

9ine2tunin5 for dis&rimination


* 9irst learn one layer at a time 5reedily+ * T(en treat t(is as Kpre2trainin5L t(at finds a 5ood initial set of 6ei5(ts 6(i&( &an 7e fine2tuned 7y a lo&al sear&( pro&edure+ , Contrastive 6a3e2sleep is one 6ay of fine2 tunin5 t(e model to 7e 7etter at 5eneration+ * Ba&3propa5ation &an 7e used to fine2tune t(e model for 7etter dis&rimination+ , T(is over&omes many of t(e limitations of standard 7a&3propa5ation+

C(y 7a&3propa5ation 6or3s 7etter 6it( 5reedy pre2trainin5: T(e optimi>ation vie6
* $reedily learnin5 one layer at a time s&ales 6ell to really 7i5 net6or3s: espe&ially if 6e (ave lo&ality in ea&( layer+ * Ce do not start 7a&3propa5ation until 6e already (ave sensi7le feature dete&tors t(at s(ould already 7e very (elpful for t(e dis&rimination tas3+ , "o t(e initial 5radients are sensi7le and 7a&3prop only needs to perform a lo&al sear&( from a sensi7le startin5 point+

C(y 7a&3propa5ation 6or3s 7etter 6it( 5reedy pre2trainin5: T(e overfittin5 vie6
* 8ost of t(e information in t(e final 6ei5(ts &omes from modelin5 t(e distri7ution of input ve&tors+ , T(e input ve&tors 5enerally &ontain a lot more information t(an t(e la7els+ , T(e pre&ious information in t(e la7els is only used for t(e final fine2tunin5+ , T(e fine2tunin5 only modifies t(e features sli5(tly to 5et t(e &ate5ory 7oundaries ri5(t+ t does not need to dis&over features+ * T(is type of 7a&3propa5ation 6or3s 6ell even if most of t(e trainin5 data is unla7eled+ , T(e unla7eled data is still very useful for dis&overin5 5ood features+

9irst: model t(e distri7ution of di5it ima5es


T(e top t6o layers form a restri&ted Bolt>mann ma&(ine 6(ose free ener5y lands&ape s(ould model t(e lo6 dimensional manifolds of t(e di5its+

2000 units

T(e net6or3 learns a density model for unla7eled di5it ima5es+ C(en 6e 5enerate from t(e model 6e 5et t(in5s t(at loo3 li3e real di5its of all &lasses+ But do t(e (idden features really (elp 6it( di5it dis&riminationD Add .0 softmaxed units to t(e top and do 7a&3propa5ation+

000 units 000 units


2A x 2A pixel ima5e

'esults on permutation2invariant 8N "T tas3


* Bery &arefully trained 7a&3prop net 6it( one or t6o (idden layers (!lattS %inton# * "B8 (De&oste ) "&(oel3opf: 2002# * $enerative model of ?oint density of ima5es and la7els (R 5enerative fine2tunin5# * $enerative model of unla7elled di5its follo6ed 7y 5entle 7a&3propa5ation
(%inton ) "ala3(utdinov: "&ien&e 2004#

.+4P

.+/P .+20P .+.0P

Learnin5 Dynami&s of Deep Nets


t(e next / slides des&ri7e 6or3 7y Mos(ua Ben5io@s 5roup

Before fine-tuning

After fine-tuning

Hffe&t of Unsupervised !re2trainin5


Erhan et. al. AISTATS2009

40

Hffe&t of Dept(
w/o pre-training

without pre-training

with pre-training

44

Learnin5 Tra?e&tories in 9un&tion "pa&e


(a 22D visuali>ation produ&ed 6it( t2"NH# Erhan et. al. AISTATS2009 * Ha&( point is a model in fun&tion spa&e * Color J epo&( * Top: tra?e&tories 6it(out pre2trainin5+ Ha&( tra?e&tory &onver5es to a different lo&al min+ * Bottom: Tra?e&tories 6it( pre2trainin5+ * No overlapN

C(y unsupervised pre2trainin5 ma3es sense stuff


(i5( 7and6idt(

stuff
lo6 7and6idt(

ima5e

la7el

ima5e

la7el

f ima5e2la7el pairs 6ere 5enerated t(is 6ay: it 6ould ma3e sense to try to 5o strai5(t from ima5es to la7els+ 9or example: do t(e pixels (ave even parityD

f ima5e2la7el pairs are 5enerated t(is 6ay: it ma3es sense to first learn to re&over t(e stuff t(at &aused t(e ima5e 7y invertin5 t(e (i5( 7and6idt( pat(6ay+

8odelin5 real2valued data


* 9or ima5es of di5its it is possi7le to represent intermediate intensities as if t(ey 6ere pro7a7ilities 7y usin5 Kmean2fieldL lo5isti& units+ , Ce &an treat intermediate values as t(e pro7a7ility t(at t(e pixel is in3ed+ * T(is 6ill not 6or3 for real ima5es+ , n a real ima5e: t(e intensity of a pixel is almost al6ays almost exa&tly t(e avera5e of t(e nei5(7orin5 pixels+ , 8ean2field lo5isti& units &annot represent pre&ise intermediate values+

'epla&in5 7inary varia7les 7y inte5er2valued varia7les


(Te( and %inton: 200.#

* Ene 6ay to model an inte5er2valued varia7le is to ma3e N identi&al &opies of a 7inary unit+ * All &opies (ave t(e same pro7a7ility: of 7ein5 KonL : p J lo5isti&(x# , T(e total num7er of KonL &opies is li3e t(e firin5 rate of a neuron+ , t (as a 7inomial distri7ution 6it( mean N p and varian&e N p(.2p#

A 7etter 6ay to implement inte5er values


* 8a3e many &opies of a 7inary unit+ * All &opies (ave t(e same 6ei5(ts and t(e same adaptive 7ias: 7: 7ut t(ey (ave different fixed offsets to t(e 7ias:

b 0.5, b 1.5, b 2.5, b 3.5, ....

A fast approximation

n=

logistic( x + 0.5 n)

log(1 + e x )

n= 1

* Contrastive diver5en&e learnin5 6or3s 6ell for t(e sum of 7inary units 6it( offset 7iases+ * t also 6or3s for re&tified linear units+ T(ese are mu&( faster to &ompute t(an t(e sum of many lo5isti& units+ output J max(0: x R randnTs;rt(lo5isti&(x## #

%o6 to train a 7ipartite net6or3 of re&tified linear units


* Iust use &ontrastive diver5en&e to lo6er t(e ener5y of data and raise t(e ener5y of near7y &onfi5urations t(at t(e model prefers to t(e data+
?
< vi h j > data

?
< vi h j > recon

"tart 6it( a trainin5 ve&tor on t(e visi7le units+ Update all (idden units in parallel 6it( samplin5 noise Update t(e visi7le units in parallel to 5et a Kre&onstru&tionL+ Update t(e (idden units a5ain

i data

i re&onstru&tion

wij = ( < vi h j > data < vi h j > recon )

3D Object Recognition: The NORB dataset


Stereo pairs o! gra"sca#e images o! to" objects$
,nima#s /umans P#anes Truc0s Cars Norma#i1ed uni!orm )ersion o! NORB

% #ighting conditions& '%( )ie*points 2+i)e object instances per c#ass in the training set 2 , different set o! !i)e instances per c#ass in the test set (-&3.. training cases& (-&3.. test cases

"implifyin5 t(e data


* Ha&( trainin5 &ase is a stereo2pair of =4x=4 ima5es+ , T(e o7?e&t is &entered+ , T(e ed5es of t(e ima5e are mainly 7lan3+ , T(e 7a&35round is uniform and 7ri5(t+ * To ma3e learnin5 faster used simplified t(e data: , T(ro6 a6ay one ima5e+ , Enly use t(e middle 4/x4/ pixels of t(e ot(er ima5e+ , Do6nsample to -2x-2 7y avera5in5 / pixels+

"implifyin5 t(e data even more so t(at it &an 7e modeled 7y re&tified linear units
* T(e intensity (isto5ram for ea&( -2x-2 ima5e (as a s(arp pea3 for t(e 7ri5(t 7a&35round+ * 9ind t(is pea3 and &all it >ero+ * Call all intensities 7ri5(ter t(an t(e 7a&35round >ero+ * 8easure intensities do6n6ards from t(e 7a&35round intensity+

Test set error rates on NE'B after 5reedy learnin5 of one or t6o (idden layers usin5 re&tified linear units
9ull NE'B (2 ima5es of =4x=4#
* Lo5isti& re5ression on t(e ra6 pixels 20+0P * $aussian "B8 (trained 7y Leon Bottou# ..+4P * Convolutional neural net (Le Cun@s 5roup# 4+0P (&onvolutional nets (ave 3no6led5e of translations 7uilt in#

'edu&ed NE'B (. ima5e -2x-2#


* Lo5isti& re5ression on t(e ra6 pixels -0+2P * Lo5isti& re5ression on first (idden layer * Lo5isti& re5ression on se&ond (idden layer

./+=P .0+2P

T(e re&eptive fields of some re&tified linear (idden units+

A standard type of real2valued visi7le unit


* Ce &an model pixels as $aussian varia7les+ Alternatin5 $i77s samplin5 is still easy: t(ou5( learnin5 needs to 7e mu&( slo6er+

bi
para7oli& &ontainment fun&tion

vi

ener5y25radient produ&ed 7y t(e total input to a visi7le unit

E ( v,h) =

i vis

(vi bi ) 2
2 i

j hid

b jhj

i, j

vi i

h j wij

Cellin5 et+ al+ (2000# s(o6 (o6 to extend 'B8@s to t(e exponential family+ "ee also Ben5io et+ al+ (2007#

A random sample of .0:000 7inary filters learned 7y Alex Fri>(evs3y on a million -2x-2 &olor ima5es+

Com7inin5 deep 7elief nets 6it( $aussian pro&esses


* Deep 7elief nets &an 7enefit a lot from unla7eled data 6(en la7eled data is s&ar&e+ , T(ey ?ust use t(e la7eled data for fine2tunin5+ * Fernel met(ods: li3e $aussian pro&esses: 6or3 6ell on small la7eled trainin5 sets 7ut are slo6 for lar5e trainin5 sets+ * "o 6(en t(ere is a lot of unla7eled data and only a little la7eled data: &om7ine t(e t6o approa&(es: , 9irst learn a deep 7elief net 6it(out usin5 t(e la7els+ , T(en apply a $aussian pro&ess model to t(e deepest layer of features+ T(is 6or3s 7etter t(an usin5 t(e ra6 data+ , T(en use $!@s to 5et t(e derivatives t(at are 7a&32 propa5ated t(rou5( t(e deep 7elief net+ T(is is a furt(er 6in+ t allo6s $!@s to fine2tune &ompli&ated domain2spe&ifi& 3ernels+

Learnin5 to extra&t t(e orientation of a fa&e pat&(


("ala3(utdinov ) %inton: N !" 2007#

T(e trainin5 and test sets for predi&tin5 fa&e orientation

.00: 000: or .000 la7eled &ases

..:000 unla7eled &ases

fa&e pat&(es from ne6 people

T(e root mean s;uared error in t(e orientation 6(en &om7inin5 $!@s 6it( deep 7elief nets
$! on t(e pixels $! on top2level features $! on top2level features 6it( fine2tunin5

.00 la7els 22+2 000 la7els .7+2 .000 la7els .4+-

.7+= .2+7 ..+2

.0+2 7+2 4+/

Con&lusion: T(e deep features are mu&( 7etter t(an t(e pixels+ 9ine2tunin5 (elps a lot+

Deep Autoen&oders
(%inton ) "ala3(utdinov: 2004# * T(ey al6ays loo3ed li3e a really ni&e 6ay to do non2linear dimensionality redu&tion: , But it is very diffi&ult to optimi>e deep autoen&oders usin5 7a&3propa5ation+ * Ce no6 (ave a mu&( 7etter 6ay to optimi>e t(em: , 9irst train a sta&3 of / 'B8@s , T(en KunrollL t(em+ , T(en fine2tune 6it( 7a&3prop+

2Ax2A
W1T
.000 neurons
T W2

000 neurons

W3T
T W4

200 neurons -0

W4
200 neurons

linear units

W3
000 neurons

W2
.000 neurons

W1

2Ax2A

A &omparison of met(ods for &ompressin5 di5it ima5es to -0 real num7ers+

real data -02D deep auto -02D lo5isti& !CA -02D !CA

'etrievin5 do&uments t(at are similar to a ;uery do&ument


* Ce &an use an autoen&oder to find lo62 dimensional &odes for do&uments t(at allo6 fast and a&&urate retrieval of similar do&uments from a lar5e set+ * Ce start 7y &onvertin5 ea&( do&ument into a K7a5 of 6ordsL+ T(is a 2000 dimensional ve&tor t(at &ontains t(e &ounts for ea&( of t(e 2000 &ommonest 6ords+

%o6 to &ompress t(e &ount ve&tor


2000 re&onstru&ted &ounts output ve&tor

000 neurons
200 neurons .0 200 neurons

000 neurons 2000 6ord &ounts

* Ce train t(e neural net6or3 to reprodu&e its input ve&tor as its output * T(is for&es it to &ompress as mu&( information as possi7le into t(e .0 num7ers in t(e &entral 7ottlene&3+ * T(ese .0 num7ers are t(en a 5ood 6ay to &ompare do&uments+
input ve&tor

!erforman&e of t(e autoen&oder at do&ument retrieval


* Train on 7a5s of 2000 6ords for /00:000 trainin5 &ases of 7usiness do&uments+ , 9irst train a sta&3 of 'B8@s+ T(en fine2tune 6it( 7a&3prop+ * Test on a separate /00:000 do&uments+ , !i&3 one test do&ument as a ;uery+ 'an3 order all t(e ot(er test do&uments 7y usin5 t(e &osine of t(e an5le 7et6een &odes+ , 'epeat t(is usin5 ea&( of t(e /00:000 test do&uments as t(e ;uery (re;uires 0+.4 trillion &omparisons#+ * !lot t(e num7er of retrieved do&uments a5ainst t(e proportion t(at are in t(e same (and2la7eled &lass as t(e ;uery do&ument+

!roportion of retrieved do&uments in same &lass as ;uery

Num7er of do&uments retrieved

9irst &ompress all do&uments to 2 num7ers usin5 a type of !CA T(en use different &olors for different do&ument &ate5ories

9irst &ompress all do&uments to 2 num7ers+ T(en use different &olors for different do&ument &ate5ories

9indin5 7inary &odes for do&uments


2000 re&onstru&ted &ounts

* Train an auto2en&oder usin5 -0 lo5isti& units for t(e &ode layer+ * Durin5 t(e fine2tunin5 sta5e: add noise to t(e inputs to t(e &ode units+ , T(e KnoiseL ve&tor for ea&( trainin5 &ase is fixed+ "o 6e still 5et a deterministi& 5radient+ , T(e noise for&es t(eir a&tivities to 7e&ome 7imodal in order to resist t(e effe&ts of t(e noise+ , T(en 6e simply round t(e a&tivities of t(e -0 &ode units to . or 0+

000 neurons
200 neurons -0

noise
200 neurons

000 neurons 2000 6ord &ounts

"emanti& (as(in5: Usin5 a deep autoen&oder as a (as(2fun&tion for findin5 approximate mat&(es ("ala3(utdinov ) %inton: 2007#

(as( fun&tion

Ksupermar3et sear&(L

%o6 5ood is a s(ortlist found t(is 6ayD


* Ce (ave only implemented it for a million do&uments 6it( 2027it &odes 222 7ut 6(at &ould possi7ly 5o 6ron5D , A 202D (yper&u7e allo6s us to &apture enou5( of t(e similarity stru&ture of our do&ument set+ * T(e s(ortlist found usin5 7inary &odes a&tually improves t(e pre&ision2re&all &urves of T92 D9+ , Lo&ality sensitive (as(in5 (t(e fastest ot(er met(od# is 00 times slo6er and (as 6orse pre&ision2re&all &urves+

$eneratin5 t(e parts of an o7?e&t


* Ene 6ay to maintain t(e &onstraints 7et6een t(e parts is to 5enerate ea&( part very a&&urately , But t(is 6ould re;uire a lot of &ommuni&ation 7and6idt(+ "loppy top2do6n spe&ifi&ation of t(e parts is less demandin5 , 7ut it messes up relations(ips 7et6een features , so use redundant features and use lateral intera&tions to &lean up t(e mess+ Ha&( transformed feature (elps to lo&ate t(e ot(ers , T(is allo6s a noisy &(annel

Ks;uareL

pose parameters sloppy top2do6n a&tivation of parts features 6it( top2do6n support &lean2up usin5 3no6n intera&tions

ts li3e soldiers on a parade 5round

"emi2restri&ted Bolt>mann 8a&(ines


* Ce restri&t t(e &onne&tivity to ma3e learnin5 easier+ * Contrastive diver5en&e learnin5 re;uires t(e (idden units to 7e in &onditional e;uili7rium 6it( t(e visi7les+ , But it does not re;uire t(e visi7le units to 7e in &onditional e;uili7rium 6it( t(e (iddens+ , All 6e re;uire is t(at t(e visi7le units are &loser to e;uili7rium in t(e re&onstru&tions t(an in t(e data+ * "o 6e &an allo6 &onne&tions 7et6een t(e visi7les+
(idden ?

i visi7le

Learnin5 a semi2restri&ted Bolt>mann 8a&(ine


.+ "tart 6it( a trainin5 ve&tor on t(e visi7le units+ 2+ Update all of t(e (idden units in parallel -+ 'epeatedly update all of t(e visi7le units in parallel usin5 mean2field updates (6it( t(e (iddens fixed# to 5et a Kre&onstru&tionL+ /+ Update all of t(e (idden units a5ain+

< vi h j > 0
i 3 i 3 i 3 i

< vi h j > 1
3

tJ0 data

tJ. re&onstru&tion

wij = ( < vi h j > 0 < vi h j > 1 )

lik = ( < vi vk > 0 < vi vk > 1 )


update for a lateral 6ei5(t

Learnin5 in "emi2restri&ted Bolt>mann 8a&(ines


* 8et(od .: To form a re&onstru&tion: &y&le t(rou5( t(e visi7le units updatin5 ea&( in turn usin5 t(e top2do6n input from t(e (iddens plus t(e lateral input from t(e ot(er visi7les+ * 8et(od 2: Use Kmean fieldL visi7le units t(at (ave real values+ Update t(em all in parallel+ , Use dampin5 to prevent os&illations
t+ 1 pi

t pi

+ (1 ) ( xi )
total input to i

dampin5

'esults on modelin5 natural ima5e pat&(es usin5 a sta&3 of 'B8@s (Esindero and %inton#
* "ta&3 of 'B8@s learned one at a time+ .000 top2 level units+ * /00 $aussian visi7le units t(at see No 8'9+ 6(itened ima5e pat&(es , Derived from .00:000 Ban %ateren ima5e pat&(es: ea&( 20x20 %idden * T(e (idden units are all 7inary+ 8'9 6it( , T(e lateral &onne&tions are 000 units learned 6(en t(ey are t(e visi7le units of t(eir 'B8+ * 'e&onstru&tion involves lettin5 t(e %idden visi7le units of ea&( 'B8 settle usin5 8'9 6it( mean2field dynami&s+ 2000 units , T(e already de&ided states in t(e level a7ove determine t(e effe&tive /00 7iases durin5 mean2field settlin5+
$aussian units

Undire&ted Conne&tions

Dire&ted Conne&tions

Dire&ted Conne&tions

Cit(out lateral &onne&tions


real data samples from model

Cit( lateral &onne&tions


real data samples from model

A funny 6ay to use an 8'9


* T(e lateral &onne&tions form an 8'9+ * T(e 8'9 is used durin5 learnin5 and 5eneration+ * T(e 8'9 is not used for inferen&e+ , T(is is a novel idea so vision resear&(ers don@t li3e it+ * T(e 8'9 enfor&es &onstraints+ Durin5 inferen&e: &onstraints do not need to 7e enfor&ed 7e&ause t(e data o7eys t(em+ , T(e &onstraints only need to 7e enfor&ed durin5 5eneration+ * Uno7served (idden units &annot enfor&e &onstraints+ , To enfor&e &onstraints re;uires lateral &onne&tions or o7served des&endants+

C(y do 6e 6(iten dataD


* ma5es typi&ally (ave stron5 pair26ise &orrelations+ * Learnin5 (i5(er order statisti&s is diffi&ult 6(en t(ere are stron5 pair26ise &orrelations+ , "mall &(an5es in parameter values t(at improve t(e modelin5 of (i5(er2order statisti&s may 7e re?e&ted 7e&ause t(ey form a sli5(tly 6orse model of t(e mu&( stron5er pair26ise statisti&s+ * "o 6e often remove t(e se&ond2order statisti&s 7efore tryin5 to learn t(e (i5(er2order statisti&s+

C(itenin5 t(e learnin5 si5nal instead of t(e data


* Contrastive diver5en&e learnin5 &an remove t(e effe&ts of t(e se&ond2order statisti&s on t(e learnin5 6it(out a&tually &(an5in5 t(e data+ , T(e lateral &onne&tions model t(e se&ond order statisti&s , f a pixel &an 7e re&onstru&ted &orre&tly usin5 se&ond order statisti&s: its 6ill 7e t(e same in t(e re&onstru&tion as in t(e data+ , T(e (idden units &an t(en fo&us on modelin5 (i5(2 order stru&ture t(at &annot 7e predi&ted 7y t(e lateral &onne&tions+
* 9or example: a pixel &lose to an ed5e: 6(ere interpolation from near7y pixels &auses in&orre&t smoot(in5+

To6ards a more po6erful: multi2linear sta&3a7le learnin5 module


* "o far: t(e states of t(e units in one layer (ave only 7een used to determine t(e effe&tive 7iases of t(e units in t(e layer 7elo6+ * t 6ould 7e mu&( more po6erful to modulate t(e pair26ise intera&tions in t(e layer 7elo6+ , A 5ood 6ay to desi5n a (ierar&(i&al system is to allo6 ea&( level to determine t(e o7?e&tive fun&tion of t(e level 7elo6+ * To modulate pair26ise intera&tions 6e need (i5(er2order Bolt>mann ma&(ines+

%i5(er order Bolt>mann ma&(ines


("e?no6s3i: <.=A4#
* T(e usual ener5y fun&tion is ;uadrati& in t(e states:

E = bias terms

i< j

si s j wij

* But 6e &ould use (i5(er order intera&tions:

E = bias terms

i< j < k

si s j sk wijk

* Unit 3 a&ts as a s6it&(+ C(en unit 3 is on: it s6it&(es in t(e pair6ise intera&tion 7et6een unit i and unit ?+ , Units i and ? &an also 7e vie6ed as s6it&(es t(at &ontrol t(e pair6ise intera&tions 7et6een ? and 3 or 7et6een i and 3+

Usin5 (i5(er2order Bolt>mann ma&(ines to model ima5e transformations


(t(e unfa&tored version# * A 5lo7al transformation spe&ifies 6(i&( pixel 5oes to 6(i&( ot(er pixel+ * Conversely: ea&( pair of similar intensity pixels: one in ea&( ima5e: votes for a parti&ular 5lo7al transformation+
ima5e transformation ima5e(t# ima5e(tR.#

9a&torin5 t(ree26ay multipli&ative intera&tions


E=

i, j ,h

si s j s h wijh

unfa&tored
6it( &u7i&ally many parameters

E=

si s j s h wif w jf whf

fa&tored
6it( linearly many parameters per fa&tor+

f i, j ,h

A pi&ture of t(e lo62ran3 tensor &ontri7uted 7y fa&tor f


w jf whf wif
Ha&( layer is a s&aled version of t(e same matrix+ T(e 7asis matrix is spe&ified as an outer produ&t 6it( typi&al term wif w jf "o ea&( a&tive (idden unit &ontri7utes a s&alar: whf times t(e matrix spe&ified 7y fa&tor f +

nferen&e 6it( fa&tored t(ree26ay multipli&ative intera&tions


Ef =
i, j ,h

si s j sh wif w jf whf

T(e ener5y &ontri7uted 7y fa&tor f+

[ E f ( s h = 0)

E f ( s h = 1)

whf

si wif

s j w jf

%o6 &(an5in5 t(e 7inary state of unit ( &(an5es t(e ener5y &ontri7uted 7y fa&tor f+

C(at unit ( needs to 3no6 in order to do $i77s samplin5

Belief propa5ation
h

whf
f

T(e out5oin5 messa5e at ea&( vertex of t(e fa&tor is t(e produ&t of t(e 6ei5(ted sums at t(e ot(er t6o verti&es+

wif
i

w jf
j

Learnin5 6it( fa&tored t(ree26ay multipli&ative intera&tions


mh f =
messa5e from fa&tor f to unit (

si wif Ef whf

s j w jf Ef whf
model

whf


data

sh m h f

data

sh m h f

model

'oland data

8odelin5 t(e &orrelational stru&ture of a stati& ima5e 7y usin5 t6o &opies of t(e ima5e
h

whf
f

Ha&( fa&tor sends t(e s;uared output of a linear filter to t(e (idden units+ t is exa&tly t(e standard model of simple and &omplex &ells+ t allo6s &omplex &ells to extra&t oriented ener5y+
j

wif
i

w jf

Copy .

Copy 2

T(e standard model drops out of doin5 7elief propa5ation for a fa&tored t(ird2order ener5y fun&tion+

An advanta5e of modelin5 &orrelations 7et6een pixels rat(er t(an pixels


* Durin5 5eneration: a Kverti&al ed5eL unit &an turn off t(e (ori>ontal interpolation in a re5ion 6it(out 6orryin5 a7out exa&tly 6(ere t(e intensity dis&ontinuity 6ill 7e+ , T(is 5ives some translational invarian&e , t also 5ives a lot of invarian&e to 7ri5(tness and &ontrast+ , "o t(e Kverti&al ed5eL unit is li3e a &omplex &ell+ * By modulatin5 t(e &orrelations 7et6een pixels rat(er t(an t(e pixel intensities: t(e 5enerative model &an still allo6 interpolation parallel to t(e ed5e+

A prin&iple of (ierar&(i&al systems


* Ha&( level in t(e (ierar&(y s(ould not try to mi&ro2mana5e t(e level 7elo6+ * nstead: it s(ould &reate an o7?e&tive fun&tion for t(e level 7elo6 and leave t(e level 7elo6 to optimi>e it+ , T(is allo6s t(e fine details of t(e solution to 7e de&ided lo&ally 6(ere t(e detailed information is availa7le+ * E7?e&tive fun&tions are a 5ood 6ay to do a7stra&tion+

Time series models


* nferen&e is diffi&ult in dire&ted models of time series if 6e use non2linear distri7uted representations in t(e (idden units+ , t is (ard to fit Dynami& Bayes Nets to (i5(2 dimensional se;uen&es (e+5 motion &apture data#+ * "o people tend to avoid distri7uted representations and use mu&( 6ea3er met(ods (e+5+ %88@s#+

Time series models


* f 6e really need distri7uted representations (6(i&( 6e nearly al6ays do#: 6e &an ma3e inferen&e mu&( simpler 7y usin5 t(ree tri&3s: , Use an 'B8 for t(e intera&tions 7et6een (idden and visi7le varia7les+ T(is ensures t(at t(e main sour&e of information 6ants t(e posterior to 7e fa&torial+ , 8odel s(ort2ran5e temporal information 7y allo6in5 several previous frames to provide input to t(e (idden units and to t(e visi7le units+ * T(is leads to a temporal module t(at &an 7e sta&3ed , "o 6e &an use 5reedy learnin5 to learn deep models of temporal stru&ture+

An appli&ation to modelin5 motion &apture data


(Taylor: 'o6eis ) %inton: 2007# * %uman motion &an 7e &aptured 7y pla&in5 refle&tive mar3ers on t(e ?oints and t(en usin5 lots of infrared &ameras to tra&3 t(e -2D positions of t(e mar3ers+ * $iven a s3eletal model: t(e -2D positions of t(e mar3ers &an 7e &onverted into t(e ?oint an5les plus 4 parameters t(at des&ri7e t(e -2D position and t(e roll: pit&( and ya6 of t(e pelvis+
, Ce only represent &(an5es in ya6 7e&ause p(ysi&s doesn@t &are a7out its value and 6e 6ant to avoid &ir&ular varia7les+

T(e &onditional 'B8 model


(a partially o7served C'9# * "tart 6it( a 5eneri& 'B8+ * Add t6o types of &onditionin5 &onne&tions+ * $iven t(e data: t(e (idden units at time t are &onditionally independent+ * T(e autore5ressive 6ei5(ts &an model most s(ort2term temporal stru&ture very 6ell: leavin5 t(e (idden units to model nonlinear irre5ularities (su&( as 6(en t(e foot (its t(e 5round#+
j
(

i
v

t22

t2.

Causal 5eneration from a learned model


* Feep t(e previous visi7le states fixed+ , T(ey provide a time2dependent 7ias for t(e (idden units+ * !erform alternatin5 $i77s samplin5 for a fe6 iterations 7et6een t(e (idden units and t(e most re&ent visi7le units+ , T(is pi&3s ne6 (idden and visi7le states t(at are &ompati7le 6it( ea&( ot(er and 6it( t(e re&ent (istory+
j

%i5(er level models


* * * En&e 6e (ave trained t(e model: 6e &an add layers li3e in a Deep Belief Net6or3+ T(e previous layer C'B8 is 3ept: and its output: 6(ile driven 7y t(e data is treated as a ne6 3ind of Kfully o7servedL data+ T(e next level C'B8 (as t(e same ar&(ite&ture as t(e first (t(ou5( 6e &an alter t(e num7er of units it uses# and is trained t(e same 6ay+ Upper levels of t(e net6or3 model more Ka7stra&tL &on&epts+ T(is 5reedy learnin5 pro&edure &an 7e ?ustified usin5 a variational 7ound+ t22 t2.

* *

Learnin5 6it( KstyleL la7els


* As in t(e 5enerative model of (and6ritten di5its (%inton et al+ 2004#: style la7els &an 7e provided as part of t(e input to t(e top layer+ * T(e la7els are represented 7y turnin5 on one unit in a 5roup of units: 7ut t(ey &an also 7e 7lended+ t22 t2.
l
k

"(o6 demo@s of multiple styles of 6al3in5


These can be foun at www.cs.toronto.e u/!gwta"lor/

'eadin5s on deep 7elief nets


A readin5 list (t(at is still 7ein5 updated# &an 7e found at
666+&s+toronto+eduO<(intonOdeeprefs+(tml

Вам также может понравиться