Академический Документы
Профессиональный Документы
Культура Документы
%istori&al 7a&35round:
9irst 5eneration neural net6or3s
* !er&eptrons (<.=40# used a layer of (and2 &oded features and tried to re&o5ni>e o7?e&ts 7y learnin5 (o6 to 6ei5(t t(ese features+ , T(ere 6as a neat learnin5 al5orit(m for ad?ustin5 t(e 6ei5(ts+ , But per&eptrons are fundamentally limited in 6(at t(ey &an learn to do+
Bom7 Toy
outputs
(idden layers
input ve&tor
A temporary di5ression
* Bapni3 and (is &o26or3ers developed a very &lever type of per&eptron &alled a "upport Be&tor 8a&(ine+ , nstead of (and2&odin5 t(e layer of non2adaptive features: ea&( trainin5 example is used to &reate a ne6 feature usin5 a fixed re&ipe+
* T(e feature &omputes (o6 similar a test example is to t(at trainin5 example+
, T(en a &lever optimi>ation te&(ni;ue is used to sele&t t(e 7est su7set of t(e features and to de&ide (o6 to 6ei5(t ea&( feature 6(en &lassifyin5 a test &ase+
* But its ?ust a per&eptron and (as all t(e same limitations+
n t(e .==0@s: many resear&(ers a7andoned neural net6or3s 6it( multiple adaptive (idden layers 7e&ause "upport Be&tor 8a&(ines 6or3ed 7etter+
Belief Nets
* A 7elief net is a dire&ted a&y&li& 5rap( &omposed of sto&(asti& varia7les+ * Ce 5et to o7serve some of t(e varia7les and 6e 6ould li3e to solve t6o pro7lems: * T(e inferen&e pro7lem: nfer t(e states of t(e uno7served varia7les+ * T(e learnin5 pro7lem: Ad?ust t(e intera&tions 7et6een varia7les to ma3e t(e net6or3 more li3ely to 5enerate t(e o7served data+
sto&(asti& (idden &ause
visi7le effe&t
Ce 6ill use nets &omposed of layers of sto&(asti& 7inary varia7les 6it( 6ei5(ted &onne&tions+ Later: 6e 6ill 5enerali>e to ot(er types of varia7le+
p ( si = 1)
0 0
bi +
s j w ji
p ( si = 1) =
1 + exp( bi
s j w ji )
visi7le effe&t
sj
w ji
i
si
1
pi p ( si = 1) =
1 + exp( s j w ji )
j
w ji = s j ( si pi )
learnin5 rate
20 220
20
posterior
p(.:.#J+000. p(.:0#J+/=== p(0:.#J+/=== p(0:0#J+000.
(ouse ?umps
C(y it is usually very (ard to learn si5moid 7elief nets one layer at a time
* To learn C: 6e need t(e posterior distri7ution in t(e first (idden layer+ * !ro7lem .: T(e posterior is typi&ally &ompli&ated 7e&ause of Kexplainin5 a6ayL+ * !ro7lem 2: T(e posterior depends on t(e prior as 6ell as t(e li3eli(ood+ , "o to learn C: 6e need to 3no6 t(e 6ei5(ts in (i5(er layers: even if 6e are only approximatin5 t(e posterior+ All t(e 6ei5(ts intera&t+ * !ro7lem -: Ce need to inte5rate over all possi7le &onfi5urations of t(e (i5(er varia7les to 5et t(e prior for first (idden layer+ Mu3N (idden varia7les
data
, No &onne&tions 7et6een (idden units+ * n an 'B8: t(e (idden units are &onditionally independent 5iven t(e visi7le states+ , "o 6e &an ;ui&3ly 5et an un7iased sample from t(e posterior distri7ution 6(en 5iven a data2ve&tor+ , T(is is a 7i5 advanta5e over dire&ted 7elief nets
i visi7le
E (v,h) =
Hner5y 6it( &onfi5uration v on t(e visi7le units and ( on t(e (idden units
vi h j wij
6ei5(t 7et6een units i and ?
i, j
E (v, h ) = vi h j wij
p (v, h)
E ( v ,h )
* T(e pro7a7ility of a &onfi5uration over t(e visi7le units is found 7y summin5 t(e pro7a7ilities of all t(e ?oint &onfi5urations t(at &ontain it+
p (v, h ) =
partition fun&tion
e
u,g
E ( v ,h )
E (u , g )
p (v ) =
e e
E ( v ,h )
E (u , g )
u,g
< vi h j > 0
i tJ0 i tJ. i tJ2
< vi h j >
i
a fantasy
t J infinity
"tart 6it( a trainin5 ve&tor on t(e visi7le units+ T(en alternate 7et6een updatin5 all t(e (idden units in parallel and updatin5 all t(e visi7le units in parallel+
< vi h j > 0
i tJ0 data
< vi h j > 1
i tJ. re&onstru&tion
"tart 6it( a trainin5 ve&tor on t(e visi7le units+ Update all t(e (idden units in parallel Update t(e all t(e visi7le units in parallel to 5et a Kre&onstru&tionL+ Update t(e (idden units a5ain+
%o6 to learn a set of features t(at are 5ood for re&onstru&tin5 ima5es of t(e di5it 2
00 7inary feature neurons
n&rement 6ei5(ts 7et6een an a&tive pixel and an a&tive feature .4 x .4 pixel ima5e data (reality#
%o6 6ell &an 6e re&onstru&t t(e di5it ima5es from t(e 7inary feature a&tivationsD
'e&onstru&tion from a&tivated 7inary features 'e&onstru&tion from a&tivated 7inary features
Data
Data
Ne6 test ima5es from t(e di5it &lass t(at t(e model 6as trained on
ma5es from an unfamiliar di5it &lass (t(e net6or3 tries to see every ima5e as a 2#
T(ree 6ays to &om7ine pro7a7ility density models (an underlyin5 t(eme of t(e tutorial#
* Mixture: Ta3e a 6ei5(ted avera5e of t(e distri7utions+ , t &an never 7e s(arper t(an t(e individual distri7utions+ t@s a very 6ea3 6ay to &om7ine models+ * Product: 8ultiply t(e distri7utions at ea&( point and t(en renormali>e (t(is is (o6 an 'B8 &om7ines t(e distri7utions defined
7y ea&( (idden unit#
, Hxponentially more po6erful t(an a mixture+ T(e normali>ation ma3es maximum li3eli(ood learnin5 diffi&ult: 7ut approximations allo6 us to learn any6ay+ * Composition: Use t(e values of t(e latent varia7les of one model as t(e data for t(e next model+ , Cor3s 6ell for learnin5 multiple layers of representation: 7ut only if t(e individual models are undire&ted+
(W3
(2
W2
(.
W1
data
p(h | W )
p ( v | h, W )
p (v ) =
p ( h ) p (v | h )
To improve p((#: 6e need it to 7e a 7etter model of t(e a55re5ated posterior distri7ution over (idden ve&tors produ&ed 7y applyin5 C to t(e data+
Tas3 2
p (h | W2 )
p (v | h, W1 )
.0 la7el neurons
000 neurons
T(e model learns to 5enerate &om7inations of la7els and ima5es+ To perform re&o5nition 6e start 6it( a neutral state of t(e la7el units and do an up2pass from t(e ima5e follo6ed 7y a fe6 iterations of t(e top2level asso&iative memory+
"amples 5enerated 7y lettin5 t(e asso&iative memory run 6it( one la7el &lamped+ T(ere are .000 iterations of alternatin5 $i77s samplin5 7et6een samples+
Hxamples of &orre&tly re&o5ni>ed (and6ritten di5its t(at t(e neural net6or3 (ad never seen 7efore
ts very 5ood
%o6 6ell does it dis&riminate on 8N "T test set 6it( no extra information a7out 5eometri& distortionsD
* * * * * * * $enerative model 7ased on 'B8@s "upport Be&tor 8a&(ine (De&oste et+ al+# Ba&3prop 6it( .000 (iddens (!latt# Ba&3prop 6it( 000 22Q-00 (iddens F2Nearest Nei5(7or "ee Le Cun et+ al+ .==A for more results .+20P .+/P <.+4P <.+4P < -+-P
ts 7etter t(an 7a&3prop and mu&( more neurally plausi7le 7e&ause t(e neurons only need to send one 3ind of si5nal: and t(e tea&(er &an 7e anot(er sensory input+
Unsupervised Kpre2trainin5L also (elps for models t(at (ave more data and 7etter priors
* 'an>ato et+ al+ (N !" 2004# used an additional 400:000 distorted di5its+ * T(ey also used &onvolutional multilayer neural net6or3s t(at (ave some 7uilt2in: lo&al translational invarian&e+ Ba&32propa5ation alone: 0+/=P
Anot(er vie6 of 6(y layer27y2layer learnin5 6or3s (%inton: Esindero ) Te( 2004#
* T(ere is an unexpe&ted e;uivalen&e 7et6een 'B8@s and dire&ted net6or3s 6it( many layers t(at all use t(e same 6ei5(ts+ , T(is e;uivalen&e also 5ives insi5(t into 6(y &ontrastive diver5en&e learnin5 6or3s+
et&+ (2
WT
v2
WT
(.
W
v.
WT
(0
W
v0
et&+ (2
WT
v2
WT
(.
W
v. R R R
WT
(0 R W v0
et&+
* T(e learnin5 rule for a si5moid 7elief net is:
WT
i ) wij s j ( si s
+
1 sj)
2 s (2 j
WT
v2 s
W
2 i
WT
1 s (. j
s1 i)
1 0 si ( s j
+ si2 ) + ...
s j si
WT
v. s
W
1 i
W
WT
1 s1 ( s j i
0 s (0 j
WT
v0 s
0 i
et&+ (2
WT
v2
WT
(.
W
v. (0
W
WT
(0
W
v0
v0
* T(en free>e t(e first layer of 6ei5(ts in 7ot( dire&tions and learn t(e remainin5 6ei5(ts (still tied to5et(er#+ , T(is is e;uivalent to learnin5 anot(er 'B8: usin5 t(e a55re5ated posterior distri7ution of (0 as t(e data+
et&+ (2
WT
v2
WT
(.
W
v.
W
v.
WT
(0
T W frozen
(0
W frozen
v0
%o6 many layers s(ould 6e use and (o6 6ide s(ould t(ey 7eD
* T(ere is no simple ans6er+ , Hxtensive experiments 7y Mos(ua Ben5io@s 5roup (des&ri7ed later# su55est t(at several (idden layers is 7etter t(an one+ , 'esults are fairly ro7ust a5ainst &(an5es in t(e si>e of a layer: 7ut t(e top layer s(ould 7e 7i5+ * Deep 7elief nets 5ive t(eir &reator a lot of freedom+ , T(e 7est 6ay to use t(at freedom depends on t(e tas3+ , Cit( enou5( narro6 layers 6e &an model any distri7ution over 7inary ve&tors ("uts3ever ) %inton: 2007#
C(at (appens 6(en t(e 6ei5(ts in (i5(er layers 7e&ome different from t(e 6ei5(ts in t(e first layerD
* T(e (i5(er layers no lon5er implement a &omplementary prior+ , "o performin5 inferen&e usin5 t(e fro>en 6ei5(ts in t(e first layer is no lon5er &orre&t+ But its still pretty 5ood+ , Usin5 t(is in&orre&t inferen&e pro&edure 5ives a variational lo6er 7ound on t(e lo5 pro7a7ility of t(e data+ * T(e (i5(er layers learn a prior t(at is &loser to t(e a55re5ated posterior distri7ution of t(e first (idden layer+ , T(is improves t(e net6or3@s model of t(e data+
* %inton: Esindero and Te( (2004# prove t(at t(is improvement is al6ays 7i55er t(an t(e loss in t(e variational 7ound &aused 7y usin5 less a&&urate inferen&e+
!ersistent CD
(Ti?men Teileman: C8L 200A ) 200=# * Use mini7at&(es of .00 &ases to estimate t(e first term in t(e 5radient+ Use a sin5le 7at&( of .00 fantasies to estimate t(e se&ond term in t(e 5radient+ * After ea&( 6ei5(t update: 5enerate t(e ne6 fantasies from t(e previous fantasies 7y usin5 one alternatin5 $i77s update+ , "o t(e fantasies &an 5et far from t(e data+
, C(erever t(e fantasies outnum7er t(e positive data: t(e free2ener5y surfa&e is raised+ T(is ma3es t(e fantasies rus( around (ypera&tively+
"ummary so far
* 'estri&ted Bolt>mann 8a&(ines provide a simple 6ay to learn a layer of features 6it(out any supervision+ , 8aximum li3eli(ood learnin5 is &omputationally expensive 7e&ause of t(e normali>ation term: 7ut &ontrastive diver5en&e learnin5 is fast and usually 6or3s 6ell+ * 8any layers of representation &an 7e learned 7y treatin5 t(e (idden states of one 'B8 as t(e visi7le data for trainin5 t(e next 'B8 (a &omposition of experts#+ * T(is &reates 5ood 5enerative models t(at &an t(en 7e fine2tuned+ , Contrastive 6a3e2sleep &an fine2tune 5eneration+
B'HAF
C(y 7a&3propa5ation 6or3s 7etter 6it( 5reedy pre2trainin5: T(e optimi>ation vie6
* $reedily learnin5 one layer at a time s&ales 6ell to really 7i5 net6or3s: espe&ially if 6e (ave lo&ality in ea&( layer+ * Ce do not start 7a&3propa5ation until 6e already (ave sensi7le feature dete&tors t(at s(ould already 7e very (elpful for t(e dis&rimination tas3+ , "o t(e initial 5radients are sensi7le and 7a&3prop only needs to perform a lo&al sear&( from a sensi7le startin5 point+
C(y 7a&3propa5ation 6or3s 7etter 6it( 5reedy pre2trainin5: T(e overfittin5 vie6
* 8ost of t(e information in t(e final 6ei5(ts &omes from modelin5 t(e distri7ution of input ve&tors+ , T(e input ve&tors 5enerally &ontain a lot more information t(an t(e la7els+ , T(e pre&ious information in t(e la7els is only used for t(e final fine2tunin5+ , T(e fine2tunin5 only modifies t(e features sli5(tly to 5et t(e &ate5ory 7oundaries ri5(t+ t does not need to dis&over features+ * T(is type of 7a&3propa5ation 6or3s 6ell even if most of t(e trainin5 data is unla7eled+ , T(e unla7eled data is still very useful for dis&overin5 5ood features+
2000 units
T(e net6or3 learns a density model for unla7eled di5it ima5es+ C(en 6e 5enerate from t(e model 6e 5et t(in5s t(at loo3 li3e real di5its of all &lasses+ But do t(e (idden features really (elp 6it( di5it dis&riminationD Add .0 softmaxed units to t(e top and do 7a&3propa5ation+
.+4P
Before fine-tuning
After fine-tuning
40
Hffe&t of Dept(
w/o pre-training
without pre-training
with pre-training
44
stuff
lo6 7and6idt(
ima5e
la7el
ima5e
la7el
f ima5e2la7el pairs 6ere 5enerated t(is 6ay: it 6ould ma3e sense to try to 5o strai5(t from ima5es to la7els+ 9or example: do t(e pixels (ave even parityD
f ima5e2la7el pairs are 5enerated t(is 6ay: it ma3es sense to first learn to re&over t(e stuff t(at &aused t(e ima5e 7y invertin5 t(e (i5( 7and6idt( pat(6ay+
* Ene 6ay to model an inte5er2valued varia7le is to ma3e N identi&al &opies of a 7inary unit+ * All &opies (ave t(e same pro7a7ility: of 7ein5 KonL : p J lo5isti&(x# , T(e total num7er of KonL &opies is li3e t(e firin5 rate of a neuron+ , t (as a 7inomial distri7ution 6it( mean N p and varian&e N p(.2p#
A fast approximation
n=
logistic( x + 0.5 n)
log(1 + e x )
n= 1
* Contrastive diver5en&e learnin5 6or3s 6ell for t(e sum of 7inary units 6it( offset 7iases+ * t also 6or3s for re&tified linear units+ T(ese are mu&( faster to &ompute t(an t(e sum of many lo5isti& units+ output J max(0: x R randnTs;rt(lo5isti&(x## #
?
< vi h j > recon
"tart 6it( a trainin5 ve&tor on t(e visi7le units+ Update all (idden units in parallel 6it( samplin5 noise Update t(e visi7le units in parallel to 5et a Kre&onstru&tionL+ Update t(e (idden units a5ain
i data
i re&onstru&tion
% #ighting conditions& '%( )ie*points 2+i)e object instances per c#ass in the training set 2 , different set o! !i)e instances per c#ass in the test set (-&3.. training cases& (-&3.. test cases
"implifyin5 t(e data even more so t(at it &an 7e modeled 7y re&tified linear units
* T(e intensity (isto5ram for ea&( -2x-2 ima5e (as a s(arp pea3 for t(e 7ri5(t 7a&35round+ * 9ind t(is pea3 and &all it >ero+ * Call all intensities 7ri5(ter t(an t(e 7a&35round >ero+ * 8easure intensities do6n6ards from t(e 7a&35round intensity+
Test set error rates on NE'B after 5reedy learnin5 of one or t6o (idden layers usin5 re&tified linear units
9ull NE'B (2 ima5es of =4x=4#
* Lo5isti& re5ression on t(e ra6 pixels 20+0P * $aussian "B8 (trained 7y Leon Bottou# ..+4P * Convolutional neural net (Le Cun@s 5roup# 4+0P (&onvolutional nets (ave 3no6led5e of translations 7uilt in#
./+=P .0+2P
bi
para7oli& &ontainment fun&tion
vi
E ( v,h) =
i vis
(vi bi ) 2
2 i
j hid
b jhj
i, j
vi i
h j wij
Cellin5 et+ al+ (2000# s(o6 (o6 to extend 'B8@s to t(e exponential family+ "ee also Ben5io et+ al+ (2007#
A random sample of .0:000 7inary filters learned 7y Alex Fri>(evs3y on a million -2x-2 &olor ima5es+
T(e root mean s;uared error in t(e orientation 6(en &om7inin5 $!@s 6it( deep 7elief nets
$! on t(e pixels $! on top2level features $! on top2level features 6it( fine2tunin5
Con&lusion: T(e deep features are mu&( 7etter t(an t(e pixels+ 9ine2tunin5 (elps a lot+
Deep Autoen&oders
(%inton ) "ala3(utdinov: 2004# * T(ey al6ays loo3ed li3e a really ni&e 6ay to do non2linear dimensionality redu&tion: , But it is very diffi&ult to optimi>e deep autoen&oders usin5 7a&3propa5ation+ * Ce no6 (ave a mu&( 7etter 6ay to optimi>e t(em: , 9irst train a sta&3 of / 'B8@s , T(en KunrollL t(em+ , T(en fine2tune 6it( 7a&3prop+
2Ax2A
W1T
.000 neurons
T W2
000 neurons
W3T
T W4
200 neurons -0
W4
200 neurons
linear units
W3
000 neurons
W2
.000 neurons
W1
2Ax2A
real data -02D deep auto -02D lo5isti& !CA -02D !CA
000 neurons
200 neurons .0 200 neurons
* Ce train t(e neural net6or3 to reprodu&e its input ve&tor as its output * T(is for&es it to &ompress as mu&( information as possi7le into t(e .0 num7ers in t(e &entral 7ottlene&3+ * T(ese .0 num7ers are t(en a 5ood 6ay to &ompare do&uments+
input ve&tor
9irst &ompress all do&uments to 2 num7ers usin5 a type of !CA T(en use different &olors for different do&ument &ate5ories
9irst &ompress all do&uments to 2 num7ers+ T(en use different &olors for different do&ument &ate5ories
* Train an auto2en&oder usin5 -0 lo5isti& units for t(e &ode layer+ * Durin5 t(e fine2tunin5 sta5e: add noise to t(e inputs to t(e &ode units+ , T(e KnoiseL ve&tor for ea&( trainin5 &ase is fixed+ "o 6e still 5et a deterministi& 5radient+ , T(e noise for&es t(eir a&tivities to 7e&ome 7imodal in order to resist t(e effe&ts of t(e noise+ , T(en 6e simply round t(e a&tivities of t(e -0 &ode units to . or 0+
000 neurons
200 neurons -0
noise
200 neurons
"emanti& (as(in5: Usin5 a deep autoen&oder as a (as(2fun&tion for findin5 approximate mat&(es ("ala3(utdinov ) %inton: 2007#
(as( fun&tion
Ksupermar3et sear&(L
Ks;uareL
pose parameters sloppy top2do6n a&tivation of parts features 6it( top2do6n support &lean2up usin5 3no6n intera&tions
i visi7le
< vi h j > 0
i 3 i 3 i 3 i
< vi h j > 1
3
tJ0 data
tJ. re&onstru&tion
t pi
+ (1 ) ( xi )
total input to i
dampin5
'esults on modelin5 natural ima5e pat&(es usin5 a sta&3 of 'B8@s (Esindero and %inton#
* "ta&3 of 'B8@s learned one at a time+ .000 top2 level units+ * /00 $aussian visi7le units t(at see No 8'9+ 6(itened ima5e pat&(es , Derived from .00:000 Ban %ateren ima5e pat&(es: ea&( 20x20 %idden * T(e (idden units are all 7inary+ 8'9 6it( , T(e lateral &onne&tions are 000 units learned 6(en t(ey are t(e visi7le units of t(eir 'B8+ * 'e&onstru&tion involves lettin5 t(e %idden visi7le units of ea&( 'B8 settle usin5 8'9 6it( mean2field dynami&s+ 2000 units , T(e already de&ided states in t(e level a7ove determine t(e effe&tive /00 7iases durin5 mean2field settlin5+
$aussian units
Undire&ted Conne&tions
Dire&ted Conne&tions
Dire&ted Conne&tions
E = bias terms
i< j
si s j wij
E = bias terms
i< j < k
si s j sk wijk
* Unit 3 a&ts as a s6it&(+ C(en unit 3 is on: it s6it&(es in t(e pair6ise intera&tion 7et6een unit i and unit ?+ , Units i and ? &an also 7e vie6ed as s6it&(es t(at &ontrol t(e pair6ise intera&tions 7et6een ? and 3 or 7et6een i and 3+
i, j ,h
si s j s h wijh
unfa&tored
6it( &u7i&ally many parameters
E=
si s j s h wif w jf whf
fa&tored
6it( linearly many parameters per fa&tor+
f i, j ,h
si s j sh wif w jf whf
[ E f ( s h = 0)
E f ( s h = 1)
whf
si wif
s j w jf
%o6 &(an5in5 t(e 7inary state of unit ( &(an5es t(e ener5y &ontri7uted 7y fa&tor f+
Belief propa5ation
h
whf
f
T(e out5oin5 messa5e at ea&( vertex of t(e fa&tor is t(e produ&t of t(e 6ei5(ted sums at t(e ot(er t6o verti&es+
wif
i
w jf
j
si wif Ef whf
s j w jf Ef whf
model
whf
data
sh m h f
data
sh m h f
model
'oland data
8odelin5 t(e &orrelational stru&ture of a stati& ima5e 7y usin5 t6o &opies of t(e ima5e
h
whf
f
Ha&( fa&tor sends t(e s;uared output of a linear filter to t(e (idden units+ t is exa&tly t(e standard model of simple and &omplex &ells+ t allo6s &omplex &ells to extra&t oriented ener5y+
j
wif
i
w jf
Copy .
Copy 2
T(e standard model drops out of doin5 7elief propa5ation for a fa&tored t(ird2order ener5y fun&tion+
i
v
t22
t2.
* *