Вы находитесь на странице: 1из 78

Foundations of Deep Learning

Hugo Larochelle ( @hugo_larochelle )


Twitter / Universit de Sherbrooke

NEURAL NETWORK ONLINE COURSE


Topics: online videos

for a more detailed


description of
neural networks

and much more!

http://info.usherbrooke.ca/hlarochelle/neural_networks

NEURAL NETWORK ONLINE COURSE


Topics: online videos

for a more detailed


description of
neural networks

and much more!

http://info.usherbrooke.ca/hlarochelle/neural_networks

Hug
3
Departem
{
Universi
Math for my slides Feedforward neura
hugo.laroche
hu
g(a) = a
w

FOUNDATIONS OF DEEP LEARNING


What

well cover

how neural networks take input x and make predict f(x)


-

forward propagation

types of units

how to train neural nets (classifiers) on data


-

loss function

backpropagation

gradient descent algorithms

tricks of the trade

deep learning
-

dropout

batch normalization

unsupervised pre-training

f (x)

...

g(a) = sigm(a) =

l(f (x(t) ; ), y (t) )

Septe

1
1+exp( a)

exp(a) exp( a)
g(a)
tanh(a)
=
(t)
(t)exp(a)+exp(
1
a)
r=
l(f
(x
;
),
y

...
... )

g(a)
= max(0,
a)
()
Math for my slides
MathFeedforwa
for my slide

P
P
r=
1
()
g(a)
reclin(a)
=
max(0,
a)
>
a(x) ...= b + ...a(x)
w
x
=
b
+
w
i i =i b +
i wi x
f (x)c = p(y = c|x)

P
g() h(x)
b
h(x)
g(a(x))
= g(a(x))
==
g(b
+ i=
w

(t)
(1)x(t) y(1)
1
...
...
h(x)
x1 xi d
x1 bixd xj P
Wxi,j
l(f (x), y) =
c 1(y=c) log f (x)c =

w
h(x)
= g(a(x)) w

Foundations of Deep Learning


Making predictions with feedforward neural networks

Hugo Larochelle
Hugo Laroc
p(y = c|x)
p(y = c|x) h
i>
D
e
partement
dinformatiq
D
e
partement
dinf
p(y = c|x)
exp(a
)
exp(a
)
h
i{>
1
C
P
P

i> Universit
p(y
o(a)
softmax(a)
=) . . . Pexp(a
) h ...
==c|x)
exp(a
)
exp(a
)
o(a)
= softmax(a)
= Pexp(a
e
de
Sherbrooke
h
i
c exp(a >)
Universit
e
de
Sh
c exp(a c
exp(a )
) Pexp(a c)
P
exp(a )
exp(a
)

o(a)
=
softmax(a)
=
.
.
.
Topics: multilayer
neural
network
P
P
exp(a
)
exp(a )
o(a) = softmax(a)
=
...
h
i
exp(a
)
exp(a
hugo.larochelle@usherbrook
>)
hugo.larochelle@us

g(a)
=
a
f (x)
exp(a
)
exp(a
)
1
C
f (x)
p(y = c|x)

NEURAL NETWORK
w

P
P

o(a)
=
softmax(a)
=
.
.
.

p(y
=
c|x)

p(y
=
c|x)
Could
exp(a
)
have
hidden
(1)
(2) Lf (x)
(1)
(2)layers:
(3)c
(1)c|x)
(2)
(3) c exp(ac ) ...
c

p(y
=
h (x) h (x)
f (x)W W W b b b
1
(1)
(2)
(1)
(2)
(3)
(1)
(2)
(3)
h
i

g(a)
=
=
September
6,
2012
hsigm(a)
i>
September
6,
>

h
(x)
h
(x)
W
W
W
b
b
b
h
i
1+exp(
a)

p(y
=
c|x)
exp(a
)
exp(a
)
(k)
(k)
(k)
(k
1)
(0)
>
(1)for
(1) (2) (2)
(3)
(1)
(2)
exp(a
) .b
exp(a
)(3)
P
P
1
C
a layer
(x)pre-activation
=b
+
W
h
xh(2)
(h (x)
(x)
=
x)

o(a)
=
softmax(a)
=
.
.
(1)
(2)
(1)
(3)
(1)
(2)
(3)
k>0
P
P
exp(a
)
exp(a
)
h
(x)
W
W
W
b
b
f
(x)
exp(a
)
exp(a
)
1
Cb
=P
softmax(a)
=
.
.
.
h (x) o(a)
h =(x)
W
W
W
b
b
P
(k)
o(a) =softmax(a)
.
.
.
exp(a
)
exp(ac )
c
h
i
c
c
>
exp(a
)
exp(a
)
c
(k) = g(a(k) (x))
(k)
(k) (k
c 1) c exp(a(0)
)c
exp(a
)
exp(a) exp( a)
exp(2a) 1
h a(x)
P
P
f
(x)
(x)
=
b
+
W
h
(x)
(h
(x)
=
x)

o(a)
=
softmax(a)
=
.
.
.
(k)
(k)
(k)
(k
1)
(0)
(k)b(1) +(k)
(k h
1) exp(a

g(a)
=
tanh(a)
=
(1) a(k)(2)
(2)
(3)
(1) =
(2)==
)(0)
exp(a
) x)(3)
a
(x)
=
W
x
(h
(x)
1
(x)
=
b
+
W
h
x
(h
(x)
x)
exp(a)+exp(
a)
exp(2a)+1
= c|x)
(L+1)
h (x) h(L+1)(x)
W p(y
W
W
b ... (2)b (3) b
p(y
= c|x) (1)
...

f
(x)
(1)
(2)
(1)
(2)
(3)
h f (x)(x) = o(a
(x)) = f (x) h (x) h (x) W
W
W
b
b
b
Abstract
Abstract

p(y
=
c|x)
h
i
(k)
(k)
>
(k)
f (x) (x))(k) (k)
h
i
(k)
h(k) (x)
=
g(a
exp(a
)
exp(a
)
>
P
P
(k)
(k)
(k)
(k
1)
(0)

h
(x)
=
g(a
(x))

h
=
g(a
(k) (x)(1)
(k)
(ka (x))
1)
(0)

o(a)
=
softmax(a)
=
.
.
.
exp(a
)
exp(a
)
(2)
(1)
(2)
(3)
(1)
(2)
(3)

(x)
=
b
+
W
h
x
(h
(x)
=
x)
exp(a
exp(a ) C

g(a)
=
max(0,
a)
1 )
hidden
a(1) (x)
=
b
+
W
h
x
(h
=
x)
P
P
h
i

h
(x)
h
(x)
W
W
W
b
b
b

o(a)
=
softmax(a)
=
.
.
.
>
Math
for
my
slides
Feedforward
neural
network.
Math
for
my
slides
Feedforward
neural
n
(2)
(1)
(2)
(3)
(1)
(2)
(3)
layer
activation
(k
from
1
to
L):
(1)
(2)
(1)
(2)
(3)
(1)
(2)
(3)
exp(a
)
exp(a
)
exp(a
)
exp(a
)
c
c
h (L+1)
(x) h(L+1)
(x)
W
W
W
b
b
b
Cb
c Pb
c
(L+1)
h (x)
h =
(x)
W
b
(k)W
(k) P W 1
(L+1)

o(a)
softmax(a)
=
.
.
.
(L+1)
h (L+1)
(x)
=f =
g(af (x)
(x)) exp(ac )
(x)
(x)
h(k) (x)
(x))
=
f
h =
(x)
=
o(a
(x))
h o(a
(x)
=
o(a
(x))
=
f
(x)
c
c exp(ac )
(k) (k)
P
P
(k)
(k)= (k
1)
(0)
1

g(a)
reclin(a)
=
max(0,
a)
(k)
(k)
(k)
(k
1)
(0)
h(k) (x) = g(a
(x))

a
(x)
=
b
+
W
h
x
(h
(x)
=
x)
>
...
...
(L+1)
(L+1)
(1)o(a
(2)
(1)
(2) w(3)
(1)
(2)
(3)> x
(k)
(k)
(k
1) (x)
(0)

f
(x)

a
(x)
=
b
+
W
h
x
(h
(x)
=
x)

h
=
(x))
=
f
(x)

a(x)
=
b
+
x
=
b
+
w

a(x)
=
b
+
w
x
=
b
+
w
x

h
(x)
h
(x)
W
W
W
b
b
b
i
i
i
i
a (x) = b + W
h
x
(h
=
x)
i
i
f (x)
(k) (2)
(k)
(k) (k 1)
(0) (3)
(k) (L+1)
(k)
(k)
(k)
(L+1)
P b(2) b(3)P
(1)
(1)
(2)
(1)

a
(x)
=
b
+
W
h
x
(h
(x) = x)

h
(x)
=
g(a
(x))

h
(x)
=
g(a
(x))

h
(x)
h
(x)
W
W
W
b
h(k) (x) = o(a
(x))
=
f
(x)

g()
b
(k) h(1) (x) h(2) (x) W(1)h(x)
(2)= g(a(x))
(3) h(x)
(1) = =
(2)g(a(x))
(3)
=
g(b
+
w
x
)
g(b
+
w
x
)
W
W
b
b
b
i
i
i
i
output
h (x)
=
g(a
(x))
i
i
layer activation
(k=L+1): (L+1) h(k) (x) = g(a(k) (x))
(L+1)
h
(x) = o(a
(x))
=
f (x)
(k)
(k)
(k) (k 1)
(0)
(L+1)
(L+1)
a
(x)
=
b
+
W
h
x
(h
(x)1 = x)
(1)
(1)
(k) (x) =(k)
(k) (x))
(k 1)
(0)
(L+1)
h
o(a
=
f(h
(x)
...
...
(L+1)
(L+1)

a
(x)
=
b
+
W
h
x
(x)
=
x)
(L+1)
hW (x)x=1 o(a
f (x)
x1 xi d
bixd (x))
xj =h(x)
h
(x) = o(a
(x)) = f(x)
i,j
1

(k)
(k)
(k)= g(a
h
(x)
(x))
(x) = g(a (x))

(k)

w
w

h(x)
=
g(a(x))
(L+1)
(L+1)

Abstract

Math for myFUNCTION


slides Feedforward neural network.
ACTIVATION

Topics: sigmoid activation function


a(x) = b +

>

w i xi = b + w x
P
Squashes the neurons

h(x)
=
g(a(x))
=
g(b
+
w
x
)
i
i
i
pre-activation between
0 and 1

Always

positive

Bounded
Strictly

increasing

x1 xd b w 1 w d
w
{
g(a) = a
g(a) = sigm(a) =

1
1+exp( a)

Math for my slides Feedforward neural network.

ACTIVATION
FUNCTION
a(x)
=b+
w x =b+w
x
P

>

i i

h(x)tangent
= g(a(x))
= g(b
+ i function
w i xi )
Topics: hyperbolic
(tanh)
activation
Squashes

the
neurons
x1 xd b w 1 w d
pre-activation between
-1 and 1 w

Can

be positive or
negative {

Bounded
Strictly

g(a) = a

increasing

g(a) = sigm(a) =

1
1+exp( a)

g(a) = tanh(a) =

exp(a) exp( a)
exp(a)+exp( a)

exp(2a) 1
exp(2a)+1

x1 xd b w 1 w d

ACTIVATION FUNCTION
w

Topics: rectified linear activation function


Bounded

below by 0
(always non-negative)

Not

upper bounded

Strictly
Tends

increasing

{
g(a) = a
g(a) = sigm(a) =

to give neurons
with sparse activities g(a) = tanh(a) =

1
1+exp( a)
exp(a) exp( a)
exp(a)+exp( a)

g(a) = max(0, a)
g(a) = reclin(a) = max(0, a)

exp(2a) 1
exp(2a)+1

g(a) = reclin(a) = max(0, a)


g() b

(1)
Wi,j

(1)
bi

xj h(x)i

(2)
wi

h(x) = g(a(x))

(1)
bi

xj h(x)i

multi-class classification:

(2)

o(a) = softmax(a) =
p(ythe
= c|x)
We use
softmax activation function at the output:
h
i>

exp(a )
exp(a )
o(a) = softmax(a) =

strictly positive

sums to one

Predicted

(2)

a(x) = b(1) + W(1) x a(x)i =

>

(2)
(2)
P = o(1)b + w
f (x)
x
(1)

(1)
(1)

a(x)
=
b
+
W
x per
a(x)
i = bi +
we need multiple outputs (1 output
class)
j Wi,j xj

p(y = c|x)
we would like to
estimate the conditional
probability
>
f (x) = o b(2) + w(2) x

(2)
wi

h(x)
= g(a(x))
ACTIVATION
FUNCTION

Topics: softmax activation function


For

(1)
Wi,j

P
.
.
.
exp(ac )

Pexp(a1 )
c exp(ac )

(1)
bi

...

exp(ac )

class is the one with highest estimated probability

Pexp(aC )
c exp(a

(x) = o(a

(3)
(2)

ah

()
=
i,j
(x))
() =

k ||W
i,j i
W
=
||F
(k)
W
>0
W
<0
i,j
H
+H
k
j
k
k
k
1
i,j
i,j
|W
|
(k)k
10
(3)
(3)
(3)
(2) 1 (k)
i,j
i
j
sign(W
)
=
1
(1)
(1)
(1)
(k)
i,j
a a ==b b ++
W
(2)
W
xh
W
Wi,j <0
(3)
i,j >0p
(k) (3)
(k) p
6
(3)
(3)
(2)
(k)
()
=
2W
(k)
(k)
a r
=
b
+
W
h
W
U
[
b,
b]
b
=
H
h
(x)
k
i,j
p
rW
()
=
sign(W
)
(k)
(2)
(2)
(1) k 1
(3)
Hk +H
W (3)(2)

FLOW

h
GRAPH

(3) (2)
=b
(x)
= g(a+ W
(x))

h a ==o(a
b (x))
+ Wp h 6
(k)
(k)

W
U
[
b,
b]
b
=
H
h
(x
P
P
P
k
i,j
(k)
(2)
(2)
(2)
(1)
(1)
(1) a(2)
(2)
(2)
(1)
H|k +Hk 1
(k)
=
b
+
W
h
()
=
|W

h
(x)
=
g(a
(x))
(3)
(3)
(3)
(2)
sign(W
)k=
=
1W(k) <0

a
=
b
+
W
h
(k) (1)
(2)
(2)1W
(1)
i,j
i(1)
j W
i,j

a
(x)
=
b
+
h
W
>0

h
=
g(a
(x))

a
b
+
x
Topics: flow graph
i,j
i,j
(1) (3) (1) (3) (1) (3) (2)
(k)
p

a
=
b
+
W
h

a
=
b
+
W
x
(2)
(2)
(2)
(1)
(1)
(1)

r
()
=
sign(W
)
(3)
(3)
(k)

b
b
b
(k)
(1)
(1)
(1)
6
W
(k)

a
(x)
=
b
+
W
h
Forward propagation

h
=
g(a
(x))
can
be

h
=
o(a
(x))
p

W
U
[
b,
b]
b
=
H
h
(x)
a = b + W xi,j
k
Hk +Hk 1
(3)
(3)
(2)
(2)
(2)
(1)
represented as an acyclic
(k)

h
=
o(a
(x))

a
=
b
+
W
h
(3)
(2)
(1)
(3)
(2)
(1)
(1)
(1)
(1)

sign(W
)
=
1
1W(k) <0
(2)
(2)
(k)
i,j
(x)
b
b
b
=
b
+
W
x
W
W
W x af (x)
h = g(a W(x))
>0
i,j
i,j
(3)
(3)
(3)
(2)
(3)
(3)
flow graph
(2)
a (1)(x) =
b +W
h p
h = o(a (x))
(2)
(1)
(1)
(3)
(2)
(1)
h ha(3)
=(k)
g(a
(x))
(3)
=
b
+
W
x
(1)
(1)
W
W
W
x 6
(x)
=
o(a
(x))
(k)

h
=
g(a
(x))
p
Its a nice way of implementing
W
U
[
b,
b]
b
=
H
h
(x)
k
(2)
(2)
(2) H(1)+H
i,j
(1)
a (3)(x) =
b(3) + W h k k 1
(1)
(2)
(2)
(2)
h hh(2)
= (x)
g(a
(x))
forward propagation
(3)
(2)
(1)
=
o(a
h in a=modular
g(a (x))
(x))
b= g(a
b (x))
b
(3) (x) = b(1)
(3) + W(1)
(3) x (2)
(3)
a(1)
way
h
(2)
(1)
(2)
(2)(1)
(1)

b
b
b
(3)
(2)
(1)

h
=
g(a
(x))

h
(x)
=
g(a
(x))
(1)
(1)

W
W
W
x
anh
= ang(a
(x))
each box could be
object with
fprop method,
(3)
(2)
(2)(3) (x))(2) (1)
h
o(a
a
(x)
=
b
+
W
h
(3)
(2)
(1)
that computes the value of the box given its W (3)
(2) W
(1)
(1)
W
x

b
b
b

h
=
g(a
(x))
parents
(2)
(1)
(1)(2) (x))(1)
(3)
(2)
(1) h
(x)
=
g(a
a
b +W x
b
b
b
(3)
(1)
(3)
(2)(2) (1)
calling the fprop method of each box in the
W
W
W
b(1) b
b (1) x f (x)
(3) (x) = g(a(3) (x))
right order yield forward propagation

h
o(a
(3)
(2)
(1)
(3)
(2)
(1)
W
W
W W
x
W
W
x
(3)
(2)
(1)(2)
(2)
b
b =b
h (x)
g(a (x))

(3)

(2)

(1)

CAPACITY OF NEURAL NETWORK


Topics: universal approximation
Universal

approximation theorem (Hornik, 1991):

a single hidden layer neural network with a linear output unit can approximate
any continuous function arbitrarily well, given enough hidden units

The

result applies for sigmoid, tanh and many other hidden


layer activation functions

This

is a good result, but it doesnt mean there is a learning


algorithm that can find the necessary parameter values!

11

Foundations of Deep Learning


Training feedforward neural networks

f (x; )
ervised learning example: (x,Machine
y) x ylearning
Supervised learning example: (x, y) x y

13

MACHINE
LEARNING

D
D

Supervised
learning
example:
(x,
y)
x
y
ningset:
D set:=D{(x =, y{(x )}, y )}
Training
train

(t)

train

(t)(t)

valid

test

(t)

Training set: Dtrain = {(x(t) , y (t) )}

empirical risk minimization, regularization


; )Topics:
f (x; )
f (x; )
valid (structural)
test
risk minimization
id Empirical
test

D
D
D valid test
framework
D
Dto design learning algorithms

1X
(t)
(t)
arg min
l(f (x ; ), y ) +
T t

(t)
(t)

l(f
(x
;
),
y
)
X
1X

(t)
(t)
X
arg
l(f(t)
(x ; ),
) + ()
1 min1
(t)
(t)
(t)y
T (x
l(f(x()
; ),
y ))+
+ ()
arg minarg min
l(f
;
),
y
()

t
T

t
t
P
1
(t)
(t)
(t)
(t)

=
r
l(f
(x
;
),
y
)
l(f (x(t) ; ), y(t) )
t
T

l(f (x ; ), y ) is a loss function

()
of )+
() is a regularizer (penalizes certain values

Learning

is cast as optimization

ideally, wed optimize classification error, but its not smooth

loss function is a surrogate for what we truly should optimize (e.g. upper bound)

r ()

P
MACHINE
LEARNING
h
R(X) = {x 2 R | 9w x = j wj X,j }

Hugo
Larochelle
Departement dinfor

14
t
l(f (x ; ), y )
D
e
partement
dinformatique
Hugo
La

Supervised
learning
example:
(x,
y)
x
y
Universit
e
de
Sher
(t)
(t)
Hugo Larochelle
l(f
),
l(f(x
(x(t) ; ;),
y (t)y) ) ()
Universit
e
de
Sherbrooke
Departement d
Abstract
hugo.larochelle@ushe
P train
D
e
partement
dinformatique
()

= set: Dr l(f (x =; ),
y (t)
) ,y
r(t)
()

Training
{(x
)}
hugo.larochelle@usherbrooke.ca
() P
Math for my slides Feedforward neural network.
Universite de SherbrookeUniversite de
arg min

l(f (x ;t), y

(t)

1
T

)t + ()

(t)

(t)

(t)

(t) X
(t)
; ),
+
Topics:
gradient
(SGD)
stochastic
= T1 t r l(f
(x
ydescent
)
r ()
1
(t)
P

September
13,
2
hugo.larochelle@
arg
min
l(f
(x ; ), y )hugo.larochelle@usherbrooke.ca
+ ()
1 f (x;)
(t)
d(t)
{x
2
R
|
r
f
(x)
=
0}
x ), y

=
r
l(f
(x
;
)
r
()
T

f
(x)
September 13, 2012

+
t
T
t
Algorithm that
performs
updates
after
each
example
> 2

v
rxtest
f (x)v > 0 8v
September
d D valid
D

{x
2
R
|
r
f
(x)
=
0}
(1)
(1)
(L+1)
(L+1)
September
13,
2012
x
(t)

,b
})
initialize
(

+
l(f (x ; ), y (t) ){W> 2 , b , . . . , W

(t)

v rx f (x)v < 0 8v

v> r2x f (x)v > 0 8v

Abstract

(t)
(t)
for N epochs l(f (x(t)
(t)
; ),
= r
l(f
(x
;
),
y
)
r ()

Abstract
Math
for
my slides Feedforward
neural network.
y
)
()
X
> 2
1

v
r
f
(x)v
<
0
8v
(t)
(t)
(t)
(t)
- for each xtraining
example (x , y )
P
Abstract
arg
min
l(f
(x
;
),
y
)
+
()
1
(t)
(t) Feedforward
Math
for
my
slides
neural
network.
training
epoch
(t)
(t)
== rTl(fr
r
l(f
(x
; ),
y()
)f (x)r ()
(x
y
)
(t)l(f
(t) ; ),
; ),

t
TMatht=for my slides Feedforward neural network.
(x
y
)
r

Math for my slides Feedforward neural network.


iteration over all examples

+
(t)
f (x)
()
l(f (x ; ),5 y (t) )

{x 2 R | x 2/ R(X)}
h

f (x)
(t)
(t)

f
(x)

l(f
(x
;
),
y
)
To apply
to neural
network training, we need
5
{x 2this
Rd |algorithm
r f (x) = 0}
(t)
(t)
l(frx(x
; ), y (t) ) r l(f (x(t) ; ),
()
y
) (t)
(t)
l(f (x ; ), y )

(t)
(t)

l(f
(x
;
),
y
)
the loss
>function
2
()
v rx f (x)v
>
0
8v
(t)
rf(x)
=(t)p(y
=yc|x)
c(x

l(f
;
),
)

()
(t)
(t)
a procedure
to
compute
the
parameter
gradients

r
l(f
(x
;
),
y
)
(t)
(t)
P

> 2

r
l(f
(x
;
),
y
)
0 (t)
v rx f (x)v
<
x(t)
= y 8vT1 t r l(f (x(t) ; ), y (t) )
r ()
r () )
the regularizer
() (t)(and the(t)gradient
()

= rl(f
(x ; ), y ) P r () 5()
initialization method
for y)+=
l(f
(x),
(x)c==c|x)log f (x)y =
c1(y=c)
f (x) log
=fp(y

{ i,ui | Xui = iui et u uj = 1i=j }


(t)

(t) r ()

>

r ()

Abstr

{W
(t)

,b

l(f (x ; ), y

(t)

, .. . r
,W
, ;b),
}))
l(f
r
l(f(x(x(t)
; ),yy (t)

(t)

r l(f (x ; ), y

(t)

15

LOSS FUNCTION
()

) ()
()

r l(f (x(t) ; ), y (t))rr


()
()

Topics: loss function for classification r ()


f (x)
=
p(y
=
c|x)
c

()

f
(x)
c = p(y = c|x)
Neural network estimates f (x)c = p(y = c|x)
(t)
x (t)
we could maximize the probabilities
of
r ()

(t)
y (t)given

(t)
x(t) inythe
training set
x
y
P
P
(x),1y)
= logcf1(x)
log
f (x)
log
f (x)y =
c =
(y=c)
l(f (x), y) = l(fP
=
log
f
(x)
=
c
y
(y=c)
c

f
(x)
=
p(y
=
c|x)
l(f minimize
(x), y) = the c 1(y=c) log f (x)c =
c
To frame as minimization, we

negative
(t) log-likelihood
(t)
natural log (ln)
x
y

P
l(f (x), y) =
log f (x)y =
c 1(y=c) log f (x)c =
we take the log to simplify for numerical stability and math simplicity

log f (x)y =
@
@
f (x)c
f (x)c

@
log f (
f (x)
logc f (x)y

log f (x)y

sometimes referred to as cross-entropy

@
f (x)c

rf (x)

log f (x)y

1(y=c) 1 f
r
(x)y =
f (x)
log
f (x)log
y f=
1y
f
(x)
rf (x) log f (x)y f (x)
= y =

BACKPROPAGATION

Topics: backpropagation algorithm

log f (x)y @a(k) (x)i (k)


(k)
(k)
=
@
log
f
(x)
@a
(x)
@a
(x)
y
ii i
@
log
f
(x)
@a
(x)
(k)
(k)
y
y
@a
(x)
i
==
@bi
(k)
(k)
(k)
(k)
(k)
(k)
@a
(x)
ii i
@
@b
@a
(x)
@b
@b
ii i
@ log
f (x)
yy
log
f
(x)
= b(k) (k)@ log f (x)
yyy
i @a @(x)
f
(x)
@
log
f
(x)
ilog
=
=log f (x)
(k)
(k)
@ =
@a
(x)
(k)
@a
(x)
(k)
y
ii i i
@a
(x)
@a
(x)
=
(k) (x)
(k)
@a
i
r (k) log f @b
(x)
@

Use

ii

iy

the chain rule to efficiently compute


gradients,
top
to
bottom
@
log
f
(x)
y
= r
log f (x)
b

compute output gradient (before activation)


ra(L+1) (x) log f (x)y (=
(e(y) f (x))

rb(k)
a(k) (x)
(k)
@a(k) (x)i bb(k)

log
yf (x)y
r
log
(x)yy
r
log
ff(x)
=
log
fff(x)
y
(x)
raaa(k)
log
(x)
== r
r
log
(x)
(k)
(k)
yy
(x)
(x)

rb(k)
log
f (x)y
for
k
from
L+1
to
1
(k)
>
r(k)
log
f
(x)
(=
(e(y)
f
(x))
(L+1)
rr
log
(=
r
log
f
(x)
h (x)
(k) (x)
y
y
y
a(L+1)
(x)f (x)
a
W
(=
(e(y)
f
(x))
log
f
(x)
(=
(e(y)
f
(x))
yy
a
(x)
= ra(k) (x) log f (x)y
gradients of hidden layer parameter
(k)
>
rcompute
log
(x)fy (x)
(= (=
ra(k) (x)r (k)
log f (x)ylog f (x)
b
r(k)W(k) flog
h
(x)
(k 1) > >
y
y
(k)
a (k)(x)
r
(=
r
log
f
(x)
h
(k)
log
f
(x)
(=
r
log
f
(x)
h
(x)(x)
(k)
yy
(x)f (x))
rW
log f (x)yyy (=
(e(y)
aa (x)
a(L+1) (x)
(k) >
rhr
logf f(x)
(x)yy (=
(= r
Wa(k) (x) ra(k)
(k (k)
1) (x) log
log(x)f (x)log
y
y(kf (x)
b
1)
r
(= rr
r
(x)hyy (x)>
(k) (x)log log
log
(x)y y (=
(=
rbW(k)
log ff(x)
flog
(x)ffy(x)
(k)
aa(k)
a(k)
(x)(x)
(k)
>
gradient
of
hidden
layer
below
- rcompute
log
f
(x)
(=
r
log
f
(x)
r
h
(x)
(k
1)
(k
1)
(k
1)
y
y
a
(x)
h
(x)
a
(x)
(k)
> yra(k) (x)
rbh(k)
log
(x)yra(=
log f (x)y
(k 1)log
(=
log
f>
(x)
(k) (x) W
y f
(x) f (x)
(k)
(k)

rh(k 1) (x) log f (x)yy (=


(= W
W
raa(k)
logff(x)
(x)yy
(k) (x)
r
log
(x)
(k)
(k) >

log
f
(x)
(=
W
r
log
f
(x)

r
log
f
(x)
(=
r
log
f
(x)
r
h
(x)
(k
1)
(k)
(k
1)
(k
1)
(k
1)
y
y
y
y
h
(x)
(x)
a
(x)
h aactivation)
(x)
a
- compute
gradient
of hidden layer below (before
0(x) (k(k)
1)
ra(k 1) (x) log f (x)yy (=
(= r
rhh(k(k 1)1)(x)
log
f (x)
(a h (x)
(x)j ), . . . ]
r[.a.(k. , g1) (x)
(x) log
3 f (x)yy
ra(k

1) (x)

log f (x)y (=

rh(k

1) (x)

log f (x)y

[. . . , g 0 (a(k

1)

(x)j ), . . . ]

16

rW(k)

>

(k)

Abstract

ACTIVATION FUNCTION

ra(k) (x) log f (x)y h (x)


Math for my slides Feedforward neural network.
rb(k) log f (x)y (= ra(k) (x) log f (x)y
P
>
Topics: sigmoid activation function
a(x) gradient
=(k)b>+ i wi xi = b + w x
rh(k 1) (x) log f (x)y (= W
ra(k) (x) log fP
(x)y
ra(k

log f (x)y (=

1) (x)

h(x) = g(a(x)) = g(b +

log f (x)y (=

g (a) = g(a)(1
g 0 (a) = 1

1) (x)

log f (x)y

x1 xd b w 1 w d

Partial
g 0 (a) =
a
derivative:
0

rh(k

g(a)) w

g(a)2

{
g(a) = a
g(a) = sigm(a) =

3
1
1+exp( a)

w i xi )
ra(k

1) (x)

(k)

(x)

17

rb(k)

Math for my slides Feedforward neural network.

log f (x)y (= ra(k) (x) log f (x)y


P
>
a(x) = b + i wi xi =(k)
b+
>w x
rh(k 1) (x) log f (x)y (= W
ra(k) (x) log f (x)y
P
h(x) = function
g(a(x)) gradient
= g(b + i wi xi )
Topics: tanhactivation
ra(k 1) (x) log f (x)y (= rh(k 1) (x) log f (x)y
ra(k
x1 xd b w 1 w d
0
g (a) = a

w
0
g (a) = g(a)(1 g(a))

18

ACTIVATION FUNCTION

Partial derivative:

{
g (a) = 1 g(a)
0

g(a) = a
g(a) = sigm(a) =

1
1+exp( a)

g(a) = tanh(a) =

exp(a) exp( a)
exp(a)+exp( a)

exp(2a) 1
exp(2a)+1

1) (x)

(k)

(x)

19

x1 xd b w 1 w d

ACTIVATION FUNCTION
w

Topics: rectified linear activation function gradient

{
Partial
0

derivative:

g (a) = 1a>0

g(a) = a
g(a) = sigm(a) =

1
1+exp( a)

g(a) = tanh(a) =

exp(a) exp( a)
exp(a)+exp( a)

g(a) = max(0, a)
g(a) = reclin(a) = max(0, a)

exp(2a) 1
exp(2a)+1

(x) = o(a

(3)
(2)

ah

()
=
i,j
(x))
() =

k ||W
i,j i
W
=
||F
(k)
W
>0
W
<0
i,j
H
+H
k
j
k
k
k
1
i,j
i,j
|W
|
(k)k
20
(3)
(3)
(3)
(2) 1 (k)
i,j
i
j
sign(W
)
=
1
(1)
(1)
(1)
(k)
i,j
a a ==b b ++
W
(2)
W
xh
W
Wi,j <0
(3)
i,j >0p
(k) (3)
(k) p
6
(3)
(3)
(2)
(k)
()
=
2W
(k)
(k)
a r
=
b
+
W
h
W
U
[
b,
b]
b
=
H
h
(x)
k
i,j
p
rW
()
=
sign(W
)
(k)
(2)
(2)
(1) k 1
(3)
Hk +H
W (3)(2)

FLOW

h
GRAPH

(3) (2)
=b
(x)
= g(a+ W
(x))

h a ==o(a
b (x))
+ Wp h 6
(k)
(k)

W
U
[
b,
b]
b
=
H
h
(x
P
P
P
k
i,j
(k)
(2)
(2)
(2)
(1)
(1)
(1) a(2)
(2)
(2)
(1)
H|k +Hk 1
(k)
=
b
+
W
h
()
=
|W

h
(x)
=
g(a
(x))
(3)
(3)
(3)
(2)
sign(W
)k=
=
1W(k) <0

a
=
b
+
W
h
(k) (1)
(2)
(2)1W
(1)
i,j
i(1)
j W
i,j

a
(x)
=
b
+
h
W
>0

h
=
g(a
(x))

a
b
+
x
Topics: automatic differentiation
i,j
i,j
(1) (3) (1) (3) (1) (3) (2)
(k)
p

a
=
b
+
W
h

a
=
b
+
W
x
(2)
(2)
(2)
(1)
(1)
(1)

r
()
=
sign(W
)
(3)
(3)
(k)

b
b
b
(k)
(1)
(1)
(1)
6
W
(k)

a
(x)
=
b
+
W
h
Each object also

h
=
g(a
(x))
has
a
bprop
method

h
=
o(a
(x))
p

W
U
[
b,
b]
b
=
H
h
(x)
a = b + W xi,j
k
Hk +Hk 1
(3) (2)
(3)
(2)
(2)
(1)
(k)

h
=
o(a
(x))

a
=
b
+
W
h
(3)
(2)
(1)
(2)
(1)
it computes the gradient of(3)
(1)
(1)
(1)
the
loss
with

sign(W
)
=
1
1W(k) <0
(2)
(2)
(k)
i,j
(x)
b
b
b
=
b
+
W
x
W
W
W x af (x)

h
=
g(a
(x))
W
>0
i,j
i,j
respect to each parent (3)
(3)
(3)
(3)
(2)
(3)
(2)
a (1)(x) =
b +W
h p
h = o(a (x))
(2)
(1)
(1)
(3)
(2)
(1)

h
=
g(a
(x))
(3)
(3)

a
=
b
+
W
x
(1)
(1)

W
W
W
x 6
(k)

h
(x)
=
o(a
(x))
(k)
fprop depends on the fprop of a boxs parents,

h
=
g(a
(x))
p
W
U
[
b,
b]
b
=
H
h
(x)
k
(2)
(2)
(2) H(1)+H
i,j
a (3)(x) =
b(3) + W h k k 1
while bprop depends the(2)
bprop of a boxs(2)
children (1)
(1)
(2)
h hh(2)
= (x)
g(a
(x))
(3)
(2)
(1)
=
o(a
h = g(a (x))
(x))
b= g(a
b (x))
b
(3) (x) = b(1)
(3) + W(1)
(3) x (2)
By calling bprop in the reverse order, a(1)
h
(3) (1)
(2)
(1)
(2)
(2)(1)

b
b
b
(3)
(2)
(1)

h
=
g(a
(x))

h
(x)
=
g(a
(x))
(1)
(1)
we get backpropagation

W
W
W
x
h = g(a (x))
(3)
(2)
(2)(3) (x))(2) (1)
h
o(a
a
(x)
=
b
+W h
(3)
(2)
(1)
only need to reach the parameters
(1)
(1)
W
W
W
x
bh(3)
b=(2)g(a
b
(x))
(2)
(1)
(1)(2) (x))(1)
(3)
(2)
(1) h
(x)
=
g(a
a
b +W x
b
b
b
(3)
(1)
(3)
(2)(2) (1)
W
W
W
b(1) b
b (1) x f (x)
(3) (x) = g(a(3) (x))

h
o(a
(3)
(2)
(1)
(3)
(2)
(1)
W
W
W W
x
W
W
x
(3)
(2)
(1)(2)
(2)
b
b =b
h (x)
g(a (x))

(3)

(2)

(1)

k ||W
()
= ki,j i W
W
=
||F
i,j
(k)
(x))
>0
W
<0
i,j
H
+H
j
k
k
k
1
i,j
i,j
()
=
|W
|
(k)k
20
(3)
(3)
(3)
(2) 1 (k)
i,j
i
j
sign(W
)
=
1
(1)
(1)
(1)
(k)
i,j
a=a p(y
=b=
b c|x)
+
W
hp Wi,j <0
(2)
=
+
W
x
W
>0

f
(x)
(3)
(3)
(3)
i,j
c (3)
(k)
(2)W (3) h
(k) p
6
(3)
(2)
(k)
ah(2) (x)
=b
+
= g(a (x))
()
=
2W
(k)
(k)
a r
=
b
+
W
h
W
U
[
b,
b]
b
=
H
h
(x)
k
i,j
p
rW
()
=
sign(W
)
(k)
(2)
(2)
(1)
(3)
(3) + W(2)
Hh
W
k +Hk 1

a
=
b
(k)

h
=
o(a
(x))
6
(t)
(t)
(k)
p

W
U
[
b,
b]
b
=
H
h
(x

x
y
P
P
P
k
i,j
(k)
(2)
(2)
(2)
(1)
(1)
(1) a(2)
(2)
(2)
(1)
H|k +Hk 1
(k)
=
b
+
W
h
()
=
|W

h
(x)
=
g(a
(x))
(3)
(3)
(3)
(2)
sign(W
)k=
=
1W(k) <0

a
=
b
+
W
h
(k) (1)
(2)
(2)1W
(1)
i,j
i(1)
j W
i,j

a
(x)
=
b
+
h
P
W
>0

h
=
g(a
(x))

a
b
+
x
Topics: automatic differentiation
i,j
i,j

l(f
(x),
y)
=
1
log
f
(x)
=
log
f
c
(3)
(3)
(3)
(2)
(y=c)
(1)
(1)
(1)
c
(3)
(2)
(1) a r
(k)
p
a
=
b
+
W
h
=
b
+
W
x
(2)
(2)
(2)
(1)
(1)
(1)
()
=
sign(W
)
(3)
(3)
(k)

b
b
b
(k)
(1)
(1)
(1)
6
W
(k)

a
(x)
=
b
+
W
h
Each object also

h
=
g(a
(x))
has
a
bprop
method

h
=
o(a
(x))
p

W
U
[
b,
b]
b
=
H
h
(x)
a = b + W xi,j
k
Hk +Hk 1

(3) (2)
(3)
(2)
(2)
(1)
(k)

h
=
o(a
(x))

a
=
b
+
W
h
(3)
(2)
(1)
(2)
(1)
it computes the gradient of(3)
(1)
(1)
(1)
the
loss
with

sign(W
)
=
1
1W(k) <0
(2)
(2)
(k)
i,j
(x)
b
b
b
=
b
+
W
x
W
W
W x af (x)

h
=
g(a
(x))
W
>0
i,j
i,j
respect to each parent (3)
(3)
(3)
(3)
(2)
(3)
(2)
a (1)(x) =
b +W
h p
@
h = o(a (x))
(2)
(1)
(1)
(3)
(2)
(1)

h
=
g(a
(x))
(3)
(3)

a
=
b
+
W
x
(1)
(1)

W
W
W
x 6
(k)

h
(x)
=
o(a
(x))
(k)
fprop depends on the fprop of a boxs parents,

h
=
g(a
(x))
p
W
U
[
b,
b]
b
=
H
h
(x)
k
f
(x)
(2)
(2)
(2) H(1)+H
i,j

a
(x)
=
b
+
W
h
while bprop depends the bprop of a boxs children
k
k 1

By

(x) = o(a

(2)

FLOW GRAPH

(1) (3)(2)
(2) h(1)h(2)
(3)g(a(3)
=
(x))
(1)
=
o(a
g(a (x))
h (x)
=
g(a
(x))
b
b(2)(x))
b

(3) (x) = b(1)


(3) + W(1)
(3) x (2)
calling bprop in the reverse order, a(1)
h
(3) (1)
(2)
(1)
(2)
(2)(1)

b
b
b
(3)
(2)
(1)

h
=
g(a
(x))

h
(x)
=
g(a
(x))
(1)
(1)
we get backpropagation

W
W
W
x
h = g(a (x)) (3)
(3)

(2)
(2) (x))(2) (1)
h
o(a
a
(x)
=
b
+W h
(3)
(2)
(1)
only need to reach the parameters
(1)
(1)
W
W
W
x
bh(3)
b=(2)g(a
b
(x))
(2)
(1)
(1)(2) (x))(1)
(3)
(2)
(1) h
(x)
=
g(a
a
b +W x
b
b
b
(3)
(1)
(3)
(2)(2) (1)
W
W
W
b(1) b
b (1) x f (x)
(3) (x) = g(a(3) (x))

h
o(a
(3)
(2)
(1)

(3)
(2)
(1)
x
W
W
W
x
(3)
(2)
(1)(2)
(2)

b
h

b =b
(x)
g(a

(x))

k ||W
()
= ki,j i W
W
=
||F
i,j
(k)
(x))
>0
W
<0
i,j
H
+H
j
k
k
k
1
i,j
i,j
()
=
|W
|
(k)k
20
(3)
(3)
(3)
(2) 1 (k)
i,j
i
j
sign(W
)
=
1
(1)
(1)
(1)
(k)
i,j
a=a p(y
=b=
b c|x)
+
W
hp Wi,j <0
(2)
=
+
W
x
W
>0

f
(x)
(3)
(3)
(3)
i,j
c (3)
(k)
(2)W (3) h
(k) p
6
(3)
(2)
(k)
ah(2) (x)
=b
+
= g(a (x))
()
=
2W
(k)
(k)
a r
=
b
+
W
h
W
U
[
b,
b]
b
=
H
h
(x)
k
i,j
p
rW
()
=
sign(W
)
(k)
(2)
(2)
(1)
(3)
(3) + W(2)
Hh
W
k +Hk 1

a
=
b
(k)

h
=
o(a
(x))
6
(t)
(t)
(k)
p

W
U
[
b,
b]
b
=
H
h
(x

x
y
P
P
P
k
i,j
(k)
(2)
(2)
(2)
(1)
(1)
(1) a(2)
(2)
(2)
(1)
H|k +Hk 1
(k)
=
b
+
W
h
()
=
|W

h
(x)
=
g(a
(x))
(3)
(3)
(3)
(2)
sign(W
)k=
=
1W(k) <0

a
=
b
+
W
h
(k) (1)
(2)
(2)1W
(1)
i,j
i(1)
j W
i,j

a
(x)
=
b
+
h
P
W
>0

h
=
g(a
(x))

a
b
+
x
Topics: automatic differentiation
i,j
i,j

l(f
(x),
y)
=
1
log
f
(x)
=
log
f
c
(3)
(3)
(3)
(2)
(y=c)
(1)
(1)
(1)
c
(3)
(2)
(1) a r
(k)
p
a
=
b
+
W
h
=
b
+
W
x
(2)
(2)
(2)
(1)
(1)
(1)
()
=
sign(W
)
(3)
(3)
(k)

b
b
b
(k)
(1)
(1)
(1)
6
W
(k)

a
(x)
=
b
+
W
h
Each object also

h
=
g(a
(x))
has
a
bprop
method

h
=
o(a
(x))
p

W
U
[
b,
b]
b
=
H
h
(x)
a = b + W xi,j
k
Hk +Hk 1

(3) (2)
(3)
(2)
(2)
(1)
(k)

h
=
o(a
(x))

a
=
b
+
W
h
(3)
(2)
(1)
(2)
(1)
it computes the gradient of(3)
(1)
(1)
(1)
the
loss
with

sign(W
)
=
1
1W(k) <0
(2)
(2)
(k)
i,j
(x)
b
b
b
=
b
+
W
x
W
W
W x af (x)

h
=
g(a
(x))
W
>0
i,j
i,j
respect to each parent (3)
(3)
(3)
(3)
(2)
(3)
(2)
a (1)(x) =
b +W
h p
@
h = o(a (x))
(2)
(1)
(1)
(3)
(2)
(1)

h
=
g(a
(x))
(3)
(3)

a
=
b
+
W
x
(1)
(1)

W
W
W
x 6
(k)

h
(x)
=
o(a
(x))
(k)
fprop depends on the fprop of a boxs parents,

h
=
g(a
(x))
p
W
U
[
b,
b]
b
=
H
h
(x)
k
f
(x)
(2)
(2)
(2) H(1)+H
i,j

a
(x)
=
b
+
W
h
while bprop depends the bprop of a boxs children
k
k 1

By

(x) = o(a

(2)

FLOW GRAPH

(1) (3)(2)
(2) h(1)h(2)
(3)g(a(3)
=
(x))
(1)
=
o(a
g(a (x))
h (x)
=
g(a
(x))
b
b(2)(x))
b

(3) (x) = b(1)


(3) + W(1)
(3) x (2)
calling bprop in the reverse order, a(1)
h
(3) (1)
(2)
(1)
(2)
(2)(1)

b
b
b
(3)
(2)
(1)

h
=
g(a
(x))

h
(x)
=
g(a
(x))
(1)
(1)
we get backpropagation

W
W
W
x
h = g(a (x)) (3)
(3)

(2)
(2) (x))(2) (1)
h
o(a
a
(x)
=
b
+W h
(3)
(2)
(1)
only need to reach the parameters
(1)
(1)
W
W
W
x
bh(3)
b=(2)g(a
b
(x))
(2)
(1)
(1)(2) (x))(1)
(3)
(2)
(1) h
(x)
=
g(a
a
b +W x
b
b
b
(3)
(1)
(3)
(2)(2) (1)
W
W
W
b(1) b
b (1) x f (x)
(3) (x) = g(a(3) (x))

h
o(a
(3)
(2)
(1)

(3)
(2)
(1)
x
W
W
W
x
(3)
(2)
(1)(2)
(2)

b
h

b =b
(x)
g(a

(x))

k ||W
()
= ki,j i W
W
=
||F
i,j
(k)
(x))
>0
W
<0
i,j
H
+H
j
k
k
k
1
i,j
i,j
()
=
|W
|
(k)k
20
(3)
(3)
(3)
(2) 1 (k)
i,j
i
j
sign(W
)
=
1
(1)
(1)
(1)
(k)
i,j
a=a p(y
=b=
b c|x)
+
W
hp Wi,j <0
(2)
=
+
W
x
W
>0

f
(x)
(3)
(3)
(3)
i,j
c (3)
(k)
(2)W (3) h
(k) p
6
(3)
(2)
(k)
ah(2) (x)
=b
+
= g(a (x))
()
=
2W
(k)
(k)
a r
=
b
+
W
h
W
U
[
b,
b]
b
=
H
h
(x)
k
i,j
p
rW
()
=
sign(W
)
(k)
(2)
(2)
(1)
(3)
(3) + W(2)
Hh
W
k +Hk 1

a
=
b
(k)

h
=
o(a
(x))
6
(t)
(t)
(k)
p

W
U
[
b,
b]
b
=
H
h
(x

x
y
P
P
P
k
i,j
(k)
(2)
(2)
(2)
(1)
(1)
(1) a(2)
(2)
(2)
(1)
H|k +Hk 1
(k)
=
b
+
W
h
()
=
|W

h
(x)
=
g(a
(x))
(3)
(3)
(3)
(2)
sign(W
)k=
=
1W(k) <0

a
=
b
+
W
h
(k) (1)
(2)
(2)1W
(1)
i,j
i(1)
j W
i,j

a
(x)
=
b
+
h
P
W
>0

h
=
g(a
(x))

a
b
+
x
Topics: automatic differentiation
i,j
i,j

l(f
(x),
y)
=
1
log
f
(x)
=
log
f
c
(3)
(3)
(3)
(2)
(y=c)
(1)
(1)
(1)
c
(3)
(2)
(1) a r
(k)
p
a
=
b
+
W
h
=
b
+
W
x
(2)
(2)
(2)
(1)
(1)
(1)
()
=
sign(W
)
(3)
(3)
(k)

b
b
b
(k)
(1)
(1)
(1)
6
W
(k)

a
(x)
=
b
+
W
h
Each object also

h
=
g(a
(x))
has
a
bprop
method

h
=
o(a
(x))
p

W
U
[
b,
b]
b
=
H
h
(x)
a = b + W xi,j
k
Hk +Hk 1

(3) (2)
(3)
(2)
(2)
(1)
(k)

h
=
o(a
(x))

a
=
b
+
W
h
(3)
(2)
(1)
(2)
(1)
it computes the gradient of(3)
(1)
(1)
(1)
the
loss
with

sign(W
)
=
1
1W(k) <0
(2)
(2)
(k)
i,j
(x)
b
b
b
=
b
+
W
x
W
W
W x af (x)

h
=
g(a
(x))
W
>0
i,j
i,j
respect to each parent (3)
(3)
(3)
(3)
(2)
(3)
(2)
a (1)(x) =
b +W
h p
@
h = o(a (x))
(2)
(1)
(1)
(3)
(2)
(1)

h
=
g(a
(x))
(3)
(3)

a
=
b
+
W
x
(1)
(1)

W
W
W
x 6
(k)

h
(x)
=
o(a
(x))
(k)
fprop depends on the fprop of a boxs parents,

h
=
g(a
(x))
p
W
U
[
b,
b]
b
=
H
h
(x)
k
f
(x)
(2)
(2)
(2) H(1)+H
i,j

a
(x)
=
b
+
W
h
while bprop depends the bprop of a boxs children
k
k 1

By

(x) = o(a

(2)

FLOW GRAPH

(1) (3)(2)
(2) h(1)h(2)
(3)g(a(3)
=
(x))
(1)
=
o(a
g(a (x))
h (x)
=
g(a
(x))
b
b(2)(x))
b

(3) (x) = b(1)


(3) + W(1)
(3) x (2)
calling bprop in the reverse order, a(1)
h
(3) (1)
(2)
(1)
(2)
(2)(1)

b
b
b
(3)
(2)
(1)

h
=
g(a
(x))

h
(x)
=
g(a
(x))
(1)
(1)
we get backpropagation

W
W
W
x
h = g(a (x)) (3)
(3)

(2)
(2) (x))(2) (1)
h
o(a
a
(x)
=
b
+W h
(3)
(2)
(1)
only need to reach the parameters
(1)
(1)
W
W
W
x
bh(3)
b=(2)g(a
b
(x))
(2)
(1)
(1)(2) (x))(1)
(3)
(2)
(1) h
(x)
=
g(a
a
b +W x
b
b
b
(3)
(1)
(3)
(2)(2) (1)
W
W
W
b(1) b
b (1) x f (x)
(3) (x) = g(a(3) (x))

h
o(a
(3)
(2)
(1)

(3)
(2)
(1)
x
W
W
W
x
(3)
(2)
(1)(2)
(2)

b
h

b =b
(x)
g(a

(x))

k ||W
()
= ki,j i W
W
=
||F
i,j
(k)
(x))
>0
W
<0
i,j
H
+H
j
k
k
k
1
i,j
i,j
()
=
|W
|
(k)k
20
(3)
(3)
(3)
(2) 1 (k)
i,j
i
j
sign(W
)
=
1
(1)
(1)
(1)
(k)
i,j
a=a p(y
=b=
b c|x)
+
W
hp Wi,j <0
(2)
=
+
W
x
W
>0

f
(x)
(3)
(3)
(3)
i,j
c (3)
(k)
(2)W (3) h
(k) p
6
(3)
(2)
(k)
ah(2) (x)
=b
+
= g(a (x))
()
=
2W
(k)
(k)
a r
=
b
+
W
h
W
U
[
b,
b]
b
=
H
h
(x)
k
i,j
p
rW
()
=
sign(W
)
(k)
(2)
(2)
(1)
(3)
(3) + W(2)
Hh
W
k +Hk 1

a
=
b
(k)

h
=
o(a
(x))
6
(t)
(t)
(k)
p

W
U
[
b,
b]
b
=
H
h
(x

x
y
P
P
P
k
i,j
(k)
(2)
(2)
(2)
(1)
(1)
(1) a(2)
(2)
(2)
(1)
H|k +Hk 1
(k)
=
b
+
W
h
()
=
|W

h
(x)
=
g(a
(x))
(3)
(3)
(3)
(2)
sign(W
)k=
=
1W(k) <0

a
=
b
+
W
h
(k) (1)
(2)
(2)1W
(1)
i,j
i(1)
j W
i,j

a
(x)
=
b
+
h
P
W
>0

h
=
g(a
(x))

a
b
+
x
Topics: automatic differentiation
i,j
i,j

l(f
(x),
y)
=
1
log
f
(x)
=
log
f
c
(3)
(3)
(3)
(2)
(y=c)
(1)
(1)
(1)
c
(3)
(2)
(1) a r
(k)
p
a
=
b
+
W
h
=
b
+
W
x
(2)
(2)
(2)
(1)
(1)
(1)
()
=
sign(W
)
(3)
(3)
(k)

b
b
b
(k)
(1)
(1)
(1)
6
W
(k)

a
(x)
=
b
+
W
h
Each object also

h
=
g(a
(x))
has
a
bprop
method

h
=
o(a
(x))
p

W
U
[
b,
b]
b
=
H
h
(x)
a = b + W xi,j
k
Hk +Hk 1

(3) (2)
(3)
(2)
(2)
(1)
(k)

h
=
o(a
(x))

a
=
b
+
W
h
(3)
(2)
(1)
(2)
(1)
it computes the gradient of(3)
(1)
(1)
(1)
the
loss
with

sign(W
)
=
1
1W(k) <0
(2)
(2)
(k)
i,j
(x)
b
b
b
=
b
+
W
x
W
W
W x af (x)

h
=
g(a
(x))
W
>0
i,j
i,j
respect to each parent (3)
(3)
(3)
(3)
(2)
(3)
(2)
a (1)(x) =
b +W
h p
@
h = o(a (x))
(2)
(1)
(1)
(3)
(2)
(1)

h
=
g(a
(x))
(3)
(3)

a
=
b
+
W
x
(1)
(1)

W
W
W
x 6
(k)

h
(x)
=
o(a
(x))
(k)
fprop depends on the fprop of a boxs parents,

h
=
g(a
(x))
p
W
U
[
b,
b]
b
=
H
h
(x)
k
f
(x)
(2)
(2)
(2) H(1)+H
i,j

a
(x)
=
b
+
W
h
while bprop depends the bprop of a boxs children
k
k 1

By

(x) = o(a

(2)

FLOW GRAPH

(1) (3)(2)
(2) h(1)h(2)
(3)g(a(3)
=
(x))
(1)
=
o(a
g(a (x))
h (x)
=
g(a
(x))
b
b(2)(x))
b

(3) (x) = b(1)


(3) + W(1)
(3) x (2)
calling bprop in the reverse order, a(1)
h
(3) (1)
(2)
(1)
(2)
(2)(1)

b
b
b
(3)
(2)
(1)

h
=
g(a
(x))

h
(x)
=
g(a
(x))
(1)
(1)
we get backpropagation

W
W
W
x
h = g(a (x)) (3)
(3)

(2)
(2) (x))(2) (1)
h
o(a
a
(x)
=
b
+W h
(3)
(2)
(1)
only need to reach the parameters
(1)
(1)
W
W
W
x
bh(3)
b=(2)g(a
b
(x))
(2)
(1)
(1)(2) (x))(1)
(3)
(2)
(1) h
(x)
=
g(a
a
b +W x
b
b
b
(3)
(1)
(3)
(2)(2) (1)
W
W
W
b(1) b
b (1) x f (x)
(3) (x) = g(a(3) (x))

h
o(a
(3)
(2)
(1)

(3)
(2)
(1)
x
W
W
W
x
(3)
(2)
(1)(2)
(2)

b
h

b =b
(x)
g(a

(x))

k ||W
()
= ki,j i W
W
=
||F
i,j
(k)
(x))
>0
W
<0
i,j
H
+H
j
k
k
k
1
i,j
i,j
()
=
|W
|
(k)k
20
(3)
(3)
(3)
(2) 1 (k)
i,j
i
j
sign(W
)
=
1
(1)
(1)
(1)
(k)
i,j
a=a p(y
=b=
b c|x)
+
W
hp Wi,j <0
(2)
=
+
W
x
W
>0

f
(x)
(3)
(3)
(3)
i,j
c (3)
(k)
(2)W (3) h
(k) p
6
(3)
(2)
(k)
ah(2) (x)
=b
+
= g(a (x))
()
=
2W
(k)
(k)
a r
=
b
+
W
h
W
U
[
b,
b]
b
=
H
h
(x)
k
i,j
p
rW
()
=
sign(W
)
(k)
(2)
(2)
(1)
(3)
(3) + W(2)
Hh
W
k +Hk 1

a
=
b
(k)

h
=
o(a
(x))
6
(t)
(t)
(k)
p

W
U
[
b,
b]
b
=
H
h
(x

x
y
P
P
P
k
i,j
(k)
(2)
(2)
(2)
(1)
(1)
(1) a(2)
(2)
(2)
(1)
H|k +Hk 1
(k)
=
b
+
W
h
()
=
|W

h
(x)
=
g(a
(x))
(3)
(3)
(3)
(2)
sign(W
)k=
=
1W(k) <0

a
=
b
+
W
h
(k) (1)
(2)
(2)1W
(1)
i,j
i(1)
j W
i,j

a
(x)
=
b
+
h
P
W
>0

h
=
g(a
(x))

a
b
+
x
Topics: automatic differentiation
i,j
i,j

l(f
(x),
y)
=
1
log
f
(x)
=
log
f
c
(3)
(3)
(3)
(2)
(y=c)
(1)
(1)
(1)
c
(3)
(2)
(1) a r
(k)
p
a
=
b
+
W
h
=
b
+
W
x
(2)
(2)
(2)
(1)
(1)
(1)
()
=
sign(W
)
(3)
(3)
(k)

b
b
b
(k)
(1)
(1)
(1)
6
W
(k)

a
(x)
=
b
+
W
h
Each object also

h
=
g(a
(x))
has
a
bprop
method

h
=
o(a
(x))
p

W
U
[
b,
b]
b
=
H
h
(x)
a = b + W xi,j
k
Hk +Hk 1

(3) (2)
(3)
(2)
(2)
(1)
(k)

h
=
o(a
(x))

a
=
b
+
W
h
(3)
(2)
(1)
(2)
(1)
it computes the gradient of(3)
(1)
(1)
(1)
the
loss
with

sign(W
)
=
1
1W(k) <0
(2)
(2)
(k)
i,j
(x)
b
b
b
=
b
+
W
x
W
W
W x af (x)

h
=
g(a
(x))
W
>0
i,j
i,j
respect to each parent (3)
(3)
(3)
(3)
(2)
(3)
(2)
a (1)(x) =
b +W
h p
@
h = o(a (x))
(2)
(1)
(1)
(3)
(2)
(1)

h
=
g(a
(x))
(3)
(3)

a
=
b
+
W
x
(1)
(1)

W
W
W
x 6
(k)

h
(x)
=
o(a
(x))
(k)
fprop depends on the fprop of a boxs parents,

h
=
g(a
(x))
p
W
U
[
b,
b]
b
=
H
h
(x)
k
f
(x)
(2)
(2)
(2) H(1)+H
i,j

a
(x)
=
b
+
W
h
while bprop depends the bprop of a boxs children
k
k 1

By

(x) = o(a

(2)

FLOW GRAPH

(1) (3)(2)
(2) h(1)h(2)
(3)g(a(3)
=
(x))
(1)
=
o(a
g(a (x))
h (x)
=
g(a
(x))
b
b(2)(x))
b

(3) (x) = b(1)


(3) + W(1)
(3) x (2)
calling bprop in the reverse order, a(1)
h
(3) (1)
(2)
(1)
(2)
(2)(1)

b
b
b
(3)
(2)
(1)

h
=
g(a
(x))

h
(x)
=
g(a
(x))
(1)
(1)
we get backpropagation

W
W
W
x
h = g(a (x)) (3)
(3)

(2)
(2) (x))(2) (1)
h
o(a
a
(x)
=
b
+W h
(3)
(2)
(1)
only need to reach the parameters
(1)
(1)
W
W
W
x
bh(3)
b=(2)g(a
b
(x))
(2)
(1)
(1)(2) (x))(1)
(3)
(2)
(1) h
(x)
=
g(a
a
b +W x
b
b
b
(3)
(1)
(3)
(2)(2) (1)
W
W
W
b(1) b
b (1) x f (x)
(3) (x) = g(a(3) (x))

h
o(a
(3)
(2)
(1)

(3)
(2)
(1)
x
W
W
W
x
(3)
(2)
(1)(2)
(2)

b
h

b =b
(x)
g(a

(x))

k ||W
()
= ki,j i W
W
=
||F
i,j
(k)
(x))
>0
W
<0
i,j
H
+H
j
k
k
k
1
i,j
i,j
()
=
|W
|
(k)k
20
(3)
(3)
(3)
(2) 1 (k)
i,j
i
j
sign(W
)
=
1
(1)
(1)
(1)
(k)
i,j
a=a p(y
=b=
b c|x)
+
W
hp Wi,j <0
(2)
=
+
W
x
W
>0

f
(x)
(3)
(3)
(3)
i,j
c (3)
(k)
(2)W (3) h
(k) p
6
(3)
(2)
(k)
ah(2) (x)
=b
+
= g(a (x))
()
=
2W
(k)
(k)
a r
=
b
+
W
h
W
U
[
b,
b]
b
=
H
h
(x)
k
i,j
p
rW
()
=
sign(W
)
(k)
(2)
(2)
(1)
(3)
(3) + W(2)
Hh
W
k +Hk 1

a
=
b
(k)

h
=
o(a
(x))
6
(t)
(t)
(k)
p

W
U
[
b,
b]
b
=
H
h
(x

x
y
P
P
P
k
i,j
(k)
(2)
(2)
(2)
(1)
(1)
(1) a(2)
(2)
(2)
(1)
H|k +Hk 1
(k)
=
b
+
W
h
()
=
|W

h
(x)
=
g(a
(x))
(3)
(3)
(3)
(2)
sign(W
)k=
=
1W(k) <0

a
=
b
+
W
h
(k) (1)
(2)
(2)1W
(1)
i,j
i(1)
j W
i,j

a
(x)
=
b
+
h
P
W
>0

h
=
g(a
(x))

a
b
+
x
Topics: automatic differentiation
i,j
i,j

l(f
(x),
y)
=
1
log
f
(x)
=
log
f
c
(3)
(3)
(3)
(2)
(y=c)
(1)
(1)
(1)
c
(3)
(2)
(1) a r
(k)
p
a
=
b
+
W
h
=
b
+
W
x
(2)
(2)
(2)
(1)
(1)
(1)
()
=
sign(W
)
(3)
(3)
(k)

b
b
b
(k)
(1)
(1)
(1)
6
W
(k)

a
(x)
=
b
+
W
h
Each object also

h
=
g(a
(x))
has
a
bprop
method

h
=
o(a
(x))
p

W
U
[
b,
b]
b
=
H
h
(x)
a = b + W xi,j
k
Hk +Hk 1

(3) (2)
(3)
(2)
(2)
(1)
(k)

h
=
o(a
(x))

a
=
b
+
W
h
(3)
(2)
(1)
(2)
(1)
it computes the gradient of(3)
(1)
(1)
(1)
the
loss
with

sign(W
)
=
1
1W(k) <0
(2)
(2)
(k)
i,j
(x)
b
b
b
=
b
+
W
x
W
W
W x af (x)

h
=
g(a
(x))
W
>0
i,j
i,j
respect to each parent (3)
(3)
(3)
(3)
(2)
(3)
(2)
a (1)(x) =
b +W
h p
@
h = o(a (x))
(2)
(1)
(1)
(3)
(2)
(1)

h
=
g(a
(x))
(3)
(3)

a
=
b
+
W
x
(1)
(1)

W
W
W
x 6
(k)

h
(x)
=
o(a
(x))
(k)
fprop depends on the fprop of a boxs parents,

h
=
g(a
(x))
p
W
U
[
b,
b]
b
=
H
h
(x)
k
f
(x)
(2)
(2)
(2) H(1)+H
i,j

a
(x)
=
b
+
W
h
while bprop depends the bprop of a boxs children
k
k 1

By

(x) = o(a

(2)

FLOW GRAPH

(1) (3)(2)
(2) h(1)h(2)
(3)g(a(3)
=
(x))
(1)
=
o(a
g(a (x))
h (x)
=
g(a
(x))
b
b(2)(x))
b

(3) (x) = b(1)


(3) + W(1)
(3) x (2)
calling bprop in the reverse order, a(1)
h
(3) (1)
(2)
(1)
(2)
(2)(1)

b
b
b
(3)
(2)
(1)

h
=
g(a
(x))

h
(x)
=
g(a
(x))
(1)
(1)
we get backpropagation

W
W
W
x
h = g(a (x)) (3)
(3)

(2)
(2) (x))(2) (1)
h
o(a
a
(x)
=
b
+W h
(3)
(2)
(1)
only need to reach the parameters
(1)
(1)
W
W
W
x
bh(3)
b=(2)g(a
b
(x))
(2)
(1)
(1)(2) (x))(1)
(3)
(2)
(1) h
(x)
=
g(a
a
b +W x
b
b
b
(3)
(1)
(3)
(2)(2) (1)
W
W
W
b(1) b
b (1) x f (x)
(3) (x) = g(a(3) (x))

h
o(a
(3)
(2)
(1)

(3)
(2)
(1)
x
W
W
W
x
(3)
(2)
(1)(2)
(2)

b
h

b =b
(x)
g(a

(x))

k ||W
()
= ki,j i W
W
=
||F
i,j
(k)
(x))
>0
W
<0
i,j
H
+H
j
k
k
k
1
i,j
i,j
()
=
|W
|
(k)k
20
(3)
(3)
(3)
(2) 1 (k)
i,j
i
j
sign(W
)
=
1
(1)
(1)
(1)
(k)
i,j
a=a p(y
=b=
b c|x)
+
W
hp Wi,j <0
(2)
=
+
W
x
W
>0

f
(x)
(3)
(3)
(3)
i,j
c (3)
(k)
(2)W (3) h
(k) p
6
(3)
(2)
(k)
ah(2) (x)
=b
+
= g(a (x))
()
=
2W
(k)
(k)
a r
=
b
+
W
h
W
U
[
b,
b]
b
=
H
h
(x)
k
i,j
p
rW
()
=
sign(W
)
(k)
(2)
(2)
(1)
(3)
(3) + W(2)
Hh
W
k +Hk 1

a
=
b
(k)

h
=
o(a
(x))
6
(t)
(t)
(k)
p

W
U
[
b,
b]
b
=
H
h
(x

x
y
P
P
P
k
i,j
(k)
(2)
(2)
(2)
(1)
(1)
(1) a(2)
(2)
(2)
(1)
H|k +Hk 1
(k)
=
b
+
W
h
()
=
|W

h
(x)
=
g(a
(x))
(3)
(3)
(3)
(2)
sign(W
)k=
=
1W(k) <0

a
=
b
+
W
h
(k) (1)
(2)
(2)1W
(1)
i,j
i(1)
j W
i,j

a
(x)
=
b
+
h
P
W
>0

h
=
g(a
(x))

a
b
+
x
Topics: automatic differentiation
i,j
i,j

l(f
(x),
y)
=
1
log
f
(x)
=
log
f
c
(3)
(3)
(3)
(2)
(y=c)
(1)
(1)
(1)
c
(3)
(2)
(1) a r
(k)
p
a
=
b
+
W
h
=
b
+
W
x
(2)
(2)
(2)
(1)
(1)
(1)
()
=
sign(W
)
(3)
(3)
(k)

b
b
b
(k)
(1)
(1)
(1)
6
W
(k)

a
(x)
=
b
+
W
h
Each object also

h
=
g(a
(x))
has
a
bprop
method

h
=
o(a
(x))
p

W
U
[
b,
b]
b
=
H
h
(x)
a = b + W xi,j
k
Hk +Hk 1

(3) (2)
(3)
(2)
(2)
(1)
(k)

h
=
o(a
(x))

a
=
b
+
W
h
(3)
(2)
(1)
(2)
(1)
it computes the gradient of(3)
(1)
(1)
(1)
the
loss
with

sign(W
)
=
1
1W(k) <0
(2)
(2)
(k)
i,j
(x)
b
b
b
=
b
+
W
x
W
W
W x af (x)

h
=
g(a
(x))
W
>0
i,j
i,j
respect to each parent (3)
(3)
(3)
(3)
(2)
(3)
(2)
a (1)(x) =
b +W
h p
@
h = o(a (x))
(2)
(1)
(1)
(3)
(2)
(1)

h
=
g(a
(x))
(3)
(3)

a
=
b
+
W
x
(1)
(1)

W
W
W
x 6
(k)

h
(x)
=
o(a
(x))
(k)
fprop depends on the fprop of a boxs parents,

h
=
g(a
(x))
p
W
U
[
b,
b]
b
=
H
h
(x)
k
f
(x)
(2)
(2)
(2) H(1)+H
i,j

a
(x)
=
b
+
W
h
while bprop depends the bprop of a boxs children
k
k 1

By

(x) = o(a

(2)

FLOW GRAPH

(1) (3)(2)
(2) h(1)h(2)
(3)g(a(3)
=
(x))
(1)
=
o(a
g(a (x))
h (x)
=
g(a
(x))
b
b(2)(x))
b

(3) (x) = b(1)


(3) + W(1)
(3) x (2)
calling bprop in the reverse order, a(1)
h
(3) (1)
(2)
(1)
(2)
(2)(1)

b
b
b
(3)
(2)
(1)

h
=
g(a
(x))

h
(x)
=
g(a
(x))
(1)
(1)
we get backpropagation

W
W
W
x
h = g(a (x)) (3)
(3)

(2)
(2) (x))(2) (1)
h
o(a
a
(x)
=
b
+W h
(3)
(2)
(1)
only need to reach the parameters
(1)
(1)
W
W
W
x
bh(3)
b=(2)g(a
b
(x))
(2)
(1)
(1)(2) (x))(1)
(3)
(2)
(1) h
(x)
=
g(a
a
b +W x
b
b
b
(3)
(1)
(3)
(2)(2) (1)
W
W
W
b(1) b
b (1) x f (x)
(3) (x) = g(a(3) (x))

h
o(a
(3)
(2)
(1)

(3)
(2)
(1)
x
W
W
W
x
(3)
(2)
(1)(2)
(2)

b
h

b =b
(x)
g(a

(x))

k ||W
()
= ki,j i W
W
=
||F
i,j
(k)
(x))
>0
W
<0
i,j
H
+H
j
k
k
k
1
i,j
i,j
()
=
|W
|
(k)k
20
(3)
(3)
(3)
(2) 1 (k)
i,j
i
j
sign(W
)
=
1
(1)
(1)
(1)
(k)
i,j
a=a p(y
=b=
b c|x)
+
W
hp Wi,j <0
(2)
=
+
W
x
W
>0

f
(x)
(3)
(3)
(3)
i,j
c (3)
(k)
(2)W (3) h
(k) p
6
(3)
(2)
(k)
ah(2) (x)
=b
+
= g(a (x))
()
=
2W
(k)
(k)
a r
=
b
+
W
h
W
U
[
b,
b]
b
=
H
h
(x)
k
i,j
p
rW
()
=
sign(W
)
(k)
(2)
(2)
(1)
(3)
(3) + W(2)
Hh
W
k +Hk 1

a
=
b
(k)

h
=
o(a
(x))
6
(t)
(t)
(k)
p

W
U
[
b,
b]
b
=
H
h
(x

x
y
P
P
P
k
i,j
(k)
(2)
(2)
(2)
(1)
(1)
(1) a(2)
(2)
(2)
(1)
H|k +Hk 1
(k)
=
b
+
W
h
()
=
|W

h
(x)
=
g(a
(x))
(3)
(3)
(3)
(2)
sign(W
)k=
=
1W(k) <0

a
=
b
+
W
h
(k) (1)
(2)
(2)1W
(1)
i,j
i(1)
j W
i,j

a
(x)
=
b
+
h
P
W
>0

h
=
g(a
(x))

a
b
+
x
Topics: automatic differentiation
i,j
i,j

l(f
(x),
y)
=
1
log
f
(x)
=
log
f
c
(3)
(3)
(3)
(2)
(y=c)
(1)
(1)
(1)
c
(3)
(2)
(1) a r
(k)
p
a
=
b
+
W
h
=
b
+
W
x
(2)
(2)
(2)
(1)
(1)
(1)
()
=
sign(W
)
(3)
(3)
(k)

b
b
b
(k)
(1)
(1)
(1)
6
W
(k)

a
(x)
=
b
+
W
h
Each object also

h
=
g(a
(x))
has
a
bprop
method

h
=
o(a
(x))
p

W
U
[
b,
b]
b
=
H
h
(x)
a = b + W xi,j
k
Hk +Hk 1

(3) (2)
(3)
(2)
(2)
(1)
(k)

h
=
o(a
(x))

a
=
b
+
W
h
(3)
(2)
(1)
(2)
(1)
it computes the gradient of(3)
(1)
(1)
(1)
the
loss
with

sign(W
)
=
1
1W(k) <0
(2)
(2)
(k)
i,j
(x)
b
b
b
=
b
+
W
x
W
W
W x af (x)

h
=
g(a
(x))
W
>0
i,j
i,j
respect to each parent (3)
(3)
(3)
(3)
(2)
(3)
(2)
a (1)(x) =
b +W
h p
@
h = o(a (x))
(2)
(1)
(1)
(3)
(2)
(1)

h
=
g(a
(x))
(3)
(3)

a
=
b
+
W
x
(1)
(1)

W
W
W
x 6
(k)

h
(x)
=
o(a
(x))
(k)
fprop depends on the fprop of a boxs parents,

h
=
g(a
(x))
p
W
U
[
b,
b]
b
=
H
h
(x)
k
f
(x)
(2)
(2)
(2) H(1)+H
i,j

a
(x)
=
b
+
W
h
while bprop depends the bprop of a boxs children
k
k 1

By

(x) = o(a

(2)

FLOW GRAPH

(1) (3)(2)
(2) h(1)h(2)
(3)g(a(3)
=
(x))
(1)
=
o(a
g(a (x))
h (x)
=
g(a
(x))
b
b(2)(x))
b

(3) (x) = b(1)


(3) + W(1)
(3) x (2)
calling bprop in the reverse order, a(1)
h
(3) (1)
(2)
(1)
(2)
(2)(1)

b
b
b
(3)
(2)
(1)

h
=
g(a
(x))

h
(x)
=
g(a
(x))
(1)
(1)
we get backpropagation

W
W
W
x
h = g(a (x)) (3)
(3)

(2)
(2) (x))(2) (1)
h
o(a
a
(x)
=
b
+W h
(3)
(2)
(1)
only need to reach the parameters
(1)
(1)
W
W
W
x
bh(3)
b=(2)g(a
b
(x))
(2)
(1)
(1)(2) (x))(1)
(3)
(2)
(1) h
(x)
=
g(a
a
b +W x
b
b
b
(3)
(1)
(3)
(2)(2) (1)
W
W
W
b(1) b
b (1) x f (x)
(3) (x) = g(a(3) (x))

h
o(a
(3)
(2)
(1)

(3)
(2)
(1)
x
W
W
W
x
(3)
(2)
(1)(2)
(2)

b
h

b =b
(x)
g(a

(x))

k ||W
()
= ki,j i W
W
=
||F
i,j
(k)
(x))
>0
W
<0
i,j
H
+H
j
k
k
k
1
i,j
i,j
()
=
|W
|
(k)k
20
(3)
(3)
(3)
(2) 1 (k)
i,j
i
j
sign(W
)
=
1
(1)
(1)
(1)
(k)
i,j
a=a p(y
=b=
b c|x)
+
W
hp Wi,j <0
(2)
=
+
W
x
W
>0

f
(x)
(3)
(3)
(3)
i,j
c (3)
(k)
(2)W (3) h
(k) p
6
(3)
(2)
(k)
ah(2) (x)
=b
+
= g(a (x))
()
=
2W
(k)
(k)
a r
=
b
+
W
h
W
U
[
b,
b]
b
=
H
h
(x)
k
i,j
p
rW
()
=
sign(W
)
(k)
(2)
(2)
(1)
(3)
(3) + W(2)
Hh
W
k +Hk 1

a
=
b
(k)

h
=
o(a
(x))
6
(t)
(t)
(k)
p

W
U
[
b,
b]
b
=
H
h
(x

x
y
P
P
P
k
i,j
(k)
(2)
(2)
(2)
(1)
(1)
(1) a(2)
(2)
(2)
(1)
H|k +Hk 1
(k)
=
b
+
W
h
()
=
|W

h
(x)
=
g(a
(x))
(3)
(3)
(3)
(2)
sign(W
)k=
=
1W(k) <0

a
=
b
+
W
h
(k) (1)
(2)
(2)1W
(1)
i,j
i(1)
j W
i,j

a
(x)
=
b
+
h
P
W
>0

h
=
g(a
(x))

a
b
+
x
Topics: automatic differentiation
i,j
i,j

l(f
(x),
y)
=
1
log
f
(x)
=
log
f
c
(3)
(3)
(3)
(2)
(y=c)
(1)
(1)
(1)
c
(3)
(2)
(1) a r
(k)
p
a
=
b
+
W
h
=
b
+
W
x
(2)
(2)
(2)
(1)
(1)
(1)
()
=
sign(W
)
(3)
(3)
(k)

b
b
b
(k)
(1)
(1)
(1)
6
W
(k)

a
(x)
=
b
+
W
h
Each object also

h
=
g(a
(x))
has
a
bprop
method

h
=
o(a
(x))
p

W
U
[
b,
b]
b
=
H
h
(x)
a = b + W xi,j
k
Hk +Hk 1

(3) (2)
(3)
(2)
(2)
(1)
(k)

h
=
o(a
(x))

a
=
b
+
W
h
(3)
(2)
(1)
(2)
(1)
it computes the gradient of(3)
(1)
(1)
(1)
the
loss
with

sign(W
)
=
1
1W(k) <0
(2)
(2)
(k)
i,j
(x)
b
b
b
=
b
+
W
x
W
W
W x af (x)

h
=
g(a
(x))
W
>0
i,j
i,j
respect to each parent (3)
(3)
(3)
(3)
(2)
(3)
(2)
a (1)(x) =
b +W
h p
@
h = o(a (x))
(2)
(1)
(1)
(3)
(2)
(1)

h
=
g(a
(x))
(3)
(3)

a
=
b
+
W
x
(1)
(1)

W
W
W
x 6
(k)

h
(x)
=
o(a
(x))
(k)
fprop depends on the fprop of a boxs parents,

h
=
g(a
(x))
p
W
U
[
b,
b]
b
=
H
h
(x)
k
f
(x)
(2)
(2)
(2) H(1)+H
i,j

a
(x)
=
b
+
W
h
while bprop depends the bprop of a boxs children
k
k 1

By

(x) = o(a

(2)

FLOW GRAPH

(1) (3)(2)
(2) h(1)h(2)
(3)g(a(3)
=
(x))
(1)
=
o(a
g(a (x))
h (x)
=
g(a
(x))
b
b(2)(x))
b

(3) (x) = b(1)


(3) + W(1)
(3) x (2)
calling bprop in the reverse order, a(1)
h
(3) (1)
(2)
(1)
(2)
(2)(1)

b
b
b
(3)
(2)
(1)

h
=
g(a
(x))

h
(x)
=
g(a
(x))
(1)
(1)
we get backpropagation

W
W
W
x
h = g(a (x)) (3)
(3)

(2)
(2) (x))(2) (1)
h
o(a
a
(x)
=
b
+W h
(3)
(2)
(1)
only need to reach the parameters
(1)
(1)
W
W
W
x
bh(3)
b=(2)g(a
b
(x))
(2)
(1)
(1)(2) (x))(1)
(3)
(2)
(1) h
(x)
=
g(a
a
b +W x
b
b
b
(3)
(1)
(3)
(2)(2) (1)
W
W
W
b(1) b
b (1) x f (x)
(3) (x) = g(a(3) (x))

h
o(a
(3)
(2)
(1)

(3)
(2)
(1)
x
W
W
W
x
(3)
(2)
(1)(2)
(2)

b
h

b =b
(x)
g(a

(x))

REGULARIZATION
g (a) = g(a)(1 g(a))
0

g 0 (a) = g(a)(1

g(a))

Topics: L2
regularization
g 0 (a)
= 1 g(a)2
0

g (a) = 1

() =

g(a)

P P P
k

() =

Pi Pj
k

(k)

W
i,j
P

(k) 2
= 2k ||W
||
P F

(k)
Wi,j

rW(k) () = 2W(k)
(k)
Gradient:
P ()
P P
rW(k)
= 2W(k)
() = k i j |Wi,j |

P P P

(k)
|W
|
(k)
i,j
k
i
j
sign(W )

(k) 2
||F
k ||W

()
=
rW(k)on
()
= not on biases (weight decay)
Onlyapplied
weights,
(k)
(k)
rW(k))()
Can be
sign(W
1= sign(W
0 a Gaussian) prior over the
interpreted
as=having
i,j
weights

sign(W

(k)

)i,j = 1 0

21

INITIALIZATION

g
(a) = g(a)(1 g(a))
0
g (a) = g(a)(1 g(a))
0
2

g
(a)
=
1
g(a)
Topics: initialization
g0 (a)
g(a))2
g 0=
(a)g(a)(1
= 1 g(a)
g 0 (a) = g(a)(1
g(a))

2 P
P
P
P

0
2
22(k)P
For biases g (a) =1 ()
g(a)
(k) 2
P =P Pg0 (a) = 1(k)g(a)
W
=
||W
||F
i,j
j
k ||2
() = k i k j i Wi,j
= k ||W(k)
F

2 P
2 PP
P P P
initialize all to 0
P P (k)
(k)
(k) 2
(k) 2
()
=
W
=
||W
||F
() = k i j W
=
||W
||
i,j
k
i
j
k
F
i,j
k
(k)

(k)

r
2W
(k) () =
W

r
()
=
2W
(k)
For weights
W
(k)

r
()
=
2W
W
(k)
rW(k) () =
2W
P
P
PP
(k)
P
P
P
P
P
(k)
Cant initialize weights to
0
with
tanh
activation
(k)
P()
= k ()|W
|W
|
=
|W
()
= P kP
|
i,j
i (k)
j i i,jjk i j i,j |
(k)

that
()
= k would
we can show
all gradients
be
i,j0 |(saddle point)
i then
j |W
(k)
rW(k) ()
=
sign(W
)
(k)
(k)

r
=
sign(W
)
(k) ()

r
()
=
sign(W
)
(k)
W
W
(k)
Cant initialize all
to the same value
rweights
) (k) )i,j = 1W(k) >0 1W(k) <0
sign(W
W(k) () = sign(W
i,j
i,j
(k) will always behave the same
(k)in a layer
- we can show that
allsign(W
hidden
units

)1i,j (k)=
(k)1
(k)sign(W

)
(k)1
(k) 1 p (k)
6
W
>0
W
<0
sign(W )i,j = i,j
1W=
1
(k)
(k) W
(k)
>0
W
<0
p
i,j
i,j
size
of

W
U
[
b,
b]
b
=
H
h
(x)
i,j
i,j
>0
W
<0
k
i,j
i,j
i,j
H
+H
k
k 1
- need to break symmetry
p
p
p
(k)
(k) (k)
6
6
6p
p
p
W
U
[
b,
b]
b
=
H

W
U
[
b,
b]
b
=
H
i,jWfrom
U
[
b,
b]
b
=
H
Recipe: sample
where
k
k
k
i,j
i,j
Hk +Hk H1k +Hk
Hk1+Hk 1
- the idea is to sample around 0 but break symmetry
-

other values of b could work well (not an exact science) ( see Glorot & Bengio, 2010)

22

MODEL SELECTION
Topics: grid search, random search
To

search for the best configuration of the hyper-parameters:

you can perform a grid search


-

specify a set of values you want to test for each hyper-parameter

try all possible configurations of these values

you can perform a random search (Bergstra and Bengio, 2012)


-

specify a distribution over the values of each hyper-parameters (e.g. uniform in some range)

sample independently each hyper-parameter to get configurations

bayesian optimization or sequential model-based optimization

a validation set (not the test set) performance to select


the best configuration

Use
You

can go back and refine the grid/distributions if needed

23

KNOWING WHEN TO STOP


Topics: early stopping
To

select the number of epochs, stop training when validation


set error increases (with some look ahead)
Training

Validation

underfitting

overfitting

0.5
0.4
0.3
0.2
0.1
0.0
number of epochs

24

OTHER TRICKS OF THE TRADE


Topics: normalization of data, decaying learning rate
Normalizing

your (real-valued) data

for dimension xi subtract its training set mean

divide by dimension xi by its training set standard deviation

this can speed up training (in number of epochs)

Decaying

the learning rate

as we get closer to the optimum, makes sense to take smaller update steps
(i)

start with large learning rate (e.g. 0.1)

(ii)

maintain until validation error stops improving

(iii) divide

learning rate by 2 and go back to (ii)

25

@f (x)
@x

f (x+) f (x )
2

OTHER TRICKS OF THE TRADE

f (x) x
Topics: mini-batch, momentum
f (x + ) f (x )
Can update based on a mini-batch of example (instead of 1 example):
P1
ist=1
t = regularized
1
the gradient
the average
loss for that mini-batch
1accurate
can give a P
more
2 estimate of the risk gradient

t=1 t < 1 t

can leverage matrix/matrix operations, which are more efficient

t = 1+ t

Can

use an
average
texponential
= t 0.5 <
1 of previous gradients:

(t)
r

(t)

= r l(f (x ), y

(t)

)+

(t 1)
r

can get through plateaus more quickly, by gaining momentum

26

OTHER TRICKS OF THE TRADE


Topics: Adagrad, RMSProp, Adam
Updates

Adagrad: learning rates are scaled by the square root of the cumulative sum of squared gradients
(t)

(t 1)

(t)

+ r l(f (x ), y

(t)

(t)
r

RMSProp: instead of cumulative sum, use exponential moving average


(t)

with adaptive learning rates (one learning rate per parameter)

(t 1)

+ (1

(t)

) r l(f (x ), y

(t)

Adam: essentially combines RMSProp with momentum

(t)
r

r l(f (x(t) ), y (t) )


p
=
(t) +
r l(f (x(t) ), y (t) )
p
=
(t) +

27

ba +
(x)
bh + W(2)ax (x) =(2)b + (2)
W (1)h
(x) =
W=(1)
(2) (1) (2)
a h(1)
(x) = b (1)+ W h
(1) a(2) (x)(1)
=
b
+
W
a (x) (2)
=
+
W(1)(3)x
(3)b
(2)
(2)
(2)
(2) (1)
bh(1) +
(x)
o(a
(x))
(x) =
W= (1)
h

a
(x)
=
b
+
W
h
(1)
(1)
(1)
(1)

a x (x) = b + W x

a
(x)
=
b
+
W
(3)
(3)
h
(x)
=
o(a
(1)
(1)(x))(2)
(2)
(1) (1)
(1)
(1)
(x) =
W=(3)
x (3) (x))
(2)
bh(3) +
(x)
g(a

a
(x)
=
b
+
W
x
(3)
(3)

h
(x)
=
o(a
(x))
h approximation
(x) = o(a (x))
Topics:
finite
difference
(2)
(2)
(3)
h
(x)o(a
=
g(a
(x))(1)
(x) =
(x))
(1)
(3)
(3)
(2) (x) = g(a
(2) (x))

h
(x)
=
o(a
(x))

h
(x)
=
g(a
(x))
(2)
(2)
To debug your implementation
of
fprop/bprop,
you
can

h
(x)
=
g(a
(x))
(1)
(1)
(2)
(3)
(2)
(1)
(x) =
g(a
(x))
h
(x)
=
g(a
(x))
(3)
(2)
(1)
(1)
(2)
compare
with
of the gradient

b
b
b
h (x) = g(aa(1)finite-difference
(x)) (1)h(2) (x) =approximation
g(a
(x))
(1)

h
(x)
=
g(a
(x))
(1) (1)
(3)
(2)
(x) =bg(a
(x))
b
b
(3)(3)
(1)
(2)
(1)
(1)
(1)
bW b(2)
b
W
W
x
f
(x)
(3)f
h (x+)
(x) =(1)g(a
(x))
@f
(x)
f
(x
)
(2)
(2)
(1)

b
b
b
b
(3) b (3)
(2)
(1)
(2)
(1)f (x)
W W
W
W
x
W
W
x f)(x)
@f (x)
f@x
(x+) f (x
(3)
(2) 2(1)

b
b
b
(2) @x (1)
(3)
(2)
(1)
)
2 W
W
W
x
f
(x)
W
W
x f (x)
f (x )
@f (x) @f (x)
f (x+)f (x+)
f (x )
@x
2
(3)
(2)
(1)
@x
2
(x)
W fW
W
x f (x)

f
(x)

the loss @f
)
f(x+)
fwould
(xx )be
(x+)
f
(x
)

@x
2 x

f
(x)
2
f (x) x would
be a parameter @f (x)
f (x+) f (x )
@x
x
2
loss
f (x)
f (x + ) would
f (x be )
the
if you xadd to the parameter
+ ) f (x ) would be the loss
iffyou
(x)subtract
x to the parameter

h
b

(x) = g(a

(x))

GRADIENT CHECKING

x f (x)

28

DEBUGGING ON SMALL DATASET


Topics: debugging on small dataset
Next, make

sure your model is able to (over)fit on a very


small dataset (~50 examples)

If

not, investigate the following situations:


Are some of the units saturated, even before the first update?
-

scale down the initialization of your parameters for these units

properly normalize the inputs

Is the training error bouncing up and down?


-

decrease the learning rate

Note

that this isnt a replacement for gradient checking

could still overfit with some of the gradients being wrong

29

Foundations of Deep Learning


Training deep feed-forward neural networks

DEEP
LEARNING
Le systme
visuel
humain
Topics: inspiration from visual cortex
Pourquoi%ne%pas%sinspirer%du%cerveau%pour%faire%de%la%vision!%

31

DEEP
LEARNING
Le systme
visuel
humain
Topics: inspiration from visual cortex
Pourquoi%ne%pas%sinspirer%du%cerveau%pour%faire%de%la%vision!%

31

DEEP
LEARNING
Le systme
visuel
humain
Topics: inspiration from visual cortex
Pourquoi%ne%pas%sinspirer%du%cerveau%pour%faire%de%la%vision!%

31

DEEP
LEARNING
Le systme
visuel
humain
Topics: inspiration from visual cortex
Pourquoi%ne%pas%sinspirer%du%cerveau%pour%faire%de%la%vision!%

31

31

DEEP
LEARNING
Le systme
visuel
humain
Topics: inspiration from visual cortex
Pourquoi%ne%pas%sinspirer%du%cerveau%pour%faire%de%la%vision!%
edges
...

31

DEEP
LEARNING
Le systme
visuel
humain
Topics: inspiration from visual cortex
Pourquoi%ne%pas%sinspirer%du%cerveau%pour%faire%de%la%vision!%
edges
...

nose
mouth
eyes

31

DEEP
LEARNING
Le systme
visuel
humain
Topics: inspiration from visual cortex
Pourquoi%ne%pas%sinspirer%du%cerveau%pour%faire%de%la%vision!%
edges
...

nose
mouth
eyes

face

DEEP LEARNING
Topics: theoretical justification
A

deep architecture can represent certain functions


(exponentially) more compactly

Example: Boolean

functions

a Boolean circuit is a sort of feed-forward network where hidden units are logic
gates (i.e. AND, OR or NOT functions of their arguments)

any Boolean function can be represented by a single hidden layer Boolean circuit
-

however, it might require an exponential number of hidden units

it can be shown that there are Boolean functions which


-

require an exponential number of hidden units in the single layer case

require a polynomial number of hidden units if we can adapt the number of layers

See Exploring Strategies for Training Deep Neural Networks for a discussion

32

DEEP LEARNING
Topics: success story: speech recognition

33

DEEP LEARNING
Topics: success story: computer vision

34

35

Hugo Larochelle
Hugo Laroc
Departement
dinformatiq
Departement
dinf
UniversiteUniversit
de Sherbrooke
e de Sh
hugo.larochelle@usherbrook
hugo.larochelle@us

DEEP LEARNING
w

Topics: why training is hard

g(a) = a
...
First hypothesis: optimization is harder
g(a) = sigm(a) =
(underfitting)

vanishing gradient problem

saturated units block gradient


propagation

g(a) = tanh(a)
...

SeptemberSeptember
6, 2012 6,

1
1+exp( a)

exp(a) exp( a)
= exp(a)+exp(
1
a)
...

exp(2a) 1
exp(2a)+1

Abstract

Abstract

g(a) = max(0,
Math fora)
my slides
neural network.
MathFeedforward
for my slides Feedforward
neural n

P
P
1

g(a)
=
reclin(a)
=
max(0,
a)
>
>
...
...
This is a well known problem in
a(x) = b + a(x)
w
x
=
b
+
w
x
=
b
+
w
x
=
b
+
w
x
i i i
i i i
recurrent neural networks
P
P
g() h(x)
b
h(x)
g(a(x))
= g(a(x))
==
g(b
+ i=
wig(b
xi )+ i wi xi )

(1)
(1)
x1 bix...d
Wi,j

...x x
xj h(x)
1 id

w
h(x)
= g(a(x)) w

(t)

DEEP LEARNING

l(f (x ; ), y

(t)

(t)

l(f (x ; ), y ) l(f (x(t) ; ), y (t) )

l(f (x(t) ; ), y(t)()


)

1
T

(t)

(t)
(t)
r
l(f
(x
;
),
y
)

36

r
()
()
(t)
(t)

l(f
(x
;
),
y
)
Topics: why training
is
hard
P
P
P
(t)
+; ), y (t) )
1
1 (t)
1
(t)
(t)
(t)

=
r
l(f

(x
=
;
),
y
)
r
l(f
r
(x
()
;
),
y
)

r
=
()
r
l(f
(x
r ()

t
t
t
T
T
T
()
()
d
Second1 hypothesis:
overfitting

{x
2
R
| rx f (x) = 0}
P

+
(t)
(t)

= T t r l(f (x ; ), y )
r ()1 P

= T t r l(f (x(t) ; ), y (t) )


r () > 2
we are exploring da space of complex
f (x)
v r
f (x)v > 0 8v
d
d
functions
x0}

{x
2
R
|
r
f
(x)

=
{x
0}
2
R
|
r
f
(x)
=
0}

{x
2
R
|
r
=
x
x
x

+

+
> 2
deep nets usually

v
r
> 2 have lots of
> parameters
2
> 2
x f (x)v < 0 8v

v
r
f
(x)v
>
0

8v
v
r
f
(x)v
>
0
8v

v
r
f
(x)v
>
0
8v
d
x
x
{x 2 R | rx f (x) x= 0}
d
{x 2 R | rx f (x) = 0}
(t)
(t)

=
r
l(f
(x
;
),
y
)
r ()
>
2
>
2
>
2

Might

v
r
f
(x)v
<
0

8v
v
r
f
(x)v
<
0
8v

v
r
f
(x)v
<
0
8v
a
high
variance
/
low
bias
situation
> 2 be in
x
x
x
v rx f (x)v > 0 8v
v> r2x f (x)v > 0 8v
(t) (t)
(t)
(t)
(t)
(t)
(t) , y (t)

(x
))

=
r
l(f
(x

;
),
y
=
)
r
l(f
r
(x
()
;
),
y
)
r

()
=
r
l(f
(x
;
),
y
r ()
> 2

v rx f (x)v < 0 8v
v> r2x f (x)v < 0 8v

(t) (t)
(t) (t)
(t) (t)

f
f
possible

(x

(x
,
y
)

(x
,
y
)
(t) , y ) (t)

= r l(f (x ; ), y )
r ()

= r l(f (x(t) ; ), y (t) )


r ()

f
f

f
f
(t) (t) f
(x , y )
(x(t) , y (t) )
possible
f f
possible
f f
()
l(f (x ; ),y (t)
)
(t)

low variance/
high bias

good trade-off

high variance/
low bias

(t)

DEEP LEARNING

l(f (x ; ), y

(t)

(t)

l(f (x ; ), y ) l(f (x(t) ; ), y (t) )

l(f (x(t) ; ), y(t)()


)

1
T

(t)

(t)
(t)
r
l(f
(x
;
),
y
)

36

r
()
()
(t)
(t)

l(f
(x
;
),
y
)
Topics: why training
is
hard
P
P
P
(t)
+; ), y (t) )
1
1 (t)
1
(t)
(t)
(t)

=
r
l(f

(x
=
;
),
y
)
r
l(f
r
(x
()
;
),
y
)

r
=
()
r
l(f
(x
r ()

t
t
t
T
T
T
()
()
d
Second1 hypothesis:
overfitting

{x
2
R
| rx f (x) = 0}
P

+
(t)
(t)

= T t r l(f (x ; ), y )
r ()1 P

= T t r l(f (x(t) ; ), y (t) )


r () > 2
we are exploring da space of complex
f (x)
v r
f (x)v > 0 8v
d
d
functions
x0}

{x
2
R
|
r
f
(x)

=
{x
0}
2
R
|
r
f
(x)
=
0}

{x
2
R
|
r
=
x
x
x

+

+
> 2
deep nets usually

v
r
> 2 have lots of
> parameters
2
> 2
x f (x)v < 0 8v

v
r
f
(x)v
>
0

8v
v
r
f
(x)v
>
0
8v

v
r
f
(x)v
>
0
8v
d
x
x
{x 2 R | rx f (x) x= 0}
d
{x 2 R | rx f (x) = 0}
(t)
(t)

=
r
l(f
(x
;
),
y
)
r ()
>
2
>
2
>
2

Might

v
r
f
(x)v
<
0

8v
v
r
f
(x)v
<
0
8v

v
r
f
(x)v
<
0
8v
a
high
variance
/
low
bias
situation
> 2 be in
x
x
x
v rx f (x)v > 0 8v
v> r2x f (x)v > 0 8v
(t) (t)
(t)
(t)
(t)
(t)
(t) , y (t)

(x
))

=
r
l(f
(x

;
),
y
=
)
r
l(f
r
(x
()
;
),
y
)
r

()
=
r
l(f
(x
;
),
y
r ()
> 2

v rx f (x)v < 0 8v
v> r2x f (x)v < 0 8v

(t) (t)
(t) (t)
(t) (t)

f
f
possible

(x

(x
,
y
)

(x
,
y
)
(t) , y ) (t)

= r l(f (x ; ), y )
r ()

= r l(f (x(t) ; ), y (t) )


r ()

f
f

f
f
(t) (t) f
(x , y )
(x(t) , y (t) )
possible
f f
possible
f f
()
l(f (x ; ),y (t)
)
(t)

low variance/
high bias

good trade-off

high variance/
low bias

DEEP LEARNING
Topics: why training is hard
Depending

on the problem, one or the other situation will


tend to dominate

If

first hypothesis (underfitting): better optimize

use better optimization methods

use GPUs

If

second hypothesis (overfitting): use better regularization

unsupervised pre-training

stochastic dropout training

37

DEEP LEARNING
Topics: why training is hard
Depending

on the problem, one or the other situation will


tend to dominate

If

first hypothesis (underfitting): better optimize

use better optimization methods

use GPUs

If

second hypothesis (overfitting): use better regularization

unsupervised pre-training

stochastic dropout training

38

39

Hugo Larochelle
Hugo Laroc
p(y = c|x)
Departement
dinformatiq
Departement
dinf
p(y = c|x)
h{
i> Universit
e
de
Sherbrooke
h
i
Universit
e
de
Sh
exp(a
)
exp(a
)
>
P
P
exp(a )
exp(a
)

o(a)
=
softmax(a)
=
.
.
.
Topics: dropout
P
P
exp(a
)
exp(a )
o(a) = softmax(a) =
...
exp(a
)
exp(a
)
hugo.larochelle@usherbrook
hugo.larochelle@us
g(a) = a

DROPOUT
w

p(y
=
c|x)
Idea: cripple
neural
network
by
p(y = c|x)
f (x)f (x)

removing

...

SeptemberSeptember
6, 2012 6,

1
h hsigm(a)
i
g(a) =
=
i>
>
hidden
1+exp(
a)
(1)
p(y units
= c|x)stochastically
exp(aexp(a
)
1 ))
C(2)
(2)
(1)
(2) exp(a
(3)
(1)
(3)
)
Pexp(a
P
1
C

softmax(a)
o(a)
= softmax(a)
=P
. .. .(2)
(1)
(2)
(3)
(1)) .b
P
W
W
Wexp(a
bb(3)
b
exp(a
)
o(a)
= (x)
=
.
h(1)h(x)(x)
h(2)h(x)
W
W
W
b
b
exp(ac )
exp(ac )
h
i
c

c
> c
each hidden unit is set to 0 with
exp(a
)
exp(a
)
1
C
exp(a) exp( a)
P(k 1)
P

f
(x)

o(a)
=
softmax(a)
=
.
.
.
(k)
(k)
(k)
(0)
(k)b
(k)
(k h
1)

g(a)
=
tanh(a)
exp(a
exp(a
) x)
probability
c )(0)
c=
+
W
x
(h
(x)
=
1
c
c
0.5
a(k)a(x) (x)
= b=
+
W
h
x
(h
(x)
=
x)
exp(a)+exp(
a)
p(y = c|x)
p(y = c|x)

exp(2a) 1
exp(2a)+1

... (2) W(3) ...b(1) b(2) b(3)


f (x) h(1) (x) h(2) (x) W(1) W
Abstract
Abstract

p(y
=
c|x)
h
i
>
hidden units cannot
co-adapt
other
(k)
f (x)
h
i
(k)
(k)to (k)
exp(a
)
exp(a
)
>
P
P
(k)
(k)
(k)
(k
1)
(0)

h
(x)
=
g(a
(x))

h
(x)
=
g(a
(x))

o(a)
=
softmax(a)
=
.
.
.
exp(a
)
exp(a
)
(1)
(2)
(1)
(2)
(3)
(1)
(2)
(3)

a
(x)
=
b
+
W
h
x
(h
(x)
=
x)
exp(a
exp(a ) C
(x)
g(a)
= max(0,
a)
1 )
units
P
P
h
i

h
(x)
h
W
W
W
b
b
b

o(a)
=
softmax(a)
=
.
.
.
>
Math
for
my
slides
Feedforward
neural
network.
Math
for
my
slides
Feedforward
neural
n
(1)
(2)
(1)
(2) exp(a
(3)
(1)
(2)
(3)
exp(a
)
exp(a
)
)
exp(a
)
c
c
Cb
c Pb
c
(L+1)
h (x)
h =
(x)
W
b
(k)W
(k) P W 1

o(a)
softmax(a)
=
.
.
.
(L+1)
(L+1)
h (L+1)
(x)=f =
g(af (x)
(x)) exp(ac )
(x)
hmust
(x)
=
o(a
(x))
h be
(x)
=
o(a
(x))
=
f
(x)
c
c exp(ac )
hidden units
more
generally

P
P
(k) (k)
(k)
(k)= (k
1)
(0)
1

g(a)
reclin(a)
=
max(0,
a)
(k)
(k)
(k
1)
(0)

a
(x)
=
b
+
W
h
x
(h
(x)
=
x)
>
...
...
(L+1)
(1)o(a
(1)
(2) w(3)
(1)
(2)
(3)> x
(x)
a (x) = b f+
W (x)
xa(x)
(h
(x)
=+x)
h(L+1)
(x))
f (x)
useful

=
b
x
=
b
+
w

a(x)
=
b
+
w
x
=
b
+
w
x
h
h=
(x)
h(2)
(x) =
W
W
W
b
b
b
i
i
i
i
i
i
f (x)
(k) (2)
(k)
(k) (k 1)
(0) (3)
(k)
(k)
(k)
P b(2) b(3)P
(1)
(1)
(2)
(1)

a
(x)
=
b
+
W
h
x
(h
(x) = x)
h(k)
(x)
=
g(a
(x))
h (1)
(x)
= g(a
(x))
h
(x)
h
(x)
W
W
W
b

g()
b
(2)= g(a(x))
(3) h(x)
(1) = =
(2)g(a(x))
(3)
=
g(b
+
w
x
)
g(b
+
w
x
)
h (x) h(2) (x) (k)W(1)h(x)
W
W
b
b
b
i
i
i
i
i
i
(k)
1

Could

h (x) = g(a (x))


(L+1)
(L+1)

h
(x)
=
o(a
(k)
different(L+1)
dropout
a (x)(x))
=(L+1)
b=(k)f (x)
+ W(k) h(k

1)
(0)
use a
x
(h
(x)1 = x)
(1)
(1)
(k) (x))
(k 1)
(0)
h
(x)
o(a
=
f(h
(x)
...
...
(L+1)
(L+1)
a(k) (x)
==
b(k)
+
W
h
x
(x)
=
x)
hW (x)x=1 o(a
(x))
=
f (x)

x
x
x

b
x
h(x)
1
d
j
id
i,j
i
probability, but 0.5 usually (k)
(k)
(k) h
(k)= g(a
(x)
(x))

h
(x)
=
g(a
(x))
works well
w
h(x)
= g(a(x)) w
(L+1)

(L+1)

39

Hugo Larochelle
Hugo Laroc
p(y = c|x)
Departement
dinformatiq
Departement
dinf
p(y = c|x)
h{
i> Universit
e
de
Sherbrooke
h
i
Universit
e
de
Sh
exp(a
)
exp(a
)
>
P
P
exp(a )
exp(a
)

o(a)
=
softmax(a)
=
.
.
.
Topics: dropout
P
P
exp(a
)
exp(a )
o(a) = softmax(a) =
...
exp(a
)
exp(a
)
hugo.larochelle@usherbrook
hugo.larochelle@us
g(a) = a

DROPOUT
w

p(y
=
c|x)
Idea: cripple
neural
network
by
p(y = c|x)
f (x)f (x)

removing

...

SeptemberSeptember
6, 2012 6,

1
h hsigm(a)
i
g(a) =
=
i>
>
hidden
1+exp(
a)
(1)
p(y units
= c|x)stochastically
exp(aexp(a
)
1 ))
C(2)
(2)
(1)
(2) exp(a
(3)
(1)
(3)
)
Pexp(a
P
1
C

softmax(a)
o(a)
= softmax(a)
=P
. .. .(2)
(1)
(2)
(3)
(1)) .b
P
W
W
Wexp(a
bb(3)
b
exp(a
)
o(a)
= (x)
=
.
h(1)h(x)(x)
h(2)h(x)
W
W
W
b
b
exp(ac )
exp(ac )
h
i
c

c
> c
each hidden unit is set to 0 with
exp(a
)
exp(a
)
1
C
exp(a) exp( a)
P(k 1)
P

f
(x)

o(a)
=
softmax(a)
=
.
.
.
(k)
(k)
(k)
(0)
(k)b
(k)
(k h
1)

g(a)
=
tanh(a)
exp(a
exp(a
) x)
probability
c )(0)
c=
+
W
x
(h
(x)
=
1
c
c
0.5
a(k)a(x) (x)
= b=
+
W
h
x
(h
(x)
=
x)
exp(a)+exp(
a)
p(y = c|x)
p(y = c|x)

exp(2a) 1
exp(2a)+1

... (2) W(3) ...b(1) b(2) b(3)


f (x) h(1) (x) h(2) (x) W(1) W
Abstract
Abstract

p(y
=
c|x)
h
i
>
hidden units cannot
co-adapt
other
(k)
f (x)
h
i
(k)
(k)to (k)
exp(a
)
exp(a
)
>
P
P
(k)
(k)
(k)
(k
1)
(0)

h
(x)
=
g(a
(x))

h
(x)
=
g(a
(x))

o(a)
=
softmax(a)
=
.
.
.
exp(a
)
exp(a
)
(1)
(2)
(1)
(2)
(3)
(1)
(2)
(3)

a
(x)
=
b
+
W
h
x
(h
(x)
=
x)
exp(a
exp(a ) C
(x)
g(a)
= max(0,
a)
1 )
units
P
P
h
i

h
(x)
h
W
W
W
b
b
b

o(a)
=
softmax(a)
=
.
.
.
>
Math
for
my
slides
Feedforward
neural
network.
Math
for
my
slides
Feedforward
neural
n
(1)
(2)
(1)
(2) exp(a
(3)
(1)
(2)
(3)
exp(a
)
exp(a
)
)
exp(a
)
c
c
Cb
c Pb
c
(L+1)
h (x)
h =
(x)
W
b
(k)W
(k) P W 1

o(a)
softmax(a)
=
.
.
.
(L+1)
(L+1)
h (L+1)
(x)=f =
g(af (x)
(x)) exp(ac )
(x)
hmust
(x)
=
o(a
(x))
h be
(x)
=
o(a
(x))
=
f
(x)
c
c exp(ac )
hidden units
more
generally

P
P
(k) (k)
(k)
(k)= (k
1)
(0)
1

g(a)
reclin(a)
=
max(0,
a)
(k)
(k)
(k
1)
(0)

a
(x)
=
b
+
W
h
x
(h
(x)
=
x)
>
...
...
(L+1)
(1)o(a
(1)
(2) w(3)
(1)
(2)
(3)> x
(x)
a (x) = b f+
W (x)
xa(x)
(h
(x)
=+x)
h(L+1)
(x))
f (x)
useful

=
b
x
=
b
+
w

a(x)
=
b
+
w
x
=
b
+
w
x
h
h=
(x)
h(2)
(x) =
W
W
W
b
b
b
i
i
i
i
i
i
f (x)
(k) (2)
(k)
(k) (k 1)
(0) (3)
(k)
(k)
(k)
P b(2) b(3)P
(1)
(1)
(2)
(1)

a
(x)
=
b
+
W
h
x
(h
(x) = x)
h(k)
(x)
=
g(a
(x))
h (1)
(x)
= g(a
(x))
h
(x)
h
(x)
W
W
W
b

g()
b
(2)= g(a(x))
(3) h(x)
(1) = =
(2)g(a(x))
(3)
=
g(b
+
w
x
)
g(b
+
w
x
)
h (x) h(2) (x) (k)W(1)h(x)
W
W
b
b
b
i
i
i
i
i
i
(k)
1

Could

h (x) = g(a (x))


(L+1)
(L+1)

h
(x)
=
o(a
(k)
different(L+1)
dropout
a (x)(x))
=(L+1)
b=(k)f (x)
+ W(k) h(k

1)
(0)
use a
x
(h
(x)1 = x)
(1)
(1)
(k) (x))
(k 1)
(0)
h
(x)
o(a
=
f(h
(x)
...
...
(L+1)
(L+1)
a(k) (x)
==
b(k)
+
W
h
x
(x)
=
x)
hW (x)x=1 o(a
(x))
=
f (x)

x
x
x

b
x
h(x)
1
d
j
id
i,j
i
probability, but 0.5 usually (k)
(k)
(k) h
(k)= g(a
(x)
(x))

h
(x)
=
g(a
(x))
works well
w
h(x)
= g(a(x)) w
(L+1)

(L+1)

39

Hugo Larochelle
Hugo Laroc
p(y = c|x)
Departement
dinformatiq
Departement
dinf
p(y = c|x)
h{
i> Universit
e
de
Sherbrooke
h
i
Universit
e
de
Sh
exp(a
)
exp(a
)
>
P
P
exp(a )
exp(a
)

o(a)
=
softmax(a)
=
.
.
.
Topics: dropout
P
P
exp(a
)
exp(a )
o(a) = softmax(a) =
...
exp(a
)
exp(a
)
hugo.larochelle@usherbrook
hugo.larochelle@us
g(a) = a

DROPOUT
w

p(y
=
c|x)
Idea: cripple
neural
network
by
p(y = c|x)
f (x)f (x)

removing

...

SeptemberSeptember
6, 2012 6,

1
h hsigm(a)
i
g(a) =
=
i>
>
hidden
1+exp(
a)
(1)
p(y units
= c|x)stochastically
exp(aexp(a
)
1 ))
C(2)
(2)
(1)
(2) exp(a
(3)
(1)
(3)
)
Pexp(a
P
1
C

softmax(a)
o(a)
= softmax(a)
=P
. .. .(2)
(1)
(2)
(3)
(1)) .b
P
W
W
Wexp(a
bb(3)
b
exp(a
)
o(a)
= (x)
=
.
h(1)h(x)(x)
h(2)h(x)
W
W
W
b
b
exp(ac )
exp(ac )
h
i
c

c
> c
each hidden unit is set to 0 with
exp(a
)
exp(a
)
1
C
exp(a) exp( a)
P(k 1)
P

f
(x)

o(a)
=
softmax(a)
=
.
.
.
(k)
(k)
(k)
(0)
(k)b
(k)
(k h
1)

g(a)
=
tanh(a)
exp(a
exp(a
) x)
probability
c )(0)
c=
+
W
x
(h
(x)
=
1
c
c
0.5
a(k)a(x) (x)
= b=
+
W
h
x
(h
(x)
=
x)
exp(a)+exp(
a)
p(y = c|x)
p(y = c|x)

exp(2a) 1
exp(2a)+1

... (2) W(3) ...b(1) b(2) b(3)


f (x) h(1) (x) h(2) (x) W(1) W
Abstract
Abstract

p(y
=
c|x)
h
i
>
hidden units cannot
co-adapt
other
(k)
f (x)
h
i
(k)
(k)to (k)
exp(a
)
exp(a
)
>
P
P
(k)
(k)
(k)
(k
1)
(0)

h
(x)
=
g(a
(x))

h
(x)
=
g(a
(x))

o(a)
=
softmax(a)
=
.
.
.
exp(a
)
exp(a
)
(1)
(2)
(1)
(2)
(3)
(1)
(2)
(3)

a
(x)
=
b
+
W
h
x
(h
(x)
=
x)
exp(a
exp(a ) C
(x)
g(a)
= max(0,
a)
1 )
units
P
P
h
i

h
(x)
h
W
W
W
b
b
b

o(a)
=
softmax(a)
=
.
.
.
>
Math
for
my
slides
Feedforward
neural
network.
Math
for
my
slides
Feedforward
neural
n
(1)
(2)
(1)
(2) exp(a
(3)
(1)
(2)
(3)
exp(a
)
exp(a
)
)
exp(a
)
c
c
Cb
c Pb
c
(L+1)
h (x)
h =
(x)
W
b
(k)W
(k) P W 1

o(a)
softmax(a)
=
.
.
.
(L+1)
(L+1)
h (L+1)
(x)=f =
g(af (x)
(x)) exp(ac )
(x)
hmust
(x)
=
o(a
(x))
h be
(x)
=
o(a
(x))
=
f
(x)
c
c exp(ac )
hidden units
more
generally

P
P
(k) (k)
(k)
(k)= (k
1)
(0)
1

g(a)
reclin(a)
=
max(0,
a)
(k)
(k)
(k
1)
(0)

a
(x)
=
b
+
W
h
x
(h
(x)
=
x)
>
...
...
(L+1)
(1)o(a
(1)
(2) w(3)
(1)
(2)
(3)> x
(x)
a (x) = b f+
W (x)
xa(x)
(h
(x)
=+x)
h(L+1)
(x))
f (x)
useful

=
b
x
=
b
+
w

a(x)
=
b
+
w
x
=
b
+
w
x
h
h=
(x)
h(2)
(x) =
W
W
W
b
b
b
i
i
i
i
i
i
f (x)
(k) (2)
(k)
(k) (k 1)
(0) (3)
(k)
(k)
(k)
P b(2) b(3)P
(1)
(1)
(2)
(1)

a
(x)
=
b
+
W
h
x
(h
(x) = x)
h(k)
(x)
=
g(a
(x))
h (1)
(x)
= g(a
(x))
h
(x)
h
(x)
W
W
W
b

g()
b
(2)= g(a(x))
(3) h(x)
(1) = =
(2)g(a(x))
(3)
=
g(b
+
w
x
)
g(b
+
w
x
)
h (x) h(2) (x) (k)W(1)h(x)
W
W
b
b
b
i
i
i
i
i
i
(k)
1

Could

h (x) = g(a (x))


(L+1)
(L+1)

h
(x)
=
o(a
(k)
different(L+1)
dropout
a (x)(x))
=(L+1)
b=(k)f (x)
+ W(k) h(k

1)
(0)
use a
x
(h
(x)1 = x)
(1)
(1)
(k) (x))
(k 1)
(0)
h
(x)
o(a
=
f(h
(x)
...
...
(L+1)
(L+1)
a(k) (x)
==
b(k)
+
W
h
x
(x)
=
x)
hW (x)x=1 o(a
(x))
=
f (x)

x
x
x

b
x
h(x)
1
d
j
id
i,j
i
probability, but 0.5 usually (k)
(k)
(k) h
(k)= g(a
(x)
(x))

h
(x)
=
g(a
(x))
works well
w
h(x)
= g(a(x)) w
(L+1)

(L+1)

40

Hugo Larochelle
Hugo Laroc
p(y = c|x) h
p(y = c|x)
i>
D
e
partement
dinformatiq
D
e
partement
dinf
p(y = c|x)
exp(a
)
exp(a
)
1
C
h
i
>
P
P

{
i

o(a)
=
softmax(a)
exp(a=
)
exp(a ) h . . .
p(y
=
c|x)
>
P
P
exp(a
)
exp(a
)
Universit
e
de
Sherbrooke
h
i
o(a) = softmax(a) =
.
.
.
c
c
Universit
e
de
Sh
exp(a
exp(a
)
c exp(a )
c)
>
exp(a )
P
P
exp(a )
exp(a
)

o(a)
=
softmax(a)
=
.
.
.
Topics: dropout
P
P
exp(a
)
exp(a )
o(a) = softmax(a)
=
...
h
i
exp(a
)
exp(a
hugo.larochelle@usherbrook
>)
hugo.larochelle@us

g(a)
=
a
f (x)
exp(a
)
exp(a
)
1
C
f (x)
p(y = c|x)

DROPOUT
w

P (k)
P

o(a)
=
softmax(a)
=
.
.
.

p(y
=
c|x)

p(y
=
c|x)
Use
exp(a
exp(ac ) ...

f
(x)
binary
masks
m
c )(2)
(1) random
(2)
(1)
(2)
(3)
(1)
(3)

p(y
=
c|x)
c
c
f (x)W W W b b b
h (x) h (x)
1
(1)
(2)
(1)
(2)
(3)
(1)
(2)
(3)
h b
i

g(a)
=
=
September
6,
2012
hsigm(a)
i>
September
6,
>

h
(x)
h
(x)
W
W
W
b
b
h
i
1+exp(
a)

p(y
=
c|x)
exp(a
)
exp(a
)
>
(k)
(k)
(k)
(k
1)
(0)
(1)for
(2)
(1) (2) (2)
(3)
(1)
(2)
exp(a
) .b
exp(a
)(3)
P
P
1
C

o(a)
=
softmax(a)
=
.
.
a layer
(x)pre-activation
= b h
+
W
h
x
(h
(x)
=
x)
(1)
(2)
(1)
(3)
(1)
(2)
(3)
k>0
P
P
exp(a
)
exp(a
)
h(x)
(x)
h
(x)
W
W
W
b
b
f
(x)
exp(a
)
exp(a
)
1
Cb
o(a)
=P
softmax(a)
=
.
.
.
h =(x)
W
W
W
b
b
P
o(a) = softmax(a)
.
.
.
exp(a
)
exp(ac )
c
hc )
i
c
c
>
exp(a
exp(a
)
c
(k) (k)
(k) (k)
(k) (k
c 1)
exp(a(0)
)c
exp(a
)
exp(a) exp( a)
exp(2a) 1

h
(x)
=
g(a
(x))
P
P
f
(x)

a
(x)
=
b
+
W
h
(x)
(h
(x)
=
x)

o(a)
=
softmax(a)
=
.
.
.
(k)
(k)
(k)
(k
1)
(0)

(k)b(1) +(k)
(k h
1) exp(a

g(a)
=
tanh(a)
=
(1) a(k)(2)
(2)
(3)
(1) =
(2)==
)(0)
exp(a
) x)(3)
a
(x)
=
W
x
(h
(x)
1
(x)
=
b
+
W
h
x
(h
(x)
x)
exp(a)+exp(
a)
exp(2a)+1
= c|x)
(L+1)
h (x) h(L+1)(x)
W p(y
W
W
b ... (2)b (3) b
p(y
= c|x) (1)
...

f
(x)
(1)
(2)
(1)
(2)
(3)
h (x) h (x) W
W
W
b
b
b
h f (x)(x) = o(a
(x)) =p(y
f (x)= c|x)
Abstract
Abstract
h
i
(k)
(k)
>
(k)
f (x) (x))(k) (k)
h
i
(k)
h(k) (x)
=
g(a
exp(a
)
exp(a
)
>
P
P
(k)
(k)
(k)
(k
1)
(0)

h
(x)
=
g(a
(x))

h
=
g(a
(k) (x)(1)
(k)
(ka (x))
1)
(0)

o(a)
=
softmax(a)
=
.
.
.
exp(a
)
exp(a
)
(2)
(1)
(2)
(3)
(1)
(2)
(3)

(x)
=
b
+
W
h
x
(h
(x)
=
x)
exp(a
exp(a ) C
(x)
g(a)
=
max(0,
a)
1 )
hidden
a(1) (x)
=activation
b(2) (1)
+h(k
Wfrom
h
x
(h
=
x)
P
P
layer
1
to
L):
h
i

(x)
h
W
W
W
b
b
b

o(a)
=
softmax(a)
=
.
.
.
>
Math
for
my
slides
Feedforward
neural
network.
Math
for
my
slides
Feedforward
neural
n
(1)
(2)
(3)
(1)
(2)
(3)
(2)
(1)
(2)
(3)
(1)
(2)
(3)
exp(a
)
exp(a
)
exp(a
)
exp(a
)
c
c
h (L+1)
(x) h(L+1)
(x)
W
W
W
b
b
b
Cb
c Pb
c
(L+1)
h (x)
h =
(x)
W
b
(k)W
(k) P W 1
(L+1)

o(a)
softmax(a)
=
.
.
.
(L+1)
h (L+1)
(x)
=f =
g(af (x)
(x)) exp(ac )
(x)
(x)
h(k) (x)
(x))
=
f
h =
(x)
=
o(a
(x))
h o(a
(x)
=
o(a
(x))
=
f
(x)
c
c exp(ac )
(k) (k)
P
P
(k)
(k)= (k
1)
(0)
1

g(a)
reclin(a)
=
max(0,
a)
(k)
(k)
(k)
(k
1)
(0)
h(k) (x) = g(a
(x))

a
(x)
=
b
+
W
h
x
(h
(x)
=
x)
>
...
...
(L+1)
(L+1)
(1)o(a
(2)
(1)
(2) w(3)
(1)
(2)
(3)> x
(k)
(k)
(k
1) (x)
(0)

f
(x)

a
(x)
=
b
+
W
h
x
(h
(x)
=
x)

h
=
(x))
=
f
(x)

a(x)
=
b
+
x
=
b
+
w

a(x)
=
b
+
w
x
=
b
+
w
x

h
(x)
h
(x)
W
W
W
b
b
b
i
i
i
i
a (x) = b + W
h
x
(h
=
x)
i
i
f (x)
(k) (2)
(k)
(k) (k 1)
(0) (3)
(k) (L+1)
(k)
(k)
(k)
(L+1)
P b(2) b(3)P
(1)
(1)
(2)
(1)

a
(x)
=
b
+
W
h
x
(h
(x) = x)

h
(x)
=
g(a
(x))

h
(x)
=
g(a
(x))

h
(x)
h
(x)
W
W
W
b
h(k) (x) = o(a
(x))
=
f
(x)

g()
b
(k) h(1) (x) h(2) (x) W(1)h(x)
(2)= g(a(x))
(3) h(x)
(1) = =
(2)g(a(x))
(3)
=
g(b
+
w
x
)
g(b
+
w
x
)
W
W
b
b
b
i
i
i
i
output
h (x)
=
g(a
(x))
layer activation (k=L+1):
i
i
h(k) (x) = g(a(k) (x))
(L+1)
(L+1)
h
(x) = o(a
(x))
=
f (x)
(k)
(k)
(k) (k 1)
(0)
(L+1)
(L+1)
a
(x)
=
b
+
W
h
x
(h
(x)1 = x)
(1)
(1)
(k) (x) =(k)
(k) (x))
(k 1)
(0)
(L+1)
h
o(a
=
f(h
(x)
...
...
(L+1)
(L+1)

a
(x)
=
b
+
W
h
x
(x)
=
x)
(L+1)
hW (x)x=1 o(a
f (x)
x1 xi d
bixd (x))
xj =h(x)
h
(x) = o(a
(x)) = f(x)
i,j
1

(k)
(k)
(k)= g(a
h
(x)
(x))
(x) = g(a (x))

(k)

w
w

h(x)
=
g(a(x))
(L+1)
(L+1)

40
Hugo Larochelle
p(y = c|x)
Hugo
Larochelle
Hugo Laroc
D
w
epartement dinformatique
p(y = c|x) h
p(y = c|x)
i
D
e
partement
dinformatiq
D
e
partement
dinf
>
p(y = c|x)
Universit
e
de
Sherbrooke
exp(a
)
exp(a
)
1
C
h
i
>
P
P

{
i

o(a)
=
softmax(a)
exp(a=
)
exp(a ) h . . .
p(y
=
c|x)
>
P
P
exp(a
)
exp(a
)
Universit
e
de
Sherbrooke
h
i
o(a) = softmax(a) =
.
.
.
c
c
Universit
e
de
Sh
exp(a
)
c exp(a
hugo.larochelle@usherbrooke.ca
>
C
exp(a )
) Pexp(a1c)
P
exp(a1 )
exp(a

o(a)
=
softmax(a)
=
.
.
.
C)
Topics: dropout
P
P
exp(a
)
o(a) = softmax(a)
=
.
.
c.
c)
c
c exp(a
h
i
exp(a
)
exp(a
)
hugo.larochelle@usherbrook
>
c = a c
c
hugo.larochelle@us
c g(a)

f (x)
exp(a
)
exp(a
)
C
f (x) = softmax(a)
P (k)1
P
September
28,
2013

o(a)
=
.
.
.

p(y
=
c|x)
...
p(y
=
c|x)
Use
exp(a
)
exp(a
)
fW
(x)
random
c (2)
c
(1)
(2)
(1)c|x)
p(yb=
c
f (x)binary
h(1) (x)
h(2) (x)
Wmasks
W(3)m
b
b(3) c
1
(1)
(2)
(1)
(2)
(3)
(1)
(2)
(3)
h b
i

g(a)
=
=
September
6,
2012
hsigm(a)
i>
September
6,
>

h
(x)
h
(x)
W
W
W
b
b
h
i
1+exp(
a)

p(y
=
c|x)
exp(a
)
exp(a
)
>
(k)
(k)
(k)
(k
1)
(0)
(1) (2) (2)
(3)
(2)
exp(a
) . . .(1)
P
P
1
o(a)
softmax(a)
= (3)
a (x) = b +
W(1)h (2)x (2)
(h (x)
==x)
(1)
(1)
(1)
(2) exp(a
(3) C )(3)

DROPOUT

layer
pre-activation
k>0h
PW
P
exp(a
exp(a
) ) b.b. . exp(a
h(x)for
(x)
(x)
W
bb) b
f
(x)
exp(a
1) W W
C
o(a)
=
softmax(a)
=

h
h
(x)
W
W
b
P
P
o(a) = softmax(a) =
exp(a
exp(ac )
c)
hc ) . . .
i
c Abstract
c
>
exp(a
exp(a
)
c
(k) (k)
(k) (k)
(k) (k
c 1)
exp(a(0)
exp(a
1) c
C)
exp(a) exp( a)
exp(2a) 1

h
(x)
=
g(a
(x))
P
P
f
(x)

a
(x)
=
b
+
W
h
(x)
(h
(x)
=
x)

o(a)
=
softmax(a)
=
.
.
.
(k)
(k)
(k)
(k
1)
(0)

(k)b(1) +(k)
(k h
1)

g(a)
=
tanh(a)
=
(1) a(k)(2)
(2)
(3)
(1) c=
(2)=
exp(a
exp(a
) x)(3)
c )(0)
c=
a
(x)
=
W
x
(h
(x)
1
c
(x)
=
b
+
W
h
x
(h
(x)
x)
exp(a)+exp(
a)
exp(2a)+1
p(y
= c|x)
(L+1)
hMath(x)
h(L+1)
(x)
Wlearning.
W
W
b ... (2)b (3) b
p(y
= c|x) (1)
...
for
my
slides
Deep

f
(x)
(1)
(2)
(1)
(2)
(3)
h (x) h (x) W
W
W
b
b
b
h f (x)(x) = o(a
(x)) =p(y
f (x)= c|x)
Abstract
Abstract
h
i
(k)
(k)
>
(k)
f (x) (x))(k) (k)
h
i
(k)
h(k) (x)
=
g(a
exp(a
)
exp(a
)
1
C
>
P
P
(k)
(k)
(k)
(k
1)
(0)

h
(x)
=
g(a
(x))

h
(x)
=
g(a
(x))
(k)
(k)
(k
1)
(0)

o(a)
=
softmax(a)
=
.
.
.
exp(a
)
exp(a
)
(1)
(2)
(1)
(2)
(3)
(1)
(2)
(3)

a
(x)
=
b
+
W
h
x
(h
(x)
=
x)
exp(a
exp(ac ) C
(k) +h
(k)h
(x)
g(a)
=
max(0,
1 c)
a(1)
(x)
=activation
b(2)
W
x
(h
=
x)
ca)
cP
P
hidden
layer
(k
from
1
to
L):
h
i
(x)
h
W
W
W
b
b
b
h(k)
(x)
=
g(a
(x)
m
)

o(a)
=
softmax(a)
=
.
.
.
>
Math
for
my
slides
Feedforward
neural
network.
Math
for
my
slides
Feedforward
neural
n
(1)
(2)
(3)
(1)
(2)
(3)
(1)
(2)
(1)
(2)
(3)
(1)
(2)
(3)
exp(a
)
exp(a
)
exp(a
)
exp(a
)
c
c
h (L+1)
(x) h(L+1)
(x)
W
W
W
b
b
b
Cb
c Pb
c
(L+1)
h (x)
h =
(x)
W
b
(k)W
(k) P W 1
(L+1)

o(a)
softmax(a)
=
.
.
.
(L+1)
h (L+1)
(x)
=f =
g(af (x)
(x)) exp(ac )
(x)
exp(ac )
h
(x))
=(x))
f(x)
(x)
h = o(a
(x) = o(a

h (L+1)
(x)
= o(a
(x)) =
f (x)
c
c
(L+1)
(k)
P
P
(k)
(k)
(k)
(k
1)
(0)
(k)
1
h h(k) (x) =
(x)
m
)
=
f
(x)

g(a)
=
reclin(a)
=
max(0,
a)
(k)
(k)
(k)
(k
1)
(0)
= o(a
g(a
(x))
m

a
(x)
=
b
+
W
h
x
(h
(x)
=
x)
...
...
(1)o(a
(3)> x
1) (x)
(0)
f+
(x)
(k)
a (x) =(k)
b (k
W
xa(x)
(h
(x)
h(L+1)
(x))
f (x)
(L+1)
=
b(1)=+x)
w(3)x=bb(1)
=+bb(2)
+w
(2)a(x)
x = b + w> x
h
h=
(x)
h(2)
(x) =
W
W
W
bw
(L+1)
(k)

a (x) = b + W
x (h = x)
i i i
i i i
f (x) h
(1)
(3) (k) (L+1)
(k) (2)
(k)
(k) (k 1)
(0) (3)
(k)
(k)
(k)
P b(2) b(3)P
(1)
(1)
(2)
(1)
x hh(L+1)
h(2)
h

a
(x)
=
b
+
W
h
x
(h
(x) = x)

h
(x)
=
g(a
(x))

h
(x)
=
g(a
(x))

h
(x)
h
(x)
W
W
W
b
(x)
=
o(a
(x))
=
f
(x)

g()
b
(k) h(1) (x) h(2) (x) W(1)h(x)
(2)= g(a(x))
(3) h(x)
(1) = =
(2)g(a(x))
(3)
=
g(b
+
w
x
)
g(b
+
w
x
)
W
W
b
b
b
i
i
i
i
output
h(k) (x)
=
g(a
(x))
layer activation (k=L+1):
i
i
h(k) (x) = g(a(k) (x))
(L+1)
(L+1)
(2)
(3)
(x) = o(a
(x))
=
f (x)
(k)
(k)
(k) (k 1)
(0)
p(h , h ) h h(L+1)
(L+1)
a
(x)
=
b
+
W
h
x
(h
(x)1 = x)
(1)
(1)
(k) (x) =(k)
(k) (x))
(k 1)
(0)
o(a
=
f(h
(x)
...
...
(L+1)
(L+1)

a
(x)
=
b
+
W
h
x
(x)
=
x)
(L+1)
(L+1)
(x)
hW (x)x=1 o(a
f (x)
x1 xi d
bixd (x))
xj =h(x)
h
(x) = o(a
(x)) = f
i,j
>
(0)
(1)
(k)(1) (k)
(k)
p(xi = 1|h(1) ) = sigm(b
W
h
)
(k) +h
(x)
=
g(a
(x))
h (x) = g(a (x))
(1)

w
w

h(x)
=
g(a(x))
(L+1) >
(L+1)

DROPOUT

Topics: dropout backpropagation

log f (x)y @a(k) (x)i (k)


(k)
(k)
=
@
log
f
(x)
@a
(x)
@a
(x)
y
ii i
@
log
f
(x)
@a
(x)
(k)
(k)
y
y
@a
(x)
i
==
@bi
(k)
(k)
(k)
(k)
(k)
(k)
@a
(x)
ii i
@
@b
@a
(x)
@b
@b
ii i
@ log
f (x)
yy
log
f
(x)
= b(k) (k)@ log f (x)
yyy
i @a @(x)
f
(x)
@
log
f
(x)
ilog
=
=log f (x)
(k)
(k)
@ =
@a
(x)
(k)
@a
(x)
(k)
y
ii i i
@a
(x)
@a
(x)
=
(k) (x)
(k)
@a
i
r (k) log f @b
(x)
@

This

ii

iy

assumes a forward propagation has


been
made
before
@
log
f
(x)
y
= r
log f (x)
b

compute output gradient (before activation)


ra(L+1) (x) log f (x)y (=
(e(y) f (x))

rb(k)
a(k) (x)
(k)
@a(k) (x)i bb(k)

log
yf (x)y
r
log
(x)yy
r
log
ff(x)
=
log
fff(x)
y
(x)
raaa(k)
log
(x)
== r
r
log
(x)
(k)
(k)
yy
(x)
(x)

rb(k)
log
f (x)y
for
k
from
L+1
to
1
(k)
>
r(k)
log
f
(x)
(=
(e(y)
f
(x))
(L+1)
rr
log
(=
r
log
f
(x)
h (x)
(k) (x)
y
y
y
a(L+1)
(x)f (x)
a
W
(=
(e(y)
f
(x))
log
f
(x)
(=
(e(y)
f
(x))
yy
a
(x)
= ra(k) (x) log f (x)y
gradients of hidden layer parameter
(k)
>
rcompute
log
(x)fy (x)
(= (=
ra(k) (x)r (k)
log f (x)ylog f (x)
b
r(k)W(k) flog
h
(x)
(k 1) > >
y
y
(k)
a (k)(x)
r
(=
r
log
f
(x)
h
(k)
log
f
(x)
(=
r
log
f
(x)
h
(x)(x)
(k)
yy
(x)f (x))
rW
log f (x)yyy (=
(e(y)
aa (x)
a(L+1) (x)
(k) >
rhr
logf f(x)
(x)yy (=
(= r
Wa(k) (x) ra(k)
(k (k)
1) (x) log
log(x)f (x)log
y
y(kf (x)
b
1)
r
(= rr
r
(x)hyy (x)>
(k) (x)log log
log
(x)y y (=
(=
rbW(k)
log ff(x)
flog
(x)ffy(x)
(k)
aa(k)
a(k)
(x)(x)
(k)
>
gradient
of
hidden
layer
below
- rcompute
log
f
(x)
(=
r
log
f
(x)
r
h
(x)
(k
1)
(k
1)
(k
1)
y
y
a
(x)
h
(x)
a
(x)
(k)
> yra(k) (x)
rbh(k)
log
(x)yra(=
log f (x)y
(k 1)log
(=
log
f>
(x)
(k) (x) W
y f
(x) f (x)
(k)
(k)

rh(k 1) (x) log f (x)yy (=


(= W
W
raa(k)
logff(x)
(x)yy
(k) (x)
r
log
(x)
(k)
(k) >

log
f
(x)
(=
W
r
log
f
(x)

r
log
f
(x)
(=
r
log
f
(x)
r
h
(x)
(k
1)
(k)
(k
1)
(k
1)
(k
1)
y
y
y
y
h
(x)
(x)
a
(x)
h aactivation)
(x)
a
- compute
gradient
of hidden layer below (before
0(x) (k(k)
1)
ra(k 1) (x) log f (x)yy (=
(= r
rhh(k(k 1)1)(x)
log
f (x)
(a h (x)
(x)j ), . . . ]
r[.a.(k. , g1) (x)
(x) log
3 f (x)yy
ra(k

1) (x)

log f (x)y (=

rh(k

1) (x)

log f (x)y

[. . . , g 0 (a(k

1)

(x)j ), . . . ]

41

log f (x)y @a(k) (x)i (k)


(k)
(k)
=
@
log
f
(x)
@a
(x)
@a
(x)
y
ii i
@
log
f
(x)
@a
(x)
(k)
(k)
y
y
@a
(x)
i
==
@bi
(k)
(k)
(k)
(k)
(k)
(k)
@a
(x)
ii i
@
@b
@a
(x)
@b
@b
ii i
@ log
f (x)
yy
log
f
(x)
= b(k) (k)@ log f (x)
yyy
i @a @(x)
f
(x)
@
log
f
(x)
ilog
=
=log f (x)
(k)
(k)
@ =
@a
(x)
(k)
@a
(x)
(k)
y
ii i i
@a
(x)
@a
(x)
=
(k) (x)
(k)
@a
i
r (k) log f @b
(x)
@

DROPOUT

Topics: dropout backpropagation


This

ii

iy

assumes a forward propagation has


been
made
before
@
log
f
(x)
y
= r
log f (x)
b

compute output gradient (before activation)


ra(L+1) (x) log f (x)y (=
(e(y) f (x))

rb(k)
a(k) (x)
(k)
@a(k) (x)i bb(k)

log
yf (x)y
r
log
(x)yy
r
log
ff(x)
=
log
fff(x)
y
(x)
raaa(k)
log
(x)
== r
r
log
(x)
(k)
(k)
(x)
(x)
includes
the y y

(k1)
r
log
f
(x)
(k)
mask
m
y
b
for
krfrom
L+1
to
1 f(=
(k)
>
log
(x)
(=
(e(y)
f
(x))
(L+1)
rr
log
f
(x)
r
log
f
(x)
h
(x)
(k)
(k)
y
y f (x)
a (x) (e(y)
W a(L+1) (x)
(=
(e(y) = yff(x))
(x))
log
yy (=
a
(x)
ra(k) (x) log f (x)y
gradients of hidden layer parameter
(k)
>
rcompute
log
(x)fy (x)
(= (=
ra(k) (x)r (k)
log f (x)ylog f (x)
b
r(k)W(k) flog
h
(x)
(k 1) > >
y
y
(k)
a (k)(x)
r
(=
r
log
f
(x)
h
(k)
log
f
(x)
(=
r
log
f
(x)
h
(x)(x)
(k)
yy
(x)f (x))
rW
log f (x)yyy (=
(e(y)
aa (x)
a(L+1) (x)
(k) >
rhr
logf f(x)
(x)yy (=
(= r
Wa(k) (x) ra(k)
(k (k)
1) (x) log
log(x)f (x)log
y
y(kf (x)
b
1)
r
(= rr
r
(x)hyy (x)>
(k) (x)log log
log
(x)y y (=
(=
rbW(k)
log ff(x)
flog
(x)ffy(x)
(k)
aa(k)
a(k)
(x)(x)
(k)
>
gradient
of
hidden
layer
below
- rcompute
log
f
(x)
(=
r
log
f
(x)
r
h
(x)
(k
1)
(k
1)
(k
1)
y
y Math for
a
(x)
h
(x)
a my
(x)
(k)
slides
Deep learning.
>
rbh(k)
log
f
(x)
(=
W
r
log
f
(x)
(k 1)log
(k)
f
(x)
(=
r
log
f
(x)
>
(k)
y
y
y
y
(x)
a (x)
a (x)
(k)
(k)

41

Deep l

Hugo L
Departement
Universite d
hugo.larochelle

Septembe

rh(k 1) (x) log f (x)yy (=


(= W
W
raa(k)
logff(x)
(x)yy
(k) (x)
r
log
(x)
(k)
(k)
(k)
(k)
(k) >

h
(x)
=
g(a
(x)
m
)
compute
f
(x)
(=
W
r
log
f
(x)
rha(k(k 1)1)gradient
log
f
(x)
(=
r
log
f
(x)
r
(x)
(k)
(k
1)
(k 1) (x) h
y
y
y
y
(x)
a
(x)
(x) log
h
(x)
a
-
of hidden layer below (before activation)
0 (k(k)
1)
ra(k 1) (x) log
(= r
rhh(k(k 1)1)(x)
log
f
(x)
[.
.
.
,
g
(a
(x)j ), . . . ]
f (x)yy (=
log
f
(x)
r
h
(x)
(k
1)
(x)
yy
a
(x)
3
(L+1)
(k1)
0 (k
1)(L+1) (x)
m
h(L+1)
(x)
=
o(a
m
) = f (x)
ra(k 1) (x) log f (x)y (= rh(k 1) (x) log
f (x)
[.
.
.
,
g
(a
(x)
),
.
.
.
]
y
j
(1)

(2)

(3)

Abs

DROPOUT
Topics: test time classification
At

test time, we replace the masks by their expectation

this is simply the constant vector 0.5 if dropout probability is 0.5

for single hidden layer, can show this is equivalent to taking the geometric average of all
neural networks, with all possible binary masks

Beats

regular backpropagation on many datasets, but is slower (~2x)

Improving neural networks by preventing co-adaptation of feature detectors.


Hinton, Srivastava, Krizhevsky, Sutskever and Salakhutdinov, 2012.

42

DEEP LEARNING
Topics: why training is hard
Depending

on the problem, one or the other situation will


tend to dominate

If

first hypothesis (underfitting): better optimize

use better optimization methods

use GPUs

If

second hypothesis (overfitting): use better regularization

unsupervised pre-training

stochastic dropout training

43

DEEP LEARNING

43

Topics: why training is hard


Depending

on the problem, one or the other situation will


tend to dominate

If

first hypothesis (underfitting): better optimize

use better optimization methods

use GPUs

If

second hypothesis (overfitting): use better regularization

unsupervised pre-training

stochastic dropout training

Batch normalization

BATCH NORMALIZATION
Topics: batch normalization
Normalizing the
(Lecun et al. 1998)

inputs will speed up training

could normalization also be useful at the level of the hidden layers?

Batch normalization
(Ioffe and Szegedy, 2014)

is an attempt to do that

each units pre-activation is normalized (mean subtraction, stddev division)

during training, mean and stddev is computed for each minibatch

backpropagation takes into account the normalization

at test time, the global mean / stddev is used

44

ni-Batch

Let the normalized values be x


!1...m , and their linear transformations be y1...m . We refer to the transform

BATCH NORMALIZATION
BN, : x1...m y1...m

as the Batch
Normalizing Transform.
Topics: batch
normalization

We present the BN
uts is costly Transform in Algorithm 1. In the algorithm, is a constant
added to the mini-batch variance for numerical stability.
Batch
e two
neces- normalization
of whitening
ntly, we will Input: Values of x over a mini-batch: B = {x1...m };
Parameters to be learned: ,
by making it
For a layer Output: {yi = BN, (xi )}
we will norm
#
1
xi
// mini-batch mean
B
m i=1

uted over the


1998b), such
when the feaa layer may

m
#
1
2
B
(xi B )2
m i=1

xi B
x
!i " 2
B +

yi !
xi + BN, (xi )

// mini-batch variance
// normalize
// scale and shift

45

ni-Batch

Let the normalized values be x


!1...m , and their linear transformations be y1...m . We refer to the transform

BATCH NORMALIZATION
BN, : x1...m y1...m

as the Batch
Normalizing Transform.
Topics: batch
normalization

We present the BN
uts is costly Transform in Algorithm 1. In the algorithm, is a constant
added to the mini-batch variance for numerical stability.
Batch
e two
neces- normalization
of whitening
ntly, we will Input: Values of x over a mini-batch: B = {x1...m };
Parameters to be learned: ,
by making it
For a layer Output: {yi = BN, (xi )}
we will norm
#
1
xi
// mini-batch mean
B
m i=1

uted over the


1998b), such
when the feaa layer may

m
#
1
2
B
(xi B )2
m i=1

xi B
x
!i " 2
B +

yi !
xi + BN, (xi )

// mini-batch variance
// normalize
// scale and shift

Learned linear transformation


to adapt to non-linear activation
function
( and are trained)

45

DEEP LEARNING
Topics: why training is hard
Depending

on the problem, one or the other situation will


tend to dominate

If

first hypothesis (underfitting): better optimize

use better optimization methods

use GPUs

If

second hypothesis (overfitting): use better regularization

unsupervised pre-training

stochastic dropout training

46

UNSUPERVISED PRE-TRAINING
Topics: unsupervised pre-training
Solution: initialize

hidden layers using unsupervised learning

force network to represent latent structure of input distribution

character image

encourage hidden layers to encode that structure

random image

47

UNSUPERVISED PRE-TRAINING
Topics: unsupervised pre-training
Solution: initialize

hidden layers using unsupervised learning

force network to represent latent structure of input distribution

Why is one
a character
and the other
is not ?

character image

encourage hidden layers to encode that structure

random image

47

UNSUPERVISED PRE-TRAINING
Topics: unsupervised pre-training
Solution: initialize

hidden layers using unsupervised learning

this is a harder task than supervised learning (classification)

Why is one
a character
and the other
is not ?

character image

hence we expect less overfitting

random image

48

th for
my slides
Autoencoders.
Hugo
Larochelle

Hugo
Larochelle
e partement dinformatique
AUTOENCODER
D
epartement dinformatique
Universite de Sherbrooke
e
de
Sherbrooke
o.larochelle@usherbrooke.ca Universit
h(x)
g(a(x))
Topics: autoencoder, encoder, decoder,
tied =
weights
hugo.larochelle@usherbrooke.ca
17, 2012
October
Feed-forward
neural
the output layer

= sigm(b
+ Wx)
network trained to reproduce
its input
at

October 16, 2012


Decoder

Abstract

ck

b = o(a
b(x))
x
Abstract

W =W
(tied weights)
my slides Autoencoders.
h(x) = bg(a(x))
j P

b l(f (x)) =
x
=

(b
xk xk )
k
sigm(b + Wx)

b = o(b
x
a(x))

49

= sigm(c + W h(x))
for binary inputs

l(f (x)) =

(xk log(b
xk ) + (1

Encoder

h(x) = g(a(x))
= sigm(b + Wx)

xk ) log(1

x
bk ))

Math for my slides Feedforward neural


Math network.
for my slides Feedforward neural
Mathnetwork.
for my slides Feedforward neural network.

UNSUPERVISED PRE-TRAINING

>
w
x
=
b
+
w

a(x)
x
=
b
+
w
x
=
b
+
w

x
a(x)
=
b
+
w
x
=
b
+
w
x
i
i
i
i
i
i
i
i
i
P
P
P
(x) = g(a(x)) = g(b + iwh(x)
ix
h(x)
i xi ) = g(a(x)) = g(b +
i ) = g(a(x)) = g(b +
iw
i w i xi )

(x) = b +

>

>

50

Feedforward
neural
network
Feedforward
neural
network
Feedforward
neural
network
Feedforward
neural
ne
xTopics:
b
w
w

x
x
b
w
w

x
x
b
w
w
Feedforward
neural
Feedforward
network
neural
netwo
d
1 unsupervised
d
1 pre-training
d
1
d
1
d
1
d
We

Hugo
w layer-wise
w
Larochelle
Hugo Larochelle
Hugo Larochelle
will use a greedy,
procedure
Hugo Larochelle

Hugo Larochelle
Hugo Larochelle
Departement
dinformatique
D
e
partement
dinformatique
Departement
dinformatique
Departement
dinformatiq
D
e
partement
dinformatique
D
e
partement
dinformatique

{
train one layer atUniversit
a time, from
toelast,
with unsupervisedUniversit
criterione de Sherbrooke
eUniversit
defirst
Sherbrooke
de Sherbrooke
e de Sherbrook
Universite de Sherbrooke UniversiteUniversit
de Sherbrooke
hugo.larochelle@usherbrooke.ca
(a) = a fix the parameters
previous
g(a)
= ahidden layershugo.larochelle@usherbrooke.ca
g(a)
=a
hugo.larochelle@usherbrooke.ca
hugo.larochelle@usherbrooke.ca
hugo.larochelle@usherbroo
of
hugo.larochelle@usherbrooke.ca
1
1
1
...
...
(a) = sigm(a)

g(a)
=
sigm(a)
=

g(a)
=
sigm(a)
=
previous=layers
viewed
as
feature
extraction
6, 2012 1+exp(
September
6, 2012a) September
September
2012 a) September
September
1+exp( a)September
1+exp(
6, 2012 6,
6, 2012 6, 2012

(a) = tanh(a) =

exp(a) exp( a)
exp(2a) 1

g(a)
=
=
tanh(a)
exp(a)+exp( a)
exp(2a)+1

...
Abstract

exp(a) exp( a)
exp(2a) 1

g(a)
=
=
tanh(a)
exp(a)+exp( a)
exp(2a)+1

...
Abstract

...
Abstract

exp(a) exp( a)
exp(a)+exp(
a)
1

...
Abstract

exp(2a) 1
exp(2a)+1

Abstract Abstract

(a) = max(0,
a) MathFeedforward
g(a) =
max(0, a)
g(a) = max(0, a) Math for my slides Feedforward neural network.
Math for my slides
neural network.
for my slides Feedforward
neural network.
Math for my slides Feedforward neural network.
Math for my slides Feedforward neural
Math
network.
for my slides Feedforward neural network.

P
P g(a)
P
P
P
P
(a)
=
reclin(a)
=
max(0,
a)

=
reclin(a)
=
max(0,
a)

g(a)
=
reclin(a)
=
max(0,
a)
1
1
>
>
>
> = b +...
> = b + w> x
...
...
...
a(x) ...= b + ...a(x)
w
x
=
b
+
w
x
=
b
+
w
x
=
b
+
w
x

a(x)
=
b
+
w
x
w
x

a(x)
=
b
+
w
x
i
i

a(x)
=
b
+
w
x
=
b
+
w

x
a(x)
=
b
+
w
x
=
b
+
w
i
i
i
i
i
ix
i
i
i
i
i i i
i i i

Pg() b P
P
Pg() b P
P
() h(x)
b = g(a(x))

==
g(bg(a(x))
+ i=
wig(b
xh(x)
h(x)
g(a(x))
wig(a(x))
xi ) h(x)
g(a(x))
g(a(x))
==
g(b
+ i=wg(b
)
==
g(b
+ i=wg(b
)
i )+ =
i xi ) h(x)
i xh(x)
i+
i xi +
iw
i=
i w i xi )

...x1 xi d
x1 bix...d xj h(x)
Wi,j
(1)

(1)

w
(x)
= g(a(x)) w

...x1 xi d
Wi,j
x1 bix...d xj h(x)
(1)

(1)

w
h(x)
= g(a(x)) w

...x1 xi d
Wi,j
x1 bix...d xj h(x)
(1)

(1)

w
h(x)
= g(a(x)) w

Math for my slides Feedforward neural network.

51

FINE-TUNING
Topics: fine-tuning

>
w
x
=
b
+
w
x
i
i
i
P
h(x) = g(a(x)) = g(b + i wi xi )

Once

a(x) = b +

all layers are pre-trained

Feedforward
neural
net
x1 xd b w 1 w
Feedforward
neural
network
d

Hugo Larochelle
Hugo Larochelle
Departement
dinformatiqu
Departement
dinformatique
e de Sherbrooke
UniversiteUniversit
de Sherbrooke
hugo.larochelle@usherbrooke
hugo.larochelle@usherbrooke.ca

...

add output layer

train the whole network using supervised learning g(a) = a

1
...
...

g(a)
=
sigm(a)
=
Supervised learning is performed as in
1+exp(

a regular feed-forward network

forward propagation, backpropagation and update

We

call this last phase fine-tuning

g(a) = tanh(a) =

...

September
a) September
6, 2012 6,

exp(a) exp( a)
exp(a)+exp(
a)
1

...

g(a) = max(0, a)

2012

exp(2a) 1
exp(2a)+1

Abstract Abstract

MathFeedforward
for my slides Feedforward
neural network.
Math for my slides
neural network.

P
P
g(a) = reclin(a)
=
max(0,
a)
1
> = b + w> x
=
b
+
w
x
a(x)...= b + ...a(x)
w
x
=
b
+
w
i
ix
i
i i i
all parameters are tuned for the supervised task
P
P

g()
b
h(x)
g(a(x))
=wg(b
w i xi )
h(x) = g(a(x))
==
g(b
+
x+
)
at hand
i

...x1 xi d
Wi,j
x1 bix...d xj h(x)
representation is adjusted to be more discriminative
(1)

(1)

w
h(x)
= g(a(x)) w

i i

NEURAL NETWORK ONLINE COURSE


Topics: online videos

for a more detailed


description of
neural networks

and much more!

http://info.usherbrooke.ca/hlarochelle/neural_networks

52

NEURAL NETWORK ONLINE COURSE


Topics: online videos

for a more detailed


description of
neural networks

and much more!

http://info.usherbrooke.ca/hlarochelle/neural_networks

52

53

MERCI!

Вам также может понравиться