Академический Документы
Профессиональный Документы
Культура Документы
http://info.usherbrooke.ca/hlarochelle/neural_networks
http://info.usherbrooke.ca/hlarochelle/neural_networks
Hug
3
Departem
{
Universi
Math for my slides Feedforward neura
hugo.laroche
hu
g(a) = a
w
well cover
forward propagation
types of units
loss function
backpropagation
deep learning
-
dropout
batch normalization
unsupervised pre-training
f (x)
...
g(a) = sigm(a) =
Septe
1
1+exp( a)
exp(a) exp( a)
g(a)
tanh(a)
=
(t)
(t)exp(a)+exp(
1
a)
r=
l(f
(x
;
),
y
...
... )
g(a)
= max(0,
a)
()
Math for my slides
MathFeedforwa
for my slide
P
P
r=
1
()
g(a)
reclin(a)
=
max(0,
a)
>
a(x) ...= b + ...a(x)
w
x
=
b
+
w
i i =i b +
i wi x
f (x)c = p(y = c|x)
P
g() h(x)
b
h(x)
g(a(x))
= g(a(x))
==
g(b
+ i=
w
(t)
(1)x(t) y(1)
1
...
...
h(x)
x1 xi d
x1 bixd xj P
Wxi,j
l(f (x), y) =
c 1(y=c) log f (x)c =
w
h(x)
= g(a(x)) w
Hugo Larochelle
Hugo Laroc
p(y = c|x)
p(y = c|x) h
i>
D
e
partement
dinformatiq
D
e
partement
dinf
p(y = c|x)
exp(a
)
exp(a
)
h
i{>
1
C
P
P
i> Universit
p(y
o(a)
softmax(a)
=) . . . Pexp(a
) h ...
==c|x)
exp(a
)
exp(a
)
o(a)
= softmax(a)
= Pexp(a
e
de
Sherbrooke
h
i
c exp(a >)
Universit
e
de
Sh
c exp(a c
exp(a )
) Pexp(a c)
P
exp(a )
exp(a
)
o(a)
=
softmax(a)
=
.
.
.
Topics: multilayer
neural
network
P
P
exp(a
)
exp(a )
o(a) = softmax(a)
=
...
h
i
exp(a
)
exp(a
hugo.larochelle@usherbrook
>)
hugo.larochelle@us
g(a)
=
a
f (x)
exp(a
)
exp(a
)
1
C
f (x)
p(y = c|x)
NEURAL NETWORK
w
P
P
o(a)
=
softmax(a)
=
.
.
.
p(y
=
c|x)
p(y
=
c|x)
Could
exp(a
)
have
hidden
(1)
(2) Lf (x)
(1)
(2)layers:
(3)c
(1)c|x)
(2)
(3) c exp(ac ) ...
c
p(y
=
h (x) h (x)
f (x)W W W b b b
1
(1)
(2)
(1)
(2)
(3)
(1)
(2)
(3)
h
i
g(a)
=
=
September
6,
2012
hsigm(a)
i>
September
6,
>
h
(x)
h
(x)
W
W
W
b
b
b
h
i
1+exp(
a)
p(y
=
c|x)
exp(a
)
exp(a
)
(k)
(k)
(k)
(k
1)
(0)
>
(1)for
(1) (2) (2)
(3)
(1)
(2)
exp(a
) .b
exp(a
)(3)
P
P
1
C
a layer
(x)pre-activation
=b
+
W
h
xh(2)
(h (x)
(x)
=
x)
o(a)
=
softmax(a)
=
.
.
(1)
(2)
(1)
(3)
(1)
(2)
(3)
k>0
P
P
exp(a
)
exp(a
)
h
(x)
W
W
W
b
b
f
(x)
exp(a
)
exp(a
)
1
Cb
=P
softmax(a)
=
.
.
.
h (x) o(a)
h =(x)
W
W
W
b
b
P
(k)
o(a) =softmax(a)
.
.
.
exp(a
)
exp(ac )
c
h
i
c
c
>
exp(a
)
exp(a
)
c
(k) = g(a(k) (x))
(k)
(k) (k
c 1) c exp(a(0)
)c
exp(a
)
exp(a) exp( a)
exp(2a) 1
h a(x)
P
P
f
(x)
(x)
=
b
+
W
h
(x)
(h
(x)
=
x)
o(a)
=
softmax(a)
=
.
.
.
(k)
(k)
(k)
(k
1)
(0)
(k)b(1) +(k)
(k h
1) exp(a
g(a)
=
tanh(a)
=
(1) a(k)(2)
(2)
(3)
(1) =
(2)==
)(0)
exp(a
) x)(3)
a
(x)
=
W
x
(h
(x)
1
(x)
=
b
+
W
h
x
(h
(x)
x)
exp(a)+exp(
a)
exp(2a)+1
= c|x)
(L+1)
h (x) h(L+1)(x)
W p(y
W
W
b ... (2)b (3) b
p(y
= c|x) (1)
...
f
(x)
(1)
(2)
(1)
(2)
(3)
h f (x)(x) = o(a
(x)) = f (x) h (x) h (x) W
W
W
b
b
b
Abstract
Abstract
p(y
=
c|x)
h
i
(k)
(k)
>
(k)
f (x) (x))(k) (k)
h
i
(k)
h(k) (x)
=
g(a
exp(a
)
exp(a
)
>
P
P
(k)
(k)
(k)
(k
1)
(0)
h
(x)
=
g(a
(x))
h
=
g(a
(k) (x)(1)
(k)
(ka (x))
1)
(0)
o(a)
=
softmax(a)
=
.
.
.
exp(a
)
exp(a
)
(2)
(1)
(2)
(3)
(1)
(2)
(3)
(x)
=
b
+
W
h
x
(h
(x)
=
x)
exp(a
exp(a ) C
g(a)
=
max(0,
a)
1 )
hidden
a(1) (x)
=
b
+
W
h
x
(h
=
x)
P
P
h
i
h
(x)
h
(x)
W
W
W
b
b
b
o(a)
=
softmax(a)
=
.
.
.
>
Math
for
my
slides
Feedforward
neural
network.
Math
for
my
slides
Feedforward
neural
n
(2)
(1)
(2)
(3)
(1)
(2)
(3)
layer
activation
(k
from
1
to
L):
(1)
(2)
(1)
(2)
(3)
(1)
(2)
(3)
exp(a
)
exp(a
)
exp(a
)
exp(a
)
c
c
h (L+1)
(x) h(L+1)
(x)
W
W
W
b
b
b
Cb
c Pb
c
(L+1)
h (x)
h =
(x)
W
b
(k)W
(k) P W 1
(L+1)
o(a)
softmax(a)
=
.
.
.
(L+1)
h (L+1)
(x)
=f =
g(af (x)
(x)) exp(ac )
(x)
(x)
h(k) (x)
(x))
=
f
h =
(x)
=
o(a
(x))
h o(a
(x)
=
o(a
(x))
=
f
(x)
c
c exp(ac )
(k) (k)
P
P
(k)
(k)= (k
1)
(0)
1
g(a)
reclin(a)
=
max(0,
a)
(k)
(k)
(k)
(k
1)
(0)
h(k) (x) = g(a
(x))
a
(x)
=
b
+
W
h
x
(h
(x)
=
x)
>
...
...
(L+1)
(L+1)
(1)o(a
(2)
(1)
(2) w(3)
(1)
(2)
(3)> x
(k)
(k)
(k
1) (x)
(0)
f
(x)
a
(x)
=
b
+
W
h
x
(h
(x)
=
x)
h
=
(x))
=
f
(x)
a(x)
=
b
+
x
=
b
+
w
a(x)
=
b
+
w
x
=
b
+
w
x
h
(x)
h
(x)
W
W
W
b
b
b
i
i
i
i
a (x) = b + W
h
x
(h
=
x)
i
i
f (x)
(k) (2)
(k)
(k) (k 1)
(0) (3)
(k) (L+1)
(k)
(k)
(k)
(L+1)
P b(2) b(3)P
(1)
(1)
(2)
(1)
a
(x)
=
b
+
W
h
x
(h
(x) = x)
h
(x)
=
g(a
(x))
h
(x)
=
g(a
(x))
h
(x)
h
(x)
W
W
W
b
h(k) (x) = o(a
(x))
=
f
(x)
g()
b
(k) h(1) (x) h(2) (x) W(1)h(x)
(2)= g(a(x))
(3) h(x)
(1) = =
(2)g(a(x))
(3)
=
g(b
+
w
x
)
g(b
+
w
x
)
W
W
b
b
b
i
i
i
i
output
h (x)
=
g(a
(x))
i
i
layer activation
(k=L+1): (L+1) h(k) (x) = g(a(k) (x))
(L+1)
h
(x) = o(a
(x))
=
f (x)
(k)
(k)
(k) (k 1)
(0)
(L+1)
(L+1)
a
(x)
=
b
+
W
h
x
(h
(x)1 = x)
(1)
(1)
(k) (x) =(k)
(k) (x))
(k 1)
(0)
(L+1)
h
o(a
=
f(h
(x)
...
...
(L+1)
(L+1)
a
(x)
=
b
+
W
h
x
(x)
=
x)
(L+1)
hW (x)x=1 o(a
f (x)
x1 xi d
bixd (x))
xj =h(x)
h
(x) = o(a
(x)) = f(x)
i,j
1
(k)
(k)
(k)= g(a
h
(x)
(x))
(x) = g(a (x))
(k)
w
w
h(x)
=
g(a(x))
(L+1)
(L+1)
Abstract
>
w i xi = b + w x
P
Squashes the neurons
h(x)
=
g(a(x))
=
g(b
+
w
x
)
i
i
i
pre-activation between
0 and 1
Always
positive
Bounded
Strictly
increasing
x1 xd b w 1 w d
w
{
g(a) = a
g(a) = sigm(a) =
1
1+exp( a)
ACTIVATION
FUNCTION
a(x)
=b+
w x =b+w
x
P
>
i i
h(x)tangent
= g(a(x))
= g(b
+ i function
w i xi )
Topics: hyperbolic
(tanh)
activation
Squashes
the
neurons
x1 xd b w 1 w d
pre-activation between
-1 and 1 w
Can
be positive or
negative {
Bounded
Strictly
g(a) = a
increasing
g(a) = sigm(a) =
1
1+exp( a)
g(a) = tanh(a) =
exp(a) exp( a)
exp(a)+exp( a)
exp(2a) 1
exp(2a)+1
x1 xd b w 1 w d
ACTIVATION FUNCTION
w
below by 0
(always non-negative)
Not
upper bounded
Strictly
Tends
increasing
{
g(a) = a
g(a) = sigm(a) =
to give neurons
with sparse activities g(a) = tanh(a) =
1
1+exp( a)
exp(a) exp( a)
exp(a)+exp( a)
g(a) = max(0, a)
g(a) = reclin(a) = max(0, a)
exp(2a) 1
exp(2a)+1
(1)
Wi,j
(1)
bi
xj h(x)i
(2)
wi
h(x) = g(a(x))
(1)
bi
xj h(x)i
multi-class classification:
(2)
o(a) = softmax(a) =
p(ythe
= c|x)
We use
softmax activation function at the output:
h
i>
exp(a )
exp(a )
o(a) = softmax(a) =
strictly positive
sums to one
Predicted
(2)
>
(2)
(2)
P = o(1)b + w
f (x)
x
(1)
(1)
(1)
a(x)
=
b
+
W
x per
a(x)
i = bi +
we need multiple outputs (1 output
class)
j Wi,j xj
p(y = c|x)
we would like to
estimate the conditional
probability
>
f (x) = o b(2) + w(2) x
(2)
wi
h(x)
= g(a(x))
ACTIVATION
FUNCTION
(1)
Wi,j
P
.
.
.
exp(ac )
Pexp(a1 )
c exp(ac )
(1)
bi
...
exp(ac )
Pexp(aC )
c exp(a
(x) = o(a
(3)
(2)
ah
()
=
i,j
(x))
() =
k ||W
i,j i
W
=
||F
(k)
W
>0
W
<0
i,j
H
+H
k
j
k
k
k
1
i,j
i,j
|W
|
(k)k
10
(3)
(3)
(3)
(2) 1 (k)
i,j
i
j
sign(W
)
=
1
(1)
(1)
(1)
(k)
i,j
a a ==b b ++
W
(2)
W
xh
W
Wi,j <0
(3)
i,j >0p
(k) (3)
(k) p
6
(3)
(3)
(2)
(k)
()
=
2W
(k)
(k)
a r
=
b
+
W
h
W
U
[
b,
b]
b
=
H
h
(x)
k
i,j
p
rW
()
=
sign(W
)
(k)
(2)
(2)
(1) k 1
(3)
Hk +H
W (3)(2)
FLOW
h
GRAPH
(3) (2)
=b
(x)
= g(a+ W
(x))
h a ==o(a
b (x))
+ Wp h 6
(k)
(k)
W
U
[
b,
b]
b
=
H
h
(x
P
P
P
k
i,j
(k)
(2)
(2)
(2)
(1)
(1)
(1) a(2)
(2)
(2)
(1)
H|k +Hk 1
(k)
=
b
+
W
h
()
=
|W
h
(x)
=
g(a
(x))
(3)
(3)
(3)
(2)
sign(W
)k=
=
1W(k) <0
a
=
b
+
W
h
(k) (1)
(2)
(2)1W
(1)
i,j
i(1)
j W
i,j
a
(x)
=
b
+
h
W
>0
h
=
g(a
(x))
a
b
+
x
Topics: flow graph
i,j
i,j
(1) (3) (1) (3) (1) (3) (2)
(k)
p
a
=
b
+
W
h
a
=
b
+
W
x
(2)
(2)
(2)
(1)
(1)
(1)
r
()
=
sign(W
)
(3)
(3)
(k)
b
b
b
(k)
(1)
(1)
(1)
6
W
(k)
a
(x)
=
b
+
W
h
Forward propagation
h
=
g(a
(x))
can
be
h
=
o(a
(x))
p
W
U
[
b,
b]
b
=
H
h
(x)
a = b + W xi,j
k
Hk +Hk 1
(3)
(3)
(2)
(2)
(2)
(1)
represented as an acyclic
(k)
h
=
o(a
(x))
a
=
b
+
W
h
(3)
(2)
(1)
(3)
(2)
(1)
(1)
(1)
(1)
sign(W
)
=
1
1W(k) <0
(2)
(2)
(k)
i,j
(x)
b
b
b
=
b
+
W
x
W
W
W x af (x)
h = g(a W(x))
>0
i,j
i,j
(3)
(3)
(3)
(2)
(3)
(3)
flow graph
(2)
a (1)(x) =
b +W
h p
h = o(a (x))
(2)
(1)
(1)
(3)
(2)
(1)
h ha(3)
=(k)
g(a
(x))
(3)
=
b
+
W
x
(1)
(1)
W
W
W
x 6
(x)
=
o(a
(x))
(k)
h
=
g(a
(x))
p
Its a nice way of implementing
W
U
[
b,
b]
b
=
H
h
(x)
k
(2)
(2)
(2) H(1)+H
i,j
(1)
a (3)(x) =
b(3) + W h k k 1
(1)
(2)
(2)
(2)
h hh(2)
= (x)
g(a
(x))
forward propagation
(3)
(2)
(1)
=
o(a
h in a=modular
g(a (x))
(x))
b= g(a
b (x))
b
(3) (x) = b(1)
(3) + W(1)
(3) x (2)
(3)
a(1)
way
h
(2)
(1)
(2)
(2)(1)
(1)
b
b
b
(3)
(2)
(1)
h
=
g(a
(x))
h
(x)
=
g(a
(x))
(1)
(1)
W
W
W
x
anh
= ang(a
(x))
each box could be
object with
fprop method,
(3)
(2)
(2)(3) (x))(2) (1)
h
o(a
a
(x)
=
b
+
W
h
(3)
(2)
(1)
that computes the value of the box given its W (3)
(2) W
(1)
(1)
W
x
b
b
b
h
=
g(a
(x))
parents
(2)
(1)
(1)(2) (x))(1)
(3)
(2)
(1) h
(x)
=
g(a
a
b +W x
b
b
b
(3)
(1)
(3)
(2)(2) (1)
calling the fprop method of each box in the
W
W
W
b(1) b
b (1) x f (x)
(3) (x) = g(a(3) (x))
right order yield forward propagation
h
o(a
(3)
(2)
(1)
(3)
(2)
(1)
W
W
W W
x
W
W
x
(3)
(2)
(1)(2)
(2)
b
b =b
h (x)
g(a (x))
(3)
(2)
(1)
a single hidden layer neural network with a linear output unit can approximate
any continuous function arbitrarily well, given enough hidden units
The
This
11
f (x; )
ervised learning example: (x,Machine
y) x ylearning
Supervised learning example: (x, y) x y
13
MACHINE
LEARNING
D
D
Supervised
learning
example:
(x,
y)
x
y
ningset:
D set:=D{(x =, y{(x )}, y )}
Training
train
(t)
train
(t)(t)
valid
test
(t)
D
D
D valid test
framework
D
Dto design learning algorithms
1X
(t)
(t)
arg min
l(f (x ; ), y ) +
T t
(t)
(t)
l(f
(x
;
),
y
)
X
1X
(t)
(t)
X
arg
l(f(t)
(x ; ),
) + ()
1 min1
(t)
(t)
(t)y
T (x
l(f(x()
; ),
y ))+
+ ()
arg minarg min
l(f
;
),
y
()
t
T
t
t
P
1
(t)
(t)
(t)
(t)
=
r
l(f
(x
;
),
y
)
l(f (x(t) ; ), y(t) )
t
T
()
of )+
() is a regularizer (penalizes certain values
Learning
is cast as optimization
loss function is a surrogate for what we truly should optimize (e.g. upper bound)
r ()
P
MACHINE
LEARNING
h
R(X) = {x 2 R | 9w x = j wj X,j }
Hugo
Larochelle
Departement dinfor
14
t
l(f (x ; ), y )
D
e
partement
dinformatique
Hugo
La
Supervised
learning
example:
(x,
y)
x
y
Universit
e
de
Sher
(t)
(t)
Hugo Larochelle
l(f
),
l(f(x
(x(t) ; ;),
y (t)y) ) ()
Universit
e
de
Sherbrooke
Departement d
Abstract
hugo.larochelle@ushe
P train
D
e
partement
dinformatique
()
= set: Dr l(f (x =; ),
y (t)
) ,y
r(t)
()
Training
{(x
)}
hugo.larochelle@usherbrooke.ca
() P
Math for my slides Feedforward neural network.
Universite de SherbrookeUniversite de
arg min
l(f (x ;t), y
(t)
1
T
)t + ()
(t)
(t)
(t)
(t) X
(t)
; ),
+
Topics:
gradient
(SGD)
stochastic
= T1 t r l(f
(x
ydescent
)
r ()
1
(t)
P
September
13,
2
hugo.larochelle@
arg
min
l(f
(x ; ), y )hugo.larochelle@usherbrooke.ca
+ ()
1 f (x;)
(t)
d(t)
{x
2
R
|
r
f
(x)
=
0}
x ), y
=
r
l(f
(x
;
)
r
()
T
f
(x)
September 13, 2012
+
t
T
t
Algorithm that
performs
updates
after
each
example
> 2
v
rxtest
f (x)v > 0 8v
September
d D valid
D
{x
2
R
|
r
f
(x)
=
0}
(1)
(1)
(L+1)
(L+1)
September
13,
2012
x
(t)
,b
})
initialize
(
+
l(f (x ; ), y (t) ){W> 2 , b , . . . , W
(t)
v rx f (x)v < 0 8v
Abstract
(t)
(t)
for N epochs l(f (x(t)
(t)
; ),
= r
l(f
(x
;
),
y
)
r ()
Abstract
Math
for
my slides Feedforward
neural network.
y
)
()
X
> 2
1
v
r
f
(x)v
<
0
8v
(t)
(t)
(t)
(t)
- for each xtraining
example (x , y )
P
Abstract
arg
min
l(f
(x
;
),
y
)
+
()
1
(t)
(t) Feedforward
Math
for
my
slides
neural
network.
training
epoch
(t)
(t)
== rTl(fr
r
l(f
(x
; ),
y()
)f (x)r ()
(x
y
)
(t)l(f
(t) ; ),
; ),
t
TMatht=for my slides Feedforward neural network.
(x
y
)
r
{x 2 R | x 2/ R(X)}
h
f (x)
(t)
(t)
f
(x)
l(f
(x
;
),
y
)
To apply
to neural
network training, we need
5
{x 2this
Rd |algorithm
r f (x) = 0}
(t)
(t)
l(frx(x
; ), y (t) ) r l(f (x(t) ; ),
()
y
) (t)
(t)
l(f (x ; ), y )
(t)
(t)
l(f
(x
;
),
y
)
the loss
>function
2
()
v rx f (x)v
>
0
8v
(t)
rf(x)
=(t)p(y
=yc|x)
c(x
l(f
;
),
)
()
(t)
(t)
a procedure
to
compute
the
parameter
gradients
r
l(f
(x
;
),
y
)
(t)
(t)
P
> 2
r
l(f
(x
;
),
y
)
0 (t)
v rx f (x)v
<
x(t)
= y 8vT1 t r l(f (x(t) ; ), y (t) )
r ()
r () )
the regularizer
() (t)(and the(t)gradient
()
= rl(f
(x ; ), y ) P r () 5()
initialization method
for y)+=
l(f
(x),
(x)c==c|x)log f (x)y =
c1(y=c)
f (x) log
=fp(y
(t) r ()
>
r ()
Abstr
{W
(t)
,b
l(f (x ; ), y
(t)
, .. . r
,W
, ;b),
}))
l(f
r
l(f(x(x(t)
; ),yy (t)
(t)
r l(f (x ; ), y
(t)
15
LOSS FUNCTION
()
) ()
()
()
f
(x)
c = p(y = c|x)
Neural network estimates f (x)c = p(y = c|x)
(t)
x (t)
we could maximize the probabilities
of
r ()
(t)
y (t)given
(t)
x(t) inythe
training set
x
y
P
P
(x),1y)
= logcf1(x)
log
f (x)
log
f (x)y =
c =
(y=c)
l(f (x), y) = l(fP
=
log
f
(x)
=
c
y
(y=c)
c
f
(x)
=
p(y
=
c|x)
l(f minimize
(x), y) = the c 1(y=c) log f (x)c =
c
To frame as minimization, we
negative
(t) log-likelihood
(t)
natural log (ln)
x
y
P
l(f (x), y) =
log f (x)y =
c 1(y=c) log f (x)c =
we take the log to simplify for numerical stability and math simplicity
log f (x)y =
@
@
f (x)c
f (x)c
@
log f (
f (x)
logc f (x)y
log f (x)y
@
f (x)c
rf (x)
log f (x)y
1(y=c) 1 f
r
(x)y =
f (x)
log
f (x)log
y f=
1y
f
(x)
rf (x) log f (x)y f (x)
= y =
BACKPROPAGATION
Use
ii
iy
rb(k)
a(k) (x)
(k)
@a(k) (x)i bb(k)
log
yf (x)y
r
log
(x)yy
r
log
ff(x)
=
log
fff(x)
y
(x)
raaa(k)
log
(x)
== r
r
log
(x)
(k)
(k)
yy
(x)
(x)
rb(k)
log
f (x)y
for
k
from
L+1
to
1
(k)
>
r(k)
log
f
(x)
(=
(e(y)
f
(x))
(L+1)
rr
log
(=
r
log
f
(x)
h (x)
(k) (x)
y
y
y
a(L+1)
(x)f (x)
a
W
(=
(e(y)
f
(x))
log
f
(x)
(=
(e(y)
f
(x))
yy
a
(x)
= ra(k) (x) log f (x)y
gradients of hidden layer parameter
(k)
>
rcompute
log
(x)fy (x)
(= (=
ra(k) (x)r (k)
log f (x)ylog f (x)
b
r(k)W(k) flog
h
(x)
(k 1) > >
y
y
(k)
a (k)(x)
r
(=
r
log
f
(x)
h
(k)
log
f
(x)
(=
r
log
f
(x)
h
(x)(x)
(k)
yy
(x)f (x))
rW
log f (x)yyy (=
(e(y)
aa (x)
a(L+1) (x)
(k) >
rhr
logf f(x)
(x)yy (=
(= r
Wa(k) (x) ra(k)
(k (k)
1) (x) log
log(x)f (x)log
y
y(kf (x)
b
1)
r
(= rr
r
(x)hyy (x)>
(k) (x)log log
log
(x)y y (=
(=
rbW(k)
log ff(x)
flog
(x)ffy(x)
(k)
aa(k)
a(k)
(x)(x)
(k)
>
gradient
of
hidden
layer
below
- rcompute
log
f
(x)
(=
r
log
f
(x)
r
h
(x)
(k
1)
(k
1)
(k
1)
y
y
a
(x)
h
(x)
a
(x)
(k)
> yra(k) (x)
rbh(k)
log
(x)yra(=
log f (x)y
(k 1)log
(=
log
f>
(x)
(k) (x) W
y f
(x) f (x)
(k)
(k)
log
f
(x)
(=
W
r
log
f
(x)
r
log
f
(x)
(=
r
log
f
(x)
r
h
(x)
(k
1)
(k)
(k
1)
(k
1)
(k
1)
y
y
y
y
h
(x)
(x)
a
(x)
h aactivation)
(x)
a
- compute
gradient
of hidden layer below (before
0(x) (k(k)
1)
ra(k 1) (x) log f (x)yy (=
(= r
rhh(k(k 1)1)(x)
log
f (x)
(a h (x)
(x)j ), . . . ]
r[.a.(k. , g1) (x)
(x) log
3 f (x)yy
ra(k
1) (x)
log f (x)y (=
rh(k
1) (x)
log f (x)y
[. . . , g 0 (a(k
1)
(x)j ), . . . ]
16
rW(k)
>
(k)
Abstract
ACTIVATION FUNCTION
log f (x)y (=
1) (x)
log f (x)y (=
g (a) = g(a)(1
g 0 (a) = 1
1) (x)
log f (x)y
x1 xd b w 1 w d
Partial
g 0 (a) =
a
derivative:
0
rh(k
g(a)) w
g(a)2
{
g(a) = a
g(a) = sigm(a) =
3
1
1+exp( a)
w i xi )
ra(k
1) (x)
(k)
(x)
17
rb(k)
w
0
g (a) = g(a)(1 g(a))
18
ACTIVATION FUNCTION
Partial derivative:
{
g (a) = 1 g(a)
0
g(a) = a
g(a) = sigm(a) =
1
1+exp( a)
g(a) = tanh(a) =
exp(a) exp( a)
exp(a)+exp( a)
exp(2a) 1
exp(2a)+1
1) (x)
(k)
(x)
19
x1 xd b w 1 w d
ACTIVATION FUNCTION
w
{
Partial
0
derivative:
g (a) = 1a>0
g(a) = a
g(a) = sigm(a) =
1
1+exp( a)
g(a) = tanh(a) =
exp(a) exp( a)
exp(a)+exp( a)
g(a) = max(0, a)
g(a) = reclin(a) = max(0, a)
exp(2a) 1
exp(2a)+1
(x) = o(a
(3)
(2)
ah
()
=
i,j
(x))
() =
k ||W
i,j i
W
=
||F
(k)
W
>0
W
<0
i,j
H
+H
k
j
k
k
k
1
i,j
i,j
|W
|
(k)k
20
(3)
(3)
(3)
(2) 1 (k)
i,j
i
j
sign(W
)
=
1
(1)
(1)
(1)
(k)
i,j
a a ==b b ++
W
(2)
W
xh
W
Wi,j <0
(3)
i,j >0p
(k) (3)
(k) p
6
(3)
(3)
(2)
(k)
()
=
2W
(k)
(k)
a r
=
b
+
W
h
W
U
[
b,
b]
b
=
H
h
(x)
k
i,j
p
rW
()
=
sign(W
)
(k)
(2)
(2)
(1) k 1
(3)
Hk +H
W (3)(2)
FLOW
h
GRAPH
(3) (2)
=b
(x)
= g(a+ W
(x))
h a ==o(a
b (x))
+ Wp h 6
(k)
(k)
W
U
[
b,
b]
b
=
H
h
(x
P
P
P
k
i,j
(k)
(2)
(2)
(2)
(1)
(1)
(1) a(2)
(2)
(2)
(1)
H|k +Hk 1
(k)
=
b
+
W
h
()
=
|W
h
(x)
=
g(a
(x))
(3)
(3)
(3)
(2)
sign(W
)k=
=
1W(k) <0
a
=
b
+
W
h
(k) (1)
(2)
(2)1W
(1)
i,j
i(1)
j W
i,j
a
(x)
=
b
+
h
W
>0
h
=
g(a
(x))
a
b
+
x
Topics: automatic differentiation
i,j
i,j
(1) (3) (1) (3) (1) (3) (2)
(k)
p
a
=
b
+
W
h
a
=
b
+
W
x
(2)
(2)
(2)
(1)
(1)
(1)
r
()
=
sign(W
)
(3)
(3)
(k)
b
b
b
(k)
(1)
(1)
(1)
6
W
(k)
a
(x)
=
b
+
W
h
Each object also
h
=
g(a
(x))
has
a
bprop
method
h
=
o(a
(x))
p
W
U
[
b,
b]
b
=
H
h
(x)
a = b + W xi,j
k
Hk +Hk 1
(3) (2)
(3)
(2)
(2)
(1)
(k)
h
=
o(a
(x))
a
=
b
+
W
h
(3)
(2)
(1)
(2)
(1)
it computes the gradient of(3)
(1)
(1)
(1)
the
loss
with
sign(W
)
=
1
1W(k) <0
(2)
(2)
(k)
i,j
(x)
b
b
b
=
b
+
W
x
W
W
W x af (x)
h
=
g(a
(x))
W
>0
i,j
i,j
respect to each parent (3)
(3)
(3)
(3)
(2)
(3)
(2)
a (1)(x) =
b +W
h p
h = o(a (x))
(2)
(1)
(1)
(3)
(2)
(1)
h
=
g(a
(x))
(3)
(3)
a
=
b
+
W
x
(1)
(1)
W
W
W
x 6
(k)
h
(x)
=
o(a
(x))
(k)
fprop depends on the fprop of a boxs parents,
h
=
g(a
(x))
p
W
U
[
b,
b]
b
=
H
h
(x)
k
(2)
(2)
(2) H(1)+H
i,j
a (3)(x) =
b(3) + W h k k 1
while bprop depends the(2)
bprop of a boxs(2)
children (1)
(1)
(2)
h hh(2)
= (x)
g(a
(x))
(3)
(2)
(1)
=
o(a
h = g(a (x))
(x))
b= g(a
b (x))
b
(3) (x) = b(1)
(3) + W(1)
(3) x (2)
By calling bprop in the reverse order, a(1)
h
(3) (1)
(2)
(1)
(2)
(2)(1)
b
b
b
(3)
(2)
(1)
h
=
g(a
(x))
h
(x)
=
g(a
(x))
(1)
(1)
we get backpropagation
W
W
W
x
h = g(a (x))
(3)
(2)
(2)(3) (x))(2) (1)
h
o(a
a
(x)
=
b
+W h
(3)
(2)
(1)
only need to reach the parameters
(1)
(1)
W
W
W
x
bh(3)
b=(2)g(a
b
(x))
(2)
(1)
(1)(2) (x))(1)
(3)
(2)
(1) h
(x)
=
g(a
a
b +W x
b
b
b
(3)
(1)
(3)
(2)(2) (1)
W
W
W
b(1) b
b (1) x f (x)
(3) (x) = g(a(3) (x))
h
o(a
(3)
(2)
(1)
(3)
(2)
(1)
W
W
W W
x
W
W
x
(3)
(2)
(1)(2)
(2)
b
b =b
h (x)
g(a (x))
(3)
(2)
(1)
k ||W
()
= ki,j i W
W
=
||F
i,j
(k)
(x))
>0
W
<0
i,j
H
+H
j
k
k
k
1
i,j
i,j
()
=
|W
|
(k)k
20
(3)
(3)
(3)
(2) 1 (k)
i,j
i
j
sign(W
)
=
1
(1)
(1)
(1)
(k)
i,j
a=a p(y
=b=
b c|x)
+
W
hp Wi,j <0
(2)
=
+
W
x
W
>0
f
(x)
(3)
(3)
(3)
i,j
c (3)
(k)
(2)W (3) h
(k) p
6
(3)
(2)
(k)
ah(2) (x)
=b
+
= g(a (x))
()
=
2W
(k)
(k)
a r
=
b
+
W
h
W
U
[
b,
b]
b
=
H
h
(x)
k
i,j
p
rW
()
=
sign(W
)
(k)
(2)
(2)
(1)
(3)
(3) + W(2)
Hh
W
k +Hk 1
a
=
b
(k)
h
=
o(a
(x))
6
(t)
(t)
(k)
p
W
U
[
b,
b]
b
=
H
h
(x
x
y
P
P
P
k
i,j
(k)
(2)
(2)
(2)
(1)
(1)
(1) a(2)
(2)
(2)
(1)
H|k +Hk 1
(k)
=
b
+
W
h
()
=
|W
h
(x)
=
g(a
(x))
(3)
(3)
(3)
(2)
sign(W
)k=
=
1W(k) <0
a
=
b
+
W
h
(k) (1)
(2)
(2)1W
(1)
i,j
i(1)
j W
i,j
a
(x)
=
b
+
h
P
W
>0
h
=
g(a
(x))
a
b
+
x
Topics: automatic differentiation
i,j
i,j
l(f
(x),
y)
=
1
log
f
(x)
=
log
f
c
(3)
(3)
(3)
(2)
(y=c)
(1)
(1)
(1)
c
(3)
(2)
(1) a r
(k)
p
a
=
b
+
W
h
=
b
+
W
x
(2)
(2)
(2)
(1)
(1)
(1)
()
=
sign(W
)
(3)
(3)
(k)
b
b
b
(k)
(1)
(1)
(1)
6
W
(k)
a
(x)
=
b
+
W
h
Each object also
h
=
g(a
(x))
has
a
bprop
method
h
=
o(a
(x))
p
W
U
[
b,
b]
b
=
H
h
(x)
a = b + W xi,j
k
Hk +Hk 1
(3) (2)
(3)
(2)
(2)
(1)
(k)
h
=
o(a
(x))
a
=
b
+
W
h
(3)
(2)
(1)
(2)
(1)
it computes the gradient of(3)
(1)
(1)
(1)
the
loss
with
sign(W
)
=
1
1W(k) <0
(2)
(2)
(k)
i,j
(x)
b
b
b
=
b
+
W
x
W
W
W x af (x)
h
=
g(a
(x))
W
>0
i,j
i,j
respect to each parent (3)
(3)
(3)
(3)
(2)
(3)
(2)
a (1)(x) =
b +W
h p
@
h = o(a (x))
(2)
(1)
(1)
(3)
(2)
(1)
h
=
g(a
(x))
(3)
(3)
a
=
b
+
W
x
(1)
(1)
W
W
W
x 6
(k)
h
(x)
=
o(a
(x))
(k)
fprop depends on the fprop of a boxs parents,
h
=
g(a
(x))
p
W
U
[
b,
b]
b
=
H
h
(x)
k
f
(x)
(2)
(2)
(2) H(1)+H
i,j
a
(x)
=
b
+
W
h
while bprop depends the bprop of a boxs children
k
k 1
By
(x) = o(a
(2)
FLOW GRAPH
(1) (3)(2)
(2) h(1)h(2)
(3)g(a(3)
=
(x))
(1)
=
o(a
g(a (x))
h (x)
=
g(a
(x))
b
b(2)(x))
b
b
b
b
(3)
(2)
(1)
h
=
g(a
(x))
h
(x)
=
g(a
(x))
(1)
(1)
we get backpropagation
W
W
W
x
h = g(a (x)) (3)
(3)
(2)
(2) (x))(2) (1)
h
o(a
a
(x)
=
b
+W h
(3)
(2)
(1)
only need to reach the parameters
(1)
(1)
W
W
W
x
bh(3)
b=(2)g(a
b
(x))
(2)
(1)
(1)(2) (x))(1)
(3)
(2)
(1) h
(x)
=
g(a
a
b +W x
b
b
b
(3)
(1)
(3)
(2)(2) (1)
W
W
W
b(1) b
b (1) x f (x)
(3) (x) = g(a(3) (x))
h
o(a
(3)
(2)
(1)
(3)
(2)
(1)
x
W
W
W
x
(3)
(2)
(1)(2)
(2)
b
h
b =b
(x)
g(a
(x))
k ||W
()
= ki,j i W
W
=
||F
i,j
(k)
(x))
>0
W
<0
i,j
H
+H
j
k
k
k
1
i,j
i,j
()
=
|W
|
(k)k
20
(3)
(3)
(3)
(2) 1 (k)
i,j
i
j
sign(W
)
=
1
(1)
(1)
(1)
(k)
i,j
a=a p(y
=b=
b c|x)
+
W
hp Wi,j <0
(2)
=
+
W
x
W
>0
f
(x)
(3)
(3)
(3)
i,j
c (3)
(k)
(2)W (3) h
(k) p
6
(3)
(2)
(k)
ah(2) (x)
=b
+
= g(a (x))
()
=
2W
(k)
(k)
a r
=
b
+
W
h
W
U
[
b,
b]
b
=
H
h
(x)
k
i,j
p
rW
()
=
sign(W
)
(k)
(2)
(2)
(1)
(3)
(3) + W(2)
Hh
W
k +Hk 1
a
=
b
(k)
h
=
o(a
(x))
6
(t)
(t)
(k)
p
W
U
[
b,
b]
b
=
H
h
(x
x
y
P
P
P
k
i,j
(k)
(2)
(2)
(2)
(1)
(1)
(1) a(2)
(2)
(2)
(1)
H|k +Hk 1
(k)
=
b
+
W
h
()
=
|W
h
(x)
=
g(a
(x))
(3)
(3)
(3)
(2)
sign(W
)k=
=
1W(k) <0
a
=
b
+
W
h
(k) (1)
(2)
(2)1W
(1)
i,j
i(1)
j W
i,j
a
(x)
=
b
+
h
P
W
>0
h
=
g(a
(x))
a
b
+
x
Topics: automatic differentiation
i,j
i,j
l(f
(x),
y)
=
1
log
f
(x)
=
log
f
c
(3)
(3)
(3)
(2)
(y=c)
(1)
(1)
(1)
c
(3)
(2)
(1) a r
(k)
p
a
=
b
+
W
h
=
b
+
W
x
(2)
(2)
(2)
(1)
(1)
(1)
()
=
sign(W
)
(3)
(3)
(k)
b
b
b
(k)
(1)
(1)
(1)
6
W
(k)
a
(x)
=
b
+
W
h
Each object also
h
=
g(a
(x))
has
a
bprop
method
h
=
o(a
(x))
p
W
U
[
b,
b]
b
=
H
h
(x)
a = b + W xi,j
k
Hk +Hk 1
(3) (2)
(3)
(2)
(2)
(1)
(k)
h
=
o(a
(x))
a
=
b
+
W
h
(3)
(2)
(1)
(2)
(1)
it computes the gradient of(3)
(1)
(1)
(1)
the
loss
with
sign(W
)
=
1
1W(k) <0
(2)
(2)
(k)
i,j
(x)
b
b
b
=
b
+
W
x
W
W
W x af (x)
h
=
g(a
(x))
W
>0
i,j
i,j
respect to each parent (3)
(3)
(3)
(3)
(2)
(3)
(2)
a (1)(x) =
b +W
h p
@
h = o(a (x))
(2)
(1)
(1)
(3)
(2)
(1)
h
=
g(a
(x))
(3)
(3)
a
=
b
+
W
x
(1)
(1)
W
W
W
x 6
(k)
h
(x)
=
o(a
(x))
(k)
fprop depends on the fprop of a boxs parents,
h
=
g(a
(x))
p
W
U
[
b,
b]
b
=
H
h
(x)
k
f
(x)
(2)
(2)
(2) H(1)+H
i,j
a
(x)
=
b
+
W
h
while bprop depends the bprop of a boxs children
k
k 1
By
(x) = o(a
(2)
FLOW GRAPH
(1) (3)(2)
(2) h(1)h(2)
(3)g(a(3)
=
(x))
(1)
=
o(a
g(a (x))
h (x)
=
g(a
(x))
b
b(2)(x))
b
b
b
b
(3)
(2)
(1)
h
=
g(a
(x))
h
(x)
=
g(a
(x))
(1)
(1)
we get backpropagation
W
W
W
x
h = g(a (x)) (3)
(3)
(2)
(2) (x))(2) (1)
h
o(a
a
(x)
=
b
+W h
(3)
(2)
(1)
only need to reach the parameters
(1)
(1)
W
W
W
x
bh(3)
b=(2)g(a
b
(x))
(2)
(1)
(1)(2) (x))(1)
(3)
(2)
(1) h
(x)
=
g(a
a
b +W x
b
b
b
(3)
(1)
(3)
(2)(2) (1)
W
W
W
b(1) b
b (1) x f (x)
(3) (x) = g(a(3) (x))
h
o(a
(3)
(2)
(1)
(3)
(2)
(1)
x
W
W
W
x
(3)
(2)
(1)(2)
(2)
b
h
b =b
(x)
g(a
(x))
k ||W
()
= ki,j i W
W
=
||F
i,j
(k)
(x))
>0
W
<0
i,j
H
+H
j
k
k
k
1
i,j
i,j
()
=
|W
|
(k)k
20
(3)
(3)
(3)
(2) 1 (k)
i,j
i
j
sign(W
)
=
1
(1)
(1)
(1)
(k)
i,j
a=a p(y
=b=
b c|x)
+
W
hp Wi,j <0
(2)
=
+
W
x
W
>0
f
(x)
(3)
(3)
(3)
i,j
c (3)
(k)
(2)W (3) h
(k) p
6
(3)
(2)
(k)
ah(2) (x)
=b
+
= g(a (x))
()
=
2W
(k)
(k)
a r
=
b
+
W
h
W
U
[
b,
b]
b
=
H
h
(x)
k
i,j
p
rW
()
=
sign(W
)
(k)
(2)
(2)
(1)
(3)
(3) + W(2)
Hh
W
k +Hk 1
a
=
b
(k)
h
=
o(a
(x))
6
(t)
(t)
(k)
p
W
U
[
b,
b]
b
=
H
h
(x
x
y
P
P
P
k
i,j
(k)
(2)
(2)
(2)
(1)
(1)
(1) a(2)
(2)
(2)
(1)
H|k +Hk 1
(k)
=
b
+
W
h
()
=
|W
h
(x)
=
g(a
(x))
(3)
(3)
(3)
(2)
sign(W
)k=
=
1W(k) <0
a
=
b
+
W
h
(k) (1)
(2)
(2)1W
(1)
i,j
i(1)
j W
i,j
a
(x)
=
b
+
h
P
W
>0
h
=
g(a
(x))
a
b
+
x
Topics: automatic differentiation
i,j
i,j
l(f
(x),
y)
=
1
log
f
(x)
=
log
f
c
(3)
(3)
(3)
(2)
(y=c)
(1)
(1)
(1)
c
(3)
(2)
(1) a r
(k)
p
a
=
b
+
W
h
=
b
+
W
x
(2)
(2)
(2)
(1)
(1)
(1)
()
=
sign(W
)
(3)
(3)
(k)
b
b
b
(k)
(1)
(1)
(1)
6
W
(k)
a
(x)
=
b
+
W
h
Each object also
h
=
g(a
(x))
has
a
bprop
method
h
=
o(a
(x))
p
W
U
[
b,
b]
b
=
H
h
(x)
a = b + W xi,j
k
Hk +Hk 1
(3) (2)
(3)
(2)
(2)
(1)
(k)
h
=
o(a
(x))
a
=
b
+
W
h
(3)
(2)
(1)
(2)
(1)
it computes the gradient of(3)
(1)
(1)
(1)
the
loss
with
sign(W
)
=
1
1W(k) <0
(2)
(2)
(k)
i,j
(x)
b
b
b
=
b
+
W
x
W
W
W x af (x)
h
=
g(a
(x))
W
>0
i,j
i,j
respect to each parent (3)
(3)
(3)
(3)
(2)
(3)
(2)
a (1)(x) =
b +W
h p
@
h = o(a (x))
(2)
(1)
(1)
(3)
(2)
(1)
h
=
g(a
(x))
(3)
(3)
a
=
b
+
W
x
(1)
(1)
W
W
W
x 6
(k)
h
(x)
=
o(a
(x))
(k)
fprop depends on the fprop of a boxs parents,
h
=
g(a
(x))
p
W
U
[
b,
b]
b
=
H
h
(x)
k
f
(x)
(2)
(2)
(2) H(1)+H
i,j
a
(x)
=
b
+
W
h
while bprop depends the bprop of a boxs children
k
k 1
By
(x) = o(a
(2)
FLOW GRAPH
(1) (3)(2)
(2) h(1)h(2)
(3)g(a(3)
=
(x))
(1)
=
o(a
g(a (x))
h (x)
=
g(a
(x))
b
b(2)(x))
b
b
b
b
(3)
(2)
(1)
h
=
g(a
(x))
h
(x)
=
g(a
(x))
(1)
(1)
we get backpropagation
W
W
W
x
h = g(a (x)) (3)
(3)
(2)
(2) (x))(2) (1)
h
o(a
a
(x)
=
b
+W h
(3)
(2)
(1)
only need to reach the parameters
(1)
(1)
W
W
W
x
bh(3)
b=(2)g(a
b
(x))
(2)
(1)
(1)(2) (x))(1)
(3)
(2)
(1) h
(x)
=
g(a
a
b +W x
b
b
b
(3)
(1)
(3)
(2)(2) (1)
W
W
W
b(1) b
b (1) x f (x)
(3) (x) = g(a(3) (x))
h
o(a
(3)
(2)
(1)
(3)
(2)
(1)
x
W
W
W
x
(3)
(2)
(1)(2)
(2)
b
h
b =b
(x)
g(a
(x))
k ||W
()
= ki,j i W
W
=
||F
i,j
(k)
(x))
>0
W
<0
i,j
H
+H
j
k
k
k
1
i,j
i,j
()
=
|W
|
(k)k
20
(3)
(3)
(3)
(2) 1 (k)
i,j
i
j
sign(W
)
=
1
(1)
(1)
(1)
(k)
i,j
a=a p(y
=b=
b c|x)
+
W
hp Wi,j <0
(2)
=
+
W
x
W
>0
f
(x)
(3)
(3)
(3)
i,j
c (3)
(k)
(2)W (3) h
(k) p
6
(3)
(2)
(k)
ah(2) (x)
=b
+
= g(a (x))
()
=
2W
(k)
(k)
a r
=
b
+
W
h
W
U
[
b,
b]
b
=
H
h
(x)
k
i,j
p
rW
()
=
sign(W
)
(k)
(2)
(2)
(1)
(3)
(3) + W(2)
Hh
W
k +Hk 1
a
=
b
(k)
h
=
o(a
(x))
6
(t)
(t)
(k)
p
W
U
[
b,
b]
b
=
H
h
(x
x
y
P
P
P
k
i,j
(k)
(2)
(2)
(2)
(1)
(1)
(1) a(2)
(2)
(2)
(1)
H|k +Hk 1
(k)
=
b
+
W
h
()
=
|W
h
(x)
=
g(a
(x))
(3)
(3)
(3)
(2)
sign(W
)k=
=
1W(k) <0
a
=
b
+
W
h
(k) (1)
(2)
(2)1W
(1)
i,j
i(1)
j W
i,j
a
(x)
=
b
+
h
P
W
>0
h
=
g(a
(x))
a
b
+
x
Topics: automatic differentiation
i,j
i,j
l(f
(x),
y)
=
1
log
f
(x)
=
log
f
c
(3)
(3)
(3)
(2)
(y=c)
(1)
(1)
(1)
c
(3)
(2)
(1) a r
(k)
p
a
=
b
+
W
h
=
b
+
W
x
(2)
(2)
(2)
(1)
(1)
(1)
()
=
sign(W
)
(3)
(3)
(k)
b
b
b
(k)
(1)
(1)
(1)
6
W
(k)
a
(x)
=
b
+
W
h
Each object also
h
=
g(a
(x))
has
a
bprop
method
h
=
o(a
(x))
p
W
U
[
b,
b]
b
=
H
h
(x)
a = b + W xi,j
k
Hk +Hk 1
(3) (2)
(3)
(2)
(2)
(1)
(k)
h
=
o(a
(x))
a
=
b
+
W
h
(3)
(2)
(1)
(2)
(1)
it computes the gradient of(3)
(1)
(1)
(1)
the
loss
with
sign(W
)
=
1
1W(k) <0
(2)
(2)
(k)
i,j
(x)
b
b
b
=
b
+
W
x
W
W
W x af (x)
h
=
g(a
(x))
W
>0
i,j
i,j
respect to each parent (3)
(3)
(3)
(3)
(2)
(3)
(2)
a (1)(x) =
b +W
h p
@
h = o(a (x))
(2)
(1)
(1)
(3)
(2)
(1)
h
=
g(a
(x))
(3)
(3)
a
=
b
+
W
x
(1)
(1)
W
W
W
x 6
(k)
h
(x)
=
o(a
(x))
(k)
fprop depends on the fprop of a boxs parents,
h
=
g(a
(x))
p
W
U
[
b,
b]
b
=
H
h
(x)
k
f
(x)
(2)
(2)
(2) H(1)+H
i,j
a
(x)
=
b
+
W
h
while bprop depends the bprop of a boxs children
k
k 1
By
(x) = o(a
(2)
FLOW GRAPH
(1) (3)(2)
(2) h(1)h(2)
(3)g(a(3)
=
(x))
(1)
=
o(a
g(a (x))
h (x)
=
g(a
(x))
b
b(2)(x))
b
b
b
b
(3)
(2)
(1)
h
=
g(a
(x))
h
(x)
=
g(a
(x))
(1)
(1)
we get backpropagation
W
W
W
x
h = g(a (x)) (3)
(3)
(2)
(2) (x))(2) (1)
h
o(a
a
(x)
=
b
+W h
(3)
(2)
(1)
only need to reach the parameters
(1)
(1)
W
W
W
x
bh(3)
b=(2)g(a
b
(x))
(2)
(1)
(1)(2) (x))(1)
(3)
(2)
(1) h
(x)
=
g(a
a
b +W x
b
b
b
(3)
(1)
(3)
(2)(2) (1)
W
W
W
b(1) b
b (1) x f (x)
(3) (x) = g(a(3) (x))
h
o(a
(3)
(2)
(1)
(3)
(2)
(1)
x
W
W
W
x
(3)
(2)
(1)(2)
(2)
b
h
b =b
(x)
g(a
(x))
k ||W
()
= ki,j i W
W
=
||F
i,j
(k)
(x))
>0
W
<0
i,j
H
+H
j
k
k
k
1
i,j
i,j
()
=
|W
|
(k)k
20
(3)
(3)
(3)
(2) 1 (k)
i,j
i
j
sign(W
)
=
1
(1)
(1)
(1)
(k)
i,j
a=a p(y
=b=
b c|x)
+
W
hp Wi,j <0
(2)
=
+
W
x
W
>0
f
(x)
(3)
(3)
(3)
i,j
c (3)
(k)
(2)W (3) h
(k) p
6
(3)
(2)
(k)
ah(2) (x)
=b
+
= g(a (x))
()
=
2W
(k)
(k)
a r
=
b
+
W
h
W
U
[
b,
b]
b
=
H
h
(x)
k
i,j
p
rW
()
=
sign(W
)
(k)
(2)
(2)
(1)
(3)
(3) + W(2)
Hh
W
k +Hk 1
a
=
b
(k)
h
=
o(a
(x))
6
(t)
(t)
(k)
p
W
U
[
b,
b]
b
=
H
h
(x
x
y
P
P
P
k
i,j
(k)
(2)
(2)
(2)
(1)
(1)
(1) a(2)
(2)
(2)
(1)
H|k +Hk 1
(k)
=
b
+
W
h
()
=
|W
h
(x)
=
g(a
(x))
(3)
(3)
(3)
(2)
sign(W
)k=
=
1W(k) <0
a
=
b
+
W
h
(k) (1)
(2)
(2)1W
(1)
i,j
i(1)
j W
i,j
a
(x)
=
b
+
h
P
W
>0
h
=
g(a
(x))
a
b
+
x
Topics: automatic differentiation
i,j
i,j
l(f
(x),
y)
=
1
log
f
(x)
=
log
f
c
(3)
(3)
(3)
(2)
(y=c)
(1)
(1)
(1)
c
(3)
(2)
(1) a r
(k)
p
a
=
b
+
W
h
=
b
+
W
x
(2)
(2)
(2)
(1)
(1)
(1)
()
=
sign(W
)
(3)
(3)
(k)
b
b
b
(k)
(1)
(1)
(1)
6
W
(k)
a
(x)
=
b
+
W
h
Each object also
h
=
g(a
(x))
has
a
bprop
method
h
=
o(a
(x))
p
W
U
[
b,
b]
b
=
H
h
(x)
a = b + W xi,j
k
Hk +Hk 1
(3) (2)
(3)
(2)
(2)
(1)
(k)
h
=
o(a
(x))
a
=
b
+
W
h
(3)
(2)
(1)
(2)
(1)
it computes the gradient of(3)
(1)
(1)
(1)
the
loss
with
sign(W
)
=
1
1W(k) <0
(2)
(2)
(k)
i,j
(x)
b
b
b
=
b
+
W
x
W
W
W x af (x)
h
=
g(a
(x))
W
>0
i,j
i,j
respect to each parent (3)
(3)
(3)
(3)
(2)
(3)
(2)
a (1)(x) =
b +W
h p
@
h = o(a (x))
(2)
(1)
(1)
(3)
(2)
(1)
h
=
g(a
(x))
(3)
(3)
a
=
b
+
W
x
(1)
(1)
W
W
W
x 6
(k)
h
(x)
=
o(a
(x))
(k)
fprop depends on the fprop of a boxs parents,
h
=
g(a
(x))
p
W
U
[
b,
b]
b
=
H
h
(x)
k
f
(x)
(2)
(2)
(2) H(1)+H
i,j
a
(x)
=
b
+
W
h
while bprop depends the bprop of a boxs children
k
k 1
By
(x) = o(a
(2)
FLOW GRAPH
(1) (3)(2)
(2) h(1)h(2)
(3)g(a(3)
=
(x))
(1)
=
o(a
g(a (x))
h (x)
=
g(a
(x))
b
b(2)(x))
b
b
b
b
(3)
(2)
(1)
h
=
g(a
(x))
h
(x)
=
g(a
(x))
(1)
(1)
we get backpropagation
W
W
W
x
h = g(a (x)) (3)
(3)
(2)
(2) (x))(2) (1)
h
o(a
a
(x)
=
b
+W h
(3)
(2)
(1)
only need to reach the parameters
(1)
(1)
W
W
W
x
bh(3)
b=(2)g(a
b
(x))
(2)
(1)
(1)(2) (x))(1)
(3)
(2)
(1) h
(x)
=
g(a
a
b +W x
b
b
b
(3)
(1)
(3)
(2)(2) (1)
W
W
W
b(1) b
b (1) x f (x)
(3) (x) = g(a(3) (x))
h
o(a
(3)
(2)
(1)
(3)
(2)
(1)
x
W
W
W
x
(3)
(2)
(1)(2)
(2)
b
h
b =b
(x)
g(a
(x))
k ||W
()
= ki,j i W
W
=
||F
i,j
(k)
(x))
>0
W
<0
i,j
H
+H
j
k
k
k
1
i,j
i,j
()
=
|W
|
(k)k
20
(3)
(3)
(3)
(2) 1 (k)
i,j
i
j
sign(W
)
=
1
(1)
(1)
(1)
(k)
i,j
a=a p(y
=b=
b c|x)
+
W
hp Wi,j <0
(2)
=
+
W
x
W
>0
f
(x)
(3)
(3)
(3)
i,j
c (3)
(k)
(2)W (3) h
(k) p
6
(3)
(2)
(k)
ah(2) (x)
=b
+
= g(a (x))
()
=
2W
(k)
(k)
a r
=
b
+
W
h
W
U
[
b,
b]
b
=
H
h
(x)
k
i,j
p
rW
()
=
sign(W
)
(k)
(2)
(2)
(1)
(3)
(3) + W(2)
Hh
W
k +Hk 1
a
=
b
(k)
h
=
o(a
(x))
6
(t)
(t)
(k)
p
W
U
[
b,
b]
b
=
H
h
(x
x
y
P
P
P
k
i,j
(k)
(2)
(2)
(2)
(1)
(1)
(1) a(2)
(2)
(2)
(1)
H|k +Hk 1
(k)
=
b
+
W
h
()
=
|W
h
(x)
=
g(a
(x))
(3)
(3)
(3)
(2)
sign(W
)k=
=
1W(k) <0
a
=
b
+
W
h
(k) (1)
(2)
(2)1W
(1)
i,j
i(1)
j W
i,j
a
(x)
=
b
+
h
P
W
>0
h
=
g(a
(x))
a
b
+
x
Topics: automatic differentiation
i,j
i,j
l(f
(x),
y)
=
1
log
f
(x)
=
log
f
c
(3)
(3)
(3)
(2)
(y=c)
(1)
(1)
(1)
c
(3)
(2)
(1) a r
(k)
p
a
=
b
+
W
h
=
b
+
W
x
(2)
(2)
(2)
(1)
(1)
(1)
()
=
sign(W
)
(3)
(3)
(k)
b
b
b
(k)
(1)
(1)
(1)
6
W
(k)
a
(x)
=
b
+
W
h
Each object also
h
=
g(a
(x))
has
a
bprop
method
h
=
o(a
(x))
p
W
U
[
b,
b]
b
=
H
h
(x)
a = b + W xi,j
k
Hk +Hk 1
(3) (2)
(3)
(2)
(2)
(1)
(k)
h
=
o(a
(x))
a
=
b
+
W
h
(3)
(2)
(1)
(2)
(1)
it computes the gradient of(3)
(1)
(1)
(1)
the
loss
with
sign(W
)
=
1
1W(k) <0
(2)
(2)
(k)
i,j
(x)
b
b
b
=
b
+
W
x
W
W
W x af (x)
h
=
g(a
(x))
W
>0
i,j
i,j
respect to each parent (3)
(3)
(3)
(3)
(2)
(3)
(2)
a (1)(x) =
b +W
h p
@
h = o(a (x))
(2)
(1)
(1)
(3)
(2)
(1)
h
=
g(a
(x))
(3)
(3)
a
=
b
+
W
x
(1)
(1)
W
W
W
x 6
(k)
h
(x)
=
o(a
(x))
(k)
fprop depends on the fprop of a boxs parents,
h
=
g(a
(x))
p
W
U
[
b,
b]
b
=
H
h
(x)
k
f
(x)
(2)
(2)
(2) H(1)+H
i,j
a
(x)
=
b
+
W
h
while bprop depends the bprop of a boxs children
k
k 1
By
(x) = o(a
(2)
FLOW GRAPH
(1) (3)(2)
(2) h(1)h(2)
(3)g(a(3)
=
(x))
(1)
=
o(a
g(a (x))
h (x)
=
g(a
(x))
b
b(2)(x))
b
b
b
b
(3)
(2)
(1)
h
=
g(a
(x))
h
(x)
=
g(a
(x))
(1)
(1)
we get backpropagation
W
W
W
x
h = g(a (x)) (3)
(3)
(2)
(2) (x))(2) (1)
h
o(a
a
(x)
=
b
+W h
(3)
(2)
(1)
only need to reach the parameters
(1)
(1)
W
W
W
x
bh(3)
b=(2)g(a
b
(x))
(2)
(1)
(1)(2) (x))(1)
(3)
(2)
(1) h
(x)
=
g(a
a
b +W x
b
b
b
(3)
(1)
(3)
(2)(2) (1)
W
W
W
b(1) b
b (1) x f (x)
(3) (x) = g(a(3) (x))
h
o(a
(3)
(2)
(1)
(3)
(2)
(1)
x
W
W
W
x
(3)
(2)
(1)(2)
(2)
b
h
b =b
(x)
g(a
(x))
k ||W
()
= ki,j i W
W
=
||F
i,j
(k)
(x))
>0
W
<0
i,j
H
+H
j
k
k
k
1
i,j
i,j
()
=
|W
|
(k)k
20
(3)
(3)
(3)
(2) 1 (k)
i,j
i
j
sign(W
)
=
1
(1)
(1)
(1)
(k)
i,j
a=a p(y
=b=
b c|x)
+
W
hp Wi,j <0
(2)
=
+
W
x
W
>0
f
(x)
(3)
(3)
(3)
i,j
c (3)
(k)
(2)W (3) h
(k) p
6
(3)
(2)
(k)
ah(2) (x)
=b
+
= g(a (x))
()
=
2W
(k)
(k)
a r
=
b
+
W
h
W
U
[
b,
b]
b
=
H
h
(x)
k
i,j
p
rW
()
=
sign(W
)
(k)
(2)
(2)
(1)
(3)
(3) + W(2)
Hh
W
k +Hk 1
a
=
b
(k)
h
=
o(a
(x))
6
(t)
(t)
(k)
p
W
U
[
b,
b]
b
=
H
h
(x
x
y
P
P
P
k
i,j
(k)
(2)
(2)
(2)
(1)
(1)
(1) a(2)
(2)
(2)
(1)
H|k +Hk 1
(k)
=
b
+
W
h
()
=
|W
h
(x)
=
g(a
(x))
(3)
(3)
(3)
(2)
sign(W
)k=
=
1W(k) <0
a
=
b
+
W
h
(k) (1)
(2)
(2)1W
(1)
i,j
i(1)
j W
i,j
a
(x)
=
b
+
h
P
W
>0
h
=
g(a
(x))
a
b
+
x
Topics: automatic differentiation
i,j
i,j
l(f
(x),
y)
=
1
log
f
(x)
=
log
f
c
(3)
(3)
(3)
(2)
(y=c)
(1)
(1)
(1)
c
(3)
(2)
(1) a r
(k)
p
a
=
b
+
W
h
=
b
+
W
x
(2)
(2)
(2)
(1)
(1)
(1)
()
=
sign(W
)
(3)
(3)
(k)
b
b
b
(k)
(1)
(1)
(1)
6
W
(k)
a
(x)
=
b
+
W
h
Each object also
h
=
g(a
(x))
has
a
bprop
method
h
=
o(a
(x))
p
W
U
[
b,
b]
b
=
H
h
(x)
a = b + W xi,j
k
Hk +Hk 1
(3) (2)
(3)
(2)
(2)
(1)
(k)
h
=
o(a
(x))
a
=
b
+
W
h
(3)
(2)
(1)
(2)
(1)
it computes the gradient of(3)
(1)
(1)
(1)
the
loss
with
sign(W
)
=
1
1W(k) <0
(2)
(2)
(k)
i,j
(x)
b
b
b
=
b
+
W
x
W
W
W x af (x)
h
=
g(a
(x))
W
>0
i,j
i,j
respect to each parent (3)
(3)
(3)
(3)
(2)
(3)
(2)
a (1)(x) =
b +W
h p
@
h = o(a (x))
(2)
(1)
(1)
(3)
(2)
(1)
h
=
g(a
(x))
(3)
(3)
a
=
b
+
W
x
(1)
(1)
W
W
W
x 6
(k)
h
(x)
=
o(a
(x))
(k)
fprop depends on the fprop of a boxs parents,
h
=
g(a
(x))
p
W
U
[
b,
b]
b
=
H
h
(x)
k
f
(x)
(2)
(2)
(2) H(1)+H
i,j
a
(x)
=
b
+
W
h
while bprop depends the bprop of a boxs children
k
k 1
By
(x) = o(a
(2)
FLOW GRAPH
(1) (3)(2)
(2) h(1)h(2)
(3)g(a(3)
=
(x))
(1)
=
o(a
g(a (x))
h (x)
=
g(a
(x))
b
b(2)(x))
b
b
b
b
(3)
(2)
(1)
h
=
g(a
(x))
h
(x)
=
g(a
(x))
(1)
(1)
we get backpropagation
W
W
W
x
h = g(a (x)) (3)
(3)
(2)
(2) (x))(2) (1)
h
o(a
a
(x)
=
b
+W h
(3)
(2)
(1)
only need to reach the parameters
(1)
(1)
W
W
W
x
bh(3)
b=(2)g(a
b
(x))
(2)
(1)
(1)(2) (x))(1)
(3)
(2)
(1) h
(x)
=
g(a
a
b +W x
b
b
b
(3)
(1)
(3)
(2)(2) (1)
W
W
W
b(1) b
b (1) x f (x)
(3) (x) = g(a(3) (x))
h
o(a
(3)
(2)
(1)
(3)
(2)
(1)
x
W
W
W
x
(3)
(2)
(1)(2)
(2)
b
h
b =b
(x)
g(a
(x))
k ||W
()
= ki,j i W
W
=
||F
i,j
(k)
(x))
>0
W
<0
i,j
H
+H
j
k
k
k
1
i,j
i,j
()
=
|W
|
(k)k
20
(3)
(3)
(3)
(2) 1 (k)
i,j
i
j
sign(W
)
=
1
(1)
(1)
(1)
(k)
i,j
a=a p(y
=b=
b c|x)
+
W
hp Wi,j <0
(2)
=
+
W
x
W
>0
f
(x)
(3)
(3)
(3)
i,j
c (3)
(k)
(2)W (3) h
(k) p
6
(3)
(2)
(k)
ah(2) (x)
=b
+
= g(a (x))
()
=
2W
(k)
(k)
a r
=
b
+
W
h
W
U
[
b,
b]
b
=
H
h
(x)
k
i,j
p
rW
()
=
sign(W
)
(k)
(2)
(2)
(1)
(3)
(3) + W(2)
Hh
W
k +Hk 1
a
=
b
(k)
h
=
o(a
(x))
6
(t)
(t)
(k)
p
W
U
[
b,
b]
b
=
H
h
(x
x
y
P
P
P
k
i,j
(k)
(2)
(2)
(2)
(1)
(1)
(1) a(2)
(2)
(2)
(1)
H|k +Hk 1
(k)
=
b
+
W
h
()
=
|W
h
(x)
=
g(a
(x))
(3)
(3)
(3)
(2)
sign(W
)k=
=
1W(k) <0
a
=
b
+
W
h
(k) (1)
(2)
(2)1W
(1)
i,j
i(1)
j W
i,j
a
(x)
=
b
+
h
P
W
>0
h
=
g(a
(x))
a
b
+
x
Topics: automatic differentiation
i,j
i,j
l(f
(x),
y)
=
1
log
f
(x)
=
log
f
c
(3)
(3)
(3)
(2)
(y=c)
(1)
(1)
(1)
c
(3)
(2)
(1) a r
(k)
p
a
=
b
+
W
h
=
b
+
W
x
(2)
(2)
(2)
(1)
(1)
(1)
()
=
sign(W
)
(3)
(3)
(k)
b
b
b
(k)
(1)
(1)
(1)
6
W
(k)
a
(x)
=
b
+
W
h
Each object also
h
=
g(a
(x))
has
a
bprop
method
h
=
o(a
(x))
p
W
U
[
b,
b]
b
=
H
h
(x)
a = b + W xi,j
k
Hk +Hk 1
(3) (2)
(3)
(2)
(2)
(1)
(k)
h
=
o(a
(x))
a
=
b
+
W
h
(3)
(2)
(1)
(2)
(1)
it computes the gradient of(3)
(1)
(1)
(1)
the
loss
with
sign(W
)
=
1
1W(k) <0
(2)
(2)
(k)
i,j
(x)
b
b
b
=
b
+
W
x
W
W
W x af (x)
h
=
g(a
(x))
W
>0
i,j
i,j
respect to each parent (3)
(3)
(3)
(3)
(2)
(3)
(2)
a (1)(x) =
b +W
h p
@
h = o(a (x))
(2)
(1)
(1)
(3)
(2)
(1)
h
=
g(a
(x))
(3)
(3)
a
=
b
+
W
x
(1)
(1)
W
W
W
x 6
(k)
h
(x)
=
o(a
(x))
(k)
fprop depends on the fprop of a boxs parents,
h
=
g(a
(x))
p
W
U
[
b,
b]
b
=
H
h
(x)
k
f
(x)
(2)
(2)
(2) H(1)+H
i,j
a
(x)
=
b
+
W
h
while bprop depends the bprop of a boxs children
k
k 1
By
(x) = o(a
(2)
FLOW GRAPH
(1) (3)(2)
(2) h(1)h(2)
(3)g(a(3)
=
(x))
(1)
=
o(a
g(a (x))
h (x)
=
g(a
(x))
b
b(2)(x))
b
b
b
b
(3)
(2)
(1)
h
=
g(a
(x))
h
(x)
=
g(a
(x))
(1)
(1)
we get backpropagation
W
W
W
x
h = g(a (x)) (3)
(3)
(2)
(2) (x))(2) (1)
h
o(a
a
(x)
=
b
+W h
(3)
(2)
(1)
only need to reach the parameters
(1)
(1)
W
W
W
x
bh(3)
b=(2)g(a
b
(x))
(2)
(1)
(1)(2) (x))(1)
(3)
(2)
(1) h
(x)
=
g(a
a
b +W x
b
b
b
(3)
(1)
(3)
(2)(2) (1)
W
W
W
b(1) b
b (1) x f (x)
(3) (x) = g(a(3) (x))
h
o(a
(3)
(2)
(1)
(3)
(2)
(1)
x
W
W
W
x
(3)
(2)
(1)(2)
(2)
b
h
b =b
(x)
g(a
(x))
k ||W
()
= ki,j i W
W
=
||F
i,j
(k)
(x))
>0
W
<0
i,j
H
+H
j
k
k
k
1
i,j
i,j
()
=
|W
|
(k)k
20
(3)
(3)
(3)
(2) 1 (k)
i,j
i
j
sign(W
)
=
1
(1)
(1)
(1)
(k)
i,j
a=a p(y
=b=
b c|x)
+
W
hp Wi,j <0
(2)
=
+
W
x
W
>0
f
(x)
(3)
(3)
(3)
i,j
c (3)
(k)
(2)W (3) h
(k) p
6
(3)
(2)
(k)
ah(2) (x)
=b
+
= g(a (x))
()
=
2W
(k)
(k)
a r
=
b
+
W
h
W
U
[
b,
b]
b
=
H
h
(x)
k
i,j
p
rW
()
=
sign(W
)
(k)
(2)
(2)
(1)
(3)
(3) + W(2)
Hh
W
k +Hk 1
a
=
b
(k)
h
=
o(a
(x))
6
(t)
(t)
(k)
p
W
U
[
b,
b]
b
=
H
h
(x
x
y
P
P
P
k
i,j
(k)
(2)
(2)
(2)
(1)
(1)
(1) a(2)
(2)
(2)
(1)
H|k +Hk 1
(k)
=
b
+
W
h
()
=
|W
h
(x)
=
g(a
(x))
(3)
(3)
(3)
(2)
sign(W
)k=
=
1W(k) <0
a
=
b
+
W
h
(k) (1)
(2)
(2)1W
(1)
i,j
i(1)
j W
i,j
a
(x)
=
b
+
h
P
W
>0
h
=
g(a
(x))
a
b
+
x
Topics: automatic differentiation
i,j
i,j
l(f
(x),
y)
=
1
log
f
(x)
=
log
f
c
(3)
(3)
(3)
(2)
(y=c)
(1)
(1)
(1)
c
(3)
(2)
(1) a r
(k)
p
a
=
b
+
W
h
=
b
+
W
x
(2)
(2)
(2)
(1)
(1)
(1)
()
=
sign(W
)
(3)
(3)
(k)
b
b
b
(k)
(1)
(1)
(1)
6
W
(k)
a
(x)
=
b
+
W
h
Each object also
h
=
g(a
(x))
has
a
bprop
method
h
=
o(a
(x))
p
W
U
[
b,
b]
b
=
H
h
(x)
a = b + W xi,j
k
Hk +Hk 1
(3) (2)
(3)
(2)
(2)
(1)
(k)
h
=
o(a
(x))
a
=
b
+
W
h
(3)
(2)
(1)
(2)
(1)
it computes the gradient of(3)
(1)
(1)
(1)
the
loss
with
sign(W
)
=
1
1W(k) <0
(2)
(2)
(k)
i,j
(x)
b
b
b
=
b
+
W
x
W
W
W x af (x)
h
=
g(a
(x))
W
>0
i,j
i,j
respect to each parent (3)
(3)
(3)
(3)
(2)
(3)
(2)
a (1)(x) =
b +W
h p
@
h = o(a (x))
(2)
(1)
(1)
(3)
(2)
(1)
h
=
g(a
(x))
(3)
(3)
a
=
b
+
W
x
(1)
(1)
W
W
W
x 6
(k)
h
(x)
=
o(a
(x))
(k)
fprop depends on the fprop of a boxs parents,
h
=
g(a
(x))
p
W
U
[
b,
b]
b
=
H
h
(x)
k
f
(x)
(2)
(2)
(2) H(1)+H
i,j
a
(x)
=
b
+
W
h
while bprop depends the bprop of a boxs children
k
k 1
By
(x) = o(a
(2)
FLOW GRAPH
(1) (3)(2)
(2) h(1)h(2)
(3)g(a(3)
=
(x))
(1)
=
o(a
g(a (x))
h (x)
=
g(a
(x))
b
b(2)(x))
b
b
b
b
(3)
(2)
(1)
h
=
g(a
(x))
h
(x)
=
g(a
(x))
(1)
(1)
we get backpropagation
W
W
W
x
h = g(a (x)) (3)
(3)
(2)
(2) (x))(2) (1)
h
o(a
a
(x)
=
b
+W h
(3)
(2)
(1)
only need to reach the parameters
(1)
(1)
W
W
W
x
bh(3)
b=(2)g(a
b
(x))
(2)
(1)
(1)(2) (x))(1)
(3)
(2)
(1) h
(x)
=
g(a
a
b +W x
b
b
b
(3)
(1)
(3)
(2)(2) (1)
W
W
W
b(1) b
b (1) x f (x)
(3) (x) = g(a(3) (x))
h
o(a
(3)
(2)
(1)
(3)
(2)
(1)
x
W
W
W
x
(3)
(2)
(1)(2)
(2)
b
h
b =b
(x)
g(a
(x))
REGULARIZATION
g (a) = g(a)(1 g(a))
0
g 0 (a) = g(a)(1
g(a))
Topics: L2
regularization
g 0 (a)
= 1 g(a)2
0
g (a) = 1
() =
g(a)
P P P
k
() =
Pi Pj
k
(k)
W
i,j
P
(k) 2
= 2k ||W
||
P F
(k)
Wi,j
rW(k) () = 2W(k)
(k)
Gradient:
P ()
P P
rW(k)
= 2W(k)
() = k i j |Wi,j |
P P P
(k)
|W
|
(k)
i,j
k
i
j
sign(W )
(k) 2
||F
k ||W
()
=
rW(k)on
()
= not on biases (weight decay)
Onlyapplied
weights,
(k)
(k)
rW(k))()
Can be
sign(W
1= sign(W
0 a Gaussian) prior over the
interpreted
as=having
i,j
weights
sign(W
(k)
)i,j = 1 0
21
INITIALIZATION
g
(a) = g(a)(1 g(a))
0
g (a) = g(a)(1 g(a))
0
2
g
(a)
=
1
g(a)
Topics: initialization
g0 (a)
g(a))2
g 0=
(a)g(a)(1
= 1 g(a)
g 0 (a) = g(a)(1
g(a))
2 P
P
P
P
0
2
22(k)P
For biases g (a) =1 ()
g(a)
(k) 2
P =P Pg0 (a) = 1(k)g(a)
W
=
||W
||F
i,j
j
k ||2
() = k i k j i Wi,j
= k ||W(k)
F
2 P
2 PP
P P P
initialize all to 0
P P (k)
(k)
(k) 2
(k) 2
()
=
W
=
||W
||F
() = k i j W
=
||W
||
i,j
k
i
j
k
F
i,j
k
(k)
(k)
r
2W
(k) () =
W
r
()
=
2W
(k)
For weights
W
(k)
r
()
=
2W
W
(k)
rW(k) () =
2W
P
P
PP
(k)
P
P
P
P
P
(k)
Cant initialize weights to
0
with
tanh
activation
(k)
P()
= k ()|W
|W
|
=
|W
()
= P kP
|
i,j
i (k)
j i i,jjk i j i,j |
(k)
that
()
= k would
we can show
all gradients
be
i,j0 |(saddle point)
i then
j |W
(k)
rW(k) ()
=
sign(W
)
(k)
(k)
r
=
sign(W
)
(k) ()
r
()
=
sign(W
)
(k)
W
W
(k)
Cant initialize all
to the same value
rweights
) (k) )i,j = 1W(k) >0 1W(k) <0
sign(W
W(k) () = sign(W
i,j
i,j
(k) will always behave the same
(k)in a layer
- we can show that
allsign(W
hidden
units
)1i,j (k)=
(k)1
(k)sign(W
)
(k)1
(k) 1 p (k)
6
W
>0
W
<0
sign(W )i,j = i,j
1W=
1
(k)
(k) W
(k)
>0
W
<0
p
i,j
i,j
size
of
W
U
[
b,
b]
b
=
H
h
(x)
i,j
i,j
>0
W
<0
k
i,j
i,j
i,j
H
+H
k
k 1
- need to break symmetry
p
p
p
(k)
(k) (k)
6
6
6p
p
p
W
U
[
b,
b]
b
=
H
W
U
[
b,
b]
b
=
H
i,jWfrom
U
[
b,
b]
b
=
H
Recipe: sample
where
k
k
k
i,j
i,j
Hk +Hk H1k +Hk
Hk1+Hk 1
- the idea is to sample around 0 but break symmetry
-
other values of b could work well (not an exact science) ( see Glorot & Bengio, 2010)
22
MODEL SELECTION
Topics: grid search, random search
To
specify a distribution over the values of each hyper-parameters (e.g. uniform in some range)
Use
You
23
Validation
underfitting
overfitting
0.5
0.4
0.3
0.2
0.1
0.0
number of epochs
24
Decaying
as we get closer to the optimum, makes sense to take smaller update steps
(i)
(ii)
(iii) divide
25
@f (x)
@x
f (x+) f (x )
2
f (x) x
Topics: mini-batch, momentum
f (x + ) f (x )
Can update based on a mini-batch of example (instead of 1 example):
P1
ist=1
t = regularized
1
the gradient
the average
loss for that mini-batch
1accurate
can give a P
more
2 estimate of the risk gradient
t=1 t < 1 t
t = 1+ t
Can
use an
average
texponential
= t 0.5 <
1 of previous gradients:
(t)
r
(t)
= r l(f (x ), y
(t)
)+
(t 1)
r
26
Adagrad: learning rates are scaled by the square root of the cumulative sum of squared gradients
(t)
(t 1)
(t)
+ r l(f (x ), y
(t)
(t)
r
(t 1)
+ (1
(t)
) r l(f (x ), y
(t)
(t)
r
27
ba +
(x)
bh + W(2)ax (x) =(2)b + (2)
W (1)h
(x) =
W=(1)
(2) (1) (2)
a h(1)
(x) = b (1)+ W h
(1) a(2) (x)(1)
=
b
+
W
a (x) (2)
=
+
W(1)(3)x
(3)b
(2)
(2)
(2)
(2) (1)
bh(1) +
(x)
o(a
(x))
(x) =
W= (1)
h
a
(x)
=
b
+
W
h
(1)
(1)
(1)
(1)
a x (x) = b + W x
a
(x)
=
b
+
W
(3)
(3)
h
(x)
=
o(a
(1)
(1)(x))(2)
(2)
(1) (1)
(1)
(1)
(x) =
W=(3)
x (3) (x))
(2)
bh(3) +
(x)
g(a
a
(x)
=
b
+
W
x
(3)
(3)
h
(x)
=
o(a
(x))
h approximation
(x) = o(a (x))
Topics:
finite
difference
(2)
(2)
(3)
h
(x)o(a
=
g(a
(x))(1)
(x) =
(x))
(1)
(3)
(3)
(2) (x) = g(a
(2) (x))
h
(x)
=
o(a
(x))
h
(x)
=
g(a
(x))
(2)
(2)
To debug your implementation
of
fprop/bprop,
you
can
h
(x)
=
g(a
(x))
(1)
(1)
(2)
(3)
(2)
(1)
(x) =
g(a
(x))
h
(x)
=
g(a
(x))
(3)
(2)
(1)
(1)
(2)
compare
with
of the gradient
b
b
b
h (x) = g(aa(1)finite-difference
(x)) (1)h(2) (x) =approximation
g(a
(x))
(1)
h
(x)
=
g(a
(x))
(1) (1)
(3)
(2)
(x) =bg(a
(x))
b
b
(3)(3)
(1)
(2)
(1)
(1)
(1)
bW b(2)
b
W
W
x
f
(x)
(3)f
h (x+)
(x) =(1)g(a
(x))
@f
(x)
f
(x
)
(2)
(2)
(1)
b
b
b
b
(3) b (3)
(2)
(1)
(2)
(1)f (x)
W W
W
W
x
W
W
x f)(x)
@f (x)
f@x
(x+) f (x
(3)
(2) 2(1)
b
b
b
(2) @x (1)
(3)
(2)
(1)
)
2 W
W
W
x
f
(x)
W
W
x f (x)
f (x )
@f (x) @f (x)
f (x+)f (x+)
f (x )
@x
2
(3)
(2)
(1)
@x
2
(x)
W fW
W
x f (x)
f
(x)
the loss @f
)
f(x+)
fwould
(xx )be
(x+)
f
(x
)
@x
2 x
f
(x)
2
f (x) x would
be a parameter @f (x)
f (x+) f (x )
@x
x
2
loss
f (x)
f (x + ) would
f (x be )
the
if you xadd to the parameter
+ ) f (x ) would be the loss
iffyou
(x)subtract
x to the parameter
h
b
(x) = g(a
(x))
GRADIENT CHECKING
x f (x)
28
If
Note
29
DEEP
LEARNING
Le systme
visuel
humain
Topics: inspiration from visual cortex
Pourquoi%ne%pas%sinspirer%du%cerveau%pour%faire%de%la%vision!%
31
DEEP
LEARNING
Le systme
visuel
humain
Topics: inspiration from visual cortex
Pourquoi%ne%pas%sinspirer%du%cerveau%pour%faire%de%la%vision!%
31
DEEP
LEARNING
Le systme
visuel
humain
Topics: inspiration from visual cortex
Pourquoi%ne%pas%sinspirer%du%cerveau%pour%faire%de%la%vision!%
31
DEEP
LEARNING
Le systme
visuel
humain
Topics: inspiration from visual cortex
Pourquoi%ne%pas%sinspirer%du%cerveau%pour%faire%de%la%vision!%
31
31
DEEP
LEARNING
Le systme
visuel
humain
Topics: inspiration from visual cortex
Pourquoi%ne%pas%sinspirer%du%cerveau%pour%faire%de%la%vision!%
edges
...
31
DEEP
LEARNING
Le systme
visuel
humain
Topics: inspiration from visual cortex
Pourquoi%ne%pas%sinspirer%du%cerveau%pour%faire%de%la%vision!%
edges
...
nose
mouth
eyes
31
DEEP
LEARNING
Le systme
visuel
humain
Topics: inspiration from visual cortex
Pourquoi%ne%pas%sinspirer%du%cerveau%pour%faire%de%la%vision!%
edges
...
nose
mouth
eyes
face
DEEP LEARNING
Topics: theoretical justification
A
Example: Boolean
functions
a Boolean circuit is a sort of feed-forward network where hidden units are logic
gates (i.e. AND, OR or NOT functions of their arguments)
any Boolean function can be represented by a single hidden layer Boolean circuit
-
require a polynomial number of hidden units if we can adapt the number of layers
See Exploring Strategies for Training Deep Neural Networks for a discussion
32
DEEP LEARNING
Topics: success story: speech recognition
33
DEEP LEARNING
Topics: success story: computer vision
34
35
Hugo Larochelle
Hugo Laroc
Departement
dinformatiq
Departement
dinf
UniversiteUniversit
de Sherbrooke
e de Sh
hugo.larochelle@usherbrook
hugo.larochelle@us
DEEP LEARNING
w
g(a) = a
...
First hypothesis: optimization is harder
g(a) = sigm(a) =
(underfitting)
g(a) = tanh(a)
...
SeptemberSeptember
6, 2012 6,
1
1+exp( a)
exp(a) exp( a)
= exp(a)+exp(
1
a)
...
exp(2a) 1
exp(2a)+1
Abstract
Abstract
g(a) = max(0,
Math fora)
my slides
neural network.
MathFeedforward
for my slides Feedforward
neural n
P
P
1
g(a)
=
reclin(a)
=
max(0,
a)
>
>
...
...
This is a well known problem in
a(x) = b + a(x)
w
x
=
b
+
w
x
=
b
+
w
x
=
b
+
w
x
i i i
i i i
recurrent neural networks
P
P
g() h(x)
b
h(x)
g(a(x))
= g(a(x))
==
g(b
+ i=
wig(b
xi )+ i wi xi )
(1)
(1)
x1 bix...d
Wi,j
...x x
xj h(x)
1 id
w
h(x)
= g(a(x)) w
(t)
DEEP LEARNING
l(f (x ; ), y
(t)
(t)
1
T
(t)
(t)
(t)
r
l(f
(x
;
),
y
)
36
r
()
()
(t)
(t)
l(f
(x
;
),
y
)
Topics: why training
is
hard
P
P
P
(t)
+; ), y (t) )
1
1 (t)
1
(t)
(t)
(t)
=
r
l(f
(x
=
;
),
y
)
r
l(f
r
(x
()
;
),
y
)
r
=
()
r
l(f
(x
r ()
t
t
t
T
T
T
()
()
d
Second1 hypothesis:
overfitting
{x
2
R
| rx f (x) = 0}
P
+
(t)
(t)
= T t r l(f (x ; ), y )
r ()1 P
{x
2
R
|
r
f
(x)
=
{x
0}
2
R
|
r
f
(x)
=
0}
{x
2
R
|
r
=
x
x
x
+
+
> 2
deep nets usually
v
r
> 2 have lots of
> parameters
2
> 2
x f (x)v < 0 8v
v
r
f
(x)v
>
0
8v
v
r
f
(x)v
>
0
8v
v
r
f
(x)v
>
0
8v
d
x
x
{x 2 R | rx f (x) x= 0}
d
{x 2 R | rx f (x) = 0}
(t)
(t)
=
r
l(f
(x
;
),
y
)
r ()
>
2
>
2
>
2
Might
v
r
f
(x)v
<
0
8v
v
r
f
(x)v
<
0
8v
v
r
f
(x)v
<
0
8v
a
high
variance
/
low
bias
situation
> 2 be in
x
x
x
v rx f (x)v > 0 8v
v> r2x f (x)v > 0 8v
(t) (t)
(t)
(t)
(t)
(t)
(t) , y (t)
(x
))
=
r
l(f
(x
;
),
y
=
)
r
l(f
r
(x
()
;
),
y
)
r
()
=
r
l(f
(x
;
),
y
r ()
> 2
v rx f (x)v < 0 8v
v> r2x f (x)v < 0 8v
(t) (t)
(t) (t)
(t) (t)
f
f
possible
(x
(x
,
y
)
(x
,
y
)
(t) , y ) (t)
= r l(f (x ; ), y )
r ()
f
f
f
f
(t) (t) f
(x , y )
(x(t) , y (t) )
possible
f f
possible
f f
()
l(f (x ; ),y (t)
)
(t)
low variance/
high bias
good trade-off
high variance/
low bias
(t)
DEEP LEARNING
l(f (x ; ), y
(t)
(t)
1
T
(t)
(t)
(t)
r
l(f
(x
;
),
y
)
36
r
()
()
(t)
(t)
l(f
(x
;
),
y
)
Topics: why training
is
hard
P
P
P
(t)
+; ), y (t) )
1
1 (t)
1
(t)
(t)
(t)
=
r
l(f
(x
=
;
),
y
)
r
l(f
r
(x
()
;
),
y
)
r
=
()
r
l(f
(x
r ()
t
t
t
T
T
T
()
()
d
Second1 hypothesis:
overfitting
{x
2
R
| rx f (x) = 0}
P
+
(t)
(t)
= T t r l(f (x ; ), y )
r ()1 P
{x
2
R
|
r
f
(x)
=
{x
0}
2
R
|
r
f
(x)
=
0}
{x
2
R
|
r
=
x
x
x
+
+
> 2
deep nets usually
v
r
> 2 have lots of
> parameters
2
> 2
x f (x)v < 0 8v
v
r
f
(x)v
>
0
8v
v
r
f
(x)v
>
0
8v
v
r
f
(x)v
>
0
8v
d
x
x
{x 2 R | rx f (x) x= 0}
d
{x 2 R | rx f (x) = 0}
(t)
(t)
=
r
l(f
(x
;
),
y
)
r ()
>
2
>
2
>
2
Might
v
r
f
(x)v
<
0
8v
v
r
f
(x)v
<
0
8v
v
r
f
(x)v
<
0
8v
a
high
variance
/
low
bias
situation
> 2 be in
x
x
x
v rx f (x)v > 0 8v
v> r2x f (x)v > 0 8v
(t) (t)
(t)
(t)
(t)
(t)
(t) , y (t)
(x
))
=
r
l(f
(x
;
),
y
=
)
r
l(f
r
(x
()
;
),
y
)
r
()
=
r
l(f
(x
;
),
y
r ()
> 2
v rx f (x)v < 0 8v
v> r2x f (x)v < 0 8v
(t) (t)
(t) (t)
(t) (t)
f
f
possible
(x
(x
,
y
)
(x
,
y
)
(t) , y ) (t)
= r l(f (x ; ), y )
r ()
f
f
f
f
(t) (t) f
(x , y )
(x(t) , y (t) )
possible
f f
possible
f f
()
l(f (x ; ),y (t)
)
(t)
low variance/
high bias
good trade-off
high variance/
low bias
DEEP LEARNING
Topics: why training is hard
Depending
If
use GPUs
If
unsupervised pre-training
37
DEEP LEARNING
Topics: why training is hard
Depending
If
use GPUs
If
unsupervised pre-training
38
39
Hugo Larochelle
Hugo Laroc
p(y = c|x)
Departement
dinformatiq
Departement
dinf
p(y = c|x)
h{
i> Universit
e
de
Sherbrooke
h
i
Universit
e
de
Sh
exp(a
)
exp(a
)
>
P
P
exp(a )
exp(a
)
o(a)
=
softmax(a)
=
.
.
.
Topics: dropout
P
P
exp(a
)
exp(a )
o(a) = softmax(a) =
...
exp(a
)
exp(a
)
hugo.larochelle@usherbrook
hugo.larochelle@us
g(a) = a
DROPOUT
w
p(y
=
c|x)
Idea: cripple
neural
network
by
p(y = c|x)
f (x)f (x)
removing
...
SeptemberSeptember
6, 2012 6,
1
h hsigm(a)
i
g(a) =
=
i>
>
hidden
1+exp(
a)
(1)
p(y units
= c|x)stochastically
exp(aexp(a
)
1 ))
C(2)
(2)
(1)
(2) exp(a
(3)
(1)
(3)
)
Pexp(a
P
1
C
softmax(a)
o(a)
= softmax(a)
=P
. .. .(2)
(1)
(2)
(3)
(1)) .b
P
W
W
Wexp(a
bb(3)
b
exp(a
)
o(a)
= (x)
=
.
h(1)h(x)(x)
h(2)h(x)
W
W
W
b
b
exp(ac )
exp(ac )
h
i
c
c
> c
each hidden unit is set to 0 with
exp(a
)
exp(a
)
1
C
exp(a) exp( a)
P(k 1)
P
f
(x)
o(a)
=
softmax(a)
=
.
.
.
(k)
(k)
(k)
(0)
(k)b
(k)
(k h
1)
g(a)
=
tanh(a)
exp(a
exp(a
) x)
probability
c )(0)
c=
+
W
x
(h
(x)
=
1
c
c
0.5
a(k)a(x) (x)
= b=
+
W
h
x
(h
(x)
=
x)
exp(a)+exp(
a)
p(y = c|x)
p(y = c|x)
exp(2a) 1
exp(2a)+1
p(y
=
c|x)
h
i
>
hidden units cannot
co-adapt
other
(k)
f (x)
h
i
(k)
(k)to (k)
exp(a
)
exp(a
)
>
P
P
(k)
(k)
(k)
(k
1)
(0)
h
(x)
=
g(a
(x))
h
(x)
=
g(a
(x))
o(a)
=
softmax(a)
=
.
.
.
exp(a
)
exp(a
)
(1)
(2)
(1)
(2)
(3)
(1)
(2)
(3)
a
(x)
=
b
+
W
h
x
(h
(x)
=
x)
exp(a
exp(a ) C
(x)
g(a)
= max(0,
a)
1 )
units
P
P
h
i
h
(x)
h
W
W
W
b
b
b
o(a)
=
softmax(a)
=
.
.
.
>
Math
for
my
slides
Feedforward
neural
network.
Math
for
my
slides
Feedforward
neural
n
(1)
(2)
(1)
(2) exp(a
(3)
(1)
(2)
(3)
exp(a
)
exp(a
)
)
exp(a
)
c
c
Cb
c Pb
c
(L+1)
h (x)
h =
(x)
W
b
(k)W
(k) P W 1
o(a)
softmax(a)
=
.
.
.
(L+1)
(L+1)
h (L+1)
(x)=f =
g(af (x)
(x)) exp(ac )
(x)
hmust
(x)
=
o(a
(x))
h be
(x)
=
o(a
(x))
=
f
(x)
c
c exp(ac )
hidden units
more
generally
P
P
(k) (k)
(k)
(k)= (k
1)
(0)
1
g(a)
reclin(a)
=
max(0,
a)
(k)
(k)
(k
1)
(0)
a
(x)
=
b
+
W
h
x
(h
(x)
=
x)
>
...
...
(L+1)
(1)o(a
(1)
(2) w(3)
(1)
(2)
(3)> x
(x)
a (x) = b f+
W (x)
xa(x)
(h
(x)
=+x)
h(L+1)
(x))
f (x)
useful
=
b
x
=
b
+
w
a(x)
=
b
+
w
x
=
b
+
w
x
h
h=
(x)
h(2)
(x) =
W
W
W
b
b
b
i
i
i
i
i
i
f (x)
(k) (2)
(k)
(k) (k 1)
(0) (3)
(k)
(k)
(k)
P b(2) b(3)P
(1)
(1)
(2)
(1)
a
(x)
=
b
+
W
h
x
(h
(x) = x)
h(k)
(x)
=
g(a
(x))
h (1)
(x)
= g(a
(x))
h
(x)
h
(x)
W
W
W
b
g()
b
(2)= g(a(x))
(3) h(x)
(1) = =
(2)g(a(x))
(3)
=
g(b
+
w
x
)
g(b
+
w
x
)
h (x) h(2) (x) (k)W(1)h(x)
W
W
b
b
b
i
i
i
i
i
i
(k)
1
Could
h
(x)
=
o(a
(k)
different(L+1)
dropout
a (x)(x))
=(L+1)
b=(k)f (x)
+ W(k) h(k
1)
(0)
use a
x
(h
(x)1 = x)
(1)
(1)
(k) (x))
(k 1)
(0)
h
(x)
o(a
=
f(h
(x)
...
...
(L+1)
(L+1)
a(k) (x)
==
b(k)
+
W
h
x
(x)
=
x)
hW (x)x=1 o(a
(x))
=
f (x)
x
x
x
b
x
h(x)
1
d
j
id
i,j
i
probability, but 0.5 usually (k)
(k)
(k) h
(k)= g(a
(x)
(x))
h
(x)
=
g(a
(x))
works well
w
h(x)
= g(a(x)) w
(L+1)
(L+1)
39
Hugo Larochelle
Hugo Laroc
p(y = c|x)
Departement
dinformatiq
Departement
dinf
p(y = c|x)
h{
i> Universit
e
de
Sherbrooke
h
i
Universit
e
de
Sh
exp(a
)
exp(a
)
>
P
P
exp(a )
exp(a
)
o(a)
=
softmax(a)
=
.
.
.
Topics: dropout
P
P
exp(a
)
exp(a )
o(a) = softmax(a) =
...
exp(a
)
exp(a
)
hugo.larochelle@usherbrook
hugo.larochelle@us
g(a) = a
DROPOUT
w
p(y
=
c|x)
Idea: cripple
neural
network
by
p(y = c|x)
f (x)f (x)
removing
...
SeptemberSeptember
6, 2012 6,
1
h hsigm(a)
i
g(a) =
=
i>
>
hidden
1+exp(
a)
(1)
p(y units
= c|x)stochastically
exp(aexp(a
)
1 ))
C(2)
(2)
(1)
(2) exp(a
(3)
(1)
(3)
)
Pexp(a
P
1
C
softmax(a)
o(a)
= softmax(a)
=P
. .. .(2)
(1)
(2)
(3)
(1)) .b
P
W
W
Wexp(a
bb(3)
b
exp(a
)
o(a)
= (x)
=
.
h(1)h(x)(x)
h(2)h(x)
W
W
W
b
b
exp(ac )
exp(ac )
h
i
c
c
> c
each hidden unit is set to 0 with
exp(a
)
exp(a
)
1
C
exp(a) exp( a)
P(k 1)
P
f
(x)
o(a)
=
softmax(a)
=
.
.
.
(k)
(k)
(k)
(0)
(k)b
(k)
(k h
1)
g(a)
=
tanh(a)
exp(a
exp(a
) x)
probability
c )(0)
c=
+
W
x
(h
(x)
=
1
c
c
0.5
a(k)a(x) (x)
= b=
+
W
h
x
(h
(x)
=
x)
exp(a)+exp(
a)
p(y = c|x)
p(y = c|x)
exp(2a) 1
exp(2a)+1
p(y
=
c|x)
h
i
>
hidden units cannot
co-adapt
other
(k)
f (x)
h
i
(k)
(k)to (k)
exp(a
)
exp(a
)
>
P
P
(k)
(k)
(k)
(k
1)
(0)
h
(x)
=
g(a
(x))
h
(x)
=
g(a
(x))
o(a)
=
softmax(a)
=
.
.
.
exp(a
)
exp(a
)
(1)
(2)
(1)
(2)
(3)
(1)
(2)
(3)
a
(x)
=
b
+
W
h
x
(h
(x)
=
x)
exp(a
exp(a ) C
(x)
g(a)
= max(0,
a)
1 )
units
P
P
h
i
h
(x)
h
W
W
W
b
b
b
o(a)
=
softmax(a)
=
.
.
.
>
Math
for
my
slides
Feedforward
neural
network.
Math
for
my
slides
Feedforward
neural
n
(1)
(2)
(1)
(2) exp(a
(3)
(1)
(2)
(3)
exp(a
)
exp(a
)
)
exp(a
)
c
c
Cb
c Pb
c
(L+1)
h (x)
h =
(x)
W
b
(k)W
(k) P W 1
o(a)
softmax(a)
=
.
.
.
(L+1)
(L+1)
h (L+1)
(x)=f =
g(af (x)
(x)) exp(ac )
(x)
hmust
(x)
=
o(a
(x))
h be
(x)
=
o(a
(x))
=
f
(x)
c
c exp(ac )
hidden units
more
generally
P
P
(k) (k)
(k)
(k)= (k
1)
(0)
1
g(a)
reclin(a)
=
max(0,
a)
(k)
(k)
(k
1)
(0)
a
(x)
=
b
+
W
h
x
(h
(x)
=
x)
>
...
...
(L+1)
(1)o(a
(1)
(2) w(3)
(1)
(2)
(3)> x
(x)
a (x) = b f+
W (x)
xa(x)
(h
(x)
=+x)
h(L+1)
(x))
f (x)
useful
=
b
x
=
b
+
w
a(x)
=
b
+
w
x
=
b
+
w
x
h
h=
(x)
h(2)
(x) =
W
W
W
b
b
b
i
i
i
i
i
i
f (x)
(k) (2)
(k)
(k) (k 1)
(0) (3)
(k)
(k)
(k)
P b(2) b(3)P
(1)
(1)
(2)
(1)
a
(x)
=
b
+
W
h
x
(h
(x) = x)
h(k)
(x)
=
g(a
(x))
h (1)
(x)
= g(a
(x))
h
(x)
h
(x)
W
W
W
b
g()
b
(2)= g(a(x))
(3) h(x)
(1) = =
(2)g(a(x))
(3)
=
g(b
+
w
x
)
g(b
+
w
x
)
h (x) h(2) (x) (k)W(1)h(x)
W
W
b
b
b
i
i
i
i
i
i
(k)
1
Could
h
(x)
=
o(a
(k)
different(L+1)
dropout
a (x)(x))
=(L+1)
b=(k)f (x)
+ W(k) h(k
1)
(0)
use a
x
(h
(x)1 = x)
(1)
(1)
(k) (x))
(k 1)
(0)
h
(x)
o(a
=
f(h
(x)
...
...
(L+1)
(L+1)
a(k) (x)
==
b(k)
+
W
h
x
(x)
=
x)
hW (x)x=1 o(a
(x))
=
f (x)
x
x
x
b
x
h(x)
1
d
j
id
i,j
i
probability, but 0.5 usually (k)
(k)
(k) h
(k)= g(a
(x)
(x))
h
(x)
=
g(a
(x))
works well
w
h(x)
= g(a(x)) w
(L+1)
(L+1)
39
Hugo Larochelle
Hugo Laroc
p(y = c|x)
Departement
dinformatiq
Departement
dinf
p(y = c|x)
h{
i> Universit
e
de
Sherbrooke
h
i
Universit
e
de
Sh
exp(a
)
exp(a
)
>
P
P
exp(a )
exp(a
)
o(a)
=
softmax(a)
=
.
.
.
Topics: dropout
P
P
exp(a
)
exp(a )
o(a) = softmax(a) =
...
exp(a
)
exp(a
)
hugo.larochelle@usherbrook
hugo.larochelle@us
g(a) = a
DROPOUT
w
p(y
=
c|x)
Idea: cripple
neural
network
by
p(y = c|x)
f (x)f (x)
removing
...
SeptemberSeptember
6, 2012 6,
1
h hsigm(a)
i
g(a) =
=
i>
>
hidden
1+exp(
a)
(1)
p(y units
= c|x)stochastically
exp(aexp(a
)
1 ))
C(2)
(2)
(1)
(2) exp(a
(3)
(1)
(3)
)
Pexp(a
P
1
C
softmax(a)
o(a)
= softmax(a)
=P
. .. .(2)
(1)
(2)
(3)
(1)) .b
P
W
W
Wexp(a
bb(3)
b
exp(a
)
o(a)
= (x)
=
.
h(1)h(x)(x)
h(2)h(x)
W
W
W
b
b
exp(ac )
exp(ac )
h
i
c
c
> c
each hidden unit is set to 0 with
exp(a
)
exp(a
)
1
C
exp(a) exp( a)
P(k 1)
P
f
(x)
o(a)
=
softmax(a)
=
.
.
.
(k)
(k)
(k)
(0)
(k)b
(k)
(k h
1)
g(a)
=
tanh(a)
exp(a
exp(a
) x)
probability
c )(0)
c=
+
W
x
(h
(x)
=
1
c
c
0.5
a(k)a(x) (x)
= b=
+
W
h
x
(h
(x)
=
x)
exp(a)+exp(
a)
p(y = c|x)
p(y = c|x)
exp(2a) 1
exp(2a)+1
p(y
=
c|x)
h
i
>
hidden units cannot
co-adapt
other
(k)
f (x)
h
i
(k)
(k)to (k)
exp(a
)
exp(a
)
>
P
P
(k)
(k)
(k)
(k
1)
(0)
h
(x)
=
g(a
(x))
h
(x)
=
g(a
(x))
o(a)
=
softmax(a)
=
.
.
.
exp(a
)
exp(a
)
(1)
(2)
(1)
(2)
(3)
(1)
(2)
(3)
a
(x)
=
b
+
W
h
x
(h
(x)
=
x)
exp(a
exp(a ) C
(x)
g(a)
= max(0,
a)
1 )
units
P
P
h
i
h
(x)
h
W
W
W
b
b
b
o(a)
=
softmax(a)
=
.
.
.
>
Math
for
my
slides
Feedforward
neural
network.
Math
for
my
slides
Feedforward
neural
n
(1)
(2)
(1)
(2) exp(a
(3)
(1)
(2)
(3)
exp(a
)
exp(a
)
)
exp(a
)
c
c
Cb
c Pb
c
(L+1)
h (x)
h =
(x)
W
b
(k)W
(k) P W 1
o(a)
softmax(a)
=
.
.
.
(L+1)
(L+1)
h (L+1)
(x)=f =
g(af (x)
(x)) exp(ac )
(x)
hmust
(x)
=
o(a
(x))
h be
(x)
=
o(a
(x))
=
f
(x)
c
c exp(ac )
hidden units
more
generally
P
P
(k) (k)
(k)
(k)= (k
1)
(0)
1
g(a)
reclin(a)
=
max(0,
a)
(k)
(k)
(k
1)
(0)
a
(x)
=
b
+
W
h
x
(h
(x)
=
x)
>
...
...
(L+1)
(1)o(a
(1)
(2) w(3)
(1)
(2)
(3)> x
(x)
a (x) = b f+
W (x)
xa(x)
(h
(x)
=+x)
h(L+1)
(x))
f (x)
useful
=
b
x
=
b
+
w
a(x)
=
b
+
w
x
=
b
+
w
x
h
h=
(x)
h(2)
(x) =
W
W
W
b
b
b
i
i
i
i
i
i
f (x)
(k) (2)
(k)
(k) (k 1)
(0) (3)
(k)
(k)
(k)
P b(2) b(3)P
(1)
(1)
(2)
(1)
a
(x)
=
b
+
W
h
x
(h
(x) = x)
h(k)
(x)
=
g(a
(x))
h (1)
(x)
= g(a
(x))
h
(x)
h
(x)
W
W
W
b
g()
b
(2)= g(a(x))
(3) h(x)
(1) = =
(2)g(a(x))
(3)
=
g(b
+
w
x
)
g(b
+
w
x
)
h (x) h(2) (x) (k)W(1)h(x)
W
W
b
b
b
i
i
i
i
i
i
(k)
1
Could
h
(x)
=
o(a
(k)
different(L+1)
dropout
a (x)(x))
=(L+1)
b=(k)f (x)
+ W(k) h(k
1)
(0)
use a
x
(h
(x)1 = x)
(1)
(1)
(k) (x))
(k 1)
(0)
h
(x)
o(a
=
f(h
(x)
...
...
(L+1)
(L+1)
a(k) (x)
==
b(k)
+
W
h
x
(x)
=
x)
hW (x)x=1 o(a
(x))
=
f (x)
x
x
x
b
x
h(x)
1
d
j
id
i,j
i
probability, but 0.5 usually (k)
(k)
(k) h
(k)= g(a
(x)
(x))
h
(x)
=
g(a
(x))
works well
w
h(x)
= g(a(x)) w
(L+1)
(L+1)
40
Hugo Larochelle
Hugo Laroc
p(y = c|x) h
p(y = c|x)
i>
D
e
partement
dinformatiq
D
e
partement
dinf
p(y = c|x)
exp(a
)
exp(a
)
1
C
h
i
>
P
P
{
i
o(a)
=
softmax(a)
exp(a=
)
exp(a ) h . . .
p(y
=
c|x)
>
P
P
exp(a
)
exp(a
)
Universit
e
de
Sherbrooke
h
i
o(a) = softmax(a) =
.
.
.
c
c
Universit
e
de
Sh
exp(a
exp(a
)
c exp(a )
c)
>
exp(a )
P
P
exp(a )
exp(a
)
o(a)
=
softmax(a)
=
.
.
.
Topics: dropout
P
P
exp(a
)
exp(a )
o(a) = softmax(a)
=
...
h
i
exp(a
)
exp(a
hugo.larochelle@usherbrook
>)
hugo.larochelle@us
g(a)
=
a
f (x)
exp(a
)
exp(a
)
1
C
f (x)
p(y = c|x)
DROPOUT
w
P (k)
P
o(a)
=
softmax(a)
=
.
.
.
p(y
=
c|x)
p(y
=
c|x)
Use
exp(a
exp(ac ) ...
f
(x)
binary
masks
m
c )(2)
(1) random
(2)
(1)
(2)
(3)
(1)
(3)
p(y
=
c|x)
c
c
f (x)W W W b b b
h (x) h (x)
1
(1)
(2)
(1)
(2)
(3)
(1)
(2)
(3)
h b
i
g(a)
=
=
September
6,
2012
hsigm(a)
i>
September
6,
>
h
(x)
h
(x)
W
W
W
b
b
h
i
1+exp(
a)
p(y
=
c|x)
exp(a
)
exp(a
)
>
(k)
(k)
(k)
(k
1)
(0)
(1)for
(2)
(1) (2) (2)
(3)
(1)
(2)
exp(a
) .b
exp(a
)(3)
P
P
1
C
o(a)
=
softmax(a)
=
.
.
a layer
(x)pre-activation
= b h
+
W
h
x
(h
(x)
=
x)
(1)
(2)
(1)
(3)
(1)
(2)
(3)
k>0
P
P
exp(a
)
exp(a
)
h(x)
(x)
h
(x)
W
W
W
b
b
f
(x)
exp(a
)
exp(a
)
1
Cb
o(a)
=P
softmax(a)
=
.
.
.
h =(x)
W
W
W
b
b
P
o(a) = softmax(a)
.
.
.
exp(a
)
exp(ac )
c
hc )
i
c
c
>
exp(a
exp(a
)
c
(k) (k)
(k) (k)
(k) (k
c 1)
exp(a(0)
)c
exp(a
)
exp(a) exp( a)
exp(2a) 1
h
(x)
=
g(a
(x))
P
P
f
(x)
a
(x)
=
b
+
W
h
(x)
(h
(x)
=
x)
o(a)
=
softmax(a)
=
.
.
.
(k)
(k)
(k)
(k
1)
(0)
(k)b(1) +(k)
(k h
1) exp(a
g(a)
=
tanh(a)
=
(1) a(k)(2)
(2)
(3)
(1) =
(2)==
)(0)
exp(a
) x)(3)
a
(x)
=
W
x
(h
(x)
1
(x)
=
b
+
W
h
x
(h
(x)
x)
exp(a)+exp(
a)
exp(2a)+1
= c|x)
(L+1)
h (x) h(L+1)(x)
W p(y
W
W
b ... (2)b (3) b
p(y
= c|x) (1)
...
f
(x)
(1)
(2)
(1)
(2)
(3)
h (x) h (x) W
W
W
b
b
b
h f (x)(x) = o(a
(x)) =p(y
f (x)= c|x)
Abstract
Abstract
h
i
(k)
(k)
>
(k)
f (x) (x))(k) (k)
h
i
(k)
h(k) (x)
=
g(a
exp(a
)
exp(a
)
>
P
P
(k)
(k)
(k)
(k
1)
(0)
h
(x)
=
g(a
(x))
h
=
g(a
(k) (x)(1)
(k)
(ka (x))
1)
(0)
o(a)
=
softmax(a)
=
.
.
.
exp(a
)
exp(a
)
(2)
(1)
(2)
(3)
(1)
(2)
(3)
(x)
=
b
+
W
h
x
(h
(x)
=
x)
exp(a
exp(a ) C
(x)
g(a)
=
max(0,
a)
1 )
hidden
a(1) (x)
=activation
b(2) (1)
+h(k
Wfrom
h
x
(h
=
x)
P
P
layer
1
to
L):
h
i
(x)
h
W
W
W
b
b
b
o(a)
=
softmax(a)
=
.
.
.
>
Math
for
my
slides
Feedforward
neural
network.
Math
for
my
slides
Feedforward
neural
n
(1)
(2)
(3)
(1)
(2)
(3)
(2)
(1)
(2)
(3)
(1)
(2)
(3)
exp(a
)
exp(a
)
exp(a
)
exp(a
)
c
c
h (L+1)
(x) h(L+1)
(x)
W
W
W
b
b
b
Cb
c Pb
c
(L+1)
h (x)
h =
(x)
W
b
(k)W
(k) P W 1
(L+1)
o(a)
softmax(a)
=
.
.
.
(L+1)
h (L+1)
(x)
=f =
g(af (x)
(x)) exp(ac )
(x)
(x)
h(k) (x)
(x))
=
f
h =
(x)
=
o(a
(x))
h o(a
(x)
=
o(a
(x))
=
f
(x)
c
c exp(ac )
(k) (k)
P
P
(k)
(k)= (k
1)
(0)
1
g(a)
reclin(a)
=
max(0,
a)
(k)
(k)
(k)
(k
1)
(0)
h(k) (x) = g(a
(x))
a
(x)
=
b
+
W
h
x
(h
(x)
=
x)
>
...
...
(L+1)
(L+1)
(1)o(a
(2)
(1)
(2) w(3)
(1)
(2)
(3)> x
(k)
(k)
(k
1) (x)
(0)
f
(x)
a
(x)
=
b
+
W
h
x
(h
(x)
=
x)
h
=
(x))
=
f
(x)
a(x)
=
b
+
x
=
b
+
w
a(x)
=
b
+
w
x
=
b
+
w
x
h
(x)
h
(x)
W
W
W
b
b
b
i
i
i
i
a (x) = b + W
h
x
(h
=
x)
i
i
f (x)
(k) (2)
(k)
(k) (k 1)
(0) (3)
(k) (L+1)
(k)
(k)
(k)
(L+1)
P b(2) b(3)P
(1)
(1)
(2)
(1)
a
(x)
=
b
+
W
h
x
(h
(x) = x)
h
(x)
=
g(a
(x))
h
(x)
=
g(a
(x))
h
(x)
h
(x)
W
W
W
b
h(k) (x) = o(a
(x))
=
f
(x)
g()
b
(k) h(1) (x) h(2) (x) W(1)h(x)
(2)= g(a(x))
(3) h(x)
(1) = =
(2)g(a(x))
(3)
=
g(b
+
w
x
)
g(b
+
w
x
)
W
W
b
b
b
i
i
i
i
output
h (x)
=
g(a
(x))
layer activation (k=L+1):
i
i
h(k) (x) = g(a(k) (x))
(L+1)
(L+1)
h
(x) = o(a
(x))
=
f (x)
(k)
(k)
(k) (k 1)
(0)
(L+1)
(L+1)
a
(x)
=
b
+
W
h
x
(h
(x)1 = x)
(1)
(1)
(k) (x) =(k)
(k) (x))
(k 1)
(0)
(L+1)
h
o(a
=
f(h
(x)
...
...
(L+1)
(L+1)
a
(x)
=
b
+
W
h
x
(x)
=
x)
(L+1)
hW (x)x=1 o(a
f (x)
x1 xi d
bixd (x))
xj =h(x)
h
(x) = o(a
(x)) = f(x)
i,j
1
(k)
(k)
(k)= g(a
h
(x)
(x))
(x) = g(a (x))
(k)
w
w
h(x)
=
g(a(x))
(L+1)
(L+1)
40
Hugo Larochelle
p(y = c|x)
Hugo
Larochelle
Hugo Laroc
D
w
epartement dinformatique
p(y = c|x) h
p(y = c|x)
i
D
e
partement
dinformatiq
D
e
partement
dinf
>
p(y = c|x)
Universit
e
de
Sherbrooke
exp(a
)
exp(a
)
1
C
h
i
>
P
P
{
i
o(a)
=
softmax(a)
exp(a=
)
exp(a ) h . . .
p(y
=
c|x)
>
P
P
exp(a
)
exp(a
)
Universit
e
de
Sherbrooke
h
i
o(a) = softmax(a) =
.
.
.
c
c
Universit
e
de
Sh
exp(a
)
c exp(a
hugo.larochelle@usherbrooke.ca
>
C
exp(a )
) Pexp(a1c)
P
exp(a1 )
exp(a
o(a)
=
softmax(a)
=
.
.
.
C)
Topics: dropout
P
P
exp(a
)
o(a) = softmax(a)
=
.
.
c.
c)
c
c exp(a
h
i
exp(a
)
exp(a
)
hugo.larochelle@usherbrook
>
c = a c
c
hugo.larochelle@us
c g(a)
f (x)
exp(a
)
exp(a
)
C
f (x) = softmax(a)
P (k)1
P
September
28,
2013
o(a)
=
.
.
.
p(y
=
c|x)
...
p(y
=
c|x)
Use
exp(a
)
exp(a
)
fW
(x)
random
c (2)
c
(1)
(2)
(1)c|x)
p(yb=
c
f (x)binary
h(1) (x)
h(2) (x)
Wmasks
W(3)m
b
b(3) c
1
(1)
(2)
(1)
(2)
(3)
(1)
(2)
(3)
h b
i
g(a)
=
=
September
6,
2012
hsigm(a)
i>
September
6,
>
h
(x)
h
(x)
W
W
W
b
b
h
i
1+exp(
a)
p(y
=
c|x)
exp(a
)
exp(a
)
>
(k)
(k)
(k)
(k
1)
(0)
(1) (2) (2)
(3)
(2)
exp(a
) . . .(1)
P
P
1
o(a)
softmax(a)
= (3)
a (x) = b +
W(1)h (2)x (2)
(h (x)
==x)
(1)
(1)
(1)
(2) exp(a
(3) C )(3)
DROPOUT
layer
pre-activation
k>0h
PW
P
exp(a
exp(a
) ) b.b. . exp(a
h(x)for
(x)
(x)
W
bb) b
f
(x)
exp(a
1) W W
C
o(a)
=
softmax(a)
=
h
h
(x)
W
W
b
P
P
o(a) = softmax(a) =
exp(a
exp(ac )
c)
hc ) . . .
i
c Abstract
c
>
exp(a
exp(a
)
c
(k) (k)
(k) (k)
(k) (k
c 1)
exp(a(0)
exp(a
1) c
C)
exp(a) exp( a)
exp(2a) 1
h
(x)
=
g(a
(x))
P
P
f
(x)
a
(x)
=
b
+
W
h
(x)
(h
(x)
=
x)
o(a)
=
softmax(a)
=
.
.
.
(k)
(k)
(k)
(k
1)
(0)
(k)b(1) +(k)
(k h
1)
g(a)
=
tanh(a)
=
(1) a(k)(2)
(2)
(3)
(1) c=
(2)=
exp(a
exp(a
) x)(3)
c )(0)
c=
a
(x)
=
W
x
(h
(x)
1
c
(x)
=
b
+
W
h
x
(h
(x)
x)
exp(a)+exp(
a)
exp(2a)+1
p(y
= c|x)
(L+1)
hMath(x)
h(L+1)
(x)
Wlearning.
W
W
b ... (2)b (3) b
p(y
= c|x) (1)
...
for
my
slides
Deep
f
(x)
(1)
(2)
(1)
(2)
(3)
h (x) h (x) W
W
W
b
b
b
h f (x)(x) = o(a
(x)) =p(y
f (x)= c|x)
Abstract
Abstract
h
i
(k)
(k)
>
(k)
f (x) (x))(k) (k)
h
i
(k)
h(k) (x)
=
g(a
exp(a
)
exp(a
)
1
C
>
P
P
(k)
(k)
(k)
(k
1)
(0)
h
(x)
=
g(a
(x))
h
(x)
=
g(a
(x))
(k)
(k)
(k
1)
(0)
o(a)
=
softmax(a)
=
.
.
.
exp(a
)
exp(a
)
(1)
(2)
(1)
(2)
(3)
(1)
(2)
(3)
a
(x)
=
b
+
W
h
x
(h
(x)
=
x)
exp(a
exp(ac ) C
(k) +h
(k)h
(x)
g(a)
=
max(0,
1 c)
a(1)
(x)
=activation
b(2)
W
x
(h
=
x)
ca)
cP
P
hidden
layer
(k
from
1
to
L):
h
i
(x)
h
W
W
W
b
b
b
h(k)
(x)
=
g(a
(x)
m
)
o(a)
=
softmax(a)
=
.
.
.
>
Math
for
my
slides
Feedforward
neural
network.
Math
for
my
slides
Feedforward
neural
n
(1)
(2)
(3)
(1)
(2)
(3)
(1)
(2)
(1)
(2)
(3)
(1)
(2)
(3)
exp(a
)
exp(a
)
exp(a
)
exp(a
)
c
c
h (L+1)
(x) h(L+1)
(x)
W
W
W
b
b
b
Cb
c Pb
c
(L+1)
h (x)
h =
(x)
W
b
(k)W
(k) P W 1
(L+1)
o(a)
softmax(a)
=
.
.
.
(L+1)
h (L+1)
(x)
=f =
g(af (x)
(x)) exp(ac )
(x)
exp(ac )
h
(x))
=(x))
f(x)
(x)
h = o(a
(x) = o(a
h (L+1)
(x)
= o(a
(x)) =
f (x)
c
c
(L+1)
(k)
P
P
(k)
(k)
(k)
(k
1)
(0)
(k)
1
h h(k) (x) =
(x)
m
)
=
f
(x)
g(a)
=
reclin(a)
=
max(0,
a)
(k)
(k)
(k)
(k
1)
(0)
= o(a
g(a
(x))
m
a
(x)
=
b
+
W
h
x
(h
(x)
=
x)
...
...
(1)o(a
(3)> x
1) (x)
(0)
f+
(x)
(k)
a (x) =(k)
b (k
W
xa(x)
(h
(x)
h(L+1)
(x))
f (x)
(L+1)
=
b(1)=+x)
w(3)x=bb(1)
=+bb(2)
+w
(2)a(x)
x = b + w> x
h
h=
(x)
h(2)
(x) =
W
W
W
bw
(L+1)
(k)
a (x) = b + W
x (h = x)
i i i
i i i
f (x) h
(1)
(3) (k) (L+1)
(k) (2)
(k)
(k) (k 1)
(0) (3)
(k)
(k)
(k)
P b(2) b(3)P
(1)
(1)
(2)
(1)
x hh(L+1)
h(2)
h
a
(x)
=
b
+
W
h
x
(h
(x) = x)
h
(x)
=
g(a
(x))
h
(x)
=
g(a
(x))
h
(x)
h
(x)
W
W
W
b
(x)
=
o(a
(x))
=
f
(x)
g()
b
(k) h(1) (x) h(2) (x) W(1)h(x)
(2)= g(a(x))
(3) h(x)
(1) = =
(2)g(a(x))
(3)
=
g(b
+
w
x
)
g(b
+
w
x
)
W
W
b
b
b
i
i
i
i
output
h(k) (x)
=
g(a
(x))
layer activation (k=L+1):
i
i
h(k) (x) = g(a(k) (x))
(L+1)
(L+1)
(2)
(3)
(x) = o(a
(x))
=
f (x)
(k)
(k)
(k) (k 1)
(0)
p(h , h ) h h(L+1)
(L+1)
a
(x)
=
b
+
W
h
x
(h
(x)1 = x)
(1)
(1)
(k) (x) =(k)
(k) (x))
(k 1)
(0)
o(a
=
f(h
(x)
...
...
(L+1)
(L+1)
a
(x)
=
b
+
W
h
x
(x)
=
x)
(L+1)
(L+1)
(x)
hW (x)x=1 o(a
f (x)
x1 xi d
bixd (x))
xj =h(x)
h
(x) = o(a
(x)) = f
i,j
>
(0)
(1)
(k)(1) (k)
(k)
p(xi = 1|h(1) ) = sigm(b
W
h
)
(k) +h
(x)
=
g(a
(x))
h (x) = g(a (x))
(1)
w
w
h(x)
=
g(a(x))
(L+1) >
(L+1)
DROPOUT
This
ii
iy
rb(k)
a(k) (x)
(k)
@a(k) (x)i bb(k)
log
yf (x)y
r
log
(x)yy
r
log
ff(x)
=
log
fff(x)
y
(x)
raaa(k)
log
(x)
== r
r
log
(x)
(k)
(k)
yy
(x)
(x)
rb(k)
log
f (x)y
for
k
from
L+1
to
1
(k)
>
r(k)
log
f
(x)
(=
(e(y)
f
(x))
(L+1)
rr
log
(=
r
log
f
(x)
h (x)
(k) (x)
y
y
y
a(L+1)
(x)f (x)
a
W
(=
(e(y)
f
(x))
log
f
(x)
(=
(e(y)
f
(x))
yy
a
(x)
= ra(k) (x) log f (x)y
gradients of hidden layer parameter
(k)
>
rcompute
log
(x)fy (x)
(= (=
ra(k) (x)r (k)
log f (x)ylog f (x)
b
r(k)W(k) flog
h
(x)
(k 1) > >
y
y
(k)
a (k)(x)
r
(=
r
log
f
(x)
h
(k)
log
f
(x)
(=
r
log
f
(x)
h
(x)(x)
(k)
yy
(x)f (x))
rW
log f (x)yyy (=
(e(y)
aa (x)
a(L+1) (x)
(k) >
rhr
logf f(x)
(x)yy (=
(= r
Wa(k) (x) ra(k)
(k (k)
1) (x) log
log(x)f (x)log
y
y(kf (x)
b
1)
r
(= rr
r
(x)hyy (x)>
(k) (x)log log
log
(x)y y (=
(=
rbW(k)
log ff(x)
flog
(x)ffy(x)
(k)
aa(k)
a(k)
(x)(x)
(k)
>
gradient
of
hidden
layer
below
- rcompute
log
f
(x)
(=
r
log
f
(x)
r
h
(x)
(k
1)
(k
1)
(k
1)
y
y
a
(x)
h
(x)
a
(x)
(k)
> yra(k) (x)
rbh(k)
log
(x)yra(=
log f (x)y
(k 1)log
(=
log
f>
(x)
(k) (x) W
y f
(x) f (x)
(k)
(k)
log
f
(x)
(=
W
r
log
f
(x)
r
log
f
(x)
(=
r
log
f
(x)
r
h
(x)
(k
1)
(k)
(k
1)
(k
1)
(k
1)
y
y
y
y
h
(x)
(x)
a
(x)
h aactivation)
(x)
a
- compute
gradient
of hidden layer below (before
0(x) (k(k)
1)
ra(k 1) (x) log f (x)yy (=
(= r
rhh(k(k 1)1)(x)
log
f (x)
(a h (x)
(x)j ), . . . ]
r[.a.(k. , g1) (x)
(x) log
3 f (x)yy
ra(k
1) (x)
log f (x)y (=
rh(k
1) (x)
log f (x)y
[. . . , g 0 (a(k
1)
(x)j ), . . . ]
41
DROPOUT
ii
iy
rb(k)
a(k) (x)
(k)
@a(k) (x)i bb(k)
log
yf (x)y
r
log
(x)yy
r
log
ff(x)
=
log
fff(x)
y
(x)
raaa(k)
log
(x)
== r
r
log
(x)
(k)
(k)
(x)
(x)
includes
the y y
(k1)
r
log
f
(x)
(k)
mask
m
y
b
for
krfrom
L+1
to
1 f(=
(k)
>
log
(x)
(=
(e(y)
f
(x))
(L+1)
rr
log
f
(x)
r
log
f
(x)
h
(x)
(k)
(k)
y
y f (x)
a (x) (e(y)
W a(L+1) (x)
(=
(e(y) = yff(x))
(x))
log
yy (=
a
(x)
ra(k) (x) log f (x)y
gradients of hidden layer parameter
(k)
>
rcompute
log
(x)fy (x)
(= (=
ra(k) (x)r (k)
log f (x)ylog f (x)
b
r(k)W(k) flog
h
(x)
(k 1) > >
y
y
(k)
a (k)(x)
r
(=
r
log
f
(x)
h
(k)
log
f
(x)
(=
r
log
f
(x)
h
(x)(x)
(k)
yy
(x)f (x))
rW
log f (x)yyy (=
(e(y)
aa (x)
a(L+1) (x)
(k) >
rhr
logf f(x)
(x)yy (=
(= r
Wa(k) (x) ra(k)
(k (k)
1) (x) log
log(x)f (x)log
y
y(kf (x)
b
1)
r
(= rr
r
(x)hyy (x)>
(k) (x)log log
log
(x)y y (=
(=
rbW(k)
log ff(x)
flog
(x)ffy(x)
(k)
aa(k)
a(k)
(x)(x)
(k)
>
gradient
of
hidden
layer
below
- rcompute
log
f
(x)
(=
r
log
f
(x)
r
h
(x)
(k
1)
(k
1)
(k
1)
y
y Math for
a
(x)
h
(x)
a my
(x)
(k)
slides
Deep learning.
>
rbh(k)
log
f
(x)
(=
W
r
log
f
(x)
(k 1)log
(k)
f
(x)
(=
r
log
f
(x)
>
(k)
y
y
y
y
(x)
a (x)
a (x)
(k)
(k)
41
Deep l
Hugo L
Departement
Universite d
hugo.larochelle
Septembe
h
(x)
=
g(a
(x)
m
)
compute
f
(x)
(=
W
r
log
f
(x)
rha(k(k 1)1)gradient
log
f
(x)
(=
r
log
f
(x)
r
(x)
(k)
(k
1)
(k 1) (x) h
y
y
y
y
(x)
a
(x)
(x) log
h
(x)
a
-
of hidden layer below (before activation)
0 (k(k)
1)
ra(k 1) (x) log
(= r
rhh(k(k 1)1)(x)
log
f
(x)
[.
.
.
,
g
(a
(x)j ), . . . ]
f (x)yy (=
log
f
(x)
r
h
(x)
(k
1)
(x)
yy
a
(x)
3
(L+1)
(k1)
0 (k
1)(L+1) (x)
m
h(L+1)
(x)
=
o(a
m
) = f (x)
ra(k 1) (x) log f (x)y (= rh(k 1) (x) log
f (x)
[.
.
.
,
g
(a
(x)
),
.
.
.
]
y
j
(1)
(2)
(3)
Abs
DROPOUT
Topics: test time classification
At
for single hidden layer, can show this is equivalent to taking the geometric average of all
neural networks, with all possible binary masks
Beats
42
DEEP LEARNING
Topics: why training is hard
Depending
If
use GPUs
If
unsupervised pre-training
43
DEEP LEARNING
43
If
use GPUs
If
unsupervised pre-training
Batch normalization
BATCH NORMALIZATION
Topics: batch normalization
Normalizing the
(Lecun et al. 1998)
Batch normalization
(Ioffe and Szegedy, 2014)
is an attempt to do that
44
ni-Batch
BATCH NORMALIZATION
BN, : x1...m y1...m
as the Batch
Normalizing Transform.
Topics: batch
normalization
We present the BN
uts is costly Transform in Algorithm 1. In the algorithm, is a constant
added to the mini-batch variance for numerical stability.
Batch
e two
neces- normalization
of whitening
ntly, we will Input: Values of x over a mini-batch: B = {x1...m };
Parameters to be learned: ,
by making it
For a layer Output: {yi = BN, (xi )}
we will norm
#
1
xi
// mini-batch mean
B
m i=1
m
#
1
2
B
(xi B )2
m i=1
xi B
x
!i " 2
B +
yi !
xi + BN, (xi )
// mini-batch variance
// normalize
// scale and shift
45
ni-Batch
BATCH NORMALIZATION
BN, : x1...m y1...m
as the Batch
Normalizing Transform.
Topics: batch
normalization
We present the BN
uts is costly Transform in Algorithm 1. In the algorithm, is a constant
added to the mini-batch variance for numerical stability.
Batch
e two
neces- normalization
of whitening
ntly, we will Input: Values of x over a mini-batch: B = {x1...m };
Parameters to be learned: ,
by making it
For a layer Output: {yi = BN, (xi )}
we will norm
#
1
xi
// mini-batch mean
B
m i=1
m
#
1
2
B
(xi B )2
m i=1
xi B
x
!i " 2
B +
yi !
xi + BN, (xi )
// mini-batch variance
// normalize
// scale and shift
45
DEEP LEARNING
Topics: why training is hard
Depending
If
use GPUs
If
unsupervised pre-training
46
UNSUPERVISED PRE-TRAINING
Topics: unsupervised pre-training
Solution: initialize
character image
random image
47
UNSUPERVISED PRE-TRAINING
Topics: unsupervised pre-training
Solution: initialize
Why is one
a character
and the other
is not ?
character image
random image
47
UNSUPERVISED PRE-TRAINING
Topics: unsupervised pre-training
Solution: initialize
Why is one
a character
and the other
is not ?
character image
random image
48
th for
my slides
Autoencoders.
Hugo
Larochelle
Hugo
Larochelle
e partement dinformatique
AUTOENCODER
D
epartement dinformatique
Universite de Sherbrooke
e
de
Sherbrooke
o.larochelle@usherbrooke.ca Universit
h(x)
g(a(x))
Topics: autoencoder, encoder, decoder,
tied =
weights
hugo.larochelle@usherbrooke.ca
17, 2012
October
Feed-forward
neural
the output layer
= sigm(b
+ Wx)
network trained to reproduce
its input
at
Abstract
ck
b = o(a
b(x))
x
Abstract
W =W
(tied weights)
my slides Autoencoders.
h(x) = bg(a(x))
j P
b l(f (x)) =
x
=
(b
xk xk )
k
sigm(b + Wx)
b = o(b
x
a(x))
49
= sigm(c + W h(x))
for binary inputs
l(f (x)) =
(xk log(b
xk ) + (1
Encoder
h(x) = g(a(x))
= sigm(b + Wx)
xk ) log(1
x
bk ))
UNSUPERVISED PRE-TRAINING
>
w
x
=
b
+
w
a(x)
x
=
b
+
w
x
=
b
+
w
x
a(x)
=
b
+
w
x
=
b
+
w
x
i
i
i
i
i
i
i
i
i
P
P
P
(x) = g(a(x)) = g(b + iwh(x)
ix
h(x)
i xi ) = g(a(x)) = g(b +
i ) = g(a(x)) = g(b +
iw
i w i xi )
(x) = b +
>
>
50
Feedforward
neural
network
Feedforward
neural
network
Feedforward
neural
network
Feedforward
neural
ne
xTopics:
b
w
w
x
x
b
w
w
x
x
b
w
w
Feedforward
neural
Feedforward
network
neural
netwo
d
1 unsupervised
d
1 pre-training
d
1
d
1
d
1
d
We
Hugo
w layer-wise
w
Larochelle
Hugo Larochelle
Hugo Larochelle
will use a greedy,
procedure
Hugo Larochelle
Hugo Larochelle
Hugo Larochelle
Departement
dinformatique
D
e
partement
dinformatique
Departement
dinformatique
Departement
dinformatiq
D
e
partement
dinformatique
D
e
partement
dinformatique
{
train one layer atUniversit
a time, from
toelast,
with unsupervisedUniversit
criterione de Sherbrooke
eUniversit
defirst
Sherbrooke
de Sherbrooke
e de Sherbrook
Universite de Sherbrooke UniversiteUniversit
de Sherbrooke
hugo.larochelle@usherbrooke.ca
(a) = a fix the parameters
previous
g(a)
= ahidden layershugo.larochelle@usherbrooke.ca
g(a)
=a
hugo.larochelle@usherbrooke.ca
hugo.larochelle@usherbrooke.ca
hugo.larochelle@usherbroo
of
hugo.larochelle@usherbrooke.ca
1
1
1
...
...
(a) = sigm(a)
g(a)
=
sigm(a)
=
g(a)
=
sigm(a)
=
previous=layers
viewed
as
feature
extraction
6, 2012 1+exp(
September
6, 2012a) September
September
2012 a) September
September
1+exp( a)September
1+exp(
6, 2012 6,
6, 2012 6, 2012
(a) = tanh(a) =
exp(a) exp( a)
exp(2a) 1
g(a)
=
=
tanh(a)
exp(a)+exp( a)
exp(2a)+1
...
Abstract
exp(a) exp( a)
exp(2a) 1
g(a)
=
=
tanh(a)
exp(a)+exp( a)
exp(2a)+1
...
Abstract
...
Abstract
exp(a) exp( a)
exp(a)+exp(
a)
1
...
Abstract
exp(2a) 1
exp(2a)+1
Abstract Abstract
(a) = max(0,
a) MathFeedforward
g(a) =
max(0, a)
g(a) = max(0, a) Math for my slides Feedforward neural network.
Math for my slides
neural network.
for my slides Feedforward
neural network.
Math for my slides Feedforward neural network.
Math for my slides Feedforward neural
Math
network.
for my slides Feedforward neural network.
P
P g(a)
P
P
P
P
(a)
=
reclin(a)
=
max(0,
a)
=
reclin(a)
=
max(0,
a)
g(a)
=
reclin(a)
=
max(0,
a)
1
1
>
>
>
> = b +...
> = b + w> x
...
...
...
a(x) ...= b + ...a(x)
w
x
=
b
+
w
x
=
b
+
w
x
=
b
+
w
x
a(x)
=
b
+
w
x
w
x
a(x)
=
b
+
w
x
i
i
a(x)
=
b
+
w
x
=
b
+
w
x
a(x)
=
b
+
w
x
=
b
+
w
i
i
i
i
i
ix
i
i
i
i
i i i
i i i
Pg() b P
P
Pg() b P
P
() h(x)
b = g(a(x))
==
g(bg(a(x))
+ i=
wig(b
xh(x)
h(x)
g(a(x))
wig(a(x))
xi ) h(x)
g(a(x))
g(a(x))
==
g(b
+ i=wg(b
)
==
g(b
+ i=wg(b
)
i )+ =
i xi ) h(x)
i xh(x)
i+
i xi +
iw
i=
i w i xi )
...x1 xi d
x1 bix...d xj h(x)
Wi,j
(1)
(1)
w
(x)
= g(a(x)) w
...x1 xi d
Wi,j
x1 bix...d xj h(x)
(1)
(1)
w
h(x)
= g(a(x)) w
...x1 xi d
Wi,j
x1 bix...d xj h(x)
(1)
(1)
w
h(x)
= g(a(x)) w
51
FINE-TUNING
Topics: fine-tuning
>
w
x
=
b
+
w
x
i
i
i
P
h(x) = g(a(x)) = g(b + i wi xi )
Once
a(x) = b +
Feedforward
neural
net
x1 xd b w 1 w
Feedforward
neural
network
d
Hugo Larochelle
Hugo Larochelle
Departement
dinformatiqu
Departement
dinformatique
e de Sherbrooke
UniversiteUniversit
de Sherbrooke
hugo.larochelle@usherbrooke
hugo.larochelle@usherbrooke.ca
...
1
...
...
g(a)
=
sigm(a)
=
Supervised learning is performed as in
1+exp(
We
g(a) = tanh(a) =
...
September
a) September
6, 2012 6,
exp(a) exp( a)
exp(a)+exp(
a)
1
...
g(a) = max(0, a)
2012
exp(2a) 1
exp(2a)+1
Abstract Abstract
MathFeedforward
for my slides Feedforward
neural network.
Math for my slides
neural network.
P
P
g(a) = reclin(a)
=
max(0,
a)
1
> = b + w> x
=
b
+
w
x
a(x)...= b + ...a(x)
w
x
=
b
+
w
i
ix
i
i i i
all parameters are tuned for the supervised task
P
P
g()
b
h(x)
g(a(x))
=wg(b
w i xi )
h(x) = g(a(x))
==
g(b
+
x+
)
at hand
i
...x1 xi d
Wi,j
x1 bix...d xj h(x)
representation is adjusted to be more discriminative
(1)
(1)
w
h(x)
= g(a(x)) w
i i
http://info.usherbrooke.ca/hlarochelle/neural_networks
52
http://info.usherbrooke.ca/hlarochelle/neural_networks
52
53
MERCI!