Академический Документы
Профессиональный Документы
Культура Документы
IE 643
Lecture 6
September 1, 2020.
Perceptron - Caveat
Perceptron - Caveat
Perceptron - Caveat
Not suitable when linear separability assumption fails
Example: Classical XOR problem
x1 x2 y = x1 ⊕ x2
0 0 -1
0 1 1
1 0 1
1 1 -1
Perceptron - Caveat
x1 x2 y = x1 ⊕ x2 ŷ = sign(w1 x1 + w2 x2 − θ)
0 0 -1 sign(−θ)
0 1 1 sign(w2 − θ)
1 0 1 sign(w1 − θ)
1 1 -1 sign(w1 + w2 − θ)
Perceptron - Caveat
Not suitable when linear separability assumption fails
Example: Classical XOR problem
sign(−θ) = −1 =⇒ θ > 0
sign(w2 − θ) = 1 =⇒ w2 − θ ≥ 0
sign(w1 − θ) = 1 =⇒ w1 − θ ≥ 0
sign(w1 + w2 − θ) = −1 =⇒ −w1 − w2 + θ > 0
Perceptron - Caveat
Not suitable when linear separability assumption fails
Example: Classical XOR problem
sign(−θ) = −1 =⇒ θ > 0
sign(w2 − θ) = 1 =⇒ w2 − θ ≥ 0
sign(w1 − θ) = 1 =⇒ w1 − θ ≥ 0
sign(w1 + w2 − θ) = −1 =⇒ −w1 − w2 + θ > 0
Forms the core idea behind kernel methods. (Will not be pursued in
this course!)
Idea: The separating surface need not be linear and can be assumed
to take some non-linear form.
Idea: The separating surface need not be linear and can be assumed
to take some non-linear form.
Hence for an input space X and output space Y, the learned map
h : X → Y can take some non-linear form.
Idea: The separating surface need not be linear and can be assumed
to take some non-linear form.
Hence for an input space X and output space Y, the learned map
h : X → Y can take some non-linear form.
Some notations
nk` denotes k-th neuron at layer `.
ak` denotes the activation of the neuron nk` .
P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 19 / 76
Moving on from Perceptron
x1 x2 a11 a21 ŷ y
0 0 max{b1 , 0} max{b2 , 0} sign(ta11 + ua21 + b3 ) -1
0 1 max{q + b1 , 0} max{s + b2 , 0} sign(ta11 + ua21 + b3 ) +1
1 0 max{p + b1 , 0} max{r + b2 , 0} sign(ta11 + ua21 + b3 ) +1
1 1 max{p + q + b1 , 0} max{r + s + b2 , 0} sign(ta11 + ua21 + b3 ) -1
Notable features:
Notable features:
Multiple layers stacked together.
Notable features:
Multiple layers stacked together.
Zero-th layer usually called input layer.
Notable features:
Multiple layers stacked together.
Zero-th layer usually called input layer.
Final layer usually called output layer.
Notable features:
Multiple layers stacked together.
Zero-th layer usually called input layer.
Final layer usually called output layer.
Intermediate layers are called hidden layers.
Notable features:
Each neuron in the hidden and output layer is like a perceptron.
Notable features:
Each neuron in the hidden and output layer is like a perceptron.
However, unlike perceptron, different activation functions are used.
Notable features:
Each neuron in the hidden and output layer is like a perceptron.
However, unlike perceptron, different activation functions are used.
max{x, 0} has a special name called ReLU (Rectified Linear Unit).
Recall:
nk` denotes k-th neuron at `-th layer.
ak` denotes activation of neuron nk` .
At layer L1 :
I At neuron n11 :
F a11 = φ(w11
1 1
x1 + w12 x2 ) .
At layer L1 :
I At neuron n11 :
F a11 = φ(w11
1 1
x1 + w12 x2 ) =: φ(z11 ) .
At layer L1 :
I At neuron n21 :
F a21 = φ(w21
1 1
x1 + w22 x2 ) .
At layer L1 :
I At neuron n21 :
F a21 = φ(w21
1 1
x1 + w22 x2 ) =: φ(z21 ) .
At layer L1 :
1
φ(z11 ) 1 x + w1 x )
a1 φ(w11 1 12 2
= =
a21 φ(z21 ) 1 x + w1 x )
φ(w21 1 22 2
1 1
w11 w12 x
Letting W1 = 1 1 and x = 1 , we have at layer L1 :
w21 w22 x2
1 1 1 1 x
a1 z1 w11 x1 + w12 2
=φ =φ = φ(W 1 x)
a21 z21 1 x + w1 x
w21 1 22 2
1
a
Letting a1 = 11 , we have at layer L1 :
a2
1
a
a = 11 = φ(W 1 x)
1
a2
At layer L2 :
I At neuron n12 :
F a12 = φ(w11
2 1 2 1
a1 + w12 a2 ) .
At layer L2 :
I At neuron n12 :
F a12 = φ(w11
2 1 2 1
a1 + w12 a2 ) =: φ(z12 ) .
At layer L2 :
I At neuron n22 :
F a22 = φ(w21
2 1 2 1
a1 + w22 a2 ) .
At layer L2 :
I At neuron n22 :
F a22 = φ(w21
2 1 2 1
a1 + w22 a2 ) =: φ(z22 ).
At layer L2 :
2
φ(z12 ) 2 a1 + w 2 a1 )
2 a1 φ(w11 1 12 2
a = 2 = =
a2 φ(z22 ) 2 a1 + w 2 a1 )
φ(w21 1 22 2
2 2
w11 w12
Letting W2 = 2 2 , we have at layer L2 :
w21 w22
2 2 2 1 2 a1
1
2 a1 z1 w11 a1 + w12 2 2 a1
a = 2 =φ =φ =φ W
a2 z22 2 a1 + w 2 a1
w21 1 22 2 a21
We have at layer L2 :
2 2 1
2 a1 z1 2 a1
a = 2 =φ =φ W = φ(W 2 a1 )
a2 z22 a21
At layer L3 :
I At neuron n13 :
F a13 = φ(w11
3 2 3 2
a1 + w12 a2 ) .
At layer L3 :
I At neuron n13 :
F a13 = φ(w11
3 2 3 2
a1 + w12 a2 ) =: φ(z13 ) .
At layer L3 :
3 3 , we have at layer L :
Letting W 3 = w11
w12 3
2
3 3 a1
3 3 3 2 3 2
a = a1 = φ z1 = φ w11 a1 + w12 a2 = φ W
a22
3 3 , we have at layer L :
Letting W 3 = w11
w12 3
2
3 3 a1
= φ(W 3 a2 )
3 3
a = a1 = φ z1 = φ W
a22
a3 = φ(W 3 a2 )
Optimization perspective
Given training data D = {(x s , y s )}Ss=1 ,
S
X
min es
s=1
Optimization perspective
Given training data D = {(x s , y s )}Ss=1 ,
S
X S
X
min es = E (y s , ŷ s )
s=1 s=1
Optimization perspective
Given training data D = {(x s , y s )}Ss=1 ,
S
X S
X S
X
min es = E (y s , ŷ s ) = E (y s , MLP(x s ))
s=1 s=1 s=1
Optimization perspective
Given training data D = {(x s , y s )}Ss=1 ,
S
X S
X S
X
min es = E (y s , ŷ s ) = E (y s , MLP(x s ))
s=1 s=1 s=1