IE643 Lecture6 2020sep1

Deep Learning - Theory and Practice
IE 643
Lecture 6
September 1, 2020.
P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 1 / 76

Outline
1 Moving on from Perceptron
2 Multi Layer Perceptron

MLP-Data Perspective

Moving on from Perceptron
Perceptron - Caveat
Not suitable when linear separability assumption fails

Example: Classical XOR problem

Perceptron - Caveat

Heavily criticized by M. Minsky and S. Papert in their book: Perceptrons,

MIT Press, 1969.

Perceptron - Caveat
x1 x2 y = x1 ⊕ x2
0 0 -1
0 1 1
1 0 1
1 1 -1

Perceptron - Caveat

x1 x2 y = x1 ⊕ x2 ŷ = sign(w1 x1 + w2 x2 − θ)
0 0 -1 sign(−θ)
0 1 1 sign(w2 − θ)
1 0 1 sign(w1 − θ)
1 1 -1 sign(w1 + w2 − θ)

Perceptron - Caveat
sign(−θ) = −1 =⇒ θ > 0
sign(w2 − θ) = 1 =⇒ w2 − θ ≥ 0
sign(w1 − θ) = 1 =⇒ w1 − θ ≥ 0
sign(w1 + w2 − θ) = −1 =⇒ −w1 − w2 + θ > 0
Note: This system is inconsistent. (Homework!)

Perceptron - Caveat
sign(−θ) = −1 =⇒ θ > 0
sign(w2 − θ) = 1 =⇒ w2 − θ ≥ 0
sign(w1 − θ) = 1 =⇒ w1 − θ ≥ 0
sign(w1 + w2 − θ) = −1 =⇒ −w1 − w2 + θ > 0
Note: This system is inconsistent. (Homework!)

Recall: We verified this using code for linear separability check.
Moving away from perceptron - Dealing with XOR problem

Assume that the sample features x ∈ Rd .

Idea: Use a transformation φ : Rd → Rq , where q d, to lift the

data samples x ∈ Rd into φ(x) ∈ Rq hoping to see a separating
hyperplane in the transformed space.

Idea: Use a transformation φ : Rd → Rq , where q d, to lift the

data samples x ∈ Rd into φ(x) ∈ Rq hoping to see a separating
hyperplane in the transformed space.
Forms the core idea behind kernel methods. (Will not be pursued in
this course!)


Idea: The separating surface need not be linear and can be assumed
to take some non-linear form.

Hence for an input space X and output space Y, the learned map
h : X → Y can take some non-linear form.

Hence for an input space X and output space Y, the learned map
h : X → Y can take some non-linear form.
Forms the idea behind multi-layer perceptrons!



Some notations
nk` denotes k-th neuron at layer `.
ak` denotes the activation of the neuron nk` .
Activation at neuron n11 :
a11 = max{px1 + qx2 + b1 , 0}.

a21 = max{rx1 + sx2 + b2 , 0}.


a12 = sign(ta11 + ua21 + b3 ).


a12 = sign(ta11 + ua21 + b3 ).
Note: The activation a12 is the output of the network denoted by ŷ .
x1 x2 a11 a21 ŷ y
0 0 max{b1 , 0} max{b2 , 0} sign(ta11 + ua21 + b3 ) -1
0 1 max{q + b1 , 0} max{s + b2 , 0} sign(ta11 + ua21 + b3 ) +1
1 0 max{p + b1 , 0} max{r + b2 , 0} sign(ta11 + ua21 + b3 ) +1
1 1 max{p + q + b1 , 0} max{r + s + b2 , 0} sign(ta11 + ua21 + b3 ) -1


A different Multi Layer Perceptron (MLP) architecture is given for XOR

problem in:
David. E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams.
Learning Internal Representations by Error Propagation,
Technical Report, UCSD, 1985.

Multi Layer Perceptron

Notable features:

Notable features:
Multiple layers stacked together.

Notable features:
Zero-th layer usually called input layer.

Notable features:
Final layer usually called output layer.

Notable features:
Final layer usually called output layer.
Intermediate layers are called hidden layers.

Notable features:
Each neuron in the hidden and output layer is like a perceptron.

Notable features:
However, unlike perceptron, different activation functions are used.

Notable features:
However, unlike perceptron, different activation functions are used.
max{x, 0} has a special name called ReLU (Rectified Linear Unit).


Multi Layer Perceptron - More notations

This MLP contains an input layer L0 , 2 hidden layers denoted by

L1 , L2 , and output layer L3 .

Recall:
nk` denotes k-th neuron at `-th layer.
ak` denotes activation of neuron nk` .

wij` denotes weight of connection connecting ni` from nj`−1 .



In this particular case, the inputs are x1 and x2 at input layer L0 .

At layer L1 :
I At neuron n11 :
F a11 = φ(w11
1 1
x1 + w12 x2 ) .

At layer L1 :
I At neuron n11 :
F a11 = φ(w11
1 1
x1 + w12 x2 ) =: φ(z11 ) .

At layer L1 :
I At neuron n21 :
F a21 = φ(w21
1 1
x1 + w22 x2 ) .

At layer L1 :
I At neuron n21 :
F a21 = φ(w21
1 1
x1 + w22 x2 ) =: φ(z21 ) .

At layer L1 :
1
φ(z11 ) 1 x + w1 x )

a1 φ(w11 1 12 2
= =
a21 φ(z21 ) 1 x + w1 x )
φ(w21 1 22 2

1 1

w11 w12 x
Letting W1 = 1 1 and x = 1 , we have at layer L1 :
w21 w22 x2
1 1 1 1 x

a1 z1 w11 x1 + w12 2
=φ =φ = φ(W 1 x)
a21 z21 1 x + w1 x
w21 1 22 2

1
a
Letting a1 = 11 , we have at layer L1 :
a2
1
a
a = 11 = φ(W 1 x)
1
a2

At layer L2 :
I At neuron n12 :
F a12 = φ(w11
2 1 2 1
a1 + w12 a2 ) .

At layer L2 :
I At neuron n12 :
F a12 = φ(w11
2 1 2 1
a1 + w12 a2 ) =: φ(z12 ) .

At layer L2 :
I At neuron n22 :
F a22 = φ(w21
2 1 2 1
a1 + w22 a2 ) .

At layer L2 :
I At neuron n22 :
F a22 = φ(w21
2 1 2 1
a1 + w22 a2 ) =: φ(z22 ).

At layer L2 :
2
φ(z12 ) 2 a1 + w 2 a1 )

2 a1 φ(w11 1 12 2
a = 2 = =
a2 φ(z22 ) 2 a1 + w 2 a1 )
φ(w21 1 22 2

2 2

w11 w12
Letting W2 = 2 2 , we have at layer L2 :
w21 w22
2 2 2 1 2 a1
1
2 a1 z1 w11 a1 + w12 2 2 a1
a = 2 =φ =φ =φ W
a2 z22 2 a1 + w 2 a1
w21 1 22 2 a21

We have at layer L2 :
2 2 1
2 a1 z1 2 a1
a = 2 =φ =φ W = φ(W 2 a1 )
a2 z22 a21

At layer L3 :
I At neuron n13 :
F a13 = φ(w11
3 2 3 2
a1 + w12 a2 ) .

At layer L3 :
I At neuron n13 :
F a13 = φ(w11
3 2 3 2
a1 + w12 a2 ) =: φ(z13 ) .

At layer L3 :
a3 = a13 = φ(z13 ) = φ(w11

3 a2 + w 3 a2 )

1 12 2

3 3 , we have at layer L :
Letting W 3 = w11

w12 3
2
3 3 a1
3 3 3 2 3 2

a = a1 = φ z1 = φ w11 a1 + w12 a2 = φ W
a22

3 3 , we have at layer L :
Letting W 3 = w11

w12 3
2
3 3 a1
= φ(W 3 a2 )
3 3
a = a1 = φ z1 = φ W
a22

a3 = φ(W 3 a2 )

a3 = φ(W 3 a2 ) = φ(W 3 φ(W 2 a1 ))

a3 = φ(W 3 a2 ) = φ(W 3 φ(W 2 a1 )) = φ(W 3 φ(W 2 φ(W 1 x)))

ŷ = a3 = φ(W 3 a2 ) = φ(W 3 φ(W 2 a1 )) = φ(W 3 φ(W 2 φ(W 1 x)))

Multi Layer Perceptron MLP-Data Perspective
Multi Layer Perceptron - Data Perspective
Given data (x, y ), multi layer perceptron predicts:
ŷ = φ(W 3 φ(W 2 φ(W 1 x)))

ŷ = φ(W 3 φ(W 2 φ(W 1 x))) =: MLP(x)

ŷ = φ(W 3 φ(W 2 φ(W 1 x))) =: MLP(x)
Similar to perceptron, if y 6= ŷ an error E (y , ŷ ) is incurred.

ŷ = φ(W 3 φ(W 2 φ(W 1 x))) =: MLP(x)

Aim: To change the weights W 1 , W 2 , W 3 , such that the error E (y , ŷ ) is
minimized.

ŷ = φ(W 3 φ(W 2 φ(W 1 x))) =: MLP(x)

Aim: To change the weights W 1 , W 2 , W 3 , such that the error E (y , ŷ ) is
minimized.
Leads to an error minimization problem.

Input: Training Data D = {(x s , y s )}Ss=1 .

For each sample x s the prediction ŷ s = MLP(x s ).
Error: e s = E (y s , ŷ s ).
Aim: To minimize Ss=1 e s .
P

Optimization perspective
Given training data D = {(x s , y s )}Ss=1 ,
S
X
min es
s=1

S
X S
X
min es = E (y s , ŷ s )
s=1 s=1

S
X S
X S
X
min es = E (y s , ŷ s ) = E (y s , MLP(x s ))
s=1 s=1 s=1

S
X S
X S
X
min es = E (y s , ŷ s ) = E (y s , MLP(x s ))
s=1 s=1 s=1
Note: The minimization is over the weights of the MLP W 1 , . . . , W L ,

where L denotes number of layers in MLP.

IE643 Lecture6 2020sep1

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

IE643 Lecture6 2020sep1

Загружено:

Авторское право:

Доступные форматы

Deep Learning - Theory and Practice

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 1 / 76

1 Moving on from Perceptron

2 Multi Layer Perceptron

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 2 / 76

Not suitable when linear separability assumption fails

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 3 / 76

Not suitable when linear separability assumption fails

Heavily criticized by M. Minsky and S. Papert in their book: Perceptrons,

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 4 / 76

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 5 / 76

Not suitable when linear separability assumption fails

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 6 / 76

Note: This system is inconsistent. (Homework!)

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 7 / 76

Note: This system is inconsistent. (Homework!)

Moving away from perceptron - Dealing with XOR problem

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 9 / 76

Moving away from perceptron - Dealing with XOR problem

Assume that the sample features x ∈ Rd .

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 10 / 76

Moving away from perceptron - Dealing with XOR problem

Assume that the sample features x ∈ Rd .

Idea: Use a transformation φ : Rd → Rq , where q  d, to lift the

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 11 / 76

Moving away from perceptron - Dealing with XOR problem

Assume that the sample features x ∈ Rd .

Idea: Use a transformation φ : Rd → Rq , where q  d, to lift the

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 12 / 76

Moving away from perceptron - Dealing with XOR problem

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 13 / 76

Moving away from perceptron - Dealing with XOR problem

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 14 / 76

Moving away from perceptron - Dealing with XOR problem

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 15 / 76

Moving away from perceptron - Dealing with XOR problem

Forms the idea behind multi-layer perceptrons!

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 16 / 76

Moving away from perceptron - Dealing with XOR problem

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 17 / 76

Moving away from perceptron - Dealing with XOR problem

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 18 / 76

Moving away from perceptron - Dealing with XOR problem

Moving away from perceptron - Dealing with XOR problem

Activation at neuron n11 :

a11 = max{px1 + qx2 + b1 , 0}.

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 20 / 76

Moving away from perceptron - Dealing with XOR problem

Activation at neuron n21 :

a21 = max{rx1 + sx2 + b2 , 0}.

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 21 / 76

Moving away from perceptron - Dealing with XOR problem

Activation at neuron n12 :

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 22 / 76

Moving away from perceptron - Dealing with XOR problem

Activation at neuron n12 :

Moving away from perceptron - Dealing with XOR problem

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 24 / 76

Moving away from perceptron - Dealing with XOR problem

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 25 / 76

Moving away from perceptron - Dealing with XOR problem

A different Multi Layer Perceptron (MLP) architecture is given for XOR

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 26 / 76

Idea: Use a transformation φ : Rd → Rq , where q d, to lift the

Idea: Use a transformation φ : Rd → Rq , where q d, to lift the