Вы находитесь на странице: 1из 76

Deep Learning - Theory and Practice

IE 643
Lecture 6

September 1, 2020.

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 1 / 76


Outline

1 Moving on from Perceptron

2 Multi Layer Perceptron


MLP-Data Perspective

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 2 / 76


Moving on from Perceptron

Perceptron - Caveat

Not suitable when linear separability assumption fails


Example: Classical XOR problem

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 3 / 76


Moving on from Perceptron

Perceptron - Caveat

Not suitable when linear separability assumption fails


Example: Classical XOR problem

Heavily criticized by M. Minsky and S. Papert in their book: Perceptrons,


MIT Press, 1969.

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 4 / 76


Moving on from Perceptron

Perceptron - Caveat
Not suitable when linear separability assumption fails
Example: Classical XOR problem

x1 x2 y = x1 ⊕ x2
0 0 -1
0 1 1
1 0 1
1 1 -1

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 5 / 76


Moving on from Perceptron

Perceptron - Caveat

Not suitable when linear separability assumption fails


Example: Classical XOR problem

x1 x2 y = x1 ⊕ x2 ŷ = sign(w1 x1 + w2 x2 − θ)
0 0 -1 sign(−θ)
0 1 1 sign(w2 − θ)
1 0 1 sign(w1 − θ)
1 1 -1 sign(w1 + w2 − θ)

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 6 / 76


Moving on from Perceptron

Perceptron - Caveat
Not suitable when linear separability assumption fails
Example: Classical XOR problem

sign(−θ) = −1 =⇒ θ > 0
sign(w2 − θ) = 1 =⇒ w2 − θ ≥ 0
sign(w1 − θ) = 1 =⇒ w1 − θ ≥ 0
sign(w1 + w2 − θ) = −1 =⇒ −w1 − w2 + θ > 0

Note: This system is inconsistent. (Homework!)

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 7 / 76


Moving on from Perceptron

Perceptron - Caveat
Not suitable when linear separability assumption fails
Example: Classical XOR problem

sign(−θ) = −1 =⇒ θ > 0
sign(w2 − θ) = 1 =⇒ w2 − θ ≥ 0
sign(w1 − θ) = 1 =⇒ w1 − θ ≥ 0
sign(w1 + w2 − θ) = −1 =⇒ −w1 − w2 + θ > 0

Note: This system is inconsistent. (Homework!)


Recall: We verified this using code for linear separability check.
P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 8 / 76
Moving on from Perceptron

Moving away from perceptron - Dealing with XOR problem

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 9 / 76


Moving on from Perceptron

Moving away from perceptron - Dealing with XOR problem

Assume that the sample features x ∈ Rd .

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 10 / 76


Moving on from Perceptron

Moving away from perceptron - Dealing with XOR problem

Assume that the sample features x ∈ Rd .

Idea: Use a transformation φ : Rd → Rq , where q  d, to lift the


data samples x ∈ Rd into φ(x) ∈ Rq hoping to see a separating
hyperplane in the transformed space.

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 11 / 76


Moving on from Perceptron

Moving away from perceptron - Dealing with XOR problem

Assume that the sample features x ∈ Rd .

Idea: Use a transformation φ : Rd → Rq , where q  d, to lift the


data samples x ∈ Rd into φ(x) ∈ Rq hoping to see a separating
hyperplane in the transformed space.

Forms the core idea behind kernel methods. (Will not be pursued in
this course!)

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 12 / 76


Moving on from Perceptron

Moving away from perceptron - Dealing with XOR problem

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 13 / 76


Moving on from Perceptron

Moving away from perceptron - Dealing with XOR problem

Idea: The separating surface need not be linear and can be assumed
to take some non-linear form.

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 14 / 76


Moving on from Perceptron

Moving away from perceptron - Dealing with XOR problem

Idea: The separating surface need not be linear and can be assumed
to take some non-linear form.

Hence for an input space X and output space Y, the learned map
h : X → Y can take some non-linear form.

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 15 / 76


Moving on from Perceptron

Moving away from perceptron - Dealing with XOR problem

Idea: The separating surface need not be linear and can be assumed
to take some non-linear form.

Hence for an input space X and output space Y, the learned map
h : X → Y can take some non-linear form.

Forms the idea behind multi-layer perceptrons!

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 16 / 76


Moving on from Perceptron

Moving away from perceptron - Dealing with XOR problem

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 17 / 76


Moving on from Perceptron

Moving away from perceptron - Dealing with XOR problem

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 18 / 76


Moving on from Perceptron

Moving away from perceptron - Dealing with XOR problem

Some notations
nk` denotes k-th neuron at layer `.
ak` denotes the activation of the neuron nk` .
P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 19 / 76
Moving on from Perceptron

Moving away from perceptron - Dealing with XOR problem

Activation at neuron n11 :

a11 = max{px1 + qx2 + b1 , 0}.

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 20 / 76


Moving on from Perceptron

Moving away from perceptron - Dealing with XOR problem

Activation at neuron n21 :

a21 = max{rx1 + sx2 + b2 , 0}.

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 21 / 76


Moving on from Perceptron

Moving away from perceptron - Dealing with XOR problem

Activation at neuron n12 :


a12 = sign(ta11 + ua21 + b3 ).

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 22 / 76


Moving on from Perceptron

Moving away from perceptron - Dealing with XOR problem

Activation at neuron n12 :


a12 = sign(ta11 + ua21 + b3 ).
Note: The activation a12 is the output of the network denoted by ŷ .
P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 23 / 76
Moving on from Perceptron

Moving away from perceptron - Dealing with XOR problem

x1 x2 a11 a21 ŷ y
0 0 max{b1 , 0} max{b2 , 0} sign(ta11 + ua21 + b3 ) -1
0 1 max{q + b1 , 0} max{s + b2 , 0} sign(ta11 + ua21 + b3 ) +1
1 0 max{p + b1 , 0} max{r + b2 , 0} sign(ta11 + ua21 + b3 ) +1
1 1 max{p + q + b1 , 0} max{r + s + b2 , 0} sign(ta11 + ua21 + b3 ) -1

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 24 / 76


Moving on from Perceptron

Moving away from perceptron - Dealing with XOR problem

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 25 / 76


Moving on from Perceptron

Moving away from perceptron - Dealing with XOR problem

A different Multi Layer Perceptron (MLP) architecture is given for XOR


problem in:
David. E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams.
Learning Internal Representations by Error Propagation,
Technical Report, UCSD, 1985.

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 26 / 76


Multi Layer Perceptron

Multi Layer Perceptron

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 27 / 76


Multi Layer Perceptron

Multi Layer Perceptron

Notable features:

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 28 / 76


Multi Layer Perceptron

Multi Layer Perceptron

Notable features:
Multiple layers stacked together.

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 29 / 76


Multi Layer Perceptron

Multi Layer Perceptron

Notable features:
Multiple layers stacked together.
Zero-th layer usually called input layer.

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 30 / 76


Multi Layer Perceptron

Multi Layer Perceptron

Notable features:
Multiple layers stacked together.
Zero-th layer usually called input layer.
Final layer usually called output layer.

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 31 / 76


Multi Layer Perceptron

Multi Layer Perceptron

Notable features:
Multiple layers stacked together.
Zero-th layer usually called input layer.
Final layer usually called output layer.
Intermediate layers are called hidden layers.

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 32 / 76


Multi Layer Perceptron

Multi Layer Perceptron

Notable features:
Each neuron in the hidden and output layer is like a perceptron.

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 33 / 76


Multi Layer Perceptron

Multi Layer Perceptron

Notable features:
Each neuron in the hidden and output layer is like a perceptron.
However, unlike perceptron, different activation functions are used.

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 34 / 76


Multi Layer Perceptron

Multi Layer Perceptron

Notable features:
Each neuron in the hidden and output layer is like a perceptron.
However, unlike perceptron, different activation functions are used.
max{x, 0} has a special name called ReLU (Rectified Linear Unit).

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 35 / 76


Multi Layer Perceptron

Multi Layer Perceptron

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 36 / 76


Multi Layer Perceptron

Multi Layer Perceptron - More notations

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 37 / 76


Multi Layer Perceptron

Multi Layer Perceptron - More notations

This MLP contains an input layer L0 , 2 hidden layers denoted by


L1 , L2 , and output layer L3 .

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 38 / 76


Multi Layer Perceptron

Multi Layer Perceptron - More notations

Recall:
nk` denotes k-th neuron at `-th layer.
ak` denotes activation of neuron nk` .

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 39 / 76


Multi Layer Perceptron

Multi Layer Perceptron - More notations

wij` denotes weight of connection connecting ni` from nj`−1 .

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 40 / 76


Multi Layer Perceptron

Multi Layer Perceptron - More notations

wij` denotes weight of connection connecting ni` from nj`−1 .

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 41 / 76


Multi Layer Perceptron

Multi Layer Perceptron - More notations

wij` denotes weight of connection connecting ni` from nj`−1 .

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 42 / 76


Multi Layer Perceptron

Multi Layer Perceptron - More notations

In this particular case, the inputs are x1 and x2 at input layer L0 .

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 43 / 76


Multi Layer Perceptron

Multi Layer Perceptron - More notations

At layer L1 :
I At neuron n11 :
F a11 = φ(w11
1 1
x1 + w12 x2 ) .

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 44 / 76


Multi Layer Perceptron

Multi Layer Perceptron - More notations

At layer L1 :
I At neuron n11 :
F a11 = φ(w11
1 1
x1 + w12 x2 ) =: φ(z11 ) .

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 45 / 76


Multi Layer Perceptron

Multi Layer Perceptron - More notations

At layer L1 :
I At neuron n21 :
F a21 = φ(w21
1 1
x1 + w22 x2 ) .

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 46 / 76


Multi Layer Perceptron

Multi Layer Perceptron - More notations

At layer L1 :
I At neuron n21 :
F a21 = φ(w21
1 1
x1 + w22 x2 ) =: φ(z21 ) .

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 47 / 76


Multi Layer Perceptron

Multi Layer Perceptron - More notations

At layer L1 :
 1 
φ(z11 ) 1 x + w1 x )
  
a1 φ(w11 1 12 2
= =
a21 φ(z21 ) 1 x + w1 x )
φ(w21 1 22 2

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 48 / 76


Multi Layer Perceptron

Multi Layer Perceptron - More notations

 1 1
  
w11 w12 x
Letting W1 = 1 1 and x = 1 , we have at layer L1 :
w21 w22 x2
 1  1   1 1 x

a1 z1 w11 x1 + w12 2
=φ =φ = φ(W 1 x)
a21 z21 1 x + w1 x
w21 1 22 2

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 49 / 76


Multi Layer Perceptron

Multi Layer Perceptron - More notations

 1
a
Letting a1 = 11 , we have at layer L1 :
a2
 1
a
a = 11 = φ(W 1 x)
1
a2

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 50 / 76


Multi Layer Perceptron

Multi Layer Perceptron - More notations

At layer L2 :
I At neuron n12 :
F a12 = φ(w11
2 1 2 1
a1 + w12 a2 ) .

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 51 / 76


Multi Layer Perceptron

Multi Layer Perceptron - More notations

At layer L2 :
I At neuron n12 :
F a12 = φ(w11
2 1 2 1
a1 + w12 a2 ) =: φ(z12 ) .

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 52 / 76


Multi Layer Perceptron

Multi Layer Perceptron - More notations

At layer L2 :
I At neuron n22 :
F a22 = φ(w21
2 1 2 1
a1 + w22 a2 ) .

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 53 / 76


Multi Layer Perceptron

Multi Layer Perceptron - More notations

At layer L2 :
I At neuron n22 :
F a22 = φ(w21
2 1 2 1
a1 + w22 a2 ) =: φ(z22 ).

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 54 / 76


Multi Layer Perceptron

Multi Layer Perceptron - More notations

At layer L2 :
 2 
φ(z12 ) 2 a1 + w 2 a1 )
  
2 a1 φ(w11 1 12 2
a = 2 = =
a2 φ(z22 ) 2 a1 + w 2 a1 )
φ(w21 1 22 2

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 55 / 76


Multi Layer Perceptron

Multi Layer Perceptron - More notations

 2 2

w11 w12
Letting W2 = 2 2 , we have at layer L2 :
w21 w22
 2  2   2 1 2 a1
   1 
2 a1 z1 w11 a1 + w12 2 2 a1
a = 2 =φ =φ =φ W
a2 z22 2 a1 + w 2 a1
w21 1 22 2 a21

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 56 / 76


Multi Layer Perceptron

Multi Layer Perceptron - More notations

We have at layer L2 :
 2  2    1 
2 a1 z1 2 a1
a = 2 =φ =φ W = φ(W 2 a1 )
a2 z22 a21

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 57 / 76


Multi Layer Perceptron

Multi Layer Perceptron - More notations

At layer L3 :
I At neuron n13 :
F a13 = φ(w11
3 2 3 2
a1 + w12 a2 ) .

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 58 / 76


Multi Layer Perceptron

Multi Layer Perceptron - More notations

At layer L3 :
I At neuron n13 :
F a13 = φ(w11
3 2 3 2
a1 + w12 a2 ) =: φ(z13 ) .

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 59 / 76


Multi Layer Perceptron

Multi Layer Perceptron - More notations

At layer L3 :

a3 = a13 = φ(z13 ) = φ(w11


3 a2 + w 3 a2 )
     
1 12 2

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 60 / 76


Multi Layer Perceptron

Multi Layer Perceptron - More notations

 3 3 , we have at layer L :
Letting W 3 = w11

w12 3
  2 
3 3 a1
 3  3   3 2 3 2

a = a1 = φ z1 = φ w11 a1 + w12 a2 = φ W
a22

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 61 / 76


Multi Layer Perceptron

Multi Layer Perceptron - More notations

 3 3 , we have at layer L :
Letting W 3 = w11

w12 3
  2 
3 3 a1
= φ(W 3 a2 )
 3  3 
a = a1 = φ z1 = φ W
a22

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 62 / 76


Multi Layer Perceptron

Multi Layer Perceptron - More notations

a3 = φ(W 3 a2 )

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 63 / 76


Multi Layer Perceptron

Multi Layer Perceptron - More notations

a3 = φ(W 3 a2 ) = φ(W 3 φ(W 2 a1 ))

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 64 / 76


Multi Layer Perceptron

Multi Layer Perceptron - More notations

a3 = φ(W 3 a2 ) = φ(W 3 φ(W 2 a1 )) = φ(W 3 φ(W 2 φ(W 1 x)))

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 65 / 76


Multi Layer Perceptron

Multi Layer Perceptron - More notations

ŷ = a3 = φ(W 3 a2 ) = φ(W 3 φ(W 2 a1 )) = φ(W 3 φ(W 2 φ(W 1 x)))

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 66 / 76


Multi Layer Perceptron MLP-Data Perspective

Multi Layer Perceptron - Data Perspective

Given data (x, y ), multi layer perceptron predicts:

ŷ = φ(W 3 φ(W 2 φ(W 1 x)))

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 67 / 76


Multi Layer Perceptron MLP-Data Perspective

Multi Layer Perceptron - Data Perspective

Given data (x, y ), multi layer perceptron predicts:

ŷ = φ(W 3 φ(W 2 φ(W 1 x))) =: MLP(x)

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 68 / 76


Multi Layer Perceptron MLP-Data Perspective

Multi Layer Perceptron - Data Perspective

Given data (x, y ), multi layer perceptron predicts:

ŷ = φ(W 3 φ(W 2 φ(W 1 x))) =: MLP(x)

Similar to perceptron, if y 6= ŷ an error E (y , ŷ ) is incurred.

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 69 / 76


Multi Layer Perceptron MLP-Data Perspective

Multi Layer Perceptron - Data Perspective

Given data (x, y ), multi layer perceptron predicts:

ŷ = φ(W 3 φ(W 2 φ(W 1 x))) =: MLP(x)

Similar to perceptron, if y 6= ŷ an error E (y , ŷ ) is incurred.


Aim: To change the weights W 1 , W 2 , W 3 , such that the error E (y , ŷ ) is
minimized.

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 70 / 76


Multi Layer Perceptron MLP-Data Perspective

Multi Layer Perceptron - Data Perspective

Given data (x, y ), multi layer perceptron predicts:

ŷ = φ(W 3 φ(W 2 φ(W 1 x))) =: MLP(x)

Similar to perceptron, if y 6= ŷ an error E (y , ŷ ) is incurred.


Aim: To change the weights W 1 , W 2 , W 3 , such that the error E (y , ŷ ) is
minimized.
Leads to an error minimization problem.

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 71 / 76


Multi Layer Perceptron MLP-Data Perspective

Multi Layer Perceptron - Data Perspective

Input: Training Data D = {(x s , y s )}Ss=1 .


For each sample x s the prediction ŷ s = MLP(x s ).
Error: e s = E (y s , ŷ s ).
Aim: To minimize Ss=1 e s .
P

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 72 / 76


Multi Layer Perceptron MLP-Data Perspective

Multi Layer Perceptron - Data Perspective

Optimization perspective
Given training data D = {(x s , y s )}Ss=1 ,
S
X
min es
s=1

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 73 / 76


Multi Layer Perceptron MLP-Data Perspective

Multi Layer Perceptron - Data Perspective

Optimization perspective
Given training data D = {(x s , y s )}Ss=1 ,
S
X S
X
min es = E (y s , ŷ s )
s=1 s=1

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 74 / 76


Multi Layer Perceptron MLP-Data Perspective

Multi Layer Perceptron - Data Perspective

Optimization perspective
Given training data D = {(x s , y s )}Ss=1 ,
S
X S
X S
X
min es = E (y s , ŷ s ) = E (y s , MLP(x s ))
s=1 s=1 s=1

P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 75 / 76


Multi Layer Perceptron MLP-Data Perspective

Multi Layer Perceptron - Data Perspective

Optimization perspective
Given training data D = {(x s , y s )}Ss=1 ,
S
X S
X S
X
min es = E (y s , ŷ s ) = E (y s , MLP(x s ))
s=1 s=1 s=1

Note: The minimization is over the weights of the MLP W 1 , . . . , W L ,


where L denotes number of layers in MLP.
P. Balamurugan Deep Learning - Theory and Practice September 1, 2020. 76 / 76

Вам также может понравиться