Вы находитесь на странице: 1из 76

Deep Learning & Neural Networks

Lecture 1
Kevin Duh
Graduate School of Information Science
Nara Institute of Science and Technology

Jan 14, 2014

2/40

3/40

4/40

What is Deep Learning?


A family of methods that uses deep architectures to learn high-level
feature representations

5/40

What is Deep Learning?


A family of methods that uses deep architectures to learn high-level
feature representations

5/40

Example of Trainable Features


Hierarchical object-parts features in Computer Vision [Lee et al., 2009]

6/40

Course Outline

Goal:
To understand the foundations of neural networks and deep learning,
at a level sufficient for reading recent research papers

7/40

Course Outline

Goal:
To understand the foundations of neural networks and deep learning,
at a level sufficient for reading recent research papers
Schedule:
I
I
I
I

Lecture
Lecture
Lecture
Lecture

1
2
3
4

(Jan
(Jan
(Jan
(Jan

14):
16):
21):
23):

Machine Learning background & Neural Networks


Deep Architectures (DBN, SAE)
Applications in Vision, Speech, Language
Advanced topics in optimization

7/40

Course Outline

Goal:
To understand the foundations of neural networks and deep learning,
at a level sufficient for reading recent research papers
Schedule:
I
I
I
I

Lecture
Lecture
Lecture
Lecture

1
2
3
4

(Jan
(Jan
(Jan
(Jan

14):
16):
21):
23):

Machine Learning background & Neural Networks


Deep Architectures (DBN, SAE)
Applications in Vision, Speech, Language
Advanced topics in optimization

Prerequisites:
I

Basic calculus, probability, linear algebra

7/40

Course Material
Course Website:
http://cl.naist.jp/~kevinduh/a/deep2014/
Useful References:
1

2
3
4
5
6

Yoshua Bengios [Bengio, 2009] short book: Learning Deep


Architectures for AI1
Yann LeCun & MarcAurelio Ranzatos ICML2013 tutorial2
Richard Socher et. al.s NAACL2013 tutorial3
Geoff Hintons Coursera course4
Theano code samples5
Chris Bishops book Pattern Recognition and Machine Learning
(PRML)6

http://www.iro.umontreal.ca/~bengioy/papers/ftml.pdf
http://techtalks.tv/talks/deep-learning/58122/
3
http://www.socher.org/index.php/DeepLearningTutorial/
4
https://www.coursera.org/course/neuralnets
5
http://deeplearning.net/tutorial/
6
http://research.microsoft.com/en-us/um/people/cmbishop/prml/
2

8/40

Grading

The only criteria for grading:


Are you actively participating and asking questions in class?
I
I
I
I

If
If
If
If

you
you
you
you

ask (or answer) 3+ questions, grade = A


ask (or answer) 2 questions, grade = B
ask (or answer) 1 question, grade = C
dont ask (or answer) any questions, you get no credit.

9/40

Best Advice I got while in Grad School

Always Ask Questions!

10/40

Best Advice I got while in Grad School

Always Ask Questions!


If you dont understand, you must ask questions in order to
understand.

10/40

Best Advice I got while in Grad School

Always Ask Questions!


If you dont understand, you must ask questions in order to
understand.
If you understand, you will naturally have questions.

10/40

Best Advice I got while in Grad School

Always Ask Questions!


If you dont understand, you must ask questions in order to
understand.
If you understand, you will naturally have questions.
Having no questions is a sign that you are not thinking.

10/40

Todays Topics

Machine Learning background


Why Machine Learning is needed?
Main Concepts: Generalization, Model Expressiveness, Overfitting
Formal Notation

Neural Networks
1-Layer Nets (Logistic Regression)
2-Layer Nets and Model Expressiveness
Training by Backpropagation

11/40

Todays Topics

Machine Learning background


Why Machine Learning is needed?
Main Concepts: Generalization, Model Expressiveness, Overfitting
Formal Notation

Neural Networks
1-Layer Nets (Logistic Regression)
2-Layer Nets and Model Expressiveness
Training by Backpropagation

12/40

Write a Program to Recognize the Digit 2


This is hard to do manually!
bool recognizeDigitAs2(int** imagePixels){...}

*example from Hintons Coursera course

13/40

Write a Program to Recognize the Digit 2


This is hard to do manually!
bool recognizeDigitAs2(int** imagePixels){...}

Machine Learning solution:


1

Assume you have a database (training data) of 2s and non-2s.

Automatically learn this function from data


*example from Hintons Coursera course

13/40

A Machine Learning Solution

Training data are represented as pixel matrices:


Classifier is parameterized by weight matrix of same dimension.

14/40

A Machine Learning Solution

Training data are represented as pixel matrices:


Classifier is parameterized by weight matrix of same dimension.
Training procedure:
1 When observe 2, add 1 to corresponding matrix elements
2 When observe non-2, subtract 1 to corresponding matrix elements
0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0

0
1
0
0
0
0
0
0
1
0

0
1
0
0
0
0
0
1
1
0

0
1
0
0
0
0
1
0
1
0

0
1
0
0
0
1
0
0
1
0

0
1
0
0
1
0
0
0
1
0

0
1
1
1
0
0
0
0
1
0

0
0
0
0
0
0
0
0
1
0

0
0
0
0
0
0
0
0
1
0

0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0

0
1
0
0
0
0
0
0
1
0

0
1
0
0
0
0
0
1
1
0

0
1
0
0
0
0
1
0
1
0

0
1
0
0
0
1
0
0
1
0

0
1
0
0
1
0
0
0
1
0

0
0
0
0
-1
-1
-1
-1
-1
0

0
0
0
0
0
0
0
0
1
0

0
0
0
0
0
0
0
0
1
0

0
0
0
0
0
0
0
0
0
0

14/40

A Machine Learning Solution

Training data are represented as pixel matrices:


Classifier is parameterized by weight matrix of same dimension.
Training procedure:
1 When observe 2, add 1 to corresponding matrix elements
2 When observe non-2, subtract 1 to corresponding matrix elements
0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0

0
1
0
0
0
0
0
0
1
0

0
1
0
0
0
0
0
1
1
0

0
1
0
0
0
0
1
0
1
0

0
1
0
0
0
1
0
0
1
0

0
1
0
0
1
0
0
0
1
0

0
1
1
1
0
0
0
0
1
0

0
0
0
0
0
0
0
0
1
0

0
0
0
0
0
0
0
0
1
0

0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0

0
1
0
0
0
0
0
0
1
0

0
1
0
0
0
0
0
1
1
0

0
1
0
0
0
0
1
0
1
0

0
1
0
0
0
1
0
0
1
0

0
1
0
0
1
0
0
0
1
0

0
0
0
0
-1
-1
-1
-1
-1
0

0
0
0
0
0
0
0
0
1
0

0
0
0
0
0
0
0
0
1
0

0
0
0
0
0
0
0
0
0
0

Test procedure: given new image, take sum of element-wise product.


If positive, predict 2; else predict non-2.

14/40

Todays Topics

Machine Learning background


Why Machine Learning is needed?
Main Concepts: Generalization, Model Expressiveness, Overfitting
Formal Notation

Neural Networks
1-Layer Nets (Logistic Regression)
2-Layer Nets and Model Expressiveness
Training by Backpropagation

15/40

Generalization 6= Memorization
Key Issue in Machine Learning: Training data is limited
If the classifier just memorizes the training data, it may perform
poorly on new data
Generalization is ability to extend accurate predictions to new data

16/40

Generalization 6= Memorization
Key Issue in Machine Learning: Training data is limited
If the classifier just memorizes the training data, it may perform
poorly on new data
Generalization is ability to extend accurate predictions to new data

E.g. consider shifted image:

Will this classifier generalize?


0
0
0
0
0
0
0
0
0
0

0
1
0
0
0
0
0
0
1
0

0
1
0
0
0
0
0
1
1
0

0
1
0
0
0
0
1
0
1
0

0
1
0
0
0
1
0
0
1
0

0
1
0
0
1
0
0
0
1
0

0
0
0
0
-1
-1
-1
-1
-1
0

0
0
0
0
0
0
0
0
1
0

0
0
0
0
0
0
0
0
1
0

0
0
0
0
0
0
0
0
0
0

16/40

Generalization 6= Memorization
One potential way to increase generalization ability:
Discretize weight matrix with larger grids (fewer weights to train)
E.g. consider shifted image:

Now will this classifier generalize?


0
0
0
0
0
0
0
0
0
0

0
1
0
0
0
0
0
0
1
0

0
1
0
0
0
0
0
1
1
0

0
1
0
0
0
0
1
0
1
0

0
1
0
0
0
1
0
0
1
0

0
1
0
0
1
0
0
0
1
0

0
0
0
0
-1
-1
-1
-1
-1
0

0
0
0
0
0
0
0
0
1
0

0
0
0
0
0
0
0
0
1
0

0
0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0
0

0
1
0
0
0
0
0
0
1
0

0
1
0
0
0
0
0
1
1
0

0
1
0
0
0
0
1
0
1
0

0
1
0
0
0
1
0
0
1
0

0
1
0
0
1
0
0
0
1
0

0
0
0
0
-1
-1
-1
-1
-1
0

0
0
0
0
0
0
0
0
1
0

0
0
0
0
0
0
0
0
1
0

0
0
0
0
0
0
0
0
0
0

1
0
0
0
1

1
0
0
1
1

1
0
1
0
1

0
0
-1
-1
0

0
0
0
0
1

17/40

Model Expressiveness and Overfitting

A model with more weight parameters may fit training data better
But since training data is limited, expressive model stand the risk of
overfitting to peculiarities of the data.
Less Expressive Model More Expressive Model
(fewer weights)
(more weights)
Underfit training data Overfit training data

18/40

Model Expressiveness and Overfitting


Fitting the training data (blue points: xn )
with a polynomial model: f (x) =
w0 + w1 x + w2 x 2 + . . . + wM x M
1P
under squared error objective 2 n (f (xn ) tn )2
M =0

M =1

t
0

M =3

M =9

t
0

from PRML Chapter 1 [Bishop, 2006]

19/40

Todays Topics

Machine Learning background


Why Machine Learning is needed?
Main Concepts: Generalization, Model Expressiveness, Overfitting
Formal Notation

Neural Networks
1-Layer Nets (Logistic Regression)
2-Layer Nets and Model Expressiveness
Training by Backpropagation

20/40

Basic Problem Setup in Machine Learning

Training Data: a set of (x (m) , y (m) )m={1,2,..M} pairs, where input


x (m) R d and output y (m) = {0, 1}
I

e.g. x=vectorized image pixels, y=2 or non-2

Goal: Learn function f : x y to predicts correctly on new inputs x.

21/40

Basic Problem Setup in Machine Learning

Training Data: a set of (x (m) , y (m) )m={1,2,..M} pairs, where input


x (m) R d and output y (m) = {0, 1}
I

e.g. x=vectorized image pixels, y=2 or non-2

Goal: Learn function f : x y to predicts correctly on new inputs x.


I

Step 1: Choose a function model family:


F

e.g. logistic regression, support vector machines, neural networks

21/40

Basic Problem Setup in Machine Learning

Training Data: a set of (x (m) , y (m) )m={1,2,..M} pairs, where input


x (m) R d and output y (m) = {0, 1}
I

e.g. x=vectorized image pixels, y=2 or non-2

Goal: Learn function f : x y to predicts correctly on new inputs x.


I

Step 1: Choose a function model family:

Step 2: Optimize parameters w on the Training Data

e.g. logistic regression, support vector machines, neural networks


e.g. minimize loss function minw

PM

m=1 (fw (x

(m)

) y (m) )2

21/40

Todays Topics

Machine Learning background


Why Machine Learning is needed?
Main Concepts: Generalization, Model Expressiveness, Overfitting
Formal Notation

Neural Networks
1-Layer Nets (Logistic Regression)
2-Layer Nets and Model Expressiveness
Training by Backpropagation

22/40

1-Layer Nets (Logistic Regression)


Function model: f (x) = (w T x + b)
I
I
I

Parameters: vector w R d , b is scalar bias term


is a non-linearity, e.g. sigmoid: (z) = 1/(1 + exp(z))
For simplicity, sometimes write f (x) = (w T x) where w = [w ; b] and
x = [x; 1]

23/40

1-Layer Nets (Logistic Regression)


Function model: f (x) = (w T x + b)
I
I
I

Parameters: vector w R d , b is scalar bias term


is a non-linearity, e.g. sigmoid: (z) = 1/(1 + exp(z))
For simplicity, sometimes write f (x) = (w T x) where w = [w ; b] and
x = [x; 1]

Non-linearity will be important in expressiveness multi-layer nets.


Other non-linearities, e.g., tanh(z) = (e z e z )/(e z + e z )

23/40

Training 1-Layer Nets: Gradient


Assume Squared-Error Loss(w ) =

1
2

m ((w

T x (m) )

y (m) )2

is Cross-Entropy loss:
P*An alternative
(m)
T (m)
y
log((w
x )) + (1 y (m) ) log(1 (w T x (m) ))
m

24/40

Training 1-Layer Nets: Gradient


P
Assume Squared-Error Loss(w ) = 12 m ((w T x (m) ) y (m) )2

P 
Gradient: w Loss = m (w T x (m) ) y (m) 0 (w T x (m) )x (m)

is Cross-Entropy loss:
P*An alternative
(m)
T (m)
y
log((w
x )) + (1 y (m) ) log(1 (w T x (m) ))
m

24/40

Training 1-Layer Nets: Gradient


P
Assume Squared-Error Loss(w ) = 12 m ((w T x (m) ) y (m) )2

P 
Gradient: w Loss = m (w T x (m) ) y (m) 0 (w T x (m) )x (m)
I

General form of gradient:

Error (m) 0 (in(m) ) x (m)

is Cross-Entropy loss:
P*An alternative
(m)
T (m)
y
log((w
x )) + (1 y (m) ) log(1 (w T x (m) ))
m

24/40

Training 1-Layer Nets: Gradient


P
Assume Squared-Error Loss(w ) = 12 m ((w T x (m) ) y (m) )2

P 
Gradient: w Loss = m (w T x (m) ) y (m) 0 (w T x (m) )x (m)
I

General form of gradient:

Error (m) 0 (in(m) ) x (m)

Derivative of sigmoid (z) = 1/(1 + exp(z)):


0

(z)

=
=
=
=
=



d
1
dz 1 + exp(z)

2
1
d

(1 + exp(z))
1 + exp(z)
dz
2

1

exp(z)(1)
1 + exp(z)



exp(z)
1
1 + exp(z)
1 + exp(z)
(z)(1 (z))

is Cross-Entropy loss:
P*An alternative
(m)
T (m)
y
log((w
x )) + (1 y (m) ) log(1 (w T x (m) ))
m

24/40

Training 1-Layer Nets: Gradient Descent Algorithm

P
General form of gradient: m Error (m) 0 (in(m) ) x (m)
Gradient descent algorithm:
1
2
3
4

Initialize w
P
Compute w Loss = m Error (m) 0 (in(m) ) x (m)
w w (w Loss)
Repeat steps 2-3 until some condition satisfied

25/40

Training 1-Layer Nets: Gradient Descent Algorithm

P
General form of gradient: m Error (m) 0 (in(m) ) x (m)
Gradient descent algorithm:
1
2
3
4

Initialize w
P
Compute w Loss = m Error (m) 0 (in(m) ) x (m)
w w (w Loss)
Repeat steps 2-3 until some condition satisfied

Stochastic gradient descent (SGD) algorithm:


1
2
3
4

Initialize w
for each sample (x (m) , y (m) ) in training set:
w w (Error (m) 0 (in(m) ) x (m) )
Repeat loop 2-3 until some condition satisfied

25/40

Training 1-Layer Nets: Gradient Descent Algorithm

P
General form of gradient: m Error (m) 0 (in(m) ) x (m)
Gradient descent algorithm:
1
2
3
4

Initialize w
P
Compute w Loss = m Error (m) 0 (in(m) ) x (m)
w w (w Loss)
Repeat steps 2-3 until some condition satisfied

Stochastic gradient descent (SGD) algorithm:


1
2
3
4

Initialize w
for each sample (x (m) , y (m) ) in training set:
w w (Error (m) 0 (in(m) ) x (m) )
Repeat loop 2-3 until some condition satisfied

Learning rate > 0 & stopping condition are important in practice

25/40

Intuition of SGD update


for some sample (x (m) , y (m) ):
w w (((w T x (m) ) y (m) ) 0 (w T x (m) ) x (m) )
(w T x (m) )
0
1
0
1

y (m)
0
1
1
0

Error
0
0
-1
+1

new w
no change
no change
w + 0 (in(m) )x (m)
w 0 (in(m) )x (m)

new prediction
0
1
0
1

26/40

Intuition of SGD update


for some sample (x (m) , y (m) ):
w w (((w T x (m) ) y (m) ) 0 (w T x (m) ) x (m) )
(w T x (m) )
0
1
0
1

y (m)
0
1
1
0

Error
0
0
-1
+1

new w
no change
no change
w + 0 (in(m) )x (m)
w 0 (in(m) )x (m)

new prediction
0
1
0
1


T
w + 0 (in(m) )x (m) x (m) = w T x (m) + 0 (in(m) )||x (m) ||2 w T x (m)

26/40

Intuition of SGD update


for some sample (x (m) , y (m) ):
w w (((w T x (m) ) y (m) ) 0 (w T x (m) ) x (m) )
(w T x (m) )
0
1
0
1

y (m)
0
1
1
0

Error
0
0
-1
+1

new w
no change
no change
w + 0 (in(m) )x (m)
w 0 (in(m) )x (m)

new prediction
0
1
0
1


T
w + 0 (in(m) )x (m) x (m) = w T x (m) + 0 (in(m) )||x (m) ||2 w T x (m)

0 (w T x (m) ) is near 0 when confident, near 0.25 when uncertain.

26/40

Intuition of SGD update


for some sample (x (m) , y (m) ):
w w (((w T x (m) ) y (m) ) 0 (w T x (m) ) x (m) )
(w T x (m) )
0
1
0
1

y (m)
0
1
1
0

Error
0
0
-1
+1

new w
no change
no change
w + 0 (in(m) )x (m)
w 0 (in(m) )x (m)

new prediction
0
1
0
1


T
w + 0 (in(m) )x (m) x (m) = w T x (m) + 0 (in(m) )||x (m) ||2 w T x (m)

0 (w T x (m) ) is near 0 when confident, near 0.25 when uncertain.


large = more aggressive updates; small = more conservative

26/40

Intuition of SGD update


for some sample (x (m) , y (m) ):
w w (((w T x (m) ) y (m) ) 0 (w T x (m) ) x (m) )
(w T x (m) )
0
1
0
1

y (m)
0
1
1
0

Error
0
0
-1
+1

new w
no change
no change
w + 0 (in(m) )x (m)
w 0 (in(m) )x (m)

new prediction
0
1
0
1


T
w + 0 (in(m) )x (m) x (m) = w T x (m) + 0 (in(m) )||x (m) ||2 w T x (m)

0 (w T x (m) ) is near 0 when confident, near 0.25 when uncertain.


large = more aggressive updates; small = more conservative
SGD improves classification for current sample, but no guarantee
about others

26/40

Geometric view of SGD update


I

I
I

1
2

x ) y (m) )2 + ||w ||
Gradient descent goes in steepest descent direction, but slower to
compute per iteration for large datasets
SGD can be viewed as noisy descent, but faster per iteration
In practice, a good tradeoff is mini-batch SGD

Loss objective contour plot:

m ((w

T (m)

27/40

Effect of Learning Rate on Convergence Speed


SGD update: w w (Error (m) 0 (in(m) ) x (m) )
I
I

Ideally, should be as large as possible without causing divergence.


0
Common heuristic: (t) = 1+t
= O(1/t)

Analysis by [Schaul et al., 2013] (in plot, ):

28/40

Generalization issues: Regularization and Early-stopping


P
Optimizing Loss(w ) = 12 m ((w T x (m) ) y (m) )2 on training data
not necessarily leads to generalization.

Figures from Chapter 5, [Bishop, 2006]

29/40

Generalization issues: Regularization and Early-stopping


P
Optimizing Loss(w ) = 12 m ((w T x (m) ) y (m) )2 on training data
not necessarily leads to generalization.P
1

Adding regularization: Loss(w ) = 21 m ((w T x (m) ) y (m) )2 + ||w ||


reduces sensitivity to training input and decreases risk of overfitting

Figures from Chapter 5, [Bishop, 2006]

29/40

Generalization issues: Regularization and Early-stopping


P
Optimizing Loss(w ) = 12 m ((w T x (m) ) y (m) )2 on training data
not necessarily leads to generalization.P
1

Adding regularization: Loss(w ) = 21 m ((w T x (m) ) y (m) )2 + ||w ||


reduces sensitivity to training input and decreases risk of overfitting
Early Stopping:
F
F

Prepare separate training and validation (development) data


Optimize Loss(w ) on training but stop when Loss(w ) on validation
stops improving
0.45

0.25
0.4
0.2

0.15

10

20

30

40

50

0.35

10

20

30

40

50

Figures from Chapter 5, [Bishop, 2006]

29/40

Summary

1
2

Given Training Data: (x (m) , y (m) )m={1,2,..M}


Optimize a model
f (x) = (w T x + b) under
1P
Loss(w ) = 2 m ((w T x (m) ) y (m) )2

30/40

Summary

1
2

Given Training Data: (x (m) , y (m) )m={1,2,..M}


Optimize a model
f (x) = (w T x + b) under
1P
Loss(w ) = 2 m ((w T x (m) ) y (m) )2
P
General form of gradient: m Error (m) 0 (in(m) ) x (m)

30/40

Summary

1
2

3
4

Given Training Data: (x (m) , y (m) )m={1,2,..M}


Optimize a model
f (x) = (w T x + b) under
1P
Loss(w ) = 2 m ((w T x (m) ) y (m) )2
P
General form of gradient: m Error (m) 0 (in(m) ) x (m)
SGD algorithm: for each sample (x (m) , y (m) ) in training set,
w w (Error (m) 0 (in(m) ) x (m) )

30/40

Summary

1
2

3
4

Given Training Data: (x (m) , y (m) )m={1,2,..M}


Optimize a model
f (x) = (w T x + b) under
1P
Loss(w ) = 2 m ((w T x (m) ) y (m) )2
P
General form of gradient: m Error (m) 0 (in(m) ) x (m)
SGD algorithm: for each sample (x (m) , y (m) ) in training set,
w w (Error (m) 0 (in(m) ) x (m) )
Important issues:
I
I

Optimization speed/convergence: batch vs. mini-batch, learning rate


Generalization ability: regularization, early-stopping

30/40

Todays Topics

Machine Learning background


Why Machine Learning is needed?
Main Concepts: Generalization, Model Expressiveness, Overfitting
Formal Notation

Neural Networks
1-Layer Nets (Logistic Regression)
2-Layer Nets and Model Expressiveness
Training by Backpropagation

31/40

2-Layer Neural Networks


y
wj

w1

h1

hj
wij

w11

xi

x1

w2

w3

h2

h3

w12

x2

x3

x4

P
P
P
f (x) = ( j wj hj ) = ( j wj ( i wij xi ))

Called Multilayer Perceptron (MLP), but more like multilayer logistic regression

32/40

2-Layer Neural Networks


y
wj

w1

h1

hj
wij

w11

xi

x1

w2

w3

h2

h3

w12

x2

x3

x4

P
P
P
f (x) = ( j wj hj ) = ( j wj ( i wij xi ))
Hidden units hj s can be viewed as new features from combining xi s
Called Multilayer Perceptron (MLP), but more like multilayer logistic regression

32/40

Modeling complex non-linearities


Given same number of units (with non-linear activation), a deeper
architecture is more expressive than a shallow one [Bishop, 1995]
I
I

1-layer nets only model linear hyperplanes


2-layer nets are universal function approximators: given infinite hidden
nodes, it can express any continuous function
>3-layer nets can do so with fewer nodes/weights

33/40

Todays Topics

Machine Learning background


Why Machine Learning is needed?
Main Concepts: Generalization, Model Expressiveness, Overfitting
Formal Notation

Neural Networks
1-Layer Nets (Logistic Regression)
2-Layer Nets and Model Expressiveness
Training by Backpropagation

34/40

Training a 2-Layer Net with Backpropagation

y
wj

w1

h1

hj
wij

w11

xi

x1

w2

w3

h2

Adjust weights

h3

Predict f (x (m) )

w12

x2

x3

x4

P
P
(m)
1. For each sample, compute f (x (m) ) = ( j wj ( i wij xi ))
2. If f (x (m) ) 6= y (m) , back-propagate error and adjust weights {wij , wj }.

35/40

Derivatives of the weights


Assume two outputs (y1 , y2 ) P
per input x,
and loss per sample: Loss = k 12 [(ink ) yk ]2
y1

yk

y2

wjk

h1

hj

h2

h3

wij
xi

x1

x2

x3

x4

36/40

Derivatives of the weights


Assume two outputs (y1 , y2 ) P
per input x,
and loss per sample: Loss = k 12 [(ink ) yk ]2
y1

yk

y2

wjk

h1

hj

h2

h3

wij
xi

Loss
wjk

Loss ink
ink wjk

= k

x1
P
( j wjk hj )
wjk

x2

x3

x4

= k hj

36/40

Derivatives of the weights


Assume two outputs (y1 , y2 ) P
per input x,
and loss per sample: Loss = k 12 [(ink ) yk ]2
y1

yk

y2

wjk

h1

hj

h2

h3

wij
xi

Loss
wjk

Loss
wij

Loss ink
ink wjk
Loss inj
inj wij

x1

x2

x3

x4

P
( j wjk hj )
= k hj
Pwjk
( j wij xi )
j wij
= j xi

= k
=

36/40

Derivatives of the weights


Assume two outputs (y1 , y2 ) P
per input x,
and loss per sample: Loss = k 12 [(ink ) yk ]2
y1

yk

y2

wjk

h1

hj

h2

h3

wij
xi

x1

x2

P
( j wjk hj )
Loss ink
=

= k hj
k
ink wjk
Pwjk
(
w
x
)
j ij i
Loss
Loss inj
= j w
= j xi
wij = in
j wij
ij
P

2
1

k = in
= [(ink )
k 2 [(ink ) yk ]
k
Loss
wjk

x3

x4

yk ] 0 (ink )

36/40

Derivatives of the weights


Assume two outputs (y1 , y2 ) P
per input x,
and loss per sample: Loss = k 12 [(ink ) yk ]2
y1

yk

y2

wjk

h1

hj

h2

h3

wij

x1

xi

x2

x3

x4

P
( j wjk hj )
Loss ink
=

= k hj
k
ink wjk
Pwjk
(
w
x
)
j ij i
Loss
Loss inj
= j w
= j xi
wij = in
j wij
ij
P

2
1

k = in
= [(ink )
k 2 [(ink ) yk ]
k
Loss
wjk

j =

Loss ink
k ink inj

k k

inj

yk ] 0 (ink )

P
w
(in
)
= [ k k wjk ] 0 (inj )
j
j jk

P

36/40

Backpropagation Algorithm
All updates involve some scaled error from output input feature:
Loss
wjk
Loss
wij

= k hj where k = [(ink ) yk ] 0 (ink )


P
= j xi where j = [ k k wjk ] 0 (inj )

After forward pass, compute k from final layer, then j for previous layer.
For deeper nets, iterate backwards further.

y1

yk

y2

wjk

h1

hj

h2

h3

wij
xi

x1

x2

x3

x4

37/40

Summary
1

By extending from 1-layer to 2-layer net, we get dramatic increase in


model expressiveness:

38/40

Summary
1

By extending from 1-layer to 2-layer net, we get dramatic increase in


model expressiveness:

Backpropagation is an efficient way to train 2-layer nets:


I

Similar to SGD for 1-layer net, just more chaining in gradient

38/40

Summary
1

By extending from 1-layer to 2-layer net, we get dramatic increase in


model expressiveness:

Backpropagation is an efficient way to train 2-layer nets:


I
I

Similar to SGD for 1-layer net, just more chaining in gradient


General form: update wij by j xi , and j is scaled/weighted sum of
errors from outgoing layers

38/40

Summary
1

By extending from 1-layer to 2-layer net, we get dramatic increase in


model expressiveness:

Backpropagation is an efficient way to train 2-layer nets:


I
I

Similar to SGD for 1-layer net, just more chaining in gradient


General form: update wij by j xi , and j is scaled/weighted sum of
errors from outgoing layers

Ideally, we want even deeper architectures

38/40

Summary
1

By extending from 1-layer to 2-layer net, we get dramatic increase in


model expressiveness:

Backpropagation is an efficient way to train 2-layer nets:


I
I

Similar to SGD for 1-layer net, just more chaining in gradient


General form: update wij by j xi , and j is scaled/weighted sum of
errors from outgoing layers

Ideally, we want even deeper architectures


I
I

But Backpropagation becomes ineffective due to vanishing gradients


Deep Learning comes to the rescue! (next lecture)

38/40

References I
Bengio, Y. (2009).
Learning Deep Architectures for AI, volume Foundations and Trends in
Machine Learning.
NOW Publishers.
Bishop, C. (1995).
Neural Networks for Pattern Recognition.
Oxford University Press.
Bishop, C. (2006).
Pattern Recognition and Machine Learning.
Springer.
Lee, H., Grosse, R., Ranganath, R., and Ng, A. (2009).
Convolutional deep belief networks for scalable unsupervised learning
of hierarchical representations.
In ICML.

39/40

References II

Schaul, T., Zhang, S., and LeCun, Y. (2013).


No more pesky learning rates.
In Proc. International Conference on Machine learning (ICML13).

40/40

Вам также может понравиться