Академический Документы
Профессиональный Документы
Культура Документы
Lecture 1
Kevin Duh
Graduate School of Information Science
Nara Institute of Science and Technology
2/40
3/40
4/40
5/40
5/40
6/40
Course Outline
Goal:
To understand the foundations of neural networks and deep learning,
at a level sufficient for reading recent research papers
7/40
Course Outline
Goal:
To understand the foundations of neural networks and deep learning,
at a level sufficient for reading recent research papers
Schedule:
I
I
I
I
Lecture
Lecture
Lecture
Lecture
1
2
3
4
(Jan
(Jan
(Jan
(Jan
14):
16):
21):
23):
7/40
Course Outline
Goal:
To understand the foundations of neural networks and deep learning,
at a level sufficient for reading recent research papers
Schedule:
I
I
I
I
Lecture
Lecture
Lecture
Lecture
1
2
3
4
(Jan
(Jan
(Jan
(Jan
14):
16):
21):
23):
Prerequisites:
I
7/40
Course Material
Course Website:
http://cl.naist.jp/~kevinduh/a/deep2014/
Useful References:
1
2
3
4
5
6
http://www.iro.umontreal.ca/~bengioy/papers/ftml.pdf
http://techtalks.tv/talks/deep-learning/58122/
3
http://www.socher.org/index.php/DeepLearningTutorial/
4
https://www.coursera.org/course/neuralnets
5
http://deeplearning.net/tutorial/
6
http://research.microsoft.com/en-us/um/people/cmbishop/prml/
2
8/40
Grading
If
If
If
If
you
you
you
you
9/40
10/40
10/40
10/40
10/40
Todays Topics
Neural Networks
1-Layer Nets (Logistic Regression)
2-Layer Nets and Model Expressiveness
Training by Backpropagation
11/40
Todays Topics
Neural Networks
1-Layer Nets (Logistic Regression)
2-Layer Nets and Model Expressiveness
Training by Backpropagation
12/40
13/40
13/40
14/40
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
1
1
0
0
1
0
0
0
0
1
0
1
0
0
1
0
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
1
0
0
1
1
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
1
1
0
0
1
0
0
0
0
1
0
1
0
0
1
0
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
1
0
0
0
0
0
-1
-1
-1
-1
-1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
14/40
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
1
1
0
0
1
0
0
0
0
1
0
1
0
0
1
0
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
1
0
0
1
1
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
1
1
0
0
1
0
0
0
0
1
0
1
0
0
1
0
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
1
0
0
0
0
0
-1
-1
-1
-1
-1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
14/40
Todays Topics
Neural Networks
1-Layer Nets (Logistic Regression)
2-Layer Nets and Model Expressiveness
Training by Backpropagation
15/40
Generalization 6= Memorization
Key Issue in Machine Learning: Training data is limited
If the classifier just memorizes the training data, it may perform
poorly on new data
Generalization is ability to extend accurate predictions to new data
16/40
Generalization 6= Memorization
Key Issue in Machine Learning: Training data is limited
If the classifier just memorizes the training data, it may perform
poorly on new data
Generalization is ability to extend accurate predictions to new data
0
1
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
1
1
0
0
1
0
0
0
0
1
0
1
0
0
1
0
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
1
0
0
0
0
0
-1
-1
-1
-1
-1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
16/40
Generalization 6= Memorization
One potential way to increase generalization ability:
Discretize weight matrix with larger grids (fewer weights to train)
E.g. consider shifted image:
0
1
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
1
1
0
0
1
0
0
0
0
1
0
1
0
0
1
0
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
1
0
0
0
0
0
-1
-1
-1
-1
-1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
1
1
0
0
1
0
0
0
0
1
0
1
0
0
1
0
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
1
0
0
0
0
0
-1
-1
-1
-1
-1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
1
0
0
1
1
1
0
1
0
1
0
0
-1
-1
0
0
0
0
0
1
17/40
A model with more weight parameters may fit training data better
But since training data is limited, expressive model stand the risk of
overfitting to peculiarities of the data.
Less Expressive Model More Expressive Model
(fewer weights)
(more weights)
Underfit training data Overfit training data
18/40
M =1
t
0
M =3
M =9
t
0
19/40
Todays Topics
Neural Networks
1-Layer Nets (Logistic Regression)
2-Layer Nets and Model Expressiveness
Training by Backpropagation
20/40
21/40
21/40
PM
m=1 (fw (x
(m)
) y (m) )2
21/40
Todays Topics
Neural Networks
1-Layer Nets (Logistic Regression)
2-Layer Nets and Model Expressiveness
Training by Backpropagation
22/40
23/40
23/40
1
2
m ((w
T x (m) )
y (m) )2
is Cross-Entropy loss:
P*An alternative
(m)
T (m)
y
log((w
x )) + (1 y (m) ) log(1 (w T x (m) ))
m
24/40
is Cross-Entropy loss:
P*An alternative
(m)
T (m)
y
log((w
x )) + (1 y (m) ) log(1 (w T x (m) ))
m
24/40
is Cross-Entropy loss:
P*An alternative
(m)
T (m)
y
log((w
x )) + (1 y (m) ) log(1 (w T x (m) ))
m
24/40
(z)
=
=
=
=
=
d
1
dz 1 + exp(z)
2
1
d
(1 + exp(z))
1 + exp(z)
dz
2
1
exp(z)(1)
1 + exp(z)
exp(z)
1
1 + exp(z)
1 + exp(z)
(z)(1 (z))
is Cross-Entropy loss:
P*An alternative
(m)
T (m)
y
log((w
x )) + (1 y (m) ) log(1 (w T x (m) ))
m
24/40
P
General form of gradient: m Error (m) 0 (in(m) ) x (m)
Gradient descent algorithm:
1
2
3
4
Initialize w
P
Compute w Loss = m Error (m) 0 (in(m) ) x (m)
w w (w Loss)
Repeat steps 2-3 until some condition satisfied
25/40
P
General form of gradient: m Error (m) 0 (in(m) ) x (m)
Gradient descent algorithm:
1
2
3
4
Initialize w
P
Compute w Loss = m Error (m) 0 (in(m) ) x (m)
w w (w Loss)
Repeat steps 2-3 until some condition satisfied
Initialize w
for each sample (x (m) , y (m) ) in training set:
w w (Error (m) 0 (in(m) ) x (m) )
Repeat loop 2-3 until some condition satisfied
25/40
P
General form of gradient: m Error (m) 0 (in(m) ) x (m)
Gradient descent algorithm:
1
2
3
4
Initialize w
P
Compute w Loss = m Error (m) 0 (in(m) ) x (m)
w w (w Loss)
Repeat steps 2-3 until some condition satisfied
Initialize w
for each sample (x (m) , y (m) ) in training set:
w w (Error (m) 0 (in(m) ) x (m) )
Repeat loop 2-3 until some condition satisfied
25/40
y (m)
0
1
1
0
Error
0
0
-1
+1
new w
no change
no change
w + 0 (in(m) )x (m)
w 0 (in(m) )x (m)
new prediction
0
1
0
1
26/40
y (m)
0
1
1
0
Error
0
0
-1
+1
new w
no change
no change
w + 0 (in(m) )x (m)
w 0 (in(m) )x (m)
new prediction
0
1
0
1
T
w + 0 (in(m) )x (m) x (m) = w T x (m) + 0 (in(m) )||x (m) ||2 w T x (m)
26/40
y (m)
0
1
1
0
Error
0
0
-1
+1
new w
no change
no change
w + 0 (in(m) )x (m)
w 0 (in(m) )x (m)
new prediction
0
1
0
1
T
w + 0 (in(m) )x (m) x (m) = w T x (m) + 0 (in(m) )||x (m) ||2 w T x (m)
26/40
y (m)
0
1
1
0
Error
0
0
-1
+1
new w
no change
no change
w + 0 (in(m) )x (m)
w 0 (in(m) )x (m)
new prediction
0
1
0
1
T
w + 0 (in(m) )x (m) x (m) = w T x (m) + 0 (in(m) )||x (m) ||2 w T x (m)
26/40
y (m)
0
1
1
0
Error
0
0
-1
+1
new w
no change
no change
w + 0 (in(m) )x (m)
w 0 (in(m) )x (m)
new prediction
0
1
0
1
T
w + 0 (in(m) )x (m) x (m) = w T x (m) + 0 (in(m) )||x (m) ||2 w T x (m)
26/40
I
I
1
2
x ) y (m) )2 + ||w ||
Gradient descent goes in steepest descent direction, but slower to
compute per iteration for large datasets
SGD can be viewed as noisy descent, but faster per iteration
In practice, a good tradeoff is mini-batch SGD
m ((w
T (m)
27/40
28/40
29/40
29/40
0.25
0.4
0.2
0.15
10
20
30
40
50
0.35
10
20
30
40
50
29/40
Summary
1
2
30/40
Summary
1
2
30/40
Summary
1
2
3
4
30/40
Summary
1
2
3
4
30/40
Todays Topics
Neural Networks
1-Layer Nets (Logistic Regression)
2-Layer Nets and Model Expressiveness
Training by Backpropagation
31/40
w1
h1
hj
wij
w11
xi
x1
w2
w3
h2
h3
w12
x2
x3
x4
P
P
P
f (x) = ( j wj hj ) = ( j wj ( i wij xi ))
Called Multilayer Perceptron (MLP), but more like multilayer logistic regression
32/40
w1
h1
hj
wij
w11
xi
x1
w2
w3
h2
h3
w12
x2
x3
x4
P
P
P
f (x) = ( j wj hj ) = ( j wj ( i wij xi ))
Hidden units hj s can be viewed as new features from combining xi s
Called Multilayer Perceptron (MLP), but more like multilayer logistic regression
32/40
33/40
Todays Topics
Neural Networks
1-Layer Nets (Logistic Regression)
2-Layer Nets and Model Expressiveness
Training by Backpropagation
34/40
y
wj
w1
h1
hj
wij
w11
xi
x1
w2
w3
h2
Adjust weights
h3
Predict f (x (m) )
w12
x2
x3
x4
P
P
(m)
1. For each sample, compute f (x (m) ) = ( j wj ( i wij xi ))
2. If f (x (m) ) 6= y (m) , back-propagate error and adjust weights {wij , wj }.
35/40
yk
y2
wjk
h1
hj
h2
h3
wij
xi
x1
x2
x3
x4
36/40
yk
y2
wjk
h1
hj
h2
h3
wij
xi
Loss
wjk
Loss ink
ink wjk
= k
x1
P
( j wjk hj )
wjk
x2
x3
x4
= k hj
36/40
yk
y2
wjk
h1
hj
h2
h3
wij
xi
Loss
wjk
Loss
wij
Loss ink
ink wjk
Loss inj
inj wij
x1
x2
x3
x4
P
( j wjk hj )
= k hj
Pwjk
( j wij xi )
j wij
= j xi
= k
=
36/40
yk
y2
wjk
h1
hj
h2
h3
wij
xi
x1
x2
P
( j wjk hj )
Loss ink
=
= k hj
k
ink wjk
Pwjk
(
w
x
)
j ij i
Loss
Loss inj
= j w
= j xi
wij = in
j wij
ij
P
2
1
k = in
= [(ink )
k 2 [(ink ) yk ]
k
Loss
wjk
x3
x4
yk ] 0 (ink )
36/40
yk
y2
wjk
h1
hj
h2
h3
wij
x1
xi
x2
x3
x4
P
( j wjk hj )
Loss ink
=
= k hj
k
ink wjk
Pwjk
(
w
x
)
j ij i
Loss
Loss inj
= j w
= j xi
wij = in
j wij
ij
P
2
1
k = in
= [(ink )
k 2 [(ink ) yk ]
k
Loss
wjk
j =
Loss ink
k ink inj
k k
inj
yk ] 0 (ink )
P
w
(in
)
= [ k k wjk ] 0 (inj )
j
j jk
P
36/40
Backpropagation Algorithm
All updates involve some scaled error from output input feature:
Loss
wjk
Loss
wij
After forward pass, compute k from final layer, then j for previous layer.
For deeper nets, iterate backwards further.
y1
yk
y2
wjk
h1
hj
h2
h3
wij
xi
x1
x2
x3
x4
37/40
Summary
1
38/40
Summary
1
38/40
Summary
1
38/40
Summary
1
38/40
Summary
1
38/40
References I
Bengio, Y. (2009).
Learning Deep Architectures for AI, volume Foundations and Trends in
Machine Learning.
NOW Publishers.
Bishop, C. (1995).
Neural Networks for Pattern Recognition.
Oxford University Press.
Bishop, C. (2006).
Pattern Recognition and Machine Learning.
Springer.
Lee, H., Grosse, R., Ranganath, R., and Ng, A. (2009).
Convolutional deep belief networks for scalable unsupervised learning
of hierarchical representations.
In ICML.
39/40
References II
40/40