ResearchSeminar1

Deep Learning & Neural Networks
Lecture 1
Kevin Duh
Graduate School of Information Science
Nara Institute of Science and Technology
Jan 14, 2014
2/40
3/40
4/40
What is Deep Learning?

A family of methods that uses deep architectures to learn high-level
feature representations
5/40
What is Deep Learning?

A family of methods that uses deep architectures to learn high-level
feature representations
5/40
Example of Trainable Features

Hierarchical object-parts features in Computer Vision [Lee et al., 2009]
6/40
Course Outline
Goal:
To understand the foundations of neural networks and deep learning,
at a level sufficient for reading recent research papers
7/40
Course Outline
Goal:
Schedule:
I
I
I
I
Lecture
Lecture
Lecture
Lecture
1
2
3
4
(Jan
(Jan
(Jan
(Jan
14):
16):
21):
23):
Machine Learning background & Neural Networks

Deep Architectures (DBN, SAE)
Applications in Vision, Speech, Language
Advanced topics in optimization
7/40
Course Outline
Goal:
Schedule:
I
I
I
I
Lecture
Lecture
Lecture
Lecture
1
2
3
4
(Jan
(Jan
(Jan
(Jan
14):
16):
21):
23):
Machine Learning background & Neural Networks

Deep Architectures (DBN, SAE)
Applications in Vision, Speech, Language
Advanced topics in optimization
Prerequisites:
I
Basic calculus, probability, linear algebra
7/40
Course Material
Course Website:
http://cl.naist.jp/~kevinduh/a/deep2014/
Useful References:
1
2
3
4
5
6
Yoshua Bengios [Bengio, 2009] short book: Learning Deep

Architectures for AI1
Yann LeCun & MarcAurelio Ranzatos ICML2013 tutorial2
Richard Socher et. al.s NAACL2013 tutorial3
Geoff Hintons Coursera course4
Theano code samples5
Chris Bishops book Pattern Recognition and Machine Learning
(PRML)6
http://www.iro.umontreal.ca/~bengioy/papers/ftml.pdf
http://techtalks.tv/talks/deep-learning/58122/
3
http://www.socher.org/index.php/DeepLearningTutorial/
4
https://www.coursera.org/course/neuralnets
5
http://deeplearning.net/tutorial/
6
http://research.microsoft.com/en-us/um/people/cmbishop/prml/
2
8/40
Grading
The only criteria for grading:

Are you actively participating and asking questions in class?
I
I
I
I
If
If
If
If
you
you
you
you
ask (or answer) 3+ questions, grade = A

ask (or answer) 2 questions, grade = B
ask (or answer) 1 question, grade = C
dont ask (or answer) any questions, you get no credit.
9/40
Best Advice I got while in Grad School
Always Ask Questions!
10/40

If you dont understand, you must ask questions in order to
understand.
10/40

understand.
If you understand, you will naturally have questions.
10/40

understand.
If you understand, you will naturally have questions.
Having no questions is a sign that you are not thinking.
10/40
Todays Topics
Machine Learning background

Why Machine Learning is needed?
Main Concepts: Generalization, Model Expressiveness, Overfitting
Formal Notation
Neural Networks
1-Layer Nets (Logistic Regression)
2-Layer Nets and Model Expressiveness
Training by Backpropagation
11/40
Todays Topics

Formal Notation
Neural Networks
12/40
Write a Program to Recognize the Digit 2

This is hard to do manually!
bool recognizeDigitAs2(int** imagePixels){...}
*example from Hintons Coursera course
13/40
Write a Program to Recognize the Digit 2

This is hard to do manually!
bool recognizeDigitAs2(int** imagePixels){...}
Machine Learning solution:

1
Assume you have a database (training data) of 2s and non-2s.
Automatically learn this function from data

*example from Hintons Coursera course
13/40
A Machine Learning Solution
Training data are represented as pixel matrices:

Classifier is parameterized by weight matrix of same dimension.
14/40

Training procedure:
1 When observe 2, add 1 to corresponding matrix elements
2 When observe non-2, subtract 1 to corresponding matrix elements
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
1
1
0
0
1
0
0
0
0
1
0
1
0
0
1
0
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
1
0
0
1
1
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
1
1
0
0
1
0
0
0
0
1
0
1
0
0
1
0
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
1
0
0
0
0
0
-1
-1
-1
-1
-1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
14/40

Training procedure:
1 When observe 2, add 1 to corresponding matrix elements
2 When observe non-2, subtract 1 to corresponding matrix elements
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
1
1
0
0
1
0
0
0
0
1
0
1
0
0
1
0
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
1
0
0
1
1
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
1
1
0
0
1
0
0
0
0
1
0
1
0
0
1
0
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
1
0
0
0
0
0
-1
-1
-1
-1
-1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
Test procedure: given new image, take sum of element-wise product.

If positive, predict 2; else predict non-2.
14/40
Todays Topics

Formal Notation
Neural Networks
15/40
Generalization 6= Memorization
Key Issue in Machine Learning: Training data is limited
If the classifier just memorizes the training data, it may perform
poorly on new data
Generalization is ability to extend accurate predictions to new data
16/40
Key Issue in Machine Learning: Training data is limited
If the classifier just memorizes the training data, it may perform
poorly on new data
Generalization is ability to extend accurate predictions to new data
E.g. consider shifted image:
Will this classifier generalize?

0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
1
1
0
0
1
0
0
0
0
1
0
1
0
0
1
0
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
1
0
0
0
0
0
-1
-1
-1
-1
-1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
16/40
One potential way to increase generalization ability:
Discretize weight matrix with larger grids (fewer weights to train)
E.g. consider shifted image:
Now will this classifier generalize?

0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
1
1
0
0
1
0
0
0
0
1
0
1
0
0
1
0
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
1
0
0
0
0
0
-1
-1
-1
-1
-1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
1
1
0
0
1
0
0
0
0
1
0
1
0
0
1
0
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
1
0
0
0
0
0
-1
-1
-1
-1
-1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
1
0
0
1
1
1
0
1
0
1
0
0
-1
-1
0
0
0
0
0
1
17/40
Model Expressiveness and Overfitting
A model with more weight parameters may fit training data better
But since training data is limited, expressive model stand the risk of
overfitting to peculiarities of the data.
Less Expressive Model More Expressive Model
(fewer weights)
(more weights)
Underfit training data Overfit training data
18/40
Model Expressiveness and Overfitting

Fitting the training data (blue points: xn )
with a polynomial model: f (x) =
w0 + w1 x + w2 x 2 + . . . + wM x M
1P
under squared error objective 2 n (f (xn ) tn )2
M =0
M =1
t
0
M =3
M =9
t
0
from PRML Chapter 1 [Bishop, 2006]
19/40
Todays Topics

Formal Notation
Neural Networks
20/40
Basic Problem Setup in Machine Learning
Training Data: a set of (x (m) , y (m) )m={1,2,..M} pairs, where input

x (m) R d and output y (m) = {0, 1}
I
e.g. x=vectorized image pixels, y=2 or non-2
Goal: Learn function f : x y to predicts correctly on new inputs x.
21/40

I

I
Step 1: Choose a function model family:

F
e.g. logistic regression, support vector machines, neural networks
21/40

I

I
Step 1: Choose a function model family:
Step 2: Optimize parameters w on the Training Data
e.g. logistic regression, support vector machines, neural networks

e.g. minimize loss function minw
PM
m=1 (fw (x
(m)
) y (m) )2
21/40
Todays Topics

Formal Notation
Neural Networks
22/40

Function model: f (x) = (w T x + b)
I
I
I
Parameters: vector w R d , b is scalar bias term

is a non-linearity, e.g. sigmoid: (z) = 1/(1 + exp(z))
For simplicity, sometimes write f (x) = (w T x) where w = [w ; b] and
x = [x; 1]
23/40

Function model: f (x) = (w T x + b)
I
I
I
Parameters: vector w R d , b is scalar bias term

is a non-linearity, e.g. sigmoid: (z) = 1/(1 + exp(z))
For simplicity, sometimes write f (x) = (w T x) where w = [w ; b] and
x = [x; 1]
Non-linearity will be important in expressiveness multi-layer nets.

Other non-linearities, e.g., tanh(z) = (e z e z )/(e z + e z )
23/40
Training 1-Layer Nets: Gradient

Assume Squared-Error Loss(w ) =
1
2
m ((w
T x (m) )
y (m) )2
is Cross-Entropy loss:
P*An alternative
(m)
T (m)
y
log((w
x )) + (1 y (m) ) log(1 (w T x (m) ))
m
24/40

P
Assume Squared-Error Loss(w ) = 12 m ((w T x (m) ) y (m) )2

P
Gradient: w Loss = m (w T x (m) ) y (m) 0 (w T x (m) )x (m)
P*An alternative
(m)
T (m)
y
log((w
x )) + (1 y (m) ) log(1 (w T x (m) ))
m
24/40

P

P
I
General form of gradient:
Error (m) 0 (in(m) ) x (m)
P*An alternative
(m)
T (m)
y
log((w
x )) + (1 y (m) ) log(1 (w T x (m) ))
m
24/40

P

P
I
General form of gradient:
Error (m) 0 (in(m) ) x (m)
Derivative of sigmoid (z) = 1/(1 + exp(z)):

0
(z)
=
=
=
=
=

d
1
dz 1 + exp(z)

2
1
d
(1 + exp(z))
1 + exp(z)
dz
2

1
exp(z)(1)
1 + exp(z)

exp(z)
1
1 + exp(z)
1 + exp(z)
(z)(1 (z))
P*An alternative
(m)
T (m)
y
log((w
x )) + (1 y (m) ) log(1 (w T x (m) ))
m
24/40
Training 1-Layer Nets: Gradient Descent Algorithm
P
General form of gradient: m Error (m) 0 (in(m) ) x (m)
Gradient descent algorithm:
1
2
3
4
Initialize w
P
Compute w Loss = m Error (m) 0 (in(m) ) x (m)
w w (w Loss)
Repeat steps 2-3 until some condition satisfied
25/40
P
1
2
3
4
Initialize w
P
w w (w Loss)
Stochastic gradient descent (SGD) algorithm:

1
2
3
4
Initialize w
for each sample (x (m) , y (m) ) in training set:
w w (Error (m) 0 (in(m) ) x (m) )
Repeat loop 2-3 until some condition satisfied
25/40
P
1
2
3
4
Initialize w
P
w w (w Loss)
Stochastic gradient descent (SGD) algorithm:

1
2
3
4
Initialize w
for each sample (x (m) , y (m) ) in training set:
w w (Error (m) 0 (in(m) ) x (m) )
Repeat loop 2-3 until some condition satisfied
Learning rate > 0 & stopping condition are important in practice
25/40
Intuition of SGD update

for some sample (x (m) , y (m) ):
w w (((w T x (m) ) y (m) ) 0 (w T x (m) ) x (m) )
(w T x (m) )
0
1
0
1
y (m)
0
1
1
0
Error
0
0
-1
+1
new w
no change
no change
w + 0 (in(m) )x (m)
w 0 (in(m) )x (m)
new prediction
0
1
0
1
26/40

w w (((w T x (m) ) y (m) ) 0 (w T x (m) ) x (m) )
(w T x (m) )
0
1
0
1
y (m)
0
1
1
0
Error
0
0
-1
+1
new w
no change
no change
w + 0 (in(m) )x (m)
w 0 (in(m) )x (m)
new prediction
0
1
0
1

T
w + 0 (in(m) )x (m) x (m) = w T x (m) + 0 (in(m) )||x (m) ||2 w T x (m)
26/40

w w (((w T x (m) ) y (m) ) 0 (w T x (m) ) x (m) )
(w T x (m) )
0
1
0
1
y (m)
0
1
1
0
Error
0
0
-1
+1
new w
no change
no change
w + 0 (in(m) )x (m)
w 0 (in(m) )x (m)
new prediction
0
1
0
1

T
0 (w T x (m) ) is near 0 when confident, near 0.25 when uncertain.
26/40

w w (((w T x (m) ) y (m) ) 0 (w T x (m) ) x (m) )
(w T x (m) )
0
1
0
1
y (m)
0
1
1
0
Error
0
0
-1
+1
new w
no change
no change
w + 0 (in(m) )x (m)
w 0 (in(m) )x (m)
new prediction
0
1
0
1

T

large = more aggressive updates; small = more conservative
26/40

w w (((w T x (m) ) y (m) ) 0 (w T x (m) ) x (m) )
(w T x (m) )
0
1
0
1
y (m)
0
1
1
0
Error
0
0
-1
+1
new w
no change
no change
w + 0 (in(m) )x (m)
w 0 (in(m) )x (m)
new prediction
0
1
0
1

T

large = more aggressive updates; small = more conservative
SGD improves classification for current sample, but no guarantee
about others
26/40
Geometric view of SGD update

I
I
I
1
2
x ) y (m) )2 + ||w ||
Gradient descent goes in steepest descent direction, but slower to
compute per iteration for large datasets
SGD can be viewed as noisy descent, but faster per iteration
In practice, a good tradeoff is mini-batch SGD
Loss objective contour plot:
m ((w
T (m)
27/40
Effect of Learning Rate on Convergence Speed

SGD update: w w (Error (m) 0 (in(m) ) x (m) )
I
I
Ideally, should be as large as possible without causing divergence.

0
Common heuristic: (t) = 1+t
= O(1/t)
Analysis by [Schaul et al., 2013] (in plot, ):
28/40
Generalization issues: Regularization and Early-stopping

P
Optimizing Loss(w ) = 12 m ((w T x (m) ) y (m) )2 on training data
not necessarily leads to generalization.
Figures from Chapter 5, [Bishop, 2006]
29/40

P
not necessarily leads to generalization.P
1
Adding regularization: Loss(w ) = 21 m ((w T x (m) ) y (m) )2 + ||w ||

reduces sensitivity to training input and decreases risk of overfitting
29/40

P
not necessarily leads to generalization.P
1
Adding regularization: Loss(w ) = 21 m ((w T x (m) ) y (m) )2 + ||w ||

reduces sensitivity to training input and decreases risk of overfitting
Early Stopping:
F
F
Prepare separate training and validation (development) data

Optimize Loss(w ) on training but stop when Loss(w ) on validation
stops improving
0.45
0.25
0.4
0.2
0.15
10
20
30
40
50
0.35
10
20
30
40
50
29/40
Summary
1
2
Given Training Data: (x (m) , y (m) )m={1,2,..M}

Optimize a model
f (x) = (w T x + b) under
1P
Loss(w ) = 2 m ((w T x (m) ) y (m) )2
30/40
Summary
1
2

Optimize a model
1P
Loss(w ) = 2 m ((w T x (m) ) y (m) )2
P
30/40
Summary
1
2
3
4

Optimize a model
1P
Loss(w ) = 2 m ((w T x (m) ) y (m) )2
P
SGD algorithm: for each sample (x (m) , y (m) ) in training set,
w w (Error (m) 0 (in(m) ) x (m) )
30/40
Summary
1
2
3
4

Optimize a model
1P
Loss(w ) = 2 m ((w T x (m) ) y (m) )2
P
SGD algorithm: for each sample (x (m) , y (m) ) in training set,
w w (Error (m) 0 (in(m) ) x (m) )
Important issues:
I
I
Optimization speed/convergence: batch vs. mini-batch, learning rate

Generalization ability: regularization, early-stopping
30/40
Todays Topics

Formal Notation
Neural Networks
31/40
2-Layer Neural Networks

y
wj
w1
h1
hj
wij
w11
xi
x1
w2
w3
h2
h3
w12
x2
x3
x4
P
P
P
f (x) = ( j wj hj ) = ( j wj ( i wij xi ))
Called Multilayer Perceptron (MLP), but more like multilayer logistic regression
32/40
2-Layer Neural Networks

y
wj
w1
h1
hj
wij
w11
xi
x1
w2
w3
h2
h3
w12
x2
x3
x4
P
P
P
f (x) = ( j wj hj ) = ( j wj ( i wij xi ))
Hidden units hj s can be viewed as new features from combining xi s
Called Multilayer Perceptron (MLP), but more like multilayer logistic regression
32/40
Modeling complex non-linearities

Given same number of units (with non-linear activation), a deeper
architecture is more expressive than a shallow one [Bishop, 1995]
I
I
1-layer nets only model linear hyperplanes

2-layer nets are universal function approximators: given infinite hidden
nodes, it can express any continuous function
>3-layer nets can do so with fewer nodes/weights
33/40
Todays Topics

Formal Notation
Neural Networks
34/40
Training a 2-Layer Net with Backpropagation
y
wj
w1
h1
hj
wij
w11
xi
x1
w2
w3
h2
Adjust weights
h3
Predict f (x (m) )
w12
x2
x3
x4
P
P
(m)
1. For each sample, compute f (x (m) ) = ( j wj ( i wij xi ))
2. If f (x (m) ) 6= y (m) , back-propagate error and adjust weights {wij , wj }.
35/40
Derivatives of the weights

Assume two outputs (y1 , y2 ) P
per input x,
and loss per sample: Loss = k 12 [(ink ) yk ]2
y1
yk
y2
wjk
h1
hj
h2
h3
wij
xi
x1
x2
x3
x4
36/40

per input x,
y1
yk
y2
wjk
h1
hj
h2
h3
wij
xi
Loss
wjk
Loss ink
ink wjk
= k
x1
P
( j wjk hj )
wjk
x2
x3
x4
= k hj
36/40

per input x,
y1
yk
y2
wjk
h1
hj
h2
h3
wij
xi
Loss
wjk
Loss
wij
Loss ink
ink wjk
Loss inj
inj wij
x1
x2
x3
x4
P
( j wjk hj )
= k hj
Pwjk
( j wij xi )
j wij
= j xi
= k
=
36/40

per input x,
y1
yk
y2
wjk
h1
hj
h2
h3
wij
xi
x1
x2
P
( j wjk hj )
Loss ink
=
= k hj
k
ink wjk
Pwjk
(
w
x
)
j ij i
Loss
Loss inj
= j w
= j xi
wij = in
j wij
ij
P

2
1
k = in
= [(ink )
k 2 [(ink ) yk ]
k
Loss
wjk
x3
x4
yk ] 0 (ink )
36/40

per input x,
y1
yk
y2
wjk
h1
hj
h2
h3
wij
x1
xi
x2
x3
x4
P
( j wjk hj )
Loss ink
=
= k hj
k
ink wjk
Pwjk
(
w
x
)
j ij i
Loss
Loss inj
= j w
= j xi
wij = in
j wij
ij
P

2
1
k = in
= [(ink )
k 2 [(ink ) yk ]
k
Loss
wjk
j =
Loss ink
k ink inj
k k
inj
yk ] 0 (ink )

P
w
(in
)
= [ k k wjk ] 0 (inj )
j
j jk
P
36/40
Backpropagation Algorithm
All updates involve some scaled error from output input feature:
Loss
wjk
Loss
wij
= k hj where k = [(ink ) yk ] 0 (ink )

P
= j xi where j = [ k k wjk ] 0 (inj )
After forward pass, compute k from final layer, then j for previous layer.
For deeper nets, iterate backwards further.
y1
yk
y2
wjk
h1
hj
h2
h3
wij
xi
x1
x2
x3
x4
37/40
Summary
1
By extending from 1-layer to 2-layer net, we get dramatic increase in

model expressiveness:
38/40
Summary
1

Backpropagation is an efficient way to train 2-layer nets:

I
Similar to SGD for 1-layer net, just more chaining in gradient
38/40
Summary
1


I
I

General form: update wij by j xi , and j is scaled/weighted sum of
errors from outgoing layers
38/40
Summary
1


I
I

Ideally, we want even deeper architectures
38/40
Summary
1


I
I

Ideally, we want even deeper architectures

I
I
But Backpropagation becomes ineffective due to vanishing gradients

Deep Learning comes to the rescue! (next lecture)
38/40
References I
Bengio, Y. (2009).
Learning Deep Architectures for AI, volume Foundations and Trends in
Machine Learning.
NOW Publishers.
Bishop, C. (1995).
Neural Networks for Pattern Recognition.
Oxford University Press.
Bishop, C. (2006).
Pattern Recognition and Machine Learning.
Springer.
Lee, H., Grosse, R., Ranganath, R., and Ng, A. (2009).
Convolutional deep belief networks for scalable unsupervised learning
of hierarchical representations.
In ICML.
39/40
References II
Schaul, T., Zhang, S., and LeCun, Y. (2013).

No more pesky learning rates.
In Proc. International Conference on Machine learning (ICML13).
40/40

ResearchSeminar1

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

ResearchSeminar1

Загружено:

Авторское право:

Доступные форматы

Deep Learning & Neural Networks

Jan 14, 2014

What is Deep Learning?

What is Deep Learning?

Example of Trainable Features

Machine Learning background & Neural Networks

Machine Learning background & Neural Networks

Basic calculus, probability, linear algebra

Yoshua Bengios [Bengio, 2009] short book: Learning Deep

The only criteria for grading:

ask (or answer) 3+ questions, grade = A

Best Advice I got while in Grad School

Always Ask Questions!

Best Advice I got while in Grad School

Always Ask Questions!

Best Advice I got while in Grad School

Always Ask Questions!

Best Advice I got while in Grad School

Always Ask Questions!

Machine Learning background

Machine Learning background

Write a Program to Recognize the Digit 2

*example from Hintons Coursera course

Write a Program to Recognize the Digit 2

Machine Learning solution:

Assume you have a database (training data) of 2s and non-2s.

Automatically learn this function from data

A Machine Learning Solution

Training data are represented as pixel matrices:

A Machine Learning Solution

Training data are represented as pixel matrices:

A Machine Learning Solution

Training data are represented as pixel matrices:

Test procedure: given new image, take sum of element-wise product.

Machine Learning background

E.g. consider shifted image:

Will this classifier generalize?

Now will this classifier generalize?

Model Expressiveness and Overfitting

Model Expressiveness and Overfitting

from PRML Chapter 1 [Bishop, 2006]

Machine Learning background

Basic Problem Setup in Machine Learning

Training Data: a set of (x (m) , y (m) )m={1,2,..M} pairs, where input

e.g. x=vectorized image pixels, y=2 or non-2

Goal: Learn function f : x y to predicts correctly on new inputs x.

Basic Problem Setup in Machine Learning

Training Data: a set of (x (m) , y (m) )m={1,2,..M} pairs, where input

e.g. x=vectorized image pixels, y=2 or non-2

Goal: Learn function f : x y to predicts correctly on new inputs x.

Step 1: Choose a function model family:

e.g. logistic regression, support vector machines, neural networks

Basic Problem Setup in Machine Learning

Training Data: a set of (x (m) , y (m) )m={1,2,..M} pairs, where input

e.g. x=vectorized image pixels, y=2 or non-2

Goal: Learn function f : x y to predicts correctly on new inputs x.

Step 1: Choose a function model family:

Step 2: Optimize parameters w on the Training Data

e.g. logistic regression, support vector machines, neural networks

Machine Learning background

1-Layer Nets (Logistic Regression)

Parameters: vector w R d , b is scalar bias term

1-Layer Nets (Logistic Regression)