Вы находитесь на странице: 1из 301

Deep Learning Tutorial

Hung-yi Lee
Deep learning
attracts lots of attention.
I believe you have seen lots of exciting results
before.

Deep learning trends


at Google. Source:
SIGMOD/Jeff Dean

This talk focuses on the basic techniques.


Outline

Lecture I: Introduction of Deep Learning

Lecture II: Tips for Training Deep Neural Network

Lecture III: Variants of Neural Network

Lecture IV: Next Wave


Lecture I:
Introduction of
Deep Learning
Outline of Lecture I

Introduction of Deep Learning


Lets start with general
machine learning.
Why Deep?

Hello World for Deep Learning


Machine Learning
Looking for a Function
Speech Recognition
f How are you
Image Recognition
f Cat

Playing Go
f 5-5 (next move)
Dialogue System
f Hi Hello
(what the user said) (system response)
Image Recognition:

Framework f cat

A set of Model
function f1 , f 2

f1 cat f2 money

f1 dog f2 snake
Image Recognition:

Framework f cat

A set of Model
function f1 , f 2 Better!

Goodness of
function f
Supervised Learning

Training function input:


Data
function output: monkey cat dog
Image Recognition:

Framework f cat

Training Testing
A set of Model
function f1 , f 2 cat
Step 1

Goodness of Pick the Best Function


Using f
function f f*
Step 2 Step 3

Training
Data
monkey cat dog
Three Steps for Deep Learning

Step 1: Step 2: Step 3: pick


define a set goodness of the best
of function function function

Deep Learning is so simple


Three Steps for Deep Learning

Step 1: Step 2: Step 3: pick


define a set
Neural goodness of the best
ofNetwork
function function function

Deep Learning is so simple


Human Brains
Neural Network
Neuron
z a1w1 ak wk aK wK b

a1 w1 A simple function

wk z z
ak a

Activation

wK function
aK weights b bias
Neural Network
Neuron Sigmoid Function z

z
1
z
1 e z
2
1

z
4
-1 -2 0.98

Activation
-1
function
1 weights 1 bias
Neural Network
Different connections leads to
different network structure

z z

z
Each neurons can have different values
of weights and biases.
Weights and biases are network parameters
Fully Connect Feedforward
Network
1 4 0.98
1
-2
1
-1 -2 0.12
-1
1
0
Sigmoid Function z

z
1
z
1 e z
Fully Connect Feedforward
Network
1 4 0.98 2 0.86 3 0.62
1
-2 -1 -1
1 0 -2
-1 -2 0.12 -2 0.11 -1 0.83
-1
1 -1 4
0 0 2
Fully Connect Feedforward
Network
1 0.73 2 0.72 3 0.51
0
-2 -1 -1
1 0 -2
-1 0.5 -2 0.12 -1 0.85
0
1 -1 4
0 0 2
This is a function. 1 0.62 0 0.51
= =
Input vector, output vector 1 0.83 0 0.85
Given parameters , define a function
Given network structure, define a function set
Fully Connect Feedforward
Network neuron
Input Layer 1 Layer 2 Layer L Output
x1 y1
x2 y2


xN yM
Input Output
Layer Hidden Layers Layer
Deep means many hidden layers
Output Layer (Option)
Softmax layer as the output layer

Ordinary Layer

z1
y1 z1
In general, the output of
z2
y2 z 2
network can be any value.

May not be easy to interpret


z3
y3 z 3
Output Layer (Option)
Probability:
Softmax layer as the output layer 1 > > 0
= 1
Softmax Layer

3 0.88 3

e
20
z1 e e z1
y1 e z1 zj

j 1

1 0.12 3
z2 e e z 2 2.7
y2 e z2
e
zj

j 1
0.05 0
z3 -3
3
e e z3
y3 e z3
e
zj

3 j 1

e zj

j 1
Example Application

Input Output

y1
0.1 is 1
x1
x2 y2
0.7 is 2
The image
is 2

x256 y10
0.2 is 0
16 x 16 = 256
Ink 1 Each dimension represents
No ink 0 the confidence of a digit.
Example Application
Handwriting Digit Recognition

x1 y1 is 1
x2
y2 is 2
Neural
Machine 2



Network
x256 y10 is 0
What is needed is a
function
Input: output:
256-dim vector 10-dim vector
Example Application
Input Layer 1 Layer 2 Layer L Output
x1 y1 is 1
x2
A function set containing the y2 is 2
candidates for 2



Handwriting Digit Recognition
xN y10 is 0
Input Output
Layer Hidden Layers Layer

You need to decide the network structure to


let a good function in your function set.
FAQ

Q: How many layers? How many neurons for each


layer?

Trial and Error + Intuition

Q: Can the structure be automatically determined?


Three Steps for Deep Learning

Step 1: Step 2: Step 3: pick


define a set
Neural goodness of the best
ofNetwork
function function function

Deep Learning is so simple


Training Data
Preparing training data: images and their labels

5 0 4 1

9 2 1 3

The learning target is defined on


the training data.
Learning Target
x1 y1 is 1

Softmax
x2
y2 is 2


x256 y10 is 0
16 x 16 = 256
Ink 1 The learning target is
No ink 0
Input: y1 has the maximum value

Input: y2 has the maximum value


A good function should make the loss
Loss of all examples as small as possible.

x1 y1 As close as 1
x2 of possible
Given a set y2 0
parameters


Loss
xN y10 0
target
Loss can be the distance between the
network output and target
Total Loss:
Total Loss

=
For all training data =1

x1 NN y1 1
1 As small as possible
x2 NN y2 2
2 Find a function in
function set that
x3 NN y3 3 minimizes total loss L
3

Find the network


xR NN yR
parameters that
minimize total loss L
Three Steps for Deep Learning

Step 1: Step 2: Step 3: pick


define a set
Neural goodness of the best
ofNetwork
function function function

Deep Learning is so simple


How to pick the best function

Find network parameters that minimize total loss L


Layer l Layer l+1
Enumerate all possible values

Network parameters =
106
1 , 2 , 3 , , 1 , 2 , 3 ,
weights



Millions of parameters

E.g. speech recognition: 8 layers and


1000 1000
1000 neurons each layer
neurons neurons
Network parameters =
Gradient Descent 1 , 2 , , 1 , 2 ,

Find network parameters that minimize total loss L


Pick an initial value for w
Total
Random, RBM pre-train
Loss
Usually good enough

w
Network parameters =
Gradient Descent 1 , 2 , , 1 , 2 ,

Find network parameters that minimize total loss L


Pick an initial value for w
Total Compute
Loss Negative Increase w

Positive Decrease w

w
http://chico386.pixnet.net/album/photo/171572850
Network parameters =
Gradient Descent 1 , 2 , , 1 , 2 ,

Find network parameters that minimize total loss L


Pick an initial value for w
Total Compute
Loss
Repeat

is called
learning rate w
Network parameters =
Gradient Descent 1 , 2 , , 1 , 2 ,

Find network parameters that minimize total loss L


Pick an initial value for w
Total Compute
Loss
Repeat Until is approximately small
(when update is little)

w
Gradient Descent

Compute 1
1 0.2 0.15
1 1
Compute 2
2 -0.1
2
0.05 = 2


Compute 1 1
1 0.3 0.2
1

gradient
Gradient Descent

Compute 1 Compute 1
1 0.2 0.15 0.09
1 1
Compute 2 Compute 2
2 -0.1 0.05 0.15
2 2

Compute 1 Compute 1
1 0.3 0.2 0.10
1 1

Gradient Descent

Color: Value of
2 Total Loss L

Randomly pick a starting point

1
Gradient Descent Hopfully, we would reach
a minima ..

Color: Value of
2 Total Loss L

( 1 , 2 )

Compute 1 , 2

1
Gradient Descent - Difficulty
Gradient descent never guarantee global minima

Different initial point

Reach different minima,


so different results

There are some tips to


help you avoid local
1 2 minima, no guarantee.
You are playing Age of Empires
Gradient DescentYou cannot see the whole map.

( 1 , 2 )

Compute 1 , 2

2 1
Gradient Descent
This is the learning of machines in deep
learning
Even alpha go using this approach.
People image Actually ..

I hope you are not too disappointed :p


Backpropagation
Backpropagation: an efficient way to compute
Ref:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_201
5_2/Lecture/DNN%20backprop.ecm.mp4/index.html

Dont worry about , the toolkits will handle it.


Concluding Remarks

Step 1: Step 2: Step 3: pick


define a set goodness of the best
of function function function

Deep Learning is so simple


Outline of Lecture I

Introduction of Deep Learning

Why Deep?

Hello World for Deep Learning


Deeper is Better?
Word Error Word Error
Layer X Size Layer X Size
Rate (%) Rate (%)
1 X 2k 24.2
2 X 2k 20.4 Not surprised, more
3 X 2k 18.4 parameters, better
4 X 2k 17.8 performance
5 X 2k 17.2 1 X 3772 22.5
7 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription
Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Universality Theorem
Any continuous function f

f : R N RM
Can be realized by a network
with one hidden layer
Reference for the reason:
(given enough hidden http://neuralnetworksandde
neurons) eplearning.com/chap4.html

Why Deep neural network not Fat neural network?


Fat + Short v.s. Thin + Tall
The same number
of parameters

Which one is better?


x1 x2 xN x1 x2 xN

Shallow Deep
Fat + Short v.s. Thin + Tall
Word Error Word Error
Layer X Size Layer X Size
Rate (%) Rate (%)
1 X 2k 24.2
2 X 2k 20.4
Why?
3 X 2k 18.4
4 X 2k 17.8
5 X 2k 17.2 1 X 3772 22.5
7 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription
Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Analogy
Logic circuits Neural network
Logic circuits consists of Neural network consists of
gates neurons
A two layers of logic gates A hidden layer network can
can represent any Boolean represent any continuous
function. function.
Using multiple layers of Using multiple layers of
logic gates to build some neurons to represent some
functions are much simpler functions are much simpler
less gates needed less less
parameters data?

This page is for EE background.


Modularization
Deep Modularization
Classifier Girls with
1 long hair

Classifier Boys with
2 weak long hair examples
Little
Image
Classifier Girls with
3 short hair

Classifier Boys with
4 short hair

Each basic classifier can have
Modularization sufficient training examples.

Deep Modularization




Boy or Girl? v.s.


Basic
Image
Classifier


Long or


short? v.s.

Classifiers for the

attributes
Modularization
can be trained by little data

Deep Modularization
Classifier Girls with
1 long hair
Boy or Girl? Classifier Boys with
2 fine long Little
hair data
Image Basic
Classifier Classifier Girls with
Long or 3 short hair
short?
Classifier Boys with
Sharing by the 4 short hair
following classifiers
as module
Modularization
Deep Modularization Less training data?
x1
x2 The modularization is
automatically learned from data.


xN

The most basic Use 1st layer as module Use 2nd layer as
classifiers to build classifiers module
Reference: Zeiler, M. D., & Fergus, R.

Modularization (2014). Visualizing and understanding


convolutional networks. In Computer
VisionECCV 2014 (pp. 818-833)

Deep Modularization
x1
x2


xN

The most basic Use 1st layer as module Use 2nd layer as
classifiers to build classifiers module
Outline of Lecture I

Introduction of Deep Learning

Why Deep?

Hello World for Deep Learning


If you want to learn theano:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/L
Keras ecture/Theano%20DNN.ecm.mp4/index.html
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Le
cture/RNN%20training%20(v6).ecm.mp4/index.html

Very flexible
or
Need some
effort to learn

Easy to learn and use


Interface of
TensorFlow or (still have some flexibility)
Theano You can modify it if you can write
keras TensorFlow or Theano
Keras
Franois Chollet is the author of Keras.
He currently works for Google as a deep learning
engineer and researcher.
Keras means horn in Greek
Documentation: http://keras.io/
Example:
https://github.com/fchollet/keras/tree/master/exa
mples

Keras
Example Application
Handwriting Digit Recognition

Machine 1

28 x 28

MNIST Data: http://yann.lecun.com/exdb/mnist/


Hello world for deep learning
Keras provides data sets loading function: http://keras.io/datasets/
Keras

28x28


500


500

Softmax

y1 y2
y10
Keras
Keras

Step 3.1: Configuration


0.1
Step 3.2: Find the optimal network parameters

Training data Labels Next lecture


(Images) (digits)
Keras
Step 3.2: Find the optimal network parameters

numpy array numpy array

28 x 28 10
=784

Number of training examples Number of training examples


https://www.tensorflow.org/versions/r0.8/tutorials/mnist/beginners/index.html
Keras

Save and load models


http://keras.io/getting-started/faq/#how-can-i-save-a-keras-model

How to use the neural network (testing):

case 1:

case 2:
Keras
Using GPU to speed training
Way 1
THEANO_FLAGS=device=gpu0 python
YourCode.py
Way 2 (in your code)
import os
os.environ["THEANO_FLAGS"] =
"device=gpu0"
Live Demo
Lecture II:
Tips for Training DNN
Recipe of Deep Learning
YES

Step 1: define a NO
Good Results on
set of function
Testing Data?
Overfitting!
Step 2: goodness
of function YES

NO
Step 3: pick the Good Results on
best function Training Data?

Neural
Network
Do not always blame Overfitting
Not well trained

Overfitting?

Training Data Testing Data


Recipe of Deep Learning
YES

Good Results on
Different approaches for Testing Data?
different problems.

e.g. dropout for good results YES


on testing data
Good Results on
Training Data?

Neural
Network
Recipe of Deep Learning
YES

Choosing proper loss Good Results on


Testing Data?
Mini-batch
YES
New activation function
Good Results on
Adaptive Learning Rate Training Data?

Momentum
Choosing Proper Loss
1

x1 y1 1 1 1

x2

Softmax
y2 0 2 0




loss
x256 y10 0 10 0
Which one is better?
10 10 target
Square 2 Cross
Error Entropy
=1 =0 =1 =0
Lets try it

Square Error

Cross Entropy
Testing: Accuracy
Lets try it
Square Error 0.11
Cross Entropy 0.84

Training
Cross
Entropy

Square
Error
Choosing Proper Loss
When using softmax output layer,
choose cross entropy
Cross
Entropy

Total
Loss
Square
Error
http://jmlr.org/procee
dings/papers/v9/gloro
w1 w2
t10a/glorot10a.pdf
Recipe of Deep Learning
YES

Choosing proper loss Good Results on


Testing Data?
Mini-batch
YES
New activation function
Good Results on
Adaptive Learning Rate Training Data?

Momentum
We do not really minimize total loss!
Mini-batch Randomly initialize
network parameters

x1 NN y1 1 Pick the 1st batch


Mini-batch

1 = 1 + 31 +
x31 NN y31 31 Update parameters once
31 Pick the 2nd batch

= 2 + 16 +
Update parameters once
x2 NN y2 2
Mini-batch


2 Until all mini-batches
have been picked
x16 NN y16 16
16 one epoch

Repeat the above process


Mini-batch

Pick the 1st batch


x1 NN y1 1
= 1 + 31 +
Mini-batch

1
Update parameters once
x31 NN y31 31
31 Pick the 2nd batch

= 2 + 16 +
Update parameters once
100 examples in a mini-batch


Until all mini-batches
Repeat 20 times have been picked
one epoch
We do not really minimize total loss!
Mini-batch Randomly initialize
network parameters

x1 NN y1 1 Pick the 1st batch


Mini-batch

1 = 1 + 31 +
x31 NN y31 31 Update parameters once
31 Pick the 2nd batch

= 2 + 16 +
Update parameters once
x2 NN y2 2
Mini-batch


2
L is different each time
x16 NN y16 16 when we update
16 parameters!

Mini-batch
Original Gradient Descent With Mini-batch

Unstable!!!

The colors represent the total loss.


Not always true with
Mini-batch is Faster parallel computing.

Original Gradient Descent With Mini-batch


Update after seeing all If there are 20 batches, update
examples 20 times in one epoch.

See all See only one


examples batch
Can have the same speed
(not super large data set)

1 epoch

Mini-batch has better performance!


Testing:
Accuracy
Mini-batch is Better!
Mini-batch 0.84
No batch 0.12
Training
Mini-batch
Accuracy

No batch

Epoch
Shuffle the training examples for each epoch
Epoch 1 Epoch 2

x1 NN y1 1 x1 NN y1 1

Mini-batch
Mini-batch

1 1
x31 NN y31 31 x31 NN y31 31
31 17

Dont worry. This is the default of Keras.


x2 NN y2 2 x2 NN y2 2
Mini-batch
Mini-batch

2 2

x16 NN y16 16 x16 NN y16 16


16 26



Recipe of Deep Learning
YES

Choosing proper loss Good Results on


Testing Data?
Mini-batch
YES
New activation function
Good Results on
Adaptive Learning Rate Training Data?

Momentum
Hard to get the power of Deep

Results on Training Data

Deeper usually does not imply better.


Testing: Accuracy
Lets try it
3 layers 0.84
9 layers 0.11

Training
3 layers

9 layers
Vanishing Gradient Problem
x1 y1
x2 y2


xN yM

Smaller gradients Larger gradients

Learn very slow Learn very fast

Almost random Already converge


based on random!?
Vanishing Gradient Problem
Smaller gradients

x1 1 1
x2 Small
output 2 2



+
xN
Large
+
input
Intuitive way to compute the derivatives

=?

Hard to get the power of Deep

In 2006, people used RBM pre-training.


In 2015, people use ReLU.
ReLU
Rectified Linear Unit (ReLU)
Reason:

1. Fast to compute
=
2. Biological reason
=0 3. Infinite sigmoid

with different biases
4. Vanishing gradient
[Xavier Glorot, AISTATS11]
[Andrew L. Maas, ICML13] problem
[Kaiming He, arXiv15]

=
ReLU
=0

0

x1 y1

0 y2
x2
0

0

=
ReLU
A Thinner linear network =0

x1 y1

y2
x2
Do not have
smaller gradients
Lets try it
Testing: 9 layers Accuracy
Lets try it
Sigmoid 0.11
ReLU 0.96
9 layers

Training

ReLU
Sigmoid
ReLU - variant



= =


= 0.01 =

also learned by
gradient descent
Maxout ReLU is a special cases of Maxout

Learnable activation function [Ian J. Goodfellow, ICML13]

+ 5 neuron + 1
Input
Max 7 Max 2
x1 + 7 + 2

x2 + 1 + 4
Max 1 Max 4
+ 1 + 3

You can have more than 2 elements in a group.


Maxout ReLU is a special cases of Maxout

Learnable activation function [Ian J. Goodfellow, ICML13]


Activation function in maxout network can be
any piecewise linear convex function
How many pieces depending on how many
elements in a group

2 elements in a group 3 elements in a group


Recipe of Deep Learning
YES

Choosing proper loss Good Results on


Testing Data?
Mini-batch
YES
New activation function
Good Results on
Adaptive Learning Rate Training Data?

Momentum
Learning Rates Set the learning
rate carefully

If learning rate is too large

Total loss may not decrease


after each update
2

1
Learning Rates Set the learning
rate carefully

If learning rate is too large

Total loss may not decrease


after each update
2
If learning rate is too small

Training would be too slow

1
Learning Rates
Popular & Simple Idea: Reduce the learning rate by
some factor every few epochs.
At the beginning, we are far from the destination, so we
use larger learning rate
After several epochs, we are close to the destination, so
we reduce the learning rate
E.g. 1/t decay: = + 1
Learning rate cannot be one-size-fits-all
Giving different parameters different learning
rates
Adagrad
Original:
Adagrad: w
Parameter dependent
learning rate

constant
=

=0 2 is obtained
at the i-th update
Summation of the square of the previous derivatives

=
Adagrad
=0 2

g0 g1 g0 g1
1 2
0.1 0.2 20.0 10.0
Learning rate: Learning rate:

= =
0.12 0.1 20 2 20

= =
0.12 + 0.22 0.22 202 + 102 22
Observation: 1. Learning rate is smaller and
smaller for all parameters
2. Smaller derivatives, larger
Why?
learning rate, and vice versa
Larger
derivatives

Smaller
Learning Rate

Smaller Derivatives

Larger Learning Rate

2. Smaller derivatives, larger


Why?
learning rate, and vice versa
Not the whole story
Adagrad [John Duchi, JMLR11]
RMSprop
https://www.youtube.com/watch?v=O3sxAc4hxZU

Adadelta [Matthew D. Zeiler, arXiv12]


No more pesky learning rates [Tom Schaul, arXiv12]
AdaSecant [Caglar Gulcehre, arXiv14]
Adam [Diederik P. Kingma, ICLR15]
Nadam
http://cs229.stanford.edu/proj2015/054_report.pdf
Recipe of Deep Learning
YES

Choosing proper loss Good Results on


Testing Data?
Mini-batch
YES
New activation function
Good Results on
Adaptive Learning Rate Training Data?

Momentum
Hard to find
optimal network parameters
Total
Loss Very slow at the
plateau
Stuck at saddle point

Stuck at local minima


0 =0 =0

The value of a network parameter w


In physical world
Momentum

How about put this phenomenon


in gradient descent?
Still not guarantee reaching
Momentum global minima, but give some
hope
cost
Movement =
Negative of + Momentum
Negative of
Momentum
Real Movement

= 0
Adam RMSProp (Advanced Adagrad) + Momentum
Lets try it Testing: Accuracy
Original 0.96
Adam 0.97
ReLU, 3 layer

Training

Original
Adam
Recipe of Deep Learning
YES

Early Stopping Good Results on


Testing Data?

Regularization
YES

Dropout Good Results on


Training Data?

Network Structure
Why Overfitting?
Training data and testing data can be different.

Training Data: Testing Data:

Learning target is defined by the training data.


The parameters achieving the learning target do not
necessary have good results on the testing data.
Panacea for Overfitting
Have more training data
Create more training data (?)

Handwriting recognition:

Original Created
Training Data: Training Data:

Shift 15
Why Overfitting?
For experiments, we added some noises to the
testing data
Why Overfitting?
For experiments, we added some noises to the
testing data

Testing: Accuracy

Clean 0.97
Noisy 0.50

Training is not influenced.


Recipe of Deep Learning
YES

Early Stopping Good Results on


Testing Data?

Weight Decay
YES

Dropout Good Results on


Training Data?

Network Structure
Early Stopping
Total
Loss
Stop at Validation set
here Testing set

Training set

Epochs
http://keras.io/getting-started/faq/#how-can-i-interrupt-training-when-
Keras: the-validation-loss-isnt-decreasing-anymore
Recipe of Deep Learning
YES

Early Stopping Good Results on


Testing Data?

Weight Decay
YES

Dropout Good Results on


Training Data?

Network Structure
Weight Decay
Our brain prunes out the useless link between
neurons.

Doing the same thing to machines brain improves


the performance.
Weight Decay

Weight decay is one Useless


kind of regularization
Close to zero ()
Weight Decay
L
Implementation Original: w w
w
0.01
L
Weight Decay: w 10.99
w
w
Smaller and smaller

Keras: http://keras.io/regularizers/
Recipe of Deep Learning
YES

Early Stopping Good Results on


Testing Data?

Weight Decay
YES

Dropout Good Results on


Training Data?

Network Structure
Dropout
Training:

Each time before updating the parameters


Each neuron has p% to dropout
Dropout
Training:

Thinner!

Each time before updating the parameters


Each neuron has p% to dropout
The structure of the network is changed.
Using the new network for training
For each mini-batch, we resample the dropout neurons
Dropout
Testing:

No dropout
If the dropout rate at training is p%,
all the weights times (1-p)%
Assume that the dropout rate is 50%.
If a weight w = 1 by training, set = 0.5 for testing.
Dropout - Intuitive Reason
partner

When teams up, if everyone expect the partner will do


the work, nothing will be done finally.
However, if you know your partner will dropout, you
will do better.
When testing, no one dropout actually, so obtaining
good results eventually.
Dropout - Intuitive Reason
Why the weights should multiply (1-p)% (dropout
rate) when testing?
Training of Dropout Testing of Dropout
Assume dropout rate is 50% No dropout
Weights from training
1 0.5 1 2
2 0.5 2
3 0.5 3
4 0.5 4
Weights multiply (1-p)%

Dropout is a kind of ensemble.
Training
Ensemble Set

Set 1 Set 2 Set 3 Set 4

Network Network Network Network


1 2 3 4

Train a bunch of networks with different structures


Dropout is a kind of ensemble.
Ensemble
Testing data x

Network Network Network Network


1 2 3 4

y1 y2 y3 y4

average
Dropout is a kind of ensemble.
minibatch minibatch minibatch minibatch Training of
1 2 3 4 Dropout

M neurons


2M possible
networks

Using one mini-batch to train one network


Some parameters in the network are shared
Dropout is a kind of ensemble.
Testing of Dropout testing data x

All the
weights


multiply
(1-p)%

y1 y2 y3
?????
average y
More about dropout
More reference for dropout [Nitish Srivastava, JMLR14] [Pierre Baldi,
NIPS13][Geoffrey E. Hinton, arXiv12]
Dropout works better with Maxout [Ian J. Goodfellow, ICML13]
Dropconnect [Li Wan, ICML13]
Dropout delete neurons
Dropconnect deletes the connection between neurons
Annealed dropout [S.J. Rennie, SLT14]
Dropout rate decreases by epochs
Standout [J. Ba, NISP13]
Each neural has different dropout rate
Lets try it


500
model.add( dropout(0.8) )

500
model.add( dropout(0.8) )
Softmax

y1 y2
y10
Lets try it

No Dropout
Accuracy

Dropout

Testing:
Training Accuracy
Noisy 0.50
Epoch + dropout 0.63
Recipe of Deep Learning
YES

Early Stopping Good Results on


Testing Data?

Regularization
YES

Dropout Good Results on


Training Data?

Network Structure
CNN is a very good example!
(next lecture)
Concluding Remarks
of Lecture II
Recipe of Deep Learning
YES

Step 1: define a NO
Good Results on
set of function
Testing Data?

Step 2: goodness
of function YES

NO
Step 3: pick the Good Results on
best function Training Data?

Neural
Network
Lets try another task
Document Classification

stock in document

Machine


president in document


http://top-breaking-news.com/
Data
MSE
ReLU
Accuracy

Adaptive Learning Rate


MSE 0.36
CE 0.55
+ ReLU 0.75
+ Adam 0.77
Accuracy

Dropout Adam 0.77


+ dropout 0.79
Lecture III:
Variants of Neural
Networks
Variants of Neural Networks

Convolutional Neural
Network (CNN) Widely used in
image processing

Recurrent Neural Network


(RNN)
Why CNN for Image?
When processing image, the first layer of fully
connected network would be very large

Softmax

100
3 x 107



100 100 x 100 x 3 1000
Can the fully connected network be simplified by
considering the properties of image recognition?
Why CNN for Image
Some patterns are much smaller than the whole
image
A neuron does not have to see the whole image
to discover the pattern.
Connecting to small region with less parameters

beak detector
Why CNN for Image
The same patterns appear in different regions.
upper-left
beak detector

Do almost the same thing


They can use the same
set of parameters.

middle beak
detector
Why CNN for Image
Subsampling the pixels will not change the object
bird
bird

subsampling

We can subsample the pixels to make image smaller


Less parameters for the network to process the image
Three Steps for Deep Learning

Step 1: Step 2: Step 3: pick


define a set
Convolutional goodness of the best
of function
Neural Network function function

Deep Learning is so simple


The whole CNN
cat dog
Convolution

Max Pooling
Can repeat
Fully Connected many times
Feedforward network Convolution

Max Pooling

Flatten
The whole CNN
Property 1
Some patterns are much Convolution
smaller than the whole image
Property 2
Max Pooling
The same patterns appear in
Can repeat
different regions.
many times
Property 3 Convolution
Subsampling the pixels will
not change the object
Max Pooling

Flatten
The whole CNN
cat dog
Convolution

Max Pooling
Can repeat
Fully Connected many times
Feedforward network Convolution

Max Pooling

Flatten
CNN Convolution Those are the network
parameters to be learned.

1 -1 -1
1 0 0 0 0 1 -1 1 -1 Filter 1
0 1 0 0 1 0 -1 -1 1 Matrix
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
-1 1 -1 Filter 2
0 1 0 0 1 0
Matrix
0 0 1 0 1 0 -1 1 -1


6 x 6 image
Each filter detects a small
Property 1
pattern (3 x 3).
1 -1 -1
CNN Convolution -1 1 -1 Filter 1
-1 -1 1
stride=1

1 0 0 0 0 1
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0

6 x 6 image
1 -1 -1
CNN Convolution -1 1 -1 Filter 1
-1 -1 1
If stride=2

1 0 0 0 0 1
0 1 0 0 1 0 3 -3
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
We set stride=1 below
0 0 1 0 1 0

6 x 6 image
1 -1 -1
CNN Convolution -1 1 -1 Filter 1
-1 -1 1
stride=1

1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
0 1 0 0 1 0
0 0 1 0 1 0 -3 -3 0 1

6 x 6 image 3 -2 -2 -1

Property 2
-1 1 -1
CNN Convolution -1 1 -1 Filter 2
-1 1 -1
stride=1 Do the same process for
1 0 0 0 0 1 every filter
0 1 0 0 1 0 3 -1 -3 -1
-1 -1 -1 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
-1 -1 -2 1
0 1 0 0 1 0 Feature
0 0 1 0 1 0 -3 -3 Map0 1
-1 -1 -2 1
6 x 6 image 3 -2 -2 -1
-1 0 -4 3
4 x 4 image
1 -1 -1
CNN Zero Padding -1 1 -1 Filter 1
-1 -1 1
0 0 0
0 1 0 0 0 0 1
0 0 1 0 0 1 0
0 0 1 1 0 0 You will get another 6 x 6
1 0 0 0 1 0 images in this way
0 1 0 0 1 0 0
0 0 1 0 1 0 0 Zero padding
0 0 0
6 x 6 image
CNN Colorful image
1 -1 -1 -1-1 11 -1-1
11 -1-1 -1-1 -1 1 -1
-1-1 11 -1-1 -1-1-1 111 -1-1-1 Filter 2
-1 1 -1 Filter 1 -1 1 -1
-1-1 -1-1 11 -1-1 11 -1-1
-1 -1 1
Colorful image
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0
1 00 00 10 11 00 0
1 0 0 0 1 0
0 11 00 00 01 10 0
0 1 0 0 1 0
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0
The whole CNN
cat dog
Convolution

Max Pooling
Can repeat
Fully Connected many times
Feedforward network Convolution

Max Pooling

Flatten
CNN Max Pooling
1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1

3 -1 -3 -1 -1 -1 -1 -1

-3 1 0 -3 -1 -1 -2 1

-3 -3 0 1 -1 -1 -2 1

3 -2 -2 -1 -1 0 -4 3
CNN Max Pooling

New image
1 0 0 0 0 1 but smaller
0 1 0 0 1 0 Conv
3 0
0 0 1 1 0 0 -1 1
1 0 0 0 1 0
0 1 0 0 1 0 Max 3 1
0 3
0 0 1 0 1 0 Pooling
2 x 2 image
6 x 6 image
Each filter
is a channel
The whole CNN
3 0
-1 1 Convolution

3 1
0 3
Max Pooling
Can repeat
A new image many times
Smaller than the original Convolution
image
The number of the channel Max Pooling
is the number of filters
The whole CNN
cat dog
Convolution

Max Pooling
A new image
Fully Connected
Feedforward network Convolution

Max Pooling
A new image
Flatten
3
Flatten
0

1
3 0
-1 1 3

3 1 -1
0 3 Flatten
1 Fully Connected
Feedforward network
0

3
The whole CNN

Convolution

Max Pooling
Can repeat
many times
Convolution

Max Pooling
Input
+ 5
Max 7
x1 + 7

x2 + 1
Max 1
+ 1

1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0 convolution Max
image pooling
(Ignoring the non-linear activation function after the convolution.)
1 -1 -1 Filter 1 1: 1
-1 1 -1 2: 0
-1 -1 1 3: 0
4: 0 3


1 0 0 0 0 1
0 1 0 0 1 0 7: 0
0 0 1 1 0 0 8: 1
1 0 0 0 1 0 9: 0
0 1 0 0 1 0 10: 0


0 0 1 0 1 0
13: 0
6 x 6 image
14: 0
Less parameters! 15: 1 Only connect to 9
16: 1 input, not fully
connected

1 -1 -1 1: 1
-1 1 -1 Filter 1 2: 0
-1 -1 1 3: 0
4: 0 3


1 0 0 0 0 1
0 1 0 0 1 0 7: 0
0 0 1 1 0 0 8: 1
1 0 0 0 1 0 9: 0 -1
0 1 0 0 1 0 10: 0


0 0 1 0 1 0
13: 0
6 x 6 image
14: 0
Less parameters! 15: 1
16: 1 Shared weights
Even less parameters!

Input
+ 5
Max 7
x1 + 7

x2 + 1
Max 1
+ 1

1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0 convolution Max
image pooling
(Ignoring the non-linear activation function after the convolution.)
Input
+ 5
Max 7
x1 + 7

x1 + 1
Max 1
+ 1
3 -1 -3 -1
3 0
-3 1 0 -3

-3 -3 0 1
3 1
3 -2 -2 -1
Input
+ 5
Max 7
x1 + 7

x2 + 1
Dim = 6 x 6 = 36
Max 1
parameters = + 1
36 x 32 = 1152 Dim = 4 x 4 x 2
= 32
1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
convolution
0 0 1 0 1 0 Max
Only 9 x 2 = 18 pooling
image
parameters
Convolutional Neural Network

Step 1: Step 2: Step 3: pick


define a set
Convolutional goodness of the best
of function
Neural Network function function

monkey 0
cat 1
CNN


dog 0
Convolution, Max target
Pooling, fully connected
Learning: Nothing special, just gradient descent
Playing Go

Next move
Network (19 x 19
positions)

19 x 19 matrix 19 x 19 vector
19(image)
x 19 vector
Black: 1 Fully-connected feedword
white: -1 network can be used
none: 0 But CNN performs much better.
v.s.

Playing Go : 5
:
: 5
Training: record of previous plays

Target:
Network = 1
else = 0

Target:
Network 5 = 1
else = 0
Why CNN for playing Go?
Some patterns are much smaller than the whole
image

Alpha Go uses 5 x 5 for first layer

The same patterns appear in different regions.


Why CNN for playing Go?
Subsampling the pixels will not change the object
Max Pooling How to explain this???

Alpha Go does not use Max Pooling


Variants of Neural Networks

Convolutional Neural
Network (CNN)

Recurrent Neural Network


(RNN) Neural Network with Memory
Example Application
Slot Filling

I would like to arrive Taipei on November 2nd.

ticket booking system

Destination: Taipei
Slot
time of arrival: November 2nd
Example Application
y1 y2
Solving slot filling by
Feedforward network?
Input: a word
(Each word is represented
as a vector)

Taipei x1 x2
1-of-N encoding

How to represent each word as a vector?


1-of-N Encoding lexicon = {apple, bag, cat, dog, elephant}
The vector is lexicon size. apple = [ 1 0 0 0 0]
Each dimension corresponds bag = [ 0 1 0 0 0]
to a word in the lexicon cat = [ 0 0 1 0 0]
The dimension for the word dog = [ 0 0 0 1 0]
is 1, and others are 0 elephant = [ 0 0 0 0 1]
Beyond 1-of-N encoding
Dimension for Other Word hashing

apple 0 a-a-a 0
bag 0 a-a-b 0



cat 0 a-p-p 1
dog 0


26 X 26 X 26
elephant 0 p-l-e 1

p-p-l 1
other 1



w = apple

w = Gandalf w = Sauron
187
Example Application time of
dest departure
y1 y2
Solving slot filling by
Feedforward network?
Input: a word
(Each word is represented
as a vector)
Output:
Probability distribution that
the input word belonging to
the slots
Taipei x1 x2
Example Application time of
dest departure
y1 y2
arrive Taipei on November 2nd

other dest other time time


Problem?
leave Taipei on November 2nd

place of departure

Neural network Taipei x1 x2


needs memory!
Three Steps for Deep Learning

Step 1: Step 2: Step 3: pick


define a set
Recurrent goodness of the best
of function
Neural Network function function

Deep Learning is so simple


Recurrent Neural Network (RNN)
y1 y2

The output of hidden layer


are stored in the memory.
store

a1 a2

Memory can be considered x1 x2


as another input.
RNN The same network is used again and again.

Probability of Probability of Probability of


arrive in each slot Taipei in each slot on in each slot
y1 y2 y3
store store
a1 a2 a3
a1 a2

x1 x2 x3

arrive Taipei on November 2nd


RNN Different

Prob of leave Prob of Taipei Prob of arrive Prob of Taipei


in each slot in each slot in each slot in each slot
y1 y2 y1 y2
store store
a1 a2 a1 a2
a1 a1

x1 x2 x1 x2
leave Taipei arrive Taipei

The values stored in the memory is different.


Of course it can be deep
yt yt+1 yt+2

xt xt+1 xt+2
Bidirectional RNN
xt xt+1 xt+2

yt yt+1 yt+2

xt xt+1 xt+2
Long Short-term Memory (LSTM)
Other part of the network
Special Neuron:
Signal control
Output Gate
4 inputs,
the output gate 1 output
(Other part of
the network)
Memory Forget Signal control
Cell Gate the forget gate
(Other part of
the network)
Signal control
Input Gate LSTM
the input gate
(Other part of
the network)
Other part of the network
=

multiply
Activation function f is
usually a sigmoid function
Between 0 and 1
Mimic open and close gate

c


= +

multiply


0
0
-10
10

10
7 10
1
1 3
10
3

3
-3
1
10
-3

10
-3
7 -10
0
1 -3
10
-3

-3
LSTM

ct-1

vector

zf zi z zo 4 vectors

xt
LSTM
yt
zo
ct-1

zf

zi
zf zi z zo

xt
z
Extension: peephole
LSTM
yt yt+1

ct-1 ct ct+1

zf zi z zo zf zi z zo

ct-1 ht-1 xt ct ht xt+1


Multiple-layer
LSTM

Dont worry if you cannot understand this.


Keras can handle it.
Keras supports
LSTM, GRU, SimpleRNN layers

This is quite
standard now.

https://img.komicolle.org/2015-09-20/src/14426967627131.gif
Three Steps for Deep Learning

Step 1: Step 2: Step 3: pick


define a set goodness of the best
of function function function

Deep Learning is so simple


Learning Target
other dest other
0 1 0 0 1 0 0 1 0

y1 y2 y3
copy copy
a1 a2 a3
a1 a2

Wi
x1 x2 x3

Training
Sentences: arrive Taipei on November 2nd
other dest other time time
Three Steps for Deep Learning

Step 1: Step 2: Step 3: pick


define a set goodness of the best
of function function function

Deep Learning is so simple


Learning y1 y2

Backpropagation
through time (BPTT)
copy

a1 a2


x1 x2

RNN Learning is very difficult in practice.



Unfortunately
RNN-based network is not always easy to learn
Real experiments on Language modeling

sometimes
Total Loss

Lucky

Epoch
The error surface is rough.
The error surface is either
very flat or very steep.

Total
Clipping

CostLoss
w2

w1 [Razvan Pascanu, ICML13]


Why?
=1 1000 = 1 Large Small
= 1.01 1000 20000 Learning rate?
= 0.99 1000 0 small Large
= 0.01 1000 0 Learning rate?
=w999
y1 y2 y3 y1000
Toy Example
1 1 1 1

w w w
1 1 1 1
1 0 0 0
Helpful Techniques
Long Short-term Memory (LSTM)
Can deal with gradient vanishing (not gradient
explode)
Memory and input are
added
The influence never disappears
unless forget gate is closed
No Gradient vanishing add
(If forget gate is opened.)
Gated Recurrent Unit (GRU):
simpler than LSTM [Cho, EMNLP14]
Helpful Techniques
Structurally Constrained
Clockwise RNN
Recurrent Network (SCRN)

[Jan Koutnik, JMLR14] [Tomas Mikolov, ICLR15]

Vanilla RNN Initialized with Identity matrix + ReLU activation


function [Quoc V. Le, arXiv15]
Outperform or be comparable with LSTM in 4 different tasks
More Applications
Probability of Probability of Probability of
arrive in each slot Taipei in each slot on in each slot
y1 y2 y3
Input store
and output are both sequences
store
a1 with the a2 length
same a3
a 1
a2
RNN can do more than that!
x1 x2 x3

arrive Taipei on November 2nd


Keras Example:
Many to one https://github.com/fchollet/keras/blob
/master/examples/imdb_lstm.py

Input is a vector sequence, but output is only one vector

Sentiment Analysis


. . .


Positive () Negative () Positive ()


Many to Many (Output is shorter)
Both input and output are both sequences, but the output
is shorter.
E.g. Speech Recognition

Output: (character sequence)


Trimming
Problem?
Why cant it be

(vector
Input:
sequence)
Many to Many (Output is shorter)
Both input and output are both sequences, but the output
is shorter.
Connectionist Temporal Classification (CTC) [Alex Graves,
ICML06][Alex Graves, ICML14][Haim Sak, Interspeech15][Jie Li,
Interspeech15][Andrew Senior, ASRU15]

Add an extra symbol


representing null


Many to Many (No Limitation)
Both input and output are both sequences with different
lengths. Sequence to sequence learning
E.g. Machine Translation (machine learning)
machine

learning

Containing all
information about
input sequence
Many to Many (No Limitation)
Both input and output are both sequences with different
lengths. Sequence to sequence learning
E.g. Machine Translation (machine learning)


machine

learning

Dont know when to stop


Many to Many (No Limitation)

tlkagk: ===================
Ref:http://zh.pttpedia.wikia.com/wiki/%E6%8E%A5%E9%BE%8D%
E6%8E%A8%E6%96%87 ()
Many to Many (No Limitation)
Both input and output are both sequences with different
lengths. Sequence to sequence learning
E.g. Machine Translation (machine learning)

===

machine

learning

Add a symbol === ()


[Ilya Sutskever, NIPS14][Dzmitry Bahdanau, arXiv15]
One to Many
Input an image, but output a sequence of words
[Kelvin Xu, arXiv15][Li Yao, ICCV15]
A vector
for whole ===
image a woman is

CNN

Input
image Caption Generation
Application:
Video Caption Generation

A girl is running.

Video

A group of people is A group of people is


knocked by a tree. walking in the forest.
Video Caption Generation
Can machine describe what it see from video?
Demo:
Concluding Remarks

Convolutional Neural
Network (CNN)

Recurrent Neural Network


(RNN)
Lecture IV:
Next Wave
Outline
Supervised Learning
Ultra Deep Network
New network structure
Attention Model
Reinforcement Learning

Unsupervised Learning
Image: Realizing what the World Looks Like
Text: Understanding the Meaning of Words
Audio: Learning human language without supervision
Skyscraper

https://zh.wikipedia.org/wiki/%E9%9B%99%E5%B3%B0%E5%A1%94#/me
dia/File:BurjDubaiHeight.svg
Ultra Deep Network 22 layers

http://cs231n.stanford.e
du/slides/winter1516_le 19 layers
cture8.pdf

8 layers
6.7%
7.3%
16.4%

AlexNet (2012) VGG (2014) GoogleNet (2014)


Ultra Deep Network
101 layers
152 layers

3.57%

7.3% 6.7%
16.4%
AlexNet VGG GoogleNet Residual Net Taipei
(2012) (2014) (2014) (2015) 101
Ultra Deep Network
Worry about overfitting? 152 layers

Worry about training


first!
This ultra deep network 3.57%
have special structure.

7.3% 6.7%
16.4%
AlexNet VGG GoogleNet Residual Net
(2012) (2014) (2014) (2015)
Ultra Deep Network
Ultra deep network is the
ensemble of many networks
with different depth.

6 layers
Ensemble 4 layers
2 layers
Ultra Deep Network
FractalNet

Resnet in Resnet

Good Initialization?
Ultra Deep Network

copy Gate
controller copy
output layer output layer output layer

Highway Network automatically


determines the layers needed!
Input layer Input layer Input layer
Outline
Supervised Learning
Ultra Deep Network
New network structure
Attention Model
Reinforcement Learning

Unsupervised Learning
Image: Realizing what the World Looks Like
Text: Understanding the Meaning of Words
Audio: Learning human language without supervision
Attention-based Model
What you learned Lunch today
in these lectures

What is deep
learning?

summer
vacation 10
Answer Organize years ago

http://henrylo1605.blogspot.tw/2015/05/blog-post_56.html
Attention-based Model
Input DNN/RNN output

Reading Head
Controller

Reading Head


Machines Memory
Ref:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Attain%20(v3).e
cm.mp4/index.html
Attention-based Model v2

Input DNN/RNN output

Reading Head Writing Head


Controller Controller

Writing Head Reading Head


Machines Memory

Neural Turing Machine


Reading Comprehension

Query DNN/RNN answer

Reading Head
Controller

Semantic
Analysis

Each sentence becomes a vector.


Reading Comprehension
End-To-End Memory Networks. S. Sukhbaatar, A. Szlam, J.
Weston, R. Fergus. NIPS, 2015.
The position of reading head:

Keras has example:


https://github.com/fchollet/keras/blob/master/examples/ba
bi_memnn.py
Visual Question Answering

source: http://visualqa.org/
Visual Question Answering

Query DNN/RNN answer

Reading Head
Controller

CNN A vector for


each region
Visual Question Answering
Huijuan Xu, Kate Saenko. Ask, Attend and Answer: Exploring
Question-Guided Spatial Attention for Visual Question
Answering. arXiv Pre-Print, 2015
Speech Question Answering
TOEFL Listening Comprehension Test by Machine
Example:
Audio Story: (The original story is 5 min long.)
Question: What is a possible origin of Venus clouds?
Choices:
(A) gases released as a result of volcanic activity
(B) chemical reactions caused by high surface temperatures
(C) bursts of radio energy from the plane's surface
(D) strong winds that blow dust into the atmosphere
Experimental setup:
Simple Baselines 717 for training,
124 for validation, 122 for testing

(2) select the shortest (4) the choice with semantic


choice as answer most similar to others
Accuracy (%)

random

(1) (2) (3) (4) (5) (6) (7)


Naive Approaches
Everything is learned
Model Architecture from training examples

It be quite possible that this be


Answer due to volcanic eruption because
volcanic eruption often emit gas. If
Attention that be the case volcanism could very
Select the choice most well be the root cause of Venus 's thick
similar to the answer cloud cover. And also we have observe
burst of radio energy from the planet
Attention
Question 's surface. These burst be similar to
what we see when volcano erupt on
Semantics
earth

Semantic Speech Semantic


Analysis Recognition Analysis

Question: what is a possible Audio Story:


origin of Venus clouds?"
Model Architecture
Word-based Attention
Model Architecture
Sentence-based Attention
(A) (A) (A) (A)

(A)

(B) (B) (B)


Supervised Learning

Memory Network: 39.2%


Accuracy (%)

(proposed by FB AI group)

(1) (2) (3) (4) (5) (6) (7)


Naive Approaches
[Tseng & Lee, Interspeech 16]
Supervised Learning [Fang & Hsu & Lee, SLT 16]

Word-based Attention: 48.8%

Memory Network: 39.2%


Accuracy (%)

(proposed by FB AI group)

(1) (2) (3) (4) (5) (6) (7)


Naive Approaches
Outline
Supervised Learning
Ultra Deep Network
New network structure
Attention Model
Reinforcement Learning

Unsupervised Learning
Image: Realizing what the World Looks Like
Text: Understanding the Meaning of Words
Audio: Learning human language without supervision
Scenario of Reinforcement
Learning
Observation Action

Agent

Dont do Reward
that

Environment
Scenario of Reinforcement
Learning Agent learns to take actions to
maximize expected reward.
Observation Action

Agent

Thank you. Reward

http://www.sznews.com/news/conte Environment
nt/2013-11/26/content_8800180.htm
Supervised v.s. Reinforcement
Supervised Hello Say Hi
Learning from
teacher Bye bye Say Good bye

Reinforcement

. .

Hello Bad
Learning from
critics Agent Agent
Scenario of Reinforcement
Learning Agent learns to take actions to
maximize expected reward.
Observation Action

Reward Next Move

If win, reward = 1
If loss, reward = -1
Otherwise, reward = 0
Environment
Supervised v.s. Reinforcement
Supervised:

Next move: Next move:


5-5 3-3

Reinforcement Learning

First move many moves Win!

Alpha Go is supervised learning + reinforcement learning.


Difficulties of Reinforcement
Learning
It may be better to sacrifice immediate reward to
gain more long-term reward
E.g. Playing Go
Agents actions affect the subsequent data it
receives
E.g. Exploration
Deep Reinforcement Learning
DNN
Observation Action


Function Function
Input Output

Used to pick the


best function Reward

Environment
Application: Interactive Retrieval
Interactive retrieval is helpful. [Wu & Lee, INTERSPEECH 16]

Deep Learning

user

Deep Learning related to Machine Learning?


Deep Learning related to Education?
Deep Reinforcement Learning
Different network depth
Some depth is needed.

Better retrieval
The task cannot be addressed
performance,
Less user labor by linear model.

More Interaction
More applications
Alpha Go, Playing Video Games, Dialogue
Flying Helicopter
https://www.youtube.com/watch?v=0JL04JJjocc
Driving
https://www.youtube.com/watch?v=0xo1Ldx3L
5Q
Google Cuts Its Giant Electricity Bill With
DeepMind-Powered AI
http://www.bloomberg.com/news/articles/2016-07-
19/google-cuts-its-giant-electricity-bill-with-deepmind-
powered-ai
To learn deep reinforcement
learning
Lectures of David Silver
http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Te
aching.html
10 lectures (1:30 each)
Deep Reinforcement Learning
http://videolectures.net/rldm2015_silver_reinfo
rcement_learning/
Outline
Supervised Learning
Ultra Deep Network
New network structure
Attention Model
Reinforcement Learning

Unsupervised Learning
Image: Realizing what the World Looks Like
Text: Understanding the Meaning of Words
Audio: Learning human language without supervision
Does machine know what the
world look like?
Ref: https://openai.com/blog/generative-models/

Draw something!
Deep Dream
Given a photo, machine adds what it sees

http://deepdreamgenerator.com/
Deep Dream
Given a photo, machine adds what it sees

http://deepdreamgenerator.com/
Deep Style
Given a photo, make its style like famous paintings

https://dreamscopeapp.com/
Deep Style
Given a photo, make its style like famous paintings

https://dreamscopeapp.com/
Deep Style

CNN CNN

content style

CNN

?
Generating Images by RNN

color of color of color of


2nd pixel 3rd pixel 4th pixel

color of color of color of


1st pixel 2nd pixel 3rd pixel
Generating Images by RNN
Pixel Recurrent Neural Networks
https://arxiv.org/abs/1601.06759

Real
World
Generating Images
Training a decoder to generate images is
unsupervised

? code Training data is a lot of images

Neural Network
Auto-encoder
code NN
Decoder
Not state-of-
the-art Learn together
approach NN code
Encoder

As close as possible

Output Layer
Input Layer

Layer
bottle

Layer
Layer

Layer

Encoder Decoder
Code
Generating Images
Training a decoder to generate images is
unsupervised
Variation Auto-encoder (VAE)
Ref: Auto-Encoding Variational Bayes,
https://arxiv.org/abs/1312.6114
Generative Adversarial Network (GAN)
Ref: Generative Adversarial Networks,
http://arxiv.org/abs/1406.2661

code NN
Decoder
Which one is machine-generated?

Ref: https://openai.com/blog/generative-models/
!!! https://github.com/mattya/chainer-DCGAN
Outline
Supervised Learning
Ultra Deep Network
New network structure
Attention Model
Reinforcement Learning

Unsupervised Learning
Image: Realizing what the World Looks Like
Text: Understanding the Meaning of Words
Audio: Learning human language without supervision
Machine Reading
Machine learn the meaning of words from reading
a lot of documents without supervision

http://top-breaking-news.com/
Machine Reading
Machine learn the meaning of words from reading
a lot of documents without supervision

Word Vector / Embedding


tree
flower
dog rabbit
run
jump cat
Machine Reading
Generating Word Vector/Embedding is
unsupervised

Apple Training data is a lot of text

Neural Network

?
https://garavato.files.wordpress.com/2011/11/stacksdocuments.jpg?w=490
Machine Reading
Machine learn the meaning of words from reading
a lot of documents without supervision
A word can be understood by its context
You shall know a word
are
by the company it keeps
something very similar

520

520
Word Vector

Source: http://www.slideshare.net/hustwj/cikm-keynotenov2014
283

Word Vector +

Characteristics


Solving analogies

Rome : Italy = Berlin : ?


Compute +
Find the word w with the closest V(w)
284
Machine Reading
Machine learn the meaning of words from reading
a lot of documents without supervision
Demo
Model used in demo is provided by
Part of the project done by
TA:
Training data is from PTT (collected by )

286
Outline
Supervised Learning
Ultra Deep Network
New network structure
Attention Model
Reinforcement Learning

Unsupervised Learning
Image: Realizing what the World Looks Like
Text: Understanding the Meaning of Words
Audio: Learning human language without supervision
Learning from Audio Book

Machine does not have


any prior knowledge

Machine listens to lots of


audio book

Like an infant

[Chung, Interspeech 16)


Audio Word to Vector
Audio segment corresponding to an unknown word
Fixed-length vector
Audio Word to Vector
The audio segments corresponding to words with
similar pronunciations are close to each other.
dog
never
dog
never

dogs
never

ever ever
Sequence-to-sequence
Auto-encoder
vector
audio segment

RNN Encoder The values in the memory


represent the whole audio
segment
The vector we want

How to train RNN Encoder?

x1 x2 x3 x4 acoustic features

audio segment
Sequence-to-sequence
Input acoustic features
Auto-encoder
x1 x2 x3 x4
The RNN encoder and
decoder are jointly trained.
y1 y2 y3 y4
RNN Encoder

RNN Decoder
x1 x2 x3 x4 acoustic features

audio segment
Audio Word to Vector
- Results
Visualizing embedding vectors of the words

fear

fame

name near
WaveNet (DeepMind)

https://deepmind.com/blog/wavenet-generative-model-raw-audio/
Concluding Remarks
Concluding Remarks

Lecture I: Introduction of Deep Learning

Lecture II: Tips for Training Deep Neural Network

Lecture III: Variants of Neural Network

Lecture IV: Next Wave


AI ?
New Job in AI Age AI
(
)

http://www.express.co.uk/news/science/651202/First-step-towards-The-Terminator-
becoming-reality-AI-beats-champ-of-world-s-oldest-game
AI


AI



AI

AI
step 1AI



step 3
E.g. best function
E.g. Deep Learning

AI
AI AI
AI

http://www.gvm.com.tw/web
only_content_10787.html