Академический Документы
Профессиональный Документы
Культура Документы
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 1 20 Jan 2016
Administrative
A1 is due today (midnight)
I’m holding make up office hours on today: 5pm @ Gates 259
Also:
- We are shuffling the course schedule around a bit
- the grading scheme is subject to few % changes
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 2 20 Jan 2016
Things you should know for your Project Proposal
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 3 20 Jan 2016
Things you should know for your Project Proposal
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 4 20 Jan 2016
1. Train on ImageNet 2. Finetune network on
your own data
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 5 20 Jan 2016
Transfer Learning with CNNs
1. Train on 2. If small dataset: fix 3. If you have medium sized
ImageNet all weights (treat CNN dataset, “finetune” instead:
as fixed feature use the old weights as
extractor), retrain only initialization, train the full
the classifier network or only some of the
higher layers
i.e. swap the Softmax
layer at the end retrain bigger portion of the
network, or even all of it.
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 6 20 Jan 2016
E.g. Caffe Model Zoo: Lots of pretrained ConvNets
https://github.com/BVLC/caffe/wiki/Model-Zoo
...
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 7 20 Jan 2016
Things you should know for your Project Proposal
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 8 20 Jan 2016
Things you should know for your Project Proposal
Mini-batch SGD
Loop:
1. Sample a batch of data
2. Forward prop it through the graph, get loss
3. Backprop to calculate the gradients
4. Update the parameters using the gradient
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 10 20 Jan 2016
Where we are now...
(image credits
to Alec Radford)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 11 20 Jan 2016
Neural Turing Machine
input tape
loss
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 12 20 Jan 2016
activations
“local gradient”
gradients
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 13 20 Jan 2016
Implementation: forward/backward API
Graph (or Net) object. (Rough psuedo code)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 14 20 Jan 2016
Implementation: forward/backward API
x
z
*
y
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 15 20 Jan 2016
Example: Torch Layers
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 16 20 Jan 2016
Neural Network: without the brain stuff
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 17 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 18 20 Jan 2016
Neural Networks: Architectures
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 19 20 Jan 2016
Training Neural Networks
A bit of history...
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 20 20 Jan 2016
A bit of history
The Mark I Perceptron machine was the first
implementation of the perceptron algorithm.
recognized
letters of the alphabet
update rule:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 21 20 Jan 2016
A bit of history
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 22 20 Jan 2016
A bit of history
recognizable maths
Reinvigorated research in
Deep Learning
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 24 20 Jan 2016
First strong results
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 25 20 Jan 2016
Overview
1. One time setup
activation functions, preprocessing, weight
initialization, regularization, gradient checking
2. Training dynamics
babysitting the learning process,
parameter updates, hyperparameter optimization
3. Evaluation
model ensembles
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 26 20 Jan 2016
Activation Functions
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 27 20 Jan 2016
Activation Functions
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 28 20 Jan 2016
Leaky ReLU
Activation Functions max(0.1x, x)
Sigmoid
Maxout
tanh tanh(x)
ELU
ReLU max(0,x)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 29 20 Jan 2016
Activation Functions
- Squashes numbers to range [0,1]
- Historically popular since they
have nice interpretation as a
saturating “firing rate” of a neuron
Sigmoid
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 30 20 Jan 2016
Activation Functions
- Squashes numbers to range [0,1]
- Historically popular since they
have nice interpretation as a
saturating “firing rate” of a neuron
3 problems:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 31 20 Jan 2016
x sigmoid
gate
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 32 20 Jan 2016
Activation Functions
- Squashes numbers to range [0,1]
- Historically popular since they
have nice interpretation as a
saturating “firing rate” of a neuron
3 problems:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 33 20 Jan 2016
Consider what happens when the input to a neuron (x)
is always positive:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 34 20 Jan 2016
Consider what happens when the input to a neuron is
always positive... allowed
gradient
update
directions
hypothetical
What can we say about the gradients on w? optimal w
vector
Always all positive or all negative :(
(this is also why you want zero-mean data!)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 35 20 Jan 2016
Activation Functions
- Squashes numbers to range [0,1]
- Historically popular since they
have nice interpretation as a
saturating “firing rate” of a neuron
3 problems:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 36 20 Jan 2016
Activation Functions
tanh(x)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 37 20 Jan 2016
- Computes f(x) = max(0,x)
Activation Functions
- Does not saturate (in +region)
- Very computationally efficient
- Converges much faster than
sigmoid/tanh in practice (e.g. 6x)
ReLU
(Rectified Linear Unit)
[Krizhevsky et al., 2012]
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 38 20 Jan 2016
- Computes f(x) = max(0,x)
Activation Functions
- Does not saturate (in +region)
- Very computationally efficient
- Converges much faster than
sigmoid/tanh in practice (e.g. 6x)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 39 20 Jan 2016
x ReLU
gate
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 40 20 Jan 2016
active ReLU
DATA CLOUD
dead ReLU
will never activate
=> never update
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 41 20 Jan 2016
active ReLU
DATA CLOUD
Leaky ReLU
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 43 20 Jan 2016
[Mass et al., 2013]
Activation Functions [He et al., 2015]
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 44 20 Jan 2016
[Clevert et al., 2015]
Activation Functions
Exponential Linear Units (ELU)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 45 20 Jan 2016
[Goodfellow et al., 2013]
Maxout “Neuron”
- Does not have the basic form of dot product ->
nonlinearity
- Generalizes ReLU and Leaky ReLU
- Linear Regime! Does not saturate! Does not die!
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 46 20 Jan 2016
TLDR: In practice:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 47 20 Jan 2016
Data Preprocessing
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 48 20 Jan 2016
Step 1: Preprocess the data
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 50 20 Jan 2016
TLDR: In practice for Images: center only
e.g. consider CIFAR-10 example with [32,32,3] images
- Subtract the mean image (e.g. AlexNet)
(mean image = [32,32,3] array)
- Subtract per-channel mean (e.g. VGGNet)
(mean along each channel = 3 numbers)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 51 20 Jan 2016
Weight Initialization
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 52 20 Jan 2016
- Q: what happens when W=0 init is used?
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 53 20 Jan 2016
- First idea: Small random numbers
(gaussian with zero mean and 1e-2 standard deviation)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 54 20 Jan 2016
- First idea: Small random numbers
(gaussian with zero mean and 1e-2 standard deviation)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 55 20 Jan 2016
Lets look at
some
activation
statistics
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 56 20 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 57 20 Jan 2016
All activations
become zero!
Q: think about the
backward pass.
What do the
gradients look like?
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 58 20 Jan 2016
Almost all neurons
*1.0 instead of *0.01
completely
saturated, either -1
and 1. Gradients
will be all zero.
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 59 20 Jan 2016
“Xavier initialization”
[Glorot et al., 2010]
Reasonable initialization.
(Mathematical derivation
assumes linear activations)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 60 20 Jan 2016
but when using the ReLU
nonlinearity it breaks.
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 61 20 Jan 2016
He et al., 2015
(note additional /2)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 62 20 Jan 2016
He et al., 2015
(note additional /2)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 63 20 Jan 2016
Proper initialization is an active area of research…
Understanding the difficulty of training deep feedforward neural networks
by Glorot and Bengio, 2010
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by
Saxe et al, 2013
Random walk initialization for training very deep feedforward networks by Sussillo and
Abbott, 2014
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 64 20 Jan 2016
[Ioffe and Szegedy, 2015]
Batch Normalization
“you want unit gaussian activations? just make them so.”
this is a vanilla
differentiable function...
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 65 20 Jan 2016
[Ioffe and Szegedy, 2015]
Batch Normalization
“you want unit gaussian activations?
just make them so.”
D
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 66 20 Jan 2016
[Ioffe and Szegedy, 2015]
Batch Normalization
BN Problem: do we
necessarily want a unit
tanh gaussian input to a
tanh layer?
...
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 67 20 Jan 2016
[Ioffe and Szegedy, 2015]
Batch Normalization
Normalize:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 68 20 Jan 2016
[Ioffe and Szegedy, 2015]
Batch Normalization
- Improves gradient flow through
the network
- Allows higher learning rates
- Reduces the strong dependence
on initialization
- Acts as a form of regularization
in a funny way, and slightly
reduces the need for dropout,
maybe
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 69 20 Jan 2016
[Ioffe and Szegedy, 2015]
Batch Normalization
Note: at test time BatchNorm layer
functions differently:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 70 20 Jan 2016
Babysitting the Learning Process
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 71 20 Jan 2016
Step 1: Preprocess the data
10 output
output layer
input neurons, one
CIFAR-10
layer hidden layer per class
images, 3072
numbers
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 73 20 Jan 2016
Double check that the loss is reasonable:
disable regularization
loss ~2.3.
“correct “ for returns the loss and the
10 classes gradient for all parameters
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 74 20 Jan 2016
Double check that the loss is reasonable:
crank up regularization
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 75 20 Jan 2016
Lets try to train now…
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 76 20 Jan 2016
Lets try to train now…
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 78 20 Jan 2016
Lets try to train now…
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 79 20 Jan 2016
Lets try to train now…
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 80 20 Jan 2016
Lets try to train now…
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 81 20 Jan 2016
Lets try to train now…
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 82 20 Jan 2016
Lets try to train now…
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 85 20 Jan 2016
Cross-validation strategy
I like to do coarse -> fine cross-validation in stages
First stage: only a few epochs to get rough idea of what params work
Second stage: longer running time, finer search
… (repeat as necessary)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 86 20 Jan 2016
For example: run coarse search for 5 epochs
note it’s best to optimize
in log space!
nice
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 87 20 Jan 2016
Now run finer search...
adjust range
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 88 20 Jan 2016
Now run finer search...
adjust range
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 89 20 Jan 2016
Random Search vs. Grid Search
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 90 20 Jan 2016
Hyperparameters to play with:
- network architecture
- learning rate, its decay schedule, update type
- regularization (L2/Dropout strength)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 91 20 Jan 2016
My cross-validation
“command center”
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 92 20 Jan 2016
Monitor and visualize the loss curve
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 93 20 Jan 2016
Loss
time
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 94 20 Jan 2016
Loss
Bad initialization
a prime suspect
time
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 95 20 Jan 2016
lossfunctions.tumblr.com Loss function specimen
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 96 20 Jan 2016
lossfunctions.tumblr.com
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 97 20 Jan 2016
lossfunctions.tumblr.com
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 98 20 Jan 2016
Monitor and visualize the accuracy:
no gap
=> increase model capacity?
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 99 20 Jan 2016
Track the ratio of weight updates / weight magnitudes:
ratio between the values and updates: ~ 0.0002 / 0.02 = 0.01 (about okay)
want this to be somewhere around 0.001 or so
10
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 20 Jan 2016
0
Summary TLDRs
We looked in detail at:
10
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 20 Jan 2016
1
TODO
Look at:
10
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 20 Jan 2016
2