You are on page 1of 2

INTRODUCTION

Learning rate & model performace: learning rate significantly affects model
performace
-> It's one of the most difficult hyperparameter to set
Learning rate & cost function: the cost is often highly sensitive to some
directions & insensitive to others in parameter space
Method of momentum: can avoid these issues somewhat, but introduce
another hyperparameter
-> It's natural to ask if there's another way
Separate learning rate: if we believe that directions of sensitivity
are somewhat axis aligned
-> It makes sense to use a separate learning rate for each
parameter & automatically adapt these learning rate through the learning process
Delta-bar-delta algorithm: an early heuristic approach to adapting individual
learning rates for model parameters during training
Base idea:
1. If the partial derivative of the loss, with respect to a given
model parameter, remains the same sign
-> The learning rate should increase
2. If the partial derivative changes sign
-> The learning rate should decrease
#NOTE: this rule can only be applied to full batch optimization
Explain: the gradient computed in full batch optimization
is the exact gradient

ADAGRAD
Idea: individually adapt the learning rates of all model parameters by
scaling them inversely proportional to the square root of the sum of all the
historical squared values of the gradient
Pseudo-code:
Require: global learning rate esp (initially applied to all parameters)
Require: initial parameter theta
Require: a small constant delta (maybe 10^(-7)) for numerical stability

L(f(x(i); theta), y(i)) is the loss of the example i

initialize gradient accumulation variable r = 0


while stopping criterion not met:
sample a minibatch of m examples from the training set
{x(1), ..., x(m)} with corresponding targets y(i)
compute gradient: g = 1/m * grad(sum(L(f(x(i); theta), y(i)) all
i), theta)
accumulate squared gradient: r = r + ||g||^2
compute update: Dtheta = - esp / (delta + sqrt(r)) .* g
apply update: theta = theta - Dtheta
Efect of the parameter update:
The parameters with the largest partial derivative of the loss: have a
rapid decrease in their learning rate
The paramters with small partial derivative of the loss: have a
relatively small decrease in their learning rate
#NOTE: for training deep models, the accumulation of squared gradients from
the beginning of training can result in an early & relatively decrease in the
effective learning rate
-> AdaGrad performs well for some but not all deep models

RMSPROP
RMSProp: modified version of AdaGrad that perform well in the non-convex
setting
Idea: change the gradient accumulation into an exponentially weighted moving
average
Recall AdaGrad:
Convex function: AdaGrad is designed to converge rapidly when applied
to a convex function
Non-convex function:
1. The learning trajectory may pass through many different
structures
-> It may eventually arrive at a region that is locally
convex bowl
2. AdaGrad shrinks the learning rate according to the entire
history of the squared gradient
-> The learning rate maybe too small before ariving at such
a convex structure
RMSPropr & AdaGrad: use an exponentially decaying average to discard history
from the extreme past so that it can converge rapidly after finding a convex bowl
-> RMS Prop can be seen as an instance of the AdaGrad algorithm
initialized within that bowl
Pseudo-code:
Require: global learning rate esp, decay rate p
Require: initial parameter theta
Require: small constant delta (maybe 10^(-6)) used to stabilize
division by small numbers

initialize accumulation variables r = 0


while stopping criterion not met:
sample a minibatch of m examples from the training set
{x(1), ..., x(m)} with corresponding targets y(i)
compute gradient: g = 1/m * grad(sum(L(f(x(i)), y(i)), all i),
theta)
accumulate squared gradient: r = p*r + (1 - p) * ||g||^2
compute parameter update: Dtheta = - esp / sqrt(delta + r) .* g
apply update: theta = theta + Dtheta
#NOTE:
1. Empirically, RMSProp has been shown to be an effective & practical
optimization algorithm for deep neural nets
2. It's currently one of the go-to optimization methods being employed
routinely by deep learning practitioners

NEW WORD
Accumulation (n): s tch ly