You are on page 1of 2

# INTRODUCTION

Learning rate & model performace: learning rate significantly affects model
performace
-> It's one of the most difficult hyperparameter to set
Learning rate & cost function: the cost is often highly sensitive to some
directions & insensitive to others in parameter space
Method of momentum: can avoid these issues somewhat, but introduce
another hyperparameter
-> It's natural to ask if there's another way
Separate learning rate: if we believe that directions of sensitivity
are somewhat axis aligned
-> It makes sense to use a separate learning rate for each
parameter & automatically adapt these learning rate through the learning process
Delta-bar-delta algorithm: an early heuristic approach to adapting individual
learning rates for model parameters during training
Base idea:
1. If the partial derivative of the loss, with respect to a given
model parameter, remains the same sign
-> The learning rate should increase
2. If the partial derivative changes sign
-> The learning rate should decrease
#NOTE: this rule can only be applied to full batch optimization
Explain: the gradient computed in full batch optimization

Idea: individually adapt the learning rates of all model parameters by
scaling them inversely proportional to the square root of the sum of all the
historical squared values of the gradient
Pseudo-code:
Require: global learning rate esp (initially applied to all parameters)
Require: initial parameter theta
Require: a small constant delta (maybe 10^(-7)) for numerical stability

## initialize gradient accumulation variable r = 0

while stopping criterion not met:
sample a minibatch of m examples from the training set
{x(1), ..., x(m)} with corresponding targets y(i)
i), theta)
accumulate squared gradient: r = r + ||g||^2
compute update: Dtheta = - esp / (delta + sqrt(r)) .* g
apply update: theta = theta - Dtheta
Efect of the parameter update:
The parameters with the largest partial derivative of the loss: have a
rapid decrease in their learning rate
The paramters with small partial derivative of the loss: have a
relatively small decrease in their learning rate
#NOTE: for training deep models, the accumulation of squared gradients from
the beginning of training can result in an early & relatively decrease in the
effective learning rate

RMSPROP
setting
Idea: change the gradient accumulation into an exponentially weighted moving
average
to a convex function
Non-convex function:
1. The learning trajectory may pass through many different
structures
-> It may eventually arrive at a region that is locally
convex bowl
-> The learning rate maybe too small before ariving at such
a convex structure
from the extreme past so that it can converge rapidly after finding a convex bowl
-> RMS Prop can be seen as an instance of the AdaGrad algorithm
initialized within that bowl
Pseudo-code:
Require: global learning rate esp, decay rate p
Require: initial parameter theta
Require: small constant delta (maybe 10^(-6)) used to stabilize
division by small numbers

## initialize accumulation variables r = 0

while stopping criterion not met:
sample a minibatch of m examples from the training set
{x(1), ..., x(m)} with corresponding targets y(i)