A Quick Guide On Basic Regularization Methods For Neural Networks - by Jaime Durán - Yottabytes - Medium

18/09/2020 A Quick Guide on Basic Regularization Methods for Neural Networks | by Jaime Durán | yottabytes | Medium
You have 2 free stories left this month. Sign up and get an extra one for free.
A Quick Guide on Basic Regularization

Methods for Neural Networks
L1 / L2, Weight Decay, Dropout, Batch Normalization, Data Augmentation and Early Stopping
Jaime Durán
Oct 29, 2019 · 7 min read
The time has come to put our neural network on a diet [Photo: Gesina Kunkel]
https://medium.com/yottabytes/a-quick-guide-on-basic-regularization-methods-for-neural-networks-e10feb101328 1/9
Also available in Spanish | También disponible en español
The following story will sound familiar to many of you. Imagine for a moment that we
have to solve a data problem using a classic Machine Learning algorithm. A typical first
approach is to apply dimensionality reduction for improving the model’s performance
and decreasing its training time.
Nowadays, the laborious task of optimizing our processes has been left a bit in the
background when using Deep Learning, thanks especially to the arrival of GPUs and
easy access to them. As we count on more resources we are no longer willing to
sacrifice any detail from the input data, so we introduce absolutely all the features into
the network, no matter how many they are. We also struggle to reduce the error to the
minimum, which translates into an uncontrolled increase in the network layers and
parameters, always encouraged by the Universal Approximation Theorem.
This growth increases the network complexity, and also the risk of overfitting,
especially when we have few training samples (the number of input samples being
much less than the number of parameters used by the network to adjust itself to so
little information).
Solving this problem is precisely what regularization techniques try to accomplish. This
article briefly reviews the most used ones today, mainly due to their good
performance.
. . .
L2 regularization
The main idea behind this kind of regularization is to decrease the parameters value,
which translates into a variance reduction.
This technique introduces an extra penalty term in the original loss function (L),
adding the sum of squared parameters (ω).
The bad news is that this new term could be so big that the network would try to
minimize the loss function by making its parameters very close to zero, which wouldn’t
be convenient at all. That’s why we multiply that sum by a small constant (λ), whose
value will be arbitrarily chosen (0.1, 0.01, …).
The loss function formula:
And the weights update on each iteration of the gradient descent algorithm would be
such that:
Increasing the value of λ would be enough to apply more regularization.
This is the regularization used in the regression known as Ridge.
L1 regularization
There is another technique very similar to the previous one called L1 regularization,
where the network parameters in the penalty term are not squared, and its absolute
value is used instead:
This variant pushes the parameters towards smaller values, even completely canceling
the influence of some input features on the network output, which implies an
automatic feature selection. The result is a better generalization, but only to a certain
extent (the choice of λ becomes more significant in this case).
This is the regularization applied by the Lasso regression.
Weight decay
This technique is identical to the L2 regularization, but applied in a different point:
instead of introducing the penalty as a sum in the loss function, it is added as an extra
term in the weights update formula:
As we can note this update is practically the same as in L2 regularization, except that in
this case the λ constant is not multiplied by 2 (so if we code both methods, the value for
λ should be the double in this case to get the same result that in L2).
Dropout
This technique differs from the prior ones. The procedure is simple: for each new input
to the network in the training stage, a percentage of the neurons in each hidden layer is
randomly deactivated, according to a previously defined discard probability. This
could be the same for the entire network, or different in each layer.
[ paper ]
By applying dropout we avoid that neurons could memorize part of the input; that’s
precisely what happens when the network is overfitting.
Once we have our model fitted, we must somehow compensate the fact that part of
the network was inactive in training time. When making predictions, every neuron will
be active and therefore there will be more activations (layer outputs) contributing to
the network output, increasing its range. One easy solution is to multiply all
parameters by its probability of not being discarded.
Batch normalization
The history of this technique is pretty strange. It was presented as a solution to reduce
something called Internal Covariate Shift, but recent researching shows that’s not what
it really does. Nevertheless, it’s an essential technique for neural networks training, as
explained below.
Batch norm basically implies adding an extra step, usually between neurons and the
activation function, with the purpose of normalizing the output activations. Ideally,
normalization should be applied using the mean and variance of the entire training set,
but if we are using the mini-batch gradient descent algorithm to train our network,
mean and variance of each mini-batch input will be used instead.
Note: each output of each neuron will be normalized independently, meaning that on
each iteration the mean and variance of each output will be calculated for the current
mini-batch.
After normalization, two additional parameters are used: one bias as an adding, and
another constant similar to bias multiplying each activation. Their purpose is to easily
scale up the input range to the output range, which will significantly help our network
to adjust its parameters to the input data, and will reduce the oscillations of the loss
function. As a consequence, we can also increase the learning rate (there is less risk of
ending up stuck in a local minimum) and convergence towards the global minimum
will be accomplished more quickly.
[ source ]
Batch norm is more a training aid technique than a regularization strategy in itself. The
latter is actually achieved by applying an optimization technique known as
Momentum, already introduced in the previous article. Thanks to the whole,
overfitting is significantly mitigated.
Data augmentation
The idea here is to apply diverse transformations on the original dataset, obtaining
slightly different but essentially identical samples, thus allowing the network to
perform better in inference time.
This technique is widely used in the field of computer vision because it works
extremely well (in other fields still needs to be exploited). Within this context, a single
input image will be processed by the neural network as many times as epochs we run,
enabling the network to memorize part of the image if we train for too long. The
solution: applying random transformations each time we reintroduce the image into
the network.
Examples of these transformations are:
Flipping the image horizontally / vertically
Rotating the image X degrees.
Trimming, expanding, resizing, …
Applying perspective deformations
Adjusting brightness, contrast, saturation, …
Adding some noise, defects, …
A combination of the above :)
Example using fastai
By applying data augmentation we’ll have more information available for learning,
without the need of getting additional samples, and also without extending training
time. Moreover, whether our network is dedicated to classifying images or detecting
objects, this technique will make the model capable of achieving good results with
images taken from different angles or under different light conditions (less overfitting,
so it will generalize better!).
Early Stopping
And finally a ruleset to apply in order to know when it’s time to stop training so there is
no overfitting (nor underfitting).
The most common approach would be to train the model monitoring its performance
and saving its parameters at the end of each epoch. We’ll do so until we appreciate the
validation error increases steadily (there are deteriorations due to the stochastic
nature of the algorithm). We’ll keep the model we obtained just before that.
Training example using fastai
It’s necessary at this point to remember the importance of choosing a good validation
set :)
. . .
I hope you liked it! Subscribe to #yottabytes so you don’t miss articles like this one :)
Sign up for Arti ciality Bites

By yottabytes
Weekly newsletter with small bits about Data Science and Arti cial Intelligence Take a look
Your email
Get this newsletter
By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more
information about our privacy practices.
Machine Learning Deep Learning Neural Networks Data Science Regularization
About Help Legal
Get the Medium app

A Quick Guide On Basic Regularization Methods For Neural Networks - by Jaime Durán - Yottabytes - Medium

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

A Quick Guide On Basic Regularization Methods For Neural Networks - by Jaime Durán - Yottabytes - Medium

Загружено:

Авторское право:

Доступные форматы

18/09/2020 A Quick Guide on Basic Regularization Methods for Neural Networks | by Jaime Durán | yottabytes | Medium

A Quick Guide on Basic Regularization

Also available in Spanish | También disponible en español

The loss function formula:

Increasing the value of λ would be enough to apply more regularization.

This is the regularization used in the regression known as Ridge.

This is the regularization applied by the Lasso regression.

Examples of these transformations are:

Flipping the image horizontally / vertically

Rotating the image X degrees.

Trimming, expanding, resizing, …

Applying perspective deformations

Adjusting brightness, contrast, saturation, …

Adding some noise, defects, …

A combination of the above :)

Example using fastai

Training example using fastai

Sign up for Arti ciality Bites

Get this newsletter

Machine Learning Deep Learning Neural Networks Data Science Regularization

About Help Legal

Get the Medium app

Вам также может понравиться