Вы находитесь на странице: 1из 10

Phase Plane Analysis of Vanishing Gradient Problem

Calvin Godfrey
2015-12-21

1
1.1

Background Information
Neural Networks

The purpose of this project was to see if any trends were found when graphing
the weights of a neural network with a phase portrait. A neural network is setup
by having multiple layers of artificial neurons, made to simulate what happens
in the brain. It starts with a layer of input neurons, followed by any number of
layers of hidden neurons (which do the actual math) and end with an output
layer. Every neuron is connected to all the neurons in the next layer, and each
connection has a weight that adjust the input. Every neuron (other than the
ones in the input layer) also has one bias that also adjust the weights. When
a neuron gets its input, it multiples the input by the weight and adds the bias
(wx + b) then plugs in the new number to the sigmoid function. The sigmoid
function is
1
1 + ex
where e is the irrational number 2.71828.... The values go all the way through
the network, and an output is given based on the final values of the neurons
in the output layer. Neural networks are trained with data sets and expected
outputs, which allows the network to learn and be more accurate. All the
weights and biases are changed based on the difference between the expected
and actual value. The error jl of neuron j in layer l is defined as
S(x) =

jl =

C
zjl

where zjl is the input to the neuron after the weight and bias is applied and
C
before the sigmoid function and z
l is the partial derivative of C (the cost
j

function) in respect to zjl . The cost function C is defined as


C=

1 X
||y(x) aL (x)||2
2n x

where n is the total number of training examples, x is an individual training


example, y(x) is the desired output, aL (x) is a vector signifying the output of
each neuron in the output layer, and || signifies the magnitude (length) of the
vector. In words, it calculates the average error of the network over all the
training examples, divided by 2. It uses this value to measure how to change
each value in the network, and each time the values are adjusted, an epoch
has passed. The values are adjusted going backwards through the network, first
tweaking the output neurons. It uses the amount the change by to adjust the
neurons in the previous layers, and so on all way through the network.
The problem this project is trying to help solve is the vanishing gradient
problem. The cost function for a given weight or bias changes that weight
based on the accuracy of values closer to the output. When those weights and
2

biases reach their optimal values before the weights earlier on in the input,
those weights cannot get more optimal even though they arent at the best
position. The only accepted workaround right now is just waiting longer to let
those values settle very slowly, which inhibits how much neural networks can
be used for. However, one potential method implemented is DropIn, based on a
paper released 22 November. This method attempts to make learning even more
natural and human-like by starting with a more basic network with less layers.
Using a probability variable p, layers were slowly added in until the network
is as connected and deep as necessary. This probability changed based on the
epoch the network was on, and this allows values near the input to reach their
optimal value, meaning the vanishing gradient problem doesnt have as much
as an effect, without sacrificing speed of the network.

1.2

Phase Portraits

A phase portrait represents the trajectories of a dynamical system in the phase


plane. A dynamical system is a relationship between two or more quantities
which change over time based on their current value. The change can be either
deterministic, which means only one series of events occurs and is the same
every time, or stochastic, which means random events affect the system. The
phase portraits generated in this project are stochastic (see #2 under methods).
These dynamical systems can be solved to help find special characteristics about
the graph. To do this, the equation of the system
dx
= Ax + By
dt
dy
= Cx + Dy
dt
can be rewritten as one matrix equation as
  
 
x
d x
A B
=
C D
dt y
y
A B ) as A by finding the charThis can be solved given the coefficient matrix ( C
D
acteristic polynomial of the matrix. This can be found by calculating det(tI A)
where t is the variable time being solved for and I is the identity matrix ( 10 01 ).
The determinant (det()) of a matrix is the upper-left item times the lower right
minutes upper right times lower left, so det(A) = AD BC. det(tI A) will
return a polynomial of the form t2 + bt + c and its roots can be found using the
quadratic formula. These roots determine specific characteristics of the phase
portrait. For example, if one root is positive and the other is negative, the
portrait will have a saddle point, meaning the different trajectories of the
dynamical system approach a specific point (called a node) but never reach it.
If both are positive, the system diverges away from the node, and if both are
negative, the system converges to the node.

Methods

The procedure for this project is as follows:


1. Create a neural network.
(a) Decide what it needs to be trained to do. The network designed
was made to recognize hand-written digits because there is already
a large dataset available in MNIST (see bibliography).
(b) Decide how many layers and how many neurons in each layer. The
network was designed to have 4 layers with 784, 30, 30, and 10 neurons respectively. When DropIn was added, a second network with
three layers (784, 30, 10) was added.
(c) In order to save time for the various test done, the network was
trained for 50 epochs, which took approximately 20 sec. per epoch.
2. Implement a method to reset the network to its original starting values
(which were randomly chosen at the very start) but changing one weight
(specifically weight 1, neuron 1, layer 1) to a new, random value. This
means that the plots show how changing one weight affects all the others
in the network.
(a) This was done 30 times, with 50 epochs each, before plots were created of the data.
3. Implement a searchable way to store every value in the network in order
to create the plots.
4. Implement a way to create phase portraits based on the data collected.
(a) Find a way to distinguish the different trajectories. The built-in
library used made everything the same color, and therefore indistinguishable.
(b) Each phase portrait created has 30 trajectories (representing the 30
times the network was reset) with 50 points (representing the 50
epochs per training.
5. Implement DropIn capabilities.
(a) Find a way to calculate the probability p of using 3 or 4 layers for
the training.
(b) Implement a method to copy all the values in one of the two networks
to the other.
(c) Adjust the graphing process so that it only one networks values were
used to create graphs.

Data Collection

It is impossible to contain all the data here, since there is over 2 GB of raw data
and over 6 GB of phase portraits made. The raw data is stored in a similar
format to below:
0.06454214
-0.00923222
-0.008556
...
0.03980881
-0.00060694
-0.03255616

0.0488859
-0.00948839
-0.0033378
...
0.00888486
-0.05098261
0.00906832

0.0083646
-0.09542767
-0.0275531
...
0.04386231
-0.07543377
0.03867584

...
...
...
...
...
...
...

-0.02503039
0.03434509
0.01002194
...
0.0201772
-0.005953
0.00858764

0.05112216
0.05538697
-0.02595811
...
-0.01932619
-0.02836276
-0.01971178

0.03703298
0.06440752
0.0114625
...
-0.03113669
0.03444392
-0.0256726

The full set of data is stored in a six-dimensional array 30x2x51x3x784x30.


They represent the number of times the network was reset, the number of networks with DropIn implemented, number of epochs+1, number of layers, maximum number of neurons, and maximum number of weights, respectively.
A typical phase portrait looks like the one below:

while a typical phase portrait with dropin looks like this:

The title of each graph represents which weight in the network is graphed. The
x-axis is the actual value of the weight, while the y value is w, which is the
change in the weight value between the difference epochs. Each line represents
a new training of the network with a new random value for the one weight and
everything else starting the same. The line ends at the different colored xs
and start at the other end. There are no units listed for the graphs because the
data has no units; theyre just numbers. A total of 78,160 phase portraits were
created.

4
4.1

Data Analysis
Without DropIn

The vanishing gradient can be visually seen in the different phase portraits
created. The xs on the diagram represent the value that weight ends at
during that training set, and its value can be seen on the x-axis. Although it
cannot be known if it is the optimal value, the effect of the vanishing gradient
problem can be seen based on if the point is settled. A point is considered
settled if its value does not change a lot, which means its position on the y-axis
is approximately zero. The vanishing gradient can therefore be seen because in
general, the points later on in the network settle to a point faster, due to the
way the cost function was implemented. For example, in the diagrams below,
the one on the left is from Layer 0 and therefore earlier on in the network. The
weights still arent close to being settled, and this can be seen because the xs
moving down, meaning they are changing at an increasing rate. On the other
hand, the xs on the right are almost exactly at zero, and therefore changing
very little. It is in layer 2, so it is more like to reach its optimal position, and
in this case, it most likely has reached the optimal value.
The original plan to analyze the data was to do the math described in section
1-2, but in reverse. Given the graph, figure out the whether the roots of the
characteristic polynomial were positive or negative. However, the trajectories
were too clustered together to find enough information about the system. There
were only a few exceptions, like the purple trajectory on the diagram on page
5, which wasnt enough to do the math. An attempted solution to this was
increasing the range for the randomly set weight, however this did not have any
discernable effect on the diagrams.

4.2

With DropIn

An interesting trend in the DropIn diagrams were the similarities in the diagrams
for different weights but same neuron. Even though the actual values of the
different weights ranged from -0.3 to 0.4, the graphs had almost identical shapes,
as seen below.

This shows that weights in the network are more closely connected than expected
given the math behind it. It means that for the weights listed above, the change
in the weight in any given epoch is proportional to its value, which isnt apparent
from the math. It also means that the different weights for a specific neuron
have similar functions, which seems logical. Although the different weights for
the same neuron have similar shapes, the different neurons have completely
different appearances. The following diagrams are from different neurons, and

though they all look different, any weight for that specific neuron could have
been picked to demonstrate the fact.

The way DropIn was implemented is what causes all the triangles shown. A
probability p decides which network is trained, so when the network not used for
the diagrams is trained, the weight value doesnt change, causing the y-value to
drop to 0. Another important thing to note is that for almost all of the DropIn
diagrams, the xs are on or extremely close to 0.0, regardless of the layer the
neuron and weight is on. It also cant be caused by the other network being
trained for the last epoch, because the same is true on all the neurons. This
demonstrates that the goal of DropIn was successful in making it more likely
that weights settle to an optimal value.

5
5.1

Results and Conclusion


Discussion

The original purpose of this experiment was to make phase portraits of the
weights in a neural network in order to find a solution to the vanishing gradient problem. By finding any relationships in the diagrams, the effect of the
vanishing gradient can be lessened and make neural networks more efficient.
In reality, DropIn was implemented as a potential solution to see if there were
any improvements over the original neural network created, because the phase
portraits didnt have enough information to find nodes for. It appears that
DropIn does allow more weights to settle to their optimal value faster without
sacrificing too much overall time. This can be seen below, with the same weight
being charted, only the DropIn one ends at zero.
The way DropIn is implemented causes the diagrams to look very different,
which inhibits easy comparison, though it can be seen that DropIn allows the
change in weights to end near 0 a lot of the time.

5.2

Error Analysis

This was a partially stochastic project, because each time the neural network
reset its values, it changed a single weight to a new random value. This new value
is what determines the trajectories on the phase portraits, so while diagrams
from the same training process can be compared, it is not completely recreatable.
Another source of error is from sacrificing accuracy for speed. It took over
one hour to train the network each time a change was made, like when DropIn
was implemented. From there, the data could be saved and reused, but only
the training was completely finished. In all the experiments, 50 epochs we used,
while when the network was being designed for accuracy, it didnt reach its full

accuracy until over 150 epochs in. Additionally, with fifty epochs, it took over
3 hours to make the full set of phase portraits, so increasing the size of the data
to load would also increase the time to create the additional diagrams. Given
more time to train the data, it would be clearer to see when the weight had
settled, and compare how it changed with DropIn added.

5.3

Future Research

Given more time to continue this project, there is a lot that could be added.
One useful tool would be a way to tell which way the trajectories are going,
rather than just marking the end point. This could be done by adding arrows
periodically on the trajectory. Also, it would be interesting to try to overlay all
the weights on a single phase portrait for one training session and see how they
compare to each other or if patterns emerge. Also, if the network started out
with a bigger range of starting values, there could be a chance there would be
enough different trajectories to predict information about the dynamical system.
Finally, it would be interesting to see the effect DropIn has on networks with a
larger amount of layers. With only four and three layers used, there isnt a big
difference, but in the original paper, they used three and eleven layers, which is
a very large difference. It would most likely be easier to see the impact DropIn
has on the larger network.

5.4

Conclusions

In conclusion, the vanishing gradient is definitely a problem for neural networks


even as small as four layers deep, and gets worse the more layers there are. For
comparison, the network that has the highest accuracy with the MNIST data
set has over fifteen layers. The vanishing gradient can be seen in the phase
portraits because the portraits for weights later on in the network settle to a
value (and therefore have a y-value of approximately 0) more often than weights
earlier in the network. An attempted solution to this was implementing DropIn,
which caused the network to start with only three layers, and then sometimes
train with four layers based on a probability p that changed based on the epoch
the network was currently on. When DropIn was implemented, weights that
didnt settle in the original network settled to a y-value near 0 with DropIn,
as seen above. Overall, there wasnt enough time or data to make a decisive
conclusion as to any patterns formed, but other researchers can add on to this
to make big improvements in neural networks.

Bibliography

Axtell, Travis, Dr. Science Fair. 20 Sept. 2015. E-mail.


Bengio, Y., Simard, P., and Frasconi, P. Learning long term dependencies with
gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 1994.

Hochreiter, S. (n.d.). The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions. International Journal of Uncertainty,
Fuzziness and Knowledge-Based Systems Int. J. Unc. Fuzz. Knowl. Based
Syst., 107-116.
LeCun, Y., Cortes, C., & Burges, C. (1998) The MNIST Database of Handwritten Digits [Handwritten digits images]. Available from:http://yann.lecun.com/exdb/mnist/
Michael A. Nielsen, Neural Networks and Deep Learning, Determination
Press, 2015
Smith, L., Hand, E., & Doster, T. (2015). Gradual DropIn of Layers to Train
Very Deep Neural Networks. Retrieved November 29, 2015, from arxiv.org
Sundermeyer, M., Ney, H., & Schluter, R. (n.d.). From Feedforward to Recurrent LSTM Neural Networks for Language Modeling. IEEE/ACM Transactions
on Audio, Speech, and Language Processing IEEE/ACM Trans. Audio Speech
Lang. Process., 517-529.

10

Вам также может понравиться