Академический Документы
Профессиональный Документы
Культура Документы
Mark K Cowan
2 y
+3
2 +3
2
2
x1 +3
2 +3
2
2
2
x2
This book is based partly on content from the 2013
session of the on-line Machine Learning course run
by Andrew Ng (Stanford University). The on-line
course is provided for free via the Coursera platform
(www.coursera.org). The author is no way affiliated
with Coursera, Stanford University or Andrew Ng.
DRAFT
14/08/2013
2 Neural networks 11
2.1 Regression visualised as a building block . . . . . . . . . . . . . . . . . . . . 11
2.2 Hidden layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Notation for neural network elements . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Bias nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Logic gates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 Feed-forward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.7 Cost-function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.8 Gradient via back-propagation . . . . . . . . . . . . . . . . . . . . . . . . . . 16
List of Figures 17
Viewing
This ebook is designed to be viewed like a book, with two (facing) pages at a time and a
separate cover page.
Notation
Matrices are in bold-face, vectors are under arrows.
~ Vectors which could be expanded into
matrices to process several data at once are shown in bold-face.
Functions of matrices/vectors are applied element-wise, with a result that is of the same
type as
the
parameter. The Hadamard matrix product is denoted by A B. The norm of a
matrix
A
is the norm of the vector produced by unrolling the matrix, i.e.
2
A2ij
X
A
=
i,j
Preface
TODO
Regression
1
Regression is a simple, yet powerful technique for identifying trends in datasets and
predicting properties of new data points. It forms a building block of many other more
complex and more powerful techniques1 .
Objective:
Create a model:
Rpq , g : R R
that predicts output:
~y Rq
given input:
~x Rp .
Usage:
The output ~y can be a continuous value, a discrete value (e.g. a boolean), or a vector of
such values. The input ~x is a vector, and can also be either discrete or continuous. The
individual properties of one data row are called features. Typically, many inputs are to be
processed at once so the feature vectors are augmented to form matrices, and the model can
be rewritten for m data rows as:
Model: Rpq , g : R R
Predicts: Y Rmq
For input: X Rmp
Discrete:
For discrete problems, threshold value(s) is/are chosen between the various levels of
output. The phase-space surface defined by these levels is called the decision boundary.
Categorisation problems are slightly more complicated, as the categories do not necessarily
have a logical order (with respect to the < and > comparison operators). The index of the
predicted category c is given by c = argmaxi (~yi ), and ~y is a vector with length equal to the
number of categories.
1.1 Hypothesis
The prediction is done via a hypothesis function h (~x). This function is typically of the
form:
h (~x) = g (~x)
1
1.2. Learning Regression
choices for the transfer function are2 the sigmoid function and the inverse hyperbolic tangent
function.
1
g (k) = sigmoid (k) = 0 g (k) +1
1 + ek
1+k
g (k) = artanh (k) ln 1k 1 g (k) +1
1.2 Learning
c Mark K Cowan
The learning is driven by a set of m examples, i.e. m values of ~x for which the corresponding
y values are known. The set of ~x vectors can therefore be grouped into a matrix X, and the
corresponding y values/vectors may also be grouped into a vector/matrix Y. The hypothesis
becomes h (X) = g(X) (g is applied element-wise) and the learning stage may now be
This non-negative valued function gives some indication of the accuracy of the current
hypothesis h on the learning dataset [Y, X], and the objective of the learning stage is
to minimise the value of this J .
2
Regression 1.3. Data normalisation
where is the learning rate, and defines the speed at which the learning algorithm converges.
A small value of will result in slow learning, while a large value of will result in oscillations
about a minimum (and will therefore prevent convergence). A steadily decreasing function
may be used in place of a constant for in order to provide fast convergence towards a
minimum, but to also reduce the final distance from the minimum if the algorithm oscillates.
The initial estimate, 0 is usually initialised to normally distributed random values with
mean = 0. Note that the gradient descent method will not always find the global minimum
of the cost function, it can get trapped in local minima.
c Mark K Cowan
xnormalised =
where the mean and the standard deviation of a list x containing N values are respectively:
N
1 X
= xi
N i=1
v
u N
1 X
(xi )2
u
= t
N i=1
TODO
Graphical examples to show why this
works
TODO
Mention Tikhonov regularisation and
use PhD examples to demonstrate the
power of image/fourier operators.
Maybe a mention of the cool stuff
that can occur when other AI
techniques such as evolutionary
algorithms are added to the mix
1.4 Regularisation
The previous solvers will to minimise the residual
h (X) Y
, but consequently may
attempt to fit the noise and errors in the learning data too. In order to prevent this, other
3
1.4. Regularisation Regression
terms can be added into the minimisation process. A simple extra term to minimise is the
magnitude of the learning values , by modifying the residual as follows:
2
2
h (X) Y
+
The non-negative variable is the regularisation parameter, and it determines whether the
learning process primarily minimises the magnitude of the parameter vector (), which
results in under-fitting and may be fixed by decreasing , or whether the learning process
over-fits the training data (X, Y) which may be remedied by increasing .
def
If one of the parameters (1 in ) represents a constant offset (e.g. X1 = ~1) then this
parameter (and any associated weightings) are excluded from the regularisation3 , i.e:
2
2
h (X) Y
+
2..N
TODO
c Mark K Cowan
TODO
Prove it
TODO
Prove it
3
Note that the bias nodes of a neural network (chapter 2) are extra features, not constant offsets
4
Regression 1.5. Linear regression
c Mark K Cowan
produce a non-linear hypothesis. This is particularly useful for problems with discrete output
such as categorisation and decision problems, as these kind of problems imply non-linear
hypotheses. The output may then be passed through a threshold function in order to produce
a discrete value4 , for example:
true if y 0.5,
threshold (y) =
false if y < 0.5
0.8
0.6
f(k)
0.4
Threshold
0.2 Sigmoid
Linear
0 5%/95%
6 4 2 0 2 4 6
k
Figure 1.1: Plot of the sigmoid transfer function
The sigmoid function has a range of [0, 1], which is is ideal for binary problems with yes/no
outputs such as categorisation. For ternary problems with positive/neutral/negative outputs,
a function with a range of [1, +1] may be more applicable. One such function is the artanh
function, although the sigmoid function can also be scaled to this new range.
4
For example, yes/no
5
1.7. Multi-class classification Regression
TODO
Simplify and shorten, add examples
c Mark K Cowan
1.8 Formulae
Linear regression:
g (k) = k
2
2
1
J = 2m
h (X) Y
+ 2m
~ =
J 1
XT h (X) Y +
m m
Logistic regression:
1
g (k) = for yes/no
1 + ek
or = tanh1 (k) for positive/neutral/negative
2
J = m1 ~y log h (~x) + (1 ~y ) log 1 h (~x) + 2m
~ = 1 XT h (X) Y +
J
m m
6
Regression 1.9. Fine-tuning the learning process
J
Iterations Iterations Iterations
c Mark K Cowan
(a) No convergence (b) Oscillation (c) Convergence
Increase Decrease Good
A simple optimisation is to start with a high value of , then decrease it with each
oscillation. This provides rapid convergence initially with a high chance of oscillations as the
cost-function minimum is approached, but with the oscillations decreasing as the learning
rate is decreased, resulting in convergence close to the cost-function minimum.
7
1.9. Fine-tuning the learning process Regression
Training data
Other data
J
J
Iterations Iterations Iterations
Example:
Taking a fifth-order polynomial as an example, we train a linear regression system (using
gradient descent) by using four5 known points in the polynomial. Training with many
c Mark K Cowan
features (for example, monomials up to the tenth order), we can find an exact or near-exact
match for the training data as shown in figure 1.4a, but this learnt trend may be very different
to the actual trend. Conversely, training with too few features or with high regularisation
produces a learnt trend that is equally bad at matching either the training dataset or future
data, shown in figure 1.4b. The good fit in figure 1.4c doesnt perfectly follow the trend, since
there is insufficient information in just four points to describe a fifth-order polynomial - but
it is a reasonable match. Of course, if the actual trend was a sinusoid then the "good" fit
would have high error instead. When the training data contains insufficient information
to describe the trend, then some properties of the trend must be known beforehand when
designing the features, for example are we to use use sinusoid features, monomials features,
exponential features or some combination of them all?
Training data
Learnt trend
Actual trend
Y
X X X
Cross-validation:
If the actual trend were known, then there would be no need for machine learning. In
5
This is two less than would be required to uniquely describe the polynomial
8
Regression 1.9. Fine-tuning the learning process
practice, the learnt trend cannot be directly compared against the actual trend, and for
many real-world problems there is no actual trend. Instead, the training dataset may be
split into two subsets, the training set and the cross-validation set. The learning algorithm
may then be applied to the training set, and the regularisation parameter varied in order
to minimize the error6 in the cross-validation set (see below). The amount of regularisation
can therefore be determined intelligently, rather then by guesswork. In figure 1.3, the other
data series becomes the cost-function curve for the cross-validation dataset.
TODO
Validation dataset and example
c Mark K Cowan
Growing the training set:
Another technique to diagnose bias/variance is to plot the cost-function for the training
set and a fixed-cross-validation set while varying the size of the training set. Adding more
training data will not reduce error caused bias, but will reduce error caused by variance.
Therefore, the errors in the training set and in the cross-validation set will both converge
to some high value if the system is suffering from high bias (Figure 1.5a). If the system is
suffering from high variance instead then the errors may appear to converge to two separate
values with a large gap in between (Figure 1.5b), but will converge eventually if a large and
diverse enough training set is used. Over-fitting can therefore be eliminated by getting more
training data when possible, instead of increasing the regularisation parameter ().
Training data
Cross-validation data
J
J
gap
Figure 1.5: Identifying bias and variance by varying the size of the training set
Summary:
Over-fitting can be interpreted as the system trying to infer too much from too little training
data, whereas under-fitting can be interpreted as the system making inefficient use of the
training data, or there not being sufficient training data for the system to create a useful
6
The value of the cost-function, J
9
1.9. Fine-tuning the learning process Regression
hypothesis. Therefore, both can be fixed by either adjusting the amount of regularisation
or by altering the amount of training data available. In order to increase the amount of
training data, new data/features7 can be collected or new features can be produced by
applying non-linear operations to the existing elements, such as exponentiation (~x7 = ~x12 ),
multiplication or division (~x8 = ~x4~x5 /~x6 ).
TODO
This section is high priority
7
Recall that features are properties of a data item
10
Neural networks
2
Neural networks are complex non-linear models, built from components that individually
behave similarly to a regression model. They can be visualised as graphs1 , and some
sub-graphs may exist with behaviour similar to that of logic gates2 . Although the structure
of a neural network is explicitly designed beforehand, the processing that the network does
in order to produce a hypothesis (and therefore, the various logic gates and other processing
structures within the network) evolves during the learning process3 . This allows a neural
network to be used as a solver that programs itself, in contrast to typical algorithms that
must be designed and coded explicitly.
Evaluating the hypothesis defined by a neural network may be achieved via feed-forward,
which amounts to setting the input nodes, then propagating the values through the
connections in the network until all output nodes have been calculated completely. The
learning can be accomplished by using gradient descent, where the error in the output nodes
is pushed back through the network via back-propagation, in order to estimate the error in
the hidden nodes, which allows calculation of the gradient of the cost-function.
Processing Output
Input data
unit data
Summation
Inputs Output
(-weighted)
x1 y
x2
x3
x4
11
2.1. Regression visualised as a building block Neural networks
Similarly, logistic regression may be visualised as a graph, with one extra node to represent
the transfer function. A logistic regression element may also be described using a linear
regression element and a transfer node, by recognising that the first two stages of a logistic
regression element form a linear regression element.
Summation Transfer
Inputs Output
(-weighted) function
x1 g y
x2
x3
x4
Since the last three stages of the pipeline are dependent only on the first stage, we will
condense them into one non-linear mixing operation at the output:
Inputs Output
x1 1 y
2
x2 3
4
x3
x4
Using this notation, a network that performs classification via several one-vs-all classifiers has
the following form, where the parameter vectors have been combined to form a parameter
matrix , with a separate column to produce each column of the output vector:
Inputs Outputs
x1
y1
x2
y2
x3
y3
x4
12
Neural networks 2.2. Hidden layers
x1
y1
x2
c Mark K Cowan
y2
x3
y3
x4
Although layers of linear regression nodes could be used in the network there is no point
since each logistic regression element transforms a linear combination of the inputs, and a
linear combination of a linear combination is itself a linear combination5 .
4
Neural networks can also contain feedback loops and other features that are not possible by stacking
alone
5
i.e. Multiple linear combination layers can be combined to give one single linear combination element
13
2.4. Bias nodes Neural networks
Summation Transfer
Inputs Output
(-weighted) function
x1 1 z g a y
2
x2 3
4
x3
x4
For convenience, this will be given index 0, such that al0 = 1. There is a separate parameter
vector for each layer, so we now have a set of matrices. A biased network with several
hidden layers is shown below to illustrate the structure and notation for such a network:
a01 a11
a21
a02 a12
a22 a31
a03 a13
a23 a32
a04 a14
a24
a05 a15
Figure 2.8: Example of the notation used to number the nodes of a neural network
14
Neural networks 2.6. Feed-forward
x1 x1
+2 +2
+1
y 3 y 1 y
2
x +2 +2
x2 x2
x1 x1
x y y y
x2 x2
(a) Logical NOT gate (b) Logical AND gate (c) Logical OR gate
From these, it becomes trivial to construct other gates. Negating the values produces
the inverted gates, and these can be used to construct more complex gates. Thus, neural
networks may be understood as self-designing microchips, capable of both digital and
analogue processing.
c Mark K Cowan
x1 2
2 2 +3 2
x1
+3 +3 y y
x2
2 2 +3 2
x2 2
Figure 2.10: Logical XOR gate, constructed from four NAND gates
2.6 Feed-forward
To evaluate the hypothesis h (~x) for some input data ~x, the data is fed forward through the
layers:
2.7 Cost-function
The cost-function remains the same as that for logistic regression:
L
l
2
X
J = m1 ~y log h (~x) + 1 ~y log 1 h (~x) + 2m
l=1
15
2.8. Gradient via back-propagation Neural networks
Note the dot-product in the expression for the cost-function; this represents a summation of
the cost-functions for each individual one-vs-all classifier. Since neural networks have hidden
layers between the input and output nodes, the learning process is slightly more complex
than that of the logistic regression multi-class network.
With the error at each node calculated, the amount that each node contributes to the over-all
cost can be estimated, leading to the gradient:
T
n = an en+1 error contributed to the next layer
~ n J
= 1
n +
n gradient J
m m n
Note that the gradient shown here includes the regularisation term. Also, although the bias
nodes have constant values for their activations, the error contributed by bias nodes is not
necessarily zero6 , since the weights of these nodes (l0,j ) are not constant. Therefore the
weights of the bias nodes should be excluded from regularisation, as described in section 1.4
except for rare cases where it may be useful to minimise the amount of bias.
TODO
Real-world examples
6
Therefore the gradient of the cost-function with respect to the weights of these nodes may also be
non-zero
16
List of Figures
*
1.1 Plot of the sigmoid transfer function . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Learning curves for various learning rates . . . . . . . . . . . . . . . . . . . . 7
1.3 Learning curves for various learning rates . . . . . . . . . . . . . . . . . . . . 8
1.4 Examples of over-fitting, under-fitting and a good fit . . . . . . . . . . . . . 8
1.5 Identifying bias and variance by varying the size of the training set . . . . . 9
17