Вы находитесь на странице: 1из 21

Machine Learning

Mark K Cowan

2 y
+3
2 +3
2
2
x1 +3
2 +3
2

2
2

x2
This book is based partly on content from the 2013
session of the on-line Machine Learning course run
by Andrew Ng (Stanford University). The on-line
course is provided for free via the Coursera platform
(www.coursera.org). The author is no way affiliated
with Coursera, Stanford University or Andrew Ng.

DRAFT
14/08/2013

This work is licensed under the Creative Commons


Attribution-NonCommercial-NoDerivs 3.0 Unported
License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-nc-nd/3.0/.
Contents
*
1 Regression 1
1.1 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Normal equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Data normalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Regularisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.7 Multi-class classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.8 Formulae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.9 Fine-tuning the learning process . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.9.1 : Learning, convergence and oscillation . . . . . . . . . . . . . . . . 7
1.9.2 : Bias and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.9.3 Accuracy, precision, F-values . . . . . . . . . . . . . . . . . . . . . . . 10

2 Neural networks 11
2.1 Regression visualised as a building block . . . . . . . . . . . . . . . . . . . . 11
2.2 Hidden layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Notation for neural network elements . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Bias nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Logic gates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 Feed-forward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.7 Cost-function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.8 Gradient via back-propagation . . . . . . . . . . . . . . . . . . . . . . . . . . 16

List of Figures 17
Viewing
This ebook is designed to be viewed like a book, with two (facing) pages at a time and a
separate cover page.

Notation
Matrices are in bold-face, vectors are under arrows.
~ Vectors which could be expanded into
matrices to process several data at once are shown in bold-face.
Functions of matrices/vectors are applied element-wise, with a result that is of the same
type as the parameter. The Hadamard matrix product is denoted by A B. The norm of a
matrix A is the norm of the vector produced by unrolling the matrix, i.e.
2
A2ij
X
A =

i,j

Preface

TODO
Regression
1
Regression is a simple, yet powerful technique for identifying trends in datasets and
predicting properties of new data points. It forms a building block of many other more
complex and more powerful techniques1 .

Objective:
Create a model:
Rpq , g : R R
that predicts output:
~y Rq
given input:
~x Rp .

Usage:
The output ~y can be a continuous value, a discrete value (e.g. a boolean), or a vector of
such values. The input ~x is a vector, and can also be either discrete or continuous. The
individual properties of one data row are called features. Typically, many inputs are to be
processed at once so the feature vectors are augmented to form matrices, and the model can
be rewritten for m data rows as:
Model: Rpq , g : R R
Predicts: Y Rmq
For input: X Rmp

Discrete:
For discrete problems, threshold value(s) is/are chosen between the various levels of
output. The phase-space surface defined by these levels is called the decision boundary.
Categorisation problems are slightly more complicated, as the categories do not necessarily
have a logical order (with respect to the < and > comparison operators). The index of the
predicted category c is given by c = argmaxi (~yi ), and ~y is a vector with length equal to the
number of categories.

1.1 Hypothesis
The prediction is done via a hypothesis function h (~x). This function is typically of the
form:
h (~x) = g (~x)

where is a matrix defining a linear transformation of the input parameters, x is a row-vector


of input values, and g(k) is an optional non-linear transfer function. Common non-linear
1
For example, artificial neural networks (chapter 2)

1
1.2. Learning Regression

choices for the transfer function are2 the sigmoid function and the inverse hyperbolic tangent
function.
1
g (k) = sigmoid (k) = 0 g (k) +1
1 + ek
1+k
g (k) = artanh (k) ln 1k 1 g (k) +1

Where y is continuous, a linear transfer function is used to provide linear regression.


In cases where the output is discrete (i.e. classification problems and decision making),
the sigmoid transfer function is typically used, providing logistic regression. Where the
parameter k of g is a vector or a matrix, g is applied element-wise to the parameter, i.e.
for g = g(k), gi,j = g(ki,j ).

1.2 Learning
c Mark K Cowan

The learning is driven by a set of m examples, i.e. m values of ~x for which the corresponding
y values are known. The set of ~x vectors can therefore be grouped into a matrix X, and the
corresponding y values/vectors may also be grouped into a vector/matrix Y. The hypothesis
becomes h (X) = g(X) (g is applied element-wise) and the learning stage may now be

defined as a process which minimises the cost-function:


2
J () , using = h (X) Y

This non-negative valued function gives some indication of the accuracy of the current
hypothesis h on the learning dataset [Y, X], and the objective of the learning stage is
to minimise the value of this J .

1.2.1 Normal equation


In the unlikely event that the X matrix is square and non-singular, the inverse can be used
to find . For other cases, the left pseudo-inverse may be used:
 1
Y = X = = XT X XT Y
2
This minimises X Y (i.e. assumes that h (X) = X). Since the matrix to be

inverted is symmetric and typically positive semi-definite, Cholesky decomposition can be


used to rapidly invert it in a stable manner. Singular value decomposition (svd) is also
practical.

1.2.2 Gradient descent


Successive approximations of the minimum of the cost-function may be obtained by
walking down the landscape defined by the cost-function. This is achieved by the following
iteration:
i+1 = i J ~
2
See figure 1.1 on page 5

2
Regression 1.3. Data normalisation

where is the learning rate, and defines the speed at which the learning algorithm converges.
A small value of will result in slow learning, while a large value of will result in oscillations
about a minimum (and will therefore prevent convergence). A steadily decreasing function
may be used in place of a constant for in order to provide fast convergence towards a
minimum, but to also reduce the final distance from the minimum if the algorithm oscillates.
The initial estimate, 0 is usually initialised to normally distributed random values with
mean = 0. Note that the gradient descent method will not always find the global minimum
of the cost function, it can get trapped in local minima.

1.3 Data normalisation


If different features are on considerably different magnitude scales, then normalisation may
be necessary first. Otherwise, the features on larger scales will dominate the cost-function
and prevent the other features from being learnt. One simple way to transform all features
to a similar scale is:
xraw


c Mark K Cowan
xnormalised =

where the mean and the standard deviation of a list x containing N values are respectively:
N
1 X
= xi
N i=1
v
u N
1 X
(xi )2
u
= t
N i=1

TODO
Graphical examples to show why this
works

TODO
Mention Tikhonov regularisation and
use PhD examples to demonstrate the
power of image/fourier operators.
Maybe a mention of the cool stuff
that can occur when other AI
techniques such as evolutionary
algorithms are added to the mix

1.4 Regularisation

The previous solvers will to minimise the residual h (X) Y , but consequently may

attempt to fit the noise and errors in the learning data too. In order to prevent this, other

3
1.4. Regularisation Regression

terms can be added into the minimisation process. A simple extra term to minimise is the
magnitude of the learning values , by modifying the residual as follows:
2 2
h (X) Y +

The non-negative variable is the regularisation parameter, and it determines whether the
learning process primarily minimises the magnitude of the parameter vector (), which
results in under-fitting and may be fixed by decreasing , or whether the learning process
over-fits the training data (X, Y) which may be remedied by increasing .
def
If one of the parameters (1 in ) represents a constant offset (e.g. X1 = ~1) then this
parameter (and any associated weightings) are excluded from the regularisation3 , i.e:
2 2
h (X) Y + 2..N

TODO
c Mark K Cowan

Justify why/when bias features are


required, and why they often escape
regularisation

Regularisation of the normal equation:


 1
= XT X + I XT Y
* Remember to set I1 to zero if 1 is an offset.

TODO
Prove it

Regularisation of the gradient descent method:


2 2
J = h (X) Y +

* Remember to exclude 1 from the norm calculation if 1 is an offset.

TODO
Prove it

3
Note that the bias nodes of a neural network (chapter 2) are extra features, not constant offsets

4
Regression 1.5. Linear regression

1.5 Linear regression


By taking the transfer function to be g(k) = k, the model simply represents a linear
combination of the input values, parametrised by the matrix. While this results in a
linear hypothesis function h (with respect to X and Y), the output need not necessarily
be linear with respect to the parameters of the actual underlying problem the regression
parameters (x values) may be non-linear functions of the problem parameters. For example,
given a problem involving two physical parameters a and b, we can construct a set of features
for the regression that are non-linear with respect to a and b, e.g:
 
~x = a2 , ab, b2 , a3 /b

1.6 Logistic regression


A non-linear transfer function (e.g. the sigmoid function: g(k) = 1+e1k ) can be used to


c Mark K Cowan
produce a non-linear hypothesis. This is particularly useful for problems with discrete output
such as categorisation and decision problems, as these kind of problems imply non-linear
hypotheses. The output may then be passed through a threshold function in order to produce
a discrete value4 , for example:

true if y 0.5,
threshold (y) =
false if y < 0.5

0.8

0.6
f(k)

0.4
Threshold
0.2 Sigmoid
Linear
0 5%/95%

6 4 2 0 2 4 6
k
Figure 1.1: Plot of the sigmoid transfer function

The sigmoid function has a range of [0, 1], which is is ideal for binary problems with yes/no
outputs such as categorisation. For ternary problems with positive/neutral/negative outputs,
a function with a range of [1, +1] may be more applicable. One such function is the artanh
function, although the sigmoid function can also be scaled to this new range.

4
For example, yes/no

5
1.7. Multi-class classification Regression

1.7 Multi-class classification


In order to classify input vectors ~x into some category c, a separate classifier can be trained for
each category. Each individual classifier Ci predicts whether or not the object o (represented
by input vector ~x) is in category ci , i.e. each individual classifier predicts the result of the
boolean expression Ci (~x) o ci . Since each individual classifier predicts only whether
the object is in the associated category or is not in it, this approach is often called one vs
all classification. The cost-function of a multi-class classifier is the sum of the cost-functions
for each individual one-vs-all classifier. Consequently, the gradient of the cost-function is
the sum of all the gradients of the cost-functions for the individual classifiers.

TODO
Simplify and shorten, add examples
c Mark K Cowan

1.8 Formulae
Linear regression:

g (k) = k
2 2
1
J = 2m
h (X) Y + 2m

 
~ =
J 1
XT h (X) Y +

m m

Logistic regression:
1
g (k) = for yes/no
1 + ek
or = tanh1 (k) for positive/neutral/negative
     2

J = m1 ~y log h (~x) + (1 ~y ) log 1 h (~x) + 2m

~ = 1 XT h (X) Y +
 
J
m m

1.9 Fine-tuning the learning process


The result of iterative learning processes such as gradient descent can be assessed through
a variety of techniques, which provides information that can be used to make intelligent
adjustments to the learning rate and to the regularisation parameter.
Problems arising from a poor choice of the learning rate may be identified by looking
at the learning curves. Bad values for the regularisation parameter may be diagnosed by
cross-validation, which identifies bias and variance.

6
Regression 1.9. Fine-tuning the learning process

1.9.1 : Learning, convergence and oscillation


As the learning process iterates, the cost-function can be plotted against the iteration number
to produce learning curves. The evolution of the cost-function during the learning process
will indicate whether the learning rate is too high, or too slow. The cost-function should
ideally converge to some value when the learning process is complete (figure 1.2c), resulting
from the function reaching a local minimum. A lack of convergence indicates that the
learning process is far from complete (figure 1.2a), and oscillation indicates that the learning
process is unlikely to complete (figure 1.2b).
J

J
Iterations Iterations Iterations


c Mark K Cowan
(a) No convergence (b) Oscillation (c) Convergence
Increase Decrease Good

Figure 1.2: Learning curves for various learning rates

A simple optimisation is to start with a high value of , then decrease it with each
oscillation. This provides rapid convergence initially with a high chance of oscillations as the
cost-function minimum is approached, but with the oscillations decreasing as the learning
rate is decreased, resulting in convergence close to the cost-function minimum.

1.9.2 : Bias and variance


The dataset used to train the machine learning algorithm does not completely represent the
problem that the algorithm is intended to solve, so it is possible for the algorithm to over-fit
the data, by learning the imperfections in the dataset in addition to the desired trends. In
this case the cost-function will be very low when evaluated on the training data, since the
algorithm fits the training data well. When the algorithm is applied to other data though,
the error can be large (figure 1.3a). Over-fitting can be prevented by regularisation, however
this leads to another problem, under-fitting, where the cost-function will be high for both
the training dataset and for other data as a result of the algorithm not learning the trend
in enough detail (figure 1.3b). Ideally, the algorithm should perform with a low but similar
error value on both the training data and on other data (figure 1.3c).

7
1.9. Fine-tuning the learning process Regression

Training data
Other data
J

J
Iterations Iterations Iterations

(a) Over-fit/Variance (b) Under-fit/Bias (c) Good


Increase Decrease

Figure 1.3: Learning curves for various learning rates

Example:
Taking a fifth-order polynomial as an example, we train a linear regression system (using
gradient descent) by using four5 known points in the polynomial. Training with many
c Mark K Cowan

features (for example, monomials up to the tenth order), we can find an exact or near-exact
match for the training data as shown in figure 1.4a, but this learnt trend may be very different
to the actual trend. Conversely, training with too few features or with high regularisation

produces a learnt trend that is equally bad at matching either the training dataset or future
data, shown in figure 1.4b. The good fit in figure 1.4c doesnt perfectly follow the trend, since
there is insufficient information in just four points to describe a fifth-order polynomial - but
it is a reasonable match. Of course, if the actual trend was a sinusoid then the "good" fit
would have high error instead. When the training data contains insufficient information
to describe the trend, then some properties of the trend must be known beforehand when
designing the features, for example are we to use use sinusoid features, monomials features,
exponential features or some combination of them all?

Training data
Learnt trend
Actual trend
Y

X X X

(a) Variance: (b) Bias: (c) Good


many features and few features or
low regularisation high regularisation

Figure 1.4: Examples of over-fitting, under-fitting and a good fit

Cross-validation:
If the actual trend were known, then there would be no need for machine learning. In
5
This is two less than would be required to uniquely describe the polynomial

8
Regression 1.9. Fine-tuning the learning process

practice, the learnt trend cannot be directly compared against the actual trend, and for
many real-world problems there is no actual trend. Instead, the training dataset may be
split into two subsets, the training set and the cross-validation set. The learning algorithm
may then be applied to the training set, and the regularisation parameter varied in order
to minimize the error6 in the cross-validation set (see below). The amount of regularisation
can therefore be determined intelligently, rather then by guesswork. In figure 1.3, the other
data series becomes the cost-function curve for the cross-validation dataset.

good = argmin J (, Xtrain )



good = argmin J (, Xcross )

TODO
Validation dataset and example


c Mark K Cowan
Growing the training set:
Another technique to diagnose bias/variance is to plot the cost-function for the training
set and a fixed-cross-validation set while varying the size of the training set. Adding more
training data will not reduce error caused bias, but will reduce error caused by variance.
Therefore, the errors in the training set and in the cross-validation set will both converge
to some high value if the system is suffering from high bias (Figure 1.5a). If the system is
suffering from high variance instead then the errors may appear to converge to two separate
values with a large gap in between (Figure 1.5b), but will converge eventually if a large and
diverse enough training set is used. Over-fitting can therefore be eliminated by getting more
training data when possible, instead of increasing the regularisation parameter ().

Training data
Cross-validation data
J

J
gap

Training set size Training set size Training set size

(a) Variance (b) Bias (c) Good

Figure 1.5: Identifying bias and variance by varying the size of the training set

Summary:
Over-fitting can be interpreted as the system trying to infer too much from too little training
data, whereas under-fitting can be interpreted as the system making inefficient use of the
training data, or there not being sufficient training data for the system to create a useful
6
The value of the cost-function, J

9
1.9. Fine-tuning the learning process Regression

hypothesis. Therefore, both can be fixed by either adjusting the amount of regularisation
or by altering the amount of training data available. In order to increase the amount of
training data, new data/features7 can be collected or new features can be produced by
applying non-linear operations to the existing elements, such as exponentiation (~x7 = ~x12 ),
multiplication or division (~x8 = ~x4~x5 /~x6 ).

High bias High variance


Decrease Increase
Get more features Use less features
Create more features Grow the training dataset

Table 1.1: Cheat-sheet for handling bias and variance

1.9.3 Accuracy, precision, F-values


c Mark K Cowan

TODO
This section is high priority

7
Recall that features are properties of a data item

10
Neural networks
2
Neural networks are complex non-linear models, built from components that individually
behave similarly to a regression model. They can be visualised as graphs1 , and some
sub-graphs may exist with behaviour similar to that of logic gates2 . Although the structure
of a neural network is explicitly designed beforehand, the processing that the network does
in order to produce a hypothesis (and therefore, the various logic gates and other processing
structures within the network) evolves during the learning process3 . This allows a neural
network to be used as a solver that programs itself, in contrast to typical algorithms that
must be designed and coded explicitly.
Evaluating the hypothesis defined by a neural network may be achieved via feed-forward,
which amounts to setting the input nodes, then propagating the values through the
connections in the network until all output nodes have been calculated completely. The
learning can be accomplished by using gradient descent, where the error in the output nodes
is pushed back through the network via back-propagation, in order to estimate the error in
the hidden nodes, which allows calculation of the gradient of the cost-function.

Processing Output
Input data
unit data

Figure 2.1: An information processing unit, visualised as a graph

2.1 Regression visualised as a building block


Linear regression may be visualised as a graph. The output is simply the weighted sum of
the inputs:

Summation
Inputs Output
(-weighted)
x1 y

x2

x3

x4

Figure 2.2: Linear regression


1
For example, figure 2.6
2
For example, figure 2.10
3
Analogous to biological neural networks, from which the concept of artificial neural networks is derived

11
2.1. Regression visualised as a building block Neural networks

Similarly, logistic regression may be visualised as a graph, with one extra node to represent
the transfer function. A logistic regression element may also be described using a linear
regression element and a transfer node, by recognising that the first two stages of a logistic
regression element form a linear regression element.

Summation Transfer
Inputs Output
(-weighted) function
x1 g y

x2

x3

x4

Figure 2.3: Logistic regression


c Mark K Cowan

Since the last three stages of the pipeline are dependent only on the first stage, we will
condense them into one non-linear mixing operation at the output:

Inputs Output
x1 1 y
2
x2 3
4
x3

x4

Figure 2.4: Simplified anatomy of a logistic regression process

Using this notation, a network that performs classification via several one-vs-all classifiers has
the following form, where the parameter vectors have been combined to form a parameter
matrix , with a separate column to produce each column of the output vector:

Inputs Outputs
x1
y1
x2
y2
x3
y3
x4

Figure 2.5: Simplified anatomy of a multi-class classification network

12
Neural networks 2.2. Hidden layers

2.2 Hidden layers


Logistic regression is a powerful tool but it can only form simple hypotheses, since it operates
on a linear combination of the input values (albeit applying a non-linear function as soon as
possible). Neural networks are constructed from layers of such non-linear mixing elements,
allowing development of more complex hypotheses. This is achieved by stacking4 logistic
regression networks to produce more complex behaviour. The inclusion of extra non-linear
mixing stages between the input and the output nodes can increase the complexity of the
network, allowing it to develop more advanced hypotheses. This is relatively simple:

Input Hidden Output


layer layer layer

x1
y1
x2


c Mark K Cowan
y2
x3
y3
x4

Figure 2.6: A simple neural network with one hidden layer

Although layers of linear regression nodes could be used in the network there is no point
since each logistic regression element transforms a linear combination of the inputs, and a
linear combination of a linear combination is itself a linear combination5 .

2.3 Notation for neural network elements


The input value to the jth node (or neuron) of the lth layer in a network with L layers
is denoted zjl , and the output value (or activation) of the node is denoted alj = g(zjl ). The
parameter matrix for the lth layer (which produces z l from al1 ) is denoted l1 . The
activation of the first (or input) layer is given by the network input values: a1 = ~x. The
activation of the last (or output) layer is the output of the network: aL = ~y .

4
Neural networks can also contain feedback loops and other features that are not possible by stacking
alone
5
i.e. Multiple linear combination layers can be combined to give one single linear combination element

13
2.4. Bias nodes Neural networks

Summation Transfer
Inputs Output
(-weighted) function
x1 1 z g a y
2
x2 3

4


x3

x4

Figure 2.7: , g, z and a in a neural network element

2.4 Bias nodes


Typically, each layer contains an offset term which is set to some constant value (e.g. 1).
c Mark K Cowan

For convenience, this will be given index 0, such that al0 = 1. There is a separate parameter
vector for each layer, so we now have a set of matrices. A biased network with several
hidden layers is shown below to illustrate the structure and notation for such a network:

Input Hidden Hidden Output


layer layer layer layer
a00 1 a10 2 a20 3

a01 a11
a21
a02 a12
a22 a31
a03 a13
a23 a32
a04 a14
a24
a05 a15

Figure 2.8: Example of the notation used to number the nodes of a neural network

2.5 Logic gates


The presence of multiple layers can be used to construct all the elementary logic gates. This
in turn allows construction of advanced digital processing logic in neural networks and
this construction occurs automatically during the learning stage. Some examples are shown
below, which take inputs of 0/1 and which return a positive output for true and a non-positive
output for f alse:

14
Neural networks 2.6. Feed-forward

x1 x1
+2 +2
+1
y 3 y 1 y
2
x +2 +2
x2 x2

x1 x1
x y y y
x2 x2
(a) Logical NOT gate (b) Logical AND gate (c) Logical OR gate

Figure 2.9: Elementary logic gates as neural networks

From these, it becomes trivial to construct other gates. Negating the values produces
the inverted gates, and these can be used to construct more complex gates. Thus, neural
networks may be understood as self-designing microchips, capable of both digital and


analogue processing.

c Mark K Cowan
x1 2

2 2 +3 2
x1
+3 +3 y y
x2
2 2 +3 2

x2 2

Figure 2.10: Logical XOR gate, constructed from four NAND gates

2.6 Feed-forward
To evaluate the hypothesis h (~x) for some input data ~x, the data is fed forward through the
layers:

~zl+1 = ~al n input value of neuron


 
~al+1 = g ~zl+1 activation of neuron
~a1 = ~x network input
~aL = h (~x) network output

2.7 Cost-function
The cost-function remains the same as that for logistic regression:
  L
l 2
     X

J = m1 ~y log h (~x) + 1 ~y log 1 h (~x) + 2m

l=1

15
2.8. Gradient via back-propagation Neural networks

Note the dot-product in the expression for the cost-function; this represents a summation of
the cost-functions for each individual one-vs-all classifier. Since neural networks have hidden
layers between the input and output nodes, the learning process is slightly more complex
than that of the logistic regression multi-class network.

2.8 Gradient via back-propagation


The gradients of the cost-function (with respect to the various parameter matrices) are
calculated by propagating the error in the output back through the network.

eL = h (X) Y error at the final layer


 
lT
el = el+1 g0 (z l ) error at each prior layer
g(k)
g (k) = 1
1+ek
g0 (k) = 1g(k)
transfer function and derivative
c Mark K Cowan

With the error at each node calculated, the amount that each node contributes to the over-all
cost can be estimated, leading to the gradient:
T
n = an en+1 error contributed to the next layer

~ n J
= 1
n +
n gradient J
m m n

Note that the gradient shown here includes the regularisation term. Also, although the bias
nodes have constant values for their activations, the error contributed by bias nodes is not
necessarily zero6 , since the weights of these nodes (l0,j ) are not constant. Therefore the
weights of the bias nodes should be excluded from regularisation, as described in section 1.4
except for rare cases where it may be useful to minimise the amount of bias.

TODO
Real-world examples

6
Therefore the gradient of the cost-function with respect to the weights of these nodes may also be
non-zero

16
List of Figures
*
1.1 Plot of the sigmoid transfer function . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Learning curves for various learning rates . . . . . . . . . . . . . . . . . . . . 7
1.3 Learning curves for various learning rates . . . . . . . . . . . . . . . . . . . . 8
1.4 Examples of over-fitting, under-fitting and a good fit . . . . . . . . . . . . . 8
1.5 Identifying bias and variance by varying the size of the training set . . . . . 9

2.1 An information processing unit, visualised as a graph . . . . . . . . . . . . . 11


2.2 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Simplified anatomy of a logistic regression process . . . . . . . . . . . . . . . 12
2.5 Simplified anatomy of a multi-class classification network . . . . . . . . . . . 12
2.6 A simple neural network with one hidden layer . . . . . . . . . . . . . . . . . 13
2.7 , g, z and a in a neural network element . . . . . . . . . . . . . . . . . . . 14
2.8 Example of the notation used to number the nodes of a neural network . . . 14
2.9 Elementary logic gates as neural networks . . . . . . . . . . . . . . . . . . . 15
2.10 Logical XOR gate, constructed from four NAND gates . . . . . . . . . . . . 15

17

Вам также может понравиться