Вы находитесь на странице: 1из 62

Optimisation and Multilayer Perceptron

Tirtharaj Dash
Perceptron I

I Perceptron is one the early ANN developed only to solve


linearly separable pattern classification problems.
I It is a single layer ANN that generates targeted output from
the present inputs given it has been trained with training
input-output data pairs.
I It basically uses typically a signum function or step function as
activation function.

Figure 1: A typical architecture of Perceptron with a threshold θ at


the output layer (From: Neuro-Fuzzy and Soft Computing)
Perceptron II

Figure 2: A typical architecture of Perceptron with threshold being


represented as a part of the knowledge-base (From: Neuro-Fuzzy
and Soft Computing)

I The activation functions are:


 
+1; if x > 0
sgn(x) = (1)
−1; otherwise
 
+1; if x > 0
step(x) = (2)
0; otherwise
Perceptron III
I It has been proven that there exists a method which can tune
these parameters (w) to provide the required output if such
weights exists. This is called Perceptron convergence theorem.
I Can a perceptron solve a simple non-linearly problem such as
Exclusive–OR problem?

Table 1: Exclusive–OR
x1 x2 Class
0 0 0
0 1 1
1 0 1
1 1 0
I Similar table can also be made if we consider the inputs to be
bipolar (+1,-1).
I It is quite easy to verify that the XOR problem is not linearly
separable.
Perceptron IV

Figure 3: XOR problem (From: Neuro-Fuzzy and Soft Computing)

I Let try to construct a straight line to partition the


two-dimensional input space into two regions, each containing
only data points of the same class.
Perceptron V
I This can possibly by by trying a perceptron (single-layer)
which is required to satisfy the following inequalities:

0 × w1 + 0 × w2 + w0 ≤ 0 =⇒ w0 ≤ 0 (3)
0 × w1 + 1 × w2 + w0 > 0 =⇒ w0 > −w2 (4)
1 × w1 + 0 × w2 + w0 > 0 =⇒ w0 > −w1 (5)
1 × w1 + 1 × w2 + w0 ≤ 0 =⇒ w0 ≤ −(w1 + w2 ) (6)

I The above sets of inequalities are self-contradictory for any


given set of weights {w0 , w1 , w2 }.
I Then, what is the solution to such a problem? – Instead of
focusing on getting a single straight line for the problem, if we
can use two straight lines, then possible solution can be
obtained.
I To obtain these two straight lines, we may possibly need more
than one perceptrons – such an architecture is called
Multilayer Perceptron (MLP).
MLP I

I Basic features of multilayer perceptrons:


I The model of each neuron in the network includes a nonlinear
activation function that is differentiable.
I The network contains one or more layers that are hidden from
both the input and output nodes.
I The network exhibits a high degree of connectivity, the extent
of which is determined by synaptic weights of the network.

Figure 4: MLP with two [fully-connected] hidden layers (From:


Simon Haykin, Neural Networks and Learning Machines)
MLP II
I The training of the MLP is usually based on gradient descent
optimization algorithm and is popularly known as Back
Propagation algorithm.
I There are two phases in MLP training – (a) forward phase,
(b) backward phase.
I (a) In the forward phase, the synaptic weights of the network
are fixed and the input signal is propagated through the
network, layer by layer, until it reaches the output. Thus, in
this phase, changes are confined to the activation potentials
and outputs of the neurons in the network.
I (b) In the backward phase, an error signal is produced by
comparing the output of the network with a desired
response.The resulting error signal is propagated through the
network, again layer by layer, but this time the propagation is
performed in the backward direction. In this second phase,
successive adjustments are made to the synaptic weights of
MLP III
the network. Calculation of the adjustments for the output
layer is straightforward, but it is much more challenging for
the hidden layers.
I Each hidden or output neuron of an MLP is designed to
perform two computations:
I The computation of the function signal appearing at the
output of each neuron, which is expressed as a continuous
nonlinear function of the input signal and synaptic weights
associated with that neuron;
I (for back propagation) The computation of an estimate of the
gradient vector (i.e., the gradients of the error surface with
respect to the weights connected to the inputs of a neuron),
which is needed for the backward pass through the network.
I Hidden neurons:
I The hidden neurons act as feature detectors. During the
learning process of the MLP, the hidden neurons begin to
gradually “discover” the salient features that characterize the
training data.
MLP IV
I They do so by performing a nonlinear transformation on the
input data into a new space called the feature space.
I If it is a classification problem, then separability is very high in
this hidden layer feature space.
I Back propagation algorithm is responsible for assigning more
credits to those output or hidden neurons which contribute to
closest possible outputs that match with the desired output,
and blame those neurons which try to deviate from the desired
output. – This is called the credit assignment problem.
I Based on the processing of the training data and computation
of the error signal, back propagation based learning of MLP
can be of two types:
I Batch learning – Batch gradient descent
I Online learning – stochastic gradient descent
I Consider an MLP with an input layer of source nodes, a set of
hidden layers and an output layer that may consist of a set of
output neurons.
MLP V
I Let T = {x(n), d(n)}N
n=1 denote the training samples used to
train the network.
I Let yj (n) denote the function signal produced at the output of
neuron j at the output layer by the stimulus x(n) applied at
the input layer.
I Correspondingly, the error signal produced at the output of
neuron j is defined by

ej (n) = dj (n) − yj (n) (7)

where dj (n) is the ith element of the desired-response vector


d(n).
I The instantaneous error energy can be computed as
1
Ej (n) = ej2 (n) (8)
2
MLP VI

I Summing the error-energy contributions of all the neurons in


the output layer, we can express the total instantaneous error
energy of the whole network as
X
E(n) = Ej (n) (9)
j∈C
1X 2
= ej (n) (10)
2
j∈C

where C is the set of all neurons in the output layer. With the
training dataset consisting of N examples, the error energy
MLP VII

can be averaged over the whole training samples – also called


the empirical risk – defined by
N
1 X
Eav (N) = E(n) (11)
N
n=1
N
1 XX 2
= ej (n) (12)
2N
n=1 j∈C

I Since the output y of the net is a function of adjustable


synaptic weights, this error energy(ies) is also function of
these parameters – free parameters.
Batch learning – Batch GD I
I In the batch method of supervised learning, adjustments to
the synaptic weights of the MLP are performed after the
presentation of all the N training examples (i.e. T ) that
constitute one epoch of training.
I In other words, the cost function for batch learning is defined
by the average error energy Eav .
I Adjustments to the synaptic weights of the MLP are made on
an epoch-by-epoch basis.
I Correspondingly, one realization of the learning curve is
obtained by plotting Eav versus the number of epochs, where,
for each epoch for each epoch of training, the examples in the
training samples T are randomly shuffled.
I This process of batch processing is repeated for multiple times
till a possible stopping condition.
I Advantages of batch gradient descent:
Batch learning – Batch GD II

I accurate estimation of the gradient vector (i.e., the derivative


of the cost function Eav with respect to the weight vector w),
thereby guaranteeing, under simple conditions, convergence of
the method of steepest descent to a local minimum.
I Parallelization of the learning process.
I However, from a practical perspective, batch learning is rather
demanding in terms of storage requirements.
I In a statistical context, batch learning may be viewed as a
form of statistical inference. It is therefore well suited for
solving nonlinear regression problems – where the outcome
should be averaged over all the examples.
Online learning – Stochastic GD I
I In the on-line method of supervised learning, adjustments to
the synaptic weights of the MLP are performed on an
example-by-example basis.
I The cost function to be minimized is therefore the total
instantaneous error energy E(n).
I Here, the training examples are presented to the network in a
random manner, the use of on-line learning makes the search
in the multidimensional weight space stochastic in nature; it is
for this reason that the method of on-line learning is
sometimes referred to as a stochastic method.
I This stochasticity has the desirable effect of making it less
likely for the learning process to be trapped in a local
minimum, which is a definite advantage of on-line learning
over batch learning.
I Online learning is well suited for pattern-classification
problems.
Back Propagation with SGD I

I Consider the following figure that highlights the output


neuron j:

Figure 5: Signal-flow graph for output node j (From: Simon Haykin,


NN and LM)
Back Propagation with SGD II
I Let focus on the back propagation algorithm developed with
SGD for free parameter tuning.
I From the figure: The output neuron j is fed by the function
signals produced by a layer of neurons to its left.
I The induced local field vj (n) produced at the input of the
activation function associated with neuron j is therefore
m
X
vj (n) = wij (n)yi (n) (13)
i=0

where m is the total number of inputs (excluding the bias)


applied to neuron j. The synaptic weight wj0 (corresponding
to the fixed input y0 = +1) equals the bias bj applied to
neuron j. Hence, the function signal yj (n) appearing at the
output of neuron j at iteration n is

yj (n) = fj (vj (n)) (14)


Back Propagation with SGD III
I The back propagation algorithm applies a correction ∆wji (n)
to the synaptic weight wji (n), which is proportional to the
∂E(n)
partial derivative ∂w ji (n)
.
I According to chain rule of derivatives, we can express this
gradient as

∂E(n) ∂E(n) ∂ej (n) ∂yj (n) ∂vj (n)


= (15)
∂wji (n) ∂ej (n) ∂yj (n) ∂vj (n) ∂wji (n)

I We can obtain every term at right of the above equation using


the available definitions.
I From the definition of instantaneous energy function, we can
get
∂E(n)
= ej (n) (16)
∂ej (n)
Back Propagation with SGD IV

I Similarly,
∂ej (n)
= −1 (17)
∂yj (n)
I Differentiating yj (n) = fj (vj (n)) w.r.t. vj (n) gives

∂yj (n)
= fj0 (vj (n)) (18)
∂vj (n)

I Finally,
∂vj (n)
= yi (n) (19)
∂wji (n)
I So, we can write

∂E(n)
= −ej (n)fj0 (vj (n))yi (n) (20)
∂wji (n)
Back Propagation with SGD V
I The correction ∆wji (n) applies to wji (n) is defined by the
delta rule, or
∂E(n)
∆wji (n) = −η (21)
∂wji (n)
where η is the learning rate parameter of the back
propagation algorithm. If we remember the gradient descent,
we were taking a step opposite to the direction of the
gradient. Therefore, the – sign is used in the delta rule that
signify the same in the weight space.
I Now, we can write

∆wji (n) = η ej (n)fj0 (vj (n))yi (n) (22)

I The term at the right ej (n)fj0 (vj (n)) can be written as δj (n)
which is called the local gradient at the jth node.

∆wji (n) = η δj (n)yi (n) (23)


Back Propagation with SGD VI
I Based on the location of the neuron j, the calculation of the
local gradient may be different. In other words, the
computation of local gradient is somewhat easier for output
layer because a proper error signal is there to back propagate.
However, for the a neuron in the hidden layer, there is no such
direct error signal.
I However, based on the concept of credit assignment, each
hidden neuron share some amount of responsibility towards
the computed error at their next layer.
I Neuron j is an output neuron: When neuron j is located in
the output layer of the network, it is supplied with a desired
response of its own. The computation is straightforward for
this neuron. (See earlier equations)
I Neuron j is a hidden neuron: When neuron j is located in a
hidden layer of the network, there is no specified desired
response for that neuron.
Back Propagation with SGD VII
I The error signal for a hidden neuron would have to be
determined recursively and working backwards in terms of the
error signals of all the neurons to which that hidden neuron is
directly connected.
I See the following figure:
Back Propagation with SGD VIII
Figure 6: Signal-flow graph for output node k connected to hidden
neuron j (From: Simon Haykin, NN and LM)

I We now redefine the local gradient δj (n) as


∂E(n) ∂yj (n)
δj (n) = − (24)
∂yj (n) ∂vj (n)
∂E(n) 0
=− f (vj (n)) (25)
∂yj (n)
where, neuron j is hidden.
∂E(n)
I Now to compute ∂y j (n)
we have to use the figure given above:

1X 2
E(n) = ek (n) (26)
2
k∈C

where neuron k is an output node. (This is from our previous


equation of E(n) with just j replaced with k – just avoid the
confusion of j now being called the hidden neuron).
Back Propagation with SGD IX
I Differentiating above equation w.r.t. yj (n), we get

∂E(n) X ∂ek (n)


= ek (27)
∂yj (n) ∂yj (n)
k

∂ek (n)
I From chain rule for partial derivatives, ∂yj (n) can be replaced
in the above equation as

∂E(n) X ∂ek (n) ∂vk (n)


= ek (28)
∂yj (n) ∂vk (n) ∂yj (n)
k

I However, we know that ek (n) = dk (n) − yk (n) or

ek (n) = dk (n) − f 0 (vk (n)) (29)

note, k is the output node here.


Back Propagation with SGD X
I Hence,
∂ek (n)
= −f 0 (vk (n)) (30)
∂vk (n)
I We also note from the above figure that for neuron k, the
induced local field is
m
X
vk (n) = wkj (n)yj (n) (31)
j=0

where m is the total number of inputs (excluding the bias)


applied to the neuron k.
I The synaptic weight wk0 (n) is equal to the bias bk (n) applied
to neuron k, and the corresponding input is fixed at the value
+1. Differentiating vk (n) with respect to yj (n) yields

∂vk (n)
= wkj (n) (32)
∂yj (n)
Back Propagation with SGD XI
∂E(n)
I Coming back to the term ∂yj (n) , we can now write

∂E(n) X
=− ek f 0 (vk (n))wkj (n) (33)
∂yj (n)
k
X
=− δk (n)wkj (n) (34)
k

I Therefore the local gradient of a hidden neuron j can be


obtained by putting the above equation in the local gradient
equation as (in Eq.(38))
X
δj (n) = fj0 (vj (n)) δk (n)wkj (n) (35)
k

where, neuron j is hidden.


Back Propagation with SGD XII

Figure 7: Signal-flow graph showing error back propagation for


hidden neuron (From: Simon Haykin, NN and LM)

I The outside factor fj0 (vj (n)) involved in the computation of


the local gradient δj (n) in above equation depends solely on
the activation function associated with hidden neuron j.
I The remaining factor involved in this computation –namely,
the summation over k− – depends on two sets of terms:
Back Propagation with SGD XIII
I The first set of terms, the δk (n), requires knowledge of the
error signals ek (n) for all neurons that lie in the layer to the
immediate right of hidden neuron j and that are directly
connected to neuron j
I The second set of terms, the wkj (n), consists of the synaptic
weights associated with these connections.
I Summary of back propagation algorithm:

Figure 8: Computation of weight correction (From: Simon Haykin,


NN and LM)

I Based on the location of neuron j, values have to be


substituted.
Summary

I Gradient based optimization techniques are popular for


guiding the search towards a better local minimum.
I Gradient descent has been considered for tuning the free
parameters of a neural network by back propagating the error
signal towards the input layers.
I Stochastic gradient descent aid the neural network learning by
making the search process effective which minimizes the cost
(error) at every iteration for any randomly picked pattern from
training set.
I Back Propagation is one of the most effective which basically
works with gradient descent optimization algorithm.
Making BackProp perform better I
Actually, this is an experimental or innovative art to make the
BackProp perform better.
I Stochastic vs batch update – As mentioned previously, the
stochastic (sequential) mode of back-propagation learning
(involving pattern-by-pattern updating) is computationally
faster than the batch mode. This is especially true when the
training data sample is large and redundant.
I Maximizing information content – As a general rule, every
training example presented to the back-propagation algorithm
should be chosen on the basis that its information content is
the largest possible for the task at hand (LeCun, 1993).
Two ways of realizing this choice are as follows:
I Use an example that results in the largest training error.
I Use an example that is radically different from all those
previously used.
Try to search more of the weight space.
Making BackProp perform better II
I Randomization of input samples (random sampling) to make
sure that the samples should not belong to the same class
in a single epoch.
I Activation function – preferred choice is a sigmoid function
(s-shaped function).
I Target values – make the target values (i.e. desired response
of some sample di ) separated from the limiting value of the
sigmoid function by some separation gap of ; so that the
error will not be saturated at the hidden neurons. The basic
idea is that at every iteration there should be some minimal
error so that the MLP will learn that error by improving its
present knowledge.
I Normalization of the inputs – Making the mean value,
averaged over the entire training samples, close to 0. Many
things are usually done such as PCA... (out of discussion here)
Making BackProp perform better III

Figure 9: Illustrating the operation of mean removal, decorrelation,


and covariance equalization for a two-dimensional input space
Making BackProp perform better IV
I Initialization of the synaptic weights – should these be large
enough? or very small? If neighther, then what?
Large values – quick saturation during learning because of the
smaller local gradients.
Small values – smaller area of the error surface.
Desirable – the weights should be selected from a normal
distribution with 0 mean, and variance equal to the reciprocal
of the number of the synaptic connections of a neuron. (study
the proof for this statement in section 4.6, derivation of the
equation 4.48)
I Learning from hints – Prior information about the function i.e.
input – output mapping function f (·).
But, this is experimentally unusual to have prior knowledge.
Rather what we could do is to assume a function f and try to
find a fˆ which minimizes the error.
Usually, it is done by assuming the properties of the weights
for any neuron. This could be probability distribution,
Making BackProp perform better V

symmetry etc. That is instead of assuming a function as a


whole you can assume the properties of the arguments of that
function.
I Learning rate – the last layer usually has a larger local
gradient hence a smaller learning rate will make the weight
update smooth for the last layer. But, for the last–i th layers,
the net has already partially corrected itself starting from the
last layer. So, a larger learning rate might be required (but,
not always the case).
Net with higher number of inputs might not need high
learning rate.
According to research (LeCun, 1993), for a given neuron, the
l.r. should be inversely proportional to the square root of the
synaptic connections made to that neuron.
Generalization

I A network is said to generalize well when the input–output


mapping computed by the network is correct (or nearly so) for
test data never used in creating or training the network.
I It is assumed that the test data are drawn from the same
population used to generate the training data.
I A network which is over-trained is prone to overfitting and
less generalized – vice versa is also true. The idea is that a
NN which has been trained with a few set of samples, should
be able to recognize a set of samples which are slightly
different than the ones used during training.
I It is often considered to find a smooth function which is
simplest that makes the obective function fit for the available
training data. (Occam’s razor)
Factors affecting generalization

I The size of the training sample and how representative the


training sample is of the environment of interest – The size of
the training sample is fixed, and the issue of interest is that of
determining the best architecture of network for achieving
good generalization.
I The architecture of the neural network – The architecture of
the network is fixed (hopefully in accordance with the physical
complexity of the underlying problem), and the issue to be
resolved is that of determining the size of the training sample
needed for a good generalization to occur.
I The physical complexity of the problem at hand – we can’t
control this.
Widrow’s rule of thumb

I The architecture and the number of training samples


arguments are related in some ways and have individual
importance over generalization problem.
I According to the rule of thumb, In practice, it seems that all
we really need for a good generalization is to have the size of
the training sample, N, satisfy the condition
W
N = O( ),

where W is the number of free parameters i.e. weights and
biases;  denotes the fraction of classification errors permitted
on test data (as in pattern classification), O(·) denotes the
order of quantity enclosed within ().
Approximation of functions
I A multilayer perceptron trained with the back-propagation
algorithm may be viewed as a practical vehicle for performing
a nonlinear input–output mapping of a general nature.
I To be specific, let m0 denote the number of input (source)
nodes of a multilayer perceptron, and let M = mL denote the
number of neurons in the output layer of the network.
I The inputoutput relationship of the network defines a
mapping from an m0 −dimensional Euclidean input space to
an M−dimensional Euclidean output space which is infinitely
continuously differentiable when given such an activation
function.
I Fundamental question arises out of this input–output
mapping:
What is the minimum number of hidden layers in a multilayer
perceptron with an input–output mapping that provides an
approximate realization of any continuous mapping?
Universal approximation theorem I
The question to the previous slide could be answered using
universal approximation theorem–
Let ϕ(·) be a nonconstant, bounded, and monotone-increasing
continuous function. Let Im0 denote the m0 −dimensional unit
hypercube [0, 1]m0 . The space of continuous functions on Im0 is
denoted by C (Im0 ). Then, given any function f 3 C (Im0 ) and
 > 0, there exists an integer m1 and a set of real constants αi , bi ,
and wji , where i = 1, . . . , m1 and j = 1, . . . , m0 such that we may
define  
m1
X m0
X
F (x1 , . . . , xm0 ) = αi ϕ  wji xj + bi 
i=1 j=1

as an approximate realization of the function f (·); that is,

|F (x1 , . . . , xm0 ) − f (x1 , . . . , xm0 )| < 

for all x1 , . . . , xm0 that lie in the input space.


Universal approximation theorem II

The theorem states that a single hidden layer is sufficient for a


multilayer perceptron to compute a uniform approximation to a
given training set represented by the set of inputs x1 , . . . , xm0 and
a desired (target) output f (x1 , . . . , xm0 )
However, the theorem does not say that a single hidden layer is
optimum in the sense of learning time, ease of implementation, or
(more importantly) generalization.
Effect of dimension–curse of dimensionality

I If the approximation function f (x) is arbitrarily complex and


(for the most part) completely unknown, we need dense
sample (data) points to learn it well.
I Unfortunately, dense samples are hard to find in “high
dimensions”–hence the curse of dimensionality. In particular,
there is an exponential growth in complexity as a result of an
increase in dimensionality, which, in turn, leads to the
deterioration of the space-filling properties for uniformly
randomly distributed points in higher dimension spaces.
I Reason cited by (Friedman, 1995):
A function defined in high-dimensional space is likely to be
much more complex than a function defined in a
lower-dimensional space, and those complications are harder
to discern.
How to mitigate the curse of dimensionality problem

I Incorporate prior knowledge about the unknown function to


be approximated – This knowledge is provided over and above
the training data. Naturally, the acquisition of knowledge is
problem dependent. In pattern classification, for example,
knowledge may be acquired from understanding the particular
classes (categories) of the input data.
I Design the network so as to provide increasing smoothness of
the unknown function with increasing input dimensionality.
Cross validation (CV) I
I BackProp based learning encodes an input-output mapping,
represented by a set of labeled examples, into the synaptic
weights and/or any other parameters of an MLP.
I The hope is that the network becomes well trained so that it
learns enough about the past to generalize to the future.
From such a perspective, the learning process amounts to a
choice of network parameterization for a given set of data.
Meaning–we may view the network selection problem as
choosing, within a set of candidate model structures
(parameterizations), the “best” one according to a certain
criterion.
I To evaluate any mode, first the available data set is randomly
partitioned into a training sample and a test set.
I The training sample is further partitioned into two disjoint
subsets:
I an estimation subset, used to select the model;
Cross validation (CV) II
I a validation subset, used to test or validate the model.
I The motivation here is to validate the model on a data set
different from the one used for parameter estimation. In this
way, we may use the training sample to assess the
performance of various candidate models and thereby choose
the “best” one.
I There is, however, a distinct possibility that the model with
the best-performing parameter values so selected may end up
overfitting the validation subset. To guard against this
possibility, the generalization performance of the selected
model is measured on the test set, which is different from the
validation subset.
I Validation could be useful for estimating three different
things:
I Estimating architecture of a model (e.g. estimating number of
hidden layers, number of neurons in a hidden layer, so on)
Cross validation (CV) III
I Estimating the parameters of a model (e.g. learning rate,
momentum factor, thresholds, if any)
I Generalization with validation sets to make it suitable for
real-time test with independent test data
I See the following figure to understand the CV:

Figure 10: This figure could be understood as an example of 4-CV


where each box is a partition. (shaded part– test data)
Cross validation (CV) IV

I So generally, for a K −fold CV, the dataset is partitioned into


K partitions.
for i=1:K
Train with K-1 partition excluding the i-th partition
Test with i-th partition
end
I Usually, K = 5 or K = 10 is preferred. However, if the dataset
is large enough, one should also take care of the resources
available for the job.
I When K = N (where, N is the number of samples available
for training), it is called Leave-one-out-Cross-Validation, or
popularly LOOCV.
Complexity regularization I

I In designing a multilayer perceptron by whatever method, we


are in effect building a nonlinear model of the physical
phenomenon responsible for the generation of the inputoutput
examples used to train the network.
I Tradeoff reliability of the training data and goodness of the
model (i.e., a method for solving the bias-variance dilemma).
I The bias-variance dilemma, which means that in parameter
estimation (involving the use of a finite sample size) we have
the inevitable task of trading off the variance of the estimate
with the bias; the bias is defined as the difference between the
expected value of the parameter estimate and the true value,
and the variance is a measure of the “volatility” of the
estimate around the expected value.(Note that, this bias is
not the network bias or synaptic weights)
Complexity regularization II
I In the context of back-propagation learning, or any other
supervised learning procedure for that matter, we may realize
this tradeoff by minimizing the total risk, expressed as a
function of the parameter vector w, b, as follows:

R(w, b) = Jav (w, b) + λJc (w, b)

I The term Jav (w, b) is the average error measure such as MSE
whose evaluation extends over the output neurons of the
network and is carried out for all the training examples on an
epoch-by-epoch basis.
I The second term, Jc (w, b) is a complexity penalty, where the
notion of complexity is measured in terms of the network
(weights w, b) alone; its inclusion imposes on the solution
prior knowledge that we may have on the models being
considered.
Complexity regularization III
I For the present situation, λ could be considered as a
regularization parameter, which represents the relative
importance of the complexity-penalty term with respect to the
performance metric term.
I When λ is zero, the back-propagation learning process is
unconstrained, with the network being completely determined
from the training examples.
I When λ is made infinitely large, on the other hand, the
implication is that the constraint imposed by the complexity
penalty is by itself sufficient to specify the network, which is
another way of saying that the training examples are
unreliable.
I In practical applications of complexity regularization, the
regularization parameter λ is assigned a value somewhere
between these two limiting cases.
Complexity regularization IV
I In simplest form, the term Jc (w, b) could be written as

Jc (w, b) = ||f (w, b)||2 , (36)

where || · ||2 is a squared norm.


I The function f is considered as a linear functionPof w, b to
make the term simple. For example, f (w, b) = (Aw + Bb)
I This procedure operates by forcing some of the synaptic
weights in the network to take values approximately to
minimal as low as 0, while permitting other weights to retain
their relatively large values.
I Accordingly, the weights of the network are grouped roughly
into two categories:
I weights that have a significant influence on the networks
performance;
I weights that have practically little or no influence on the
networks performance.
Complexity regularization V

I The weights in the latter category are referred to as excess


weights.
I In the absence of complexity regularization, these weights
result in poor generalization by virtue of their high likelihood
of taking on completely arbitrary values or causing the
network to overfit the data in order to produce a slight
reduction in the training error.
I The use of complexity regularization encourages the excess
weights to assume values close to zero and thereby improve
generalization.
I This way of representing the complexity term in Equation (36)
is called as the weight-decay procedure.
Example of weight decay I
I Let us denote the set of w, b as W.
I We know that the learning rate, η is a parameter that
determines how much an updating step influences the current
value of the weights.
I From the above slides, we understand that weight decay is an
additional term in the weight update rule that causes the
weights to exponentially decay to zero, if no other update is
scheduled.
I So let’s say that we have a cost or error function J(W) that
we want to minimize using BackProp which uses Gradient
descent.
I GD tells us to update the weights W in the direction of
steepest descent in J:

∂J
Wi = Wi − η (37)
∂Wi
Example of weight decay II
I If η is large you will have a correspondingly large modification
of the weights Wi (at least this much we know)
I In order to effectively limit the number of free parameters in
our MLP model so as to avoid overfitting, it is possible to
regularize the cost function by changing the cost function to
something like this:

λ 2
J(W) = J(W) + W
2
I Applying gradient descent to this new cost function we obtain:

∂J
Wi = Wi − η − ηλWi
∂Wi
I The new term ηλWi coming from the regularization causes
the weight to decay in proportion to its size.
Network committee (towards generalization) I

I A network committee is a set of different neural network


architectures that work together to generate an estimate of
the underlying function f (X ).
I Each network is assumed to have been trained on the same
data distribution although not necessarily the same dataset.
I Consider a set of network architectures each trained on the
same data distribution where the outputs of each of the
network response to presentation of specific input Xk are
simply averaged.
I Intuitively we can expect that since each network has a
different noise component in prediction, an averaging out of
noise components might actually reduce the overall noise in
prediction.
Network committee (towards generalization) II

I In fact, at a minimal (additional) computational cost the


performance can actually improve when using a committee of
networks, and in any case the performance cannot get worse.
[can you mathematically show this?]
I Consider a set of single output neural networks {Ni=1 N } that

have been trained on some noisy dataset generated from an


underlying deterministic function f (X ). Each network
generates a function fi (X ) that essentially adds a noise
component i (X ) to the function f (X ),

fi (X ) = f (X ) + i (X ) (38)
Network committee (towards generalization) III

I For a network Ni , we construct the sum-squared error


function using the expectation operator:

Ei = E (fi (X ) − f (X ))2
 
(39)

Putting Eq(1) in Eq(2),

Ei = E [2i (X )] (40)

Considering a continuous distribution,


Z
Ei = 2i (X )p(x )d X (41)
Network committee (towards generalization) IV
I The average error Eav taken by averaging the errors of the
network, acting in isolation is given by
N
1 X
Eav = Ei (42)
N
i=1
N
E [2i (X )]
1 X
or , Eav = (43)
N
i=1

I However, if we were to inform a committee that generates the


output by averaging outputs of each network Ni , we have a
committee network function
N
fC (X ) = fi (X )
1 X
(44)
N
i=1
Network committee (towards generalization) V

I Now, for this an error can be



N
!2 
fi (X ) − f (X ) 
1 X
EC = E  (45)
N
i=1

N
!2 
i (X ) 
1 X
or , EC = E  (46)
N
i=1
(47)
Network committee (towards generalization) VI
I If we assume that the noise components i (X ) are
uncorrelated and have 0 means, EC reduces to
 
1 X 2 X X
EC = 2 E i + i j  (48)
N
i i j, j6=i
1 X X X
= 2 E [2i ] + E [i j ] (49)
N
i i j, j6=i
1 X
= E [2i ] (50)
N2
i
1
= Eav (51)
N
I Here we have substituted E [i j ] = 0 due to non-correlation of
variables.
Network committee (towards generalization) VII
I We see that theoretically the committee error is reduced by a
factor N. Practically, the reduction of error will be less since
the variable noise components are indeed more correlated than
uncorrelated.
I In any case, the committee error cannot be more than an
individual error. To see this rather intuitive result we recall
the Cauchy–Schwarz inequality:

N
!2 N
!
X X
i ≤N 2i (52)
i=1 i=1

I Therefore,
EC ≤ Eav (53)
I We should note that at a marginal increase in computational
cost we obtain some reduction in error due to the reduced
variance achieved by averaging over different networks.
Network committee (towards generalization) VIII

I Another way of reducing the error to more possible lower


extent is to give weightage on the outputs of those networks
that have better performance.

Figure 11: Committee Network

Вам также может понравиться