Академический Документы
Профессиональный Документы
Культура Документы
E
pat
Y
i
Y
i
W
i j
D
i
Y
i
Y
i
W
i j
33
and in principle eq. 2.9 cannot be applied. So, in effect Widrow and Hoff proposed to
use, during training, a linear activation function or Y = W X + bias. Such a modification
makes learning quicker because it changes the weights even when the output
classification is almost correct, in contrast to the perceptron rule that changes the
weights only when there is a gross classification error. Another important difference is
the use of bipolar inputs instead of binary inputs. Using binary inputs, when the input
is 0, the weights associated with such an input do not change. Using bipolar inputs, the
weights change even when the inputs are inactivated (-1 in this case).
The training procedure of the single-layer perceptron with the delta rule can be
summarized as:
1) Initialize the matrix W and the bias vector with small random numbers.
2) Select an input/desired output (X,D) vector from the training set.
3) Calculate the network output as: Y = W X + bias
4) Change the weight matrix and the bias vector using:
5) Repeat step 2-4 until the output error vector D Y is sufficiently
(2.10)
W
i j
D
i
Y
i
X
j
(2.11)
bias
i
D
i
Y
i
small for all input vectors in the training set.
After training, the output of the network Y for any input vector X is calculated in two
steps:
1) Calculate the net input to each output unit: net = W X + bias
2) The network output is given by:
This is known in the ANN literature as the recall phase.
(2.12)
Y
i
'
1 if net
i
> 0
1 otherwise
The above training procedure directly minimizes the average of the difference
between the desired output and the net input for each output unit (what Widrow and
Hoff in [WiHo60] call measured error). However, it is possible to show that, by doing
this, we are also minimizing the average of the output error (what Widrow and Hoff call
neuron error). Since the introduction of the ADALINE/MADALINE model, Widrowand
Hoff were well aware that it could only be used to solve linearly separable problems.
In relation to the network capacity, Widrow and Lehr [WiLe90] and Nilson
[Nil65] show that, on average, an ADALINE with p inputs can store up to 2p random
34
patterns with random binary desired responses. The value 2p is the upper limit reached
when p .
By comparing eq. 2.9 with eq. 2.5, we can see that the Perceptron learning rule
and the Delta rule are in principle identical, with the only major difference being the
omission of the threshold function during training in the case of the Delta rule.
However, they are based on different principles: the Perceptron rule is based upon the
placement of a hyperplane and the Delta rule is based upon the minimization of the
mean-squared-error between the desired and computed outputs.
It is also interesting to see that if we train the Linear Associator Y = W X using
the Delta rule instead of the Hebbian rule, the input vectors do not need to be
orthogonal to each other, they only need to be linearly independent. However, for p
input units, the Linear Associator is still limited to store up to p linear associations,
since it is not possible to have more than p independent vectors in an input space with
dimension p [Per92]. In particular if: a) the learning rate is small enough; b) all
training pairs are presented with the same probability; c) there are p input training
patterns and p network inputs; and d) the input training patterns form a linear
independent set; then the weight matrix converges to the optimal solution W
*
where:
For convergence results, which are also applied to the ADALINE/Perceptron ANN, see
(2.13)
W D
1
D
2
... D
p
X
1
X
2
... X
p
1
[Sim90], [Luo91], [WiSt85] and [ShRo90].
Widrow applied the LMS algorithm and its variants to train the Linear Associator
(what he called an Adaptive Linear Combiner or a non-recursive adaptive linear filter,
i.e an ADALINE with a linear output function) to a large range of signal processing
problems. For examples of applications see Widrow and Stearns [WiSt85].
In the early 60s Widrow also proposed an heuristic algorithm to adapt the
weights of a multi-layer ANN. The first layer was composed of ADALINEs and the
output layer has a single fixed logic unit, for instance, and OR, AND or majority-vote
taker. Only the weights arriving at the ADALINEs were adapted. The learning rule,
called MRI for Madaline Rule I, uses the minimal disturbance principle, i.e. no more
ADALINEs are adapted than necessary to correct the output decision, therefore causing
the minimal disturbance to the responses already learned.
In 1987 Widrow and Winter developed the MRII, Madaline Rule II, an extension
of MRI to allow the use of more than one logic unit at the output layer. However, up
35
to now both MRI and MRII have not been used much in the ANN literature. In 1988
David Andes modified MRII into MRIII by replacing the threshold logic function used
in the ADALINE by sigmoid functions. However, Widrow and his students later realised
that MRIII is mathematically equivalent to the Back-Propagation algorithm to be
presented in the next section. For more details on the MRI and MRII rules see [WiLe90]
and [Sim90].
2.4.4 - The Multi-Layer Perceptron and the role of hidden units
Figure 2.9 shows the minimum configuration for a Multi-Layer Perceptron. At
Figure 2.9 - The minimum configuration for a Multi-Layer Perceptron (MLP)
least one layer of hidden units with nonlinear activation functions is needed. An ANN
with hidden layers of linear units can be represented by an equivalent ANN without
hidden layers. The output units can have linear or nonlinear activation functions. It is
also possible to have direct connections from the input to the output units. In general,
if we draw the ANN with the input layer at the bottom and the output layer at the top
of the diagram (as in fig. 2.9), a layer of units can send connections to any layer that
is above it, since we assume that the MLP is by definition a feedforward ANN model.
The use of hidden units make it possible to reencode the input patterns therefore
creating a different representation. Each hidden layer reencodes its input. Some authors
refer to the hidden units creating internal representations or extracting the hidden
36
features from the data. Depending on the number of hidden units, the new representation
can correspond to vectors that are then linearly separable. If there are too few units in
a hidden layer to make possible the necessary reencoding, perhaps another layer of
hidden units is necessary. Because of this, the designer has to decide, for instance,
between using a) only one hidden layer with several units; or b) two hidden layers with
fewer units in each hidden layer. Normally no more than two hidden layers of units are
used, firstly because the representation power added by up to 2 hidden layers is likely
to be enough to solve the problem and secondly because for most of the algorithms used
nowadays the simulation results indicate that the training time increases rapidly with the
number of hidden layers.
The power of an algorithm that can adapt the weights of a MLP originates from
the fact that such algorithm can find such reencoding automatically by using the given
set of examples of the desired input-output mapping. It is possible to see such internal
reencoding, or internal representation, as a set of rules (or micro-rules as some authors
prefer to refer to them). So, using an analogy with expert systems, such an algorithm
would "extract" the rules or features from the set of examples, what is referred to by
some authors as the property of performing feature extraction from the data set.
Figure 2.10 - The first possible solution for the XOR problem
37
Figures 2.10, 2.11 and 2.12 illustrate three different solutions for the XOR
Figure 2.11 - The second possible solution for the XOR problem
problem using TLUs in the hidden and output layers. Note that for the ANNs illustrated
in fig. 2.10, 2.11 and 2.12, the output unit can also be linear with a zero bias, i.e.
respectively y = x
3
+ x
4
, y = x
4
- x
3
and y = x
1
+ x
2
- 2x
3
. In figures 2.10 and 2.11 the
two hidden units reencode the input variables x
1
and x
2
as the variables x
3
and x
4
. The
four input patterns are mapped to three points in the x
3
-x
4
space. These three points are
then linearly separable as illustrated. Observe that the solution illustrated in figure 2.11
is a combination of the AND and OR functions.
Figure 2.12 illustrates that if connections from the input to the output units are
used, the XOR problem can be solved using only one hidden unit which implements the
AND functions. Then if we consider the expanded input space x
1
-x
2
-x
3
, the 4 patterns
are now linearly separable since it is possible to find a plane that separates the points
which should produce a "0" output from the points that should produce a "1" output. If
the output unit is kept as the TLU, the decision surface in the space x
1
-x
2
changes from
a line to a ellipse (see figure 2.8) [WiLe90].
Since the AND function can be defined for binary variables as the product of the
38
variables, from figure 2.12 we can see that if we have as input to the network the value
Figure 2.12 - The third possible solution for the XOR problem
of the variable x
1
*x
2
, one layer of units would be enough to solve the problem and there
would be no need of hidden units. Generalizing this idea, when the unit itself uses
products of its input variables, it is called a higher-order unit and the network a higher-
order ANN. In general, higher-order units implements the function [GiMa87]:
From this definition, the percepton is a first-order ANN since it uses only the first input
(2.14)
y
i
F
,
bias
j
w
(1)
i j
x
j
j k
w
(2)
i j k
x
j
x
k
j k l
w
(3)
i j kl
x
j
x
k
x
l
....
term of the above equation. Widrow [WiLe90] refers to such units as units with
Polynomial Discriminant Functions. The problem with higher-order ANN is the very
rapid increase in the number of weights with the number of inputs as was earlier noted
by Minsky and Papert [MiPa69]. However, recently such networks have successfully
been used for classification of images irrespectively of their translation, rotation and
scaling ([RSO89], [SpRe92]), where the weight number explosion is kept under control
by grouping the weights. For some problems, as was the case for the XOR, one layer
of higher-order units may be enough since they use more complex decision surfaces than
the MLPs hyperplanes. A MLP can only implement more complex decision surfaces
by a combination of such hyperplanes.
Finally, on the subject of units that use products of inputs, Durbin and Rumelhart
proposed to use what they called product units [DuRu89]. Instead of calculating a
weighted sum, each product unit calculates a weighted product, where each unit is raised
to a power determined by a variable weight. Therefore such a unit can learn an arbitrary
polynomial term. They argue that such units are biological plausible and correspond to
processing done locally at synapses.
39
2.4.5 - The Back-Propagation Algorithm
We have seen that the advantage of using hidden units is that the ANN can then
implement more complex decision surfaces, i.e the representation power is greatly
increased. The disadvantage of using hidden units is that learning becomes much harder
since the learning procedure has to decide which features it should extract from the
training data. Basically the dimension of the solution space is also greatly increased
since we need to determine a larger number of weights.
The Back-Propagation algorithm (BP) has been independently derived by several
people working in different fields. Werbos [Wer74] discovered the BP algorithm while
working on his doctoral thesis in statistics and called it the dynamical feedback
algorithm. Parker ([Par82], [Par85]) rediscovered the BP algorithm in 1982 and called
it the learning logic algorithm. Finally, in 1986, Rumelhart, Hinton and Williams
[RHW86] rediscovered the algorithm and the technique became widely known. The BP
algorithm is today the most popular supervised learning rule to train feedforward multi-
layered ANNs and it is responsible, with Hopfield networks (presented in the next
chapter), for the return of a general interest in ANNs.
The BP algorithm uses the same principle as the Delta Rule, i.e. minimize the
sum of the squares of the output error, averaged over the training set, using a gradient-
descent search. For this reason, the BP algorithm is also called the Generalized Delta
Rule. The crucial modification was to use smooth continuous activation functions in all
units instead of using TLUs. This allows the application of a gradient-descent search
even through the hidden units. The standard activation function for the hidden units are
the so called squashing or S-shaped functions, such as the sigmoid,
sig(x) = [1+exp(x)]
1
, and the hyperbolic tangent, tanh(x) = 2*sig(2x) - 1. Sometimes
the general class of squashing functions is also referred to as sigmoidal functions.
The sigmoid function increases monotonically from 0 to 1 while the hyperbolic
tangent increases from 1 to 1. Note that the sigmoid function can be seen as a smooth
approximation to the threshold function defined in eq. 2.1, while the hyperbolic tangent
can be seen as the approximation of a bipolar TLU with a 1/1 output as used by
Widrow in the ADALINE. The function sig(x/T) tends to the threshold function when
T tends to 0; the parameter T is called the temperature and is sometimes used to change
the inclination of the sigmoid or hyperbolic tangent functions around their middle point.
In some applications, especially pattern classification where we need or want to limit
40
the range of the output units, squashing functions are also used in those units.
The difficulty in training a MLP is that there is no pre-defined error for the
hidden units. Since the BP algorithm is a supervised rule, we have the target for the
output units but not for the hidden units. As in the case of the Delta rule we want to
change the weights in the direction that decreases the output error.
Without loss of generality, let a feedforward ANN be numbered from input to
output such that unit 1 is the first input unit and unit N is the last output unit. Assuming
that the ANN has p input units, H hidden units distributed over one or more hidden
layers, and q output units, making a total of N units (p + H + q = N), then:
As in the case of the Delta rule, we apply the chain rule from variational calculus:
(2.15)
E
pat
1
2
N
r p H 1
D
r
out
r
2
and dout
i
/dnet
i
= derivative of the activation function of unit i with respect to its
(2.16)
W
i j
E
pat
W
i j
E
pat
out
i
dout
i
dnet
i
net
i
W
i j
argument net
i
; and net
i
/W
ij
= out
j
. However, to calculate the term E
pat
/out
i
we need
to consider if the unit i is an output unit (p+H+1 i N) or a hidden unit
(p+1 i p+H). If the unit i is an output unit, then as in the Delta rule, we have:
If the unit i is a hidden unit, then:
(2.17)
E
pat
out
i
D
i
out
i
but net
L
/out
i
= W
Li
. If we define E
pat
/net
L
=
L
, then:
(2.18)
E
pat
out
i
N
L i 1
E
pat
net
L
net
L
out
i
Equation 2.18 and 2.19 simply states that the effect of the output of an hidden unit on
(2.19)
E
pat
out
i
N
L i 1
L
W
Li
the output error is defined as the summation of the effect of the units that receive
connections from the hidden unit multiplied by the value of each connection. In other
words, the output error is "back-propagated" from the output layer to the hidden layers
through the weights and through the nonlinear activation functions. Observe that, in
relation to the Delta rule, the only new equation is really eq. 2.18 since the new problem
created by the hidden units is to find how a change in a weight received by a hidden
41
unit affects the output error.
Summarizing, we have:
where for output units (p+H+1 i N):
(2.20)
E
pat
W
i j
i
out
j
and for hidden units (p+1 i p+H):
(2.21)
i
D
i
out
i
dout
i
dnet
i
As usual, the above equations are also applied to adjust the bias by simply considering
(2.22)
1
1
1
]
N
L i 1
L
W
Li
dout
i
dnet
i
them as additional weights that come from units with a constant unit output, i.e. in eq.
2.20, out
j
= 1.
Observe that in the above derivation of the BP algorithm, only the following
constraints are included in relation to the network: 1) the network is a feedforward
ANN; 2) all units have differentiable activation functions f (net
i
); and 3) the combination
function is defined in a vectorial notation as net = W out + bias. Some possible cases
are: the use of different activation functions in the hidden layer; use of several hidden
layers; and feedforward networks that are not strictly feedforward.
Another reason for using the sigmoid function or the hyperbolic tangent in a
multi-layered ANN is that their derivatives can be calculated simply from their output
value (dsig(x)/dx = sig(x) [1-sig(x)]; dtanh(x) = [1+tanh(x)] [1-tanh(x)]), without the
need of more complex calculations. This is very useful since it reduces the overall
number of calculations needed to train the network.
2.4.6 - Using the Back-Propagation Algorithm
In relation to the initialization of the weights and biases, Rumelhart et al.
[RMW86] suggested using small random values. Concerning the learning rate , they
point out that, although larger learning rates will result in more rapid learning, they can
also lead to oscillation. They suggested that one way to use larger learning rates without
leading to oscillations is to modify eq. 2.16 by adding a momentum term:
42
where the index k indicates the presentation number and is a small positive constant
(2.23)
W
i j
( k 1)
i
out
j
W
i j
( k)
selected by the user. A larger increases the influence of the last weight change on the
current weight change. Such a modification in effect filters out the high-frequency
oscillations in the weight changes since it tends to cancel weight changes in opposite
directions and reinforces the predominant direction of change. This can be useful when
the error surface contains long ravines with a sharp curvature across the ravine and a
floor with a small inclination. For more details about the use of the momentum term see
[Zur92].
In the case of the Delta rule, when applied to networks without hidden layers and
with output units with linear activation functions, the error surface will always have a
bowl shape and the local minima points are also global minima. If the learning rate is
small enough, the Delta rule will converge to one of these minima. In the case of the
MLP, the error surface can be much more complex with many local minima. Since the
BP is, as the Delta rule, a gradient-descent procedure, there is the possibility for the
algorithm to get trapped in one of these local minima and therefore not converge to the
best possible solution, the global minimum ([WiLe90], [Zur92], [McRu88]).
Whenever we have a pre-determined set of training data with a fixed number of
patterns, we can define an epoch as a single presentation of all training patterns to the
network. We will normally adopt a random order presentation of the training patterns
during an epoch and to adjust the weights after the presentation of each single pattern.
This is called random incremental updating as opposed to sequential cumulative
updating, when the patterns are presented to the network with a constant ordering, the
weight changes are summed and the weights are only updated at the end of the epoch.
Simulations results indicate that random incremental updating tends to work better than
sequential cumulative updating, since it injects some "noise" into the search procedure
[Zur92] and therefore helps the network to settle to a better local minimum.
It is interesting to know that, as Widrow points out [WiLe90], the idea of error
backpropagation through nonlinear systems has been used for centuries in the field of
variational calculus and has also been widely used since the 60s in the field of optimal
control. Le Cun [LeC89] and Simpson [Sim90] point out that Bryson and Ho [BrHo69]
developed an algorithm very similar to the BP algorithm for nonlinear adaptive control.
Le Cun [LeC89] also shows how, using a Lagrangian formalism, the BP algorithm can
43
be derived as a solution to an optimization problem with nonlinear constraints and that
from such interpretation some extensions can easily be derived.
Although the BP algorithm was proposed for feedforward ANN, Almeida
[Alm89] has extended it to feedback networks by using a linearization technique, where
he assumes that each input pattern is presented to the network long enough for it to
reach a stable equilibrium. Only then are the outputs compared to the desired ones. Also
he assumes that the desired outputs depend only on the present inputs, not on the past
ones. Rumelhart et al. [RHW86] also considered applying the BP algorithm to feedback
networks but they used different assumptions. They simply expand the feedback network
as a feedforward network with several layers. This is possible because, as Minsky and
Papert [MiPa69] point out, for every feedback network, there is a feedforward network
with identical behaviour over a finite period of time. The BP algorithm is then applied
on this equivalent feedforward network and the weights are averaged after each change
to avoid violating the constraint that certain weights should be equal.
Another multi-layered learning algorithm that was presented before the
popularization of the BP algorithm in 1986 was the Boltzmann Machine (BM),
introduced in 1984 by Hinton, Ackley and Sejnowski ([HAS84], [HiSe86]). It uses a
much more complicated procedure than the BP algorithm in which the activations of the
hidden units are probabilistically adjusted using gradually decreasing amounts of noise
to escape local minima in favour of the global minimum. The idea of using noise to
escape local minima is called simulated annealing [KGV83]. The combination of
simulated annealing with the probabilistic adjustment of the hidden layers is called
stochastic learning [Sim90]. The main disadvantage of the Boltzmann Machine is its
excessively long training time. Later on, in 1986, Szu introduced a modified version of
the Boltzmann Machine called the Cauchy Machine (CM) that uses a fast simulated
annealing procedure [Szu86]. Although faster than the Boltzmann Machine, the Cauchy
Machine still suffers from very long training times [Sim90].
2.5 - Representation, Learning and Generalization
The first problem to be solved when applying feedforward ANNs trained using
supervised learning is the training data selection problem, i.e. to select a data set to be
used when training the ANN. Such training data set must contain the underlying
44
relationship that the ANN should acquire. Since in most cases this underlying
relationship is unknown this may not be a trivial problem.
Once a training data set has been selected, the subsequent problems, in the
sequence that they have to be solved, can be classified in three main areas:
representation, learning and generalization.
The representation problem is how to design the ANN structure such that there
is at least one solution (set of network weights) that learn the training set. The learning
problem is how to find one of these possible set of weights, i.e. training the ANN. This
is also referred to by some authors as the loading problem, based on the concept that
we are "loading" the training data set onto the ANN [Jud90]. Once training is finished,
the generalization problem is concerned with the network response when presented with
data that was not in the training set. A measure of generalization is normally obtained
by verifying the network performance using a test data set.
2.5.1 - The Representation Problem
The representation problem concerns: a) how many hidden layers we use; b) how
many units in each hidden layer; and c) which functions we use for the hidden units.
Normally the particular application in hand will specify how many input and output
units the ANN should have.
Particularly in classification problems (to determine the class to which the input
pattern belongs) the designer has some freedom to decide how to code the output, e.g.
using binary coding or 1-of-N coding. Sometimes, the designer may even decide to
preprocess the input data. Here we will assume that the designer has already decided the
input and output representation.
Once the designer has decided the network input and output representation, to
solve the particular problem in hand it is still necessary to look for the network internal
representation. The representation problem is then to choose the ANN structure such that
an internal representation exists, i.e. that there is at least one set of parameters (weights)
that can reproduce the training data set with a small error. At this moment there is very
little theory to help in this task.
Hornik et al. [HSW89] established that a feedforward ANN with as few as one
hidden layer using arbitrary squashing activation functions (such as sigmoids) and no
squashing functions at the output layer are capable of approximating virtually any
45
function of interest from one finite multi-dimensional space to another to any degree of
accuracy, provided sufficiently many hidden units are available. Later Stinchcombe and
White [StWh89] extended this result and showed that even if the activation function
used in the hidden layer is a rather general nonlinear function, the same type of FF
ANN is still a universal approximator. More or less at the same time, Funahashi
[Fun89], Cybenko [Cyb89], Kreinovichi [Kro91] and Ito [Ito91] proved similar results.
White [Whi92] edited a book with a collection of his papers on this subject of ANNs
and approximation and learning theory.
From a theoretical point of view such results are important but they are existence
proofs, i.e. they prove that there is a FF ANN with just one hidden layer using
squashing or non-squashing functions in the hidden layer that solves the input-output
mapping problem. However, it is not possible to deduce from these proofs the ANN
topology (number of hidden layers and number of units in each hidden layer) or, once
the network topology is chosen, how to determine the network free parameters (the
weights).
Another important point not clarified by the proofs mentioned above is, given a
specific criterion such as minimum number of hidden units, which function is more
suitable to be used as the activation function for the hidden units. In general, these
functions belong to two classes: local or global (also called nonlocal) functions.
Units that use local functions have a constant output (normally zero) outside a
closed region of the unit input space and a different set of values within the closed
region. Units that use functions that can not be characterized as local function are said
to use global functions.
The classical example of FF ANN that use local functions in the hidden layer are
the so called gaussian Radial Basis Functions (RBF), where out
i
= exp(-net
i
2
) and
net
i
= x-C
i
, and C
i
is a vector which determines the position of the centre of the unit.
global (also called nonlocal) or local functions. In this case the regions where the unit
output is above or below a certain value are respectively closed and open region and the
decision surfaces are in general ellipsoids.
When using the usual combining function net
i
= W
i
x + bias
i
, where x is the unit
vector input, the squashing and step functions are examples of global functions. In this
case there is a hyperplane that divides the unit input space into two regions where the
unit output has a high constant value in one region and a low constant value in the other
46
region. In this case, if we consider the input space to be unbounded, the regions where
the unit output is above or below a certain value are open regions and the decision
surfaces are hyperplanes. Note that the use of higher order units (see eq. 2.14) with a
squashing or step function makes it possible for the unit to implement global or local
functions by varying the unit weight values.
Park and Sandberg ([PaSa91],[PaSa93]) proved that RBF networks with just one
hidden layer and linear output units are also universal approximators.
2.5.2 - The Learning Problem
Once we have decided the network topology and the type of units to be used, the
next step is to determine the network free parameters, i.e. the network weights. The
range of applicable algorithms depends on the particular functions used in the hidden
units. Typically the Back-Propagation algorithm is used for FF ANNs with squashing
functions.
The BP algorithm can also be used for RBF networks but Moody and Darken
[MoDa89] have proposed a hybrid algorithm with two stages. In the first stage the
hidden units centres and the widths of the gaussian functions used by the hidden units
are determined in an unsupervised manner, i.e. by using only the input data and not the
correspondent desired outputs. The centres are determined by using a k-means clustering
algorithm and the widths by nearest-neighbour heuristics. In the second stage just the
output weights, i.e. the weights between the hidden and output units, which correspond
to the amplitudes of the gaussians, are modified in order to minimize the standard least-
squares error using a supervised algorithm such as the delta rule. The authors found out
that, in comparison with networks with sigmoid units trained by BP, the convergence
is very rapid, possibly because the first unsupervised stage has done most of the work
necessary for the correct classification. However, a possible drawback is the need of a
larger number of hidden units (and therefore network weights) to achieve the same
accuracy when approximating certain functions, in comparison with a network which
uses squashing functions.
The algorithms used to train a FF ANN can be classified into two main classes:
a) the algorithms that try to converge to the global minimum solution, and b) the
algorithms that try to converge rapidly. Unfortunately, it seems that the two classes do
not overlap. Consequently the algorithms that try to converge rapidly can still be trapped
47
in local minima (as BP does) while the algorithms that try to converge to the global
minimum tend to converge very slowly when compared, for instance, with the BP
algorithm.
Examples of algorithms that look for the global minimum are the Boltzman
Machine, already mentioned in the previous section, and genetic algorithms
([MoDav89], [HKP91]). Anther possible problem with the use of genetic algorithms to
train FF ANNs is the need for large amount of processing power and memory.
Jacobs [Jac88] and Silva and Almeida [SiAl90] proposed to adapt the learning
rate (the step size) when executing the BP algorithm in order to speed the convergence.
This modification has the advantage that it does not increase significantly the
computational and memory requirements in relation to the standard BP algorithm.
The BP algorithm is a first-order algorithm since it uses only the first derivative
of the cost function to search for the minimum. Several researchers have proposed
second-order algorithms to perform such a search, for instance, Becker and le Cun
[BeCu88] and Kollias and Anastassiou [KoAn89]. Battiti [Bat92] published a review of
the application of first- and second-order methods for the training of FF ANN.
The main problems of using such second-order algorithms are: 1) a large increase
in the number of operations performed and in the memory requirements, especially for
large networks; and 2) not all implementations use local computations. Furthermore,
Saarinen et al. [SBC91] argue that many network training problems are ill-conditioned,
i.e. have ill-conditioned or indefinite Hessians, and therefore may not be solved more
efficiently by higher-order optimization algorithms.
A more recent approach has been suggested by Shah et al. [SPD92] where they
use optimal stochastic filtering techniques to train the ANN and at the same time they
pay attention to the computational and storage costs. Tepedelenlioglu et al. [TRSR91]
and Singhal and Wu [SiWu89] have proposed to use the Extended Kalman Filtering
algorithm to train FF ANNs.
There have also been a few approaches that try to reduce the network training
time and at the same time determine the number of units in the hidden layer, i.e. they
try to adapt the network topology. Normally such approaches start with an ANN with
a small size and add hidden units. Fahlman and Lebiere proposed the Cascade-
Correlation Learning Architecture [FaLe90] and studied the two-spirals problem (the
training points are arranged in two interlocking spirals).
48
Hirose et al. [HYH91] also suggest adapting during training the number of
hidden units with the aim of escaping local minima. Training is performed as standard
by the BP algorithm and they proposed adding an extra hidden unit whenever the
network seems to be trapped in a local minimum. Since the addition of such an extra
hidden unit distorts the error surface, that point in the weight space is not a local
minimum anymore. Later on, after satisfactory convergence is achieved, they proposed
a way of eliminating some of the hidden units.
2.5.3 - The Generalization Problem
Even if the training algorithm manages to find a satisfactory solution for the
training patterns, the ANN still needs to produce "reasonable" outputs when presented
with input patterns that were not used in the training set. i.e. the ANN needs to be able
to "generalize" what it has learned.
Poggio and Girosi ([PoGi90a],[PoGi90b]) state that, from the point of view that
FF ANN are trying to learn an input-output mapping from a set of examples, such a
form of learning is closely related to classical approximation techniques, for instance,
generalized splines and regularization theory [TiAr77]. In this case learning can be seen
as solving the problem of hypersurface reconstruction and is a ill-posed problem since
in general there are infinite solutions. A priori assumptions are then necessary to make
the problem well-posed. Possibly the simplest assumption is that the input-output
mapping is smooth, that is small changes in the inputs cause a small change in the
output.
Training a FF ANN can be seen as a generalized multi-dimensional version of
finding the parameters of a polynomial that fits a set of points drawn from a uni-
dimensional space. Too many degrees of freedom (too many weights in the ANN) can
result in overfitting the training data and to poor performance in the test data set
[HKP91]. Therefore the ideal situation would be to find the minimum number of hidden
units that can produce the desired input-output mapping. This should result in the
smoothest possible mapping. Since it is very difficult and time-consuming to determine
the minimum number of hidden units, one approach that is frequently used is to train
the network using a small training data set and periodically to test the network using a
larger test data set. Training is then stopped when the error cost function measured over
the test data achieves the minimum value. If we continue training the network after such
49
a minimum is achieved, the error cost function measured over the training data will
continue to decrease but it will increase if the measure was taken over the test data.
Baum and Haussler [BaHa89] proved some theoretical bounds governing the
appropriate sample size against the network size in terms of the network generalization.
One possible approach that can be used to improve network generalization is to
somehow constrain, during training, the degrees of freedom available to the network
trying to obtain a near-optimal network topology. The ANN should then be large enough
to contain the desired knowledge (assumed to be contained in the examples of the
training data set) but small enough to generalize well. A simple approach is just to add
to the normal cost function (the mean-squared-output error) a penalty for network
complexity. One possibility is to use the weight decay idea, i.e. we add to the cost
function the term w
i
2
[HKP91]. The application of the BP algorithm to this new cost
function results in a weight decay term which discourages very large weights. Another
possibility is to use the extra cost function term [w
ij
2
/(K+w
i
2
)]. For small weights this
can be approximated by [w
ij
2
/K] and for large weights by . After training, the ANN
can then be tested where the weights with the smallest magnitudes are removed, what
is known as "pruning" the network. When all the incoming weights of a hidden units are
removed, the hidden unit is effectively removed as well. Therefore the weight-
elimination stage can also affect the network topology.
Nowlan and Hinton ([NoHi92a],[NoHi92b]) propose an approach where the
network degrees of freedom are constrained by encouraging clustering of the weights
values. While the weight decay approach encourages clustering around the zero value,
their approach is aimed at encouraging clustering around a finite set of arbitrary real
values, what is sometimes called weight-sharing. Kendall and Hall propose the minimum
description length (MDL) approach [KeHa93] aimed at minimizing the information
content in the network weights. They claim that the MDL length also encourages weight
elimination and weight-sharing.
More recently Green, Nascimento and York [GNY93] proposed to add
competition within the hidden layer of a FF ANN in order to eliminate unnecessary
hidden units. The addition of the competition turns the network into a feedback ANN.
However, the BP algorithm is applied as normal since they proposed to ignore the
competition weights during the backward pass of the BP algorithm.
One drawback that all these approaches have in common is the need to select
50
some extra parameters during training.
2.6 - Limitations of Feedforward ANNs
The basic concepts of Artificial Neural Networks and the differences in relation
to traditional computation were introduced in this chapter. Also the more important
feedforward ANN models were presented and the role of hidden units was discussed.
The majority of feedforward ANN models currently in use are sigmoid based and
have the following limitations:
1) Current ANN models take a long time to be trained, there is no
guarantee of convergence and the learning is inconsistent, i.e. the mean-
squared-error can remain high for many iterations and suddenly decrease
to a lower value. Therefore, without previous experience with a particular
problem, it is very difficult to estimate how long training will take.
2) When an ANN produces an output that corresponds to a decision, for
instance in a pattern classification problem, in general it is very difficult
to trace how the network reached such a decision, that is to get an
"explanation" form the network. An ANN by being trained using a
training data set, extracts the knowledge from the set of examples and
creates its own internal representation. To extract the knowledge coded
into the network we need to understand this internal coding, a difficult
task.
3) In general, an ANN does not give confidence intervals for its outputs.
However, Richard and Lippman [RiLi91] show that when an FF ANN is
trained to solve an M-class problem (one output unit corresponding to the
correct class, all other zero) using a mean-squared-error cost function
such as in the BP algorithm, the network outputs provide estimates of
Bayesian probabilities.
4) Without prior experience with the problem in hand, the network
topology is determined by trial and error. A too small network will make
learning impossible and a too large network will generalize badly.
5) It is not possible, in the general case, to encode prior information in
the network. If this was possible, training times could be reduced
51
considerably.
While the models presented in this chapter were of the feedforward type, the next
chapter concerns feedback networks, their theory and applications.
52
Chapter 3 - Feedback Neural Networks:
the Hopeld and IAC Models
The main feedforward ANN models were presented in chapter 2. In this chapter
the principles behind the use of feedback ANNs are introduced and two models, the
Hopfield and IAC (Interactive Activation and Competition) neural networks are
presented and analyzed.
Because of the presence of the feedback connections, feedback ANN are
nonlinear dynamical systems which can exhibit very complex behaviour. They are used
in two areas: 1) as associative memories or 2) to solve some hard optimization
problems. The basic idea in using a feedback ANN as an associative memory is to
design the network such that the patterns that should be memorized correspond to stable
equilibrium points. To use feedback ANN to solve optimization problems the network
is designed so that it converges to the stable equilibrium points that correspond to good
(perhaps not necessarily optimal) solutions of the problem in hand.
In this chapter we show how the IAC network can be used to solve certain
optimization problems, much like the Hopfield network. As an example we show in
detail how to implement a 2-bit analog-digital converter using the IAC network.
3.1 - Associative Memories
To work as an associative memory, a network has to solve the following
problem:
"Store M patterns S such that when presented with a new pattern Y, the
network returns the stored pattern S that is closest in some sense to Y".
Such associative memory can work as a content-addressable memory since we should
be able to retrieve the stored pattern by using as input an incomplete or corrupted
version of it (pattern completion and pattern recognition). Possible applications are in
53
hand-written digit and face recognition tasks and retrieval of information in general
databases.
For mathematical convenience we will assume that the components of the stored
patterns S and the test patterns Y can be only 1 or 1, instead of the usual binary values
0 and 1.
Figure 3.1 shows the general model of a one-layer feedback ANN that can be
used as an associative memory. In this particular case each unit is a TLU (Threshold
Logic Unit) with a bipolar output. The output of each unit is calculated as:
where the net input net is calculated as:
(3.1)
Y
i
sgn net
i
'
1 if net
i
> 0
1 if net
i
< 0
where N is the number of units in the network. The terms bias
i
and ext
i
represent
(3.2)
net
i
N
j 1
W
i j
Y
j
bias
i
ext
i
respectively the fixed internal and variable external inputs. These terms could be
grouped together but in most models one or both of them are zero.
For simplicity, lets consider for the moment that the bias term bias
i
and the
external input ext
i
are zero.
The network is operated as follows: 1) a input pattern is loaded into the network
as the initial values for the network output Y; 2) the network output values are updated
asynchronously and stochastically, i.e. at each time step a unit is selected randomly from
among the N units with equal probability 1/N, independently of which units were
updated previously, and eqs. 3.1 and 3.2 are used to update its output. We will show
later that under some conditions, after a sufficient large number of time steps, the
network will converge to a stable equilibrium point (EP), called a "memory". The output
of the units are then interpreted as the network classification of the input pattern.
Three important issues in such applications are: 1) how the network weights
should be adjusted such that network is stable, that is such that the network converges
to an EP for any initial condition; 2) for a network with N units, how many patterns can
be stored; and 3) under what conditions will the network converge to the closest stored
pattern.
Note that: 1) the units are simultaneously input and output units; 2) since there
are no hidden units, such a network cannot encode the patterns, or in other words, the
54
network cannot change the pattern representation; and 3) the network always occupies
Figure 3.1 - An one-layer feedback ANN
the corners of the hypercube [1 1]
N
.
3.1.1 - Storing one pattern
Lets firstly consider the simple case where we want to store just one pattern. A
pattern Y is a stable EP if:
for all i, since when eq. 3.1 is applied to update the unit output no change will be
(3.3)
sgn
,
N
j 1
W
i j
Y
j
Y
i
produced. Representing by S the pattern that we want to store, this can be achieved by
setting the network weights to:
where k > 0 since then:
(3.4)
W
i j
k S
i
S
j
given that S
j
S
j
= 1. For later convenience, let k = 1 / N. Then, in vectorial notation we
(3.5)
sgn
,
N
j 1
k S
i
S
j
S
j
sgn kNS
i
S
i
have that:
55
where S is a column vector and W is a symmetric matrix.
(3.6)
W
,
1
N
S S
T
Note that even if almost half of the bits of the initial condition (the starting
pattern) are wrong, the stored pattern will still be retrieved since the correct bits, that
are in the majority, will force the sign of the net input to be equal to S
i
. This can be
proved by combining eqs. 3.3 and 3.6:
where N
c
and N
w
are respectively the number of correct and wrong bits in the starting
(3.7)
sgn
,
N
j 1
W
i j
Y
j
sgn
,
S
i
N
N
j 1
S
j
Y
j
sgn
,
S
i
N
c
N
w
N
S
i
pattern Y in relation to the stored pattern S. Observe also that if the starting pattern has
more than half the bits different from the stored pattern (N
w
> N
c
) than the network will
retrieve the inverse of the stored pattern, i.e. S. Therefore there are two stable EPs,
sometimes also called attractors. The set of patterns that converge to one of the EPs
constitutes what is called the basin of attraction or region of convergence of that EP.
For this particular case, the entire input space is symmetrically divided into the two
basins of attraction.
3.1.2 - Storing several patterns
One simple way to store more than one pattern in the network is to generalize
eq. 3.4 and try to superimpose the patterns by using:
or in vectorial notation,
(3.8)
W
i j
1
N
M
pat 1
S
pat
i
S
pat
j
where M is the total number of patterns that we want to store in the network and the
(3.9)
W
1
N
M
pat 1
S
pat
S
pat
T
weight matrix W is still symmetric.
Equations 3.8 and 3.9 are implementations of the Hebbian rule, already
introduced in chapter 2. A feedback network operating as an associative memory, using
the Hebbian rule to store all patterns and being updated asynchronously is usually called
a discrete-time Hopfield network, after J. J. Hopfield who emphasized the concept of
using the equilibrium points of nonlinear dynamical systems as stored memories [Hop82].
56
The patterns S will be stored as stable EPs, i.e. fixed attractors, if they satisfy
the condition that:
By combining eqs. 3.8 and 3.10 we have that:
(3.10)
sgn
,
N
j 1
W
i j
S
j
S
i
Lets suppose that we want to test such a condition for stored pattern S
1
. The interior
(3.11)
sgn
,
1
N
N
j 1
M
pat 1
S
pat
i
S
pat
j
S
j
S
i
of the function sgn ( ) can be separated into the term pat = 1 and pat > 1:
where c.t. stands for crosstalk term, the second term of the left side of eq. 3.12.
(3.12)
1
N
N
j 1
S
1
i
S
1
j
S
1
j
1
N
N
j 1
M
pat 2
S
pat
i
S
pat
j
S
1
j
S
1
i
c.t.
Therefore if the magnitude of the crosstalk term is less than 1, it will not change the
sign of S
i
1
and the condition for stability of the pattern S
1
will be satisfied. The
magnitude of the crosstalk term is a function of the type and number of patterns to be
stored.
For many cases of interest, provided that the number of patterns to be stored is
much less than the number of units (M << N, see next section about storage capacity),
the crosstalk term is small enough and all stored patterns are stable. Moreover, as in the
single pattern case, if the network is initialized with a version of one of the stored
patterns that is corrupted with a few wrong bits, the network will retrieve the correct
stored version [HKP91].
3.1.3 - Storage Capacity
Hertz et. al. [HKP91] show that if: a) if the patterns to be stored are random
(each bit has equal probability of being 1 or +1) and independent; and b) M and N are
large, then the crosstalk term can be approximated by a random variable with gaussian
distribution, zero mean and variance M/N. Therefore the ratio M/N determines the
probability of the crosstalk term being greater than 1 for S
i
= 1 or less than 1 for
S
i
= +1. From this modelling we can estimate, for instance, that if we choose
M = 0.185 N and the network is initialized with one of the S patterns, no more than 1%
of the bits will change. However, these few bits that change can cause more bits to
57
change and so on, i.e. what is known as the "avalanche" effect.
Hertz et. al. [HKP91] show, using an analogy to spin glass models and mean
field theory, that this avalanche occurs if M > 0.138 N and therefore we could not use
the network as a "memory". They also show that, using the previous modelling, for
M = 0.138 N, 0.37% of the bits will change initially and 1.6% of them will change
before an attractor is reached. So, if we choose M 0.138 N there will be an attractor
"close" to the patterns S that we want to store, i.e. they will be retrieved but the final
result will have a few bits wrong. As an example for this case, for N = 256, M 35.
If we want to recall all stored patterns S without error (perfect recall) , i.e. to
force the patterns S to be the attractors (not only "close" to the attractors as in the
previous case), then McEliece et. al. [MPRV87] show that M N/ (4 ln N). Moreover,
they show that perfect recall will happen if the initial pattern has less than N/ 2 different
bits when compared with a stored pattern ([HKP91],[HuHo3]). In this case for N = 256,
M 11.
From these arguments we can see that, when using the Hebbian rule (eqs. 3.8 and
3.9), the storage capacity of the Hopfield network is rather limited. Other design
techniques have been proposed that improve the storage capacity ([VePs89],[FaMi90])
to a value closer to M = N, the limit for the storage capacity of the Hopfield network
[AbJa85].
Note as well that if the patterns to be stored are all orthogonal to each other, i.e.
apparently the memory capacity would be N since the crosstalk term is zero in this case
(3.13)
S
l
T
S
k
'
0 for lk
N for l k
(see eq. 3.12). However if we use the Hebbian rule (eqs. 3.8 or 3.9) to store N
orthogonal patterns, the weight matrix W will be equal to the identity matrix, i.e. each
unit one feedbacks to itself. Such an arrangement is useless as a memory since it makes
all initial patterns stable, that is the network does not change its initial pattern. This can
be interpreted as making attractors of all points of the discrete configuration space and
their basins of attraction contain only the attractors themselves. Therefore to make the
network useful in this case we need to store less than N orthogonal patterns.
We can prove that the weight matrix will be equal to the identity matrix if we
try to store N patterns using the Hebbian rule, by defining a square and not in general
symmetric matrix X where each row of X is defined as the transpose of one of the
58
patterns to be stored, i.e. X
ij
= S
j
i
. Consequently, from eq. 3.13 we have that:
where I is the identity matrix. Then, we can rewrite eq. 3.9 as:
(3.14)
X X
T
N I
By the definition of orthogonality, no row of the matrix X can be written as a linear
(3.15)
W
1
N
M
pat 1
S
pat
S
pat
T 1
N
X
T
X
combination of the other rows and therefore the inverse of X and the inverse of its
transpose X
T
exist. Then if we pre-multiply both sides of eq. 3.14 by X
T
and pos-
multiply them by X
T
we have:
and consequently X
T
X = NI and W = I, as we want to show.
(3.16)
X
T
X X
T
X
T
N I X
T
X
T
3.1.4 - Minimizing an energy function
One important contribution made by Hopfield [Hop82] was to propose a lower
and upper bounded scalar-valued function, a so-called "energy function", that reflects
the state of the whole network, i.e. such a function involves all the network outputs. He
then showed that whenever one of the network outputs Y
i
is updated, the value of this
function is decreased if Y
i
changes or remains constant if Y
i
does not change. Therefore
the network will evolve until it reaches one state that is locally stable equilibrium point.
To prove this, Hopfield defined the energy function as the following quadratic function:
where H(k) is the value of the energy function for the whole network at time step k. The
(3.17)
H(k)
1
2
Y
T
(k) W Y(k)
1
2
N
i 1
N
j 1
W
i j
Y
i
(k) Y
j
(k)
lower and upper limit for H(k) for any k are given respectively by
(1/2)
N
i =1
N
j =1
W
ij
and (1/2)
N
i =1
N
j =1
W
ij
since the outputs Y are 1 or +1.
Lets assume that at time k the unit L was selected to be updated, where
1 L N. Isolating the energy terms due to unit L, we can rewrite eq. 3.17 as:
(3.18)
H(k)
1
2
N
i 1
iL
N
j 1
jL
W
i j
Y
i
(k) Y
j
(k)
1
2
Y
L
(k)
N
j 1
jL
W
Lj
Y
j
(k)
1
2
Y
L
(k)
N
i 1
iL
W
i L
Y
i
(k)
1
2
W
LL
Y
L
(k)
2
59
The variation in the energy is given by H(k) = H(k+1) H(k). Note that: 1) since the
updating is asynchronous only unit L may change at time k and consequently
Y
i
(k+1) = Y
i
(k) for i L, 2) since all units have bipolar outputs [Y
i
]
2
= 1 for all i.
Therefore, when calculating H(k), the first and fourth terms of the right side of eq. 3.18
will be cancelled out and we can write:
If unit L changes its output then Y
L
(k+1) = Y
L
(k) and using the fact that the weight
(3.19)
H(k)
1
2
Y
L
(k 1)
1
1
1
1
]
N
j 1
jL
W
Lj
Y
j
(k)
N
i 1
iL
W
i L
Y
i
(k)
1
2
Y
L
(k)
1
1
1
1
]
N
j 1
jL
W
Lj
Y
j
(k)
N
i 1
iL
W
i L
Y
i
(k)
matrix W is symmetric (see eq. 3.8), we have that:
Due to the rule used to update the network outputs (eqs. 3.1 and 3.2), whenever a unit
(3.20) H(k) 2 Y
L
(k)
N
j 1
jL
W
Lj
Y
j
(k) 2 Y
L
(k) net
L
(k) 2 W
LL
changes its output the product Y
L
(k) net
L
(k) is negative. Due to the Hebbian rule (eq. 3.8)
W
LL
= M/ N. Therefore whenever a unit changes its output, the overall energy of the
network decreases. In other words, the energy is a monotonically decreasing function
with respect to time.
Note that we use the fact that the weight matrix is symmetric, an assumption that
is not biologically plausible in terms of networks of real neurons. McEliece et. al.
[MPRV87] speculate, however, that maybe all that is necessary is a "little" symmetry,
such a lot of zeros at symmetric positions in the weight matrix, what is common in real
neural networks. Moreover, asymmetric weight matrices can be used to generate a
cyclical sequence of patterns ([HKP91],[Kle86]) and Kleinfeld and Sompolinsky
[KlSo89] even found a mollusc that apparently uses this mechanism. In this case the
attractors are stable limit cycles.
In chapter 2 we mentioned that learning in feedforward networks could be seen
as an optimization process. This is also the case here for feedback networks and such
an interpretation will be very useful later. The problem can be stated as follows: how
should the weights be set such that the patterns to be stored are deep minima of the
energy function given by eq. 3.17. Lets start with storing just one pattern. If we want
60
to store just pattern S, since its components are 1 or +1, we can make the energy term
dependent on [S
i
]
2
[S
j
]
2
that is always positive and the energy term will be as small as
possible [BeJa90], or:
From that we see that we just need to define the weight matrix as W
ij
= S
i
S
j
. Again to
(3.21)
H
1
2
N
i 1
N
j 1
W
i j
S
i
S
j
1
2
N
i 1
N
j 1
S
i
2
S
j
2
store several patterns, we just sum this equation over all patterns (eqs. 3.8 and 3.9).
Adding all patterns together in this way will distort the energy levels for each stored
patterns because of the crosstalk term. However, as stated before, if M << N the
distortion will not be significant.
3.1.5 - Spurious States
We have shown that if the crosstalk term is small enough the patterns S
i
to be
stored are attractors (stable equilibrium points) and they will be local minima of the
energy function. Such attractors are sometimes called retrieval states or retrieval
memories. This situation is very likely to happen, as stated before, if the number of
patterns M to be stored is much less than the number of units N. However, these are not
the only attractors that the network has.
Firstly, the reverse of an attractor S
i
is also an attractor since it also satisfies eq.
3.10 and it will have the same energy H.
Secondly, Hertz et. al. [HKP91] and Amit et. al. [AGS85a] show that patterns
defined as a linear combination of an odd number of attractors are also attractors. They
call such attractors mixture states or retrieval memories.
Thirdly, Amit et. al. [AGS85b] show that if the number M of patterns to be
stored is relatively large (compared to N), then there are attractors that are not correlated
with any linear combination of the original patterns S
i
. They call such attractors spin
glass states, from the spin glass models in statistical mechanics.
The second and third type of attractors are called spurious states, spurious
minima or spurious memories. Their existence means the there is the possibility that the
network will not work perfectly as an associative memory, since it can converge to
"memories" that were not previously defined.
Some measures, however, can be taken to decrease the size of basins of
attractions of these spurious states. For instance, as Hopfield did in his original paper
61
[Hop82], we can force the constraint that a unit does not feedback to itself, i.e. W
ii
= 0,
for 1 i N [KaSo87]. It is possible to show that this modification does not affect the
stability of the patterns that we want to store (the retrieval memories) although it affects
the dynamics of the network [HKP91].
A second possible improvement proposed by Hopfield et. al. [HFP83] is to try
to "unlearn" some of spurious states. To do this, the network weights are determined by
applying eq. 3.8, the network state is initialized in a random position and the network
output is updated until it achieves convergence. If the state to which the network
converged is one of the spurious memories, represented by X
F
, then the Hebbian rule is
applied with the sign reversed:
where 0 < << 1. One possible interpretation is that such a procedure changes the
(3.22)
W
i j
X
F
i
X
F
j
shape of the energy function by raising the energy level at the local minimum X
F
,
therefore reducing its basin of attraction. The assumption is that memories with the
deepest energy valleys tend to have the largest basins of attraction. However, too much
"unlearning" will result in perturbing and even destroying the retrieval memories that
we intended to store [HFP83].
3.1.6 - Synchronous Updating
The asynchronous updating used in the Hopfield network can be seen as a simple
way to model the random propagation delays of the signals in a network of real neurons.
If synchronous updating is used (all unit outputs are updated simultaneously in
a discrete time formulation), there will be no significant changes in terms of memory
capacity or position of the equilibrium points ([HKP91],[AGS85a],[MPRV87]).
However, the network dynamics will be different, e.g. it will take much less iterations
to converge to a fixed attractor (EP), and there is the possibility for existence of stable
limit cycles that are not present if asynchronous update is used. Zurada [Zur92] shows
an example of this last case.
Another difference is that using synchronous updating the trajectory in the output
space is always the same for a given starting point. When using asynchronous updating
this is not the case because the units are randomly selected to be updated, as explained
before.
62
3.2 - Solving Optimization Problems
After proposing to use ANN with binary or bipolar units with random
asynchronous updating as associative (or content-addressable) memories [Hop82],
Hopfield realized that he could obtain the same computational properties by using a
deterministic continuous-time version with units that have a continuous and
monotonically increasing activation function such as a squashing function [Hop84]. This
network is sometimes referred to as the gradient-type Hopfield network [Zur92] or
Hopfield network with continuous updating [HKP91].
By making such modifications he realized that he could also propose a hardware
analog implementation of the above network using electrical components such as
amplifiers, resistors and capacitances. The capacitances where introduced for each unit
such that they would have an integrative time delay. Consequently, the time evolution
of the network should be represented by a nonlinear differential equation.
3.2.1 - An analog implementation
The behaviour of each unit in this analog version is closer to the behaviour of
a real neuron. Figure 3.2 illustrates such a unit. The variables net
i
and Y
j
are voltages,
bias
i
is a current, W
ij
and g
i
are conductances, C
i
is a capacitance and the triangle
represents a voltage amplifier with a function f, i.e. V
out
= f(V
in
) or Y
i
= f
i
(net
i
). We will
assume that the voltage amplifier has an infinite input impedance such that it does not
absorb any current. Figure 3.3 illustrates the implementation of a feedback network
using this type of unit. In order to avoid the need for negative resistances, we have to
assume that the voltage amplifiers have a negative output Y
i
as well or use an
additional amplifier for each unit with constant gain 1.
Adding all currents for the units, which are illustrated by arrows in fig. 3.2, the
dynamic behaviour of a unit can be described by:
Lets define the parameter G
i
as G
i
= g
i
+
N
j =1
W
ij
, the external input vector as:
(3.23)
i
c
C
i
dnet
i
dt
bias
i
N
j 1
W
i j
Y
j
net
i
g
i
net
i
ext = [ext
1
... ext
N
]
T
and the matrices G and C as G = diag[G
1
... G
N
] and
C = diag[C
1
... C
N
]. Then the dynamical behaviour of the whole network can be
described by the following set of differential equations:
63
where net and bias are column vectors, and the function f ( ) is applied to each
Figure 3.2 - The analog implementation of a unit
using electrical components
Figure 3.3 - The analog implementation of a continuous-time Hopfield
network with 4 units
(3.24)
C
dnet
dt
G net W f ( net ) bias
component of the vector net. Note that, by definition, Y = f (net).
64
3.2.2 - An energy function
Assuming that the weight matrix W is symmetric and that the activation function
f
i
is a monotonically increasing function bounded by lower and upper limits for all units,
Hopfield ([Hop84],[Zur92]) proposed the following energy function in order to prove
the stability of the network:
Applying the chain rule we have:
(3.25)
H(t)
1
2
Y
T
W Y bias
T
Y
N
i 1
G
i
Y
i
0
f
1
i
(z) dz
where by definition
(3.26)
dH Y(t)
dt
N
i 1
H Y(t)
Y
i
dY
i
dt
Y
H(Y)
T
Y(t)
Using the Leibnitz rule we have:
(3.27)
Y
H (Y)
1
1
1
]
H (Y)
Y
1
....
H (Y)
Y
N
T
From this relation and since the matrix W is symmetric, we can write that:
(3.28) d
dY
i
,
N
j 1
G
j
Y
j
0
f
1
j
(z) dz G
i
f
1
i
Y
i
G
i
net
i
Comparing eqs. 3.26 and 3.29, we can see that:
(3.29)
dH
dt
W Y bias G net
T
1
1
]
C
dnet
dt
T
Y
or for each component:
(3.30)
Y
H(Y) C
dnet
dt
Since Y
i
= f
i
(net
i
) and f
i
( ) is a monotonically increasing function, we can write that
(3.31)
H Y
Y
i
C
i
dnet
i
dt
net
i
= f
i
1
(Y
i
) and
where d f
i
1
(Y
i
) / d Y
i
> 0. Finally, by substituting eqs. 3.31 and 3.32 in eq. 3.26:
(3.32)
dnet
i
dt
df
1
i
Y
i
dY
i
dY
i
dt
65
Therefore dH/dt 0 and dH/dt = 0 if and only if dY
i
/dt = 0 for all units, 1 i N.
(3.33) dH
dt
N
i 1
C
i
df
1
i
Y
i
dY
i
,
dY
i
dt
2
Since the network "energy" is a bounded function, this proves that the network will
evolve until it settles to an equilibrium point, a local minimum of the energy function.
In other words, the network "searches" for a minimum of the energy function and stops
there. Note that the possibility of limit cycles is excluded since in a limit cycle
dY
i
/dt 0 and dH/dt = 0.
It is also interesting to investigate the effect of the steepness of the activation
function f
i
. This is easily done by replacing Y
i
= f
i
(net
i
) by Y
i
= f
i
(net
i
) and net
i
= f
i
1
(Y
i
)
by net
i
= f
i
1
(Y
i
) / . The energy function H(t) becomes:
As the gain increases the activation function f
i
tends to a threshold function. Suppose,
(3.34)
H(t)
1
2
Y
T
W Y bias
T
Y
1
N
i 1
G
i
Y
i
0
f
1
i
(z) dz
for instance, that f( ) = tanh( ). The integral in the third term on the right-hand side of
eq. 3.34 is zero for Y
i
= 0 and positive otherwise, becoming very large as Y
i
approaches
its bounds 1 or +1 since such bounds are approached very slowly. In the limit case
when +, the contribution by the third term is negligible and the location of the
equilibrium points are given by the maxima and minima of:
The same arguments are valid if f
i
(net
i
) = sig(net
i
) = 1/[1+exp(net
i
)].
(3.35)
H(t)
1
2
Y
T
W Y bias
T
Y
1
2
N
i 1
N
j 1
W
i j
Y
i
(t) Y
j
(t)
N
i 1
Y
i
(t) bias
i
For large but finite , the third term on the right-hand side in eq. 3.34 begins to
contribute but only when Y
i
approaches its bounds, i.e. when the network is near to one
of the surfaces, edges or corners of the hypercube that contain the network dynamics.
When all Y
i
are far from their limits, the contribution of the third term is still negligible.
Consequently, for large but finite the maxima of the complete energy function given
by eq. 3.34 at the corners and the minima are slightly displaced toward the interior of
the hypercube [Hop84]. Therefore in this case, it can assumed that the energy function
that is being minimized is the energy function given by eq. 3.35 and that the equilibrium
points will be located at the corners of the hypercube.
Note that if is sufficiently large, it is reasonable to assume that net
i
0 and
66
consequently in figures 3.2 and 3.3 the current sources ext
i
can be substituted by an
equivalent voltage source VExt
i
in series with the appropriate resistor RExt
i
such that
VExt
i
/RExt
i
= ext
i
.
Hopfield and Tank [HoTa85] then realized that: 1) if the cost function of an
optimization problem could be expressed in a quadratic equation with the same form as
eq. 3.35, and 2) a network like the one illustrated in fig. 3.3 using units with large finite
gains in their activation functions could be used to search for a minimum of the cost
function. They proposed a solution for the optimization problem using analog hardware
and therefore radically different from implementing an algorithm in a digital computer.
The weights and bias can be determined by comparing the cost function for the problem
in hand with the energy function given by eq. 3.35.
Hopfield and Tank ([HoTa85], [TaHo86], [HoTa86]) showed, as examples, how
such a network could be used to propose solutions to: 1) analog/digital conversion
problems; 2) decomposition/decision signal problems (to determine the decomposition
of a particular signal given the knowledge of its individual components); 3) linear
programming problems [Per92]; and 4) the travelling salesman problem (TSP). Other
possible applications investigated by other researchers are: 1) job shop scheduling
optimization [Zur92]; 2) economic electric power dispatch problems [Zur92]; and 3)
graph bipartitioning (important for chip design where we want to divide a group of
interconnected components into 2 subsets with more or less the same number of
components in each set and minimizing the wire length between the two sets).
It is important to emphasize that it is possible only to prove that, given the
proper constraints, the network converges to a local minimum of the energy function.
However, in general, such a local minimum is not the global minimum. Therefore the
Hopfield approach is best suited to problems where there are several local minima that
give satisfactory solutions and it is more important to rapidly approach a "good" solution
than to take much longer and to have the best possible solution. One could argue that
these are the kind of problems that biological systems have to solve [Per92]. It is not
always easy to decide if a particular optimization problem with a particular set of
parameters will be well suited to be solved using the Hopfield approach.
67
3.3 - The IAC Neural Network
The Interactive Activation and Competition (IAC) Neural Network was proposed
by the psychologists McClelland and Rumelhart to model visual word recognition
[McRu81] and retrieval of general and specific information from specific information
about individual exemplars previously stored in the network [McRu88]. The network
uses as inputs noisy clues, for instance, the network can be used to recognize a word
that was partially obscured or to retrieve the specific information stored about an item
using a partial or incorrect version of its description.
The IAC network is also a feedback network that operates in discrete or
continuous time and the output of the units are real continuous numbers. The principle
of operation is the same as the Hopfield network, i.e. there is no learning phase and the
designer sets the topology and the initial state of the network. The network then evolves
to a equilibrium state (equilibrium point, EP) that represents the network answer to the
problem.
As in the Hopfield network, the network topology is selected in order to satisfy
the specific constraints of the problem in hand. The major difference in operation
between the Hopfield network and the IAC network is the activation function used.
McClelland and Rumelhart [McRu88] define an IAC network as consisting of
Figure 3.4 - Typical topology for the IAC network where dashed and solid lines
represent respectively inhibitory and excitatory connections.
Black squares represent activated units.
a set of units organized into pools. The units in a pool compete against each other such
68
that ideally when the network settles to an EP there is only one activated unit in each
pool. Units situated in different pools can excite or be indifferent to each other but
normally they do not inhibit each other. Figure 3.4 illustrates the typical topology for
the IAC network. All connections are assumed to be bidirectional and therefore the
weight matrix W is symmetric. All units also have an external input, not shown in figure
3.4.
According to McClelland and Rumelharts conception, each pool represents a
specific property (or characteristic) and each unit in the pool represents mutually
exclusive possibilities for such a property . For example, in figure 3.4 pool 1 could
represent the gender of an individual, while pools 3 and 4 could represent his education
level, marital status or profession. Pool 2 could contain the names of the individuals.
We can use the above example where the network is used to store specific
information about a set of individuals to show three possible cases of information
retrieval by the network [McRu88].
In the first case information about an individual could then be retrieved by
activating the unit with his name in pool 2 and we want just one unit activated in each
pool after convergence.
In the second case we can initialize the network with the description of an
individual by activating the corresponding units in pools 1, 3 and 4 and, after
convergence is achieved, look for the winner unit in pool 2. To be useful, the network
should retrieve the correct individual even if the description is partial or slightly
incorrect. It is possible to have units that are partially activated in the pool for names
if there is no perfect match and several units have a close match. The amount of
activation should be related to the number of matches with the given description.
In the third case we can retrieve general information about a property by
activating the corresponding unit, for example, to retrieve the general properties of
married individuals. In this case it is also possible to have units that are partially
activated.
McClelland and Rumelhart showed, by using simulations, that the network works
well in the above three cases [McRu88]. However, in order to operate the network the
designer has to adjust some parameters but McClelland and Rumelhart did not provide
guidelines for selecting such parameters.
In this section we derive a few results that are applicable to networks of this type
69
with any number of units, including the proof that, given certain conditions, the IAC
network is a stable system and that it also minimizes an energy function, much like the
Hopfield network. Extensive results are then derived in the case where the network has
2 units. We analyse mathematically the dynamics of an IAC network with 2 units. More
specifically we are interested in how the parameters of the model affect the number,
type, location and zone of attraction (or basin of attraction) of the equilibrium points.
In most cases stability around the EPs is proved using Lyapunov functions.
3.3.1 - The Mathematical Model
McClelland and Rumelhart used the standard form for the combining, activation
and output function (these terms are defined in section 2.3.2) to define the mathematical
model of the IAC network. Assuming that the IAC network is operating in discrete-time
and in synchronous mode, we have:
where the variables a
i
(k) and Y
i
(k) represent respectively the activation and output values
(3.36)
net
i
(k)
N
j 1
W
i j
Y
j
(k) ext
i
(k)
(3.37)
a
i
(k 1) f a
i
(k) , net
i
(k)
(3.38)
Y
i
(k 1) g a
i
(k 1)
for unit i at iteration k, N represents the number of units in the network, 1 i N,
f[ , ] and g[ ] are respectively the activation and output functions and the weight
matrix W is assumed to be symmetric.
McClelland and Rumelhart wanted a model with the following properties:
1) the activation values must be kept between two limits given by the
parameters max and min, where min 0 < max;
2) when the network is initialized, all the activation values are at the rest
value given by the parameter rest, where min rest 0;
3) when the net input for a particular unit is positive, its activation value
must be driven towards the upper limit max;
4) when the net input for a particular unit is negative, its activation value
must be driven towards the lower limit min;
5) when the net input for a particular unit is zero, its activation value
must be driven towards the rest value given by the parameter rest with an adjustable
speed that is given by the parameter decay 0.
70
To satisfy the above requirements, Rumelhart and McClelland proposed the
following functions f( , ) and g( ):
if net
i
(k) 0,
otherwise
(3.39)
a
i
(k) max a
i
(k) net
i
(k) decay a
i
(k) rest
where a
i
(k) = a
i
(k+1) a
i
(k), and
(3.40)
a
i
(k) a
i
(k) min net
i
(k) decay a
i
(k) rest
Typical parameters used in simulations by McClelland and Rumelhart [McRu88] are:
(3.41)
Y
i
(k)
'
a
i
(k) if a
i
(k) 0
0 otherwise
max = 1, min = 0.2, rest = 0.1, decay = 0.1, ext
i
= 0 or 0.4; and W
ij
= 0.1, 0 or 0.1.
However, such parameters were found through trial and error and not from mathematical
analysis.
3.3.2 - Initial Considerations
Without loss of generality, we can consider that each one of the units is
connected to at least one of the other units (for each unit i, W
ij
0 for at least one j),
since we are not interested in the case where a unit is completely isolated from the other
units.
If we assume that min < 0 < max and min = max, eqs. 3.39 and 3.40 can be
combined into just one equation as:
If the network is operating in continuous time, the above equation is simply replaced by:
(3.42)
a
i
(k) net
i
(k) a
i
(k) net
i
(k) max decay a
i
(k) rest
The equilibrium points of the system a
i
e
can be found by solving eq. 3.42 for a
i
(k) = 0
(3.43)
da
i
dt
net
i
a
i
net
i
max decay a
i
rest
or eq. 3.43 for da
i
/dt = 0. So:
where net
i
e
represents the value of the net input for unit i when the network reaches an
(3.44) a
e
i
max net
e
i
decay rest
net
e
i
decay
EP. Since net
i
e
is in general unknown, eq. 3.44 does not help to find the position of the
EP in the general case. But we can still use it to state that if decay = 0:
71
1) when net
i
e
0 the EP will be characterized by a
i
e
= max if net
i
e
> 0
or a
i
e
= max if net
i
e
< 0;
2) when net
i
e
= 0 eq. 3.44 cannot be used to find the EP but the points
where net
i
= 0 for all units are also equilibrium points since a
i
(or da
i
/dt) = 0 for all
i. One point where this is possible, but not the only one, is to have ext
i
= 0 for all units
and consequently the point a
i
e
= 0 for all i is also an EP.
Moreover, for rest = 0 and small values of decay, as long as decay << net
i
e
,
the EP will still be located near max or max and the condition net
i
= 0 is not enough
to cause an EP.
Observe that if W
ij
= 0 for all j, i.e. the unit i is completely isolated from the
other units, net
i
= ext
i
and the condition for stability is that -decay < ext
i
< 2-decay
that can also be written as - ext
i
< decay < 2- ext
i
. Therefore such unit can form a
stable 1 dimensional system even if decay < 0. The position of the EPs is given by eq.
3.44 replacing net
i
e
by ext
i
.
3.3.3 - Minimizing an Energy Function
In this section we show that under certain constraints the continuous time version
of the IAC network, like the Hopfield network, also minimizes a bounded energy
function. Therefore, we can prove that the network is stable and can be used to solve
the same kind of minimization problems for which the Hopfield network has been used.
First, lets assume that decay = 0 and that the network is within or at the border
of the hypercube [max max]
N
where N is the number of units in the network, i.e.
max a
i
max for all i. We can define the following quadratic function as the energy
function:
As in the case for the Hopfield network we can write that:
(3.45)
H(t)
1
2
Y
T
W Y ext
T
Y
Since the matrix W is symmetric:
(3.46)
dH Y(t)
dt
N
i 1
H Y(t)
Y
i
dY
i
dt
Y
H(Y)
T
Y(t)
But Y
i
= g(a
i
), so we have:
(3.47)
dH
dt
W Y ext
T
Y net
T
Y
N
i 1
net
i
dY
i
dt
72
Using eq. 3.43, finally:
(3.48)
dH
dt
N
i 1
net
i
dg(a
i
)
da
i
da
i
dt
Therefore dH/dt 0, for decay = 0, max a
i
max and dg(a
i
)/da
i
0 for all i (g( )
(3.49) dH
dt
'
N
i 1
dg a
i
da
i
net
2
i
max a
i
if net
i
0
N
i 1
dg a
i
da
i
net
2
i
max a
i
if net
i
< 0
is a monotonically increasing function). From the above we can also state that dH/dt = 0
if and only if dY
i
/dt = da
i
/dt = 0 for all i, i.e. the network has reached an EP. Note that
net
i
= 0 for all i implies not only dH/dt = 0 but also da
i
/dt = 0 for all i (see eq. 3.43).
Now we need to deal with the case when the network is initialized outside the
hypercube [max max]
N
, i.e. max > a
i
> max for at least one i. If a
i
0, eq. 3.43 can
be written as:
On the other hand, if a
i
< 0, eq. 3.43 can be written as:
(3.50)
da
i
dt
'
net
i
a
i
max decay a
i
rest if net
i
0
net
i
a
i
max decay a
i
rest if net
i
< 0
Equations 3.50 and 3.51 show respectively that, given that decay > 0 and rest < max:
(3.51)
da
i
dt
'
net
i
a
i
max decay a
i
rest if net
i
0
net
i
a
i
max decay a
i
rest if net
i
< 0
1) if a
i
> max, then da
i
/dt < 0; and 2) if a
i
< max, then da
i
/dt > 0. In other words,
considering the activation space, if the network is outside the hypercube [max max]
N
and decay > 0, the changes in the activation values are such that, given enough time,
the network will reach the borders of the hypercube and we will end up with
a
i
max. Note that even in the case when decay = 0, the changes in the activation
will still drive the network to the borders of the hypercube [max max]
N
, with the only
exception that the network can be trapped in the condition where net
i
= 0 (section 3.3.6
shows an example of this case). Once inside or at the borders of the hypercube, the
network then seeks the minima of the energy function given by eq. 3.45, given that,
73
among other conditions, decay = 0.
One way to ensure that the energy function given by eq. 3.45 is minimized
would be to have decay > 0 whenever a
i
> max for at least one i and when
a
i
max for all i we set decay to 0. A less complicated way would be to set decay
to some small positive value and rest to 0, without having to consider if the network is
inside the hypercube or not. From eq. 3.44 we can see that this will cause only a small
perturbation in the position of the EPs that are at the locations were a
i
e
= max or max,
assuming that for such EPs the condition decay << net
i
e
is satisfied. If the EPs that
are the solution for the problem satisfy such a condition (in general such information is
not available a priori) then we still could consider that the energy function given by eq.
3.45 is being minimized. However, the location and number of the other EPs (the EPs
that are not at the corners of the hypercube [max max]
N
) can change significantly.
A possible interpretation for the reason that decay > 0 brings the network to the
borders of the hypercube is because this stops the points where net
i
= 0 from being EPs
and from eq. 3.44 we can see that it also forces a
i
e
< max. However, some of the
points that were EPs for decay = 0 can suffer large perturbation if the condition
decay << net
i
e
is not satisfied.
Note that, as the Hopfield network, the IAC network suffers from the possibility
of being trapped in local minima (instead of converging to the global minima of the
energy function).
A simple modification that makes it easier to analyse the network dynamic
behaviour is to have dg(a
i
) /da
i
> 0 for all i, instead of dg(a
i
) /da
i
0 (see eq. 3.41), for
instance, using the identity function as the output rule: Y
i
= a
i
for all i. Such
modification will be used in the next section.
3.3.4 - Considering two units
Consider an IAC network with two units with min < 0 < max, min = max,
decay 0, rest = 0, and the output function as being the identity function, Y
i
= a
i
for
all i. As usual we will assume that the units do not feedback to themselves, i.e. W
ii
= 0
and use W
12
= W
21
= c, where c = factor of cooperation (c > 0) or competition (c < 0).
Figure 3.5 illustrates such network. From eq. 3.43 we can write:
74
We can now consider three main cases ([Nas90], [NaZa92]):
Figure 3.5 - The IAC Neural Network with 2 units
(3.52)
da
1
dt
ext
1
ca
2
a
1
ext
1
ca
2
max decay a
1
(3.53)
da
2
dt
ext
2
ca
1
a
2
ext
2
ca
1
max decay a
2
1) external inputs = 0, decay 0;
2) external inputs 0, decay = 0;
3) external inputs 0, decay > 0.
3.3.5 - Case Positive Decay and No External Inputs
Solving eqs. 3.52 and 3.53 for da
i
/dt = 0 we have that the EP [a
e
1
a
e
2
]/max is
given by:
where (i,j) = (1,2) or (2,1). Solving the above pair of equations by direct substitution for
(3.54)
a
e
i
max
a
e
j
/ max
decay
c max
c
c
a
e
j
max
0 dec 1, the normalized EPs [a
1
a
2
] are:
for c > 0: [0 0], [ ], [ ]
for c < 0: [0 0], [ ], [ ]
where a
i
= a
i
/max, i = 1 or 2, = 1 dec, and dec = decay/( c max) = normalized
decay. Using linearization around the EPs, it is possible to show that the origin is a EP
type saddle while the other 2 EPs are type stable node. If dec 1, all 3 EPs collapse
75
into the origin which becomes a stable node. Figure 3.6 shows the phase-plane for
-1.5
-1
-0.5
0
0.5
1
1.5
-1.5 -1 -0.5 0 0.5 1 1.5
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
a1/max
a
2
/
m
a
x
-1.5
-1
-0.5
0
0.5
1
1.5
-1.5 -1 -0.5 0 0.5 1 1.5
-1.5
-1
-0.5
0
0.5
1
1.5
-1.5 -1 -0.5 0 0.5 1 1.5
Figure 3.6 - ext
1
= ext
2
= 0, decay/(c max) = 0.15, c > 0
ext
1
= ext
2
= 0, c > 0 and dec = 0.15 and some trajectories for different initial activation
values. As expected the EPs are at positions [0 0], [0.85 0.85] and [-0.85 -0.85].
We can study the stability and zones of convergence of the stable EPs by
defining a Lyapunov function. For instance, assuming c > 0 and 0 dec 1 (0 1)
for the EP [ ] we can define the following Lyapunov function:
where x
i
= a
i
- , i = 1,2. Therefore:
(3.55)
V( a
1
, a
2
)
x
2
1
x
2
2
2 c max
Assuming that a
i
> 0, we have that x
i
+ > 0, i = 1,2. From eqs. 3.52 and 3.53 for
(3.56)
dV
dt
x
1
c max
dx
1
dt
x
2
c max
dx
2
dt
(i,j) = (1,2) and (2,1):
and therefore dV/dt 0 for 1, i.e. for dec 0 and dV/dt = 0 implies that x
1
= x
2
= 0
(3.57)
1
c max
dx
i
dt
x
j
x
i
x
j
1 x
i
(3.58)
dV
dt
i 2
j 1
i 1
j 2
x
2
i
x
j
1 x
i
x
j
1 x
i
x
i
(3.59)
dV
dt
x
2
1
x
2
x
2
2
x
1
x
1
x
2
2
1
(the possibility x
1
= x
2
= is excluded since we assumed that x
i
+ > 0). This proves
76
that all trajectories in the 1
st
quadrant will converge to the EP [ ] if c 0 and
0 decay/(c max) 1.
We can also easily see from eqs. 3.52 and 3.53 that for points in the 2
nd
and 4
rd
quadrants situated above the line a
2
= -a
1
, the property da
2
/da
1
> -1 is valid. Therefore
their respective trajectories will enter the 1
st
quadrant and converge to the EP [ ] as
fig. 3.6 illustrates. The same procedure can be used to prove the stability of the EP
[- -] and the EPs for the case c < 0.
From the above we can conclude that the separatrix (the curve that divides the
zones of convergence of the 2 stable EPs) for c > 0 is the line a
2
= -a
1
and for c < 0
the line a
2
= a
1
. Observe that, in the absence of any external disturbances, if the
network is initialized exactly on the separatrix, the activation values will converge to the
unstable EP situated at the origin (see fig. 3.6).
3.3.6 - Case of Non-Zero External Inputs With No Decay
Lets assume for simplicity that c > 0 (the case c < 0 is completely analogous)
and define the normalized external inputs ext
1
and ext
2
, where:
ext
i
= ext
i
/(c max), i = 1 or 2.
From eqs. 3.52 and 3.53, we can see that, if decay = 0, then da
i
/dt = 0 in the following
cases:
a) when a
j
= -ext
i
, the main switching line;
b) if a
j
> -ext
i
, when a
i
= 1;
c) if a
j
< -ext
i
, when a
i
= 1.
where (i,j) = (1,2) or (2,1). Therefore the increase of ext
1
shifts its associated main
switching line a
2
= -ext
1
downwards. Analogously, the increase of ext
2
shifts its
associated main switching line a
1
= -ext
2
sideways to the left.
The EPs are the points that are common to the above switching lines. The
positioning of these switching lines gives rise to 3 main sub-cases that correspond to
different regions in figure 3.7:
a) if ext
1
< 1 and ext
2
< 1;
b) if ext
i
> 1 and ext
j
1, (i,j) = (1,2) or (2,1);
c) if ext
1
= 1 and/or ext
2
= 1.
Now lets consider each one of these cases and their sub-cases:
77
a) if ext
1
< 1 and ext
2
< 1;
Figure 3.7 - Location of the stable E.P.
region A in fig. 3.7,
1 EP at [ext
2
ext
1
], type saddle,
2 EPs at [a
e
1
a
e
2
] = {[1 1],[1 1]}, type stable node.
The phase-plane and the trajectories in this case are similar to those in fig. 3.6 with the
difference that the position of the unstable EP is not necessarily located at the origin.
b) if ext
i
> 1 and ext
j
1, (i,j) = (1,2) or (2,1);
-3
-2
-1
0
1
2
3
-3 -2 -1 0 1 2 3
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
a1/max
a
2
/
m
a
x
Figure 3.8 - ext
1
= ext
2
= 1.5, decay = 0
regions B, C, D, E in fig. 3.7, not including dashed lines in the middle
of regions B and C,
1 unstable EP at [a
e
1
a
e
2
] = [ext
2
ext
1
],
1 EP type stable node, which location is given by fig. 3.7 according to:
78
Region B: [1 1], Region C: [1 1],
Region D: [1 1], Region E: [1 1].
Figure 3.8 shows the phase-plane when ext
1
= ext
2
= 1.5 and some trajectories for
different initial activation values. The EPs are at [1 1] and [-1.5 1.5].
c) if ext
1
= 1 and/or ext
2
= 1;
c.1) if ext
i
= 1 and ext
j
1, (i,j) = (1,2) or (2,1);
c.1.1) {if ext
i
= 1 and ext
j
> 1 and ext
j
1} OR
{if ext
i
= 1 and ext
j
< 1 and ext
j
1};
dashed lines in fig. 3.7, not including the circles nor the black
squares,
1 EP type stable node which location is given by regions B or C in
fig. 3.7 (the nearest region),
A semi-line of non-isolated EPs.
Figure 3.9 shows the phase-plane for ext
1
= 0, ext
2
= 1 and some trajectories for
different initial conditions. The EPs are [-1 -1] and the semi-line a
1
1.
c.1.2) {if ext
i
= 1 and ext
j
< 1} OR
-3
-2
-1
0
1
2
3
-3 -2 -1 0 1 2 3
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
a1/max
a
2
/
m
a
x
Figure 3.9 - ext
1
= 0, ext
2
= 1, decay = 0
{if ext
i
= 1 and ext
j
> 1};
border of regions D and E in fig. 3.7 (solid lines) not including the
circles,
No stable isolated EPs,
A semi-line of non-isolated EPs.
Figure 3.10 shows the phase-plane for ext
1
= 1, ext
2
= 1.5 and some trajectories.
79
c.2) if ext
1
= ext
2
= 1;
-3
-2
-1
0
1
2
3
-3 -2 -1 0 1 2 3
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
a1/max
a
2
/
m
a
x
Figure 3.10 - ext
1
= 1, ext
2
= 1.5, decay = 0
c.2.1) if (ext
1
, ext
2
) = (1,1) or (1,1);
black squares in fig. 3.7,
1 EP type stable node which location is given by regions B or C in
fig. 3.7 (the nearest region),
2 orthogonal semi-lines of non-isolated EPs.
Figure 3.11 shows the phase-plane for ext
1
= ext
2
= 1 and some trajectories.
c.2.2) if (ext
1
, ext
2
) = (1,1) or (1,1);
-3
-2
-1
0
1
2
3
-3 -2 -1 0 1 2 3
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
a1/max
a
2
/
m
a
x
Figure 3.11 - ext
1
= ext
2
= 1, decay = 0
circles in fig. 3.7,
no stable isolated EP,
2 orthogonal semi-lines of non-isolated EPs.
80
Figure 3.12 shows the phase-plane for ext
1
= 1, ext
2
= 1 and some trajectories.
-3
-2
-1
0
1
2
3
-3 -2 -1 0 1 2 3
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
a1/max
a
2
/
m
a
x
Figure 3.12 - ext
1
= 1, ext
2
= 1, decay = 0
We can determine the zones of attraction of the stable EPs by calculating
analytically the equation of the trajectory that converges to the unstable EP. This
equation can be obtained by combining eqs. 3.52 and 3.53 using decay = 0 and solving
the ordinary differential equation da
2
/da
1
= F(a
1
, a
2
, ext
1
, ext
2
). In this case, it is easy
to find the equation of the trajectory since the variables a
1
and a
2
are separable. If we
define:
S
1
= sign(ext
2
+a
1
), S
2
= sign(ext
1
+a
2
)
(sign(0) = 0) and assuming that S
1
0 and S
2
0, the equation of the trajectory is:
(3.60)
a
2
a
2
(0)
S
1
,
1
ext
1
S
1
ln
,
1 S
1
a
2
1 S
1
a
2
(0)
a
1
a
1
(0)
S
2
,
1
ext
2
S
2
ln
,
1 S
2
a
1
1 S
2
a
1
(0)
The curves that separate the zones of convergence are asymptotes that can be
calculated by solving the above nonlinear equation, i.e. find a
2
(0) given a
1
(0), a
1
=a
1
*
,
a
2
=a
2
*
, ext
1
, ext
2
where a
*
is the unstable EP. The asymptotes that leave the unstable
EP and converge to the stable EP can also be calculated by the same method with a
*
as the stable EP. Figure 3.13 shows the asymptotes for some cases when ext
1
< 1
and ext
2
< 1 where eq. 3.60 was solved numerically.
As before, we can prove the stability of the EPs by defining a Lyapunov
81
function. For instance, fig. 3.7 shows that if we assume that c > 0, ext
1
> 1 and
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-1 -0.5 0 0.5 1
*
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-1 -0.5 0 0.5 1
a1/max
a
2
/
m
a
x
*
*
* *
Figure 3.13 - Asymptotes for ext
1
0 and/or ext
2
0
with decay = 0
ext
2
> -1 (region B), then the point [a
1
a
2
] = [1 1] is a EP type stable node. To prove
its stability, we can define the same Lyapunov function defined in eq. 3.55, where x
i
is
now defined as x
i
= a
i
1, i = 1,2. Assuming that a
1
> -ext
2
and a
2
> -ext
1
(our
region of interest), we can easily show that:
Since V > 0 and dV/dt < 0 in our region of interest except in the origin x
1
= x
2
= 0
(3.61)
dV
dt
x
2
1
ext
1
x
2
1 x
2
2
ext
2
x
1
1 0
where V = dV/dt = 0, this proves the asymptotic stability of the EP [1 1].
The same procedure can be used to prove the stability of the EP [1 1] and the
EPs for the case c < 0.
3.3.7 - Case of Non-Zero External Inputs and Positive Decay
Again, without loss of generality lets assume that c > 0. From eqs. 3.52 and
3.53 we have that da
i
/dt = 0 when:
where (i,j) = (1,2) or (2,1). Since the EPs are the points where da
1
/dt = 0 and
(3.62)
a
i
ext
i
a
j
ext
i
a
j
dec
da
2
/dt = 0, they can be calculated by combining eq. 3.62 for (i,j) = (1,2) with eq. 3.62
for (i,j) = (2,1). This means that we need to find the real-valued roots of the following
quadratic polynomial:
82
P a
1
a
1
2
S
1
S
2
ext
1
1 S
2
dec
a
1
1
S
1
ext
1
S
2
ext
2
dec ext
2
S
1
S
2
dec dec
2
1 S
2
ext
1
a
1
0
ext
1
S
2
ext
2
dec ext
2
(3.63)
where:
However we dont know a priori the values of S
1
and S
2
. Therefore we apply the
(3.64)
S
i
'
1 if ext
i
a
j
0
1 otherwise
following algorithm:
Step 1) Assume that S
1
= 1 and S
2
= 1.
Step 2) Find the roots of P(a
1
) and reject the complex roots.
Step 3) Check for each real-valued root if ext
2
+ root 0.
If YES, accept this root, otherwise reject it.
Step 4) For each accepted root, use eq. 3.62 to calculate the corresponding value
for a
2
.
Step 5) Check to see if ext
1
+ a
2
0.
If YES, accept this value, otherwise reject it.
Step 6) Assume other combinations for (S
1
, S
2
), calculate the possible values of
a
1
and a
2
, and check if the assumptions for S
1
and S
2
are satisfied.
Figure 3.14 illustrates how the values of ext
1
and ext
2
affect the curves da
1
/dt
-1.5
-1
-0.5
0
0.5
1
1.5
-1.5 -1 -0.5 0 0.5 1 1.5
a1/max
a
2
/
m
a
x
ext2=4
1.5
1
0.3 0 -0.3 -1
-1.5
-4
ext1=-4 -1.5
-1
-0.3
0
0.3
1 1.5 4
Figure 3.14 - Curves da
1
/dt = 0 ( ) and da
2
/dt = 0 ()
for several external inputs and dec = 0.15
and da
2
/dt (dec = 0.15). We can see that an increasing ext
1
shifts the curve da
1
/dt to
83
the left and an increasing ext
2
shifts the curve da
2
/dt downwards. Figure 3.14 shows
that there are only three possible cases for the EPs, since the curves da
1
/dt = 0,
da
2
/dt = 0 cross each other 1, 2 or 3 times:
1) If the curves da
1
/dt = 0, da
2
/dt = 0 cross each other three times, then we
-1.5
-1
-0.5
0
0.5
1
1.5
-1.5 -1 -0.5 0 0.5 1 1.5
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
o o o o o o o o o o o o o
a1/max
a
2
/
m
a
x
-1.5
-1
-0.5
0
0.5
1
1.5
-1.5 -1 -0.5 0 0.5 1 1.5
Figure 3.15 - The case where one of the E.P.s is
over the separatrix
have 2 EPs type stable node and 1 EP type saddle. This is the case if ext
1
, ext
2
and
dec are not large.
2) If the curves da
1
/dt = 0, da
2
/dt = 0 cross each only once, then we have only
1 EPs type stable node. This is the case if ext
1
or ext
2
or dec are large.
3) The curves da
1
/dt = 0, da
2
/dt = 0 cross each other once (the EP type stable
node) and touch each other at another point (the point that is on the separatrix). In this
case all trajectories with initial conditions on one side of the separatrix will converge
to the EP that is on the separatrix. All trajectories with initial conditions on the other
side of the separatrix will converge to the stable EP. Figure 3.15 illustrates this case,
when ext
1
= ext
2
= 0.3754 and dec = 0.15. The EPs are a
1
e
= a
2
e
= 0.894 and
a
1
e
= a
2
e
= 0.613.
Some important points are:
a) There will always be 1 or 2 stable EPs and not more than 1 unstable EP since curves
da
1
/dt = 0, da
2
/dt = 0 cross each other 1, 2 or 3 times;
b) All EPs will be such that a
e
1
< 1 and a
e
2
< 1 (see eq. 3.62);
c) if dec 1, then there is only 1 EP and it is a EP type stable node. This can be
verified through visual inspection of figure 3.16 which shows the curves da
1
/dt = 0,
84
da
2
/dt = 0 for dec = 1. The theoretical way to prove that there is only one EP in this
-1.5
-1
-0.5
0
0.5
1
1.5
-1.5 -1 -0.5 0 0.5 1 1.5
a1/max
a
2
/
m
a
x
-1.5
-1
-0.5
0
0.5
1
1.5
-1.5 -1 -0.5 0 0.5 1 1.5
ext1=-4
ext1=4
ext2=-4
ext2=4
Figure 3.16 - Curves da
1
/dt = 0 ( ) and da
2
/dt = 0 ()
for several external inputs and dec = 1
case would be to show that if dec 1, then for any real values for ext
1
and ext
2
:
1) the quadratic polynomial expressed in eq. 3.63 will have only one real-valued
root which is a
e
1
and,
2) the application of the algorithm proposed in this section will also give a valid
value for a
e
2
.
3.3.8 - A Two-Bit Analog-Digital Converter using the IAC network
As in the case of the Hopfield network, the IAC network can be used as an
analog-digital converter since this task can be posed as an optimization problem. Using
the IAC network, the solution can be proposed following the same procedure proposed
by Tank and Hopfield when using Hopfield networks ([TaHo86], [Zur92]).
In this example we will use an IAC network with 2 units and therefore the A/D
converter has a 2-bit resolution. However the same principle can be used to increase the
number of units and the resolution of the A/D converter. The parameters in this case are
max = 1, decay = 0, W
ii
= 0, i = 1,2 and W
ij
= c, (i,j) = (1,2) and (2,1), i.e. W is
symmetrical with zero diagonal entries. We will assume that the network is always
initialized within the hypercube [-max max]
2
. As the output function we can use
Y
i
(a
i
) = a
i
and therefore the network output will be bipolar (-1 or 1) instead of binary
(0 or 1). If desired we could easily force the network to have binary outputs by defining
Y
i
(a
i
) = (a
i
+1)/2.
85
Denoting by x the input analog value, the desired input-final output mapping that
the network should produce is: (x a
2
, a
1
) = (0 1,1), (1 1,1), (2 1,1) and
(3 1,1). The corresponding decimal value d for a network output is given by:
d = 1.5 + a
2
+ a
1
/2. The network should minimize the square of the conversion error
E
1
(t) where:
(3.65)
E
1
(t) x d
2
1
1
1
1
]
x
2
2x
,
1.5 a
2
a
1
2
,
1.5 a
2
a
1
2
2
To determine the weights and external inputs of the network we compare the
function E
1
(t) that we want to minimize with the "energy" function H(t) given by eq.
3.50. In this case the function H(t) is given by:
Since H(t) does not contain terms [a
i
]
2
we need to modify E
1
(t) in order to eliminate
(3.66)
H(t) ca
1
a
2
ext
1
a
1
ext
2
a
2
such terms but in such a way that the resultant function is still non-negative and has the
correct local minima. This also has to be done when using the Hopfield network and we
just need to adapt the procedure adopted there for this case [TaHo86]. The solution is
to define the function to be minimized E(t) as E(t) = E
1
(t) + E
2
(t) and:
Since we assume that the network was initialized within the hypercube [-1 1]
2
and it will
(3.67)
E
2
(t)
2
i 1
i
2
4
a
i
1 a
i
1
remain within or at the borders of the hypercube, the function E
2
(t) is always positive
except at the corners of the hypercube where it is zero. The coefficients of E
2
(t) can
have any negative values but in this case they were chosen in order to cancel the terms
[a
i
]
2
in E
1
(t). Therefore:
Finally comparing H(t) with E(t) and ignoring the term x
2
- 3x + 14/4 since it is a
(3.68)
E(t) x
2
3x
14
4
a
1
a
2
1.5 x a
1
3 2x a
2
constant we have that: c = 1, ext
1
= x - 1.5, ext
2
= 2x - 3.
In section 3.3.6 we analyzed such a network in this case but for c > 0. In order
to use those results here we just need to rotate our coordinate system 90 degrees so
that the line a
2
= -a
1
(position of the EP for ext
1
= ext
2
= 0) becomes a
2
= a
1
. If we
rotate -90 degrees then we have that a
2
NEW
= a
1
OLD
and a
1
NEW
= -a
2
OLD
. The desired input-
86
final output mapping is then: (x a
2
NEW
, a
1
NEW
) = (0 1,1), (1 1,1), (2 1,1)
and (3 1,1). Dropping the superscript "NEW", the corresponding decimal value d
for a network output is now given by: d = 1.5 a
1
+ a
2
/2. The function E(t) is now:
Again comparing H(t) with this new definition for E(t), ignoring the constant term that
(3.69)
E(t) x d
2
2
i 1
1
i
2
a
i
1 a
i
1
is function only of x, we have that: c = 1, ext
1
= 3 2x, ext
2
= x - 1.5.
From section 3.3.6 we know that such network will produce the stable EP in the
desired locations. Note that: a) ext
1
and ext
2
can be seen as lines parametrized in x and
therefore we can write that ext
2
= -ext
1
/2; and b) if 1 x 2, then ext
1
1 and
ext
2
1. Referring to fig. 3.7 we will have an EP type saddle when in region A
( ext
1
< 1) or semi-lines of EP when over the dashed lines that are the border between
regions A-B (ext
1
= 1) and A-C (ext
1
= -1). The semi-lines of EPs can be eliminated
without moving the position of the stable EP significantly by using a very small decay
such as 0.01.
The existence of the saddle point results in the problem that the stable EP to
which the network converges is determined by the point at which the network was
initialized. For instance, for x = 1.5, the line a
2
= -a
1
divides the two zones of
convergence (also called zones of attraction) for the two stable EPs at (1,1) or (1,1).
Lee and Sheu ([LeSh91],[LeSh92],[YuNe93]), when using a Hopfield network as an A/D
converter showed how to modify the Hopfield network in order to eliminate such saddle
points. Consequently the EP to which the network converges does not depend where the
network is initialized and therefore there is only one possible network response. Maybe
an equivalent modification can be proposed for the IAC network.
3.4 - Conclusions
In this chapter we demonstrated how feedback networks can be used as
associative memories or to solve minimization problems. The Hopfield and IAC neural
networks were presented and analyzed.
The main contribution of this chapter is to show that the IAC network can also
be used to solve minimization problems, and as such it is an alternative to Hopfield
networks. As an example we showed how to implement a 2-bit analog-digital converter.
87
Chapter 4 - Faster Learning by
Constraining Decision Surfaces
In chapter 2 we pointed out that one of the main problems with the current
feedforward ANN models is that they take too long to be trained by the training
algorithms in use today. Therefore one active area of research is the development of new
methods to increase the learning speed of feedforward ANN models, mainly the multi-
layer Perceptron since this is the most popular feedforward model. At the end of chapter
2 we mentioned some methods that can be used to try to speed up learning in
feedforward ANNs, i.e. a) without adapting the network topology, for instance, by using
adaptive learning rates or second-order algorithms; or b) adapting the network topology,
such as the Cascade-Correlation Learning algorithm [FaLe90].
In this chapter we propose an alternative method that aims to speed up learning
by constraining the weights arriving at the hidden units of a multi-layer feedforward
ANN. The method is concerned with the case where the hidden units have sigmoidal
functions, such as in the multi-layer Perceptron.
The basic idea of the proposed method is based on the observation that one
condition that is necessary, but not sufficient, for a feedforward multi-layer ANN to
learn a specific mapping is to have the decision surfaces defined by the hidden units
within or close to the boundaries of the network input space. The hidden units then will
not have a constant output value and cannot be simply substituted by the addition of a
bias to the the output unit. Since it is quite reasonable to know beforehand the range of
the network input values, we can assume that the network input space is also known.
The proposed method then simply checks the above condition and resets those hidden
units with decision surfaces outside a valid region. This approach also leads to a new
method for initializing the weights of the ANN.
We show different methods for initializing and constraining during training the
locations of the decision surfaces. We also show how one can adjust the inclination of
the decision surfaces. In the simulation section the proposed method is illustrated for the
88
case where an ANN is trained to perform the nonlinear mapping sin(x) over the range
2 to 2. This example uses the Back-Propagation algorithm to train the network but
the proposed method can be used with any other algorithm that adjusts the weights
directly without imposing constraints on the decision surfaces.
The proposed method can be applied to any unit as long as the decision surface
associated with such a unit is a hyperplane, e.g. sigmoid or hyperbolic tangent units.
Therefore the ANN can have more than one hidden layer of units and it does not need
to be a strictly feedforward ANN. Note that by constraining the decision surfaces we are
in effect indirectly constraining the hidden unit weights.
4.1 - Initialization Procedures
In chapter 2 we saw that if a unit has its output defined as: 1) an increasing
function of their net input with saturation above an upper limit and below a lower limit
(a sigmoidal unit, for instance sigmoid and hyperbolic tangent units) and 2) its net input
is defined as a linear combination of the unit inputs; then the decision surface of this
unit is the hyperplane defined by:
w
i1
x
1
+ w
i2
x
2
+ .... + w
iNx
x
Nx
+ bias
i
= 0
where w
ij
and bias
i
are the unit incoming weights and bias and x
j
, j=1,...,Nx, are the unit
inputs and Nx is the number of inputs received by the units. A fundamental component
of the learning process is the correct placement and inclination of these decision surfaces
in the network input space.
The simplest case is a network with just a single layer of sigmoidal hidden units
and an output layer of linear units. In order to perform correctly the desired mapping
the hidden unit weights, i.e. the weights received by the hidden units, have to be such
that the decision surfaces of each hidden unit have the correct position and inclination.
Then the role of the output unit weights is to perform the correct combination of such
weights.
The problems in which this type of ANN can be applied can be divided into two
classes: a) pattern-recognition, where the inputs and desired outputs are binary (0 or 1)
or bipolar (-1 or 1); and b) function mapping, where the inputs and desired outputs are
real numbers. A important difference is that, in general, in the former case there is more
freedom for the placement and inclination of the decision surfaces (vide the XOR
89
problem) than in the latter case. In other words the input-output mapping is less
sensitive in relation to decision surfaces in the former case than in the latter case.
4.1.1 - The Standard Initialization Procedure
The standard and widely used procedure to initialize all the weights and biases
of a feedforward multi-layer ANN, such as the Multi-Layer Perceptron, is to simply set
all weights and biases to small random values [RHW86] using a normal or uniform
distribution. The justification for using small values is to avoid saturation of the unit
since saturated units will operate in the regions where the derivative of the unit output
function is very small and consequently, if the network is trained by the BP algorithm
(or other algorithm that uses first-derivative information), training will be very slow.
One problem with such a procedure is that it does not take into consideration the
size of the network inputs when choosing how spread the random weights should be.
The ANN literature contain a few alternative procedures such as the ones that have been
proposed by Nguyen and Widrow [NgWi89] and Drago and Ridella [DrRi92].
Assuming that the network input space has dimension 1, if we use a gaussian
distribution with zero mean to generate the weight from the input unit to each hidden
unit and the bias of each hidden unit, the position of the decision surface (in this case
a point in a horizontal line) will be given by: x
DS
= -bias
i
/ w
i
, where here i specifies the
hidden unit number. Assuming that bias
i
and w
i
are random independent variables, x
DS
has a cauchy distribution with zero mean [Pap84].
Figure 4.1 shows the histogram of x
DS
calculated as above by using the quotient
of 1000 computer generated samples of two (assumed independent) random gaussian
variables with zero mean and variance 1.
If the network has 2 inputs then the decision surface (in this case a line) for
hidden unit i will be described by w
i1
x
1
+ w
i2
x
2
+ bias
i
= 0. Figure 4.2 shows 100 lines
generated by defining the coefficients w
i1
, w
i2
and bias
i
as gaussian random variables
with zero mean and standard variation 0.1.
From figures 4.1 and 4.2 we can see that most of the decision surfaces will be
concentrated around the origin. It is also very important to note that, depending on how
large the input space is, even if we considered it centered around the origin, there is the
possibility that some of the decision surfaces will fall outside the input space. In this
case, these units will produce a near-constant output for inputs within the valid input
90
space, especially if their decision surfaces have a steep inclination. Therefore such units
0
20
40
60
80
100
120
140
160
-15 -10 -5 0 5 10 15
Number of samples displayed = 965
min = -1518.5099 max = 779.7392
delta = 0.5
x3
N
u
m
b
e
r
o
f
S
a
m
p
l
e
s
Fig. 4.1 - Histogram of a variable defined
as the quotient of two gaussian random
variables with zero mean and variance 1
-10
-8
-6
-4
-2
0
2
4
6
8
10
-10 -5 0 5 10
x1
x
2
Fig. 4.2 - 100 decision surfaces generated
by the standard initialization procedure
will not be peforming any useful computation and can be substituted by a constant term
added to the bias of the output units. If we train the network with a hidden unit using
a gradient-based algorithm (such as the Back-Propagation alagorithm) initialized in this
way, this hidden unit will take a long time to change its weights since the unit will be
operating in a region where its derivative is very low.
If we consider in fig. 4.2 that the valid range for variables x
1
and x
2
is [-10 10]
most of the decision surfaces will be within a small distance from the origin. If we have
no pre-knowledge about the correct positioning of the decision surfaces, this seems
difficult to justify. On the other hand if the valid range for x
1
and x
2
is [-1 1] there will
be several hidden units with their decision surfaces outside the valid network input
space. We get very similar results if, instead of a gaussian distribution, we use a uniform
distribution.
It is possible to explore the concept of decision surfaces to get better
initialization procedures. In this case we only need to make use of information that is
normally available, that is, the valid range of the network input variables.
If no previous information is available about the desired location of the decision
surfaces for a particular problem (the normal case), then it is reasonable to argue that
the available decision surfaces should be uniformly distributed over the valid input
space. Therefore, instead of generating the weights and biases using a particular random
distribution, and then calculating from them the positioning of the decision surfaces and
looking at their distribution over the input space, we propose to do the opposite. We can
generate the location of the decision surfaces using some appropriate random distribution
91
and then we calculate the weights and biases associated with each decision surface.
Finally we adjust the inclination of each decision surfaces also considering the size of
the input space to avoid the possibility of very large inclinations that will result in slow
adaptation.
If the output units are linear we cannot associate a decision surface with them
and therefore we propose to still generate their weights and biases using a random
distribution (gaussian or uniform) with zero mean. On the other hand, if the output units
are sigmoidal, the same method that is used to generate the weights and biases can be
applied with the only difference that the input space for the output units is now the
output space of the hidden units. If there are direct connections from the network input
units to the output units the input space of the output units also includes the network
input space.
We propose two new procedures to initialize the network based on this idea of
initializing the decision surfaces.
4.1.2 - The First Initialization Procedure
One way to initialize the decision surfaces over the valid input space is to select
(using a uniform distribution) a sufficient number of points to define a decision surface,
in this case a hyperplane. Since each point inside the valid input space has the same
probability of being chosen as all the other valid points, the location of the
correspondent decision surfaces will be also uniformly distributed over the input space.
Therefore the first step is from the set of selected points to get the equation of the
decision surface and from that get the values for the weights and for the bias.
Since the network input space dimension is Nx, we select at random from a
uniform distribution Nx points within the valid input range since we need Nx points to
define uniquely a decision surface. Since these points belong to the decision surface,
they must satisfy the equation of the hyperplane (for simplicity we drop the subscript i):
w
1
x
1
+ w
2
x
2
+ .... + w
Nx
x
Nx
+ bias = 0
Since we have Nx points we have to solve a system of linear equations: X
= 0, where
each selected point defines a row of the matrix X
, W
= [w
1
... w
Nx
bias]
T
and
0 = [0 ... 0]
T
. We have then Nx equations and Nx+1 unknowns. Therefore it is necessary
to add another constraint. There are several possibilities, like forcing one of the weights
or the bias to be equal to 1 by adding the constraint w
1
= 1 (or bias = 1). At this point
92
we use:
w
1
+ w
2
+ .... + w
Nx
+ bias = Nx
The particular constraint used is not relevant, as long it is a valid one (bias = 0 is not
a valid constraint), since it will determine the inclination and we propose to normalize
the inclination in another step later.
Using the above procedure, the following steps are used to initialize a
feedforward ANN with one hidden layer of sigmoidal units and linear output units:
Step 1) Initialize the weights that the output units receive and the bias of
the output units as small random values. Observe that the output units
can receive weights directly from the input units as well.
Step 2) For each unit in the hidden layer:
2.1 - select at random Nx points within the valid input range (Nx =
number of input units). All points within the valid input space have the
same probability of being selected. Lets use the following notation to
denote each of these points and their components:
X
i
= [ x
i
1
x
i
2
... x
i
Nx
]
2.2 - The weights that connect the input units to this hidden unit and the
bias of this hidden unit are calculated as the solution of the following set
of linear equations:
Figure 4.3 shows the locations of 100 decision surfaces obtained by using this
(4.1)
1
1
1
1
1
1
1
1
1
1
1
]
x
1
1
x
1
2
... x
1
Nx
1
x
2
1
x
2
2
... x
2
Nx
1
... ... ... ... ...
x
Nx
1
x
Nx
2
... x
Nx
Nx
1
1 1 ... 1 1
1
1
1
1
1
1
1
1
1
1
]
w
i 1
w
i 2
...
w
i Nx
bias
i
1
1
1
1
1
1
1
1
1
]
0
0
...
0
Nx
procedure. It is important to notice that the decision surfaces are equally spread over the
input space and, due to the method, it is possible to guarantee that all of them cross the
valid input space.
Using this method we have to solve a system of linear equation with dimension
Nx+1 for each network unit. If the unit receives inputs from several other units (for
instance the network has a large number of inputs) or if the number of units to be
initialized is very large, this method can be too computationally demanding. In order to
93
minimize this problem we propose a second initialization procedure. However, the
-10
-8
-6
-4
-2
0
2
4
6
8
10
-10 -5 0 5 10
x1
x
2
Fig. 4.3 - 100 decision surfaces generated by the first initialization procedure.
method used to initialize the weights received by a linear output units is the same.
4.1.3 - The Second Initialization Procedure
Instead of selecting Nx points within the valid input range and then using these
points to generate a decision surface, another method is simply to select at random only
1 point (all points within the valid input space are equally probable) and then to select
a vector with random components such that the direction in which this vector points is
random. The decision surface is then defined as the hyperplane that passes through the
selected point and such that the selected vector is normal to it.
One way of generating this vector normal to the decision surface is firstly to fix
its length, and then each time that a new component of the vector is to be defined the
limit for the size of that component is calculated as the total length decreased by the
sum of the squares of the components already generated, or if Ns is the desired length
for the vector V where:
The first component V
1
can be generated as a uniform random variable in the interval
(4.2)
Ns V
2
1
V
2
2
... V
2
Nx
[-Ns Ns]. The second component V
2
is chosen in the range [-(Ns
2
- V
1
2
)
1/2
(Ns
2
- V
1
2
)
1/2
]
and so on until V
Nx-1
is generated. The last component of V is calculated such that V has
the desired length, or:
94
where the positive sign and the negative sign have the same probability of 50%.
Figure 4.4 - Geometrical interpretation of the initialization procedure
(4.3)
V
Nx
Ns
2
V
2
1
V
2
2
... V
2
Nx 1
A more direct way of generating this vector normal to the decision surface is to
use the concept of circular symmetry of random variables [Pap84]. This concept states
that if we have several independent normal random variables with zero mean and equal
variance, then these random variables are circular symmetrical and their joint statistics
depends only on the distance from the origin. Therefore all points that have the same
distance from the origin are equally probable. This implies that, if the components of
V are generated using such a concept, for a given magnitude all directions will be
equally probable.
Suppose for the moment that the vector V has been scaled to an arbitrary non-
zero magnitude. Denoting the selected point by X
*
, all points X that belong to the
decision surface satisfy the equation: (X - X
*
)
T
V = 0. Figure 4.4 gives a geometrical
interpretation for this equation. Comparing this equation with the equation for the
decision surface (which is defined by the incoming weights to the unit and the
associated bias as W
T
X + bias = 0 where W = [w
1
w
2
... w
Nx
]
T
) we have:
W = V, bias = - X
*T
V
Comparing this method with the standard method (assuming that the standard method
uses a gaussian distribution), we can see that the difference is the way that the bias term
is initialized. Figure 4.5 shows the locations of 100 decision surfaces where V was
generated using random values with gaussian distribution, zero mean and 0.1 as the
standard deviation. Again using the above method we guarantee that the decision
95
surfaces will cross the valid input space since the selected point X
*
was chosen from the
-10
-8
-6
-4
-2
0
2
4
6
8
10
-10 -5 0 5 10
x1
x
2
Figure 4.5 - 100 decision surfaces generated by the second initialization procedure.
set of points that are within the valid input space.
Note that in this procedure we have assumed that the vector V (and therefore the
weight vector W for the unit as well) has been scaled to an arbitrary non-zero
magnitude. It is such a magnitude that dictates the inclination of the decision surface.
In the next sub-section we propose to adjust this inclination as the last step (Step 3) for
both initialization procedures suggested here.
4.1.4 - Adjusting the Inclinations of the Decision Surfaces
In order to adjust the inclination of the decision surface we simply adjust the
variation of the output of the unit for a given variation of the unit input.
The variation of the unit input is specified by choosing any point X
1
over the
decision surface and another point X
2
such that X = X
2
- X
1
is orthogonal to the
decision surface.
For convenience, assuming without loss of generality that the center of the valid
input space is the origin, we can choose X
1
to be the point belonging to the decision
surface that is closer to the origin. As we have seen in the previous sub-section the unit
weight vector (the incoming weights) W is the vector orthogonal to the decision surface
and consequently X
1
= W. Since X
1
belongs to the decision surface, W
T
X
1
+ bias = 0.
Combining these two equations, we have that:
(4.4)
bias
W
2
96
where W = (W
T
W)
1/2
= length of the weight vector. The point X
2
is then defined as:
(4.5) X
1
bias
W
2
W
where K
s
and U
s
are scalar parameters such that K
s
> 0, U
s
> 0. The parameter U
s
is set
(4.6)
X
2
X
1
K
s
U
s
W
W
to the distance from the origin to the most distant corner of the valid input space and
therefore gives a measure of the size of the input space. The parameter K
s
therefore is
the length of the vector X in "U
s
" units.
Once we have selected a value for K
s
, for a given unit weight vector and bias,
the variation of the output of the unit is simply calculated as:
where F(net/T) is the unit function, e.g. sigmoid or hyperbolic tangent, T is the fixed
(4.7)
F F net
2
/ T F net
1
/ T
parameter called temperature, T > 0, net
1
= W
T
X
1
+ bias and net
2
= W
T
X
2
+ bias. Note
that: a) net
1
is by definition 0 since X
1
is on the decision surface; and b) for a unit with
sigmoid function or hyperbolic tangent, F(net
1
) = 0.5 or 0 respectively.
The objective is to find a scalar positive gain K
w
that, when used to scale the unit
weight vector and the unit bias, the unit will have the desired output variation F
des
> 0
for a given input variation specified by K
s
. From eq. 4.7, K
w
can be calculated for a
sigmoid unit using:
Using the expression: tanh(x/T) = 2 sig(2x/T) - 1, K
w
can be calculated for a hyperbolic
(4.8)
K
w
T
net
2
ln
,
0.5 F
des
0.5 F
des
tangent unit using:
The unit weight vector and bias are finally replaced by K
w
W and K
w
bias respectively.
(4.9)
K
w
T
2 net
2
ln
,
1 F
des
1 F
des
Note that for sigmoid units: 0 < F
des
< 0.5, and for hyperbolic tangent units:
0 < F
des
< 1.
97
4.2 - Constraining the Decision Surfaces during Training
The knowledge that the decision surfaces will have to be within or close to the
boundaries of the network input space can also be explored during training. An easy and
simple (and probably not optimal) way to do this is simply to periodically check if the
decision surfaces are within the boundaries of a permissible region. The units with
decision surfaces outside such a region are then reinitialized using the methods proposed
in the previous section. This permissible region is in general defined as enclosing the
network input space.
In order to perform a sufficiently close approximation to some mappings some
of the decision surfaces may have to be outside the network input region, but still close
to its boundaries. If a decision surface is situated very far from the boundary of the
network input region, the variation of the unit output over the network input region will
be small (if the inclination of the decision surface is small) or zero and therefore this
unit will be operating as a linear unit. This unit can be replaced by another unit with a
decision surface located within the the network input space with a small inclination such
that this unit also operates as a linear unit. A constant term should then be added to the
bias of the output units that represents the averaged output of the original unit over the
input space.
We propose two methods to check if the decision surface of a sigmoidal unit is
within the boundaries of the permissible region. In general such a region is defined as
a hypercube if we can define hard limits for each network input.
In the first method we calculate the unit output for each corner of this hypercube.
If, for all corners of the hypercube, the network output is always less or always greater
than its output midpoint (defined as 0.5 for sigmoid units and 0 for hyperbolic tangent
units), then the decision surface of this unit is outside the hypercube.
In the second method we define a hypersphere such that it encloses the
hypercube. If we assume that all sides of the hypercube have the same length 2u and its
center is the origin, all corners will be equally distant from the origin and the radius of
the hypersphere is equal to the distance from the corners to the origin, that is u (Nx)
1/2
.
If the distance from the decision surface to the origin ( bias W , see eq. 4.5) is
greater than the radius of the hypersphere, then the decision surface is outside the
hypersphere and outside the hypercube.
98
The first method is more restrictive than the second one since, for input spaces
with dimension greater than 1, if the decision surface (the hyperplane) is nearly parallel
to one of the sides of the hypercube, it is possible that the decision surface is inside the
hypersphere but outside the hypercube. On the other hand, the number of calculations
is much greater in the first method than in the second method. For input spaces with
dimension 1, the two methods are the same.
4.3 - Simulations
In this section we illustrate the application of the proposed method in the case
where it is desired to train a FF ANN to learn the mapping y = sin(x) for x in the
interval [-2 2]. The ANN has 5 sigmoid hidden units and the output unit is linear.
In order to perform the desired mapping the learning algorithm has to position
a decision surface where the target function crosses the line y = 0. Therefore 5 hidden
units is the minimum number of hidden units necessary to produce a good
approximation of the desired mapping. Moreover, since the network input and output
variables are continuous, the decision surfaces for the hidden units have to be positioned
at [2,1,0,1,2] with a good precision and have approximately the correct inclination.
The weights for the output units should then provide the correct linear combination of
the outputs of the hidden units. In other words, this is a demanding problem since the
solution space is very limited and contains only a certain combination of weights.
Figure 4.6 shows that the function F(x) can approximate the sine function very
well, where:
or using the relation tanh(x/T) = 2 sig(2x/T) - 1:
(4.10)
F(x) 1.15
5
i 1
1
i 1
tanh x ( i 3)
The degree of approximation can be measured by calculating the Root-Mean-Squared
(4.11)
F(x) 1 2.3
5
i 1
1
i 1
sig 2x 2( i 3)
(RMS) error. The expression for the RMS error is:
99
where Np = number of selected points. Using a set of 40 equally spaced points in the
-1.5
-1
-0.5
0
0.5
1
1.5
-3 -2 -1 0 1 2 3
-1.5
-1
-0.5
0
0.5
1
1.5
-3 -2 -1 0 1 2 3
x/pi
sin(x) F(x) error
Figure 4.6 - The function sin(x) and its approximation F(x).
(4.12)
RMS error
1
Np
Np
i 1
sin x
i
F x
i
2
in the range [-2 2] the RMS error is 0.004539.
In this section we compare the simulations for 3 cases: 1) in the first case all the
network weights and biases are initialized using the standard initialization method, that
is as random values; 2) in the second case the network is initialized using the second
initialization method presented in section 4.1.3 and the inclinations of the decision
surfaces for the hidden units are then adjusted as explained in section 4.1.4; 3) in the
third case the network is initialized as in the previous case and the position of the
decision surfaces for hidden units are reinitialized during training whenever they are
found to be outside a pre-defined permissible region.
In all cases the network is trained using the Back-Propagation algorithm using
the following parameters: learning rate = 0.125, momentum = 0, temperature = 0.5. Each
epoch is defined as the presentation of 50 points selected with uniform distribution in
the range [-2.5 2.5] (a new set of points is selected in every epoch). Care was taken
to ensure that the same training data, with the same order, was used in all 3 cases. The
network input is defined as x/(2) and the desired network output is defined as sin(x)
and is presented uncorrupted. The RMS error is calculated every 5 epochs using 40
points equally spaced between [-2 2].
In the first case all the network weights and bias are initialized as random values
with gaussian distribution, zero mean and 0.3 as the standard deviation. In the second
and third case: a) the output unit weights and bias were initialized to the same values
100
used in the first case but the hidden unit weights and biases were initialized such that
the decision surfaces were located in the range [-2 2] (in network units [-1 1]) using
the method presented in section 4.1.3; b) the inclination of the initial decision surfaces
was adjusted as explained in section 4.1.4 using the user-defined parameters K
s
= 1 and
F
des
= 0.4 (in our simulations U
s
= 1). In all 3 cases the location of the decision
surfaces was verified every 5 epochs.
In the third case, whenever the decision surface of a hidden unit was detected to
be outside the permissible region defined to be [-4 4] (in network units [-2 2]), the
following procedure was adopted:
a) The amount w
OH
(MaxHU - MinHU)/2 was added to the output unit
bias, where w
OH
= weight connecting the hidden unit to the output unit, MaxHU and
MinHU = maximum and minimum values for the hidden unit output when the network
inputs are in the corners of the permissible region [-4 4]. The basic idea is to transfer
to the output unit bias the "average" contribution of the hidden unit that is being reset.
b) The weight w
OH
was set to zero.
c) The hidden unit incoming weights and bias were reinitialized such that
the decision surface went back to be in the range [-2 2] (in network units [-1 1])
using the initialization method presented in section 4.1.3. The hidden unit incoming
weights were generated as random gaussian numbers with zero mean and unit variance.
Note that, once a hidden unit is reset, the inclination of its new decision surface
was not readjusted, although this is a possible alternative.
Figure 4.7 shows the RMS error history (sometimes also referred to as the
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 500 1000 1500 2000 2500
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 500 1000 1500 2000 2500
Number of epochs
R
M
S
E
r
r
o
r
(3)
(2)
(1)
Figure 4.7 - The RMS error history for the 3 simulation cases.
learning curve) for the 3 cases. Considering that the convergence is obtained when the
RMS error remain less than 0.02, the first case takes 2060 epochs to converge, the
101
second case 345 epochs and the third case 230 epochs. The learning speed in the third
case is almost 9 times faster than in the first case.
Figures 4.8-4.11 show the evolution of the location of the decision surfaces for
the 3 cases. Figures 4.8 and 4.9 refer to the first case, plotted using different vertical
scales while figures 4.10 and 4.11 refer to the second and third cases respectively. Note
that the learning curve for case 1 in figure 4.7 has a staircase shape and that, whenever
one of the decision surfaces converges to its correct final value, there is a sharp decrease
in the RMS error in the learning curve. Note in figure 4.9 that a large number of epochs
is wasted since the decision surfaces are very far from their correct locations.
Figure 4.12 shows for case 3 the history of the output unit weights and bias, also
sampled every 5 epochs. Finally figure 4.13 shows for case 3 the approximation
provided by the network after being trained for 500 epochs. At the end of the training
session the RMS error is 0.004807 and the decision surfaces are located at
[-1.9306 -1.0350 -0.0311 1.0455 1.9645] while they were expected to be located at
[-2 -1 0 1 2].
-300
-200
-100
0
100
200
300
400
500
0 500 1000 1500 2000 2500
-300
-200
-100
0
100
200
300
400
500
0 500 1000 1500 2000 2500
Number of epochs
x
/
p
i
Fig. 4.8 - Decision surfaces for case 1
-3
-2
-1
0
1
2
3
4
0 500 1000 1500 2000 2500
-3
-2
-1
0
1
2
3
4
0 500 1000 1500 2000 2500
Number of epochs
x
/
p
i
Fig. 4.9 - Decision surfaces for case 1
-12
-10
-8
-6
-4
-2
0
2
4
6
0 50 100 150 200 250 300 350 400 450 500
-12
-10
-8
-6
-4
-2
0
2
4
6
0 50 100 150 200 250 300 350 400 450 500
Number of epochs
x
/
p
i
Fig. 4.10 - Decision surfaces for case 2
-12
-10
-8
-6
-4
-2
0
2
4
6
0 50 100 150 200 250 300 350 400 450 500
-12
-10
-8
-6
-4
-2
0
2
4
6
0 50 100 150 200 250 300 350 400 450 500
Number of epochs
x
/
p
i
Fig. 4.11 - Decision surfaces for case 3
102
-4
-3
-2
-1
0
1
2
3
4
0 50 100 150 200 250 300 350 400 450 500
-4
-3
-2
-1
0
1
2
3
4
0 50 100 150 200 250 300 350 400 450 500
Number of epochs
bias
Fig. 4.12 - Output unit weights and bias
for case 3
-1.5
-1
-0.5
0
0.5
1
1.5
-3 -2 -1 0 1 2 3
-1.5
-1
-0.5
0
0.5
1
1.5
-3 -2 -1 0 1 2 3
x/pi
sin(x) ANN output error
Fig. 4.13 - The function sin(x) and its
network approximation for case 3
4.4 - Conclusion
In this chapter we presented a technique that can be used with the Back-
Propagation in order to speed up learning. We propose to use the knowledge about the
range of the network inputs to initialize and constrain the location of the network
decision surfaces during training. We also propose to adjust the inclination of the
decision surfaces during the weight initialization process.
The simulation results demonstrate that once the decision surfaces converge to
their correct location, the adjustment of the second layer of weights is very fast. This
seems to indicate that learning occurs in the bottom to top direction (the input layer is
at the bottom and output layer is at the top).
During training the user has to define the permissible region for the decision
surfaces. A too small permissible region will lead to a large number of unnecessary
reinitializations of the decision surfaces. On the other hand, a too large permissible
region will tend to slow down the convergence. A possible alternative, that avoids the
need of specifying a permissible region, is to treat the location of the decision surface
as a "soft" constraint, not as a "hard" constraint.
In the next chapter we are concerned about how to improve the fault-tolerance
of the feedforward ANN, so as to increase network robustness to loss of hidden units.