Chap 24

6
Chapter 2 - Articial Neural Networks:

Basic Concepts
The great majority of digital computers in use today are based around the
principle of using one very powerful processor through which all computations are
channelled. This is the so called von Neumann architecture, after John von Neumann,
one of the pioneers of modern computing. The power of such a processor can be
measured in terms of its speed (number of instructions that it can execute in a unit of
time) and complexity (the number of different instructions that it can execute).
The traditional way to use such computers has been to write a precise sequence
of steps (a computer program or an algorithm) to be executed by the computer. This is
the algorithmic approach. Such programs can be written in different computer
languages, where higher level languages will have commands that when translated to the
machine level will correspond to several instructions at the processor level.
Researchers in Artificial Intelligence (AI) follow the algorithmic approach and
try to capture the knowledge of an expert in some specific domain as a set of rules to
create so called expert systems. This is based on the hypothesis that the experts thought
process can be modelled by using a set of symbols and a set of logical rules which
manipulate such symbols. This is the symbolic approach. It is still necessary to have
someone that understands the process (the expert) and someone to program the
computer.
The algorithmic and symbolic approaches can be very useful for certain problems
where it is possible to find a precise sequence of mathematical operations (e.g. inversion
of matrices) or a precise sequence of rules (e.g medical diagnosis of certain well
understood diseases). However such approaches have the following weaknesses:
a) Sequential (or Serial) Computation: as a consequence of the
centralization around the processor, the instructions have to be executed sequentially,
even if two sets of instructions are unrelated. This creates a bottleneck around the
central processor. Sometimes, instead of just one, a small number of very powerful
7
central processors are used, but this has to be weighed against the increase in the
complexity of programming the management of these processors so that they are used
effectively. Also, sooner or later, the physical limits for signal propagation times within
the computer will be reached. The current approach of reducing the processor size is
also constrained by physical limits.
b) Local Representation: the knowledge is localized in the sense that a
concept or a rule can be traced to a precise area in the computer memory. Such
representation is not resistant to damage. Also a very small corruption in one of the
instructions to be executed by the processor (a single bit error) can easily ruin the
sequential computation. Also, as the complexity of the program increases, its reliability
decreases, since it is more likely that the programmers will make mistakes. Recently
developed programming styles such as object-oriented programming (OOP) aim to make
it easier to manage these complex programs.
c) "Learning" is difficult: if we define computational "learning" as the
construction or modification of some computational representation or model [Tho92],
it is difficult to simulate "learning" using the algorithmic and symbolic approaches. This
happens because it is not straightforward to incorporate the data acquired from
interaction with the environment into the model.
In general, it can be said that digital computers can solve problems that are
difficult for humans, but it is often very difficult to use them to automate tasks that
humans can solve with little effort, such as driving a car or recognizing faces and
sounds in a real-world situation.
Artificial Neural Networks (ANN), also called neurocomputing, connectionism,
or parallel distributed processing (PDP), provide an alternative approach to be applied
to problems where the algorithmic and symbolic approaches are not well suited.
Artificial Neural Networks are inspired by our present knowledge of biological nervous
systems, although they do not try to be realistic in every detail (the area of ANN is not
concerned with biological modelling, a different field). Some ANN models may
therefore be totally unrealistic from a biological modelling point of view [HKP91].
In contrast to the conventional digital computer, ANN perform their computation
using a large number of very simple and highly interconnected processors operating in
parallel. The representation of knowledge is distributed over these connections and
"learning" is performed by changing certain values associated with such connections, not
8
by programming. The learning methods still have to be programmed, however, and for
each problem, we must choose a suitable learning algorithm but the same general
approach is kept.
Current ANN models are so crude approximations of biological nervous systems
that it is hard to justify the use of the word neural. The word is used today more
because of historical reasons since most of the earlier researchers came from biological
or psychological backgrounds, not engineering or computer science.
It is generally believed that knowledge about real biological neural networks can
help by providing insights about how to improve the artificial neural network models
and clarifying their limitations and weaknesses. The next section presents a simplified
introduction to the human nervous system and human brain. The human brain is the
most complex organ we have and is a structure still poorly understood, despite intense
research and much progress since Santiago Ramon y Cajal showed that the human
nervous system is made of an assembly of well-defined cells.
A general framework for ANN models is later introduced and some important
ANN models are presented in the subsequent sections. Finally some limitations of the
current ANN models are highlighted in the conclusions.
2.1 - The Human Nervous System and the Brain
The human nervous system consists of the Central Nervous System (CNS) and
the Peripheral Nervous System (PNS). The CNS is composed of the brain and the spinal
cord. The PNS is composed of the nervous system outside the brain and spinal cord.
The human nervous system can be seen as a vast electrical switching network.
The top-level behaviour of such a network can be approximately described by figure 2.1.
The inputs to this network are provided by sensory receptors. Such receptors act as
transducers and generate signals from within the body or from sense organs that observe
the external environment. The information is then conveyed by the PNS to the CNS,
where it is then analyzed and processed. If necessary, the CNS sends signals to the
effectors and the related motor organs that will execute the desired actions. From the
above description we can see that the human nervous systems can be described as a
closed-loop control system, with feedback from within the body (in order to regulate
some bodily functions such as the heart beat rate) and from outside the body (so we are
9
aware of our interactions with the external environment) [Zur92].
Figure 2.1 - The human nervous system as a closed-loop control system
Most of the information processing done by the CNS is performed in the brain.
In contrast to other organs in the human body, the brain does not process metabolic
products, but instead processes "information". In order to process such information the
brain is the most concentrated consumer of energy in the body, being responsible, with
the body at rest, for over 20% of the bodys oxygen consumption despite being only 2%
of the body mass . Despite such high energy consumption the brain dissipates very little
heat. Since the brain is very mechanically and chemically sensitive and cannot
regenerate itself if damaged, the brain is the most protected organ in the body. A bony
skull provides the brain with a strong mechanical protection while chemical protection
is provided by a highly effective filtration system called the blood-brain barrier, a dense
network of blood vessels that isolates the brain from potentially toxic substances found
in the bloodstream [Was89]. The brain and spinal cord are also immersed in the
cerebrospinal fluid, which provides further protection against damage.
2.1.1 - Neurons
The human brain contains approximately 10
11
elementary nerve cells called
neurons (10
11
is around 20 times the current worlds population and the estimated
number of stars in our galaxy). Each of these neurons is connected to around 10
3
to 10
4
other neurons, and therefore the human brain is estimated to have 10
14
to 10
15
connections. The neuron is the basic building block of the nervous system and most
neurons are in the brain.
10
Neurons can be classified into two main classes: 1) output cells, that connect
different regions of the brain to each other, connect the brain to the effectors (motor
neurons), or connect the sensory receptors to the brain (sensory neurons); and 2)
interneurons, that are confined to the region where they occur [BeJa90].
There are hundreds of neuron types, each with its characteristic function, shape
and location, but the main features of a neuron of any type are its cell body, called
soma, dendrites and the axon, as figure 2.2 illustrates.
The cell body is usually 5 to 100 m in diameter and contains the normally large
nucleus of the neuron. Most of the biochemical activities necessary to maintain the life
of the neuron, such as synthesis of enzymes and other molecules, take place within its
cell body.
The dendrites act as the input channels of external signals to the neuron and the
axon acts as the output channel. Dendrites form a dendritic tree, which is a bushy tree
that spreads out around the cell body within a region of up to 400 m in radius. An
axon extends away from the cell body and is relatively uniform in diameter. It can be
as short as 100 m for interneurons or as long as 1 meter for sensory and motor
neurons, such as the neurons that connect the toe to the spinal cord. The axon also
branches but only at its end, in contrast to dendrites that split much closer to the cell
body.
The end of a branch of an axon has a button shape, with diameter around 1 m,
and connects to the dendrite of another neuron. Such a connection is called a synapse
(from the greek verb "to join"). Usually this is not a physical connection (the axon and
the dendrite do not touch) but there is a small gap called the synapse gap or synapse
cleft that is normally between 200 and 500 across (1 = 10
-10
m, the diameter of
a water molecule is around 3 ). The point where the axon is connected to its cell body
is called the Hillock zone.
2.1.2 - The Action Potential
The cell body can generate electrical activity in the form of a voltage pulse
called an action potential (AP) or electrical spike.
The axon carries the action potential from the cell body to the synapses where
chemical molecules, called neurotransmitters, are then released. These diffuse across the
synapse gap to the dendrite at the other side of the synapse and modify the dendrites
11
membrane potential. It takes around 0.6 ms for the neurotransmitters to cross the
Figure 2.2 - A schematic representation of a neuron
synapse gap. According to the predominant type of neurotransmitter presented at the
synapse, the membrane potential of the dendrite is increased (an excitatory synapse) or
decreased (an inhibitory synapse). These signals received by the dendrites from many
different neurons are then sent to the cell body where they are, roughly speaking,
averaged. If this average over a short time interval is above a certain threshold at the
Hillock zone, the neuron "fires", i.e. the cell body generates an action potential that is
then transmitted by its axon to other neurons.
The AP has a peak value of about 100 mV and a duration of around 1 ms. At
rest (no input to the neuron) the cell body has a potential of -70 mV in relation to its
outside (this generates a electrical field of about 10
4
V/mm across the membrane of the
neuron), the threshold value for the generation of the AP is around -60 mV to -30 mV
(depending on the sensitivity of the neuron) and when the AP is at its peak the interior
of the neuron is 30 mV above the potential of its external environment [DeDe93].
After the AP is generated and extinguished, there is a refractory period when the
neuron does not fire even if its receives very large inputs. The refractory period takes
3 to 4 ms and is important since it sets an upper limit for the maximum firing frequency
of the neuron. The firing duration period is defined as the duration of the pulse added
to the duration of the refractory period. Considering the minimum duration period as 4
ms (1 ms as the minimum duration of the pulse + 3 ms as the minimum duration of the
refractory period), the maximum firing frequency is 250 Hz.
12
Once the AP is generated, it is transmitted along the axon, like an electrical
signal is transmitted along a electric cable. There is a chemical regenerating mechanism,
provided by exchange of ions, that ensures that the AP is transmitted along the axon
without much distortion in its shape and duration. The velocity of propagation of the AP
along the axon can vary from 0.5 to 130 m/s since it is proportional to the square root
of the diameter of the axon and it increases when the axon is covered by myelin, a
relatively thick and fatty insulating layer. Two-thirds of the axons in the human body
have small diameter (between 0.0003 to 0.0013 mm) are unmyelinated. They constitute
the low-speed (up to 1.5 m/s or 3.2 mph) nerve fibres (group of axons) and they carry
"routine" information such as body temperature, where this low speed is adequate. The
other one-third consists of high-speed nerve fibres (up to 130 m/s or 290 mph), the
axons have relatively large diameter (0.001 to 0.022 mm) and they are myelinated. They
are used for transmission of vital information that needs to be processed rapidly, for
instance, when there is danger to the organism [DeDe93].
At regular intervals, there are breakes in the myelin cover, the so called Ranvier
nodes, which play a vital role in the transmission of the pulses along the axon.
Without the myelin cover, the diameter of the mammalian optic nerve, that
contain about 1 million axons, would have to be increased from 2 mm to around
100 mm to carry the same information at the same speed. On the other hand,
information that has low priority is carried by the lower speed axons since they occupy
less space in the organism. The victims of multiple sclerosis are supposed to suffer from
deterioration of such myelin cover, probably caused by an attack from the autoimmune
system.
In contrast to axons the dendrites do not have a myelin cover and there is no
regenerating mechanism to transmit the signal received by a dendrite at the synapse to
the cell body. Therefore a greater distance between a synapse and the cell body will
mean that the signal that such a synapse sends along the dendrite take longer to arrive
at the cell body and will suffer greater attenuation and distortion. This is the reason why
dendrites can not be very long. A possible model for a dendritic tree is a passible RCG
network (resistors in series with capacitors and resistances in parallel), much like the
models used for transmission lines in studies of distribution of electric energy [DeDe93].
In cases when there is imminent danger or damage, for instance, when we
unintentionally touch a very hot object, the brain is not directly involved in the
13
immediate reaction. In such cases a simple decision and very fast reaction is needed, i.e.
move the hand away from the hot object, a so called reflex arc. The brain is excluded
from this decision process to avoid slowing the reaction. If our ancestors had to think
in order to react to such situations, probably they would have not survived the harsh
environment in which they lived. In these cases, when the signal reaches the spinal cord,
a signal is very rapidly sent to the proper muscles to perform a corrective action. In the
above example, the sensory neurons that received the stimulus are directly linked,
possibly through some interneurons in the spinal cord, to some motor neurons. The brain
receives the signal as well since the same sensory neurons will also be connected to
other interneurons that have a path to the brain. This is important to make the brain
aware of the environment.
From the above, we can see that the signal transmitted by the neuron is
modulated in frequency and the information content is not that a neuron has "fired", but
in the number of pulses fired per unit of time. Using such a frequency modulation (FM)
method, the signal generated in the cell body of the neuron by the neuron can be
transmitted by the axon to other neurons over long distances. Interestingly,
telecommunication engineering has proven that the FM technique has significant
advantages in noise rejection over other techniques.
The main sources of noise at the neuronal level are a consequence of the
chemical mechanism involved in the transmission of the AP across the synapse and
along the axon. The cause of the noise in the first case is the random movements of the
molecules of the neurotransmitters. In the second case the cause is the random
movement of the ions that are involved in the transmission of the AP along the axon.
2.1.3 - Structure of the Brain
The human brain is hierarchically structured and the higher levels of the structure
are believed to be specified by the genetic code. The values of the synapses of all
neurons are the lowest level of such structure and are believed to determined not by the
genetic code, but by the interactions of each individual with the environment, i.e.
"learned".
In the scale of evolution, the lower level animals have their nervous system
completed specified by the genetic code. Man is at the top of the scale and has the
highest brain volume in relation to total body weight. The genetic code cannot specify
14
all 10
14
-10
15
synapses in the human brain since it is beyond its capacity [RMS92].
However, this turns out to be a decisive advantage since it enables short-term adaptation
to the environment while the evolutionary process provides long-term adaptation,
increasing the probability of survival of the species that use this strategy.
The human brain can be divided into smaller regions, according to appearance
and organization. One possibility is to divide the brain in 3 main regions: the cerebral
cortex, the thalamus and the perithalamic structures [Per92].
The cerebral cortex is the "central" processor of the brain and it is unique to
mammals. It is the youngest brain region in the evolutionary sense and constitutes the
outer part of the brain (the word cortex means the outer layer of an organ). The cerebral
cortex is a flat thin two-dimensional layered structure of the order of 0.2 m
2
in area and
on average 2-3 mm in thickness, i.e. about 50 to 100 neurons in depth [RMS92]. It is
extensively folded in higher mammals with several fissures in order to fit inside a skull
of reasonable size. The cerebral cortex can be divided in several subareas, which seem
to be functional areas. Such areas are specialized for specific tasks such as visual
perception (the visual cortex), motor control (motor cortex), or touch (somatosensory
cortex). The body is represented unequally in the somatosensory cortex, with face and
hands having proportionally larger representation than other parts. There are also
association areas that help in the interpretation of the signals received by the sensory
areas [And83].
The thalamus is the "frontal" computer of the brain. All information which flows
to and from the cortex is processed by the thalamus. It can also be divided in regions
and is centrally located in the brain.
The perithalamic structures are the "peripherical" computers that play auxiliary
but vital and not fully understood roles, like "slave" computers. Some of these structures
are the hypothalamus, that control hormonal secretions and other activities such as
breathing and digestion; the hippocampus, that is involved in long-term memory; and
the cerebellum, that is mainly involved in storing, retrieving and learning sequences of
coordinated movements [Per92].
More details about the human nervous system and the human brain can be
obtained from any neurobiology textbook. A good readable introduction to the subject
is given in [Sci79]. For a engineering perspective of the nervous system see [DeDe93].
For an introduction to the mathematical modelling of biological neurons see [Hop86].
15
2.2 - Brain versus Digital Computer
The human brain can be seen as a flexible analog processor with enormous
memory capacity that has been engineered and fine-tuned by evolution through several
millions of years to execute tasks that are important for survival in our particular world.
The more important a task was for our survival, the more optimized our whole body is
for that particular task, satisfying certain biological and physical constraints such as
body size, energy consumption and energy dissipation. The human nervous system and
the brain is a particular good example of this.
We are very good at recognizing faces and understand speech, very rapidly and
accurately and far better than any digital computer, probably because it was very
important to our survival to differentiate between friends and enemies and to
communicate with each other. We can perform such tasks so effortlessly that we do not
realize how hard they are until we try to program a digital computer to perform them.
On the other hand, we are easily outperformed by a pocket calculator when executing
arithmetic tasks since such tasks are unnatural to us and had very little importance to
our survival. Hinton [Hin89] suggests that: a) considering arithmetic operations, 1 brain
has the processing power of 1/10 of a pocket calculator; b) but if we consider vision,
1 brain has the processing power of 1000 supercomputers; c) considering memorization
of arbitrary facts brains are much worse than digital computers; d) but if we consider
associative memory for real-world facts, such as recalling the name of a person given
a partial description with possibly few wrong clues (the so called content-addressable
memory), the brain is much better than computers (that use address-addressable
memory).
The important point is to realize that certain problems are suitable to be solved
by the conventional algorithmic procedure implemented in digital computers while other
problems are not. Artificial Neural Networks provide possible methods for trying to
solve some of the problems that are not suitable for digital computation.
The main differences in the mode of information processing between the brain
and a digital computer can be summarized as follows [Sim90]:
2.2.1 - Processing Speed
Nowadays it is common to have digital computers operating using clock
16
frequencies in the range 16 to 33 MHz, such as the 80386 and 80486 Intel
microprocessors. Therefore they will take between 30 and 40 ns to execute a single
instruction (supercomputers can take as little as 3 ns, 1 ns = 1 nanosecond = 10
-9
s). As
we have seen in the previous section, neurons operate in the millisecond range and will
take as at least 4 ms to complete a firing cycle. So a digital computer can have
components that are 10
5
times faster than a neuron.
2.2.2 - Processing Order (Serial/Parallel)
A digital computer processes information serially while the brain processes
information in parallel. Furthermore, consider that we take around 500 milliseconds to
recognize a face and that the average processing time of a neuron is 5 milliseconds.
Therefore, if the brain is executing "parallel programs", such programs cannot have more
than 100 steps. This has been called the "100-step program" constraint. So, one possible
explanation for the brain outperforming a very much faster digital computer in certain
tasks such as vision is that, instead of executing a very large program serially as digital
computers do, the brain executes in parallel a large number of small programs. This
shows that to perform certain tasks well it is not enough to execute a single instruction
of the "program" very fast.
2.2.3 - Number and Complexity of Processors
A digital computer can execute each single instruction of its program much faster
that one neuron can change its state, but the human brain has a much larger number
(10
14
) of processors (neurons) operating at the same time. The brain also has a high
interconnectivity since each neuron is connected to around 10
4
other neurons. Another
difference is that, while in a digital computer the processor is very complex (since it has
to be able to interpret a large number of different instructions) and it has a high
precision (in terms of the number of significant digits of the response), it is believed that
the neuron is in comparison a very simple processor with a low precision
1
.
1
It is very difficult to model very accurately the behaviour of a real neuron. Recent studies seem to
indicate that some signal processing is also done by the dendrites and at the synapse level, instead of just
at the cell body. However, the general assumption today is that such phenomena do not contribute
significantly to the computational power of the brain. Much more research is needed to clarify this issue.
17
2.2.4 - Knowledge Storage and Tolerance to Damage
In a digital computer a particular item of information, or datum, is stored in a
specific memory location. This type of memory is referred to as a localized memory,
since a memory unit holds an entire piece of information. Moreover, a digital computer
uses address-addressable memory. In contrast, information in the brain is thought to be
located in the synapses in a distributed manner, such that no synapse holds an entire
datum and each synapse can contribute to the representation of several pieces of
information. This is called a distributed memory and the brain uses content-addressable
memory, i.e. a memory is retrieved by using parts of its contents as clues.
Distributed memories have the advantage that they are more resistant to damage
(faults). This means that the human brain is relatively tolerant to the loss of few
neurons, i.e. the information stored (the memory) is not severely distorted when a few
neurons die. Also, because of the intrinsic parallelism, the loss of few computational
units (the neurons in this case) will not result in a total failure. Such tolerance to
damage is sometimes referred to as graceful degradation and means that performance
decreases smoothly with increase in damage. Compare this with a digital computer
where the corruption of a memory location or failure of any processing element can
result in a total machine failure.
On the other hand, distributed memories have the possible disadvantage that
when it is necessary to update some information, much more work is necessary since
several physical locations of the memory need to be updated. For this reason, sometimes
it is said that knowledge in a digital computer is strictly replaceable while knowledge
in the brain is adaptable.
2.2.5 - Processing Control
In a digital computer there is a clock signal that is used to synchronize all
components. A central processor uses the clock signal to control the activities of all the
other components. In contrast there is no specific area in the brain responsible for
control or for synchronization of all neurons. For this reason, the brain is sometimes
called an "anarchic" system since there is no "homunculus" that monitors the activities
of each neuron.
Given all these differences it is ironic to realize that todays achievements in
ANN research are a direct consequence of the vast progress in the areas of hardware and
18
software for digital computers in the recent decades. The majority of ANN models in
use today are simulated in digital computers since specific hardware for ANN is not yet
easily available or affordable. The current digital computers provide a suitable
framework that is used by researchers to carry out experiments with their ANN models.
2.3 - The Basics of an Articial Neural Networks Model
In this section a formal definition of Artificial Neural Networks is introduced and
a general framework for ANN models is presented.
2.3.1 - A Formal Denition
The following formal definition of an Artificial Neural Network was proposed
by Hecht-Nielsen [Hec90]:
"An Artificial Neural Network is a parallel, distributed information
processing structure consisting of processing units (which can possess
a local memory and can carry out localized information processing
operations) interconnected via unidirectional signal channels called
connections. Each processing unit has a single output connection that
branches ("fans out") into as many collateral connections as desired;
each carries the same signal - the processing unit output signal. The
processing unit output signal can be of any mathematical type desired.
The information processing that goes on within each processing unit can
be defined arbitrarily with the restriction that it must be completely local;
that is, it must depend only on the current values of the input signals
arriving at the processing element via impinging connections and on
values stored in the processing units local memory."
The above definition contains two slight changes in relation to the nomenclature used
in the original one proposed by Hecht-Nielsen. The term "Neural Networks" was
changed to "Artificial Neural Networks" to emphasize that we are not dealing with
biological neural networks. Also the term processing element (PE) was changed to
processing unit and most of the time we will simply use the term unit.
From the above definition ANN can be seen as a subclass of a general computing
architecture known as Multiple Instruction Multiple Data (MIMD) parallel processing
19
architecture. Hecht-Nielsen points out [Hec90] that maybe the general MIMD
architectures are too general to be efficient and maybe ANN is a good compromise
between an efficient structure with considerable information processing capability and
a general-purpose implementation.
2.3.2 - A General Framework for ANN models
There are many different ANN models but each model can be precisely specified
by the following eight major aspects [RHM86]:
A set of processing units
A state of activation for each unit
An output function for each unit
A pattern of connectivity among units or topology of the network
A propagation rule, or combining function, to propagate the activities
of the units through the network
An activation rule to update the activities of each unit by using the
current activation value and the inputs received from other units
An external environment that provides information to the network
and/or interacts with it.
A learning rule to modify the pattern of connectivity by using
information provided by the external environment.
Figure 2.3 illustrates the general model of a processing unit. The state of the unit is
Figure 2.3 - The general model of a processing unit
given by its activation value a
i
. The state of an ANN with N units at time instant t can
be represented by the vector [a
1
(t) a
2
(t) ... a
N
(t)]. Such a vector is sometimes referred to
20
as the short-term memory (STM) of the network. The output function uses as argument
the activation value to calculate the output of the unit denoted by out
i
. Such an output
value is then transmitted to the other units in the network. Some possibilities for the
output function are:
1) The linear function: out
i
= Gain * a
i
;
2) The threshold function:
if a
i
> Threshold, then out
i
= 1; otherwise out
i
= 0
3) The sigmoid function:
out
i
= 1/[1+exp(-a
i
)]
The pattern of connectivity, or the network topology, specifies how each unit is
connected to the other units in the network. The pattern of connectivity also specifies
which units (or groups of units) are allowed to receive connections from a particular
unit. The strength of each connection is normally represented by a real number w. We
adopt the notation that w
ij
means the weight to unit i from unit j (ij). The pattern of
connectivity for a whole network with N units can be represented by the weight matrix
W with dimensions N by N. Row i of W contains all weights received by unit i from
the other units in the network (the input weights of unit i) and column j contains all
weights sent by unit j to the other units (the output weights of unit j). The weight matrix
has a very important role since it represent the knowledge that is encoded in the network
and, because of this, it is said that the matrix W contains the long-term memory of the
network (LTM).
Each unit can send its output value to several other units and each unit can
receive as input the output values of several other units. The propagation rule specifies
how such output values from other units are combined into a much smaller set of values,
normally only one value called the net input of the unit. For this reason the propagation
rule is sometimes called the combining function. A frequently used combining function
simply defines a net input value for each unit as a weighted summation or
net(t) = W(t) out(t), where net(t) and out(t) are respectively the net input and output
vectors. Another possibility would be to define a net excitatory input vector and net
inhibitory vector as netE(t) = WE(t) out(t) and netI(t) = WI(t) out(t) where the matrix
WE uses only the positive elements of W and WI uses only the negative elements.
Such net input values and the current activation value are then used by the
activation rule to define the activation value of the unit at the next time step or, using
21
a discrete time notation:
a
i
(t+1) = F[a
i
(t),net
i
(t)] or a
i
(t+1) = F[a
i
(t),netE
i
(t),netI
i
(t)].
In some models the activation function will be simply the identity function:
a
i
(t+1) = net
i
(t) and if we have at the same time out
i
(t) = a
i
(t) and net(t) = W out(t), the
whole network is can be viewed as a linear discrete time dynamical system since we
will have: a(t+1) = W a(t). In most models the activation function or the output function
are the identity function but not both simultaneously.
The external environment interacts with the network sometimes to provide inputs
to the network and to receive its outputs. The units that receive signals from the external
environment are called input units and units that send signals to the external
environment are called output units. Units that are not input nor output units, i.e. that
are not connected directly to the external environment, are called hidden units. In some
ANN models some units are at the same time input and output units. The units can be
grouped in layers according to some property. Therefore in some models we can have
an input layer, an output layer and one, several or no hidden layers. Some authors refer
to layers as layers of weights but in this work we use layers to refer to layers of units.
2.3.3 - Learning
The external environment also interacts with the network during "learning" or
training of ANN. In this phase a learning rule is used to change the elements of the
matrix W and other adaptable parameters that the network may have. In this context
"learning" and "adaptation" are seen as simply changes in the network parameters. In
ANN models the external environment will normally provide a set of "training" input
vectors. There are two main types of learning: supervised and unsupervised.
In supervised learning the external environment also provides a desired output
for each one of the training input vectors and it is said that the external environment
acts as a "teacher". A special case of supervised learning is reinforcement learning
where the external environment only provides the information that the network output
is "good" or "bad", instead of giving the correct output. In the case of reinforcement
learning it is said that the external environment acts as a "critic". Some authors prefer
to classify reinforcement learning not as a special case of supervised learning but as a
third type of learning rule.
In unsupervised learning the external environment does not provide the desired
22
network output nor classifies it as good or bad. By using the correlation of the input
vector the learning rule changes the network weights in order to group the input vector
into "clusters" such that similar input vectors will produce similar network outputs since
they will belong to the same cluster. Ideally, the learning rule finds the number of
clusters and their respective centres, if they exist, for the training data. This learning
method is also called self-organization. Sometimes, it is improperly said that in
unsupervised learning the network learns without a teacher, but this is not absolutely
correct. The teacher is not involved in every step but he still has to set goals even in an
unsupervised learning mode. Zurada [Zur92] proposes the following analogy to clarify
this point. An ANN being trained using supervised learning corresponds to a student
learning by answering the questions posed by the teacher and comparing his answers to
the correct answers given by the teacher. The unsupervised learning case corresponds
to the student learning the subject from a videotape lecture provided by the teacher but
the teacher is not available to answer any questions. The teacher provides the methods
(the learning rule) and questions (the input training vectors) but not the answers to the
questions (the output training vectors).
Considering hardware implementations of an ANN with a large number of units,
it is preferable to have learning rules that use only local information to the unit whose
weights are being updated. Without such a constraint, especially for a large ANN, inter-
unit communication can cause a considerable burden.
2.3.4 - Network Topology
Accordingly to its topology an ANN can be classified as a feedforward or
feedback (also called recurrent) ANN. In a feedforward ANN a unit only sends its
output to units from which it does not receive an input directly or indirectly (via other
units). In other words, there are no feedback loops.
In general, given a feedforward ANN, by properly numbering the units we can
define a weight matrix W which is lower-triangular and has a zero diagonal (the unit
does not feedback into itself). A feedforward ANN arranged in layers, where the units
are connected only to the units situated in the next consecutive layer, is called a strictly
feedforward ANN. In a feedback network feedback loops are allowed to exist. A
feedforward ANN implements a static mapping from its input space to its output space
while a feedback ANN is, in general, a nonlinear dynamical system and therefore the
23
stability of the network is one of the main concerns. Figure 2.4 shows examples of
Figure 2.4 - Examples of feedforward (a,b) and feedback (c) ANN.
However only (a) is a strictly feedforward ANN.
feedforward, strictly feedforward and feedback networks.
A typical application of feedforward ANNs is to develop nonlinear models that
are then used for pattern recognition/classification. In this case a feedforward ANN can
be seen as another tool for performing nonlinear regression analysis. A typical
application of feedback ANNs is as content-addressable memories. The state vectors that
correspond to the information that we want to record (the specific "memory") are set to
be a stable equilibrium point. Another possible area of application is
unconstrained/constrained optimization where it is hoped that the network will converge
to a stable equilibrium point that represents a satisfactory near-optimum solution to the
problem.
ANNs can also be classified as synchronized or asynchronized according to the
timing of the application of the activation rule. In synchronized ANNs, we can imagine
the equivalent of a central clock that is used by all units in the network such that all of
them, simultaneously, sample their inputs, calculate their net input and their activation
and output values, i.e. a synchronous update is used. Such an update can be seen as a
discrete difference equation that approximates an underlying continuous differential
equation. In asynchronized ANNs, at each point in time, there is an maximum of only
one unit being updated. Normally, whenever the updating is allowed, a unit is selected
at random to be updated and the activation values of the other units are kept constant.
In some models of feedback ANNs such a procedure helps, but does not guarantee, the
24
stability of the network and is generally used when a synchronous update can result in
stability problems.
2.4 - Articial Neural Network Models
In this section we review the basic characteristics of the most important
feedforward ANN models, more or less in chronological order. Such models provide the
foundation of most of the many feedforward ANN models available and in use today.
Anderson and Rosenfeld [AnRo88] have edited a very interesting book which contains
a collection of several classical papers in the ANN area. Nilsson [Nil65] has published
a theoretical view of the state of the field in mid-60s. Lippmann [Lip87] and, more
recently, Hush and Horne [HuHo93] published updated reviews of several ANN models
and Simpson [Sim90] has published an extensive compilation of ANN models. White
[Whi89] and Levin et al. [LTS90] provide a statistical interpretation of the methods used
to train feedforward ANNs. Nerrand et al. [NRPD93] show that ANNs can be
considered as general nonlinear filters that can be trained adaptively. They also show
that several algorithms used in linear and nonlinear adaptive filtering can be seen as
special cases of algorithms used to train ANNs.
2.4.1 - Early Models
In 1943 the neurophysiologist Warren McCulloch and the logician Walter Pitts
[McPi43] proposed to describe the biological neuron as a Threshold Logic Unit (TLU)
with L binary (0 or 1) inputs x
j
and 1 binary output y. The weights associated with such
inputs are w
j
= 1. The output of such a unit is high (1) when the linear summation of
the all inputs is greater than a certain threshold value and low (0) otherwise. This is
equivalent to defining the activation rule as the threshold function and the output value
as equal to the activation value. The threshold value is mathematically represented by
a variable bias. Therefore the output y of the TLU is:
(2.1) y F
T
j
w
j
x
j
bias
with F
T
(z) = 1 if z 0 and F
T
(z) = 0 if z < 0 (the Heaviside or unit step function). The
variable bias can be simply seen as another weight which originates from a unit whose
output is always 1.
25
McCulloch and Pitts showed that it is possible to construct any arbitrary logical
function by using a combination of such units, i.e such a network is capable of universal
computation. For example, the logical function AND can be implemented by using one
unit with weights set to 1 and bias set to -1.5. Still with weights set to 1, if the bias is
set to -0.5 we have the logical function OR. To obtain an inverter the weight is set to -
1 and the bias to 0.5. By using a combination of these basic functions AND, OR and
inverter is possible to construct any arbitrary logical function. However, such networks
are not fault-tolerant in that all components must always function properly in contrast
to the fault-tolerance of biological neural networks. Another problem was that
McCulloch and Pitts did not explain how to obtain the values of weights and biases for
the network and, in particular, they did not propose a learning rule.
In relation to the fault-tolerance problem, von Neumann [Neu56] realized that,
by using redundancy, a more reliable network could be build with unreliable
components.
In relation to the learning problem, in 1949 the psychologist Donald O. Hebb
proposed a method of determining the network weights. In his book The Organization
of Behavior [Heb49] he states:
"When an axon of cell A is near enough to excite a cell B and repeatedly
or persistently takes part in firing it, some growth process or metabolic
change takes place in one or both cells such that As efficiency as one
of the cells firing B, is increased."
This is known today as the Hebbian rule and most of the learning rules can be seen as
variants of this rule. The Hebbian rule can be formulated mathematically as:
where x = [ x
1
x
2
... x
p
]
T
is the input vector, y = [ y
1
y
2
... y
q
]
T
is the output vector and
(2.2)
w
i j
y
i
(x) x
j
> 0, called the learning rate, controls the size of each learning step and p and q are
respectively the number of inputs and output units.
A simple example of an application of the Hebbian rule can be illustrated by the
Linear Associative Matrix (LAM) model, sometimes also called Linear Associator,
introduced by J. Anderson [And68]. In such models the output is a linear function of the
inputs, or simply Y = W X, i.e a feedforward linear ANN. Figure 2.5 illustrates the
LAM network that is used to associate a set of input vectors X, [ X
1
X
2
... X
M
], to a set
of desired output vectors D, [ D
1
D
2
... D
M
], where M is the number of desired
26
associations. The weight matrix has dimension qxp, each input vector has dimension px1
Figure 2.5 - The Linear Associative Memory (LAM)
and the output vectors have dimension qx1.
Sometimes such vectors are referred to as patterns (from the nomenclature used
in the area of pattern recognition) since a pattern can be seen as a point in a multi-
dimensional input space, i.e. a vector. Therefore one can refer to the LAM and several
other ANN models as pattern associators.
In the training phase, if the input vectors X form an orthonormal set, i.e. they are
orthogonal to each other and each one has unit length (we can easily force the unit
length by dividing each component of the vector by its length), we can initialize the
weight matrix as a null matrix (W
ij
(0) = 0 for all i and j), and set the learning rate to
1. The correct input-output association can be encoded in the weight matrix using only
a single presentation of each input-desired output pair. Since we have for each
presentation:
after the presentation of all M pairs the weight matrix W will be:
(2.3)
W( k) W( k) W( k 1) D
k
X
k
T
, 1 k M
The advantages of the LAM model are: 1) its real-time response, 2) its fault-
(2.4)
W(M) D
1
X
1
T
D
2
X
2
T
... D
M
X
M
T
tolerance in relation to corruption of the weight matrix, and 3) its interpolated response.
The limitations of a such model are: 1) the number of patterns stored has to be less or
equal to the number of input units, M p, 2) the input patterns have to be orthogonal
to each other and 3) it is not possible to store nonlinear input-output relationships. Later
27
on we will see that, by using the Delta Rule, the association can be learned even for
non-orthogonal input patterns.
2.4.2 - The Perceptron and Linear Separability
In 1958 Frank Rosenblatt proposed the Perceptron model ([Ros58], [Ros62]) that
can also be used as a pattern associator or pattern classifier. The single-layer perceptron
model consists of one layer of input binary units and one layer of binary output units.
There are no hidden layers and therefore there is only one layer of modifiable weights.
The output units use a hard-limiting threshold output function as in eq. 2-1. Figure 2.6
illustrates the single-layer perceptron.
Figure 2.6 - The Single-Layer Perceptron
The single-layer perceptron is a special case of Rosenblatts elementary
perceptron. His original perceptron was proposed as a model for visual pattern
recognition and has 3 types of units: sensory, association and response units (see figure
2.7). The sensory units, or S-units, form a "retina" and act as transducers responding to
physical signals such as light, pressure or heat. The association units, or A-units, receive
random, localized and fixed connections from the sensory units. The association units
send variable connections to the response units, or R-units, that act as the output units.
All units are TLUs but, while the sensory and association units have 0 or 1 as outputs
(binary outputs) and non-zero fixed threshold values, the response units have a -1 or 1
output (bipolar output) and a zero fixed threshold value. Since such a model is
equivalent to a network of McCulloch and Pitts units, such a network can also compute
any logical function and therefore it can also perform universal computation [Ros62].
28
The computation performed by the sensory and association units can be viewed as a
Figure 2.7 - Rosenblatts elementary perceptron
fixed pre-processing stage, since the connections between them are non-adaptable.
For training the perceptron, Rosenblatt proposed the following supervised
learning error-correction rule:
1) Apply an input pattern and calculate the output Y.
2) a) If the output is CORRECT, go to step 1;
b) If the output is INCORRECT and is -1, ADD each input to its
corresponding weight, W
ij
= X
j
; or
c) if the output is INCORRECT and is 1, SUBTRACT each input
from its corresponding weight, W
ij
= - X
j
.
3) Select another input pattern from the training set and return to step 1.
Mathematically step 2 can be expressed as:
making explicit the error-correction part. Rosenblatt [Ros62] proved that, if a solution
(2.5)
W
i j
1
2
D
i
Y
i
X
j
exists, i.e. if there is a weight matrix W that gives the correct classification for the set
of training patterns, then the above algorithm will find such solution after a finite
number of iterations. This proof is today known as the Perceptron Convergence
Theorem and it can also be found in [Zur92], [MiPa69], [HKP91] and [BeJa90].
Therefore it is also very important to understand in which cases such a solution exists.
Without loss of generality, since all output units operate independently of each
29
other, lets consider just one of the output units of the single-layer perceptron. Such an
output unit divides its input space into 2 regions or classes (one region has high output
and the other low output, where "high" normally means 1 and "low" means 0 or 1
depending on if we are using binary or bipolar output units). These 2 regions are
separated by a hyperplane (a line for 2 inputs and a plane for 3 inputs) and the
hyperplane is the decision surface in this case. The position of such a hyperplane is
determined by the weights and bias received by this output unit. The equation of the
hyperplane of output unit i is:
Marvin Minsky and Seymour Papert [MiPa69] analysed in detail the capabilities and
(2.6)
p
j 1
W
i j
X
j
bias
i
0
limitations of the single-layer perceptron model (chapter 3 of [AlMo90] contains a good
explanation of Minsky and Paperts arguments and Block [Blo70] summarizes their main
results). One of the most important limitations proved by Minsky and Papert was that
the single-layer perceptron can only solve problems that are linearly separable, i.e
problems where, for each output unit, a hyperplane exists that correctly divides the input
space into the two correct classes. Unfortunately, many interesting problems are not
linearly separable problems. Moreover, Peretto ([Per92], chap. 6) shows that the number
of linearly separable logical functions reduces to zero as the number of arguments
increases.
Figure 2.8 illustrates the logical boolean functions AND, OR and XOR which
Figure 2.8 - The AND, OR and XOR problems and possible location
for the decision surfaces
have 2 inputs and 1 output. From figure 2.8 we can see that the functions AND and OR
are linearly separable (see the position of the decision surfaces, i.e. lines in this case).
However the XOR function is not linearly separable since it is not possible to position
30
a single line to separate the points that should produce a "0" (or "-1") output from the
points that should produce a "1" output. Another way to illustrate that the perceptron
cannot solve the XOR problem is to write the 4 inequalities that need to be satisfied by
the network weights and bias. The inequalities are:
a) 0 W
1
+ 0 W
2
< bias bias > 0
b) 1 W
1
+ 0 W
2
> bias W
1
> bias
c) 0 W
1
+ 1 W
2
> bias W
2
> bias
d) 1 W
1
+ 1 W
2
< bias W
1
+ W
2
< bias
These inequalities cannot be simultaneously satisfied since the weights W
1
and W
2
cannot both be greater than the bias, which has a positive value, and at the same time
their sum be less than bias.
It is important to understand that for such problems, for the single-layer
perceptron, there is no solution to be found, i.e. it is a representation problem (where
we are interested to know if there is at least one solution), not a learning problem
(where we know that there is at least one solution and want to find one of the solutions).
Figure 2.8 also illustrates that a possible solution for the XOR problem is to
change the shape of the decision surfaces, for instance from hyperplanes to ellipsoids.
In such a case, there are two possible solutions. In the first solution all points in the
input space X outside D.S.1 produce a "1" output and the points inside D.S.1 produce
a "0" output. In the second solution all points inside D.S.2 produce a "1" output and the
points outside produce a "0" output.
The way to overcome the limitation of linear separability is to use multi-layer
networks, such as the so called Multi-Layer Perceptron (MLP), that introduce extra
layers of units (the so called hidden units) between the input and output layers. It is
possible to show that this is equivalent to defining new shapes for the decision surfaces
by combining several hyperplanes. However, partly as a consequence of the publication
of Minsky and Paperts book, the interest of the research community in the late 1960s
was quickly diverted from ANN to other areas, mostly to the then new area of Artificial
Intelligence. Minsky and Papert state in their book ([MiPa69] page 231-232):
"The perceptron has shown itself worthy of study despite (and even
because of!) its severe limitations. It has many features to attract
attention: its linearity; its intriguing learning theorem; its clear
paradigmatic simplicity as a kind of parallel computation. There is no
31
reason to suppose that any of these virtues carry over to the many-
layered version. Nevertheless, we consider it to be an important research
problem to elucidate (or reject) our intuitive judgement that the extension
is sterile. Perhaps some powerful convergence theorem will be
discovered, or some profound reason for the failure to produce an
interesting "learning theorem" for the multi-layered machine will be
found."
At that time there was no reliable algorithm to train a multi-layer ANN and Minsky and
Papert judged that it was not worthwhile to try to find one. The interest in ANN would
only resurface again in the mid/late 1980s partly because of the popularization of
Hopfields work and the Back-Propagation Algorithm (which will be presented in the
next section), but also because of the perceived limitations of the AI approach.
Simpson [Sim90] argues that Rosenblatt was aware of the limitations of the
single-layer perceptron model. Rosenblatt ([Ros62], [Sim90]) also proposed extensions
to the perceptron model illustrated in figure 2.7, that he called the series-coupled
perceptron (a feedforward network). Such extensions were made by adding extra
feedback connections to the series-coupled perceptron. He proposed the cross-coupled
perceptron with parallel connections within the association units, and the back-coupled
perceptron with added connections back from the response units to the association units.
However, in all of these models the first layer of weights (the weights from the sensory
units to the association units) were randomly preset and non-adaptable. The extra
feedback connections made it very difficult to analyze mathematically such models and
Minsky and Paperts book is not concerned with them. Rosenblatt also came very close
to discovering the key to training multi-layer perceptrons when he proposed a heuristic
algorithm to adapt both layers of weights. In page 292 of [Ros62] Rosenblatt states:
"The procedure to be described here is called the "back-propagating
error correction procedure" since it takes its cue from the error of the R-
units (the output units), propagating corrections back towards the sensory
end of the network (the input units) if it fails to make a satisfactory
correction quickly at the response end (output units). The actual
correction procedure for the connections to a given unit, whether it is an
A-unit (hidden unit) or an R-unit (output unit), is perfectly identical to
the correction procedure employed for an elementary perceptron (the
32
single-layer perceptron), based on the error-indication assigned to the
terminal unit."
Again Minsky and Paperts book did not discuss this algorithm, possibly because
Rosenblatt could not, apparently, make much progress with it.
2.4.3 - ADALINE/MADALINE and the Delta Rule
In 1960 Widrow and Hoff [WiHo60] introduced the ADALINE, initially an
abbreviation for ADAptive LInear NEuron and later, when ANN models became less
popular, ADAptive LINear Element. The ADALINE is a TLU with a bipolar output and
bipolar inputs (-1 or 1). As usual, such unit computes a weighted sum of the inputs plus
a bias. If the sum is greater than zero, the output is 1. If the sum is equal to or less than
zero, the output is -1. Later such a model was also used with continuous real-valued
inputs. AMADALINE (multiple ADALINE) is basically the single-layer perceptron with
bipolar inputs and bipolar outputs.
To train the ADALINE/MADALINE network, Widrow and Hoff proposed the
Delta Rule, also known today as the Widrow-Hoff Rule or Least-Mean-Square (LMS)
algorithm. The Delta Rule is also an error-correction rule, i.e. a supervised learning rule.
The learning speed of the single-layer perceptron, when trained with the
Perceptron rule (eq. 2.5), could be very slow since the weights were changed only when
there was a gross classification error. The basic principle of the Delta Rule is to change,
for each presentation of an input/desired output pattern from the training set, the network
weights in the direction that decreases the squared output error E
pat
defined as:
In other words, the Delta rule is a gradient-descent search procedure executed at each
(2.7)
E
pat
q
i 1
E
pat
i
q
i 1
1
2
D
i
Y
i
2
iteration. By repeating this procedure over the set of training patterns, we minimize the
"average" output error E
av
where:
and consequently we have:
(2.8)
E
av
1
M
M
pat 1
E
pat
However the activation/output function is not continuous and therefore not differentiable
(2.9)
W
i j

E
pat
W
i j
E
pat
Y
i
Y
i
W
i j
D
i
Y
i
Y
i
W
i j
33
and in principle eq. 2.9 cannot be applied. So, in effect Widrow and Hoff proposed to
use, during training, a linear activation function or Y = W X + bias. Such a modification
makes learning quicker because it changes the weights even when the output
classification is almost correct, in contrast to the perceptron rule that changes the
weights only when there is a gross classification error. Another important difference is
the use of bipolar inputs instead of binary inputs. Using binary inputs, when the input
is 0, the weights associated with such an input do not change. Using bipolar inputs, the
weights change even when the inputs are inactivated (-1 in this case).
The training procedure of the single-layer perceptron with the delta rule can be
summarized as:
1) Initialize the matrix W and the bias vector with small random numbers.
2) Select an input/desired output (X,D) vector from the training set.
3) Calculate the network output as: Y = W X + bias
4) Change the weight matrix and the bias vector using:
5) Repeat step 2-4 until the output error vector D Y is sufficiently
(2.10)
W
i j
D
i
Y
i
X
j
(2.11)
bias
i
D
i
Y
i
small for all input vectors in the training set.
After training, the output of the network Y for any input vector X is calculated in two
steps:
1) Calculate the net input to each output unit: net = W X + bias
2) The network output is given by:
This is known in the ANN literature as the recall phase.
(2.12)
Y
i
'
1 if net
i
> 0
1 otherwise
The above training procedure directly minimizes the average of the difference
between the desired output and the net input for each output unit (what Widrow and
Hoff in [WiHo60] call measured error). However, it is possible to show that, by doing
this, we are also minimizing the average of the output error (what Widrow and Hoff call
neuron error). Since the introduction of the ADALINE/MADALINE model, Widrowand
Hoff were well aware that it could only be used to solve linearly separable problems.
In relation to the network capacity, Widrow and Lehr [WiLe90] and Nilson
[Nil65] show that, on average, an ADALINE with p inputs can store up to 2p random
34
patterns with random binary desired responses. The value 2p is the upper limit reached
when p .
By comparing eq. 2.9 with eq. 2.5, we can see that the Perceptron learning rule
and the Delta rule are in principle identical, with the only major difference being the
omission of the threshold function during training in the case of the Delta rule.
However, they are based on different principles: the Perceptron rule is based upon the
placement of a hyperplane and the Delta rule is based upon the minimization of the
mean-squared-error between the desired and computed outputs.
It is also interesting to see that if we train the Linear Associator Y = W X using
the Delta rule instead of the Hebbian rule, the input vectors do not need to be
orthogonal to each other, they only need to be linearly independent. However, for p
input units, the Linear Associator is still limited to store up to p linear associations,
since it is not possible to have more than p independent vectors in an input space with
dimension p [Per92]. In particular if: a) the learning rate is small enough; b) all
training pairs are presented with the same probability; c) there are p input training
patterns and p network inputs; and d) the input training patterns form a linear
independent set; then the weight matrix converges to the optimal solution W
*
where:
For convergence results, which are also applied to the ADALINE/Perceptron ANN, see
(2.13)
W D
1
D
2
... D
p
X
1
X
2
... X
p
1
[Sim90], [Luo91], [WiSt85] and [ShRo90].
Widrow applied the LMS algorithm and its variants to train the Linear Associator
(what he called an Adaptive Linear Combiner or a non-recursive adaptive linear filter,
i.e an ADALINE with a linear output function) to a large range of signal processing
problems. For examples of applications see Widrow and Stearns [WiSt85].
In the early 60s Widrow also proposed an heuristic algorithm to adapt the
weights of a multi-layer ANN. The first layer was composed of ADALINEs and the
output layer has a single fixed logic unit, for instance, and OR, AND or majority-vote
taker. Only the weights arriving at the ADALINEs were adapted. The learning rule,
called MRI for Madaline Rule I, uses the minimal disturbance principle, i.e. no more
ADALINEs are adapted than necessary to correct the output decision, therefore causing
the minimal disturbance to the responses already learned.
In 1987 Widrow and Winter developed the MRII, Madaline Rule II, an extension
of MRI to allow the use of more than one logic unit at the output layer. However, up
35
to now both MRI and MRII have not been used much in the ANN literature. In 1988
David Andes modified MRII into MRIII by replacing the threshold logic function used
in the ADALINE by sigmoid functions. However, Widrow and his students later realised
that MRIII is mathematically equivalent to the Back-Propagation algorithm to be
presented in the next section. For more details on the MRI and MRII rules see [WiLe90]
and [Sim90].
2.4.4 - The Multi-Layer Perceptron and the role of hidden units
Figure 2.9 shows the minimum configuration for a Multi-Layer Perceptron. At
Figure 2.9 - The minimum configuration for a Multi-Layer Perceptron (MLP)
least one layer of hidden units with nonlinear activation functions is needed. An ANN
with hidden layers of linear units can be represented by an equivalent ANN without
hidden layers. The output units can have linear or nonlinear activation functions. It is
also possible to have direct connections from the input to the output units. In general,
if we draw the ANN with the input layer at the bottom and the output layer at the top
of the diagram (as in fig. 2.9), a layer of units can send connections to any layer that
is above it, since we assume that the MLP is by definition a feedforward ANN model.
The use of hidden units make it possible to reencode the input patterns therefore
creating a different representation. Each hidden layer reencodes its input. Some authors
refer to the hidden units creating internal representations or extracting the hidden
36
features from the data. Depending on the number of hidden units, the new representation
can correspond to vectors that are then linearly separable. If there are too few units in
a hidden layer to make possible the necessary reencoding, perhaps another layer of
hidden units is necessary. Because of this, the designer has to decide, for instance,
between using a) only one hidden layer with several units; or b) two hidden layers with
fewer units in each hidden layer. Normally no more than two hidden layers of units are
used, firstly because the representation power added by up to 2 hidden layers is likely
to be enough to solve the problem and secondly because for most of the algorithms used
nowadays the simulation results indicate that the training time increases rapidly with the
number of hidden layers.
The power of an algorithm that can adapt the weights of a MLP originates from
the fact that such algorithm can find such reencoding automatically by using the given
set of examples of the desired input-output mapping. It is possible to see such internal
reencoding, or internal representation, as a set of rules (or micro-rules as some authors
prefer to refer to them). So, using an analogy with expert systems, such an algorithm
would "extract" the rules or features from the set of examples, what is referred to by
some authors as the property of performing feature extraction from the data set.
Figure 2.10 - The first possible solution for the XOR problem
37
Figures 2.10, 2.11 and 2.12 illustrate three different solutions for the XOR
Figure 2.11 - The second possible solution for the XOR problem
problem using TLUs in the hidden and output layers. Note that for the ANNs illustrated
in fig. 2.10, 2.11 and 2.12, the output unit can also be linear with a zero bias, i.e.
respectively y = x
3
+ x
4
, y = x
4
- x
3
and y = x
1
+ x
2
- 2x
3
. In figures 2.10 and 2.11 the
two hidden units reencode the input variables x
1
and x
2
as the variables x
3
and x
4
. The
four input patterns are mapped to three points in the x
3
-x
4
space. These three points are
then linearly separable as illustrated. Observe that the solution illustrated in figure 2.11
is a combination of the AND and OR functions.
Figure 2.12 illustrates that if connections from the input to the output units are
used, the XOR problem can be solved using only one hidden unit which implements the
AND functions. Then if we consider the expanded input space x
1
-x
2
-x
3
, the 4 patterns
are now linearly separable since it is possible to find a plane that separates the points
which should produce a "0" output from the points that should produce a "1" output. If
the output unit is kept as the TLU, the decision surface in the space x
1
-x
2
changes from
a line to a ellipse (see figure 2.8) [WiLe90].
Since the AND function can be defined for binary variables as the product of the
38
variables, from figure 2.12 we can see that if we have as input to the network the value
Figure 2.12 - The third possible solution for the XOR problem
of the variable x
1
*x
2
, one layer of units would be enough to solve the problem and there
would be no need of hidden units. Generalizing this idea, when the unit itself uses
products of its input variables, it is called a higher-order unit and the network a higher-
order ANN. In general, higher-order units implements the function [GiMa87]:
From this definition, the percepton is a first-order ANN since it uses only the first input
(2.14)
y
i
F
,
bias
j
w
(1)
i j
x
j
j k
w
(2)
i j k
x
j
x
k
j k l
w
(3)
i j kl
x
j
x
k
x
l
....
term of the above equation. Widrow [WiLe90] refers to such units as units with
Polynomial Discriminant Functions. The problem with higher-order ANN is the very
rapid increase in the number of weights with the number of inputs as was earlier noted
by Minsky and Papert [MiPa69]. However, recently such networks have successfully
been used for classification of images irrespectively of their translation, rotation and
scaling ([RSO89], [SpRe92]), where the weight number explosion is kept under control
by grouping the weights. For some problems, as was the case for the XOR, one layer
of higher-order units may be enough since they use more complex decision surfaces than
the MLPs hyperplanes. A MLP can only implement more complex decision surfaces
by a combination of such hyperplanes.
Finally, on the subject of units that use products of inputs, Durbin and Rumelhart
proposed to use what they called product units [DuRu89]. Instead of calculating a
weighted sum, each product unit calculates a weighted product, where each unit is raised
to a power determined by a variable weight. Therefore such a unit can learn an arbitrary
polynomial term. They argue that such units are biological plausible and correspond to
processing done locally at synapses.
39
2.4.5 - The Back-Propagation Algorithm
We have seen that the advantage of using hidden units is that the ANN can then
implement more complex decision surfaces, i.e the representation power is greatly
increased. The disadvantage of using hidden units is that learning becomes much harder
since the learning procedure has to decide which features it should extract from the
training data. Basically the dimension of the solution space is also greatly increased
since we need to determine a larger number of weights.
The Back-Propagation algorithm (BP) has been independently derived by several
people working in different fields. Werbos [Wer74] discovered the BP algorithm while
working on his doctoral thesis in statistics and called it the dynamical feedback
algorithm. Parker ([Par82], [Par85]) rediscovered the BP algorithm in 1982 and called
it the learning logic algorithm. Finally, in 1986, Rumelhart, Hinton and Williams
[RHW86] rediscovered the algorithm and the technique became widely known. The BP
algorithm is today the most popular supervised learning rule to train feedforward multi-
layered ANNs and it is responsible, with Hopfield networks (presented in the next
chapter), for the return of a general interest in ANNs.
The BP algorithm uses the same principle as the Delta Rule, i.e. minimize the
sum of the squares of the output error, averaged over the training set, using a gradient-
descent search. For this reason, the BP algorithm is also called the Generalized Delta
Rule. The crucial modification was to use smooth continuous activation functions in all
units instead of using TLUs. This allows the application of a gradient-descent search
even through the hidden units. The standard activation function for the hidden units are
the so called squashing or S-shaped functions, such as the sigmoid,
sig(x) = [1+exp(x)]
1
, and the hyperbolic tangent, tanh(x) = 2*sig(2x) - 1. Sometimes
the general class of squashing functions is also referred to as sigmoidal functions.
The sigmoid function increases monotonically from 0 to 1 while the hyperbolic
tangent increases from 1 to 1. Note that the sigmoid function can be seen as a smooth
approximation to the threshold function defined in eq. 2.1, while the hyperbolic tangent
can be seen as the approximation of a bipolar TLU with a 1/1 output as used by
Widrow in the ADALINE. The function sig(x/T) tends to the threshold function when
T tends to 0; the parameter T is called the temperature and is sometimes used to change
the inclination of the sigmoid or hyperbolic tangent functions around their middle point.
In some applications, especially pattern classification where we need or want to limit
40
the range of the output units, squashing functions are also used in those units.
The difficulty in training a MLP is that there is no pre-defined error for the
hidden units. Since the BP algorithm is a supervised rule, we have the target for the
output units but not for the hidden units. As in the case of the Delta rule we want to
change the weights in the direction that decreases the output error.
Without loss of generality, let a feedforward ANN be numbered from input to
output such that unit 1 is the first input unit and unit N is the last output unit. Assuming
that the ANN has p input units, H hidden units distributed over one or more hidden
layers, and q output units, making a total of N units (p + H + q = N), then:
As in the case of the Delta rule, we apply the chain rule from variational calculus:
(2.15)
E
pat
1
2
N
r p H 1
D
r
out
r
2
and dout
i
/dnet
i
= derivative of the activation function of unit i with respect to its
(2.16)
W
i j

E
pat
W
i j
E
pat
out
i
dout
i
dnet
i
net
i
W
i j
argument net
i
; and net
i
/W
ij
= out
j
. However, to calculate the term E
pat
/out
i
we need
to consider if the unit i is an output unit (p+H+1 i N) or a hidden unit
(p+1 i p+H). If the unit i is an output unit, then as in the Delta rule, we have:
If the unit i is a hidden unit, then:
(2.17)
E
pat
out
i
D
i
out
i
but net
L
/out
i
= W
Li
. If we define E
pat
/net
L
=
L
, then:
(2.18)
E
pat
out
i
N
L i 1
E
pat
net
L
net
L
out
i
Equation 2.18 and 2.19 simply states that the effect of the output of an hidden unit on
(2.19)
E
pat
out
i
N
L i 1
L
W
Li
the output error is defined as the summation of the effect of the units that receive
connections from the hidden unit multiplied by the value of each connection. In other
words, the output error is "back-propagated" from the output layer to the hidden layers
through the weights and through the nonlinear activation functions. Observe that, in
relation to the Delta rule, the only new equation is really eq. 2.18 since the new problem
created by the hidden units is to find how a change in a weight received by a hidden
41
unit affects the output error.
Summarizing, we have:
where for output units (p+H+1 i N):
(2.20)
E
pat
W
i j
i
out
j
and for hidden units (p+1 i p+H):
(2.21)
i
D
i
out
i
dout
i
dnet
i
As usual, the above equations are also applied to adjust the bias by simply considering
(2.22)
1
1
1
]
N
L i 1
L
W
Li
dout
i
dnet
i
them as additional weights that come from units with a constant unit output, i.e. in eq.
2.20, out
j
= 1.
Observe that in the above derivation of the BP algorithm, only the following
constraints are included in relation to the network: 1) the network is a feedforward
ANN; 2) all units have differentiable activation functions f (net
i
); and 3) the combination
function is defined in a vectorial notation as net = W out + bias. Some possible cases
are: the use of different activation functions in the hidden layer; use of several hidden
layers; and feedforward networks that are not strictly feedforward.
Another reason for using the sigmoid function or the hyperbolic tangent in a
multi-layered ANN is that their derivatives can be calculated simply from their output
value (dsig(x)/dx = sig(x) [1-sig(x)]; dtanh(x) = [1+tanh(x)] [1-tanh(x)]), without the
need of more complex calculations. This is very useful since it reduces the overall
number of calculations needed to train the network.
2.4.6 - Using the Back-Propagation Algorithm
In relation to the initialization of the weights and biases, Rumelhart et al.
[RMW86] suggested using small random values. Concerning the learning rate , they
point out that, although larger learning rates will result in more rapid learning, they can
also lead to oscillation. They suggested that one way to use larger learning rates without
leading to oscillations is to modify eq. 2.16 by adding a momentum term:
42
where the index k indicates the presentation number and is a small positive constant
(2.23)
W
i j
( k 1)
i
out
j
W
i j
( k)
selected by the user. A larger increases the influence of the last weight change on the
current weight change. Such a modification in effect filters out the high-frequency
oscillations in the weight changes since it tends to cancel weight changes in opposite
directions and reinforces the predominant direction of change. This can be useful when
the error surface contains long ravines with a sharp curvature across the ravine and a
floor with a small inclination. For more details about the use of the momentum term see
[Zur92].
In the case of the Delta rule, when applied to networks without hidden layers and
with output units with linear activation functions, the error surface will always have a
bowl shape and the local minima points are also global minima. If the learning rate is
small enough, the Delta rule will converge to one of these minima. In the case of the
MLP, the error surface can be much more complex with many local minima. Since the
BP is, as the Delta rule, a gradient-descent procedure, there is the possibility for the
algorithm to get trapped in one of these local minima and therefore not converge to the
best possible solution, the global minimum ([WiLe90], [Zur92], [McRu88]).
Whenever we have a pre-determined set of training data with a fixed number of
patterns, we can define an epoch as a single presentation of all training patterns to the
network. We will normally adopt a random order presentation of the training patterns
during an epoch and to adjust the weights after the presentation of each single pattern.
This is called random incremental updating as opposed to sequential cumulative
updating, when the patterns are presented to the network with a constant ordering, the
weight changes are summed and the weights are only updated at the end of the epoch.
Simulations results indicate that random incremental updating tends to work better than
sequential cumulative updating, since it injects some "noise" into the search procedure
[Zur92] and therefore helps the network to settle to a better local minimum.
It is interesting to know that, as Widrow points out [WiLe90], the idea of error
backpropagation through nonlinear systems has been used for centuries in the field of
variational calculus and has also been widely used since the 60s in the field of optimal
control. Le Cun [LeC89] and Simpson [Sim90] point out that Bryson and Ho [BrHo69]
developed an algorithm very similar to the BP algorithm for nonlinear adaptive control.
Le Cun [LeC89] also shows how, using a Lagrangian formalism, the BP algorithm can
43
be derived as a solution to an optimization problem with nonlinear constraints and that
from such interpretation some extensions can easily be derived.
Although the BP algorithm was proposed for feedforward ANN, Almeida
[Alm89] has extended it to feedback networks by using a linearization technique, where
he assumes that each input pattern is presented to the network long enough for it to
reach a stable equilibrium. Only then are the outputs compared to the desired ones. Also
he assumes that the desired outputs depend only on the present inputs, not on the past
ones. Rumelhart et al. [RHW86] also considered applying the BP algorithm to feedback
networks but they used different assumptions. They simply expand the feedback network
as a feedforward network with several layers. This is possible because, as Minsky and
Papert [MiPa69] point out, for every feedback network, there is a feedforward network
with identical behaviour over a finite period of time. The BP algorithm is then applied
on this equivalent feedforward network and the weights are averaged after each change
to avoid violating the constraint that certain weights should be equal.
Another multi-layered learning algorithm that was presented before the
popularization of the BP algorithm in 1986 was the Boltzmann Machine (BM),
introduced in 1984 by Hinton, Ackley and Sejnowski ([HAS84], [HiSe86]). It uses a
much more complicated procedure than the BP algorithm in which the activations of the
hidden units are probabilistically adjusted using gradually decreasing amounts of noise
to escape local minima in favour of the global minimum. The idea of using noise to
escape local minima is called simulated annealing [KGV83]. The combination of
simulated annealing with the probabilistic adjustment of the hidden layers is called
stochastic learning [Sim90]. The main disadvantage of the Boltzmann Machine is its
excessively long training time. Later on, in 1986, Szu introduced a modified version of
the Boltzmann Machine called the Cauchy Machine (CM) that uses a fast simulated
annealing procedure [Szu86]. Although faster than the Boltzmann Machine, the Cauchy
Machine still suffers from very long training times [Sim90].
2.5 - Representation, Learning and Generalization
The first problem to be solved when applying feedforward ANNs trained using
supervised learning is the training data selection problem, i.e. to select a data set to be
used when training the ANN. Such training data set must contain the underlying
44
relationship that the ANN should acquire. Since in most cases this underlying
relationship is unknown this may not be a trivial problem.
Once a training data set has been selected, the subsequent problems, in the
sequence that they have to be solved, can be classified in three main areas:
representation, learning and generalization.
The representation problem is how to design the ANN structure such that there
is at least one solution (set of network weights) that learn the training set. The learning
problem is how to find one of these possible set of weights, i.e. training the ANN. This
is also referred to by some authors as the loading problem, based on the concept that
we are "loading" the training data set onto the ANN [Jud90]. Once training is finished,
the generalization problem is concerned with the network response when presented with
data that was not in the training set. A measure of generalization is normally obtained
by verifying the network performance using a test data set.
2.5.1 - The Representation Problem
The representation problem concerns: a) how many hidden layers we use; b) how
many units in each hidden layer; and c) which functions we use for the hidden units.
Normally the particular application in hand will specify how many input and output
units the ANN should have.
Particularly in classification problems (to determine the class to which the input
pattern belongs) the designer has some freedom to decide how to code the output, e.g.
using binary coding or 1-of-N coding. Sometimes, the designer may even decide to
preprocess the input data. Here we will assume that the designer has already decided the
input and output representation.
Once the designer has decided the network input and output representation, to
solve the particular problem in hand it is still necessary to look for the network internal
representation. The representation problem is then to choose the ANN structure such that
an internal representation exists, i.e. that there is at least one set of parameters (weights)
that can reproduce the training data set with a small error. At this moment there is very
little theory to help in this task.
Hornik et al. [HSW89] established that a feedforward ANN with as few as one
hidden layer using arbitrary squashing activation functions (such as sigmoids) and no
squashing functions at the output layer are capable of approximating virtually any
45
function of interest from one finite multi-dimensional space to another to any degree of
accuracy, provided sufficiently many hidden units are available. Later Stinchcombe and
White [StWh89] extended this result and showed that even if the activation function
used in the hidden layer is a rather general nonlinear function, the same type of FF
ANN is still a universal approximator. More or less at the same time, Funahashi
[Fun89], Cybenko [Cyb89], Kreinovichi [Kro91] and Ito [Ito91] proved similar results.
White [Whi92] edited a book with a collection of his papers on this subject of ANNs
and approximation and learning theory.
From a theoretical point of view such results are important but they are existence
proofs, i.e. they prove that there is a FF ANN with just one hidden layer using
squashing or non-squashing functions in the hidden layer that solves the input-output
mapping problem. However, it is not possible to deduce from these proofs the ANN
topology (number of hidden layers and number of units in each hidden layer) or, once
the network topology is chosen, how to determine the network free parameters (the
weights).
Another important point not clarified by the proofs mentioned above is, given a
specific criterion such as minimum number of hidden units, which function is more
suitable to be used as the activation function for the hidden units. In general, these
functions belong to two classes: local or global (also called nonlocal) functions.
Units that use local functions have a constant output (normally zero) outside a
closed region of the unit input space and a different set of values within the closed
region. Units that use functions that can not be characterized as local function are said
to use global functions.
The classical example of FF ANN that use local functions in the hidden layer are
the so called gaussian Radial Basis Functions (RBF), where out
i
= exp(-net
i
2
) and
net
i
= x-C
i
, and C
i
is a vector which determines the position of the centre of the unit.
global (also called nonlocal) or local functions. In this case the regions where the unit
output is above or below a certain value are respectively closed and open region and the
decision surfaces are in general ellipsoids.
When using the usual combining function net
i
= W
i
x + bias
i
, where x is the unit
vector input, the squashing and step functions are examples of global functions. In this
case there is a hyperplane that divides the unit input space into two regions where the
unit output has a high constant value in one region and a low constant value in the other
46
region. In this case, if we consider the input space to be unbounded, the regions where
the unit output is above or below a certain value are open regions and the decision
surfaces are hyperplanes. Note that the use of higher order units (see eq. 2.14) with a
squashing or step function makes it possible for the unit to implement global or local
functions by varying the unit weight values.
Park and Sandberg ([PaSa91],[PaSa93]) proved that RBF networks with just one
hidden layer and linear output units are also universal approximators.
2.5.2 - The Learning Problem
Once we have decided the network topology and the type of units to be used, the
next step is to determine the network free parameters, i.e. the network weights. The
range of applicable algorithms depends on the particular functions used in the hidden
units. Typically the Back-Propagation algorithm is used for FF ANNs with squashing
functions.
The BP algorithm can also be used for RBF networks but Moody and Darken
[MoDa89] have proposed a hybrid algorithm with two stages. In the first stage the
hidden units centres and the widths of the gaussian functions used by the hidden units
are determined in an unsupervised manner, i.e. by using only the input data and not the
correspondent desired outputs. The centres are determined by using a k-means clustering
algorithm and the widths by nearest-neighbour heuristics. In the second stage just the
output weights, i.e. the weights between the hidden and output units, which correspond
to the amplitudes of the gaussians, are modified in order to minimize the standard least-
squares error using a supervised algorithm such as the delta rule. The authors found out
that, in comparison with networks with sigmoid units trained by BP, the convergence
is very rapid, possibly because the first unsupervised stage has done most of the work
necessary for the correct classification. However, a possible drawback is the need of a
larger number of hidden units (and therefore network weights) to achieve the same
accuracy when approximating certain functions, in comparison with a network which
uses squashing functions.
The algorithms used to train a FF ANN can be classified into two main classes:
a) the algorithms that try to converge to the global minimum solution, and b) the
algorithms that try to converge rapidly. Unfortunately, it seems that the two classes do
not overlap. Consequently the algorithms that try to converge rapidly can still be trapped
47
in local minima (as BP does) while the algorithms that try to converge to the global
minimum tend to converge very slowly when compared, for instance, with the BP
algorithm.
Examples of algorithms that look for the global minimum are the Boltzman
Machine, already mentioned in the previous section, and genetic algorithms
([MoDav89], [HKP91]). Anther possible problem with the use of genetic algorithms to
train FF ANNs is the need for large amount of processing power and memory.
Jacobs [Jac88] and Silva and Almeida [SiAl90] proposed to adapt the learning
rate (the step size) when executing the BP algorithm in order to speed the convergence.
This modification has the advantage that it does not increase significantly the
computational and memory requirements in relation to the standard BP algorithm.
The BP algorithm is a first-order algorithm since it uses only the first derivative
of the cost function to search for the minimum. Several researchers have proposed
second-order algorithms to perform such a search, for instance, Becker and le Cun
[BeCu88] and Kollias and Anastassiou [KoAn89]. Battiti [Bat92] published a review of
the application of first- and second-order methods for the training of FF ANN.
The main problems of using such second-order algorithms are: 1) a large increase
in the number of operations performed and in the memory requirements, especially for
large networks; and 2) not all implementations use local computations. Furthermore,
Saarinen et al. [SBC91] argue that many network training problems are ill-conditioned,
i.e. have ill-conditioned or indefinite Hessians, and therefore may not be solved more
efficiently by higher-order optimization algorithms.
A more recent approach has been suggested by Shah et al. [SPD92] where they
use optimal stochastic filtering techniques to train the ANN and at the same time they
pay attention to the computational and storage costs. Tepedelenlioglu et al. [TRSR91]
and Singhal and Wu [SiWu89] have proposed to use the Extended Kalman Filtering
algorithm to train FF ANNs.
There have also been a few approaches that try to reduce the network training
time and at the same time determine the number of units in the hidden layer, i.e. they
try to adapt the network topology. Normally such approaches start with an ANN with
a small size and add hidden units. Fahlman and Lebiere proposed the Cascade-
Correlation Learning Architecture [FaLe90] and studied the two-spirals problem (the
training points are arranged in two interlocking spirals).
48
Hirose et al. [HYH91] also suggest adapting during training the number of
hidden units with the aim of escaping local minima. Training is performed as standard
by the BP algorithm and they proposed adding an extra hidden unit whenever the
network seems to be trapped in a local minimum. Since the addition of such an extra
hidden unit distorts the error surface, that point in the weight space is not a local
minimum anymore. Later on, after satisfactory convergence is achieved, they proposed
a way of eliminating some of the hidden units.
2.5.3 - The Generalization Problem
Even if the training algorithm manages to find a satisfactory solution for the
training patterns, the ANN still needs to produce "reasonable" outputs when presented
with input patterns that were not used in the training set. i.e. the ANN needs to be able
to "generalize" what it has learned.
Poggio and Girosi ([PoGi90a],[PoGi90b]) state that, from the point of view that
FF ANN are trying to learn an input-output mapping from a set of examples, such a
form of learning is closely related to classical approximation techniques, for instance,
generalized splines and regularization theory [TiAr77]. In this case learning can be seen
as solving the problem of hypersurface reconstruction and is a ill-posed problem since
in general there are infinite solutions. A priori assumptions are then necessary to make
the problem well-posed. Possibly the simplest assumption is that the input-output
mapping is smooth, that is small changes in the inputs cause a small change in the
output.
Training a FF ANN can be seen as a generalized multi-dimensional version of
finding the parameters of a polynomial that fits a set of points drawn from a uni-
dimensional space. Too many degrees of freedom (too many weights in the ANN) can
result in overfitting the training data and to poor performance in the test data set
[HKP91]. Therefore the ideal situation would be to find the minimum number of hidden
units that can produce the desired input-output mapping. This should result in the
smoothest possible mapping. Since it is very difficult and time-consuming to determine
the minimum number of hidden units, one approach that is frequently used is to train
the network using a small training data set and periodically to test the network using a
larger test data set. Training is then stopped when the error cost function measured over
the test data achieves the minimum value. If we continue training the network after such
49
a minimum is achieved, the error cost function measured over the training data will
continue to decrease but it will increase if the measure was taken over the test data.
Baum and Haussler [BaHa89] proved some theoretical bounds governing the
appropriate sample size against the network size in terms of the network generalization.
One possible approach that can be used to improve network generalization is to
somehow constrain, during training, the degrees of freedom available to the network
trying to obtain a near-optimal network topology. The ANN should then be large enough
to contain the desired knowledge (assumed to be contained in the examples of the
training data set) but small enough to generalize well. A simple approach is just to add
to the normal cost function (the mean-squared-output error) a penalty for network
complexity. One possibility is to use the weight decay idea, i.e. we add to the cost
function the term w
i
2
[HKP91]. The application of the BP algorithm to this new cost
function results in a weight decay term which discourages very large weights. Another
possibility is to use the extra cost function term [w
ij
2
/(K+w
i
2
)]. For small weights this
can be approximated by [w
ij
2
/K] and for large weights by . After training, the ANN
can then be tested where the weights with the smallest magnitudes are removed, what
is known as "pruning" the network. When all the incoming weights of a hidden units are
removed, the hidden unit is effectively removed as well. Therefore the weight-
elimination stage can also affect the network topology.
Nowlan and Hinton ([NoHi92a],[NoHi92b]) propose an approach where the
network degrees of freedom are constrained by encouraging clustering of the weights
values. While the weight decay approach encourages clustering around the zero value,
their approach is aimed at encouraging clustering around a finite set of arbitrary real
values, what is sometimes called weight-sharing. Kendall and Hall propose the minimum
description length (MDL) approach [KeHa93] aimed at minimizing the information
content in the network weights. They claim that the MDL length also encourages weight
elimination and weight-sharing.
More recently Green, Nascimento and York [GNY93] proposed to add
competition within the hidden layer of a FF ANN in order to eliminate unnecessary
hidden units. The addition of the competition turns the network into a feedback ANN.
However, the BP algorithm is applied as normal since they proposed to ignore the
competition weights during the backward pass of the BP algorithm.
One drawback that all these approaches have in common is the need to select
50
some extra parameters during training.
2.6 - Limitations of Feedforward ANNs
The basic concepts of Artificial Neural Networks and the differences in relation
to traditional computation were introduced in this chapter. Also the more important
feedforward ANN models were presented and the role of hidden units was discussed.
The majority of feedforward ANN models currently in use are sigmoid based and
have the following limitations:
1) Current ANN models take a long time to be trained, there is no
guarantee of convergence and the learning is inconsistent, i.e. the mean-
squared-error can remain high for many iterations and suddenly decrease
to a lower value. Therefore, without previous experience with a particular
problem, it is very difficult to estimate how long training will take.
2) When an ANN produces an output that corresponds to a decision, for
instance in a pattern classification problem, in general it is very difficult
to trace how the network reached such a decision, that is to get an
"explanation" form the network. An ANN by being trained using a
training data set, extracts the knowledge from the set of examples and
creates its own internal representation. To extract the knowledge coded
into the network we need to understand this internal coding, a difficult
task.
3) In general, an ANN does not give confidence intervals for its outputs.
However, Richard and Lippman [RiLi91] show that when an FF ANN is
trained to solve an M-class problem (one output unit corresponding to the
correct class, all other zero) using a mean-squared-error cost function
such as in the BP algorithm, the network outputs provide estimates of
Bayesian probabilities.
4) Without prior experience with the problem in hand, the network
topology is determined by trial and error. A too small network will make
learning impossible and a too large network will generalize badly.
5) It is not possible, in the general case, to encode prior information in
the network. If this was possible, training times could be reduced
51
considerably.
While the models presented in this chapter were of the feedforward type, the next
chapter concerns feedback networks, their theory and applications.
52
Chapter 3 - Feedback Neural Networks:
the Hopeld and IAC Models
The main feedforward ANN models were presented in chapter 2. In this chapter
the principles behind the use of feedback ANNs are introduced and two models, the
Hopfield and IAC (Interactive Activation and Competition) neural networks are
presented and analyzed.
Because of the presence of the feedback connections, feedback ANN are
nonlinear dynamical systems which can exhibit very complex behaviour. They are used
in two areas: 1) as associative memories or 2) to solve some hard optimization
problems. The basic idea in using a feedback ANN as an associative memory is to
design the network such that the patterns that should be memorized correspond to stable
equilibrium points. To use feedback ANN to solve optimization problems the network
is designed so that it converges to the stable equilibrium points that correspond to good
(perhaps not necessarily optimal) solutions of the problem in hand.
In this chapter we show how the IAC network can be used to solve certain
optimization problems, much like the Hopfield network. As an example we show in
detail how to implement a 2-bit analog-digital converter using the IAC network.
3.1 - Associative Memories
To work as an associative memory, a network has to solve the following
problem:
"Store M patterns S such that when presented with a new pattern Y, the
network returns the stored pattern S that is closest in some sense to Y".
Such associative memory can work as a content-addressable memory since we should
be able to retrieve the stored pattern by using as input an incomplete or corrupted
version of it (pattern completion and pattern recognition). Possible applications are in
53
hand-written digit and face recognition tasks and retrieval of information in general
databases.
For mathematical convenience we will assume that the components of the stored
patterns S and the test patterns Y can be only 1 or 1, instead of the usual binary values
0 and 1.
Figure 3.1 shows the general model of a one-layer feedback ANN that can be
used as an associative memory. In this particular case each unit is a TLU (Threshold
Logic Unit) with a bipolar output. The output of each unit is calculated as:
where the net input net is calculated as:
(3.1)
Y
i
sgn net
i
'
1 if net
i
> 0
1 if net
i
< 0
where N is the number of units in the network. The terms bias
i
and ext
i
represent
(3.2)
net
i
N
j 1
W
i j
Y
j
bias
i
ext
i
respectively the fixed internal and variable external inputs. These terms could be
grouped together but in most models one or both of them are zero.
For simplicity, lets consider for the moment that the bias term bias
i
and the
external input ext
i
are zero.
The network is operated as follows: 1) a input pattern is loaded into the network
as the initial values for the network output Y; 2) the network output values are updated
asynchronously and stochastically, i.e. at each time step a unit is selected randomly from
among the N units with equal probability 1/N, independently of which units were
updated previously, and eqs. 3.1 and 3.2 are used to update its output. We will show
later that under some conditions, after a sufficient large number of time steps, the
network will converge to a stable equilibrium point (EP), called a "memory". The output
of the units are then interpreted as the network classification of the input pattern.
Three important issues in such applications are: 1) how the network weights
should be adjusted such that network is stable, that is such that the network converges
to an EP for any initial condition; 2) for a network with N units, how many patterns can
be stored; and 3) under what conditions will the network converge to the closest stored
pattern.
Note that: 1) the units are simultaneously input and output units; 2) since there
are no hidden units, such a network cannot encode the patterns, or in other words, the
54
network cannot change the pattern representation; and 3) the network always occupies
Figure 3.1 - An one-layer feedback ANN
the corners of the hypercube [1 1]
N
.
3.1.1 - Storing one pattern
Lets firstly consider the simple case where we want to store just one pattern. A
pattern Y is a stable EP if:
for all i, since when eq. 3.1 is applied to update the unit output no change will be
(3.3)
sgn
,
N
j 1
W
i j
Y
j
Y
i
produced. Representing by S the pattern that we want to store, this can be achieved by
setting the network weights to:
where k > 0 since then:
(3.4)
W
i j
k S
i
S
j
given that S
j
S
j
= 1. For later convenience, let k = 1 / N. Then, in vectorial notation we
(3.5)
sgn
,
N
j 1
k S
i
S
j
S
j
sgn kNS
i
S
i
have that:
55
where S is a column vector and W is a symmetric matrix.
(3.6)
W

,
1
N
S S
T
Note that even if almost half of the bits of the initial condition (the starting
pattern) are wrong, the stored pattern will still be retrieved since the correct bits, that
are in the majority, will force the sign of the net input to be equal to S
i
. This can be
proved by combining eqs. 3.3 and 3.6:
where N
c
and N
w
are respectively the number of correct and wrong bits in the starting
(3.7)
sgn
,
N
j 1
W
i j
Y
j
sgn
,
S
i
N
N
j 1
S
j
Y
j
sgn
,
S
i
N
c
N
w
N
S
i
pattern Y in relation to the stored pattern S. Observe also that if the starting pattern has
more than half the bits different from the stored pattern (N
w
> N
c
) than the network will
retrieve the inverse of the stored pattern, i.e. S. Therefore there are two stable EPs,
sometimes also called attractors. The set of patterns that converge to one of the EPs
constitutes what is called the basin of attraction or region of convergence of that EP.
For this particular case, the entire input space is symmetrically divided into the two
basins of attraction.
3.1.2 - Storing several patterns
One simple way to store more than one pattern in the network is to generalize
eq. 3.4 and try to superimpose the patterns by using:
or in vectorial notation,
(3.8)
W
i j
1
N
M
pat 1
S
pat
i
S
pat
j
where M is the total number of patterns that we want to store in the network and the
(3.9)
W
1
N
M
pat 1
S
pat
S
pat
T
weight matrix W is still symmetric.
Equations 3.8 and 3.9 are implementations of the Hebbian rule, already
introduced in chapter 2. A feedback network operating as an associative memory, using
the Hebbian rule to store all patterns and being updated asynchronously is usually called
a discrete-time Hopfield network, after J. J. Hopfield who emphasized the concept of
using the equilibrium points of nonlinear dynamical systems as stored memories [Hop82].
56
The patterns S will be stored as stable EPs, i.e. fixed attractors, if they satisfy
the condition that:
By combining eqs. 3.8 and 3.10 we have that:
(3.10)
sgn
,
N
j 1
W
i j
S
j
S
i
Lets suppose that we want to test such a condition for stored pattern S
1
. The interior
(3.11)
sgn
,
1
N
N
j 1
M
pat 1
S
pat
i
S
pat
j
S
j
S
i
of the function sgn ( ) can be separated into the term pat = 1 and pat > 1:
where c.t. stands for crosstalk term, the second term of the left side of eq. 3.12.
(3.12)
1
N
N
j 1
S
1
i
S
1
j
S
1
j
1
N
N
j 1
M
pat 2
S
pat
i
S
pat
j
S
1
j
S
1
i
c.t.
Therefore if the magnitude of the crosstalk term is less than 1, it will not change the
sign of S
i
1
and the condition for stability of the pattern S
1
will be satisfied. The
magnitude of the crosstalk term is a function of the type and number of patterns to be
stored.
For many cases of interest, provided that the number of patterns to be stored is
much less than the number of units (M << N, see next section about storage capacity),
the crosstalk term is small enough and all stored patterns are stable. Moreover, as in the
single pattern case, if the network is initialized with a version of one of the stored
patterns that is corrupted with a few wrong bits, the network will retrieve the correct
stored version [HKP91].
3.1.3 - Storage Capacity
Hertz et. al. [HKP91] show that if: a) if the patterns to be stored are random
(each bit has equal probability of being 1 or +1) and independent; and b) M and N are
large, then the crosstalk term can be approximated by a random variable with gaussian
distribution, zero mean and variance M/N. Therefore the ratio M/N determines the
probability of the crosstalk term being greater than 1 for S
i
= 1 or less than 1 for
S
i
= +1. From this modelling we can estimate, for instance, that if we choose
M = 0.185 N and the network is initialized with one of the S patterns, no more than 1%
of the bits will change. However, these few bits that change can cause more bits to
57
change and so on, i.e. what is known as the "avalanche" effect.
Hertz et. al. [HKP91] show, using an analogy to spin glass models and mean
field theory, that this avalanche occurs if M > 0.138 N and therefore we could not use
the network as a "memory". They also show that, using the previous modelling, for
M = 0.138 N, 0.37% of the bits will change initially and 1.6% of them will change
before an attractor is reached. So, if we choose M 0.138 N there will be an attractor
"close" to the patterns S that we want to store, i.e. they will be retrieved but the final
result will have a few bits wrong. As an example for this case, for N = 256, M 35.
If we want to recall all stored patterns S without error (perfect recall) , i.e. to
force the patterns S to be the attractors (not only "close" to the attractors as in the
previous case), then McEliece et. al. [MPRV87] show that M N/ (4 ln N). Moreover,
they show that perfect recall will happen if the initial pattern has less than N/ 2 different
bits when compared with a stored pattern ([HKP91],[HuHo3]). In this case for N = 256,
M 11.
From these arguments we can see that, when using the Hebbian rule (eqs. 3.8 and
3.9), the storage capacity of the Hopfield network is rather limited. Other design
techniques have been proposed that improve the storage capacity ([VePs89],[FaMi90])
to a value closer to M = N, the limit for the storage capacity of the Hopfield network
[AbJa85].
Note as well that if the patterns to be stored are all orthogonal to each other, i.e.
apparently the memory capacity would be N since the crosstalk term is zero in this case
(3.13)
S
l
T
S
k

'
0 for lk
N for l k
(see eq. 3.12). However if we use the Hebbian rule (eqs. 3.8 or 3.9) to store N
orthogonal patterns, the weight matrix W will be equal to the identity matrix, i.e. each
unit one feedbacks to itself. Such an arrangement is useless as a memory since it makes
all initial patterns stable, that is the network does not change its initial pattern. This can
be interpreted as making attractors of all points of the discrete configuration space and
their basins of attraction contain only the attractors themselves. Therefore to make the
network useful in this case we need to store less than N orthogonal patterns.
We can prove that the weight matrix will be equal to the identity matrix if we
try to store N patterns using the Hebbian rule, by defining a square and not in general
symmetric matrix X where each row of X is defined as the transpose of one of the
58
patterns to be stored, i.e. X
ij
= S
j
i
. Consequently, from eq. 3.13 we have that:
where I is the identity matrix. Then, we can rewrite eq. 3.9 as:
(3.14)
X X
T
N I
By the definition of orthogonality, no row of the matrix X can be written as a linear
(3.15)
W
1
N
M
pat 1
S
pat
S
pat
T 1
N
X
T
X
combination of the other rows and therefore the inverse of X and the inverse of its
transpose X
T
exist. Then if we pre-multiply both sides of eq. 3.14 by X
T
and pos-
multiply them by X
T
we have:
and consequently X
T
X = NI and W = I, as we want to show.
(3.16)
X
T
X X
T
X
T
N I X
T
X
T
3.1.4 - Minimizing an energy function
One important contribution made by Hopfield [Hop82] was to propose a lower
and upper bounded scalar-valued function, a so-called "energy function", that reflects
the state of the whole network, i.e. such a function involves all the network outputs. He
then showed that whenever one of the network outputs Y
i
is updated, the value of this
function is decreased if Y
i
changes or remains constant if Y
i
does not change. Therefore
the network will evolve until it reaches one state that is locally stable equilibrium point.
To prove this, Hopfield defined the energy function as the following quadratic function:
where H(k) is the value of the energy function for the whole network at time step k. The
(3.17)
H(k)
1
2
Y
T
(k) W Y(k)
1
2
N
i 1
N
j 1
W
i j
Y
i
(k) Y
j
(k)
lower and upper limit for H(k) for any k are given respectively by
(1/2)
N
i =1

N
j =1
W
ij
and (1/2)
N
i =1

N
j =1
W
ij
since the outputs Y are 1 or +1.
Lets assume that at time k the unit L was selected to be updated, where
1 L N. Isolating the energy terms due to unit L, we can rewrite eq. 3.17 as:
(3.18)
H(k)
1
2
N
i 1
iL
N
j 1
jL
W
i j
Y
i
(k) Y
j
(k)
1
2
Y
L
(k)
N
j 1
jL
W
Lj
Y
j
(k)
1
2
Y
L
(k)
N
i 1
iL
W
i L
Y
i
(k)
1
2
W
LL
Y
L
(k)
2
59
The variation in the energy is given by H(k) = H(k+1) H(k). Note that: 1) since the
updating is asynchronous only unit L may change at time k and consequently
Y
i
(k+1) = Y
i
(k) for i L, 2) since all units have bipolar outputs [Y
i
]
2
= 1 for all i.
Therefore, when calculating H(k), the first and fourth terms of the right side of eq. 3.18
will be cancelled out and we can write:
If unit L changes its output then Y
L
(k+1) = Y
L
(k) and using the fact that the weight
(3.19)
H(k)
1
2
Y
L
(k 1)
1
1
1
1
]
N
j 1
jL
W
Lj
Y
j
(k)
N
i 1
iL
W
i L
Y
i
(k)
1
2
Y
L
(k)
1
1
1
1
]
N
j 1
jL
W
Lj
Y
j
(k)
N
i 1
iL
W
i L
Y
i
(k)
matrix W is symmetric (see eq. 3.8), we have that:
Due to the rule used to update the network outputs (eqs. 3.1 and 3.2), whenever a unit
(3.20) H(k) 2 Y
L
(k)
N
j 1
jL
W
Lj
Y
j
(k) 2 Y
L
(k) net
L
(k) 2 W
LL
changes its output the product Y
L
(k) net
L
(k) is negative. Due to the Hebbian rule (eq. 3.8)
W
LL
= M/ N. Therefore whenever a unit changes its output, the overall energy of the
network decreases. In other words, the energy is a monotonically decreasing function
with respect to time.
Note that we use the fact that the weight matrix is symmetric, an assumption that
is not biologically plausible in terms of networks of real neurons. McEliece et. al.
[MPRV87] speculate, however, that maybe all that is necessary is a "little" symmetry,
such a lot of zeros at symmetric positions in the weight matrix, what is common in real
neural networks. Moreover, asymmetric weight matrices can be used to generate a
cyclical sequence of patterns ([HKP91],[Kle86]) and Kleinfeld and Sompolinsky
[KlSo89] even found a mollusc that apparently uses this mechanism. In this case the
attractors are stable limit cycles.
In chapter 2 we mentioned that learning in feedforward networks could be seen
as an optimization process. This is also the case here for feedback networks and such
an interpretation will be very useful later. The problem can be stated as follows: how
should the weights be set such that the patterns to be stored are deep minima of the
energy function given by eq. 3.17. Lets start with storing just one pattern. If we want
60
to store just pattern S, since its components are 1 or +1, we can make the energy term
dependent on [S
i
]
2
[S
j
]
2
that is always positive and the energy term will be as small as
possible [BeJa90], or:
From that we see that we just need to define the weight matrix as W
ij
= S
i
S
j
. Again to
(3.21)
H
1
2
N
i 1
N
j 1
W
i j
S
i
S
j
1
2
N
i 1
N
j 1
S
i
2
S
j
2
store several patterns, we just sum this equation over all patterns (eqs. 3.8 and 3.9).
Adding all patterns together in this way will distort the energy levels for each stored
patterns because of the crosstalk term. However, as stated before, if M << N the
distortion will not be significant.
3.1.5 - Spurious States
We have shown that if the crosstalk term is small enough the patterns S
i
to be
stored are attractors (stable equilibrium points) and they will be local minima of the
energy function. Such attractors are sometimes called retrieval states or retrieval
memories. This situation is very likely to happen, as stated before, if the number of
patterns M to be stored is much less than the number of units N. However, these are not
the only attractors that the network has.
Firstly, the reverse of an attractor S
i
is also an attractor since it also satisfies eq.
3.10 and it will have the same energy H.
Secondly, Hertz et. al. [HKP91] and Amit et. al. [AGS85a] show that patterns
defined as a linear combination of an odd number of attractors are also attractors. They
call such attractors mixture states or retrieval memories.
Thirdly, Amit et. al. [AGS85b] show that if the number M of patterns to be
stored is relatively large (compared to N), then there are attractors that are not correlated
with any linear combination of the original patterns S
i
. They call such attractors spin
glass states, from the spin glass models in statistical mechanics.
The second and third type of attractors are called spurious states, spurious
minima or spurious memories. Their existence means the there is the possibility that the
network will not work perfectly as an associative memory, since it can converge to
"memories" that were not previously defined.
Some measures, however, can be taken to decrease the size of basins of
attractions of these spurious states. For instance, as Hopfield did in his original paper
61
[Hop82], we can force the constraint that a unit does not feedback to itself, i.e. W
ii
= 0,
for 1 i N [KaSo87]. It is possible to show that this modification does not affect the
stability of the patterns that we want to store (the retrieval memories) although it affects
the dynamics of the network [HKP91].
A second possible improvement proposed by Hopfield et. al. [HFP83] is to try
to "unlearn" some of spurious states. To do this, the network weights are determined by
applying eq. 3.8, the network state is initialized in a random position and the network
output is updated until it achieves convergence. If the state to which the network
converged is one of the spurious memories, represented by X
F
, then the Hebbian rule is
applied with the sign reversed:
where 0 < << 1. One possible interpretation is that such a procedure changes the
(3.22)
W
i j
X
F
i
X
F
j
shape of the energy function by raising the energy level at the local minimum X
F
,
therefore reducing its basin of attraction. The assumption is that memories with the
deepest energy valleys tend to have the largest basins of attraction. However, too much
"unlearning" will result in perturbing and even destroying the retrieval memories that
we intended to store [HFP83].
3.1.6 - Synchronous Updating
The asynchronous updating used in the Hopfield network can be seen as a simple
way to model the random propagation delays of the signals in a network of real neurons.
If synchronous updating is used (all unit outputs are updated simultaneously in
a discrete time formulation), there will be no significant changes in terms of memory
capacity or position of the equilibrium points ([HKP91],[AGS85a],[MPRV87]).
However, the network dynamics will be different, e.g. it will take much less iterations
to converge to a fixed attractor (EP), and there is the possibility for existence of stable
limit cycles that are not present if asynchronous update is used. Zurada [Zur92] shows
an example of this last case.
Another difference is that using synchronous updating the trajectory in the output
space is always the same for a given starting point. When using asynchronous updating
this is not the case because the units are randomly selected to be updated, as explained
before.
62
3.2 - Solving Optimization Problems
After proposing to use ANN with binary or bipolar units with random
asynchronous updating as associative (or content-addressable) memories [Hop82],
Hopfield realized that he could obtain the same computational properties by using a
deterministic continuous-time version with units that have a continuous and
monotonically increasing activation function such as a squashing function [Hop84]. This
network is sometimes referred to as the gradient-type Hopfield network [Zur92] or
Hopfield network with continuous updating [HKP91].
By making such modifications he realized that he could also propose a hardware
analog implementation of the above network using electrical components such as
amplifiers, resistors and capacitances. The capacitances where introduced for each unit
such that they would have an integrative time delay. Consequently, the time evolution
of the network should be represented by a nonlinear differential equation.
3.2.1 - An analog implementation
The behaviour of each unit in this analog version is closer to the behaviour of
a real neuron. Figure 3.2 illustrates such a unit. The variables net
i
and Y
j
are voltages,
bias
i
is a current, W
ij
and g
i
are conductances, C
i
is a capacitance and the triangle
represents a voltage amplifier with a function f, i.e. V
out
= f(V
in
) or Y
i
= f
i
(net
i
). We will
assume that the voltage amplifier has an infinite input impedance such that it does not
absorb any current. Figure 3.3 illustrates the implementation of a feedback network
using this type of unit. In order to avoid the need for negative resistances, we have to
assume that the voltage amplifiers have a negative output Y
i
as well or use an
additional amplifier for each unit with constant gain 1.
Adding all currents for the units, which are illustrated by arrows in fig. 3.2, the
dynamic behaviour of a unit can be described by:
Lets define the parameter G
i
as G
i
= g
i
+
N
j =1
W
ij
, the external input vector as:
(3.23)
i
c
C
i
dnet
i
dt
bias
i
N
j 1
W
i j
Y
j
net
i
g
i
net
i
ext = [ext
1
... ext
N
]
T
and the matrices G and C as G = diag[G
1
... G
N
] and
C = diag[C
1
... C
N
]. Then the dynamical behaviour of the whole network can be
described by the following set of differential equations:
63
where net and bias are column vectors, and the function f ( ) is applied to each
Figure 3.2 - The analog implementation of a unit
using electrical components
Figure 3.3 - The analog implementation of a continuous-time Hopfield
network with 4 units
(3.24)
C
dnet
dt
G net W f ( net ) bias
component of the vector net. Note that, by definition, Y = f (net).
64
3.2.2 - An energy function
Assuming that the weight matrix W is symmetric and that the activation function
f
i
is a monotonically increasing function bounded by lower and upper limits for all units,
Hopfield ([Hop84],[Zur92]) proposed the following energy function in order to prove
the stability of the network:
Applying the chain rule we have:
(3.25)
H(t)
1
2
Y
T
W Y bias
T
Y
N
i 1
G
i
Y
i
0
f
1
i
(z) dz
where by definition
(3.26)
dH Y(t)
dt
N
i 1
H Y(t)
Y
i
dY
i
dt

Y
H(Y)
T
Y(t)
Using the Leibnitz rule we have:
(3.27)
Y
H (Y)
1
1
1
]
H (Y)
Y
1
....
H (Y)
Y
N
T
From this relation and since the matrix W is symmetric, we can write that:
(3.28) d
dY
i
,
N
j 1
G
j
Y
j
0
f
1
j
(z) dz G
i
f
1
i
Y
i
G
i
net
i
Comparing eqs. 3.26 and 3.29, we can see that:
(3.29)
dH
dt
W Y bias G net
T
1
1
]
C
dnet
dt
T
Y
or for each component:
(3.30)
Y
H(Y) C
dnet
dt
Since Y
i
= f
i
(net
i
) and f
i
( ) is a monotonically increasing function, we can write that
(3.31)
H Y
Y
i
C
i
dnet
i
dt
net
i
= f
i
1
(Y
i
) and
where d f
i
1
(Y
i
) / d Y
i
> 0. Finally, by substituting eqs. 3.31 and 3.32 in eq. 3.26:
(3.32)
dnet
i
dt
df
1
i
Y
i
dY
i
dY
i
dt
65
Therefore dH/dt 0 and dH/dt = 0 if and only if dY
i
/dt = 0 for all units, 1 i N.
(3.33) dH
dt
N
i 1
C
i
df
1
i
Y
i
dY
i
,
dY
i
dt
2
Since the network "energy" is a bounded function, this proves that the network will
evolve until it settles to an equilibrium point, a local minimum of the energy function.
In other words, the network "searches" for a minimum of the energy function and stops
there. Note that the possibility of limit cycles is excluded since in a limit cycle
dY
i
/dt 0 and dH/dt = 0.
It is also interesting to investigate the effect of the steepness of the activation
function f
i
. This is easily done by replacing Y
i
= f
i
(net
i
) by Y
i
= f
i
(net
i
) and net
i
= f
i
1
(Y
i
)
by net
i
= f
i
1
(Y
i
) / . The energy function H(t) becomes:
As the gain increases the activation function f
i
tends to a threshold function. Suppose,
(3.34)
H(t)
1
2
Y
T
W Y bias
T
Y
1
N
i 1
G
i
Y
i
0
f
1
i
(z) dz
for instance, that f( ) = tanh( ). The integral in the third term on the right-hand side of
eq. 3.34 is zero for Y
i
= 0 and positive otherwise, becoming very large as Y
i
approaches
its bounds 1 or +1 since such bounds are approached very slowly. In the limit case
when +, the contribution by the third term is negligible and the location of the
equilibrium points are given by the maxima and minima of:
The same arguments are valid if f
i
(net
i
) = sig(net
i
) = 1/[1+exp(net
i
)].
(3.35)
H(t)
1
2
Y
T
W Y bias
T
Y
1
2
N
i 1
N
j 1
W
i j
Y
i
(t) Y
j
(t)
N
i 1
Y
i
(t) bias
i
For large but finite , the third term on the right-hand side in eq. 3.34 begins to
contribute but only when Y
i
approaches its bounds, i.e. when the network is near to one
of the surfaces, edges or corners of the hypercube that contain the network dynamics.
When all Y
i
are far from their limits, the contribution of the third term is still negligible.
Consequently, for large but finite the maxima of the complete energy function given
by eq. 3.34 at the corners and the minima are slightly displaced toward the interior of
the hypercube [Hop84]. Therefore in this case, it can assumed that the energy function
that is being minimized is the energy function given by eq. 3.35 and that the equilibrium
points will be located at the corners of the hypercube.
Note that if is sufficiently large, it is reasonable to assume that net
i
0 and
66
consequently in figures 3.2 and 3.3 the current sources ext
i
can be substituted by an
equivalent voltage source VExt
i
in series with the appropriate resistor RExt
i
such that
VExt
i
/RExt
i
= ext
i
.
Hopfield and Tank [HoTa85] then realized that: 1) if the cost function of an
optimization problem could be expressed in a quadratic equation with the same form as
eq. 3.35, and 2) a network like the one illustrated in fig. 3.3 using units with large finite
gains in their activation functions could be used to search for a minimum of the cost
function. They proposed a solution for the optimization problem using analog hardware
and therefore radically different from implementing an algorithm in a digital computer.
The weights and bias can be determined by comparing the cost function for the problem
in hand with the energy function given by eq. 3.35.
Hopfield and Tank ([HoTa85], [TaHo86], [HoTa86]) showed, as examples, how
such a network could be used to propose solutions to: 1) analog/digital conversion
problems; 2) decomposition/decision signal problems (to determine the decomposition
of a particular signal given the knowledge of its individual components); 3) linear
programming problems [Per92]; and 4) the travelling salesman problem (TSP). Other
possible applications investigated by other researchers are: 1) job shop scheduling
optimization [Zur92]; 2) economic electric power dispatch problems [Zur92]; and 3)
graph bipartitioning (important for chip design where we want to divide a group of
interconnected components into 2 subsets with more or less the same number of
components in each set and minimizing the wire length between the two sets).
It is important to emphasize that it is possible only to prove that, given the
proper constraints, the network converges to a local minimum of the energy function.
However, in general, such a local minimum is not the global minimum. Therefore the
Hopfield approach is best suited to problems where there are several local minima that
give satisfactory solutions and it is more important to rapidly approach a "good" solution
than to take much longer and to have the best possible solution. One could argue that
these are the kind of problems that biological systems have to solve [Per92]. It is not
always easy to decide if a particular optimization problem with a particular set of
parameters will be well suited to be solved using the Hopfield approach.
67
3.3 - The IAC Neural Network
The Interactive Activation and Competition (IAC) Neural Network was proposed
by the psychologists McClelland and Rumelhart to model visual word recognition
[McRu81] and retrieval of general and specific information from specific information
about individual exemplars previously stored in the network [McRu88]. The network
uses as inputs noisy clues, for instance, the network can be used to recognize a word
that was partially obscured or to retrieve the specific information stored about an item
using a partial or incorrect version of its description.
The IAC network is also a feedback network that operates in discrete or
continuous time and the output of the units are real continuous numbers. The principle
of operation is the same as the Hopfield network, i.e. there is no learning phase and the
designer sets the topology and the initial state of the network. The network then evolves
to a equilibrium state (equilibrium point, EP) that represents the network answer to the
problem.
As in the Hopfield network, the network topology is selected in order to satisfy
the specific constraints of the problem in hand. The major difference in operation
between the Hopfield network and the IAC network is the activation function used.
McClelland and Rumelhart [McRu88] define an IAC network as consisting of
Figure 3.4 - Typical topology for the IAC network where dashed and solid lines
represent respectively inhibitory and excitatory connections.
Black squares represent activated units.
a set of units organized into pools. The units in a pool compete against each other such
68
that ideally when the network settles to an EP there is only one activated unit in each
pool. Units situated in different pools can excite or be indifferent to each other but
normally they do not inhibit each other. Figure 3.4 illustrates the typical topology for
the IAC network. All connections are assumed to be bidirectional and therefore the
weight matrix W is symmetric. All units also have an external input, not shown in figure
3.4.
According to McClelland and Rumelharts conception, each pool represents a
specific property (or characteristic) and each unit in the pool represents mutually
exclusive possibilities for such a property . For example, in figure 3.4 pool 1 could
represent the gender of an individual, while pools 3 and 4 could represent his education
level, marital status or profession. Pool 2 could contain the names of the individuals.
We can use the above example where the network is used to store specific
information about a set of individuals to show three possible cases of information
retrieval by the network [McRu88].
In the first case information about an individual could then be retrieved by
activating the unit with his name in pool 2 and we want just one unit activated in each
pool after convergence.
In the second case we can initialize the network with the description of an
individual by activating the corresponding units in pools 1, 3 and 4 and, after
convergence is achieved, look for the winner unit in pool 2. To be useful, the network
should retrieve the correct individual even if the description is partial or slightly
incorrect. It is possible to have units that are partially activated in the pool for names
if there is no perfect match and several units have a close match. The amount of
activation should be related to the number of matches with the given description.
In the third case we can retrieve general information about a property by
activating the corresponding unit, for example, to retrieve the general properties of
married individuals. In this case it is also possible to have units that are partially
activated.
McClelland and Rumelhart showed, by using simulations, that the network works
well in the above three cases [McRu88]. However, in order to operate the network the
designer has to adjust some parameters but McClelland and Rumelhart did not provide
guidelines for selecting such parameters.
In this section we derive a few results that are applicable to networks of this type
69
with any number of units, including the proof that, given certain conditions, the IAC
network is a stable system and that it also minimizes an energy function, much like the
Hopfield network. Extensive results are then derived in the case where the network has
2 units. We analyse mathematically the dynamics of an IAC network with 2 units. More
specifically we are interested in how the parameters of the model affect the number,
type, location and zone of attraction (or basin of attraction) of the equilibrium points.
In most cases stability around the EPs is proved using Lyapunov functions.
3.3.1 - The Mathematical Model
McClelland and Rumelhart used the standard form for the combining, activation
and output function (these terms are defined in section 2.3.2) to define the mathematical
model of the IAC network. Assuming that the IAC network is operating in discrete-time
and in synchronous mode, we have:
where the variables a
i
(k) and Y
i
(k) represent respectively the activation and output values
(3.36)
net
i
(k)
N
j 1
W
i j
Y
j
(k) ext
i
(k)
(3.37)
a
i
(k 1) f a
i
(k) , net
i
(k)
(3.38)
Y
i
(k 1) g a
i
(k 1)
for unit i at iteration k, N represents the number of units in the network, 1 i N,
f[ , ] and g[ ] are respectively the activation and output functions and the weight
matrix W is assumed to be symmetric.
McClelland and Rumelhart wanted a model with the following properties:
1) the activation values must be kept between two limits given by the
parameters max and min, where min 0 < max;
2) when the network is initialized, all the activation values are at the rest
value given by the parameter rest, where min rest 0;
3) when the net input for a particular unit is positive, its activation value
must be driven towards the upper limit max;
4) when the net input for a particular unit is negative, its activation value
must be driven towards the lower limit min;
5) when the net input for a particular unit is zero, its activation value
must be driven towards the rest value given by the parameter rest with an adjustable
speed that is given by the parameter decay 0.
70
To satisfy the above requirements, Rumelhart and McClelland proposed the
following functions f( , ) and g( ):
if net
i
(k) 0,
otherwise
(3.39)
a
i
(k) max a
i
(k) net
i
(k) decay a
i
(k) rest
where a
i
(k) = a
i
(k+1) a
i
(k), and
(3.40)
a
i
(k) a
i
(k) min net
i
(k) decay a
i
(k) rest
Typical parameters used in simulations by McClelland and Rumelhart [McRu88] are:
(3.41)
Y
i
(k)

'
a
i
(k) if a
i
(k) 0
0 otherwise
max = 1, min = 0.2, rest = 0.1, decay = 0.1, ext
i
= 0 or 0.4; and W
ij
= 0.1, 0 or 0.1.
However, such parameters were found through trial and error and not from mathematical
analysis.
3.3.2 - Initial Considerations
Without loss of generality, we can consider that each one of the units is
connected to at least one of the other units (for each unit i, W
ij
0 for at least one j),
since we are not interested in the case where a unit is completely isolated from the other
units.
If we assume that min < 0 < max and min = max, eqs. 3.39 and 3.40 can be
combined into just one equation as:
If the network is operating in continuous time, the above equation is simply replaced by:
(3.42)
a
i
(k) net
i
(k) a
i
(k) net
i
(k) max decay a
i
(k) rest
The equilibrium points of the system a
i
e
can be found by solving eq. 3.42 for a
i
(k) = 0
(3.43)
da
i
dt
net
i
a
i
net
i
max decay a
i
rest
or eq. 3.43 for da
i
/dt = 0. So:
where net
i
e
represents the value of the net input for unit i when the network reaches an
(3.44) a
e
i
max net
e
i
decay rest
net
e
i
decay
EP. Since net
i
e
is in general unknown, eq. 3.44 does not help to find the position of the
EP in the general case. But we can still use it to state that if decay = 0:
71
1) when net
i
e
0 the EP will be characterized by a
i
e
= max if net
i
e
> 0
or a
i
e
= max if net
i
e
< 0;
2) when net
i
e
= 0 eq. 3.44 cannot be used to find the EP but the points
where net
i
= 0 for all units are also equilibrium points since a
i
(or da
i
/dt) = 0 for all
i. One point where this is possible, but not the only one, is to have ext
i
= 0 for all units
and consequently the point a
i
e
= 0 for all i is also an EP.
Moreover, for rest = 0 and small values of decay, as long as decay << net
i
e
,
the EP will still be located near max or max and the condition net
i
= 0 is not enough
to cause an EP.
Observe that if W
ij
= 0 for all j, i.e. the unit i is completely isolated from the
other units, net
i
= ext
i
and the condition for stability is that -decay < ext
i
< 2-decay
that can also be written as - ext
i
< decay < 2- ext
i
. Therefore such unit can form a
stable 1 dimensional system even if decay < 0. The position of the EPs is given by eq.
3.44 replacing net
i
e
by ext
i
.
3.3.3 - Minimizing an Energy Function
In this section we show that under certain constraints the continuous time version
of the IAC network, like the Hopfield network, also minimizes a bounded energy
function. Therefore, we can prove that the network is stable and can be used to solve
the same kind of minimization problems for which the Hopfield network has been used.
First, lets assume that decay = 0 and that the network is within or at the border
of the hypercube [max max]
N
where N is the number of units in the network, i.e.
max a
i
max for all i. We can define the following quadratic function as the energy
function:
As in the case for the Hopfield network we can write that:
(3.45)
H(t)
1
2
Y
T
W Y ext
T
Y
Since the matrix W is symmetric:
(3.46)
dH Y(t)
dt
N
i 1
H Y(t)
Y
i
dY
i
dt

Y
H(Y)
T
Y(t)
But Y
i
= g(a
i
), so we have:
(3.47)
dH
dt
W Y ext
T
Y net
T

Y
N
i 1
net
i
dY
i
dt
72
Using eq. 3.43, finally:
(3.48)
dH
dt
N
i 1
net
i
dg(a
i
)
da
i
da
i
dt
Therefore dH/dt 0, for decay = 0, max a
i
max and dg(a
i
)/da
i
0 for all i (g( )
(3.49) dH
dt
'
N
i 1
dg a
i
da
i
net
2
i
max a
i
if net
i
0
N
i 1
dg a
i
da
i
net
2
i
max a
i
if net
i
< 0
is a monotonically increasing function). From the above we can also state that dH/dt = 0
if and only if dY
i
/dt = da
i
/dt = 0 for all i, i.e. the network has reached an EP. Note that
net
i
= 0 for all i implies not only dH/dt = 0 but also da
i
/dt = 0 for all i (see eq. 3.43).
Now we need to deal with the case when the network is initialized outside the
hypercube [max max]
N
, i.e. max > a
i
> max for at least one i. If a
i
0, eq. 3.43 can
be written as:
On the other hand, if a
i
< 0, eq. 3.43 can be written as:
(3.50)
da
i
dt
'
net
i
a
i
max decay a
i
rest if net
i
0
net
i
a
i
max decay a
i
rest if net
i
< 0
Equations 3.50 and 3.51 show respectively that, given that decay > 0 and rest < max:
(3.51)
da
i
dt
'
net
i
a
i
max decay a
i
rest if net
i
0
net
i
a
i
max decay a
i
rest if net
i
< 0
1) if a
i
> max, then da
i
/dt < 0; and 2) if a
i
< max, then da
i
/dt > 0. In other words,
considering the activation space, if the network is outside the hypercube [max max]
N
and decay > 0, the changes in the activation values are such that, given enough time,
the network will reach the borders of the hypercube and we will end up with
a
i
max. Note that even in the case when decay = 0, the changes in the activation
will still drive the network to the borders of the hypercube [max max]
N
, with the only
exception that the network can be trapped in the condition where net
i
= 0 (section 3.3.6
shows an example of this case). Once inside or at the borders of the hypercube, the
network then seeks the minima of the energy function given by eq. 3.45, given that,
73
among other conditions, decay = 0.
One way to ensure that the energy function given by eq. 3.45 is minimized
would be to have decay > 0 whenever a
i
> max for at least one i and when
a
i
max for all i we set decay to 0. A less complicated way would be to set decay
to some small positive value and rest to 0, without having to consider if the network is
inside the hypercube or not. From eq. 3.44 we can see that this will cause only a small
perturbation in the position of the EPs that are at the locations were a
i
e
= max or max,
assuming that for such EPs the condition decay << net
i
e
is satisfied. If the EPs that
are the solution for the problem satisfy such a condition (in general such information is
not available a priori) then we still could consider that the energy function given by eq.
3.45 is being minimized. However, the location and number of the other EPs (the EPs
that are not at the corners of the hypercube [max max]
N
) can change significantly.
A possible interpretation for the reason that decay > 0 brings the network to the
borders of the hypercube is because this stops the points where net
i
= 0 from being EPs
and from eq. 3.44 we can see that it also forces a
i
e
< max. However, some of the
points that were EPs for decay = 0 can suffer large perturbation if the condition
decay << net
i
e
is not satisfied.
Note that, as the Hopfield network, the IAC network suffers from the possibility
of being trapped in local minima (instead of converging to the global minima of the
energy function).
A simple modification that makes it easier to analyse the network dynamic
behaviour is to have dg(a
i
) /da
i
> 0 for all i, instead of dg(a
i
) /da
i
0 (see eq. 3.41), for
instance, using the identity function as the output rule: Y
i
= a
i
for all i. Such
modification will be used in the next section.
3.3.4 - Considering two units
Consider an IAC network with two units with min < 0 < max, min = max,
decay 0, rest = 0, and the output function as being the identity function, Y
i
= a
i
for
all i. As usual we will assume that the units do not feedback to themselves, i.e. W
ii
= 0
and use W
12
= W
21
= c, where c = factor of cooperation (c > 0) or competition (c < 0).
Figure 3.5 illustrates such network. From eq. 3.43 we can write:
74
We can now consider three main cases ([Nas90], [NaZa92]):
Figure 3.5 - The IAC Neural Network with 2 units
(3.52)
da
1
dt
ext
1
ca
2
a
1
ext
1
ca
2
max decay a
1
(3.53)
da
2
dt
ext
2
ca
1
a
2
ext
2
ca
1
max decay a
2
1) external inputs = 0, decay 0;
2) external inputs 0, decay = 0;
3) external inputs 0, decay > 0.
3.3.5 - Case Positive Decay and No External Inputs
Solving eqs. 3.52 and 3.53 for da
i
/dt = 0 we have that the EP [a
e
1
a
e
2
]/max is
given by:
where (i,j) = (1,2) or (2,1). Solving the above pair of equations by direct substitution for
(3.54)
a
e
i
max
a
e
j
/ max
decay
c max
c
c
a
e
j
max
0 dec 1, the normalized EPs [a
1
a
2
] are:
for c > 0: [0 0], [ ], [ ]
for c < 0: [0 0], [ ], [ ]
where a
i
= a
i
/max, i = 1 or 2, = 1 dec, and dec = decay/( c max) = normalized
decay. Using linearization around the EPs, it is possible to show that the origin is a EP
type saddle while the other 2 EPs are type stable node. If dec 1, all 3 EPs collapse
75
into the origin which becomes a stable node. Figure 3.6 shows the phase-plane for
-1.5
-1
-0.5
0
0.5
1
1.5
-1.5 -1 -0.5 0 0.5 1 1.5
o o o o o o o o o o o o o
a1/max
a
2
/
m
a
x
-1.5
-1
-0.5
0
0.5
1
1.5
-1.5 -1 -0.5 0 0.5 1 1.5
-1.5
-1
-0.5
0
0.5
1
1.5
-1.5 -1 -0.5 0 0.5 1 1.5
Figure 3.6 - ext
1
= ext
2
= 0, decay/(c max) = 0.15, c > 0
ext
1
= ext
2
= 0, c > 0 and dec = 0.15 and some trajectories for different initial activation
values. As expected the EPs are at positions [0 0], [0.85 0.85] and [-0.85 -0.85].
We can study the stability and zones of convergence of the stable EPs by
defining a Lyapunov function. For instance, assuming c > 0 and 0 dec 1 (0 1)
for the EP [ ] we can define the following Lyapunov function:
where x
i
= a
i
- , i = 1,2. Therefore:
(3.55)
V( a
1
, a
2
)
x
2
1
x
2
2
2 c max
Assuming that a
i
> 0, we have that x
i
+ > 0, i = 1,2. From eqs. 3.52 and 3.53 for
(3.56)
dV
dt
x
1
c max
dx
1
dt
x
2
c max
dx
2
dt
(i,j) = (1,2) and (2,1):
and therefore dV/dt 0 for 1, i.e. for dec 0 and dV/dt = 0 implies that x
1
= x
2
= 0
(3.57)
1
c max
dx
i
dt
x
j
x
i
x
j
1 x
i

(3.58)
dV
dt
i 2
j 1
i 1
j 2
x
2
i
x
j
1 x
i
x
j
1 x
i
x
i

(3.59)
dV
dt
x
2
1
x
2
x
2
2
x
1
x
1
x
2
2
1
(the possibility x
1
= x
2
= is excluded since we assumed that x
i
+ > 0). This proves
76
that all trajectories in the 1
st
quadrant will converge to the EP [ ] if c 0 and
0 decay/(c max) 1.
We can also easily see from eqs. 3.52 and 3.53 that for points in the 2
nd
and 4
rd
quadrants situated above the line a
2
= -a
1
, the property da
2
/da
1
> -1 is valid. Therefore
their respective trajectories will enter the 1
st
quadrant and converge to the EP [ ] as
fig. 3.6 illustrates. The same procedure can be used to prove the stability of the EP
[- -] and the EPs for the case c < 0.
From the above we can conclude that the separatrix (the curve that divides the
zones of convergence of the 2 stable EPs) for c > 0 is the line a
2
= -a
1
and for c < 0
the line a
2
= a
1
. Observe that, in the absence of any external disturbances, if the
network is initialized exactly on the separatrix, the activation values will converge to the
unstable EP situated at the origin (see fig. 3.6).
3.3.6 - Case of Non-Zero External Inputs With No Decay
Lets assume for simplicity that c > 0 (the case c < 0 is completely analogous)
and define the normalized external inputs ext
1
and ext
2
, where:
ext
i
= ext
i
/(c max), i = 1 or 2.
From eqs. 3.52 and 3.53, we can see that, if decay = 0, then da
i
/dt = 0 in the following
cases:
a) when a
j
= -ext
i
, the main switching line;
b) if a
j
> -ext
i
, when a
i
= 1;
c) if a
j
< -ext
i
, when a
i
= 1.
where (i,j) = (1,2) or (2,1). Therefore the increase of ext
1
shifts its associated main
switching line a
2
= -ext
1
downwards. Analogously, the increase of ext
2
shifts its
associated main switching line a
1
= -ext
2
sideways to the left.
The EPs are the points that are common to the above switching lines. The
positioning of these switching lines gives rise to 3 main sub-cases that correspond to
different regions in figure 3.7:
a) if ext
1
< 1 and ext
2
< 1;
b) if ext
i
> 1 and ext
j
1, (i,j) = (1,2) or (2,1);
c) if ext
1
= 1 and/or ext
2
= 1.
Now lets consider each one of these cases and their sub-cases:
77
a) if ext
1
< 1 and ext
2
< 1;
Figure 3.7 - Location of the stable E.P.
region A in fig. 3.7,
1 EP at [ext
2
ext
1
], type saddle,
2 EPs at [a
e
1
a
e
2
] = {[1 1],[1 1]}, type stable node.
The phase-plane and the trajectories in this case are similar to those in fig. 3.6 with the
difference that the position of the unstable EP is not necessarily located at the origin.
b) if ext
i
> 1 and ext
j
1, (i,j) = (1,2) or (2,1);
-3
-2
-1
0
1
2
3
-3 -2 -1 0 1 2 3
a1/max
a
2
/
m
a
x
Figure 3.8 - ext
1
= ext
2
= 1.5, decay = 0
regions B, C, D, E in fig. 3.7, not including dashed lines in the middle
of regions B and C,
1 unstable EP at [a
e
1
a
e
2
] = [ext
2
ext
1
],
1 EP type stable node, which location is given by fig. 3.7 according to:
78
Region B: [1 1], Region C: [1 1],
Region D: [1 1], Region E: [1 1].
Figure 3.8 shows the phase-plane when ext
1
= ext
2
= 1.5 and some trajectories for
different initial activation values. The EPs are at [1 1] and [-1.5 1.5].
c) if ext
1
= 1 and/or ext
2
= 1;
c.1) if ext
i
= 1 and ext
j
1, (i,j) = (1,2) or (2,1);
c.1.1) {if ext
i
= 1 and ext
j
> 1 and ext
j
1} OR
{if ext
i
= 1 and ext
j
< 1 and ext
j
1};
dashed lines in fig. 3.7, not including the circles nor the black
squares,
1 EP type stable node which location is given by regions B or C in
fig. 3.7 (the nearest region),
A semi-line of non-isolated EPs.
Figure 3.9 shows the phase-plane for ext
1
= 0, ext
2
= 1 and some trajectories for
different initial conditions. The EPs are [-1 -1] and the semi-line a
1
1.
c.1.2) {if ext
i
= 1 and ext
j
< 1} OR
-3
-2
-1
0
1
2
3
-3 -2 -1 0 1 2 3
a1/max
a
2
/
m
a
x
Figure 3.9 - ext
1
= 0, ext
2
= 1, decay = 0
{if ext
i
= 1 and ext
j
> 1};
border of regions D and E in fig. 3.7 (solid lines) not including the
circles,
No stable isolated EPs,
A semi-line of non-isolated EPs.
1
= 1, ext
2
= 1.5 and some trajectories.
79
c.2) if ext
1
= ext
2
= 1;
-3
-2
-1
0
1
2
3
-3 -2 -1 0 1 2 3
a1/max
a
2
/
m
a
x
Figure 3.10 - ext
1
= 1, ext
2
= 1.5, decay = 0
c.2.1) if (ext
1
, ext
2
) = (1,1) or (1,1);
black squares in fig. 3.7,
1 EP type stable node which location is given by regions B or C in
fig. 3.7 (the nearest region),
2 orthogonal semi-lines of non-isolated EPs.
1
= ext
2
= 1 and some trajectories.
c.2.2) if (ext
1
, ext
2
) = (1,1) or (1,1);
-3
-2
-1
0
1
2
3
-3 -2 -1 0 1 2 3
a1/max
a
2
/
m
a
x
Figure 3.11 - ext
1
= ext
2
= 1, decay = 0
circles in fig. 3.7,
no stable isolated EP,
2 orthogonal semi-lines of non-isolated EPs.
80
1
= 1, ext
2
= 1 and some trajectories.
-3
-2
-1
0
1
2
3
-3 -2 -1 0 1 2 3
a1/max
a
2
/
m
a
x
Figure 3.12 - ext
1
= 1, ext
2
= 1, decay = 0
We can determine the zones of attraction of the stable EPs by calculating
analytically the equation of the trajectory that converges to the unstable EP. This
equation can be obtained by combining eqs. 3.52 and 3.53 using decay = 0 and solving
the ordinary differential equation da
2
/da
1
= F(a
1
, a
2
, ext
1
, ext
2
). In this case, it is easy
to find the equation of the trajectory since the variables a
1
and a
2
are separable. If we
define:
S
1
= sign(ext
2
+a
1
), S
2
= sign(ext
1
+a
2
)
(sign(0) = 0) and assuming that S
1
0 and S
2
0, the equation of the trajectory is:
(3.60)
a
2
a
2
(0)
S
1
,
1
ext
1
S
1
ln
,
1 S
1
a
2
1 S
1
a
2
(0)
a
1
a
1
(0)
S
2
,
1
ext
2
S
2
ln
,
1 S
2
a
1
1 S
2
a
1
(0)
The curves that separate the zones of convergence are asymptotes that can be
calculated by solving the above nonlinear equation, i.e. find a
2
(0) given a
1
(0), a
1
=a
1
*
,
a
2
=a
2
*
, ext
1
, ext
2
where a
*
is the unstable EP. The asymptotes that leave the unstable
EP and converge to the stable EP can also be calculated by the same method with a
*
as the stable EP. Figure 3.13 shows the asymptotes for some cases when ext
1
< 1
and ext
2
< 1 where eq. 3.60 was solved numerically.
As before, we can prove the stability of the EPs by defining a Lyapunov
81
function. For instance, fig. 3.7 shows that if we assume that c > 0, ext
1
> 1 and
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-1 -0.5 0 0.5 1
*
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-1 -0.5 0 0.5 1
a1/max
a
2
/
m
a
x
*
*
* *
Figure 3.13 - Asymptotes for ext
1
0 and/or ext
2
0
with decay = 0
ext
2
> -1 (region B), then the point [a
1
a
2
] = [1 1] is a EP type stable node. To prove
its stability, we can define the same Lyapunov function defined in eq. 3.55, where x
i
is
now defined as x
i
= a
i
1, i = 1,2. Assuming that a
1
> -ext
2
and a
2
> -ext
1
(our
region of interest), we can easily show that:
Since V > 0 and dV/dt < 0 in our region of interest except in the origin x
1
= x
2
= 0
(3.61)
dV
dt
x
2
1
ext
1
x
2
1 x
2
2
ext
2
x
1
1 0
where V = dV/dt = 0, this proves the asymptotic stability of the EP [1 1].
The same procedure can be used to prove the stability of the EP [1 1] and the
EPs for the case c < 0.
3.3.7 - Case of Non-Zero External Inputs and Positive Decay
Again, without loss of generality lets assume that c > 0. From eqs. 3.52 and
3.53 we have that da
i
/dt = 0 when:
where (i,j) = (1,2) or (2,1). Since the EPs are the points where da
1
/dt = 0 and
(3.62)
a
i
ext
i
a
j
ext
i
a
j
dec
da
2
/dt = 0, they can be calculated by combining eq. 3.62 for (i,j) = (1,2) with eq. 3.62
for (i,j) = (2,1). This means that we need to find the real-valued roots of the following
quadratic polynomial:
82
P a
1
a
1
2
S
1
S
2
ext
1
1 S
2
dec
a
1
1
S
1
ext
1
S
2
ext
2
dec ext
2
S
1
S
2
dec dec
2
1 S
2
ext
1
a
1
0
ext
1
S
2
ext
2
dec ext
2
(3.63)
where:
However we dont know a priori the values of S
1
and S
2
. Therefore we apply the
(3.64)
S
i
'
1 if ext
i
a
j
0
1 otherwise
following algorithm:
Step 1) Assume that S
1
= 1 and S
2
= 1.
Step 2) Find the roots of P(a
1
) and reject the complex roots.
Step 3) Check for each real-valued root if ext
2
+ root 0.
If YES, accept this root, otherwise reject it.
Step 4) For each accepted root, use eq. 3.62 to calculate the corresponding value
for a
2
.
Step 5) Check to see if ext
1
+ a
2
0.
If YES, accept this value, otherwise reject it.
Step 6) Assume other combinations for (S
1
, S
2
), calculate the possible values of
a
1
and a
2
, and check if the assumptions for S
1
and S
2
are satisfied.
Figure 3.14 illustrates how the values of ext
1
and ext
2
affect the curves da
1
/dt
-1.5
-1
-0.5
0
0.5
1
1.5
-1.5 -1 -0.5 0 0.5 1 1.5
a1/max
a
2
/
m
a
x
ext2=4
1.5
1
0.3 0 -0.3 -1
-1.5
-4
ext1=-4 -1.5
-1
-0.3
0
0.3
1 1.5 4
Figure 3.14 - Curves da
1
/dt = 0 ( ) and da
2
/dt = 0 ()
for several external inputs and dec = 0.15
and da
2
/dt (dec = 0.15). We can see that an increasing ext
1
shifts the curve da
1
/dt to
83
the left and an increasing ext
2
shifts the curve da
2
/dt downwards. Figure 3.14 shows
that there are only three possible cases for the EPs, since the curves da
1
/dt = 0,
da
2
/dt = 0 cross each other 1, 2 or 3 times:
1) If the curves da
1
/dt = 0, da
2
/dt = 0 cross each other three times, then we
-1.5
-1
-0.5
0
0.5
1
1.5
-1.5 -1 -0.5 0 0.5 1 1.5
a1/max
a
2
/
m
a
x
-1.5
-1
-0.5
0
0.5
1
1.5
-1.5 -1 -0.5 0 0.5 1 1.5
Figure 3.15 - The case where one of the E.P.s is
over the separatrix
have 2 EPs type stable node and 1 EP type saddle. This is the case if ext
1
, ext
2
and
dec are not large.
2) If the curves da
1
/dt = 0, da
2
/dt = 0 cross each only once, then we have only
1 EPs type stable node. This is the case if ext
1
or ext
2
or dec are large.
3) The curves da
1
/dt = 0, da
2
/dt = 0 cross each other once (the EP type stable
node) and touch each other at another point (the point that is on the separatrix). In this
case all trajectories with initial conditions on one side of the separatrix will converge
to the EP that is on the separatrix. All trajectories with initial conditions on the other
side of the separatrix will converge to the stable EP. Figure 3.15 illustrates this case,
when ext
1
= ext
2
= 0.3754 and dec = 0.15. The EPs are a
1
e
= a
2
e
= 0.894 and
a
1
e
= a
2
e
= 0.613.
Some important points are:
a) There will always be 1 or 2 stable EPs and not more than 1 unstable EP since curves
da
1
/dt = 0, da
2
/dt = 0 cross each other 1, 2 or 3 times;
b) All EPs will be such that a
e
1
< 1 and a
e
2
< 1 (see eq. 3.62);
c) if dec 1, then there is only 1 EP and it is a EP type stable node. This can be
verified through visual inspection of figure 3.16 which shows the curves da
1
/dt = 0,
84
da
2
/dt = 0 for dec = 1. The theoretical way to prove that there is only one EP in this
-1.5
-1
-0.5
0
0.5
1
1.5
-1.5 -1 -0.5 0 0.5 1 1.5
a1/max
a
2
/
m
a
x
-1.5
-1
-0.5
0
0.5
1
1.5
-1.5 -1 -0.5 0 0.5 1 1.5
ext1=-4
ext1=4
ext2=-4
ext2=4
Figure 3.16 - Curves da
1
/dt = 0 ( ) and da
2
/dt = 0 ()
for several external inputs and dec = 1
case would be to show that if dec 1, then for any real values for ext
1
and ext
2
:
1) the quadratic polynomial expressed in eq. 3.63 will have only one real-valued
root which is a
e
1
and,
2) the application of the algorithm proposed in this section will also give a valid
value for a
e
2
.
3.3.8 - A Two-Bit Analog-Digital Converter using the IAC network
As in the case of the Hopfield network, the IAC network can be used as an
analog-digital converter since this task can be posed as an optimization problem. Using
the IAC network, the solution can be proposed following the same procedure proposed
by Tank and Hopfield when using Hopfield networks ([TaHo86], [Zur92]).
In this example we will use an IAC network with 2 units and therefore the A/D
converter has a 2-bit resolution. However the same principle can be used to increase the
number of units and the resolution of the A/D converter. The parameters in this case are
max = 1, decay = 0, W
ii
= 0, i = 1,2 and W
ij
= c, (i,j) = (1,2) and (2,1), i.e. W is
symmetrical with zero diagonal entries. We will assume that the network is always
initialized within the hypercube [-max max]
2
. As the output function we can use
Y
i
(a
i
) = a
i
and therefore the network output will be bipolar (-1 or 1) instead of binary
(0 or 1). If desired we could easily force the network to have binary outputs by defining
Y
i
(a
i
) = (a
i
+1)/2.
85
Denoting by x the input analog value, the desired input-final output mapping that
the network should produce is: (x a
2
, a
1
) = (0 1,1), (1 1,1), (2 1,1) and
(3 1,1). The corresponding decimal value d for a network output is given by:
d = 1.5 + a
2
+ a
1
/2. The network should minimize the square of the conversion error
E
1
(t) where:
(3.65)
E
1
(t) x d
2
1
1
1
1
]
x
2
2x
,
1.5 a
2
a
1
2
,
1.5 a
2
a
1
2
2
To determine the weights and external inputs of the network we compare the
function E
1
(t) that we want to minimize with the "energy" function H(t) given by eq.
3.50. In this case the function H(t) is given by:
Since H(t) does not contain terms [a
i
]
2
we need to modify E
1
(t) in order to eliminate
(3.66)
H(t) ca
1
a
2
ext
1
a
1
ext
2
a
2
such terms but in such a way that the resultant function is still non-negative and has the
correct local minima. This also has to be done when using the Hopfield network and we
just need to adapt the procedure adopted there for this case [TaHo86]. The solution is
to define the function to be minimized E(t) as E(t) = E
1
(t) + E
2
(t) and:
Since we assume that the network was initialized within the hypercube [-1 1]
2
and it will
(3.67)
E
2
(t)
2
i 1
i
2
4
a
i
1 a
i
1
remain within or at the borders of the hypercube, the function E
2
(t) is always positive
except at the corners of the hypercube where it is zero. The coefficients of E
2
(t) can
have any negative values but in this case they were chosen in order to cancel the terms
[a
i
]
2
in E
1
(t). Therefore:
Finally comparing H(t) with E(t) and ignoring the term x
2
- 3x + 14/4 since it is a
(3.68)
E(t) x
2
3x
14
4
a
1
a
2
1.5 x a
1
3 2x a
2
constant we have that: c = 1, ext
1
= x - 1.5, ext
2
= 2x - 3.
In section 3.3.6 we analyzed such a network in this case but for c > 0. In order
to use those results here we just need to rotate our coordinate system 90 degrees so
that the line a
2
= -a
1
(position of the EP for ext
1
= ext
2
= 0) becomes a
2
= a
1
. If we
rotate -90 degrees then we have that a
2
NEW
= a
1
OLD
and a
1
NEW
= -a
2
OLD
. The desired input-
86
final output mapping is then: (x a
2
NEW
, a
1
NEW
) = (0 1,1), (1 1,1), (2 1,1)
and (3 1,1). Dropping the superscript "NEW", the corresponding decimal value d
for a network output is now given by: d = 1.5 a
1
+ a
2
/2. The function E(t) is now:
Again comparing H(t) with this new definition for E(t), ignoring the constant term that
(3.69)
E(t) x d
2
2
i 1
1
i
2
a
i
1 a
i
1
is function only of x, we have that: c = 1, ext
1
= 3 2x, ext
2
= x - 1.5.
From section 3.3.6 we know that such network will produce the stable EP in the
desired locations. Note that: a) ext
1
and ext
2
can be seen as lines parametrized in x and
therefore we can write that ext
2
= -ext
1
/2; and b) if 1 x 2, then ext
1
1 and
ext
2
1. Referring to fig. 3.7 we will have an EP type saddle when in region A
( ext
1
< 1) or semi-lines of EP when over the dashed lines that are the border between
regions A-B (ext
1
= 1) and A-C (ext
1
= -1). The semi-lines of EPs can be eliminated
without moving the position of the stable EP significantly by using a very small decay
such as 0.01.
The existence of the saddle point results in the problem that the stable EP to
which the network converges is determined by the point at which the network was
initialized. For instance, for x = 1.5, the line a
2
= -a
1
divides the two zones of
convergence (also called zones of attraction) for the two stable EPs at (1,1) or (1,1).
Lee and Sheu ([LeSh91],[LeSh92],[YuNe93]), when using a Hopfield network as an A/D
converter showed how to modify the Hopfield network in order to eliminate such saddle
points. Consequently the EP to which the network converges does not depend where the
network is initialized and therefore there is only one possible network response. Maybe
an equivalent modification can be proposed for the IAC network.
3.4 - Conclusions
In this chapter we demonstrated how feedback networks can be used as
associative memories or to solve minimization problems. The Hopfield and IAC neural
networks were presented and analyzed.
The main contribution of this chapter is to show that the IAC network can also
be used to solve minimization problems, and as such it is an alternative to Hopfield
networks. As an example we showed how to implement a 2-bit analog-digital converter.
87
Chapter 4 - Faster Learning by
Constraining Decision Surfaces
In chapter 2 we pointed out that one of the main problems with the current
feedforward ANN models is that they take too long to be trained by the training
algorithms in use today. Therefore one active area of research is the development of new
methods to increase the learning speed of feedforward ANN models, mainly the multi-
layer Perceptron since this is the most popular feedforward model. At the end of chapter
2 we mentioned some methods that can be used to try to speed up learning in
feedforward ANNs, i.e. a) without adapting the network topology, for instance, by using
adaptive learning rates or second-order algorithms; or b) adapting the network topology,
such as the Cascade-Correlation Learning algorithm [FaLe90].
In this chapter we propose an alternative method that aims to speed up learning
by constraining the weights arriving at the hidden units of a multi-layer feedforward
ANN. The method is concerned with the case where the hidden units have sigmoidal
functions, such as in the multi-layer Perceptron.
The basic idea of the proposed method is based on the observation that one
condition that is necessary, but not sufficient, for a feedforward multi-layer ANN to
learn a specific mapping is to have the decision surfaces defined by the hidden units
within or close to the boundaries of the network input space. The hidden units then will
not have a constant output value and cannot be simply substituted by the addition of a
bias to the the output unit. Since it is quite reasonable to know beforehand the range of
the network input values, we can assume that the network input space is also known.
The proposed method then simply checks the above condition and resets those hidden
units with decision surfaces outside a valid region. This approach also leads to a new
method for initializing the weights of the ANN.
We show different methods for initializing and constraining during training the
locations of the decision surfaces. We also show how one can adjust the inclination of
the decision surfaces. In the simulation section the proposed method is illustrated for the
88
case where an ANN is trained to perform the nonlinear mapping sin(x) over the range
2 to 2. This example uses the Back-Propagation algorithm to train the network but
the proposed method can be used with any other algorithm that adjusts the weights
directly without imposing constraints on the decision surfaces.
The proposed method can be applied to any unit as long as the decision surface
associated with such a unit is a hyperplane, e.g. sigmoid or hyperbolic tangent units.
Therefore the ANN can have more than one hidden layer of units and it does not need
to be a strictly feedforward ANN. Note that by constraining the decision surfaces we are
in effect indirectly constraining the hidden unit weights.
4.1 - Initialization Procedures
In chapter 2 we saw that if a unit has its output defined as: 1) an increasing
function of their net input with saturation above an upper limit and below a lower limit
(a sigmoidal unit, for instance sigmoid and hyperbolic tangent units) and 2) its net input
is defined as a linear combination of the unit inputs; then the decision surface of this
unit is the hyperplane defined by:
w
i1
x
1
+ w
i2
x
2
+ .... + w
iNx
x
Nx
+ bias
i
= 0
where w
ij
and bias
i
are the unit incoming weights and bias and x
j
, j=1,...,Nx, are the unit
inputs and Nx is the number of inputs received by the units. A fundamental component
of the learning process is the correct placement and inclination of these decision surfaces
in the network input space.
The simplest case is a network with just a single layer of sigmoidal hidden units
and an output layer of linear units. In order to perform correctly the desired mapping
the hidden unit weights, i.e. the weights received by the hidden units, have to be such
that the decision surfaces of each hidden unit have the correct position and inclination.
Then the role of the output unit weights is to perform the correct combination of such
weights.
The problems in which this type of ANN can be applied can be divided into two
classes: a) pattern-recognition, where the inputs and desired outputs are binary (0 or 1)
or bipolar (-1 or 1); and b) function mapping, where the inputs and desired outputs are
real numbers. A important difference is that, in general, in the former case there is more
freedom for the placement and inclination of the decision surfaces (vide the XOR
89
problem) than in the latter case. In other words the input-output mapping is less
sensitive in relation to decision surfaces in the former case than in the latter case.
4.1.1 - The Standard Initialization Procedure
The standard and widely used procedure to initialize all the weights and biases
of a feedforward multi-layer ANN, such as the Multi-Layer Perceptron, is to simply set
all weights and biases to small random values [RHW86] using a normal or uniform
distribution. The justification for using small values is to avoid saturation of the unit
since saturated units will operate in the regions where the derivative of the unit output
function is very small and consequently, if the network is trained by the BP algorithm
(or other algorithm that uses first-derivative information), training will be very slow.
One problem with such a procedure is that it does not take into consideration the
size of the network inputs when choosing how spread the random weights should be.
The ANN literature contain a few alternative procedures such as the ones that have been
proposed by Nguyen and Widrow [NgWi89] and Drago and Ridella [DrRi92].
Assuming that the network input space has dimension 1, if we use a gaussian
distribution with zero mean to generate the weight from the input unit to each hidden
unit and the bias of each hidden unit, the position of the decision surface (in this case
a point in a horizontal line) will be given by: x
DS
= -bias
i
/ w
i
, where here i specifies the
hidden unit number. Assuming that bias
i
and w
i
are random independent variables, x
DS
has a cauchy distribution with zero mean [Pap84].
Figure 4.1 shows the histogram of x
DS
calculated as above by using the quotient
of 1000 computer generated samples of two (assumed independent) random gaussian
variables with zero mean and variance 1.
If the network has 2 inputs then the decision surface (in this case a line) for
hidden unit i will be described by w
i1
x
1
+ w
i2
x
2
+ bias
i
= 0. Figure 4.2 shows 100 lines
generated by defining the coefficients w
i1
, w
i2
and bias
i
as gaussian random variables
with zero mean and standard variation 0.1.
From figures 4.1 and 4.2 we can see that most of the decision surfaces will be
concentrated around the origin. It is also very important to note that, depending on how
large the input space is, even if we considered it centered around the origin, there is the
possibility that some of the decision surfaces will fall outside the input space. In this
case, these units will produce a near-constant output for inputs within the valid input
90
space, especially if their decision surfaces have a steep inclination. Therefore such units
0
20
40
60
80
100
120
140
160
-15 -10 -5 0 5 10 15
Number of samples displayed = 965
min = -1518.5099 max = 779.7392
delta = 0.5
x3
N
u
m
b
e
r

o
f

S
a
m
p
l
e
s
Fig. 4.1 - Histogram of a variable defined
as the quotient of two gaussian random
variables with zero mean and variance 1
-10
-8
-6
-4
-2
0
2
4
6
8
10
-10 -5 0 5 10
x1
x
2
Fig. 4.2 - 100 decision surfaces generated
by the standard initialization procedure
will not be peforming any useful computation and can be substituted by a constant term
added to the bias of the output units. If we train the network with a hidden unit using
a gradient-based algorithm (such as the Back-Propagation alagorithm) initialized in this
way, this hidden unit will take a long time to change its weights since the unit will be
operating in a region where its derivative is very low.
If we consider in fig. 4.2 that the valid range for variables x
1
and x
2
is [-10 10]
most of the decision surfaces will be within a small distance from the origin. If we have
no pre-knowledge about the correct positioning of the decision surfaces, this seems
difficult to justify. On the other hand if the valid range for x
1
and x
2
is [-1 1] there will
be several hidden units with their decision surfaces outside the valid network input
space. We get very similar results if, instead of a gaussian distribution, we use a uniform
distribution.
It is possible to explore the concept of decision surfaces to get better
initialization procedures. In this case we only need to make use of information that is
normally available, that is, the valid range of the network input variables.
If no previous information is available about the desired location of the decision
surfaces for a particular problem (the normal case), then it is reasonable to argue that
the available decision surfaces should be uniformly distributed over the valid input
space. Therefore, instead of generating the weights and biases using a particular random
distribution, and then calculating from them the positioning of the decision surfaces and
looking at their distribution over the input space, we propose to do the opposite. We can
generate the location of the decision surfaces using some appropriate random distribution
91
and then we calculate the weights and biases associated with each decision surface.
Finally we adjust the inclination of each decision surfaces also considering the size of
the input space to avoid the possibility of very large inclinations that will result in slow
adaptation.
If the output units are linear we cannot associate a decision surface with them
and therefore we propose to still generate their weights and biases using a random
distribution (gaussian or uniform) with zero mean. On the other hand, if the output units
are sigmoidal, the same method that is used to generate the weights and biases can be
applied with the only difference that the input space for the output units is now the
output space of the hidden units. If there are direct connections from the network input
units to the output units the input space of the output units also includes the network
input space.
We propose two new procedures to initialize the network based on this idea of
initializing the decision surfaces.
4.1.2 - The First Initialization Procedure
One way to initialize the decision surfaces over the valid input space is to select
(using a uniform distribution) a sufficient number of points to define a decision surface,
in this case a hyperplane. Since each point inside the valid input space has the same
probability of being chosen as all the other valid points, the location of the
correspondent decision surfaces will be also uniformly distributed over the input space.
Therefore the first step is from the set of selected points to get the equation of the
decision surface and from that get the values for the weights and for the bias.
Since the network input space dimension is Nx, we select at random from a
uniform distribution Nx points within the valid input range since we need Nx points to
define uniquely a decision surface. Since these points belong to the decision surface,
they must satisfy the equation of the hyperplane (for simplicity we drop the subscript i):
w
1
x
1
+ w
2
x
2
+ .... + w
Nx
x
Nx
+ bias = 0
Since we have Nx points we have to solve a system of linear equations: X
= 0, where
each selected point defines a row of the matrix X
, W
= [w
1
... w
Nx
bias]
T
and
0 = [0 ... 0]
T
. We have then Nx equations and Nx+1 unknowns. Therefore it is necessary
to add another constraint. There are several possibilities, like forcing one of the weights
or the bias to be equal to 1 by adding the constraint w
1
= 1 (or bias = 1). At this point
92
we use:
w
1
+ w
2
+ .... + w
Nx
+ bias = Nx
The particular constraint used is not relevant, as long it is a valid one (bias = 0 is not
a valid constraint), since it will determine the inclination and we propose to normalize
the inclination in another step later.
Using the above procedure, the following steps are used to initialize a
feedforward ANN with one hidden layer of sigmoidal units and linear output units:
Step 1) Initialize the weights that the output units receive and the bias of
the output units as small random values. Observe that the output units
can receive weights directly from the input units as well.
Step 2) For each unit in the hidden layer:
2.1 - select at random Nx points within the valid input range (Nx =
number of input units). All points within the valid input space have the
same probability of being selected. Lets use the following notation to
denote each of these points and their components:
X
i
= [ x
i
1
x
i
2
... x
i
Nx
]
2.2 - The weights that connect the input units to this hidden unit and the
bias of this hidden unit are calculated as the solution of the following set
of linear equations:
Figure 4.3 shows the locations of 100 decision surfaces obtained by using this
(4.1)
1
1
1
1
1
1
1
1
1
1
1
]
x
1
1
x
1
2
... x
1
Nx
1
x
2
1
x
2
2
... x
2
Nx
1
... ... ... ... ...
x
Nx
1
x
Nx
2
... x
Nx
Nx
1
1 1 ... 1 1
1
1
1
1
1
1
1
1
1
1
]
w
i 1
w
i 2
...
w
i Nx
bias
i
1
1
1
1
1
1
1
1
1
]
0
0
...
0
Nx
procedure. It is important to notice that the decision surfaces are equally spread over the
input space and, due to the method, it is possible to guarantee that all of them cross the
valid input space.
Using this method we have to solve a system of linear equation with dimension
Nx+1 for each network unit. If the unit receives inputs from several other units (for
instance the network has a large number of inputs) or if the number of units to be
initialized is very large, this method can be too computationally demanding. In order to
93
minimize this problem we propose a second initialization procedure. However, the
-10
-8
-6
-4
-2
0
2
4
6
8
10
-10 -5 0 5 10
x1
x
2
Fig. 4.3 - 100 decision surfaces generated by the first initialization procedure.
method used to initialize the weights received by a linear output units is the same.
4.1.3 - The Second Initialization Procedure
Instead of selecting Nx points within the valid input range and then using these
points to generate a decision surface, another method is simply to select at random only
1 point (all points within the valid input space are equally probable) and then to select
a vector with random components such that the direction in which this vector points is
random. The decision surface is then defined as the hyperplane that passes through the
selected point and such that the selected vector is normal to it.
One way of generating this vector normal to the decision surface is firstly to fix
its length, and then each time that a new component of the vector is to be defined the
limit for the size of that component is calculated as the total length decreased by the
sum of the squares of the components already generated, or if Ns is the desired length
for the vector V where:
The first component V
1
can be generated as a uniform random variable in the interval
(4.2)
Ns V
2
1
V
2
2
... V
2
Nx
[-Ns Ns]. The second component V
2
is chosen in the range [-(Ns
2
- V
1
2
)
1/2
(Ns
2
- V
1
2
)
1/2
]
and so on until V
Nx-1
is generated. The last component of V is calculated such that V has
the desired length, or:
94
where the positive sign and the negative sign have the same probability of 50%.
Figure 4.4 - Geometrical interpretation of the initialization procedure
(4.3)
V
Nx
Ns
2
V
2
1
V
2
2
... V
2
Nx 1
A more direct way of generating this vector normal to the decision surface is to
use the concept of circular symmetry of random variables [Pap84]. This concept states
that if we have several independent normal random variables with zero mean and equal
variance, then these random variables are circular symmetrical and their joint statistics
depends only on the distance from the origin. Therefore all points that have the same
distance from the origin are equally probable. This implies that, if the components of
V are generated using such a concept, for a given magnitude all directions will be
equally probable.
Suppose for the moment that the vector V has been scaled to an arbitrary non-
zero magnitude. Denoting the selected point by X
*
, all points X that belong to the
decision surface satisfy the equation: (X - X
*
)
T
V = 0. Figure 4.4 gives a geometrical
interpretation for this equation. Comparing this equation with the equation for the
decision surface (which is defined by the incoming weights to the unit and the
associated bias as W
T
X + bias = 0 where W = [w
1
w
2
... w
Nx
]
T
) we have:
W = V, bias = - X
*T
V
Comparing this method with the standard method (assuming that the standard method
uses a gaussian distribution), we can see that the difference is the way that the bias term
is initialized. Figure 4.5 shows the locations of 100 decision surfaces where V was
generated using random values with gaussian distribution, zero mean and 0.1 as the
standard deviation. Again using the above method we guarantee that the decision
95
surfaces will cross the valid input space since the selected point X
*
was chosen from the
-10
-8
-6
-4
-2
0
2
4
6
8
10
-10 -5 0 5 10
x1
x
2
Figure 4.5 - 100 decision surfaces generated by the second initialization procedure.
set of points that are within the valid input space.
Note that in this procedure we have assumed that the vector V (and therefore the
weight vector W for the unit as well) has been scaled to an arbitrary non-zero
magnitude. It is such a magnitude that dictates the inclination of the decision surface.
In the next sub-section we propose to adjust this inclination as the last step (Step 3) for
both initialization procedures suggested here.
4.1.4 - Adjusting the Inclinations of the Decision Surfaces
In order to adjust the inclination of the decision surface we simply adjust the
variation of the output of the unit for a given variation of the unit input.
The variation of the unit input is specified by choosing any point X
1
over the
decision surface and another point X
2
such that X = X
2
- X
1
is orthogonal to the
decision surface.
For convenience, assuming without loss of generality that the center of the valid
input space is the origin, we can choose X
1
to be the point belonging to the decision
surface that is closer to the origin. As we have seen in the previous sub-section the unit
weight vector (the incoming weights) W is the vector orthogonal to the decision surface
and consequently X
1
= W. Since X
1
belongs to the decision surface, W
T
X
1
+ bias = 0.
Combining these two equations, we have that:
(4.4)
bias
W
2
96
where W = (W
T
W)
1/2
= length of the weight vector. The point X
2
is then defined as:
(4.5) X
1
bias
W
2
W
where K
s
and U
s
are scalar parameters such that K
s
> 0, U
s
> 0. The parameter U
s
is set
(4.6)
X
2
X
1
K
s
U
s
W
W
to the distance from the origin to the most distant corner of the valid input space and
therefore gives a measure of the size of the input space. The parameter K
s
therefore is
the length of the vector X in "U
s
" units.
Once we have selected a value for K
s
, for a given unit weight vector and bias,
the variation of the output of the unit is simply calculated as:
where F(net/T) is the unit function, e.g. sigmoid or hyperbolic tangent, T is the fixed
(4.7)
F F net
2
/ T F net
1
/ T
parameter called temperature, T > 0, net
1
= W
T
X
1
+ bias and net
2
= W
T
X
2
+ bias. Note
that: a) net
1
is by definition 0 since X
1
is on the decision surface; and b) for a unit with
sigmoid function or hyperbolic tangent, F(net
1
) = 0.5 or 0 respectively.
The objective is to find a scalar positive gain K
w
that, when used to scale the unit
weight vector and the unit bias, the unit will have the desired output variation F
des
> 0
for a given input variation specified by K
s
. From eq. 4.7, K
w
can be calculated for a
sigmoid unit using:
Using the expression: tanh(x/T) = 2 sig(2x/T) - 1, K
w
can be calculated for a hyperbolic
(4.8)
K
w
T
net
2
ln
,
0.5 F
des
0.5 F
des
tangent unit using:
The unit weight vector and bias are finally replaced by K
w
W and K
w
bias respectively.
(4.9)
K
w
T
2 net
2
ln
,
1 F
des
1 F
des
Note that for sigmoid units: 0 < F
des
< 0.5, and for hyperbolic tangent units:
0 < F
des
< 1.
97
4.2 - Constraining the Decision Surfaces during Training
The knowledge that the decision surfaces will have to be within or close to the
boundaries of the network input space can also be explored during training. An easy and
simple (and probably not optimal) way to do this is simply to periodically check if the
decision surfaces are within the boundaries of a permissible region. The units with
decision surfaces outside such a region are then reinitialized using the methods proposed
in the previous section. This permissible region is in general defined as enclosing the
network input space.
In order to perform a sufficiently close approximation to some mappings some
of the decision surfaces may have to be outside the network input region, but still close
to its boundaries. If a decision surface is situated very far from the boundary of the
network input region, the variation of the unit output over the network input region will
be small (if the inclination of the decision surface is small) or zero and therefore this
unit will be operating as a linear unit. This unit can be replaced by another unit with a
decision surface located within the the network input space with a small inclination such
that this unit also operates as a linear unit. A constant term should then be added to the
bias of the output units that represents the averaged output of the original unit over the
input space.
We propose two methods to check if the decision surface of a sigmoidal unit is
within the boundaries of the permissible region. In general such a region is defined as
a hypercube if we can define hard limits for each network input.
In the first method we calculate the unit output for each corner of this hypercube.
If, for all corners of the hypercube, the network output is always less or always greater
than its output midpoint (defined as 0.5 for sigmoid units and 0 for hyperbolic tangent
units), then the decision surface of this unit is outside the hypercube.
In the second method we define a hypersphere such that it encloses the
hypercube. If we assume that all sides of the hypercube have the same length 2u and its
center is the origin, all corners will be equally distant from the origin and the radius of
the hypersphere is equal to the distance from the corners to the origin, that is u (Nx)
1/2
.
If the distance from the decision surface to the origin ( bias W , see eq. 4.5) is
greater than the radius of the hypersphere, then the decision surface is outside the
hypersphere and outside the hypercube.
98
The first method is more restrictive than the second one since, for input spaces
with dimension greater than 1, if the decision surface (the hyperplane) is nearly parallel
to one of the sides of the hypercube, it is possible that the decision surface is inside the
hypersphere but outside the hypercube. On the other hand, the number of calculations
is much greater in the first method than in the second method. For input spaces with
dimension 1, the two methods are the same.
4.3 - Simulations
In this section we illustrate the application of the proposed method in the case
where it is desired to train a FF ANN to learn the mapping y = sin(x) for x in the
interval [-2 2]. The ANN has 5 sigmoid hidden units and the output unit is linear.
In order to perform the desired mapping the learning algorithm has to position
a decision surface where the target function crosses the line y = 0. Therefore 5 hidden
units is the minimum number of hidden units necessary to produce a good
approximation of the desired mapping. Moreover, since the network input and output
variables are continuous, the decision surfaces for the hidden units have to be positioned
at [2,1,0,1,2] with a good precision and have approximately the correct inclination.
The weights for the output units should then provide the correct linear combination of
the outputs of the hidden units. In other words, this is a demanding problem since the
solution space is very limited and contains only a certain combination of weights.
Figure 4.6 shows that the function F(x) can approximate the sine function very
well, where:
or using the relation tanh(x/T) = 2 sig(2x/T) - 1:
(4.10)
F(x) 1.15
5
i 1
1
i 1
tanh x ( i 3)
The degree of approximation can be measured by calculating the Root-Mean-Squared
(4.11)
F(x) 1 2.3
5
i 1
1
i 1
sig 2x 2( i 3)
(RMS) error. The expression for the RMS error is:
99
where Np = number of selected points. Using a set of 40 equally spaced points in the
-1.5
-1
-0.5
0
0.5
1
1.5
-3 -2 -1 0 1 2 3
-1.5
-1
-0.5
0
0.5
1
1.5
-3 -2 -1 0 1 2 3
x/pi
sin(x) F(x) error
Figure 4.6 - The function sin(x) and its approximation F(x).
(4.12)
RMS error
1
Np
Np
i 1
sin x
i
F x
i
2
in the range [-2 2] the RMS error is 0.004539.
In this section we compare the simulations for 3 cases: 1) in the first case all the
network weights and biases are initialized using the standard initialization method, that
is as random values; 2) in the second case the network is initialized using the second
initialization method presented in section 4.1.3 and the inclinations of the decision
surfaces for the hidden units are then adjusted as explained in section 4.1.4; 3) in the
third case the network is initialized as in the previous case and the position of the
decision surfaces for hidden units are reinitialized during training whenever they are
found to be outside a pre-defined permissible region.
In all cases the network is trained using the Back-Propagation algorithm using
the following parameters: learning rate = 0.125, momentum = 0, temperature = 0.5. Each
epoch is defined as the presentation of 50 points selected with uniform distribution in
the range [-2.5 2.5] (a new set of points is selected in every epoch). Care was taken
to ensure that the same training data, with the same order, was used in all 3 cases. The
network input is defined as x/(2) and the desired network output is defined as sin(x)
and is presented uncorrupted. The RMS error is calculated every 5 epochs using 40
points equally spaced between [-2 2].
In the first case all the network weights and bias are initialized as random values
with gaussian distribution, zero mean and 0.3 as the standard deviation. In the second
and third case: a) the output unit weights and bias were initialized to the same values
100
used in the first case but the hidden unit weights and biases were initialized such that
the decision surfaces were located in the range [-2 2] (in network units [-1 1]) using
the method presented in section 4.1.3; b) the inclination of the initial decision surfaces
was adjusted as explained in section 4.1.4 using the user-defined parameters K
s
= 1 and
F
des
= 0.4 (in our simulations U
s
= 1). In all 3 cases the location of the decision
surfaces was verified every 5 epochs.
In the third case, whenever the decision surface of a hidden unit was detected to
be outside the permissible region defined to be [-4 4] (in network units [-2 2]), the
following procedure was adopted:
a) The amount w
OH
(MaxHU - MinHU)/2 was added to the output unit
bias, where w
OH
= weight connecting the hidden unit to the output unit, MaxHU and
MinHU = maximum and minimum values for the hidden unit output when the network
inputs are in the corners of the permissible region [-4 4]. The basic idea is to transfer
to the output unit bias the "average" contribution of the hidden unit that is being reset.
b) The weight w
OH
was set to zero.
c) The hidden unit incoming weights and bias were reinitialized such that
the decision surface went back to be in the range [-2 2] (in network units [-1 1])
using the initialization method presented in section 4.1.3. The hidden unit incoming
weights were generated as random gaussian numbers with zero mean and unit variance.
Note that, once a hidden unit is reset, the inclination of its new decision surface
was not readjusted, although this is a possible alternative.
Figure 4.7 shows the RMS error history (sometimes also referred to as the
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 500 1000 1500 2000 2500
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 500 1000 1500 2000 2500
Number of epochs
R
M
S

E
r
r
o
r
(3)
(2)
(1)
Figure 4.7 - The RMS error history for the 3 simulation cases.
learning curve) for the 3 cases. Considering that the convergence is obtained when the
RMS error remain less than 0.02, the first case takes 2060 epochs to converge, the
101
second case 345 epochs and the third case 230 epochs. The learning speed in the third
case is almost 9 times faster than in the first case.
Figures 4.8-4.11 show the evolution of the location of the decision surfaces for
the 3 cases. Figures 4.8 and 4.9 refer to the first case, plotted using different vertical
scales while figures 4.10 and 4.11 refer to the second and third cases respectively. Note
that the learning curve for case 1 in figure 4.7 has a staircase shape and that, whenever
one of the decision surfaces converges to its correct final value, there is a sharp decrease
in the RMS error in the learning curve. Note in figure 4.9 that a large number of epochs
is wasted since the decision surfaces are very far from their correct locations.
Figure 4.12 shows for case 3 the history of the output unit weights and bias, also
sampled every 5 epochs. Finally figure 4.13 shows for case 3 the approximation
provided by the network after being trained for 500 epochs. At the end of the training
session the RMS error is 0.004807 and the decision surfaces are located at
[-1.9306 -1.0350 -0.0311 1.0455 1.9645] while they were expected to be located at
[-2 -1 0 1 2].
-300
-200
-100
0
100
200
300
400
500
0 500 1000 1500 2000 2500
-300
-200
-100
0
100
200
300
400
500
0 500 1000 1500 2000 2500
Number of epochs
x
/
p
i
Fig. 4.8 - Decision surfaces for case 1
-3
-2
-1
0
1
2
3
4
0 500 1000 1500 2000 2500
-3
-2
-1
0
1
2
3
4
0 500 1000 1500 2000 2500
Number of epochs
x
/
p
i
-12
-10
-8
-6
-4
-2
0
2
4
6
0 50 100 150 200 250 300 350 400 450 500
-12
-10
-8
-6
-4
-2
0
2
4
6
0 50 100 150 200 250 300 350 400 450 500
Number of epochs
x
/
p
i
-12
-10
-8
-6
-4
-2
0
2
4
6
0 50 100 150 200 250 300 350 400 450 500
-12
-10
-8
-6
-4
-2
0
2
4
6
0 50 100 150 200 250 300 350 400 450 500
Number of epochs
x
/
p
i
102
-4
-3
-2
-1
0
1
2
3
4
0 50 100 150 200 250 300 350 400 450 500
-4
-3
-2
-1
0
1
2
3
4
0 50 100 150 200 250 300 350 400 450 500
Number of epochs
bias
Fig. 4.12 - Output unit weights and bias
for case 3
-1.5
-1
-0.5
0
0.5
1
1.5
-3 -2 -1 0 1 2 3
-1.5
-1
-0.5
0
0.5
1
1.5
-3 -2 -1 0 1 2 3
x/pi
sin(x) ANN output error
Fig. 4.13 - The function sin(x) and its
network approximation for case 3
4.4 - Conclusion
In this chapter we presented a technique that can be used with the Back-
Propagation in order to speed up learning. We propose to use the knowledge about the
range of the network inputs to initialize and constrain the location of the network
decision surfaces during training. We also propose to adjust the inclination of the
decision surfaces during the weight initialization process.
The simulation results demonstrate that once the decision surfaces converge to
their correct location, the adjustment of the second layer of weights is very fast. This
seems to indicate that learning occurs in the bottom to top direction (the input layer is
at the bottom and output layer is at the top).
During training the user has to define the permissible region for the decision
surfaces. A too small permissible region will lead to a large number of unnecessary
reinitializations of the decision surfaces. On the other hand, a too large permissible
region will tend to slow down the convergence. A possible alternative, that avoids the
need of specifying a permissible region, is to treat the location of the decision surface
as a "soft" constraint, not as a "hard" constraint.
In the next chapter we are concerned about how to improve the fault-tolerance
of the feedforward ANN, so as to increase network robustness to loss of hidden units.

Chap 24

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Chap 24

Загружено:

Авторское право:

Доступные форматы

6

Chapter 2 - Articial Neural Networks:

Вам также может понравиться