Вы находитесь на странице: 1из 60

Gradient Descent

The Perceptron Learning Rule is an algorithm for


adjusting the network weights wij to minimise the
difference between the actual outputs outj and the
desired outputs targj .
We can define an Error Function to quantify this
difference:

For obvious reasons this is known as the Sum Squared


Error function. It is the total squared error summed
over all output units j and all training patterns p.
Gradient Descent
The aim of learning is to minimise this error by
adjusting the weights wij. Typically we make a
series of small adjustments to the weights wij wij
+ wij until the error E(wij) is small enough.
A systematic procedure for doing this requires the
knowledge of how the error E(wij) varies as we
change the weights wij, i.e. the gradient of E with
respect to wij.
Gradient Descent
Backpropagation Networks

Backpropagation networks, and multi layered perceptrons, in general, are feedforward


networks with distinct input, output, and hidden layers. The units function basically like
perceptrons, except that the transition (output) rule and the weight update (learning)
mechanism are more complex.

The figure on next page presents the architecture of backpropagation networks. There
may be any number of hidden layers, and any number of hidden units in any given
hidden layer. Input and output units can be binary {0, 1}, bi-polar {-1, +1}, or may
have real values within a specific range such as [-1, 1]. Note that units within the same
layer are not interconnected.
Backpropagation Networks
Backpropagation Networks
In feedforward activation, units of hidden layer
1 compute their activation and output values and
pass these on to the next layer, and so on until
the output units will have produced the
network's actual response to the current input.
The activation value ak of unit k is computed as
follows.
This is basically the same activation function of
linear threshold units (McCulloch and Pitts
model).
As illustrated above, xi is the input signal
coming from unit i at the other end of the
incoming connection. wki is the weight of the
connection between unit k and unit i. Unlike in
the linear threshold unit, the output of a unit in a
backpropagation network is no longer based on
a threshold. The output yk of unit k is computed
as follows:
The function f(x) is referred to as the output
function. It is a continuously increasing function
of the sigmoid type, asymptotically approaching
0 as x decreases, and asymptotically approaches
1 as x increases. At x = 0, f(x) is equal to 0.5.
Backpropagation Networks
In some implementations of the
backpropagation model, it is convenient to
have input and output values that are bi-polar.
In this case, the output function uses the
hypertangent function, which has basically
the same shape, but would be asymptotic to
1 as x decreases. This function has value 0
when x is 0.

Once activation is fed forward all the way to


the output units, the networks response is
compared to the desired output ydi which
accompanies the training pattern. There are
two types of error. The first error is the error
at the output layer. This can be directly
computed as follows:

The second type of error is the error at the


hidden layers. This cannot be computed
directly since there is no available
information on the desired outputs of the
hidden layers. This is where the
retropropagation of error is called for.
Backpropagation Networks
Essentially, the error at the output layer is used to compute for the error at the hidden layer
immediately preceding the output layer. Once this is computed, this is used in turn to compute for
the error of the next hidden layer immediately preceding the last hidden layer. This is done
sequentially until the error at the very first hidden layer is computed. The retropropagation of error
is illustrated in the figure below:
Backpropagation Networks

Computation of errors ei at a hidden layer is done as follows:

The errors at the other end of the outgoing connections of the hidden unit h have been earlier
computed. These could be error values at the output layer or at a hidden layer. These error signals are
multiplied by their corresponding outgoing connection weights and the sum of these is taken.
Backpropagation Networks

The errors at the other end of the outgoing connections of the hidden unit h
have been earlier computed. These could be error values at the output layer or
at a hidden layer. These error signals are multiplied by their corresponding
outgoing connection weights and the sum of these is taken.
Backpropagation Networks
After computing for the error for each unit, whether
it be at a hidden unit or at an output unit, the
network then fine-tunes its connection weights
wkjt+1. The weight update rule is uniform for all
connection weights.
The learning rate a is typically a small value
between 0 and 1. It controls the size of weight
adjustments and has some bearing on the speed of
the learning process as well as on the precision by
which the network can possibly operate. f(x) also
controls the size of weight adjustments, depending
on the actual output f(x). In the case of the sigmoid
function above, its first derivative (slope) f(x) is
easily computed as follows:

We note that the change in weight is directly proportional to the error term computed for the unit at the
output end of the incoming connection. However, this weight change is controlled by the output signal
coming from the input end of the incoming connection. We can infer that very little weight change
(learning) occurs when this input signal is almost zero.
The weight change is further controlled by the term f(ak). Because this term measures the slope of the
function, and knowing the shape of the function, we can infer that there will likewise be little weight
change when the output of the unit at the other end of the connection is close to 0 or 1. Thus, learning
will take place mainly at those connections with high pre-synaptic signals and non-committed (hovering
around 0.5) post-synaptic signals.
Learning Process

One of the most important aspects of Neural Network is the learning process. The
learning process of a Neural Network can be viewed as reshaping a sheet of metal,
which represents the output (range) of the function being mapped. The training set
(domain) acts as energy required to bend the sheet of metal such that it passes through
predefined points. However, the metal, by its nature, will resist such reshaping. So the
network will attempt to find a low energy configuration (i.e. a flat/non-wrinkled shape)
that satisfies the constraints (training data).

Learning can be done in supervised or unsupervised training.

In supervised training, both the inputs and the outputs are provided.
The network then processes the inputs and compares its resulting outputs against the
desired outputs. Errors are then calculated, causing the system to adjust the weights
which control the network. This process occurs over and over as the weights are
continually tweaked.
Backpropagation Learning Math

See next
slide for
explanation
Visualization of
Backpropagation
learning

Backprop output layer


Summary
The following properties of nervous systems will be of particular interest in our neurally-inspired
models:
In unsupervised training, the network is provided with inputs but not with desired outputs. The
system itself must then decide what features it will use to group the input data. This is often referred
to as self-organization or adaption.
Following geometrical interpretations will demonstrate the learning process within different Neural
Models:
Parallel, distributed information processing
High degree of connectivity among basic units
Connections are modifiable based on experience
Learning is a constant process, and usually unsupervised
Learning is based only on local information
Performance degrades gracefully if some units are removed

References:
http://www.ece.utep.edu/research/webfuzzy/docs/kk-thesis/kk-thesis-html/node12.html
http://www.comp.nus.edu.sg/~pris/ArtificialNeuralNetworks/LinearThresholdUnit.html
http://www.csse.uwa.edu.au/teaching/units/233.407/lectureNotes/Lect1-UWA.pdf
Supervised and Unsupervised
Neural Networks
References
http://www.ai.rug.nl/vakinformatie/ias/slide
s/3_NeuralNetworksAdaptation.pdf
http://www.users.cs.york.ac.uk/~sok/IML/iml
_nn_arch.pdf
http://ilab.usc.edu/classes/2005cs561/note
s/LearningInNeuralNetworks-CS561-3-05.pdf
Understanding Supervised and
Unsupervised Learning

B
A B
A
B A
Two possible Solutions

A B A B B B

A A
A B B A
Supervised Learning
It is based on a
labeled training set. Class

The class of each


Class
A
piece of data in B Class

training set is known. Class


B
Class labels are pre- A
determined and A Class

provided in the Class B


training phase.
Supervised Vs Unsupervised
Task performed Task performed
Classification Clustering
Pattern Recognition NN Model :
NN model : Self Organizing Maps
Preceptron What groupings exist
Feed-forward NN in this data?
How is each data point
What is the class of related to the data
this data point? set as a whole?
Unsupervised Learning
Input : set of patterns P, from n-dimensional space S,
but little/no information about their classification,
evaluation, interesting features, etc.
It must learn these by itself! : )

Tasks:
Clustering - Group patterns based on similarity
Vector Quantization - Fully divide up S into a small
set of regions (defined by codebook vectors) that
also helps cluster P.
Feature Extraction - Reduce dimensionality of S by
removing unimportant features (i.e. those that do not
help in clustering P)
Unsupervised Learning
Basic ISOdata Algorithm
Choose some initial values for the means 1 c
Classify the m samples by assigning them to the class of
the closest mean.
Re-compute the means as the average of the samples in
the class
Repeat until no mean changes value
Unsupervised Learning
Similarity Measures
1. normalized inner product
The objective of the similarity measure approach is to try
to find natural groupings. We will now assume that x is an
n-dimensional column vector.

2. Fraction of Shared Attributes

3. Ratio of Shared Attributes


Unsupervised Learning
Criterion Functions
Criterion functions measure the quality of the partition of
the data. The objectiveis to find a partition that extremizes
a criterion function.
1. Sum of Squared Error Criteria.
Unsupervised Learning
2. Minimum Error Criteria

3. Scattering Criteria

4. Iterative Optimization
Select a criterion function.
Find sets that extremize criterion function (solve by exhaustive
enumeration)
Hebbian learning
Hebbs Law states that if neuron i is near enough to excite
neuron j and repeatedly participates in its activation, the
synaptic connection between these two neurons is
strengthened and neuron j becomes more sensitive to
stimuli from neuron i.
Hebbs Law can be represented in the form of two rules:
1. If two neurons on either side of a connection are activated
synchronously, then the weight of that connection is increased.

2. If two neurons on either side of a connection are activated


asynchronously, then the weight of that connection is decreased.
I n p u t Si g n a l s

i
j
Hebbian learning in a neural network

O u t p u t Si g n a l s
Hebbian learning in a neural network

Using Hebbs Law we can express the adjustment applied


to the weight wij at iteration p in the following form:
As a special case, we can represent Hebbs Law as
follows: wij ( p ) F [ y j ( p ), xi ( p )]
where is the learning rate parameter.
This equation is referred to as the activity product rule.
wij ( p) y j ( p) xi ( p)

Hebbian learning implies that weights can only increase.


To resolve this problem, we might impose a limit on the
growth of synaptic weights. It can be done by introducing
a non-linear forgetting factor into Hebbs Law:
wij ( p) y j ( p ) xi ( p) y j ( p) wij ( p)
Hebbian learning algorithm
Step 1: Initialisation.
Set initial synaptic weights and thresholds to small random values, say
in an interval [0, 1].
Step 2: Activation.
Compute the neuron output at iteration p
n
y j ( p ) xi ( p ) wij ( p ) j
i 1
where n is the number of neuron inputs, and j is the threshold value of
neuron j.
Step 3: Learning.
Update the weights in the network:
wij ( p 1) wij ( p) wij ( p)
where wij(p) is the weight correction at iteration p.
wij ( p ) y j ( p )[ xi ( p ) wij ( p )]
Competitive learning
In competitive learning, neurons compete among
themselves to be activated.
While in Hebbian learning, several output neurons can be
activated simultaneously, in competitive learning, only a
single output neuron is active at any time.
The output neuron that wins the competition is called
the winner-takes-all neuron.
The basic idea of competitive learning was introduced in
the early 1970s.
In the late 1980s, Teuvo Kohonen introduced a special
class of artificial neural networks called self-organising
feature maps. These maps are based on competitive
learning.
Feature-mapping Kohonen model
Kohonen layer Kohonen layer

Input layer Input layer

1 0 0 1
(a) (b)
The Kohonen network
The Kohonen model provides a topological mapping. It
places a fixed number of input patterns from the input
layer into a higher-dimensional output or Kohonen layer.
Training in the Kohonen network begins with the winners
neighbourhood of a fairly large size. Then, as training
proceeds, the neighbourhood size gradually decreases.
The lateral connections are used to create a competition
between neurons. The neuron with the largest activation
level among all neurons in the output layer becomes the
winner. This neuron is the only neuron that produces an
output signal. The activity of all other neurons is
suppressed in the competition.
Architecture of the Kohonen
Network
y1

O u t p u t Si g n a l s
I n p u t Si g n a l s

x1
y2
x2
y3

Input Output
layer layer
The Kohonen network

The lateral feedback connections produce excitatory or


inhibitory effects, depending on the distance from the
winning neuron. This is achieved by the use of a Mexican
hat function which describes synaptic weights between
neurons in the Kohonen layer.
In the Kohonen network, a neuron learns by shifting its
weights from inactive connections to active ones. Only
the winning neuron and its neighbourhood are allowed to
learn. If a neuron does not respond to a given input
pattern, then learning cannot occur in that particular
neuron.
The Mexican hat function of lateral
connection
Connection
1 strength
Excitatory
effect

0 Distance
Inhibitory Inhibitory
effect effect
The Kohonen network
In the Kohonen network, a neuron learns by shifting its
weights from inactive connections to active ones. Only
the winning neuron and its neighbourhood are allowed to
learn. If a neuron does not respond to a given input
pattern, then learning cannot occur in that particular
neuron.
The competitive learning rule defines the change wij
applied to synaptic weight wij as

( xi wij ), if neuron j wins the competition


wij
0, if neuron j loses the competition
where xi is the input signal and is the learning rate
parameter.
The Kohonen network
Suppose, for instance, that the 2-dimensional input vector
X is presented to the three-neuron Kohonen network,
0.52
X
0 .12
The initial weight vectors, Wj, are given by
The Kohonen network
We find the winning (best-matching) neuron jX using the
minimum-distance Euclidean criterion:

Neuron 3 is the winner and its weight vector W3 is updated


according to the competitive learning rule.
The Kohonen network
The updated weight vector W3 at iteration (p + 1) is
determined as:
0.43 0.01 0.44
W3 ( p 1) W3 ( p) W3 ( p)
0.21 0. 01 0 .20
The weight vector W3 of the wining neuron 3 becomes
closer to the input vector X with each iteration.
The Kohonen network
Step 1: Initialisation.
Set initial synaptic weights to small random values, say in
an interval [0, 1], and assign a small positive value to the
learning rate parameter .
Step 2: Activation and Similarity Matching.
Activate the Kohonen network by applying the input vector
X, and find the winner-takes-all (best matching) neuron jX
at iteration p, using the minimum-distance Euclidean
criterion 1/ 2
n

jX ( p) min X W j ( p ) [ xi wij ( p )]2 ,
j i 1
where n is the number of neurons in the input layer, and
m is the number of neurons in the Kohonen layer.
The Kohonen network

Step 3: Learning.
Update the synaptic weights
wij ( p 1) wij ( p ) wij ( p )
where wij(p) is the weight correction at iteration p.
The weight correction is determined by the competitive
learning rule:
[ xi wij ( p )] , j j ( p)
wij ( p)
0, j j ( p)
where is the learning rate parameter, and j(p) is the
neighbourhood function centred around the winner-takes-
all neuron jX at iteration p.
The Kohonen network
Step 4: Iteration.
Increase iteration p by one, go back to Step 2 and
continue until the minimum-distance Euclidean criterion is
satisfied, or no noticeable changes occur in the feature
map.
Adaptive Resonance Theory (ART)
Stability: system behaviour doesnt change
after irrelevant events
Plasticity: System adapts its behaviour
according to significant events
Dilemma: how to achieve stability without
rigidity and plasticity without chaos?
Ongoing learning capability
Preservation of learned knowledge
ART Architecture
Bottom-up weights bij
Top-down weights tij
Store class template
Input nodes
Vigilance test
Input normalisation
Output nodes
Forward matching
Long-term memory
ANN weights
Short-term memory
ANN activation pattern top down
bottom up (normalised)
ART Algorithm
new pattern
recognition

comparison
categorisation

known unknown
Incoming pattern matched with
stored cluster templates
Adapt winner Initialise If close enough to stored
node uncommitted node template joins best matching
cluster, weights adapted
If not, a new cluster is
initialised with pattern as
template
ART1 Architecture

50
Additional Modules
Categorisation result

Output layer

Gain control

Reset module Input layer

Input pattern

51
Reset Module
Fixed connection weights
Implements the vigilance test
Excitatory connection from F1(b)
Inhibitory connection from F1(a)
Output of reset module inhibitory to
output layer
Disables firing output node if match
with pattern is not close enough
Duration of reset signal lasts until
pattern is present
52
Gain module
Fixed connection weights
Controls activation cycle of input
layer
Excitatory connection from input
lines
Inhibitory connection from output
layer
Output of gain module excitatory to
input layer
2/3 rule for input layer
53
Recognition Phase
Forward transmission via bottom-up weights
Input pattern matched with bottom-up
weights (normalised template) of output
nodes
Inner product xbi
Best matching node fires (winner-take-all
layer)
Similar to Kohonens SOM algorithm, pattern
associated to closest matching template
ART1: fraction of bits of template also in
input pattern
54
Comparison Phase
Backward transmission via top-down weights
Vigilance test: class template matched with input
pattern
If pattern close enough to template,
categorisation was successful and resonance
achieved
If not close enough reset winner neuron and try
next best matching
Repeat until
vigilance test passed

Or (all committed neurons) exhausted

55
ART1 Algorithm
Step 0 : initialize
parameters :
L 1
0 1
initialize weights : L
0 bij (0)
L 1 n
tji (0) 1
Step 1: While stopping condition is false do Steps 2-13

Step 2: For each training input . do steps 3-12


Step 3: Set activations of all F2 units to zero.
Set activations of F1(a) units to input vector s.

56
ART1 Algorithm (cont.)
Step 4: Compute the norm of s:

s si
i
Step 5: Send input signal from F1(a) to the F1(b) layer
xi si
Step 6: For each F2 node that is not inhibited:
if y-1 then

yj bi
ij xi

Step 7: While reset is true. do Steps 8-11.

57
ART1 Algorithm (cont.)
Step 8: find J such that yJyj for all nodes j.
If yJ then all nodes are inhibited and this pattern cannot be clustered.

Step 9: Recompute activation x of F1(b)


xi = sitJi

Step 10: Compute the norm of vector x:

x xi
i

Step 11: Test for reset:


x
if then
s

yJ=-1 (inhibit node J)(and continue executing step 7 again)


x 58
If then proceed to step 12.
s
ART1 Algorithm (cont.)
Step 12: Update the weight for node J (fast learning)

Lxi
bij ( new)
L 1 x
tJi ( new) xi
Step 13: Test for stopping condition.

Adaptive Resonance Theory NN 59


Vigilance Threshold
Small , imprecise
Vigilance threshold sets
granularity of clustering
It defines amount of attraction
of each prototype
Low threshold
Large mismatch accepted
Large , fragmented
Few large clusters
Misclassifications more likely
High threshold
Small mismatch accepted
Many small clusters
Higher precision

60

Вам также может понравиться