II. Artificial Neural Networks

II.
Artificial Neural Networks
II.1. Artificial neuron

= the basic computational unit of unit artificial neural networks (ANN)
p1 p1 ,...., p R - neuron inputs

... w1 b
f : , y = f (n) - activation
wj n y
pj f (n ) function usually nonlinear
.... wR b bias other usual notation
pR w1 , w2 ,.....wR weights of incomming

links
Perceptron
Components:
- Synapses or links characterized by weights (also called strengths)
- Summing block, Activation function ( usually nonlinear)
Input-output mapping: static model
The output of the model is computed as follows: y = f (n) ,
where the input of the activation function, called activation (n) is:
p1
n = w1 p1 + ..... + wR p R + b 1 = [ w1 ... wR ] ... + b = Wp + b ,
p R
or
R
n = wi pi + b
i =1
with b , W 1 x R , p R x 1 .
b can be treated as a supplementary weight w0 = b for the input p0 = 1 :
R
~~
n = wi pi = W p
i =0
not
~
notations: [w0 w1 K wR ] = W (transposed) extended weight vector
not
[1 p1 K ~T
pR ] = p (transposed) extended input vector
So, the diagram can be redrawn:

p0
=1 w0
=b
p1
... w1
wj n y
pj f (n)
.... wR
pR
Usually : y [0,1] or y [1,1] .
Other alternative for the extended weight vector:
R +1 ~~
~
W = [ w1 .. wR ~ T = [ p ...
b] , p pR 1] , n = wi pi = W p .
1
i =1
Remarks comparison between the artificial neuron (AN) and the biological one (BN):
1) AN admits negative weights !!! (unlike BN).

positive weights for excitation effect ,
negative or null weights - for inhibitive effect .
2) time constants: BN 1msec, AN 1nsec

(fewer links and fewer neurons for ANN)
3) energetic efficiency: BN 10 16 J / sec per operation, AN 10 6 J / sec per operation.
4) BNN works asynchronously, without a clock master (continuous time domain).
5) BNN involves random connectivity; ANN uses specified connectivity (usually).
6) BNN are tolerant against errors.

Typical activation functions:
I. Deterministic functions
1, n 0 R
1) y = f (n) = - hard limiter with nn = wi p i
0, n < 0 i =1
f f
+1 +1
-b 0 nn -b 0 nn
-1
Hard limiter Symmetric hard limiter

2) y = f (n) = n - linear
+1
- 0 nn
, with nn = wi p i
R
i =1
1
3) y = f (n) = , c > 0 - sigmoid
1 + exp(cn)
For c = 1 :
f
+1
0.5
-b 0 nn
R
, with nn = wi p i
i =1
FUNCTIE DE ACTIVARE SIGMOID
w=2>0; b=3>0 w=2>0; b=-3<0
1 1
0.8 0.8
0.6 0.6
a
0.4 0.4
0.2 0.2
0 0
-5 0 intrare 5 -5 0 5
intrare
w=-2<0; b=3>0 w=-2<0; b=-3<0
1 1
0.8 0.8
0.6 0.6
a
a
0.4 0.4
0.2 0.2
0 0
-5 0 5 -5 0 5
intrare intrare
in punctul de inflexiune: a=0.5, p = -w/b, tangenta la grafic are panta= w/4
1 exp(2cn)
4) y = f (n) = , c > 0 - hyperbolic tangent
1 + exp(2cn)
For c = 1 :
f
+1
-b
0
-1 nn
R
, with nn = wi p i
i =1

( n c )2
5) y = f (n ) = e 2 - Gaussian function
2
>> see RBF (other input-output mapping)
f
+1
c n
II. 2. ANN architectures
The ANN structure allows a parallel and distributed processing.
Each ANN can be represented by a directed graph.
The nodes of the graph correspond to neurons.
The nodes are connected by links which ensure unidirectional and instant
communication.
A processing unit (neuron) admits any number of input links.
A processing unit has local memory.
A processing unit can be modeled in terms of input output formalism.
The neurons are organized in layers; within a layer, the neurons are considered to work in parallel.
Input Hidden Output
layer layers laye r
u1 y1 Legend:
Lateral links (between the
u2 y2 nodes of the sam e layer)
Feedback links (from the
output of a neuron to its
input)
u m -1 yk-1 B ackward links (t o the neurons
of the previous layers)
um yk F eedforward li nks (to the
neurons of t he next layers)
ANN AN N
inputs outputs
Remark:
- the input layer does not perform any processing; it will be not count;
- the hidden layers and the output layer include neurons
.
Types of ANN
feed-forward: - with feed-forward links, only

dynamic/recurrent: - contain at least one lateral, backward or feedback link
Example 1: Feed-forward architectures with one layer:
Only feed-forward links!!!
Input layer: m inputs, no processing
Output layer: k neurons, characterized by :

yl = f l ( wl ,1u1 + .. + wl ,m u m + bl ), l = 1, k
b1 N euro n 1
n1
w 1,1
y1
f1 ( n)
u1
w1 , j
.. .
w 1, m
u j
. ..
... .
w k ,1
N euro n k
um
wk , j
bk
wk ,m
nk
yk
f k (n )
In p u t
O u t p u t l a y e r - k n e u ro n s
laye r
Let us assume that all activation functions are identical.
One can write:

y1 f ( w1,1u1 + ..... + w1,m u m + b1 ) w1,1u1 + ..... + w1,m u m + b1
... ... ...
~~
y = y l = f ( wl ,1u1 + ..... + wl ,m u m + bl ) = f ( wl ,1u1 + ..... + wl ,m u m + bl ) = f ( Wu )

... ... ...
y k f ( wk ,1u1 + ..... + wk ,m u m + bk ) wk ,1u1 + ..... + wk , m u m + bk

with
w1,1 .. w1,m b1 extended weights for neuron 1

.. ... ... ...
~
W = wl ,1 ... wl ,m bl extended weights for neuron l = extended weights matrix,

... ... ... ...
wk ,1 ... wk ,m bk extended weights for neuron k

u1

~ = ... = input vector,
u
u m

1
y1
y = ... = output vector.
y k
Remark:
~
W = [ wi, j ], i = 1, k , j = 1, m + 1
with
for wi, j - the first index indicates the neuron
- the second index indicates the link
Simplified diagram for feedforward ANNs with 1 layer
y1
u1
N e uron 1
.. .

uj
... . yk
um
N e uron k
Inp ut O utpu t laye r

la ye r - k ne uron s
Remark:
The layers can be
Fully connected all feedforward links are used,
Partially connected some feedforward connections are missing.
Example2: Feed-forward architectures with two layers
2
b1 b
1
1
1
n2 y 11
1 y1 1
n1 2
w 1
1,1 f 1 (n ) w
1,1 f 2 ( n)
1
1
u1
w11, j
... Neuron 1, w
2
N euron 1,
layer 1 1, s
w 11, m layer 2
uj
... ...
....
w 1 s ,1 N euron s, Neuron k,
um layer 1 w 2 layer 2
w1 s , j k ,1
b1 b 2k
s
w1 s ,m 1
n y 1s n k2 y 2k

s
1 w k2,s 2
f s (n) f k (n)
Input Hidden layer - s neurons Output layer - k neurons

layer - layer 1 - layer 2
Identical activation function within each layer!!!!
- Layer 1:

y11 w1 .. w1 b11 u1
1,1 1, m
~1
1 1 ~ 1~ 1
y = f ( W u ) , with y = ... , W = .. .. .. .. , u = ..
~
u
y1 w1 .. w1 b1s m
s sx1 s,1 s, m sx ( m +1) 1 ( m +1) x1
- Layer 2
y12 w 2 .. w 2 b12
1,1 1, s
2 2 ~ 2 ~1 2 ~2
y = f ( W y ) , with y = ... , W = .. .. .. .. ,
y2 w 2 .. w 2 2
bk
k kx1 k ,1 k,s kx ( s +1)

y11
~ y1
y 1 = .. =
y1s 1

1 ( s +1) x1
Remark: Upper index: the layer
Lower indexes: the first =the neuron; the second = the link of the neuron
y1 y2
1 1
u1
N e uron 1, N eu ro n 1 ,
... lay er 1 la ye r 2

uj
... .
y 1s y2
k
um
N e uron s , N eu ro n k
lay er 1 la ye r 2
Inp ut H id den la yer O utpu t la ye r

la ye r - s neu rons - k ne uron s
Example 3: Feed-forward architectures with two layers simplified diagram
y1
u1
N e uron 1
.. .
q- 1 q- 1
uj
... . yk
um
N e uron k
Inp ut O utpu t laye r

la ye r - k ne uron s
ANN architecture = Number of inputs and number of outputs
Number of layers
Number of neurons within a layer
Map of links
Type of the activation functions
ANN parameters
sigmoid/linear/step = weights
biases
Gaussian = centers
= spreads
III. 3. Multi-layer Perceptron (MLP) - revision
III. 3. 1. MLP architecture
- all the neurons have sigmoidal activation function (linear, sigmoid, tanh)
b1 b2
1 1
1
v2 y 11
v11 y1
2
1
w 1 1 ,1 f 1 (v ) w 1 ,1 f 2
(v )
1 1
u1
w 11 , j
.. . N e uron 1, w2 N eu ro n 1 ,
L a yer 1 1, s
w 11 , m L a ye r 2
u j
. .. ...
... .
w 1 s ,1 N eu ro n s , N eu ro n k,
um L a ye r 1 w 2 L a ye r 2
1 k ,1
w s,j
b1 b 2k
s
w 1 s, m v
1
y 1s v k2 y 2k

s
1 w k2 ,s 2
f (v )
s f (v )
k
In pu t H idd en laye r - s ne ur on s O u tpu t la ye r - k n eu ro ns

laye r - la ye r 1 - la yer 2
The layers are linked in series: the outputs of the neurons belonging to a layer are inputs for
the neurons of the next layer.
Within a layer, the neurons work in parallel.
The MLP can have any number of hidden layers

III.3.2. Criteria for learning algorithms based on error correction
Let us consider k neurons within the output layer.
1. On-line
The training samples (u (i ), d (i )), i = 1, N are presented in sequel, one sample per iteration (the
number of iterations = multiple of N).
k
I (n) = 0.5 ei 2 (n) , with ei (n) = d i (n) yi ( n) = the error of the ith output neuron
i =1
2. Batch
All the training samples (u i , d i ), i = 1, N are presented at a single iteration.
1 N k 2
I ( n) = ei (n, j ) ,with ei (n, j ) = the error of the ith output neuron for the jth training
2 N j =1 i =1
sample presented an the nth epoch

III.3.3. Backpropagation learning algorithm
= the steepest descent method (gradient)
I
wijl (n + 1) = wijl (n) (n) = wijl (n) + wijl (n) ,
wijl
> 0 - influences the convergence speed
Overview steps carried out at each epoch:

Evaluate the ANN output and the error: feedforward INOUT
Adapt the parameters: back: OUTIN: (backpropagation)
k 2
- For online learning ( I (n) = 0.5 ei (n) )
i =1
Parameter variation
= learning rate ( ) x local gradient ( ) x input (corresponding to the link)
1 N k 2
- For batch learning ( I (n) = ei (n, j ) ,)
2 N j =1 i =1
where ei (n, j ) = the error of the ith output neuron for the jth training sample presented at
the nth epoch
1 N
wikl (n) = l
wik (n, j ) -the mean of variations separately computed for each sample
N j =1
Backpropagation adaptation equations
k
For the sake of simplicity, online learning is considered I (n) = 0.5 ei 2 (n)
i =1
1. For the output layer (denoted l)
Let us consider k output neurons and s neurons in the preceding layer.
ei (n) = d i (n) yil (n) , the error produced by the output neuron i for a certain sample
s
vil (n) = wijl y lj1 , with wil0 = bil i y 0l 1 = 1 .
j =0
I I ei ei yil vil
( n) = ( n) (n) = ei (n) ( n) ( n) ( n)
wijl ei wijl yil vil wijl
I
l
(n) = ei (n) (1) f ' i (vil (n)) y lj1 (n) = il (n) y lj1 (n)
wij
I
wijl ( n) = il ( n) y lj1 ( n) with il = ei (n) f 'i (vil (n)) = (n) = local gradient
vijl
Parameter variation
2. For the hidden layers
Problem: find the contribution of a hidden neuron to the total error.

The parameters will be adapted starting from the output layer to the input layer.
Considering the hidden layer l, the local gradients within the layers l +1, l +2, .. must be
available from previous computations.
The output of the neuron i belonging to layer l is input for the neurons belonging to l +1
(output).
For the sake of simplicity: the layer l +1 is considered the output layer.
Layer l +1: with k neurons
y zl +1 (n) = f l +1 (v zl +1 (n)) , z = 1, k ,
s
v lz+1 (n) = wlz+, j1 y lj (n) ,
j =0
s = the number of input connections for the neuron i (the number of neuron with the previous
layer, l ), wlz+,01 = b zl +1 , y lz = 1 .
zl +1 known for z = 1, k .
Layer l: with s neurons
y il (n) = f l (vil (n)) , z = 1, k ,
q
vil (n) = wil, j y lj1 (n) ,
j =0
q = the number of input connections for neuron i (the number of neurons belonging to the
previuous layer), wil,0 = bil , y 0l 1 = 1 .
If l is the first hidden layer ( l = 1 ), then yil 1 (n) = u i (n) .

I k I e l +1
( n) = ( n) z ( n)
l +1
wil, j z =1 e z wil, j
I k
e zl +1 y zl +1 v zl +1 yil vil
wil, j
( n ) =
z =1
e l +1
z ( n )
y zl +1
( n )
v zl +1
( n )
y il
( n )
vil
( n )
wil, j
( n)
k
I
l
(n) = e zl +1 (n) (1) f l '+1 (v zl +1 (n)) w zil +1 f l ' (v il (n)) y lj1 (n)
wi , j z =1
k k
I
l
wi , j
( n ) =
z =1
l +1
z w l +1
z , i f l
'
( v l
i ( n )) y l 1
j ( n ) = y l 1
j ( n )
z =1
zl +1 w zl +,i1 f l ' (v il (n)) .
k
wil, j ( n) = y il 1 ( n) il ( n) , with il (n) = zl +1 w lz+, i1 f l' (v1i (n)) = local gradient
z =1
Parameter variation
Remarks
1) The derivatives of the objective functions
o Sigmoid
1 a exp( av)
f (v ) = , a > 0 f ' (v) = = a f (v) [1 f (v)]
1 + exp( av) [1 + exp( av)]2
o Hyperbolic tangent
b
f (v) = a tanh(bv), a, b > 0 f ' (v) = [a f (v)] [a + f (v)]
a
2) Learning rate > 0
- For small values: low convergence speed; a quite smooth trajectory is followed within the
search space
- For large values: risk of unstable behavior
Improvements:
2a) use inertial back-propagation (with momentum)
2b) use a distinct learning rate for each link

2a) use inertial back-propagation (with momentum) - explanations
Generalized delta rule:
wijl ( n) = wijl ( n 1) + il ( n) y lj1 ( n) , > 0, momentum constant

n t n t
I
wijl (n) = n t il (t ) y lj1 (t ) = n t (t ) ,
t =0 t =0 wijl
I I I
wijl (n) = [ n l
(0) + n 1 l (1) + ... + l (n)] .
wij wij wij
has a stabilizing effect:

I
when (t ) keeps the sign at successive iterations, the absolute value of wijl
wijl
increases
I
when (t ) changes the sign at successive iterations, the absolute value of wijl
wijl
decreases
[0,1] :
3) online or batch learning?
- For online learning: training samples must be randomly presented to avoid cycling
Online learning
Reduced memory consumption

Faster learning for large training data sets
Convergence hardly to analyze (the examples must be randomly presented for avoiding the
stagnation in local optima)
Good results for training data sets containing similar samples
4) Stop criteria
- there is no available proof of algorithm convergence
- only some recommendation can be made:
Recommendations:
o The norm of the gradient becomes close to 0
Disadvantage: numerous epochs can be involved.
o The variation of the criterion I becomes insignificant
Disadvantage: premature stop.
5) The initialization of weights

- The result is dependent on the initial ANN parameters
Use uniformly distributed or normally distributed (mean 0, spread illustrating the saturation of
some neurons).
6) Efficient exploitation of the training samples
- For online learning, the successive samples can be the very different ones
When the training samples are randomly presented, this condition could be
frequently met.
- Outliers can impede the convergence and can lead to bad generalization capabilities.
7) The learning faster for asymmetric activation functions
f ( v ) = f (v )
Ex: Symmetric limiter

Hyperbolic tangent: f (v) = a tanh(bv) ,
recommended values (LeCun) a = 1.7159 , b = 2 / 3 , with
f
(0) 1.14 .
v
8) Learning rate
The neurons should learn at the same speed.
- Usually, the gradients in the output layer are bigger, so should be smaller for the output
neurons.
- The neurons having more links can work with smaller .

1
LeCun suggests: = , m = the number of input links for a certain neuron.
m
ATENTION!!! - Generalization capacity
Training = approximation in terms of the training data set
Generalization = approximation in terms of another data set

/ validation /
If N is too large or the ANN architecture is too complex, the model results overfitted
select the simplest function possible,
if there is no information to invalidate this

III. 3. 4. Applications
1. Function approximations
MLP = universal approximator
Theorem:
Any continuous bounded function can be approximated with any desired degree of accuracy
> 0 , by means of a MLP containing
- a hidden layer with m neurons, characterized by continuous, bounded, monotonic
activation functions;
- a hidden layer with a linear neuron ( or sigmoidal neuron working within its linear region)
m R
F (u) = i f ( wij u j + bi )
i =1 j =1
m = the number of hidden neurons
R = the number of inputs
Remarks regarding the content of this theorem:
- MLP existence is guaranteed
- the theorem does not give any indications concerning the resulted generalization capacity
of the model and the time requested for learning
- the optimal structure is not given
Remarks regarding the applicability of this theorem:
- the value of m:
o if m is small, the empiric risk is lower (reduced risk to learn the noise captured by
the training samples);
o if m is large, a good accuracy can be obtained;
- when a single hidden layer is used:

- the parameters of the neurons tend to interact: the approximation of some samples can be
improved solely by accepting worse approximation for other samples
For ANNs with 2 hidden layers: the hidden layer 1 extracts the local properties
the hidden layers 2 extract the global properties
III. 4. ANN with Radial basis functions - RBF
III. .4. 1 The neuron of RBFs
The structure for the hidden neuron of the RBFs :
p1
c1
... n y = f( n )
pj cj
.... cR
pR
y = f ( p c ) = f ( ( p1 c1 ) 2 + ... + ( p R c R ) 2 )
p1 c1
p = ... , c = ...

p R c R
c = center vector (a center for each input connection)
See demorb1
Usually, Gaussian activation function is used:

2
pc ( p1 c1 ) 2 + ... + ( p R c R ) 2
y = exp( ) = exp( )
2 2 2 2
p1 c1
p = ... , c = ...

p R c R
c = vector of centers for a hidden neuron

= spread
Remarks:
The neuron is activated only if the input (vector) is similar to the center (vector).
o The accepted similitude level is given by .
o If is large, the neuron is activated for reduced similitude between inputs and
centers.
For inputs which are very dissimilar to the centers, the neuron is inactive:
y 0, for p c >> 0 p, c very different.

Comparison between Gaussian neuron and perceptron
p1
... w1 b
n y
pj wj
f (n)
.... wR
pR
p j has more significant influence for the activation of the neuron if the absolute value of
p j w j is larger.
III. 4.2. RBF architecture
Standard architecture includes
an output linear neuron
a single hidden layer with s Gaussian neurons.
- Because a single hidden layer is considered the upper index will be deleted for
almost of the notation (it was only kept for making distinction between the linear
and the radial basis activation functions).
n1 y1 = f1(n1)
u1
... b
ci1 w1
....
ni yi = f1 (ni) n
uj cij wi
.... f2 (n) = y
cim ....
ws
um ns
ys = f1 (ns)
y1 s
y = f ( w1 y1 + .. + ws y s + b) = w1 y1 + .. + ws y s + b = [w1
2
.. ws ] ... + b = wi yi + b
i =1
y s
yi = f 1 ( u c i ) = f 1 ( (u1 ci1 ) 2 + ... + (u m cim ) 2 )

u1 ci1
u = ... , c i = ...

u m cim
c i = center vector for the hidden neuron i
For Gaussian activation functions within the hidden layer:

2
s u ci (u1 ci1 ) 2 + ... + (u m cim ) 2
s
y = b + wi exp( ) = b + wi exp( )
i =1 2 i 2 i =1 2 i 2
ci = center vector for the hidden neuron i

i = spread for the hidden neuron i
III. 4. 3. RBF = universal approximator
Necessary theoretical background:
Premises for solving classification problems
Covers Theorem for pattern classification:

A complex classification problem (nonlinearly separable) has great chances to become
linearly separable via a nonlinear mapping provided to a space of high dimension.
Let us consider the samples u(i ) = [u1 (i ) .. u m (i )]T belonging to m .

(e.g. the training samples).
f1 (u)
Let us consider f : , s large , with f (u) = ... , f1 ,..., f s : m .
m s
f s (u)
(e.g. f1 ,... f s indicate the mappings provided by s hidden neurons)
Definition
The classes C1 ,C 2 are f-separable, if there exist w = [w1 .. ws ]T s with:
w T f (u) > 0, for u C1

wT f (u) 0, for u C2
Remarks:
- according to Covers theorem: chose large s value and non-linear f1 ,... f s
- the f hiper-plane delimitating the classes is given by w T f(u) = 0 .
fi could be radial basis ones.
Example: XOR problem

1 0 1 0
Classify the samples: u(1) = C1 , u(2) = C 2 , u(3) = C 2 , u(1) = C1
1 1 0 0
Let us consider:
2
1
f1 (u) = exp( u ) = exp((u1 1) 2 (u 2 1) 2 )
1
.
2
0
f 2 (u) = exp( u ) = exp(u12 u 2 2 )
0
Knowing the inputs samples, it results:
f1 (u(1)) = 1, f 2 (u(1)) = 0.13

f1 (u(2)) = 0.36, f 2 (u(2)) = 0.36
f1 (u(3)) = 0.36, f 2 (u(3)) = 0.36
f1 (u(4)) = 0.13, f 2 (u(4)) = 1
u2 f2(u)
u1 f1(u)
For function approximations (interpolation)
Find F : m accepting (u (i ), d (i )) , i = 1, N ,
with u(i ) = [u1 (i ) .. u m (i )]T m and d (i )
d (i ) = F (u(i )) = the desired output of the function corresponding to input u(i )
(these samples could be used for training).
Find the interpolation:

N
F (u) = wi f i ( u u(i ) )
i =1
- the number of radial basis functions = the number of the training samples;
- the functions f i accept the centers c i = u(i ) .
Radial basis function could be chosen as follows:
2
a) f i (u ) = u ci + qi 2 , qi > 0 : non local, unbounded
1
b) f i (u ) = , qi > 0 : local, bounded
2 2
u ci + qi
2
u ci
c) f i (u ) = exp( ) : local, bounded
2 i 2
Knowing that d (i ) = F (u(i )) , it results:
f1 (u (1)) .. f N (u (1)) w1 d (1)

.. .. .. .. = .. .

f1 (u ( N )) ... f N (u ( N )) w N d ( N )
Let us consider:
f1 (u (1)) .. f N (u (1))
= .. .. ..
= interpolation matrix.
f1 (u ( N )) ... f N (u ( N ))
Using this notation, the equation can be rewritten as follows:
w1 d (1) w1 d (1)
.. = .. .. = .. - if is nonsingular.
1
wN d ( N ) wN d ( N )
Michellis Theorem (1986)
If f i are radial and samples u (i ) m are distinct
Then is nonsingular.
Remarks:
o For f i of types b) and c) is positively defined.
o For f i of type a) admits N 1 positive Eigen values and a negative Eigen value.
Remarks:
large N (many samples) many radial basis functions complex model (over-fitting)
large N (large samples) risk of poorly conditioned interpolation matrix and large execution
times
It is desirable to use fewer radial basis functions than training samples.
s < N , s = the number of the hidden neurons.
N s
Instead of F (u) = wi f i ( u u(i ) ) , one has to consider F (u) = b + wi f i ( u ci ) :
i =1 i =1
- The centers of the radial basis functions and the input training samples are different.
- The output neuron accepts nonzero bias.

Knowing that d (i ) = F (u(i )) , i = 1, N it results:

f1 (u (1)) .. f s (u (1)) 1 w1 d (1)
.. .. .. .. .. = .. .
w
f1 (u ( N )) ... f s (u ( N )) 1 s d ( N )
b
Let us denote:
f1 (u (1)) .. f s (u (1)) 1
G = .. .. .. N x ( s +1)
.
f1 (u ( N )) ... f s (u ( N )) 1
Therefore, it results:

w1 d (1) w1 d (1)

G .. = .. .. = G .. ,
+
w
ws d ( N ) s d ( N )
b b
G + = (G T G ) 1 G T
Example: XOR problem
1 0 1 0
Classify the samples: u(1) = C1 , u(2) = C 2 , u(3) = C 2 , u(1) = C1
1 1 0 0
Let us consider:
2
1
f1 (u) = exp( u ) = exp((u1 1) 2 (u 2 1) 2 )
1
.
2
0
f 2 (u) = exp( u ) = exp(u12 u 2 2 )
0
For the above mentioned input training samples it results:

f1 (u(1)) = 1, f 2 (u(1)) = 0.13 1 0.13 1
f1 (u(2)) = 0.36, f 2 (u(2)) = 0.36 0.36 0.36 1
G= .
f1 (u(3)) = 0.36, f 2 (u(3)) = 0.36 0.36 0.36 1

f1 (u(4)) = 0.13, f 2 (u(4)) = 1 0.13 1 1
Let us define:
d (1) = 0, d (2) = 1, d (3) = 1, d (4) = 0
Therefore, it results:
- G + (given by the MATLAB function pinv):
1.7942 -1.2195 -1.2195 0.6448
0.6448 -1.2195 -1.2195 1.7942
-0.8780 1.3780 1.3780 -0.8780
0
w1 1 2.439
w = G = 2.439
+
2 1
b 2.7561
0
Theorem:
Any continuous bounded function F : m can be approximated with any desired

degree of accuracy by means of:
s u ci
F (u) = b + wi f ( ), > 0 , if
i =1
f : m is bounded and f (u )du < .
m
- The requirements imposed by this theorem are met for the radial functions b), c).
- The radial basis function a) can be used for s = N , only.
- f is not necessarily symmetric !!!!!
Remark:
the ANN with hidden radial basis activation functions and a linear output neuron is
compliant with the requirements of the previous theorem.
- training = optimization carried out in terms of the training data set
step 1: select the centers and the spread
step2: assuming the centers and the spread are known, compute the output weights:

w1 d (1) f1 (u (1)) .. f s (u (1)) 1
.. = G + .. , cu G = .. .. .. N x ( s +1)
w
s d ( N ) f1 (u ( N )) ... f s (u ( N )) 1
b
Challenge: what centers to chose?
If the centers are known, the output weights can be computed in a single
step.
- generalization = interpolation
For Gaussian functions:
2
s uc
F (u) = b + wi exp( ) , when the same spread is employed for all the hidden neurons
i =1 2 2
Recommendation: Chose s = 3 N
Comparison between MLP and RBF
MLP RBF
Any number of hidden layers One hidden layer

Input operator = scalar product Input operator = Euclidian distance
Nonlinearity in terms of neural parameters Linearity in terms of output parameters, if
fix centers and spreads are assumed
Large training time required Small training time if the centers are
known (useful for on-line training)
Global action Local action
Fewer parameters for the same degree of
accuracy (usually)
Learning strategies
1. Random centers selection
Pas 1. Chose the centers randomly .

d
Pas 2. Compute the spread = max ,
2s
with
d max = the maximum distance between the selected centers,
s = the number of hidden neurons.
Pas 3. Compute the weights and the bias.
2. Centers self-organization
Step 1. The centers are chosen via a clustering algorithm applied for the training input samples (e.
g. K-mean clustering)
K-mean clustering (tip: learn via competition):

Step 1-0: Chose random distinct initial values for all s centers, denoted c i (n) , with n = 0
and i = 1, s .
Step 1-1: For the training sample u(n) compute u(n) c i , i = 1, s and find the minimum
distance, which indicates the nearest center for this sample. Consider i , with i 1, s , the
nearest center.
Step 1-2: Update the nearest center, moving it toward the sample:
c i (n + 1) = c i (n) u(n) c i (n) , cu 1 > > 0
Step 1-3: n n + 1
Step 1-4: If some training samples have not been used yet or the change made at step 1-2 is
too large,
Go to step 1-1.
Drawback: the result depends on the initial values
d
Pas 2. Compute the spread = max ,
2s
with
d max = the maximum distance between the selected centers,
s = the number of hidden neurons.
Pas 3. Compute the weights and the bias.
.
3. Supervised centers selection
The parameters of RBF are adapted by using error correction:
LMS algorithm:
- let us consider the batch learning

- criterion :
1 N 2 1 N s u( j ) c i 2
I= e( j ) = [ d ( j ) wi f ( )]
2 j =1 2 j =1 i =1 i
convex in terms of weights;

non-convex in terms of centers (centers optimization can lock in local
optima)
For Gaussian activation functions:

2
1 N s u ( j ) ci
I= [ d ( j ) iw exp( ) ]2
2 j =1 2
i =1 2 i
At each iteration, the parameters of RBF (weights, centers, spreads) are updated according to the
following rules:
- for weights:
I
wi wi 1 , with
wi
2
I N u( j ) c i N u( j ) c i
= e( j ) f ( ) = e( j ) exp( )
wi j =1 i j =1 2 i 2
- for centers:
I
ci ci 2 , with
c i
2
I w N u( j ) c i
= i e( j ) exp( ) (c i ) k [u( j ) k (c i ) k ] ,
(c i ) k i2 j =1 2 i 2
with (c i ) k , u( j ) k indicating the kth component of the vectors c i , i = 1, N and u( j ), j = 1, N
(having the length m)
- for spreads:
I
i i 3 , with
i
2
I N u( j ) c i w N u( j ) c i 2
= e( j ) f ( ) = i e( j ) exp( ) u( j ) c i
i j =1 i 4 i 3 j =1 2 i 2
4. Constructive algorithm
- insert the hidden neurons in sequel the center vector copies the input training sample that
produces the highest output squared error for the current architecture
> see MATLAB

II. Artificial Neural Networks

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

II. Artificial Neural Networks

Загружено:

Авторское право:

Доступные форматы

II.

Artificial Neural Networks

II.1. Artificial neuron

p1 p1 ,...., p R - neuron inputs

.... wR b bias other usual notation

pR w1 , w2 ,.....wR weights of incomming

Input-output mapping: static model

The output of the model is computed as follows: y = f (n) ,

So, the diagram can be redrawn:

Usually : y [0,1] or y [1,1] .

Other alternative for the extended weight vector:

1) AN admits negative weights !!! (unlike BN).

2) time constants: BN 1msec, AN 1nsec

3) energetic efficiency: BN 10 16 J / sec per operation, AN 10 6 J / sec per operation.

4) BNN works asynchronously, without a clock master (continuous time domain).

5) BNN involves random connectivity; ANN uses specified connectivity (usually).

6) BNN are tolerant against errors.

Hard limiter Symmetric hard limiter

>> see RBF (other input-output mapping)

The ANN structure allows a parallel and distributed processing.

Each ANN can be represented by a directed graph.

The nodes of the graph correspond to neurons.

A processing unit (neuron) admits any number of input links.

A processing unit has local memory.

A processing unit can be modeled in terms of input output formalism.

feed-forward: - with feed-forward links, only

Example 1: Feed-forward architectures with one layer:

Only feed-forward links!!!

Input layer: m inputs, no processing

Output layer: k neurons, characterized by :

One can write:

w1,1 .. w1,m b1 extended weights for neuron 1

Inp ut O utpu t laye r

Input Hidden layer - s neurons Output layer - k neurons

Inp ut H id den la yer O utpu t la ye r

Inp ut O utpu t laye r

In pu t H idd en laye r - s ne ur on s O u tpu t la ye r - k n eu ro ns

Within a layer, the neurons work in parallel.

The MLP can have any number of hidden layers

sample presented an the nth epoch

= the steepest descent method (gradient)

Overview steps carried out at each epoch:

1. For the output layer (denoted l)

Let us consider k output neurons and s neurons in the preceding layer.

Problem: find the contribution of a hidden neuron to the total error.

Layer l +1: with k neurons

previuous layer), wil,0 = bil , y 0l 1 = 1 .

If l is the first hidden layer ( l = 1 ), then yil 1 (n) = u i (n) .

1) The derivatives of the objective functions

2a) use inertial back-propagation (with momentum)

2b) use a distinct learning rate for each link

wijl ( n) = wijl ( n 1) + il ( n) y lj1 ( n) , > 0, momentum constant

has a stabilizing effect:

Reduced memory consumption

5) The initialization of weights

7) The learning faster for asymmetric activation functions

Ex: Symmetric limiter

The neurons should learn at the same speed.

- The neurons having more links can work with smaller .

Training = approximation in terms of the training data set

Generalization = approximation in terms of another data set

select the simplest function possible,

if there is no information to invalidate this