Вы находитесь на странице: 1из 73

II.

Artificial Neural Networks

II.1. Artificial neuron


= the basic computational unit of unit artificial neural networks (ANN)

p1 p1 ,...., p R - neuron inputs


... w1 b
f : , y = f (n) - activation
wj n y
pj f (n ) function usually nonlinear

.... wR b bias other usual notation

pR w1 , w2 ,.....wR weights of incomming


links
Perceptron
Components:
- Synapses or links characterized by weights (also called strengths)
- Summing block, Activation function ( usually nonlinear)

Input-output mapping: static model

The output of the model is computed as follows: y = f (n) ,

where the input of the activation function, called activation (n) is:
p1
n = w1 p1 + ..... + wR p R + b 1 = [ w1 ... wR ] ... + b = Wp + b ,
p R
or

R
n = wi pi + b
i =1

with b , W 1 x R , p R x 1 .
b can be treated as a supplementary weight w0 = b for the input p0 = 1 :

R
~~
n = wi pi = W p
i =0

not
~
notations: [w0 w1 K wR ] = W (transposed) extended weight vector
not
[1 p1 K ~T
pR ] = p (transposed) extended input vector

So, the diagram can be redrawn:


p0
=1 w0
=b
p1
... w1

wj n y
pj f (n)
.... wR

pR

Usually : y [0,1] or y [1,1] .

Other alternative for the extended weight vector:

R +1 ~~
~
W = [ w1 .. wR ~ T = [ p ...
b] , p pR 1] , n = wi pi = W p .
1
i =1
Remarks comparison between the artificial neuron (AN) and the biological one (BN):

1) AN admits negative weights !!! (unlike BN).


positive weights for excitation effect ,
negative or null weights - for inhibitive effect .

2) time constants: BN 1msec, AN 1nsec


(fewer links and fewer neurons for ANN)

3) energetic efficiency: BN 10 16 J / sec per operation, AN 10 6 J / sec per operation.

4) BNN works asynchronously, without a clock master (continuous time domain).

5) BNN involves random connectivity; ANN uses specified connectivity (usually).

6) BNN are tolerant against errors.


Typical activation functions:

I. Deterministic functions
1, n 0 R
1) y = f (n) = - hard limiter with nn = wi p i
0, n < 0 i =1

f f
+1 +1

-b 0 nn -b 0 nn

-1

Hard limiter Symmetric hard limiter


2) y = f (n) = n - linear

+1

- 0 nn

, with nn = wi p i
R

i =1
1
3) y = f (n) = , c > 0 - sigmoid
1 + exp(cn)

For c = 1 :
f

+1

0.5

-b 0 nn
R
, with nn = wi p i
i =1
FUNCTIE DE ACTIVARE SIGMOID
w=2>0; b=3>0 w=2>0; b=-3<0
1 1

0.8 0.8

0.6 0.6

a
0.4 0.4

0.2 0.2

0 0
-5 0 intrare 5 -5 0 5
intrare
w=-2<0; b=3>0 w=-2<0; b=-3<0
1 1

0.8 0.8

0.6 0.6
a

a
0.4 0.4

0.2 0.2

0 0
-5 0 5 -5 0 5
intrare intrare
in punctul de inflexiune: a=0.5, p = -w/b, tangenta la grafic are panta= w/4
1 exp(2cn)
4) y = f (n) = , c > 0 - hyperbolic tangent
1 + exp(2cn)

For c = 1 :
f

+1
-b
0

-1 nn
R
, with nn = wi p i
i =1

( n c )2
5) y = f (n ) = e 2 - Gaussian function
2

>> see RBF (other input-output mapping)

f
+1

c n
II. 2. ANN architectures

 The ANN structure allows a parallel and distributed processing.

 Each ANN can be represented by a directed graph.

 The nodes of the graph correspond to neurons.

 The nodes are connected by links which ensure unidirectional and instant
communication.

 A processing unit (neuron) admits any number of input links.

 A processing unit has local memory.

 A processing unit can be modeled in terms of input output formalism.

The neurons are organized in layers; within a layer, the neurons are considered to work in parallel.
Input Hidden Output
layer layers laye r

u1 y1 Legend:
Lateral links (between the
u2 y2 nodes of the sam e layer)
Feedback links (from the
output of a neuron to its
input)
u m -1 yk-1 B ackward links (t o the neurons
of the previous layers)
um yk F eedforward li nks (to the
neurons of t he next layers)

ANN AN N
inputs outputs

Remark:
- the input layer does not perform any processing; it will be not count;
- the hidden layers and the output layer include neurons
.
Types of ANN

feed-forward: - with feed-forward links, only


dynamic/recurrent: - contain at least one lateral, backward or feedback link

Example 1: Feed-forward architectures with one layer:

Only feed-forward links!!!

Input layer: m inputs, no processing

Output layer: k neurons, characterized by :


yl = f l ( wl ,1u1 + .. + wl ,m u m + bl ), l = 1, k
b1 N euro n 1
n1
w 1,1
y1
f1 ( n)

u1
w1 , j

.. .
w 1, m
u j

. ..
... .
w k ,1
N euro n k
um
wk , j

bk

wk ,m
nk
yk
f k (n )

In p u t
O u t p u t l a y e r - k n e u ro n s
laye r
Let us assume that all activation functions are identical.

One can write:


y1 f ( w1,1u1 + ..... + w1,m u m + b1 ) w1,1u1 + ..... + w1,m u m + b1
... ... ...
~~
y = y l = f ( wl ,1u1 + ..... + wl ,m u m + bl ) = f ( wl ,1u1 + ..... + wl ,m u m + bl ) = f ( Wu )

... ... ...
y k f ( wk ,1u1 + ..... + wk ,m u m + bk ) wk ,1u1 + ..... + wk , m u m + bk

with

w1,1 .. w1,m b1 extended weights for neuron 1


.. ... ... ...
~
W = wl ,1 ... wl ,m bl extended weights for neuron l = extended weights matrix,

... ... ... ...
wk ,1 ... wk ,m bk extended weights for neuron k

u1

~ = ... = input vector,
u
u m

1
y1
y = ... = output vector.
y k

Remark:
~
W = [ wi, j ], i = 1, k , j = 1, m + 1
with
for wi, j - the first index indicates the neuron
- the second index indicates the link
Simplified diagram for feedforward ANNs with 1 layer

y1

u1
N e uron 1
.. .

uj

... . yk
um
N e uron k

Inp ut O utpu t laye r


la ye r - k ne uron s

Remark:
The layers can be
Fully connected all feedforward links are used,
Partially connected some feedforward connections are missing.
Example2: Feed-forward architectures with two layers
2
b1 b
1
1

1
n2 y 11
1 y1 1
n1 2
w 1
1,1 f 1 (n ) w
1,1 f 2 ( n)
1
1
u1
w11, j
... Neuron 1, w
2
N euron 1,
layer 1 1, s
w 11, m layer 2
uj
... ...
....
w 1 s ,1 N euron s, Neuron k,
um layer 1 w 2 layer 2
w1 s , j k ,1

b1 b 2k
s

w1 s ,m 1
n y 1s n k2 y 2k

s
1 w k2,s 2
f s (n) f k (n)

Input Hidden layer - s neurons Output layer - k neurons


layer - layer 1 - layer 2
Identical activation function within each layer!!!!
- Layer 1:

y11 w1 .. w1 b11 u1
1,1 1, m
~1
1 1 ~ 1~ 1
y = f ( W u ) , with y = ... , W = .. .. .. .. , u = ..
~
u
y1 w1 .. w1 b1s m
s sx1 s,1 s, m sx ( m +1) 1 ( m +1) x1
- Layer 2
y12 w 2 .. w 2 b12
1,1 1, s
2 2 ~ 2 ~1 2 ~2
y = f ( W y ) , with y = ... , W = .. .. .. .. ,
y2 w 2 .. w 2 2
bk
k kx1 k ,1 k,s kx ( s +1)

y11
~ y1
y 1 = .. =
y1s 1

1 ( s +1) x1
Remark: Upper index: the layer
Lower indexes: the first =the neuron; the second = the link of the neuron
y1 y2
1 1

u1
N e uron 1, N eu ro n 1 ,
... lay er 1 la ye r 2

uj

... .
y 1s y2
k
um
N e uron s , N eu ro n k
lay er 1 la ye r 2

Inp ut H id den la yer O utpu t la ye r


la ye r - s neu rons - k ne uron s
Example 3: Feed-forward architectures with two layers simplified diagram

y1

u1
N e uron 1
.. .
q- 1 q- 1
uj

... . yk
um
N e uron k

Inp ut O utpu t laye r


la ye r - k ne uron s
ANN architecture = Number of inputs and number of outputs
Number of layers
Number of neurons within a layer
Map of links
Type of the activation functions

ANN parameters
sigmoid/linear/step = weights
biases

Gaussian = centers
= spreads
III. 3. Multi-layer Perceptron (MLP) - revision
III. 3. 1. MLP architecture
- all the neurons have sigmoidal activation function (linear, sigmoid, tanh)

b1 b2
1 1

1
v2 y 11
v11 y1
2
1

w 1 1 ,1 f 1 (v ) w 1 ,1 f 2
(v )
1 1
u1
w 11 , j
.. . N e uron 1, w2 N eu ro n 1 ,
L a yer 1 1, s
w 11 , m L a ye r 2
u j
. .. ...
... .
w 1 s ,1 N eu ro n s , N eu ro n k,
um L a ye r 1 w 2 L a ye r 2
1 k ,1
w s,j
b1 b 2k
s

w 1 s, m v
1
y 1s v k2 y 2k

s
1 w k2 ,s 2
f (v )
s f (v )
k

In pu t H idd en laye r - s ne ur on s O u tpu t la ye r - k n eu ro ns


laye r - la ye r 1 - la yer 2
The layers are linked in series: the outputs of the neurons belonging to a layer are inputs for
the neurons of the next layer.

Within a layer, the neurons work in parallel.

The MLP can have any number of hidden layers


III.3.2. Criteria for learning algorithms based on error correction
Let us consider k neurons within the output layer.
1. On-line
The training samples (u (i ), d (i )), i = 1, N are presented in sequel, one sample per iteration (the
number of iterations = multiple of N).
k
I (n) = 0.5 ei 2 (n) , with ei (n) = d i (n) yi ( n) = the error of the ith output neuron
i =1

2. Batch
All the training samples (u i , d i ), i = 1, N are presented at a single iteration.

1 N k 2
I ( n) = ei (n, j ) ,with ei (n, j ) = the error of the ith output neuron for the jth training
2 N j =1 i =1

sample presented an the nth epoch


III.3.3. Backpropagation learning algorithm

= the steepest descent method (gradient)

I
wijl (n + 1) = wijl (n) (n) = wijl (n) + wijl (n) ,
wijl
> 0 - influences the convergence speed

Overview steps carried out at each epoch:


Evaluate the ANN output and the error: feedforward INOUT
Adapt the parameters: back: OUTIN: (backpropagation)
k 2
- For online learning ( I (n) = 0.5 ei (n) )
i =1

Parameter variation
= learning rate ( ) x local gradient ( ) x input (corresponding to the link)

1 N k 2
- For batch learning ( I (n) = ei (n, j ) ,)
2 N j =1 i =1

where ei (n, j ) = the error of the ith output neuron for the jth training sample presented at
the nth epoch

1 N
wikl (n) = l
wik (n, j ) -the mean of variations separately computed for each sample
N j =1
Backpropagation adaptation equations

k
For the sake of simplicity, online learning is considered I (n) = 0.5 ei 2 (n)
i =1

1. For the output layer (denoted l)

Let us consider k output neurons and s neurons in the preceding layer.

ei (n) = d i (n) yil (n) , the error produced by the output neuron i for a certain sample
s
vil (n) = wijl y lj1 , with wil0 = bil i y 0l 1 = 1 .
j =0
I I ei ei yil vil
( n) = ( n) (n) = ei (n) ( n) ( n) ( n)
wijl ei wijl yil vil wijl

I
l
(n) = ei (n) (1) f ' i (vil (n)) y lj1 (n) = il (n) y lj1 (n)
wij

I
wijl ( n) = il ( n) y lj1 ( n) with il = ei (n) f 'i (vil (n)) = (n) = local gradient
vijl

Parameter variation
= learning rate ( ) x local gradient ( ) x input (corresponding to the link)
2. For the hidden layers

Problem: find the contribution of a hidden neuron to the total error.


The parameters will be adapted starting from the output layer to the input layer.

Considering the hidden layer l, the local gradients within the layers l +1, l +2, .. must be
available from previous computations.

The output of the neuron i belonging to layer l is input for the neurons belonging to l +1
(output).
For the sake of simplicity: the layer l +1 is considered the output layer.

Layer l +1: with k neurons

y zl +1 (n) = f l +1 (v zl +1 (n)) , z = 1, k ,
s
v lz+1 (n) = wlz+, j1 y lj (n) ,
j =0

s = the number of input connections for the neuron i (the number of neuron with the previous

layer, l ), wlz+,01 = b zl +1 , y lz = 1 .

zl +1 known for z = 1, k .
Layer l: with s neurons
y il (n) = f l (vil (n)) , z = 1, k ,
q
vil (n) = wil, j y lj1 (n) ,
j =0

q = the number of input connections for neuron i (the number of neurons belonging to the

previuous layer), wil,0 = bil , y 0l 1 = 1 .

If l is the first hidden layer ( l = 1 ), then yil 1 (n) = u i (n) .


I k I e l +1
( n) = ( n) z ( n)
l +1
wil, j z =1 e z wil, j

I k
e zl +1 y zl +1 v zl +1 yil vil
wil, j
( n ) =
z =1
e l +1
z ( n )
y zl +1
( n )
v zl +1
( n )
y il
( n )
vil
( n )
wil, j
( n)

k
I
l
(n) = e zl +1 (n) (1) f l '+1 (v zl +1 (n)) w zil +1 f l ' (v il (n)) y lj1 (n)
wi , j z =1

k k
I
l
wi , j
( n ) =
z =1
l +1
z w l +1
z , i f l
'
( v l
i ( n )) y l 1
j ( n ) = y l 1
j ( n )
z =1
zl +1 w zl +,i1 f l ' (v il (n)) .

k
wil, j ( n) = y il 1 ( n) il ( n) , with il (n) = zl +1 w lz+, i1 f l' (v1i (n)) = local gradient
z =1

Parameter variation
= learning rate ( ) x local gradient ( ) x input (corresponding to the link)
Remarks

1) The derivatives of the objective functions

o Sigmoid
1 a exp( av)
f (v ) = , a > 0 f ' (v) = = a f (v) [1 f (v)]
1 + exp( av) [1 + exp( av)]2

o Hyperbolic tangent
b
f (v) = a tanh(bv), a, b > 0 f ' (v) = [a f (v)] [a + f (v)]
a
2) Learning rate > 0

- For small values: low convergence speed; a quite smooth trajectory is followed within the
search space
- For large values: risk of unstable behavior

Improvements:

2a) use inertial back-propagation (with momentum)

2b) use a distinct learning rate for each link


2a) use inertial back-propagation (with momentum) - explanations
Generalized delta rule:

wijl ( n) = wijl ( n 1) + il ( n) y lj1 ( n) , > 0, momentum constant


n t n t
I
wijl (n) = n t il (t ) y lj1 (t ) = n t (t ) ,
t =0 t =0 wijl

I I I
wijl (n) = [ n l
(0) + n 1 l (1) + ... + l (n)] .
wij wij wij

 has a stabilizing effect:


I
 when (t ) keeps the sign at successive iterations, the absolute value of wijl
wijl

increases
I
 when (t ) changes the sign at successive iterations, the absolute value of wijl
wijl

decreases
 [0,1] :
3) online or batch learning?
- For online learning: training samples must be randomly presented to avoid cycling

Online learning

Reduced memory consumption


Faster learning for large training data sets
Convergence hardly to analyze (the examples must be randomly presented for avoiding the
stagnation in local optima)
Good results for training data sets containing similar samples
4) Stop criteria
- there is no available proof of algorithm convergence
- only some recommendation can be made:

Recommendations:
o The norm of the gradient becomes close to 0
Disadvantage: numerous epochs can be involved.
o The variation of the criterion I becomes insignificant
Disadvantage: premature stop.

5) The initialization of weights


- The result is dependent on the initial ANN parameters

Use uniformly distributed or normally distributed (mean 0, spread illustrating the saturation of
some neurons).
6) Efficient exploitation of the training samples

- For online learning, the successive samples can be the very different ones
When the training samples are randomly presented, this condition could be
frequently met.
- Outliers can impede the convergence and can lead to bad generalization capabilities.

7) The learning faster for asymmetric activation functions

f ( v ) = f (v )

Ex: Symmetric limiter


Hyperbolic tangent: f (v) = a tanh(bv) ,
recommended values (LeCun) a = 1.7159 , b = 2 / 3 , with
f
(0) 1.14 .
v
8) Learning rate

The neurons should learn at the same speed.

- Usually, the gradients in the output layer are bigger, so should be smaller for the output
neurons.

- The neurons having more links can work with smaller .


1
LeCun suggests: = , m = the number of input links for a certain neuron.
m
ATENTION!!! - Generalization capacity

Training = approximation in terms of the training data set

Generalization = approximation in terms of another data set


/ validation /

If N is too large or the ANN architecture is too complex, the model results overfitted

select the simplest function possible,

if there is no information to invalidate this


III. 3. 4. Applications

1. Function approximations

 MLP = universal approximator

Theorem:
Any continuous bounded function can be approximated with any desired degree of accuracy
> 0 , by means of a MLP containing
- a hidden layer with m neurons, characterized by continuous, bounded, monotonic
activation functions;
- a hidden layer with a linear neuron ( or sigmoidal neuron working within its linear region)
m R
F (u) = i f ( wij u j + bi )
i =1 j =1
m = the number of hidden neurons
R = the number of inputs
Remarks regarding the content of this theorem:
- MLP existence is guaranteed
- the theorem does not give any indications concerning the resulted generalization capacity
of the model and the time requested for learning
- the optimal structure is not given

Remarks regarding the applicability of this theorem:

- the value of m:
o if m is small, the empiric risk is lower (reduced risk to learn the noise captured by
the training samples);
o if m is large, a good accuracy can be obtained;

- when a single hidden layer is used:


- the parameters of the neurons tend to interact: the approximation of some samples can be
improved solely by accepting worse approximation for other samples

For ANNs with 2 hidden layers: the hidden layer 1 extracts the local properties
the hidden layers 2 extract the global properties
III. 4. ANN with Radial basis functions - RBF

III. .4. 1 The neuron of RBFs

The structure for the hidden neuron of the RBFs :

p1
c1
... n y = f( n )

pj cj

.... cR

pR
y = f ( p c ) = f ( ( p1 c1 ) 2 + ... + ( p R c R ) 2 )
p1 c1
p = ... , c = ...

p R c R

c = center vector (a center for each input connection)

See demorb1

Usually, Gaussian activation function is used:


2
pc ( p1 c1 ) 2 + ... + ( p R c R ) 2
y = exp( ) = exp( )
2 2 2 2
p1 c1
p = ... , c = ...

p R c R

c = vector of centers for a hidden neuron


= spread
Remarks:

The neuron is activated only if the input (vector) is similar to the center (vector).
o The accepted similitude level is given by .
o If is large, the neuron is activated for reduced similitude between inputs and
centers.

For inputs which are very dissimilar to the centers, the neuron is inactive:

y 0, for p c >> 0 p, c very different.


Comparison between Gaussian neuron and perceptron

p1
... w1 b
n y
pj wj
f (n)
.... wR

pR

p j has more significant influence for the activation of the neuron if the absolute value of
p j w j is larger.
III. 4.2. RBF architecture

Standard architecture includes

an output linear neuron

a single hidden layer with s Gaussian neurons.

- Because a single hidden layer is considered the upper index will be deleted for
almost of the notation (it was only kept for making distinction between the linear
and the radial basis activation functions).
n1 y1 = f1(n1)

u1
... b
ci1 w1
....
ni yi = f1 (ni) n
uj cij wi
.... f2 (n) = y
cim ....
ws
um ns
ys = f1 (ns)
y1 s
y = f ( w1 y1 + .. + ws y s + b) = w1 y1 + .. + ws y s + b = [w1
2
.. ws ] ... + b = wi yi + b
i =1
y s

yi = f 1 ( u c i ) = f 1 ( (u1 ci1 ) 2 + ... + (u m cim ) 2 )


u1 ci1
u = ... , c i = ...

u m cim

c i = center vector for the hidden neuron i

For Gaussian activation functions within the hidden layer:


2
s u ci (u1 ci1 ) 2 + ... + (u m cim ) 2
s
y = b + wi exp( ) = b + wi exp( )
i =1 2 i 2 i =1 2 i 2

ci = center vector for the hidden neuron i


i = spread for the hidden neuron i
III. 4. 3. RBF = universal approximator
Necessary theoretical background:

Premises for solving classification problems

Covers Theorem for pattern classification:


A complex classification problem (nonlinearly separable) has great chances to become
linearly separable via a nonlinear mapping provided to a space of high dimension.

Let us consider the samples u(i ) = [u1 (i ) .. u m (i )]T belonging to m .


(e.g. the training samples).

f1 (u)
Let us consider f : , s large , with f (u) = ... , f1 ,..., f s : m .
m s

f s (u)
(e.g. f1 ,... f s indicate the mappings provided by s hidden neurons)
Definition
The classes C1 ,C 2 are f-separable, if there exist w = [w1 .. ws ]T s with:

w T f (u) > 0, for u C1


wT f (u) 0, for u C2

Remarks:
- according to Covers theorem: chose large s value and non-linear f1 ,... f s
- the f hiper-plane delimitating the classes is given by w T f(u) = 0 .

fi could be radial basis ones.

Example: XOR problem


1 0 1 0
Classify the samples: u(1) = C1 , u(2) = C 2 , u(3) = C 2 , u(1) = C1
1 1 0 0
Let us consider:
2
1
f1 (u) = exp( u ) = exp((u1 1) 2 (u 2 1) 2 )
1
.
2
0
f 2 (u) = exp( u ) = exp(u12 u 2 2 )
0

Knowing the inputs samples, it results:

f1 (u(1)) = 1, f 2 (u(1)) = 0.13


f1 (u(2)) = 0.36, f 2 (u(2)) = 0.36
f1 (u(3)) = 0.36, f 2 (u(3)) = 0.36
f1 (u(4)) = 0.13, f 2 (u(4)) = 1
u2 f2(u)

u1 f1(u)
For function approximations (interpolation)

Find F : m accepting (u (i ), d (i )) , i = 1, N ,

with u(i ) = [u1 (i ) .. u m (i )]T m and d (i )

d (i ) = F (u(i )) = the desired output of the function corresponding to input u(i )

(these samples could be used for training).

Find the interpolation:


N
F (u) = wi f i ( u u(i ) )
i =1
- the number of radial basis functions = the number of the training samples;
- the functions f i accept the centers c i = u(i ) .
Radial basis function could be chosen as follows:

2
a) f i (u ) = u ci + qi 2 , qi > 0 : non local, unbounded
1
b) f i (u ) = , qi > 0 : local, bounded
2 2
u ci + qi
2
u ci
c) f i (u ) = exp( ) : local, bounded
2 i 2
Knowing that d (i ) = F (u(i )) , it results:

f1 (u (1)) .. f N (u (1)) w1 d (1)


.. .. .. .. = .. .

f1 (u ( N )) ... f N (u ( N )) w N d ( N )

Let us consider:

f1 (u (1)) .. f N (u (1))
= .. .. ..
= interpolation matrix.
f1 (u ( N )) ... f N (u ( N ))

Using this notation, the equation can be rewritten as follows:

w1 d (1) w1 d (1)
.. = .. .. = .. - if is nonsingular.
1

wN d ( N ) wN d ( N )
Michellis Theorem (1986)
If f i are radial and samples u (i ) m are distinct
Then is nonsingular.

Remarks:
o For f i of types b) and c) is positively defined.
o For f i of type a) admits N 1 positive Eigen values and a negative Eigen value.

Remarks:
large N (many samples) many radial basis functions complex model (over-fitting)
large N (large samples) risk of poorly conditioned interpolation matrix and large execution
times
It is desirable to use fewer radial basis functions than training samples.

s < N , s = the number of the hidden neurons.

N s
Instead of F (u) = wi f i ( u u(i ) ) , one has to consider F (u) = b + wi f i ( u ci ) :
i =1 i =1

- The centers of the radial basis functions and the input training samples are different.

- The output neuron accepts nonzero bias.


Knowing that d (i ) = F (u(i )) , i = 1, N it results:

f1 (u (1)) .. f s (u (1)) 1 w1 d (1)
.. .. .. .. .. = .. .
w
f1 (u ( N )) ... f s (u ( N )) 1 s d ( N )
b
Let us denote:
f1 (u (1)) .. f s (u (1)) 1
G = .. .. .. N x ( s +1)
.
f1 (u ( N )) ... f s (u ( N )) 1

Therefore, it results:

w1 d (1) w1 d (1)

G .. = .. .. = G .. ,
+
w
ws d ( N ) s d ( N )
b b

G + = (G T G ) 1 G T
Example: XOR problem
1 0 1 0
Classify the samples: u(1) = C1 , u(2) = C 2 , u(3) = C 2 , u(1) = C1
1 1 0 0
Let us consider:
2
1
f1 (u) = exp( u ) = exp((u1 1) 2 (u 2 1) 2 )
1
.
2
0
f 2 (u) = exp( u ) = exp(u12 u 2 2 )
0

For the above mentioned input training samples it results:


f1 (u(1)) = 1, f 2 (u(1)) = 0.13 1 0.13 1
f1 (u(2)) = 0.36, f 2 (u(2)) = 0.36 0.36 0.36 1
G= .
f1 (u(3)) = 0.36, f 2 (u(3)) = 0.36 0.36 0.36 1

f1 (u(4)) = 0.13, f 2 (u(4)) = 1 0.13 1 1
Let us define:
d (1) = 0, d (2) = 1, d (3) = 1, d (4) = 0

Therefore, it results:
- G + (given by the MATLAB function pinv):
1.7942 -1.2195 -1.2195 0.6448
0.6448 -1.2195 -1.2195 1.7942
-0.8780 1.3780 1.3780 -0.8780

0
w1 1 2.439
w = G = 2.439
+
2 1
b 2.7561
0
Theorem:

Any continuous bounded function F : m can be approximated with any desired


degree of accuracy by means of:

s u ci
F (u) = b + wi f ( ), > 0 , if
i =1
f : m is bounded and f (u )du < .
m

- The requirements imposed by this theorem are met for the radial functions b), c).
- The radial basis function a) can be used for s = N , only.
- f is not necessarily symmetric !!!!!
Remark:
 the ANN with hidden radial basis activation functions and a linear output neuron is
compliant with the requirements of the previous theorem.

- training = optimization carried out in terms of the training data set

step 1: select the centers and the spread

step2: assuming the centers and the spread are known, compute the output weights:

w1 d (1) f1 (u (1)) .. f s (u (1)) 1
.. = G + .. , cu G = .. .. .. N x ( s +1)
w
s d ( N ) f1 (u ( N )) ... f s (u ( N )) 1
b

Challenge: what centers to chose?

If the centers are known, the output weights can be computed in a single
step.
- generalization = interpolation
For Gaussian functions:
2
s uc
F (u) = b + wi exp( ) , when the same spread is employed for all the hidden neurons
i =1 2 2

Recommendation: Chose s = 3 N
Comparison between MLP and RBF

MLP RBF

Any number of hidden layers One hidden layer


Input operator = scalar product Input operator = Euclidian distance
Nonlinearity in terms of neural parameters Linearity in terms of output parameters, if
fix centers and spreads are assumed
Large training time required Small training time if the centers are
known (useful for on-line training)
Global action Local action
Fewer parameters for the same degree of
accuracy (usually)
Learning strategies

1. Random centers selection

Pas 1. Chose the centers randomly .


d
Pas 2. Compute the spread = max ,
2s
with
d max = the maximum distance between the selected centers,
s = the number of hidden neurons.
Pas 3. Compute the weights and the bias.
2. Centers self-organization
Step 1. The centers are chosen via a clustering algorithm applied for the training input samples (e.
g. K-mean clustering)

K-mean clustering (tip: learn via competition):


Step 1-0: Chose random distinct initial values for all s centers, denoted c i (n) , with n = 0
and i = 1, s .

Step 1-1: For the training sample u(n) compute u(n) c i , i = 1, s and find the minimum
distance, which indicates the nearest center for this sample. Consider i , with i 1, s , the
nearest center.
Step 1-2: Update the nearest center, moving it toward the sample:
c i (n + 1) = c i (n) u(n) c i (n) , cu 1 > > 0

Step 1-3: n n + 1
Step 1-4: If some training samples have not been used yet or the change made at step 1-2 is
too large,
Go to step 1-1.
Drawback: the result depends on the initial values

d
Pas 2. Compute the spread = max ,
2s
with
d max = the maximum distance between the selected centers,
s = the number of hidden neurons.
Pas 3. Compute the weights and the bias.
.
3. Supervised centers selection

The parameters of RBF are adapted by using error correction:

LMS algorithm:

- let us consider the batch learning


- criterion :
1 N 2 1 N s u( j ) c i 2
I= e( j ) = [ d ( j ) wi f ( )]
2 j =1 2 j =1 i =1 i

convex in terms of weights;


non-convex in terms of centers (centers optimization can lock in local
optima)

For Gaussian activation functions:


2
1 N s u ( j ) ci
I= [ d ( j ) iw exp( ) ]2
2 j =1 2
i =1 2 i
At each iteration, the parameters of RBF (weights, centers, spreads) are updated according to the
following rules:

- for weights:
I
wi wi 1 , with
wi
2
I N u( j ) c i N u( j ) c i
= e( j ) f ( ) = e( j ) exp( )
wi j =1 i j =1 2 i 2

- for centers:
I
ci ci 2 , with
c i
2
I w N u( j ) c i
= i e( j ) exp( ) (c i ) k [u( j ) k (c i ) k ] ,
(c i ) k i2 j =1 2 i 2
with (c i ) k , u( j ) k indicating the kth component of the vectors c i , i = 1, N and u( j ), j = 1, N
(having the length m)
- for spreads:
I
i i 3 , with
i
2
I N u( j ) c i w N u( j ) c i 2
= e( j ) f ( ) = i e( j ) exp( ) u( j ) c i
i j =1 i 4 i 3 j =1 2 i 2

4. Constructive algorithm

- insert the hidden neurons in sequel the center vector copies the input training sample that
produces the highest output squared error for the current architecture

> see MATLAB

Вам также может понравиться