Академический Документы
Профессиональный Документы
Культура Документы
where the input of the activation function, called activation (n) is:
p1
n = w1 p1 + ..... + wR p R + b 1 = [ w1 ... wR ] ... + b = Wp + b ,
p R
or
R
n = wi pi + b
i =1
with b , W 1 x R , p R x 1 .
b can be treated as a supplementary weight w0 = b for the input p0 = 1 :
R
~~
n = wi pi = W p
i =0
not
~
notations: [w0 w1 K wR ] = W (transposed) extended weight vector
not
[1 p1 K ~T
pR ] = p (transposed) extended input vector
wj n y
pj f (n)
.... wR
pR
R +1 ~~
~
W = [ w1 .. wR ~ T = [ p ...
b] , p pR 1] , n = wi pi = W p .
1
i =1
Remarks comparison between the artificial neuron (AN) and the biological one (BN):
I. Deterministic functions
1, n 0 R
1) y = f (n) = - hard limiter with nn = wi p i
0, n < 0 i =1
f f
+1 +1
-b 0 nn -b 0 nn
-1
+1
- 0 nn
, with nn = wi p i
R
i =1
1
3) y = f (n) = , c > 0 - sigmoid
1 + exp(cn)
For c = 1 :
f
+1
0.5
-b 0 nn
R
, with nn = wi p i
i =1
FUNCTIE DE ACTIVARE SIGMOID
w=2>0; b=3>0 w=2>0; b=-3<0
1 1
0.8 0.8
0.6 0.6
a
0.4 0.4
0.2 0.2
0 0
-5 0 intrare 5 -5 0 5
intrare
w=-2<0; b=3>0 w=-2<0; b=-3<0
1 1
0.8 0.8
0.6 0.6
a
a
0.4 0.4
0.2 0.2
0 0
-5 0 5 -5 0 5
intrare intrare
in punctul de inflexiune: a=0.5, p = -w/b, tangenta la grafic are panta= w/4
1 exp(2cn)
4) y = f (n) = , c > 0 - hyperbolic tangent
1 + exp(2cn)
For c = 1 :
f
+1
-b
0
-1 nn
R
, with nn = wi p i
i =1
( n c )2
5) y = f (n ) = e 2 - Gaussian function
2
f
+1
c n
II. 2. ANN architectures
The nodes are connected by links which ensure unidirectional and instant
communication.
The neurons are organized in layers; within a layer, the neurons are considered to work in parallel.
Input Hidden Output
layer layers laye r
u1 y1 Legend:
Lateral links (between the
u2 y2 nodes of the sam e layer)
Feedback links (from the
output of a neuron to its
input)
u m -1 yk-1 B ackward links (t o the neurons
of the previous layers)
um yk F eedforward li nks (to the
neurons of t he next layers)
ANN AN N
inputs outputs
Remark:
- the input layer does not perform any processing; it will be not count;
- the hidden layers and the output layer include neurons
.
Types of ANN
u1
w1 , j
.. .
w 1, m
u j
. ..
... .
w k ,1
N euro n k
um
wk , j
bk
wk ,m
nk
yk
f k (n )
In p u t
O u t p u t l a y e r - k n e u ro n s
laye r
Let us assume that all activation functions are identical.
with
Remark:
~
W = [ wi, j ], i = 1, k , j = 1, m + 1
with
for wi, j - the first index indicates the neuron
- the second index indicates the link
Simplified diagram for feedforward ANNs with 1 layer
y1
u1
N e uron 1
.. .
uj
... . yk
um
N e uron k
Remark:
The layers can be
Fully connected all feedforward links are used,
Partially connected some feedforward connections are missing.
Example2: Feed-forward architectures with two layers
2
b1 b
1
1
1
n2 y 11
1 y1 1
n1 2
w 1
1,1 f 1 (n ) w
1,1 f 2 ( n)
1
1
u1
w11, j
... Neuron 1, w
2
N euron 1,
layer 1 1, s
w 11, m layer 2
uj
... ...
....
w 1 s ,1 N euron s, Neuron k,
um layer 1 w 2 layer 2
w1 s , j k ,1
b1 b 2k
s
w1 s ,m 1
n y 1s n k2 y 2k
s
1 w k2,s 2
f s (n) f k (n)
u1
N e uron 1, N eu ro n 1 ,
... lay er 1 la ye r 2
uj
... .
y 1s y2
k
um
N e uron s , N eu ro n k
lay er 1 la ye r 2
y1
u1
N e uron 1
.. .
q- 1 q- 1
uj
... . yk
um
N e uron k
ANN parameters
sigmoid/linear/step = weights
biases
Gaussian = centers
= spreads
III. 3. Multi-layer Perceptron (MLP) - revision
III. 3. 1. MLP architecture
- all the neurons have sigmoidal activation function (linear, sigmoid, tanh)
b1 b2
1 1
1
v2 y 11
v11 y1
2
1
w 1 1 ,1 f 1 (v ) w 1 ,1 f 2
(v )
1 1
u1
w 11 , j
.. . N e uron 1, w2 N eu ro n 1 ,
L a yer 1 1, s
w 11 , m L a ye r 2
u j
. .. ...
... .
w 1 s ,1 N eu ro n s , N eu ro n k,
um L a ye r 1 w 2 L a ye r 2
1 k ,1
w s,j
b1 b 2k
s
w 1 s, m v
1
y 1s v k2 y 2k
s
1 w k2 ,s 2
f (v )
s f (v )
k
2. Batch
All the training samples (u i , d i ), i = 1, N are presented at a single iteration.
1 N k 2
I ( n) = ei (n, j ) ,with ei (n, j ) = the error of the ith output neuron for the jth training
2 N j =1 i =1
I
wijl (n + 1) = wijl (n) (n) = wijl (n) + wijl (n) ,
wijl
> 0 - influences the convergence speed
Parameter variation
= learning rate ( ) x local gradient ( ) x input (corresponding to the link)
1 N k 2
- For batch learning ( I (n) = ei (n, j ) ,)
2 N j =1 i =1
where ei (n, j ) = the error of the ith output neuron for the jth training sample presented at
the nth epoch
1 N
wikl (n) = l
wik (n, j ) -the mean of variations separately computed for each sample
N j =1
Backpropagation adaptation equations
k
For the sake of simplicity, online learning is considered I (n) = 0.5 ei 2 (n)
i =1
ei (n) = d i (n) yil (n) , the error produced by the output neuron i for a certain sample
s
vil (n) = wijl y lj1 , with wil0 = bil i y 0l 1 = 1 .
j =0
I I ei ei yil vil
( n) = ( n) (n) = ei (n) ( n) ( n) ( n)
wijl ei wijl yil vil wijl
I
l
(n) = ei (n) (1) f ' i (vil (n)) y lj1 (n) = il (n) y lj1 (n)
wij
I
wijl ( n) = il ( n) y lj1 ( n) with il = ei (n) f 'i (vil (n)) = (n) = local gradient
vijl
Parameter variation
= learning rate ( ) x local gradient ( ) x input (corresponding to the link)
2. For the hidden layers
Considering the hidden layer l, the local gradients within the layers l +1, l +2, .. must be
available from previous computations.
The output of the neuron i belonging to layer l is input for the neurons belonging to l +1
(output).
For the sake of simplicity: the layer l +1 is considered the output layer.
y zl +1 (n) = f l +1 (v zl +1 (n)) , z = 1, k ,
s
v lz+1 (n) = wlz+, j1 y lj (n) ,
j =0
s = the number of input connections for the neuron i (the number of neuron with the previous
layer, l ), wlz+,01 = b zl +1 , y lz = 1 .
zl +1 known for z = 1, k .
Layer l: with s neurons
y il (n) = f l (vil (n)) , z = 1, k ,
q
vil (n) = wil, j y lj1 (n) ,
j =0
q = the number of input connections for neuron i (the number of neurons belonging to the
I k
e zl +1 y zl +1 v zl +1 yil vil
wil, j
( n ) =
z =1
e l +1
z ( n )
y zl +1
( n )
v zl +1
( n )
y il
( n )
vil
( n )
wil, j
( n)
k
I
l
(n) = e zl +1 (n) (1) f l '+1 (v zl +1 (n)) w zil +1 f l ' (v il (n)) y lj1 (n)
wi , j z =1
k k
I
l
wi , j
( n ) =
z =1
l +1
z w l +1
z , i f l
'
( v l
i ( n )) y l 1
j ( n ) = y l 1
j ( n )
z =1
zl +1 w zl +,i1 f l ' (v il (n)) .
k
wil, j ( n) = y il 1 ( n) il ( n) , with il (n) = zl +1 w lz+, i1 f l' (v1i (n)) = local gradient
z =1
Parameter variation
= learning rate ( ) x local gradient ( ) x input (corresponding to the link)
Remarks
o Sigmoid
1 a exp( av)
f (v ) = , a > 0 f ' (v) = = a f (v) [1 f (v)]
1 + exp( av) [1 + exp( av)]2
o Hyperbolic tangent
b
f (v) = a tanh(bv), a, b > 0 f ' (v) = [a f (v)] [a + f (v)]
a
2) Learning rate > 0
- For small values: low convergence speed; a quite smooth trajectory is followed within the
search space
- For large values: risk of unstable behavior
Improvements:
I I I
wijl (n) = [ n l
(0) + n 1 l (1) + ... + l (n)] .
wij wij wij
increases
I
when (t ) changes the sign at successive iterations, the absolute value of wijl
wijl
decreases
[0,1] :
3) online or batch learning?
- For online learning: training samples must be randomly presented to avoid cycling
Online learning
Recommendations:
o The norm of the gradient becomes close to 0
Disadvantage: numerous epochs can be involved.
o The variation of the criterion I becomes insignificant
Disadvantage: premature stop.
Use uniformly distributed or normally distributed (mean 0, spread illustrating the saturation of
some neurons).
6) Efficient exploitation of the training samples
- For online learning, the successive samples can be the very different ones
When the training samples are randomly presented, this condition could be
frequently met.
- Outliers can impede the convergence and can lead to bad generalization capabilities.
f ( v ) = f (v )
- Usually, the gradients in the output layer are bigger, so should be smaller for the output
neurons.
If N is too large or the ANN architecture is too complex, the model results overfitted
1. Function approximations
Theorem:
Any continuous bounded function can be approximated with any desired degree of accuracy
> 0 , by means of a MLP containing
- a hidden layer with m neurons, characterized by continuous, bounded, monotonic
activation functions;
- a hidden layer with a linear neuron ( or sigmoidal neuron working within its linear region)
m R
F (u) = i f ( wij u j + bi )
i =1 j =1
m = the number of hidden neurons
R = the number of inputs
Remarks regarding the content of this theorem:
- MLP existence is guaranteed
- the theorem does not give any indications concerning the resulted generalization capacity
of the model and the time requested for learning
- the optimal structure is not given
- the value of m:
o if m is small, the empiric risk is lower (reduced risk to learn the noise captured by
the training samples);
o if m is large, a good accuracy can be obtained;
For ANNs with 2 hidden layers: the hidden layer 1 extracts the local properties
the hidden layers 2 extract the global properties
III. 4. ANN with Radial basis functions - RBF
p1
c1
... n y = f( n )
pj cj
.... cR
pR
y = f ( p c ) = f ( ( p1 c1 ) 2 + ... + ( p R c R ) 2 )
p1 c1
p = ... , c = ...
p R c R
See demorb1
The neuron is activated only if the input (vector) is similar to the center (vector).
o The accepted similitude level is given by .
o If is large, the neuron is activated for reduced similitude between inputs and
centers.
For inputs which are very dissimilar to the centers, the neuron is inactive:
p1
... w1 b
n y
pj wj
f (n)
.... wR
pR
p j has more significant influence for the activation of the neuron if the absolute value of
p j w j is larger.
III. 4.2. RBF architecture
- Because a single hidden layer is considered the upper index will be deleted for
almost of the notation (it was only kept for making distinction between the linear
and the radial basis activation functions).
n1 y1 = f1(n1)
u1
... b
ci1 w1
....
ni yi = f1 (ni) n
uj cij wi
.... f2 (n) = y
cim ....
ws
um ns
ys = f1 (ns)
y1 s
y = f ( w1 y1 + .. + ws y s + b) = w1 y1 + .. + ws y s + b = [w1
2
.. ws ] ... + b = wi yi + b
i =1
y s
f1 (u)
Let us consider f : , s large , with f (u) = ... , f1 ,..., f s : m .
m s
f s (u)
(e.g. f1 ,... f s indicate the mappings provided by s hidden neurons)
Definition
The classes C1 ,C 2 are f-separable, if there exist w = [w1 .. ws ]T s with:
Remarks:
- according to Covers theorem: chose large s value and non-linear f1 ,... f s
- the f hiper-plane delimitating the classes is given by w T f(u) = 0 .
u1 f1(u)
For function approximations (interpolation)
Find F : m accepting (u (i ), d (i )) , i = 1, N ,
2
a) f i (u ) = u ci + qi 2 , qi > 0 : non local, unbounded
1
b) f i (u ) = , qi > 0 : local, bounded
2 2
u ci + qi
2
u ci
c) f i (u ) = exp( ) : local, bounded
2 i 2
Knowing that d (i ) = F (u(i )) , it results:
Let us consider:
f1 (u (1)) .. f N (u (1))
= .. .. ..
= interpolation matrix.
f1 (u ( N )) ... f N (u ( N ))
w1 d (1) w1 d (1)
.. = .. .. = .. - if is nonsingular.
1
wN d ( N ) wN d ( N )
Michellis Theorem (1986)
If f i are radial and samples u (i ) m are distinct
Then is nonsingular.
Remarks:
o For f i of types b) and c) is positively defined.
o For f i of type a) admits N 1 positive Eigen values and a negative Eigen value.
Remarks:
large N (many samples) many radial basis functions complex model (over-fitting)
large N (large samples) risk of poorly conditioned interpolation matrix and large execution
times
It is desirable to use fewer radial basis functions than training samples.
N s
Instead of F (u) = wi f i ( u u(i ) ) , one has to consider F (u) = b + wi f i ( u ci ) :
i =1 i =1
- The centers of the radial basis functions and the input training samples are different.
Therefore, it results:
w1 d (1) w1 d (1)
G .. = .. .. = G .. ,
+
w
ws d ( N ) s d ( N )
b b
G + = (G T G ) 1 G T
Example: XOR problem
1 0 1 0
Classify the samples: u(1) = C1 , u(2) = C 2 , u(3) = C 2 , u(1) = C1
1 1 0 0
Let us consider:
2
1
f1 (u) = exp( u ) = exp((u1 1) 2 (u 2 1) 2 )
1
.
2
0
f 2 (u) = exp( u ) = exp(u12 u 2 2 )
0
Therefore, it results:
- G + (given by the MATLAB function pinv):
1.7942 -1.2195 -1.2195 0.6448
0.6448 -1.2195 -1.2195 1.7942
-0.8780 1.3780 1.3780 -0.8780
0
w1 1 2.439
w = G = 2.439
+
2 1
b 2.7561
0
Theorem:
s u ci
F (u) = b + wi f ( ), > 0 , if
i =1
f : m is bounded and f (u )du < .
m
- The requirements imposed by this theorem are met for the radial functions b), c).
- The radial basis function a) can be used for s = N , only.
- f is not necessarily symmetric !!!!!
Remark:
the ANN with hidden radial basis activation functions and a linear output neuron is
compliant with the requirements of the previous theorem.
step2: assuming the centers and the spread are known, compute the output weights:
w1 d (1) f1 (u (1)) .. f s (u (1)) 1
.. = G + .. , cu G = .. .. .. N x ( s +1)
w
s d ( N ) f1 (u ( N )) ... f s (u ( N )) 1
b
If the centers are known, the output weights can be computed in a single
step.
- generalization = interpolation
For Gaussian functions:
2
s uc
F (u) = b + wi exp( ) , when the same spread is employed for all the hidden neurons
i =1 2 2
Recommendation: Chose s = 3 N
Comparison between MLP and RBF
MLP RBF
Step 1-1: For the training sample u(n) compute u(n) c i , i = 1, s and find the minimum
distance, which indicates the nearest center for this sample. Consider i , with i 1, s , the
nearest center.
Step 1-2: Update the nearest center, moving it toward the sample:
c i (n + 1) = c i (n) u(n) c i (n) , cu 1 > > 0
Step 1-3: n n + 1
Step 1-4: If some training samples have not been used yet or the change made at step 1-2 is
too large,
Go to step 1-1.
Drawback: the result depends on the initial values
d
Pas 2. Compute the spread = max ,
2s
with
d max = the maximum distance between the selected centers,
s = the number of hidden neurons.
Pas 3. Compute the weights and the bias.
.
3. Supervised centers selection
LMS algorithm:
- for weights:
I
wi wi 1 , with
wi
2
I N u( j ) c i N u( j ) c i
= e( j ) f ( ) = e( j ) exp( )
wi j =1 i j =1 2 i 2
- for centers:
I
ci ci 2 , with
c i
2
I w N u( j ) c i
= i e( j ) exp( ) (c i ) k [u( j ) k (c i ) k ] ,
(c i ) k i2 j =1 2 i 2
with (c i ) k , u( j ) k indicating the kth component of the vectors c i , i = 1, N and u( j ), j = 1, N
(having the length m)
- for spreads:
I
i i 3 , with
i
2
I N u( j ) c i w N u( j ) c i 2
= e( j ) f ( ) = i e( j ) exp( ) u( j ) c i
i j =1 i 4 i 3 j =1 2 i 2
4. Constructive algorithm
- insert the hidden neurons in sequel the center vector copies the input training sample that
produces the highest output squared error for the current architecture