Академический Документы
Профессиональный Документы
Культура Документы
1
2
2
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Steepest Descent Direction
The direction is determined by the derivative of E with
respect to each component of w.
Gradient of E:
Training rule for gradient descent:
w
i
w
i
+ Aw
i
, where
Differentiating E from the equation (X)
| |
V
E w
E
w
E
w
E
w
n
( ) , ,...,
c
c
c
c
c
c
0 1
Aw
E
w
i
i
= q
c
c
c
c
c
c
E
w w
d y d y x
i i
k k
d Train
k k i
d Train
k k
= =
e e
1
2
2
( ) ( )( )
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Gradient-Descent Algorithm
Initialise each w
i
to some small random value
Until the termination condition is met, do
initialise each Aw
i
to zero
for each (x
k
,d
k
) in training examples do
compute the output y
for each weight w
i
do
Aw
i
Aw
i
+ q(d
k
- y
k
) x
i
for each w
i
do w
i
w
i
+ Aw
i
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Remarks on Gradient-Descent Algorithm
It can be used whenever
search space of weight vectors is continuous
the error can be differentiated with respect to the weights
Difficulties in applying the algorithm are
converging to an optimum can be slow
no guarantee that the global optimum will be found
Delta Rule: w
i
w
i
+ q(d
k
- y
k
) x
i
weights are updated upon examining each training example
less computation time per weight update step
can sometimes avoid falling into local optima
Delta rule converges toward the minimum error weights
regardless of whether the training data are linearly separable
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Rosenblatt's Simple Perceptron
Designed for task of pattern recognition
Single layer net of perceptrons
Perceptron learning rule
Limitations - Objects are separable by a hyperplane.
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Neural Networks
Interconnected net of formal neurons
input(receptors), hidden, output(efectors)
State and configuration of NN
Dynamics:
organisational - topology and its change
recurrent, feed-forward, multi-layer
activation - initialisation of the state and its change
continuous, discrete, sequential, parallel, activation
function
adaptive - initial configuration and learning algorithm
supervise x unsupervised learning
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Multi-Layer Networks
Feed-forward networks with intermediate "hidden" layer(s)
n-layer network (n hidden layers)
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Geometric Interpretation of the Function
of the Multi-Layer Networks
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
XOR by means of 2-Layer Network
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Multi-Layer Networks and
Backpropagation Algorithm
Requires units whose output is
a nonlinear function of its inputs
a differentiable function of its inputs
Activation function:
o
( ) =
+
1
1 e
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Adaptation of Weights in Multi-Layer
Networks
Error function:
Adaptation step:
where
and 0<c<1 is the learning rate
E w E w
k
k
p
( ) ( ) =
=
1
E w y w x d
k j k kj
j Y
( ) ( ( , ) ) =
e
1
2
2
w w w
ji
t
ji
t
ji
t ( ) ( ) ( )
= +
1
A
Aw
E
w
w
ji
t
ji
t ( ) ( )
( ) =
c
c
c
1
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Visualisation of the process
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Backpropagation Algorithm
We can write:
In order to get the derivative we use the chain rule:
where we get as derivative of potential:
and the derivative of the activation function is:
c
c
c
c
E
w
E
w
ji
k
ji
k
p
=
=
1
c
c
c
c
c
c
c
c
E
w
E
y
y
w
k
ji
k
j
j
j
j
ji
=
c
c
j
ji
w
c
c
j
ji
i
w
y =
c
c
y
y y
j
j
j j j
= ( ) 1
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Backpropagation Algorithm
Substituting expressions, we obtain
c
c
c
c
E
w
E
y
y y y
k
ji
k
j
j j j i
= ( ) 1
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Derivation of Training Rule for Output
vs. Hidden Units
Output unit: ,
that is an error of the j-th neuron on k-th example
Hidden unit:
because y
j
is used in calculation of internal potential of all
units whose inputs includes the output of unit j
In this way the errors are propagated backwards from the
output layer to the first hidden layer
c
c
E
y
y d
k
j
j kj
=
c
c
c
c
c
c
c
c
c
c
E
y
E
y
y
y
E
y
y y w
k
j
k
r
r
r
r
j
k
r
r r r rj
r j r j
= =
e e
( ) 1
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Visualisation of the Backpropagation in
a Three-Layer Network
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Remarks on Multi-Layer Networks and
Backpropagation
Speed of Backpropagation
Convergence to local optimum
the more weights the more dimensions that might
provide "escape routes" from the local optimum
Stochastic gradient descent rather than true gradient
descent - less likely to get stuck in a local optimum
Training multiple networks using the same data but
different initial weight vectors
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Remarks on Multi-Layer Networks and
Backpropagation
Some hints to choose the topology?
complexity of the net should reflect the complexity of
the problem
overfitting
heuristics:
1. layer a few more units than is the number of inputs
2. layer (N
outputs
+ N
layer1
)/2
try and adjust
Constructive algorithms - Cascade Correlation Algorithm
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Overfitting in Multi-Layer Networks