NN Ch1

J.
Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control

Artificial Neural Networks: Motivation
Machine learning paradigm motivated by biological learning systems
Human brain
interconnected network of 10
11
neurons
each connected to 10
4
others
neuron switching time 10
-3
seconds
suprisingly complex decisions, suprisingly quickly (10
-1
s)
information processing abilities follow from highly parallel process
robust & fault tolerant
flexible
deals with fuzzy, noisy, or inconsistent information
J. Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control
Schematic Drawing of a Biological Neuron
dendrites - inputs
axon - output
synaptic junctions - excitatory or inhibitory
activation potential, threshold and firing of cell
transmission of signals from receptors (receive stimuli
from the environment) to effector (executive units)
Appropriate Problems for ANN
Problems:
instances are represented by many attribute-value pairs
target function output may be discrete- or real- valued
training examples may contain errors
long training times are acceptable
fast evaluation of the learned function is required
generalisation abilities
Implementation issues:
What is the best architecture - number of neurons, layers,
connections between units, ...
How to learn ANN - how many examples, how many learning
cycles, type of training examples, ...
What can the network do - what tasks, how well, how fast, ...
Perceptron - Formal Model of Neuron

x
1
, ..., x
n
- inputs
w
1
, ..., w
n
- synaptic weights
w
0
threshold
- internal potential of neuron
o() - activation function: o() = sgn()
y- output: y = o()
Geometric interpretation of the function
of Single Perceptron

Geometric interpretation of the function
of Single Perceptron
A single perceptron performs classification into 2 classes
with the decision hyperplane
w.x = 0
only linearly separable sets of examples
Perceptron can represent all primitive boolean functions
AND, OR, NAND, NOR (?XOR?).
every boolean function can be represented by some
network of perceptrons
Perceptron Learning Rule
How to learn the weights for a single perceptron?
1. Begin with random weights
2. Apply the perceptron to each training example (xk, dk) and
modify the weights whenever it misclassifies an example:
w
i
w
i
+ Aw
i

where Aw
i
= q(d
k
- y
k
) x
i
and q is the learning rate
3. If at least one example was misclassified continue with step 2.
else end.
The convergence of this procedure is assured if the training
examples are linearly separable
Gradient Descent and Delta Rule
Gradient descent used to search the space of possible weight vectors
linear unit: y(x) = w.x
training error:

Weight vector is altered in order to find the minimum error.
Converge to best approximation to the target function
E w d y
k k
d Train
k
( ) ( )
e
1
2
2
Steepest Descent Direction
The direction is determined by the derivative of E with
respect to each component of w.
Gradient of E:

Training rule for gradient descent:
w
i
w
i
+ Aw
i
, where

Differentiating E from the equation (X)
| |
V
E w
E
w
E
w
E
w
n
( ) , ,...,
c
c
c
c
c
c
0 1
Aw
E
w
i
i
= q
c
c
c
c
c
c
E
w w
d y d y x
i i
k k
d Train
k k i
d Train
k k
= =
e e

1
2
2
( ) ( )( )
Gradient-Descent Algorithm
Initialise each w
i
to some small random value
Until the termination condition is met, do
initialise each Aw
i
to zero
for each (x
k
,d
k
) in training examples do
compute the output y
for each weight w
i
do
Aw
i
Aw
i
+ q(d
k
- y
k
) x
i

for each w
i
do w
i
w
i
+ Aw
i

Remarks on Gradient-Descent Algorithm
It can be used whenever
search space of weight vectors is continuous
the error can be differentiated with respect to the weights
Difficulties in applying the algorithm are
converging to an optimum can be slow
no guarantee that the global optimum will be found
Delta Rule: w
i
w
i
+ q(d
k
- y
k
) x
i

weights are updated upon examining each training example
less computation time per weight update step
can sometimes avoid falling into local optima
Delta rule converges toward the minimum error weights
regardless of whether the training data are linearly separable
Rosenblatt's Simple Perceptron
Designed for task of pattern recognition
Single layer net of perceptrons

Perceptron learning rule
Limitations - Objects are separable by a hyperplane.
Neural Networks
Interconnected net of formal neurons
input(receptors), hidden, output(efectors)
State and configuration of NN
Dynamics:
organisational - topology and its change
recurrent, feed-forward, multi-layer
activation - initialisation of the state and its change
continuous, discrete, sequential, parallel, activation
function
adaptive - initial configuration and learning algorithm
supervise x unsupervised learning
Multi-Layer Networks
Feed-forward networks with intermediate "hidden" layer(s)
n-layer network (n hidden layers)

Geometric Interpretation of the Function
of the Multi-Layer Networks

XOR by means of 2-Layer Network

Multi-Layer Networks and
Backpropagation Algorithm
Requires units whose output is
a nonlinear function of its inputs
a differentiable function of its inputs

Activation function:

o

( ) =
+

1
1 e
Adaptation of Weights in Multi-Layer
Networks
Error function:

Adaptation step:

where

and 0<c<1 is the learning rate
E w E w
k
k
p
( ) ( ) =
=
1
E w y w x d
k j k kj
j Y
( ) ( ( , ) ) =
e
1
2
2
w w w
ji
t
ji
t
ji
t ( ) ( ) ( )
= +
1
A
Aw
E
w
w
ji
t
ji
t ( ) ( )
( ) =

c
c
c
1
Visualisation of the process
We can write:

In order to get the derivative we use the chain rule:

where we get as derivative of potential:

and the derivative of the activation function is:

c
c
c
c
E
w
E
w
ji
k
ji
k
p
=
=
1
c
c
c
c
c
c
c
c
E
w
E
y
y
w
k
ji
k
j
j
j
j
ji
=
c
c
j
ji
w
c
c
j
ji
i
w
y =
c
c

y
y y
j
j
j j j
= ( ) 1
Substituting expressions, we obtain

c
c
c
c

E
w
E
y
y y y
k
ji
k
j
j j j i
= ( ) 1
Derivation of Training Rule for Output
vs. Hidden Units
Output unit: ,
that is an error of the j-th neuron on k-th example
Hidden unit:

because y
j
is used in calculation of internal potential of all
units whose inputs includes the output of unit j
In this way the errors are propagated backwards from the
output layer to the first hidden layer

c
c
E
y
y d
k
j
j kj
=
c
c
c
c
c
c
c
c
c
c

E
y
E
y
y
y
E
y
y y w
k
j
k
r
r
r
r
j
k
r
r r r rj
r j r j
= =
e e

( ) 1
Visualisation of the Backpropagation in
a Three-Layer Network

Remarks on Multi-Layer Networks and
Backpropagation
Speed of Backpropagation
Convergence to local optimum
the more weights the more dimensions that might
provide "escape routes" from the local optimum
Stochastic gradient descent rather than true gradient
descent - less likely to get stuck in a local optimum
Training multiple networks using the same data but
different initial weight vectors
Remarks on Multi-Layer Networks and
Backpropagation
Some hints to choose the topology?
complexity of the net should reflect the complexity of
the problem
overfitting
heuristics:
1. layer a few more units than is the number of inputs
2. layer (N
outputs
+ N
layer1
)/2
try and adjust
Constructive algorithms - Cascade Correlation Algorithm
Overfitting in Multi-Layer Networks

NN Ch1

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

NN Ch1

Загружено:

Авторское право:

Доступные форматы

J.

Kubalk, Gerstner Laboratory for Intelligent Decision Making and Control

Вам также может понравиться