Академический Документы
Профессиональный Документы
Культура Документы
Dan Simon
Cleveland State University
1
Neural Networks
Artificial Neural Network (ANN): An
information processing paradigm that is
inspired by biological neurons
Distinctive structure: Large number of
simple, highly interconnected processing
elements (neurons); parallel processing
Inductive learning, that is, learning by
example; an ANN is configured for a
specific application through a learning
process
Learning involves adjustments to
connections between the neurons
2
Inductive Learning
Sometimes we cant explain how we know
something; we rely on our experience
An ANN can generalize from expert
knowledge and re-create expert behavior
Example: An ER doctor considers a
patients age, blood pressure, heart rate,
ECG, etc., and makes an educated guess
about whether or not the patient had a
heart attack
3
The Birth of ANNs
The first artificial neuron was
proposed in 1943 by
neurophysiologist Warren McCulloch
and the psychologist/logician Walter
Pitts
No computing resources at that time
4
Biological Neurons
5
A Simple Artificial Neuron
6
A Simple ANN
Pattern recognition: T versus H
x11
x12 f1(.)
x
x21 13
x22 f2(.) g(.)
x
x31 23
x32 f3(.)
x33
1 0
7
Examples:
1 0 1, 1, 1
1
x x x f1 f2 f3 g
0, 0, 0
1 2 3
0 0 0 0 1 1 1
0 0 1 0 ? 0 0
0
0 1 0 1 1 1 1
0 1 1 1 ? 1 1 1, ? 1
1 0 0 0 ? 0 0 1
1 0 1 0 0 0 0
1 1 0 1 ? 1 1 0, ?, 1
1 1 Truth
1 1 table
0 0 0 ? 8
Feedforward ANN
10
Perceptrons
A simple ANN introduced by Frank
Rosenblatt in 1958
Discredited by Marvin Minsky and
Seymour Papert in 1969
Perceptrons have been widely publicized
as 'pattern recognition' or 'learning
machines' and as such have been discussed
in a large number of books, journal articles,
and voluminous 'reports'. Most of this
writing ... is without scientific value
11
Perceptrons
x0=1
w0
x1 1 if w x 0
w1 f ( x)
w2 0 otherwise
x2
w3
x3
13
From Perceptrons to
Backpropagation
Perceptrons were dismissed because of:
Limitations of single layer perceptrons
The threshold function is not differentiable
Multi-layer ANNs with differentiable
activation functions allow much richer
behaviors.
A multi-layer perceptron
(MLP) is a feedforward
ANN with at least one
hidden layer.
14
Backpropagation
Derivative-based method
Derivative-based for optimizing ANN weights.
method for
optimizing ANN 1969: First described by
weights Arthur Bryson and Yu-Chi
Ho.
1970s-80s: Popularized by
David Rumelhart, Geoffrey
Hinton Ronald Williams,
Paul Werbos; led to a
renaissance in ANN
research.
15
The Credit Assignment
Problem
Output 1
Wanted 0
16
Backpropagation
input neurons hidden neurons output neurons
x1 x1 v11 y1 w11 o1
a1 z1
v21 w21
v12 w12
x2 x2 v22 y2 w22 o2
a2 z2
1 no
E (tk ok ) 2
2 k 1
1 no
(tk f ( zk )) 2
2 k 1 Sigmoid transfer
x 1
f ( x) (1 e ) function
1
df x x 2
e (1 e )
f(x)
0.5
dx
[1 f ( x)] f ( x)
0
-5 0 5 18
x
dE
j
Output Neurons dz j
d 1 2
2 (t j o j )
1 no
E (tk f ( zk )) 2 dz j
2 k 1 do j
dE dE dz j (t j o j )
dz j
dwij dz j dwij
df ( z j )
dE (t j o j )
yi dz j
dz j
(t j o j )(1 f ( z j )) f ( z j )
j yi
(t j o j )(1 o j )o j
19
Hidden Neurons
dE dE dzk dy j da j
D( j ) = {output dvij kD ( j ) dzk dy j da j dvij
neurons whose inputs
come from the j-th dE dzk dy j
middle-layer neuron} xi
vij aj yj kD ( j ) dz k dy j da j
kD ( j )
k w jk y j (1 y j )
y j (1 y j )
kD ( j )
k w jk
20
The Backpropagation Training
Algorithm
1. Randomly initialize weights {w} and {v}.
2. Input sample x to get output o. Compute error E.
3. Compute derivatives of E with respect to output
weights {w} (two pages previous).
4. Compute derivatives of E with respect to hidden
weights {v} (previous page). Note that the results of
step 3 are used for this computation; hence the term
backpropagation).
5. Repeat step 4 for additional hidden layers as needed.
6. Use gradient descent to update weights {w} and
{v}. Go to step 2 for the next sample/iteration.
21
XOR Example
x2
y = sign(x1x2)
1 0
x1
0 1
22
XOR Example
x1 x1 v11 y1 w11 o1
a1 z1
v12 w21
v21 y2
w31
x2 x2 v22
a2
v31 v32 1
1 1 1 Backprop.m
23
XOR Example
x2
1 0
x1
0 1
24
Backpropagation Issues
Momentum: wij wij j yi + wij,previous
What value of should we use?
Backpropagation is a local optimizer
Combine it with a global optimizer (e.g., BBO)
Run backprop with multiple initial conditions
Add random noise to input data and/or
weights to improve generalization
25
Backpropagation Issues
Batch backpropagation
Dont forget to
Randomly initialize weights {w} and {v}
adjust the
While not (termination criteria) learning rate!
For i = 1 to (number of training samples)
Input sample xi to get output oi. Compute
error Ei
Compute dEi / dw and dEi / dv
Next sample
dE / dw = dEi / dw and dE / dv = dEi / dv
Use gradient descent to update weights {w} and
{v}. 26
Backpropagation Issues
Weight decay
wij wij j yi dwij
This tends to decrease weight magnitudes
unless they are reinforced with backprop
d 0.001
This corresponds to adding a term to the
error function that penalizes the weight
magnitudes
27
Backpropagation Issues
Quickprop (Scott Fahlman, 1988)
Backpropagation is notoriously slow.
Quickprop has the same philosophy
as Newton-Raphson.
Assume the error surface is quadratic
and jump in one step to the
minimum of the quadratic.
28
Backpropagation Issues
Other activation functions
Sigmoid: f(x) = (1+ex)1
Hyperbolic tangent: f(x) = tanh(x)
Step: f(x) = U(x)
Tan Sigmoid: f(x) = (ecx ecx) / (ecx + ecx)
for some positive constant c
How many hidden layers should we
use?
29
Universal Approximation
Theorem
A feed-forward ANN with one hidden layer
and a finite number of neurons can
approximate any continuous function to any
desired accuracy.
The ANN activation functions can be any
continuous, nonconstant, bounded,
monotonically increasing functions.
The desired weights may not be obtainable
via backpropagation.
George Cybenko, 1989; Kurt Hornik, 1991
30
Termination Criterion
Error
Validation/Test Set
Training Set
32
Adaptive Backpropagation
Recall standard weight update: wij wij
j yi
With adaptive learning rates, each weight
wij has its own rate ij
If the sign of wij is the same over several
backprop updates, then increase ij
If the sign of wij is not the same over
several backprop updates, then decrease
ij
33
Double Backpropagation
no
In addition to minimizing 1
E1 (tk ok ) 2
the training error: 2 k 1
Also minimize the 2
1 no
E1
sensitivity of training E2
error to input data: 2 k 1 xk
P = number of input training patterns. We want
an ANN that can generalize. So input changes
should not result in large error changes.
34
Other ANN Training Methods
Gradient-free approaches (GAs, BBO, etc.)
Global optimization
BBO.m
Combination with gradient descent
We can train the structure as well as the
weights
We can use non-differentiable activation
functions
We can use non-differentiable cost
functions
35
Classification Benchmarks
The Iris classification problem
150 data samples
Four input feature values (sepal length
and width, and petal length and
width)
Three types of irises: Setosa,
Versicolour, and Virginica
36
Classification Benchmarks
The two-spirals
classification problem
UC Irvine Machine
Learning Repository
http://archive.ics.uci.ed
u/ml
194 benchmarks!
37
Radial Basis Functions
N middle-layer neurons
Inputs x
Activation functions f (x, ci)
Output weights wik
yk = wik f (x, ci)
J. Moody and C. Darken, 1989
Universal approximators = wik ( ||xci|| )
(.) is a basis function
limx ( ||xci|| ) = 0
{ ci } are the N RBF centers
38
Radial Basis Functions
Common basis functions:
Gaussian: ( ||xci|| ) = exp(||xci||2 / 2)
is the width of the basis function
Many other proposed basis functions
39
Radial Basis Functions
Suppose we have the data set (xi, yi), i = 1, , N
Each xi is multidimensional, each yi is scalar
Set ci = xi, i = 1, , N
Define gik = ( || xi xk|| )
Input each xi to the RBF to obtain:
g11 L g1N w1 y1 Gw = y
M O M M M G is nonsingular if {xi}
are distinct
g N 1 L g NN w N y N Solve for w
Global minimum
(assuming fixed c and )
40
Radial Basis Functions
We again have the data set (xi, yi), i = 1, , N
Each xi is multidimensional, each yi is scalar
ck are given for (k = 1, , m), and m < N
Define gik = ( || xi ck|| )
Input each xi to the RBF to obtain:
g11 L g1m w1 y1 Gw = y
M O M M M w = (GTG)1GT = G+y
g N 1 L g Nm w m y N
41
Radial Basis Functions
How can we choose the RBF centers?
Randomly select them from the inputs
Use a clustering algorithm
Other options (BBO?)
How can we choose the RBF widths?
42
Other Types of ANNs
Many other types of ANNs
Cerebellar Model Articulation Controller (CMAC)
Spiking neural networks
Self-organizing map (SOM)
Recurrent neural network (RNN)
Hopfield network
Boltzman machine
Cascade-Correlation
and many others
43
Sources
Neural Networks, by C. Stergiou and D. Siganos,
www.doc.ic.ac.uk/~nd/ surprise_96/journal/vol4/cs11/report.html
The Backpropagation Algorithm, by A. Venkataraman,
www.speech.sri.com/people/anand/771/html/node37.html
CS 478 Course Notes, by Tony Martinez,
http://axon.cs.byu.edu/~martinez
44