Artificial Neural Networks: Dan Simon Cleveland State University

Artificial Neural Networks
Dan Simon
Cleveland State University
1
Neural Networks
Artificial Neural Network (ANN): An
information processing paradigm that is
inspired by biological neurons
Distinctive structure: Large number of
simple, highly interconnected processing
elements (neurons); parallel processing
Inductive learning, that is, learning by
example; an ANN is configured for a
specific application through a learning
process
Learning involves adjustments to
connections between the neurons
2
Inductive Learning
Sometimes we cant explain how we know
something; we rely on our experience
An ANN can generalize from expert
knowledge and re-create expert behavior
Example: An ER doctor considers a
patients age, blood pressure, heart rate,
ECG, etc., and makes an educated guess
about whether or not the patient had a
heart attack
3
The Birth of ANNs
The first artificial neuron was
proposed in 1943 by
neurophysiologist Warren McCulloch
and the psychologist/logician Walter
Pitts
No computing resources at that time
4
Biological Neurons
5
A Simple Artificial Neuron
6
A Simple ANN
Pattern recognition: T versus H
x11
x12 f1(.)
x
x21 13
x22 f2(.) g(.)
x
x31 23
x32 f3(.)
x33
1 0
7
Examples:
1 0 1, 1, 1
1
x x x f1 f2 f3 g
0, 0, 0
1 2 3
0 0 0 0 1 1 1
0 0 1 0 ? 0 0
0
0 1 0 1 1 1 1
0 1 1 1 ? 1 1 1, ? 1
1 0 0 0 ? 0 0 1
1 0 1 0 0 0 0
1 1 0 1 ? 1 1 0, ?, 1
1 1 Truth
1 1 table
0 0 0 ? 8
Feedforward ANN
How many hidden layers should we use? How

many neurons should we use in each hidden
layer?
9
Recurrent ANN
10
Perceptrons
A simple ANN introduced by Frank
Rosenblatt in 1958
Discredited by Marvin Minsky and
Seymour Papert in 1969
Perceptrons have been widely publicized
as 'pattern recognition' or 'learning
machines' and as such have been discussed
in a large number of books, journal articles,
and voluminous 'reports'. Most of this
writing ... is without scientific value
11
Perceptrons
x0=1
w0
x1 1 if w x 0
w1 f ( x)
w2 0 otherwise
x2
w3
x3
Three-dimensional single-layer perceptron

Problem: Given a set of training data (i.e.,
(x, y) pairs), find the weight vector {w}
that correctly classifies the inputs.
12
The Perceptron Training Rule
t = target output, o = perceptron output
Training rule: wi = e xi, where e = t o,
and is the step size.
Note that e = 0, 1 or 1.
If e = 0, then dont update the weight.
If e = 1, then t = 1 and o = 0, so we need to
increase wi if xi > 0, and decrease wi if xi < 0.
Similar logic applies when e = 1.
is often initialized to 0.1 and decreases as
training progresses.
13
From Perceptrons to
Backpropagation
Perceptrons were dismissed because of:
Limitations of single layer perceptrons
The threshold function is not differentiable
Multi-layer ANNs with differentiable
activation functions allow much richer
behaviors.
A multi-layer perceptron
(MLP) is a feedforward
ANN with at least one
hidden layer.
14
Backpropagation
Derivative-based method
Derivative-based for optimizing ANN weights.
method for
optimizing ANN 1969: First described by
weights Arthur Bryson and Yu-Chi
Ho.
1970s-80s: Popularized by
David Rumelhart, Geoffrey
Hinton Ronald Williams,
Paul Werbos; led to a
renaissance in ANN
research.
15
The Credit Assignment
Problem
Output 1
Wanted 0
In a multi-layer ANN, how can we tell which

weight should be varied to correct an
output error? Answer: backpropagation.
16
Backpropagation
input neurons hidden neurons output neurons
x1 x1 v11 y1 w11 o1
a1 z1
v21 w21
v12 w12
x2 x2 v22 y2 w22 o2
a2 z2
a1 v11 x1 v21 x2 z1 w11 y1 w21 y2

y1 f (a1 ) o1 f ( z1 )
Similar for a2 and y2 Similar for z2 and o2
17
tk = desired (target) value of k-th output neuron
no = number of output neurons
1 no
E (tk ok ) 2
2 k 1
1 no
(tk f ( zk )) 2
2 k 1 Sigmoid transfer
x 1
f ( x) (1 e ) function
1
df x x 2
e (1 e )
f(x)
0.5
dx
[1 f ( x)] f ( x)
0
-5 0 5 18
x
dE
j
Output Neurons dz j
d 1 2
2 (t j o j )
1 no
E (tk f ( zk )) 2 dz j
2 k 1 do j
dE dE dz j (t j o j )
dz j
dwij dz j dwij
df ( z j )
dE (t j o j )
yi dz j
dz j
(t j o j )(1 f ( z j )) f ( z j )
j yi
(t j o j )(1 o j )o j
19
Hidden Neurons
dE dE dzk dy j da j

D( j ) = {output dvij kD ( j ) dzk dy j da j dvij
neurons whose inputs
come from the j-th dE dzk dy j
middle-layer neuron} xi
vij aj yj kD ( j ) dz k dy j da j
{ zk for all k D( j ) } dE dzk dy j

j
kD ( j ) dz k dy j da j

kD ( j )
k w jk y j (1 y j )
y j (1 y j )
kD ( j )
k w jk
20
The Backpropagation Training
Algorithm
1. Randomly initialize weights {w} and {v}.
2. Input sample x to get output o. Compute error E.
3. Compute derivatives of E with respect to output
weights {w} (two pages previous).
4. Compute derivatives of E with respect to hidden
weights {v} (previous page). Note that the results of
step 3 are used for this computation; hence the term
backpropagation).
5. Repeat step 4 for additional hidden layers as needed.
6. Use gradient descent to update weights {w} and
{v}. Go to step 2 for the next sample/iteration.
21
XOR Example
x2
y = sign(x1x2)
1 0
x1
0 1
Not linearly separable. This is a very simple

problem, but early ANNs were unable to solve it.
22
XOR Example
x1 x1 v11 y1 w11 o1
a1 z1
v12 w21
v21 y2
w31
x2 x2 v22
a2
v31 v32 1
1 1 1 Backprop.m
Bias nodes at both the input and hidden layer
23
XOR Example
x2
1 0
x1
0 1
Homework: Record the weights for the trained ANN, input

various (x1, x2) combinations to the ANN to see how well it
can generalize.
24
Backpropagation Issues
Momentum: wij wij j yi + wij,previous
What value of should we use?
Backpropagation is a local optimizer
Combine it with a global optimizer (e.g., BBO)
Run backprop with multiple initial conditions
Add random noise to input data and/or
weights to improve generalization
25
Batch backpropagation
Dont forget to
Randomly initialize weights {w} and {v}
adjust the
While not (termination criteria) learning rate!
For i = 1 to (number of training samples)
Input sample xi to get output oi. Compute
error Ei
Compute dEi / dw and dEi / dv
Next sample
dE / dw = dEi / dw and dE / dv = dEi / dv
Use gradient descent to update weights {w} and
{v}. 26
Weight decay
wij wij j yi dwij
This tends to decrease weight magnitudes
unless they are reinforced with backprop
d 0.001
This corresponds to adding a term to the
error function that penalizes the weight
magnitudes
27
Quickprop (Scott Fahlman, 1988)
Backpropagation is notoriously slow.
Quickprop has the same philosophy
as Newton-Raphson.
Assume the error surface is quadratic
and jump in one step to the
minimum of the quadratic.
28
Other activation functions
Sigmoid: f(x) = (1+ex)1
Hyperbolic tangent: f(x) = tanh(x)
Step: f(x) = U(x)
Tan Sigmoid: f(x) = (ecx ecx) / (ecx + ecx)
for some positive constant c
How many hidden layers should we
use?
29
Universal Approximation
Theorem
A feed-forward ANN with one hidden layer
and a finite number of neurons can
approximate any continuous function to any
desired accuracy.
The ANN activation functions can be any
continuous, nonconstant, bounded,
monotonically increasing functions.
The desired weights may not be obtainable
via backpropagation.
George Cybenko, 1989; Kurt Hornik, 1991
30
Termination Criterion
Error
Validation/Test Set
Training Set
If we train too long we begin to

memorize the training data and lose
the ability to generalize.
Train with a validation/test set.
31
Termination Criterion
Cross Validation
N data partitions
N training runs, each using (N1) partitions
for training and 1 partition for
validation/test
Each training run, store number of epochs ci
for the best test set performance (i=1,,N)
cave = mean{ci}
Train on all data for cave epochs
32
Adaptive Backpropagation
Recall standard weight update: wij wij
j yi
With adaptive learning rates, each weight
wij has its own rate ij
If the sign of wij is the same over several
backprop updates, then increase ij
If the sign of wij is not the same over
several backprop updates, then decrease
ij
33
Double Backpropagation
no
In addition to minimizing 1
E1 (tk ok ) 2
the training error: 2 k 1
Also minimize the 2
1 no
E1
sensitivity of training E2
error to input data: 2 k 1 xk
P = number of input training patterns. We want
an ANN that can generalize. So input changes
should not result in large error changes.
34
Other ANN Training Methods
Gradient-free approaches (GAs, BBO, etc.)
Global optimization
BBO.m
Combination with gradient descent
We can train the structure as well as the
weights
We can use non-differentiable activation
functions
We can use non-differentiable cost
functions
35
Classification Benchmarks
The Iris classification problem
150 data samples
Four input feature values (sepal length
and width, and petal length and
width)
Three types of irises: Setosa,
Versicolour, and Virginica
36
Classification Benchmarks
The two-spirals
classification problem
UC Irvine Machine
Learning Repository
http://archive.ics.uci.ed
u/ml
194 benchmarks!
37
Radial Basis Functions
N middle-layer neurons
Inputs x
Activation functions f (x, ci)
Output weights wik
yk = wik f (x, ci)
J. Moody and C. Darken, 1989
Universal approximators = wik ( ||xci|| )
(.) is a basis function
limx ( ||xci|| ) = 0
{ ci } are the N RBF centers
38
Common basis functions:
Gaussian: ( ||xci|| ) = exp(||xci||2 / 2)
is the width of the basis function
Many other proposed basis functions
39
Suppose we have the data set (xi, yi), i = 1, , N
Each xi is multidimensional, each yi is scalar
Set ci = xi, i = 1, , N
Define gik = ( || xi xk|| )
Input each xi to the RBF to obtain:
g11 L g1N w1 y1 Gw = y
M O M M M G is nonsingular if {xi}
are distinct
g N 1 L g NN w N y N Solve for w
Global minimum
(assuming fixed c and )
40
We again have the data set (xi, yi), i = 1, , N
Each xi is multidimensional, each yi is scalar
ck are given for (k = 1, , m), and m < N
Define gik = ( || xi ck|| )
Input each xi to the RBF to obtain:
g11 L g1m w1 y1 Gw = y
M O M M M w = (GTG)1GT = G+y

g N 1 L g Nm w m y N
41
How can we choose the RBF centers?
Randomly select them from the inputs
Use a clustering algorithm
Other options (BBO?)
How can we choose the RBF widths?
42
Other Types of ANNs
Many other types of ANNs
Cerebellar Model Articulation Controller (CMAC)
Spiking neural networks
Self-organizing map (SOM)
Recurrent neural network (RNN)
Hopfield network
Boltzman machine
Cascade-Correlation
and many others
43
Sources
Neural Networks, by C. Stergiou and D. Siganos,
www.doc.ic.ac.uk/~nd/ surprise_96/journal/vol4/cs11/report.html
The Backpropagation Algorithm, by A. Venkataraman,
www.speech.sri.com/people/anand/771/html/node37.html
CS 478 Course Notes, by Tony Martinez,
http://axon.cs.byu.edu/~martinez
44

Artificial Neural Networks: Dan Simon Cleveland State University

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Artificial Neural Networks: Dan Simon Cleveland State University

Загружено:

Авторское право:

Доступные форматы

Artificial Neural Networks

How many hidden layers should we use? How

Three-dimensional single-layer perceptron

In a multi-layer ANN, how can we tell which

a1 v11 x1 v21 x2 z1 w11 y1 w21 y2

{ zk for all k D( j ) } dE dzk dy j

Not linearly separable. This is a very simple

Bias nodes at both the input and hidden layer

Homework: Record the weights for the trained ANN, input

If we train too long we begin to

Вам также может понравиться