Вы находитесь на странице: 1из 50

CS451/CS551/EE565

ARTIFICIAL
INTELLIGENCE
Neural Nets
11-17-2008
Prof. Janice T. Searleman
jets@clarkson.edu, jetsza
Outline
Artificial Neural Networks: ANNs
Types of ANNs: feedforward & recurrent
Perceptrons
Hopfield Nets
Multilayer Feedforward Nets

Exam#2 Tuesday, 11/18, 7:00 pm, SC 344

Reading Assignment: AIMA
Chapter 18, Learning from Observations
Chapter 19, sections 19.1, Logical Formulation of Learning
Chapter 20, section 20.5, Neural Networks
Artificial Neural Networks (ANN)
Distributed representational and computational
mechanism based (very roughly) on neurophysiology.
A collection of simple interconnected processors
(neurons) that can learn complex behaviors & solve
difficult problems.
Wide range of applications:
Supervised Learning
Function Learning (Correct mapping from inputs to outputs)
Time-Series Analysis, Forecasting, Controller Design
Concept Learning
Standard Machine Learning Classification tasks: Features => Class
Unsupervised Learning
Pattern Recognition (Associative Memory models)
Words, Sounds, Faces, etc.
Data Clustering
Unsupervised Concept Learning
Recap: Connectionist models
Key intuition: Much of intelligence is in the connections
between the 10 billion neurons in the human brain.
Neuron switching time is roughly 0.001 second; scene
recognition time is about 0.1 second. This suggests that
the brain is massively parallel because 100
computational steps are simply not sufficient to
accomplish scene recognition.
Development: Formation of basic connection topology
Learning: Fine-tuning of topology + Major
synaptic-efficiency changes.

The matrix IS the intelligence!
Connectionist Models: Artificial Neurons
Characteristics
Large number of simple neuron-like processing
elements
Large number of weighted connections between the
elements (the weights encode the knowledge)
Highly parallel, distributed control
Fault-tolerant.
Degrades gracefully.
Inductive learning of internal representation
Weights are tuned automatically
Simple Computing Elements
Each unit (node) receives signals from its input
links and computes a new activation level that it
sends along all output links.
Computation is split into two steps:
(1) in
i
= W
j,i
a
j
, the linear step, and then

(2) a
i
g(in
i
), the nonlinear step.
j
Neural Networks
node = unit
node
node
node
link
weight of link
activation
level
A NODE
in
i
g
a
i
input
function
activation function
output
input links
output
links
a
j
W
j,i
a
i
= g(in
i
)
Possibilities for g
Step function Sign function Sigmoid (logistic) function
step(x) = 1, if x >= threshold
0, if x < threshold
(in picture above, threshold = 0)
sign(x) = +1, if x >= 0
-1, if x < 0
sigmoid(x) = 1/(1+e
-x
)
Adding an extra input with activation a
0
= - 1 and weight
W
0,j
= t is equivalent to having a threshold at t. This way
we can always assume a 0 threshold.
Real vs Artificial neurons
i
n
i
i
x w

=0
o/w 0 and 0 if 1
0
> =

=
i
n
i
i
x w o
x
0

x
n

w
0

w
n

o
Threshold units
axon
dendrites
dendrites
synapse
cell
Similarities with neurons
Neurons ANN
Neuron

Synapse

Action Potential

Analogue input

Discrete output
Unit

Connection

Output

Analogue input

Discrete output
Differences
Neurons ANN
Different
neurotransmitters

Complex calculation
of excitation

?

Large, complex
Ignored


Simple summation of
inputs

Learning algorithm

Small, simple
Neural Nets: A Brief History
McCulloch and Pitts 1943 Showed how neural-
like networks could compute
Rosenblatt ~1950s Perceptrons
Minsky & Papert 1969 Perceptron
Deficiencies
Hopfield 1982 Hopfield Nets
Hinton & Sejnowski 1986 Boltzmann
Machines
Rumelhart et. al 1986 Multilayer nets with
Backpropagation

Universal computing elements
In 1943, McCullough and Pitts showed that a
synchronous assembly of such neurons is a
universal computing machine. That is, any
Boolean function can be implemented with
threshold (step function) units.
x
1

x
2

o(x
1
,x
2
)
w
1
w
2
w = ?
-1
Implementing AND
otherwise 0
0 5 . 1 if 1 ) , (
2 1 2 1
=
> + + = x x x x o
1
1
x
1

x
2

o(x
1
,x
2
)
-1
W=1.5
Implementing OR
x
1

x
2

o(x
1
,x
2
)
1
1
-1
W=0.5
o(x1,x2) = 1 if 0.5 + x1 + x2 > 0
= 0 otherwise
Implementing NOT
x
1

o(x
1
,x
2
)
-1
W=-0.5
-1
otherwise 0
0 5 . 0 if 1 ) (
1 1
=
> = x x o
Implementing more complex
Boolean functions

x
1

x
2

1
1
0.5
-1
x
1
or x
2

x
3

1
1
1.5
(x
1
or x
2
) and x
3

-1
Types of Neural Networks
Feedforward: Links are unidirectional, and
there are no cycles, i.e., the network is a
directed acyclic graph (DAG). Units are
arranged in layers, and each unit is linked only
to units in the next layer. There is no internal
state other than the weights.
Recurrent: Links can form arbitrary topologies,
which can implement memory. Behavior can
become unstable, oscillatory, or chaotic.
Feedforward Neural Net
A recurrent network topology
Hopfield net: every unit i is connected to
every other unit j by weight W
ij
Weights are assumed to be symmetric: W
ij
= W
ji
Useful for associative memory after training on a set of
examples, a new stimulus will cause the network to settle into
an activation pattern corresponding to the example in the
training set that most closely resembles the new stimulus.
Perceptrons
Hopfield Nets
Multilayer Feedforward Nets
Perceptrons
Perceptrons are single-layer feedforward
networks
Each output unit is independent of the others
Can assume a single output unit
Activation of the output unit is calculated by:
O

= Step
0
( W
j
x
j
)
where x
j
is the activation of input unit j, and we
assume an additional weight and input to
represent the threshold
j
Perceptron
Multiple Perceptrons
How can perceptrons be designed?
The Perceptron Learning Theorem (Rosenblatt,
1960): Given enough training examples, there is
an algorithm that will learn any linearly separable
function.
Learning algorithm:
If the perceptron fires when it should not, make
each weight w
i
smaller by an amount proportional
to x
i

If it fails to fire when it should, make each w
i

proportionally larger
The perceptron learning algorithm
Inputs: training set {(x
1
,x
2
,,x
n
,t)}
Method
Randomly initialize weights w(i), -0.5<=i<=0.5
Repeat for several epochs until convergence:
for each example
Calculate network output o.
Adjust weights:
correct output
i i i
i i
w w w
x o t w
A +
= A ) ( q
Perceptron training
rule
learning rate error
Expressive limits of perceptrons
Can the XOR function be represented by a
perceptron (a network without a hidden layer)?
x
0
x
1
x
2

o(x
1
,x
2
)
w
0

w
1

w
2

There is no assignment of values to w
0
,w
1
and w
2
that
satisfies above inequalities. XOR cannot be represented!
0 * 1 * 1
0 * 0 * 1
0 * 1 * 0
0 * 0 * 0
2 1 0
2 1 0
2 1 0
2 1 0
s + +
> + +
> + +
s + +
w w w
w w w
w w w
w w w
XOR(x
1
,x
2
)
So what can be represented using
perceptrons?
and or
Representation theorem: 1 layer feedforward networks can
only represent linearly separable functions. That is,
the decision surface separating positive from negative
examples has to be a plane.
Why does the method work?
The perceptron learning rule performs gradient descent in
weight space.
Error surface: The surface that describes the error on each
example as a function of all the weights in the network. A set of
weights defines a point on this surface.
We look at the partial derivative of the surface with respect to
each weight (i.e., the gradient -- how much the error would
change if we made a small change in each weight). Then the
weights are being altered in an amount proportional to the slope
in each direction (corresponding to a weight). Thus the network
as a whole is moving in the direction of steepest descent on the
error surface.
The error surface in weight space has a single global minimum and
no local minima. Gradient descent is guaranteed to find the global
minimum, provided the learning rate is not so big that that you
overshoot it.
Perceptrons
Hopfield Nets
Multilayer Feedforward Nets
Hopfield Nets: John Hopfield, 1982
distributed representation
memory is stored as a pattern of activation
different memories are different patterns on the SAME PEs
distributed, asynchronous control
each processor makes decisions based on local situation
content-addressable memory
a number of patterns can be stored in a net;
to retrieve a pattern, specify some (or all) of it & it will find
the closest match
fault tolerance
the network works even if a few PEs misbehave or fail
(graceful degradation)
also handles novel inputs well (robust)
Distributed Information Storage &
Processing
w
i

w
j

w
k

Information is stored in the weights with:
Concepts/Patterns spread over many weights, and nodes.
Individual weights can hold info for many different concepts
Parallel Relaxation
choose an arbitrary unit. if any neighbors are
active, compute the sum
if the sum is positive, then activate the unit; else,
deactivate it
continue until a stable state is achieved (all units
have been considered & no more units can
change)
Hopfield showed that given any set of weights and
any initial state, the parallel relaxation algorithm
would eventually settle into a stable state
Example Hopfield Net
-1
-1
-1
-1
-2
+3
+3
+1
+1
+1
+2
inactive unit
active unit
Note that this is a stable state
Test Input:
-1
-1
-1
-1
-2
+3
+3
+1
+1
+1
+2
inactive unit
active unit
What steady state does this converge to?
Another Test Input:
-1
-1
-1
-1
-2
+3
+3
+1
+1
+1
+2
inactive unit
active unit
What steady state does this converge to?
Four Stable States
Perceptrons
Hopfield Nets
Multilayer Feedforward Nets
Multilayer Feedforward Net
Multi-layer networks
Multi-layer feedforward networks are trainable
by backpropagation provided the activation
function g is a differentiable function.
Threshold units dont qualify, but the logistic
function does.
Sigmoid units
Soft threshold units.
x
0

x
n

w
0

w
n

o
i
n
i
i
x w

=0
Sigmoid unit for g
a
e
a

+
=
1
1
) ( o
Squashing function
)) ( 1 )( (
) (
a a
a
a
o o
o
=
c
c
This is g, the basis for gradient descent
Weight adjustment with backpropagation in
multi-layer networks
Learning is similar to learning with perceptrons:
Example inputs are presented to the network. If the network
computes an output vector that matches the target, do nothing.
If there is a difference between output and target (i.e., an error),
then the weights are adjusted to reduce this error.
The key is to assess the blame for the error and divide it among
the contributing weights.
The error term (t - o) is known for the units in the output
layer. To adjust the weights between the hidden and the
output layer, we can use the gradient descent rule as
done for perceptrons.
To adjust weights between the input and hidden layer, we
need some way of estimating the errors made by the
hidden units.
Error back-propagation
Key idea: each hidden node is responsible for some
fraction of the error in each of the output nodes.
This fraction equals the strength of the connection
(weight) between the hidden node and the output
node.
where is the error at output node i.
i
o
i
outputs i
ij
w o

e
= j node hidden at error
ji
The backpropagation algorithm for
multi-layer networks with sigmoid units
Initialize all weights in the network to small random numbers.
Until weights converge do
For each training example
Compute network output vector o
For each output unit k

Update each network weight
w
hk
w
hk
+
k
x
h
For each hidden unit h


Update each network weight
w
ih
w
ih
+
h
x
i

) )( 1 (
k k k k k
o t o o = o
k
outputs k
hk h h h
w o o o o

e
= ) 1 (
error backpropagation
Momentum
Sometimes we multiply w
ij
by a momentum
factor . This allows us to use a high learning
rate, but prevent the oscillatory behavior that can
sometimes result from a high learning rate.
ij j ij ij
x w w qo +
Multiply by momentum
More on backpropagation
Performs gradient descent over the entire
network weight vector.
Will find a local, not necessarily global error
minimum.
Minimizes error over training set; need to guard
against overfitting just as with decision tree
learning.
Training takes thousands of iterations (epochs) -
-- slow!
When neural nets are appropriate for
learning problems
Instances are represented by attribute-value pairs.
Pre-processing required: Continuous input values to be
scaled in [0-1] range, and discrete values need to
converted to Boolean features.
Training examples are noisy.
Long training times are acceptable.
Human understandability of learned function is
unimportant.
However, there is some work on converting NNs to rules.
What is actually learned by the NN?
Network topology
Designing network topology is an art.
We can learn the network topology using
genetic algorithms. But using GAs is very
cpu-intensive. An alternative that people
use is hill-climbing.
Applications of neural networks
Alvinn (the neural network that learns to drive a van
from camera inputs).
NETtalk: a network that learns to pronounce English
text.
Recognizing hand-written zip codes.
Lots of applications in financial time series analysis.
Driving Miss ANN
ALVINN (Pomerleau, 1993)
30 x 32
960 input nodes 4 hidden nodes
30 output nodes
(Steering)
Sharp L
Sharp R
Training = watch human driver for awhile
Testing = ANN drives aloneup to 90 miles
Drove 98% alone across USA
Symbolic methods for this task cant compare!

Вам также может понравиться