Business Intelligence & Data Mining-10

Artificial Neural Networks
(ANNs)
Background and Motivation

Computers and the Brain: A Contrast
Arithmetic:
Vision:
1 brain = 1/10 pocket calculator

1 brain = 1000 super computers
Memory of arbitrary details: computer wins

Memory of real-world (related) facts: brain wins
A computer must be programmed explicitly
The brain can learn by experiencing the world
What are Artificial Neural Networks?

A system loosely modeled on the human brain that simulates the
multiple layers of simple processing elements called neurons.
There are various classes of ANN models
depending on:
q Problem type
Prediction, Classification , Clustering

q Design / Structure of the model
q Model building algorithm
A bit of biology . . .
Most important functional unit in human brain a class of cells called
NEURONS
Dendrites
Cell Body
Axon
Synapse
Actual Neuron
Schematic
Dendrites Receive information Cell Body Process information

Axon Carries processed information to other neurons
Synapse Junction between Axon end and Dendrites of other Neurons
Artificial Neurons
simulated on hardware or by software
input connections - the receivers
node, unit, Processing Unit or Processing Element
simulates neuron body
output connection and transfer function - the
transmitter (axon)
activation function employs a threshold or bias
connection weights act as synaptic junctions
Learning (for a particular model) occurs via changes in
value of the various weights and choice of functions.
An Artificial Neuron: Structure

Dendrites
X1
X2
w1
w2
..
.
Xp
wp
Cell Body
Axon
Direction of flow of Information
V = f(I)
I = w1X1 + w2X2
+ w3X3 + + wpXp
Receives Inputs X 1 X2 Xp from other neurons or environment

Inputs fed-in through connections with weights
Total Input = Weighted sum of inputs from all sources
Transfer function and Activation function convert the input to output
Output goes to other neurons or environment
Behaviour of an artificial neural network

to any particular input depends upon
structure of each node (threshold /
activation / transfer function)
structure of the network (architecture)
weights on each of the connections
.... these must be learned !
Activity
in
an
ANN
Node
Mathematical Model of a Node
Adder _ Function
a0
w0
Incoming
activation
wi
ai
Outgoing
activation
wn
an
Activation_ Function
Activity in an ANN Node

Adder _ Function
n
a0
I = xi wi
w0
Incoming
activation
wi
ai
i =0
Outgoing
activation
wn
an
Activity in an ANN Node

Adder _ Function
a0
w0
Incoming
activation
wi
ai
Outgoing
activation
wn
an
1 if I > t,
f (I ) =
0 if I <= t
Transfer Functions
There are various choices for Transfer / Activation functions
0.5
-1
Tanh
f(x) =
(ex e-x) / (ex + e-x)
Logistic
f(x) = ex / (1 + ex)
Threshold
0 if x< 0
f(x) =
1 if x >= 1
Artificial Neural Network Architecture

The most common neural network
architectures has three layers:
- 1st Layer: input layer
-2nd Layer: hidden layer
-3rd Layer: output layer
More complex ANN Architectures

Number of hidden layers can be
None
One
More
Feed-forward Artificial Neural Network Architecture

A collection of neurons form a Layer
X1
X2
X3
X4
Input Layer
Direction of information flow
- Each neuron gets ONLY

one input, directly from outside
Hidden Layer
- Connects Input and Output
layers
Output Layer
- Output of each neuron
directly goes to outside
y1
y2
Feed-forward Artificial Neural Network Architecture

A few restrictions
X1
X2
X3
X4
Within a layer neurons are NOT

connected to each other.
Neuron in one layer is connected
to neurons ONLY in the NEXT
layer. (Feed-forward)
Jumping of layer is NOT
allowed
y1
y2
An ANN model
What do we mean by A particular Model ?
Input: X1 X2 X3
For an ANN :
Output: Y
Model: Y = f(X1 X2 X3)
Algebraic form of f(.) is too complicated to write down.
However it is characterized by
# Input Neurons
# Hidden Layers
# Neurons in each Hidden Layer
# Output Neurons
The adder, activation and transfer functions
WEIGHTS for all the connections
Fitting an ANN model = Specifying values for all those parameters
ANN Model an Example

Input: X1 X2 X3
Output: Y
X2
X1
0.6
-0.1
X3
-0.2
0.1
0.7
0.5
0.1

Parameters
Example
# Input Neurons
# Hidden Layers
# Hidden Layer Size
# Output Neurons
-0.2
Weights
Decided by the structure

of the problem
# Input Nrns = # of Xs
# Output Nrns = # of Ys
Learnt
Free parameters
Prediction using a particular ANN Model

Input: X1 X2 X3
X1 =1
Output: Y
X2=-1
0.6
-0.1
X3 =2
-0.2
0.1
0.7
0.5
0.2
f (0.2) = 0.55
0.55
0.9
f (0.9) = 0.71
0.71
0.1

0.2 = 0.5 * 1 0.1*(-1) 0.2 * 2
f(x) = ex / (1 + ex)
f(0.2) = e0.2 / (1 + e0.2) = 0.55
Predicted Y = 0.478
-0.2
Suppose Actual Y = 2
-0.087
f (-0.087) = 0.478
0.478
Then
Prediction Error = (2-0.478) =1.522
Real Estate Appraisal by Neural Networks
Building ANN Model

Input: X1 X2 X3
Output: Y Model: Y = f(X1 X2 X3)
# Input Neurons = # Inputs = 3

# Hidden Layer = ???
# Neurons in Hidden Layer = ???
# Output Neurons = # Outputs = 1

Try 1
Try 2
Architecture is now defined
No fixed strategy.
By trial and error
How to get the weights ???
Given the Architecture There are 8 weights to decide.

W = (W1, W 2, , W 8)
Training Data: (Yi , X1i, X2i, , Xpi ) i= 1,2,,n
Given a particular choice of W, we will get predicted Ys
( V1,V2,,Vn)
They are function of W.
Choose W such that over all prediction error E is minimized
E = (Yi Vi) 2
Training the Model
Back Propagation
Feed forward
Start with a random set of weights.
E = (Yi Vi) 2
Feed forward the first observation through the net

X1
Network
V1
; Error = (Y1 V1)
Adjust the weights so that this error is reduced
( network fits the first observation well )
Feed forward the second observation.
Adjust weights to fit the second observation well
Keep repeating till you reach the last observation
This finishes one CYCLE through the data
Randomize the order of the input (observations)
Perform many such training cycles till the
overall prediction error E is small.
Back Propagation
Each weight Shares the Blame for prediction
error and tries to adjust with other weights.
Back Propagation algorithm decides how to
distribute the blame among all weights and
adjust the weights accordingly.
Small portion of blame leads to small
adjustment.
Large portion of the blame leads to large
adjustment.
E = (Yi Vi) 2
Weight adjustment during Back Propagation

Weight adjustment formula in Back Propagation
Vi the prediction for ith observation
is a function of the network weights vector W = ( W1, W 2,.)
Hence, E, the total prediction error is also a function of W
E( W ) = [ Yi Vi( W ) ] 2
Gradient Descent Method :
For every individual weight W i, updation formula looks like
Wnew = Wold + * ( E / W) | Wold

= Learning Parameter (between 0 and 1)
Another slight variation is also used sometimes
W(t+1) = W (t) + * ( E / W) | W(t) + * (W(t) - W(t-1) )

= Momentum (between 0 and 1)
Geometric interpretation of the Weight adjustment

Consider a very simple network with 2 inputs and 1 output. No hidden layer.
There are only two weights whose values needs to be specified.
E( w1, w2 ) = [ Yi Vi(w1, w2 ) ] 2
A pair ( w1, w2 ) is a point on 2-D plane.
w1
w2
For any such point we can get a value of E.

Plot E vs ( w1, w2 ) - a 3-D surface - Error Surface
Aim is to identify that pair for which E is minimum
That means identify the pair for which the height of
the error surface is minimum.
Gradient Descent Algorithm

Start with a random point ( w1, w2 )
Move to a better point ( w1, w2 ) where the height of error surface is lower.
Keep moving till you reach ( w*1, w*2 ), where the error is minimum.
Crawling the Error Surface

14.0
Error Surface
12.0
Local Minima
10.0
8.0
Error
6.0
4.0
w*
w0
Weight Space
-2.000
-1.000
0.000
1.000
2.000
3.000
4.000
5.000
6.000
6.000
5.000
4.000
3.000
2.000
1.000
0.000
-1.000
W2
-2.000
-3.000
0.0
-3.000
2.0
Global Minima
W1
Training Algorithm
Decide the Network architecture
(# Hidden layers, #Neurons in each Hidden Layer)
Decide the Learning parameter and Momentum
Initialize the Network with random weights
Do till Convergence criterion is not met
For I = 1 to # Training Data points
Feed forward the I-th observation thru the Net
Compute the prediction error on I-th observation
Back propagate the error and adjust weights
E = (Yi Vi)
Next I
Check for Convergence
End Do
Convergence Criterion
When to stop training the Network ?
Ideally when we reach the global minima of the error surface
How do we know we have reached there ?
We dont
Suggestions:
1. Stop if the decrease in total prediction error (since last cycle) is small.
2. Stop if the overall changes in the weights (since last cycle) are small.
Drawback:
Error keeps on decreasing. We get a very good fit to training data.
BUT The network thus obtained have poor generalizing power on unseen data
The phenomenon is also known as - Over fitting of the Training data
The network is said to Memorize the training data.
- so that when an X in training set is given,
the network faithfully produces the corresponding Y.
-However for Xs which the network didnt see before, it predicts poorly.
Convergence Criterion
Modified Suggestion:
Partition the training data into Training set and Validation set
Use:
Training set - build the model

Validation set - test the performance of the model on unseen data
Typically as we have more and more training cycles
Error on Training set keeps on decreasing.
Error on Validation set first decreases and then increases.
Error
Validation
Training
Cycle
Stop training when the error on

Validation set starts increasing
Choice of Training Parameters

Learning Parameter and Momentum
- needs to be supplied by user from outside. Should be between 0 and 1
What should be the optimal values of these training parameters ?
- No clear consensus on any fixed strategy.
- However, effects of wrongly specifying them are well studied.
Learning Parameter
Too big Large leaps in weight space risk of missing global minima.
Too small
- Takes long time to converge to global minima
- Once stuck in local minima, difficult to get out of it.
Suggestion
Trial and Error Try various choices of Learning Parameter and Momentum
See which choice leads to minimum prediction error
Summary
q
Artificial Neural network (ANN)

A class of models inspired by biological Neurons
Used for various modeling problems Prediction, Classification, Clustering, ..
One particular subclass of ANNs Feed forward Back propagation networks

v
v
v
v
Organized in layers Input, hidden, Output

Each layer is a collection of a number of artificial Neurons
Neurons in one layer in connected to neurons in next layer
Connections have weights
Fitting an ANN model is to find the values of these weights.
Given a training data set weights are found by Feed forward Back
propagation algorithm, which is a form of Gradient Descent Method a
popular technique for function minimization.
Network architecture as well as the training parameters are decided upon by

trial and error. Try various choices and pick the one that gives lowest
prediction error.
Self-Organizing Maps
Overview
A Self-Organizing Map (SOM) is a way to
represent higher dimensional data in an usually 2D or 3-D manner, such that similar data is grouped
together.
It runs unsupervised and performs the grouping on
its own.
Once the SOM converges, it can classify new data.
SOMs run in two phases:
Training phase: the map is built, the network organizes
using a competitive process, it is trained using a large
numbers of inputs.
Mapping phase: new vectors are quickly given a
location on the converged map, easily classifying or
categorizing the new data.
Example
Example: Data on poverty levels in different countries.

SOM does not show poverty levels, rather it shows how similar the
poverty levels of different countries are to each other. (Similar
color = similar poverty level).
Forming the Map
The Basic Process

1)
2)
3)
4)
5)
6)
7)
8)
9)
Each pixel is a node / unit.

Initialize each nodes weights with random values.
Take the next vector from training data and present it to the
SOM.
Every node is examined to find the Best Matching Unit /
Node (BMU).
Adjust BMUs weights to make it come closer to the input
vector. The rate of adjustment is determined by the learning
rate. The learning rate decreases with each iteration.
The radius of the neighborhood around the BMU is
calculated. The size of the neighborhood decreases with each
iteration.
Each node in the BMUs neighborhood has its weights
adjusted to become more like the BMU. Nodes closest to the
BMU are altered more than the nodes further away in the
neighborhood.
Repeat from step 3 to 7 for each vector in the training data.
Repeat from step 3 to 8 for enough iterations for convergence.
Calculating the Best Matching Unit

Calculating the BMU is done according to the
Euclidean distance between the nodes weights
(W1, W2, , Wn) and the input vectors values
(V1, V2, , Vn).
This gives a good measurement of how similar the two
sets of data are to each other.
The node closest to the input vector is chosen as the
BMU (or winning node).
Determining the BMU

Neighborhood
Size of the neighborhood: uses an exponential decay function that shrinks

on each iteration until eventually the neighborhood is just the BMU itself.
Effect of location within the neighborhood: The neighborhood is defined by a

gaussian curve so that nodes that are closer are influenced more than farther
nodes.
Modifying the Nodes Weights

The new weight for a node is the old weight, plus a fraction (L) of
the difference between the old weight and the input vector
adjusted (by theta) depending on the distance from the BMU.
The learning rate, L, is also an exponential decay function.

This ensures that the SOM will converge.
The lambda represents a constant, and t is the time step
SOM Application: Text Clustering

Business Intelligence &amp; Data Mining-10

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Business Intelligence &amp; Data Mining-10

Загружено:

Авторское право:

Доступные форматы

Artificial Neural Networks

Background and Motivation

1 brain = 1/10 pocket calculator

Memory of arbitrary details: computer wins

What are Artificial Neural Networks?

Prediction, Classification , Clustering

Dendrites Receive information Cell Body Process information

An Artificial Neuron: Structure

Direction of flow of Information

Receives Inputs X 1 X2 Xp from other neurons or environment

Behaviour of an artificial neural network

Activity in an ANN Node

Activity in an ANN Node

Artificial Neural Network Architecture

More complex ANN Architectures

Feed-forward Artificial Neural Network Architecture

Direction of information flow

- Each neuron gets ONLY

Feed-forward Artificial Neural Network Architecture

Within a layer neurons are NOT

Model: Y = f(X1 X2 X3)

Algebraic form of f(.) is too complicated to write down.

ANN Model an Example

Model: Y = f(X1 X2 X3)

# Hidden Layer Size

Decided by the structure

Prediction using a particular ANN Model

Model: Y = f(X1 X2 X3)

Real Estate Appraisal by Neural Networks

Building ANN Model

Output: Y Model: Y = f(X1 X2 X3)

# Input Neurons = # Inputs = 3

# Output Neurons = # Outputs = 1

Architecture is now defined

Given the Architecture There are 8 weights to decide.

Training the Model

Start with a random set of weights.

Feed forward the first observation through the net

Weight adjustment during Back Propagation

Wnew = Wold + * ( E / W) | Wold

Another slight variation is also used sometimes

W(t+1) = W (t) + * ( E / W) | W(t) + * (W(t) - W(t-1) )

Geometric interpretation of the Weight adjustment

For any such point we can get a value of E.

Gradient Descent Algorithm

Crawling the Error Surface

Training set - build the model

Stop training when the error on

Choice of Training Parameters

Artificial Neural network (ANN)

Used for various modeling problems Prediction, Classification, Clustering, ..

One particular subclass of ANNs Feed forward Back propagation networks

Organized in layers Input, hidden, Output

Fitting an ANN model is to find the values of these weights.

Network architecture as well as the training parameters are decided upon by

Example: Data on poverty levels in different countries.

Forming the Map

The Basic Process

Each pixel is a node / unit.

Calculating the Best Matching Unit

Determining the BMU

Size of the neighborhood: uses an exponential decay function that shrinks

Effect of location within the neighborhood: The neighborhood is defined by a

Modifying the Nodes Weights

Business Intelligence & Data Mining-10

Business Intelligence & Data Mining-10