Neural Network

Machine Learning Srihari
Neural Network Training
Sargur Srihari
Topics
•  Neural network parameters

•  Probabilistic problem formulation
•  Determining the error function
•  Regression
•  Binary classification
•  Multi-class classification
•  Parameter optimization
•  Local quadratic approximation
•  Use of gradient optimization
•  Gradient descent optimization
2
Neural Network parameters
•  Linear models for regression and classification can

be represented as
⎛M ⎞
y(x, w) = f ⎜ ∑ w jφ j (x)⎟
⎝ j =1 ⎠
φ j (x )
•  which are linear combinations of basis functions
•  In a neural network the basis functions φ j (x ) depend
on parameters
•  During training allow these parameters to be adjusted along
with the coefficients wj
3
Network Training: Sum of squared errors

•  Neural networks perform a transformation
•  vector x of input variables to vector y of output variables
•  For sigmoid activation function
⎛ M (2) ⎛ D (1) ⎞ ⎞ D input variables
yk (x,w) = σ ⎜ ∑ wkj h ⎜ ∑ w ji x i ⎟ ⎟ M hidden units
⎝ j =1 ⎝ i =1 ⎠⎠
•  Where vector w consists of all weight and bias parameters

•  To determine w, simple analogy with curve fitting
•  minimize sum-of-squared errors function
•  Given set of input vectors {xn}, n=1,..,N and target vectors {tn}
minimize the error function
N
N training vectors
1
E(w) = ∑ || y( x n ,w) - tn ||2
2 n=1 4
•  Consider a more general probabilistic interpretation
Probabilistic View: From activation function f determine

Error Function E (as defined by likelihood function)
1.  Regression M
•  f: activation function is identity y(x,w)= wTφ (x ) = ∑ w jφ j (x )

j=1
•  E: Sum-of-squares error/Maximum Likelihood E(w) =

1 N
∑|| y( xn ,w) - tn ||2
2 n=1
2.  (Multiple Independent) Binary Classifications

1
y( x,w) = σ ( wTφ( x )) =
•  f: activation function is Logistic sigmoid 1+ exp(-wTφ( x ))
N
•  E: Cross-entropy error function E(w) = −∑ {t n =1

n
ln yn + (1 − tn )ln(1 − yn ) }
3.  Multiclass Classification exp(w kφ (x ))

y k (x,w) =
•  f: Softmax outputs ∑ exp(w jφ (x ))
j
•  E: Cross-entropy error function N K
E( w) = -∑ ∑ tkn lnyk( x n ,w) 5
n=1 k=1
1. Probabilistic View: Regression
•  Output is a single target variable t that can take any real value
•  Assuming t is Gaussian distributed with an x-dependent mean
p(t | x, w) = N(t | y(x, w), β −1 )
•  Likelihood function
N
p(t | x,w, β ) = ∏ N(t n | y(x n ,w), β −1 )
n=1
•  Taking negative logarithm, we get the negative log-likelihood

N 2
β N N
€ ∑
2 n=1
{ y(x n ,w) − t n } − ln β + ln(2π )
2 2
6
•  which can be used to learn parameters w and β
€
Machine Learning
Regression Error Function Srihari
•  Likelihood Function could be used to learn parameters w and β

•  Usually done in a Bayesian treatment
•  In neural network literature minimizing error is used
•  They are equivalent here.
2
Sum of squared errors is
1 N
E(w) = {
∑ y(xn ,w) − tn
2 n =1
}
•  Its smallest value occurs when ∇E(w)=0
•  Since E(w) is non-convex:
•  Solution wML found using iterative optimization wt+1=wt+∧wt
•  Gradient descent (discussed later in this lecture))
•  Another solution is back-propagation
•  Since regression output is same as activation yk=ak, so
M
∂E
= yk − t k ak = ∑ w (2)
ki x i + w k 0 where k = 1,..,K
(2)
∂ak i=1
•  Having found wML the value of βML can also be found using
2
ln β / 2π 1 N 7
= ∑ { y(x n ,w ML ) − t n }
β N n=1
2. Binary Classification
•  Single target variable t where t=1 denotes C1 and t =0
denotes C2
•  Consider network with single output whose activation
function is logistic sigmoid
1
y = σ (a) =
1+ exp(−a)
•  so that 0 < y(x,w) < 1
•  Interpret y(x,w) as conditional probability p(C1|x)
€
•  Conditional distribution of targets given inputs
p(t | x,w) = y(x,w)t {1 − y(x,w)}1−t

8
Binary Classification Error Function

•  Error function is negative log-likelihood which in this
case is a Cross-Entropy error function
N
{ }
E(w) = − ∑ tn ln yn + (1 − tn )ln(1 − yn )
n =1
•  where yn denotes y(xn ,w)

•  Using cross-entropy error function instead of sum of
squares leads to faster training and improved
generalization
9
2. K Separate Binary Classifications
•  Network has K outputs each with a logistic sigmoid

activation function
•  Associated with each output is a binary class label tk
K
p(t | x,w) = ∏ yk (x,w) [1 − yk (x,w)]
tk 1−tk
k =1
•  Taking negative logarithm of likelihood function

N K
E(w) = -∑ ∑ {tnk lnynk +(1 -tnk )ln(1 -ynk )}
n=1 k=1
where ynk denotes yk (x n ,w)

10
3. Multiclass Classification
•  Each input assigned to one of K classes
•  Binary target variables have 1-of-K coding scheme
t k ∈ {0,1}
•  Network outputs are interpreted as yk (x,w) = p(tk = 1 | x)
•  Leads to following error function

N K
= -∑ ∑ tkn lnyk( x n ,w)
E( w)€
n=1 k=1
•  Output unit activation function is given by softmax

exp(ak( x,w)) exp(w kφ (x ))
yk( x,w) = y k (x,w) =
∑ exp(a j( x,w)) ∑ exp(w jφ (x ))
j
j 11
Parameter Optimization
•  Task: Find weight vector w which minimizes the
chosen function E(w)
•  Geometrical picture of error function
•  Error function has a highly nonlinear
•  dependence
12
Parameter Optimization: Geometrical View

E(w): surface sitting over weight
space
•  wA:a local minimum
•  wB global minimum
•  Need to find minimum
•  At point wC local gradientgradient
•  is given by vector∇E(w)
•  points in direction of greatest
rate of increase of E(w)
• Negative gradient points to rate
of greatest decrease
13
Finding w where E(w) is smallest

Small step from w to w+δw leads to
•  change in error function
δE ≈ δw T ∇E (w)
•  Minimum of E(w) will occur when
€ ∇E(w) = 0
•  Points at which gradient vanishes are
stationary points: minima, maxima, saddle
€
Complex surface
No hope of finding analytical solution to
equation
∇E(w) = 0 14
Iterative Numerical Procedure for Minima

•  Since there is no analytical solution
choose initial w(0) and update it using
w (τ +1) = w (τ ) + Δw (τ )
where τ is the iteration step
•  Different algorithms involve different choices for
weight vector update Δw (τ )
€
(τ )
•  Weight vector update Δw is usually
based on gradient
€ ∇E(w) evaluated at
weight vector w (τ +1)
•  To understand€importance of gradient information
consider Taylor’s
€ series expansion of error
15
function
Leads to local quadratic approximation
€
Discussion Overview
Preliminary concepts for Backpropagation

1 Local quadratic approximation
•  Provides insight into optimization problem
•  O(W3) where W is dimensionality of w
•  Based on Taylor’s series: f(x) is approximated
2 Use of gradient information

•  Leads to significant improvements in speed of
locating minima of error function
•  Backpropagation is O(W2)
3 Gradient descent optimization 16
•  Simplest approach of using gradient information

Definitions of Gradient and Hessian

•  First derivative of a scalar function E(w) with respect to a
vector w=[w1,w2]T is a vector called the Gradient of E(w)
⎡ ∂E ⎤
⎢ ⎥
∂w1 ⎥
E(w) = ⎢⎢
d If there are M elements in the vector
∇E(w) =
dw ∂E ⎥ then Gradient is a M x 1 vector
⎢ ⎥
⎢⎣ ∂w2 ⎥
⎦
•  Second derivative of E(w) is a matrix called the Hessian
⎡ 2 ⎤
⎢ ∂E ∂ 2E ⎥
⎢ 2 ⎥
d2 ⎢ ∂w1 ∂w1∂w2 ⎥ Hessian is a matrix with
H = ∇∇E(w) = E(w) = ⎢ ⎥ M2 elements
dw 2
⎢ ∂ 2E ∂ 2E ⎥
⎢ ⎥
⎢ ∂w ∂w ∂w2 ⎥⎥
2
⎣⎢ 2 1
⎦
1. Local Quadratic Optimization

•  Taylor’s Series Expansion of E(w) around some point w ˆ
in weight space (with cubic and higher terms omitted)
1
E(w) ≅ E(wˆ ) + (w − wˆ )T b + (w − wˆ )T H(w − wˆ )
2 ˆ €
where b is the gradient of E evaluated at w b ≡ ∇E |w= wˆ
•  b is a vector of W elements
∂E
H is the Hessian matrix H = ∇∇E with elements (H) ij =
€ € ∂w i∂w j
•  H is a W x W matrix €
ˆ
w= w
•  Consider local quadratic approximation around w*, a

€
minimum of the error function €
1
E(w) ≅ E(w*) + (w − w*)T H (w − w*)
2
•  where H is evaluated at w* and the linear term vanishes
•  Let us interpret this geometrically
•  Consider eigen value equation for the Hessian matrix Hu i = λi u i
18
•  where the eigen vectors ui are orthonormal u u T
i i = δ ij
•  Expand (w-w*) as a linear combination the eigenvectors w − w* = ∑ α i u i

i
Error Function Approximation by a Quadratic
Approximated by quadratic
1
•  Error Functions E(w) ≅ E(w*) + (w − w*)T H (w − w*)
M 2
•  Linear Regression y(x,w) = w T φ (x) = ∑ w jφ j (x) Where H = ∇∇E is a W x W matrix
j =1
1 N
2 Whose contours of constant error are
E(w) = ∑{ y(x n ,w) − t n } ellipses
2 n=1 €
with axes aligned with eigen vectors ui
1
•  Binary Classification y(x,w) = σ (w T φ (x)) = of H whose
1 + exp(−w T φ (x))
N
lengths are inversely proportional to
€ sq roots of eigenvalues
E(w) = −∑{t n ln y n + (1− t n )ln(1− y n )}
n=1
exp(w kφ (x))
•  Multiclass Classification y (x,w) = k
∑ exp(w jφ (x))
j
N K
€ E(w) = −∑ ∑ t kn ln y k (x n ,w)
n=1 k=1
€
19
Neighborhood of a minimum w*
•  w-w* is a coordinate transformation

•  Origin is translated to w*
•  Axes rotated to align with eigenvectors of
Hessian
•  Error function can be written as
1
E(w) = E(w*) + ∑ λiα i2
2 i
•  Matrix H = ∇∇E is positive definite iff
vT Hv > 0 for all v
• € Since eigenvectors form a complete set v = ∑ c ui

i i
•  Then an arbitrary vector v can be written as v T

Hv = ∑ ci2 λi
i
•  The stationary point w* will be a minimum if the

Hessian matrix is positive definite (or all its 20
eigenvalues are positive)

Condition for a point w* to be a minimum
•  For a one-dimensional weight space, a stationary

point w* will be minimum if
∂2E
>0
∂w 2 w*
•  Corresponding result in D dimensions is that the

Hessian matrix evaluated at w* is positive definite
€ T
•  A matrix H is positive definite iff v Hv > 0 for all v
21
Complexity of Quadratic Approximation

1
E(w) ≅ E(wˆ ) + (w − wˆ )T b + (w − wˆ )T H(w − wˆ )
2
where b is the gradient of E evaluated at w ˆ
•  b is a vector of W elements b ≡ ∇E |w= wˆ
€ H is the Hessian matrix with elements
•  H is a W x W matrix H = ∇∇E €
€
•  Error surface is specified by b and H

€
•  They contain total of W(W+3)/2 independent elements
•  W is total number of adaptive parameters in network
•  Minimum depends on O(W2) parameters
•  Need to perform O(W2) function evaluations, each requiring O(W) steps.
•  Computational effort needed is O(W3)
•  W in a 10 x 10 x 10 network needs 100+100=200 weights which means 8
million steps
2. Use of Gradient Information

•  Gradient of error function can be evaluated efficiently
using back-propagation
•  Use of gradient information can lead to significant
improvements in speed with which minimum of error function
can be located
•  In quadratic approximation to error function
•  Computational effort needed is O(W3)
•  By using gradient information minimum can be found in
O(W2) steps
•  4,000 steps for 10 x 10 x 10 network with 200 weights
23
3. Gradient Descent Optimization
•  Simplest approach to using gradient information

•  Take a small step in the direction of the negative
gradient
w (τ +1) = w (τ ) − η∇E(w (τ ) )
•  η is the learning rate

€•  There are batch and on-line versions
24
Summary
•  Neural network parameters have many parameters
•  can be determined analogous to linear regression
parameters
•  Probabilistic formulation leads to appropriate error
functions for linear regression, binary and multi-class
classification
•  Parameter optimization can be viewed as minimizing

error function in weight space
•  At the minimum Hessian is positive definite
•  Local quadratic optimization needs O(W3) steps

•  Using gradient information more efficient O(W2)
algorithm can be designed 25

Neural Network

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Neural Network

Загружено:

Авторское право:

Доступные форматы

Machine Learning Srihari

Neural Network Training

• Neural network parameters

Neural Network parameters

• Linear models for regression and classification can

Network Training: Sum of squared errors

• Where vector w consists of all weight and bias parameters

Probabilistic View: From activation function f determine

• f: activation function is identity y(x,w)= wTφ (x ) = ∑ w jφ j (x )

• E: Sum-of-squares error/Maximum Likelihood E(w) =

2. (Multiple Independent) Binary Classifications

• E: Cross-entropy error function E(w) = −∑ {t n =1

3. Multiclass Classification exp(w kφ (x ))

1. Probabilistic View: Regression

• Taking negative logarithm, we get the negative log-likelihood

• Likelihood Function could be used to learn parameters w and β

p(t | x,w) = y(x,w)t {1 − y(x,w)}1−t

Binary Classification Error Function

• where yn denotes y(xn ,w)

2. K Separate Binary Classifications

• Network has K outputs each with a logistic sigmoid

• Taking negative logarithm of likelihood function

where ynk denotes yk (x n ,w)

• Leads to following error function

• Output unit activation function is given by softmax

Parameter Optimization: Geometrical View

Finding w where E(w) is smallest

• Minimum of E(w) will occur when

Iterative Numerical Procedure for Minima

Preliminary concepts for Backpropagation

2 Use of gradient information

• Simplest approach of using gradient information

Definitions of Gradient and Hessian

1. Local Quadratic Optimization

• Consider local quadratic approximation around w*, a

• Expand (w-w*) as a linear combination the eigenvectors w − w* = ∑ α i u i

Error Function Approximation by a Quadratic

• w-w* is a coordinate transformation

• € Since eigenvectors form a complete set v = ∑ c ui

• Then an arbitrary vector v can be written as v T

• The stationary point w* will be a minimum if the

eigenvalues are positive)

Condition for a point w* to be a minimum

• For a one-dimensional weight space, a stationary

• Corresponding result in D dimensions is that the

Complexity of Quadratic Approximation

• Error surface is specified by b and H

2. Use of Gradient Information

3. Gradient Descent Optimization

• Simplest approach to using gradient information

• η is the learning rate

• Parameter optimization can be viewed as minimizing

• Local quadratic optimization needs O(W3) steps

Вам также может понравиться

•  Neural network parameters

•  Linear models for regression and classification can

•  Where vector w consists of all weight and bias parameters

•  f: activation function is identity y(x,w)= wTφ (x ) = ∑ w jφ j (x )

•  E: Sum-of-squares error/Maximum Likelihood E(w) =

2.  (Multiple Independent) Binary Classifications

•  E: Cross-entropy error function E(w) = −∑ {t n =1

3.  Multiclass Classification exp(w kφ (x ))

•  Taking negative logarithm, we get the negative log-likelihood

•  Likelihood Function could be used to learn parameters w and β

•  where yn denotes y(xn ,w)

•  Network has K outputs each with a logistic sigmoid

•  Taking negative logarithm of likelihood function

•  Leads to following error function

•  Output unit activation function is given by softmax

•  Minimum of E(w) will occur when

•  Simplest approach of using gradient information

•  Consider local quadratic approximation around w*, a

•  Expand (w-w) as a linear combination the eigenvectors w − w = ∑ α i u i

•  w-w* is a coordinate transformation

• € Since eigenvectors form a complete set v = ∑ c ui

•  Then an arbitrary vector v can be written as v T

•  The stationary point w* will be a minimum if the

•  For a one-dimensional weight space, a stationary

•  Corresponding result in D dimensions is that the

•  Error surface is specified by b and H

•  Simplest approach to using gradient information

•  η is the learning rate

•  Parameter optimization can be viewed as minimizing

•  Local quadratic optimization needs O(W3) steps