Вы находитесь на странице: 1из 25

Machine Learning Srihari

Neural Network Training

Sargur Srihari
Machine Learning Srihari

Topics

•  Neural network parameters


•  Probabilistic problem formulation
•  Determining the error function
•  Regression
•  Binary classification
•  Multi-class classification
•  Parameter optimization
•  Local quadratic approximation
•  Use of gradient optimization
•  Gradient descent optimization

2
Machine Learning Srihari

Neural Network parameters

•  Linear models for regression and classification can


be represented as
⎛M ⎞
y(x, w) = f ⎜ ∑ w jφ j (x)⎟
⎝ j =1 ⎠
φ j (x )
•  which are linear combinations of basis functions
•  In a neural network the basis functions φ j (x ) depend
on parameters
•  During training allow these parameters to be adjusted along
with the coefficients wj

3
Machine Learning Srihari

Network Training: Sum of squared errors


•  Neural networks perform a transformation
•  vector x of input variables to vector y of output variables
•  For sigmoid activation function
⎛ M (2) ⎛ D (1) ⎞ ⎞ D input variables
yk (x,w) = σ ⎜ ∑ wkj h ⎜ ∑ w ji x i ⎟ ⎟ M hidden units
⎝ j =1 ⎝ i =1 ⎠⎠

•  Where vector w consists of all weight and bias parameters


•  To determine w, simple analogy with curve fitting
•  minimize sum-of-squared errors function
•  Given set of input vectors {xn}, n=1,..,N and target vectors {tn}
minimize the error function
N
N training vectors
1
E(w) = ∑ || y( x n ,w) - tn ||2
2 n=1 4
•  Consider a more general probabilistic interpretation
Machine Learning Srihari

Probabilistic View: From activation function f determine


Error Function E (as defined by likelihood function)

1.  Regression M

•  f: activation function is identity y(x,w)= wTφ (x ) = ∑ w jφ j (x )


j=1

•  E: Sum-of-squares error/Maximum Likelihood E(w) =


1 N
∑|| y( xn ,w) - tn ||2
2 n=1

2.  (Multiple Independent) Binary Classifications


1
y( x,w) = σ ( wTφ( x )) =
•  f: activation function is Logistic sigmoid 1+ exp(-wTφ( x ))
N

•  E: Cross-entropy error function E(w) = −∑ {t n =1


n
ln yn + (1 − tn )ln(1 − yn ) }

3.  Multiclass Classification exp(w kφ (x ))


y k (x,w) =
•  f: Softmax outputs ∑ exp(w jφ (x ))
j
•  E: Cross-entropy error function N K
E( w) = -∑ ∑ tkn lnyk( x n ,w) 5
n=1 k=1
Machine Learning Srihari

1. Probabilistic View: Regression

•  Output is a single target variable t that can take any real value
•  Assuming t is Gaussian distributed with an x-dependent mean
p(t | x, w) = N(t | y(x, w), β −1 )
•  Likelihood function
N
p(t | x,w, β ) = ∏ N(t n | y(x n ,w), β −1 )
n=1

•  Taking negative logarithm, we get the negative log-likelihood


N 2
β N N
€ ∑
2 n=1
{ y(x n ,w) − t n } − ln β + ln(2π )
2 2
6
•  which can be used to learn parameters w and β


Machine Learning
Regression Error Function Srihari

•  Likelihood Function could be used to learn parameters w and β


•  Usually done in a Bayesian treatment
•  In neural network literature minimizing error is used
•  They are equivalent here.
2
Sum of squared errors is
1 N
E(w) = {
∑ y(xn ,w) − tn
2 n =1
}
•  Its smallest value occurs when ∇E(w)=0
•  Since E(w) is non-convex:
•  Solution wML found using iterative optimization wt+1=wt+∧wt
•  Gradient descent (discussed later in this lecture))
•  Another solution is back-propagation
•  Since regression output is same as activation yk=ak, so
M
∂E
= yk − t k ak = ∑ w (2)
ki x i + w k 0 where k = 1,..,K
(2)
∂ak i=1

•  Having found wML the value of βML can also be found using
2
ln β / 2π 1 N 7
= ∑ { y(x n ,w ML ) − t n }
β N n=1
Machine Learning Srihari

2. Binary Classification
•  Single target variable t where t=1 denotes C1 and t =0
denotes C2
•  Consider network with single output whose activation
function is logistic sigmoid
1
y = σ (a) =
1+ exp(−a)
•  so that 0 < y(x,w) < 1
•  Interpret y(x,w) as conditional probability p(C1|x)

•  Conditional distribution of targets given inputs

p(t | x,w) = y(x,w)t {1 − y(x,w)}1−t


8
Machine Learning Srihari

Binary Classification Error Function


•  Error function is negative log-likelihood which in this
case is a Cross-Entropy error function
N

{ }
E(w) = − ∑ tn ln yn + (1 − tn )ln(1 − yn )
n =1

•  where yn denotes y(xn ,w)


•  Using cross-entropy error function instead of sum of
squares leads to faster training and improved
generalization

9
Machine Learning Srihari

2. K Separate Binary Classifications

•  Network has K outputs each with a logistic sigmoid


activation function
•  Associated with each output is a binary class label tk
K
p(t | x,w) = ∏ yk (x,w) [1 − yk (x,w)]
tk 1−tk

k =1

•  Taking negative logarithm of likelihood function


N K
E(w) = -∑ ∑ {tnk lnynk +(1 -tnk )ln(1 -ynk )}
n=1 k=1

where ynk denotes yk (x n ,w)


10
Machine Learning Srihari

3. Multiclass Classification
•  Each input assigned to one of K classes
•  Binary target variables have 1-of-K coding scheme
t k ∈ {0,1}
•  Network outputs are interpreted as yk (x,w) = p(tk = 1 | x)

•  Leads to following error function


N K
= -∑ ∑ tkn lnyk( x n ,w)
E( w)€
n=1 k=1

•  Output unit activation function is given by softmax


exp(ak( x,w)) exp(w kφ (x ))
yk( x,w) = y k (x,w) =
∑ exp(a j( x,w)) ∑ exp(w jφ (x ))
j
j 11
Machine Learning Srihari

Parameter Optimization
•  Task: Find weight vector w which minimizes the
chosen function E(w)
•  Geometrical picture of error function
•  Error function has a highly nonlinear
•  dependence

12
Machine Learning Srihari

Parameter Optimization: Geometrical View


E(w): surface sitting over weight
space
•  wA:a local minimum
•  wB global minimum
•  Need to find minimum
•  At point wC local gradientgradient
•  is given by vector∇E(w)
•  points in direction of greatest
rate of increase of E(w)
• Negative gradient points to rate
of greatest decrease

13
Machine Learning Srihari

Finding w where E(w) is smallest


Small step from w to w+δw leads to
•  change in error function
δE ≈ δw T ∇E (w)

•  Minimum of E(w) will occur when

€ ∇E(w) = 0
•  Points at which gradient vanishes are
stationary points: minima, maxima, saddle

Complex surface
No hope of finding analytical solution to
equation
∇E(w) = 0 14
Machine Learning Srihari

Iterative Numerical Procedure for Minima


•  Since there is no analytical solution
choose initial w(0) and update it using
w (τ +1) = w (τ ) + Δw (τ )
where τ is the iteration step
•  Different algorithms involve different choices for
weight vector update Δw (τ )

(τ )
•  Weight vector update Δw is usually
based on gradient
€ ∇E(w) evaluated at
weight vector w (τ +1)
•  To understand€importance of gradient information
consider Taylor’s
€ series expansion of error
15
function
Leads to local quadratic approximation

Machine Learning Srihari

Discussion Overview

Preliminary concepts for Backpropagation


1 Local quadratic approximation
•  Provides insight into optimization problem
•  O(W3) where W is dimensionality of w
•  Based on Taylor’s series: f(x) is approximated

2 Use of gradient information


•  Leads to significant improvements in speed of
locating minima of error function
•  Backpropagation is O(W2)
3 Gradient descent optimization 16

•  Simplest approach of using gradient information


Machine Learning Srihari

Definitions of Gradient and Hessian


•  First derivative of a scalar function E(w) with respect to a
vector w=[w1,w2]T is a vector called the Gradient of E(w)
⎡ ∂E ⎤
⎢ ⎥
∂w1 ⎥
E(w) = ⎢⎢
d If there are M elements in the vector
∇E(w) =
dw ∂E ⎥ then Gradient is a M x 1 vector
⎢ ⎥
⎢⎣ ∂w2 ⎥

•  Second derivative of E(w) is a matrix called the Hessian
⎡ 2 ⎤
⎢ ∂E ∂ 2E ⎥
⎢ 2 ⎥
d2 ⎢ ∂w1 ∂w1∂w2 ⎥ Hessian is a matrix with
H = ∇∇E(w) = E(w) = ⎢ ⎥ M2 elements
dw 2
⎢ ∂ 2E ∂ 2E ⎥
⎢ ⎥
⎢ ∂w ∂w ∂w2 ⎥⎥
2
⎣⎢ 2 1

Machine Learning Srihari

1. Local Quadratic Optimization


•  Taylor’s Series Expansion of E(w) around some point w ˆ
in weight space (with cubic and higher terms omitted)
1
E(w) ≅ E(wˆ ) + (w − wˆ )T b + (w − wˆ )T H(w − wˆ )
2 ˆ €
where b is the gradient of E evaluated at w b ≡ ∇E |w= wˆ
•  b is a vector of W elements
∂E
H is the Hessian matrix H = ∇∇E with elements (H) ij =
€ € ∂w i∂w j
•  H is a W x W matrix €
ˆ
w= w

•  Consider local quadratic approximation around w*, a



minimum of the error function €
1
E(w) ≅ E(w*) + (w − w*)T H (w − w*)
2
•  where H is evaluated at w* and the linear term vanishes
•  Let us interpret this geometrically
•  Consider eigen value equation for the Hessian matrix Hu i = λi u i
18
•  where the eigen vectors ui are orthonormal u u T
i i = δ ij

•  Expand (w-w*) as a linear combination the eigenvectors w − w* = ∑ α i u i


i
Machine Learning Srihari

Error Function Approximation by a Quadratic

Approximated by quadratic
1
•  Error Functions E(w) ≅ E(w*) + (w − w*)T H (w − w*)
M 2
•  Linear Regression y(x,w) = w T φ (x) = ∑ w jφ j (x) Where H = ∇∇E is a W x W matrix
j =1
1 N
2 Whose contours of constant error are
E(w) = ∑{ y(x n ,w) − t n } ellipses
2 n=1 €
with axes aligned with eigen vectors ui
1
•  Binary Classification y(x,w) = σ (w T φ (x)) = of H whose
1 + exp(−w T φ (x))
N
lengths are inversely proportional to
€ sq roots of eigenvalues
E(w) = −∑{t n ln y n + (1− t n )ln(1− y n )}
n=1
exp(w kφ (x))
•  Multiclass Classification y (x,w) = k
∑ exp(w jφ (x))
j
N K

€ E(w) = −∑ ∑ t kn ln y k (x n ,w)
n=1 k=1


19
Machine Learning Srihari

Neighborhood of a minimum w*

•  w-w* is a coordinate transformation


•  Origin is translated to w*
•  Axes rotated to align with eigenvectors of
Hessian
•  Error function can be written as
1
E(w) = E(w*) + ∑ λiα i2
2 i
•  Matrix H = ∇∇E is positive definite iff
vT Hv > 0 for all v

• € Since eigenvectors form a complete set v = ∑ c ui


i i

•  Then an arbitrary vector v can be written as v T


Hv = ∑ ci2 λi
i

•  The stationary point w* will be a minimum if the


Hessian matrix is positive definite (or all its 20

eigenvalues are positive)


Machine Learning Srihari

Condition for a point w* to be a minimum

•  For a one-dimensional weight space, a stationary


point w* will be minimum if
∂2E
>0
∂w 2 w*

•  Corresponding result in D dimensions is that the


Hessian matrix evaluated at w* is positive definite
€ T
•  A matrix H is positive definite iff v Hv > 0 for all v

21
Machine Learning Srihari

Complexity of Quadratic Approximation


1
E(w) ≅ E(wˆ ) + (w − wˆ )T b + (w − wˆ )T H(w − wˆ )
2
where b is the gradient of E evaluated at w ˆ
•  b is a vector of W elements b ≡ ∇E |w= wˆ
€ H is the Hessian matrix with elements
•  H is a W x W matrix H = ∇∇E €

•  Error surface is specified by b and H



•  They contain total of W(W+3)/2 independent elements
•  W is total number of adaptive parameters in network
•  Minimum depends on O(W2) parameters
•  Need to perform O(W2) function evaluations, each requiring O(W) steps.
•  Computational effort needed is O(W3)
•  W in a 10 x 10 x 10 network needs 100+100=200 weights which means 8
million steps
Machine Learning Srihari

2. Use of Gradient Information


•  Gradient of error function can be evaluated efficiently
using back-propagation
•  Use of gradient information can lead to significant
improvements in speed with which minimum of error function
can be located
•  In quadratic approximation to error function
•  Computational effort needed is O(W3)
•  By using gradient information minimum can be found in
O(W2) steps
•  4,000 steps for 10 x 10 x 10 network with 200 weights

23
Machine Learning Srihari

3. Gradient Descent Optimization

•  Simplest approach to using gradient information


•  Take a small step in the direction of the negative
gradient

w (τ +1) = w (τ ) − η∇E(w (τ ) )

•  η is the learning rate


€•  There are batch and on-line versions

24
Machine Learning Srihari
Summary
•  Neural network parameters have many parameters
•  can be determined analogous to linear regression
parameters
•  Probabilistic formulation leads to appropriate error
functions for linear regression, binary and multi-class
classification

•  Parameter optimization can be viewed as minimizing


error function in weight space
•  At the minimum Hessian is positive definite

•  Local quadratic optimization needs O(W3) steps


•  Using gradient information more efficient O(W2)
algorithm can be designed 25

Вам также может понравиться