Вы находитесь на странице: 1из 31

Part II 1

CSE 5526: Introduction to Neural Networks


Linear Regression
Part II 2
Problem statement
Part II 3
Linear regression with one variable
Given a set of N pairs of data <x
i
, d
i
>, approximate d by a
linear function of x (regressor)
i.e.


or



where the activation function (x) =x is a linear function, and it
corresponds to a linear neuron. y is the output of the neuron, and



is called the regression (expectational) error

b wx d +
i i
i i i i i
b wx
b wx y d


+ + =
+ + = + = ) (
i i i
y d =
Part II 4
Linear regression (cont.)
The problem of regression with one variable is how to
choose w and b to minimize the regression error

The least squares method aims to minimize the square error
E:






= =
= =
N
i
i i
N
i
i
y d E
1
2
1
2
) (
2
1
2
1

Part II 5
Linear regression (cont.)
To minimize the two-variable square function, set




0
0
w
E
b
E
Part II 6
Linear regression (cont.)

= =


=

i
i i
i
i i
b wx d
b
b wx d
b
E
0 ) (
) (
2
1
2

= =


=

i
i i i
i
i i
x b wx d
w
b wx d
w
E
0 ) (
) (
2
1
2
Part II 7
Linear regression (cont.)
Hence










where an overbar (i.e. ) indicates the mean



=
i
i
i
i i
i
i
i
i
i
i
x x N
d x x d x
b
] ) ( [
2
2


=
i
i
i
i i
x x
d d x x
w
2
) (
) )( (
x
Derive yourself!
Part II 8
Linear regression (cont.)
This method gives an optimal solution, but it can be time-
and memory-consuming as a batch solution





Part II 9
Finding optimal parameters via search
Without loss of generality, set b =0




E(w) is called a cost function




=
=
N
i
i i
wx d w E
1
2
) (
2
1
) (
Part II 10
Cost function
w

E(w)

w*

E
min
Question: how can we update w to minimize E?
Part II 11
Gradient and directional derivatives
Without loss of generality, consider a two-variable function
f(x, y). The gradient of f(x, y) at a given point (x
0
, y
0
)
T
is








where u
x
and u
y
are unit vectors in the x and y directions, and
and




y y x x
y y
x x
T
y x f y x f
y
y x f
x
y x f
u u
f
) , ( ) , (
)
) , (
,
) , (
(
0 0 0 0
0
0
+ =

=
=
=
x f f
x
=
y f f
y
=
Part II 12
Gradient and directional derivatives (cont.)
At any given direction, u =au
x
+bu
y
, with , the
directional derivative at (x
0
, y
0
)
T
along the unit vector u is







Which direction has the greatest slope? The gradient because of the
dot product!












u f
u
) , (
) , ( ) , (
)] , ( ) , ( [ )] , ( ) , ( [
lim
) , ( ) , (
lim ) , (
0 0
0 0 0 0
0 0 0 0 0 0 0 0
0
0 0 0 0
0
0 0
y x
y x bf y x af
h
y x f hb y x f hb y x f hb y ha x f
h
y x f hb y ha x f
y x f D
T
y x
h
h
x
=
+ =
+ + + + +
=
+ +
=

1
2 2
= + b a
Part II 13
Gradient and directional derivatives (cont.)
Example: see blackboard












Part II 14
Gradient and directional derivatives (cont.)
To find the gradient at a particular point (x
0
, y
0
)
T
, first find
the level curve or contour of f(x, y) at that point, C(x
0
, y
0
). A
tangent vector u to C satisfies




because f(x, y) is constant on a level curve. Hence the
gradient vector is perpendicular to the tangent vector












0 ) , (
0 0
= = u f
u
y x D
T
Part II 15
An illustration of level curves
Part II 16
Gradient and directional derivatives (cont.)
The gradient of a cost function is a vector with the
dimension of w that points to the direction of maximum E
increase and with a magnitude equal to the slope of the
tangent of the cost function along that direction
Can the slope be negative?












Part II 17
Gradient illustration
w

E(w)

w*

E
min

w
0
w

w
w w E w w E
w E
w

+
=


2
) ( ) (
lim
) (
0 0
0
0
Gradient
Part II 18
Gradient descent
Minimize the cost function via gradient (steepest) descent
a case of hill-climbing



n: iterration number
: learning rate

See previous figure











) ( ) ( ) 1 ( n E n w n w = +
Part II 19
Gradient descent (cont.)
For the mean-square-error cost function:



















2
2 2
)] ( ) ( ) ( [
2
1
)] ( ) ( [
2
1
) (
2
1
) (
n x n w n d
n y n d n e n E
=
= =
linear neurons
) ( ) (
) (
) (
2
1
) (
) (
2
n x n e
n w
n e
n w
E
n E
=

=
Part II 20
Gradient descent (cont.)
Hence





This is the least-mean-square (LMS) algorithm, or the Widrow-Hoff
rule



















) ( )] ( ) ( [ ) (
) ( ) ( ) ( ) 1 (
n x n y n d n w
n x n e n w n w
+ =
+ = +

Part II 21
Multi-variable case
The analysis for the one-variable case extends to the multi-
variable case








where w
0
=b (bias) and x
0
=1, as done for perceptron learning


















2
)] ( ) ( ) ( [
2
1
) ( n n n d n E
T
x w =
T
m
w
E
w
E
w
E
) ,..., , ( ) (
1 0

= w E
Part II 22
Multi-variable case (cont.)
The LMS algorithm























) ( )] ( ) ( [ ) (
) ( ) ( ) (
) ( ) ( ) 1 (
n n y n d n
n n e n
n n n
x w
x w
E w w
+ =
+ =
= +

Part II 23
LMS algorithm
Remarks
The LMS rule is exactly the same in math form as the perceptron
learning rule
Perceptron learning is for McCulloch-Pitts neurons, which are
nonlinear, whereas LMS learning is for linear neurons. In other
words, perceptron learning is for classification and LMS is for
function approximation
LMS should be less sensitive to noise in the input data than
perceptrons. On the other hand, LMS learning converges slowly
Newtons method changes weights in the direction of the minimum
E(w) and leads to fast convergence. But it is not an online version
and computationally extensive






















Part II 24
Stability of adaptation
When is too small,
learning converges slowly
When is too large, learning
doesnt converge
Part II 25
Learning rate annealing
Basic idea: start with a large rate but gradually decrease it
Stochastic approximation




c is a positive parameter























n
c
n = ) (
Part II 26
Learning rate annealing (cont.)
Search-then-converge



0
and are positive parameters

When n is small compared to , learning rate is approximately
constant
When n is large compared to , learning rule schedule roughly
follows stochastic approximation























) ( 1
) (
0

n
n
+
=
Part II 27
Rate annealing illustration
Part II 28
Nonlinear neurons
To extend the LMS algorithm to nonlinear neurons, consider
differentiable activation function at iteration n






























2
2
)] ) ( ( ) ( [
2
1
)] ( ) ( [
2
1
) (

=
=
j
j j
n x w n d
n y n d n E

Part II 29
Nonlinear neurons (cont.)
By chain rule of differentiation





























) ( )) ( ( ) (
) ( )) ( ( )] ( ) ( [
n x n v n e
n x n v n y n d
w
v
v
y
y
E
w
E
j
j
j j

=

=

Part II 30
Nonlinear neurons (cont.)
The gradient descent gives




The above is called the delta () rule
If we choose a logistic sigmoid for



then




























) ( ) ( ) (
) ( )) ( ( ) ( ) ( ) 1 (
n x n n w
n x n v n e n w n w
j j
j j j


+ =

+ = +
) exp( 1
1
) (
av
v
+
=
)] ( 1 )[ ( ) ( v v a v =

(see textbook)
Part II 31
Role of activation function
v


v


The role of : weight update is most sensitive when v is near zero

Вам также может понравиться