Вы находитесь на странице: 1из 5

World appl. programming, Vol(5), No (4), April, 2015. pp.

68-72

TI Journals

ISSN:

World Applied Programming

2222-2510

www.tijournals.com

Copyright 2015. All rights reserved for TI Journals.

Application of Newton-Raphson Algorithm in Common Random


Numbers for Finding the Optimal Solution in Simultaneous Perturbation
Stochastic Approximation
Behrouz Fathi-Vajargah*
Faculty of Mathematical Sciences, Department of Statistics, University of guilan, Rasht, Iran

Sina Darjazi
Faculty of Mathematical Sciences, Department of Statistics, University of guilan, Rasht, Iran
*Corresponding author: Behrouz.fathi@gmail.com

Keywords

Abstract

Common random numbers


Gradient-free algorithm
Invers transform method
Newton-Raphson method
Simultaneous perturbation
Stochastic approximation

The use of common random numbers (CRN) is known as a good way to reduce the variability of the gradient
estimates of gradient-free algorithms such as simultaneous perturbation stochastic approximation (SPSA). It
has been proven, that the common random numbers is optimal choise when we use the inverse transform
method to generate random variables. In this paper, we show that Newton-Raphson method to generate
random variables in CRN also provides an appropriate solution for SPSA.

1.

Introduction

Finite difference stochastic approximation (FDSA) and simultaneous perturbation stochastic approximation (SPSA) are the gradient-free
efficient ways to minimize the loss function ( ), in the presence of random noise. SPSA requires to 2 p times less the calculation
of loss function than FDSA method. Thus SPSA to FDSA is superior in view of speed and memory. SPSA is applicable in a variety of fields
such as passive filter planning, wireless sensor networks, and Neural Networks [4, 5, 6]. An important component of SPSA is gradient estimate
which it has differential form and using the CRN, variance of this difference can be reduced. CRN method, independent of optimization in the
context of simulation is expressed to reduce the variability of the difference of two random vectors [2, 3, 8, 9]. The main idea comes from
minimize var(X Y ) , that X , Y are random variables [9]. Given distribution of X , Y the minimum value occurs when cov( X , Y ) is
maximized, namely X , Y have to have same behavior. CRN method by taking the same path in generation of X , Y achieves this goal. Then CRN
is proposed as a simulation-based optimization method for SPSA. Reducing the variance of the gradient estimate, leads to reducing variability in
solutions of SPSA algorithm and finally it will lead to faster convergence. Theoretically, CRN is optimum choise when inverse transform
method is used to generate random variables[8, 9].This requirement, seems a bit restrictive, because there are so many variables that cannot be
found simple form for their distribution functions and we need to other ways to generate these variables. In this paper we use Newton-Raphson
method to solve the equation F ( X ) U instead of inverse transform, in theory of CRN.

2.

Formulation of the problem

Let ( ) be a loss function in the presence of random noise observed as y ( ) L ( ) , where

is a vector of parameters of

interest, represent the noise term and is the domain of allowable values for . Our problem is as follows:
(1)

min L ( ) .

The stochastic optimization algorithm for solving (1) is given by the following iterative form:

k 1 k ak g k (k ), k 0,1,2,...
where k is the estimate of at kth iteration and

(2)
(. )

represents an estimate of the gradient of L at kth iteration. The step-size sequence

{ak } is nonnegative, decreasing, converging to zero and

k 0

In finite difference (FD) method, lth element of the gradient estimate is calculated as follows:
g kl (k )

ykl( ) ykl( )
2ck

l 1,.., p

(3)

ykl( ) y (k ck el ) ,

where {ck } is a sequence of positive numbers converging to zero with the condition

ak2

c
k 0

2
k

and el is a unit vector with a 1 in the lth place

and a 0 elsewhere. Therefore, we need to calculate 2 p loss function for a gradient estimate in FD method.

69

Application of Newton-Raphson Algorithm in Common Random Numbers for Finding the Optimal Solution in Simultaneous Perturbation Stochastic Approximatio...
World Applied Programming Vol(5), No (4), April, 2015.

In simultaneous perturbation ( SP) method, lth element of the gradient estimate is calculated as follows:
y ( ) yk( )
g kl (k ) k
2ck

(4)

l 1,.., p

yk( ) y (k ck k ),
Where is the vector of random variables that are mutually independent and in conditions [1] holds true. Sadegh and Spall have proven
Symmetric Bernoulli distribution for elements of k is asymptotically optimal [12]. Xumeng for small samples, has introduced an effective
distribution [14].

3.

Background of common random numbers

Proposition1: Consider X ( X 1 ,..., X n ) , Y (Y 1 ,...,Y n ) are random vectors with independent components of a given distributions. Our
problem is to find the 2n -dimensional distribution function FXY such that var( g ( X ) h(Y )) be minimal, where the real functions g and h are
(1)
( n)
(1)
( n)
monotonic in the same direction with respect to the ith component, i 1,..., n . Suppose U1 (u1 ,..., u1 ) and U 2 (u2 ,..., u2 ) are vectors

with

independent

components and

uniformly distributed

on [0,1].

Then by rewriting

g (U1 ) g ( FX11 (u1(1) ),..., FXn1 (u1( n ) ))

and

h(U 2 ) g ( FY1 1 (u2(1) ),..., FYn 1 (u2( n ) )) , problem is equivalent to finding the minimum of var( g (U 1 ) h (U 2 )) . Then choice of VCRN(vector of CRN)
(U 1 U 2 U ) would be an optimal choice for our problem.

Proof: Clearly, the problem of finding the minimum variance is equivalent to the problem of finding the maximum of E ( g (U1 )h (U 2 )) . We want
to prove:

max FU1U2 E ( g (U1 ) h (U 2 )) E ( g (U )h (U )).

(5)

The proof is by induction on n.


For n 1 proposition is correct[13]. Assuming that for some m 1 ,VCRN is an optimal choice, then for m 1 we have:

E ( g (U1 )h (U 2 )) ... ( g (U1 )h (U 2 ) du1(1) ...du1( m 1) du2(2) ...du 2( m 1)


Eu ( m 1) ,u ( m1) ( Eu (1) ,..., u ( m ) , u (1) ,..., u ( m ) [ g (U1 ) h (U 2 )])
1

Eu ( m1) ,u ( m1) ( EU [ g (U , u1( m 1) )h (U , u 2( m 1) )]),


1

where .
Thus, we conclude, for the given index m +1, other elements must be common. The last statement is true for any other index, which means that

u1(m 1) u2(m 1) u (m 1) . Q.E.D [9].

4.

Obtain the numerical solution for the inverse F ( X ) U

The inverse transform method is accurate when explicit form of F 1 is known. But sometimes we have to solve numerically F ( X ) U . It
requires more computation time.
There are many factors to select the appropriate
algorithm, such as:
1) speed of convergence,
2) ensure the convergence,
3) knowledge of the density function,
4) prior knowledge.
4.1. Newton-Raphson method
This method will converge when F is convex or concave. If f has no explicit form, then this method should not be used. The approximation of
1
( F ( x ) F ( x)) , due to the elimination of error terms is relatively inaccurate [10]. The algorithm is as follows:

1) Choose an initial point X.


F(X ) U
2) X X
.
f ( x)

f ( x) by

3) If the stop criterion is satisfied, choose X as an answer, otherwise you go to 2.


4.2. Stopping criterion
Any stopping criterion used in the algorithm will lead to inaccurate answer. However, the accuracy of the algorithm in terms of the need can be
controlled. An example of stopping criteria of the algorithm is provided below. Accuracy can be increased with a smaller .
*
1) X X ,

Behrouz Fathi-Vajargah *, Sina Darjazi

70

World Applied Programming Vol(5), No (4), April, 2015.

2) F ( X ) U ,
*
3) X X X ,

where X * is the exact solution of F ( X ) U . Since X * is not known in practice, the second criterion would be appropriate [10].

5.

Simulation-based optimization: Common random numbers

5.1. Optimization with SP gradient estimators


Let L ( ) E (Q ( ,V )) , where Q is output of a simulation as a function of the chosen and random effects V. The expectation is with respect
to all randomness embodied in V. lth element of the SP gradient estimate is calculated as follows:
y ( ) yk( )
g kl (k ) k
2ck

l 1,.., p

(6)

yk( ) Q(k ck k ,Vk( ) ),


where Vk( ) are sequences of random variables generated by using the Monte Carlo method, and in general, can be directly or indirectly related
to (in the case of indirect, distribution function of V depends on ).
()

Proposition 2: Suppose Vk

are vectors of random effects with independent elements and Vkl( ) ,Vkl( ) for all l may be dependent but

Vkl( ) ,Vkm( ) , l m must be independent. Suppose Vk( ) and Vk( ) are generated by using the inverse transform method from vectors U k( ) and U k(-)

respectively (where, U k( ) ,U k(-) are vectors with independent components and uniformly distributed on [0,1]). Suppose yk( ) and yk( ) are
monotonic in the same direction with respect to the lth element of V k( ) and V k( ) , for almost all values of k .Then var[g kl (k ) | k ] is
minimized at each l when U k( ) U k(-) U k .
Proof: According to the proposition 1, var(( yk( ) yk( ) ) | k , k ) and var(

yk( ) yk( )
| k , k ) is minimized when U k( ) U k( ) U k .
2ck kl

On the other hand:

E ((

yk( ) yk( ) 2
y ( ) yk( ) 2
) | k ) E[ E (( k
) | k , k ) | k ],
2ck kl
2ck kl

which implies var( g kl (k ) | k ) is minimized when U k( ) U k( ) U k . Q.E.D [11].


5.2. Rate of convergence in SPSA by using CRN
1

Under proposition 2 of [1] for large k, k * proportional with O (k 3 ) moves towards zero. Given the constraints of 2 , 3 0
2

and the forms ak

a
c
, ck
( a, c are positive constants), fastest possible stochastic rate is proportional with k 3 for large k .

(k 1)
(k 1)

Thus maximum rate of stochastic convergence of k to * in SPSA algorithm without using CRN is O(k 3 ) . Under Theorem 2.1 of [7] and the
similar proposition 2 of [1] for large

0 1 ,

k,

k *

proportional with

moves toward zero. Given the constraints of

1
, , 4 0 and using CRN in the gradient estimate of the loss function L( ) , maximum rate of stochastic
2

1
convergence of k to * , will be proportional to
. Note that, the progress made in using the CRN comes from the elimination of the O(1)
k

terms in the Taylor expansion of Q(k ck k ,Vk ) Q (k ck k ,Vk ) in the proof of Theorem 2.1of [7], if the CRN not be used, with replacing
Vk and Vk instead of Vk , convergence rate will not be increased.

6.

Algorithm

Suppose N is allowed or required number of iterations:


i.
Set k 0.
ii.

Guess the 0 .

iii.

Choose a, c, , in sequences ak

iv.

Compute ak

a
c
,c k
.
( k 1)
( k 1)

a
c
,c k
.
( k 1)
(k 1)

71

Application of Newton-Raphson Algorithm in Common Random Numbers for Finding the Optimal Solution in Simultaneous Perturbation Stochastic Approximatio...
World Applied Programming Vol(5), No (4), April, 2015.

v.

Simulate p -dimensional random perturbation vector k .

vi.
vii.

Generate random vector V with independent elements of a given distribution F.


Compute Q ( c ,V ), Q ( c ,V ) .

viii.

Compute g k (k ) according to the g kl (k )

7.

y k( ) y k( )
, l 1,.., p and yk( ) Q(k ck k ,Vk( ) ) .
2c k

ix.

Update k to a new value k 1 with using k 1 k ak g k (k ) .

x.

Stop, if k N , or if there is little change in the last few iterations, Otherwise put k k 1 and go to step iv.

Numerical example

Consider the following loss function:


10

10

L ( ) ti2 E (
i 1

i 1

1
vi ti

),

where (t1,..., t10 )T , t i 0 ( i 1,...,10) , and the vi are independent random variable from Rayleigh distribution with parameters i . We
generate i according to the uniform distribution on (0.2,2). Q ( ,V ) is monotonically non-increasing for an increasing vi for any value of
ti 0 . The parameters i and the corresponding elements within * are given in table 1. According to proposition 2, CRN provides the

minimum variance for the elements g k (k ) , n is the total number of iterations and sequences ak , ck is defined as ak 0.7 , ck
k 1

0.5
. In all
( k 1) 0.49

runs, we used the initial guess 0 (1.2,1.2,...,1.2)T . Difference CRN method with non-CRN method in the sense of convergence is shown in
Figure 1. We define the following two states:
I : vi generated by the inverse transform method. II : vi generated by the Newton-Raphson method.
The results of 50 independent implementation are given in table 2 and 3.
We've tested the following statistical hypothesis:
There is no significant difference between I and II in the sense of their estimated losses and rate of convergence

Figure 1. Convergence of CRN method and non-CRN method


Table 1
i
1
2
3
4
5
6
7
8
9
10

~ (0.2,2)
1.8509
0.7145
1.563
1.5567
0.8848
1.2221
0.3365
0.2971
1.1554
1.6025

ith element of
0.8167
0.6752
0.7896
0.7889
0.7046
0.7517
0.5808
0.5665
0.7433
0.7935

Behrouz Fathi-Vajargah *, Sina Darjazi

72

World Applied Programming Vol(5), No (4), April, 2015.

Table 2. Comparison of I , II in terms of terminal loss values


Total iterations n

Mean L(n ) in state I

Mean L(n ) in state II

Sig(level of significance)

100
1000
10000
100000

28.60587177
26.572366593
26.380245345
26.355868184

28.22759923
26.48061284
26.37294471
26.354989426

0.7918535057
0.7001811655
0.53092978
0.21001431

Table3. Comparison of I , II in terms of terminal n 2 (n * ) values


1

Total
iterations n

Mean n 2 (n * ) in state I

Mean n 2 (n * ) in state II

Sig(level of significance)

100
1000
10000
100000

10.500910173437314
7.164886153732708
6.858212185078264
7.848472662713310

8.802750217276586
6.785387557160711
7.676914389347618
7.704537753341972

0.569360983918349
0.630234680928987
0.535493477324933
0.769379323778296

It is clear, that in the all cases of the table 2,3, Since Sig is high, there is no significant difference, therefore we accept the mentioned statistical
hypothesis test.

8.

Conclusion

Results of Table 2 and 3 indicate, that there is no significant difference between two methods in the sense of their estimated losses and rate of
convergence. So, in CRN method for SPSA algorithm, significant difference between Newton-Raphson method and inverse transform method
does not exist. Of course, the run time of Newton-Raphson is long. Some algorithms such as Bisection and secant method have higher execution
speed, because they seek only in a selected interval [a, b ] , but it is possible that be converge to a false value. These algorithms also can be used
when explicit form of f does not exist.

References
[1]
[2]
[3]
[4]
[5]

[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]

J. C. Spall, multivariate stochastic Approximation using a simultaneous Perturbation Gradient Approximation, IEEE Trans. Automatic Control, vol. 37, pp.
332-341, (1992).
S. Ehrlichman; S. G. Henderson, Comparing two systems: Beyond common random numbers, Simulation Conference, 2008. WSC2008. Winter, pp. 245251, 7-10 (Dec. 2008).
Xi Chen; B. Ankenman; B. L. Nelson, Common random numbers and stochastic kriging, Simulation Conference (WSC), Proceedings of the 2010 Winter,
pp. 947-956, 5-8 (Dec. 2010).
Ying-Yi Hong; Ching-Sheng Chiu, Passive Filter Planning Using Simultaneous Perturbation Stochastic Approximation, Power Delivery, IEEE
Transactions on, vol. 25, no. 2, pp. 939-946, (April 2010).
M. A. Azim; Z. Aung; Weidong Xiao; V. Khadkikar; A. Jamalipour, Localization in wireless sensor networks by constrained simultaneous perturbation
stochastic approximation technique, Signal Processing and Communication Systems (ICSPCS), 2012 6th International Conference on , pp. 1-9, 12-14 (Dec.
2012).
J. C. Spall; Qing Song; Yeng Chai Soh; Jie Ni, Robust Neural Network Tracking Controller Using Simultaneous Perturbation Stochastic Approximation,
Neural Networks, IEEE Transactions on , vol. 19, no. 5, pp. 817-835, (May2008).
N. L. Kleinman; J. C. Spall and D. Q. Naiman, Simulation-Based Optimization with Stochastic Approximation Using Common Random Numbers,
Management Science, vol. 45, pp. 1570-1578, (1999).
R. Y. Rubinstein; G. Samorodnitsky and M. Shaked, Antithetic Variates, Multivariate Dependence, and Simulation of stochastic Systems, Management
Science, vol. 31, pp. 66-77, (1985).
R. Y. Rubinstein and G. Samorodnitsky, Variance reduction by the use of common and antithetic random variables, Journal of Statistical Computation and
Simulation, Vol. 22, pp. 161-180, (1985).
L. Devroye, Non-Uniform Random Variate Generation, Springer-Verlag, New York, (1986).
J. C. Spall, Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control, Wiley, Hoboken, NJ (2003).
P. Sadegh, J. C. Spall, Optimal Random Perturbations for Stochastic Approximation with a Simultaneous Perturbation Gradient Approximation, IEEE
Transactions on Automatic Control, vol. 43, pp. 14801484, (1998).
P. Bratley, B. L. Fox, and L. E. Schrage, A Guide to Simulation, Springer-Verlag, New York, N. Y, (1983).
Xumeng Cao, Effective perturbation distributions for small samples in simultaneous perturbation stochastic approximation, Information Sciences and
Systems (CISS), 2011 45th Annual Conference on, pp. 1-5, 23-25, (March 2011).

Вам также может понравиться