Desymm

CASE STUDY: STOCHASTIC SIMULATION VIA RADEMACHER
BOOTSTRAP
MAXIM RAGINSKY
In this lecture, we will look at an application of statistical learning theory to the problem of
efficient stochastic simulation, which arises frequently in engineering design. The basic question is
as follows. Suppose we have a system with input space Z. The system has a tunable parameter
that lies in some set . We have a performance index ` : Z [0, 1], where we assume
that the lower the value of `, the better the performance. Thus, if we use the parameter setting
and apply input z Z, the performance of the corresponding system is given by the scalar
`(z, ) [0, 1]. Now lets suppose that the input to the system is actually a random variable Z Z
with some distribution P P(Z). Then we can define the operating characteristic
Z
(1) L() , EP [`(Z, )] `(z, )PZ (dz), .
Z
The goal is to find an optimal operating point that achieves (or comes arbitrarily close to)
inf L().
In practice, the problem of minimizing L() is quite difficult for large-scale systems. First of all,
computing the integral in (1) may be a challenge. Secondly, we may not even know the distribution
PZ . Thirdly, there may be more than one distribution of the input, each corresponding to different
operating regimes and/or environments. For this reason, engineers often resort to Monte Carlo
simulation techniques: Assuming we can efficiently sample from PZ , we draw a large number of
independent samples Z1 , Z2 , . . . , Zn and compute
n
1X
bn = arg min Ln () arg min `(Zi , ),
n
i=1
where Ln () denotes the empirical version of the operating characteristic (1). Given an accuracy
parameter > 0 and a confidence parameter (0, 1), we simply need to draw enough samples,
so that
L(bn ) inf L() +

with probability at least 1 , regardless of what the true distribution PZ happens to be.
This is, of course, just another instance of the ERM algorithm we have been studying extensively.
However, there are two issues. One is how many samples we need to guarantee that the empirically
optimal operating point will be good. The other is the complexity of actually computing an
empirical minimizer.
The first issue has already come up in the course under the name of sample complexity of
learning. The second issue is often handled by relaxing the problem a bit: We choose a probability
distribution Q over (assuming it can be equipped with an appropriate -algebra) and, instead
of minimizing L() over , set some level parameter (0, 1), and seek any b , for which
there exists some exceptional set with Q() , such that
(2) inf L() L()
b inf L() +
\
Date: April 4, 2011.

1
with probability at least 1 . Unless the actual optimal operating point happens to lie in the
exceptional set , we will come to within of the optimum with confidence at least 1 . Then
we just need to draw a large enough number n of samples Z1 , . . . , Zn from PZ and a large enough
number m of samples 1 , . . . , m from Q, and then compute
b = arg min Ln ().
{1 ,...,m }
In the next several lectures, we will see how statistical learning theory can be used to develop such
simulation procedures. Moreover, we will learn how to use Rademacher averages1 to determine how
many samples we need in the process of learning. The use of statistical learning theory for simulation
has been pioneered in the context of control by M. Vidyasagar [Vid98, Vid01]; the refinement of
his techniques using Rademacher averages is due to Koltchinskii et al. [KAA+ 00a, KAA+ 00b]. We
will essentially follow their presentation, but with slightly better constants.
We will follow the following plan. First, we will revisit the abstract ERM problem and its
sample complexity. Then we will introduce a couple of refined tools pertaining to Rademacher
averages. Next, we will look at sequential algorithms for empirical approximation, in which the
sample complexity is not set a priori, but is rather determined by a data-driven stopping rule.
And, finally, we will see how these sequential algorithms can be used to develop robust and efficient
stochastic simulation strategies.
1. Empirical Risk Minimization: a quick review

Recall the abstract Empirical Risk Minimization problem: We have a space Z, a class P of
probability distributions over Z, and a class F of measurable functions f : Z [0, 1]. Given an
i.i.d. sample Z n drawn according to some unknown P P, we compute
n
1X
fbn , arg min Pn (f ) arg min f (Zi ).
f F f F n
i=1
We would like for P (fbn ) to be close to inf f F P (f ) with high probability. To that end, we have
derived the bound
P (fbn ) inf P (f ) 2kPn P kF ,
f F
where, as before, we have defined the uniform deviation
n
1 X
kPn P kF , sup |Pn (f ) P (f )| = sup f (Zi ) EP f (Z) .

f F f F n i=1

Hence, if n is sufficiently large so that, for every P P, kPn P kF /2 with P -probability at
least 1 , then P (fbn ) will be -close to inf f F P (f ) with probability at least 1 . This motivates
the following definition:
Definition 1. Given the pair (F, P), an accuracy parameter > 0, and a confidence parameter
(0, 1), the sample complexity of empirical approximation is

(3) N (; ) , min n N : sup P {kPn P kF } .
P P
In other words, for any > 0 and any (0, 1), N (/2; ) is an upper bound on the number of
samples needed to guarantee that P (fbn ) inf f F P (f ) + with probability (confidence) at least
1 .
1More precisely, their stochastic counterpart, in which we do not take the expectation over the Rademacher
sequence, but rather use it as a resource to aid the simulation.
2
2. Empirical Rademacher averages
As before, let Z n be an i.i.d. sample of length n from some P P(Z). On multiple occasions we
have seen that the performance of the ERM algorithm is controlled by the Rademacher average
" n #
n 1 X
(4) Rn (F(Z )) , E sup i f (Zi ) ,
n

n f F
i=1
where n = (1 , . . . , n ) is an n-tuple of i.i.d. Rademacher random variables independent of Z n .

More precisely, we have stablished the fundamental symmetrization inequality
(5) EkPn P kF 2ERn (F(Z n )),
as well as the concentration bounds

2
(6) P {kPn P kF EkPn P kF + } e2n
2
(7) P {kPn P kF EkPn P kF } e2n
These results show two things:

(1) The uniform deviation kPn P kF tightly concentrates around its expected value.
(2) The expected value EkPn P kF is bounded from above by ERn (F(Z n )).
It turns out that the expected Rademacher average ERn (F(Z n )) also furnishes a lower bound on
EkPn P kF :
Lemma 1 (Desymmetrization inequality). For any class F of measurable functions f : Z [0, 1],
we have
#
n
"
1 n 1 1 X
(8) ERn (F(Z )) E sup i [f (Zi ) P (f )] EkPn P kF .

2 2 n 2n f F
i=1
Proof. We will first prove the second inequality in (8). To that end, for each 1 i n and each
f F, let us define Ui (f ) , f (Zi ) P (f ). Then EUi (f ) = 0. Let Z 1 , . . . , Z n be an independent
copy of Z1 , . . . , Zn . Then we can define U i (f ), 1 i n, similarly. Moreover, since EUi (f ) = 0,
we can write
# #
n n
" "
X X
E sup i [f (Zi ) P (f )] = E sup i Ui (f )

f F i=1 f F i=1
#
n
"
X
= E sup i Ui (f ) EU i (f )

f F i=1
" n #
X
E sup i [Ui (f ) U i (f )] .

f F
i=1

Since, for each i, Ui (f ) and U i (f ) are i.i.d., the difference Ui (f ) U i (f ) is a symmetric random
variable. Therefore,
(d)
i [Ui (f ) U i (f )] : 1 i n = Ui (f ) U i (f ) : 1 i n .
3
Using this fact and the triangle inequality, we get
# #
n n
" "
X X
E sup i [Ui (f ) U i (f )] = E sup [Ui (f ) U i (f )]

f F i=1
f F i=1

" n #
X
2E sup Ui (f )

f F i=1
" n #
X
= 2E sup |f (Zi ) P (f )

f F
i=1

= 2n EkPn P kF .
To prove the first inequality in (8), we write
#
n
"
n 1 X
ERn (F(Z )) = E sup i [f (Zi ) P (f ) + P (f )]

n f F i=1

" n # " n #
1 X 1 X
E sup i [f (Zi ) P (f )] + E sup P (f ) i

n f F i=1 n f F
i=1
" n # n
1 X 1 X
= E sup i [f (Zi ) P (f )] + E i

n f F i=1 n
i=1
#
n
"
1 X 1
E sup i [f (Zi ) P (f )] + .

n f F n
i=1
Rearranging, we get the desired inequality.
In this section, we will see that we can get a lot of mileage out of the stochastic version of the
Rademacher average. To that end, let us define
n
1 X
(9) rn (F(Z n )) , sup i f (Zi ) .

n f F
i=1
The key difference between (4) and (9) is that, in the latter, we do not take the expectation over the
Rademacher sequence n . In other words, both Rn (F(Z n )) and rn (F(Z n )) are random variables,
but the former depends only on the training data Z n , while the latter also depends on the n
Rademacher random variables 1 , . . . , n . We see immediately that Rn (F(Z n )) = E[rn (F(Z n ))|Z n ]
and ERn (F(Z n )) = Ern (F(Z n )), where the expectation on the right-hand side is over both Z n and
n . The following result will be useful:
Lemma 2 (Concentration inequalities for Rademacher averages). For any > 0,
2 /2
(10) P {rn (F(Z n )) ERn (F(Z n )) + } en
and
2 /2
(11) P {rn (F(Z n )) ERn (F(Z n )) } en .
Proof. For each 1 i n, let Ui , (Zi , i ). Then rn (F(Z n )) can be represented as a real-
valued function g(U n ). Moreover, it is easy to see that this function has bounded differences with
c1 = . . . = cn = 2/n. Hence, McDiarmids inequality tells us that for any > 0
2 /2
P {g(U n ) Eg(U n ) + } en ,
and the same holds for the probability that g(U n ) Eg(U n ) . This completes the proof.
4
3. Sequential learning algorithms
In a sequential learning algorithm, the sample complexity is a random variable. It is not known
in advance, but rather is computed from data in the process of learning. In other words, instead
of using a training sequence of fixed length, we keep drawing independent samples until we decide
that we have acquired enough of them, and then compute an empirical risk minimizer.
To formalize this idea, we need the notion of a stopping time. Let {Un } n=1 be a random process.
A random variable taking values in N is called a stopping time if and only if, for each n 1, the
occurrence of the event { = n} is determined by U n = (U1 , . . . , Un ). More precisely:
Definition 2. For each n, let n denote the -algebra generated by U n (in other words, n consists
of all events that occur by time n). Then a random variable taking values in N is a stopping time
if and only if, for each n 1, the event { = n} n .
In other words, denoting by U the entire sample path (U1 , U2 , . . .) of our random process, we
can view as a function that maps U into N. For each n, the indicator function of the event
{ = n} is a function of U :
1{ =n} 1{ (U )=n} .
Then is a stopping time if and only if, for each n and for all U , V with U n = V n we have
1{ (U )=n} = 1{ (V )=n} .
Our sequential learning algorithms will work as follows. Given a desired accuracy parameter
> 0 and a confidence parameter > 0, let n(, ) be the initial sample size; we will assume that
n(, ) is a nonincreasing function of both and . Let T (, ) denote the set of all stopping times
such that
sup P {kP P kF } .
P P
Now if T (, ) and we let

1X
fb , arg min P (f ) arg min f (Zi ),
f F f F
i=1
then we immediately see that

sup P (f ) inf P (f ) + 2 .
b
P P f F
Of course, the whole question is how to construct an appropriate stopping time without knowing
P.
Definition 3. A parametric family of stopping times {(, ) : > 0, (0, 1)} is called strongly
efficient (SE) (w.r.t. F and P) if there exist constants K1 , K2 , K3 1, such that for all > 0,
(0, 1)
(12) (, ) T (K1 , )
and for all T (, )
(13) sup P {(K2 , ) > } K3 .
P P
In other words, Eq. (12) says that any SE stopping time {(, )} guarantees that we can approx-
imate statistical expectations by empirical expectations with accuracy K1 and confidence 1 ;
similarly, Eq. (13) says that, with probability at least 1 K3 , we will require at most as many
samples as would be needed by any sequential algorithm for empirical approximation with accuracy
/K2 and confidence 1 .
5
Definition 4. A family of stopping times {(, ) : > 0, (0, 1)} is weakly efficient (WE) for
(F, P) if there exist constants K1 , K2 , K3 1, such that for all > 0, (0, 1)
(14) (, ) T (K1 , )
and
(15) sup P {(K2 , ) > N (; )} K3 .
P P
If (, ) is a WE stopping time, then Eq. (14) says that we can solve the empirical approximation
problem with accuracy K1 and confidence 1 ; Eq. (15) says that, with probability at most 1 ,
the sample complexity will be less than the sample complexity of empirical approximation with
accuracy /K2 and confidence 1 .
If N (; ) n(, ), then N (, ) T (, ). Hence, any WE stopping time is also SE. The
converse, however, is not true.
3.1. A strongly efficient sequential learning algorithm. Let {Zn } n=1 be an infinite sequence

of i.i.d. draws from some P P; let {n }n=1 be an i.i.d. Rademacher sequence independent of
{Zn }. Choose

2 2
(16) n(, ) 2 log +1
(1 e2 /2 )
and let
(17) (, ) , min {n n(, ) : rn (F(Z n )) } .
This is clearly a stopping time for each > 0 and each (0, 1).
Theorem 1. The family {(, ) : > 0, (0, 1)} defined in (17) with n(, ) set according to
(16) is SE for any class F of measurable functions f : Z [0, 1] and P = P(Z) with K1 = 5, K2 =
6, K3 = 1.
Proof. Let n = n(, ). We will first show that, for any P P(Z),
(18) kPn P kF 2rn (F(Z n )) + 3, n n
with probability at least 1. Since for n = (, ) n we have rn (F(Z n )) , we will immediately

be able to conclude that

P kP(,) P kF 5 ,
which will imply that (, ) T (5, ). Now we prove (18). First of all, applying Lemma 2 and
the union bound, we can write

[ X 2
P {rn (F(Z n )) ERn (F(Z n )) + } en /2

nn nn
2 2
X
= en /2 en /2
n0
2
en /2
=
1 e2 /2
/2.
6
From the symmetrization inequality (5), we know that EkPn P kF 2ERn (F(Z n )). Moreover,
using (6) and the union bound, we can write

[ X 2
P {kPn P kF EkPn P kF + } e2n

nn nn
2
X
en /2
nn
/2.
Therefore, with probability at least 1 ,
kPn P kF EkPn P kF + 2ERn (F(Z n )) + 2rn (F(Z n )) + 3, n n
which is (18). This shows that (12) holds for (, ) with K1 = 5.

Next, we will prove that, for any P P(Z),

(19) P min kPn P kF < .
nn<(6,)
In other words, (19) says that, with probability at least 1, kPn P kF for all n n < (6, ).
This means that, for any T (, ), (6, ) with probability at least 1 , which will give
us (13) with K2 = 6 and K3 = 1.
To prove (19), we have by (7) and the union bound that

[
P {kPn P kF EkPn P kF } /2.

nn
By the desymmetrization inequality (8), we have

1 1
EkPn P kF ERn (F(Z n )) , n.
2 2 n
Finally, by the concentration inequality (10) and the union bound,

[
n n
P {rn (F(Z )) ERn (F(Z )) + } /2.

nn
Therefore, with probability at least 1 ,

1 1 3
kPn P kF rn (F(Z n )) , n n.
2 2 n 2
If n n < (6, ), then rn (F(Z n )) > 6. Therefore, using the fact that n n and n(, )1/2 ,
we see that, with probability at least 1 ,
3 1 3 1
kPn P kF > , n n < (6, ).
2 2 n 2 2 n
This proves (19), and we are done.
7
3.2. A weakly efficient sequential learning algorithm. Now choose

2 4
(20) n(, ) 2 log + 1,

for each k = 0, 1, 2, . . . let nk , 2k n(, ), and let
(21) (, ) , min {nk : rnk (F(Z nk )) } .
Theorem 2. The family {(, ) : > 0, (0, 1/2)} defined in (21) with n(, ) set according
to (20) is WE for any class F of measurable functions f : Z [0, 1] and P = P(Z) with K1 = 5,
K2 = 18, K3 = 3.
Proof. As before, let n = n(, ). The proof of (14) is similar to what we have done in the proof of
Theorem 1, except we use the bounds
( )
k 2
[ X
P nk nk
{rnk (F(Z )) ERnk (F(Z )) + } e2 n /2
k=0 k=0

n2
n2 /2 n2 /2 (2k 1)
X
=e +e e 2
k=1

2 /2 2 /2 k 1)
X
en + en e(2
k=1

2 /2 2 /2
X
en + en ek
k=1
n2 /2
2e
/2,
where in the third step we have used the fact that n2 /2 1. Similarly,
( )
[
P {kPnk P kF EkPnk P kF + } 2 .
k=0
Therefore,
kPnk P kF 2rnk (F(Z nk )) + 3, k = 0, 1, 2, . . .
and consequently

P kP(,) P kF 5 ,
which proves (14).
Now we prove (15). Let N = N (, ), the sample complexity of empirical approximation that
we have defined in (3). Let us choose k so that nk N < nk+1 , which is equivalent to 2k n N <
2k+1 n. Then
P {(18, ) > N } P {(18, ) > nk } .
We will show that the probability on the right-hand side is less than 3. First of all, since N n
(by hypothesis), we have nk n/2 1/2 . Therefore, with probability at least 1
1 1 9 1
(22) kPnk P kF rnk (F(Z nk )) rnk (F(Z nk )) 5.
2 2 nk 2 2
8
If (18, ) > nk , then by definition rnk (F(Z nk )) > 18. Writing rnk = rnk (F(Z nk )) for brevity,
we see get
P {(18, ) > nk } P {rnk > 18}
= P {rnk > 18, kPnk P kF 18} + P {rnk > 18, kPnk P kF < 4}
P {kPnk P kF 4} + P {rnk > 18, kPnk P kF < 4} .
If rnk > 18 but kPnk P kF < 4, the event in (22) cannot occur. Indeed, suppose it does. Then
it must be the case that 4 > 9 5 = 4, which is a contradiction. Therefore,
P {rnk > 18, kPnk P kF < 4} ,
and hence
P {(18, ) > nk } P {kPnk P kF 4} + .
For each f F and each n N define
n
X
Sn (f ) , [f (Zi ) P (f )]
i=1
and let kSn kF , supf F |Sn (f )|. Then

P {kPnk P kF 4} = P {kSnk kF 4nk } P {kSnk kF 2N } .
Since nk N , the F-indexed stochastic processes Snk (f ) and SN (f ) Snk (f ) are independent.
Therefore, we use a technical result stated as Lemma 4 in the appendix with 1 = Snk and 2 =
SN (f ) Snk (f ) to write
P {kSN kF N }
P {kSnk kF 2N } .
inf f F P {|SN (f ) Snk (f )| N }
By definition of N = N (, ), the probability in the numerator is at most . To analyze the
probability in the denominator, we use Hoeffdings inequality to get
inf P {|SN (f ) Snk (f )| N } = 1 sup P {|SN (f ) Snk (f )| > N }
f F f F
2 /2
1 2eN
1 .
Therefore,

P {(18, ) > nk } + 3
1
for < 1/2. Therefore, {(, ) : (0, 1), (0, 1/2)} is WE with K1 = 5, K2 = 18, K3 = 3.
4. A sequential algorithm for stochastic simulation

Armed with these results on sequential learning algorithms, we can take up the question of
constructing efficient simulation strategies. We fix an accuracy parameter > 0, a confidence
parameter (0, 1), and a level parameter (0, 1). Given two probability distributions, P on
the input space Z and Q on the parameter space , we draw a large i.i.d. sample Z1 , . . . , Zn from
P and a large i.i.d. sample 1 , . . . , m from Q. We then compute
b = arg min Ln (),
{1 ,...,m
9
where
n
1X
Ln () , `(Zi , ).
n
i=1
The goal is to pick n and m large enough so that, with probability at least 1 , b is an -minimizer
of L to level , i.e., with probability at least 1 there exists some set with Q() ,
such that Eq. (2) holds with probability at least 1 .
To that end, consider the following algorithm based on Theorem 2, proposed by Koltchinskii et
al. [KAA+ 00a, KAA+ 00b]:
Algorithm 1
choose positive integers m and n such that
log(2/)
m log[1/(1)] and n b 50 2
log 8 c + 1
draw m independent samples 1 , . . . , m from Q
draw n independent samples Z1 , . . . , Zn from PZ
evaluate the stopping P variable
= max1jm n1 ni=1 i `(Zi , j )

where 1 , . . . , n are i.i.d. Rademacher r.v.s independent of m and Z n

if > /5, then
add n more i.i.d. samples from PZ and repeat
else stop and output
b = arg min{1 ,...,n } Ln ()
Then we claim that, with probability at least 1 , b is an -minimizer of L to level . To see this,
we need the following result [Vid03, Lemma 11.1]:
Lemma 3. Let Q be a probability distribution on the parameter set , and let h : R be a
(measurable) real-valued function on , bounded from above, i.e., h() < + for all . Let
1 , . . . , m be m i.i.d. samples from Q, and let
h(m ) , max h(m ).
1jm
Then for any (0, 1)

: h() > h(m )

(23) Q
with probability at least 1 (1 )m .
Proof. For each c R, let
F (c) , P ({ : h() c}) .
Note that F is the CDF of the random variable = h() with Q. Therefore, it is right-
continuous, i.e., limc0 &c F (c0 ) = F (c). Now define
c , inf {c : F (c) 1 } .
Since F is right-continuous, F (c ) 1 . Moreover, if c < c , then F (c) < 1 . Now let us
suppose that h(m ) c . Then, since F is monotone nondecreasing,
P : h() h(m ) = F h(m ) F (c ) 1 ,

or, equivalently, if h(m ) c , then

P : h() > h(m ) .

Therefore, if m is such that

: h() > h(m ) > ,

P
10
then it must be the case that h(m ) < c , which in turn implies that F (h(m )) < 1 , the
complement of the event in (23). But h(m ) < c means that h(j ) < c for every 1 j m.
Since the j s are independent, the events {h(j ) < c } are independent, and each occurs with
probability at most 1 . Therefore,
P m m : Q : h() > h(m ) (1 )m ,

which is what we intended to prove.
We apply this lemma to the function h() = L(). Then, provided m is chosen as described in
Algorithm 1, we will have

Q : L() < min L(m ) /2.
1jj
Now consider the finite class of functions F = {fj (z) = `(z, j ) : 1 j m}. By Theorem 2, the
final output b {1 , . . . , m } will satisfy

b min L(j )
L()
1jm
with probability at least 1 /2. Hence, with probability at least 1 there exists a set
with Q() , such that (2) holds. Moreover, the total number of samples used up by Algorithm 1
will be, with probability at least 1 3/2, no more than
NF ,PZ (/18, /2) min {n N : P (kPn PZ kF > /18) < /2} .
We can estimate NF ,PZ (/18, /2) as follows. First of all, the function
(Z n ) , kPn PZ kF max |Pn (fj ) PZ (fj )|

1jm
has bounded differences with c1 = . . . = cn = 1/n. Therefore, by McDiarmids inequality

2
P ((Z n ) E(Z n ) + t) e2nt , t > 0.
Secondly, since the class F is finite with |F| = m, the symmetrization inequality (5) and the Finite
Class Lemma give the bound
r
log m
EkPn PZ kF 4 .
n
p
Therefore, if we choose t = /18 4 n1 log m and n is large enough so that t > /20 (say), then
2 /200
P (kPn P kF > /18) en .
Hence, a fairly conservative estimate is

( $ % )
720 2

200 2
NF ,PZ (/18, /2) max log + 1, log m + 1
2
It is instructive to compare Algorithm 1 with a simple Monte Carlo strategy:
11
Algorithm 0
choose positive integers m and n such that
log(2/)
m log[1/(1)] and n 212 log 4m
draw m independent samples 1 , . . . , m from Q
draw n independent samples Z1 , . . . , Zn from PZ
for j = 1 to m
compute Ln (j ) = n1 ni=1 `(Zi , j )
P
end for
output b = arg min{1 ,...,m Ln (j )
The selection of m is guided by the same considerations as in Algorithm 1. Moreover, for each
1 j m, Ln (j ) is an average of n independent random variables `(Zi , j ) [0, 1], and L(j ) =
ELn (j ). Hence, Hoeffdings inequality says that
2
P ({Z n Zn : |Ln (j ) L(j )| > }) 2e2n .
If we choose n as described in Algorithm 0, then

m
[

b min L(j ) > P
P Ln () |Ln (j ) L(j )| >
1jm
j=1
m
X
P (|Ln (j ) L(j )| > )
j=1
/2.
Hence, with probability at least 1 there exists a set with Q() , so that (2) holds.
It may seem at first glance that Algorithm 0 is more efficient than Algorithm 1. However, this is
not the case in high-dimensional situations. There, one can actually show that, with probability
practically equal to one, the empirical minimum of L can be much larger than the true minimum
(cf. [KAA+ 00b] for a very vivid numerical illustration). This is an instance of the so-called Curse
of Dimensionality, which adaptive schemes like Algorithm 1 can often avoid.
Appendix A. Technical lemma

Lemma 4. Let {1 (f ) : f F} and {2 (f ) : f F} be two independent F-indexed stochastic
processes with
kj kF , sup |j (f )| < , j = 1, 2.
f F
Then for all t > 0, c > 0

P {k1 2 kF t}
(24) P {k1 kF t + c} .
inf f F P {|2 (f )| c}
Proof. If k1 kF t + c, then there exists some f F, such that |1 (f )| t + c. Then for this
particular f by the triangle inequality we see that
|2 (f )| c |1 (f ) 2 (f )| t
Therefore,
n o n o n o n o
inf P2 |2 (f )| c P2 |2 (f )| c P2 |1 (f ) 2 (f )| t P2 k1 2 kF t .
f F
12
The leftmost and the rightmost terms in the above inequality do not depend on the particular f ,
and the inequality between them is valid on the event {k1 kF t + c}. Therefore, integrating the
two sides w.r.t. 1 on this event, we get
n o n o n o
inf P2 |2 (f )| c P1 k1 kF t + c P1 ,2 k1 2 kF t .
f F
Rearranging, we get (24).
References
+
[KAA 00a] V. Koltchinskii, C. T. Abdallah, M. Ariola, P. Dorato, and D. Panchenko. Improved sample complexity
estimates for statistical learning control of uncertain systems. IEEE Transactions on Automatic Control,
45(12):23832388, 2000.
[KAA+ 00b] V. Koltchinskii, C. T. Abdallah, M. Ariola, P. Dorato, and D. Panchenko. Statistical learning control
of uncertain systems: it is better than it seems. Technical Report EECE-TR-00-001, University of New
Mexico, April 2000.
[Vid98] M. Vidyasagar. Statistical learning theory and randomized algorithms for control. IEEE Control Maga-
zine, 18(6):162190, 1998.
[Vid01] M. Vidyasagar. Randomized algorithms for robust controller synthesis using statistical learning theory.
Automatica, 37:15151528, 2001.
[Vid03] M. Vidyasagar. Learning and Generalization. Springer, 2 edition, 2003.
13

Desymm

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Desymm

Загружено:

Авторское право:

Доступные форматы

CASE STUDY: STOCHASTIC SIMULATION VIA RADEMACHER

Date: April 4, 2011.

1. Empirical Risk Minimization: a quick review

where n = (1 , . . . , n ) is an n-tuple of i.i.d. Rademacher random variables independent of Z n .

(5) EkPn P kF 2ERn (F(Z n )),

as well as the concentration bounds

These results show two things:

with probability at least 1. Since for n = (, ) n we have rn (F(Z n )) , we will immediately

Therefore, with probability at least 1 ,

kPn P kF EkPn P kF + 2ERn (F(Z n )) + 2rn (F(Z n )) + 3, n n

which is (18). This shows that (12) holds for (, ) with K1 = 5.

By the desymmetrization inequality (8), we have

Finally, by the concentration inequality (10) and the union bound,

Therefore, with probability at least 1 ,

and let kSn kF , supf F |Sn (f )|. Then

4. A sequential algorithm for stochastic simulation

where 1 , . . . , n are i.i.d. Rademacher r.v.s independent of m and Z n

Then for any (0, 1)

or, equivalently, if h(m ) c , then

Therefore, if m is such that

P m m : Q : h() > h(m ) (1 )m ,

which is what we intended to prove.

NF ,PZ (/18, /2) min {n N : P (kPn PZ kF > /18) < /2} .

(Z n ) , kPn PZ kF max |Pn (fj ) PZ (fj )|

has bounded differences with c1 = . . . = cn = 1/n. Therefore, by McDiarmids inequality

Hence, a fairly conservative estimate is

It is instructive to compare Algorithm 1 with a simple Monte Carlo strategy:

Appendix A. Technical lemma

Then for all t > 0, c > 0

Rearranging, we get (24).

Вам также может понравиться