Академический Документы
Профессиональный Документы
Культура Документы
Paul Deheuvels
Topics on Empirical Processes
1 Introduction, notation and preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
1.2 Distribution and quantile functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
1.3 Topologies on spaces of measures and functions . . . . . . . . . . . . . . . . 98
1.4 The quantile transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
vi Contents
Abstract. This text contains the material presented at the European Math-
ematical Society Summer School on Theory and Statistical Applications of
Empirical Processes in Laredo, Spain in August-September 2004.
1. Introduction
It was in 1900 when Karl Pearson proposed the rst test of goodness of t: the
2 test. The subsequent research devoted to enhancements of this elementary
goodness-of-t procedure became a major source of motivation for the development
of key areas in Probability and Statistics such as the theory of weak convergence
in general spaces and the asymptotic theory of empirical processes. In this course
we will analyze some suggestive aspects which arise from the development of the
asymptotic theory of goodness-of-t tests along the last century.
We will pay special attention to stressing the parallel evolution of the theory
of empirical processes and the asymptotic theory of goodness-of-t tests. Doubt-
less, this evolution is a good indicator of the vast transformation that Probabil-
ity and Statistics experienced along this century. Certainly, the names that con-
tributed to the theory are the main guarantee for this assertion. Pearson, Fisher,
Cramer, Von Mises, Kolmogorov, Smirnov, Feller, . . . laid the foundations of the
theory. In some cases, the mathematical derivation of the asymptotic distribution
of goodness-of-t tests in that period had the added merit that, in a certain sense,
the limit law was blindly pursued. In Mathematics the main diculty in showing
convergence consists, with no doubt, of obtaining a convincing candidate for limit.
Thus, proofs in that period could be considered as major pieces of precision and
inventiveness.
A systematic method of handling adequate candidates for the limit law begins
in 1950 with the heuristic work by Doob [36], made precise by Donsker, through
the Invariance Principle. The subsequent construction of adequate metric spaces
and the development of the corresponding weak convergence theory as the right
2 E. del Barrio
probabilistic setup for the study of asymptotic distributions had a wide and fast
diusion, with notable advances due to Prohorov and Skorohod among others. The
contribution of Billingsleys book [7] to this diusion must also be pointed out.
The study of Probability in Banach spaces has been other source of useful
results for the goodness-of-t theory. The names of Varadhan, Dudley, Araujo,
Gine, Zinn, Ledoux, Talagrand,. . . are necessary references to anyone interested
in asymptotics in Statistics. For example, the Central Limit Theorem in Hilbert
spaces had a main role in the obtention of the asymptotic behavior of Cramer-von
Mises type statistics.
Last, we must indicate the signicance of the Hungarian school, developing
the strong approximation techniques initiated by Skorohod with his embedding.
Breimans book [8] had the merit of initially spreading Skorohods embedding.
Now, the strong approximations due to Komlos, Major, Tusnady, M. and S. Csorgo,
Revesz, Deheuvels, Horvath, Mason, . . . , are an invaluable tool in the study of
asymptotics in Statistics, as we will point out in this course.
more adequate for tests against one-sided alternatives. The statistics Dn , Dn+ or
Dn are known as Kolmogorov-Smirnov statistics and present the advantage of
being distribution-free: for any continuous d.f. F0 , Dn has (under H0 ) the same
distribution as sup0<t<1 |n (t)|, with n being the uniform empirical process:
n (t) = n(Gn (t) t), 0 t 1,
where Gn (t) = ni=1 I(Ui t) and Ui are i.i.d. uniform r.v.s. Similar statements
hold for Dn+ and Dn . Thus, the same p-values can be used to obtain the signicance
level when testing t to any continuous distribution. This desirable property is not
4 E. del Barrio
which was proposed by Smirnov [88], [89]. All the statistics which can be obtained
by varying are usually referred to as statistics of Cramer-von Mises type. Con-
sideration of dierent weight functions allows the statistician to put special
emphasis on the detection of particular sets of alternatives. For this reason, some
weighted versions of Kolmogorovs statistics have also been proposed, namely,
|Fn (x) F0 (x)|
Kn () = n sup . (2.2)
<x< (F0 (x))
The convenience of employing Wn2 () instead of Dn2 as a test statistic can
be understood taking into account that Dn2 accounts only for the largest deviation
between Fn (t) and F (t), while Wn2 () is a weighted average of all the deviations
between Fn (t) and F (t). Thus, as observed by Stephens [96], Wn2 () should have
more chance to detect alternatives that, not presenting a very large deviation with
respect to F at any point t, are moderately far from F for a large range of points
t (think of location alternatives). These heuristic considerations are conrmed by
simulation studies (see [96] for references).
Two particular statistics have received special attention in the literature.
When = 1,
2
Wn = n (Fn (x) F0 (x))2 dF0 (x)
is called the Cramer-von Mises statistic; when (t) = (t(1 t))1 then
2 (Fn (x) F0 (x))2
An = n dF0 (x)
F0 (t)(1 F0 (t))
is called the Anderson-Darling statistic. A2n has the additional appeal of weighting
the deviations according to their expected value, and this results in a more powerful
statistic for testing t to a xed distribution, see [96].
For the use in practice of any of these appealing statistics we should be able
to obtain the corresponding signicance levels. Smirnov [91], using combinato-
rial techniques, obtained an explicit expression for the exact distribution of Dn+ .
Kolmogorov [53] gave also an expression that enabled the tabulation of the dis-
tribution of Dn . Some more diculties were found when dealing with the exact
distributions of statistics of Cramer-von Mises type. But even in those cases in
which some formula allowed to compute the exact p-values, the interest in ob-
taining the asymptotic distribution of the test statistic was clear, for it would
greatly decrease the computational eort needed to obtain the (approximate) p-
values (and this was of crucial importance by the time these tests were proposed).
The celebrated rst asymptotic results about Dn and Dn+ are summarized in the
following theorem:
Empirical and Quantile Processes 5
and
2
P sup B(t) > x = e2x . (2.4)
0t1
Theorem 2.3. If we consider n and B as random elements taking values in D[0, 1],
then
w
n B.
Theorem 2.3 enabled to rederive Theorem 2.2 in a very natural way. Notice
that Dn = ||n || and that the map x ||x|| is continuous for the Skorohod
w
topology outside a set of B-measure zero. Thus, we can conclude that Dn ||B||
and this, combined with (2.3), gives a proof of the rst statement in Theorem 2.2.
The same method works for Dn+ .
The use of the Skorohod space is not the only possibility to circumvent the
diculty posed by the nonmeasurability of the empirical process. A dierent ap-
proach to the problem could be based on the following scheme. If we could dene,
on a rich enough probability space, a sequence of i.i.d. r.v.s uniformly distributed
on (0, 1) with associated empirical process n (t) and a Brownian bridge B(t) such
that
sup |n (t) B(t)|0,
P
(2.5)
0t1
then we would trivially obtain that for any functional H dened on D[0, 1] and
P
continuous on C[0, 1], H(n ) H(B) obtaining a new proof of Theorem 2.2. The
study of results of type (2.5), generically known as strong approximations, began
with the Skorohod embedding, consisting of imitating the partial sum process by
Empirical and Quantile Processes 7
using a Brownian motion evaluated at random times (see [8]). Successive rene-
ments of this idea became one of the most important methodologies in the research
related to empirical processes.
Turning back to the applications of Theorem 2.3 in the asymptotic theory of
1
tests of t, we should note that the functional x 0 x(t)2 dt is also continuous
for the Skorohod topology outside a set of B-measure zero. We can use this fact
to obtain the asymptotic distribution of the Cramer-von Mises statistic. Namely,
1
2 w
Wn B(t)2 dt. (2.6)
0
1
n
Yi (t) Y (t)
n i=1 w
1
if and only if 0 E(Y1 (t))2 (t)dt < and, in that case, Y is a Gaussian random
element with the same covariance function as Y1 .
Therefore, if we set
Yi (t) = I{Ui t} t, i = 1, . . . , n,
n
then n (t) = 1 Yi (t) and Y1 (t) has the same covariance function as the
i=1
n
w 1
Brownian bridge B(t). Hence, n B in L2 ((0, 1), ) if and only if 0 t(1
t)(t)dt < .
While the development of the probability in Banach spaces provides this nal
result for quadratic statistics (and we will have a better example of this later in
Chapter 6), the use of strong approximations produces a similar result for supre-
mum norm statistics. Chibisov [12] and OReilly [70] used the Skorohod embedding
and a special representation of the uniform empirical process in terms of a Poisson
process (see, e.g. [86] p. 339) to obtain necessary and sucient conditions for the
weak convergence of the empirical process to the Brownian bridge in weighted uni-
form metrics. If is a positive function on (0, 1) nondecreasing in a neighborhood
of 0 and nonincreasing in a neighborhood of 1 and we consider the norm given
by ||x|| = sup0<t<1 |x(t)|
w
(t) on D[0, 1], then n B in || || norm (with the nec-
essary modications in the denition of weak convergence to avoid measurability
problems) if and only if
1
1 (t)2
exp
dt < , (2.8)
0 t(1 t) t(1 t)
for every
> 0. An immediate corollary of the Chibisov-OReilly theorem is that
(2.8) is a sucient condition for ensuring the convergence
w |B(t)|
Kn () sup .
0<t<1 (t)
A modication of the so-called Hungarian construction due to Komlos, Major
and Tusnady [55], [56] and to Csorgo and Revesz [21] was used in [17] to give the
following nal result for statistics of Kolmogorov-Smirnov type:
Empirical and Quantile Processes 9
where {Yj }j i.i.d. N (0, 1), j 0. This shows that the characteristic function of
the limiting distribution of n2 () can be written as
() = (1 2ij )1/2 .
j=1
This can be used in some cases to nd a useful expression of the limiting charac-
teristic function and, hopefully, to nd, via an inversion formula, exact expressions
for the limiting distribution functions.
10 E. del Barrio
Example. For the Cramer-von Mises Statistic, n2 j = (j)2 ; fj (t) = 2 sin(jt)
and, therefore,
Yj2
n2 2 := .
d
j=1
2 j 2
To see this, we observe that in this case equation (2.9) becomes
t 1
f (t) = s(1 t)f (s)ds + (1 s)f (s)ds,
0 t
from which, dierentiating twice we obtain
f (t) = f (t).
Noting that f (0) = f (1) = 0 we prove the above claim. It can be used to obtain
that
i 2 2i
E(e )= .
sin( 2i)
This gives
(2j)2
1 1 y xy
2
P ( > x) = (1) j+1
exp dy.
j=1 ((2j1)) 2 y sin y 2
G0 = G0 ; Hn = Gn ; n = 0 + / n
g(G1 (t)) if 0 = 0, G = G( )
(t) =
G1 (t)g(G1 (t)) if 0 = 1, G = G(/)
Turning back to statistics of Cramer-von Mises type we obtain
1
n2 () (B(t) + (t))2 (t) dt = j Yj 2
w 0 j=1
1/2
with Yj independent N (j , fj , 1) r.v.s. This suggest that we dene the
principal component in direction fj :
1
n2 () = 2
Yn,j , Yn,j := n (t)fj (t) dt
j=1 0
Under H0 , Yn,j are approximately N (0, j ) independent r.v.s; under local alter-
natives
d
Yn,j N (, fj , j ).
Thus, Yn,j measures deviations in direction fj and tests based on Yn,j can be
expected to be powerful against alternatives in direction fj .
0.5
0.8
0.5
0.4
0.0
1.5
t t
12 E. del Barrio
n
and Sn = max1in
Si
. The proofs of these results can be found in [27].
We begin with the Levy maximal inequalities, bounding the tail probabilities
of Sn by tail probabilities of
Sn
.
Theorem 3.1. Let Xi be independent symmetric r.v.s. Then
n
k
P max
Xi
> t 2Pr
Xi
> t , t > 0.
1kn
i=1 i=1
The symmetry hypothesis can be replaced if Xi are i.i.d. at the price of worse
constants:
Theorem 3.2. Let Xi be i.i.d. r.v.s. Then
n
k
P max
Xi
> t 9Pr
Xi
> t/30 , t > 0.
1kn
i=1 i=1
The Homann-Jrgensen inequalities bound moments of sums by the corre-
sponding moment of a maximum plus a quantile of the sum:
Empirical and Quantile Processes 13
Theorem 3.3. For each p > 0 there exists constants Kp , cp such that, if Xi are
i.i.d or independent and symmetric r.v.s, then
Sn
p Kp [t0 +
Xn
p ],
where
Y
p = (E
Y
)p (1/p)
and
t0 = inf{t > 0 : Pr(
Sn
> t cp )}.
This inequality gives the following result on comparison of moments:
Theorem 3.4. For each 0 < p < q there exists a constant, K, such that, if Xi are
i.i.d or independent and symmetric r.v.s, and then
Sn
p K[
Sn
q +
Xn
p ].
The following randomization/symmetrization result by Rademacher variables
(symmetric r.v.s taking values in {1, 1}) shows that the above Theorem is also
valid for centered, independent r.v.s.
Theorem 3.5. Let Xi be independent, centered r.v.s in Lp , p 1, and let {
i } be
independent Rademacher r.v.s, independent of the Xi . Then
n
n
n
2p E
i Xi
p E
Xi
p 2p E
i Xi
p
i=1 i=1 i=1
and
n
ESnp 2p+1 E
i Xi
p .
i=1
Proof. The suciency part follows easily from Markovs inequality. For the neces-
sity part we can assume, w.o.l.g., that X 0. It is now convenient to write
Z(t) := IX>t , Zi (t) = IXi >t , i N, t R,
14 E. del Barrio
so that Y (t) = Z(t) EZ(t) and likewise for Yi . The stochastic boundedness
hypothesis simply asserts
1 n
lim sup Pr (Zi EZi )L1 > M = 0.
M n n i=1
The Levy type inequality for i.i.d. random vectors then implies then that
1
lim sup Pr max Zi EZi L > M = 0.
M n n 1in 1
or, equivalently,
2, (Z EZ) := sup t2 Pr Z EZ L1 > t < .
t>0
where
n (Zi EZi )
i=1
t0,n = inf t : Pr L1
> t c2 .
n
On one hand, the stochastic boundedness hypothesis implies supn t0,n < , and
on the other, inequality (3.2) asserts the niteness of the sup over n of the rst
summand at the right-hand side of the last for r = 1. We thus conclude that
n (Z EZ )
i i
sup E i=1 < . (3.3)
n n L1
If now is a binomial (n, p) then there exist positive nite constants C1 and C2
such that
C1 1
L() = Bin(n, p) with p implies E| E| C2 np.)
n 2
(this follows, for instance, from symmetrization and Corollary 3.4 in Gine and
Zinn, 1983.) Applying this to the empirical process, yields
n
1 i=1 (IXi >t Pr{Xi > t})
Pr{X > t} E for med(X) < t < Q(1 C1 /n).
C2 n
Then, if integrating and applying inequality (3.3), we obtain
Q(1C1 /n)
sup Pr{X > t}dt
n med(X)
Q(1C1 /n) n
1 (IXi >t Pr{Xi > t})
sup E i=1 dt < .
C2 n med(X) n
Since Q 1 C1 /n ess sup X as n , this last inequality gives
Pr{X > t}dt < ,
0
1 r 1 r
n
P (Mn ) (b k b r
k+1 )E|S k |r
= b [E|Sk |r E|Sk1 |r ] .
r 1 r 1 k
16 E. del Barrio
n
n
n
(bj bj+1 ) Sj dP = (bj bj+1 ) Sj dP
k=1 j=k Ak j=1 Mj
n
(bk bk+1 )E|Sk |
1
Lemma 3.8 (Birnbaum and Marshall). Let {|St |, Ft }0t be a submartingale with
right-continuous sample paths. Assume S(0) = 0 and (t) = ES 2 (t) < on [0, ].
Let q > 0 be a nondecreasing and right-continuous function on [0, ]. Then
|S(t)| 1
P sup 1 2
d(t).
0t q(t) 0 q(t)
n
2
1 lim [E(S 2 (i/2n ) S 2 ((i 1)/2n ))/q 2 (i/2n )]
n
1
1
=1 d(t)
0 q(t)2
3.2. The Central Limit Theorem
The classical central limit problem consists in studying conditions for tightness
and weak convergence of sequences of laws of sums of small independent random
variables and determining their limits. The theory is built up around two main
results, the Gaussian and the Poisson convergence theorems. Here we collect the
main facts on this central limit problem, both for real and for Banach valued
random variables and refer to [3] for a complete treatment of the subject.
If {Xn }
n=1 are i.i.d. centered r.v.s with unit variance, then, the Levy-
Lindeberg CLT states that
X1 + + Xn
N (0, 1).
n w
without the nite variance assumption we have, for instance, that, if {Xn } n=1 are
i.i.d. Cauchy, then
X1 + + Xn
= X1
n d
and we get, trivially, a dierent limit distribution.
More generally we could consider the limiting distributions of sums of trian-
gular arrays of row-wise independent r.v.s, that is, we consider r.v.s {Xn,k : n
kn
N, 1 k kn }, where Xn,1 , . . . , Xn,kn are independent, and call Sn = k=1 Xn,k .
If we now consider, for instance, i.i.d. Xn,k , k = 1, . . . , n having Bernoulli distri-
bution with parameter pn such that npn (0, ) we have that Sn converges
weakly to the Poisson distribution with mean .
Obviously, without some restriction on the relative weight of the summands in
Sn any kind of (trivial) limit distribution can result. The innitesimality condition
ensures that this is not the case. We say that the triangular array {Xn,k : n
N, 1 k kn } of row-wise independent random variables is innitesimal if, for
every
> 0
max P (|Xn,k | >
) 0, as n .
1kkn
It turns out that, under innitesimality, the class of possible limit laws of Sn an ,
where an are possibly needed centering constants, is the class of the so-called
innitely divisible laws. A law is said to be innitely divisible if, for every n, it
n)
can be expressed as an nth convolution power: = n n (that is, is the
law of the sum of n i.i.d. r.v.s). Innitely divisible laws can be characterized as
the convolution of a Gaussian law and a Poissonization of a Levy measure:
= N (, 2 ) c Pois .
If is a nite measure on R, then the associated Poissonization is
(R) n)
Pois := e /n!.
n=0
The only possible constants an in the above expression are of type an = n1/
for some (0, 2]. This is called the stability index of the stable law . The
case = 2 corresponds to the Gaussian law. A stable law with index < 2
is an innitely divisible law without normal part and Levy measure (c1 , c2 ; )
dened by
c1 x1 dx for x > 0
d(c1 , c2 ; )(x) =
c2 |x|1 dx for x < 0
for some c1 , c2 0.
Assume {Xn }n is a sequence of i.i.d. r.v.s having law . Then we say that
belongs to the domain of attraction of the stable law if there exists constants
an > 0, bn R such that
X1 + + Xn
bn .
an w
Stable laws are the only laws having a nonvoid domain of attraction. Domains
of attraction can be characterized in terms of regular variation of the tails and
the truncated variances of a law. A function L is regularly varying (at ) with
exponent R if
L(tx)
lim = x .
t L(t)
We say that it is slowly varying if the above exponent equals 0.
Empirical and Quantile Processes 19
Under innitesimality, the class of possible weak limits of L(Sn an ) is, still
in this new setup, the class of innitely divisible laws (the probability measures
expressible as nth convolution powers for every n. Innitely divisible laws, , on
B can be characterized as convolutions of Gaussian measures and Poissonization
of Levy measures:
= c Pois .
The characterization of Levy measures, though, is not so straightforward as on the
real line. A Levy measure on B is a -nite positive measure, , for which there
exists > 0 and a probability measure having characteristic functional
(f ) = exp eif (x) 1 if (x)I{
x
}dm u(x)
for all f B (here B denotes the topological dual of B). In this case we dene
c Pois := . Integrability of min(1,
x
2 ) is neither a necessary nor a sucient
condition for a measure to be a Levy measure on a general separable Banach space.
The following result is a general CLT in Banach spaces.
Theorem 3.11. (CLT in separable Banach spaces) Let {Xn.k } be innitesimal.
Then L(Sn an ) is weakly convergent i
20 E. del Barrio
Then
(a) is a Levy
measure and there exists a centered Gaussian p.m. such that
(f ) = f 2 d, f B .
(b) For every > 0 such that (B ) = 0
L(Sn ESn, ) c Pois .
w
In the particular case of a separable Hilbert space (H, , ) the CLT can be
slightly simplied. In Hilbert space Levy measures are positive Borel measures
integrating min(1,
x
2 ). Condition (iii) in Theorem 3.11 can be replaced by
(iii ) There exists a c.o.n.s {i }i1 in H such that, for some (all)
> 0,
n
k
Theorem 3.14 is a deep result with a long, dicult proof. It has, though,
many important consequences for important processes in Statistics. A rst, direct
result in this line can be obtained for the partial sum process:
1
[nt]
S (n) (t) := Xk , 0 t 1.
n
k=1
We will assume that {W (t)}t0 is Brownian motion and W (n) (t) = 1 W (nt)
n
(observe that W (n) (t) is itself a Brownian motion).
Theorem 3.15. We can dene, on a suciently rich probability space, versions of
{Xn }n and {W (t)}t0 such that
P n sup S (n) (t) W (n) (t) > C log n + x Kex ,
0t1
Now
P n sup un k Bn k > A log n + t
1kn
n n
A log n + t
(n)
2P n sup S (t) W (t) >
(n)
0t1 5
A log n + t
+ P (1 +
n )n+1 >
5
1/2
A log n + t
+ 2P sup |Sk k| >
1kn 5n
1/2
A log n + t
+ 2P |
n | >
5n
Using Theorem 3.15 and some other standard techniques we can give exponential
bounds for all terms in the right-hand side of this last inequality to conclude (see
[21] for details):
Theorem 3.16. We can dene, on a suciently rich probability
space, versions of
{Un }n and {W (t)}t0 such that for every n 1 and |x| c n,
P n sup |un (t) Bn (t)| > C log n + x Kex ,
0t1
Komlos, Major and Tusnady [55, 56] gave also a construction, similar to the
one in Theorem 3.14 for the uniform empirical process:
Theorem 3.17. We can dene, on a suciently rich probability space, a sequence
of Brownian bridges {Bn (t) : 0 t 1} such that
P sup |n (t) Bn (t)| > n1/2 (x + C log n) K exp(x) (3.5)
0t1
1/2
P sup |un (t) Bn (t)| > n (x + C log d) K exp(x) (3.9)
1d/nt1
and
|n (t) Bn (t)|
n1/2 sup = OP (1) (3.11)
(t(1 t))
n t1 n
Proof. We use the construction in Theorem 3.18. (3.10) follows from (3.6) (taking
d = n and x = (2/) log n) and the Borel-Cantelli lemma:
1
P ( sup |n (t) Bn (t)| > n1/2 log n(2/ + C)) .
0t1 n2
|n (t) Bn (t)|
(2) := n1/2 sup .
n,
1
0t1 n (1 t)
(i)
It is enough to prove that n, = OP (1), i = 1, 2. By symmetry it reduces to
(1)
showing that n, = OP (1). Now, for e < d < let di = id, i = 1, 2, . . . , in1 and
Empirical and Quantile Processes 25
din = n, where in = max{i : di1 n}. Set I1 = [1/n, d1 /n], Ii = [di1 /n, di /n],
i = 2, . . . , in and n,i = sup{|n (t) Bn (t)| : t [0, di /n]} i = 2, . . . , in . Now
in
1/2
n, > (C + 1) log d) P (1,n > n
P ((1) (C + 1) log d) + P (i,n
i=2
in
> n1/2 (C + 1)di1 ) =: Pi,n (d).
i=1
Hence using, (3.6) and the fact that di1 log di for d large enough, we have,
for large d, P1,n (d) K exp( log d) = Kd and Pi,n (d) K exp(di1 )
K((i 1)d)2 . Thus
1 1
in
1
Pi,n (d) K + K ,
i=1
d d2 i=1 i2
(1)
which can be made arbitrarily small by taking d large enough. This shows n, =
OP (1) for = 1. To prove this for general it suces to cover the range [/n, 1/n]
for xed 0 < < 1, but
|n (t)|
n1/2 sup (nGn (1/n) + 1) = OP (1),
1 t
n t n
= OP (1).
The proof of the corresponding result for the quantile process is similar to
the proof of Theorem 3.20 and, hence, omitted.
26 E. del Barrio
and
|un (t) Bn (t)|
n1/2 sup = OP (1) (3.15)
(t(1 t))
n t1 n
does not require integrability of 1/q 2 . We will see next how can we characterize
niteness of
B/q
.
Lemma 3.25. The following statements are equivalent:
(i) lim supt0 |W (t)|/q(t) <
(ii) lim supt0 |B(t)|/q(t) <
(iii) lim supt0 |W (t)|/q(t) = for some 0 <
(iv) lim supt0 |B(t)|/q(t) = for some 0 <
Proof. It follows easily from Blumenthals 0-1 law.
q: inf t1/2 q(t) > 0 for all > 0,
F C0 :=
q nondecreasing in a neighbourhood of 0
1/2
E(q, c) = s3/2 q(s) exp(cq 2 (s)/s)ds
0
1/2
1
I(q, c) = exp(cq 2 (s)/s)ds
0 s
Empirical and Quantile Processes 27
Since can be taken arbitrarily close to 1, this proves our claim. It remains to be
shown that, if lim supt0 |W (t)|/q(t) = for some 0 < then I(q, c) <
for any c > 2 /2. The proof can be found in [19], pp. 181188.
We summarize the consequences of the above results for the niteness of
B/q
in the following corollary. Now F C0,1 is the class of functions nonde-
creasing on a neighborhood of 0, nonincreasing in a neighborhood of 1 and such
that
inf q(x) > 0, (0, 1/2).
<x<1
Corollary 3.29. Assume q F C0,1 , then the following statements are equivalent:
(i) sup0<t<1 |B(t)|/q(t) < a.s.
(ii) For some c > 0
1
1
I(q, c) := exp(cq 2 (s)/(s(1 s)))ds
0 s(1 s)
(a similar results holds also for the upper tail). Otherwise we can take sequence
1/2
tk 0 such that ktk (0, ) and limk q(tk )/tk = < . Then,
lim P (|Bk (tk ) + k 1/2 tk |/q(tk ) <
) = (
1/2 ) (
1/2 ).
k
|n (t)| |B(t)|
(i) sup sup
0<t<1 q(t) w 0<t<1 q(t)
(ii) For some c > 0
1
1
c) :=
I(q, exp(cq 2 (s)/(s(1 s)))ds <
0 s(1 s)
Proof. Finiteness of the limit in (i) implies (ii). Thus, it suces to show that (ii)
implies (i). Under (ii) we have that limt0 q(t)/t1/2 = limt1 q(t)/(1 t)1/2 =
and from this we get
|n (t)| |Bn (t)|
sup = sup + oP (1).
0<t<1 q(t) U1:n <t<Un:n q(t)
Other tests of normality are the u-test [25], based on the ratio between the
range and the standard deviation in the sample, and the a-test [45], which studies
the ratio of the sample mean to the standard deviation. These tests are broadly
considered as not being too powerful against a wide range of alternatives (although
it is known that the u-test has good power against alternatives with light tails
[85]; in fact, see [99, 100], the u-test is the most powerful against the uniform
distribution while the a-test is the most powerful against the double exponential
distribution).
For these reasons, some other tests, focusing on features that characterize
completely (or, at least, more completely) the family under consideration, have
been proposed. These tests can be divided, broadly speaking, into three categories.
A rst, more general, category consists of tests that adapt other tests devised in
the xed-distribution setup. When we specialize on location scale families, new
types of tests, that try to take advantage of the particular structure of F , can be
employed. Tests based on the analysis of probability plots, usually referred to as
correlation and regression tests, lie in this class. A third category, whose represen-
tatives combine some of the most interesting features exhibited by goodness-of-t
tests lying in the rst two categories, is composed of tests based on a suitable
L2 -distance between the empirical quantile function and the quantile functions of
the distributions in F , the so-called Wasserstein distance.
Tests based on Wasserstein distance are related to tests in the rst category
in the sense that all of them depend on functional distances. On the other hand,
it happens that the the study of Wasserstein-tests gives some hints about several
properties of the probability plot-tests. Both facts have led us to present them
separately. Our approach will try to show that tests based on Wasserstein distance
provide the right setup to apply the empirical and quantile process theory to study
probability plot-based tests.
from the grouped data (O1 , . . . , Ok ), then 2 has asymptotic 2kd1 distribution
(see, e.g., [13] for a detailed review of Pearsons and Fishers contributions).
Fisher also observed that estimating from the grouped data instead of us-
ing the complete sample (e.g., by estimating from the complete likelihood) could
produce a loss of information resulting in lack of power. Further, estimating from
the original data is often computationally simpler. Fisher studied the asymptotic
distribution of 2 when is unidimensional and is its maximum likelihood esti-
mator from the ungrouped data. His result was extended by Cherno and Lehmann
in [11] for a general d-dimensional parameter showing that, under regularity con-
ditions (essentially conditions to ensure the consistency and asymptotic normality
of the maximum likelihood estimator),
w
kd1
k1
2 Yj2 + j Yj2 , (4.1)
j=1 j=kd
where Yj are i.i.d. standard normal r.v.s and i [0, 1] and may depend on the
parameter . This dependence is a serious drawback for the use of 2 for testing t
to some families of distributions, the normal family being one of them (see [11]).
The practical use of 2 for testing t presented another diculty: the choice
of cells. The asymptotic 2k1 distribution of Pearsons statistic was a consequence
of the asymptotic normality of the cell frequencies. A cell with a very low expected
frequency would cause a very slow convergence to normality and this could result
in a poor approximation of the distribution of 2 . This (somehow oversimplifying)
observation led to the diusion of rules of thumb such as use cells with number of
observations at least 10. Hence, combining neighboring cells with few observations
became a common practice (see, e.g., [13]).
From a more theoretical point of view, in the setup of testing t to a xed
distribution, Mann and Wald [63] and Gumbel [47] suggested to use equally likely
intervals under the null hypothesis as a reasonable way to reduce the arbitrariness
in the choice of cells (this choice oers some good properties, for instance, it makes
the 2 test unbiased, see, e.g., [14]). Trying to adapt this idea to the case of testing
t to parametric families poses the problem that dierent distributions in the null
hypothesis lead to dierent partitions into equiprobable cells. A natural solution
to this problem is choosing for cells equally likely intervals under F (, ), where
is some suitable estimator of . A consequence of this procedure is that, again, the
cells are chosen at random.
Allowing the cells to be chosen at random introduces a deep modication
on the statistical structure of 2 because the distribution of the random vector
(O1 , . . . , Ok ) is no longer multinomial; remarkably, however, it can, in some im-
portant cases, eliminate the dependence on the parameter of the asymptotic dis-
tribution in (4.1). Watson ([104], [105]) noted that if is the maximum likelihood
estimator of (from the ungrouped data) and cell j has boundaries F 1 ( j1 k , )
1 j
and F ( k , ), then the convergence in (4.1) remains true. Further, if F is a lo-
cation scale family, then the i s do not depend on , but only on the family F .
Empirical and Quantile Processes 33
and
|Fn (x) F (x; n )|
Kn () = n sup ,
<x< (F (x; n ))
and use them as statistical tests, rejecting the null hypothesis when large values of
Wn2 () or Kn () are observed. Though, it took a long time until these statistics
were considered as serious competitors to the 2 -test; little was known about these
versions of Cramer-von Mises or Kolmogorov-Smirnov tests until the 50s (see, e.g.,
[13]).
The property exhibited by Wn2 and Kn of being distribution free does not
carry over to Wn2 () or Kn (). If we set Zi = F (Xi ; n ) and Gn (t) denotes the
empirical d.f. associated to Z1 , . . . , Zn then, obviously,
1
Wn2 () = n (t)(Gn (t) t)2 dt (4.2)
0
|Gn (t) t|
Kn () = n sup (4.3)
0<t<1 (t)
but, unlike in the xed distribution case, Z1 , . . . , Zn are not i.i.d. uniform r.v.s.
However, in some important cases the distribution of Z1 , . . . , Zn does not
depend on , but only on F . In those cases, the distribution of Wn2 () or Kn () is
parameter free. This happens if F is a location scale family and n is an equivariant
estimator, a fact noted by David and Johnson [26]. Therefore Wn2 () or Kn ()
can be used in a straight forward way as test statistics in this situation. Lilliefors
34 E. del Barrio
[59] took advantage of this property and, from a simulation study, constructed his
popular table for using the Kolmogorov-Smirnov statistic when testing normality.
The rst attempt to derive the asymptotic distribution of any statistic of
Wn2 () or Kn () type was due to Darling [24]. His study concerned the Cramer-
von Mises statistic
1
Wn2 = n (Fn (x) F (x; n ))2 dF (x; n ) = n (Gn (t) t)2 dt, (4.4)
0
nite-dimensional distributions of { n(Gn (t) t)}t converge weakly to those of a
centered Gaussian process Z(t) with covariance function
1 1 1 1 (s) 1 (t)
K(s, t) = s t st , (4.7)
(1 (s)) (1 (t)) 2 (1 (s)) (1 (t))
where denotes the density function of the standard normal distribution, its d.f.
and 1 is the quantile inverse of (notice that the dierence between Darlings
result and (4.7) is the introduction of an extra term corresponding to the second
parameter to be estimated). Although they did not prove weak convergence of the
estimated empirical process itself, they used this result (combined with a particular
w 1
invariance result due to Kac) to conclude that Wn2 0 (Z(t))2 dt, providing thus
the asymptotic distribution of the Cramer-von Mises test of normality.
4.2. The empirical process with estimated parameters
A general study of the weak convergence of the estimated empirical process was
carried out by Durbin [39]. We present here an approach to his main results using
strong approximations. We will assume F is a parametric family,
F = {F (, ), },
where is some open set in R . The empirical process with estimated parameters is
k
nn (x) = n(Fn (x) F (x, n ), x R,
where n is a sequence of estimators.
We will assume this sequence to be ecient in the sense that
1
n
n(n ) = l(Xi ; ) + oP (1),
n i=1
where l(X1 ; ) is centered and has nite second moments.
1
we obtain, from the Law of Large Numbers, that n v () I() a.s.. Now, a
Taylor expansion of v around gives
1 1
(v (n ) v ()) = v () n( n ) + oP (1) = I() n( n ) + oP (1),
n n
1
n
n( n ) = l(Xi , ) + oP (1),
n i=1
with l(x, ) = I()1
log f (x, ). Clearly l(x, )dF (x, ) = 0 , while
l(x, )l(x, )T dF (x, ) = I()1 I()I()1 = I()1 .
4.2.1. Some notes on stochastic integration. Equation 4.8 suggests that n (t) w
1
B(t) H(t, )T 0 L(s, )dB(s), where B is a Brownian bridge. We cannot give
1
0 L(s, )dB(s) the meaning of a Stieltjes integral since the trajectories of B are
not of bounded variation. It is possible, though, to make sense of expressions like
1
0 f (s)dB(s) with f L (0, 1) through the following construction.
2
Empirical and Quantile Processes 37
n
Assume rst that f is simple: f (t) = i=1 ai I(ti1 , ti ] with ai R and
0 = t0 < t1 < < tn = 1. Then
1
n n
f (s)dB(s) := ai (B(ti ) B(ti1 )) = ai Bi ,
0 i=1 i=1
1
In fact, if f1 , . . . , fk L2 (0, 1), then f
0 1 (s)dB(s), . . . , f
0 k (s)dB(s) has a
joint centered, Gaussian law and form the isometry dening the integrals we see
that
1 1 1 1 1
Cov f (s)dB(s), g(s)dB(s) = f (s)g(s)ds f (s)ds g(s)ds.
0 0 0 0 0
(4.9)
1 1
We can similarly check that {B(t)}t[0,1] , 0 f1 (s)dB(s), . . . , 0 fk (s)dB(s) is
Gaussian and
1 t 1
Cov B(t), f (s)dB(s) = f (s)ds t f (s)ds
0 0 0
This result can be easily extended to any h of bounded variation and continuous
on [0, 1]:
1 1
h(t)dB(t) = B(t)dh(t).
0 0
This integration-by-parts formula can be used to bound the dierence be-
tween stochastic integrals and the the corresponding integrals with respect to the
empirical process:
1 1 1
h(t)dn (t) h(t)dBn (t) d|h|(t)
n Bn
.
0 0 0
We can summarize now the above arguments in the following Theorem
Theorem 4.1. Provided H(t, ) is continuous on [0, 1] and L(t, ) is continuous
and of bounded variation on [0, 1] we can dene n and Brownian bridges Bn such
that
log n
n Bn
= O a.s.,
n
1
where Bn (t) = Bn (t)H(t, )T 0 L(t, )dBn (t). Bn is a centered Gaussian process
with covariance
s
K(s, t) = s t st H(t, )T L(x, )dx (4.10)
0
t 1
H(s, )T L(x, )dx + H(s, )T L(x, )L(x, )T dxH(t, ).
0 0
and
(Fn (x) F (x; n ))2
A2n =n dF (x; n ),
F (x; n )(1 F (x; n ))
which are known, as in the xed distribution setup, as Kolmogorov-Smirnov and
Anderson-Darling statistics respectively. Also, as in the xed distribution case,
quadratic statistics oer in general better power properties than Kn , with A2n
outperforming Wn2 . Any of these statistics oers considerable gain in power with
respect to the 2 test (see, e.g., [94] or [96]).
Let us conclude this subsection by commenting, briefly, that the achievements
of subsequent advances in the theory of empirical processes have allowed to develop
other goodness-of-t procedures.
For instance, in [41] the asymptotic distribution of the empirical characteristic
function is obtained. This was applied in [69] and [49] to propose two normality
tests (notice that the modulus of the characteristic function does not depend on
the mean of the parent distribution). Simulations in [49] suggest that those tests
have a quite good behavior against symmetrical alternatives.
A dierent way to adapt the xed-distribution tests is the minimum dis-
tance method. Assume that (F, G) is a distance between d.f.s. Set (Fn , F ) :=
inf (Fn , F (; )). (Fn , F ) is a reasonable measure of the discrepancy between
the sample distribution and the family F that can also be used for testing t to
F . Dudleys theory of weak convergence of empirical processes can be used for
deriving the limiting distribution of (Fn , F ) when (F, G) = ||F G|| with || ||
being some norm on D[0, 1] or D[, ] (see, e.g., [74]). An alternative derivation
can be based on Skorohod embedding (see [86] pp. 254-257).
An equally important concern about W was the tabulation of its null distri-
bution. Except in case n = 3, when the W -test is equivalent to the u-test [82],
the exact distribution of W is unknown. Percentiles of W were computed by sim-
ulation in [82] for sample sizes up to 50. Though, the asymptotic distribution of
W remained unknown for a long time. In fact, it was not obtained until 20 years
later, in [58] by showing the asymptotic equivalence, under normality, of W and
another correlation test whose distribution was already known at this time (see
the considerations concerning the de Wet-Venter test below).
Some transformations of W that made its distribution approximately Gauss-
ian were proposed (see [83] or [78]). However, these results must be used with some
caution because, as shown in [57], they rely on some approximations which do not
hold with the necessary accuracy.
An additional weakness of the Shapiro-Wilk test is that the procedure may be
not consistent for testing t to non-normal families of distributions. For instance, If
F is the exponential location scale family then the Shapiro-Wilk statistic becomes
(Xn X(1) )2
WE = ,
(n 1)Sn2
which is a function of the coecient of variation. There are some families of dis-
tributions with the same coecient of variation as the exponential family (see
[79, 93]). Thus, the WE -test is not consistent when applied for testing exponen-
tiality. In particular, simulations in [93] suggest that the power of the WE -test
against the beta ( 14 , 12
5
) distribution decrease with the sampling size.
These limitations of the Shapiro-Wilk test led to the introduction of modi-
cations of W aimed to ease them. The rst examples were the dAgostino test [23]
and the Shapiro-Francia test [81]. They were intended to replace the W -test for
sample sizes greater than 50. Both tests are easier to compute than the W -test.
The dAgostino test employs an estimator of proposed in [37] to get the statistic
(i (n + 1)21 )X(i)
D= i .
n2 S
The Shapiro-Francia test is based on an idea suggested (without proof) in [48] (see
also [95]) according to which matrix V 1 in (4.13) can be replaced by the identity
I, obtaining thus the statistic
(m X0 )
2
W = . (4.14)
m m (Xi X)2
Both tests are correlation tests. The plotting positions are (1, 2, . . . , n) for the
D-test and m for the W -test. Simulation studies in [23] and [81], respectively,
suggested that the proposed tests are approximately equivalent to the W -test.
The D-test has the advantage of being asymptotically normal and its distribution
can be approximated by Cornish-Fisher expansions for moderate sample sizes.
Apart from the ease of computation, an interesting feature of the W -Shapiro-
Francia test seemed to be its consistency for testing t to any location scale family
Empirical and Quantile Processes 43
with nite second order moment, a fact shown in [79]. The meaning of this con-
sistency needs some explanation. The fact really shown in [79] is that, if Wn
denotes the Shapiro-Francia statistic for a sample of size n and we x (0, 1),
then, under any xed alternative,
P (Wn < ) 1. (4.15)
If we try to choose the critical value so that the test has signicance level , that
critical value, n (), will depend on n. We cannot conclude P (Wn < n ()) 1
from (4.15) and, in fact, the Shapiro-Francia test fails to detect departures from
some location scale families (an example of this is given by the exponential family,
see the comments about the power of the Shapiro-Francia test below).
A further simplication of the W -test was proposed by Weisberg and Bing-
ham in [106] by replacing m by vector m = (m1 , . . . , mn ), where
i 3/8
mi = 1 , i = 1, . . . , n,
n + 1/4
and denotes the standard Gaussian d.f. This statistic is easier to compute than
W , while a Monte Carlo study included in [106] suggests that both tests are
equivalent.
Another modication of W was proposed in [32] by de Wet and Venter. It
seems that the concept of correlation test was introduced for the rst time in that
paper. The de Wet and Venter test is the correlation test with plotting positions
1 1 1 n
= ,..., ,
n+1 n+1
or, equivalently, the test which rejects normality when large values of
X(i) Xn 2
1
W = [i/(n + 1)] (4.16)
i
Sn
are observed.
Some other tests continued this line. For instance, in [42], Filliben proposed
a correlation test with the medians of the ordered statistic Z0 as plotting posi-
tions. Some simulations comparing this and the W and W tests are oered. The
distribution of this statistic was also computed via Monte Carlo method.
An interesting feature of the W -test is that it was the rst correlation nor-
mality test with known asymptotic distribution. To be precise, it was shown in
[32] that, if {Zi } is a sequence of independent standard Gaussian r.v.s., then
Z2 1
W an i
w
i=3
i
for a certain sequence of constants {an }. The key of the proof relied on showing,
through rather involved calculations, the asymptotic equivalence, under normality,
of W and a certain quadratic form and using then the asymptotic theory for
quadratic forms given in [33].
44 E. del Barrio
Since the publication of [32], the possibility of obtaining the asymptotic dis-
tribution of other correlation tests of normality by showing their asymptotic equiv-
alence with the W -test was considered. An important paper in this program, was
[102], where the asymptotic equivalence of correlation tests under some general
conditions (satised by most of the correlation tests in the literature) is shown.
In particular, it is shown that the Shapiro-Francia, the Weisberg-Bingham and
the Filliben tests are asymptotically equivalent to the de Wet-Venter test, having,
consequently, the same asymptotic distribution.
The asymptotic distribution of the Shapiro-Wilk test could be obtained then
using its asymptotic equivalence with the Shapiro-Francia, shown in [58]. This
solved an important problem which had been open for around twenty years. It
would be unfair not mentioning the paper [57], by Leslie, which proved the validity
of the key step in previous heuristic reasonings based on assuming that vector m is
an asymptotic eigenvector of V 1 . More precisely, the main result in that paper,
is that there exists a constant C which does not depend on n such that
V 1 m 2m
C(log n)1/2 ,
where, given the matrix B = (bij ), then
B
2 = ij b2ij .
The possibility of extending the use of correlation tests to cover goodness-
of-t to other families of distributions has been explored, for instance, in [92] for
the exponential distribution or in [46] for the extreme value distributions. In this
setup correlation tests do not present the same nice properties exhibited when
testing normality. In [60] the asymptotic normality of the Shapiro-Francia test
when applied to the exponential family is obtained. The rate of convergence is
extremely low: (log n)1/2 . This result was generalized in [65] to cover extreme-value
and logistic distributions with the same rate and the same asymptotic distribution
as in the exponential case. However, the asymptotic eciency of the Shapiro-
Francia test in these situations was found to be 0 when compared with tests based
on the empirical distribution function, since it was possible to nd a sequence of
contiguous alternatives such that the asymptotic power coincides with the nominal
level of signicance of the test (see also [61] on this question).
is dened as
1/2
2
W(P1 , P2 ) := inf E (X1 X2 ) , L(X1 ) = P1 , L(X2 ) = P2 .
the correlation coecient: see Lockhart and Stephens (1998), Section 5). The null
asymptotics of Rn provides a good insight into the cause of this ineciency. To
study this asymptotic distribution under H0 we can assume F = G0 (by the
locationscale invariance of Rn ) and denote the empirical quantile process as
vn (t) = n(Fn1 (t) F 1 (t)), to get (see del Barrio et al. (1999)) that
1 1
2 1
2
1
nRn = 2 vn2 (t)dt vn (t)dt vn (t)F 1 (t)dt
(Fn ) 0 0 0
1 1
1 1 1 1
= 2 (vn (t) vn , 1 1 vn , F F (t)) dt = 2
2
v 2 (t)dt,
(Fn ) 0 (Fn ) 0 n
1
where f, g = 0 f g and vn = vn vn , 1 1 vn , F 1 F 1 . It is shown in
del Barrio et al. (1999) that under normality there exist constants an such that
nRn an converges in law to a nondegenerate distribution. More precisely, if
(resp. ) denote the standard normal distribution (resp. density) function and B
is a Brownian bridge, then
1 2 1 B(t)
2 1 B(t)1 (t)
2
B (t) EB 2 (t)
nRn an dt dt dt
d 0 2 (1 (t)) 0 (
1 (t))
0 (
1 (t))
1
= B 2 (t) E B 2 (t)dt,
0
12 ex /2 1 + 1 /6 n H3 (x) + 2 24
/ n
H4 (x) ,
it can be shown that
Yn,3 N (1 , 3), and Yn,4 N (2 , 4).
d d
Yn,3 and Yn,4 being the components with larger weight in the decomposition of
Rn , this explains the good performance or the Wasserstein (or the Shapiro-Wilk)
test against deviations from normality in skewness or kurtosis.
48 E. del Barrio
In the exponentialcase nRn is not shift tight, but it can be shown that, for
some constants an , (n/ log n)Rn an is asymptotically normal. However, if we
x (0, 1/2) then
1
1
(vn (t))2 dt 0,
log n Pr
hence the asymptotic distribution of Rn depends only on the tails of F : the Wasser-
stein exponentiality test cannot detect alternatives that have approximately expo-
nential tails.
As a possible remedy to this ineciency, de Wet (2000, 2002) and Csorgo
(2002) proposed to replace the Wasserstein distance W by a weighted version
1/2
1 1 1
Ww (F, G) := 0 (F (t) G (t))2
w(t)dt , for some positive measurable
function w, and the test statistic Rn by
Ww2 (Fn , H)
Rw
n = 2 (F )
,
w n
Rwn is location scale invariant, hence its null distribution can be studied assuming,
as above, that F = G0 . Under the assumptions
1
w(t)dt = 1, (5.3)
0
1
G1
0 (t)w(t)dt = 0 (5.4)
0
and
1
(G1 2
0 (t)) w(t)dt = 1, (5.5)
0
we can mimic, step by step, the computations leading to (1.6) and obtain that
1 1
2 1
2
1 1
w
nRn = 2 v (t)w(t)dt
2
vn (t)w(t)dt vn (t)F (t)w(t)dt
w (Fn ) 0 n 0 0
1
1
= 2 (vn (t) vn ,1 w 1 vn ,F 1 w F 1 (t))2 w(t)dt
w (Fn ) 0
1
1
= 2 v 2 (t)w(t)dt, (5.6)
w (Fn ) 0 n
1
where, now, f, g w = 0 (f g)w and vn = vn vn , 1 w 1 vn , F 1 w F 1 . Thus,
the asymptotic distribution of nRw n under the null hypothesis can be obtained
through the analysis of weighted L2 -functionals of the quantile process vn .
Empirical and Quantile Processes 49
d n
n+1
un (t)= Zn,j (t), (6.2)
Sn+1 j=1
where
Zn,j (t) = n1/2 an,j (t)j and an,j (t) = (1 t)I{j1<nt} tI{j1nt} . (6.3)
So, if
11/n 2
un (t)
Ln := dt (6.4)
1/n g(t)
for some weight function g non-vanishing on (0, 1), then, with
2 denoting the
L2 norm with respect to Lebesgue measure on the unit interval,
2 n+1 2
n
d
Ln = c (6.5)
Sn+1 n,i i
i=1 2
for certain functions cn,i (t) which we assume in L2 (0, 1) (in the case of (6.5),
but not always below, cn,i = n1/2 an,i (t)I[1/n,11/n] (t)/g(t)). By the law of large
numbers, weak convergence of the statistic an Ln bn then reduces to weak con-
vergence of
n+1 2 2
Sn+1
an c
n,i i b n ,
i=1 2 n
and the second variable is almost a constant if bn does not grow too fast.
50 E. del Barrio
1
n+1
1 1
(s t st) Kn (s, t) |an,j (s)an,j (t)| 3(s t st).
2 n n j=1
n+1
Proof. To prove i) x t and observe that each term in j=1 an,j (t) equals either
1 t (the rst n [n(1 t)] terms) or t (the remaining [n(1 t)] + 1 terms).
Hence
mn (t) = (n [n(1 t)])(1 t) ([n(1 t)] + 1)t = frac(n(1 t)) t.
The identity in ii) can be proved in a similar way. Fix s t. In the corresponding
sum for Kn (s, t) there are three types of summands: the rst n [n(1 s)], each
equal to (1 s)(1 t); the next [n(1 s)] [n(1 t)], each equal to s(1 t);
and the remaining [n(1 t)] + 1, each equal to st. This gives ii) and the right-hand
side inequality in iii). The left-hand side inequality in iii) is a trivial consequence
of ii).
We will also make use of the following simple but useful observation about
hypercontractivity of linear combinations of exponential variables with coecients
in L2 . Its proof, based on standard symmetrization/randomization techniques, can
be found in [31].
n
Lemma 6.2. Let Y (t) = k=1 ck (t)k for some n N and ck L2 (0, 1), and
where the variables k are independent exponential with parameter 1. Then, there
exists an absolute constant C < such that
2
E
Y
42 C E
Y
22 . (6.6)
Next we consider the general quantile process. Let F be a twice dierentiable
distribution function such that f := F is non-vanishing on supp F := {F =
0, 1} := (aF , bF ) and
t(1 t)f F 1 (t)
r := sup < , (6.7)
0<t<1 f 2 F 1 (t)
Empirical and Quantile Processes 51
where F 1 (t) is the corresponding quantile function. Condition (6.7), that comes
from Csorgo and Revesz (1978), is a natural condition to have if we wish to relate
general and uniform quantile processes: see Lemma 1.1, Ch. 6 and comments after
its proof in this reference. Let Xi be i.i.d. with common distribution F , let Fn be
the empirical distribution of X1 , . . . , Xn , n N, and let Fn1 denote the empirical
quantile function. Since we are considering only distributional results, there is no
loss of generality in taking Xi = F 1 (Ui ), where Ui are i.i.d. uniform on [0, 1]. In
this case,
Fn1 (t) = F 1 G1 n (t) , (6.8)
where G1
n is the quantile function corresponding to the uniform variables U1 , . . . ,
Un . We let vn to be the quantile processes associated to the sequence Xi ,
vn (t) := n Fn1 (t) F 1 (t) , n N. (6.9)
vn and un are related by the limited Taylor expansion
1 1
vn (t) = n F Gn (t) F 1 (t)
1
n Gn (t) t 1 1 2 f F 1 ()
= + n G n (t) t
f F 1 (t) 2 n f 3 F 1 ()
un (t) 1 f F 1 () 2
= + u (t) (6.10)
f F 1 (t) 2 n f 3 F 1 () n
for some between t and G1 n (t). This Taylor expansion can be used to relate the
(weighted) L2 norms of vn and un /f (F 1 ). This is the content of our next result,
see [31] for the proof. Here w is a non-negative measurable function on (0, 1) and
2,w,n and , w,n denote, respectively, the norm and the inner product in the
space L2 ((1/n, 1 1/n), w(t)dt). Then we have:
Lemma 6.3. Let F be a distribution function which is twice dierentiable on its
open support (aF , bF ), with f (x) := F (x) > 0 for all aF < x < bF , and which sat-
ises condition (6.7). Assume further that w is a non-negative measurable function
such that
11/n 1/2
1 t (1 t)1/2
lim w(t)dt = 0. (6.11)
n n 1/n f 2 F 1 (t)
Then, if un is the uniform quantile process and vn is the quantile process dened
by (6.8) and (6.9),
u 2 un
n
vn
22,w,n 1 0 and vn 1 0 (6.12)
f F 2,w,n f F 2,w,n
in probability.
In fact, as mentioned in the introduction, we are interested not in
vn
2,w,n
but rather in
vn
2,w , where
2,w denotes the L2 -norm with respect to the
measure w(t)dt over the whole interval (0, 1). So, we must deal next with the
integrals at extremes. The problem can be solved imposing conditions which are
52 E. del Barrio
related to, but weaker than the usual von Mises conditions on domains of attraction
(e.g., Parzen (1979) and Schuster (1984)). Again, we refer to [31] for a proof.
Lemma 6.4. Let F be a distribution function which is twice dierentiable on its
open support (aF , bF ), with f (x) := F (x) > 0 for all aF < x < bF . Assume that
F satises condition (6.7), that
|f (F 1 (x))|x
either aF > or lim inf > 0, (6.13)
x0+ f 2 (F 1 (x))
and
|f (F 1 (1 x))|x
either bF < or lim inf > 0. (6.14)
x0+ f 2 (F 1 (1 x))
Assume further that w is a bounded non-negative measurable function such that
x 1
x 0 w(t)dt x 1x w(t)dt
lim = 0 and lim = 0. (6.15)
x0+ f 2 F 1 (x) x0+ f 2 F 1 (1 x)
Then,
vn
22,w
vn
22,w,n 0 (6.16)
in probability.
As a consequence of these two lemmas we have:
Proposition 6.5. Let F be a distribution function which is twice dierentiable on its
open support (aF , bF ), with f (x) := F (x) > 0 for all aF < x < bF . Assume that
F satises conditions (6.7), (6.13) and (6.14). Let w be a non-negative measurable
function for which the limits (6.15) hold. Assume further that
11/n 1/2
1 t (1 t)1/2
lim w(t)dt = 0.
n n 1/n f 2 F 1 (t)
Then, u 2
vn
22,w 1
n
0 (6.17)
f F 2,w,n
in probability. If, moreover, h L2 (w(t)dt) and the sequence un /f F 1 , h w,n
is stochastically bounded, then
un
vn , h 2w 1 , h 2w,n 0 (6.18)
f F
in probability.
Proof. The conclusion (6.17) is a direct consequence of the last two Lemmas.
The
limit
(6.18)
follows
from the same two lemmas, stochastic boundedness of
un /f F 1 , h w,n , Holders inequality and the identities
un
vn , h 2w,n 1 , h 2w,n
f F
un un un
= vn 1 , h w,n vn 1 , h w,n + 2 1 , h w,n
f F f F f F
Empirical and Quantile Processes 53
and
vn , h 2w vn , h 2w,n
= vn hw vn hw + 2vn , h w,n .
(0,1/n][11/n,1) (0,1/n][11/n,1)
that is,
1
EV 2 < 2a2 whenever Pr{|V | > a} < .
4C
Proof. It follows immediately upon observing that
1/2
EV 2 t2 + E V 2 I|V |>t t2 + (EV 4 )1/2 Pr{|V | > t} .
and
sup
mn
22 = sup cn,i , cn,j < (6.24)
n n
1i,jn
E
Yn
22 = E cn,k (t)k dt = c2n,k (t) + cn,i (t)cn,j (t) dt
0 k 0 k i,j
= ck,k + ci,j , (6.26)
k i,j
it follows that the conditions (6.23) and (6.24) are sucient for tightness of the
sequence {
Yn
2 } and that (6.25) is sucient for its convergence to zero in prob-
ability. Suciency for tightness and convergence of {
Zn
2 } follows from this and
the law of large numbers. Necessity in both cases follows immediately from Lem-
mas 6.6 and 6.2.
One can say more about the way Yn converges. In fact, stochastic bounded-
ness of
Yn
m
2 implies uniform boundedness of moments of any order. This fact is
proved in [31]. With these preliminaries on integrability out of the way, we con-
sider now convergence in law of the sequence {
Yn
2 }. We consider several cases,
corresponding to the dierent cases for convergence of the square integral of the
quantile process described
a) Convergence of the processes Yn . Here we obtain necessary and sucient condi-
tions for weak convergence of Yn as L2 -valued random vectors; then convergence
Empirical and Quantile Processes 55
of
Yn
22 will be an immediate consequence of the continuous mapping theorem
for weak convergence. Note that
P (
cn,i i
2 >
) = exp
/
cn,i
2
and therefore, the triangular array {cn,i : i = 1, . . . , n; n N} is innitesimal if
and only if
max
cn,i
2 0 (6.27)
i
as n . The next theorem gives necessary and sucient conditions for the
convergence in law in L2 (0, 1) of {Yn } under (6.27). Under innitesimality, the
only possible limits of {Yn } are Gaussian, with a trace-class covariance operator.
Kn and mn are dened as in (6.21).
Theorem 6.8. Assuming condition (6.27) holds, the sequence {Yn } converges in
law in L2 (0, 1) if and only if the following conditions hold:
i) There
exists a symmetric,
positive semi-denite, trace-class kernel K(s, t)
L2 (0, 1) (0, 1) such that
Kn K. (6.28)
L2
ii) If i 0 are the eigenvalues of K then
n
cn,i
2
2
i . (6.29)
i=1 i=1
iii) There exists m L2 (0, 1) such that
mn m (6.30)
L2
If i), ii) and iii) hold, then Yn converges in law in L2 (0, 1) to an L2 (0, 1)-valued
Gaussian random variable Y with mean function m and covariance operator K
given by 1 1
K (f, g) = K(s, t)f (s)g(t)dsdt
0 0
for f, g L2 (0, 1).
Proof. Necessity. Let us assume rst that the L2 -valued random vectors Yn con-
verge in law. Then {
Yn
2 } also converges in law and, moreover, by Proposition
6.3, its moments converge as well (to the moments of the limit). This implies, in
particular, that the sequence
n n 2
E
Yn
22 =
cn,i
22 + cn,i , n N, (6.31)
2
i=1 i=1
converges. Note also that convergence in law of Yn to Y plus uniform integrability
of {
Yn
2 }, which is a consequence of moment convergence, ensure that EYn L2
EY and, therefore, that
n
cn,i m := EY. (6.32)
L2
i=1
56 E. del Barrio
Now, (6.31) and (6.32) imply (6.30) and also that the left-hand side in (6.29)
converges to a nite limit. We have also proved that {Yn EYn } converges in law,
namely,
n
Yn EYn = cn,i (i 1) Y EY. (6.33)
d
i=1
Also, (6.33) and uniform integrability imply
n
cn,i
22 = E
Yn EYn
22 E
Y EY
22 . (6.34)
i=1
Let now i , i be, respectively, the eigenvalues and the corresponding eigenfunc-
tions associated to the kernel K. Then
1 1
i = K(s, t)i (s)i (t)dsdt = EY EY, i 2 .
0 0
Therefore,
E
Y EY
22 = E Y EY, i 2 = i ,
i=1 i=1
which combined with (6.34) yields (6.28) and (6.29).
Suciency. Assume now that (6.27), (6.28), (6.29) and (6.30) hold. Let us de-
note i, = i I{cn,i i 2 } and i = i I{cn,i i 2 >} for
> 0 and i N. Since
Ei I{i >t} = (t + 1)et 1/t2 for large enough t, condition (6.29) implies that,
for large enough n,
n n n
E cn,i i = cn,i Ei
cn,i
2 Ei
2 2
i=1 i=1 i=1
1
n
cn,i
22 max
cn,i
2 0.
2 i=1
i
Hence, by the Central Limit Theorem in Hilbert spaces (see, e.g., Araujo and Gine
(1980), Corollary 3.7.8), if {Yn } is shift convergent in law, we can take the shifts
to be the expected values EYn and therefore, by the same central limit theorem,
the proof reduces to showing that
Empirical and Quantile Processes 57
n
i) j=1 P (
cn,j j
2 >
) 0 as n for every
> 0,
ii) for every
> 0 and every f L2 (0, 1)
n
Var(cn,j j, , f ) K (f, f ) and
j=1
2
j 2 =
cn,j
22 E(j )2
j=1
2 j=1
2 j=1
1
n
c n,j
2
2 E12 I{1 >/ maxi cn,i 2 } 0
2 j=1
n
Then, using (6.29) and the fact that, as shown above, j=1 E
cn,j j
22 0, we
have
n
n
lim E
cn,j j, Ecn,j j,
2 = lim E
cn,j j Ecn,j j
2
n n
j=1 j=1
n
= lim
cn,j
2 = i
n
j=1 i=1
and, by (6.36),
n
k
k
k
lim Ecn,j j, Ecn,j j, , i 2 = K (i , i ) = i ,
n
j=1 i=1 i=1 i=1
b) Shift convergence of
Yn
22 , I. It can be proved that shift tightness of {
Yn
22 }
implies tightness of the sequence centered at expectations, and even tightness of
the sequence {
Zn
22 E
Yn
22 }. We comment this to help to appreciate sharpness
of the results that follow, but omit any proof of it. In the previous subsection, we
Empirical and Quantile Processes 59
and, clearly,in order to make sense of this limit it suces (and is necessary as
i i < , a weaker condition. We deal here with this situation,
2
well) that
that is, we relax the assumptions on K in Theorem 6.8 by only assuming that
K L2 ((0, 1) (0, 1)). In this case, the operator induced
by K on L2 (0, 1) is
Hilbert-Schmidt, that is, its eigenvalues {k } satisfy k 2k < (e.g., Dunford
and Schwartz (1963), XI.6 and XI.8.44). Then, with considerable abuse of notation,
we dene
Y EY
22 E
Y EY
22 := k (Zk2 1), (6.37)
k
where the variables Zk are independent standard normal (Y may not exist but the
series does converge a.s.). In fact,
Y EY
22 E
Y EY
22 makes sense as a
multiple Wiener integral, but this is of marginal interest for the sequel.
We state now a useful lemma on the asymptotic normality of sums of inde-
pendent exponential random variables. The proof is omitted, but it is just an easy
exercise on the CLT on the line if one uses the fact the Gaussian family is factor
closed.
The main argument in this subsection is contained in the proof of the follow-
ing proposition.
Yn EYn
22 E
Yn EYn
22
Y EY
22 E
Y EY
22 ,
d
where
Y EY
22 E
Y EY
22 is as dened in (6.37).
60 E. del Barrio
where
Y E
22 E
Y EY
22 is dened as in (6.37) and {Zk } is an ortho-
Gaussian sequence.
Proof. Formally we require the proof of Proposition 6.11 rather than its statement.
First we note
Yn
22 E
Yn
22 = (
Yn EYn
22 E
Yn EYn
22 ) + 2Yn EYn , EYn . (6.44)
As in the previous proof,
EYn EYn , EYn Yn EYn , k = Kn , mn k
K, m k = k m, k .
62 E. del Barrio
This implies that for each M we have convergence in law of the vector
Yn EYn , 1 , . . . , Yn EYn , M , Yn EYn , EYn
to the Gaussian vector
1 Z1 , . . . , M ZM , k m, k Zk , .
k=1
This gives weak convergence, for every M < , of the random variables
M
Yn EYn , k 2 EYn EYn , k 2 + 2Yn EYn , EYn ,
k=1
i1
n
= 2 cn,i,j (j 1) (i 1) + cn,i,i (i 1)2 1 = xn,i ,
i=1 j=1 i=1
where
i1
xn,i = 2 cn,i,j (j 1) (i 1) + cn,i,i (i 1)2 1
j=1
(and we use the convention that 0j=1 aj = 0). If {i } denotes an independent
copy of the sequence {i } and we set
i1
xn,i = 2 cn,i,j (j 1) (i 1) + cn,i,i (i 1)2 1
j=1
for i = 1, . . . , n and Fn,i = (1 , . . . , i ), then, for each n N, {xn,i } and {xn,i } are
tangent sequences with respect to {Fi }, that is, L(xn,i |Fn,i1 ) = L(xn,i |Fn,i1 )
and the random variables xn,i are conditionally independent given the sequence
{i }. Hence, {xn,i } is a decoupled tangent sequence to {xn,i } (see, e.g., de la Pena
and Gine (1999), Chapter 6). Decoupling introduces enough independence among
Empirical and Quantile Processes 63
n
the summands in i=1 xn,i to enable us to use the CLT in order to obtain their
asymptotic distribution. The principle of conditioning (Theorem 1.1 in Jakubowski
(1986), reproduced in de la Pena and Ginen(1999), Theorem 7.1.4) can then be
used to conclude convergence in law of i=1 xn,i itself. The proof of our next
result follows this approach.
Theorem 6.13. Let Z be a standard normal random variable. If
max
cn,i
0, (6.45)
i
n
2
Kn
22 + 6
cn,i
42 2 , (6.46)
i=1
and
2
2
cn,i , cn,j cn,i cn,k + cn,i , cn,j cn,i , cn,i + cn,j 0,
j
=k i:i>jk j i:i>j
(6.47)
then
Yn EYn
22 E
Yn EYn
22 Z. (6.48)
d
If, instead of conditions (6.45), (6.46) and (6.47), we have
max
cn,i
22 + |cn,i , mn | 0, (6.49)
i
n
2
Kn
22 + 2
cn,i
42 + 4 cn,i , cn,i + mn 2 2 , (6.50)
i=1 i
and
2
cn,i , cn,j cn,i cn,k
j,k i:i>jk
2
+ cn,i , cn,j cn,i , cn,i + mn + cn,j 0, (6.51)
j i:i>j
then
Yn
22 E
Yn
22 Z. (6.52)
d
n
Proof. We rst prove the limit in (6.48). If we set Un = i=1 xn,i , with xn,i
dened as above, the principle of conditioning (Jakubowski (1986)) reduces the
proof to showing that
L(Un |{i }) N (0, 2 )
w
in probability. Arguing as in the proof of Lemma 6.10, we can see that this is
equivalent to proving that
i1
2
and
i1
2
Bn := E(x2n,i |{j }) = 4 c2i,i + 4 ci,i + ci,j (j 1) 2 . (6.54)
Pr
i i i j=1
27
i1
27 27
n
max c2 2 max c2i,j = 2 maxcn,i cn,i , Kn
t2 i j=1 i,j t i
j=1
t i
27
Kn
2 max
cn,i
22 0.
t2 i
This concludes the proof of the limit (6.48). We pay attention now to the limit
(6.52). The fact that
Yn
22 E
Yn
22 = ci,j (i 1)(j 1)
1i
=jn
n
!
+ ci,i (i 1)2 1 + 2cn,i , mn (i 1)
i=1
where
i1
yn,i = 2 cn,i,j (j 1) (i 1) + cn,i,i (i 1)2 1 + 2cn,i , mn (i 1),
j=1
can be used to conclude (6.52) by reproducing the proof of (6.48) almost verbatim.
The tool for Theorem 6.13, namely the principle of conditioning, which could
be easily replaced by the Brown-Eagleson central limit theorem for martingales,
has been used before in analogous situations. We will just mention P. Hall (1984),
who uses it in density estimation, in order to prove a limit theorem for degenerate
Empirical and Quantile Processes 65
U -statistics with varying kernels. His result is dierent from ours and does not
apply here, but there are similarities in the proofs.
The assumptions in the above theorem are quite tight (for instance, it can
be shown that they are necessary for the limits (6.53) and (6.54)). An easier to
check set of (stronger) sucient conditions, more adapted to the quantile process
case can be stated with the following notation. We dene
1
(Kn Kn )(s, t) := Kn (s, u)Kn (t, u)du = cn,i , cn,j cn,i cn,j .
0 i,j
2
It can be easily checked that
Kn Kn
22 = j,k i cn,i,j cn,i,k and also that
1 1 1 1
Kn Kn
22 = Kn (s, t)Kn (u, v)Kn (s, u)Kn (t, v)dsdtdudv.
0 0 0 0
The assumptions in the following result are often easier to deal with than (6.51).
The proof can be found in [31].
Corollary 6.14. If
n
n
cn,i , mn 2 0,
cn,i
42 0 and
Kn Kn
2 0, (6.55)
i=1 i=1
Kn (s, t) := |cn,i (s)cn,i (t)| CKn (s, t) (6.56)
i
Kn
22 2 /2, (6.57)
then
Yn
22 E
Yn
22 Z,
d
Note that if
Kn Kn
2 0 we cannot have Kn K in L2 unless K = 0.
cn,i
0). Of course, if conditions of this type are removed, other asymptotic
distributions can be obtained. It is straightforward to see, for instance, that if
(cn,i,j i,j )2 0, (6.58)
1i,jn
66 E. del Barrio
for some real numbers {i,j } satisfying i,j
2
i,j < then
n
!
Yn EYn
22 E
Yn EYn
22 = cn,i,j (i 1)(j 1) i,j
i,j=1
!
i,j (i 1)(j 1) i,j .
L2
i,j=1
Note
that the limiting random variable is well dened because the condition
i,j i,j < implies that the associated partial sums are L2 convergent. If,
2
further,
n n
2
cn,i,j i 0 (6.59)
i=1 j=1
for some real numbers i such that i=1 i2 < , then we also have that
!
Yn
22 E
Yn
22 i,j (i 1)(j 1) i,j + 2 i (i 1). (6.60)
L2
i,j=1 i=1
d) Shift convergence of
Zn
22 . Yn can be replaced by Zn in Theorem 6.8 as an
immediate consequence of the law of large numbers, whereas it can be replaced in
Theorem 6.12 and Corollary 6.14 because of the following proposition.
Proposition 6.15. Suppose
Yn
22 E
Yn
22 converges in law. Then,
Zn
22 E
Yn
22
converges in law to the same limit if and only if
E
Yn
22
0. (6.61)
n
In particular, this condition is satised if both conditions,
i cn,i,i
0 and cn,i , mn 2 0, (6.62)
n i
hold. If (6.61) holds, we also have Zn EYn , h Yn EYn , h 0 for any
h L2 (0, 1).
Proof. Since
2
n1 Sn Sn
Zn
22
Yn
22 = 1+ 1
Yn
22 = OP n1/2
Yn
22 ,
Sn n1 n1
by the central limit theorem and the law of large numbers, the necessity and
suciency of condition (6.61) follows from Lemmas 2.2 and 3.1. Now, by (6.26)
and Cauchy-Schwartz,
1 1 1
E
Yn
22 = ci,i + ci,j ci,i + cn,i , cn,j 2 ,
n n i j
n i i j
The uniform quantile process. Recall from the beginning of this section that if un
is the uniform quantile process, then
11/n 2 2 n+1 2
un (t) n
Ln :=
d
dt= cn,i i
g(t) S n+1
1/n i=1 2
where an,i are as dened in (6.3) and cn,i = n1/2 an,i (t)I[1/n,11/n] (t)/g(t). With
the help of Lemma 6.1, the results of Section 6.1 can be easily specialized to this
situation.
a) The innitesimality condition (6.27): It follows from the denitions that
11/n 2
1 t + (1 t)2 1 11/n 1
dt max
c
n,i 2
2
dt,
2n 1/n g 2 (t) i n 1/n g 2 (t)
and from this we conclude that condition (6.27) is equivalent to
1 11/n 1
dt 0. (6.63)
n 1/n g 2 (t)
b) Convergence of Kn and denition of K. Also from the denitions (in (6.21) and
Lemma 6.1), we have
1 Kn (s, t)
Kn (s, t) = I{1/n s, t 1 1/n},
n g(s)g(t)
so that by Lemma 6.1 iii),
(s t st)
Kn (s, t) K(s, t) := (6.64)
g(s)g(t)
pointwise, hence, by Lemma 6.1 iii) and dominated convergence, Kn L2 K if
and only if K L2 ((0, 1) (0, 1)), if and only if
1 1
(s t st)2
ds dt < . (6.65)
0 0 g 2 (s)g 2 (t)
Next we see that the limiting kernel K is trace-class and the limit (6.29) holds if
and only if
1
t(1 t)
dt < . (6.66)
0 g 2 (t)
In fact, by Lemma 6.12 iii),
n+1 1
1 11/n Kn (t, t) t(1 t)
cn,i
22 = 2 (t)
dt dt
i=1
n 1/n g 0 g 2 (t)
regardless of whether the limiting integral is nite or not. Thus (6.66) is necessary
in order to get a nite limit in (6.29). On the other hand, if (6.66) holds and
68 E. del Barrio
1 B(t)
2 1 1
s t st
= E i (t)dt = i (s)i (t)dsdt = i < .
i=1 0 g(t) i=1 0 0 g(s)g(t) i=1
Hence, if (6.66) holds then K is trace-class and (6.29) holds.
c) Convergence of mn to m = 0 assuming innitesimality. If the innitesimality
condition (6.63) holds, then
n+1 2
1 1 mn (t)2 1 1 1
cn,i = dt dt 0,
i=1
2 n 0 g 2 (t) n 0 g 2 (t)
showing that mn 0 in L2 .
Finally, note that condition (6.66) implies conditions (6.63) and (6.65): the
rst, by dominated convergence, and condition (6.65) because, since (s t st)2
s(1 s)t(1 t), we have
1 1 1 1 1 t(1 t)
2
(s t st)2 s(1 s)t(1 t)
ds dt ds dt = dt .
0 0 g 2 (s)g 2 (t) 0 0 g 2 (s)g 2 (t) 0 g 2 (t)
Summarizing, and the law of large numbers for Sn /n give the following:
Theorem 6.16. Let un (g) denote the weighted uniform quantile process, that is,
un (g)(t) = (un (t)/g(t))I{1/n t 1 1/n}}, 0 < t < 1, where g is a non-zero
measurable function. Assume
1 11/n 1
dt 0.
n 1/n g 2 (t)
Then the sequence of processes {un (g)} is weakly convergent in L2 (0, 1) to a non-
degenerate limit if and only if
1
t(1 t)
dt < .
0 g 2 (t)
In this case,
un (g) Bg
d
in L2 (0, 1), where Bg (t) = B(t)/g(t) and B is a Brownian bridge. In particular
11/n 2 1 2
un (t) B (t)
2 (t)
dt 2
dt.
1/n g d 0 g (t)
Only the necessity part of this theorem may be considered new; the suciency is
well known (see e.g., Mason (1984), Csorgo and Horvath (1988) and (1993) p. 354).
Empirical and Quantile Processes 69
For g(t) = (1 (t)), where and denote the standard normal density and
distribution function respectively, this result goes back, in one form or other, to de
Wet and Venter (1972), but it seems to be new in the generality it is given here.
See also Gregory (1977) and del Barrio, Cuesta-Albertos, Matran and Rodrguez-
Rodrguez (1999).
Next we examine the normal convergence case (as a consequence of Corollary
6.14). We will further relax integrability (so, condition (6.65) will not hold), but,
for convenience, will impose regular variation of g at at least one of the end points
0 and at 1. Standard use of the basic properties of regular variation shows that,
70 E. del Barrio
n
dt dt Z.
L(1/n) 1/n g 2 (t) 1/n g 2 (t) d
Proof. We only consider the case when g is symmetric about 1/2. We can apply
Corollary 6.14 with cn,i (t) = 1/2
an,i (t)
g(t) I{1/n t 1 1/n}. Now we
1
nL (1/n)
have that
11/n 11/n
2 Kn2 (s, t)
2
Kn
22 = 2
ds dt
n L(1/n) 1/n 1/n g 2 (s)g 2 (t)
and
n+1
n+1
2 11/n 11/n
i=1 a2n,i (s)a2n,i (t)
cn,i
42 = 2 ds dt.
i=1
n L(1/n) 1/n 1/n g 2 (s)g 2 (t)
Empirical and Quantile Processes 71
We claim that
n+1
n+1
2
Kn
22 1,
cn,i
42 0 and cn,i , cn,i + mn 2 0. (6.71)
i=1 i=1
n+1
In fact, from Lemma 6.1 we obtain that i=1 a2n,i (s)a2n,i (t) 3n(stst), which,
by Lemma 6.18, implies that
n+1 1
M (1/n)
cn,i
42 6 n 0
i=1
L(1/n)
and proves the second part of (6.71). The rst part can be obtained using Lemma
6.1 ii) and iii) to see that |Kn (s, t)2 n2 (s t st)2 | = |Kn (s, t) + n(s t
st)||Kn (s, t) n(s t st)| 8n(s t st) and, consequently, that
11/n 11/n K 2 (s, t) n2 (s t st)2
2
Kn
2 1 = 2 n
2 ds dt
n2 L(1/n) 1/n 1/n g 2 (s)g 2 (t)
1
M (1/n)
16 n 0.
L(1/n)
Finally, the third part of claim (6.71) is a consequence of
n+1
cn,i , mn 2 = Kn , mn mn
i=1
11/n 11/n 1
1 Kn (s, t)mn (s)mn (t) M (1/n)
= ds dt n 0.
n2 L(1/n) 1/n 1/n
2 2
g (s)g (t) L(1/n)
since (a + b) 2a + 2b . The limits (6.71) prove the rst two limits in (6.55) and
2 2 2
the limit in (6.57) (with 2 = 1). Lemma 6.1 iii) gives that (6.56) is also satised
(with C = 6). Finally, the third limit in (6.55) follows from Lemma 6.18 since
81R((1/n)
Kn Kn
22 0.
n4 L2 (1/n)
Corollary 6.14 implies now that
Yn
22 E
Yn
22 w N (0, 1). The conditions
(6.62) from Proposition 6.15 are also satised because of the last two limits in
(6.71) (see the argument immediately before Theorem 6.17) and therefore we also
have
Zn
22 E
Yn
22 w N (0, 1). Now we are only left with showing that we
11/n
can replace E
Yn
22 by L1/2 (1/n) 1/n t(1 t)g 2 (t)dt as centering constants.
Arguing as in the proof of Theorem 6.17 we see that
11/n 11/n
1 t(1 t) 4 1
E
Yn
2 1/2
2
dt dt 0,
L (1/n) 1/n g 2 (t) nL1/2 (1/n) 1/n g 2 (t)
where the last limit is a consequence of (6.59).
72 E. del Barrio
1 c2 [y] 1 [y]
dy + dy ,
d 1 1 + c2 1 y 2 1 + c2 1 y 2
(1)
where {S[y] : y 1} is the partial sum process associated to the sequence {j } of
(1) [y] (2)
independent exponential random variables, that is, S[y] = j=1 j , and {S[y] } is
(1)
an independent copy of {S[y] }.
Proof. As above, we only consider the case when g is symmetric. Symmetry of g
and the fact that an,j (1 t) = an,n+2j (t) show that
11/n 2 n
2 11/n n+1 2
1 un (t) 1 j=1 an,j (t)j
dt = dt
E(1/n) 1/n g 2 (t) Sn+1 nE(1/n) 1/n g 2 (t)
n
2
= (Vn(1) + Vn(2) ),
Sn+1
where
n+1 2
1/2 a n,j (t)j
1 j=1
Vn(1) = 1 dt
nE( n ) 1/n g 2 (t)
and
n+1 2
an,j (t)n+2j
1/2
1 j=1
Vn(2) = dt.
nE( n1 ) 1/n g 2 (t)
We set bn,j (t) = I{j 1 < nt} and dene
1/2 n+1 2
(1) 1 j=1 bn,j (t)j nt
Wn = dt
nE( n1 ) 1/n g 2 (t)
Empirical and Quantile Processes 73
(2)
and, similarly, Wn , replacing j with n+2j . Now, since
2 1 1/2 t2
|(Vn(1) )1/2 (Wn(1) )1/2 |2 n(1 Sn+1 ) dt = OP (1) 0
n
E( n1 ) 1/n g 2 (t) Pr
with
n/2
1 1
cn,i,j = 2 1 2 (y/n)
dy if i j > 1,
n E( n ) ij1 g
n/2
1 ([y] y)
cn,1,1 = cn,2,2 , dn,i = 2 1 dy if i > 1,
n E( n ) i1 g 2 (y/n)
n/2
1 ([y] y)2
dn,1 = dn,2 and en = 2 1 dy.
n E( n ) 1 g 2 (y/n)
Similarly,
(1)
(S[y] y)2
1
dy = i,j (i 1)(j 1) + 2 i (i 1) +
,
1 1 y 2 i,j i
([y]y)
where i,j = 1 1
ij1 y 2
1 dy if i j > 1, 1,1 = 2,2 , i = 1
1
i1 y 2
dy if
1
([y]y)2
i > 1, 1 = 2 and
= 1 1 2 dy. Standard regular variation techniques
y
show that i,j (cn,i,j i,j ) 0, i (dn,i i )2 0 and en
, yielding as in
2
The general quantile process. By Proposition 6.5, we can transfer the results in
the previous
subsection to the general quantile process just by taking g(t) =
f (F 1 (t))/ w(t). Let us denote as General Hypotheses or (GH) the following
conditions on the cdf F and the weight w:
GH F is twice dierentiable on its open support (aF , bF ) with f (x) = F (x) >
0 there, and satises conditions (6.7), 6.13 and 6.14. w is a non-negative
measurable function on (0, 1) and satises conditions 6.15.
These, together with (6.11), are the conditions under which we can transfer results
on un to vn by Proposition 2.5. (6.11) is not included because it will be subsumed
by other conditions, in fact, because, by dominated convergence,
1
t(1 t) 1 11/n w(t)dt
1 (t))
w(t)dt < (6.11) 0.
2
0 f (F n 1/n f 2 (F 1 (t))
We then have:
Theorem 6.21. Let B be a Brownian bridge on (0, 1) and let Z be a standard
normal random variable.
a) If F and w satisfy (GH) and
1
t(1 t)
2 (F 1 (t))
w(t)dt < (6.73)
0 f
then
B(t) w(t)
vn (t)
in law in L2 (0, 1),
f 2 (F 1 (t))
in particular,
1 1
B 2 (t)
vn (t)w(t)dt
2
w(t)dt in distribution.
0 0 f 2 (F 1 (t))
b) If F and w satisfy (GH) and
11/n 1/2
1 t (1 t)1/2
w(t)dt 0
n 1/n f 2 (F 1 (t))
and
1 1
(s t st)2
w(s)w(t)dsdt < (6.74)
0 0 f 2 (F 1 (s))f 2 (F 1 (t))
then
1 11/n
t(1 t) 1
B 2 (t) EB 2 (t)
vn2 (t)w(t)dt w(t)dt w(t)dt.
0 1/n f 2 (F 1 (t)) w 0 f 2 (F 1 (t))
c) Assume F is twice dierentiable on its open support (aF , bF ) with f (x) =
F (x) > 0 there, that F satises condition (6.7) and that the function g :=
f (F 1 ) is regularly varying with exponent one at at least one of the two points
Empirical and Quantile Processes 75
0 and at 1, and with exponent not larger than one in the other. Assume also
that
1x 1x
(s t st)2
L(x) := 2 dsdt (6.75)
x x f 2 (F 1 (s)f 2 (F 1 (t))
as x 0. Then,
1 1
1 t(1 t)
vn2 (t)dt 2 1 (t))
dt Z in distribution. (6.76)
L(1/n) 0 0 f (F
Proof. By Proposition ?? and the remark above on condition (6.11), the statements
a) and b) of the theorem do not require proof. But part c) does (Proposition ??)
does not apply in this case). As usual, we assume f (F 1 symmetric about 1/2. If
11/n
we can replace
vn
22 in (6.76) by
un /f (F 1 )
22,n = 1/n u2n (t)/f (F 1 (t))dt,
the result will follow from Theorem 6.19. By the proof of Lemma 6.4, we can
replace
vn
22 by
vn
22,n if we show that
x
x2 1
lim
= 0, and lim
(F 1 (x) F 1 (t))2 dt = 0,
x0 f 2 F 1 (x) L(x) x0 x L(x) 0
and, by the proof of Lemma 6.3 (see (2.20) and (2.21)), we can replace
vn
22,n by
un /f (F 1 )
22,n if
11/n 1/2
1 t (1 t)1/2
lim
dt = 0.
n nL(1/n) 1/n f 2 F 1 (t)
The rst and third of these limits follows just as the limit (??) in the proof of
Lemma 6.18, using LHopital andthe equivalence (??). To show that the second
x
limit also holds, let us set h(x) = 0 (F 1 (x) F 1 (t))2 dt and observe that
x x x
2 2 1
h (x) = (F 1
(x) F 1
(t))dt = dudt
f (F 1 (x)) 0 f (F 1 (x)) 0 t f (F 1 (u))
x
2 u 2x2
= du
f (F 1 (x)) 0 f (F 1 (u)) f 2 (F 1 (x))
the last equivalence being a consequence of regular variation.
x This, (??) and
x
regular variation imply, in turn, h(x) = 0 h (y)dy 2 0 y 2 /f 2 (F 1 (y))dy
2x3 /f 2 (F 1 (x)) 2x2 x 1/f 2(F 1 (y))dy and, consequently, that
x 1
1 x 0 f 2 (F 1 (y)) dy
1 1
lim
(F (x) F (t)) dt = 2 lim
2
=0
x0 x L(x) 0 x0 L(x)
by (??).
Examples a) Consider the distribution functions
1 |x|
2e if x 0
F (x) = for > 0,
1 12 ex
if x 0
76 E. del Barrio
and take w 1. Let f be the corresponding densities, which are symmetric about
zero. Then, it is easy but somewhat cumbersome to see that
1
f F1 (x) x(1 x) log(1)/ ,
x(1 x)
1
f F1 (x) x(1 x) log(2)/ ,
x(1 x)
where a(x) b(x) means that 0 < limx0 a(x)/b(x) < and likewise for x 1,
whereas 0 < inf tI a(t)/b(t) sup
xI a(t)/b(t)
< for any closed interval I con-
tained in (0, 1) (for instance f F1 (x) = x| log 2x|(1)/ + (1 x)| log 2(1
x)|(1)/ ). So, f F1 is symmetric about 1/2 and of regular variation with
exponent 1 at 0 (and at 1). It then follows easily that F is in case a) of the above
theorem i > 2, in case b) i 4/3 < 2, hence for the normal distribution,
and in case c) i 0 < 4/3, in particular for the symmetric exponential dis-
tribution. As mentioned above, if the tail probabilities are of dierent order, the
largest dominates and these theorems still hold, so that the same conclusions apply
to the one sided families.
b) Likewise, if
and for the extreme value distribution H(x) = exp(ex ), also in case c),
1
1
vn2 (t)dt log n Z in distribution.
2 log n 0
n1 1 1
where cn = 0 (F (t))2 dt+ 1 1 (F 1 (t))2 dt. In both cases {S[y]+1 : y 0} is the
(1)
n
partial sum process associated to the sequence {j } of independent exponential ran-
(1) [y]+1 (2)
dom variables, that is, S[y]+1 = j=1 j , and {S[y]+1 : y 0} is an independent
(1)
copy of {S[y]+1 }.
l log
0. b) If l RV (0) and < 0 then limn n1 = 0.
l n
l n
(log n) 2(log n) exp dt = 2(log n)/2 0.
l n1 2 n1 t
Proof of Theorem 6.22. We will assume in this proof that 0 > > 1/2. The case
= 1/2 can be handled with straightforward changes. We set, as in the proof of
Proposition 6.20, bn,j = I{j 1 < nt} and H 1 (x) = F 1 (1 x) and observe,
using the fact that bn,j (1 t) = 1 bn,n+2j (t) except in a null set, that
1 logn n
n+1
2
1 n 1
1 vn
2
(t)dt = 1 F 1
Sn+1 bn,j (t)j F 1 (t) dt
E( n ) 0 E( n ) 0 j=1
1
n+1
2 1 log n
n 1 n
+ F 1 Sn+1
1
bn,j (t)j F 1 (t) dt + vn2 (t)dt
E( n1 ) 1 log
n
n
j=1
E( n1 ) log n
n
log n
n+1
2
n n
= F 1 Sn+1
1
bn,j (t)j F 1 (t) dt
E( n1 ) 0 j=1
log n
n+1
2
n n
+ H 1 1
bn,j (t)n+2j H 1 (t) dt
E( n1 ) 0
Sn+1
j=1
1 log n
1 n
+ vn2 (t)dt =: Vn(1) + Vn(2) + Vn(3) .
E( n1 ) log n
n
We also set
log n
1 n+1
2
n n
Wn(1) := F 1 bn,j (t)j F 1 (t) dt and
E( n1 ) 0 n j=1
log n
1 n+1
2
n n
Wn(2) := H 1 bn,j (t)n+2j H 1 (t) dt.
E( n1 ) 0 n j=1
(1) (2)
Observe that Wn and Wn are independent since they are functions of disjoint
sets of independent exponential r.v.s j . We will proceed now by showing that
Empirical and Quantile Processes 79
(3)
the central part, Vn , is negligible and that the upper and lower integrals are
asymptotically independent and weakly convergent to the above stated limits.
This will be achieved by proving the following three claims:
Claim 1. Vn(3) Pr 0.
Claim 2. (Vn(i) )1/2 (Wn(i) )1/2 Pr 0, i = 1, 2.
2c2
(1) 2
Claim 3. Wn(1) w ||(1+c2 ) (S[y]+1 ) y dy,
0
(2) 2
Wn(2) w 2
||(1+c2 ) (S[y]+1 ) y dy.
0
Proof of Claim 1. We show rst that
1 logn n
1 u2n (t)
Vn
(3)
dt Pr 0.
E( n1 ) logn n f 2 (F 1 (t))
As in the proof of Proposition 6.3 this reduces to showing that
1 logn n
1 1
dt 0
1
nE( n ) nlog n f (F 1 (t))
2
and
1 log n
1 n t1/2 (1 t)1/2
dt 0.
nE( n1 ) log n
n
f 2 (F 1 (t))
To ease the computations we will assume in the remainder of the proof of this
claim that c = 1 and replace E( n1 ) by F 1 ( n1 )2 in the last two denominators (the
ratio of the two sequences converges, by regular variation, to a positive constant).
Extension to general c is straightforward. Regular variation implies that
x/f (F 1 (x)) x/f 2 (F 1 (x)) 1
lim = and lim 1x = ,
x0 F 1 (x) x0 1 2
x f 2 (F 1 (t)) dt
where l(1) (x) = x/f 2 (F 1 (x)) RV21 (0) and 2 1 (2, 1) and the last
limit follows from Lemma 6.23. Similarly,
1 logn n 1/2
1 t (1 t)1/2 2 l(2) (log n/n)
lim 1 1 2 dt = lim = 0,
n nF ( n ) logn n f 2 (F 1 (t)) 1/4 n l(2) (1/n)
since l(2) (x) = x3/2 /f 2 (F 1 (x)) RV21/2 (0) and 21/2 (3/2, 1/2). Now
1log n/n u2n (t)
f 2 (F 1 (t)) dt Pr 0. But
1
we can prove Claim 1 by showing that F 1 (1/n) 2 log n/n
80 E. del Barrio
Using again regular variation properties and Lemma 6.23 we have that
1 logn n
1 t(1 t) l(3) (log n/n)
lim 1 1 2 2 1
dt = || lim = 0,
n F ( n ) logn n f (F (t)) n l(3) (1/n)
now with l(3) (x) = x2 /f 2 (F 1 (x)) RV2 (0) and 2 (1, 0). This completes
the proof of Claim 1.
(1) (1)
Proof of Claim 2. We will show that (Vn )1/2 (Wn )1/2 Pr 0. It suces to
show that
log n
1
(F 1 (S[y]+1 /Sn+1 ) F 1 (S[y]+1 /n))2 dy Pr 0.
(1) (1)
F 1 ( n1 )2 0
(1)
To ease the notation we will omit the superscript from S[y]+1 . Similarly as in (6.10)
we consider a Taylor expansion
F 1 Sn+1 F 1 nj
Sj S
Sn+1
Sj 1 1 Sn+1
2 Sj
2 f (F 1 ())
= 1 + 1 ,
n S
Sn+1 f (F 1 ( j )) 2 n Sn+1 f 3 (F 1 ())
n
for some between Sj /Sn+1 and Sj /n, which enables us to write, using the obvious
2
analogues of (??) and (??), and the fact that supn1 nE 1 Sn+1n < , that
Sj Sj
1 Sj 1 Sj 1 1
F Sn+1 F n OP (1)
n
+ n
n f (F 1 ( Sj )) n f (F 1 ( Sj ))
n n
S
j
1
OP (1) n
,
n f (F 1 ( Sj ))
n
where OP (1) stands for a stochastically bounded sequence which does not depend
on j [1, log n]. We take now
> 0 such that 2 3 +
< 0. From this bound
and regular variation (Lemma 6.23, c)) we obtain that
log n
(F 1 ( S[y]+1 ) F 1 ( [y]+1
S S 2
n+1 n )) dy
0
S[y]+1 2 Sj 2
1 log n
1
[log n+1]
OP (1) n
S
dy OP (1) n
Sj
n 0 f 2 (F 1 ( [y]+1
n ))
n j=1 f 2 (F 1 ( n ))
1
[log n+1]
Sj 22
[log n+1]
OP (1) n OP (1)n23+ j 22 0.
n j=1 j=1
Empirical and Quantile Processes 81
This completes the proof of Claim 2 (note that we need not divide by F 1 ( n1 )2 to
obtain the equivalence of the two sequences if < 3/2; if = 3/2 that division
still gives the result).
(1) k/n 1 1 n+1 1
2
Proof of Claim 3. We set Wn,k = n1 0 F n j=1 bn,j (t)j F (t) dt
E( )
2c2
k (1) n 2
and Wk = ||(1+c2 ) 0 (S[y]+1 ) y dy. With the change of variable t = y/n
(1)
we can rewrite Wn,k as
2
F 1 (S[y]+1 /n) F 1 (y/n)
k (1)
(1)
Wn,k = bn dy,
0 F 1 (1/n)
2c2
where bn = F 1 ( n1 )2 /E( n1 ) ||(1+c2 ) , and we conclude that, by regular varia-
(1) (1) 2c2
(1)
tion, Wn,k Pr Wk . To prove that Wn w ||(1+c 2) 0 ((S[y]+1 ) y ) dy it
2
suces, using a 3
argument, to show that
(1)
lim lim sup P (|Wn,k Wn(1) | >
) = 0 (6.77)
k n
for all
> 0. As in the proof of Claim 2 we consider a Taylor expansion
S
y
S[y]+1 y 1 (S[y]+1 y)2 f (F 1 ())
F 1 F 1
[y]+1
= + ,
n n nf (F 1 (y/n)) 2 n2 f 3 (F 1 ())
for some between S[y]+1 /n and y/n, which enables us to write, using the obvious
2
equivalents of (??) and (??), and the fact that supy1 E (S[y]+1 y)/y 1/2 < ,
that
1 S[y]+1 1 y 1 y/ n 1 1
F F OP (1) + ,
n n n f (F 1 (y/n)) n f (F 1 (y/n))
where OP (1) stands for a stochastically bounded sequence which does not depend
on y [k, log n]. From this bound we obtain that
(1)
|Wn,k Wn(1) |
log n
2
bn
F 1 (S[y]+1 /n) F 1 (y/n) dy
(1)
= 1
F ( n1 )2 k
log n log n
1 1 y/n 1 1
OP (1) 1 1 2 dy + 2 dy
F (n) n k f 2 (F 1 (y/n)) n k f 2 (F 1 (y/n))
log n logn n
1 n t 1 1
= OP (1) dt + dt .
F 1 ( n1 )2 k
n
f 2 (F 1 (t)) n nk f 2 (F 1 (t))
From regular variation we obtain that
log n logn n
1 n t 1 1
2 (F 1 (t))
dt + 2 (F 1 (t))
dt C1 k 22 + C2 k 12
F 1 ( n1 )2 k
n
f n k
n
f
82 E. del Barrio
and this, combined with the last estimate and the fact that 2 2 < 0 completes
the proof of (6.77) and, consequently of Claim 3 .
relates to the quantile process vn via equations (5.6) (assuming conditions (5.3)-
(5.5)).
Theorem 6.24. Let w be a non-negative measurable function satisfying condition
(5.3). Let H be a location scale family of distributions as dened in the Introduc-
1
tion, such that 0 (F 1 (t))2 w(t)dt < for any (hence for all ) F H, let G0 H
be chosen so as to satisfy conditions (5.4) and (5.5) and let g0 = G0 . Assume the
distribution functions F H and the weight w satises conditions (GH) and that
moreover 1
t(1 t)
w(t)dt < . (6.78)
f 2 F 1 (t)
0
Then, under the null hypothesis F H, we have
1 1 2
w B 2 (t) B(t)
nRn 2 1 w(t)dt 1 w(t)dt
d 0 g0 (G0 (t)) 0 g0 (G0 (t))
1 2
B(t)G1
0 (t)
1 w(t)dt . (6.79)
0 g0 (G0 (t))
If H in Theorem 6.24 were only a location family or only a scale family then
the limit would exhibit only the loss of one degree of freedom, that is, one of the
last two integrals would be absent from the limit in (6.79): see Csorgo (2002),
where a theorem of this sort for scale families is proved.
Theorem 6.25. Under the hypotheses of Theorem 6.24, except that now condition
(6.78) is replaced by the weaker conditions (6.11) and
1 1
(s t st)2
2
1
2 1 w(s)w(t)dsdt < , (6.80)
0 0 g0 G0 (t) g0 G0 (s)
Empirical and Quantile Processes 83
we have
1 2
11/n
t(1 t) B (t) EB 2 (t)
nRw 1 w(t)dt w(t)dt
g02 (G1
n
1/n g02 (G0 (t)) d 0 0 (t))
1 2 1 2
B(t) B(t)G1
0 (t)
1 w(t)dt 1 w(t)dt .
0 g0 (G0 (t)) 0 g0 (G0 (t))
A version of this theorem for scale families is proved in Csorgo (2002), however
the hypotheses there are stronger by factors of the order of log n or (log n)2 , the
integrals at the end points are not treated analytically and the proof is dierent
(it relies on strong approximations, which account for the stronger assumptions).
Next we consider convergence to a normal distribution. This case is less
interesting in connection with testing since, as indicated in the Introduction,
1 2 1 B 2 (t)
vn (t)w(t)dt f 2 (F 1 (t)) w(t)dt if f does not vanish on suppF , and
d
therefore, if we divide by L(1/n) , as we must by Theorem 6.21 c), this
part of the statistic has no inuence on the limit. So, when a distribution satises
the hypotheses of Theorem 6.21 c) (meaning that g = f (F 1 ) is regularly varying
with exponent 1 and L(x) with this g tends to innity), if one wishes to have a
sensible test of t, it is probably best to nd a weight w so that one can apply
Theorems 6.24 or 6.25. Hence, we will only consider the normal convergence case
with weight w 1.
Theorem 6.26. Let H be a location scale family of distributions and assume for
simplicity that the distribution G0 H with mean zero and variance one is the
distribution function of a symmetric random variable. Assume that the follow-
ing conditions hold for some (hence for all ) F H: F is twice dierentiable on
84 E. del Barrio
(aF , bF ) and satises condition (6.7), its density f is strictly positive on (aF , bF ),
the function f (F 1 ) is regularly varying of exponent 1 at 0 and at 1, and L(x)
as x 0. Then, under the null hypothesis F H,
11/n
1 t(1 t)
nRn dt Z, (6.81)
L(1/n) 1/n g02 G10 (t)
d
by the central limit theorem, and 2 (Fn ) 1 a.s. by the law of large numbers.
So, it suces to show that
11/n
1 t(1 t)
vn
22 vn , F 1 2 dt Z.
L(1/n) 1/n f 2 F 1 (t) d
The arguments in the proof of Theorem 6.21 c) not only show that here we can
replace
vn
22 by
un /f (F 1
2,n , but also that vn , F 1 2 can be replaced by
un /f (F 1 ), F 1 n ; therefore, the theorem will follow from Theorem 6.21 c) (hence
from Theorem 6.19) if we show that the sequence
11/n
un 1 un (t)F 1 (t)
1
, F n := dt, n N, (6.82)
f (F ) 1/n f (F 1 (t))
is stochastically bounded (as it will then tend to zero upon dividing by L(1/n)).
For this, we show that the product of the nth variable in (6.82) by Sn+1 /n has
expected value tending to zero and variance dominated by a constant independent
of n. By (6.2), Lemma 6.1 i) and slow variation of F 1 at 0 and 1, we have
11/n 1 n+1
Sn+1 un F (t) i=1 an,i i
E , F 1
= E dt
n f (F 1 )
n nf F 1 (t)
1/n
11/n
1 |F 1 (t)|
dt
n 1/n f F 1 (t)
F 1 (11/n)
1
= |u|du
n F 1 (1/n)
(F 1 (1/n))2 + (F 1 (1 1/n))2
= 0.
2 n
Empirical and Quantile Processes 85
Let X be a random variable with distribution F . By Lemma 6.1 iii) and niteness
of the absolute moments of X, we have
Sn+1 un 1
Var , F n
n f (F 1 )
11/n 1 n+1 2
1 F (t) i=1 an,i (i 1)
=E dt
n 1/n f F 1 (t)
1 11/n 11/n Kn (s, t)F 1 (s)F 1 (t)
= dsdt
n 1/n 1/n f F 1 (s) f F 1 (t)
11/n t
s(1 t)|F 1 (s)||F 1 (t)|
6 dsdt
1 (s) f F 1 (t)
1/n 1/n f F
F 1 (11/n) v
=6 F (u)(1 F (v))|u||v|dudv
F 1 (1/n) F 1 (1/n)
0 v
6 F (u)|u||v|dudv
F 1 (1/n) F 1 (1/n)
F 1 (11/n) 0
+ F (u)(1 F (v))|u||v|dudv
0 F 1 (1/n)
F 1 (11/n) v
+ (1 F (v))|u||v|dudv
0 0
3
EX 4 + (EX 2 )2 < ,
4
where at the last step we use Fubini and integration by parts.
As in Theorem 6.21 c), symmetry of G0 is not necessary. Csorgo (2002) also
proves a result for correlation tests where the limit is normal, but only for the
special case of Weibull scale families.
Likewise, Theorem 6.22 can be used to obtain the limiting distribution of
nRn when f (F 1 ) is regularly varying at the end points with exponent > 1, but
we refrain from doing so, to avoid too much repetition.
Example. (Gauss-Laplace location-scale families) This is a modication of a result
in Csorgo (2002). Consider the distribution functions from the above Example,
1 |x|
2e if x 0
F (x) = for > 0.
1 12 ex
if x 0
Then, just in that example, Theorem 6.24 with w 1 holds for the location scale
family based on F i > 2, Theorem 6.25 with w 1 holds i 4/3 < 2,
hence for the normal distribution (which gives Shapiro-Wilk), and Theorem 6.26
with w 1 holds for 0 < 4/3, in particular for the symmetric exponential
distribution. As mentioned above, if the tail probabilities are of dierent order,
the largest dominates and the same conclusions apply to the one sided families.
86 E. del Barrio
Example. (Testing t to the Laplace location scale family) It follows from the last
example and the comments immediately before Theorem 6.26 that a weighted
Wasserstein test would be convenient for the Gauss-Laplace location scale family
when the index is between 0 and 4/3. For any given > 0 these families are (in
terms of the densities):
|(x)/|
H = F, : f, (x) := e , x R, R, > 0 .
2(1/)
The weight should approach zero near 0 and 1. For simplicity we will only present a
test for the Laplace family H1 . Simple but tedious computations using the approxi-
mations in the previous example show that a weight of the order w(t) | log t(1t)|
1
will allow us to apply Theorem 6.24 if > 1 and Theorem 6.25 if 1/2 < 1
(the determining conditions are (6.78) that holds for all > 1, and (6.81) that
holds for 1/2 < 1). If w is too small near 0 and 1, we make the extreme part
of the distribution count less, whereas possibly the limit has more variability as
the integral of B 2 EB 2 is closer to being divergent. de Wet (2000) convincingly
suggests taking = 1 (see also Csorgo (2002)). Concretely, we dene
1 1
w(t) := I
e 0<t1/2 + e I1/2<t1 /W
log 2t log 2(1t)
where
u2 u
W := e u1 eu du, and set also V = e du.
1 0 1+u
Take G0 := F0,W/V . Then w and G0 satisfy conditions (5.3)-(5.5), and the
conditions (GH) and (6.80) hold as well (but not (6.78)). Then, Theorem 5.2
gives that, under the null hypothesis F H1 ,
2 ne W
nRw n log log
V 2 2
1/2 2 1
1 B (t) EB 2 (t) B 2 (t) EB 2 (t)
dt + dt
1/2 (1 t) log 2(1t)
e e
d V 0 t2 log 2t 2
1/2 1 2
1 B(t) B(t)
e dt + dt
1/2 (1 t) log 2(1t)
e
VW 0 t log 2t
1/2 1 2
1 B(t) log 2t B(t) log 2(1 t)
2 dt + dt .
1/2 (1 t) log 2(1t)
e e
V 0 t log 2t
References
[1] Ali, M.M. (1974) Stochastic ordering and kurtosis measure. J. Amer. Statist. Ass.
69, 543545.
[2] Anderson, T.W. and Darling, D.A. (1952) Asymptotic theory of certain good-
ness of t criteria based on stochastic processes. Ann. Math. Statist. 23, 193212.
Empirical and Quantile Processes 87
[3] Araujo, A. and Gine, E. (1980) The Central Limit Theorem for Real and Banach
Valued Random Variables. Wiley, New York.
[4] Balanda, K.P. and McGillivray, H.L. (1988) Kurtosis: A critical review. Amer.
Statistician 42, 111119.
[5] Bickel, P. and Freedman, D. (1981) Some asymptotic theory for the bootstrap.
Ann. Stat. 9 11961217.
[6] Bickel, P. and van Zwet, W. R. (1978) Asymptotic expansions for the power of
distribution free tests in the two-sample problem. Ann. Statist. 6, 9371004.
[7] Billingsley, P. (1968) Convergence of Probability Measures. Wiley, New York.
[8] Breiman, L. (1968) Probability. Addison-Wesley, Reading.
[9] Bretagnolle, J. and Massart, P. (1989) Hungarian constructions from the
nonasymptotic viewpoint. Ann. Probab., 17 239256.
[10] Brown, B. and Hettmansperger, T. (1996) Normal scores, normal plots, and
test for normality. Jour. Amer. Stat. Assoc., 91, 16681675.
[11] Chernoff, H. and Lehmann, E.L. (1954) The use of maximum likelihood esti-
mates in 2 tests of goodness of t. Ann. Math. Statist. 25, 579586.
[12] Chibisov, D.M. (1964) Some theorems on the limiting behavior of empirical dis-
tribution functions. Selected Transl. Math. Statist. Probab. 6, 147156.
[13] Cochran, W.G. (1952) The 2 test of goodness of t. Ann. Math. Statist. 23,
315345.
[14] Cohen, A. and Sackrowitz, H.B (1975) Unbiasedness of the chi-square, likeli-
hood ratio and other goodness of t tests for the equal cell case. Ann. Statist. 3,
959964.
[15] Cramer, H. (1928) On the composition of elementary errors. Second paper: Sta-
tistical applications. Skand. Aktuartidskr. 11, 141180.
[16] Csorgo, M. (1983) Quantile Processes with Statistical Applications. SIAM, Phila-
delphia.
[17] Csorgo, M., Csorgo, S., Horvath, L and Mason, D.M. (1986) Weighted em-
pirical and quantile process. Ann. Probab. 14, 3185.
[18] Csorgo, M. and Horvath, L. (1988) On the distributions of Lp norms of weighted
uniform empirical and quantile process. Ann. Prob., 16, 142161.
[19] Csorgo, M. and Horvath, L. (1993) Weighted Approximations in Probability and
Statistics. John Wiley and Sons.
[20] Csorgo, M., Horvath, L. and Shao, Q.-M. (1993) Convergence of integrals of
uniform empirical and quantile processes Stochastic. Process. Appl., 45, 283294.
[21] Csorgo, M. and Revesz, P. (1978) Strong approximations of the quantile process.
Ann. Statist. 6, 882894.
[22] Cuesta-Albertos, J.A., Matran, C., Rachev, S.T. and Ruschendorf, L.
(1996) Mass transportation problems in Probability Theory. Math. Scientist 21,
3472.
[23] DAgostino, R.B. (1971) An omnibus test of normality for moderate and large
sample sizes. Biometrika 58, 341348.
[24] Darling, D.A. (1955) The Cramer-Smirnov test in the parametric case. Ann.
Math. Statist. 26, 120.
88 E. del Barrio
[25] David, H.A., Hartley, H.O. and Pearson, E.S. (1954) The distribution of the
ratio, in a single normal sample, of range to standard deviation. Biometrika 41,
482493.
[26] David, F.N. and Johnson, N.L. (1948) The probability integral transformation
when parameters are estimated from the sample. Biometrika 35, 182190.
[27] de la Pena, V. and Gine, E. (1998) Decoupling. From dependence to indepen-
dence. Randomly stopped processes, U -statistics and processes, martingales and be-
yond. Springer.
[28] del Barrio, E. (2000) Asymptotic distribution of statistics of Cramer-von Mises
type. Preprint.
[29] del Barrio, E., Cuesta-Albertos, J.A., Matran, C. and Rodrguez-Rodr-
guez, J. (1999) Tests of goodness of t based on the L2-Wasserstein distance. Ann.
Statist., 27, 12301239.
[30] del Barrio, E., Gine, E. and Matran, C. (1999) Central limit theorems for
the Wasserstein distance between the empirical and the true distributions. Ann.
Probab. 27 10091071.
[31] del Barrio, E., Gine, E., and Utzet, F. (2005) Asymptotics for L2 functionals
of the empirical quantile process, with applications to tests of t based on weighted
Wasserstein distances. Bernoulli 11 131189.
[32] de Wet, T. and Venter, J. (1972) Asymptotic distributions of certain test criteria
of normality. S. Afr. Statist. J. 6, 135149.
[33] de Wet, T. and Venter, J. (1973) Asymptotic distributions for quadratic forms
with applications to test of t. Ann. Statist., 2, 380387.
[34] Donsker, M.D. (1951) An invariance principle for certain probability limit theo-
rems. Mem. Amer. Math. Soc. 6.
[35] Donsker, M.D. (1952) Justication and extension of Doobs heuristic approach
to the Kolmogorov-Smirnov theorems. Ann. Math. Statist. 23, 277281.
[36] Doob, J.L. (1949) Heuristic approach to the Kolmogorov-Smirnov theorems. Ann.
Math. Statist. 20, 393403.
[37] Downton, F. (1966) Linear estimates with polynomial coecients. Biometrika 53,
129141.
[38] Dudley, R.M. (1978) Central limit theorems for empirical measures. Ann. Probab.
6, 899929.
[39] Durbin, J. (1973) Weak convergence of the sample distribution function when
parameters are estimated. Ann. Statist. 1, 279290.
[40] Feller, W. (1948) On the Kolmogorov-Smirnov limit theorems for empirical dis-
tributions. Ann. Math. Statist. 19, 177189.
[41] Feuerverger, A. and Mureika, R.A. (1977) The empirical characteristic func-
tion and its applications. Ann. Statist. 5, 8897.
[42] Filliben, J.J. (1975) The probability plot correlation coecient test for normality.
Technometrics 17, 111117.
[43] Fisher, R.A. (1930) The moments of the distribution for normal samples of mea-
sures of departure from normality. Proc. Roy. Soc. A 130, 16.
Empirical and Quantile Processes 89
[44] Galambos, J. (1987) The Asymptotic Theory of Extreme Order Statistics 2nd ed.
Krieger, Melbourne, Florida.
[45] Geary, R.C. (1947) Testing for normality. Biometrika 34, 209242.
[46] Gerlach, B. (1979) A consistent correlation-type goodness-of-t test; with appli-
cation to the two parameter Weibull distribution. Math. Operationsforsch. Statist.
Ser. Statist. 10, 427452.
[47] Gumbel, E.J. (1943) On the reliability of the classical chi-square test. Ann. Math.
Statist. 14, 253263.
[48] Gupta, A.K. (1952) Estimation of the mean and the standard deviation of a normal
population from a censored sample. Biometrika 39, 260273.
[49] Hall, P. and Welsh, A.H. (1983) A test for normality based on the empirical
characteristic function. Biometrika 70, 485489.
[50] Jakubowski, A. (1986) Principle of conditioning in limit theorems for sums of
random variables. Ann. Probab. 14, 902915.
[51] Kac, M., Kiefer, J. and Wolfowitz, J. (1955) On tests of normality and other
tests of goodness of t based on distance methods. Ann. Math. Statist. 26, 189211.
[52] Kale, B.K. and Sebastian, G. (1996) On a class of symmetric nonnormal dis-
tributions with kurtosis of three. In Statistical Theory and Applications: Papers
in Honor of Herbert A. David. Eds. H.H. Nagaraja, P.K. Sen and D. Morrison.
Springer Verlag, New York.
[53] Kolmogorov, A. (1933) Sulla determinazione empirica di una legge di dis-
tribuzione. Giorn. Ist. Ital. Attuari. 4, 8391.
[54] Kolmogorov, A.N. and Prohorov, Yu. V. (1949) On sums of a random number
of random terms (Russian) Uspehi Matem. Nauk (N.S.) 4, 168172.
[55] Komlos, J. Major, P. and Tusnady, G. (1975) An approximation of partial
sums of independent RVs and the sample DF. I. Z. Wahrsch. verw. Gebiete 32,
111131.
[56] Komlos, J. Major, P. and Tusnady, G. (1976) An approximation of partial
sums of independent RVs and the sample DF. II. Z. Wahrsch. verw. Gebiete 34,
3358.
[57] Leslie, J.R. (1984) Asymptotic properties and new approximations for both the
covariance matrix of normal order statistics and its inverse. Colloquia Mathematica
Societatis Janos Bolyai 45, 317-354. Eds. P. Revesz, K. Sarkadi and P.K. Sen.
Elsevier, Amsterdam.
[58] Leslie, J.R., Stephens, M.A. and Fotopoulos, S. (1986) Asymptotic distribu-
tion of the Shapiro-Wilk W for testing for normality. Ann. Statist. 14, 14971506.
[59] Lilliefors, H.W. (1967) On the Kolmogorof-Smirnov test for normality with mean
and variance unknown. J. Amer. Statist. Ass. 62, 399402.
[60] Lockhart, R.A. (1985) The asymptotic distribution of the correlation coecient
in testing t to the exponential distribution. Canad. J. Statist. 13, 253256.
[61] Lockhart, R.A. (1991) Overweight tails are inecient. Ann. Statist. 19, 2254
2258.
90 E. del Barrio
[62] Lockhart, R.A. and Stephens, M.A. (1998) The probability plot: Test of t
based on the correlation coecient. Order statistics: applications, 453473, Hand-
book of Statist. 17, North-Holland, Amsterdam.
[63] Mann, H. B. and Wald, A. (1942) On the choice of the number of class intervals
in the application of the chi square test. Ann. Math. Statis. 13, 306317.
[64] Mason, D. and Shorack, G. (1992) Necessary and sucient conditions for as-
ymptotic normality of L-statistics. Ann. Prob., 20, 17791804.
[65] McLaren, C.G. and Lockhart, R.A. (1987) On the asymptotic eciency of
certain correlation tests of t. Canad. J. Statist. 15, 159167.
[66] Moore, D.S. (1971) A chi-square statistic with random cell boundaries. Ann.
Math. Statist. 42, 147156.
[67] Moore, D.S. (1978) Chi-square tests. In Studies in Statistics (R.V. Hogg, ed.)
66106. The Mathematical Association of America.
[68] Moore, D.S. (1986) Tests of chi-squared type. In Goodness-of-Fit Techniques,
dAgostino, R.B. and Stephens, M.A., eds., 6396. North-Holland, Amsterdam.
[69] Murota, K. and Takeuchi, K. (1981) The studentized empirical characteristic
function and its application to test for the shape of distribution. Biometrika 68,
5565.
[70] OReilly, N.E. (1974) On the weak convergence of empirical processes in sup-norm
metrics. Ann. Probab. 2, 642651.
[71] Pearson, E.S. (1930) A further development of test for normality. Biometrika 22,
239249.
[72] Pearson, E.S., DAgostino, R.B. and Bowman, K.O. (1977) Tests for departure
from normality: Comparison of powers. Biometrika 64, 23146.
[73] Pollard, D. (1979) General chi-square goodness-of-t tests with data-dependent
cells. Z. Wahrsch. Verw. Gebiete 50, 317331.
[74] Pollard, D. (1980) The minimum distance method of testing. Metrika 27, 4370.
[75] Prohorov, Y.V. (1953) Probability distributions in functional spaces (Russian)
Uspehi Matem. Nauk (N.S.) 8, 165167.
[76] Prohorov, Y.V. (1956) The convergence of random processes and limit theorems
in probability. Theor. Probab. and its Applicat. 1, 157214.
[77] Rachev, S.T. (1991). Probability Metrics and the Stability of Stochastic Models.
Wiley.
[78] Royston, J.P. (1982) An extension of Shapiro and Wilks W test for normality to
large samples. Appl. Statist. 31, 115124.
[79] Sarkadi, K. (1975) The consistency of the Shapiro-Francia test. Biometrika 62,
445450.
[80] Serfling, R.J. (1980) Approximation Theorems of Mathematical Statistics. Wiley,
New York.
[81] Shapiro, S.S. and Francia, R.S. (1972) An approximate analysis of variance test
of normality. J. Amer. Statist. Ass. 67, 215216.
[82] Shapiro, S.S. and Wilk, M.B. (1965) An analysis of variance test for normality
(complete samples). Biometrika 52, 591611.
Empirical and Quantile Processes 91
[83] Shapiro, S.S. and Wilk, M.B. (1968) Approximations for the null distribution of
the W statistic. Technometrics 10, 861866.
[84] Shapiro, S.S. and Wilk, M.B. (1972) An analysis of variance test for the expo-
nential distribution (complete samples). Technometrics 14, 355370.
[85] Shapiro, S.S., Wilk, M.B. and Chen, H.J. (1968) A comparative study of various
tests for normality. J. Amer. Statist. Ass. 63, 13431372.
[86] Shorack, G.R. and Wellner, J.A. (1986) Empirical Processes With Applications
to Statistics. Wiley, New York.
[87] Skorohod, A.V. (1956) Limit theorems for stochastic processes. Theor. Probab.
and its Applicat. 1, 261290.
[88] Smirnov, N.V. (1936) Sur la distribution de 2 (Criterium de M.R. v. Mises). C.
R. Acad. Sci. Paris 202, 449452.
[89] Smirnov, N.V. (1937) Sur la distribution de 2 (Criterium de M.R. v. Mises)
(Russian/French summary). Mat. Sbornik (N.S.) 2, 973993.
[90] Smirnov, N.V. (1939) Sur les ecarts de la courbe de distribution empirique (Rus-
sian/French summary). Mat. Sbornik (N.S.) 6, 326.
[91] Smirnov, N.V. (1941) Approximate laws of distribution of random variables from
empirical data (Russian). Uspekhi Mat. Nauk. 10, 179206.
[92] Smith, R.M. and Bain, L.J. (1976) Corelation-type goodness-of-t statistics with
censored sampling. Comm. Statist. A-Theory and Methods 5, 119132.
[93] Spinelli J.J and Stephens, M.A. (1987) Tests for exponentiality when origin and
scale parameters are unknown. Technometrics 29, 471476.
[94] Stephens, M.A. (1974) EDF statistics for goodness of t and some comparisons.
J. Amer. Statist. Assoc. 69, 730737.
[95] Stephens, M.A. (1975) Asymptotic properties for covariance matrices of order
statistics. Biometrika 62, 2328.
[96] Stephens, M.A. (1986) Tests based on EDF statistics. In Goodness-of-Fit Tech-
niques (R.B. DAgostino and M.A. Stephens., eds.) 97193. Marcel Dekker, New
York.
[97] Stephens, M.A. (1986) Tests based on regression and correlation. Goodness-of-Fit
Techniques 195233. Eds. R.B. DAgostino and M.A. Stephens. Marcel Dekker, New
York.
[98] Sukhatme, S (1972) Fredholm determinant of a positive denite kernel of a special
type and its application. Ann. Math. Statist. 43, 19141926.
[99] Uthoff, V.A. (1970) An optimum test property of two well known statistics. J.
Amer. Statist. Ass. 65, 1597.
[100] Uthoff, V.A. (1973) The most powerful scale and location invariant test of the
normal versus the double exponential. Ann. Statist. 1, 170174.
[101] Vallender, S. (1973). Calculation of the Wasserstein distance between probability
distributions on the line. Theo. Prob. Appl., 18, 785786.
[102] Verrill, S. and Johnson, R. (1987) The asymptotic equivalence of some modied
Shapiro-Wilk statistics Complete and censored sample cases. Ann. Statist. 15,
413419.
[103] von Mises, R. (1931) Wahrscheinlichkeitsrechnung. Wein, Leipzig.
92 E. del Barrio
[104] Watson, G.S. (1957) The 2 goodness-of-t test for normal distributions. Bio-
metrika 44, 336348.
[105] Watson, G.S. (1958) On chi-square goodness-of-t tests for continuous distribu-
tions. J. Roy. Statist. Soc. B 20, 4461.
[106] Weisberg, S. and Bingham, C. (1975) An approximate analysis of variance test
for non-normality suitable for machine calculation. Technometrics 17, 133134.
[107] Williams, P. (1935) Note on the sampling distribution of 1 , where the popu-
lation is normal. Biometrika 27, 269271.
study is a research domain in se, which I believe, should not be treated as a starting
point. In particular, probability theory in Banach spaces, which comes into this
eld as a basic ingredient, constitutes a universe in itself (refer to [6, 135]).
This introductory course has been presented at the summer school in Laredo, and
later, taught to graduate students at the University of Paris VI. Of course, the
contents of these notes had to be limited by lack of space, and their completion
is, as one could expect, left to the reader, with the help of the enclosed exercises
and references.
F (x) := lim F (y) = lim F (y) = P(X < x) F (x) for x R {}. (1.5)
yx yx
The set DF of discontinuity points of F , namely, the set of all x R such that
The quantile function G() in (1.7), as well as its right-continuous version, dened,
for 0 < u < 1, by G(u+) := inf{x : F (x) > u}, and, for u = 0 or 1, through the
relations (1.12) below, fullls
0uv1 G(u) G(v) + ; (1.11)
G(0+) := G(0) = = lim G(u);
u0
(1.12)
G(1+) := G(1) = + = lim G(u);
u1
G(u) = G(u) := lim G(u
) for 0 < u 1. (1.13)
0
We note that (1.23) is invalid when H(t) = is constant for t [A, B]. In this
case (1.22) is meaningless, and the denitions, (1.19) of H and H , reduce to
H (s) = A and H (s) = B for s = , so that
{H } (t) = {H } (t) = , for t = A,
and
(1.24)
{H } (t) = {H } (t) = , for t = B.
Both operators, H H and H H , are natural extensions of the inversion
operator, H H 1 , with respect to composition of applications. The interest of
H and H is to be always dened, whereas such is not the case for H 1 . When
H() is (strictly) increasing and continuous on [A, B], with a properly dened
inverse mapping, H 1 (), (strictly) increasing and continuous on [H(A), H(B)],
all three denitions coincide, since then, A = A , B = B , and, for each s [A, B]
(resp. t [H(A), H(B)],
H H 1 (t) = t, H 1 H(s) = s, H 1 t = H t = H t . (1.25)
Proposition 1.1 below provides a variant of (1.25), tailored to the case of distribu-
tion and quantile functions. Its proof can be adapted to show that continuity of
H and H 1 is sucient to imply (1.25).
A similar argument as that used to infer (1.30) from (1.7), shows that, for each
0 < u < 1 (this implying that < G(u) G(u+) < ) and > 0, we have
F (G(u) ) < u F (G(u) + ),
and (1.31)
F (G(u+) ) u < F (G(u+) + ).
By letting 0 in (1.31), we obtain readily the inequalities
F (G(u)) F (G(u+)) u F (G(u)) F (G(u+)). (1.32)
Another application of (1.7), shows readily that, for each < x < + ,
G(F (x)) G(F (x)) x G(F (x)+) G(F (x)+). (1.33)
A simple and useful consequence of these inequalities is stated in the next propo-
sition.
Proposition 1.1. Assume that F () is continuous on ( , + ), and that G() is
continuous on (0, 1). Then, for each x [ , + ] and u [0, 1], one has
G(F (x)) = x, and F (G(u)) = u. (1.34)
Proof. The assumption that F () and G() are continuous is equivalent to the
identities F () = F () and G() = G(+). We have therefore F (x) = F (x) and
G(u) = G(u+) in (1.32)(1.33), which, in turn, entail (3.11) for x ( , + ) and
u (0, 1). To conclude, we observe that (1.34) holds for x = and u = 0 or 1,
as a direct consequence of the denitions (1.1) and (1.7) of F and G.
Exercise 1.1. Let X denote a [0, 1]-valued rv with density f (), continuous and
positive on [0, 1]. Denoting by F () and G() the df and qf of X, show that the
d
quantile density g(u) = du G(u) is continuous and positive on [0, 1], and is equal
to 1/f (G(u)).
Exercise 1.2. Let G be the qf of a rv X. Prove the equality
1
E(X) = G(u)du.
0
= + . (1.35)
Given M(E), the choice of + and in (1.35) is not unique. However, the
Hahn-Jordan decomposition theorem (see, e.g., p. 173 in [172] and pp. 178181 in
[83]), allows to choose these two measures, by setting = , where + and
are dened through a decomposition of E = E E+ into the union of two disjoint
measurable subsets E and E+ , such that, for each relatively compact A BE ,
+ (A) = (A E+ ) and (A) = (A E ), E E+ = . (1.36)
The property (1.36) characterizes orthogonality, of + and (denoted hereafter
by + ). The measures in (1.36) are uniquely dened for each A BE , via
the relations
+ (A) = sup (B E+ ) and (A) = sup {(B E )}. (1.37)
BA, BBE BA, BBE
In the special case where E = R, we may observe that the condition that
Mf (R) is equivalent to the condition that the distribution function H (x) =
(, x] of is of bounded variation on R, with H () := limx H (x) = 0,
and with total variation on R equal to |dH |(R) = ||(R). In the sequel the set
of functions f of bounded variation on a sub-interval I of R will be denoted by
BV(I).
100 P. Deheuvels
for each bounded function f on E with compact support (resp. compact subset
K of E). For the set P(E) of probability measures on E, (1.41) is automatic,
so that vague and weak convergence are equivalent, for a sequence n P(E),
when the limit is in P(E). However, the set P(E) is not closed in M+ (E) with
respect to the vague topology, so that (1.42) is not sucient to ensure weak relative
compactness of a subset S of P(E). A useful characterization of this property
is given by Prohorovs theorem (refer to p. 37 in [19]), which shows that weak
relative compactness of S P(E) is equivalent to the tightness of S, namely, to
the condition that, for each
> 0, there exists a compact subset K of S such that
inf (K ) 1 . (1.43)
S
1.3.2. Weak convergence and the Levy metric. We consider briey in the present
section the topology, denoted hereafter by W, of weak convergence on the space
P(R) of probability measures on R, identied to the space F(R) of the corre-
sponding (right-continuous) distribution functions. We refer to 1.3.1 for basic
denitions, to [19] for more details on weak convergence of probability measures,
and to 1.3.3 (resp. 1.3.4) below, for a study of of this topology on the set of
bounded non-negative (resp. signed) measures on [0, 1]. One important feature of
the topological space (P(R), W), is that it is conveniently metricized by the Levy
metric (or distance), dened as follows. Consider two, possibly unbounded, non-
decreasing functions H1 and H2 on [A, B] R. Extend the denition of these
functions to R, by setting, for i = 1, 2,
' Hi (A) for x A,
Hi (x) =
Hi (B) for x A.
When H1 and H2 are bounded, (1.44) implies that L (H1 , H2 ) < . It is readily
checked that L denes a metric on the set F(R) of all probability dfs on R.
Denoting, as usual, by CF the set of continuity points of F , a necessary and
sucient condition for a sequence {Fn : n 1} of dfs to converge weakly to the
102 P. Deheuvels
Consider now two quantile functions, G1 and G2 , pertaining to the two distribution
functions F1 F and F2 F. One may extend the denition of G1 and G2 to R,
by setting, for i = 1, 2,
' Gi (0) for x 0,
Gi (x) =
Gi (1) for x 1.
Given this notation, one may dene the Levy distance L (G1 , G2 ) between G1
'1 , G
and G2 , by the formal replacement of H1 , H2 in (1.44) by G ' 2 . We have the
following useful proposition.
Proposition 1.2. When G1 , G2 are the quantile functions of the distribution func-
tions F1 , F2 , we have
L (F1 , F2 ) = L (G1 , G2 ). (1.46)
Proof. Combine (1.32)(1.33) with (1.45).
An interesting application of (1.45) and (1.46) is provided by the following char-
acterization of weak convergence. Let Gn = Fn denote the quantile function of
Fn for n = 1, 2, . . ., and let G = F denote the quantile function of F . Denote by
CG [0, 1] the set of all continuity points of G. Then,
W
Fn F L (Gn , G) 0 lim Gn (u) = G(u) u CG . (1.47)
n
1.3.3. Weak and uniform topologies for df s of non-negative measures on [0, 1].
We now specialize in the study of (non negative bounded) Radon measures with
support in a bounded closed interval of R. For convenience, and without loss of
generality, we will limit ourselves to the case where this interval is E = [0, 1]. As
follows from Denition 1.1, the vague and weak topologies coincide on M[0, 1].
Let M+ f [0, 1] denote the set of all bounded non-negative measures on R, with
support in [0, 1], namely, fullling R [0, 1] = 0 and (A) 0 for each A BR .
We denote by I+ [0, 1] (resp. I [0, 1]) the set of all right-continuous (resp. left-
continuous) distribution functions H(+) (resp. H()) of measures M+ f [0, 1],
of the form
H (x+) = ([0, x]) and H (x) = ([0, x)) for x R. (1.48)
We let I [0, 1] denote either one of the sets I+ [0, 1] or I [0, 1]. The correspondence
H (+) H () being one-to-one, we may endow I [0, 1] with the weak
topology W of convergence of measures in M+ f [0, 1]. This topology is metricized
by the Levy metric L (, ) dened via (1.44), and we let (I [0, 1], W) denote the
set I [0, 1] endowed with the weak topology W. Likewise, we let (I [0, 1], U) (resp.
(M+ f [0, 1], U)) denote the set I [0, 1] (resp. Mf [0, 1]), endowed with the uniform
+
Topics on Empirical Processes 103
topology U. The latter topology is induced by the sup-norm distance, dened, for
H , H I [0, 1], with , M+
f [0, 1], via
U (, ) =
H H
= sup |H (x) H (x)|. (1.49)
xR
In view of (1.42), for any (nite) constant C 0, the set M+ f,C [0, 1] of all
Mf [0, 1] such that ([0, 1]) = H (1+) C is weakly compact. The corresponding
+
1.3.4. Weak convergence of signed measures on [0, 1]. We now concentrate on the
study of the weak topology W, on the space Mf [0, 1] of totally bounded signed
measures on R, with support in [0, 1]. Recalling the Hahn-Jordan decomposition
104 P. Deheuvels
By (1.42), M+ f,C [0, 1] is a weakly compact metrizable space. Therefore, for each
increasing sequence {nj : j 1} of positive integers, there exists an increasing
subsequence, along which
n W , for some and Mf,C [0, 1]. This, in
+ +
turn, entails that, along this subsequence, H (n , ) 0, and hence, H (n , )
0, where := + . Since the so-dened is necessarily unique, we infer from
this argument that (Mf,C [0, 1], H) is a complete metric space. The same argument
shows readily that Mf,C [0, 1] is sequentially compact, and therefore compact, with
respect to W.
Remark 1.3. As mentioned above in 1.3.1, the fact that Mf,C [0, 1] is metrizable
with respect to the weak topology W, is a general consequence of the Banach-Alaoglu
theorem (refer to pp. 6768 in [173]). The Hognas metric (1.57) provides here a
convenient example of metric endowing Mf,C [0, 1] with W.
1.3.5. Compact sets based upon rate functions. We will consider below a series of
examples of compact subsets of Mf [0, 1], endowed, either with the weak topology
W, or with the uniform topology U, of the corresponding distribution functions.
Throughout,
we will identify Mf [0, 1] with its distribution function H (t) =
[0, t] , so that the sets we will consider will be dened equivalently, in terms of
measures, or in terms of functions. In particular, the uniform topology U will be
dened on Mf [0, 1], by setting, for , Mf [0, 1]
dU (, ) =
H H
= sup |H (t) H (t)|. (1.58)
tR
The compact sets we shall consider will be dened through rate functions, which,
in the applications discussed later on, will be chosen as Cherno functions, of the
form (2.8). However, in the present section, this restriction is not necessary, and
we will work in a more general setup. By rate function is meant a non-negative
convex (possibly innite) function {() : R}, with the following properties,
holding for some specied constant m R.
(.1) 0 () ;
The following theorems will have useful consequences. For each H AC[0, 1]
(the set of absolutely continuous functions on [0, 1], see 1.3.3), we denote by
d
H(t) = dt H(t) the Lebesgue derivative of H. By (AC[0, 1], U) is meant the set
AC[0, 1] endowed with the uniform topology U. The set BC0 [0, 1] collects all dis-
tribution functions H of totally bounded signed measures Mf [0, 1] with
support in [0, 1].
Theorem 1.1. Let , fullling (.12), be such that t1 = and t0 = . Intro-
duce the set
1
,c = H AC[0, 1] : H(0) = 0 and c(c1 H(u))du 1 . (1.62)
0
M
ba b a ba
b
(H(u))du ,
ba a ba
where we have used the convexity of . Thus, we have |H(b) H(a)| {M (b
a)}, whence
|b a| /M |H(b) H(a)| .
This establishes the uniform equicontinuity of , the boundedness of follow-
ing trivially, in turn, from the fact that H(0) = 0 for all H . Finally, the fact
that is closed follows from the observation (see, e.g., Lynch and Sethuraman
[144]) that the mapping
1
H AC[0, 1] (H(u))du, (1.63)
0
is lower semi-continuous.
Theorem 1.2. Let , fullling (.12), be such that
() = for , t1 = and t0 < .
Topics on Empirical Processes 107
is a compact subset of the set (C[0, 1], U) of continuous functions on [0, 1],
endowed with the uniform topology U. This follows from Theorem 1.1, taken
with (u) = u2 . We note further that X (u) = 12 u2 is the Cherno function,
(2.8), of a rv X following a standard normal N (0, 1) law (see, e.g., 2.3.3).
2 ) The Finkelstein set of functions (refer to Finkelstein (1971)) dened by
1
F = f AC[0, 1] : f (0) = f (1) = 0 and f(u)2 du 1 , (1.66)
0
is a compact subset of the set (C[0, 1], U) of continuous functions on [0, 1],
endowed with
the uniform topology.
This follows from 1 ) and the observation
that F = f S : f (1) = 0 is closed in (C[0, 1], U).
3 ) The set of functions (refer to Deheuvels and Mason (1990, 1992)) dened for
c > 0 by
1
c = f AC[0, 1] : f (0) = 0 and ch c1 f(u) du 1 , (1.67)
0
is a compact subset of the set (C[0, 1], U) of continuous functions on [0, 1],
endowed with the uniform topology U. This follows from Theorem 1.1, taken
with (u) = h(u), which is nothing else but the Cherno function, (2.8), of
a Poisson rv with expectation equal to 1 (see, e.g., (2.60)).
4 ) The set of functions (refer to Deheuvels and Mason (1990, 1992)) dened for
c > 0 by
1
c = f I [0, 1] : f (0) = 0 and c c1 f(u) du + fS (1+) 1 , (1.68)
0
The next theorem extends Theorems 1.1 and 1.2 to the case where t0 and t1
are arbitrary.
Theorem 1.3. Let , fullling (.12), be such that t1 t0 . Recall
(1.53), and set BV0 [0, 1] = {H : Mf [0, 1]}. For each c > 0, introduce the set
1
,c = H BV0 [0, 1] : c c1 H (u) du + t0 H+ (1+) t1 H (1+) 1 .
S S
0
(1.69)
Then ,c is a compact subset of (BV0 [0, 1], W).
Proof. See, e.g., Deheuvels [51].
In the forthcoming 2.1, we will give some examples of rate functions based upon
large deviation theory.
Exercise 1.3.
1 ) Let h() and () be as in (1.60)(1.61). Check the relations
h(1/) = () and (1/) = h() and > 0.
2 ) Recall the results of Exercise 1.1. Let F denote the set of df s F (x) of rvs
d
with support in [0, 1], having density f (x) = dx F (x) of X continuous and
positive on [0, 1]. Let likewise G denote the set of qf s of random variables,
d
with density quantile function g(u) = du G(u) continuous and positive on
[0, 1]. Show that the mapping F F G = F G denes a one-to-
one mapping of F onto G, continuous with respect to the weak topology W.
Show likewise that the mapping G G F = G F denes a one-to-one
mapping of G onto F , continuous with respect to the weak topology W. Check
that {F } = F and {G } = G, for F F and G G.
3 ) Show that the sets F1 and G1 are homeomorphic (for additional details,
see, e.g., [65, 66]).
1.4. The quantile transform
1.4.1. The univariate quantile transform.
Theorem 1.4. For any random variable X R, with distribution function F and
quantile function G, the following properties hold.
(i) Whenever F is continuous, the random variable U := F (X) is uniformly
distributed on (0, 1);
(ii) When F is arbitrary, and V is a uniform (0, 1) rv, one has the distributional
identity
d
X = G(V ). (1.70)
Proof. The implications F (X) < F (x) X < x and X x F (X) F (x)
entail that
P(F (X) < F (x)) P(X < x) = F (x)
(1.71)
F (x) = P(X x) P(F (X) F (x)).
Topics on Empirical Processes 109
When F is continuous, u = F (x) reaches all possible values in (0, 1) when x varies
in ( , + ). Thus, by (1.71), the dfs, H(u) := P(F (X) u) and H(u) :=
P(F (x) < u), of F (X) fulll, for all u (0, 1),
H(u) u H(u).
This, in turn, readily implies that H(u) = u for all u CH , the set of continuity
points of H. Since this, in turn, implies that H(u) = u for all u [0, 1], we obtain
readily Assertion (i) of the theorem.
d
To establish Assertion (ii), we dene a random variable Y = N (0, 1), following
a standard normal law, and independent of X (in fact, any rv with a continuous
and positive density on R will do). For each n 1, we observe that the rv Xn :=
X+n1 Y R has a density fn , continuous and positive on R. This, in turn, implies
that the restriction of the df Fn of Xn to R, and the restriction of the qf Gn of Xn to
(0, 1), are continuous and increasing. By the just-proven Assertion (i), this implies
that, for each n 1, Un := Fn (Xn ) is uniformly distributed on (0, 1). Moreover, an
application of (1.34) in Proposition 1.1 shows that Xn = Gn (Fn (Xn )) = Gn (Un )
W
for each n 1. Now, as n , Xn = X + n1 Y X, whence Fn F (recall
that almost sure [a.s.] convergence implies convergence in probability, and hence,
weak convergence of the underlying probability measures). By (1.47), this implies,
in turn, that Gn (u) G(u) for each u CG , the set of continuity points of G.
d
We next observe that Xn = Gn (U ), where U is a xed, uniformly distributed
on (0, 1), random variable. Since DG := R CG is, at most, countable, we have
d
P(U CG ) = 1. By all this, it follows that, as n , Xn = Gn (U ) G(U ),
W W
and hence, Xn G(U ). Since Xn X a.s., and hence, Xn X, as n , it
d
follows that G(U ) = X, which was to be proved.
The quantile transform Theorem 1.4 plays an essential role in empirical process
theory. By this theorem, it is always possible without loss of generality [w.l.o.g.] to
dene an arbitrary sequence X1 , X2 , . . . of independent and identically distributed
[iid] random variables [rvs], with df F () and qf G(), on a probability space
(, A, P), carrying an iid sequence U1 , U2 , . . . of uniform (0, 1) rvs, such that
Xn = G(Un ) for each n 1. (1.72)
Because of (1.72), it is, most of the time, more convenient to work directly on the
sequence U1 , U2 , . . ., rather than on X1 , X2 , . . ., given that any result which can
be obtained in this case has a counterpart for X1 , X2 , . . ., which often follows by
book-keeping arguments based upon the mapping u G(u).
1.4.2. The multivariate quantile transform. First, we need recall some classical
(but far from obvious) properties of conditional distributions. Recall (see, e.g.,
p. 209 in [10]) that a set E is a Polish space, when there is a complete metric
dening its topology, with a countable dense subset in E. We will be concerned
here with the case of Rd , which is Polish with respect to the usual topology.
Consider now two E-valued rvs, X and Y, dened on the same probability space
(, A, P). Denote by BE of all Borel subsets of E (the smallest -algebra generated
by open (and closed) subsets of E). We assume that X and Y are measurable in
the sense that the events X1 (A) = {X A} and Y1 (A) = {Y A} belong
to A for all A BE . We denote by AY the -algebra of A, generated by the
sets Y1 (A), for A BE . In words, AY is the smallest -algebra of A such that
Y1 (A) = {Y A} = { : Y () A} AY for all A BE . Then, it
is possible to dene a regular conditional probability P(X | Y = y), with the
following properties.
(i) For each y E, A BE P(X A | Y = y) denes a probability measure
on (E, BE );
(ii) For each A BE , the mapping y = Y() E P(X A | Y = y)
is AY -measurable;
(iii) The conditional probability P(X A | Y = y), of A BE given y E,
is dened uniquely, with respect to y, up to an a.e. equivalence over a set
B BE such that P(Y B) = 0;
(iv) For each A BE and B BY ,
P(X A | Y = y)dP(Y = y) = P(X A, Y B). (1.73)
B
Dening the support, denoted by supp(Y), of Y as the set of all y E such that
P(Y Vy ) > 0 for all open neighborhoods Vy of y, we see that the denition of
P(X A | Y = y) is meaningful only for y supp(Y). When y supp(Y), the
conditional probability P(X | Y = y) is, therefore, arbitrary.
For t (0, 1), set H1 (t) = inf{x : F1 (x) t}, and, for t (0, 1), j = 2, . . . , d, and
x Rd , set
Hj t x1 , . . . , xj1 = inf x : Fj (x|x1 , . . . , xj1 ) t .
Also, set for u = (u1 , . . . , ud ) (0, 1)d , G1 (u1 ) = H1 (u1 ), and, for j = 2, . . . , d,
G(u) = G1 u1 , G2 u1 , u2 , . . . , Gd u1 , . . . , ud . (1.81)
112 P. Deheuvels
d
U=H X . (1.83)
(ii) If H() is continuously (resp. twice continuously dierentiable) in a neighbor-
hood of x Rd , then G() is continuously (resp. twice continuously dierenti-
able) in the neighborhood of t = H(x), and such that
|G(u) G(t) DG(t)(u t)| = o(|u t|) as |u t| 0. (1.84)
Moreover,
1
DG(t) = DH(x) , (1.85)
where DG(t) (resp. DH(x)) denote the dierential of G() at t = H(x) (resp.
x).
(iii) If X has a density (), continuous and positive in a neighborhood of x Rd ,
then, the Jacobian of G() at t = H(x) is equal to
1/f G(t) . (1.86)
Proof. See, e.g., Deheuvels and Mason [68].
The next theorem gives a useful local version of the quantile transformation in
Theorem 1.5.
Theorem 1.6. Assume that X has a density f , continuously (resp. twice continu-
ously) dierentiable in a neighborhood of x Rd , and that f (x) > 0. Then, there
exists a t (0, 1)d , a neighborhood Vx of x, a neighborhood Vt of t, and a one-
to-one mapping G of Vt onto Vx such that x = G(t). Moreover, on a suitable
probability space, we may dene X jointly with a random vector U uniformly dis-
tributed on (0, 1)d , and such that
X1I{XVx } = G U 1I{UVt } . (1.87)
The function G () is of the form
Remark 1.4.
1 ) The multivariate quantile theorem was generalized as follows by Skorohod
[188] (see, e.g., Dudley and Phillip [81]). Consider two Polish spaces X1 and
X2 , and let L denote a probability law on X1 X2 , with marginal distribution
L1 on X1 . Then, there exists a measurable map : [0, 1] X1 X1 X2 ,
such that, if V X1 is a rv with distribution L1 , and if U is a uniform (0, 1)
rv independent of V , then (V, (U, V )) X1 X2 has distribution function
L.
2 ) The following argument (see, e.g., Lemma A1 in Berkes and Phillip [14]) is
often useful in the application of Strong invariance principles. Consider three
Polish spaces X1 , X2 and X3 . We consider a probability law L1,2 on X1 X2 ,
and a probability law L2,3 on X2 X3 . We assume that the marginals of L1,2
and L2,3 on X2 coincide. Then, there exists a probability space on which sits
a rv (X1 , X2 , X3 ), where (X1 , X2 ) has law L1,2 , and (X2 , X3 ) has law L2,3 .
Exercise 1.4. Show that, for any probability law L in a Polish space, there exists
a measurable map : [0, 1] X, such that (U ) has probability law L when U is
uniformly distributed on (0, 1).
The mgf (s) of X is always nite for s = 0, and such that (0) = 1. To charac-
terize the domain of niteness I := {s R : (s) < } of , set
t1 = inf{t R : (t) < } 0 t0 = sup{t R : (t) < } . (2.2)
The following proposition has a more or the less straightforward analytic proof.
Proposition 2.1. The mgf () of a random variable X is always nite and C
in the neighborhood of each s (t1 , t0 ). Moreover, when s (t0 , t1 ), for each
m = 1, 2, . . ., we have
dm
(m) (s) = m (s) = E X m esX = xm esx dF (x) R. (2.3)
ds
Proof. Omitted.
Remark 2.1. Recall the denition (2.2) of t0 and t1 . When t0 < , the value of
X (t0 ) may be either nite on innite, depending upon the distribution of X. The
same is true for X (t1 ). The domain of niteness of = X , denoted by
I = {s R : (s) < }, (2.5)
is, depending upon the distribution of X, one among the four intervals (t1 , t0 ),
(t1 , t0 ], [t1 , t0 ), [t1 , t0 ].
Proposition 2.2. The mgf () of X is always convex on its domain of niteness.
Moreover, the convexity of is always strict, with the exception of the case where
X = 0 almost surely.
Proof. The result is trivial for t1 = t0 = 0. When t1 < t0 , observe, via (2.3), that,
for each s (t1 , t0 ),
(s) = E X 2 esX 0, (2.6)
with equality only possible when X = 0 a.s.
The mgf of X follows a stronger property than convexity, being log-convex
(meaning that log is convex), as showed in the next proposition.
Topics on Empirical Processes 115
and we may limit ourselves to give proof when t1 < t0 . We then make use of the
Schwarz inequality, to observe that, for each s (t1 , t0 ),
2
d2 E X 2 esX E XesX
log (s) = 2 0, (2.7)
ds2 E (esX )
with equality only possible when X is a.s. constant.
The Cherno function of X (or Legendre transform of the distribution function of
X), is dened by
() = X () = sup {s log (s)} (2.8)
s:(s)<
which is (2.11). The remainder of the part of the proof of Proposition 2.4 relative
to (2.10) is technical and omitted. We refer to Lemma 2.1, p. 266 in Deheuvels
[51] for details.
In the remainder of this section, we consider a sequence X1 , X2 , . . . of independent
replic of the rv X with mgf and Cherno function as dened above. We
set S0 = 0 and Sn = X1 + + Xn for n 1. The following hypotheses will be
assumed, at times (recall (1.10)).
(C.1) t0 = sup{s : (s) < } > 0;
(C.2) t1 = inf {s : (s) < } < 0;
(C.3) The distribution of X is nondegenerate, or, equivalently,
supxR P(X = x) < 1;
(C.4) m = E(X) R exists.
Under (C.4) only, Chernos theorem (see, e.g., [9, 30]) may be stated as follows.
Theorem 2.1. Under (C.4), we have
P(Sn n) exp(n()) for each m; (2.13)
P(Sn n) exp(n()) for each m. (2.14)
Moreover,
1
lim log P(Sn n) = () for each m; (2.15)
n
n
1
lim log P(Sn n) = () for each m. (2.16)
n n
Proof. The proof of (2.13)(2.14) relies on the simple, but nevertheless powerful,
Markov inequality. Below, we establish this result in a slightly more general form.
Consider a random variable Y for which Y (t) = E(exp(tY )) < for some t 0.
The inequality ety etu for y u entails that
Y (t) = E(exp(tY )) = etu dP(Y u)
e dP(Y u) e
tu ty
dP(Y u) = ety P(Y y),
[y,) [y,)
or, equivalently,
Observe that the inequalities (2.17)(2.18) are always true for all y R, even when
the only value of t for which Y (t) < is t = 0. Now, consider a random variable
Z with nite expectation mZ := E(Z), and set Y = Z mZ , so that E(Y ) = 0.
In this case, we see that (2.17)(2.18) readily imply that
A crucial step in the proof relies on Proposition 2.1, which implies that, whenever
E(Y ) = 0, we have
y>0 sup ty log Y (t) = sup ty log Y (t) , (2.21)
t0:Y (t)< t:Y (t)<
y<0 sup ty log Y (t) = sup ty log Y (t) . (2.22)
t0:Y (t)< t:Y (t)<
Proof. As follows from (2.31) we have, under the assumptions of the theorem,
P max Sk n exp n sup {s log (s)} . (2.27)
0kn s0:sI
118 P. Deheuvels
Whenever t0 > 0, we observe from the proof of Proposition 2.4 that, for each
> 0, the function L (s) = s log (s) has derivative L (s) = ds
d
log (s)
decreasing on (0, s0 ) with a right derivative at 0 given by L (0+) = 0.
This implies in turn that
sup {s log (s)} = sup {s log (s)} . (2.28)
s0:sI sI
We conclude (2.25) by combining (2.27) with (2.28). The proof of (2.26) is similar
and omitted.
Here, as usual, AC[0, 1] denotes the set of all absolutely continuous functions f
on [0, 1], with Lebesgue derivative f = dx
d
f . The RKHS of {W (t) : 0 t 1} is,
accordingly
H = f AC[0, 1] : |f |H < . (2.33)
Its unit ball, often referred to as the Strassen set (see, e.g., [199]), is denoted by S
or K, and dened by
S = f AC[0, 1] : |f |H 1 . (2.34)
The sup-norm of a function f C[0, 1] (the vector space of continuous functions
on [0, 1]), is denoted by
f
= sup |f (t)|. (2.35)
0t1
For each f C[0, 1] and > 0, we set
N (f ) = g C[0, 1] :
g f
< . (2.36)
For any A C[0, 1] and > 0, we set
(
A = N (f ) = g C[0, 1] : f A,
g f
< . (2.37)
f A
Proof. See, e.g., Schilder [174], and Deuschel and Stroock [73], p. 12.
The next lemma turns out to be useful to derive functional laws of the iterated
logarithm.
Lemma 2.1. For each (0, 1) and f S such that 0 < < |f |H 1, we have
2 2
J C[0, 1] S (1 + )2 , and J N (f ) |f |H |f |2H 1 .(2.41)
Topics on Empirical Processes 121
Proof. A proof of Theorem 2.4 based upon Schilders Theorem 2.3 is given in
Exercises 2.2 and 2.3 below.
A typical application of Strassens Theorem 2.4 is stated in the following corollary
of this theorem.
Corollary 2.1. Let : f C0 [0, 1] R denote a continuous functional, with
respect to the uniform topology U. Then, we have, with probability 1,
lim sup (T ) = sup (f ). (2.43)
T f S
Proof. Since S is a compact subset of (C[0, 1], U) (see, e.g., Example 1.3), by con-
tinuity of , there exists a g S such that L0 := (g) = supf S (f ). As follows
from (2.42), there exists a.s. a sequence Tn such that
Tn g
0, whence
L1 := lim sup (T ) lim (Tn ) = (g) = L0 .
T n
T > 0} such that 0 < aT < T for T > 0. For each T > 0 and t 0, consider the
function of s [0, 1] dened by
T,t (s) = {W (t + aT s) W (t)}/ 2aT {log(T /aT ) + log2 T }, (2.45)
where log2 T = log+ log+ T , and log+ u = log(u e).
Theorem 2.5. Let aT and T 1 aT . Then, with probability 1,
lim sup inf
T,t f
T 0tT aT f S
(2.46)
= sup lim inf inf
T,t f
= 0.
f S T 0tT aT
Proof. Following the lines of Exercises 2.2 and 2.3 in the sequel, it is not too dicult
to infer the above results from Schilders Theorem 2.3. A more rened proof of
Theorem 2.5, valid for possibly non-uniform norms, and based on the isoperimetric
inequality (see, e.g., [20]) is to be found in [61].
Exercise 2.1.
1 ) Show that the conclusion of Theorem 2.4 may be reformulated into the fol-
lowing two complementary statements.
(i) For each > 0, there exists a.s. a T < such that T S for all
T ;
(ii) For each f S and > 0, there exists a.s. a sequence Tn , such
that Tn N (f ) for each n.
2 ) Assume, that aT , T 1 aT and ((log(T /aT ))/ log2 T as T . Set
FT = {T,t : 0 t T }. Show that, under these assumptions and notation,
the conclusion of Theorem 2.5 may be reformulated as follows. For each > 0,
there exists a.s. a T < such that, for all T T , FT S and S FT .
Exercise 2.2.
1 ) Fix any > 0, and set Tn = (1 + )n for n 0. Let
{W (t) : t 0} denote a
standard Wiener process, and set T (u) = W (T u)/ 2T log2 T for 0 u 1
and T 0. Here and elsewhere, log2 u = log+ log+ u and log+ u = log(u e).
Let S be as in (2.34). Making use of (2.41) and (2.39), show that, for each
> 0,
P Tn S < .
n=0
2 ) Infer from the results of (1 ) that, with probability 1, for each 0 1,
lim sup
Tn
1 and lim sup sup |Tn (u) Tn (v)| .
n n 0u,v1, |uv|
Topics on Empirical Processes 123
3 ) Recall the notation (2.35). Show that, for all n suciently large, uniformly
over Tn1 T Tn ,
2Tn log2 Tn 2Tn log2 Tn
1
1 + .
2T log2 T 2Tn1 log2 Tn1
4 ) Establish the inequality, for all large n,
sup
T Tn
An + Bn ,
Tn1 T Tn
where
An = (1 + ) sup sup |Tn (u) Tn (u)| and Bn =
Tn
.
1
1+ 1 0u1
Show that, for each > 0, there exists a.s. a T < , such that, for all
T T , T S .
6 ) Show that, a.s. for each 0 1,
lim sup
T
1 and lim sup sup |T (u) T (v)| .
T T 0u,v1, |uv|
(2.48)
Exercise 2.3. Recall the notation (2.32) and (2.41). Fix any > 0, and set Tn =
(1 + )n for n 0. Let T () be as in Exercise 2.2, and set, for n 1,
Tn (u) = {W (Tn1 + (Tn Tn1 )u) W (Tn1 )}/ 2Tn log2 Tn .
1 ) Making use of (2.48), show that the processes {Tn (u) : 0 u 1} are
independent, and such that
2
lim sup
Tn Tn
a.s.
n 1+
2 ) Let f and (0, 1) be such that |f |H < 1 . Show that there exists a
such that, for all ,
P Tn N (f ) = .
n=1
3 ) Making use of the fact that S is compact in (C[0, 1], U), show that, for each
f S,
lim inf
T f
= 0 a.s.
T
124 P. Deheuvels
Exercise 2.4.
1 ) Check that, whenever W () is a standard Wiener process, then the process
dened by W (0) = 0 and W (t) = tW (1/t) for t > 0 is, again, a Wiener
process.
2 ) Making use of Corollary 2.1, show that Strassens Theorem 2.4 implies the
law of the iterated logarithm [LIL] for the Wiener process (due to Levy [137])
lim sup W (1/T )/ 2T log2 (1/T ) = lim sup W (T )/ 2T log2 T = 1 a.s.
T 0 T
(2.49)
3 ) Let {(t) : t > 0} be a measurable positive function such that
(t)/ 2t log2 (1/t) as t 0.
Dene a norm on C[0, 1] by setting |f |0 = sup0<t1 |f (t)|/(t). Show that
the conclusion of Theorem 2.4 holds when
is replaced by | |0 [see, e.g.,
[61, 62]].
Assume further that (s) < for all s R, or, equivalently (see, e.g., (2.10)),
that
()
(C) lim = .
|| ||
Remark 2.3.
1 ) (D[0, M ], S) is a Polish space, and D[0, M ] C[0, M ]. The uniform topology
U is stronger on D[0, M ] than the Skorohod topology S. On the other hand,
the restriction of the Skorohod topology S to C[0, M ] coincides there with
the uniform topology (see, e.g., p. 112 in [19]). As a consequence of this,
a compact subset of (C[0, M ], U) is also a compact subset of (D[0, M ], S).
Moreover, for any f C[0, M ], a uniform neighborhood of f is a Skorohod
neighborhood of f . For characterizations of compactness in D[0, M ], with
respect to S, we refer to Ch. 3 in [19].
Topics on Empirical Processes 125
2 ) In view of (1.59) and (2.10), the assumption (C) is equivalent to the condition
that
t0 = sup{t : (t) < } = and t1 = inf{t : (t) < } = . (2.51)
As follows from Theorem 1.1, and (1.67) in Example 1.3, the set dened, for
each c > 0, by
1
,c = f AC[0, M ] : f (0) = 0 and c c1 f(u) du 1 , (2.52)
0
is, under (C), a compact subset of (C[0, 1], U).
For any function f of D[0, M ], dene a functional JX,M (f ) by setting
M
f(u) du, if f AC[0, M ] and f (0) = 0,
JX,M (f ) = 0 (2.53)
otherwise.
Dene further, for each A D[0, M ]
inf f A JX,M (f ) if A = ,
JX,M (A) = (2.54)
if A = .
Theorem 2.6. Under (C), for each closed subset F of (D[0, M ], S), we have
1 X()
2.3.2. Inequalities for binomial distributions. Let X follow a Bernoulli Be(p) dis-
tribution, namely,
P(X = 1) = 1 P(X = 0) = p (0, 1). (2.71)
The moment generating function of X is then given by
(t) = E(exp(tX)) = 1 + p et 1) for t R. (2.72)
Proposition 2.9. The Cherno function of the Bernoulli Be(p) distribution is
given by
log 1
if = 0,
1p
log + (1 ) log 1 if 0 < < 1,
() = p 1p
(2.73)
1
log if = 1,
p
if [0, 1].
Proof. Straightforward.
The following theorem turns out to be quite useful. Recall the denition (2.60) of
h().
Theorem 2.7. Let Sn follow a binomial B(n, p) distribution. Then
Proof. It is well known (see, for example, Csorgo and Revesz (1981)), that
2 2
P sup W (t) u T = 2(1 (u)) = et /2 dt. (2.80)
0tT 2 u
Instead of using (2.80) for proving (2.78), we apply (2.31) to obtain directly that
u
We so obtain (2.78) in the + case. The proof of (2.78) in the case is identical,
and omitted. Finally, (2.79) is a trivial consequence of (2.78).
2.4. Strong approximations of partial sums of i.i.d. random variables
We limit ourselves here to a few results which are relevant in our discussion, and
refer to the monograph [39] of Csorgo and Revesz for additional details and refer-
ences. In particular, we will concentrate on the case where the random variables
have exponential moments. Set S0 = 0 and Sn = X1 + + Xn for n 1, where
X1 , X2 , . . . are iid random replic of a rv X. We assume below that the mgf
(t) = E(exp(tX)) of X fullls the condition (C) = (C.12), where.
(C.1) t0 = sup{s : (s) < } > 0; and
(C.2) t1 = inf {s : (s) < } < 0.
The assumption (C) = (C.12) implies the existence of E(X k ) for each k N. We
will set, in particular
m = E(X) and 2 = Var(x) = E (X m)2 .
We set, further, S(t) = St for t 0, where #t$ t < #t$ + 1 denotes the integer
part of t. The following Theorem 2.8(1 ) states a variant of the celebrated Komlos,
Major and Tusnady strong invariance principle ([129, 130]), with a renement due
to Major [146] for (2 ). Part (3 ) of the theorem is due to Strassen [199] (see also
[147, 148]). We set log+ u = log(u e), and log2 u = log+ (log+ u), for u R.
Theorem 2.8.
1 ) Under (C), it is possible to dene {S(t) : t 0} jointly with a standard
Wiener process {W (t) : t 0}, in such a way that, for universal constants
(depending upon the distribution of X) a1 > 0, a2 > 0 and a3 > 0, we have,
for all T 0 and x R,
P sup |S(t) mt W (t)| a1 log+ T + x a2 exp a3 x . (2.81)
0tT
2 ) If we only assume that E(|X|r ) < for some r > 2, then we may write, as
T ,
|S(T ) mT W (T )| = o(T 1/r ). (2.82)
3 ) Finally, under the assumption that E(|X| ) < only, we have, as T ,
2
By combining (2.81) with the Borel-Cantelli lemma (see the Exercises 2.6 and 2.7
below), we obtain readily that, under (C),
|S(T ) mT W (T )| = O(log T ) a.s. as T . (2.84)
Let {aT : T > 0} be such that 0 < aT T for T > 0, aT and T 1 aT . Set, for
each T > 0 and t 0,
S(t + saT ) S(t) msaT
t,T (s) =
for 0 s 1. (2.85)
2aT {log(T /aT ) + log2 T }
Recall the denition (1.65) of the Strassen set S.
T 0tT aT f S
(2.86)
= sup lim inf inf
T,t f
= 0.
f S T 0tT aT
The limit law (2.88) breaks down when aT / log T = O(1). This range is covered
by the Erdos-Renyi new law of large numbers (see, e.g., [97]) discussed in 2.5
below. For further renements, we refer to Deheuvels and Steinebach [71]. When
combined, the above results yield readily Strassens proof ([199]) of the classical
law of the iterated logarithm, due originally to Hartman and Wintner [112]
Exercise 2.7.
1 ) Let X1 , X2 , . . ., be iid replic of a random variable X fullling (C). Show that
there exists constants c1 > 0 and c2 > 0, such that P(|X| t) c1 exp(c2 t)
for all t 0.
2 ) Show that, as n ,
|Xn | = O (log n) a.s. (2.92)
132 P. Deheuvels
for the Cherno function pertaining to . For each c (0, ] (with the convention
1/ = 0), we set
c = inf { : () 1/c} and c = sup{ : () 1/c}. (2.93)
+
Remark 2.5. Let X be degenerate with P(X = m) = 1. Then, (t) = etm for t R,
and
0 for = m,
() =
for = m.
In this case, (C.124) hold, and, for each c (0, ],
c = m. As follows from
the next proposition, the assumption that X is nondegenerate turns out to imply,
under (C.124), that c = m for c (0, ).
For each integer k such that 0 k n, consider the maximal and minimal
increments
In+ (k) := max Si+k Si and In (k) := min Si+k Si . (2.95)
0ink 0ink
Fix a c > 0, and set kn = #c log n$ for n 1, where #u$ u < #u$ + 1 denotes the
integer part of u.
Theorem 2.10. (Erdos and Renyi) We have, almost surely, under (C.14),
lim k 1 I + (kn ) = +
c , (2.96)
n n n
and, under (C.24),
lim kn1 In (kn ) =
c . (2.97)
n
Proof. In view of Remark 2.5, the statements (2.96)(2.97) hold trivially when X
is degenerate. Therefore, we may limit ourselves establish these statements in the
remaining case, where (C.3) is satised.
Step 1. (Outer bounds) Assume (C.134). We dene an integer sequence {nk :
k k0 }, where
k+1 k
k0 := inf{k 1, e c e c 1},
and, for each k k0 , by setting nk := sup{n 1 : #c log n$ = k}. The inequalities
c log n 1 < #c log n$ = kn c log n,
readily imply that, for each k k0 ,
k k+1
e c nk < e c . (2.98)
Moreover, it is easy to see that, for all k k0 ,
max kn1 In+ (kn ) = k 1 Ink (k). (2.99)
nk1 <nnk
Thus, by (2.13), (2.95), and (2.98)(2.99), we obtain that, for each m and
k k0 ,
(
1
all this, +we see that, for k k0 , Pk (c + ) e exp k , and hence,
+
By c
kk0 Pk (c + ) < . The Borel-Cantelli lemma implies therefore that, for each
> 0,
lim sup kn1 In+ (kn ) +
c + a.s.
n
Since this relation holds for each specied > 0, it implies, in turn, that
lim sup kn1 In+ (kn ) +
c a.s. (2.100)
n
Step 2. (Inner bounds) Assuming (C.134), we set L +
n (k) = max Sk , S2k
Sk , . . . , SkN Sk(N 1) , where N = N (n, k) := #n/k$ 1 and k = kn = #c log n$.
We have, for each > m = E(X),
N
We next make use of Chernos theorem (refer to (2.15)) to show that, for any
1 (m, ),
1
lim log P Sk k 1 = ( 1 ),
k k
so that, ultimately in k , for each 2 > 0
P Sk k 1 exp k ( 1 ) + 2 . (2.101)
Now, letting = + c , with 1 (m, c ), we infer from (2.94) and the denition
+
+
(2.93) of c that
1
2 := (+ c 1 ) > 0.
c
By all this, if we set 2 = and k = kn = #c log n$ c log n in (2.101), we obtain
that the inequality
1
nc
P Skn kn + c 1 exp kn ,
c n
holds for all large n. This, in turn, readily entails that, for all n suciently large,
N (n, k )
n
Qn,kn (+c 1 ) exp n c
exp n c/2
,
n
which is summable in n. Keeping in mind that L+ n (k) In (k), an application of
+
Since this relation holds for an arbitrary 1 > 0, we obtain therefore that
lim inf kn1 In+ (kn ) +
c a.s. (2.102)
n
Exercise 2.9. Assume that (C.14) is satised, and that m = E(X) = 0. Set
Jn+ (k) = max Sj Si and Jn (k) = min Sj Si .
0iji+kn 0iji+kn
1 ) Show that the conclusion (2.96)(2.97) of Theorem 2.10 holds, with the formal
replacements of In+ (k) and In (k) by Jn+ (k) and Jn (k), respectively.
2 ) Show that these results fail to hold when m = E(X) = 0 [for additional
details, see, e.g., [58, 59]].
Exercise 2.10. Let {W (t) : t 0} be a Wiener process. Set
In+ (k) = max W (i + k) W (i) and In (k) = min W (i + k) W (i) .
0ink 0ink
1 ) Letting kn = #c log n$, for c > 0, nd the a.s. limits
lim k 1 In+ (kn ) and lim k 1 In (kn ).
n n n n
2 ) Show that if, on a suitable probability space, |Sn W (n)| = o(log n) a.s., as
n , then the Cherno function of X is given by () = 12 2 for R.
[As follows from a result of Bartfai [12] (see, e.g., p. 101 in [40]), this implies
d
that X = N (0, 1), so that the relation |Sn W (n)| = o(log n) is impossible,
unless Sn coincides initially with W (n), for some Wiener process W ().]
2.5.2. Functional Erdos-Renyi laws. The original Erdos-Renyi law [97] has given
rise to a number of renements and variants in the literature (see, e.g., [58, 59]
and the references therein). Below, we give a functional version of this theorem
due to Deheuvels [51]. We will work in the framework and assumptions of 2.5.1,
assuming throughout that (C.34) hold. We introduce the following additional
notation. We set, for each t 0, S(t) = St (which is right-continuous). For an
arbitrary function 0 < aT T of T 0, we set, for x 0,
x,T (s) = a1
T S(x + saT ) S(x) for 0 s 1, (2.103)
and set, for each T > 0,
FT = x,T : 0 x T aT . (2.104)
It is noteworthy that, for each T > 0, FT BV0 [0, 1], where BV0 [0, 1] is as in
(1.55). Letting and t0 , t1 be as in (2.8) and (2.10), we set, for each c > 0,
1
,c = H BV0 [0, 1] : c c1 H(u) du + t0 HS (1+) t1 HS+ (1+) 1 .
0
(2.105)
Note that, when t0 = and t1 = , the set dened by (2.105) reduces to
1
,c = H AC[0, 1] : H(0) = 0 and c c1 H(u) du 1 . (2.106)
0
In view of Theorems 1.1, 1.2 and 1.3, we see that:
When t0 , t1 are arbitrary, ,c BV0 [0, 1] is compact with respect to the
topology W, of weak convergence of distribution functions of signed measures
136 P. Deheuvels
in Mf [0, 1], dened via the Hognas metric (1.57). The set ,c may be iden-
tied, through the mapping H , to a weakly compact metrizable (via
dH ) subset of the (non-metrizable) topological space (Mf [0, 1], W);
When t1 = and t0 is arbitrary, ,c I+ [0, 1] is compact with respect
to the topology of weak convergence of distribution functions, dened by the
Levy distance L (, ). This set of functions may be identied, through the
mapping H , to a weakly compact subset of the metrizable (via L )
topological space (M+ f [0, 1], W);
Theorem 2.11. Fix any c > 0, and assume that aT / log T c as T . Then,
under either (C.134) or (C.234), there exists a 0 < C < such that ,c
BV0,C [0, 1], and, a.s. ultimately as T , FT BV0,C [0, 1]. Moreover, if we
set, for each > 0 and A BV0,C [0, 1],
3. Empirical Processes
3.1. Uniform empirical distribution and quantile functions
Let U1 , U2 , . . . be a sequence of independent and identically distributed [iid] uni-
form (0, 1) random variables. For each n 1, the empirical distribution based
upon U1 , . . . , Un is dened by
1
n
Pn (B) = 1I{Ui B} , (3.1)
n i=1
so that sup i1 <t i |Un (Vn (t)) t| = n1 . Since, by denition, Un (Vn (0)) = n1 , we
n n
conclude readily (3.7).
Proposition 3.3. We have, with probability 1,
i 1 i
Un I
=
Vn I
= max Ui,n , Ui,n . (3.8)
1in n n
Proof. For each i = 1, . . . , n and Ui1,n < t Ui,n , we have Un (t) t = ni t,
n < t n , Vn (t) t = Ui,n t. By combining these two relations,
whereas, for i1 i
Now, the change of variables (u1 , . . . , un , s) (u1 /s, . . . , un /s, s) has Jacobian
equal to sn , so that the joint density of (v1 := u1 /s, . . . , vn := un /s, s) equals
sn es
f (v1 , . . . , vn , s) = n! .
n!
We obtain therefore that (T1 /Tn+1 , . . . , Tn /Tn+1 ) and Tn+1 are independent. More-
over,
(i) (T1 /Tn+1 , . . . , Tn /Tn+1 ) is uniformly distributed (with constant density, equal
to n!) on the set {(v1 , . . . , vn ) = 0 v1 vn 1};
(ii) Tn+1 follows a (n + 1) distribution on R+ .
To conclude (3.10) we observe that the joint distributions of the random vectors
(T1 /Tn+1 , . . . , Tn /Tn+1 ), and (U1,n , . . . , Un,n ), coincide.
Theorem 3.2. For each n 1, the random variables
U i
i,n
, i = 1, . . . , n, (3.11)
Ui+1,n
are independent and uniformly distributed on (0, 1).
Proof. We change variables by setting Yi = log Ui , for i = 1, 2, . . ., so that
Y1 , Y2 , . . . denes an iid sequence of exponentially distributed random variables.
For each n 1, set Y0,n = 0, and denote by
0 < Y1,n < < Yn,n ,
the order statistics of Y1 , . . . , Yn . Keeping in mind that
Yi,n = log Uni+1,n for i = 0, . . . , n,
we see that (3.11) is equivalent to the property that
(n i + 1){Yi,n Yi1,n }, i = 1, . . . , n, (3.12)
are independent and exponentially distributed. To establish this property, we re-
call from the proof of Theorem 3.1 that the joint density of (U1,n , . . . , Un,n ) is
constant and equal to n! on the set {(v1 , . . . , vn ) = 0 v1 vn 1}. The
change of variables (v1 , . . . , vn ) (y1 := log vn , . . . , yn := log v1 ) has Jaco-
bian 1/(y1 . . . yn ) = exp(y1 + + yn ), so that the joint density of (Y1,n , . . . , Yn,n )
equals
f (y1 , . . . , yn ) = n! e(y1 ++yn ) ,
on the domain {(y1 , . . . , yn ) : 0 y1 yn }. We now make the change
of variables (y1 , . . . , yn ) (t1 := ny1 , (n 1)(y2 y1 ), . . . , tn := yn yn1 ).
Since the corresponding Jacobian equals n!, we conclude that the joint density of
(nY1,n , (n 1)(Y2,n Y1,n ), . . . , Yn,n Yn1,n ) is equal to
n
f (t1 , . . . , tn ) = eti for ti 0, i = 1, . . . , n.
i=1
Exercise 3.2. Fix a constant c > 0, and set kn = #c log n$. Letting () be as in
(1.61), set
c+ = inf{x 1 : (x) 1/c} and c = sup{x 1 : (x) 1/c}.
Consider the statistics, for 1 k n + 1,
+
Mn,k = max {Ui+k1,n Ui1,n }
1ink+2
and
Mn,k = min {Ui+k1,n Ui1,n }.
1ink+2
suciently large,
P(Ak ) 2P(Bk ). (3.24)
In view of (3.16) and (3.23), we obtain, in turn, that, for all large k,
2
P(Bk ) 2 exp (1 + )2 log2 nk = . (3.25)
(log nk )(1+)2
Since, as k , log nk = (1 + o(1))k log(1 + ), we infer from (3.24) and (3.25)
that
P(Ak ) < . (3.26)
k=1
The Borel-Cantelli lemma, when combined with (3.26), implies that the events Ak
hold nitely often with probability 1. This, in turn, implies that, a.s., ultimately
for all large n,
" 1/2
nk log2 nk
n
(1 + 2) 2 log2 n
1
nk1 log2 nk1
"
(1 + 2)(1 + ) 2 log2 n.
1
(3.27)
Here, we have used the observation that, ultimately for all large k,
nk log2 nk
= (1 + o(1))(1 + ) (1 + )2 .
nk1 log2 nk1
Since > 0 and > 0 in (3.27) may be chosen as small as desired, it follows
readily from this limiting statement that, with probability 1,
1
lim sup(2 log2 n)1/2
n
= lim sup(2 log2 n)1/2
n
. (3.28)
n n 2
We conclude (3.19), by combining (3.21) with (3.28).
Exercise 3.3. Fix any 0 T 1. Show that, almost surely,
lim sup(2 log2 n)1/2 sup |n (t)| = sup t(1 t).
n 0tT 0tT
Exercise 3.4. Making use of (3.64) and (3.65), show that the constant C in the
2
Dvoretzky, Kiefer and Wolfowitz [86] inequality, P (
n
t) Ce2t , cannot be
chosen less that 2.
3.3. Some further martingale inequalities
We have the following useful results concerning empirical processes and Brownian
bridges.
Proposition 3.4. For each specied n 1, the process
{n (t)/(1 t) : 0 t < 1} , (3.29)
denes a martingale.
Proof. It is equivalent to prove that the process (nUn (t)nt)/(1t) : 0 t < 1 ,
denes a martingale. Consider any 0 < s < t < 1, and set A = nUn (s) and
Topics on Empirical Processes 143
2 exp (1 T )2 1 .
2 3 nT
Proof. First note that
P sup |n (t)| T
0tT
(t)
(t)
n n
P sup T + P sup T .
0tT 1t 0tT 1t
Next, we combine Proposition 3.4 with (2.31). We so obtain that
(t)
n
P sup u
0tT 1t
s (T )
n
exp sup su log E exp
s 1T
s s
n
P sup T exp nT h 1 + .
0tT 1t nT
A similar argument allows to show that
(t)
(1 T )
n
P sup T exp nT h 1 .
0tT 1t nT
In view of (2.68) and (2.70), we conclude readily (3.33).
Remark 3.4. We will obtain in the forthcoming Proposition 3.9, a renement of
the inequality (3.33), through a completely dierent argument.
3.4. Relations between empirical and Poisson processes
The empirical process is related to the Poisson process by the following gen-
eral principle. The latter provides a construction of a Poisson process on a gen-
eral locally compact separable metric space X (see, e.g., 1.3.1). Consider a se-
quence X = X1 , X2 , . . . of iid X-valued random variables with common distribu-
tion PX (B) = P(X B), for B belonging to the set BX of all Borel subsets of X.
Topics on Empirical Processes 145
Next, we make use of the observation that, if K follows a Poisson distribution with
expectation , then
sup P(K = k) = P(K = #$), (3.40)
k0
where #$ < #$+1 is the integer part of . To establish this property, observe
that, if
k
pk := P(K = k) = e for k = 0, 1, . . . ,
k!
then
pk 0 for k k #$,
=
pk1 k < 0 for k > k > #$,
from where (3.40) is straightforward. By all this, we infer from the Stirling formula,
N ! = (1 + o(1))(N/e)N 2N as N ,
that, as n ,
1 n! (n(1 T ))n(1T )
n(1T )
sup P(R = k) = e n
e
P(R + R = n) k0 nn #n(1 T )$!
1 + o(1)
= .
1T
This, in turn, suces to show that
1
exp 1 .
1 T0 2 3 nT
Proof. Let {(t) : t 0} denote a standardPoisson process. By combining (3.38)
and (3.39) with (2.65), taken with = / nT , we obtain readily that, for all n
suciently large,
1+
exp nT h 1 + ,
1 T0 nT
which, in view of (2.70), readily yields (3.41).
148 P. Deheuvels
Proof. The original papers of Komlos, Major and Tusnady [129, 130] give hardly
any proof for (3.46), and the rst self-contained proof of this inequality was given
by Mason and van Zwet [155]. Some details are also given in in Csorgo and Revesz
[39], Bretagnolle and Massart [26] and Csorgo and Horvath [38].
Remark 3.5. The optimal choice of a1 , a2 , a3 in (3.46) (as well as that of a4 , a5 , a6
in (3.47) below) remains an open problem. The best presently known result in this
direction is due to Bretagnolle and Massart [26], who showed that (3.46) holds for
n 2 with a1 = 12, a2 = 2 and = 1/6.
n
(log+ n)(a10 log+ n + t)
P n Bi a11 exp(a12 t). (3.51)
n i=1 n
10
P sup n (u) Bi (u) +
+
a11 exp(a12 t).
d
0u n n i=1
n
(3.52)
Proof. See, e.g., Castelle and Laurent-Bonvalot [27] and Castelle [28].
n Bi (u) = O , (3.53)
n i=1 n
and
1 '
n
n + Bi (u) = O n1/4 (log n)1/2 (log2 n)1/4 , (3.54)
n i=1
n + Bi (u) = o n1/4 (log n)1/2 (log2 n)1/4 . (3.55)
n i=1
Proof. By combining Theorem 3.9 with the Borel-Cantelli lemma we obtain (3.53).
This, when combined with the Bahadur-Kiefer representation (see, e.g., Theorem
3.4 in the sequel) yields (3.54), rst stated by Csorgo and Revesz [39]. The fact that
(3.54) is optimal, was proved by Deheuvels [55], who showed that (3.55) cannot
hold a.s., for any possible choice of the iid sequence {B *n : n 1}.
The following theorem is due to Csorgo, Csorgo, Horvath and Mason [41], and
Csorgo, and Horvath [37].
Exercise 3.5. Show that, under the assumptions of Theorems 3.5 and 3.6, we have,
almost surely,
log n
log n
n Bn
= O and
n Bn
= O . (3.57)
n n
3.6. Some results for weighted processes
Below, we mention some useful results on weighted empirical processes. We re-
fer to Einmahl and Mason [93, 94] for similar results for quantile processes and
multivariate empirical processes. The standardized empirical process is dened by
n (t)
n (t) =
for 0 < t < 1, (3.58)
t(1 t)
and has the property that, for each 0 < t < 1,
d
n (t) N (0, 1) as n .
The following limit laws are available for functionals of n and n . Let
1
Tn = sup |n (t)|, An = n2 (t)dt, (3.59)
0<t<1 0
1
Dn = sup |n (t)|, n2 = 2n (t)dt, (3.60)
0t1 0
where B() is a Brownian bridge, and {Yk : k 1} an iid sequence of N (0, 1) r.v.s,
and Anderson and Darling [5] showed likewise that
1
d d Yk2
n2 B 2 (t)dt = . (3.63)
0 k(k + 1)
k=1
For the other statistics (see, e.g., Kolmogorov [128] and Smirnov [191]), it holds
that, for each t > 0,
2 2
lim P(Dn t) = P
B
t = 2 (1)k+1 e2k t , (3.64)
n
k=1
2
lim P(Dn t) = P sup B(s) t = e2t . (3.65)
n 0s1
and, for each t R (see, e.g., Eicker [88] and Jaeschke [116]),
t
lim P Tn 2 log2 n 2 log2 n 12 log3 n 12 log < t = e2e . (3.66)
n
152 P. Deheuvels
Moreover,
Tn
lim inf
=1 a.s. (3.69)
n 2 log2 n
Theorem 3.12 has been extended by Einmahl and Mason [93] as follows. Consider
the statistic
1 |n (t)| 1 1
Tn, = n 2 sup = n 2 sup {t(1 t)} 2 |n (t)|, (3.70)
0<t<1 {t(1 t)}
1
0<t<1
for 0 1/2. Keeping in mind that Tn = Tn, 12 , we have the following theorem.
Exercise 3.6.
1 ) Show, as a consequence of (3.66), that
Tn
lim
=1 in probability. (3.75)
n 2 log2 n
3 ) Show that the upper bound (3.74) is optimal, in the sense that
Tn
lim sup = a.s. (3.77)
n log n
Exercise 3.7.
1 ) Show that, for any xed c > 0, we have, almost surely,
1
2n (t)
1c/n
2n (t)
dt dt 0. (3.78)
0 t(1 t) c/n t(1 t)
2 ) Show that, for any 1 > 0 and 2 > 0, there exists an = 1 ,2 (0, 12 ) such
that
2 (t) 11/n 2 (t)
lim sup P n
dt + n
dt 1 1 2 . (3.79)
n 1/n t(1 t) 1 t(1 t)
4 ) Show that (3.80) also holds for = 0 [Hint: make use of (3.46), (3.56) and
(3.73)].
are as follows.
1
S = f AC[0, 1] : f (0) = 0, f2 (t)dt 1 , (3.82)
0
1
F = f AC[0, 1] : f (0) = f (1) = 0, f2 (t)dt 1
0 (3.83)
= f S : f (1) = 0 .
The next theorem is due to Finkelstein [98].
Theorem 3.14. (Finkelstein) We have, with probability 1,
lim inf (2 log2 n)1/2 n f
n f F
(3.84)
= sup lim inf (2 log2 n)1/2 n f = 0,
f F n
and
lim inf (2 log2 n)1/2 n f
n f F
(3.85)
= sup lim inf (2 log2 n)1/2 n f = 0
f F n
Proof. By Theorem 3.8, we may dene, without loss of generality, the sequence
{n : n 1} on the same probability space as a sequence {Bn : n 1} of inde-
pendent Brownian bridges, in such a way that
n
1/2
n n Bi = O (log n)2 a.s. (3.86)
i=1
In view of Proposition 4.3, an application of Theorem 4.3 shows that, with prob-
ability 1,
n
lim inf (2n log2 n)1/2 Bi f = 0, (3.87)
n f F
i=1
n
sup lim inf (2n log2 n)1/2 Bi f = 0. (3.88)
f F n
i=1
The rst half, (3.84), of Theorem 3.14 is a consequence of (3.86), (3.87) and (3.88).
The second half, (3.84), of the theorem follows from (3.84), in combination with
the Bahadur-Kiefer representation (3.125).
Exercise 3.8.
1 ) Show that
sup
f
= 12 .
f F
2 ) By combining this result with Theorem 3.14 and Theorem 3.10, give a proof
of Theorem 3.4.
Topics on Empirical Processes 155
When t0 = 0 and u [0, 1], we obtain the tail empirical process, denoted hereafter
by n = n,0 , for n , and tail quantile process, denoted hereafter by n = n,0 , for
n . The case where t0 = 1 and u [1, 0] being similar, by symmetry, the tail
processes will be considered below for t0 = 0 and u [0, 1] only.
We rst state a series of results for the tail empirical and quantile processes,
corresponding to t0 [0, 1] and u [0, 1]. Kiefer [126] (see, e.g., [49]) showed the
following limit laws. Set for any c > 0
c+ = inf{x > 1 : h(x) 1/c}, c = sup{x < 1 : h(x) 1/c}, (3.91)
c+ = inf{x > 1 : (x) 1/c}, c = sup{x < 1 : (x) 1/c}. (3.92)
Theorem 3.15. Under (H.1) and (H.4) with nhn / log2 n , we have, almost
surely,
lim sup (2 log2 n)1/2 n (hn ) = lim sup (2hn log2 n)1/2 n (hn ) = 1, (3.93)
n n
lim sup (2 log2 n)1/2 n (hn ) = lim sup (2hn log2 n)1/2 n (hn ) = 1. (3.94)
n n
Proposition 3.10. Assume (H.1) and (H.4) with nhn / log2 n . Then, on a
suitable probability space, there exists a sequence Wk (), k = 1, 2, . . . of independent
Wiener processes such that, almost surely as n ,
n
Theorem 3.18. Under (H.123), we have, almost surely for each 0 c < d 1,
1/2
lim sup (2hn {log(1/hn ) + log2 n}) sup |n (u) n (v)| (3.100)
n cu,vd
|uv|hn
1/2
= lim sup (2hn {log(1/hn ) + log2 n}) sup |n (u + hn ) n (u)| (3.101)
n cud
1/2
= lim sup (2hn {log(1/hn ) + log2 n}) sup |n (u) n (v)| (3.102)
n cu,vd
|uv|hn
1/2
= lim sup (2hn {log(1/hn ) + log2 n}) sup |n (u + hn ) n (u)| = 1,
n cud
(3.103)
lim inf (2hn {log(1/hn ) + log2 n})1/2 sup |n (u) n (v)| (3.104)
n cu,vd
|uv|hn
1/2
= lim inf (2hn {log(1/hn ) + log2 n}) sup |n (u + hn ) n (u)| (3.105)
n cud
1/2
= lim inf (2hn {log(1/hn ) + log2 n}) sup |n (u) n (v)| (3.106)
n cu,vd
|uv|hn
1/2
= lim inf (2hn {log(1/hn ) + log2 n}) sup |n (u + hn ) n (u)|
n cud
1/2
= . (3.107)
+1
Theorem 3.18 turns out to be a direct consequence of functional limit laws due
to Deheuvels and Mason [66], Deheuvels [52], Deheuvels [53], and Deheuvels and
Einmahl [60].
Let S denote the Strassen set, as dened in (2.34). Consider the random sets of
functions of u [0, 1],
1/2
En = (2hn {log(1/hn ) + log2 n}) {n (t + hn u) n (t)} : c t d
=: n,t (u) : c t d , (3.112)
1/2
Fn = (2hn {log(1/hn ) + log2 n}) {n (t + hn u) n (t)} : c t d
=: n,t (u) : c t d . (3.113)
Denote the Hausdor distance (pertaining to the uniform topology U) between
subsets of B[0, 1] by setting, for arbitrary A, B B[0, 1],
U (A, B) = inf > 0 : A B and B A , (3.114)
whenever such an > 0 exists, and U (A, B) = otherwise.
Theorem 3.19. Under (H.123) with = , we have, for each 0 c < d 1,
lim U (En , S) = lim U (Fn , S) = 0 a.s. (3.115)
n n
Recall the denition (1.44) of the Levy distance L (f, g) between two functions
f, g I[0, T ]. We set, for any f I[0, T ],
N[] (f ) = g I[0, T ] : L (f, g) < , (3.121)
and, for any subset A I[0, T ],
(
A[] = N[] (f ), (3.122)
f A
Proof. We refer to Shorack [182] and Deheuvels and Mason [64] for details. Below,
we follow the elegant proof of Shorack [182] to establish (3.125). First, we combine
Propositions 3.2 and 3.3, to obtain that, almost surely,
1
sup |n (t) n (Vn (t))| = sup |n (t) n1/2 {Vn (t) Un (Vn (t))}| = ,
0t1 0t1 n
and hence, since Vn (t) = t + n1/2 n (t),
1
{n + n } {n n (I + n1/2 n )}
= a.s. (3.126)
n
In view of (3.126), we see that (3.125) is equivalent to
lim sup n1/4 (log n)1/2 (log2 n)1/4 n (I + n1/2 n ) n = 21/4 a.s. (3.127)
n
160 P. Deheuvels
Since > 0 may be chosen as small as desired, it follows readily from this last
statement that, a.s.,
lim inf n1/4 (log n)1/2 (log2 n)1/4 n (I + n1/2 n ) n 21/4 . (3.131)
n
Proof. See, e.g., Kiefer [124]). Below, we give a simple proof of this statement,
given in Deheuvels and Mason [67], and 3.4 of Deheuvels and Mason [68]. We
need the following two auxiliary results (see, e.g., Lemmas 3.3 and 3.4 in [68]).
Set, for 0 < t0 < 1 and 1 u 1,
xn = (2 log2 n)1/2 n (t0 ), (3.133)
2 1/2
1/2
fn (t) = 2 log2 n log2 n
n (3.134)
2 1/2
n (t0 ) n t0 log2 n u .
n
Then, we have, almost surely, as n ,
|n1/4 (2 log2 n)3/4 {n (t0 ) + n (t0 )} fn (xn )| 0. (3.135)
Moreover, the sequence (xn , fn ) is almost surely relatively compact in RB[1, 1],
where B[1, 1] denotes the set of bounded functions on [1, 1], endowed with the
unform topology. The limit set is given by
1
x2
(x, fn ) K1 := (x, f ) R AC0 [1, 1] : + f(u)2 du 1 .
t0 (1 t0 ) 1
(3.136)
In view of (3.133), (3.134), (3.135) and (3.136), the RHS of (3.132) equals, almost
surely,
1/4
sup |f (x)| = t0 (1 t0 ) sup s(1 s2 ) = {t0 (1 t0 )}1/4 21/2 33/4 ,
(x,f )K1 0s1
(3.137)
which yields the theorem.
Exercise 3.9.
1 ) Letting xn be as in (3.133), show, as a consequence from Finkelsteins Theo-
rem 3.14, that, almost surely as n ,
xn [ t0 (1 t0 ), t0 (1 t0 ) ],
and show that this result is in agreement with (3.136).
2 ) Letting fn be as in (3.134), show, as a consequence from Theorem 3.17, that,
almost surely as n ,
fn S,
and show that this result is in agreement with (3.136).
162 P. Deheuvels
1 x Xi
n
fn,h (x) = K . (3.138)
nh i=1 h
The following theorem was proved by Stute [200], under slightly more restrictive
conditions. As stated, it is due to Deheuvels and Mason [66] (see also Deheuvels
[52] and Deheuvels and Einmahl [60]).
Proof. We will establish (3.140) in the special case where J = (0, 1), f (x) = 1 for
0 < x < 1, and K(t) = 0 for t [ 14 , 34 ]. The arguments of the proof in the general
case are essentially identical, with the addition of minor technicalities, which can
be omitted. We let n0 be such that thn J for all t I, when n n0 . Recall the
denition (3.112) of n,x (u). Our assumptions imply that Xn = Un is uniformly
Topics on Empirical Processes 163
after integrating by parts. Now, as follows from Theorem 3.19, we have U ({n,x :
x I}, S) 0 a.s. It is easily checked that, whenever f (u) S, we have f (u) S.
Therefore, an easy argument shows that, almost surely as n ,
nhn 1/2
sup {fn,hn (x) Efn,hn (x)}
2 log(1/hn ) xI
1
= (1 + o(1)) sup f (u)dK(u)
f S 0
1 1
sup f (u)dK(u) = sup f(u)K(u)du,
f S 0 f S 0
At this point, we leave it to the reader to continue these arguments we have used
to establish similar results for other non-parametric estimators of interest.
4. Auxiliary results
4.1. Some Gaussian process theory
Let Z be a centered Gaussian random vector taking values in a separable Ba-
nach space X, with norm denoted by | |X . Throughout, we will work under the
assumption that
(G.1) P (|Z|X < ) = 1.
In the sequel, we will mainly consider the case where X = C[0, 1], the set of all
continuous functions on [0, 1], endowed with the sup-norm uniform topology U,
dened by | |X =
, and, either,
Z =W is (the restriction to [0, 1] of) a standard Wiener process; or
Z =B is a Brownian bridge.
The assumption (G.1) is clearly satised in either of these cases.
Denoting by BX the -algebra of Borel subsets of X, the distribution of Z is dened
on BX via
PZ (B) = P(Z B) B BX . (4.1)
Denoting by X the topological dual of X, that is, the space of all continuous linear
forms h : X R, one may dene a linear mapping J : X X, by the Bochner
integral
h X J h = E (Zh (Z)) = zh (z)PZ (dz). (4.2)
X
The image space H := J (X ) is pre-Hilbertian with inner product dened by
J g , J h H = E (g (Z)h (Z)) = g (z)h (z)PZ (dz), for g , h X . (4.3)
X
The reproducing kernel Hilbert space [RKHS] of Z is then dened as the closure
H in X of H with respect to the Hilbert norm | |H = , H . Of special interest
here is the unit ball of H, denoted by
K = {h H : |h|H 1} . (4.4)
The following assumption will be needed, in addition to (G.1). Throughout, we
assume that
(G.2) K is a compact subset of (X, | |X ).
We will make use of a generalized version of the Cameron-Martin formula stated
in the following theorem.
Theorem 4.1. For each h H, there exists a measurable linear form ' h on X ,
fullling the equalities
Proof. There is noting to prove if PZ (A) = 0. Assuming that PZ (A) > 0, we set
B = A in (4.7), then make use of the Jensen inequality to obtain, by symmetry of
A and linearity of '
h, that
PZ (A + h) = exp 2 |h|H1 2
exp ' h(x) PZ (dx)
A
P (dx)
exp '
Z
= exp 2 |h|H PZ (A)
1 2
h(x)
A P Z (A)
PZ (dx)
exp 12 |h|2H PZ (A) exp '
h(x)
A PZ (A)
which is (4.7).
The next statement is the celebrated isoperimetric inequality due to Borell (1975)
and Sudakov and Tsyrelson (1978). Denote by
x
1 2
(x) = et /2 dt, (4.8)
2
the df of the standard N (0, 1) law, and by 1 = the corresponding quantile
function, fullling
1 (u) = u for all 0 < u < 1. (4.9)
Theorem 4.2. For each A, B BX such that B (A + rK) = , we have
PZ (B) 1 1 (PZ (A)) + r . (4.10)
The following simple inequalities will be useful to evaluate the right-hand
side of (4.10).
Lemma 4.1. We have, for every t > 0,
et /2 1
2 2
et /2
1 2 1 (t) . (4.11)
t 2 t t 2
Moreover, for every t 0, we have
2 2 2
1 (t) et /2 et /2 . (4.12)
2
166 P. Deheuvels
Prior to the proof of Theorem 4.3, we will establish the following useful result,
usually referred to as Ottavianis lemma.
Lemma 4.2. Let 0 and 1 , . . . , N denote independent random vectors taking value
in the Banach space (X, | |X ). Set i = 0 + 1 + + i for i = 1, . . . , N . Consider
a Borel subset A of X and select a > 0. Suppose that, for each 1 i N ,
P(|N i |X ) 1/2. Then, we have
P i = 0, . . . , N : i A2 2P (N A ) . (4.18)
Proof. Introduce the events
Ai () = {i A } for i = 0, . . . , N,
Bi () = {|N i |X < } for i = 0, . . . , N,
Denote by A = A the complement of A, and set A1 (2) = . We obtain
readily that
(
N
N
which is (4.18).
Proof of Theorem 4.3.
Part 1 Outer Bounds. We select a > 0 and an > 0.
Introduce the sequence
of integers nk = #(1 + )k $ for k = 0, 1, . . ., and set ak = 2nk log2 nk . Consider
the events
1/2
Ck = n (nk1 , nk ] : n1/2 Yn {nk 2 log2 nk K}2ak , (4.19)
1/2 1/2
Dk = nk Ynk {nk 2 log2 nk K}ak . (4.20)
1/2
P n1/2 Yn nk Ynk ak = P (nk n)1/2 |Z|X 2nk log2 nk
X
n 1/2
k
P |Z|X 2 log2 nk
nk1
1
P |Z|X (1 + 12 ) 2 log2 nk .
2
168 P. Deheuvels
We may therefore apply Ottavianis lemma (Lemma 4.2) to show that, for all k
suciently large,
P(Ck ) 2P(Dk ). (4.21)
We now turn to the evaluation of P(Dk ). Toward this end, me make use of the
Isoperimetric Inequality (Theorem 4.2) with choices of A, B and r which will are
specied below. We set, for a constant m > 0 which will be precised later on,
A = D(0, m) := {x X : |x|X < m},
B = X {A + rK} = X {x + h : |x|X < m, |h|H r}.
In view of (G.1), we may now choose m > 0 so large that PZ (A) = P(|Z|X < m) >
1/2, which, in turn, implies that := 1 (PZ (A)) > 0. This, when combined with
(4.10) and (4.12), shows that, for all r > 0,
1
= exp 2 log2 nk = o .
log nk 2 (log nk )(log2 nk )2
Since log nk = (1 + o(1))k log(1 + ) as k , we infer from this last expression,
when combined with (4.21), that
P (Ck ) < .
k=1
The Borel-Cantelli lemma implies in turn that, with probability 1 for all k su-
ciently large, we have, uniformly over all n such that nk1 < n nk ,
2 nk log2 nk /{nk1 log2 nk1 }
Yn nk log2 nk
1/2
K
2 log2 n nk1 log2 nk1
2(1+)
(1 + )K . (4.23)
Part 2 Inner Bounds. We rst note that the assumption (G.2) implies that
In view of (4.24), and making use of Part 1, we obtain readily that, almost surely,
lim sup |(2nk log2 nk )1/2 nk1 Ynk1 |X
k
exp 12 |h|2H 2 log2 (nk nk1 ) P |Z|X < 14 2 log2 (nk nk1 )
1
12 exp 12 (1 2) 2 log2 (nk nk1 ) 1 .
k
the Borel-Cantelli lemma implies, in turn, that P(Ek i.o.) = 1. Now, by the de-
nition (4.27) of Ek , it follows that, almost surely,
1/2
lim inf 2nk log2 nk nk Ynk nk1 Ynk1 h
k X
(2nk log2 nk )1/2
lim 1
8
k (2(nk nk1 ) log2 (nk nk1 ))1/2
(2nk log2 nk )1/2
+|h|X 1 lim
k (2(nk nk1 ) log2 (nk nk1 ))1/2
+ +
14 +M 1 . (4.28)
1+ 1+
In view of (4.28), we see that there exists a 2 > 0 such that, whenever 2 ,
we have, almost surely,
1/2
lim inf 2nk log2 nk nk Ynk nk1 Ynk1 h < 12 .(4.29)
k X
By combining (4.26) with (4.29), we obtain readily that (4.25) holds for all
1 2 . This concludes the proof of Theorem 4.3.
Exercise 4.1. Prove (3.87) and (3.88).
4.3. Karhunen-Loeve expansions
We recall the following facts about Karhunen-Loeve [KL] expansions, (see, e.g.,
[3], [7], [119] and [122]). Denote by {Z(t) : 0 < t < 1} a centered Gaussian process
with covariance function {R(s, t) = E(Z(s)Z(t)) : 0 < s, t < 1}, fullling
1
0< R(t, t)dt < . (4.30)
0
Then, there exist nonnegative constants {k : k 1}, k , together with functions
{ek (t) : k 1} L2 (0, 1) of t (0, 1) such that the properties (K.1234) below
hold.
(K.1) For all i 1 and k 1,
1
1 if i = k,
ei (t)ek (t)dt =
0 0 if i = k.
(K.2) The {k , ek () : k 1} form a complete set of solutions of the Fredholm
equation in (, e()), = 0.
1 1
e(t) = R(s, t)e(s)ds for 0 < t < 1 and e2 (t)dt = 1. (4.31)
0 0
The k (resp. ek ()) are the eigenvalues (resp. eigenfunctions) of the
Fredholm transformation
1
f L (0, 1) T f L (0, 1) : T f (t) =
2 2
R(s, t)f (s)ds, t (0, 1).
0
Topics on Empirical Processes 171
of Z() holds, with the series (4.33) converging a.s., and in integrated
mean square on (0, 1).
Remark 4.1.
1 ) The sequence {k , ek () : k 1} in (K.1234) may very well be nite.
Below, we will implicitly exclude this case and specialize in innite KL ex-
pansions with k ranging through IN = {1, 2, . . .}, with 1 > 2 > > 0.
2 ) If, in addition to (4.30), Z() is a.s. continuous on [0, 1] with covariance
function R(, ) continuous on [0, 1]2 , then, we may choose the functions
{ek () : k 1} in the KL expansion (4.33) to be continuous on [0, 1]. The
series (4.32) is then absolutely and uniformly convergent on [0, 1]2 , and the
series (4.33) is a.s. uniformly convergent on [0, 1] (see, e.g., [3]).
There are few Gaussian processes of interest with respect to statistics for which
the KL expansion is known through explicit values of {k : k 1}, and with
simple forms of the functions {ek () : k 1} (see, e.g., [152] for a review). It is
useful to have a precise knowledge of the k s, since we infer from (4.33) that
1
D2 = Z 2 (t)dt = k k2 . (4.34)
0 k1
This readily implies (see, e.g., (6.23), p. 200 in [119]), that the moment-generating
function of the distribution of D2 is given by
1/2
1 1
D2 (z) = E(exp(zD2 )) = for Re(z) < . (4.35)
1 2zk 21
k=1
where {Yk : k 1} denotes an iid sequence of normal N (0, 1) r.vs. The proof of
(4.40) is left to the reader in Exercise 4.2, and as a special case of Theorem 4.6 in
4.5, in the sequel.
In view of (4.40), given any two functions f and g of the form
f (t) = ak 2 sin k 12 t
k=1
and
(4.41)
g(t) = bk 2 sin k 12 t ,
k=1
the Hilbert product of f and g within the reproducing kernel Hilbert space [RKHS]
pertaining to the Wiener process on [0, 1] is given by
1
2
f, g H = = k 2 ak b k =
1
f(t)g(t)dt, (4.42)
k=1 0
Proposition 4.2. The RKHS pertaining to the Wiener process on [0, 1] is the space
H , with Hilbert norm | |H , of all absolutely continuous functions f on [0, 1] such
that |f |H < , where
1/2
1 2
f (t) dt if f AC[0, 1] fullls f (0) = 0,
|f |H = 0 (4.43)
otherwise.
The standard Brownian bridge B(t) admits on [0, 1] the Karhunen-Loeve repre-
sentation
1
B(t) = Yk 2 sin (kt) , (4.44)
k
k=1
where {Yk : k 1} denotes an iid sequence of normal N (0, 1) r.vs. Therefore,
given any two functions f and g of the form
f (t) = ak 2 sin (kt) and g(t) = bk 2 sin (kt) , (4.45)
k=1 k=1
the Hilbert product of f and g within the RKHS pertaining to the Brownian bridge
on [0, 1] is given by
1
2
f, g H = = (k) ak bk = f(t)g(t)dt, (4.46)
k=1 0
Exercise 4.2. Consider the Fredholm transformation of L2 [0, 1] onto itself, de-
ned by
1
f T f (x) = min{x, t}f (t)dt,
0
and let y(x) denote an eigenvalue of T , fullling T y = y for some > 0.
1 ) Show that y() is continuous on [0, 1], then, by a recursion, that y is twice
continuously dierentiable on (0, 1) and such that y(0) = 0 and y (1) = 0.
2 ) Show that y is a solution of the dierential equation y + y = 0, and that the
only possible values of are of the form = 1/{(k 12 )}2 for k = 1, 2, . . .
. Conclude to the validity of (4.40).
174 P. Deheuvels
Proof. The details of proofs of Theorems 4.4 and 4.5 are given in Deheuvels and
Martynov [63].
In the sequel, we will concentrate on the particular case where, for some constant
R,
(t) = t for 0 < t 1. (4.52)
We note that (L.123) hold under (4.52) i > 1. In particular,
1 1
> 1 t 2 (t)dt < t(1 t) 2 (t)dt < .
0 0
For > 1, consider the Bessel function J () of rst order and index (see
4.6 below for details on the denition and properties of J ()). For > 1, the
positive zeros of J () (solutions of J (z) = 0) form an innite sequence, denoted
hereafter by 0 < z,1 < z,2 < . These zeros are interlaced with the zeros
0 < z+1,1 < z+1,2 < of J+1 () (see, e.g., [207], p. 479), in such a way that
0 < z,1 < z+1,1 < z,2 < z+1,2 < z,3 < . (4.53)
The next theorems provide KL expansions for {t W (t) : 0 t 1} and {t B(t) :
0 t 1}.
Theorem 4.6. Let {W (t) : t 0} denote a Wiener process. Then, for each =
2 1 > 1, or equivalently, for each = 1/(2(1 + )) > 0, the Karhunen-Loeve
1
1
The simple observation that the { t 2 (1) ek (t ) : k 1} are orthonormal in
L (0, 1) whenever such is the case for the {ek (t) : k 1} (see, e.g., [145]), allows
2
under suitable conditions on the functions f () and g() on (0, 1), it is possible to
expand f () into the Fourier-Bessel expansion f (t) = k=1 ak J (z,k t), with
1
2
ak = 2 tf (t)J (z,k t)dt, k 1, (4.65)
J1,k (z,k ) 0
and g() into a Dini expansion g(t) = k=1 bk J (z1,k t), with
1
2
bk = 2 tg(t)J (z1,k t)dt, k 1. (4.66)
J,k (z1,k ) 0
By setting f (t) = t B(t2 ) and g(t) = t W (t2 ) in (4.65)(4.66), we get
2k 2k
ak = and bk = for k 1. (4.67)
z,k J1 (z,k ) z1,k J (z1,k )
Put = 0 and = /( + 1) in (4.61)(4.62). Set, for notational simplicity,
z,k = z/(+1),k and z1,k = z/(+1)1,k . We so obtain the KL expansions
J +1
2 z
/(+1) 1,k t 2
W (t ) = k + 1 t2 (4.68)
,
z1,k ( + 1) J/(+1) z1,k
k=1
+1
2
J
/(+1) z,k t 2
B(t ) = k +1t 2 . (4.69)
z,k ( + 1) J/(+1)1 z,k
k=1
The KL expansion (4.69) has been obtained by Li ([140]) (see the proof of Theorem
1.6, pp. 24-25 in [140]), up to the normalizing factor, for k = 1, 2, . . .,
ck = + 1/{J/(+1)1 z,k },
of the eigenfunction in (4.69) (with the notation (4.58))
+1
ek (t) = ck t/2 J/(+1) z,k t 2 ,
left implicit in his work. In spite of the fact that it is possible to revert the previous
arguments, starting with (4.69), in order to obtain an alternative proof of Theorem
4.6 based on [140], this does only work for the values of = /(+1) with 0 < < 1
(since we must have > 0).
the rst kind (see, e.g., 9.1.69 in [1]), explicitly dened, for an arbitrary R, by
( 12 x) ( 14 x2 )k
0 F1 ( + 1; 4 x ) = ( 2 x)
1 2 1
J (x) = (4.71)
.
( + 1) ( + k + 1)(k + 1)
k=0
The roots (or zeros) of J () have the following properties, in addition to (4.53)
(see, e.g., Ch.XV, pp. 478521 in [207], p. 96 in [132]). For any > 1, J () has
only real roots. Moreover, in this case, the positive roots of J () are isolated and
form an increasing sequence
such that, for any xed k 1, z,k is a continuous and increasing function of
> 1. In addition, for any specied > 1, as k ,
4 2 1 1
z,k = k + 12 ( 12 ) + O . (4.76)
8 k + 12 ( 12 ) k3
so that, in either of these cases, z,k reduces to the rst term in (4.76).
( 12 z) z2
J (z) = 1 2 for z > 0. (4.78)
( + 1) z,k
k=1
Topics on Empirical Processes 179
4.6.2. Some special cases. The expression (4.71) of the rst-order Bessel function
J () can be simplied when = m + 12 for an integer m = 1, 0, 1, . . .. In partic-
ular, for m = 1 and m = 0,
" "
2 2
J 12 (x) = x cos x and J 12 (x) = x sin x. (4.79)
For m 0, we get
" m
2 d sin x
where Pm () and Qm () are polynomials. The rst terms of the sequence are
P1 (u) = 1, Q1 (u) = 0, P0 (u) = 0, Q0 (u) = 1. (4.82)
Lemma 4.3. For an arbitrary m 0, we have the recurrence formulas
Qm+1 (u) = (2m + 1)wQm (w) Qm1 (w), (4.83)
Pm+1 (u) = (2m + 1)wPm (w) Pm1 (w). (4.84)
Proof. We have
" 2m + 1 " x "
1 (x)
x x
2 Jm+ 32 (x) = 2 Jm+ 2 2 Jm 2 (x),
1
x
so that (4.83)(4.84) is straightforward.
By combining (4.81)(4.82) with (4.83)(4.84), we get
"
sin x
J 32 (x) = 2
x cos x , (4.85)
x
"
3 sin x 3 cos x
J 52 (x) = 2
x sin x . (4.86)
x2 x
References
[1] Abramowitz, M. and Stegun, I.A. (1965). Handbook of Mathematical Functions.
Dover, New York.
[2] Acosta, A. de and Kuelbs, J. (1983). Limit theorems for moving averages of inde-
pendent random vectors. Z. Warhscheinlichkeitstheor. Verw. Geb. 64 67123.
[3] Adler, R.J. (1990). An Introduction to Continuity, Extrema and Related Topics for
General Gaussian Processes. IMS Lecture Notes-Monograph Series 12. Institute of
Mathematical Statistics, Hayward, California.
[4] Akaike, H. (1954). An approximation of the density function. Ann. Inst. Statist.
Math. 6 127132.
[5] Anderson, T.W. and Darling, D.A. (1952). Asymptotic theory of certain goodness
of t criteria based on stochastic processes. Ann. Math. Statist. 23 193212.
180 P. Deheuvels
[6] Araujo, A. and Gine, E. (1980). The Central Limit Theorem for Real and Banach
Valued Random Variables. Wiley, New York.
[7] Ash, R.B. and Gardner, M.F. (1975). Topics in Stochastic Processes. Academic
Press, New York.
[8] Bahadur, R.R. (1967). A note on quantiles in large samples. Ann. Math. Statist. 37
577580.
[9] Bahadur, R.R. (1971). Some Limit Theorems in Statistics. Regional Conference
Series in Applied Mathematics. 4. S.I.A.M., Philadelphia.
[10] Bauer, H. (1981). Probability Theory and Elements of Measure Theory. Academic
Press, New York.
[11] del Barrio, E., Cuesta-Albertos, J.A. and Matran, C. (2000). Contributions of em-
pirical and quantile processes to the asymptotic theory of goodness-of-t tests. Test
9 196.
[12] Bartfai, P. (1966). Die Bestimmung der zu einem wiederkehrenden Prozess geho-
renden Verteilungsfunktion aus den mit Fehlern behafteten Daten einer einzigen
Relation. Studia Sci. math. Hung. 1 161168.
[13] Bartlett, M.S. (1963). Statistical estimation of density functions. Sankhya. Ser. A
25 245254.
[14] Berkes, I. and Philipp, W. (1979). Approximation theorems for independent and
weakly dependent random vectors. Ann. Probab. 7 2954.
[15] Berlinet, A. (1993). Hierarchies of higher-order kernels. Prob. Theor. Rel. Fields 94
489504.
[16] Berlinet, A. and Devroye, L. (1994). A comparison of kernel density estimates. Publ.
Inst. Statist. Univ. Paris 38 359.
[17] Bickel, P. and Rosenblatt, M. (1973). On some global measures of the deviation of
density function estimates. Ann. Statist. 1 10711095.
[18] Bickel, P. and Rosenblatt, M. (1975). Corrections to On some global measures of
the deviation of density function estimates. Ann. Statist. 3 1370.
[19] Billingsley, P. (1968). Convergence of Probability Measures. John Wiley & Sons,
New York.
[20] Borell, C. (1975). The Brunn-Minkovski inequality in Gauss space. Invent. Math.
30 207216.
[21] Borell, C. (1976). Gaussian Radon measures on locally convex spaces. Math. Scand.
38 265284.
[22] Borell, C. (1977). A note on Gauss measures which agree on balls. Ann. Inst. H.
Poincare Ser. B 13 231238.
[23] Bosq, D. and Lecoutre, J.-P. (1987). Theorie de lEstimation Fonctionnelle. Eco-
nomica, Paris.
[24] Bowman, F. (1958). Introduction to Bessel Functions. Dover, new York.
[25] Bowman, A., Hall, P. and Prvan, T. (1998). Bandwidth selection for the smoothing
of distribution functions. Biometrika 85 799808.
[26] Bretagnolle, J. and Massart, P. (1989). Hungarian constructions from the non-
asymptotic viewpoint. Ann. Probab. 17 239256.
Topics on Empirical Processes 181
[65] Deheuvels, P. and Mason, D.M. (1990b). Nonstandard Functional laws of the iter-
ated logarithm for tail empirical and quantile processes. Ann. Probab. 18 16931722.
[66] Deheuvels, P. and Mason, D.M. (1992a). Functional laws of the iterated logarithm
for the increments of empirical and quantile processes. Ann. Probab. 20 12481287.
[67] Deheuvels, P. and Mason, D.M. (1992b). A functional L.I.L. approach to pointwise
Bahadur-Kiefer theorems. In Probability in Banach Spaces. (R.M. Dudley, M. Hahn
and J. Kuelbs, eds.) 8 255266. Birkhauser, Boston.
[68] Deheuvels, P. and Mason, D.M. (1994a). Functional laws of the iterated logarithm
for local empirical processes indexed by sets. Ann. Probab. 22 16191661.
[69] Deheuvels, P. and Mason, D.M. (1994b). Random fractals generated by oscilla-
tions of processes with stationary and independent increments. Probability in Ba-
nach Spaces. 7390, 9 Homan-Jrgensen, J., Kuelbs, J. and Marcus, M.B. Eds.
Birkhauser, Boston.
[70] Deheuvels, P. and Mason, D.M. (2004). General asymptotic condence bands based
on kernel-type function estimators. Statist. Inference for Stoch. Processes. 7 225
277.
[71] Deheuvels, P. and Steinebach, J. (1987). Exact convergence rates in strong approx-
imation laws for large increments of partial sums. Probab. Theor. Related Fields.
76 369393.
[72] Derzko, G. and Deheuvels, P. (2002). Estimation non-parametrique de la regression
dichotomique - application biomedicale. C. R. Acad. Sci. Paris Ser. I 334 5963.
[73] Deuschel, J.D. and Stroock, D.W. (1989). Large Deviations. Academic Press, New
York.
[74] Devroye, L. (1978). The uniform convergence of the Nadaraya-Watson regression
function estimate. Canad. J. Statist. 6 179191.
[75] Devroye, L. (1977). A uniform bound for the deviation of empirical distribution
functions. J. Multivariate Analysis. 7 594597.
[76] Devroye, L. (1982). Bounds for the uniform deviation of empirical measures. J.
Multivariate Analysis. 12 7279.
[77] Devroye, L. (1987). A Course in Density Estimation. Birkhauser-Verlag, Boston.
[78] Devroye, L. and Lugosi, G. (2001). Combinatorial Methods in Density Estimation.
Springer, New York.
[79] Devroye, L. and Gyor, L. (1985). Nonparametric Density Estimation: The L1
View. Wiley, New York.
[80] Doob, J.L. (1953). Stochastic Processes. Wiley, New York.
[81] Dudley, R.M. and Philipp, W. (1983). Invariance principles for sums of Banach
space valued random elements and empirical processes indexed by sets. Z. Wahrsch.
Verw. Gebiete. 62 509552.
[82] Dudley, R.M. (1999). Uniform Central Limit Theorems. Cambridge University
Press, Cambridge.
[83] Dudley, R.M. (2002). Real Analysis and Probability. Cambridge University Press,
Cambridge.
[84] Dugundji, J. (1966). Topology. Allyn and Bacon, Boston.
184 P. Deheuvels
[85] Durbin, J. (1973). Distribution Theory for Tests Based on the Sample Distribution
Function. Regional Conference Series in Applied Mathematics, 9 S.I.A.M., Philadel-
phia.
[86] Dvoretzky, A., Kiefer, J. and Wolfowitz, J. (1956). Asymptotic minimax character
of the sample distribution function and of the classical multinomial estimator. Ann.
Math. Statist. 33 642669.
[87] Eggermont, P.P.B. and La Riccia, V.N. (2001). Maximum Penalized Likelihood Es-
timation. Springer, New York.
[88] Eicker, F. (1979). The asymptotic distribution of the suprema of the standardized
empirical processes. Ann. Statist. 7 116138.
[89] Einmahl, J.H.J. (1987). Multivariate Empirical Processes. C.W.I. Tract 32. Math-
ematisch Centrum, Amsterdam.
[90] Einmahl, U. (1986). A renement of the KMT-inequality for partial sum strong
approximation. Technical Report Series of the Laboratory for Research in Statistics
and Probability Carleton University. 88, Ottawa, Canada.
[91] Einmahl, U. (1988). Strong approximations for partial sums of i.i.d. B-valued r.v.s
in the domain of attraction of a Gaussian law. Probab. Theor. Rel. Fields. 77 6585.
[92] Einmahl, U. (1989). Extensions of results of Komlos, Major and Tusnady to the
multivariate case. J. Multivariate Anal. 28 2068.
[93] Einmahl, J.H.J. and Mason, D.M. (1985). Bounds for weighted multivariate empir-
ical distribution functions. Z. Wahrsch. Verw. Gebiete. 70 563571.
[94] Einmahl, J.H.J. and Mason, D.M. (1988). Strong limit theorems for weighted quan-
tile processes. Ann. Probab. 16 16231643.
[95] Einmahl, U. and Mason, D.M. (2000). An empirical process approach to the uniform
consistency of kernel-type function estimators. J. Theoretical Prob., 13, 137.
[96] Epanechnikov, V.A. (1969). Nonparametric estimation of a multivariate probability
density. Theor. Probab. Appl. 14 153158.
[97] Erdos, P. and Renyi, A. (1970). On a new law of large numbers. J. Analyse Math.
23 103111.
[98] Finkelstein, H. (1971). The law of the iterated logarithm for empirical distributions.
Ann. Math. Statist. 42 607615.
[99] Gaenssler, P. (1983). Empirical Processes. Vol. 3, IMS Lecture Notes-Monograph
Series, Institute of Mathematical Statistics, Hayward.
[100] Gaenssler, P. and Stute, W. (1979). Empirical process: a survey of results for inde-
pendent and identically distributed random variables. Ann. Probab. 7 193243.
[101] Gasser, T., Muller, H.G. and Mammitzsch, V. (1985). Kernels for nonparametric
curve estimation. J. R. Statist. Soc. Ser. B 47 238252.
[102] Gikhman, I.I. (1957). On a nonparametric criterion of homogeneity for k samples.
Theor. Probab. Appl. 2 369373.
[103] Goodman, V., Kuelbs, J. and Zinn, J. (1981). Some results on the LIL in Banach
space with application of weighted empirical process. Ann. Probab. 8 713752.
[104] Gyor, L., Kohler, M., Krzyzak, A., and Walk, H. (2002). A Distribution-Free
Theory of Nonparametric Regression. Springer, New York.
Topics on Empirical Processes 185
[105] Hall, P. (1991). On iterated logarithm laws for linear arrays and nonparametric
regression estimators. Ann. Probab. 19 740757.
[106] Hall, P. and Heyde, C.C. (1980). Martingale Limit Theory and its Application.
Academic Press, New York.
[107] Hall, P. and Marron, J.S. (1991). Lower bounds for bandwidth selection in density
estimation. Probab. Theor. Rel. Fields. 90 149173.
[108] Hall, P., Marron, J.S. and Park, B.U. (1992). Smoothed cross-validation. Probab.
Theor. Rel. Fields 92 120.
[109] Hardle, W. (1984). A law of the iterated logarithm for nonparametric regression
function estimators. Ann. Statist. 12 624635.
[110] Hardle, W. (1990). Applied Nonparametric Regression. Cambridge University Press,
Cambridge.
[111] Hardle, W., Janssen, P. and Sering, R. (1988). Strong uniform consistency rates
of estimators of conditional functionals. Ann. Statist. 16 14281449.
[112] Hartman, P. and Wintner, A. (1941). On the law of iterated logarithm. Amer. J.
Math. 63 169176.
[113] Hognas, G. (1977). Characterization of weak convergence of signed measures on
[0, 1]. Math. Scand. 41 175184.
[114] Hurevicz, W. and Wallman, G. (1948). Dimension Theory. Princeton University
Press, Princeton.
[115] Izenman, A.J. (1991). Recent developments in nonparametric density estimation.
J. Amer. Statist. Assoc. 86 205224.
[116] Jaeschke, D. (1979). The asymptotic distribution of the supremum of the standard-
ized empirical distribution on subintervals. Ann. Statist. 7 108115.
[117] Jones, M.C., Marron, J.S. and Park, B.U. (1991). A simple root n bandwidth se-
lector. Ann. Statist. 19 19191932.
[118] Jones, M.C., Marron, J.S. and Sheather, S.J. (1996). A brief survey of bandwidth
selection for density estimation. J. Amer. Statist. Assoc. 91 401407.
[119] Kac, M. (1951). On some connections between probability theory and dierential
and integral equations. Proc.Second Berkeley Sympos. Math. Statist. Probab. 180
215.
[120] Kac, M. (1980). Integration in Function Spaces and Some of its Applications.
Lezioni Ferniane, Academia Nazionale dei Lincei. Pisa.
[121] Kac, M. and Siegert, A.J.F. (1947). On the theory of noise in radio receivers with
square law detectors. J. Appl. Physics. 18 383397.
[122] Kac, M. and Siegert, A.J.F. (1947). An explicit representation of a stationary Gauss-
ian process. Ann. Math. Statist. 18 438442.
[123] Kiefer, J. (1959). K-sample analogues of the Kolmogorov-Smirnov and Cramer-V.
Mises tests. Ann. Math. Statist. 30 420447.
[124] Kiefer, J. (1967). On Bahadurs representation of sample quantiles. Ann. Math.
Statist. 38 13231342.
[125] Kiefer, J. (1970). Deviations between the sample quantile process and the sample
d.f. In Nonparametric Techniques in Statistical Inference. (M. Puri, ed.) 299319.
Cambridge Univ. Press.
186 P. Deheuvels
[126] Kiefer, J. (1972a). Iterated logarithm analogues for sample quantiles when pn 0.
Proc. Sixth Berkeley Symp. Math. Statist. Probab. 1 227244. Univ. California Press,
Berkeley.
[127] Kiefer, J. (1972b). Skorohod embedding of multivariate rvs and the sample df. Z.
Wahrsch. Verw. Gebiete. 24 135.
[128] Kolmogorov, A.N. (1933). Sulla determinazione empirica di una legge di dis-
tribuzione. Giorn. Inst. Ital. Attauri. 4 8391.
[129] Komlos, J., Major, P. and Tusnady, G. (1975). An approximation of partial sums
of independent r.v.s and the sample df. I. Z. Wahrsch. Verw. Gebiete. 32 111131.
[130] Komlos, J., Major, P. and Tusnady, G. (1975). An approximation of partial sums
of independent r.v.s and the sample df. II. Z. Wahrsch. Verw. Gebiete. 34 3358.
[131] Konakov, V.D. and Piterbarg, V.I. (1984). On the convergence rate of maximal
deviation distribution of kernel regression estimates. J. Mult. Appl. 15 279294.
[132] Korenev, B.G. (2002). Bessel Functions and their Applications. Taylor & Francis,
London.
[133] Krzyzak, A., and Pawlak, M. (1984). Distribution-free consistency of a nonpara-
metric kernel regression estimate and classication. IEEE Trans. on Information
Theory. 30 7881.
[134] Lai, T.L. (1974). Reproducing kernel Hilbert spaces and the law of the iterated
logarithm for Gaussian processes. Z. Wahrscheinlichkeitstheorie Verw. Gebiete. 29
719.
[135] Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces: Isoperimetry
and Processes. Springer, Berlin.
[136] Ledoux, M. (1996). On Talagrands deviation inequalities for product measures.
ESAIM: Probab. Statist. 1 6387. http//www.emath.fr/ps/
[137] Levy, P. (1937). Theorie de lAddition des Variables Aleatoires. Gauthier-Villars,
Paris.
[138] Levy, P. (1951). Wiener random functions and other Laplacian random functions.
Proc. 6th Berkeley Sympos. Probab. Theory Math. Statist. 2 171186.
[139] Levy, P. (1953). La mesure de Hausdor de la courbe du mouvement brownien.
Giorn. ist. Ital. Attuari. 16 137.
[140] Li, W.V. (1992a). Comparison results for the lower tail of Gaussian seminorms. J.
Theor. Probab. 5 131.
[141] Li, W.V. (1992b). Limit theorems for the square integral of Brownian motion and
its increments. Stoch. Processes Appl. 41 223239.
[142] Li, W.V. (1992c). Lim inf results for the Wiener process and its increments under
the L2 -norm. Prob. Th. Rel. Fields. 92 6990.
[143] Loader, C.R. (1999). Bandwidth selection: classical or plug-in? Ann. Statist. 27
415438.
[144] Lynch, J. and Sethuraman, J. (1987). Large deviations for processes with indepen-
dent increments. Ann. Probab. 15 610627.
[145] Maccone, C. (1984). Eigenfunctions and energy for time-rescaled Gaussian pro-
cesses. Boll. Un. Mat. ital. 6 213219.
Topics on Empirical Processes 187
[190] Smirnov, N.V. (1937). On the distribution of the 2 criterion. Rec. Math. (Mat.
Sbornik). 6 326.
[191] On the estimation of the discrepancy between empirical curves of distribution for
two independent samples. Bull. Math. de lUniversite de Moscou. 2.
[192] Smirnov, N.V. (1948). Table for estimating the goodness of t of empirical distri-
butions. Ann. Math. Statist. 19 279281.
[193] Spiegelman, J. and Sacks, J. (1980). Consistent window estimation of nonparametric
regression. Ann. Statist. 8 240246.
[194] Steen, L.A. and Seebach, J.A. (1978). Counterexamples in Topology. 2nd Ed.
Springer, New York.
[195] Stein, E.M. (1970). Singular Integrals and Dierentiability Properties of Functions,
Princeton University Press, Princeton, New Jersey.
[196] Stone, C.J. (1977). Consistent nonparametric regression. Ann. Statist. 5 595645.
[197] Stone, C.J. (1980). Optimal rates of convergence for nonparametric estimators.
Ann. Statist. 8 13481360.
[198] Stone, C.J. (1982). Optimal global rates of convergence for nonparametric regres-
sion. Ann. Statist. 10 10401053.
[199] Strassen, V. (1964). An invariance principle for the law of the iterated logarithm.
Z. Wahrsch. Verw. Gebiete. 3 211226.
[200] Stute, W. (1982). The oscillation behaviour of empirical processes. Ann. Probab. 10
86107.
[201] Talagrand, M. (1994). Sharper bounds for Gaussian and empirical processes. Ann.
Probability 22, 2876.
[202] Tapia, R.A. and Thompson, J.R. (1978). Nonparametric Probability Density Esti-
mation. Johns Hopkins University Press, Baltimore.
[203] Terrel, G.R. (1990). The maximal smoothing principle in density estimation. J.
Amer. Statist. Assoc. 85 470477.
[204] van der Vaart, A.W. and Wellner, J.A. (1996). Weak Convergence and Empirical
Processes. Springer Verlag, New York.
[205] Wand, M.P. and Jones, M.C. (1995). Kernel Smoothing. Chapman and Hall, Lon-
don.
[206] Varadhan, S.R.S. (1966). Asymptotic probabilities and dierential equations.
Comm. Pure Appl. Math. 19 261286.
[207] Watson, G.N. (1952). A Treatise on the Theory of Bessel Functions. Cambridge
University Press, Cambridge.
[208] Watson, G.S. (1964). Smooth regression analysis. Sankhya A26 359372.
[209] Watson, G.S. and Leadbetter, M.R. (1963). On the estimation of probability den-
sity, I. Ann. Math. Statist. 34 480491.
[210] Wertz, W. (1972). Fehlerabschatzung fur eine Klasse von nichtparametrischen
Schatzfolgen. Metrika. 19 132139.
[211] Wertz, W. (1978). Statistical Density Estimation: A Survey. Vandenhoeck &
Ruprecht, Gottingen.
[212] Woodroofe, M. (1966). On the maximum deviation of the sample density. Ann.
Math. Statist. 41 16651671.
190 P. Deheuvels
[213] Zolotarev, V.M. (1961). Concerning a certain probability problem. Theor. Probab.
Appl. 6 201204.
[214] Zolotarev, V.M. (1983). Probability metrics. Theory Probab. Appl. 28 278302.
Paul Deheuvels
L.S.T.A., Universite Paris VI
((Please insert complete address))
Oracle Inequalities and Regularization
Sara van de Geer
1. Statistical models
In this chapter, the construction of a statistical model is discussed. We contemplate
on deviations from the model on the one hand, and simplicity of a model on the
other. We introduce the concepts approximation error and estimation error. The
idea of complexity regularization is illustrated in two situations: histograms in
density estimation and smoothing splines in regression.
Here is a brief sketch of the contents of the other chapters. In Chapter 2, we
introduce penalized M-estimators or penalized empirical risk estimators. These
are obtained by minimizing a loss function (e.g., least squares loss, minus maximum
likelihood, or, in classication, support vector machine loss). A roughness penalty
is added to the loss function to avoid overtting. We study the behavior of the
estimators in a general context. The excess risk of an estimator is a global measure
for its performance. We consider so-called oracle inequalities for the excess risk.
These inequalities relate the performance of the estimator to the procedure that
chooses the optimal model by trading o bias and variance (or, more generally,
approximation error and estimation error). In Chapter 2, we highlight the role of
empirical process theory in this context.
As important particular case, we investigate high-dimensional linear spaces.
The approximation error then comes from approximating curves or images by ele-
ments of a high-dimensional parameter space. Chapter 3 studies oracle inequalities
in a regression framework, using the least squares estimators of the coecients.
That setup has the advantage that everything can be calculated explicitly. It serves
as a preparation for more complicated situations.
192 S. van de Geer
inaccuracy
systematic error
| complexity
complexiteit
oracle
When there are innitely many parameters in the model, it is called nonpara-
metric. We will however not be very strict in our distinction between parametric
and nonparametric, for the following reason. Throughout, the number of obser-
vations n is assumed to be large. We will allow that the choice of the model
P depends on n, and is in fact more rich for larger n. This is only natural,
since when we have many observations, we may want to use more exible models
and get more information out of the data. Thus, in a parametric model, may
depend on n, and in particular its dimension N may depend on n, and in fact
grow without limit as n . This means strictly speaking that we deal with a
sequence of parametric models with nonparametric limiting model. We think of
such a situation as a nonparametric one.
Parametric models (with N small) are in a sense less rich than nonparamet-
ric models, and there is also a range in the complexity of various nonparametric
models. The more complex a model, the larger the inaccuracy will be. On the
other hand, too simple models have large systematic error. (Here, we use a generic
terminology. we will be more precise in our denitions later on, e.g., in Section
2.3.) Both inaccuracy and systematic error depend on the model, and on the truth
P . The optimal model trades o the inaccuracy and systematic error (see Figure
1). However, since P is unknown, it is also not known which model this will be.
Only an oracle can tell you that. Our aim will be to mimic this oracle.
To evaluate the inaccuracy of a model, we will use empirical process theory.
Empirical process theory is about comparing the theoretical distribution P with
its empirical counterpart, the empirical distribution Pn , introduced in the next
section.
194 S. van de Geer
1.3. Regularization
Many (nonparametric) estimation procedures involve the choice of a tuning pa-
rameter, also called regularization parameter or smoothing parameter. Here are
two examples.
Example 1.3. Histograms. Suppose X R has density, say f0 , with respect to
Lebesgue measure. Our aim is to estimate f0 . The density f0 (x) at x is dened as
the derivative of the distribution function F0 at x:
F0 (x + h) F0 (x h) P (x h, x + h]
f0 (x) = lim = lim .
h0 2h h0 2h
Unfortunately, replacing P by Pn here does not work, as for h small enough,
Pn (x h, x + h] will be either zero or one. Therefore, instead of taking the limit
Oracle Inequalities and Regularization 195
F0
0.9 Fn
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1 2 3 4 5 6 7 8 9 10 11
x
The optimal in the sense of IMSE choice of the bandwidth h minimizes the
integrated mean square error, i.e., trades o variance and squared bias. However,
196 S. van de Geer
1.8
1.6
f0
1.4
1.2
0.8
0.6
0.4
0.2
0
1 2 3 4 5 6 7 8 9 10 11
to carry out this trade o, one needs to know certain aspects of f0 . But f0 is
unknown! In an attempt to mimic an oracle, one often uses part of the data to
estimate f0 with various choices of the bandwidth h, and then use the rest of the
data to decide on the choice for h. One may also estimate IMSE(fn ) applying for
example least squares cross validation. We will not present the details here. Instead
of bandwidth selection, we will study regularization using complexity penalties, as
illustrated in Example 1.4. In the intermezzo following this example, it is shown
that the two approaches can be closely related.
Exercise 1.1. Suppose
f0 (x) = 2/x3 , x > 1.
Let h > 0 be the bandwidth. At a given x > 1 + h, calculate bias and variance of
the histogram
Pn (x h, x + h]
fn (x) = .
2h
Show that for x xed, and h 0 and nh , the bias is of order h2
(bias(fn (x)) = O(h2 )) and the variance is of order 1/(nh) (var(fn (x)) =
O(1/(nh))). The optimal in the sense of MSE choice for h is thus hopt =
O(n1/5 ). (For a denition of order symbols, see Section 3.5.)
Example 1.4. Penalized least squares. Consider the regression
EYi = f0 (xi ), i = 1, . . . , n,
where Yi is a response variable, xi is a co-variable (i = 1, . . . , n), and where f0 is an
unknown function. We examine here the case where xi = i/n [0, 1], i = 1, . . . , n
Oracle Inequalities and Regularization 197
and f0 is dened on [0, 1]. We suppose f0 is not changing too much, in the
sense that
the squared rst derivative of f0 is small, say in terms of the average |f (x)|2 dx.
As estimator of f0 we propose
n 1
1
fn = arg min |Yi f (xi )| +
2 2
|f (x)| dx .
2
f n i=1 0
Here, arg stands for argument, i.e., the location where the minimum is at-
tained. Moreover, is a tuning or regularization parameter. If = 0, the
estimator fn will just interpolate the data. On the other hand, if
n = , fn will
be a constant function (namely, constantly equal to the average i=1 Yi /n of the
observations). To the least squares loss function, we have thus added a penalty for
choosing a too wiggly function. This is called (complexity) regularization.
Figure 4 below plots the true f (which is f0 ) in blue together with the data
(red). The aim is to recover f0 from the data. Figure 5 shows the estimator fn
0 0.02
0
-0.02
-0.02
-0.04
-0.04
-0.06
-0.06
-0.08 -0.08
-0.1
-0.1
-0.12
-0.12
-0.14
-0.14
-0.16
-0.16 -0.18
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Figure 4
0.02 0.02
0 0
-0.02 -0.02
-0.04 -0.04
-0.06 -0.06
-0.08 -0.08
-0.1 -0.1
-0.12 -0.12
-0.14 -0.14
-0.16 -0.16
-0.18 -0.18
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
(in green) for two choices of the tuning parameter . The t of fn is dened as
n
|Yi fn (xi )|2 /n.
i=1
Obviously, the smaller value of gives a better t. Figure 6 plots the estimator
fn together with f0 , for two values of . The error (or excess risk, see Chapter
2), which is dened here as
n
|fn (xi ) f0 (xi )|2 /n
i=1
turns out to be smaller for the smaller value of .
Now, in real life situations, it is not possible to make the plots of Figure 6
and/or calculate the error, since the true f is then unknown. Thus, again, we need
an oracle to tell us which to choose. In Section 4.5, we show that by penalizing
small values of one may arrive at an oracle inequality.
0.02 0.02
0 0
-0.02 -0.02
-0.04 -0.04
-0.06 -0.06
-0.08 -0.08
-0.1 -0.1
-0.12 -0.12
-0.14 -0.14
-0.16 -0.16
-0.18 -0.18
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
We show in Lemma 1.1 below that the solution f can be explicitly calculated
(using variational calculus). This solution reveals that the tuning parameter
plays the role of a bandwidth parameter.
2 h(g f ).
2 gh 2 f h + 2h(1)f (1).
2. M-estimators
We will start with two examples, and then present the general denition of an
M-estimator. We furthermore give denitions of excess risk, estimation error and
approximation error. Next, we highlight how empirical process theory can be in-
voked to assess the excess risk of an M-estimator.
Recall that arg stands for argument, i.e., fn is the density in F where the
likelihood of the observations is maximal. Note that we may write
n
log f (Xi )/n = Pn (log f ).
i=1
Exercise 2.1. Verify that the true density f0 = dP/d is a maximizer of P (log f )
over all densities f .
Exercise 2.2. Check that a histogram with cells chosen a priori (see Example 1.3)
is a maximum likelihood estimator for the density on R, with model class
N
F = {f = k l(ak1 ,ak ] , k 0, k = 1, . . . , N, f (x)dx = 1}.
k=1
Bayes rule f0 is to classify observations X with 0 (X) > 1/2 in the class with label
1 and those with 0 (X) 1/2 in the class with label 0, i.e.,
Identifying sets and indicator functions, we also call the set G0 Bayes rule.
Let {(Xi , Yi )}ni=1 be a sample from (X, Y ). One may estimate Bayes rule in
the following way. The number of misclassications using the classier G X is
n
#{ Xi G, Yi = 0} + #{Xi
/ G, Yi = 1} = |Yi lG (Xi )| := nRn (G),
i=1
R(G) = P (X G, Y = 0) + P (X
/ G, Y = 1) = ERn (G).
Exercise 2.3. Verify that Bayes rule G0 minimizes R(G) over all subsets G X .
202 S. van de Geer
+Rn (fn ) Rn (f )
Rn (fn ) R(fn ) (Rn (f ) R(f )) .
We write for f F,
n (f ) = n (Rn (f ) R(f )) = n(Pn (f ) P (f )) .
Let F0 be some subset of F. The empirical process indexed by F0 is
{n (f ) : f F0 }.
From the above, we know that
0 R(fn ) R(f ) [n (fn ) n (f )]/ n.
Adding R(f ) R(f0 ) to both sides of this inequality yields moreover
(2.2) R(fn ) R(f0 ) [n (fn ) n (f )]/ n + [R(f ) R(f0 )] = I + II,
with
I = [n (fn ) n (f )]/ n
and with II the approximation error
II = [R(f ) R(f0 )].
This inequality reveals the two components I and II when studying the excess
risk at fn . The expression I is a bound for the estimation error. Empirical pro-
cess theory is invoked to examine this term. Handling the approximation error II
involves approximation theory. We refer to inequalities like (2.2) as basic inequal-
ities, because such inequalities will the starting point for deriving bounds for the
excess risk at fn .
Exercise 2.5. In Exercise 2.4, dene the noise variables
i = Yi f0 (xi ), i = 1, . . . , n.
Verify that for the model with xed design
2
n
n (f ) n (f ) =
i (f (xi ) f (xi )).
n i=1
The two main tools for studying M-estimators are empirical process theory
and approximation theory. Empirical process theory is used to investigate the esti-
mation error of an estimator. There is no straight answer on how to invoke which
parts of empirical process theory. In Lemma 2.1 below, we give an example, which
highlights the main idea.
Empirical process theory supplies us with inequalities for suprema of empiri-
cal processes indexed by functions. We will in fact need the behavior of increments
of the empirical process. This concerns the question: if two functions f and f
are close, then how small is the dierence |n (f ) n (f )|? We welcome an
answer that holds uniformly over f F in a neighborhood of f , because we can
then apply it to a random function (an estimator) in this neighborhood.
Empirical process theory gives us probability or moment bounds for the supre-
ma of empirical processes. These can be directly derived inequalities. However,
a strong tool is based on so-called concentration inequalities. Concentration in-
equalities are (exponential, or even sub-Gaussian) probability inequalities for the
concentration of a random variable around its mean. One can apply them to the
random variable representing the supremum of an empirical process. Then, the only
task left is to nd good bounds for the mean of this supremum. This is in some
cases as easy as Cauchy-Schwarz, perhaps preceded by a so-called symmetrization
and/or a contraction inequality. These concepts (concentration, symmetrization
and contraction) will be discussed in more detail in Chapter 6.
Approximation theory will be used to illustrate how the behavior of estima-
tors depends on how well the model approximates the truth. Since in real life
we will actually never nd out what the truth is, these illustrations are purely
theoretical.
The next lemma was used in the proof of Lemma 2.1. It will in fact also be
of help in proofs in Chapter 5.
vt t + v .
k
Proof. Suppose rst that v/ t . Then obviously
v
vt = t t.
Conversely, if v/ t , then t (v/) . So then
v
vt v( ) = v .
Suppose that the constant C does not depend on P or on the model class F .
Clearly, the estimation error Vn depends on the model F , for example in (2.5)
through the parameter . Moreover, Vn may also depend on P , for example as in
(2.5) through the parameter . It is clear that also the approximation error, which
we write for short as Bn2 = R(f ) R(f0 ), depends on F and P . To express these
dependencies, let us write Vn = Vn (F , P ) and Bn2 = Bn2 (F , P ). Given a collection
of models {F }, the optimal model F would now be
F = arg min Vn (F , P ) + Bn2 (F , P ) .
{F }
But since the optimal model F depends on P , only an oracle knows what F is.
The oracle is mimicked if we can construct an estimator with excess risk at most
that of the estimator when F were known.
In our theory, we will not be able to arrive at exact oracle behavior. Instead,
a trade o up to constants independent of the sample size n, or possibly up to
(log n)-factors, is established. Having essentially the large-n situation in mind,
we will not be too much concerned about such constants and (log n)-factors. In
conclusion, we look for an estimator fn satisfying up to (log n)-terms
ER(fn ) R(f0 ) = O Vn (F , P ) + Bn2 (F , P ) .
The approach to this end will be to add a penalty to the empirical risk, so
that complicated models are penalized. Such a method is called penalized empirical
risk minimization, and it is a form of complexity regularization.
1 , . . . ,
n are assumed to be i.i.d. N (0, 2 )-distributed.
We may collect the observations Y = (Y1 , . . . , Yn ) in a (random) vector in
R . The regression model takes the mean of Y as (possibly partly) unknown vector
n
1 2
n
(xi ) = 1, j,
n i=1 j
n are i.i.d. and N (0, 2 /n)-distributed.
Exercise 3.1. Show that
1 , . . . ,
We call (3.1) the sequence space formulation. If the regression function f0 is
completely unknown, the expectation 0 = (1,0 , . . . , n,0 ) of the random vector
Y = (Y1 , . . . , Yn ) is completely unknown. So then there are n unknowns, that is as
many unknowns as there are observations. Nevertheless, when the signal is sparse,
one can estimate the vector 0 (in a global sense). With sparseness, we mean that
most of the elements in 0 are zero or almost zero. In literature, the sparseness
of a representation is often dened as the number of zero coecients, so that one
representation is sparser than another i it has less non-zero coecients. Section
3.2 gives a formally somewhat dierent denition, giving the possibility of a more
rened comparison.
We end this section with some remarks. In practice, one has to make a choice
for the orthonormal basis {j }. This is like choosing a language to interpret a
given (noisy) text. Here, the special properties of the data set are of importance.
For example, some signals are well described by taking a Fourier basis, others by
wavelets or trigonometric series. The problem is closely related to data compres-
sion. One hopes to choose a basis such that 0 is indeed sparse, i.e., that one has
an economical approximate representation with only a few non-zero coecients.
In literature, most systems of basis functions are orthonormal for Lebesgue
measure instead of Qn . We assume orthonormality in L2 (Qn ) because it makes it
possible to avoid technical calculations.
We note moreover that in many applications, sparse representations can only
be obtained using non-orthogonal (and perhaps overcomplete) representations (for
example for representing sharp edges in a picture).
The mapping x from X to Rn is sometimes called the feature mapping.
For example, if f0 is a picture, then may represent features like angles, directions,
and shapes.
In some other situations is of the form j (xi ) = K(|xi xj |/h) where | | is
some metric on X , K is some kernel and h a regularization parameter, often called
Oracle Inequalities and Regularization 211
the width. In that case, sparseness may be less an issue. It is rather the choice of
the parameter h that is of importance here.
If there are many sets of basis functions to choose from, we call them dictio-
naries. One may use a data dependent method to choose a dictionary, but we will
not consider this issue.
In the next section, we take the sequence space formulation as a starting
point, and we will omit the tilde in our notation.
3.2. Estimating the mean of a normal vector
In this section we study the estimation of the vector 0 Rn from observations
Yj = j,0 +
j , j = 1, . . . , n,
with
1 , . . . ,
n independent N (0, 2 /n)-distributed.
Denote the Euclidean norm on Rn as
n
n =
2
|j |2 , Rn .
j=1
A sparse signal......
Figure 7
that exploits this as much as possible. A signal with only few large coecients
is called sparse. It is like the starry night sky: if you consider the sky as a set of
pixels, then at most pixels there is no star (no signal), or it is too far away, and
the light you see there is mainly background noise.
As mathematical description of sparseness, we take:
Definition. The signal 0 Rn is called sparse if for some 0 r < 2,
n
(3.2) |j,0 |r 1.
j=1
Conversely
n
|j,0 |r |j,0 |r #{|j,0 | > }r .
j=1 |j,0 |>
Exercise 3.2. Let us compare this result with the one of Lemma 2.1. With the
notation used there, one has model class F = (J ), and empirical process
n
n () = 2 n
j j .
j=1
Thus, up to constants, Lemma 2.1 produces the same answer as direct calculations.
3.4. The model an oracle would select
An oracle chooses J as the set J which minimizes the mean square error. I.e.,
2 |J |
J = arg min + 2
j,0 .
J {1,...,n} n
j J
/
Exercise 3.3. Show that the index set an oracle would select is
J = {j : |j,0 | > 2 /n}.
(You can use similar arguments as in the proof of Lemma 3.5.)
Oracle Inequalities and Regularization 215
Exercise 3.4. Suppose that 0 is sparse. Show that for the model (J ) chosen by
the oracle,
2 2r
2
E
n (J ) 0
n 2
2
.
n
3.5. Hard- and soft-thresholding
The oracle
estimates only the coecients j,0 which are in absolute value bigger
2
than /n. The idea is now to replace the unknown coecients j,0 by the ob-
servations Yj , j = 1, . . . , n. First of all, it should then be noted that the noise level
2 is generally unknown. However, this problem is minor, as one may construct
estimators of 2 . Here, we assume for simplicity that 2 is known. A more severe
problem is that we now estimate the 0,j in order to construct an estimator of the
0,j ! It actually
turns out that this procedure works if
we make the threshold a
bit larger than 2 /n, namely, it should be chosen % 2(log n) 2 /n. Then the
oracle is almost mimicked (see Lemma 3.3).
We now rst present the denitions of the hard-thresholding and soft-
thresholding estimator. Lemmas 3.3 and 3.4 give the oracle inequality for these
estimators, and Lemmas 3.5 and 3.6 put them in the framework of penalized M-
estimation. We then have a clearer picture of the type of oracle inequalities we
may expect for more general M-estimators.
Let n 0 be some threshold. This threshold will be the regularization
parameter in the present context.
Definition of the hard-thresholding estimator.
Yj if |Yj | > n
j,n (hard) = , j = 1, . . . , n.
0 if |Yj | n
Definition of the soft-thresholding estimator.
Yj n if Yj > n
j,n (soft) = Yj + n if Yj < n , j = 1, . . . , n.
0 if |Yj | n
It is shown in Lemmas 3.3 and 3.4 that the estimators n (hard) and n (soft)
have similar oracle properties, i.e., they have up to (log n)-terms the same mean
square error as when using the model (J ) given by the oracle. Lemmas 3.3 and
3.4 are proved in Donoho and Johnstone (1994b), by direct calculations. We will
reconsider the oracle properties of the soft-thresholding estimator in Lemma 3.7,
in a fashion that allows extension to other M-estimation contexts.
Lemma 3.3 for the hard-thresholding estimator is stated in asymptotic sense.
We use the following notation and terminology. Let {zn } and {n } be sequences
of positive numbers. We say that zn = o(n ) (zn is of smaller order than n ) if
zn /n 0 as n . Moreover, zn = O(n ) (zn is of order n ) means that
lim supn zn /n < , and zn % n (zn is asymptotically equal to n ) means
216 S. van de Geer
Exercise 3.5. Suppose that 0 is sparse. Show that when the threshold is chosen
as in Lemma 3.3 (Lemma 3.4), the squared rate of convergence for the hard-
thresholding (soft-thresholding) estimator is
2 2r
log n 2
.
n
Compare with Exercise 3.4.
Donoho and Johnstone (1994b) also prove that the soft-thresholding estima-
tor can be improved by choosing the threshold n more carefully. We will not cite
that result here, because our focus is not so much on the constants. In fact, in
Lemma 3.7, we will treat the soft-thresholding estimator again, using an indirect
method. This gives worse constants, but the approach has the advantage that the
method is applicable to other situations as well, and in particular does not rely on
the assumption of normally distributed errors.
The hard- and soft-thresholding estimators are penalized M-estimators, as
is shown in Lemmas 3.5 and 3.6. This point of view allows one to dene hard-
and soft-thresholding type estimators for other loss functions as well. The hard-
thresholding estimator comes
n from a penalty on the number of non-zero coef-
cients, #{j = 0} = | |0
which we refer to as as the 0 -penalty. The
j=1 j
soft-thresholding case corresponds to a penalty on the 1 -norm nj=1 |j |, and we
refer to it as the 1 -penalty.
Lemma 3.5. The hard-thresholding estimator n (hard) minimizes
n
(Yj j )2 + 2n #{j = 0}.
j=1
Oracle Inequalities and Regularization 217
Now, we will reprove Lemma 3.4, albeit with less economic constants. The
idea is writing down a basic inequality, similar to (2.1) but now for the penalized
case. The basic inequality with penalty takes the form
n 0
2n [n (n ) n ( )]/ n +
0
2n + pen( ) pen(n ).
Here, is the best penalized approximation of 0 . It means that is dened as
in Lemma 3.2 with an appropriate choice of the threshold . The empirical process
takes the simple form
n
n () = 2 n
j j .
j=1
We will bound the increments by
n
|n () n ( )| 2 n |j j, | max |
j |.
j=1,...,n
j=1
Finally, we only consider the soft-thresholding estimator, that is, the penalty con-
sidered is the 1 -penalty
n
pen() = 2n |j |.
j=1
For the regularization parameter n , a value proportional to 2 log n/n can be
taken (see Lemma 3.8). Lemma 3.7 then states that the estimator with 1 -penalty
satises an oracle inequality, where the oracle concerns the 0 -penalty.
Lemma 3.7. Let n = n (soft). Let 0 < 1 be arbitrary. Set
Vn () = 162n #{j = 0}/.
On the set n = {max1jn |
j | n }, we have
2+
n 0
2n ( ) min Vn () +
0
2n .
2
Then, on n ,
n 0
2n [n (n ) n ( )]/ n + pen( ) pen(n ) +
0
2n
n
=2
j (j,n j, ) + pen( ) pen(n ) +
0
2n
j=1
n
2n |j,n j, | + pen( ) pen(n ) +
0
2n
j=1
2pen1 (n ) +
0
2n
4n N ( )
n
n +
0
2n
4n N ( )
n 0
n + 4n N ( )
0
n +
0
2n
Now, we proceed as in Lemma 2.1. Since for a and b non-negative, ab
(a + b)/2 (compare with the technical lemma at the end of Chapter 2), we have
82 N ( )
4n N ( )
n 0
n
n 0
2n + n .
2
162n N ( )
n 0
2n
n 0
2n +
0
2n + +
0
2n ,
2 2
or
162n N ( )
(1 )
n 0
2n + (1 + )
0
2n
2 2
162n N ( )
(1 + ) +
0
n .
2
2
To conclude the proof, take
We thus see that for n % 2 log n/n, we arrive at the same oracle rates as
in Lemmas 3.3 and 3.4, provided the set P(maxj |
j | > n ) 0 for such a choice
of n . Indeed, this is shown to be okay in Lemma 3.8.
220 S. van de Geer
1
= exp[ 2 a].
2
Take = a to arrive at
1
P (Z > a) exp[ a2 ].
2
To prove the second assertion of the lemma, we note that
1
P max |Zj | > a N exp[ a2 ].
1jN 2
Take a 2 log N to get
1
P max Zj > a exp[ a2 ].
1jN 4
When this aim is indeed reached, we loosely say that fn satises an oracle inequal-
ity. In fact, what (4.1) says it that fn behaves as the noiseless version f . That
means so to speak that we overruled the variance of the noise.
In Section 4.1, we put our objectives of this chapter in the framework of
Chapter 2. In particular, we recall there the denitions of estimation and ap-
proximation error. Section 4.2 calculates the estimation error when one employs
least squares estimation, without penalty, over a nite model class. The estima-
tion error turns out to behave as the log-cardinality of the model class. Section
4.3 shows that when considering a collection of nested nite models, a penalty
pen(f ) proportional to the log-cardinality of the smallest class containing f will
indeed mimic the oracle over this collection of models. In Section 4.4, we consider
general penalties. It turns out that the (local) entropy of the model classes plays a
crucial rule. Recall that in Chapter 3, the hard-thresholding estimator corresponds
to a penalty proportional to the dimension (number of non-zero coecients) of the
model class. Indeed, the local entropy a nite-dimensional space is proportional to
its dimension. For a nite class, the entropy is (bounded by) its log-cardinality.
Whether or not (4.1) holds true depends on the choice of the penalty. In
Section 4.4, we show that when the penalty is taken too small there will appear
an additional term showing that not all variance was killed. Section 4.5 presents
an example.
Throughout this chapter, we assume the noise level > 0 to be known. In
that case, by a rescaling argument, one can assume without loss of generality that
= 1. In general, one needs a good estimate of an upper bound for , because
the penalties considered in this chapter depend on the noise level. When one
replaces the unknown noise level by an estimated upper bound, the penalty in
fact becomes data dependent.
n
f
2n = |j |2 :=
2n .
j=1
2.4,
fn (, F ) f0
2n =
fn (, F ) f (, F )
2n +
f (, F ) f0
2n .
Write fn (, F ) = f
n (F ) , and likewise f = f (F ) . Then by Pythagoras rule, the
type, goes through. In fact, at the cost of additional, essentially technical, assump-
tions, an inequality of exponential type on the errors is sucient as well (see van
de Geer (2000)).
fn f0
2n + 4 log |F |/(n) + 4t2 / + (1 + )
f f0
2n .
Oracle Inequalities and Regularization 225
j=0
n i=1
226 S. van de Geer
exp[2j+1 (2j+2 + nt2 )] = exp[(2j+1 + nt2 )]
j=0 j=0
exp[(j + 1 + nt2 )] exp[(x + nt2 )] = exp[nt2 ].
j=0 0
n
But if i=1
i (fn (xi ) f (xi ))/n (8 log |F(fn )|/n + 2t2 )1/2
fn f
n , the
basic inequality gives
fn f0
2n 2(8 log |F(fn )|/n+ 2t2 )1/2
fn f
n +
f f0
2n + pen(f ) pen(fn )
fn f0
2n + 16 log |F(f)|/ pen(f) + 4t2 / + (1 + )
f f0
2n + pen(f )
=
fn f0
2n + 4t2 / + (1 + )
f f0
2n + pen(f ),
by the denition of pen(f ).
n) = log N (u, F ,
n) :
u > 0} is called the entropy of F (for the metric induced by the norm
n ).
Recall the denition of the estimator
n
1
fn = arg min |Yi f (xi )| + pen(f ) ,
2
f F n i=1
We moreover dene
F (t) = {f F :
f f
2n + pen(f ) t2 }, t > 0.
Consider the entropy H(, F (t),
n ) of F (t). Suppose it is nite for each t, and in
fact that the square root of the entropy is integrable, i.e., that for some continuous
upper bound H(, F (t),
n ) of H(, F (t),
n ). one has
t"
(t) = H(u, F (t),
n )du < , t > 0. (4.2)
0
Lemma 4.3. Suppose that (t)/t2 does not increase as t increases. There exists
constants c and c such that for
2
(4.3) ntn c ((tn ) tn ) ,
we have
c
E
fn f0
2n + pen(fn ) 2
f f0
2n + pen(f ) + t2n + .
n
Lemma 4.3 is from van de Geer (2001). Comparing it to, e.g., Lemma 4.2,
one sees that there is no arbitrary 0 < < 1 involved in the statement of Lemma
4.3. This is just because van de Geer (2001) has xed at = 1/3 for simplicity.
When (t)/t2 n/C for all t, for some constant C, condition (4.3) is
fullled if tn cn1/2 , and, in addition, C c. Thus, by choosing the penalty
carefully, one can indeed ensure that the variance is overruled.
f2 f
n t + I(f ). It is now not dicult to show that for some constant C1
t ps
2
1s t
H(u, F (t),
n ) C1 ( ) u + log( ) , 0 < u < t.
( 1)u
Corollary 4.5. By applying Lemma 4.3, we nd that for some constant c1 ,
E{
fn f0
2n + 2 I p (fn )} 2 min{
f f0
2n + 2 I p (f )}
f
2ps 1
1 2ps+p2 log 1
+c1 2 + .
n ps n
4.5.2. Overruling the variance in this case. For choosing the smoothing parameter
, the above suggests the penalty
2ps+p2
2ps
C 0
pen(f ) = min 2 I p (f ) + + 2 ,
n ps
with C0 a suitable constant. The minimization within this penalty yields
2s 2
pen(f ) = C0 n 2s+1 I 2s+1 (f ),
where C0 depends on C0 and s. From the computational point of view (in partic-
ular, when p = 2), it may be convenient to carry out the penalized least squares
as in the previous subsection, for all values of , yielding the estimators
n
1
fn (, ) = arg min |Yi f (xi )| + I (f ) .
2 2 p
f n i=1
Corollary 4.6. For an appropriate, large enough, choice of C0 (or C0 ), depending
on c, p and s, we have for a constant c0 depending on c, c , C0 (C0 ), p and s.
2s 2
E
fn f0
2n + C0 n 2s+1 I 2s+1 (fn )
2s 2
c
2 min
f f0
2n + C0 n 2s+1 I 2s+1 (f ) + 0 .
f n
Thus, the estimator adapts to small values of I(f0 ). For example, when s = 1
and I(f0 ) = 0 (i.e., when f0 is the constant function), the excess risk of the
parametric rate 1/n. If we knew that f0 is constant, we
estimator converges with
n
would of course use the i=1 Yi /n as estimator. Thus, this penalized estimator
mimics an oracle.
5. The 1 -penalty
Generally, using almost exactly the same arguments, the results of the previous
chapter for least squares estimation, can be extended to other M-estimation proce-
dures provided the margin behavior is known. However, when the margin behavior
is not known (i.e., when the parameters c2 and especially in the margin condition
of Section 2.5 are not known), a simple extension is no longer possible. The reason
is that the estimation error depends on this margin behavior. Overruling the vari-
ance is thus not straightforward as this variance is not known. In this chapter,
we propose an 1 -penalty (i.e., a soft-thresholding type penalty) to overcome the
margin problem.
Let F F be a class of functions on X . We assume that each f F can be
written as a linear combination
m
f (x) = j j (x), x X .
j=1
Here, {j }m
j=1 is a given system of m functions on X , and = (1 , . . . , m ) is an
(in whole or in part) unknown vector. We consider the situation where the number
of parameters m is large, possibly larger than the number of observations n.
230 S. van de Geer
m
The 1 -penalty on f = j=1 j j is
m
(5.1) pen(f ) = n |j |,
j=1
and
f0 = arg min R(f ).
f F
Note that our notation in this chapter diers somewhat from the previous chapter,
in the sense that in the denition of fn , we minimize over f F, instead of f F .
(It can be brought in line with that notation by taking pen(f ) = for f F \F.)
The empirical process, indexed by some subset F0 of F , is
{n (f ) = n(Pn (f ) P (f )) : f F0 }.
Throughout, we x some (arbitrary) f F. In the results (Lemmas 5.1, 5.2 and
5.4) one may choose f as being an (almost) oracle. Because the exact denition
of this (almost) oracle is somewhat involved, we leave the choice of f open in our
general exposition. One can see a particular choice explained after the statement
of the result for general f (see below Lemma 5.1).
We use the notation
(5.3) n = {[n (fn ) n (f )] n[pen(fn f ) + 2n ]}.
The set n will play exactly the same role as in Lemma 3.7.
Recall that in this chapter, pen(f ) = n I(). We will choose the smoothing
parameter n in such a way that (for all f F) the probability of the set n
Oracle Inequalities and Regularization 231
The metric d is taken as the one induced by the L2 (Qn )-norm in Section 5.1
(robust regression with xed design). In Section 5.2, F is a class of densities with
respect to a -nite measure , and the metric d will be the one induced by the
L2 ()-norm. Section 5.3 takes the metric d induced by the L1 (Q)-norm. In general,
for some measure on X , and 1 p < , we denote the Lp ()-norm by
1/p
f
p, = |f |p d , f Lp (),
and we write
=
2,Q ,
n =
2,Qn .
The value of and c2 is not assumed to be known, although in Section 5.2
(density estimation), is shown to be generally equal to 2. The latter means that
in density estimation, we get oracle inequalities very similar to those for least
squares estimators in regression. The situation is then completely analogous to
the one of Lemma 3.7.
The 1 -penalty allows one to adapt to other margin behavior as well (i.e., to
the situation where the excess risk is bounded from below by some more general
increasing function of the appropriate metric).
232 S. van de Geer
The proofs of the oracle inequalities all follow from one single argument,
which was basically already used in Lemma 3.7. We give this argument in Section
m
5.4. It makes use of an assumed relation between the 1 -norm I() = j=1 |j |
and the metric d in the margin condition.
Condition on the metrics. For some constant c1 , 0 and 0 < , we have
for all J {1, . . . , m},
|j j | c1 |J | d (f , f ), f , f F.
jJ
In all Sections 5.1, 5.2 and 5.3, the condition on metrics is met with = 1/2.
Sections 5.1 and 5.2, have = 1, and Section 5.3 has = 1/2.
where K is nite (possibly depending on n). We need this condition to handle the
empirical process part of the problem.
Throughout this section, we assume that is Lipschitz:
|(z1 ) (z2 )| |z1 z2 |, z1 , z2 R.
This is related to a certain robustness of the estimator. We call the estimator fn of
this section the 1 -penalized robust regression estimator. The Lipschitz assumption
also makes it possible to apply the contraction principle (see Section 6.3). This
means there is enough machinery to obtain a good bound for the empirical process.
Oracle Inequalities and Regularization 233
the excess risk of the estimator fn satises an inequality involving the estimation
error Vn (f ) and approximation error R(f ) R(f0 ).
Lemma 5.1. Assume the margin condition, and the positive eigenvalue condition
on the system {j }. Then on the set n we have
1+
R(fn ) R(f0 ) ( ) Vn (f ) + R(f ) R(f0 ) + 2n .
1
Proof. Use the positive eigenvalue condition on the system {j } to see that the
condition on metrics holds with d(f, f) =
f f
n , = 1/2, and = 1. Now,
invoke the margin condition and apply Lemma 5.5.
(where we assume for simplicity that the minimum is attained). This choice bal-
ances the estimation error Vn (f ) and approximation error R(f ) R(f0 ), so that
one arrives at an oracle inequality on n . It can
be shown that for any f , the
set n has large probability for the choice n = c log n/n, with c a large enough
constant depending only on K. This can be done using the tools from Section 6
(the concentration inequality of Theorem 6.1, symmetrization (Theorem 6.3), the
contraction principle (Theorem 6.4), and the peeling device described in Section
6.4). The details are in Loubes and van de Geer (2002) and van de Geer (2003).
Thus, a good choice for n does not require knowledge of the constants or c2
appearing in the margin condition. Since the estimation error Vn (f ) does depend
on these constants, one may conclude that the 1 -penalty yields a trade o be-
tween estimation error and approximation error, without (directly) estimating the
estimation error.
Exercise 5.2. Suppose that for all integers N ,
min [R(f ) R(f0 )] c3 N s .
f F , N (f )N
Here s > 0 is a smoothness parameter. Verify that the trade o between esti-
mation error and approximation error gives
2s
R(fn ) R(f0 ) c4 n2(1)s+1 .
The constant c4 is determined by the values of c1 , c2 and c3 and those of , and s.
Oracle Inequalities and Regularization 235
Take
F = {f : f d = 0, ef d < }.
Dene for f F ,
pf = exp[f b(f )], b(f ) = log ef d.
Dene now
f
22, = f 2 d.
In fact, the value = 2 is quite natural in this case, as can be seen in the
next exercise.
Exercise 5.4. Let {pf : f F } be a class of densities, with respect to Lebesgue
measure on [0, 1], and suppose that p0 1 is the density of the uniform distribu-
tion. Note that f0 = 0 and b(f0 ) = 0 in this case. Suppose that supf F |f | K.
Show that
R(f ) R(f0 ) = f 2 p1 d [ f p1 d]2 ,
where
p1 = exp[f1 b(f1 )],
and |f1 | |f |. Moreover
f p1 d = f f1 p2 d f p2 d f1 p2 d,
where
p2 = exp[f2 b(f2 )],
and |f2 | |f1 |. Conclude that
R(f ) R(f0 ) e2K f 2 d 2e4K ( f 2 d)2 .
The margin condition is thus met with = 2 when F satises the requirement
that
f
2, is suciently small for all f F. (Convexity arguments can remove
this requirement on F .)
Let 2min be the smallest eigenvalue of . Assume that for some constant 0 < c1 <
,
min 1/c1 .
m
Now, we proceed exactly as in the previous section. Let for f= j=1 j j ,
N (f ) = #{j = 0},
Oracle Inequalities and Regularization 237
Lemma 5.2. Assume the margin condition, and the positive eigenvalue condition
on the system {j }. Then on n we have
1+
R(fn ) R(f0 ) ( ){Vn (f ) + R(f ) R(f0 ) + 2n }.
1
Proof. This again straightforward application of Lemma 5.5, as in the proof of
Lemma 5.1.
Proof. This follows from Bernsteins inequality (Bernstein (1924), Bennet (1962)),
which says that for each a > 0,
na2
P (|Pn (j ) P (j )| > t) 2 exp[ ],
2a|j | + j2
where j2 = P |j |2 |P j |2 .
In view
of Lemma 5.3, we now know that for sequences of sets n with
n = 3c0 log m/n, with m = mn as n , we have
lim P(n ) = 1.
n
238 S. van de Geer
Exercise 5.5. Verify that SVM loss is consistent, in the sense that Bayes rule f0
satises
f0 = arg min R(f ).
all f
To handle the empirical process, note rst that (z) = (1 z)+ is Lipschitz.
This allows us to again apply the contraction principle of Section 6.3. It means that
we can invoke the tools of Chapter 6 (in particular, the concentration inequality
of Theorem 6.2, symmetrization, the contraction principle and the peeling device)
to handle the empirical process part of the problem.
Margin condition. For some constants 1 and c2 , we have for all f F,
R(f ) R(f0 )
f f0
1,Q /c2 .
Note that we assumed the margin condition with an L1 -norm, instead of the
L2 -norm. It turns out that under some mild assumptions on 0 and Q, indeed the
margin condition is met with L1 (Q)-norm.
Positive eigenvalue condition on the system {j }. Let
= dQ.
and
pen2 (f ) = n |j |.
j J
/
Then
pen(f ) = pen1 (f ) + pen2 (f ),
and one easily sees
pen(fn f ) = pen1 (fn f ) + pen2 (fn )
and
pen(f ) pen(fn ) pen1 (fn f ) pen2 (fn ).
Oracle Inequalities and Regularization 241
So on n ,
We review, in Sections 6.16.4, some general results for the empirical process
n(Pn P )() : .
In Section 6.5, we apply the results of Sections 6.1 - 6.4 to arrive at a uniform
probability bound for the case where the empirical process is indexed by a class
{f = f : f F } with : R R and F a subset of a collection of linear
functions {f = j j j }. There, we moreover assume the Lipschitz condition
Each term is the summand in the right-hand side can be studied using results
for the non-weighted empirical process. Because the index set F is peeled into
annuli f F, 2j1 t < d (f ) 2j t, this method for obtaining probability bounds
is sometimes referred to as the peeling device (van de Geer (2000)). We apply
244 S. van de Geer
the peeling device in the proof of Lemma 6.4. It is also handy for obtaining the
modulus of continuity of the empirical process indexed by functions satisfying an
entropy bound (see Section 6.6).
6.5. The case of a Lipschitz transformation of a linear space
Let : R R be a Lipschitz function, i.e.,
|(z1 ) (z2 ))| |z1 z1 |, z1 , z2 R.
Let F be a collection of functions on X and dene
f = f, f F.
Recall the notation for the empirical process
n (f ) = n(Pn P )(f ), f F.
Regression and classification example. In regression and classication, we replace
X with values x X by the pair (X, Y ) with values (x, y) X R. In robust
regression, the loss function is (y f (x)) with a given Lipschitz function.
In binary classication, one has y {1} and the SVM loss is (yf (x)) where
(z) = (1 z)+ is the hinge loss, which is clearly also Lipschitz.
Suppose now that F a bounded subset of a collection of linear functions
n
F = {f = j j : |f | K0 /2},
j=1
of ZM from its mean EZM . Next task is to obtain an upper bound for EZM . This
is done in Lemmas 6.2 and 6.3. The combination of Lemmas 6.16.3 yields the
probability inequality (6.1) (below Lemma 6.3) for ZM with M xed. Lemma 6.4
nally uses this result and the peeling device of Section 6.4, to derive a probability
bound for the weighted empirical process.
We use the notation
K0 M = max(K0 , M ), K0 M = min(K0 , M ).
Lemma 6.1. Assume the normalization condition on the system {j }. For all M
satisfying + +
log n n
K0 M K0 ,
n log n
we have
+
log n
P ZM 2EZM + 36K0 M
2
exp[(K0 M )2 log n].
n
Use the symmetrization inequality (Theorem 6.3) and then the contraction prin-
ciple (Theorem 6.4). Then we get
1
n
EZM 2E sup |
i (f (Xi ) f (Xi ))|
f FM n i=1
1
n
4E sup |
i (f (Xi ) f (Xi ))| .
f FM n i=1
246 S. van de Geer
But obviously,
1 1
n n
E sup |
i (f (Xi ) f (Xi ))| M E max |
i j (xi )| .
f FM n i=1 1jn n
i=1
Proof. Dene
+
1
n
log n
Z = max |
i j (xi )| / 3 .
1jn n n
i=1
The combination of Lemmas 6.1, 6.2 and 6.3 yields that under the normal-
ization condition on the system {j },
+
log n
(6.1) P ZM 228K0 M 2
exp[(K0 M )2 log n],
n
for K0 log n/n M K0 n/ log n.
We now invoke the technique of Section 6.4, to study the weighted empirical
process
n (f ) n (f )
, f F.
I( ) + 456K02 log n/n
Lemma 6.4. Dene n = 456K02 log n/n. Under the normalization condition on
the system {j } we have
|( f ) ( f )|
P sup nn 3 exp[K02 log n/2].
f F I( ) + n
Oracle Inequalities and Regularization 247
Proof. To simplify the notation, we assume that f 0. We split the class F into
three subclasses, namely
+
log n
F (I) = {f : I() K0 },
n
+
log n
F (II) = {f : K0 < I() K0 },
n
and
F (III) = {f : I() > K0 }.
Then
|( f )| |( f )|
P sup nn P sup nn
f F I() + n f F (I) I() + n
|( f )| |( f )|
+P sup nn + P sup nn
f F (II) I() + n f F (II) I() + n
= PI + PII + PIII .
Apply (6.1) to PI to see that
PI P sup |n (f )| n2n
f F (I)
2 4 log n
= P ZK log n (456) K0
0 n n
log n
P ZK log n 228K03
0 n n
exp[K02 log n].
Now,
F (II) jj=0 0
{f F : Mj+1 < I() Mj },
j
where Mj = 2 K0 , j = 0, . . . , j0 , and where 2j0 n. Thus
j0
j0
PII P ZMj n Mj /2 exp[K02 log n) exp[K02 log n/2].
j=0 j=0
Finally
F (III) =
j=1 {f F : Mj1 < I() Mj },
where Mj = 2j K0 , j = 0, 1, 2, . . .. So
PIII P ZMj n Mj /2 exp[Mj2 log n] exp[K02 log n].
j=1 j=1
248 S. van de Geer
Let us conclude this section by putting the result of Lemma 6.4 in the context
of Chapter 5. Following the notation of that chapter, let
n = {[n (fn ) n (f )] n[pen(fn f ) + 2n ]},
where fn is any random function in F , n = 456K02 log n/n and pen(f ) =
n I(). Then by Lemma 6.4, under the normalization condition on the system
{j }, one has
P(n ) 1 3 exp[K02 log n/2].
This, in combination with the result of Lemma 5.4 completes, for the case m = n,
the proof of adaptivity of the 1 -penalized SVM loss estimator. (When m is larger
than n, say m = nD , the result goes through with suitably adjusted normalization
condition on the system {j }, or alternatively, with suitably adjusted constants
depending on D.)
The result of Lemma 5.1 for the 1 -penalized robust regression estimator, can
be completed analogously. It suces to replace in the arguments of this section,
the concentration inequality of Theorem 6.2 by the one of Theorem 6.1.
6.6. Modulus of continuity of the empirical process
Let F be a class of functions on X .
Definition. Let u > 0. The u-entropy H(u, F , d) (for the metric d) is the logarithm
of the minimum number of balls with radius u necessary to cover F .
We dene
n
f
= P |f | =
2 2
Ef 2 (Xi )/n.
i=1
Let moreover
|f | = sup |f (x)|,
xX
and we suppose that the entropy of F for the metric induced by | | is nite. We
denote this entropy by H (, F ).
Consider again a transformation
f = f : f F,
where is Lipschitz.
Theorem 6.5 (van de Geer (2000)). Let F be a class of functions with |f f| 1
for all f, f F. Suppose that for some constants A and s > 1/2,
1
H (u, F , | | ) Au s , u > 0.
Then there exists a constant c, depending only on A and s, such that for all f F
and all t c,
(f ) (f
n )| t
P t c exp[ ].
n
sup 1
f F , f f >n
s
2s+1
f f
2s
1 c
Oracle Inequalities and Regularization 249
Here f (s) is the s-th derivative of f . Then (Kolmogorov and Tikhomirov (1959)),
the entropy condition of Theorem 6.5 holds.
Exercise 6.1. We rst study the regression setup with xed design. Suppose, for
i = 1, . . . , n, we have observations (xi , Yi ) (X R), , where xi is a xed co-
variable, and Yi a response variable. In this setup, we throughout used the notation
n
f
2n = f 2 (Xi ),
i=1
(2005). In van de Geer (2000), moduli of continuity are derived, more general as
the one cited in Section 6.6. There, also their implications on rates of convergence
in regression and maximum likelihood estimation is explained. Entropy results for
subsets of Besov spaces, which are extensions of the space F studied in Example
6.1, can be found in Birge and Massart (2000).
References
Akaike, H. (1973). Information theory and an extension of the maximum likelihood princi-
ple. Proceedings 2nd International Symposium on Information Theory, P.N. Petrov
and F. Csaki, Eds., Akademia Kiado, Budapest, 267281.
Alexander, K.S. (1985). Rates of growth for weighted empirical processes. In: Proceedings
of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer 2 (L. LeCam
abd R.A. Olshen eds.) 475493. University of California Press, Berkeley.
Barron, A., Birge, L. and Massart, P. (1999). Risk bounds for model selection via penal-
ization. Prob. Theory Rel. Fields 113 301413.
Bartlett, P.L., Jordan, M.I. and McAulie, J.D. (2003). Convexity, classication and risk
bounds. Techn. Report 638, University of California at Berkeley.
Bennet, G. (1962). Probability inequalities for sums of independent random variables.
Journ. Amer. Statist. Assoc. 57 3345.
Bernstein, S. (1924). Sur un modication de linegalite de Tchebichef. Ann. Sci. Inst.
Sav. Ukraine Sect. Math. I (Russian, French summary).
Birge, L. and Massart, P. (2000). An adaptive compression algorithm in Besov spaces.
Journ. Constr. Approx. 16 136.
Birman, M.S. and Solomjak, M.Z. (1967). Piecewise polynomial approximation of func-
tions in the classes Wp . Math. USSR Sbornik 73 295317.
Boser, B. Guyon, I. and Vapnik, V.N. (1992). A training algorithm for optimal margin
classiers.Fifth Annual Conf. on Comp. Learning Theory, Pittsburgh ACM 142
152.
Candes, E.J. and Donoho, D.L. (2004). New tight frames of curvelets and optimal rep-
resentations of objects with piecewise C 2 singularities. Comm. Pure and Applied
Math. LVII 219266.
Devroye, L., Gyor, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recog-
nition. Springer, New York, Berlin, Heidelberg.
Donoho, D.L. and Johnstone, I.M. (1994a). Minimax risk for 1 -losses over p -balls. Prob.
Th. Related Fields 99, 277303.
Donoho, D.L. and Johnstone, I.M. (1994b). Ideal spatial adaptation via wavelet shrinkage.
Biometrika 81, 425455.
Donoho, D.L. (1995). De-noising via soft-thresholding. IEEE Transactions in Information
Theory 41, 613627.
Donoho, D.L. and Johnstone, I.M. (1996). Neo-classical minimax problems, thresholding
and adaptive function estimation. Bernoulli 2, 3962.
Donoho, D.L. (1999). Wedgelets: nearly minimax estimation of edges. Ann. Statist. 27
859897.
Oracle Inequalities and Regularization 251
Donoho, D.L. (2004a). For most large underdetermined systems of equations, the mini-
mal 1 -norm near-solution approximates the sparsest near-solution. Techn. Report,
Stanford University.
Donoho, D.L. (2004b). For most large underdetermined systems of linear equations, the
minimal 1 -norm solution is also the sparsest solution. Techn. Report, Stanford
University.
Edmunds, E. and Triebel, H. (1992). Entropy numbers and approximation numbers in
function spaces. II. Proceedings of the London Mathematical Society (3) 64, 153169.
Gine, E. and Koltchinskii, V. (2004). Concentration inequalities and asymptotic results
for ratio type empirical processes. Working paper.
Goldstein, H. (1980). Classical Mechanics. 2nd edition, Reading, MA, Addison-Wesley.
Green, P.J. and Silverman, B.W. (1994). Nonparametric Regression and Generalized Lin-
ear Models: A Roughness Penalty Approach. Chapman and Hall, London.
Hardle, W., Kerkyacharian, G., Picard, D. and Tsybakov, A. (1998). Wavelets, Approxi-
mation and Statistical Applications. Lecture Notes in Statistics, vol. 129. Springer,
New York, Berlin, Heidelberg.
Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning.
Data Mining, Inference and Prediction. Springer, New York.
Hoeding, W. (1963). Probability inequalities for sums of bounded random variables.
Journ. Amer. Statist. Assoc. 58 1330.
Koenker, R. and Bassett Jr. G. (1978). Regression quantiles. Econometrica 46, 3350.
Koenker, R., Ng, P.T. and Portnoy, S.L. (1992). Nonparametric estimation of conditional
quantile functions. L1 Statistical Analysis and Related Methods, Ed. Y. Dodge,
Elsevier, Amsterdam, 217229.
Koenker, R., Ng, P.T. and Portnoy, S.L. (1994). Quantile smoothing splines. Biometrika
81, 673680.
Kolmogorov, A.N. and Tikhomirov, V.M. (1959). -entropy and -capacity of sets in
function spaces. Uspekhi Mat. Nauk 14 3-86. (English transl. in Americ. Math. Soc.
Transl. (2) 17 (1961) 277364).
Koltchinskii, V. (2001) Rademacher penalties and structural risk minimization. IEEE
Trans. Inform. Theory 47 19021914.
Koltchinskii, V. and Panchenko, D. (2002). Empirical margin distributions and bounding
the generalization error of combined classiers. Ann. Statist. 30 1-50.
Koltchinskii, V. (2003) Local Rademacher complexities and oracle inequalities in risk
minimization. Manuscript.
Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces: Isoperimetry and
Processes . Springer Verlag, New York.
Ledoux, M. (1996). Talagrand deviation inequalities for product measures. ESIAM:
Probab. Statist. 1 6387. Available at: www.emath.fr/ps/
Lin, Y. (2002). Support vector machines and the Bayes rule in classication. Data mining
knowledge and discovery 6 259275.
Loubes, J.-M. and van de Geer, S. (2002). Adaptive estimation in regression, using soft
thresholding type penalties. Statistica Neerlandica 56 453478.
Lugosi, G. and Wegkamp, M. (2004). Complexity regularization via localized random
penalties. To appear in Ann. Statist.
252 S. van de Geer
Mammen, E. and Tsybakov, A.B. (1999). Smooth discrimination analysis. Ann. Statist.
27 18081829.
Massart, P. (2000a). About the constants in Talagrands concentration inequalities for
empirical processes. Ann. Probab. 28 863884.
Massart, P. (2000b). Some applications of concentration inequalities to statistics. Ann.
Fac. Sci. Toulouse 9, 245303.
Pareto, V. (1897). Course dEconomie Politique, Rouge, Lausanne et Paris.
Portnoy, S. (1997). Local asymptotics for quantile smoothing splines. Ann. Statist. 25,
414434.
Portnoy, S. and Koenker, R. (1997). The Gaussian hare and the Laplacian tortoise:
computability of squared error versus absolute-error estimators, with discussion.
Stat. Science 12, 279300.
Scholkopf, B. and Smola, A. (2002). Learning with Kernels, MIT Press, Cambridge.
Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6, 461464.
Silverman, B.W. (1985). Some aspects of the smoothing spline approach to nonparametric
regression curve tting (with discussion). Journ. Royal Statist. Soc. B 47, 152.
Talagrand, M. (1995). Concentration of measure and isoperimetric inequalities in product
spaces. Publications Mathematiques de lI.H.E.S 81 73205.
Tarigan, B. and van de Geer, S.A. (2005). Support vector machines with 1 -complexity
regularization. Submitted.
Tibshirani, R. (1996). Regression analysis and selection via the LASSO. Journal Royal
Statist. Soc. B 58, 267288.
Tsybakov, A.B. (2004). Optimal aggregation of classiers in statistical learning. Ann.
Statist. 32, 135166.
Tsybakov, A.B. and van de Geer, S.A. (2005). Square root penalty: adaptation to the
margin in classication and in edge estimation. To appear in Ann. Statist. 33.
van de Geer, S. (2000). Empirical Processes in M-Estimation. Cambridge University
Press.
van de Geer, S. (2001). Least squares estimation with complexity penalties. Mathematical
Methods of Statistics 10, 355374.
van de Geer, S. (2002). M-estimation using penalties or sieves. J. Statist. Planning Inf.
108, 5569.
van de Geer, S. (2003). Adaptive quantile regression. In: Recent Advances and Trends in
Nonparametric Statistics (Eds. M.G. Akritas and D.N. Politis), Elsevier, 235250.
van der Vaart, A.W. and Wellner, J.A. (1996). Weak Convergence and Empirical Pro-
cesses, with Applications to Statistics. Springer, New York.
Vapnik, V.N. and Chervonenkis, A.Ya. (1971). On the uniform convergence of relative
frequencies of events to their probabilities. Th. Probab. Appl. 16, 264280.
Wahba, G. (1990). Spline Models for Observational Data. Philadelphia: Soc. for Ind. and
Appl. Math.
Wand, M.P. and Jones, M.L. (1995). Kernel Smoothing. Chapman and Hall, London.