Вы находитесь на странице: 1из 256

Contents

Eustasio del Barrio


Empirical and Quantile Processes in the
Asymptotic Theory of Goodness-of-fit Tests
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Testing t to a xed distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1 Some notes on principal component decompositions
of quadratic statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Some tools from empirical processes theory . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1 Some inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Strong approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4 Testing t to a family of distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1 Adaptation of tests coming from the xed-distribution setup . . . 31
4.2 The empirical process with estimated parameters . . . . . . . . . . . . . . 35
4.3 Correlation and regression tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5 Tests based on Wasserstein distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6 Asymptotics for L2 -functionals of the quantile process . . . . . . . . . . . . . . . 49
6.1 Weak convergence of L2 linear combinations
of exponential r.v.s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2 Weak convergence of weighted L2 functionals
of the quantile process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.3 Weighted Wasserstein tests of t to location-scale
families of distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

Paul Deheuvels
Topics on Empirical Processes
1 Introduction, notation and preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
1.2 Distribution and quantile functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
1.3 Topologies on spaces of measures and functions . . . . . . . . . . . . . . . . 98
1.4 The quantile transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
vi Contents

2 Fluctuations of partial sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114


2.1 Some large deviations theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
2.2 Martingale inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
2.3 Some useful examples of large deviation inequalities . . . . . . . . . . . . 126
2.4 Strong approximations of partial sums of i.i.d. random variables 129
2.5 The Erdos-Renyi theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
3 Empirical Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
3.1 Uniform empirical distribution and quantile functions . . . . . . . . . . 136
3.2 Uniform empirical and quantile processes . . . . . . . . . . . . . . . . . . . . . . 140
3.3 Some further martingale inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
3.4 Relations between empirical and Poisson processes . . . . . . . . . . . . . 144
3.5 Strong approximations of empirical and quantile processes . . . . . 148
3.6 Some results for weighted processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
3.7 Finkelsteins theorem via invariance principles . . . . . . . . . . . . . . . . . 153
3.8 Local and tail empirical process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
3.9 Modulus of continuity of n and n . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
3.10 The Bahadur-Kiefer representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
3.11 Application to density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
4 Auxiliary results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
4.1 Some Gaussian process theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
4.2 A functional LIL for superpositions of Gaussian processes . . . . . . 166
4.3 Karhunen-Loeve expansions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
4.4 The RKHS of the Wiener process and Brownian bridge . . . . . . . . 172
4.5 KL expansions for weighted Wiener processes
and Brownian bridges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
4.6 Bessel functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

Sara van de Geer


Oracle Inequalities and Regularization
1 Statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
1.1 Parametric models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
1.2 The empirical distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
1.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
1.4 Bibliographical remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
2 M-estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
2.1 Some examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
2.2 General framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
2.3 Estimation and approximation error . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
2.4 Where empirical process theory comes in . . . . . . . . . . . . . . . . . . . . . . . 203
2.5 Some rst results, assuming ready-to-use
empirical process theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Contents vii

2.6 Balancing estimation and approximation error . . . . . . . . . . . . . . . . . 208


2.7 Bibliographical remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
3 The sequence space formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
3.1 Reformulation of the regression problem . . . . . . . . . . . . . . . . . . . . . . . 209
3.2 Estimating the mean of a normal vector . . . . . . . . . . . . . . . . . . . . . . . 211
3.3 A collection of models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
3.4 The model an oracle would select . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
3.5 Hard- and soft-thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
3.6 A probability inequality for the empirical process . . . . . . . . . . . . . . 220
3.7 Bibliographical remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
4 Overruling the variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
4.1 Estimation and approximation error . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
4.2 Finite models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
4.3 Nested, nite models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
4.4 General penalties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
4.5 Application to the classical penalty of Chapter 1 . . . . . . . . . . . . . 227
4.6 Bibliographical remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
5 The 1 -penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
5.1 Robust regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
5.2 Density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
5.3 Binary classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
5.4 The behavior on n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
5.5 Bibliographical remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
6 Tools from empirical process theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
6.1 Concentration inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
6.2 Symmetrization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
6.3 Contraction principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
6.4 Weighted empirical processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
6.5 The case of a Lipschitz transformation of a linear space . . . . . . . . 244
6.6 Modulus of continuity of the empirical process . . . . . . . . . . . . . . . . . 248
6.7 Bibliographical remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
Empirical and Quantile Processes in the
Asymptotic Theory of Goodness-of-fit Tests
Eustasio del Barrio

Abstract. This text contains the material presented at the European Math-
ematical Society Summer School on Theory and Statistical Applications of
Empirical Processes in Laredo, Spain in August-September 2004.

1. Introduction
It was in 1900 when Karl Pearson proposed the rst test of goodness of t: the
2 test. The subsequent research devoted to enhancements of this elementary
goodness-of-t procedure became a major source of motivation for the development
of key areas in Probability and Statistics such as the theory of weak convergence
in general spaces and the asymptotic theory of empirical processes. In this course
we will analyze some suggestive aspects which arise from the development of the
asymptotic theory of goodness-of-t tests along the last century.
We will pay special attention to stressing the parallel evolution of the theory
of empirical processes and the asymptotic theory of goodness-of-t tests. Doubt-
less, this evolution is a good indicator of the vast transformation that Probabil-
ity and Statistics experienced along this century. Certainly, the names that con-
tributed to the theory are the main guarantee for this assertion. Pearson, Fisher,
Cramer, Von Mises, Kolmogorov, Smirnov, Feller, . . . laid the foundations of the
theory. In some cases, the mathematical derivation of the asymptotic distribution
of goodness-of-t tests in that period had the added merit that, in a certain sense,
the limit law was blindly pursued. In Mathematics the main diculty in showing
convergence consists, with no doubt, of obtaining a convincing candidate for limit.
Thus, proofs in that period could be considered as major pieces of precision and
inventiveness.
A systematic method of handling adequate candidates for the limit law begins
in 1950 with the heuristic work by Doob [36], made precise by Donsker, through
the Invariance Principle. The subsequent construction of adequate metric spaces
and the development of the corresponding weak convergence theory as the right
2 E. del Barrio

probabilistic setup for the study of asymptotic distributions had a wide and fast
diusion, with notable advances due to Prohorov and Skorohod among others. The
contribution of Billingsleys book [7] to this diusion must also be pointed out.
The study of Probability in Banach spaces has been other source of useful
results for the goodness-of-t theory. The names of Varadhan, Dudley, Araujo,
Gine, Zinn, Ledoux, Talagrand,. . . are necessary references to anyone interested
in asymptotics in Statistics. For example, the Central Limit Theorem in Hilbert
spaces had a main role in the obtention of the asymptotic behavior of Cramer-von
Mises type statistics.
Last, we must indicate the signicance of the Hungarian school, developing
the strong approximation techniques initiated by Skorohod with his embedding.
Breimans book [8] had the merit of initially spreading Skorohods embedding.
Now, the strong approximations due to Komlos, Major, Tusnady, M. and S. Csorgo,
Revesz, Deheuvels, Horvath, Mason, . . . , are an invaluable tool in the study of
asymptotics in Statistics, as we will point out in this course.

2. Testing fit to a fixed distribution


The simplest goodness-of-t problem consists of testing t to a single xed distri-
bution, namely, given a random sample of real r.v.s X1 , X2 , . . . , Xn with common
d.f. F , testing the null hypothesis H0 : F = F0 for a xed d.f. F0 . While this
procedure is usually of limited interest in applications, the solutions proposed to
this problem provided the main idea in subsequent generalizations designed for
testing t to composite null hypotheses.
Pearsons chi-squared test can be considered as the rst approach to the
problem of testing t to a xed distribution. The solution proposed by Pearson
consisted of dividing the real line into k disjoint categories or cells C1 , . . . , Ck
into which data would fall, under the null hypothesis, with probabilities p1 , . . . , pk .
That is, if H0 were true, then P (X1 Ci ) = pi , i = 1, . . . , k. If Oi is the number
of observations in cell i, then Oi has a binomial distribution with parameters n
and pi ; hence, the De Moivre-Laplace central limit theorem (C.L.T.) states that
(npi (1 pi ))1/2 (Oi npi ) w N (0, 1).
The multivariate C.L.T. shows that, if l k, then Bl = n1/2 (O1 np1 , . . .,
Ol npl )T has a limit distribution which is centered Gaussian and has covariance
matrix l whose (i, j) element, i,j , satises i,j = pi pj , for i = j, and i,i =
pi (1 pi ). On the other hand, if pi > 0, i = 1, . . . , k, k1 is nondegenerate and
1 1 1 1
k1 has element (i, j), i,j , satisfying i,j = pk , for i = j, and i,i = pi + pk .
Simple matrix algebra shows that Bk1 T
1
k1 Bk1 converges in law to a k1
2

distribution. Further, straightforward computations show that


k
(Oj npj )2
2 := T
= Bk1 1
k1 Bk1
j=1
np j

providing thus a well-known result in the asymptotic theory of tests of t:


Empirical and Quantile Processes 3

Theorem 2.1. Under H0 , 2 has asymptotic distribution 2k1 .

Theorem 2.1 reduces the problem to testing t to a xed distribution to


analyze a multinomial distribution, thus providing a widely applicable and easy
to use method for testing t which immediately carries over to the multivariate
setup. Moreover, this test also allows some freedom in choosing the number, the
location or the size of the cells C1 , . . . , Ck . This point will be discussed in the next
section. Though, as pointed out by many authors (see, e.g., [68]), consideration
of only the cell frequencies when F is continuous produces a loss of information
that results in lack of power (the 2 statistic will not distinguish two dierent
distributions sharing the same cell probabilities).
Therefore, in order to improve our method for testing t we should try to
make use of the complete information provided by the data. However, the multi-
variate C.L.T. and elementary matrix algebra were the only tools needed in the
derivation of the asymptotic distribution in Theorem 2.1. This will not be the case
when handling more complicated statistics.
A way to improve Pearsons statistic consists of employing a functional dis-
tance to measure the discrepancy between the hypothesized d.f. F0 and the empir-
ical d.f. Fn . The rst representatives of this method were proposed in the late 20s
and in the 30s. Cramer [15] and, in a more general form, von Mises [103] proposed

2
n = n (Fn (x) F0 (x))2 (x) dx

for some suitable weight function as an adequate measure of discrepancy. Kol-


mogorov [53] studied

Dn = n sup |Fn (x) F0 (x)|
<x<

and Smirnov [90], [91] the closely related statistics



Dn+ = n sup (Fn (x) F0 (x))
<x<

Dn = n sup (F0 (x) Fn (x))
<x<

more adequate for tests against one-sided alternatives. The statistics Dn , Dn+ or
Dn are known as Kolmogorov-Smirnov statistics and present the advantage of
being distribution-free: for any continuous d.f. F0 , Dn has (under H0 ) the same
distribution as sup0<t<1 |n (t)|, with n being the uniform empirical process:

n (t) = n(Gn (t) t), 0 t 1,

where Gn (t) = ni=1 I(Ui t) and Ui are i.i.d. uniform r.v.s. Similar statements
hold for Dn+ and Dn . Thus, the same p-values can be used to obtain the signicance
level when testing t to any continuous distribution. This desirable property is not
4 E. del Barrio

satised by n2 , but it also holds for the following modication:



Wn2 () = n (F0 (x))(Fn (x) F0 (x))2 dF0 (x), (2.1)

which was proposed by Smirnov [88], [89]. All the statistics which can be obtained
by varying are usually referred to as statistics of Cramer-von Mises type. Con-
sideration of dierent weight functions allows the statistician to put special
emphasis on the detection of particular sets of alternatives. For this reason, some
weighted versions of Kolmogorovs statistics have also been proposed, namely,
|Fn (x) F0 (x)|
Kn () = n sup . (2.2)
<x< (F0 (x))
The convenience of employing Wn2 () instead of Dn2 as a test statistic can
be understood taking into account that Dn2 accounts only for the largest deviation
between Fn (t) and F (t), while Wn2 () is a weighted average of all the deviations
between Fn (t) and F (t). Thus, as observed by Stephens [96], Wn2 () should have
more chance to detect alternatives that, not presenting a very large deviation with
respect to F at any point t, are moderately far from F for a large range of points
t (think of location alternatives). These heuristic considerations are conrmed by
simulation studies (see [96] for references).
Two particular statistics have received special attention in the literature.
When = 1,

2
Wn = n (Fn (x) F0 (x))2 dF0 (x)

is called the Cramer-von Mises statistic; when (t) = (t(1 t))1 then

2 (Fn (x) F0 (x))2
An = n dF0 (x)
F0 (t)(1 F0 (t))

is called the Anderson-Darling statistic. A2n has the additional appeal of weighting
the deviations according to their expected value, and this results in a more powerful
statistic for testing t to a xed distribution, see [96].
For the use in practice of any of these appealing statistics we should be able
to obtain the corresponding signicance levels. Smirnov [91], using combinato-
rial techniques, obtained an explicit expression for the exact distribution of Dn+ .
Kolmogorov [53] gave also an expression that enabled the tabulation of the dis-
tribution of Dn . Some more diculties were found when dealing with the exact
distributions of statistics of Cramer-von Mises type. But even in those cases in
which some formula allowed to compute the exact p-values, the interest in ob-
taining the asymptotic distribution of the test statistic was clear, for it would
greatly decrease the computational eort needed to obtain the (approximate) p-
values (and this was of crucial importance by the time these tests were proposed).
The celebrated rst asymptotic results about Dn and Dn+ are summarized in the
following theorem:
Empirical and Quantile Processes 5

Theorem 2.2. For every x > 0:



 2
x2
i) (Kolmogorov, 1933) P (Dn x) (1)j e2j
j=
  2
ii) (Smirnov, 1941) P Dn+ > x e2x .
Kolmogorovs proof of i) was based on the consideration of a limiting diu-
sion equation. Smirnov used the exact expression of P (Dn+ > x) to show ii). Also
Smirnov [88] derived the asymptotic distribution of statistics of Cramer-von Mises
type.
Feller [40] claimed that Kolmogorovs and Smirnovs proofs were very intri-
cate and were based on completely dierent methods and presented his paper
as an attempt to give unied proofs of those theorems (which could provide a
systematic method of deriving the asymptotic distribution of other test statistics
expressible as a functional of the empirical d.f.). It seemed unnatural that, being
Dn , Dn+ and Wn2 measures of discrepancy between Fn and F0 based on the same
object, namely, the empirical process, a particular technique had to be used in the
derivation of the asymptotic distribution of each statistic. Thus, Fellers paper is
a remarkable step in the development of a unied asymptotic theory for tests of
t based on the empirical process. But still, a study of the empirical process itself
and of its asymptotic distribution (a concept which would have to be precised) was
not considered and, as claimed in Doob [36], all these proofs (including Fellers)
conceal to some extent . . . the naturalness of the results (qualitatively at least)
and their mutual relations.
It was Doob [36] who, considering the nite-dimensional distributions, con-
jectured the convergence of the uniform empirical process to the Brownian bridge.
A useful consequence of this fact would be that, under some (not explicit) hypothe-
ses, the derivation of the asymptotic distribution of a functional of the uniform
empirical process could be reduced to the derivation of the distribution of the same
functional of the Brownian bridge. Doob proved that
   
2 2
P sup |B(t)| x = (1)j e2j x (2.3)
0t1 j=

and   
2
P sup B(t) > x = e2x . (2.4)
0t1

Thus, justication of Doobs conjecture would provide a new simpler proof


of the results of Kolmogorov and Smirnov.
This justication was given by Donsker by employing his invariance principle
in [34] and [35]. His results showed that the distribution of a continuous func-
tional of the partial sum process (obtained from a sequence of i.i.d. r.v.s with
nite second moment) converges to the distribution of the corresponding func-
tional of a Brownian motion, and that the distribution of a continuous functional
6 E. del Barrio

of the uniform empirical process converges to the distribution of the corresponding


functional of a Brownian bridge.
The development of the theory of weak convergence in metric spaces due,
among others, to Kolmogorov, Prohorov and Skorohod in the fties (see [75], [54],
[76] and [87]) allowed to get a better understanding of this invariance principle, as
presented in [7]. The space C[0, 1] was one of the rst metric spaces for which this
theory was developed, through the work of Prohorov [76]. The scheme consisting
of proving the convergence of the nite-dimensional distributions plus a tightness
condition allowed to obtain distributional limit theorems for slight modications of
the partial sum and the uniform empirical processes, because both processes could
be approximated by equivalent processes obtained from them by linear interpo-
lation so that all the random objects considered in the limit theorems remained in
C[0, 1].
This last approximation is somehow articial. In order to avoid it, a wider
space had to be considered. A proper study of the weak convergence of the uni-
form empirical process could be attempted in the space D[0, 1]. The fact that the
empirical process is not measurable when the uniform norm is considered, led to
the introduction of a more involved topology, namely the Skorohod topology that
turned D[0, 1] into a separable and complete metric space in which the empiri-
cal process was measurable. In this setup the weak convergence of the empirical
process could be properly stated (see, e.g., [7] p. 141):

Theorem 2.3. If we consider n and B as random elements taking values in D[0, 1],
then
w
n B.

Theorem 2.3 enabled to rederive Theorem 2.2 in a very natural way. Notice
that Dn = ||n || and that the map x  ||x|| is continuous for the Skorohod
w
topology outside a set of B-measure zero. Thus, we can conclude that Dn ||B||
and this, combined with (2.3), gives a proof of the rst statement in Theorem 2.2.
The same method works for Dn+ .
The use of the Skorohod space is not the only possibility to circumvent the
diculty posed by the nonmeasurability of the empirical process. A dierent ap-
proach to the problem could be based on the following scheme. If we could dene,
on a rich enough probability space, a sequence of i.i.d. r.v.s uniformly distributed
on (0, 1) with associated empirical process n (t) and a Brownian bridge B(t) such
that
sup |n (t) B(t)|0,
P
(2.5)
0t1

then we would trivially obtain that for any functional H dened on D[0, 1] and
P
continuous on C[0, 1], H(n ) H(B) obtaining a new proof of Theorem 2.2. The
study of results of type (2.5), generically known as strong approximations, began
with the Skorohod embedding, consisting of imitating the partial sum process by
Empirical and Quantile Processes 7

using a Brownian motion evaluated at random times (see [8]). Successive rene-
ments of this idea became one of the most important methodologies in the research
related to empirical processes.
Turning back to the applications of Theorem 2.3 in the asymptotic theory of
1
tests of t, we should note that the functional x  0 x(t)2 dt is also continuous
for the Skorohod topology outside a set of B-measure zero. We can use this fact
to obtain the asymptotic distribution of the Cramer-von Mises statistic. Namely,
 1
2 w
Wn B(t)2 dt. (2.6)
0

Then a Karhunen-Loeve expansion of B(t) allows to easily compute the char-


1
acteristic function of 0 B(t)2 dt and the inversion of this characteristic function
allows to tabulate the asymptotic distribution of Wn2 (see, e.g., [86] p. 215 for
details). This methodology made unnecessary the involved arguments used by
Smirnov to derive the asymptotic distribution of Wn2 .
A little extra eort allows to extend this method for deriving the asymptotic
distribution of other statistics of Cramer-von Mises type. As a consequence of the
Law of the Iterated Logarithm for the Brownian motion, Anderson and Darling
showed in [2] that, provided
  1
1 1
(t)t log log dt and (t)(1 t) log log dt
0 t 1t
1
are nite for some (0, 1), the functional x  0 (t)x(t)2 dt is continuous,
with respect to the Skorohod distance, outside a set of B-measure zero and, conse-
w 1
quently, Wn2 () 0 (t)B(t)2 dt. This result covers the Anderson-Darling statis-
tic A2n .
Although all the limit theorems for goodness of t that we have described
so far are based on the weak convergence of the empirical process considered as a
random element taking values in the space of cadlag functions with the Skorohod
topology plus the continuity of a suitable functional, there is a more natural way
to study the asymptotic properties of statistics of Cramer-von Mises type and,
more generally, of integral functionals of the empirical process.
The uniform empirical process can be viewed as a random element taking
values in the separable Hilbert space L2 ((0, 1), ) of all real, Borel measurable
1
functions f on (0, 1) such that 0 (t)f (t)2 dt is nite, where we consider the
norm given by
 1
||f ||22, = (t)f (t)2 dt.
0

In this setupWn2 () =||n ||22, .


The theory of probability in Banach spaces,
developed in the 60s and 70s, turned the problem of studying the asymptotic
distribution of Wn2 () into an easier task, because the C.L.T. for random elements
taking values in L2 ((0, 1), ) (see, e.g., [3] p. 205, ex. 14) asserts that a sequence
8 E. del Barrio

{Yn (t)}n of i.i.d. L2 (0, 1)-valued random elements veries

1 
n
Yi (t) Y (t)
n i=1 w

1
if and only if 0 E(Y1 (t))2 (t)dt < and, in that case, Y is a Gaussian random
element with the same covariance function as Y1 .
Therefore, if we set
Yi (t) = I{Ui t} t, i = 1, . . . , n,
n
then n (t) = 1 Yi (t) and Y1 (t) has the same covariance function as the
i=1
n
w 1
Brownian bridge B(t). Hence, n B in L2 ((0, 1), ) if and only if 0 t(1
t)(t)dt < .

Theorem 2.4. (Asymptotic distribution of statistics of Cramer-von Mises type)


1
n converges weakly in L2 ((0, 1), ) if and only if 0 t(1 t)(t)dt < . In that
case
 1
w
Wn ()
2
(t)B(t)2 dt. (2.7)
0

While the development of the probability in Banach spaces provides this nal
result for quadratic statistics (and we will have a better example of this later in
Chapter 6), the use of strong approximations produces a similar result for supre-
mum norm statistics. Chibisov [12] and OReilly [70] used the Skorohod embedding
and a special representation of the uniform empirical process in terms of a Poisson
process (see, e.g. [86] p. 339) to obtain necessary and sucient conditions for the
weak convergence of the empirical process to the Brownian bridge in weighted uni-
form metrics. If is a positive function on (0, 1) nondecreasing in a neighborhood
of 0 and nonincreasing in a neighborhood of 1 and we consider the norm given
by ||x|| = sup0<t<1 |x(t)|
w
(t) on D[0, 1], then n B in || || norm (with the nec-
essary modications in the denition of weak convergence to avoid measurability
problems) if and only if
 1  
1 (t)2
exp
dt < , (2.8)
0 t(1 t) t(1 t)
for every
> 0. An immediate corollary of the Chibisov-OReilly theorem is that
(2.8) is a sucient condition for ensuring the convergence
w |B(t)|
Kn () sup .
0<t<1 (t)
A modication of the so-called Hungarian construction due to Komlos, Major
and Tusnady [55], [56] and to Csorgo and Revesz [21] was used in [17] to give the
following nal result for statistics of Kolmogorov-Smirnov type:
Empirical and Quantile Processes 9

Theorem 2.5. (Asymptotic distribution of statistics of Kolmogorov-Smirnov ty-


pe) If is a positive function on (0, 1), nondecreasing in a neighborhood of 0 and
nonincreasing in a neighborhood of 1, then Kn () converges

in distribution to a
1 1 (t)2
nondegenerate limit law if and only if 0 t(1t) exp
t(1t) dt < , for some

> 0. In that case,


w |B(t)|
Kn () sup .
0<t<1 (t)

2.1. Some notes on principal component decompositions of quadratic statistics


1
We have seen that, provided 0 t(1 t)(t) dt < ,
 1
n ()
2
(B(t))2 (t) dt
w 0

Thus, if {fj }j is a c.o.n.s. for , 2, we have


 1

(B(t)) (t) dt =
B
2, =
2 2
B, fj 22, .
0 j=1
1
On the other hand, B, fj = 0 B(t)fj (t)(t) dt are centered Gaussian r.v. and,
if {fj }j are the eigenfunctions of the covariance operator, namely, if
 1
j fj (t) = (s t st)(s)(t)fj (s)ds (2.9)
0

then B, fj are independent N (0, j ) r.v.s, since


 1  1 
Var(B, fj ) = E B(s)fj (s)B(t)fj (t)dsdt
0 0
 1  1  1
= (s t st)fj (s)fj (t)dsdt = j fj (t)2 dt = j .
0 0 0

As a consequence, we have that, under H0




n2 () j Yj2 ,
w
j=1

where {Yj }j i.i.d. N (0, 1), j 0. This shows that the characteristic function of
the limiting distribution of n2 () can be written as


() = (1 2ij )1/2 .
j=1

This can be used in some cases to nd a useful expression of the limiting charac-
teristic function and, hopefully, to nd, via an inversion formula, exact expressions
for the limiting distribution functions.
10 E. del Barrio

Example. For the Cramer-von Mises Statistic, n2 j = (j)2 ; fj (t) = 2 sin(jt)
and, therefore,

Yj2
n2 2 := .
d
j=1
2 j 2
To see this, we observe that in this case equation (2.9) becomes
 t  1
f (t) = s(1 t)f (s)ds + (1 s)f (s)ds,
0 t
from which, dierentiating twice we obtain
f  (t) = f (t).
Noting that f (0) = f (1) = 0 we prove the above claim. It can be used to obtain
that
i 2 2i
E(e )= .
sin( 2i)
This gives
 (2j)2
1 1 y xy
2
P ( > x) = (1) j+1
exp dy.
j=1 ((2j1)) 2 y sin y 2

In the case of the Anderson-Darling


statistic, A2n , we
can proceed similarly to
1
obtain j = (j(j + 1)) , f1 (t) = 6t(1 t), f2 (t) = 30t(1 t)(2t 1), . . . ,
and

Yj2
A2n A2 := .
d
j=1
j(j + 1)
Now
i 2 2i
E(e )=
cos((/2)/ 1 + 8i)
and a similar expression is available for P (A2 > x).
The main interest of the orthogonal decomposition of quadratic statistics
does not come from this type of inversions, but from the fact that it provides a
good insight into the power properties of the tests. Let us assume that H0 does
not hold, but, instead, that X1 , . . . , Xn are i.i.d. Hn and n(Hn G0 ) (G0 )
(this are called local alternatives). It can be shown then that
n B +
w
Hence,
 1
Dn () sup |B(t) + (t)|(t) and n2 () (B(t) + (t))2 (t) dt
d 0<t<1 d 0
Example. Location and scale alternatives

X1 , . . . , Xn i.i.d. Hn , n(Hn G0 ) (G0 )
Empirical and Quantile Processes 11


G0 = G0 ; Hn = Gn ; n = 0 + / n

g(G1 (t)) if 0 = 0, G = G( )
(t) =
G1 (t)g(G1 (t)) if 0 = 1, G = G(/)
Turning back to statistics of Cramer-von Mises type we obtain
 1 
n2 () (B(t) + (t))2 (t) dt = j Yj 2
w 0 j=1
1/2
with Yj independent N (j , fj , 1) r.v.s. This suggest that we dene the
principal component in direction fj :

  1
n2 () = 2
Yn,j , Yn,j := n (t)fj (t) dt
j=1 0

Under H0 , Yn,j are approximately N (0, j ) independent r.v.s; under local alter-
natives
d
Yn,j N (, fj , j ).
Thus, Yn,j measures deviations in direction fj and tests based on Yn,j can be
expected to be powerful against alternatives in direction fj .

Example. Cramer-von Mises components


2n n 2n n

Yn,1 = n i=1 cos(G0 (X i ))), Yn,2 = 2n i=1 cos(2G0 (Xi )))

g(G1 (t)) location alternatives
(t) =
G1 (t)g(G1 (t)) scale alternatives

f1(t)= 2sin(t) f2(t)= 2sin(2t)


1.5
1.2

0.5
0.8

0.5
0.4
0.0

1.5

0.0 0.4 0.8 0.0 0.4 0.8

t t
12 E. del Barrio

If we assume G to be (approximately) symmetric we can expect Yn,1 to be good for


detecting location alternatives, while Yn,2 should be more powerful against scale
alternatives. The table below provides some conrmation of this idea.

Asymptotic power of n2 , Yn,1 , Yn,2 & A2n


location and scale alternatives; normal model
(one-sided tests)
location scale
best test power best test power
Test 0.50 0.95 0.50 0.95

Yn,1 0.466 0.930 0.05 0.05


Yn,2 0.05 0.05 0.336 0.787
n2 0.342 0.877 0.072 0.205
A2n 0.354 0.890 0.108 0.423

Similar considerations apply to other quadratic functionals of empirical, quan-


tile or Gaussian processes, but we will not pursue this issues further here. The
interested reader can found more details in [86], [96].

3. Some tools from empirical processes theory


3.1. Some inequalities
We collect here some inequalities involving tail probabilities and moments of sums
of independent r.v.s. They are useful for our goals in that they can be used to up-
grade stochastic boundedness of some sequences of variables to uniform bounded-
ness of their moments. We will assume that Xi are  independent B-valued r.v.s (B
being a separable Banach space) and denote Sn = i=1 Xi , Xn = max1in
Xi

n

and Sn = max1in
Si
. The proofs of these results can be found in [27].
We begin with the Levy maximal inequalities, bounding the tail probabilities
of Sn by tail probabilities of
Sn
.
Theorem 3.1. Let Xi be independent symmetric r.v.s. Then
   n 
 k 
P max
Xi
> t 2Pr
Xi
> t , t > 0.
1kn
i=1 i=1
The symmetry hypothesis can be replaced if Xi are i.i.d. at the price of worse
constants:
Theorem 3.2. Let Xi be i.i.d. r.v.s. Then
   n 
k 
P max
Xi
> t 9Pr
Xi
> t/30 , t > 0.
1kn
i=1 i=1
The Homann-Jrgensen inequalities bound moments of sums by the corre-
sponding moment of a maximum plus a quantile of the sum:
Empirical and Quantile Processes 13

Theorem 3.3. For each p > 0 there exists constants Kp , cp such that, if Xi are
i.i.d or independent and symmetric r.v.s, then

Sn
p Kp [t0 +
Xn
p ],
where
Y
p = (E
Y
)p (1/p)
and
t0 = inf{t > 0 : Pr(
Sn
> t cp )}.
This inequality gives the following result on comparison of moments:
Theorem 3.4. For each 0 < p < q there exists a constant, K, such that, if Xi are
i.i.d or independent and symmetric r.v.s, and then

Sn
p K[
Sn
q +
Xn
p ].
The following randomization/symmetrization result by Rademacher variables
(symmetric r.v.s taking values in {1, 1}) shows that the above Theorem is also
valid for centered, independent r.v.s.
Theorem 3.5. Let Xi be independent, centered r.v.s in Lp , p 1, and let {
i } be
independent Rademacher r.v.s, independent of the Xi . Then

n 
n 
n
2p E

i Xi
p E
Xi
p 2p E

i Xi
p
i=1 i=1 i=1

and

n
ESnp 2p+1 E

i Xi
p .
i=1

As an example of the usefulness of this inequalities we give the following


result on the L1 -norm of the empirical process on the line.
Theorem 3.6. Let X, Xi , i N, be i.i.d. random variables with common distribu-
tion F . Let
Y (t) := IX>t Pr{X > t}, < t < , (3.1)
and let Yi , i N, denote the processes obtained by replacing X by Xi in (3.1).
Then the sequence
 
Yi 

 
  = n |Fn (t) F (t)|dt, n N,
i=1
n L1

is stochastically bounded if and only if



2,1 (X) := Pr{|X| > t}dt < .
0

Proof. The suciency part follows easily from Markovs inequality. For the neces-
sity part we can assume, w.o.l.g., that X 0. It is now convenient to write
Z(t) := IX>t , Zi (t) = IXi >t , i N, t R,
14 E. del Barrio

so that Y (t) = Z(t) EZ(t) and likewise for Yi . The stochastic boundedness
hypothesis simply asserts
 1 n
 
lim sup Pr  (Zi EZi )L1 > M = 0.
M n n i=1
The Levy type inequality for i.i.d. random vectors then implies then that
 1   
lim sup Pr max Zi EZi L > M = 0.
M n n 1in 1

The classical inequality for independent random variables, say i ,


  

Pr max |i | > t 1 exp Pr{|i | > t}


i

then gives that there is a constant M < such that


 1   
sup n Pr Z EZ L > M < ,
n n 1

or, equivalently,
  
2, (Z EZ) := sup t2 Pr Z EZ L1 > t < .
t>0

A rst consequence of this is that


  

Zi EZi
L1 1 
E max = Pr max Zi EZi
L1 > t dt
1in n n 0 1in

 
1 + n Pr
Zi EZi
L1 > t dt
n


1 + n2, (Z EZ)
t2 dt
n
= 1 + 2, (Z EZ) < . (3.2)
A second consequence is that EX =
EZ
L1 < . To wit, if t0 < is such that
Pr{X > t0 } 1/2, then, since niteness of 2, obviously implies E
Z EZ
L1 <
, we have

 
> E
Z EZ
L1 =E IX>t Pr{X > t}dt
0
  X 
1 1
E IX>t0 dt = E(X t0 )+ ,
t0 2 2
so that EX = t0 + E(X t0 ) t0 + E(X t0 )+ < . Now, HomannJrgensens
inequality gives that for every r > 0 there exist nite positive constants ci , i = 1, 2,
depending only on r, such that
 n (Z EZ ) r 

Zi EZi
rL1

 i i 
E i=1  c 1 E max + t r
0,n ,
n L1 1in nr/2
Empirical and Quantile Processes 15

where
   
  n (Zi EZi ) 
i=1
t0,n = inf t : Pr L1
> t c2 .
n
On one hand, the stochastic boundedness hypothesis implies supn t0,n < , and
on the other, inequality (3.2) asserts the niteness of the sup over n of the rst
summand at the right-hand side of the last for r = 1. We thus conclude that
 n (Z EZ ) 
 i i 
sup E i=1  < . (3.3)
n n L1

If now is a binomial (n, p) then there exist positive nite constants C1 and C2
such that
C1 1
L() = Bin(n, p) with p implies E| E| C2 np.)
n 2
(this follows, for instance, from symmetrization and Corollary 3.4 in Gine and
Zinn, 1983.) Applying this to the empirical process, yields
n
1  i=1 (IXi >t Pr{Xi > t}) 
Pr{X > t} E  for med(X) < t < Q(1 C1 /n).
C2 n
Then, if integrating and applying inequality (3.3), we obtain
 Q(1C1 /n)
sup Pr{X > t}dt
n med(X)
 Q(1C1 /n)  n
1  (IXi >t Pr{Xi > t}) 
sup E i=1 dt < .
C2 n med(X) n
 
Since Q 1 C1 /n ess sup X as n , this last inequality gives

Pr{X > t}dt < ,
0

that is, 2,1 (X) < , proving the theorem. 

We nally include here a dierent type of maximal inequality, the Birnbaum-


Marshall inequality for martingales. It can be used to bound the tails of weighted
supremum functionals of the empirical process as we will see later.

Lemma 3.7. Let {|Sk |, Fk }0kn be a submartingale with S0 = 0. Let b1


bn bn+1 = 0 and let r 1. Call Mn = max1kn bk |Sk |. Then for all > 0

1  r 1  r
n
P (Mn ) (b k b r
k+1 )E|S k |r
= b [E|Sk |r E|Sk1 |r ] .
r 1 r 1 k
16 E. del Barrio

Proof. Assume without loss of generality r = 1 and Sk 0. Set Ak = (max1


j < kbj Sj < bk Sk ). Then
n 
n  
n n 
P (Mn ) = P (Ak ) bk Sk dP = (bj bj+1 ) Sk dP
1 1 Ak k=1 j=k Ak


n 
n  
n 
(bj bj+1 ) Sj dP = (bj bj+1 ) Sj dP
k=1 j=k Ak j=1 Mj

n
(bk bk+1 )E|Sk | 
1

Lemma 3.8 (Birnbaum and Marshall). Let {|St |, Ft }0t be a submartingale with
right-continuous sample paths. Assume S(0) = 0 and (t) = ES 2 (t) < on [0, ].
Let q > 0 be a nondecreasing and right-continuous function on [0, ]. Then
  
|S(t)| 1
P sup 1 2
d(t).
0t q(t) 0 q(t)

Proof. By right-continuity of sample paths and S(0) = 0


   
|S(t)| |S(i/2n )|
P sup 1 =P max 1 for all n
0t q(t) 0<i2n q(i/2n )
 
|S(i/2n )|
= lim P max 1
n 0i2n q(i/2n )


n
2
1 lim [E(S 2 (i/2n ) S 2 ((i 1)/2n ))/q 2 (i/2n )]
n
1

1
=1 d(t) 
0 q(t)2
3.2. The Central Limit Theorem
The classical central limit problem consists in studying conditions for tightness
and weak convergence of sequences of laws of sums of small independent random
variables and determining their limits. The theory is built up around two main
results, the Gaussian and the Poisson convergence theorems. Here we collect the
main facts on this central limit problem, both for real and for Banach valued
random variables and refer to [3] for a complete treatment of the subject.
If {Xn }
n=1 are i.i.d. centered r.v.s with unit variance, then, the Levy-
Lindeberg CLT states that
X1 + + Xn
N (0, 1).
n w

we might wonder about the possibility of obtaining dierent limiting distributions


(possibly at dierent rates) if we drop some of the assumptions. In this i.i.d. but
Empirical and Quantile Processes 17

without the nite variance assumption we have, for instance, that, if {Xn } n=1 are
i.i.d. Cauchy, then
X1 + + Xn
= X1
n d
and we get, trivially, a dierent limit distribution.
More generally we could consider the limiting distributions of sums of trian-
gular arrays of row-wise independent r.v.s, that is, we consider r.v.s {Xn,k : n
kn
N, 1 k kn }, where Xn,1 , . . . , Xn,kn are independent, and call Sn = k=1 Xn,k .
If we now consider, for instance, i.i.d. Xn,k , k = 1, . . . , n having Bernoulli distri-
bution with parameter pn such that npn (0, ) we have that Sn converges
weakly to the Poisson distribution with mean .
Obviously, without some restriction on the relative weight of the summands in
Sn any kind of (trivial) limit distribution can result. The innitesimality condition
ensures that this is not the case. We say that the triangular array {Xn,k : n
N, 1 k kn } of row-wise independent random variables is innitesimal if, for
every
> 0
max P (|Xn,k | >
) 0, as n .
1kkn
It turns out that, under innitesimality, the class of possible limit laws of Sn an ,
where an are possibly needed centering constants, is the class of the so-called
innitely divisible laws. A law is said to be innitely divisible if, for every n, it
n)
can be expressed as an nth convolution power: = n n (that is, is the
law of the sum of n i.i.d. r.v.s). Innitely divisible laws can be characterized as
the convolution of a Gaussian law and a Poissonization of a Levy measure:
= N (, 2 ) c Pois .
If is a nite measure on R, then the associated Poissonization is


(R) n)
Pois := e /n!.
n=0

If is a positive measure, not necessarily nite, but it integrates min(1, x2 ) then


the weak limit
w lim  <|x| xd(x) Pois (||x| >
) := c Pois
0

exists and denes the


 ( -centered ) Poissonization of . Positive measures on the
real line satisfying min(1, x2 )d(x) < are called Levy measures. Thus c Pois
is dened for Levy measures.
The characteristic function of the innitely divisible law = N (, 2 )
c Pois is
  
2 t2
(t) = exp it + (eitx 1 itxI{|x |})d(x) .
2
 CLT on the real line. We will denote Xn,k, =
We are now ready to state the
Xn,k I{Xn,k } and Sn, = 1kkn Xn,k, .
18 E. del Barrio

Theorem 3.9. (CLT on the real line) Let {Xn,k : n N, 1 k kn } be an inni-


tesimal array of row-wise independent real random variables and {an } a sequence
of constants. Then, Sn an converges weakly i
(i) There exists a Levy measure such that
 
P (Xn,k > ) (, ), P (Xn,k < ) (, )
1kkn 1kkn

for every > 0 such that {, } = 0.


(ii) There exists 0 such that
   
lim sup
lim Var (Xn,k, ) = 2 Var (Xn,k, ) = 2 .
0 lim inf
1kkn 1kkn

(iii) If > 0 satises {, } = 0 then E(Sn, ) an a .


Then, if this is the case
Sn an N (a , 2 ) c Pois .
w

If we restrict our interest to triangular arrays coming from i.i.d. sequences,


that is, those arrays with kn = n and Xn,k = Xk /an for some i.i.d. r.v.s {Xn }n and
a sequence of normalizing constants {an }n then the class of limiting distributions
reduces to the stable laws. A law is said to be stable if it is of the same type as
any of its nth power convolutions, that is, if X1 , . . . , Xn are i.i.d. , then
X 1 + + X n = an X 1 + b n .
d

The only possible constants an in the above expression are of type an = n1/
for some (0, 2]. This is called the stability index of the stable law . The
case = 2 corresponds to the Gaussian law. A stable law with index < 2
is an innitely divisible law without normal part and Levy measure (c1 , c2 ; )
dened by 
c1 x1 dx for x > 0
d(c1 , c2 ; )(x) =
c2 |x|1 dx for x < 0
for some c1 , c2 0.
Assume {Xn }n is a sequence of i.i.d. r.v.s having law . Then we say that
belongs to the domain of attraction of the stable law if there exists constants
an > 0, bn R such that
X1 + + Xn
bn .
an w

Stable laws are the only laws having a nonvoid domain of attraction. Domains
of attraction can be characterized in terms of regular variation of the tails and
the truncated variances of a law. A function L is regularly varying (at ) with
exponent R if
L(tx)
lim = x .
t L(t)
We say that it is slowly varying if the above exponent equals 0.
Empirical and Quantile Processes 19

Theorem 3.10. (Domains of attraction on the line)


(i) A law belongs to the domain of attraction of the normal law i the truncated
moment  x
U (x) = y 2 d(x)
x
is slowly varying or, equivalently, i
(x, x)c
x2 0.
U (x)
(ii) A law belongs to the domain of attraction of (c1 , c2 ; ), < 2 i (x, x)c
is regularly varying of order alpha and
(x, ) c1 (, x) c2
c
, c
.
(x, x) c1 + c2 (x, x) c1 + c2
We nish this section with some results on the CLT on separable Banach
spaces. We consider a triangular array {Xn,k : n N, 1 k kn }, where
kn
Xn,1 , . . . , Xn,kn are independent (B,

)-valued r.v.s, call Sn = k=1 Xn,k and


consider the problem of nding necessary and sucient conditions for the weak
convergence of Sn an and of characterizing the possible limiting distributions.
The theory parallels to a great extent that for the real line. There are some im-
portant dierences, though.
As in the real case, we say that the array is innitesimal if
max P (
Xn,k
>
) 0, as n .
1kkn

Under innitesimality, the class of possible weak limits of L(Sn an ) is, still
in this new setup, the class of innitely divisible laws (the probability measures
expressible as nth convolution powers for every n. Innitely divisible laws, , on
B can be characterized as convolutions of Gaussian measures and Poissonization
of Levy measures:
= c Pois .
The characterization of Levy measures, though, is not so straightforward as on the
real line. A Levy measure on B is a -nite positive measure, , for which there
exists > 0 and a probability measure having characteristic functional
 
(f ) = exp eif (x) 1 if (x)I{
x
}dm u(x)

for all f B  (here B  denotes the topological dual of B). In this case we dene
c Pois := . Integrability of min(1,
x
2 ) is neither a necessary nor a sucient
condition for a measure to be a Levy measure on a general separable Banach space.
The following result is a general CLT in Banach spaces.
Theorem 3.11. (CLT in separable Banach spaces) Let {Xn.k } be innitesimal.
Then L(Sn an ) is weakly convergent i
20 E. del Barrio

(i) There exists a -nite measure with {0} = 0 such that



L(Xn,j )|Bc |Bc
w
j

for every > 0 such that (B ) = 0.


(ii) The limit
  
lim sup
(f ) = lim Var (f (Xn,k, ))
0 lim inf
1kkn

exists and is nite for every f B  (at least for f W B  weak-star


total ).
(iii) There exists a (for all) sequence {Fk } of nite-dimensional subspaces with
B = k Fk and > 0, p > 0 (for all > 0, p > 0) such that
lim sup Edp (Sn, ESn, , Fk ) = 0.
k n

Then
(a) is a Levy
 measure and there exists a centered Gaussian p.m. such that
(f ) = f 2 d, f B  .
(b) For every > 0 such that (B ) = 0
L(Sn ESn, ) c Pois .
w

In the particular case of a separable Hilbert space (H, , ) the CLT can be
slightly simplied. In Hilbert space Levy measures are positive Borel measures
integrating min(1,
x
2 ). Condition (iii) in Theorem 3.11 can be replaced by
(iii ) There exists a c.o.n.s {i }i1 in H such that, for some (all)
> 0,
n
 
k

lim lim sup E


Xn,j, EXn,j,
2 EXn,j, EXn,j, , i 2 = 0
k n
j=1 i=1

Thus, convergence of sums to a Gaussian limit can be characterised with the


following Theorem:
Theorem 3.12. (CLT in separable Hilbert spaces, normal case) Let {Xn.k } be an
innitesimal H-valued array. Then L(Sn an ) is weakly convergent i
(i) For every
> 0

P (
Xn,j
>
) 0.
j

(ii) The limit


  
lim sup
(f ) = lim Var (Xn,k, , h )
0 lim inf
1kkn

exists and is nite for every h H.


Empirical and Quantile Processes 21

(iii) There exists a c.o.n.s {i }i1 in H such that


n
 
k

lim lim sup E


Xn,j, EXn,j,
2 EXn,j, EXn,j, , i 2 = 0
k n
j=1 i=1

Then there exists a centered Gaussian p.m. such that (h) = x, h 2 d(x), h H
and
L(Sn ESn, )
w
for every > 0.
The case of sequences of i.i.d. H-valued random variables becomes also sim-
pler in the Hilbert space setup. We conclude the subsection with the Lindeberg-
Levy Theorem in Hilbert space. It is a very easy consequence of Theorem 3.12.
Theorem 3.13. Let {Xn } be a sequence of centered i.i.d. r.v.s then
X1 + + Xn

n
converges weakly i E
X1
2 < . In that case the limit law is centered Gaussian
with covariance
(h) = EX1 , h 2 , h H.
3.3. Strong approximations
As we said before, a dierent approach to proving limit theorems for empirical
process on the line can be based on nding suitable versions of the empirical
process, n (t), and a Brownian bridges Bn (t) such that
sup |n (t) Bn (t)| 0, (3.4)
0t1

almost surely or in probability. The study of results of type (3.4), generically


known as strong approximations, began with the Skorohod embedding, consisting
of imitating the partial sum process by using a Brownian motion evaluated at
random times (see [8]). Successive renements of this idea became one of the most
important methodologies in the research related to empirical processes. A major
breakthrough in this line was achieved by the so-called Hungarian construction due
to Komlos, Major and Tusnady. We summarize the main results in this section.
We assume that {Xn }n are i.i.d. centered r.v.s with unit variance and nite
moment generating function in a neighborhood of the origin and call F its distri-
bution function and set Sk = X1 + + Xk . We also assume that {Yn }n are i.i.d.
standard normal and set Tk = Y1 + + Yk . Then
Theorem 3.14. We can dene, on the probability space on which {Yn }n are dened,
a sequence {Xn }n of i.i.d. r.v.s with d.f. F such that
 
P max |Sk Tk | > C log n + x Kex , x > 0, n 1
1kn

where C, K, are constants depending only on F .


22 E. del Barrio

Theorem 3.14 is a deep result with a long, dicult proof. It has, though,
many important consequences for important processes in Statistics. A rst, direct
result in this line can be obtained for the partial sum process:

1 
[nt]
S (n) (t) := Xk , 0 t 1.
n
k=1

We will assume that {W (t)}t0 is Brownian motion and W (n) (t) = 1 W (nt)
n
(observe that W (n) (t) is itself a Brownian motion).
Theorem 3.15. We can dene, on a suciently rich probability space, versions of
{Xn }n and {W (t)}t0 such that
   
 
P n sup S (n) (t) W (n) (t) > C log n + x Kex ,
0t1

where C, K, are positive constants not depending on n.


Proof. It follows easily from Theorem 3.14 and the reection principle for Brownian
motion. 
Csorgo and Revesz [21] used this result to give a strong approximation for
the uniform quantile process:
 
un (t) := n G1 n (t) t , 0 < t < 1,
 n
where Gn (t) = n1 i=1 I(Ui t), 0 < t < 1 and {Un }n are i.i.d. uniform r.v.s on
(0, 1). In fact, the well-known distributional equality
 
d S1 Sn
(U(1) , . . . , U(n) ) = ,..., ,
Sn+1 Sn+1
with Sk = 1 + . . . + k and {k }k i.i.d. exponentials with mean 1 allows us to
assume
Sk k1 k
G1n (t) = , if <t .
Sn+1 n n
Then, taking Xj = j 1 and
n = Sn+1 n
1, we have
     
k Sk k 1 n k 1 n k
un = n = (Sk k) (Sn n) n+1
n Sn+1 n n Sn+1 n n Sn+1 n
   
k k 1 k
= (1 +
n ) S (n) S (n) (1) (1 +
n ) n+1 .
n n nn
If we now dene Bn (t) = W (n) (t) tW (n) (1), then Bn are Brownian bridges and
        
k k k k k
un Bn = S (n)
W (n)
(S (n) (1) W (n) (1))
n n n n n
   
k k 1 k
+
n S (n) S (n) (1) (1 +
n ) n+1 .
n n nn
Empirical and Quantile Processes 23

Now
      
 
P n sup un k Bn k  > A log n + t
1kn
 n n 
    A log n + t 
 (n) 
2P n sup S (t) W (t) >
(n)
0t1 5
  
A log n + t
+ P (1 +
n )n+1 >
5
  1/2 
A log n + t
+ 2P sup |Sk k| >
1kn 5n
  1/2 
A log n + t
+ 2P |
n | >
5n
Using Theorem 3.15 and some other standard techniques we can give exponential
bounds for all terms in the right-hand side of this last inequality to conclude (see
[21] for details):
Theorem 3.16. We can dene, on a suciently rich probability
space, versions of
{Un }n and {W (t)}t0 such that for every n 1 and |x| c n,
 

P n sup |un (t) Bn (t)| > C log n + x Kex ,
0t1

Komlos, Major and Tusnady [55, 56] gave also a construction, similar to the
one in Theorem 3.14 for the uniform empirical process:
Theorem 3.17. We can dene, on a suciently rich probability space, a sequence
of Brownian bridges {Bn (t) : 0 t 1} such that
 
P sup |n (t) Bn (t)| > n1/2 (x + C log n) K exp(x) (3.5)
0t1

for some absolute positive constants C, K, .


The best constants in (3.5) are C = 12, K = 2, = 1/6, see [9].
While the above Theorems give the best possible rates of approximation of
the empirical or the quantile process by Brownian bridges, there was still room for
some improvement through a careful study of the approximation at the tails. The
following results come from [17]. They have important applications in the study
of weighted norms of empirical and quantile processes.
Theorem 3.18. We can dene, on a suciently rich probability space, a sequence
{Un }n and Brownian bridges {Bn (t) : 0 t 1} such that
 
P sup |n (t) Bn (t)| > n1/2 (x + C log d) K exp(x) (3.6)
0td/n
24 E. del Barrio
 
1/2
P sup |n (t) Bn (t)| > n (x + C log d) K exp(x) (3.7)
1d/nt1

for some absolute positive constants C, K, .

Theorem 3.19. We can dene, on a suciently rich probability space, a sequence


{Un }n and Brownian bridges {Bn (t) : 0 t 1} such that
 
P sup |un (t) Bn (t)| > n1/2 (x + C log d) K exp(x) (3.8)
0td/n

 
1/2
P sup |un (t) Bn (t)| > n (x + C log d) K exp(x) (3.9)
1d/nt1

for some absolute positive constants C, K, .

3.3.1. Weighted approximations of empirical and quantile processes. We will see


here how the improved construction in [17] can be used to give weighted approxi-
mations of empirical and quantile processes.

Theorem 3.20. We can dene, on a suciently rich probability space, a sequence


of Brownian bridges {Bn (t) : 0 t 1} such that

sup |n (t) Bn (t)| = O(n1/2 log n) a.s. (3.10)


0t1

and
|n (t) Bn (t)|
n1/2 sup = OP (1) (3.11)
(t(1 t))
n t1 n

for all 0 < 1/2 and 0 < < .

Proof. We use the construction in Theorem 3.18. (3.10) follows from (3.6) (taking
d = n and x = (2/) log n) and the Borel-Cantelli lemma:
1
P ( sup |n (t) Bn (t)| > n1/2 log n(2/ + C)) .
0t1 n2

To show (3.11), take rst = 1 and 0 < 1/2 and call


|n (t) Bn (t)|
(1)
n, := n1/2 sup
1 t
n t1

|n (t) Bn (t)|
(2) := n1/2 sup .
n,
1
0t1 n (1 t)
(i)
It is enough to prove that n, = OP (1), i = 1, 2. By symmetry it reduces to
(1)
showing that n, = OP (1). Now, for e < d < let di = id, i = 1, 2, . . . , in1 and
Empirical and Quantile Processes 25

din = n, where in = max{i : di1 n}. Set I1 = [1/n, d1 /n], Ii = [di1 /n, di /n],
i = 2, . . . , in and n,i = sup{|n (t) Bn (t)| : t [0, di /n]} i = 2, . . . , in . Now

in
1/2
n, > (C + 1) log d) P (1,n > n
P ((1) (C + 1) log d) + P (i,n
i=2

in
> n1/2 (C + 1)di1 ) =: Pi,n (d).
i=1

Hence using, (3.6) and the fact that di1 log di for d large enough, we have,
for large d, P1,n (d) K exp( log d) = Kd and Pi,n (d) K exp(di1 )
K((i 1)d)2 . Thus

1  1
in
1
Pi,n (d) K + K ,
i=1
d d2 i=1 i2
(1)
which can be made arbitrarily small by taking d large enough. This shows n, =
OP (1) for = 1. To prove this for general it suces to cover the range [/n, 1/n]
for xed 0 < < 1, but
|n (t)|
n1/2 sup (nGn (1/n) + 1) = OP (1),
1 t
n t n

since nGn (1/n) converges weakly to a Poisson r.v. and


 
|Bn (t)|  1 
n 1/2
sup
1/2
n sup |Bn (t)|d
sup W (t) W (n)


t 1 t 0t 1 = 0t1 n
n n n

= OP (1). 

Corollary 3.21. We can dene, on a suciently rich probability space, a sequence


of Brownian bridges {Bn (t) : 0 t 1} such that
|n (t) Bn (t)|
n1/2 sup = OP (1) (3.12)
U1:n tUn:n (t(1 t))
for all 0 < 1/2.

Corollary 3.22. We can dene, on a suciently rich probability space, a sequence


of Brownian bridges {Bn (t) : 0 t 1} such that
|n (t) Bn (t)|
n1/2 sup = OP (1) (3.13)
0<t<1 (t(1 t))
for all 0 < < 1/2.

The proof of the corresponding result for the quantile process is similar to
the proof of Theorem 3.20 and, hence, omitted.
26 E. del Barrio

Theorem 3.23. We can dene, on a suciently rich probability space, a sequence


of Brownian bridges {Bn (t) : 0 t 1} such that
sup |un (t) Bn (t)| = O(n1/2 log n) a.s. (3.14)
0t1

and
|un (t) Bn (t)|
n1/2 sup = OP (1) (3.15)
(t(1 t))
n t1 n

for all 0 < 1/2 and 0 < < .


From Inequality 3.8 and the fact that {n (t)/(1 t)}t is a martingale we
obtain the following bound

  1 1
P
n /q
0 x 2

dt.
x 0 q(t)2
And this already gives a simple result on convergence of n in weighted sup metrics:
Theorem 3.24. Suppose that q is symmetric about 1/2, positive and nondecreasing
on [0, /1/2] and satises
 1
1
2
dt < .
0 q(t)
Then there exists, on a suciently rich probability space, versions of n and a
Brownian bridge such that  
 n B 
  P 0.
 q 

There is room for improvement in this Theorem. In fact, niteness of


B/q

does not require integrability of 1/q 2 . We will see next how can we characterize
niteness of
B/q
.
Lemma 3.25. The following statements are equivalent:
(i) lim supt0 |W (t)|/q(t) <
(ii) lim supt0 |B(t)|/q(t) <
(iii) lim supt0 |W (t)|/q(t) = for some 0 <
(iv) lim supt0 |B(t)|/q(t) = for some 0 <
Proof. It follows easily from Blumenthals 0-1 law. 

 
q: inf t1/2 q(t) > 0 for all > 0,
F C0 :=
q nondecreasing in a neighbourhood of 0
 1/2
E(q, c) = s3/2 q(s) exp(cq 2 (s)/s)ds
0
 1/2
1
I(q, c) = exp(cq 2 (s)/s)ds
0 s
Empirical and Quantile Processes 27

Lemma 3.26. Assume q F C0 .


(i) If I(q, c) < then E(q, d) < for any d > c and q(s)/s as t 0.
(ii) If E(q, c) < and q(t)/t1/2 as t 0 then I(q, c) < .
Proof. (ii) Call C = inf 0<t1/2 q(t)/t1/2 > 0. Then E(q, c) CI(q, c).
(i) If I(q, c1 ) < then I(q, c2 ) < for any c2 > c1 . Hence we can assume c > 1.
Fix small enough so that q is non-decreasing on (0, ]. It t (0, /c] then
 ct  ct  ct
1 1 1
exp(cq 2 (s)/s)ds exp(cq 2 (s)/t)ds exp(cq 2 (ct)/t)ds
t s t s t s
= (log c) exp(c2 q 2 (ct)/(ct)).
Hence q(t)/t1/2 . We also have, for d c > 0 and small enough , q(t)/t1/2
exp((d c)q 2 (t)/t) for t (0, ].Consequently,
 
1 1
(q(t)/t ) exp(dq (t)/t)dt
1/2 2
exp(cq 2 (t)/t)dt.
0 t 0 t
This completes the proof. 
 1/2
Lemma 3.27. If q F C0 and 0
1/q 2 (t)dt < then I(q, c) < for all c > 0.
Proof. For all x, c > 0 we have that x exp(cx) 1/(ec). Therefore,
 1/2  1/2
1 1 q 2 (t)
exp(cq 2 (t)/t)dt = 2
exp(cq 2 (t)/t)dt
0 t 0 q (t) t

1 1/2
1/q 2 (t)dt. 
ec 0
Theorem 3.28. Assume q F C0 . Then lim supt0 |W (t)|/q(t) = for some 0
< i 
< for any c > 2 /2
I(q, c) .
= for any c < 2 /2
Proof. We assume that I(q, c) < and show that lim supt0 |W (t)|/q(t)
(2c)1/2 . Assume f is an increasing function on (0, b] and let 0 < a = t0 < t1 <
< tn = b 1. Then
P (W (t) > f (t) for some t [a, b]) P (Tq(a)a )

n
+ P (tm1 < Tq(tm1 ) tm )
m=1
 a
(2t3 )1/2 f (a) exp(f 2 (a)/(2t))dt
0
n 
 tm 2

+ (2t3 )1/2 f (tm1 ) exp f (t2t


m1 )
dt
m=1 tm1
28 E. del Barrio

We x now some > 1, take a = /n and set tm = (1/)nm b. We have now


that, for t [tm1 , tm ]
1/2
t1/2 f (tm1 ) exp(f 2 (tm1 )/(2t)) f (tm1 )tm1 exp(f 2 (tm1 )/(2tm1 )).
Take > then x exp(x2 /(2)) exp(x2 /(2)) for large x. If we dene f (t) =
(22 c)1/2 q(t) then, for small b (recall that q(t)/t1/2 as t 0) we get that,
for t [tm1 , tm ]
t1/2 f (tm1 ) exp(f 2 (tm1 )/(2t)) exp(f 2 (tm1 )/(2tm1 ))
exp(f 2 (tm1 )/(2t))
exp(f 2 (t/)/(2t))
exp(f 2 (t/)/(2t)).
Taking n we obtain that
 b
1 1
P (W (t) > (2 c) /2q(t) for some t (0, b])
2 1
exp(cq 2 (t))dt.
(2)1/2 0 t
Hence, (letting b 0)
lim sup |W (t)|/q(t) (22 c)1 /2a.s.
t0

Since can be taken arbitrarily close to 1, this proves our claim. It remains to be
shown that, if lim supt0 |W (t)|/q(t) = for some 0 < then I(q, c) <
for any c > 2 /2. The proof can be found in [19], pp. 181188. 
We summarize the consequences of the above results for the niteness of

B/q
in the following corollary. Now F C0,1 is the class of functions nonde-
creasing on a neighborhood of 0, nonincreasing in a neighborhood of 1 and such
that
inf q(x) > 0, (0, 1/2).
<x<1

Corollary 3.29. Assume q F C0,1 , then the following statements are equivalent:
(i) sup0<t<1 |B(t)|/q(t) < a.s.
(ii) For some c > 0
 1
1

I(q, c) := exp(cq 2 (s)/(s(1 s)))ds
0 s(1 s)

(iii) limt0,1 q(t)/(t(1 t))1/2 = 1 and, for some c > 0


 1/2
E(q, c) := (s(1 s))3/2 q(s) exp(cq 2 (s)/(s(1 s)))ds < .
0

With this results we can prove the Chibisov-OReilly Theorem.


Theorem 3.30. (Weak convergence of the empirical process in weighted sup norm).
If q F C0,1 the following statements are equivalent:
Empirical and Quantile Processes 29

(i) There is a sequence of Brownian bridges {Bn (t), 0 t 1} such that


|n (t) Bn (t)|
sup = oP (1).
0<t<1 q(t)
(ii) For all c > 0
 1
1
c) :=
I(q, exp(cq 2 (s)/(s(1 s)))ds <
0 s(1 s)
Proof. We show rst that (ii) implies (i). Fix (0, 1/2). Then
|n (t) Bn (t)| t1/2 |n (t) Bn (t)|
sup sup sup
U1:n tUn:n q(t) 0<t q(t) U1:n tUn:n t1/2
(1 t)1/2 |n (t) Bn (t)|
+ sup sup
1t<1 q(t) U1:n tUn:n (1 t)1/2
1 |n (t) Bn (t)|
+ sup sup = oP (1).
t1 q(t) 0t1 t1/2
Now, observing that nU1:n is tight (it has an exponential limiting distribution) we
have
|n (t)| t t1/2
sup = n1/2 sup (nU1:n )1/2 sup = oP (1)
0<t<U1:n q(t) 0<t<U1:n q(t) 0<t<U1:n q(t)

and, by Theorem 3.28 also sup0<t<U1:n |Bq(t)


n (t)|
= oP (1). With similar results for
the upper tail we complete the proof of (i).
Assume now that (i) holds true. We see rst that, in this case,
lim q(t)/t1/2 =
t0

(a similar results holds also for the upper tail). Otherwise we can take sequence
1/2
tk 0 such that ktk (0, ) and limk q(tk )/tk = < . Then,
lim P (|Bk (tk ) + k 1/2 tk |/q(tk ) <
) = (
1/2 ) (
1/2 ).
k

Now, if k = ktk , k /k 0 and


lim inf P (|k (tk ) + k 1/2 tk |/q(tk ) <
) lim P (Gk (k /k) = 0) = e .
k k
Hence,
(
1/2 ) (
1/2 ) e ,
for all
> 0, a contradiction. The remainder of the proof follows easily from
this. 
In fact, with the aid of weighted approximations we can prove this renement
of the Chibisov-OReilly Theorem:
Theorem 3.31 (Asymptotic distributions of Kolmogorov-Smirnov-type statistics).
If q F C0,1 the following statements are equivalent:
30 E. del Barrio

|n (t)| |B(t)|
(i) sup sup
0<t<1 q(t) w 0<t<1 q(t)
(ii) For some c > 0
 1
1
c) :=
I(q, exp(cq 2 (s)/(s(1 s)))ds <
0 s(1 s)
Proof. Finiteness of the limit in (i) implies (ii). Thus, it suces to show that (ii)
implies (i). Under (ii) we have that limt0 q(t)/t1/2 = limt1 q(t)/(1 t)1/2 =
and from this we get
|n (t)| |Bn (t)|
sup = sup + oP (1). 
0<t<1 q(t) U1:n <t<Un:n q(t)

4. Testing fit to a family of distributions


We consider in this section the problem of testing whether the underlying d.f.
of the sample, F , belongs to a given family of distribution functions, F . We will
assume F is a parametric family, i.e.,
F = {F (, ), },
where is some open set in Rd . F 1 (, ) is the quantile function associated to
F (, ).
Perhaps, the most interesting case appears when F is the Gaussian family.
It seems that the rst statistics for detecting possible departures from normality
were introduced in [43, 71, 107] and were basedon the study of the standardized
third and fourth moments, usually denoted by b1 and b2 , respectively.
To strengthen these procedures, some omnibus tests, that intended to take
into account simultaneously both features, were proposed. For instance,in [72] the
K 2 and the R tests, consisting
of handling two suitable
functions of the b1 and b2
statistics, namely, K 2 = K( b1 , b2 ) and R = R( b1 , b2 ), were introduced. In that
paper a Monte Carlo study comparing those tests to the most popular normality
tests was accomplished. The authors select many alternative distributions and the
power of both tests seems to be similar to that of the competitor ones.
However, tests based on kurtosis and skewness are not too reliable because
they are based on properties which do not characterize Gaussian distributions.
For instance, [1] exhibits a sequence of distributions {Pk } which converges to the
standard Gaussian distribution while the kurtosis of Pk goes to innity. Thus,
if we consider a random sample obtained from Pk , the bigger the index k, the
greater chance to reject the normality of the sample. On the opposite side, some
examples of symmetric distributions, with shapes very far from normality (some
of them even multimodal),and 2 = 3 are known (see, for instance, [4, 52]). As a
consequence, none of the b1 , b2 , K 2 or R tests detects the non-normality of the
parent distribution in all cases.
Empirical and Quantile Processes 31

Other tests of normality are the u-test [25], based on the ratio between the
range and the standard deviation in the sample, and the a-test [45], which studies
the ratio of the sample mean to the standard deviation. These tests are broadly
considered as not being too powerful against a wide range of alternatives (although
it is known that the u-test has good power against alternatives with light tails
[85]; in fact, see [99, 100], the u-test is the most powerful against the uniform
distribution while the a-test is the most powerful against the double exponential
distribution).
For these reasons, some other tests, focusing on features that characterize
completely (or, at least, more completely) the family under consideration, have
been proposed. These tests can be divided, broadly speaking, into three categories.
A rst, more general, category consists of tests that adapt other tests devised in
the xed-distribution setup. When we specialize on location scale families, new
types of tests, that try to take advantage of the particular structure of F , can be
employed. Tests based on the analysis of probability plots, usually referred to as
correlation and regression tests, lie in this class. A third category, whose represen-
tatives combine some of the most interesting features exhibited by goodness-of-t
tests lying in the rst two categories, is composed of tests based on a suitable
L2 -distance between the empirical quantile function and the quantile functions of
the distributions in F , the so-called Wasserstein distance.
Tests based on Wasserstein distance are related to tests in the rst category
in the sense that all of them depend on functional distances. On the other hand,
it happens that the the study of Wasserstein-tests gives some hints about several
properties of the probability plot-tests. Both facts have led us to present them
separately. Our approach will try to show that tests based on Wasserstein distance
provide the right setup to apply the empirical and quantile process theory to study
probability plot-based tests.

4.1. Adaptation of tests coming from the fixed-distribution setup


All the procedures considered in Section 2 were based on measuring some distance
between a distribution obtained from the sample and a xed distribution. A way to
adapt this idea for the new setup consists of choosing some adequate estimator of
(assuming the null hypothesis is true) and, then, replacing the xed distribution
by F (, ). This simple idea was suggested by Pearson for his 2 -test (a good survey
on 2 -tests is [67]). That is, Pearson suggested to use the statistic

k
(Oj npj ())2
2
= ,
j=1 npj ()
where pj () denotes the probability, under F (, ), that X1 falls into cell j.
Though, he did not realize the change in the asymptotic distribution of 2 due
to the estimation of parameters. It was Fisher, in the 20s, who pointed out that
the limiting distribution of 2 depends on the method of estimation and showed
that, under regularity conditions, if is the maximum likelihood estimator of
32 E. del Barrio

from the grouped data (O1 , . . . , Ok ), then 2 has asymptotic 2kd1 distribution
(see, e.g., [13] for a detailed review of Pearsons and Fishers contributions).
Fisher also observed that estimating from the grouped data instead of us-
ing the complete sample (e.g., by estimating from the complete likelihood) could
produce a loss of information resulting in lack of power. Further, estimating from
the original data is often computationally simpler. Fisher studied the asymptotic
distribution of 2 when is unidimensional and is its maximum likelihood esti-
mator from the ungrouped data. His result was extended by Cherno and Lehmann
in [11] for a general d-dimensional parameter showing that, under regularity con-
ditions (essentially conditions to ensure the consistency and asymptotic normality
of the maximum likelihood estimator),

w

kd1 
k1
2 Yj2 + j Yj2 , (4.1)
j=1 j=kd

where Yj are i.i.d. standard normal r.v.s and i [0, 1] and may depend on the
parameter . This dependence is a serious drawback for the use of 2 for testing t
to some families of distributions, the normal family being one of them (see [11]).
The practical use of 2 for testing t presented another diculty: the choice
of cells. The asymptotic 2k1 distribution of Pearsons statistic was a consequence
of the asymptotic normality of the cell frequencies. A cell with a very low expected
frequency would cause a very slow convergence to normality and this could result
in a poor approximation of the distribution of 2 . This (somehow oversimplifying)
observation led to the diusion of rules of thumb such as use cells with number of
observations at least 10. Hence, combining neighboring cells with few observations
became a common practice (see, e.g., [13]).
From a more theoretical point of view, in the setup of testing t to a xed
distribution, Mann and Wald [63] and Gumbel [47] suggested to use equally likely
intervals under the null hypothesis as a reasonable way to reduce the arbitrariness
in the choice of cells (this choice oers some good properties, for instance, it makes
the 2 test unbiased, see, e.g., [14]). Trying to adapt this idea to the case of testing
t to parametric families poses the problem that dierent distributions in the null
hypothesis lead to dierent partitions into equiprobable cells. A natural solution
to this problem is choosing for cells equally likely intervals under F (, ), where
is some suitable estimator of . A consequence of this procedure is that, again, the
cells are chosen at random.
Allowing the cells to be chosen at random introduces a deep modication
on the statistical structure of 2 because the distribution of the random vector
(O1 , . . . , Ok ) is no longer multinomial; remarkably, however, it can, in some im-
portant cases, eliminate the dependence on the parameter of the asymptotic dis-
tribution in (4.1). Watson ([104], [105]) noted that if is the maximum likelihood
estimator of (from the ungrouped data) and cell j has boundaries F 1 ( j1 k , )
1 j
and F ( k , ), then the convergence in (4.1) remains true. Further, if F is a lo-
cation scale family, then the i s do not depend on , but only on the family F .
Empirical and Quantile Processes 33

As a consequence, an improved 2 method could be used for testing normality or


exponentiality.
The development of the theory of weak convergence in metric spaces pro-
vided valuable tools for further insights in 2 -testing. Using the weak convergence
of the empirical process in D[0, 1], Moore obtained in [66] a short rigorous proof
of Watsons result which was also valid for multivariate observations and random
rectangular cells. Later Pollard [73], using a general Central Limit Theorem for
empirical measures due to Dudley [38], extended the result to very general ran-
dom cells under the mild assumption that these random cells were chosen from a
Donsker class.
Despite the fact that all these theoretical contributions have widely spread
the applicability and reliability of 2 -tests, the limitations of this procedure, noted
when testing t to a xed distribution, carry over to the case of testing t to a
family (see, e.g., [94] or [96]).
The use of supremum or quadratic statistics based on the empirical d.f. with
parameters estimated from the data could provide more powerful tests, just as in
the xed distribution setup. The adaptation of Wn2 or Kn to this situation can be
easily carried out. Let n be some estimator of . We can dene the statistics

Wn2 () = n (F (x; n ))(Fn (x) F (x; n ))2 dF (x; n )

and
|Fn (x) F (x; n )|
Kn () = n sup ,
<x< (F (x; n ))
and use them as statistical tests, rejecting the null hypothesis when large values of
Wn2 () or Kn () are observed. Though, it took a long time until these statistics
were considered as serious competitors to the 2 -test; little was known about these
versions of Cramer-von Mises or Kolmogorov-Smirnov tests until the 50s (see, e.g.,
[13]).
The property exhibited by Wn2 and Kn of being distribution free does not
carry over to Wn2 () or Kn (). If we set Zi = F (Xi ; n ) and Gn (t) denotes the
empirical d.f. associated to Z1 , . . . , Zn then, obviously,
 1
Wn2 () = n (t)(Gn (t) t)2 dt (4.2)
0
|Gn (t) t|
Kn () = n sup (4.3)
0<t<1 (t)
but, unlike in the xed distribution case, Z1 , . . . , Zn are not i.i.d. uniform r.v.s.
However, in some important cases the distribution of Z1 , . . . , Zn does not
depend on , but only on F . In those cases, the distribution of Wn2 () or Kn () is
parameter free. This happens if F is a location scale family and n is an equivariant
estimator, a fact noted by David and Johnson [26]. Therefore Wn2 () or Kn ()
can be used in a straight forward way as test statistics in this situation. Lilliefors
34 E. del Barrio

[59] took advantage of this property and, from a simulation study, constructed his
popular table for using the Kolmogorov-Smirnov statistic when testing normality.
The rst attempt to derive the asymptotic distribution of any statistic of
Wn2 () or Kn () type was due to Darling [24]. His study concerned the Cramer-
von Mises statistic
  1
Wn2 = n (Fn (x) F (x; n ))2 dF (x; n ) = n (Gn (t) t)2 dt, (4.4)
0

assuming that was one-dimensional. Let us dene


  2

Hn := n Fn (x) F (x; ) (n ) F (x; ) dF (x; )

 1
 2
= n(Gn (t) t) Tn g(t) dt,
0

where Tn = n(n ), and

g(t) = g(t, ) =F (x; )|x=F 1 (t;) . (4.5)

Darlings approach was based on showing that, when the underlying dis-
tribution of the sample is F (, ) and F and satisfy some adequate regularity
conditions, then
Wn2 Hn = oP (1). (4.6)
Thus, the asymptotic distribution of Wn2 can be studied
through that of Hn .
Darling showed that the nite-dimensional distributions of n(Gn (t) t) Tn g(t)
converge weakly to those of a Gaussian process Y (t) with covariance function
K(s, t) = s t st (s)(t), where (t) = g(t) and 2 is the asymptotic
variance of Tn . He showed, further, that, under some additional assumptions on
n , Donskers invariance principle could be applied to conclude that
 1
2 w
Wn (Y (t))2 dt,
0
1
and, as in the xed distribution case, a Karhunen-Loeve expansion for 0 (Y (t))2 dt
can provide a good way to tabulate the limiting distribution of Wn2 . Sukhatme [98]
extended Darlings result to multidimensional parameters and gave very valuable
information for the Karhunen-Loeve expansion of the limiting Gaussian process.

Instead of considering the process { n(Gn (t) t) Tn g(t)}t , a direct study

of the estimated empirical process, { n(Gn (t) t)}t , could yield the asymptotic
distribution of general Wn2 () and Kn () statistics (recall (4.2) and (4.3)) without
having to rely on a dierent asymptotic equivalence as in (4.6) for every dierent
statistic. Kac, Kiefer and Wolfowitz [51] were the rst in studying this estimated
empirical process in a particular case: if we are testing t to the family of normal
distributions N (, 2 ) and we estimate = (, 2 ) by n = (Xn , Sn2 ), then the
Empirical and Quantile Processes 35


nite-dimensional distributions of { n(Gn (t) t)}t converge weakly to those of a
centered Gaussian process Z(t) with covariance function
1 1 1 1 (s) 1 (t)
K(s, t) = s t st , (4.7)
(1 (s)) (1 (t)) 2 (1 (s)) (1 (t))
where denotes the density function of the standard normal distribution, its d.f.
and 1 is the quantile inverse of (notice that the dierence between Darlings
result and (4.7) is the introduction of an extra term corresponding to the second
parameter to be estimated). Although they did not prove weak convergence of the
estimated empirical process itself, they used this result (combined with a particular
w 1
invariance result due to Kac) to conclude that Wn2 0 (Z(t))2 dt, providing thus
the asymptotic distribution of the Cramer-von Mises test of normality.
4.2. The empirical process with estimated parameters
A general study of the weak convergence of the estimated empirical process was
carried out by Durbin [39]. We present here an approach to his main results using
strong approximations. We will assume F is a parametric family,
F = {F (, ), },
where is some open set in R . The empirical process with estimated parameters is
k


nn (x) = n(Fn (x) F (x, n ), x R,
where n is a sequence of estimators.
We will assume this sequence to be ecient in the sense that
1 
n

n(n ) = l(Xi ; ) + oP (1),
n i=1
where l(X1 ; ) is centered and has nite second moments.

Example. Suppose F (x, ) F has density f (x, ) = F


x (x, ). Take n as the
maximum likelihood estimator: the maximizer of

n
v() := log f (Xi , ).
i=1

Under adequate regularity conditions log f (x, )dF (x, ) = 0 and
   T

log f (x, ) log f (x, ) dF (x, )


2
= log f (x, )dF (x, ) =: I().
2
Since
n
n
2
v  () = log f (Xi , ) and v  () = log f (Xi , ),
i=1
i=1
2
36 E. del Barrio

1 
we obtain, from the Law of Large Numbers, that n v () I() a.s.. Now, a
Taylor expansion of v  around gives
1 1
(v  (n ) v  ()) = v  () n( n ) + oP (1) = I() n( n ) + oP (1),
n n

which, taking into account that v  (n ) = 0, gives

1 
n

n( n ) = l(Xi , ) + oP (1),
n i=1

with l(x, ) = I()1
log f (x, ). Clearly l(x, )dF (x, ) = 0 , while

l(x, )l(x, )T dF (x, ) = I()1 I()I()1 = I()1 . 

To obtain the null asymptotic distribution of nn we assume that F = F ()


and write

nn (x) = n(Fn (x) F (x, )) n(F (x, n ) F (x, ))
F
= F
n
(,)
(x) (x, )T n(n ) + oP (1)

F n
T 1
= F
n
(,)
(x) (x, ) l(Xi , ) + oP (1)
n i=1

F
= F
n
(,)
(x) (x, )T
l(x, )dFn
(,)
(x) + oP (1)
R
 1
= n (F (x, ) H(F (x, ), )T L(t, )dn (t) + oP (1)
0
= n (F (x, )) + oP (1),
1
where n is the uniform empirical process, H(t, ) = F (F (t, ), ), L(t, ) =
1
l(F (t, ), ) and
 1
n (t) = n (t) H(t, )T L(s, )dn (s), 0 < t < 1, (4.8)
0

is the uniform estimated empirical process.

4.2.1. Some notes on stochastic integration. Equation 4.8 suggests that n (t) w
1
B(t) H(t, )T 0 L(s, )dB(s), where B is a Brownian bridge. We cannot give
1
0 L(s, )dB(s) the meaning of a Stieltjes integral since the trajectories of B are
not of bounded variation. It is possible, though, to make sense of expressions like
1
0 f (s)dB(s) with f L (0, 1) through the following construction.
2
Empirical and Quantile Processes 37

n
Assume rst that f is simple: f (t) = i=1 ai I(ti1 , ti ] with ai R and
0 = t0 < t1 < < tn = 1. Then
 1 
n n
f (s)dB(s) := ai (B(ti ) B(ti1 )) = ai Bi ,
0 i=1 i=1

where Bi = B(ti ) B(ti1 ).


It can be easily checked that EBi = 0, Var (Bi ) = ti (1 ti ) and
Cov (Bi , Bj ) = ti tj if i = j.
1
The random variable 0 f (s)dB(s) is centered Gaussian with variance

n 
a2i Var (Bi ) + 2 ai aj Cov (Bi , Bj )
i=1 1i<jn

n 
n n
= a2i ti ai aj ti tj
i=1 i=1 j=1
 n 2   2

n  1 1
= a2i ti ai ti = f (t)dt
2
f (t)dt .
i=1 i=1 0 0
1
Thus, f  0 f (s)dB(s) denes an isometry between the subspace of L2 (0, 1)
consisting of centered, simple functions and its range. We can therefore extend the
denition to all centered functions in L2 (0, 1). Finally, for a general f L2 (0, 1),
 1  1
f (s)dB(s) := f  f(s)dB(s),
0 0
1 1
where f(s) = f (s) 0 f (t)dt. The stochastic integral 0 f (s)dB(s) is a centered,
Gaussian r.v. with variance
 1  1 2
f 2 (t)dt f (t)dt .
0 0
 1

1
In fact, if f1 , . . . , fk L2 (0, 1), then f
0 1 (s)dB(s), . . . , f
0 k (s)dB(s) has a
joint centered, Gaussian law and form the isometry dening the integrals we see
that
 1  1   1  1  1
Cov f (s)dB(s), g(s)dB(s) = f (s)g(s)ds f (s)ds g(s)ds.
0 0 0 0 0
(4.9)

1 1
We can similarly check that {B(t)}t[0,1] , 0 f1 (s)dB(s), . . . , 0 fk (s)dB(s) is
Gaussian and
  1   t  1
Cov B(t), f (s)dB(s) = f (s)ds t f (s)ds
0 0 0

(take (g(s) = I(0,t] (s) in (4.9) to check it.


38 E. del Barrio

An integration-by-parts formula. Suppose h is simple. Then


 1 n
h(t)dB(t) = h(ti )(B(ti ) B(ti1 ))
0 i=1
1
  1
= B(ti )(h(ti+1 ) h(ti )) = B(t)dh(t).
i=0 0

This result can be easily extended to any h of bounded variation and continuous
on [0, 1]:
 1  1
h(t)dB(t) = B(t)dh(t).
0 0
This integration-by-parts formula can be used to bound the dierence be-
tween stochastic integrals and the the corresponding integrals with respect to the
empirical process:
 1  1   1
 
 h(t)dn (t) h(t)dBn (t) d|h|(t)
n Bn
.

0 0 0
We can summarize now the above arguments in the following Theorem
Theorem 4.1. Provided H(t, ) is continuous on [0, 1] and L(t, ) is continuous
and of bounded variation on [0, 1] we can dene n and Brownian bridges Bn such
that  
log n

n Bn
= O a.s.,
n
1
where Bn (t) = Bn (t)H(t, )T 0 L(t, )dBn (t). Bn is a centered Gaussian process
with covariance
 s
K(s, t) = s t st H(t, )T L(x, )dx (4.10)
0
 t  1
H(s, )T L(x, )dx + H(s, )T L(x, )L(x, )T dxH(t, ).
0 0

If n is the maximum likelihood estimator, then the covariance function in


(4.10) simplies to
K(s, t) = s t st H(s, )T I()1 H(t, ).
d
Note that this covariance function can be expressed as s t st j=1 j (s)j (t)
for some real functions j . A very complete study of the Karhunen-Loeve expansion
of Gaussian processes with this type of covariance function was carried out in [98].
   
Example. Assume F = G0 : R, > 0 is a location scale family (G0 is
a standardized distribution function with density g0 ). Then
 
1   1
H(t, ) = g0 G1 (t)
0 G1
0 (t)
Empirical and Quantile Processes 39

and   g2 (x)  g2 (x) 


0
1 g0 (x) dx x g00 (x) dx
I() = 2  g2 (x)  2 g02 (x) .
x g00 (x) dx x g0 (x) dx 1
 
12
We can now write I()1 = 2 11 , with ij depending only on G0 , but
12 22
not on or and
K(s, t) = s t st 1 (s)1 (t) 2 (s)2 (t).
Here

2
12
1 (t) = 11 g0 (G1
0 (t)),
22
and
12
2 (t) = g0 (G1
0 (t)) 22 g0 (G1 1
0 (t))G0 (t).
22
If F is the Gaussian family G0 (x) = (x), g0 (x) = (x), g0 (x) = x(x)
and  
1 1 0
I() = 2 .
0 2
1
Hence, 11 = 1, 12 = 0, 22 = 2 and
1
K(s, t) = s t st (1 (s))(1 (t)) (1 (s))1 (s)(1 (t))1 (t).
2
In this Gaussian case L is not of bounded variation on [0, 1], but the above argu-
ment can be modied and still prove that
  1
n (t)}t w {B(t) +(1 (t)) 1 (s)dB(s)
 1  0
1 1 1 1
+ ( (t)) (t) ( (s) 1)dB(s)
2
2 0
t
as random variables in D[0, 1] or L2 (0, 1).
Theorem 4.1 provides, as an easy corollary, the asymptotic distribution of a
variety of Wn2 () and Kn () statistics under the null hypothesis. In fact, Durbins
results also give a valuable tool for studying its asymptotic power because they
include too the asymptotic distribution of the estimated empirical process under
contiguous alternatives. A survey of results connected to Theorem 4.1 as well as a
simple derivation of it based on Skorohod embedding can be found in [86]. Among
the statistics whose asymptotic distribution can be derived from Theorem 4.1 three
representatives have deserved special attention in the literature: the Cramer-Von
Mises statistic, and

Kn = n sup |Fn (x) F (x; n )|
<x<
40 E. del Barrio

and 
(Fn (x) F (x; n ))2
A2n =n dF (x; n ),
F (x; n )(1 F (x; n ))
which are known, as in the xed distribution setup, as Kolmogorov-Smirnov and
Anderson-Darling statistics respectively. Also, as in the xed distribution case,
quadratic statistics oer in general better power properties than Kn , with A2n
outperforming Wn2 . Any of these statistics oers considerable gain in power with
respect to the 2 test (see, e.g., [94] or [96]).
Let us conclude this subsection by commenting, briefly, that the achievements
of subsequent advances in the theory of empirical processes have allowed to develop
other goodness-of-t procedures.
For instance, in [41] the asymptotic distribution of the empirical characteristic
function is obtained. This was applied in [69] and [49] to propose two normality
tests (notice that the modulus of the characteristic function does not depend on
the mean of the parent distribution). Simulations in [49] suggest that those tests
have a quite good behavior against symmetrical alternatives.
A dierent way to adapt the xed-distribution tests is the minimum dis-
tance method. Assume that (F, G) is a distance between d.f.s. Set (Fn , F ) :=
inf (Fn , F (; )). (Fn , F ) is a reasonable measure of the discrepancy between
the sample distribution and the family F that can also be used for testing t to
F . Dudleys theory of weak convergence of empirical processes can be used for
deriving the limiting distribution of (Fn , F ) when (F, G) = ||F G|| with || ||
being some norm on D[0, 1] or D[, ] (see, e.g., [74]). An alternative derivation
can be based on Skorohod embedding (see [86] pp. 254-257).

4.3. Correlation and regression tests


In this subsection we assume that F is a location scale family, i.e., given a proba-
bility measure H0 , we will assume that F is the family of d.f.s obtained from H0
by location or scale changes. We will assume H0 to be standardized.
Goodness-of-t tests in this subsection focus on the analysis of the popular
probability plot. Some reviews on this subject have appeared recently (see, for
instance, [62] or [97]). The idea behind the probability plot is the following.
Let X1 , . . . , Xn be a random sample whose common d.f. belongs to F and has
mean and variance 2 . Let X0 = (X(1) , . . . , X(n) ) be the corresponding ordered
statistic. Let Z0 = (Z(1) , . . . , Z(n) ) be an ordered sample with underlying d.f. H0

and let m = (m1 , . . . , mn ) and V = (vij ) be, respectively, the mean vector and the
covariance matrix of Z0 , that is, mi = EZ(i) and vij = E(Z(i) mi )(Z(j) mj ).
Then,
X(i) = + Z(i) , in distribution, i = 1, . . . , n. (4.11)
Thus, the plot of the ordered values X(1) , . . . , X(n) against the points m1 , . . . , mn
should be approximately linear and lack of linearity in this plot suggests that
the d.f. of X1 does not belong to F . Checking this linearity is often done by
eye. However, some analytical procedures have been devised to test it. They were
Empirical and Quantile Processes 41

proposed according to two dierent criteria, which essentially lead to equivalent


tests, the main dierence being the point of view employed by the proposer to
justify his/her proposal.
The rst criterium relies on the idea of selecting an estimator of assuming
the linear model (4.11) is right and comparing it with Sn2 which, in any case, is
a consistent estimator of . Under the null hypothesis 2 /Sn2 should take values
close to 1. Hence, values of 2 /Sn2 far from 1 would lead to rejection of the null
hypothesis. These procedures are called regression tests.
A second class consists of tests assessing the linearity in (4.11) through the
correlation coecient between vectors X0 and m, (m, X0 ), (notice that here we
have no real correlation coecient because m is not random). When model (4.11)
is true, we expect 2 (m, X0 ) to take values close to 1 and, consequently, small
values of 2 (m, X0 ) would indicate that the null hypothesis is not true. Tests of
this kind are called correlation tests. Vector m can be replaced by other vectors
= (1 , . . . , n ) satisfying, under the null hypothesis, some approximate linear
relation with X0 . Coordinates of vector are usually known as plotting positions.
The rst representative of these tests was the Shapiro-Wilk W -test of nor-
mality, proposed in [82]. There, authors state that they are trying to provide an
analytical procedure to summarize formally indications of probability plots (p.
591). The best linear unbiased estimator, BLUE, of and in model (4.11) is

m V 1 X0
= Xn and = (4.12)
m V 1 m
(this holds for any symmetric H0 ). Hence, under the null hypothesis, 2 /Sn2 should
take values close to 1. The Shapiro-Wilk statistic, W , is a normalized version of
2 /Sn2 , namely,

2
m V 1 X0
W =  1 1  . (4.13)
m V V m i (Xi X)2
The normalization ensures that W always takes values between 0 and 1 (since
W equals 2 (V 1 m, X0 )). Small values of W would lead to rejection of the null
hypothesis. This is a regression test, since it is based on the comparison of and
Sn2 , but, obviously, it can be seen as a correlation test with plotting positions
V 1 m. According to simulations (provided, for instance, in [85]) it seems that
the W -test is one of the most powerful normality tests against a wide range of
alternatives. This fact has made this test very popular and it can be considered as
the gold standard for comparisons. However, employing W for testing normality
presented several diculties of dierent kinds.
A rst problem concerned computational aspects. Computation of W requires
previous computation of m and V 1 . This task is dicult and, in fact, by the time
of W was introduced, elements in V were tabulated only for n 20. For this reason
some numerical approximations that allowed to compute W quite accurately for
sample sizes up to 50 were proposed in [82] .
42 E. del Barrio

An equally important concern about W was the tabulation of its null distri-
bution. Except in case n = 3, when the W -test is equivalent to the u-test [82],
the exact distribution of W is unknown. Percentiles of W were computed by sim-
ulation in [82] for sample sizes up to 50. Though, the asymptotic distribution of
W remained unknown for a long time. In fact, it was not obtained until 20 years
later, in [58] by showing the asymptotic equivalence, under normality, of W and
another correlation test whose distribution was already known at this time (see
the considerations concerning the de Wet-Venter test below).
Some transformations of W that made its distribution approximately Gauss-
ian were proposed (see [83] or [78]). However, these results must be used with some
caution because, as shown in [57], they rely on some approximations which do not
hold with the necessary accuracy.
An additional weakness of the Shapiro-Wilk test is that the procedure may be
not consistent for testing t to non-normal families of distributions. For instance, If
F is the exponential location scale family then the Shapiro-Wilk statistic becomes
(Xn X(1) )2
WE = ,
(n 1)Sn2
which is a function of the coecient of variation. There are some families of dis-
tributions with the same coecient of variation as the exponential family (see
[79, 93]). Thus, the WE -test is not consistent when applied for testing exponen-
tiality. In particular, simulations in [93] suggest that the power of the WE -test
against the beta ( 14 , 12
5
) distribution decrease with the sampling size.
These limitations of the Shapiro-Wilk test led to the introduction of modi-
cations of W aimed to ease them. The rst examples were the dAgostino test [23]
and the Shapiro-Francia test [81]. They were intended to replace the W -test for
sample sizes greater than 50. Both tests are easier to compute than the W -test.
The dAgostino test employs an estimator of proposed in [37] to get the statistic

(i (n + 1)21 )X(i)
D= i .
n2 S
The Shapiro-Francia test is based on an idea suggested (without proof) in [48] (see
also [95]) according to which matrix V 1 in (4.13) can be replaced by the identity
I, obtaining thus the statistic
(m X0 )
2
W =  . (4.14)
m m (Xi X)2
Both tests are correlation tests. The plotting positions are (1, 2, . . . , n) for the
D-test and m for the W  -test. Simulation studies in [23] and [81], respectively,
suggested that the proposed tests are approximately equivalent to the W -test.
The D-test has the advantage of being asymptotically normal and its distribution
can be approximated by Cornish-Fisher expansions for moderate sample sizes.
Apart from the ease of computation, an interesting feature of the W  -Shapiro-
Francia test seemed to be its consistency for testing t to any location scale family
Empirical and Quantile Processes 43

with nite second order moment, a fact shown in [79]. The meaning of this con-
sistency needs some explanation. The fact really shown in [79] is that, if Wn 
denotes the Shapiro-Francia statistic for a sample of size n and we x (0, 1),
then, under any xed alternative,
P (Wn  < ) 1. (4.15)
If we try to choose the critical value so that the test has signicance level , that
critical value, n (), will depend on n. We cannot conclude P (Wn  < n ()) 1
from (4.15) and, in fact, the Shapiro-Francia test fails to detect departures from
some location scale families (an example of this is given by the exponential family,
see the comments about the power of the Shapiro-Francia test below).
A further simplication of the W  -test was proposed by Weisberg and Bing-
ham in [106] by replacing m by vector m = (m1 , . . . , mn ), where
 
i 3/8
mi = 1 , i = 1, . . . , n,
n + 1/4
and denotes the standard Gaussian d.f. This statistic is easier to compute than
W  , while a Monte Carlo study included in [106] suggests that both tests are
equivalent.
Another modication of W was proposed in [32] by de Wet and Venter. It
seems that the concept of correlation test was introduced for the rst time in that
paper. The de Wet and Venter test is the correlation test with plotting positions
    
1 1 1 n
= ,..., ,
n+1 n+1
or, equivalently, the test which rejects normality when large values of
  X(i) Xn 2
1
W = [i/(n + 1)] (4.16)
i
Sn
are observed.
Some other tests continued this line. For instance, in [42], Filliben proposed
a correlation test with the medians of the ordered statistic Z0 as plotting posi-
tions. Some simulations comparing this and the W and W  tests are oered. The
distribution of this statistic was also computed via Monte Carlo method.
An interesting feature of the W -test is that it was the rst correlation nor-
mality test with known asymptotic distribution. To be precise, it was shown in
[32] that, if {Zi } is a sequence of independent standard Gaussian r.v.s., then

 Z2 1
W an i
w
i=3
i
for a certain sequence of constants {an }. The key of the proof relied on showing,
through rather involved calculations, the asymptotic equivalence, under normality,
of W and a certain quadratic form and using then the asymptotic theory for
quadratic forms given in [33].
44 E. del Barrio

Since the publication of [32], the possibility of obtaining the asymptotic dis-
tribution of other correlation tests of normality by showing their asymptotic equiv-
alence with the W -test was considered. An important paper in this program, was
[102], where the asymptotic equivalence of correlation tests under some general
conditions (satised by most of the correlation tests in the literature) is shown.
In particular, it is shown that the Shapiro-Francia, the Weisberg-Bingham and
the Filliben tests are asymptotically equivalent to the de Wet-Venter test, having,
consequently, the same asymptotic distribution.
The asymptotic distribution of the Shapiro-Wilk test could be obtained then
using its asymptotic equivalence with the Shapiro-Francia, shown in [58]. This
solved an important problem which had been open for around twenty years. It
would be unfair not mentioning the paper [57], by Leslie, which proved the validity
of the key step in previous heuristic reasonings based on assuming that vector m is
an asymptotic eigenvector of V 1 . More precisely, the main result in that paper,
is that there exists a constant C which does not depend on n such that


V 1 m 2m
C(log n)1/2 ,

where, given the matrix B = (bij ), then
B
2 = ij b2ij .
The possibility of extending the use of correlation tests to cover goodness-
of-t to other families of distributions has been explored, for instance, in [92] for
the exponential distribution or in [46] for the extreme value distributions. In this
setup correlation tests do not present the same nice properties exhibited when
testing normality. In [60] the asymptotic normality of the Shapiro-Francia test
when applied to the exponential family is obtained. The rate of convergence is
extremely low: (log n)1/2 . This result was generalized in [65] to cover extreme-value
and logistic distributions with the same rate and the same asymptotic distribution
as in the exponential case. However, the asymptotic eciency of the Shapiro-
Francia test in these situations was found to be 0 when compared with tests based
on the empirical distribution function, since it was possible to nd a sequence of
contiguous alternatives such that the asymptotic power coincides with the nominal
level of signicance of the test (see also [61] on this question).

5. Tests based on Wasserstein distance


A dierent approach to correlation tests was suggested in [29] and will be widely
developed in the remainder of this work. The methodology consists of analyzing the
L2 -Wasserstein distance between a xed distribution and a location scale family of
probability distributions in R. Our study will cover dierent kinds of distribution
tails, including as key examples the uniform, normal, exponential and a heavier
tailed law.
Let P2 (R) be the set of probabilities on R with nite second moment. For
probabilities P1 and P2 in P2 (R) the L2 -Wasserstein distance between P1 and P2
Empirical and Quantile Processes 45

is dened as
 1/2 
2
W(P1 , P2 ) := inf E (X1 X2 ) , L(X1 ) = P1 , L(X2 ) = P2 .

For simplication of notation we will identify probability laws with their


d.f.s. In particular, if Fi , i = 1, 2, are the d.f.s associated to the probability laws
Pi P2 (R), we will say that Fi P2 (R), i = 1, 2 and write W(F1 , F2 ) instead of
W(P1 , P2 ).
A main fact which makes W useful for statistics on the line (the multivariate
setting is very dierent) is that it can be explicitly obtained in terms of quantile
functions. If Fi P2 (R), i = 1, 2, then (see, e.g., [101], [5])
 1 1/2
 1 1
2
W(F1 , F2 ) = F1 (t) F2 (t) dt . (5.1)
0
Some relevant well-known properties of the Wasserstein distance are included
in the following proposition for future reference. The reader interested in properties
and uses of the general Lp -Wasserstein distance can resort to [5], [86] or [22].
Proposition 5.1.
(a) Let Fi P2 (R), i = 1, 2. Call mi the mean value of Fi and Fi the d.f. dened
by Fi (x) = Fi (x mi ). Then
W 2 (F1 , F2 ) = W 2 (F1 , F2 ) + (m1 m2 )2 .
(b) Let {Fn }n be a sequence in P2 (R). The following statements are equivalent:
i. Fn F P2 (R)
 in W-distance
 (i.e., W(Fn , F ) 0).
ii. Fn F and |t|2 dFn |t|2 dF < .
w
iii. Fn1 F 1 a.s. and in L2 (0, 1).
As in Subsection 3.2 we assume F to be a location scale family of d.f.s, that is,
F = {H : H(x) = H0 ( x ), R, > 0} for some H0 P2 (R) which we choose,
for simplicity, with zero mean and unit variance (thus, given H(x) = H0 ( x ) in
F , and are its mean and its standard deviation, respectively).
1
Note that the quantile function associated to H(x) = H0 ( x ) veries H (t) =
1
+H0 (t). Therefore, if F is a d.f. in P2 (R) with mean 0 and standard deviation
0 , (5.1) and Proposition 5.1 a) imply that
 1 
 1 2
W 2 (F, F ) = inf{W 2 (F, H), H F } = inf F (t) 0 H01 (t) dt
  1 >0 0 
 1 
= inf 0 + 2
2 2
F (t) 0 H01 (t)dt (5.2)
>0 0
 1 2  1 2
 1 
= 02 F (t) 0 H01 (t)dt = 02 F 1 (t)H01 (t)dt .
0 0
1
Thus, the law in F closest to F is given by = 0 and = 0 F 1 (t)H01 (t)dt,
which is the covariance between F 1 and H01 when seen as r.v.s dened on (0, 1).
The ratio W 2 (F, F )/02 is not aected by location or scale changes on F. Hence,
46 E. del Barrio

it can be considered as a measure of dissimilarity between F and F . For example,


if denotes the standard normal d.f., the best W-approximant to F in the set
F
 1N of normal laws will be the normal law with mean 0 and standard deviation
1
0
F (t)1 (t) dt, and the ratio

2
1 1
W 2 (F, FN ) 0
F (t)1 (t)dt
=1
02 02
measures the nonnormality of F .
The invariance of W 2 (F, F )/02 against location or scale changes on F sug-
gests the convenience of using a sample version of it for testing t to the location
scale family F . More precisely, if X1 , X2 , . . . , Xn is a random sample with under-
lying d.f. F ,
W 2 (Fn , F ) 2
Rn = 2
= 1 n2 ,
Sn Sn
 1 1 1
where n = 0 Fn (t)H0 (t)dt, can be used as a test statistic for the null hypoth-
esis F F. Large values of Rn would lead to rejection of the null hypothesis.
This testing procedure belongs to the class of minimum distance tests de-
scribed in Subsection 3.1. A nice feature of Wasserstein tests is that we have an
explicit expression for the minimum distance estimators and, consequently, for the
minimum distance statistic, opposite to what happens for other metrics (e.g., those
metrics leading to Kolmogorov-Smirnov or Cramer-von Mises statistics).
The connection of Rn to correlation and regression tests can be clearly seen
by noting that large values of Rn correspond to small values of n2 /Sn2 , which, with
the notation employed in Subsection 3.2, can be expressed as
n 2
2 i=1 i X(i)
(, X0 ) = ,
Sn2
 i/n
where = (1 , . . . , n ) and i = (i1)/n H01 (t) dt, i = 1, . . . , n (observe that
is centered and  = 1 since H0 is assumed to be standardized). Hence, the Rn -
test is equivalent to a correlation test with plotting positions = (1 , . . . , n ) .
In fact, the plotting positions in the Shapiro-Wilk, Shapiro-Francia or de Wet-
Venter tests are approximations to the Wasserstein plotting positions. This had
been noticed, in the particular context of normal probability plots, by Brown and
Hettmansperger in [10], which considered the problem of nding the best choice
for the plotting positions. That paper presented a heuristic explanation, based on
an orthogonal expansion of Rn , of the good power properties of the Rn normality
test against general alternatives, observed by Stephens [95]. We will try to justify
those heuristic considerations.
The Wasserstein test of normality turns out to be equivalent to the well-
known Shapiro-Wilk test, sharing its good power properties. However, both are
inecient procedures for testing t to location scale families that, as the expo-
nential family, for instance, have heavier tails (just as with tests of t based on
Empirical and Quantile Processes 47

the correlation coecient: see Lockhart and Stephens (1998), Section 5). The null
asymptotics of Rn provides a good insight into the cause of this ineciency. To
study this asymptotic distribution under H0 we can assume F = G0 (by the
locationscale invariance of Rn ) and denote the empirical quantile process as
vn (t) = n(Fn1 (t) F 1 (t)), to get (see del Barrio et al. (1999)) that
 1  1
2  1
2 
1
nRn = 2 vn2 (t)dt vn (t)dt vn (t)F 1 (t)dt
(Fn ) 0 0 0
 1  1
1 1 1 1
= 2 (vn (t) vn , 1 1 vn , F F (t)) dt = 2
2
v 2 (t)dt,
(Fn ) 0 (Fn ) 0 n
1
where f, g = 0 f g and vn = vn vn , 1 1 vn , F 1 F 1 . It is shown in
del Barrio et al. (1999) that under normality there exist constants an such that
nRn an converges in law to a nondegenerate distribution. More precisely, if
(resp. ) denote the standard normal distribution (resp. density) function and B
is a Brownian bridge, then
 1 2  1 B(t)
2  1 B(t)1 (t)
2
B (t) EB 2 (t)
nRn an dt dt dt
d 0 2 (1 (t)) 0 (
1 (t))
0 (
1 (t))
 1
= B 2 (t) E B 2 (t)dt,
0

where B = (B B, 1 1 B, 1 1 )/(1 ). This last integral admits a


principal component decomposition, namely,
 1

B 2 (t) E B 2 (t)dt = j (Yj2 1),
0 j=3

where j = 1/j, fj (t) = Hj (1 (t)), Yj = B, fj / j and Hj is the Hermite
polynomial of degree j 1. j and fj are, respectively, the eigenvalues and eigen-
vectors of the covariance operator associated to B, hence Yj are i.i.d. N (0, 1). We
can, similarly write
1 2 
nRn = S12 0 (vn ) (t)dt = S12 j=1 Yn,j 2
,
n n
1
where Yn,j = 0 vn (t)fj (t) dt is the principal component in direction fj . Yn,j =
Yn,j /Sn detects deviations in direction fj . For local skewness-kurtosis alternatives:
Xi 2

12 ex /2 1 + 1 /6 n H3 (x) + 2 24
/ n
H4 (x) ,

it can be shown that
Yn,3 N (1 , 3), and Yn,4 N (2 , 4).
d d

Yn,3 and Yn,4 being the components with larger weight in the decomposition of
Rn , this explains the good performance or the Wasserstein (or the Shapiro-Wilk)
test against deviations from normality in skewness or kurtosis.
48 E. del Barrio

In the exponentialcase nRn is not shift tight, but it can be shown that, for
some constants an , (n/ log n)Rn an is asymptotically normal. However, if we
x (0, 1/2) then
 1
1
(vn (t))2 dt 0,
log n Pr

hence the asymptotic distribution of Rn depends only on the tails of F : the Wasser-
stein exponentiality test cannot detect alternatives that have approximately expo-
nential tails.
As a possible remedy to this ineciency, de Wet (2000, 2002) and Csorgo
(2002) proposed to replace the Wasserstein distance W by a weighted version

1/2
1 1 1
Ww (F, G) := 0 (F (t) G (t))2
w(t)dt , for some positive measurable
function w, and the test statistic Rn by
Ww2 (Fn , H)
Rw
n = 2 (F )
,
w n

where, here and in the sequel we set


 1  1
 2
w (F ) = F 1 (t)w(t)dt and w
2
(F ) = (F 1 (t))2 w(t)dt w (F ) .
0 0

Rwn is location scale invariant, hence its null distribution can be studied assuming,
as above, that F = G0 . Under the assumptions
 1
w(t)dt = 1, (5.3)
0
 1
G1
0 (t)w(t)dt = 0 (5.4)
0
and
 1
(G1 2
0 (t)) w(t)dt = 1, (5.5)
0

we can mimic, step by step, the computations leading to (1.6) and obtain that
 1  1
2  1
2 
1 1
w
nRn = 2 v (t)w(t)dt
2
vn (t)w(t)dt vn (t)F (t)w(t)dt
w (Fn ) 0 n 0 0
 1
1
= 2 (vn (t) vn ,1 w 1 vn ,F 1 w F 1 (t))2 w(t)dt
w (Fn ) 0
 1
1
= 2 v 2 (t)w(t)dt, (5.6)
w (Fn ) 0 n
1
where, now, f, g w = 0 (f g)w and vn = vn vn , 1 w 1 vn , F 1 w F 1 . Thus,
the asymptotic distribution of nRw n under the null hypothesis can be obtained
through the analysis of weighted L2 -functionals of the quantile process vn .
Empirical and Quantile Processes 49

6. Asymptotics for L2 -functionals of the quantile process


We devote our last section to the derivation of a complete set of asymptotic re-
sults for weighted L2 -functionals of the quantile process. Our analysis is based on
obtaining CLTs for linear combinations of L2 functions with random exponential
coecients, a simpler problem to which the analysis of the quantile process can
be reduced. This section is largely based on [31].
We will see rst how the above mentioned reduction can be justied. Let
Ui , i N, be i.i.d. uniform (0,1) random variables; for each n, let Gn (t) be the
empirical c.d.f. associated to U1 , . . . , Un , let G1
n (t) be the quantile function and
let un (t) be the associated uniform quantile process, that is,

un (t) = n(t G1 n (t)), 0 < t < 1. (6.1)
If {n }
n=1 are i.i.d. r.v.s with common exponential distribution of mean 1 and
Sn = 1 + + n , then, the well known distributional identity
 
d S1 Sn
(Un:1 , . . . , Un:n )= ,...,
Sn+1 Sn+1
allows us to rewrite G1
n (t) as Sj /Sn+1 if (j 1)/n < t j/n and, consequently,

d n 
n+1
un (t)= Zn,j (t), (6.2)
Sn+1 j=1

where
Zn,j (t) = n1/2 an,j (t)j and an,j (t) = (1 t)I{j1<nt} tI{j1nt} . (6.3)
So, if
 11/n  2
un (t)
Ln := dt (6.4)
1/n g(t)
for some weight function g non-vanishing on (0, 1), then, with

2 denoting the
L2 norm with respect to Lebesgue measure on the unit interval,
 2  n+1 2
n  
d
Ln =  c  (6.5)
Sn+1  n,i i 
i=1 2

for certain functions cn,i (t) which we assume in L2 (0, 1) (in the case of (6.5),
but not always below, cn,i = n1/2 an,i (t)I[1/n,11/n] (t)/g(t)). By the law of large
numbers, weak convergence of the statistic an Ln bn then reduces to weak con-
vergence of
 n+1 2  2
  Sn+1
an  c 
n,i i  b n ,
i=1 2 n
and the second variable is almost a constant if bn does not grow too fast.
50 E. del Barrio

Since the an,j (t) have a relatively complicated expression, it is convenient to


isolate here as a lemma some estimates on an,j (t) to be used below. It is also con-
venient to introduce the notation f g for the function of two variables f (x)g(y).
We recall that the map(f, g)  f  g is a continuous bilinear map between
L2 (0, 1)L2(0, 1) and L2 (0, 1)(0, 1) and that f1 g1 , f2 g2 = f1 , g1 f2 , g2 .
n+1 n+1
Lemma 6.1. Set mn = j=1 an,j and Kn = j=1 an,j an,j . Then for every
0 < s, t < 1,
i) mn (t) = frac(n(1 t)) t; also, if 1/n t 1 1/n then
n
|mn (t)| 1 nt(1 t);
n1
ii) if s t then Kn (s, t) = [n(1 t)]s + frac(n(1 s))(1 t) + st;
iii) n1 Kn (s, t) s t st; further, if 1/n s, t 1 1/n, then

1
n+1
1 1
(s t st) Kn (s, t) |an,j (s)an,j (t)| 3(s t st).
2 n n j=1
n+1
Proof. To prove i) x t and observe that each term in j=1 an,j (t) equals either
1 t (the rst n [n(1 t)] terms) or t (the remaining [n(1 t)] + 1 terms).
Hence
mn (t) = (n [n(1 t)])(1 t) ([n(1 t)] + 1)t = frac(n(1 t)) t.
The identity in ii) can be proved in a similar way. Fix s t. In the corresponding
sum for Kn (s, t) there are three types of summands: the rst n [n(1 s)], each
equal to (1 s)(1 t); the next [n(1 s)] [n(1 t)], each equal to s(1 t);
and the remaining [n(1 t)] + 1, each equal to st. This gives ii) and the right-hand
side inequality in iii). The left-hand side inequality in iii) is a trivial consequence
of ii). 
We will also make use of the following simple but useful observation about
hypercontractivity of linear combinations of exponential variables with coecients
in L2 . Its proof, based on standard symmetrization/randomization techniques, can
be found in [31].
n
Lemma 6.2. Let Y (t) = k=1 ck (t)k for some n N and ck L2 (0, 1), and
where the variables k are independent exponential with parameter 1. Then, there
exists an absolute constant C < such that
 2
E
Y
42 C E
Y
22 . (6.6)
Next we consider the general quantile process. Let F be a twice dierentiable
distribution function such that f := F  is non-vanishing on supp F := {F =
0, 1} := (aF , bF ) and
  
t(1 t)f  F 1 (t) 
r := sup   < , (6.7)
0<t<1 f 2 F 1 (t)
Empirical and Quantile Processes 51

where F 1 (t) is the corresponding quantile function. Condition (6.7), that comes
from Csorgo and Revesz (1978), is a natural condition to have if we wish to relate
general and uniform quantile processes: see Lemma 1.1, Ch. 6 and comments after
its proof in this reference. Let Xi be i.i.d. with common distribution F , let Fn be
the empirical distribution of X1 , . . . , Xn , n N, and let Fn1 denote the empirical
quantile function. Since we are considering only distributional results, there is no
loss of generality in taking Xi = F 1 (Ui ), where Ui are i.i.d. uniform on [0, 1]. In
this case,
 
Fn1 (t) = F 1 G1 n (t) , (6.8)
where G1
n is the quantile function corresponding to the uniform variables U1 , . . . ,
Un . We let vn to be the quantile processes associated to the sequence Xi ,
 
vn (t) := n Fn1 (t) F 1 (t) , n N. (6.9)
vn and un are related by the limited Taylor expansion
 1  1  
vn (t) = n F Gn (t) F 1 (t)
 1   
n Gn (t) t 1  1 2 f  F 1 ()
=   + n G n (t) t  
f F 1 (t) 2 n f 3 F 1 ()
 
un (t) 1 f  F 1 () 2
=  +   u (t) (6.10)
f F 1 (t) 2 n f 3 F 1 () n
for some between t and G1 n (t). This Taylor expansion can be used to relate the
(weighted) L2 norms of vn and un /f (F 1 ). This is the content of our next result,
see [31] for the proof. Here w is a non-negative measurable function on (0, 1) and


2,w,n and , w,n denote, respectively, the norm and the inner product in the
space L2 ((1/n, 1 1/n), w(t)dt). Then we have:
Lemma 6.3. Let F be a distribution function which is twice dierentiable on its
open support (aF , bF ), with f (x) := F  (x) > 0 for all aF < x < bF , and which sat-
ises condition (6.7). Assume further that w is a non-negative measurable function
such that
 11/n 1/2
1 t (1 t)1/2
lim   w(t)dt = 0. (6.11)
n n 1/n f 2 F 1 (t)
Then, if un is the uniform quantile process and vn is the quantile process dened
by (6.8) and (6.9),
 u 2  un 
 n   

vn
22,w,n   1   0 and vn  1   0 (6.12)
f F 2,w,n f F 2,w,n

in probability.
In fact, as mentioned in the introduction, we are interested not in
vn
2,w,n
but rather in
vn
2,w , where

2,w denotes the L2 -norm with respect to the
measure w(t)dt over the whole interval (0, 1). So, we must deal next with the
integrals at extremes. The problem can be solved imposing conditions which are
52 E. del Barrio

related to, but weaker than the usual von Mises conditions on domains of attraction
(e.g., Parzen (1979) and Schuster (1984)). Again, we refer to [31] for a proof.
Lemma 6.4. Let F be a distribution function which is twice dierentiable on its
open support (aF , bF ), with f (x) := F  (x) > 0 for all aF < x < bF . Assume that
F satises condition (6.7), that
|f  (F 1 (x))|x
either aF > or lim inf > 0, (6.13)
x0+ f 2 (F 1 (x))
and
|f  (F 1 (1 x))|x
either bF < or lim inf > 0. (6.14)
x0+ f 2 (F 1 (1 x))
Assume further that w is a bounded non-negative measurable function such that
x 1
x 0 w(t)dt x 1x w(t)dt
lim   = 0 and lim   = 0. (6.15)
x0+ f 2 F 1 (x) x0+ f 2 F 1 (1 x)

Then,

vn
22,w
vn
22,w,n 0 (6.16)
in probability.
As a consequence of these two lemmas we have:
Proposition 6.5. Let F be a distribution function which is twice dierentiable on its
open support (aF , bF ), with f (x) := F  (x) > 0 for all aF < x < bF . Assume that
F satises conditions (6.7), (6.13) and (6.14). Let w be a non-negative measurable
function for which the limits (6.15) hold. Assume further that
 11/n 1/2
1 t (1 t)1/2
lim   w(t)dt = 0.
n n 1/n f 2 F 1 (t)
Then,  u 2
 

vn
22,w   1  
n
0 (6.17)
f F 2,w,n
   
in probability. If, moreover, h L2 (w(t)dt) and the sequence un /f F 1 , h w,n
is stochastically bounded, then
un
vn , h 2w   1  , h 2w,n 0 (6.18)
f F
in probability.
Proof. The conclusion (6.17) is a direct consequence of the last two Lemmas.
The
 limit
 (6.18)
 follows
 from the same two lemmas, stochastic boundedness of
un /f F 1 , h w,n , Holders inequality and the identities
un
vn , h 2w,n   1  , h 2w,n
f F
 
un un un
= vn  1  , h w,n vn  1  , h w,n + 2  1  , h w,n
f F f F f F
Empirical and Quantile Processes 53

and
vn , h 2w vn , h 2w,n
  
= vn hw vn hw + 2vn , h w,n .
(0,1/n][11/n,1) (0,1/n][11/n,1)

(Note that, by the rst identity, {vn , h w,n } is stochastically bounded.) 

Obviously, combining this proposition with (6.3) and (6.4) for g = f (F 1 ),


reduces convergence in distribution of
vn
22,w and of
vn
22,w to convergence in
distribution of Ln , which is a function of exponential random variables that will
be relatively easy to handle. The conditions under which this has been established
here are weaker than those usually found in the literature.

6.1. Weak convergence of L2 linear combinations of exponential r.v.s.


We should point out that the results that follow do not require the variables
i to be exponential, but only to be integrable enough, however, we stay with
exponential variables, which is what we need. In this section, the functions cn,i in
the expression (2.4 ) of Ln are allowed to be arbitrary functions in L2 (0, 1). Given
a triangular array cn,i , i n, n N, of functions in L2 (0, 1), we set
n  2
n1
Yn (t) := cn,i i , Zn (t) = Yn (t), t [0, 1]. (6.19)
i=1
Sn
Dene cn,i,j = ci,j as
 1
cn,i,j = ci,j = cn,i (t)cn,j (t)dt := cn,i , cn,j , i, j = 1, . . . , n, n N. (6.20)
0
It will also be convenient to introduce the functions
n 
n
Kn (s, t) = cn,i (s)cn,i (t) and mn (t) = cn,i (t), t [0, 1], n N, (6.21)
i=1 i=1

which are the covariance and the mean functions,


 respectively, of the random
processes Yn (t). With tensor notation Kn = i cn,i cn,i . Obviously
   

Kn
22 = cn,i , cn,j 2 = c2n,i,j and
mn
22 = cn,i , cn,j = cn,i,j .
i,j i,j i,j i,j
(6.22)
Before getting into convergence we examine some interesting integrability
issues. The rst result of this subsection is based on the Paley-Zygmund argument,
e.g., de la Pena and Gine (1999), pp. 119124, in particular Corollary 3.3.4 there,
that we restate here for ease of reference:
Lemma 6.6. Let V be a random variable such that EV 4 C(EV 2 )2 . Then, for
all t > 0,
IEV 2 2t2 4C Pr{|V | > t}
54 E. del Barrio

that is,
1
EV 2 < 2a2 whenever Pr{|V | > a} < .
4C
Proof. It follows immediately upon observing that
   1/2
EV 2 t2 + E V 2 I|V |>t t2 + (EV 4 )1/2 Pr{|V | > t} . 

Proposition 6.7. The sequence {


Yn
2 }, with Yn as dened in (6.19), is stochasti-
cally bounded if and only if both conditions

n
sup
cn,k
22 < (6.23)
n
k=1

and 
sup
mn
22 = sup cn,i , cn,j < (6.24)
n n
1i,jn

are satised, and the same is true for the sequence {


Zn
2 }.
Moreover, limn
Yn
2 = 0 in probability if and only if

n 
lim
cn,k
22 = lim cn,i , cn,j = 0, (6.25)
n n
k=1 1i,jn

and the same is true for the sequence {


Zn
2 }.
Proof. We will use the abbreviated notation ci,j for cn,i,j . Since
 1 
2  1  

E
Yn
22 = E cn,k (t)k dt = c2n,k (t) + cn,i (t)cn,j (t) dt
0 k 0 k i,j
 
= ck,k + ci,j , (6.26)
k i,j

it follows that the conditions (6.23) and (6.24) are sucient for tightness of the
sequence {
Yn
2 } and that (6.25) is sucient for its convergence to zero in prob-
ability. Suciency for tightness and convergence of {
Zn
2 } follows from this and
the law of large numbers. Necessity in both cases follows immediately from Lem-
mas 6.6 and 6.2. 

One can say more about the way Yn converges. In fact, stochastic bounded-
ness of
Yn
m
2 implies uniform boundedness of moments of any order. This fact is
proved in [31]. With these preliminaries on integrability out of the way, we con-
sider now convergence in law of the sequence {
Yn
2 }. We consider several cases,
corresponding to the dierent cases for convergence of the square integral of the
quantile process described
a) Convergence of the processes Yn . Here we obtain necessary and sucient condi-
tions for weak convergence of Yn as L2 -valued random vectors; then convergence
Empirical and Quantile Processes 55

of
Yn
22 will be an immediate consequence of the continuous mapping theorem
for weak convergence. Note that
 
P (
cn,i i
2 >
) = exp
/
cn,i
2
and therefore, the triangular array {cn,i : i = 1, . . . , n; n N} is innitesimal if
and only if
max
cn,i
2 0 (6.27)
i
as n . The next theorem gives necessary and sucient conditions for the
convergence in law in L2 (0, 1) of {Yn } under (6.27). Under innitesimality, the
only possible limits of {Yn } are Gaussian, with a trace-class covariance operator.
Kn and mn are dened as in (6.21).
Theorem 6.8. Assuming condition (6.27) holds, the sequence {Yn } converges in
law in L2 (0, 1) if and only if the following conditions hold:
i) There
 exists a symmetric,
 positive semi-denite, trace-class kernel K(s, t)
L2 (0, 1) (0, 1) such that
Kn K. (6.28)
L2
ii) If i 0 are the eigenvalues of K then
n


cn,i
2
2
i . (6.29)
i=1 i=1
iii) There exists m L2 (0, 1) such that
mn m (6.30)
L2
If i), ii) and iii) hold, then Yn converges in law in L2 (0, 1) to an L2 (0, 1)-valued
Gaussian random variable Y with mean function m and covariance operator K
given by   1 1
K (f, g) = K(s, t)f (s)g(t)dsdt
0 0
for f, g L2 (0, 1).
Proof. Necessity. Let us assume rst that the L2 -valued random vectors Yn con-
verge in law. Then {
Yn
2 } also converges in law and, moreover, by Proposition
6.3, its moments converge as well (to the moments of the limit). This implies, in
particular, that the sequence
n n 2
 
E
Yn
22 =
cn,i
22 +  cn,i  , n N, (6.31)
2
i=1 i=1
converges. Note also that convergence in law of Yn to Y plus uniform integrability
of {
Yn
2 }, which is a consequence of moment convergence, ensure that EYn L2
EY and, therefore, that
n
cn,i m := EY. (6.32)
L2
i=1
56 E. del Barrio

Now, (6.31) and (6.32) imply (6.30) and also that the left-hand side in (6.29)
converges to a nite limit. We have also proved that {Yn EYn } converges in law,
namely,
n
Yn EYn = cn,i (i 1) Y EY. (6.33)
d
i=1
Also, (6.33) and uniform integrability imply

n

cn,i
22 = E
Yn EYn
22 E
Y EY
22 . (6.34)
i=1

Another consequence of (6.33) is that


(Yn EYn ) (Yn EYn ) (Y EY ) (Y EY ) (6.35)
d
 
in L2 (0, 1) (0, 1) ((6.35) follows from (6.33) and continuity of the map (f, g) 
f g). Now, since
f g
2 =
f
2
g
2 , convergence of moments of
Yn
2 ensures
also uniform integrability of (Yn EYn ) (Yn EYn ) and, just as above, we obtain
that
   
Kn = E (Yn EYn ) (Yn EYn ) K := E (Y EY ) (Y EY ) .
L2

Let now i , i be, respectively, the eigenvalues and the corresponding eigenfunc-
tions associated to the kernel K. Then
 1 1
i = K(s, t)i (s)i (t)dsdt = EY EY, i 2 .
0 0
Therefore,



E
Y EY
22 = E Y EY, i 2 = i ,
i=1 i=1
which combined with (6.34) yields (6.28) and (6.29).
Suciency. Assume now that (6.27), (6.28), (6.29) and (6.30) hold. Let us de-
note i, = i I{ cn,i i 2 } and i = i I{ cn,i i 2 >} for
> 0 and i N. Since
Ei I{i >t} = (t + 1)et 1/t2 for large enough t, condition (6.29) implies that,
for large enough n,
  n  n  n
   
E cn,i i  =  cn,i Ei 
cn,i
2 Ei
2 2
i=1 i=1 i=1

1 

n

cn,i
22 max
cn,i
2 0.

2 i=1
i

Hence, by the Central Limit Theorem in Hilbert spaces (see, e.g., Araujo and Gine
(1980), Corollary 3.7.8), if {Yn } is shift convergent in law, we can take the shifts
to be the expected values EYn and therefore, by the same central limit theorem,
the proof reduces to showing that
Empirical and Quantile Processes 57

n
i) j=1 P (
cn,j j
2 >
) 0 as n for every
> 0,
ii) for every
> 0 and every f L2 (0, 1)

n
Var(cn,j j, , f ) K (f, f ) and
j=1

iii) there exists a c.o.n.s {i }i1 in L2 (0, 1) such that


n
 
k

lim lim sup E


cn,j j, Ecn,j j,
2 Ecn,j j, Ecn,j j, , i 2 = 0.
k n
j=1 i=1

To check i), we see that, as a consequence of (6.27) and (6.28),



n
1 
n
1 
n
P (
cn,j j
>
) E
c n,j

 2
j 2 =
cn,j
22 E(j )2
j=1

2 j=1
2 j=1

1 

n

c n,j
2
2 E12 I{1 >/ maxi cn,i 2 } 0

2 j=1

as n . This calculation shows also that for every f L2 (0, 1)



n 
n 
n
Var(cn,j j , f ) Ecn,j j , f 2
f
22 E
cn,j j
22 0,
j=1 j=1 j=1

which implies that ii) is equivalent to



n
Var(cn,j j , f ) K (f, f ). (6.36)
j=1

In order to prove (6.36), we recall that


n 
n
Cov( cn,j (s)j , cn,j (t)j ) = Kn (s, t),
j=1 j=1

which, combined with (6.28), implies that


n  1
n


Var(cn,j j , f ) = Var cn,j (t)j f (t)dt


j=1 0 j=1
 1 1 
n 
n

= Cov cn,j (s)j , cn,j (t)j f (s)f (t)dsdt


0 0 j=1 j=1
 1  1  1  1
= Kn (s, t)f (s)f (t)dsdt K(s, t)f (s)f (t)dsdt
0 0 0 0
= K (f, f ),
proving (6.36). Finally, to show that iii) holds, let {i } be a c.o.n.s. of eigenfunc-
tions of the covariance operator K and let {i } be the associated eigenvalues.
58 E. del Barrio

n
Then, using (6.29) and the fact that, as shown above, j=1 E
cn,j j
22 0, we
have
n 
n
lim E
cn,j j, Ecn,j j,
2 = lim E
cn,j j Ecn,j j
2
n n
j=1 j=1

n

= lim
cn,j
2 = i
n
j=1 i=1

and, by (6.36),

n 
k 
k 
k
lim Ecn,j j, Ecn,j j, , i 2 = K (i , i ) = i ,
n
j=1 i=1 i=1 i=1

which completes the proof. 


The above proof shows that the suciency part of Theorem 6.8 holds as well
if we only assume that the random variables i are i.i.d. and square integrable (with
trivial adjustments to account for mean and variance possibly dierent from 1).
As an immediate consequence of Theorem 6.8 we obtain sucient conditions
for convergence in law of the L2 norms of linear combinations of independent
exponential random variables:
Corollary 6.9. Suppose (6.27), (6.28), (6.29) and (6.30) hold. Let S be a metric
space and let H : L2 (0, 1)  S be a continuous function. Then
H(Yn ) H(Y ),
d
where Y is an L2 (0, 1)-valued Gaussian random variable with mean function m
and covariance operator K given by
 1 1
K (f, g) = K(s, t)f (s)g(t)dsdt
0 0
for f, g L2 (0, 1).
2
Below, we will apply this corollary for H(f ) =
f
22 k=1 f, hk
2
with
hk L 2 .
Remark 6.9.1. The limiting random  process Y in Theorem 6.8 and Corollary 6.9
is centered if and only if
mn
22 = i,j cn,i,j 0. The type of argument employed
in the proof of Theorem 6.8 shows that, under the innitesimality condition (6.27),
conditions (6.28) and (6.29) are necessary and sucient for convergence in law of
the processes Yn EYn and that the limiting random process has then a centered
Gaussian distribution with covariance operator K .

b) Shift convergence of
Yn
22 , I. It can be proved that shift tightness of {
Yn
22 }
implies tightness of the sequence centered at expectations, and even tightness of
the sequence {
Zn
22 E
Yn
22 }. We comment this to help to appreciate sharpness
of the results that follow, but omit any proof of it. In the previous subsection, we
Empirical and Quantile Processes 59

examined the case when the kernels Kn associated to Y n converge


in L2 to a
trace class kernel. In that case, Yn EYn Y m = i i i Zi and
Y
 d
m
22 = i i Zi2 , where {Zi } is an ortho-Gaussian sequence (a sequence of i.i.d.
standard
 normal random variables). Of course, convergence of this series requires
i i < . However, if we allow centering, then


Yn EYn
22 E
Yn EYn
22 i (Zi2 1)
d
i

and, clearly,in order to make sense of this limit it suces (and is necessary as
i i < , a weaker condition. We deal here with this situation,
2
well) that
that is, we relax the assumptions on K in Theorem 6.8 by only assuming that
K L2 ((0, 1) (0, 1)). In this case, the operator induced
 by K on L2 (0, 1) is
Hilbert-Schmidt, that is, its eigenvalues {k } satisfy k 2k < (e.g., Dunford
and Schwartz (1963), XI.6 and XI.8.44). Then, with considerable abuse of notation,
we dene


Y EY
22 E
Y EY
22 := k (Zk2 1), (6.37)
k

where the variables Zk are independent standard normal (Y may not exist but the
series does converge a.s.). In fact,
Y EY
22 E
Y EY
22 makes sense as a
multiple Wiener integral, but this is of marginal interest for the sequel.
We state now a useful lemma on the asymptotic normality of sums of inde-
pendent exponential random variables. The proof is omitted, but it is just an easy
exercise on the CLT on the line if one uses the fact the Gaussian family is factor
closed.

Lemma 6.10. If {an,i : i = 1, . . . , n; n N} is a triangular array of real numbers


then ni=1 an,i (i 1) Z, where Z is a standard normal random variable, if
d
and only if
n
max |an,i | 0 and a2n,i 2 . (6.38)
i
i=1
n
Moreover, if maxi |an,i | 0 then the only possible limit laws of 
i=1 an,i (i 1)
are normal and convergence in law is equivalent to convergence of ni=1 a2n,i .

The main argument in this subsection is contained in the proof of the follow-
ing proposition.

Proposition 6.11. If maxi


cn,i
2 0 and Kn L2 K (so that K is necessarily
in L2 ((0, 1) (0, 1))),then


Yn EYn
22 E
Yn EYn
22
Y EY
22 E
Y EY
22 ,
d

where
Y EY
22 E
Y EY
22 is as dened in (6.37).
60 E. del Barrio

Proof. Let {k } be a c.o.n.s. of eigenfunctions of K, k of eigenvalue k for each


k. Then

  

Yn EYn
22 E
Yn EYn
22 = Yn EYn , k 2 EYn EYn , k 2 ,
k=1
n
where Yn EYn , k = i=1 cn,i , k (i 1). By Lemma 6.10 under (6.38) the
limit laws of Yn EYn , k are normal, and there is convergence if
only possible
and only if { ni=1 cn,i , k 2 } converges. Since Kn L2 K, we have in particular,
 1 1
 
E Yn EYn , k Yn EYn , l = Kn (s, t)k (s)l (t)dsdt
0 0
 1  1 
k if k=l
K(s, t)k (s)l (t)dsdt = ,
0 0 0 if k = l

and therefore (Yn EYn , k )M M


k=1 d (k Zk )k=1 . Then, by the continuous mapping
theorem for weak convergence,

M
  
M
Yn EYn , k 2 EYn EYn , k 2 M := k (Zj2 1). (6.39)
d
k=1 k=1

Since k=1 2k =
K
22 < , M converges a.s. and in L2 and, with some abuse of
notation, as explained above, we denote this limit as
Y EY
22 E
Y EY
22 ,
that is, we have,
M
Y EY
22 E
Y EY
22 . (6.40)
L2
M n n
Set cM= cn,i
n,i k=1 cn,i , k k , YnM = M
i=1 cn,i i and KnM = M
i=1 cn,i cM
n,i .
Observe that

  
Yn EYn , k 2 EYn EYn , k 2 =
YnM EYnM
22 E
YnM EYnM
22 .
k=M+1
(6.41)
We claim that, under (6.28),
lim lim sup Var(
YnM EYnM
22 ) = 0. (6.42)
M n
 
We recall that K = k=1 k k k and dene K
M
= k=M+1 k k k . We

can easily see that
K M
22 = k=M+1 2k . Now, since {k k }k { 12 (k l +
k l )}k =l is an orthonormal basis for L2 ((0, 1) (0, 1)) we have that
 

KnM K M
22 = KnM K M , k k 2 + 2 KnM K M , k l 2 (6.43)
k k =l

(here we have used the fact that f f, g h = f f, h g ). Observe that


K M , k l = K, k k if k = l > M and K M , k l = 0, otherwise
Empirical and Quantile Processes 61

and, also, that KnM , k l = Kn , k l if k, l > M and KnM , k l = 0


otherwise. Combining this with (6.43) we obtain that
 

KnM K M
22 = Kn K, k k 2 + 2 Kn K, k l 2
k>M k =l;k,l>M
 
Kn K, k k + 2
2
Kn K, k l 2 =
Kn K
22 .
k k =l

The last inequality implies that L2 K M and also that


KnM
2
K M
2 .
KnM
Now this convergence combined with the fact that Var(
YnM EYnM
22 ) 8
KnM
22
prove claim (6.42). Now, the proposition follows from (6.39), (6.40), (6.41) and
(6.42) through a standard 3
argument. 
We should remark that this proposition also holds true if we replace the
sequence of exponential random variables by an i.i.d. sequence of square integrable
random variables, with only formal changes in the proof.
Both, Theorem 6.8 and Proposition 6.11 are practically exercises on the cen-
tral limit theorem in Hilbert space, however, Proposition 6.11 can be seen as a
limit theorem for quadratic forms, and this subject has a long history, reviewed
e.g. in Guttorp and Lockhart (1988). Theorem 1 in de Wet and Venter (1973) and
Theorem 5 in Guttorp and Lockhart (1988) could seemingly apply to give Propo-
sition 6.11, however the conditions in either theorem are quite dicult to verify
and we could not check them in our case, whereas the conditions in Proposition
3.5 are very easy to decide in general.
We conclude this subsection giving sucient conditions for convergence in
law of {
Yn
22 E
Yn
22 }. The result is not directly applicable to Wasserstein
distances, but the changes needed for that arestraightforward
and omitted here. A
warning on notation:
 2 we write Y EY, h = 
k k , h Z k for k orthonormal,
h L2 (0, 1) and < although Y EY may not make sense.
Theorem 6.12. If maxi
cn,i
2 0, Kn L2 K and mn L2 m then

Yn
22 E
Yn
22
Y EY
22 E
Y EY
22 + 2Y EY, m
d



:= k (Zk2 1) + 2 k m, k Zk
k=1 k=1

where
Y E
22 E
Y EY
22 is dened as in (6.37) and {Zk } is an ortho-
Gaussian sequence.
Proof. Formally we require the proof of Proposition 6.11 rather than its statement.
First we note

Yn
22 E
Yn
22 = (
Yn EYn
22 E
Yn EYn
22 ) + 2Yn EYn , EYn . (6.44)
As in the previous proof,
EYn EYn , EYn Yn EYn , k = Kn , mn k
K, m k = k m, k .
62 E. del Barrio

This implies that for each M we have convergence in law of the vector
 
Yn EYn , 1 , . . . , Yn EYn , M , Yn EYn , EYn
to the Gaussian vector


 
1 Z1 , . . . , M ZM , k m, k Zk , .
k=1
This gives weak convergence, for every M < , of the random variables

M
 
Yn EYn , k 2 EYn EYn , k 2 + 2Yn EYn , EYn ,
k=1

in analogy with (6.39). By (6.44) these random variables are nite-dimensional


approximations of the sequence of interest, and the result now follows by the
approximation argument in the previous proof, as a consequence of the limit (6.42).
Now, this and (6.42) gives the result. 
The hypotheses in this theorem are very natural. We will not deal with the
question of whether they are necessary (given innitesimality) however, note that
the existence of K and m are necessary in order to dene the limit.
c) Shift convergence of
Yn
22 , II. There are some situations in which Kn is not
convergent in L2 but, nevertheless, {
Yn EYn
22 E
Yn EYn
22 } is weakly
convergent. From the denitions we see that

Yn EYn
22 E
Yn EYn
22
 
n
!
= cn,i,j (i 1)(j 1) + cn,i,i (i 1)2 1
1i =jn i=1
n  

 i1
  
n
= 2 cn,i,j (j 1) (i 1) + cn,i,i (i 1)2 1 = xn,i ,
i=1 j=1 i=1

where

i1
 
xn,i = 2 cn,i,j (j 1) (i 1) + cn,i,i (i 1)2 1
j=1

(and we use the convention that 0j=1 aj = 0). If {i } denotes an independent
copy of the sequence {i } and we set

i1
 
xn,i = 2 cn,i,j (j 1) (i 1) + cn,i,i (i 1)2 1
j=1

for i = 1, . . . , n and Fn,i = (1 , . . . , i ), then, for each n N, {xn,i } and {xn,i } are
tangent sequences with respect to {Fi }, that is, L(xn,i |Fn,i1 ) = L(xn,i |Fn,i1 )
and the random variables xn,i are conditionally independent given the sequence
{i }. Hence, {xn,i } is a decoupled tangent sequence to {xn,i } (see, e.g., de la Pena
and Gine (1999), Chapter 6). Decoupling introduces enough independence among
Empirical and Quantile Processes 63

n
the summands in i=1 xn,i to enable us to use the CLT in order to obtain their
asymptotic distribution. The principle of conditioning (Theorem 1.1 in Jakubowski
(1986), reproduced in de la Pena and Ginen(1999), Theorem 7.1.4) can then be
used to conclude convergence in law of i=1 xn,i itself. The proof of our next
result follows this approach.
Theorem 6.13. Let Z be a standard normal random variable. If
max
cn,i
0, (6.45)
i


n
2
Kn
22 + 6
cn,i
42 2 , (6.46)
i=1
and
 
2  
2
cn,i , cn,j cn,i cn,k + cn,i , cn,j cn,i , cn,i + cn,j 0,
j =k i:i>jk j i:i>j
(6.47)
then

Yn EYn
22 E
Yn EYn
22 Z. (6.48)
d
If, instead of conditions (6.45), (6.46) and (6.47), we have

max
cn,i
22 + |cn,i , mn | 0, (6.49)
i


n 
2
Kn
22 + 2
cn,i
42 + 4 cn,i , cn,i + mn 2 2 , (6.50)
i=1 i
and
 
2
cn,i , cn,j cn,i cn,k
j,k i:i>jk
 
2
+ cn,i , cn,j cn,i , cn,i + mn + cn,j 0, (6.51)
j i:i>j

then

Yn
22 E
Yn
22 Z. (6.52)
d
n
Proof. We rst prove the limit in (6.48). If we set Un = i=1 xn,i , with xn,i
dened as above, the principle of conditioning (Jakubowski (1986)) reduces the
proof to showing that
L(Un |{i }) N (0, 2 )
w
in probability. Arguing as in the proof of Lemma 6.10, we can see that this is
equivalent to proving that
 
i1
2

An := max E(x2n,i |{j }) = 4 max c2i,i + ci,i + ci,j (j 1) 0 (6.53)


i i Pr
j=1
64 E. del Barrio

and
   
i1
2
Bn := E(x2n,i |{j }) = 4 c2i,i + 4 ci,i + ci,j (j 1) 2 . (6.54)
Pr
i i i j=1

After a straightforward but cumbersome computation that we omit, we can see


that

n
EBn = 2
Kn
22 + 6
cn,i
42
i=1
and
 
2  
2 
Var(Bn ) = 16 ci,j ci,k +4 ci,j (ci,i + ci,j ) ,
j =k i:i>jk j i:i>j

which, by (6.46) and (6.47), immediately give (6.54). We


 i1  check now (6.53), which is
equivalent to maxi ci,i 0 and maxi  j=1 ci,j (j 1) 0. This last convergence
Pr
follows from (6.47) and the use of a rened Octavianis maximal inequality (e.g.,
Proposition 1.1.2 in de la Pena and Gine (1999)):
 i1    i1  
   
P max  ci,j (j 1) > t 3 max P  ci,j (j 1) > t/3
i i
j=1 j=1

27 
i1
27 27 
n
max c2 2 max c2i,j = 2 maxcn,i cn,i , Kn
t2 i j=1 i,j t i
j=1
t i

27


Kn
2 max
cn,i
22 0.
t2 i

This concludes the proof of the limit (6.48). We pay attention now to the limit
(6.52). The fact that


Yn
22 E
Yn
22 = ci,j (i 1)(j 1)
1i =jn
 n
 ! 
+ ci,i (i 1)2 1 + 2cn,i , mn (i 1)
i=1
where

i1
 
yn,i = 2 cn,i,j (j 1) (i 1) + cn,i,i (i 1)2 1 + 2cn,i , mn (i 1),
j=1

can be used to conclude (6.52) by reproducing the proof of (6.48) almost verbatim.

The tool for Theorem 6.13, namely the principle of conditioning, which could
be easily replaced by the Brown-Eagleson central limit theorem for martingales,
has been used before in analogous situations. We will just mention P. Hall (1984),
who uses it in density estimation, in order to prove a limit theorem for degenerate
Empirical and Quantile Processes 65

U -statistics with varying kernels. His result is dierent from ours and does not
apply here, but there are similarities in the proofs.
The assumptions in the above theorem are quite tight (for instance, it can
be shown that they are necessary for the limits (6.53) and (6.54)). An easier to
check set of (stronger) sucient conditions, more adapted to the quantile process
case can be stated with the following notation. We dene
 1 
(Kn Kn )(s, t) := Kn (s, u)Kn (t, u)du = cn,i , cn,j cn,i cn,j .
0 i,j

 
2
It can be easily checked that
Kn Kn
22 = j,k i cn,i,j cn,i,k and also that
 1  1  1  1

Kn Kn
22 = Kn (s, t)Kn (u, v)Kn (s, u)Kn (t, v)dsdtdudv.
0 0 0 0

The assumptions in the following result are often easier to deal with than (6.51).
The proof can be found in [31].

Corollary 6.14. If

n 
n
cn,i , mn 2 0,
cn,i
42 0 and
Kn Kn
2 0, (6.55)
i=1 i=1


Kn (s, t) := |cn,i (s)cn,i (t)| CKn (s, t) (6.56)
i

for some absolute constant C < , and


Kn
22 2 /2, (6.57)

then

Yn
22 E
Yn
22 Z,
d

where Z is standard normal.

Note that if
Kn Kn
2 0 we cannot have Kn K in L2 unless K = 0.

Remark 6.14.1. The results on convergence or shift convergence in law of


Yn
22
derived so far in this article assume innitesimality on the coecients cn,i (maxi

cn,i
0). Of course, if conditions of this type are removed, other asymptotic
distributions can be obtained. It is straightforward to see, for instance, that if

(cn,i,j i,j )2 0, (6.58)
1i,jn
66 E. del Barrio


for some real numbers {i,j } satisfying i,j
2
i,j < then

n
!

Yn EYn
22 E
Yn EYn
22 = cn,i,j (i 1)(j 1) i,j
i,j=1

 !
i,j (i 1)(j 1) i,j .
L2
i,j=1

Note
 that the limiting random variable is well dened because the condition
i,j i,j < implies that the associated partial sums are L2 convergent. If,
2

further,
n n
2
cn,i,j i 0 (6.59)
i=1 j=1

for some real numbers i such that i=1 i2 < , then we also have that



!

Yn
22 E
Yn
22 i,j (i 1)(j 1) i,j + 2 i (i 1). (6.60)
L2
i,j=1 i=1

d) Shift convergence of
Zn
22 . Yn can be replaced by Zn in Theorem 6.8 as an
immediate consequence of the law of large numbers, whereas it can be replaced in
Theorem 6.12 and Corollary 6.14 because of the following proposition.
Proposition 6.15. Suppose
Yn
22 E
Yn
22 converges in law. Then,
Zn
22 E
Yn
22
converges in law to the same limit if and only if
E
Yn
22
0. (6.61)
n
In particular, this condition is satised if both conditions,
 
i cn,i,i
0 and cn,i , mn 2 0, (6.62)
n i

hold. If (6.61) holds, we also have Zn EYn , h Yn EYn , h 0 for any
h L2 (0, 1).
Proof. Since
 2   
n1 Sn Sn  

Zn
22
Yn
22 = 1+ 1
Yn
22 = OP n1/2
Yn
22 ,
Sn n1 n1
by the central limit theorem and the law of large numbers, the necessity and
suciency of condition (6.61) follows from Lemmas 2.2 and 3.1. Now, by (6.26)
and Cauchy-Schwartz,
  
1 1    1   
E
Yn
22 =  ci,i + ci,j  ci,i + cn,i , cn,j 2 ,
n n i j
n i i j

which gives the suciency of (6.62). 


Empirical and Quantile Processes 67

6.2. Weak convergence of weighted L2 functionals of the quantile process


We distinguish between the uniform and the general quantile processes.

The uniform quantile process. Recall from the beginning of this section that if un
is the uniform quantile process, then
 11/n  2  2  n+1 2
un (t) n  
Ln :=
d
dt=  cn,i i 
g(t) S n+1
 
1/n i=1 2

where an,i are as dened in (6.3) and cn,i = n1/2 an,i (t)I[1/n,11/n] (t)/g(t). With
the help of Lemma 6.1, the results of Section 6.1 can be easily specialized to this
situation.
a) The innitesimality condition (6.27): It follows from the denitions that
 11/n 2 
1 t + (1 t)2 1 11/n 1
dt max
c

n,i 2
2
dt,
2n 1/n g 2 (t) i n 1/n g 2 (t)
and from this we conclude that condition (6.27) is equivalent to

1 11/n 1
dt 0. (6.63)
n 1/n g 2 (t)
b) Convergence of Kn and denition of K. Also from the denitions (in (6.21) and
Lemma 6.1), we have
1 Kn (s, t)
Kn (s, t) = I{1/n s, t 1 1/n},
n g(s)g(t)
so that by Lemma 6.1 iii),
(s t st)
Kn (s, t) K(s, t) := (6.64)
g(s)g(t)
pointwise, hence, by Lemma 6.1 iii) and dominated convergence, Kn L2 K if
and only if K L2 ((0, 1) (0, 1)), if and only if
 1 1
(s t st)2
ds dt < . (6.65)
0 0 g 2 (s)g 2 (t)
Next we see that the limiting kernel K is trace-class and the limit (6.29) holds if
and only if
 1
t(1 t)
dt < . (6.66)
0 g 2 (t)
In fact, by Lemma 6.12 iii),

n+1   1
1 11/n Kn (t, t) t(1 t)

cn,i
22 = 2 (t)
dt dt
i=1
n 1/n g 0 g 2 (t)
regardless of whether the limiting integral is nite or not. Thus (6.66) is necessary
in order to get a nite limit in (6.29). On the other hand, if (6.66) holds and
68 E. del Barrio

Bg (t) = B(t)/g(t), where B(t), 0 < t < 1, is a Brownian bridge, then Bg is a


centered, L2 (0, 1)-valued Gaussian process with covariance function K(s, t). Thus,
if i and i denote the eigenvalues and eigenfunctions, respectively, of the kernel
K, then
 1  1  1 2   1
2
t(1 t) 2 B (t) B(t)
dt = EB g (t) dt = E dt = E i (t)dt
0 g 2 (t) 0
2
0 g (t) i=1 0 g(t)

  1 B(t)
2   1 1
s t st 
= E i (t)dt = i (s)i (t)dsdt = i < .
i=1 0 g(t) i=1 0 0 g(s)g(t) i=1
Hence, if (6.66) holds then K is trace-class and (6.29) holds.
c) Convergence of mn to m = 0 assuming innitesimality. If the innitesimality
condition (6.63) holds, then
 n+1 2  
  1 1 mn (t)2 1 1 1
 cn,i  = dt dt 0,
i=1
2 n 0 g 2 (t) n 0 g 2 (t)
showing that mn 0 in L2 .
Finally, note that condition (6.66) implies conditions (6.63) and (6.65): the
rst, by dominated convergence, and condition (6.65) because, since (s t st)2
s(1 s)t(1 t), we have
 1 1  1 1  1 t(1 t)
2
(s t st)2 s(1 s)t(1 t)
ds dt ds dt = dt .
0 0 g 2 (s)g 2 (t) 0 0 g 2 (s)g 2 (t) 0 g 2 (t)
Summarizing, and the law of large numbers for Sn /n give the following:
Theorem 6.16. Let un (g) denote the weighted uniform quantile process, that is,
un (g)(t) = (un (t)/g(t))I{1/n t 1 1/n}}, 0 < t < 1, where g is a non-zero
measurable function. Assume

1 11/n 1
dt 0.
n 1/n g 2 (t)
Then the sequence of processes {un (g)} is weakly convergent in L2 (0, 1) to a non-
degenerate limit if and only if
 1
t(1 t)
dt < .
0 g 2 (t)
In this case,
un (g) Bg
d
in L2 (0, 1), where Bg (t) = B(t)/g(t) and B is a Brownian bridge. In particular
 11/n 2  1 2
un (t) B (t)
2 (t)
dt 2
dt.
1/n g d 0 g (t)

Only the necessity part of this theorem may be considered new; the suciency is
well known (see e.g., Mason (1984), Csorgo and Horvath (1988) and (1993) p. 354).
Empirical and Quantile Processes 69

Since, under innitesimality, mn 0 in L2 , we have that the analogue of


Theorem 6.12 for Yn = (Sn+1 /n)un holds under conditions (6.63) and (6.65). In
order to get
rid of the factor Sn+1 /n, according to Proposition 6.15 we must have
E
Yn
22 / n 0, which follows from conditions (6.62). Now, if conditions (6.63)
and (6.65) hold, then so do conditions (6.62): the second condition in (6.62) is
obvious because mn 0 in L2 and supn
Kn
2 < (as Kn converges in L2 ), so

cn,i , mn 2 = Kn , mn mn
Kn
2
mn
22 0,
i

and the rst follows because, by Lemma 6.1 iii),


 11/n  11/n
1  1 Kn (t, t) 3 t(1 t)
cn,k , cn,k = dt dt,
n i n 1/n ng 2 (t) n 1/n g 2 (t)
and it is easy to see that this last expression tends to zero
as a consequence
of
(6.63) (divide the domain of integration at the points 1/ n, 1/2 and 1 1/ n).
Hence, Theorem 6.12 together with Proposition 6.15 give:
Theorem 6.17. If conditions
  
1 11/n 1 1 1
(s t st)2
dt 0. and ds dt < .
n 1/n g 2 (t) 0 0 g 2 (s)g 2 (t)
hold then
 11/n  
u2n (t) 11/n
t(1 t) 1
B 2 (t) EB 2 (t)
dt dt dt, (6.67)
1/n g(t)2 1/n g 2 (t) d 0 g 2 (t)
where the integral of (B 2 EB 2 )/g 2 is dened in a limiting L2 sense.
Proof. By Theorem 6.12 and the above observations, it only remains to show that
we can actually replace E
Yn
22 by the centering constants in (6.67). By (6.63)
and Lemma 6.1, we have
  11/n 
 t(1 t)  1 11/n |nt(1 t) Kn (t, t) m2n (t, t)|
E
Yn
22 dt  dt
1/n g 2 (t) n 1/n g 2 (t)

4 11/n 1
dt 0. 
n 1/n g 2 (t)

For g(t) = (1 (t)), where and denote the standard normal density and
distribution function respectively, this result goes back, in one form or other, to de
Wet and Venter (1972), but it seems to be new in the generality it is given here.
See also Gregory (1977) and del Barrio, Cuesta-Albertos, Matran and Rodrguez-
Rodrguez (1999).
Next we examine the normal convergence case (as a consequence of Corollary
6.14). We will further relax integrability (so, condition (6.65) will not hold), but,
for convenience, will impose regular variation of g at at least one of the end points
0 and at 1. Standard use of the basic properties of regular variation shows that,
70 E. del Barrio

if g is regularly varying at 0 and at 1 with exponent , then the hypotheses of


Theorem 6.17, (6.63) and (6.65), both hold for < 1 and fail if > 1. We study
now the borderline case in which = 1 and (6.65) fails, that is,
 1x  1x
(s t st)2
L(x) := 2 dsdt as x 0. (6.68)
x x g 2 (s)g 2 (t)
This case will fall within the scope of Corollary 6.14 and we will obtain normal
convergence. Besides the function L(x) just dened, it is convenient to introduce
two more functions,
 1x  1x
(s t st)
M (x) = 2 dsdt and
x x g 2 (s)g 2 (t)

(s t st)(s u su)(t v tv)(u v uv)
R(x) = dsdtdudv,
x g 2 (s)g 2 (t)g 2 (u)g 2 (v)
where x = (x, 1 x)4 , and establish their relationship with L. The next lemma
will be helpful in this goal. The proof (cumbersome, but routine calculations based
on regular variation) is omitted. It can be found in [31].
Lemma 6.18. Assume g > 0 is regularly varying at 0 and at 1 with exponent 1,
or that g is regularly varying with exponent one at one of these points and with
smaller exponent in the other. Assume also that L(x) as x 0. Then
xM (x)
lim =0 (6.69)
x0 L(x)
and
R(x)
lim = 0. (6.70)
x0 L2 (x)
Theorem 6.19. Assume g > 0 is regularly varying at 0 and at 1 with exponent 1,
or that g is regularly varying with exponent one at one of these points and with
smaller exponent at the other. Assume also that L(x) as x 0. Let Z denote
a standard normal random variable. Then
1  11/n u2 (t)  11/n
t(1 t)

n
dt dt Z.
L(1/n) 1/n g 2 (t) 1/n g 2 (t) d

Proof. We only consider the case when g is symmetric about 1/2. We can apply
Corollary 6.14 with cn,i (t) = 1/2
an,i (t)
g(t) I{1/n t 1 1/n}. Now we
1
nL (1/n)
have that  
11/n 11/n
2 Kn2 (s, t)
2
Kn
22 = 2
ds dt
n L(1/n) 1/n 1/n g 2 (s)g 2 (t)
and
  n+1

n+1
2 11/n 11/n
i=1 a2n,i (s)a2n,i (t)

cn,i
42 = 2 ds dt.
i=1
n L(1/n) 1/n 1/n g 2 (s)g 2 (t)
Empirical and Quantile Processes 71

We claim that

n+1 
n+1
2
Kn
22 1,
cn,i
42 0 and cn,i , cn,i + mn 2 0. (6.71)
i=1 i=1
n+1
In fact, from Lemma 6.1 we obtain that i=1 a2n,i (s)a2n,i (t) 3n(stst), which,
by Lemma 6.18, implies that

n+1 1
M (1/n)

cn,i
42 6 n 0
i=1
L(1/n)

and proves the second part of (6.71). The rst part can be obtained using Lemma
6.1 ii) and iii) to see that |Kn (s, t)2 n2 (s t st)2 | = |Kn (s, t) + n(s t
st)||Kn (s, t) n(s t st)| 8n(s t st) and, consequently, that
    11/n  11/n K 2 (s, t) n2 (s t st)2 
2
Kn
2 1 = 2  n 
2  ds dt 
n2 L(1/n) 1/n 1/n g 2 (s)g 2 (t)

1
M (1/n)
16 n 0.
L(1/n)
Finally, the third part of claim (6.71) is a consequence of

n+1
cn,i , mn 2 = Kn , mn mn
i=1
 11/n  11/n 1
1 Kn (s, t)mn (s)mn (t) M (1/n)
= ds dt n 0.
n2 L(1/n) 1/n 1/n
2 2
g (s)g (t) L(1/n)
since (a + b) 2a + 2b . The limits (6.71) prove the rst two limits in (6.55) and
2 2 2

the limit in (6.57) (with 2 = 1). Lemma 6.1 iii) gives that (6.56) is also satised
(with C = 6). Finally, the third limit in (6.55) follows from Lemma 6.18 since
81R((1/n)

Kn Kn
22 0.
n4 L2 (1/n)
Corollary 6.14 implies now that
Yn
22 E
Yn
22 w N (0, 1). The conditions
(6.62) from Proposition 6.15 are also satised because of the last two limits in
(6.71) (see the argument immediately before Theorem 6.17) and therefore we also
have
Zn
22 E
Yn
22 w N (0, 1). Now we are only left with showing that we
 11/n
can replace E
Yn
22 by L1/2 (1/n) 1/n t(1 t)g 2 (t)dt as centering constants.
Arguing as in the proof of Theorem 6.17 we see that
  11/n  11/n
 1 t(1 t)  4 1
E
Yn
2 1/2
2
dt dt 0,
L (1/n) 1/n g 2 (t) nL1/2 (1/n) 1/n g 2 (t)
where the last limit is a consequence of (6.59). 
72 E. del Barrio

Finally, we consider smaller functions g, corresponding to Remark 6.16.1.


The following result is only given for completeness and only symmetric weights
are considered in the proof. The one sided analogue is already contained in Csorgo
and Horvath (1988).
Proposition 6.20. Let g be a positive function on (0, 1), regularly varying at 0 and
at 1 with exponent > 1 and with equal or smaller exponent at the other extreme,
and such that limx0 g(x)/g(1 x) := c [0, ]. Set
 1x
t(1 t)
E(x) := dt.
x g 2 (t)
Then,
 11/n 2
1 un (t)
dt
E(1/n) 1/n g 2 (t)
 (S (1) y)2  (S (2) y)2

1 c2 [y] 1 [y]
dy + dy ,
d 1 1 + c2 1 y 2 1 + c2 1 y 2
(1)
where {S[y] : y 1} is the partial sum process associated to the sequence {j } of
(1) [y] (2)
independent exponential random variables, that is, S[y] = j=1 j , and {S[y] } is
(1)
an independent copy of {S[y] }.
Proof. As above, we only consider the case when g is symmetric. Symmetry of g
and the fact that an,j (1 t) = an,n+2j (t) show that
 11/n 2 n
2  11/n  n+1 2
1 un (t) 1 j=1 an,j (t)j
dt = dt
E(1/n) 1/n g 2 (t) Sn+1 nE(1/n) 1/n g 2 (t)
n
2
= (Vn(1) + Vn(2) ),
Sn+1
where

n+1 2
 1/2 a n,j (t)j
1 j=1
Vn(1) = 1 dt
nE( n ) 1/n g 2 (t)
and

n+1 2
 an,j (t)n+2j
1/2
1 j=1
Vn(2) = dt.
nE( n1 ) 1/n g 2 (t)
We set bn,j (t) = I{j 1 < nt} and dene
 1/2  n+1 2
(1) 1 j=1 bn,j (t)j nt
Wn = dt
nE( n1 ) 1/n g 2 (t)
Empirical and Quantile Processes 73

(2)
and, similarly, Wn , replacing j with n+2j . Now, since

2 1  1/2 t2
|(Vn(1) )1/2 (Wn(1) )1/2 |2 n(1 Sn+1 ) dt = OP (1) 0
n
E( n1 ) 1/n g 2 (t) Pr

(1) (1) (1) (2) (2)


and Vn = OP (1), we see that Vn Wn = oP (1). Analogously, Vn Wn =
(1) (2)
oP (1), showing that (6.72) is equivalent to convergence in law of Wn + Wn
(1) d (2) (1)
to the right-hand side there. Obviously, Wn = Wn and, moreover, Wn and
(2)
Wn are asymptotically independent: they are indeed independent if n is odd
(1)
since bn,j (t) = 0 if t 1/2 and j > (n + 1)/2 and therefore Wn depends only
(2)
on 1 , . . . , (n+1)/2 while Wn depends only on (n+3)/2 , . . . , n+1 ; the overlapping
that arises if n is even is negligible. Hence, in order to prove (6.72) if suces to
show that
 (S (1) y)2
1 [y]
Wn(1) dy. (6.72)
d 1 1 y 2
To see this, we note that
 
Wn(1) = cn,i,j (i 1)(j 1) + 2 dn,i (i 1) + en
i,j i

with
 n/2
1 1
cn,i,j = 2 1 2 (y/n)
dy if i j > 1,
n E( n ) ij1 g
 n/2
1 ([y] y)
cn,1,1 = cn,2,2 , dn,i = 2 1 dy if i > 1,
n E( n ) i1 g 2 (y/n)
 n/2
1 ([y] y)2
dn,1 = dn,2 and en = 2 1 dy.
n E( n ) 1 g 2 (y/n)
Similarly,
 (1)
(S[y] y)2  
1
dy = i,j (i 1)(j 1) + 2 i (i 1) +
,
1 1 y 2 i,j i
  ([y]y)
where i,j = 1 1
ij1 y 2
1 dy if i j > 1, 1,1 = 2,2 , i = 1
1
i1 y 2
dy if
1
 ([y]y)2
i > 1, 1 = 2 and
= 1 1 2 dy. Standard regular variation techniques
 y
show that i,j (cn,i,j i,j ) 0, i (dn,i i )2 0 and en
, yielding as in
2

Remark 3.10 that


 (1)
(S[y] y)2
1
Wn(1) dy
L2 1 1 y 2
and proving (6.72). 
74 E. del Barrio

The general quantile process. By Proposition 6.5, we can transfer the results in
the previous subsection to the general quantile process just by taking g(t) =
f (F 1 (t))/ w(t). Let us denote as General Hypotheses or (GH) the following
conditions on the cdf F and the weight w:
GH F is twice dierentiable on its open support (aF , bF ) with f (x) = F  (x) >
0 there, and satises conditions (6.7), 6.13 and 6.14. w is a non-negative
measurable function on (0, 1) and satises conditions 6.15.
These, together with (6.11), are the conditions under which we can transfer results
on un to vn by Proposition 2.5. (6.11) is not included because it will be subsumed
by other conditions, in fact, because, by dominated convergence,
 1 
t(1 t) 1 11/n w(t)dt
1 (t))
w(t)dt < (6.11) 0.
2
0 f (F n 1/n f 2 (F 1 (t))
We then have:
Theorem 6.21. Let B be a Brownian bridge on (0, 1) and let Z be a standard
normal random variable.
a) If F and w satisfy (GH) and
 1
t(1 t)
2 (F 1 (t))
w(t)dt < (6.73)
0 f
then
B(t) w(t)
vn (t) in law in L2 (0, 1),
f 2 (F 1 (t))
in particular,
 1  1
B 2 (t)
vn (t)w(t)dt
2
w(t)dt in distribution.
0 0 f 2 (F 1 (t))
b) If F and w satisfy (GH) and
 11/n 1/2
1 t (1 t)1/2
w(t)dt 0
n 1/n f 2 (F 1 (t))
and  
1 1
(s t st)2
w(s)w(t)dsdt < (6.74)
0 0 f 2 (F 1 (s))f 2 (F 1 (t))
then
  
1 11/n
t(1 t) 1
B 2 (t) EB 2 (t)
vn2 (t)w(t)dt w(t)dt w(t)dt.
0 1/n f 2 (F 1 (t)) w 0 f 2 (F 1 (t))
c) Assume F is twice dierentiable on its open support (aF , bF ) with f (x) =
F  (x) > 0 there, that F satises condition (6.7) and that the function g :=
f (F 1 ) is regularly varying with exponent one at at least one of the two points
Empirical and Quantile Processes 75

0 and at 1, and with exponent not larger than one in the other. Assume also
that
 1x  1x
(s t st)2
L(x) := 2 dsdt (6.75)
x x f 2 (F 1 (s)f 2 (F 1 (t))
as x 0. Then,
 1  1 
1 t(1 t)
vn2 (t)dt 2 1 (t))
dt Z in distribution. (6.76)
L(1/n) 0 0 f (F

Proof. By Proposition ?? and the remark above on condition (6.11), the statements
a) and b) of the theorem do not require proof. But part c) does (Proposition ??)
does not apply in this case). As usual, we assume f (F 1 symmetric about 1/2. If
 11/n
we can replace
vn
22 in (6.76) by
un /f (F 1 )
22,n = 1/n u2n (t)/f (F 1 (t))dt,
the result will follow from Theorem 6.19. By the proof of Lemma 6.4, we can
replace
vn
22 by
vn
22,n if we show that
 x
x2 1
lim   = 0, and lim (F 1 (x) F 1 (t))2 dt = 0,
x0 f 2 F 1 (x) L(x) x0 x L(x) 0

and, by the proof of Lemma 6.3 (see (2.20) and (2.21)), we can replace
vn
22,n by

un /f (F 1 )
22,n if
 11/n 1/2
1 t (1 t)1/2
lim   dt = 0.
n nL(1/n) 1/n f 2 F 1 (t)
The rst and third of these limits follows just as the limit (??) in the proof of
Lemma 6.18, using LHopital andthe equivalence (??). To show that the second
x
limit also holds, let us set h(x) = 0 (F 1 (x) F 1 (t))2 dt and observe that
 x  x x
2 2 1
h (x) = (F 1
(x) F 1
(t))dt = dudt
f (F 1 (x)) 0 f (F 1 (x)) 0 t f (F 1 (u))
 x
2 u 2x2
= du 
f (F 1 (x)) 0 f (F 1 (u)) f 2 (F 1 (x))
the last equivalence being a consequence of regular variation.
x This, (??) and
x
regular variation imply, in turn, h(x) = 0 h (y)dy  2 0 y 2 /f 2 (F 1 (y))dy 

2x3 /f 2 (F 1 (x))  2x2 x 1/f 2(F 1 (y))dy and, consequently, that
 x  1
1 x 0 f 2 (F 1 (y)) dy
1 1
lim (F (x) F (t)) dt = 2 lim
2
=0
x0 x L(x) 0 x0 L(x)
by (??). 
Examples a) Consider the distribution functions
 1 |x|
2e if x 0
F (x) = for > 0,
1 12 ex

if x 0
76 E. del Barrio

and take w 1. Let f be the corresponding densities, which are symmetric about
zero. Then, it is easy but somewhat cumbersome to see that
  1
f F1 (x) x(1 x) log(1)/ ,
x(1 x)
  1
f F1 (x) x(1 x) log(2)/ ,
x(1 x)

where a(x) b(x) means that 0 < limx0 a(x)/b(x) < and likewise for x 1,
whereas 0 < inf tI a(t)/b(t) sup
 xI a(t)/b(t)
 < for any closed interval I con-
tained in (0, 1) (for instance f F1 (x) = x| log 2x|(1)/ + (1 x)| log 2(1
 
x)|(1)/ ). So, f F1 is symmetric about 1/2 and of regular variation with
exponent 1 at 0 (and at 1). It then follows easily that F is in case a) of the above
theorem i > 2, in case b) i 4/3 < 2, hence for the normal distribution,
and in case c) i 0 < 4/3, in particular for the symmetric exponential dis-
tribution. As mentioned above, if the tail probabilities are of dierent order, the
largest dominates and these theorems still hold, so that the same conclusions apply
to the one sided families.
b) Likewise, if

f (x) = x1 ex , x > 0, c > 0,


is the Weibull family of densities, then f (F 1 (u)) = (1 u) log(1)/ 1u


1
, and,
as in the above example, f is in cases a), b) or c) according as to whether > 3,
4/3 < 2 or 0 < 4/3.
c) (McLaren and Lockhart (1987)). For the logistic distribution F (x) = (1 +
ex )1 , x R, and the exponential with parameter one, both falling in case c),
computation of L and the centering gives
 1 
1
3/2
vn2 (t)dt 2 log n Z in distribution,
2 log n 0

and for the extreme value distribution H(x) = exp(ex ), also in case c),
 1 
1
vn2 (t)dt log n Z in distribution.
2 log n 0

We cannot apply Proposition ?? when f (F 1 (t) is too large at 0 and 1 (reg-


ularly varying of exponent 1 or lower), as we saw in Case c) of the above theorem
(exponent 1). Here is a situation where the exponent is larger than one.

Theorem 6.22. Assume F satises conditions (6.7), f (F 1 ) varies regularly at 0


or at 1 with exponent (1, 3/2) and with equal or smaller exponent at the other
Empirical and Quantile Processes 77

extreme, and limx0 |F 1 (x)|/F 1 (1 x) = c [0, ]. Then, denoting = 1 ,


 1  
1 2 c2
 (1) 2
v 2
(t)dt w 2 (S[y]+1 ) y dy
1
E( n ) 0 n
|| 1+c
0
 
 (2) 
2
1
+ 1+c2 (S[y]+1 ) y dy .
0
1
If F satises conditions (6.7), f (F ) varies regularly at 0 or at 1 with exponent
= 3/2 (i.e., = 1/2) and with equal or smaller exponent at the other extreme,
1
limx0 |F 1 (x)|/ F 1 (1 x) = c [0, ] and 0 (F 1 (t))2 dt < , then
 1 
1
v 2
(t)dt c n
E( n1 ) 0 n
   
c2 1 4  (1) 1/2 
1/2 2
w 4 1+c 2
(1)
" + (S [y]+1 ) y dy
S1 S1
(1) 1
  
1 4  (2) 1/2 
1/2 2
+ 1+c 1
2
(2)
" + (S [y]+1 ) y dy ,
S1 S1
(2) 1

 n1 1  1
where cn = 0 (F (t))2 dt+ 1 1 (F 1 (t))2 dt. In both cases {S[y]+1 : y 0} is the
(1)
n
partial sum process associated to the sequence {j } of independent exponential ran-
(1) [y]+1 (2)
dom variables, that is, S[y]+1 = j=1 j , and {S[y]+1 : y 0} is an independent
(1)
copy of {S[y]+1 }.

Remark 6.22.1. Regular variation of f (F 1 ) with exponent is, essentially, equiv-


alent to regular variation of F 1 with exponent (1/2, 0). In fact, if f (F 1 )
RV (0), then F 1 RV (0) and, provided f is monotone in a neighborhood of
, if F 1 RV (0) then f (F 1 ) RV (0) (see, e.g., Resnick (1987), Proposi-
1
tions 0.6 and 0.7). With the assumption of regular variation, niteness of 0 vn2 (t)dt
requires 1/2. Thus, Theorem 6.22 completes the picture of all the possible
1
limiting distributions of 0 vn2 (t)dt for distributions with regularly varying tails.
Remark 6.22.2. It follows easily from the law of the iterated logarithm that
 
 (1) 
(S[y]+1 ) y  1
lim sup 1/2 = a.s.
y y 2 log log y ||
for all < 0, see, e.g., Samorodnitsky and Taqqu (1994), p. 31. This implies that
 (1)
the limiting integrals in Theorem 6.22 are a.s. nite. Integrability of (S[y]+1 )
2
y at 0 needs > 1/2. When = 1/2 the eect of the centering constants, cn ,
is to remove this lack of integrability, still leading to a limiting distribution.
Next we collect some elementary properties of regularly varying functions
that will be useful in our proof of Theorem 6.22.
78 E. del Barrio
 log n 
 l n =
Lemma 6.23. a) If l RV0 (0) is positive and
> 0 then limn (log n) 1
l
 n n

l log
0. b) If l RV (0) and < 0 then limn  n1  = 0.
l n

c) If l RV (0) and > then x l(x) 0 as x 0.


Proof. b) is a trivial consequence of a) and the proof c) can be found, e.g., in
Resnick (1987), so we only prove a). By Karamatas  theorem
(see, e.g., Resnick
1
(1987), p. 17) l can be written as l(x) = c(x) exp x (
(t)/t)dt with c(x) c
(0, ) and
(x) 0 as x 0. Therefore, taking n0 large enough to ensure that
|
(t)| <
/2 for t log n0 /n0 and n n0 we have that
 log n 
 logn n 1

 l  n  
(log n) 2(log n) exp dt = 2(log n)/2 0. 
l n1 2 n1 t
Proof of Theorem 6.22. We will assume in this proof that 0 > > 1/2. The case
= 1/2 can be handled with straightforward changes. We set, as in the proof of
Proposition 6.20, bn,j = I{j 1 < nt} and H 1 (x) = F 1 (1 x) and observe,
using the fact that bn,j (1 t) = 1 bn,n+2j (t) except in a null set, that
 1  logn n 
n+1

2
1 n 1
1 vn
2
(t)dt = 1 F 1
Sn+1 bn,j (t)j F 1 (t) dt
E( n ) 0 E( n ) 0 j=1
 1 
n+1

2  1 log n
n 1 n
+ F 1 Sn+1
1
bn,j (t)j F 1 (t) dt + vn2 (t)dt
E( n1 ) 1 log
n
n
j=1
E( n1 ) log n
n

 log n

n+1

2
n n
= F 1 Sn+1
1
bn,j (t)j F 1 (t) dt
E( n1 ) 0 j=1
 log n

n+1

2
n n
+ H 1 1
bn,j (t)n+2j H 1 (t) dt
E( n1 ) 0
Sn+1
j=1
 1 log n
1 n
+ vn2 (t)dt =: Vn(1) + Vn(2) + Vn(3) .
E( n1 ) log n
n

We also set
 log n
1 n+1


2
n n
Wn(1) := F 1 bn,j (t)j F 1 (t) dt and
E( n1 ) 0 n j=1
 log n
1 n+1


2
n n
Wn(2) := H 1 bn,j (t)n+2j H 1 (t) dt.
E( n1 ) 0 n j=1
(1) (2)
Observe that Wn and Wn are independent since they are functions of disjoint
sets of independent exponential r.v.s j . We will proceed now by showing that
Empirical and Quantile Processes 79

(3)
the central part, Vn , is negligible and that the upper and lower integrals are
asymptotically independent and weakly convergent to the above stated limits.
This will be achieved by proving the following three claims:
Claim 1. Vn(3) Pr 0.
Claim 2. (Vn(i) )1/2 (Wn(i) )1/2 Pr 0, i = 1, 2.

2c2
 (1) 2
Claim 3. Wn(1) w ||(1+c2 ) (S[y]+1 ) y dy,
0
 (2) 2
Wn(2) w 2
||(1+c2 ) (S[y]+1 ) y dy.
0
Proof of Claim 1. We show rst that
 1 logn n
1 u2n (t)
Vn
(3)
dt Pr 0.
E( n1 ) logn n f 2 (F 1 (t))
As in the proof of Proposition 6.3 this reduces to showing that
 1 logn n
1 1
dt 0
1
nE( n ) nlog n f (F 1 (t))
2

and
 1 log n
1 n t1/2 (1 t)1/2
dt 0.
nE( n1 ) log n
n
f 2 (F 1 (t))
To ease the computations we will assume in the remainder of the proof of this
claim that c = 1 and replace E( n1 ) by F 1 ( n1 )2 in the last two denominators (the
ratio of the two sequences converges, by regular variation, to a positive constant).
Extension to general c is straightforward. Regular variation implies that
x/f (F 1 (x)) x/f 2 (F 1 (x)) 1
lim = and lim  1x = ,
x0 F 1 (x) x0 1 2
x f 2 (F 1 (t)) dt

which implies, in turn, that


 1 logn n
1 1 2 l(1) (log n/n)
lim 1
dt = lim = 0,
n nF 1 ( )2 log n
1
n n
2
f (F (t)) 1/2 n l(1) (1/n)

where l(1) (x) = x/f 2 (F 1 (x)) RV21 (0) and 2 1 (2, 1) and the last
limit follows from Lemma 6.23. Similarly,
 1 logn n 1/2
1 t (1 t)1/2 2 l(2) (log n/n)
lim 1 1 2 dt = lim = 0,
n nF ( n ) logn n f 2 (F 1 (t)) 1/4 n l(2) (1/n)

since l(2) (x) = x3/2 /f 2 (F 1 (x)) RV21/2 (0) and 21/2 (3/2, 1/2). Now
 1log n/n u2n (t)
f 2 (F 1 (t)) dt Pr 0. But
1
we can prove Claim 1 by showing that F 1 (1/n) 2 log n/n
80 E. del Barrio

taking expectations we can see that it suces to show that


 1 logn n
1 t(1 t)
lim 2 (F 1 (t))
dt = 0.
n F 1 ( 1 )2 log n f
n n

Using again regular variation properties and Lemma 6.23 we have that
 1 logn n
1 t(1 t) l(3) (log n/n)
lim 1 1 2 2 1
dt = || lim = 0,
n F ( n ) logn n f (F (t)) n l(3) (1/n)
now with l(3) (x) = x2 /f 2 (F 1 (x)) RV2 (0) and 2 (1, 0). This completes
the proof of Claim 1.
(1) (1)
Proof of Claim 2. We will show that (Vn )1/2 (Wn )1/2 Pr 0. It suces to
show that
 log n
1
(F 1 (S[y]+1 /Sn+1 ) F 1 (S[y]+1 /n))2 dy Pr 0.
(1) (1)
F 1 ( n1 )2 0
(1)
To ease the notation we will omit the superscript from S[y]+1 . Similarly as in (6.10)
we consider a Taylor expansion

F 1 Sn+1 F 1 nj
Sj S

Sn+1
Sj 1 1 Sn+1
2 Sj
2 f  (F 1 ())
= 1 + 1 ,
n S
Sn+1 f (F 1 ( j )) 2 n Sn+1 f 3 (F 1 ())
n
for some between Sj /Sn+1 and Sj /n, which enables us to write, using the obvious

2
analogues of (??) and (??), and the fact that supn1 nE 1 Sn+1n < , that


  Sj Sj 
 1 Sj 1 Sj  1 1
F Sn+1 F n  OP (1)
n
+ n
n f (F 1 ( Sj )) n f (F 1 ( Sj ))
n n
S
j
1
OP (1) n
,
n f (F 1 ( Sj ))
n
where OP (1) stands for a stochastically bounded sequence which does not depend
on j [1, log n]. We take now
> 0 such that 2 3 +
< 0. From this bound
and regular variation (Lemma 6.23, c)) we obtain that
 log n
(F 1 ( S[y]+1 ) F 1 ( [y]+1
S S 2
n+1 n )) dy
0
  S[y]+1 2  Sj 2
1 log n
1 
[log n+1]
OP (1) n
S
dy OP (1) n
Sj
n 0 f 2 (F 1 ( [y]+1
n ))
n j=1 f 2 (F 1 ( n ))

1 
[log n+1]
 Sj 22 
[log n+1]
OP (1) n OP (1)n23+ j 22 0.
n j=1 j=1
Empirical and Quantile Processes 81

This completes the proof of Claim 2 (note that we need not divide by F 1 ( n1 )2 to
obtain the equivalence of the two sequences if < 3/2; if = 3/2 that division
still gives the result).
(1)  k/n 1  1 n+1  1

2
Proof of Claim 3. We set Wn,k = n1 0 F n j=1 bn,j (t)j F (t) dt
E( )
2c2
 k  (1) n 2
and Wk = ||(1+c2 ) 0 (S[y]+1 ) y dy. With the change of variable t = y/n
(1)
we can rewrite Wn,k as
  2
F 1 (S[y]+1 /n) F 1 (y/n)
k (1)
(1)
Wn,k = bn dy,
0 F 1 (1/n)
2c2
where bn = F 1 ( n1 )2 /E( n1 ) ||(1+c2 ) , and we conclude that, by regular varia-
(1) (1) 2c2
 (1)
tion, Wn,k Pr Wk . To prove that Wn w ||(1+c 2) 0 ((S[y]+1 ) y ) dy it
2

suces, using a 3
argument, to show that
(1)
lim lim sup P (|Wn,k Wn(1) | >
) = 0 (6.77)
k n

for all
> 0. As in the proof of Claim 2 we consider a Taylor expansion
S
y
S[y]+1 y 1 (S[y]+1 y)2 f  (F 1 ())
F 1 F 1
[y]+1
= + ,
n n nf (F 1 (y/n)) 2 n2 f 3 (F 1 ())
for some between S[y]+1 /n and y/n, which enables us to write, using the obvious
2
equivalents of (??) and (??), and the fact that supy1 E (S[y]+1 y)/y 1/2 < ,
that


  
 1 S[y]+1 1 y  1 y/ n 1 1
F F  OP (1) + ,
n n n f (F 1 (y/n)) n f (F 1 (y/n))
where OP (1) stands for a stochastically bounded sequence which does not depend
on y [k, log n]. From this bound we obtain that
(1)
|Wn,k Wn(1) |
 log n
2
bn
F 1 (S[y]+1 /n) F 1 (y/n) dy
(1)
= 1
F ( n1 )2 k
  log n  log n 
1 1 y/n 1 1
OP (1) 1 1 2 dy + 2 dy
F (n) n k f 2 (F 1 (y/n)) n k f 2 (F 1 (y/n))
 log n  logn n 
1 n t 1 1
= OP (1) dt + dt .
F 1 ( n1 )2 k
n
f 2 (F 1 (t)) n nk f 2 (F 1 (t))
From regular variation we obtain that
  log n  logn n 
1 n t 1 1
2 (F 1 (t))
dt + 2 (F 1 (t))
dt C1 k 22 + C2 k 12
F 1 ( n1 )2 k
n
f n k
n
f
82 E. del Barrio

and this, combined with the last estimate and the fact that 2 2 < 0 completes
the proof of (6.77) and, consequently of Claim 3 . 

6.3. Weighted Wasserstein tests of fit to location-scale families of distributions


Finally, we apply the foregoing to weighted Wassserstein tests. Recall that
Ww2 (Fn , H)
Rw
n = 2 (F )
w n

relates to the quantile process vn via equations (5.6) (assuming conditions (5.3)-
(5.5)).
Theorem 6.24. Let w be a non-negative measurable function satisfying condition
(5.3). Let H be a location scale family of distributions as dened in the Introduc-
1
tion, such that 0 (F 1 (t))2 w(t)dt < for any (hence for all ) F H, let G0 H
be chosen so as to satisfy conditions (5.4) and (5.5) and let g0 = G0 . Assume the
distribution functions F H and the weight w satises conditions (GH) and that
moreover  1
t(1 t)
  w(t)dt < . (6.78)
f 2 F 1 (t)
0
Then, under the null hypothesis F H, we have
 1  1 2
w B 2 (t) B(t)
nRn 2 1 w(t)dt 1 w(t)dt
d 0 g0 (G0 (t)) 0 g0 (G0 (t))
 1 2
B(t)G1
0 (t)
1 w(t)dt . (6.79)
0 g0 (G0 (t))

Note that the hypotheses on F H are either satised by all or by none of


the functions in H.

Proof. By equivariance we can assume F = G0 . The result follows directly from


Theorem 4.5 a) as soon as we show that w 2
(Fn ) 1 in probability. By Theorem
4.5 a)
vn
2,w = OP (1), and therefore (recall F = G0 and (5.5))
  1
|w (Fn ) 1| = 
Fn1
2,w
F 1
2,w 
Fn1 F 1
2,w =
vn
2,w 0. 
n Pr

If H in Theorem 6.24 were only a location family or only a scale family then
the limit would exhibit only the loss of one degree of freedom, that is, one of the
last two integrals would be absent from the limit in (6.79): see Csorgo (2002),
where a theorem of this sort for scale families is proved.
Theorem 6.25. Under the hypotheses of Theorem 6.24, except that now condition
(6.78) is replaced by the weaker conditions (6.11) and
 1 1
(s t st)2
2
 1
 2  1  w(s)w(t)dsdt < , (6.80)
0 0 g0 G0 (t) g0 G0 (s)
Empirical and Quantile Processes 83

we have
  1 2
11/n
t(1 t) B (t) EB 2 (t)
nRw 1 w(t)dt w(t)dt
g02 (G1
n
1/n g02 (G0 (t)) d 0 0 (t))
 1 2   1 2
B(t) B(t)G1
0 (t)
1 w(t)dt 1 w(t)dt .
0 g0 (G0 (t)) 0 g0 (G0 (t))

Proof. As above, we can take F = G0 . By Theorem 4.5 b) properly modied to


account for the weighted integrals of the Brownian bridge (as done in Theorem
6.17), it suces to prove that
   11/n
1 t(1 t)
w (Fn ) 1 and 2 (F )
1 2 (F 1 (t))
w(t)dt 0,
Pr w n 1/n f Pr

which, by condition (6.11), reduce to proving n(w 2
(Fn ) 1) = OP (1). We have
1
2
n(w (Fn ) 1) =
vn
2,w + 2vn , F 1 w .
n
By checking the proof of Theorem 6.17 (by way of Theorem 6.12), it is easy to see
that
un /f (F 1 ), F 1 w,n B/f (F 1 ), F 1 w,n
d
(note that (5.5) implies F 1 L2 (w(t)dt)).
Hence Proposition 2.5 gives vn , F 1 w = OP (1). Likewise, by Theorem 4.5
 11/n t(1t)
b),
vn
2,w is shift convergent in law with shifts 1/n f 2 (F 1 (t)) w(t)dt which, by

(6.11) are o( n), so that
vn
2,w / n 0. 
Pr

A version of this theorem for scale families is proved in Csorgo (2002), however
the hypotheses there are stronger by factors of the order of log n or (log n)2 , the
integrals at the end points are not treated analytically and the proof is dierent
(it relies on strong approximations, which account for the stronger assumptions).
Next we consider convergence to a normal distribution. This case is less
interesting in connection with testing since, as indicated in the Introduction,
 1 2  1 B 2 (t)
vn (t)w(t)dt f 2 (F 1 (t)) w(t)dt if f does not vanish on suppF , and
d
therefore, if we divide by L(1/n) , as we must by Theorem 6.21 c), this
part of the statistic has no inuence on the limit. So, when a distribution satises
the hypotheses of Theorem 6.21 c) (meaning that g = f (F 1 ) is regularly varying
with exponent 1 and L(x) with this g tends to innity), if one wishes to have a
sensible test of t, it is probably best to nd a weight w so that one can apply
Theorems 6.24 or 6.25. Hence, we will only consider the normal convergence case
with weight w 1.
Theorem 6.26. Let H be a location scale family of distributions and assume for
simplicity that the distribution G0 H with mean zero and variance one is the
distribution function of a symmetric random variable. Assume that the follow-
ing conditions hold for some (hence for all ) F H: F is twice dierentiable on
84 E. del Barrio

(aF , bF ) and satises condition (6.7), its density f is strictly positive on (aF , bF ),
the function f (F 1 ) is regularly varying of exponent 1 at 0 and at 1, and L(x)
as x 0. Then, under the null hypothesis F H,
  11/n 
1 t(1 t)
nRn   dt Z, (6.81)
L(1/n) 1/n g02 G10 (t)
d

n is as dened in (5.6) for w 1.


where Z is standard normal and nRn := nRw
1
Proof. As in the previous two theorems we can assume F = G0 . Since  x f (F dt ) is
1
regularly varying of exponent 1 at 0 and at 1, it follows that F (x) = 1/2 f (F 1 (t))
1
is slowly varying at 0 and at 1. Hence, 0 |F 1 (t)|r dt < for all r R and there-
 r
fore, all the moments 0 |x| dF (x), r > 0, are nite. In particular, if Xi are i.i.d.
with distribution F ,
 1
1 
n
vn (t)dt = (Xi EXi ) = OP (1)
0 n i=1

by the central limit theorem, and 2 (Fn ) 1 a.s. by the law of large numbers.
So, it suces to show that
  11/n 
1 t(1 t)

vn
22 vn , F 1 2   dt Z.
L(1/n) 1/n f 2 F 1 (t) d

The arguments in the proof of Theorem 6.21 c) not only show that here we can
replace
vn
22 by
un /f (F 1
2,n , but also that vn , F 1 2 can be replaced by
un /f (F 1 ), F 1 n ; therefore, the theorem will follow from Theorem 6.21 c) (hence
from Theorem 6.19) if we show that the sequence
 11/n
un 1 un (t)F 1 (t)
 1
, F n := dt, n N, (6.82)
f (F ) 1/n f (F 1 (t))

is stochastically bounded (as it will then tend to zero upon dividing by L(1/n)).
For this, we show that the product of the nth variable in (6.82) by Sn+1 /n has
expected value tending to zero and variance dominated by a constant independent
of n. By (6.2), Lemma 6.1 i) and slow variation of F 1 at 0 and 1, we have
    11/n 1 n+1 
 Sn+1 un   F (t) i=1 an,i i 
E , F 1 
= E   dt
 n f (F 1 )
n  nf F 1 (t) 
1/n
 11/n
1 |F 1 (t)|
  dt
n 1/n f F 1 (t)
 F 1 (11/n)
1
= |u|du
n F 1 (1/n)
(F 1 (1/n))2 + (F 1 (1 1/n))2
= 0.
2 n
Empirical and Quantile Processes 85

Let X be a random variable with distribution F . By Lemma 6.1 iii) and niteness
of the absolute moments of X, we have
 
Sn+1 un 1
Var  , F n
n f (F 1 )
  11/n 1 n+1 2
1 F (t) i=1 an,i (i 1)
=E   dt
n 1/n f F 1 (t)
 
1 11/n 11/n Kn (s, t)F 1 (s)F 1 (t)
=     dsdt
n 1/n 1/n f F 1 (s) f F 1 (t)
 11/n  t
s(1 t)|F 1 (s)||F 1 (t)|
6     dsdt
1 (s) f F 1 (t)
1/n 1/n f F
 F 1 (11/n)  v
=6 F (u)(1 F (v))|u||v|dudv
F 1 (1/n) F 1 (1/n)
0  v
6 F (u)|u||v|dudv
F 1 (1/n) F 1 (1/n)
 F 1 (11/n)  0
+ F (u)(1 F (v))|u||v|dudv
0 F 1 (1/n)
 F 1 (11/n)  v
+ (1 F (v))|u||v|dudv
0 0
3 
EX 4 + (EX 2 )2 < ,
4
where at the last step we use Fubini and integration by parts. 
As in Theorem 6.21 c), symmetry of G0 is not necessary. Csorgo (2002) also
proves a result for correlation tests where the limit is normal, but only for the
special case of Weibull scale families.
Likewise, Theorem 6.22 can be used to obtain the limiting distribution of
nRn when f (F 1 ) is regularly varying at the end points with exponent > 1, but
we refrain from doing so, to avoid too much repetition.
Example. (Gauss-Laplace location-scale families) This is a modication of a result
in Csorgo (2002). Consider the distribution functions from the above Example,
 1 |x|
2e if x 0
F (x) = for > 0.
1 12 ex

if x 0
Then, just in that example, Theorem 6.24 with w 1 holds for the location scale
family based on F i > 2, Theorem 6.25 with w 1 holds i 4/3 < 2,
hence for the normal distribution (which gives Shapiro-Wilk), and Theorem 6.26
with w 1 holds for 0 < 4/3, in particular for the symmetric exponential
distribution. As mentioned above, if the tail probabilities are of dierent order,
the largest dominates and the same conclusions apply to the one sided families.
86 E. del Barrio

Example. (Testing t to the Laplace location scale family) It follows from the last
example and the comments immediately before Theorem 6.26 that a weighted
Wasserstein test would be convenient for the Gauss-Laplace location scale family
when the index is between 0 and 4/3. For any given > 0 these families are (in
terms of the densities):
 
|(x)/|
H = F, : f, (x) := e , x R, R, > 0 .
2(1/)
The weight should approach zero near 0 and 1. For simplicity we will only present a
test for the Laplace family H1 . Simple but tedious computations using the approxi-
mations in the previous example show that a weight of the order w(t) | log t(1t)|
1

will allow us to apply Theorem 6.24 if > 1 and Theorem 6.25 if 1/2 < 1
(the determining conditions are (6.78) that holds for all > 1, and (6.81) that
holds for 1/2 < 1). If w is too small near 0 and 1, we make the extreme part
of the distribution count less, whereas possibly the limit has more variability as
the integral of B 2 EB 2 is closer to being divergent. de Wet (2000) convincingly
suggests taking = 1 (see also Csorgo (2002)). Concretely, we dene
 
1 1
w(t) := I
e 0<t1/2 + e I1/2<t1 /W
log 2t log 2(1t)
where
 
u2 u
W := e u1 eu du, and set also V = e du.
1 0 1+u
Take G0 := F0,W/V . Then w and G0 satisfy conditions (5.3)-(5.5), and the
conditions (GH) and (6.80) hold as well (but not (6.78)). Then, Theorem 5.2
gives that, under the null hypothesis F H1 ,
 
2 ne W
nRw n log log
V 2 2
 1/2 2  1 
1 B (t) EB 2 (t) B 2 (t) EB 2 (t)
dt + dt
1/2 (1 t) log 2(1t)
e e
d V 0 t2 log 2t 2

 1/2  1 2
1 B(t) B(t)
e dt + dt
1/2 (1 t) log 2(1t)
e
VW 0 t log 2t
 1/2  1 2
1 B(t) log 2t B(t) log 2(1 t)
2 dt + dt .
1/2 (1 t) log 2(1t)
e e
V 0 t log 2t

References
[1] Ali, M.M. (1974) Stochastic ordering and kurtosis measure. J. Amer. Statist. Ass.
69, 543545.
[2] Anderson, T.W. and Darling, D.A. (1952) Asymptotic theory of certain good-
ness of t criteria based on stochastic processes. Ann. Math. Statist. 23, 193212.
Empirical and Quantile Processes 87

[3] Araujo, A. and Gine, E. (1980) The Central Limit Theorem for Real and Banach
Valued Random Variables. Wiley, New York.
[4] Balanda, K.P. and McGillivray, H.L. (1988) Kurtosis: A critical review. Amer.
Statistician 42, 111119.
[5] Bickel, P. and Freedman, D. (1981) Some asymptotic theory for the bootstrap.
Ann. Stat. 9 11961217.
[6] Bickel, P. and van Zwet, W. R. (1978) Asymptotic expansions for the power of
distribution free tests in the two-sample problem. Ann. Statist. 6, 9371004.
[7] Billingsley, P. (1968) Convergence of Probability Measures. Wiley, New York.
[8] Breiman, L. (1968) Probability. Addison-Wesley, Reading.
[9] Bretagnolle, J. and Massart, P. (1989) Hungarian constructions from the
nonasymptotic viewpoint. Ann. Probab., 17 239256.
[10] Brown, B. and Hettmansperger, T. (1996) Normal scores, normal plots, and
test for normality. Jour. Amer. Stat. Assoc., 91, 16681675.
[11] Chernoff, H. and Lehmann, E.L. (1954) The use of maximum likelihood esti-
mates in 2 tests of goodness of t. Ann. Math. Statist. 25, 579586.
[12] Chibisov, D.M. (1964) Some theorems on the limiting behavior of empirical dis-
tribution functions. Selected Transl. Math. Statist. Probab. 6, 147156.
[13] Cochran, W.G. (1952) The 2 test of goodness of t. Ann. Math. Statist. 23,
315345.
[14] Cohen, A. and Sackrowitz, H.B (1975) Unbiasedness of the chi-square, likeli-
hood ratio and other goodness of t tests for the equal cell case. Ann. Statist. 3,
959964.
[15] Cramer, H. (1928) On the composition of elementary errors. Second paper: Sta-
tistical applications. Skand. Aktuartidskr. 11, 141180.
[16] Csorgo, M. (1983) Quantile Processes with Statistical Applications. SIAM, Phila-
delphia.
[17] Csorgo, M., Csorgo, S., Horvath, L and Mason, D.M. (1986) Weighted em-
pirical and quantile process. Ann. Probab. 14, 3185.
[18] Csorgo, M. and Horvath, L. (1988) On the distributions of Lp norms of weighted
uniform empirical and quantile process. Ann. Prob., 16, 142161.
[19] Csorgo, M. and Horvath, L. (1993) Weighted Approximations in Probability and
Statistics. John Wiley and Sons.
[20] Csorgo, M., Horvath, L. and Shao, Q.-M. (1993) Convergence of integrals of
uniform empirical and quantile processes Stochastic. Process. Appl., 45, 283294.
[21] Csorgo, M. and Revesz, P. (1978) Strong approximations of the quantile process.
Ann. Statist. 6, 882894.
[22] Cuesta-Albertos, J.A., Matran, C., Rachev, S.T. and Ruschendorf, L.
(1996) Mass transportation problems in Probability Theory. Math. Scientist 21,
3472.
[23] DAgostino, R.B. (1971) An omnibus test of normality for moderate and large
sample sizes. Biometrika 58, 341348.
[24] Darling, D.A. (1955) The Cramer-Smirnov test in the parametric case. Ann.
Math. Statist. 26, 120.
88 E. del Barrio

[25] David, H.A., Hartley, H.O. and Pearson, E.S. (1954) The distribution of the
ratio, in a single normal sample, of range to standard deviation. Biometrika 41,
482493.
[26] David, F.N. and Johnson, N.L. (1948) The probability integral transformation
when parameters are estimated from the sample. Biometrika 35, 182190.
[27] de la Pena, V. and Gine, E. (1998) Decoupling. From dependence to indepen-
dence. Randomly stopped processes, U -statistics and processes, martingales and be-
yond. Springer.
[28] del Barrio, E. (2000) Asymptotic distribution of statistics of Cramer-von Mises
type. Preprint.
[29] del Barrio, E., Cuesta-Albertos, J.A., Matran, C. and Rodrguez-Rodr-
guez, J. (1999) Tests of goodness of t based on the L2-Wasserstein distance. Ann.
Statist., 27, 12301239.
[30] del Barrio, E., Gine, E. and Matran, C. (1999) Central limit theorems for
the Wasserstein distance between the empirical and the true distributions. Ann.
Probab. 27 10091071.
[31] del Barrio, E., Gine, E., and Utzet, F. (2005) Asymptotics for L2 functionals
of the empirical quantile process, with applications to tests of t based on weighted
Wasserstein distances. Bernoulli 11 131189.
[32] de Wet, T. and Venter, J. (1972) Asymptotic distributions of certain test criteria
of normality. S. Afr. Statist. J. 6, 135149.
[33] de Wet, T. and Venter, J. (1973) Asymptotic distributions for quadratic forms
with applications to test of t. Ann. Statist., 2, 380387.
[34] Donsker, M.D. (1951) An invariance principle for certain probability limit theo-
rems. Mem. Amer. Math. Soc. 6.
[35] Donsker, M.D. (1952) Justication and extension of Doobs heuristic approach
to the Kolmogorov-Smirnov theorems. Ann. Math. Statist. 23, 277281.
[36] Doob, J.L. (1949) Heuristic approach to the Kolmogorov-Smirnov theorems. Ann.
Math. Statist. 20, 393403.
[37] Downton, F. (1966) Linear estimates with polynomial coecients. Biometrika 53,
129141.
[38] Dudley, R.M. (1978) Central limit theorems for empirical measures. Ann. Probab.
6, 899929.
[39] Durbin, J. (1973) Weak convergence of the sample distribution function when
parameters are estimated. Ann. Statist. 1, 279290.
[40] Feller, W. (1948) On the Kolmogorov-Smirnov limit theorems for empirical dis-
tributions. Ann. Math. Statist. 19, 177189.
[41] Feuerverger, A. and Mureika, R.A. (1977) The empirical characteristic func-
tion and its applications. Ann. Statist. 5, 8897.
[42] Filliben, J.J. (1975) The probability plot correlation coecient test for normality.
Technometrics 17, 111117.
[43] Fisher, R.A. (1930) The moments of the distribution for normal samples of mea-
sures of departure from normality. Proc. Roy. Soc. A 130, 16.
Empirical and Quantile Processes 89

[44] Galambos, J. (1987) The Asymptotic Theory of Extreme Order Statistics 2nd ed.
Krieger, Melbourne, Florida.
[45] Geary, R.C. (1947) Testing for normality. Biometrika 34, 209242.
[46] Gerlach, B. (1979) A consistent correlation-type goodness-of-t test; with appli-
cation to the two parameter Weibull distribution. Math. Operationsforsch. Statist.
Ser. Statist. 10, 427452.
[47] Gumbel, E.J. (1943) On the reliability of the classical chi-square test. Ann. Math.
Statist. 14, 253263.
[48] Gupta, A.K. (1952) Estimation of the mean and the standard deviation of a normal
population from a censored sample. Biometrika 39, 260273.
[49] Hall, P. and Welsh, A.H. (1983) A test for normality based on the empirical
characteristic function. Biometrika 70, 485489.
[50] Jakubowski, A. (1986) Principle of conditioning in limit theorems for sums of
random variables. Ann. Probab. 14, 902915.
[51] Kac, M., Kiefer, J. and Wolfowitz, J. (1955) On tests of normality and other
tests of goodness of t based on distance methods. Ann. Math. Statist. 26, 189211.
[52] Kale, B.K. and Sebastian, G. (1996) On a class of symmetric nonnormal dis-
tributions with kurtosis of three. In Statistical Theory and Applications: Papers
in Honor of Herbert A. David. Eds. H.H. Nagaraja, P.K. Sen and D. Morrison.
Springer Verlag, New York.
[53] Kolmogorov, A. (1933) Sulla determinazione empirica di una legge di dis-
tribuzione. Giorn. Ist. Ital. Attuari. 4, 8391.
[54] Kolmogorov, A.N. and Prohorov, Yu. V. (1949) On sums of a random number
of random terms (Russian) Uspehi Matem. Nauk (N.S.) 4, 168172.
[55] Komlos, J. Major, P. and Tusnady, G. (1975) An approximation of partial
sums of independent RVs and the sample DF. I. Z. Wahrsch. verw. Gebiete 32,
111131.
[56] Komlos, J. Major, P. and Tusnady, G. (1976) An approximation of partial
sums of independent RVs and the sample DF. II. Z. Wahrsch. verw. Gebiete 34,
3358.
[57] Leslie, J.R. (1984) Asymptotic properties and new approximations for both the
covariance matrix of normal order statistics and its inverse. Colloquia Mathematica
Societatis Janos Bolyai 45, 317-354. Eds. P. Revesz, K. Sarkadi and P.K. Sen.
Elsevier, Amsterdam.
[58] Leslie, J.R., Stephens, M.A. and Fotopoulos, S. (1986) Asymptotic distribu-
tion of the Shapiro-Wilk W for testing for normality. Ann. Statist. 14, 14971506.
[59] Lilliefors, H.W. (1967) On the Kolmogorof-Smirnov test for normality with mean
and variance unknown. J. Amer. Statist. Ass. 62, 399402.
[60] Lockhart, R.A. (1985) The asymptotic distribution of the correlation coecient
in testing t to the exponential distribution. Canad. J. Statist. 13, 253256.
[61] Lockhart, R.A. (1991) Overweight tails are inecient. Ann. Statist. 19, 2254
2258.
90 E. del Barrio

[62] Lockhart, R.A. and Stephens, M.A. (1998) The probability plot: Test of t
based on the correlation coecient. Order statistics: applications, 453473, Hand-
book of Statist. 17, North-Holland, Amsterdam.
[63] Mann, H. B. and Wald, A. (1942) On the choice of the number of class intervals
in the application of the chi square test. Ann. Math. Statis. 13, 306317.
[64] Mason, D. and Shorack, G. (1992) Necessary and sucient conditions for as-
ymptotic normality of L-statistics. Ann. Prob., 20, 17791804.
[65] McLaren, C.G. and Lockhart, R.A. (1987) On the asymptotic eciency of
certain correlation tests of t. Canad. J. Statist. 15, 159167.
[66] Moore, D.S. (1971) A chi-square statistic with random cell boundaries. Ann.
Math. Statist. 42, 147156.
[67] Moore, D.S. (1978) Chi-square tests. In Studies in Statistics (R.V. Hogg, ed.)
66106. The Mathematical Association of America.
[68] Moore, D.S. (1986) Tests of chi-squared type. In Goodness-of-Fit Techniques,
dAgostino, R.B. and Stephens, M.A., eds., 6396. North-Holland, Amsterdam.
[69] Murota, K. and Takeuchi, K. (1981) The studentized empirical characteristic
function and its application to test for the shape of distribution. Biometrika 68,
5565.
[70] OReilly, N.E. (1974) On the weak convergence of empirical processes in sup-norm
metrics. Ann. Probab. 2, 642651.
[71] Pearson, E.S. (1930) A further development of test for normality. Biometrika 22,
239249.
[72] Pearson, E.S., DAgostino, R.B. and Bowman, K.O. (1977) Tests for departure
from normality: Comparison of powers. Biometrika 64, 23146.
[73] Pollard, D. (1979) General chi-square goodness-of-t tests with data-dependent
cells. Z. Wahrsch. Verw. Gebiete 50, 317331.
[74] Pollard, D. (1980) The minimum distance method of testing. Metrika 27, 4370.
[75] Prohorov, Y.V. (1953) Probability distributions in functional spaces (Russian)
Uspehi Matem. Nauk (N.S.) 8, 165167.
[76] Prohorov, Y.V. (1956) The convergence of random processes and limit theorems
in probability. Theor. Probab. and its Applicat. 1, 157214.
[77] Rachev, S.T. (1991). Probability Metrics and the Stability of Stochastic Models.
Wiley.
[78] Royston, J.P. (1982) An extension of Shapiro and Wilks W test for normality to
large samples. Appl. Statist. 31, 115124.
[79] Sarkadi, K. (1975) The consistency of the Shapiro-Francia test. Biometrika 62,
445450.
[80] Serfling, R.J. (1980) Approximation Theorems of Mathematical Statistics. Wiley,
New York.
[81] Shapiro, S.S. and Francia, R.S. (1972) An approximate analysis of variance test
of normality. J. Amer. Statist. Ass. 67, 215216.
[82] Shapiro, S.S. and Wilk, M.B. (1965) An analysis of variance test for normality
(complete samples). Biometrika 52, 591611.
Empirical and Quantile Processes 91

[83] Shapiro, S.S. and Wilk, M.B. (1968) Approximations for the null distribution of
the W statistic. Technometrics 10, 861866.
[84] Shapiro, S.S. and Wilk, M.B. (1972) An analysis of variance test for the expo-
nential distribution (complete samples). Technometrics 14, 355370.
[85] Shapiro, S.S., Wilk, M.B. and Chen, H.J. (1968) A comparative study of various
tests for normality. J. Amer. Statist. Ass. 63, 13431372.
[86] Shorack, G.R. and Wellner, J.A. (1986) Empirical Processes With Applications
to Statistics. Wiley, New York.
[87] Skorohod, A.V. (1956) Limit theorems for stochastic processes. Theor. Probab.
and its Applicat. 1, 261290.
[88] Smirnov, N.V. (1936) Sur la distribution de 2 (Criterium de M.R. v. Mises). C.
R. Acad. Sci. Paris 202, 449452.
[89] Smirnov, N.V. (1937) Sur la distribution de 2 (Criterium de M.R. v. Mises)
(Russian/French summary). Mat. Sbornik (N.S.) 2, 973993.
[90] Smirnov, N.V. (1939) Sur les ecarts de la courbe de distribution empirique (Rus-
sian/French summary). Mat. Sbornik (N.S.) 6, 326.
[91] Smirnov, N.V. (1941) Approximate laws of distribution of random variables from
empirical data (Russian). Uspekhi Mat. Nauk. 10, 179206.
[92] Smith, R.M. and Bain, L.J. (1976) Corelation-type goodness-of-t statistics with
censored sampling. Comm. Statist. A-Theory and Methods 5, 119132.
[93] Spinelli J.J and Stephens, M.A. (1987) Tests for exponentiality when origin and
scale parameters are unknown. Technometrics 29, 471476.
[94] Stephens, M.A. (1974) EDF statistics for goodness of t and some comparisons.
J. Amer. Statist. Assoc. 69, 730737.
[95] Stephens, M.A. (1975) Asymptotic properties for covariance matrices of order
statistics. Biometrika 62, 2328.
[96] Stephens, M.A. (1986) Tests based on EDF statistics. In Goodness-of-Fit Tech-
niques (R.B. DAgostino and M.A. Stephens., eds.) 97193. Marcel Dekker, New
York.
[97] Stephens, M.A. (1986) Tests based on regression and correlation. Goodness-of-Fit
Techniques 195233. Eds. R.B. DAgostino and M.A. Stephens. Marcel Dekker, New
York.
[98] Sukhatme, S (1972) Fredholm determinant of a positive denite kernel of a special
type and its application. Ann. Math. Statist. 43, 19141926.
[99] Uthoff, V.A. (1970) An optimum test property of two well known statistics. J.
Amer. Statist. Ass. 65, 1597.
[100] Uthoff, V.A. (1973) The most powerful scale and location invariant test of the
normal versus the double exponential. Ann. Statist. 1, 170174.
[101] Vallender, S. (1973). Calculation of the Wasserstein distance between probability
distributions on the line. Theo. Prob. Appl., 18, 785786.
[102] Verrill, S. and Johnson, R. (1987) The asymptotic equivalence of some modied
Shapiro-Wilk statistics Complete and censored sample cases. Ann. Statist. 15,
413419.
[103] von Mises, R. (1931) Wahrscheinlichkeitsrechnung. Wein, Leipzig.
92 E. del Barrio

[104] Watson, G.S. (1957) The 2 goodness-of-t test for normal distributions. Bio-
metrika 44, 336348.
[105] Watson, G.S. (1958) On chi-square goodness-of-t tests for continuous distribu-
tions. J. Roy. Statist. Soc. B 20, 4461.
[106] Weisberg, S. and Bingham, C. (1975) An approximate analysis of variance test
for non-normality suitable for machine calculation. Technometrics 17, 133134.

[107] Williams, P. (1935) Note on the sampling distribution of 1 , where the popu-
lation is normal. Biometrika 27, 269271.

Eustasio del Barrio


Departamento de Estadstica
e Investigacion Operativa
Facultad de Ciencias
Universidad de Valladolid
C/ Prado de la Magdalena S/N
E-47005 Valladolid, Spain
e-mail: tasio@eio.uva.es
Topics on Empirical Processes
Paul Deheuvels

1. Introduction, notation and preliminaries


1.1. Introduction
The present notes, starting at an elementary level, collect some basic mathematical
facts and technical arguments which should be useful for research students working
in the eld of empirical processes and nonparametric statistics. There are many
ne reviews and monographs to be recommended (such as [38, 40, 44, 77, 78, 79, 89,
87, 99, 100, 164, 178, 183, 204]) on this and related subjects. Obviously, one cannot
expect a scholar to have read all these references prior to the start of his research
activities, and it is of interest to provide a minimal selection of what one should
know at the beginning. We have chosen here to focus our study on functional limit
laws generated by empirical processes, for which there has been continuous interest
in the last decades (see, e.g., [52, 60, 65, 66, 62, 98, 156, 199]). Such results provide
a number of direct applications to statistics which motivate our interest. This is a
rather delicate subject, mixing topological arguments and large deviation theory
(see, e.g., [73]), together with other sophisticated probabilistic techniques. A basic
knowledge of Gaussian processes (refer to [135]) is a natural prerequisite, as well as
some understanding of weak and strong invariance principles (see, e.g., [19, 40, 155,
154, 156]). We have chosen not to speak much of the, so called, abstract empirical
processes theory, which studies processes, indexed by sets or functions, on general
spaces, and we advise motivated readers to orient their interests towards this
direction, in a second step (refer to [6, 82, 99, 136, 164]). Our taste for functional
limit laws takes root in the fact that these results have turned out to be most useful
in practice, and await yet even more extensive investigations to be exploited in full.
A second point is that abstract empirical process theory has remained mostly on
the mathematical side, with relatively few explicit statistical applications. There
are some exceptions, though, and, in fact, the recent evolution of statistical science
has shown that one could solve several dicult problems of interest by using
deviation inequalities obtained in this general framework (see, e.g., [95, 70, 201]).
However, the relevant mathematical tools are, at times, so involved, that their
94 P. Deheuvels

study is a research domain in se, which I believe, should not be treated as a starting
point. In particular, probability theory in Banach spaces, which comes into this
eld as a basic ingredient, constitutes a universe in itself (refer to [6, 135]).

This introductory course has been presented at the summer school in Laredo, and
later, taught to graduate students at the University of Paris VI. Of course, the
contents of these notes had to be limited by lack of space, and their completion
is, as one could expect, left to the reader, with the help of the enclosed exercises
and references.

1.2. Distribution and quantile functions


1.2.1. Left- and right-continuous distribution and quantile functions. Let X R
denote a real-valued random variable [rv] with probability law dened by PX (A) =
P(X A), for all A BR belonging to the set BR of all Borel subsets of R (endowed
with the usual topology). By denition, BR is the -algebra generated by the open
(or closed) subsets of R. Thanks to the properties of the Lebesgue-Stieltjes integral,
the probability law PX is uniquely determined (see, e.g., pp. 1718 in [19]) by the
(right-continuous version of the) distribution function [df] of X, dened by

F (x) = P(X x) for x R := R {} {}. (1.1)

The df F () of a rv may be any nondecreasing, right-continuous function on R,


such that

x y 0 F (x) F (y) 1; (1.2)

F () = 0 = lim F (x); F () = 1 = lim F (x); (1.3)


x x

F (x+) := lim F (y) = F (x) for x R {}. (1.4)


yx

The left-continuous version F (x) := P(X < x) of the distribution function F ()


of X fullls (1.2) and (1.3), but not (1.4). The latter relation is replaced by

F (x) := lim F (y) = lim F (y) = P(X < x) F (x) for x R {}. (1.5)
yx yx

The set DF of discontinuity points of F , namely, the set of all x R such that

PX ({x}) = P(X = x) = F (x) F (x) > 0, (1.6)

is at most countable, so that its complement, the set CF := R DF , of all conti-


nuity points of F , is always dense in R. Since F is completely determined by the
knowledge of the values of F (x) when x varies within a dense subset of R, this
property holds when the dense subset is chosen equal to CF .
Topics on Empirical Processes 95

The (left-continuous version of the) quantile function [qf] G() of X is dened on


[0, 1] by


:= inf{x : F (x) > 0} for u = 0,
G(u) = inf{x R : F (x) u} = sup{x R : F (x) < u} for 0 < u < 1, (1.7)


+ := sup{x : F (x) < 1} for u = 1.
The values of the distribution endpoints, ess inf(X) := and ess sup(X) :=
+ , of PX are possibly innite. The distribution of X is said to be degenerate
when X is almost surely [a.s.] constant, namely, when there exists an x0 R such
that P(X = x0 ) = 1. In this case, one has
< = + = x0 < . (1.8)
On the other hand, for non-degenerate distributions, the following inequalities hold.
< + . (1.9)
Non-degeneracy of PX is conveniently characterized by the property that
sup P(X = x) < 1. (1.10)
xR

The quantile function G() in (1.7), as well as its right-continuous version, dened,
for 0 < u < 1, by G(u+) := inf{x : F (x) > u}, and, for u = 0 or 1, through the
relations (1.12) below, fullls
0uv1 G(u) G(v) + ; (1.11)
G(0+) := G(0) = = lim G(u);
u0
(1.12)
G(1+) := G(1) = + = lim G(u);
u1
G(u) = G(u) := lim G(u
) for 0 < u 1. (1.13)
0

The right-continuous version G(+) of the quantile function G() of X, fullls


(1.11) and (1.12), but not (1.13). The latter relation is replaced by
G(u+): = lim G(u +
) = inf{x : F (x) > u}
0
= sup{x : F (x) u} for 0 < u < 1. (1.14)
One has the following renement of (1.12) for G(+).
G(0+) = = lim G(u+); G(1+) = + = lim G(u+). (1.15)
u0 u1

Example 1.1. Let F and G be, respectively, the df and qf of a rv X.


1 ) Show that the set of all x R such that F (x+) F (x) 1/n is nite or
void, for each n 1. Conclude that the set DF of discontinuity points of F
is, at most, countable.
2 ) Show likewise that the set DG of discontinuity points of G is, at most, count-
able.
96 P. Deheuvels

1.2.2. Generalized inverses of nondecreasing functions. The distribution and


quantile functions are mutual inverses in a sense made precise below. Consider
a non-decreasing function {H(t) : A t B}, with A B and
H(A) H(B) . We do not make continuity assumptions on H, except
at end-points when A < B, in which case, we require that
H(A) = lim H(t) and H(B) = lim H(t). (1.16)
tA tB

Then, it is always possible to dene a left-continuous inverse H (s) (resp. a right-


continuous inverse H (s)) of H(), over H(A) s H(B), as follows.
(i) When H(A) < s H(B) (resp. H(A) s < H(B)), we set
H (s) = sup{t : H(t) < s} (resp. H (s) = inf{t : H(t) > s}). (1.17)
(ii) When H(A) = s < H(B) (resp. H(A) < s = H(B)), we dene H (s) (resp.
H (s)) via (1.17) and
H (H(A)) = lim H (v) (resp. H (H(B)) = lim H (v)). (1.18)
vH(A) vH(B)

(iii) There is one possibility uncovered by (1.17)(1.18), when H(A) = H(B). In


this case, we dene H (s) (resp. H (s)) for H(A) = s = H(B) by setting
H (H(A)) = A (resp. H (H(B)) = B). (1.19)
If, in addition to the previous assumptions, H() is right-continuous, then, the
denition (1.17)(1.18)(1.19), of the left-continuous inverse H () of H(), may
be simplied into
H (s) = inf{t : H(t) s} for H(A) s H(B). (1.20)
Likewise, when H() is left-continuous, the denition (1.17)(1.18)(1.19), of the
right-continuous inverse H () of H(), may be simplied into
H (s) = sup{t : H(t) s} for H(A) s H(B). (1.21)
Under the additional assumption that H(A) < H(B), the right-continuous (resp.
left-continuous) inversion operators, mapping H onto H (resp. H onto H ), are
involutions. The condition H(A) < H(B) (which entails A < B) allows to dene
intrinsic end-points, A and B , pertaining to H(), by setting
A = inf{t (A, B) : H(t) > H(A)}
and (1.22)
B = sup{t (A, B) : H(t) < H(B)}.
We have A A < B B , and H(A) = H(A ) < H(B ) =
H(B) . It is readily checked, that, whenever H() is right-continuous (resp.
left-continuous), for all A t B ,
H(t) = {H } (t) = {H } (t) (resp. H(t) = {H } (t) = {H } (t)).
(1.23)
Topics on Empirical Processes 97

We note that (1.23) is invalid when H(t) = is constant for t [A, B]. In this
case (1.22) is meaningless, and the denitions, (1.19) of H and H , reduce to
H (s) = A and H (s) = B for s = , so that
{H } (t) = {H } (t) = , for t = A,
and
(1.24)
{H } (t) = {H } (t) = , for t = B.
Both operators, H H and H H , are natural extensions of the inversion
operator, H H 1 , with respect to composition of applications. The interest of
H and H is to be always dened, whereas such is not the case for H 1 . When
H() is (strictly) increasing and continuous on [A, B], with a properly dened
inverse mapping, H 1 (), (strictly) increasing and continuous on [H(A), H(B)],
all three denitions coincide, since then, A = A , B = B , and, for each s [A, B]
(resp. t [H(A), H(B)],
      
H H 1 (t) = t, H 1 H(s) = s, H 1 t = H t = H t . (1.25)
Proposition 1.1 below provides a variant of (1.25), tailored to the case of distribu-
tion and quantile functions. Its proof can be adapted to show that continuity of
H and H 1 is sucient to imply (1.25).

1.2.3. Some useful properties of distribution and quantile functions. As follows


from (1.7) and (1.20)(1.21)(1.23), in the special case where G is the qf of the df
F , we have
G(u) = F (u) for 0u1
and
(1.26)
F (x) = G (x) for x + ,

G(u+) = F (u) for 0u1


and
(1.27)
F (x) = G (x) for x + .
In view of (1.26)(1.27) and (1.20)(1.21), we obtain that, for each x fullling
x + ,
F (x) = sup{u : G(u) x} = sup{u : G(u+) x}, (1.28)
F (x) = inf{u : G(u) x} = inf{u : G(u+) x}. (1.29)
As follows from 1.2.2, the relations (1.28)(1.29) become invalid when x  [, + ].
We note, further, that the denition (1.7) of G() entails that, for each u [0, 1]
such that < G(u) < ,
F (G(u) ) < u F (G(u) + ) > 0. (1.30)
This implies that the image set of [0, 1] by either G() or G(+) is nothing else but
the support of the distribution of X. The latter, denoted by supp(X), is a closed
subset of R (see, e.g., p. 277 in [10]), collecting all x R such that P(Vx ) > 0 for
each open neighborhood Vx of x in R. It is noteworthy that P(X supp(X)) = 1.
The complement in R of the support of X, namely R supp(X), is the union of
all open subsets O of R such that P(X O) = 0.
98 P. Deheuvels

A similar argument as that used to infer (1.30) from (1.7), shows that, for each
0 < u < 1 (this implying that < G(u) G(u+) < ) and > 0, we have
F (G(u) ) < u F (G(u) + ),
and (1.31)
F (G(u+) ) u < F (G(u+) + ).
By letting 0 in (1.31), we obtain readily the inequalities
F (G(u)) F (G(u+)) u F (G(u)) F (G(u+)). (1.32)
Another application of (1.7), shows readily that, for each < x < + ,
G(F (x)) G(F (x)) x G(F (x)+) G(F (x)+). (1.33)
A simple and useful consequence of these inequalities is stated in the next propo-
sition.
Proposition 1.1. Assume that F () is continuous on ( , + ), and that G() is
continuous on (0, 1). Then, for each x [ , + ] and u [0, 1], one has
G(F (x)) = x, and F (G(u)) = u. (1.34)
Proof. The assumption that F () and G() are continuous is equivalent to the
identities F () = F () and G() = G(+). We have therefore F (x) = F (x) and
G(u) = G(u+) in (1.32)(1.33), which, in turn, entail (3.11) for x ( , + ) and
u (0, 1). To conclude, we observe that (1.34) holds for x = and u = 0 or 1,
as a direct consequence of the denitions (1.1) and (1.7) of F and G. 
Exercise 1.1. Let X denote a [0, 1]-valued rv with density f (), continuous and
positive on [0, 1]. Denoting by F () and G() the df and qf of X, show that the
d
quantile density g(u) = du G(u) is continuous and positive on [0, 1], and is equal
to 1/f (G(u)).
Exercise 1.2. Let G be the qf of a rv X. Prove the equality
 1
E(X) = G(u)du.
0

1.3. Topologies on spaces of measures and functions


1.3.1. Weak and vague convergence of measures. We rst recall some basic topo-
logical facts. The notion of Polish, and of locally compact, topological space plays
here an essential role. A topological space E is Polish, i its topology can be de-
ned by a metric , for which it is complete, and separable (meaning, see, e.g., p.
209 in [10], that there exists a countable dense subset of E, or, equivalently, that
there exists a countable base of open sets). A locally compact space is a Hausdor
topological space (such that two distinct points have disjoint neighborhoods), for
which each point has a compact neighborhood. A locally compact space E has a
countable base (see, e.g., p. 209 in [10]), i there exists a countable base of open
sets of E (meaning that each open set of E is the union of some of these subsets).
Both structures are closely related. In fact, the topology of an arbitrary separable
metric space (E, ) can be dened by a metric , such that the completion E of
Topics on Empirical Processes 99

E with respect to is a separable compact metric space (see, e.g., p. 72 in [83]).


Conversely, a locally compact space is metrizable if and only it has a countable
base, in which case it is Polish (see, e.g., p. 225 in [10]). Because of this, and since,
in the applications we consider, E = R or Rd , we will let, unless otherwise specied,
E denote a metrizable locally compact space, which, as follows from the previous
arguments, is necessarily a Polish space. To work on locally compact spaces sim-
plies matters with respect to measure theory, since we may restrict our interest
to Radon measures, by denition nite on compact subsets.
We recall below the usual denitions and properties of vague (resp. weak conver-
gence) in the space M+ (E) of (non-negative) Radon measures (resp. M+ f (E) of
(non-negative) nite Radon measures) on a metrizable separable locally compact
space E. In words, M+ (E) (resp. M+ f (E)) consists of all non-negative measures
dened on the Borel algebra BE of E, and such that (A) < for each rela-
tively compact Borel subset A BE of E (resp. (E) < ). We denote by M(E)
(resp. Mf (E)) the set of all signed (Radon) measures (resp. totally bounded signed
(Radon) measures on E), and keep in mind that M(E) (resp. Mf (E))
i there exist two non-negative measures + M+ (E) and M+ (E) (resp.
+ M+
f (E) and Mf (E)), such that
+

= + . (1.35)
Given M(E), the choice of + and in (1.35) is not unique. However, the
Hahn-Jordan decomposition theorem (see, e.g., p. 173 in [172] and pp. 178181 in
[83]), allows to choose these two measures, by setting = , where + and
are dened through a decomposition of E = E E+ into the union of two disjoint
measurable subsets E and E+ , such that, for each relatively compact A BE ,
+ (A) = (A E+ ) and (A) = (A E ), E E+ = . (1.36)
The property (1.36) characterizes orthogonality, of + and (denoted hereafter
by + ). The measures in (1.36) are uniquely dened for each A BE , via
the relations
+ (A) = sup (B E+ ) and (A) = sup {(B E )}. (1.37)
BA, BBE BA, BBE

In view of (1.37), we may dene the total variation measure || Mf (E) of


Mf (E), via
||(A) = + (A) + (A) = sup (B) for each A BE . (1.38)
BA, BBE

In the special case where E = R, we may observe that the condition that
Mf (R) is equivalent to the condition that the distribution function H (x) =
(, x] of is of bounded variation on R, with H () := limx H (x) = 0,
and with total variation on R equal to |dH |(R) = ||(R). In the sequel the set
of functions f of bounded variation on a sub-interval I of R will be denoted by
BV(I).
100 P. Deheuvels

Definition 1.1. A family of measures M(E), indexed by an oriented net


A, is said to be vaguely convergent to M(E) (resp. a net-oriented family,
Mf (E), A, is said to be weakly convergent to Mf (E)), when
 
f d f d. (1.39)
R R
for all f , continuous on E with compact support (resp. continuous and bounded
on E).
Vague convergence of M+ (E) to M+ (E) is equivalent to the prop-
erty that
(K) (K), (1.40)
for each compact subset K of E, with boundary K such that (K) = 0. It is
noteworthy (see, e.g., p. 235 in [10]) that weak convergence of M+ f (E) to
Mf (E) is equivalent to vague convergence, with the additional condition that
+

(E) (E). (1.41)


Remark 1.1. The weak convergence of probability measures on a separable metric
space can be dened by the Prohorov metric (see, e.g., pp. 393399 in [82]). When E
is locally compact, the vague topology on M+ (E) is metrisable i E has a countable
base, or, equivalently, i E is Polish (see, e.g., p. 243 in [10]).
Remark 1.2. Neither of the nice characterizations, (1.40) and (1.41), of vague
and weak convergence holds when M+ (E) and M+ f (E) are replaced, respectively,
by M(E) and Mf (E). The vague and weak topologies are rather nasty to han-
dle for signed measures, since they are not metrizable on M(E). This (far from
obvious) fact can be shown through the following arguments. First, the space Cb (E)
of bounded continuous functions on E, endowed with the sup-norm, is a Banach
space. Its topological dual Cb (E), endowed with the weak topology, coincides with
the set Mf (E) of nite signed measures on E, endowed with the weak topology (see,
e.g., p. 16 in [19]). Second, it turns out (see, e.g., p. 68 in [173]) that the topologi-
cal dual X of an innite-dimensional Banach space X is not metrizable (and such
is the case, therefore, in general for (Mf (E), W)). On the other hand, it can be
shown (see, e.g., p. 68 in [173]), as a consequence of the Banach-Alaoglu theorem,
that any weak -compact subset of X is metrizable. This last fact underlies some
of the results of the forthcoming 1.3.4.
Because of the mathematical diculties underlined in Remark 1.2, in the re-
mainder of this section, we will limit ourselves to the study of vague and weak con-
vergence on the spaces of non-negative measures M+ (E) and M+ f (E). In 1.3.3
1.3.4, we will discuss again the case of signed measures, when E = [0, 1].
A subset S M+ (E) is vaguely relatively compact if and only if (see, e.g., p. 241
in [10]) 
sup f d < (resp. sup (K) < ), (1.42)
S K
Topics on Empirical Processes 101

for each bounded function f on E with compact support (resp. compact subset
K of E). For the set P(E) of probability measures on E, (1.41) is automatic,
so that vague and weak convergence are equivalent, for a sequence n P(E),
when the limit is in P(E). However, the set P(E) is not closed in M+ (E) with
respect to the vague topology, so that (1.42) is not sucient to ensure weak relative
compactness of a subset S of P(E). A useful characterization of this property
is given by Prohorovs theorem (refer to p. 37 in [19]), which shows that weak
relative compactness of S P(E) is equivalent to the tightness of S, namely, to
the condition that, for each
> 0, there exists a compact subset K of S such that
inf (K ) 1 . (1.43)
S

In general (see, e.g., pp. 237238 in [10]), weak convergence of n M+ f (E) to


M+ f (E) is equivalent to the convergence of n (Q) to (Q) for each Q BE
with boundary Q of null measure. For the probability distributions on E = R,
considered in 1.3.2, this last condition is equivalent to the pointwise convergence
(1.45) of dfs at each continuity point of the limiting df. The same property holds
when E = Rd for an arbitrary d 1.

1.3.2. Weak convergence and the Levy metric. We consider briey in the present
section the topology, denoted hereafter by W, of weak convergence on the space
P(R) of probability measures on R, identied to the space F(R) of the corre-
sponding (right-continuous) distribution functions. We refer to 1.3.1 for basic
denitions, to [19] for more details on weak convergence of probability measures,
and to 1.3.3 (resp. 1.3.4) below, for a study of of this topology on the set of
bounded non-negative (resp. signed) measures on [0, 1]. One important feature of
the topological space (P(R), W), is that it is conveniently metricized by the Levy
metric (or distance), dened as follows. Consider two, possibly unbounded, non-
decreasing functions H1 and H2 on [A, B] R. Extend the denition of these
functions to R, by setting, for i = 1, 2,

' Hi (A) for x A,
Hi (x) =
Hi (B) for x A.

Given this notation, the Levy distance between H1 and H2 is dened by


 

inf 0 : H' 1 (x ) H
' 2 (x) H
' 1 (x + ) + x R ,
L (H1 , H2 ) = whenever such a 0 exists


otherwise. (1.44)

When H1 and H2 are bounded, (1.44) implies that L (H1 , H2 ) < . It is readily
checked that L denes a metric on the set F(R) of all probability dfs on R.
Denoting, as usual, by CF the set of continuity points of F , a necessary and
sucient condition for a sequence {Fn : n 1} of dfs to converge weakly to the
102 P. Deheuvels

limiting df F is that (see, e.g., p. 18 in [19])


W
Fn F L (Fn , F ) 0 lim Fn (x) = F (x) x CF . (1.45)
n

Consider now two quantile functions, G1 and G2 , pertaining to the two distribution
functions F1 F and F2 F. One may extend the denition of G1 and G2 to R,
by setting, for i = 1, 2,

' Gi (0) for x 0,
Gi (x) =
Gi (1) for x 1.
Given this notation, one may dene the Levy distance L (G1 , G2 ) between G1
'1 , G
and G2 , by the formal replacement of H1 , H2 in (1.44) by G ' 2 . We have the
following useful proposition.
Proposition 1.2. When G1 , G2 are the quantile functions of the distribution func-
tions F1 , F2 , we have
L (F1 , F2 ) = L (G1 , G2 ). (1.46)
Proof. Combine (1.32)(1.33) with (1.45). 
An interesting application of (1.45) and (1.46) is provided by the following char-
acterization of weak convergence. Let Gn = Fn denote the quantile function of
Fn for n = 1, 2, . . ., and let G = F denote the quantile function of F . Denote by
CG [0, 1] the set of all continuity points of G. Then,
W
Fn F L (Gn , G) 0 lim Gn (u) = G(u) u CG . (1.47)
n

1.3.3. Weak and uniform topologies for df s of non-negative measures on [0, 1].
We now specialize in the study of (non negative bounded) Radon measures with
support in a bounded closed interval of R. For convenience, and without loss of
generality, we will limit ourselves to the case where this interval is E = [0, 1]. As
follows from Denition 1.1, the vague and weak topologies coincide on M[0, 1].
Let M+ f [0, 1] denote the set of all bounded non-negative measures on R, with
 
support in [0, 1], namely, fullling R [0, 1] = 0 and (A) 0 for each A BR .
We denote by I+ [0, 1] (resp. I [0, 1]) the set of all right-continuous (resp. left-
continuous) distribution functions H(+) (resp. H()) of measures M+ f [0, 1],
of the form
H (x+) = ([0, x]) and H (x) = ([0, x)) for x R. (1.48)
We let I [0, 1] denote either one of the sets I+ [0, 1] or I [0, 1]. The correspondence
H (+) H () being one-to-one, we may endow I [0, 1] with the weak
topology W of convergence of measures in M+ f [0, 1]. This topology is metricized
by the Levy metric L (, ) dened via (1.44), and we let (I [0, 1], W) denote the
set I [0, 1] endowed with the weak topology W. Likewise, we let (I [0, 1], U) (resp.
(M+ f [0, 1], U)) denote the set I [0, 1] (resp. Mf [0, 1]), endowed with the uniform
+
Topics on Empirical Processes 103

topology U. The latter topology is induced by the sup-norm distance, dened, for
H , H I [0, 1], with , M+
f [0, 1], via

U (, ) =
H H
= sup |H (x) H (x)|. (1.49)
xR

In view of (1.42), for any (nite) constant C 0, the set M+ f,C [0, 1] of all
Mf [0, 1] such that ([0, 1]) = H (1+) C is weakly compact. The corresponding
+

set of distribution functions is

[0, 1] := {H I [0, 1] : H(1+) C} = {H () : ([0, 1]) C}.


IC (1.50)
The Lebesgue decomposition of M+ f [0, 1] shows that the df H (+) I+ [0, 1]
(resp. H () I [0, 1]) of has a unique decomposition into the sum of two
components, as follows. For all t R,
 
 
H (t) = h(u)du + HS (t) resp. H (t+) = h(u)du + HS (t+) .
[0,t] [0,t]
(1.51)
Here, h is a non-negative function, Lebesgue-integrable on [0, 1], dened uniquely
up to an almost everywhere [a.e.] equivalence, and S := dHS is a singular non-
negative measure on [0, 1], with support of null Lebesgue measure, and such that
 1
 
[0, 1] = H (1+) = h(u)du + HS (1+). (1.52)
0

The function h in (1.51)(1.52) is the Lebesgue derivative of H = H , denoted


d
hereafter by H = dx H. By denition, H is absolutely
 continuous (with re-
spect to Lebesgue measure) i HS (1+) = S [0, 1] = 0. In the sequel, the
set of non-decreasing, absolutely continuous distribution functions H of mea-
sures M+ f [0, 1], will be denoted by ACI0 [0, 1]. Naturally, we may combine the
Lebesgue and Hahn-Jordan decompositions of a totally bounded signed measure
Mf [0, 1], to write

 
H (t) = [0, t] = h(u)du + H+ (t) H (t), (1.53)
S S
[0,t]

where + S S are singular non-negative measures on [0, 1], with supports of null
Lebesgue measure, and h = H = dx d
H is the Lebesgue derivative of H . The
set of all absolutely continuous functions H on [0, 1], fullling H (1+) = 0, will
S
be denoted by AC0 [0, 1]. A function H AC0 [0, 1] (resp. ACI0 [0, 1] is such that
H(0) = 0. If we let H(0) be arbitrary, we obtain a general absolutely continuous
(resp. absolutely continuous nondecreasing) function on [0, 1]. The corresponding
sets of functions will be denoted by AC[0, 1] (resp. ACI[0, 1]).

1.3.4. Weak convergence of signed measures on [0, 1]. We now concentrate on the
study of the weak topology W, on the space Mf [0, 1] of totally bounded signed
measures on R, with support in [0, 1]. Recalling the Hahn-Jordan decomposition
104 P. Deheuvels

= + of in (1.36), we see that Mf [0, 1] and || = + + M+ f [0, 1]


fulll, in view of (1.38),
         
|| [0, 1] = + [0, 1] + [0, 1] < and supp := supp || [0, 1].
  (1.54)
The right-continuous distribution function [df] H (x) = [0, x] of  Mf [0, 1],
fullls H (0) = 0, and is of bounded variation |dH |(R) = || [0, 1] < on R.
Conversely, via the Lebesgue-Stieltjes integral, any right-continuous function H of
bounded variation on R, with H (x) = 0 for x < 0, and H (x) = H(1) for x 1
is the df of some Mf [0, 1]. Here, and elsewhere, we set
 
BV0 [0, 1] = H : Mf [0, 1] , (1.55)
and, for an arbitrary C 0,
   
Mf,C [0, 1] = Mf [0, 1] : || [0, 1] C
and   (1.56)
BV0,C [0, 1] = H : Mf,C [0, 1] .
Introduce now the metric (see, e.g., Hognas [113]), dened, for , Mf [0, 1], by
 1
H (, ) = H (H , H ) = |H (t) H (t)|dt + |H (1) H (1)|. (1.57)
0
As mentioned in Remark 1.2, the weak topology on M(R) is not metrizable. For
this reason, one should discuss the weak convergence of M[0, 1] to
M[0, 1], along an oriented net A, rather than on a sequence. We have the following
characterization of weak convergence in Mf [0, 1], due to Hognas [113].
Proposition 1.3. A net-oriented family Mf [0, 1], with A, of totally
bounded signed measures on [0, 1] is weakly convergent to Mf [0, 1], if and
only if:
1 ) There exists a constant 0 < C < such that, ultimately along the net A,
Mf,C [0, 1];
2 ) We have, along the net A, dH ( , ) 0.
Proof. See, e.g., Hognas [113]. 
A direct consequence of Proposition 1.3 is that, for each C [0, ), the restric-
tion of the weak topology W to Mf,C [0, 1] is metrizable. This is captured in the
following corollary of Proposition 1.3.
Corollary 1.1. For each 0 < C < 0, the metric dW (, ) endows Mf,C [0, 1] with the
weak topology.
We infer readily from the above arguments the following proposition.
Proposition 1.4. For each 0 < C < 0, the set Mf,C [0, 1] is weakly compact.
Proof. Consider any sequence {n : n 1} Mf,C [0, 1] such that dH (m , n ) 0

as m n . For each n 1, denote by n = + n n the Hahn-Jordan
decomposition of n . Observe, via (1.38), that, for each n 1,
n Mf,C [0, 1].
+
Topics on Empirical Processes 105

By (1.42), M+ f,C [0, 1] is a weakly compact metrizable space. Therefore, for each
increasing sequence {nj : j 1} of positive integers, there exists an increasing
subsequence, along which
n W , for some and Mf,C [0, 1]. This, in
+ +

turn, entails that, along this subsequence, H (n , ) 0, and hence, H (n , )
0, where := + . Since the so-dened is necessarily unique, we infer from
this argument that (Mf,C [0, 1], H) is a complete metric space. The same argument
shows readily that Mf,C [0, 1] is sequentially compact, and therefore compact, with
respect to W. 
Remark 1.3. As mentioned above in 1.3.1, the fact that Mf,C [0, 1] is metrizable
with respect to the weak topology W, is a general consequence of the Banach-Alaoglu
theorem (refer to pp. 6768 in [173]). The Hognas metric (1.57) provides here a
convenient example of metric endowing Mf,C [0, 1] with W.
1.3.5. Compact sets based upon rate functions. We will consider below a series of
examples of compact subsets of Mf [0, 1], endowed, either with the weak topology
W, or with the uniform topology U, of the corresponding distribution functions.
Throughout,
  we will identify Mf [0, 1] with its distribution function H (t) =
[0, t] , so that the sets we will consider will be dened equivalently, in terms of
measures, or in terms of functions. In particular, the uniform topology U will be
dened on Mf [0, 1], by setting, for , Mf [0, 1]
dU (, ) =
H H
= sup |H (t) H (t)|. (1.58)
tR
The compact sets we shall consider will be dened through rate functions, which,
in the applications discussed later on, will be chosen as Cherno functions, of the
form (2.8). However, in the present section, this restriction is not necessary, and
we will work in a more general setup. By rate function is meant a non-negative
convex (possibly innite) function {() : R}, with the following properties,
holding for some specied constant m R.
(.1) 0 () ;

(.2) (m) = 0 and is convex on R.


We set further, for fullling (.12),
() ()
t0 := lim 0 t1 := lim . (1.59)

Example 1.2. The following three examples of rate functions deserve special atten-
tion.
1 ) () = 2 , with = 0, t1 = and t0 = ;
2 ) () = h(), with = 1, t1 = and t0 = , where h is dened by


log + 1 for > 0;
h() = 1 for = 0; (1.60)


for < 0.
106 P. Deheuvels

3 ) () = (), with = 1, t1 = and t0 = 1, where  is dened by



log + 1 for > 0;
() = (1.61)
for 0.

The following theorems will have useful consequences. For each H AC[0, 1]
(the set of absolutely continuous functions on [0, 1], see 1.3.3), we denote by
d
H(t) = dt H(t) the Lebesgue derivative of H. By (AC[0, 1], U) is meant the set
AC[0, 1] endowed with the uniform topology U. The set BC0 [0, 1] collects all dis-
tribution functions H of totally bounded signed measures Mf [0, 1] with
support in [0, 1].
Theorem 1.1. Let , fullling (.12), be such that t1 = and t0 = . Intro-
duce the set
  1 
,c = H AC[0, 1] : H(0) = 0 and c(c1 H(u))du 1 . (1.62)
0

Then ,c is a compact subset of (AC[0, 1], U).


Proof. We limit ourselves to c = 1, and set = ,1 . By the Arzela-Ascoli
theorem (see, e.g., p. 369 in [173]), it is enough to show that is closed, uniformly
equi-continuous and bounded. To establish this property, we x an arbitrary > 0.
The assumption that ()/|| as || ensures the existence of an
M such that () 1 || for all || M . Now, this ensures that, for each
0 a < b 1 and H ,
|H(b) H(a)| |H(b) H(a)|  H(b) H(a) 

M
ba b a ba
b

(H(u))du ,
ba a ba
where we have used the convexity of . Thus, we have |H(b) H(a)| {M (b
a)}, whence
|b a| /M |H(b) H(a)| .
This establishes the uniform equicontinuity of , the boundedness of follow-
ing trivially, in turn, from the fact that H(0) = 0 for all H . Finally, the fact
that is closed follows from the observation (see, e.g., Lynch and Sethuraman
[144]) that the mapping
 1
H AC[0, 1] (H(u))du, (1.63)
0

is lower semi-continuous. 
Theorem 1.2. Let , fullling (.12), be such that
() = for , t1 = and t0 < .
Topics on Empirical Processes 107

For each c > 0, introduce the set


  1 
,c = H I [0, 1] : H(0) = 0, c(c1 H(u))du + t1 HS (1+) 1 . (1.64)
0

Then ,c is a compact subset of (I [0, 1], W).


Proof. See, e.g., Lynch and Sethuraman [144]. 
Example 1.3.
1 ) The Strassen set of functions (see, e.g., Strassen [199] and (2.34) in the
sequel), dened by
  1 
S = f AC[0, 1] : f (0) = 0 and f(u)2 du 1 , (1.65)
0

is a compact subset of the set (C[0, 1], U) of continuous functions on [0, 1],
endowed with the uniform topology U. This follows from Theorem 1.1, taken
with (u) = u2 . We note further that X (u) = 12 u2 is the Cherno function,
(2.8), of a rv X following a standard normal N (0, 1) law (see, e.g., 2.3.3).
2 ) The Finkelstein set of functions (refer to Finkelstein (1971)) dened by
  1 
F = f AC[0, 1] : f (0) = f (1) = 0 and f(u)2 du 1 , (1.66)
0

is a compact subset of the set (C[0, 1], U) of continuous functions on [0, 1],
endowed with
 the uniform topology.
 This follows from 1 ) and the observation
that F = f S : f (1) = 0 is closed in (C[0, 1], U).
3 ) The set of functions (refer to Deheuvels and Mason (1990, 1992)) dened for
c > 0 by
  1 
 
c = f AC[0, 1] : f (0) = 0 and ch c1 f(u) du 1 , (1.67)
0

is a compact subset of the set (C[0, 1], U) of continuous functions on [0, 1],
endowed with the uniform topology U. This follows from Theorem 1.1, taken
with (u) = h(u), which is nothing else but the Cherno function, (2.8), of
a Poisson rv with expectation equal to 1 (see, e.g., (2.60)).
4 ) The set of functions (refer to Deheuvels and Mason (1990, 1992)) dened for
c > 0 by
  1 
 
c = f I [0, 1] : f (0) = 0 and c c1 f(u) du + fS (1+) 1 , (1.68)
0

is a compact subset of the set (I [0, 1], W) of non-decreasing non-negative


functions on [0, 1], endowed with the weak topology. This follows from Theo-
rem 1.2, taken with (u) = (u). This last function is the Cherno function,
(2.8), of an exponentially distributed rv with mean 1.
108 P. Deheuvels

The next theorem extends Theorems 1.1 and 1.2 to the case where t0 and t1
are arbitrary.
Theorem 1.3. Let , fullling (.12), be such that t1 t0 . Recall
(1.53), and set BV0 [0, 1] = {H : Mf [0, 1]}. For each c > 0, introduce the set
  1 
 
,c = H BV0 [0, 1] : c c1 H (u) du + t0 H+ (1+) t1 H (1+) 1 .
S S
0
(1.69)
Then ,c is a compact subset of (BV0 [0, 1], W).
Proof. See, e.g., Deheuvels [51]. 
In the forthcoming 2.1, we will give some examples of rate functions based upon
large deviation theory.
Exercise 1.3.
1 ) Let h() and () be as in (1.60)(1.61). Check the relations
h(1/) = () and (1/) = h() and > 0.

2 ) Recall the results of Exercise 1.1. Let F denote the set of df s F (x) of rvs
d
with support in [0, 1], having density f (x) = dx F (x) of X continuous and
positive on [0, 1]. Let likewise G denote the set of qf s of random variables,
d
with density quantile function g(u) = du G(u) continuous and positive on
[0, 1]. Show that the mapping F F G = F G denes a one-to-
one mapping of F onto G, continuous with respect to the weak topology W.
Show likewise that the mapping G G F = G F denes a one-to-one
mapping of G onto F , continuous with respect to the weak topology W. Check
that {F } = F and {G } = G, for F F and G G.

3 ) Show that the sets F1 and G1 are homeomorphic (for additional details,
see, e.g., [65, 66]).
1.4. The quantile transform
1.4.1. The univariate quantile transform.
Theorem 1.4. For any random variable X R, with distribution function F and
quantile function G, the following properties hold.
(i) Whenever F is continuous, the random variable U := F (X) is uniformly
distributed on (0, 1);
(ii) When F is arbitrary, and V is a uniform (0, 1) rv, one has the distributional
identity
d
X = G(V ). (1.70)
Proof. The implications F (X) < F (x) X < x and X x F (X) F (x)
entail that
P(F (X) < F (x)) P(X < x) = F (x)
(1.71)
F (x) = P(X x) P(F (X) F (x)).
Topics on Empirical Processes 109

When F is continuous, u = F (x) reaches all possible values in (0, 1) when x varies
in ( , + ). Thus, by (1.71), the dfs, H(u) := P(F (X) u) and H(u) :=
P(F (x) < u), of F (X) fulll, for all u (0, 1),
H(u) u H(u).
This, in turn, readily implies that H(u) = u for all u CH , the set of continuity
points of H. Since this, in turn, implies that H(u) = u for all u [0, 1], we obtain
readily Assertion (i) of the theorem.
d
To establish Assertion (ii), we dene a random variable Y = N (0, 1), following
a standard normal law, and independent of X (in fact, any rv with a continuous
and positive density on R will do). For each n 1, we observe that the rv Xn :=
X+n1 Y R has a density fn , continuous and positive on R. This, in turn, implies
that the restriction of the df Fn of Xn to R, and the restriction of the qf Gn of Xn to
(0, 1), are continuous and increasing. By the just-proven Assertion (i), this implies
that, for each n 1, Un := Fn (Xn ) is uniformly distributed on (0, 1). Moreover, an
application of (1.34) in Proposition 1.1 shows that Xn = Gn (Fn (Xn )) = Gn (Un )
W
for each n 1. Now, as n , Xn = X + n1 Y X, whence Fn F (recall
that almost sure [a.s.] convergence implies convergence in probability, and hence,
weak convergence of the underlying probability measures). By (1.47), this implies,
in turn, that Gn (u) G(u) for each u CG , the set of continuity points of G.
d
We next observe that Xn = Gn (U ), where U is a xed, uniformly distributed
on (0, 1), random variable. Since DG := R CG is, at most, countable, we have
d
P(U CG ) = 1. By all this, it follows that, as n , Xn = Gn (U ) G(U ),
W W
and hence, Xn G(U ). Since Xn X a.s., and hence, Xn X, as n , it
d
follows that G(U ) = X, which was to be proved. 

The quantile transform Theorem 1.4 plays an essential role in empirical process
theory. By this theorem, it is always possible without loss of generality [w.l.o.g.] to
dene an arbitrary sequence X1 , X2 , . . . of independent and identically distributed
[iid] random variables [rvs], with df F () and qf G(), on a probability space
(, A, P), carrying an iid sequence U1 , U2 , . . . of uniform (0, 1) rvs, such that
Xn = G(Un ) for each n 1. (1.72)
Because of (1.72), it is, most of the time, more convenient to work directly on the
sequence U1 , U2 , . . ., rather than on X1 , X2 , . . ., given that any result which can
be obtained in this case has a counterpart for X1 , X2 , . . ., which often follows by
book-keeping arguments based upon the mapping u G(u).

An inconvenience of Theorem 1.4 is that it is specic to dimension 1, being only


valid for real-valued random variables. In the next section, we show that this result
can be yet extended to Rd -valued random vectors, at the price of some additional
technicalities.
110 P. Deheuvels

1.4.2. The multivariate quantile transform. First, we need recall some classical
(but far from obvious) properties of conditional distributions. Recall (see, e.g.,
p. 209 in [10]) that a set E is a Polish space, when there is a complete metric
dening its topology, with a countable dense subset in E. We will be concerned
here with the case of Rd , which is Polish with respect to the usual topology.
Consider now two E-valued rvs, X and Y, dened on the same probability space
(, A, P). Denote by BE of all Borel subsets of E (the smallest -algebra generated
by open (and closed) subsets of E). We assume that X and Y are measurable in
the sense that the events X1 (A) = {X A} and Y1 (A) = {Y A} belong
to A for all A BE . We denote by AY the -algebra of A, generated by the
sets Y1 (A), for A BE . In words, AY is the smallest -algebra of A such that
Y1 (A) = {Y A} = { : Y () A} AY for all A BE . Then, it
is possible to dene a regular conditional probability P(X | Y = y), with the
following properties.
(i) For each y E, A BE P(X A | Y = y) denes a probability measure
on (E, BE );
(ii) For each A BE , the mapping y = Y() E P(X A | Y = y)
is AY -measurable;
(iii) The conditional probability P(X A | Y = y), of A BE given y E,
is dened uniquely, with respect to y, up to an a.e. equivalence over a set
B BE such that P(Y B) = 0;
(iv) For each A BE and B BY ,

P(X A | Y = y)dP(Y = y) = P(X A, Y B). (1.73)
B

Dening the support, denoted by supp(Y), of Y as the set of all y E such that
P(Y Vy ) > 0 for all open neighborhoods Vy of y, we see that the denition of
P(X A | Y = y) is meaningful only for y supp(Y). When y  supp(Y), the
conditional probability P(X | Y = y) is, therefore, arbitrary.

Let X = (X1 , . . . , Xd ) Rd be a multivariate random vector. The distribution,


PX = P(X ), of X on BRd is then (see, e.g., p. 17 in [19]), uniquely determined
by its distribution function

F (x) = F (x1 , . . . , xd ) = P(X x)


d (1.74)
= P(X1 x1 , . . . , Xd xd ) for x = (x1 , . . . , xd ) R .
d
We set here x = (x1 , . . . , xd ) y = (y1 , . . . , yd ), for x, y R , when xj yj for
j = 1, . . . , d. Of course, F () in (1.74) fullls

0 F (x) F (y) 1. (1.75)


Topics on Empirical Processes 111

Moreover, setting y x (resp. y x) when yj xj (resp. yj xj ) for j = 1, . . . , d,


we have
d
lim F (y) = F (x) for x R ; (1.76)
yx
F (, . . . , ) = 0 = lim F (x);
x(,...,)
(1.77)
F (, . . . , ) = 1 = lim F (x).
x(,...,)

When d = 1, the relations (1.75)(1.76)(1.77) reduce to (1.2)(1.3)(1.4), and


are sucient to dene the distribution function of a probability measure on R.
On the other hand, this property does not hold for d 2. Introduce the following
notation. For j = 1, . . . , d and u = (u1 , . . . , ud ) Rd , set
j;u F (x) = F (x1 , . . . , xj1 , xj + uj , xj+1 , . . . , xd )
(1.78)
F (x1 , . . . , xj1 , xj , xj+1 , . . . , xd ).
A necessary and sucient for F to dene the df of a multivariate distribution is
that, in addition to (1.75), (1.76) and (1.77), we have the 2d d + 1 inequalities
j1 ;u jk ;u F (x) 0, (1.79)
which must hold for all x Rd , u Rd , with u1 0, . . . , ud 0, and 1 j1 <
< jk d.
Theorem 1.5 below gives a multivariate version of Theorem 1.4, by combining
results of Rosenblatt [169], Einmahl [92] (see his Theorem 6), and Deheuvels and
Mason [68] (see their Lemma 3.1). The following notation will be needed. Let
U = (U1 , . . . , Ud ) be a uniform (0, 1)d random vector. Let F1 (x1 ) = P(X1 x1 )
denote the distribution function of X1 , and let, for j = 2, . . . , d,



Fj xj  x1 , . . . , xj1 = P Xj xj  X1 = x1 , . . . , Xj1 = xj1 ,


denote a regular conditional distribution function of Xj given X1 = x1 , . . . , Xj1 =
xj . Dene a function H(x) [0, 1]d of x = (x1 , . . . , xd ) Rd , by setting

H(x) = F1 (x1 ), F2 (x1 |x2 ), . . . , Fd (xd |x1 , . . . , xd1 ) . (1.80)

For t (0, 1), set H1 (t) = inf{x : F1 (x) t}, and, for t (0, 1), j = 2, . . . , d, and
x Rd , set

 
Hj t  x1 , . . . , xj1 = inf x : Fj (x|x1 , . . . , xj1 ) t .

Also, set for u = (u1 , . . . , ud ) (0, 1)d , G1 (u1 ) = H1 (u1 ), and, for j = 2, . . . , d,


Gj (u1 , . . . , uj ) = Hj uj  G1 (u1 ), . . . , Gj1 (u1 , . . . , uj1 ) .

Dene a function G(u) Rd of u = (u1 , . . . , ud ) (0, 1)d , by setting


     

G(u) = G1 u1 , G2 u1 , u2 , . . . , Gd u1 , . . . , ud . (1.81)
112 P. Deheuvels

Theorem 1.5. Under the above assumptions and notation, we have


d  
X=G U . (1.82)
(i) If, in addition, H() is continuous on R , then, we have
d

d  
U=H X . (1.83)
(ii) If H() is continuously (resp. twice continuously dierentiable) in a neighbor-
hood of x Rd , then G() is continuously (resp. twice continuously dierenti-
able) in the neighborhood of t = H(x), and such that
|G(u) G(t) DG(t)(u t)| = o(|u t|) as |u t| 0. (1.84)
Moreover,
 1
DG(t) = DH(x) , (1.85)
where DG(t) (resp. DH(x)) denote the dierential of G() at t = H(x) (resp.
x).
(iii) If X has a density (), continuous and positive in a neighborhood of x Rd ,
then, the Jacobian of G() at t = H(x) is equal to
 
1/f G(t) . (1.86)
Proof. See, e.g., Deheuvels and Mason [68]. 
The next theorem gives a useful local version of the quantile transformation in
Theorem 1.5.
Theorem 1.6. Assume that X has a density f , continuously (resp. twice continu-
ously) dierentiable in a neighborhood of x Rd , and that f (x) > 0. Then, there
exists a t (0, 1)d , a neighborhood Vx of x, a neighborhood Vt of t, and a one-
to-one mapping G of Vt onto Vx such that x = G(t). Moreover, on a suitable
probability space, we may dene X jointly with a random vector U uniformly dis-
tributed on (0, 1)d , and such that
 
X1I{XVx } = G U 1I{UVt } . (1.87)
The function G () is of the form
 

G u = G1 (u1 ), G2 (u1 , u2 ), . . . , Gd (u1 , . . . , ud ) , (1.88)


where, for j = 1, . . . , j, Gd (u1 , . . . , uj ) is an increasing function of uj . In addition,
G () is continuously (resp. twice continuously) dierentiable on Vt , and such that
        
G u G t f (x) 1/d (u t) = o |u t| as |u t| 0. (1.89)
Proof. See, e.g., Deheuvels and Mason [68]. 
Theorems 1.51.6 allow to identify, without loss of generality, an arbitrary iid
sequence X1 , X2 , . . ., of Rd -valued random vectors, with G(U1 ), G(U1 ), . . ., where
U1 , U2 , . . ., are iid uniformly distributed on (0, 1)d random vectors, for some appro-
priate invertible mapping G of (0, 1)d onto Rd . This is, of course, the best possible
result of the kind since there is no homeomorphism (a continuous mapping with
Topics on Empirical Processes 113

a continuous inverse) of an open subset of Rd onto an open subset of Rq unless


d = q (see, e.g., [114]). On the other hand (see, Remark 1.4 below), if we let G
be non-invertible, the dimension argument breaks down. Unfortunately, Theorems
1.51.6 are dicult to apply for d 2, because of the inherent technicalities due
to the conditioning used for the construction of G in (1.82), making it dicult to
relate the smoothness properties of F and G. Moreover, the proof of these the-
orems relies on the choice of a basis in Rd , which leads to technical diculties.
Because of this, we will limit ourselves in the sequel to d = 1. The extension of the
presently available results from dimension 1 to higher dimensions yields a series of
challenging problems.

Remark 1.4.
1 ) The multivariate quantile theorem was generalized as follows by Skorohod
[188] (see, e.g., Dudley and Phillip [81]). Consider two Polish spaces X1 and
X2 , and let L denote a probability law on X1 X2 , with marginal distribution
L1 on X1 . Then, there exists a measurable map : [0, 1] X1 X1 X2 ,
such that, if V X1 is a rv with distribution L1 , and if U is a uniform (0, 1)
rv independent of V , then (V, (U, V )) X1 X2 has distribution function
L.
2 ) The following argument (see, e.g., Lemma A1 in Berkes and Phillip [14]) is
often useful in the application of Strong invariance principles. Consider three
Polish spaces X1 , X2 and X3 . We consider a probability law L1,2 on X1 X2 ,
and a probability law L2,3 on X2 X3 . We assume that the marginals of L1,2
and L2,3 on X2 coincide. Then, there exists a probability space on which sits
a rv (X1 , X2 , X3 ), where (X1 , X2 ) has law L1,2 , and (X2 , X3 ) has law L2,3 .

Exercise 1.4. Show that, for any probability law L in a Polish space, there exists
a measurable map : [0, 1] X, such that (U ) has probability law L when U is
uniformly distributed on (0, 1).

Exercise 1.5. Let d 2 be integer. A copula, C(u) = C(u1 , . . . , ud ), in Rd is, by


denition, the df of a random vector U = (U1 , . . . , Ud ) with uniformly distributed
on [0, 1] marginals. Given any random vector X = (X1 , . . . , Xd ) with df F (x) =
P(X x) and marginal df s Fj (x) = P(Xj x), for j = 1, . . . , d, a copula C() is
associated with X whenever the identity
   
F x1 , . . . , xd = C F1 (x1 ), . . . , Fd (xd ) , (1.90)
holds for all x = (x1 , . . . , xd ).
1 ) Show that, whenever F1 , . . . , Fd are continuous, the copula C() of F () is
dened uniquely by (1.90).
2 ) Show that the set Cd of copulas in Rd is weakly compact. Adapt the proof of
Theorem 1.4 to show that there exists a copula C() for each df F () [This is
known as Sklars theorem (see, e.g., [187, 42, 48]).
114 P. Deheuvels

2. Fluctuations of partial sums


2.1. Some large deviations theory
Let X denote a random variable with df F (x) = P(X x). The corresponding
Laplace transform, or moment-generating function [mgf], denoted by = X , is
dened, for s R, by

(s) = E(e ) =
sX
esx dF (x) (0, ]. (2.1)

The mgf (s) of X is always nite for s = 0, and such that (0) = 1. To charac-
terize the domain of niteness I := {s R : (s) < } of , set
t1 = inf{t R : (t) < } 0 t0 = sup{t R : (t) < } . (2.2)
The following proposition has a more or the less straightforward analytic proof.
Proposition 2.1. The mgf () of a random variable X is always nite and C
in the neighborhood of each s (t1 , t0 ). Moreover, when s (t0 , t1 ), for each
m = 1, 2, . . ., we have

dm  
(m) (s) = m (s) = E X m esX = xm esx dF (x) R. (2.3)
ds

When (s) < in a right (resp. left) neighborhood of 0, or equivalently, when


t0 > 0 (resp. t1 < 0), then, the niteness of the mth moment, E(X m ), of X, is
equivalent to the existence of a nite limit
 
lim (m) (s) = E(X m ) resp. lim (m) (s) = E(X m ) . (2.4)
s0 s0

Proof. Omitted. 
Remark 2.1. Recall the denition (2.2) of t0 and t1 . When t0 < , the value of
X (t0 ) may be either nite on innite, depending upon the distribution of X. The
same is true for X (t1 ). The domain of niteness of = X , denoted by
I = {s R : (s) < }, (2.5)
is, depending upon the distribution of X, one among the four intervals (t1 , t0 ),
(t1 , t0 ], [t1 , t0 ), [t1 , t0 ].
Proposition 2.2. The mgf () of X is always convex on its domain of niteness.
Moreover, the convexity of is always strict, with the exception of the case where
X = 0 almost surely.
Proof. The result is trivial for t1 = t0 = 0. When t1 < t0 , observe, via (2.3), that,
for each s (t1 , t0 ),  
 (s) = E X 2 esX 0, (2.6)
with equality only possible when X = 0 a.s. 
The mgf of X follows a stronger property than convexity, being log-convex
(meaning that log is convex), as showed in the next proposition.
Topics on Empirical Processes 115

Proposition 2.3. The mgf () is always log-convex on its domain of niteness.


Moreover, the log-convexity of () is always strict with the exception of the case
where the distribution of X is degenerate.
Proof. When X is degenerate (see, e.g., (1.10)) with X = c a.s., we have (s) = ecs ,
d2
and ds 2 log (s) = 0. In the other cases, the property is trivial for t1 = t0 = 0,

and we may limit ourselves to give proof when t1 < t0 . We then make use of the
Schwarz inequality, to observe that, for each s (t1 , t0 ),
   2
d2 E X 2 esX E XesX
log (s) = 2 0, (2.7)
ds2 E (esX )
with equality only possible when X is a.s. constant. 
The Cherno function of X (or Legendre transform of the distribution function of
X), is dened by
() = X () = sup {s log (s)} (2.8)
s:(s)<

Proposition 2.4. The Cherno function () is a (possibly innite) convex func-


tion on R, fullling
0 () , (2.9)
and
() ()
lim = t1 and lim = t0 . (2.10)

Moreover, when m = E(X) exists, we have
(m) = 0. (2.11)
Proof. The inequality (2.9) follows readily from the denition (2.8) of , and the
observation that
0 log (0) = 0 ().
The fact that is convex follows from the fact that the supremum of an arbitrary
family of convex functions is convex. Since, for each s I = {s R : (s) < },
the function s log (s) is linear, and hence, convex, the conclusion is
straightforward.
To establish (2.10)(2.11), consider rst the case where t1 = t0 = 0, so that the
domain of denition of the mgf reduces to I = {0}. In this case () = 0 for
all R (and in particular for = m when m = E(X) R exists), and we see
that (2.10)(2.11) are trivial.
Assume now that t1 < 0 < t0 , in which case the existence of m = E(X) =  (0) is
guaranteed by Proposition 2.1. We observe that the function
L (s) = s log (s), (2.12)
has, via Proposition 2.3, a rst derivative L (s)
= log (s), and a second
d
ds
 d2
derivative L (s) = ds2 log (s) < 0 on I . Now, because of (2.3), when = m,
116 P. Deheuvels

we have Lm (0) = m dsd


log (0) = 0, so that Lm (s) > 0 for s < 0 and Lm (s) < 0
for s > 0. This, in turn, implies that
(m) = sup Lm (s) = Lm (0) = 0,
sI

which is (2.11). The remainder of the part of the proof of Proposition 2.4 relative
to (2.10) is technical and omitted. We refer to Lemma 2.1, p. 266 in Deheuvels
[51] for details. 
In the remainder of this section, we consider a sequence X1 , X2 , . . . of independent
replic of the rv X with mgf and Cherno function as dened above. We
set S0 = 0 and Sn = X1 + + Xn for n 1. The following hypotheses will be
assumed, at times (recall (1.10)).
(C.1) t0 = sup{s : (s) < } > 0;
(C.2) t1 = inf {s : (s) < } < 0;
(C.3) The distribution of X is nondegenerate, or, equivalently,
supxR P(X = x) < 1;
(C.4) m = E(X) R exists.
Under (C.4) only, Chernos theorem (see, e.g., [9, 30]) may be stated as follows.
Theorem 2.1. Under (C.4), we have
P(Sn n) exp(n()) for each m; (2.13)
P(Sn n) exp(n()) for each m. (2.14)
Moreover,
1
lim log P(Sn n) = () for each m; (2.15)
n
n
1
lim log P(Sn n) = () for each m. (2.16)
n n

Proof. The proof of (2.13)(2.14) relies on the simple, but nevertheless powerful,
Markov inequality. Below, we establish this result in a slightly more general form.
Consider a random variable Y for which Y (t) = E(exp(tY )) < for some t 0.
The inequality ety etu for y u entails that

Y (t) = E(exp(tY )) = etu dP(Y u)

 
e dP(Y u) e
tu ty
dP(Y u) = ety P(Y y),
[y,) [y,)

or, equivalently,
 

P(Y y) exp sup ty log Y (t) . (2.17)


t0:(t)<

A similar argument, which we omit shows that


 

P(Y y) exp sup ty log Y (t) . (2.18)


t0:(t)<
Topics on Empirical Processes 117

Observe that the inequalities (2.17)(2.18) are always true for all y R, even when
the only value of t for which Y (t) < is t = 0. Now, consider a random variable
Z with nite expectation mZ := E(Z), and set Y = Z mZ , so that E(Y ) = 0.
In this case, we see that (2.17)(2.18) readily imply that
 

P(Y y) exp sup ty log Y (t) for y 0, (2.19)


t:Y (t)<
 

P(Y y) exp sup ty log Y (t) for y 0. (2.20)


t:Y (t)<

A crucial step in the proof relies on Proposition 2.1, which implies that, whenever
E(Y ) = 0, we have
   
y>0 sup ty log Y (t) = sup ty log Y (t) , (2.21)
t0:Y (t)< t:Y (t)<
   
y<0 sup ty log Y (t) = sup ty log Y (t) . (2.22)
t0:Y (t)< t:Y (t)<

By setting Y = Z mZ and Z (t) = E(exp(tZ)) = etmZ Y (t) in the above


formul, we get
Y (t) = E(exp(tY )) = E(exp(t{Z mZ })) = Z (t)etmZ .
This, when combined with (2.19)(2.20) and (2.21)(2.22), implies readily that
 

P(Z z) exp sup tz log Z (t) for z mZ , (2.23)


t:Z (t)<
 

P(Z z) exp sup tz log Z (t) for z mZ . (2.24)


t:Z (t)<

By setting Z = Sn in (2.23)(2.24), we obtain (2.13)(2.14). The remainder of the


proof is omitted. 
The next theorem relies on the fact that, when m = E(X) = 0, {Sn : n 0} is a
martingale (see, e.g., 2.2 below for details).
Theorem 2.2. Assume (C.4), with m = E(X) = 0. Then, under (C.1), for any
0, we have
 
P max Sk n exp(n()), (2.25)
0kn

whereas, under (C.2) for each 0, we have


 
P min Sk n exp(n()). (2.26)
0kn

Proof. As follows from (2.31) we have, under the assumptions of the theorem,
   
P max Sk n exp n sup {s log (s)} . (2.27)
0kn s0:sI
118 P. Deheuvels

Whenever t0 > 0, we observe from the proof of Proposition 2.4 that, for each
> 0, the function L (s) = s log (s) has derivative L (s) = ds
d
log (s)

decreasing on (0, s0 ) with a right derivative at 0 given by L (0+) = 0.
This implies in turn that
sup {s log (s)} = sup {s log (s)} . (2.28)
s0:sI sI

We conclude (2.25) by combining (2.27) with (2.28). The proof of (2.26) is similar
and omitted. 

2.2. Martingale inequalities


The theory of martingales provides some very useful inequalities. We present be-
low a minimal selection of some essential facts which are needed in the present
framework, and refer to [80] and [106] for details.
Let I = [0, T ] be an interval of either R or Z. We consider a random process
{Zt : t I} dened on a probability space (, A, P). We suppose given a family
of -algebras of events {At : t I} such that, for each s, t I with s t,
As At A. The family {Zt , At : t I} denes a martingale, whenever the
following properties hold.
(M.1) Zt is At measurable for each t I;
(M.2) E(Zt ) exists for each t I;
(M.3) E(Zt |As ) = Zs a.s. for each s, t I with s t.
The family {Zt , At : t I} denes a sub-martingale, whenever the following prop-
erties hold.
(M.1) Zt is At measurable for each t I;
(M.2) E(Zt ) exists for each t I;
(M.3) E(Zt |As ) Zs a.s. for each s, t I with s t.
In many cases of interest, in the above denition of martingales (resp. sub-martin-
gales), the -algebras reduce to At = {Zs : s I, s t} for t I. When such
is the case, we will abbreviate the denition, by saying that {Zt : t I} denes
a martingale (resp. sub-martingale). Otherwise, {Zt : t I} will be said to be a
martingale (resp. submartingale) adapted to {At : t I}.
We have the following useful properties of martingales and sub-martingales.
Proposition 2.5.
(i) Whenever {Zt , At : t I} is a martingale, and is a convex function such
that E((Zt )) exists for all t I, {(Zt ), At : t I} is a sub-martingale.
(ii) Whenever {Zt , At : t I} is a sub-martingale, and is a convex nondecreas-
ing function such that E((Zt )) exists for all t I, {(Zt ), At : t I} is a
sub-martingale.
Proof. Omitted. 
Topics on Empirical Processes 119

Proposition 2.6. (Doobs inequality) Whenever {Zt , At : t I} is a sub-martingale,


for each u > 0,  
1
P sup Zt u E (|ZT |) . (2.29)
tI u
Proof. Omitted. 
The following examples of interest provide some useful applications of Propositions
2.5 and 2.6.
Example 2.1.
1 ) Let {Zt , At : t I} be a martingale. Since (t) = |t|r is convex for each
r 1, we have, for each u > 0,
 
1
P sup |Zt | u r E (|ZT |r ) . (2.30)
tI u
To establish (2.30), we infer from Proposition 2.5 that {|Zt |r , At : t I} is a
sub-martingale for each r 1, then, we make use of Proposition 2.6 to show
that
   
1
P sup |Zt | u = P sup |Zt |r ur r E (|ZT |r ) .
tI tI u
2 ) Let {Zt , At : t I} be a martingale, and let s 0 be such that E(exp(sZt )) <
for all t I. Since s (s) = es is convex and nondecreasing, we have,
for each R,
   

P sup Zt exp(s) E (exp(sZT )) = exp s log ZT (s) . (2.31)


tI

To establish (2.31), we infer from Proposition 2.5 that {exp(sZt ), At : t I}


is a sub-martingale. Then, we make use of Proposition 2.6 with u = es to
show that
   
P sup Xt = P sup{exp(sXt )} exp(s) exp(s) E (exp(sZT )) .
tI tI
Note here that the assumption s 0 is necessary for the validity of the above
equality.
2.2.1. Functional large deviations and limit theorems for the Wiener process.
We start by Schilders theorem (see, e.g., [174]), for the statement of which the
following notation and denitions will be needed. Let {W (t) : t 0} denote a
standard Wiener process. The reproducing kernel Hilbert space [RKHS] (see, e.g.,
4.4) associated to the restriction, {W (t) : 0 t 1}, of W () to [0, 1], is then
dened by (2.33) below. The corresponding Hilbert norm, is dened, for each
f B[0, 1] (the set of all bounded functions on [0, 1]), by
 1/2
1 2
f (t) dt if f AC[0, 1] and f (0) = 0,
|f |H = 0 (2.32)
otherwise.
120 P. Deheuvels

Here, as usual, AC[0, 1] denotes the set of all absolutely continuous functions f
on [0, 1], with Lebesgue derivative f = dx
d
f . The RKHS of {W (t) : 0 t 1} is,
accordingly  
H = f AC[0, 1] : |f |H < . (2.33)
Its unit ball, often referred to as the Strassen set (see, e.g., [199]), is denoted by S
or K, and dened by  
S = f AC[0, 1] : |f |H 1 . (2.34)
The sup-norm of a function f C[0, 1] (the vector space of continuous functions
on [0, 1]), is denoted by

f
= sup |f (t)|. (2.35)
0t1
For each f C[0, 1] and > 0, we set
 
N (f ) = g C[0, 1] :
g f
< . (2.36)
For any A C[0, 1] and > 0, we set
(  
A = N (f ) = g C[0, 1] : f A,
g f
< . (2.37)
f A

Note here that, by convention, we set () = , so that = . Introduce a


functional J() on the subsets of the set B[0, 1] of bounded functions on [0, 1] by
setting, for each A B[0, 1],

inf f A |f |2H if A = ,
J(A) = (2.38)
if A = .
For each > 0, we denote by W{} the process dened by W{} (s) = (2)1/2 W (s)
for s [0, 1]. Recall that (C[0, 1], U) denotes the set C[0, 1] of continuous functions
on [0, 1], endowed with the uniform topology U, induced by

, as in (2.35).
Theorem 2.3. (Schilders theorem). For each closed subset F of (C[0, 1], U), we
have
1
lim sup log P(W{} F ) J(F ), (2.39)

and, for each open subset G of (C[0, 1], U), we have
1
lim inf log P(W{} G) J(G). (2.40)

Proof. See, e.g., Schilder [174], and Deuschel and Stroock [73], p. 12. 
The next lemma turns out to be useful to derive functional laws of the iterated
logarithm.
Lemma 2.1. For each (0, 1) and f S such that 0 < < |f |H 1, we have
     2  2
J C[0, 1] S (1 + )2 , and J N (f ) |f |H |f |2H 1 .(2.41)
Topics on Empirical Processes 121

Proof. See, e.g., Lemma 2.5 in Deheuvels [54]. 


In 1964, Strassen [199] provided the rst example of functional law of the iterated
logarithm [FLIL]. Below, we state the version of his theorem for the Wiener process.
Theorem 2.4. (Strassens FLIL) Let {W (t) : t 0} denote a standard Wiener
process. Set T (u) = W (T u)/ 2T log2 T for 0 u 1 and T 0, where log2 u =
log+ log+ u and log+ u = log(u e). Then, the following two results hold with
probability 1.
   
lim inf
T f
= 0 and sup lim inf
T f
= 0. (2.42)
T f S f S T

Proof. A proof of Theorem 2.4 based upon Schilders Theorem 2.3 is given in
Exercises 2.2 and 2.3 below. 
A typical application of Strassens Theorem 2.4 is stated in the following corollary
of this theorem.
Corollary 2.1. Let : f C0 [0, 1] R denote a continuous functional, with
respect to the uniform topology U. Then, we have, with probability 1,
lim sup (T ) = sup (f ). (2.43)
T f S

Proof. Since S is a compact subset of (C[0, 1], U) (see, e.g., Example 1.3), by con-
tinuity of , there exists a g S such that L0 := (g) = supf S (f ). As follows
from (2.42), there exists a.s. a sequence Tn such that
Tn g
0, whence
L1 := lim sup (T ) lim (Tn ) = (g) = L0 .
T n

On the other hand, if {n : n 1} is a sequence such that n and (n )


L1 , there exists a.s. a subsequence {n : n 1} {n : n 1}, together with an
h S such that
n h
0, whence,
L1 = lim (n ) = (h) sup (f ) = L0 .
n f S

We have therefore L0 = L1 , as sought. 


Remark 2.2. The sup-norm

topology U, used in the statement of Theorem 2.4
and Corollary 2.1, can be replaced by the topology generated by a general norm | |0
on C[0, 1], such that | |0 is measurable with respect to (C[0, 1], U), as long as the
(necessary and sucient) condition
 

P sup |W (1 ) |0 < > 0, (2.44)


0

holds for some > 0 (see, e.g., [61, 62]).


The next result states a functional limit law for the increments of Wiener
processes due to Revesz [167] (see also [2] and [61]). Consider a function {aT :
122 P. Deheuvels

T > 0} such that 0 < aT < T for T > 0. For each T > 0 and t 0, consider the
function of s [0, 1] dened by

T,t (s) = {W (t + aT s) W (t)}/ 2aT {log(T /aT ) + log2 T }, (2.45)
where log2 T = log+ log+ T , and log+ u = log(u e).
Theorem 2.5. Let aT and T 1 aT . Then, with probability 1,


lim sup inf
T,t f

T 0tT aT f S

 (2.46)
= sup lim inf inf
T,t f
= 0.
f S T 0tT aT

If, in addition to the above assumptions, ((log(T /aT ))/ log2 T as T ,


then, with probability 1,


lim sup inf
T,t f
= 0. (2.47)
T f S 0tT aT

Proof. Following the lines of Exercises 2.2 and 2.3 in the sequel, it is not too dicult
to infer the above results from Schilders Theorem 2.3. A more rened proof of
Theorem 2.5, valid for possibly non-uniform norms, and based on the isoperimetric
inequality (see, e.g., [20]) is to be found in [61]. 
Exercise 2.1.
1 ) Show that the conclusion of Theorem 2.4 may be reformulated into the fol-
lowing two complementary statements.
(i) For each > 0, there exists a.s. a T < such that T S for all
T ;
(ii) For each f S and > 0, there exists a.s. a sequence Tn , such
that Tn N (f ) for each n.
2 ) Assume, that aT , T 1 aT and ((log(T /aT ))/ log2 T as T . Set
FT = {T,t : 0 t T }. Show that, under these assumptions and notation,
the conclusion of Theorem 2.5 may be reformulated as follows. For each > 0,
there exists a.s. a T < such that, for all T T , FT S and S FT .
Exercise 2.2.
1 ) Fix any > 0, and set Tn = (1 + )n for n 0. Let {W (t) : t 0} denote a
standard Wiener process, and set T (u) = W (T u)/ 2T log2 T for 0 u 1
and T 0. Here and elsewhere, log2 u = log+ log+ u and log+ u = log(u e).
Let S be as in (2.34). Making use of (2.41) and (2.39), show that, for each
> 0,

 
P Tn  S < .
n=0
2 ) Infer from the results of (1 ) that, with probability 1, for each 0 1,
 
lim sup
Tn
1 and lim sup sup |Tn (u) Tn (v)| .
n n 0u,v1, |uv|
Topics on Empirical Processes 123

3 ) Recall the notation (2.35). Show that, for all n suciently large, uniformly
over Tn1 T Tn ,

2Tn log2 Tn 2Tn log2 Tn
1 1 + .
2T log2 T 2Tn1 log2 Tn1
4 ) Establish the inequality, for all large n,
sup
T Tn
An + Bn ,
Tn1 T Tn

where
 
An = (1 + ) sup sup |Tn (u) Tn (u)| and Bn =
Tn
.
1
1+ 1 0u1

5 ) Show that, for each > 0, with probability 1,


 
lim sup sup
T Tn
2.
n Tn1 T Tn

Show that, for each > 0, there exists a.s. a T < , such that, for all
T T , T S .
6 ) Show that, a.s. for each 0 1,
 
lim sup
T
1 and lim sup sup |T (u) T (v)| .
T T 0u,v1, |uv|
(2.48)

Exercise 2.3. Recall the notation (2.32) and (2.41). Fix any > 0, and set Tn =
(1 + )n for n 0. Let T () be as in Exercise 2.2, and set, for n 1,

Tn (u) = {W (Tn1 + (Tn Tn1 )u) W (Tn1 )}/ 2Tn log2 Tn .
1 ) Making use of (2.48), show that the processes {Tn (u) : 0 u 1} are
independent, and such that
2
lim sup
Tn Tn
a.s.
n 1+
2 ) Let f and (0, 1) be such that |f |H < 1 . Show that there exists a
such that, for all ,



P Tn N (f ) = .
n=1

3 ) Making use of the fact that S is compact in (C[0, 1], U), show that, for each
f S,
lim inf
T f
= 0 a.s.
T
124 P. Deheuvels

Exercise 2.4.
1 ) Check that, whenever W () is a standard Wiener process, then the process
dened by W (0) = 0 and W (t) = tW (1/t) for t > 0 is, again, a Wiener
process.
2 ) Making use of Corollary 2.1, show that Strassens Theorem 2.4 implies the
law of the iterated logarithm [LIL] for the Wiener process (due to Levy [137])

lim sup W (1/T )/ 2T log2 (1/T ) = lim sup W (T )/ 2T log2 T = 1 a.s.
T 0 T
(2.49)
3 ) Let {(t) : t > 0} be a measurable positive function such that

(t)/ 2t log2 (1/t) as t 0.
Dene a norm on C[0, 1] by setting |f |0 = sup0<t1 |f (t)|/(t). Show that
the conclusion of Theorem 2.4 holds when

is replaced by | |0 [see, e.g.,
[61, 62]].

2.2.2. Functional large deviations for processes with independent increments.


Schilders Theorem 2.3 turns out to be the particular case of a more general theory
(see, e.g., [73] and the references therein). The following extension of this theorem
will be useful in the present setup. Consider a stochastic process {X(t) : t
0} with stationary and independent increments. For each > 0, assume that
1 X(t), for t [0, M ] takes values in the cadlag space (D[0, M ], S) of right-
continuous functions with left-hand limits, endowed with the Skorohod topology
S (see, e.g., Ch. 3 in [19]). Set (t) = E(exp(sX(1))), and, for each R, dene
the associated Cherno function (refer to (2.8)), via
 
() = sup s log (s) . (2.50)
s:(s)<

Assume further that (s) < for all s R, or, equivalently (see, e.g., (2.10)),
that
()
(C) lim = .
|| ||

Remark 2.3.
1 ) (D[0, M ], S) is a Polish space, and D[0, M ] C[0, M ]. The uniform topology
U is stronger on D[0, M ] than the Skorohod topology S. On the other hand,
the restriction of the Skorohod topology S to C[0, M ] coincides there with
the uniform topology (see, e.g., p. 112 in [19]). As a consequence of this,
a compact subset of (C[0, M ], U) is also a compact subset of (D[0, M ], S).
Moreover, for any f C[0, M ], a uniform neighborhood of f is a Skorohod
neighborhood of f . For characterizations of compactness in D[0, M ], with
respect to S, we refer to Ch. 3 in [19].
Topics on Empirical Processes 125

2 ) In view of (1.59) and (2.10), the assumption (C) is equivalent to the condition
that
t0 = sup{t : (t) < } = and t1 = inf{t : (t) < } = . (2.51)
As follows from Theorem 1.1, and (1.67) in Example 1.3, the set dened, for
each c > 0, by
  1 
 
,c = f AC[0, M ] : f (0) = 0 and c c1 f(u) du 1 , (2.52)
0
is, under (C), a compact subset of (C[0, 1], U).
For any function f of D[0, M ], dene a functional JX,M (f ) by setting
 M  
f(u) du, if f AC[0, M ] and f (0) = 0,
JX,M (f ) = 0 (2.53)
otherwise.
Dene further, for each A D[0, M ]

inf f A JX,M (f ) if A = ,
JX,M (A) = (2.54)
if A = .

Theorem 2.6. Under (C), for each closed subset F of (D[0, M ], S), we have
1 X()

lim sup log P F JX,M (F ), (2.55)



and, for each open subset G of (D[0, M ], S), we have
1 X()

lim inf log P G JX,M (G). (2.56)



Proof. See, e.g., Varadhan [206] and Lynch and Sethuraman [144]. 
Exercise 2.5.
1 ) Let {(t) : t 0} denote a standard Poisson process (see, e.g., 2.3.1).
Making use of (2.60), show that this process has stationary and independent
increments, with Cherno function (refer to (2.50)), given by () = h(),
and fullling (C).

2 ) Prove that the set of functions dened by
  M 
(M ) = f C[0, M ] : f (0) = 0, h(f(u))du 1. ,
0
is a compact subset of (C[0, M ], U). Recalling (2.37), show that, for each > 0,
there exists an > 0 such that, for all T suciently large,
(log T ) 
1
P
/ (T ) 1+ . (2.57)
log T T
[Consult Lemma 2.8 in [65] for an example of application of (2.57).]
126 P. Deheuvels

2.3. Some useful examples of large deviation inequalities


2.3.1. Inequalities for the Poisson process. By a standard Poisson process (see,
e.g., 3.4) {(t) : t 0} is meant a homogeneous right-continuous Poisson process
on R+ with E((t)) = t for t 0. Set X = (1) to denote a Poisson random
variable with expectation = 1. We have
e1
P(X = k) = for k = 0, 1, 2, . . . , (2.58)
k!
so that  
X (t) = E(etX ) = exp et 1 for t R. (2.59)
The corresponding Cherno function is denoted by h() := X (). Some calcula-
tions show that

log + 1 when > 0,
h() = 1 when = 0, (2.60)


when < 0.
A direct application of (2.23)(2.24), in combination with (2.60), shows that, for
an arbitrary T 0,
P((T ) T ) exp(T h()) for 1, (2.61)
P((T ) T ) exp(T h()) for 1. (2.62)
Since {(t) t : t 0} is a martingale, we infer readily from (2.31) that, for an
arbitrary T 0,
 
P sup {(t) t} T exp(T h(1 + )) for 0, (2.63)
0tT
 
P inf {(t) t} T exp(T h(1 )) for 0. (2.64)
0tT

The following two propositions are simple consequences of (2.63)(2.64).


Proposition 2.7. For each T 0 and each 0, we have
 
P sup |(t) t| T 2 exp(T h(1 + )). (2.65)
0tT

Proof. Observe that, for 0 1,



 k
h(1 + ) = (1 + ) log(1 + ) = (1)k . (2.66)
k(k 1)
k=2

k
h(1 ) = (1 ) log(1 ) + = . (2.67)
k(k 1)
k=2

By combining (2.66)(2.67) with the observation that h(1 ) = for > 1, we


obtain the inequality
h(1 + ) h(1 ) for all 0. (2.68)
Topics on Empirical Processes 127

By combining (2.68) with (2.63)(2.64), we see that, for all 0,


 
P sup |(t) t| T exp(T h(1 + )) + exp(T h(1 ))
0tT
2 exp(T h(1 + )),
which is (2.65). 
Proposition 2.8. For each T 0 and each 0 1, we have
   
T 2  
P sup |(t) t| T 2 exp 1 . (2.69)
0tT 2 3
Proof. By (2.67), we see that, for each 0 1,

 k 2 3
h(1 + ) = (1)k , (2.70)
k(k 1) 2 6
k=2

which, in view of (2.65), suces for (2.69). 

2.3.2. Inequalities for binomial distributions. Let X follow a Bernoulli Be(p) dis-
tribution, namely,
P(X = 1) = 1 P(X = 0) = p (0, 1). (2.71)
The moment generating function of X is then given by
 
(t) = E(exp(tX)) = 1 + p et 1) for t R. (2.72)
Proposition 2.9. The Cherno function of the Bernoulli Be(p) distribution is
given by
 

log 1
if = 0,

 1p
  

log + (1 ) log 1 if 0 < < 1,
() = p  1p
(2.73)

1

log if = 1,


p
if  [0, 1].
Proof. Straightforward. 
The following theorem turns out to be quite useful. Recall the denition (2.60) of
h().
Theorem 2.7. Let Sn follow a binomial B(n, p) distribution. Then

P (Sn n) exp np h for p, (2.74)


p
whereas

P (Sn n) exp np h for p. (2.75)


p
128 P. Deheuvels

Proof. Let X = X1 , . . . , Xn be iid Bernoulli Be(p) distributed rvs. Then, Sn =


X1 + + Xn follows a binomial B(n, p) distribution, and, via an application of
(2.73) and (2.13)(2.14), we obtain that
P(Sn n) exp(n()) for p,
P(Sn n) exp(n()) for p,
where = X is as in (2.73). Therefore, the proof of (2.74)(2.75) boils down to
show that, for any 0 < < 1, we have the inequality

  1 
ph = log + p () = log + (1 ) log . (2.76)
p p p 1p
Now, (2.76) is a straightforward consequence of the inequality
  
1 1  1
1 
h = log + 1 0.
1p 1p 1p 1p
This completes the proof of the theorem. 
d
Corollary 2.2. Let n (tp) = n1/2 (Sn np), where, for 0 < p < 1, Sn follow a

binomial B(n, p) distribution. Then, for each 0 u np,

P |n (p)| u p = P |Sn np| u np


   2 
u
u u  (2.77)
2 exp np h 1 + 2 exp 1 .
np 2 3 np
Proof. Combine (2.74)(2.75) with (2.70). 

2.3.3. Inequalities for Wiener processes. Let {W (t) : t 0} denote a standard


Wiener process. Obviously, {W (t) : t [0, 1]} denes a martingale. Moreover, if
we set X = W (1), then (u) = E(exp(uX)) = exp( 12 u2 ), and the corresponding
Cherno function (refer to (2.9)) is then equal to () = X () = exp( 12 2 ) for
all R.
Proposition 2.10. For each T > 0, we have

P sup W (t) u T exp 12 u2 , (2.78)


0tT

P sup |W (t)| u T 2 exp 12 u2 (2.79)


0tT

Proof. It is well known (see, for example, Csorgo and Revesz (1981)), that


2 2
P sup W (t) u T = 2(1 (u)) = et /2 dt. (2.80)
0tT 2 u
Instead of using (2.80) for proving (2.78), we apply (2.31) to obtain directly that

u

P sup W (t) u T exp T = exp 12 u2 .


0tT T
Topics on Empirical Processes 129

We so obtain (2.78) in the + case. The proof of (2.78) in the case is identical,
and omitted. Finally, (2.79) is a trivial consequence of (2.78). 
2.4. Strong approximations of partial sums of i.i.d. random variables
We limit ourselves here to a few results which are relevant in our discussion, and
refer to the monograph [39] of Csorgo and Revesz for additional details and refer-
ences. In particular, we will concentrate on the case where the random variables
have exponential moments. Set S0 = 0 and Sn = X1 + + Xn for n 1, where
X1 , X2 , . . . are iid random replic of a rv X. We assume below that the mgf
(t) = E(exp(tX)) of X fullls the condition (C) = (C.12), where.
(C.1) t0 = sup{s : (s) < } > 0; and
(C.2) t1 = inf {s : (s) < } < 0.
The assumption (C) = (C.12) implies the existence of E(X k ) for each k N. We
will set, in particular
 
m = E(X) and 2 = Var(x) = E (X m)2 .
We set, further, S(t) = St for t 0, where #t$ t < #t$ + 1 denotes the integer
part of t. The following Theorem 2.8(1 ) states a variant of the celebrated Komlos,
Major and Tusnady strong invariance principle ([129, 130]), with a renement due
to Major [146] for (2 ). Part (3 ) of the theorem is due to Strassen [199] (see also
[147, 148]). We set log+ u = log(u e), and log2 u = log+ (log+ u), for u R.
Theorem 2.8.
1 ) Under (C), it is possible to dene {S(t) : t 0} jointly with a standard
Wiener process {W (t) : t 0}, in such a way that, for universal constants
(depending upon the distribution of X) a1 > 0, a2 > 0 and a3 > 0, we have,
for all T 0 and x R,

 
P sup |S(t) mt W (t)| a1 log+ T + x a2 exp a3 x . (2.81)
0tT

2 ) If we only assume that E(|X|r ) < for some r > 2, then we may write, as
T ,
|S(T ) mT W (T )| = o(T 1/r ). (2.82)

3 ) Finally, under the assumption that E(|X| ) < only, we have, as T ,
2

|S(T ) mT W (T )| = o(T 1/2 (log2 T )1/2 ). (2.83)


Proof. The original papers of Komlos, Major and Tusnady [129, 130] were short on
some important technical arguments, useful for the understanding of the proofs.
The details were provided (in a multivariate framework) in the monograph of
Einmahl [90] (see also [90, 92]). A routine argument (see, e.g., Exercises 2.6 and
2.7) allows to write sup0tT () in (2.81), and let T vary in R in (2.82)(2.83),
instead of restricting T to the integers 1, 2, . . . . The conclusion of the theorem
holds when S(t) is taken as a process with independent increments (such as a
Poisson process). The necessary details are given in Lemma 2.1, p. 84 in [69]. 
130 P. Deheuvels

By combining (2.81) with the Borel-Cantelli lemma (see the Exercises 2.6 and 2.7
below), we obtain readily that, under (C),
|S(T ) mT W (T )| = O(log T ) a.s. as T . (2.84)
Let {aT : T > 0} be such that 0 < aT T for T > 0, aT and T 1 aT . Set, for
each T > 0 and t 0,
S(t + saT ) S(t) msaT
t,T (s) = for 0 s 1. (2.85)
2aT {log(T /aT ) + log2 T }
Recall the denition (1.65) of the Strassen set S.

Theorem 2.9. Under (E.12), whenever aT / log T , we have, with probabil-


ity 1,


lim sup inf
T,t f

T 0tT aT f S

 (2.86)
= sup lim inf inf
T,t f
= 0.
f S T 0tT aT

If, in addition to the above assumptions, ((log(T /aT ))/ log2 T as T ,


then, with probability 1,


lim sup inf
T,t f
= 0. (2.87)
T f S 0tT aT

Proof. Combine Theorem 2.5 with (2.84) and Theorem 2.8. 


A direct consequence of Theorem 2.9 is that, whenever aT / log T , we have,
a.s. as T ,
   
sup a1 T S(t + aT ) S(t) m
0tT aT
"
(2.88)
=O a1
T {log(T /aT ) + log2 T } = o(1).

The limit law (2.88) breaks down when aT / log T = O(1). This range is covered
by the Erdos-Renyi new law of large numbers (see, e.g., [97]) discussed in 2.5
below. For further renements, we refer to Deheuvels and Steinebach [71]. When
combined, the above results yield readily Strassens proof ([199]) of the classical
law of the iterated logarithm, due originally to Hartman and Wintner [112]

Corollary 2.3. (The Hartman-Wintner-Strassen LIL) Under the assumption that


Var(X) = 2 < , the sequence (Sn nm)/ 2n log2 n is almost surely relatively
compact in R (endowed with the usual topology), with limit set equal to the interval
[, ]. We have
S nm Sn nm
n  [, ] and lim sup = a.s. (2.89)
2n log2 n n 2n log2 n
Topics on Empirical Processes 131

Proof. Combine (2.49) with (2.83). 


We will use the convenient notation
xn  L, (2.90)
to express that the sequence {xn } is relatively compact (in an appropriate topo-
logical space) with limit set equal to L.
Remark 2.4.
1 ) Some basic knowledge of topology is needed to understand the notion of limit
set for a relatively compact sequence. The following facts should be kept
in mind. A topological space is Hausdor if two distinct points have nonin-
tersecting neighborhoods (see, e.g., p. 137 in [84]). A metric space is always
Hausdor. A Hausdor space is compact, i it is paracompact, meaning that
each open cover has a nite sub-cover (see, e.g., pp. 162 & 222 in [84]). A
topological space is sequentially compact (resp. countably compact), when-
ever each sequence contains an innite convergent subsequence (resp. each
countable open cover has a nite sub-cover). Sequential compactness (as well
as compactness) implies countable compactness, but, in general, the converse
is not true (refer to p. 20 in [194]). However, the notions of compactness,
sequential compactness and countable compactness, are equivalent on a sep-
arable metric space (see, e.g., p. 209 in [10]). Thus, a separable metric space
is compact i it is sequentially compact and complete.

2 ) One should distinguish the conditions of paracompactness and precompact-
ness on a metric space. A metric space is said to be precompact or totally
bounded, if, for each > 0, there exists a nite covering with balls of radius
less than . A metrizable space is compact, i there exists a metric for which
it is both complete and precompact.

3 ) A part of a topological space is said to be relatively compact whenever it is
included in a compact set. By denition, the limit set of a relatively com-
pact sequence in a metric space is the set of all limits of innite convergent
subsequences.
Exercise 2.6. Let {W (t) : t 0} denote a (standard) Wiener process. Making use
of (2.79), show that, as n ,

sup |W (t + n) W (t)| = O log n a.s. (2.91)


0t1

Exercise 2.7.
1 ) Let X1 , X2 , . . ., be iid replic of a random variable X fullling (C). Show that
there exists constants c1 > 0 and c2 > 0, such that P(|X| t) c1 exp(c2 t)
for all t 0.

2 ) Show that, as n ,
|Xn | = O (log n) a.s. (2.92)
132 P. Deheuvels

Exercise 2.8. Let {xn : n 1} be a relatively compact sequence in a metric space.


Show that the limit set of {xn : n 1} (that is, the set of all limits of convergent
subsequences) is compact.
2.5. The Erdos-Renyi theorem
2.5.1. The Classical Erdos-Renyi theorem. Set S0 = 0 and Sn = X1 + + Xn for
n 1, where X1 , X2 , . . . are iid random observations. The classical Erdos-Renyi
[ER] theorem (see, e.g., [97]), stated in Theorem 2.10 below, plays an important
role in the general theory of strong invariance principles for partial sums. In par-
ticular, it allows to give a lower bound to the best rates of strong approximations
of partial sums by Gaussian processes (see Exercise 2.10 below and, e.g., 2.4, pp.
97101 in [40]). The ER theorem describes the uctuations of order #c log n$ of the
partial sum process {Si : 0 i n}, in a sense made precise below. The following
assumptions will be assumed, at times on the mgf (t) = E(exp(tX)) of X = X1 .
(C.1) t0 = sup{s : (s) < } > 0;
(C.2) t1 = inf {s : (s) < } < 0;
(C.3) The distribution of X is nondegenerate, or, equivalently,
supx P(X = x) < 1;
(C.4) m = E(X) R exists.
denote by (C) the joint assumption (C.12), and set
 
() = sup t log (t) ,
t:(t)<

for the Cherno function pertaining to . For each c (0, ] (with the convention
1/ = 0), we set

c = inf { : () 1/c} and c = sup{ : () 1/c}. (2.93)
+
Remark 2.5. Let X be degenerate with P(X = m) = 1. Then, (t) = etm for t R,
and 
0 for = m,
() =
for = m.
In this case, (C.124) hold, and, for each c (0, ],
c = m. As follows from
the next proposition, the assumption that X is nondegenerate turns out to imply,
under (C.124), that c = m for c (0, ).

Proposition 2.11. Under (C.134) (resp. (C.234)), we have


= m and, for
each c (0, ),
c <
m < + (resp. <
c < m). (2.94)
Proof. Under (C.134), Proposition 2.4 shows that () is a convex nonnegative
function, fullling (m) = 0 and ()/ t0 > 0 as . This, in turn,
readily implies the rst half of (2.94). The proof of the second half, under (C.23
4), is similar and omitted. 
Topics on Empirical Processes 133

For each integer k such that 0 k n, consider the maximal and minimal
increments
   
In+ (k) := max Si+k Si and In (k) := min Si+k Si . (2.95)
0ink 0ink

Fix a c > 0, and set kn = #c log n$ for n 1, where #u$ u < #u$ + 1 denotes the
integer part of u.
Theorem 2.10. (Erdos and Renyi) We have, almost surely, under (C.14),
lim k 1 I + (kn ) = +
c , (2.96)
n n n
and, under (C.24),
lim kn1 In (kn ) =
c . (2.97)
n

Proof. In view of Remark 2.5, the statements (2.96)(2.97) hold trivially when X
is degenerate. Therefore, we may limit ourselves establish these statements in the
remaining case, where (C.3) is satised.
Step 1. (Outer bounds) Assume (C.134). We dene an integer sequence {nk :
k k0 }, where
k+1 k
k0 := inf{k 1, e c e c 1},
and, for each k k0 , by setting nk := sup{n 1 : #c log n$ = k}. The inequalities
c log n 1 < #c log n$ = kn c log n,
readily imply that, for each k k0 ,
k k+1
e c nk < e c . (2.98)
Moreover, it is easy to see that, for all k k0 ,
max kn1 In+ (kn ) = k 1 Ink (k). (2.99)
nk1 <nnk

Thus, by (2.13), (2.95), and (2.98)(2.99), we obtain that, for each m and
k k0 ,
(

Pk () := P {kn1 In+ (kn ) } = P k 1 Ink (k) }


nk1 <nnk
k k 
n( 

= P Si+k Si k (nk k + 1)P Sk k


i=0

1
 1 

nk exp k() e c exp k () .


c
Now, as follows from (2.94) and the denition (2.93) of +
c , for each > 0, we
have
  1
:= +c + > 0.
c
134 P. Deheuvels

1  
 all this, +we see that, for k k0 , Pk (c + ) e exp k , and hence,
+
By c

kk0 Pk (c + ) < . The Borel-Cantelli lemma implies therefore that, for each
> 0,
lim sup kn1 In+ (kn ) +
c + a.s.
n
Since this relation holds for each specied > 0, it implies, in turn, that
lim sup kn1 In+ (kn ) +
c a.s. (2.100)
n

Step 2. (Inner bounds)  Assuming (C.134), we set L +
n (k) = max Sk , S2k
Sk , . . . , SkN Sk(N 1) , where N = N (n, k) := #n/k$ 1 and k = kn = #c log n$.
We have, for each > m = E(X),


N

Qn,k () := P L+ n (k) < k = 1 P Sk k exp N P Sk k .

We next make use of Chernos theorem (refer to (2.15)) to show that, for any
1 (m, ),
1  

lim log P Sk k 1 = ( 1 ),
k k
so that, ultimately in k , for each 2 > 0
 
 

P Sk k 1 exp k ( 1 ) + 2 . (2.101)

Now, letting = + c , with 1 (m, c ), we infer from (2.94) and the denition
+
+
(2.93) of c that
1
2 := (+ c 1 ) > 0.
c
By all this, if we set 2 = and k = kn = #c log n$ c log n in (2.101), we obtain
that the inequality
 
1 
nc
P Skn kn + c 1 exp kn ,
c n
holds for all large n. This, in turn, readily entails that, for all n suciently large,
 N (n, k ) 

n
Qn,kn (+c 1 ) exp n c
exp n c/2
,
n
which is summable in n. Keeping in mind that L+ n (k) In (k), an application of
+

the Borel-Cantelli lemma shows that, almost surely,


lim inf kn1 In+ (kn ) lim inf kn1 L+
n (kn ) c 1 .
+
n n

Since this relation holds for an arbitrary 1 > 0, we obtain therefore that
lim inf kn1 In+ (kn ) +
c a.s. (2.102)
n

By combining (2.100) with (2.102), we conclude (2.96). The proof of (2.97) is


similar, and omitted. 
Topics on Empirical Processes 135

Exercise 2.9. Assume that (C.14) is satised, and that m = E(X) = 0. Set
   
Jn+ (k) = max Sj Si and Jn (k) = min Sj Si .
0iji+kn 0iji+kn

1 ) Show that the conclusion (2.96)(2.97) of Theorem 2.10 holds, with the formal
replacements of In+ (k) and In (k) by Jn+ (k) and Jn (k), respectively.
2 ) Show that these results fail to hold when m = E(X) = 0 [for additional
details, see, e.g., [58, 59]].
Exercise 2.10. Let {W (t) : t 0} be a Wiener process. Set
   
In+ (k) = max W (i + k) W (i) and In (k) = min W (i + k) W (i) .
0ink 0ink

1 ) Letting kn = #c log n$, for c > 0, nd the a.s. limits
lim k 1 In+ (kn ) and lim k 1 In (kn ).
n n n n

2 ) Show that if, on a suitable probability space, |Sn W (n)| = o(log n) a.s., as
n , then the Cherno function of X is given by () = 12 2 for R.
[As follows from a result of Bartfai [12] (see, e.g., p. 101 in [40]), this implies
d
that X = N (0, 1), so that the relation |Sn W (n)| = o(log n) is impossible,
unless Sn coincides initially with W (n), for some Wiener process W ().]
2.5.2. Functional Erdos-Renyi laws. The original Erdos-Renyi law [97] has given
rise to a number of renements and variants in the literature (see, e.g., [58, 59]
and the references therein). Below, we give a functional version of this theorem
due to Deheuvels [51]. We will work in the framework and assumptions of 2.5.1,
assuming throughout that (C.34) hold. We introduce the following additional
notation. We set, for each t 0, S(t) = St (which is right-continuous). For an
arbitrary function 0 < aT T of T 0, we set, for x 0,
 
x,T (s) = a1
T S(x + saT ) S(x) for 0 s 1, (2.103)
and set, for each T > 0,
 
FT = x,T : 0 x T aT . (2.104)
It is noteworthy that, for each T > 0, FT BV0 [0, 1], where BV0 [0, 1] is as in
(1.55). Letting and t0 , t1 be as in (2.8) and (2.10), we set, for each c > 0,
  1 
 
,c = H BV0 [0, 1] : c c1 H(u) du + t0 HS (1+) t1 HS+ (1+) 1 .
0
(2.105)
Note that, when t0 = and t1 = , the set dened by (2.105) reduces to
  1 
 
,c = H AC[0, 1] : H(0) = 0 and c c1 H(u) du 1 . (2.106)
0
In view of Theorems 1.1, 1.2 and 1.3, we see that:
When t0 , t1 are arbitrary, ,c BV0 [0, 1] is compact with respect to the
topology W, of weak convergence of distribution functions of signed measures
136 P. Deheuvels

in Mf [0, 1], dened via the Hognas metric (1.57). The set ,c may be iden-
tied, through the mapping H , to a weakly compact metrizable (via
dH ) subset of the (non-metrizable) topological space (Mf [0, 1], W);
When t1 = and t0 is arbitrary, ,c I+ [0, 1] is compact with respect
to the topology of weak convergence of distribution functions, dened by the
Levy distance L (, ). This set of functions may be identied, through the
mapping H , to a weakly compact subset of the metrizable (via L )
topological space (M+ f [0, 1], W);

When t0 = and t1 = , ,c AC0 [0, 1] is compact with respect to the


uniform topology.
Recall the denition (1.56) of BV0,C [0, 1]. We have the following result (see,
e.g., [51]).

Theorem 2.11. Fix any c > 0, and assume that aT / log T c as T . Then,
under either (C.134) or (C.234), there exists a 0 < C < such that ,c
BV0,C [0, 1], and, a.s. ultimately as T , FT BV0,C [0, 1]. Moreover, if we
set, for each > 0 and A BV0,C [0, 1],

A[];C = {f BV0,C [0, 1] : dH (f, g) < for some g A}, (2.107)


then, there exists a.s. a T < such that, for all T T , we have
[];C [];C
FT ,c and ,c FT . (2.108)

Exercise 2.11. Show that, under (C), (2.107) may be replaced by


FT ,c and ,c FT . (2.109)

3. Empirical Processes
3.1. Uniform empirical distribution and quantile functions
Let U1 , U2 , . . . be a sequence of independent and identically distributed [iid] uni-
form (0, 1) random variables. For each n 1, the empirical distribution based
upon U1 , . . . , Un is dened by
1
n
Pn (B) = 1I{Ui B} , (3.1)
n i=1

for each Borel set B BR . Denote by #E the cardinality of E. The corresponding


uniform empirical distribution function will be denoted hereafter by
1
n
1
Un (t) = Pn ((, u]) = 1I{Ui x} = #{Ui t : 1 i n} for t R.
n i=1 n
(3.2)
Topics on Empirical Processes 137

Remark 3.1. In general, each sequence X1 , . . . , Xn of random variables, with values


ncompact separable metric space X, denes an empirical measure by
in a locally
n = n1 i=1 Xi , where x stands for the Dirac measure at x. In this framework,
n is a Radon measure on X. Among other problems of interest, one may consider
convergence properties, as n , of the restriction of n to a class of subsets of
BX , or to a class of Borel-measurable functions on X. One of the main issues to be
considered is then the central limit theorem, namely, the asymptotic convergence
of such an empirical processes, indexed by sets or functions, to a limiting Gaussian
process. Also, it is of interest to characterize the cases where this empirical process
fullls a Glivenko-Cantelli-type law of large numbers, or a functional law of the
iterated logarithm. We refer to [6, 82, 99, 164, 204] for advanced courses on the
subject. In what follows, we shall limit ourselves to the case of X = R, which is
of interest in and of itself. In particular, the notion of empirical quantile process
(see, e.g., 3.2 below) gives rise to developments which are very much specic to
X = R, and dicult to extend properly in higher dimensions.
Remark 3.2. Let X1 , . . . , Xn be iid replic of a random variable X with continuous
df F (x) = P(X x) and quantile function G(t) = inf{x : F (x) t}. By Theorem
1.4, the rvs U1 = F (X1 ), . . . , Un = F (Xn ) are iid uniform (0, 1) rvs. The em-
pirical measure n based upon X1 , . . . , Xn has a df given by Fn (x) = Un (F (x)),
for x R, and a qf given by Gn (x) = G(Vn (t)), for t (0, 1). For this reason,
the study of the limiting behavior of Fn and Gn can be reduced to the study of the
limiting behavior of Un and Vn . In the sequel, we will deal mostly with uniform
empirical and quantile processes. We refer to [40, 38, 85, 183], and the references
therein for details on the general case.
For each n 0, set U0,n = 0 and Un+1,n = 1, and, for each n 1, denote
by U1,n Un,n the order statistics of U1 , . . . , Un , obtained by sorting these
random variables increasingly.
Proposition 3.1. For each n 0, we have, with probability 1,
0 = U0,n < U1,n < < Un,n < Un+1,n = 1. (3.3)
Proof. Obviously, for each i 1 and j 1 with i = j, we have
P(Ui = 0 or 1) = 0 and P(Ui = Uj ) = 0,
so that U1 , . . . , Un are a.s. distinct and in (0, 1). The conclusion (3.3) is then
straightforward. 
Unless otherwise specied, we will assume, without loss of generality, that (3.3)
holds on the probability space (, A, P) on which sit U1 , U2 , . . . . This allows us to
write

0 for t U1,n ,
Un (t) = ni for Ui1,n < t U1,n , i = 1, . . . , n, (3.4)


1 for t Un,n .
138 P. Deheuvels

The uniform empirical quantile function is dened by



U1,n for t = 0,
Vn (t) = (3.5)
Ui,n for i1 n < t n,
i
i = 1, . . . , n.
Let I(t) = t denote identity, and, for each bounded function f on [0, 1], set

f
= sup |f (t)|. (3.6)
0t1

Proposition 3.2. We have, with probability 1,


1

Un (Vn ) I
= . (3.7)
n

n < t n , we have Un (Vn (t)) = Un (Ui,n ) = n ,


Proof. For each i = 1, . . . , n and i1 i i

so that sup i1 <t i |Un (Vn (t)) t| = n1 . Since, by denition, Un (Vn (0)) = n1 , we
n n
conclude readily (3.7). 
Proposition 3.3. We have, with probability 1,
   
 i 1   i 


Un I
=
Vn I
= max Ui,n , Ui,n  . (3.8)
1in n   n
Proof. For each i = 1, . . . , n and Ui1,n < t Ui,n , we have Un (t) t = ni t,
n < t n , Vn (t) t = Ui,n t. By combining these two relations,
whereas, for i1 i

the conclusion (3.8) is straightforward. 


The following properties of order statistics play an essential role in the study of
empirical processes.
Theorem 3.1. Let {i : i 1} denote a sequence of independent and identically
distributed random variables with a standard exponential distribution. Namely, for
each i 1,
P(i > t) = et for t 0. (3.9)
Set T0 = 0 and Tm = 1 + + m for m 1. Then, for each n 0, Tn+1 and
the random vector {Ti /Tn+1 : i = 0, . . . , n + 1} are independent. Moreover, one
has the distributional identity
d
{Ui,n : i = 0, . . . , n + 1} = {Ti /Tn+1 : i = 0, . . . , n + 1} . (3.10)
Proof. The joint density of 1 , . . . , n+1 is
f (u1 , . . . , un+1 ) = es ,
with s = u1 + + un+1 .
The linear change of variables (u1 , . . . , un+1 ) (u1 , . . . , un , s) has Jacobian
equal to 1, so that the joint density of (u1 , . . . , un , s) is equal to
f (u1 , . . . , un , s) = es .
Topics on Empirical Processes 139

Now, the change of variables (u1 , . . . , un , s) (u1 /s, . . . , un /s, s) has Jacobian
equal to sn , so that the joint density of (v1 := u1 /s, . . . , vn := un /s, s) equals
sn es
f (v1 , . . . , vn , s) = n! .
n!
We obtain therefore that (T1 /Tn+1 , . . . , Tn /Tn+1 ) and Tn+1 are independent. More-
over,
(i) (T1 /Tn+1 , . . . , Tn /Tn+1 ) is uniformly distributed (with constant density, equal
to n!) on the set {(v1 , . . . , vn ) = 0 v1 vn 1};
(ii) Tn+1 follows a (n + 1) distribution on R+ .
To conclude (3.10) we observe that the joint distributions of the random vectors
(T1 /Tn+1 , . . . , Tn /Tn+1 ), and (U1,n , . . . , Un,n ), coincide. 
Theorem 3.2. For each n 1, the random variables
 U i
i,n
, i = 1, . . . , n, (3.11)
Ui+1,n
are independent and uniformly distributed on (0, 1).
Proof. We change variables by setting Yi = log Ui , for i = 1, 2, . . ., so that
Y1 , Y2 , . . . denes an iid sequence of exponentially distributed random variables.
For each n 1, set Y0,n = 0, and denote by
0 < Y1,n < < Yn,n ,
the order statistics of Y1 , . . . , Yn . Keeping in mind that
Yi,n = log Uni+1,n for i = 0, . . . , n,
we see that (3.11) is equivalent to the property that
(n i + 1){Yi,n Yi1,n }, i = 1, . . . , n, (3.12)
are independent and exponentially distributed. To establish this property, we re-
call from the proof of Theorem 3.1 that the joint density of (U1,n , . . . , Un,n ) is
constant and equal to n! on the set {(v1 , . . . , vn ) = 0 v1 vn 1}. The
change of variables (v1 , . . . , vn ) (y1 := log vn , . . . , yn := log v1 ) has Jaco-
bian 1/(y1 . . . yn ) = exp(y1 + + yn ), so that the joint density of (Y1,n , . . . , Yn,n )
equals
f (y1 , . . . , yn ) = n! e(y1 ++yn ) ,
on the domain {(y1 , . . . , yn ) : 0 y1 yn }. We now make the change
of variables (y1 , . . . , yn ) (t1 := ny1 , (n 1)(y2 y1 ), . . . , tn := yn yn1 ).
Since the corresponding Jacobian equals n!, we conclude that the joint density of
(nY1,n , (n 1)(Y2,n Y1,n ), . . . , Yn,n Yn1,n ) is equal to

n
f (t1 , . . . , tn ) = eti for ti 0, i = 1, . . . , n.
i=1

This completes the proof of (3.12), and hence, of (3.12). 


140 P. Deheuvels

Dene the uniform spacings by


Si,n = Ui,n Ui1,n for i = 1, . . . , n + 1. (3.13)
Exercise 3.1. Denote, respectively, the maximal (resp. minimal) uniform spacing
by
Mn+ = max Si,n and Mn = min Si,n .
1in+1 1in+1

Establish the limit laws, for x R and y 0,



x

lim P nMn log n x = ee and lim P n2 mn y = 1 ey .


n n

Exercise 3.2. Fix a constant c > 0, and set kn = #c log n$. Letting () be as in
(1.61), set
c+ = inf{x 1 : (x) 1/c} and c = sup{x 1 : (x) 1/c}.
Consider the statistics, for 1 k n + 1,
+
Mn,k = max {Ui+k1,n Ui1,n }
1ink+2
and
Mn,k = min {Ui+k1,n Ui1,n }.
1ink+2

Show that, in probability, as n ,


+
Mn,k Mn,k
n
c+ and n
c .
kn kn
3.2. Uniform empirical and quantile processes
The uniform empirical process is dened by
n (t) = n1/2 (Un (t) t) for t R, (3.14)
and the uniform quantile process is dened by
n (t) = n1/2 (Un (t) t) for t [0, 1]. (3.15)
It is convenient to set n (t) = 0 for t  [0, 1].
The following results are well known, and will be discussed in the sequel.
Theorem 3.3. We have, for each n 1 and t 0,
2
P (
n
t) = P (
n
t) 2e2t . (3.16)
2
Proof. Dvoretzky, Kiefer and Wolfowitz [86] showed that P (
n
t) Ce2t ,
for some universal constant C 2, proved equal to 2 by Massart [157]. We invoke
(3.8) for the equality
n
=
n
. 
Remark 3.3. The extension of (3.16) to dimension d 2 is delicate. For x =
(x1 , . . . , xd ) and y = (y1 , . . . , yd ) Rd , write x y when xj yj for j = 1, . . . , d.
Topics on Empirical Processes 141

Next, given an iid sequence U1 , U2 , . . . of random vectors, uniformly distributed on


[0, 1]d , dene the d-dimensional uniform empirical process by
n 

d 
n (x) = n1/2 1I{Ui x} xj , (3.17)
i=1 j=1

for x = (x1 , . . . , xd ) [0, 1]d. An example of Dvoretzky-Kiefer-Wolfowitz-type


bound which can be obtained in this framework is given by Devroye [75, 76], who
established the inequality
2
P (
n
t) 2e2 (2n)d e2t , (3.18)
for all n 1 and t n1/2 d2 .
Theorem 3.4. (Chung) We have, with probability 1,
1
lim sup(2 log2 n)1/2
n
= lim sup(2 log2 n)1/2
n
= . (3.19)
n n 2
Proof. This result is due to Chung [31]. Below, we give a simplied proof of (3.19)
to illustrate the use of some important technical tools. We rst observe that, for
each specied t [0, 1], we may write

n
  n
n1/2 n (t) = 1I{Ui t} t = (Yi E(Yi )),
i=1 i=1
d
where Yi = Be(t), i = 1, 2, . . ., is an iid sequence of random variables following
Bernoulli distributions (denoted by Be(t)) with probability of success P(Yi = 1) =
t. Since Var(Yi ) = t(1 t), the law of the iterated logarithm [LIL] (see, e.g., (2.89))
implies that (for each specied t [0, 1])
n (t)
lim sup = t(1 t) a.s. (3.20)
n 2 log2 n
By choosing t = 1/2 in (3.20), and recalling (3.8), we obtain readily that
n (1/2) 1
lim sup(2 log2 n)1/2
n
= lim sup(2 log2 n)1/2
n
lim sup = .
n n n 2 log2 n 2
(3.21)
To obtain the reverse inequality, we select > 0 and
> 0, and set nk = #(1 + )k $
for k = 0, 1, 2, . . ., where #u$ u < #u$ + 1 denotes the integer part of u. Consider
the events, for k = 1, 2, . . .,
 " 
Ak = n (nk1 , nk ] :
n1/2 n
(1 + 2) 12 nk log2 nk , (3.22)
 " 
1/2
Bk =
nk nk
(1 + ) 12 nk log2 nk . (3.23)

Making use of Ottavianis lemma, stated in a general framework in Lemma 4.2


in the sequel, we obtain readily that, for all k suciently large, one has, for all k
142 P. Deheuvels

suciently large,
P(Ak ) 2P(Bk ). (3.24)
In view of (3.16) and (3.23), we obtain, in turn, that, for all large k,
  2
P(Bk ) 2 exp (1 + )2 log2 nk = . (3.25)
(log nk )(1+)2
Since, as k , log nk = (1 + o(1))k log(1 + ), we infer from (3.24) and (3.25)
that


P(Ak ) < . (3.26)
k=1
The Borel-Cantelli lemma, when combined with (3.26), implies that the events Ak
hold nitely often with probability 1. This, in turn, implies that, a.s., ultimately
for all large n,
"  1/2
nk log2 nk

n
(1 + 2) 2 log2 n
1
nk1 log2 nk1
"
(1 + 2)(1 + ) 2 log2 n.
1
(3.27)
Here, we have used the observation that, ultimately for all large k,
nk log2 nk
= (1 + o(1))(1 + ) (1 + )2 .
nk1 log2 nk1
Since > 0 and > 0 in (3.27) may be chosen as small as desired, it follows
readily from this limiting statement that, with probability 1,
1
lim sup(2 log2 n)1/2
n
= lim sup(2 log2 n)1/2
n
. (3.28)
n n 2
We conclude (3.19), by combining (3.21) with (3.28). 
Exercise 3.3. Fix any 0 T 1. Show that, almost surely,

lim sup(2 log2 n)1/2 sup |n (t)| = sup t(1 t).
n 0tT 0tT

Exercise 3.4. Making use of (3.64) and (3.65), show that the constant C in the
2
Dvoretzky, Kiefer and Wolfowitz [86] inequality, P (
n
t) Ce2t , cannot be
chosen less that 2.
3.3. Some further martingale inequalities
We have the following useful results concerning empirical processes and Brownian
bridges.
Proposition 3.4. For each specied n 1, the process
{n (t)/(1 t) : 0 t < 1} , (3.29)
denes a martingale.
 
Proof. It is equivalent to prove that the process (nUn (t)nt)/(1t) : 0 t < 1 ,
denes a martingale. Consider any 0 < s < t < 1, and set A = nUn (s) and
Topics on Empirical Processes 143

B = nUn (t) nUn (s). We obtain readily that



P B = b| {Un (u) : 0 u s} = P B = b|A = a


 )  
n! n!
= sa (t s)b (1 t)nab sa (1 s)na
a! b! (n a b)! a! (n a)!
  
(n a)! (t s) (1 t)
b nab
=
b! (n a b)! (1 s)na
  b  na
(n a)! ts 1t
= .
b! (n a b)! 1t 1s
Next, we write

 1 t na na
  (n a)!   t s b 
ts

E B|A = a = b = (na) ,
1s b! (n a b)! 1t 1s
b=0
whence

n(t s) + a(1 t) 1 t
E B + A nt|A = a = nt = n(1 t) (n a) ,
1s 1s
and, nally
B + A nt 
a ns

E A = a = ,
1t 1s
which suces for our needs. 
Proposition 3.5. Let {N (t) : 0 t 1} dene a Brownian bridge. Then, the
process  
B(t)/(1 t) : 0 t < 1 , (3.30)
denes a martingale.
Proof. Omitted. 
Proposition 3.4 has some interesting consequences. First, we obtain the following
useful corollary.
Corollary 3.1. For each n 1, 0 < T < 1 and > 0, we have

1
P sup |n (t)| T (1 T ) 2 . (3.31)
0tT
Proof. As follows from Proposition and (2.30) taken with r = 2, we obtain that,
for u > 0 and 0 < T < 1,
 

 n (t) 

P sup |n (t)| u(1 T ) T P sup   u T


0tT 0tT 1 t

1  
n (T ) 2
1 T (1 T ) 1
2 E = 2 = 2 .
u T 1T u T (1 T ) 2 u (1 T )

We obtain readily (3.31) by setting u = / 1 T in this last inequality. 
144 P. Deheuvels

Corollary 3.2. For each n 1, 0 < T < 1 and > 0, we have



(1 T )

P sup |n (t)| T 2 exp nT h 1 + . (3.32)


0tT nT

Moreover, if 0 < N T /(1 T ), we have

(1 T )

P sup |n (t)| T 2 exp nT h 1 +


0tT nT
2  (3.33)
(1 T ) 

2 exp (1 T )2 1 .
2 3 nT
Proof. First note that

P sup |n (t)| T
0tT
 (t) 
 (t) 

n n
P sup T + P sup T .
0tT 1t 0tT 1t
Next, we combine Proposition 3.4 with (2.31). We so obtain that
 (t) 

n
P sup u
0tT 1t
  s (T )


n
exp sup su log E exp
s 1T
 s  s


= exp sup (u(1 T )) log E exp n (T )


s 1T 1T
   

= exp sup r(u(1 T )) log E exp rn (T ) .


r
We combine this last inequality with Theorem 2.7, and (2.70), to obtain that
 (t) 
(1 T )

n
P sup T exp nT h 1 + .
0tT 1t nT
A similar argument allows to show that
 (t) 
(1 T )

n
P sup T exp nT h 1 .
0tT 1t nT
In view of (2.68) and (2.70), we conclude readily (3.33). 
Remark 3.4. We will obtain in the forthcoming Proposition 3.9, a renement of
the inequality (3.33), through a completely dierent argument.
3.4. Relations between empirical and Poisson processes
The empirical process is related to the Poisson process by the following gen-
eral principle. The latter provides a construction of a Poisson process on a gen-
eral locally compact separable metric space X (see, e.g., 1.3.1). Consider a se-
quence X = X1 , X2 , . . . of iid X-valued random variables with common distribu-
tion PX (B) = P(X B), for B belonging to the set BX of all Borel subsets of X.
Topics on Empirical Processes 145

Let further K denote a random variable, independent of X1 , X2 , . . ., and following


a Poisson distribution with expectation E(K) = . We have namely
k
P(K = k) = e for k = 0, 1, . . . . (3.34)
k!
By denition, the Poisson process on X with mean measure () = PX () is the
random measure (), dened on the set BX of Borel subsets of X by

K
(B) = 1I{Xi B} , (3.35)
i=1

where, by convention () := 0. One of the important features of the Poisson
process is that:
1 ) For each B BX , (B) follows a Poisson distribution with expectation (B);
2 ) For each sequence B1 , B2 , . . . BX , with (Bi Bj ) = 0 i = j, (B1 ),
(B2 ), . . . are independent;

3 ) Whenever 1 , 2 , . . . are independent Poisson processes with mean measures
1 , 2 , . . ., the superposed process 1 +2 + is Poisson with mean measure
1 + 2 + .
Strictly speaking, the denition (3.35) is valid only for Poisson processes with
bounded mean measures M+ f (X), namely such that (X) = P(X) = < . To
obtain a more general form of this denition, compatible with the above property
(3 ), we use the convention that a Poisson process with a possibly unbounded
mean measure M+ (X) is dened as the superposition 1 + 2 + of an
innite sequence of independent Poisson processes (in the sense of (3.35)), with
j (X) < for j = 1, 2, . . ., and such that = 1 + 2 + . At this point, we see
the usefulness to assume that X is a locally compact separable metric space, to
render such a construction possible for an arbitrary nonnegative Radon measure
on X.
In the particular case where X = R+ and is the Lebesgue measure on R+ , we ob-
tain the standard Poisson process, commonly dened through its right-continuous
distribution function  
N (t) = (0, t] for t 0. (3.36)
The standard Poisson process {N (t) : t 0} has the following important property,
which yields an alternative construction of this process.
Proposition 3.6. Let Tk = inf{t : N (t) k} for k = 0, 1, . . .. Then, T0 = 0 and the
random variables k = Tk Tk1 , k = 1, 2, . . . are independent and exponentially
distributed with unit mean.
Proof. Omitted. 
For each > 0, the homogeneous Poisson process dened by N (t) = N (t)
for t 0, with N () as in (3.36), is such that E(N (t)) = t for t 0. The next
proposition is a straightforward consequence of the above denitions. Let Un (t) be,
146 P. Deheuvels

as in (3.2), the uniform empirical distribution function pertaining to the rst n 1


observations from an iid sequence U1 , U2 , . . . of uniform (0, 1) random variables.
Proposition 3.7. For an arbitrary > 0 and n 1, we have the distributional
equality
  d   
L {nUn (t) : 0 t 1} = L {N (t) : 0 t 1}N (1) = n , (3.37)
 
between the distribution L {nUn (t) : 0 t 1}
 of {nUn(t) : 0 t 1}, and the
conditional distribution L {N (t) : 0 t 1}N (1) = n of {N (t) : 0 t 1},
given that N (1) = n.
Proof. Omitted. 
The fact that a homogeneous Poisson process has independent increment renders
this process more tractable than the empirical process. The next proposition gives
a useful trick to replace empirical processes by Poisson processes in a series of
calculations.
Proposition 3.8. For each 0 < T < 1, there exists a universal constant CT with
the following property. For any measurable set B of the set (I+ [0, T ], W) of nonde-
creasing right-continuous nonnegative functions on [0, T ], endowed with the weak
topology, and each n 1, we have

P {nUn (t) : 0 t T } B CT P {Nn (t) : 0 t T } B . (3.38)


In addition, for each > 0, there exists an n < such that (3.38) holds for all
n 1 with
CT = (1 + )/ 1 T . (3.39)
Proof. This elementary result has been used repeatedly in a series of papers in the
literature (see, e.g., Lemma 2.7 in Deheuvels [54]). Let N () be as in (3.37), and
set = n. Introduce the events
   
E1 = {nUn (t) : 0 t T } B and E2 = {Nn (t) : 0 t T } B .

Set R = Nn (T ) and R = Nn (1) Nn (T ). By (3.37), and the independence of


(E2 , R) and R, we have
P(E2 {R + R = n})
P(E1 ) = P(E2 |Nn (1) = n) = P(E2 |R + R = n) =
P(R + R = n)
1 n
= P(E2 {R = n k})P(R = k)
P(R + R = n) k=0
1   n
sup P(R = k) P(E2 {R = n k})
P(R + R = n) k0 k=0
1  
sup P(R = k) P(E2 ).
P(R + R = n) k0
Topics on Empirical Processes 147

Next, we make use of the observation that, if K follows a Poisson distribution with
expectation , then
sup P(K = k) = P(K = #$), (3.40)
k0

where #$ < #$+1 is the integer part of . To establish this property, observe
that, if
k
pk := P(K = k) = e for k = 0, 1, . . . ,
k!
then 
pk 0 for k k #$,
=
pk1 k < 0 for k > k > #$,
from where (3.40) is straightforward. By all this, we infer from the Stirling formula,

N ! = (1 + o(1))(N/e)N 2N as N ,
that, as n ,
1    n!  (n(1 T ))n(1T ) 
n(1T )
sup P(R = k) = e n
e
P(R + R = n) k0 nn #n(1 T )$!
1 + o(1)
= .
1T
This, in turn, suces to show that
1  

CT := sup sup P(R = k) < .


n1 P(R + R = n) k0

Given this relation, we obtain readily (3.38) and (3.39), as sought. 


The next proposition gives an example of how (3.38) may be applied.
Proposition 3.9. For any 0 < T0 < 1 and > 0, there exists an n< such that,
for all n n , we have, uniformly over 0 < T T0 and 0 nT ,

2(1 + )

P sup |n (t)| T exp nT h 1 +


0tT 1 T0 nT
2
(3.41)
2(1 + ) 

exp 1 .
1 T0 2 3 nT
Proof. Let {(t) : t 0} denote a standardPoisson process. By combining (3.38)
and (3.39) with (2.65), taken with = / nT , we obtain readily that, for all n
suciently large,

1+  

P sup |n (t)| T P sup |(t) t| nT


0tT 1 T0 0tnT nT
2(1 + )

exp nT h 1 + ,
1 T0 nT
which, in view of (2.70), readily yields (3.41). 
148 P. Deheuvels

3.5. Strong approximations of empirical and quantile processes


The uniform empirical process n () fullls
E(n (t)) = 0 and E({m1/2 m (s)}{n1/2 n (t)}) = {m n}{s t st},
for all 0 s, t 1 and m, n 1. For each n 1, the Gaussian process with
the same covariance structure as {n (t) : 0 t 1} is the Brownian bridge,
conveniently dened ny B(t) = W (t) tW (1), where W () is a (standard one-
dimensional) Wiener process. The two-parameter Gaussian process with the same
covariance structure as {n1/2 n (t) : 0 t 1, n 0} (with the convention that
n1/2 n (t) = 0 for n = 0) is the Kiefer process {K(t, s) : s 0, t 0} (refer to
Kiefer [127]). For integer values of n 0, the Kiefer process K(t, n) is conveniently
dened by setting
 n
K(t, n) = Bi (t) for 0 t 1, (3.42)
i=1

where B1 , B2 , . . . is an iid sequence of Brownian bridges. Given a (standard two-


dimensional) Wiener process {W (t, s), t 0, s 0}, such that, for s, t, s , t , s ,
t 0,
 
E(W (t, s)) = 0 and E W (t , s )W (t , s ) = {t s }{t s }, (3.43)
one denes a Kiefer process, via
K(t, s) = W (t, s) tW (1, s) for t0 and s 0. (3.44)
The Doob-Donsker weak invariance principle (see, e.g., Billingsley [19], and the
references therein) establishes the weak convergence
W
n B, (3.45)
which holds for the probability measures induced by n and B on the set D[0, 1]
endowed with the Skorohod topology (see, e.g., 2.2.2 and Ch. 3 in [19]). In many
applications, it is more appropriate and convenient to make use of strong invariance
principles, where the empirical processes and Brownian bridges are dened on the
same probability space. Below, we cite the most useful theorems of the kind, in
the framework of our study. We start with the celebrated Komlos, Major and
Tusnady [129, 130] invariance principle for the uniform empirical process (stated
in Theorem 3.5), and the Csorgo and Revesz variant for the empirical quantile
process (stated in Theorem 3.6).

Theorem 3.5. On a suitable probability space, it is possible to dene {Un : n 1}


jointly with a sequence {Bn : n 1} of Brownian bridges, in such a way that, for
appropriate constants a1 > 0, a2 > 0 and a3 > 0, we have, for all n 1 and t R,
 
a1 log n + t
P
n Bn
a2 exp(a3 t). (3.46)
n
Topics on Empirical Processes 149

Proof. The original papers of Komlos, Major and Tusnady [129, 130] give hardly
any proof for (3.46), and the rst self-contained proof of this inequality was given
by Mason and van Zwet [155]. Some details are also given in in Csorgo and Revesz
[39], Bretagnolle and Massart [26] and Csorgo and Horvath [38]. 
Remark 3.5. The optimal choice of a1 , a2 , a3 in (3.46) (as well as that of a4 , a5 , a6
in (3.47) below) remains an open problem. The best presently known result in this
direction is due to Bretagnolle and Massart [26], who showed that (3.46) holds for
n 2 with a1 = 12, a2 = 2 and = 1/6.

Theorem 3.6. On a suitable probability space, it is possible to dene {Un : n 1}


jointly with a sequence {Bn : n 1} of Brownian bridges, in such a way that, for
appropriate constants a4 > 0, a5 > 0 and a6 > 0, we have, for all n 1 and t R,
 
a4 log n + t
P
n Bn
a5 exp(a6 t). (3.47)
n
Proof. See, e.g., Csorgo and Revesz [39]. 
A very useful renement of Theorem 3.5, due to Mason and van Zwet [155], is
cited in the next theorem. Introduce the following notation. Set
log+ u = log1 u = log(u e) for u R, (3.48)
and dene recursively, for each integer p 2,
logp u = log+ u(logp1 (u)) for u R. (3.49)
Theorem 3.7. On a suitable probability space, it is possible to dene {Un : n 1}
jointly with a sequence {Bn : n 1} of Brownian bridges, in such a way that, for
appropriate constants a7 > 0, a8 > 0 and a9 > 0, we have, for all n 1, t R
and 0 < d n
 
a7 log+ d + t
P sup |n (u) Bn (u)| a8 exp(a9 t). (3.50)
0u d n
n

Proof. See, e.g., Mason and van Zwet [155]. 


The following two theorems provide (very likely non-optimal) versions of Theorems
3.5 and 3.7 with respect to the approximation of the empirical process by Kiefer
processes.
Theorem 3.8. On a suitable probability space, it is possible to dene {Un : n 1}
jointly with a sequence {B 'n : n 1} of independent Brownian bridges, in such a
way that, for appropriate constants a10 > 0, a11 > 0 and a12 > 0, we have, for all
n 1 and t R,
 1 '

n
  (log+ n)(a10 log+ n + t)
P n Bi  a11 exp(a12 t). (3.51)
n i=1 n

Proof. See, e.g., Komlos, Major and Tusnady [129, 130]. 


150 P. Deheuvels

Theorem 3.9. On a suitable probability space, it is possible to dene {Un : n 1}


'n : n 1} of independent Brownian bridges, in such a
jointly with a sequence {B
way that, for appropriate constants a10 > 0, a11 > 0 and a12 > 0, we have, for all
n 1 and t R,
 1 '
n  (log n)(a log n + t)

  10
P sup n (u) Bi (u) +
+
a11 exp(a12 t).
d
0u n n i=1
n
(3.52)

Proof. See, e.g., Castelle and Laurent-Bonvalot [27] and Castelle [28]. 

Theorem 3.10. On a suitable probability space, we have, almost surely as n ,


 1 '
n  (log n)2

 
n Bi (u) = O , (3.53)
n i=1 n
and
 1 '
n 

 
n + Bi (u) = O n1/4 (log n)1/2 (log2 n)1/4 , (3.54)
n i=1

where {B 'n : n 1} is a sequence of independent Brownian bridges. In addition,


*n : n 1} of independent Brownian bridges such
there does not exist a sequence {B
that, almost surely,
 1 *
n 

 
n + Bi (u) = o n1/4 (log n)1/2 (log2 n)1/4 . (3.55)
n i=1

Proof. By combining Theorem 3.9 with the Borel-Cantelli lemma we obtain (3.53).
This, when combined with the Bahadur-Kiefer representation (see, e.g., Theorem
3.4 in the sequel) yields (3.54), rst stated by Csorgo and Revesz [39]. The fact that
(3.54) is optimal, was proved by Deheuvels [55], who showed that (3.55) cannot
hold a.s., for any possible choice of the iid sequence {B *n : n 1}. 

The following theorem is due to Csorgo, Csorgo, Horvath and Mason [41], and
Csorgo, and Horvath [37].

Theorem 3.11. On a suitable probability space, it is possible to dene {Un : n 1}


jointly with a sequence {Bn : n 1} of Brownian bridges, in such a way that the
following limiting property holds. For any 0 < 12 , 0 < 14 and c > 0, we
have, as n ,
|n (t) Bn (t)|
n sup 1 = OP (1)
n t1 n
c c
{t(1 t)} 2
and (3.56)
|n (t) + Bn (t)|
n sup 1 = OP (1)
n t1 n
c c
{t(1 t)} 2
Topics on Empirical Processes 151

Exercise 3.5. Show that, under the assumptions of Theorems 3.5 and 3.6, we have,
almost surely,
log n
log n


n Bn
= O and
n Bn
= O . (3.57)
n n
3.6. Some results for weighted processes
Below, we mention some useful results on weighted empirical processes. We re-
fer to Einmahl and Mason [93, 94] for similar results for quantile processes and
multivariate empirical processes. The standardized empirical process is dened by
n (t)
n (t) = for 0 < t < 1, (3.58)
t(1 t)
and has the property that, for each 0 < t < 1,
d
n (t) N (0, 1) as n .
The following limit laws are available for functionals of n and n . Let
 1
Tn = sup |n (t)|, An = n2 (t)dt, (3.59)
0<t<1 0
 1
Dn = sup |n (t)|, n2 = 2n (t)dt, (3.60)
0t1 0

Dn+ = sup n (t), Dn = sup {n (t)}. (3.61)


0t1 0t1

Smirnov [189, 190, 192] showed that


 1

d B 2 (t) d Yk2
A2n dt = , (3.62)
0 t(1 t) k2 2
k=1

where B() is a Brownian bridge, and {Yk : k 1} an iid sequence of N (0, 1) r.v.s,
and Anderson and Darling [5] showed likewise that
 1

d d Yk2
n2 B 2 (t)dt = . (3.63)
0 k(k + 1)
k=1

For the other statistics (see, e.g., Kolmogorov [128] and Smirnov [191]), it holds
that, for each t > 0,


2 2
lim P(Dn t) = P
B
t = 2 (1)k+1 e2k t , (3.64)
n
k=1

2
lim P(Dn t) = P sup B(s) t = e2t . (3.65)
n 0s1

and, for each t R (see, e.g., Eicker [88] and Jaeschke [116]),

t
lim P Tn 2 log2 n 2 log2 n 12 log3 n 12 log < t = e2e . (3.66)
n
152 P. Deheuvels

Theorem 3.12. (Csaki) Let {an : n 1} be a sequence of positive constants. Then




an = lim sup Tn nan = a.s. (3.67)
n
n=1

If, in addition, nan , then




an < lim Tn nan = 0 a.s. (3.68)
n
n=1

Moreover,
Tn
lim inf =1 a.s. (3.69)
n 2 log2 n

Theorem 3.12 has been extended by Einmahl and Mason [93] as follows. Consider
the statistic
1 |n (t)| 1 1
Tn, = n 2 sup = n 2 sup {t(1 t)} 2 |n (t)|, (3.70)
0<t<1 {t(1 t)}
1
0<t<1

for 0 1/2. Keeping in mind that Tn = Tn, 12 , we have the following theorem.

Theorem 3.13. (Einmahl and Mason) Let {an : n 1} be a sequence of positive


constants. Then


an = lim sup Tn, (nan )1 = a.s. (3.71)
n
n=1

If, in addition, nan , then




an < lim Tn, (nan )1 = 0. (3.72)
n
n=1

Proof. See, e.g., Einmahl and Mason [93]. 


We have the following easy corollary of Theorem 3.13.

Corollary 3.3. For each 0 1/2, we have, almost surely,


log Tn,
lim sup = 1 . (3.73)
n log2 n
Proof. Set an = 1/{n(log n)1+ } in (3.71)(3.72). 
The following useful consequence of Corollary 3.3 has some interesting applica-
tions. For each > 0, we have, almost surely for all large n,
|n (t)| 1
Tn = Tn, 12 = sup 1 (log n) 2 + . (3.74)
0<t<1 {t(1 t)} 2
Topics on Empirical Processes 153

Exercise 3.6.
1 ) Show, as a consequence of (3.66), that
Tn
lim =1 in probability. (3.75)
n 2 log2 n

2 ) Show, as a consequence of Theorem 3.12, that


Tn Tn
lim inf =1 and lim sup = a.s. (3.76)
n 2 log2 n n 2 log2 n

3 ) Show that the upper bound (3.74) is optimal, in the sense that
Tn
lim sup = a.s. (3.77)
n log n
Exercise 3.7.
1 ) Show that, for any xed c > 0, we have, almost surely,
 1
2n (t)
 1c/n
2n (t) 

 dt dt 0. (3.78)
0 t(1 t) c/n t(1 t)

2 ) Show that, for any 1 > 0 and 2 > 0, there exists an = 1 ,2 (0, 12 ) such
that
  2 (t)    11/n 2 (t) 

   
lim sup P  n
dt +  n
dt 1 1 2 . (3.79)
n 1/n t(1 t) 1 t(1 t)

3 ) Show that, for each (0, 12 ), we have


 1  1
2n (t) d B 2 (t)
dt dt. (3.80)
t(1 t) t(1 t)

4 ) Show that (3.80) also holds for = 0 [Hint: make use of (3.46), (3.56) and
(3.73)].

3.7. Finkelsteins theorem via invariance principles


In this section, we will denote by (C[0, 1], U) (resp. (B[0, 1], U) ) the set on con-
tinuous (resp. bounded) functions on [0, 1], endowed with the uniform topology U,
dened by the sup-norm

f
= sup |f (t)|. (3.81)
0t1

We denote by AC[0, 1] the set of all absolutely continuous functions f on [0, 1]


with Lebesgue derivative f = dx
df
. The Strassen set, S, and the Finkelstein set, F,
154 P. Deheuvels

are as follows.
  1 
S = f AC[0, 1] : f (0) = 0, f2 (t)dt 1 , (3.82)
0
  1 
F = f AC[0, 1] : f (0) = f (1) = 0, f2 (t)dt 1
  0 (3.83)
= f S : f (1) = 0 .
The next theorem is due to Finkelstein [98].
Theorem 3.14. (Finkelstein) We have, with probability 1,
  
lim inf (2 log2 n)1/2 n f 
n f F
   (3.84)
= sup lim inf (2 log2 n)1/2 n f  = 0,
f F n

and  
lim inf (2 log2 n)1/2 n f 
n f F
   (3.85)
= sup lim inf (2 log2 n)1/2 n f  = 0
f F n

Proof. By Theorem 3.8, we may dene, without loss of generality, the sequence
{n : n 1} on the same probability space as a sequence {Bn : n 1} of inde-
pendent Brownian bridges, in such a way that
 n   
 1/2 
n n Bi  = O (log n)2 a.s. (3.86)
i=1
In view of Proposition 4.3, an application of Theorem 4.3 shows that, with prob-
ability 1,
  
n 
 
lim inf (2n log2 n)1/2 Bi f  = 0, (3.87)
n f F
i=1
  
n 
 
sup lim inf (2n log2 n)1/2 Bi f  = 0. (3.88)
f F n
i=1

The rst half, (3.84), of Theorem 3.14 is a consequence of (3.86), (3.87) and (3.88).
The second half, (3.84), of the theorem follows from (3.84), in combination with
the Bahadur-Kiefer representation (3.125). 
Exercise 3.8.
1 ) Show that
sup
f
= 12 .
f F
2 ) By combining this result with Theorem 3.14 and Theorem 3.10, give a proof
of Theorem 3.4.
Topics on Empirical Processes 155

3.8. Local and tail empirical process


Fix any t0 [0, 1], and select a sequence of positive constants {hn : n 1} fullling
(H.1) hn 0, nhn ;
(H.4) nhn / log2 n r [0, ].
The local empirical and quantile processes at t0 are the random functions of u
dened, respectively, by
 
n,t0 (u) = h1/2
n n (t0 + hn u) n (t0 ) , (3.89)
and  
n,t0 (u) = h1/2
n n (t 0 + h n u) n (t 0 ) , (3.90)

When t0 = 0 and u [0, 1], we obtain the tail empirical process, denoted hereafter
by n = n,0 , for n , and tail quantile process, denoted hereafter by n = n,0 , for
n . The case where t0 = 1 and u [1, 0] being similar, by symmetry, the tail
processes will be considered below for t0 = 0 and u [0, 1] only.
We rst state a series of results for the tail empirical and quantile processes,
corresponding to t0 [0, 1] and u [0, 1]. Kiefer [126] (see, e.g., [49]) showed the
following limit laws. Set for any c > 0
c+ = inf{x > 1 : h(x) 1/c}, c = sup{x < 1 : h(x) 1/c}, (3.91)
c+ = inf{x > 1 : (x) 1/c}, c = sup{x < 1 : (x) 1/c}. (3.92)

Theorem 3.15. Under (H.1) and (H.4) with nhn / log2 n , we have, almost
surely,
lim sup (2 log2 n)1/2 n (hn ) = lim sup (2hn log2 n)1/2 n (hn ) = 1, (3.93)
n n

lim sup (2 log2 n)1/2 n (hn ) = lim sup (2hn log2 n)1/2 n (hn ) = 1. (3.94)
n n

On the other hand, when nhn / log2 n c, we have, almost surely,

lim sup (2 log2 n)1/2 n (hn )


n
(3.95)
= lim sup (2hn log2 n)1/2 n (hn ) = (c 1)(c/2)1/2 ,
n

lim sup (2 log2 n)1/2 n (hn )


n
(3.96)
= lim sup (2hn log2 n)1/2 n (hn ) = (c 1)(c/2)1/2 .
n

Proof. See, e.g., Kiefer [126], and Deheuvels [49]. 


Mason [156] established a functional version of (3.94), stated in Theorem 3.16
below. This result is a consequence of the following invariance principle, when
combined with a general LIL due to Lai [134].
156 P. Deheuvels

Proposition 3.10. Assume (H.1) and (H.4) with nhn / log2 n . Then, on a
suitable probability space, there exists a sequence Wk (), k = 1, 2, . . . of independent
Wiener processes such that, almost surely as n ,
  n


n (hn ) 1 Wi (hn ) = o hn log2 n . (3.97)


n i=1

Proof. See, e.g., Mason [156]. 


As a consequence of Proposition 3.10, we obtain the following result.
Theorem 3.16. Assume (H.1) and (H.4) with nhn / log2 n . Then, as n ,
the sequences
(2 log2 n)1/2 n (hn ) = (2hn log2 n)1/2 n (hn )  S, (3.98)
1/2 1/2
(2 log2 n) n (hn ) = (2hn log2 n) n (hn )  S, (3.99)
are almost surely relatively compact in (B[0, 1], U), with limit set equal to S.
Proof. See, e.g., Mason [156], and Einmahl and Mason [94]. 
When 0 < t0 < 1, the limiting behavior of the local empirical process, and that of
the local quantile process, become distinct. The following theorem holds.
Theorem 3.17.
1 ) Let 0 < t0 < 1. Assume (H.1) and (H.4) with nhn / log2 n . Then, as
n , the sequence
(2 log2 n)1/2 n,t0 (hn ) = (2hn log2 n)1/2 {n (t0 + hn ) n (t0 )}  S,
is almost surely relatively compact in (B[1, 1], U), with limit set equal to S.

2 ) Let 0 < t0 < 1. Assume (H.1), together with:
 
(H.4) nhn / log+ 1/(hn n) ;
  
(H.5) log+ 1/(hn n) / log2 n d [, ].
Then, the sequence
     
(2hn log+ 1/(hn n) + log2 n )1/2 {n (t0 + hn ) n (t0 )} : n 1 ,
is almost surely relatively compact in (B[1, 1], U), with limit set equal to S.
Proof. See, e.g., Deheuvels and Mason [68], Mason [156], and Deheuvels [54]. 
3.9. Modulus of continuity of n and n
Let {hn : n 1} denote a sequence of constants fullling the conditions
(H.1) hn 0, nhn ;
(H.2) nhn / log n ;
(H.3) (log(1/hn ))/ log2 n [0, ].
The following theorem collects a series of limiting results concerning the modulus
of continuity of n and n . Below, we let 0 c < d 1 denote specied constants.
Topics on Empirical Processes 157

Theorem 3.18. Under (H.123), we have, almost surely for each 0 c < d 1,
1/2
lim sup (2hn {log(1/hn ) + log2 n}) sup |n (u) n (v)| (3.100)
n cu,vd
|uv|hn
1/2
= lim sup (2hn {log(1/hn ) + log2 n}) sup |n (u + hn ) n (u)| (3.101)
n cud
1/2
= lim sup (2hn {log(1/hn ) + log2 n}) sup |n (u) n (v)| (3.102)
n cu,vd
|uv|hn
1/2
= lim sup (2hn {log(1/hn ) + log2 n}) sup |n (u + hn ) n (u)| = 1,
n cud
(3.103)

and, setting /( + 1) = 1 when = ,

lim inf (2hn {log(1/hn ) + log2 n})1/2 sup |n (u) n (v)| (3.104)
n cu,vd
|uv|hn
1/2
= lim inf (2hn {log(1/hn ) + log2 n}) sup |n (u + hn ) n (u)| (3.105)
n cud
1/2
= lim inf (2hn {log(1/hn ) + log2 n}) sup |n (u) n (v)| (3.106)
n cu,vd
|uv|hn
1/2
= lim inf (2hn {log(1/hn ) + log2 n}) sup |n (u + hn ) n (u)|
n cud

1/2
= . (3.107)
+1

In addition, we have, in probability,

lim (2hn {log(1/hn ) + log2 n})1/2 sup |n (u) n (v)| (3.108)


n cu,vd
|uv|hn

= lim (2hn {log(1/hn ) + log2 n})1/2 sup |n (u + hn ) n (u)| (3.109)


n cud

= lim (2hn {log(1/hn ) + log2 n})1/2 sup |n (u) n (v)| (3.110)


n cu,vd
|uv|hn

1/2
1/2
= lim (2hn {log(1/hn ) + log2 n}) sup |n (u + hn ) n (u)| = .
n cud +1
(3.111)

Proof. To establish (3.100)(3.111) combine results of Stute [200], Mason, Shorack


and Wellner [154], Mason [153], Deheuvels and Mason [66], Deheuvels [52], and
Deheuvels and Einmahl [60]. 
158 P. Deheuvels

Theorem 3.18 turns out to be a direct consequence of functional limit laws due
to Deheuvels and Mason [66], Deheuvels [52], Deheuvels [53], and Deheuvels and
Einmahl [60].
Let S denote the Strassen set, as dened in (2.34). Consider the random sets of
functions of u [0, 1],
 
1/2
En = (2hn {log(1/hn ) + log2 n}) {n (t + hn u) n (t)} : c t d
 
=: n,t (u) : c t d , (3.112)
 
1/2
Fn = (2hn {log(1/hn ) + log2 n}) {n (t + hn u) n (t)} : c t d
 
=: n,t (u) : c t d . (3.113)
Denote the Hausdor distance (pertaining to the uniform topology U) between
subsets of B[0, 1] by setting, for arbitrary A, B B[0, 1],
 
U (A, B) = inf > 0 : A B and B A , (3.114)
whenever such an > 0 exists, and U (A, B) = otherwise.
Theorem 3.19. Under (H.123) with = , we have, for each 0 c < d 1,
lim U (En , S) = lim U (Fn , S) = 0 a.s. (3.115)
n n

Proof. See, e.g., Deheuvels and Mason [66]. 


1
Fix now a T > 0, set hn = n log n, and consider the random sets of functions of
u [0, T ] dened by
 n  
Gn = {Un (t + un1 log n) Un (t)} : c t d , (3.116)
log n
 n  
Hn = {Vn (t + un1 log n) Vn (t)} : c t d . (3.117)
log n
Introduce the sets of functions of u [0, T ] dened by
  T 
T = f AC[0, T ] : f (0) = 0, h(f(u))du 1 , (3.118)
0
  T 
T = f I[0, T ] : f (0) = 0, (f(u))du + fS (T +) 1 . (3.119)
0
Here, I[0, T ] denotes the restriction to [0, T ] of left-continuous nondecreasing func-
tions on R. Each of these functions, say f I[0, 1] with f (0) = 0, denes, via the
Lebesgue-Stieltjes integral, a nonnegative measure on [0, T ]. We may therefore
decompose this measure into an absolutely continuous component with Lebesgue
derivative f, and a singular component dfS . We then dene fS (T +) by

fS (T +) = dfS . (3.120)
[0,T ]
Topics on Empirical Processes 159

Recall the denition (1.44) of the Levy distance L (f, g) between two functions
f, g I[0, T ]. We set, for any f I[0, T ],
 
N[] (f ) = g I[0, T ] : L (f, g) < , (3.121)
and, for any subset A I[0, T ],
(
A[] = N[] (f ), (3.122)
f A

when A = , and A[] = [] = otherwise. Denote the Hausdor distance (per-


taining to the weak topology) between subsets of I[0, T ] by setting, for arbitrary
A, B I[0, T ],
 
W (A, B) = inf > 0 : A B [] and B A[] , (3.123)
whenever such an > 0 exists, and W (A, B) = otherwise.
Theorem 3.20. We have, for each T > 0 and 0 c < d 1,
lim U (Gn , T ) = lim W (Hn , T ) = 0 a.s. (3.124)
n n

Proof. See, e.g., Deheuvels and Mason [66]. 


3.10. The Bahadur-Kiefer representation
The classical Bahadur-Kiefer representation, due to Bahadur [8], and Kiefer [124,
125], is stated in the following theorems 3.21 and 3.22. It allows to replace the
quantile process n by (1) times the empirical process n with an almost sure
uniform error of order n1/4+o(1) . At times (see, e.g., Deheuvels [55, 57]), this yields
the best possible rates of approximation. The strange constants in (3.125) and
(3.132) below are generated in a quite natural way by functional limit theorems,
and their derivation illustrates well the main topic of the present course.
Theorem 3.21. We have, with probability 1,
lim sup n1/4 (log n)1/2 (log2 n)1/4
n + n
= 21/4 . (3.125)
n

Proof. We refer to Shorack [182] and Deheuvels and Mason [64] for details. Below,
we follow the elegant proof of Shorack [182] to establish (3.125). First, we combine
Propositions 3.2 and 3.3, to obtain that, almost surely,
1
sup |n (t) n (Vn (t))| = sup |n (t) n1/2 {Vn (t) Un (Vn (t))}| = ,
0t1 0t1 n
and hence, since Vn (t) = t + n1/2 n (t),
1

{n + n } {n n (I + n1/2 n )}
= a.s. (3.126)
n
In view of (3.126), we see that (3.125) is equivalent to
 
lim sup n1/4 (log n)1/2 (log2 n)1/4 n (I + n1/2 n ) n  = 21/4 a.s. (3.127)
n
160 P. Deheuvels

The proof of (3.127) is achieved in two complementary parts.


Part 1. (Outer bounds). Fix any > 0. By Chungs Theorem 3.4, we have, a.s.
for all n suciently large,
n1/2 n
hn := (1 + )21/2 log2 n. This, when
combined with (3.100), shows that, a.s.,
 
lim sup(2hn {log(1/hn ) + log2 n})1/2 n (I + n1/2 n ) n 
n

lim sup(2hn {log(1/hn ) + log2 n})1/2 sup |n (u) n (v)| = 1.


n 0u,v1
|uv|hn

We obtain readily from this last result that, a.s.,


 
lim sup n1/4 (log n)1/2 (log2 n)1/4 n (I + n1/2 n ) n  (1 + )1/2 21/4 .
n

Since > 0 may be chosen as small as desired, it follows that, a.s.,


 
lim sup n1/4 (log n)1/2 (log2 n)1/4 n (I + n1/2 n ) n  21/4 . (3.128)
n

Part 2. (Inner bounds). In this part, we invoke Finkelsteins Theorem 3.14. As


follows from (3.83), the function f (t) = min{t, 1 t} for t [0, 1] belongs to the
Finkelstein set F. Thus, a.s. for each
> 0, there exists an increasing sequence
{nj : j 1} along which
(2 log2 n)1/2 n f
14
. Set now c = 12 14
and
d = 12 + 14
. Obviously, for any t [c, d], we have |f (t) 12 | 14
. Therefore, along
{nj : j 1}, we have, uniformly over t [c, d], |(2 log2 n)1/2 n (t) 12 | 12
.
Setting hn = 21/2 n1/2 (log2 n)1/2 , we have therefore, along {nj : j 1}, and
uniformly over t [c, d], |n1/2 n (t)hn | hn
. Now, by an application of (3.101)
and (3.105), we have
1/2
lim (2hn log(1/hn )) sup |n (u + hn ) n (u)| = 1 a.s., (3.129)
n cud

whereas we infer from (3.101) that


1/2
lim sup (2hn log(1/hn )) sup |n (u + hn ) n (u + hn + v)| =
a.s.,
n cud
|v|hn 
(3.130)
Making use of (3.129)(3.130), we obtain therefore that
 
1/2  
lim inf (2hn log(1/hn )) sup n (u + n1/2 n ) n (u) 1
a.s.,
n cud

Since > 0 may be chosen as small as desired, it follows readily from this last
statement that, a.s.,
 
lim inf n1/4 (log n)1/2 (log2 n)1/4 n (I + n1/2 n ) n  21/4 . (3.131)
n

The proof of (3.127) is achieved by combining (3.128) with (3.130). 


Topics on Empirical Processes 161

The pointwise Bahadur-Kiefer representation is stated in the following theorem.


Theorem 3.22. For each specied t0 [0, 1], we have, with probability 1,
lim sup n1/4 (2 log2 n)3/4 |n (t0 ) + n (t0 )| = {t0 (1 t0 )}1/4 21/2 33/4 . (3.132)
n

Proof. See, e.g., Kiefer [124]). Below, we give a simple proof of this statement,
given in Deheuvels and Mason [67], and 3.4 of Deheuvels and Mason [68]. We
need the following two auxiliary results (see, e.g., Lemmas 3.3 and 3.4 in [68]).
Set, for 0 < t0 < 1 and 1 u 1,
xn = (2 log2 n)1/2 n (t0 ), (3.133)
2 1/2
1/2
fn (t) = 2 log2 n log2 n
n (3.134)
2 1/2

n (t0 ) n t0 log2 n u .
n
Then, we have, almost surely, as n ,
|n1/4 (2 log2 n)3/4 {n (t0 ) + n (t0 )} fn (xn )| 0. (3.135)
Moreover, the sequence (xn , fn ) is almost surely relatively compact in RB[1, 1],
where B[1, 1] denotes the set of bounded functions on [1, 1], endowed with the
unform topology. The limit set is given by
  1 
x2
(x, fn )  K1 := (x, f ) R AC0 [1, 1] : + f(u)2 du 1 .
t0 (1 t0 ) 1
(3.136)
In view of (3.133), (3.134), (3.135) and (3.136), the RHS of (3.132) equals, almost
surely,
 1/4
sup |f (x)| = t0 (1 t0 ) sup s(1 s2 ) = {t0 (1 t0 )}1/4 21/2 33/4 ,
(x,f )K1 0s1
(3.137)
which yields the theorem. 
Exercise 3.9.
1 ) Letting xn be as in (3.133), show, as a consequence from Finkelsteins Theo-
rem 3.14, that, almost surely as n ,

xn  [ t0 (1 t0 ), t0 (1 t0 ) ],
and show that this result is in agreement with (3.136).

2 ) Letting fn be as in (3.134), show, as a consequence from Theorem 3.17, that,
almost surely as n ,
fn  S,
and show that this result is in agreement with (3.136).
162 P. Deheuvels

3.11. Application to density estimation


The theorems of the previous sections provide a number of applications to non-
parametric functional estimation (see, e.g., [4, 13, 45, 46, 47, 15, 16, 17, 18, 23, 25,
29, 33, 34, 35, 45, 46, 50, 52, 53, 56, 72, 74, 77, 75, 78, 79, 87, 96, 105, 107, 108, 109,
110, 111, 115, 117, 118, 131, 133, 143, 149, 158, 159, 160, 161, 162, 165, 168, 169,
170, 171, 175, 176, 177, 178, 180, 196, 197, 198, 200, 202, 203, 208, 209, 210, 212]).
We will limit ourselves to one example, showing how the preceding functional limit
laws can be applied. Consider an iid sequence X1 , X2 , . . . of rvs with common
distribution F (x) = P(X1 x), having density f (x), assumed to be positive and
continuous on J = (c , d ) R. We estimate f (x), for x I := [c, d] J, by the
Akaike-Parzen-Rosenblatt kernel estimator (refer to [4, 163, 170])

1  x Xi

n
fn,h (x) = K . (3.138)
nh i=1 h

Here, h > 0 is a positive bandswidth, and K() a kernel, namely a function of


bounded variation on R, such that, for some 0 < M < ,

(i) K(t) = 0 for |t| M ;



(ii) K(t)dt = 1.
R

The following theorem was proved by Stute [200], under slightly more restrictive
conditions. As stated, it is due to Deheuvels and Mason [66] (see also Deheuvels
[52] and Deheuvels and Einmahl [60]).

Theorem 3.23. Assume that {hn : n 1} is a sequence of positive constants,


fullling
nhn log(1/hn )
hn 0; nhn ; ; . (3.139)
log n log2 n
Then, we have, almost surely,
 1/2   1/2
nhn
lim sup {fn,hn (x) Efn,hn (x)} = sup f (x) K 2 (t)dt .
n 2 log(1/hn ) xI xI R
(3.140)

Proof. We will establish (3.140) in the special case where J = (0, 1), f (x) = 1 for
0 < x < 1, and K(t) = 0 for t  [ 14 , 34 ]. The arguments of the proof in the general
case are essentially identical, with the addition of minor technicalities, which can
be omitted. We let n0 be such that thn J for all t I, when n n0 . Recall the
denition (3.112) of n,x (u). Our assumptions imply that Xn = Un is uniformly
Topics on Empirical Processes 163

distributed on (0, 1) for n = 1, 2, . . ., so that we may write, for n n0 ,


 1
1 x t

fn,hn (x) Efn,hn (x) = K d{Un (t) t}
0 hn hn
 1
1
= K(u)d{n (x hn u) n (x)}
hn n 0
 1
1
= d{n (x hn u) n (x)}dK(u)
hn n 0
 2(log(1/h ) + log n) 1/2  1
n
= 2
n,x (u)dK(u)
nhn 0
 2 log(1/h )) 1/2  1
n
= (1 + o(1)) n,x (u)dK(u),
nhn 0

after integrating by parts. Now, as follows from Theorem 3.19, we have U ({n,x :
x I}, S) 0 a.s. It is easily checked that, whenever f (u) S, we have f (u) S.
Therefore, an easy argument shows that, almost surely as n ,
 nhn 1/2
sup {fn,hn (x) Efn,hn (x)}
2 log(1/hn ) xI
 1
= (1 + o(1)) sup f (u)dK(u)
f S 0
 1  1
sup f (u)dK(u) = sup f(u)K(u)du,
f S 0 f S 0

after inegrating by parts. Finally the Schwarz inequality shows that


 1  1 
sup
f (u)K(u)du = K 2 (u)du ,
f S 0 0

which suces for proving (3.140) under our (simplied) assumptions. 

At this point, we leave it to the reader to continue these arguments we have used
to establish similar results for other non-parametric estimators of interest.

Exercise 3.10. Making use of Theorem 3.17, show that, whenever


nhn
hn 0; nhn ; ,
log2 n
we have, for each specied x0 in the neighborhood of which f is continuous, almost
surely,
 nh 1/2   1/2
n
lim sup {fn,hn (x) Efn,hn (x)} = f (x) K 2 (t)dt .
n 2 log2 n R
164 P. Deheuvels

4. Auxiliary results
4.1. Some Gaussian process theory
Let Z be a centered Gaussian random vector taking values in a separable Ba-
nach space X, with norm denoted by | |X . Throughout, we will work under the
assumption that
(G.1) P (|Z|X < ) = 1.
In the sequel, we will mainly consider the case where X = C[0, 1], the set of all
continuous functions on [0, 1], endowed with the sup-norm uniform topology U,
dened by | |X =

, and, either,
Z =W is (the restriction to [0, 1] of) a standard Wiener process; or
Z =B is a Brownian bridge.
The assumption (G.1) is clearly satised in either of these cases.
Denoting by BX the -algebra of Borel subsets of X, the distribution of Z is dened
on BX via
PZ (B) = P(Z B) B BX . (4.1)

Denoting by X the topological dual of X, that is, the space of all continuous linear
forms h : X R, one may dene a linear mapping J : X X, by the Bochner
integral 
h X J h = E (Zh (Z)) = zh (z)PZ (dz). (4.2)
X
The image space H := J (X ) is pre-Hilbertian with inner product dened by


J g , J h H = E (g (Z)h (Z)) = g (z)h (z)PZ (dz), for g , h X . (4.3)
X
The reproducing kernel Hilbert space [RKHS] of Z is then dened as the closure
H in X of H with respect to the Hilbert norm | |H = , H . Of special interest
here is the unit ball of H, denoted by
K = {h H : |h|H 1} . (4.4)
The following assumption will be needed, in addition to (G.1). Throughout, we
assume that
(G.2) K is a compact subset of (X, | |X ).
We will make use of a generalized version of the Cameron-Martin formula stated
in the following theorem.
Theorem 4.1. For each h H, there exists a measurable linear form ' h on X ,
fullling the equalities


PZ (B + h) = exp ' h(x) 12 |h|2H PZ (dx) for each B BX , (4.5)




B

E 'h2 (Z) = '
h2 (x)PZ (dx) = |h|2H . (4.6)
X
Topics on Empirical Processes 165

Proof. See, e.g., Borell (1976). 


The next proposition (see, e.g., Borell (1977)) states a useful consequence of (4.6).
Proposition 4.1. Let h H and let A BX denote a symmetric Borel subset of
X . Then,

PZ (A + h) exp 12 |h|2H PZ (A). (4.7)

Proof. There is noting to prove if PZ (A) = 0. Assuming that PZ (A) > 0, we set
B = A in (4.7), then make use of the Jensen inequality to obtain, by symmetry of
A and linearity of '
h, that



PZ (A + h) = exp 2 |h|H1 2
exp ' h(x) PZ (dx)
A


P (dx) 
exp '
Z
= exp 2 |h|H PZ (A)
1 2
h(x)
A P Z (A)

  PZ (dx)

exp 12 |h|2H PZ (A) exp '
h(x)
A PZ (A)

exp 12 |h|H PZ (A),


2

which is (4.7). 
The next statement is the celebrated isoperimetric inequality due to Borell (1975)
and Sudakov and Tsyrelson (1978). Denote by
 x
1 2
(x) = et /2 dt, (4.8)
2
the df of the standard N (0, 1) law, and by 1 = the corresponding quantile
function, fullling
 
1 (u) = u for all 0 < u < 1. (4.9)
Theorem 4.2. For each A, B BX such that B (A + rK) = , we have
 
PZ (B) 1 1 (PZ (A)) + r . (4.10)
The following simple inequalities will be useful to evaluate the right-hand
side of (4.10).
Lemma 4.1. We have, for every t > 0,
et /2  1
2 2
et /2
1 2 1 (t) . (4.11)
t 2 t t 2
Moreover, for every t 0, we have
2 2 2
1 (t) et /2 et /2 . (4.12)
2
166 P. Deheuvels

Proof. By integrating by parts, we rst observe that


 
1 x2 /2 1 1 2
1 (t) = e dx = d{ex /2 }
2 t 2 t x
2  2
et /2 1 1 x2 /2 et /2 2
= 2
d{e } d{ex /2 }.
t 2 2 t x t 2
By integrating once more by parts, we obtain likewise that

et /2  1 et /2  1
2 2
1 2 x2 /2
1 (t) = 1 2 + 3
d{e } 1 2 .
t 2 t 2 t x t 2 t
The above inequalities readily yield (4.11). To establish (4.12), we infer from (4.11)
that the inequality
2
et /2 2 2
1 (t) et /2 ,
t 2 2
always holds when t 1. On the other hand, when 0 t 1, we may write
 
1  1 x2 /2 2
 2 2
1 (t) = e dx + ex /2 dx et /2 .
2 t 1 2
By combining these two cases, we obtain readily (4.12) as sought. 
4.2. A functional LIL for superpositions of Gaussian processes
To illustrate the Gaussian process methodology of 4.1, we give below a proof of
an elementary version of the functional law of the iterated logarithm [FLIL] for
superpositions of independent Gaussian processes. We refer to [93], [103], [134],
for more rened results of the kind. We inherit the notation of 4.1, and consider
a sequence Z1 , Z2 , . . . of independent replicae of Z. We set further
1 
n
Yn = Zi . (4.13)
n i=1
The following notation will be needed. For each > 0 and x X, we set
V,x = {z X :
z x
< } . (4.14)
For each > 0 and A BX , A = , we set
(
A = {z X : x A,
z x
< } = V,x . (4.15)
xA

For convenience, we set = .


Theorem 4.3. Under (G.12), with probability 1 for any > 0, there exists an n
such that, for all n n ,
(2 log2 n)1/2 Yn K . (4.16)
Moreover, for each z K, we have, with probability 1,
lim inf |(2 log2 n)1/2 Yn z|X = 0. (4.17)
n
Topics on Empirical Processes 167

Prior to the proof of Theorem 4.3, we will establish the following useful result,
usually referred to as Ottavianis lemma.
Lemma 4.2. Let 0 and 1 , . . . , N denote independent random vectors taking value
in the Banach space (X, | |X ). Set i = 0 + 1 + + i for i = 1, . . . , N . Consider
a Borel subset A of X and select a > 0. Suppose that, for each 1 i N ,
P(|N i |X ) 1/2. Then, we have
 
P i = 0, . . . , N : i  A2 2P (N  A ) . (4.18)
Proof. Introduce the events
Ai () = {i  A } for i = 0, . . . , N,
Bi () = {|N i |X < } for i = 0, . . . , N,
Denote by A = A the complement of A, and set A1 (2) = . We obtain
readily that
(
N

N

P Ai (2) = P Ai (2) Ai1 (2)


i=0 i=0

N

2 P Ai (2) Ai1 (2) P |N i |X


i=0

N

= 2 P Ai (2) Ai1 (2) Bi ()


i=0

N

2 P Ai (2) Ai1 (2) AN () 2P (AN ()) ,


i=0

which is (4.18). 
Proof of Theorem 4.3.
Part 1 Outer Bounds. We select a > 0 and an > 0. Introduce the sequence
of integers nk = #(1 + )k $ for k = 0, 1, . . ., and set ak = 2nk log2 nk . Consider
the events
 
1/2
Ck = n (nk1 , nk ] : n1/2 Yn  {nk 2 log2 nk K}2ak , (4.19)
 
1/2 1/2
Dk = nk Ynk  {nk 2 log2 nk K}ak . (4.20)

Our assumptions imply that, ultimately as k , for each n (nk1 , nk ],


 

 1/2 
P n1/2 Yn nk Ynk  ak = P (nk n)1/2 |Z|X 2nk log2 nk
X
 n 1/2

k
P |Z|X 2 log2 nk
nk1

1
P |Z|X (1 + 12 ) 2 log2 nk .
2
168 P. Deheuvels

We may therefore apply Ottavianis lemma (Lemma 4.2) to show that, for all k
suciently large,
P(Ck ) 2P(Dk ). (4.21)
We now turn to the evaluation of P(Dk ). Toward this end, me make use of the
Isoperimetric Inequality (Theorem 4.2) with choices of A, B and r which will are
specied below. We set, for a constant m > 0 which will be precised later on,
A = D(0, m) := {x X : |x|X < m},
B = X {A + rK} = X {x + h : |x|X < m, |h|H r}.
In view of (G.1), we may now choose m > 0 so large that PZ (A) = P(|Z|X < m) >
1/2, which, in turn, implies that := 1 (PZ (A)) > 0. This, when combined with
(4.10) and (4.12), shows that, for all r > 0,
  1

PZ (B) 1 1 (PZ (A)) + r exp ( + r)2 . (4.22)


2

We now choose r = 2 log2 nk in (4.22). In view of the denition (4.20) of Dk , we
observe that, ultimately as k ,
 2 log2 nk
 m

P(Dk ) = P Z  2 log nk K P Z  2 log nk K


1
2

= PZ (B) exp + 2 log2 nk


2
1 2
1

= exp 2 log2 nk = o .
log nk 2 (log nk )(log2 nk )2
Since log nk = (1 + o(1))k log(1 + ) as k , we infer from this last expression,
when combined with (4.21), that


P (Ck ) < .
k=1

The Borel-Cantelli lemma implies in turn that, with probability 1 for all k su-
ciently large, we have, uniformly over all n such that nk1 < n nk ,
2 nk log2 nk /{nk1 log2 nk1 }
 

Yn nk log2 nk
1/2
K
2 log2 n nk1 log2 nk1
 2(1+)
(1 + )K . (4.23)

Now, we make use of (G.2) to show that, for any xed


> 0, we may select > 0
and > 0 so small that
 2(1+)
(1 + )K K .

This, when combined with (4.23), suces for (4.16).


Topics on Empirical Processes 169

Part 2 Inner Bounds. We rst note that the assumption (G.2) implies that

M := sup |g|X < . (4.24)


gK

Let, as in Part 1, nk = #(1 + )k $, for k = 0, 1, . . . . Under the assumptions of


the theorem, it is enough to show that, for each specied > 0 and h K with
1 2 := |h|2H < 1, there exists a > 0 such that, almost surely,

lim inf |(2 log2 nk )1/2 Ynk h|X < . (4.25)


k

In view of (4.24), and making use of Part 1, we obtain readily that, almost surely,

lim sup |(2nk log2 nk )1/2 nk1 Ynk1 |X
k

 (2nk log2 nk )1/2  


1/2
= lim lim sup |(2 log n k1 ) Yn h|X
k (2nk1 log2 nk1 )1/2
2 k1
k

(1 + )1/2 sup |g h|X 2M (1 + )1/2 .


gK

Therefore, there exists a 1 > 0 such that, whenever 1 , almost surely,


 
 
lim sup (2nk log2 nk )1/2 nk1 Ynk1  < 12 . (4.26)
k X

Consider next the mutually independent events



1/2    
 
Ek :=  2(nk nk1 )log2 (nk nk1 ) nk Ynk nk1 Ynk1 h < 18 .
X
(4.27)
As follows from (4.7) in combination with (G.1), we obtain therefore that, ulti-
mately as k ,
 1/2  1/2

P(Ek ) = P |Z 2 log2 (nk nk1 ) h|X < 14 2 log2 (nk nk1 )



 1/2

exp 12 |h|2H 2 log2 (nk nk1 ) P |Z|X < 14 2 log2 (nk nk1 )

1
12 exp 12 (1 2) 2 log2 (nk nk1 ) 1 .
k

Since this entails that




P(Ek ) = ,
k=1
170 P. Deheuvels

the Borel-Cantelli lemma implies, in turn, that P(Ek i.o.) = 1. Now, by the de-
nition (4.27) of Ek , it follows that, almost surely,
 1/2   
 
lim inf  2nk log2 nk nk Ynk nk1 Ynk1 h
k X
 (2nk log2 nk )1/2  
lim 1
8
k (2(nk nk1 ) log2 (nk nk1 ))1/2
 (2nk log2 nk )1/2 
 
+|h|X 1 lim 
k (2(nk nk1 ) log2 (nk nk1 ))1/2
 +  +

14 +M 1 . (4.28)
1+ 1+
In view of (4.28), we see that there exists a 2 > 0 such that, whenever 2 ,
we have, almost surely,
 1/2   
 
lim inf  2nk log2 nk nk Ynk nk1 Ynk1 h < 12 .(4.29)
k X

By combining (4.26) with (4.29), we obtain readily that (4.25) holds for all
1 2 . This concludes the proof of Theorem 4.3. 
Exercise 4.1. Prove (3.87) and (3.88).
4.3. Karhunen-Loeve expansions
We recall the following facts about Karhunen-Loeve [KL] expansions, (see, e.g.,
[3], [7], [119] and [122]). Denote by {Z(t) : 0 < t < 1} a centered Gaussian process
with covariance function {R(s, t) = E(Z(s)Z(t)) : 0 < s, t < 1}, fullling
 1
0< R(t, t)dt < . (4.30)
0
Then, there exist nonnegative constants {k : k 1}, k , together with functions
{ek (t) : k 1} L2 (0, 1) of t (0, 1) such that the properties (K.1234) below
hold.
(K.1) For all i 1 and k 1,
 1 
1 if i = k,
ei (t)ek (t)dt =
0 0 if i = k.
(K.2) The {k , ek () : k 1} form a complete set of solutions of the Fredholm
equation in (, e()), = 0.
 1  1
e(t) = R(s, t)e(s)ds for 0 < t < 1 and e2 (t)dt = 1. (4.31)
0 0
The k (resp. ek ()) are the eigenvalues (resp. eigenfunctions) of the
Fredholm transformation
 1
f L (0, 1) T f L (0, 1) : T f (t) =
2 2
R(s, t)f (s)ds, t (0, 1).
0
Topics on Empirical Processes 171

(K.3) The series expansion



R(s, t) = k ek (s)ek (t) for 0 < s, t < 1, (4.32)
k1
 
is convergent in L2 (0, 1)2 .
(K.4) There exist independent and identically distributed [i.i.d.] N (0, 1) ran-
dom variables {k : k 1} such that the Karhunen-Loeve [KL] expan-
sion 
Z(t) = k k ek (t) 0 < t < 1, (4.33)
k1

of Z() holds, with the series (4.33) converging a.s., and in integrated
mean square on (0, 1).
Remark 4.1.
1 ) The sequence {k , ek () : k 1} in (K.1234) may very well be nite.
Below, we will implicitly exclude this case and specialize in innite KL ex-
pansions with k ranging through IN = {1, 2, . . .}, with 1 > 2 > > 0.
2 ) If, in addition to (4.30), Z() is a.s. continuous on [0, 1] with covariance
function R(, ) continuous on [0, 1]2 , then, we may choose the functions
{ek () : k 1} in the KL expansion (4.33) to be continuous on [0, 1]. The
series (4.32) is then absolutely and uniformly convergent on [0, 1]2 , and the
series (4.33) is a.s. uniformly convergent on [0, 1] (see, e.g., [3]).
There are few Gaussian processes of interest with respect to statistics for which
the KL expansion is known through explicit values of {k : k 1}, and with
simple forms of the functions {ek () : k 1} (see, e.g., [152] for a review). It is
useful to have a precise knowledge of the k s, since we infer from (4.33) that
 1 
D2 = Z 2 (t)dt = k k2 . (4.34)
0 k1

This readily implies (see, e.g., (6.23), p. 200 in [119]), that the moment-generating
function of the distribution of D2 is given by
 1/2
1 1
D2 (z) = E(exp(zD2 )) = for Re(z) < . (4.35)
1 2zk 21
k=1

We have |E(exp(zD2 ))| < for all z C with Re(z) < 1


21 , subject to the
additional conditions that
(i) 1 > 2 > > 0,
and
(4.36)
  1 1
(ii) k = < , where k = for k 1.
k k
k=1 k=1
172 P. Deheuvels

Since D2 is a weighted sum of independent 21 components, its distribution is easy


to compute under (4.36) via the Smirnov formula (see, e.g., [43], [150], [151], [189],
[192]). For t > 0,

 2k tu/2
1 e du
P D2 > t = (1)k+1 (4.37)

k=1 2k1 u |F(u)|

1 + o(1) 2 etu/2 du
= as t ,
1 u |F(u)|

where F(u) is the Fredholm determinant dened, under (4.36), by


  
u
F(u) = 1 uk = 1 for u 0. (4.38)
k
k=1 k=1

In view of (4.35)(4.38), we note that


1
D2 (z) = {F(2z)}1/2 for Re(z) <
. (4.39)
21
We refer to Martynov ([152], [150]) for a study of the convergence of the series
(4.37), together with versions of this formula holding when some the consecutive
terms of the sequence 1 2 > 0 are equal.
4.4. The RKHS of the Wiener process and Brownian bridge
The standard Wiener process W (t) admits on [0, 1] the Karhunen-Loeve represen-
tation

1   
W (t) = Yk   2 sin k 12 t , (4.40)
k=1
k 2
1

where {Yk : k 1} denotes an iid sequence of normal N (0, 1) r.vs. The proof of
(4.40) is left to the reader in Exercise 4.2, and as a special case of Theorem 4.6 in
4.5, in the sequel.
In view of (4.40), given any two functions f and g of the form

   
f (t) = ak 2 sin k 12 t
k=1
and
(4.41)
   
g(t) = bk 2 sin k 12 t ,
k=1

the Hilbert product of f and g within the reproducing kernel Hilbert space [RKHS]
pertaining to the Wiener process on [0, 1] is given by

  1
  2
f, g H = = k 2 ak b k =
1
f(t)g(t)dt, (4.42)
k=1 0

where f and g denote, respectively, the Lebesgue derivatives of f and g. We obtain


the proposition:
Topics on Empirical Processes 173

Proposition 4.2. The RKHS pertaining to the Wiener process on [0, 1] is the space
H , with Hilbert norm | |H , of all absolutely continuous functions f on [0, 1] such
that |f |H < , where
 1/2
1 2
f (t) dt if f AC[0, 1] fullls f (0) = 0,
|f |H = 0 (4.43)
otherwise.

The standard Brownian bridge B(t) admits on [0, 1] the Karhunen-Loeve repre-
sentation

1
B(t) = Yk 2 sin (kt) , (4.44)
k
k=1
where {Yk : k 1} denotes an iid sequence of normal N (0, 1) r.vs. Therefore,
given any two functions f and g of the form




f (t) = ak 2 sin (kt) and g(t) = bk 2 sin (kt) , (4.45)
k=1 k=1

the Hilbert product of f and g within the RKHS pertaining to the Brownian bridge
on [0, 1] is given by

  1
2
f, g H = = (k) ak bk = f(t)g(t)dt, (4.46)
k=1 0

where f and g denote, respectively, the Lebesgue derivatives of f and g. We obtain


the proposition:
Proposition 4.3. The RKHS pertaining to the Brownian bridge on [0, 1] is the space
H , with Hilbert norm | |H , of all absolutely continuous functions f on [0, 1] such
that |f |H < , where
 1/2
1 2
f (t) dt if f AC[0, 1] fullls f (0) = f (1) = 0,
|f |H = 0 (4.47)
otherwise.

Exercise 4.2. Consider the Fredholm transformation of L2 [0, 1] onto itself, de-
ned by
 1
f T f (x) = min{x, t}f (t)dt,
0
and let y(x) denote an eigenvalue of T , fullling T y = y for some > 0.
1 ) Show that y() is continuous on [0, 1], then, by a recursion, that y is twice
continuously dierentiable on (0, 1) and such that y(0) = 0 and y  (1) = 0.
2 ) Show that y is a solution of the dierential equation y  + y = 0, and that the
only possible values of are of the form = 1/{(k 12 )}2 for k = 1, 2, . . .
. Conclude to the validity of (4.40).
174 P. Deheuvels

4.5. KL expansions for weighted Wiener processes and Brownian bridges


In this section, we provide KL expansions for weighted Wiener processes and
Brownian bridges due to Deheuvels and Martynov [63]. Throughout, {W (t) : t
0} and {B(t) : 0 t 1} denote, respectively, a Wiener process, and a Brownian
bridge. These processes are centered with covariance functions
E(W (s)W (t))s t for s, t 0,
and
E(B(s)B(t)) = s t st for 0 s, t 1.
Denote by {(t) : 0 < t < 1} a positive and continuous function on (0, 1), whose
denition will, at times, be extended by continuity to (0, 1] or [0, 1]. Below, we will
work under additional conditions taken among the following.
(L.1) () is continuous on (0, 1];
 1
(L.2) (i) lim t(t) = 0; (ii) t 2 (t)dt < ;
t0 0
 1
(C.3) (i) lim t(t) = lim(1 t)(t) = 0; (ii) t(1 t) 2 (t)dt < .
t0 t0 0
It is readily checked that (L.2)(ii) (resp. (L.3)(ii)) is the version of (4.30) corre-
sponding to Z(t) = Z1 (t) (resp. Z(t) = Z2 (t)), where
Z1 (t) = (t)W (t) and Z2 (t) = (t)B(t) for 0 < t < 1.
To obtain the KL expansions of Z1 (), Z2 (), we will use the following theorems, in
the spirit of Kac and Siegert ([121], [122]), and Kac (see, e.g., pp. 199-200 in [119]
and Section 2 in [120]).
Theorem 4.4. Assume (C.1-2). Set Z(t) = Z1 (t) = (t)W (t) for 0 < t 1. Then,
the {(k , ek ()) : k 1} in the KL expansion of Z() are obtained by setting =
1/ and e(t) = y(t)(t), where y() is a continuous on [0, 1] and twice continuously
dierentiable on (0, 1] solution of the dierential equation
y  (t) + 2 (t)y(t) = 0, (4.48)
subject to > 0 and with limit conditions
y(0) = 0 and y  (1) = 0. (4.49)
Theorem 4.5. Assume (L.3). Set Z(t) = Z1 (t) = (t)B(t) for 0 < t < 1. Then, the
{(k , ek ()) : k 1} in the KL expansion of Z() are obtained by setting = 1/
and e(t) = y(t)(t), where y() is a continuous on [0, 1] and twice continuously
dierentiable on (0, 1) solution of the dierential equation
y  (t) + 2 (t)y(t) = 0, (4.50)
subject to > 0 and with limit conditions
y(0) = 0 and y(1) = 0. (4.51)
Topics on Empirical Processes 175

Proof. The details of proofs of Theorems 4.4 and 4.5 are given in Deheuvels and
Martynov [63]. 
In the sequel, we will concentrate on the particular case where, for some constant
R,
(t) = t for 0 < t 1. (4.52)
We note that (L.123) hold under (4.52) i > 1. In particular,
 1  1
> 1 t 2 (t)dt < t(1 t) 2 (t)dt < .
0 0
For > 1, consider the Bessel function J () of rst order and index (see
4.6 below for details on the denition and properties of J ()). For > 1, the
positive zeros of J () (solutions of J (z) = 0) form an innite sequence, denoted
hereafter by 0 < z,1 < z,2 < . These zeros are interlaced with the zeros
0 < z+1,1 < z+1,2 < of J+1 () (see, e.g., [207], p. 479), in such a way that
0 < z,1 < z+1,1 < z,2 < z+1,2 < z,3 < . (4.53)
The next theorems provide KL expansions for {t W (t) : 0 t 1} and {t B(t) :

0 t 1}.
Theorem 4.6. Let {W (t) : t 0} denote a Wiener process. Then, for each =
2 1 > 1, or equivalently, for each = 1/(2(1 + )) > 0, the Karhunen-Loeve
1

expansion of {t W (t) : 0 < t 1} is given by




1
t W (t) = t 2 1 W (t) = k k ek (t), (4.54)
k=1

where {k : k 1} are i.i.d. N (0, 1) random variables, and, for k = 1, 2, . . .,


 2 2   1 
t 2 
1
12 J z1,k
  for 0 < t 1.(4.55)
k = , ek (t) = t 2
z1,k J z1,k
Theorem 4.7. Let {B(t) : 0 t 1} denote a Brownian bridge. Then, for each
= 21
1 > 1, or equivalently, for each = 1/(2(1 + )) > 0, the Karhunen-
Loeve expansion of {t B(t) : 0 < t 1} is given by


1
t B(t) = t 2 1 B(t) = k k ek (t), (4.56)
k=1

where {k : k 1} are i.i.d. N (0, 1) random variables, and, for k = 1, 2, . . .,


 2 2 1 1
 J z t 21  
,k
  for 0 < t 1. (4.57)
k = , ek (t) = t 2 2
z,k J1 z,k
The KL expansion of {t B(t) : 0 < t 1} in Theorem 4.7 is known for = 0
( = 1/2) (see, e.g., [5] and 4.4, pp. 30-32 in [85]). For = 1/2 ( = 1),
we refer to Scott [179]. In the general case where > 1/2 ( 0 < < 1), this
KL expansion turns out to be equivalent to a KL expansion given in Li [140].
176 P. Deheuvels

1
The simple observation that the { t 2 (1) ek (t ) : k 1} are orthonormal in
L (0, 1) whenever such is the case for the {ek (t) : k 1} (see, e.g., [145]), allows
2

us to derive, via (4.58)(4.59)(4.60) below, through the change of scale t t , a


series of variants of the KL expansions of Theorems 4.64.7. We will make use of
the convention of writing these KL expansions under the form

  1/2
Z(t) = k k ek (t), 0 < t 1, (4.58)
k=1

where, as in (4.33), k and ek () for k 1, are, respectively, the eigenvalues and


eigenfunctions pertaining to {Z(t) : 0 < t 1}. The arguments above show that,
whenever the KL expansion (4.58) holds, then, for each choice of > 0, the KL
1
expansion of t 2 (1) Z(t ) on (0, 1] is given by
1

  1/2  1 (1) 
t 2 (1) Z(t ) = k / k t2 ek (t ) , 0 < t 1, (4.59)
k=1

with eigenvalues k and eigenfunctions ek () for k = 1, 2, . . ., given by


 1/2  1 
k = k / and ek (t) = t 2 (1) ek (t ) . (4.60)

By combining (4.58)(4.59)(4.60) with Theorems 4.64.7, we obtain readily in


(4.61)(4.62) below the KL expansions of t W (t ) and t B(t ) (under the form
(4.58)). For any > 0, > 21 ( + 1), and = /(2 + + 1) > 0, we get namely
    1  J z 

2 2 1,k t 2

t W (t ) = k t 2   , (4.61)
z1,k J z1,k
k=1
  
 2   1  J z,k t 2 

t B(t ) = k t 2 2   . (4.62)
z,k J1 z,k
k=1

In particular, by setting = 1, = and = 1/(2( + 1)) in (4.61)(4.62), we


get (4.54)(4.55) and (4.56)(4.57).
Of special interest here is the choice of = 2 and = 12 in (4.61)(4.62).
For these values of the constants , , we obtain that, for each > 0, the KL
1 1
expansions of t 2 W (t2 ) and t 2 B(t2 ) are given by.
  
1

 2   1  J z1,k t 
2
t 2 W (t ) = k 2 t2   , (4.63)
z1,k J z1,k
k=1
  
1
 2   1  J z,k t 
t 2 B(t2 ) = k 2 t2   . (4.64)
z,k J1 z,k
k=1
1
By multiplying both sides of (4.63) by t 2 , we get a Dini series expansion of
t W (t2 ) on (0, 1). Proceeding likewise with (4.64) one obtains the Fourier-Bessel
series expansion of t B(t2 ) on (0, 1) (see, e.g., pp. 96103 in [132]). Recall that,
Topics on Empirical Processes 177

under suitable conditions on the functions f () and g() on (0, 1), it is possible to
expand f () into the Fourier-Bessel expansion f (t) = k=1 ak J (z,k t), with
 1
2
ak = 2 tf (t)J (z,k t)dt, k 1, (4.65)
J1,k (z,k ) 0

and g() into a Dini expansion g(t) = k=1 bk J (z1,k t), with
 1
2
bk = 2 tg(t)J (z1,k t)dt, k 1. (4.66)
J,k (z1,k ) 0
By setting f (t) = t B(t2 ) and g(t) = t W (t2 ) in (4.65)(4.66), we get
2k 2k
ak = and bk = for k 1. (4.67)
z,k J1 (z,k ) z1,k J (z1,k )
Put = 0 and = /( + 1) in (4.61)(4.62). Set, for notational simplicity,
z,k = z/(+1),k and z1,k = z/(+1)1,k . We so obtain the KL expansions
 
    J +1

2 z
/(+1) 1,k t 2
W (t ) = k + 1 t2   (4.68)
,
z1,k ( + 1) J/(+1) z1,k
k=1
  +1 
 2  
J
/(+1) z,k t 2


B(t ) = k +1t 2   . (4.69)
z,k ( + 1) J/(+1)1 z,k
k=1

The KL expansion (4.69) has been obtained by Li ([140]) (see the proof of Theorem
1.6, pp. 24-25 in [140]), up to the normalizing factor, for k = 1, 2, . . .,
 
ck = + 1/{J/(+1)1 z,k },
of the eigenfunction in (4.69) (with the notation (4.58))
 +1 
ek (t) = ck t/2 J/(+1) z,k t 2 ,
left implicit in his work. In spite of the fact that it is possible to revert the previous
arguments, starting with (4.69), in order to obtain an alternative proof of Theorem
4.6 based on [140], this does only work for the values of = /(+1) with 0 < < 1
(since we must have > 0).

4.6. Bessel functions


4.6.1. Definition of Bessel functions. For each real constant R, we denote by
J () the Bessel function of the rst kind of index . For our needs, it will be useful
to recall some important properties of these functions (refer to [132] and [207] for
details). The second order homogeneous dierential equation
x2 y  + xy  + (x2 2 )y = 0, (4.70)

has a fundamental set of solutions on (0, ) of the form y = Cx k=0 ak xk ,
where C is a constant. These solutions are proportional to the Bessel function of
178 P. Deheuvels

the rst kind (see, e.g., 9.1.69 in [1]), explicitly dened, for an arbitrary R, by


( 12 x) ( 14 x2 )k
0 F1 ( + 1; 4 x ) = ( 2 x)
1 2 1
J (x) = (4.71)
.
( + 1) ( + k + 1)(k + 1)
k=0

When = n is a negative integer, ( + k + 1) = (n + k + 1) = for


k = 0, . . . , n 1 so that, making use of the convention a/ = 0 when a R, the
n rst terms in the series (4.71) vanish. In this case, we have the relation

Jn (x) = (1)n Jn (x). (4.72)

In (4.71), we made use of the generalized hypergeometric function



 1 zk
0 F1 (b; z) = for z C, (4.73)
(b)k k!
k=0

where the Pochhammer symbol (b)k is dened for k IN by (b)k = (b + k)/(b)


when b = 0, 1, 2, . . . , and, for an arbitrary b R, by

(b)0 = 1 and (b)k = b(b + 1) . . . (b + k 1) for k 1. (4.74)

The roots (or zeros) of J () have the following properties, in addition to (4.53)
(see, e.g., Ch.XV, pp. 478521 in [207], p. 96 in [132]). For any > 1, J () has
only real roots. Moreover, in this case, the positive roots of J () are isolated and
form an increasing sequence

0 < z,1 < z,2 < . . . , (4.75)

such that, for any xed k 1, z,k is a continuous and increasing function of
> 1. In addition, for any specied > 1, as k ,
  4 2 1 1

z,k = k + 12 ( 12 )   + O . (4.76)
8 k + 12 ( 12 ) k3

Remark 4.2. For = 12 and = 12 , we have (see (4.79) below)


 
z 12 ,k = k 12 and z 21 ,k = k for k = 1, 2, . . . , (4.77)

so that, in either of these cases, z,k reduces to the rst term in (4.76).

An alternative denition of the Bessel function J () makes use of Eulers formula


(see, e.g., (2)(3) p. 498 in [207])

( 12 z)  z2 

J (z) = 1 2 for z > 0. (4.78)
( + 1) z,k
k=1
Topics on Empirical Processes 179

4.6.2. Some special cases. The expression (4.71) of the rst-order Bessel function
J () can be simplied when = m + 12 for an integer m = 1, 0, 1, . . .. In partic-
ular, for m = 1 and m = 0,
" "
2 2
J 12 (x) = x cos x and J 12 (x) = x sin x. (4.79)
For m 0, we get
" m
2 d sin x

Jm+ 12 (x) = (1)m x x m


. (4.80)
dx x
In general, for an arbitrary integer m 1, Jm+ 12 () is of the form,
" 1
1

2 Jm+ 2 (x) = Qm x sin x Pm x cos x,


x
1 (4.81)

where Pm () and Qm () are polynomials. The rst terms of the sequence are
P1 (u) = 1, Q1 (u) = 0, P0 (u) = 0, Q0 (u) = 1. (4.82)
Lemma 4.3. For an arbitrary m 0, we have the recurrence formulas
Qm+1 (u) = (2m + 1)wQm (w) Qm1 (w), (4.83)
Pm+1 (u) = (2m + 1)wPm (w) Pm1 (w). (4.84)
Proof. We have
" 2m + 1 " x "
1 (x)
x x
2 Jm+ 32 (x) = 2 Jm+ 2 2 Jm 2 (x),
1
x
so that (4.83)(4.84) is straightforward. 
By combining (4.81)(4.82) with (4.83)(4.84), we get
"  
sin x
J 32 (x) = 2
x cos x , (4.85)
x
"  
3 sin x 3 cos x
J 52 (x) = 2
x sin x . (4.86)
x2 x

References
[1] Abramowitz, M. and Stegun, I.A. (1965). Handbook of Mathematical Functions.
Dover, New York.
[2] Acosta, A. de and Kuelbs, J. (1983). Limit theorems for moving averages of inde-
pendent random vectors. Z. Warhscheinlichkeitstheor. Verw. Geb. 64 67123.
[3] Adler, R.J. (1990). An Introduction to Continuity, Extrema and Related Topics for
General Gaussian Processes. IMS Lecture Notes-Monograph Series 12. Institute of
Mathematical Statistics, Hayward, California.
[4] Akaike, H. (1954). An approximation of the density function. Ann. Inst. Statist.
Math. 6 127132.
[5] Anderson, T.W. and Darling, D.A. (1952). Asymptotic theory of certain goodness
of t criteria based on stochastic processes. Ann. Math. Statist. 23 193212.
180 P. Deheuvels

[6] Araujo, A. and Gine, E. (1980). The Central Limit Theorem for Real and Banach
Valued Random Variables. Wiley, New York.
[7] Ash, R.B. and Gardner, M.F. (1975). Topics in Stochastic Processes. Academic
Press, New York.
[8] Bahadur, R.R. (1967). A note on quantiles in large samples. Ann. Math. Statist. 37
577580.
[9] Bahadur, R.R. (1971). Some Limit Theorems in Statistics. Regional Conference
Series in Applied Mathematics. 4. S.I.A.M., Philadelphia.
[10] Bauer, H. (1981). Probability Theory and Elements of Measure Theory. Academic
Press, New York.
[11] del Barrio, E., Cuesta-Albertos, J.A. and Matran, C. (2000). Contributions of em-
pirical and quantile processes to the asymptotic theory of goodness-of-t tests. Test
9 196.
[12] Bartfai, P. (1966). Die Bestimmung der zu einem wiederkehrenden Prozess geho-
renden Verteilungsfunktion aus den mit Fehlern behafteten Daten einer einzigen
Relation. Studia Sci. math. Hung. 1 161168.
[13] Bartlett, M.S. (1963). Statistical estimation of density functions. Sankhya. Ser. A
25 245254.
[14] Berkes, I. and Philipp, W. (1979). Approximation theorems for independent and
weakly dependent random vectors. Ann. Probab. 7 2954.
[15] Berlinet, A. (1993). Hierarchies of higher-order kernels. Prob. Theor. Rel. Fields 94
489504.
[16] Berlinet, A. and Devroye, L. (1994). A comparison of kernel density estimates. Publ.
Inst. Statist. Univ. Paris 38 359.
[17] Bickel, P. and Rosenblatt, M. (1973). On some global measures of the deviation of
density function estimates. Ann. Statist. 1 10711095.
[18] Bickel, P. and Rosenblatt, M. (1975). Corrections to On some global measures of
the deviation of density function estimates. Ann. Statist. 3 1370.
[19] Billingsley, P. (1968). Convergence of Probability Measures. John Wiley & Sons,
New York.
[20] Borell, C. (1975). The Brunn-Minkovski inequality in Gauss space. Invent. Math.
30 207216.
[21] Borell, C. (1976). Gaussian Radon measures on locally convex spaces. Math. Scand.
38 265284.
[22] Borell, C. (1977). A note on Gauss measures which agree on balls. Ann. Inst. H.
Poincare Ser. B 13 231238.
[23] Bosq, D. and Lecoutre, J.-P. (1987). Theorie de lEstimation Fonctionnelle. Eco-
nomica, Paris.
[24] Bowman, F. (1958). Introduction to Bessel Functions. Dover, new York.
[25] Bowman, A., Hall, P. and Prvan, T. (1998). Bandwidth selection for the smoothing
of distribution functions. Biometrika 85 799808.
[26] Bretagnolle, J. and Massart, P. (1989). Hungarian constructions from the non-
asymptotic viewpoint. Ann. Probab. 17 239256.
Topics on Empirical Processes 181

[27] Castelle, N., and Laurent-Bonvalot, F. (1998). Strong approximations of bivariate


uniform empirical processes. Ann. Inst. Henri Poincare. 34 425480.
[28] Castelle, N. (2002). Approximation fortes pour des processus bivaries. Canad. J.
Math. 54 533553.
[29] Cheng, P.E., and Bai, Z. (1995). Optimal strong convergence rates in nonparametric
regression. Math. Meth. of Statist. 4 405420.
[30] Cherno, H. (1952). A measure of asymptotic eciency for tests of a hypothesis
based on the sums of observations. Ann. Math. Statist. 23 493507.
[31] Chung, K.L. (1949). An estimate concerning the Kolmogorov limit distributions.
Trans. Amer. Math. Soc. 67 3650.
[32] Ciesielski, Z. and Taylor, S.J. (1962). First passage times and sojourn times for
Brownian motion in space and the exact Hausdor measure of the sample path.
Trans Amer. Math. Soc. 434450.
[33] Collomb, G. (1977). Quelques proprietes de la methode du noyau pour lestimation
non-parametrique de la regression en un point xe. C. R. Acad. Sci. Paris 285 A
289292.
[34] Collomb, G. (1979). Conditions necessaires et susantes de convergence uniforme
dun estimateur de la regression, estimation des derivees de la regression. C. R.
Acad. Sci. Paris 288 161163.
[35] Collomb, G. (1981). Estimation non-parametrique de la regression: revue bi-
bliographique. Int. Statist. Rev. 49 7593.
[36] Csaki, E. (1980). On the standardized empirical distribution function. Coll. Math.
Soc. Janos Bolyai. 32 123138. Nonparametric Statistical Inference. Akademiai Ki-
ado, Budapest.
[37] Csorgo, M. and Horvath, L. (1986). Approximations of weighted empirical and
quantile processes. Statist. and Probab. Letters. 4 275280.
[38] Csorgo, M. and Horvath, L. (1993). Weighted Approximations in Probability and
Statistics. Wiley, New York.
[39] Csorgo, M. and Revesz, P. (1978). Strong approximation of the quantile process.
Ann. Statist. 6 882894.
[40] Csorgo, M. and Revesz, P. (1981). Strong Approximations in Probability and Sta-
tistics. Academic Press, New York.
[41] Csorgo, M., Csorgo, S., Horvath, L. and Mason, D.M. (1986). Weighted empirical
and quantile processes. Ann. Probab. 14 86118.
[42] DallAglio, G., Kotz, S. and Salinetti, G. (1991). Advances in Probability Distribu-
tions with Given Marginals. Kluwer, Dordrecht.
[43] Darling, D.A. (1957). The Kolmogorov-Smirnov, Cramer-von Mises tests. Ann.
Math. Statist. 28 823838.
[44] David, H.A. (1981). Order Statistics. 2nd Ed. Wiley, New York.
[45] Deheuvels, P. (1974). Conditions necessaires et susantes de convergence presque
sure et uniforme presque sure des estimateurs de la densite. C. R. Acad. Sci. Paris.
Ser. A 278 12171220.
182 P. Deheuvels

[46] Deheuvels, P. (1977a). Estimation non parametrique de la densite par histogram-


mes generalises. Publications de lInstitut de Statistique de lUniversite de Paris. 22
124.
[47] Deheuvels, P. (1977b). Estimation non parametrique de la densite par histogrammes
generalises (II). Revue de Statistique Appliquee 25 542.
[48] Deheuvels, P. (1979). Proprietes dexistence et proprietes topologiques des fonctions
de dependance. C. R. Acad. Sci. Paris, Ser. A 288 217220.
[49] Deheuvels, P. (1986). Strong laws for the k-th order statistic when k c log2 n.
Probab. Theory Related Fields. 72 133154.
[50] Deheuvels, P. (1990). Laws of the iterated logarithm for density estimators. In:
Nonparametric Functional Estimation and Related Topics. Roussas, G. (Ed.) 19
20. Kluwer, Dordrecht.
[51] Deheuvels, P. (1991). Functional Erdos-Renyi laws. Studia Sci. Math. Hungar. 26
261295.
[52] Deheuvels, P. (1992). Functional laws of the iterated logarithm for large increments
of empirical and quantile processes. Stochastic Processes Appl. 43 133163.
[53] Deheuvels, P. (1996). Functional laws of the iterated logarithm for small increments
of empirical processes. Statistica Neerlandica. 50 261280.
[54] Deheuvels, P. (1997). Strong laws for local quantile processes. Ann. Probab. 25
20072054.
[55] Deheuvels, P. (1998). On the approximation of quantile processes by Kiefer pro-
cesses. J. Theor. Probab. 11 9971018.
[56] Deheuvels, P. (2000). Limit laws for kernel density estimators for kernels with un-
bounded supports. Asymptotics in Statistics and Probability. M.L. Puri (Ed.) 117
132. VSP. International Science Publishers, Amsterdam.
[57] Deheuvels, P. (2000). Strong approximations of quantile processes by iterated Kiefer
processes. Ann. Probab. 28 909945.
[58] Deheuvels, P., Devroye, L. and Lynch, J. (1986). Exact convergence rates in the
limit theorems of Erdos-Renyi and Shepp. Ann. Probab. 14 209223.
[59] Deheuvels, P. and Devroye, L. (1987). Limit laws of Erdo-Renyi-Shepp type. Ann.
Probab. 15 13631386.
[60] Deheuvels, P. and Einmahl, J. H. J. (2000). Functional limit laws for the increments
of Kaplan-Meier product-limit processes and applications. Ann. Probab. 28 1301
1335.
[61] Deheuvels, P. and Lifshits, M.A. (1993). Strassen-type functional laws for strong
topologies. Probab. Theory Related Fields. 97 151167.
[62] Deheuvels, P. and Lifshits, M.A. (1994). Necessary and sucient conditions for the
Strassen law of the iterated logarithm in nonuniform topologies. Ann. Probab. 22
18381856.
[63] Deheuvels, P. and Martynov, G. (2003). Karhunen-Loeve expansions for weighted
Wiener processes and Brownian bridges via Bessel functions. Progress in Probability.
55 5793.
[64] Deheuvels, P. and Mason, D.M. (1990a). Bahadur-Kiefer-type processes. Ann.
Probab. 18 669697.
Topics on Empirical Processes 183

[65] Deheuvels, P. and Mason, D.M. (1990b). Nonstandard Functional laws of the iter-
ated logarithm for tail empirical and quantile processes. Ann. Probab. 18 16931722.
[66] Deheuvels, P. and Mason, D.M. (1992a). Functional laws of the iterated logarithm
for the increments of empirical and quantile processes. Ann. Probab. 20 12481287.
[67] Deheuvels, P. and Mason, D.M. (1992b). A functional L.I.L. approach to pointwise
Bahadur-Kiefer theorems. In Probability in Banach Spaces. (R.M. Dudley, M. Hahn
and J. Kuelbs, eds.) 8 255266. Birkhauser, Boston.
[68] Deheuvels, P. and Mason, D.M. (1994a). Functional laws of the iterated logarithm
for local empirical processes indexed by sets. Ann. Probab. 22 16191661.
[69] Deheuvels, P. and Mason, D.M. (1994b). Random fractals generated by oscilla-
tions of processes with stationary and independent increments. Probability in Ba-
nach Spaces. 7390, 9 Homan-Jrgensen, J., Kuelbs, J. and Marcus, M.B. Eds.
Birkhauser, Boston.
[70] Deheuvels, P. and Mason, D.M. (2004). General asymptotic condence bands based
on kernel-type function estimators. Statist. Inference for Stoch. Processes. 7 225
277.
[71] Deheuvels, P. and Steinebach, J. (1987). Exact convergence rates in strong approx-
imation laws for large increments of partial sums. Probab. Theor. Related Fields.
76 369393.
[72] Derzko, G. and Deheuvels, P. (2002). Estimation non-parametrique de la regression
dichotomique - application biomedicale. C. R. Acad. Sci. Paris Ser. I 334 5963.
[73] Deuschel, J.D. and Stroock, D.W. (1989). Large Deviations. Academic Press, New
York.
[74] Devroye, L. (1978). The uniform convergence of the Nadaraya-Watson regression
function estimate. Canad. J. Statist. 6 179191.
[75] Devroye, L. (1977). A uniform bound for the deviation of empirical distribution
functions. J. Multivariate Analysis. 7 594597.
[76] Devroye, L. (1982). Bounds for the uniform deviation of empirical measures. J.
Multivariate Analysis. 12 7279.
[77] Devroye, L. (1987). A Course in Density Estimation. Birkhauser-Verlag, Boston.
[78] Devroye, L. and Lugosi, G. (2001). Combinatorial Methods in Density Estimation.
Springer, New York.
[79] Devroye, L. and Gyor, L. (1985). Nonparametric Density Estimation: The L1
View. Wiley, New York.
[80] Doob, J.L. (1953). Stochastic Processes. Wiley, New York.
[81] Dudley, R.M. and Philipp, W. (1983). Invariance principles for sums of Banach
space valued random elements and empirical processes indexed by sets. Z. Wahrsch.
Verw. Gebiete. 62 509552.
[82] Dudley, R.M. (1999). Uniform Central Limit Theorems. Cambridge University
Press, Cambridge.
[83] Dudley, R.M. (2002). Real Analysis and Probability. Cambridge University Press,
Cambridge.
[84] Dugundji, J. (1966). Topology. Allyn and Bacon, Boston.
184 P. Deheuvels

[85] Durbin, J. (1973). Distribution Theory for Tests Based on the Sample Distribution
Function. Regional Conference Series in Applied Mathematics, 9 S.I.A.M., Philadel-
phia.
[86] Dvoretzky, A., Kiefer, J. and Wolfowitz, J. (1956). Asymptotic minimax character
of the sample distribution function and of the classical multinomial estimator. Ann.
Math. Statist. 33 642669.
[87] Eggermont, P.P.B. and La Riccia, V.N. (2001). Maximum Penalized Likelihood Es-
timation. Springer, New York.
[88] Eicker, F. (1979). The asymptotic distribution of the suprema of the standardized
empirical processes. Ann. Statist. 7 116138.
[89] Einmahl, J.H.J. (1987). Multivariate Empirical Processes. C.W.I. Tract 32. Math-
ematisch Centrum, Amsterdam.
[90] Einmahl, U. (1986). A renement of the KMT-inequality for partial sum strong
approximation. Technical Report Series of the Laboratory for Research in Statistics
and Probability Carleton University. 88, Ottawa, Canada.
[91] Einmahl, U. (1988). Strong approximations for partial sums of i.i.d. B-valued r.v.s
in the domain of attraction of a Gaussian law. Probab. Theor. Rel. Fields. 77 6585.
[92] Einmahl, U. (1989). Extensions of results of Komlos, Major and Tusnady to the
multivariate case. J. Multivariate Anal. 28 2068.
[93] Einmahl, J.H.J. and Mason, D.M. (1985). Bounds for weighted multivariate empir-
ical distribution functions. Z. Wahrsch. Verw. Gebiete. 70 563571.
[94] Einmahl, J.H.J. and Mason, D.M. (1988). Strong limit theorems for weighted quan-
tile processes. Ann. Probab. 16 16231643.
[95] Einmahl, U. and Mason, D.M. (2000). An empirical process approach to the uniform
consistency of kernel-type function estimators. J. Theoretical Prob., 13, 137.
[96] Epanechnikov, V.A. (1969). Nonparametric estimation of a multivariate probability
density. Theor. Probab. Appl. 14 153158.
[97] Erdos, P. and Renyi, A. (1970). On a new law of large numbers. J. Analyse Math.
23 103111.
[98] Finkelstein, H. (1971). The law of the iterated logarithm for empirical distributions.
Ann. Math. Statist. 42 607615.
[99] Gaenssler, P. (1983). Empirical Processes. Vol. 3, IMS Lecture Notes-Monograph
Series, Institute of Mathematical Statistics, Hayward.
[100] Gaenssler, P. and Stute, W. (1979). Empirical process: a survey of results for inde-
pendent and identically distributed random variables. Ann. Probab. 7 193243.
[101] Gasser, T., Muller, H.G. and Mammitzsch, V. (1985). Kernels for nonparametric
curve estimation. J. R. Statist. Soc. Ser. B 47 238252.
[102] Gikhman, I.I. (1957). On a nonparametric criterion of homogeneity for k samples.
Theor. Probab. Appl. 2 369373.
[103] Goodman, V., Kuelbs, J. and Zinn, J. (1981). Some results on the LIL in Banach
space with application of weighted empirical process. Ann. Probab. 8 713752.
[104] Gyor, L., Kohler, M., Krzyzak, A., and Walk, H. (2002). A Distribution-Free
Theory of Nonparametric Regression. Springer, New York.
Topics on Empirical Processes 185

[105] Hall, P. (1991). On iterated logarithm laws for linear arrays and nonparametric
regression estimators. Ann. Probab. 19 740757.
[106] Hall, P. and Heyde, C.C. (1980). Martingale Limit Theory and its Application.
Academic Press, New York.
[107] Hall, P. and Marron, J.S. (1991). Lower bounds for bandwidth selection in density
estimation. Probab. Theor. Rel. Fields. 90 149173.
[108] Hall, P., Marron, J.S. and Park, B.U. (1992). Smoothed cross-validation. Probab.
Theor. Rel. Fields 92 120.
[109] Hardle, W. (1984). A law of the iterated logarithm for nonparametric regression
function estimators. Ann. Statist. 12 624635.
[110] Hardle, W. (1990). Applied Nonparametric Regression. Cambridge University Press,
Cambridge.
[111] Hardle, W., Janssen, P. and Sering, R. (1988). Strong uniform consistency rates
of estimators of conditional functionals. Ann. Statist. 16 14281449.
[112] Hartman, P. and Wintner, A. (1941). On the law of iterated logarithm. Amer. J.
Math. 63 169176.
[113] Hognas, G. (1977). Characterization of weak convergence of signed measures on
[0, 1]. Math. Scand. 41 175184.
[114] Hurevicz, W. and Wallman, G. (1948). Dimension Theory. Princeton University
Press, Princeton.
[115] Izenman, A.J. (1991). Recent developments in nonparametric density estimation.
J. Amer. Statist. Assoc. 86 205224.
[116] Jaeschke, D. (1979). The asymptotic distribution of the supremum of the standard-
ized empirical distribution on subintervals. Ann. Statist. 7 108115.
[117] Jones, M.C., Marron, J.S. and Park, B.U. (1991). A simple root n bandwidth se-
lector. Ann. Statist. 19 19191932.
[118] Jones, M.C., Marron, J.S. and Sheather, S.J. (1996). A brief survey of bandwidth
selection for density estimation. J. Amer. Statist. Assoc. 91 401407.
[119] Kac, M. (1951). On some connections between probability theory and dierential
and integral equations. Proc.Second Berkeley Sympos. Math. Statist. Probab. 180
215.
[120] Kac, M. (1980). Integration in Function Spaces and Some of its Applications.
Lezioni Ferniane, Academia Nazionale dei Lincei. Pisa.
[121] Kac, M. and Siegert, A.J.F. (1947). On the theory of noise in radio receivers with
square law detectors. J. Appl. Physics. 18 383397.
[122] Kac, M. and Siegert, A.J.F. (1947). An explicit representation of a stationary Gauss-
ian process. Ann. Math. Statist. 18 438442.
[123] Kiefer, J. (1959). K-sample analogues of the Kolmogorov-Smirnov and Cramer-V.
Mises tests. Ann. Math. Statist. 30 420447.
[124] Kiefer, J. (1967). On Bahadurs representation of sample quantiles. Ann. Math.
Statist. 38 13231342.
[125] Kiefer, J. (1970). Deviations between the sample quantile process and the sample
d.f. In Nonparametric Techniques in Statistical Inference. (M. Puri, ed.) 299319.
Cambridge Univ. Press.
186 P. Deheuvels

[126] Kiefer, J. (1972a). Iterated logarithm analogues for sample quantiles when pn 0.
Proc. Sixth Berkeley Symp. Math. Statist. Probab. 1 227244. Univ. California Press,
Berkeley.
[127] Kiefer, J. (1972b). Skorohod embedding of multivariate rvs and the sample df. Z.
Wahrsch. Verw. Gebiete. 24 135.
[128] Kolmogorov, A.N. (1933). Sulla determinazione empirica di una legge di dis-
tribuzione. Giorn. Inst. Ital. Attauri. 4 8391.
[129] Komlos, J., Major, P. and Tusnady, G. (1975). An approximation of partial sums
of independent r.v.s and the sample df. I. Z. Wahrsch. Verw. Gebiete. 32 111131.
[130] Komlos, J., Major, P. and Tusnady, G. (1975). An approximation of partial sums
of independent r.v.s and the sample df. II. Z. Wahrsch. Verw. Gebiete. 34 3358.
[131] Konakov, V.D. and Piterbarg, V.I. (1984). On the convergence rate of maximal
deviation distribution of kernel regression estimates. J. Mult. Appl. 15 279294.
[132] Korenev, B.G. (2002). Bessel Functions and their Applications. Taylor & Francis,
London.
[133] Krzyzak, A., and Pawlak, M. (1984). Distribution-free consistency of a nonpara-
metric kernel regression estimate and classication. IEEE Trans. on Information
Theory. 30 7881.
[134] Lai, T.L. (1974). Reproducing kernel Hilbert spaces and the law of the iterated
logarithm for Gaussian processes. Z. Wahrscheinlichkeitstheorie Verw. Gebiete. 29
719.
[135] Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces: Isoperimetry
and Processes. Springer, Berlin.
[136] Ledoux, M. (1996). On Talagrands deviation inequalities for product measures.
ESAIM: Probab. Statist. 1 6387. http//www.emath.fr/ps/
[137] Levy, P. (1937). Theorie de lAddition des Variables Aleatoires. Gauthier-Villars,
Paris.
[138] Levy, P. (1951). Wiener random functions and other Laplacian random functions.
Proc. 6th Berkeley Sympos. Probab. Theory Math. Statist. 2 171186.
[139] Levy, P. (1953). La mesure de Hausdor de la courbe du mouvement brownien.
Giorn. ist. Ital. Attuari. 16 137.
[140] Li, W.V. (1992a). Comparison results for the lower tail of Gaussian seminorms. J.
Theor. Probab. 5 131.
[141] Li, W.V. (1992b). Limit theorems for the square integral of Brownian motion and
its increments. Stoch. Processes Appl. 41 223239.
[142] Li, W.V. (1992c). Lim inf results for the Wiener process and its increments under
the L2 -norm. Prob. Th. Rel. Fields. 92 6990.
[143] Loader, C.R. (1999). Bandwidth selection: classical or plug-in? Ann. Statist. 27
415438.
[144] Lynch, J. and Sethuraman, J. (1987). Large deviations for processes with indepen-
dent increments. Ann. Probab. 15 610627.
[145] Maccone, C. (1984). Eigenfunctions and energy for time-rescaled Gaussian pro-
cesses. Boll. Un. Mat. ital. 6 213219.
Topics on Empirical Processes 187

[146] Major, P. (1976a). The approximation of partial sums of independent r.v.s. Z.


Wahrscheinlichkeitstheorie Verw. Gebiete. 35 213220.
[147] Major, P. (1976b). The approximation of partial sums of i.i.d.r.v.s. when the sum-
mands have only two moments. Z. Wahrscheinlichkeitstheorie Verw. Gebiete. 35
221230.
[148] Major, P. (1979). An improvement of Strassens invariance principle. Ann. Probab.
7 5561.
[149] Marron, J.S. and Nolan, D. (1988). Canonical kernels for density estimation. Statist.
Probab. Letters. 7 195199.
[150] Martynov, G.V. (1975). Computation of the distribution function of quadratic forms
in normal random variables. Theor. Probab. Appl. 20 797809.
[151] Martynov, G.V. (1977). Generalization of the Smirnov formula for the quadratic
forms distributions. Theor. Probab. Appl. 22 614620.
[152] Martynov, G.V. (1992). Statistical tests based on empirical processes and related
questions. J. Soviet. Math. 61 21952275.
[153] Mason, D.M. (1984). A strong limit theorem for the oscillation modulus of the
uniform empirical quantile process. Stoch. Processes Appl. 17 126136.
[154] Mason, D.M., Shorack, G. and Wellner, J.A. (1983). Strong limit theorems for
oscillation moduli of the empirical process. Z. Wahrscheinlichkeit. verw. Gebiete.
61 369373.
[155] Mason, D.M. and van Zwet, W.R. (1987). A renement of the KMT inequality for
the uniform empirical process. Ann. Probab. 15 871884.
[156] Mason, D.M. (1988). A strong invariance theorem for the tail empirical process.
Ann. Inst. H. Poincare Probab. Statist. 24 491506.
[157] Massart, P. (1990). About the constant in the DKW inequality. Ann. Probab. 18
12691283.
[158] Muller, H.G. (1988). Nonparametric Regression Analysis of Longitudinal Data.
Springer, New York.
[159] Nadaraya, E.A. (1964). On estimating regression. Theor. Probab. Appl. 9 141142.
[160] Nadaraya, E.A. (1970). Remarks on nonparametric estimates for density function
and regression curves. Theor. Probab. Appl. 15 134137.
[161] Nadaraya, E.A. (1989). Nonparametric Estimation of Probability Densities and Re-
gression Curves. Kluwer, Dordrecht.
[162] Park, B.U. and Marron, J.S. (1990). Comparison of data-driven bandwidth selec-
tors. J. Amer. Statist. Assoc. 85 6672.
[163] Parzen, E. (1962). On estimation of a probability density function and mode. Ann.
Math. Statist. 33 10651076.
[164] Pollard, D. (1984). Convergence of Stochastic Processes. Springer-Verlag, New York.
[165] Prakasa Rao, B.L.S. (1983). Nonparametric Functional Estimation. Academic
Press, New York.
[166] Revesz, P. (1979a). On nonparametric estimation of the regression function. Prob-
lems of Control and Information Theory. 8 297302.
188 P. Deheuvels

[167] Revesz, P. (1979b). A generalization of Strassens functional law of the iterated


logarithm for Gaussian processes. Z. Warhscheinlichkeitstheor. Verw. Geb. 50 257
264.
[168] Rice, J. (1984). Bandwidth choice for nonparametric regression. Ann. Statist. 12
12151230.
[169] Rosenblatt, M. (1952). Remarks on a multivariate transformation. Ann. Math.
Statist. 23 470472.
[170] Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density
function. Ann. Math. Statist. 27 832837.
[171] Roussas, G. (1990). Nonparametric Functional Estimation and Related Topics.
NATO ASI Series 355. Kluwer, Dordrecht.
[172] Rudin, W. (1979). Real and Complex Analysis. 3rd ed., McGraw Hill, New York.
[173] Rudin, W. (1973). Functional Analysis. Tata McGraw-Hill, New Delhi.
[174] Schilder, M. (1966). Asymptotic formulas for Wiener integrals. Trans. Amer. Math.
Soc. 125 6385.
[175] Schucany, W.R. (1989a). On nonparametric regression with higher-order kernels.
J. Stat. Plann. Inference. 23 145151.
[176] Schucany, W.R. (1989b). Locally optimal window widths for kernel density estima-
tion with large samples. Statist. Probab. Letters 7 401405.
[177] Schucany, W.R. and Sommers, J.P. (1977). Improvements of kernel type density
estimators. J. Amer. Statist. Assoc. 72 420423.
[178] Scott, D.W. (1992). Multivariate Density Estimation Theory, Practice and Visu-
alization. Wiley, New York.
[179] Scott, W.F. (1999). A weighted Cramer-von Mises statistic, with some applications
to clinical trials. Commun. Statist. Theor. Methods. 28 30013008.
[180] Scott, D.W. and Terrel, G.R. (1987). Biased and unbiased cross-validation in density
estimation. J. Amer. Statist. Assoc. 82 11311146.
[181] Sheather, S.J. and Jones, M.C. (1991). A reliable data-based bandwidth selection
method for kernel density estimation. J. Roy. Statist. Soc. Ser.B 53 683690.
[182] Shorack, G.R. (1982). Kiefers theorem via the Hungarian construction. Z. Wahr-
sch. Verw. Gebiete. 34 3358.
[183] Shorack, G.R. and Wellner, J.A. (1986). Empirical Processes with Applications to
Statistics. Wiley, New York.
[184] Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis. Chap-
man and Hall, London.
[185] Singh, R.S. (1979). Mean squared errors of estimates of a density and its derivatives.
Biometrika 66 177180.
[186] Singh, R.S. (1987). MISE of kernel estimates of a density and its derivatives. Stat.
and Probab. Letters 5 153159.
[187] Sklar, A. (1959). Fonctions de repartition a n dimensions et leurs marges. Publ.
Inst. Statist. Univ. Paris. 8 229231.
[188] Skorohod, A.V. (1976). On a representation of random variables. Theory Probab.
Appl. 21 628632.
[189] Smirnov, N.V. (1936). Sur la distribution de 2 . C.R. Acad. Sci. Paris. 202 449452.
Topics on Empirical Processes 189

[190] Smirnov, N.V. (1937). On the distribution of the 2 criterion. Rec. Math. (Mat.
Sbornik). 6 326.
[191] On the estimation of the discrepancy between empirical curves of distribution for
two independent samples. Bull. Math. de lUniversite de Moscou. 2.
[192] Smirnov, N.V. (1948). Table for estimating the goodness of t of empirical distri-
butions. Ann. Math. Statist. 19 279281.
[193] Spiegelman, J. and Sacks, J. (1980). Consistent window estimation of nonparametric
regression. Ann. Statist. 8 240246.
[194] Steen, L.A. and Seebach, J.A. (1978). Counterexamples in Topology. 2nd Ed.
Springer, New York.
[195] Stein, E.M. (1970). Singular Integrals and Dierentiability Properties of Functions,
Princeton University Press, Princeton, New Jersey.
[196] Stone, C.J. (1977). Consistent nonparametric regression. Ann. Statist. 5 595645.
[197] Stone, C.J. (1980). Optimal rates of convergence for nonparametric estimators.
Ann. Statist. 8 13481360.
[198] Stone, C.J. (1982). Optimal global rates of convergence for nonparametric regres-
sion. Ann. Statist. 10 10401053.
[199] Strassen, V. (1964). An invariance principle for the law of the iterated logarithm.
Z. Wahrsch. Verw. Gebiete. 3 211226.
[200] Stute, W. (1982). The oscillation behaviour of empirical processes. Ann. Probab. 10
86107.
[201] Talagrand, M. (1994). Sharper bounds for Gaussian and empirical processes. Ann.
Probability 22, 2876.
[202] Tapia, R.A. and Thompson, J.R. (1978). Nonparametric Probability Density Esti-
mation. Johns Hopkins University Press, Baltimore.
[203] Terrel, G.R. (1990). The maximal smoothing principle in density estimation. J.
Amer. Statist. Assoc. 85 470477.
[204] van der Vaart, A.W. and Wellner, J.A. (1996). Weak Convergence and Empirical
Processes. Springer Verlag, New York.
[205] Wand, M.P. and Jones, M.C. (1995). Kernel Smoothing. Chapman and Hall, Lon-
don.
[206] Varadhan, S.R.S. (1966). Asymptotic probabilities and dierential equations.
Comm. Pure Appl. Math. 19 261286.
[207] Watson, G.N. (1952). A Treatise on the Theory of Bessel Functions. Cambridge
University Press, Cambridge.
[208] Watson, G.S. (1964). Smooth regression analysis. Sankhya A26 359372.
[209] Watson, G.S. and Leadbetter, M.R. (1963). On the estimation of probability den-
sity, I. Ann. Math. Statist. 34 480491.
[210] Wertz, W. (1972). Fehlerabschatzung fur eine Klasse von nichtparametrischen
Schatzfolgen. Metrika. 19 132139.
[211] Wertz, W. (1978). Statistical Density Estimation: A Survey. Vandenhoeck &
Ruprecht, Gottingen.
[212] Woodroofe, M. (1966). On the maximum deviation of the sample density. Ann.
Math. Statist. 41 16651671.
190 P. Deheuvels

[213] Zolotarev, V.M. (1961). Concerning a certain probability problem. Theor. Probab.
Appl. 6 201204.
[214] Zolotarev, V.M. (1983). Probability metrics. Theory Probab. Appl. 28 278302.

Paul Deheuvels
L.S.T.A., Universite Paris VI
((Please insert complete address))
Oracle Inequalities and Regularization
Sara van de Geer

Keywords. Classication, complexity penalties, complexity regularization,


density estimation, empirical processes, empirical risk minimization,
M-estimation, model selection, nonparametric regression, oracle inequalities.

1. Statistical models
In this chapter, the construction of a statistical model is discussed. We contemplate
on deviations from the model on the one hand, and simplicity of a model on the
other. We introduce the concepts approximation error and estimation error. The
idea of complexity regularization is illustrated in two situations: histograms in
density estimation and smoothing splines in regression.
Here is a brief sketch of the contents of the other chapters. In Chapter 2, we
introduce penalized M-estimators or penalized empirical risk estimators. These
are obtained by minimizing a loss function (e.g., least squares loss, minus maximum
likelihood, or, in classication, support vector machine loss). A roughness penalty
is added to the loss function to avoid overtting. We study the behavior of the
estimators in a general context. The excess risk of an estimator is a global measure
for its performance. We consider so-called oracle inequalities for the excess risk.
These inequalities relate the performance of the estimator to the procedure that
chooses the optimal model by trading o bias and variance (or, more generally,
approximation error and estimation error). In Chapter 2, we highlight the role of
empirical process theory in this context.
As important particular case, we investigate high-dimensional linear spaces.
The approximation error then comes from approximating curves or images by ele-
ments of a high-dimensional parameter space. Chapter 3 studies oracle inequalities
in a regression framework, using the least squares estimators of the coecients.
That setup has the advantage that everything can be calculated explicitly. It serves
as a preparation for more complicated situations.
192 S. van de Geer

Chapter 4 considers general least squares estimators. In Chapter 5, we look at


robust regression estimators, density estimation, and binary classication. We con-
sider there a penalty of 1 -type. Chapter 6 summarizes some tools from empirical
process theory. Each chapter ends with bibliographical remarks.
Most of the work will be on estimation theory and on where this theory can
make use of inequalities for empirical processes. Approximation theory (for exam-
ple the approximating properties of truncated series expansions) will be touched
upon only briey.
We consider a data set consisting of n observations on a variable, say X,
with values in some space X . These observations are denoted by X1 , . . . , Xn . We
assume that the observations are independent, and that each observation follows
the same probability law as X, say P (independent, non-identically distributed
observations will also be considered).
The probability distribution P is in whole or in part unknown. Our aim is
to estimate P or certain aspects of it. A statistical model is a set of candidate
distributions P for P . If nothing is known, one might want to choose P as the
set of all distributions on X . However, if one has some idea about the form of P ,
one may want to incorporate this information into the model set P. In that way
the estimation problem is made easier, i.e., the accuracy is greater. If P / P, the
model is misspecified. In that case one usually has a systematic error (bias) in the
estimator.
1.1. Parametric models
A parametric model is of the form P = {P : }. where the parameter space
is a subset of Euclidean space RN . For example, if the Xi are yes/no answers to
a certain question (the binary case), we know that P allows only two possibilities,
say 1 and 0 (yes=1, no=0). There is only one parameter, say the probability of
a yes answer.
Here is another example.
Example 1.1. Vilfredo Pareto (1897) noticed that the number of people whose
income exceeds level x is often approximately proportional to x , where is a
parameter which diers from country to country. Therefore, as a model for the
distribution of incomes, one may propose the Pareto distribution function
1
P (X x) = 1 , x > 1,
x
with density

f (x) = +1 , x > 1.
x
More generally, in a parametric model, there are only a nite number of
parameters = (1 , . . . , N ). When the model is well specied, one has P = P0
for some 0 . However, a low-dimensional, parametric model is often just a
mathematical idealization. For example, there seems to be little physical reason
for incomes to follow the Pareto law. I.e., a model is only an approximation.
Oracle Inequalities and Regularization 193

inaccuracy

systematic error

| complexity
complexiteit
oracle

Figure 1. The trade o between inaccuracy and systematic error

When there are innitely many parameters in the model, it is called nonpara-
metric. We will however not be very strict in our distinction between parametric
and nonparametric, for the following reason. Throughout, the number of obser-
vations n is assumed to be large. We will allow that the choice of the model
P depends on n, and is in fact more rich for larger n. This is only natural,
since when we have many observations, we may want to use more exible models
and get more information out of the data. Thus, in a parametric model, may
depend on n, and in particular its dimension N may depend on n, and in fact
grow without limit as n . This means strictly speaking that we deal with a
sequence of parametric models with nonparametric limiting model. We think of
such a situation as a nonparametric one.
Parametric models (with N small) are in a sense less rich than nonparamet-
ric models, and there is also a range in the complexity of various nonparametric
models. The more complex a model, the larger the inaccuracy will be. On the
other hand, too simple models have large systematic error. (Here, we use a generic
terminology. we will be more precise in our denitions later on, e.g., in Section
2.3.) Both inaccuracy and systematic error depend on the model, and on the truth
P . The optimal model trades o the inaccuracy and systematic error (see Figure
1). However, since P is unknown, it is also not known which model this will be.
Only an oracle can tell you that. Our aim will be to mimic this oracle.
To evaluate the inaccuracy of a model, we will use empirical process theory.
Empirical process theory is about comparing the theoretical distribution P with
its empirical counterpart, the empirical distribution Pn , introduced in the next
section.
194 S. van de Geer

1.2. The empirical distribution


The unknown P can be estimated from the data in the following way. Suppose
rst that we are interested in the probability that an observation falls in A, where
A X is a certain set chosen by the researcher. We denote this probability by
P (A). It can be estimated by the frequency of A, i.e., by
number of times an observation Xi falls in A
Pn (A) =
total number of observations
number of Xi A
= .
n
In view of the law of large numbers, Pn (A) should be close to P (A) for n large.
We now dene the empirical distribution Pn as the probability law that assigns to
a set A the probability Pn (A). We regard Pn as an estimator of the unknown P .
More generally, for a function f : X R. we denote the expectation
n of f (X)
by P (f ) = Ef (X). Its empirical counterpart is denoted by Pn (f ) = i=1 f (Xi )/n.
Thus when f is the indicator function of the set A, which is denoted by lA , we
have P (f ) = P (A), and Pn (f ) = Pn (A).
Example 1.2. The empirical distribution function. Suppose that X R. The
distribution function of X is dened as
F0 (x) = P (, x],
and the empirical distribution function is
number of Xi x
Fn (x) = .
n
(Here, we have a slight abuse of notation: F0 is the truth. It is not the empiri-
cal distribution having 0 observations, a nonsensical object.) Figure 2 plots the
distribution function F0 (x) = 1 1/x2 , x 1 (smooth curve) and the empiri-
cal distribution function Fn (stair function) of a sample from X with sample size
n = 200.

1.3. Regularization
Many (nonparametric) estimation procedures involve the choice of a tuning pa-
rameter, also called regularization parameter or smoothing parameter. Here are
two examples.
Example 1.3. Histograms. Suppose X R has density, say f0 , with respect to
Lebesgue measure. Our aim is to estimate f0 . The density f0 (x) at x is dened as
the derivative of the distribution function F0 at x:
F0 (x + h) F0 (x h) P (x h, x + h]
f0 (x) = lim = lim .
h0 2h h0 2h
Unfortunately, replacing P by Pn here does not work, as for h small enough,
Pn (x h, x + h] will be either zero or one. Therefore, instead of taking the limit
Oracle Inequalities and Regularization 195

F0
0.9 Fn
0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
1 2 3 4 5 6 7 8 9 10 11
x

Figure 2. Theoretical and empirical distribution function

as h 0, we fix h at a (small) positive value, called the bandwidth. The estimator


of f0 (x) becomes
Pn (x h, x + h] number of Xi (x h, x + h]
fn (x) = = .
2h 2nh
A histogram is a plot of this estimator at points x {x0 , x0 + 2h, x0 + 4h, . . .}.
Figure 3 shows the histogram, with bandwidth h = 0.25, for the sample of
size n = 200 from the Pareto distribution with parameter 0 = 2 (i.e., with some
abuse of notation, f0 = f0 ). The solid line is the density of this distribution.
The bandwidth h is an example of a tuning parameter. Choosing a value for
it is a complicated matter, as it leads to considering variance, bias, and related
concepts. Such considerations will be the major topic in these notes. The vari-
ance, var(fn (x)), quanties the inaccuracy of the histogram at the point x. The
systematic error can be measured by the squared bias
bias2 (fn (x)) = (Efn (x) f0 (x))2 .
The mean square error is
MSE(fn (x)) = var(fn (x)) + bias2 (fn (x)).
The integrated mean square error is

IMSE(fn ) = MSE(fn (x))dx.

The optimal in the sense of IMSE choice of the bandwidth h minimizes the
integrated mean square error, i.e., trades o variance and squared bias. However,
196 S. van de Geer

1.8

1.6
f0
1.4

1.2

0.8

0.6

0.4

0.2

0
1 2 3 4 5 6 7 8 9 10 11

Figure 3. True density and a histogram

to carry out this trade o, one needs to know certain aspects of f0 . But f0 is
unknown! In an attempt to mimic an oracle, one often uses part of the data to
estimate f0 with various choices of the bandwidth h, and then use the rest of the
data to decide on the choice for h. One may also estimate IMSE(fn ) applying for
example least squares cross validation. We will not present the details here. Instead
of bandwidth selection, we will study regularization using complexity penalties, as
illustrated in Example 1.4. In the intermezzo following this example, it is shown
that the two approaches can be closely related.
Exercise 1.1. Suppose
f0 (x) = 2/x3 , x > 1.
Let h > 0 be the bandwidth. At a given x > 1 + h, calculate bias and variance of
the histogram
Pn (x h, x + h]
fn (x) = .
2h
Show that for x xed, and h 0 and nh , the bias is of order h2
(bias(fn (x)) = O(h2 )) and the variance is of order 1/(nh) (var(fn (x)) =
O(1/(nh))). The optimal in the sense of MSE choice for h is thus hopt =
O(n1/5 ). (For a denition of order symbols, see Section 3.5.)
Example 1.4. Penalized least squares. Consider the regression
EYi = f0 (xi ), i = 1, . . . , n,
where Yi is a response variable, xi is a co-variable (i = 1, . . . , n), and where f0 is an
unknown function. We examine here the case where xi = i/n [0, 1], i = 1, . . . , n
Oracle Inequalities and Regularization 197

and f0 is dened on [0, 1]. We suppose f0 is not changing too much, in the
 sense that
the squared rst derivative of f0 is small, say in terms of the average |f  (x)|2 dx.
As estimator of f0 we propose
 n  1 
1 

fn = arg min |Yi f (xi )| +
2 2
|f (x)| dx .
2
f n i=1 0

Here, arg stands for argument, i.e., the location where the minimum is at-
tained. Moreover, is a tuning or regularization parameter. If = 0, the
estimator fn will just interpolate the data. On the other hand, if
n = , fn will
be a constant function (namely, constantly equal to the average i=1 Yi /n of the
observations). To the least squares loss function, we have thus added a penalty for
choosing a too wiggly function. This is called (complexity) regularization.
Figure 4 below plots the true f (which is f0 ) in blue together with the data
(red). The aim is to recover f0 from the data. Figure 5 shows the estimator fn

0 0.02

0
-0.02

-0.02
-0.04
-0.04

-0.06
-0.06

-0.08 -0.08

-0.1
-0.1

-0.12
-0.12
-0.14

-0.14
-0.16

-0.16 -0.18
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

True f Noise added, noise level = 0.01

Figure 4

0.02 0.02

0 0

-0.02 -0.02

-0.04 -0.04

-0.06 -0.06

-0.08 -0.08

-0.1 -0.1

-0.12 -0.12

-0.14 -0.14

-0.16 -0.16

-0.18 -0.18
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Denoised, lambda=0.2 Denoised, lambda=0.1


Fit=9.0531e-04 Fit=3.4322e-04
Figure 5
198 S. van de Geer

(in green) for two choices of the tuning parameter . The t of fn is dened as

n
|Yi fn (xi )|2 /n.
i=1
Obviously, the smaller value of gives a better t. Figure 6 plots the estimator
fn together with f0 , for two values of . The error (or excess risk, see Chapter
2), which is dened here as

n
|fn (xi ) f0 (xi )|2 /n
i=1
turns out to be smaller for the smaller value of .
Now, in real life situations, it is not possible to make the plots of Figure 6
and/or calculate the error, since the true f is then unknown. Thus, again, we need
an oracle to tell us which to choose. In Section 4.5, we show that by penalizing
small values of one may arrive at an oracle inequality.
0.02 0.02

0 0

-0.02 -0.02

-0.04 -0.04

-0.06 -0.06

-0.08 -0.08

-0.1 -0.1

-0.12 -0.12

-0.14 -0.14

-0.16 -0.16

-0.18 -0.18
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Denoised, lambda=0.1 Denoised, lambda=0.05


Err or=2.8119e-04 Err or=7.8683e-05
Figure 6

Intermezzo. As a continuous version of the problem studied in Example 1.4, con-


sider  1  1 
f = arg min |y(x) f (x)|2 dx + 2 |f  (x)|2 dx .
f 0 0
In fact, let us formulate an extension, namely a continuous version corre-
sponding to the so-called white noise model
dY (x) = f (x)dx + dW (x),
where W is standard Brownian motion. In that case, the derivative y(x) =
dY (x)/dx does not exist, as Brownian motion is nowhere dierentiable. We there-
fore use a formulation avoiding this derivative:
  1  1  1 
f = arg min 2 f (x)dY (x) + f 2 (x)dx + 2 |f  (x)|2 dx .
f 0 0 0
Oracle Inequalities and Regularization 199

We show in Lemma 1.1 below that the solution f can be explicitly calculated
(using variational calculus). This solution reveals that the tuning parameter
plays the role of a bandwidth parameter.

Lemma 1.1. The solution of the continuous version is



C x 1 x ux
f (x) = cosh( ) + sinh( )dY (u),
0
where
  1 
1 1u 1
C = Y (1) Y (u) sinh( )du / sinh( ).
0
Proof. Partial integration shows that we have to minimize
  

2 Y f + f 2Y (1)f (1) +
2 2
|f  |2 .

Now, replace f  by the function g. We then have the restriction f  = g. We


can express this restriction by adding a Lagrangian term to the function to be
minimized, with Lagrange parameters in the function h, say. We then arrive at the
problem of minimizing
  
L(f, g) = 2 Y g + f 2 2Y (1)f (1) + 2 g 2


2 h(g f  ).

Invoking again partial integration, rewrite this as


  
L(f, g) = 2 Y g + f 2 2Y (1)f (1) + 2 g 2

 
2 gh 2 f h + 2h(1)f (1).

The Euler equations (see, e.g., Goldstein (1980)) now become


f h = 0
and
Y h + 2 g = 0.
Since g = f  , this gives the equation
h = (h Y )/2 ,
with boundary condition h(1) = Y (1). Solving this dierential equation yields the
result of the lemma. 
200 S. van de Geer

1.4. Bibliographical remarks


Histograms are special cases of kernel estimators. There are many books on kernel
density estimation. We refer to Wand and Jones (1995). The penalized least squares
estimator we have considered in Example 1.4 was chosen for ease of exposition. A
penalized smoothing spline with penalty on the second derivative, instead of the
rst, is used more frequently. The penalty is then
 1
2
|f  (x)|2 dx.
0
We refer to Silverman (1985) and Wahba (1990). A nice discussion of more general
(Sobolev) penalties (involving for instance higher derivatives) can be found in the
book of Green and Silverman (1994). The proof of Lemma 1.1 is a simple example
of variational calculus, which is generally developed in the context of classical
mechanics (Goldstein (1980)).

2. M-estimators
We will start with two examples, and then present the general denition of an
M-estimator. We furthermore give denitions of excess risk, estimation error and
approximation error. Next, we highlight how empirical process theory can be in-
voked to assess the excess risk of an M-estimator.

2.1. Some examples


Examples of M-estimators are least squares estimators, maximum likelihood esti-
mators, and estimators in binary classication using 0/1 loss. Let us consider the
latter two.
Example 2.1. Maximum likelihood. Let P be a collection of probability measures
dominated by a -nite measure . Write F = {f = dP /d : P P} for the col-
lection of densities. Using the model class F , the maximum likelihood estimator is

n
fn = arg max log f (Xi ).
f F
i=1

Recall that arg stands for argument, i.e., fn is the density in F where the
likelihood of the observations is maximal. Note that we may write

n
log f (Xi )/n = Pn (log f ).
i=1

This is the empirical counterpart of



log f dP = P (log f ).

The maximum likelihood fn maximizes Pn (log f ) over all f F.


Oracle Inequalities and Regularization 201

Exercise 2.1. Verify that the true density f0 = dP/d is a maximizer of P (log f )
over all densities f .

Exercise 2.2. Check that a histogram with cells chosen a priori (see Example 1.3)
is a maximum likelihood estimator for the density on R, with model class


N 
F = {f = k l(ak1 ,ak ] , k 0, k = 1, . . . , N, f (x)dx = 1}.
k=1

Here, the cell boundaries a0 < a1 < < aN are xed.

Example 2.2. Minimum empirical risk estimation in classification. Let (X, Y ) be


random variables, with X X a co-variable and Y {0, 1} a label. For example,
X may be the features of a mushroom (size, shape, color), and Y indicates whether
it is edible or not. A classier is a function f : X {0, 1}. So f is the indicator
function of a set G X , and instances x G are classied with the label 1,
whereas x / G are classied with the label 0. We will usually identify sets with
indicator functions.
Dene the regression

0 (x) = P (Y = 1|X = x).

Bayes rule f0 is to classify observations X with 0 (X) > 1/2 in the class with label
1 and those with 0 (X) 1/2 in the class with label 0, i.e.,

f0 = lG0 , G0 = {x : 0 (x) > 1/2}.

Identifying sets and indicator functions, we also call the set G0 Bayes rule.
Let {(Xi , Yi )}ni=1 be a sample from (X, Y ). One may estimate Bayes rule in
the following way. The number of misclassications using the classier G X is


n
#{ Xi G, Yi = 0} + #{Xi
/ G, Yi = 1} = |Yi lG (Xi )| := nRn (G),
i=1

where Rn (G) is the proportion of misclassications. The prediction error of a


classier G is the theoretical counterpart R(G) of Rn (G):

R(G) = P (X G, Y = 0) + P (X
/ G, Y = 1) = ERn (G).

The empirical risk minimizer using the model class G is dened as

Gn = arg min Rn (G).


GG

Exercise 2.3. Verify that Bayes rule G0 minimizes R(G) over all subsets G X .
202 S. van de Geer

2.2. General framework


Examples 2.1 and 2.2 are two examples of M-estimation. The general principle is
as follows. We have observed i.i.d. copies {Xi }ni=1 of a random variable X with dis-
tribution P . We choose a model class F F for the (possibly innite-dimensional)
parameter of interest f0 F . Here F is a large class of candidate f s, and we refer
to it as the class of all f s. Let for all f , f : X R be a loss function. Dene the
theoretical risk
R(f ) = P (f ).
The loss function is chosen in such a way that the risk is smallest at the parameter
of interest f0 :
f0 = arg min R(f ).
all f
Let the empirical risk be
Rn (f ) = Pn (f ).
Let F F be a model class. The M-estimator fn (M stand for Minimization
(or Maximization)) is then
(2.1) fn = arg min Rn (f ).
f F

A model class F is chosen here in the empirical risk minimization, because


minimizing the empirical risk over all f generally leads to overtting. The class
F should on the other hand not be chosen too small, as that may result in a
large systematic error. This problem is exactly our object of study, and it will be
presented more formally in Section 2.7. Complexity regularization will be used as
procedure for choosing F not too large and not too small.
In Example 2.1, the loss function is f = log f , with f a density. In Example
2.2 or a more general regression set-up, with i.i.d. co-variables (random design),
the situation is within our framework, but we use a dierent notation. We let
{(Xi , Yi )}ni=1 be i.i.d. copies of a random variable (X, Y ) with distribution P , and
consider a loss function f : X R R, f F . (For instance, in Example 2.2,
the loss function is f (x, y) = |y f (x)|, with f = lG the indicator function of
a subset G X .) The distribution of the co-variable X is denoted by Q, and
the empirical distribution of X1 , . . . , Xn is denoted by Qn . When the co-variables
are xed (non-random design), the regression model concerns a situation with
independent but not identically distributed observations. However, this will not
really alter the general theory. The reason why we formulated this introduction
for the i.i.d. case is merely for ease of exposition.
2.3. Estimation and approximation error
We will now be somewhat more precise in our denitions of accuracy and system-
atic error. We will refer to the inaccuracy as estimation error and the systematic
error as approximation error. Let f be a loss function, R(f ) = P (f ), and f0 the
overall minimizer of R(f ). Let F be the model class, and f the minimizer of R(f )
over F .
Oracle Inequalities and Regularization 203

The empirical risk is Rn (f ) = Pn (f ). Using the M-estimator or empirical


risk minimizer fn given in (2.1), its estimation error is dened as
R(fn ) R(f ).
Note that this is a random variable. Below, we will extend the concept, and for
example refer to the average value of R(fn ) R(f ) (or a quantity of this order)
as the estimation error.
The approximation error is dened as
R(f ) R(f0 ).
This is a non-random quantity.
The dierence R(f ) R(f0 ) is called the excess risk at f . Note that it is
always positive. We will investigate the behavior of the excess risk R(fn ) R(f0 ).
This will lead to probability inequalities for R(fn ) R(f0 ), or bounds for, e.g., the
average excess risk ER(fn ) R(f0 ).
Exercise 2.4. Consider a regression model
E(Y |X = x) = f0 (x).
Let f (x, y) = (y f (x)) be the least squares loss function. Consider rst the
2

case of xed design (i.e., work conditionally on Xi = xi , i = 1, . . . , n). Let


1
n
Rn (f ) = (Yi f (xi ))2
n i=1
and
R(f ) = ERn (f )
(where the expectation is conditional given x1 , . . . , xn ). Check that the excess risk
R(f ) R(f0 ) is equal to
f f0
2n , where

n is the L2 (Qn )-norm. Suppose that
F is a (nite-dimensional) linear space. Show that f is the projection of f0 (in
L2 (Qn )) on F and that hence

fn f0
2n =
fn f
2n +
f f0
2n .
Conclude similarly for the random design case, where

n is to be replaced by
the L2 (Q)-norm

.
2.4. Where empirical process theory comes in
To derive bounds for the excess risk of the M-estimator fn given in (2.1), we use
the inequalities
R(fn ) R(f ) 0,
and
Rn (fn ) Rn (f ) 0.
These inequalities are true because f minimizes the theoretical risk over F and
fn minimizes the empirical risk over F .
204 S. van de Geer

Thus we may write




0 R(fn ) R(f ) = Rn (fn ) R(fn ) (Rn (f ) R(f ))

+Rn (fn ) Rn (f )


Rn (fn ) R(fn ) (Rn (f ) R(f )) .

We write for f F,

n (f ) = n (Rn (f ) R(f )) = n(Pn (f ) P (f )) .
Let F0 be some subset of F. The empirical process indexed by F0 is
{n (f ) : f F0 }.
From the above, we know that

0 R(fn ) R(f ) [n (fn ) n (f )]/ n.
Adding R(f ) R(f0 ) to both sides of this inequality yields moreover

(2.2) R(fn ) R(f0 ) [n (fn ) n (f )]/ n + [R(f ) R(f0 )] = I + II,
with

I = [n (fn ) n (f )]/ n
and with II the approximation error
II = [R(f ) R(f0 )].
This inequality reveals the two components I and II when studying the excess
risk at fn . The expression I is a bound for the estimation error. Empirical pro-
cess theory is invoked to examine this term. Handling the approximation error II
involves approximation theory. We refer to inequalities like (2.2) as basic inequal-
ities, because such inequalities will the starting point for deriving bounds for the
excess risk at fn .
Exercise 2.5. In Exercise 2.4, dene the noise variables
i = Yi f0 (xi ), i = 1, . . . , n.
Verify that for the model with xed design
2 
n
n (f ) n (f ) =
i (f (xi ) f (xi )).
n i=1

In the model with random design, it becomes


2 
n
n (f ) n (f ) =
i (f (xi ) f (xi ))
n i=1
!
n (
f f0
2n
f f0
2 ) (
f f0
2n
f f0
2 ) .
Oracle Inequalities and Regularization 205

The two main tools for studying M-estimators are empirical process theory
and approximation theory. Empirical process theory is used to investigate the esti-
mation error of an estimator. There is no straight answer on how to invoke which
parts of empirical process theory. In Lemma 2.1 below, we give an example, which
highlights the main idea.
Empirical process theory supplies us with inequalities for suprema of empiri-
cal processes indexed by functions. We will in fact need the behavior of increments
of the empirical process. This concerns the question: if two functions f and f
are close, then how small is the dierence |n (f ) n (f )|? We welcome an
answer that holds uniformly over f F in a neighborhood of f , because we can
then apply it to a random function (an estimator) in this neighborhood.
Empirical process theory gives us probability or moment bounds for the supre-
ma of empirical processes. These can be directly derived inequalities. However,
a strong tool is based on so-called concentration inequalities. Concentration in-
equalities are (exponential, or even sub-Gaussian) probability inequalities for the
concentration of a random variable around its mean. One can apply them to the
random variable representing the supremum of an empirical process. Then, the only
task left is to nd good bounds for the mean of this supremum. This is in some
cases as easy as Cauchy-Schwarz, perhaps preceded by a so-called symmetrization
and/or a contraction inequality. These concepts (concentration, symmetrization
and contraction) will be discussed in more detail in Chapter 6.
Approximation theory will be used to illustrate how the behavior of estima-
tors depends on how well the model approximates the truth. Since in real life
we will actually never nd out what the truth is, these illustrations are purely
theoretical.

2.5. Some first results, assuming ready-to-use empirical process theory


We will assume two inequalities. The rst one, the margin condition, is purely
deterministic. It depends very much on the problem under consideration. The
second one, the empirical process condition, is an assumption on the increments
of the empirical process {n (f ) : f F }. It is in an ready-to-use form. In the
literature, such empirical process inequalities have indeed been established (see
also Section 6.6). A very simple example can be established in Exercise 3.2.
We assume that the two inequalities hold for some metric d on F . The in-
equalities are tied up in a technical condition on the parameters involved.
Margin condition. For some constants c2 and 1 and for all f F,
R(f ) R(f0 ) d (f, f0 )/c2 .
In the margin condition, is an identiability parameter. If it is small, then
f0 is well-identied for the metric d. Frequently, it only holds for those f with
d(f, f0 ) bounded by some constant. We skip this issue here to avoid digressions.
Exercise 2.6. In Exercise 2.4, check that the margin condition is met, with = 2
and d(f, f0 ) =
f f0
n (xed design) or d(f, f0 ) =
f f0
(random design).
206 S. van de Geer

Empirical process condition. For some constants c1 , r > 0, and 0 1, we


have  r
|n (f ) n (f )|
E sup cr1 .
f F d (f, f )
In the empirical process condition, is a complexity parameter. If it is small,
the class F is complex, and the increments of the empirical process can be large.
When f = f , the weighted empirical process (n (f ) n (f ))/d (f, f ) is
dened to be zero. We remark now that meeting the empirical process condition
frequently means that the class F does not contain f s with distance d(f, f ) not
zero but very small. This may not be true for our original F . But actually, as we
want to prove that d(f, f ) is small, this generally turns out to be only a minor
technical issue.
Technical condition. We have > and r
.
In Lemma 2.1, we rst consider the well specied case, i.e., the case where
f0 F. We then show that in the misspecied case, the approximation error occurs
as an additional term.
Lemma 2.1. Let fn be the M-estimator dened in (2.1). Suppose the margin condi-
tion, the empirical process condition, and the technical condition are met. If f0 F
(well specied case), we have

ER(fn ) R(f0 ) c1 c2 n 2() .



(2.3)
More generally, if possibly f0
/ F (misspecied case), we have for any 0 < < 1,
1+
(2.4) ER(fn ) R(f0 ) ( ) {Vn + R(f ) R(f0 )} ,
1
where

Vn = 2c1 (c2 /) n 2() .

(2.5)

Proof. Dene Zn = |n (fn ) n (f )|/d (fn , f ). From basic inequality (2.2) it


follows that

(2.6) R(fn ) R(f0 ) Zn d (fn , f )/ n + R(f ) R(f0 ).
Suppose rst that f0 F, so that f0 = f . By the margin condition, we know
that

d (fn , f0 ) c2 (R(fn ) R(f0 )) .
Hence, from (2.6)

R(fn ) R(f0 ) c2 (R(fn ) R(f0 )) Zn / n.
This implies

R(fn ) R(f0 ) c2 Zn n 2() .

Oracle Inequalities and Regularization 207

Result (2.3) now follows from



(2.7) EZn (EZnr ) r() c1 .

More generally, in the possibly misspecied case, we use that


d (fn , f ) d (fn , f0 ) + d (f , f0 )

c2 (R(fn ) R(f0 )) + c2 (R(f ) R(f0 )) .
Now, use the Technical Lemma below. Use it twice. We nd

R(fn ) R(f0 ) c2 (R(fn ) R(f0 )) Zn / n

+c2 (R(f ) R(f0 )) Zn / n + R(f ) R(f0 )

(R(fn ) R(f0 )) + (R(f ) R(f0 ))



+2(c2 /) Zn n 2() + R(f ) R(f0 ).

Conclude invoking (2.7). 

The next lemma was used in the proof of Lemma 2.1. It will in fact also be
of help in proofs in Chapter 5.

Technical Lemma. We have for all positive v, t, and , and > ,

vt t + v .
k


Proof. Suppose rst that v/ t . Then obviously
v
vt = t t.


Conversely, if v/ t , then t (v/) . So then
v
vt v( ) = v .



Recall that we dened the estimation error as R(fn ) R(f ). Now, in a


general context, the estimation error and approximation error cannot be studied
separately. This primarily because the margin condition is assumed at f0 , not at f .
Also, we will generally bound the estimation error by the term I in (2.2) involving
the empirical process. For this term we will in turn consider a (probability) bound.
To understand more clearly the meaning of the formulas, we will be exible in
our use of terminology, and refer to such upper bounds as estimation error. For
example, in the context of Lemma 2.1, Vn in (2.5) is (an upper bound for) the
(average) estimation error, and we simply refer to it as estimation error.
208 S. van de Geer

2.6. Balancing estimation and approximation error


Lemma 2.1 gives an upper bound for the average excess risk, involving the (bound)
for the estimation error and the approximation error. More generally, suppose we
have shown that for some constants C and Vn

ER(fn ) R(f0 ) C{Vn + R(f ) R(f0 )}.

Suppose that the constant C does not depend on P or on the model class F .
Clearly, the estimation error Vn depends on the model F , for example in (2.5)
through the parameter . Moreover, Vn may also depend on P , for example as in
(2.5) through the parameter . It is clear that also the approximation error, which
we write for short as Bn2 = R(f ) R(f0 ), depends on F and P . To express these
dependencies, let us write Vn = Vn (F , P ) and Bn2 = Bn2 (F , P ). Given a collection
of models {F }, the optimal model F would now be
 
F = arg min Vn (F , P ) + Bn2 (F , P ) .
{F }

But since the optimal model F depends on P , only an oracle knows what F is.
The oracle is mimicked if we can construct an estimator with excess risk at most
that of the estimator when F were known.
In our theory, we will not be able to arrive at exact oracle behavior. Instead,
a trade o up to constants independent of the sample size n, or possibly up to
(log n)-factors, is established. Having essentially the large-n situation in mind,
we will not be too much concerned about such constants and (log n)-factors. In
conclusion, we look for an estimator fn satisfying up to (log n)-terms
 
ER(fn ) R(f0 ) = O Vn (F , P ) + Bn2 (F , P ) .

The approach to this end will be to add a penalty to the empirical risk, so
that complicated models are penalized. Such a method is called penalized empirical
risk minimization, and it is a form of complexity regularization.

2.7. Bibliographical remarks


For the application of empirical process theory to evaluate M-estimators, we refer
to van der Vaart and Wellner (1996), and van de Geer (2000) and the references in
these books. We remark that much recent work uses rened concentration inequal-
ities (e.g., Massart (2000b), see also Section 6.1). The empirical process condition
is a condition on the weighted empirical process. See Section 6.4 for techniques.
The margin condition is from Tsybakov (2004) (see also Mammen and Tsybakov
(1999)). It in particular plays an important role in the classication problem (see
Example 2.2), but also in other contexts (see van de Geer (2003)) and Chapter
5). The Technical Lemma is as in Tsybakov and van de Geer (2005), albeit with
larger constant but simpler proof.
Oracle Inequalities and Regularization 209

3. The sequence space formulation


The aim of this chapter is to reveal the main ingredients of complexity regulariza-
tion yet keeping technicalities to a minimum. We will consider a situation where
the average excess risk can be computed exactly and without much eort. We
examine orthogonal series expansions of a regression function. The coecients are
estimated by least squares. The question we address is: which coecients should
we keep? Keeping many coecients results in an estimator of the regression curve
with large variance. On the other hand, killing many coecients might lead to a
large bias. An oracle can choose the optimal trade o between bias and variance.
Hard- and soft-thresholding are presented as estimation methods that mimic this
oracle. In Section 3.2 the concept of sparseness is introduced as motivation for
considering the collection of models given in Section 3.3. Section 3.4 presents the
model an oracle would select from this collection. Section 3.5 states that hard-
and soft-thresholding estimators mimic the oracle, and presents a proof this for
the soft-thresholding case, using the empirical process result of Section 3.6. The
consequences of the theory for sparse signals is left as exercises (Exercise 3.4 and
3.5).

3.1. Reformulation of the regression problem


Consider the regression
Yi = f0 (xi ) +
i , i = 1, . . . , n,
where
1 , . . . ,
n are independent and centered noise variables, and f0 is an un-
known function on X . Our objective is to put forward the basic idea of an ora-
cle inequality. We want to facilitate the exposition as much as possible. In this
spirit, we make the simplifying assumption of normally distributed errors, that is,

1 , . . . ,
n are assumed to be i.i.d. N (0, 2 )-distributed.
We may collect the observations Y = (Y1 , . . . , Yn ) in a (random) vector in
R . The regression model takes the mean of Y as (possibly partly) unknown vector
n

in Rn . The co-variables x1 , . . . , xn are considered as xed (nonrandom design) in


this chapter.
Recall that Qn denotes the empirical distribution of x1 , . . . , xn . Let
(1 , . . . , n ) be certain given functions, chosen in such a way that they form an
orthonormal basis in L2 (Qn ). Thus j : X R, j = 1, . . . , n, and

1 2
n
(xi ) = 1, j,
n i=1 j

i.e, the functions have length 1, and


1
n
j (xi )k (xi ) = 0, k = j,
n i=1

i.e., the functions are orthogonal to each other.


210 S. van de Geer

Dene now the inner products


1
n
Yj = Yi j (xi ), j = 1, . . . , n,
n i=1
and
1
n
j,0 = f0 (xi )j (xi ), j = 1, . . . , n.
n i=1
We then have
(3.1) Yj = j,0 +
j , j = 1, . . . , n,
where
1
n

j =
i j (xi ), j = 1, . . . , n.
n i=1


n are i.i.d. and N (0, 2 /n)-distributed.
Exercise 3.1. Show that
1 , . . . ,
We call (3.1) the sequence space formulation. If the regression function f0 is
completely unknown, the expectation 0 = (1,0 , . . . , n,0 ) of the random vector
Y = (Y1 , . . . , Yn ) is completely unknown. So then there are n unknowns, that is as
many unknowns as there are observations. Nevertheless, when the signal is sparse,
one can estimate the vector 0 (in a global sense). With sparseness, we mean that
most of the elements in 0 are zero or almost zero. In literature, the sparseness
of a representation is often dened as the number of zero coecients, so that one
representation is sparser than another i it has less non-zero coecients. Section
3.2 gives a formally somewhat dierent denition, giving the possibility of a more
rened comparison.
We end this section with some remarks. In practice, one has to make a choice
for the orthonormal basis {j }. This is like choosing a language to interpret a
given (noisy) text. Here, the special properties of the data set are of importance.
For example, some signals are well described by taking a Fourier basis, others by
wavelets or trigonometric series. The problem is closely related to data compres-
sion. One hopes to choose a basis such that 0 is indeed sparse, i.e., that one has
an economical approximate representation with only a few non-zero coecients.
In literature, most systems of basis functions are orthonormal for Lebesgue
measure instead of Qn . We assume orthonormality in L2 (Qn ) because it makes it
possible to avoid technical calculations.
We note moreover that in many applications, sparse representations can only
be obtained using non-orthogonal (and perhaps overcomplete) representations (for
example for representing sharp edges in a picture).
The mapping x  from X to Rn is sometimes called the feature mapping.
For example, if f0 is a picture, then may represent features like angles, directions,
and shapes.
In some other situations is of the form j (xi ) = K(|xi xj |/h) where | | is
some metric on X , K is some kernel and h a regularization parameter, often called
Oracle Inequalities and Regularization 211

the width. In that case, sparseness may be less an issue. It is rather the choice of
the parameter h that is of importance here.
If there are many sets of basis functions to choose from, we call them dictio-
naries. One may use a data dependent method to choose a dictionary, but we will
not consider this issue.
In the next section, we take the sequence space formulation as a starting
point, and we will omit the tilde in our notation.
3.2. Estimating the mean of a normal vector
In this section we study the estimation of the vector 0 Rn from observations
Yj = j,0 +
j , j = 1, . . . , n,
with
1 , . . . ,
n independent N (0, 2 /n)-distributed.
Denote the Euclidean norm on Rn as
 n

n =
2
|j |2 , Rn .
j=1

A sparse signal......

Figure 7

Now, we want to nd an estimator n such that


n 0
n is small. In general,
one cannot guarantee that a good estimator exists. But if we believe that most of
the coecients j,0 are zero or at least not too big, we can construct an estimator
212 S. van de Geer

that exploits this as much as possible. A signal with only few large coecients
is called sparse. It is like the starry night sky: if you consider the sky as a set of
pixels, then at most pixels there is no star (no signal), or it is too far away, and
the light you see there is mainly background noise.
As mathematical description of sparseness, we take:
Definition. The signal 0 Rn is called sparse if for some 0 r < 2,

n
(3.2) |j,0 |r 1.
j=1

Note that sparseness depends on the parameter r, which is a roughness pa-


rameter. Very small values of r correspond to very sparse signals. The extreme
case is r = 0, in which case there is at most one non-zero coecient (using the
convention 00 = 0 and z 0 = 1 for z > 0).
The constant 1 in the right-hand side of (3.2) is chosen for ease of exposition.
It could be replaced by any other constant by a rescaling argument.
Lemma 3.1 below states that our denition of sparseness implies that a sparse
signal has only a few large coecients. In Lemma 3.2, it is shown that such a signal
can be well approximated by one with only a few non-zero coecients.
Lemma 3.1. Suppose 0 is sparse. Then
#{|j,0 | > } r .
Proof. By the denition of sparseness,

n
|j,0 |r 1.
j=1

Conversely

n 
|j,0 |r |j,0 |r #{|j,0 | > }r . 
j=1 |j,0 |>

Lemma 3.2. Let 


j,0 if |j,0 | >
j, = .
0 if |j,0 |
If 0 is sparse we have

0
2n 2r .
Proof. By the denition of , we have
 
n

0
2n = |j,0 |2 2r |j,0 |r 2r ,
|j,0 | j=1

using in the last inequality the assumed sparseness of 0 . 


Oracle Inequalities and Regularization 213

3.3. A collection of models


Recall that
Yj = j,0 +
j , j = 1, . . . , n,
with
1 , . . . ,
n i.i.d. N (0, 2 /n). Now, we have in mind the situation where we
believe that most coecients in 0 are small. But we do not know which coecients
are small, and we certainly do not know whether the signal is sparse with a given
roughness parameter r. We stress that here the purpose of some denition of
sparseness is to motivate our choice of model class. In the rest of this chapter,
sparseness is only used as illustration of the implications of an oracle inequality,
see Exercises 3.4 and 3.5.
So we believe and hope most coecients can be best estimated as being zero,
but we dont know which ones. Suppose we decide to set all coecients with index
j/ J (J {1, . . . , n}) to zero. In the language of the previous chapter, our model
class is then
F = (J ) = { Rn : j = 0 j / J }.
As empirical risk, we take the least squares loss

n
Rn () = (Yj j )2
i=1

with theoretical counterpart


R() = ERn () =
0
2n + 2 .
Note thus that R(0 ) = 2 , and that the excess risk is equal to
R() R(0 ) =
0
2n .
The least squares estimator over the model (J ) is
n = arg min Rn ().
(J )

It is clear that this estimator has coecients



Yj if j J
j,n (J ) = .
0 if j /J
We will now show that the (average) estimation error of the estimator is
in this case its variance and its approximation error equals its squared bias. The
excess risk is therefore the mean square error.
Dene the best approximation within the class
(J ) = arg min R().
(J )

This vector has coecients



j,0 if j J
j, (J ) = .
0 if j
/J
214 S. van de Geer

The estimation error is


 2 |J |
ER(n (J )) R( (J )) = E
n (J ) (J )
2n = E(Yj j,0 )2 = .
n
jJ

Indeed, this is the variance


En (J ) En (J )
2n .
The approximation error is

R( (J )) R(0 ) =
(J ) 0
2n = 2
j,0 .
j J
/

Here, we indeed recognize the squared bias


En (J ) 0
2n .
The average excess risk is thus the mean square error
2 |J |  2
ER(n (J )) R(0 ) = + j,0 .
n
j J
/

Exercise 3.2. Let us compare this result with the one of Lemma 2.1. With the
notation used there, one has model class F = (J ), and empirical process
n
n () = 2 n
j j .
j=1

The metric we take is d(, ) =



n . Check that the margin condition holds
with = 2 and c2 = 1. The empirical process condition is met for r = 2, and for
= 1 and c21 = 4 2 |J |. The estimation error Vn dened in (2.5) becomes with
these values
8 2 |J |
Vn = .
n
Now, verify the technical condition. The result (2.3) of Lemma 2.1 becomes

  2 
1+ 8 |J |
ER(n (J )) R(0 ) + 2
j,0 .
1 n
j J
/

Thus, up to constants, Lemma 2.1 produces the same answer as direct calculations.
3.4. The model an oracle would select
An oracle chooses J as the set J which minimizes the mean square error. I.e.,

2 |J | 
J = arg min + 2
j,0 .
J {1,...,n} n
j J
/

Exercise 3.3. Show that the index set an oracle would select is

J = {j : |j,0 | > 2 /n}.
(You can use similar arguments as in the proof of Lemma 3.5.)
Oracle Inequalities and Regularization 215

Exercise 3.4. Suppose that 0 is sparse. Show that for the model (J ) chosen by
the oracle,
 2  2r
2

E
n (J ) 0
n 2
2
.
n
3.5. Hard- and soft-thresholding
The oracle
estimates only the coecients j,0 which are in absolute value bigger
2
than /n. The idea is now to replace the unknown coecients j,0 by the ob-
servations Yj , j = 1, . . . , n. First of all, it should then be noted that the noise level
2 is generally unknown. However, this problem is minor, as one may construct
estimators of 2 . Here, we assume for simplicity that 2 is known. A more severe
problem is that we now estimate the 0,j in order to construct an estimator of the
0,j ! It actually turns out that this procedure works if we make the threshold a
bit larger than 2 /n, namely, it should be chosen % 2(log n) 2 /n. Then the
oracle is almost mimicked (see Lemma 3.3).
We now rst present the denitions of the hard-thresholding and soft-
thresholding estimator. Lemmas 3.3 and 3.4 give the oracle inequality for these
estimators, and Lemmas 3.5 and 3.6 put them in the framework of penalized M-
estimation. We then have a clearer picture of the type of oracle inequalities we
may expect for more general M-estimators.
Let n 0 be some threshold. This threshold will be the regularization
parameter in the present context.
Definition of the hard-thresholding estimator.

Yj if |Yj | > n
j,n (hard) = , j = 1, . . . , n.
0 if |Yj | n
Definition of the soft-thresholding estimator.


Yj n if Yj > n
j,n (soft) = Yj + n if Yj < n , j = 1, . . . , n.


0 if |Yj | n

It is shown in Lemmas 3.3 and 3.4 that the estimators n (hard) and n (soft)
have similar oracle properties, i.e., they have up to (log n)-terms the same mean
square error as when using the model (J ) given by the oracle. Lemmas 3.3 and
3.4 are proved in Donoho and Johnstone (1994b), by direct calculations. We will
reconsider the oracle properties of the soft-thresholding estimator in Lemma 3.7,
in a fashion that allows extension to other M-estimation contexts.
Lemma 3.3 for the hard-thresholding estimator is stated in asymptotic sense.
We use the following notation and terminology. Let {zn } and {n } be sequences
of positive numbers. We say that zn = o(n ) (zn is of smaller order than n ) if
zn /n 0 as n . Moreover, zn = O(n ) (zn is of order n ) means that
lim supn zn /n < , and zn % n (zn is asymptotically equal to n ) means
216 S. van de Geer

zn /n 1. If {n } is a sequence of estimators of some parameter 0 in some


metric space with metric d, we write d(n , 0 ) = OP (n ) (n converges (to 0 )
with rate n ) if d(n , 0 )/n remains bounded in probability. Note that sucient
for the latter is Ed(n , 0 ) = O(n ).
Lemma 3.3. Take for some 0 < < 1,
(1 )(log log n) 2 /n 2n 2(log n) 2 /n o((log n) 2 /n).
Then
2 (|J | + 1) 

E
n (hard) 0
2n Ln + 2
j,0 ,
n
j J
/
where Ln % 2 log n.
Lemma 3.4. Take 2n = 2(log n) 2 /n. Then

2 (|J | + 1) 

E
n (soft) 0
2n (2 log n + 1) + 2
j,0 .
n
j J
/

Exercise 3.5. Suppose that 0 is sparse. Show that when the threshold is chosen
as in Lemma 3.3 (Lemma 3.4), the squared rate of convergence for the hard-
thresholding (soft-thresholding) estimator is
 2  2r
log n 2
.
n
Compare with Exercise 3.4.
Donoho and Johnstone (1994b) also prove that the soft-thresholding estima-
tor can be improved by choosing the threshold n more carefully. We will not cite
that result here, because our focus is not so much on the constants. In fact, in
Lemma 3.7, we will treat the soft-thresholding estimator again, using an indirect
method. This gives worse constants, but the approach has the advantage that the
method is applicable to other situations as well, and in particular does not rely on
the assumption of normally distributed errors.
The hard- and soft-thresholding estimators are penalized M-estimators, as
is shown in Lemmas 3.5 and 3.6. This point of view allows one to dene hard-
and soft-thresholding type estimators for other loss functions as well. The hard-
thresholding estimator comes
n from a penalty on the number of non-zero coef-
cients, #{j = 0} = | |0
which we refer to as as the 0 -penalty. The
j=1 j 
soft-thresholding case corresponds to a penalty on the 1 -norm nj=1 |j |, and we
refer to it as the 1 -penalty.
Lemma 3.5. The hard-thresholding estimator n (hard) minimizes
n
(Yj j )2 + 2n #{j = 0}.
j=1
Oracle Inequalities and Regularization 217

Proof. Let ,n be the minimizer of



n
(Yj j )2 + 2n #{j = 0}.
j=1

We can carry out the minimization coecient-by-coecient. So (for j =


1, . . . , n), j,n minimizes
Sj2 (j ) = (Yj j )2 + 2n l{j = 0}.
Clearly, if j,n = 0, we must take it equal to Yj , and then Sj2 (j,n ) = 2n . If j,n = 0,
we have Sj2 (j,n ) = Yj2 . So

2 2n if j,n = Yj
Sj (j,n ) = .
Yj2 if j,n = 0
Since Sj2 (j ) is minimized over j , we have
Sj2 (j,n ) = min(2n , Yj2 ).
Hence, j,n = j,n (hard). 

Lemma 3.6. The soft-thresholding estimator n (soft) minimizes



n 
n
(Yj j ) + 2n
2
|j |.
j=1 j=1

Proof. Let ,n be the minimizer of



n 
n
(Yj j )2 + 2n |j |.
j=1 j=1

The minimization can be done coecient-by-coecient, as in the previous


lemma. So (for j = 1, . . . , n), j,n minimizes
Sj2 (j ) = (Yj j )2 + 2n |j |.
If j,n > 0, the function Sj2 (j ) is dierentiable near j,n , with derivative 2(Yj
n ) + 2j . Setting this derivative to zero shows that j,n = Yj n . So we conclude
that j,n > 0 can only happen when Yj n > 0, and in that case j,n = Yj n .
Similarly when j,n < 0. When j,n = 0, we nd Sj2 (j,n ) = Yj2 . Hence


+2n Yj n ifj,n = Yj n > 0
2

Sj (j,n ) = 2n Yj 2n < mboxif j,n = Yj + n < 0 .


2

2
Yj if j,n = 0
Obviously, we always have
2n Yj 2n Yj2 ,
so j,n = 0 whenever |Yj | n . In other words, j,n = j,n (soft). 
218 S. van de Geer

Now, we will reprove Lemma 3.4, albeit with less economic constants. The
idea is writing down a basic inequality, similar to (2.1) but now for the penalized
case. The basic inequality with penalty takes the form


n 0
2n [n (n ) n ( )]/ n +
0
2n + pen( ) pen(n ).
Here, is the best penalized approximation of 0 . It means that is dened as
in Lemma 3.2 with an appropriate choice of the threshold . The empirical process
takes the simple form
 n
n () = 2 n
j j .
j=1
We will bound the increments by
n
|n () n ( )| 2 n |j j, | max |
j |.
j=1,...,n
j=1

Finally, we only consider the soft-thresholding estimator, that is, the penalty con-
sidered is the 1 -penalty
n
pen() = 2n |j |.
j=1

For the regularization parameter n , a value proportional to 2 log n/n can be
taken (see Lemma 3.8). Lemma 3.7 then states that the estimator with 1 -penalty
satises an oracle inequality, where the oracle concerns the 0 -penalty.
Lemma 3.7. Let n = n (soft). Let 0 < 1 be arbitrary. Set
Vn () = 162n #{j = 0}/.
On the set n = {max1jn |
j | n }, we have
2+  

n 0
2n ( ) min Vn () +
0
2n .
2

Proof. At rst, let be arbitrary. Write



n
pen() = 2n |j |
j=1

= pen1 () + pen2 (),


with 
pen1 () = 2n |j |,
j, =0

pen2 () = 2n |j |.
j, =0

We use the short hand notation


N () = #{j = 0}.
Oracle Inequalities and Regularization 219

Then, on n ,


n 0
2n [n (n ) n ( )]/ n + pen( ) pen(n ) +
0
2n


n
=2
j (j,n j, ) + pen( ) pen(n ) +
0
2n
j=1


n
2n |j,n j, | + pen( ) pen(n ) +
0
2n
j=1

= pen1 (n ) + pen2 (n ) + pen1 ( ) pen1 (n ) pen2 (n ) +


0
2n

2pen1 (n ) +
0
2n

4n N ( )
n
n +
0
2n

4n N ( )
n 0
n + 4n N ( )
0
n +
0
2n

Now, we proceed as in Lemma 2.1. Since for a and b non-negative, ab
(a + b)/2 (compare with the technical lemma at the end of Chapter 2), we have
82 N ( )
4n N ( )
n 0
n
n 0
2n + n .
2

Here, we may also replace


n 0
n by
0
n . So we nd on n ,

162n N ( )

n 0
2n
n 0
2n +
0
2n + +
0
2n ,
2 2
or
162n N ( )
(1 )
n 0
2n + (1 + )
0
2n
2 2
 
162n N ( )
(1 + ) +
0
n .
2
2
To conclude the proof, take

= arg min{162n N ()/ +


0
2n }. 


We thus see that for n % 2 log n/n, we arrive at the same oracle rates as
in Lemmas 3.3 and 3.4, provided the set P(maxj |
j | > n ) 0 for such a choice
of n . Indeed, this is shown to be okay in Lemma 3.8.
220 S. van de Geer

3.6. A probability inequality for the empirical process


Recall that we proved Lemma 3.7 on the set where {max1jn |
n | n }. In order
to nish our oracle inequality, we need to show that for appropriate thresholds n ,
this set has large probability.

Lemma 3.8. Let Z be N (0, 1)-distributed. Then for all a > 0,


1
P(Z > a) exp[ a2 ].
2
Moreover, if Z1 , . .. , ZN are N (0, 1)-distributed (and not necessarily independent),
then for all a 2 log N
 
1
P max Zj > a exp[ a2 ].
1jN 4

Proof. We have for all a > 0, and > 0, by Chebyshevs inequality


E exp[Z]
P (Z > a)
exp[a]

1
= exp[ 2 a].
2
Take = a to arrive at
1
P (Z > a) exp[ a2 ].
2
To prove the second assertion of the lemma, we note that
 
1
P max |Zj | > a N exp[ a2 ].
1jN 2

Take a 2 log N to get
 
1
P max Zj > a exp[ a2 ]. 
1jN 4

It clearly follows from Lemma 3.8, that if


1 , . . . ,
n are independent
N (0, 2 /n), we have for all t 2,
 
t2 log n
P max |
j | > t 2 log n/n 2 exp[ ].
1jn 4

Exercise 3.6. Let Z1 , Z2 , . . . , be independent N (0, 1)-distributed, and


j =

Zj / n. Take n = t 2 log n/n, with t > 2. Verify that max1jn |
j | n
almost surely for all n suciently large.
Oracle Inequalities and Regularization 221

3.7. Bibliographical remarks


In Donoho and Johnstone (1994a), one can nd minimax theory for p -spaces
(0 < p < ). In Donoho and Johnstone (1994b), the relation with wavelet shrink-
age is also there: it explicitly takes the step from the original regression problem
via wavelets to the sequence space formulation. Roughly speaking, when using an
appropriate wavelet basis to expand a function f0 with coecients 0 , the rough-
ness parameter r corresponds to the eective smoothness s of f0 via the relation
r = 2/(2s + 1). For more details we refer to Donoho and Johnstone (1996) and
wavelet theory (e.g., Edmunds and Triebel (1992), Hardle, Kerkyacharian, Picard
and Tsybakov (1998)). The oracle terminology is from Donoho and Johnstone
(1994b). Soft-thresholding is also studied in Donoho (1995). The 1 -penalty cor-
responding to soft-thresholding is called the LASSO in Tibshirani (1996). It is
used for instance for regularization of least squares estimators in linear models
with many co-variables, possibly with linear dependencies between the variables.
We refer to the book of Hastie, Tibshirani and Friedman (2001). Indeed, linear
dependent systems can often provide better sparse approximations, for instance
wedgelets (see Donoho (1999)) or curvelets (see Candes and Donoho (2004)), which
are sparse systems for representing edges or other singularities. The similarity be-
tween the behavior of hard- and soft-thresholding type penalties is investigated in
Donoho (2004a,b). There, it is shown that the 1 -penalized solution of an overde-
termined least squares problem is often also (an approximation of) the 0 -penalized
problem.

4. Overruling the variance


We revisit the regression problem of the previous chapter. One has observations
{(xi , Yi )}ni=1 , with x1 , . . . , xn xed co-variables, and Y1 , . . . , Yn response variables,
satisfying the regression
Yi = f0 (xi ) +
i , i = 1, . . . , n,
where
1 , . . . ,
n are independent and centered noise variables, and f0 is an un-
known function on X . The errors are assumed to be N (0, 2 )-distributed.
Let F be a collection of regression functions. The penalized least squares
estimator is  n 
1 
fn = arg min |Yi f (xi )| + pen(f ) .
2
f F n i=1
Here pen(f ) is a penalty on the complexity of the function f . Let Qn be the
empirical distribution of x1 , . . . , xn and

n be the L2 (Qn )-norm. Dene
 
f = arg min
f f0
2n + pen(f ) .
f F

Our aim is to show that


 
(4.1) E
fn f0
2n const.
f f0
2n + pen(f ) .
222 S. van de Geer

When this aim is indeed reached, we loosely say that fn satises an oracle inequal-
ity. In fact, what (4.1) says it that fn behaves as the noiseless version f . That
means so to speak that we overruled the variance of the noise.
In Section 4.1, we put our objectives of this chapter in the framework of
Chapter 2. In particular, we recall there the denitions of estimation and ap-
proximation error. Section 4.2 calculates the estimation error when one employs
least squares estimation, without penalty, over a nite model class. The estima-
tion error turns out to behave as the log-cardinality of the model class. Section
4.3 shows that when considering a collection of nested nite models, a penalty
pen(f ) proportional to the log-cardinality of the smallest class containing f will
indeed mimic the oracle over this collection of models. In Section 4.4, we consider
general penalties. It turns out that the (local) entropy of the model classes plays a
crucial rule. Recall that in Chapter 3, the hard-thresholding estimator corresponds
to a penalty proportional to the dimension (number of non-zero coecients) of the
model class. Indeed, the local entropy a nite-dimensional space is proportional to
its dimension. For a nite class, the entropy is (bounded by) its log-cardinality.
Whether or not (4.1) holds true depends on the choice of the penalty. In
Section 4.4, we show that when the penalty is taken too small there will appear
an additional term showing that not all variance was killed. Section 4.5 presents
an example.
Throughout this chapter, we assume the noise level > 0 to be known. In
that case, by a rescaling argument, one can assume without loss of generality that
= 1. In general, one needs a good estimate of an upper bound for , because
the penalties considered in this chapter depend on the noise level. When one
replaces the unknown noise level by an estimated upper bound, the penalty in
fact becomes data dependent.

4.1. Estimation and approximation error


Let F be a model class. Consider the least squares estimator without penalty
1
n
fn (, F ) = arg min |Yi f (xi )|2 .
f F n
i=1

The excess risk


fn (, F ) f0
2n of this estimator is the sum of estimation error
and approximation error.
Now, if we have a collection of models {F }, a penalty is usually some measure
of the complexity of the model class F . With some abuse of notation, write this
penalty as pen(F ). The corresponding penalty on the functions f is then
pen(f ) = min pen(F ).
F : f F

We may then write


 
1
n
fn = arg min |Yi fn (xi , F )|2 + pen(F ) ,
fn (,F ): F {F } n i=1
Oracle Inequalities and Regularization 223

where fn (, F ) is the least squares estimator over F . Similarly,


f = arg min {
f (, F ) f0
2n + pen(F )},
f (,F ): F {F }

where f (, F ) is the best approximation of f0 in the model F .


As we will see, taking pen(F ) proportional to (an estimate) of a bound for
the estimation error of fn (, F ) will (up to constants and possibly (log n)-factors)
balance estimation error and approximation error.
Exercise 4.1. Let {j }nj=1 be an orthonormal basis in L2 (Q). Dene f =
n 
j=1 j j , where = (1 , . . . , n ) is a vector in R . Check that
n


n

f
2n = |j |2 :=

2n .
j=1

Let F = {f : R }, and F F be a linear subspace, so that, as in Exercise


n

2.4,

fn (, F ) f0
2n =
fn (, F ) f (, F )
2n +
f (, F ) f0
2n .
Write fn (, F ) = f
n (F ) , and likewise f = f (F ) . Then by Pythagoras rule, the

excess risk at fn (, F ) can be written as



n (F ) 0
2n =
n (F ) (F )
2n +
(F ) 0
2n .
Consider now the hard-thresholding estimator fn with n = n (hard) dened in
Section 3.5. Recall that in that case, pen(f ) = 2n #{j = 0}, with n the
threshold (see Lemma 3.5). Compare the oracle inequality (4.1) with the oracle
inequality of Lemma 3.3.
Exercise 4.1 highlights again that the hard-thresholding estimator, which is
based on an 0 -penalty overruling the variance, has oracle behavior.
We remark here that the 1 -penalty is of a dierent nature, yet can still
yield oracle behavior. Indeed, we will see in Chapter 5 that 1 -type penalties can
lead to the right trade o between estimation error and approximation error. The
1 -penalty kills the variance in a rather implicit way. It is not based on (an esti-
mate of) the estimation error. This is very useful in contexts other than penalized
least squares, because generally, the estimation error can depend on the unknown
underlying probability measure P in a rather complicated way.
In this chapter, the empirical process takes the form
1 
n
n (f ) =
i f (xi ),
n i=1
with the function f ranging over (some subclass of) F . Probability inequalities
for the empirical process are derived using Lemma 3.8. The latter is for normally
distributed random variables. It is exactly at this place where our assumption of
normally distributed noise comes in. Relaxing the normality assumption is straight-
forward, provided a proper probability inequality, an inequality of sub-Gaussian
224 S. van de Geer

type, goes through. In fact, at the cost of additional, essentially technical, assump-
tions, an inequality of exponential type on the errors is sucient as well (see van
de Geer (2000)).

4.2. Finite models


Let F be a nite collection of functions, with cardinality |F | 2. Consider the
least squares estimator over F
1
n
fn = arg min |Yi f (xi )|2 .
f F n
i=1

In this section, F is xed, and we do not explicitly express the dependency of fn


on F . Dene

f f0
n = min
f f0
n .
f F
The dependence of f on F is also not expressed in the notation of this section.
Alternatively stated, we take here

0 f F
pen(f ) = .
f F \F
The result of Lemma 4.1 below implies that the estimation error is propor-
tional to log |F |/n, i.e., it is logarithmic in the number of elements in the parameter
space. We present the result in terms of a probability inequality. An inequality for,
e.g., the average excess risk follows from this (see Exercise 4.2).
Lemma 4.1. We have for all t > 0 and 0 < < 1,
  
1+ 4 log |F | 4t2
P
fn f0
2n ( )
f f0
2n + + exp[nt2 ].
1
Proof. We have the basic inequality
2
n

fn f0
2n
i (fn (xi ) f (xi )) +
f f0
2n .
n i=1
By Lemma 3.8, for all t > 0,
 1 n

i=1
i (f (xi ) f (xi ))
P max n
> 2 log |F |/n + 2t2
f F , f f n >0
f f
n
|F | exp[(log |F | + nt2 )] = exp[nt2 ].
n
If n1 i=1
i (fn (xi ) f (xi )) (2 log |F |/n + 2t2 )1/2
fn f
n , we have,

using 2 ab a + b for all non-negative a and b,

fn f0
2n 2(2 log |F |/n + 2t2 )1/2
fn f
n +
f f0
2n


fn f0
2n + 4 log |F |/(n) + 4t2 / + (1 + )
f f0
2n . 
Oracle Inequalities and Regularization 225

Exercise 4.2. Using Lemma 4.1, and the formula



EZ = P(Z t)dt
0
for a non-negative random variable Z, derive bounds for the average excess risk
E
fn f0
2n .
4.3. Nested, finite models
Let F1 F2 be a collection of nested, nite models, and let F =
m=1 Fm .
We assume log |F1 | > 1.
As indicated in Section 4.1, it is a good strategy to take the penalty pro-
portional to the estimation error. In the present context, this works as follows.
Dene
F (f ) = Fm(f ) , m(f ) = arg min{m : f Fm },
and for some 0 < < 1,
16 log |F(f )|
pen(f ) = .
n
In coding theory, a similar penalty is used. When encoding a message using an
encoder from Fm , one needs to send, in addition to the encoded message, log2 |Fm |
bits to tell the receiver which encoder was applied.
Let  n 
1
fn = arg min |Yi f (xi )| + pen(f ) ,
2
f F n i=1
and
 
f = arg min
f f0
2n + pen(f ) .
f F

Lemma 4.2. We have, for all t > 0 and 0 < < 1,


 
1+  
P
fn f0
n > (
2
)
f f0
n + pen(f ) + 4t /
2 2
exp[nt2 ].
1
Proof. Write down the basic inequality
2
n

fn f0
2n + pen(fn )
i (fn (xi ) f (xi )) +
f f0
2n + pen(f ).
n i=1

Dene Fj = {f F : 2j < | log F (f )| 2j+1 }, j = 0, 1, . . .. We have for all t > 0,


using Lemma 3.8,
 
1
n
P f F :
i (f (xi ) f (xi )) > (8 log |F(f )|/n + 2t2 )1/2
f f
n
n i=1

 
 1
n
P f Fj ,
i (f (xi ) f (xi )) > (2 /n + 2t )
f f
n
j+3 2 1/2

j=0
n i=1
226 S. van de Geer




exp[2j+1 (2j+2 + nt2 )] = exp[(2j+1 + nt2 )]
j=0 j=0


 
exp[(j + 1 + nt2 )] exp[(x + nt2 )] = exp[nt2 ].
j=0 0
n
But if i=1
i (fn (xi ) f (xi ))/n (8 log |F(fn )|/n + 2t2 )1/2
fn f
n , the
basic inequality gives

fn f0
2n 2(8 log |F(fn )|/n+ 2t2 )1/2
fn f
n +
f f0
2n + pen(f ) pen(fn )


fn f0
2n + 16 log |F(f)|/ pen(f) + 4t2 / + (1 + )
f f0
2n + pen(f )
=
fn f0
2n + 4t2 / + (1 + )
f f0
2n + pen(f ),
by the denition of pen(f ). 

4.4. General penalties


In the general case with possibly innite model classes F , we may replace the
log-cardinality of a class by its entropy.
Definition. Let u > 0 be arbitrary and let N (u, F ,

n ) be the minimum number of


balls with radius u necessary to cover F . Then {H(u, F ,

n) = log N (u, F ,

n) :
u > 0} is called the entropy of F (for the metric induced by the norm

n ).
Recall the denition of the estimator
 n 
1
fn = arg min |Yi f (xi )| + pen(f ) ,
2
f F n i=1

and of the noiseless version


 
f = arg min
f f0
2n + pen(f ) .
f F

We moreover dene
F (t) = {f F :
f f
2n + pen(f ) t2 }, t > 0.
Consider the entropy H(, F (t),

n ) of F (t). Suppose it is nite for each t, and in
fact that the square root of the entropy is integrable, i.e., that for some continuous
upper bound H(, F (t),

n ) of H(, F (t),

n ). one has
 t"
(t) = H(u, F (t),

n )du < , t > 0. (4.2)
0

This means that near u = 0, the entropy H(u, F (t),



n ) is not allowed to
grow faster than 1/u2 . Assumption (4.2) is related to asymptotic continuity of the
empirical process {n (f ) : f F(t)}. If (4.2) does not hold, one can still prove
inequalities for the excess risk. To avoid digressions we will skip that issue here.
Oracle Inequalities and Regularization 227

Lemma 4.3. Suppose that (t)/t2 does not increase as t increases. There exists
constants c and c such that for
2
(4.3) ntn c ((tn ) tn ) ,
we have
    c
E
fn f0
2n + pen(fn ) 2
f f0
2n + pen(f ) + t2n + .
n
Lemma 4.3 is from van de Geer (2001). Comparing it to, e.g., Lemma 4.2,
one sees that there is no arbitrary 0 < < 1 involved in the statement of Lemma
4.3. This is just because van de Geer (2001) has xed at = 1/3 for simplicity.

When (t)/t2 n/C for all t, for some constant C, condition (4.3) is
fullled if tn cn1/2 , and, in addition, C c. Thus, by choosing the penalty
carefully, one can indeed ensure that the variance is overruled.

4.5. Application to the classical penalty of Chapter 1


We now come to an extension of Example 1.4. Suppose X = [0, 1]. Let F be the
class of functions on [0, 1] which have derivatives of all orders. The sth derivative
of a function f F on [0, 1] is denoted by f (s) . Dene for a given 1 p < , and
given smoothness s {1, 2, . . .},
 1
p
I (f ) = |f (s) (x)|p dx, f F .
0

We consider two cases. In Subsection 4.5.1, we x a smoothing parameter


> 0 and take the penalty pen(f ) = 2 I p (f ). After some calculations, we then
show that in general the variance has not been overruled, i.e., we do not arrive at
an estimator that behaves as a noiseless version, because there still is an additional
term. However, this additional term can now be killed by including it in the
penalty. It all boils down in Subsection 4.5.2 to a data dependent choice for ,
2
or alternatively viewed, a penalty of the form pen(f ) = 2 I 2s+1 (f ), with > 0
depending on s and n. This penalty allows one to adapt to small values for I(f0 ).

4.5.1. Fixed smoothing parameter. For a function f F , we dene the penalty


pen(f ) = 2 I p (f ),
with a given > 0.

Lemma 4.4. The entropy integral can be bounded by


 + 
2ps+2p
1 1
(t) c0 t 2ps ps + t log( 1) t > 0.

Here, c0 is a constant depending on s and p.


228 S. van de Geer

Proof. This follows from the fact that


H (u, {f F : I(f ) 1, |f | 1) Au1/s , u > 0
where the constant A depends on s and p (see Birman and Solomjak (1967)). Here,
H denotes the entropy for the sup-norm (see Section 6.6 for a denition). For
f F(t), we have
  p2
t
I(f ) ,

and

f f
n t.
We therefore may write f F(t) as f1 + f2 , with |f1 | I(f1 ) = I(f ) and

f2 f
n t + I(f ). It is now not dicult to show that for some constant C1
 
t ps
2
1s t
H(u, F (t),

n ) C1 ( ) u + log( ) , 0 < u < t. 
( 1)u
Corollary 4.5. By applying Lemma 4.3, we nd that for some constant c1 ,
E{
fn f0
2n + 2 I p (fn )} 2 min{
f f0
2n + 2 I p (f )}
f
  2ps 1 
1 2ps+p2 log 1
+c1 2 + .
n ps n

4.5.2. Overruling the variance in this case. For choosing the smoothing parameter
, the above suggests the penalty
   2ps+p2
2ps 
C 0
pen(f ) = min 2 I p (f ) + + 2 ,
n ps
with C0 a suitable constant. The minimization within this penalty yields
2s 2
pen(f ) = C0 n 2s+1 I 2s+1 (f ),
where C0 depends on C0 and s. From the computational point of view (in partic-
ular, when p = 2), it may be convenient to carry out the penalized least squares
as in the previous subsection, for all values of , yielding the estimators
 n 
1 
fn (, ) = arg min |Yi f (xi )| + I (f ) .
2 2 p
f n i=1

Then the estimator with the penalty of this subsection is fn (, n ), where


 n   2ps+p2
2ps 
1 C0
n = arg min |Yi fn (xi , )| +
2
2 .
>0 n i=1 n ps
From the same calculations as in the proof of Lemma 4.4, one arrives at the
following corollary.
Oracle Inequalities and Regularization 229

Corollary 4.6. For an appropriate, large enough, choice of C0 (or C0 ), depending
on c, p and s, we have for a constant c0 depending on c, c , C0 (C0 ), p and s.
 2s 2

E
fn f0
2n + C0 n 2s+1 I 2s+1 (fn )

 2s 2
 c
2 min
f f0
2n + C0 n 2s+1 I 2s+1 (f ) + 0 .
f n
Thus, the estimator adapts to small values of I(f0 ). For example, when s = 1
and I(f0 ) = 0 (i.e., when f0 is the constant function), the excess risk of the
parametric rate 1/n. If we knew that f0 is constant, we
estimator converges with 
n
would of course use the i=1 Yi /n as estimator. Thus, this penalized estimator
mimics an oracle.

4.6. Bibliographical remarks


When {F } is a collection of nite-dimensional linear models, a classical penalty
is the dimension of F , properly normalized. This also has information-theoretic
meaning as code length. More generally, one may think of taking the number of pa-
rameters in the penalty, or some other measure of degrees of freedom. Important
ideas go back to Akaike (1973) and Schwarz (1978). There is much literature on
penalized least squares and other loss functions. We mention in particular Barron,
Birge and Massart (1999) for very general results.

5. The 1 -penalty
Generally, using almost exactly the same arguments, the results of the previous
chapter for least squares estimation, can be extended to other M-estimation proce-
dures provided the margin behavior is known. However, when the margin behavior
is not known (i.e., when the parameters c2 and especially in the margin condition
of Section 2.5 are not known), a simple extension is no longer possible. The reason
is that the estimation error depends on this margin behavior. Overruling the vari-
ance is thus not straightforward as this variance is not known. In this chapter,
we propose an 1 -penalty (i.e., a soft-thresholding type penalty) to overcome the
margin problem.
Let F F be a class of functions on X . We assume that each f F can be
written as a linear combination

m
f (x) = j j (x), x X .
j=1


Here, {j }m
j=1 is a given system of m functions on X , and = (1 , . . . , m ) is an
(in whole or in part) unknown vector. We consider the situation where the number
of parameters m is large, possibly larger than the number of observations n.
230 S. van de Geer

m
The 1 -penalty on f = j=1 j j is

m
(5.1) pen(f ) = n |j |,
j=1

with n a regularization parameter. We study M-estimation with this penalty. We


will denote the 1 -norm of a vector Rm by

m
I() = |j |.
j=1

To simplify the exposition, we assume all coecients j are penalized. In practice,


the rst few coecients are often left free (for instance the constant term or
constant + linear term, etc.).
In Section 5.1, robust estimation procedures in regression is studied, such as
least absolute deviations. There, we consider the case of xed design. Section 5.2
investigates exponential family approximations of a density. Section 5.3 consid-
ers the classication problem with random design, and we apply support vector
machine loss.
We recall the notation of Chapter 2. The loss function considered is denoted
by f , f F . We write for each f F , Rn (f ) = Pn (f ) and R(f ) = EPn (f ).
We dene
(5.2) fn = arg min {Rn (f ) + pen(f )} ,
f F

and
f0 = arg min R(f ).
f F

Note that our notation in this chapter diers somewhat from the previous chapter,
in the sense that in the denition of fn , we minimize over f F, instead of f F .
(It can be brought in line with that notation by taking pen(f ) = for f F \F.)
The empirical process, indexed by some subset F0 of F , is

{n (f ) = n(Pn (f ) P (f )) : f F0 }.
Throughout, we x some (arbitrary) f F. In the results (Lemmas 5.1, 5.2 and
5.4) one may choose f as being an (almost) oracle. Because the exact denition
of this (almost) oracle is somewhat involved, we leave the choice of f open in our
general exposition. One can see a particular choice explained after the statement
of the result for general f (see below Lemma 5.1).
We use the notation

(5.3) n = {[n (fn ) n (f )] n[pen(fn f ) + 2n ]}.
The set n will play exactly the same role as in Lemma 3.7.
Recall that in this chapter, pen(f ) = n I(). We will choose the smoothing
parameter n in such a way that (for all f F) the probability of the set n
Oracle Inequalities and Regularization 231

is very large (tending to 1 as n tends to ). Under the given conditions on the


system {j }, this is true with the choice
+
log m
n = c .
n
But what is the value of c here? Let
|f | = sup |f (x)|.
x
The constant c depends in Section 5.1 and 5.3 on an assumed nite upper bound K
for supf F |f | and, in Sections 5.2 and 5.3, on an upper bound for the density of
X. To show that with such a choice for c, the set n has large probability, we need
empirical process theory. Chapter 6 is devoted to that. In fact, Section 6.5 uses
Theorem 6.2 and then presents all details for (robust) regression or classication
with random design. In Section 5.1, we consider xed design. One may then argue
similarly, replacing Theorem 6.2 by Theorem 6.1. This is elaborated on in Loubes
and van de Geer (2002) and van de Geer (2003). We remark here that in the latter
two papers, the dependency of c on K is removed using convexity arguments. This
however necessitates more subtle formulations. We skip the issue here to keep the
exposition transparent.
The margin condition of Section 2.5 also appears in all three sections. It is a
lower bound for the excess risk R(f ) R(f0 ), f F, in terms of a metric d on F .
Let us repeat this margin condition here.
Margin condition. For some constants c2 and 1, we have for all f F,
R(f ) R(f0 ) d (f, f0 )/c2 .

The metric d is taken as the one induced by the L2 (Qn )-norm in Section 5.1
(robust regression with xed design). In Section 5.2, F is a class of densities with
respect to a -nite measure , and the metric d will be the one induced by the
L2 ()-norm. Section 5.3 takes the metric d induced by the L1 (Q)-norm. In general,
for some measure on X , and 1 p < , we denote the Lp ()-norm by
 1/p

f
p, = |f |p d , f Lp (),

and we write


=

2,Q ,

n =

2,Qn .
The value of and c2 is not assumed to be known, although in Section 5.2
(density estimation), is shown to be generally equal to 2. The latter means that
in density estimation, we get oracle inequalities very similar to those for least
squares estimators in regression. The situation is then completely analogous to
the one of Lemma 3.7.
The 1 -penalty allows one to adapt to other margin behavior as well (i.e., to
the situation where the excess risk is bounded from below by some more general
increasing function of the appropriate metric).
232 S. van de Geer

The proofs of the oracle inequalities all follow from one single argument,
which was basically already used in Lemma 3.7. We give this argument in Section
m
5.4. It makes use of an assumed relation between the 1 -norm I() = j=1 |j |
and the metric d in the margin condition.
Condition on the metrics. For some constant c1 , 0 and 0 < , we have
for all J {1, . . . , m},

|j j | c1 |J | d (f , f ), f , f F.
jJ

In all Sections 5.1, 5.2 and 5.3, the condition on metrics is met with = 1/2.
Sections 5.1 and 5.2, have = 1, and Section 5.3 has = 1/2.

5.1. Robust regression


Consider independent observations Y1 , . . . , Yn , satisfying, for i = 1, . . . , n, the re-
gression
Yi = f0 (xi ) +
i ,
where
f0 (xi ) = min E(Yi c),
cR
with : R [0, ) a given loss function. The error is dened as
i = Yi f0 (xi ),
i = 1, . . . , n. We examine the estimator
 n 
1 
fn = arg min (Yi f (xi )) + pen(f ) ,
f F n i=1
where
F {f : Rm },
m
with f = j=1 j j , and where

m
pen(f ) = n I(), I() = |j |.
j=1

For technical reasons, we assume


sup |f | K,
f F

where K is nite (possibly depending on n). We need this condition to handle the
empirical process part of the problem.
Throughout this section, we assume that is Lipschitz:
|(z1 ) (z2 )| |z1 z2 |, z1 , z2 R.
This is related to a certain robustness of the estimator. We call the estimator fn of
this section the 1 -penalized robust regression estimator. The Lipschitz assumption
also makes it possible to apply the contraction principle (see Section 6.3). This
means there is enough machinery to obtain a good bound for the empirical process.
Oracle Inequalities and Regularization 233

Example 5.1. Least absolute deviations. If


(z) = |z|, z R,
then f0 (xi ) is the median of Yi (whenever it exists). Hence, in that case
i has
median zero. We call fn the 1 -penalized least absolute deviations estimator.
Example 5.2. Quantile regression. More generally, if for a given 0 < < 1,
(z) = |z|l{z < 0} + (1 )|z|l{z 0}, z R,
then f0 (xi ) is the -quantile of the distribution of Yi . The estimator fn is then
called the 1 -penalized quantile estimator.
As usual, we employ the notation
1
n
Rn (f ) = (Yi f (xi )),
n i=1
R(f ) = ERn (f ),

n (f ) = n[Rn (f ) R(f )],
and
1 2
n

f
2n = f (xi ).
n i=1
Margin condition. For some constants c2 and > 1 and for all f F,
R(f ) R(f0 )
f f0
n /c2 .
The value of and c2 may depend on the density of the errors
i , i = 1, . . . , n.
Typically = 2 as is shown in Exercise 5.1.
Exercise 5.1. Consider least absolute deviations estimation, i.e., the case (z) =
|z|. Suppose that
1 , . . . ,
n are i.i.d. and that their common distribution has den-
sity g with respect to Lebesgue measure. Suppose also that g(u) > t > 0 for all
|u| K. Show that when also |f0 | K, the margin condition holds with = 2
and c2 = 2/t.

Positive eigenvalue condition on the system {j }. Dene


1
n
n = (xi )(xi ) ,
n i=1
with (xi ) = (1 (xi ), . . . , m (xi )) , i = 1, . . . , n. Let 2min be the smallest eigen-
value of n . Assume that for some constant 0 < c1 < ,
min 1/c1 .
(Note that this condition on the system {j }m
j=1 implies m n.)
m
Let for f= j=1 j j ,
N (f ) = #{j = 0}.
234 S. van de Geer

Dene for 0 < < 1,


1 !
Vn (f ) = 2(c2 /) 1 4c21 2n N (f ) 2(1) .
We will show in Lemma 5.1 that on the set

n = {[n (fn ) n (f )] n[pen(fn f ) + 2 ]}, n

the excess risk of the estimator fn satises an inequality involving the estimation
error Vn (f ) and approximation error R(f ) R(f0 ).
Lemma 5.1. Assume the margin condition, and the positive eigenvalue condition
on the system {j }. Then on the set n we have
1+  
R(fn ) R(f0 ) ( ) Vn (f ) + R(f ) R(f0 ) + 2n .
1
Proof. Use the positive eigenvalue condition on the system {j } to see that the
condition on metrics holds with d(f, f) =
f f
n , = 1/2, and = 1. Now,
invoke the margin condition and apply Lemma 5.5. 

In Lemma 5.1, we now choose


f = arg min {Vn (f ) + R(f ) R(f0 )}
f F

(where we assume for simplicity that the minimum is attained). This choice bal-
ances the estimation error Vn (f ) and approximation error R(f ) R(f0 ), so that
one arrives at an oracle inequality on n . It can be shown that for any f , the
set n has large probability for the choice n = c log n/n, with c a large enough
constant depending only on K. This can be done using the tools from Section 6
(the concentration inequality of Theorem 6.1, symmetrization (Theorem 6.3), the
contraction principle (Theorem 6.4), and the peeling device described in Section
6.4). The details are in Loubes and van de Geer (2002) and van de Geer (2003).
Thus, a good choice for n does not require knowledge of the constants or c2
appearing in the margin condition. Since the estimation error Vn (f ) does depend
on these constants, one may conclude that the 1 -penalty yields a trade o be-
tween estimation error and approximation error, without (directly) estimating the
estimation error.
Exercise 5.2. Suppose that for all integers N ,
min [R(f ) R(f0 )] c3 N s .
f F , N (f )N

Here s > 0 is a smoothness parameter. Verify that the trade o between esti-
mation error and approximation error gives
2s
R(fn ) R(f0 ) c4 n2(1)s+1 .
The constant c4 is determined by the values of c1 , c2 and c3 and those of , and s.
Oracle Inequalities and Regularization 235

5.2. Density estimation


Let X1 , . . . , Xn be i.i.d., with distribution P on X . Let p0 = dP/d be the density
with respect to some given -nite measure . Dene

f0 = log p0 log p0 d.

Take  
F = {f : f d = 0, ef d < }.

Dene for f F ,

pf = exp[f b(f )], b(f ) = log ef d.

Then for each f F , the function pf is a density with respect to . We examine


the penalized maximum likelihood estimator fn over the class F F:
 n 
1
fn = arg max log pf (Xi ) pen(f ) .
f F n i=1
This means we take the loss function
f (x) = f (x) + b(f ).
The empirical risk is
Rn (f ) = Pn (f ) + b(f ),
and the theoretical counterpart is
R(f ) = P (f ) + b(f ).
Exercise 5.3. Verify that
R(f ) R(f0 ) = K(pf |pf0 ),
where K(p|p0 ) is the Kullback-Leibler information between the densities p and p0 :

K(p|p0 ) = log(p0 /p)p0 d.

As stated in the beginning of this chapter, we assume that each f F can


be written in the form
m
f= j j ,
j=1

for some coecients j , j = 1, . . . , m. In this section this means we are considering


an m-parameter exponential family. To have an identiable representation, the
functions j can be assumed to be centered with respect to

j d = 0, j = 1, . . . , m.
236 S. van de Geer

The penalty is again taken to be the 1 -penalty



m
pen(f ) = n I(), I() = |j |.
j=1

Dene now 

f
22, = f 2 d.

Margin condition. For some > 1 and c2 , and for all f F,


K(pf |p0 )
f f0
2, /c2 .

In fact, the value = 2 is quite natural in this case, as can be seen in the
next exercise.
Exercise 5.4. Let {pf : f F } be a class of densities, with respect to Lebesgue
measure on [0, 1], and suppose that p0 1 is the density of the uniform distribu-
tion. Note that f0 = 0 and b(f0 ) = 0 in this case. Suppose that supf F |f | K.
Show that  
R(f ) R(f0 ) = f 2 p1 d [ f p1 d]2 ,
where
p1 = exp[f1 b(f1 )],
and |f1 | |f |. Moreover
   
f p1 d = f f1 p2 d f p2 d f1 p2 d,

where
p2 = exp[f2 b(f2 )],
and |f2 | |f1 |. Conclude that
 
R(f ) R(f0 ) e2K f 2 d 2e4K ( f 2 d)2 .

The margin condition is thus met with = 2 when F satises the requirement
that
f
2, is suciently small for all f F. (Convexity arguments can remove
this requirement on F .)

Positive eigenvalue condition on the system {j }. Let



=  d.

Let 2min be the smallest eigenvalue of . Assume that for some constant 0 < c1 <
,
min 1/c1 .
m
Now, we proceed exactly as in the previous section. Let for f= j=1 j j ,
N (f ) = #{j = 0},
Oracle Inequalities and Regularization 237

and for 0 < < 1,


1 !
Vn (f ) = 2(c2 /) 1 4c21 2n N (f ) 2(1) .
Moreover, let f be any function in F . Recall the set

n = {[n (fn ) n (f )] n[pen(fn f ) + 2 ]}, n

Lemma 5.2. Assume the margin condition, and the positive eigenvalue condition
on the system {j }. Then on n we have
1+
R(fn ) R(f0 ) ( ){Vn (f ) + R(f ) R(f0 ) + 2n }.
1
Proof. This again straightforward application of Lemma 5.5, as in the proof of
Lemma 5.1. 

It is in this case easy to handle the set n . Note that


[Rn (f ) R(f )] [Rn (f ) R(f )] = Pn (f f ) P (f f )
max |Pn (j ) P (j )|I( ).
j

Normalization condition on the system {j }. Assume that for all j = 1, . . . , m,


+
n
|j |
log m
(where we recall the notation |j | = supx |j (x)|), and

|j |2 d 1.

Also assume that p0 c0 , where c0 1 is given.


Lemma 5.3. Suppose the normalization condition on {j }. Then for all t 1,
 +   
log m 2
P max |Pn (j ) P (j )| 3tc0 2 exp tc0 log m .
j n 7

Proof. This follows from Bernsteins inequality (Bernstein (1924), Bennet (1962)),
which says that for each a > 0,
na2
P (|Pn (j ) P (j )| > t) 2 exp[ ],
2a|j | + j2
where j2 = P |j |2 |P j |2 . 

In view
of Lemma 5.3, we now know that for sequences of sets n with
n = 3c0 log m/n, with m = mn as n , we have
lim P(n ) = 1.
n
238 S. van de Geer

5.3. Binary classification


Let (X, Y ) be random variables, with X X a feature and Y {1, +1} a label.
A classier is a function f : X R. Using the classier f , we predict the label
+1 when f (X) 0, and the label 1 when f (X) < 0. Thus, a classication error
occurs when Y f (X) < 0.
We regard (X, Y ) as random variables with distribution P , and denote the
distribution of X by Q. Moreover, we write
0 (x) = P (Y = 1|X = x), x X .
Bayes rule is the classier

+1 if 0 1/2
f0 = .
1 if 0 < 1/2
In Exercise 2.3, we have seen that f0 minimizes the probability of a classication
error.
Let (X1 , Y1 ), . . . , (Xn , Yn ) be observed i.i.d. copies of (X, Y ). These observa-
tions are called the training set. Let F be a collection of classiers. In empirical
risk minimization (see Vapnik (1995, 1998)), one chooses the classier in F that
has the smallest number of classication errors. However, if F is a rich set, this
classier will be hard to compute. Indeed, we will again consider a very high-
dimensional class F in this section We will use support vector machine (SVM) loss
instead of the number of misclassications, to overcome computational problems.
The empirical SVM loss function is
1
n
Rn (f ) = (Yi f (Xi )),
n i=1
where (z) = (1 z)+ is the hinge function, with z+ denoting the positive part of
z R. Thus, our loss function f is now
f (x, y) = (1 yf (x))+ .
The 1 -penalized SVM loss estimator fn is
 n 
1
fn = arg min (1 Yi f (Xi ))+ + pen(f ) ,
f F n i=1
where
F {f : Rm },
and

m
pen(f ) = n I(), I() = |j |.
j=1
As in Section 5.1, we need to assume, for technical reasons, that
sup |f | K,
f F

where K is a given nite constant.


Oracle Inequalities and Regularization 239

Dene the theoretical SVM loss



R(f ) = ERn (f ) = (1 yf (x))+ dP (x, y).

Exercise 5.5. Verify that SVM loss is consistent, in the sense that Bayes rule f0
satises
f0 = arg min R(f ).
all f

To handle the empirical process, note rst that (z) = (1 z)+ is Lipschitz.
This allows us to again apply the contraction principle of Section 6.3. It means that
we can invoke the tools of Chapter 6 (in particular, the concentration inequality
of Theorem 6.2, symmetrization, the contraction principle and the peeling device)
to handle the empirical process part of the problem.
Margin condition. For some constants 1 and c2 , we have for all f F,
R(f ) R(f0 )
f f0
1,Q /c2 .
Note that we assumed the margin condition with an L1 -norm, instead of the
L2 -norm. It turns out that under some mild assumptions on 0 and Q, indeed the
margin condition is met with L1 (Q)-norm.
Positive eigenvalue condition on the system {j }. Let

=  dQ.

Let 2min be the smallest eigenvalue of . Assume that


min > 0.
The rest is as in the previous two sections, albeit that the condition on metrics
is now met with the value = 1/2 (instead of the value = 1 of Sections 5.1 and
m
5.2). Let for f= j=1 j j ,
N (f ) = #{j = 0},
and for 0 < < 1,
1 ! 21

Vn (f ) = 2(c2 /) 21 8K2n N (f )/2min .
Moreover, let f be any function in F . Recall the set

n = {[n (fn ) n (f )] n[pen(fn f ) + 2 ]}. n

{j } (see Section 6.5),


Assuming a proper normalization condition on the system
this set has large probability for the choice n = c log n/n with c a suitable
constant (see the conclusion at the end of Section 6.5).
Lemma 5.4. Assume the margin condition, and the positive eigenvalue condition
on the system {j }. Then on n we have
1+
R(fn ) R(f0 ) ( ){Vn (f ) + R(f ) R(f0 ) + 2n }.
1
240 S. van de Geer

Proof. This follows from the fact that for f F and f F


 m
|j j | |J |1/2 ( |j j |2 )1/2
jJ j=1
1/2
|J |1/2
f f
2,Q /min |J |1/2 2K
f f
1,Q /min .
So the condition on metrics holds with
d(f, f) =
f f
1,Q

and with = = 1/2 and c1 = 2K/min . Apply Lemma 5.5. 
5.4. The behavior on n
We use straightforward calculus to show the inequality for the excess risk on n .
The arguments follow exactly those in the proof of Lemma 3.7.
Lemma 5.5. Assume the margin condition and the condition on the metrics, for-
mulated in the beginning of this chapter. Then on the set n , we have for any
0 < < 1,
1+
R(fn ) R(f0 ) ( ){Vn + R(f ) R(f0 ) + 2n },
1
where

Vn = 2(c2 /) (2c1 ) (|J | n ) ,
with J = {j, = 0}.
Proof. The basic inequality says that

R(fn ) R(f0 ) [n (fn ) ( f )]/ n + pen(f ) pen(fn ) + R(f ) R(f0 ),
so that on n ,
R(fn ) R(f0 ) pen(fn f ) + 2n + pen(f ) pen(fn ) + R(f ) R(f0 ).
Now, use the same arguments as in the proof of Lemma 3.7. Let

pen1 (f ) = n |j |,
jJ

and 
pen2 (f ) = n |j |.
j J
/
Then
pen(f ) = pen1 (f ) + pen2 (f ),
and one easily sees
pen(fn f ) = pen1 (fn f ) + pen2 (fn )
and
pen(f ) pen(fn ) pen1 (fn f ) pen2 (fn ).
Oracle Inequalities and Regularization 241

So on n ,

R(fn ) R(f0 ) 2pen1 (fn f ) + 2n + R(f ) R(f0 ).

By the condition on the metrics, on n ,

R(fn ) R(f0 ) 2c1 n |J | d (fn , f ) + 2n + R(fn ) R(f ).

After application of the triangle inequality

d(fn , f ) d(fn , f0 ) + d(f , f0 ),

the margin condition gives us that on n ,



R(fn ) R(f0 ) 2c1 n |J | c2 (R(fn ) R(f0 )) + 2c1 n |J | c2 (R(f )) R(f0 ))

+2n + R(f ) R(f0 ).


Next, invoke the Technical Lemma of Chapter 2. Then, on n ,

R(fn ) R(f0 ) 2(c2 /) (2c1 ) (|J | n ) + (R(fn ) R(f0 ))

+2n + (1 + )(R(f ) R(f0 )). 

5.5. Bibliographical remarks


For quantile regression, we refer to Koenker and Bassett (1978) and, in the non-
parametric case, Koenker, Ng and Portnoy (1992, 1994) and Portnoy (1997). A
nice comparison of least squares and least absolute deviations is in Portnoy and
Koenker (1997). The result of Section 5.1 is from Loubes and van de Geer (2002)
and van de Geer (2003). There, also convexity is used to show that one may take
n not depending on the bound K for the functions f in F . See also van de Geer
(2002) for convexity arguments in density estimation (and high-dimensional expo-
nential families). A good reference for classication is the book by Devroye, Gyor
and Lugosi (1996). Also the book of Hastie, Tibshirani and Friedman (2001) has
classication as one of its subjects, including SVMs, but also other methods and
algorithms. Adaptive results for empirical risk minimizers in classication, using
for instance Rademacher complexities, are in, e.g., Koltchinskii (2001), Koltchinskii
and Panchenko (2002), Koltchinskii (2003), and Lugosi and Wegkamp (2004). Sup-
port vector machines (SVMs) have been introduced by Boser, Guyon and Vapnik
(1992). An important book on SVMs is Scholkopf and Smola (2002). Lin (2002)
shows that the SVM is consistent. See Bartlett, Jordan and McAulie (2003) for
results for more general loss functions. The 1 -penalty is often referred to as the
LASSO (see Tibshirani (1996)). Adaptivity of the SVM with 1 -penalty is studied
in Tarigan and van de Geer (2005).
242 S. van de Geer

6. Tools from empirical process theory


Let X1 , . . . , Xn be independent random variables with values in X and let be
some family of real-valued functions on X . Dene

n
Pn () = (Xi )/n, P () = EPn ().
i=1

We review, in Sections 6.16.4, some general results for the empirical process

n(Pn P )() : .
In Section 6.5, we apply the results of Sections 6.1 - 6.4 to arrive at a uniform
probability bound for the case where the empirical process is indexed by a class
{f = f : f F } with : R R and F a subset of a collection of linear
functions {f = j j j }. There, we moreover assume the Lipschitz condition

|(z1 ) (z2 ))| |z1 z2 |, z1 , z2 R.


The probability bound of Section 6.5 can be used to handle the set n introduced
in Chapter 5. Section 6.6 studies empirical processes indexed by functions in a
function class satisfying an entropy bound. It considers the modulus of continuity
of the empirical process. The result says that a version of the empirical process
condition of Section 2.5 holds.

6.1. Concentration inequalities


Concentration inequalities are exponential probability inequalities for the concen-
tration of the supremum of the empirical process around its mean. Remarkable is
that the amount of concentration does not depend on the dimensionality of the
problem.
Dene
Z = sup |Pn () P ()|.

We start out with a Hoeding-type concentration inequality.

Theorem 6.1 (Massart (2000a)). Suppose is a nite collection, satisfying ai,


(Xi ) bi, for some real numbers ai, and bi, and for all 1 i n and .
Dene
n
L2 = sup (bi, ai, )2 /n.
i=1

Then for any positive t,


nt2
P(Z EZ + t) exp[ ]. 
2L2
The next theorem is a Bernstein-type concentration inequality.
Oracle Inequalities and Regularization 243

Theorem 6.2 (Massart (2000a)). Let be a countable family of functions, satisfying


|| (= supx |(x)|) b < for every . Let

n
2 = sup var((Xi ))/n.
i=1

Then for any positive t,



P(Z 2EZ + t 8 + 69bt2 /2) exp[nt2 ]. 
6.2. Symmetrization
Definition. A Rademacher sequence
1 , . . . ,
n is a sequence of i.i.d. random vari-
ables with values in {1}, and with P(
i = 1) = P(
i = 1) = 1/2 (i = 1, . . . , n).
Theorem 6.3 (van der Vaart and Wellner (1996)). Let
1 , . . . ,
n be a Rademacher
sequence independent of X1 , . . . , Xn . Then
   
1
n
E sup |Pn () P ()| 2E sup |
i (Xi )| . 
n i=1

6.3. Contraction principle


Theorem 6.4 (Ledoux and Talagrand (1991)). Let x1 , . . . , xn be non-random ele-
ments of X , and : R R be Lipschitz, i.e.,
|(z1 ) (z2 )| |z1 z2 |, z1 , z2 R.
Furthermore, let F be a class of functions on X . Then for any f F,
     
n  n 
   
E sup 
i [(f (xi )) (f (xi ))] 2E sup 
i (f (xi ) f (xi )) . 
f F   f F  
i=1 i=1

6.4. Weighted empirical processes


Let = {f : f F } be a class of functions indexed by a set F . Dene the
empirical process
n (f ) = n(Pn P )(f ), f F.
Consider a function d on F taking non-negative values. Suppose we are interested
in the weighted empirical process
n (f )
,
d (f )
where > 0 is given. We have for all a > 0 and t > 0,
 
 
|n (f )| 
P sup
>a P sup |n (f )| > a2 (j1)
t .
f F , d (f )>t d (f ) j=1 f F , d (f )2j t

Each term is the summand in the right-hand side can be studied using results
for the non-weighted empirical process. Because the index set F is peeled into
annuli f F, 2j1 t < d (f ) 2j t, this method for obtaining probability bounds
is sometimes referred to as the peeling device (van de Geer (2000)). We apply
244 S. van de Geer

the peeling device in the proof of Lemma 6.4. It is also handy for obtaining the
modulus of continuity of the empirical process indexed by functions satisfying an
entropy bound (see Section 6.6).
6.5. The case of a Lipschitz transformation of a linear space
Let : R R be a Lipschitz function, i.e.,
|(z1 ) (z2 ))| |z1 z1 |, z1 , z2 R.
Let F be a collection of functions on X and dene
f = f, f F.
Recall the notation for the empirical process

n (f ) = n(Pn P )(f ), f F.
Regression and classification example. In regression and classication, we replace
X with values x X by the pair (X, Y ) with values (x, y) X R. In robust
regression, the loss function is (y f (x)) with a given Lipschitz function.
In binary classication, one has y {1} and the SVM loss is (yf (x)) where
(z) = (1 z)+ is the hinge loss, which is clearly also Lipschitz.
Suppose now that F a bounded subset of a collection of linear functions
n
F = {f = j j : |f | K0 /2},
j=1

where {j }nj=1 is a given system of functions. The constant K0 is assumed to be


nite and satisfy K0 1. We assume there are exactly n functions in the system
to facilitate the exposition.
Denote the 1 -norm of a vector Rn by
n
I() = |j |.
j=1

Normalization condition on the system {j }. Assume that for all j = 1, . . . , n,


+
n
max |j (x)| ,
x log n
and
P |j |2 1.
Fix an arbitrary f = f in F . Dene for all M > 0,
FM = {f F : I( ) M },
and
ZM = sup |n (f ) n (f )|/ n.
f FM
We rst apply, in Lemma 6.1, the concentration inequality of Theorem 6.1 to ZM ,
with M xed. The result is a probability inequality for the (one-sided) deviation
Oracle Inequalities and Regularization 245

of ZM from its mean EZM . Next task is to obtain an upper bound for EZM . This
is done in Lemmas 6.2 and 6.3. The combination of Lemmas 6.16.3 yields the
probability inequality (6.1) (below Lemma 6.3) for ZM with M xed. Lemma 6.4
nally uses this result and the peeling device of Section 6.4, to derive a probability
bound for the weighted empirical process.
We use the notation
K0 M = max(K0 , M ), K0 M = min(K0 , M ).
Lemma 6.1. Assume the normalization condition on the system {j }. For all M
satisfying + +
log n n
K0 M K0 ,
n log n
we have
 + 
log n
P ZM 2EZM + 36K0 M
2
exp[(K0 M )2 log n].
n

Proof. By the second part of the normalization condition on the system {j }, we


know that
P |f f |2 I 2 ( ),
In the concentration inequality of Theorem 6.2, we may take
2 (K0 M )2 .
We moreover take there t2 = (K0 M )2 log n/n, and b = K0 . Then

t 8 + 69bt2 /2 36K02M log n/n. 
Lemma 6.2. We have for all M > 0,
 
1
n
EZM 4M E max |
i j (xi )| ,
1jn n
i=1
where
1 , . . . ,
n is a Rademacher sequence (see Section 6.2 for a denition), in-
dependent of X1 , . . . , Xn .
Proof. By the denition of ZM ,
 
EZM = E sup |(Pn P )(f f )| .
f FM

Use the symmetrization inequality (Theorem 6.3) and then the contraction prin-
ciple (Theorem 6.4). Then we get
 
1
n
EZM 2E sup |
i (f (Xi ) f (Xi ))|
f FM n i=1
 
1
n
4E sup |
i (f (Xi ) f (Xi ))| .
f FM n i=1
246 S. van de Geer

But obviously,
   
1 1
n n
E sup |
i (f (Xi ) f (Xi ))| M E max |
i j (xi )| . 
f FM n i=1 1jn n
i=1

From now on we assume log n 1.


Lemma 6.3. Assume the normalization condition on the system {j }. Let
1 , . . . ,
n
be a Rademacher sequence, independent of X1 , . . . , Xn . We have
  +
1
n
log n
E max |
i j (xi )| 24 .
1jn n n
i=1

Proof. Dene
   + 
1
n
log n
Z = max |
i j (xi )| / 3 .
1jn n n
i=1

Using the same argument as in Lemma 5.2, we nd for all t 1,


2
P(Z t) 2 exp[ t log n].
7
Hence,
 
2
E(Z ) = P(Z t)dt 1 + 2 exp[ t log n]
1 7
7 2
=1+ exp[ log n] 8.
log n 7
(Here, we used that log n 1.) 

The combination of Lemmas 6.1, 6.2 and 6.3 yields that under the normal-
ization condition on the system {j },
 + 
log n
(6.1) P ZM 228K0 M 2
exp[(K0 M )2 log n],
n

for K0 log n/n M K0 n/ log n.
We now invoke the technique of Section 6.4, to study the weighted empirical
process
n (f ) n (f )
, f F.
I( ) + 456K02 log n/n

Lemma 6.4. Dene n = 456K02 log n/n. Under the normalization condition on
the system {j } we have
 
|( f ) ( f )|
P sup nn 3 exp[K02 log n/2].
f F I( ) + n
Oracle Inequalities and Regularization 247

Proof. To simplify the notation, we assume that f 0. We split the class F into
three subclasses, namely
+
log n
F (I) = {f : I() K0 },
n
+
log n
F (II) = {f : K0 < I() K0 },
n
and
F (III) = {f : I() > K0 }.
Then
   
|( f )| |( f )|
P sup nn P sup nn
f F I() + n f F (I) I() + n
   
|( f )| |( f )|
+P sup nn + P sup nn
f F (II) I() + n f F (II) I() + n

= PI + PII + PIII .
Apply (6.1) to PI to see that
 

PI P sup |n (f )| n2n
f F (I)
 
2 4 log n
= P ZK log n (456) K0
0 n n
 
log n
P ZK log n 228K03
0 n n
exp[K02 log n].
Now,
F (II) jj=0 0
{f F : Mj+1 < I() Mj },
j

where Mj = 2 K0 , j = 0, . . . , j0 , and where 2j0 n. Thus

j0
  j0
PII P ZMj n Mj /2 exp[K02 log n) exp[K02 log n/2].
j=0 j=0

Finally
F (III) =
j=1 {f F : Mj1 < I() Mj },

where Mj = 2j K0 , j = 0, 1, 2, . . .. So



PIII P ZMj n Mj /2 exp[Mj2 log n] exp[K02 log n]. 
j=1 j=1
248 S. van de Geer

Let us conclude this section by putting the result of Lemma 6.4 in the context
of Chapter 5. Following the notation of that chapter, let

n = {[n (fn ) n (f )] n[pen(fn f ) + 2n ]},

where fn is any random function in F , n = 456K02 log n/n and pen(f ) =
n I(). Then by Lemma 6.4, under the normalization condition on the system
{j }, one has
P(n ) 1 3 exp[K02 log n/2].
This, in combination with the result of Lemma 5.4 completes, for the case m = n,
the proof of adaptivity of the 1 -penalized SVM loss estimator. (When m is larger
than n, say m = nD , the result goes through with suitably adjusted normalization
condition on the system {j }, or alternatively, with suitably adjusted constants
depending on D.)
The result of Lemma 5.1 for the 1 -penalized robust regression estimator, can
be completed analogously. It suces to replace in the arguments of this section,
the concentration inequality of Theorem 6.2 by the one of Theorem 6.1.
6.6. Modulus of continuity of the empirical process
Let F be a class of functions on X .
Definition. Let u > 0. The u-entropy H(u, F , d) (for the metric d) is the logarithm
of the minimum number of balls with radius u necessary to cover F .
We dene

n

f
= P |f | =
2 2
Ef 2 (Xi )/n.
i=1
Let moreover
|f | = sup |f (x)|,
xX
and we suppose that the entropy of F for the metric induced by | | is nite. We
denote this entropy by H (, F ).
Consider again a transformation
f = f : f F,
where is Lipschitz.
Theorem 6.5 (van de Geer (2000)). Let F be a class of functions with |f f| 1
for all f, f F. Suppose that for some constants A and s > 1/2,
1
H (u, F , | | ) Au s , u > 0.
Then there exists a constant c, depending only on A and s, such that for all f F
and all t c,

(f ) (f
n )| t
P t c exp[ ].
n
sup 1 

f F , f f >n
s
2s+1
f f
2s
1 c
Oracle Inequalities and Regularization 249

Example 6.1. Let X = [0, 1] and


 1
F = {|f | 1, |f (s) (x)|2 dx 1}.
0

Here f (s) is the s-th derivative of f . Then (Kolmogorov and Tikhomirov (1959)),
the entropy condition of Theorem 6.5 holds.
Exercise 6.1. We rst study the regression setup with xed design. Suppose, for
i = 1, . . . , n, we have observations (xi , Yi ) (X R), , where xi is a xed co-
variable, and Yi a response variable. In this setup, we throughout used the notation

n

f
2n = f 2 (Xi ),
i=1

for a function f : X R. Note that for such a function,


f
=
f
n . Consider
the M-estimator over the class F dened in Example 6.1:
1
n
fn = arg min (Yi f (xi )).
f F n
i=1
Assume the margin condition on of Section 2.5 holds, with d the metric correspond-
ing to the

n -norm, and with margin parameter 2. By applying Theorem
6.5 (with F there replaced by F = {(x, y)  y f (x) : f F }), show that the
bound as given in (2.4) for the average risk becomes
1 +  2(1)s+1
2s

ER(fn ) R(f0 ) ( ) c0 n + R(f ) R(f0 ) ,
1
as = 1 1/(2s). Here, c0 is a constant depending on the constant c of Theorem
6.5, and on , and s. Compare with the result of Exercise 5.2. This illustrates
that, indeed, up to log n-terms, the estimator with 1 -penalty of Section 5.1, adapts
to the smoothness (= number of derivatives in this case) s.
Exercise 6.2. Generalize the situation of Exercise 6.1 to the case of random design,
assuming the margin condition of Section 2.5 is met with d the metric correspond-
ing to the

-norm.
6.7. Bibliographical remarks
Hoedings inequality and Bernsteins inequality can be found throughout the lit-
erature. Original references are Hoeding (1963), Bernstein (1924) and Bennett
(1962). Concentration inequalities have been derived by Talagrand (e.g., Talagrand
(1995)) and further developed by Ledoux (e.g., Ledoux (1996)). Massart (2000a)
studies the constants in these inequalities. Symmetrization is an important ran-
domization technique in empirical process theory, that goes back to Vapnik and
Chervonenkis (1971). The method described in Section 6.4, for weighted empir-
ical processes is from Alexander (1985), and referred to as the peeling device in
van de Geer (2000). A recent paper on weighted empirical processes is Gine and
Koltchinskii (2004). The results in Section 6.5 are from Tarigan and van de Geer
250 S. van de Geer

(2005). In van de Geer (2000), moduli of continuity are derived, more general as
the one cited in Section 6.6. There, also their implications on rates of convergence
in regression and maximum likelihood estimation is explained. Entropy results for
subsets of Besov spaces, which are extensions of the space F studied in Example
6.1, can be found in Birge and Massart (2000).

References
Akaike, H. (1973). Information theory and an extension of the maximum likelihood princi-
ple. Proceedings 2nd International Symposium on Information Theory, P.N. Petrov
and F. Csaki, Eds., Akademia Kiado, Budapest, 267281.
Alexander, K.S. (1985). Rates of growth for weighted empirical processes. In: Proceedings
of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer 2 (L. LeCam
abd R.A. Olshen eds.) 475493. University of California Press, Berkeley.
Barron, A., Birge, L. and Massart, P. (1999). Risk bounds for model selection via penal-
ization. Prob. Theory Rel. Fields 113 301413.
Bartlett, P.L., Jordan, M.I. and McAulie, J.D. (2003). Convexity, classication and risk
bounds. Techn. Report 638, University of California at Berkeley.
Bennet, G. (1962). Probability inequalities for sums of independent random variables.
Journ. Amer. Statist. Assoc. 57 3345.
Bernstein, S. (1924). Sur un modication de linegalite de Tchebichef. Ann. Sci. Inst.
Sav. Ukraine Sect. Math. I (Russian, French summary).
Birge, L. and Massart, P. (2000). An adaptive compression algorithm in Besov spaces.
Journ. Constr. Approx. 16 136.
Birman, M.S. and Solomjak, M.Z. (1967). Piecewise polynomial approximation of func-
tions in the classes Wp . Math. USSR Sbornik 73 295317.
Boser, B. Guyon, I. and Vapnik, V.N. (1992). A training algorithm for optimal margin
classiers.Fifth Annual Conf. on Comp. Learning Theory, Pittsburgh ACM 142
152.
Candes, E.J. and Donoho, D.L. (2004). New tight frames of curvelets and optimal rep-
resentations of objects with piecewise C 2 singularities. Comm. Pure and Applied
Math. LVII 219266.
Devroye, L., Gyor, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recog-
nition. Springer, New York, Berlin, Heidelberg.
Donoho, D.L. and Johnstone, I.M. (1994a). Minimax risk for 1 -losses over p -balls. Prob.
Th. Related Fields 99, 277303.
Donoho, D.L. and Johnstone, I.M. (1994b). Ideal spatial adaptation via wavelet shrinkage.
Biometrika 81, 425455.
Donoho, D.L. (1995). De-noising via soft-thresholding. IEEE Transactions in Information
Theory 41, 613627.
Donoho, D.L. and Johnstone, I.M. (1996). Neo-classical minimax problems, thresholding
and adaptive function estimation. Bernoulli 2, 3962.
Donoho, D.L. (1999). Wedgelets: nearly minimax estimation of edges. Ann. Statist. 27
859897.
Oracle Inequalities and Regularization 251

Donoho, D.L. (2004a). For most large underdetermined systems of equations, the mini-
mal 1 -norm near-solution approximates the sparsest near-solution. Techn. Report,
Stanford University.
Donoho, D.L. (2004b). For most large underdetermined systems of linear equations, the
minimal 1 -norm solution is also the sparsest solution. Techn. Report, Stanford
University.
Edmunds, E. and Triebel, H. (1992). Entropy numbers and approximation numbers in
function spaces. II. Proceedings of the London Mathematical Society (3) 64, 153169.
Gine, E. and Koltchinskii, V. (2004). Concentration inequalities and asymptotic results
for ratio type empirical processes. Working paper.
Goldstein, H. (1980). Classical Mechanics. 2nd edition, Reading, MA, Addison-Wesley.
Green, P.J. and Silverman, B.W. (1994). Nonparametric Regression and Generalized Lin-
ear Models: A Roughness Penalty Approach. Chapman and Hall, London.
Hardle, W., Kerkyacharian, G., Picard, D. and Tsybakov, A. (1998). Wavelets, Approxi-
mation and Statistical Applications. Lecture Notes in Statistics, vol. 129. Springer,
New York, Berlin, Heidelberg.
Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning.
Data Mining, Inference and Prediction. Springer, New York.
Hoeding, W. (1963). Probability inequalities for sums of bounded random variables.
Journ. Amer. Statist. Assoc. 58 1330.
Koenker, R. and Bassett Jr. G. (1978). Regression quantiles. Econometrica 46, 3350.
Koenker, R., Ng, P.T. and Portnoy, S.L. (1992). Nonparametric estimation of conditional
quantile functions. L1 Statistical Analysis and Related Methods, Ed. Y. Dodge,
Elsevier, Amsterdam, 217229.
Koenker, R., Ng, P.T. and Portnoy, S.L. (1994). Quantile smoothing splines. Biometrika
81, 673680.
Kolmogorov, A.N. and Tikhomirov, V.M. (1959). -entropy and -capacity of sets in
function spaces. Uspekhi Mat. Nauk 14 3-86. (English transl. in Americ. Math. Soc.
Transl. (2) 17 (1961) 277364).
Koltchinskii, V. (2001) Rademacher penalties and structural risk minimization. IEEE
Trans. Inform. Theory 47 19021914.
Koltchinskii, V. and Panchenko, D. (2002). Empirical margin distributions and bounding
the generalization error of combined classiers. Ann. Statist. 30 1-50.
Koltchinskii, V. (2003) Local Rademacher complexities and oracle inequalities in risk
minimization. Manuscript.
Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces: Isoperimetry and
Processes . Springer Verlag, New York.
Ledoux, M. (1996). Talagrand deviation inequalities for product measures. ESIAM:
Probab. Statist. 1 6387. Available at: www.emath.fr/ps/
Lin, Y. (2002). Support vector machines and the Bayes rule in classication. Data mining
knowledge and discovery 6 259275.
Loubes, J.-M. and van de Geer, S. (2002). Adaptive estimation in regression, using soft
thresholding type penalties. Statistica Neerlandica 56 453478.
Lugosi, G. and Wegkamp, M. (2004). Complexity regularization via localized random
penalties. To appear in Ann. Statist.
252 S. van de Geer

Mammen, E. and Tsybakov, A.B. (1999). Smooth discrimination analysis. Ann. Statist.
27 18081829.
Massart, P. (2000a). About the constants in Talagrands concentration inequalities for
empirical processes. Ann. Probab. 28 863884.
Massart, P. (2000b). Some applications of concentration inequalities to statistics. Ann.
Fac. Sci. Toulouse 9, 245303.
Pareto, V. (1897). Course dEconomie Politique, Rouge, Lausanne et Paris.
Portnoy, S. (1997). Local asymptotics for quantile smoothing splines. Ann. Statist. 25,
414434.
Portnoy, S. and Koenker, R. (1997). The Gaussian hare and the Laplacian tortoise:
computability of squared error versus absolute-error estimators, with discussion.
Stat. Science 12, 279300.
Scholkopf, B. and Smola, A. (2002). Learning with Kernels, MIT Press, Cambridge.
Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6, 461464.
Silverman, B.W. (1985). Some aspects of the smoothing spline approach to nonparametric
regression curve tting (with discussion). Journ. Royal Statist. Soc. B 47, 152.
Talagrand, M. (1995). Concentration of measure and isoperimetric inequalities in product
spaces. Publications Mathematiques de lI.H.E.S 81 73205.
Tarigan, B. and van de Geer, S.A. (2005). Support vector machines with 1 -complexity
regularization. Submitted.
Tibshirani, R. (1996). Regression analysis and selection via the LASSO. Journal Royal
Statist. Soc. B 58, 267288.
Tsybakov, A.B. (2004). Optimal aggregation of classiers in statistical learning. Ann.
Statist. 32, 135166.
Tsybakov, A.B. and van de Geer, S.A. (2005). Square root penalty: adaptation to the
margin in classication and in edge estimation. To appear in Ann. Statist. 33.
van de Geer, S. (2000). Empirical Processes in M-Estimation. Cambridge University
Press.
van de Geer, S. (2001). Least squares estimation with complexity penalties. Mathematical
Methods of Statistics 10, 355374.
van de Geer, S. (2002). M-estimation using penalties or sieves. J. Statist. Planning Inf.
108, 5569.
van de Geer, S. (2003). Adaptive quantile regression. In: Recent Advances and Trends in
Nonparametric Statistics (Eds. M.G. Akritas and D.N. Politis), Elsevier, 235250.
van der Vaart, A.W. and Wellner, J.A. (1996). Weak Convergence and Empirical Pro-
cesses, with Applications to Statistics. Springer, New York.
Vapnik, V.N. and Chervonenkis, A.Ya. (1971). On the uniform convergence of relative
frequencies of events to their probabilities. Th. Probab. Appl. 16, 264280.
Wahba, G. (1990). Spline Models for Observational Data. Philadelphia: Soc. for Ind. and
Appl. Math.
Wand, M.P. and Jones, M.L. (1995). Kernel Smoothing. Chapman and Hall, London.

Sara van de Geer


((Please insert complete address))

Вам также может понравиться