r
X
i
v
:
1
2
0
4
.
0
1
4
7
v
1
[
c
s
.
I
T
]
3
1
M
a
r
2
0
1
2
SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY, 2012 1
Covering Numbers for Convex Functions
Adityanand Guntuboyina and Bodhisattva Sen
AbstractIn this paper we study the covering numbers of
the space of convex and uniformly bounded functions in multi
dimension. We nd optimal upper and lower bounds for the 
covering number of C([a, b]
d
, B), in the Lpmetric, 1 p < , in
terms of the relevant constants, where d 1, a < b R, B > 0,
and C([a, b]
d
, B) denotes the set of all convex functions on [a, b]
d
that are uniformly bounded by B. We summarize previously
known results on covering numbers for convex functions and also
provide alternate proofs of some known results. Our results have
direct implications in the study of rates of convergence of em
pirical minimization procedures as well as optimal convergence
rates in the numerous convexity constrained function estimation
problems.
Index Termsconvexity constrained function estimation, em
pirical risk minimization, Hausdorff distance, Kolmogorov en
tropy, Lpmetric, metric entropy, packing numbers.
I. INTRODUCTION
E
VER since the work of [1], covering numbers (and their
logarithms, known as metric entropy numbers) have been
studied extensively in a variety of disciplines. For a subset T
of a metric space (A, ), the covering number M(T, ; )
is dened as the smallest number of balls of radius whose
union contains T. Covering numbers capture the size of the
underlying metric space and play a central role in a number
of areas in information theory and statistics, including non
parametric function estimation, density estimation, empirical
processes and machine learning.
In this paper we study the covering numbers of the space of
convex and uniformly bounded functions in multidimension.
Specically, we nd optimal upper and lower bounds for the
covering number M((([a, b]
d
, B), ; L
p
), in the L
p
metric,
1 p < , in terms of the relevant constants, where d 1,
a, b R, B > 0, and (([a, b]
d
, B) denotes the set of all
convex functions on [a, b]
d
that are uniformly bounded by
B. We also summarize previously known results on covering
numbers for convex functions. The special case of the problem
when d = 1 has been recently established by Dryanov in [2,
Theorem 3.1]. Prior to [2], the only other result on the covering
numbers of convex functions is due to Bronshtein in [3] (see
also [4, Chapter 8]) who considered convex functions that
are uniformly bounded and uniformly Lipschitz with a known
Lipschitz constant under the L
metric.
In recent years there has been an upsurge of interest
in nonparametric function estimation under convexity based
constraints, especially in multidimension. In general function
estimation, it is wellknown (see e.g., [5][8]) that the covering
numbers of the underlying function space can be used to
A. Guntuboyina is with the Department of Statistics, University of Califor
nia, Berkeley, CA 94720 USA email: aditya@stat.berkeley.edu.
B. Sen is with the Department of Statistics, Columbia University, New
York, NY 10027 USA email: bodhi@stat.columbia.edu.
characterize optimal rates of convergence. They are also useful
for studying the rates of convergence of empirical minimiza
tion procedures (see e.g., [9], [10]). Our results have direct
implications in this regard in the context of understanding the
rates of convergence of the numerous convexity constrained
function estimators, e.g., the nonparametric least squares es
timator of a convex regression function studied in [11], [12];
the maximum likelihood estimator of a logconcave density in
multidimension studied in [13][15]. Also, similar problems
that crucially use convexity/concavity constraints to estimate
sets have also received recent attention in the statistical and
machine learning literature, see e.g., [16], [17], and our results
can be applied in such settings.
The paper is organized as follows. In Section II, we set
up notation and provide motivation for our main results,
which are proved in Section III. In Section IV, we draw
some connections to previous results on covering numbers for
convex functions and prove a related auxiliary result along
with some inequalities of possible independent interest.
II. MOTIVATION
The rst result on covering numbers for convex functions
was proved by Bronshtein in [3], who considered convex
functions dened on a cube in R
d
that are uniformly bounded
and uniformly Lipschitz. Specically, let (([a, b]
d
, B, ) de
note the class of realvalued convex functions dened on
[a, b]
d
that are uniformly bounded in absolute value by B and
uniformly Lipschitz with constant . In Theorem 6 of [3],
Bronshtein proved that for sufciently small, the logarithm
of M((([a, b]
d
, B, ), ; L
:= sup
x[a,b]
d [f(x)
g(x)[.
Bronshtein worked with the class (([a, b]
d
, B, ) where the
functions are uniformly Lipschitz with constant . However,
in convexitybased function estimation problems, one usually
does not have a known uniform Lipschitz bound on the un
known function class. This leads to difculties in the analysis
of empirical minimization procedures via Bronshteins result.
To the best of our knowledge, there does not exist any other
result on the covering numbers of convex functions that deals
with all d 1 and does not require the Lipschitz constraint.
In the absence of the uniformly Lipschitz constraint
(i.e., if one works with the class (([a, b]
d
, B) instead of
(([a, b]
d
, B, )), the covering numbers under the L
metric
are innite. In other words, the space (([a, b]
d
, B) is not totally
bounded under the L
[f
j
(2
k
) f
k
(2
k
)[ = 1 2
jk
1/2,
for all j < k.
This motivated us to study the covering numbers of the
class (([a, b]
d
, B) under a different metric, namely the L
p

metric for 1 p < . We recall that under the L
p
metric,
1 p < , the distance between two functions f and g on
[a, b]
d
is dened as
[[f g[[
p
:=
_
_
x[a,b]
d
[f(x) g(x)[
p
dx
_
1/p
.
Our main result in this paper shows that if one works with
the L
p
metric as opposed to L
f(x) g(x)
p
dx
= (b a)
d
_
y[a,b]
d
f(y) Bg
_
y a1
b a
_
p
dy.
for g (([0, 1]
d
, 1). It follows that covering f to within in
the L
p
metric on [a, b]
d
is equivalent to covering
f to within
(b a)
d/p
/B in the L
p
metric on [0, 1]
d
. Therefore, for
1 p < ,
M((([a, b]
d
, B), ; L
p
) = M((([0, 1]
d
, 1),
; L
p
), (1)
where
:= (b a)
d/p
/B.
A. Upper Bound for M((([a, b]
d
, B), ; L
p
)
Theorem 3.1: Fix 1 p < . There exist positive
constants c and
0
, depending only on the dimension d and p,
such that, for every B > 0 and b > a, we have
log M
_
(([a, b]
d
, B), ; L
p
_
c
_
B(b a)
d/p
_
d/2
,
for every
0
B(b a)
d/p
.
The main ingredient in our proof of the above theorem is an
extension of Bronshteins theorem to uniformly bounded con
vex functions having different Lipschitz constraints in different
directions. Specically, for B (0, ),
i
(0, ] and
a
i
< b
i
for i = 1, . . . , d, let (
_
d
i=1
[a
i
, b
i
]; B;
1
, . . . ,
d
_
denote the set of all realvalued convex functions f on the
rectangle [a
1
, b
1
] [a
d
, b
d
] that are uniformly bounded
by B and satisfy:
[f(x
1
, . . . , x
i1
, x
i
, x
i+1
, . . . , x
d
)
f(x
1
, . . . , x
i1
, y
i
, x
i+1
, . . . , x
d
)[
i
[x
i
y
i
[ (2)
for every i = 1, . . . , d; x
i
, y
i
[a
i
, b
i
] and x
j
[a
j
, b
j
] for j ,= i. In other words, the function x
f(x
1
, . . . , x
i1
, x, x
i+1
, . . . , x
d
) is Lipschitz on [a
i
, b
i
] with
constant
i
for all x
j
[a
j
, b
j
], j ,= i.
Clearly, the class (([a, b]
d
, B, ) that Bronshtein studied
is contained in (([a, b]
d
; B; , . . . , ). Also, it is easy to
check that every function f in ( (
i
[a
i
, b
i
]; B;
1
, . . . ,
d
)
is Lipschitz with respect to the Euclidean norm on
i
[a
i
, b
i
]
with Lipschitz constant
_
2
1
+ +
2
d
.
Note that for
i
= , the inequality (2) is satised
by every function f. As a result, we have the equality
(([a, b]
d
, B) = (([a, b]
d
; B; , . . . , ). The following re
sult gives an upper bound for the covering number of
((
i
[a
i
, b
i
]; B;
1
, . . . ,
d
) and is the main ingredient in the
proof of Theorem 3.1. Its proof is similar to Bronshteins
proof [3, Proof of Theorem 6] of his upper bound on
(([a, b]
d
, B, ) and is included in Section IV.
Theorem 3.2: There exist positive constants c and
0
, de
pending only on the dimension d, such that for every positive
B,
1
, . . . ,
d
and rectangle [a
1
, b
1
] [a
d
, b
d
], we have
log M
_
(
_
d
i=1
[a
i
, b
i
]; B;
1
, . . . ,
d
_
, ; L
_
c
_
B +
d
i=1
i
(b
i
a
i
)
_
d/2
, (3)
for all 0 <
0
B +
d
i=1
i
(b
i
a
i
).
Remark 3.1: Note that the right hand side of (3) equals
unless
i
< for all i = 1, . . . , d. Thus, Theorem 3.2 is only
meaningful when
i
< for all i = 1, . . . , d.
Remark 3.2: Because (([a, b]
d
, B, ) is contained in
(([a, b]
d
; B;
1
, . . . ,
d
), Theorem 3.2 includes Bronshteins
upper bound on (([a, b]
d
, B, ) as a special case. Moreover, it
gives explicit dependence of the upper bound on the constants
a, b, B and . Bronshtein did not state the dependence on these
constants.
We are now ready to prove Theorem 3.1 using Theo
rem 3.2. Here is the intuition behind the proof. The class
(([a, b]
d
, B) can be thought of as an expansion of the class
(([a, b]
d
; B;
1
, . . . ,
d
) formed by the removal of the d
Lipschitz constraints
1
, . . . ,
d
(or equivalently, by setting
1
= =
d
= ). Instead of removing all these
d Lipschitz constraints at the same time, we remove them
sequentially one at a time. This is formally accomplished by
induction on the number of indices i for which
i
= . Each
step of the induction argument focuses on the removal of one
nite
i
and is thus like solving the onedimensional problem.
We consequently use Dryanovs ideas from [2, Theorem 3.1]
to solve this quasi onedimensional problem which allows us
to complete the induction step.
Proof of Theorem 3.1: The scaling identity (1) lets us
take a = 0, b = 1 and B = 1.
SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY, 2012 3
We shall prove that there exist positive constants c and
0
,
depending only on d and p, such that for every
i
(0, ],
we have
log M
_
(
_
[0, 1]
d
; 1;
1
, . . . ,
d
_
; ; L
p
_
c
_
2 +
d
i=1
i
i
<
_
d/2
, (4)
for 0 <
0
. Note that this proves the theorem because we
can set
i
= for all i = 1, . . . , d. Our proof will involve
induction on l: the number of indices i for which
i
= .
For l = 0, i.e., when
i
< for all i = 1, . . . , d, (4) is a
direct consequence of Theorem 3.2. In fact, in this case, (4)
also holds for p = . Suppose now that (4) holds for all l < k
for some k 1, . . . , d. We shall then verify it for l = k. Fix
i
(0, ] such that exactly k of them equal innity. Without
loss of generality, we assume that
1
= =
k
= and
i
< for i > k. For every sufciently small > 0, we shall
exhibit an cover of (([0, 1]
d
; 1; , . . . , ,
k+1
, . . . ,
d
) in
the L
p
metric whose cardinality has logarithm bounded from
above by a constant multiple of (
i>k
i
+2)
d/2
d/2
. Note
that for k = d, the term
i>k
i
equals zero. For convenience,
let us denote the class (([0, 1]
d
; 1; , . . . , ,
k+1
, . . . ,
d
)
by ( in the rest of this proof.
Let
u := exp
_
2(p + 1)
2
(p + 2) log 2
_
and v := 1 u. (5)
Fix > 0 and choose an integer A and
1
, . . . ,
A+1
such that
p
=
1
< <
A
< u
A+1
.
For every two functions f and g on [0, 1]
d
, we can obviously
decompose the integral
_
[f g[
p
as
_
[0,1]
d
[f g[
p
=
_
[0,u][0,1]
d1
[f g[
p
+
_
[u,v][0,1]
d1
[f g[
p
+
_
[v,1][0,1]
d1
[f g[
p
.
Also,
_
[0,u][0,1]
d1
[f g[
p
_
[0,1][0,1]
d1
[f g[
p
+
A
m=1
_
[m,m+1][0,1]
d1
[f g[
p
.
For a xed m = 1, . . . , A, consider the problem of covering
the functions in ( on the rectangular strip [
m
,
m+1
]
[0, 1]
d1
. Clearly,
_
[m,m+1][0,1]
d1
[f g[
p
= (
m+1
m
)
_
[0,1]
d
[
f g[
p
(6)
where, for x = (x
1
, . . . , x
d
) [0, 1]
d
,
f(x) := f(
m
+ (
m+1
m
)x
1
, x
2
, . . . , x
d
),
and g(x) := g(
m
+ (
m+1
m
)x
1
, x
2
, . . . , x
d
).
By convexity, the restriction of every function f in ( to
[
m
,
m+1
] [0, 1]
d1
belongs to the class:
(([
m
,
m+1
] [0, 1]
d1
; 1; 2/
m
, , . . . , ,
k+1
, . . . ,
d
)
Consequently, the corresponding function
f belongs to
(([0, 1]
d
; 1; 2(
m+1
m
)/
m
, , . . . , ,
k+1
, . . . ,
d
).
Because 2(
m+1
m
)/
m
< , we can use the induction
hypothesis to assert the existence of positive constants
0
and c, depending only on d and p, such that for every
positive real number
m
0
, there exists an
m
cover of
(([0, 1]
d
; 1; 2(
m+1
m
)/
m
, , . . . , ,
k+1
, . . . ,
d
)) in
the L
p
metric on [0, 1]
d
of size smaller than
exp
_
_
c
d/2
m
_
2 +
2(
m+1
m
)
m
+
i>k
i
_
d/2
_
_
exp
_
_
c
_
2 +
i>k
i
_
d/2 _
m+1
m
_
d/2
_
_
.
By covering the functions in ( by the constant function 0
on [0,
1
] [0, 1]
d1
and up to
m
in the L
p
metric on
[
m
,
m+1
] [0, 1]
d1
for m = 1, . . . , A, we obtain a cover of
the restriction of the functions in ( to the set [0, u] [0, 1]
d1
in L
p
metric having coverage S
1/p
1
and cardinality bounded
from above by exp(S
2
) where
S
1
:=
1
+
A
m=1
p
m
(
m+1
m
) and
S
2
:= c
_
i>k
i
+ 2
_
d/2
A
m=1
_
m+1
m
_
d/2
. (7)
Suppose now that
m
:= exp
_
p
_
p + 1
p + 2
_
m1
log
_
and
m
:= exp
_
p
(p + 1)
m2
(p + 2)
m1
log
_
,
for m = 1, . . . , A+1, where A is the largest integer such that
exp
_
p
_
p + 1
p + 2
_
A1
log
_
< u.
Then,
S
1
=
1
+
A
m=1
p
m
(
m+1
m
)
1
+
A
m=1
p
m
m+1
=
p
_
1 +
A
m=1
2
m
_
,
and
S
2
= c
_
i>k
i
+ 2
_
d/2 A
m=1
d
m
,
where
m
:=
_
m+1
m
= exp
_
p
2(p + 1)
2
(p + 1)
m
(p + 2)
m
log
_
.
SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY, 2012 4
Note that if 1, then log 0 which implies
m
1.
Also, for m = 2, . . . , A, we have
m1
= exp
_
p log
2(p + 1)
2
(p + 2)
_
p + 1
p + 2
_
m1
_
exp
_
p log
2(p + 1)
2
(p + 2)
_
p + 1
p + 2
_
A1
_
= exp
_
log
A
2(p + 1)
2
(p + 2)
_
> exp
_
log u
2(p + 1)
2
(p + 2)
_
= 2,
where we have used
A
< u and the fact that u has the
expression (5). Therefore
m
2
m1
which can be rewritten
as
r
m
2
r
2
r
1
_
r
m
r
m1
_
for every r 1.
Thus,
A
m=1
r
m
r
1
+
2
r
2
r
1
A
m=2
_
r
m
r
m1
_
=
1
2
r
1
(2
r
r
A
r
1
)
2
r
2
r
1
.
Using this for r = 2 and r = d, we deduce that
S
1
7
3
p
and S
2
2
d
c
2
d
1
_
i>k
i
+ 2
_
d/2
.
An exactly similar analysis can be done now to cover
the restrictions of the functions in ( to the set [v, 1]
[0, 1]
d1
having the same coverage S
1/p
1
and same car
dinality bounded by exp(S
2
). For [u, v] [0, 1]
d1
, we
note, by convexity, that the restrictions of functions in
( to the set [u, v] [0, 1]
d1
belong to (([u, v]
[0, 1]
d1
; 1; 2/u, , . . . , ,
k+1
, . . . ,
d
). By the induction
hypothesis, there exist constants c and
0
, depending only on
d and p, such that for all
0
, one can get a cover
of (([u, v] [0, 1]
d1
; 1; 2/u, , . . . , ,
k+1
, . . . ,
d
) in the
L
p
metric having cardinality smaller than
exp
_
_
c
d/2
_
2 +
2
u
+
i>k
i
_
d/2
_
_
exp
_
c
_
2
u
_
d/2
_
i>k
i
+ 2
_
d/2
_
.
Observe that u only depends on p. By combining the covers of
the restrictions of functions in ( to these three strips [0, u]
[0, 1]
d1
, [u, v] [0, 1]
d1
and [v, 1] [0, 1]
d1
, we obtain,
for
0
, a cover of ( in the L
p
metric having coverage at
most
_
7
3
p
+
7
3
p
+
p
_
1/p
=
_
17
3
_
1/p
_
d/2
_
.
By relabelling (17/3)
1/p
as , we have proved that for
(3/17)
1/p
0
,
log M((; ; L
p
)
c
_
17
3
_
d/(2p)
_
2
d+1
2
d
1
+
2
d/2
u
d/2
__
i>k
i
+ 2
_
d/2
.
This proves (4) for all
1
, . . . ,
d
such that exactly k of them
equal . The proof is complete by induction.
Remark 3.3: The argument used in the induction step above
involved splitting the interval [0, 1] into the three intervals
[0, u], [u, v] and [v, 1], and then subsequently splitting the
interval [0, u] into smaller subintervals. We have borrowed
this idea from Dryanov [2, Proof of Theorem 3.1]. We must
mention however that Dryanov uses a more elaborate argument
to bound sums of the form S
1
and S
2
. Our way of controlling
S
1
and S
2
is much simpler which shortens the argument
considerably.
B. Lower bound for M((([a, b]
d
, B), ; L
p
)
Theorem 3.3: There exist positive constants c and
0
, de
pending only on the dimension d, such that for every p 1,
B > 0 and b > a, we have
log M
_
(([a, b]
d
, B), ; L
p
_
c
_
B(b a)
d/p
_
d/2
,
for
0
B(b a)
d/p
.
Proof: As before, by the scaling identity (1), we take
a = 0, b = 1 and B = 1. For functions dened on [0, 1]
d
,
the L
p
metric, p > 1, is larger than L
1
. We will thus take
p = 1 in the rest of this proof. We prove that for sufciently
small, there exists an packing subset of (([0, 1]
d
, 1), under
the L
1
metric, of cardinality larger than a constant multiple of
d/2
. By a packing subset of (([0, 1]
d
, 1), we mean a subset
F satisfying [[f g[[
1
whenever f, g F with f ,= g.
Fix 0 < 4(2 +
d 1)
2
and let k := k() be the
positive integer satisfying
k
2
1/2
2 +
d 1
< k + 1 2k. (8)
Consider the intervals I(i) = [u(i), v(i)] for i = 1, . . . , k,
such that
1) 0 u(1) < v(1) u(2) < v(2) u(k) <
v(k) 1,
2) v(i) u(i) =
, for i = 1, . . . , k,
3) u(i + 1) v(i) =
1
2
_
(d 1) for i = 1, . . . , k 1.
Let o denote the set of all ddimensional cubes of the
form I(i
1
) I(i
d
) where i
1
, . . . , i
d
1, . . . , k. The
cardinality of o, denoted by [o[, is clearly k
d
.
For each S o with S = I(i
1
) I(i
d
) where I(i
j
) =
[u(i
j
), v(i
j
)], let us dene the function h
S
: [0, 1]
d
R as
h
S
(x) = h
S
(x
1
, . . . , x
d
)
:=
1
d
d
j=1
_
u
2
(i
j
) +v(i
j
) + u(i
j
)x
j
u(i
j
)
= f
0
(x) +
1
d
d
j=1
x
j
u(i
j
)v(i
j
) x
j
, (9)
SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY, 2012 5
where f
0
(x) :=
1
d
_
x
2
1
+ + x
2
d
_
, for x [0, 1]
d
. The
functions h
S
, S o have the following four key properties:
1) h
S
is afne and hence convex.
2) For every x [0, 1]
d
, we have h
S
(x) h
S
(1, . . . , 1)
1.
3) For every x S, we have h
S
(x) f
0
(x). This is
because whenever x S, we have u(i
j
) x
j
v(i
j
)
for each j, which implies x
j
u(i
j
)v(i
j
)x
j
0.
4) Let S, S
o with S ,= S
. For every x S
, we have
h
S
(x) f
0
(x). To see this, let S
= I(i
1
) I(i
d
)
with I(i
j
) = [u(i
j
), v(i
j
)]. Let x S
and x 1 j
d. If I(i
j
) = I(i
j
), then x
j
I(i
j
) = [u(i
j
), v(i
j
)] and
hence
x
j
u(i
j
)v(i
j
) x
j
v(i
j
) u(i
j
)
2
4
=
4
.
If I(i
j
) ,= I(i
j
) and u(i
j
) < v(i
j
) < u(i
j
) < v(i
j
),
then
x
j
u(i
j
)v(i
j
) x
j
u(i
j
) v(i
j
)
2
=
d 1
4
.
The same above bound holds if u(i
j
) < v(i
j
) < u(i
j
) <
v(i
j
). Because S ,= S
, at least one of i
j
and i
j
will be
different. Consequently,
h
S
(x) = f
0
(x) +
j
x
j
u(i
j
)v(i
j
) x
j
f
0
(x) +
j:ij =i
j:ij =i
j
(d 1)
4
f
0
(x).
Let 0, 1
S
denote the collection of all 0, 1valued func
tions on o. The cardinality of 0, 1
S
clearly equals 2
S
(recall that [o[ = k
d
).
For each 0, 1
S
, let
g
(x) := max
_
max
SS:(S)=1
h
S
(x), f
0
(x)
_
.
The rst two properties of h
S
, S o ensure that g
(([0, 1]
d
, 1). The last two properties imply that
g
(x) = h
S
(x)(S) + f
0
(x)(1 (S)) for x S.
We now bound from below the L
1
distance between g
and
g
for , 0, 1
S
. Because the interiors of the cubes in o
are all disjoint, we can write
[[g
[[
1
SS
_
xS
[g
(x) g
(x)[ dx
=
SS
(S) ,=
(S)
_
xS
[h
S
(x) f
0
(x)[dx.
Note that from (9) and by symmetry, the value of integral
:=
_
xS
[h
S
(x) f
0
(x)[dx
is the same for all S o. We have thus shown that
[[g
[[
1
(,
) for all ,
0, 1
S
, (10)
where (,
) :=
SS
(S) ,=
j=1
x
j
u(i
j
)v(i
j
)x
j
dx
d
. . . dx
1
.
By the change of variable y
j
= x
j
u(i
j
)/v(i
j
) u(i
j
)
for j = 1, . . . , d, we get
=
d
j=1
v(i
j
)u(i
j
)
_
[0,1]
d
1
d
d
j=1
v(i
j
)u(i
j
)
2
y
j
(1y
j
)dy.
Recalling that v(i) u(i) =
for all i = 1, . . . , k, we get
=
d/2
d
where
d
:=
_
[0,1]
d
1
d
d
j=1
y
j
(1 y
j
)dy.
Note that
d
is a constant that depends on the dimension d
alone. Thus, from (10), we deduce
[[g
[[
1
d
d/2
(,
) (11)
for all ,
0, 1
S
. We now use the VarshamovGilbert
lemma (see e.g., [18, Lemma 4.7]) which asserts the existence
of a subset W of 0, 1
S
with cardinality, [W[ exp([o[/8)
such that (,
W with ,=
.
Thus, from (11) and (8), we get that for every ,
W with
,=
,
[[g
[[
1
d
d/2
[o[
4
=
d
4
d/2
k
d
c
1
where c
1
:=
d
4
(2 +
d 1)
d
. Taking := c
1
, we have
obtained for
0
:= 4c
1
(2 +
d 1)
2
, an packing
subset of (([0, 1]
d
, 1) of size M := [W[ where
log M
[o[
8
=
k
d
8
(2 +
d 1)
d
8
d/2
=
c
d/2
1
8(2 +
d 1)
d
d/2
= c
d/2
,
where c depends only on the dimension d. This completes the
proof.
Remark 3.4: The explicit packing subset constructed in the
above proof consists of functions that can be viewed as
perturbations of the quadratic function f
0
. Previous lower
bounds on the covering numbers of convex functions in [3,
Proof of Theorem 6] and [2, Section 2] (for d = 1) are based
on perturbations of a function whose graph is a subset of
a sphere; a more complicated convex function than f
0
. The
perturbations of f
0
in the above proof can also be used to
simplify the lower bound arguments in those papers.
IV. DISTANCES BETWEEN CONVEX FUNCTIONS, AND
THEIR EPIGRAPHS
One of the aims of this section is to provide the proof
of Theorem 3.2. Our strategy for the proof of Theorem 3.2
is similar to Bronshteins proof of the upper bound on
SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY, 2012 6
M((([a, b]
d
, B, ), ; L
H
(C, D) := max
_
sup
xC
inf
yD
[x y[, sup
xD
inf
yC
[x y[
_
,
where [ [ denotes Euclidean distance.
Lemma 4.1: For every pair of functions f and g in
(([0, 1]
d
; B;
1
, . . . ,
d
), we have
[[f g[[
H
(V
f
(B), V
g
(B))
_
1 +
2
1
+ +
2
d
.
Proof: We can clearly assume that
i
< for all
i = 1, . . . , d. Fix f, g (([0, 1]
d
; B;
1
, . . . ,
d
) and let
H
(V
f
(B), V
g
(B)) = . Fix x [0, 1]
d
with f(x) ,= g(x).
Suppose, without loss of generality, that f(x) < g(x). Now
(x, f(x)) V
f
(B) and because
H
(V
f
(B), V
g
(B)) = , there
exists (x
, y
) V
g
(B) with [(x, f(x))(x
, y
)[ . Because
f(x) < g(x), the point (x, f(x)) lies outside V
g
(B) and using
the convexity of V
g
(B) we can take y
= g(x
). Therefore,
0 g(x) f(x)
= g(x) g(x
) + g(x
) f(x)
[x x
[
_
2
1
+ +
2
d
+[g(x
) f(x)[
2
1
+ +
2
d
+ 1
_
[x x
[
2
+[g(x
) f(x)[
2
=
_
2
1
+ +
2
d
+ 1 [(x, f(x)) (x
, y
)[
_
2
1
+ +
2
d
+ 1,
where the second last inequality follows from the Cauchy
Scwarz (CS) inequality. Lemma 4.1 now follows because x
[0, 1]
d
is arbitrary in the above argument.
The proof of Theorem 3.2, given below, is based on
Lemma 4.1 and the following result on covering numbers of
convex sets proved in [3]. For > 0, let /
d+1
() denote
the set of all compact, convex subsets of the ball in R
d+1
of
radius centered at the origin. In Theorem 3 (and Remark 1)
of [3], Bronshtein proved that there exist positive constants c
and
0
, depending only on d, such that
log M(/
d+1
(), ;
H
) c
_
_
d/2
for
0
. (12)
A more detailed account of Bronshteins proof of (12) can be
found in Section 8.4 of [4].
Proof of Theorem 3.2: The conclusion of the theorem
is clearly only meaningful in the case when
i
< for all
i = 1, . . . , d. We therefore assume this in the rest of this proof.
For every f (
_
d
i=1
[a
i
, b
i
]; B;
1
, . . . ,
d
_
, let us dene
the function
f on [0, 1]
d
by
f(t
1
, . . . , t
d
) := f (a
1
+ (b
1
a
1
)t
1
, . . . , a
d
+ (b
d
a
d
)t
d
) ,
for t
1
, t
2
, . . . , t
d
[0, 1]. Clearly the function
f belongs
to the class (
_
[0, 1]
d
; B;
1
(b
1
a
1
), . . . ,
d
(b
d
a
d
)
_
and
covering
f to within in the L
metric is equivalent to
covering f. Thus
M
_
((
i
[a
i
, b
i
]; B;
1
, . . . ,
d
), ; L
_
= M
_
(([0, 1]
d
; B;
1
(b
1
a
1
), . . . ,
d
(b
d
a
d
)), ; L
_
. (13)
We thus take, without loss of generality, a
i
= 0 and b
i
= 1
for all i = 1, . . . , d.
From Lemma 4.1 and the observation that V
f
(B)
/
d+1
(
d + B
2
) for all f (([0, 1]
d
, B), it follows that
M
_
(([0, 1]
d
; B;
1
, . . . ,
d
), ; L
_
M
_
/
d+1
(
d + B
2
),
2
1+
2
1
++
2
d
;
H
_
.
Thus from (12), we deduce the existence of two positive
constants c and
0
, depending only on d, such that
log M
_
(([0, 1]
d
; B;
1
, . . . ,
d
), ; L
_
c
_
(d+B
2
)(1+
2
1
++
2
d
)
_
d/2
,
if
0
_
(d + B
2
)(1 +
2
1
+ +
2
d
). By the scaling
inequality (13), we obtain
log M
_
((
i
[a
i
, b
i
]; B;
1
, . . . ,
d
), ; L
_
c
_
_
(d + B
2
)(1 +
i
2
i
(b
i
a
i
)
2
)
_
d/2
if
0
_
(d + B
2
)(1 +
i
2
i
(b
i
a
i
)
2
). By another scal
ing argument, it follows that
M
_
((
i
[a
i
, b
i
]; B;
1
, . . . ,
d
), ; L
_
= M
_
(
_
i
[a
i
, b
i
];
B
;
1
, . . . ,
d
_
,
; L
_
SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY, 2012 7
for every > 0 and, as a consequence, we get, for every
> 0,
log M
_
((
i
[a
i
, b
i
]; B;
1
, . . . ,
d
), ; L
_
c
_
_
(d
2
+ B
2
)(1 +
i
2
i
(b
i
a
i
)
2
/
2
)
_
d/2
.
if
0
_
(d
2
+ B
2
)(1 +
i
2
i
(b
i
a
i
)
2
/
2
). Choosing
(by differentiation)
4
=
B
2
i
2
i
(b
i
a
i
)
2
d
,
we deduce nally
log M
_
(([a, b]
d
; B;
1
, . . . ,
d
), ; L
_
c
_
B +
_
d
i
2
i
(b
i
a
i
)
2
_
d/2
if
0
_
B +
_
d
i
2
i
(b
i
a
i
)
2
_
. The proof of the
theorem will now be complete by noting that
2
i
(b
i
a
i
)
2
i
(b
i
a
i
)
2
i
(b
i
a
i
)
2
.
The terms involving d can be absorbed in the constants c and
0
.
One might wonder if a version of Lemma 4.2 can be proved
for the L
p
metric instead of the L
H
(V
f
(1), V
g
(1)) = > 0. Our rst step is to observe that
[f(x) g(x)[ (1 +[m
f
(x)[ +[m
g
(x)[) (15)
for every x (0, 1)
d
, where [m
f
(x)[ denotes the Euclidean
norm of the subgradient vector m
f
(x) R
d
. To see this, x
x (0, 1)
d
with f(x) ,= g(x). We assume, without loss of
generality, that f(x) < g(x). Clearly (x, f(x)) V
f
(1) and
because
H
(V
f
(1), V
g
(1)) = , there exists (x
, y
) V
g
(1)
with [(x, f(x)) (x
, y
= g(x
) g(x) +m
g
(x), x
x .
Therefore,
0 g(x) f(x) = g(x) g(x
) + g(x
) f(x)
m
g
(x), x x
+[g(x
) f(x)[
[m
g
(x)[[x x
[ +[g(x
) f(x)[
_
[m
g
(x)[
2
+ 1 [(x, f(x)) (x
, y
)[
_
[m
g
(x)[
2
+ 1 (1 +[m
g
(x)[).
Note that the CauchySchwarz inequality has been used twice
in the above chain of inequalities. We have thus shown that
g(x) f(x) (1 +[m
g
(x)[) in the case when f(x) < g(x).
One would have a similar inequality in the case when f(x) >
g(x). Combining these two, we obtain (15).
As a consequence of (15), we get
[[f g[[
1
=
_
[0,1]
d
\[,1]
d
[f g[ +
_
[,1]
d
[f g[
2
_
1 (1 2)
d
_
+
_
1 +
_
[,1]
d
[m
f
(x)[dx
+
_
[,1]
d
[m
g
(x)[dx
_
_
1 + 4d +
_
[,1]
d
[m
f
(x)[ +[m
g
(x)[dx
_
,
where we have used the inequality (1 2)
d
1 2d.
To complete the proof of (14), we show that
_
[,1]
d
[m
f
(x)[dx 8d for every f (([0, 1]
d
, 1).
We write m
f
(x) = (m
f
(x)(1), . . . , m
f
(x)(d)) R
d
and
use the denition of the subgradient to note that for every
x [, 1 ]
d
and 1 i d,
f(x + te
i
) f(x) t m
f
(x)(i) (16)
for t > 0 sufciently small, where e
i
is the unit vector in
the ith coordinate direction i.e., e
i
(j) := 1 if i = j and 0
otherwise. Dividing both sides by t and letting t 0, we
would get m
f
(x)(i) f
(x; e
i
) (we use f
(x; v) to denote
the directional derivative of f in the direction v; directional
derivatives exist as f is convex). Using (16) for t < 0, we get
m
f
(x)(i) f
(x; e
i
). Combining these two inequalities,
we get
[m
f
(x)(i)[ [f
(x; e
i
)[ +[f
(x; e
i
)[ for i = 1, . . . , d.
As a result,
_
[,1]
d
[m
f
(x)[dx
i=1
_
[,1]
d
[m
f
(x)(i)[dx
i=1
_
_
[,1]
d
[f
(x; e
i
)[dx +
_
[,1]
d
[f
(x; e
i
)[dx
_
.
We now show that for each i, both the integrals
_
[,1]
d
[f
(x; e
i
)[ and
_
[,1]
d
[f
(x; e
i
)[ are bounded
SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY, 2012 8
from above by 4. Assume, without loss of generality, that i = 1
and notice
_
[,1]
d
[f
(x; e
1
)[dx
=
_
u[,1]
d1
__
1
[f
((x
1
, u); e
1
)[dx
1
_
du. (17)
We x u = (x
2
, . . . , x
d
) [, 1]
d1
and focus on the inner
integral. Let v(z) := f(z, x
2
, . . . , x
d
) for z [0, 1]. Clearly v
is a convex function on [0, 1] and its right derivative, v
r
(x
1
)
at the point z = x
1
(0, 1) equals f
(x; e
1
) where x =
(x
1
, . . . , x
d
). The inner integral thus equals
_
1
[v
r
(z)[dz.
Because of the convexity of v, its right derivative v
r
(z) is
nondecreasing and satises
v(y
2
) v(y
1
) =
_
y2
y1
v
r
(z)dz for 0 < y
1
< y
2
< 1.
Consequently,
_
1
[v
r
(z)[dz
sup
c1
_
_
c
r
(z)dz +
_
1
c
v
r
(z)dz
_
= sup
c1
(v() + v(1 ) 2v(c)) .
The function v() clearly satises [v(z)[ 1 because f
(([0, 1]
d
, 1). This implies that
_
1
[v
r
(z)[dz 4. The
identity (17) therefore gives
_
[,1]
d
[f
(x; e
1
)[dx
=
_
(x2,...,x
d
)[,1]
d1
__
1
[v
r
(z)[dz
_
dx
2
. . . dx
d
4.
Similarly, by working with left derivatives of v as opposed to
right, we can prove that
_
[,1]
d
[f
(x; e
1
)[dx 4.
Therefore, the integral
_
[,1]
d
[m
f
[ is at most 8d because it
is less than or equal to
d
i=1
_
_
[,1]
d
[f
(x; e
i
)[dx +
_
[,1]
d
[f
(x; e
i
)[dx
_
.
This completes the proof of Lemma 4.2.
Remark 4.1: Lemma 4.2 is not true if L
1
is replaced by L
p
,
for p > 1. Indeed, if d = 1 and f
g[[
p
=
1/p
(1 + p)
1/p
and
H
(V
f
(1), V
g
(1)) =
1 +
2
.
As can be arbitrarily close to zero, this clearly rules out
any inequality of the form (14) with the L
1
metric replaced
by L
p
, for 1 < p .
Remark 4.2: Lemma 4.2 and Bronshteins result (12) can
be used to give an alternative proof of Theorem 3.1 for the
special case p = 1. Indeed, the scaling identity (1) lets us take
a = 0, b = 1 and B = 1. Inequality (14) implies that the
covering number M
_
(([0, 1]
d
, 1), ; L
1
_
is less than or equal
to
M
_
/
d+1
(
d + 1),
2(1 + 20d)
;
H
_
.
Thus from (12), we deduce the existence of two positive
constants c and
0
, depending only on d, such that
log M
_
(([0, 1]
d
, 1), ; L
1
_
c
d/2
whenever
0
. Note that, by Remark 4.1, this method of
proof does not work in the case of L
p
, for 1 < p < .
REFERENCES
[1] A. N. Kolmogorov and V. M. Tihomirov, entropy and capacity of
sets in function spaces, Amer. Math. Soc. Transl. (2), vol. 17, pp. 277
364, 1961.
[2] D. Dryanov, Kolmogorov entropy for classes of convex functions,
Constructive Approximation, vol. 30, pp. 137153, 2009.
[3] E. M. Bronshtein, entropy of convex sets and functions, Siberian
Mathematical Journal, vol. 17, pp. 393398, 1976.
[4] R. M. Dudley, Uniform Central Limit Theorems. Cambridge University
Press, 1999.
[5] L. Birg e, Approximation dans les espaces metriques et theorie de
lestimation, Zeitschrift f ur Wahrscheinlichkeitstheorie und Verwandte
Gebiete, vol. 65, pp. 181237, 1983.
[6] L. Le Cam, Convergence of estimates under dimensionality restric
tions, Annals of Statistics, vol. 1, pp. 3853, 1973.
[7] Y. Yang and A. Barron, Informationtheoretic determination of minimax
rates of convergence, Annals of Statistics, vol. 27, pp. 15641599, 1999.
[8] A. Guntuboyina, Lower bounds for the minimax risk using f diver
gences, and applications, IEEE Transactions on Information Theory,
vol. 57, pp. 23862399, 2011.
[9] S. Van de Geer, Applications of Empirical Process Theory. Cambridge
University Press, 2000.
[10] L. Birg e and P. Massart, Rates of convergence for minimum contrast
estimators, Probability Theory and Related Fields, vol. 97, pp. 113
150, 1993.
[11] E. Seijo and B. Sen, Nonparametric least squares estimation of a
multivariate convex regression function, Annals of Statistics, vol. 39,
pp. 16331657, 2011.
[12] L. A. Hannah and D. Dunson, Bayesian nonparametric multivariate
convex regression, 2011, submitted.
[13] A. Seregin and J. A. Wellner, Nonparametric estimation of multivariate
convextransformed densities, Annals of Statistics, vol. 38, pp. 3751
3781, 2010.
[14] M. L. Cule, R. J. Samworth, and M. I. Stewart, Maximum likelihood es
timation of a multidimensional logconcave density (with discussion),
Journal of the Royal Statistical Society, Series B, vol. 72, pp. 545600,
2010.
[15] L. D umbgen, R. J. Samworth, and D. Schuhmacher, Approximation
by logconcave distributions with applications to regression, Annals of
Statistics, vol. 39, pp. 702730, 2011.
[16] A. Guntuboyina, Optimal rates of convergence for the estimation of
reconstruction of convex bodies from noisy support function measure
ments. Annals of Statistics, 2011, to appear.
[17] R. J. Gardner, M. Kiderlen, and P. Milanfar, Convergence of algorithms
for reconstructing convex bodies and directional measures, Annals of
Statistics, vol. 34, pp. 13311374, 2006.
[18] P. Massart, Concentration inequalities and model selection. Lecture
notes in Mathematics. Berlin: Springer, 2007, vol. 1896.
Гораздо больше, чем просто документы.
Откройте для себя все, что может предложить Scribd, включая книги и аудиокниги от крупных издательств.
Отменить можно в любой момент.