Вы находитесь на странице: 1из 8

a

r
X
i
v
:
1
2
0
4
.
0
1
4
7
v
1


[
c
s
.
I
T
]


3
1

M
a
r

2
0
1
2
SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY, 2012 1
Covering Numbers for Convex Functions
Adityanand Guntuboyina and Bodhisattva Sen
AbstractIn this paper we study the covering numbers of
the space of convex and uniformly bounded functions in multi-
dimension. We nd optimal upper and lower bounds for the -
covering number of C([a, b]
d
, B), in the Lp-metric, 1 p < , in
terms of the relevant constants, where d 1, a < b R, B > 0,
and C([a, b]
d
, B) denotes the set of all convex functions on [a, b]
d
that are uniformly bounded by B. We summarize previously
known results on covering numbers for convex functions and also
provide alternate proofs of some known results. Our results have
direct implications in the study of rates of convergence of em-
pirical minimization procedures as well as optimal convergence
rates in the numerous convexity constrained function estimation
problems.
Index Termsconvexity constrained function estimation, em-
pirical risk minimization, Hausdorff distance, Kolmogorov en-
tropy, Lp-metric, metric entropy, packing numbers.
I. INTRODUCTION
E
VER since the work of [1], covering numbers (and their
logarithms, known as metric entropy numbers) have been
studied extensively in a variety of disciplines. For a subset T
of a metric space (A, ), the -covering number M(T, ; )
is dened as the smallest number of balls of radius whose
union contains T. Covering numbers capture the size of the
underlying metric space and play a central role in a number
of areas in information theory and statistics, including non-
parametric function estimation, density estimation, empirical
processes and machine learning.
In this paper we study the covering numbers of the space of
convex and uniformly bounded functions in multi-dimension.
Specically, we nd optimal upper and lower bounds for the
-covering number M((([a, b]
d
, B), ; L
p
), in the L
p
-metric,
1 p < , in terms of the relevant constants, where d 1,
a, b R, B > 0, and (([a, b]
d
, B) denotes the set of all
convex functions on [a, b]
d
that are uniformly bounded by
B. We also summarize previously known results on covering
numbers for convex functions. The special case of the problem
when d = 1 has been recently established by Dryanov in [2,
Theorem 3.1]. Prior to [2], the only other result on the covering
numbers of convex functions is due to Bronshtein in [3] (see
also [4, Chapter 8]) who considered convex functions that
are uniformly bounded and uniformly Lipschitz with a known
Lipschitz constant under the L

metric.
In recent years there has been an upsurge of interest
in nonparametric function estimation under convexity based
constraints, especially in multi-dimension. In general function
estimation, it is well-known (see e.g., [5][8]) that the covering
numbers of the underlying function space can be used to
A. Guntuboyina is with the Department of Statistics, University of Califor-
nia, Berkeley, CA 94720 USA e-mail: aditya@stat.berkeley.edu.
B. Sen is with the Department of Statistics, Columbia University, New
York, NY 10027 USA e-mail: bodhi@stat.columbia.edu.
characterize optimal rates of convergence. They are also useful
for studying the rates of convergence of empirical minimiza-
tion procedures (see e.g., [9], [10]). Our results have direct
implications in this regard in the context of understanding the
rates of convergence of the numerous convexity constrained
function estimators, e.g., the nonparametric least squares es-
timator of a convex regression function studied in [11], [12];
the maximum likelihood estimator of a log-concave density in
multi-dimension studied in [13][15]. Also, similar problems
that crucially use convexity/concavity constraints to estimate
sets have also received recent attention in the statistical and
machine learning literature, see e.g., [16], [17], and our results
can be applied in such settings.
The paper is organized as follows. In Section II, we set
up notation and provide motivation for our main results,
which are proved in Section III. In Section IV, we draw
some connections to previous results on covering numbers for
convex functions and prove a related auxiliary result along
with some inequalities of possible independent interest.
II. MOTIVATION
The rst result on covering numbers for convex functions
was proved by Bronshtein in [3], who considered convex
functions dened on a cube in R
d
that are uniformly bounded
and uniformly Lipschitz. Specically, let (([a, b]
d
, B, ) de-
note the class of real-valued convex functions dened on
[a, b]
d
that are uniformly bounded in absolute value by B and
uniformly Lipschitz with constant . In Theorem 6 of [3],
Bronshtein proved that for sufciently small, the logarithm
of M((([a, b]
d
, B, ), ; L

) can be bounded from above and


below by a positive constant (not depending on ) multiple
of
d/2
. Note that the L

distance between two functions f


and g on [a, b]
d
is dened as [[f g[[

:= sup
x[a,b]
d [f(x)
g(x)[.
Bronshtein worked with the class (([a, b]
d
, B, ) where the
functions are uniformly Lipschitz with constant . However,
in convexity-based function estimation problems, one usually
does not have a known uniform Lipschitz bound on the un-
known function class. This leads to difculties in the analysis
of empirical minimization procedures via Bronshteins result.
To the best of our knowledge, there does not exist any other
result on the covering numbers of convex functions that deals
with all d 1 and does not require the Lipschitz constraint.
In the absence of the uniformly Lipschitz constraint
(i.e., if one works with the class (([a, b]
d
, B) instead of
(([a, b]
d
, B, )), the covering numbers under the L

metric
are innite. In other words, the space (([a, b]
d
, B) is not totally
bounded under the L

metric. This can be seen, for example,


by noting that the functions
f
j
(t) := max
_
0, 1 2
j
t
_
, for t [0, 1],
SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY, 2012 2
are in (([0, 1], 1), for all j 1, and satisfy
[[f
j
f
k
[[

[f
j
(2
k
) f
k
(2
k
)[ = 1 2
jk
1/2,
for all j < k.
This motivated us to study the covering numbers of the
class (([a, b]
d
, B) under a different metric, namely the L
p
-
metric for 1 p < . We recall that under the L
p
-metric,
1 p < , the distance between two functions f and g on
[a, b]
d
is dened as
[[f g[[
p
:=
_
_
x[a,b]
d
[f(x) g(x)[
p
dx
_
1/p
.
Our main result in this paper shows that if one works with
the L
p
-metric as opposed to L

, then the covering numbers


of (([a, b]
d
, B) are nite. Moreover, they are bounded from
above and below by constant multiples of
d/2
for sufciently
small .
III. L
p
COVERING NUMBER BOUNDS FOR (([a, b]
d
, B)
In this section, we prove upper and lower bounds for the
-covering number of (([a, b]
d
, B) under the L
p
-metric, 1
p . Let us start by noting a simple scaling identity that
allows us to take a = 0, b = 1 and B = 1, without loss of
generality. For each f (([a, b]
d
, B), let us dene

f on [0, 1]
d
by

f(x) := f(a1 +(b a)x)/B, where 1 = (1, . . . , 1) R
d
.
Clearly

f (([0, 1]
d
, 1) and, for 1 p < ,
B
p
_
x[0,1]
d

f(x) g(x)

p
dx
= (b a)
d
_
y[a,b]
d

f(y) Bg
_
y a1
b a
_

p
dy.
for g (([0, 1]
d
, 1). It follows that covering f to within in
the L
p
-metric on [a, b]
d
is equivalent to covering

f to within
(b a)
d/p
/B in the L
p
-metric on [0, 1]
d
. Therefore, for
1 p < ,
M((([a, b]
d
, B), ; L
p
) = M((([0, 1]
d
, 1),

; L
p
), (1)
where

:= (b a)
d/p
/B.
A. Upper Bound for M((([a, b]
d
, B), ; L
p
)
Theorem 3.1: Fix 1 p < . There exist positive
constants c and
0
, depending only on the dimension d and p,
such that, for every B > 0 and b > a, we have
log M
_
(([a, b]
d
, B), ; L
p
_
c
_

B(b a)
d/p
_
d/2
,
for every
0
B(b a)
d/p
.
The main ingredient in our proof of the above theorem is an
extension of Bronshteins theorem to uniformly bounded con-
vex functions having different Lipschitz constraints in different
directions. Specically, for B (0, ),
i
(0, ] and
a
i
< b
i
for i = 1, . . . , d, let (
_

d
i=1
[a
i
, b
i
]; B;
1
, . . . ,
d
_
denote the set of all real-valued convex functions f on the
rectangle [a
1
, b
1
] [a
d
, b
d
] that are uniformly bounded
by B and satisfy:
[f(x
1
, . . . , x
i1
, x
i
, x
i+1
, . . . , x
d
)
f(x
1
, . . . , x
i1
, y
i
, x
i+1
, . . . , x
d
)[
i
[x
i
y
i
[ (2)
for every i = 1, . . . , d; x
i
, y
i
[a
i
, b
i
] and x
j

[a
j
, b
j
] for j ,= i. In other words, the function x
f(x
1
, . . . , x
i1
, x, x
i+1
, . . . , x
d
) is Lipschitz on [a
i
, b
i
] with
constant
i
for all x
j
[a
j
, b
j
], j ,= i.
Clearly, the class (([a, b]
d
, B, ) that Bronshtein studied
is contained in (([a, b]
d
; B; , . . . , ). Also, it is easy to
check that every function f in ( (

i
[a
i
, b
i
]; B;
1
, . . . ,
d
)
is Lipschitz with respect to the Euclidean norm on

i
[a
i
, b
i
]
with Lipschitz constant
_

2
1
+ +
2
d
.
Note that for
i
= , the inequality (2) is satised
by every function f. As a result, we have the equality
(([a, b]
d
, B) = (([a, b]
d
; B; , . . . , ). The following re-
sult gives an upper bound for the -covering number of
((

i
[a
i
, b
i
]; B;
1
, . . . ,
d
) and is the main ingredient in the
proof of Theorem 3.1. Its proof is similar to Bronshteins
proof [3, Proof of Theorem 6] of his upper bound on
(([a, b]
d
, B, ) and is included in Section IV.
Theorem 3.2: There exist positive constants c and
0
, de-
pending only on the dimension d, such that for every positive
B,
1
, . . . ,
d
and rectangle [a
1
, b
1
] [a
d
, b
d
], we have
log M
_
(
_
d

i=1
[a
i
, b
i
]; B;
1
, . . . ,
d
_
, ; L

_
c
_
B +

d
i=1

i
(b
i
a
i
)

_
d/2
, (3)
for all 0 <
0
B +

d
i=1

i
(b
i
a
i
).
Remark 3.1: Note that the right hand side of (3) equals
unless
i
< for all i = 1, . . . , d. Thus, Theorem 3.2 is only
meaningful when
i
< for all i = 1, . . . , d.
Remark 3.2: Because (([a, b]
d
, B, ) is contained in
(([a, b]
d
; B;
1
, . . . ,
d
), Theorem 3.2 includes Bronshteins
upper bound on (([a, b]
d
, B, ) as a special case. Moreover, it
gives explicit dependence of the upper bound on the constants
a, b, B and . Bronshtein did not state the dependence on these
constants.
We are now ready to prove Theorem 3.1 using Theo-
rem 3.2. Here is the intuition behind the proof. The class
(([a, b]
d
, B) can be thought of as an expansion of the class
(([a, b]
d
; B;
1
, . . . ,
d
) formed by the removal of the d
Lipschitz constraints
1
, . . . ,
d
(or equivalently, by setting

1
= =
d
= ). Instead of removing all these
d Lipschitz constraints at the same time, we remove them
sequentially one at a time. This is formally accomplished by
induction on the number of indices i for which
i
= . Each
step of the induction argument focuses on the removal of one
nite
i
and is thus like solving the one-dimensional problem.
We consequently use Dryanovs ideas from [2, Theorem 3.1]
to solve this quasi one-dimensional problem which allows us
to complete the induction step.
Proof of Theorem 3.1: The scaling identity (1) lets us
take a = 0, b = 1 and B = 1.
SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY, 2012 3
We shall prove that there exist positive constants c and
0
,
depending only on d and p, such that for every
i
(0, ],
we have
log M
_
(
_
[0, 1]
d
; 1;
1
, . . . ,
d
_
; ; L
p
_
c
_
2 +

d
i=1

i

i
<

_
d/2
, (4)
for 0 <
0
. Note that this proves the theorem because we
can set
i
= for all i = 1, . . . , d. Our proof will involve
induction on l: the number of indices i for which
i
= .
For l = 0, i.e., when
i
< for all i = 1, . . . , d, (4) is a
direct consequence of Theorem 3.2. In fact, in this case, (4)
also holds for p = . Suppose now that (4) holds for all l < k
for some k 1, . . . , d. We shall then verify it for l = k. Fix

i
(0, ] such that exactly k of them equal innity. Without
loss of generality, we assume that
1
= =
k
= and

i
< for i > k. For every sufciently small > 0, we shall
exhibit an -cover of (([0, 1]
d
; 1; , . . . , ,
k+1
, . . . ,
d
) in
the L
p
-metric whose cardinality has logarithm bounded from
above by a constant multiple of (

i>k

i
+2)
d/2

d/2
. Note
that for k = d, the term

i>k

i
equals zero. For convenience,
let us denote the class (([0, 1]
d
; 1; , . . . , ,
k+1
, . . . ,
d
)
by ( in the rest of this proof.
Let
u := exp
_
2(p + 1)
2
(p + 2) log 2
_
and v := 1 u. (5)
Fix > 0 and choose an integer A and
1
, . . . ,
A+1
such that

p
=
1
< <
A
< u
A+1
.
For every two functions f and g on [0, 1]
d
, we can obviously
decompose the integral
_
[f g[
p
as
_
[0,1]
d
[f g[
p
=
_
[0,u][0,1]
d1
[f g[
p
+
_
[u,v][0,1]
d1
[f g[
p
+
_
[v,1][0,1]
d1
[f g[
p
.
Also,
_
[0,u][0,1]
d1
[f g[
p

_
[0,1][0,1]
d1
[f g[
p
+
A

m=1
_
[m,m+1][0,1]
d1
[f g[
p
.
For a xed m = 1, . . . , A, consider the problem of covering
the functions in ( on the rectangular strip [
m
,
m+1
]
[0, 1]
d1
. Clearly,
_
[m,m+1][0,1]
d1
[f g[
p
= (
m+1

m
)
_
[0,1]
d
[

f g[
p
(6)
where, for x = (x
1
, . . . , x
d
) [0, 1]
d
,

f(x) := f(
m
+ (
m+1

m
)x
1
, x
2
, . . . , x
d
),
and g(x) := g(
m
+ (
m+1

m
)x
1
, x
2
, . . . , x
d
).
By convexity, the restriction of every function f in ( to
[
m
,
m+1
] [0, 1]
d1
belongs to the class:
(([
m
,
m+1
] [0, 1]
d1
; 1; 2/
m
, , . . . , ,
k+1
, . . . ,
d
)
Consequently, the corresponding function

f belongs to
(([0, 1]
d
; 1; 2(
m+1

m
)/
m
, , . . . , ,
k+1
, . . . ,
d
).
Because 2(
m+1

m
)/
m
< , we can use the induction
hypothesis to assert the existence of positive constants
0
and c, depending only on d and p, such that for every
positive real number
m

0
, there exists an
m
-cover of
(([0, 1]
d
; 1; 2(
m+1

m
)/
m
, , . . . , ,
k+1
, . . . ,
d
)) in
the L
p
-metric on [0, 1]
d
of size smaller than
exp
_
_
c
d/2
m
_
2 +
2(
m+1

m
)

m
+

i>k

i
_
d/2
_
_
exp
_
_
c
_
2 +

i>k

i
_
d/2 _

m+1

m
_
d/2
_
_
.
By covering the functions in ( by the constant function 0
on [0,
1
] [0, 1]
d1
and up to
m
in the L
p
-metric on
[
m
,
m+1
] [0, 1]
d1
for m = 1, . . . , A, we obtain a cover of
the restriction of the functions in ( to the set [0, u] [0, 1]
d1
in L
p
-metric having coverage S
1/p
1
and cardinality bounded
from above by exp(S
2
) where
S
1
:=
1
+
A

m=1

p
m
(
m+1

m
) and
S
2
:= c
_

i>k

i
+ 2
_
d/2
A

m=1
_

m+1

m
_
d/2
. (7)
Suppose now that

m
:= exp
_
p
_
p + 1
p + 2
_
m1
log
_
and

m
:= exp
_
p
(p + 1)
m2
(p + 2)
m1
log
_
,
for m = 1, . . . , A+1, where A is the largest integer such that
exp
_
p
_
p + 1
p + 2
_
A1
log
_
< u.
Then,
S
1
=
1
+
A

m=1

p
m
(
m+1

m
)

1
+
A

m=1

p
m

m+1
=
p
_
1 +
A

m=1

2
m
_
,
and
S
2
= c
_
i>k

i
+ 2

_
d/2 A

m=1

d
m
,
where

m
:=
_

m+1

m
= exp
_
p
2(p + 1)
2
(p + 1)
m
(p + 2)
m
log
_
.
SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY, 2012 4
Note that if 1, then log 0 which implies
m
1.
Also, for m = 2, . . . , A, we have

m1
= exp
_
p log
2(p + 1)
2
(p + 2)
_
p + 1
p + 2
_
m1
_
exp
_
p log
2(p + 1)
2
(p + 2)
_
p + 1
p + 2
_
A1
_
= exp
_
log
A
2(p + 1)
2
(p + 2)
_
> exp
_
log u
2(p + 1)
2
(p + 2)
_
= 2,
where we have used
A
< u and the fact that u has the
expression (5). Therefore
m
2
m1
which can be rewritten
as

r
m

2
r
2
r
1
_

r
m

r
m1
_
for every r 1.
Thus,
A

m=1

r
m

r
1
+
2
r
2
r
1
A

m=2
_

r
m

r
m1
_
=
1
2
r
1
(2
r

r
A

r
1
)
2
r
2
r
1
.
Using this for r = 2 and r = d, we deduce that
S
1

7
3

p
and S
2

2
d
c
2
d
1
_
i>k

i
+ 2

_
d/2
.
An exactly similar analysis can be done now to cover
the restrictions of the functions in ( to the set [v, 1]
[0, 1]
d1
having the same coverage S
1/p
1
and same car-
dinality bounded by exp(S
2
). For [u, v] [0, 1]
d1
, we
note, by convexity, that the restrictions of functions in
( to the set [u, v] [0, 1]
d1
belong to (([u, v]
[0, 1]
d1
; 1; 2/u, , . . . , ,
k+1
, . . . ,
d
). By the induction
hypothesis, there exist constants c and
0
, depending only on
d and p, such that for all
0
, one can get a -cover
of (([u, v] [0, 1]
d1
; 1; 2/u, , . . . , ,
k+1
, . . . ,
d
) in the
L
p
-metric having cardinality smaller than
exp
_
_
c
d/2
_
2 +
2
u
+

i>k

i
_
d/2
_
_
exp
_
c
_
2
u
_
d/2
_
i>k

i
+ 2

_
d/2
_
.
Observe that u only depends on p. By combining the covers of
the restrictions of functions in ( to these three strips [0, u]
[0, 1]
d1
, [u, v] [0, 1]
d1
and [v, 1] [0, 1]
d1
, we obtain,
for
0
, a cover of ( in the L
p
-metric having coverage at
most
_
7
3

p
+
7
3

p
+
p
_
1/p
=
_
17
3
_
1/p

and cardinality at most


exp
_
c
_
2
d+1
2
d
1
+
2
d/2
u
d/2
__
i>k

i
+ 2

_
d/2
_
.
By relabelling (17/3)
1/p
as , we have proved that for
(3/17)
1/p

0
,
log M((; ; L
p
)
c
_
17
3
_
d/(2p)
_
2
d+1
2
d
1
+
2
d/2
u
d/2
__
i>k

i
+ 2

_
d/2
.
This proves (4) for all
1
, . . . ,
d
such that exactly k of them
equal . The proof is complete by induction.
Remark 3.3: The argument used in the induction step above
involved splitting the interval [0, 1] into the three intervals
[0, u], [u, v] and [v, 1], and then subsequently splitting the
interval [0, u] into smaller subintervals. We have borrowed
this idea from Dryanov [2, Proof of Theorem 3.1]. We must
mention however that Dryanov uses a more elaborate argument
to bound sums of the form S
1
and S
2
. Our way of controlling
S
1
and S
2
is much simpler which shortens the argument
considerably.
B. Lower bound for M((([a, b]
d
, B), ; L
p
)
Theorem 3.3: There exist positive constants c and
0
, de-
pending only on the dimension d, such that for every p 1,
B > 0 and b > a, we have
log M
_
(([a, b]
d
, B), ; L
p
_
c
_

B(b a)
d/p
_
d/2
,
for
0
B(b a)
d/p
.
Proof: As before, by the scaling identity (1), we take
a = 0, b = 1 and B = 1. For functions dened on [0, 1]
d
,
the L
p
-metric, p > 1, is larger than L
1
. We will thus take
p = 1 in the rest of this proof. We prove that for sufciently
small, there exists an -packing subset of (([0, 1]
d
, 1), under
the L
1
-metric, of cardinality larger than a constant multiple of

d/2
. By a packing subset of (([0, 1]
d
, 1), we mean a subset
F satisfying [[f g[[
1
whenever f, g F with f ,= g.
Fix 0 < 4(2 +

d 1)
2
and let k := k() be the
positive integer satisfying
k
2
1/2
2 +

d 1
< k + 1 2k. (8)
Consider the intervals I(i) = [u(i), v(i)] for i = 1, . . . , k,
such that
1) 0 u(1) < v(1) u(2) < v(2) u(k) <
v(k) 1,
2) v(i) u(i) =

, for i = 1, . . . , k,
3) u(i + 1) v(i) =
1
2
_
(d 1) for i = 1, . . . , k 1.
Let o denote the set of all d-dimensional cubes of the
form I(i
1
) I(i
d
) where i
1
, . . . , i
d
1, . . . , k. The
cardinality of o, denoted by [o[, is clearly k
d
.
For each S o with S = I(i
1
) I(i
d
) where I(i
j
) =
[u(i
j
), v(i
j
)], let us dene the function h
S
: [0, 1]
d
R as
h
S
(x) = h
S
(x
1
, . . . , x
d
)
:=
1
d
d

j=1
_
u
2
(i
j
) +v(i
j
) + u(i
j
)x
j
u(i
j
)

= f
0
(x) +
1
d
d

j=1
x
j
u(i
j
)v(i
j
) x
j
, (9)
SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY, 2012 5
where f
0
(x) :=
1
d
_
x
2
1
+ + x
2
d
_
, for x [0, 1]
d
. The
functions h
S
, S o have the following four key properties:
1) h
S
is afne and hence convex.
2) For every x [0, 1]
d
, we have h
S
(x) h
S
(1, . . . , 1)
1.
3) For every x S, we have h
S
(x) f
0
(x). This is
because whenever x S, we have u(i
j
) x
j
v(i
j
)
for each j, which implies x
j
u(i
j
)v(i
j
)x
j
0.
4) Let S, S

o with S ,= S

. For every x S

, we have
h
S
(x) f
0
(x). To see this, let S

= I(i

1
) I(i

d
)
with I(i

j
) = [u(i

j
), v(i

j
)]. Let x S

and x 1 j
d. If I(i
j
) = I(i

j
), then x
j
I(i
j
) = [u(i
j
), v(i
j
)] and
hence
x
j
u(i
j
)v(i
j
) x
j

v(i
j
) u(i
j
)
2
4
=

4
.
If I(i
j
) ,= I(i

j
) and u(i

j
) < v(i

j
) < u(i
j
) < v(i
j
),
then
x
j
u(i
j
)v(i
j
) x
j

u(i
j
) v(i

j
)
2
=
d 1
4
.
The same above bound holds if u(i
j
) < v(i
j
) < u(i

j
) <
v(i

j
). Because S ,= S

, at least one of i
j
and i

j
will be
different. Consequently,
h
S
(x) = f
0
(x) +

j
x
j
u(i
j
)v(i
j
) x
j

f
0
(x) +

j:ij =i

j:ij =i

j
(d 1)

4
f
0
(x).
Let 0, 1
S
denote the collection of all 0, 1-valued func-
tions on o. The cardinality of 0, 1
S
clearly equals 2
|S|
(recall that [o[ = k
d
).
For each 0, 1
S
, let
g

(x) := max
_
max
SS:(S)=1
h
S
(x), f
0
(x)
_
.
The rst two properties of h
S
, S o ensure that g


(([0, 1]
d
, 1). The last two properties imply that
g

(x) = h
S
(x)(S) + f
0
(x)(1 (S)) for x S.
We now bound from below the L
1
distance between g

and
g

for , 0, 1
S
. Because the interiors of the cubes in o
are all disjoint, we can write
[[g

[[
1

SS
_
xS
[g

(x) g

(x)[ dx
=

SS
(S) ,=

(S)
_
xS
[h
S
(x) f
0
(x)[dx.
Note that from (9) and by symmetry, the value of integral
:=
_
xS
[h
S
(x) f
0
(x)[dx
is the same for all S o. We have thus shown that
[[g

[[
1
(,

) for all ,

0, 1
S
, (10)
where (,

) :=

SS
(S) ,=

(S) denotes the Ham-


ming distance.
The quantity can be computed in the following way. Let
S = I(i
1
) I(i
d
) where I(i
j
) = [u(i
j
), v(i
j
)]. We write
=
_
v(i1)
u(i1)
. . .
_
v(i
d
)
u(i
d
)
1
d
d

j=1
x
j
u(i
j
)v(i
j
)x
j
dx
d
. . . dx
1
.
By the change of variable y
j
= x
j
u(i
j
)/v(i
j
) u(i
j
)
for j = 1, . . . , d, we get
=
d

j=1
v(i
j
)u(i
j
)
_
[0,1]
d
1
d
d

j=1
v(i
j
)u(i
j
)
2
y
j
(1y
j
)dy.
Recalling that v(i) u(i) =

for all i = 1, . . . , k, we get
=
d/2

d
where

d
:=
_
[0,1]
d
1
d
d

j=1
y
j
(1 y
j
)dy.
Note that
d
is a constant that depends on the dimension d
alone. Thus, from (10), we deduce
[[g

[[
1

d

d/2
(,

) (11)
for all ,

0, 1
S
. We now use the Varshamov-Gilbert
lemma (see e.g., [18, Lemma 4.7]) which asserts the existence
of a subset W of 0, 1
S
with cardinality, [W[ exp([o[/8)
such that (,

) [o[/4 for all ,

W with ,=

.
Thus, from (11) and (8), we get that for every ,

W with
,=

,
[[g

[[
1

d

d/2

[o[
4
=

d
4

d/2
k
d
c
1

where c
1
:=

d
4
(2 +

d 1)
d
. Taking := c
1
, we have
obtained for
0
:= 4c
1
(2 +

d 1)
2
, an -packing
subset of (([0, 1]
d
, 1) of size M := [W[ where
log M
[o[
8
=
k
d
8

(2 +

d 1)
d
8

d/2
=
c
d/2
1
8(2 +

d 1)
d

d/2
= c
d/2
,
where c depends only on the dimension d. This completes the
proof.
Remark 3.4: The explicit packing subset constructed in the
above proof consists of functions that can be viewed as
perturbations of the quadratic function f
0
. Previous lower
bounds on the covering numbers of convex functions in [3,
Proof of Theorem 6] and [2, Section 2] (for d = 1) are based
on perturbations of a function whose graph is a subset of
a sphere; a more complicated convex function than f
0
. The
perturbations of f
0
in the above proof can also be used to
simplify the lower bound arguments in those papers.
IV. DISTANCES BETWEEN CONVEX FUNCTIONS, AND
THEIR EPIGRAPHS
One of the aims of this section is to provide the proof
of Theorem 3.2. Our strategy for the proof of Theorem 3.2
is similar to Bronshteins proof of the upper bound on
SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY, 2012 6
M((([a, b]
d
, B, ), ; L

). The proof involves the following


ingredients:
1) An inequality between the L

distance between two


convex functions and the Hausdorff distance between
their epigraphs.
2) The result of Bronshtein [3] for the covering numbers
of convex sets in the Hausdorff metric.
For a convex function f on [0, 1]
d
and B > 0, let us dene
the epigraph V
f
(B) of f by
V
f
(B) :=
_
(x
1
, . . . , x
d
, x
d+1
) : (x
1
, . . . , x
d
) [0, 1]
d
and f(x
1
, . . . , x
d
) x
d+1
B .
If f (([0, 1]
d
, B), then clearly
x
2
1
+ + x
2
d
+ x
2
d+1
1 + + 1 + B
2
= d + B
2
for every (x
1
, . . . , x
d+1
) V
f
(B). Therefore, for every f
(([0, 1]
d
, B), its epigraph V
f
(B) is contained in the (d + 1)-
dimensional ball of radius

d + B
2
centered at the origin.
The following inequality relates the L

distance between two


functions in (([0, 1]
d
; B;
1
, . . . ,
d
) to the Hausdorff distance
between their epigraphs. The Hausdorff distance between two
compact, convex sets C and D in Euclidean space is dened
by

H
(C, D) := max
_
sup
xC
inf
yD
[x y[, sup
xD
inf
yC
[x y[
_
,
where [ [ denotes Euclidean distance.
Lemma 4.1: For every pair of functions f and g in
(([0, 1]
d
; B;
1
, . . . ,
d
), we have
[[f g[[


H
(V
f
(B), V
g
(B))
_
1 +
2
1
+ +
2
d
.
Proof: We can clearly assume that
i
< for all
i = 1, . . . , d. Fix f, g (([0, 1]
d
; B;
1
, . . . ,
d
) and let

H
(V
f
(B), V
g
(B)) = . Fix x [0, 1]
d
with f(x) ,= g(x).
Suppose, without loss of generality, that f(x) < g(x). Now
(x, f(x)) V
f
(B) and because
H
(V
f
(B), V
g
(B)) = , there
exists (x

, y

) V
g
(B) with [(x, f(x))(x

, y

)[ . Because
f(x) < g(x), the point (x, f(x)) lies outside V
g
(B) and using
the convexity of V
g
(B) we can take y

= g(x

). Therefore,
0 g(x) f(x)
= g(x) g(x

) + g(x

) f(x)
[x x

[
_

2
1
+ +
2
d
+[g(x

) f(x)[

2
1
+ +
2
d
+ 1
_
[x x

[
2
+[g(x

) f(x)[
2
=
_

2
1
+ +
2
d
+ 1 [(x, f(x)) (x

, y

)[

_

2
1
+ +
2
d
+ 1,
where the second last inequality follows from the Cauchy-
Scwarz (C-S) inequality. Lemma 4.1 now follows because x
[0, 1]
d
is arbitrary in the above argument.
The proof of Theorem 3.2, given below, is based on
Lemma 4.1 and the following result on covering numbers of
convex sets proved in [3]. For > 0, let /
d+1
() denote
the set of all compact, convex subsets of the ball in R
d+1
of
radius centered at the origin. In Theorem 3 (and Remark 1)
of [3], Bronshtein proved that there exist positive constants c
and
0
, depending only on d, such that
log M(/
d+1
(), ;
H
) c
_

_
d/2
for
0
. (12)
A more detailed account of Bronshteins proof of (12) can be
found in Section 8.4 of [4].
Proof of Theorem 3.2: The conclusion of the theorem
is clearly only meaningful in the case when
i
< for all
i = 1, . . . , d. We therefore assume this in the rest of this proof.
For every f (
_

d
i=1
[a
i
, b
i
]; B;
1
, . . . ,
d
_
, let us dene
the function

f on [0, 1]
d
by

f(t
1
, . . . , t
d
) := f (a
1
+ (b
1
a
1
)t
1
, . . . , a
d
+ (b
d
a
d
)t
d
) ,
for t
1
, t
2
, . . . , t
d
[0, 1]. Clearly the function

f belongs
to the class (
_
[0, 1]
d
; B;
1
(b
1
a
1
), . . . ,
d
(b
d
a
d
)
_
and
covering

f to within in the L

-metric is equivalent to
covering f. Thus
M
_
((

i
[a
i
, b
i
]; B;
1
, . . . ,
d
), ; L

_
= M
_
(([0, 1]
d
; B;
1
(b
1
a
1
), . . . ,
d
(b
d
a
d
)), ; L

_
. (13)
We thus take, without loss of generality, a
i
= 0 and b
i
= 1
for all i = 1, . . . , d.
From Lemma 4.1 and the observation that V
f
(B)
/
d+1
(

d + B
2
) for all f (([0, 1]
d
, B), it follows that
M
_
(([0, 1]
d
; B;
1
, . . . ,
d
), ; L

_
M
_
/
d+1
(

d + B
2
),

2

1+
2
1
++
2
d
;
H
_
.
Thus from (12), we deduce the existence of two positive
constants c and
0
, depending only on d, such that
log M
_
(([0, 1]
d
; B;
1
, . . . ,
d
), ; L

_
c
_

(d+B
2
)(1+
2
1
++
2
d
)

_
d/2
,
if
0
_
(d + B
2
)(1 +
2
1
+ +
2
d
). By the scaling
inequality (13), we obtain
log M
_
((

i
[a
i
, b
i
]; B;
1
, . . . ,
d
), ; L

_
c
_
_
(d + B
2
)(1 +

i

2
i
(b
i
a
i
)
2
)

_
d/2
if
0
_
(d + B
2
)(1 +

i

2
i
(b
i
a
i
)
2
). By another scal-
ing argument, it follows that
M
_
((

i
[a
i
, b
i
]; B;
1
, . . . ,
d
), ; L

_
= M
_
(
_

i
[a
i
, b
i
];
B

;

1

, . . . ,

d

_
,

; L

_
SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY, 2012 7
for every > 0 and, as a consequence, we get, for every
> 0,
log M
_
((

i
[a
i
, b
i
]; B;
1
, . . . ,
d
), ; L

_
c
_
_
(d
2
+ B
2
)(1 +

i

2
i
(b
i
a
i
)
2
/
2
)

_
d/2
.
if
0
_
(d
2
+ B
2
)(1 +

i

2
i
(b
i
a
i
)
2
/
2
). Choosing
(by differentiation)

4
=
B
2

i

2
i
(b
i
a
i
)
2
d
,
we deduce nally
log M
_
(([a, b]
d
; B;
1
, . . . ,
d
), ; L

_
c
_
B +
_
d

i

2
i
(b
i
a
i
)
2

_
d/2
if
0
_
B +
_
d

i

2
i
(b
i
a
i
)
2
_
. The proof of the
theorem will now be complete by noting that

2
i
(b
i
a
i
)
2

i
(b
i
a
i
)

2
i
(b
i
a
i
)
2
.
The terms involving d can be absorbed in the constants c and

0
.
One might wonder if a version of Lemma 4.2 can be proved
for the L
p
-metric instead of the L

-metric, and without any


Lipschitz constraints. Such an inequality would, in particular,
yield an alternative simpler proof of Theorem 3.1. It turns out
that one can prove such a bound for the L
1
-metric but not for
L
p
for any p > 1. The inequality for L
1
is presented next.
This inequality could possibly be of independent interest. The
reason why such an inequality can not be proved for L
p
, p > 1,
is explained in Remark 4.1.
Lemma 4.2: For every pair of functions f and g in
(([0, 1]
d
, 1), we have
[[f g[[
1
(1 + 20d)
H
(V
f
(1), V
g
(1)). (14)
Proof: For f (([0, 1]
d
, 1) and x (0, 1)
d
, let m
f
(x)
denote any subgradient of the convex function f at x. Let

H
(V
f
(1), V
g
(1)) = > 0. Our rst step is to observe that
[f(x) g(x)[ (1 +[m
f
(x)[ +[m
g
(x)[) (15)
for every x (0, 1)
d
, where [m
f
(x)[ denotes the Euclidean
norm of the subgradient vector m
f
(x) R
d
. To see this, x
x (0, 1)
d
with f(x) ,= g(x). We assume, without loss of
generality, that f(x) < g(x). Clearly (x, f(x)) V
f
(1) and
because
H
(V
f
(1), V
g
(1)) = , there exists (x

, y

) V
g
(1)
with [(x, f(x)) (x

, y

)[ . Since f(x) < g(x), the point


(x, f(x)) lies outside the convex set V
g
(1) and we can thus
take y

= g(x

). By the denition of the subgradient, we have


g(x

) g(x) +m
g
(x), x

x .
Therefore,
0 g(x) f(x) = g(x) g(x

) + g(x

) f(x)
m
g
(x), x x

+[g(x

) f(x)[
[m
g
(x)[[x x

[ +[g(x

) f(x)[

_
[m
g
(x)[
2
+ 1 [(x, f(x)) (x

, y

)[

_
[m
g
(x)[
2
+ 1 (1 +[m
g
(x)[).
Note that the Cauchy-Schwarz inequality has been used twice
in the above chain of inequalities. We have thus shown that
g(x) f(x) (1 +[m
g
(x)[) in the case when f(x) < g(x).
One would have a similar inequality in the case when f(x) >
g(x). Combining these two, we obtain (15).
As a consequence of (15), we get
[[f g[[
1
=
_
[0,1]
d
\[,1]
d
[f g[ +
_
[,1]
d
[f g[
2
_
1 (1 2)
d
_
+
_
1 +
_
[,1]
d
[m
f
(x)[dx
+
_
[,1]
d
[m
g
(x)[dx
_

_
1 + 4d +
_
[,1]
d
[m
f
(x)[ +[m
g
(x)[dx
_
,
where we have used the inequality (1 2)
d
1 2d.
To complete the proof of (14), we show that
_
[,1]
d
[m
f
(x)[dx 8d for every f (([0, 1]
d
, 1).
We write m
f
(x) = (m
f
(x)(1), . . . , m
f
(x)(d)) R
d
and
use the denition of the subgradient to note that for every
x [, 1 ]
d
and 1 i d,
f(x + te
i
) f(x) t m
f
(x)(i) (16)
for t > 0 sufciently small, where e
i
is the unit vector in
the ith coordinate direction i.e., e
i
(j) := 1 if i = j and 0
otherwise. Dividing both sides by t and letting t 0, we
would get m
f
(x)(i) f

(x; e
i
) (we use f

(x; v) to denote
the directional derivative of f in the direction v; directional
derivatives exist as f is convex). Using (16) for t < 0, we get
m
f
(x)(i) f

(x; e
i
). Combining these two inequalities,
we get
[m
f
(x)(i)[ [f

(x; e
i
)[ +[f

(x; e
i
)[ for i = 1, . . . , d.
As a result,
_
[,1]
d
[m
f
(x)[dx

i=1
_
[,1]
d
[m
f
(x)(i)[dx

i=1
_
_
[,1]
d
[f

(x; e
i
)[dx +
_
[,1]
d
[f

(x; e
i
)[dx
_
.
We now show that for each i, both the integrals
_
[,1]
d
[f

(x; e
i
)[ and
_
[,1]
d
[f

(x; e
i
)[ are bounded
SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY, 2012 8
from above by 4. Assume, without loss of generality, that i = 1
and notice
_
[,1]
d
[f

(x; e
1
)[dx
=
_
u[,1]
d1
__
1

[f

((x
1
, u); e
1
)[dx
1
_
du. (17)
We x u = (x
2
, . . . , x
d
) [, 1]
d1
and focus on the inner
integral. Let v(z) := f(z, x
2
, . . . , x
d
) for z [0, 1]. Clearly v
is a convex function on [0, 1] and its right derivative, v

r
(x
1
)
at the point z = x
1
(0, 1) equals f

(x; e
1
) where x =
(x
1
, . . . , x
d
). The inner integral thus equals
_
1

[v

r
(z)[dz.
Because of the convexity of v, its right derivative v

r
(z) is
non-decreasing and satises
v(y
2
) v(y
1
) =
_
y2
y1
v

r
(z)dz for 0 < y
1
< y
2
< 1.
Consequently,
_
1

[v

r
(z)[dz
sup
c1
_

_
c

r
(z)dz +
_
1
c
v

r
(z)dz
_
= sup
c1
(v() + v(1 ) 2v(c)) .
The function v() clearly satises [v(z)[ 1 because f
(([0, 1]
d
, 1). This implies that
_
1

[v

r
(z)[dz 4. The
identity (17) therefore gives
_
[,1]
d
[f

(x; e
1
)[dx
=
_
(x2,...,x
d
)[,1]
d1
__
1

[v

r
(z)[dz
_
dx
2
. . . dx
d
4.
Similarly, by working with left derivatives of v as opposed to
right, we can prove that
_
[,1]
d
[f

(x; e
1
)[dx 4.
Therefore, the integral
_
[,1]
d
[m
f
[ is at most 8d because it
is less than or equal to
d

i=1
_
_
[,1]
d
[f

(x; e
i
)[dx +
_
[,1]
d
[f

(x; e
i
)[dx
_
.
This completes the proof of Lemma 4.2.
Remark 4.1: Lemma 4.2 is not true if L
1
is replaced by L
p
,
for p > 1. Indeed, if d = 1 and f

(x) := max(0, 1 (x/))


for 0 < 1 and g(x) := 0 for all x [0, 1], then it can be
easily checked that for 1 p < ,
[[f

g[[
p
=

1/p
(1 + p)
1/p
and
H
(V
f
(1), V
g
(1)) =

1 +
2
.
As can be arbitrarily close to zero, this clearly rules out
any inequality of the form (14) with the L
1
-metric replaced
by L
p
, for 1 < p .
Remark 4.2: Lemma 4.2 and Bronshteins result (12) can
be used to give an alternative proof of Theorem 3.1 for the
special case p = 1. Indeed, the scaling identity (1) lets us take
a = 0, b = 1 and B = 1. Inequality (14) implies that the
covering number M
_
(([0, 1]
d
, 1), ; L
1
_
is less than or equal
to
M
_
/
d+1
(

d + 1),

2(1 + 20d)
;
H
_
.
Thus from (12), we deduce the existence of two positive
constants c and
0
, depending only on d, such that
log M
_
(([0, 1]
d
, 1), ; L
1
_
c
d/2
whenever
0
. Note that, by Remark 4.1, this method of
proof does not work in the case of L
p
, for 1 < p < .
REFERENCES
[1] A. N. Kolmogorov and V. M. Tihomirov, -entropy and -capacity of
sets in function spaces, Amer. Math. Soc. Transl. (2), vol. 17, pp. 277
364, 1961.
[2] D. Dryanov, Kolmogorov entropy for classes of convex functions,
Constructive Approximation, vol. 30, pp. 137153, 2009.
[3] E. M. Bronshtein, -entropy of convex sets and functions, Siberian
Mathematical Journal, vol. 17, pp. 393398, 1976.
[4] R. M. Dudley, Uniform Central Limit Theorems. Cambridge University
Press, 1999.
[5] L. Birg e, Approximation dans les espaces metriques et theorie de
lestimation, Zeitschrift f ur Wahrscheinlichkeitstheorie und Verwandte
Gebiete, vol. 65, pp. 181237, 1983.
[6] L. Le Cam, Convergence of estimates under dimensionality restric-
tions, Annals of Statistics, vol. 1, pp. 3853, 1973.
[7] Y. Yang and A. Barron, Information-theoretic determination of minimax
rates of convergence, Annals of Statistics, vol. 27, pp. 15641599, 1999.
[8] A. Guntuboyina, Lower bounds for the minimax risk using f diver-
gences, and applications, IEEE Transactions on Information Theory,
vol. 57, pp. 23862399, 2011.
[9] S. Van de Geer, Applications of Empirical Process Theory. Cambridge
University Press, 2000.
[10] L. Birg e and P. Massart, Rates of convergence for minimum contrast
estimators, Probability Theory and Related Fields, vol. 97, pp. 113
150, 1993.
[11] E. Seijo and B. Sen, Nonparametric least squares estimation of a
multivariate convex regression function, Annals of Statistics, vol. 39,
pp. 16331657, 2011.
[12] L. A. Hannah and D. Dunson, Bayesian nonparametric multivariate
convex regression, 2011, submitted.
[13] A. Seregin and J. A. Wellner, Nonparametric estimation of multivariate
convex-transformed densities, Annals of Statistics, vol. 38, pp. 3751
3781, 2010.
[14] M. L. Cule, R. J. Samworth, and M. I. Stewart, Maximum likelihood es-
timation of a multi-dimensional log-concave density (with discussion),
Journal of the Royal Statistical Society, Series B, vol. 72, pp. 545600,
2010.
[15] L. D umbgen, R. J. Samworth, and D. Schuhmacher, Approximation
by log-concave distributions with applications to regression, Annals of
Statistics, vol. 39, pp. 702730, 2011.
[16] A. Guntuboyina, Optimal rates of convergence for the estimation of
reconstruction of convex bodies from noisy support function measure-
ments. Annals of Statistics, 2011, to appear.
[17] R. J. Gardner, M. Kiderlen, and P. Milanfar, Convergence of algorithms
for reconstructing convex bodies and directional measures, Annals of
Statistics, vol. 34, pp. 13311374, 2006.
[18] P. Massart, Concentration inequalities and model selection. Lecture
notes in Mathematics. Berlin: Springer, 2007, vol. 1896.