Week 1-2 Notes

Stat 210B: Week 1-2
1 U-Statistic
Definition 1 (Estimable Parameter). Given a probability distribution P , a parameter (P ) derived
from the distribution is estimable if there is some function h of a sample of size r, such that
h(X1 , . . . , Xr ) is an unbiased estimator of (P ). In particular, the smallest integer r satisfying this
property is the degree of (P ).
By definition, the function h does not have to be symmetric, but it can be made so, by the following
way:
1 X
hsym (X1 , , Xr ) = h(X1 , , Xr )
r!

where is permutations over {1, . . . , r}.
Definition 2 (U-statistics). For iid sample X1 , . . . , Xn P and estimable parameter (P ), suppose
h(X1 , . . . , Xr ), where r n, is unbiased for (P ). Assume that h is permutation symmetric. Then
the U-statistic with kernel h is defined as:
1 X
U= n
h(X1 , . . . , Xr )
r C
where ranges over all combinations. If h is not permutation symmetric, then we modify it as
follows,
(n r)! X
U= h(X1 , . . . , Xr )
n!
P
1
where ranges over all permutations. In either way, U is unbiased. Moreover, note that
U = E[h(X1 , . . . , Xr )|X(1) , . . . , X(n) ]
Observe that U-statistics generalize many classical estimators. We now give a few examples:
Examples
1
Pn
1. Sample mean: Let r = 1 and choose U = n i=1 h(Xi ) with h being the identity function,
then U is the sample mean.
2. Variance: use kernel h(x1 , x2 ) = x21 x1 x2 and then construct the corresponding symmetric
kernel hsym (x1 , x2 ) = 21 (x1 x2 )2 , we obtain the corresponding U-statistic for estimating the
variance:
1 X1
U= n
(Xi Xj )2
2 i<j
2
3. Signed rank statistics (Wilcoxon): The original signed rank test aims to determine if P is
symmetric about 0. It has a signed rank statistic, which can be rewritten as
X X
Wn+ = 1{Xi >0} + 1{Xi +Xj >0}
i i<j
4. Kendalls : For a two-dimensional probability distribution P , it is said to be concordant for
(x1 , y1 ), (x2 , y2 ) if the slope of the line connecting these two points is positive, and discordant
for the two vectors if the slope is negative. For iid (X1 , Y1 ), (X2 , Y2 ), if the probability of
them being concordant then there is a positive association between X and Y . Kendalls
measures the association as follows:

= 2 P[X1 < X2 , Y1 < Y2 ] + P[X2 < X1 , Y2 < Y1 ] 1
Hence, if we let h((x1 , y1 ), (x2 , y2 )) = 2(1{x1 <x2 ,y1 <y2 } + 1{x2 <x1 ,y2 <y1 } ) 1, it follows that
Kendalls is also a U-statistic.
2
The goal of the discussion below is to derive an asymptotic distribution of the U-statistic. To do
ajek projection to create the projection of U onto the space S, then

so, we first use the idea of H
we show the asymptotic distrbution of the projection, followed by the proof that U converges in
probability to the projection to give the result.
ajek Projection). Let S be the subspace of random variables given by

Definition 3 (H
Xn
S={ gi (Xi )}
i=1
where each gi is any function that has finite second moment. For a random variable T , the Hajek
projection projects T to some element of S. Moreover, the projected image is given by

n
X
Sb = E(T |Xi ) (n 1)ET
i=1
Theorem 4. Suppose we are interested in the limiting distribution of {Tn }, and relate {Tn } to some
P
{Sn } for which we know the limit, then if Tn Sn 0, by the Slutskys theorem, Tn converges to
the same limit.
Now, let S be a linear space for RVs with finite second moments, then Sb is a projection of T onto
S if and only if Sb S and E(T S)S

b = 0 for S S. If S contains constants, then ET = ESb and
cov(T S,
b S) = 0.
The proof of the last statement in definition 3 goes by proving E[(T S)S]
b = 0 for any S S and
then use the smoothing identity to get the result.
Now, we use the result above to project U onto S to get:

n
X
U
b= E(U |Xi )
i=1
For convenience, assume that h is permutation symmetric. Also, we define hc for 1 c r as
follows:
hc (x1 , , xc ) = E[h(x1 , , xc , Xc+1 , , Xr )]
3
and
c2 = Var(hc (X1 , , Xc ))
Using this notation and the fact that h is symmetric, we have

n
b= r
X
U h1 (Xi )
n i=1
For a proof to the result above, see reference 1.
Lemma 5. For two combination of size r,
Cov[h(X1 , , Xr ), h(X10 , , Xr0 )] = c2
where c2 is defined as above.
r2 2
Hence, the variance of U
b is just
n 1 .
Compute the variance:

2 X X
n
Var(U ) = cov(h(X1 ), . . . , h(Xr )), h(X10 ), . . . , h(Xr0 ))
r
0
2 Xr
n n r nr 2
= (1)
r c=0
r c rc c
1 Xr
n r nr 2
=
r c=0
c rc c
2 (Tn )
Theorem 6. If 2 (Sn ) 1, then
Tn E[Tn ] Sn E[Sn ] P
0
(Tn ) (Sn )
so we have
r2 2
Var(U ) n 1 + O(1/n2 )22 + + O(1/nr )r2 P
= r2 2
1
Var(U n 1
b)
d
Theorem 7 (Asymptotic Convergence). If E[h2 (X1 , . . . , Xr )] < , then n(U ) N (0, r2 12 )
Now, the theorem simply follows from the results above.
4
Reference
1. Peter Barletts Notes on U-statistics: http://www.stat.berkeley.edu/ bartlett/courses/2013spring-
stat210b/notes/7notes.pdf
2. Thomas Fergusons notes: https://www.math.ucla.edu/ tom/Stat200C/Ustat.pdf
3. Jon Wellners notes: https://www.stat.washington.edu/jaw/COURSES/580s/581/HO/HajekProj-
HoeffdingExp.pdf
2 Tail Bound
Theorem 8 (Markov Inequality). Let X be a nonnegative RV and t > 0. If E[X] < , then
E[X]
P[X t]
t
Applying the Markov inequality to the variable Y = (X E[X])2 , we obtain the Chebyshevs
inequality:
Theorem 9 (Chebyshevs Inequality). Suppose X has finite variance and let = E[X], then
Var(X)
P[|X | t]
t2
Now we give a brief review of moment and MGF.
Definition 10 (Moment). For a one-dimensional RV X with PDF f (x), the k-th moment about c
is:
Z
k = (x c)k f (x)dx

whenever it exists. Typical choices of c are 0 and the mean. When c is the mean, the moment is
called a central moment.
5
Suppose the k-th central moment for X exists, then applying the Markovs inequality to |X |k
gives:
E[|X |k ]
P[|X | t] (2)
tk
for all t > 0.
Definition 11 (Moment Generating Function). The MGF for a RV X, when it exists, is defined
as follows:
MX () = E[eX ]
If X is multidimensional, then we replace X by the dot product X. Unlike the CDF or
characteristic function, the MGF does not always exist. However, when it exists, it becomes a good
tool and can assist in, say extracting the k-th moment.
Assuming that the MGF exists in [a, a], then we can use Markovs inequality to obtain:
E[e(X) ]
P(X t) = P(e(X) et ) , t > 0, [0, a]
et
from which we derive the Chernoff bound:
log P(X t) inf (log E[eX ] t) (3)

[0,a]
Note that the Chernoff bound is always sub-optimal to the bound (2).
2.1 Sub-Gaussian
For Gaussian, the MGF exists everywhere. In particular, we have
2 2
E[eX ] = e+ 2
Applying the Chernoff bound gives:
t2
inf (log E[eX ] t) =
[0,+) 2 2
6
and so
2
/2 2
P[X + t] et
This gives a probability bound on the upper tail and we thus refer to this inequality as the upper
deviation inequality. It also motivates the discussion of RVs with a tail probability less than or
equal to a value with the form as above.
Definition 12 (Sub-Gaussian). A random variable X is sub-Gaussian if there exists positive
such that
2
2 /2
E[e(X) ] et
For a sub-Gaussian RV with parameter , by symmetry, X is also SG(), which gives a lower
deviation inequality. Combining the two gives a concentration inequality:
2
/2 2
P[|X | t] 2et
Example of Sub-Gaussian
1. Rademacher ([1], Example 2.2): X {1, 1} with equal probablity. Then
2
E[eX ] e /2
Hence, Rademacher is SG(1).
2. Bounded RV([1], Example 2.3): Suppose X [c1 , c2 ], then X is SG(), where = (c2 c1 )/2.
The argument in the book uses a symmetrization approach to give = c2 c1 .
Lemma 13 (Hoeffding). Suppose X [a, b] a.s., then
2
(ab)2 /8
EeX e
A bound for the sum of independent RVs is given by:
7
Theorem 14 (Hoeffding bound). Let Xi P be iid, where each Xi SG(). Let X be the sample
mean. Then X SG(/n). More generally,

n
X t2
P[ (Xi i ) t] exp Pn (4)
i=1
2 i=1 i2
Remark. Together with the Hoeffdings lemma, one can get a bound for bounded RVs.
For SG RVs, there are several other equivalent definitions:
Theorem 15 (Equivalencer). Thm 2.1 in [1]. Suppose EX = 0, the followings are equivalent:
1. There is some such that for all R:
2
2 /2
E[eX ] e
2. There is some c and a Z N (0, 2 )such that for all t:
P[|X| t] c P[|Z| t]
3. There is some such that for all integers k:
(2k)! 2k
E[X 2k ]
2k k!
4. For all [0, 1):

2
/2 2 1
E[eX ]
1
2.2 Sub-Exponential RVs
2
A SG RV decays at least at some rate ecx , where x is its deviation from the mean and c is some
constant. Such decaying rates capture only a small subset of RVs of interest. Hence, we consider
the set of RVs that decay at a smaller rate:
8
Definition 16 (Sub-Exponential RVs). A RV X with mean is sub-exponential if there is non-
nengative (, ) such that

2 2 1
Ee(X) e 2 , where || <

Compare this with the definition of SG RVs, one finds that SE RVs bound the MGF in a neighbor-
hood of 0, whereas SG RVs bound the MGF in the entire R.
Example
1. Chi-squared ([1], Example 2.4): Let Z N (0, 1) and X = Z 2 . Then
e 2
E[e(X1) ] = e4 /2
1 2
where 0 < 1/2.
2 2
P(X t) exp(t + ), where < 1/
2
Unlike the SG RVs, where the tail bound can be found by solving an unconstrained optimization
problem, here the MGF is defined only in a neighborhood of 0, so the optimization problem becomes
a constrained one: find infimum over [0, 1/).
Theorem 17 (Sub-Exponential RVs Tail Bound). Suppose X SE(, ), then

2 2
et/2 , 0 t

P (X t)
et/2 , t > 2

Theorem 18 (Bernsteins Condition). Let X be a random variable, such that the variance 2
exists. For all integers k 3, the Bernsteins condition with parameter b > is:
1
E[(X )k ] k! 2 bk2
2
Then, this is a sufficient but not necessary condition for sub-exponentiality. In particular, with the

parameters aboe, X SE( 2, 2b).
9
Theorem 19 (Bernsteins Inequality). Suppose X satisfies the Bernsteins condition with param-
eters and b, then for all || < 1b :
h 2 2 i
E[e(X) ] exp
2(1 b||)
which leads to a concentration inequality:
2 2
P[|X | > t] 2 exp(t )
2(1 b||)
If we pick = t/(bt + 2 ), then we get
t 2
2(bt+
P[|X | > t] 2e 2)
Lemma 20 (Johnson-Linderstrauss). A deterministic statement of the JL lemma says that: given

8 log n
(0, 1) and a set X of n points in Rd , where m > 2 , then there is a F : Rd Rm such that
for any x1 , x2 X:
||F (x1 ) F (x2 )||2
1 1+
||x1 x2 ||2
The probabilistic version says that: for any of the n2 pairs of points x1 , x2 :

h ||F (x ) F (x )||2 i
n m2 /8
1 2
P 2
]
/ [1 , 1 + ] 2 e
||x1 x2 || 2
10

Week 1-2 Notes

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Week 1-2 Notes

Загружено:

Авторское право:

Доступные форматы

Stat 210B: Week 1-2

Definition 1 (Estimable Parameter). Given a probability distribution P , a parameter (P ) derived

h(X1 , . . . , Xr ) is an unbiased estimator of (P ). In particular, the smallest integer r satisfying this

property is the degree of (P ).

where is permutations over {1, . . . , r}.

Definition 2 (U-statistics). For iid sample X1 , . . . , Xn P and estimable parameter (P ), suppose

h(X1 , . . . , Xr ), where r n, is unbiased for (P ). Assume that h is permutation symmetric. Then

the U-statistic with kernel h is defined as:

U = E[h(X1 , . . . , Xr )|X(1) , . . . , X(n) ]

then U is the sample mean.

symmetric about 0. It has a signed rank statistic, which can be rewritten as

4. Kendalls : For a two-dimensional probability distribution P , it is said to be concordant for

measures the association as follows:

Kendalls is also a U-statistic.

ajek projection to create the projection of U onto the space S, then

probability to the projection to give the result.

ajek Projection). Let S be the subspace of random variables given by

projection projects T to some element of S. Moreover, the projected image is given by

the same limit.

S if and only if Sb S and E(T S)S

then use the smoothing identity to get the result.

Now, we use the result above to project U onto S to get:

For convenience, assume that h is permutation symmetric. Also, we define hc for 1 c r as

hc (x1 , , xc ) = E[h(x1 , , xc , Xc+1 , , Xr )]

Using this notation and the fact that h is symmetric, we have

For a proof to the result above, see reference 1.

Lemma 5. For two combination of size r,

Cov[h(X1 , , Xr ), h(X10 , , Xr0 )] = c2

where c2 is defined as above.

Compute the variance:

Now, the theorem simply follows from the results above.

1. Peter Barletts Notes on U-statistics: http://www.stat.berkeley.edu/ bartlett/courses/2013spring-

2. Thomas Fergusons notes: https://www.math.ucla.edu/ tom/Stat200C/Ustat.pdf

3. Jon Wellners notes: https://www.stat.washington.edu/jaw/COURSES/580s/581/HO/HajekProj-

Now we give a brief review of moment and MGF.

called a central moment.

for all t > 0.

If X is multidimensional, then we replace X by the dot product X. Unlike the CDF or

from which we derive the Chernoff bound:

log P(X t) inf (log E[eX ] t) (3)

For Gaussian, the MGF exists everywhere. In particular, we have

Applying the Chernoff bound gives:

equal to a value with the form as above.

Definition 12 (Sub-Gaussian). A random variable X is sub-Gaussian if there exists positive

deviation inequality. Combining the two gives a concentration inequality:

1. Rademacher ([1], Example 2.2): X {1, 1} with equal probablity. Then

Hence, Rademacher is SG(1).

The argument in the book uses a symmetrization approach to give = c2 c1 .

Lemma 13 (Hoeffding). Suppose X [a, b] a.s., then

A bound for the sum of independent RVs is given by:

mean. Then X SG(/n). More generally,

For SG RVs, there are several other equivalent definitions:

1. There is some such that for all R:

2. There is some c and a Z N (0, 2 )such that for all t:

3. There is some such that for all integers k:

4. For all [0, 1):

2.2 Sub-Exponential RVs

the set of RVs that decay at a smaller rate:

nengative (, ) such that

hood of 0, whereas SG RVs bound the MGF in the entire R.

1. Chi-squared ([1], Example 2.4): Let Z N (0, 1) and X = Z 2 . Then

where 0 < 1/2.

a constrained one: find infimum over [0, 1/).

Theorem 17 (Sub-Exponential RVs Tail Bound). Suppose X SE(, ), then