Вы находитесь на странице: 1из 10

Stat 210B: Week 1-2

1 U-Statistic

Definition 1 (Estimable Parameter). Given a probability distribution P , a parameter (P ) derived

from the distribution is estimable if there is some function h of a sample of size r, such that

h(X1 , . . . , Xr ) is an unbiased estimator of (P ). In particular, the smallest integer r satisfying this

property is the degree of (P ).

By definition, the function h does not have to be symmetric, but it can be made so, by the following

way:
1 X
hsym (X1 , , Xr ) = h(X1 , , Xr )
r!

where is permutations over {1, . . . , r}.

Definition 2 (U-statistics). For iid sample X1 , . . . , Xn P and estimable parameter (P ), suppose

h(X1 , . . . , Xr ), where r n, is unbiased for (P ). Assume that h is permutation symmetric. Then

the U-statistic with kernel h is defined as:

1 X
U= n
 h(X1 , . . . , Xr )
r C

where ranges over all combinations. If h is not permutation symmetric, then we modify it as

follows,
(n r)! X
U= h(X1 , . . . , Xr )
n!
P

1
where ranges over all permutations. In either way, U is unbiased. Moreover, note that

U = E[h(X1 , . . . , Xr )|X(1) , . . . , X(n) ]

Observe that U-statistics generalize many classical estimators. We now give a few examples:

Examples

1
Pn
1. Sample mean: Let r = 1 and choose U = n i=1 h(Xi ) with h being the identity function,

then U is the sample mean.

2. Variance: use kernel h(x1 , x2 ) = x21 x1 x2 and then construct the corresponding symmetric

kernel hsym (x1 , x2 ) = 21 (x1 x2 )2 , we obtain the corresponding U-statistic for estimating the

variance:
1 X1
U= n
 (Xi Xj )2
2 i<j
2

3. Signed rank statistics (Wilcoxon): The original signed rank test aims to determine if P is

symmetric about 0. It has a signed rank statistic, which can be rewritten as

X X
Wn+ = 1{Xi >0} + 1{Xi +Xj >0}
i i<j

4. Kendalls : For a two-dimensional probability distribution P , it is said to be concordant for

(x1 , y1 ), (x2 , y2 ) if the slope of the line connecting these two points is positive, and discordant

for the two vectors if the slope is negative. For iid (X1 , Y1 ), (X2 , Y2 ), if the probability of

them being concordant then there is a positive association between X and Y . Kendalls

measures the association as follows:


 
= 2 P[X1 < X2 , Y1 < Y2 ] + P[X2 < X1 , Y2 < Y1 ] 1

Hence, if we let h((x1 , y1 ), (x2 , y2 )) = 2(1{x1 <x2 ,y1 <y2 } + 1{x2 <x1 ,y2 <y1 } ) 1, it follows that

Kendalls is also a U-statistic.

2
The goal of the discussion below is to derive an asymptotic distribution of the U-statistic. To do

ajek projection to create the projection of U onto the space S, then


so, we first use the idea of H

we show the asymptotic distrbution of the projection, followed by the proof that U converges in

probability to the projection to give the result.

ajek Projection). Let S be the subspace of random variables given by


Definition 3 (H

Xn
S={ gi (Xi )}
i=1

where each gi is any function that has finite second moment. For a random variable T , the Hajek

projection projects T to some element of S. Moreover, the projected image is given by


n
X
Sb = E(T |Xi ) (n 1)ET
i=1

Theorem 4. Suppose we are interested in the limiting distribution of {Tn }, and relate {Tn } to some
P
{Sn } for which we know the limit, then if Tn Sn 0, by the Slutskys theorem, Tn converges to

the same limit.

Now, let S be a linear space for RVs with finite second moments, then Sb is a projection of T onto

S if and only if Sb S and E(T S)S


b = 0 for S S. If S contains constants, then ET = ESb and

cov(T S,
b S) = 0.

The proof of the last statement in definition 3 goes by proving E[(T S)S]
b = 0 for any S S and

then use the smoothing identity to get the result.

Now, we use the result above to project U onto S to get:


n
X
U
b= E(U |Xi )
i=1

For convenience, assume that h is permutation symmetric. Also, we define hc for 1 c r as

follows:

hc (x1 , , xc ) = E[h(x1 , , xc , Xc+1 , , Xr )]

3
and

c2 = Var(hc (X1 , , Xc ))

Using this notation and the fact that h is symmetric, we have


n
b= r
X 
U h1 (Xi )
n i=1

For a proof to the result above, see reference 1.

Lemma 5. For two combination of size r,

Cov[h(X1 , , Xr ), h(X10 , , Xr0 )] = c2

where c2 is defined as above.

r2 2
Hence, the variance of U
b is just
n 1 .

Compute the variance:


 2 X X
n
Var(U ) = cov(h(X1 ), . . . , h(Xr )), h(X10 ), . . . , h(Xr0 ))
r
0
 2 Xr    
n n r nr 2
= (1)
r c=0
r c rc c
 1 Xr   
n r nr 2
=
r c=0
c rc c

2 (Tn )
Theorem 6. If 2 (Sn ) 1, then

Tn E[Tn ] Sn E[Sn ] P
0
(Tn ) (Sn )

so we have
r2 2
Var(U ) n 1 + O(1/n2 )22 + + O(1/nr )r2 P
= r2 2
1
Var(U n 1
b)
d
Theorem 7 (Asymptotic Convergence). If E[h2 (X1 , . . . , Xr )] < , then n(U ) N (0, r2 12 )

Now, the theorem simply follows from the results above.

4
Reference

1. Peter Barletts Notes on U-statistics: http://www.stat.berkeley.edu/ bartlett/courses/2013spring-

stat210b/notes/7notes.pdf

2. Thomas Fergusons notes: https://www.math.ucla.edu/ tom/Stat200C/Ustat.pdf

3. Jon Wellners notes: https://www.stat.washington.edu/jaw/COURSES/580s/581/HO/HajekProj-

HoeffdingExp.pdf

2 Tail Bound

Theorem 8 (Markov Inequality). Let X be a nonnegative RV and t > 0. If E[X] < , then

E[X]
P[X t]
t

Applying the Markov inequality to the variable Y = (X E[X])2 , we obtain the Chebyshevs

inequality:

Theorem 9 (Chebyshevs Inequality). Suppose X has finite variance and let = E[X], then

Var(X)
P[|X | t]
t2

Now we give a brief review of moment and MGF.

Definition 10 (Moment). For a one-dimensional RV X with PDF f (x), the k-th moment about c

is:
Z
k = (x c)k f (x)dx

whenever it exists. Typical choices of c are 0 and the mean. When c is the mean, the moment is

called a central moment.

5
Suppose the k-th central moment for X exists, then applying the Markovs inequality to |X |k

gives:
E[|X |k ]
P[|X | t] (2)
tk

for all t > 0.

Definition 11 (Moment Generating Function). The MGF for a RV X, when it exists, is defined

as follows:

MX () = E[eX ]

If X is multidimensional, then we replace X by the dot product X. Unlike the CDF or

characteristic function, the MGF does not always exist. However, when it exists, it becomes a good

tool and can assist in, say extracting the k-th moment.

Assuming that the MGF exists in [a, a], then we can use Markovs inequality to obtain:

E[e(X) ]
P(X t) = P(e(X) et ) , t > 0, [0, a]
et

from which we derive the Chernoff bound:

log P(X t) inf (log E[eX ] t) (3)


[0,a]

Note that the Chernoff bound is always sub-optimal to the bound (2).

2.1 Sub-Gaussian

For Gaussian, the MGF exists everywhere. In particular, we have

2 2
E[eX ] = e+ 2

Applying the Chernoff bound gives:

t2
inf (log E[eX ] t) =
[0,+) 2 2

6
and so
2
/2 2
P[X + t] et

This gives a probability bound on the upper tail and we thus refer to this inequality as the upper

deviation inequality. It also motivates the discussion of RVs with a tail probability less than or

equal to a value with the form as above.

Definition 12 (Sub-Gaussian). A random variable X is sub-Gaussian if there exists positive

such that
2
2 /2
E[e(X) ] et

For a sub-Gaussian RV with parameter , by symmetry, X is also SG(), which gives a lower

deviation inequality. Combining the two gives a concentration inequality:

2
/2 2
P[|X | t] 2et

Example of Sub-Gaussian

1. Rademacher ([1], Example 2.2): X {1, 1} with equal probablity. Then

2
E[eX ] e /2

Hence, Rademacher is SG(1).

2. Bounded RV([1], Example 2.3): Suppose X [c1 , c2 ], then X is SG(), where = (c2 c1 )/2.

The argument in the book uses a symmetrization approach to give = c2 c1 .

Lemma 13 (Hoeffding). Suppose X [a, b] a.s., then

2
(ab)2 /8
EeX e

A bound for the sum of independent RVs is given by:

7
Theorem 14 (Hoeffding bound). Let Xi P be iid, where each Xi SG(). Let X be the sample

mean. Then X SG(/n). More generally,


n
X  t2 
P[ (Xi i ) t] exp Pn (4)
i=1
2 i=1 i2

Remark. Together with the Hoeffdings lemma, one can get a bound for bounded RVs.

For SG RVs, there are several other equivalent definitions:

Theorem 15 (Equivalencer). Thm 2.1 in [1]. Suppose EX = 0, the followings are equivalent:

1. There is some such that for all R:

2
2 /2
E[eX ] e

2. There is some c and a Z N (0, 2 )such that for all t:

P[|X| t] c P[|Z| t]

3. There is some such that for all integers k:

(2k)! 2k
E[X 2k ]
2k k!

4. For all [0, 1):


2
/2 2 1
E[eX ]
1

2.2 Sub-Exponential RVs

2
A SG RV decays at least at some rate ecx , where x is its deviation from the mean and c is some

constant. Such decaying rates capture only a small subset of RVs of interest. Hence, we consider

the set of RVs that decay at a smaller rate:

8
Definition 16 (Sub-Exponential RVs). A RV X with mean is sub-exponential if there is non-

nengative (, ) such that


2 2 1
Ee(X) e 2 , where || <

Compare this with the definition of SG RVs, one finds that SE RVs bound the MGF in a neighbor-

hood of 0, whereas SG RVs bound the MGF in the entire R.

Example

1. Chi-squared ([1], Example 2.4): Let Z N (0, 1) and X = Z 2 . Then

e 2
E[e(X1) ] = e4 /2
1 2

where 0 < 1/2.

2 2
P(X t) exp(t + ), where < 1/
2

Unlike the SG RVs, where the tail bound can be found by solving an unconstrained optimization

problem, here the MGF is defined only in a neighborhood of 0, so the optimization problem becomes

a constrained one: find infimum over [0, 1/).

Theorem 17 (Sub-Exponential RVs Tail Bound). Suppose X SE(, ), then



2 2
et/2 , 0 t


P (X t)
et/2 , t > 2


Theorem 18 (Bernsteins Condition). Let X be a random variable, such that the variance 2

exists. For all integers k 3, the Bernsteins condition with parameter b > is:

1
E[(X )k ] k! 2 bk2
2

Then, this is a sufficient but not necessary condition for sub-exponentiality. In particular, with the

parameters aboe, X SE( 2, 2b).

9
Theorem 19 (Bernsteins Inequality). Suppose X satisfies the Bernsteins condition with param-

eters and b, then for all || < 1b :

h 2 2 i
E[e(X) ] exp
2(1 b||)

which leads to a concentration inequality:

2 2
P[|X | > t] 2 exp(t )
2(1 b||)

If we pick = t/(bt + 2 ), then we get

t 2
2(bt+
P[|X | > t] 2e 2)

Lemma 20 (Johnson-Linderstrauss). A deterministic statement of the JL lemma says that: given


8 log n
(0, 1) and a set X of n points in Rd , where m > 2 , then there is a F : Rd Rm such that

for any x1 , x2 X:
||F (x1 ) F (x2 )||2
1 1+
||x1 x2 ||2

The probabilistic version says that: for any of the n2 pairs of points x1 , x2 :


h ||F (x ) F (x )||2 i  
n m2 /8
1 2
P 2
]
/ [1 , 1 + ] 2 e
||x1 x2 || 2

10

Вам также может понравиться