Академический Документы
Профессиональный Документы
Культура Документы
February 1, 2019
(a) Suppose we used the direct estimator of the loss, namely, we replace the integral with the
average to get
2
̂
L(h) = ∫ p̂2 (x)dx − ∑ p̂(Xi ).
n i
Show that this fails in the sense that it is minimized by taking h = 0.
Show that
̂ 2 n+1 2
L(h) = − 2 ∑Z
(n − 1)h n (n − 1)h j j
where Zj is the number of observations in bin j.
Solution.
Define
1 n
θ̂j = ∑ 1(Xi ∈ Bj ) and Zj = nθ̂j
n i=1
for j = 1, . . . , m.
1
10/36-702 Statistical Machine Learning: Homework 1
(a) (7 pts.)
1 2
̂
L(h) =∫ p̂2 (x)dx − ∑ p̂(Xi )
0 n i
2
1⎛ m θ̂j ⎞ 2 n m θ̂j
=∫ ∑ 1(x ∈ Bj ) dx − ∑ ∑ 1(Xi ∈ Bj )
0 ⎝ j=1 h ⎠ n i=1 j=1 h
1 1⎛ m m m n
= ∫ ∑ ∑ ̂j θ̂k 1(x ∈ Bj ∩ Bk )⎞dx − 2 ∑ θ̂j ⋅ 1 ∑ 1(Xi ∈ Bj )
θ
h2 0 ⎝ k=1 j=1 ⎠ h j=1 n i=1
1 1⎛ m m
= ∫ ∑ ̂2 1(x ∈ Bj )⎞dx − 2 ∑ θ̂2
θ
h2 0 ⎝ j=1 j ⎠ h j=1 j
1 m ̂2 1 2 m ̂2
= ∑ θ ∫ 1(x ∈ Bj )dx − ∑θ
h2 j=1 j 0 h j=1 j
1 m ̂2 2 m ̂2
= ∑θ − ∑θ
h j=1 j h j=1 j
1 m
= − ∑ θ̂j2
h j=1
1 m 2
=− ∑Z
hn2 j=1 j
(b) (8 pts.)
From part (a) we have
2 1 m ̂2
∫ p̂ (x)dx = ∑θ . (1)
h j=1 j
2
10/36-702 Statistical Machine Learning: Homework 1
1 m 2 m
̂
L(h) = ∑ θ̂j2 − 2 2
∑ (n θ̂j − nθ̂j )
h j=1 n(n − 1)h j=1
m m
2 2⎛ 1 2n ⎞
= ̂
∑ j ∑ θ̂j
θ + −
(n − 1)h j=1 j=1 ⎝ h (n − 1)h ⎠
²
=1
2 n + 1 m ̂2
= − ∑θ
(n − 1)h (n − 1)h j=1 j
m
2 n+1 2
= − 2 ∑ Zj .
(n − 1)h n (n − 1)h j=1
Hint: Recall that the Lyapunov central limit theorem says the following: Suppose that Y1 , Y2 , . . .
are independent. Let µi = E[Yi ] and σi2 = Var(Yi ). Let s2n = ∑ni=1 σi2 . If
n
1 2+δ
lim ∑ E[∣Yi − µi ∣ ]=0
n→∞ s2+δ
n i=1
(b) Assume that the smoothness is β = 2. Suppose that the bandwidth hn is chosen optimally.
Show that
p̂h (x) − p(x)
; N (b(x), 1)
sn (x)
for some constant b(x) which is, in general, not 0.
Solution.
(a) [8 pts.]
Caveat: The classical Central Limit Theorem cannot be applied here, as h = hn is a function
∥x−X ∥
of n and thus the K( h i ) are not identically distributed. However, as the hint suggests, the
Lyapunov CLT still holds for non-identically distributed random variables.
3
10/36-702 Statistical Machine Learning: Homework 1
Now
⎡RR 2+δ ⎤ ⎡R RRR2+δ ⎤⎥
⎢RR 1 ⎛ ∥x − Xi ∥ ⎞ ph (x) RRRR ⎥ 1 ⎢⎢RRRR 1 ⎛ ∥x − Xi ∥ ⎞
⎢ R
E⎢RRR K − R ⎥
RRR ⎥ = 2+δ E⎢RRR K − ph (x)RRRRR ⎥⎥
⎢RRR nh ⎝ h ⎠ n RR ⎥ n
R ⎦ ⎢RRR h ⎝ h ⎠ RRR ⎥
⎣ ⎣ ⎦
⎛ 1 ⎞
= Θ 2+δ 1+δ ,
⎝n h ⎠
and
⎡RR 2⎤
n ⎢RR 1 ⎛ ∥x − Xi ∥ ⎞ ph (x) RRRR ⎥
s2n ⎢ R
= ∑ E⎢RRR K − R ⎥
R
i=1 ⎢ RRR nh ⎝ h ⎠ n RRRR ⎥⎥
⎣ R ⎦
⎡RR R 2⎤
1 ⎢R 1 ⎛ ∥x − Xi ∥ ⎞ RR ⎥
= E⎢⎢RRRRR K − ph (x)RRRRR ⎥⎥
n ⎢RR h ⎝ h ⎠ RRR ⎥
⎣R ⎦
⎛ 1 ⎞
=Θ .
⎝ nh ⎠
Therefore,
⎡RR 2+δ ⎤
1 n ⎢RR 1 ⎛ ∥x − Xi ∥ ⎞ ph (x) RRRR ⎥ 1
∑ E⎢⎢RRRR K − RRR ⎥ = Θ((nh)1+ 2δ ) ⋅ n ⋅ Θ⎛ ⎞
s2+δ R nh ⎝ h ⎠ n RRR ⎥ ⎥ 2+δ
⎝n h ⎠1+δ
n i=1 ⎢
⎣RR R ⎦
= Θ((nh)− 2 )
δ
→ 0,
4
10/36-702 Statistical Machine Learning: Homework 1
Therefore,
; N (b(x), 1).
Solution.
⎡ n ⎤
⎢1 1 ⎛ −Xi ⎞⎥⎥
⎢
p(0)] = E⎢ ∑ K
E[̂ ⎥
⎢ n i=1 h ⎝ h ⎠⎥
⎣ ⎦
⎡ ⎤
⎢ 1 ⎛ Xi ⎞⎥
= E⎢⎢ K ⎥
⎥
⎢ h ⎝ h ⎠⎥
⎣ ⎦
1 1 u
= ∫ K( )p(u)du
h 0 h
1/h u
=∫ K(t)p(ht)dt let t =
0 h
1/h ⎛ h2 t2 2 ⎞
=∫ K(t) p(0) + ht ⋅ ∂+ p(0) + ⋅ ∂+ p(0) + o(h2 ) dt
0 ⎝ 2 ⎠
1/h 1/h 1/h
= p(0) ∫ K(t)dt + O(h) ∫ tK(t)dt + O(h2 ) ∫ t2 K(t)dt
0 0 0
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
≤ σK
2 /2<∞
≤ p(0) + O(h),
1/h
where we assumed K(⋅) is supported on [−1, 1], h ≤ 1, and ∫0 tK(t)dt is bounded.
5
10/36-702 Statistical Machine Learning: Homework 1
Solution.
We assume p has bounded m derivatives, and so p ∈ Σ(m, L) for some constant L > 0 ∈ R. Let’s
first analyze the bias b(x):
1 n 1 x − xi
E [p̂(x)] − p(x) = E [ ∑ K ( )] − p(x)
n i=1 h h
1 x − x1
= E[ K ( )] − p(x)
h h
1 x−u
=∫ K( ) p(u)du − p(x)
h h
x−u
= ∫ K (t) p(x − th)dt − p(x) where t =
h
t2 h2 ′′ (−th)m−1 (m−1)
= ∫ K (t) [p(x) − thp′ (x) + p (x) + ... + p (x)
2 (m − 1)!
(−th)m (m)
+ p (w)]dt − p(x) w ∈ (x − th, x) , Taylor Exp.
m!
Given ∫ K(y)dy = 1, then ∫ K (t) p(x)dt = p(x) and ∫ y j K(y)dy = 0 for 1 ≤ j ≤ m − 1, so we are
left with:
(−th)m (m)
∣E [p̂(x)] − p(x)∣ = ∣∫ K(t) p (w)dt∣
(m)!
Lhm
≤ ∣∫ K(t)tm dt∣
m!
Lhm m m
≤ ∫ ∣K(t)∣ ∣t∣ dt = Ch for some 0 < C < ∞
m!
And so we have that ∫ b(x)2 dx ≤ C ′ h2m for some 0 < C ′ < ∞ .
Analyzing now the variance we have that:
6
10/36-702 Statistical Machine Learning: Homework 1
1 n 1 x − xi
V(p̂(x)) = V ( ∑ K ( ))
n i=1 h h
1 x − x1
= V (K ( ))
nh2 h
1 x − x1 2
≤ E (K ( ) )
nh2 h
1 x − x1 2
= ∫ K ( ) p(x)dx
nh2 h
1 2 x−u
= ∫ K (t) p(x − th)dt where t =
nh h
supx p(x) C ′′
≤ ∫ K (t) 2
dt ≤ for some 0 < C ′′ < ∞
nh nh
Since densities in Σ(m, L) are uniformly bounded.
The optimal bandwidth is therefore:
1
∂ ( nh + h2m ) 1
+ 2mh2m−1 = 0 Ô⇒ h∗ = (2mn)− 2m+1 ≍ n− 2m+1
1 1
= 0 Ô⇒ − 2
∂h nh
And so the convergence rate is:
(b) Define the Sobolev ellipsoid E(m, L) of order m as the set of densities of the form p(x) =
∑∞ ∞ 2 2m
j=1 βj φj (x) where ∑j=1 βj j < L2 . Show that the risk for any density in E(m, L) is bounded
2m
by c[(k/n) + (1/k) ]. Using this bound, find the optimal value of k and find the corresponding
risk.
Solution.
7
10/36-702 Statistical Machine Learning: Homework 1
(b) (5 pts.)
∞
C 2k
sup R(̂
p(x)) ≤ + ∑ βj2 from part (a)
p∈E(m,L) n j=k+1
2m ∞ 2
C 2 k k ∑j=k+1 βj
= +
n k 2m
2 ∞ 2 2m
C k ∑j=k+1 βj j
≤ +
n k 2m
2 2
C k L
≤ + 2m
n k
⎛k 1 ⎞
≤ max{C 2 , L2 } + 2m
⎝n k ⎠
8
10/36-702 Statistical Machine Learning: Homework 1
(c) Let K be a kernel. Recall that the convolution of a density p with K is (p⋆K)(x) = ∫ p(z)K(x−
z)dz. Show that
∫ ∣p ⋆ K − q ⋆ K∣ ≤ ∫ ∣K∣ ∫ ∣p − q∣.
Hence, smoothing reduces L1 distance.
(e) Let p̂ be a histogram on R with binwidth h. Under some regularity conditions it can be shown
that √
2 √ 1
E ∫ ∣p − pn ∣ ≈ ∫ p + h ∫ ∣p′ ∣.
πnh 4
√
Hence, this risk can be unbounded if ∫ p = ∞. A density is said to have a regularly varying
tail of order r if limx→∞ p(tx)/p(x) = tr for all t > 0 and limx→−∞ p(tx)/p(x) = tr for all t > 0.
Suppose that p has a regularly varying tail of order r with r < −2. Show that the risk bound
above is bounded.
Solution.
= ( ∫ p(x)dx − ∫ q(x)dx)
B B
= P (B) − Q(B)
9
10/36-702 Statistical Machine Learning: Homework 1
1
Ô⇒ ∫ ∣p − q∣ ≥ P (B) − Q(B) for any measurable B ⊆ R.
2
By noting,
1 1
∫ ∣p − q∣ = ∫ ∣q − p∣,
2 2
parallel reasoning shows
1
∫ ∣p − q∣ ≥ Q(B) − P (B) for any measurable B ⊆ R.
2
So together we have,
1
∫ ∣p − q∣ ≥ ∣P (B) − Q(B)∣
2
and thus
1
∫ ∣p − q∣ ≥ sup ∣P (B) − Q(B)∣, (3)
2 B⊆R
= (∫ p(x)dx − ∫ q(x)dx)
B′ B′
= P (B ) − Q(B ′ ).
′
= ∣P (B ′ ) − Q(B ′ )∣.
(b) (5 pts.)
Let F be the σ-field generated by the sets A on the sample space Ω, and
10
10/36-702 Statistical Machine Learning: Homework 1
≤ sup ∣P (X ∈ A) − P (Y ∈ A)∣.
A∈F
(c) (5 pts.)
RRR RRR
RRR RRRdx
∫ ∣p ⋆ K − q ⋆ K∣ = ∫ RRR ∫ p(z)K(x − z)dz − ∫ q(z)K(x − z)dz RRR
RR RR
RRR RRR
= ∫ RRRRR ∫ (p(z) − q(z))K(x − z)dz RRRRRdx
RRR RRR
= ∫ ∣K∣ ∫ ∣p − q∣
(d) (10 pts.) Here we can further assume that the density has bounded support, see appendix for
a proof without this assumption. By Cauchy inequality,
(∫ ∣p − pn ∣)2 ≤ ∫ (p − pn )2 ∫ 12 → 0,
11
10/36-702 Statistical Machine Learning: Homework 1
Appendix
Proof of Claim in Problem 2.
1 p
From 2p ∣a∣ − ∣b∣p ≤ ∣a − b∣p ≤ 2p ∣a∣p + 2p ∣b∣p , we have
Then,
1 p ⎛ ∥x − u∥ ⎞
E[∣Zi ∣p ] = p ∫ ∣K∣ p(u)du
h ⎝ h ⎠
1 p
= ∫ ∣K∣ (∥v∥)p(x + hv)dv.
hp−1
So as h → 0, choose any [a, b] such that ∣K∣p (∥v∥) > 0 for some v ∈ [a, b], then ∫ ∥K∥p (∥v∥)p(x +
b b
hv)dv ≥ ∫a ∣K∣p (∥v∥)p(x + hv)dv → ∫a ∣K∣p (∥v∥)p(x)dv > 0 by the Bounded Convergence Theorem.
Also, ∫ ∣K∣p (∥v∥)p(x + hv)dv ≤ ∫ ∣K∣p (∥v∥) supx p(x)dv < ∞, hence ∫ ∣K∣p (∥v∥)p(x + hv)dv = Θ(1),
and accordingly,
⎛ 1 ⎞
E[∣Zi ∣p ] = Θ p−1 .
⎝h ⎠
Then
∣ph (x)∣ = ∣E[Zi ]∣ ≤ E[∣Zi ∣] = O(1).
Hence
⎛ 1 ⎞ ⎛ 1 ⎞
Θ p−1 = 2−p E[∣Zi ∣p ] − ph (x)p ≤ E[∣Zi − ph (x)p ] ≤ 2p E[∣Zi ∣p ] + 2p ph (x)p = Θ p−1
⎝h ⎠ ⎝h ⎠
which implies
⎛ 1 ⎞
E[∣Zi − ph (x)∣p ] = Θ p−1 .
⎝h ⎠
∫ 0 ≤ ∫ ∣p − pn ∣ ≤ ∫ p + pn = 2.
∫ ∣p − pn ∣ → ∫ (lim ∣p − pn ∣) = 0.
12