Вы находитесь на странице: 1из 13

Univariate Kernel Density Estimation

Zhi Ouyang

Augest, 2005

1 Questions:
• What are the statistical properties of kernel functions on estimators?

• What influence does the shape/scaling of the kernel functions have on the estimators?

• How to chose the scaling parameter in practice?

• How can kernel smoothing ideas used in making confidence statements?

• How do dependencies in the data affect the kernel regression estimator?

• How can one best deal with multiple predictor variables?

2 Histogram is a kind of density estimation.


• Binwidth is the smoothing parameter.

• Sensitivity of the placement of bin edges is a problem not shared by other density estimators,
solution is the averaged shifted histogram[10], which is an appealing motivation for kernel
methods.

• Drawback: step functions.

• Multivariate histogram.

• Histograms are not so sufficient as other kernel estimators in using the data.

3 Why univariate kernel density estimator?


• Effective way to show structure of the data / do not want to impose a specific parametric form
of the density.

• Easy to start with.

1
Kernel Smoothing Zhi Ouyang Augest, 2005

4 The estimator
Suppose we have a random sample X1 , . . . , Xn taken from a continuous, univariate density f .
n   n
1 X x − X i 1X 1 u
fˆ(x; h) = K ; or fˆ(x; h) = Kh (x − Xi ), where Kh (u) = K .
nh h n h h
i=1 i=1

We shall see, the choice of the shape of the kernel function is not a particular important one, but
the choice of the value of the bandwidth is very important.

5 MSE and MISE criteria


5.1 MSE
If we want to estimate f at a particular point x, consider

Efˆ(x; h) = EKh (x − X) = (Kh ∗ f )(x);


X
Vfˆ(x; h) = E{n−1 Kh (x − Xi )2 } − (Kh ∗ f )2 (x)
i
X
−1
= n EKh (x − X)2 − n−2 E Kh (x − Xi )Kh (x − Xj ) − (Kh ∗ f )2 (x)
i6=j
 
−1 n(n − 1)
= n (Kh2 ∗ f )(x) + − 1 (Kh ∗ f )2 (x)
n2
= n−1 {(Kh2 ∗ f )(x) − (Kh ∗ f )2 (x)}.

Then the mean squared error (MSE) can be written as

M SE{fˆ(x; h)} = E{fˆ(x; h) − f (x)}2


= Vfˆ(x; h) − {Efˆ(x; h) − f (x)}2
= n−1 {(Kh2 ∗ f )(x) − (Kh ∗ f )2 (x)} + {(Kh ∗ f )(x) − f (x)}2 .

5.2 MISE
However, we are more interested in estimating f on a real line. Consider the integrated squared
error (ISE) and the mean integrated squared error (MISE)
Z
ISE{fˆ(x; h)} = {fˆ(x; h) − f (x)}2 dx;
Z Z
M ISE{fˆ(.; h)} = E {fˆ(x; h) − f (x)}2 dx = E{fˆ(x; h) − f (x)}2 dx
Z Z
= n−1 {(Kh2 ∗ f )(x) − (Kh ∗ f )2 (x)} dx + {(Kh ∗ f )(x) − f (x)}2 dx.

2
Kernel Smoothing Zhi Ouyang Augest, 2005

Notice that
1 2 x−y
Z Z Z
(Kh2 ∗ f )(x) dx = K ( )f (y) dy dx
h2 h
Z Z
1 2
= K (z)f (x − hz) dz dx
h
Z Z
1 2
= K (z) f (x − hz) dx dz
h
Z
1
= K 2 (x) dx
h
Then MISE could be written as
Z Z Z Z
ˆ 1 1
M ISE{f (.; h)} = K (x) dx + (1 − ) (Kh ∗ f ) (x) dx − 2 (Kh ∗ f )(x)f (x) dx + f 2 (x) dx
2 2
nh n

5.3 MIAE
We could also work with other criterions such as mean integrated absolute error (MIAE)
Z
M IAE{f (.; h)} = E |fˆ(x; h) − f (x)| dx.
ˆ

MIAE is always defined whenever fˆ(x; h) is a density, and it is invariant under monotone transfor-
mation, but more complicated.

6 Order and Taylor expansions


6.1 Taylor’s Theorem
Suppose f is a real-valued function defined on R and let x ∈ R. Assume that f has p continuous
derivatives in an interval (x − δ, x + δ) for some δ > 0. Then for any sequence αn converging to zero,
p
X αnj
f (x + αn ) = f (j) (x) + o(αnp ).
j!
j=0

6.2 Example
Suppose we have a random sample X1 , . . . , Xn from the N (µ, σ 2 ), and we are interested in estimating
eµ . Known that the maximum likelihood estimator is eX̄ , and
σ2 2σ 2 σ2 σ2
E{eX̄ } = eµ+ 2n ; E{e2X̄ } = e2µ+ n ; V{eX̄ } = e2µ+ n (e n −1 ).
Then the MSE can be approximated as
2σ 2 σ2
M SE(eX̄ ) = e2µ (e n− 2e 2n + 1)
( 2 ) (  2 ) !
2σ 2 1 2σ 2 σ2 1 σ2

= e2µ 1+ + + ... − 2 1 + + + ... + 1
n 2 n 2n 2 2n
1 2 2µ
∼ σ e .
n
It is typical for MSE that has a rate of convergence of order n−1 , and we shall see that rates of
convergence of nonparametric kernel estimators are typically slower than n−1 .

3
Kernel Smoothing Zhi Ouyang Augest, 2005

7 Asymptotic MSE and MISE approximations


7.1 Assumptions and notations
Assume that

(i) The density f has second derivative f 00 which is continuous, square integrable and monotone.
¯
(ii) The bandwidth hn is non-random sequence of positive numbers, such that
¯
lim h = 0, and lim nh = ∞.
n→∞ n→∞

(iii) The kernel K is a bounded probability density function having finite fourth moment and
¯ symmetric about the origin, i.e.
Z Z Z
K(z) dz = 1, zK(z) dz = 0, µ2 (K) := z 2 K(z) dz < ∞.

g 2 (z) dz.
R
Also, denote R(g) :=

7.2 Calculations
Recall that
x−y
Z Z
1
Efˆ(x; h) = (Kh ∗ f )(x) = K( )f (y) dy = K(z)f (x − hz)dz.
h h
Expand f (x − hz) about x, we obtain that
1
f (x − hz) = f (x) − hzf 0 (x) + h2 z 2 f 00 (x) + o(h2 ).
2
which is uniformly in z, hence
1
Efˆ(x; h) − f (x) = h2 µ2 (K)f 00 (x) + o(h2 ).
2
Similarly,

Vfˆ(x; h) = n−1 {(Kh2 ∗ f )(x) − (Kh ∗ f )2 (x)}


Z Z
1 2 −1
= K (z)f (x − hz) dz − n K(z)f (x − hz) dz
nh
Z Z
1 2 −1
= K (z){f (x) + o(1)} dz − n K(z){f (x) + o(1)} dz
nh
1 1
= R(K)f (x) + o{ }.
nh nh
Therefore,
1 1 1
M SE{fˆ(x; h)} = R(K)f (x) + h4 µ22 (K)f 002 (z) + o{ + h4 };
nh 4 nh
1
M ISE{fˆ(.; h)} = AM ISE{fˆ(.; h)} + o{ + h4 };
nh
1 1
AM ISE{fˆ(.; h)} = R(K) + h4 µ22 (K)R(f 00 ).
nh 4

4
Kernel Smoothing Zhi Ouyang Augest, 2005

Notice that the tail term o{(nh)−1 + h4 } shows the variance-bias trade-off, while AMISE could be
minimized at
 1/5
R(K) 5
hAMISE = 2 00
, inf AMISE{fˆ(x; h)} = {µ22 (K)R(K)4 R(f 00 )}1/5 n−4/5 .
nµ2 (K)R(f ) h>0 4
Equivalently, as n goes to infinity, we can rewrite
 1/5
R(K) 5
hMISE ∼ , inf MISE{fˆ(x; h)} ∼ {µ22 (K)R(K)4 R(f 00 )}1/5 n−4/5 .
nµ22 (K)R(f 00 ) h>0 4
Aside from its dependence on the known K and n, the expression shows us the optimal h is
inversely proportional to the curvature of f , i.e.R(f 00 ). The problem is that we do not know the
curvature, but there are ways to estimate it.
Another thing could be seen is that, the best obtainable rate of convergence of the MISE of the
kernel estimator is of order n−4/5 , which is less efficient than that of MSE(n−1 ).

7.3 Comparison with histogram


SupposeRthe knots are x0 , . . . , xn , where xk = x0 + kb. Since f is a density function, denote the c.d.f.
x
F (x) = −∞ f (x)dx. The histogram could be written as
n
1 X
fˆ(x; b) = I(xk ,xk +b] (Xi ), where x ∈ (xk , xk + b].
nb
i=1

Then
xk +b
F (xk + b) − F (xk )
Z
1 b
Efˆ(x; b) = f (x) dx = = f (xk ) + f 0 (xk ) + o(b);
b xk b 2
b
Bias{fˆ(x; b)} = f (xk ) − {f (xk ) + (x − xk )f 0 (xk ) + o(x − xk )} + f 0 (xk ) + o(b)
2
b
= { − (x − xk )}f 0 (xk ) + o(b);
2
1 n(n − 1)
Efˆ2 (x; b) = 2
{F (xk + b) − F (xk )} + {F (xk + b) − F (xk )}2 ;
nb n2 b2
1 1
Vfˆ(x; b) = {f (xk ) + o(1)} − {f (xk ) + o(1)}2 .
nb n
Vary x in different bins, and take integration, we have

M ISE{fˆ(.; h} = AM ISE{fˆ(.; h)} + o{(nb)−1 + b2 };


1 b2
AM ISE{fˆ(.; h} = + R(f 0 ).
nb 12
Therefore, M ISE is asymptotically minimized at
1
bMISE ∼ {6/R(f 0 )}1/3 n−1/3 , inf MISE{fˆ(x; h)} ∼ {36R(f 0 )}1/3 n−2/3 .
b>0 4
In other words, the MISE of histogram is asymptotically inferior to the kernel estimator, since its
convergence rate is O(n−2/3 ) compared to the kernel estimator’s O(n−4/5 ) rate. More reference see
Scott [9].

5
Kernel Smoothing Zhi Ouyang Augest, 2005

8 Exact MISE calculations


Recall that φσ (x − µ) is the density of the N (µ, σ 2 ) distribution, we know that
Z
φσ (x − µ)φ0σ (x − µ0 ) dx = φ√σ2 +σ02 (µ − µ0 ).

Also recall that


Z Z Z Z
1 1
M ISE{fˆ(.; h)} = 2
K (x) dx + (1 − ) 2
(Kh ∗ f ) (x) dx − 2 (Kh ∗ f )(x)f (x) dx + f 2 (x) dx.
nh n

8.1 MISE for a single normal distribution


Take K to be the N (0, 1) density and f to be the N (0, σ 2 ) density, then
Kh (x) = φh (x), and f (x) = φσ (x).
It is very easy to show that
Z
1
K 2 (x) dx = φ√2 (0) = √ ;
2 π
Z Z
1
(Kh ∗ f )2 (x) dx = φ2√h2 +σ2 (x) dx = φ√2(h2 +σ2 ) (0) = p ;
2 π(h2 + σ 2 )
Z Z
1
(Kh ∗ f )(x)f (x) dx = φ√h2 +σ2 (x)φσ (x) dx = φ√h2 +2σ2 (0) = p ;
2π(h + 2σ 2 )
2
Z
1
f 2 (x) dx = φ√2σ (0) = √ .
2 πσ
Therefore
( )
1 1 1 − n−1 23/2 1
M ISE{fˆ(.; h)} = √ +√ −√ + .
2 π nh 2
h +σ 2 2
h + 2σ 2 σ

8.2 MISE for normal mixtures


Continue using K(x) = φ1 (x). Suppose the density can be written as mixture of normal distributions,
k
X
f (x) = wl φσl (x − µl ),
l=1

where k ∈ Z+ ,w1 , . . . , wk are positive numbers that sum to one, an for each l, µl ∈ R and σl2 > 0.
Similarly, (almost trivial to verify)
1
M ISE{fˆ(.; h)} = √ + wT ((1 − n−1 )Ω2 − 2Ω1 + Ω0 )w,
2 πnh
where w = (w1 , . . . , wk )T , and Ωa [l; l0 ] = φq (µl − µl0 ).
ah2 +σl2 +σl20
If we plot the exact and the asymptotic MISE/IV(integrated variance)/ISB(integrated squared
bias) according to different bandwidth, we shall see that IV/AIV decreases fairly ”uniformly” as log h
increases, but ISB/AISB increases very ”non-uniformly”. This is because the bias approximation is
based on the assumption that h → 0. Overall, for densities close to normality, the bias approximation
tends to be quite reasonable, while for densities with more features such as multiple modes, this
approximation becomes worse, see Marron and Wand [7].

6
Kernel Smoothing Zhi Ouyang Augest, 2005

8.3 Using characteristic functions to simplify MISE


eitx g(x) dx, then
R
For real-valued function g, denote its Fourier transformation as ϕg (t) =
ϕf ∗g (t) = ϕf (t)ϕg (t).
Also, recall the well-known Paseval’s identity
Z Z
1
f (x)g(x) dx = ϕf (t)ϕg (t) dt.

From those properties, we could easily rewrite the MISE as
Z Z  
ˆ 1 2 1 1 2
M ISE{f (.; h)} = κ (t) dt + (1 − )κ (ht) − 2κ(ht) + 1 |ϕf (t)|2 dt,
2πnh 2π n
where κ(t) = ϕK (t).

8.3.1 sinc kernel


The sinc kernel and its characteristic function is given below
Z
sin x sin x
K(x) = , κ(t) = eitx dx = 1{|t|≤1} .
πx πx
Note that |ϕf (t)2 | = ϕf (t)ϕf (−t) is symmetric about 0, hence
1 + n−1 1/h
Z Z
ˆ 1
M ISE{f (., h)} = − |ϕf (t)| dt + f 2 (x) dx.
2
πnh π 0
Davis [3] showed that the MISE-optimal bandwidth satisfies
  2

ϕf 1 = 1 ,

hM ISE n+1
provided |ϕf (t)| > 0 for all t. We might think of ϕf (t) is a constant near 0, then set
1 n+11 1
= |ϕf ( )|2 .
πnh nπ h h
If f is the normal density, it can be shown that for sinc kernel,
inf M ISE{fˆ(., h)} = O{(log n)1/2 n−1 }.
h>0
which is faster than any rate of order n−α , 0 < α < 1, but the MISE is not O(n−1 ) since R(K)
is infinite. This is an example of a higher-order kernel with ”infinite” order, i.e. µj (K) = 0 for all
j ∈ N.

8.3.2 Laplace kernel, with exponential density


The Laplace kernel and its characteristic function is given below
Z
1 −|x| 1 1
K(x) = e , κ(t) = eitx e−|x| dx = .
2 2 1 + t2
Using the exponential density
1
f (x) = e−x 1{x>0} , ϕf (t) = .
1 − it
After standard calculations, we got
1 2nh2 + (n − 1)h − 2
M ISE{fˆ(.; h)} = + .
4nh 4n(1 + h)2
Take the derivative on h, and it is easy to find hM ISE = √1 , which attains minimal M ISE = 1√
.
n 2+2 n

7
Kernel Smoothing Zhi Ouyang Augest, 2005

9 Canonical kernels and optimal kernel theory


Now we investigate how the shape of the kernel could influence the estimator. In order to obtain
admissible estimators, Cline [2] showed that the kernel should be symmetric and unimodal.
Recall the two component in AMISE,
1 1
AM ISE{fˆ(.; h)} = R(K) + h4 µ22 (K)R(f 00 ).
nh 4
If we want to separate h and K, we need R(K) = µ22 (K). Consider the scaling of K of the form
Kδ (.) = K(./δ)/δ.
Plug this into above equation, we need
δ0 = {R(K)/µ22 (K)}1/5 .
Then
R(Kδ ) = δ −1 R(K), µ22 (Kδ ) = δ 4 µ22 (K).
We could rewrite AMISE as
 
1 1 4 00
AM ISE{fˆ(.; h)} = C(K) + h R(f ) , where C(K) = {R(K)4 µ22 (K)}1/5 .
nh 4
Notice that C(K) is invariant to scaling of K. We call K c = Kδ0 the canonical kernel for the class
{Kδ : δ > 0}. It is the unique member of this class that permits the ”decoupling” of K and h, see
Marron and Nolan [6].
For example, let K = φ, the standard normal, then δ0 = (4π)−1/10 , so
φc (x) = φ(4π)−1/10 (x), C(φ) = (4π)−2/5 .
Canonical kernels are very useful for pictorial comparison of density estimates based on different
shaped kernels, since they are defined in a way that a particular single choice of bandwidth gives
roughly the same amount of smoothing.
We changed the problem to choose K to minimize C(Kδ0 ). Since C(K) is invariant, the optimal
K is the one that minimises C(K) subject to
Z Z Z
K(x) dx = 1, xK(x) dx = 0, x2 K(x) dx = a2 < ∞, K(x) ≥ 0 for all x.

Hodges and Lehmann [4] showed that the solution can be written as
x2 .  1/2 
 
a 3
K (x) = 1− 2 5 a 1{|x|<51/2 a} .
4 5a

A special case is that a = 1/ 5 (Epanechnikov kernel ), and
3
K ∗ (x) = (1 − x2 )1{|x|<1} .
4
The efficiency of the kernel K relative to K ∗ is defined as {C(K∗)/C(K)}5/4 . The family
K(x; p) = {22p+1 B(p + 1, p + 1)}−1 (1 − x2 )p 1{|x|<1} ,
where B(., .) is the beta function taken in [−1, 1].
Table 1 shows us the efficiency doesn’t have much improvement in different shape of the kernels.
Uniform kernels are not very popular in practice since it corresponds to piecewise constant. Even the
Epanechnikov kernel is not so attractive because the estimator have a discontinuous first derivative.

8
Kernel Smoothing Zhi Ouyang Augest, 2005

Table 1: Efficiencies of several kernels compared to the optimal kernel.

Kernel Form {C(K ∗ )/C(K)}5/4


Epanechnikov K(x; 1) 1.000
Biweight K(x; 2) 0.994
Triweight K(x; 3) 0.987
Normal K(x; ∞) 0.951
Triangular 0.986
Uniform K(x; 0) 0.930

10 Higher-order kernels
10.1 Why higher-order kernels?
We know that the best obtainable rate of convergence of the kernel estimator is of order n−4/5 . If
we loose the condition that K must be a density, the convergence rate could be faster. For example,
recall the asymptotic bias is given by
1
Efˆ(x; h) − f (x) = h2 µ2 (K)f 00 (x) + o(h2 ).
2
If we set µ2 (K) = 0, the MSE and MISE will have optimal convergence rate of order n−8/9 .

10.2 What is higher-order kernels?


We insist that K to be symmetric, and say K is a k−th order kernel if

µ0 (K) = 1, µj (K) = 0, for j = 1, . . . , k − 1, µk (K) 6= 0.

10.3 How to get higher-order kernels?


One way to generate higher-order kernels is deductively from the lower-order kernels,
3 1 0
K[k+2] (x) = K[k] (x) + xK[k] (x).
2 2
For example, set K[2] (x) = φ(x), then K[4] = 12 (3 − x2 )φ(x).
Another way is developed when f is a normal mixture density for a certain class of higher-order
kernels, see Marron and Wand [7].
k/2−1
X (−1)l
G[k] = φ(2l) (x), l = 0, 2, 4, . . . .
2l l!
l=0

10.4 Misc topics about higher-order kernels.


The convergence rate can be made arbitrarily close to the parametric n−1 as the order increases,
which means it will eventually dominate second-order kernel estimators for large n. However, it does
need a larger sample size (K[4] would require several thousand in order to reduce MISE compared
to normal kernel). Another price that need to be paid for higher-order kernels is the negative
contributions of the kernel may make the the estimated density not a density itself.

9
Kernel Smoothing Zhi Ouyang Augest, 2005

An extreme case of higher-order kernels is the ”infinite”-order kernels, such as sinc kernel K(x) =
sin x/(πx). Sinc kernel estimator suffers from the same drawback as other higher-order kernel esti-
mators and the good asymptotic performance is not guaranteed to carry over to finite sample size in
practice.

11 Measuring how difficult a density to estimate


Recall that, for K a symmetric probability kernel density function,
5
inf M ISE{fˆ(.; h)} ∼ C(K)R(f 00 )1/5 n−4/5 ,
h>0 4
so the magnitude of R(f 00 ) tells us how well f can be estimated even when h is chosen optimally.
First, we cannot set R(f 00 ), where f cannot be constant over the real line, and uniform density
have difficulty in estimating its boundaries. Second, R(f 00 ) is not scale invariant. To appreciate this,
suppose X has the density fX , and Y = X/a has density fY (x) = afX (x) for some positive a. Then
00 ), but
R(fY00 ) = a5 R(fX
00
D(f ) = {σ(f )5 R(fX )}1/4

is scale invariant, where σ(f ) is the population standard deviation of f . The inclusion of 14 power
allows a equivalence sample size interpretation as was done for the comparison of kernel shapes.
One result for the D(f ) is that it attains its minimal with the shape Beta(4, 4),
35
f ∗ (x) = (1 − x2 )3 1{|x|<1} .
32
More general result is that densities close to normality appear to be easier for the kernel estimator to
estimate. The degree of estimation difficulty increases with skewness, kurtosis, and multimodality.

12 Modification of kernel density estimator


12.1 Local kernel density estimators
One modification is to change the bandwidth h along the x−axis,
n
X
fˆL (x; h(x)) = {h(x)}−1 K{(x − Xi )/h(x)}.
i=1

The optimal h for asymptotic MSE at x is


h R(K)f (x) i1/5
hAM SE (x) = 2 , provided f 00 (x) 6= 0.
µ2 (K)f 002 (x)n
When f 00 (x) = 0, additional terms should be taken into account, see Schucany [8]. If this hA M SE
is chosen for every x, then the corresponding value of AMISE{fˆ(.; h(.))} can be shown to be
5
{µ2 (K)2 R(K)4 }1/5 R((f 2 f 00 )1/5 )n−4/5
4
Although n−4/5 showed no improvement, but

R((f 2 f 00 )1/5 ) ≤ R(f 00 )1/5

10
Kernel Smoothing Zhi Ouyang Augest, 2005

holds for all f . The sample size efficiency depend on the magnitude of the ratio
{R((f 2 f 00 )1/5 )/R(f 00 )1/5 }5/4 .
Their are other ways such as the nearest neighbor density estimator, see Loftsgaarden and Que-
senberry [5]. It uses distances from x to the data point that is the kth nearest to x (for some suitable
k) in a pilot estimation step that is essentially equivalent to h(x) ∝ 1/f (x). However, 1/f (x) is
usually not satisfactory surrogated for the optimal {f (x)/f 00 (x)}1/5 .

12.2 Variable kernel density estimators


In stead of a single h or a function h(x), the idea here is to use n values α(Xi ), i = 1, . . . , n.
Therefore, the kernel centered on Xi has associated with its own scale parameter α(Xi ), allowing
different degrees of smoothing depending on the locations. The estimator is in the form
n  
1 X 1 x − X i
fˆV (x; α) = K .
n α(Xi ) α(Xi )
i=1
The intuition suggests that each α(Xi ) should depend on the true density in roughly an inverse
way. An theoretical result by Abramson [1] shows that taking α(Xi ) = hf −1/2 (Xi ) is a particular
good choice such that one can achieve a bias of order h4 in stead of h2 . Pilot estimation to obtain
α(Xi ) is necessary for a practical implementation of fˆV (.; α), which is the expense compared with
using a fourth-order kernel, while no negativity produced. The optimal MSE can be improved from
order n−4/5 to n−8/9 .

12.3 Transformation kernel density estimator


If the random sample Xi has a density f that is difficult to estimate, then apply an transformation
such that Yi has a density g which is easier to estimate. One can ”backtransform” the estimate of g
to obtain the estimate of f .
Suppose that Yi = t(Xi ), where t is an increasing differentiable function defined on the support
of f . Therefore f (x) = g(t(x))t0 (x), and the density could be estimated by
n
1X
fˆT (x; h, t) = Kh {t(x) − t(Xi )}t0 (x).
n
i=1

Apply the mean value theorem, we can see this estimator lies somewhere in between of fˆL and fˆV ,
n
1 X t0 (x)
 0 
ˆ t (ξi )(x − Xi )
fT (x; h, t) = K .
n h h
i=1
where ξ lies between x and Xi .
The best choice of the transformation t depend quite heavily on the shape of f . If f is a skewed
unimodal density, then t is suggested to be convex function on the support of f in order to reduce
the skewness of f in some sense. If f is close to being symmetric, but has a high amount of kurtosis,
then t should be concave to the left and convex to the right of the center of symmetry of f .
One approach is to apply the following family called shifted power family on heavily skewed data,
(x + λ1 )λ2 sign(λ2 ), λ2 6= 0

t(x; λ1 , λ2 ) = .
ln(x + λ1 ), λ2 = 0
where λ1 > −min(X).
Another approach is to estimate t nonparametrically. If F and G are c.d.f. of p.d.f. f and g,
then Y = G−1 (F (X)) has density g. One could choose G easy to estimate, and take t = G−1 (F̂ (X)).

11
Kernel Smoothing Zhi Ouyang Augest, 2005

13 Density estimation at boundaries


It becomes difficult to estimate the density at the boundary. Suppose f is a density such that
f (x) = 0 for x < 0 and f (x) > 0 for x ≥ 0, and f 00 is continuous away from x = 0. Also, let K be a
kernel with support confined in [−1, 1].
For x > 0,
Z x/h
ˆ
E f (x; h) = K(z)f (x − hz) dz.
−1

Be aware that we still have a large bias at the boundary, but the performance is greatly improved.

14 Density derivative estimation


A natural estimator of the r−th derivative f (r) (x) is
n  
ˆ(r) 1 X (r) x − Xi
f (x; h) = K ,
nhr+1 h
i=1

sufficiently differentiability of K permitting. MSE property can be shown as


1 1 1
M SE{fˆ(r) (x; h)} = R(K (r) )f (x) + h4 µ22 (K)f (r+2) (x) + o{ (2r+1) + h4 }.
nh2r+1) 4 nh
It follows that the MSE-optimal bandwidth for estimating f (r) (x) is of order n−1/(2r+5) . There-
fore, the estimation of f 0 (x) requires a bandwidth of order n−1/7 compared to the optimal n−1/5 for
estimating f itself. It reveals the increasing difficulty in problems of estimating higher derivatives.

12
Kernel Smoothing Zhi Ouyang Augest, 2005

References
[1] I.S. Abramson. On bandwidth variation in kernel estimates - a square root law. Annal of
Statitics, 9:168–76, 1982.

[2] D.B.H. Cline. Admissible kernel estimators of a multivariate density. Annal of Statistics,
16:1421–7, 1988.

[3] K.B. Davis. Mean square error properties of density estimates. Annal of Statistics, 75:1025–30,
1975.

[4] J.L. Hodges and E.L. Lehmann. The efficiency of some nonparametric competitors to the t-test.
Annal of Mathematical Statistics, 13:324–35, 1956.

[5] D.O. Loftsgaarden and C.P. Quesenberry. A nonparametric density estimate of a multivariate
density function. Annal of Mathematical Statistics, 36:1049–51, 1965.

[6] J.S. Marron and D. Nolan. Canonical kernel for density estimation. Statist. Probab. Lett.,
7:195–9, 1989.

[7] J.S. Marron and M.P. Wand. Exact mean integrated squared error. Annal of Statistics, 20:712–
36, 1992.

[8] W.R. Schucany. Locally optimal window width for kernel density estimation with large samples.
Statist. Probab. Lett., 7:401–5, 1989.

[9] D.W. Scott. On optimal and data-based histograms. Biometrika, 66:605–10, 1979.

[10] D.W. Scott. Average shifted histograms: effective nonparametric density estimators in several
dimensions. Annal of Statistics, 13:1024–40, 1985.

13

Вам также может понравиться