Kernel Smoothing

Univariate Kernel Density Estimation
Zhi Ouyang
Augest, 2005
1 Questions:
• What are the statistical properties of kernel functions on estimators?
• What influence does the shape/scaling of the kernel functions have on the estimators?
• How to chose the scaling parameter in practice?
• How can kernel smoothing ideas used in making confidence statements?
• How do dependencies in the data affect the kernel regression estimator?
• How can one best deal with multiple predictor variables?
2 Histogram is a kind of density estimation.

• Binwidth is the smoothing parameter.
• Sensitivity of the placement of bin edges is a problem not shared by other density estimators,
solution is the averaged shifted histogram[10], which is an appealing motivation for kernel
methods.
• Drawback: step functions.
• Multivariate histogram.
• Histograms are not so sufficient as other kernel estimators in using the data.
3 Why univariate kernel density estimator?

• Effective way to show structure of the data / do not want to impose a specific parametric form
of the density.
• Easy to start with.
1
Kernel Smoothing Zhi Ouyang Augest, 2005
4 The estimator
Suppose we have a random sample X1 , . . . , Xn taken from a continuous, univariate density f .
n n
1 X x − X i 1X 1 u
fˆ(x; h) = K ; or fˆ(x; h) = Kh (x − Xi ), where Kh (u) = K .
nh h n h h
i=1 i=1
We shall see, the choice of the shape of the kernel function is not a particular important one, but
the choice of the value of the bandwidth is very important.
5 MSE and MISE criteria

5.1 MSE
If we want to estimate f at a particular point x, consider
Efˆ(x; h) = EKh (x − X) = (Kh ∗ f )(x);

X
Vfˆ(x; h) = E{n−1 Kh (x − Xi )2 } − (Kh ∗ f )2 (x)
i
X
−1
= n EKh (x − X)2 − n−2 E Kh (x − Xi )Kh (x − Xj ) − (Kh ∗ f )2 (x)
i6=j

−1 n(n − 1)
= n (Kh2 ∗ f )(x) + − 1 (Kh ∗ f )2 (x)
n2
= n−1 {(Kh2 ∗ f )(x) − (Kh ∗ f )2 (x)}.
Then the mean squared error (MSE) can be written as
M SE{fˆ(x; h)} = E{fˆ(x; h) − f (x)}2

= Vfˆ(x; h) − {Efˆ(x; h) − f (x)}2
= n−1 {(Kh2 ∗ f )(x) − (Kh ∗ f )2 (x)} + {(Kh ∗ f )(x) − f (x)}2 .
5.2 MISE
However, we are more interested in estimating f on a real line. Consider the integrated squared
error (ISE) and the mean integrated squared error (MISE)
Z
ISE{fˆ(x; h)} = {fˆ(x; h) − f (x)}2 dx;
Z Z
M ISE{fˆ(.; h)} = E {fˆ(x; h) − f (x)}2 dx = E{fˆ(x; h) − f (x)}2 dx
Z Z
= n−1 {(Kh2 ∗ f )(x) − (Kh ∗ f )2 (x)} dx + {(Kh ∗ f )(x) − f (x)}2 dx.
2
Notice that
1 2 x−y
Z Z Z
(Kh2 ∗ f )(x) dx = K ( )f (y) dy dx
h2 h
Z Z
1 2
= K (z)f (x − hz) dz dx
h
Z Z
1 2
= K (z) f (x − hz) dx dz
h
Z
1
= K 2 (x) dx
h
Then MISE could be written as
Z Z Z Z
ˆ 1 1
M ISE{f (.; h)} = K (x) dx + (1 − ) (Kh ∗ f ) (x) dx − 2 (Kh ∗ f )(x)f (x) dx + f 2 (x) dx
2 2
nh n
5.3 MIAE
We could also work with other criterions such as mean integrated absolute error (MIAE)
Z
M IAE{f (.; h)} = E |fˆ(x; h) − f (x)| dx.
ˆ
MIAE is always defined whenever fˆ(x; h) is a density, and it is invariant under monotone transfor-
mation, but more complicated.
6 Order and Taylor expansions

6.1 Taylor’s Theorem
Suppose f is a real-valued function defined on R and let x ∈ R. Assume that f has p continuous
derivatives in an interval (x − δ, x + δ) for some δ > 0. Then for any sequence αn converging to zero,
p
X αnj
f (x + αn ) = f (j) (x) + o(αnp ).
j!
j=0
6.2 Example
Suppose we have a random sample X1 , . . . , Xn from the N (µ, σ 2 ), and we are interested in estimating
eµ . Known that the maximum likelihood estimator is eX̄ , and
σ2 2σ 2 σ2 σ2
E{eX̄ } = eµ+ 2n ; E{e2X̄ } = e2µ+ n ; V{eX̄ } = e2µ+ n (e n −1 ).
Then the MSE can be approximated as
2σ 2 σ2
M SE(eX̄ ) = e2µ (e n− 2e 2n + 1)
( 2 ) ( 2 ) !
2σ 2 1 2σ 2 σ2 1 σ2

= e2µ 1+ + + ... − 2 1 + + + ... + 1
n 2 n 2n 2 2n
1 2 2µ
∼ σ e .
n
It is typical for MSE that has a rate of convergence of order n−1 , and we shall see that rates of
convergence of nonparametric kernel estimators are typically slower than n−1 .
3
7 Asymptotic MSE and MISE approximations

7.1 Assumptions and notations
Assume that
(i) The density f has second derivative f 00 which is continuous, square integrable and monotone.
¯
(ii) The bandwidth hn is non-random sequence of positive numbers, such that
¯
lim h = 0, and lim nh = ∞.
n→∞ n→∞
(iii) The kernel K is a bounded probability density function having finite fourth moment and
¯ symmetric about the origin, i.e.
Z Z Z
K(z) dz = 1, zK(z) dz = 0, µ2 (K) := z 2 K(z) dz < ∞.
g 2 (z) dz.
R
Also, denote R(g) :=
7.2 Calculations
Recall that
x−y
Z Z
1
Efˆ(x; h) = (Kh ∗ f )(x) = K( )f (y) dy = K(z)f (x − hz)dz.
h h
Expand f (x − hz) about x, we obtain that
1
f (x − hz) = f (x) − hzf 0 (x) + h2 z 2 f 00 (x) + o(h2 ).
2
which is uniformly in z, hence
1
Efˆ(x; h) − f (x) = h2 µ2 (K)f 00 (x) + o(h2 ).
2
Similarly,
Vfˆ(x; h) = n−1 {(Kh2 ∗ f )(x) − (Kh ∗ f )2 (x)}

Z Z
1 2 −1
= K (z)f (x − hz) dz − n K(z)f (x − hz) dz
nh
Z Z
1 2 −1
= K (z){f (x) + o(1)} dz − n K(z){f (x) + o(1)} dz
nh
1 1
= R(K)f (x) + o{ }.
nh nh
Therefore,
1 1 1
M SE{fˆ(x; h)} = R(K)f (x) + h4 µ22 (K)f 002 (z) + o{ + h4 };
nh 4 nh
1
M ISE{fˆ(.; h)} = AM ISE{fˆ(.; h)} + o{ + h4 };
nh
1 1
AM ISE{fˆ(.; h)} = R(K) + h4 µ22 (K)R(f 00 ).
nh 4
4
Notice that the tail term o{(nh)−1 + h4 } shows the variance-bias trade-off, while AMISE could be
minimized at
1/5
R(K) 5
hAMISE = 2 00
, inf AMISE{fˆ(x; h)} = {µ22 (K)R(K)4 R(f 00 )}1/5 n−4/5 .
nµ2 (K)R(f ) h>0 4
Equivalently, as n goes to infinity, we can rewrite
1/5
R(K) 5
hMISE ∼ , inf MISE{fˆ(x; h)} ∼ {µ22 (K)R(K)4 R(f 00 )}1/5 n−4/5 .
nµ22 (K)R(f 00 ) h>0 4
Aside from its dependence on the known K and n, the expression shows us the optimal h is
inversely proportional to the curvature of f , i.e.R(f 00 ). The problem is that we do not know the
curvature, but there are ways to estimate it.
Another thing could be seen is that, the best obtainable rate of convergence of the MISE of the
kernel estimator is of order n−4/5 , which is less efficient than that of MSE(n−1 ).
7.3 Comparison with histogram

SupposeRthe knots are x0 , . . . , xn , where xk = x0 + kb. Since f is a density function, denote the c.d.f.
x
F (x) = −∞ f (x)dx. The histogram could be written as
n
1 X
fˆ(x; b) = I(xk ,xk +b] (Xi ), where x ∈ (xk , xk + b].
nb
i=1
Then
xk +b
F (xk + b) − F (xk )
Z
1 b
Efˆ(x; b) = f (x) dx = = f (xk ) + f 0 (xk ) + o(b);
b xk b 2
b
Bias{fˆ(x; b)} = f (xk ) − {f (xk ) + (x − xk )f 0 (xk ) + o(x − xk )} + f 0 (xk ) + o(b)
2
b
= { − (x − xk )}f 0 (xk ) + o(b);
2
1 n(n − 1)
Efˆ2 (x; b) = 2
{F (xk + b) − F (xk )} + {F (xk + b) − F (xk )}2 ;
nb n2 b2
1 1
Vfˆ(x; b) = {f (xk ) + o(1)} − {f (xk ) + o(1)}2 .
nb n
Vary x in different bins, and take integration, we have
M ISE{fˆ(.; h} = AM ISE{fˆ(.; h)} + o{(nb)−1 + b2 };

1 b2
AM ISE{fˆ(.; h} = + R(f 0 ).
nb 12
Therefore, M ISE is asymptotically minimized at
1
bMISE ∼ {6/R(f 0 )}1/3 n−1/3 , inf MISE{fˆ(x; h)} ∼ {36R(f 0 )}1/3 n−2/3 .
b>0 4
In other words, the MISE of histogram is asymptotically inferior to the kernel estimator, since its
convergence rate is O(n−2/3 ) compared to the kernel estimator’s O(n−4/5 ) rate. More reference see
Scott [9].
5
8 Exact MISE calculations

Recall that φσ (x − µ) is the density of the N (µ, σ 2 ) distribution, we know that
Z
φσ (x − µ)φ0σ (x − µ0 ) dx = φ√σ2 +σ02 (µ − µ0 ).
Also recall that

Z Z Z Z
1 1
M ISE{fˆ(.; h)} = 2
K (x) dx + (1 − ) 2
(Kh ∗ f ) (x) dx − 2 (Kh ∗ f )(x)f (x) dx + f 2 (x) dx.
nh n
8.1 MISE for a single normal distribution

Take K to be the N (0, 1) density and f to be the N (0, σ 2 ) density, then
Kh (x) = φh (x), and f (x) = φσ (x).
It is very easy to show that
Z
1
K 2 (x) dx = φ√2 (0) = √ ;
2 π
Z Z
1
(Kh ∗ f )2 (x) dx = φ2√h2 +σ2 (x) dx = φ√2(h2 +σ2 ) (0) = p ;
2 π(h2 + σ 2 )
Z Z
1
(Kh ∗ f )(x)f (x) dx = φ√h2 +σ2 (x)φσ (x) dx = φ√h2 +2σ2 (0) = p ;
2π(h + 2σ 2 )
2
Z
1
f 2 (x) dx = φ√2σ (0) = √ .
2 πσ
Therefore
( )
1 1 1 − n−1 23/2 1
M ISE{fˆ(.; h)} = √ +√ −√ + .
2 π nh 2
h +σ 2 2
h + 2σ 2 σ
8.2 MISE for normal mixtures

Continue using K(x) = φ1 (x). Suppose the density can be written as mixture of normal distributions,
k
X
f (x) = wl φσl (x − µl ),
l=1
where k ∈ Z+ ,w1 , . . . , wk are positive numbers that sum to one, an for each l, µl ∈ R and σl2 > 0.
Similarly, (almost trivial to verify)
1
M ISE{fˆ(.; h)} = √ + wT ((1 − n−1 )Ω2 − 2Ω1 + Ω0 )w,
2 πnh
where w = (w1 , . . . , wk )T , and Ωa [l; l0 ] = φq (µl − µl0 ).
ah2 +σl2 +σl20
If we plot the exact and the asymptotic MISE/IV(integrated variance)/ISB(integrated squared
bias) according to different bandwidth, we shall see that IV/AIV decreases fairly ”uniformly” as log h
increases, but ISB/AISB increases very ”non-uniformly”. This is because the bias approximation is
based on the assumption that h → 0. Overall, for densities close to normality, the bias approximation
tends to be quite reasonable, while for densities with more features such as multiple modes, this
approximation becomes worse, see Marron and Wand [7].
6
8.3 Using characteristic functions to simplify MISE

eitx g(x) dx, then
R
For real-valued function g, denote its Fourier transformation as ϕg (t) =
ϕf ∗g (t) = ϕf (t)ϕg (t).
Also, recall the well-known Paseval’s identity
Z Z
1
f (x)g(x) dx = ϕf (t)ϕg (t) dt.
2π
From those properties, we could easily rewrite the MISE as
Z Z
ˆ 1 2 1 1 2
M ISE{f (.; h)} = κ (t) dt + (1 − )κ (ht) − 2κ(ht) + 1 |ϕf (t)|2 dt,
2πnh 2π n
where κ(t) = ϕK (t).
8.3.1 sinc kernel

The sinc kernel and its characteristic function is given below
Z
sin x sin x
K(x) = , κ(t) = eitx dx = 1{|t|≤1} .
πx πx
Note that |ϕf (t)2 | = ϕf (t)ϕf (−t) is symmetric about 0, hence
1 + n−1 1/h
Z Z
ˆ 1
M ISE{f (., h)} = − |ϕf (t)| dt + f 2 (x) dx.
2
πnh π 0
Davis [3] showed that the MISE-optimal bandwidth satisfies
2

ϕf 1 = 1 ,

hM ISE n+1
provided |ϕf (t)| > 0 for all t. We might think of ϕf (t) is a constant near 0, then set
1 n+11 1
= |ϕf ( )|2 .
πnh nπ h h
If f is the normal density, it can be shown that for sinc kernel,
inf M ISE{fˆ(., h)} = O{(log n)1/2 n−1 }.
h>0
which is faster than any rate of order n−α , 0 < α < 1, but the MISE is not O(n−1 ) since R(K)
is infinite. This is an example of a higher-order kernel with ”infinite” order, i.e. µj (K) = 0 for all
j ∈ N.
8.3.2 Laplace kernel, with exponential density

The Laplace kernel and its characteristic function is given below
Z
1 −|x| 1 1
K(x) = e , κ(t) = eitx e−|x| dx = .
2 2 1 + t2
Using the exponential density
1
f (x) = e−x 1{x>0} , ϕf (t) = .
1 − it
After standard calculations, we got
1 2nh2 + (n − 1)h − 2
M ISE{fˆ(.; h)} = + .
4nh 4n(1 + h)2
Take the derivative on h, and it is easy to find hM ISE = √1 , which attains minimal M ISE = 1√
.
n 2+2 n
7
9 Canonical kernels and optimal kernel theory

Now we investigate how the shape of the kernel could influence the estimator. In order to obtain
admissible estimators, Cline [2] showed that the kernel should be symmetric and unimodal.
Recall the two component in AMISE,
1 1
AM ISE{fˆ(.; h)} = R(K) + h4 µ22 (K)R(f 00 ).
nh 4
If we want to separate h and K, we need R(K) = µ22 (K). Consider the scaling of K of the form
Kδ (.) = K(./δ)/δ.
Plug this into above equation, we need
δ0 = {R(K)/µ22 (K)}1/5 .
Then
R(Kδ ) = δ −1 R(K), µ22 (Kδ ) = δ 4 µ22 (K).
We could rewrite AMISE as

1 1 4 00
AM ISE{fˆ(.; h)} = C(K) + h R(f ) , where C(K) = {R(K)4 µ22 (K)}1/5 .
nh 4
Notice that C(K) is invariant to scaling of K. We call K c = Kδ0 the canonical kernel for the class
{Kδ : δ > 0}. It is the unique member of this class that permits the ”decoupling” of K and h, see
Marron and Nolan [6].
For example, let K = φ, the standard normal, then δ0 = (4π)−1/10 , so
φc (x) = φ(4π)−1/10 (x), C(φ) = (4π)−2/5 .
Canonical kernels are very useful for pictorial comparison of density estimates based on different
shaped kernels, since they are defined in a way that a particular single choice of bandwidth gives
roughly the same amount of smoothing.
We changed the problem to choose K to minimize C(Kδ0 ). Since C(K) is invariant, the optimal
K is the one that minimises C(K) subject to
Z Z Z
K(x) dx = 1, xK(x) dx = 0, x2 K(x) dx = a2 < ∞, K(x) ≥ 0 for all x.
Hodges and Lehmann [4] showed that the solution can be written as
x2 . 1/2

a 3
K (x) = 1− 2 5 a 1{|x|<51/2 a} .
4 5a
√
A special case is that a = 1/ 5 (Epanechnikov kernel ), and
3
K ∗ (x) = (1 − x2 )1{|x|<1} .
4
The efficiency of the kernel K relative to K ∗ is defined as {C(K∗)/C(K)}5/4 . The family
K(x; p) = {22p+1 B(p + 1, p + 1)}−1 (1 − x2 )p 1{|x|<1} ,
where B(., .) is the beta function taken in [−1, 1].
Table 1 shows us the efficiency doesn’t have much improvement in different shape of the kernels.
Uniform kernels are not very popular in practice since it corresponds to piecewise constant. Even the
Epanechnikov kernel is not so attractive because the estimator have a discontinuous first derivative.
8
Table 1: Efficiencies of several kernels compared to the optimal kernel.
Kernel Form {C(K ∗ )/C(K)}5/4

Epanechnikov K(x; 1) 1.000
Biweight K(x; 2) 0.994
Triweight K(x; 3) 0.987
Normal K(x; ∞) 0.951
Triangular 0.986
Uniform K(x; 0) 0.930
10 Higher-order kernels
10.1 Why higher-order kernels?
We know that the best obtainable rate of convergence of the kernel estimator is of order n−4/5 . If
we loose the condition that K must be a density, the convergence rate could be faster. For example,
recall the asymptotic bias is given by
1
Efˆ(x; h) − f (x) = h2 µ2 (K)f 00 (x) + o(h2 ).
2
If we set µ2 (K) = 0, the MSE and MISE will have optimal convergence rate of order n−8/9 .
10.2 What is higher-order kernels?

We insist that K to be symmetric, and say K is a k−th order kernel if
µ0 (K) = 1, µj (K) = 0, for j = 1, . . . , k − 1, µk (K) 6= 0.
10.3 How to get higher-order kernels?

One way to generate higher-order kernels is deductively from the lower-order kernels,
3 1 0
K[k+2] (x) = K[k] (x) + xK[k] (x).
2 2
For example, set K[2] (x) = φ(x), then K[4] = 12 (3 − x2 )φ(x).
Another way is developed when f is a normal mixture density for a certain class of higher-order
kernels, see Marron and Wand [7].
k/2−1
X (−1)l
G[k] = φ(2l) (x), l = 0, 2, 4, . . . .
2l l!
l=0
10.4 Misc topics about higher-order kernels.

The convergence rate can be made arbitrarily close to the parametric n−1 as the order increases,
which means it will eventually dominate second-order kernel estimators for large n. However, it does
need a larger sample size (K[4] would require several thousand in order to reduce MISE compared
to normal kernel). Another price that need to be paid for higher-order kernels is the negative
contributions of the kernel may make the the estimated density not a density itself.
9
An extreme case of higher-order kernels is the ”infinite”-order kernels, such as sinc kernel K(x) =
sin x/(πx). Sinc kernel estimator suffers from the same drawback as other higher-order kernel esti-
mators and the good asymptotic performance is not guaranteed to carry over to finite sample size in
practice.
11 Measuring how difficult a density to estimate

Recall that, for K a symmetric probability kernel density function,
5
inf M ISE{fˆ(.; h)} ∼ C(K)R(f 00 )1/5 n−4/5 ,
h>0 4
so the magnitude of R(f 00 ) tells us how well f can be estimated even when h is chosen optimally.
First, we cannot set R(f 00 ), where f cannot be constant over the real line, and uniform density
have difficulty in estimating its boundaries. Second, R(f 00 ) is not scale invariant. To appreciate this,
suppose X has the density fX , and Y = X/a has density fY (x) = afX (x) for some positive a. Then
00 ), but
R(fY00 ) = a5 R(fX
00
D(f ) = {σ(f )5 R(fX )}1/4
is scale invariant, where σ(f ) is the population standard deviation of f . The inclusion of 14 power
allows a equivalence sample size interpretation as was done for the comparison of kernel shapes.
One result for the D(f ) is that it attains its minimal with the shape Beta(4, 4),
35
f ∗ (x) = (1 − x2 )3 1{|x|<1} .
32
More general result is that densities close to normality appear to be easier for the kernel estimator to
estimate. The degree of estimation difficulty increases with skewness, kurtosis, and multimodality.
12 Modification of kernel density estimator

12.1 Local kernel density estimators
One modification is to change the bandwidth h along the x−axis,
n
X
fˆL (x; h(x)) = {h(x)}−1 K{(x − Xi )/h(x)}.
i=1
The optimal h for asymptotic MSE at x is

h R(K)f (x) i1/5
hAM SE (x) = 2 , provided f 00 (x) 6= 0.
µ2 (K)f 002 (x)n
When f 00 (x) = 0, additional terms should be taken into account, see Schucany [8]. If this hA M SE
is chosen for every x, then the corresponding value of AMISE{fˆ(.; h(.))} can be shown to be
5
{µ2 (K)2 R(K)4 }1/5 R((f 2 f 00 )1/5 )n−4/5
4
Although n−4/5 showed no improvement, but
R((f 2 f 00 )1/5 ) ≤ R(f 00 )1/5
10
holds for all f . The sample size efficiency depend on the magnitude of the ratio
{R((f 2 f 00 )1/5 )/R(f 00 )1/5 }5/4 .
Their are other ways such as the nearest neighbor density estimator, see Loftsgaarden and Que-
senberry [5]. It uses distances from x to the data point that is the kth nearest to x (for some suitable
k) in a pilot estimation step that is essentially equivalent to h(x) ∝ 1/f (x). However, 1/f (x) is
usually not satisfactory surrogated for the optimal {f (x)/f 00 (x)}1/5 .
12.2 Variable kernel density estimators

In stead of a single h or a function h(x), the idea here is to use n values α(Xi ), i = 1, . . . , n.
Therefore, the kernel centered on Xi has associated with its own scale parameter α(Xi ), allowing
different degrees of smoothing depending on the locations. The estimator is in the form
n
1 X 1 x − X i
fˆV (x; α) = K .
n α(Xi ) α(Xi )
i=1
The intuition suggests that each α(Xi ) should depend on the true density in roughly an inverse
way. An theoretical result by Abramson [1] shows that taking α(Xi ) = hf −1/2 (Xi ) is a particular
good choice such that one can achieve a bias of order h4 in stead of h2 . Pilot estimation to obtain
α(Xi ) is necessary for a practical implementation of fˆV (.; α), which is the expense compared with
using a fourth-order kernel, while no negativity produced. The optimal MSE can be improved from
order n−4/5 to n−8/9 .
12.3 Transformation kernel density estimator

If the random sample Xi has a density f that is difficult to estimate, then apply an transformation
such that Yi has a density g which is easier to estimate. One can ”backtransform” the estimate of g
to obtain the estimate of f .
Suppose that Yi = t(Xi ), where t is an increasing differentiable function defined on the support
of f . Therefore f (x) = g(t(x))t0 (x), and the density could be estimated by
n
1X
fˆT (x; h, t) = Kh {t(x) − t(Xi )}t0 (x).
n
i=1
Apply the mean value theorem, we can see this estimator lies somewhere in between of fˆL and fˆV ,
n
1 X t0 (x)
0
ˆ t (ξi )(x − Xi )
fT (x; h, t) = K .
n h h
i=1
where ξ lies between x and Xi .
The best choice of the transformation t depend quite heavily on the shape of f . If f is a skewed
unimodal density, then t is suggested to be convex function on the support of f in order to reduce
the skewness of f in some sense. If f is close to being symmetric, but has a high amount of kurtosis,
then t should be concave to the left and convex to the right of the center of symmetry of f .
One approach is to apply the following family called shifted power family on heavily skewed data,
(x + λ1 )λ2 sign(λ2 ), λ2 6= 0

t(x; λ1 , λ2 ) = .
ln(x + λ1 ), λ2 = 0
where λ1 > −min(X).
Another approach is to estimate t nonparametrically. If F and G are c.d.f. of p.d.f. f and g,
then Y = G−1 (F (X)) has density g. One could choose G easy to estimate, and take t = G−1 (F̂ (X)).
11
13 Density estimation at boundaries

It becomes difficult to estimate the density at the boundary. Suppose f is a density such that
f (x) = 0 for x < 0 and f (x) > 0 for x ≥ 0, and f 00 is continuous away from x = 0. Also, let K be a
kernel with support confined in [−1, 1].
For x > 0,
Z x/h
ˆ
E f (x; h) = K(z)f (x − hz) dz.
−1
Be aware that we still have a large bias at the boundary, but the performance is greatly improved.
14 Density derivative estimation

A natural estimator of the r−th derivative f (r) (x) is
n
ˆ(r) 1 X (r) x − Xi
f (x; h) = K ,
nhr+1 h
i=1
sufficiently differentiability of K permitting. MSE property can be shown as

1 1 1
M SE{fˆ(r) (x; h)} = R(K (r) )f (x) + h4 µ22 (K)f (r+2) (x) + o{ (2r+1) + h4 }.
nh2r+1) 4 nh
It follows that the MSE-optimal bandwidth for estimating f (r) (x) is of order n−1/(2r+5) . There-
fore, the estimation of f 0 (x) requires a bandwidth of order n−1/7 compared to the optimal n−1/5 for
estimating f itself. It reveals the increasing difficulty in problems of estimating higher derivatives.
12
References
[1] I.S. Abramson. On bandwidth variation in kernel estimates - a square root law. Annal of
Statitics, 9:168–76, 1982.
[2] D.B.H. Cline. Admissible kernel estimators of a multivariate density. Annal of Statistics,
16:1421–7, 1988.
[3] K.B. Davis. Mean square error properties of density estimates. Annal of Statistics, 75:1025–30,
1975.
[4] J.L. Hodges and E.L. Lehmann. The efficiency of some nonparametric competitors to the t-test.
Annal of Mathematical Statistics, 13:324–35, 1956.
[5] D.O. Loftsgaarden and C.P. Quesenberry. A nonparametric density estimate of a multivariate
density function. Annal of Mathematical Statistics, 36:1049–51, 1965.
[6] J.S. Marron and D. Nolan. Canonical kernel for density estimation. Statist. Probab. Lett.,
7:195–9, 1989.
[7] J.S. Marron and M.P. Wand. Exact mean integrated squared error. Annal of Statistics, 20:712–
36, 1992.
[8] W.R. Schucany. Locally optimal window width for kernel density estimation with large samples.
Statist. Probab. Lett., 7:401–5, 1989.
[9] D.W. Scott. On optimal and data-based histograms. Biometrika, 66:605–10, 1979.
[10] D.W. Scott. Average shifted histograms: effective nonparametric density estimators in several
dimensions. Annal of Statistics, 13:1024–40, 1985.
13

Kernel Smoothing

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Kernel Smoothing

Загружено:

Авторское право:

Доступные форматы

Univariate Kernel Density Estimation

• How to chose the scaling parameter in practice?

• How can kernel smoothing ideas used in making confidence statements?

• How do dependencies in the data affect the kernel regression estimator?

• How can one best deal with multiple predictor variables?

2 Histogram is a kind of density estimation.

• Drawback: step functions.

3 Why univariate kernel density estimator?

• Easy to start with.

5 MSE and MISE criteria

Efˆ(x; h) = EKh (x − X) = (Kh ∗ f )(x);

Then the mean squared error (MSE) can be written as

M SE{fˆ(x; h)} = E{fˆ(x; h) − f (x)}2

6 Order and Taylor expansions

7 Asymptotic MSE and MISE approximations

Vfˆ(x; h) = n−1 {(Kh2 ∗ f )(x) − (Kh ∗ f )2 (x)}

7.3 Comparison with histogram

M ISE{fˆ(.; h} = AM ISE{fˆ(.; h)} + o{(nb)−1 + b2 };

8 Exact MISE calculations

Also recall that

8.1 MISE for a single normal distribution

8.2 MISE for normal mixtures

8.3 Using characteristic functions to simplify MISE

8.3.1 sinc kernel

8.3.2 Laplace kernel, with exponential density

9 Canonical kernels and optimal kernel theory

Table 1: Efficiencies of several kernels compared to the optimal kernel.

Kernel Form {C(K ∗ )/C(K)}5/4

10.2 What is higher-order kernels?

µ0 (K) = 1, µj (K) = 0, for j = 1, . . . , k − 1, µk (K) 6= 0.

10.3 How to get higher-order kernels?

10.4 Misc topics about higher-order kernels.

11 Measuring how difficult a density to estimate

12 Modification of kernel density estimator

The optimal h for asymptotic MSE at x is

R((f 2 f 00 )1/5 ) ≤ R(f 00 )1/5

12.2 Variable kernel density estimators

12.3 Transformation kernel density estimator

13 Density estimation at boundaries

14 Density derivative estimation

sufficiently differentiability of K permitting. MSE property can be shown as

Вам также может понравиться