Академический Документы
Профессиональный Документы
Культура Документы
Zhi Ouyang
Augest, 2005
1 Questions:
• What are the statistical properties of kernel functions on estimators?
• What influence does the shape/scaling of the kernel functions have on the estimators?
• Sensitivity of the placement of bin edges is a problem not shared by other density estimators,
solution is the averaged shifted histogram[10], which is an appealing motivation for kernel
methods.
• Multivariate histogram.
• Histograms are not so sufficient as other kernel estimators in using the data.
1
Kernel Smoothing Zhi Ouyang Augest, 2005
4 The estimator
Suppose we have a random sample X1 , . . . , Xn taken from a continuous, univariate density f .
n n
1 X x − X i 1X 1 u
fˆ(x; h) = K ; or fˆ(x; h) = Kh (x − Xi ), where Kh (u) = K .
nh h n h h
i=1 i=1
We shall see, the choice of the shape of the kernel function is not a particular important one, but
the choice of the value of the bandwidth is very important.
5.2 MISE
However, we are more interested in estimating f on a real line. Consider the integrated squared
error (ISE) and the mean integrated squared error (MISE)
Z
ISE{fˆ(x; h)} = {fˆ(x; h) − f (x)}2 dx;
Z Z
M ISE{fˆ(.; h)} = E {fˆ(x; h) − f (x)}2 dx = E{fˆ(x; h) − f (x)}2 dx
Z Z
= n−1 {(Kh2 ∗ f )(x) − (Kh ∗ f )2 (x)} dx + {(Kh ∗ f )(x) − f (x)}2 dx.
2
Kernel Smoothing Zhi Ouyang Augest, 2005
Notice that
1 2 x−y
Z Z Z
(Kh2 ∗ f )(x) dx = K ( )f (y) dy dx
h2 h
Z Z
1 2
= K (z)f (x − hz) dz dx
h
Z Z
1 2
= K (z) f (x − hz) dx dz
h
Z
1
= K 2 (x) dx
h
Then MISE could be written as
Z Z Z Z
ˆ 1 1
M ISE{f (.; h)} = K (x) dx + (1 − ) (Kh ∗ f ) (x) dx − 2 (Kh ∗ f )(x)f (x) dx + f 2 (x) dx
2 2
nh n
5.3 MIAE
We could also work with other criterions such as mean integrated absolute error (MIAE)
Z
M IAE{f (.; h)} = E |fˆ(x; h) − f (x)| dx.
ˆ
MIAE is always defined whenever fˆ(x; h) is a density, and it is invariant under monotone transfor-
mation, but more complicated.
6.2 Example
Suppose we have a random sample X1 , . . . , Xn from the N (µ, σ 2 ), and we are interested in estimating
eµ . Known that the maximum likelihood estimator is eX̄ , and
σ2 2σ 2 σ2 σ2
E{eX̄ } = eµ+ 2n ; E{e2X̄ } = e2µ+ n ; V{eX̄ } = e2µ+ n (e n −1 ).
Then the MSE can be approximated as
2σ 2 σ2
M SE(eX̄ ) = e2µ (e n− 2e 2n + 1)
( 2 ) ( 2 ) !
2σ 2 1 2σ 2 σ2 1 σ2
= e2µ 1+ + + ... − 2 1 + + + ... + 1
n 2 n 2n 2 2n
1 2 2µ
∼ σ e .
n
It is typical for MSE that has a rate of convergence of order n−1 , and we shall see that rates of
convergence of nonparametric kernel estimators are typically slower than n−1 .
3
Kernel Smoothing Zhi Ouyang Augest, 2005
(i) The density f has second derivative f 00 which is continuous, square integrable and monotone.
¯
(ii) The bandwidth hn is non-random sequence of positive numbers, such that
¯
lim h = 0, and lim nh = ∞.
n→∞ n→∞
(iii) The kernel K is a bounded probability density function having finite fourth moment and
¯ symmetric about the origin, i.e.
Z Z Z
K(z) dz = 1, zK(z) dz = 0, µ2 (K) := z 2 K(z) dz < ∞.
g 2 (z) dz.
R
Also, denote R(g) :=
7.2 Calculations
Recall that
x−y
Z Z
1
Efˆ(x; h) = (Kh ∗ f )(x) = K( )f (y) dy = K(z)f (x − hz)dz.
h h
Expand f (x − hz) about x, we obtain that
1
f (x − hz) = f (x) − hzf 0 (x) + h2 z 2 f 00 (x) + o(h2 ).
2
which is uniformly in z, hence
1
Efˆ(x; h) − f (x) = h2 µ2 (K)f 00 (x) + o(h2 ).
2
Similarly,
4
Kernel Smoothing Zhi Ouyang Augest, 2005
Notice that the tail term o{(nh)−1 + h4 } shows the variance-bias trade-off, while AMISE could be
minimized at
1/5
R(K) 5
hAMISE = 2 00
, inf AMISE{fˆ(x; h)} = {µ22 (K)R(K)4 R(f 00 )}1/5 n−4/5 .
nµ2 (K)R(f ) h>0 4
Equivalently, as n goes to infinity, we can rewrite
1/5
R(K) 5
hMISE ∼ , inf MISE{fˆ(x; h)} ∼ {µ22 (K)R(K)4 R(f 00 )}1/5 n−4/5 .
nµ22 (K)R(f 00 ) h>0 4
Aside from its dependence on the known K and n, the expression shows us the optimal h is
inversely proportional to the curvature of f , i.e.R(f 00 ). The problem is that we do not know the
curvature, but there are ways to estimate it.
Another thing could be seen is that, the best obtainable rate of convergence of the MISE of the
kernel estimator is of order n−4/5 , which is less efficient than that of MSE(n−1 ).
Then
xk +b
F (xk + b) − F (xk )
Z
1 b
Efˆ(x; b) = f (x) dx = = f (xk ) + f 0 (xk ) + o(b);
b xk b 2
b
Bias{fˆ(x; b)} = f (xk ) − {f (xk ) + (x − xk )f 0 (xk ) + o(x − xk )} + f 0 (xk ) + o(b)
2
b
= { − (x − xk )}f 0 (xk ) + o(b);
2
1 n(n − 1)
Efˆ2 (x; b) = 2
{F (xk + b) − F (xk )} + {F (xk + b) − F (xk )}2 ;
nb n2 b2
1 1
Vfˆ(x; b) = {f (xk ) + o(1)} − {f (xk ) + o(1)}2 .
nb n
Vary x in different bins, and take integration, we have
5
Kernel Smoothing Zhi Ouyang Augest, 2005
where k ∈ Z+ ,w1 , . . . , wk are positive numbers that sum to one, an for each l, µl ∈ R and σl2 > 0.
Similarly, (almost trivial to verify)
1
M ISE{fˆ(.; h)} = √ + wT ((1 − n−1 )Ω2 − 2Ω1 + Ω0 )w,
2 πnh
where w = (w1 , . . . , wk )T , and Ωa [l; l0 ] = φq (µl − µl0 ).
ah2 +σl2 +σl20
If we plot the exact and the asymptotic MISE/IV(integrated variance)/ISB(integrated squared
bias) according to different bandwidth, we shall see that IV/AIV decreases fairly ”uniformly” as log h
increases, but ISB/AISB increases very ”non-uniformly”. This is because the bias approximation is
based on the assumption that h → 0. Overall, for densities close to normality, the bias approximation
tends to be quite reasonable, while for densities with more features such as multiple modes, this
approximation becomes worse, see Marron and Wand [7].
6
Kernel Smoothing Zhi Ouyang Augest, 2005
7
Kernel Smoothing Zhi Ouyang Augest, 2005
Hodges and Lehmann [4] showed that the solution can be written as
x2 . 1/2
a 3
K (x) = 1− 2 5 a 1{|x|<51/2 a} .
4 5a
√
A special case is that a = 1/ 5 (Epanechnikov kernel ), and
3
K ∗ (x) = (1 − x2 )1{|x|<1} .
4
The efficiency of the kernel K relative to K ∗ is defined as {C(K∗)/C(K)}5/4 . The family
K(x; p) = {22p+1 B(p + 1, p + 1)}−1 (1 − x2 )p 1{|x|<1} ,
where B(., .) is the beta function taken in [−1, 1].
Table 1 shows us the efficiency doesn’t have much improvement in different shape of the kernels.
Uniform kernels are not very popular in practice since it corresponds to piecewise constant. Even the
Epanechnikov kernel is not so attractive because the estimator have a discontinuous first derivative.
8
Kernel Smoothing Zhi Ouyang Augest, 2005
10 Higher-order kernels
10.1 Why higher-order kernels?
We know that the best obtainable rate of convergence of the kernel estimator is of order n−4/5 . If
we loose the condition that K must be a density, the convergence rate could be faster. For example,
recall the asymptotic bias is given by
1
Efˆ(x; h) − f (x) = h2 µ2 (K)f 00 (x) + o(h2 ).
2
If we set µ2 (K) = 0, the MSE and MISE will have optimal convergence rate of order n−8/9 .
9
Kernel Smoothing Zhi Ouyang Augest, 2005
An extreme case of higher-order kernels is the ”infinite”-order kernels, such as sinc kernel K(x) =
sin x/(πx). Sinc kernel estimator suffers from the same drawback as other higher-order kernel esti-
mators and the good asymptotic performance is not guaranteed to carry over to finite sample size in
practice.
is scale invariant, where σ(f ) is the population standard deviation of f . The inclusion of 14 power
allows a equivalence sample size interpretation as was done for the comparison of kernel shapes.
One result for the D(f ) is that it attains its minimal with the shape Beta(4, 4),
35
f ∗ (x) = (1 − x2 )3 1{|x|<1} .
32
More general result is that densities close to normality appear to be easier for the kernel estimator to
estimate. The degree of estimation difficulty increases with skewness, kurtosis, and multimodality.
10
Kernel Smoothing Zhi Ouyang Augest, 2005
holds for all f . The sample size efficiency depend on the magnitude of the ratio
{R((f 2 f 00 )1/5 )/R(f 00 )1/5 }5/4 .
Their are other ways such as the nearest neighbor density estimator, see Loftsgaarden and Que-
senberry [5]. It uses distances from x to the data point that is the kth nearest to x (for some suitable
k) in a pilot estimation step that is essentially equivalent to h(x) ∝ 1/f (x). However, 1/f (x) is
usually not satisfactory surrogated for the optimal {f (x)/f 00 (x)}1/5 .
Apply the mean value theorem, we can see this estimator lies somewhere in between of fˆL and fˆV ,
n
1 X t0 (x)
0
ˆ t (ξi )(x − Xi )
fT (x; h, t) = K .
n h h
i=1
where ξ lies between x and Xi .
The best choice of the transformation t depend quite heavily on the shape of f . If f is a skewed
unimodal density, then t is suggested to be convex function on the support of f in order to reduce
the skewness of f in some sense. If f is close to being symmetric, but has a high amount of kurtosis,
then t should be concave to the left and convex to the right of the center of symmetry of f .
One approach is to apply the following family called shifted power family on heavily skewed data,
(x + λ1 )λ2 sign(λ2 ), λ2 6= 0
t(x; λ1 , λ2 ) = .
ln(x + λ1 ), λ2 = 0
where λ1 > −min(X).
Another approach is to estimate t nonparametrically. If F and G are c.d.f. of p.d.f. f and g,
then Y = G−1 (F (X)) has density g. One could choose G easy to estimate, and take t = G−1 (F̂ (X)).
11
Kernel Smoothing Zhi Ouyang Augest, 2005
Be aware that we still have a large bias at the boundary, but the performance is greatly improved.
12
Kernel Smoothing Zhi Ouyang Augest, 2005
References
[1] I.S. Abramson. On bandwidth variation in kernel estimates - a square root law. Annal of
Statitics, 9:168–76, 1982.
[2] D.B.H. Cline. Admissible kernel estimators of a multivariate density. Annal of Statistics,
16:1421–7, 1988.
[3] K.B. Davis. Mean square error properties of density estimates. Annal of Statistics, 75:1025–30,
1975.
[4] J.L. Hodges and E.L. Lehmann. The efficiency of some nonparametric competitors to the t-test.
Annal of Mathematical Statistics, 13:324–35, 1956.
[5] D.O. Loftsgaarden and C.P. Quesenberry. A nonparametric density estimate of a multivariate
density function. Annal of Mathematical Statistics, 36:1049–51, 1965.
[6] J.S. Marron and D. Nolan. Canonical kernel for density estimation. Statist. Probab. Lett.,
7:195–9, 1989.
[7] J.S. Marron and M.P. Wand. Exact mean integrated squared error. Annal of Statistics, 20:712–
36, 1992.
[8] W.R. Schucany. Locally optimal window width for kernel density estimation with large samples.
Statist. Probab. Lett., 7:401–5, 1989.
[9] D.W. Scott. On optimal and data-based histograms. Biometrika, 66:605–10, 1979.
[10] D.W. Scott. Average shifted histograms: effective nonparametric density estimators in several
dimensions. Annal of Statistics, 13:1024–40, 1985.
13