Академический Документы
Профессиональный Документы
Культура Документы
Outline
1 2 3 4 5
Motivation: Why Nonparametric Regression? Kernel Regression I: Basics on Kernel Smoothers Kernel Regression II: Local Linear Regression Kernel Regression III: Local Polynomial Regression Bandwitdh Selection
Motivation
1 , . . . , n u.
In a previous lecture we assumed that a linear specication would give a good approximation for describing the link between covariates xT i px1 , . . . , xk q and the response variable yi , i.e. yi where i
xi i ,
T
2 Np0, q, for i 1, . . . , n.
This approach is parametric in the sense that it only depends on a nite number of parameters k . In many contexts of interest this specication may however be to rigid to uncover any structure in the data so that generalizations are necessary.
Motivation
To have a more exible framework we replace xT i by a general function mpxi q, and hence move into an innite-dimensional setting. The model is now based on the specication yi where i
m p x i q i ,
Np
2 0,
q, for i 1, . . . , n.
This approach is nonparametric in the sense that we need an innite number of parameters to describe the regression function mpq. Our objective: estimate the regression function mpq from observations tpxi , yi qun i 1 , by using kernel-based procedures.
Kernel Smoothing
Basics
The estimation procedures to be discussed here are based on kernel functions, i.e. functions K such that
K px qdx
1.
Next week the problem of kernel density estimation is revisited to provide more background for those who are unfamiliar with the subject.
Kernel Smoothing
Some technical asides
The best obtainable rate of convergence of kernel estimators, in terms of mean integrated squared error MISEpfh q Ef
pp fh px q f px qq2 dx
is O pn4{5 q, but it is possible to improve it slightly, to O pn8{9 q, if we do not restrict our attention to density functions. Since the price to pay is high in terms of interpretation, and the gains are marginal for sample sizes often found in practice, we restrict however our attention to kernels which are density functions. Typically K is taken to be symmetric and unimodal, and there is a sense in which kernels which do not obey these requisites are inadmissable (Cline, 1988).
-2
4 x
10
Our rst method to estimate the regression function mpq is the NadarayaWatson estimator This is based on approximating E pY |X using
xq
yf px , y qdy {f px q by
p pY | X p px q E m
xq
1 n
n i 1 Kh px xi qYi p fh p x q
n i 1 Kh px xi qYi . n i 1 Kh px xi q
The estimator works as rolling weighted mean, with the weights dened by the kernel function K .
This local mean estimator can be shown to be the solution to an optimization problem of interest
p px q arg min m
x
p q i 1
Kh px
xi qpYi px qq2 .
We now need to think about the questions: Choice of the kernel? Choice of the bandwidth?
Some Kernels
Uniform
K pu q
1 It|u|1u , 2
Triangular Epanechnikov
K pu q p1 |u |q It|u|1u , K pu q K pu q K pu q K pu q 3 p1 u2 q It|u|1u , 4
Biweight
15 p1 u2 q2 It|u|1u , 16 35 p1 u2 q3 It|u|1u , 32
Triweight
Gaussian
?1
u2 2
P R.
Some Kernels
Uniform kernel 1.2 1.2 Triangular kernel 1.2 Epanechinikov kernel
1.0
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0 u
0 u
0.0 3
0.2
0.4
0.6
0.8
1.0
0 u
Gaussian kernel
1.0
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0 u
0 u
0.0 3
0.2
0.4
0.6
0.8
1.0
0 u
A Unifying Kernel
Most of the kernels presented above are particular cases of a more general kernel dened as Kq pu q t22q 1 B pq 1, q 1qu1 p1 u 2 qtIt|u|1^q 8u ItuPR^q 8u u, where B p, q denotes the Beta function B px , y q Specically we have that: q q q q q
1
0
u x 1 p1 u qy 1 du .
The choice of bandwidth is however critical to the performance of the estimator and far more important than the choice of kernel.
Data
G G G G G GG GG G G G G GG G G G GG GG G G G G GG G G G GG G G GG GG G G G G GG G G G GG G G GG GG G G G GG G G GG G G G G GG G G G GG G G G G GGG GG G G G GG G G G GG G G G G G GG GGG G G G G GG G G G G G G G GG G G G G G G G G G G G G GG G GGGG G G G GG G G G G G G G G G GG G G G G GG G G GGG GG G GG G G G G G G G G G G G G G G G G G G G GG G G G G G G GG G G G G G GGG G G G G G G G G G G G G GG G G G GGG G G G GG G G G G G GG G G G G G G G G G G G G G G G G G G G GG G GG G GG G G GG GG G G GG G G G G G G G G G G G G G G G GG GG G G G G G G G GG GG GG G G G G G G G GG G G G G G GG G G G G G G G G GG G G G G G G G GG G GG G G G G G G GG G G GGG G GG G G G G G G G GG G GG G GG GG GG GGG G G G G G G G G G G G G G GG GG G G G G G G G GG G G G G G G G G GG G G G G G GG G G G GG GG G G G G GG G G G GGG G G G G G G GG G GG G G GG G G G GGG G G G G G G G G G G G G G G GG G G GG G G G G G GG G GG G G
Undersmoothed
G G G G G GG GG G G G G GG G G G GG GG G G G G GG G G G GG G G GG GG G G G G GG G G G GG G G GG GG G G G GG G G GG G G G G GG G G G GG G G G G GGG GG G G G GG G G G GG G G G G G GG GGG G G G G GG G G G G G G G GG G G G G G G G G G G G G GG G GGGG G G G GG G G G G G G G G G GG G G G G GG G G GGG GG G GG G G G G G G G G G G G G G G G G G G G GG G G G G G G GG G G G G G GGG G G G G G G G G G G G G GG G G G GGG G G G GG G G G G G GG G G G G G G G G G G G G G G G G G G G GG G GG G GG G G GG GG G G GG G G G G G G G G G G G G G G G GG GG G G G G G G G GG GG GG G G G G G G G GG G G G G G GG G G G G G G G G GG G G G G G G G GG G GG G G G G G G GG G G GGG G GG G G G G G G G GG G GG G GG GG GG GGG G G G G G G G G G G G G G GG GG G G G G G G G GG G G G G G G G G GG G G G G G GG G G G GG GG G G G G GG G G G GGG G G G G G G GG G GG G G GG G G G GGG G G G G G G G G G G G G G G GG G G GG G G G G G GG G GG G G
y 2 0 2
4 x
10
4 x
10
OK
G G G G G GG GG G G G G GG G G G GG GG G G G G GG G G G GG G G GG GG G G G G GG G G G GG G G GG GG G G G GG G G GG G G G G GG G G G GG G G G G GGG GG G G G GG G G G GG G G G G G GG GGG G G G G GG G G G G G G GG G G G G G G G G G GG G G G G G GGGG G G G GG G G G G G G G G G GG G G G G GG G G GGG GG G GG G G G GG G G G G G G G G G G GG G G G GG G G G GG GG G G G G G GGG G G G G G G G G G G G G G G G G G G GGG G G G GG G G G G G GG G G G G G G GG G GG G G G G GG G GG G G G G GG G G G G G GG G G GG G G G G G G G G G G G G G G G G G G G GG G GG GG G G G G G G G GG G G G GG G G G GG G G G G G G G G G GG G GGG G G G G G G G G G G G G G G GG G G G G G G G G G G G GG GG G GG G GG G G G GG G GG GG GGG G G G G G G G G G GG GG GG G G G G G GG G G G G G GG G GG G G G G G G GG G G G GG GG G G G G GG G G G GGG G G G G G G GG G GG G G GG G G G GGG G G G G G G G G G G G G G G GG G G GG G G G G G GG G GG G G
Oversmoothed
G G G G G GG GG G G G G GG G G G GG GG G G G G GG G G G GG G G GG GG G G G G GG G G G GG G G GG GG G G G GG G G GG G G G G GG G G G GG G G G G GGG GG G G G GG G G G GG G G G G G GG GGG G G G G GG G G G G G G GG G G G G G G G G G GG G G G G G GGGG G G G GG G G G G G G G G G GG G G G G GG G G GGG GG G GG G G G GG G G G G G G G G G G GG G G G GG G G G GG GG G G G G G GGG G G G G G G G G G G G G G G G G G G GGG G G G GG G G G G G GG G G G G G G GG G GG G G G G GG G GG G G G G GG G G G G G GG G G GG G G G G G G G G G G G G G G G G G G G GG G GG GG G G G G G G G GG G G G GG G G G GG G G G G G G G G G GG G GGG G G G G G G G G G G G G G G GG G G G G G G G G G G G GG GG G GG G GG G G G GG G GG GG GGG G G G G G G G G G GG GG GG G G G G G GG G G G G G GG G GG G G G G G G GG G G G GG GG G G G G GG G G G GGG G G G G G G GG G GG G G GG G G G GGG G G G G G G G G G G G G G G GG G G GG G G G G G GG G GG G G
y 2 0 2
4 x
10
4 x
10
The problem of bandwidth selection is a complicated one, and will be discussed later. For now it is important to preserve the idea that the selection of h involves a biasvariance tradeoff: If the bandwidth is too small, the estimator will be too rough; but if it is too large, important features will be smoothed out. The selected bandwidth should of course depend on the sample size, and we assume that h
hn op1q,
limn8 nh
8,
8.
Figure: An eruption of the Old Faithful Geyser, Yellowstone National Park, Wyoming, US.
Data: Waiting time to next eruption; Duration of eruption; Observations: 272; Units: minutes; Source: Hrdle (1991).
attach(faithful) #The Nadaraya--Watson estimator over different bandwidths par(mfrow=c(1,3)) plot(eruptions,waiting,main="h=0.1",xlab="Duration", ylab="Waiting time") lines(ksmooth(duration,waiting,"normal",0.1),lwd=2,col = "red") plot(eruptions,waiting,main="h=0.5",xlab="Duration", ylab="Waiting time") lines(ksmooth(duration,waiting,"normal",0.5),lwd=2,col = "red") plot(eruptions,waiting,main="h=2",xlab="Duration", ylab="Waiting time") lines(ksmooth(duration,waiting,"normal",2),lwd=2,col = "red")
h=0.1 h=0.5 h=2
90
90
80
80
70
70
60
60
50
50
1.5
2.5
3.5
4.5
50 1.5
60
70
80
90
2.5
3.5
4.5
In practice we typically observe that the NadarayaWatson has a poorer behavior near the edges of the region where we observe the data (say for durations close to 1.6 or 5.1 in the faithful data) and in regions where we lack more data (say for durations between 2.8 and 3.3). Asymptotic arguments can be used to show that the optimal MISE of the NadarayaWatson estimator O pn4{5 q, reduces to O pn2{3 q close to the boundary of the support of the covariate (Wand & Jones, 1995, 5.5). In the following we present an alternative to this local constant t estimator which removes the bias exactly to rst order.
Instead of taking a rolling weighted mean, we can consider a rolling mean regression, with the weights dened by the kernel function K . This is exactly what is known as local linear regression; the corresponding estimator is now a solution to an optimization problem which extends the one from which we obtained the NadarayaWatson estimator
ppx q
0 px q 1 px q
arg p pxmin q, px qq
0 1
n i
1
Kh px xi qpYi 0 px q 1 px qxi q2 .
p0 px q p1 px qx . p px q m
Waiting time
50 1.5
60
70
80
90
2.0
2.5
3.0
3.5 Duration
4.0
4.5
5.0
Local polynomial regression extends the ideas above to the context where we can have further polynomial terms in the regression. Hence we now perform a rolling weighted regression based on yi
0 px q
p j
1
j px qxij
i ,
where the weights are again controled by the kernel K . This estimator is also the solution to an optimization problem of interest
p0 px q p p px q m
Particular cases:
p j
1
pj px qx j ,
P N0 .
p 0 px q m
local linear regression: p
p 1 px q m
where
1
0
p sr px q n1
1
pxi x qr Kh pxi x q.
50 1.5
60
70
80
90
2.0
2.5
3.0
3.5
4.0
4.5
5.0
What to Take: p
0, 1 or 2?
Local intercept regression (p 0) is nothing else than a weighted moving average. Such a simple local model might work well for some situations, but may not always approximate the underlying function well enough. Local linear ts (p 1) might work better in that case. It can be shown p 1 pq has some nice MSE features and minimax optimality that m properties (Fan, 1992, 1993). Higher-degree polynomials would work in theory, but may tend to overt the data in each subset and may be numerically unstable, making accurate computations difcult. Asymptotic arguments suggest that local polynomials of odd degree dominate those of even degree.
The problem of deciding how much to smooth is always of great importance in nonparametric regression. The selection of the smoothing parameter (bandwidth) is always related to a certain interpretation of the smooth. If the purpose of the smoothing is to increase the "signal to noise ratio", or to suggest a simple model, then a slightly "oversmoothed" curve with a subjectively chosen bandwith might be desirable. On the other hand, when the interest is purely in estimating the regression curve itself with an emphasis on local structures then a slighlty "undersmoothed" curve may be appropriate.
Bandwidth Selectors
General comments
, it is necessary to To characterize a bandwidth selector as optimal, ho dene relevant criteria. The minimization of the average squared error (ASE) denes one such criteria ho
1 p h pxi q mpxi qq2 . pm arg hmin ASEphq arg min PR hPR n
i n
1
Here we mainly focus on discussing a cross-validatory bandwidth selector, because of its simplicity, but many other strategies such as plug-in rules are available (Wand and Jones, 1995, 5.8).
Cross-Validation in Regression
Denition
Cross-validation entails estimating the regression function using data with the i th observation removed, and the resultant leave-one-out estimator is, for example, for p 0
p i ,h pxi q m
The CV function validates the ability to predict tyi un i 1 across the subsamples tpxj , yj qun . j 1,j i
hCV
h ho CV
Here op p1q is the stochastic version of Landaus little-o notation, and is used to denote convergence to zero with probability 1.
0.2
0.4 h
0.6
0.8
## Nadaraya--Watson estimator evaluated at the optimal bandwidth by cross validation plot(eruptions,waiting,main="h=0.4243375",xlab="Duration", ylab="Waiting time") lines(ksmooth(duration,waiting,"normal",hopt),lwd=2,col="red")
h=0.4243375
50 1.5
60
70
80
90
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Bandwidth Selection
R: ksmooth{stats}
1.0 0.0
0.5
0.0
0.5
1.0
1.5
0.2
0.4 x
0.6
0.8
1.0
Bandwidth Selection
R: h.select{sm} + ksmooth{stats}
set.seed(123);x<-runif(200,0,1);y<-sin(4*x)+rnorm(200,0,1/3); h1<-h.select(x,y,method="cv"); h2<-h.select(x,y,method="aicc"); plot(x,y,col=gray(0.5)); lines(ksmooth(x,y,"normal",h1),col="red"); lines(ksmooth(x,y,"normal",h2),col="green"); lines(seq(0,1,length=100),sin(4*seq(0,1,length=100)),lty=2);
2.0 CV bandwidth= 0.09 AIC bandwidth= 0.13 true
1.0 0.0
0.5
0.0
0.5
1.0
1.5
0.2
0.4 x
0.6
0.8
1.0
Bandwidth Selection
R: (h.select + sm.regression){sm}
1.0
0.5
0.0
0.5
1.0
1.5
0.0
0.2
0.4 x
0.6
0.8
1.0