Вы находитесь на странице: 1из 40

Applied Statistics

J. Blanchet and J. Wadsworth


Institute of Mathematics, Analysis, and Applications EPF Lausanne

An MSc Course for Applied Mathematicians, Fall 2012

Outline

1 2 3 4 5

Motivation: Why Nonparametric Regression? Kernel Regression I: Basics on Kernel Smoothers Kernel Regression II: Local Linear Regression Kernel Regression III: Local Polynomial Regression Bandwitdh Selection

Part I Motivation: Why Nonparametric Regression?

Motivation

Suppose that we collect a random sample tpxi , yi q : i

 1 , . . . , n u.

In a previous lecture we assumed that a linear specication would give a good approximation for describing the link between covariates xT i  px1 , . . . , xk q and the response variable yi , i.e. yi where i

 xi i ,
T

2  Np0, q, for i  1, . . . , n.

This approach is parametric in the sense that it only depends on a nite number of parameters k . In many contexts of interest this specication may however be to rigid to uncover any structure in the data so that generalizations are necessary.

Motivation

To have a more exible framework we replace xT i by a general function mpxi q, and hence move into an innite-dimensional setting. The model is now based on the specication yi where i

 m p x i q i ,

 Np

2 0,

q, for i  1, . . . , n.

This approach is nonparametric in the sense that we need an innite number of parameters to describe the regression function mpq. Our objective: estimate the regression function mpq from observations tpxi , yi qun i 1 , by using kernel-based procedures.

Why Should I Learn Nonparametric Regression?


To understand the origins of the Universe and test cosmological models

Figure: Cosmic Microwave Background (CMB) temperature power spectrum


as a nonlinear function of the multipole (Genovese et al, 2004). More information on CMB and reduced-galaxy maps such as the one on the left can be found at http://lambda.gsfc.nasa.gov/

Part II Kernel Regression I: Basic Kernel Smoothers

Kernel Smoothing
Basics

The estimation procedures to be discussed here are based on kernel functions, i.e. functions K such that

K px qdx

 1.

Remember that kernel density estimation can be made using


n n 1 x xi 1 p K  Kh px xi q. fh px q  nh i 1 h n i 1 Notation: h is the bandwidth and Kh pq  h1 K p{hq.

Next week the problem of kernel density estimation is revisited to provide more background for those who are unfamiliar with the subject.

Kernel Smoothing
Some technical asides

The best obtainable rate of convergence of kernel estimators, in terms of mean integrated squared error MISEpfh q  Ef

pp fh px q f px qq2 dx

is O pn4{5 q, but it is possible to improve it slightly, to O pn8{9 q, if we do not restrict our attention to density functions. Since the price to pay is high in terms of interpretation, and the gains are marginal for sample sizes often found in practice, we restrict however our attention to kernels which are density functions. Typically K is taken to be symmetric and unimodal, and there is a sense in which kernels which do not obey these requisites are inadmissable (Cline, 1988).

Local Intercept Regression

Recall our set-up is

 m p x i q i Idea: estimate mpx q  E pY |X  x q using a rolling weighted average


yi of points y with similar values of x
6 6 y 0 2 4 x 6 8 10 -2 0 0 2 4

-2

4 x

10

Local Intercept Regression


R: ksmooth{stats}

Our rst method to estimate the regression function mpq is the NadarayaWatson estimator This is based on approximating E pY |X using

 xq 

yf px , y qdy {f px q by

p pY | X p px q  E m

 xq 

1 n

n i 1 Kh px xi qYi p fh p x q

n i 1 Kh px xi qYi . n i 1 Kh px xi q

The estimator works as rolling weighted mean, with the weights dened by the kernel function K .

Local Intercept Regression


R: ksmooth{stats}

This local mean estimator can be shown to be the solution to an optimization problem of interest

p px q  arg min m
x

p q i 1

Kh px

xi qpYi px qq2 .

We now need to think about the questions: Choice of the kernel? Choice of the bandwidth?

Some Kernels
Uniform

K pu q 

1 It|u|1u , 2

Triangular Epanechnikov

K pu q  p1 |u |q It|u|1u , K pu q  K pu q  K pu q  K pu q  3 p1 u2 q It|u|1u , 4

Biweight

15 p1 u2 q2 It|u|1u , 16 35 p1 u2 q3 It|u|1u , 32

Triweight

Gaussian

?1

u2 2

P R.

Some Kernels
Uniform kernel 1.2 1.2 Triangular kernel 1.2 Epanechinikov kernel

1.0

1.0

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0.0

0.0

0 u

0 u

0.0 3

0.2

0.4

0.6

0.8

1.0

0 u

Biweight kernel 1.2 1.2

Triweight kernel 1.2

Gaussian kernel

1.0

1.0

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0.0

0.0

0 u

0 u

0.0 3

0.2

0.4

0.6

0.8

1.0

0 u

A Unifying Kernel

Most of the kernels presented above are particular cases of a more general kernel dened as Kq pu q  t22q 1 B pq 1, q 1qu1 p1 u 2 qtIt|u|1^q 8u ItuPR^q 8u u, where B p, q denotes the Beta function B px , y q  Specically we have that: q q q q q

1
0

u x 1 p1 u qy 1 du .

 0: Uniform;  1: Epanechnikov;  2: Biweight;  3: Triweight; 8: Normal.

Choice of the Kernel

p pq is We prefer smooth kernels to ensure that the resulting estimator m smooth.


We prefer compact kernels because this ensures that only data, local to the point at which m is estimated, are used in the t. The optimal choice under some standard assumptions is the Epanechnikov kernel (this is formalized later). It achieves smoothness, compactness, and rapid computation. Any sensible choice of kernel will produce acceptable results, so the choice of the kernel is not crucially important.

Choice of the Bandwidth?

The choice of bandwidth is however critical to the performance of the estimator and far more important than the choice of kernel.
Data
G G G G G GG GG G G G G GG G G G GG GG G G G G GG G G G GG G G GG GG G G G G GG G G G GG G G GG GG G G G GG G G GG G G G G GG G G G GG G G G G GGG GG G G G GG G G G GG G G G G G GG GGG G G G G GG G G G G G G G GG G G G G G G G G G G G G GG G GGGG G G G GG G G G G G G G G G GG G G G G GG G G GGG GG G GG G G G G G G G G G G G G G G G G G G G GG G G G G G G GG G G G G G GGG G G G G G G G G G G G G GG G G G GGG G G G GG G G G G G GG G G G G G G G G G G G G G G G G G G G GG G GG G GG G G GG GG G G GG G G G G G G G G G G G G G G G GG GG G G G G G G G GG GG GG G G G G G G G GG G G G G G GG G G G G G G G G GG G G G G G G G GG G GG G G G G G G GG G G GGG G GG G G G G G G G GG G GG G GG GG GG GGG G G G G G G G G G G G G G GG GG G G G G G G G GG G G G G G G G G GG G G G G G GG G G G GG GG G G G G GG G G G GGG G G G G G G GG G GG G G GG G G G GGG G G G G G G G G G G G G G G GG G G GG G G G G G GG G GG G G

Undersmoothed
G G G G G GG GG G G G G GG G G G GG GG G G G G GG G G G GG G G GG GG G G G G GG G G G GG G G GG GG G G G GG G G GG G G G G GG G G G GG G G G G GGG GG G G G GG G G G GG G G G G G GG GGG G G G G GG G G G G G G G GG G G G G G G G G G G G G GG G GGGG G G G GG G G G G G G G G G GG G G G G GG G G GGG GG G GG G G G G G G G G G G G G G G G G G G G GG G G G G G G GG G G G G G GGG G G G G G G G G G G G G GG G G G GGG G G G GG G G G G G GG G G G G G G G G G G G G G G G G G G G GG G GG G GG G G GG GG G G GG G G G G G G G G G G G G G G G GG GG G G G G G G G GG GG GG G G G G G G G GG G G G G G GG G G G G G G G G GG G G G G G G G GG G GG G G G G G G GG G G GGG G GG G G G G G G G GG G GG G GG GG GG GGG G G G G G G G G G G G G G GG GG G G G G G G G GG G G G G G G G G GG G G G G G GG G G G GG GG G G G G GG G G G GGG G G G G G G GG G GG G G GG G G G GGG G G G G G G G G G G G G G G GG G G GG G G G G G GG G GG G G

y 2 0 2

4 x

10

4 x

10

OK
G G G G G GG GG G G G G GG G G G GG GG G G G G GG G G G GG G G GG GG G G G G GG G G G GG G G GG GG G G G GG G G GG G G G G GG G G G GG G G G G GGG GG G G G GG G G G GG G G G G G GG GGG G G G G GG G G G G G G GG G G G G G G G G G GG G G G G G GGGG G G G GG G G G G G G G G G GG G G G G GG G G GGG GG G GG G G G GG G G G G G G G G G G GG G G G GG G G G GG GG G G G G G GGG G G G G G G G G G G G G G G G G G G GGG G G G GG G G G G G GG G G G G G G GG G GG G G G G GG G GG G G G G GG G G G G G GG G G GG G G G G G G G G G G G G G G G G G G G GG G GG GG G G G G G G G GG G G G GG G G G GG G G G G G G G G G GG G GGG G G G G G G G G G G G G G G GG G G G G G G G G G G G GG GG G GG G GG G G G GG G GG GG GGG G G G G G G G G G GG GG GG G G G G G GG G G G G G GG G GG G G G G G G GG G G G GG GG G G G G GG G G G GGG G G G G G G GG G GG G G GG G G G GGG G G G G G G G G G G G G G G GG G G GG G G G G G GG G GG G G

Oversmoothed
G G G G G GG GG G G G G GG G G G GG GG G G G G GG G G G GG G G GG GG G G G G GG G G G GG G G GG GG G G G GG G G GG G G G G GG G G G GG G G G G GGG GG G G G GG G G G GG G G G G G GG GGG G G G G GG G G G G G G GG G G G G G G G G G GG G G G G G GGGG G G G GG G G G G G G G G G GG G G G G GG G G GGG GG G GG G G G GG G G G G G G G G G G GG G G G GG G G G GG GG G G G G G GGG G G G G G G G G G G G G G G G G G G GGG G G G GG G G G G G GG G G G G G G GG G GG G G G G GG G GG G G G G GG G G G G G GG G G GG G G G G G G G G G G G G G G G G G G G GG G GG GG G G G G G G G GG G G G GG G G G GG G G G G G G G G G GG G GGG G G G G G G G G G G G G G G GG G G G G G G G G G G G GG GG G GG G GG G G G GG G GG GG GGG G G G G G G G G G GG GG GG G G G G G GG G G G G G GG G GG G G G G G G GG G G G GG GG G G G G GG G G G GGG G G G G G G GG G GG G G GG G G G GGG G G G G G G G G G G G G G G GG G G GG G G G G G GG G GG G G

y 2 0 2

4 x

10

4 x

10

Choice of the Bandwidth? Later

The problem of bandwidth selection is a complicated one, and will be discussed later. For now it is important to preserve the idea that the selection of h involves a biasvariance tradeoff: If the bandwidth is too small, the estimator will be too rough; but if it is too large, important features will be smoothed out. The selected bandwidth should of course depend on the sample size, and we assume that h

 hn  op1q,
limn8 nh

8,

but that h converges to zero at a slower rate than 1{n, i.e.

 8.

An Example: The Old Faifthful Data


R: faithful{datasets}

Figure: An eruption of the Old Faithful Geyser, Yellowstone National Park, Wyoming, US.

An Example: The Old Faifthful Data


R: faithful{datasets}

Data: Waiting time to next eruption; Duration of eruption; Observations: 272; Units: minutes; Source: Hrdle (1991).

The NadarayaWatson Estimator in R


R: faithful{datasets} + ksmooth{stats}

attach(faithful) #The Nadaraya--Watson estimator over different bandwidths par(mfrow=c(1,3)) plot(eruptions,waiting,main="h=0.1",xlab="Duration", ylab="Waiting time") lines(ksmooth(duration,waiting,"normal",0.1),lwd=2,col = "red") plot(eruptions,waiting,main="h=0.5",xlab="Duration", ylab="Waiting time") lines(ksmooth(duration,waiting,"normal",0.5),lwd=2,col = "red") plot(eruptions,waiting,main="h=2",xlab="Duration", ylab="Waiting time") lines(ksmooth(duration,waiting,"normal",2),lwd=2,col = "red")
h=0.1 h=0.5 h=2

90

90

Waiting time between eruptions

Waiting time between eruptions

Waiting time between eruptions 1.5 2.5 3.5 4.5

80

80

70

70

60

60

50

50

1.5

2.5

3.5

4.5

50 1.5

60

70

80

90

2.5

3.5

4.5

Duration of the eruption

Duration of the eruption

Duration of the eruption

The BoundaryBias Problem

In practice we typically observe that the NadarayaWatson has a poorer behavior near the edges of the region where we observe the data (say for durations close to 1.6 or 5.1 in the faithful data) and in regions where we lack more data (say for durations between 2.8 and 3.3). Asymptotic arguments can be used to show that the optimal MISE of the NadarayaWatson estimator O pn4{5 q, reduces to O pn2{3 q close to the boundary of the support of the covariate (Wand & Jones, 1995, 5.5). In the following we present an alternative to this local constant t estimator which removes the bias exactly to rst order.

Part III Local Linear Regression

Local Linear Regression

Instead of taking a rolling weighted mean, we can consider a rolling mean regression, with the weights dened by the kernel function K . This is exactly what is known as local linear regression; the corresponding estimator is now a solution to an optimization problem which extends the one from which we obtained the NadarayaWatson estimator

ppx q 

0 px q 1 px q

 arg p pxmin q, px qq
0 1

n i

1

Kh px xi qpYi 0 px q 1 px qxi q2 .

The estimator for the regression function is now given by

p0 px q p1 px qx . p px q  m

Local Linear Fitting in R


R: locpoly{KernSmooth}

## local linear regression require(KernSmooth) fitlocpol<-locpoly(eruptions,waiting,bandwidth=0.4) plot(eruptions,waiting,xlab="Duration",ylab="Waiting time",main="h=0.5") lines(fitlocpol,type="l",lwd=2,col="red")


h=0.5

Waiting time

50 1.5

60

70

80

90

2.0

2.5

3.0

3.5 Duration

4.0

4.5

5.0

Part IV Local Polynomial Regression

Local Polynomial Regression

Local polynomial regression extends the ideas above to the context where we can have further polynomial terms in the regression. Hence we now perform a rolling weighted regression based on yi

 0 px q

p j

1

j px qxij

i ,

where the weights are again controled by the kernel K . This estimator is also the solution to an optimization problem of interest

p0 px q 2 p n  ppx q  .   arg min Kh px xi q yi 0 px q j px qxij . . . px q i 1 j 1 pp px q

Local Polynomial Regression


The estimator for the regression function is thus

p0 px q p p px q  m
Particular cases:

p j

1

pj px qx j ,

P N0 .

local intercept regression: p

p 0 px q  m
local linear regression: p

p 1 px q  m
where

n p p i 1 ts2 px q s1 px qpxi x quKh pxi x qyi , 2 p s2 px qp s0 px q p s1 px q


n i

1

n i 1 Kh pxi x qyi . n i 1 Kh pxi x q

0

p sr px q  n1

1

pxi x qr Kh pxi x q.

Comparison over Different Degrees


R: locpoly{KernSmooth}
require(KernSmooth) fitlocpol0<-locpoly(eruptions,waiting,bandwidth=0.5,degree=0) fitlocpol1<-locpoly(eruptions,waiting,bandwidth=0.5,degree=1) fitlocpol2<-locpoly(eruptions,waiting,bandwidth=0.5,degree=2) plot(eruptions,waiting,xlab="Duration",ylab="Waiting time",main="h=0.5") lines(fitlocpol0,type="l",lwd=2,col ="gray") lines(fitlocpol1,type="l",lwd=2,col ="red") lines(fitlocpol2,type="l",lwd=2,col ="blue")
h=0.5; gray, red,and blue respectively correspond to p=0,1,2

Waiting time between eruptions

50 1.5

60

70

80

90

2.0

2.5

3.0

3.5

4.0

4.5

5.0

Duration of the eruption

What to Take: p

 0, 1 or 2?

Local intercept regression (p  0) is nothing else than a weighted moving average. Such a simple local model might work well for some situations, but may not always approximate the underlying function well enough. Local linear ts (p  1) might work better in that case. It can be shown p 1 pq has some nice MSE features and minimax optimality that m properties (Fan, 1992, 1993). Higher-degree polynomials would work in theory, but may tend to overt the data in each subset and may be numerically unstable, making accurate computations difcult. Asymptotic arguments suggest that local polynomials of odd degree dominate those of even degree.

Part V Bandwidth Selection

How much to smooth?

The problem of deciding how much to smooth is always of great importance in nonparametric regression. The selection of the smoothing parameter (bandwidth) is always related to a certain interpretation of the smooth. If the purpose of the smoothing is to increase the "signal to noise ratio", or to suggest a simple model, then a slightly "oversmoothed" curve with a subjectively chosen bandwith might be desirable. On the other hand, when the interest is purely in estimating the regression curve itself with an emphasis on local structures then a slighlty "undersmoothed" curve may be appropriate.

Bandwidth Selectors
General comments

However, a good automatically selected bandwidth is always a good starting point.

, it is necessary to To characterize a bandwidth selector as optimal, ho dene relevant criteria. The minimization of the average squared error (ASE) denes one such criteria ho
1 p h pxi q mpxi qq2 . pm  arg hmin ASEphq  arg min PR hPR n
i n

1

Here we mainly focus on discussing a cross-validatory bandwidth selector, because of its simplicity, but many other strategies such as plug-in rules are available (Wand and Jones, 1995, 5.8).

Cross-Validation in Regression
Denition

Cross-validation entails estimating the regression function using data with the i th observation removed, and the resultant leave-one-out estimator is, for example, for p  0

p i ,h pxi q  m

n j 1,j i Kh pxi xj qyj . n j 1,j i Kh pxi xj q p i ,h pxi qu2 . m

We consider the cross-validation function: CVphq 


n 1 tyi n i 1

The CV function validates the ability to predict tyi un i 1 across the subsamples tpxj , yj qun . j 1,j i

Bandwidth selection by cross-Validation


Denition and properties

The cross-validatory bandwidth selector is then dened as

hCV

 arg hmin CVphq. PR P N0 .  op p1q.

These principles apply overall for p

It can be shown that (Hrdle et al, 1988)

h ho CV

Here op p1q is the stochastic version of Landaus little-o notation, and is used to denote convergence to zero with probability 1.

Bandwidth Selection by Cross-Validation


R: hcv{sm}

## install.packages(sm) require(sm) attach(faithful) hopt<-hcv(eruptions,waiting,display="lines") > hopt [1] 0.4243375


15000 CV 9000 10000 11000 12000 13000 14000

0.2

0.4 h

0.6

0.8

Bandwidth Selection by Cross-Validation


R: ksmooth{stats}

## Nadaraya--Watson estimator evaluated at the optimal bandwidth by cross validation plot(eruptions,waiting,main="h=0.4243375",xlab="Duration", ylab="Waiting time") lines(ksmooth(duration,waiting,"normal",hopt),lwd=2,col="red")
h=0.4243375

Waiting time between eruptions

50 1.5

60

70

80

90

2.0

2.5

3.0

3.5

4.0

4.5

5.0

Duration of the eruption

Bandwidth Selection
R: ksmooth{stats}

set.seed(123);x<-runif(200,0,1);y<-sin(4*x)+rnorm(200,0,1/3); plot(x,y,col=gray(0.5)); lines(ksmooth(x,y,"normal",0.08),col="green"); lines(ksmooth(x,y,"normal",0.2),col="red"); lines(ksmooth(x,y,"normal",0.5),col="blue"); lines(seq(0,1,length=100),sin(4*seq(0,1,length=100)),lty=2);


2.0

bandwidth=0.08 bandwidth=0.2 bandwidth=0.5 true

1.0 0.0

0.5

0.0

0.5

1.0

1.5

0.2

0.4 x

0.6

0.8

1.0

Bandwidth Selection
R: h.select{sm} + ksmooth{stats}
set.seed(123);x<-runif(200,0,1);y<-sin(4*x)+rnorm(200,0,1/3); h1<-h.select(x,y,method="cv"); h2<-h.select(x,y,method="aicc"); plot(x,y,col=gray(0.5)); lines(ksmooth(x,y,"normal",h1),col="red"); lines(ksmooth(x,y,"normal",h2),col="green"); lines(seq(0,1,length=100),sin(4*seq(0,1,length=100)),lty=2);
2.0 CV bandwidth= 0.09 AIC bandwidth= 0.13 true

1.0 0.0

0.5

0.0

0.5

1.0

1.5

0.2

0.4 x

0.6

0.8

1.0

Bandwidth Selection
R: (h.select + sm.regression){sm}

set.seed(123);x<-runif(200,0,1);y<-sin(4*x)+rnorm(200,0,1/3); h1<-h.select(x,y,method="cv"); h2<-h.select(x,y,method="aicc"); sm.regression(x,y,h=h1,col="red"); sm.regression(x,y,h=h2,col="green",add=T); lines(seq(0,1,length=100),sin(4*seq(0,1,length=100)),lty=2);


2.0

1.0

0.5

0.0

0.5

1.0

1.5

CV bandwidth= 0.09 AIC bandwidth= 0.13 true

0.0

0.2

0.4 x

0.6

0.8

1.0

Вам также может понравиться