Вы находитесь на странице: 1из 20

Kernel Methods

Lecture Notes for CMPUT 466/551



Nilanjan Ray
Kernel Methods: Key Points
Essentially a local regression (function estimation/fitting) technique

Only the observations (training set) close to the query point are
considered for regression computation

While regressing, an observation point gets a weight that decreases
as its distance from the query point increases

The resulting regression function is smooth

All these features of this regression are made possible by a function
called kernel

Requires very little training (i.e., not many parameters to compute
offline from the training set, not much offline computation needed)

This kind of regression is known as memory based technique as it
requires entire training set to be available while regressing
One-Dimensional Kernel
Smoothers
We have seen that k-nearest
neighbor directly estimates
Pr(Y|X=x)
k-nn assigns equal weight to
all points in neighborhood
The average curve is bumpy
and discontinuous
Rather than give equal weight,
assign weights that decrease
smoothly with distance from
the target points
( ) ( ) ( ) x N x y Ave x f
k i i
e = |

Nadaraya-Watson
Kernel-weighted Average
N-W kernel weighted average:

K

is a kernel function:


( )
( )
( )

=
=
=
N
i
i
N
i
i i
x x K
y x x K
x f
1
0
1
0
0
,
,

( )
( )
|
|
.
|

\
|
=
0
0
0
,
x h
x x
D x x K

( ) ( ) ( ) ( ) 0 and 0 , 1 , 0
such that K function smooth Any
2
> = = >
} } }
dx x K x dx x xK dx x K x K
where
Typically K is also symmetric about 0
Some Points About Kernels
h

(x
0
) is a width function also dependent on
For the N-W kernel average h

(x
0
) =
For k-nn average h

(x
0
) = |x
0
-x
[k]
|, where x
[k]
is
the k
th
closest x
i
to x
0

determines the width of local neighborhood
and degree of smoothness
also controls the tradeoff between bias and
variance
Larger makes lower variance but higher bias (Why?)
is computed from training data (how?)
Example Kernel functions
Epanechnikov quadratic kernel (used in N-W method)


tri-cube kernel


Gaussian kernel

( ) ( )
( )
{
; 1 t if 1
4
3
otherwise. 0
0
0
2
,
s
=
|
|
.
|

\
|
=
t
t D
x x
D x x K

( ) ( )
( )
{
; 1 t if 1
otherwise. 0
0
0
3
3
,
s
=
|
|
.
|

\
|
=
t
t D
x x
D x x K

( ) )
2
) (
exp(
2
1
,
2
2
0
0
t

x x
x x K

=
Compact vanishes beyond a finite range (such as Epanechnikov, tri-cube)
Everywhere differentiable (Gaussian, tri-cube)
Kernel
characteristics
Local Linear Regression
In kernel-weighted average method estimated function value
has a high bias at the boundary
This high bias is a result of the asymmetry at the boundary
The bias can also be present in the interior when the x values
in the training set are not equally spaced
Fitting straight lines rather than constants locally helps us to
remove bias (why?)
Locally Weighted Linear Regression
Least squares solution:







Note that the estimate is linear in y
i
The weights l
i
(x
i
) are sometimes referred to as
the equivalent kernel
( ) ( )
( ) ( ) ( ) | |
( ) ( ) ( ) ( ) ( ) ( ) ( )
( )
( ) ( )
( )
( ) ( )
i
i
T
N
i
i i
T T
T
N
i
i i i
x x
x x K ith x W N N
x b ith B N
,x x b
y x l
y x W B B x W B x b x x x x f
x x x y x x K
, element diagonal with matrix diagonal
row with matrix regression 2
1 : function valued - vector

, min
0 0
T
1
0
0
1
0 0 0 0 0 0
2
1
0 0 0
,
0 0

| o
| o
| o

=
=
= + =

=
Ex.
Bias Reduction In Local Linear
Regression
Local linear regression automatically modifies the kernel to
correct the bias exactly to the first order

( ) ( ) ( )
( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )
( ) ( ) ( ) ( )
( ) ( ) ( ) ( ) ( )
( ) ( ) ( ) 0 and 1 : since

1
0 0
1
0
1
0
2
0 0 0 0
1
0
2
0 0 0
1
0
2
0 0
1
0 0 0
1
0 0
1
0 0
= =
+ ' ' = =
+ ' ' + =
+ ' ' + ' + =
=

= =
=
=
= = =
=
N
i
i i
N
i
i
N
i
i i
N
i
i i
N
i
i i
N
i
i i
N
i
i
N
i
i i
x l x x x l
R x l x x x f x f x f E bias
R x l x x x f x f
R x l x x x f x l x x x f x l x f
x f x l x f E
Write a Taylor series expansion of f(x
i
)
Ex. 6.2 in [HTF]
Local Polynomial Regression
Why have a polynomial for the local fit? What would be
the rationale?







We will gain on bias; however we will pay the price in
terms of variance (why?)
( ) ( )
( ) ( ) ( )
( ) ( ) ( ) ( ) ( ) ( ) ( )
( )
( ) ( )
( )
( ) ( )
i
i
d T
N
i
i i
T T T
d
j
j
j
N
i
d
j
j
i j i i
d j x x
x x K ith x W N N
x b ith B d N
x ,x x b
y x l
y x W B B x W B x b x x x x f
x x x y x x K
j
, element diagonal with matrix diagonal
row with matrix regression 1
,..., 1 : function valued - vector

, min
0 0
T
1
0
0
1
0 0
1
0 0 0 0
2
1 1
0 0 0
,..., 1 , ,
0 0

| o
| o
| o

+
=
=
= + =
(

=
= =
=
Bias and Variance Tradeoff
As the degree of local polynomial regression increases, bias
decreases and variance increases
Local linear fits can help reduce bias significantly at the boundaries
at a modest cost in variance
Local quadratic fits tend to be most helpful in reducing bias due to
curvature in the interior of the domain
So, would it be helpful have a mixture or linear and quadratic local
fits?

Local Regression in Higher
Dimensions

We can extend 1D local regression to higher dimensions











Standardize each coordinates in the kernel, because Euclidean
(square) norm is affected by scaling
( ) ( )
( ) ( ) ( ) | |
( )
( ) ( ) ( ) ( ) ( ) ( ) ( )
( )
( )
( )
( ) ( )
i
i
p
p
T d
p
N
i
i i
T T T T
N
i
T
i i i
x x
x x K ith x W N N
x b ith B H N
x b H
p
y x l
y x W B B x W B x b x x b x f
x x
D x x K
x x b y x x K
, element diagonal with matrix diagonal
row with matrix regression
: function valued - vector 1
degree d with dimension


,
, min
0 0
T
1
1
1
0
0
1
0 0 0 0 0
0
0
2
1
0 0
,
0 0

| o
|

|

=
= =
|
|
.
|

\
|
=

+
+
=

Local Regression: Issues in Higher


Dimensions
The boundary poses even a greater problem in higher
dimensions
Many training points are required to reduce the bias; Sample
size should increase exponentially in p to match the same
performance.
Local regression becomes less useful when dimensions
go beyond 2 or 3
Its impossible to maintain localness (low bias) and
sizeable samples (low variance) in the same time
Combating Dimensions: Structured Kernels
In high dimensions, input variables (i.e., x variables)
could be very much correlated. This correlation could be
a key to reduce the dimensionality while performing
kernel regression.
Let A be a positive semidefinite matrix (what does that
mean?). Lets now consider a kernel that looks like:


If A=E
-1
, the inverse of the covariance matrix of the input
variables, then the correlation structure is captured
Further, one can take only a few principal components of
A to reduce the dimensionality
( )
( ) ( )
|
|
.
|

\
|

=

0 0
0 ,
,
x x A x x
D x x K
T
A
Combating Dimensions: Low Order
Additive Models
ANOVA (analysis of variance) decomposition:




One-dimensional local regression is all needed:




( ) ( )

=
+ =
p
j
j j p
x g x x x f
1
2 1
, , , o
( ) ( ) ( ) ... , , , ,
1
2 1
+ + + =

< = l k
l k kl
p
j
j j p
x x g x g x x x f o
Probability Density Function
Estimation
In many classification or regression problems we desperately
want to estimate probability densities recall the instances
So can we not estimate a probability density, directly given
some samples from it?
Local methods of Density Estimation:



This estimate is typically bumpy, non-smooth (why?)
i
x
0
0
# ( )
( )
i
x Nbhood x
f x
N
e
=
Smooth PDF Estimation using Kernels
Parzen method:

Gaussian kernel:



In p-dimensions

=
=
N
i
i
x x K
N
x f
1
0 0
) , (
1
) (

2
0
1
(|| ||/ )
2
0
1 2
2
1
( )
(2 )
i
N
x x
X p
i
f x e
N

t

=
=

)
2
) (
exp(
2
1
) , (
2
2
0
0
t

i
i
x x
x x K

=
Kernel density estimation
Using Kernel Density Estimates in
Classification
) | ( ) ( j G x p x f
j
= =

=
= = =
K
l
l l
j j
x f
x f
x X j G P
1
0
0
0
) (
) (
) | (
t
t
Posterior probability density:
In order to estimate this density, we can estimate the class conditional densities
using Parzen method
where is the j
th
class conditional density
Class conditional densities Ratio of posteriors
) (
) (
) | 1 (
) | 1 (
2 2
1 1
x f
x f
x X G P
x X G P
t
t
=
= =
= =
Naive Bayes Classifier
In Bayesian Classification we need to
estimate the class conditional densities:
What if the input space x is multi-
dimensional?
If we apply kernel density estimates, we
will run into the same problems that we
faced in high dimensions
To avoid these difficulties, assume that
the class conditional density factorizes:
In other words we are assuming here
that the features are independent
Nave Bayes model
Advantages:
Each class density for each feature can
be estimated (low variance)
If some of the features are continuous,
some are discrete this method can
seamlessly handle the situation
Nave Bayes classifier works surprisingly
well for many problems (why?)

) | ( ) ( j G x p x f
j
= =
[
=
= =
p
i
i p j
j G x p x x f
1
1
) | ( ) , , (
Discriminant function is now generalized linear additive
Key Points
Local assumption
Usually Bandwidth () selection is more important than
kernel function selection
Low bias, low variance usually not guaranteed in high
dimensions
Little training and high online computational complexity
Use sparingly: only when really required, like in the high-
confusion zone
Use when model may not be used again: No need for the
training phase