Вы находитесь на странице: 1из 19

WEEK 05

KERNEL MACHINES
Kernel Machines
2

 Discriminant-based: No need to estimate densities


first
 Define the discriminant in terms of support vectors
 The use of kernel functions, application-specific
measures of similarity
 No need to represent instances as vectors
 Convex optimization problems with a unique solution
Optimal Separating Hyperplane
3

  if  C1
X  x , r t where r  
t
t t t 1 x
  1 if x t
C2
find w and w 0 such that
w T xt  w 0  1 for r t  1
-1
w x  w 0  1 for r t  1
T t

which can be rewritten as


r t w T xt  w 0   1
(Cortes and Vapnik, 1995; Vapnik, 1995)
Margin
4

 Distance from the discriminant to the closest instances


on either side

 Distance of x to the hyperplane is

w T xt  w0
w
Margin
5

 We require r t w T xt  w 0 
  , t
w
 For a unique sol’n, fix ρ||w||=1, and to max margin

min w subject to r t w T xt  w 0   1, t


1 2

2
Margin
6

Circled are the support vectors

Margin lines

boundary
min w subject to r t w T xt  w 0   1, t
1 2

2
 
Lp  w   t r t w T xt  w 0   1
N
1 2

2 t 1

 w   t r t w T xt  w 0    t
N N
1 2

2 t 1 t 1

Lp N
 0  w   t r t xt
w t 1

Lp N
 0   t r t  0
w 0 t 1
Ld  w w   w T  t r t xt w 0  t r t   t
1 T
2 t t t

  w w    t
1 T
2 t

    r r x  x   t
1 t s t s t T s

2 t s t

subject to  t r t  0 and  t  0, t
t
Most αt are 0 and
only a small number have αt >0;
they are the support vectors

Support vector lies on the margin lines


Support Vectors

Circled are the support vectors

Margin lines

boundary
Soft Margin Hyperplane
 Not linearly separable

r t w T x t  w 0   1   t
 Soft error Two types of deviation
 Lie on the wrong side of hyperplane
 Lie in the margin

t
 t

 New primal is
1
2
2
 
Lp  w  C t  t  t  t r t w T x t  w 0   1   t t  t t
Soft Margin Hyperplane

t
 t

This guarantees the positivity of ξ


C is the penalty
factor
1
2
2
 
Lp  w  C t  t  t  t r t w T x t  w 0   1   t t  t t
Choosing C

 For large values of C


 chooses a smaller-margin

 all the training points classified correctly

(overfitting)

 A very small value of C (underfitting)


 creates a larger-margin separating hyperplane

 It misclassifies more points


There are four possible cases when classifying
r t w T x t  w 0   1   t

ξ=0

ξ=0
0<ξ<1
misclassificatio
ξ>1 n
#{ ξ>1 })
The nonseparable instances

 We store as support vectors


 They would have trouble correctly classifying if they were not in the
training set
 They would either be misclassified or classified correctly but not with
enough confidence.
Hinge Loss

 0 if y t r t  1
Lhinge (y , r )  
t t

1  y t t
r otherwise

(1-2
y)
Kernel Trick
 Preprocess input x by basis functions Linear model

z = φ(x) g(z)=wTz
g(x)=wT φ(x)
 The SVM solution
Linear model
w   t r t z t   t r t φxt 
t t

gx   w φx    r φx


T t t
 φx
t T

gx    t r t K xt , x 
t
Vectorial Kernels
Polynomials of degree q:

K x , x   x x  1
t T t q

K x , y   xT y  1
2

 x1y1  x 2 y 2  12
 1  2 x1y1  2 x 2 y 2  2 x1 x 2 y1y 2  x12 y12  x 22 y 22

 x   1, 2 x1 , 2 x 2 , 2 x1 x 2 , x , x 2
1 
2 T
2
Vectorial Kernels
 Radial-basis functions:

 xt  x 2

K xt , x   exp  
 2s 2 
 

Вам также может понравиться