Академический Документы
Профессиональный Документы
Культура Документы
C19 Machine Learning Primal and dual forms Linear separability revisted Feature mapping Kernels for SVMs
Kernel trick requirements radial basis functions
Hilary 2013
A. Zisserman
SVM review
We have seen that for an SVM learning a linear classier f (x) = w>x + b is formulated as solving an optimization problem over w :
N X 2 min ||w|| + C max (0, 1 yif (xi)) w Rd i
This quadratic optimization problem is known as the primal problem. Instead, the SVM can be formulated to learn a linear classier f (x) =
N X i
iyi(xi>x) + b
by solving an optimization problem over i. This is know as the dual problem, and we will look at the advantages of this formulation.
w=
Proof: see example sheet .
j yj xj
j =1
j =1
N X
and for w in the cost function minw ||w||2 subject to yi w>xi + b 1, i ||w||2 = j yj xj
>
j yj x j > x + b =
X
k
j =1
N X
j yj xj >x + b
k yk xk
X
jk
j k yj yk (xj >xk )
X
jk
j =1
N X
j yj (xj >xi) + b 1, i
min ||w|| + C
N X i
X
i
Need to learn d parameters for primal, and N for dual If N << d then more ecient to solve for than w Dual form only involves (xj >xk ). We will return to why this is an advantage when we look at kernels.
iyi(xi>x) + b
At rst sight the dual form appears to have the disadvantage of a K-NN classier it requires the training data points xi. However, many of the is are zero. The ones that are non-zero dene the support vectors xi.
f (x) =
X
i
i yi (xi > x) + b
support vectors
C = 10
soft margin
min
||w||2 + C
N X i
<0
>0
R2
x1 x2
x2 1 x2 2 2x1x2
R2 R3
Z= 2x1x2
Y = x2 2
X = x2 1
Data is linearly separable in 3D This means that the problem can still be solved by a linear classifier
RD
f (x) = 0
: x (x)
Rd R D
Simply map x to (x) where data is separable Solve for w in high dimensional space RD If D >> d then there are many more parameters to learn for w. Can this be avoided?
i yi = 0
iyi k(xi, x) + b
X
i
1X j k yj yk k(xj , xk ) 2 jk
X
i
Special transformations
:
x1 x2
x2 1 x2 2 2x1x2
R2 R3
Kernel Trick
Classier can be learnt and applied without explicitly computing (x) All that is required is the kernel k(x, z) = (x>z)2 Complexity of learning depends on N (typically it is O(N 3)) not on D
Example kernels
Linear kernels k(x, x0) = x>x0
f (x ) =
N X i
i y i k (x i , x ) + b
support vector
+b
0.6
0.4
feature y
0.2
-0.2
-0.4
-0.6 -0.8
-0.6
-0.4
-0.2
0 0.2 feature x
0.4
0.6
0.8
= 1.0
f (x) = 0
C=
f (x) = 1
f (x) = 1
f (x) =
N X i
+b
= 1. 0
C = 100
= 1. 0
C = 10
f (x) =
N X i
+b
= 1.0
C=
f (x) =
N X i
+b
= 0.25
C=
= 0.1
C=
f (x) =
N X i
+b
data
pos. vec. neg. vec.
For N data points xi, the Gram matrix is a N x N matrix K with entries:
-2
Kij = k(xi, xj )
6 -6 -4 -2 0 2 4 6
10
20
30
k(h, h0) =
We will see other examples of kernels later in regression and unsupervised learning
Background reading
Bishop, chapters 6.2 and 7 Hastie et al, chapter 12 More on web page: http://www.robots.ox.ac.uk/~az/lectures/ml