Академический Документы
Профессиональный Документы
Культура Документы
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
Department of Computer Science, University College London, Gower Street, London WC1E 6BT, United Kingdom
Department of Computer Science and Technology, East China Normal University, 500 Dongchuan Road, Shanghai 200241, China
a r t i c l e i n f o
abstract
Article history:
Received 13 March 2011
Received in revised form
24 June 2011
Accepted 26 June 2011
Communicated by C. Fyfe
Available online 12 August 2011
Support vector machines (SVMs) are theoretically well-justied machine learning techniques, which
have also been successfully applied to many real-world domains. The use of optimization methodologies plays a central role in nding solutions of SVMs. This paper reviews representative and state-ofthe-art techniques for optimizing the training of SVMs, especially SVMs for classication. The objective
of this paper is to provide readers an overview of the basic elements and recent advances for training
SVMs and enable them to develop and implement new optimization strategies for SVM-related
research at their disposal.
& 2011 Elsevier B.V. All rights reserved.
Keywords:
Decision support systems
Duality
Optimization methodology
Pattern classication
Support vector machine (SVM)
1. Introduction
Support vector machines (SVMs) are state-of-the-art machine
learning techniques with their root in structural risk minimization [51,60]. Roughly speaking, by the theory of structural risk
minimization, a function from a function class tends to have a low
expectation of risk on all the data from an underlying distribution
if it has a low empirical risk on a certain data set that is sampled
from this distribution and simultaneously the function class has a
low complexity, for example, measured by the VC-dimension
[61]. The well-known large margin principle in SVMs [8] essentially restricts the complexity of the function class, making use of
the fact that for some function classes a larger margin corresponds to a lower fat-shattering dimension [51].
Since their invention [8], research on SVMs has exploded both
in theory and applications. Recent theory advances in characterizing their generalization errors include Rademacher complexity
theory [3,52] and PAC-Bayesian theory [33,39]. In practice, SVMs
have been successfully applied to many real-world domains [11].
Moreover, SVMs have been combined with some other research
directions, for example, multitask learning, active learning, multiview learning and semi-supervised learning, and brought forward
fruitful approaches [4,15,16,55,56].
0925-2312/$ - see front matter & 2011 Elsevier B.V. All rights reserved.
doi:10.1016/j.neucom.2011.06.026
3610
problem is
min f x
s:t: fi x r0, i 1, . . . ,n,
hj x 0, j 1, . . . ,e:
f x
fi x r 0,
s:t:
i 1, . . . ,n:
n
X
ai fi x, ai Z 0,
i1
x A Df
ak0 x A Df
x A Df
i1
ai fi x
e
X
bj hj x, ai Z 0, bj A R:
j1
10
f yx 1yy r yf x 1yf y:
11
12
n
X
Lx, a, b f x
ani fi xn 0, i 1, . . . ,n,
Theorem 5 (Necessity of KKT conditions). For a convex optimization problem dened by (1), if all the inequality constraints fis satisfy
one of the constraint qualications of Theorem4, then the KKT saddle
point conditions (5) given in Theorem1 are necessary for optimality.
A convex optimization problem is dened as the minimization
of a convex function over a convex set. It can be represented as
min f x
s:t: fi x r 0,
i 1, . . . ,n,
a>
j x bj ,
j 1, . . . ,e,
13
n
X
14
i1
Fw 12JwJ2
s:t:
yi w> xi b Z 1,
w,b
@ai Lx , a fi x r 0, i 1, . . . ,n
n
Saddle point at a ,
18
cw,b x signw> x b:
19
n
X
1
JwJ2
li yi w> xi b1,
2
i1
li Z 0,
20
@w Lwn ,bn , kn wn
lni yi xi 0,
21
i1
@b Lwn ,bn , kn
n
X
lni yi 0:
22
i1
n
X
lni yi xi :
23
i1
n
i 1, . . . ,n,
15
n
3611
16
17
n
n
X
X
1
JwJ2
li yi w> xi
li
2
i1
i1
n
X
li
n X
n
n X
n
X
1X
li lj yi yj x>
li lj yi yj x>
i xj
i xj
2i1j1
i1j1
li
n X
n
1X
l l y y x> x :
2i1j1 i j i j i j
i1
n
X
i1
24
max
k
s:t:
gk k> 112k> Dk
k> y 0,
kk0,
25
>
26
Therefore, for all support vectors the constraints are active with
equality. The bias bn can thus be calculated as
n
bn yi x>
i w
27
3612
37
i1
i1
29
s:t:
Fw, n
f : x A X /fx A F:
i 1, . . . ,n,
n
n
X
X
1
JwJ2 C
xi
li yi w> xi b1 xi
2
i1
i1
gi xi , li Z 0, gi Z0,
31
i1
@w Lwn ,bn , nn , kn , cn wn
lni yi xi 0,
n
X
lni yi 0,
33
i1
n
s:t:
34
where (32) and (33) are respectively identical to (21) and (22) for
the separable case.
The dual optimization problem is derived as
max
32
i1
@b Lwn ,bn , nn , kn , cn
39
30
where the scalar C controls the balance between the margin and
empirical loss. This problem is also a differentiable convex
problem with afne constraints. The constraint qualication is
satised by the rened Slaters condition.
The Lagrangian of problem (30) is
n
X
38
1
JwJ2 C
xi
2
i1
xi Z 0, i 1, . . . ,n,
Lw,b, n, k, c
n
X
yi w> xi b Z1xi ,
gk k> 112k> Dk
k> y 0,
kk0,
k$C1,
35
/fF , kx,S fF x,
40
42
where deg is the degree of the polynomial, and the Gaussian radial
basis function (RBF) kernel (Gaussian kernel for short)
!
JxzJ2
kx,z exp
:
43
2s2
Using the kernel trick, the optimization problem (35) for SVMs
becomes
gk k> 112k> Dk
max
k
purpose, expanding the equalities in (48) and (49), e.g., substituting k with k Dk, yields
DDk DsDz yDh Dk 1s zhy9r1 ,
k> y 0,
kk0,
k$C1,
s:t:
y> Dk y> k,
44
where the entries of D are Dij yi yj kxi ,xj . The solution for the
SVM classier is formulated as
!
n
X
n
yi li kxi ,x bn :
cn x sign
45
i1
Dk Dc C1kc 0,
1
1
g1
i 1, . . . ,n,
i si Dgi Dsi mgi si gi Dgi Dsi 9rkkt1i ,
1
1
l1
i 1, . . . ,n:
i zi Dli Dzi mli zi li Dli Dzi 9rkkt2i ,
For small and moderately sized problems (say with less than
5000 examples), interior point algorithms are reliable and accurate optimization methods to choose [49]. For large-scale
problems, methods that exploit the sparsity of the dual variables
must be adopted; if further limitation on the storage is required,
compact representation should be considered, for example, using
an approximation for the kernel matrix.
s:t:
>
k y 0,
k$C1,
kk0:
46
47
k> y 0 Primalfeasibility,
k$C1 Primalfeasibility,
kk0 Primalfeasibility,
s> kC1 0 ZeroKKT gap,
z> k 0 ZeroKKT gap,
sk0,
Dc Dk
51
we only need to further solve for Dk and Dh. Substituting (51) into
the rst two equalities of (50), we have
"
#
#
"
r1 rkkt1 rkkt2
Dk
D diagc1 s> k1 z> y
,
52
y> k
0 Dh
y>
where
50
As it is clear that
>
1 >
2k Dkk 1
3613
zk0:
48
53
>
k c C1,
s> c 0,
ck0:
49
The chunking algorithm [60] relies on the fact that the value of
the quadratic form in (46) is unchanged if we remove the rows
and columns of D that associate with zero entries of k. That is, we
need to only employ support vectors to represent the nal
classier. Thus, the large quadratic program problem can be
divided into a series of smaller subproblems.
Chunking starts with a subset (chunk) of training data and
solves a small quadratic program (QP) subproblem. Then it retains
the support vectors and replace the other data in the chunk with a
certain number of data violating KKT conditions, and solves the
second subproblem for a new classier. Each subproblem is
initialized with the solutions of the previous subproblem. At the
last step, all nonzero entries of k can be identied.
3614
s:t:
55
s:t:
f k 12k> Dkk> 1
k$C1,
kk0,
56
li yi lj yj rij ,
0 r li r C,
kk,1 kk ,
2
1 2
2li Dii 2li lj Dij lj Djj li lj
0 r lj r C,
54
kk,n 1 kk 1 ,
1 k
k
kk,i lk1 1 , . . . , lki1
, li , . . . , ln > , i 2, . . . ,n:
k,i
When we update k to k
subproblem [21] is solved:
min
f kk,i dei
s:t:
0 r li d r C,
k,i 1
57
58
>
59
s:t:
>
>
1 >
2ks Dss ks ka Das ks ks 1
>
k>
s ys ka ya ,
ks $C1,
ks k0,
60
max min
b
y>
61
3615
loss is formulated as
if yt 4 1 h,
if 91yt9 r h,
66
if yt o 1h,
where parameter h typically varies between 0.01 and 0.5 [14]. The
Newton step of variable update is
bbrstep H1 r,
67
68
to zero, we obtain
d H1 hrh:
4.5. Solving the primal with Newtons method
Although most literature on SVMs deals with the dual optimization problem, there is also some work concentrating on the
optimization of the primal because primal and dual are indeed
two sides of the same coin. For example, [27,38] studied the
primal optimization of linear SVMs, and [14,34] studied the
primal optimization of both linear and nonlinear SVMs. These
works mainly use Newton-type methods. Below we introduce the
optimization problem and the approach adopted in [14].
Consider a kernel function k and an associated reproducing
kernel Hilbert space H. The optimization problem is
min rJf J2H
f AH
n
X
Ryi ,f xi ,
62
i1
2rf
n
X
@R
i1
@f
yi ,f n xi kxi , 0,
63
n
X
bi kxi ,x
64
i1
69
It was shown that primal and dual optimizations are equivalent in terms of the solution and time complexity [14]. However,
as primal optimization is more focused on minimizing the desired
objective function, perhaps it is superior to the dual optimization
when people are interested in nding approximate solutions [14].
Here, we would like to point out an alternative solution
approach called the augmented Lagrangian method [5], which
has approximately the same convergence speed as the Newtons
method.
4.6. Stochastic subgradient with projection
The subgradient method is a simple iterative algorithm for
minimizing a convex objective function that can be non-differentiable. The method seems very much like the ordinary gradient
method applied to differentiable functions, but with several
subtle differences. For instance, the subgradient method is not
truly a descent method: the function value often increases.
Furthermore, the step lengths in the subgradient method are
xed in advance, instead of found by a line search as in the
gradient method [9].
A subgradient of function f x : Rd -R at x is any vector g that
satises
f y Z f x g> yx
70
71
>
n
X
Ryi ,Ki> b,
65
i1
72
3616
r
1 X
JwJ2
Rw,x,y:
2
k x,y A A
73
wt 1 wt Zt rt ,
74
P
where step length Zt 1=rt, and rt rwt 1=k x,y A A yx is a
t
subgradient of f w,At at wt . Last, wt 1 is updated to wt 1 by
2
projecting onto the set
p
fw : JwJ r 1= r g,
75
which is shown to contain the optimal solution of the SVM. The
algorithm nally outputs wT 1 .
If k1, the algorithm is a variant of the stochastic gradient
method; if k is equal to the total number of training examples, the
algorithm is the subgradient projection method. Beside simplicity,
the algorithm is guaranteed to obtain a solution of accuracy e
~
^ is dened
with O1=
e iterations, where an e-accurate solution w
^ rminw f w e [50].
as f w
min
w, n, r
s:t:
1
min JwJ2
Rw,xi ,yi ,
w 2
ni1
76
cx sign/w, fxSr:
78
81
82
w,b, n, n
s:t:
77
80
n
n
X
X
1
JwJ2 C
xi C
xni
2
i1
i1
79
xi , xi Z 0, i 1, . . . ,n:
83
84
Semidenite programming is a recent popular class of optimizations with wide engineering applications including kernel
learning for SVMs [32,35]. For solving semidenite programming,
there exist interior point algorithms with good theoretical properties and computational efciency [58].
A semidenite program (SDP) is a special convex optimization
problem with a linear objective function, and linear matrix
inequality and afne equality constraints.
Denition 10 (SDP). An SDP is an optimization problem of the
form
min
u
s:t:
c> u
F j u : F0j u1 F1j uq Fqj $0,
Au b,
j 1, . . . ,L,
85
>
m
X
mi Ki ,
i1
Kk0,
traceK r c
86
or with constraints
K
m
X
mi Ki ,
3617
7. Conclusion
We have presented a review of optimization techniques used
for training SVMs. The KKT optimality conditions are a milestone
in optimization theory. The optimization of many real problems
relies heavily on the application of the conditions. We have
shown how to instantiate the KKT conditions for SVMs. Along
with the introduction of the SVM algorithms, the characterization
of effective kernels has also been presented, which would be
helpful to understand the SVMs with nonlinear classiers.
For the optimization methodologies applied to SVMs, we have
reviewed interior point algorithms, chunking and SMO, coordinate
descent, active set methods, Newtons method for solving the primal,
stochastic subgradient with projection, and cutting plane algorithms.
The rst four methods focus on the optimization of dual problems,
while the last three directly deal with the optimization of the primal.
This reects current developments in SVM training.
Since their invention, SVMs have been extended to multiple
variants in order to solve different applications such as the briey
introduced density estimation and regression problems. For kernel
learning in SVM-type applications, we further introduced SDP and
DC programming, which can be very useful for optimizing problems
more complex than the traditional SVMs with xed kernels.
We believe the optimization techniques introduced in this
paper can be applied to other SVM-related research as well. Based
on these techniques, people can implement their own solvers for
different purposes, such as contributing to one future line of
research for training SVMs by developing fast and accurate
approximation methods for the challenging problem of largescale applications. Another direction can be investigating recent
promising optimization methods (e.g., [24] and its references) and
applying them to solving variants of SVMs for certain machine
learning applications.
i1
mi Z 0, i 1, . . . ,m,
Kk0,
traceK r c
Acknowledgements
87
3618
[48] B. Scholkopf,
J. Platt, J. Shawe-Taylor, A. Smola, R. Williamson, Estimating the
support of a high-dimensional distribution, Neural Computation 13 (2001)
14431471.
[49] B. Scholkopf,
A. Smola, Learning with Kernels, MIT Press, Cambridge, MA,
2002.
[50] S. Shalev-Shwartz, Y. Singer, N. Srebro, Pegasos: primal estimated subgradient solver for SVM, in: Proceedings of the 24th International Conference
on Machine Learning, 2007, pp. 807814.
[51] J. Shawe-Taylor, P. Bartlett, R. Williamson, M. Anthony, Structural risk
minimization over data-dependent hieraichies, IEEE Transactions on Information Theory 44 (1998) 19261940.
[52] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, Cambridge, UK, 2004.
[53] N. Shor, Minimization Methods for Non-Differentiable Functions, SpringerVerlag, Berlin, 1985.
[54] M. Slater, A note on Motzkins transposition theorem, Econometrica 19
(1951) 185186.
[55] S. Sun, D. Hardoon, Active learning with extremely sparse labeled examples,
Neurocomputing 73 (2010) 29802988.
[56] S. Sun, J. Shawe-Taylor, Sparse semi-supervised learning using conjugate
functions, Journal of Machine Learning Research 11 (2010) 24232455.
[57] C. Teo, Q. Le, A. Smola, S. Vishwanathan, A scalable modular convex solver for
regularized risk minimization, in: Proceedings of the 13th ACM Conference
on Knowledge Discovery and Data Mining, 2007, pp. 727736.
[58] L. Vandenberghe, S. Boyd, Semidenite programming, SIAM Review 38
(1996) 4995.
[59] R. Vanderbei, LOQO Users ManualVersion 3.10, Technical Report SOR-9708, Statistics and Operations Research, Princeton University, 1997.
[60] V. Vapnik, Estimation of Dependences Based on Empirical Data, SpringerVerlag, New York, 1982.
[61] V. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, New
York, 1995.
[62] L. Zanni, T. Serani, G. Zanghirati, Parallel software for training large sclae
support vector machines on multiprocessor systems, Journal of Machine
Learning Research 7 (2006) 14671492.
[63] T. Zhang, Solving large scale linear prediction problems using stochastic
gradient descent algorithms, in: Proceedings of the 21st International
Conference on Machine Learning, 2004, pp. 919926.