Академический Документы
Профессиональный Документы
Культура Документы
Abstra
t. The tutorial starts with an overview of the
on
epts of VC dimension and stru
tural risk
minimization. We then des
ribe linear Support Ve
tor Ma
hines (SVMs) for separable and non-separable
data, working through a non-trivial example in detail. We des
ribe a me
hani
al analogy, and dis
uss
when SVM solutions are unique and when they are global. We des
ribe how support ve
tor training
an
be pra
ti
ally implemented, and dis
uss in detail the kernel mapping te
hnique whi
h is used to
onstru
t
SVM solutions whi
h are nonlinear in the data. We show how Support Ve
tor ma
hines
an have very large
(even innite) VC dimension by
omputing the VC dimension for homogeneous polynomial and Gaussian
radial basis fun
tion kernels. While very high VC dimension would normally bode ill for generalization
performan
e, and while at present there exists no theory whi
h shows that good generalization performan
e
is guaranteed for SVMs, there are several arguments whi
h support the observed high a
ura
y of SVMs,
whi
h we review. Results of some experiments whi
h were inspired by these arguments are also presented.
We give numerous examples and proofs of most of the key theorems. There is new material, and I hope that
the reader will nd that even old material is
ast in a fresh light.
Keywords: Support Ve tor Ma hines, Statisti al Learning Theory, VC Dimension, Pattern Re ognition
Appeared in: Data Mining and Knowledge Dis overy 2, 121-167, 1998
1. Introdu
tion
The purpose of this paper is to provide an introdu
tory yet extensive tutorial on the basi
ideas behind Support Ve
tor Ma
hines (SVMs). The books (Vapnik, 1995; Vapnik, 1998)
ontain ex
ellent des
riptions of SVMs, but they leave room for an a
ount whose purpose
from the start is to tea
h. Although the subje
t
an be said to have started in the late
seventies (Vapnik, 1979), it is only now re
eiving in
reasing attention, and so the time
appears suitable for an introdu
tory review. The tutorial dwells entirely on the pattern
re
ognition problem. Many of the ideas there
arry dire
tly over to the
ases of regression
estimation and linear operator inversion, but spa
e
onstraints pre
luded the exploration of
these topi
s here.
The tutorial
ontains some new material. All of the proofs are my own versions, where
I have pla
ed a strong emphasis on their being both
lear and self-
ontained, to make the
material as a
essible as possible. This was done at the expense of some elegan
e and
generality: however generality is usually easily added on
e the basi
ideas are
lear. The
longer proofs are
olle
ted in the Appendix.
By way of motivation, and to alert the reader to some of the literature, we summarize
some re
ent appli
ations and extensions of support ve
tor ma
hines. For the pattern re
og-
nition
ase, SVMs have been used for isolated handwritten digit re
ognition (Cortes and
Vapnik, 1995; S
holkopf, Burges and Vapnik, 1995; S
holkopf, Burges and Vapnik, 1996;
Burges and S
holkopf, 1997), obje
t re
ognition (Blanz et al., 1996), speaker identi
ation
(S
hmidt, 1996),
harmed quark dete
tion1 , fa
e dete
tion in images (Osuna, Freund and
Girosi, 1997a), and text
ategorization (Joa
hims, 1997). For the regression estimation
ase, SVMs have been
ompared on ben
hmark time series predi
tion tests (Muller et al.,
1997; Mukherjee, Osuna and Girosi, 1997), the Boston housing problem (Dru
ker et al.,
2
1997), and (on arti
ial data) on the PET operator inversion problem (Vapnik, Golowi
h
and Smola, 1996). In most of these
ases, SVM generalization performan
e (i.e. error rates
on test sets) either mat
hes or is signi
antly better than that of
ompeting methods. The
use of SVMs for density estimation (Weston et al., 1997) and ANOVA de
omposition (Stit-
son et al., 1997) has also been studied. Regarding extensions, the basi
SVMs
ontain no
prior knowledge of the problem (for example, a large
lass of SVMs for the image re
ogni-
tion problem would give the same results if the pixels were rst permuted randomly (with
ea
h image suering the same permutation), an a
t of vandalism that would leave the best
performing neural networks severely handi
apped) and mu
h work has been done on in-
orporating prior knowledge into SVMs (S
holkopf, Burges and Vapnik, 1996; S
holkopf et
al., 1998a; Burges, 1998). Although SVMs have good generalization performan
e, they
an
be abysmally slow in test phase, a problem addressed in (Burges, 1996; Osuna and Girosi,
1998). Re
ent work has generalized the basi
ideas (Smola, S
holkopf and Muller, 1998a;
Smola and S
holkopf, 1998), shown
onne
tions to regularization theory (Smola, S
holkopf
and Muller, 1998b; Girosi, 1998; Wahba, 1998), and shown how SVM ideas
an be in
orpo-
rated in a wide range of other algorithms (S
holkopf, Smola and Muller, 1998b; S
holkopf
et al, 1998
). The reader may also nd the thesis of (S
holkopf, 1997) helpful.
The problem whi
h drove the initial development of SVMs o
urs in several guises - the
bias varian
e tradeo (Geman, Bienensto
k and Doursat, 1992),
apa
ity
ontrol (Guyon
et al., 1992), overtting (Montgomery and Pe
k, 1992) - but the basi
idea is the same.
Roughly speaking, for a given learning task, with a given nite amount of training data, the
best generalization performan
e will be a
hieved if the right balan
e is stru
k between the
a
ura
y attained on that parti
ular training set, and the \
apa
ity" of the ma
hine, that is,
the ability of the ma
hine to learn any training set without error. A ma
hine with too mu
h
apa
ity is like a botanist with a photographi
memory who, when presented with a new
tree,
on
ludes that it is not a tree be
ause it has a dierent number of leaves from anything
she has seen before; a ma
hine with too little
apa
ity is like the botanist's lazy brother,
who de
lares that if it's green, it's a tree. Neither
an generalize well. The exploration and
formalization of these
on
epts has resulted in one of the shining peaks of the theory of
statisti
al learning (Vapnik, 1979).
In the following, bold typefa
e will indi
ate ve
tor or matrix quantities; normal typefa
e
will be used for ve
tor and matrix
omponents and for s
alars. We will label
omponents
of ve
tors and matri
es with Greek indi
es, and label ve
tors and matri
es themselves with
Roman indi
es. Familiarity with the use of Lagrange multipliers to solve problems with
equality or inequality
onstraints is assumed2 .
formulae). Now it is assumed that there exists some unknown probability distribution
P (x; y) from whi
h these data are drawn, i.e., the data are assumed \iid" (independently
drawn and identi
ally distributed). (We will use P for
umulative probability distributions,
and p for their densities). Note that this assumption is more general than asso
iating a
xed y with every x: it allows there to be a distribution of y for a given x. In that
ase,
the trusted sour
e would assign labels yi a
ording to a xed distribution,
onditional on
xi . However, after this Se
tion, we will be assuming xed y for given x.
Now suppose we have a ma
hine whose task it is to learn the mapping xi 7! yi . The
ma
hine is a
tually dened by a set of possible mappings x 7! f (x; ), where the fun
tions
f (x; ) themselves are labeled by the adjustable parameters . The ma
hine is assumed to
be deterministi
: for a given input x, and
hoi
e of , it will always give the same output
f (x; ). A parti
ular
hoi
e of generates what we will
all a \trained ma
hine." Thus,
for example, a neural network with xed ar
hite
ture, with
orresponding to the weights
and biases, is a learning ma
hine in this sense.
The expe
tation of the test error for a trained ma
hine is therefore:
Z
R() =
1 jy f (x; )jdP (x; y) (1)
2
Note that, when a density p(x; y) exists, dP (x; y) may be written p(x; y)dxdy. This is a
ni
e way of writing the true mean error, but unless we have an estimate of what P (x; y) is,
it is not very useful.
The quantity R() is
alled the expe
ted risk, or just the risk. Here we will
all it the
a
tual risk, to emphasize that it is the quantity that we are ultimately interested in. The
\empiri
al risk" Remp () is dened to be just the measured mean error rate on the training
set (for a xed, nite number of observations)4:
l
1 X
Remp () =
2l i=1 jyi f (xi ; )j: (2)
Note that no probability distribution appears here. Remp () is a xed number for a
parti
ular
hoi
e of and for a parti
ular training set fxi ; yi g.
The quantity 21 jyi f (xi ; )j is
alled the loss. For the
ase des
ribed here, it
an only
take the values 0 and 1. Now
hoose some su
h that 0 1. Then for losses taking
these values, with probability 1 , the following bound holds (Vapnik, 1995):
s
h(log(2l=h) + 1) log(=4)
R() Remp () + (3)
l
where h is a non-negative integer
alled the Vapnik Chervonenkis (VC) dimension, and is
a measure of the notion of
apa
ity mentioned above. In the following we will
all the right
hand side of Eq. (3) the \risk bound." We depart here from some previous nomen
lature:
the authors of (Guyon et al., 1992)
all it the \guaranteed risk", but this is something of a
misnomer, sin
e it is really a bound on a risk, not a risk, and it holds only with a
ertain
probability, and so is not guaranteed. The se
ond term on the right hand side is
alled the
\VC
onden
e."
We note three key points about this bound. First, remarkably, it is independent of P (x; y).
It assumes only that both the training data and the test data are drawn independently
a
ording to some P (x; y). Se
ond, it is usually not possible to
ompute the left hand
4
side. Third, if we know h, we
an easily
ompute the right hand side. Thus given several
dierent learning ma
hines (re
all that \learning ma
hine" is just another name for a family
of fun
tions f (x; )), and
hoosing a xed, suÆ
iently small , by then taking that ma
hine
whi
h minimizes the right hand side, we are
hoosing that ma
hine whi
h gives the lowest
upper bound on the a
tual risk. This gives a prin
ipled method for
hoosing a learning
ma
hine for a given task, and is the essential idea of stru
tural risk minimization (see
Se
tion 2.6). Given a xed family of learning ma
hines to
hoose from, to the extent that
the bound is tight for at least one of the ma
hines, one will not be able to do better than
this. To the extent that the bound is not tight for any, the hope is that the right hand
side still gives useful information as to whi
h learning ma
hine minimizes the a
tual risk.
The bound not being tight for the whole
hosen family of learning ma
hines gives
riti
s a
justiable target at whi
h to re their
omplaints. At present, for this
ase, we must rely
on experiment to be the judge.
The VC dimension is a property of a set of fun
tions ff ()g (again, we use as a generi
set
of parameters: a
hoi
e of spe
ies a parti
ular fun
tion), and
an be dened for various
lasses of fun
tion f . Here we will only
onsider fun
tions that
orrespond to the two-
lass
pattern re
ognition
ase, so that f (x; ) 2 f 1; 1g 8x; . Now if a given set of l points
an
be labeled in all possible 2l ways, and for ea
h labeling, a member of the set ff ()g
an
be found whi
h
orre
tly assigns those labels, we say that that set of points is shattered by
that set of fun
tions. The VC dimension for the set of fun
tions ff ()g is dened as the
maximum number of training points that
an be shattered by ff ()g. Note that, if the VC
dimension is h, then there exists at least one set of h points that
an be shattered, but it in
general it will not be true that every set of h points
an be shattered.
Theorem 1 Consider some set of m points in Rn . Choose any one of the points as origin.
Then the m points
an be shattered by oriented hyperplanes5 if and only if the position
ve
tors of the remaining points are linearly independent6 .
The VC dimension thus gives
on
reteness to the notion of the
apa
ity of a given set
of fun
tions. Intuitively, one might be led to expe
t that learning ma
hines with many
parameters would have high VC dimension, while learning ma
hines with few parameters
would have low VC dimension. There is a striking
ounterexample to this, due to E. Levin
and J.S. Denker (Vapnik, 1995): A learning ma
hine with just one parameter, but with
innite VC dimension (a family of
lassiers is said to have innite VC dimension if it
an
shatter l points, no matter how large l). Dene the step fun
tion (x); x 2 R : f(x) =
1 8x > 0; (x) = 1 8x 0g. Consider the one-parameter family of fun
tions, dened by
f (x; ) (sin(x)); x; 2 R: (4)
You
hoose some number l, and present me with the task of nding l points that
an be
shattered. I
hoose them to be:
xi = 10 i ; i = 1; ; l: (5)
You spe
ify any labels you like:
y1 ; y2 ; ; yl ; yi 2 f 1; 1g: (6)
Then f () gives this labeling if I
hoose to be
l
= (1 +
X (1 yi )10i ): (7)
i=1 2
Thus the VC dimension of this ma
hine is innite.
Interestingly, even though we
an shatter an arbitrarily large number of points, we
an
also nd just four points that
annot be shattered. They simply have to be equally spa
ed,
and assigned labels as shown in Figure 2. This
an be seen as follows: Write the phase at
x1 as 1 = 2n + Æ. Then the
hoi
e of label y1 = 1 requires 0 < Æ < . The phase at x2 ,
mod 2, is 2Æ; then y2 = 1 ) 0 < Æ < =2. Similarly, point x3 for
es Æ > =3. Then at
x4 , =3 < Æ < =2 implies that f (x4 ; ) = 1,
ontrary to the assigned label. These four
points are the analogy, for the set of fun
tions in Eq. (4), of the set of three points lying
along a line, for oriented hyperplanes in Rn . Neither set
an be shattered by the
hosen
family of fun
tions.
6
x=0 1 2 3 4
Figure 2. Four points that annot be shattered by (sin(x)), despite innite VC dimension.
1.4
1.2
1
VC Confidence
0.8
0.6
0.4
0.2
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
h / l = VC Dimension / Sample Size
Figure 3 shows how the se
ond term on the right hand side of Eq. (3) varies with h, given
a
hoi
e of 95%
onden
e level ( = 0:05) and assuming a training sample of size 10,000.
The VC
onden
e is a monotoni
in
reasing fun
tion of h. This will be true for any value
of l.
Thus, given some sele
tion of learning ma
hines whose empiri
al risk is zero, one wants to
hoose that learning ma
hine whose asso
iated set of fun
tions has minimal VC dimension.
This will lead to a better upper bound on the a
tual error. In general, for non zero empiri
al
risk, one wants to
hoose that learning ma
hine whi
h minimizes the right hand side of Eq.
(3).
Note that in adopting this strategy, we are only using Eq. (3) as a guide. Eq. (3) gives
(with some
hosen probability) an upper bound on the a
tual risk. This does not prevent a
parti
ular ma
hine with the same value for empiri
al risk, and whose fun
tion set has higher
VC dimension, from having better performan
e. In fa
t an example of a system that gives
good performan
e despite having innite VC dimension is given in the next Se
tion. Note
also that the graph shows that for h=l > 0:37 (and for = 0:05 and l = 10; 000), the VC
onden
e ex
eeds unity, and so for higher values the bound is guaranteed not tight.
2.5. Two Examples
Consider the k'th nearest neighbour
lassier, with k = 1. This set of fun
tions has innite
VC dimension and zero empiri
al risk, sin
e any number of points, labeled arbitrarily, will
be su
essfully learned by the algorithm (provided no two points of opposite
lass lie right
on top of ea
h other). Thus the bound provides no information. In fa
t, for any
lassier
7
with innite VC dimension, the bound is not even valid7 . However, even though the bound
is not valid, nearest neighbour
lassiers
an still perform well. Thus this rst example is a
autionary tale: innite \
apa
ity" does not guarantee poor performan
e.
Let's follow the time honoured tradition of understanding things by trying to break them,
and see if we
an
ome up with a
lassier for whi
h the bound is supposed to hold, but
whi
h violates the bound. We want the left hand side of Eq. (3) to be as large as possible,
and the right hand side to be as small as possible. So we want a family of
lassiers whi
h
gives the worst possible a
tual risk of 0:5, zero empiri
al risk up to some number of training
observations, and whose VC dimension is easy to
ompute and is less than l (so that the
bound is non trivial). An example is the following, whi
h I
all the \notebook
lassier."
This
lassier
onsists of a notebook with enough room to write down the
lasses of m
training observations, where m l. For all subsequent patterns, the
lassier simply says
that all patterns have the same
lass. Suppose also that the data have as many positive
(y = +1) as negative (y = 1) examples, and that the samples are
hosen randomly. The
notebook
lassier will have zero empiri
al risk for up to m observations; 0:5 training error
for all subsequent observations; 0:5 a
tual error, and VC dimension h = m. Substituting
these values in Eq. (3), the bound be
omes:
m
4l ln(2l=m) + 1 (1=m) ln(=4) (8)
whi
h is
ertainly met for all if
z
f (z ) =
2 exp
(z=4 1)
1; z (m=l); 0 z 1 (9)
whi
h is true, sin
e f (z ) is monotoni
in
reasing, and f (z = 1) = 0:236.
2.6. Stru
tural Risk Minimization
We
an now summarize the prin
iple of stru
tural risk minimization (SRM) (Vapnik, 1979).
Note that the VC
onden
e term in Eq. (3) depends on the
hosen
lass of fun
tions,
whereas the empiri
al risk and a
tual risk depend on the one parti
ular fun
tion
hosen by
the training pro
edure. We would like to nd that subset of the
hosen set of fun
tions, su
h
that the risk bound for that subset is minimized. Clearly we
annot arrange things so that
the VC dimension h varies smoothly, sin
e it is an integer. Instead, introdu
e a \stru
ture"
by dividing the entire
lass of fun
tions into nested subsets (Figure 4). For ea
h subset,
we must be able either to
ompute h, or to get a bound on h itself. SRM then
onsists of
nding that subset of fun
tions whi
h minimizes the bound on the a
tual risk. This
an be
done by simply training a series of ma
hines, one for ea
h subset, where for a given subset
the goal of training is simply to minimize the empiri
al risk. One then takes that trained
ma
hine in the series whose sum of empiri
al risk and VC
onden
e is minimal.
We have now laid the groundwork ne
essary to begin our exploration of support ve
tor
ma
hines.
3. Linear Support Ve
tor Ma
hines
3.1. The Separable Case
We will start with the simplest
ase: linear ma
hines trained on separable data (as we shall
see, the analysis for the general
ase - nonlinear ma
hines trained on non-separable data -
8
results in a very similar quadrati
programming problem). Again label the training data
fxi ; yi g; i = 1; ; l; yi 2 f 1; 1g; xi 2 Rd . Suppose we have some hyperplane whi
h
separates the positive from the negative examples (a \separating hyperplane"). The points
x whi
h lie on the hyperplane satisfy w x + b = 0, where w is normal to the hyperplane,
jbj=kwk is the perpendi
ular distan
e from the hyperplane to the origin, and kwk is the
Eu
lidean norm of w. Let d+ (d ) be the shortest distan
e from the separating hyperplane
to the
losest positive (negative) example. Dene the \margin" of a separating hyperplane
to be d+ + d . For the linearly separable
ase, the support ve
tor algorithm simply looks for
the separating hyperplane with largest margin. This
an be formulated as follows: suppose
that all the training data satisfy the following
onstraints:
xi w + b +1 for yi = +1 (10)
xi w + b 1 for yi = 1 (11)
These
an be
ombined into one set of inequalities:
yi (xi w + b) 1 0 8i (12)
Now
onsider the points for whi
h the equality in Eq. (10) holds (requiring that there
exists su
h a point is equivalent to
hoosing a s
ale for w and b). These points lie on
the hyperplane H1 : xi w + b = 1 with normal w and perpendi
ular distan
e from the
origin j1 bj=kwk. Similarly, the points for whi
h the equality in Eq. (11) holds lie on the
hyperplane H2 : xi w + b = 1, with normal again w, and perpendi
ular distan
e from
the origin j 1 bj=kwk. Hen
e d+ = d = 1=kwk and the margin is simply 2=kwk. Note
that H1 and H2 are parallel (they have the same normal) and that no training points fall
between them. Thus we
an nd the pair of hyperplanes whi
h gives the maximum margin
by minimizing kwk2, subje
t to
onstraints (12).
Thus we expe
t the solution for a typi
al two dimensional
ase to have the form shown
in Figure 5. Those training points for whi
h the equality in Eq. (12) holds (i.e. those
whi
h wind up lying on one of the hyperplanes H1 , H2 ), and whose removal would
hange
the solution found, are
alled support ve
tors; they are indi
ated in Figure 5 by the extra
ir
les.
We will now swit
h to a Lagrangian formulation of the problem. There are two reasons
for doing this. The rst is that the
onstraints (12) will be repla
ed by
onstraints on the
Lagrange multipliers themselves, whi
h will be mu
h easier to handle. The se
ond is that in
this reformulation of the problem, the training data will only appear (in the a
tual training
and test algorithms) in the form of dot produ
ts between ve
tors. This is a
ru
ial property
whi
h will allow us to generalize the pro
edure to the nonlinear
ase (Se
tion 4).
9
w
H2
-b
|w| H1
Origin
Margin
Figure 5. Linear separating hyperplanes for the separable ase. The support ve tors are ir led.
Note that we have now given the Lagrangian dierent labels (P for primal, D for dual) to
emphasize that the two formulations are dierent: LP and LD arise from the same obje
tive
fun
tion but with dierent
onstraints; and the solution is found by minimizing LP or by
maximizing LD . Note also that if we formulate the problem with b = 0, whi
h amounts to
requiring that all hyperplanes
ontain the origin, the
onstraint (15) does not appear. This
is a mild restri
tion for high dimensional spa
es, sin
e it amounts to redu
ing the number
of degrees of freedom by one.
Support ve
tor training (for the separable, linear
ase) therefore amounts to maximizing
LD with respe
t to the i , subje
t to
onstraints (15) and positivity of the i , with solution
given by (14). Noti
e that there is a Lagrange multiplier i for every training point. In
the solution, those points for whi
h i > 0 are
alled \support ve
tors", and lie on one of
the hyperplanes H1 ; H2 . All other training points have i = 0 and lie either on H1 or
H2 (su
h that the equality in Eq. (12) holds), or on that side of H1 or H2 su
h that the
stri
t inequality in Eq. (12) holds. For these ma
hines, the support ve
tors are the
riti
al
elements of the training set. They lie
losest to the de
ision boundary; if all other training
points were removed (or moved around, but so as not to
ross H1 or H2 ), and training was
repeated, the same separating hyperplane would be found.
The Karush-Kuhn-Tu
ker (KKT)
onditions play a
entral role in both the theory and
pra
ti
e of
onstrained optimization. For the primal problem above, the KKT
onditions
may be stated (Flet
her, 1987):
X
w
LP = w i yi xi = 0 = 1; ; d (17)
i
X
LP = i yi = 0 (18)
b i
yi (xi w + b) 1 0 i = 1; ; l (19)
i 0 8i (20)
i (yi (w xi + b) 1) = 0 8i (21)
The KKT
onditions are satised at the solution of any
onstrained optimization problem
(
onvex or not), with any kind of
onstraints, provided that the interse
tion of the set
of feasible dire
tions with the set of des
ent dire
tions
oin
ides with the interse
tion of
the set of feasible dire
tions for linearized
onstraints with the set of des
ent dire
tions
(see Flet
her, 1987; M
Cormi
k, 1983)). This rather te
hni
al regularity assumption holds
for all support ve
tor ma
hines, sin
e the
onstraints are always linear. Furthermore, the
problem for SVMs is
onvex (a
onvex obje
tive fun
tion, with
onstraints whi
h give a
onvex feasible region), and for
onvex problems (if the regularity
ondition holds), the
KKT
onditions are ne
essary and suÆ
ient for w; b; to be a solution (Flet
her, 1987).
Thus solving the SVM problem is equivalent to nding a solution to the KKT
onditions.
This fa
t results in several approa
hes to nding the solution (for example, the primal-dual
path following method mentioned in Se
tion 5).
As an immediate appli
ation, note that, while w is expli
itly determined by the training
pro
edure, the threshold b is not, although it is impli
itly determined. However b is easily
found by using the KKT \
omplementarity"
ondition, Eq. (21), by
hoosing any i for
11
whi
h i 6= 0 and
omputing b (note that it is numeri
ally safer to take the mean value of
b resulting from all su
h equations).
Noti
e that all we've done so far is to
ast the problem into an optimization problem
where the
onstraints are rather more manageable than those in Eqs. (10), (11). Finding
the solution for real world problems will usually require numeri
al methods. We will have
more to say on this later. However, let's rst work out a rare
ase where the problem is
nontrivial (the number of dimensions is arbitrary, and the solution
ertainly not obvious),
but where the solution
an be found analyti
ally.
3.3. Optimal Hyperplanes: An Example
While the main aim of this Se
tion is to explore a non-trivial pattern re
ognition problem
where the support ve
tor solution
an be found analyti
ally, the results derived here will
also be useful in a later proof. For the problem
onsidered, every training point will turn
out to be a support ve
tor, whi
h is one reason we
an nd the solution analyti
ally.
Consider n + 1 symmetri
ally pla
ed points lying on a sphere Sn 1 of radius R: more
pre
isely, the points form the verti
es of an n-dimensional symmetri
simplex. It is
onve-
nient to embed the points in Rn+1 in su
h a way that they all lie in the hyperplane whi
h
passes through the origin and whi
h is perpendi
ular to the (n +1)-ve
tor (1; 1; :::; 1) (in this
formulation, the points lie on Sn 1 , they span Rn , and are embedded in Rn+1 ). Expli
itly,
re
alling that ve
tors themselves are labeled by Roman indi
es and their
oordinates by
Greek, the
oordinates are given by:
s r
R Rn
xi = (1 Æi; ) +Æ (22)
n(n + 1) i; n + 1
where the Krone
ker delta, Æi; , is dened by Æi; = 1 if = i, 0 otherwise. Thus, for
example, the ve
tors for three equidistant points on the unit
ir
le (see Figure 12) are:
r
x = ( 23 ; p16 ;
1
6
p1 )
r
x = ( p16 ; 23 ;
2 p1 )
6
r
x3 = ( p16 ; p16 ; 23 ) (23)
One
onsequen
e of the symmetry is that the angle between any pair of ve
tors is the
same (and is equal to ar
os( 1=n)):
kxi k2 = R2 (24)
xi xj = R2 =n (25)
or, more su
in
tly,
xi xj = Æi;j (1 Æi;j ) :
1 (26)
R 2 n
Assigning a
lass label C 2 f+1; 1g arbitrarily to ea
h point, we wish to nd that
hyperplane whi
h separates the two
lasses with widest margin. Thus we must maximize
12
LD in Eq. (16), subje
t to i 0 and also subje
t to the equality
onstraint, Eq. (15).
Our strategy is to simply solve the problem as though there were no inequality
onstraints.
If the resulting solution does in fa
t satisfy i 0 8i, then we will have found the general
solution, sin
e the a
tual maximum of LD will then lie in the feasible region, provided the
equality
onstraint, Eq. (15), is also met. In order to impose the equality
onstraint we
introdu
e an additional Lagrange multiplier . Thus we seek to maximize
nX
LD
+1
i
1 nX+1 nX+1
i Hij j i yi ; (27)
i=1 2 i;j=1 i=1
where we have introdu
ed the Hessian
Hij yi yj xi xj : (28)
Setting LD = 0 gives
i
(H)i + yi = 1 8i (29)
Now H has a very simple stru
ture: the o-diagonal elements are yi yj R2 =n, and the
diagonal elements are R2 . The fa
t that all the o-diagonal elements dier only by fa
tors
of yi suggests looking for a solution whi
h has the form:
i =
1 + yi a + 1 yi b (30)
2 2
where a and b are unknowns. Plugging this form in Eq. (29) gives:
n+1 a+b yi p a + b 1 yi
n 2 n 2 = R2
(31)
where p is dened by
nX
+1
p yi : (32)
i=1
Thus
a+b=
2n (33)
R (n + 1)
2
and substituting this into the equality
onstraint Eq. (15) to nd a, b gives
n p n
a= 2
R (n + 1)
1 n+1
; b= 2
R (n + 1)
1 + n +p 1 (34)
whi
h gives for the solution
n yi p
i = 1 (35)
R2 (n + 1) n+1
Also,
(H)i = 1 ny+i p1 : (36)
13
Hen e
nX
+1
kwk =2
i j yi yj xi xj = T H
i;j =1
nX nX n 2 !
+1
yi p +1
p
= i 1
n+1
= i = 2 1
R n+1
(37)
i=1 i=1
Note that this is one of those
ases where the Lagrange multiplier
an remain undeter-
mined (although determining it is trivial). We have now solved the problem, sin
e all the
i are
learly positive or zero (in fa
t the i will only be zero if all training points have
the same
lass). Note that kwk depends only on the number of positive (negative) polarity
points, and not on how the
lass labels are assigned to the points in Eq. (22). This is
learly
not true of w itself, whi
h is given by
+1
nX
w = R (nn+ 1)
2
yi
p
n+1
xi (38)
i=1
The margin, M = 2=kwk, is thus given by
M=p
2R : (39)
n (1 (p=(n + 1))2 )
Thus when the number of points n + 1 is even, the minimum margin o
urs when p = 0
(equal numbers of positive and negative examples), in whi
h
ase the margin is Mmin =
2R=pn. If n p+ 1 is odd, the minimum margin o
urs when p = 1, in whi
h
ase Mmin =
2R(n +1)=(n n + 2). In both
ases, the maximum margin is given by Mmax = R(n +1)=n.
Thus, for example, for the two dimensional simplex
onsisting of three points lying on S1
(and spanning R2 ), and with labeling su
h that not all three points have the same polarity,
the maximum and minimum margin are both 3R=2 (see Figure (12)).
Note that the results of this Se
tion amount to an alternative,
onstru
tive proof that the
VC dimension of oriented separating hyperplanes in Rn is at least n + 1.
3.4. Test Phase
On
e we have trained a Support Ve
tor Ma
hine, how
an we use it? We simply determine
on whi
h side of the de
ision boundary (that hyperplane lying half way between H1 and H2
and parallel to them) a given test pattern x lies and assign the
orresponding
lass label,
i.e. we take the
lass of x to be sgn(w x + b).
3.5. The Non-Separable Case
The above algorithm for separable data, when applied to non-separable data, will nd no
feasible solution: this will be eviden
ed by the obje
tive fun
tion (i.e. the dual Lagrangian)
growing arbitrarily large. So how
an we extend these ideas to handle non-separable data?
We would like to relax the
onstraints (10) and (11), but only when ne
essary, that is, we
would like to introdu
e a further
ost (i.e. an in
rease in the primal obje
tive fun
tion) for
doing so. This
an be done by introdu
ing positive sla
k variables i ; i = 1; ; l in the
onstraints (Cortes and Vapnik, 1995), whi
h then be
ome:
14
xi w + b +1 i for yi = +1 (40)
xi w + b 1 + i for yi = 1 (41)
i 0 8i: (42)
P
Thus, for an error to o
ur, the
orresponding i must ex
eed unity, so i i is an upper
bound on the number of training errors. Hen
e a natural way to assign an extra
ost
errors is to
hange the obje
tive fun
tion to be minimized from kwk2 =2 to kwk2 =2 +
for P
C ( i i )k , where C is a parameter to be
hosen by the user, a larger C
orresponding to
assigning a higher penalty to errors. As it stands, this is a
onvex programming problem
for any positive integer k; for k = 2 and k = 1 it is also a quadrati
programming problem,
and the
hoi
e k = 1 has the further advantage that neither the i , nor their Lagrange
multipliers, appear in the Wolfe dual problem, whi
h be
omes:
Maximize:
LD
X
i
1 X y y x x (43)
i 2 i;j i j i j i j
subje
t to:
0 i C; (44)
X
i yi = 0: (45)
i
The solution is again given by
NS
X
w= i yi xi : (46)
i=1
where NS is the number of support ve
tors. Thus the only dieren
e from the optimal
hyperplane
ase is that the i now have an upper bound of C . The situation is summarized
s
hemati
ally in Figure 6.
We will need the Karush-Kuhn-Tu
ker
onditions for the primal problem. The primal
Lagrangian is
LP =
1 kwk2 + C X X
i fyi (xi w + b) 1 + i g
X
i i (47)
i
2 i i i
where the i are the Lagrange multipliers introdu
ed to enfor
e positivity of the i . The
KKT
onditions for the primal problem are therefore (note i runs from 1 to the number of
training points, and from 1 to the dimension of the data)
LP X
w
= w i yi xi = 0 (48)
i
LP X
b
= i yi = 0 (49)
i
15
LP
i
= C i i = 0 (50)
yi (xi w + b) 1 + i 0 (51)
i 0 (52)
i 0 (53)
i 0 (54)
i fyi (xi w + b) 1 + i g = 0 (55)
i i = 0 (56)
As before, we
an use the KKT
omplementarity
onditions, Eqs. (55) and (56), to
determine the threshold b. Note that Eq. (50)
ombined with Eq. (56) shows that i = 0 if
i < C . Thus we
an simply take any training point for whi
h 0 < i < C to use Eq. (55)
(with i = 0) to
ompute b. (As before, it is numeri
ally wiser to take the average over all
su
h training points.)
-b
|w|
−ξ
|w|
Consider the
ase in whi
h the data are in R2 . Suppose that the i'th support ve
tor exerts
a for
e Fi = i yi w^ on a sti sheet lying along the de
ision surfa
e (the \de
ision sheet")
(here w^ denotes the unit ve
tor in the dire
tion w). Then the solution (46) satises the
onditions of me
hani
al equilibrium:
X X
For
es = i yi w
^ =0 (57)
i
X X
Torques = si ^ (i yi w^ ) = w^ ^ w = 0: (58)
i
(Here the si are the support ve
tors, and ^ denotes the ve
tor produ
t.) For data in Rn ,
learly the
ondition that the sum of for
es vanish is still met. One
an easily show that
the torque also vanishes.9
This me
hani
al analogy depends only on the form of the solution (46), and therefore holds
for both the separable and the non-separable
ases. In fa
t this analogy holds in general
16
(i.e., also for the nonlinear
ase des
ribed below). The analogy emphasizes the interesting
point that the \most important" data points are the support ve
tors with highest values of
, sin
e they exert the highest for
es on the de
ision sheet. For the non-separable
ase, the
upper bound i C
orresponds to an upper bound on the for
e any given point is allowed
to exert on the sheet. This analogy also provides a reason (as good as any other) to
all
these parti
ular ve
tors \support ve
tors"10.
Figure 7 shows two examples of a two-
lass pattern re
ognition problem, one separable and
one not. The two
lasses are denoted by
ir
les and disks respe
tively. Support ve
tors are
identied with an extra
ir
le. The error in the non-separable
ase is identied with a
ross.
The reader is invited to use Lu
ent's SVM Applet (Burges, Knirs
h and Harats
h, 1996) to
experiment and
reate pi
tures like these (if possible, try using 16 or 24 bit
olor).
Figure 7. The linear
ase, separable (left) and not (right). The ba
kground
olour shows the shape of the
de
ision surfa
e.
1
0.8
0.6
0.4
0.2
0
1
0.5
0.2 0
0.4 -0.5
0.6 -1
0.8
1
or H to be R4 and
0 1
x21
B x1 x2 C
(x) = B
x1 x2 A :
C (64)
x22
The literature on SVMs usually refers to the spa
e H as a Hilbert spa
e, so let's end this
Se
tion with a few notes on this point. You
an think of a Hilbert spa
e as a generalization
of Eu
lidean spa
e that behaves in a gentlemanly fashion. Spe
i
ally, it is any linear spa
e,
with an inner produ
t dened, whi
h is also
omplete with respe
t to the
orresponding
norm (that is, any Cau
hy sequen
e of points
onverges to a point in the spa
e). Some
authors (e.g. (Kolmogorov, 1970)) also require that it be separable (that is, it must have a
ountable subset whose
losure is the spa
e itself), and some (e.g. Halmos, 1967) don't. It's a
generalization mainly be
ause its inner produ
t
an be any inner produ
t, not just the s
alar
(\dot") produ
t used here (and in Eu
lidean spa
es in general). It's interesting that the
older mathemati
al literature (e.g. Kolmogorov, 1970) also required that Hilbert spa
es be
innite dimensional, and that mathemati
ians are quite happy dening innite dimensional
Eu
lidean spa
es. Resear
h on Hilbert spa
es
enters on operators in those spa
es, sin
e
the basi
properties have long sin
e been worked out. Sin
e some people understandably
blan
h at the mention of Hilbert spa
es, I de
ided to use the term Eu
lidean throughout
this tutorial.
4.1. Mer
er's Condition
For whi
h kernels does there exist a pair fH; g, with the properties des
ribed above, and
for whi
h does there not? The answer is given by Mer
er's
ondition (Vapnik, 1995; Courant
and Hilbert, 1953): There exists a mapping and an expansion
X
K (x; y) = (x)i (y)i (65)
i
19
set of all hyperplanes fw; bg are parameterized by dim(H) +1 numbers. Most pattern
re
ognition systems with billions, or even an innite, number of parameters would not make
it past the start gate. How
ome SVMs do so well? One might argue that, given the form
of solution, there are at most l + 1 adjustable parameters (where l is the number of training
samples), but this seems to be begging the question13 . It must be something to do with our
requirement of maximum margin hyperplanes that is saving the day. As we shall see below,
a strong
ase
an be made for this
laim.
Sin
e the mapped surfa
e is of intrinsi
dimension dim(L), unless dim(L) = dim(H), it
is obvious that the mapping
annot be onto (surje
tive). It also need not be one to one
(bije
tive):
onsider x1 ! x1 ; x2 ! x2 in Eq. (62). The image of need not itself be
a ve
tor spa
e: again,
onsidering the above simple quadrati
example, the ve
tor (x) is
not in the image of unless x = 0. Further, a little playing with the inhomogeneous kernel
K (xi ; xj ) = (xi xj + 1)2 (71)
will
onvin
e you that the
orresponding
an map two ve
tors that are linearly dependent
in L onto two ve
tors that are linearly independent in H.
So far we have
onsidered
ases where is done impli
itly. One
an equally well turn
things around and start with , and then
onstru
t the
orresponding kernel. For example
(Vapnik, 1996), if L = R1 , then a Fourier expansion in the data x,
ut o after N terms,
has the form
N
a0 X
f (x) =
2 + r=1(a1r
os(rx) + a2r sin(rx)) (72)
and this
an be viewed as a dot produ
t between two ve
tors in R2N +1 : a = ( pa02 ; a11 ; : : : ; a21 ; : : :),
and the mapped (x) = ( p12 ;
os(x);
os(2x); : : : ; sin(x); sin(2x); : : :). Then the
orrespond-
ing (Diri
hlet) kernel
an be
omputed in
losed form:
N
(xi ) (xj ) = 21 +
os(rxi )
os(rxj ) + sin(rxi ) sin(rxj )
X
r=1
N N
= 12 +
os(rÆ) = 12 + Ref e(irÆ) g
X X
r=0 r=0
1
= 2 + Ref(1 e i (N +1)Æ
)=(1 eiÆ )g
= (sin((N + 1=2)Æ))=2 sin(Æ=2):
Finally, it is
lear that the above impli
it mapping tri
k will work for any algorithm in
whi
h the data only appear as dot produ
ts (for example, the nearest neighbor algorithm).
This fa
t has been used to derive a nonlinear version of prin
ipal
omponent analysis by
(S
holkopf, Smola and Muller, 1998b); it seems likely that this tri
k will
ontinue to nd
uses elsewhere.
21
The rst kernels investigated for the pattern re
ognition problem were the following:
K (x; y) = (x y + 1)p (74)
Figure 9. Degree 3 polynomial kernel. The ba kground olour shows the shape of the de ision surfa e.
Finally, note that although the SVM
lassiers des
ribed above are binary
lassiers, they
are easily
ombined to handle the multi
lass
ase. A simple, ee
tive
ombination trains
22
N one-versus-rest
lassiers (say, \one" positive, \rest" negative) for the N -
lass
ase and
takes the
lass for a test point to be that
orresponding to the largest positive distan
e
(Boser, Guyon and Vapnik, 1992).
4.4. Global Solutions and Uniqueness
When is the solution to the support ve
tor training problem global, and when is it unique?
By \global", we mean that there exists no other point in the feasible region at whi
h the
obje
tive fun
tion takes a lower value. We will address two kinds of ways in whi
h uniqueness
may not hold: solutions for whi
h fw; bg are themselves unique, but for whi
h the expansion
of w in Eq. (46) is not; and solutions whose fw; bg dier. Both are of interest: even if the
pair fw; bg is unique, if the i are not, there may be equivalent expansions of w whi
h
require fewer support ve
tors (a trivial example of this is given below), and whi
h therefore
require fewer instru
tions during test phase.
It turns out that every lo
al solution is also global. This is a property of any
onvex
programming problem (Flet
her, 1987). Furthermore, the solution is guaranteed to be
unique if the obje
tive fun
tion (Eq. (43)) is stri
tly
onvex, whi
h in our
ase means
that the Hessian must be positive denite (note that for quadrati
obje
tive fun
tions F ,
the Hessian is positive denite if and only if F is stri
tly
onvex; this is not true for non-
quadrati
F : there, a positive denite Hessian implies a stri
tly
onvex obje
tive fun
tion,
but not vi
e versa (
onsider F = x4 ) (Flet
her, 1987)). However, even if the Hessian is
positive semidenite, the solution
an still be unique:
onsider two points along the real
line with
oordinates x1 = 1 and x2 = 2, and with polarities + and . Here the Hessian is
positive semidenite, but the solution (w = 2; b = 3; i = 0 in Eqs. (40), (41), (42)) is
unique. It is also easy to nd solutions whi
h are not unique in the sense that the i in the
expansion of w are not unique:: for example,
onsider the problem of four separable points
on a square in R2 : x1 = [1; 1℄, x2 = [ 1; 1℄, x3 = [ 1; 1℄ and x4 = [1; 1℄, with polarities
[+; ; ; +℄ respe
tively. One solution is w = [1; 0℄, b = 0, = [0:25; 0:25; 0:25; 0:25℄;
another has the same wPand b, but = [0:5; 0:5; 0; 0℄ (note that both solutions satisfy the
onstraints i > 0 and i i yi = 0). When
an this o
ur in general? Given some solution
,
hoose an 0 whi
h is in the null spa
e of the Hessian Hij = yi yj xi xj , and require that
0 be orthogonal to the ve
tor all of whose
omponents are 1. Then adding 0 to in Eq.
(43) will leave LD un
hanged. If 0 i + 0i C and 0 satises Eq. (45), then + 0 is
also a solution15.
How about solutions where the fw; bg are themselves not unique? (We emphasize that
this
an only happen in prin
iple if the Hessian is not positive denite, and even then,
the solutions are ne
essarily global). The following very simple theorem shows that if non-
unique solutions o
ur, then the solution at one optimal point is
ontinuously deformable
into the solution at the other optimal point, in su
h a way that all intermediate points are
also solutions.
Theorem 2 Let the variable X stand for the pair of variables fw; bg. Let the Hessian for
the problem be positive semidenite, so that the obje
tive fun
tion is
onvex. Let X0 and X1
be two points at whi
h the obje
tive fun
tion attains its minimal value. Then there exists a
path X = X( ) = (1 )X0 + X1 ; 2 [0; 1℄, su
h that X( ) is a solution for all .
Proof: Let the minimum value of the obje
tive fun
tion be Fmin . Then by assumption,
F (X ) = F (X ) = Fmin . By
onvexity of F , F (X( )) (1 )F (X ) + F (X ) = Fmin .
0 1 0 1
Furthermore, by linearity, the X( ) satisfy the
onstraints Eq. (40), (41): expli
itly (again
ombining both
onstraints into one):
23
Although simple, this theorem is quite instru
tive. For example, one might think that the
problems depi
ted in Figure 10 have several dierent optimal solutions (for the
ase of linear
support ve
tor ma
hines). However, sin
e one
annot smoothly move the hyperplane from
one proposed solution to another without generating hyperplanes whi
h are not solutions,
we know that these proposed solutions are in fa
t not solutions at all. In fa
t, for ea
h
of these
ases, the optimal unique solution is at w = 0, with a suitable
hoi
e of b (whi
h
has the ee
t of assigning the same label to all the points). Note that this is a perfe
tly
a
eptable solution to the
lassi
ation problem: any proposed hyperplane (with w 6= 0)
will
ause the primal obje
tive fun
tion to take a higher value.
Figure 10. Two problems, with proposed (in orre t) non-unique solutions.
Finally, note that the fa
t that SVM training always nds a global solution is in
ontrast
to the
ase of neural networks, where many lo
al minima usually exist.
5. Methods of Solution
The support ve
tor optimization problem
an be solved analyti
ally only when the number
of training data is very small, or for the separable
ase when it is known beforehand whi
h
of the training data be
ome support ve
tors (as in Se
tions 3.3 and 6.2). Note that this
an
happen when the problem has some symmetry (Se
tion 3.3), but that it
an also happen
when it does not (Se
tion 6.2). For the general analyti
ase, the worst
ase
omputational
omplexity is of order NS3 (inversion of the Hessian), where NS is the number of support
ve
tors, although the two examples given both have
omplexity of O(1).
However, in most real world
ases, Equations (43) (with dot produ
ts repla
ed by ker-
nels), (44), and (45) must be solved numeri
ally. For small problems, any general purpose
optimization pa
kage that solves linearly
onstrained
onvex quadrati
programs will do. A
good survey of the available solvers, and where to get them,
an be found16 in (More and
Wright, 1993).
For larger problems, a range of existing te
hniques
an be brought to bear. A full ex-
ploration of the relative merits of these methods would ll another tutorial. Here we just
des
ribe the general issues, and for
on
reteness, give a brief explanation of the te
hnique
we
urrently use. Below, a \fa
e" means a set of points lying on the boundary of the feasible
region, and an \a
tive
onstraint" is a
onstraint for whi
h the equality holds. For more
24
on nonlinear programming te
hniques see (Flet
her, 1987; Mangasarian, 1969; M
Cormi
k,
1983).
The basi
re
ipe is to (1) note the optimality (KKT)
onditions whi
h the solution must
satisfy, (2) dene a strategy for approa
hing optimality by uniformly in
reasing the dual
obje
tive fun
tion subje
t to the
onstraints, and (3) de
ide on a de
omposition algorithm
so that only portions of the training data need be handled at a given time (Boser, Guyon
and Vapnik, 1992; Osuna, Freund and Girosi, 1997a). We give a brief des
ription of some
of the issues involved. One
an view the problem as requiring the solution of a sequen
e
of equality
onstrained problems. A given equality
onstrained problem
an be solved in
one step by using the Newton method (although this requires storage for a fa
torization of
the proje
ted Hessian), or in at most l steps using
onjugate gradient as
ent (Press et al.,
1992) (where l is the number of data points for the problem
urrently being solved: no extra
storage is required). Some algorithms move within a given fa
e until a new
onstraint is
en
ountered, in whi
h
ase the algorithm is restarted with the new
onstraint added to the
list of equality
onstraints. This method has the disadvantage that only one new
onstraint
is made a
tive at a time. \Proje
tion methods" have also been
onsidered (More, 1991),
where a point outside the feasible region is
omputed, and then line sear
hes and proje
tions
are done so that the a
tual move remains inside the feasible region. This approa
h
an add
several new
onstraints at on
e. Note that in both approa
hes, several a
tive
onstraints
an be
ome ina
tive in one step. In all algorithms, only the essential part of the Hessian
(the
olumns
orresponding to i 6= 0) need be
omputed (although some algorithms do
ompute the whole Hessian). For the Newton approa
h, one
an also take advantage of the
fa
t that the Hessian is positive semidenite by diagonalizing it with the Bun
h-Kaufman
algorithm (Bun
h and Kaufman, 1977; Bun
h and Kaufman, 1980) (if the Hessian were
indenite, it
ould still be easily redu
ed to 2x2 blo
k diagonal form with this algorithm).
In this algorithm, when a new
onstraint is made a
tive or ina
tive, the fa
torization of
the proje
ted Hessian is easily updated (as opposed to re
omputing the fa
torization from
s
rat
h). Finally, in interior point methods, the variables are essentially res
aled so as to
always remain inside the feasible region. An example is the \LOQO" algorithm of (Vander-
bei, 1994a; Vanderbei, 1994b), whi
h is a primal-dual path following algorithm. This last
method is likely to be useful for problems where the number of support ve
tors as a fra
tion
of training sample size is expe
ted to be large.
We brie
y des
ribe the
ore optimization method we
urrently use17 . It is an a
tive set
method
ombining gradient and
onjugate gradient as
ent. Whenever the obje
tive fun
tion
is
omputed, so is the gradient, at very little extra
ost. In phase 1, the sear
h dire
tions
s are along the gradient. The nearest fa
e along the sear
h dire
tion is found. If the dot
produ
t of the gradient there with s indi
ates that the maximum along s lies between the
urrent point and the nearest fa
e, the optimal point along the sear
h dire
tion is
omputed
analyti
ally (note that this does not require a line sear
h), and phase 2 is entered. Otherwise,
we jump to the new fa
e and repeat phase 1. In phase 2, Polak-Ribiere
onjugate gradient
as
ent (Press et al., 1992) is done, until a new fa
e is en
ountered (in whi
h
ase phase 1 is
re-entered) or the stopping
riterion is met. Note the following:
Sear
h dire
tions are always proje
ted so that the i
ontinue to satisfy the equality
onstraint Eq. (45). Note that the
onjugate gradient algorithm will still work; we
are simply sear
hing in a subspa
e. However, it is important that this proje
tion is
implemented in su
h a way that not only is Eq. (45) met (easy), but also so that the
angle between the resulting sear
h dire
tion, and the sear
h dire
tion prior to proje
tion,
is minimized (not quite so easy).
25
We also use a \sti
ky fa
es" algorithm: whenever a given fa
e is hit more than on
e,
the sear
h dire
tions are adjusted so that all subsequent sear
hes are done within that
fa
e. All \sti
ky fa
es" are reset (made \non-sti
ky") when the rate of in
rease of the
obje
tive fun
tion falls below a threshold.
The algorithm stops when the fra
tional rate of in
rease of the obje
tive fun
tion F falls
below a toleran
e (typi
ally 1e-10, for double pre
ision). Note that one
an also use
as stopping
riterion the
ondition that the size of the proje
ted sear
h dire
tion falls
below a threshold. However, this
riterion does not handle s
aling well.
In my opinion the hardest thing to get right is handling pre
ision problems
orre
tly
everywhere. If this is not done, the algorithm may not
onverge, or may be mu
h slower
than it needs to be.
A good way to
he
k that your algorithm is working is to
he
k that the solution satises
all the Karush-Kuhn-Tu
ker
onditions for the primal problem, sin
e these are ne
essary
and suÆ
ient
onditions that the solution be optimal. The KKT
onditions are Eqs. (48)
through (56), with dot produ
ts between data ve
tors repla
ed by kernels wherever they
appear (note w must be expanded as in Eq. (48) rst, sin
e w is not in general the mapping
of a point in L). Thus to
he
k the KKT
onditions, it is suÆ
ient to
he
k that the
i satisfy 0 i C , that the equality
onstraint (49) holds, that all points for whi
h
0 i < C satisfy Eq. (51) with i = 0, and that all points with i = C satisfy Eq. (51)
for some i 0. These are suÆ
ient
onditions for all the KKT
onditions to hold: note
that by doing this we never have to expli
itly
ompute the i or i , although doing so is
trivial.
5.1. Complexity, S
alability, and Parallelizability
Support ve
tor ma
hines have the following very striking property. Both training and test
fun
tions depend on the data only through the kernel fun
tions K (xi ; xj ). Even though it
orresponds to a dot produ
t in a spa
e of dimension dH , where dH
an be very large or
innite, the
omplexity of
omputing K
an be far smaller. For example, for kernels of the
form K = (xi xj )p , a dot produ
t in H would require of order dL+pp 1 operations, whereas
the
omputation of K (xi ; xj ) requires only O(dL ) operations (re
all dL is the dimension of
the data). It is this fa
t that allows us to
onstru
t hyperplanes in these very high dimen-
sional spa
es yet still be left with a tra
table
omputation. Thus SVMs
ir
umvent both
forms of the \
urse of dimensionality": the proliferation of parameters
ausing intra
table
omplexity, and the proliferation of parameters
ausing overtting.
5.1.1. Training For
on
reteness, we will give results for the
omputational
omplexity of
one the the above training algorithms (Bun
h-Kaufman)18 (Kaufman, 1998). These results
assume that dierent strategies are used in dierent situations. We
onsider the problem of
training on just one \
hunk" (see below). Again let l be the number of training points, NS
the number of support ve
tors (SVs), and dL the dimension of the input data. In the
ase
where most SVs are not at the upper bound, and NS =l << 1, the number of operations C
is O(NS3 + (NS2 )l + NS dL l). If instead NS =l 1, then C is O(NS3 + NS l + NS dL l) (basi
ally
by starting in the interior of the feasible region). For the
ase where most SVs are at the
upper bound, and NS =l << 1, then C is O(NS2 + NS dL l). Finally, if most SVs are at the
upper bound, and NS =l 1, we have C of O(DL l2).
For larger problems, two de
omposition algorithms have been proposed to date. In the
\
hunking" method (Boser, Guyon and Vapnik, 1992), one starts with a small, arbitrary
26
subset of the data and trains on that. The rest of the training data is tested on the resulting
lassier, and a list of the errors is
onstru
ted, sorted by how far on the wrong side of the
margin they lie (i.e. how egregiously the KKT
onditions are violated). The next
hunk is
onstru
ted from the rst N of these,
ombined with the NS support ve
tors already found,
where N + NS is de
ided heuristi
ally (a
hunk size that is allowed to grow too qui
kly or
too slowly will result in slow overall
onvergen
e). Note that ve
tors
an be dropped from
a
hunk, and that support ve
tors in one
hunk may not appear in the nal solution. This
pro
ess is
ontinued until all data points are found to satisfy the KKT
onditions.
The above method requires that the number of support ve
tors NS be small enough so that
a Hessian of size NS by NS will t in memory. An alternative de
omposition algorithm has
been proposed whi
h over
omes this limitation (Osuna, Freund and Girosi, 1997b). Again,
in this algorithm, only a small portion of the training data is trained on at a given time, and
furthermore, only a subset of the support ve
tors need be in the \working set" (i.e. that set
of points whose 's are allowed to vary). This method has been shown to be able to easily
handle a problem with 110,000 training points and 100,000 support ve
tors. However, it
must be noted that the speed of this approa
h relies on many of the support ve
tors having
orresponding Lagrange multipliers i at the upper bound, i = C .
These training algorithms may take advantage of parallel pro
essing in several ways. First,
all elements of the Hessian itself
an be
omputed simultaneously. Se
ond, ea
h element
often requires the
omputation of dot produ
ts of training data, whi
h
ould also be par-
allelized. Third, the
omputation of the obje
tive fun
tion, or gradient, whi
h is a speed
bottlene
k,
an be parallelized (it requires a matrix multipli
ation). Finally, one
an envision
parallelizing at a higher level, for example by training on dierent
hunks simultaneously.
S
hemes su
h as these,
ombined with the de
omposition algorithm of (Osuna, Freund and
Girosi, 1997b), will be needed to make very large problems (i.e. >> 100,000 support ve
tors,
with many not at bound), tra
table.
5.1.2. Testing In test phase, one must simply evaluate Eq. (61), whi
h will require
O(MNS ) operations, where M is the number of operations required to evaluate the kernel.
For dot produ
t and RBF kernels, M is O(dL ), the dimension of the data ve
tors. Again,
both the evaluation of the kernel and of the sum are highly parallelizable pro
edures.
In the absen
e of parallel hardware, one
an still speed up test phase by a large fa
tor, as
des
ribed in Se
tion 9.
6. The VC Dimension of Support Ve
tor Ma
hines
We now show that the VC dimension of SVMs
an be very large (even innite). We will
then explore several arguments as to why, in spite of this, SVMs usually exhibit good
generalization performan
e. However it should be emphasized that these are essentially
plausibility arguments. Currently there exists no theory whi
h guarantees that a given
family of SVMs will have high a
ura
y on a given problem.
We will
all any kernel that satises Mer
er's
ondition a positive kernel, and the
orre-
sponding spa
e H the embedding spa
e. We will also
all any embedding spa
e with minimal
dimension for a given kernel a \minimal embedding spa
e". We have the following
Theorem 3 Let K be a positive kernel whi
h
orresponds to a minimal embedding spa
e
H. Then the VC dimension of the
orresponding support ve
tor ma
hine (where the error
penalty C in Eq. (44) is allowed to take all values) is dim(H) + 1.
27
Proof: If the minimal embedding spa
e has dimension dH , then dH points in the image of
L under the mapping
an be found whose position ve
tors in H are linearly independent.
From Theorem 1, these ve
tors
an be shattered by hyperplanes in H. Thus by either
restri
ting ourselves to SVMs for the separable
ase (Se
tion 3.1), or for whi
h the error
penalty C is allowed to take all values (so that, if the points are linearly separable, a C
an
be found su
h that the solution does indeed separate them), the family of support ve
tor
ma
hines with kernel K
an also shatter these points, and hen
e has VC dimension dH +1.
(The proof is in the Appendix). Thus the VC dimension of SVMs with these kernels is
dL +p 1 + 1. As noted above, this gets very large very qui
kly.
p
Proof: The kernel matrix, Kij K (xi ; xj ), is a Gram matrix (a matrix of dot produ
ts:
see (Horn, 1985)) in H. Clearly we
an
hoose training data su
h that all o-diagonal
elements Ki6=j
an be made arbitrarily small, and by assumption all diagonal elements Ki=j
are of O(1). The matrix K is then of full rank; hen
e the set of ve
tors, whose dot produ
ts
in H form K, are linearly independent (Horn, 1985); hen
e, by Theorem 1, the points
an be
shattered by hyperplanes in H, and hen
e also by support ve
tor ma
hines with suÆ
iently
large error penalty. Sin
e this is true for any nite number of points, the VC dimension of
these
lassiers is innite.
28
Note that the assumptions in the theorem are stronger than ne
essary (they were
hosen
to make the
onne
tion to radial basis fun
tions
lear). In fa
t it is only ne
essary that l
training points
an be
hosen su
h that the rank of the matrix Kij in
reases without limit
as l in
reases. For example, for Gaussian RBF kernels, this
an also be a
omplished (even
for training data restri
ted to lie in a bounded subset of RdL ) by
hoosing small enough
RBF widths. However in general the VC dimension of SVM RBF
lassiers
an
ertainly be
nite, and indeed, for data restri
ted to lie in a bounded subset of RdL ,
hoosing restri
tions
on the RBF widths is a good way to
ontrol the VC dimension.
This
ase gives us a se
ond opportunity to present a situation where the SVM solution
an be
omputed analyti
ally, whi
h also amounts to a se
ond,
onstru
tive proof of the
Theorem. For
on
reteness we will take the
ase for Gaussian RBF kernels of the form
K (x1 ; x2 ) = e kx1 x2 k =2 . Let us
hoose training points su
h that the smallest distan
e
2 2
between any pair of points is mu
h larger than the width . Consider the de
ision fun
tion
evaluated on the support ve
tor sj :
i yi e ksi sj k =2 + b:
X
f (sj ) = (80)
2 2
i
The sum on the right hand side will then be largely dominated by the term i = j ; in fa
t
the ratio of that term to the
ontribution from the rest of the sum
an be made arbitrarily
large by
hoosing the training points to be arbitrarily far apart. In order to nd the SVM
solution, we again assume for the moment that every training point be
omes a support
ve
tor, and we work with SVMs for the separable
ase (Se
tion 3.1) (the same argument
will hold for SVMs for the non-separable
ase if C in Eq. (44) is allowed to take large
enough values). Sin
e all points are support ve
tors, the equalities in Eqs. (10), (11)
will hold for them. Let there be N+ (N ) positive (negative) polarity points. We further
assume that all positive (negative) polarity points have the same value + ( ) for their
Lagrange multiplier. (We will know that this assumption is
orre
t if it delivers a solution
whi
h satises all the KKT
onditions and
onstraints). Then Eqs. (19), applied to all the
training data, and the equality
onstraint Eq. (18), be
ome
+ + b = 1
+b = 1
N+ + N = 0 (81)
whi
h are satised by
+ =
2N
N + N+
=
2N+
N + N+
N N
b = + (82)
N + N+
Thus, sin
e the resulting i are also positive, all the KKT
onditions and
onstraints are
satised, and we must have found the global solution (with zero training errors). Sin
e the
number of training points, and their labeling, is arbitrary, and they are separated without
error, the VC dimension is innite.
The situation is summarized s
hemati
ally in Figure 11.
29
Figure 11. Gaussian RBF SVMs of suÆ
iently small width
an
lassify an arbitrarily large number of
training points
orre
tly, and thus have innite VC dimension
Now we are left with a striking
onundrum. Even though their VC dimension is innite (if
the data is allowed to take all values in RdL ), SVM RBFs
an have ex
ellent performan
e
(S
holkopf et al, 1997). A similar story holds for polynomial SVMs. How
ome?
7. The Generalization Performan
e of SVMs
In this Se
tion we
olle
t various arguments and bounds relating to the generalization perfor-
man
e of SVMs. We start by presenting a family of SVM-like
lassiers for whi
h stru
tural
risk minimization
an be rigorously implemented, and whi
h will give us some insight as to
why maximizing the margin is so important.
7.1. VC Dimension of Gap Tolerant Classiers
Consider a family of
lassiers (i.e. a set of fun
tions on Rd ) whi
h we will
all \gap
tolerant
lassiers." A parti
ular
lassier 2 is spe
ied by the lo
ation and diameter
of a ball in Rd , and by two hyperplanes, with parallel normals, also in Rd . Call the set of
points lying between, but not on, the hyperplanes the \margin set." The de
ision fun
tions
are dened as follows: points that lie inside the ball, but not in the margin set, are assigned
lass f1g, depending on whi
h side of the margin set they fall. All other points are simply
dened to be \
orre
t", that is, they are not assigned a
lass by the
lassier, and do not
ontribute to any risk. The situation is summarized, for d = 2, in Figure 12. This rather
odd family of
lassiers, together with a
ondition we will impose on how they are trained,
will result in systems very similar to SVMs, and for whi
h stru
tural risk minimization
an
be demonstrated. A rigorous dis
ussion is given in the Appendix.
Label the diameter of the ball D and the perpendi
ular distan
e between the two hyper-
planes M . The VC dimension is dened as before to be the maximum number of points that
an be shattered by the family, but by \shattered" we mean that the points
an o
ur as
errors in all possible ways (see the Appendix for further dis
ussion). Clearly we
an
ontrol
the VC dimension of a family of these
lassiers by
ontrolling the minimum margin M
and maximum diameter D that members of the family are allowed to assume. For example,
onsider the family of gap tolerant
lassiers in R2 with diameter D = 2, shown in Figure
12. Those with margin satisfying M 3=2
an shatter three points; if 3=2 < M < 2, they
an shatter two; and if M 2, they
an shatter only one. Ea
h of these three families of
30
lassiers
orresponds to one of the sets of
lassiers in Figure 4, with just three nested
subsets of fun
tions, and with h1 = 1, h2 = 2, and h3 = 3.
Φ=0
Φ=1
D=2
M = 3/2
Φ=0
Φ=−1
Φ=0
These ideas
an be used to show how gap tolerant
lassiers implement stru
tural risk
minimization. The extension of the above example to spa
es of arbitrary dimension is
en
apsulated in a (modied) theorem of (Vapnik, 1995):
Theorem 6 For data in Rd , the VC dimension h of gap tolerant
lassiers of minimum
margin Mmin and maximum diameter Dmax is bounded above19 by minfdDmax2
=Mmin
2
e; dg +
1.
For the proof we assume the following lemma, whi
h in (Vapnik, 1979) is held to follow
from symmetry arguments20:
Lemma: Consider n d + 1 points lying in a ball B 2 Rd. Let the points be shatterable
by gap tolerant
lassiers with margin M . Then in order for M to be maximized, the points
must lie on the verti
es of an (n 1)-dimensional symmetri
simplex, and must also lie on
the surfa
e of the ball.
Proof: We need only
onsider the
ase where the number of points n satises n d + 1.
(n > d +1 points will not be shatterable, sin
e the VC dimension of oriented hyperplanes in
Rd is d +1, and any distribution of points whi
h
an be shattered by a gap tolerant
lassier
an also be shattered by an oriented hyperplane; this also shows that h d + 1). Again we
onsider points on a sphere of diameter D, where the sphere itself is of dimension d 2. We
will need two results from Se
tion 3.3, namely (1) if n is even, we
an nd a distribution of n
points (the verti
es of the (n 1)-dimensional symmetri
simplex) whi
h
an be shattered by
gap tolerant
lassiers if Dmax
2
=Mmin
2
= n 1, and (2) if n is odd, we
an nd a distribution
of n points whi
h
an be so shattered if Dmax
2
=Mmin
2
= (n 1)2 (n + 1)=n2.
If n is even, at most n points
an be shattered whenever
n 1 Dmax 2
=Mmin
2
< n: (83)
31
Thus for n even the maximum number of points that
an be shattered may be written
bDmax=Mmin
+ 1.
2 2
Let's see how we
an do stru
tural risk minimization with gap tolerant
lassiers. We need
only
onsider that subset of the ,
all it S , for whi
h training \su
eeds", where by su
ess
we mean that all training data are assigned a label 2 f1g (note that these labels do not
have to
oin
ide with the a
tual labels, i.e. training errors are allowed). Within S , nd
the subset whi
h gives the fewest training errors -
all this number of errors Nmin . Within
that subset, nd the fun
tion whi
h gives maximum margin (and hen
e the lowest bound
on the VC dimension). Note the value of the resulting risk bound (the right hand side of
Eq. (3), using the bound on the VC dimension in pla
e of the VC dimension). Next, within
S , nd that subset whi
h gives Nmin + 1 training errors. Again, within that subset, nd
the whi
h gives the maximum margin, and note the
orresponding risk bound. Iterate,
and take that
lassier whi
h gives the overall minimum risk bound.
An alternative approa
h is to divide the fun
tions into nested subsets i ; i 2 Z ; i 1,
as follows: all 2 i have fD; M g satisfying dD2 =M 2 e i. Thus the family of fun
tions
in i has VC dimension bounded above by min(i; d) + 1. Note also that i i+1 . SRM
then pro
eeds by taking that for whi
h training su
eeds in ea
h subset and for whi
h
the empiri
al risk is minimized in that subset, and again,
hoosing that whi
h gives the
lowest overall risk bound.
Note that it is essential to these arguments that the bound (3) holds for any
hosen de
ision
fun
tion, not just the one that minimizes the empiri
al risk (otherwise eliminating solutions
for whi
h some training point x satises (x) = 0 would invalidate the argument).
The resulting gap tolerant
lassier is in fa
t a spe
ial kind of support ve
tor ma
hine
whi
h simply does not
ount data falling outside the sphere
ontaining all the training data,
or inside the separating margin, as an error. It seems very reasonable to
on
lude that
support ve
tor ma
hines, whi
h are trained with very similar obje
tives, also gain a similar
kind of
apa
ity
ontrol from their training. However, a gap tolerant
lassier is not an
SVM, and so the argument does not
onstitute a rigorous demonstration of stru
tural risk
minimization for SVMs. The original argument for stru
tural risk minimization for SVMs is
known to be
awed, sin
e the stru
ture there is determined by the data (see (Vapnik, 1995),
Se
tion 5.11). I believe that there is a further subtle problem with the original argument.
The stru
ture is dened so that no training points are members of the margin set. However,
one must still spe
ify how test points that fall into the margin are to be labeled. If one simply
32
assigns the same, xed
lass to them (say +1), then the VC dimension will be higher21 than
the bound derived in Theorem 6. However, the same is true if one labels them all as errors
(see the Appendix). If one labels them all as \
orre
t", one arrives at gap tolerant
lassiers.
On the other hand, it is known how to do stru
tural risk minimization for systems where
the stru
ture does depend on the data (Shawe-Taylor et al., 1996a; Shawe-Taylor et al.,
1996b). Unfortunately the resulting bounds are mu
h looser than the VC bounds above,
whi
h are already very loose (we will examine a typi
al
ase below where the VC bound is
a fa
tor of 100 higher than the measured test error). Thus at the moment stru
tural risk
minimization alone does not provide a rigorous explanation as to why SVMs often have good
generalization performan
e. However, the above arguments strongly suggest that algorithms
that minimize D2 =M 2
an be expe
ted to give better generalization performan
e. Further
eviden
e for this is found in the following theorem of (Vapnik, 1998), whi
h we quote without
proof22 :
Theorem 7 For optimal hyperplanes passing through the origin, we have
E [D l=M ℄
2 2
E [P (error)℄ (86)
where P (error) is the probability of error on the test set, the expe
tation on the left is over
all training sets of size l 1, and the expe
tation on the right is over all training sets of size
l.
However, in order for these observations to be useful for real problems, we need a way to
ompute the diameter of the minimal en
losing sphere des
ribed above, for any number of
training points and for any kernel mapping.
7.3. How to Compute the Minimal En
losing Sphere
Again let be the mapping to the embedding spa
e H. We wish to
ompute the radius
of the smallest sphere in H whi
h en
loses the mapped training data: that is, we wish to
minimize R2 subje
t to
k(xi ) Ck2 R2 8i (87)
where C 2 H is the (unknown)
enter of the sphere. Thus introdu
ing positive Lagrange
multipliers i , the primal Lagrangian is
X
LP = R2 i (R2 k(xi ) Ck ):
2
(88)
i
This is again a
onvex quadrati
programming problem, so we
an instead maximize the
Wolfe dual
X X
LD = i K (xi ; xi ) i j K (xi ; xj ) (89)
i i;j
(where we have again repla
ed (xi ) (xj ) by K (xi ; xj )) subje
t to:
X
i = 1 (90)
i
i 0 (91)
33
(Vapnik, 1995) gives an alternative bound on the a
tual risk of support ve
tor ma
hines:
E [Number of support ve
tors℄
E [P (error)℄ (93)
Number of training samples ;
where P (error) is the a
tual risk for a ma
hine trained on l 1 examples, E [P (error)℄
is the expe
tation of the a
tual risk over all
hoi
es of training set of size l 1, and
E [Number of support ve
tors℄ is the expe
tation of the number of support ve
tors over all
hoi
es of training sets of size l. It's easy to see how this bound arises:
onsider the typi
al
situation after training on a given training set, shown in Figure 13.
Figure 13. Support ve
tors (
ir
les)
an be
ome errors (
ross) after removal and re-training (the dotted line
denotes the new de
ision surfa
e).
We
an get an estimate of the test error by removing one of the training points, re-training,
and then testing on the removed point; and then repeating this, for all training points. From
the support ve
tor solution we know that removing any training points that are not support
ve
tors (the latter in
lude the errors) will have no ee
t on the hyperplane found. Thus
the worst that
an happen is that every support ve
tor will be
ome an error. Taking the
expe
tation over all su
h training sets therefore gives an upper bound on the a
tual risk,
for training sets of size l 1.
34
Although elegant, I have yet to nd a use for this bound. There seem to be many situations
where the a
tual error in
reases even though the number of support ve
tors de
reases, so
the intuitive
on
lusion (systems that give fewer support ve
tors give better performan
e)
often seems to fail. Furthermore, although the bound
an be tighter than that found using
the estimate of the VC dimension
ombined with Eq. (3), it
an at the same time be less
predi
tive, as we shall see in the next Se
tion.
Let us put these observations to some use. As mentioned above, training an SVM RBF
lassier will automati
ally give values for the RBF weights, number of
enters,
enter
positions, and threshold. For Gaussian RBFs, there is only one parameter left: the RBF
width ( in Eq. (80)) (we assume here only one RBF width for the problem). Can we
nd the optimal value for that too, by
hoosing that whi
h minimizes D2 =M 2 ? Figure 14
shows a series of experiments done on 28x28 NIST digit data, with 10,000 training points and
60,000 test points. The top
urve in the left hand panel shows the VC bound (i.e. the bound
resulting from approximating the VC dimension in Eq. (3)23 by Eq. (85)), the middle
urve
shows the bound from leave-one-out (Eq. (93)), and the bottom
urve shows the measured
test error. Clearly, in this
ase, the bounds are very loose. The right hand panel shows just
the VC bound (the top
urve, for 2 > 200), together with the test error, with the latter
s
aled up by a fa
tor of 100 (note that the two
urves
ross). It is striking that the two
urves have minima in the same pla
e: thus in this
ase, the VC bound, although loose,
seems to be nevertheless predi
tive. Experiments on digits 2 through 9 showed that the VC
bound gave a minimum for whi
h 2 was within a fa
tor of two of that whi
h minimized the
test error (digit 1 was in
on
lusive). Interestingly, in those
ases the VC bound
onsistently
gave a lower predi
tion for 2 than that whi
h minimized the test error. On the other hand,
the leave-one-out bound, although tighter, does not seem to be predi
tive, sin
e it had no
minimum for the values of 2 tested.
0.7 0.7
Actual Risk : SV Bound : VC Bound
0.6 0.65
VC Bound : Actual Risk * 100
0.5 0.6
0.4 0.55
0.3 0.5
0.2 0.45
0.1 0.4
0 0.35
0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000
Sigma Squared Sigma Squared
8. Limitations
Perhaps the biggest limitation of the support ve
tor approa
h lies in
hoi
e of the kernel.
On
e the kernel is xed, SVM
lassiers have only one user-
hosen parameter (the error
penalty), but the kernel is a very big rug under whi
h to sweep parameters. Some work has
been done on limiting kernels using prior knowledge (S
holkopf et al., 1998a; Burges, 1998),
but the best
hoi
e of kernel for a given problem is still a resear
h issue.
A se
ond limitation is speed and size, both in training and testing. While the speed
problem in test phase is largely solved in (Burges, 1996), this still requires two training
passes. Training for very large datasets (millions of support ve
tors) is an unsolved problem.
Dis
rete data presents another problem, although with suitable res
aling ex
ellent results
have nevertheless been obtained (Joa
hims, 1997). Finally, although some work has been
done on training a multi
lass SVM in one step24 , the optimal design for multi
lass SVM
lassiers is a further area for resear
h.
9. Extensions
We very brie
y des
ribe two of the simplest, and most ee
tive, methods for improving the
performan
e of SVMs.
The virtual support ve
tor method (S
holkopf, Burges and Vapnik, 1996; Burges and
S
holkopf, 1997), attempts to in
orporate known invarian
es of the problem (for example,
translation invarian
e for the image re
ognition problem) by rst training a system, and
then
reating new data by distorting the resulting support ve
tors (translating them, in the
ase mentioned), and nally training a new system on the distorted (and the undistorted)
data. The idea is easy to implement and seems to work better than other methods for
in
orporating invarian
es proposed so far.
The redu
ed set method (Burges, 1996; Burges and S
holkopf, 1997) was introdu
ed to
address the speed of support ve
tor ma
hines in test phase, and also starts with a trained
SVM. The idea is to repla
e the sum in Eq. (46) by a similar sum, where instead of support
ve
tors,
omputed ve
tors (whi
h are not elements of the training set) are used, and instead
of the i , a dierent set of weights are
omputed. The number of parameters is
hosen
beforehand to give the speedup desired. The resulting ve
tor is still a ve
tor in H, and
the parameters are found by minimizing the Eu
lidean norm of the dieren
e between the
original ve
tor w and the approximation to it. The same te
hnique
ould be used for SVM
regression to nd mu
h more eÆ
ient fun
tion representations (whi
h
ould be used, for
example, in data
ompression).
Combining these two methods gave a fa
tor of 50 speedup (while the error rate in
reased
from 1.0% to 1.1%) on the NIST digits (Burges and S
holkopf, 1997).
10. Con
lusions
SVMs provide a new approa
h to the problem of pattern re
ognition (together with re-
gression estimation and linear operator inversion) with
lear
onne
tions to the underlying
statisti
al learning theory. They dier radi
ally from
omparable approa
hes su
h as neural
networks: SVM training always nds a global minimum, and their simple geometri
inter-
pretation provides fertile ground for further investigation. An SVM is largely
hara
terized
by the
hoi
e of its kernel, and SVMs thus link the problems they are designed for with a
large body of existing work on kernel based methods. I hope that this tutorial will en
ourage
some to explore SVMs for themselves.
36
A
knowledgments
I'm very grateful to P. Knirs
h, C. Nohl, E. Osuna, E. Rietman, B. S
holkopf, Y. Singer, A.
Smola, C. Stenard, and V. Vapnik, for their
omments on the manus
ript. Thanks also to
the reviewers, and to the Editor, U. Fayyad, for extensive, useful
omments. Spe
ial thanks
are due to V. Vapnik, under whose patient guidan
e I learned the ropes; to A. Smola and
B. S
holkopf, for many interesting and fruitful dis
ussions; and to J. Shawe-Taylor and D.
S
huurmans, for valuable dis
ussions on stru
tural risk minimization.
Appendix
A.1. Proofs of Theorems
We
olle
t here the theorems stated in the text, together with their proofs. The Lemma has
a shorter proof using a \Theorem of the Alternative," (Mangasarian, 1969) but we wished
to keep the proofs as self-
ontained as possible.
Lemma 1 Two sets of points in Rn may be separated by a hyperplane if and only if the
interse
tion of their
onvex hulls is empty.
Proof: We allow the notions of points in Rn , and position ve
tors of those points, to be
used inter
hangeably in this proof. Let CA , CB be the
onvex hulls of two sets of points
A, B in Rn . Let A B denote the set of points whose position ve
tors are given by
a b; a 2 A; b 2 B (note that A B does not
ontain the origin), and let CA CB have
the
orresponding meaning for the
onvex hulls. Then showing that A and B are linearly
separable (separable by a hyperplane) is equivalent to showing that the set A B is linearly
separable from the origin O. For suppose the latter: then 9 w 2 Rn ; b 2 R; b < 0 su
h
that x w + b > 0 8x 2 A B . Now pi
k some y 2 B , and denote the set of all points
a b + y; a 2 A; b 2 B by A B + y. Then x w + b > y w 8x 2 A B + y, and
learly
y w + b < y w, so the sets A B + y and y are linearly separable. Repeating this pro
ess
shows that A B is linearly separable from the origin if and only if A and B are linearly
separable.
T
We now show that, if CA CB = ;, then CA CB is linearly separable from the origin.
Clearly CA CB does not
ontain the origin. Furthermore CA CB is
onvex, sin
e
8x1 = a1 b1 ; x2 = a2 b2 ; 2 [0; 1℄; a1 ; a2 2 CA ; b1 ; b2 2 CB , we have (1 )x1 + x2 =
((1 )a1 + a2 ) ((1 )b1 + b2 ) 2 CA CB . Hen
e it is suÆ
ient to show that any
onvex set S , whi
h does not
ontain O, is linearly separable from O. Let xmin 2 S be
that point whose Eu
lidean distan
e from O, kxmin k, is minimal. (Note there
an be only
one su
h point, sin
e if there were two, the
hord joining them, whi
h also lies in S , would
ontain points
loser to O.) We will show that 8x 2 S , x xmin > 0. Suppose 9 x 2 S
su
h that x xmin 0. Let L be the line segment joining xmin and x. Then
onvexity
implies that L S . Thus O 2= L, sin
e by assumption O 2= S . Hen
e the three points O, x
and xmin form an obtuse (or right) triangle, with obtuse (or right) angle o
urring at the
point O. Dene n^ (x xmin )=kx xmin k. Then the distan
e from the
losest point in
L to O is kxmin k2 (xmin n^ )2 , whi
h is less than kxmin k2. Hen
e x xmin > 0 and S is
linearly separable from O. Thus CA CB is linearly separable from O, and a fortiori A B
is linearly separable from O, and thus A is linearly separable from B .
It remains to show that, if the two sets of points A, B are linearly separable, the interse
-
tion of their
onvex hulls if empty. By assumption there exists a pair w 2 Rn ; b 2 R, su
h
that 8ai 2 A; w ai + b > 0 and 8bi 2 B; w bi + b < 0. Consider a general point x 2 CA . It
37
P P P
may be written x = i i ai ; i = 1; 0 i 1. Then wT x + b = i i fw ai + bg > 0.
Similarly, for points y 2 CB , w y + b < 0. Hen
e CA CB = ;, sin
e otherwise we
would be able to nd a point x = y whi
h simultaneously satises both inequalities.
Theorem 1: Consider some set of m points in Rn. Choose any one of the points as
origin. Then the m points
an be shattered by oriented hyperplanes if and only if the
position ve
tors of the remaining points are linearly independent.
Proof: Label the origin O, and assume that the m 1 position ve
tors of the remaining
points are linearly independent. Consider any partition of the m points into two subsets,
S1 and S2 , of order m1 and m2 respe
tively, so that m1 + m2 = m. Let S1 be the subset
ontaining O. Then the
onvex hull C1 of S1 is that set of points whose position ve
tors x
satisfy
m1
X m1
X
x= i s1i ; i = 1; i 0 (A.1)
i=1 i=1
where the s1i are the position ve
tors of the m1 points in S1 (in
luding the null position
ve
tor of the origin). Similarly, the
onvex hull C2 of S2 is that set of points whose position
ve
tors x satisfy
m2
X m2
X
x= i s2i ; i = 1; i 0 (A.2)
i=1 i=1
where the s2i are the position ve
tors of the m2 points in S2 . Now suppose that C1 and
C2 interse
t. Then there exists an x 2 Rn whi
h simultaneously satises Eq. (A.1) and Eq.
(A.2). Subtra
ting these equations gives a linear
ombination of the m 1 non-null position
ve
tors whi
h vanishes, whi
h
ontradi
ts the assumption of linear independen
e. By the
lemma, sin
e C1 and C2 do not interse
t, there exists a hyperplane separating S1 and S2 .
Sin
e this is true for any
hoi
e of partition, the m points
an be shattered.
It remains to show that if the m 1 non-null position ve
tors are not linearly independent,
then the m points
annot be shattered by oriented hyperplanes. If the m 1 position ve
tors
are not linearly independent, then there exist m 1 numbers,
i , su
h that
m
X1
i si = 0 (A.3)
i=1
P
If all the
i are of the same sign, then we
an s
ale them so that
i 2 [0; 1℄ and i
i = 1.
Eq. (A.3) then states that the origin lies in the
onvex hull of the remaining points; hen
e,
by the lemma, the origin
annot be separated from the remaining points by a hyperplane,
and the points
annot be shattered.
If the
i are not all of the same sign, pla
e all the terms with negative
i on the right:
X X
j
j jsj = j
k jsk (A.4)
j 2I1 k2I2
S S S
onvex hull of the points f j2I1 sj g O (or, if the equality holds, of the points f j2I1 sj g),
and the right hand side is the position ve
tor of a point lying in the
onvex hull of the points
S
k2I2 sk , so the
onvex hulls overlap, and by the lemma, the two sets of points
annot be
separated by a hyperplane. Thus the m points
annot be shattered.
Theorem 4: If the data is d-dimensional (i.e. L = Rd), the dimension of the mini-
mal embedding spa
e, for homogeneous polynomial kernels of degree p (K (x ; x ) = (x
x )p; x ; x 2 Rd), is d pp .
1 2 1
+ 1
2 1 2
Proof: First we show that the the number of
omponents of (x) is p dp . Label the
+ 1
omponents of as in Eq. Pd
(79). Then a
omponent is uniquely identied by the
hoi
e
of the d integers ri 0, i=1 ri = p. Now
onsider p obje
ts distributed amongst d 1
partitions (numbered 1 through d 1), su
h that obje
ts are allowed to be to the left of
all partitions, or to the right of all partitions. Suppose m obje
ts fall between partitions
q and q + 1. Let this
orrespond to a term xm q+1 in the produ
t in Eq. (79). Similarly, m
obje
ts falling to the left of all partitions
orresponds to a term xm1 , and m obje
ts falling
to the right of all partitions
orresponds
Pd
to a term xmd . Thus the number of distin
t terms
of the form x1 x2 xd ;
r 1 r2 rd
i=1 ri = p; ri 0 is the number of way of distributing
the obje
ts and partitions amongst themselves, modulo permutations of the partitions and
permutations of the obje
ts, whi
h is p+dp 1 .
Next we must show that the set of ve
tors with
omponents r1 r2 rd (x) span the spa
e H.
This follows from the fa
t that the
omponents of (x) are linearly independent fun
tions.
For suppose instead that the image of a
ting on x 2 L is a subspa
e of H. Then there
exists a xed nonzero ve
tor V 2 H su
h that
dim
X (H)
Vi i (x) = 0 8x 2 L: (A.5)
i=1
Using the labeling introdu
ed above,
onsider a parti
ular
omponent of :
d
X
r1 r2 rd (x); ri = p: (A.6)
i=1
Sin
e Eq. (A.5) holds for all x, and sin
e the mapping in Eq. (79)
ertainly has all
derivatives dened, we
an apply the operator
( x )r1 ( x )rd (A.7)
1 d
to Eq. (A.5), whi
h will pi
k that one term with
orresponding powers of the xi in Eq.
(79), giving
Vr1 r2 rd = 0: (A.8)
P
Sin
e this is true for all
hoi
es of r1 ; ; rd su
h that di=1 ri = p, every
omponent of
V must vanish. Hen
e the image of a
ting on x 2 L spans H.
A.2. Gap Tolerant Classiers and VC Bounds
The following point is
entral to the argument. One normally thinks of a
olle
tion of points
as being \shattered" by a set of fun
tions, if for any
hoi
e of labels for the points, a fun
tion
39
from the set
an be found whi
h assigns those labels to the points. The VC dimension of that
set of fun
tions is then dened as the maximum number of points that
an be so shattered.
However,
onsider a slightly dierent denition. Let a set of points be shattered by a set
of fun
tions if for any
hoi
e of labels for the points, a fun
tion from the set
an be found
whi
h assigns the in
orre
t labels to all the points. Again let the VC dimension of that set
of fun
tions be dened as the maximum number of points that
an be so shattered.
It is in fa
t this se
ond denition (whi
h we adopt from here on) that enters the VC
bound proofs (Vapnik, 1979; Devroye, Gyor and Lugosi, 1996). Of
ourse for fun
tions
whose range is f1g (i.e. all data will be assigned either positive or negative
lass), the two
denitions are the same. However, if all points falling in some region are simply deemed to
be \errors", or \
orre
t", the two denitions are dierent. As a
on
rete example, suppose
we dene \gap intolerant
lassiers", whi
h are like gap tolerant
lassiers, but whi
h label
all points lying in the margin or outside the sphere as errors. Consider again the situation in
Figure 12, but assign positive
lass to all three points. Then a gap intolerant
lassier with
margin width greater than the ball diameter
annot shatter the points if we use the rst
denition of \shatter", but
an shatter the points if we use the se
ond (
orre
t) denition.
With this
aveat in mind, we now outline how the VC bounds
an apply to fun
tions with
range f1; 0g, where the label 0 means that the point is labeled \
orre
t." (The bounds
will also apply to fun
tions where 0 is dened to mean \error", but the
orresponding VC
dimension will be higher, weakening the bound, and in our
ase, making it useless). We will
follow the notation of (Devroye, Gyor and Lugosi, 1996).
Consider points x 2 Rd, and let p(x) denote a density on Rd. Let be a fun
tion on Rd
with range f1; 0g, and let be a set of su
h fun
tions. Let ea
h x have an asso
iated
label yx 2 f1g. Let fx1 ; ; xn g be any nite number of points in Rd: then we require
to have the property that there exists at least one 2 su
h that (xi ) 2 f1g 8 xi . For
given , dene the set of points A by
where the sets A are dened above. The n-th shatter oeÆ ient of A is dened
s(A; n) =
xn2fRdgn NA (x ; ; xn ):
x1;;max 1 (A.13)
We also dene the VC dimension for the
lass A to be the maximum integer k 1 for
whi
h s(A; k) = 2k .
Theorem 8 (adapted from Devroye, Gyor and Lugosi, 1996, Theorem 12.6):Given n (fxi ; g),
() and s(A; n) dened above, and given n points (x1 ; :::; xn ) 2 Rd, let 0 denote that sub-
set of su
h that all 2 0 satisfy (xi ) 2 f1g 8 xi . (This restri
tion may be viewed as
part of the training algorithm). Then for any su
h ,
The proof is exa
tly that of (Devroye, Gyor and Lugosi, 1996), Se
tions 12.3, 12.4 and
12.5, Theorems 12.5 and 12.6. We have dropped the \sup" to emphasize that this holds
for any of the fun
tions . In parti
ular, it holds for those whi
h minimize the empiri
al
error and for whi
h all training data take the values f1g. Note however that the proof
only holds for the se
ond denition of shattering given above. Finally, note that the usual
form of the VC bounds is easily derived from Eq. (A.14) by using s(2A; n) (en=h)h (where
h is the VC dimension) (Vapnik, 1995), setting = 8s(A; n) exp n =32 , and solving for .
Clearly these results apply to our gap tolerant
lassiers of Se
tion 7.1. For them, a
parti
ular
lassier 2 is spe
ied by a set of parameters fB; H; M g, where B is a
ball in Rd, D 2 R is the diameter of B , H is a d 1 dimensional oriented hyperplane in
Rd, and M 2 R is a s
alar whi
h we have
alled the margin. H itself is spe
ied by its
normal (whose dire
tion spe
ies whi
h points H+ (H ) are labeled positive (negative) by
the fun
tion), and by the minimal distan
e from H to the origin. For a given 2 , the
margin set SM is dened as the setT
onsisting of Tthose points whose Tminimal distan
e to H
is less than M=2. Dene Z SM B , Z+ Z H+ , and Z Z H . The fun
tion
is then dened as follows:
4. Given the name \test set," perhaps we should also use \train set;" but the hobbyists got there
rst.
5. We use the term \oriented hyperplane" to emphasize that the mathemati
al obje
t
onsidered
is the pair fH; ng, where H is the set of points whi
h lie in the hyperplane and n is a parti
ular
hoi
e for the unit normal. Thus fH; ng and fH; ng are dierent oriented hyperplanes.
6. Su
h a set of m points (whi
h span an m 1 dimensional subspa
e of a linear spa
e) are said
to be \in general position" (Kolmogorov, 1970). The
onvex hull of a set of m points in general
position denes an m 1 dimensional simplex, the verti
es of whi
h are the points themselves.
7. The derivation of the bound assumes that the empiri
al risk
onverges uniformly to the a
tual
risk as the number of training observations in
reases (Vapnik, 1979). A ne
essary and suÆ
ient
ondition for this is that liml!1 H (l)=l = 0, where l is the number of training samples and
H (l) is the VC entropy of the set of de
ision fun
tions (Vapnik, 1979; Vapnik, 1995). For any
set of fun
tions with innite VC dimension, the VC entropy is l log 2: hen
e for these
lassiers,
the required uniform
onvergen
e does not hold, and so neither does the bound.
8. There is a ni
e geometri
interpretation for the dual problem: it is basi
ally nding the two
losest points of
onvex hulls of the two sets. See (Bennett and Bredensteiner, 1998).
9. One
an dene the torque to be
1 :::n 2 = i :::n xn 1 Fn (A.16)
where repeated indi
es are summed over on the right hand side, and where is the totally
antisymmetri
tensor with 1:::n = 1. (Re
all that Greek indi
es are used to denote tensor
omponents). The sum of torques on the de
ision sheet is then:
X X
1 :::n sin 1 Fin = ^ n = 1 :::n wn 1 w^n = 0
1 :::n sin 1 i yi w (A.17)
i i
10. In the original formulation (Vapnik, 1979) they were
alled \extreme ve
tors."
11. By \de
ision fun
tion" we mean a fun
tion f (x) whose sign represents the
lass assigned to
data point x.
12. By \intrinsi
dimension" we mean the number of parameters required to spe
ify a point on the
manifold.
13. Alternatively one
an argue that, given the form of the solution, the possible w must lie in a
subspa
e of dimension l.
14. Work in preparation.
15. Thanks to A. Smola for pointing this out.
16. Many thanks to one of the reviewers for pointing this out.
17. The
ore quadrati
optimizer is about 700 lines of C++. The higher level
ode (to handle
a
hing of dot produ
ts,
hunking, IO, et
) is quite
omplex and
onsiderably larger.
18. Thanks to L. Kaufman for providing me with these results.
19. Re
all that the \
eiling" sign de means \smallest integer greater than or equal to." Also, there
is a typo in the a
tual formula given in (Vapnik, 1995), whi
h I have
orre
ted here.
20. Note, for example, that the distan
e between every pair of verti
es of the symmetri
simplex
is the same: see Eq. (26). However, a rigorous proof is needed, and as far as I know is la
king.
21. Thanks to J. Shawe-Taylor for pointing this out.
22. V. Vapnik, Private Communi
ation.
23. There is an alternative bound one might use, namely that
orresponding to the set of totally
bounded non-negative fun
tions (Equation (3.28) in (Vapnik, 1995)). However, for loss fun
-
tions taking the value zero or one, and if the empiri
al risk is zero, this bound is looser than
that in Eq. (3) whenever h(log(2l=h)+1)
l
log(=4)
> 1=16, whi
h is the
ase here.
24. V. Blanz, Private Communi
ation
Referen
es
10
M.A. Aizerman, E.M. Braverman, and L.I. Rozoner. Theoreti
al foundations of the potential fun
tion
method in pattern re
ognition learning. Automation and Remote Control, 25:821{837, 1964.
42
M. Anthony and N. Biggs. Pa
learning and neural networks. In The Handbook of Brain Theory and
Neural Networks, pages 694{697, 1995.
K.P. Bennett and E. Bredensteiner. Geometry in learning. In Geometry at Work, page to appear,
Washington, D.C., 1998. Mathemati
al Asso
iation of Ameri
a.
C.M. Bishop. Neural Networks for Pattern Re
ognition. Clarendon Press, Oxford, 1995.
V. Blanz, B. S
holkopf, H. Bultho, C. Burges, V. Vapnik, and T. Vetter. Comparison of view{based
obje
t re
ognition algorithms using realisti
3d models. In C. von der Malsburg, W. von Seelen, J. C.
Vorbruggen, and B. Sendho, editors, Arti
ial Neural Networks | ICANN'96, pages 251 { 256, Berlin,
1996. Springer Le
ture Notes in Computer S
ien
e, Vol. 1112.
B. E. Boser, I. M. Guyon, and V .Vapnik. A training algorithm for optimal margin
lassiers. In Fifth
Annual Workshop on Computational Learning Theory, Pittsburgh, 1992. ACM.
James R. Bun
h and Linda Kaufman. Some stable methods for
al
ulating inertia and solving symmetri
linear systems. Mathemati
s of
omputation, 31(137):163{179, 1977.
James R. Bun
h and Linda Kaufman. A
omputational method for the indenite quadrati
programming
problem. Linear Algebra and its Appli
ations, 34:341{370, 1980.
C. J. C. Burges and B. S
holkopf. Improving the a
ura
y and speed of support ve
tor learning ma
hines.
In M. Mozer, M. Jordan, and T. Pets
he, editors, Advan
es in Neural Information Pro
essing Systems 9,
pages 375{381, Cambridge, MA, 1997. MIT Press.
C.J.C. Burges. Simplied support ve
tor de
ision rules. In Lorenza Saitta, editor, Pro
eedings of the Thir-
teenth International Conferen
e on Ma
hine Learning, pages 71{77, Bari, Italy, 1996. Morgan Kaufman.
C.J.C. Burges. Geometry and invarian
e in kernel based methods. In B. S
holkopf, C.J.C. Burges, and A.J.
Smola, editors, Advan
es in Kernel Methods: Support Ve
tor Learning, pages 89{116. MIT Press, 1999.
C.J.C. Burges, P. Knirs
h, and R. Harats
h. Support ve
tor web page: http://svm.resear
h.bell-labs.
om.
Te
hni
al report, Lu
ent Te
hnologies, 1996.
C. Cortes and V. Vapnik. Support ve
tor networks. Ma
hine Learning, 20:273{297, 1995.
R. Courant and D. Hilbert. Methods of Mathemati
al Physi
s. Inters
ien
e, 1953.
Lu
Devroye, Laszlo Gyor, and Gabor Lugosi. A Probabilisti
Theory of Pattern Re
ognition. Springer
Verlag, Appli
ations of Mathemati
s Vol. 31, 1996.
H. Dru
ker, C.J.C. Burges, L. Kaufman, A. Smola, and V. Vapnik. Support ve
tor regression ma
hines.
Advan
es in Neural Information Pro
essing Systems, 9:155{161, 1997.
R. Flet
her. Pra
ti
al Methods of Optimization. John Wiley and Sons, In
., 2nd edition, 1987.
S. Geman, E. Bienensto
k, and R. Doursat. Neural networks and the bias / varian
e dilemma. Neural
Computation, 4:1{58, 1992.
F. Girosi. An equivalen
e between sparse approximation and support ve
tor ma
hines. Neural Computation
(to appear); CBCL AI Memo 1606, MIT, 1998.
I. Guyon, V. Vapnik, B. Boser, L. Bottou, and S.A. Solla. Stru
tural risk minimization for
hara
ter
re
ognition. Advan
es in Neural Information Pro
essing Systems, 4:471{479, 1992.
P.R. Halmos. A Hilbert Spa
e Problem Book. D. Van Nostrand Company, In
., 1967.
Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 1985.
T. Joa
hims. Text
ategorization with support ve
tor ma
hines. Te
hni
al report, LS VIII Number 23,
University of Dortmund, 1997. ftp://ftp-ai.informatik.uni-dortmund.de/pub/Reports/report23.ps.Z.
L. Kaufman. Solving the qp problem for support ve
tor training. In Pro
eedings of the 1997 NIPS Workshop
on Support Ve
tor Ma
hines (to appear), 1998.
A.N. Kolmogorov and S.V.Fomin. Introdu
tory Real Analysis. Prenti
e-Hall, In
., 1970.
O.L. Mangarasian. Nonlinear Programming. M
Graw Hill, New York, 1969.
Garth P. M
Cormi
k. Non Linear Programming: Theory, Algorithms and Appli
ations. John Wiley and
Sons, In
., 1983.
D.C. Montgomery and E.A. Pe
k. Introdu
tion to Linear Regression Analysis. John Wiley and Sons, In
.,
2nd edition, 1992.
More and Wright. Optimization Guide. SIAM, 1993.
Jorge J. More and Gerardo Toraldo. On the solution of large quadrati
programming problems with bound
onstraints. SIAM J. Optimization, 1(1):93{113, 1991.
S. Mukherjee, E. Osuna, and F. Girosi. Nonlinear predi
tion of
haoti
time series using a support ve
tor
ma
hine. In Pro
eedings of the IEEE Workshop on Neural Networks for Signal Pro
essing 7, pages
511{519, Amelia Island, FL, 1997.
K.-R. Muller, A. Smola, G. Rats
h, B. S
holkopf, J. Kohlmorgen, and V. Vapnik. Predi
ting time series
with support ve
tor ma
hines. In Pro
eedings, International Conferen
e on Arti
ial Neural Networks,
page 999. Springer Le
ture Notes in Computer S
ien
e, 1997.
Edgar Osuna, Robert Freund, and Federi
o Girosi. An improved training algorithm for support ve
tor
ma
hines. In Pro
eedings of the 1997 IEEE Workshop on Neural Networks for Signal Pro
essing, Eds.
J. Prin
ipe, L. Giles, N. Morgan, E. Wilson, pages 276 { 285, Amelia Island, FL, 1997.
Edgar Osuna, Robert Freund, and Federi
o Girosi. Training support ve
tor ma
hines: an appli
ation to
fa
e dete
tion. In IEEE Conferen
e on Computer Vision and Pattern Re
ognition, pages 130 { 136, 1997.
43
Edgar Osuna and Federi
o Girosi. Redu
ing the run-time
omplexity of support ve
tor ma
hines. In
International Conferen
e on Pattern Re
ognition (submitted), 1998.
William H. Press, Brain P. Flannery, Saul A. Teukolsky, and William T. Vettering. Numeri
al re
ipes in
C: the art of s
ienti
omputing. Cambridge University Press, 2nd edition, 1992.
M. S
hmidt. Identifying speaker with support ve
tor networks. In Interfa
e '96 Pro
eedings, Sydney, 1996.
B. S
holkopf. Support Ve
tor Learning. R. Oldenbourg Verlag, Muni
h, 1997.
B. S
holkopf, C. Burges, and V. Vapnik. Extra
ting support data for a given task. In U. M. Fayyad and
R. Uthurusamy, editors, Pro
eedings, First International Conferen
e on Knowledge Dis
overy & Data
Mining. AAAI Press, Menlo Park, CA, 1995.
B. S
holkopf, C. Burges, and V. Vapnik. In
orporating invarian
es in support ve
tor learning ma
hines. In
C. von der Malsburg, W. von Seelen, J. C. Vorbruggen, and B. Sendho, editors, Arti
ial Neural Networks
| ICANN'96, pages 47 { 52, Berlin, 1996. Springer Le
ture Notes in Computer S
ien
e, Vol. 1112.
B. S
holkopf, P. Simard, A. Smola, and V. Vapnik. Prior knowledge in support ve
tor kernels. In M. Jordan,
M. Kearns, and S. Solla, editors, Advan
es in Neural Information Pro
essing Systems 10, Cambridge, MA,
1998. MIT Press. In press.
B. S
holkopf, A. Smola, and K.-R. Muller. Nonlinear
omponent analysis as a kernel eigenvalue problem.
Neural Computation, 1998. In press.
B. S
holkopf, A. Smola, K.-R. Muller, C. Burges, and V. Vapnik. Support ve
tor methods in learning
and feature extra
tion. Australian Journal of Intelligent Information Pro
essing Systems, 5:3 { 9, 1998.
Spe
ial issue with sele
ted papers of ACNN'98.
B. S
holkopf, K. Sung, C. Burges, F. Girosi, P. Niyogi, T. Poggio, and V. Vapnik. Comparing support
ve
tor ma
hines with gaussian kernels to radial basis fun
tion
lassiers. IEEE Trans. Sign. Pro
essing,
45:2758 { 2765, 1997.
John Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson, and Martin Anthony. A framework for
stru
tural risk minimization. In Pro
eedings, 9th Annual Conferen
e on Computational Learning Theory,
pages 68{76, 1996.
John Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson, and Martin Anthony. Stru
tural risk mini-
mization over data-dependent hierar
hies. Te
hni
al report, NeuroCOLT Te
hni
al Report NC-TR-96-053,
1996.
A. Smola, B. S
holkopf, and K.-R. Muller. General
ost fun
tions for support ve
tor regression. In Ninth
Australian Congress on Neural Networks (to appear), 1998.
A.J. Smola and B. S
holkopf. On a kernel-based method for pattern re
ognition, regression, approximation
and operator inversion. Algorithmi
a, 22:211 { 231, 1998.
Alex J. Smola, Bernhard S
holkopf, and Klaus-Robert Muller. The
onne
tion between regularization
operators and support ve
tor kernels. Neural Networks, 11:637{649, 1998.
M. O. Stitson, A. Gammerman, V. Vapnik, V.Vovk, C. Watkins, and J. Weston. Support ve
tor anova
de
omposition. Te
hni
al report, Royal Holloway College, Report number CSD-TR-97-22, 1997.
Gilbert Strang. Introdu
tion to Applied Mathemati
s. Wellesley-Cambridge Press, 1986.
R. J. Vanderbei. Interior point methods : Algorithms and formulations. ORSA J. Computing, 6(1):32{34,
1994.
R.J. Vanderbei. LOQO: An interior point
ode for quadrati
programming. Te
hni
al report, Program in
Statisti
s & Operations Resear
h, Prin
eton University, 1994.
V. Vapnik. Estimation of Dependen
es Based on Empiri
al Data [in Russian℄. Nauka, Mos
ow, 1979.
(English translation: Springer Verlag, New York, 1982).
V. Vapnik. The Nature of Statisti
al Learning Theory. Springer-Verlag, New York, 1995.
V. Vapnik. Statisti
al Learning Theory. John Wiley and Sons, In
., New York, 1998.
Gra
e Wahba. Support ve
tor ma
hines, reprodu
ing kernel hilbert spa
es and the ga
v. In Pro
eedings of
the 1997 NIPS Workshop on Support Ve
tor Ma
hines (to appear). MIT Press, 1998.
J. Weston, A. Gammerman, M. O. Stitson, V. Vapnik, V.Vovk, and C. Watkins. Density estimation using
support ve
tor ma
hines. Te
hni
al report, Royal Holloway College, Report number CSD-TR-97-23, 1997.