Вы находитесь на странице: 1из 43

A Tutorial on Support Ve

tor Ma hines for Pattern


Re ognition
CHRISTOPHER J.C. BURGES burgesmi rosoft. om
Mi rosoft Resear h (formerly Lu ent Te hnologies)

Abstra t. The tutorial starts with an overview of the on epts of VC dimension and stru tural risk
minimization. We then des ribe linear Support Ve tor Ma hines (SVMs) for separable and non-separable
data, working through a non-trivial example in detail. We des ribe a me hani al analogy, and dis uss
when SVM solutions are unique and when they are global. We des ribe how support ve tor training an
be pra ti ally implemented, and dis uss in detail the kernel mapping te hnique whi h is used to onstru t
SVM solutions whi h are nonlinear in the data. We show how Support Ve tor ma hines an have very large
(even in nite) VC dimension by omputing the VC dimension for homogeneous polynomial and Gaussian
radial basis fun tion kernels. While very high VC dimension would normally bode ill for generalization
performan e, and while at present there exists no theory whi h shows that good generalization performan e
is guaranteed for SVMs, there are several arguments whi h support the observed high a ura y of SVMs,
whi h we review. Results of some experiments whi h were inspired by these arguments are also presented.
We give numerous examples and proofs of most of the key theorems. There is new material, and I hope that
the reader will nd that even old material is ast in a fresh light.

Keywords: Support Ve tor Ma hines, Statisti al Learning Theory, VC Dimension, Pattern Re ognition

Appeared in: Data Mining and Knowledge Dis overy 2, 121-167, 1998

1. Introdu tion
The purpose of this paper is to provide an introdu tory yet extensive tutorial on the basi
ideas behind Support Ve tor Ma hines (SVMs). The books (Vapnik, 1995; Vapnik, 1998)
ontain ex ellent des riptions of SVMs, but they leave room for an a ount whose purpose
from the start is to tea h. Although the subje t an be said to have started in the late
seventies (Vapnik, 1979), it is only now re eiving in reasing attention, and so the time
appears suitable for an introdu tory review. The tutorial dwells entirely on the pattern
re ognition problem. Many of the ideas there arry dire tly over to the ases of regression
estimation and linear operator inversion, but spa e onstraints pre luded the exploration of
these topi s here.
The tutorial ontains some new material. All of the proofs are my own versions, where
I have pla ed a strong emphasis on their being both lear and self- ontained, to make the
material as a essible as possible. This was done at the expense of some elegan e and
generality: however generality is usually easily added on e the basi ideas are lear. The
longer proofs are olle ted in the Appendix.
By way of motivation, and to alert the reader to some of the literature, we summarize
some re ent appli ations and extensions of support ve tor ma hines. For the pattern re og-
nition ase, SVMs have been used for isolated handwritten digit re ognition (Cortes and
Vapnik, 1995; S holkopf, Burges and Vapnik, 1995; S holkopf, Burges and Vapnik, 1996;
Burges and S holkopf, 1997), obje t re ognition (Blanz et al., 1996), speaker identi ation
(S hmidt, 1996), harmed quark dete tion1 , fa e dete tion in images (Osuna, Freund and
Girosi, 1997a), and text ategorization (Joa hims, 1997). For the regression estimation
ase, SVMs have been ompared on ben hmark time series predi tion tests (Muller et al.,
1997; Mukherjee, Osuna and Girosi, 1997), the Boston housing problem (Dru ker et al.,
2

1997), and (on arti ial data) on the PET operator inversion problem (Vapnik, Golowi h
and Smola, 1996). In most of these ases, SVM generalization performan e (i.e. error rates
on test sets) either mat hes or is signi antly better than that of ompeting methods. The
use of SVMs for density estimation (Weston et al., 1997) and ANOVA de omposition (Stit-
son et al., 1997) has also been studied. Regarding extensions, the basi SVMs ontain no
prior knowledge of the problem (for example, a large lass of SVMs for the image re ogni-
tion problem would give the same results if the pixels were rst permuted randomly (with
ea h image su ering the same permutation), an a t of vandalism that would leave the best
performing neural networks severely handi apped) and mu h work has been done on in-
orporating prior knowledge into SVMs (S holkopf, Burges and Vapnik, 1996; S holkopf et
al., 1998a; Burges, 1998). Although SVMs have good generalization performan e, they an
be abysmally slow in test phase, a problem addressed in (Burges, 1996; Osuna and Girosi,
1998). Re ent work has generalized the basi ideas (Smola, S holkopf and Muller, 1998a;
Smola and S holkopf, 1998), shown onne tions to regularization theory (Smola, S holkopf
and Muller, 1998b; Girosi, 1998; Wahba, 1998), and shown how SVM ideas an be in orpo-
rated in a wide range of other algorithms (S holkopf, Smola and Muller, 1998b; S holkopf
et al, 1998 ). The reader may also nd the thesis of (S holkopf, 1997) helpful.
The problem whi h drove the initial development of SVMs o urs in several guises - the
bias varian e tradeo (Geman, Bienensto k and Doursat, 1992), apa ity ontrol (Guyon
et al., 1992), over tting (Montgomery and Pe k, 1992) - but the basi idea is the same.
Roughly speaking, for a given learning task, with a given nite amount of training data, the
best generalization performan e will be a hieved if the right balan e is stru k between the
a ura y attained on that parti ular training set, and the \ apa ity" of the ma hine, that is,
the ability of the ma hine to learn any training set without error. A ma hine with too mu h
apa ity is like a botanist with a photographi memory who, when presented with a new
tree, on ludes that it is not a tree be ause it has a di erent number of leaves from anything
she has seen before; a ma hine with too little apa ity is like the botanist's lazy brother,
who de lares that if it's green, it's a tree. Neither an generalize well. The exploration and
formalization of these on epts has resulted in one of the shining peaks of the theory of
statisti al learning (Vapnik, 1979).
In the following, bold typefa e will indi ate ve tor or matrix quantities; normal typefa e
will be used for ve tor and matrix omponents and for s alars. We will label omponents
of ve tors and matri es with Greek indi es, and label ve tors and matri es themselves with
Roman indi es. Familiarity with the use of Lagrange multipliers to solve problems with
equality or inequality onstraints is assumed2 .

2. A Bound on the Generalization Performan e of a Pattern Re ognition Learn-


ing Ma hine
There is a remarkable family of bounds governing the relation between the apa ity of a
learning ma hine and its performan e3. The theory grew out of onsiderations of under what
ir umstan es, and how qui kly, the mean of some empiri al quantity onverges uniformly,
as the number of data points in reases, to the true mean (that whi h would be al ulated
from an in nite amount of data) (Vapnik, 1979). Let us start with one of these bounds.
The notation here will largely follow that of (Vapnik, 1995). Suppose we are given l
observations. Ea h observation onsists of a pair: a ve tor xi 2 Rn ; i = 1; : : : ; l and the
asso iated \truth" yi , given to us by a trusted sour e. In the tree re ognition problem, xi
might be a ve tor of pixel values (e.g. n = 256 for a 16x16 image), and yi would be 1 if the
image ontains a tree, and -1 otherwise (we use -1 here rather than 0 to simplify subsequent
3

formulae). Now it is assumed that there exists some unknown probability distribution
P (x; y) from whi h these data are drawn, i.e., the data are assumed \iid" (independently
drawn and identi ally distributed). (We will use P for umulative probability distributions,
and p for their densities). Note that this assumption is more general than asso iating a
xed y with every x: it allows there to be a distribution of y for a given x. In that ase,
the trusted sour e would assign labels yi a ording to a xed distribution, onditional on
xi . However, after this Se tion, we will be assuming xed y for given x.
Now suppose we have a ma hine whose task it is to learn the mapping xi 7! yi . The
ma hine is a tually de ned by a set of possible mappings x 7! f (x; ), where the fun tions
f (x; ) themselves are labeled by the adjustable parameters . The ma hine is assumed to
be deterministi : for a given input x, and hoi e of , it will always give the same output
f (x; ). A parti ular hoi e of generates what we will all a \trained ma hine." Thus,
for example, a neural network with xed ar hite ture, with orresponding to the weights
and biases, is a learning ma hine in this sense.
The expe tation of the test error for a trained ma hine is therefore:
Z
R( ) =
1 jy f (x; )jdP (x; y) (1)
2
Note that, when a density p(x; y) exists, dP (x; y) may be written p(x; y)dxdy. This is a
ni e way of writing the true mean error, but unless we have an estimate of what P (x; y) is,
it is not very useful.
The quantity R( ) is alled the expe ted risk, or just the risk. Here we will all it the
a tual risk, to emphasize that it is the quantity that we are ultimately interested in. The
\empiri al risk" Remp ( ) is de ned to be just the measured mean error rate on the training
set (for a xed, nite number of observations)4:
l
1 X
Remp ( ) =
2l i=1 jyi f (xi ; )j: (2)

Note that no probability distribution appears here. Remp ( ) is a xed number for a
parti ular hoi e of and for a parti ular training set fxi ; yi g.
The quantity 21 jyi f (xi ; )j is alled the loss. For the ase des ribed here, it an only
take the values 0 and 1. Now hoose some  su h that 0    1. Then for losses taking
these values, with probability 1 , the following bound holds (Vapnik, 1995):
s 
h(log(2l=h) + 1) log(=4)
R( )  Remp ( ) + (3)
l
where h is a non-negative integer alled the Vapnik Chervonenkis (VC) dimension, and is
a measure of the notion of apa ity mentioned above. In the following we will all the right
hand side of Eq. (3) the \risk bound." We depart here from some previous nomen lature:
the authors of (Guyon et al., 1992) all it the \guaranteed risk", but this is something of a
misnomer, sin e it is really a bound on a risk, not a risk, and it holds only with a ertain
probability, and so is not guaranteed. The se ond term on the right hand side is alled the
\VC on den e."
We note three key points about this bound. First, remarkably, it is independent of P (x; y).
It assumes only that both the training data and the test data are drawn independently
a ording to some P (x; y). Se ond, it is usually not possible to ompute the left hand
4

side. Third, if we know h, we an easily ompute the right hand side. Thus given several
di erent learning ma hines (re all that \learning ma hine" is just another name for a family
of fun tions f (x; )), and hoosing a xed, suÆ iently small , by then taking that ma hine
whi h minimizes the right hand side, we are hoosing that ma hine whi h gives the lowest
upper bound on the a tual risk. This gives a prin ipled method for hoosing a learning
ma hine for a given task, and is the essential idea of stru tural risk minimization (see
Se tion 2.6). Given a xed family of learning ma hines to hoose from, to the extent that
the bound is tight for at least one of the ma hines, one will not be able to do better than
this. To the extent that the bound is not tight for any, the hope is that the right hand
side still gives useful information as to whi h learning ma hine minimizes the a tual risk.
The bound not being tight for the whole hosen family of learning ma hines gives riti s a
justi able target at whi h to re their omplaints. At present, for this ase, we must rely
on experiment to be the judge.

2.1. The VC Dimension

The VC dimension is a property of a set of fun tions ff ( )g (again, we use as a generi set
of parameters: a hoi e of spe i es a parti ular fun tion), and an be de ned for various
lasses of fun tion f . Here we will only onsider fun tions that orrespond to the two- lass
pattern re ognition ase, so that f (x; ) 2 f 1; 1g 8x; . Now if a given set of l points an
be labeled in all possible 2l ways, and for ea h labeling, a member of the set ff ( )g an
be found whi h orre tly assigns those labels, we say that that set of points is shattered by
that set of fun tions. The VC dimension for the set of fun tions ff ( )g is de ned as the
maximum number of training points that an be shattered by ff ( )g. Note that, if the VC
dimension is h, then there exists at least one set of h points that an be shattered, but it in
general it will not be true that every set of h points an be shattered.

2.2. Shattering Points with Oriented Hyperplanes in Rn


Suppose that the spa e in whi h the data live is R2 , and the set ff ( )g onsists of oriented
straight lines, so that for a given line, all points on one side are assigned the lass 1, and all
points on the other side, the lass 1. The orientation is shown in Figure 1 by an arrow,
spe ifying on whi h side of the line points are to be assigned the label 1. While it is possible
to nd three points that an be shattered by this set of fun tions, it is not possible to nd
four. Thus the VC dimension of the set of oriented lines in R2 is three.
Let's now onsider hyperplanes in Rn . The following theorem will prove useful (the proof
is in the Appendix):

Theorem 1 Consider some set of m points in Rn . Choose any one of the points as origin.
Then the m points an be shattered by oriented hyperplanes5 if and only if the position
ve tors of the remaining points are linearly independent6 .

Corollary: The VC dimension of the set of oriented hyperplanes in Rn is n + 1, sin e we


an always hoose n + 1 points, and then hoose one of the points as origin, su h that the
position ve tors of the remaining n points are linearly independent, but an never hoose
n + 2 su h points (sin e no n + 1 ve tors in Rn an be linearly independent).
An alternative proof of the orollary an be found in (Anthony and Biggs, 1995), and
referen es therein.
5

Figure 1. Three points in R2 , shattered by oriented lines.

2.3. The VC Dimension and the Number of Parameters

The VC dimension thus gives on reteness to the notion of the apa ity of a given set
of fun tions. Intuitively, one might be led to expe t that learning ma hines with many
parameters would have high VC dimension, while learning ma hines with few parameters
would have low VC dimension. There is a striking ounterexample to this, due to E. Levin
and J.S. Denker (Vapnik, 1995): A learning ma hine with just one parameter, but with
in nite VC dimension (a family of lassi ers is said to have in nite VC dimension if it an
shatter l points, no matter how large l). De ne the step fun tion (x); x 2 R : f(x) =
1 8x > 0; (x) = 1 8x  0g. Consider the one-parameter family of fun tions, de ned by
f (x; )  (sin( x)); x; 2 R: (4)
You hoose some number l, and present me with the task of nding l points that an be
shattered. I hoose them to be:
xi = 10 i ; i = 1;    ; l: (5)
You spe ify any labels you like:
y1 ; y2 ;    ; yl ; yi 2 f 1; 1g: (6)
Then f ( ) gives this labeling if I hoose to be
l
= (1 +
X (1 yi )10i ): (7)
i=1 2
Thus the VC dimension of this ma hine is in nite.
Interestingly, even though we an shatter an arbitrarily large number of points, we an
also nd just four points that annot be shattered. They simply have to be equally spa ed,
and assigned labels as shown in Figure 2. This an be seen as follows: Write the phase at
x1 as 1 = 2n + Æ. Then the hoi e of label y1 = 1 requires 0 < Æ < . The phase at x2 ,
mod 2, is 2Æ; then y2 = 1 ) 0 < Æ < =2. Similarly, point x3 for es Æ > =3. Then at
x4 , =3 < Æ < =2 implies that f (x4 ; ) = 1, ontrary to the assigned label. These four
points are the analogy, for the set of fun tions in Eq. (4), of the set of three points lying
along a line, for oriented hyperplanes in Rn . Neither set an be shattered by the hosen
family of fun tions.
6

x=0 1 2 3 4

Figure 2. Four points that annot be shattered by (sin( x)), despite in nite VC dimension.

2.4. Minimizing The Bound by Minimizing h

1.4

1.2

1
VC Confidence

0.8

0.6

0.4

0.2
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
h / l = VC Dimension / Sample Size

Figure 3. VC on den e is monotoni in h

Figure 3 shows how the se ond term on the right hand side of Eq. (3) varies with h, given
a hoi e of 95% on den e level ( = 0:05) and assuming a training sample of size 10,000.
The VC on den e is a monotoni in reasing fun tion of h. This will be true for any value
of l.
Thus, given some sele tion of learning ma hines whose empiri al risk is zero, one wants to
hoose that learning ma hine whose asso iated set of fun tions has minimal VC dimension.
This will lead to a better upper bound on the a tual error. In general, for non zero empiri al
risk, one wants to hoose that learning ma hine whi h minimizes the right hand side of Eq.
(3).
Note that in adopting this strategy, we are only using Eq. (3) as a guide. Eq. (3) gives
(with some hosen probability) an upper bound on the a tual risk. This does not prevent a
parti ular ma hine with the same value for empiri al risk, and whose fun tion set has higher
VC dimension, from having better performan e. In fa t an example of a system that gives
good performan e despite having in nite VC dimension is given in the next Se tion. Note
also that the graph shows that for h=l > 0:37 (and for  = 0:05 and l = 10; 000), the VC
on den e ex eeds unity, and so for higher values the bound is guaranteed not tight.
2.5. Two Examples

Consider the k'th nearest neighbour lassi er, with k = 1. This set of fun tions has in nite
VC dimension and zero empiri al risk, sin e any number of points, labeled arbitrarily, will
be su essfully learned by the algorithm (provided no two points of opposite lass lie right
on top of ea h other). Thus the bound provides no information. In fa t, for any lassi er
7

with in nite VC dimension, the bound is not even valid7 . However, even though the bound
is not valid, nearest neighbour lassi ers an still perform well. Thus this rst example is a
autionary tale: in nite \ apa ity" does not guarantee poor performan e.
Let's follow the time honoured tradition of understanding things by trying to break them,
and see if we an ome up with a lassi er for whi h the bound is supposed to hold, but
whi h violates the bound. We want the left hand side of Eq. (3) to be as large as possible,
and the right hand side to be as small as possible. So we want a family of lassi ers whi h
gives the worst possible a tual risk of 0:5, zero empiri al risk up to some number of training
observations, and whose VC dimension is easy to ompute and is less than l (so that the
bound is non trivial). An example is the following, whi h I all the \notebook lassi er."
This lassi er onsists of a notebook with enough room to write down the lasses of m
training observations, where m  l. For all subsequent patterns, the lassi er simply says
that all patterns have the same lass. Suppose also that the data have as many positive
(y = +1) as negative (y = 1) examples, and that the samples are hosen randomly. The
notebook lassi er will have zero empiri al risk for up to m observations; 0:5 training error
for all subsequent observations; 0:5 a tual error, and VC dimension h = m. Substituting
these values in Eq. (3), the bound be omes:
m
4l  ln(2l=m) + 1 (1=m) ln(=4) (8)
whi h is ertainly met for all  if
z
f (z ) =
2 exp
(z=4 1)
 1; z  (m=l); 0  z  1 (9)
whi h is true, sin e f (z ) is monotoni in reasing, and f (z = 1) = 0:236.
2.6. Stru tural Risk Minimization

We an now summarize the prin iple of stru tural risk minimization (SRM) (Vapnik, 1979).
Note that the VC on den e term in Eq. (3) depends on the hosen lass of fun tions,
whereas the empiri al risk and a tual risk depend on the one parti ular fun tion hosen by
the training pro edure. We would like to nd that subset of the hosen set of fun tions, su h
that the risk bound for that subset is minimized. Clearly we annot arrange things so that
the VC dimension h varies smoothly, sin e it is an integer. Instead, introdu e a \stru ture"
by dividing the entire lass of fun tions into nested subsets (Figure 4). For ea h subset,
we must be able either to ompute h, or to get a bound on h itself. SRM then onsists of
nding that subset of fun tions whi h minimizes the bound on the a tual risk. This an be
done by simply training a series of ma hines, one for ea h subset, where for a given subset
the goal of training is simply to minimize the empiri al risk. One then takes that trained
ma hine in the series whose sum of empiri al risk and VC on den e is minimal.
We have now laid the groundwork ne essary to begin our exploration of support ve tor
ma hines.
3. Linear Support Ve tor Ma hines
3.1. The Separable Case

We will start with the simplest ase: linear ma hines trained on separable data (as we shall
see, the analysis for the general ase - nonlinear ma hines trained on non-separable data -
8

h4 h3 h2 h1 h1 < h2 < h3 ...

Figure 4. Nested subsets of fun tions, ordered by VC dimension.

results in a very similar quadrati programming problem). Again label the training data
fxi ; yi g; i = 1;    ; l; yi 2 f 1; 1g; xi 2 Rd . Suppose we have some hyperplane whi h
separates the positive from the negative examples (a \separating hyperplane"). The points
x whi h lie on the hyperplane satisfy w  x + b = 0, where w is normal to the hyperplane,
jbj=kwk is the perpendi ular distan e from the hyperplane to the origin, and kwk is the
Eu lidean norm of w. Let d+ (d ) be the shortest distan e from the separating hyperplane
to the losest positive (negative) example. De ne the \margin" of a separating hyperplane
to be d+ + d . For the linearly separable ase, the support ve tor algorithm simply looks for
the separating hyperplane with largest margin. This an be formulated as follows: suppose
that all the training data satisfy the following onstraints:

xi  w + b  +1 for yi = +1 (10)
xi  w + b  1 for yi = 1 (11)
These an be ombined into one set of inequalities:
yi (xi  w + b) 1  0 8i (12)
Now onsider the points for whi h the equality in Eq. (10) holds (requiring that there
exists su h a point is equivalent to hoosing a s ale for w and b). These points lie on
the hyperplane H1 : xi  w + b = 1 with normal w and perpendi ular distan e from the
origin j1 bj=kwk. Similarly, the points for whi h the equality in Eq. (11) holds lie on the
hyperplane H2 : xi  w + b = 1, with normal again w, and perpendi ular distan e from
the origin j 1 bj=kwk. Hen e d+ = d = 1=kwk and the margin is simply 2=kwk. Note
that H1 and H2 are parallel (they have the same normal) and that no training points fall
between them. Thus we an nd the pair of hyperplanes whi h gives the maximum margin
by minimizing kwk2, subje t to onstraints (12).
Thus we expe t the solution for a typi al two dimensional ase to have the form shown
in Figure 5. Those training points for whi h the equality in Eq. (12) holds (i.e. those
whi h wind up lying on one of the hyperplanes H1 , H2 ), and whose removal would hange
the solution found, are alled support ve tors; they are indi ated in Figure 5 by the extra
ir les.
We will now swit h to a Lagrangian formulation of the problem. There are two reasons
for doing this. The rst is that the onstraints (12) will be repla ed by onstraints on the
Lagrange multipliers themselves, whi h will be mu h easier to handle. The se ond is that in
this reformulation of the problem, the training data will only appear (in the a tual training
and test algorithms) in the form of dot produ ts between ve tors. This is a ru ial property
whi h will allow us to generalize the pro edure to the nonlinear ase (Se tion 4).
9

w
H2
-b
|w| H1
Origin
Margin

Figure 5. Linear separating hyperplanes for the separable ase. The support ve tors are ir led.

Thus, we introdu e positive Lagrange multipliers i ; i = 1;    ; l, one for ea h of the


inequality onstraints (12). Re all that the rule is that for onstraints of the form i  0,
the onstraint equations are multiplied by positive Lagrange multipliers and subtra ted
from the obje tive fun tion, to form the Lagrangian. For equality onstraints, the Lagrange
multipliers are un onstrained. This gives Lagrangian:
l l
 12 kwk
X X
LP 2
i yi (xi  w + b) + i (13)
i=1 i=1
We must now minimize LP with respe t to w; b, and simultaneously require that the
derivatives of LP with respe t to all the i vanish, all subje t to the onstraints i  0
(let's all this parti ular set of onstraints C1 ). Now this is a onvex quadrati programming
problem, sin e the obje tive fun tion is itself onvex, and those points whi h satisfy the
onstraints also form a onvex set (any linear onstraint de nes a onvex set, and a set of
N simultaneous linear onstraints de nes the interse tion of N onvex sets, whi h is also
a onvex set). This means that we an equivalently solve the following \dual" problem:
maximize LP , subje t to the onstraints that the gradient of LP with respe t to w and b
vanish, and subje t also to the onstraints that the i  0 (let's all that parti ular set of
onstraints C2 ). This parti ular dual formulation of the problem is alled the Wolfe dual
(Flet her, 1987). It has the property that the maximum of LP , subje t to onstraints C2 ,
o urs at the same values of the w, b and , as the minimum of LP , subje t to onstraints
C1 8 .
Requiring that the gradient of LP with respe t to w and b vanish give the onditions:
X
w= i yi xi (14)
i
X
i yi = 0: (15)
i
Sin e these are equality onstraints in the dual formulation, we an substitute them into
Eq. (13) to give
LD = i
X 1 X y y x  x (16)
i 2 i;j i j i j i j
10

Note that we have now given the Lagrangian di erent labels (P for primal, D for dual) to
emphasize that the two formulations are di erent: LP and LD arise from the same obje tive
fun tion but with di erent onstraints; and the solution is found by minimizing LP or by
maximizing LD . Note also that if we formulate the problem with b = 0, whi h amounts to
requiring that all hyperplanes ontain the origin, the onstraint (15) does not appear. This
is a mild restri tion for high dimensional spa es, sin e it amounts to redu ing the number
of degrees of freedom by one.
Support ve tor training (for the separable, linear ase) therefore amounts to maximizing
LD with respe t to the i , subje t to onstraints (15) and positivity of the i , with solution
given by (14). Noti e that there is a Lagrange multiplier i for every training point. In
the solution, those points for whi h i > 0 are alled \support ve tors", and lie on one of
the hyperplanes H1 ; H2 . All other training points have i = 0 and lie either on H1 or
H2 (su h that the equality in Eq. (12) holds), or on that side of H1 or H2 su h that the
stri t inequality in Eq. (12) holds. For these ma hines, the support ve tors are the riti al
elements of the training set. They lie losest to the de ision boundary; if all other training
points were removed (or moved around, but so as not to ross H1 or H2 ), and training was
repeated, the same separating hyperplane would be found.

3.2. The Karush-Kuhn-Tu ker Conditions

The Karush-Kuhn-Tu ker (KKT) onditions play a entral role in both the theory and
pra ti e of onstrained optimization. For the primal problem above, the KKT onditions
may be stated (Flet her, 1987):

 X
w
LP = w i yi xi = 0  = 1;    ; d (17)
i
 X
LP = i yi = 0 (18)
b i
yi (xi  w + b) 1  0 i = 1;    ; l (19)
i  0 8i (20)
i (yi (w  xi + b) 1) = 0 8i (21)
The KKT onditions are satis ed at the solution of any onstrained optimization problem
( onvex or not), with any kind of onstraints, provided that the interse tion of the set
of feasible dire tions with the set of des ent dire tions oin ides with the interse tion of
the set of feasible dire tions for linearized onstraints with the set of des ent dire tions
(see Flet her, 1987; M Cormi k, 1983)). This rather te hni al regularity assumption holds
for all support ve tor ma hines, sin e the onstraints are always linear. Furthermore, the
problem for SVMs is onvex (a onvex obje tive fun tion, with onstraints whi h give a
onvex feasible region), and for onvex problems (if the regularity ondition holds), the
KKT onditions are ne essary and suÆ ient for w; b; to be a solution (Flet her, 1987).
Thus solving the SVM problem is equivalent to nding a solution to the KKT onditions.
This fa t results in several approa hes to nding the solution (for example, the primal-dual
path following method mentioned in Se tion 5).
As an immediate appli ation, note that, while w is expli itly determined by the training
pro edure, the threshold b is not, although it is impli itly determined. However b is easily
found by using the KKT \ omplementarity" ondition, Eq. (21), by hoosing any i for
11

whi h i 6= 0 and omputing b (note that it is numeri ally safer to take the mean value of
b resulting from all su h equations).
Noti e that all we've done so far is to ast the problem into an optimization problem
where the onstraints are rather more manageable than those in Eqs. (10), (11). Finding
the solution for real world problems will usually require numeri al methods. We will have
more to say on this later. However, let's rst work out a rare ase where the problem is
nontrivial (the number of dimensions is arbitrary, and the solution ertainly not obvious),
but where the solution an be found analyti ally.
3.3. Optimal Hyperplanes: An Example

While the main aim of this Se tion is to explore a non-trivial pattern re ognition problem
where the support ve tor solution an be found analyti ally, the results derived here will
also be useful in a later proof. For the problem onsidered, every training point will turn
out to be a support ve tor, whi h is one reason we an nd the solution analyti ally.
Consider n + 1 symmetri ally pla ed points lying on a sphere Sn 1 of radius R: more
pre isely, the points form the verti es of an n-dimensional symmetri simplex. It is onve-
nient to embed the points in Rn+1 in su h a way that they all lie in the hyperplane whi h
passes through the origin and whi h is perpendi ular to the (n +1)-ve tor (1; 1; :::; 1) (in this
formulation, the points lie on Sn 1 , they span Rn , and are embedded in Rn+1 ). Expli itly,
re alling that ve tors themselves are labeled by Roman indi es and their oordinates by
Greek, the oordinates are given by:
s r
R Rn
xi = (1 Æi; ) +Æ (22)
n(n + 1) i; n + 1
where the Krone ker delta, Æi; , is de ned by Æi; = 1 if  = i, 0 otherwise. Thus, for
example, the ve tors for three equidistant points on the unit ir le (see Figure 12) are:
r
x = ( 23 ; p16 ;
1
6
p1 )
r
x = ( p16 ; 23 ;
2 p1 )
6
r
x3 = ( p16 ; p16 ; 23 ) (23)
One onsequen e of the symmetry is that the angle between any pair of ve tors is the
same (and is equal to ar os( 1=n)):
kxi k2 = R2 (24)

xi  xj = R2 =n (25)
or, more su in tly,
xi  xj = Æi;j (1 Æi;j ) :
1 (26)
R 2 n
Assigning a lass label C 2 f+1; 1g arbitrarily to ea h point, we wish to nd that
hyperplane whi h separates the two lasses with widest margin. Thus we must maximize
12

LD in Eq. (16), subje t to i  0 and also subje t to the equality onstraint, Eq. (15).
Our strategy is to simply solve the problem as though there were no inequality onstraints.
If the resulting solution does in fa t satisfy i  0 8i, then we will have found the general
solution, sin e the a tual maximum of LD will then lie in the feasible region, provided the
equality onstraint, Eq. (15), is also met. In order to impose the equality onstraint we
introdu e an additional Lagrange multiplier . Thus we seek to maximize
nX
LD 
+1

i
1 nX+1 nX+1

i Hij j  i yi ; (27)
i=1 2 i;j=1 i=1
where we have introdu ed the Hessian
Hij  yi yj xi  xj : (28)
Setting LD = 0 gives
 i
(H )i + yi = 1 8i (29)
Now H has a very simple stru ture: the o -diagonal elements are yi yj R2 =n, and the
diagonal elements are R2 . The fa t that all the o -diagonal elements di er only by fa tors
of yi suggests looking for a solution whi h has the form:
   
i =
1 + yi a + 1 yi b (30)
2 2
where a and b are unknowns. Plugging this form in Eq. (29) gives:
    
n+1 a+b yi p a + b 1 yi
n 2 n 2 = R2
(31)
where p is de ned by
nX
+1

p yi : (32)
i=1
Thus
a+b=
2n (33)
R (n + 1)
2

and substituting this into the equality onstraint Eq. (15) to nd a, b gives
   
n p n
a= 2
R (n + 1)
1 n+1
; b= 2
R (n + 1)
1 + n +p 1 (34)
whi h gives for the solution
 
n yi p
i = 1 (35)
R2 (n + 1) n+1
Also,
(H )i = 1 ny+i p1 : (36)
13

Hen e

nX
+1

kwk =2
i j yi yj xi  xj = T H
i;j =1
nX   nX  n   2 !
+1
yi p +1
p
= i 1
n+1
= i = 2 1
R n+1
(37)
i=1 i=1
Note that this is one of those ases where the Lagrange multiplier  an remain undeter-
mined (although determining it is trivial). We have now solved the problem, sin e all the
i are learly positive or zero (in fa t the i will only be zero if all training points have
the same lass). Note that kwk depends only on the number of positive (negative) polarity
points, and not on how the lass labels are assigned to the points in Eq. (22). This is learly
not true of w itself, whi h is given by
+1 
nX 
w = R (nn+ 1)
2
yi
p
n+1
xi (38)
i=1
The margin, M = 2=kwk, is thus given by
M=p
2R : (39)
n (1 (p=(n + 1))2 )
Thus when the number of points n + 1 is even, the minimum margin o urs when p = 0
(equal numbers of positive and negative examples), in whi h ase the margin is Mmin =
2R=pn. If n p+ 1 is odd, the minimum margin o urs when p = 1, in whi h ase Mmin =
2R(n +1)=(n n + 2). In both ases, the maximum margin is given by Mmax = R(n +1)=n.
Thus, for example, for the two dimensional simplex onsisting of three points lying on S1
(and spanning R2 ), and with labeling su h that not all three points have the same polarity,
the maximum and minimum margin are both 3R=2 (see Figure (12)).
Note that the results of this Se tion amount to an alternative, onstru tive proof that the
VC dimension of oriented separating hyperplanes in Rn is at least n + 1.
3.4. Test Phase

On e we have trained a Support Ve tor Ma hine, how an we use it? We simply determine
on whi h side of the de ision boundary (that hyperplane lying half way between H1 and H2
and parallel to them) a given test pattern x lies and assign the orresponding lass label,
i.e. we take the lass of x to be sgn(w  x + b).
3.5. The Non-Separable Case

The above algorithm for separable data, when applied to non-separable data, will nd no
feasible solution: this will be eviden ed by the obje tive fun tion (i.e. the dual Lagrangian)
growing arbitrarily large. So how an we extend these ideas to handle non-separable data?
We would like to relax the onstraints (10) and (11), but only when ne essary, that is, we
would like to introdu e a further ost (i.e. an in rease in the primal obje tive fun tion) for
doing so. This an be done by introdu ing positive sla k variables i ; i = 1;    ; l in the
onstraints (Cortes and Vapnik, 1995), whi h then be ome:
14

xi  w + b  +1 i for yi = +1 (40)
xi  w + b  1 + i for yi = 1 (41)
i  0 8i: (42)
P
Thus, for an error to o ur, the orresponding i must ex eed unity, so i i is an upper
bound on the number of training errors. Hen e a natural way to assign an extra ost
errors is to hange the obje tive fun tion to be minimized from kwk2 =2 to kwk2 =2 +
for P
C ( i i )k , where C is a parameter to be hosen by the user, a larger C orresponding to
assigning a higher penalty to errors. As it stands, this is a onvex programming problem
for any positive integer k; for k = 2 and k = 1 it is also a quadrati programming problem,
and the hoi e k = 1 has the further advantage that neither the i , nor their Lagrange
multipliers, appear in the Wolfe dual problem, whi h be omes:
Maximize:

LD 
X
i
1 X y y x  x (43)
i 2 i;j i j i j i j
subje t to:
0  i  C; (44)
X
i yi = 0: (45)
i
The solution is again given by
NS
X
w= i yi xi : (46)
i=1
where NS is the number of support ve tors. Thus the only di eren e from the optimal
hyperplane ase is that the i now have an upper bound of C . The situation is summarized
s hemati ally in Figure 6.
We will need the Karush-Kuhn-Tu ker onditions for the primal problem. The primal
Lagrangian is

LP =
1 kwk2 + C X  X
i fyi (xi  w + b) 1 + i g
X
i i (47)
i
2 i i i
where the i are the Lagrange multipliers introdu ed to enfor e positivity of the i . The
KKT onditions for the primal problem are therefore (note i runs from 1 to the number of
training points, and  from 1 to the dimension of the data)

LP X
w
= w i yi xi = 0 (48)
i
LP X
b
= i yi = 0 (49)
i
15

LP
i
= C i i = 0 (50)
yi (xi  w + b) 1 + i  0 (51)
i  0 (52)
i  0 (53)
i  0 (54)
i fyi (xi  w + b) 1 + i g = 0 (55)
i i = 0 (56)
As before, we an use the KKT omplementarity onditions, Eqs. (55) and (56), to
determine the threshold b. Note that Eq. (50) ombined with Eq. (56) shows that i = 0 if
i < C . Thus we an simply take any training point for whi h 0 < i < C to use Eq. (55)
(with i = 0) to ompute b. (As before, it is numeri ally wiser to take the average over all
su h training points.)

-b
|w|

−ξ
|w|

Figure 6. Linear separating hyperplanes for the non-separable ase.

3.6. A Me hani al Analogy

Consider the ase in whi h the data are in R2 . Suppose that the i'th support ve tor exerts
a for e Fi = i yi w^ on a sti sheet lying along the de ision surfa e (the \de ision sheet")
(here w^ denotes the unit ve tor in the dire tion w). Then the solution (46) satis es the
onditions of me hani al equilibrium:
X X
For es = i yi w
^ =0 (57)
i
X X
Torques = si ^ ( i yi w^ ) = w^ ^ w = 0: (58)
i
(Here the si are the support ve tors, and ^ denotes the ve tor produ t.) For data in Rn ,
learly the ondition that the sum of for es vanish is still met. One an easily show that
the torque also vanishes.9
This me hani al analogy depends only on the form of the solution (46), and therefore holds
for both the separable and the non-separable ases. In fa t this analogy holds in general
16

(i.e., also for the nonlinear ase des ribed below). The analogy emphasizes the interesting
point that the \most important" data points are the support ve tors with highest values of
, sin e they exert the highest for es on the de ision sheet. For the non-separable ase, the
upper bound i  C orresponds to an upper bound on the for e any given point is allowed
to exert on the sheet. This analogy also provides a reason (as good as any other) to all
these parti ular ve tors \support ve tors"10.

3.7. Examples by Pi tures

Figure 7 shows two examples of a two- lass pattern re ognition problem, one separable and
one not. The two lasses are denoted by ir les and disks respe tively. Support ve tors are
identi ed with an extra ir le. The error in the non-separable ase is identi ed with a ross.
The reader is invited to use Lu ent's SVM Applet (Burges, Knirs h and Harats h, 1996) to
experiment and reate pi tures like these (if possible, try using 16 or 24 bit olor).

Figure 7. The linear ase, separable (left) and not (right). The ba kground olour shows the shape of the
de ision surfa e.

4. Nonlinear Support Ve tor Ma hines


How an the above methods be generalized to the ase where the de ision fun tion11 is not
a linear fun tion of the data? (Boser, Guyon and Vapnik, 1992), showed that a rather old
tri k (Aizerman, 1964) an be used to a omplish this in an astonishingly straightforward
way. First noti e that the only way in whi h the data appears in the training problem, Eqs.
(43) - (45), is in the form of dot produ ts, xi  xj . Now suppose we rst mapped the data
to some other (possibly in nite dimensional) Eu lidean spa e H, using a mapping whi h we
will all :
 : Rd 7! H: (59)
Then of ourse the training algorithm would only depend on the data through dot produ ts
in H, i.e. on fun tions of the form (xi )  (xj ). Now if there were a \kernel fun tion" K
su h that K (xi ; xj ) = (xi )  (xj ), we would only need to use K in the training algorithm,
and would never need to expli itly even know what  is. One example is
17

K (xi ; xj ) = e kxi xj k2= 2 :


2
(60)
In this parti ular example, H is in nite dimensional, so it would not be very easy to work
with  expli itly. However, if one repla es xi  xj by K (xi ; xj ) everywhere in the training
algorithm, the algorithm will happily produ e a support ve tor ma hine whi h lives in an
in nite dimensional spa e, and furthermore do so in roughly the same amount of time it
would take to train on the un-mapped data. All the onsiderations of the previous se tions
hold, sin e we are still doing a linear separation, but in a di erent spa e.
But how an we use this ma hine? After all, we need w, and that will live in H also (see
Eq. (46)). But in test phase an SVM is used by omputing dot produ ts of a given test
point x with w, or more spe i ally by omputing the sign of
NS
X NS
X
f (x) = i yi (si )  (x) + b = i yi K (si ; x) + b (61)
i=1 i=1
where the si are the support ve tors. So again we an avoid omputing (x) expli itly
and use the K (si ; x) = (si )  (x) instead.
Let us all the spa e in whi h the data live, L. (Here and below we use L as a mnemoni
for \low dimensional", and H for \high dimensional": it is usually the ase that the range of
 is of mu h higher dimension than its domain). Note that, in addition to the fa t that w
lives in H, there will in general be no ve tor in L whi h maps, via the map , to w. If there
were, f (x) in Eq. (61) ould be omputed in one step, avoiding the sum (and making the
orresponding SVM NS times faster, where NS is the number of support ve tors). Despite
this, ideas along these lines an be used to signi antly speed up the test phase of SVMs
(Burges, 1996). Note also that it is easy to nd kernels (for example, kernels whi h are
fun tions of the dot produ ts of the xi in L) su h that the training algorithm and solution
found are independent of the dimension of both L and H.
In the next Se tion we will dis uss whi h fun tions K are allowable and whi h are not.
Let us end this Se tion with a very simple example of an allowed kernel, for whi h we an
onstru t the mapping .
Suppose that your data are ve tors in R2 , and you hoose K (xi ; xj ) = (xi  xj )2 . Then
it's easy to nd a spa e H, and mapping  from R2 to H, su h that (x  y)2 = (x)  (y):
we hoose H = R3 and
0 1
px
2
1
(x) =  2 x1 x2 A (62)
x22
(note that here the subs ripts refer to ve tor omponents). For data in L de ned on the
square [ 1; 1℄  [ 1; 1℄ 2 R2 (a typi al situation, for grey level image data), the (entire)
image of  is shown in Figure 8. This Figure also illustrates how to think of this mapping:
the image of  may live in a spa e of very high dimension, but it is just a (possibly very
ontorted) surfa e whose intrinsi dimension12 is just that of L.
Note that neither the mapping  nor the spa e H are unique for a given kernel. We ould
equally well have hosen H to again be R3 and
0 2 1
1 (x1 x22 )
(x) = p  2x1 x2 A (63)
2 (x21 + x22 )
18

1
0.8
0.6
0.4
0.2
0
1
0.5
0.2 0
0.4 -0.5
0.6 -1
0.8
1

Figure 8. Image, in H, of the square [ 1; 1℄  [ 1; 1℄ 2 R2 under the mapping .

or H to be R4 and
0 1
x21
B x1 x2 C
(x) = B
 x1 x2 A :
C (64)
x22
The literature on SVMs usually refers to the spa e H as a Hilbert spa e, so let's end this
Se tion with a few notes on this point. You an think of a Hilbert spa e as a generalization
of Eu lidean spa e that behaves in a gentlemanly fashion. Spe i ally, it is any linear spa e,
with an inner produ t de ned, whi h is also omplete with respe t to the orresponding
norm (that is, any Cau hy sequen e of points onverges to a point in the spa e). Some
authors (e.g. (Kolmogorov, 1970)) also require that it be separable (that is, it must have a
ountable subset whose losure is the spa e itself), and some (e.g. Halmos, 1967) don't. It's a
generalization mainly be ause its inner produ t an be any inner produ t, not just the s alar
(\dot") produ t used here (and in Eu lidean spa es in general). It's interesting that the
older mathemati al literature (e.g. Kolmogorov, 1970) also required that Hilbert spa es be
in nite dimensional, and that mathemati ians are quite happy de ning in nite dimensional
Eu lidean spa es. Resear h on Hilbert spa es enters on operators in those spa es, sin e
the basi properties have long sin e been worked out. Sin e some people understandably
blan h at the mention of Hilbert spa es, I de ided to use the term Eu lidean throughout
this tutorial.
4.1. Mer er's Condition

For whi h kernels does there exist a pair fH; g, with the properties des ribed above, and
for whi h does there not? The answer is given by Mer er's ondition (Vapnik, 1995; Courant
and Hilbert, 1953): There exists a mapping  and an expansion
X
K (x; y) = (x)i (y)i (65)
i
19

if and only if, for any g(x) su h that


Z
g(x)2 dx is nite (66)
then
Z
K (x; y)g(x)g(y)dxdy  0: (67)
Note that for spe i ases, it may not be easy to he k whether Mer er's ondition is
satis ed. Eq. (67) must hold for every g with nite L2 norm (i.e. whi h satis es Eq. (66)).
However, we an easily prove that the ondition is satis ed for positive integral powers of
the dot produ t: K (x; y) = (x  y)p . We must show that
d
Z X
( xi yi )p g(x)g(y)dxdy  0: (68)
i=1
P
The typi al term in the multinomial expansion of ( di=1 xi yi )p ontributes a term of the
form
Z
p!
xr11 xr22    y1r1 y2r2    g(x)g(y)dxdy (69)
r1 !r2 !    (p r1 r2   )!
to the left hand side of Eq. (67), whi h fa torizes:
Z
= r !r !    (p p! r r   )! ( xr11 xr22    g(x)dx)2  0: (70)
1 2 1 2

One simple onsequen e is that any kernel whi h an be expressed as K (x; y) = 1


P
p=0 p (x
y)p , where the p are positive real oeÆ ients and the series is uniformly onvergent, satis es
Mer er's ondition, a fa t also noted in (Smola, S holkopf and Muller, 1998b).
Finally, what happens if one uses a kernel whi h does not satisfy Mer er's ondition? In
general, there may exist data su h that the Hessian is inde nite, and for whi h the quadrati
programming problem will have no solution (the dual obje tive fun tion an be ome arbi-
trarily large). However, even for kernels that do not satisfy Mer er's ondition, one might
still nd that a given training set results in a positive semide nite Hessian, in whi h ase the
training will onverge perfe tly well. In this ase, however, the geometri al interpretation
des ribed above is la king.
4.2. Some Notes on  and H
Mer er's ondition tells us whether or not a prospe tive kernel is a tually a dot produ t
in some spa e, but it does not tell us how to onstru t  or even what H is. However, as
with the homogeneous (that is, homogeneous in the dot produ t in L) quadrati polynomial
kernel dis ussed above, we an expli itly onstru t the mapping for some kernels. In Se tion
6.1 we show how Eq. (62) an be extended to arbitrary homogeneous polynomial  kernels,
and that the orresponding spa e H is a Eu lidean spa e of dimension d+pp 1 . Thus for
example, for a degree p = 4 polynomial, and for data onsisting of 16 by 16 images (d=256),
dim(H) is 183,181,376.
Usually, mapping your data to a \feature spa e" with an enormous number of dimensions
would bode ill for the generalization performan e of the resulting ma hine. After all, the
20

set of all hyperplanes fw; bg are parameterized by dim(H) +1 numbers. Most pattern
re ognition systems with billions, or even an in nite, number of parameters would not make
it past the start gate. How ome SVMs do so well? One might argue that, given the form
of solution, there are at most l + 1 adjustable parameters (where l is the number of training
samples), but this seems to be begging the question13 . It must be something to do with our
requirement of maximum margin hyperplanes that is saving the day. As we shall see below,
a strong ase an be made for this laim.
Sin e the mapped surfa e is of intrinsi dimension dim(L), unless dim(L) = dim(H), it
is obvious that the mapping annot be onto (surje tive). It also need not be one to one
(bije tive): onsider x1 ! x1 ; x2 ! x2 in Eq. (62). The image of  need not itself be
a ve tor spa e: again, onsidering the above simple quadrati example, the ve tor (x) is
not in the image of  unless x = 0. Further, a little playing with the inhomogeneous kernel
K (xi ; xj ) = (xi  xj + 1)2 (71)
will onvin e you that the orresponding  an map two ve tors that are linearly dependent
in L onto two ve tors that are linearly independent in H.
So far we have onsidered ases where  is done impli itly. One an equally well turn
things around and start with , and then onstru t the orresponding kernel. For example
(Vapnik, 1996), if L = R1 , then a Fourier expansion in the data x, ut o after N terms,
has the form
N
a0 X
f (x) =
2 + r=1(a1r os(rx) + a2r sin(rx)) (72)

and this an be viewed as a dot produ t between two ve tors in R2N +1 : a = ( pa02 ; a11 ; : : : ; a21 ; : : :),
and the mapped (x) = ( p12 ; os(x); os(2x); : : : ; sin(x); sin(2x); : : :). Then the orrespond-
ing (Diri hlet) kernel an be omputed in losed form:

(xi )  (xj ) = K (xi ; xj ) = sin((2 N + 1=2)(xi xj ))


sin((x x )=2) : (73)
i j

This is easily seen as follows: letting Æ  xi xj ,

N
(xi )  (xj ) = 21 + os(rxi ) os(rxj ) + sin(rxi ) sin(rxj )
X

r=1
N N
= 12 + os(rÆ) = 12 + Ref e(irÆ) g
X X

r=0 r=0
1
= 2 + Ref(1 e i (N +1)Æ
)=(1 eiÆ )g
= (sin((N + 1=2)Æ))=2 sin(Æ=2):
Finally, it is lear that the above impli it mapping tri k will work for any algorithm in
whi h the data only appear as dot produ ts (for example, the nearest neighbor algorithm).
This fa t has been used to derive a nonlinear version of prin ipal omponent analysis by
(S holkopf, Smola and Muller, 1998b); it seems likely that this tri k will ontinue to nd
uses elsewhere.
21

4.3. Some Examples of Nonlinear SVMs

The rst kernels investigated for the pattern re ognition problem were the following:
K (x; y) = (x  y + 1)p (74)

K (x; y) = e kx yk =2 (75)


2 2

K (x; y) = tanh(x  y Æ) (76)


Eq. (74) results in a lassi er that is a polynomial of degree p in the data; Eq. (75) gives
a Gaussian radial basis fun tion lassi er, and Eq. (76) gives a parti ular kind of two-layer
sigmoidal neural network. For the RBF ase, the number of enters (NS in Eq. (61)),
the enters themselves (the si ), the weights ( i ), and the threshold (b) are all produ ed
automati ally by the SVM training and give ex ellent results ompared to lassi al RBFs,
for the ase of Gaussian RBFs (S holkopf et al, 1997). For the neural network ase, the
rst layer onsists of NS sets of weights, ea h set onsisting of dL (the dimension of the
data) weights, and the se ond layer onsists of NS weights (the i ), so that an evaluation
simply requires taking a weighted sum of sigmoids, themselves evaluated on dot produ ts of
the test data with the support ve tors. Thus for the neural network ase, the ar hite ture
(number of weights) is determined by SVM training.
Note, however, that the hyperboli tangent kernel only satis es Mer er's ondition for
ertain values of the parameters  and Æ (and of the data kxk2 ). This was rst noti ed
experimentally (Vapnik, 1995); however some ne essary onditions on these parameters for
positivity are now known14 .
Figure 9 shows results for the same pattern re ognition problem as that shown in Figure
7, but where the kernel was hosen to be a ubi polynomial. Noti e that, even though
the number of degrees of freedom is higher, for the linearly separable ase (left panel), the
solution is roughly linear, indi ating that the apa ity is being ontrolled; and that the
linearly non-separable ase (right panel) has be ome separable.

Figure 9. Degree 3 polynomial kernel. The ba kground olour shows the shape of the de ision surfa e.

Finally, note that although the SVM lassi ers des ribed above are binary lassi ers, they
are easily ombined to handle the multi lass ase. A simple, e e tive ombination trains
22

N one-versus-rest lassi ers (say, \one" positive, \rest" negative) for the N - lass ase and
takes the lass for a test point to be that orresponding to the largest positive distan e
(Boser, Guyon and Vapnik, 1992).
4.4. Global Solutions and Uniqueness

When is the solution to the support ve tor training problem global, and when is it unique?
By \global", we mean that there exists no other point in the feasible region at whi h the
obje tive fun tion takes a lower value. We will address two kinds of ways in whi h uniqueness
may not hold: solutions for whi h fw; bg are themselves unique, but for whi h the expansion
of w in Eq. (46) is not; and solutions whose fw; bg di er. Both are of interest: even if the
pair fw; bg is unique, if the i are not, there may be equivalent expansions of w whi h
require fewer support ve tors (a trivial example of this is given below), and whi h therefore
require fewer instru tions during test phase.
It turns out that every lo al solution is also global. This is a property of any onvex
programming problem (Flet her, 1987). Furthermore, the solution is guaranteed to be
unique if the obje tive fun tion (Eq. (43)) is stri tly onvex, whi h in our ase means
that the Hessian must be positive de nite (note that for quadrati obje tive fun tions F ,
the Hessian is positive de nite if and only if F is stri tly onvex; this is not true for non-
quadrati F : there, a positive de nite Hessian implies a stri tly onvex obje tive fun tion,
but not vi e versa ( onsider F = x4 ) (Flet her, 1987)). However, even if the Hessian is
positive semide nite, the solution an still be unique: onsider two points along the real
line with oordinates x1 = 1 and x2 = 2, and with polarities + and . Here the Hessian is
positive semide nite, but the solution (w = 2; b = 3; i = 0 in Eqs. (40), (41), (42)) is
unique. It is also easy to nd solutions whi h are not unique in the sense that the i in the
expansion of w are not unique:: for example, onsider the problem of four separable points
on a square in R2 : x1 = [1; 1℄, x2 = [ 1; 1℄, x3 = [ 1; 1℄ and x4 = [1; 1℄, with polarities
[+; ; ; +℄ respe tively. One solution is w = [1; 0℄, b = 0, = [0:25; 0:25; 0:25; 0:25℄;
another has the same wPand b, but = [0:5; 0:5; 0; 0℄ (note that both solutions satisfy the
onstraints i > 0 and i i yi = 0). When an this o ur in general? Given some solution
, hoose an 0 whi h is in the null spa e of the Hessian Hij = yi yj xi  xj , and require that
0 be orthogonal to the ve tor all of whose omponents are 1. Then adding 0 to in Eq.
(43) will leave LD un hanged. If 0  i + 0i  C and 0 satis es Eq. (45), then + 0 is
also a solution15.
How about solutions where the fw; bg are themselves not unique? (We emphasize that
this an only happen in prin iple if the Hessian is not positive de nite, and even then,
the solutions are ne essarily global). The following very simple theorem shows that if non-
unique solutions o ur, then the solution at one optimal point is ontinuously deformable
into the solution at the other optimal point, in su h a way that all intermediate points are
also solutions.
Theorem 2 Let the variable X stand for the pair of variables fw; bg. Let the Hessian for
the problem be positive semide nite, so that the obje tive fun tion is onvex. Let X0 and X1
be two points at whi h the obje tive fun tion attains its minimal value. Then there exists a
path X = X( ) = (1  )X0 +  X1 ;  2 [0; 1℄, su h that X( ) is a solution for all  .

Proof: Let the minimum value of the obje tive fun tion be Fmin . Then by assumption,
F (X ) = F (X ) = Fmin . By onvexity of F , F (X( ))  (1  )F (X ) + F (X ) = Fmin .
0 1 0 1
Furthermore, by linearity, the X( ) satisfy the onstraints Eq. (40), (41): expli itly (again
ombining both onstraints into one):
23

yi (w  xi + b ) = yi ((1  )(w0  xi + b0 ) +  (w1  xi + b1 ))


 (1  )(1 i ) +  (1 i ) = 1 i (77)

Although simple, this theorem is quite instru tive. For example, one might think that the
problems depi ted in Figure 10 have several di erent optimal solutions (for the ase of linear
support ve tor ma hines). However, sin e one annot smoothly move the hyperplane from
one proposed solution to another without generating hyperplanes whi h are not solutions,
we know that these proposed solutions are in fa t not solutions at all. In fa t, for ea h
of these ases, the optimal unique solution is at w = 0, with a suitable hoi e of b (whi h
has the e e t of assigning the same label to all the points). Note that this is a perfe tly
a eptable solution to the lassi ation problem: any proposed hyperplane (with w 6= 0)
will ause the primal obje tive fun tion to take a higher value.

Figure 10. Two problems, with proposed (in orre t) non-unique solutions.

Finally, note that the fa t that SVM training always nds a global solution is in ontrast
to the ase of neural networks, where many lo al minima usually exist.
5. Methods of Solution
The support ve tor optimization problem an be solved analyti ally only when the number
of training data is very small, or for the separable ase when it is known beforehand whi h
of the training data be ome support ve tors (as in Se tions 3.3 and 6.2). Note that this an
happen when the problem has some symmetry (Se tion 3.3), but that it an also happen
when it does not (Se tion 6.2). For the general analyti ase, the worst ase omputational
omplexity is of order NS3 (inversion of the Hessian), where NS is the number of support
ve tors, although the two examples given both have omplexity of O(1).
However, in most real world ases, Equations (43) (with dot produ ts repla ed by ker-
nels), (44), and (45) must be solved numeri ally. For small problems, any general purpose
optimization pa kage that solves linearly onstrained onvex quadrati programs will do. A
good survey of the available solvers, and where to get them, an be found16 in (More and
Wright, 1993).
For larger problems, a range of existing te hniques an be brought to bear. A full ex-
ploration of the relative merits of these methods would ll another tutorial. Here we just
des ribe the general issues, and for on reteness, give a brief explanation of the te hnique
we urrently use. Below, a \fa e" means a set of points lying on the boundary of the feasible
region, and an \a tive onstraint" is a onstraint for whi h the equality holds. For more
24

on nonlinear programming te hniques see (Flet her, 1987; Mangasarian, 1969; M Cormi k,
1983).
The basi re ipe is to (1) note the optimality (KKT) onditions whi h the solution must
satisfy, (2) de ne a strategy for approa hing optimality by uniformly in reasing the dual
obje tive fun tion subje t to the onstraints, and (3) de ide on a de omposition algorithm
so that only portions of the training data need be handled at a given time (Boser, Guyon
and Vapnik, 1992; Osuna, Freund and Girosi, 1997a). We give a brief des ription of some
of the issues involved. One an view the problem as requiring the solution of a sequen e
of equality onstrained problems. A given equality onstrained problem an be solved in
one step by using the Newton method (although this requires storage for a fa torization of
the proje ted Hessian), or in at most l steps using onjugate gradient as ent (Press et al.,
1992) (where l is the number of data points for the problem urrently being solved: no extra
storage is required). Some algorithms move within a given fa e until a new onstraint is
en ountered, in whi h ase the algorithm is restarted with the new onstraint added to the
list of equality onstraints. This method has the disadvantage that only one new onstraint
is made a tive at a time. \Proje tion methods" have also been onsidered (More, 1991),
where a point outside the feasible region is omputed, and then line sear hes and proje tions
are done so that the a tual move remains inside the feasible region. This approa h an add
several new onstraints at on e. Note that in both approa hes, several a tive onstraints
an be ome ina tive in one step. In all algorithms, only the essential part of the Hessian
(the olumns orresponding to i 6= 0) need be omputed (although some algorithms do
ompute the whole Hessian). For the Newton approa h, one an also take advantage of the
fa t that the Hessian is positive semide nite by diagonalizing it with the Bun h-Kaufman
algorithm (Bun h and Kaufman, 1977; Bun h and Kaufman, 1980) (if the Hessian were
inde nite, it ould still be easily redu ed to 2x2 blo k diagonal form with this algorithm).
In this algorithm, when a new onstraint is made a tive or ina tive, the fa torization of
the proje ted Hessian is easily updated (as opposed to re omputing the fa torization from
s rat h). Finally, in interior point methods, the variables are essentially res aled so as to
always remain inside the feasible region. An example is the \LOQO" algorithm of (Vander-
bei, 1994a; Vanderbei, 1994b), whi h is a primal-dual path following algorithm. This last
method is likely to be useful for problems where the number of support ve tors as a fra tion
of training sample size is expe ted to be large.
We brie y des ribe the ore optimization method we urrently use17 . It is an a tive set
method ombining gradient and onjugate gradient as ent. Whenever the obje tive fun tion
is omputed, so is the gradient, at very little extra ost. In phase 1, the sear h dire tions
s are along the gradient. The nearest fa e along the sear h dire tion is found. If the dot
produ t of the gradient there with s indi ates that the maximum along s lies between the
urrent point and the nearest fa e, the optimal point along the sear h dire tion is omputed
analyti ally (note that this does not require a line sear h), and phase 2 is entered. Otherwise,
we jump to the new fa e and repeat phase 1. In phase 2, Polak-Ribiere onjugate gradient
as ent (Press et al., 1992) is done, until a new fa e is en ountered (in whi h ase phase 1 is
re-entered) or the stopping riterion is met. Note the following:

 Sear h dire tions are always proje ted so that the i ontinue to satisfy the equality
onstraint Eq. (45). Note that the onjugate gradient algorithm will still work; we
are simply sear hing in a subspa e. However, it is important that this proje tion is
implemented in su h a way that not only is Eq. (45) met (easy), but also so that the
angle between the resulting sear h dire tion, and the sear h dire tion prior to proje tion,
is minimized (not quite so easy).
25

 We also use a \sti ky fa es" algorithm: whenever a given fa e is hit more than on e,
the sear h dire tions are adjusted so that all subsequent sear hes are done within that
fa e. All \sti ky fa es" are reset (made \non-sti ky") when the rate of in rease of the
obje tive fun tion falls below a threshold.
 The algorithm stops when the fra tional rate of in rease of the obje tive fun tion F falls
below a toleran e (typi ally 1e-10, for double pre ision). Note that one an also use
as stopping riterion the ondition that the size of the proje ted sear h dire tion falls
below a threshold. However, this riterion does not handle s aling well.
 In my opinion the hardest thing to get right is handling pre ision problems orre tly
everywhere. If this is not done, the algorithm may not onverge, or may be mu h slower
than it needs to be.
A good way to he k that your algorithm is working is to he k that the solution satis es
all the Karush-Kuhn-Tu ker onditions for the primal problem, sin e these are ne essary
and suÆ ient onditions that the solution be optimal. The KKT onditions are Eqs. (48)
through (56), with dot produ ts between data ve tors repla ed by kernels wherever they
appear (note w must be expanded as in Eq. (48) rst, sin e w is not in general the mapping
of a point in L). Thus to he k the KKT onditions, it is suÆ ient to he k that the
i satisfy 0  i  C , that the equality onstraint (49) holds, that all points for whi h
0  i < C satisfy Eq. (51) with i = 0, and that all points with i = C satisfy Eq. (51)
for some i  0. These are suÆ ient onditions for all the KKT onditions to hold: note
that by doing this we never have to expli itly ompute the i or i , although doing so is
trivial.
5.1. Complexity, S alability, and Parallelizability

Support ve tor ma hines have the following very striking property. Both training and test
fun tions depend on the data only through the kernel fun tions K (xi ; xj ). Even though it
orresponds to a dot produ t in a spa e of dimension dH , where dH an be very large or
in nite, the omplexity of omputing K an be far smaller. For example,  for kernels of the
form K = (xi  xj )p , a dot produ t in H would require of order dL+pp 1 operations, whereas
the omputation of K (xi ; xj ) requires only O(dL ) operations (re all dL is the dimension of
the data). It is this fa t that allows us to onstru t hyperplanes in these very high dimen-
sional spa es yet still be left with a tra table omputation. Thus SVMs ir umvent both
forms of the \ urse of dimensionality": the proliferation of parameters ausing intra table
omplexity, and the proliferation of parameters ausing over tting.

5.1.1. Training For on reteness, we will give results for the omputational omplexity of
one the the above training algorithms (Bun h-Kaufman)18 (Kaufman, 1998). These results
assume that di erent strategies are used in di erent situations. We onsider the problem of
training on just one \ hunk" (see below). Again let l be the number of training points, NS
the number of support ve tors (SVs), and dL the dimension of the input data. In the ase
where most SVs are not at the upper bound, and NS =l << 1, the number of operations C
is O(NS3 + (NS2 )l + NS dL l). If instead NS =l  1, then C is O(NS3 + NS l + NS dL l) (basi ally
by starting in the interior of the feasible region). For the ase where most SVs are at the
upper bound, and NS =l << 1, then C is O(NS2 + NS dL l). Finally, if most SVs are at the
upper bound, and NS =l  1, we have C of O(DL l2).
For larger problems, two de omposition algorithms have been proposed to date. In the
\ hunking" method (Boser, Guyon and Vapnik, 1992), one starts with a small, arbitrary
26

subset of the data and trains on that. The rest of the training data is tested on the resulting
lassi er, and a list of the errors is onstru ted, sorted by how far on the wrong side of the
margin they lie (i.e. how egregiously the KKT onditions are violated). The next hunk is
onstru ted from the rst N of these, ombined with the NS support ve tors already found,
where N + NS is de ided heuristi ally (a hunk size that is allowed to grow too qui kly or
too slowly will result in slow overall onvergen e). Note that ve tors an be dropped from
a hunk, and that support ve tors in one hunk may not appear in the nal solution. This
pro ess is ontinued until all data points are found to satisfy the KKT onditions.
The above method requires that the number of support ve tors NS be small enough so that
a Hessian of size NS by NS will t in memory. An alternative de omposition algorithm has
been proposed whi h over omes this limitation (Osuna, Freund and Girosi, 1997b). Again,
in this algorithm, only a small portion of the training data is trained on at a given time, and
furthermore, only a subset of the support ve tors need be in the \working set" (i.e. that set
of points whose 's are allowed to vary). This method has been shown to be able to easily
handle a problem with 110,000 training points and 100,000 support ve tors. However, it
must be noted that the speed of this approa h relies on many of the support ve tors having
orresponding Lagrange multipliers i at the upper bound, i = C .
These training algorithms may take advantage of parallel pro essing in several ways. First,
all elements of the Hessian itself an be omputed simultaneously. Se ond, ea h element
often requires the omputation of dot produ ts of training data, whi h ould also be par-
allelized. Third, the omputation of the obje tive fun tion, or gradient, whi h is a speed
bottlene k, an be parallelized (it requires a matrix multipli ation). Finally, one an envision
parallelizing at a higher level, for example by training on di erent hunks simultaneously.
S hemes su h as these, ombined with the de omposition algorithm of (Osuna, Freund and
Girosi, 1997b), will be needed to make very large problems (i.e. >> 100,000 support ve tors,
with many not at bound), tra table.

5.1.2. Testing In test phase, one must simply evaluate Eq. (61), whi h will require
O(MNS ) operations, where M is the number of operations required to evaluate the kernel.
For dot produ t and RBF kernels, M is O(dL ), the dimension of the data ve tors. Again,
both the evaluation of the kernel and of the sum are highly parallelizable pro edures.
In the absen e of parallel hardware, one an still speed up test phase by a large fa tor, as
des ribed in Se tion 9.
6. The VC Dimension of Support Ve tor Ma hines
We now show that the VC dimension of SVMs an be very large (even in nite). We will
then explore several arguments as to why, in spite of this, SVMs usually exhibit good
generalization performan e. However it should be emphasized that these are essentially
plausibility arguments. Currently there exists no theory whi h guarantees that a given
family of SVMs will have high a ura y on a given problem.
We will all any kernel that satis es Mer er's ondition a positive kernel, and the orre-
sponding spa e H the embedding spa e. We will also all any embedding spa e with minimal
dimension for a given kernel a \minimal embedding spa e". We have the following
Theorem 3 Let K be a positive kernel whi h orresponds to a minimal embedding spa e
H. Then the VC dimension of the orresponding support ve tor ma hine (where the error
penalty C in Eq. (44) is allowed to take all values) is dim(H) + 1.
27

Proof: If the minimal embedding spa e has dimension dH , then dH points in the image of
L under the mapping  an be found whose position ve tors in H are linearly independent.
From Theorem 1, these ve tors an be shattered by hyperplanes in H. Thus by either
restri ting ourselves to SVMs for the separable ase (Se tion 3.1), or for whi h the error
penalty C is allowed to take all values (so that, if the points are linearly separable, a C an
be found su h that the solution does indeed separate them), the family of support ve tor
ma hines with kernel K an also shatter these points, and hen e has VC dimension dH +1.

Let's look at two examples.


6.1. The VC Dimension for Polynomial Kernels

Consider an SVM with homogeneous polynomial kernel, a ting on data in RdL :


K (x1 ; x2 ) = (x1  x2 )p ; x1 ; x2 2 RdL (78)
As in the ase when dL = 2 and the kernel is quadrati (Se tion 4), one an expli itly
onstru t the map . Letting zi = x1i x2i , so that K (x1 ; x2 ) = (z1 +    + zdL )p , we see that
ea h dimension of H orresponds to a term with given powers of the zi in the expansion of
K . In fa t if we hoose to label the omponents of (x) in this manner, we an expli itly
write the mapping for any p and dL :
s  dL
p! X
r1 r2 rdL (x) = xr11 xr22    xrddLL ; ri = p; ri  0 (79)
r1 !r2 !    rdL ! i=1
This leads to
Theorem 4 If the spa e in whi h the data live has dimension dL (i.e. L = RdL ), the
dimension of the minimal embedding spa e, for homogeneous  polynomial kernels of degree p
(K (x1 ; x2 ) = (x1  x2 )p ; x1 ; x2 2 RdL ), is dL +pp 1 .

(The proof is in the Appendix). Thus the VC dimension of SVMs with these kernels is
dL +p 1 + 1. As noted above, this gets very large very qui kly.
p

6.2. The VC Dimension for Radial Basis Fun tion Kernels

Theorem 5 Consider the lass of Mer er kernels for whi h K (x ; x ) ! 0 as kx x k ! 1 2 1 2


1d, and for whi h K (x; x) is O(1), and assume that the data an be hosen arbitrarily from
R . Then the family of lassi ers onsisting of support ve tor ma hines using these kernels,
and for whi h the error penalty is allowed to take all values, has in nite VC dimension.

Proof: The kernel matrix, Kij  K (xi ; xj ), is a Gram matrix (a matrix of dot produ ts:
see (Horn, 1985)) in H. Clearly we an hoose training data su h that all o -diagonal
elements Ki6=j an be made arbitrarily small, and by assumption all diagonal elements Ki=j
are of O(1). The matrix K is then of full rank; hen e the set of ve tors, whose dot produ ts
in H form K, are linearly independent (Horn, 1985); hen e, by Theorem 1, the points an be
shattered by hyperplanes in H, and hen e also by support ve tor ma hines with suÆ iently
large error penalty. Sin e this is true for any nite number of points, the VC dimension of
these lassi ers is in nite.
28

Note that the assumptions in the theorem are stronger than ne essary (they were hosen
to make the onne tion to radial basis fun tions lear). In fa t it is only ne essary that l
training points an be hosen su h that the rank of the matrix Kij in reases without limit
as l in reases. For example, for Gaussian RBF kernels, this an also be a omplished (even
for training data restri ted to lie in a bounded subset of RdL ) by hoosing small enough
RBF widths. However in general the VC dimension of SVM RBF lassi ers an ertainly be
nite, and indeed, for data restri ted to lie in a bounded subset of RdL , hoosing restri tions
on the RBF widths is a good way to ontrol the VC dimension.
This ase gives us a se ond opportunity to present a situation where the SVM solution
an be omputed analyti ally, whi h also amounts to a se ond, onstru tive proof of the
Theorem. For on reteness we will take the ase for Gaussian RBF kernels of the form
K (x1 ; x2 ) = e kx1 x2 k =2 . Let us hoose training points su h that the smallest distan e
2 2

between any pair of points is mu h larger than the width . Consider the de ision fun tion
evaluated on the support ve tor sj :
i yi e ksi sj k =2 + b:
X
f (sj ) = (80)
2 2

i
The sum on the right hand side will then be largely dominated by the term i = j ; in fa t
the ratio of that term to the ontribution from the rest of the sum an be made arbitrarily
large by hoosing the training points to be arbitrarily far apart. In order to nd the SVM
solution, we again assume for the moment that every training point be omes a support
ve tor, and we work with SVMs for the separable ase (Se tion 3.1) (the same argument
will hold for SVMs for the non-separable ase if C in Eq. (44) is allowed to take large
enough values). Sin e all points are support ve tors, the equalities in Eqs. (10), (11)
will hold for them. Let there be N+ (N ) positive (negative) polarity points. We further
assume that all positive (negative) polarity points have the same value + ( ) for their
Lagrange multiplier. (We will know that this assumption is orre t if it delivers a solution
whi h satis es all the KKT onditions and onstraints). Then Eqs. (19), applied to all the
training data, and the equality onstraint Eq. (18), be ome

+ + b = 1
+b = 1
N+ + N = 0 (81)
whi h are satis ed by

+ =
2N
N + N+
=
2N+
N + N+
N N
b = + (82)
N + N+
Thus, sin e the resulting i are also positive, all the KKT onditions and onstraints are
satis ed, and we must have found the global solution (with zero training errors). Sin e the
number of training points, and their labeling, is arbitrary, and they are separated without
error, the VC dimension is in nite.
The situation is summarized s hemati ally in Figure 11.
29

Figure 11. Gaussian RBF SVMs of suÆ iently small width an lassify an arbitrarily large number of
training points orre tly, and thus have in nite VC dimension

Now we are left with a striking onundrum. Even though their VC dimension is in nite (if
the data is allowed to take all values in RdL ), SVM RBFs an have ex ellent performan e
(S holkopf et al, 1997). A similar story holds for polynomial SVMs. How ome?
7. The Generalization Performan e of SVMs
In this Se tion we olle t various arguments and bounds relating to the generalization perfor-
man e of SVMs. We start by presenting a family of SVM-like lassi ers for whi h stru tural
risk minimization an be rigorously implemented, and whi h will give us some insight as to
why maximizing the margin is so important.
7.1. VC Dimension of Gap Tolerant Classi ers

Consider a family of lassi ers (i.e. a set of fun tions  on Rd ) whi h we will all \gap
tolerant lassi ers." A parti ular lassi er  2  is spe i ed by the lo ation and diameter
of a ball in Rd , and by two hyperplanes, with parallel normals, also in Rd . Call the set of
points lying between, but not on, the hyperplanes the \margin set." The de ision fun tions
 are de ned as follows: points that lie inside the ball, but not in the margin set, are assigned
lass f1g, depending on whi h side of the margin set they fall. All other points are simply
de ned to be \ orre t", that is, they are not assigned a lass by the lassi er, and do not
ontribute to any risk. The situation is summarized, for d = 2, in Figure 12. This rather
odd family of lassi ers, together with a ondition we will impose on how they are trained,
will result in systems very similar to SVMs, and for whi h stru tural risk minimization an
be demonstrated. A rigorous dis ussion is given in the Appendix.
Label the diameter of the ball D and the perpendi ular distan e between the two hyper-
planes M . The VC dimension is de ned as before to be the maximum number of points that
an be shattered by the family, but by \shattered" we mean that the points an o ur as
errors in all possible ways (see the Appendix for further dis ussion). Clearly we an ontrol
the VC dimension of a family of these lassi ers by ontrolling the minimum margin M
and maximum diameter D that members of the family are allowed to assume. For example,
onsider the family of gap tolerant lassi ers in R2 with diameter D = 2, shown in Figure
12. Those with margin satisfying M  3=2 an shatter three points; if 3=2 < M < 2, they
an shatter two; and if M  2, they an shatter only one. Ea h of these three families of
30

lassi ers orresponds to one of the sets of lassi ers in Figure 4, with just three nested
subsets of fun tions, and with h1 = 1, h2 = 2, and h3 = 3.

Φ=0
Φ=1

D=2
M = 3/2
Φ=0

Φ=−1
Φ=0

Figure 12. A gap tolerant lassi er on data in R


2
.

These ideas an be used to show how gap tolerant lassi ers implement stru tural risk
minimization. The extension of the above example to spa es of arbitrary dimension is
en apsulated in a (modi ed) theorem of (Vapnik, 1995):
Theorem 6 For data in Rd , the VC dimension h of gap tolerant lassi ers of minimum
margin Mmin and maximum diameter Dmax is bounded above19 by minfdDmax2
=Mmin
2
e; dg +
1.
For the proof we assume the following lemma, whi h in (Vapnik, 1979) is held to follow
from symmetry arguments20:
Lemma: Consider n  d + 1 points lying in a ball B 2 Rd. Let the points be shatterable
by gap tolerant lassi ers with margin M . Then in order for M to be maximized, the points
must lie on the verti es of an (n 1)-dimensional symmetri simplex, and must also lie on
the surfa e of the ball.
Proof: We need only onsider the ase where the number of points n satis es n  d + 1.
(n > d +1 points will not be shatterable, sin e the VC dimension of oriented hyperplanes in
Rd is d +1, and any distribution of points whi h an be shattered by a gap tolerant lassi er
an also be shattered by an oriented hyperplane; this also shows that h  d + 1). Again we
onsider points on a sphere of diameter D, where the sphere itself is of dimension d 2. We
will need two results from Se tion 3.3, namely (1) if n is even, we an nd a distribution of n
points (the verti es of the (n 1)-dimensional symmetri simplex) whi h an be shattered by
gap tolerant lassi ers if Dmax
2
=Mmin
2
= n 1, and (2) if n is odd, we an nd a distribution
of n points whi h an be so shattered if Dmax
2
=Mmin
2
= (n 1)2 (n + 1)=n2.
If n is even, at most n points an be shattered whenever
n 1  Dmax 2
=Mmin
2
< n: (83)
31

Thus for n even the maximum number of points that an be shattered may be written
bDmax=Mmin + 1.
2 2

If n is odd, at most n points an be shattered when Dmax


2
=Mmin
2
= (n 1)2 (n + 1)=n2.
However, the quantity on the right hand side satis es
n 2 < (n 1)2 (n + 1)=n2 < n 1 (84)
for all integer n > 1. Thus for n odd the largest number of points that an be shattered
is ertainly bounded above by dDmax2
=Mmin
2
e + 1, and from the above this bound is also
satis ed when n is even. Hen e in general the VC dimension h of gap tolerant lassi ers
must satisfy
D2
h  d max e + 1: (85)
Mmin
2

This result, together with h  d + 1, on ludes the proof.


7.2. Gap Tolerant Classi ers, Stru tural Risk Minimization, and SVMs

Let's see how we an do stru tural risk minimization with gap tolerant lassi ers. We need
only onsider that subset of the , all it S , for whi h training \su eeds", where by su ess
we mean that all training data are assigned a label 2 f1g (note that these labels do not
have to oin ide with the a tual labels, i.e. training errors are allowed). Within S , nd
the subset whi h gives the fewest training errors - all this number of errors Nmin . Within
that subset, nd the fun tion  whi h gives maximum margin (and hen e the lowest bound
on the VC dimension). Note the value of the resulting risk bound (the right hand side of
Eq. (3), using the bound on the VC dimension in pla e of the VC dimension). Next, within
S , nd that subset whi h gives Nmin + 1 training errors. Again, within that subset, nd
the  whi h gives the maximum margin, and note the orresponding risk bound. Iterate,
and take that lassi er whi h gives the overall minimum risk bound.
An alternative approa h is to divide the fun tions  into nested subsets i ; i 2 Z ; i  1,
as follows: all  2 i have fD; M g satisfying dD2 =M 2 e  i. Thus the family of fun tions
in i has VC dimension bounded above by min(i; d) + 1. Note also that i  i+1 . SRM
then pro eeds by taking that  for whi h training su eeds in ea h subset and for whi h
the empiri al risk is minimized in that subset, and again, hoosing that  whi h gives the
lowest overall risk bound.
Note that it is essential to these arguments that the bound (3) holds for any hosen de ision
fun tion, not just the one that minimizes the empiri al risk (otherwise eliminating solutions
for whi h some training point x satis es (x) = 0 would invalidate the argument).
The resulting gap tolerant lassi er is in fa t a spe ial kind of support ve tor ma hine
whi h simply does not ount data falling outside the sphere ontaining all the training data,
or inside the separating margin, as an error. It seems very reasonable to on lude that
support ve tor ma hines, whi h are trained with very similar obje tives, also gain a similar
kind of apa ity ontrol from their training. However, a gap tolerant lassi er is not an
SVM, and so the argument does not onstitute a rigorous demonstration of stru tural risk
minimization for SVMs. The original argument for stru tural risk minimization for SVMs is
known to be awed, sin e the stru ture there is determined by the data (see (Vapnik, 1995),
Se tion 5.11). I believe that there is a further subtle problem with the original argument.
The stru ture is de ned so that no training points are members of the margin set. However,
one must still spe ify how test points that fall into the margin are to be labeled. If one simply
32

assigns the same, xed lass to them (say +1), then the VC dimension will be higher21 than
the bound derived in Theorem 6. However, the same is true if one labels them all as errors
(see the Appendix). If one labels them all as \ orre t", one arrives at gap tolerant lassi ers.
On the other hand, it is known how to do stru tural risk minimization for systems where
the stru ture does depend on the data (Shawe-Taylor et al., 1996a; Shawe-Taylor et al.,
1996b). Unfortunately the resulting bounds are mu h looser than the VC bounds above,
whi h are already very loose (we will examine a typi al ase below where the VC bound is
a fa tor of 100 higher than the measured test error). Thus at the moment stru tural risk
minimization alone does not provide a rigorous explanation as to why SVMs often have good
generalization performan e. However, the above arguments strongly suggest that algorithms
that minimize D2 =M 2 an be expe ted to give better generalization performan e. Further
eviden e for this is found in the following theorem of (Vapnik, 1998), whi h we quote without
proof22 :
Theorem 7 For optimal hyperplanes passing through the origin, we have

 E [D l=M ℄
2 2
E [P (error)℄ (86)
where P (error) is the probability of error on the test set, the expe tation on the left is over
all training sets of size l 1, and the expe tation on the right is over all training sets of size
l.
However, in order for these observations to be useful for real problems, we need a way to
ompute the diameter of the minimal en losing sphere des ribed above, for any number of
training points and for any kernel mapping.
7.3. How to Compute the Minimal En losing Sphere

Again let  be the mapping to the embedding spa e H. We wish to ompute the radius
of the smallest sphere in H whi h en loses the mapped training data: that is, we wish to
minimize R2 subje t to
k(xi ) Ck2  R2 8i (87)
where C 2 H is the (unknown) enter of the sphere. Thus introdu ing positive Lagrange
multipliers i , the primal Lagrangian is
X
LP = R2 i (R2 k(xi ) Ck ):
2
(88)
i
This is again a onvex quadrati programming problem, so we an instead maximize the
Wolfe dual
X X
LD = i K (xi ; xi ) i j K (xi ; xj ) (89)
i i;j
(where we have again repla ed (xi )  (xj ) by K (xi ; xj )) subje t to:
X
i = 1 (90)
i
i 0 (91)
33

with solution given by


X
C= i (xi ): (92)
i
Thus the problem is very similar to that of support ve tor training, and in fa t the ode
for the latter is easily modi ed to solve the above problem. Note that we were in a sense
\lu ky", be ause the above analysis shows us that there exists an expansion (92) for the
enter; there is no a priori reason why we should expe t that the enter of the sphere in H
should be expressible in terms of the mapped training data in this way. The same an be
said of the solution for the support ve tor problem, Eq. (46). (Had we hosen some other
geometri al onstru tion, we might not have been so fortunate. Consider the smallest area
equilateral triangle ontaining two given points in R2 . If the points' position ve tors are
linearly dependent, the enter of the triangle annot be expressed in terms of them.)
7.4. A Bound from Leave-One-Out

(Vapnik, 1995) gives an alternative bound on the a tual risk of support ve tor ma hines:
E [Number of support ve tors℄
E [P (error)℄  (93)
Number of training samples ;
where P (error) is the a tual risk for a ma hine trained on l 1 examples, E [P (error)℄
is the expe tation of the a tual risk over all hoi es of training set of size l 1, and
E [Number of support ve tors℄ is the expe tation of the number of support ve tors over all
hoi es of training sets of size l. It's easy to see how this bound arises: onsider the typi al
situation after training on a given training set, shown in Figure 13.

Figure 13. Support ve tors ( ir les) an be ome errors ( ross) after removal and re-training (the dotted line
denotes the new de ision surfa e).

We an get an estimate of the test error by removing one of the training points, re-training,
and then testing on the removed point; and then repeating this, for all training points. From
the support ve tor solution we know that removing any training points that are not support
ve tors (the latter in lude the errors) will have no e e t on the hyperplane found. Thus
the worst that an happen is that every support ve tor will be ome an error. Taking the
expe tation over all su h training sets therefore gives an upper bound on the a tual risk,
for training sets of size l 1.
34

Although elegant, I have yet to nd a use for this bound. There seem to be many situations
where the a tual error in reases even though the number of support ve tors de reases, so
the intuitive on lusion (systems that give fewer support ve tors give better performan e)
often seems to fail. Furthermore, although the bound an be tighter than that found using
the estimate of the VC dimension ombined with Eq. (3), it an at the same time be less
predi tive, as we shall see in the next Se tion.

7.5. VC, SV Bounds and the A tual Risk

Let us put these observations to some use. As mentioned above, training an SVM RBF
lassi er will automati ally give values for the RBF weights, number of enters, enter
positions, and threshold. For Gaussian RBFs, there is only one parameter left: the RBF
width ( in Eq. (80)) (we assume here only one RBF width for the problem). Can we
nd the optimal value for that too, by hoosing that  whi h minimizes D2 =M 2 ? Figure 14
shows a series of experiments done on 28x28 NIST digit data, with 10,000 training points and
60,000 test points. The top urve in the left hand panel shows the VC bound (i.e. the bound
resulting from approximating the VC dimension in Eq. (3)23 by Eq. (85)), the middle urve
shows the bound from leave-one-out (Eq. (93)), and the bottom urve shows the measured
test error. Clearly, in this ase, the bounds are very loose. The right hand panel shows just
the VC bound (the top urve, for 2 > 200), together with the test error, with the latter
s aled up by a fa tor of 100 (note that the two urves ross). It is striking that the two
urves have minima in the same pla e: thus in this ase, the VC bound, although loose,
seems to be nevertheless predi tive. Experiments on digits 2 through 9 showed that the VC
bound gave a minimum for whi h 2 was within a fa tor of two of that whi h minimized the
test error (digit 1 was in on lusive). Interestingly, in those ases the VC bound onsistently
gave a lower predi tion for 2 than that whi h minimized the test error. On the other hand,
the leave-one-out bound, although tighter, does not seem to be predi tive, sin e it had no
minimum for the values of 2 tested.

0.7 0.7
Actual Risk : SV Bound : VC Bound

0.6 0.65
VC Bound : Actual Risk * 100

0.5 0.6

0.4 0.55

0.3 0.5

0.2 0.45

0.1 0.4

0 0.35
0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000
Sigma Squared Sigma Squared

Figure 14. The VC bound an be predi tive even when loose.


35

8. Limitations
Perhaps the biggest limitation of the support ve tor approa h lies in hoi e of the kernel.
On e the kernel is xed, SVM lassi ers have only one user- hosen parameter (the error
penalty), but the kernel is a very big rug under whi h to sweep parameters. Some work has
been done on limiting kernels using prior knowledge (S holkopf et al., 1998a; Burges, 1998),
but the best hoi e of kernel for a given problem is still a resear h issue.
A se ond limitation is speed and size, both in training and testing. While the speed
problem in test phase is largely solved in (Burges, 1996), this still requires two training
passes. Training for very large datasets (millions of support ve tors) is an unsolved problem.
Dis rete data presents another problem, although with suitable res aling ex ellent results
have nevertheless been obtained (Joa hims, 1997). Finally, although some work has been
done on training a multi lass SVM in one step24 , the optimal design for multi lass SVM
lassi ers is a further area for resear h.
9. Extensions
We very brie y des ribe two of the simplest, and most e e tive, methods for improving the
performan e of SVMs.
The virtual support ve tor method (S holkopf, Burges and Vapnik, 1996; Burges and
S holkopf, 1997), attempts to in orporate known invarian es of the problem (for example,
translation invarian e for the image re ognition problem) by rst training a system, and
then reating new data by distorting the resulting support ve tors (translating them, in the
ase mentioned), and nally training a new system on the distorted (and the undistorted)
data. The idea is easy to implement and seems to work better than other methods for
in orporating invarian es proposed so far.
The redu ed set method (Burges, 1996; Burges and S holkopf, 1997) was introdu ed to
address the speed of support ve tor ma hines in test phase, and also starts with a trained
SVM. The idea is to repla e the sum in Eq. (46) by a similar sum, where instead of support
ve tors, omputed ve tors (whi h are not elements of the training set) are used, and instead
of the i , a di erent set of weights are omputed. The number of parameters is hosen
beforehand to give the speedup desired. The resulting ve tor is still a ve tor in H, and
the parameters are found by minimizing the Eu lidean norm of the di eren e between the
original ve tor w and the approximation to it. The same te hnique ould be used for SVM
regression to nd mu h more eÆ ient fun tion representations (whi h ould be used, for
example, in data ompression).
Combining these two methods gave a fa tor of 50 speedup (while the error rate in reased
from 1.0% to 1.1%) on the NIST digits (Burges and S holkopf, 1997).
10. Con lusions
SVMs provide a new approa h to the problem of pattern re ognition (together with re-
gression estimation and linear operator inversion) with lear onne tions to the underlying
statisti al learning theory. They di er radi ally from omparable approa hes su h as neural
networks: SVM training always nds a global minimum, and their simple geometri inter-
pretation provides fertile ground for further investigation. An SVM is largely hara terized
by the hoi e of its kernel, and SVMs thus link the problems they are designed for with a
large body of existing work on kernel based methods. I hope that this tutorial will en ourage
some to explore SVMs for themselves.
36

A knowledgments
I'm very grateful to P. Knirs h, C. Nohl, E. Osuna, E. Rietman, B. S holkopf, Y. Singer, A.
Smola, C. Stenard, and V. Vapnik, for their omments on the manus ript. Thanks also to
the reviewers, and to the Editor, U. Fayyad, for extensive, useful omments. Spe ial thanks
are due to V. Vapnik, under whose patient guidan e I learned the ropes; to A. Smola and
B. S holkopf, for many interesting and fruitful dis ussions; and to J. Shawe-Taylor and D.
S huurmans, for valuable dis ussions on stru tural risk minimization.
Appendix
A.1. Proofs of Theorems
We olle t here the theorems stated in the text, together with their proofs. The Lemma has
a shorter proof using a \Theorem of the Alternative," (Mangasarian, 1969) but we wished
to keep the proofs as self- ontained as possible.
Lemma 1 Two sets of points in Rn may be separated by a hyperplane if and only if the
interse tion of their onvex hulls is empty.

Proof: We allow the notions of points in Rn , and position ve tors of those points, to be
used inter hangeably in this proof. Let CA , CB be the onvex hulls of two sets of points
A, B in Rn . Let A B denote the set of points whose position ve tors are given by
a b; a 2 A; b 2 B (note that A B does not ontain the origin), and let CA CB have
the orresponding meaning for the onvex hulls. Then showing that A and B are linearly
separable (separable by a hyperplane) is equivalent to showing that the set A B is linearly
separable from the origin O. For suppose the latter: then 9 w 2 Rn ; b 2 R; b < 0 su h
that x  w + b > 0 8x 2 A B . Now pi k some y 2 B , and denote the set of all points
a b + y; a 2 A; b 2 B by A B + y. Then x  w + b > y  w 8x 2 A B + y, and learly
y  w + b < y  w, so the sets A B + y and y are linearly separable. Repeating this pro ess
shows that A B is linearly separable from the origin if and only if A and B are linearly
separable.
T
We now show that, if CA CB = ;, then CA CB is linearly separable from the origin.
Clearly CA CB does not ontain the origin. Furthermore CA CB is onvex, sin e
8x1 = a1 b1 ; x2 = a2 b2 ;  2 [0; 1℄; a1 ; a2 2 CA ; b1 ; b2 2 CB , we have (1 )x1 + x2 =
((1 )a1 + a2 ) ((1 )b1 + b2 ) 2 CA CB . Hen e it is suÆ ient to show that any
onvex set S , whi h does not ontain O, is linearly separable from O. Let xmin 2 S be
that point whose Eu lidean distan e from O, kxmin k, is minimal. (Note there an be only
one su h point, sin e if there were two, the hord joining them, whi h also lies in S , would
ontain points loser to O.) We will show that 8x 2 S , x  xmin > 0. Suppose 9 x 2 S
su h that x  xmin  0. Let L be the line segment joining xmin and x. Then onvexity
implies that L  S . Thus O 2= L, sin e by assumption O 2= S . Hen e the three points O, x
and xmin form an obtuse (or right) triangle, with obtuse (or right) angle o urring at the
point O. De ne n^  (x xmin )=kx xmin k. Then the distan e from the losest point in
L to O is kxmin k2 (xmin  n^ )2 , whi h is less than kxmin k2. Hen e x  xmin > 0 and S is
linearly separable from O. Thus CA CB is linearly separable from O, and a fortiori A B
is linearly separable from O, and thus A is linearly separable from B .
It remains to show that, if the two sets of points A, B are linearly separable, the interse -
tion of their onvex hulls if empty. By assumption there exists a pair w 2 Rn ; b 2 R, su h
that 8ai 2 A; w  ai + b > 0 and 8bi 2 B; w  bi + b < 0. Consider a general point x 2 CA . It
37

P P P
may be written x = i i ai ; i = 1; 0  i  1. Then wT x + b = i i fw  ai + bg > 0.
Similarly, for points y 2 CB , w  y + b < 0. Hen e CA CB = ;, sin e otherwise we
would be able to nd a point x = y whi h simultaneously satis es both inequalities.

Theorem 1: Consider some set of m points in Rn. Choose any one of the points as
origin. Then the m points an be shattered by oriented hyperplanes if and only if the
position ve tors of the remaining points are linearly independent.
Proof: Label the origin O, and assume that the m 1 position ve tors of the remaining
points are linearly independent. Consider any partition of the m points into two subsets,
S1 and S2 , of order m1 and m2 respe tively, so that m1 + m2 = m. Let S1 be the subset
ontaining O. Then the onvex hull C1 of S1 is that set of points whose position ve tors x
satisfy
m1
X m1
X
x= i s1i ; i = 1; i  0 (A.1)
i=1 i=1
where the s1i are the position ve tors of the m1 points in S1 (in luding the null position
ve tor of the origin). Similarly, the onvex hull C2 of S2 is that set of points whose position
ve tors x satisfy
m2
X m2
X
x= i s2i ; i = 1; i  0 (A.2)
i=1 i=1
where the s2i are the position ve tors of the m2 points in S2 . Now suppose that C1 and
C2 interse t. Then there exists an x 2 Rn whi h simultaneously satis es Eq. (A.1) and Eq.
(A.2). Subtra ting these equations gives a linear ombination of the m 1 non-null position
ve tors whi h vanishes, whi h ontradi ts the assumption of linear independen e. By the
lemma, sin e C1 and C2 do not interse t, there exists a hyperplane separating S1 and S2 .
Sin e this is true for any hoi e of partition, the m points an be shattered.
It remains to show that if the m 1 non-null position ve tors are not linearly independent,
then the m points annot be shattered by oriented hyperplanes. If the m 1 position ve tors
are not linearly independent, then there exist m 1 numbers, i , su h that
m
X1
i si = 0 (A.3)
i=1
P
If all the i are of the same sign, then we an s ale them so that i 2 [0; 1℄ and i i = 1.
Eq. (A.3) then states that the origin lies in the onvex hull of the remaining points; hen e,
by the lemma, the origin annot be separated from the remaining points by a hyperplane,
and the points annot be shattered.
If the i are not all of the same sign, pla e all the terms with negative i on the right:
X X
j j jsj = j k jsk (A.4)
j 2I1 k2I2

P of S nO (i.e. of thePset S with the


where I1 , I2 are the indi es of the orresponding partition
origin
P removed). Now s ale
P this equation so that either j 2I1 j j j = 1 and k2I2 j k j  1,
or j2I1 j j j  1 and k2I2 j k j = 1. Suppose without loss of generality that the latter
holds. Then the left hand side of Eq. (A.4) is the position ve tor of a point lying in the
38

S S S
onvex hull of the points f j2I1 sj g O (or, if the equality holds, of the points f j2I1 sj g),
and the right hand side is the position ve tor of a point lying in the onvex hull of the points
S
k2I2 sk , so the onvex hulls overlap, and by the lemma, the two sets of points annot be
separated by a hyperplane. Thus the m points annot be shattered.

Theorem 4: If the data is d-dimensional (i.e. L = Rd), the dimension of the mini-
mal embedding spa e, for homogeneous polynomial kernels of degree p (K (x ; x ) = (x 
x )p; x ; x 2 Rd), is d pp .
1 2 1
+ 1
2 1 2

Proof: First we show that the the number of omponents of (x) is p dp . Label the
+ 1

omponents of  as in Eq. Pd
(79). Then a omponent is uniquely identi ed by the hoi e
of the d integers ri  0, i=1 ri = p. Now onsider p obje ts distributed amongst d 1
partitions (numbered 1 through d 1), su h that obje ts are allowed to be to the left of
all partitions, or to the right of all partitions. Suppose m obje ts fall between partitions
q and q + 1. Let this orrespond to a term xm q+1 in the produ t in Eq. (79). Similarly, m
obje ts falling to the left of all partitions orresponds to a term xm1 , and m obje ts falling
to the right of all partitions orresponds
Pd
to a term xmd . Thus the number of distin t terms
of the form x1 x2    xd ;
r 1 r2 rd
i=1 ri = p; ri  0 is the number of way of distributing
the obje ts and partitions amongst themselves,  modulo permutations of the partitions and
permutations of the obje ts, whi h is p+dp 1 .
Next we must show that the set of ve tors with omponents r1 r2 rd (x) span the spa e H.
This follows from the fa t that the omponents of (x) are linearly independent fun tions.
For suppose instead that the image of  a ting on x 2 L is a subspa e of H. Then there
exists a xed nonzero ve tor V 2 H su h that
dim
X (H)

Vi i (x) = 0 8x 2 L: (A.5)
i=1
Using the labeling introdu ed above, onsider a parti ular omponent of :
d
X
r1 r2 rd (x); ri = p: (A.6)
i=1
Sin e Eq. (A.5) holds for all x, and sin e the mapping  in Eq. (79) ertainly has all
derivatives de ned, we an apply the operator
( x )r1    ( x )rd (A.7)
1 d
to Eq. (A.5), whi h will pi k that one term with orresponding powers of the xi in Eq.
(79), giving
Vr1 r2 rd = 0: (A.8)
P
Sin e this is true for all hoi es of r1 ;    ; rd su h that di=1 ri = p, every omponent of
V must vanish. Hen e the image of  a ting on x 2 L spans H.
A.2. Gap Tolerant Classi ers and VC Bounds
The following point is entral to the argument. One normally thinks of a olle tion of points
as being \shattered" by a set of fun tions, if for any hoi e of labels for the points, a fun tion
39

from the set an be found whi h assigns those labels to the points. The VC dimension of that
set of fun tions is then de ned as the maximum number of points that an be so shattered.
However, onsider a slightly di erent de nition. Let a set of points be shattered by a set
of fun tions if for any hoi e of labels for the points, a fun tion from the set an be found
whi h assigns the in orre t labels to all the points. Again let the VC dimension of that set
of fun tions be de ned as the maximum number of points that an be so shattered.
It is in fa t this se ond de nition (whi h we adopt from here on) that enters the VC
bound proofs (Vapnik, 1979; Devroye, Gyor and Lugosi, 1996). Of ourse for fun tions
whose range is f1g (i.e. all data will be assigned either positive or negative lass), the two
de nitions are the same. However, if all points falling in some region are simply deemed to
be \errors", or \ orre t", the two de nitions are di erent. As a on rete example, suppose
we de ne \gap intolerant lassi ers", whi h are like gap tolerant lassi ers, but whi h label
all points lying in the margin or outside the sphere as errors. Consider again the situation in
Figure 12, but assign positive lass to all three points. Then a gap intolerant lassi er with
margin width greater than the ball diameter annot shatter the points if we use the rst
de nition of \shatter", but an shatter the points if we use the se ond ( orre t) de nition.
With this aveat in mind, we now outline how the VC bounds an apply to fun tions with
range f1; 0g, where the label 0 means that the point is labeled \ orre t." (The bounds
will also apply to fun tions where 0 is de ned to mean \error", but the orresponding VC
dimension will be higher, weakening the bound, and in our ase, making it useless). We will
follow the notation of (Devroye, Gyor and Lugosi, 1996).
Consider points x 2 Rd, and let p(x) denote a density on Rd. Let  be a fun tion on Rd
with range f1; 0g, and let  be a set of su h fun tions. Let ea h x have an asso iated
label yx 2 f1g. Let fx1 ;    ; xn g be any nite number of points in Rd: then we require 
to have the property that there exists at least one  2  su h that (xi ) 2 f1g 8 xi . For
given , de ne the set of points A by

A = fx : yx = 1; (x) = 1g [ fx : yx = 1; (x) = 1g (A.9)


We require that the  be su h that all sets A are measurable. Let A denote the set of all
A.
De nition: Let xi; i = 1;    ; n be n points. We de ne the empiri al risk for the set
fxi ; g to be
n
X
n (fxi ; g) = (1=n) Ixi 2A : (A.10)
i=1
where I is the indi ator fun tion. Note that the empiri al risk is zero if (xi ) = 0 8 xi .
De nition: We de ne the a tual risk for the fun tion  to be
 () = P (x 2 A): (A.11)
Note also that those points x for whi h (x) = 0 do not ontribute to the a tual risk.
De nition: For xed (x1;    ; xn) 2 Rd, let NA be the number of di erent sets in
ffx ;    ; xn g \ A : A 2 Ag
1 (A.12)
40

where the sets A are de ned above. The n-th shatter oeÆ ient of A is de ned

s(A; n) =
xn2fRdgn NA (x ;    ; xn ):
x1;;max 1 (A.13)
We also de ne the VC dimension for the lass A to be the maximum integer k  1 for
whi h s(A; k) = 2k .

Theorem 8 (adapted from Devroye, Gyor and Lugosi, 1996, Theorem 12.6):Given n (fxi ; g),
 () and s(A; n) de ned above, and given n points (x1 ; :::; xn ) 2 Rd, let 0 denote that sub-
set of  su h that all  2 0 satisfy (xi ) 2 f1g 8 xi . (This restri tion may be viewed as
part of the training algorithm). Then for any su h ,

P (jn (fxi ; g)  ()j > )  8s(A; n) exp n =32 (A.14)


2

The proof is exa tly that of (Devroye, Gyor and Lugosi, 1996), Se tions 12.3, 12.4 and
12.5, Theorems 12.5 and 12.6. We have dropped the \sup" to emphasize that this holds
for any of the fun tions . In parti ular, it holds for those  whi h minimize the empiri al
error and for whi h all training data take the values f1g. Note however that the proof
only holds for the se ond de nition of shattering given above. Finally, note that the usual
form of the VC bounds is easily derived from Eq. (A.14) by using s(2A; n)  (en=h)h (where
h is the VC dimension) (Vapnik, 1995), setting  = 8s(A; n) exp n =32 , and solving for .
Clearly these results apply to our gap tolerant lassi ers of Se tion 7.1. For them, a
parti ular lassi er  2  is spe i ed by a set of parameters fB; H; M g, where B is a
ball in Rd, D 2 R is the diameter of B , H is a d 1 dimensional oriented hyperplane in
Rd, and M 2 R is a s alar whi h we have alled the margin. H itself is spe i ed by its
normal (whose dire tion spe i es whi h points H+ (H ) are labeled positive (negative) by
the fun tion), and by the minimal distan e from H to the origin. For a given  2 , the
margin set SM is de ned as the setT onsisting of Tthose points whose Tminimal distan e to H
is less than M=2. De ne Z  SM B , Z+  Z H+ , and Z  Z H . The fun tion 
is then de ned as follows:

(x) = 1 8x 2 Z+; (x) = 1 8x 2 Z ; (x) = 0 otherwise (A.15)


and the orresponding sets A as in Eq. (A.9).
Notes
1. K. Muller, Private Communi ation
2. The reader in whom this eli its a sinking feeling is urged to study (Strang, 1986; Flet her,
1987; Bishop, 1995). There is a simple geometri al interpretation of Lagrange multipliers: at
a boundary orresponding to a single onstraint, the gradient of the fun tion being extremized
must be parallel to the gradient of the fun tion whose ontours spe ify the boundary. At a
boundary orresponding to the interse tion of onstraints, the gradient must be parallel to a
linear ombination (non-negative in the ase of inequality onstraints) of the gradients of the
fun tions whose ontours spe ify the boundary.
3. In this paper, the phrase \learning ma hine" will be used for any fun tion estimation algo-
rithm, \training" for the parameter estimation pro edure, \testing" for the omputation of the
fun tion value, and \performan e" for the generalization a ura y (i.e. error rate as test set
size tends to in nity), unless otherwise stated.
41

4. Given the name \test set," perhaps we should also use \train set;" but the hobbyists got there
rst.
5. We use the term \oriented hyperplane" to emphasize that the mathemati al obje t onsidered
is the pair fH; ng, where H is the set of points whi h lie in the hyperplane and n is a parti ular
hoi e for the unit normal. Thus fH; ng and fH; ng are di erent oriented hyperplanes.
6. Su h a set of m points (whi h span an m 1 dimensional subspa e of a linear spa e) are said
to be \in general position" (Kolmogorov, 1970). The onvex hull of a set of m points in general
position de nes an m 1 dimensional simplex, the verti es of whi h are the points themselves.
7. The derivation of the bound assumes that the empiri al risk onverges uniformly to the a tual
risk as the number of training observations in reases (Vapnik, 1979). A ne essary and suÆ ient
ondition for this is that liml!1 H (l)=l = 0, where l is the number of training samples and
H (l) is the VC entropy of the set of de ision fun tions (Vapnik, 1979; Vapnik, 1995). For any
set of fun tions with in nite VC dimension, the VC entropy is l log 2: hen e for these lassi ers,
the required uniform onvergen e does not hold, and so neither does the bound.
8. There is a ni e geometri interpretation for the dual problem: it is basi ally nding the two
losest points of onvex hulls of the two sets. See (Bennett and Bredensteiner, 1998).
9. One an de ne the torque to be
1 :::n 2 = i :::n xn 1 Fn (A.16)
where repeated indi es are summed over on the right hand side, and where  is the totally
antisymmetri tensor with 1:::n = 1. (Re all that Greek indi es are used to denote tensor
omponents). The sum of torques on the de ision sheet is then:
X X
1 :::n sin 1 Fin = ^ n = 1 :::n wn 1 w^n = 0
1 :::n sin 1 i yi w (A.17)
i i

10. In the original formulation (Vapnik, 1979) they were alled \extreme ve tors."
11. By \de ision fun tion" we mean a fun tion f (x) whose sign represents the lass assigned to
data point x.
12. By \intrinsi dimension" we mean the number of parameters required to spe ify a point on the
manifold.
13. Alternatively one an argue that, given the form of the solution, the possible w must lie in a
subspa e of dimension l.
14. Work in preparation.
15. Thanks to A. Smola for pointing this out.
16. Many thanks to one of the reviewers for pointing this out.
17. The ore quadrati optimizer is about 700 lines of C++. The higher level ode (to handle
a hing of dot produ ts, hunking, IO, et ) is quite omplex and onsiderably larger.
18. Thanks to L. Kaufman for providing me with these results.
19. Re all that the \ eiling" sign de means \smallest integer greater than or equal to." Also, there
is a typo in the a tual formula given in (Vapnik, 1995), whi h I have orre ted here.
20. Note, for example, that the distan e between every pair of verti es of the symmetri simplex
is the same: see Eq. (26). However, a rigorous proof is needed, and as far as I know is la king.
21. Thanks to J. Shawe-Taylor for pointing this out.
22. V. Vapnik, Private Communi ation.
23. There is an alternative bound one might use, namely that orresponding to the set of totally
bounded non-negative fun tions (Equation (3.28) in (Vapnik, 1995)). However, for loss fun -
tions taking the value zero or one, and if the empiri al risk is zero, this bound is looser than
that in Eq. (3) whenever h(log(2l=h)+1)
l
log(=4)
> 1=16, whi h is the ase here.
24. V. Blanz, Private Communi ation

Referen es
10
M.A. Aizerman, E.M. Braverman, and L.I. Rozoner. Theoreti al foundations of the potential fun tion
method in pattern re ognition learning. Automation and Remote Control, 25:821{837, 1964.
42

M. Anthony and N. Biggs. Pa learning and neural networks. In The Handbook of Brain Theory and
Neural Networks, pages 694{697, 1995.
K.P. Bennett and E. Bredensteiner. Geometry in learning. In Geometry at Work, page to appear,
Washington, D.C., 1998. Mathemati al Asso iation of Ameri a.
C.M. Bishop. Neural Networks for Pattern Re ognition. Clarendon Press, Oxford, 1995.
V. Blanz, B. S holkopf, H. Bultho , C. Burges, V. Vapnik, and T. Vetter. Comparison of view{based
obje t re ognition algorithms using realisti 3d models. In C. von der Malsburg, W. von Seelen, J. C.
Vorbruggen, and B. Sendho , editors, Arti ial Neural Networks | ICANN'96, pages 251 { 256, Berlin,
1996. Springer Le ture Notes in Computer S ien e, Vol. 1112.
B. E. Boser, I. M. Guyon, and V .Vapnik. A training algorithm for optimal margin lassi ers. In Fifth
Annual Workshop on Computational Learning Theory, Pittsburgh, 1992. ACM.
James R. Bun h and Linda Kaufman. Some stable methods for al ulating inertia and solving symmetri
linear systems. Mathemati s of omputation, 31(137):163{179, 1977.
James R. Bun h and Linda Kaufman. A omputational method for the inde nite quadrati programming
problem. Linear Algebra and its Appli ations, 34:341{370, 1980.
C. J. C. Burges and B. S holkopf. Improving the a ura y and speed of support ve tor learning ma hines.
In M. Mozer, M. Jordan, and T. Pets he, editors, Advan es in Neural Information Pro essing Systems 9,
pages 375{381, Cambridge, MA, 1997. MIT Press.
C.J.C. Burges. Simpli ed support ve tor de ision rules. In Lorenza Saitta, editor, Pro eedings of the Thir-
teenth International Conferen e on Ma hine Learning, pages 71{77, Bari, Italy, 1996. Morgan Kaufman.
C.J.C. Burges. Geometry and invarian e in kernel based methods. In B. S holkopf, C.J.C. Burges, and A.J.
Smola, editors, Advan es in Kernel Methods: Support Ve tor Learning, pages 89{116. MIT Press, 1999.
C.J.C. Burges, P. Knirs h, and R. Harats h. Support ve tor web page: http://svm.resear h.bell-labs. om.
Te hni al report, Lu ent Te hnologies, 1996.
C. Cortes and V. Vapnik. Support ve tor networks. Ma hine Learning, 20:273{297, 1995.
R. Courant and D. Hilbert. Methods of Mathemati al Physi s. Inters ien e, 1953.
Lu Devroye, Laszlo Gyor , and Gabor Lugosi. A Probabilisti Theory of Pattern Re ognition. Springer
Verlag, Appli ations of Mathemati s Vol. 31, 1996.
H. Dru ker, C.J.C. Burges, L. Kaufman, A. Smola, and V. Vapnik. Support ve tor regression ma hines.
Advan es in Neural Information Pro essing Systems, 9:155{161, 1997.
R. Flet her. Pra ti al Methods of Optimization. John Wiley and Sons, In ., 2nd edition, 1987.
S. Geman, E. Bienensto k, and R. Doursat. Neural networks and the bias / varian e dilemma. Neural
Computation, 4:1{58, 1992.
F. Girosi. An equivalen e between sparse approximation and support ve tor ma hines. Neural Computation
(to appear); CBCL AI Memo 1606, MIT, 1998.
I. Guyon, V. Vapnik, B. Boser, L. Bottou, and S.A. Solla. Stru tural risk minimization for hara ter
re ognition. Advan es in Neural Information Pro essing Systems, 4:471{479, 1992.
P.R. Halmos. A Hilbert Spa e Problem Book. D. Van Nostrand Company, In ., 1967.
Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 1985.
T. Joa hims. Text ategorization with support ve tor ma hines. Te hni al report, LS VIII Number 23,
University of Dortmund, 1997. ftp://ftp-ai.informatik.uni-dortmund.de/pub/Reports/report23.ps.Z.
L. Kaufman. Solving the qp problem for support ve tor training. In Pro eedings of the 1997 NIPS Workshop
on Support Ve tor Ma hines (to appear), 1998.
A.N. Kolmogorov and S.V.Fomin. Introdu tory Real Analysis. Prenti e-Hall, In ., 1970.
O.L. Mangarasian. Nonlinear Programming. M Graw Hill, New York, 1969.
Garth P. M Cormi k. Non Linear Programming: Theory, Algorithms and Appli ations. John Wiley and
Sons, In ., 1983.
D.C. Montgomery and E.A. Pe k. Introdu tion to Linear Regression Analysis. John Wiley and Sons, In .,
2nd edition, 1992.
More and Wright. Optimization Guide. SIAM, 1993.
Jorge J. More and Gerardo Toraldo. On the solution of large quadrati programming problems with bound
onstraints. SIAM J. Optimization, 1(1):93{113, 1991.
S. Mukherjee, E. Osuna, and F. Girosi. Nonlinear predi tion of haoti time series using a support ve tor
ma hine. In Pro eedings of the IEEE Workshop on Neural Networks for Signal Pro essing 7, pages
511{519, Amelia Island, FL, 1997.
K.-R. Muller, A. Smola, G. Rats h, B. S holkopf, J. Kohlmorgen, and V. Vapnik. Predi ting time series
with support ve tor ma hines. In Pro eedings, International Conferen e on Arti ial Neural Networks,
page 999. Springer Le ture Notes in Computer S ien e, 1997.
Edgar Osuna, Robert Freund, and Federi o Girosi. An improved training algorithm for support ve tor
ma hines. In Pro eedings of the 1997 IEEE Workshop on Neural Networks for Signal Pro essing, Eds.
J. Prin ipe, L. Giles, N. Morgan, E. Wilson, pages 276 { 285, Amelia Island, FL, 1997.
Edgar Osuna, Robert Freund, and Federi o Girosi. Training support ve tor ma hines: an appli ation to
fa e dete tion. In IEEE Conferen e on Computer Vision and Pattern Re ognition, pages 130 { 136, 1997.
43

Edgar Osuna and Federi o Girosi. Redu ing the run-time omplexity of support ve tor ma hines. In
International Conferen e on Pattern Re ognition (submitted), 1998.
William H. Press, Brain P. Flannery, Saul A. Teukolsky, and William T. Vettering. Numeri al re ipes in
C: the art of s ienti omputing. Cambridge University Press, 2nd edition, 1992.
M. S hmidt. Identifying speaker with support ve tor networks. In Interfa e '96 Pro eedings, Sydney, 1996.
B. S holkopf. Support Ve tor Learning. R. Oldenbourg Verlag, Muni h, 1997.
B. S holkopf, C. Burges, and V. Vapnik. Extra ting support data for a given task. In U. M. Fayyad and
R. Uthurusamy, editors, Pro eedings, First International Conferen e on Knowledge Dis overy & Data
Mining. AAAI Press, Menlo Park, CA, 1995.
B. S holkopf, C. Burges, and V. Vapnik. In orporating invarian es in support ve tor learning ma hines. In
C. von der Malsburg, W. von Seelen, J. C. Vorbruggen, and B. Sendho , editors, Arti ial Neural Networks
| ICANN'96, pages 47 { 52, Berlin, 1996. Springer Le ture Notes in Computer S ien e, Vol. 1112.
B. S holkopf, P. Simard, A. Smola, and V. Vapnik. Prior knowledge in support ve tor kernels. In M. Jordan,
M. Kearns, and S. Solla, editors, Advan es in Neural Information Pro essing Systems 10, Cambridge, MA,
1998. MIT Press. In press.
B. S holkopf, A. Smola, and K.-R. Muller. Nonlinear omponent analysis as a kernel eigenvalue problem.
Neural Computation, 1998. In press.
B. S holkopf, A. Smola, K.-R. Muller, C. Burges, and V. Vapnik. Support ve tor methods in learning
and feature extra tion. Australian Journal of Intelligent Information Pro essing Systems, 5:3 { 9, 1998.
Spe ial issue with sele ted papers of ACNN'98.
B. S holkopf, K. Sung, C. Burges, F. Girosi, P. Niyogi, T. Poggio, and V. Vapnik. Comparing support
ve tor ma hines with gaussian kernels to radial basis fun tion lassi ers. IEEE Trans. Sign. Pro essing,
45:2758 { 2765, 1997.
John Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson, and Martin Anthony. A framework for
stru tural risk minimization. In Pro eedings, 9th Annual Conferen e on Computational Learning Theory,
pages 68{76, 1996.
John Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson, and Martin Anthony. Stru tural risk mini-
mization over data-dependent hierar hies. Te hni al report, NeuroCOLT Te hni al Report NC-TR-96-053,
1996.
A. Smola, B. S holkopf, and K.-R. Muller. General ost fun tions for support ve tor regression. In Ninth
Australian Congress on Neural Networks (to appear), 1998.
A.J. Smola and B. S holkopf. On a kernel-based method for pattern re ognition, regression, approximation
and operator inversion. Algorithmi a, 22:211 { 231, 1998.
Alex J. Smola, Bernhard S holkopf, and Klaus-Robert Muller. The onne tion between regularization
operators and support ve tor kernels. Neural Networks, 11:637{649, 1998.
M. O. Stitson, A. Gammerman, V. Vapnik, V.Vovk, C. Watkins, and J. Weston. Support ve tor anova
de omposition. Te hni al report, Royal Holloway College, Report number CSD-TR-97-22, 1997.
Gilbert Strang. Introdu tion to Applied Mathemati s. Wellesley-Cambridge Press, 1986.
R. J. Vanderbei. Interior point methods : Algorithms and formulations. ORSA J. Computing, 6(1):32{34,
1994.
R.J. Vanderbei. LOQO: An interior point ode for quadrati programming. Te hni al report, Program in
Statisti s & Operations Resear h, Prin eton University, 1994.
V. Vapnik. Estimation of Dependen es Based on Empiri al Data [in Russian℄. Nauka, Mos ow, 1979.
(English translation: Springer Verlag, New York, 1982).
V. Vapnik. The Nature of Statisti al Learning Theory. Springer-Verlag, New York, 1995.
V. Vapnik. Statisti al Learning Theory. John Wiley and Sons, In ., New York, 1998.
Gra e Wahba. Support ve tor ma hines, reprodu ing kernel hilbert spa es and the ga v. In Pro eedings of
the 1997 NIPS Workshop on Support Ve tor Ma hines (to appear). MIT Press, 1998.
J. Weston, A. Gammerman, M. O. Stitson, V. Vapnik, V.Vovk, and C. Watkins. Density estimation using
support ve tor ma hines. Te hni al report, Royal Holloway College, Report number CSD-TR-97-23, 1997.

Вам также может понравиться