Академический Документы
Профессиональный Документы
Культура Документы
Intelligent System
Chapter 8a:
Support Vector Machine
• Introduction
• Theory
• Implementation
• Tools Comparison
• LIBSVM practical
Abdul
Rahim
Ahmad
Introduction (1 of 9)
Introduction (1)
SVM is mainly used in the problem of classification and
regression.
In classification,
We want to estimate a decision function, f using a
set of training data with labels such that f will
correctly classify unseen test examples.
Definition of SVM:
“The Support Vector Machine is a learning machine
for pattern recognition and regression problems which
constructs its solution (decision function f) in terms of
a subset of the training data, the Support Vectors.”
Introduction (1 of 9)
Introduction (2)
Why the name machine?
Implemented in Software – a software machine
It receive input and produce output – classification.
What are support vectors?
A (small) subset of the set of input vectors that are
needed for the final machine implementation. ie: they
support the final machine functionality.
What relation with Neural Network (NN)?
It perform similar function as NN – pattern
recognition, function estimation, interpolation,
regression etc.
Only BETTER.
Introduction (3 of 9)
Introduction (3)
History
SVM came from the idea of "Generalized Portrait"
Algorithm in 1963 for constructing separating
hyperplanes with optimal margin.
Introduced as Large Margin classifier in the COLT
1992 conference by Boser, Guyon,Vapnik in the
paper:
“A Training Algorithm for Optimal Margin
Classifiers. “
What is Optimal margin classifier?
Classification algorithm that maximize the margin
between nearest points on separate classes in the
classification.
Introduction (4 of 9)
Introduction (4)
Introduction (5)
What is Risk minimization ?
choosing appropriate value for parameters, eg: α that
minimize:
R( )= Q( z, )dP( z )
where
α defines the parameterisation
Q is the loss function
z belongs to the union of input and output spaces
P describes the distribution of z
P can only be estimated – normally avoided
(to simplify) by using empirical risk:
1
Remp ( )
l
Q( zi , )
Introduction (6)
Introduction (7)
Risk minimization - two
distinct ways
Fix confidence in the risk,
optimize empirical risk -
Neural network.
Fix empirical risk, optimize
confidence interval - SVM.
Introduction (8)
To implemet SRM
-> Find Largest margin by either of the following methods
Introduction (9)
NN NN
Most popular 1 2
classifiers are trained
using Neural
network (NN). D
NN decision function
might not be B
C
the same for every
training and for
different initial A
parameter values
Optimal since
training stops once
convergence is
achieved A, B, C, D Optimal decision
For better are the function
generalization, we Have large margin between nearest
need optimal decision Support points of the 2 classes
function – the one and Vectors
only.
Theory (1/15)
Theory (1)
3 cases of SVM :
Linearly separable case.
Theory (2)
Linearly separable case.
Specifically we want to find a plane H: y = w.x + b = 0 and
two planes parallel to it, say H1 and H2 such that they are
equidistant from H and
H1: y = w.x + b = +1 and
H2: y = w.x + b = -1 .
Also there should be no data points between H1 and H2
and the distance M between H1 and H2 is maximized.
H1: y = w.x + b = +1
Theory (3)
The distance of a point on H1 to H is :
|w.x + b|/||w|| = 1/||w||,
H1: y = w.x + b = +1
Theory (4)
In order to maximize the distance we minimize ||w||.
Furthermore we do not want any data points between
the two. Thus we have :
H1: y = w.x + b +1 for positive examples yi = +1
H2: y = w.x + b -1 for negative examples yi = -1
H1: y = w.x + b = +1
Theory (5)
This is a convex, quadratic programming problem (in w, b) in a
convex set, which can be solved by introducing N non-negative
Lagrange multipliers 1, 2,…, N 0 associated with the
constraints. (Theory of Lagrange Multipliers)
Thus we have the following Lagrangian to solve for i’s :
1 T N N
L(w , b, ) w w i y i (w .x i b) i
2 i 1 i 1
We have to minimize this function over w and b and maximize it
over i’s.
We can solve the Wolfe dual of the Langrangian, instead :
Maximize L(w, b, ) w.r.t , subject to the constraints that
the gradient of L(w, b, ) w.r.t to the primal variables w and
b vanish ie: L/ w = 0
and L/ b = 0 and that 0.
N N
We thus have w i y i x i and i y i 0 i 1
i 1
Theory (6/15)
Theory (6)
N N
Putting w i y i x i and y i i 0 in L(w, b, ), we get the
wolfe dual: i 1 i 1
N
1
Ld i i j y i y j ( xi .x j )
i 1 2 i,j in which input data only appear in a dot
product.
We solve for i’s which will maximize Ld subject to I ≥ 0 i=1,…,l
and N
i y i 0
i 1
Theory (7)
Non linear (separable) case
In this case, we can transform the data points into another
high dimensional space such that the data points will be
linearly separable in the new space. We construct Optimal
Separating Hyper plane in that space.
Let the transformation be (.). In the high dimensional
N
1
space, we solve:
Ld i i j y i y j ( x i ). ( x j )
i 1 2 i,j
Example of
mapping
from 2D to
3D
Theory (8/15)
Theory (8)
Theory (9)
SVM for Non-linear Separable Case
An SVM corresponds to a non-linear
decision surface in input surface R2
Data points
in input
space
Mapping from
R2 via into R3
Hyperplane in
feature space
R3
Theory (10/15)
Theory (10)
Non linear (separable) case
To determine if a dot product in high dimensional space is
equivalent to a kernel function in input space, i.e: (xi).(xj) =
K(xi.xj)
Use Mercer’s condition
Need not have to be explicit about the transformation (.) as
long as we know that K(xi.xj) is equivalent to the dot product of
some other high dimensional space.
Kernel functions that can be used this way:
Linear kernel K ( x , y ) x. y
Polynomial kernels K ( x , y ) ( x. y 1 ) d
2
x y
Theory (11)
Imperfect Separation Case
No strict enforcement that there be no data points
between hyperplanes H1 and H2
But penalize the data points that are in the wrong side.
Penalty C is finite and have to be chosen by the user.
Large C means higher penalty.
We introduce non-negative slack variable 0 so that
:
W.xi + b + 1 - i for yi = +1
W.xi + b - 1 + i for yi = -1
0 i.
Theory (12/15)
Theory (12)
We add to the objective function a penalising
term 1
min imize w T w C (i )m
w ,b , 2 i
subject to y i (w T x i b) i 1 0,1 i N
i 0,1 i N
Theory (13/15)
Theory (13)
Imperfect Separation Case
Introducing Lagrange multipliers , , the lagrangian is:
1 T N N N
L(w ,b, i , , ) w w C i i [ y i (w .x i b) i 1] i i
2 i 1 i 1 i 1
1 T N N N N
L(w ,b, i , , ) w w C i i ) i ( i y i x i )w ( i y i )b i
T
2 i 1 i 1 i 1 i 1
Theory (14)
Different SVM Objective functions leads to
different SVM variations
l Most commonly
Using l1 norm 1 T
min w w C i
w , ,b 2
used
i 1
Using l2 norm 1 T 1 l 2
min w w C i
w , ,b 2 2 i 1
Using l1 norm for w - linear programming (LP) SVM
l l
min wi C i
w , ,b
i 1 i 1
Theory (15)
SVM architecture (for Neural Network users)
The kernel function k is chosen a priori (determine the type of classifier).
Training – solve a quadratic programming problem to find
no of hidden units (no. of support vectors),
weights (w),
threshold (b)
The first layer weights xi are a subset of the training set (the support
vectors).
The second layer weights I = yi I are computed from the Lagrange
Multipliers.
N
f ( x ) sgn(( i y i K ( x i , x ) b)
i 1
Application (1/1)
Application (1)
SVM Applications
applied to a number of applications such as
Image classification.
Time series prediction
Face recognition
Biological data processing for medical diagnosis
Digit recognition (MLP-SVM)
Text Categorisation
Speech recognition
Using hybrid SVM/HMM
Implementation (1/6)
Implementation (1)
SVM Implementation
High-performance classifiers
use of kernels.
Implementation (2)
SVM Implementation
Main issues are classification accuracy and speed
To improve on the speed, a number of improvements to
original SVM are developed:
(1) Chunking - Osuna (1) Nearest Point Algorithm
(2) Sequential Minimization – Keerthi
Optimization (SMO) - Platt
Implementation (3/6)
Implementation (3)
SVM Software Implementation
In high level languages C, C++, FORTRAN
SVM light - Thorsten Joachims'.
mySVM -Ruping
SMO in C++ - XiaPing Yi
LIBSVM – Chih Jen Lin
Matlab, toolbox
OSU SVM Toolbox - Junshui Ma and Stanley Ahalt.
MATLAB Support Vector Machine Toolbox - Gavin Cawley
Matlab routines for support vector machine classification - Anton
Schwaighofer
MATLAB Support Vector Machine Toolbox - Steve Gunn
LearnSC - Vojislav Kecman
LIBSVM Interface – students of C.J.Lin
Implementation (4/6)
Implementation (4)
Implementation (5)
Model Selection:
Minimizing an estimate of generalization error or some
related performance measures
K-fold cross-validation and leave-one-out (LOO) estimates
Other recent model selection strategies are based on some
bound determined by a quantity (through theoretical
analysis) which is not obtained using retraining with data
points left out (as in cross-validation or LOO)
SV count /Jaakkola Haussler bound /Opper – Winther Bound/
Radius – margin Bound /Span Bound/
10-fold cross-validation is popularly used and used in my
work.
Implementation (6/6)
Implementation (6)
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
LIBSVM for Windows
Java
C/C++
LIBSVM in MATLAB
LIBSVM in R package
LIBSVM in WEKA