Вы находитесь на странице: 1из 42

MITM 613

Intelligent System

Chapter 8a:
Support Vector Machine

Abdul Rahim Ahmad


2

Chapter Eight(a) : SVM

• Introduction
• Theory
• Implementation
• Tools Comparison
• LIBSVM practical

Abdul
Rahim
Ahmad
Introduction (1 of 9)

Introduction (1)
 SVM is mainly used in the problem of classification and
regression.
 In classification,
 We want to estimate a decision function, f using a
set of training data with labels such that f will
correctly classify unseen test examples.
 Definition of SVM:
 “The Support Vector Machine is a learning machine
for pattern recognition and regression problems which
constructs its solution (decision function f) in terms of
a subset of the training data, the Support Vectors.”
Introduction (1 of 9)

Introduction (2)
 Why the name machine?
 Implemented in Software – a software machine
 It receive input and produce output – classification.
 What are support vectors?
 A (small) subset of the set of input vectors that are
needed for the final machine implementation. ie: they
support the final machine functionality.
 What relation with Neural Network (NN)?
 It perform similar function as NN – pattern
recognition, function estimation, interpolation,
regression etc.
 Only BETTER.
Introduction (3 of 9)

Introduction (3)
 History
 SVM came from the idea of "Generalized Portrait"
Algorithm in 1963 for constructing separating
hyperplanes with optimal margin.
 Introduced as Large Margin classifier in the COLT
1992 conference by Boser, Guyon,Vapnik in the
paper:
“A Training Algorithm for Optimal Margin
Classifiers. “
 What is Optimal margin classifier?
 Classification algorithm that maximize the margin
between nearest points on separate classes in the
classification.
Introduction (4 of 9)

Introduction (4)

 Why the need to achieve optimal margin?


 Optimal margin leads to better generalization
 Implying minimization of overall risk
 Two kinds of Risk Minimization :
 Structural Risk Minimization (SRM)
 As in SVM

 Empirical Risk Minimization (ERM)


 As in Neural Network
Introduction (5 of 9)

Introduction (5)
 What is Risk minimization ?
 choosing appropriate value for parameters, eg: α that
minimize:
R( )=  Q( z, )dP( z )
 where
 α defines the parameterisation
 Q is the loss function
 z belongs to the union of input and output spaces
 P describes the distribution of z
 P can only be estimated – normally avoided
(to simplify) by using empirical risk:
1
Remp (  ) 
l
 Q( zi , )

 Minimizing this is called empirical risk minimisation (as in


NN).
Introduction (6 of 9)

Introduction (6)

 Vapnik (Vapnik, 1995) proved that the bound on


expected risk is:
R( ) Remp (  )  f ( h )

 Where h, is the VC dimension – measure of the capacity of


the learning machine. f(h) provides the confidence in the
risk.
 h log( ) 
R ( )  Remp ( )    , 
l l 
 2l   
h  log  1  log  
 h log( )   h  4
 , 
l l  l

 SRM identify optimal point on the curve for bound on


the expected risk (ie:trade-off between expected risk
and complexity of the approximating function)
Introduction (7 of 9)

Introduction (7)
 Risk minimization - two
distinct ways
 Fix confidence in the risk,
optimize empirical risk -
Neural network.
 Fix empirical risk, optimize
confidence interval - SVM.

• In NN: Fix network structure.


learning -> minimize empirical risk.
(using gradient descent)
• In SVM: Fix empirical risk.
(to min, or 0 for separable data set),
learning -> optimizes for a minimum confidence
interval
(maximizing the margin of the
separating hyper plane).
Introduction (8 of 9)

Introduction (8)
 To implemet SRM
-> Find Largest margin by either of the following methods

Find Optimal plane that Find Optimal plane that


bisects closest points in maximize margin
convex hulls More often
used
Introduction (9 of 9)

Introduction (9)
NN NN
 Most popular 1 2
classifiers are trained
using Neural
network (NN). D

 NN decision function
might not be B
C
 the same for every
training and for
different initial A
parameter values
 Optimal since
training stops once
convergence is
achieved A, B, C, D Optimal decision
 For better are the function
generalization, we Have large margin between nearest
need optimal decision Support points of the 2 classes
function – the one and Vectors
only.
Theory (1/15)

Theory (1)
 3 cases of SVM :
 Linearly separable case.

 Non-linearly separable case.

 Non-separable or imperfect separation case


(allowing for noise).
Theory (2/15)

Theory (2)
 Linearly separable case.
 Specifically we want to find a plane H: y = w.x + b = 0 and
two planes parallel to it, say H1 and H2 such that they are
equidistant from H and
H1: y = w.x + b = +1 and
H2: y = w.x + b = -1 .
 Also there should be no data points between H1 and H2
and the distance M between H1 and H2 is maximized.

H1: y = w.x + b = +1

H: y = w.x + b = 0 H2: y = w.x + b = -1


Theory (3/15)

Theory (3)
 The distance of a point on H1 to H is :
 |w.x + b|/||w|| = 1/||w||,

 Therefore the distance between H1 and H2 is 2/||w||

H1: y = w.x + b = +1

H: y = w.x + b = 0 H2: y = w.x + b = -1


Theory (4/15)

Theory (4)
 In order to maximize the distance we minimize ||w||.
Furthermore we do not want any data points between
the two. Thus we have :
 H1: y = w.x + b  +1 for positive examples yi = +1
 H2: y = w.x + b  -1 for negative examples yi = -1

 The two equations can be combined: yi (w.x + b)  1


 Formulation for Optimal Hyper plane is :
Minimize ||w|| subject to yi (w.x + b)  1

H1: y = w.x + b = +1

H: y = w.x + b = 0 H2: y = w.x + b = -1


Theory (5/15)

Theory (5)
 This is a convex, quadratic programming problem (in w, b) in a
convex set, which can be solved by introducing N non-negative
Lagrange multipliers 1, 2,…, N  0 associated with the
constraints. (Theory of Lagrange Multipliers)
 Thus we have the following Lagrangian to solve for i’s :

1 T N N
L(w , b, )  w w   i y i (w .x i  b)   i
2 i 1 i 1
 We have to minimize this function over w and b and maximize it
over i’s.
 We can solve the Wolfe dual of the Langrangian, instead :
 Maximize L(w, b, ) w.r.t , subject to the constraints that
the gradient of L(w, b, ) w.r.t to the primal variables w and
b vanish ie: L/ w = 0
and L/ b = 0 and that   0.
N N
 We thus have w   i y i x i and  i y i  0 i 1
i 1
Theory (6/15)

Theory (6)
N N
 Putting w   i y i x i and  y i i 0 in L(w, b, ), we get the
wolfe dual: i 1 i 1
N
1
Ld   i  i j y i y j ( xi .x j )
i 1 2 i,j in which input data only appear in a dot
product.
 We solve for i’s which will maximize Ld subject to I ≥ 0 i=1,…,l
and N

 i y i  0
i 1

 The hyperplane decision function is thus :


or N
f ( x )  sgn((  i y i ( x i .x )  b) f ( x )  sgn(w .x  b)
i 1
 Since I ≥ 0 for all points on the margin and I = 0 for others, only
those I play a role in the decision function. They are called
support vectors
 The number of support vectors are usually small, thus we say that
the solution to SVM is sparse.
Theory (7/15)

Theory (7)
 Non linear (separable) case
 In this case, we can transform the data points into another
high dimensional space such that the data points will be
linearly separable in the new space. We construct Optimal
Separating Hyper plane in that space.
 Let the transformation be (.). In the high dimensional
N
1
space, we solve:
Ld   i   i j y i y j ( x i ). ( x j )
i 1 2 i,j

 Example of
mapping
from 2D to
3D
Theory (8/15)

Theory (8)

 Non linear (separable) case


 In place of the dot product, if we can find a kernel function
which perform this dot product implicitly, we can replace it with
that kernel (ie: perform kernel evaluation instead of explicitly
map the training data)
N
1
Ld   i   i j y i y j K ( x i , x j )
i 1 2 i,j
 The hyper plane decision function is thus now :
N
f ( x )  sgn((  i y i K ( x i , x )  b)
i 1
Theory (9/15)

Theory (9)
SVM for Non-linear Separable Case
An SVM corresponds to a non-linear
decision surface in input surface R2
Data points
in input
space

Mapping from
R2 via  into R3

Hyperplane in
feature space
R3
Theory (10/15)

Theory (10)
 Non linear (separable) case
 To determine if a dot product in high dimensional space is
equivalent to a kernel function in input space, i.e: (xi).(xj) =
K(xi.xj)
 Use Mercer’s condition
 Need not have to be explicit about the transformation (.) as
long as we know that K(xi.xj) is equivalent to the dot product of
some other high dimensional space.
 Kernel functions that can be used this way:
 Linear kernel K ( x , y )  x. y

 Polynomial kernels K ( x , y )  ( x. y  1 ) d
2
 x y

 Radial basis function (Gaussian kernel) K( x y )  e 2 2

 Hyperbolic tangent kernel K ( x , y )  tanh( ax.y  b )


Theory (11/15)

Theory (11)
Imperfect Separation Case
 No strict enforcement that there be no data points
between hyperplanes H1 and H2
 But penalize the data points that are in the wrong side.
 Penalty C is finite and have to be chosen by the user.
Large C means higher penalty.
 We introduce non-negative slack variable   0 so that
:
 W.xi + b  + 1 - i for yi = +1
 W.xi + b  - 1 + i for yi = -1
  0 i.
Theory (12/15)

Theory (12)
 We add to the objective function a penalising
term 1
min imize w T w  C (i )m
w ,b , 2 i

 Where m is usually set to 1, which gives us


1 N
min imize w w  C (   )
T
i
 2
w ,b ,
i 1

subject to y i (w T x i  b)  i  1  0,1  i  N
 i  0,1  i  N
Theory (13/15)

Theory (13)
Imperfect Separation Case
 Introducing Lagrange multipliers , , the lagrangian is:
1 T N N N
L(w ,b, i , ,  )  w w  C  i   i [ y i (w .x i  b)   i  1]   i i
2 i 1 i 1 i 1

1 T N N N N
L(w ,b, i , ,  )  w w  C  i  i ) i  (  i y i x i )w  (  i y i )b   i
T

2 i 1 i 1 i 1 i 1

• Similarly, solving for the Wolfe dual, neither I nor their


Lagrange multipliers, appear in the dual problem. Minimize
N
1
Ld  i   i j y i y j xi .x j
i 1 2 i,j N
Subject to 0  i C and  y
i 1
i i 0

• The only difference from the perfectly separating case is that


I now is bounded by C. The solution is again given by
N
w   i y i x i
i 1
Theory (14/15)

Theory (14)
 Different SVM Objective functions leads to
different SVM variations
l Most commonly
 Using l1 norm 1 T
min w w  C   i
w , ,b 2
used
i 1

 Using l2 norm 1 T 1 l 2
min w w  C   i
w , ,b 2 2 i 1
 Using l1 norm for w - linear programming (LP) SVM
l l
min  wi  C   i
w , ,b
i 1 i 1

 v parameter for controlling the number of support


vectors
l
1 l 2
min  wi      i
w , ,b l i 1
i 1
Theory (15/15)

Theory (15)
 SVM architecture (for Neural Network users)
 The kernel function k is chosen a priori (determine the type of classifier).
 Training – solve a quadratic programming problem to find
 no of hidden units (no. of support vectors),
 weights (w),
 threshold (b)

 The first layer weights xi are a subset of the training set (the support
vectors).
 The second layer weights I = yi I are computed from the Lagrange
Multipliers.

N
f ( x )  sgn((  i y i K ( x i , x )  b)
i 1
Application (1/1)

Application (1)
 SVM Applications
 applied to a number of applications such as
 Image classification.
 Time series prediction
 Face recognition
 Biological data processing for medical diagnosis
 Digit recognition (MLP-SVM)
 Text Categorisation
 Speech recognition
 Using hybrid SVM/HMM
Implementation (1/6)

Implementation (1)

 SVM Implementation
 High-performance classifiers
 use of kernels.

 Different kernel functions lead to


 very similar classification accuracies
 produced similar SV sets.
(that is the SV set seems to characterize the given task
up to a certain degree independent of the type of
kernel)
Implementation (2/6)

Implementation (2)
 SVM Implementation
 Main issues are classification accuracy and speed
 To improve on the speed, a number of improvements to
original SVM are developed:
(1) Chunking - Osuna (1) Nearest Point Algorithm
(2) Sequential Minimization – Keerthi
Optimization (SMO) - Platt
Implementation (3/6)

Implementation (3)
SVM Software Implementation
 In high level languages C, C++, FORTRAN
 SVM light - Thorsten Joachims'.
 mySVM -Ruping
 SMO in C++ - XiaPing Yi
 LIBSVM – Chih Jen Lin
 Matlab, toolbox
 OSU SVM Toolbox - Junshui Ma and Stanley Ahalt.
 MATLAB Support Vector Machine Toolbox - Gavin Cawley
 Matlab routines for support vector machine classification - Anton
Schwaighofer
 MATLAB Support Vector Machine Toolbox - Steve Gunn
 LearnSC - Vojislav Kecman
 LIBSVM Interface – students of C.J.Lin
Implementation (4/6)

Implementation (4)

 Steps in SVM training


 Select the parameter C (representing the
tradeoff between minimizing the training error
and margin maximization), kernel function and
any kernel parameters.
 Solve the dual QP or alternative problem
formulation using appropriate QP or LP algorithm
to obtain the support vectors.
 Calculate threshold b using the support vectors.
Implementation (5/6)

Implementation (5)

 Model Selection:
 Minimizing an estimate of generalization error or some
related performance measures
 K-fold cross-validation and leave-one-out (LOO) estimates
 Other recent model selection strategies are based on some
bound determined by a quantity (through theoretical
analysis) which is not obtained using retraining with data
points left out (as in cross-validation or LOO)
 SV count /Jaakkola Haussler bound /Opper – Winther Bound/
Radius – margin Bound /Span Bound/
 10-fold cross-validation is popularly used and used in my
work.
Implementation (6/6)

Implementation (6)

 Different methods for QP Optimization:


 (a) techniques in which kernel components are
evaluated and discarded during learning
Kernel Adatron
 (b) decomposition method in which an evolving
subset of data is used and
Sequential Minimal Optimization (SMO)
SVMlight/LIBSVM
 (c) new optimization approaches that specifically
exploit the structure of the SVM problem.
Nearest point algorithm (NPA)
Tools Comparison –
SVMTorch/SVMLight/LIBSVM
Features SVMTorch SVMLight LIBSVM
Developer Ronan Collobert Thosten Joachims Chih-Jen Lin
Uses Classification Classification C-SVC / -SVC
Regression Regression Regression / -SVR
Ranking -SVR / distribution
estimation / one-class SVM

Language C++ C C/C++/Java


Phyton/Matlab/R/Perl
interface
Optimization Decomposition Decomposition Decomposition
method Working set of size - 2 Working set of size – 2 or Working set of size – 2 or
more more
Internal cache Yes Yes Yes
Shrinking optional Yes Yes
Generalization Yes None Yes
Performance LOO and Xi-alpha estimates Automatic cross validation
estimates functionality
Multiclass Yes No Yes
One against all Need to add by the user. One against all
One against one with DAG
Extras Weighted SVM for
unbalanced dataset
Shrinking (Remove  equal to bounds 0 or C for a long time)
Implementation
SVMTORC (III)
H
Implementation
SVMLight (III)
Implementation
LIBSVM (III)
LIBSVM
LIBSVM History
 1.0 : June 2000 First Release.
 2.0 : Aug 2000 Major updates – add nu-svm, one-class
svm, and svr
 2.1 : Dec 2000 Java version added, regression demonstrated in
svm-toy
 2.2 : Jan 2001 Multi-class classification, nu-SVR
 2.3 : Mar 2001 Cross validation, fix some minor bugs
 2.31: April 2001 Fix one bug on one-class SVM, use float for Cache
 2.33: Dec 2001 Python interface added
 2.36: Aug 2002 grid.py added: contour plot of CV accuracy
 2.4 : April 2003 improvements of scaling
 2.5 : Nov 2003 some minor updates
 2.6 : April 2004 Probability estimates for
classification/regression
 2.7 : Nov 2004 Stratified cross validation
 2.8 : April 2005 New working set selection via
second order information
LIBSVM Current Version
 2.81: Nov 2005
 2.82: Apr 2006
 2.83: Nov 2006
 2.84: April 2007
 2.85: Nov 2007
 2.86: April 2008
 2.87: October 2008
 2.88: October 2008
 2.89: April 2009
 2.9: November 2009
 2.91: April 2010
 3.0 : September 13, 2010
 3.12: April Fools' day, 2012

http://www.csie.ntu.edu.tw/~cjlin/libsvm/
LIBSVM for Windows
 Java
 C/C++
 LIBSVM in MATLAB
 LIBSVM in R package
 LIBSVM in WEKA

Вам также может понравиться