Вы находитесь на странице: 1из 11

Supervised Learning - Support Vector Machines and Feature reduction

Report 3 -Machine Learning and Pattern Recognition B.M.Oscar De Silva -201070588-Winter 2011 submitted to: Dr. George Mann April 21, 2011

Support Vector Machines

Support vector machines are a class of linear classiers . Unlike Logistic classiers which attempt to t a logistic probability distribution to the training set, SVMs attempts to learn the linear decision boundary with the maximum possible margin between classes. So this is identied as an optimum margin classier.

1.1

SVM - Notation

For support vector machine algorithms a dierent notation scheme is used; Training Examples All positive examples are labeled +1 and the negative examples are labeled to -1. i.e. y {1, 1} Discriminant function is parameterized by the vector w and the intercept term b, and the decision boundary is wT x + b = 0 Function margin (i) is the normal scaled distance form decision boundary to each training example x(i) . This is a scaled distance because the vector w is not a unit vector. (i) = y (i) (wT x(i) + b) Functional margin is the minimum of all function margins. = min (i) Geometric margin is the normalized functional margin. i.e. the actual distance from the decision boundary to the closest training vector. = w

1.2

SVM - Optimization problem

The optimization problem of SVM can be formulated as to maximize the geometric margin w.r.t. the parameters , w&b such that every training example lies further apart from the margin. max,w,b w (i) T (i) s.t.y (w x + b) i

The functional margin (the scaled distance to nearest training example from decision boundary) is set to 1. This is possible since w and b is scalable without aecting the decision boundary. Now the optimization problem is no longer non convex and is solvable by commercial quadratic programming code. 1 w 2 2 s.t.y (i) (wT x(i) + b) 1i minw,b This problem formulation is further rened for exploiting the use of non linear feature spaces and linearly inseparable cases, using lagrange multipliers and lagrange duality. max W () =
i

1 2

y (i) y (j) i j x(i) x(j)


i j

s.t.C i 0, i i y (i) = 0
i

i are the lagrange multipliers which are the parameters of the optimization problem. The i only carries a non zero value for the examples with a function margin less than or equal to 1. So these training examples are termed the support vectors and are the only data points that decides the decision boundary for the classier. The slack C is inplace to handle the linearly inseperable case. Detailed formulation is explained in [1]. The optimized values of i relate to the decision boundary as; w=
i

i y (i) x(i) , s.t.i = 0

b = mean(y (i) wT x(i) ), i

1.3

SVM - Kernel trick

Kernel trick is a general method used in machine learning algorithms to eciently work in higher dimensional feature spaces. Consider a case where linear classication is not possible with the given features. But a non linear mapping operation to higher dimension feature space converts the dataset to linearly classiable. For clarity the original features 2

10 9 8 7 Feature 2 Feature 2 10 6 5 4 3 2 1 0 0 2 4 6 Feature 1 8 Support Vectors

10

0 0 2 4 6 Feature 1 8 10

(a) linearly separable

(b) linearly inseparable

Figure 1: classication of two diastases by solving the SVM optimization problem using quadratic programming. The color lled datapoints are support vectors identied by the optimization. of the problem are termed attributes and the feature mapping operation maps these attributes x to features used by the algorithm (x). For example a a single attribute x can be transformed to higher dimensional feature set of its powers; x x2 (1) (x) = x3 . . . The kernel trick bypasses this computationally intensive individual feature mappings(x) and uses kernel functions to compute the dot products of two vectors in this new feature spaceK(x, z) = (x)(z). So for machine learning algorithms which solely uses dot products of the attributes x(i) .x(j) in its steps, can eciently increase its feature dimension by replacing these dot product operations with a kernel function K(x(i) , x(j) ) = (x(i) )(x(j) ). With the new additions, the optimization problem takes the form; max W () =
i

1 2

y (i) y (j) i j .K(x(i) , x(j) )


i j

s.t.C i 0, i i y (i) = 0
i

7 6.5 6 5.5 Feature 2 5 4.5 4 3.5 3 2.5 3 4 5 Feature 1 6 7

Figure 2: Classication of nonlinear dataset using gaussian kernel with support vector machines. The support vectors are the colored data points. some common kernels that can be used are; K(x(i) , x(j) ) = x(i) .x(j) K(x , x ) = (x .x )
(i) (j) (i) (j) d

nomapping homogenouspolynomial inhomogenouspolynomial

K(x(i) , x(j) ) = (x(i) .x(j) + 1)d K(x(i) , x(j) ) = exp(

x(i) .x(j) 2 ) Gaussianr.b.f. 2 2 Figure 3 shows results of an Gaussian radial basis function mapping SVM operation. This kernel transforms the feature space to innite dimensions eciently. After solving the quadrature programming optimization problem for i s, new data points are classied as follows. b = mean(y (svm)
i

i y (i) K(x(i) , x(svm) )), supportvectorsx(svm)


(i)

ChooseC1 if
i

i y K(x(i) , x(p) ) + b > 0, testvectorsx(p)

1.4

SVM - SMO algorithm

The quadratic programming problem for SVM requires utilization of commercial QP solvers. The constraint i i y (i) = 0 makes the previously discussed gradient accent 4

method and Newtons method in feasible. This is because each update step of the algorithm should satisfy the constraint. SMO algorithm overcomes the issue updating two i s in each step do that the constraint is satised. The basic steps of the algorithm can be summarized as follows; 1. Set i s to satisfy the inequality and equality constraints. 2. Select two i s to perform an update. i.e. u & v 3. To satisfy equality constraint, set u = (
i

i y (i) v y (v) )y (u)

4. Maximize the objective function using coordinate accent or Newtons method, xing i ; i = u, v. maxv W (v ) 5. Clip v to satisfy the inequality constraints. 6. Repeat from step two until convergence. (Convergence criteria: Satisfy KKT conditions[1] within 0.01 tolerance).

Generative Learning Algorithms

So far the classication methods discussed directly learns the posterior probabilities of the classes, i.e. the discriminant functions. In logistic regression a logistic function is t the training data and in SVM a perceptron function is t to the training set. Such algorithms are discriminative learning algorithms. This section introduces another class of learners; generative learning algorithms. They attempt to model the class given likelihoods P (x|Cj ) and priors P (Cj ) by parameter estimation on the training data set. And from bayes theorem the discriminative function gj (x) = P (Cj |x) is found.

2.1

Gaussian discriminant analysis

Gaussian discriminant analysis GDA generates the posterior probabilities by assuming the underline probabilities to be th following; y Bernoulli() x|y = 0 N (0 , ) x|y = 1 N (1 , ) So the distributions for the class give likelihoods and priors are; P (y) = y (1 )1y 1 1 T 1 (x 0 )T ) P (x|y = 0) = 1 exp( (x 0 ) 2 (2 ) 2 1 1 T 1 P (x|y = 1) = (x 1 )T ) 1 exp( (x 1 ) 2 2 (2 ) 5

2 Feature 2

6 0

4 5 Feature 1

Figure 3: Gaussian discriminant analysis of a dataset. The classier boundary is computed as P (y = 1|x(p) ) = 0.5 The machine learning algorithm should estimate the parameters , 0 , 1 , . Following a maximum likelihood approach, the optimization problem becomes;
m

max,0 ,1 , l(, 0 , 1 , ) = log


i=1 m

P (x(i) , y (i) ) P (x(i) |y (i) )P (y (i) )


i=1

= log By maximization the MLE estimates are found as; = 0 = 1 = = 1 m 1 m


i

1{y (i) = 1}
i

1{y (i) = 0}x(i) (i) = 0} i 1{y (i) = 1}x(i) i 1{y (i) = 1} i 1{y (x(i) y(i) )(x(i) y(i) )T
i

These estimates are used to nd the posterior probability P (y|x(p) ) for classication of new data point x(p) .

2.2

Naive Bayes classier

In the above model it was assumed that the two random variables x1 & x2 share a common covariance matrix . We can relax the assumption by allowing dierent covariance 6

matrices for the two classes. This will give a nonlinear decision boundary. Whatever the case for a given class y a non diagonal covariance matrix suggests that the features in that particular class are mutually dependant. Naive bayes classier brings upon a strong assumption to the above case. It assumes that given a classy the features x are mutually independent. This is also termed as class conditional independence of features or generally the Naive bayes assumption. This allows to use the multiplicative rule for computing joint probabilities.
n

P (|y) = x
i=1

P (xi |y)

This classication method is extensively used in designing spam lters. The generative model is formes so that all probabilities of spam class, spam given probability of each word and non spam given probability of each word is modeled by Bernoulli distributions. To overcome the unobserved eect of some words during testing a laplace smoothing operation is employed [1].

Feature selection and Feature reduction

Machine learning algorithms frequently encounter problems with high feature dimensions. In SVMs the feature space was mapped to higher dimensions to exploit non linearity and better classication. At the same time most problems have higher number of features with only a few features carrying information relevant to the classier. So for some problems it is the case that some features can be ommited for classication. This process is termed feature selection. Another possibility is to use all the features available and map to a lower dimensional space to be used by the machine learning algorithm. This is termed dimensionality reduction.

3.1

Principal component Analysis

Principal component analysis is an unsupervised dimensionality reduction technique. This is because it does not take in to account the class labels of each data point. The method considers all the data points of every class as a whole an attempts to nd the vectors in the n dimensional space that maps all the data with largest variance. So the rst principal component is the n dimensional vector on to which the data maps with largest variance( maximum spread). The second principal component is a vector normal to the rst PCA with largest variance mapping. It is possible to nd to maximum of n principal components, but in most cases large portion of the information in the dataset is captured by few principal components. So the designer can limit the feature mapping to k PCAs capturing majority of the information, which eectively reduces the n dimensional features to much simpler k dimensional PCAs. Consider a data set X = {(1) , x(2) , ..., x(m) } where the features x Rn , needs to x mapped to the unit vector of largest variance u. Mathematically this is to maximize the distance to each data point along unit vector u.

10
PCA 2

8 6
feature 2

4 2
PCA 1

0 2 4 10

0 feature 1

10

Figure 4: Principal component mapping of a dataset. The two mapping vectors are the two eigen vectors of the covariance of features. By observing the singular values it can be said that PCA1 carries 70% of the information in original data.

maxu s.t. u

1 m
2

(x(i) .)2 u
i

=1

This optimization problem can be rened as; 1 m (x(i) .)2 = u


i

1 m

uT x(i) x(i) u
i

= uT (

1 m

x(i) x(i) )u
i T

= maxu u u 2 s.t. u = 1 By using lagrange multipliers and solving for critical value the solution becomes; u = u This is typical eigen value problem, where the rst principal component mapping is the eigen vector corresponding to the largest eigen value. The second PCA is the vector corresponding to the next largest eigen value. The amount of information captured by the successive vectors can be computed by the ratio between the sum of eigen values captured and the sum of all eigen values. The PCA operation is performed after mean removal and variance normalization preprocessing steps. For stable computation of eigen vectors Singular Value Decomposition SVD routines are usually employed. 8

10 20 30 50 100 150 200 250 300

10 8 6 4
Eigenvector 2 6 6 06 06 6 6 0 6 00 6 0 0 6 6 6 0 6 0 00 6 6 06 0 0 0 0 0 1 8 8 9 99 3 22 2 2 3 2 22 2 2 2 8 4 4 44 4 4 4 44 4 4 444 4 4 4 4 49 44 4 9

0 0

2 0 2 4 6 8 10 10
2 2 2 222 2 22

8 88 5 5 8 819 49 8 88 5 51 5 35 9 1 9 3 85 1 2 1 8 5 51 5 1 914 1 9 9 7 1 2 58 9 1 91 4 3 22 5 8 5 12 8 8 2 5 9 1 1 9 1 3 5 7 5 9 3 3 9 1 7 7 5 7 1 7 3 1 9 9 3 3 3 7 7 7 77 3 7 7 7 3 3 3 3 7 7 7 3 7 7

0 2 Eigenvector 1

10

Figure 5: Mapping 600 examples of a 1024 dimensional image feature vector (handwritten digits), to two principal components. Amount of energy captured is 21%, data of dierent classes apears to have clustered to certain regions of the reduced feature space.

3.2

Linear discriminant Analysis

Linear Discrin=minant Analysis is a supervised feature reduction method. The objective is to map the n dimensional features of a k class supervised learning problem to a (k-1) feature space where all the classes are furthest apart. This is termed the Fishers Linear discriminant, and corresponds to maximization of the objective function; J(w) = where; s2 = j
i

(w.m1 w.m2 )2 s2 + s2 2 1 y (i) (w.x w.mj )2

This is to maximize the separation of means in the new dimensional space dened by vector w. For a general case the optimization problem and the solution is formulated as
10 8 6
feature 2

4 2 0 2 4 10
LDA mapping

0 feature 1

10

Figure 6: CLDA mapping of a two dimensional dataset

10

follows; J(w) = where; Si =


i

|W T SB W | |W T SW W | y (i) (x(i) mi )(x(i) mi )T


K

1 m= K
K

mi
i=1

SB =
i=1 K

Ni (mi m)(mi m)T Si


i=1

SW =

solution; 1 maxeigen vectors SW SB

3.3

Feature selection

Feature selection methods attempts to eliminate features form the original feature space, in order to simplify the machine learning problem. There are two main methods of feature selection. Wrapper selection The process starts with no features and sequentially adds features and tests the validation error of the machine learning algorithm. Then the features that does not contribute signicantly to the model is dropped from the feature space. Filter selection In lter selection each feature is given a score depending upon the amount of mutual information that it shares with the output variable y. Then a set of features are selected based on this score. This method is computationally ecient than wrapper feature selection although suitable means of scoring features should be established, for this to work. Next sections will cover detailed design aspects of machine learning algorithms on Model selection, Validation, Error analysis and learning aspects. Also unsupervised methods including k means, gaussian mixture models and factor analysis will be discussed.

References
[1] Andrew, Ng., CS229 Machine Learning- Lecture notes University of Stanford, 2010 [2] Duda, R.O. and Hart, P.E. and Stork, D.G., Pattern classication volume2(2001) [3] Ethem Alpaydin, Introduction to Machine Learning, The MIT Press(2004)

11

Вам также может понравиться