Вы находитесь на странице: 1из 13

Linear Discriminators

Chapter 20 Only relevant parts

Concerns
Generalization Accuracy Efficiency Noise Irrelevant features Generality: when does this work?

Linear Model
Let f1, fn be the feature values of an example. Let class be denoted {+1, -1}. Define f0 = -1. (bias weight) Linear model defines weights w0,w1,..wn.
-w0 is the threshold If w0*f0+w1*f1..+wn*fn> 0, predict class + else predict class -.

Classification rule:
Briefly: W*F>0 where * is inner product of weight vector and feature weights and F has been augmented with extra 1.

Augmentation Trick
Suppose data defined features f1 and f2. 2* f1 + 3*f2 > 4 is classifier Equivalently: <4,2,3> *<-1,f1,f2> > 0 Mapping data <f1,f2> to <-1,f1,f2> allows learning/representing threshold as just another featuer. Mapping data into higher dimensions is key idea behind SVMs

Mapping to enable Linear Separation


Let xi be m vectors in R^N. Map xi into R^{N+M} by xi -> <xi,0,..1,0..> where 1 in n+i position. For any labelling of xi by classes +/-, the embedding makes data linearly separable.
Define wi = 0 i<N w(i+n) = 1 if xi is + else 0. W(i+n) = -1 if xi is negative else 0.

Representational Power
Or of n features
Wi = 1, threshold = 0

And of n features
Wi = 1 threshold = n -1

K of n features (prototype)
Wi =1 threshold = k -1

Cant do XOR Combining linear threshold units yields any boolean function.

Classical Perceptron
Goal: Any W which separates the data. Algorithm (X is augmented with 1) W=0 Repeat
If X positive and W*X wrong, W = W+X; Else if X negative & W*X wrong, W = W-X.

Until no errors or very large number of times.

Classical Perceptron
Theorem: If concept linearly separable, then algorithm finds a solution. Training time can be exponential in number of features. Epoch is single pass through entire data. Convergence can take exponentially many epochs, but guaranteed to work. If |xi|<R and margin = m, then number of mistake is < R^2/m^2.

Hill-Climbing Search
This is an optimization problem. The solution is by hill-climbing so there is no guarantee of finding the optimal solution. While derivates tell you the direction (the negative gradient) they do not tell you how much to change each Xi. On the plus side it is fast. On the negative side, no guarantee of separation

Hill-climbing View
Goal: minimize Squared-error = Err^2. Let class yi be 1 or -1. Let Err = sum(W*Xi Yi) where Xi is ith example. This is a function only of the weights. Use Calculus; take partial derivates wrt Wj. To move to lower value, move in direction of negative gradient, i.e. change in Xi is -2*Err*Xj

Support Vector Machine


Goal: maximize the margin. Assuming the line separates the data, the margin is the minimum of the closest positive and negative example to the line. Good News: This can be solved by quadratic program. Implemented in Weka as SOM. If not linearly separable, SVM will add more features.

If not Linearly Separable


1. Add more nodes: Neural Nets
1. 2. 3. 4.
1. 2. 3. 4.

Can Represent any boolean function: why? No guarantees about learning Slow Incomprehensible
Can represent any boolean function Learning guarantees Fast Semi-comprehensible

2. Add more features: SVM

Adding features
Suppose pt (x,y) is positive if it lies in the unit disk else negative. Clearly very unlinearly separable Map (x,y) -> (x,y, x^2+y^2) Now in 3-space, easily separable. This works for any learning algorithm, but SVM will almost do it for you. (set parameters).

Вам также может понравиться