Вы находитесь на странице: 1из 3

6.

036 Introduction to Machine Learning

MASSACHUSETTS INSTITUTE OF TECHNOLOGY


Department of Electrical Engineering and Computer Science
6.036Introduction to Machine Learning
Spring Semester 2015
Assignment 2, Issued: Friday, Feb. 13

Due: Friday, Feb. 20

Perceptron Convergence Rates


1. Mistake Bounds
First realize that x(i) x(j) , i 6= j since x(i) will have a 1 in the ith coordinate and 0 elsewhere, and
x(j) will have a 1 in the j th coordinate and 0 elsewhere. Thus x(i) x(j) = 0.
(a) The first iteration of perceptron with data point (x(i) , y (i) ) will be a mistake due to our initialization of (0) =
0. The first update sets (1) = y (i) x(i) . The second iteration of perceptron with
(j)
data point (x , y (j) ) will also yield a mistake since
(1) x(j) = y (i) x(i) x(j) = y (i) (0) = 0
.
Thus, for the d = 2 example, after 2 updates (2) = y (i) x(i) + y (j) x(j) . We now check whether
the second pass yields mistakes

y (i) (2) x(i) = y (i) (y (i) x(i) + y (j) x(j) ) x(i)


= (y (i) )2 kx(i) k2 + (y (i) y (j) )x(i) x(j)
= (y (i) )2 kx(i) k2 + 0
>0
so the first point is classified correctly. Similarly, the second point is also classified correctly.
Therefore, it only takes two updates to classify the d = 2 dataset regardless of ordering or
labeling. Here are sketches for for each of the four labelings of y (1) and y (2) :

(b) Using the intuition that we gained from the d = 2 case, we notice that upon the first pass of the
data points, every x(i) = 0 for all i. In other words, each data point we consider in sequence
lies on the current boundary. Since each point starts as a mistake, there are d updates. Now it
remains to be shown that after d updates, all points are classified correctly. After the ith update,
we add y (i) x(i) to (i1) . After d updates,
(d) =

d
X
i=1

y (i) x(i)

6.036 Introduction to Machine Learning

We check whether y (t) (d) x(t) > 0 for all t to ensure there are no mistakes. Notice that the
only non-zero term of the dot product occurs when i = t. Thus,
y (t) (d) x(t) = (y (t) )2 kx(t) k2 > 0
for all t = 1, 2, , d.
(c) As shown in part (b), converges to
=

d
X

y (i) x(i)

i=1

does not depend on the ordering since each term is merely summed together. However, it does
depend on the labels y (i) . Notice that each coordinate of is either +1 or 1.
(d) From the hint, kx(t) k = cos(t) = 1.
Thus, kx(t) k R, t, where R = 1. Since each
coordinate of is either +1 or 1, kk = d. Also, by construction of , y (t) x(t) = 1. Thus,
2
y (t) x(t)
= 1d . The resulting mistake bound is R
= d. We see that the number of updates
kk
2
required is exactly equal to the bound we proved in class. This agrees with our intuition since
the dataset is constructed adversarially such that every point begins as a mistake.
Passive-aggressive algorithm

2.

(a) 4. 0-1 loss and a small . Since is small, the zero-one loss term dominates. Since the loss
function can only improve when it is classified correctly, the example must be on the positive
side. Because all correctly classified examples show the same loss, the secondary goal of minimizing the change in will cause the point to be immediately on the positive side of the decision
boundary.

6.036 Introduction to Machine Learning

(b) 1. hinge loss and a large . For hinge loss, loss can be improved by moving the vector towards
the example. However, since is large, the change in term dominates, so only rotates slightly.
(c) 2. hinge loss and a small . For hinge loss, loss can be improved by moving the vector
towards the example. Since is small, the loss term dominates, so the loss is minimized when
the example is correctly classified. The difference between hinge loss and 0-1 loss with small
is that hinge loss will correctly classify the example with a margin from the boundary. This is
because hinge loss is minimized when y x 1 while 0-1 is when y x 0.
(d) 3. 0-1 loss and a large . Since the loss function is zero-one, it can only take on two values
whether its classified correctly or not. With large, the change in term dominates. In order
to minimize the change in , simply stays the same after the update and does not improve the
loss at all.
3.

(a) As increases, the passive-aggressive algorithm is dissuaded from updating very far from the
previous . This is because serves as the weight for the regularization term (the portion that
prevents overfitting: k (k) k) of the minimizing function. Thus, as increases, we expect the
step size between updates to decrease, so should be smaller.
(b) When Lossh (y(k+1) x) > 0, the hinge loss can be simply rewritten as 1 y x. Thus, our
minimizing function simplifies to

k (k) k2 + Lossh (y x)
2

= k (k) k2 + 1 y x
2

f () =

We compute the minimum by differentiating f () w.r.t. to get


df
d

= ( (k) ) yx = 0

( (k) ) =

1
yx

= (k) +

where

is just the step size. Thus, =

1
yx

when Lossh (y(k+1) x) > 0.

(c) The number of mistakes for online passive-aggressive does depend on feature vector ordering
in general. The feature vector ordering could cause to converge towards a local minima (in
terms of number of mistakes) rather than a global one. However, this is not true for the data in
problem 1. Each example pertains to a distinct coordinate of theta without any interference from
other coordinates. The parameters that the algorithm finds are identical to those we would get
by updating each coordinate independently of others (as if we had d 1- dimensional problems).
Therefore, similarly to perceptron, theres no dependence on ordering (for this data).

Вам также может понравиться