Академический Документы
Профессиональный Документы
Культура Документы
(b) Using the intuition that we gained from the d = 2 case, we notice that upon the first pass of the
data points, every x(i) = 0 for all i. In other words, each data point we consider in sequence
lies on the current boundary. Since each point starts as a mistake, there are d updates. Now it
remains to be shown that after d updates, all points are classified correctly. After the ith update,
we add y (i) x(i) to (i1) . After d updates,
(d) =
d
X
i=1
y (i) x(i)
We check whether y (t) (d) x(t) > 0 for all t to ensure there are no mistakes. Notice that the
only non-zero term of the dot product occurs when i = t. Thus,
y (t) (d) x(t) = (y (t) )2 kx(t) k2 > 0
for all t = 1, 2, , d.
(c) As shown in part (b), converges to
=
d
X
y (i) x(i)
i=1
does not depend on the ordering since each term is merely summed together. However, it does
depend on the labels y (i) . Notice that each coordinate of is either +1 or 1.
(d) From the hint, kx(t) k = cos(t) = 1.
Thus, kx(t) k R, t, where R = 1. Since each
coordinate of is either +1 or 1, kk = d. Also, by construction of , y (t) x(t) = 1. Thus,
2
y (t) x(t)
= 1d . The resulting mistake bound is R
= d. We see that the number of updates
kk
2
required is exactly equal to the bound we proved in class. This agrees with our intuition since
the dataset is constructed adversarially such that every point begins as a mistake.
Passive-aggressive algorithm
2.
(a) 4. 0-1 loss and a small . Since is small, the zero-one loss term dominates. Since the loss
function can only improve when it is classified correctly, the example must be on the positive
side. Because all correctly classified examples show the same loss, the secondary goal of minimizing the change in will cause the point to be immediately on the positive side of the decision
boundary.
(b) 1. hinge loss and a large . For hinge loss, loss can be improved by moving the vector towards
the example. However, since is large, the change in term dominates, so only rotates slightly.
(c) 2. hinge loss and a small . For hinge loss, loss can be improved by moving the vector
towards the example. Since is small, the loss term dominates, so the loss is minimized when
the example is correctly classified. The difference between hinge loss and 0-1 loss with small
is that hinge loss will correctly classify the example with a margin from the boundary. This is
because hinge loss is minimized when y x 1 while 0-1 is when y x 0.
(d) 3. 0-1 loss and a large . Since the loss function is zero-one, it can only take on two values
whether its classified correctly or not. With large, the change in term dominates. In order
to minimize the change in , simply stays the same after the update and does not improve the
loss at all.
3.
(a) As increases, the passive-aggressive algorithm is dissuaded from updating very far from the
previous . This is because serves as the weight for the regularization term (the portion that
prevents overfitting: k (k) k) of the minimizing function. Thus, as increases, we expect the
step size between updates to decrease, so should be smaller.
(b) When Lossh (y(k+1) x) > 0, the hinge loss can be simply rewritten as 1 y x. Thus, our
minimizing function simplifies to
k (k) k2 + Lossh (y x)
2
= k (k) k2 + 1 y x
2
f () =
= ( (k) ) yx = 0
( (k) ) =
1
yx
= (k) +
where
1
yx
(c) The number of mistakes for online passive-aggressive does depend on feature vector ordering
in general. The feature vector ordering could cause to converge towards a local minima (in
terms of number of mistakes) rather than a global one. However, this is not true for the data in
problem 1. Each example pertains to a distinct coordinate of theta without any interference from
other coordinates. The parameters that the algorithm finds are identical to those we would get
by updating each coordinate independently of others (as if we had d 1- dimensional problems).
Therefore, similarly to perceptron, theres no dependence on ordering (for this data).