Вы находитесь на странице: 1из 3

Algorithm 1

Given Armijo line search parameters ↵ 2(0,1/2), 2(0,1), a max # of iterations N, a gradient tolerance ✏, and an initial
guess x0 .
Define n=0, x=x0 .
While n < N and krfa (x)k2 > ✏ {
Define t=1
While fa (x trfa (x)) fa (x) ↵tkrfa (x)k22 {
t= t}
x = x trfa (x), n=n+1 }
Return x, which is our guess for the optimal value x*.

1 Q1
Let a >0, define
fa : R2 ! R, fa (x) = x21 + a(x2 1)2
The function obj_fun_file.m evaluates fa (x), rfa (x), r2 fa (x) for a value of x that you input. Note you have to adjust the
value of a in the function itself.
1. Show that fa is a strongly convex and closed function. Show that strong convexity implies convexity (in general, not
just for this specific example).
2. Write out the algorithm given in Example4grad.m.
3. (Easy) Find fa⇤ and x⇤ s.t. fa⇤ := fa (x⇤ ) = minx2R2 fa (x).
4. Run the algorithm given in Example4grad.m to find fa⇤ , x⇤ numerically, do this for a = 1, 10, 102 , 103 , 104 . Comment
on the rate of convergence.
5. Explain why the rates of convergence you found above agree with theory.

1.1 Soln to Q1
1. Put m := a ^ 1. Then fa (x) m 2
2 kxk2 = (1
m 2
2 )x1 + (a
m
2 )x2
2
2ax2 + a, which is convex as a sum of convex functions.
It’s closed because, by the continuity of fa on R , the sublevel sets {x : fa (x)  c} are closed sets. Let f : Rn ! R [ {1} be
2

a strongly convex function. Then 9m >0 s.t. f¯(x) := f (x) m 2 ¯


2 kxk2 is convex, so f (x) = f (x) + 2 kxk2 is convex as a sum
m 2

of convex functions.
2. See Algorithm 1.
3. Note fa (x) 0 8x, but we can attain fa⇤ =0 by choosing x⇤ = e2 .
4. Running this in matlab is a simple matter of setting the desired value of a in obj_fun_file.m, and then simply running
Value of a 1 10 102 103 104
“Example4grad” in the console, and obtained the following:
# iterations until convergence 14 91 891 8883 88352
Convergence seems to get slower by roughly a factor of 10 if a grows by a factor of 10.
5. Because we have shown fa is strongly convex and closed, the theory on p.436-40 of the textbook applies. By (12.25),
on the kth iteration, we have fa (xk ) fa⇤  (1 mslb )(fa (x0 ) fa⇤ ). For us, m := sup{k : fa (x) k2 kxk22 is convex in x}=1
works. By (12.18) and (12.19), slb / L1 , where L is the Lipschitz constant of rfa . But
✓ ◆ ✓ ◆
2x1 2 0
rfa (x) = = (x e2 )
2a(x2 1) 0 2a

so the Lipschitz constant is clearly the operator norm of the matrix: L = 2a, so as a increases, L does also, and we expect
slower convergence.

2 Q2
In Support Vector Machines (SVM), we are given data points x1 , ..., xn with corresponding labels y1 , ..., yn , each yi 2 {±1}.
We seek a hyperplane {x : w> x + b=0} that “approximately” separates the points with labels 1 on one side, and labels -1 on
the other side. This can be posed as
Xn
min (1 yi (w> xi + b))+ (1)
w,b
i=1

We implement this with spam data, where we have 5172 observations, 32 features (counts of certain words/symbols), and
labels 1 for spam, 0 for not spam. To obtain these data, load spam_data.mat. The features are given in training_data, the
labels are given in training_labels.

1
1. (Data Preprocessing) To be consistent with the SVM setup, change the labels to 1 for spam, -1 for not spam. Then
partition the data randomly into a training set of size 4000, and a test set with the rest of the data.
2. Use CVX to solve (1) with the training data. Report the proportion of misclassified samples on the training and test
set. Can you see a drawback of this optimization problem?
3. Consider the following variants on (1), which are referred to as `2 or `1 regularization:
n
X
min (1 yi (w> xi + b))+ + 2
2 kwk2 (2)
w,b
i=1
Xn
min (1 yi (w> xi + b))+ + 1 kwk1
w,b
i=1

Give a brief, intuitive motivation for regularization. Use 10-fold cross validation to select a good choice of 2 , 1 for both
problems in (2), and report them1 . Also report the proportion of misclassified samples on the training and test set, for both
formulations.

2.1 Soln to Q2
1. Matlab code below:
load ( ’ spam_data . mat ’ )
t r a i n i n g _ l a b e l s=d o u b l e ( 2⇤ t r a i n i n g _ l a b e l s 1);
i n d a l l=randperm ( 5 1 7 2 ) ;
emtr=t r a i n i n g _ d a t a ( i n d a l l ( 1 : 4 0 0 0 ) , : ) ;
l b l t r=t r a i n i n g _ l a b e l s ( i n d a l l ( 1 : 4 0 0 0 ) ) ;
emv=t r a i n i n g _ d a t a ( i n d a l l ( 4 0 0 1 : 5 1 7 2 ) , : ) ; l b l v=t r a i n i n g _ l a b e l s ( i n d a l l ( 4 0 0 1 : 5 1 7 2 ) ) ;
2. Matlab code below:
emtr2=b s xf u n ( @times , emtr , l b l t r ’ ) ;
cvx_begin
v a r i a b l e w( 3 2 ) ;
variable b;
minimize (sum( pos (1 emtr2 ⇤w b⇤ l b l t r ’ ) ) / 4 0 0 0 )
cvx_end
pred=sign ( emtr ⇤w+b ) ;
mean( pred~=l b l t r ’ )
pred=sign ( emv⇤w+b ) ;
mean( pred~=l b l v ’ )
We obtained a training and testing error both in the ballpark of 0.2. Often, the test error was lower than the training error.
However, this is somewhat of an improvement over random guessing, as about 0.29 of the test data was actually spam. A
drawback of this approach is observed by looking at the w that results: we find that many coefficients are on the order of 106 .
This is a sign of instability, given the much more modest size of entries in the predictor. Indeed, using a different permutation
of the data, we obtain a vastly different w (for instance, its first entry went from 5.3e6 to about 0).
3. The motivation for regularization is to control the complexity of the model. Given the enormous w’s above, this a
good idea. Here is a function we wrote to perform 10 fold cross validation for our data
function merr=c v 1 0 f o l d s p ( indcv , l 1 , l 2 , t r a i n i n g _ d a t a , t r a i n i n g _ l a b e l s )
a l l e m=t r a i n i n g _ d a t a ( indcv , : ) ;
a l l l b l =t r a i n i n g _ l a b e l s ( i n d c v ) ;
e r r s=zeros ( 2 , 1 0 ) ;
f o r k =1:10
indv =(400⇤k 3 9 9 ) : ( 4 0 0 ⇤ k ) ;
emtr=a l l e m ;
emtr ( indv , : ) = [ ] ;
l b l t r= a l l l b l ;
l b l t r ( indv ) = [ ] ;
emv=a l l e m ( indv , : ) ;
l b l v= a l l l b l ( indv ) ;
emtr2=b s xf u n ( @times , emtr , l b l t r ’ ) ;
cvx_begin
1 If you don’t know what cross validation is, then look it up on Wikipedia

2
v a r i a b l e w( 3 2 ) ;
variable b;
minimize (sum( pos (1 emtr2 ⇤w b⇤ l b l t r ’ ) ) / 3 6 0 0 + l 2 ⇤ sum_square (w) )
cvx_end
pred=sign ( emv⇤w+b ) ;
e r r s ( 1 , k)=mean( pred~=l b l v ’ ) ;
cvx_begin
v a r i a b l e w( 3 2 ) ;
variable b;
minimize (sum( pos (1 emtr2 ⇤w b⇤ l b l t r ’ ) ) / 3 6 0 0 + l 1 ⇤norm(w, 1 ) )
cvx_end
pred=sign ( emv⇤w+b ) ;
e r r s ( 2 , k)=mean( pred~=l b l v ’ ) ;
end
merr=mean( e r r s , 2 ) ;
We tried various different values of 1, 2:

cvx_begin
v a r i a b l e w( 3 2 ) ;
variable b;
minimize (sum( pos (1 emtr2 ⇤w b⇤ l b l t r ’))/4000+10^ 12⇤ sum_square (w) )
cvx_end
pred=sign ( emtr ⇤w+b ) ;
mean( pred~=l b l t r ’ )
pred=sign ( emv⇤w+b ) ;
mean( pred~=l b l v ’ )
cvx_begin
v a r i a b l e w( 3 2 ) ;
variable b;
minimize (sum( pos (1 emtr2 ⇤w b⇤ l b l t r ’))/4000+10^ 6⇤norm(w, 1 ) )
cvx_end
pred=sign ( emtr ⇤w+b ) ;
mean( pred~=l b l t r ’ )
pred=sign ( emv⇤w+b ) ;
mean( pred~=l b l v ’ )
And noticed that error seemed to decrease monotonically (though very gradually) as 1 , 2 decreased, in fact, we decreased
it until cvx started having issues. The training and testing error were still both in the ballpark of 0.2, with test error often
lower than training error. However, when we look at the optimal w, we find the largest coefficients are on the order of 102
now, so in a sense these w’s are better behaved. This is also desirable since they can produce a similar test error as the w’s
with enormous weights encountered in part 2.
So the results of this exercise are a bit weird, but it’s enough for students to supply the requested information, implement
the models using cvx, and do CV, where they pick some candidate values of 1 , 2 and pick the one that ultimately gives
lowest CV-error.

Вам также может понравиться