Вы находитесь на странице: 1из 25

# Matrix Methods in Machine Learning

Lecture Notes
Rebekah Dix

## November 11, 2018

Contents
1 Elements of Machine Learning 3

## 2 Linear Algebra Review 3

2.1 Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Linear Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

## 3 Linear Systems and Vector Norms 5

3.1 Vector Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4 Least Squares 6
4.1 Geometric Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.2 Vector Calculus Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2.1 Review of Vector Calculus . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2.2 Application to Least Squares . . . . . . . . . . . . . . . . . . . . . . . 9
4.3 Positive Definite Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.4 Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.5 Least Squares with Orthonormal Basis for Subspace . . . . . . . . . . . . . . 10
4.5.1 Orthogonal Matrices and Orthonormal Basis . . . . . . . . . . . . . . 10
4.5.2 Back to LS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.5.3 Gram-Schmidt Orthogonalization Algorithm . . . . . . . . . . . . . . 11

## 6 Tikhonov Regularization/Ridge Regression 14

6.1 Tikhonov Regularization Derivation . . . . . . . . . . . . . . . . . . . . . . . 15
6.1.1 Derivation with Vector Calculus . . . . . . . . . . . . . . . . . . . . . 15
6.1.2 Alternative Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1
7 Singular Value Decomposition 16
7.1 Interpretation of SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
7.2 Low-Rank Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

## 8 Power Iteration and Page Rank 18

8.1 SVD: Connection to Eigenvalues/vectors . . . . . . . . . . . . . . . . . . . . 18

9 Matrix Completion 18
9.1 Iterative Singular Value Thresholding . . . . . . . . . . . . . . . . . . . . . . 18

10 Iterative Solvers 18
10.1 Gradient Descent/Landweber Iteration . . . . . . . . . . . . . . . . . . . . . 19

11 Regularized Regression 21
11.1 Proximal Gradient Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
11.2 LASSO (Least absolute selection and shrinkage operator) . . . . . . . . . . . 21

## 12 Convexity and Support Vector Machines 23

12.1 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
12.2 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2
1 Elements of Machine Learning
1. Collect data

## 2. Preprocessing: changing data to simplify subsequent operations without losing rel-

evant information.

## 3. Feature extraction: reduce raw data by extracting features or properties relevant to

the model.

4. Generate training samples: a large collection of examples we can use to learn the
model.

5. Loss function: To learn the model, we choose a loss function (i.e. a measure of how
well a model fits the data)

6. Learn the model: Search over a collection of candidate models or model parameters
to find one that minimizes the loss on training data.

7. Characterize generalization error (the error of our predictions on new data that was
not used for training).

## 2 Linear Algebra Review

2.1 Products
Inner products:
p
h x, wi = ∑ w j x j = x T w = wT x (1)
j =1

## Thus this inner product is a weighted sum of the elements of x.

Matrix-vector multiplication:

x1T
   T 
x1 w
 xT   x T w
2  2 
Xw =   w =  ..  (2)
 
..
 .   . 
xnT xnT w

Matrix-matrix multiplication:

## Example 1. Let X ∈ Rn× p , n movies, p people. T ∈ Rn×r , and W ∈ Rr× p . We can

think of T as the taste profiles of r representative customers and W as the weights on each
representative profile (there will be one set of weights for each customer). Suppose we
have two representative taste profiles (i.e. an action lover and a romance lover). Then w
will be a 2-vector containing the weights of on the two representative taste profiles. Then

3
Tw is the expected preferences of a customer who weights the representative taste profiles
of T with the weights given in w.

## X = TW =⇒ Xij = hith row of T, jth column of Wi (3)

• The jth column of X is a weighted sum of the columns of T, where the jth column
of W tells us the weights.
x j = Tw j (4)
That is, the tastes (preferences) of the jth customer.

• The ith row of X is xiT = tiT W where tiT is the ith row of T. This gives us how much
each customer likes movie i.

## Inner product representation:

 
t Tw t Tw . . . t Tw
t1T
 
  1 1 1 2 1 p
t2T  t w1 . . .
T .. 
. 
 
TW = 

 w1 w2

. . . wp =  2 (5)
..  .. .. .. 

 .   . . . 
tnT T
t n w1 T
tn w p

## Outer Product Representation:

w1T
 
 
 w2T  r
TW = T1 T2 . . . Tr   = ∑ Tk wkT (6)
 
  ..
 .  k =1
wrT

(the sum of rank 1 matrices. TW has rank r if and only if the columns of T are rows
of W are linearly independent). In this representation, we can think about Tk as the kth
representative taste profile and wkT as the kth row of W, or the affinity of each customer
with the kth representative profile.

## 2.2 Linear Independence

Definition 1. (Linear Independence) Vectors v1 , v2 , . . . , vn ∈ R p are linearly independent vec-
tors if and only if
n
∑ αj vj = 0 ⇐⇒ α j = 0, j = 1, . . . , n (7)
j =1

Definition 2. (Matrix rank) The rank of a matrix is the maximum number of linearly independent
columns. The rank of a matrix is less than the smallest dimension of the matrix.

4
3 Linear Systems and Vector Norms
Example 2. (Condition on rank( A) for existence of exact solution)
Consider the linear system of equations Ax = b. This means that b is a weighted
  sum
of the columns
  of A. Suppose A is full rank. Now consider the matrix A b . If the
rank of A b were greater than the rank of A (since the number of columns of the matrix
increased by 1 and A is assumed full rank, this would imply the rank is rank( A) + 1),
this would mean that b could not be written as a linear combination of the columns of
A, and thatthe system would not have an exact solution. Therefore, we must have that
rank( A b ) = rank( A) in order for the system Ax = b to have an exact solution.
To see how the definition of linear independence applies here, observe that Ax =
b =⇒ Ax − b = 0. Therefore  
  x
A b =0 (8)
−1
 
Thus, if Ax = b has an exact solution, then A b does not have linearly independent
columns.

## Example 3. (Condition on rank( A) for more than one exact solution)

If the system of linear equations Ax = b has more than one exact solution, then there
is at least one non zero vector w for which x + w is also a solution. That is, A( x + w) = b.
If x is an exact solution, then Ax = b. This implies Aw = 0. Therefore, the columns of
A are linearly dependent. Thus, if rank( A) < dim( x ), then there will be more than one
exact solution.

## Example 4. (Apply the above conditions) Let

   
1 −2 2
A =  −1 2  , b =  −2 (9)
−2 4 −4

We want to solve Ax = b.
 
• This system has an exact solution, since rank( A) = rank( A b ). This follows since
the columns of A are linearly dependent,
 so it has rank 1, and b is a multiple of the
columns of A, so the rank of A b is also 1.

• Note that 1 = rank( A) < dim( x ) = 2. Therefore this system does not have a unique
solution.

## 3.1 Vector Norms

Definition 3. (Vector Norm) A vector norm is a function k·k mapping from Rn → R with the
following properties.

1. k x k ≥ 0 for all x ∈ Rn .

5
2. k x k = 0 if and only if x = 0.

## 4. k x + yk ≤ k x k + kyk for all x, y ∈ Rn .

Helpful fact: k x kq0 ≤ k x kq if 1 ≤ q ≤ q0 ≤ ∞.

## Figure 1: The l p norm in R2

4 Least Squares
We are given:
1. Vector of labels y ∈ Rn

## 2. Matrix of features X ∈ Rn× p

We want to find:
1. Vector of weights w ∈ R p
Assumptions:
1. n ≥ p, and rank( X ) = p.
If y = Xw, then we have a system of n linear equations, where the ith equation is
p
yi = w1 xi1 + w2 xi2 + · · · + w p xip = ∑ w j xij = hw, x·i i (10)
j =1

## where x·i is the ith row of X.

In general, y 6= Xw for any w. We define a residual ri = yi − hw, x·i i. Our goal is then
to find w ∑in=1 |ri |2 (the sum of square residuals/errors).
Why should we minimize the sum of square errors?
1. Magnifies the effect of large errors

6
2. Allows us to compute derivatives

## 4.1 Geometric Approach

We know r̂ = y − X ŵ is orthogonal to the span of the columns of X. Thus xiT r̂ = 0,
or X T r̂ = 0. This implies X T (y − X ŵ) = 0. Thus ŵ is a solution to the linear system of
equations
X T X ŵ = X T y (11)

Figure 2: Geometry of LS in R2

Observations:

• The question we’re trying to answer: What is the point in col ( X ) that has the shortest
distance to y? In R2 , what are the weights β 1 and β 2 such that β 1 x1 + β 2 x2 has the
shortest distance to y?

• colX is the space of all vectors that can be written as αx1 + βx2 for some α, β ∈ R,
that is the span of the columns of X. y may not lie in this space.

• The residual vector will form a right angle with colX, because any other angle would
correspond to a longer distance.

7
4.2 Vector Calculus Approach
4.2.1 Review of Vector Calculus
Let w be a p-vector and let f be a function of w that maps R p to R. Then the gradient
of f with respect to w is
 ∂ f (w) 
 ∂w. 1 
∇w f (w) =  .  (12)
 . 
∂ f (w)
∂w p

## Example 5. (Gradient of an Inner Product) Let f (w) = h a, wi = w T a = ∑in=1 wi ai . Then

 
a1
 a2 
∇w w T a =  ..  = a (13)
 
.
ap

## Example 6. (Gradient of an Inner Product, Squared) Let f (w) = kwk2 = w T w = w12 +

· · · + w2p . Then
 
2w1
 2w2 
T
∇w w w =  ..  = 2w (14)
 
 . 
2w p
(This is a special case of the Quadratic Form discussed below, where w T Qw, and Q = I)

## Example 7. (Gradient of a Quadratic Form) Let x ∈ Rn and f ( x ) = x T Qx, where Q is

symmetric (if Q isn’t symmetric we could replace Q with 21 ( Q + Q T )). Then

f ( x ) = x T Qx
n n
= ∑ ∑ xi Qij x j
i =1 j =1

Therefore 
2Qkk xk
 i=j=k
df
[∇ x f ]k = = Qkj x j i = k, i 6= j (15)
dxk 
Qik xi j = k, j 6= i

Therefore
∇x f = (Q + QT )x (16)
If Q is symmetric, then this equals 2Qx.

8
4.2.2 Application to Least Squares
Let f (w) = ky − Xwk22 . Then the least squares problem is

w

## f (w) = (y − Xw) T (y − Xw)

= y T y − y T Xw − w T X T y + w T X T Xw
= y T y − 2w T X T y + w T X T Xw

Then

∇w f (w) = −2X T y + 2X T Xw

## At an optimum we have that ŵ solves X T y = X T Xw. Then if ( X T X )−1 exists, we have

that
ŵ = ( X T X )−1 X T y (18)

## Theorem 1. (Sufficient Condition for Existence/Uniqueness of LS Solution) If the columns of

X are linearly independent, then X T X is non-singular, and there exists a unique least squares
solution ŵ = ( X T X )−1 X T y.

Proof.

## 4.3 Positive Definite Matrices

Definition 4 (Positive Definite, pd). A matrix Q (n × n) is positive definite (written Q  0) if
x T Qx > 0 for all x ∈ Rn , x 6= 0.

## Definition 5 (Positive Semi-Definite, psd). A matrix Q (n × n) is positive semi-definite (writ-

ten Q  0) if x T Qx ≥ 0 for all x ∈ Rn , x 6= 0.

## Properties of Positive Definite matrices:

1. If P  0 and Q  0, then P + Q  0.

## 3. For any matrix A, A T A  0 and AA T  0. Further, if the columns of A are linearly

independent, then A T A  0.

## 4. If A  0, then A−1 exists.

5. Notation: A  B means A − B  0.

9
Example 8. Let 

1 1
X = 1 1 (19)
1 1
Then  
T 3 3
X X= (20)
3 3
 
1
Consider the vector a = . Then a T X T Xa = 0. Therefore X T X is not positive defi-
−1
nite.

4.4 Subspaces
Definition 6. (Subspace) A set of points S ⊆ Rn is a subspace if

## 1. 0 ∈ S (S contains the origin)

2. If x, y ∈ S, then x + y ∈ S

3. If x ∈ S, α ∈ R, then αx ∈ S.

## 4.5 Least Squares with Orthonormal Basis for Subspace

Suppose are given a training sample { xi , yi }in=1 , xi ∈ R p and y ∈ R. If the columns of X
(the data matrix) are linearly dependent, then X T X is not invertible. It is then impossible
to tell which features
 aresignificant predictors of y.
Given X = x1 . . . x p , the following are options to represent the corresponding sub-
space spanned by the columns of X:

## 4.5.1 Orthogonal Matrices and Orthonormal Basis

Definition 7. (Orthonormal basis for X) An orthonormal basis for the columns of X is a collection
of vectors {u1 , . . . , ur } such that the span of the columns of X equals the span of {u1 , . . . , ur }.
That is, span({ x1 , . . . , x p }) = span({u1 , . . . , ur }). Furthermore,
(
0 i 6= j
uiT u j = (21)
1 i=j

## That is, the u vectors are orthogonal and have norm 1.

Observations:

10
• The rank r of the subspace must satisfy r ≤ min (n, p). r is the number of linearly
independent columns of X.

## • We can place the basis vectors into a basis matrix U ∈ Rn×r .

Claim 1. (Properties of orthogonal (basis) matrices) Let U ∈ Rn×r be an orthogonal (basis)
matrix.
1. U T U = I

## 3. U is length preserving: kUvk2 = kvk2 for v ∈ Rn .

Proof. We prove each item as follows:
1. We can easily see this from the inner product interpretation of matrix multiplication.

2. (UV ) T UV = V T U T UV = V T V = I.

## 3. kUvk22 = (Uv) T Uv = v T U T Uv = v T v = kvk22 .

4.5.2 Back to LS
Suppose U is an orthonormal basis matrix for our data matrix X. Then, the least-
squares problem is
v̂ = arg min ky − Uvk22 (22)
v

## 4.5.3 Gram-Schmidt Orthogonalization Algorithm

How can we take X and get an orthonormal basis U?

1. Input X = x1 . . . x p ∈ Rn× p
 

Output: U = u1 . . . ur ∈ Rn×r
 

## where r = rank( X ) ≤ min(n, p)

x1
2. Initialize u1 = k x1 k2

3. For j = 2, 3, . . . , p
x 0j = all the components of x j not represented by u1 , . . . , u j−1 .

j −1
x 0j = x j − ∑ (uiT x j )ui (23)
i =1

11
here (uiT x j ) is the least squares weight for ui .
 0
x
 j
 x 0j 6= 0
uj = x 0j (24)
2
0 x 0j =0

## Next, by construction, each column of U, ui , is in span({ x1 , . . . , x p }). Therefore we can

write
ui = αi1 x1 + αi2 x2 + · · · + αip x p (25)
where the αij ∈ R. We can write this in matrix form as

U = XA (26)

## where X is n × p and A is p × r, and the ith column of A is

 
αi1
α 
 i2 
ai =  ..  (27)
 . 
αip

Thus, ui = Xai .
Now, suppose w ∈ R p is the vector of weights we found using LS, and as above, v is
our vector of weights founding using LS with an orthonormal basis matrix. We have two
equations for the predicated label ŷ

ŷ = w1 x2 + w2 x2 + · · · + w p x p
= v1 u1 + v2 u2 + · · · + vr ur
= v1 Xa1 + v2 Xa2 + · · · + vr Xar
= v1 (α11 x1 + α12 x2 + · · · + α1p x p )
..
.
+ vr (αr1 x1 + αr2 x2 + · · · + αrp x p )
= x1 (v1 α11 + · · · + vr αr1 )
..
.
+ x p (v1 α1p + · · · + vr αrp )

Notice that

w1 = v1 α11 + · · · + vr αr1
..
.
w p = v1 α1p + · · · + vr αrp

12
Therefore
ŷ = XAv = Xw (28)
so that Av = w.
In sum, given a new sample xnew ∈ R p , we have two ways to predict label ynew :
1. ŷnew = h xnew , wi
T
2. Using an orthonormal basis U, we know that U = XA. Therefore, unew T A.
= xnew
Equivalently, unew = Axnew . Then ynew = hunew , vi.
If the columns of X are linearly independent (r = p), we can calculate using LS (recall-
ing ui = Xai )
a i = ( X T X ) −1 X T u i (29)
Theorem 2. Let X ∈ Rn× p , n ≥ p, be full rank (the p columns of X are linearly indepen-
dent) and y ∈ Rn . Let u1 , . . . , u p be orthonormal basis vectors such that span({ x1 , . . . , x p }) =
span({u1 , . . . , u p }). Then ŷ = X ŵ where ŵ = arg minw ky − Xwk22 is given by ŷ = UU T y,
where U = u1 u2 . . . u p .
Proof.
ŷ = X ŵ = X ( X T X )−1 X T y (30)
where Px = X ( X T X )−1 X T is a projection matrix. Since span({ x1 , . . . , x p }) = span({u1 , . . . , u p }),
we must have that
Px y = Pu y (31)
which implies Px = Pu . Thus

Px = Pu = U (U T U )−1 U T = UU T (32)

Finally
ŷ = Px y = Pu y = UU T y (33)

## 5 Least Squares Classification

We are given a training sample { xi , yi }in=1 , xi ∈ R p and y ∈ R (or y ∈ {+1, −1}).
Definition 8. (Linear Predictor) We have a linear predictor if each label is a linear combination of
p
the features i.e. we can find weights {wi }i=1 such that

## yi = w1 xi1 + w2 xi2 + . . . w p xip (34)

In words, this says the label for observation i is a linear combination of the features for example i.
The steps to complete least squares classification in this environment are as follows:

13
1. Build a data matrix or feature matrix and label vector

x1T
   T   
x1 1 y1
 x T   x T 1  y2 
2   2
X=  =  ..  ∈ Rn × p , y=. (35)
   
..
 .   .   .. 
xnT xnT 1 yn

## 2. Solve a least squares optimization problem

n
ŵ = arg min ky − Xwk22 = arg min ∑ (yi − xiT w)2 (36)
w w i =1

(this last equality makes it clear that we are minimizing the sum of squared resid-
uals). If the columns of X are linearly independent, then X T X is positive definite.
Therefore X T X is invertible. In sum, if X T X is positive definite, then there exists a
unique LS solution
ŵ = ( X T X )−1 X T y (37)
The predicted labels are

ŷ = Xw
= X ( X T X ) −1 X T y

## 6 Tikhonov Regularization/Ridge Regression

We are given X ∈ Rn× p (n training samples, p features) and y ∈ Rn (n labels). Our
model is y ≈ Xw, which means yi ≈ xiT w for some w ∈ R p .
The LS problem is
n
ŵ LS = arg min ky − Xwk22 = arg min ∑ (yi − xiT w)2 (38)
w w i =1

## There are two cases

1. If X is full rank (i.e. the columns of X are linearly independent), then ŵ LS is unique
and
ŵ LS = ( X T X )−1 X T y (39)

2. If X is not full rank, then X T X is not invertible. ŵ LS is not unique; there are infinitely
many solutions.

14
6.1 Tikhonov Regularization Derivation
In this second case (and it can also be useful in the first), we can define a new objective

## ŵ = arg min ky − Xwk22 + λ kwk22 (40)

w

where ky − Xwk22 measures the fit to the data, λ > 0 is a regularization parameter or
tuning parameter, and kwk22 is a regularizer. kwk22 measures the energy in w.

## 1. ŵ is unique even when no unique least square solution exists

2. Even when X is full rank, X T X can be badly behaved, and regularization adjusts for
this.

## 6.1.1 Derivation with Vector Calculus

Let f (w) = ky − Xwk22 + λ kwk22 . Then

f (w) = y T y − 2w T X T y + w T X T Xw + λw T w
= y T y − 2w T X T y + w T ( X T X + λI )w

Then
∇w f (w) = −2X T y + 2( X T X + λI ) (41)
If ( X T X + λI ) is invertible, then ŵ = ( X T X + λI )−1 X T y. BUT, ( X T X + λI ) is always
invertible. Recall that if a matrix is positive definite, then it is invertible. We can show that
( X T X + λI ) is indeed positive definite and hence invertible. To see this, fix 0 6= a ∈ Rn ,
then

a T ( X T X + λI ) a = a T X T Xa + λa T a
= k Xak22 + λ k ak22

Now note that k Xak22 ≥ 0 (it could be 0 if X is not full rank and a is in the null space of X
– this is what causes troubles with LS) but λ k ak22 > 0. Therefore,( X T X + λI ) is positive
definite.

## 6.1.2 Alternative Derivation

Note that for vectors a, b,
  2
a
k ak22 + kbk22 =
b (42)
2

15
Therefore,

√ 2
2
= ky − Xwk2 + λw

2
y − Xw 2
 
= √
λw 2
    2
y
√Xw
= 0 −
λw 2
    2
y X
− √

= w
0 λI
2
2
= ỹ − X̃w 2

## We can solve this problem with LS, so that

ŵ = ( X̃ T X̃ )−1 X̃ ỹ (43)

where
X̃ T X̃ = X T X + λI (44)
and
X̃ ỹ = X T y (45)
Thus this is equivalent to the derivation above.

## 7 Singular Value Decomposition

Theorem 3. Every matrix X ∈ Rn× p can be factorized as

X = UΣV T (46)

where

16
Figure 3: SVD

## 7.1 Interpretation of SVD

1. U is an orthonormal basis for the columns of X

## 2. ΣV T are the basis coefficients

Example 9. (Netflix) Let X ∈ Rn× p be a matrix (full rank) where the columns are taste
profiles of customers and the rows are single movie ratings across customers.

1. The ith column of U is a basis vector in Rn and is the ith representative customer
taste profile (i.e. vector of normalized movie ratings).

2. The jth column of V T (the jth row of V) is the relative importance of each represen-
tative taste profile to predicting customer j’s preferences.

3. The ith row of V T (the ith column of V) is the vector of users’ affinities to the ith
representative profile.

## 7.2 Low-Rank Approximation

Theorem 4. (Subspace Approximation) If X ∈ R p×n has rank r > k, then

Z : rank( Z )=k

## is given by Z = Xk = Uk Σk VkT and

r
k X − Xk k2F = ∑ σi2 (48)
i = k +1

17
8 Power Iteration and Page Rank
8.1 SVD: Connection to Eigenvalues/vectors
Suppose X = UΣV T ∈ R p×n . Then

A := X T X
= VΣU T UΣV T
= VΣ2 V T
= VΛV T

## Thus V is a matrix of the eigenvectors of A and Λ = Σ2 contains its eigenvalues.

Power iteration gives a method to find the 1st right singular vector.

9 Matrix Completion
9.1 Iterative Singular Value Thresholding

## Algorithm 1 Iterative Singular Value Thresholding

Require: X̂ = zeros(n, p)
Require: X̂Ω = XΩ . fill in obs. entries
Require: Threshold or r
for k ← 0, 1, . . . do
[U, S, V ] ← svd( X̂ )
Ŝ ← S ≥ threshold . if threshold, keep sing vals ≥
Ŝ ← S(1 : r, 1 : r ) . if rank, keep r sing vals
X̂ ← U ŜV T
X̂Ω ← XΩ . fill in obs. entries
If converged: X̂ − X̂old < e, stop.
end for

10 Iterative Solvers
Let τ > 0 be step size.

18
Algorithm 2 Landweber Iteration
Require: w (0)
for k ← 0, 1, . . . do
w (k+1) ← w(k) − τX T ( Xw(k) − y)
end for

This algorithm takes a step in the direction of negative gradient of each iteration of the
objective function f (w) = k Xw − yk22 . Notice that

f (w) = k Xw − yk22
∇w f (w) = ∇w ( Xw − y)T ( Xw − y)
= ∇w w T X T Xy − 2w T X T y + y T y
= 2X T Xw − 2X T y
= 2X T ( Xw − y)

Thus the new iterate equals the old iterate plus a step in direction of negative gradient.

## Proof. We want to show that

2 2
Xw(k+1) − y ≤ Xw(k) − y (49)

2 2

Recall that the iteration is given by w(k+1) = w(k) − τX T ( Xw(k) − y). Then
2 2
( k +1) (k) T (k)
Xw − y = X (w − τX ( Xw − y)) − y

2 2
2
= Xw(k) − y − τXX T ( Xw(k) − y)

2
2 2
= Xw − y + τ XX ( Xw − y) − 2τ ( Xw(k) − y)T XX T ( Xw(k) − y)
(k) 2 T (k)

2 2

## Now observe that

2 2
2
XX T ( Xw(k) − y) ≤ k X kop X T ( Xw(k) − y) (50)

2 2

19
Therefore
2 2  2 2 
( k +1) (k) 2 T (k) T (k)
Xw − y ≤ Xw − y + τ τ k X kop X ( Xw − y) − 2 X ( Xw − y)

2 2 2 2
2 2
= Xw(k) − y + τ X T ( Xw(k) − y) (τ k X k2op − 2)

2 2

2 2
Thus, if(τ k X k2op
− 2) < 0, then Xw ( k + 1 ) ( k )
− y ≤ Xw − y . Therefore, for conver-

2 2
gence we require that
2
0<τ< (51)
k X k2op
in order to ensure convergence. Under this condition, then

w ( k ) → ( X T X ) −1 X T y (52)

## The above proof made use of the following claim.

Claim 3. (Bound on 2-norm of matrix-vector product) Let X be matrix and w a vector (con-
formable). Then
k Xwk2 ≤ k X kop kwk2 (53)

Proof. Recall: The 2-norm (here, operator norm) of a matrix is it’s largest singular value.

## Suppose X = UΣV T . Then

k Xwk2 = UΣV T w

2
T
= ΣV w (U orthonormal, preserves norms)

2
!1
 2 2

= ∑ σi (V w)iT
i
!1
2

## ≤ σmax ∑(V T w)2i

i

= σmax V T w (definition of norm)

2
= σmax kwk2 (V, V T orthonormal, preserves norms)

20
11 Regularized Regression

## Algorithm 3 Proximal Gradient Algorithm: arg minw ky − Xwk22 + λr (w)

Require: Initial w (0)
for k ← 0, 1, . . . do
z (k) ← w (k) − τX T ( Xw (k) − y ) . grad descent step
2
w (k+1) ← arg minw z k − w 2 + λτr (w )

. regularization step
end for

## 11.2 LASSO (Least absolute selection and shrinkage operator)

LASSO solve the following problem

## ŵ L = arg min kwk1 subject to ky − Xwk < e (55)

w

which is equivalent to
ŵ L = arg min ky − Xwk22 + λ kwk1 (56)
w
In the figure below, the rhombuses and circles show the locus of points for which the
weight vector has a particular norm (in the L1 and L2 norms respectively). More precisely,
they are {w : kwk1 = τ1 } and {w : kwk2 = τ2 }. The red line is {w : y = Xw}.

## Figure 4: Weight vector with Lasso vs Ridge Regression

21
Example 10. (r (w) = kwk1 = ∑i |wi |)

## ŵ = arg min kz − wk22 + λτ kwk1

w
 
= arg min ∑ (zi − wi )2 + λτ |wi | (separable)
w i

wi

where λ, τ > 0.

## ŵi = arg min(zi − wi )2 + λτwi

wi
d
(obj) = −2(zi − wi ) + λτ
dwi
λτ
wi = z i −
2
Thus

(a) If zi > λτ
2 , then ŵi = zi − λτ
2 .
λτ
(b) If zi < 2 , then ŵi = 0.

In sum  
λτ
ŵi = zi − (58)
2 +

## 2. Case 2: zi < 0. Then ŵi ≤ 0. Similarly,

λτ
wi = z i + (59)
2
Thus

(a) If zi < − λτ λτ
2 , then ŵi = zi 2 .
(b) If zi > − λτ
2 , then ŵi = 0.

In sum  
λτ
ŵi = − |zi | − (60)
2 +

22
We can combine these two cases to get that
 
λτ
ŵi = − |zi | − sign(zi ) (61)
2 +

## we we call the SoftThreshold(zi , λτ

2 ). The figure below shows how ŵi depends on zi .

## 12 Convexity and Support Vector Machines

12.1 Convexity
Example 11. (Convex function lies above tangent lines) Suppose l (w) is a convex function.
Then informally, l (w) ≥ tangent and w. More formally,

## l (u) ≥ l (w) + (u − w) T ∇l (w) (62)

23
Figure 6: A convex function lies entirely above tangent lines

## 12.2 Support Vector Machines

When we use the classification rule ŷ = sign(w T x ), our goal is thus to choose a w such
that ŷ = sign(w T x ) as often as possible. However, when we use least squares we do not
actually minimize the number of mistakes. Minimizing the number of mistakes can be

24
written as minimizing the following sum of indicator variables:
n
∑ Iyi 6=sign(wT xi ) (64)
i =1

LS actually minimizes
n
∑ ( y i − w T x i )2 (65)
i =1
We want to choose a convex function that mimics the ideal loss. We will use hinge
loss, which is defined by
n
l (w) = ∑ (1 − yi xiT w)+ (66)
i =1
where (
a a>0
( a)+ = (67)
0 otherwise
Definition 10. (Support Vector Machine) If we minimize
n
∑ (1 − yi xiT w)+ + λ kwk22 (68)
i =1

## this is called a support vector machine.

In the figure below, the black line is the ideal loss function (not convex). The green line
is the squared loss function (LS), the blue line is hinge loss, and the red line is log loss.

25