Вы находитесь на странице: 1из 36

Problem Solutions

Chapter 4
Pierre Paquay

Problem 4.1

Below we plot the monomials of order i, φi (x) = xi .

1.0

0.5 Order:
i=0
i=1
Phi(x)

0.0 i=2
i=3
i=4
−0.5 i=5

−1.0
−1.0 −0.5 0.0 0.5 1.0
x
It is easy to see that as the order i increases, so does the complexity of the curve (in the sense that it is able
to fit more complex target functions).

Problem 4.2

We may write

 
 L0 (x)
h(x) = 1 −1 1 L1 (x)
L2 (x)
= L0 (x) − L1 (x) + L2 (x)
3 2 1
= x −x+
2 2

So we get a degree 2 polynomial.

Problem 4.3

(a) We use the recursive definition of the Legendre polynomials to develop an algorithm to compute Lk (x)
given x.

1
Legendre <- function(x, k) {
if (k == 0)
return(1)
if (k == 1)
return(x)
else
return(((2 * k - 1) / k) * x * Legendre(x, k - 1) - ((k - 1) / k) * Legendre(x, k - 2))
}

Now we plot the first six Legendre polynomials below.

1.0

0.5 Order:
k=0
k=1
L_k(x)

0.0 k=2
k=3
k=4
−0.5 k=5

−1.0
−1.0 −0.5 0.0 0.5 1.0
x
(b) We prove this fact by induction. For k = 0, we have L0 (x) = 1 which is a monomial of order 0. For k = 1,
we have L1 (x) = x which is a monomial of order 1. Now we assume that the result is true for all order less
than k + 2, and we will prove it is still true for order k + 2. We will also assume that k is even (the case
when it is odd is proved in the same way). We have

2k + 3 k+1
Lk+2 (x) = x· Lk+1 (x) − · Lk (x)
k+2 | {z } k+2 | {z }
k+1 +a k−1 +···+a x xk +b k−2 +···+b
| {z } =a | {z } =b
k+1 x k−1 x 1 k k−2 x 0
=c1 =c0

= c1 ak+1 xk+2 + (c1 ak−1 − c0 bk )xk + · · · + (c1 a1 − c0 b2 )x2 − c0 b0

which is actually a linear combination of monomials all of even order with highest order k + 2. In this case
we obviously have
Lk (−x) = (−1)k Lk (x).

(c) Once again we proceed by induction on k. For k = 1, we have


x2 − 1 dL1 (x)
= x2 − 1 = xL1 (x) − L0 (x).
1 | dx{z }
=1

Now we assume that the result is true for all order less than k, and we prove it is still true for k. We have
that

2
x2 − 1 dLk (x)
k dx !
2
x − 1 2k − 1 (2k − 1)x dLk−1 (x) k − 1 dLk−2 (x)
= Lk−1 (x) + −
k k k dx k dx
(x2 − 1)(2k − 1) (2k − 1)(k − 1)x x2 − 1 dLk−1 (x) (k − 1)(k − 2) x2 − 1 dLk−2 (x)
= 2
Lk−1 (x) + −
k k2 k−1 dx k2 k−2 dx
| {z } | {z }
=xLk−1 (x)−Lk−2 (x) =xLk−2 (x)−Lk−3 (x)
2
(2k − 1)(kx − 1) (k − 1)(3kx − 3x) (k − 1)(k − 2)
= Lk−1 (x) − Lk−2 (x) + Lk−3 (x)
k2 !k 2 k2 !
2k − 1 k−1 2k − 1 (k − 1)2 2k − 3 k−2
= x xLk−1 (x) − Lk−2 (x) − Lk−1 (x) − xLk−2 (x) − Lk−3 (x)
k k k2 k2 k−1 k−1
| {z } | {z }
=Lk (x) =Lk−1 (x)
2
(2k − 1) + (k − 1)
= xLk (x) − Lk−1 (x)
k2
= xLk (x) − Lk−1 (x).

(d) We may write that

! !
d dLk (x) d
(x2 − 1) = xkLk (x) − kLk−1 (x)
dx dx dx
dLk (x) dLk−1 (x)
= kLk (x) + xk −k
dx dx
k 2 x2 k2 x k(k − 1) k(k − 1)
= kLk (x) + 2 Lk (x) − 2 Lk−1 (x) − 2 xLk−1 (x) + 2 Lk−2(x)
x −1 x −1 x −1 x −1
kx2 − k + k 2 x2 k
= 2
Lk (x) − 2 [(2k − 1)xLk−1 (x) − (k − 1)Lk−2 (x)]
x −1 x −1
kx2 − k + k 2 x2 k2
= Lk (x) − Lk (x)
x2 − 1 x2 − 1
k
= 2
[(x2 − 1) + kx2 − k]Lk (x)
x −1
= k(k + 1)Lk (x).

(e) We will first consider the case where l 6= k. We have that


!
d dLk (x)
(1 − x2 ) + k(k + 1)Lk (x) = 0
dx dx

and !
d dLl (x)
(1 − x2 ) + l(l + 1)Ll (x) = 0,
dx dx

now we multiply the first identity by Ll (x) and the second by Lk (x), if we substract and integrate the two
identities obtained, we get
Z 1 ! ! Z 1
d 2 dLk (x) d 2 dLl (x)
Ll (x) (1 − x ) − Lk (x) (1 − x ) dx + [k(k + 1) − l(l + 1)] Lk (x)Ll (x)dx = 0.
−1 dx dx dx dx −1

3
Using integration by parts for the first integral, we get
1 1 ! Z
1
2 dLk (x) 2 dLl (x) dLl (x) dLk (x) dLk (x) dLl (x)

Ll (x)(1 − x ) − Lk (x)(1 − x ) − (1 − x2 ) − (1 − x2 ) dx = 0.
dx dx dx dx dx dx

−1
−1 −1 | {z }
| {z } | {z }
=0
=0 =0

Finally, we obtain Z 1
Lk (x)Ll (x)dx = 0.
−1

Now, we consider the case where l = k. We have that

1 1 1
2k − 1 k−1
Z Z Z
Ak = L2k (x) = xLk (x)Lk−1 (x)dx − Lk (x)Lk−2 (x)dx
−1 k −1 k −1
| {z }
=0
1 1
(2k − 1)(k + 1) (2k − 1)k
Z Z
= Lk+1 (x)Lk−1 (x)dx + L2k−1 (x)dx
k(2k + 1) −1 k(2k + 1) −1
| {z }
=0
1
2k − 1
Z
= L2k−1 (x)dx.
2k + 1 −1

Finally, we are able to obtain that

2k − 1
Ak = Ak−1
2k + 1
2k − 1 2k − 3
= · Ak−2
2k + 1 2k − 1
2k − 1 2k − 3 31
= · ··· A0
2k + 1 2k − 1 5 3 |{z}
=2
2
= .
2k + 1

Problem 4.4

The following code is an implementation of the experimental framework used to study various aspects of
overfitting.
Legendre2 <- function(x, q) {
vec <- rep(NA, q + 1)
for (k in 0:q) {
vec[k + 1] <- (choose(q, k))^2 * (x - 1)^(q - k) * (x + 1)^k / 2^q
}

return(sum(vec))
}

f <- function(x, Qf, aq) {


Lq <- rep(0, Qf + 1)
for (k in 0:Qf) {

4
Lq[k + 1] <- Legendre2(x, k)
}

return(sum(aq * Lq))
}
f <- Vectorize(f, vectorize.args = "x")

experiment <- function(Qf, N, sigma, Ntest) {


aq <- rnorm(Qf + 1)
norm <- rep(0, Qf + 1)
for (q in 0:Qf)
norm[q + 1] <- 1 / (2 * q + 1)
norm_fac <- 1 / sqrt(sum(norm))
aq <- norm_fac * aq

xn <- runif(N, min = -1, max = 1)


eps <- rnorm(N)
yn <- f(xn, Qf, aq) + sigma * eps
D <- data.frame(x = xn, y = yn)

y <- D$y
D2 <- data.frame(x = D$x, x_sq = D$x^2)
Z2 <- as.matrix(cbind(1, D2))
Z2_cross <- solve(t(Z2) %*% Z2) %*% t(Z2)
w2 <- as.vector(Z2_cross %*% y)
D10 <- data.frame(x = D$x, x_sq = D$x^2, x_cub = D$x^3, x_quad = D$x^4,
x_quint = D$x^5, x_six = D$x^6, x_seven = D$x^7,
x_eight = D$x^8, x_nine = D$x^9, x_ten = D$x^10)
Z10 <- as.matrix(cbind(1, D10))
Z10_cross <- solve(t(Z10) %*% Z10) %*% t(Z10)
w10 <- as.vector(Z10_cross %*% y)

x <- runif(Ntest, min = -1, max = 1)


eps <- rnorm(Ntest)
y <- f(x, Qf, aq) + sigma * eps
Dtest <- data.frame(x = x, y = y)
Eout2 <- mean((as.matrix(cbind(1, Dtest$x, Dtest$x^2)) %*% w2 - Dtest$y)^2)
Eout10 <- mean((as.matrix(cbind(1, Dtest$x, Dtest$x^2, Dtest$x^3, Dtest$x^4,
Dtest$x^5, Dtest$x^6, Dtest$x^7, Dtest$x^8,
Dtest$x^9, Dtest$x^10)) %*% w10 - Dtest$y)^2)

return(c(Eout2, Eout10))
}

(a) To normalize f , we compute Ea,x [f 2 ] as follows,

5
Ea,x [f 2 ] = Ex [Ea|x [f 2 |x]]
= Ex [ Vara|x [f ] +( Ea|x [f ] )2 ]
| {z } | {z }
P 2 P
= Lq (x) Vara|x [aq ] = Lq (x) Ea|x [aq ]
q q
| {z } | {z }
=1 =0

Qf
X
= Ex [L2q (x)].
q=0

Moreover, we may write that Z 1


1 1
Ex [L2q (x)] = L2q (x)dx = ,
2 −1 2q + 1
with which we can conclude that
Qf
X 1
Ea,x [f 2 ] = .
q=0
2q + 1
qP
1
This means that, to normalize f , we have to multiply each coefficient aq by the constant factor 1/ q 2q+1 .
Obviously, if the signal f is normalized to E[f 2 ] = 1, this implies that the noise level σ 2 is automatically
calibrated to the signal level.
(b) To obtain g2 and g10 , we first transform the original data x ∈ X with a second (resp. tenth) order
transformation z = Φ2 (x) ∈ Z2 (resp. z = Φ10 (x) ∈ Z10 ). Then, we find the best linear fit for the data in
Z2 -space (resp. Z10 -space) to find g̃2 = w̃T z (resp. g̃10 = w̃T z). And finally, we get the best fit in X -space

g2 (x) = g̃2 (Φ2 (x)) = w̃T Φ2 (x) (resp. g10 (x) = g̃10 (Φ10 (x)) = w̃T Φ10 (x)).

(c) To compute analytically Eout for a given g10 we have to compute

Eout (g10 ) = Ex,y [(g10 (x) − y(x))2 ] = Ex,y [(g10 (x) − f (x) − σ)2 ] = Ex [Ey|x [(g10 (x) − f (x) − σ)2 |x]].

(d) Below we plot the extent of overfitting depending on certain parameters of the learning problem. In the
first plot, we fix Qf = 20 to study the stochastic noise.
# Grid search with Qf = 20
Nexp <- 1000
grid <- expand.grid(N = seq(20, 120, by = 5), sigma_sq = seq(0, 2, by = 0.05))
E_out_Overfit <- foreach(i = 1:nrow(grid), .combine = "rbind") %dopar% {
set.seed(1975)
Eout_H2 <- numeric(Nexp)
Eout_H10 <- numeric(Nexp)
for (n in 1:Nexp) {
tmp <- experiment(Qf = 20, grid$N[i], sqrt(grid$sigma[i]), Ntest = 100)
Eout_H2[n] <- tmp[1]
Eout_H10[n] <- tmp[2]
}
c(mean(Eout_H2), mean(Eout_H10))
}
Eout <- cbind(grid, E_out_Overfit)
colnames(Eout) <- c("N", "sigma_sq", "Eout_H2", "Eout_H10")
Eout["Overfit"] <- Eout$Eout_H10 - Eout$Eout_H2
Eout$Overfit <- ifelse(Eout$Overfit > 0.2, 0.2, Eout$Overfit)

6
Eout$Overfit <- ifelse(Eout$Overfit < -0.2, -0.2, Eout$Overfit)

ggplot(Eout, aes(N, sigma_sq, fill = Overfit)) + geom_raster(interpolate = TRUE) +


xlab("N") + ylab("Sigma_sq") +
scale_fill_gradient2(low = "blue", mid = "green", high = "red")

2.0

1.5 Overfit
0.2
Sigma_sq

0.1
1.0
0.0

−0.1

0.5

0.0

25 50 75 100 125
N
In the second plot, we fix σ 2 = 0.1 to study the deterministic noise.
# grid search with sigma_sq = 0.1
Nexp <- 200
grid <- expand.grid(Qf = seq(1, 80, by = 1), N = seq(20, 120, by = 5))
E_out_Overfit <- foreach(i = 1:nrow(grid), .combine = "rbind") %dopar% {
set.seed(1975)
Eout_H2 <- numeric(Nexp)
Eout_H10 <- numeric(Nexp)
for (n in 1:Nexp) {
tmp <- experiment(grid$Qf[i], grid$N[i], sqrt(0.1), Ntest = 10)
Eout_H2[n] <- tmp[1]
Eout_H10[n] <- tmp[2]
}
c(mean(Eout_H2), mean(Eout_H10))
}
Eout <- cbind(grid, E_out_Overfit)
colnames(Eout) <- c("Qf", "N", "Eout_H2", "Eout_H10")
Eout["Overfit"] <- Eout$Eout_H10 - Eout$Eout_H2
Eout$Overfit <- ifelse(Eout$Overfit > 0.2, 0.2, Eout$Overfit)
Eout$Overfit <- ifelse(Eout$Overfit < -0.2, -0.2, Eout$Overfit)

ggplot(Eout, aes(N, Qf, fill = Overfit)) + geom_raster(interpolate = TRUE) +


xlab("N") + ylab("Q_f") +
scale_fill_gradient2(low = "blue", mid = "green", high = "red")

7
80

60 Overfit
0.2

0.1
Q_f

40
0.0

−0.1

20 −0.2

0
25 50 75 100 125
N
(e) We take the average over many experiments because we want estimates of the expected out-of-sample
error for a given learning scenario (Qf , N, σ) using H2 and H10 .

Problem 4.5

If we consider the following constrained optimization problem


min Ein (w) subject to wT w ≥ C,
w

the theory of Lagrange multipliers tells us that this problem is equivalent to the following unconstrained
optimization problem
min(Ein (w) − λ0C wT w) ; λ0C ≥ 0.
w
If we let λC = −λ0C , we get that the original constrained optimization problem is equivalent to minimizing
the augmented error
Eaug (w) = Ein (w) + λC wT w ; λC ≤ 0.
So, we may conclude that the soft order constraint corresponding to this problem is wT w ≥ C.

Problem 4.6

(a) We begin by noting that


(wreg − wlin )T Z T Z(wreg − wlin ) + y T (I − H)y y T (I − H)y
Ein (wreg ) = ≥ = Ein (wlin ).
N N
Now we suppose that ||wreg || > ||wlin ||, in this case we may write that
Eaug (wreg ) = Ein (wreg ) + λ||wreg ||2 > Ein (wlin ) + λ||wlin ||2 = Eaug (wlin ),
which is not possible since wreg = argminw Eaug (w). So, we may conclude that ||wreg || ≤ ||wlin ||.
(b) First, we note that if vi are eigenvectors with eigenvalues λi of a matrix A, then Avi = λi vi , and
consequently
1 1
vi = λi A−1 vi ⇔ A−1 vi = vi ⇒ A−2 vi = 2 vi ,
λi λi

8
which means that vi are also eigenvectors of A−2 with eigenvalues 1/λ2i .
Now, let vi be the orthogonal eigenvectors of non-zero eigenvalues λi of Z T Z (since Z T Z is invertible and
symmetric). We have that

||wreg ||2 = y T Z(Z T Z + λI)−2 Z T y = uT (Z T Z + λI)−2 u,

and
||wlin ||2 = y T Z(Z T Z)−2 Z T y = uT (Z T Z)−2 u
where u = Z T y; if we let V = (v0 , · · · , vQ ) be the orthogonal matrix of eigenvectors, we get

V T Z T ZV = diag(λi )

and
V T (Z T Z + λI)V = V T Z T ZV + λV T V = diag(λi + λ).
If we expand u in the eigenbasis of Z T Z, we get that u = i αi vi and
P

X
||wreg ||2 = αi αj viT (Z T Z + λI)−2 vj
i,j
X 1
= αi αj v T vj
i,j
(λi + λ)2 i
X αi2
=
i
(λi + λ)2
X α2 X
≤ i
= αi αj viT (Z T Z)−2 vj = ||wlin ||2 ;
i
λ2i i,j

for the above inequality to be true, we have to note that since Z T Z is (at least) semi positive definite, its
eigenvalues are non-negative.

Problem 4.7

Here, for our (N × d) matrix Z, we assume that N > d, and in this case U is a (N × d) orthogonal matrix, Γ
is a (d × d) diagonal matrix and V is a (d × d) orthogonal matrix. We begin by noting that

Z T Z = V ΓU T U ΓV T = V Γ2 V T .

Let us first consider the vector Hy, we have

Hy = Z(Z T Z)−1 Z T y
= U ΓV T (V T )−1 Γ−2 V −1 V ΓU T y
= U U T y;

moreover, we also have for H(λ)y that

9
H(λ)y = Z(Z T Z + λI)−1 Z T y
= U ΓV T (V Γ2 V T + λI)−1 V ΓU T y
= U ΓV T [V (Γ2 + λI) V T ]−1 V ΓU T y
| {z }
=diag(σi2 +λ)
!
T −1 1
= U ΓV (V ) T
diag V −1 V ΓU T y
σi2 + λ
!
σi2
= U diag U T y.
σi2+λ

Putting all of the above together, we get


!
σi2
(I − H(λ))y = (I − H)y + (H − H(λ))y = (I − H)y + U diag 1 − 2 U T y,
σi + λ

and consequently

Ein (wreg )
1 T
= y (I − H(λ))2 y
N
1 T
= y (I − H(λ))T (I − H(λ))y
N ! ! !
1 T T σi2 T T σi2 T σi2
= [y (I − H)y + 2y (I − H)U diag 1 − 2 U y + y U diag 1 − 2 U U diag 1 − 2 U T y]
N σi + λ σi + λ σi + λ
!2 !
1 T T σi2 T T σi2
= [y (I − H)y + y U diag 1 − 2 U y + 2y (I − H)U diag 1 − 2 UT y
N σi + λ | {z } σi + λ
=U −HU =U −U U T U =0
!2
1 X 2 σ2
= Ein (wlin ) + ai 1 − 2 i .
N i σi + λ

Problem 4.8

First, we compute ∇Eaug (w), we immediately have

∇Eaug (w) = ∇Ein (w) + 2λw.

So the gradient descent update rule becomes

w(t + 1) ← w(t) − η∇Eaug (w(t)) = (1 − 2ηλ)w(t) − η∇Ein (w(t)).

Problem 4.9

(a) Let Γ be the following matrix


− γ1T
 

..
Γ= ,
 
.
− γkT −

10

now we construct a virtual example (zi , 0) where zi = λγi for i = 1, · · · , k. If D = {(z10 , y1 ), · · · , (zN
0
, yN )},
this means that the matrix for the augmented data is
− z10T
 

 .. 
 . 
0T
   
− zN −
Zaug =   = √Z
− z T − λΓ
 1 
 .. 
 . 
− zkT −

and  
y1
 .. 
 . 
   
yN  y
yaug  0 = 0 .
= 
 
 . 
 .. 
0

(b) If we solve the least squares problem with Zaug and yaug , we get

wlin = T
(Zaug Zaug )−1 Zaug
T
yaug
√ √
   
Z y
= [(Z T | λΓT ) √ ]−1 (Z T | λΓT )
λΓ 0
= (Z T Z + λΓT Γ)−1 Z T y = wreg .

Problem 4.10
T
(a) If wlin ΓT Γwlin ≤ C, then obviously wreg = wlin .
T
(b) If wlin ΓT Γwlin > C, then we have that wreg
T
ΓT Γwreg = C (see the book illustration).
(c) The original constrained problem is equivalent to solving the following unconstrained problem with
Lagrange multipliers,
min(Ein (w) − λC (−wT ΓT Γw + C))
w | {z }
=L(w,λC )

where λC ≥ 0. We have that



∇w,λC L(w, λC ) = (∇w L(w, λC ), L(w, λC ))
∂λC
where

∇w L(w, λC ) = ∇Ein (w) + 2λC ΓT Γw and L(w, λC ) = wT ΓT Γw − C.
∂λC
Since wreg is a solution to the original constrained problem, it must also be a solution to the equivalent
unconstrained problem, this means that

∇Ein (wreg ) + 2λC ΓT Γwreg = 0 and wreg


T
ΓT Γwreg − C = 0;

if we solve for λC , we get that


T T
wreg ∇Ein (wreg ) + 2λC wreg ΓT Γwreg = 0,
| {z }
=C

11
and consequently
1 T
λC = − w ∇Ein (wreg ).
2C reg
T
(d) (i) If wlin ΓT Γwlin ≤ C, we know that wreg = wlin , and consequently ∇Ein (wreg ) = 0, which implies that
λC = 0.
T
(ii) If wlin ΓT Γwlin > C, let us assume that λC = 0, this means that wreg minimizes
Ein (w) − λC (−wT ΓT Γw + C) = Ein (w),
so we have wreg = wlin and
T
wreg ΓT Γwreg = wlin
T
ΓT Γwlin > C,
T
which is not possible since wreg ΓT Γwreg ≤ C by definition. In conclusion, we have that λC > 0.
T
(iii) As wlin T
ΓT Γwlin > C, we have that λC > 0 which means that wreg ∇Ein (wreg ) < 0. Now, if we compute
the derivative relative to C, we get
dλC 1 T
= w ∇Ein (wreg ) < 0.
dC 2C 2 reg

Problem 4.11

(a) We have immediately


wlin = (Z T Z)−1 Z T y = (Z T Z)−1 Z T (Zwf + ) = wf + (Z T Z)−1 Z T .
And so the average function g is given by

g(x) = ED [g D (x)]
= ED [Φ(x)T wlin ]
= Φ(x)T wf + ED [Φ(x)T (Z T Z)−1 Z T ]]
= Φ(x)T wf + EZ [Ey|Z [Φ(x)T (Z T Z)−1 Z T |Z]]
= Φ(x)T wf + EZ [Φ(x)T (Z T Z)−1 Z T Ey|Z [|Z]]
| {z }
=E []=0
T
= Φ(x) wf = f (x),

which means that


bias(x) = (g(x) − f (x))2 = 0,
and consequently bias = Ex [bias(x)] = 0.
(b) We may write that

var(x) = ED [(g D (x) − g(x))2 ]


= ED [(g D (x) − f (x))2 ]
= ED [(Φ(x)T (wf + (Z T Z)−1 Z T ) − Φ(x)T wf )2 ]
= ED [T Z(Z T Z)−1 Φ(x)Φ(x)T (Z T Z)−1 Z T ]
| {z }
=trace(Φ(x)Φ(x)T (Z T Z)−1 Z T T Z(Z T Z)−1 )

= trace(EZ [Ey|Z [Φ(x)Φ(x)T (Z T Z)−1 Z T T Z(Z T Z)−1 |Z])


= trace(EZ [Φ(x)Φ(x)T (Z T Z)−1 Z T Ey|Z [T |Z] Z(Z T Z)−1 ])
| {z }
=E [T ]=σ 2 I
2 T T −1
= σ trace(EZ [Φ(x)Φ(x) (Z Z) ])

12
where we have used the cyclic property of the trace. This allows us to write that

var = Ex [var(x)]
= σ 2 trace(EZ [Ex [Φ(x)Φ(x)T (Z T Z)−1 ]])
= σ 2 trace(EZ [Ex [Φ(x)Φ(x)T ](Z T Z)−1 ])
| {z }
=ΣΦ
2
σ 1
= (ΣΦ EZ [( Z T Z)−1 ]).
N N

(c) We know by the law of large numbers that N1 Z T Z converges in probability to ΣΦ , this implies that
( N1 Z T Z)−1 converges in probability to Σ−1
Φ . With that in mind, to the first order in 1/N , we have that

σ2 σ 2 (Q + 1)
var ≈ trace(ΣΦ Σ−1
Φ )= .
N N

Problem 4.12

(a) We may write that

wreg = (Z T Z + λI)−1 Z T (Zwf + )


= (Z T Z + λI)−1 [(Z T Zwf + λwf ) − λwf ] + (Z T Z + λI)−1 Z T 
= wf − λ(Z T Z + λI)−1 wf + (Z T Z + λI)−1 Z T .

(b) The average function g is given by

g(x) = ED [g D (x)]
= ED [Φ(x)T wreg ]
= ED [Φ(x)T (wf − λ(Z T Z + λI)−1 wf + (Z T Z + λI)−1 Z T )]
= EZ [Φ(x)T wf − λΦ(x)T (Z T Z + λI)−1 wf + Φ(x)T (Z T Z + λI)−1 Z T Ey|Z [|Z]]
| {z }
=0
T T T −1
= Φ(x) wf − λΦ(x) EZ [(Z Z + λI) ]wf .

Thus, thanks to the cyclic property of the trace, the bias(x) is equal to

bias(x) = (g(x) − f (x))2


= λ2 wfT EZ [(Z T Z + λI)−1 ]Φ(x)Φ(x)T EZ [(Z T Z + λI)−1 ]wf
= λ2 trace(Φ(x)T Φ(x)EZ [(Z T Z + λI)−1 ]wf wfT EZ [(Z T Z + λI)−1 ]),

consequently, we have that

13
bias = Ex [bias(x)]
= λ2 trace(Ex [Φ(x)T Φ(x)] EZ [(Z T Z + λI)−1 ]wf wfT EZ [(Z T Z + λI)−1 ])
| {z }
=I
= λ trace(EZ [(Z Z + λI)−1 ]wf wfT EZ [(Z T Z + λI)−1 ])
2 T
| {z } | {z }
1 1
≈ N +λ I ≈ N +λ I

λ2
≈ trace(wf wfT )
(N + λ)2 | {z }
=trace(wfT wf )=||wf ||2

λ2
≈ ||wf ||2 ,
(N + λ)2

since Z T Z ≈ N ΣΦ = N I.
Now, if we compute var(x), we get

var(x) = ED [(g D − g(x))2 ]


= ED [(λΦ(x)T (EZ [(Z T Z − λI)−1 ] − (Z T Z − λI)−1 )wf + Φ(x)T (Z T Z + λI)−1 Z T )2 ]
| {z } | {z }
1 1
≈ N +λ I ≈ N +λ I

≈ ED [T Z(Z T Z + λI)−1 Φ(x)Φ(x)T (Z T Z + λI)−1 Z T ]


≈ EZ [trace(Ey|Z [T ] Z(Z T Z + λI)−1 Φ(x)Φ(x)T (Z T Z + λI)−1 Z T ]
| {z }
=σ 2 I
≈ σ EZ [trace(Φ(x)Φ(x)T (Z T Z + λI)−1 Z T Z(Z T Z + λI)−1 )].
2

And finally we get the variance below,

var = Ex [var(x)]
≈ σ 2 EZ [trace(Ex [Φ(x)Φ(x)T ](Z T Z + λI)−1 Z T Z(Z T Z + λI)−1 )]
| {z }
=I
≈ I (Z T Z + λI)−1 Z T Z(Z T Z + λI)−1 )]
σ 2 EZ [trace( |{z}
1
≈N ZT Z

σ2
≈ EZ [trace(Z(Z T Z + λI)−1 Z T Z(Z T Z + λI)−1 Z T )]
N
σ2
≈ EZ [trace(H(λ)2 )].
N

Problem 4.13

(a) When λ = 0, we have H(0) = Z(Z T Z)−1 Z T and H(0)2 = Z(Z T Z)−1 Z T Z(Z T Z)−1 Z T = H(0), which
means that
trace(H(0)) = trace(H(0)2 ) = trace(Z T Z(Z T Z)−1 ) = trace(Id+1 ˜
˜ ) = d + 1.

So, for (i), we get


def f (0) = 2(d˜ + 1) − (d˜ + 1) = d˜ + 1,

14
for (ii), we get
def f (0) = d˜ + 1,
and for (iii), we get
def f (0) = d˜ + 1.

(b) Here again, for our (N × (d˜+ 1)) matrix Z, we assume that N > (d˜+ 1), and in this case Z = U ΓV T where
U is a (N × (d˜+ 1)) orthogonal matrix, Γ is a ((d˜+ 1) × (d˜+ 1)) diagonal matrix and V is a ((d˜+ 1) × (d˜+ 1))
orthogonal matrix. From Problem 4.7, we know that
!
T 2 T σi2
Z Z = V Γ V and H(λ) = U diag UT ;
σi2 + λ

we begin by considering (ii), in this case we have


d˜ d˜
!
σi2 X σi2 X
0 ≤ def f = trace(H(λ)) = trace(U U diag T
) = ≤ 1 = d˜ + 1
σi2 + λ σ
i=0 i
2+λ
i=0

by the cyclic property of the trace. Obviously, if λ increases, def f decreases. Now, we consider (iii), here we
have
d˜ d˜
!
σi4 X σi4 X
2 T
0 ≤ def f = trace(H(λ) ) = trace(U U diag ) = ≤ 1 = d˜ + 1;
(σi2 + λ)2 i=0
(σi2 + λ)2 i=0

here also, if λ increases def f decreases. Finally, we consider (i), and we get
d˜ d˜ d˜ d˜
X σi2 X σi4 X σ 4 + 2σ 2 λ X
0 ≤ def f = 2 − = i i
2 + λ)2 ≤ 1 = d˜ + 1;
i=0
σi2 + λ i=0 (σi2 + λ)2 i=0
(σ i i=0

and here again, if λ increases, then def f increases.

Problem 4.14

We know from Problem 4.7 that

1 T
Ein (wreg ) = y (I − H(λ))2 y
N
1 T
= (f + T )(I − H(λ))2 (f + )
N
1 T
= [f (I − H(λ))2 f + 2f T (I − H(λ))2  + T (I − H(λ))2 ].
N

Now, if we compute the expectation of Ein (wreg ) relative to , we get

1 T
E [Ein (wreg )] = [f (I − H(λ))2 f + 2f T (I − H(λ))2 E [] +E [T (I − H(λ))2 ]]
N | {z }
=0
1 T
= [f (I − H(λ))2 f + E [trace(T (I − H(λ))2 )]]
N
1 T
= [f (I − H(λ))2 f + trace( E [T ] (I − H(λ))2 )]
N | {z }
=diag(σ 2 )
2
1 T σ
= f (I − H(λ))2 f + trace((I − H(λ))2 );
N N

15
moreover, we also have that
trace((I − H(λ))2 ) = trace(IN ) −2trace(H(λ)) + trace(H(λ)2 ) = N − def f (λ),
| {z }
=N

with which we conclude that


!
1 T def f (λ)
E [Ein (wreg )] = f (I − H(λ))2 f + σ 2 1 − .
N N

(a) The term involving σ 2 should be σ 2 def f /N .


(b) It is clear that, if def f increases, the expected in-sample error E [Ein (wreg )] decreases, which is exactly
the behaviour exhibited by the number of parameters in the simpler case of linear regression. That explains
why def f is seen as an effective number of parameters in this more complex case.

Problem 4.15

Here also, for our (N × (d + 1)) matrix Z̃, we assume that N > (d + 1), and in this case Z̃ = U SV T where U
is a (N × (d + 1)) orthogonal matrix, S is a ((d + 1) × (d + 1)) diagonal matrix and V is a ((d + 1) × (d + 1))
orthogonal matrix. As Z̃ = ZΓ−1 , we have Z = Z̃Γ; in this case, we also have that

H(λ) = Z(Z T Z + λΓT Γ)−1 Z T


= Z̃Γ[ΓT (Z̃ T Z̃ + λI)Γ]−1 ΓT Z̃ T
= Z̃(Z̃ T Z̃ + λI)−1 Z̃ T
T −1
= U SV T (V S T U T T
| {zU} SV + λV V ) V SU
T

=I
= U S(S T
|{z}S +λI)−1 SU T
=S 2
!
s2i
= U diag 2 UT
si + λ

since S 2 = diag(s2i ). In much the same way, we get that


! ! !
s2i T s2i T s4i
2
H(λ) = U diag 2 U U diag 2 U = U diag UT .
si + λ | {z } si + λ (s2i + λ)2
=I

All of the above implies that

!
T s2i
trace(H(λ)) | {zU} diag s2 + λ )
= trace(U
i
=I
d
X s2i
=
i=0
s2i + λ
d
!
X s2i + λ λ
= 2 − 2
i=0
si + λ si + λ
d
X λ
= d+1− ,
i=0
s2i + λ

16
and also that

!
2 T s4i
trace(H(λ) ) = trace(U U diag )
(s2i + λ)2
d
X s4i
=
i=0
(s2i + λ)2
d
!
X s4i + 2λs2i + λ2 2λs2 + λ2
= 2 2
− 2i
i=0
(si + λ) (si + λ)2
d
X 2λs2 + λ2i
= d+1− .
i=0
(s2i + λ)2

(a) In this case, we may write that

def f (λ) = 2trace(H(λ)) − trace(H(λ2 ))


d d
X λ X 2λs2i + λ2
= 2(d + 1) − 2 − (d + 1) +
i=0
s2i + λ i=0
(s2i + λ)2
d
X λ2
= d+1− .
i=0
(s2i + λ)2

(b) In this case, we immediately have that


d
X λ
def f (λ) = trace(H(λ)) = d + 1 − .
i=0
s2i + λ

(c) Here we also immediately have that


d
X s4i
deef f (λ) = trace(H(λ)2 ) = .
i=0
(s2i + λ)2

Problem 4.16

Here, we seek wreg that minimizes Eaug (w), where

1 λ
Eaug (w) = ||Zw − y||2 + wT ΓT Γw
N N
1 T T λ
= (w Z Zw − 2y T Zw + y T y) + wT ΓT Γw
N N

where we assume that λ > 0. If we take the gradient of the previous expression, we get
2 T
∇Eaug (w) = (Z Zw − Z T y + λΓT Γw).
N
The critical point is found by solving the equation ∇Eaug (w) = 0, which gives us
w = (Z T Z + λΓT Γ)−1 Z T y

17
provided that Γ is of full rank (since in this case ΓT Γ is positive definite, which consequently makes Z T Z +λΓT Γ
positive definite and thus invertible). For this w to be wreg , we must show that it is actually a minimum, to
do that we compute the Hessian, that is
2 T
∇2 Eaug (w) = (Z Z + λΓT Γ)
N
which is positive definite; this means that wreg = w.
(a) We have that
ŷ = Zwreg = Z(Z T Z + λΓT Γ)−1 Z T y = H(λ)y.

(b) If Γ = Z, we get that


1 1
wreg = (Z T Z + λZ T Z)−1 Z T y = (Z T Z)−1 Z T y = wlin .
λ+1 λ+1

Problem 4.17

First, we have the following computation

N N
1 X T 1 X T
(w x̂n − yn )2 = [(w xn − yn ) + wT n ]2
N n=1 N n=1
N N N
1 X T 2 X T 1 X T 2
= (w xn − yn )2 + (w xn − yn )wT n + (w n )
N n=1 N n=1 N n=1
N N
2 X T 1 X T 2
= Ein (w) + + (w xn − yn )wT n + (w n ) .
N n=1 N n=1

Then, we take the expectation relative to 1 · · · N and we get

" N
#
1 X T
Êin (w) = E1 ···N (w x̂n − yn )2
N n=1
N N
2 X T 1 X T
= Ein (w) + (w xn − yn )wT E1 ···ˆn ···N [En [n ]] + w E1 ···ˆn ···N [En [n Tn ] w]
N n=1 | {z } N n=1 | {z }
=0 2I
=σx
N
σx2 X T
= Ein (w) + w w
N n=1
= Ein (w) + σx2 wT w.

Here, the parameters for the Tikhonov regularizer are Γ = I and λ = N σx2 .

Problem 4.18

(a) We know from Problem 4.16 that


1
wreg = wlin
1+λ

18
and from Problem 3.14 that
T
ED [wlin x] = f (x).
We may now write that
1 1
g(x) = ED [g D (x)] = T
ED [wlin x] = f (x);
1+λ 1+λ
and consequently
λ2
bias(x) = (g(x) − f (x))2 = f (x)2 .
(1 + λ)2
We are now able to compute the bias, and we get

bias = Ex [bias(x)]
λ2
= wT Ex [xxT ] wf
(1 + λ)2 f | {z }
=I
2
λ
= ||wf ||2 .
(1 + λ)2

(b) We have that

var(x) = ED [(g D (x) − g(x))2 ]


1
= ED [( (wlin − wf )T x)2 ]
(1 + λ)2 | {z }
=((X T X)−1 X T )T
1
= EX [xT (X T X)−1 X T Ey|X [T |X] X(X T X)−1 x]
(1 + λ)2 | {z }
=E [T ]=σ 2 I
2
σ
= xT EX [(X T X)−1 ]x.
(1 + λ)2

The above allows us to compute the variance, and we get that

var = Ex [var(x)]
σ2
= Ex [ xT EX [(X T X)−1 ]x ]
(1 + λ)2 | {z }
=trace(xxT EX [(X T X)−1 ])
2
σ
= trace(Ex [xxT ] EX [(X T X)−1 ])
(1 + λ)2 | {z }
=I
σ2 1
= trace(EX [( X T X)−1 ])
N (1 + λ)2 | N {z }
≈Σ−1 =Id+1
2
σ (d + 1)

N (1 + λ)2

by the cyclic property of the trace.


(c) We know from Problem 2.22 that

19
ED [Eout (w)] = σ 2 + bias + var
λ2 σ 2 (d + 1)
≈ σ2 + ||w f ||2
+
(1 + λ)2 N (1 + λ)2
2 2 2
1 N λ ||wf || + σ (d + 1)
≈ σ2 + ;
N (1 + λ)2

to determine the optimal regularization parameter, we have to compute the derivative relative to λ, we get

∂ 1 2N ||wf ||2 λ2 + (2N ||wf ||2 − 2σ 2 (d + 1))λ − 2σ 2 (d + 1)


ED [Eout (w)] ≈ .
∂λ N (1 + λ)4
If we equal the above expression to 0, and solve this equation for λ, we obtain

−2N ||wf ||2 + 2σ 2 (d + 1) + (2N ||wf ||2 + 2σ 2 (d + 1)) σ 2 (d + 1)


λ∗ = = .
4N ||wf ||2 N ||wf ||2

(d) If we write λ∗ and y in the following way

(d + 1)/N
λ∗ =
||wf ||2 /σ 2
and !
wf 
y=σ X + ,
σ σ
we may see that λ∗ can be seen as the relation between the ratio of the dimension to the number of data points
and the σ-regularized weight norm. This means that if the number of dimensions (d + 1) is big compared to
the number N of data points, the regularization parameter λ∗ will be big also; and if σ 2 is small compared to
||wf ||2 , the regularization parameter λ∗ will be small also.

Problem 4.19

(a) First, we note that the lasso algorithm is equivalent to the following minimization problem
d
1 2
X
min ||Xw − y|| subject to |wi | ≤ C,
w N | {z }
i=0
=(wT X T Xw−2y T Xw+y T y)

which is also equivalent to


d
X
min(wT X T Xw − 2y T Xw) subject to |wi | ≤ C.
w
i=0

To formulate the above problem into a quadratic program, we split each wi as wi = wi+ − wi− where

|wi | + wi |wi | − wi
wi+ = ≥ 0 and wi− = ≥ 0;
2 2
in this case, we have w = w+ − w− with
 +  −
w0 w0
+  ..  −  .. 
w =  .  and w =  .  .
wd+ wd−

20
Thus, the lasso algorithm may be formulated as the following quadratic program
  +  +
1 +T −T T w T w
 min(w+ ,w− ) 2 (w , w )V V − +d



 + w
 + w
w w
 subject to A ≤ C, ≥0

w− w−

where
√ XT −2X T y
   
V = 2 , d= , and A = (1, · · · , 1|1, · · · , 1).
−X T 2X T y

Below, we implement the lasso algorithm as a quadratic program.


experiment2 <- function(Qf, N, sigma, Ntest, C, deg) {
aq <- rnorm(Qf + 1)
norm <- rep(0, Qf + 1)
for (q in 0:Qf)
norm[q + 1] <- 1 / (2 * q + 1)
norm_fac <- 1 / sqrt(sum(norm))
aq <- norm_fac * aq

xn <- runif(N, min = -1, max = 1)


eps <- rnorm(N)
yn <- f(xn, Qf, aq) + sigma * eps
D <- data.frame(x = xn, y = yn)

Ddeg <- data.frame(1, x = D$x)


for (d in 2:deg) {
Ddeg <- cbind(Ddeg, Ddeg$x^d)
}
X <- as.matrix(Ddeg)
d <- ncol(X) - 1
Vmat <- t(cbind(X, -X, matrix(0, nrow = nrow(X)))) * sqrt(2)
dvec <- as.vector(rbind(-2 * t(X) %*% as.matrix(D$y), 2 * t(X) %*% as.matrix(D$y), 0))
Amat <- matrix(c(rep(1, 2 * (d + 1)), 1), nrow = 1)
bOls <- lm.fit(X, D$y)$coefficients
bvec <- c(min(C, sum(abs(bOls))))
uvec <- c(abs(bOls), abs(bOls), sum(abs(bOls)))
soln <- LowRankQP(Vmat, dvec, Amat, bvec, uvec, method = "LU", verbose = FALSE)
w <- soln$alpha[1:(d + 1)] - soln$alpha[(d + 2):(2 * (d + 1))]

x <- runif(Ntest, min = -1, max = 1)


eps <- rnorm(Ntest)
y <- f(x, Qf, aq) + sigma * eps
Dtest <- data.frame(x = x, y = y)
Dtestdeg <- data.frame(1, x = Dtest$x)
for (d in 2:deg) {
Dtestdeg <- cbind(Dtestdeg, Dtestdeg$x^d)
}
Eout <- mean((as.matrix(Dtestdeg) %*% w - Dtest$y)^2)

return(Eout)
}

Now, we plot the out of sample error Eout versus the regularization parameter C.

21
C_grid <- seq(0.01, 100, by = 0.5)
E_out_comp <- foreach(i = 1:length(C_grid), .combine = "rbind") %dopar% {
set.seed(1975)
tmp <- experiment2(Qf = 20, N = 1000, sigma = 0.1, Ntest = 100,
C = C_grid[i], d = 6)
tmp
}
Eout <- data.frame(C = C_grid, Eout = E_out_comp[, 1])
ggplot(Eout, aes(x = C, y = Eout)) + geom_line(col = "red")

0.4
Eout

0.3

0.2

0 25 50 75 100
C
In the plot above, the minimum Eout is obtained for C = 26.01.
(b) The augmented error for the lasso is
d
X
Eaug (w) = Ein (w) + λ |wi |.
i=0

It is actually more convenient to optimize since this is an unconstrained problem as opposed to the original
lasso problem.
(c) Here we compare the number of non-zero weights from the lasso versus the quadratic penalty for d = 5
and N = 3.
experiment3 <- function(Qf, N, sigma, deg, grid) {
aq <- rnorm(Qf + 1)
norm <- rep(0, Qf + 1)
for (q in 0:Qf)
norm[q + 1] <- 1 / (2 * q + 1)
norm_fac <- 1 / sqrt(sum(norm))
aq <- norm_fac * aq

xn <- runif(N, min = -1, max = 1)


eps <- rnorm(N)
yn <- f(xn, Qf, aq) + sigma * eps

22
D <- data.frame(x = xn, y = yn)

Ddeg <- data.frame(1, x = D$x)


for (d in 2:deg) {
Ddeg <- cbind(Ddeg, Ddeg$x^d)
}
X <- as.matrix(Ddeg)
d <- ncol(X) - 1
ridge <- glmnet(X, D$y, alpha = 0, lambda = grid, standardize = FALSE)
lasso <- glmnet(X, D$y, alpha = 1, lambda = grid, standardize = FALSE)

number_ridge <- apply(coef(ridge) != 0, 2, sum)


number_lasso <- apply(coef(lasso) != 0, 2, sum)

return(data.frame(ridge = number_ridge, lasso = number_lasso))


}

set.seed(10)
grid <- 10^seq(1, -2, length = 100)
Num_nz_weights <- cbind(grid, experiment3(Qf = 20, N = 3, sigma = 1, d = 5, grid))
ggplot(Num_nz_weights, aes(x = grid, y = ridge)) + geom_line(aes(colour = "Quadratic")) +
geom_line(aes(x = grid, y = lasso, colour = "Lasso")) +
scale_color_manual("Type:", values = c("red", "green"))

4 Type:
ridge

Lasso
Quadratic

0.0 2.5 5.0 7.5 10.0


grid

Problem 4.20

(a) We know that the optimal weights for the transformed problem are

w̃ = (Z T Z)−1 Z T y

23
where
− z1T − xT1 AT
 
 
− −
Z= .. .. T
=  = XA and ỹ = αy.
   
. .
− znT − − xTn AT −
We may now write that

w̃ = (Z T Z)−1 Z T ỹ
= (AX T XAT )−1 AX T αy
= α(AT )−1 (X T X)−1 A−1 AX T y
= α(AT )−1 w

since w = (X T X)−1 X T y.
(b) In this case, we know from Problem 4.16 that

w̃reg (λ) = (Z T Z + λZ T Z)−1 Z T ỹ


1
= w̃
1+λ
1
= α(AT )−1 w
1+λ
= α(AT )−1 wreg (λ)

since wreg (λ) = 1/(1 + λ)w.

Problem 4.21

As h(x) is a linear function, we immediately have that ∂ 2 h(x)/∂x2 = 0, this implies that
!
∂ 2 h(x)
Z
Ω(h) = dx = 0;
∂x2

and consequently Γ = 0.

Problem 4.22

Here, we have a data set with N = 100 points and a validation set of K = 25 points. We consider M = 100
models H1 , · · · , HM each with VC-dimension dV C = 10.

In the first case, each model Hm gives birth to a final hypothesis gm generated on the N − K = 75 training

points; from these hypotheses, we select the one with the minimum validation error gm ∗ of 0.25. We know

that r
− − 1 2M
Eout (gm∗ ) ≤ Eout (gm ∗ ) ≤ Eval (gm∗ ) + ln
2K δ
where gm∗ is the chosen final hypothesis trained on the entire data set, since we selected our final hypothesis
− − −
gm ∗ from a finite hypothesis set Hval = {g1 , · · · , gM }. So, a bound on the out-of-sample error is given by

r r
− 1 2M 1 200
Eval (gm∗ ) + ln = 0.25 + ln ;
2K δ 50 δ

24
thus we may write that r
1 200
Eout (gm∗ ) ≤ 0.25 + ln
50 δ
with probability at least 1 − δ.
In the second case, each model Hm gives birth to a final hypothesis gm trained on the entire data set;
from these hypotheses, we select the one with the minimum in-sample error gm∗ of 0.15. Here we must be
careful since as each gm was selected (by minimizing Ein ) on each hypothesis set Hm , and gm∗ is chosen as
having the minimum Ein of these gm , this is equivalent to selecting gm∗ as having the minimum Ein in all of
H1 ∪ · · · ∪ HM which is no longer a simple finite hypothesis set. Hence, we know from the VC generalization
bound that v
u !
u8 4((2N )dV C (∪m Hm ) + 1)
Eout (gm∗ ) ≤ Ein (gm∗ ) + t ln
N δ
where we know from Problem 2.14 that
dV C (∪m Hm ) ≤ M (dV C + 1) = 1100.
So, a bound on the out-of-sample error is given by
v ! v !
u u
u8 4((2N ) d V C (∪m Hm ) + 1) u 8 4(2001100 + 1)
Ein (gm∗ ) + t ln = 0.15 + t ln ;
N δ 100 δ

thus we may write that v


u !
u 8 4(200 1100 + 1)
Eout (gm∗ ) ≤ 0.15 + t ln
100 δ
with probability at least 1 − δ.
It is pretty obvious that the first bound is tighter than the second one.

Problem 4.23

(a) We immediately have that

" #
1 X
VarD [Ecv ] = VarD en
N n
" #
1 X
= VarD en
N2 n
1 X 1 X
= Var D [e n ] + CovD [en , em ].
N2 n N2
n6=m

(b) As
en = e(g (N −2) + δn , yn ) = e(g (N −2) , yn ) + o(δn ),
we may write that

CovD [en , em ] = CovD [e(g (N −2) , yn ) + o(δn ), e(g (N −2) , ym )o(δm )]


= CovD [e(g (N −2) , yn ), e(g (N −2) , ym )] + o(δn ) + o(δm ) + o(δn δm )
= ED [e(g (N −2) , yn )e(g (N −2) , ym )] − ED [e(g (N −2) , yn )]ED [e(g (N −2) , ym )] +o(δn ) + o(δm ) + o(δn δm ).
| {z } | {z }
(1) (2)

25
First, we consider (1), we get

(1) = ED(N −2) [E(xn ,yn ),(xm ,ym )|D(N −2) [e(g (N −2) , yn )e(g (N −2) , ym )]]
= ED(N −2) [(E(xn ,yn )|D(N −2) [e(g (N −2) , yn )])2 ]
= ED(N −2) [(Eout (g (N −2) ))2 ].

Then, we consider (2), and we obtain

(2) = ED(N −2) [(E(xn ,yn )|D(N −2) [e(g (N −2) , yn )]]ED(N −2) [(E(xm ,ym )|D(N −2) [e(g (N −2) , ym )]]
= (ED(N −2) [Eout (g (N −2) )])2 .

Finally, we get that

CovD [en , em ] = ED(N −2) [(Eout (g (N −2) ))2 ] − (ED(N −2) [Eout (g (N −2) )])2 + o(δn ) + o(δm ) + o(δn δm )
= VarD(N −2) [Eout (g (N −2) )] + o(δn ) + o(δm ) + o(δn δm ).

(c) We know from point (a) that

1 X 1 X
VarD [Ecv ] = 2
VarD [en ] + 2 CovD [en , em ]
N n | {z } N | {z }
n6=m 1
=VarD [e1 ] =VarD(N −2) [Eout (g (N −2) )]+O( N )

1 N −1 1
= VarD [e1 ] + VarD(N −2) [Eout (g (N −2) )] +O( )
N | N {z } N
1
≈VarD [Eout (g)]+O( N )
1 1
≈ VarD [e1 ] + VarD [Eout (g)] + O( ).
N N

Problem 4.24

(a) Here, we use linear regression with weight decay regularization to estimate wf with wreg in the cases where
N ∈ {d + 15, d + 25, · · · , d + 115}; for each N value we also compute the cross validation errors e1 , · · · , eN
and Ecv .
d <- 3
sigma <- 0.5

wf <- as.numeric(rnorm(d + 1))


dataset_gen <- function(N) {
D <- data.frame(x1 = rnorm(N), x2 = rnorm(N), x3 = rnorm(N))

return(D)
}
y_gen <- function(D) {
y <- apply(D, 1, function(x) sum(wf * c(1, as.numeric(x))) + sigma * rnorm(1))

return(y)

26
}
crossval_error <- function(N, lambda) {
D <- dataset_gen(N)
y <- y_gen(D)
e <- rep(NA, N)
for (n in 1:N) {
X_n <- as.matrix(cbind(1, D[-n, ]))
X_n_cross <- solve(t(X_n) %*% X_n + (lambda / N) * diag(d + 1)) %*% t(X_n)
wreg_n <- as.vector(X_n_cross %*% as.matrix(y[-n]))
e[n] <- (sum(c(1, as.numeric(D[n, ])) * wreg_n) - y[n])^2
}
Ecv <- mean(e)

return(c(e[1], e[2], Ecv))


}
experiment4 <- function(lambda) {
Nseq <- seq(d + 15, d + 115, by = 10)
results <- matrix(NA, nrow = length(Nseq), ncol = 3)
i <- 1
for (N in Nseq) {
results[i, ] <- crossval_error(N, lambda)
i <- i + 1
}
results <- as.numeric(results)

return(results)
}

Now, we repeat the above experiment 5000 times maintaining the average and variance over the experiments
of e1 , e2 and Ecv .
set.seed(10)
iter <- 5000
lambda <- 0.05
results <- matrix(NA, nrow = 33, ncol = iter)
for (i in 1:iter) {
results[, i] <- experiment4(lambda)
}
mean_res <- apply(results, 1, mean)
var_res <- apply(results, 1, var)
final_res <- cbind(seq(d + 15, d + 115, by = 10),
as.data.frame(matrix(mean_res, nrow = 11)),
as.data.frame(matrix(var_res, nrow = 11)))
colnames(final_res) <- c("N", "Avg_e1", "Avg_e2", "Avg_Ecv", "Var_e1", "Var_e2", "Var_Ecv")

(b) We know from the theory that


ED [Ecv ] = ED [e1 ] = ED [e2 ] = E out (N − 1).
To visualize this, we plot below the average of e1 , e2 and Ec v.
ggplot(final_res, aes(x = N, y = Avg_e1)) + geom_line(aes(colour = "e1")) +
geom_line(aes(x = N, y = Avg_e2, colour = "e2")) +
geom_line(aes(x = N, y = Avg_Ecv, colour = "Ecv")) +
scale_colour_manual("Error:", values = c("blue", "green", "red")) +
labs(x = "N", y = "Averages")

27
0.33

0.31

Error:
Averages

e1
0.29 e2
Ecv

0.27

0.25
30 60 90 120
N
It is pretty obvious that the mean values of e1 , e2 , and Ecv are tracking each other.
(c) Since the en ’s are not independent, the contributors to the variance of e1 are the other en ’s.
(d) If the cross validation errors were truly independent, we would have that (see Problem 4.23)
1 X 1
VarD [Ecv ] = 2
VarD [en ] = VarD [e1 ].
N n N

(e) The ratio of the variance of the e1 ’s to that of the Ecv ’s is given by

VarD [e1 ] N VarD [e1 ]


Nef f = = ;
VarD [e1 ] + N1 n6=m CovD [en , em ]
P
VarD [Ecv ]

since in this context en and em are only “slightly” dependent, their covariance is close to 0, so the above
ratio is close to N .
ggplot(final_res, aes(x = N, y = Var_e1 / Var_Ecv)) + geom_line(colour = "red") +
geom_line(aes(x = N, y = N)) +
labs(x = "N", y = "N_eff")

28
120

90
N_eff

60

30

30 60 90 120
N
(f ) Increasing the amount of regularization should have no notable effect on Nef f since in this case, the
norm of wreg is more restricted, but this has no relation to the effective number of fresh examples used in
computing the cross validation error.
set.seed(10)
iter <- 5000
lambda <- 2.5
results2 <- matrix(NA, nrow = 33, ncol = iter)
for (i in 1:iter) {
results2[, i] <- experiment4(lambda)
}
mean_res2 <- apply(results2, 1, mean)
var_res2 <- apply(results2, 1, var)
final_res2 <- cbind(seq(d + 15, d + 115, by = 10),
as.data.frame(matrix(mean_res2, nrow = 11)),
as.data.frame(matrix(var_res2, nrow = 11)))
colnames(final_res2) <- c("N", "Avg_e1", "Avg_e2", "Avg_Ecv", "Var_e1", "Var_e2", "Var_Ecv")

As shown in the plot below, we see no modification in Nef f .


ggplot(final_res2, aes(x = N, y = Var_e1 / Var_Ecv)) + geom_line(colour = "red") +
geom_line(aes(x = N, y = N)) +
labs(x = "N", y = "N_eff")

29
120

90
N_eff

60

30

30 60 90 120
N

Problem 4.25

(a) No, in this case, there are no guarantees that we will get the VC-bound we obtained when using the same
validation set for all models.
(b) As exposed in the theory, since the validation model Hval was obtained before ever looking at the data in
the validation set, the process of model selection is equivalent to learning a hypothesis from Hval using the
data in Dval . In this case, we may apply the VC bound for finite hypothesis sets.
(c) We know from the proof of the Hoeffding inequality and point (b) that for each m = 1, · · · , M ,
2
P[Eout (m) − Eval (m) > ] ≤ e− Km

for all  > 0. A reasoning similar to the one that lead us to (1.6) gives us that

P[Eout (m∗ ) − Eval (m∗ ) > ] ≤ P[Eout (1) − Eval (1) > ] + · · · + P[Eout (M ) − Eval (M ) > ]
M
X 2
≤ e− Km
.
m=1

Now, if we let !
M
1 1 X −22 Km
κ() = − 2 ln e ,
2 M m=1
we get

2 1
P −22 Km
M e−2 κ()
= M eln( M m e )

M
X 2
= e−2 Km ;
m=1

30
in this case, we actually obtain
2
P[Eout (m∗ ) > Eval (m∗ ) + ] ≤ M e−2 κ()
.
2
Moreover, we may note that κ() ≥ 0 since −22 Km ≤ 0, this implies that e−2 Km
≤ 1, and so
1
P −22 Km
M me ≤ 1, and finally κ() ≥ 0.
(d) It is easy to see that
2
P[Eout (m∗ ) ≤ Eval (m∗ ) + ] = 1 − P[Eout (m∗ ) > Eval (m∗ ) + ] ≥ 1 − M e−2 κ()

q
for all  > 0. If ∗ satisfies ∗ ≥ ln(M/δ)
2κ(∗ ) , we get that

−2∗2 κ(∗ ) ≤ ln(δ/M )


and consequently
∗2
κ(∗ )
M e−2 ≤ δ.
In conclusion, we have with probability at least 1 − δ that
Eout (m∗ ) ≤ Eval (m∗ ) + ∗
q
ln(M/δ)
for all ∗ ≥ 2κ(∗ ) .

(e) We begin by proving the first inequality. Since minm Km ≤ Km for all 1 ≤ m ≤ M , we have that

−22 Km ≤ −22 minm Km


−22 Km −22 minm Km 2
PM PM
⇔ 1
M m=1 e ≤M1
m=1 e = e−2 minm Km
!
−22 Km
PM
⇔ κ() = − 212 ln M1
m=1 e ≥ minm Km .

Then, we consider the second inequality. We may write that

M
!
1 1 X −22 Km
κ() = (− ln e )
22 M m=1
M
1 1 X 2
≤ 2
− ln(e−2 Km )
2 M m=1
M M
1 1 X 2 1 X
≤ 2 Km = Km
22 M m=1 M m=1

by the inequality of Jensen applied to the convex function f (x) = − ln(x).


We know from point (d) that with probability at least 1 − δ, we have (at best) that
s
∗ ∗ 1 M
Eout (m ) ≤ Eval (m ) + ∗
ln
2κ( ) δ
q
for ∗ = ln(M/δ)
2κ(∗ ) , when the models use different validation set sizes. We also know from the proof of the
inequality of Hoeffding and point (b) that
r
∗ ∗ 1 M
Eout (m ) ≤ Eval (m ) + ln
2K δ

31
1
P
where K = M Pm Km , when models use the same validation set size. It is easy to note that since we proved
1
that κ() ≤ M m Km = K, we immediately have that
s r
1 M 1 M

ln ≥ ln .
2κ( ) δ 2K δ

Which means that the bound is better when all models use the same validation set size.

Problem 4.26

(a) Let Z be the following matrix


z1T
 

Z =  ...  ,
 
T
zN
we are then able to write that
z1T
 
N
Z T Z = (z1 , · · · , zN )  ...  =
  X
zn znT
T n=1
zN

and  
y1 N
Z T y = (z1 , · · · , zN )  ...  =
  X
zn yn .
n=1
yN
Moreover, we also have

H(λ) = ZA(λ)−1 Z T
 T
z1
 ..  −1
=  .  A(λ) (z1 , · · · , zN )
T
zN
 T
z1
 ..  −1 −1
=  .  (A(λ) z1 , · · · , A(λ) zN )
T
zN
 T
z1T A(λ)zN

z1 A(λ)z1 ···
.. ..
= ,
 
 . .
T T
zN A(λ)z1 ··· zN A(λ)zN

which implies that Hnm (λ) = znT A(λ)−1 zm . If now we leave the data point (zn , yn ) out, Z T Z becomes
 T
z1
 .. 
 . 
T T
 
(z1 , · · · , zˆn , · · · , zN ) 
 zˆn  = Z Z − zn zn ,

 . 
 .. 
T
zN

32
and Z T y becomes  
y1
 .. 
 . 
T
 
 zˆn  = Z y − zn yn .
(z1 , · · · , zˆn , · · · , zN )  
 . 
 .. 
yN

(b) We know that


wn− = (A−n )−1 Z−n
T
y−n
where the subscript −n stands for “when the nth data point is left out”. From point (a), we obtain immediately
that
T
A−n = Z−n Z−n + λΓT Γ = Z T Z − zn znT + λΓT Γ = A − zn znT
T
and Z−n y−n = Z T y − zn yn . Thus, we may write that

wn− = (A−n )−1 Z−n


T
y−n
= (A − zn znT )−1 (Z T y − zn yn )
!
−1 A−1 zn znT A−1
= A + (Z T y − zn yn )
1 − znT A−1 zn

by the Sherman-Morrisson-Woodbury formula.


(c) From point (b), we have that

!
A−1 zn znT A−1
wn− = A −1
+ (Z T y − zn yn )
1 − znT A−1 zn
A−1 zn znT A−1 T A−1 zn znT A−1
= A−1 Z T y −A−1 zn yn + Z y− zn yn
| {z } 1 − Hnn 1 − Hnn
=w
!
1
= w− A−1 zn yn − A−1 zn znT A−1 zn yn − A−1 zn znT A−1 Z T y + A−1 zn znT A−1 zn yn
1 − Hnn
1
= w− A−1 zn (yn − znT A−1 Z T y )
1 − Hnn | {z }
T w=ŷ
=zn n

(ŷn − yn )A−1 zn
= w+ .
1 − Hnn

(d) We now compute the prediction on the validation point, we get

!
(ŷn − yn )A−1 zn
znT wn− = znT w+
1 − Hnn
ŷn − yn T −1
= znT w + z A zn
|{z} 1 − Hnn |n {z }
=ŷn =Hnn
ŷn − Hnn yn
= .
1 − Hnn

33
(e) We immediately obtain

en = (yn − znT wn− )2


!2
ŷn − Hnn yn
= yn −
1 − Hnn
!2
yn − ŷn
= ,
1 − Hnn

which gives us that


N
!2
1 X yn − ŷn
Ecv = .
N n=1 1 − Hnn

Problem 4.27

(a) We know
√ that the sample standard deviation is a biased estimator of the real standard deviation, so we
divide by N to make our σcv less biased.
(b) We have that

2
N σcv = var(e1 , · · · , eN )
N N
!2 !
1 X 2 1 X
= e − en
N n=1 n N n=1
N
!4 !
1 X ŷn − yn 2
= − Ecv ,
N n=1 1 − Hnn

this implies that v


u N
!4
√ u1 X ŷn − yn
N σcv = t 2 .
− Ecv
N n=1 1 − Hnn

(c) Below, we implement the experimental design to compare the different approaches.
experiment5 <- function(Qf, N, sigma, Ntest) {
aq <- rnorm(Qf + 1)
norm <- rep(0, Qf + 1)
for (q in 0:Qf)
norm[q + 1] <- 1 / (2 * q + 1)
norm_fac <- 1 / sqrt(sum(norm))
aq <- norm_fac * aq

xn <- runif(N, min = -1, max = 1)


eps <- rnorm(N)
yn <- f(xn, Qf, aq) + sigma * eps
D <- data.frame(x = xn, y = yn)

d <- 2
E_cv <- numeric()

34
sigma_cv <- numeric()
bound <- numeric()
lambda_seq <- seq(0.05, 5, by = 0.05)
for (lambda in lambda_seq) {
Z <- as.matrix(cbind(1, D$x, D$x^2))
Z_cross <- solve(t(Z) %*% Z + (lambda / N) * diag(d + 1)) %*% t(Z)
w_reg <- as.vector(Z_cross %*% as.matrix(D$y))

y_hat <- Z %*% w_reg


H <- Z %*% Z_cross
H_nn <- diag(H)
e <- ((y_hat - D$y) / (1 - H_nn))^2
E_cv <- c(E_cv, mean(e))
sigma_cv <- c(sigma_cv, sqrt(mean(e^2) - (mean(e))^2) / sqrt(N))
bound <- c(bound, mean(e) + sqrt(mean(e^2) - (mean(e))^2) / sqrt(N))
}

lambda_best1 <- lambda_seq[which.min(sigma_cv)]


which <- which(sigma_cv - min(sigma_cv) < min(sigma_cv))
lambda_best1 <- lambda_seq[which[length(which)]]
lambda_best2 <- lambda_seq[which.min(bound)]
lambda_best3 <- lambda_seq[which.min(E_cv)]

x <- runif(Ntest, min = -1, max = 1)


eps <- rnorm(Ntest)
y <- f(x, Qf, aq) + sigma * eps
Dtest <- data.frame(x = x, y = y)

Z <- as.matrix(cbind(1, Dtest$x, Dtest$x^2))


Z_cross <- solve(t(Z) %*% Z + (lambda_best1 / N) * diag(d + 1)) %*% t(Z)
w_reg <- as.vector(Z_cross %*% as.matrix(Dtest$y))
Eout1 <- mean((as.matrix(cbind(1, Dtest$x, Dtest$x^2)) %*% w_reg - Dtest$y)^2)

Z <- as.matrix(cbind(1, Dtest$x, Dtest$x^2))


Z_cross <- solve(t(Z) %*% Z + (lambda_best2 / N) * diag(d + 1)) %*% t(Z)
w_reg <- as.vector(Z_cross %*% as.matrix(Dtest$y))
Eout2 <- mean((as.matrix(cbind(1, Dtest$x, Dtest$x^2)) %*% w_reg - Dtest$y)^2)

Z <- as.matrix(cbind(1, Dtest$x, Dtest$x^2))


Z_cross <- solve(t(Z) %*% Z + (lambda_best3 / N) * diag(d + 1)) %*% t(Z)
w_reg <- as.vector(Z_cross %*% as.matrix(Dtest$y))
Eout3 <- mean((as.matrix(cbind(1, Dtest$x, Dtest$x^2)) %*% w_reg - Dtest$y)^2)

return(c(Eout1, Eout2, Eout3))


}

set.seed(174)
Q <- 20
N_seq <- seq(2 * Q, 10 * Q, by = Q)
results <- matrix(NA, nrow = length(N_seq), ncol = 3)
for (i in 1:length(N_seq)) {
results[i, ] <- experiment5(Qf = 15, N = N_seq[i], sigma = 1, Ntest = 1000)
}

35
results <- as.data.frame(cbind(N_seq, results))
colnames(results) <- c("N", "Method1", "Method2", "Method3")

ggplot(results, aes(x = N, y = Method1, colour = "Method 1")) + geom_line() +


geom_line(aes(x = N, y = Method2, colour = "Method 2")) +
geom_line(aes(x = N, y = Method3, colour = "Method 3")) +
scale_color_manual("Method:", values = c("green", "red", "blue")) +
labs(x = "N", y = "Selection Method")

1.6

1.5
Selection Method

Method:
Method 1
1.4
Method 2
Method 3
1.3

1.2

50 100 150 200


N
We may see that these approaches give out-of-sample errors nearly identical to each other.

36

Вам также может понравиться