Академический Документы
Профессиональный Документы
Культура Документы
Chapter 4
Pierre Paquay
Problem 4.1
1.0
0.5 Order:
i=0
i=1
Phi(x)
0.0 i=2
i=3
i=4
−0.5 i=5
−1.0
−1.0 −0.5 0.0 0.5 1.0
x
It is easy to see that as the order i increases, so does the complexity of the curve (in the sense that it is able
to fit more complex target functions).
Problem 4.2
We may write
L0 (x)
h(x) = 1 −1 1 L1 (x)
L2 (x)
= L0 (x) − L1 (x) + L2 (x)
3 2 1
= x −x+
2 2
Problem 4.3
(a) We use the recursive definition of the Legendre polynomials to develop an algorithm to compute Lk (x)
given x.
1
Legendre <- function(x, k) {
if (k == 0)
return(1)
if (k == 1)
return(x)
else
return(((2 * k - 1) / k) * x * Legendre(x, k - 1) - ((k - 1) / k) * Legendre(x, k - 2))
}
1.0
0.5 Order:
k=0
k=1
L_k(x)
0.0 k=2
k=3
k=4
−0.5 k=5
−1.0
−1.0 −0.5 0.0 0.5 1.0
x
(b) We prove this fact by induction. For k = 0, we have L0 (x) = 1 which is a monomial of order 0. For k = 1,
we have L1 (x) = x which is a monomial of order 1. Now we assume that the result is true for all order less
than k + 2, and we will prove it is still true for order k + 2. We will also assume that k is even (the case
when it is odd is proved in the same way). We have
2k + 3 k+1
Lk+2 (x) = x· Lk+1 (x) − · Lk (x)
k+2 | {z } k+2 | {z }
k+1 +a k−1 +···+a x xk +b k−2 +···+b
| {z } =a | {z } =b
k+1 x k−1 x 1 k k−2 x 0
=c1 =c0
which is actually a linear combination of monomials all of even order with highest order k + 2. In this case
we obviously have
Lk (−x) = (−1)k Lk (x).
Now we assume that the result is true for all order less than k, and we prove it is still true for k. We have
that
2
x2 − 1 dLk (x)
k dx !
2
x − 1 2k − 1 (2k − 1)x dLk−1 (x) k − 1 dLk−2 (x)
= Lk−1 (x) + −
k k k dx k dx
(x2 − 1)(2k − 1) (2k − 1)(k − 1)x x2 − 1 dLk−1 (x) (k − 1)(k − 2) x2 − 1 dLk−2 (x)
= 2
Lk−1 (x) + −
k k2 k−1 dx k2 k−2 dx
| {z } | {z }
=xLk−1 (x)−Lk−2 (x) =xLk−2 (x)−Lk−3 (x)
2
(2k − 1)(kx − 1) (k − 1)(3kx − 3x) (k − 1)(k − 2)
= Lk−1 (x) − Lk−2 (x) + Lk−3 (x)
k2 !k 2 k2 !
2k − 1 k−1 2k − 1 (k − 1)2 2k − 3 k−2
= x xLk−1 (x) − Lk−2 (x) − Lk−1 (x) − xLk−2 (x) − Lk−3 (x)
k k k2 k2 k−1 k−1
| {z } | {z }
=Lk (x) =Lk−1 (x)
2
(2k − 1) + (k − 1)
= xLk (x) − Lk−1 (x)
k2
= xLk (x) − Lk−1 (x).
! !
d dLk (x) d
(x2 − 1) = xkLk (x) − kLk−1 (x)
dx dx dx
dLk (x) dLk−1 (x)
= kLk (x) + xk −k
dx dx
k 2 x2 k2 x k(k − 1) k(k − 1)
= kLk (x) + 2 Lk (x) − 2 Lk−1 (x) − 2 xLk−1 (x) + 2 Lk−2(x)
x −1 x −1 x −1 x −1
kx2 − k + k 2 x2 k
= 2
Lk (x) − 2 [(2k − 1)xLk−1 (x) − (k − 1)Lk−2 (x)]
x −1 x −1
kx2 − k + k 2 x2 k2
= Lk (x) − Lk (x)
x2 − 1 x2 − 1
k
= 2
[(x2 − 1) + kx2 − k]Lk (x)
x −1
= k(k + 1)Lk (x).
and !
d dLl (x)
(1 − x2 ) + l(l + 1)Ll (x) = 0,
dx dx
now we multiply the first identity by Ll (x) and the second by Lk (x), if we substract and integrate the two
identities obtained, we get
Z 1 ! ! Z 1
d 2 dLk (x) d 2 dLl (x)
Ll (x) (1 − x ) − Lk (x) (1 − x ) dx + [k(k + 1) − l(l + 1)] Lk (x)Ll (x)dx = 0.
−1 dx dx dx dx −1
3
Using integration by parts for the first integral, we get
1 1 ! Z
1
2 dLk (x) 2 dLl (x) dLl (x) dLk (x) dLk (x) dLl (x)
Ll (x)(1 − x ) − Lk (x)(1 − x ) − (1 − x2 ) − (1 − x2 ) dx = 0.
dx dx dx dx dx dx
−1
−1 −1 | {z }
| {z } | {z }
=0
=0 =0
Finally, we obtain Z 1
Lk (x)Ll (x)dx = 0.
−1
1 1 1
2k − 1 k−1
Z Z Z
Ak = L2k (x) = xLk (x)Lk−1 (x)dx − Lk (x)Lk−2 (x)dx
−1 k −1 k −1
| {z }
=0
1 1
(2k − 1)(k + 1) (2k − 1)k
Z Z
= Lk+1 (x)Lk−1 (x)dx + L2k−1 (x)dx
k(2k + 1) −1 k(2k + 1) −1
| {z }
=0
1
2k − 1
Z
= L2k−1 (x)dx.
2k + 1 −1
2k − 1
Ak = Ak−1
2k + 1
2k − 1 2k − 3
= · Ak−2
2k + 1 2k − 1
2k − 1 2k − 3 31
= · ··· A0
2k + 1 2k − 1 5 3 |{z}
=2
2
= .
2k + 1
Problem 4.4
The following code is an implementation of the experimental framework used to study various aspects of
overfitting.
Legendre2 <- function(x, q) {
vec <- rep(NA, q + 1)
for (k in 0:q) {
vec[k + 1] <- (choose(q, k))^2 * (x - 1)^(q - k) * (x + 1)^k / 2^q
}
return(sum(vec))
}
4
Lq[k + 1] <- Legendre2(x, k)
}
return(sum(aq * Lq))
}
f <- Vectorize(f, vectorize.args = "x")
y <- D$y
D2 <- data.frame(x = D$x, x_sq = D$x^2)
Z2 <- as.matrix(cbind(1, D2))
Z2_cross <- solve(t(Z2) %*% Z2) %*% t(Z2)
w2 <- as.vector(Z2_cross %*% y)
D10 <- data.frame(x = D$x, x_sq = D$x^2, x_cub = D$x^3, x_quad = D$x^4,
x_quint = D$x^5, x_six = D$x^6, x_seven = D$x^7,
x_eight = D$x^8, x_nine = D$x^9, x_ten = D$x^10)
Z10 <- as.matrix(cbind(1, D10))
Z10_cross <- solve(t(Z10) %*% Z10) %*% t(Z10)
w10 <- as.vector(Z10_cross %*% y)
return(c(Eout2, Eout10))
}
5
Ea,x [f 2 ] = Ex [Ea|x [f 2 |x]]
= Ex [ Vara|x [f ] +( Ea|x [f ] )2 ]
| {z } | {z }
P 2 P
= Lq (x) Vara|x [aq ] = Lq (x) Ea|x [aq ]
q q
| {z } | {z }
=1 =0
Qf
X
= Ex [L2q (x)].
q=0
g2 (x) = g̃2 (Φ2 (x)) = w̃T Φ2 (x) (resp. g10 (x) = g̃10 (Φ10 (x)) = w̃T Φ10 (x)).
Eout (g10 ) = Ex,y [(g10 (x) − y(x))2 ] = Ex,y [(g10 (x) − f (x) − σ)2 ] = Ex [Ey|x [(g10 (x) − f (x) − σ)2 |x]].
(d) Below we plot the extent of overfitting depending on certain parameters of the learning problem. In the
first plot, we fix Qf = 20 to study the stochastic noise.
# Grid search with Qf = 20
Nexp <- 1000
grid <- expand.grid(N = seq(20, 120, by = 5), sigma_sq = seq(0, 2, by = 0.05))
E_out_Overfit <- foreach(i = 1:nrow(grid), .combine = "rbind") %dopar% {
set.seed(1975)
Eout_H2 <- numeric(Nexp)
Eout_H10 <- numeric(Nexp)
for (n in 1:Nexp) {
tmp <- experiment(Qf = 20, grid$N[i], sqrt(grid$sigma[i]), Ntest = 100)
Eout_H2[n] <- tmp[1]
Eout_H10[n] <- tmp[2]
}
c(mean(Eout_H2), mean(Eout_H10))
}
Eout <- cbind(grid, E_out_Overfit)
colnames(Eout) <- c("N", "sigma_sq", "Eout_H2", "Eout_H10")
Eout["Overfit"] <- Eout$Eout_H10 - Eout$Eout_H2
Eout$Overfit <- ifelse(Eout$Overfit > 0.2, 0.2, Eout$Overfit)
6
Eout$Overfit <- ifelse(Eout$Overfit < -0.2, -0.2, Eout$Overfit)
2.0
1.5 Overfit
0.2
Sigma_sq
0.1
1.0
0.0
−0.1
0.5
0.0
25 50 75 100 125
N
In the second plot, we fix σ 2 = 0.1 to study the deterministic noise.
# grid search with sigma_sq = 0.1
Nexp <- 200
grid <- expand.grid(Qf = seq(1, 80, by = 1), N = seq(20, 120, by = 5))
E_out_Overfit <- foreach(i = 1:nrow(grid), .combine = "rbind") %dopar% {
set.seed(1975)
Eout_H2 <- numeric(Nexp)
Eout_H10 <- numeric(Nexp)
for (n in 1:Nexp) {
tmp <- experiment(grid$Qf[i], grid$N[i], sqrt(0.1), Ntest = 10)
Eout_H2[n] <- tmp[1]
Eout_H10[n] <- tmp[2]
}
c(mean(Eout_H2), mean(Eout_H10))
}
Eout <- cbind(grid, E_out_Overfit)
colnames(Eout) <- c("Qf", "N", "Eout_H2", "Eout_H10")
Eout["Overfit"] <- Eout$Eout_H10 - Eout$Eout_H2
Eout$Overfit <- ifelse(Eout$Overfit > 0.2, 0.2, Eout$Overfit)
Eout$Overfit <- ifelse(Eout$Overfit < -0.2, -0.2, Eout$Overfit)
7
80
60 Overfit
0.2
0.1
Q_f
40
0.0
−0.1
20 −0.2
0
25 50 75 100 125
N
(e) We take the average over many experiments because we want estimates of the expected out-of-sample
error for a given learning scenario (Qf , N, σ) using H2 and H10 .
Problem 4.5
the theory of Lagrange multipliers tells us that this problem is equivalent to the following unconstrained
optimization problem
min(Ein (w) − λ0C wT w) ; λ0C ≥ 0.
w
If we let λC = −λ0C , we get that the original constrained optimization problem is equivalent to minimizing
the augmented error
Eaug (w) = Ein (w) + λC wT w ; λC ≤ 0.
So, we may conclude that the soft order constraint corresponding to this problem is wT w ≥ C.
Problem 4.6
8
which means that vi are also eigenvectors of A−2 with eigenvalues 1/λ2i .
Now, let vi be the orthogonal eigenvectors of non-zero eigenvalues λi of Z T Z (since Z T Z is invertible and
symmetric). We have that
and
||wlin ||2 = y T Z(Z T Z)−2 Z T y = uT (Z T Z)−2 u
where u = Z T y; if we let V = (v0 , · · · , vQ ) be the orthogonal matrix of eigenvectors, we get
V T Z T ZV = diag(λi )
and
V T (Z T Z + λI)V = V T Z T ZV + λV T V = diag(λi + λ).
If we expand u in the eigenbasis of Z T Z, we get that u = i αi vi and
P
X
||wreg ||2 = αi αj viT (Z T Z + λI)−2 vj
i,j
X 1
= αi αj v T vj
i,j
(λi + λ)2 i
X αi2
=
i
(λi + λ)2
X α2 X
≤ i
= αi αj viT (Z T Z)−2 vj = ||wlin ||2 ;
i
λ2i i,j
for the above inequality to be true, we have to note that since Z T Z is (at least) semi positive definite, its
eigenvalues are non-negative.
Problem 4.7
Here, for our (N × d) matrix Z, we assume that N > d, and in this case U is a (N × d) orthogonal matrix, Γ
is a (d × d) diagonal matrix and V is a (d × d) orthogonal matrix. We begin by noting that
Z T Z = V ΓU T U ΓV T = V Γ2 V T .
Hy = Z(Z T Z)−1 Z T y
= U ΓV T (V T )−1 Γ−2 V −1 V ΓU T y
= U U T y;
9
H(λ)y = Z(Z T Z + λI)−1 Z T y
= U ΓV T (V Γ2 V T + λI)−1 V ΓU T y
= U ΓV T [V (Γ2 + λI) V T ]−1 V ΓU T y
| {z }
=diag(σi2 +λ)
!
T −1 1
= U ΓV (V ) T
diag V −1 V ΓU T y
σi2 + λ
!
σi2
= U diag U T y.
σi2+λ
and consequently
Ein (wreg )
1 T
= y (I − H(λ))2 y
N
1 T
= y (I − H(λ))T (I − H(λ))y
N ! ! !
1 T T σi2 T T σi2 T σi2
= [y (I − H)y + 2y (I − H)U diag 1 − 2 U y + y U diag 1 − 2 U U diag 1 − 2 U T y]
N σi + λ σi + λ σi + λ
!2 !
1 T T σi2 T T σi2
= [y (I − H)y + y U diag 1 − 2 U y + 2y (I − H)U diag 1 − 2 UT y
N σi + λ | {z } σi + λ
=U −HU =U −U U T U =0
!2
1 X 2 σ2
= Ein (wlin ) + ai 1 − 2 i .
N i σi + λ
Problem 4.8
Problem 4.9
10
√
now we construct a virtual example (zi , 0) where zi = λγi for i = 1, · · · , k. If D = {(z10 , y1 ), · · · , (zN
0
, yN )},
this means that the matrix for the augmented data is
− z10T
−
..
.
0T
− zN −
Zaug = = √Z
− z T − λΓ
1
..
.
− zkT −
and
y1
..
.
yN y
yaug 0 = 0 .
=
.
..
0
(b) If we solve the least squares problem with Zaug and yaug , we get
wlin = T
(Zaug Zaug )−1 Zaug
T
yaug
√ √
Z y
= [(Z T | λΓT ) √ ]−1 (Z T | λΓT )
λΓ 0
= (Z T Z + λΓT Γ)−1 Z T y = wreg .
Problem 4.10
T
(a) If wlin ΓT Γwlin ≤ C, then obviously wreg = wlin .
T
(b) If wlin ΓT Γwlin > C, then we have that wreg
T
ΓT Γwreg = C (see the book illustration).
(c) The original constrained problem is equivalent to solving the following unconstrained problem with
Lagrange multipliers,
min(Ein (w) − λC (−wT ΓT Γw + C))
w | {z }
=L(w,λC )
11
and consequently
1 T
λC = − w ∇Ein (wreg ).
2C reg
T
(d) (i) If wlin ΓT Γwlin ≤ C, we know that wreg = wlin , and consequently ∇Ein (wreg ) = 0, which implies that
λC = 0.
T
(ii) If wlin ΓT Γwlin > C, let us assume that λC = 0, this means that wreg minimizes
Ein (w) − λC (−wT ΓT Γw + C) = Ein (w),
so we have wreg = wlin and
T
wreg ΓT Γwreg = wlin
T
ΓT Γwlin > C,
T
which is not possible since wreg ΓT Γwreg ≤ C by definition. In conclusion, we have that λC > 0.
T
(iii) As wlin T
ΓT Γwlin > C, we have that λC > 0 which means that wreg ∇Ein (wreg ) < 0. Now, if we compute
the derivative relative to C, we get
dλC 1 T
= w ∇Ein (wreg ) < 0.
dC 2C 2 reg
Problem 4.11
g(x) = ED [g D (x)]
= ED [Φ(x)T wlin ]
= Φ(x)T wf + ED [Φ(x)T (Z T Z)−1 Z T ]]
= Φ(x)T wf + EZ [Ey|Z [Φ(x)T (Z T Z)−1 Z T |Z]]
= Φ(x)T wf + EZ [Φ(x)T (Z T Z)−1 Z T Ey|Z [|Z]]
| {z }
=E []=0
T
= Φ(x) wf = f (x),
12
where we have used the cyclic property of the trace. This allows us to write that
var = Ex [var(x)]
= σ 2 trace(EZ [Ex [Φ(x)Φ(x)T (Z T Z)−1 ]])
= σ 2 trace(EZ [Ex [Φ(x)Φ(x)T ](Z T Z)−1 ])
| {z }
=ΣΦ
2
σ 1
= (ΣΦ EZ [( Z T Z)−1 ]).
N N
(c) We know by the law of large numbers that N1 Z T Z converges in probability to ΣΦ , this implies that
( N1 Z T Z)−1 converges in probability to Σ−1
Φ . With that in mind, to the first order in 1/N , we have that
σ2 σ 2 (Q + 1)
var ≈ trace(ΣΦ Σ−1
Φ )= .
N N
Problem 4.12
g(x) = ED [g D (x)]
= ED [Φ(x)T wreg ]
= ED [Φ(x)T (wf − λ(Z T Z + λI)−1 wf + (Z T Z + λI)−1 Z T )]
= EZ [Φ(x)T wf − λΦ(x)T (Z T Z + λI)−1 wf + Φ(x)T (Z T Z + λI)−1 Z T Ey|Z [|Z]]
| {z }
=0
T T T −1
= Φ(x) wf − λΦ(x) EZ [(Z Z + λI) ]wf .
Thus, thanks to the cyclic property of the trace, the bias(x) is equal to
13
bias = Ex [bias(x)]
= λ2 trace(Ex [Φ(x)T Φ(x)] EZ [(Z T Z + λI)−1 ]wf wfT EZ [(Z T Z + λI)−1 ])
| {z }
=I
= λ trace(EZ [(Z Z + λI)−1 ]wf wfT EZ [(Z T Z + λI)−1 ])
2 T
| {z } | {z }
1 1
≈ N +λ I ≈ N +λ I
λ2
≈ trace(wf wfT )
(N + λ)2 | {z }
=trace(wfT wf )=||wf ||2
λ2
≈ ||wf ||2 ,
(N + λ)2
since Z T Z ≈ N ΣΦ = N I.
Now, if we compute var(x), we get
var = Ex [var(x)]
≈ σ 2 EZ [trace(Ex [Φ(x)Φ(x)T ](Z T Z + λI)−1 Z T Z(Z T Z + λI)−1 )]
| {z }
=I
≈ I (Z T Z + λI)−1 Z T Z(Z T Z + λI)−1 )]
σ 2 EZ [trace( |{z}
1
≈N ZT Z
σ2
≈ EZ [trace(Z(Z T Z + λI)−1 Z T Z(Z T Z + λI)−1 Z T )]
N
σ2
≈ EZ [trace(H(λ)2 )].
N
Problem 4.13
(a) When λ = 0, we have H(0) = Z(Z T Z)−1 Z T and H(0)2 = Z(Z T Z)−1 Z T Z(Z T Z)−1 Z T = H(0), which
means that
trace(H(0)) = trace(H(0)2 ) = trace(Z T Z(Z T Z)−1 ) = trace(Id+1 ˜
˜ ) = d + 1.
14
for (ii), we get
def f (0) = d˜ + 1,
and for (iii), we get
def f (0) = d˜ + 1.
(b) Here again, for our (N × (d˜+ 1)) matrix Z, we assume that N > (d˜+ 1), and in this case Z = U ΓV T where
U is a (N × (d˜+ 1)) orthogonal matrix, Γ is a ((d˜+ 1) × (d˜+ 1)) diagonal matrix and V is a ((d˜+ 1) × (d˜+ 1))
orthogonal matrix. From Problem 4.7, we know that
!
T 2 T σi2
Z Z = V Γ V and H(λ) = U diag UT ;
σi2 + λ
by the cyclic property of the trace. Obviously, if λ increases, def f decreases. Now, we consider (iii), here we
have
d˜ d˜
!
σi4 X σi4 X
2 T
0 ≤ def f = trace(H(λ) ) = trace(U U diag ) = ≤ 1 = d˜ + 1;
(σi2 + λ)2 i=0
(σi2 + λ)2 i=0
here also, if λ increases def f decreases. Finally, we consider (i), and we get
d˜ d˜ d˜ d˜
X σi2 X σi4 X σ 4 + 2σ 2 λ X
0 ≤ def f = 2 − = i i
2 + λ)2 ≤ 1 = d˜ + 1;
i=0
σi2 + λ i=0 (σi2 + λ)2 i=0
(σ i i=0
Problem 4.14
1 T
Ein (wreg ) = y (I − H(λ))2 y
N
1 T
= (f + T )(I − H(λ))2 (f + )
N
1 T
= [f (I − H(λ))2 f + 2f T (I − H(λ))2 + T (I − H(λ))2 ].
N
1 T
E [Ein (wreg )] = [f (I − H(λ))2 f + 2f T (I − H(λ))2 E [] +E [T (I − H(λ))2 ]]
N | {z }
=0
1 T
= [f (I − H(λ))2 f + E [trace(T (I − H(λ))2 )]]
N
1 T
= [f (I − H(λ))2 f + trace( E [T ] (I − H(λ))2 )]
N | {z }
=diag(σ 2 )
2
1 T σ
= f (I − H(λ))2 f + trace((I − H(λ))2 );
N N
15
moreover, we also have that
trace((I − H(λ))2 ) = trace(IN ) −2trace(H(λ)) + trace(H(λ)2 ) = N − def f (λ),
| {z }
=N
Problem 4.15
Here also, for our (N × (d + 1)) matrix Z̃, we assume that N > (d + 1), and in this case Z̃ = U SV T where U
is a (N × (d + 1)) orthogonal matrix, S is a ((d + 1) × (d + 1)) diagonal matrix and V is a ((d + 1) × (d + 1))
orthogonal matrix. As Z̃ = ZΓ−1 , we have Z = Z̃Γ; in this case, we also have that
=I
= U S(S T
|{z}S +λI)−1 SU T
=S 2
!
s2i
= U diag 2 UT
si + λ
!
T s2i
trace(H(λ)) | {zU} diag s2 + λ )
= trace(U
i
=I
d
X s2i
=
i=0
s2i + λ
d
!
X s2i + λ λ
= 2 − 2
i=0
si + λ si + λ
d
X λ
= d+1− ,
i=0
s2i + λ
16
and also that
!
2 T s4i
trace(H(λ) ) = trace(U U diag )
(s2i + λ)2
d
X s4i
=
i=0
(s2i + λ)2
d
!
X s4i + 2λs2i + λ2 2λs2 + λ2
= 2 2
− 2i
i=0
(si + λ) (si + λ)2
d
X 2λs2 + λ2i
= d+1− .
i=0
(s2i + λ)2
Problem 4.16
1 λ
Eaug (w) = ||Zw − y||2 + wT ΓT Γw
N N
1 T T λ
= (w Z Zw − 2y T Zw + y T y) + wT ΓT Γw
N N
where we assume that λ > 0. If we take the gradient of the previous expression, we get
2 T
∇Eaug (w) = (Z Zw − Z T y + λΓT Γw).
N
The critical point is found by solving the equation ∇Eaug (w) = 0, which gives us
w = (Z T Z + λΓT Γ)−1 Z T y
17
provided that Γ is of full rank (since in this case ΓT Γ is positive definite, which consequently makes Z T Z +λΓT Γ
positive definite and thus invertible). For this w to be wreg , we must show that it is actually a minimum, to
do that we compute the Hessian, that is
2 T
∇2 Eaug (w) = (Z Z + λΓT Γ)
N
which is positive definite; this means that wreg = w.
(a) We have that
ŷ = Zwreg = Z(Z T Z + λΓT Γ)−1 Z T y = H(λ)y.
Problem 4.17
N N
1 X T 1 X T
(w x̂n − yn )2 = [(w xn − yn ) + wT n ]2
N n=1 N n=1
N N N
1 X T 2 X T 1 X T 2
= (w xn − yn )2 + (w xn − yn )wT n + (w n )
N n=1 N n=1 N n=1
N N
2 X T 1 X T 2
= Ein (w) + + (w xn − yn )wT n + (w n ) .
N n=1 N n=1
" N
#
1 X T
Êin (w) = E1 ···N (w x̂n − yn )2
N n=1
N N
2 X T 1 X T
= Ein (w) + (w xn − yn )wT E1 ···ˆn ···N [En [n ]] + w E1 ···ˆn ···N [En [n Tn ] w]
N n=1 | {z } N n=1 | {z }
=0 2I
=σx
N
σx2 X T
= Ein (w) + w w
N n=1
= Ein (w) + σx2 wT w.
Here, the parameters for the Tikhonov regularizer are Γ = I and λ = N σx2 .
Problem 4.18
18
and from Problem 3.14 that
T
ED [wlin x] = f (x).
We may now write that
1 1
g(x) = ED [g D (x)] = T
ED [wlin x] = f (x);
1+λ 1+λ
and consequently
λ2
bias(x) = (g(x) − f (x))2 = f (x)2 .
(1 + λ)2
We are now able to compute the bias, and we get
bias = Ex [bias(x)]
λ2
= wT Ex [xxT ] wf
(1 + λ)2 f | {z }
=I
2
λ
= ||wf ||2 .
(1 + λ)2
var = Ex [var(x)]
σ2
= Ex [ xT EX [(X T X)−1 ]x ]
(1 + λ)2 | {z }
=trace(xxT EX [(X T X)−1 ])
2
σ
= trace(Ex [xxT ] EX [(X T X)−1 ])
(1 + λ)2 | {z }
=I
σ2 1
= trace(EX [( X T X)−1 ])
N (1 + λ)2 | N {z }
≈Σ−1 =Id+1
2
σ (d + 1)
≈
N (1 + λ)2
19
ED [Eout (w)] = σ 2 + bias + var
λ2 σ 2 (d + 1)
≈ σ2 + ||w f ||2
+
(1 + λ)2 N (1 + λ)2
2 2 2
1 N λ ||wf || + σ (d + 1)
≈ σ2 + ;
N (1 + λ)2
to determine the optimal regularization parameter, we have to compute the derivative relative to λ, we get
(d + 1)/N
λ∗ =
||wf ||2 /σ 2
and !
wf
y=σ X + ,
σ σ
we may see that λ∗ can be seen as the relation between the ratio of the dimension to the number of data points
and the σ-regularized weight norm. This means that if the number of dimensions (d + 1) is big compared to
the number N of data points, the regularization parameter λ∗ will be big also; and if σ 2 is small compared to
||wf ||2 , the regularization parameter λ∗ will be small also.
Problem 4.19
(a) First, we note that the lasso algorithm is equivalent to the following minimization problem
d
1 2
X
min ||Xw − y|| subject to |wi | ≤ C,
w N | {z }
i=0
=(wT X T Xw−2y T Xw+y T y)
To formulate the above problem into a quadratic program, we split each wi as wi = wi+ − wi− where
|wi | + wi |wi | − wi
wi+ = ≥ 0 and wi− = ≥ 0;
2 2
in this case, we have w = w+ − w− with
+ −
w0 w0
+ .. − ..
w = . and w = . .
wd+ wd−
20
Thus, the lasso algorithm may be formulated as the following quadratic program
+ +
1 +T −T T w T w
min(w+ ,w− ) 2 (w , w )V V − +d
−
+ w
+ w
w w
subject to A ≤ C, ≥0
w− w−
where
√ XT −2X T y
V = 2 , d= , and A = (1, · · · , 1|1, · · · , 1).
−X T 2X T y
return(Eout)
}
Now, we plot the out of sample error Eout versus the regularization parameter C.
21
C_grid <- seq(0.01, 100, by = 0.5)
E_out_comp <- foreach(i = 1:length(C_grid), .combine = "rbind") %dopar% {
set.seed(1975)
tmp <- experiment2(Qf = 20, N = 1000, sigma = 0.1, Ntest = 100,
C = C_grid[i], d = 6)
tmp
}
Eout <- data.frame(C = C_grid, Eout = E_out_comp[, 1])
ggplot(Eout, aes(x = C, y = Eout)) + geom_line(col = "red")
0.4
Eout
0.3
0.2
0 25 50 75 100
C
In the plot above, the minimum Eout is obtained for C = 26.01.
(b) The augmented error for the lasso is
d
X
Eaug (w) = Ein (w) + λ |wi |.
i=0
It is actually more convenient to optimize since this is an unconstrained problem as opposed to the original
lasso problem.
(c) Here we compare the number of non-zero weights from the lasso versus the quadratic penalty for d = 5
and N = 3.
experiment3 <- function(Qf, N, sigma, deg, grid) {
aq <- rnorm(Qf + 1)
norm <- rep(0, Qf + 1)
for (q in 0:Qf)
norm[q + 1] <- 1 / (2 * q + 1)
norm_fac <- 1 / sqrt(sum(norm))
aq <- norm_fac * aq
22
D <- data.frame(x = xn, y = yn)
set.seed(10)
grid <- 10^seq(1, -2, length = 100)
Num_nz_weights <- cbind(grid, experiment3(Qf = 20, N = 3, sigma = 1, d = 5, grid))
ggplot(Num_nz_weights, aes(x = grid, y = ridge)) + geom_line(aes(colour = "Quadratic")) +
geom_line(aes(x = grid, y = lasso, colour = "Lasso")) +
scale_color_manual("Type:", values = c("red", "green"))
4 Type:
ridge
Lasso
Quadratic
Problem 4.20
(a) We know that the optimal weights for the transformed problem are
w̃ = (Z T Z)−1 Z T y
23
where
− z1T − xT1 AT
− −
Z= .. .. T
= = XA and ỹ = αy.
. .
− znT − − xTn AT −
We may now write that
w̃ = (Z T Z)−1 Z T ỹ
= (AX T XAT )−1 AX T αy
= α(AT )−1 (X T X)−1 A−1 AX T y
= α(AT )−1 w
since w = (X T X)−1 X T y.
(b) In this case, we know from Problem 4.16 that
Problem 4.21
As h(x) is a linear function, we immediately have that ∂ 2 h(x)/∂x2 = 0, this implies that
!
∂ 2 h(x)
Z
Ω(h) = dx = 0;
∂x2
and consequently Γ = 0.
Problem 4.22
Here, we have a data set with N = 100 points and a validation set of K = 25 points. We consider M = 100
models H1 , · · · , HM each with VC-dimension dV C = 10.
−
In the first case, each model Hm gives birth to a final hypothesis gm generated on the N − K = 75 training
−
points; from these hypotheses, we select the one with the minimum validation error gm ∗ of 0.25. We know
that r
− − 1 2M
Eout (gm∗ ) ≤ Eout (gm ∗ ) ≤ Eval (gm∗ ) + ln
2K δ
where gm∗ is the chosen final hypothesis trained on the entire data set, since we selected our final hypothesis
− − −
gm ∗ from a finite hypothesis set Hval = {g1 , · · · , gM }. So, a bound on the out-of-sample error is given by
r r
− 1 2M 1 200
Eval (gm∗ ) + ln = 0.25 + ln ;
2K δ 50 δ
24
thus we may write that r
1 200
Eout (gm∗ ) ≤ 0.25 + ln
50 δ
with probability at least 1 − δ.
In the second case, each model Hm gives birth to a final hypothesis gm trained on the entire data set;
from these hypotheses, we select the one with the minimum in-sample error gm∗ of 0.15. Here we must be
careful since as each gm was selected (by minimizing Ein ) on each hypothesis set Hm , and gm∗ is chosen as
having the minimum Ein of these gm , this is equivalent to selecting gm∗ as having the minimum Ein in all of
H1 ∪ · · · ∪ HM which is no longer a simple finite hypothesis set. Hence, we know from the VC generalization
bound that v
u !
u8 4((2N )dV C (∪m Hm ) + 1)
Eout (gm∗ ) ≤ Ein (gm∗ ) + t ln
N δ
where we know from Problem 2.14 that
dV C (∪m Hm ) ≤ M (dV C + 1) = 1100.
So, a bound on the out-of-sample error is given by
v ! v !
u u
u8 4((2N ) d V C (∪m Hm ) + 1) u 8 4(2001100 + 1)
Ein (gm∗ ) + t ln = 0.15 + t ln ;
N δ 100 δ
Problem 4.23
" #
1 X
VarD [Ecv ] = VarD en
N n
" #
1 X
= VarD en
N2 n
1 X 1 X
= Var D [e n ] + CovD [en , em ].
N2 n N2
n6=m
(b) As
en = e(g (N −2) + δn , yn ) = e(g (N −2) , yn ) + o(δn ),
we may write that
25
First, we consider (1), we get
(1) = ED(N −2) [E(xn ,yn ),(xm ,ym )|D(N −2) [e(g (N −2) , yn )e(g (N −2) , ym )]]
= ED(N −2) [(E(xn ,yn )|D(N −2) [e(g (N −2) , yn )])2 ]
= ED(N −2) [(Eout (g (N −2) ))2 ].
(2) = ED(N −2) [(E(xn ,yn )|D(N −2) [e(g (N −2) , yn )]]ED(N −2) [(E(xm ,ym )|D(N −2) [e(g (N −2) , ym )]]
= (ED(N −2) [Eout (g (N −2) )])2 .
CovD [en , em ] = ED(N −2) [(Eout (g (N −2) ))2 ] − (ED(N −2) [Eout (g (N −2) )])2 + o(δn ) + o(δm ) + o(δn δm )
= VarD(N −2) [Eout (g (N −2) )] + o(δn ) + o(δm ) + o(δn δm ).
1 X 1 X
VarD [Ecv ] = 2
VarD [en ] + 2 CovD [en , em ]
N n | {z } N | {z }
n6=m 1
=VarD [e1 ] =VarD(N −2) [Eout (g (N −2) )]+O( N )
1 N −1 1
= VarD [e1 ] + VarD(N −2) [Eout (g (N −2) )] +O( )
N | N {z } N
1
≈VarD [Eout (g)]+O( N )
1 1
≈ VarD [e1 ] + VarD [Eout (g)] + O( ).
N N
Problem 4.24
(a) Here, we use linear regression with weight decay regularization to estimate wf with wreg in the cases where
N ∈ {d + 15, d + 25, · · · , d + 115}; for each N value we also compute the cross validation errors e1 , · · · , eN
and Ecv .
d <- 3
sigma <- 0.5
return(D)
}
y_gen <- function(D) {
y <- apply(D, 1, function(x) sum(wf * c(1, as.numeric(x))) + sigma * rnorm(1))
return(y)
26
}
crossval_error <- function(N, lambda) {
D <- dataset_gen(N)
y <- y_gen(D)
e <- rep(NA, N)
for (n in 1:N) {
X_n <- as.matrix(cbind(1, D[-n, ]))
X_n_cross <- solve(t(X_n) %*% X_n + (lambda / N) * diag(d + 1)) %*% t(X_n)
wreg_n <- as.vector(X_n_cross %*% as.matrix(y[-n]))
e[n] <- (sum(c(1, as.numeric(D[n, ])) * wreg_n) - y[n])^2
}
Ecv <- mean(e)
return(results)
}
Now, we repeat the above experiment 5000 times maintaining the average and variance over the experiments
of e1 , e2 and Ecv .
set.seed(10)
iter <- 5000
lambda <- 0.05
results <- matrix(NA, nrow = 33, ncol = iter)
for (i in 1:iter) {
results[, i] <- experiment4(lambda)
}
mean_res <- apply(results, 1, mean)
var_res <- apply(results, 1, var)
final_res <- cbind(seq(d + 15, d + 115, by = 10),
as.data.frame(matrix(mean_res, nrow = 11)),
as.data.frame(matrix(var_res, nrow = 11)))
colnames(final_res) <- c("N", "Avg_e1", "Avg_e2", "Avg_Ecv", "Var_e1", "Var_e2", "Var_Ecv")
27
0.33
0.31
Error:
Averages
e1
0.29 e2
Ecv
0.27
0.25
30 60 90 120
N
It is pretty obvious that the mean values of e1 , e2 , and Ecv are tracking each other.
(c) Since the en ’s are not independent, the contributors to the variance of e1 are the other en ’s.
(d) If the cross validation errors were truly independent, we would have that (see Problem 4.23)
1 X 1
VarD [Ecv ] = 2
VarD [en ] = VarD [e1 ].
N n N
(e) The ratio of the variance of the e1 ’s to that of the Ecv ’s is given by
since in this context en and em are only “slightly” dependent, their covariance is close to 0, so the above
ratio is close to N .
ggplot(final_res, aes(x = N, y = Var_e1 / Var_Ecv)) + geom_line(colour = "red") +
geom_line(aes(x = N, y = N)) +
labs(x = "N", y = "N_eff")
28
120
90
N_eff
60
30
30 60 90 120
N
(f ) Increasing the amount of regularization should have no notable effect on Nef f since in this case, the
norm of wreg is more restricted, but this has no relation to the effective number of fresh examples used in
computing the cross validation error.
set.seed(10)
iter <- 5000
lambda <- 2.5
results2 <- matrix(NA, nrow = 33, ncol = iter)
for (i in 1:iter) {
results2[, i] <- experiment4(lambda)
}
mean_res2 <- apply(results2, 1, mean)
var_res2 <- apply(results2, 1, var)
final_res2 <- cbind(seq(d + 15, d + 115, by = 10),
as.data.frame(matrix(mean_res2, nrow = 11)),
as.data.frame(matrix(var_res2, nrow = 11)))
colnames(final_res2) <- c("N", "Avg_e1", "Avg_e2", "Avg_Ecv", "Var_e1", "Var_e2", "Var_Ecv")
29
120
90
N_eff
60
30
30 60 90 120
N
Problem 4.25
(a) No, in this case, there are no guarantees that we will get the VC-bound we obtained when using the same
validation set for all models.
(b) As exposed in the theory, since the validation model Hval was obtained before ever looking at the data in
the validation set, the process of model selection is equivalent to learning a hypothesis from Hval using the
data in Dval . In this case, we may apply the VC bound for finite hypothesis sets.
(c) We know from the proof of the Hoeffding inequality and point (b) that for each m = 1, · · · , M ,
2
P[Eout (m) − Eval (m) > ] ≤ e− Km
for all > 0. A reasoning similar to the one that lead us to (1.6) gives us that
P[Eout (m∗ ) − Eval (m∗ ) > ] ≤ P[Eout (1) − Eval (1) > ] + · · · + P[Eout (M ) − Eval (M ) > ]
M
X 2
≤ e− Km
.
m=1
Now, if we let !
M
1 1 X −22 Km
κ() = − 2 ln e ,
2 M m=1
we get
2 1
P −22 Km
M e−2 κ()
= M eln( M m e )
M
X 2
= e−2 Km ;
m=1
30
in this case, we actually obtain
2
P[Eout (m∗ ) > Eval (m∗ ) + ] ≤ M e−2 κ()
.
2
Moreover, we may note that κ() ≥ 0 since −22 Km ≤ 0, this implies that e−2 Km
≤ 1, and so
1
P −22 Km
M me ≤ 1, and finally κ() ≥ 0.
(d) It is easy to see that
2
P[Eout (m∗ ) ≤ Eval (m∗ ) + ] = 1 − P[Eout (m∗ ) > Eval (m∗ ) + ] ≥ 1 − M e−2 κ()
q
for all > 0. If ∗ satisfies ∗ ≥ ln(M/δ)
2κ(∗ ) , we get that
(e) We begin by proving the first inequality. Since minm Km ≤ Km for all 1 ≤ m ≤ M , we have that
M
!
1 1 X −22 Km
κ() = (− ln e )
22 M m=1
M
1 1 X 2
≤ 2
− ln(e−2 Km )
2 M m=1
M M
1 1 X 2 1 X
≤ 2 Km = Km
22 M m=1 M m=1
31
1
P
where K = M Pm Km , when models use the same validation set size. It is easy to note that since we proved
1
that κ() ≤ M m Km = K, we immediately have that
s r
1 M 1 M
∗
ln ≥ ln .
2κ( ) δ 2K δ
Which means that the bound is better when all models use the same validation set size.
Problem 4.26
Z = ... ,
T
zN
we are then able to write that
z1T
N
Z T Z = (z1 , · · · , zN ) ... =
X
zn znT
T n=1
zN
and
y1 N
Z T y = (z1 , · · · , zN ) ... =
X
zn yn .
n=1
yN
Moreover, we also have
H(λ) = ZA(λ)−1 Z T
T
z1
.. −1
= . A(λ) (z1 , · · · , zN )
T
zN
T
z1
.. −1 −1
= . (A(λ) z1 , · · · , A(λ) zN )
T
zN
T
z1T A(λ)zN
z1 A(λ)z1 ···
.. ..
= ,
. .
T T
zN A(λ)z1 ··· zN A(λ)zN
which implies that Hnm (λ) = znT A(λ)−1 zm . If now we leave the data point (zn , yn ) out, Z T Z becomes
T
z1
..
.
T T
(z1 , · · · , zˆn , · · · , zN )
zˆn = Z Z − zn zn ,
.
..
T
zN
32
and Z T y becomes
y1
..
.
T
zˆn = Z y − zn yn .
(z1 , · · · , zˆn , · · · , zN )
.
..
yN
!
A−1 zn znT A−1
wn− = A −1
+ (Z T y − zn yn )
1 − znT A−1 zn
A−1 zn znT A−1 T A−1 zn znT A−1
= A−1 Z T y −A−1 zn yn + Z y− zn yn
| {z } 1 − Hnn 1 − Hnn
=w
!
1
= w− A−1 zn yn − A−1 zn znT A−1 zn yn − A−1 zn znT A−1 Z T y + A−1 zn znT A−1 zn yn
1 − Hnn
1
= w− A−1 zn (yn − znT A−1 Z T y )
1 − Hnn | {z }
T w=ŷ
=zn n
(ŷn − yn )A−1 zn
= w+ .
1 − Hnn
!
(ŷn − yn )A−1 zn
znT wn− = znT w+
1 − Hnn
ŷn − yn T −1
= znT w + z A zn
|{z} 1 − Hnn |n {z }
=ŷn =Hnn
ŷn − Hnn yn
= .
1 − Hnn
33
(e) We immediately obtain
Problem 4.27
(a) We know
√ that the sample standard deviation is a biased estimator of the real standard deviation, so we
divide by N to make our σcv less biased.
(b) We have that
2
N σcv = var(e1 , · · · , eN )
N N
!2 !
1 X 2 1 X
= e − en
N n=1 n N n=1
N
!4 !
1 X ŷn − yn 2
= − Ecv ,
N n=1 1 − Hnn
(c) Below, we implement the experimental design to compare the different approaches.
experiment5 <- function(Qf, N, sigma, Ntest) {
aq <- rnorm(Qf + 1)
norm <- rep(0, Qf + 1)
for (q in 0:Qf)
norm[q + 1] <- 1 / (2 * q + 1)
norm_fac <- 1 / sqrt(sum(norm))
aq <- norm_fac * aq
d <- 2
E_cv <- numeric()
34
sigma_cv <- numeric()
bound <- numeric()
lambda_seq <- seq(0.05, 5, by = 0.05)
for (lambda in lambda_seq) {
Z <- as.matrix(cbind(1, D$x, D$x^2))
Z_cross <- solve(t(Z) %*% Z + (lambda / N) * diag(d + 1)) %*% t(Z)
w_reg <- as.vector(Z_cross %*% as.matrix(D$y))
set.seed(174)
Q <- 20
N_seq <- seq(2 * Q, 10 * Q, by = Q)
results <- matrix(NA, nrow = length(N_seq), ncol = 3)
for (i in 1:length(N_seq)) {
results[i, ] <- experiment5(Qf = 15, N = N_seq[i], sigma = 1, Ntest = 1000)
}
35
results <- as.data.frame(cbind(N_seq, results))
colnames(results) <- c("N", "Method1", "Method2", "Method3")
1.6
1.5
Selection Method
Method:
Method 1
1.4
Method 2
Method 3
1.3
1.2
36