x2 = x1 2 + x2 2 + + xN 2 = xT x
(1.1)
x + y x + y ,
for
x, y RN
(1.2)
Unless otherwise specied, from now on the notation x will denote the Euclidean norm. The CauchySchwarz inequality for the Euclidean norm reads:
Contents
T
x y xy
MoorePenrose Pseudoinverse, 18
Condition Number, 22
ReducedRank Approximation, 23
10
11
12
13
14
15
KarhunenLo`
eve Transform, 45
16
17
x1
x
2
where x =
..
.
xN
where equality is achieved when y is any scalar multiple of x, that is, y = cx. The
angle between the two vectors x, y is dened through:
cos =
19
20
QR Factorization, 64
21
xT y
Ax
= sup Ax
x=1
x=0 x
A = sup
(1.5)
We will see later that the Euclidean matrix norm A2 is equal to the largest
singular value of the SVD decomposition of A, or equivalently, the squareroot of
the largest eigenvalue of the matrix ATA or the matrix AAT . The L1 and L matrix
norms can be expressed directly in terms of the matrix elements Aij of A:
A1 = max
A = max
i
(1.6)
Another useful matrix normnot derivable from a vector normis the Frobenius norm dened to be the sum of the squares of all the matrix elements:
AF =
Aij 2 = tr ATA)
i, j
(1.4)
xy
18
(1.3)
(Frobenius norm)
(1.7)
The L2 , L1 , L , and the Frobenius matrix norms satisfy the matrix versions of
the triangle and CauchySchwarz inequalities:
solution (2.2) is obtained via the backslash or the pseudoinverse operators (which
produce the same answer in the fullrank case):
A + B A + B
c = B\b = pinv(B)b = B+ b
(1.8)
AB AB
The distance between two vectors, or between two matrices, may be dened with
respect to any norm:
d(x, y)= x y ,
d(A, B)= A B
(1.9)
(2.3)
The matrix BT B RMM is called the Grammian. Its matrix elements are the
mutual dot products of the basis vectors (BT B)ij = bT
i bj , i, j = 1, 2, . . . , M .
The quantity P = BB+ = B(BT B)1 BT is the projection matrix onto the subspace Y. As a projection matrix, it is idempotent and symmetric, that is, P2 = P
and PT = P. The matrix Q = IN P is also a projection matrix, projecting onto
the orthogonal complement of Y, that is, the space Y of vectors in RN that are
orthogonal to each vector in Y. Thus, we have:
c1
c
2
= [b1 , b2 , . . . , bM ]
.. = Bc
.
cM
b=
ci bi = c1 b1 + c2 b2 + cM bM
i=1
(2.1)
BT b = BT Bc
c = (BT B)1 BT b = B+ b ,
B+ (BT B)1 BT
Y Y = R(B) N(BT )= RN
The orthogonal decomposition theorem follows from (2.5). It states that a given
vector in RN can be decomposed uniquely with respect to a subspace Y into the
sum of a vector that lies in Y and a vector that lies in Y , that is, for b RN :
b = b + b ,
where b Y ,
b Y
(2.6)
so that bT
b = 0. The proof is trivial; dening b = Pb and b = Qb, we have:
b = IN b = (P + Q)b = Pb + Qb = b + b
The uniqueness is argued as follows: setting b + b = b + b for a different
pair b Y, b Y , we have b b = b b , which implies that both difference
vectors lie in Y Y = {0}, and therefore, they must be the zero vector.
Fig. 2.1 illustrates this theorem. An alternative proof is to expand b in the Bbasis, that is, b = Bc, and require that b = b b be perpendicular to Y, that is,
BT b = 0. Thus, we get the conditions:
b = Bc + b
(2.5)
(2.2)
The space spanned by the linear combinations of the columns of the matrix B
is called the column space or range space of B and is denoted by R(B). Because B
is a basis for Y, we will have Y = R(B). The matrix equation Bc = b given in (2.1)
is an overdetermined system of N equations in M unknowns that has a solution
because we assumed that b lies in the range space of B.
The quantity B+ = (BT B)1 BT is a special case of the MoorePenrose pseudoinverse (for the case of a full rank matrix B with N M.) In MATLAB notation, the
Indeed,
(2.4)
BT b = BT Bc + BT b = BT Bc ,
4
or,
c = (BT B)1 BT b ,
b = Bc = B(BT B)1 BT b = Pb
(2.7)
A variation of the orthogonal decomposition theorem is the orthogonal projection theorem, which states that the projection b is that vector in Y that lies closest
to b with respect to the Euclidean distance, that is, as the vector y Y varies over
Y, the distance b y is minimized when y = b .
Fig. 2.2 illustrates the theorem. The proof is straightforward. We have b y =
b + b y = (b y)+b , but since both b and y lie in Y, so does (b y) and
therefore, (b y) b . It follows from the Pythagorean theorem that:
2
UT U = IM
(orthonormal basis)
(2.10)
(2.11)
There are many ways to construct the orthonormal basis U starting with B. One
is through the SVD implemented into the function orth. Another is through the QRfactorization, which is equivalent to the GramSchmidt orthogonalization process.
The two alternatives are:
U = orth(B);
U = qr(B,0);
% SVDbased
% QRfactorization
1.52
1.60
B=
2.08
2.00
The basis B for the subspace Y is not unique. Any other set of M linearly
independent vectors in Y would do. The projector P remains invariant under a
change of basis. Indeed, suppose that another basis is dened by the basis matrix
U = [u1 , u2 , . . . , uM ] whose M columns ui are assumed to be linearly independent.
Then, each bj can be expanded as a linear combination of the new basis vectors ui :
bj =
j = 1, 2, . . . , M
ui cij ,
4.30
4.30
3.70
3.70
The matrix B has rank 3, but nonorthogonal columns. The two orthogonal bases
obtained via the SVD and via the QR factorization are as follows:
2.11
2.05
2.69
2.75
(2.8)
i=1
0 .5
0 .5
B=
0 .5
0 .5
0.4184
0.4404
B=
0.5726
0.5505
0 .5
0 .5
0 .5
0.5
0.5093
0.4904
0.4875
0.5122
0 .5
3.60
0 .5
0.48
0.5
0.08
0.5
4.80
0.64
0.06
0.5617
3.6327
0.5617
0.0000
0.4295
0.0000
0.4295
8.00
0.60 = U1 C1
0.00
4.8400
0.1666
0.0000
7.8486
0.1729 = U2 C2
1.6520
B = UC
(base change)
(2.9)
1 T T
P = B(BT B)1 BT = UC CT (UT U)C
C U
1 T 1 T T T
C U = U(UT U)1 UT
= UC C (U U) C
T
[U1,S1,V1] = svd(B,0);
[U2,C2] = qr(B,0);
C1 = S1*V1;
% alternatively, U1 = orth(B);
The orthogonal bases satisfy U1T U1 = U2T U2 = I3 , and C2 is upper triangular. The
projection matrices onto Y and Y are:
1
1
P = U1 U1T =
4 1
1
1
3
1
1
1
1
3
1
1
,
1
3
1
1
Q = I4 P =
41
1
1
1
1
1
1
1
1
1
1
1
1
1
where C
denotes the inverse of the transposed matrix C . Among the possible
bases for Y, a convenient one is to choose the M vectors ui to have unit norm and
The ranks of P, Q are the dimensions of the subspaces Y, Y , that is, 3 and 1.
The solution is obtained by setting the gradient of the performance index to zero:
An NM matrix A RNM of rank r min{M, N} is characterized by four fundamental subspaces: the two range subspaces R(A) and R(AT ) and the two null
subspaces N(A) and N(AT ). These subspaces play a fundamental role in the SVD
of A and in the leastsquares solution of the equation Ax = b.
The fundamental theorem of linear algebra [1,20] states that their dimensions
and orthogonality properties are as follows:
R(A),
N(AT ),
R(AT ),
N(A),
subspace
subspace
subspace
subspace
of
of
of
of
RN ,
RN ,
RM ,
RM ,
dim = r,
dim = N r,
dim = r,
dim = M r,
R(A) = N(AT )
N(AT ) = R(A)
R(AT ) = N(A)
N(A) = R(AT )
(3.1)
The dimensions of the two range subspaces are equal to the rank of A. The
dimensions of the null subspaces are called the nullity of A and AT . It follows that
the spaces RM and RN are the direct sums:
(3.2)
J
= 2AT e = 2AT (b Ax)= 0
x
Thus, we obtain the orthogonality and normal equations:
AT e = 0
(orthogonality equations)
A Ax = A b
T
(normal equations)
(4.2)
If the MM matrix ATA has full rank, then it is invertible and the solution of
the normal equations is unique and is given by
x = (ATA)1 AT b
(4.3)
Their intersections are: R(A) N(AT )= {0} and R(AT ) N(A)= {0}, that
is, the zero vector. Fig. 3.1 depicts these subspaces and the action of the matrices A and AT . The fundamental theorem of linear algebra, moreover, states that
the singular value decomposition of A provides orthonormal bases for these four
subspaces and that A and AT become diagonal with respect to these bases.
Using the directsum decompositions (3.2), we resolve both b and x into their
unique orthogonal components:
Fig. 3.1 The four fundamental subspaces associated with an NM matrix A.
(4.1)
b = b + b ,
b R(A),
b N(AT ),
b RN
x = x + x ,
x R(AT ),
x N(A),
x RM
(4.4)
lie in N(A) in addition to lying in R(AT ), and hence it must be the zero vector
because R(AT ) N(A)= {0}. In fact, this unique x may be constructed by the
pseudoinverse of A:
Ax = b
x = A+ b = A+ b
(minimumnorm solution)
(4.6)
An explicit expression for the pseudoinverse A+ will be given in Sec. 6 with the
help of the SVD of A. We will also show there that A+ b = A+ b. In conclusion, the
most general solution of the leastsquares problem (4.1) is given by:
x = A+ b + x
(4.7)
(4.8)
which shows that the norm x is minimum when x = 0, or, when x = x . Fig. 4.1
illustrates this property.
The minimumnorm solution is computed by MATLABs builtin function pinv,
x = pinv(A) b. The solution obtained by MATLABs backslash operator, x = A\b,
does not, in general, coincide with x . It has a term x chosen such that the resulting
vector x has at most r nonzero (and M r zero) entries, where r is the rank of A.
We obtained the general leastsquares solution by purely geometric means using
the orthogonality equation (4.2) and the orthogonal decompositions of x and b. An
alternative approach is to substitute (4.5) directly into the performance index and
use the fact that e and e are orthogonal:
(4.9)
This expression is minimized when Ax = b , leading to the same general solution (4.7). The minimized value of the meansquare error is b 2 .
The fullrank case deserves special mention. There are three possibilities depending on whether the system Ax = b is overdetermined, underdetermined, or
square. Then, one or both of the null subspaces consist only of the zero vector:
1.
2.
3.
N > M,
M > N,
N = M,
r = M,
r = N,
r = N,
N(A)= {0},
N(AT )= {0},
N(A)= {0},
R(A )= R ,
R(A)= RN ,
N(AT )= {0},
T
(overdetermined)
(underdetermined)
(square, invertible)
N > M = r,
x = A+ b,
A+ = (ATA)1 AT
2.
M > N = r,
N = M = r,
x = A+ b + x ,
A+ = AT (AAT )1
A+ = A1
3.
x=A
b,
(4.10)
In the last two cases, the equation Ax = b is satised exactly. In the rst case,
it is satised only in the leastsquares sense.
The three cases are depicted in Fig. 4.2. In the overdetermined case, N(A)= {0}
and therefore, the leastsquares solution is unique x = x and, as we saw earlier, is
given by x = (ATA)1 AT b. Comparing with (4.6), it follows that the pseudoinverse
is in this case A+ = (ATA)1 AT .
In the underdetermined case, we have b = b , that is, b is in the range of A, and
therefore, Ax = b does have a solution. There are more unknowns than equations,
Solution: The leastsquares minimization problems and their solutions are in the two cases:
10
2x = 2
x=2
Solution: This is an underdetermined fullrank case. The minimum norm solution is computed using the pseudoinverse A+ = AT (AAT )1 . We have, AAT = 2, therefore,
J
= 2(x 1)+2(x 2)= 0
x
J
= 4(2x 2)+2(x 2)= 0
x
x = 1 .5
A+ =
x = 1 .2
It may be surprising that the solutions are different since the rst equations of the two
systems are the same, differing only by an overall scale factor. The presence of the
scale factor introduces an effective weighting of the performance index which alters
the relative importance of the squared terms. Indeed, the second performance index
may be rewritten as:
a1 x = b1
a2 x = b2
a1
a2
x=
b1
b2
A=
a1
a2
1
1
=
0 .5
0.5
x =
z
z
[ 1 , 1]
,
b=
b1
b2
(4.11)
0 .5
0 .5
[2]=
1
1
z
z
=0
Ax = 0
Therefore, the most general leastsquares solution will have the form:
x = A+ b =
x = x + x =
which assigns the weights 4:1 to the two terms, as opposed to the original 1:1. Generalizing this example, we may express the systems in the form Ax = b:
The most general vector in the onedimensional null space of A has the form:
1
2
1
1
+
z
z
=
1+z
1z
a1
a2
+
=
[a1 , a2 ]
a21 + a22
x = A+ b =
a1 b1 + a2 b2
b1
[a
,
a
]
=
1
2
b2
a21 + a22
a21 + a22
1
The rst system has [a1 , a2 ]= [1, 1] and [b1 , b2 ]= [1, 2], and the second system,
[a1 , a2 ]= [2, 1] and [b1 , b2 ]= [2, 2]. If we multiply both sides of Eq. (4.11) by the
weights w1 , w2 , we get the system and solution:
w1 a 1
w 2 a2
x=
w1 b1
w2 b2
x=
w12 a1 b1 + w22 a2 b2
w12 a21 + w22 a22
(4.12)
The differences between (4.11) and (4.12) can be explained by inspecting the corresponding performance indices that are being minimized:
J = (a1 x b1 )2 +(a2 x b2 )2 ,
J.
Example 4.2: Find the minimum norm solution, as well as the most general leastsquares
solution of the system:
x1 + x2 = 2
[ 1, 1]
x1
x2
= [2] ,
A = [1, 1],
x=
x1
x2
,
b = [ 2]
Fig. 4.3 The minimum norm solution is perpendicular to the straight line x1 + x2 = 2.
The equation x1 + x2 = 2 is represented by a straight line on the x1 x2 plane. Any point
on the line is a solution. In particular, the minimumnorm solution x is obtained by
drawing the perpendicular from the origin to the line.
The direction of x denes the 1dimensional range space R(AT ). The orthogonal
direction to R(AT ), which is parallel to the line, is the direction of the 1dimensional
null subspace N(A). In the more general case, we may replace the given equation by:
a1 x1 + a2 x2 = b1
[a1 , a2 ]
x1
x2
= [b1 ],
A = [a1 , a2 ],
b = [b1 ]
T 1
A = A (AA )
a21 + a22
a1
a2
,
x = A b =
a21 + a22
a1 b1
a 2 b1
Vectors in N(A) and the most general leastsquares solution are given by:
11
12
x =
a21 + a22
a2 z
a1 z
x = x + x =
a21 + a22
a1 b1 + a2 z
a2 b1 a1 z
Example 4.3: The pseudoinverses of Ndimensional column and row vectors are:
a=
a1
a2
..
.
aN
aT
a+ =
a2
and aT = [a1 , a2 , . . . , aN ]
a
(aT )+ =
a2
where a2 = aT a = a21 +a22 + +a2N . Thus, we obtain the minimumnorm solutions:
The orthogonal matrices U, V are not unique, but the singular values i are. To
clarify the structure of and the blocks of zeros bordering r , we give below the
expressions for for the case of a 64 matrix A of rank r = 1, 2, 3, 4:
0
0 0
0
0
0
0
0
0
1 0 0 0
1
1
1
0 0 0 0 0 2 0 0 0 2
0 2
0
0
0
0
0 0 0 0 0
3 0
3
0
0
0
0
0 0
0
0
,
,
,
0 0 0 0 0
0
4
0
0
0 0
0
0
0
0
0 0 0 0 0
0 0 0
0 0
0
0
0
0
0
0
a1
b1
b2
a2
..
x = ..
.
.
aN
bN
a1 x1 + a2 x2 + + aN xN = b
x1
x2
1
. =
.
2
2
2
. a1 + a2 + + aN
xN
r
0
rr
0
0
a1
a2
. b
.
.
aN
(5.5)
r ]
A = [Ur  U
0
0
VrT
= Ur r VrT
rT
V
(5.6)
(5.1)
(5.2)
A=
UrT Ur = Ir ,
r = INr ,
rT U
U
r = 0 ,
UrT U
r U
rT = IN
Ur UrT + U
VrT Vr = Ir ,
rT V
r = IMr ,
V
r = 0 ,
VrT V
rV
rT = IM
Vr VrT + V
U U=
with positive diagonal entries called the singular values of A and arranged in decreasing order:
1 2 r > 0
(5.4)
T
T
T
i ui vT
i = 1 u1 v2 + 2 u2 v2 + + r ur vr
(5.7)
Vr R
. The orthogonality and completeness properties of U, V may be
expressed equivalently in terms of these submatrices:
(5.3)
r
i=1
13
r
V
r = diag(1 , 2 , . . . , r )
r
U
Vr
r]
V = [v1 , v2 , . . . , vr , vr+1 , . . . , vM ]= [Vr  V
Given an NM matrix A R
of rank r min(N, M), the singular value decomposition theorem [1] states that there exist orthogonal matrices U RNN and
V RMM such that A is factored in the form:
Ur
NM
(SVD)
r ]
U = [u1 , u2 , . . . , ur , ur+1 , . . . , uN ]= [Ur  U
A = UVT
a1 b1 + a2 b2 + + aN bN
aT b
x = a+ b = T =
a a
a21 + a22 + + a2N
UrT Ur
rT Ur
U
r
UrT U
T
Ur Ur
=
(5.8)
Ir
INr
= IN ,
r U
rT = UUT = IN
Ur UrT + U
The SVD of A provides also the SVD of AT , that is, AT = VT UT . The singular
values of AT coincide with those of A. The matrix T has dimension MN, but
since rT = r , we have:
r]
AT = VT UT = [Vr  V
14
0
0
UrT
rT
U
= Vr r UrT
(5.9)
AVr = Ur r VrT Vr = Ur r ,
r = Ur r VrT V
r = 0
AV
AT Ur = Vr r UrT Ur = Vr r ,
r = Vr r UrT U
r = 0
AT U
Avi = i ui ,
Avi = 0 ,
AT ui = i vi ,
AT ui = 0 ,
i = 1, 2, . . . , r
i = r + 1, . . . , M
i = 1, 2, . . . , r
i = r + 1, . . . , N
R(A)= span{Ur } ,
dim = r ,
r } ,
N(AT )= span{U
dim = N r ,
= Ir ,
PR(A) =
= INr ,
r U
rT
PN(AT ) = U
R(AT )= span{Vr } ,
dim = r ,
= Ir ,
PR(AT ) = Vr VrT
r} ,
N(A)= span{V
dim = M r ,
rT Vr = IMr ,
V
(5.10)
Ur UrT
(5.11)
The vectors ui and vi are referred to as the left and right singular vectors of
T =
r2
0
0
(5.12)
RNN
It is evident from these that V and U are the matrices of eigenvectors of ATA and
AAT and that the corresponding nonzero eigenvalues are i = i2 , i = 1, 2, . . . , r .
The ranks of ATA and AAT are equal to the rank r of A.
The SVD factors V, U could, in principle, be obtained by solving the eigenvalue
problems of ATA and AAT . However, in practice, loss of accuracy can occur in
squaring the matrix A. Methods of computing the SVD directly from A are available.
15
(economy SVD)
(5.13)
The NM matrix U1 may be enlarged into an NN orthogonal matrix by adjoining to it (N M) orthonormal columns U2 such that U2T U1 = 0, and similarly, the
MM diagonal matrix 1 may be enlarged into an NM matrix . Then, Eq. (5.13)
may be rewritten in the standard full SVD form:
A = U1 1 VT = [U1  U2 ]
VT = UVT
(5.14)
Eq. (5.13) is called the economy or thin SVD because the U1 matrix has the
same size as A but has orthonormal columns, and 1 has size MM. For many
applications, such as SVD signal enhancement, the economy SVD is sufcient. In
MATLAB, the full and the economy SVDs are obtained with the calls:
[U,S,V] = svd(A);
[U1,S1,V] = svd(A,0);
rV
rT
PN(A) = V
A = U1 1 VT
A and are the eigenvectors of the matrices AAT and ATA, respectively. Indeed, it
follows from the orthogonality of U and V that:
r2 0
T
T
T
T
T T
T
RMM
=
A A = V U UV = V( )V ,
= diag(1 , 2 , . . . , M ) RMM
These equations show that ui and vi , i = 1, 2, . . . , r , lie in the range spaces R(A)
and R(AT ), respectively. Moreover, they provide orthonormal bases for these two
subspaces. Similarly, vi , i = r + 1, . . . , M, and ui , i = r + 1, . . . , N, are bases for the
null subspaces N(A) and N(AT ), respectively.
Thus, a second part of the fundamental theorem of linear algebra is that the
r , Vr , V
r provide orthonormal bases for the four fundamental submatrices Ur , U
spaces of A, and with respect to these bases, A has a diagonal form (the ). The
subspaces, their bases, and the corresponding projectors onto them are:
UrT Ur
rT U
r
U
T
Vr Vr
ATA = VVT ,
Because ATA has full rank, it will be strictly positive denite, its eigenvalues will
be positive, and the corresponding eigenvectors may be chosen to be orthonormal,
that is,VT V = VVT = IM . Arranging the eigenvalues in decreasing order, we dene
i = i , i = 1, 2, . . . , M, and 1 = 1/2 = diag(1 , . . . , M ) RMM . Then, we
dene U1 = AV11 , which is an NM matrix with orthonormal columns:
AVr = Ur r
r = 0
AV
T
A Ur = Vr r
r = 0
AT U
% full SVD
% economy SVD
Example 5.1: To illustrate the loss of accuracy in forming A A, consider the 43 matrix:
1 1 1
1 + 2
1
1
0 0
1
1 + 2
ATA = 1
A=
0 0
2
1
1
1+
0 0
T
The matrix A remains full rank to order O( ), but ATA requires that we work to order
O( 2 ). The singular values of A are obtained from the eigenvalues of ATA:
1 = 3 + 2 , 2 = 3 = 2
1 = 3 + 2 , 2 = 2 =
The full SVD of A can be constructed along the lines described above. Starting with
the eigenproblem of ATA, we nd:
A = UVT =
16
2
0
0
0
a
0
a
3
a
0
b
b
0
c
2c
1
1
1
1
1
where a = , = b = , = c = , =
, and =
.
3
6
2
3(3 + 2 )
(3 + 2 )
0.5 1.0
0 .5
0.5 0 .1 0 .7
2
1.1 0.2 0.5 0.5 0.7
0.1
0
=
A=
1.1 0.2 0.5 0.5
0.7 0.1 0
0.5
0. 1
0. 7
0.5 1.0
0 .5
0
0
1
0. 8
0 0.6
0
0 .6
0. 8
T
= UVT
0 .5
1 .1
A=
1 .1
0 .5
0. 5
0 .5
2
0 .5 0
0. 5
0
1
0. 8
0.6
0. 6
0. 8
T
0 .5
0 .5
U=
0 .5
0 .5
0.1544
0.6901
0.6901
0.1544
0 .5
0 .5
0 .5
0 .5
0.6901
0.1544
0.1544
0.6901
0.1544
0.6901
0.6901
0.1544
0.6901
0.1
0.1544
0 .7
=
0.1544 0.7
0.6901
0.1
0 .7
0.1
C,
0 .1
0. 7
C=
0.9969
0.0781
where CT C = I2 .
0.0781
0.9969
1 ui vi 1 u1 v1 + A2
i=2
repeat this argument by separating out the remaining singular terms i ui vi one at
a time, till we exhaust all the singular values.
This theorem is useful in canonical correlation analysis and in characterizing
the angles between subspaces.
(5.15)
VV = V V = IM
(5.18)
A = 1 u1 v1 +
ComplexValued Case
A = UV
j = 1, 2, . . . , i 1
The proof is straightforward. Using the CauchySchwarz inequality and the constraints u = v = 1, and that the Euclidean norm of A is 1 , we have:
UU = U U = IN ,
(5.17)
The last two columns of the two Us are related by the 22 orthogonal matrix C:
i = 2, . . . , r
The choice of the last two columns of U is not unique. They can be transformed by
any 22 orthogonal matrix without affecting the SVD. For example, v5.3 of MATLAB
produces the U matrix:
Then, the remaining singular values and vectors are the solutions of the criteria:
0 .5
1.0
0.2
0 .5
=
0.2 0.5
0.5
1.0
The singular values and singular vectors of a matrix A of rank r can be characterized
by the following maximization criterion [87].
First, the maximum singular value 1 and singular vectors u1 , v1 are the solutions of the maximization criterion:
u=1 v=1
6. MoorePenrose Pseudoinverse
For a fullrank NN matrix with SVD A = UVT , the ordinary inverse is obtained
by inverting the SVD factors and writing them in reverse order:
A1 = VT 1 U1 = V1 UT
(5.16)
(6.1)
17
quantity u Av could just as well be replaced by its absolute value u Av in (5.17) and (5.18).
18
were square because some of its singular values are zero. For a scalar x, we may
dene its pseudoinverse by:
x+ =
x1 , if x 0
0,
if
(6.2)
x=0
For a square MM diagonal matrix, we dene its pseudoinverse to be the diagonal matrix of the pseudoinverses:
+
+ = diag(1+ , 2+ , . . . , M
)
= diag(1 , 2 , . . . , M )
(6.3)
Ax = b
Ur r VrT x = Ur UrT b
VrT x = r1 UrT b
where we multiplied both sides of the second equation by UrT and divided by r to
get the third. Multiplying from the left by Vr and using (6.6), we nd:
but we have x = Vr VrT x , which follows from (Vr VrT )2 = Vr VrT , that is, x =
Vr VrT x = Vr VrT (Vr VrT x)= Vr VrT x . Thus, we nd x = A+ b. Using (6.8) and (6.9),
we also have A+ b = (A+ AA+ )b = A+ (AA+ b)= A+ b . Thus, we have shown:
A+ = V+ UT
(MoorePenrose pseudoinverse)
(6.5)
x = A+ b = A+ b
r ]
A = [Ur  U
r]
A+ = [Vr  V
0
0
T
Vr
(6.6)
T
T
T
i ui vT
i = 1 u1 v1 + 2 u2 v2 + + r ur vr
i=1
r
1
1
1
1
A =
vi uT
v1 uT
v2 uT
vr uT
1 +
2 + +
r
i =
i
1
2
r
i=1
(6.7)
The matrix A+ satises (and is uniquely determined by) the four standard Penrose conditions [1]:
(AA+ )T = AA+
(A+A)T = A+A
AA+A = A ,
A+AA+ = A+ ,
(6.8)
These conditions are equivalent to the fact that AA+ and A+ A are the projectors
onto the range spaces R(A) and R(AT ), respectively. Indeed, using the denition
(6.6) and Eq. (5.11), we have:
(6.9)
It is straightforward also to verify the three expressions for A given by Eq. (4.10)
for the fullrank cases. For example, if N > M = r , the matrix Vr is square and
orthogonal, so that ATA = Vr r2 VrT is invertible, (ATA)1 = Vr r2 VrT . Thus,
T
(A A)
A =
Vr r2 VrT
Vr r UrT
19
Vr r1 UrT
(7.1)
x = A+ b = Vr r1 UrT b =
r
1
vi u T
i b
i=1 i
(7.2)
A=
(minimumnorm solution)
= Ur r VrT
rT
V
T
Ur
0
= Vr r1 UrT
rT
0
U
r1
Vr VrT x = Vr r1 UrT b = A+ b
=A
zr+1
z
M
r+2
rz
zi vi = [vr+1 , vr+2 , . . . , vM ]
x =
.. = V
.
i=r+1
zM
Because x is arbitrary, so is z. Thus, the most general solution of the leastsquares problem can be written in the form [1]:
rz ,
x = x + x = A+ b + V
(7.3)
(7.4)
where we used the property (IN AA+ )T (IN AA+ )= (IN AA+ ), which can be
proved directly using (6.8). Indeed,
(IN AA+ )T (IN AA+ )= IN 2AA+ +AA+ AA+ = IN 2AA+ +AA+ = IN AA+
Example 7.1: Here, we revisit Example 4.2 from the point of view of Eq. (7.3). The full and
economy SVD of A = [a1 , a2 ] are:
T
T
a1 /1 a2 /1
a1 /1
A = [a1 , a2 ]= [1][1 , 0]
= [1][1 ]
a2 /1
a1 /1
a2 /1
with the singular value 1 =
r of N(A) will be:
V
A+ = [a1 , a2 ]+ =
a21 + a22 . Thus, the pseudoinverse of A and the basis
a1 /1
a2 /1
[11 ][1]T =
12
a1
a2
,
r =
V
a2
a1
x1
x2
= A [b1 ]+
a2
a1
z=
12
a1
a2
b1 +
a2
a1
0.36
0.48
x = 0.48 + 0.64
0.80
0.60
MATLABs backslash solution is obtained by xing z1 , z2 such that x will have at most
one nonzero entry. For example, demanding that the top two entries be zero, we get:
z
1 .8
1 .8
Ax =
1 .8
1 .8
Adx + (dA)x = db
0 .5
0 .5
T
A = UV =
0 .5
0 .5
0 .5
0 .5
0 .5
0.5
4.0
10
x1
4 .0
20
x2 =
=b
30
4.0
x3
4 .0
40
2 .4
2 .4
2 .4
2 .4
0 .5
0 .5
0.5
0.5
0.5
10
0.5
0
0.5 0
0.5
0
0
0
0
0
0
0.36
0
0.48
0
0.80
0
0.48
0.64
0.60
0 .8
0 .6
0 .0
The matrix A has rank one, so that the last three columns of U and the last two
columns of V are not uniquely dened. The pseudoinverse of A will be:
0 .5
0. 5
0.36
10
0.36
0.36
20
= 0.48
x = A+ b = 0.48 [101 ][0.5, 0.5, 0.5, 0.5]
30
0.80
0.80
40
21
z2 = 0
max
min
(8.1)
where max , min are the largest and smallest singular values of A, that is, 1 , N .
1
.
The last equation of (8.1) follows from A2 = 1 and A1 2 = N
The condition number characterizes the sensitivity of the solution of a linear
system Ax = b to small changes in A and b. Taking differentials of both sides of
the equation Ax = b, we nd:
z1 = 0.75 ,
Example 7.2: Find the most general solution of the following linear system, and in partic
8. Condition Number
0.36 0.48z1 + 0.80z2
0.80
z1
0.60
= 0.48 0.64z1 0.60z2
z2
0.00
0.80 + 0.60z1
which gives 0.8 + 0.6z1 = 1.25, and therefore, x = [0, 0, 1.25]T . This is indeed
MATLABs output of the operation A\b.
It follows from (7.3) that the most general solution of a1 x1 + a2 x2 = b1 will be:
dx = A1 db (dA)x
dx A1 db (dA)x A1 db + dAx
Using the inequality b = Ax Ax, we get:
dA db
dx
(A)
+
x
A
b
(8.2)
Large condition numbers result in a highly sensitive system, that is, small changes
in A and b may result in very large changes in the solution x. Large condition numbers, (A) 1, imply that 1 N , or that A is nearly singular.
Example 8.1: Consider the matrix A, which is very close to the singular matrix A0 :
10.0002 19.9999
10 20
A=
, A0 =
4.9996 10.0002
5 10
22
0. 8
A=
0. 2
0 . 2
25.0000
0.0000
0. 8
0.0000
0.2
0.0005
0. 8
T
0 . 8
= UVT
0 .2
A Ak 2 = k+1
Its condition number is (A)= 1 /2 = 25/0.0005 = 50000. Computing the solutions of Ax = b for three slightly different bs, we nd:
b1 =
b2 =
b3 =
10.00
5.00
10.00
5.01
10.01
5.00
x1 = A\b1 =
x2 = A\b2 =
0 .2
0 .4
15.79992
8.40016
8.20016
x3 = A\b3 =
3.59968
2
2 1 /2
A Ak F = (k+
1 + + r )
This theorem is an essential tool in signal processing, data compression, statistics, principal component analysis, and other applications, such as chaotic dynamics, meteorology, and oceanography.
In remarkably many applications the matrix A has full rank but its singular
values tend to cluster into two groups, those that are large and those that are small,
that is, assuming N M, we group the M singular values into:
1 2 r r+1 M
large group
The solutions are exact in the decimal digits shown. Even though the bs differ only
slightly, there are very large differences in the xs.
9. ReducedRank Approximation
(9.4)
(9.5)
small group
Fig. 9.1 illustrates the typical pattern. A similar pattern arises in the practical
determination of the rank of a matrix. To innite arithmetic precision, a matrix A
may have rank r , but to nite precision, the matrix might acquire full rank. However,
its lowest M r singular values are expected to be small.
Ak =
i ui vT
i
= 1 u1 v1 + 2 u2 v2 +
k uk vT
k
k = 1, 2, . . . , r
(9.2)
i=1
Ak = U
k
0
0
0
VT ,
k = diag(1 , 2 , . . . , k , 0, . . . , 0 ) Rrr
(9.3)
rk zeros
Thus, A and Ak agree in their highest k singular values, but the last rk singular
values of A, that is, k+1 , . . . , r , have been replaced by zeros in Ak . The matrices
Ak play a special role in constructing reducedrank matrices that approximate the
original matrix A.
The reducedrank approximation theorem [1] states that within the set of NM
matrices of rank k (we assume k < r ), the matrix B that most closely approximates
A in the Euclidean (or the Frobenius) matrix norm is the matrix Ak , that is, the
23
The presence of a signicant gap between the large and small singular values
allows us to dene an effective or numerical rank for the matrix A.
In leastsquares solutions, the presence of small nonzero singular values may
cause inaccuracies in the computation of the pseudoinverse. If the last (M r)
small singular values in (9.5) are kept, then A+ would be given by (6.7):
A+ =
r
M
1
1
vi uT
+
vi uT
i
i
i
i
i=1
i=r+1
and the last (M r) terms would tend to dominate the expression. For this reason,
the rank and the pseudoinverse can be determined with respect to a threshold level
or tolerance, say, such that if i , for i = r + 1, . . . , M, then these singular
values may be set to zero and the effective rank will be r . MATLABs functions rank
and pinv allow the user to specify any desired level of tolerance.
24
0
1
0
0
A=
0
0
0
0
0
0
0
0
,
0
0
0.9990
0.0037
0.0008
0.0007
0.0019
0.9999
0.0016
0.0004
0.0008
0.0009
0.0010
0.0004
0.0004
0.0005
0.0002
0.0006
where the second was obtained by adding small random numbers to the elements of
the rst using the MATLAB commands:
A = zeros(4); A(1,1)=1; A(2,2)=1;
Ahat = A + 0.001 * randn(size(A));
r ]
A = [Ur  U
1
0
A+ =
0
0
0
1
0
0
0
0
0
0
0
0
,
0
0
0.9994
0.0035
+ =
A
1.1793
1.8426
0.0043
0.9992
2.0602
1.8990
1.1867
0.6850
1165.3515
701.5460
2
b=
3
2
x = A+ b = ,
0
+ b =
=A
x
1.0750
0.5451
406.8197
1795.6280
0.2683
2.2403
1871.7169
5075.9187
1.0010
0.0037
+
A =
0.0008
0.0004
0.0020
1.0001
0.0009
0.0005
0.0008
0.0016
0.0000
0.0000
0.0007
0.0004
0.0000
0.0000
+ b =
=A
x
1.0043
1.9934
0.0010
0.0014
To avoid such potential pitfalls in solving least squares problems, one may calculate
rst the singular values of A and then make a decision as to the rank of A.
In the previous example, we saw that a small change in A caused a small change
in the singular values. The following theorem [3] establishes this property formally.
are NM matrices with N M, then their singular values differ by:
If A and A
A2
i i  A
max 
1iM
(9.6)
A2
i i 2 A

F
i=1
25
VrT
r
r
r V
rT = Ar + A
= Ur r VrT + U
rT
V
(9.7)
are very close to each other, and so are the two sets of singular
Although A and A
values, the corresponding pseudoinverses differ substantially:
r
0
r = [Ur  U
r ]
A
0
0
0
0
0
r
VrT
rT
V
VrT
rT
V
= Ur r VrT
(9.8)
r
r V
rT
=U
r RN(Mr) , Vr RMr , V
r RM(Mr) .
where Ur RNr , U
r as the noise subWe will refer to Ar as the signal subspace part of A and to A
space part. The two parts are mutually orthogonal, that is, AT
r Ar = 0. Similarly,
r and r are called the signal subspace and noise subspace singular values.
Example 9.2: Consider the following 43 matrix:
0.16 0.13
0.08
0.19
A=
3.76
4.93
3.68 4.99
6.40
6.40
1.60
1.60
0. 5
0 .5
T
UV =
0 .5
0 .5
0 .5
0 .5
0 .5
0.5
0 .5
0 .5
0.5
0 .5
0 .5
10
0 .5
0
0.5 0
0.5
0
0
8
0
0
0
0.36
0
0.48
0. 1
0.80
0
0.48
0.64
0.60
0.80
0.60
0.00
0 .5
0 .5
A=
0 .5
0 .5
0. 5
0 .5
0 .5
0.5
0. 5
10
0. 5
0
0.5
0
0. 5
0
8
0
0
0.36
0 0.48
0.80
0. 1
0.48
0.64
0.60
0.80
0.60
0.00
The singular values are {1 , 2 , 3 } = {10, 8, 0.1}. The rst two are large and we
attribute them to the signal part, whereas the third is small and we assume that it
26
is due to noise. The matrix A may be replaced by its rank2 version by setting 3 = 0.
The resulting signal subspace part of A is:
0. 5
0 .5
0 .5
0.5
0. 5
10
0. 5
0
0.5
0
0. 5
which gives:
Ar =
0.12
0.12
3.72
3.72
0
8
0
0
0.36
0 0.48
0.80
0
0.16
0.16
4.96
4.96
0.48
0.64
0.60
0.80
0.60
0.00
Singular Values
6.40
6.40
1.60
1.60
0. 5
0 . 5 0 .5
0 . 5 0 .5 0 .5
Ar =
0. 5 0 . 5
0.5
0 .5
0.5
0 .5
0 .5
0.5 0.6325
0.5 0.5 0.6325
Ar =
0.5 0.5 0.3162
0 .5
0.5 0.3162
0 .5
10
0 .5
0
0.5 0
0.5
0
0
8
0
0
10
0.3162
0.3162
0
0.6325 0
0.6325
0
0
0.36
0
0.48
0
0.80
0
0
8
0
0
0.48
0.64
0.60
0.48
0.64
0.60
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
1
100
200
300
400
500
20
0.80
0.60
0.00
As usual, the last two columns of the Us are related by a 22 orthogonal matrix.
0.1
0.80
0.60
0.00
0
0.36
0
0.48
0
0.80
0
Singular Values
i / 1
0 .5
0 .5
Ar =
0 .5
0 .5
by rst removing the column means of the image and then performing a full SVD. The
singular values become small after the rst 100.
i / 1
Example 9.3: Fig. 9.2 shows the singular values of a 512512 image. They were computed
40
60
80
100
Fig. 9.2 Singular values of 512512 image, with expanded view of rst 100 values.
Fig. 9.3 shows the original image and the image reconstructed on the basis of the rst
100 singular values. The typical MATLAB code was as follows:
The OSP MATLAB function sigsub constructs both the signal and noise subspace parts of a matrix. It has usage:
[As,An] = sigsub(A,r);
% r = desired rank
Data compression arises because each term in the expansion requires the storage of 2N coefcients, that is, N coefcients for each of the vectors i ui and vi .
Thus, the total number of coefcients to be stored is 2Nr . Compression takes place
as long as this is less than N2 , the total number of matrix elements of the original
image. Thus, we require 2Nr < N2 or r < N/2. In practice, typical values of r that
work well are of the order of N/6 to N/5.
27
Fig. 9.3 Original (left) and compressed images, keeping r = 100 components.
A = imread(stream.tiff, tiff);
[B,M] = zmean(double(A));
[U,S,V] = svd(B);
r = 100;
Ar = M + U(:,1:r) * S(1:r,1:r) * V(:,1:r);
Ar = uint8(round(Ar));
28
% display image
The image was obtained from the USC image database [64]. The function zmean removes the mean of each column and saves it. After rankreduction, the matrix of the
means is added back to the image.
i
i=1
(true)
+
+ T
A+
f = f (A)A = Vf () U =
f (i )
vi u T
i
i
i=1
(10.1)
(regularized)
f ()= u( )
f ()=
(thresholding)
(10.2)
2
+ 2
(Tikhonov)
where u(t) is the unitstep and , are selectable parameters. The unitstep keeps
only those singular values that are above the threshold, i > . The Tikhonov
regularization is explicitly:
xf = A+
f b=
i
v uT b
2
2 i i
i
i=1
(10.4)
J
= 2AT (Ax b)+22 x = 0
x
(AT A + 2 I)x = AT b
where I is the identity matrix. The solution can be expressed in the form of Eq. (10.1).
Assuming that A is illconditioned but has full rank, then, A+ = (AT A)1 AT (for
the case N M), so that:
x = (AT A + 2 DT D)1 AT b
(10.5)
Rh = r ,
where R = E[y(n)yT(n)],
r = E[x(n)y(n)]
(11.1)
(10.3)
The Tikhonov regularization can also be obtained from the following modied
leastsquares criterion:
Regularization is used in many practical inverse problems, such as the deblurring of images or tomography. The second term in the performance index (10.4)
guards both against illconditioning and against noise in the data. If the parameter
is chosen to be too large, it is possible that noise is removed too much against
getting an accurate inverse. Some illustration of this tradeoff is shown in section
5.14 of the OSP text.
Often, the second term in (10.4) is replaced by the more general term Dx2 =
T T
x D Dx, where D is an appropriate matrix. For example, in an image restoration
application, D could be chosen to be a differentiation matrix so that the performance index would attempt to preserve the sharpness of the image. The more
general performance index and its solution are:
y0 (n)
y (n)
M
1
=
(n)= hT y(n)= [h0 , h1 , . . . , hM ]
hm ym (n)
x
..
m=0
.
yM (n)
(11.2)
The observation signals ym (n) are typically (but not necessarily) either the
outputs of a tapped delay line whose input is a single time signal yn , so that
ym (n)= ynm , or, alternatively, they are the outputs of an antenna (or other spatial
sensor) array. The two cases are shown in Fig. 11.1.
The vector y(n) is dened as:
y(n)=
yn
yn1
yn2
..
.
or, y(n)=
ynM
y0 (n)
y1 (n)
y2 (n)
..
.
yM (n)
30
(11.3)
R = E[y(n)yT(n)]
r = E[y(n)x(n)]
r=
E[y(n)e(n)]= 0
N1
1
=
R
N1
1
n=0
N1
1
y(n)yT(n)
y(n)x(n)
(11.8)
n=0
y(n)e(n)= 0
n=0
To simplify the expressions, we will drop the common factor 1/N in the above
timeaverages. Next, we dene the N(M + 1) data matrix Y whose rows are the
N snapshots yT(n),
In the array case, y(n) is called a snapshot vector because it represents the
measurement of the wave eld across the array at the nth time instant. The autocorrelation matrix R measures spatial correlations among the antenna elements,
that is, Rij = E[yi(n)yj (n)], i, j, = 0, 1, . . . , M.
In the timeseries case, R measures temporal correlations between successive
E[y(n)yT(n)]h = E[x(n)y(n)]
(11.7)
In practice, we may replace the above statistical expectation values by timeaverages based on a nite, but stationary, set of time samples of the signals x(n)
and y(n), n = 0, 1, . . . , N, where typically N > M. Thus, we make the replacements:
..
.
y1 (n)
..
.
..
.
y1 (N 1)
yM (0)
yM (1)
yM (n)
(11.9)
yM (N 1)
yT(0)
..
T
y (n)
..
T
y (N 1)
(11.10)
(n) are:
The N1 column vectors of the x(n), e(n), and the estimates x
x=
x(0)
x(1)
..
.
x(n)
..
.
e=
x(N 1)
e(0)
e(1)
..
.
e(n)
..
.
=
x
(0)
x
(1)
x
..
.
(n)
x
..
.
(11.11)
(N 1)
x
e(N 1)
Noting that Y = Y
= [y (0), y (1), . . . , y (N 1)], we can write Eqs. (11.8)
in the following compact forms (without the 1/N factor):
= Y Y ,
R
31
..
.
Y = [y0 , y1 , . . . , yM ]=
(11.5)
(11.6)
E[e(n)y(n)]= 0
y1 (0)
y1 (1)
The normal equations are derived from the requirement that the optimum weights
h = [h0 , h1 , . . . , hM ]T minimize the meansquare estimation error:
(n)2 ]= E[x(n)hT y(n)2 ]= min
E = E[e(n)2 ]= E[x(n)x
yT(0)
y0 (0)
yT(1) y (1)
..
..
.
.
=
Y=
yT(n) y0 (n)
..
..
.
.
y0 (N 1)
yT(N 1)
r = Y x ,
32
Y e = 0
(11.12)
Indeed, we have:
=
R
y(n)yT(n)= [y(0), y(1), . . . , y(N 1)]
n=0
N
1
yT(0)
yT(1)
= Y Y
..
.
T
y (N 1)
r=
y(n)x(n)= [y(0), y(1), . . . , y(N 1)]
n=0
x(0)
x(1)
N
1
= Y x
..
.
x(N 1)
= x Yh
e=xx
(n) will
For a lengthN input signal yn , n = 0, 1, . . . , N 1, the output sequence x
have length N + M, with the rst M output samples corresponding to the inputon
transients, the last M outputs being the inputoff transients, and the middle N M
(n), n = M, . . . , N 1, being the steadystate outputs.
samples, x
There are several possible choices in dening the range of summation over n in
the leastsquares index:
E =
e(n)2 ,
n
One can consider: (a) the full range, 0 n N 1 + M, leading to the socalled
autocorrelation method, (b) the steadystate range, M n N 1, leading to the
covariance method, (c) the prewindowed range, 0 n N 1, or (d) the postwindowed range, M n N 1 + M. The autocorrelation and covariance choices
are the most widely used:
(11.13)
Eaut =
N
1+M
E = E[e(n) ]= min
E =
N
1
(11.14)
n=0
Y Yh = Y x
Y e = 0,
h =
r
R
Yh = x
Since N > M + 1, these will be the same if Y has full rank, that is, r = M + 1. In
this case, the solution is unique and is given by:
(full rank Y)
(11.18)
In the timeseries case, some further clarication of the denition of the data
(n) is obtained by conmatrix Y is necessary. Since ym (n)= ynm , the estimate x
volving the orderM lter h with the sequence yn :
(n)=
x
hm ym (n)=
m=0
yT(0)
..
.
yT(M 1)
Yoff
Yaut
yT(0)
..
.
(11.19)
yT(N)
..
.
T
y (N 1 + M)
yT(M 1)
yT(M)
..
.
yT(N 1)
yT(N)
..
.
T
y (N 1 + M)
Yon
= Ycov ,
Yoff
Ycov
yT(M)
..
.
T
y (N 1)
(11.20)
hm ynm
m=0
33
e(n)2
n=M
(11.16)
The SVD of the data matrix, Y = UV , can used to characterize the nature
of the solutions of these equations. The minnorm and backslash solutions are in
MATLABs notation:
h = pinv(Y)x , h = Y\x
(11.17)
1
h = (Y Y)1 Y x = R
r
Yon
(11.15)
N
1
The minimization of the leastsquares index with respect to h gives rise to the
orthogonality and normal equations, as in Eq. (4.2):
Ecov =
n=0
e(n)2 ,
34
Yaut
y0
y
1
y2
y
3
=
y4
y5
y0
y1
y2
y3
y4
y5
where J is the usual reversing matrix consisting of ones along its antidiagonal.
While Yaut and Ycov are Toeplitz matrices, only the upper half of Yfb is Toeplitz
whereas its lower half is a Hankel matrix, that is, it has the same entries along each
antidiagonal.
Given one of the three types of a data matrix Y, one can extract the signal yn that
generated that Y. The MATLAB function datamat (in the OSP toolbox) constructs
a data matrix from the signal yn , whereas the function datasig extracts the signal
yn from Y. The functions have usage:
0
0
y0
y1
,
y2
y3
y4
y5
Ycov
y2
y
3
=
y4
y5
y1
y2
y3
y4
y0
y1
y2
y3
These follow from the denition yT(n)= [yn , yn1 , yn2 ], which gives, yT(0)=
[y0 , y1 , y2 ]= [y0 , 0, 0], and so on until the last time sample at n = N 1 + M =
6 1 + 2 = 7, that is, yT(7)= [y7 , y6 , y5 ]= [0, 0, y5 ]. The middle portion of Yaut is
the covariance version Ycov .
The autocorrelation version Yaut is recognized as the ordinary Toeplitz convolution matrix for a length6 input signal and an order2 lter. It can be constructed
easily by invoking MATLABs builtin function convmtx:
Y = convmtx(y,M+1);
y0
y
1
y2
y3
y4
y
5
y0
y1
y2
y3
y4
y5
x0
x
1
x2
y0
h0
y1
x3
h1 =
x4 ,
y2
h
2
y3
x5
x6
y4
y5
x7
0
0
y2
y
3
y4
y5
y1
y2
y3
y4
Yfb
y2
y3
y
4
y5
=
y
0
y1
y2
y3
y1
y2
y3
y4
y1
y2
y3
y4
y0
x2
h0
y1
x3
h1 =
x4
y2
h2
y3
x5
y0
y1
y2
y3
= Ycov
y2
Ycov
J
y3
y4
y5
35
fnm hm = n ,
0nN+M
(12.1)
m=max(0,nN)
In the zdomain, we require F(z)H(z)= 1. Because both lters are FIR, this can
only be satised approximately. More generally, we may allow an overall delay in
the output so that F(z)H(z)= zi . Introducing a delay may give better results.
The reason is that the exact inverse lter H(z)= zi /F(z) may have poles outside
the unit circle, which would imply that the stable inverse hn cannot be causal. Such
lter may be made approximately causal by clipping its anticausal part at some
large negative time n = i and then delaying the clipped lter by i time units to
make it causal. Thus, we replace (12.1) by
Y = datamat(y,M,type);
y = datasig(Y,type);
min
(n,M)
fnm hm = ni ,
0nN+M
where the delay can be anywhere in the range 0 i n + M. Eq. (12.2) can be
written in the compact matrix form:
F h = ui
(11.21)
(12.2)
m=max(0,nN)
(12.3)
36
f = cos(w0*(nn0)) .* exp(a*(nn0).^2);
h = F + ui
(12.4)
(12.5)
where (FF+ )ii denotes the ith diagonal entry of the projection matrix FF+ . Once
the matrix of inverse lters is computed H = F+ , one can pick the optimum among
them as the one that has the smallest error (12.5), or equivalently, the one that
corresponds to the largest diagonal entry of FF+ .
The solution (12.4) can be regularized if so desired using the techniques discussed in section 10. The regularized inverse lters will be the columns of the
matrix Hf = f (F)F+ , where f () is a regularization function.
Example 12.1: Fig. 12.1 shows the given lter fn , the designed inverse lter hn , and the
ltered signal hn fn , which is supposed to be a delta function. The signal fn was
dened by:
2
fn = cos 0 (n n0 ) ea(nn0 ) ,
n = 0, 1, . . . , N
F
H
i
h
d
=
=
=
=
=
datamat(f,M,0);
pinv(F);
26;
H(:,i);
conv(h,f);
%
%
%
%
%
% dene fn
a1
T
..
= y (n)a
.
aM
aM
.
.
T
R
.
e (n)= [yn , yn1 , . . . , ynM ]
= y (n)a
a1
(13.1)
(13.2)
where N = 65, 0 = 0.1, a = 0.004, and n0 = 25. The lter order was M = 50 and
the delay was chosen to be i = n0 = 25.
20
The prediction coefcients a are found by minimizing one of the three leastsquare performance indices, corresponding to the autocorrelation, covariance, and
forward/backward methods:
1
15
fn
inverse filter hn
h n * fn
Eaut =
10
Ecov =
5
10
Efb =
15
25
50
75
20
0
100
N
1
25
50
75
N
1
e+ (n)2 + e (n)2 = min
n=M
100
Stacking the samples e (n) into a column vector, we may express the error
vectors in terms of the corresponding autocorrelation or covariance data matrices:
37
(13.3)
n=M
0.5
0
n=0
0.5
N
1+M
e+ = Y a
R
e = Ya
..
.
..
.
..
..
.
.
38
(13.4)
and similarly for e . Noting that aR = Ja, we have for the covariance case:
e = Ycov Ja
= (Ycov J)a
Then, we may dene the extended error vector consisting of both the forward
and backward errors:
e+
e
e=
Ycov
Ycov
J
a = Yfb a
(13.5)
(13.6)
There are many methods for tting MA and ARMA models to a given data sequence
yn , n = 0, 1, . . . , N 1. Some methods are nonlinear and involve an iterative minimization of a maximum likelihood criterion. Other methods are adaptive, continuously updating the model parameters on a sample by sample basis.
Here, we briey discuss a class of methods, originally suggested by Durbin
[65,66], which begin by tting a long AR model to the data, and then deriving the
MA or ARMA model parameters from that AR model by using only leastsquares
solutions of linear equations.
MA Models
Thus, in all three cases, the problem reduces to the leastsquares solution of the
linear equation Ya = 0, that is,
yn = b0 n + b1 n1 + b2 n2 + + bq nq
Ya = 0
(13.7)
Ya = [y0 , Y1 ]
= y0 + Y1 = 0
Y1 = y0
(13.8)
= pinv(Y1 )y0 = Y1 y0
a=
Y1+ y0
(13.9)
where E is the minimized prediction error E = e2 /L, where L is the column dimension of Y. Combined with the function datamat, one can obtain the prediction
lter according to the three criteria:
[a,E] = lpls(datamat(y,M,0))
[a,E] = lpls(datamat(y,M,1))
[a,E] = lpls(datamat(y,M,2))
The autocorrelation method can be computed by the alternative call to the YuleWalker function yw :
a = lpf(yw(y,M));
Further improvements of these methods result, especially in the case of extracting sinusoids in noise, when the leastsquares solution (13.9) is used in conjunction
with the SVD enhancement iteration procedure discussed in Sec. 17.
39
(14.1)
Thus, the synthesis model lter and the power spectrum of yn are:
B(z)= b0 + b1 z1 + b2 z2 + + bq zq ,
2
Syy ()= 2 B()
(14.2)
Ab = u ,
where A = datamat(a, q)
(14.3)
0
1
a
1
a2
a3
a
4
a1
a2
a3
a4
0
0
1
0
1
a1
b1 = 0
0
a2
b2
a3
0
0
a4
a1
a2
a
3
a4
a1
a2
a3
a4
40
0
1
a1
0
b1 =
0
a2
b2
0
a3
0
a4
(14.4)
where in the second equation, we deleted the rst row of A, which corresponds to
the identity 1 = 1. Thus, denoting the bottom part of A by Abot , we obtain the
following (M + q)(q + 1) linear system to be solved by leastsquares:
Abot b = 0
b = lpls(Abot )
Abot = [a1 , A1 ],
b=
Abot b = [a1 , A1 ]
= a1 + A1 = 0
= A1 \ a1 = (A1 A1 )1 A1 a1
This has the form of a linear prediction solution = R
r, where R = A1 A1
ri = (A1 a1 )i = Raa (i + 1)
(14.7)
Raa (k)=
a
m+k am ,
M k M
(14.8)
B()2
N1
1
n=0
41
yn 2
2 =
y2
b b
yn
y
n1
where y(n)=
.. ,
.
ynp
yT(n)a = eT(n)b ,
(14.11)
n1
e(n)=
..
.
nq
(14.12)
a
b
1
1
= [ n , n1 , . . . , nq ] .
[yn , yn1 , . . . , ynp ]
.
.
.
.
.
ap
bq
% t an AR(q) model to a
q
d
= 2
bm 2 = 2 b b
2
m=0
% t an AR(M) model to y
where the function lpf extracts the prediction lter from the output of the function
yw. Once the MA lter is designed, the input noise variance 2 may be calculated
using Parsevals identity:
y2 = 2
(14.10)
2
j
B() 2
+ + bq ejq
2 1 + b1 e
Syy ()= H() =
=
1 + a1 ej + + ap ejp
A()
2
m=0
a = lpf(yw(y,M));
b = lpf(yw(a,q));
yn + a1 yn1 + + ap ynp = n + b1 n1 + + bq nq
or, compactly,
(14.9)
(14.6)
1
1 + b1 z1 + + bq zq
B(z)
=
A(z)
1 + a1 z1 + . . . + ap zp
H(z)=
To clarify further the above solution, we write (14.5) in partitioned form, separating out the rst column of Abot and the bottom part of b:
(14.5)
This problem is identical to that of Eq. (13.7) and therefore, its solution was
obtained with the help of the function lpls. These design steps have been incorporated into the MATLAB function ma with usage:
[b,sigma2] = ma(y,q,M);
ARMA models
Ya = E b ,
where Y =
yT(0)
..
T
y (n)
,
..
T
y (N 1)
E=
eT(0)
..
T
e (n)
..
T
e (N 1)
(14.13)
The data matrices Y and E have dimensions N(p + 1) and N(q + 1), respectively, and correspond to the prewindowed type. They can be constructed from
the sequences y = [y0 , y1 . . . , yN1 ]T and e = [ 0 , 1 . . . , N1 ]T by the MATLAB
calls to the function datamat:
Y = datamat(y, p, pre)
E = datamat(e, q, pre)
42
y0
y
1
y2
y3
y
4
y5
y6
y0
y1
y2
y3
y4
y5
1
1
2
a1
y0
=
a2 3
y1
4
a3
y2
5
6
y3
0
0
0
0
0
y0
y1
y2
y3
y4
0
1
2
3
4
5
0
1
1
b1
2
b2
3
4
2 = min
b
aE
J = Y
(14.15)
aM = lpf yw(y, M)
(14.16)
= datamat(
e, q, pre)
E
The second step is to solve (14.15) using least squares. To this end, we separate the
to get:
, and the bottom parts of
rst columns of the matrices Y, E
a, b
Y = [y0 , Y1 ],
a=
,
=
b
b
a=E
Y
[y0 , Y1 ]
1 ]
e0 , E
= [
1
=
y0 + Y1
e0 + E
= (y
1
E
e0 ), and solved:
This may be rearranged into Y1
0
1 ]
[Y1 , E
e0 )
= (y0
43
1 ]\y
= [Y1 , E
(14.19)
, y);
v = lter(1, b
,
w = lter(1, b
e);
V = datamat(v, p, pre);
W = datamat(w, q, pre);
(14.17)
(14.20)
Va = W b
[v0 , V1 ]
= [w0 , W1 ]
(14.21)
with performance index J = Va W b2 = min. The solution of (14.21) is:
= [V1 , W1 ] \(v0 w0 )
(14.22)
1 ] \(y0
e0 )
= [Y1 , E
(14.18)
e = lter(aM , 1, y)
= [
1 ],
e0 , E
E
2 = min
1
E
e2 = y + Y1
J =
(14.14)
Even though overdetermined, these equations are consistent and may be solved
for the model parameters. Unfortunately, in practice, only the observed output
sequence y = [y0 , y1 . . . , yN1 ]T is available.
A possible strategy to overcome this problem, originally proposed by Durbin,
is to replace the unknown exact innovation vector e = [ 0 , 1 . . . , N1 ]T by an
0 ,
1 . . . ,
N1 ]T and then solve (14.13) approximately using
estimated one
e = [
is the data matrix of the approximate innovations, then
leastsquares, that is, if E
solve the leastsquares problem:
b
a=E
Y
This completes step two. We note that because of the prewindowed choice of
e0 are the lengthN signal sequences y and
the data matrices, the rst columns y0 ,
0 ,
1 . . . ,
N1 ]T .
b = E b
Ya E
44
or
(14.24)
0
1
5
6
0
1
2
3
4
5
b1
2
b
0
1
1
b
2
b
1
0
1
1
b
= 0
2
0
2 b
3
0
4
0
0
0
1
1
b
2
b
0
0
0
0
0
0
0
0
1
1
b
2
b
0
0
0
0
0
1
1
b
2
b
0
0
0
0
0
1
1
b
0
0
0 3
4
0
0 5
6
1
(14.25)
Va = W b
V = [v0 , v1 , . . . , vM ] ,
y2 = 2
hn 2
(14.26)
n=0
i = 0, 1, . . . , M
V V = VV = IM+1
(15.2)
(15.3)
RV = V ,
= diag{0 , 1 , . . . , M }
V RV =
(15.4)
The components of the transformed vector, z(n)= [z0 (n), z1 (n), . . . , zM (n)]T ,
are called principal components. They can be expressed as the dot products of the
eigenvectors vi with y(n):
The third stage may be repeated a few additional times. At each iteration, the
Rvi = i vi ,
z(n)= VT y(n)
(15.1)
[a,b,sigma2] = arma(y,p,q,M,iter);
(KLT)
v0 y(n)
z0 (n)
z (n) vT y(n)
1
1
=
,
..
..
.
.
zM (n)
vT
M y(n)
zi (n)= vT
i y(n) ,
or,
i = 0, 1, . . . , M
(15.5)
These may be thought of as the ltering of y(n) by the FIR lters vi . Therefore,
the vectors vi are often referred to as eigenfilters. The principal components are
mutually orthogonal, that is, uncorrelated. The matrix V RV = is the covariance
matrix of the transformed vector z(n):
E[z(n)zT(n)]= V E[y(n)yT(n)]V = V RV ,
E[z(n)zT(n)]=
or,
(15.6)
or, componentwise:
E[z
i (n)zj (n)]= i ij ,
i, j = 0, 1, . . . , M
(15.7)
Thus, the KLT decorrelates the components of the vector y(n). The eigenvalues
of R are the variances of the principal components, i2 = E[zi (n)2 ]= i . Because
0 1 M , the principal component z0 (n) will have the largest variance,
the component z1 (n), the next to largest, and so on.
Dening the total variance of y(n) to be the sum of the variances of its M + 1
components, we can show that the total variance is equal to the sum of the variances
of the principal components, or the sum of the eigenvalues of R. We have:
45
46
15. KarhunenLo`
eve Transform
y =
(total variance)
(15.8)
We will ignore the overall factor 1/N, as we did in section 11, and work with the
simpler denition:
i=0
y = tr E[y (n)y (n)] = tr(R)= 0 + 1 + + M
2
=
R
(15.9)
(inverse KLT)
(15.10)
(15.11)
r
1
v
i zi (n)
n=0
M
M
(n)2 = E
i
zi (n)2 =
E y(n)y
i=r
(15.13)
i=r
y(n)yT(n)
n=0
where we assume that the sample means have been removed, so that
m=
N1
1
n=0
47
(16.4)
zT(0)
zT(1)
Z=
..
.
T
z (N 1)
(16.5)
Y = ZV
y(n)= V z(n) ,
n = 0, 1, . . . , N 1
(16.6)
v
i zi (n) ,
n = 0, 1, . . . , N 1
(16.7)
(16.1)
N
1
z(n)zT(n)=
(16.8)
i, j = 0, 1, . . . , M
(16.9)
n=0
y(n)= 0
(16.3)
i=0
Principal component analysis (PCA) is essentially equivalent to the KLT. The only
difference is that instead of applying the KLT to the theoretical covariance matrix
constructed from N available
R, it is applied to the sample covariance matrix R
signal vectors y(n), n = 0, 1, . . . , N 1:
=
R
(PCA)
y(n)=
N1
1
n = 0, 1, . . . , N 1
i=0
(16.2)
These can be combined into a single compact equation involving the data matrix
constructed from the z(n). Indeed, noting that zT(n)= yT(n)V, we have:
(15.12)
(n) will be a
If the ignored eigenvalues are small, the reconstructed signal y
good approximation of the original y(n). This approximation amounts to a rankr
reduction of the original problem. The meansquare approximation error is:
yT(0)
yT(1)
..
.
T
y (N 1)
where Y is the N(M + 1) data matrix constructed from the y(n). The eigenprob , that is, RV
= V, denes the KLT/PCA transformation matrix V. The
lem of R
corresponding principal component signals will be:
Z = YV
Y=
y(n)yT(n)= Y Y ,
z(n)= VT y(n) ,
z0 (n)
z (n)
M
1
=
v
y(n)= V z(n)= [v
..
0 , v1 , . . . , vM ]
i zi (n)
i=0
.
zM (n)
N
1
z
i (n)zj (n)= i ij ,
n=0
48
In fact, the principal component signals zi (n) are, up to a scale, equal to the
left singular eigenvectors of the SVD of the data matrix Y.
Following the simplied proof of the SVD that we gave in Sec. 5, we assume
a
fullrank case so that all the i are nonzero and dene the singular values i = i ,
for i = 0, 1, . . . , N 1, and the matrices:
= diag{0 , 1 , . . . , M } = 1/2
N
1
4
5
5 4 3 2 1
(16.11)
(economy SVD)
(16.12)
U U =
u(n)uT(n)= IM+1
n=0
N
1
where ui (n) is the ith component of u(n)= [u0 (n), u1 (n), . . . , uM (n)]T . It is the
same as zi (n), but normalized to unit norm.
2.31
2.49
2.31
2.49
Y=
3.32
3.08
3.08
3.32
0 .3
1.92
1.68
0 .3
1.92 0.3
1.68
= 0 .3
2.24
0 .4
2.56 0.4
2.56 0.4
2.24
0 .4
64.09
47.88
0
0 .5
0 .8
0.6
0 .6
0. 8
The singular values of Y are 0 = 10 and 1 = 0.5. Let the two columns of Y be
y0 and y1 , so that Y = [y0 , y1 ]. The scatterplot of the eight pairs [y0 , y1 ] is shown
below. We observe the clustering along a preferential direction. This is the direction
of the rst principal component.
= ZT Z =
47.88
36.16
=
0. 8
0.6
0 .6
0 .8
100
0
02
=
12
100
0
0
0.25
0
0.25
0 .8
0.6
0 .6
0. 8
T
= VVT
y0
= 0 . 8 y 0 + 0 . 6y 1
y1
y0
z1 = vT
= 0 . 6 y 0 + 0 . 8y 1
1 y = [0.6, 0.8] y
1
Conversely, each [y0 , y1 ] pair may be reconstructed from the PCA pair [z0 , z1 ]:
y0
y1
= V z = [v
0 , v1 ]
z0
z1
= v
0 z0 + v1 z1 =
0 .8
0.6
z0 +
0.6
0 .8
z1
The two terms in this expression dene parametrically two straight lines on the y0 , y1
plane along the directions of the principal components, as shown in the above gure.
The percentage variances carried by z0 , z1 are:
02
02
= 0.9975 = 99.75 % ,
+ 12
02
12
= 0.0025 = 0.25 %
+ 12
49
z0 = vT
0 y = [ 0 .8 , 0 . 6 ]
T
= UVT
Each principal component pair [z0 , z1 ] is constructed by the following linear combinations of the [y0 , y1 ] pairs:
0. 3
0 .3
0 .3
0.3
10
0 .4
0
0 .4
0. 4
0.4
Example 16.1: Consider the following 82 data matrix Y and its economy SVD:
0.15
0.15
0.15
0.15
,
0.20
0.20
0.20
0.20
3
Z = [z0 , z1 ]= U =
4
4
4
R=
n=0
u
i (n)uj (n)= ij
y0
Y = UV
z1
(16.10)
z0
y1
U = Z1 ,
Scatterplot of [ y0 , y1]
50
Example 16.2: The table below gives N = 24 values of the signals yT(n)= [y0 (n), y1 (n)].
The data represent the measured lengths and widths of 24 female turtles and were
obtained from the le turtle.dat on the course web page. This data set represents
one of the most wellknown examples of PCA [72]. To simplify the discussion, we
consider only a subset of this data set.
The data matrix Y has dimension N(M + 1)= 242. It must be replaced by its zeromean version, that is, with the column means removed from each column. Fig. 16.1
shows the scatterplot of the pairs [y0 , y1 ].
y0 (n)
y1 (n)
y0 (n)
y1 (n)
y0 (n)
y1 (n)
0
1
2
3
4
5
6
7
98
103
103
105
109
123
123
133
81
84
86
86
88
92
95
99
8
9
10
11
12
13
14
15
133
133
134
136
137
138
141
147
102
102
100
102
98
99
103
108
16
17
18
19
20
21
22
23
149
153
155
155
158
159
162
177
107
107
115
117
115
118
124
132
Scatterplot of [ y0 , y1]
50
z1
25
y0
y1
0.8542
0.5200
z0 +
0.5200
0.8542
z1
Y = loadfile(turtle.dat);
Y = zmean(Y(:,4:5));
[U,S,V] = svd(Y,0);
% economy SVD
figure; plot(Y(:,1),Y(:,2),.);
figure; plot(U(:,1),U(:,2),.);
% scatterplot of [y0 , y1 ]
% scatterplot of [u0 , u1 ]
The right graph in Fig. 16.1 is the scatterplot of the columns of U, that is, the unitnorm
principal components u0 (n), u1 (n), n = 0, 1, . . . , N 1. Being mutually uncorrelated,
they do not exhibit clustering along any special directions.
Several applications of PCA in diverse elds, such as statistics, physiology, psychology, meteorology, and computer vision, are discussed in [410,6980].
0.25
50
50
z1 = vT
1 y = 0.52 y0 + 0.8542 y1
0.25
z0
u1
y1
25
z0 = vT
0 y = 0.8542 y0 + 0.52 y1
The two terms represent the projections of y onto the two PCA directions. The two
straight lines shown in Fig. 16.1 are given by these two terms separately, where z0
and z1 can be used to parametrize points along these lines.
Scatterplot of [ u0 , u1]
0.5
Thus, the principal component z0 carries the bulk of the variance. The two principal
components are obtained by the linear combinations z = VT y, or,
25
25
0.5
0.5
50
0.25
y0
0.25
0.5
u0
V = [v0 , v1 ]=
0.8542
0.5200
0.5200
0.8542
,
v0 =
0.8542
0.5200
,
v1 =
0.5200
0.8542
02
= 0.989 = 98.9 % ,
02 + 12
51
12
= 0.011 = 1.1 %
02 + 12
The main idea of PCA is rank reduction for the purpose of reducing the dimensionality of the problem. In many signal processing applications, such as sinusoids in
noise, or plane waves incident on an array, the noisefree signal has a data matrix of
reduced rank. For example, the rank is equal to the number of (complex) sinusoids
that are present. We will be discussing this in detail later.
The presence of noise causes the data matrix to become full rank. Forcing the
rank back to what it is supposed to be in the absence of noise has a benecial
noisereduction or signalenhancement effect. However, rankreduction destroys
any special structure that the data matrix might have, for example, being Toeplitz or
Toeplitz over Hankel. A further step is required after rank reduction that restores
the special structure of the matrix. But when the structure is restored, the rank
becomes full again. Therefore, one must iterate this process of rankreduction
followed by structure restoration.
Given an initial data matrix of a given type, such as the autocorrelation, covariance, or forward/backward type, the following steps implement the typical SVD
enhancement iteration:
52
N = 25, M = 20, K = 3
y2
y
3
Y = y4
y5
y6
y1
y2
y3
y4
y5
y0
y1
y2 = Toeplitz
y3
y2
y0
y
1
YJ = y2
y3
y4
y1
y2
y3
y4
y5
y2
y3
y4 = Hankel
y5
y6
In such cases, in the SVD enhancement iterations one must invoke the function
toepl with its Hankel option, that is, type=1.
Example 17.1: As an example that illustrates the degree of enhancement obtained from
such methods, consider the length25 signal yn listed in the le sine1.dat on the
course web page. The signal consists of two equalamplitude sinusoids of frequencies
f1 = 0.20 and f2 = 0.25 cycles/sample, in zeromean, white gaussian noise with a 0dB
SNR. The signal samples were generated by:
n = 0, 1, . . . , 24
10
53
10
10
spectra in dB
After the iteration, one may extract the enhanced signal from the enhanced
data matrix. The MATLAB function sigsub (in the OSP toolbox) carries out an
economy SVD of Y and then keeps only the r largest singular values, that is, it
extracts the signal subspace part of Y. The function toepl restores the Toeplitz or
ToeplitzoverHankel structure by nding the matrix with such structure that lies
closest to the rankreduced data matrix. We will be discussing these functions later.
The SVD enhancement iteration method has been reinvented in different contexts. In the context of linear prediction and extracting sinusoids in noise it is
known as the Cadzow iteration [3436]. In the context of chaotic dynamics, climatology, and meteorology, it is known as singular spectral analysis (SSA) [88103];
actually, in SSA only one iteration (K = 1) is used. In nonlinear dynamics, the process of forming the data matrix Y is referred to as delaycoordinate embedding
and the number of columns, M + 1, of Y is the embedding dimension.
In the literature, one often nds that the data matrix Y is dened as a Hankel
instead of a Toeplitz matrix. This corresponds to reversing the rows of the Toeplitz
denition. For example, using the reversing matrix J, we have:
N = 15, M = 10, K = 3
10
% initialize iteration
spectra in dB
Y = datamat(y,M,type);
Ye = Y;
for i=1:K,
Ye = sigsub(Ye,r);
Ye = toepl(Ye,type);
end
ye = datasig(Ye,type);
20
30
40
SVDBurg
ordinary Burg
YuleWalker
periodogram
50
60
70
0.1
0.15
0.2
0.25
0.3
0.35
20
30
40
SVDBurg
ordinary Burg
YuleWalker
periodogram
50
60
70
0.1
0.4
f [cycles/sample]
0.15
0.2
0.25
0.3
0.35
0.4
f [cycles/sample]
The effective rank is r = 4 (each real sinusoid counts for two complex ones.) The SVDenhanced version of Burgs method gives narrow peaks at the two desired frequencies.
The number of iterations was K = 3, and the prediction lter order M = 20.
The YuleWalker method results in fairly wide peaks at the two frequenciesthe SNR
is just too small for the method to work. The ordinary Burg method gives narrower
peaks, but because the lter order M is high, it also produces several false peaks that
are just as narrow.
Reducing the order of the prediction lter from M down to r , as is done in the SVD
method to avoid any false peaks, will not work at all for the YuleWalker and ordinary
Burg methodsboth will fail to resolve the peaks.
The periodogram exhibits wide mainlobes and sidelobesthe signal duration is just
too short to make the mainlobes narrow enough. If the signal is windowed prior to
computing the periodogram, for example, using a Hamming window, the two mainlobes will broaden so much that they will overlap with each other, masking completely
the frequency peaks.
The graph on the right of Fig. 17.1 makes the length even shorter, N = 15, by using
only the rst 15 samples of yn . The SVD method, implemented with M = 10, still
exhibits the two narrow peaks, whereas all of the other methods fail, with the ordinary
Burg being a little better than the others, but still exhibiting a false peak. The SVD
method works well also for K = 2 iterations, but not so well for K = 1. The following
MATLAB code illustrates the computational steps for producing these graphs:
y = loadfile(sine1.dat);
r = 4; M = 20; K = 3;
f = linspace(0.1,0.4,401); w = 2*pi*f;
% frequency band
a = lpf(burg(y,M));
H1 = 1./abs(dtft(a,w));
H1 = 20*log10(H1/max(H1));
a = lpf(yw(y,M));
54
H2 = 1./abs(dtft(a,w));
H2 = 20*log10(H2/max(H2));
100
90
80
H3 = abs(dtft(y,w));
H3 = 20*log10(H3/max(H3));
Y = datamat(y,M);
Ye = Y;
for i=1:K,
Ye = sigsub(Ye,r);
Ye = toepl(Ye);
end
ye = datasig(Ye);
i (percent)
% periodogram spectrum in dB
% Y is the autocorrelation type
% SVD enhancement iterations
% set rank to r
% toeplitzize Ye
70
60
50
40
30
20
10
0
0
10
15
20
25
30
35
index i
% Burg prediction lter of order r
% compute enhanced Burg LP spectrum
Example 17.2: The SVD enhancement process can be used to smooth data and extract local
or global trends from noisy times series. Typically, the rst few principal components
represent the trend.
degrees oC
The functions lpf, burg, yw implement the standard Burg and YuleWalker methods.
They will be discussed later on.
Signal Subspace, r = 2
0.6
0.6
0.3
0.3
degrees oC
a = lpf(burg(ye,r));
H = 1./abs(dtft(a,w));
H = 20*log10(H/max(H));
0.3
As an example, we consider the global annual average temperature obtained from the
web site: www.cru.uea.ac.uk/cru/data/temperature/. The data represent the
temperature anomalies in degrees o C with respect to the 19611990 average.
0.6
0.3
1860
1880
1900
1920
1940
1960
1980
0.6
2000
1860
1880
1900
year
1920
1940
1960
1980
2000
1980
2000
year
The smoothed signals extracted from reducing the rank to r = 1, 2, 3, 4, 5, 6 are shown
in Figs. 17.3, 17.4, and 17.5. We note that the r = 2 case represents the trend well.
As the rank is increased, the smoothed signal tries to capture more and more of the
ner variations of the original signal.
Signal Subspace, r = 3
Signal Subspace, r = 4
0.6
0.6
0.3
0.3
degrees oC
degrees oC
Assuming that the global trend ye (n) is represented by the rst two principal components (r = 2), one can subtract it from the original sequence resulting into the
residual y1 (n)= y(n)ye (n), and the SVD enhancement method may be repeated
on that signal. The rst few components of y1 (n) can be taken to represent the local
variations in the original y(n), such as shortperiod cyclical components. The rest of
the principal components of y1 (n) may be taken to represent the noise.
% or, r = 2, 3, 4, 5, 6
% zero mean
M = 30; K=1; r = 1;
y = zmean(y);
The rst two PCs account for 76% of the total variance. The percent variances are
plotted in Fig. 17.2.
0.3
0.6
0.3
1860
1880
1900
1920
1940
1960
1980
2000
year
A = loadfile(TaveGL2.dat);
y = A(:,14);
n = A(:,1);
% read data le
% column14 holds the annual averages
% column1 holds the year
55
0.6
1860
1880
1900
1920
1940
1960
year
56
0.3
0.3
0.3
0.3
1880
1900
1920
1940
1960
1980
2000
0.6
0.3
1860
1880
1900
year
1920
1940
1960
1980
0.6
2000
1860
1880
1900
plot(n,y,:, n,ye,);
1920
1940
1960
1980
0.6
2000
1860
1880
1900
1920
year
%
%
%
%
0.3
year
Ye = datamat(y,M,2);
for i=1:K,
Ye = sigsub(Y,r);
Ye = toepl(Ye,2);
end
ye = datasig(Ye,2);
0.3
0.6
0.6
0.3
degrees oC
0.3
There are other popular methods of smoothing data and extracting trends. We mention two methods: spline smoothing and local polynomial regression. They are shown
in Fig. 17.6. They have comparable performance to the SVD method. Both methods
have a free parameter, such as or , that controls the degree of smoothing.
1960
1980
2000
0.3
0.6
1940
year
degrees oC
1860
0.6
degrees oC
0.3
degrees oC
0.6
0.6
Signal Subspace, r = 6
0.6
degrees oC
degrees oC
Signal Subspace, r = 5
0.6
0.3
1860
1880
1900
1920
1940
1960
1980
0.6
2000
1860
1880
1900
1920
year
1940
1960
1980
2000
year
J = Y T2F = min,
(18.1)
y11
Y = y21
y31
y12
y22
y32
y13
y23 ,
y33
57
t2
T = t3
t4
t1
t2
t3
t0
t1
t2
y11 t2
Y T = y21 t3
y31 t4
y12 t1
y22 t2
y32 t3
y13 t0
y23 t1
y33 t2
Because the Frobenius norm is the sum of the squares of all the matrix elements,
we have:
t1 =
y12 + y23
2
t2 =
t3 =
y21 + y32
2
J = Y H2F = min,
(18.2)
(18.3)
Once T is found by averaging the diagonals of YJ, the Hankel matrix H is constructed by rowreversal, H = TJ. This amounts to averaging the antidiagonals of
the original data matrix Y.
Finally, in the case of Toeplitz over Hankel structure, we have a data matrix
whose upper half is to be Toeplitz and its lower half is the rowreversed and conjugated upper part. Partitioning Y into these two parts, we set:
Y=
YT
YH
M=
T
T J
= required approximation
A B
where is a parameter. The generalized eigenvalues of the matrix pair {A, B} are
those values of that cause A B to reduce its rank. A generalized eigenvector
corresponding to such a is a vector in the null space N(A B).
A matrix pencil is a generalization of the eigenvalue concept to nonsquare matrices. A similar rank reduction takes place in the ordinary eigenvalue problem of
a square matrix. Indeed, the eigenvalueeigenvector condition Avi = i vi can be
written as (A i I)vi = 0, which states that A I loses its rank when = i and
vi lies in the null space N(A i I).
Matrix pencil methods arise naturally in the problem of estimating damped or
undamped sinusoids in noise [46], and are equivalent to the socalled ESPRIT methods [42]. Consider a signal that is the sum of r , possibly damped, complex sinusoids
in additive, zeromean, white noise:
2
YT
T
2
J = Y MF =
J T2F = min
YH
T J F
approximations of YT and YH
J, that is, in the notation of the function toepl:
T=
toepl(YT )+toepl(YH
J)
10
20
30
Y=
40
50
60
10
20
30
Y=
40
50
60
10
20
30
40
50
60
10
20
30
40
50
60
10
20
30
40
50
60
10
20
30
40
50
60
20
30
40
T=
50
55
60
35
40
45
M=
25
30
35
yn =
15
20
30
40
50
55
30
35
40
30
35
40
10
15
20
,
30
40
50
10
15
20
H=
30
40
50
25
30
35
35
40
45
T
T J
15
20
30
40
50
55
20
30
40
50
55
60
Ai ei n eji n + vn =
i=1
Ai zni + vn
(19.2)
i=1
where zi = ei +ji , i = 1, 2, . . . , r . The problem is to estimate the unknown damping factors, frequencies, and complex amplitudes of the sinusoids, {i , i , Ai },
from available observations of a lengthN data block y = [y0 , y1 , . . . , yN1 ]T of the
noisy signal yn . We may assume that the zi are distinct.
In the absence of noise, the (N M)(M + 1)dimensional, covariancetype,
data matrix Y can be shown to have rank r , provided that the embedding order M
is chosen such that r M N r . The data matrix Y is dened as:
Example 18.1: As an example, we give below the optimum Toeplitz, Hankel, and Toeplitz
over Hankel approximations of the same data matrix Y:
(19.1)
yT(M)
..
T
,
y
(n)
Y=
..
T
y (N 1)
and, Y becomes:
yn
yn1
yn2
zi
r
2
n zi
=
y(n)=
A
z
i i
.
.
. i=1
..
.
ynM
zM
i
zM
i
.
.
.
r
n
1
2
M
Y=
Ai
zi 1, zi , zi , . . . , zi
i=1
..
.
1
zN
i
(19.3)
(19.4)
Thus, Y is the sum of r rank1 matrices, and therefore, it will have rank r . Its null
space N(Y) can be characterized conveniently in terms of the orderr polynomial
with the zi as roots, that is,
59
60
r
A(z)=
(1 zi z1 )
(19.5)
i=1
Multiplying A(z) by an arbitrary polynomial F(z) of order Mr , gives an orderM polynomial B(z)= A(z)F(z), such that r of its roots are the zi , that is, B(zi )= 0,
for i = 1, 2 . . . , r , and the remaining M r roots are arbitrary. The polynomial B(z)
denes an (M + 1)dimensional vector b = [b0 , b1 , . . . , bM ]T through its inverse
ztransform:
b0
b
1
B(z)= b0 + b1 z1 + + bM zM = 1, z1 , z2 , . . . , zM
(19.6)
..
.
bM
the dimensionality of b, and hence of the null space N(Y), will be M + 1 r . This
implies that the rank of Y is r .
Next, we consider the matrix pencil of the two submatrices Y1 , Y0 of Y, where
Y1 is dened to be the first M columns of Y, and Y0 , the last M, that is,
1
2
M
B(zi )= 1, z
,
z
,
.
.
.
,
z
i
i
i
b0
b1
..
= 0,
.
bM
i = 1, 2, . . . , r
(19.7)
.
.
.
n
(M1)
1
2
Y1 =
Ai
zi 1, zi , zi , . . . , zi
.
i=1
.
.
1
zN
i
M
zi
.
.
.
r
n 1 2
M
Y0 =
Ai
zi zi , zi , . . . , zi
i=1
..
.
1
zN
i
zM
i
.
b0
.
.
r
n
b1
1
2
M
Yb =
Ai
zi 1, zi , zi , . . . , zi
.. = 0
.
.
i=1
.
.
bM
1
zN
i
(19.8)
.
.
.
n
Yb =
Ai
zi B(zi )= 0
.
i=1
.
.
1
zN
i
1 2
M
= zi z
,
i , zi , . . . , zi
zM
i
.
.
.
r
n 1 2
M
zi Ai
Y1 =
zi zi , zi , . . . , zi
.
i=1
.
.
1
zN
i
(19.11)
B(zi )= 0
(19.9)
Thus, B(z) must have the form B(z)= A(z)F(z). Because F(z) has degree
M r , it will be characterized by M + 1 r arbitrary coefcients. It follows that
k1
This follows from the fact that the rr Vandermonde matrix V
, k, i = 1, 2, . . . , r , has
ki = zi
nonvanishing determinant det(V)= 1i<jr (zi zj ), because the zi are distinct. See Ref. [1].
61
(M1)
1
2
1, z
i , zi , . . . , zi
last M
zM
i
rst M
(M1)
1
2
= 1, z
, zM
, zM
i , zi , . . . , zi
i
i
(M1)
1
2
1, z
i , zi , . . . , zi
(19.10)
1
2
M
]:
They were obtained by keeping the rst or last M entries of [1, z
i , zi , . . . , zi
This implies that the vector b must lie in the null space of Y:
zM
i
zM
i
.
.
.
r
n 1 2
M
(zi )Ai
Y1 Y0 =
zi zi , zi , . . . , zi
i=1
..
.
ziN1
(19.12)
becomes equal to one of the zi , one of the rank1 terms will vanish and the rank
of Y1 Y0 will collapse to r 1. Thus, the r desired zeros zi are the generalized
eigenvalues of the rankr matrix pencil Y1 Y0 .
When noise is added to the sinusoids, the matrix pencil Y1 Y0 will have
full rank, but we expect its r most dominant generalized eigenvalues to be good
estimates of the zi .
In fact, the problem of nding the r eigenvalues zi can be reduced to an ordinary
rr eigenvalue problem. First, the data matrices Y1 , Y0 are extracted from the
matrix Y, for example, if N = 10 and M = 3, we have:
y3 y2 y1 y0
y3 y2 y1
y2 y1 y0
y
y
y
4 y3 y2 y1
4 y3 y2
3 y2 y1
y5 y4 y3 y2
y5 y4 y3
y4 y3 y2
Y=
y6 y5 y4 y3 Y1 = y6 y5 y4 , Y0 = y5 y4 y3
y
y
y
7 y6 y5 y4
7 y6 y5
6 y5 y4
y8 y7 y6 y5
y8 y7 y6
y7 y6 y5
y9 y8 y7 y6
y9 y8 y7
y8 y7 y6
Ur (Y1 Y0 )Vr = Ur Y1 Vr r
r1 Ur (Y1
Y0 )Vr = Z Ir ,
where Z =
[z,A] = mpencil(y,r,M);
0 = y0
for m = 1, 2, . . . , M do:
m = ym
(19.13)
Fianlly, the eigenvalues of the rr matrix Z are computed, which are the desired
estimates of the zi . The matrix pencil Z Ir may also be obtained by inverting
Y0 using its pseudoinverse and then reducing the problem to size rr . Using
1
y0
y z1
1
. = .
. .
. .
yN1
z1N1
z2
..
.
..
.
N1
z2
1
A1
v0
zr
A2 v1
..
.. + ..
. . .
1
Ar
vN1
zN
r
E[ym
i ]
,
E[
i i]
1 m M,
0im1
m1
i=0
(19.14)
y = SA + v
(19.15)
A = S+ y = S\y
(19.16)
for example,
1
y0
y b
1 10
=
y2 b20
b
y3
30
0
1
b
21
b
31
0
0
1
b
32
(20.2)
b
mi i , or expressed vectorially:
y = B
(20.1)
E[
i ym ]
i
E[
i i ]
bmi =
m
1
i=0
Vr (Y0+ Y1 IM )Vr = Z Ir
Once the zi are determined, the amplitudes Ai may be calculated by leastsquares. Writing Eq. (19.2) vectorially for the given lengthN signal yn , we have:
% steering matrix
20. QR Factorization
or,
r1 Ur Y1 Vr
(20.3)
0
0
0
1
0 2
1
3
The matrix B is the unit lowertriangular Cholesky factor of the covariance matrix of the random vector y:
The above design steps have been implemented into the MATLAB function mpencil:
R = BDB
63
64
(20.4)
q=D
E[q q ]= I
R = G G
yT(0)
..
Y = y (n)
..
T
y (N 1)
(20.6)
Q Q = I,
= Y Y
R
(20.7)
G = upper triangular
(20.8)
= Y Y = G Q QG = G G
R
(20.9)
Y = QG
yT(0)
qT(0)
.
..
..
yT(n) = qT(n) G
..
..
.
.
T
T
y (N 1)
q (N 1)
y(n)= GT q(n) ,
N
1
yT(n)= qT(n)G ,
n = 0, 1, . . . N 1
(20.12)
or,
(20.10)
Cholesky Factorization
R = E[y yT ]= G G = BDB
y = V z = V u
y = GT q = B
E[z zT ]= = 2
T ]= D
E[
E[u uT ]= I
E[q qT ]= I
SVD
QR
Y = UV = ZV
Y = QG
= Y Y = VV = V V
R
2
= Y Y = G G
R
N
1
N
1
n=0
u(n)uT(n)= I
q(n)qT(n)= I
n=0
q
i (n)qj (n)= ij
In comparing the SVD versus the QRfactorization, we observe that both methods orthogonalize the random vector y. The SVD corresponds to the KLT/PCA
eigenvalue decomposition of the covariance matrix, whereas the QR corresponds
to the Cholesky factorization. The following table compares the two approaches.
R = E[y y ]= VV
E[q
i qj ]= ij
KLT/PCA
The QRfactorization factors the data matrix Y into an N(M+1) matrix Q with
orthonormal columns and an (M+1)(M+1) upper triangular matrix G:
(20.11)
q(n)qT(n)= I
n=0
N
1
QQ =
n=0
In practice, we must work with sample covariance matrices estimated on the basis of N vectors y(n), n = 0, 1, . . . , N1. The N(M+1) data matrix Y constructed
from these vectors is used to obtain the sample covariance matrix:
Y = QG ,
E[q qT ]= I
(20.5)
Writing qT(n)= [q0 (n), q1 (n), . . . , qM (n)], the ith column of Q is the time
signal qi (n), n = 0, 1, . . . , N 1. Orthonormality in the statistical sense translates
to orthonormality in the timeaverage sense:
66
ya1
y
a2
where ya =
.. ,
.
yap
y=
ya
yb
,
yb1
y
b2
yb =
..
.
ybq
E[wa wb ]
c=
= max
E[wa wa ]E[wb wb ]
(21.1)
R = E[y yT ]=
T
E[y
a ya ]
T
E[y
a yb ]
T
E[y
b yb ]
T
E[y
b ya ]
=
Rab
Rbb
Raa
Rba
Ra1,a1
Ra2,a1
R = Ra3,a1
Rb1,a1
Rb2,a1
Ra1,a2
Ra2,a2
Ra3,a2
Rb1,a2
Rb2,a2
Ra1,a3
Ra2,a3
Ra3,a3
Rb1,a3
Rb2,a3
Ra1,b1
Ra2,b1
Ra3,b1
Rb1,b1
Rb2,b1
Ra1,b2
1
Ra2,b2
0
Ra3,b2
= 0
c1
Rb1,b2
0
Rb2,b2
0
1
0
0
c2
0
0
1
0
0
(21.2)
c1
0
0
1
0
c2
0
1
where the random vectors are ya = [ya1 , ya2 , ya3 ]T and yb = [yb1 , yb2 ]T , and we
Ra1,a1
Ra2,a1
R = Rb1,a1
Rb2,a1
Rb3,a1
Ra1,a2
Ra2,a2
Rb1,a2
Rb2,a2
Rb3,a2
Ra1,b1
Ra2,b1
Rb1,b1
Rb2,b1
Rb3,b1
Ra1,b2
Ra2,b2
Rb1,b2
Rb2,b2
Rb3,b2
Ra1,b3
1
0
Ra2,b3
=
Rb1,b3
c1
Rb2,b3 0
0
Rb3,b3
0
1
0
c2
0
c1
0
1
0
0
c2
0
1
0
0
0
0
1
Thus, the canonical structure, having a diagonal submatrix Rab , describes the
correlations between a and b in their clearest form. The goal of CCA is to bring
the general covariance matrix R of Eq. (21.2) into such a canonical form. This is
accomplished by nding appropriate linear transformations that change the bases
ya and yb into the above form.
One may start by nding p and qdimensional vectors a, b such that the linear
combinations wa = aT ya and wb = bT yb are maximally correlated, that is, nding
a, b that maximize the normalized correlation coefcient:
67
a Rab b
= max
c=
(a Raa a)(b Rbb b)
In general, the matrices Raa , Rab , Rbb are full and inspection of their entries
does not provideespecially when the matrices are largea clear insight as to the
essential correlations between the two groups.
What should be the ideal form of R in order to bring out such essential correlations? As an example, consider the case p = 3 and q = 2 and suppose R has the
following structure, referred to as the canonical correlation structure:
(21.3)
(21.4)
We may impose the constraints that wa , wb have unit variance, that is, E[wa wa ]=
a Raa a = 1 and E[wb wb ]= b Rbb b = 1. Then, the equivalent criterion reads:
c = a Rab b = max ,
b Rbb b = 1
subject to a Raa a = 1 ,
(21.5)
Ua = Va ,
a = 1a/2 ,
Va Va = Ip
Ub = Vb ,
b = b1/2 ,
Vb Vb = Iq
Raa = Ua a Va = Va a2 Va ,
Rbb = Ub b Vb = Vb b2 Vb ,
(21.6)
1
T
1
Cab = E[u
a ub ]= a Va Rab Vb b
(21.7)
T
1
1
1
2
1
In this basis, E[u
a ua ]= a Va Raa Va a = a Va Va a Va Va a = Ip , and
T
similarly, E[u
u
]=
I
.
The
transformed
covariance
matrix
will
be:
q
b b
u=
ua
ub
T
Ruu = E[u u ]=
T
E[u
a ua ]
T
E[u
b ua ]
T
E[u
a ub ]
T
E[u
b ub ]
=
Ip
Cab
Cab
Iq
a = Va a1 fa
wa = aT ya = fT
a ua
b = Vb b1 fb
wb = bT yb = fT
b ub
fa = a Va a
fb = b Vb b
Similarly, we have:
68
(21.8)
[A,C,B] = ccacov(R,p);
(21.9)
c = fa Cab fb = max ,
fa fa = fb fb = 1
subject to
C = diag{c1 , c2 . . . , cr } Cpq
Cab = Fa CFb ,
(21.12)
= [b1 , b2 , . . . , bq ]= qq matrix
The coefcient matrices A, B transform the basis ya , yb directly into the canonical
correlation basis. We dene:
wa = AT ya
wb = BT yb
w=
wa
wb
Rww = E[w wT ]=
T
E[w
a wa ]
T
E[w
b wa ]
T
E[w
a wb ]
T
E[w
b wb ]
BT
ya
yb
A Raa A
A Rab B
B Rba A
B Rbb B
(21.13)
(21.14)
T
E[w
a wa ]
T
E[w
b wa ]
T
E[w
a wb ]
=
T
E[w
b wb ]
Ip
Iq
(21.15)
The canonical correlations and canonical random variables are obtained from
the columns of A, B by the linear combinations:
ci =
wbi ] ,
E[wai
wai =
aT
i ya
wbi =
bT
i yb
i = 1, 2, . . . , r
(21.16)
Yb =
yT
b (0)
..
T
yb (n)
,
..
.
T
yb (N 1)
Y = [Ya , Yb ]
(21.17)
We assume that the column means have been removed from Y. The corresponding sample covariance matrices are then:
= Y Y =
R
Ya Ya
Ya Yb
Yb Ya
Yb Yb
=
aa
R
ab
R
ba
R
bb
R
(21.18)
Ya = Ua a Va = economy SVD of Ya ,
Ua CNp ,
Ua Ua = Ip
C = diag{c1 , c2 . . . , cr } ,
CCA coefcients ,
AC
CCA coefcients ,
BC
= Iq
r = min(p, q)
pp
(21.19)
Wa = Ya A ,
Wa Wa = Ip
Wb = Yb B ,
Wb Wb = Iq
Rww = E[w wT ]=
A = Va a1 Fa = [a1 , a2 , . . . , ap ]= pp matrix
B=
yT
a (0)
..
T
ya (n)
,
..
T
ya (N 1)
(21.11)
Vb b1 Fb
Ya =
(21.10)
It follows from Eq. (5.17) that the solution is c = c1 , the maximum singular
value of Cab , and the vectors fa , fb are the rst singular vectors. The remaining
singular values of Cab are the lower maxima of (21.10) and are obtained subject to
the orthogonality constraints of Eq. (5.18).
Thus, the desired canonical correlation structure is derived from the SVD of the
matrix Cab . The singular values of Cab are called the canonical correlations. The
following procedure will construct all of them. Start with the full SVD of Cab :
W W=
Wa Wa
Wa Wb
Wb Wa
Wb Wb
=
Ip
Iq
(21.20)
ci =
N
1
wai
(n)wbi (n) ,
n=0
70
i = 1, 2, . . . , r
(21.21)
The above steps have been implemented by the MATLAB function cca.m. It takes
as inputs the data matrices Ya , Yb and outputs A, B, C. Its usage is as follows:
[A,C,B] = cca(Ya,Yb);
Example 21.1: As an example, consider again the turtle data in the le turtle.dat. We
take Ya , Yb to be the (N = 24) measured lengths and widths of the male (group a)
and female (group b) turtles. The data are shown below:
Ya
Yb
Ya
D = loadfile(turtle.dat);
% read data le
Ya = zmean(D(:,1:2));
Yb = zmean(D(:,4:5));
[A,C,B] = cca(Ya,Yb);
Y = [Ya,Yb];
Ryy = Y*Y;
% correlated basis
Wa = Ya*A; Wb = Yb*B;
W = [Wa,Wb]; Rww = W*W;
Yb
ya1
ya2
yb1
yb2
ya1
ya2
yb1
yb2
1
2
3
4
5
6
7
8
9
10
11
12
93
94
96
101
102
103
104
106
107
112
113
114
74
78
80
84
85
81
83
83
82
89
88
86
98
103
103
105
109
123
123
133
133
133
134
136
81
84
86
86
88
92
95
99
102
102
100
102
13
14
15
16
17
18
19
20
21
22
23
24
116
117
117
119
120
120
121
123
127
128
131
135
90
90
91
93
89
93
95
93
96
95
95
106
137
138
141
147
149
153
155
155
158
159
162
177
98
99
103
108
107
107
115
117
115
118
124
132
Finally, we mention that CCA is equivalent to the problem of nding the canonical angles between two linear subspaces. Consider the two subspaces of CN spanned
by the columns of Ya and Yb . The economy SVDs of Ya , Yb provide orthonormal
bases Ua , Ub for these subspaces.
The canonical angles between the two subspaces are dened [1,86,87] in terms
of the singular values of the matrix Ua Ub , but these are the canonical correlations.
The cosines of the canonical angles are the canonical correlations:
After removing the columns means, the computed sample covariance matrix is:
= Y Y =
R
Ya Y a
Y b Ya
Y a Yb
Y b Yb
3.1490
1.8110
= 103
5.5780
3.3785
1.8110
1.1510
5.5780
3.1760
3.1760
1.9495
10.3820
6.2270
A=
0.0191
0.0545
0.0083
0.0418
, B=
0.0023 0.0955
0.0026 0.0691
0.9767
0
c1 0
=
C=
0
0.1707
0 c2
6.2270
3.9440
W W =
W a Wa
W a Wb
W b Wa
W b Wb
1
0
0
1
0.9767
0
0
0.1707
1
0
0.9767
N
1
The largest angle corresponds to the smallest singular value, that is, cos max =
cmin = cr . This angle (in radians) is returned by the builtin MATLAB function
subspace, that is,
Books
[1] G. H. Golub and C. F. Van Loan, Matrix Computations, 3/e, Johns Hopkins University
Press, Baltimore, 1996.
[2] D. S. Watkins, Fundamentals of Matrix Computations, 2/e, Wiley, New York, 2002.
0
0.1707
0
1
[3] A. Bj
orck, Numerical Methods for Least Squares Problems, SIAM Press, Philadelphia,
1996.
[4] T. W. Anderson, Introduction to Multivariate Statistical Analysis, 2/e, Wiley, New York,
1984.
[5] D. F. Morrison, Multivariate Statistical Methods, 3/e, McGrawHill, New York, 1990.
n=0
where the linear combination coefcients are the rst columns of A, B. The following
MATLAB code implements this example:
71
(21.22)
References
The rst columns of Wa , Wb are the most correlated. They are obtained as the following linear combinations of the columns of Ya , Yb :
i = 1, 2 . . . , r = min(p, q)
th_max = subspace(Ya,Yb);
ci = cos i ,
3.3785
1.9495
72
[8] H. von Storch and F. W. Zwiers, Statistical Analysis in Climate Research, Cambridge
Univ. Press, Cambridge, 1999.
[30] D. W. Tufts, R. Kumaresan, and I. Kirsteins, Data Adaptive Signal Estimation by Singular Value Decomposition of a Data Matrix, Proc. IEEE, 70, 684 (1982).
[9] I. T. Jollife, Principal Component Analysis, 2/e, SpringerVerlag, New York, 2002.
[31] D. W. Tufts and R. Kumaresan, Estimation of Frequencies of Multiple Sinusoids: Making Linear Prediction Perform Like Maximum Likelihood, Proc. IEEE, 70, 975 (1982).
[10] K. I. Diamantaras and S. Y. Kung, Principal Component Neural Networks, Wiley, New
York, 1996.
[11] R. Gittins, Canonical Analysis, SpringerVerlag, New York, 1985.
[12] B. Parlett, Symmetric Eigenvalue Problem, Prentice Hall, Upper Saddle River, NJ, 1980.
[13] E. F. Deprettere, ed., SVD and Signal Processing, NorthHolland, New York, 1988.
[14] R. J. Vaccaro, ed., SVD and Signal Processing II, Elsevier, New York, 1991.
[15] M. Moonen and B. de Moor, SVD and Signal Processing III, Elsevier, New York, 1995.
[16] S. Van Huffel and J. Vandewalle, The Total Least Squares Problem, SIAM, Philadelphia,
1991.
[17] H. D. I. Abarbanel, Analysis of Observed Chaotic Data, SpringerVerlag, New York,
1996.
[18] H. Kantz and T. Schreiber, Nonlinear Time Series Analysis, Cambridge Univ. Press,
Cambridge, 1997.
[19] A. S. Weigend and N. A. Gershenfeld, eds., Time Series Prediction: Forecasting the Future and Understanding the Past AddisonWesley, Reading, MA, 1994.
The timeseries data and most of the chapters are available on the web via FTP:
ftp://ftp.santafe.edu/pub/TimeSeries/.
SVD Applications
[32] D. W. Tufts and R. Kumaresan, Singular Value Decomposition and Improved Frequency Estimation Using Linear Prediction, IEEE Trans. Acoust., Speech, Sig. Process.,
ASSP30, 671 (1982).
[33] R. Kumaresan and D. W. Tufts, Estimating the Parameters of Exponentially Damped
Sinusoids and PoleZero Modeling in Noise, IEEE Trans. Acoust., Speech, Sig. Process.,
ASSP30, 833 (1982).
[34] J. A. Cadzow, Signal EnhancementComposite Property Mapping Algorithm, IEEE
Trans. Acoust., Speech, Sig. Process., ASSP36, 49 (1988).
[35] L. L. Scharf, The SVD and Reduced Rank Signal Processing, Sig. Process., 25, 113
(1991).
[36] J. A. Cadzow and D. M. Wilkes, Enhanced Rational Signal Modeling, Sig. Process., 25,
171 (1991).
[37] B. De Moor, The Singular Value Decomposition and Long and Short Spaces of Noisy
Matrices, IEEE Trans. Sig. Process., SP41, 2826 (1993).
[38] H. Yang and M. A. Ingram, Design of Partially Adaptive Arrays Using the SingularValue Decomposition, IEEE Trans. Antennas Propagat., AP45, 843 (1997).
[39] S. Y. Kung, K. S. Arun, and D. V. B. Rao, State Space and Singular Value Decomposition
Based Approximation Methods for the Harmonic Retrieval Problem, J. Opt. Soc. Am.,
73, 1799 (1983).
[20] G. Strang, The Fundamental Theorem of Linear Algebra, Am. Math. Monthly, 100,
848 (1993).
[21] D. Kalman, A Singularly Valuable Decomposition: The SVD of a Matrix, College Math.
J., 27, 2 (1996).
[41] J. E. Hudson, Decomposition of Antenna Signals into Plane Waves, IEE Proc., pt. H,
132, 53 (1985).
[22] C. Mulcahy and J. Rossi, A Fresh Approach to the Singular Value Decomposition,
Coll. Math. J., 29, 199 (1998).
[42] R. Roy, A. Paulraj, and T. Kailath, ESPRITA Subspace Rotation Approach to Estimation of Parameters of Cisoids in Noise, IEEE Trans. Acoust., Speech, Sig. Process.,
ASSP34, 1340 (1986).
[23] C. Long, Visualization of Matrix Singular Value Decomposition, Math. Mag., 56, 161
(1983).
[24] V. C. Klema and A. J. Laub, The Singular Value Decomposition: Its Computation and
Some Applications, IEEE Trans. Aut. Contr., AC25, 164 (1980).
[43] A. J. Mackay and A. McCowen, An Improved PencilofFunctions Method and Comparisons with Traditional Methods of Pole Extraction, IEEE Trans. Antennas Propagat.,
AP35, 435 (1987).
[25] E. Biglieri and K. Yao, Some Properties of Singular Value Decomposition and Their
Applications to Digital Signal Processing, Sig. Process., 18, 277 (1989).
[44] P. De Groen and B. De Moor, The Fit of a Sum of Exponentials to Noisy Data, J. Comp.
Appl. Math., 20, 175 (1987).
[26] A. van der Veen, E. F. Deprettere, and A. L. Swindlehurst, Subspace Based Signal
Analysis Using Singular Value Decomposition, Proc. IEEE, 81, 1277 (1993).
[45] Y. Hua and T. K. Sarkar, Generalized PencilofFunction Method for Extracting Poles
of an EM System from Its Transient Response, IEEE Trans. Antennas Propagat., AP37,
229 (1989).
[27] J. Mandel, Use of the Singular Value Decomposition in Regression Analysis, Amer.
Statistician, 36, 15 (1982).
[28] I. J. Good, Some Applications of the Singular Decomposition of a Matrix, Technometrics, 11, 823 (1969).
[29] D. D. Jackson, Interpretation of Inaccurate, Insufcient and Inconsistent Data, Geophys. J. Roy. Astron. Soc., 28, 97 (1972).
73
[46] Y. Hua and T. K. Sarkar, Matrix Pencil Method for Estimating Parameters of Exponentially Damped/Undamped Sinusoids in Noise, IEEE Trans. Acoust., Speech, Sig.
Process., ASSP38, 814 (1990).
[47] Y. Hua and T. K. Sarkar, On SVD for Estimating Generalized Eigenvalues of Singular
Matrix Pencil in Noise, IEEE Trans. Sig. Process., SP39, 892 (1991).
74
[48] T. K. Sarkar and O. Pereira, Using the Matrix Pencil Method to Estimate the Parameters
of a Sum of Complex Exponentials, IEEE Ant. Propagat. Mag., 37, no.1, 48 (1995).
PCA
[69] H. Hotelling, Analysis of a Complex of Statistical Variables into Principal Components, J. Educ. Psychol., 24, 417 (1933).
[70] H. Hotelling, The Most Predictable Criterion, J. Educ. Psychol., 26, 139 (1935).
[71] C. R. Rao, The Use and Interpretation of Principal Component Analysis in Applied
Research, Sankhya, 26, 329 (1964).
[72] P. Jolicoeur and J. E. Mosimann, Size and Shape Variation in the Painted Turtle: A
Principal Component Analysis, Growth, 24, 339 (1960).
[73] C. S. Bretherton, C. Smith, and J. M. Wallace, An Intercomparison of Methods for
Finding Coupled Patterns in Climate Data, J. Climate, 5, 541 (1992).
[74] A. S. Hadi and R. F. Ling, Some Cautionary Notes on the Use of Principal Components
Regression, Amer. Statistician, 52, 15 (1998).
[75] D. N. Naik and R. Khattree, Revisiting Olympic Track Records: Some Practical Considerations in the Principal Component Analysis, Amer. Statistician, 50, 140 (1996).
[76] B. G. Kermani, S. S. Schiffman, and H. T. Nagle, A Novel Method for Reducing the
Dimensionality in a Sensor Array, IEEE Trans. Instr. Meas. IM47, 728 (1998).
[77] J. J. Gerbrands, On the Relationships Between SVD, KLT, and PCA, Patt. Recogn., 14,
375 (1981).
[78] M. Turk and A. Pentland, Eigenfaces for Recognition, J. Cogn. Neurosci., 3, 71 (1991).
[79] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, Eigenfaces vs. Fisherfaces:
Recognition Using Class Specic Linear Projection, IEEE Trans. Patt. Anal. Mach. Intel., PAMI19, 711 (1997).
[80] J. Karhunen and J. Jountsensalo, Generalizations of Principal Component Analysis,
Optimization Problems, and Neural Networks, Neural Netw., 8, 549 (1995).
CCA
[81] H. Hotelling, Relations Between Two Sets of Variates, Biometrika, 28, 321 (1936).
[82] K. E. Muller, Understanding Canonical Correlation Through the General Linear Model
and Principal Components, Amer. Statistician, 36, 342 (1982).
[83] A. C. Pencher, Interpretation of Canonical Discriminant Functions, Canonical Variates, and Principal Components, Amer. Statistician, 46, 217 (1992).
[84] L. L. Scharf and J. K. Thomas, Wiener Filters in Canonical Coordinates for Transform
Coding, Filtering, and Quantizing, IEEE Trans. Sig. Process., SP46, 647 (1998).
[85] N. A. Campbell and W. R. Atchley, The Geometry of Canonical Variate Analysis, Syst.
Zool., 30, 268 (1981).
[66] J. Durbin, The Fitting of Time Series Models, Rev. Int. Stat. Inst., 28, 233 (1961).
[86] S. N. Afriat, Orthogonal and Oblique Projectors and the Characteristics of Pairs of
Vector Spaces, Proc. Camb. Phil. Soc., 53, 800 (1957).
75
76
[87] A. Bj
orck and G. H. Golub, Numerical Methods for Computing the Angles Between
Linear Subspaces, Math. Comp., 27, 579 (1973).
SSA and Chaotic Dynamics
[88] D. Broomhead and G. P. King, Extracting Qualitative Dynamics from Experimental
Data, Physica D, 20, 217 (1986).
[89] R. Vautard, P. Yiou, and M. Ghil, SingularSpectrum Analysis: A Toolkit for Short,
Noisy, Chaotic Signals, Physica D, 58, 95 (1992).
[90] R. Vautard and M. Ghil, SingularSpectrum Analysis in Nonlinear Dynamics With
Applications to Paleoclimatic Time Series, Physica D, 35, 395 (1989).
[91] C. L. Keppenne and M. Ghil, Adaptive Spectral Analysis and Prediction of the Southern
Oscillation Index, J. Geophys. Res., 97, 20449 (1992).
[92] M. R. Allen and L. A. Smith, Monte Carlo SSA: Detecting Oscillations in the Presence
of Coloured Noise, J. Climate, 9, 3373 (1996).
[93] SSA toolkit web page, www.atmos.ucla.edu/tcd/ssa\.
[94] M. Ghil and R. Vautard, Interdecadal Oscillations and the Warming Trend in Global
Temperature Time Series, Nature, 350, 324 (1991).
[95] M. E. Mann and J. Park, Spatial Correlations of Interdecadal Variation in Global Surface Temperatures, Geophys. Res. Lett., 20, 1055 (1993).
[96] C. Penland, M. Ghil, and K. Weickmann, Adaptive Filtering and Maximum Entropy
Spectra with Application to Changes in Atmospheric Angular Momentum, J. Geophys.
Res., 96, 22659 (1991).
[97] M. Palus and I. Dvorak, SingularValue Decomposition in Attractor Reconstruction:
Pitfalls and Precautions, Physica D, 55, 221 (1992).
[98] V. M. Buchstaber, Time Series Analysis and Grassmannians, Amer. Math. Soc.
Transl., 162, 1 (1994).
[99] J. B. Elsner and A. A. Tsonis, Singular Spectrum Analysis: A New Tool in Time Series
Analysis, Plenum Press, New York, 1996.
[100] N. Golyandina, V. Nekrutkin, and A. Zhigliavsky, Analysis of Time Series Structure:
SSA and Related Techniques, Chapman & Hall/CRC, Boca Raton, FL, 2002.
[101] J. D. Farmer and J. J. Sidorowich, Exploiting Chaos to Predict the Future and Reduce
Noise, in Y. C. Lee, ed., Evolution, Learning, and Cognition, World Scientic, Singapore, 1988.
[102] A. Basilevsky and D. P. J. Hum, KarhunenLo`
eve Analysis of Historical Time Series
With an Application to Plantation Births in Jamaica, J. Amer. Statist. Assoc., 74, 284
(1979).
[103] D. L. Danilov, Principal Components in Time Series Forecast, J. Comp. Graph. Stat.,
6, 112 (1997).
[104] R. Cawley and GH. Hsu, LocalGeometricProjection Method for Noise Reduction in
Chaotic Maps and Flows, Phys. Rev. A, 46, 3057 (1992).
[105] C. Penland and T. Magorian, Prediction of Ni
no 3 Sea Surface Temperatures Using
Linear Inverse Modeling, J. Clim., 6, 1067 (1993).
[106] T. Sauer, Time Series Prediction by Using Delay Coordinate Embedding, in Ref. [19].
77