Вы находитесь на странице: 1из 121

# Linear Algebra Review

Niels O. Nygaard
The University of Chicago
August 29, 2006
1 Vector Spaces and Bases
Recall that a real vector space is a set V with a rule for adding elements
+ and a rule for multiplying elements by real numbers i.e. if v and w are
elements in V (we shall from now on call them vectors), then we can form
another vector v + w and if R we can form the vector v. Also there
has to be a zero vector, 0 so a vector space can never be the empty set.
These rules are supposed to satisfy some self-evident axioms e.g. v +0 = v,
(v +w) = v +w, etc.
We shall mostly deal with the vector spaces R
n
of n-tuples of real num-
bers, thus a vector in R
n
is an n-tuple (x
1
, x
2
, . . . , x
n
). Other examples of
vector spaces are:
The set of continuous functions on an interval where addition of func-
tions is dened by (f +g)(x) = f(x) +g(x)
The set of dierentiable functions on an open interval ]a, b[
The set of real valued functions on a subset U R
n
The set of solutions to a system of linear dierential equations Df = 0
Instead of using the real numbers R we can also use the complex numbers
C in which case we talk about complex vector spaces.
Denition 1.0.1 A linear combination of vectors in a vector space V , v
1
, v
2
, . . . , v
k
is an expression of the form
1
v
1
+
2
v
2
+ +
k
v
k
where the s are real
numbers (or complex numbers if we are dealing with a complex vector space).
A linear combination is of course again a vector in V .
1
Denition 1.0.2 A subspace U of a vector space V , is a (non-empty) sub-
set, U V with the property that if we take any two vectors u
1
, u
2
which
happen to be in the subset U then their sum u
1
+ u
2
is also in U and for
any number (real or complex depending on whether V is a real or complex
vector space) and any u U, u is again in U. Remark that since 0u = 0,
the zero vector is in U.
Proposition 1.0.1 Consider a set of vectors v
1
, v
2
, v
3
, . . . , v
k
in V . Con-
sider the set of all possible linear combinations of these vectors. This set
is denoted Spanv
1
, v
2
, . . . , v
k
, thus a typical element in this subset is a
vector w in V which can be expressed in the form
w = a
1
v
1
+a
2
v
2
+ +a
k
v
k
where the as are real or complex numbers. Then Spanv
1
, v
2
, v
3
, . . . , v
k
is
a subspace of V .
Proof: We have to show that if w
1
and w
2
are in Spanv
1
, v
2
, v
3
, . . . , v
k

then also w
1
+w
2
and w
1
are in Spanv
1
, v
2
, . . . , v
k
. Now we can write
w
1
= a
1
v
1
+a
2
v
2
+ +a
k
v
k
and
w
2
= b
1
v
1
+b
2
v
2
+ +b
k
v
k
Hence
w
1
+w
2
= (a
1
v
1
+a
2
v
2
+ +a
k
v
k
) + (b
1
v
1
+b
2
v
2
+ +b
k
v
k
)
but we can rewrite this expression as
w
1
+w
2
= (a
1
+b
1
)v
1
+ (a
2
+b
2
)v
2
+ (a
k
+b
k
)v
k
and this is a linear combination of the vs hence an element of Spanv
1
, v
2
, . . . , v
k
.
Similarly
w
1
= (a
1
v
1
+a
2
v
2
+ +a
k
v
k
) = a
1
v
1
+a
2
v
2
+. . . a
k
v
k
and this is again an element of Spanv
1
, v
2
, . . . , v
k
.
Denition 1.0.3 If U is a subspace of V , we say that U is spanned by
vectors u
1
, u
2
, . . . , u
r
in U if U = Spanu
1
, u
2
, . . . , u
k
. We note that there
may be many dierent sets of vectors spanning U.
2
Example 1.1 Let V = R
n
and consider the subset U V consisting of
vectors u = (x
1
, x
2
, . . . , x
n
) with x
1
+ x
2
+ + x
n
= 0 i.e. the set of all
vectors whose coordinates add up to 0. Then U is a subspace. Indeed if u
1
=
(x
1
, x
2
, . . . , x
n
) and u
2
= (y
1
, y
2
, . . . , y
n
) are in U so x
1
+x
2
+ +x
n
= 0
and y
1
+y
2
+ +y
n
= 0 we have u
1
+u
2
= (x
1
+y
1
, x
2
+y
2
, . . . , x
n
+y
n
) and
(x
1
+y
1
)+(x
2
+y
2
)+ +(x
n
+y
n
) = (x
1
+x
2
+ +x
n
)+(y
1
+y
2
+ +y
n
) =
0 + 0 = 0.
Also u = (x
1
, x
2
, . . . , x
n
) and
x
1
+x
2
+ +x
n
= (x
1
+x
2
+ +x
n
) = 0 = 0
We claim that U is spanned by the n 1 vectors:
(1, 1, 0, 0, . . . , 0)
(0, 1, 1, 0, . . . , 0)
(0, 0, 1, 1, . . . , 0)
.
.
.
(0, 0, . . . , 0, 1, 1)
First, all these vectors are clearly in U. Assume now that u = (x
1
, x
2
, x
3
, . . . , x
n
)
U so x
1
+x
2
+x
3
+ +x
n
= 0. Consider the linear combination
x
1
(1, 1, 0, 0, 0, . . . , 0)
+ (x
1
+x
2
)(0, 1, 1, 0, 0, . . . , 0)
+ (x
1
+x
2
+x
3
)(0, 0, 1, 1, 0, . . . , 0)
+ (x
1
+x
2
+x
3
+x
4
)(0, 0, 0, 1, 1, . . . , 0)
.
.
.
+ (x
1
+x
2
+ +x
n1
)(0, 0, 0, . . . , 1, 1)
This is equal to
(x
1
, x
1
+ (x
1
+x
2
), (x
1
+x
2
) + (x
1
+x
2
+x
3
), . . . , (x
1
+x
2
+x
3
+ +x
n1
)
= (x
1
, x
2
, x
3
, . . . , (x
1
+x
2
+x
3
+ +x
n1
))
= (x
1
, x
2
, x
3
, . . . , x
n
) = u
where the last equality follows from x
1
+ x
2
+ + x
n1
+ x
n
= 0 so
x
n
= (x
1
+x
2
+ +x
n1
).
3
But U is also spanned by the n 1 vectors
(1, 1, 0, 0, . . . , 0)
(1, 0, 1, 0, . . . , 0)
(1, 0, 0, 1, . . . .0)
.
.
.
(1, 0, 0, 0, . . . , 1)
To see this we consider the linear combination
x
2
(1, 1, 0, 0, . . . , 0)
x
3
(1, 0, 1, 0, . . . , 0)
x
4
(1, 0, 0, 1, . . . , 0)
.
.
.
x
n
(1, 0, 0, 0, . . . , 1)
= (x
2
x
3
x
4
, x
n
, x
2
, x
3
, . . . , x
n
)
= (x
1
, x
2
, x
3
, x
4
, . . . , x
n
) = u
since x
1
+x
2
+x
4
+ +x
n
= 0 so x
1
= x
2
x
3
x
n
Denition 1.0.4 A set of vectors v
1
, v
2
, . . . , v
k
are said to be linearly in-
dependent if the only linear combination of these vectors which is equal
to 0 is the linear combination where all the coecients are 0. That is

1
v
1
+
2
v
2
+
3
v
3
+ +
k
v
k
= 0 implies that
1
=
2
=
3
= =
k
= 0.
This is equivalent to say that two linear combinations of these vectors are
equal if and only if all the coecients are equal i.e.

1
v
1
+
2
v
2
+
3
v
3
+ +
k
v
k
=
1
v
1
+
2
v
2
+
3
v
3
+ +
k
v
k
if and only if
1
=
1
,
2
=
2
,
3
=
3
, . . .
k
=
k
.
Indeed if the vectors are linearly equivalent then

1
v
1
+
2
v
2
+
3
v
3
+ +
k
v
k
=
1
v
1
+
2
v
2
+
3
v
3
+ +
k
v
k
if and only if
(
1

1
)v
1
+ (
2

2
)v
2
+ (
3

3
)v
3
+ + (
k

k
)v
k
= 0
so
1

1
=
2

2
= =
k

k
= 0
4
Denition 1.0.5 Consider a vector space V . A set of vectors v
1
, v
2
, v
3
, . . . , v
k
is said to be a basis of V if
1. They span V , i.e. every vector in V can be written as a linear combi-
nation of these vectors
2. They are linearly independent
Thus v
1
, v
2
, v
3
, . . . , v
k
is a basis if and only if every vector in V can be
written in exactly one way as a linear combination of these vectors.
Example 1.2 The vectors
e
1
= (1, 0, 0, . . . , 0, 0)
e
2
= (0, 1, 0, . . . , 0, 0)
e
3
= (0, 0, 1, . . . , 0, 0)
.
.
.
e
n1
= (0, 0, 0, . . . , 1, 0)
e
n
= (0, 0, 0, . . . , 0, 1)
form a basis for the vector space R
n
(resp. C
n
). This is called the
standard basis of R
n
(resp. C
n
)
Both the two sets of vectors considered in Example 1.1 are bases of the
subspace U.
A very important result is the following:
Theorem 1.0.1 Let v
1
, v
2
, . . . , v
k
and w
1
, w
2
, . . . , w
r
be bases of the same
vector space V . Then k = r i.e. any two bases of a vector space V have
the same number of vectors in them. This common number is called the
dimension of V , dimV .
Denition 1.0.6 A vector space is said to be nite dimensional if it has a
basis.
Remark 1.1 There are plenty of vector spaces which are not nite dimen-
sional e.g. the vector space considered above, of continuous functions on
an interval [a, b] does not have a basis in the sense above. If we dene a
linear combination of innitely many vectors as any linear combination in-
volving only a nite number of them, then in fact any vector space, nite
dimensional or not, has a basis in the sense that there is a set of vectors,
5
maybe innitely many, such that every vector can be written uniquely as a
linear combination of vectors from this set. Except when explicitly stated,
we consider nite dimensional vector spaces.
Some other results related to bases:
Theorem 1.0.2 Let dimV = m. If v, v
2
, . . . , v
k
is a set of linearly inde-
pendent vectors then k m. If k = m the set of vectors is a basis. In
other words, in an m-dimensional vector space no linearly independent set
can have more than m vectors in it and any set of m linearly independent
vectors also span the vector space
Theorem 1.0.3 Let v
1
, v
2
, . . . , v
s
be linearly independent vectors in a vec-
tor space V then we can supplement this set with vectors v
s+1
, v
s+2
, . . . , v
k
such that the combined set v
1
, v
2
, . . . , v
s
, v
s+1
, . . . , v
k
is a basis (i.e. dimV =
k).
Theorem 1.0.4 Let w
1
, . . . , w
n
be a set of vectors which span V i.e. every
vector in V can be written as a linear combination of these vectors. Then
we can pick out a basis w
i
1
, w
i
2
, . . . , w
i
k
from this set. Thus dimV n.
Theorem 1.0.5 Any subspace of a nite dimensional vector space, U V ,
is also nite dimensional and dimU dimV with equality if and only if
U = V
The great thing about bases is that if we x a basis of a vector space V ,
say v
1
, v
2
, . . . , v
k
we can basically identify V with the vector space R
k
(or
C
k
if we are in the complex case). The reason for this is that any vector
v V can be written as a linear combination v = a
1
v
1
+ a
2
v
2
+ + a
k
v
k
and the coecients (a
1
, a
2
, . . . , a
k
) which we can view as a vector in R
k
,
are uniquely determined by the the vector v. We call these coecients the
coordinates of v with respect to the basis v
1
, v
2
, . . . , v
k
. Thus the vector v
determines and in turn is determined by its coordinates. Remember though
that coordinates are always relative to a given basis. If we take another
basis say w
1
, w
2
, . . . , w
k
then v has a set of coordinates with respect to this
basis, but they will be totally dierent from the coordinates with respect to
the basis v
1
, v
2
, . . . , v
k
.
With a chosen basis, a vector is then completely determined by its coor-
dinates and these coordinates form a vector in R
k
(or C
k
) and we can do all
computations involving the vector by doing them on the coordinates e.g. we
get the coordinates of the sum of two vectors in V simply by adding their
coordinates in R
k
.
6
2 Linear Transformations and Matrices
Denition 2.0.7 Let V and W be vector spaces. A linear transformation
(or linear map) is a map T : V W which preserves the vector space
structures i.e. T(v
1
+v
2
) = T(v
1
) +T(v
2
) and T(v) = T(v)
We sometimes leave out the () so we may write Tv instead of T(v).
Example 2.1 If V = R
1
= W a linear transformation T : V W is of the
form Tx = a x for all x R where a is a xed real number.
Remark that the map Sx = a +x is not linear.
Example 2.2 Let V be the vector space of continuous functions on an in-
terval [a, b]. Then the map Tf =
_
b
a
f(t)dt is a linear transformation V R
Example 2.3 Let V be the vector space of continuously dierentiable func-
tions on an open interval ]a, b[ and W the vector space of continuous func-
tions on ]a, b[. Then the map S : V W dened by Sf = f

is a linear
transformation
Proposition 2.0.2 Let U, V, W be vector spaces and let T : U V and
S : V W be linear transformations. Then the composite map, S T :
U W, dened by S T(u) = S(Tu) is also linear, i.e. a composite of two
linear transformations is again a linear transformation.
Using bases, a linear transformation can be described very conveniently
using matrices. Thus let V and W be nite dimensional vector spaces and
let v
1
, v
2
, . . . , v
m
be a basis of V and w
1
, w
2
, . . . , w
n
be bases of V and W
resp. so dimV = m and dimW = n.
Let T : V W be a linear transformation. Applying T to any of the
basis vectors Tv
j
we obtain a vector in W and hence Tv
j
can be written
(uniquely) as a linear combination of the ws thus
Tv
j
= a
1j
w
1
+a
2j
w
2
+ +a
nj
w
n
i.e. a
1j
, a
2j
, . . . , a
mj
are the coordinates of Tv
j
with respect to the basis
w
1
, w
2
, . . . , w
n
.
We do this for every v
j
and hence we get nmnumbers
_
_
_
_
_
a
11
a
12
. . . a
1m
a
21
a
22
. . . a
2m
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
a
n2
. . . a
nm
_
_
_
_
_
.
7
This is the matrix of T with respect to the bases v
1
, v
2
, . . . , v
m
and
w
1
, w
2
, . . . , w
n
. Thus the jth column in the matrix is the set of coordinates
of Tv
j
with respect to the basis w
1
, w
2
, . . . , w
n
.
We emphasize that the matrix of a linear map only makes sense when the
bases are given. The linear map itself exists independently of bases but the
matrix depends on them and if we change the bases we change the matrix.
This fact is what the subject is all about: by choosing bases cleverly we are
often able to make the matrix very simple e.g. having lots of 0s.
Theorem 2.0.6 Let T : V W be a linear transformation, v
1
, v
2
, . . . , v
m
and w
1
, w
2
, . . . , w
n
bases of V and W resp. Let
A =
_
_
_
_
_
a
11
a
12
. . . a
1m
a
21
a
22
. . . a
2m
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
a
n2
. . . a
nm
_
_
_
_
_
be the matrix of T with respect to these bases.
Then the linear transformation is completely determined by A in the
sense that we can compute Tv for any vector v in V
To see this let v V and write it as a linear combination v = x
1
v
1
+
x
2
v
2
+ +x
m
v
m
. To determine Tv we need to nd coecients y
1
, y
2
, . . . , y
n
such that Tv = y
1
w
1
+y
2
w
2
+ +y
n
w
n
. We can do this as follows: since
T is linear
Tv = T(x
1
v
1
+x
2
v
2
+ +x
m
v
m
)
= x
1
Tv
1
+x
2
Tv
2
+ +x
m
Tv
m
= x
1
(a
11
w
1
+a
21
w
2
+ +a
n1
w
n
)
+x
2
(a
12
w
1
+a
22
w
2
+ +a
n2
w
n
)
.
.
.
+x
m
(a
1m
w
1
+a
2m
w
2
+ +a
nm
w
n
)
= (a
11
x
1
+a
12
x
2
+ +a
1m
x
m
)w
1
+ (a
21
x
1
+a
22
x
2
+ +a
2m
x
m
)w
2
.
.
.
+ (a
n1
x
1
+a
n2
x
2
+ +a
nm
x
m
)w
n
8
Thus the coordinates of Tv with respect to the basis w
1
, w
2
, . . . , w
n
are
given by
_
_
_
_
_
y
1
y
2
.
.
.
y
n
_
_
_
_
_
=
_
_
_
_
_
a
11
x
1
+a
12
x
2
+ +a
1m
x
m
a
21
x
1
+a
22
x
2
+ +a
2m
x
m
.
.
.
a
n1
x
1
+a
n2
x
2
+ +a
nm
x
m
_
_
_
_
_
Let A =
_
_
_
_
_
a
11
a
12
. . . a
1m
a
21
a
22
. . . a
2m
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
a
n2
. . . a
nm
_
_
_
_
_
be an nmmatrix and B =
_
_
_
_
_
b
11
b
12
. . . b
1k
b
21
b
22
. . . b
2k
.
.
.
.
.
.
.
.
.
.
.
.
b
m1
b
m2
. . . b
mk
_
_
_
_
_
an m k matrix (remark that the number of columns in A is equal to the
number of rows in B). We can then dene the product of these two matrices
C = A B, where C is the n k matrix C =
_
_
_
_
_
c
11
c
12
. . . c
1k
c
21
c
22
. . . c
2k
.
.
.
.
.
.
.
.
.
.
.
.
c
n1
c
n2
. . . c
nk
_
_
_
_
_
where
c
ij
= a
i1
b
1j
+a
i2
b
2j
+a
i3
b
3j
+ +a
im
b
mj
for all 1 i n, 1 j k.
Example 2.4 Let A =
_
1 2 0
2 0 3
_
and B =
_
_
2 3 0
1 1 2
0 2 0
_
_
. The product is
the 2 3 matrix C =
_
4 5 4
4 12 0
_
Using matrix multiplication we can reformulate the expression for the
coordinates of Tv as
_
_
_
_
_
y
1
y
2
.
.
.
y
n
_
_
_
_
_
= A
_
_
_
_
_
x
1
x
2
.
.
.
x
m
_
_
_
_
_
=
_
_
_
_
_
a
11
a
12
. . . a
1m
a
21
a
22
. . . a
2m
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
a
n2
. . . a
nm
_
_
_
_
_

_
_
_
_
_
x
1
x
2
.
.
.
x
m
_
_
_
_
_
Given bases v
1
, v
2
, . . . , v
m
and w
1
, w
2
, . . . , w
n
and an nk matrix we can
dene a linear transformation T as follows: let v V and let x
1
, x
2
, . . . , x
m
be the coordinates of v with respect to the basis v
1
, v
2
, . . . , v
m
. Let y
1
, y
2
, . . . , y
n
9
be dened as the matrix product
_
_
_
_
_
y
1
y
2
.
.
.
y
n
_
_
_
_
_
= A
_
_
_
_
_
x
1
x
2
.
.
.
x
m
_
_
_
_
_
=
_
_
_
_
_
a
11
a
12
. . . a
1m
a
21
a
22
. . . a
2m
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
a
n2
. . . a
nm
_
_
_
_
_

_
_
_
_
_
x
1
x
2
.
.
.
x
m
_
_
_
_
_
We dene Tv as the vector whose coordinates with respect to the basis
w
1
, w
2
, . . . , w
n
are y
1
, y
2
, . . . , y
n
i.e.
Tv = y
1
w
1
+y
2
w
2
+ +y
n
w
n
Thus we have a 1 1 correspondence: (bases+linear transformation)
(bases+matrix)
Theorem 2.0.7 Let U, V, W be vector spaces with bases u
1
, u
2
, . . . , u
k
, v
1
, v
2
, . . . , v
m
and w
1
, w
2
, . . . , w
n
resp. Let T : U V be a linear transformation with ma-
trix B =
_
_
_
_
_
b
11
b
12
. . . b
1k
b
21
b
22
. . . b
2k
.
.
.
.
.
.
.
.
.
.
.
.
b
m1
b
m2
. . . b
mk
_
_
_
_
_
with respect to the bases u
1
, u
2
, . . . , u
k
and
v
1
, v
2
, . . . , v
m
, thus B is a mk matrix. Let S : V W be a linear trans-
formation with matrix A =
_
_
_
_
_
a
11
a
12
. . . a
1m
a
21
a
22
. . . a
2m
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
a
n2
. . . a
nm
_
_
_
_
_
with respect to the bases
v
1
, v
2
, . . . , v
m
and w
1
, w
2
, . . . , w
n
. Then the matrix of the composite linear
transformation S T : U W with respect to the bases u
1
, u
2
, . . . , u
k
and
w
1
, w
2
, . . . , w
n
is the n k matrix A B
To prove this we compute S T(u
j
).
10
First Tu
j
= b
1j
v
1
+b
2j
v
2
+ +b
mj
v
m
and thus
S T(u
j
) = S(Tu
j
) = S(b
1j
v
1
+b
2j
v
2
+ +b
mj
v
m
)
= b
1j
Sv
1
+b
2j
Sv
2
+ +b
mj
Sv
m
= b
1j
(a
11
w
1
+a
21
w
2
+ +a
n1
w
n
)
+b
2j
(a
12
w
1
+a
22
w
2
+ +a
n2
w
n
)
.
.
.
+b
mj
(a
1m
w
1
+a
2m
w
2
+ +a
nm
w
n
)
= (a
11
b
1j
+a
12
b
2j
+ +a
1m
b
mj
)w
1
+ (a
21
b
1j
+a
22
b
2j
+ +a
2m
b
mj
)w
2
.
.
.
+ (a
n1
b
1j
+a
n2
b
2j
+ +a
nm
b
mj
)w
n
The coecients in this linear combination are precisely the entries in the
jth column of the product matrix A B
Consider now a system of linear equations:
a
11
x
1
+a
12
x
2
+ +a
1m
x
m
= b
1
a
21
x
1
+a
22
x
2
+ +a
2m
x
m
= b
2
.
.
.
a
n1
x
1
+a
n2
x
2
+ +a
nm
x
m
= b
n
Let Abe the nmmatrix of coecients i.e. A =
_
_
_
_
_
a
11
a
12
. . . a
1m
a
21
a
22
. . . a
2m
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
a
n2
. . . a
nm
_
_
_
_
_
.
Let x denote
_
_
_
_
_
x
1
x
2
.
.
.
x
m
_
_
_
_
_
and b =
_
_
_
_
_
b
1
b
2
.
.
.
b
n
_
_
_
_
_
then the system of linear equations can
be expressed in terms of these matrices
A x = b
One of the major tasks of numerical analysis is to solve such systems and
clearly the simpler the matrix A is, the easier this will be.
11
Now consider the vector spaces V = R
m
and W = R
n
with their standard
bases. We can then view the matrix A as the matrix of a linear transfor-
mation T : V W with respect to the standard bases. If w is the vector
b
1
e
1
+ b
2
e
2
+ + b
n
e
n
in W then the system of equations above can be
formulated in a coordinate free form
Tx = w
i.e. we have to nd a vector x V which maps to w under the linear
transformation T. This may seem like a triviality but it has very profound
implications because it sets us free from dealing with the standard bases and
to use whatever bases of V and W that are most convenient.
Before we do an example of this we want to introduce two subspaces
associated to a linear transformation T : V W.
Denition 2.0.8 The image of T is the subset Im T W of all vectors
w W that are of the form Tv, i.e. there is a vector v V which is mapped
to w.
The kernel of T is the subset ker T V consisting of all vectors v V
such that Tv = 0 i.e. the set of all vectors in V which under T are mapped
to the zero vector in W
Lemma 2.0.1 Im T W and ker T V are subspaces
Proof: To prove that Im T is a subspace consider w
1
and w
2
in Im T
and real (or complex) numbers
1
,
2
. We have to show that
1
w
1
+
2
w
2
is also in Im T.
Since w
1
, w
2
Im T we can nd vectors v
1
, v
2
V such that Tv
1
= w
1
and Tv
2
= w
2
. Consider the vector
1
v
1
+
2
v
2
V . Then T(
1
v
1
+
2
v
2
) =

1
Tv
1
+
2
Tv
2
=
1
w
1
+
2
w
2
. But this shows precisely that the vector

1
v
1
+
2
v
2
maps to
1
w
1
+
2
w
2
under T i.e. this is a vector in Im T,
which is what had to be proved.
The other statement is equally easy to prove: consider v
1
, v
2
ker T
and
1
,
2
R (or C). We have to show that
1
v
1
+
2
v
2
is also in ker T.
But we can verify this by showing that T maps this vector to 0. Now
T(
1
v
1
+
2
v
2
) =
1
Tv
1
+
2
Tv
2
=
1
0 +
2
0 = 0 which again is precisely
Proposition 2.0.3 Let T : V W be a linear transformation. Let v
1
, v
2
, . . . , v
m
12
and w
1
, w
2
, . . . , w
n
be bases of V and W resp. Let A =
_
_
_
_
_
a
11
a
12
. . . a
1m
a
21
a
22
. . . a
2m
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
a
n2
. . . a
nm
_
_
_
_
_
be the matrix of T w.r.t. these bases. Then the sub space Im T W is
spanned by the vectors in W with coordinates
_
_
_
_
_
a
11
a
21
.
.
.
a
n1
_
_
_
_
_
,
_
_
_
_
_
a
12
a
22
.
.
.
a
n2
_
_
_
_
_
, . . . ,
_
_
_
_
_
a
1m
a
2m
.
.
.
a
nm
_
_
_
_
_
with respect to the basis w
1
, w
2
, . . . , w
n
. In short: the columns of A span
the image of T. Thus dimIm T is equal to the maximum number of linearly
independent vectors among the columns. This maximum is called the column
rank of the matrix A. If we do the same for the rows of A i.e. nd the
maximum number of linearly independent rows (remark that they are vectors
in R
m
while the columns are vectors in R
n
) then we come up with the row
rank of A. It is a fact (but not a trivial fact) that for any matrix row rank
= column rank.
Here is a nice formula which relates the dimensions of Im T and ker T:
Theorem 2.0.8 dimIm T + dimker T = dimV
We can now formulate a theorem regarding the solutions to the equation
Tx = w
Theorem 2.0.9 The equation
Tx = w
has a solution if and only if w Im T. If v is a solution then any other
solution is of the form v +u where u is a vector in ker T
Proof: First: the equation Tx = w has a solution precisely when there
is a vector v V such that Tv = w but that is the same as saying that
w Im T.
Secondly: if v
1
is another solution i.e. Tv
1
= w then T(v
1
v) =
Tv
1
Tv = w w = 0. Thus u = v
1
v is in ker T. Hence v
1
= v +u and
u ker T.
13
Example 2.5 Consider the system of linear equations
x
1
+x
2
= 3
x
1
+x
3
= 4
x
2
x
3
= 1
The matrix of coecients is given by A =
_
_
1 1 0
1 0 1
0 1 1
_
_
and the right
hand side is the vector b =
_
_
3
4
1
_
_
. The linear transformation T : R
3
R
3
with this matrix with respect to the standard bases is given by T(x
1
, x
2
, x
3
) =
_
_
x
1
+x
2
x
1
+x
3
x
2
x
3
_
_
.
Now consider the basis of R
3
given by f
1
=
_
_
1
1
0
_
_
, f
2
=
_
_
1
0
1
_
_
, f
3
=
_
_
0
1
1
_
_
. Then the matrix of T with respect to the bases e
1
, e
2
, e
3
and f
1
, f
2
, f
3
is computed by Te
1
= T
_
_
1
0
0
_
_
=
_
_
1
1
0
_
_
= f
1
, Te
2
=
_
_
1
0
1
_
_
= f
2
, Te
3
=
_
_
0
1
1
_
_
= f
1
f
2
. Hence the matrix of T with respect to these bases is
_
_
1 0 1
0 1 1
0 0 0
_
_
.
The vector b is written as b = 4f
1
f
2
and hence the equation
Tx = b
_
_
1 0 1
0 1 1
0 0 0
_
_

_
_
x
1
x
2
x
3
_
_
=
_
_
4
1
0
_
_
14
or
x
1
+x
3
= 4
x
2
x
3
= 1
0 = 0
Thus the solutions are all vectors of the form
_
_
4 x
3
1 +x
3
x
3
_
_
=
_
_
4
1
0
_
_
+
x
3
_
_
1
1
1
_
_
Note that T
_
_
x
1
x
2
x
3
_
_
= (x
1
+x
3
)f
1
+ (x
2
x
3
)f
2
hence the image of T is
spanned by f
1
, f
2
. Since they are linearly independent they form a basis for
Im T so this sub space has dimension 2. Thus dimker T = 1 spanned by the
vector
_
_
1
1
1
_
_
.
This example illustrates that by changing bases we can simplify the sys-
tem to a form where we can readily read o the solutions, if there are any.
Also note that in this case it only involves changing the basis of the target
vector space.
2.1 Change of Basis
Next we shall explicitly compute how the matrix changes when we change
bases. Thus consider a linear transformation T : V W. Let v
1
, v
2
, . . . , v
m
be a basis of V and let w
1
, w
2
, . . . , w
n
be a basis of W. Assume that the
matrix of T with respect to these bases is
A =
_
_
_
_
_
a
11
a
12
. . . a
1m
a
21
a
22
. . . a
2m
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
a
n2
. . . a
nm
_
_
_
_
_
15
Now suppose we have two other bases v

1
, v

2
, . . . , v

m
and w

1
, w

2
, . . . , w

n
.
Then T has a matrix with respect to these bases, say
A

=
_
_
_
_
_
a

11
a

12
. . . a

1m
a

21
a

22
. . . a

2m
.
.
.
.
.
.
.
.
.
.
.
.
a

n1
a

n2
. . . a

nm
_
_
_
_
_
The question is: what is the connection between the matrices A and A

?
To answer this question we need to introduce the coordinate transforma-
tion matrix between two bases. Consider the two bases v
1
, v
2
, . . . , v
m
and
v

1
, v

2
, . . . , v

m
. Since they are both bases each of the vectors in one basis can
be expressed as a linear combination of the vectors in the other. Thus we
can write
v

j
= c
1j
v
1
+c
2j
v
2
+ +c
mj
v
m
for j = 1, 2, . . . , m. We can then form the matrix C =
_
_
_
_
_
c
11
c
12
. . . c
1m
c
21
c
22
. . . c
2m
.
.
.
.
.
.
.
.
.
.
.
.
c
m1
c
m2
. . . c
mm
_
_
_
_
_
.
This matrix is called the coordinate transformation matrix between the basis
v

1
, v

2
, . . . , v

m
and the basis v
1
, v
2
, . . . , v
m
. The reason for this terminology
is clear from the following result:
Proposition 2.1.1 Let u V and let
1
,
2
, . . . ,
m
be the coordinates of
u with respect to the basis v

1
, v

2
, . . . , v

m
i.e. u =
1
v

1
+
2
v

2
+ +
m
v

m
.
Then the coordinates of u with respect to the basis v
1
, v
2
, . . . , v
m
is given by
C
_
_
_
_
_

2
.
.
.

m
_
_
_
_
_
Proof: We have u =
1
v

1
+
2
v

2
+ +
m
v

m
and each v

j
= c
1j
v
1
+
16
c
2j
v
2
+ +c
mj
v
m
. Substituting these expressions we get
u =
1
(c
11
v
1
+c
21
v
2
+ +c
m1
v
m
)
+
2
(c
12
v
1
+c
22
v
2
+ +c
m2
v
m
)
+ +
+
m
(c
1m
v
1
+c
2m
v
2
+ +c
mm
v
m
)
= (
1
c
11
+
2
c
12
+ +
m
c
1m
)v
1
+(
1
c
21
+
2
c
22
+ +
m
c
2m
)v
2
+ +
+(
1
c
m1
+
2
c
m2
+ +
m
c
mm
)v
m
Thus the coordinates of u with respect to the basis v
1
, v
2
, . . . , v
m
is
_
_
_
_
_

1
c
11
+
2
c
12
+ +
m
c
1m

1
c
21
+
2
c
22
+ +
m
c
2m
.
.
.

1
c
m1
+
2
c
m2
+ +
m
c
mm
_
_
_
_
_
=
_
_
_
_
_
c
11
c
12
. . . c
1m
c
21
c
22
. . . c
2m
.
.
.
.
.
.
.
.
.
.
.
.
c
m1
c
m2
. . . c
mm
_
_
_
_
_

_
_
_
_
_

2
.
.
.

m
_
_
_
_
_
= C
_
_
_
_
_

2
.
.
.

m
_
_
_
_
_
Theorem 2.1.1 Let C be the coordinate transformation matrix between the
bases v

1
, v

2
, . . . , v

m
and v
1
, v
2
, . . . , v
m
and let D be the coordinate transfor-
mation matrix between the bases w
1
, w
2
, . . . , w
n
and w

1
, w

2
, . . . , w

n
. Let A
be the matrix of T w.r.t. v
1
, v
2
, . . . , v
m
and w
1
, w
2
, . . . , w
n
and let A

be the
matrix of T w.r.t. v

1
, v

2
, . . . , v

m
and w

1
, w

2
, . . . , w

n
then
A

= D A C
Proof: The jth column of A

## is computed as the coordinates of Tv

j
with respect to the basis w

1
, w

2
, . . . , w

n
. We rst compute the coordinates
of Tv

j
with respect to the ws. Now v

j
= c
1j
v
1
+ c
2j
v
2
+ + c
mj
v
m
thus
the coordinates of v

j
w.r.t. the vs is
_
_
_
_
_
c
1j
c
2j
.
.
.
c
mj
_
_
_
_
_
. Thus the coordinates of Tv

j
w.r.t. the ws is A
_
_
_
_
_
c
1j
c
2j
.
.
.
c
mj
_
_
_
_
_
.
17
Now using the previous Proposition the coordinates with respect to the
w

s are given by D A
_
_
_
_
_
c
1j
c
2j
.
.
.
c
mj
_
_
_
_
_
. But this is precisely the jth column in the
product matrix D A C.
Denition 2.1.1 The mm identity matrix is the mm matrix with 1s
in the diagonal and all 0s outside the diagonal. We denote this matrix by
E
m
. Thus
E
m
=
_
_
_
_
_
_
_
1 0 0 . . . 0
0 1 0 . . . 0
0 0 1 . . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 . . . 1
_
_
_
_
_
_
_
Remark that for any n m matrix A we have A E
m
= A and E
n
A = A.
Denition 2.1.2 An mm matrix C is said to be invertible if there exists
an m m matrix C

such that C C

= C

C = E
m
. The matrix C

if it
exists is called the inverse matrix and is denoted C
1
Example 2.6 Let C be the coordinate transformation matrix between the
bases v

1
, v

2
, . . . , v

m
and v
1
, v
2
, . . . , v
m
and let C

## be the coordinate transfor-

mation between the bases v
1
, v
2
, . . . , v
m
and v

1
, v

2
, . . . , v

m
. Then C

= C
1
.
This follows immediately since C C

## is the coordinate transformation

matrix between v
1
, v
2
, . . . , v
m
and itself, hence is the identity matrix. The
same is true for the other product C

C.
Corollary 2.1.1 Let T : V V be a linear transformation i.e. T maps
V into itself. Let v
1
, v
2
, . . . , v
m
and v

1
, v

2
, . . . , v

m
be bases of V . Let A
be the matrix of T with respect to v
1
, v
2
, . . . , v
m
and itself and A

the ma-
trix of T with respect to v

1
, v

2
, . . . , v

m
and itself. Let C be the coordinate
transformation matrix between v
1
, v
2
, . . . , v
m
and v

1
, v

2
, . . . , v

m
. Then
A

= C A C
1
18
3 Inner Products, Norms and Orthogonality
Let V be a vector space.
Denition 3.0.3 An inner product on V is a map ( , ) : V V R (or C
if V is a complex vector space). This map satises the following conditions
1. (
1
v
1
+
2
v
2
, w) =
1
(v
1
, w) +
2
(v
2
, w) (linearity)
2. (v, w) = (w, v) if V is real and (v, w) = (w, v) if V is complex (sym-
metric resp. hermitian)
3. (v, v) > 0 unless v = 0 (positive denite). Remark that (v, v) is always
real because of the second condition
It follows that we have (v,
1
w
1
+
2
w
2
) =
1
(v, w
1
) +
2
(v, w
2
). Thus
the inner product is also linear in the second variable if V is real and con-
jugate linear if V is complex (the linearity conditions in the rst and second
variables are known as bilinearity in the real case and sesqui-linearity in the
complex case)
Example 3.1 Let V = R
n
and dene the inner product of vectors v =
(x
1
, x
2
, . . . , x
n
) and w = (y
1
, y
2
, . . . , y
n
) by (v, w) = x
1
y
1
+x
2
y
2
+ +x
n
y
n
.
This is easily seen to be an inner product.
If V = C
n
we dene (v, w) = x
1
y
1
+x
2
y
2
+ +x
n
y
n
Denition 3.0.4 A linear functional on a vector space V is a linear trans-
formation f : V R (or C)
If ( , ) is an inner product on V and v V then the map u (u, v)
denes a linear functional. The next theorem states that in fact every linear
functional is of this form.
Theorem 3.0.2 (The Riesz representation theorem) Let dimV = m and
let ( , ) be an inner product. If f is a linear functional on V there exists a
unique vector v such that f(u) = (u, v) for all u V
Proof: We shall prove it in the real case. The complex case is slightly
more complicated. Let v
1
, v
2
, . . . , v
m
be a basis. We dene a linear func-
tional f
i
: V R by giving its matrix with respect to this basis of V and the
standard basis e
1
= 1 of R
1
. The matrix of a linear functional is an m1 ma-
trix i.e. a row. We let f
i
be dened by the matrix
_
0 0 . . . 0 1 0 . . . 0
_
19
where the 1 is in the ith position. Thus f
i
maps v
i
to 1 and all the other
basis vectors to 0.
Consider the set of all linear functionals V

## on V . This set has a vector

space structure if we dene f +g by f +g(v) = f(v) +g(v) and (f)(v) =
(f(v)) (it has to be veried that f+g and f are again linear functionals but
that is trivial). This vector space is called the dual space of V . We shall show
that the linear functionals we dened above form a basis for V

. We rst
show that they are linearly independent in V

. Let
1
f
1
+
2
f
2
+ +
m
f
m
be a linear combination which is the 0-linear functional i.e. the functional
which is identically 0. Evaluating this linear combination on the basis vector
v
i
we get 0 =
1
f
1
(v
i
) +
2
f
2
(v
i
) + +
i
f
i
(v
i
) + +
m
f
m
(f
m
(v
i
) =

i
f
i
(v
i
) =
i
. Thus
i
= 0 for i = 1, 2, . . . , m which proves the linear
independence.
To show that they span V

## consider a general linear functional f V

.
We have to show that we can write f as a linear combination of the f
i
s.
Dene
i
= f(v
i
) i.e. the value of f on the basis vector v
i
and put g =

1
f
1
+
2
f
2
+ +
m
f
m
. Then g(v
i
) =
i
for i = 1, 2, . . . , m. Thus f and
g takes the same values on all the basis vectors hence f g vanishes on all
the basis vectors. But since any vector in V is a linear combination of these
basis vectors this means that f g must vanish on any vector i.e. f g is
identically 0 or in other words f = g and so f is a linear combination of the
f
i
s.
We conclude from this that dimV = dimV

.
Now let v V . Then we dene a linear functional f
v
by f
v
(u) = (u, v).
Because the inner product is linear in the rst variable this is a linear func-
tional i.e. f
v
V

## . Thus we have dened a map v f

v
, V V

. The
statement of the Riesz representation theorem is that this map is a bijection.
First, this map is a linear transformation i.e. f
v+

v
= f
v
+

f
v
. We
verify this by evaluating both sides on a general vector u. The left hand side
evaluates to (u,

v +

) = (u, v) +

(u, v

## ) because the inner product

is also linear in the second variable (remember we are in the real case) but
this is precisely equal to the right hand side evaluated on u.
Now using the formula dimker + dimIm = dimV we see that since
dimV = dimV

must be all of V

## i.e. every linear functional

is of the form f
v
. But if v ker so f
v
is identically 0, we have in particular
that 0 = f
v
(v) = (v, v). By the positive deniteness this means that v = 0
and so the only vector in ker is the 0-vector.
To show uniqueness assume that f
v
= f
v
then f
vv
is identically 0
hence v v

ker and so v v

= 0 i.e. v = v

20
If ( , ) is an inner product we dene the norm of v, [[v[[ =
_
(v, v). Thus
[[v[[ = 0 if and only if v is the 0-vector.
Theorem 3.0.3 (The Cauchy-Schwartz inequality) For any two vectors v
and w we have
[(v, w)[ [[v[[[[w[[
with equality if and only if w is a multiple of v
Proof: We shall prove it in the complex case, the real case is done the
same way but is easier. Let = x + iy be a complex number and consider
[[v +w[[
2
= (v +w, v +w). Using the bilinearity this equals [[
2
[[v[[
2
+
(v, w) +(w, v) +[[w[[
2
= [[
2
[[v[[
2
+(v, w) +(v, w) +[[w[[
2
. Since it is
the square of a norm, this expression is always 0 with equality if and only
if w = v. We now view it as a function of the two variables x and y and
try to nd its minimum. Writing out the expression we get
(x
2
+u
2
)[[v[[
2
+ 2(x'(v, w) y(v, w)) +[[w[[
2
Taking the partial derivatives with respect to x and y and putting them
equal to 0 we get

x
= 2x[[v[[
2
+ 2'(v, w) = 0

y
= 2y[[v[[
2
2(v, w) = 0
This gives x =
'(v, w)
[[v[[
2
and y =
(v, w)
[[v[[
2
and so the minimum value is
_
('(v, w))
2
[[v[[
4
+
((v, w))
2
[[v[[
4
_
[[v[[
2
2
_
('(v, w))
2
[[v[[
2
+
((v, w))
2
[[v[[
2
_
+[[w[[
2
=
[(v, w)[
2
[[v[[
2
2
[(v, w)[
2
[[v[[
2
+[[w[[
2
=
[(v, w)[
2
[[v[[
2
+[[w[[
2
This expression is 0 and so we get

[(v, w)[
2
[[v[[
2
+[[w[[
2
0
or
[(v, w)[
2
+[[v[[
2
[[w[[
2
0
which gives the result
21
Corollary 3.0.2 (The triangle inequality) For any two vectors v, w we have
the inequality
[[v +w[[ [[v[[ +[[w[[
Proof: Squaring both sides we have to show
[[v +w[[
2
[[v[[
2
+[[w[[
2
+ 2[[v[[[[w[[
Computing the left hand side we have [[v+w[[
2
= (v+w, v+w) = [[v[[
2
+
(v, w) +(w, v) +[[w[[
2
. Thus we need to show (v, w) +(w, v) 2[[v[[[[w[[. If
(v, w) is real this is a direct consequence of the Cauchy-Schwartz inequality.
If (v, w) = a + ib then (v, w) + (w, v) = (v, w) + (v, w) = 2a. We have
2a 2

a
2
+b
2
= 2[(v, w)[ 2[[v[[[[w[[.
Corollary 3.0.3 For any two vectors v and w we have
[[[v[[ [[w[[[ [[v w[[
Proof: Squaring both sides we have to show [[v[[
2
+[[w[[
2
2[[v[[[[w[[
(v w, v w) = [[v[[
2
(v, w) (w, v) + [[w[[
2
. Thus it comes down to
showing that (v, w) + (w, v) 2[[v[[[[w[[ which is precisely what we showed
above
Denition 3.0.5 Two vectors v and w are said to be orthogonal if (v, w) =
0
In the real case we have the formula cos =
(v, w)
[[v[[[[w[[
where is the
angle between the two vectors (remark that this does not make sense in the
complex case because the inner product in general is not real, where as of
course cos is a real number
Denition 3.0.6 A set of vectors q
1
, q
2
, . . . , q
r
is said to be an orthogonal
set if for any i ,= j, (q
i
, q
j
) = 0 i.e. they are pair wise orthogonal. If we
also have [[q
i
[[ = 1 for all i, we talk of an orthonormal set
Proposition 3.0.2 Let q
1
, q
2
, . . . , q
r
be an orthogonal set of vectors, then
they are linearly independent
Proof: Consider a linear combination equal to the 0-vector
c
1
q
1
+c
2
q
2
+ +c
r
q
r
= 0
22
We have to show that all the coecients are 0. Now
0 = (c
1
q
1
+c
2
q
2
+ +c
i
q
i
+ +c
r
q
r
, q
i
)
= c
1
(q
1
, q
i
) +c
2
(q
2
, q
i
) + +c
i
(q
i
, q
i
) + +c
r
(q
r
, q
i
)
= c
i
[[q
i
[[
2
Since [[q
i
[[ > 0 this means c
i
= 0.
Lemma 3.0.1 Assume q
1
, q
2
, . . . , q
r
is an orthonormal set and let v be any
vector. Then the vector
u = v (v, q
1
)q
1
(v, q
2
)q
2
(v, q
r
)q
r
is orthogonal to all the vectors q
1
, q
2
, . . . , q
r
If dimV = m then it follows that any orthogonal set of m vectors is a
basis and so u is orthogonal to all the vectors in the basis and hence to every
vector in V , in particular to itself. Thus 0 = (u, u) = [[u[[
2
. It follows that
u = 0 and hence we have
v = (v, q
1
)q
1
+ (v, q
2
)q
2
+ + (v, q
m
)q
m
Proof: We verify this simply by taking the inner product with any of the
q
i
s. We get
(q
i
, u) = (q
i
, v (v, q
1
)q
1
(v, q
2
)q
2
(v, q
i
)q
i
(v, q
r
)q
r
)
= (q
i
, v) (q
1
, v)(q
i
, q
1
) (q
2
, v, )(q
i
, q
2
) (q
i
, v)(q
i
, q
i
) (q
i
, v)(q
i
, q
r
)
= (q
i
, v) (q
i
, v)[[q
i
[[
2
= 0
since [[q
i
[[ = 1
The vector (v, q
1
)q
1
+ (v, q
2
)q
2
+ + (v, q
r
)q
r
is called the orthogonal
projection of v onto the subspace Spanq
1
, q
2
, . . . , q
r

## Theorem 3.0.4 (Gram-Schmidt Orthogonalization) Let v

1
, v
2
, . . . , v
r
be a
linearly independent set of vectors. Then there is a set of orthonormal vec-
tors q
1
, q
2
, . . . , q
r
such that Span v
1
, v
2
, . . . , v
i
= Span q
1
, q
2
, . . . , q
i
for
i = 1, 2, . . . , r
Proof: Let q
1
=
v
1
[[v
1
[[
Assume we have constructed q
1
, . . . , q
i1
. Let
u
i
= v
i
(q
1
, v
i
)q
1
(q
i1
, v
i
)q
i1
. Then as we have seen u
i
is orthogonal
to q
1
, . . . , q
i1
. Now put q
i
=
u
i
[[u
i
[[
23
Figure 1: Orthogonal Projection
Denition 3.0.7 Let U V be a subspace. We denote by U

the subset of
vectors u

## which are orthogonal to all vectors in U i.e. u

if and only
if (u

, u) = 0 for all u U. U

## is called the orthogonal complement of U

Proposition 3.0.3 U

## is a subspace of V and any vector v V can be

written uniquely in the form v = u +u

with u U and u

Proof: If u

1
and u

2
are vectors in U

## then we have for any vector u U,

(
1
u

1
+
2
u

2
, u) = (u

1
, u) +
2
(u

2
, u) = 0 + 0. Thus
1
u

1
+
2
u

2
U

## which shows that U

is a subspace.
Let v V and let u
1
, u
2
, . . . , u
r
be a basis of U. We apply the Gram-
Schmidt procedure to obtain an orthonormal basis q
1
, q
2
, . . . , q
r
of U. Let
u

= v(q
1
, v)q
1
(q
2
, v)q
2
(q
r
, v)q
r
then u

## is orthogonal to to all the

qs and hence is orthogonal to all the vectors in U since they are all linear
combinations of the qs. Let u = (q
1
, v)q
1
+ (q
2
, v)q
2
+ + (q
r
, v)q
r
then
u U and v = u +u

Consider U

## , if u is a vector in this intersection then u is orthogonal

to every vector in U. But u U so (u, u) = 0 which implies u = 0. Thus we
conclude that U

## = 0. Now assume we have written v as u +u

and as
u
1
+ u

1
. Then we get u u
1
= u

1
. The left hand side is in U and the
right hand side is in U

## and we conclude they

are both 0 and so u = u
1
and u

= u

1
which proves the uniqueness.
24
Denition 3.0.8 Let T : V W be a linear operator. Assume both V and
W are equipped with inner products and corresponding norms. We dene
the operator norm of T by [[T[[ = sup
||v||=1
[[Tv[[
Proposition 3.0.4 Let u be any vector in V then [[Tu[[ [[T[[[[u[[
Proof: It is clear if u = 0. If u is not the zero vector
u
[[u[[
has norm = 1
and so [[T(
u
[[u[[
)[[ [[T[[. Since T(
u
[[u[[
) =
Tu
[[u[[
the inequality follows.
Denition 3.0.9 A linear transformation T : V W is said to be orthog-
onal (unitary in the complex case) if (Tu, Tv) = (u, v) i.e. T preserves the
inner products.
Clearly an orthogonal transformation also preserves norms and so [[Tv[[ =
1 if [[v[[ = 1. It follows that an orthogonal transformation has [[T[[ = 1
Proposition 3.0.5 An orthogonal transformation T : V W is 1 1
Proof: Since [[Tu[[ = [[u[[ it is clear that if Tu = 0 we have u = 0. Thus
ker T = 0. Now if Tu = Tv we have T(u v) = 0 hence u v = 0 so
u = v
Denition 3.0.10 An mm Q matrix is said to be an orthogonal matrix
if the columns of Q form an orthonormal basis of R
m
Theorem 3.0.5 An mm matrix Q is orthogonal if and only if Q
t
Q =
t
Q Q = E
m
, where
t
Q is the transposed matrix i.e. the ijthe entry in
t
Q
is the jith entry in Q
Proof: Let Q =
_
_
_
_
_
q
11
q
12
. . . q
1m
q
21
q
22
. . . q
2m
.
.
.
.
.
.
.
.
.
.
.
.
q
m1
q
m2
. . . q
mm
_
_
_
_
_
. We shall write this as Q =
_
q
1
q
2
. . . q
m
_
where q
j
is the jth column. Then
t
Q =
_
_
_
_
_
t
q
1
t
q
2
.
.
.
t
q
m
_
_
_
_
_
and
25
t
Q Q =
_
_
_
_
_
t
q
1
t
q
2
.
.
.
t
q
m
_
_
_
_
_

_
q
1
q
2
. . . q
m
_
=
_
_
_
_
_
_
_
_
_
_
_
(q
1
, q
1
) (q
1
, q
2
) . . . (q
1
, q
m
)
(q
2
, q
1
) (q
2
, q
2
) . . . (q
2
, q
m
)
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . . (q
i
, q
j
) . . .
.
.
.
.
.
.
.
.
.
.
.
.
(q
m
, q
1
) (q
m
, q
2
) . . . (q
m
, q
m
)
_
_
_
_
_
_
_
_
_
_
_
If q
1
, q
2
, . . . , q
m
is an orthonormal basis (q
i
, q
j
) =
_
1 if i = j
0 if i ,= j
and so
we get that
t
Q Q = E
m
.
To go the other way: if
t
Q Q = E
m
then we get from the expression for
the matrix product that (q
i
, q
j
) =
_
1 if i = j
0 if i ,= j
which precisely means that
the qs form an orthonormal set and since there are m of them it is a basis.
Example 3.2 Let Q be the coordinate transformation matrix between two
orthonormal bases. Then Q is an orthogonal matrix
Theorem 3.0.6 Let T : V W be a linear transformation. Then there
exists a unique linear transformation T

## : W V called the adjoint trans-

formation satisfying (Tv, w) = (v, T

## w) for all vectors v V and w W

Proof: Consider the linear functional v (Tv, w) on V . By the Riesz
representation theorem (Theorem 3.0.2) there exists a unique vector w

V
such that (Tv, w) = (v, w

## ) for all v V . We dene T

w = w

. This is a
well-dened map and by construction it satises (Tv, w) = (v, T

w). It has
to be veried that T

is a linear transformation.
We have (v, T

## w) = (v, Tw) for

all v. Hence (v, T

(w) T

(w)
T

## w = 0. The same method works to show that T

(w
1
+ w
2
) = T

w
1
+
T

w
2
.
Theorem 3.0.7 Let V = R
m
(C
m
) and W = R
n
(C
n
). Let A be the matrix
of T with respect to the standard bases. Then the matrix of T

w.r.t. the
standard bases is the transposed
t
A in the real case resp
t
A, the transposed
with all entries complex conjugated, in the complex case.
Proof: The jth column of the matrix of T

is the coordinates of T

e
j
,
i.e. T

e
j
=
1j
e
1
+
2j
e
2
+ +
m
e
m
. Then we get
ij
= (e
i
,
1j
e
1
+
2j
e
2
+
+
m
e
m
) = (e
i
, T

e
j
) = (Te
i
, e
j
). Now Te
i
= a
1i
e
1
+a
2i
a
2i
+ +a
ni
e
n
so (Te
i
, e
j
) = a
ji
and so
ij
= a
ji
which proves the result
26
Denition 3.0.11 An m m square matrix A is called symmetric resp.
hermitian if
t
A = A resp.
t
A = A. Remark that a hermitian matrix has real
entries in the diagonal
4 Eigenvectors and Eigenvalues
Denition 4.0.12 Let T : V V be a linear transformation. A non-zero
vector v is said to be an eigenvector if there is a number such that Tv = v.
The number is called the eigenvalue corresponding to the eigenvector v
Denition 4.0.13 Let be an eigenvalue, the subset v[Tv = v is called
the eigenspace associated to . It is a subspace of V , in fact it is equal to
ker(T I) where I denotes the identity map, which is obviously a linear
transformation.
Proposition 4.0.6 Let
1
,
2
, . . . ,
r
be distinct eigenvalues with eigenvec-
tors v
1
, v
2
, . . . , v
r
. Then these eigenvectors are linearly independent.
Proof: We use induction on r. Since an eigenvector is not the zero-vector
it is true for r = 1. Now assume it is true for r 1 vectors. Assume that
c
1
v
1
+c
2
v
2
+ +c
r
v
r
= 0. We have to show that all the cs are 0. Applying
the linear transformation T we get
c
1

1
v
1
+c
2

2
v
2
+ +c
r

r
v
r
= 0
Multiplying the linear dependence equation by
1
we get
c
1

1
v
1
+c
2

1
v
2
+ +c
r

1
v
r
= 0
Comparing these two equations we see that the rst terms are equal so
subtracting one from the other the rst terms cancel and we get
c
2
(
2

1
)v
2
+ +c
r
(
r

1
)v
r
= 0
By the induction hypothesis the vectors v
2
, v
3
, . . . , v
r
are linearly inde-
pendent and so all the coecients c
i
(
i

1
) = 0, for i = 2, 3, . . . , r. Since
the eigenvalues are distinct
i

1
,= 0, i = 2, 3, . . . , r and so we must have
c
2
= c
3
= = c
r
= 0. But then the only term left is c
1
v
1
= 0 and so also
c
1
= 0
Corollary 4.0.4 Assume T has m = dimV distinct eigenvalues with eigen-
vectors v
1
, v
2
, . . . , v
m
then the eigenvectors form a basis of V
27
If we compute the matrix of T with respect to a basis of eigenvectors
v
1
, v
2
, . . . , v
m
we have Tv
i
=
i
v
i
so the matrix is the diagonal matrix
D =
_
_
_
_
_
_
_

1
0 0 . . . 0
0
2
0 . . . 0
0 0
3
. . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 . . .
m
_
_
_
_
_
_
_
4.1 Determinants
Let A be a square matrix. We can associate to A a number det A called the
determinant of A.
The determinant has a number of properties some of which we list below:
1. det A ,= 0 if and only if A has an inverse matrix
2. The determinant does not change if we change A by adding a multiple
of a column of A to another column
3. If we change A by switching two columns the determinant changes sign
4. Changing A by multiplying a row by a number multiplies the det by

## 5. The determinant of a triangular matrix (i.e. the entries above or below

the diagonal are 0) is the product of the diagonal entries
6. If A and B are mm matrices det(A B) = det Adet B
7. det
t
A = det A
Example 4.1 If A is a 2 2 matrix A =
_
a b
c d
_
we have det A = ad bc
Example 4.2 Let T : R
m
R
m
be a linear transformation and let A be
its matrix w.r.t. the standard basis. Assume R
m
has a basis of eigenvectors
v
1
, v
2
, . . . , v
m
so that the matrix of T with respect to this basis is the diagonal
matrix D =
_
_
_
_
_
_
_

1
0 0 . . . 0
0
2
0 . . . 0
0 0
3
. . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 . . .
m
_
_
_
_
_
_
_
.
28
If C is the coordinate transformation matrix between the basis of eigen-
vectors and the standard basis we have D = C
1
A C and so det D =
det(C
1
A C) = det(C
1
) det(A) det(C) = det(A) det(C
1
) det(C) =
det(A) det(C
1
C) = det(A) det(E
m
) = det(A). Now det D =
1

2
. . .
m
,
the product of the eigenvalues.
Consider for a number the matrix AE
m
. This is the matrix of the
linear transformation T I. Now a non-zero vector v is an eigenvector
with eigenvalue if and only if T I(v) = 0. Thus is an eigenvalue
if and only if ker(T I) ,= 0. If ker(T I) = 0 then we have
dim(Im (T I)) = m so T I is onto. Also ker(T I) = 0 implies
that T I is 11 and so T I is a bijection and hence has an inverse map.
This means that the matrix AE
m
is invertible and so det(AE
m
) ,= 0.
This shows that is an eigenvalue precisely when det(AE
m
) = 0.
It can be shown that det(AE
m
) = (1)
m

m
+. . . is a polynomial in
of degree m. This polynomial is called the characteristic polynomial of A.
Thus the eigenvalues are precisely the roots of the characteristic polynomial.
Example 4.3 Consider A =
_
1 1
1 1
_
. To nd the eigenvalues we com-
pute det
_
1 1
1 1
_
= (1 )
2
1. Putting this determinant equal to
0 we get (1 )
2
= 1 or 1 = 1. Hence the eigenvalues are
1
= 0 and

2
= 2.
To nd eigenvectors we have to solve the systems of equations
x
1
x
2
= 0
x
1
+x
2
= 0
and
x
1
x
2
= 2x
1
x
1
+x
2
= 2x
2
The rst set of equations gives x
1
= x
2
so for instance
_
1
1
_
is an eigenvector
with eigenvalue 0. The second set gives x
1
= x
2
so
_
1
1
_
is an eigenvector
with eigenvalue 2. These two vectors from a basis of R
2
and the matrix with
respect to this basis is the diagonal matrix
_
0 0
0 2
_
29
It is not true for an arbitrary matrix that there exist a basis of eigen-
vectors, i.e. we cant always nd a basis such that the matrix becomes a
diagonal matrix. It is however true if the matrix is symmetric in the real case
and hermitian in the complex case. Indeed we have the following theorem
Theorem 4.1.1 (The Spectral Theorem) Let A be a symmetric or hermi-
tian matrix. Then there exists an orthonormal basis of eigenvectors for A
Let q
1
, q
2
, . . . , q
m
be an orthonormal basis of eigenvectors for the sym-
metric or hermitian matrix A. If Q is the coordinate transformation matrix
between the standard basis and the basis consisting of the qs then
Q
1
A Q = D
where D is the diagonal matrix with the eigenvalues in the diagonal. Thus
A = Q D Q
1
We call this the eigenvalue decomposition of A
For the remainder of these notes we shall, except where specically noted,
assume we are in the real case.
5 The Singular Value Decomposition (SVD)
Denition 5.0.1 Let A be an m n matrix. A singular value and corre-
sponding singular vectors consists of a non-negative real number , a vector
u R
m
and a vector v R
n
such that
Av = u

t
Au = v
Remark 5.1 If A is symmetric the singular values are the absolute values
of the eigenvalues , indeed if is an eigenvalue and v an eigenvector for
then we have Av = v. If < 0 put u = v then Av = ()(v) = [[u
and
t
Au = Au = u = [[v.
Denition 5.0.2 Let A be an m n matrix. A Singular Value Decompo-
sition of A is a factorization
A = U V
where U is an orthogonal m m matrix, V is an orthogonal n n matrix
and is an mn diagonal matrix with non-negative diagonal entries.
30
Proposition 5.0.1 Let A = U V be a Singular Value Decomposition.
Then the diagonal entries in are singular values
Proof: Assume m n and let =
_
_
_
_
_
_
_
_
_
_

1
0 0 . . . 0
0
2
0 . . . 0
0 0
3
. . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 . . .
n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
_
_
_
_
_
_
_
_
_
_
.
Let us show that
1
is a singular value (the other entries are completely
similar). Consider the basis vector e
1
R
n
and put v =
t
V e
1
. Thus v is the
rst row in V = rst column in
t
V . Also because V is orthogonal
t
V = V
1
.
Put u = Ue
1
(where e
1
now means the rst canonical basis vector in R
n
).
Then we have
Av = U V v = U e
1
= U
1
e
1
=
1
u
Also
t
A =
t
V
t

t
U so we get
t
Au =
t
V
t

t
U(Ue
1
) =
t
V
t
e
1
=
t
V
1
e
1
= v
Consider a singular value decomposition A = U V and let v
j
be
the jth row in V . Thus v
j
=
t
V e
j
and V v
j
= e
j
, also v
1
, v
2
, . . . , v
n
is
an orthonormal basis of R
n
. Similarly let u
j
be the jth column in U so
u
j
= Ue
j
and u
1
, u
2
, . . . , u
m
is an orthonormal basis of R
m
. Now
Av
j
= U V v
j
= U e
j
= U(
j
e
j
) =
j
Ue
j
=
j
u
j
Let T : R
n
R
m
be the linear map whose matrix with respect to the
standard bases is A. Then the existence of a singular value decomposition
is equivalent to saying that we can nd an orthonormal basis of R
n
(the vs)
and an orthonormal basis of R
m
(the us) such that the matrix of T with
respect to these bases is a diagonal matrix.
Theorem 5.0.2 Every mn matrix A has a singular value decomposition
Proof: Consider the n n matrix
t
A A. This is a symmetric matrix
because
t
(
t
A A) =
t
A
t
(
t
A) =
t
A A. Hence by the spectral theorem we
can nd an orthogonal nn matrix V such that V (
t
A A)
t
V = D where
31
D is the diagonal matrix with the eigenvalues of
t
A A in the diagonal. We
assume that we put all the 0 eigenvalues at the end i.e.
D =
_
_
_
_
_
_
_
_
_
_
_
_
_

1
0 0 . . . 0 0 0
0
2
0 . . . 0 0 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0
r
0 . . .
0 0 0 0 0 . . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0
.
.
.
.
.
.
.
.
.
.
.
. 0
_
_
_
_
_
_
_
_
_
_
_
_
_
and
1
,
2
, . . . ,
r
are non-zero.
Then we have
t
(A V )(AV ) = D. Let B = AV =
_
_
_
_
_
b
11
b
12
. . . b
1n
b
21
b
22
. . . b
2n
.
.
.
.
.
.
.
.
.
.
.
.
b
m1
b
m2
. . . b
mn
_
_
_
_
_
=
_
b
1
[ b
2
[ . . . [b
n
_
where the b
j
R
m
are the columns of B. Since
t
B B =
D is a diagonal matrix we have (b
i
, b
j
) =
_
0 if i ,= j
[[b
j
[[
2
if i = j
. Thus [[b
i
[[
2
=

i
, i = 1, 2, . . . , n and since
r+1
= =
n
= 0 we have b
r+1
= = b
n
= 0
and b
1
, b
2
, . . . , b
r
are ,= 0. Put u
i
= b
i
/[[b
i
[[, i = 1, 2, . . . , r. Then the u
i
are orthonormal and we can supplement them up to an orthonormal basis
u
1
, . . . , u
r
, u
r+1
, . . . , u
m
. Let U be the orthogonal mm matrix with the
us as columns. Then we have
t
U B =
_
_
_
_
_
u
1
u
2
.
.
.
u
m
_
_
_
_
_

_
b
1
[ b
2
[ . . . [b
r
[0 . . . [0
_
=
_
_
_
_
_
_
_
_
_
_
(u
1
, b
1
) (u
1
, b
2
) . . . (u
1
, b
r
) 0 . . . 0
(u
2
, b
1
) (u
2
, b
2
) . . . (u
2
, b
r
) 0 . . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
(u
r
, b
1
) (u
r
, b
2
) . . . (u
r
, b
r
) 0 . . . 0
(u
r+1
, b
1
) (u
r+1
, b
2
) . . . (u
r+1
, b
r
) 0 . . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
_
_
_
_
_
_
_
_
_
_
But u
i
= b
i
/[[b
i
[[, i = 1, 2, . . . , r and the us are orthonormal so (u
i
, b
j
) =
32
_
0 for i ,= j
(b
i
/[[b
i
[[, b
i
) = [[b
i
[[
2
/[[b
i
[[ = [[b
i
[[ for i = j
Thus we get
t
U B =
t
U A
t
V =
_
_
_
_
_
_
_
_
_
_
[[b
1
[[ 0 0 . . . . . . 0
0 [[b
2
[[ 0 . . . . . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 . . . [[b
r
[[ . . . 0
0 0 . . . 0 . . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
_
_
_
_
_
_
_
_
_
_
=
Thus A = U V .
Notice that
i
= [[b
i
[[
2
so the singular values are
i
= [[b
i
[[ =

i
for
i = 1, 2, . . . , r
Example 5.1 Consider the 23 matrix A =
_
1 1 1
3 2 2
_
. Find the singular
values and the singular value decomposition.
We rst form the 22 matrix A
t
A =
_
1 1 1
3 2 2
_

_
_
1 3
1 2
1 2
_
_
=
_
3 7
7 17
_
.
The characteristic polynomial is det
_
3 7
7 17
_
= (3 )(17 )
49 =
2
20 + 2. Thus the eigenvalues of A
t
A are 10 7

2 and the
singular values are
_
10 7

## 2 and the matrix

=
_
_
10 + 7

2 0 0
0
_
10 7

2 0
_
To nd the matrix V we have to nd an orthonormal basis of eigenvectors
i.e. we have to nd solutions to the two systems of linear equations
3x
1
+ 7x
2
= (10 + 7

2)x
1
7x
1
+ 17x
2
= (10 + 7

2)x
2
and
3x
1
+ 7x
2
= (10 7

2)x
1
7x
1
+ 17x
2
= (10 7

2)x
2
33
From the rst system we get x
2
= (1 +

2)x
1
hence a normalized eigen-
vector for the eigenvalue 10 + 7

2 is
v
1
=
_
1
_
4 + 2

2
,
1 +

2
_
4 + 2

2
_
Similarly from the second system we get x
2
= (1

2)x
1
and hence a
normalized eigenvector for 10

2 is
_
1
_
4 2

2
,
1

2
_
4 2

2
_
Remark that these two eigenvectors are automatically orthogonal because
they belong to dierent eigenvalues.
The matrix V is given by
V =
_
_
_
_
1
_
4 + 2

2
1
_
4 2

2
1 +

2
_
4 + 2

2
1

2
_
4 2

2
_
_
_
_
Next we have to compute the matrix U. We rst compute
B =
t
V A =
_
_
_
_
_
4 + 3

2
_
4 + 2

2
3 + 2

2
_
4 + 2

2
3 + 2

2
_
4 + 2

2
4 3

2
_
4 2

2
3 2

2
_
4 2

2
3 2

2
_
4 2

2
_
_
_
_
_
It is clear that the vector (0, 1, 1) is orthogonal to the two rows in
B hence if we normalize these three vectors and put them in as rows we
get an orthogonal matrix. The norm of the rst row is
2
_
17 + 12

2
_
4 + 2

2
=
_
10 + 7

## 2, the second has norm

2
_
17 12

2
_
4 2

2
=
_
10 7

2 and the
third

2. Hence the matrix
U =
_
_
_
_
_
_
_
_
_
4 + 3

2
_
10 + 7

2
3 + 2

2
_
10 + 7

2
3 + 2

2
2
_
10 + 7

2
4 3

2
2
_
10 7

2
3 2

2
_
10 7

2
3 2

2
_
10 7

2
0
1

2
1

2
_
_
_
_
_
_
_
_
_
34
is an orthogonal matrix.
Thus we get
U
t
A V =
_
_
_
_
_
_
_
_
1
_
10 + 7

2
0 0
0
1
_
10 7

2
0
0 0
1

2
_
_
_
_
_
_
_
_

_
B
0 1 1
_

t
B
=
_
_
_
_
_
_
_
_
1
_
10 + 7

2
0 0
0
1
_
10 7

2
0
0 0
1

2
_
_
_
_
_
_
_
_

_
_
10 + 7

2 0
0 10 7

2
0 0
_
_
=
_
_
_
_
10 + 7

2 0
0
_
10 7

2
0 0
_
_
_
Hence we get
t
V A
t
U =
_
_
10 + 7

2 0 0
0
_
10 7

2 0
_
and so
A = V
_
_
10 + 7

2 0 0
0
_
10 7

2 0
_

t
U
is the singular value decomposition.
Homework
Problem 1.
Find the singular value decompositions of the matrices
_
3 0
0 2
_
,
_
2 0
0 3
_
,
_
_
0 2
0 0
0 0
_
_
,
_
1 1
0 0
_
.
Problem 2.
Two n n matrices A and B are said to be orthogonally equivalent if
there exists an n n orthogonal matrix Q such that B = Q A
t
Q. Is it
true or false that A and B are orthogonally equivalent if and only if they
have the same singular values.
35
6 The QR Decomposition
Consider an m n (m n) matrix A =
_
_
_
_
_
a
11
a
12
. . . a
1n
a
21
a
22
. . . a
2n
.
.
.
.
.
. . . .
.
.
.
a
m1
a
m2
. . . a
mn
_
_
_
_
_
. Let a
j
denote the jth column of A so A =
_
a
1
a
2
a
3
. . . a
n
_
. Consider the
sequence of subspaces of R
m
Spana
1
Spana
1
, a
2
Spana
1
, a
2
, a
3
Spana
1
, a
2
, a
3
, . . . , a
i
. . .
At the last step we have the subspace spanned by the columns of A i.e.
ImA R
m
, the image of the linear map dened by A.
Assume rst that the image of A has dimension n i.e. that the columns
are linearly independent.
We can then apply the Gramm-Schmidt algorithm to the linearly inde-
pendent vectors a
1
, a
2
, . . . , a
n
, and we get an orthonormal system of vec-
tors in R
m
, q
1
, q
2
, . . . , q
n
. From the algorithm, at each step the vectors
q
1
, q
2
, . . . , q
j
span the subspace Spana
1
, a
2
, . . . , a
j
so we can nd real num-
bers r
1j
, r
2j
, . . . , r
jj
such that
a
j
= r
1j
q
1
+r
2j
q
2
+ +r
jj
q
j
In fact the Gramm-Schmidt algorithm constructs in the jth step a vector
v
j
= a
j
(a
j
, q
1
)q
1
(a
i
, q
2
)q
2
(a
j
, q
j1
)q
j1
so by construction v
j
is
orthogonal to q
1
, q
2
, . . . , q
j1
and we get q
j
by normalizing v
j
, q
j
=
v
j
[[v
j
[[
.
Thus
a
j
= (a
j
, q
1
)q
1
+ (a
j
, q
2
)q
2
+ + (a
j
, q
j1
)q
j1
+[[v
j
[[q
j
hence r
ij
= (a
j
, q
i
) for i j.
Consider the matrix

Q whose columns are the vectors q
1
, q
2
, . . . , q
n
and
consider the upper triangular matrix

R =
_
_
_
_
_
_
_
r
11
r
12
r
13
. . . r
1n
0 r
22
r
23
. . . r
2n
0 0 r
33
. . . r
3n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 . . . r
nn
_
_
_
_
_
_
_
.
36
The matrix product

Q

R =
_
q
1
q
2
q
3
. . . q
n
_

_
_
_
_
_
_
_
r
11
r
12
r
13
. . . r
1n
0 r
22
r
23
. . . r
2n
0 0 r
33
. . . r
3n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 . . . r
nn
_
_
_
_
_
_
_
=
_
r
11
q
1
r
12
q
1
+r
22
q
2
r
13
q
1
+r
23
q
2
+r
33
q
3
. . .
_
=
_
a
1
a
2
a
3
. . .
_
= A
This is called the Reduced QR decomposition of A.
Example 6.1 Let A be the 3 3 matrix
_
_
1 1 0
1 0 1
1 1 1
_
_
. Find the reduced
QR-decomposition of A.
We begin by normalizing the rst column of A, a
1
=
_
_
1
1
1
_
_
. The norm is
[[a
1
[[ =

3 so q
1
=
_
_
_
1

3
1

3
1

3
_
_
_. This is then the rst column of

Q. The (1, 1)
entry in

R is

3.
To nd the second column in

Q we use Gramm-Schmidt: put v
2
= a
2

(a
2
, q
1
)q
1
. Then v
2
is orthogonal to q
1
. v
2
=
_
_
1
0
1
_
_

3
_
_
_
1

3
1

3
1

3
_
_
_ =
_
_
1
0
1
_
_

_
_
2
3
2
3
2
3
_
_
=
_
_
1
3

2
3
1
3
_
_
. We normalize v
2
: the norm is
_
1
9
+
4
9
+
1
9
=
_
2
3
. Thus
q
2
=
_
_
_
1

6
1

6
_
_
_. Then a
2
=
2

3
q
1
+
_
2
3
q
2
and the second column in

Q is q
1
.
The second column in

R is
_
_
_
2

3
_
2
3
0
_
_
_.
The next step in the Gramm-Schmidt algorithm puts v
3
= a
3
(a
3
, q
1
)q
1

37
(a
3
, q
2
)q
2
=
_
_
0
1
1
_
_

3
_
_
_
1

3
1

3
1

3
_
_
_ (
1

6
)
_
_
_
1

6
1

6
_
_
_ =
_
_

1
2
0
1
2
_
_
. Then v
3
is
orthogonal to q
1
and q
2
. To get q
3
we normalize v
3
: the norm is
1

2
and so
q
3
=
_
_
_

2
2
0

2
2
_
_
_. This is the third column of

Q so

Q =
_
_
_
_
1

3
1

2
2
1

3

2

6
0
1

3
1

2
2
_
_
_
_
.
We have a
3
= (a
3
, q
1
)q
1
+ (a
3
, q
2
)q
2
+ [[v
3
[[q
3
so the third column in

R
is
_
_
_
2

6
1

2
_
_
_ and

R =
_
_
_
_

3
2

3
2

3
0
_
2
3

1

6
0 0
1

2
_
_
_
_
If A =

Q

R is the reduced QR-decomposition of an m n matrix
A with linearly independent columns,

Q is an m n matrix and

R is an
n n matrix. The n columns in

Q, q
1
, q
2
, . . . , q
n
are orthonormal. We
can extend these to an orthonormal basis for R
m
q
n+1
, q
n+2
, . . . , q
m
, for instance by using the Gramm-Schmidt algorithm on
a basis of the orthogonal complement Spanq
1
, q
2
, . . . , q
n

## then the matrix

Q with columns q
1
, q
2
, . . . , q
n
, q
n+1
, . . . , q
m
is an mm orthogonal matrix.
If we add mn rows of 0es to

R we get an mn upper triangular matrix
and we still have A = Q R. This is called the (full) QR-decomposition of
A. Remark that if A is a square matrix as in the example, the reduced and
the full QR-decompositions are the same.
Example 6.2 Consider the 3 2 matrix A =
_
_
1 1
1 0
1 1
_
_
.
The reduced QR-decomposition is A =
_
_
_
1

3
1

6
1

3

2

6
1

3
1

6
_
_
_
_

3
2

3
0
_
2
3
_
.
To get the full QR-decomposition we would nd a normal vector q
3
or-
thogonal to the columns in

Q. Here we could take q
3
=
_
_
_

2
2
0

2
2
_
_
_ so the full
38
QR-decomposition would be
A =
_
_
_
_
1

3
1

2
2
1

3

2

6
0
1

3
1

2
2
_
_
_
_

_
_
_

3
2

3
0
_
2
3
0 0
_
_
_
What do we do if the columns in A are not necessarily linearly indepen-
dent? In this case it can happen that a vector v
j
constructed in the jth
step of the Gramm-Schmidt algorithm is the 0-vector:
0 = v
j
= a
j
(a
j
, q
1
)q
1
(a
j
, q
2
)q
2
(a
j
, q
j1
)q
j1
so we cant normalize it.
If this is the case we pick q
j
to be any normal vector, orthogonal to
q
1
, q
2
, . . . , q
j1
and put r
jj
= 0 in the matrix

R, so the jth column is
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
(a
j
, q
1
)
(a
j
, q
2
)
.
.
.
(a
j
, q
j1
)
0
0
.
.
.
0
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
and then we just continue the process with the vectors q
1
, q
2
, . . . , q
j
.
This again gives us a reduced QR-decomposition A =

Q

R, where

Q has
orthonormal columns (and hence can be expanded to an orthonormal mm
matrix) and

R is upper triangular but may have some 0s in the diagonal.
Example 6.3 Consider the 4 3 matrix A =
_
_
_
_
1 1 0
0 1 1
1 0 1
0 1 1
_
_
_
_
. Find the re-
duced and full QR-decompositions.
We normalize the rst column to get q
1
=
_
_
_
_
_

2
2
0

2
2
0
_
_
_
_
_
and r
11
=

2.
39
The second step in the Gramm-Schmidt algorithm gives
v
2
=
_
_
_
_
1
1
0
1
_
_
_
_
(a
2
, q
1
)q
1
=
_
_
_
_
1
1
0
1
_
_
_
_

2
2
_
_
_
_
_

2
2
0

2
2
0
_
_
_
_
_
=
_
_
_
_
1
1
0
1
_
_
_
_

_
_
_
_
1
2
0
1
2
0
_
_
_
_
=
_
_
_
_
1
2
1

1
2
1
_
_
_
_
We normalize to get q
2
: the norm is
_
5
2
so q
2
=
_
_
_
_
_
_
_

2
2

2
2

5
_
_
_
_
_
_
_
The second column in

R is
_
_
_

2
2

2
0
_
_
_
The third step produces
v
3
=
_
_
_
_
0
1
1
1
_
_
_
_
(a
3
, q
1
)q
1
(a
3
, q
2
)q
2
=
_
_
_
_
0
1
1
1
_
_
_
_
(

2
2
)
_
_
_
_
_

2
2
0

2
2
0
_
_
_
_
_
(
5
2
_
2
5
)
_
_
_
_
_
_
_

2
2

2
2

5
_
_
_
_
_
_
_
= 0
so we cant normalize v
3
.
The third column in

R is
_
_
_

2
2
_
5
2
0
_
_
_ so

R =
_
_
_

2
2

2
2
0
_
5
2
_
5
2
0 0 0
_
_
_
To get the third column in

Q we can take any vector orthogonal to q
1
and q
2
, we can see that the vector
_
_
_
_
0
1
0
1
_
_
_
_
works. Normalizing we get q
3
=
40
_
_
_
_
_
0
1

2
0

2
_
_
_
_
_
so we get the reduced QR-decomposition
_
_
_
_
1 1 0
0 1 1
1 0 1
0 1 1
_
_
_
_
=
_
_
_
_
_
_
_

2
2

2
2

5
0
0

5
1

2
2

2
2

5
0
0

5

1

2
_
_
_
_
_
_
_

_
_
_

2
2

2
2
0
_
5
2
_
5
2
0 0 0
_
_
_
To nd the full QR-decomposition we have the nd a fourth normalized
vector q
4
orthogonal to q
1
, q
2
, q
3
. It is easy to see that the vector
_
_
_
_
2
1
2
1
_
_
_
_
is
othogonal to the three vectors and hence we can take q
4
=
_
_
_
_
_

10
1

10
2

10
1

10
_
_
_
_
_
Thus we get the full QR-decomposition
_
_
_
_
1 1 0
0 1 1
1 0 1
0 1 1
_
_
_
_
=
_
_
_
_
_
_
_

2
2

2
2

5
0
2

10
0

5
1

2
1

10

2
2

2
2

5
0
2

10
0

5

1

2
1

10
_
_
_
_
_
_
_

_
_
_
_
_

2
2

2
2
0
_
5
2
_
5
2
0 0 0
0 0 0
_
_
_
_
_
41
Homework
Problem 1.
Find the reduced and full QR-decompositions of the matrices A =
_
_
1 0
0 1
1 0
_
_
and B =
_
_
1 2
0 1
1 0
_
_
.
Problem 2.
Let A be an mn (m n) matrix and let A =

Q

R be a reduced QR-
decomposition. Show that A has full rank n if and only if all the diagonal
entries in

R are non-zero.
7 Orthogonal Projections
Consider a subspace V R
n
. Let u be a vector in R
n
. We know that we
can write u uniquely as v +v

where v V and v

. Since v and v

are
uniquely determined by u we can dene a map P : u v.
We call v = Pu the orthogonal projection of u onto the subspace V .
Proposition 7.0.2 The map P is a linear map R
n
R
n
and P
2
= P P =
P
Proof: We have to show that P(
1
u
1
+
2
u
2
) =
1
Pu
1
+
2
Pu
2
.
Write u
1
= v
1
+v

1
and u
2
= v
2
+v

2
where v
1
and v
2
V and v

1
, v

2
V

.
But then
1
u
1
+
2
u
2
= (
1
v
1
+
2
v
2
) +(
1
v

1
+
2
v

2
) with
1
v
1
+
2
v
2
V
and
1
v

1
+
2
v

2
V

## . This proves that P(

1
u
1
+
2
u
2
) =
1
v
1
+
2
v
2
=

1
Pu
1
+
2
Pu
2
.
Write u = v + v

so Pu = v. Then P
2
u = Pv but v is already in V so
Pv = v.
Let P

= Id P then P

## is the orthogonal projection onto the subspace

V

and we have Id = P +P

.
Let q
1
, q
2
, . . . , q
k
be an orthonormal basis of V and q
k+1
, q
k+2
, . . . , q
n
an
orthonormal basis for V

## then the full set q

1
, q
2
, . . . , q
n
is an orthonormal
basis of R
n
. Then we have Pq
i
= q
i
for i = 1, 2, . . . , k and Pq
j
= 0 for
42
j = k + 1, k + 2, . . . , n. Hence the matrix of P with respect to this basis is
given by
_
_
_
_
_
_
_
_
_
_
_
_
_
_
1 0 0 . . . 0 0 . . . 0
0 1 0 . . . 0 0 . . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 . . . 1 0 . . . 0
0 0 0 . . . 0 0 . . . 0
0 0 0 . . . 0 0 . . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 . . . 0 0 . . . 0
_
_
_
_
_
_
_
_
_
_
_
_
_
_
and the matrix of P

is given by
_
_
_
_
_
_
_
_
_
_
_
_
0 0 . . . 0 0 0 . . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 . . . 0 0 0 . . . 0
0 0 . . . 0 1 0 . . . 0
0 0 . . . 0 0 1 . . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 . . . 0 0 . . . 1
_
_
_
_
_
_
_
_
_
_
_
_
Let as before q
1
, q
2
, . . . , q
k
be an orthonormal basis of V , then the orthog-
onal projection onto V is given by Pu = (u, q
1
)q
1
+(u, q
2
)q
2
+ +(u, q
k
)q
k
.
In the Gramm-Schmidt algorithm applied to linearly independent vectors
u
1
, u
2
, . . . , u
k
we let q
1
=
u
1
[[u
1
[[
and v
2
= u
2
(u
2
, q
1
)q
1
. But (u
2
, q
1
)q
1
is the
orthogonal projection P
2
of u
2
onto the subspace spanned by q
1
and hence
v
2
= u
2
P
2
u
2
= (IdP
2
)u
2
. Now Q
2
= IdP
2
is the orthogonal projection
onto the othogonal complement to Spanq
1
. We have q
2
=
Q
2
u
2
[[Q
2
u
2
[[
. Next
v
3
= a
3
(a
3
, q
1
)q
1
(a
3
, q
2
)q
2
is the orthogonal projection of a
3
onto
Spanq
1
, q
2

, Q
3
and q
3
=
Q
3
a
3
[[Q
3
a
3
[[
.
In general we have that v
j
is the orthogonal projection of a
j
onto the
orthogonal complement to the subspace spanned by q
1
, q
2
, . . . , q
j1
, i.e.
Spanq
1
, q
2
, . . . , q
j1

.
We can describe this as applying a sequence of projections: we get v
2
by
applying P
q

1
a
2
, the orthogonal projection onto the subspace orthogonal to
q
1
, to a
2
.
43
Applying this projection to a
3
we have P
q

1
a
3
= a
3
(a
3
, q
1
)q
1
.
Next applying P
q

2
, the projection onto the subspace orthogonal to q
2
, we
get P
q

2
P
q

1
a
3
= P
q

2
(a
3
(a
3
, q
1
)q
1
) = (a
3
(a
3
, q
1
)q
1
)(a
3
(a
3
, q
1
)q
1
, q
2
)q
2
=
a
3
(a
3
, q
1
)q
1
(a
3
, q
2
)q
2
because (q
1
, q
2
) = 0. But this is precisely v
3
, thus
v
3
= P
q

2
P
q

1
a
3
In general
v
j
= P
q

j1
P
q

j2
. . . P
q

2
P
q

1
a
j
Using this algorithm works better numerically than the usual Gramm-
Schmidt algorithm.
Using this algorithm we shall obtain the QR-decomposition, by multi-
plying the matrix A on the right by a sequence of upper triangular matrices.
We assume that A is an m n with m n, matrix with linearly inde-
pendent columns a
1
, a
2
, . . . , a
n
.
We let r
ij
denote the inner product (a
i
, q
j
) where q
j
is the jth normal
vector constructed from the Gramm-Schmidt algorithm. Remark that r
jj
=
(a
j
, q
j
) = (a
j
(a
j
, q
1
)q
1
(a
j
, q
2
)q
2
(a
j
, q
j1
)q
j1
, q
j
) = (v
j
, q
j
) =
([[v
j
[[q
j
, q
j
) = [[v
j
[[.
Multiplying to the left with the nn matrix R
1
=
_
_
_
_
_
1/r
11
0 0 . . . 0
0 1 0 . . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 . . . 1
_
_
_
_
_
we get
_
a
1
a
2
. . . a
n
_

_
_
_
_
_
1/r
11
0 0 . . . 0
0 1 0 . . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 . . . 1
_
_
_
_
_
=
_
a
1
/r
11
a
2
. . . a
n
_
=
_
q
1
a
2
. . . a
n
_
44
Next consider the matrix R
2
=
_
_
_
_
_
_
_
1 r
21
/r
22
0 . . . 0
0 1/r
22
0 . . . 0
0 0 1 . . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 . . . 1
_
_
_
_
_
_
_
. Then
A R
1
R
2
=
_
q
1
a
2
. . . a
n
_

_
_
_
_
_
_
_
1 r
21
/r
22
0 . . . 0
0 1/r
22
0 . . . 0
0 0 1 . . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 . . . 1
_
_
_
_
_
_
_
=
_
q
1

r
21
r
22
q
1
+
1
r
22
a
2
a
3
. . . a
n
_
=
_
q
1
a
2
(a
2
, q
1
)q
1
[[v
2
[[
a
3
. . . a
n
_
=
_
q
1
q
2
a
3
. . . a
n
_
Next we put
R
3
=
_
_
_
_
_
_
_
_
_
1 0 r
13
/r
33
0 . . . 0
0 1 r
23
/r
33
0 . . . 0
0 0 1/r
33
0 . . . 0
0 0 0 1 . . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 0 . . . 1
_
_
_
_
_
_
_
_
_
By the same computation as before we get
A R
1
R
2
R
3
=
_
q
1
q
2
q
3
a
4
. . . a
n
_
Continuing this way we get
A R
1
R
2
R
n
=
_
q
1
q
2
q
3
. . . q
n
_
=

Q
The Product R
1
R
2
R
3
R
n
is an upper triangular n n matrix and
so is its inverse. If we let

R denote this inverse we get
A =

Q

R
i.e. the reduced QR-decomposition. We could call this method of obtaining
the QR-decomposition, upper-triangular orthogonalization.
There is another method to obtain the QR-decomposition, which can
aptly be called orthogonal triangularization. It multiplies A on the left by a
45
Figure 2: Householder reection
sequence of orthogonal matrices to make an upper-triangular matrix. This
method is based on so called Householder reections.
We start out by nding an orthogonal mm matrix Q
1
such that the rst
column in Q
1
A is
_
_
_
_
_
_
_
[[a
1
[[
0
0
.
.
.
0
_
_
_
_
_
_
_
. The matrix Q
1
A is
_
Q
1
a
1
Q
1
a
2
. . . Q
1
a
n
_
thus we want an orthogonal matrix Q
1
such that Q
1
a
1
=
_
_
_
_
_
_
_
[[a
1
[[
0
0
.
.
.
0
_
_
_
_
_
_
_
.
Let v = a
1
[[a
1
[[e
1
. Let u =
v
[[v[[
. The orthogonal projection onto the
subspace spanned by u is given by x (x, u)u. The Householder reection
46
is dened by Q
1
(x) = x 2(x, u)u. It is clear that Q
1
is a linear map in
fact Q
1
is reection in the m 1-dimensional sub-space v

. A reection
preserves angles and lengths so is an orthogonal map and hence Q
1
is an
orthogonal matrix. We have Q
1
(a
1
) = a
1
2(a
1
, u)u = a
1
(a
1
, v)
v
[[v[[
2
.
Now [[v[[
2
= [[a
1
[[a
1
[[e
1
[[
2
= (a
1
[[a
1
[[e
1
, a
1
[[a
1
[[e
1
) = 2[[a
1
[[
2

2[[a
1
[[(a
1
, e
1
) = 2(a
1
, a
1
[[a
1
[[e
1
). Hence Q
1
(a
1
) = a
1
2(a
1
, a
1
[[a
1
[[e
1
)
a
1
[[a
1
[[e
1
[[a
1
[[a
1
[[e
1
[[
2
=
a
1
(a
1
[[a
1
[[e
1
) = [[a
1
[[e
1
This shows that Q
1
A =
_
_
_
_
_
_
_
[[a
1
[[ . . .
0 . . .
0 . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 . . .
_
_
_
_
_
_
_
.
At the next step we want to multiply by a matrix Q
2
which does not
disturb the rst column but produces 0es below the second entry in the
second column. Q
2
is going to be of the form
_
_
_
_
_
1 0 . . . 0
0
.
.
. H
2
0
_
_
_
_
_
where H
2
is an m1 m1 matrix.
Let the second column in Q
1
A be
_
_
_
_
_
_
_
b
1
b
2
b
3
.
.
.
b
m
_
_
_
_
_
_
_
then we let H
2
be the House-
holder reection for the m 1-dimensional vector b =
_
_
_
_
_
b
2
b
3
.
.
.
b
m
_
_
_
_
_
. Since H
2
is
an orthogonal m1 m1 matrix the m1-dimensional column vectors
are orthonormal and hence also the columns in Q
2
=
_
_
_
_
_
1 0 . . . 0
0
.
.
. H
2
0
_
_
_
_
_
are
orthonormal so Q
2
is an orthogonal matrix. Now the matrix Q
2
Q
1
A =
47
_
_
_
_
_
_
_
[[a
1
[[ . . .
0 [[b[[ . . .
0 0 . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 . . .
_
_
_
_
_
_
_
.
The third matrix will be of the form Q
3
=
_
_
_
_
_
_
_
1 0 0 0 . . . 0
0 1 0 0 . . . 0
0 0
.
.
.
.
.
. H
3
0 0
_
_
_
_
_
_
_
where H
3
is an m2m2 Householder reection. Applying this procedure
m times we nally arrive at Q
m
Q
m1
. . . Q
2
Q
1
A = R where R is upper-
triangular. Let
t
Q = Q
m
Q
m1
. . . Q
2
Q
1
, then
t
Q is an orthogonal matrix
and hence A = Q R is the full QR-decomposition.
Example 7.1 Compute the QR-factorization of the matrix A =
_
_
_
_
1 1 0
0 1 1
1 0 1
0 1 1
_
_
_
_
We begin by computing the matrix of the Householder reection asso-
ciated to the rst column
_
_
_
_
1
0
1
0
_
_
_
_
. We have v =
_
_
_
_
1
0
1
0
_
_
_
_

2
_
_
_
_
1
0
0
0
_
_
_
_
so u =
_
_
_
_
_
_
_
_
1

2
_
4 2

2
0
1
_
4 2

2
0
_
_
_
_
_
_
_
_
and the Householder reection is given by Q
1
(x) = x
2(x, u)u. To compute the matrix we have to nd Q
1
(e
j
) for j = 1, 2, 3, 4.
First Q
1
(e
1
) = e
1
2(u, e
1
)u =
_
_
_
_
1
0
0
0
_
_
_
_
2(
_
_
_
_
_
_
_
_
1

2
_
4 2

2
0
1
_
4 2

2
0
_
_
_
_
_
_
_
_
,
_
_
_
_
1
0
0
0
_
_
_
_
)
_
_
_
_
_
_
_
_
1

2
_
4 2

2
0
1
_
4 2

2
0
_
_
_
_
_
_
_
_
=
48
_
_
_
_
1
0
0
0
_
_
_
_
2
1

2
_
4

2
_
_
_
_
_
_
_
_
1

2
_
4 2

2
0
1
_
4 2

2
0
_
_
_
_
_
_
_
_
=
_
_
_
_
1
0
0
0
_
_
_
_

_
_
_
_
_
_
_
_
(1

2)
2
2

2
0
1

2
2

2
0
_
_
_
_
_
_
_
_
=
_
_
_
_
_
_
_
_
1 +

2
2

2
0
1 +

2
2

2
0
_
_
_
_
_
_
_
_
=
_
_
_
_
_

2
2
0

2
2
0
_
_
_
_
_
Secondly Q
1
e
2
= e
2
2(e
2
, u)u =
_
_
_
_
0
1
0
0
_
_
_
_
2(
_
_
_
_
0
1
0
0
_
_
_
_
,
_
_
_
_
_
_
_
_
1

2
_
4 2

2
0
1
_
4 2

2
0
_
_
_
_
_
_
_
_
)
_
_
_
_
_
_
_
_
1

2
_
4 2

2
0
1
_
4 2

2
0
_
_
_
_
_
_
_
_
=
_
_
_
_
0
1
0
0
_
_
_
_
0
_
_
_
_
_
_
_
_
1

2
_
4 2

2
0
1
_
4 2

2
0
_
_
_
_
_
_
_
_
=
_
_
_
_
0
1
0
0
_
_
_
_
Next Q
1
e
3
= e
3
2(e
3
, u)u =
_
_
_
_
0
0
1
0
_
_
_
_
2(
_
_
_
_
0
0
1
0
_
_
_
_
,
_
_
_
_
_
_
_
_
1

2
_
4 2

2
0
1
_
4 2

2
0
_
_
_
_
_
_
_
_
)
_
_
_
_
_
_
_
_
1

2
_
4 2

2
0
1
_
4 2

2
0
_
_
_
_
_
_
_
_
=
_
_
_
_
0
0
1
0
_
_
_
_

_
_
_
_
_
_
_
_
1

2
2

2
0
1
2

2
0
_
_
_
_
_
_
_
_
=
_
_
_
_
_
_
_
_
1 +

2
2

2
0
1

2
2

2
0
_
_
_
_
_
_
_
_
=
_
_
_
_
_

2
2
0

2
2
0
_
_
_
_
_
.
49
Finally Q
1
e
4
= e
4
2(e
4
, u)u =
_
_
_
_
0
0
0
1
_
_
_
_
2(
_
_
_
_
0
0
0
1
_
_
_
_
,
_
_
_
_
_
_
_
_
1

2
_
4 2

2
0
1
_
4 2

2
0
_
_
_
_
_
_
_
_
)
_
_
_
_
_
_
_
_
1

2
_
4 2

2
0
1
_
4 2

2
0
_
_
_
_
_
_
_
_
=
_
_
_
_
0
0
0
1
_
_
_
_
0
_
_
_
_
_
_
_
_
1

2
_
4 2

2
0
1
_
4 2

2
0
_
_
_
_
_
_
_
_
=
_
_
_
_
0
0
0
1
_
_
_
_
Thus we have Q
1
=
_
_
_
_
_

2
2
0

2
2
0
0 1 0 0

2
2
0

2
2
0
0 0 0 1
_
_
_
_
_
and Q
1
A =
_
_
_
_
_

2
2

2
2
0 1 1
0

2
2

2
2
0 1 1
_
_
_
_
_
The next step is to compute the Householder reection in R
3
of the vec-
tor b =
_
_
1

2
2
1
_
_
. We have v = b [[b[[e
1
. The norm of b is
_
5
2
and
so v =
_
_
_
_
1
_
5
2
_
1
2
1
_
_
_
_
. The norm of v is
_
5 2
_
5
2
and hence the normal
vector in the direction of v is u =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
1
_
5
2
_
5 2
_
5
2
_
1
2
_
5 2
_
5
2
1
_
5 2
_
5
2
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
. The Householder re-
ection is given by H
2
x = x 2(x, u)u. We compute as before H
2
e
1
=
50
e
1
2(e
1
, u)u =
_
_
1
0
0
_
_
2
1
_
5
2
_
5 2
_
5
2
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
1
_
5
2
_
5 2
_
5
2
_
1
2
_
5 2
_
5
2
1
_
5 2
_
5
2
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
=
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
2 + 2
_
5
2
5 2
_
5
2
_
1
2
2 2
_
5
2
5 2
_
5
2
2 2
_
5
2
5 2
_
5
2
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
=
_
_
_
_
_
2
5
_
5
2

_
1
2
2
5
_
5
2
2
5
_
5
2
_
_
_
_
_
=
_
_
_
_
_
_
2
5

_
1
5
_
2
5
_
_
_
_
_
Next we compute H
2
e
2
= e
2
2(e
2
, u)u =
_
_
0
1
0
_
_
2
_
1
2
_
5 2
_
5
2
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
1
_
5
2
_
5 2
_
5
2
_
1
2
_
5 2
_
5
2
1
_
5 2
_
5
2
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
=
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_

2
_
1
2
(1
_
5
2
)
5 2
_
5
2
4
_
5
2
5
_
5
2
2
_
1
2
5
_
5
2
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
=
_
_
_
_
_
_

_
1
5
102
q
5
2
15
_
1
2
10+4
q
5
2
15
_
_
_
_
_
_
.
51
Finally H
2
e
3
= e
3
2(e
3
, u)u =
_
_
0
0
1
_
_
2
1
_
5 2
_
5
2
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
1
_
5
2
_
5 2
_
5
2
_
1
2
_
5 2
_
5
2
1
_
5 2
_
5
2
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
=
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
2(1
_
5
2
)
5 2
_
5
2

2
_
1
2
5 2
_
5
2
3 2
_
5
2
5 2
_
5
2
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
=
_
_
_
_
_
_
_
2
5

_
1
2
10+4
q
5
2
15
54
q
5
2
15
_
_
_
_
_
_
.
It follows that Q
2
=
_
_
_
_
_
_
_
_
1 0 0 0
0
_
2
5

_
1
5
_
2
5
0
_
1
5
102
q
5
2
15

_
1
2
10+4
q
5
2
15
0
_
2
5
_
1
2
10+4
q
5
2
15
54
q
5
2
15
_
_
_
_
_
_
_
_
and
Q
2
Q
1
A =
_
_
_
_
_

2
2

2
2
0
_
5
2
_
5
2
0 0 0
0 0 0
_
_
_
_
_
8 Computational Experiments using MATLAB
In the section on orthogonal projections we alluded to this modied Gram-
Schmidt procedure as being numerically more stable than the classical
procedure and the Householder method as being even better. Next we shall
illustrate this using some MATLAB computations.
The experiment is the following: we construct two orthogonal 80 80
52
matrices Q
1
and Q
2
. Then we let D be the diagonal matrix with the powers
2
1
, 2
2
, . . . , 2
80
in the diagonal. Let A = Q
1
D Q
2
. We compute the
QR-decomposition using the classical Gram-Schmidt, the modied Gram-
Schmidt and the Householder algorithm.
If A = Q R is the QR-decomposition of A. Then Q
1
D Q
2
= Q R so
R = Q
1
Q
1
DQ
2
. Now A = Q
1
DQ
2
is the SVD of A and so the singular
values of A are the numbers 2
1
, 2
2
, . . . , 2
80
but R = Q
1
Q
1
DQ
2
is the
SVD of R so R has the same singular values. The eigenvalues of an upper
triangular matrix are the diagonal entries as can be seen by computing the
characteristic polynomial. Hence the absolute values of the diagonal entries
in R should be the numbers 2
1
, 2
2
, . . . , 2
80
.
Here is a print-out of a MATLAB session
>> M=randn(80);
>> N=randn(80);
>> [Q1,X]=qr(M);
>> [Q2,Y]=qr(N);
>> D=diag(2.^ (-1:-1:-80));
>> A=Q1*D*Q2;
Here we rst construct two 80 80 matrices with random entries chosen
from a standard normal distribution. Then we use the QR-decomposition
to get two orthogonal matrices Q
1
and Q
2
(and also two upper triangular
matrices which we have no use for). Next we construct the diagonal matrix
D with the numbers 2
1
, 2
2
, . . . , 2
80
in the diagonal and nally we form
the 80 80 matrix A = Q
1
D Q
2
.
Next we use MATLABs built in function to nd the QR-decomposition
(this procedure uses the Householder procedure)
>> [Q,R]=qr(A);
>> v=diag(R);
>> v=log(abs(v));
>> plot(v,o)
We then take out the diagonal of the upper-triangular part and we take
the log of the absolute values of the diagonal entries. This should give
the numbers log 2, 2 log 2, 3 log 2, . . . , 80 log 2 and so when we plot
them they should lie on a nice straight line with slope log 2. Notice the
53
Figure 3:
54
specication to the plot command to use an o to mark the points rather
than using a line graph. The result is shown in the gure.
The graph looks pretty good until we get down to around 2
35
then the
machine precision is no longer good enough to distinguish the very small
values and the line is drowned out in rounding errors.
Next we shall try the classical Gram-Schmidt method. Here we rst
have to write our own algorithm to compute the QR-decomposition using
this method. This is best done by writing a function M-le as shown below:
function [QC,RC]=clgs(A)
[n,m]=size(A);
QC=zeros(n);
RC=zeros(n);
QC(:,1)=A(:,1)/(norm(A(:,1)));
for j=1:n
vj=A(:,j);
for i=1:j-1
RC(i,j)=QC(:,i)*A(:,j);
vj=vj-RC(i,j)*QC(:,i);
end
RC(j,j)=norm(vj);
QC(:,j)=vj/RC(j,j);
end
We then compute the QR-decomposition using this method and again
plot the log of the absolute values of the diagonal elements. We plot it on the
same graph to compare the results of the two methods, using the command
hold which keeps the previous graph (to get a new graph use the command
hold o). We plot the points using x as a marker.
>> hold
Current plot held
>> [QC,RC]=clgs(A);
>> v=log(abs(diag(RC)));
>> plot(v,x)
We see that the rounding errors take over much earlier, this is a result
of the Gram-Schmidt algorithm being numerically unstable i.e. the errors
compound in each step.
55
Figure 4:
56
Homework:
Write a function M-le to compute the QR-decomposition using the mod-
ied Gram-Schmidt and compute the upper triangular part. Plot log of the
absolute values using * as a marker on the same graph as the Householder
and classical Gram-Schmidt.
9 Least Squares Problems
Let A be an n m matrix and assume n > m. We can of course view A as
the matrix of a linear transformation T : R
m
R
n
.
Consider a system of linear equations
Ax = b
Since there are more equations than unknowns, this system in general will
not have solutions. In fact it will have a solution precisely when b Im T.
Since Im T has dimension at most m because of the formula dimker T +
dimIm T = m, the subspace Im T will be small relative to R
n
so b has to
be very special in order for there to be any solution.
For a given x R
m
we can consider the residual r = b Ax R
n
. Of
course if x is a solution, r = 0.
The idea of a least squares solution is to nd x such that [[r[[ is as small
as possible. If r = (r
1
, r
2
, . . . , r
n
), [[r[[
2
= r
2
1
+r
2
2
+ +r
2
n
hence the name
least squares.
Theorem 9.0.3 A vector x R
m
is a least squares solution if and only if
r Im T

## Proof: Writing out [[r[[

2
in terms of coordinates we get
(a
11
x
1
+a
12
x
2
+ +a
1m
x
m
b
1
)
2
+(a
21
x
1
+a
22
x
2
+ +a
2m
x
m
b
2
)
2
+
.
.
.
+(a
n1
x
1
+a
n2
x
2
+ +a
nm
x
m
b
n
)
2
Viewing this as a function of (x
1
, x
2
, . . . , x
m
), in order for x to be a
minimum, all the partial derivatives

x
i
must vanish.
57
Figure 5:
Computing the partial with respect to x
j
we get
2(a
11
x
1
+a
12
x
2
+ +a
1m
x
m
b
1
)a
1j
+2(a
21
x
1
+a
22
x
2
+ +a
2m
x
m
b
2
)a
2j
+
.
.
.
+2(a
n1
x
1
+a
n2
x
2
+ +a
nm
x
m
b
n
)a
nj
This is precisely 2 times the inner product of r with the jth column
of A. Thus the vanishing of
[[r[[
2
x
j
is equivalent to r being orthogonal to
the jth column. Now all the partials have to vanish and so r is orthogonal
to all the columns in A and since these column span Im T it follows that
r Im T

Using this theorem we can show that there always is a least squares
solution, namely let P : R
n
Im T be the orthogonal projection. Then
Pb Im T and so we can nd x R
m
such that Pb = Ax. As we have seen
b Pb Im T

and so r = b Ax = b Pb Im T

## hence by the theorem

x is a least squares solution.
58
Another consequence of the theorem is that x is a least squares solution if
and only if (bAx, Ay) = 0 for all y R
m
and its matrix
t
A we get 0 = (bAx, Ay) = (
t
Ab
t
A Ax, y) for all y R
m
.
Thus
t
Ab
t
A Ax is a vector in R
m
orthogonal to every vector in R
m
and
so must be the zero-vector. Thus we have proved
Theorem 9.0.4 A vector x R
m
is a least squares solution to the system
Ax b if and only if
t
A Ax =
t
Ab
Remark that
t
A A is an m m matrix, so we now have a system of
m equations with m unknowns. This system of equations is known as the
normal equations. As we have shown the normal equations have a solution
which will be unique if and only if ker
t
A A = 0 or equivalently if
t
A A
has rank m, hence if it is invertible. This is the case if A has maximal rank
m and in this case the unique solution to the least squares problem is
x = (
t
A A)
1

t
Ab
The matrix (
t
AA)
1

t
A is called the pseudo-inverse of A and often denoted
by A
+
.
Example 9.1 Consider a set of points (x
1
, y
1
), (x
2
, y
2
), . . . , (x
n
, y
n
) and the
problem of nding a polynomial p(x) of degree m < n that best ts these
points i.e. we want to nd p(x) such that (y
1
p(x
1
))
2
+ (y
2
p(x
2
))
2
+
+ (y
n
p(x
n
))
2
is as small as possible. If we write p(x) = c
0
+ c
1
x +
c
2
x
2
+ +c
m
x
m
this comes down to nding a least squares solution to the
system of equations
_
_
_
_
_
1 x
1
x
2
1
. . . x
m
1
1 x
2
x
2
2
. . . x
m
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1 x
n
x
2
n
. . . x
m
n
_
_
_
_
_
_
_
_
_
_
c
0
c
1
.
.
.
c
m
_
_
_
_
_
=
_
_
_
_
_
y
1
y
2
.
.
.
y
n
_
_
_
_
_
The matrix of coecients is known as a van der Monde matrix, if the
x
i
s are distinct this matrix has maximal rank m. Hence there is a unique
solution.
Suppose we want to t a line through these points. The matrix A is
_
_
_
_
_
1 x
1
1 x
2
.
.
.
.
.
.
1 x
n
_
_
_
_
_
59
and so
t
A A =
_
n

x
i

x
i

x
2
i
_
The inverse is
1
n

x
2
i
(

x
i
)
2
_
x
2
i

x
i

x
i
n
_
and so
A
+
=
1
n

x
2
i
(

x
i
)
2
_
x
2
i

x
i

x
i
n
_

_
1 1 . . . 1
x
1
x
2
. . . x
n
_
=
1
n

x
2
i
(

x
i
)
2
_
x
2
i
x
1

x
i

x
2
i
x
2

x
i
. . .

x
2
i
x
n

x
i

x
i
nx
1

x
i
nx
2
. . .

x
i
nx
n
_
It follows that the line is given by the equation y = c
0
+c
1
x where
c
0
=
1
n

x
2
i
(

x
i
)
2
(

y
i

x
2
i

x
i
y
i

x
i
)
c
1
=
1
n

x
2
i
(

x
i
)
2
(

y
i

x
i
n

x
i
y
i
)
Using the QR-decomposition of A we get a convenient way to solve the
normal equations.
Consider the reduced QR-decomposition of A, A =

Q

R where

Q is an
nm matrix with orthonormal columns and

R is a mm upper-triangular
matrix. Then
t
A A =
t

R
t

Q

Q

R =
t

R

R since

Q

Q = E
m
. Hence the
t

R

Rx =
t

R
t

Qb
which we can solve by solving

Rx =
t

Qb
The matrix

R is upper-triangular and so this is a system of the form
r
11
x
1
+r
12
x
2
+r
13
x
3
+ +r
1m
x
m
= c
1
r
22
x
2
+r
23
x
3
+ +r
2m
x
m
= c
2
.
.
.
r
mm
x
m
= c
m
60
where
_
_
_
_
_
c
1
c
2
.
.
.
c
m
_
_
_
_
_
=
t

## Qb. We can easily solve this system by back substitution:

x
m
= c
m
/r
mm
x
m1
=
1
r
m1,m1
(c
m1
r
m1,m
x
m
)
x
m2
=
1
r
m2,m2
(c
m2
r
m2,m1
x
m1
r
m2,m
x
m
)
.
.
.
x
1
=
1
r
11
(c
1
r
12
x
2
r
13
x
3
r
1m
x
m
)
10 Numerical Analysis
Consider a system of linear equations Ax = b. Here is a possible strategy
to nding a solution: compute the QR-decomposition A = Q R. Since Q is
an orthogonal matrix we have Q
1
=
t
Q and so we get Rx =
t
Qb. Since R
is upper-triangular we can solve this system by back substitution as above.
As the examples have shown computing the QR-decomposition by hand
is not possible unless the dimensions of A very small (think about computing
the QR-decomposition of a 100 100!). Thus we certainly want to use
a program such as MATLAB to nd Q and R and to perform the back
substitution.
Consider the following MATLAB code
61
>> R=triu(randn(50));
>> [Q,X]=qr(randn(50));
>> A=Q*R;
>> [Q2,R2]=qr(A);
>> norm(Q-Q2)
ans =
1.8187
>> norm(R-R2)/norm(R)
ans =
0.2848
>> norm(Q2*R2-A)
ans =
8.9370e-015
We rst construct a random upper-triangular matrix R and a random
orthogonal matrix Q and put A = Q R, thus this is the QR-decomposition
of A. Now we use the qr command in MATLAB to compute the QR-
decomposition of A , get Q
2
and R
2
. Now compare Q and Q
2
and R and
R
2
. If Q and Q
2
were close to each other the norm of Q Q
2
should be
small. But our computation shows it is large ( 2) and the same for RR
2
.
Since the QR-decomposition is unique this shows that we really cant hope
to compute the QR-decomposition to any reasonable degree of accuracy.
Yet when we compute Q
2
R
2
we get very close to A. Thus seemingly the
imprecisions in Q
2
and R
2
cancel out in the product. Does this mean that
the algorithm above is doomed to failure and is going to yield results that
are unusable?
We shall investigate this problem in some detail. First of all a computer
cannot represent all real numbers. In fact a double precision number is
represented by 64 bytes and is represented by the digits and the position of
the decimal point. Double precision numbers can be between 1.79 10
308
and 2.2310
308
. The problem is that there are gaps between the numbers
for instance the numbers in [1, 2] are represented by the numbers
1, 1 + 2
52
, 1 + 2 2
52
, 1 + 3 2
52
, . . . , 2
The numbers in [2
n
, 2
n+1
] are represented by the numbers
2
n
, 2
n
+ 2
n
2
52
, 2
n
+ 2
n
2 2
52
, 2
n
+ 2
n
3 2
52
, . . . , 2
n+1
62
i.e. the distance between points here is 2
n
2
52
. We shall call the collec-
tion of all these numbers the oating point numbers in the machine. Thus
these are precisely the real numbers that can be represented exactly in the
computer. Any other real number is approximated by one of the oating
point numbers. Remark that when we get close to the upper bound of the
numbers that can be represented by the machine, the distances between
consecutive oating point numbers becomes enormous, though the relative
distance stays the same.
For a real number x let fl(x) denote the closest oating point number.
Thus
[x fl(x)[
[x[
<
1
2
2
52
or [x fl(x)[ <
1
2
2
52
[x[. The number
1
2
2
52
is
called the machine precision, we shall denote it by
machine
. Thus we have
fl(x) = x(1 +) with [[ <
machine
.
Consider now the usual oating point number operations +, , , /. We
shall denote the corresponding operations in the machine by putting a circle
around the symbol e.g

## . Let denote any of these operations and assume

the machine has the following property: for any two oating point numbers
x y = fl(x y)
thus
x y = x y(1 +)
for some [[ <
machine
Denition 10.0.3 A mathematical problem is a function (not linear in
general) f : X Y from a vector space X of data to a vector space Y
of solutions. Thus the problem is: for a given data point x compute the
solution f(x)
Denition 10.0.4 An accurate algorithm for a mathematical problem f :
X Y is another function

f : X Y that can be implemented by a
computer program and such that for any x X which can be represented by
oating point numbers we have
[[

f(x) f(x)[[
[[f(x)[[
= O(
machine
)
The notation left hand side = O(
machine
) means that there is xed
constant C (not depending on x) such that left hand side < C
machine
Next we shall discuss the notion of stability of an algorithm. This de-
nition may seem strange at rst but hopefully the signicance will become
clearer later on.
63
Denition 10.0.5 An algorithm

f for a problem f is stable if for each
x X there is a x with
[[ x x[[
[[x[[
= O(
machine
)
such that
[[

f(x) f( x)[[
[[f( x)[[
= O(
machine
)
An algorithm is backwards stable if there is x as above such that

f(x) = f( x)
Example 10.1 Let f : R
2
R, f(x
1
, x
2
) = x
1
x
2
. The algorithm is

f(x
1
, x
2
) = fl(x
1
) fl(x
2
). This algorithm is backwards stable.
Indeed fl(x
1
) = x
1
(1 +
1
) and fl(x
2
) = x
2
(1 +
2
) with [
1
[, [
2
[ <

machine
. Now by our assumption about the machine we have fl(x
1
)
fl(x
2
) = (fl(x
1
) fl(x
2
))(1 +
3
). Hence we get
fl(x
1
) fl(x
2
) = [x
1
(1 +
1
) x
2
(1 +
2
)](1 +
3
)
= x
1
(1 +
1
)(1 +
3
) x
2
(1 +
2
)(1 +
3
)
= x
1
(1 +
1
+
3
+
1

3
) x
2
(1 +
2
+
3
+
2

3
)
= x
1
(1 +
4
) x
2
(1 +
5
)
with [
4
[, [
5
[ < 2
machine
+
2
machine
.
Thus

f(x
1
, x
2
) = fl(x
1
) fl(x
2
) = x
1
x
2
= f( x
1
, x
2
) with x
1
=
x
1
(1 +
4
) and x
2
= x
2
(1 +
5
) so
[ x
1
x
1
[
[x
1
[
= [
4
[ = O(
machine
). For
instance we can take the constant C = 3. The same holds for x
2
.
Consider instead f(x) = x + 1 and the algorithm

f(x) = fl(x) 1.
Then fl(x) = x(1 +
1
) and fl(x) 1 = (x(1 +
1
) + 1)(1 +
2
). This gives

f(x) f( x) = (x(1 +
1
) +1)(1 +
2
) (x(1 +
1
) +1) = ( x+1)
2
. Thus this
algorithm is not backwards stable, but since
[

f(x) f( x)[
[f( x)[
=
2
it is stable
The example at the beginning of this section illustrates the notion of
backwards stability. The problem is to nd the QR-decomposition of the
matrix A. We nd that the algorithm does a poor job of computing the
actual Q and R for A. It does however come up with a

Q and a

R such that

Q

R =

A where

A is very close to A. Thus

Q,

R is not a solution to the
problem for the matrix A but for the nearby matrix

A
64
Of course we want algorithms that are accurate, as it turns out this is
not guaranteed by stability or even backwards stability.
Accuracy means that [[

implies that [[

## f(x) f( x)[[ is small relative to [[f( x)[[ (if backwards stable

it is 0) so what we need is for [[f( x) f(x)[[ to be small relative to [[f(x)[[
Denition 10.0.6 Let f : X Y be a problem and let x X be a data
point. Let x be a small increment and let f(x) = f(x+x)f(x). The rel-
ative condition number is dened by (x) = lim
0
sup
||x||<
_
[[f(x)[[
[[f(x)[[
_
[[x[[
[[x[[
_
A problem is well-conditioned if (x) is relatively small ( 100) and
ill-conditioned if (x) is large (10
6
, 10
16
)
Example 10.2 If f is a dierentiable function e.g. all its coordinate func-
tions have continuous partial derivatives then we can form the Jacobian
matrix J(x) =
_
_
_
_
_
f
1
/x
1
(x) f
1
/x
2
(x) . . . f
1
/x
m
(x)
f
2
/x
1
(x) f
2
/x
2
(x) . . . f
2
/x
m
(x)
.
.
.
.
.
.
.
.
.
.
.
.
f
n
/x
1
(x) f
n
/x
2
(x) . . . f
n
/x
m
(x)
_
_
_
_
_
. Then we
have [[x[[ = [[f(x + x) f(x)[[ [[J(x)x[[ [[J(x)[[[[x[[. Hence we
have (x)
_
[[f(x)[[
[[f(x)[[
_
[[x[[
[[x[[
_

[[J(x)[[[[x[[
[[f(x)[[
_
[[x[[
[[x[[
= [[J(x)[[
[[x[[
[[f(x)[[
and in fact we get equality i.e. (x) = [[J(x)[[
[[x[[
[[f(x)[[
Consider f : R
2
R, f(x
1
, x
2
) = x
1
x
2
. The Jacobian of f is J =
_
1 1
_
. We have [[J[[ =

2 and so (x) =

2
[[x[[
[x
1
x
2
[
. If x
1
and x
2
are
very close this can be very large. Thus this problem is ill-conditioned
Consider f : R R, f(x) = x
2
. Then J(x) = 2x and so (x) =
2[[x[[
[[x[[
[[x
2
[[
= 2. Thus this problem is well-conditioned
We shall consider a less trivial example, namely nding the roots of a
polynomial.
Example 10.3 Consider a polynomial p(x) = a
0
+ a
1
x + a
2
x
2
+ +
a
n1
x
n1
+x
n
. If we perturb a single coecient what happens to the roots?.
65
To gure this out we consider a function F : R
n+1
R dened by
F(b
0
, b
1
, b
2
, . . . , b
n1
, z) = b
0
+b
1
z +b
2
z
2
+ +b
n1
z
n1
+z
n
Thus F(a
0
, a
1
, . . . , a
n1
, x) = p(x). Now assume x
0
is a simple root of p(x),
i.e. p(x
0
) = 0 and p

(x
0
) ,= 0. The Implicit Function Theorem says that if
F
z
(a
0
, a
1
, . . . , a
n1
, x
0
) ,= 0 then there is a function f : R
n
R dened in
a neighborhood of the point (a
0
, a
1
, . . . , a
n1
) such that
F(b
0
, b
1
, . . . , b
n1
, f(b
0
, b
1
, . . . , b
n1
) = 0
for (b
0
, b
1
, . . . , b
n1
) in this neighborhood and f(a
0
, a
1
, . . . , a
n1
) = x
0
. Thus
f(b
0
, b
1
, . . . , b
n1
) is a root of the polynomial b
0
+b
1
x+b
2
x
2
+ +b
n1
x
n1
+
x
n
and
f
b
i
(a
0
, a
1
, . . . , a
n1
) measures how the root x
0
is aected by a small
perturbation of the ith coecient. Since
F
z
(a
0
, a
1
, . . . , a
n1
, x
0
) =

z
(a
0
+a
1
z +a
2
z
2
+ +a
n1
z
n1
+z
n
)(x
0
)
= p

(x
0
) ,= 0
the Implicit Function Theorem does in fact apply.
Now using the chain rule we compute
F(b
0
, b
1
, . . . , b
n1
, f(b
0
, b
1
, . . . , b
n1
))
b
i
=
F
b
i
+
F
z
f
b
i
= 0
hence
f
b
i
(a
0
, a 1, . . . , a
n1
) =
F
b
i
(a
0
, a
1
, dots, a
n1
, x
0
)/p

(x
0
)
and
F
b
i
(a
0
, a
1
, dots, a
n1
, x
0
) =

b
i
(b
0
+b
1
z+b
2
z
2
+ +b
n1
z
n1
+z
n
)[
(a
0
,a
1
,...,a
n1
,x
0
)
= x
i
0
so we get
f
b
i
(a
0
, a
1
, . . . , a
n1
) =
x
i
0
p

(x
0
)
Thus the relative conditioning number is
=
[x
0
[
[x
0
[
_
[a
i
[
[a
i
[
= [
x
0
a
i
[[
a
i
x
0
[ [
f
b
i (a
0
,a
1
,...,a
n1
)
[
[a
i
[
[x
0
[
= [
a
i
x
j1
0
p

(x
0
)
[
66
This number can be incredibly large, consider for instance the polynomial
p(x) = (x 1)(x 2)(x 3) . . . (x 20)
The coecient a
15
is approximately 1.6710
9
and for the root x
0
= 15 we get
p

(x
0
) = (x
0
1) . . . (x
0
14)(x
0
16) . . . (x
0
20) = 14!(1)(2)(3)(4)(5) =
14!5!. Thus
1.67 10
9
15
14
14!5!
5.1 10
13
We can visualize the ill-conditioning of this problem by using the follow-
ing MATLAB code (enter it into an M-le and name it rootplot.m)
for t=1:1000
p=poly(1:20);
p(6)=p(6)+10e-6*randn(1);
r=roots(p);
r1=real(r);
r2=imag(r);
plot(r1,r2,.)
end
First execute
>>plot(zeros(1,20),*)
>>hold
to plot the roots of p(x). The code then perturbs the coecient a
15
by adding a very small random increment (10
6
a random number cho-
sen from a standard normal distribution). It then nds the roots of the
perturbed polynomial and plots the roots.
Next consider an mn matrix and consider the problem of computing
Ax from an input x. By denition the relative conditioning number is
(x) = sup
_
[[A(x +x) Ax[[
[[Ax[[
_
[[x[[
[[x[[
_
= sup
[[Ax[[
[[x[[
_
[[Ax[[
[[x[[
= [[A[[
[[x[[
[[Ax[[
If A happens to be square and non-singular we have
[[x[[
[[Ax[[
[[A
1
[[ and so
in this case we get
[[A[[[[A
1
[[
67
Figure 6:
This number can also be very, very large. It is not hard to see that [[A[[ =
the largest singular value
1
and [[A
1
[[ = largest singular value of A
1
= 1/
m
where
m
is the smallest singular value of A. Thus we get

1

m
where
1
(resp.
m
) is the largest (resp. the smallest) singular value of A.
The number (A) = [[A[[[[A
1
[[ is called the conditioning number of
the matrix A and the matrix A is said to be well-conditioned (resp. ill-
conditioned) if this number is relatively small (resp. large). If A is not
square we dene the conditioning number by (A) = [[A[[[[A
+
[[ where A
+
is the pseudo-inverse as dened in section 9.
Theorem 10.0.5 Consider the equation Ax = b. The condition number of
computing x given b with respect to perturbing b is (A)
Suppose now that we perturb A by a small amount A + A and keep b
xed. Then x must also be perturbed so we have
(A+A)(x +x) = b
Writing this out and ignoring the second order innitesimal A(x) we get
A+Ax +Ax = b
68
or
x = A
1
Ax
This implies
[[x[[ [[A
1
[[[[A[[[[x[[
and hence
[[x[[
[[x[[
_
[[A[[
[[A[[
[[A
1
[[[[A[[ = (A)
Thus we have shown
Theorem 10.0.6 Let b be xed. The condition number of solving the equa-
tion
Ax = b
with respect to perturbations of A is (A)
We can now estimate the accuracy of a backward stable algorithm in
terms of the condition number:
Theorem 10.0.7 Let f : X Y be a problem and let

f be a backwards
stable algorithm for f. Let x X and let (x) be the relative condition
number. Then the relative error satises
[[

f(x) f(x)[[
[[f(x)[[
= O((x)
machime
)
Proof: By denition of backward stability we can nd x X such that
[[ x x[[
[[x[[
= O(
machine
)
and f( x) =

f(x).
By the denition of the condition number we have
[[f(x) f( x)[[
[[f(x)[[
/
[[ x x[[
[[x[[
(x) +const
hence
[[f(x)

f(x)[[
[[f(x)[[
((x) +const.)
[[ x x[[
[[x[[
which gives the result
69
There are two properties characterizing a good algorithm:
It has to be accurate and it has to be fast. We have studied the rst of
these properties.
The speed of an algorithm is measured in how many oating point oper-
ations or ops it takes to implement it. For matrix operations the number
of ops is typically of the order of C m
3
where C is a constant depending
on the algorithm. This can clearly get large very quickly so it is of some
importance to try to minimize the constant C though for large m, C be-
comes negligible. It is not hard to estimate the number of ops for a given
algorithm but we shall only give a simple example of multiplying two mm
matrices A =
_
_
_
_
_
a
11
a
12
. . . a
1m
a
21
a
22
. . . a
2m
.
.
.
.
.
.
.
.
.
.
.
.
a
m1
a
m2
. . . a
mm
_
_
_
_
_
and B =
_
_
_
_
_
b
11
b
12
. . . b
1m
b
21
b
22
. . . b
2m
.
.
.
.
.
.
.
.
.
.
.
.
b
m1
b
m2
. . . b
mm
_
_
_
_
_
.
The ijth entry in the product matrix is given by a
i1
b
1j
+a
i2
b
2j
+ +a
im
b
mj
.
Thus to compute each entry in the product matrix we need m products and
m1 additions for a total of 2m1 ops. There are m entries so we need
on the order of (2m 1)m
2
which when m becomes very large approaches
2m
3
Consider the equation
Ax = b
where A is a non-singular mm (i.e. square) matrix. Consider the following
algorithm for solving this equation
1. Compute the QR-decomposition of A, QR = A using the Householder
triangularization
2. Let y =
t
Qb (since Q is orthogonal Q
1
=
t
Q
3. Solve the upper-triangular system Rx = y using back substitution
All these steps are backward stable so the whole algorithm is backward
stable and we get the accuracy estimate
[[ x x[[
[[x[[
O((A)
machine
)
Here is a MATLAB experiment
70
>> A=randn(100);
>> kappa=cond(A)
kappa =
644.1458
>> b=randn(100,1);
>> [Q,R]=qr(A);
>> y=Q*b;
>> tildex=R
^
-1*y;
>> x=Ab;
>> norm(x-tildex)
ans =
3.2933e-013
We rst construct a 100 100 random matrix. Then use the cond(A)
MATLAB command to compute the condition number. This matrix is well-
conditioned. Next we construct a 100-dimensional random column vector,
b. We then use the algorithm above to solve the system Ax = b and get
the solution tildex. Next we use MATLABs built-in equation solver (us-
ing ). MATLABs algorithm in fact uses a more precise variant of the
QR-decomposition (QR-composition with pivoting) but as we can see our
algorithm gives a solution quite close to the MATLAB solution.
11 LU-decomposition
Let A be an mm matrix. The purpose of the LU-decomposition is to write
A = L U where L is a lower triangular matrix and U an upper triangular
matrix. The usual algorithm for this is known as Gaussian Elimination and
consists simply of subtracting multiples of one row from all the subsequent
rows to step-by-step produce 0s below the diagonal. It turns out that we
can achieve this by multiplying A by a sequence of lower triangular matrices
with 1s in the diagonal
L
m1
. . . L
2
L
1
A = U
Thus L
1
= L
m1
. . . L
2
L
1
.
Multiplying by L
1
creates 0s below the rst entry in the rst column.
Next multiplying by L
2
creates 0s below the rst two entries in the second
column with out disturbing the rst column and so on.
It easier to do an example than to give a general explanation of this
procedure:
71
Example 11.1 Let A =
_
_
2 1 1
4 3 3
8 7 9
_
_
. We see that by subtracting 2 the
rst row from the second row and 4 the rst row from the third row, will cre-
ate 0s below the rst entry in the rst column. We can achieve this by left-
multiplying by the matrix L
1
=
_
_
1 0 0
2 1 0
4 0 1
_
_
. We get L
1
A =
_
_
2 1 1
0 1 1
0 3 5
_
_
.
Next we subtract 3 the second row from the third row to get 0s below
the second entry in the second column. We achieve this by multiplying by
the matrix L
2
=
_
_
1 0 0
0 1 0
0 3 0
_
_
. We get L
2
L
1
A =
_
_
2 1 1
0 1 1
0 0 2
_
_
= U. So
L
1
= L
2
L
1
=
_
_
1 0 0
2 1 0
2 3 1
_
_
and L =
_
_
1 0 0
2 1 0
4 3 1
_
_
so A =
_
_
1 0 0
2 1 0
4 3 1
_
_

_
_
2 1 1
0 1 1
0 0 2
_
_
is the LU-decomposition.
The algorithm proceeds in the following manner: assume that we have
in the kth step produced X
k
= L
k
L
l1
. . . L
1
A such that the rst k columns
in X
k
have 0s below the diagonal. In the k +1st step we want to multiply
to the left by an upper triangular matrix L
k+1
making the entries in the
k + 1st column from the k + 2nd to the mth row equal to 0. Thus X
k
is
of the form
X
k
=
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
x
11
x
12
. . . x
1k
x
1,k+1
. . . x
1m
0 x
22
. . . x
2k
x
2,k+1
. . . x
2m
0 0 . . . x
3k
x
3,k+1
. . . x
3m
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 . . . x
kk
x
k,k+1
. . . x
km
0 0 . . . 0 x
k+1,k+1
. . . x
k+1,m
0 0 . . . 0 x
k+2,k+1
. . . x
k+2,m
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 . . . 0 x
m,k+1
. . . x
mm
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
72
Now take
L
k+1
=
_
_
_
_
_
_
_
_
_
_
_
_
_
_
1 0 0 . . . 0 0 . . . 0
0 1 0 . . . 0 . . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 . . . 1 . . . 0 0
0 0 0 . . . x
k+2,k+1
/x
k+1,k+1
1 . . . 0
0 0 0 . . . x
k+3,k+1
/x
k+1,k+1
0 . . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 . . . x
m,k+1
/x
k+1,k+1
0 . . . 1
_
_
_
_
_
_
_
_
_
_
_
_
_
_
Left multiplication by L
k+1
produces 0s in the k + 1st column below the
k + 1, k + 1 entry so X
k+1
= L
k+1
X
k
is of the form
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
x
11
x
12
. . . x
1k
x
1,k+1
. . . x
1m
0 x
22
. . . x
2k
x
2,k+1
. . . x
2m
0 0 . . . x
3k
x
3,k+1
. . . x
3m
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 . . . x
kk
x
k,k+1
. . . x
km
0 0 . . . 0 x
k+1,k+1
. . . x
k+1,m
0 0 . . . 0 0 . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 . . . 0 0 . . .
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
Put
i,k+1
= x
i,k+1
/x
k+1,k+1
then
L
1
k+1
=
_
_
_
_
_
_
_
_
_
_
_
_
_
_
1 0 0 . . . 0 0 . . . 0
0 1 0 . . . 0 . . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 . . . 1 . . . 0 0
0 0 0 . . .
k+2,k+1
1 . . . 0
0 0 0 . . .
k+3,k+1
0 . . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 . . .
m,k+1
0 . . . 1
_
_
_
_
_
_
_
_
_
_
_
_
_
_
Thus to compute L
1
k+1
we only have to change the sign of the entries in the
k + 1st column below the diagonal. This observation makes it much easier
73
to compute
L = L
1
1
L
1
2
. . . L
1
m
=
_
_
_
_
_
_
_
1 0 0 . . . 0

21
1 0 . . . 0

31

32
1 . . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

m1

m2

m3
. . . 1
_
_
_
_
_
_
_
We can now formulate the algorithm for computing the LU-decomposition:
U = A, L = E
m
for k = 1 to m1
for j = k + 1 to m

jk
= u
jk
/u
kk
u
j,k:m
= u
j,k:m

jk
u
k,k:m
We can then do an operation count for this algorithm: for each pass of
the inner loop the rst line counts 1 op, the second line counts 2 ops for
each u
j,k
, u
j,k+1
. . . , u
j,m
. Thus in total 2(m k + 1) + 1 = 2(m k) + 3
hence the inner loop contribute (m k)(2(m k) + 3) = 2(m k)
2
+
3(m k) and so the total operation count is
m1

k=1
2(m k)
2
+ 3(m k) =
2
(m1)m(2(m1) + 1)
6
+ 3
(m1)m
2
. As m this grows like
2
3
m
3
.
If A is factored A = LU we can solve the system Ax = b by solving two
triangular systems. First Ly = b using forward substitution which gives
y
1
= b
1
y
2
= b
2

21
y
1
y
3
= b
3

31
y
1

32
y
2
.
.
. =
.
.
.
y
m
= b
m

m1
y
1

m2
y
2

m,m1
y
m
and secondly Ux = y using back substitution.
Consider the following example:
Example 11.2 A =
_
10
20
1
1 1
_
. The process yields L =
_
1 0
10
20
1
_
and
U =
_
10
20
1
0 1 10
20
_
. Assume now we perform this computation on a
74
machine with
machine
10
16
then each of these numbers will be rounded to
the nearest oating point number represented by the machine. The number
1 10
20
will be represented by 10
20
and so the machine will give

L =
_
1 0
10
20
1
_
and

U =
_
10
20
1
0 10
20
_
. If the algorithm was backward stable
we would have

A =

L

U close to A. But

L

U =
_
10
20
1
1 0
_
. This is very
far from A, we could see this by computing
[[A

A[[
[[A[[
computed LU-decomposition to solve the equations Ax =
_
1
0
_
. The correct
solution is x =
_
1
1
_
but using

A we get x =
_
0
1
_
We also note that the algorithm can fail, namely if one of the x
kk
s are 0
then we would have a division by 0 error.
A partial solution to these problems is to use pivoting. This means
interchanging rows so as to bring a non-zero entry into the kk position
before we compute L
k
. The procedure is as follows: at step k nd the entry
in the kth column in or below the diagonal with the largest absolute value.
Assume the this entry is x
r,k
then we want to interchange the kth and the
rth row to bring x
r,k
into the kkth position. Let
P
r,k
=
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
1 0 0 . . . 0 . . . 0 . . . 0
0 1 0 . . . 0 . . . 0 . . . 0
0 0 1 . . . 0 . . . 0 . . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 . . . 0 . . . 1 . . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 . . . 1 . . . 0 . . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 . . . 0 . . . 0 . . . 1
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
where there is a 1 in the k, rth and the r, kth position. Left multipli-
cation by P
r,k
interchanges the kth and rth rows without changing the
other entries. Such a matrix is called a transposition matrix. Remark
that P
2
r,k
= E
m
. Thus the procedure produces an upper triangular matrix
U = L
m1
P
m1
L
m2
P
m2
. . . L
1
P
1
A where P
m1
, P
m2
, . . . , P
1
are trans-
position matrices. Now dene L

k
= P
m1
P
m2
. . . P
k+1
L
k
P
k+1
. . . P
m2
P
m1
.
75
Right multiplication by P
r,k
interchanges the rth and the kth column. Thus
P
r,s
L
k
P
r,s
interchanges the rth and kth column and the rth and kth row.
But both r and s are > k hence this does not destroy the structure of L
k
(1s in the diagonal and one non-zero column). Now
L

2
L

1
P
m1
P
m2
. . . P
1
= (P
m1
P
m2
. . . P
3
L
2
P
3
. . . P
m2
P
m1
)
(P
m1
P
m2
. . . P
3
P
2
L
1
P
2
P
3
. . . P
m2
P
m1
)
P
m1
P
m2
. . . P
2
P
1
= (P
m1
P
m2
. . . P
3
)L
2
P
2
L
1
P
1
Multiplying by L

3
we get
L

3
L

2
L

1
(P
m1
P
m2
. . . P
1
)
= P
m1
P
m2
. . . P
4
L
3
P
4
. . . P
m2
P
m1
(P
m1
P
m2
. . . P
4
P
3
)L
2
P
2
L
1
P
1
= P
m1
P
m2
. . . P
4
L
3
P
3
L
2
P
2
L
1
P
1
Continuing this way we end up with
L

m1
L

m2
. . . L

1
(P
m1
P
m2
. . . P
1
)
= L
m1
P
m1
L
m2
P
m2
. . . L
2
P
2
L
1
P
1
Let P = P
m1
P
m2
. . . P
1
then we have
U = L
m1
P
m1
L
m2
P
m2
. . . L
1
P
1
A
L

m1
L

m2
. . . L

1
PA
Thus using pivoting we get an LU-decomposition of the matrix PA which
has the same columns as A but permuted.
Recall that the entries in L

k
are 1s in the diagonal, 0s everywhere
except in the kth column where the entries are

j,k
= x
jk
/x
kk
for j > k
but because after the pivoting [x
kk
[ is maximal in the kth column we have
[

jk
[ 1 and so all the entries in L

k
have absolute value 1.
We have the following stability result:
Theorem 11.0.8 Let A = LU be the LU-decomposition of a non-singular
matrix. Let

L and

U be machine computed using Gaussian elimination with-
out pivoting. Then

L

U = A + A with
[[A[[
[[L[[[[U[[
= O(
machine
) for some
mm matrix A
76
Remark that if we had [[A[[ in the denominator we would have backward
stability. For Gaussian elimination [[L[[ and [[U[[ can be unboundedly large
and so the algorithm is unstable.
With pivoting however the entries in L all have absolute value 1 so
[[L[[ = O(1) (of course we now have PA = LU but [[PA[[ = [[A[[). Thus if
[[U[[ = O([[A[[) the algorithm will be backward stable.
Denition 11.0.7 The growth factor for the matrix A is dened by =
max
i,j
[u
ij
[
[a
ij
[
Thus if is of order 1 we have [[U[[ = O([[A[[) and the algorithm is backward
stable. In general [[U[[ = O([[A[[).
Example 11.3 Consider the mm matrix
A =
_
_
_
_
_
_
_
_
_
_
_
1 0 0 0 . . . 0 1
1 1 0 0 . . . 0 1
1 1 1 0 . . . 0 1
1 1 1 1 . . . 0 1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1 1 1 1 . . . 1 1
1 1 1 1 . . . 1 1
_
_
_
_
_
_
_
_
_
_
_
One can compute the LU-decomposition (no pivoting is needed): L =
_
_
_
_
_
_
_
1 0 0 . . . 0
1 1 0 . . . 0
1 1 1 . . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1 1 1 . . . 1
_
_
_
_
_
_
_
and U =
_
_
_
_
_
_
_
_
_
_
_
1 0 0 0 . . . 0 1
0 1 0 0 . . . 0 2
0 0 1 0 . . . 0 4
0 0 0 1 . . . 0 8
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 0 . . . 1 2
m2
0 0 0 0 . . . 0 2
m1
_
_
_
_
_
_
_
_
_
_
_
. So in this case = 2
m1
. For a
100 100 matrix we would get [[U[[ = O(2
99
[[A[[) thus the relative accuracy
in this case is of the order 2
99
2
16
= 2
83
if
machine
2
16
. Clearly
unacceptable.
77
In almost every case that arises in practice Gaussian elimination with
pivoting is well-behaved i.e. = O(1). Only in very exceptional cases does
the algorithm misbehave.
12 Cholesky Decomposition
Consider a symmetric matrix A. If
t
x A x > 0 for any x ,= 0 we say that
A is positive denite. An example of a symmetric positive denite matrix is
the covariance matrix of a collection of non-deterministic random variables
X
1
, X
2
, . . . , X
m
,
A =
_
_
_
_
_
_
_
var(X
1
) cov(X
1
, X
2
) cov(X
1
, X
3
) . . . cov(X
1
, X
m
)
cov(X
2
, X
1
) var(X
2
) cov(X
2
, X
3
) . . . cov(X
2
, X
m
)
cov(X
3
, X
1
) cov(X
3
, X
2
) var(X
3
) . . . cov(X
3
, X
m
)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
cov(X
m
, X
1
) cov(X
m
, X
2
) cov(X
m
, X
3
) . . . var(X
m
)
_
_
_
_
_
_
_
A symmetric matrix is positive denite if and only if the eigenvalues of
A are all positive. Indeed if x is a unit eigenvector for the eigenvalue
(i.e. [[x[[ = 1). Then 0 <
t
xAx =
t
x (x) = (x, x) = . Conversely
expressing x as a linear combination of an orthonormal basis of eigenvectors
x = c
1
q
1
+c
2
q
2
+ +c
m
q
m
. Then
t
xAx =
1
c
2
1
+
2
c
2
2
+ +
m
c
2
m
> 0.
Write A =
_
a
11
w
t
w K
_
. Since
t
e
1
Ae
1
= a
11
, a
11
> 0 and by considering
vectors with rst coordinate = 0 i.e. of the form x =
_
0
y
_
where y is any n
1-dimensional vector we get 0 <
t
xAx =
t
yKy we see that K is a symmetric,
positive denite m 1 m 1 matrix. Let R
1
=
_

a
11
t
w/

a
11
0 E
m1
_
then
we have
A =
_

a
11
t
0
w/

a
11
E
m1
_

_
1
t
0
0 K w
t
w/a
11
_

a
11
t
w/

a
11
0 E
m1
_
Then A
1
=
t
R
1
1
A R
1
1
=
_
1
t
0
0 K w
t
w/a
11
_
is again positive
denite, indeed
t
xA
1
x =
t
(R
1
1
x) A (R
1
1
x) > 0. Now this implies as above
that the symmetric m1 m1 matrix K w
t
w/a
11
is positive denite.
Thus by the same method as above we can nd the m1m1 matrix

R
2
=
78
_

t
w
1
/
t
0 E
m2
_
such that (K w
t
w/a
11
) =
t

R
2
_
1
t
0
0 K
1
_

R
2
where K
1
is a
symmetric m2m2 positive denite matrix. Let R
2
=
_
1
t
0
0

R
2
_
then we
get
t
R
2

_
_
1 0
t
0
0 1
t
0
0 0 K
1
_
_
R
2
= A
1
and hence A =
t
R
1
t
R
2
_
_
1 0
t
0
0 1
t
0
0 0 K
1
_
_
R
2
R
1
.
Continuing this way we get A =
t
R
m
t
R
m1
. . .
t
R
2
t
R
1
E
m
R
1
R
2
. . . R
m1
R
m
.
Clearly the matrices R
i
are all upper triangular hence the product R =
R
1
R
2
. . . R
m1
R
m
is upper triangular and we have proved that A =
t
RR
where R is an upper triangular matrix. This is the Cholesky decomposition.
Thus we have
Theorem 12.0.9 Let A be a symmetric, positive denite matrix. Then we
can write A =
t
RR where R is an upper triangular matrix.
The following MATLAB session, indicates that the Cholesky decompo-
sition is backward stable
>> R=triu(rand(50));
>> A=R*R;
>> S=chol(A);
>> norm(S-R)
ans =
3.3056
>> norm(S*S-A)
ans =
8.5023e-015
>>
13 Eigenvalues
In this section we allow our matrices to have entries in C.
The eigenvalues of the mm matrix A are the roots of the characteristic
polynomial det(xE
m
A). This is a degree m polynomial and so has m roots
in the complex numbers, when counted with multiplicities. Thus we can
factor det(xE
m
A) = (x
1
)
m
1
(x
2
)
m
2
. . . (x
k
)
m
k
where
1
,
2
, . . . ,
k
are the distinct eigenvalues of A and m
1
+m
2
+ +m
k
= m.
Denition 13.0.8 The algebraic multiplicity of an eigenvalue
i
is the root
multiplicity m
i
79
Denition 13.0.9 The geometric multiplicity of an eigenvalue
i
is the
dimension of the eigenspace ker(A
i
E
m
)
Lemma 13.0.1 Let X be an invertible m m matrix. Then the matrix
X
1
AX has the same characteristic polynomial as A and hence has the
same eigenvalues with the same algebraic multiplicities as A.
Proof: We have det(xE
m
X
1
AX) = det(xX
1
X X
1
AX) =
det(X
1
(xE
m
A)X) = det(X
1
) det(xE
m
A) det(X) = det(xE
m
A)
Let n
i
be the geometric multiplicity of
i
and consider a basis v
1
, v
2
, . . . , v
n
i
.
We extend this to a basis of C
m
. Let C be the coordinate transformation
matrix between this basis and the standard basis. Then C
1
AC is the
matrix w.r.t. this basis. But we have C
1
ACv
j
=
i
v
j
hence C
1
AC
is of the form
_

i
E
n
i
S
0 T
_
where S and T are m n
i
m n
i
ma-
trices. Then det(xE
m
C
1
AC) = (x
i
)
n
i
det(xE
mn
i
T). But
det(xE
m
C
1
AC) = det(xE
m
A) and so (x
i
)
n
i
divides the char-
acteristic polynomial of A. It follows that n
i
m
i
and so we have shown
Proposition 13.0.3 The geometric multiplicity of an eigenvalue is al-
ways the algebraic multiplicity of
Denition 13.0.10 An eigenvalue is said to be defective if geometric
multiplicity of < algebraic multiplicity of
Denition 13.0.11 The matrix A is said to be defective if it has at least
one defective eigenvalue.
Theorem 13.0.10 A matrix A is diagonalizable i.e. A = X
1
X where
is a diagonal matrix, if and only if A is non-defective
Proof: A diagonal matrix is non-defective, indeed if
=
_
_
_
_
_
_
_
_
_
_
_
_

1
0 . . . 0 . . . . . . 0
0
1
. . . 0 . . . . . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 . . . 0
1
0 . . . 0
0 . . . 0 0
2
. . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 . . . . . . 0
k
_
_
_
_
_
_
_
_
_
_
_
_
80
then clearly the algebraic multiplicity of an eigenvalue
i
is the number of
times
i
appears in the diagonal but this is also the dimension of ker(

i
E
m
).
To go the other way let m
i
be the algebraic multiplicity of
i
. By
assumption this is also the geometric multiplicity so we can nd a basis
v
i1
, v
i2
, . . . , v
im
i
of the eigenspace for
i
. Since eigenvectors belonging to
dierent eigenvalues are linearly independent by Proposition 4.0.6 the total-
ity of these vectors for all the distinct eigenvalues are linearly independent
and since m
1
+ m
2
+ + m
i
+ + m
k
= m they form a basis, i.e. we
have a basis of eigenvectors. Then if the matrix w.r.t. this basis is diagonal
and if X is the coordinate transformation matrix we have XAX
1
= is
diagonal.
We know from the Spectral Theorem (Theorem 4.1.1) that if A is sym-
metric the we can nd an orthogonal matrix Q such that
t
QAQ = is
diagonal. We have the following general theorem:
Theorem 13.0.11 Any square matrix A has Schur factorization
A = Q T Q
1
where U is an upper triangular matrix and Q is unitary
Proof: We use induction on m. For m = 1 it is trivial, assume it is true
for m1m1 matrices. Let q
1
be an eigenvector in C
m
with eigenvalue .
Extend q
1
to an orthonormal basis q
1
, q
2
, . . . , q
m
. Let U be the mm matrix
with these vectors as columns, the U is unitary. Since U is the coordinate
transformation matrix between the standard basis and the basis consisting
of the qs and since Aq
1
= q
1
, U
1
AU is of the form
_

t
b
0 C
_
. C is an
m 1 m 1 matrix and so by the induction hypothesis C has a Schur
factorization C = V SV
1
where V is unitary and S is upper-triangular.
81
Now put Q = U
_
1
t
0
0 V
_
then Q is unitary and we have
Q
1
AQ =
_
1
t
0
0 V
1
_
U
1
AU
_
1
t
0
0 V
_
=
_
1
t
0
0 V
1
__

t
b
0 C
__
1
t
0
0 V
_
=
_

t
b
0 V
1
C
__
1
t
0
0 V
_
=
_

t
bV
0 V
1
CV
_
=
_

t
bV
0 S
_
Let T =
_

t
bV
0 S
_
. Since S is upper-triangular, T is as well and Q is
unitary, thus we have a Schur factorization of A
The eigenvalues of A are the diagonal elements in T since A and T have
the same characteristic polynomial and the characteristic polynomial of an
upper-triangular matrix T is (x t
11
)(x t
22
) . . . (x t
mm
)
We shall now turn to the problem of actually computing eigenvalues.
The obvious way would be to nd the characteristic polynomial and use an
algorithm to nd the roots. Aswe have seen nding roots of a polynomial is
an extremely ill-conditioned problem so it is unlikely that this method will
give good results. But is there another algorithm to nd the eigenvalues. It
is known that for general polynomials of degree 5 there is no algorithm
with nitely many steps for nding the roots as a function of the coecients.
Since any polynomial can be realized as the characteristic polynomial of a
matrix (the polynomial p(x) = a
0
+ a
1
x + a
2
x
2
+ + a
m1
x
m1
+ x
m
is
the characteristic polynomial of the matrix
_
_
_
_
_
_
_
_
_
0 0 0 . . . . . . a
0
1 0 0 . . . . . . a
1
0 1 0 . . . . . . a
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 . . . 0 a
m2
0 0 0 . . . 1 a
m1
_
_
_
_
_
_
_
_
_
)
there cannot be any general algorithm to nd eigenvalues that concludes in
nitely many steps. Hence we have to resort to iterative methods to nd
approximations to the eigenvalues. Thus the goal of an eigenvalue nding
algorithm should be to iteratively construct a sequence of numbers that
converges quickly to an eigenvalue.
82
To simplify matters we shall assume that A is real, symmetric so the
eigenvalues are real. We shall also assume we have ordered the eigenvalues
according to absolute values i.e.[
1
[ [
2
[ [lambda
m
[ and that we
have a corresponding orthonormal basis of eigenvectors q
1
, q
2
, . . . , q
m
Denition 13.0.12 Let x be a non-zero vector. The Rayleigh quotient is
r(x) =
t
xAx
t
xx
=
t
xAx
[[x[[
2
Remark that if x is an eigenvector, Ax = x then r(x) = .
We view r as a function R
m
0 R. In terms of the coordinate of x
and the entries of A we can write it as
r(x
1
, x
2
, . . . , x
m
) =

i,j
a
ij
x
i
x
j

i
x
2
i
Computing the partial derivatives we get
r
x
k
=
(2

i
a
ik
x
i
)(

i
x
2
i
) (

i,j
a
ij
x
i
x
j
)(2x
k
)
(

i
x
2
i
)
2
=
2(Ax)
k
[[x[[

2
[[x[[
2
t
xAx
[[x[[
2
2x
k
=
2
[[x[[
2
((Ax)
k
r(x)x
k
)
Hence we get r(x) =
2
[[x[[
2
(Ax r(x)x). Thus r(x) = 0 precisely
when x is an eigenvector.
Clearly r(ax) = r(x) for all a ,= 0. Hence we may as well only look at
unit vectors i.e. [[x[[ = 1 and thus we have
Theorem 13.0.12 A unit vector x is an eigenvector for A if and only if
r(x) = 0. In this case the eigenvalue is r(x)
Let x be a unit vector and write x = a
1
q
1
+ a
2
q
2
+ + a
m
q
m
. Then
r(x) =

a
2
i

i
. Hence r(x) r(q
j
) = a
2
1

1
+ a
2
2

2
+ + (a
2
j
1)
j
+
+ a
2
m

m
. Thus if x q
j
i.e. a
i
0 for i ,= j and a
j
1 we see that
r(x) r(q
j
) 0 as the square of [[x q
j
[[
83
Another possible way to nd a sequence of vectors converging to an
eigenvector is the method of power iteration. The algorithm works as follows
v
(0)
any unit vector
for k = 1, 2, . . .
w = Av
(k1)
v
(k)
= w/[[w[[

(k)
= r(v
(k)
)
We have v
(1)
=
Av
(0)
[[Av
(0)
[[
and v
(2)
=
Av
(1)
[[Av
(1)
[[
=
A
2
v
(0)
[[Av
(0)
[[
_
[[A
2
v
(0)
[[
[[Av
(0)
[[
=
A
2
v
(0)
[[A
2
v
(0)
[[
and in general v
(k)
=
A
k
v
(0)
[[A
k
v
(0)
[[
To analyze power iteration write v
(0)
= a
1
q
1
+a
2
q
2
+ +a
m
q
m
where
a
1
,= 0. Then A
k
v
(0)
= a
1

k
1
q
2
+a
2

k
2
q
2
+ +a
m

k
m
q
m
and so
v
(k)
=
a
1

k
1
q
1
+a
2

k
2
q
2
+ +a
m

k
m
q
m
_
a
2
1

2k
1
+a
2
2

2k
2
+ +a
2
m

2k
m
=

k
1
a
1
(q
1
+
a
2
a
1
(

1
)
k
q
2
+ +
a
m
a
1
(

1
)
k
q
m
)
[
k
1
[[a
1
[(1 +
a
2
2
a
2
1
(

1
)
2k
+ +
a
2
m
a
2
1
(

1
)
2k
)
=
(q
1
+
a
2
a
1
(

1
)
k
q
2
+ +
a
m
a
1
(

1
)
k
q
m
)

1 +
a
2
2
a
2
1
(

1
)
2k
+ +
a
2
m
a
2
1
(

1
)
2k
the depends on whether
k
1
a
1
is positive or negative.
Thus [[v
(k)
(q
1
)[[ = O([

1
[
k
). If [
1
[ > [
2
[ then [

1
[
k
0 and so
v
(k)
q
1
. This convergence is only linear in [

1
[ i.e. going from k to k+1
improves precision by a factor of [

1
[ and if this fraction is close to 1 i.e. if
[
2
[ is close to [
1
[ the convergence will be very slow.
The trick to overcome this problem is known as inverse iteration. Here
is how it works: consider a number which is not an eigenvalue. Then the
matrix A E
m
is invertible and the eigenvalues of (A E
m
)
1
are the
numbers
1

where runs through the eigenvalues of A. So if we can nd
84
the eigenvalues of (A E
m
)
1
we can certainly nd those of A. The idea
is that if is very close to a given eigenvalue , then [ [ is very small
and hence
1
[ [
is very large and in fact by choosing suciently close
to we can make it as large as we want relative to the other eigenvalues.
Thus the iterative process for (A E
m
)
1
will converge very fast. There
are two caveats: we have to have some idea about in order to choose
close to and the problem nding (AE
m
)
1
may be very ill-conditioned
if AE
m
is very close to being singular, which it will be if is very close
to .
Here is the algorithm:
v
(0)
any unit vector
for k = 1, 2, . . .
Solve (AE
m
)w = v
(k1)
for w
v
(k)
= w/[[w[[

(k)
= r(v
(k)
)
We shall argue that the ill-conditioning is not going to cause trouble.
Let us assume that the absolute value of the last eigenvalue
m
is much
smaller that the absolute values of the other eigenvalues. Then we may as
well assume that is 0 and we are solving the equation Aw = v where v is a
unit vector. Thus the problem is given by A A
1
v. Assume we are using
a backward stable algorithm to solve the equation and that the computed
solution is w. By the backward stability we have

A w = v where

A = A+A
is a perturbation of A with
[[A[[
[[A[[
= O(
machine
) and where w = w + w is
a perturbation of w with
[[w[[
[[w[[
= O((A)
machine
). Now (A) = [
1
/
m
[
which by our assumptions is very large. Thus there is very little chance that
w and w are close but we shall see that the normalized vectors
w
[[w[[
and
w
[[ w[[
are close.
Remark that the eigenvectors for A
1
are the same as the eigenvectors
for A and the eigenvalues are the inverses. Thus the largest eigenvalue
for A
1
is 1/
m
and the eigenvector is q
m
. Our argument above gives
[[
w
[[w[[
(q
m
)[[ = O([

m

m1
[).
Let v = A(
w
[[ w[[
) then we have [[ v[[ [[A[[ = [
1
[.
85
Write v = a
1
q
1
+ a
2
q
2
+ + a
m
q
m
then
w
[[ w[[
=
a
1

1
q
1
+
a
2

2
q
2
+ +
a
m1

m1
q
m1
+
a
m

m
q
m
. Since (
a
1

1
)
2
+ (
a
2

2
)
2
+ + (
a
m1

m1
)
2
+ (
a
m

m
)
2
=

w
[[ w[[

= 1 we have
w
[[ w[[
=
a
1

1
q
1
+
a
2

2
q
2
+ +
a
m1

m1
q
m1
+
a
m

m
q
m
_
(
a
1

1
)
2
+ (
a
2

2
)
2
+ + (
a
m1

m1
)
2
+ (
a
m

m
)
2
=
a
m

m
_
a
1
a
m

1
q
1
+ +
a
m1
a
m

m1
q
m1
+q
m
_
[
a
m

m
[
_
(
a
1
a
m

1
)
2
+ (
a
2
a
m

2
)
2
+ + (
a
m1
a
m

m1
)
2
+ 1
=
_
a
1
a
m

1
q
1
+ +
a
m1
a
m

m1
q
m1
+q
m
_
_
(
a
1
a
m

1
)
2
+ (
a
2
a
m

2
)
2
+ + (
a
m1
a
m

m1
)
2
+ 1
This shows that also

w
[[ w[[
(q
m
)

= O([

m

m1
[) and so

w
[[ w[[

w
[[w[[

=
O([

m

m1
[)
Thus the algorithm will construct a sequence converging to the proper
eigenvector if is chosen appropriately. We shall combine the inverse itera-
tion with the Rayleigh quotient to obtain a very fast converging algorithm.
v
(0)
any unit vector

(0)
=
t
v
(0)
Av
(0)
= corresponding Rayleigh quotient
for k = 1, 2, . . .
Solve (A
(k1)
E
m
)w = v
(k1)
for w
v
(k)
= w/[[w[[

(k)
= r(v
(k)
)
For this algorithm we have the following theorem
Theorem 13.0.13 Rayleigh quotient iteration converges to an eigenvector,
eigenvalue pair for all starting unit vectors except for a small set of unit
86
vectors (measure zero). When the algorithm converges, the convergence is
cubic in the sense that if
j
is an eigenvalue of A and v
(0)
is suciently
close to q
j
then
[[v
(k+1)
(q
j
)[[ = O([[v
(k)
(q
j
)[[
3
)
and
[
(k+1)

j
[ = O([
(k)

j
[
3
)
Thus each iteration triples the number of digits of accuracy. Here is an
example
>> A=[2 1 1;1 3 1;1 1 4];
>> v0=[1;0;0];
>> r0=v0*A*v0
r0 = 2
>> w=(A-r0*eye(3))v0;
>> v1=w/norm(w)
v1 =
-0.7071
0.7071
0
>> r1=v1*A*v1
r1 =
1.5000
>> w=(A-r1*eye(3)) v1;
>> v2=w/norm(w)
v2 =
0.9035 -0.3720 -0.2126
>> r2=v2*A*v2
r2 =
1.3305
>> w=(A-r2*eye(3))v2;
>> v3=w/norm(w)
v3 =
-0.8876 0.4274 0.1719
>> r3=v3*A*v3
r3 =
1.3249
87
After the fourth step all the results from MATLAB become equal. The
eigenvector closest to
_
_
1
0
0
_
_
is q
1
=
_
_
0.8877
0.4271
0.1721
_
_
with eigenvalue
1
= 1.3249
Starting with v
(0)
=
_
_
0
1
0
_
_
we get after four steps q
2
=
_
_
0.2332
0.7392
0.6318
_
_
and

2
= 2.4608
Starting with v
(0)
=
_
_
0
0
1
_
_
we get after ve steps q
3
=
_
_
0.3971
0.5207
0.7558
_
_
and

3
= 5.2143
Using the MATLAB command
[V,D]=eig(A)
we get a matrix V whose columns are the eigenvectors and a diagonal
matrix D where the diagonal entries are the eigenvalues. They agree with
the ones we have found here.
Use MATLAB to try other examples of larger matrices.
Write the following M-le
function A=qralg(B)
[Q,R]=qr(B);
A=R*Q;
Input some square matrix A and run
A=qralg(A)
Keep running this command a few times (use the up-arrow to get back
to the command rather than retyping it)
14 Regression Analysis
Consider random variables y
t
, X
1,t
, X
2,t
, . . . , X
k,t
where t belongs to some
indexing set I (t may denote time in which case I would be some interval
on the real axis or I could be some nite set). We assume that for any two
indices t
1
, t
2
I, X
i,t
1
and X
i,t
2
are identically distributed. We shall use
x
i,t
to indicate a sample of the random variable X
i,t
thus two samples x
i,t
1
and x
i,t
2
can be viewed as samples from a common distribution.
88
A statistical model relating these random variables is an equation
y
t
= f(X
1,t
, X
2,t
, . . . , X
k,t
) +u
t
for each t I where f is a function (which does not depend on t) and
u is some stochastic process indexed by I. In the special case where u is
identically 0 we can express y exactly as a function of the Xs but in general
all we can specify about the random variable u
t
are some distributional
properties. Regression analysis is concerned with estimating the function f
from observed samples y
t
, x
1,t
, x
2,t
, . . . , x
k,t
.
The simplest linear regression model can be expressed by the equation
y
t
=
0
+
1
X
t
+u
t
The purpose of formulating this model is to explain the value of the
dependent variable y as an ane function of the independent variable X
plus an error term u. The coecients of the model
0
and
1
remains to be
determined.
The underlying assumption of the model is that there is an ane rela-
tionship between the dependent and the independent variable and because
of the inherent imprecision in our data samples there will be a random er-
ror. Without making certain assumptions about the error term, the model
clearly has no meaning, in fact we could take any
0
and
1
and dene
u = y
0

1
X.
Assume now that we have observations (y
t
1
, x
t
1
), (y
t
2
, x
t
2
), . . . , (y
tn
, x
tn
).
We want to use these observations to get estimates for the parameters
0
and
1
under certain assumptions about the error term u
The rst condition we want to impose is that the expectation E(u
t
) = 0
for all t I.
Using this we get the equation:
E(y
t
) =
0
+
1
E(X
t
)
Because all the y
t
s and X
t
s are identically distributed we can estimate
the expectation by the sample means

E(y
t
) =
1
n

i
y
t
i
and

E(X
t
) =
1
n

i
x
t
i
and hence we get one equation to estimate the coecients. But there are
two coecients so one equation does not suce.
The second assumption we make is more serious. We assume that the
error term u
t
is independent of the independent variable X
t
for each t I.
If the model is correctly specied and the error term arises from sampling
errors rather than systemic errors this is a reasonable assumption. This
89
Figure 7: Scatter plot of the data set
implies that E(u
t
X
t
) = E(u
t
)E(X
t
) and hence from the rst assumption
(that E(u
t
) = 0), we get that also E(u
t
X
t
) = 0.
Multiplying the equation through by X and taking expectations we get
E(yX) =
0
E(X) +
1
E(X
2
)
We now have two equations and we get
1
=
E(yX) E(y)E(X)
E(X
2
) E(X)
2
=
cov(y, X)
var(X)
and
0
= E(y)
cov(y, X)
var(X)
E(X). We can then use the sample estimates to
get estimates for the coecients
Example 14.1 Consider the data sets
y
t
= 8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68
and
x
t
= 10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0
Using the sample means we get estimates for the coecients:

0
= 3.001
and

1
= 0.5001
90
Figure 8: Scatter plot of the data set with the Regression line
Let y =
_
_
_
_
_
y
t
1
y
t
2
.
.
.
y
tn
_
_
_
_
_
, X =
_
_
_
_
_
1 x
t
1
1 x
t
2
.
.
.
1 x
tn
_
_
_
_
_
and u =
_
_
_
_
_
u
t
1
u
t
2
.
.
.
u
tn
_
_
_
_
_
. Then we have
y = X
_

1
_
+u
Now
t
X u =
_
1 1 . . . 1
x
1
x
2
. . . x
n
_
_
_
_
_
_
u
1
u
2
.
.
.
u
n
_
_
_
_
_
=
_
_

t
u
t

t
x
t
u
t
_
_
=
_
n

E(u)
n

E(Xu)
_
=
_
0
0
_
and hence
t
Xy =
t
X X
_

1
_
This shows that
_

1
_
= (
t
X X)
1

t
Xy
91
and hence the estimates for the coecients are precisely the least squares
solution to the equation y = X
_

1
_
.
We can of course always nd a least squares solution as we saw in section
9 but it is important to realize that just tting an ordinary least squares
(OLS) line to the data does not in anyway mean that the data are samples
from a linear model. It is a common error to indiscriminately t OLS lines
to data sets. This can lead to serious errors.
Example 14.2 Consider the following data sets:
1.
y
t
= 9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74
x
t
= 10.0, 8.0, 13, 0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0
2.
y
t
= 7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73
x
t
= 10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0
3.
y
t
= 6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.5, 5.56, 7.91, 6.89
x
t
= 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 19.0, 8.0, 8.0, 8.0, 8.0
For data set 1 we get the OLS solution
_

1
_
=
_
3.009
.5000
_
. For data set 2
we get
_

1
_
=
_
3.0025
.4995
_
and for data set 3,
_

1
_
=
_
3.0017
.4999
_
. Thus
we might be tempted to conclude that these three data sets are all samples
from the same model. However if we graph these data sets we get completely
dierent pictures.
If we graph the residuals i.e. the error terms u
t
against the independent
variable in the case of data set 1 we see that the assumption of independence
is clearly invalid and so the linear model is incorrect
These examples show the importance of graphing the data before starting
any numerical analysis. A graphical presentation may help in formulating
92
Figure 9: Data set 1
Figure 10: Data set 2
93
Figure 11: Data set 3
Figure 12: Residuals for data set 1
94
reasonable models, in the case of data set 1 a quadratic model seems much
more appropriate. Indeed if we try to t a model of the form
y =
0
+
1
X+
2
X
2
we nd the least square solutions
0
= 5.9957,
1
= 2.7808,
2
= 0.1267
and the tted curve is shown in Figure 13.
Recall that if X
1
, X
2
are random variables, the joint distribution is the
function F
X
1
,X
2
(x
1
, x
2
) = Prob(X
1
< x
1
, X
2
< x
2
) and the joint density
function is f
X
1
,X
2
(t
1
, t
2
) =

2
F
x
1
x
2
(t
1
, t
2
). Thus
F
X
1
,X
2
(x
1
, x
2
) =
_
x
1

_
x
2

f
X
1
,X
2
(t
1
, t
2
)dt
1
dt
2
. .
Denition 14.0.13 The conditional density is dened by f
X
1
|X
2
(t
1
[t
2
) =
f
X
1
,X
2
(t
1
, t
2
)
f
X
2
(t
2
)
.
95
Denition 14.0.14 Two random variables X
1
and X
2
are independent if
f
X
1
,X
2
= f
X
1
f
X
2
or equivalently f
X
1
|X
2
= f
X
1
Recall the the expectation of the random variable X is
E(X) =
_

Xd =
_

tf
X
(t)dt
Denition 14.0.15 Recall (Introduction to Stochastic Processes=ISP 2.4)
that the conditional expectation E(X
1
[X
2
) is the random variable dened
by the condition
_
A
E(X
1
[X
2
)d =
_
A
X
1
d
for every set A in the -algebra generated by X
2
Lemma 14.0.2
E(X
1
[X
2
)() =
_

x
1
f
X
1
|X
2
(x
1
, X
2
())dx
1
Proof: By Example 2.8 in ISP we have for any function h that E(h(X
1
)[X
2
)() =
g(X
2
()) where g is the function dened by
g(x
2
) =
_

h(x
1
)f
X
1
|X
2
(x
1
, x
2
)dx
1
Applying this with h(x
1
) = x
1
proves the result
Lemma 14.0.3 (Law of iterated expectations) E(E(X
1
[X
2
)) = E(X
1
)
Proof: Let T
0
denote the trivial -algebra. Then E(X[T
0
) = E(X) for
any random variable. Now we have T
0
(X
2
) T and so we have by
transitivity of conditional expectations (ISP Theorem 2.4.3)
E(E(X
1
[X
2
)) = E(E(X
1
[X
2
)[T
0
) = E(E(X
1
[(X
2
))[T
0
) = E(X
1
[T
0
) = E(X
1
)
Lemma 14.0.4 If X
1
and X
2
are independent E(X
1
[X
2
) = E(X
1
) (in
particular constant)
Proof: If X
1
and X
2
are independent, f
X
1
,X
2
(x
1
, x
2
) = f
X
1
(x
2
)f
X
2
(x
2
)
(Proposition 2.3.1 in the Probability Theory Notes). Hence f
X
1
|X
2
= f
X
1
and so the function g in Lemma 14.0.2 is
g(x
2
) =
_

x
1
f
X
1
(x
1
)dx
1
= E(X
1
)
96
Proposition 14.0.4 Assume E(X[Y) = 0 then E(XY) = 0
Proof: We have E(XY) = E(E(XY[Y)) by the Law of iterated expec-
tations. Since Y is certainly measurable with respect to (Y) we have by
ISP 2.4.2, E(XY[Y) = YE(X[Y) = 0
This Proposition shows that to use OLS to estimate the coecients in
the regression model it suces to assume E(u
t
[X
t
) = 0 rather than the
stronger assumption that they are independent.
We shall now make some further assumptions on the error terms. We
assume that the error terms are independent, identically distributed with
mean 0 and variance
2
i.e. for any two indices t
1
,= t
2
the random variables
u
t
1
, u
t
2
are independent and have the same distribution. We write this
condition u
t
IID(0,
2
). Sometime we will make even more restrictive
assumptions such as specifying the actual distribution e.g. the u
t
s are all
normally distrubuted in which case we write u
t
NID(0,
2
).
Once a model has been suggested and tted it is a good idea to plot the
residuals to see if they are in fact IID.
Example 14.3 Using the data set Data1, which can be downloaded from
the course page we plot the data
The plot seems to indicate a linear model y
t
=
0
+
1
X
t
+u
t
. Estimating
the coecients by OLS we get
_

1
_
= 10
3
_
6.1497
.0007
_
and the plot of the
tted line seems to t the data extremely well.
But now we plot the residuals against the independent variable (Figure
16)
The plot shows that the residuals are not random and the linear model
is incorrect. The plot suggests a quadratic model so we try to t a model of
the form
y
t
=
0
+
1
X
t
+
2
X
2
t
The OLS estimates for the coecients are
_
_
_

2
_
_
_ = 10
3
_
_
.6736
.0007
.0000
_
_
(the
machine only displays 4 decimal points but internally stores 16 decimal
points).
Now the plot of the residuals shows randomness (Figure 17)
The following Q-Q plot and histogram shows that the residuals for the
quadratic model are approximately normally distributed (Figure 18 and 19)
97
Figure 14: Scatter plot of Data1
Figure 15: OLS line
98
Figure 16: Residuals for the Linear model
Figure 17: Residuals for the Quadratic model
99
Figure 18: Q-Q plot of the residuals for the Quadratic model
Figure 19: Histogram of the residuals for the Quadratic model
100
The estimated variance of the residuals are
2
= 3.9939 10
8
and so the
model
y
t
= 10
3
(.6736 +.0007X
t
+ (.0000)X
2
t
) +u
t
with u
t
NID(0, 3.9939 10
8
) is reasonable.
This example again underscores the benet of doing lots of graphical
analysis. Just plotting the data seems to indicate a linear model but the
graph of the residuals shows the problems with this model
Using OLS we can estimate coecients of models with several inde-
models). Thus can look at models of the form
y
t
=
0
+
1
X
1,t
+
2
X
2,t
+ +
k
X
k,t
+u
t
The least squares estimates of the coecients are obtained by forming the
n (k + 1) matrix
X =
_
_
_
_
_
1 x
1,t
1
x
2,t
1
. . . x
k,t
1
1 x
1,t
2
x
2,t
2
. . . x
k,t
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1 x
1,tn
x
2,tn
. . . x
k,tn
_
_
_
_
_
and the estimators are given by
_
_
_
_
_
_
_
_

2
.
.
.

k
_
_
_
_
_
_
_
_
= (
t
X X)
1

t
X
_
_
_
_
_
y
t
1
y
t
2
.
.
.
y
tn
_
_
_
_
_
14.1 Statistical Properties of the OLS Estimators
Denition 14.1.1 Let

be an estimator for some parameter whose true
value is
true
. The bias of the estimator is dened to be E(

)
true
and the
estimator is said to be unbiased if the bias is 0.
Assume our data come from the model
y
t
= X
t
+u
t
101
where X
t
= (1, X
1,t
, X
2,t
, . . . , X
k,t
) and =
_
_
_
_
_
_
_

2
.
.
.

k
_
_
_
_
_
_
_
. Our data y
t
and x
t
=
(1, x
1,t
, x
2,t
, . . . , x
k,t
), t = t
1
, t
2
, . . . , t
n
are samples of the random vector. We
assume the residuals u
t
IID(0,
2
). The OLS estimator is as before given
by

= (
t
XX)
1

t
X
_
_
_
_
_
y
t
1
y
t
2
.
.
.
y
tn
_
_
_
_
_
. Thus the bias of the OLS estimator is E(

).
Now y
t
= X
t
+u
t
and so

= (
t
XX)
1

t
X(X+u) = +(
t
XX)
1
Xu,
where u =
_
_
_
_
_
u
t
1
u
t
2
.
.
.
u
tn
_
_
_
_
_
. Thus the bias of OLS estimator is E((
t
X X)
1

t
X u).
This may not be 0 even under the condition E(u
t
[X
t
) = 0, the reason
being that (
t
X X)
1

t
X u involves products of the form X
i,t
j
X
r,ts
u
tv
and
so we would need each u
t
i
to have E(u
t
i
[X
j,tr
) = 0 for any combination of
indices or in shorthand E(u
t
[X
s
) = 0 for any t, s I. If this condition holds
the independent variables are said to be exogeneous and in this case we have
E((
t
X X)
1

t
X u[X) = (
t
X X)
1

t
X E(u[X) = 0. By the transitivity
of conditional expectation this implies that E((
t
X X)
1

t
X u) = 0 so the
OLS estimator is unbiased.
Example 14.4 Consider the model of the form
y
t
= 1.0 + 0.8y
t1
+u
t
(an auto-regressive model) and assume u
t
NID(0, 1) and y
0
= 0. We
can simulate this model by generating vectors of N(0,1) distributed random
numbers.
The following M-le named autoreg.m will generate 25 samples from
the model and compute the OLS estimates of the coecients
102
u=randn(25,1);
y=zeros(25,1);
for i=1:24
y(i+1)=1+.8*y(i)+u(i+1);
end
X=[ones(24,1) y(1:24)];
beta=(X*X)
^
(-1)*X*y;
If we run it a couple of times we get the following estimates

=
_
1.0786
0.8481
_
,
_
1.8455
0.6490
_
,
_
1.2757
0.7330
_
The following MATLAB code will run this 10,000 times and compute the
sample mean of the estimates:
b=zeros(2,10000);
for j=1:10000
autoreg
b(:,j)=beta
end
betahat=sum(b)/10000
This gives us the estimate

E(

) =
_
1.3328
0.7157
_
which indicates that the
OLS estimator has non-zero bias.
If we instead of generating 25 samples, generate 10000 samples and
average over 25 runs so we generate the same amount of data, we get

E(

) =
_
0.9912
0.8015
_
which is a much better estimate of the actual values.
This example shows that the OLS estimator is biased but the bias becomes
smaller as we increase the sample size.
This phenomenon also occurs in estimating sample variance: consider
independent samples x
i
, i = 1, 2, . . . , n of a random variable X with mean
and variance
2
. Let =
1
n

i
x
i
be the sample mean . The usual estimator
for the sample variance s
2
=
1
n

i
(x
i
)
2
is biased.
To see this, remark that the sample mean is an unbiased estimator of the
mean because E( ) =
1
n

i
E(X
i
) =
1
n

i
= .
103
We have s
2
=
1
n

i
(x
i
)
2
=
1
n

i
((x
i
) ( ))
2
=
1
n
_

i
(x
i
)
2
+n( )
2
2( )

i
(x
i
)
_
.
Now

i
(x
i
) = n( ) so we get s
2
=
1
n
_
(

i
(x
i
)
2
n( )
2
)
_
.
Taking expectations we get E(s
2
) =
1
n
_

i
E((x
i
)
2
) nE(( )
2
_
=

2
E(( )
2
) =
2
V ar( ).
We can compute V ar( ) as follows: V ar( ) = E(((
1
n

i
x
i
) )
2
) =
E((
1
n

i
x
i
)
2
) =
1
n
2
E(

i,j
(x
i
)(x
j
)) =
1
n
2

i,j
E((x
i
)(x
j
))
Since X
i
and X
j
are independent for i ,= j we get E((X
i
)(X
j

)) = E(X
i
)E(X
j
) = 0 for i ,= j and E((X
i
)
2
) =
2
. Hence
V ar( ) =
1
n
2

2
=

2
n
. Thus E(s
2
) =
n 1
n

2
. Again we see that as the
sample size gets larger (i.e. n ) the bias tends to 0. We also see that
n
n 1
s
2
=
1
n 1

i
(X
i
)
2
is an unbiased estimator.
Denition 14.1.2 An estimator

is said to be consistent if

tends to the
true value as the sample size tends to
The OLS estimator in the previous example and the s
2
estimator are
examples of consistent estimators.
To show that the OLS estimator is in fact consistent recall that we have

= +(
t
X X)
1

t
X u. Thus we need the last term to approach 0 when
the sample size increases. Remark that the matrix (
t
X X) is always a k k
matrix regardless of the sample size. In fact the i, jth entry in this matrix
is

r
X
i,tr
X
j,tr
, the 1, 1 entry is n. We divide the entries by n and notice
that (
t
X X)
1
)
t
Xu = (
1
n
t
X X)
1

1
n
t
X u.
In many cases the matrices
1
n
t
X X will converge to a xed matrix S as
the sample size increases (i.e. n ). In our example above we get
We make the (weak) assumption that E(u
t
[X
t
) = 0). This is a much
weaker assumption than assuming that the independent variables are ex-
ogenous. Then we also have E(X
i,t
j
u
t
j
[X
i,t
j
) = 0 and hence by transitivity
E(X
i,t
j
u
t
j
) = 0. This shows that E(
t
X u) = 0. Thus
1
n
t
X u 0 and
hence lim
n
(
1
n
(
t
X X)
1

1
n
t
Xu = S
1
0 = 0 and so

i.e. under these
assumptions the estimator is consistent.
104
Table 1:
sample size S
100
_
1 5.5687
5.5687 37.0914
_
1000
_
1 5.2986
5.2986 30.8182
_
10000
_
1 5.0040
5.0040 27.9696
_
100000
_
1 5.0161
5.0161 27.9281
_
1000000
_
1 5.0048
5.0048 27.8345
_
10000000
_
1 4.9991
4.9991 27.7744
_
Bias and consistency are completely dierent properties, as we have seen
an estimator can be biased and still be consistent.
Example 14.5 Consider the following three estimators for the expectation
1.
1
=
1
n+1

i=1,2,...,n
y
i
2.
2
=
1.01
n

i=1,2,...,n
y
i
3.
3
= 0.01y
1
+
0.99
n 1

i=1,2,...,n
y
i
Since
1
=
n
n + 1
, we have E(
1
) =
n
n + 1
E( ) =
n
n + 1
so this
estimator is biased, but since
n
n + 1
1 it is consistent.
In the second case we have
2
= 1.01 so E(
2
) = 1.01. This estimator
is both biased and inconsistent.
In the third case we have E(
3
) = .01E(y
1
) +.99 = so the estimator
is unbiased. But as n the limit is 0.01y
1
+ 0.99 which is stochastic
and so the limit is not the non-stochastic quantity i.e. the estimator is
inconsistent.
105
Under the assumption that the independent variables are exogenous and
the error terms are IID(0,
2
) we can compute the covariance matrix of the
OLS estimators. The covariance matrix is Cov(

)) = E((

)
t
(

)).
Now

= (
t
X X)
1

t
X u and so Cov(

) = E((
t
X X)
1

t
X u
t
u X (
t
X X)
1
) = (
t
X X)
1

t
X E(u
t
u) X (
t
X X)
1
). Under
the assumption of the error terms being IID(0,
2
) we have E(u
t
u) =
2
I
where I is the identity matrix. Hence we get Cov(

) =
2
(
t
X X)
1
.
Assume now that we want to use our model to make a forecast, thus we
ask for the value of the dependent variable y given a vector of independent
variables x = (1, x
1
, x
2
, . . . , x
k
). Using our estimators the forecast is y =

0
+

1
x
1
+

2
x
2
+ +

k
x
k
. It is of course very important to know the
variance of the forecast and we have V ar( y) = V ar(

0
+

1
x
1
+

2
x
2
+ +

k
x
k
) = x Cov(

)
t
x =
2
x (
t
X X)
1

t
x
Denition 14.1.3 An estimator

for the coecients in a linear regression
model y =
0
+
1
X
1
+
2
X
2
+ +
k
X
k
+u is said to be a linear estimator
if it is of the form A y for some matrix A. The OLS

estimator is linear
because it is (
t
X X)
1

t
X y
Let

be an estimator and consider its covariance matrix Cov(

). The
variance of any linear combination of the estimators w

, where w is a
row vector, is given by w Cov(

)
t
w. We obviously would like to have an
estimator where this variance is as small as possible. Thus if

is another
estimator we want w Cov(

)
t
w w Cov(

)
t
w for all w. Equivalently
the matrix Cov(

) Cov(

## ) is a positive, semi-denite matrix. If this is

the case we will say that

is a more ecient or a better estimator than

## Under certain assumptions, the OLS estimator is the best

Theorem 14.1.1 (The Gauss-Markov Theorem) Assume that E(u[X) = 0
and that E(u
t
u) =
2
I, in particular that the OLS estimator is unbiased.
Then the OLS estimator

is the best linear unbiased estimator
(BLUE for short).
Proof: In order to prove this theorem we need to understand what it
means: we have a set of n observations of independent variables which we
write as the nk +1 matrix X as before. A linear estimator is just a linear
map A : R
n
R
k+1
i.e. given by a k +1 n matrix. If we apply A to both
sides of the equation y = X+u we get Ay = AX+Au. Since E(u[X) = 0
we get E(A u[X) = 0 and so E(A y[X) = E(A X[X). The estimator is
unbiased if E(A y[X) = for all . Now E(A X[X) = A X because
106
A X is non-stochastic and so the estimator is unbiased if A X = for all
R
k+1
. Thus the linear estimator A is unbiased if and only if A X = I.
The OLS estimator is given by the k +1 n matrix (
t
X X)
1

t
X. Let
C = A(
t
X X)
1

t
X so C X = A X(
t
X X)
1

t
X X = I I = 0 and

## = C y = C u. Now consider the covariance Cov((

),

) = E((

)
t
(

)) = E(Cu
t
((
t
X X)
1

t
X u)) = E(Cu
t
uX(
t
XX)
1
). By
our assumptions E(u
t
u) =
2
I so Cov((

),

) =
2
CX(
t
XX)
1
= 0.
Now Cov(

) = Cov(

+(

)) = Cov(

## ) +Cov(C u), where the last

equality holds because C u = (

) and

are uncorrelated. This shows
that Cov(

) Cov(

) = Cov(C u) = E(C u
t
u
t
C) =
2
(C
t
C) and as
is true for any matrix, C
t
C is positive semi-denite.
Next we shall estimate the variance of the error terms i.e. estimate
2
,
in the model y = X +u
The vector of OLS residuals u = y X

## is orthogonal to the subspace

spanned by the columns of X, so if we let P
X
denote the projection onto this
subspace and P

X
= I P
X
the projection onto its orthogonal complement so
P

X
is 0 on the column space and the identity on the orthogonal complement.
We shall assume that E(u[X) = 0 so

## is unbiased and that u IID(0,

2
)
Applying P

X
to y = X + u gives P

X
y = P

X
u. Thus u = P

X
u =
P

X
(y X

) = P

X
y = P

X
u
The matrix of P
X
is X (
t
X X)
1

t
X so the matrix of P

X
is
I X (
t
X X)
1

t
X and u = u X (
t
X X)
1

t
Xu. This shows that
the tth coordinate u
t
in general depends on the u
s
for all s = 1, 2, . . . , n
and not just u
t
. So even if the error terms u
t
are independent this will not
necessarily be true for the estimators u
t
.
To compute the covariance matrix of u we note that
E( u) = E(u) X (
t
X X)
1
XE(u) = 0 (this uses E(u[X) = 0) so
the covariance matrix is E( u
t
u) = E(P

X
u
t
(P

X
u)) = E(P

X
u
t
u
t
P

X
) =

2
P

X
2
=
2
P

X
. If h
t
denotes the tth diagonal term in P
X
so the tth
diagonal term in P

X
is 1 h
t
we get V ar( u
t
) = (1 h
t
)
2
.
If we use the estimator
2
=
1
n

t
u
t
2
we get E(
2
) =
1
n

t
E( u
2
) =
1
n

t
V ar( u
t
) =
1
n

t
(1 h
t
)
2
=

2
n
Tr(P

X
). Since P

X
is the projection
onto an (n k)-dimensional subspace Tr(P

X
) = n k so E(
2
) =
n k
n

2
which shows that
2
is biased but if we instead use s
2
=
1
n k

t
u
t
2
we
107
get an unbiased estimator. Using this estimator in our formula Cov(

) =

2
(
t
X X)
1
we get an unbiased estimator for the covariance matrix of

,

Cov(

) = s
2
(
t
X X)
1
.
Up till now we have assumed that our model is correctly specied, i.e.
the data are generated by
y = X +u
where X is the set of independent variables we incorporate in our model.
But what happens if there in reality are more independent variables so the
data really are generated by a model
y = X +Z +u
but we are generating our estimators from the under specied model. This
is of course a quite common phenomenon: often the dependent variable will
depend on so many (known and unknown) factors that we cant possible
incorporate all of them.
Our OLS estimator for the under specied model is

= (
t
X X)
1

t
Xy = (
t
X X)
1

t
X(X +Z +u)
= + (
t
X X)
1

t
X Z + (
t
X X)
1

t
Xu
and thus even if we assume that E(u[X) = 0 so E((
t
X X)
1

t
Xu) = 0 the
OLS estimator

has bias E(

) = E((
t
X X)
1

t
X Z)
Classically the number used to measure how well the estimated model
ts the data is R
2
=
[[P
X
y[[
2
[[y[[
2
(also known as
ESS
TSS
, the explained sum of
squares divided by the total sum of squares). Since

= (
t
X X)
1

t
X and
P
X
= X(
t
XX)
1

t
X we have P
X
y = X

= y u because y = X

+ u. Now
u is orthogonal to the column space of X so X

## and u are orthogonal and

consequently [[y[[
2
= [[X

[[
2
+[[ u[[
2
= [[P
X
y[[
2
+[[ u[[
2
. Thus we can write
R
2
=
[[y[[
2
[[ u[[
2
[[y[[
2
= 1
[[ u[[
2
[[y[[
2
= 1
SSR
TSS
(SSR =sum of squared residuals).
We have (P
X
y, y) = [[P
X
y[[[[y[[ cos where is the angle between the two
vectors P
X
y = X

## and y. Thus cos

2
=
(P
X
y, y)
2
[[P
X
y[[
2
[[y[[
2
. But (P
X
y, y) =
(P
X
y, P
X
y + u) = (P
X
y, P
X
y) because P
X
y and u are orthogonal. It follows
that (P
X
y, y) = [[P
X
y[[
2
hence we get cos
2
=
[[P
X
y[[
4
[[P
X
y[[
2
[[y[[
2
=
[[P
X
y[[
2
[[y[[
2
=
R
2
.
108
If R
2
= 1, = 0 and so P
X
y = X

## = y i.e. u = 0 and we have a perfect

t. The closer R
2
gets to 0 the worse the t. When R
2
= 0, =

2
and so
X

## = 0 and X has no explanatory power vis a vis y.

14.2 Hypothesis Testing in Regression Models
We might have some underlying theoretical model that species the values
of some or all of the coecients in a linear regression model or maybe that
two or more coecients are equal or are related in some way. How does one
determine from the data whether such a hypothesis is reasonable or more
precisely: when do the data not contradict the hypothesis?
Consider the simplest possible model
y = +u
where u IID(0,
2
). is the expectation, E(y). Suppose we have a
vector of samples y and let

=
1
n

t
y
t
be the sample mean. The y
t
s are
independent and have V ar(y
t
) =
2
so V ar(

) =

2
n
Suppose we hypothesize that the true value for is some number

.
How would we decide how likely this assumption is from our data? With-
out further information there is very little we can do, so lets make the
(unrealistic) assumption that we know the distribution of u, and assume
that u NID(0,
2
) i.e. that we know the true value of the variance (this
of course will virtually never be the case). Under our hypothesis, the test
statistic z =

n
N(0, 1). For a random quantity x N(0, 1), the
probability that a x b is equal to
_
b
a
(t)dt where is the Prob-
ability Density Function (PDF) of the standard normal distribution i.e.
(t) =
1

2
exp(
t
2
2
) Looking at Figure 20. we see that there is a 95%
probability that 1.96 x 1.96 and hence a 5% probability that x falls
outside this interval. This is called the rejection level of the test and we can
call the 95% the non-rejection level. If we set our level to 5% we would then
only reject our hypothesis if the test statistic z falls outside the condence
interval [1.96, 1.96]. Thus the test is not so much a test to conrm the
hypothesis but rather whether the data contradicts the hypothesis.
109
Figure 20: The standard normal PDF
Consider the following MATLAB code
u=randn(1,1000);
y=4+.5*u;
betastar=2;
betahat=sum(y)/1000;
z=(betahat-betastar)/(.5/sqrt(1000))
z=-125.1287
betahat=3;
z=(betahat-betastar)/(.5/sqrt(1000))
z=-61.8832
betahat=4.1;
z=(betahat-betastar)/(.5/sqrt(1000))
z=7.6869
betahat=4.02;
z=(betahat-betastar)/(.5/sqrt(1000))
z=2.6273
betahat=4;
z=(betahat-betastar)/(.5/sqrt(1000))
z=1.3624
110
This shows that the hypotheses

## = 2, 3, 4.1, 4.02 are all rejected at the

5% level, while we cannot reject the hypothesis

= 4.
The hypothesis we are testing i.e. u
t
N(0,
2
) and =

known
as the null-hypothesis H
0
. Under the null-hypothesis z N(0, 1). If we
reject the null-hypothesis it may be either because the hypothesis about the
distribution or the hypothesis about the actual value is false.
The error of rejecting a null-hypothesis that is actually true is called a
type I error. Thus the probability of rejecting a true null-hypothesis is the
level i.e. 5% in this case. If we make the non-rejection level larger, say
taking a 99% non-rejection level i.e. a 1% rejection level, then we lower the
probability of making a type I error, on the other hand the test will also fail
to reject more wrong values of

## . For instance the 99% condence interval

is [2.5758, 2.5758]. If we compute the test statistic for

= 4.06 we get
z = 2.3546, thus this is not rejected at the 1% level but is rejected at the
5% level.
The P-value of the test statistic is the rejection level at which we would
reject the null-hypothesis. For instance in our example before the P-value of
the test statistic z = 1.3624 is 1the area under the PDF from [1.3624, 1.3624].
Thus the P-value in this example equals (Prob(z < 1.3624) + Prob(z >
1.3264))2(1.3264) = 0.1853 and thus we would reject the null-hypothesis
at a condence level of 18.53% and higher. In general if the level of the test
is we reject at level if the P-value of the test statistic is less than . For
instance the P-value for the test statistic z with

## is 0.0461 i.e. 4.61% and

hence the null-hypothesis

## = 4.01 is rejected at the 5% level but not at

the 1% level.
Most test statistics encountered in regression analysis follows one of four
distributions: the standard normal distribution, the
2
-distribution, the
Student t-distribution or the F-distribution.
Let (t) =
1

2
exp(
t
2
2
) then the PDF of the normal distribution with
mean and variance
2
is
1

(
t

). Thus if Z N(,
2
) the cumulative
distribution function
(x) = Prob(Z < x) =
_
x

(
t

)dt
This function does not have a simple closed form expression.
111
Figure 21: P-Value
Figure 22: The Cumulative Distribution Function (CDF) of the standard
normal
112
If Z N(,
2
) then
Z

## N(0, 1). An important property of the

normal distribution is that any linear combination of independent normal
random variables is again normal.
Theorem 14.2.1 Let Z
i
N(
i
,
2
i
), i = 1, 2, . . . , n be independent random
variables. Then the random variable
W = a
1
Z
1
+a
2
Z
2
+ +a
n
Z
n
N(a
1

1
+a
2

2
+ +a
n

n
, a
2
1

2
1
+a
2
2

2
2
+ +a
2
n

2
n
)
Proof: We know X
i
=
Z
i

i
N(0, 1) and
W = a
1
(
1
X
1
+
1
) +a
2
(
2
X
2
+
2
) +

+a
n
(
n
X
n
+
n
)
= (a
1

1
X
1
+ +a
n

n
X
n
) + (a
1

1
+a
2

2
+ +a
n

n
)
Hence it suces to show that
a
1

1
X
1
+ +a
n

n
X
n
N(0, a
2
1

2
1
+a
2
2

2
2
+ +a
2
n

2
n
)
and so we may assume that Z
i
N(0, 1) for all i.
We rst assume that n = 2 and a
2
1
+a
2
2
= 1. We shall compute the joint
density f
W,Z
1
= f
W|Z
1
f
Z
1
. Now the conditional expectation is E(W[Z
1
) =
E(a
1
Z
1
+ a
2
Z
2
[Z
1
) = a
1
Z
1
+ a
2
E(Z
2
[Z
1
) = a
1
Z
1
+ a
2
E(Z
2
) = a
1
Z
1
and
the conditional variance E((W E(W[Z
1
))
2
[Z
1
) = E((W a
1
Z
1
)
2
[Z
1
) =
E(a
2
2
Z
2
2
[Z
1
) = E(a
2
Z
2
2
) = a
2
2
. Conditionally on Z
1
, W is the sum of the
N(0, a
2
2
) random variable a
2
Z
2
and a constant a
1
z
1
and so is N(a
1
z
1
, a
2
2
)
thus the conditional density f
W|Z
1
(w[z
1
) =
1
a
2
(
w a
1
z
1
a
2
). This shows
113
that the joint density
f
W,Z
1
(w, z
1
) =
1
a
2
(
w a
1
z
1
a
2
)(z
1
)
=
1
a
2

2
exp(
(w a
1
z
1
)
2
2a
2
2
)
1

2
exp(
z
2
1
2
)
=
1
a
2

2
1

2
exp(
w
2
+a
2
1
z
2
1
2a
1
z
1
w +a
2
2
z
2
1
2a
2
2
)
=
1
a
2
2
exp(
w
2
2a
1
z
1
w +z
2
1
2a
2
2
)
=
1
a
2
2
exp(
w
2
2a
1
z
1
w
1
+z
2
1
a
2
2
w
2
+a
2
2
w
2
2a
2
2
)
=
1
a
2
2
exp(
w
2
2
) exp(
(1 a
2
2
)w
2
+z
2
1
2a
1
z
1
w
2a
2
2
)
=
1

2
exp(
w
2
2
)
1
a
2

2
exp(
(a
1
w z
1
)
2
2a
2
2
)
The last factor as a function of z
1
is the PDF of a N(a
1
w, a
2
2
) distribution.
To nd the density function f
W
we compute the marginal density of the
joint density function by integrating out z
1
f
W
(w) =
_

2
exp(
w
2
2
)
1
a
2

2
exp(
(z
1
a
1
w)
2
2a
2
2
)dz
1
=
1

2
exp(
w
2
2
)
_

1
a
2

2
exp(
(z
1
a
1
w)
2
2a
2
2
)dz
1
The integral is 1 because we are integrating a PDF from to . This
shows that W N(0, 1)
In the general case where a
2
1
+ a
2
2
= r
2
,= 1 we consider
1
r
W. By the
previous argument
1
r
W N(0, 1) and so W N(0, r
2
). This proves the
result for a linear combination of two independent random variables. The
general result follows by induction.
Consider a vector Z =
_
_
_
_
_
Z
1
Z
2
.
.
.
Z
n
_
_
_
_
_
of independent standard normal variables.
114
Let A be a non-singular matrix and consider the vector X =
_
_
_
_
_
X
1
X
2
.
.
.
X
n
_
_
_
_
_
=
A
_
_
_
_
_
Z
1
Z
2
.
.
.
Z
n
_
_
_
_
_
. By the previous Theorem each of the components X
i
N(0, a
2
i1
+
a
2
i2
+ + a
2
in
). The covariance matrix = Cov(X) = E(X
t
X) =
E(AZ
t
Z
t
A) = A E(Z
t
Z)
t
A = A
t
A. Then the joint density func-
tion is
X
(x) =
1
(

2)
n

det
exp(
1
2
x
1t
x). If = (
1
,
2
, . . . ,
n
)
is a vector of constants then X +
t
has joint density function
X
(x) =
1
(

2)
n

det
exp(
1
2
(x )
1t
(x )). In particular if is a diagonal
matrix the joint density function

X
(x)
=
1
(

2)
n

det
exp(
1
2
(x )
1t
(x ))
=
1

2
1

2
exp(
1
2
(x
1

1
)
1
1
(x
1

1
)) . . .
1

2
n

2
exp(
1
2
(x
n

n
)
1
n
(x
n

n
))
becomes just the product of the densities of the components of the vector of
random variables and hence the components are independent. This is one of
the very special properties of normally distributed random variables: they
are independent if and only if their covariances are 0. This is not true for
random variables which are not normally distributed.
This discussion suggests how we would construct a vector of normally
distributed random variables with vector of means = and covariance ma-
trix : use the Cholesky decomposition to write =
t
R R where R is an
upper-triangular matrix. Then X =
t
R Z +
t
will be such a vector.
The random variable [[Z[[
2
= Z
2
1
+ Z
2
2
+ + Z
2
n
2
distribution with n degrees of freedom. We write [[Z[[
2

2
(n). The mean
and variance of the
2
(n) distribution can be computed from the denition:
mean= E(Z
2
1
)+E(Z
2
2
)+ +E(Z
2
n
) = V ar(Z
1
)+V ar(Z
2
)+ +V ar(Z
n
) =
n.
var= E(([[Z[[
2
n)
2
) = E(((Z
2
1
1) +(Z
2
2
1) + +(Z
2
n
1))
2
). Since
the random variables (Z
2
i
1) and (Z
2
j
1) are independent for i ,= j and so
115
have covariance 0, we get the variance = nE((Z
2
1)
2
) = E(Z
4
+12Z
2
) =
E(Z
4
) + 1 2. Thus we need to compute E(Z
4
). By denition
E(Z
4
) =
1

2
_

t
4
exp(
t
2
2
)dt
Remark that

(t) = t(t)

(t) = (t) +t
2
(t)

## (t) = t(t) + 2t(t) t

3
(t) = (3t t
3
)(t)

(4)
(t) = (3 3t
2
)(t) (3t
2
t
4
)(t) = (3 6t
2
+t
4
)(t)
Thus
t
4
(t) =
(4)
(t) + 6t
2
(t) 3(t)
and we get
_

t
4
(t)dt =

(t)[

+6
_

t
2
(t)dt3
_

(t)dt = 6V ar(Z)3 = 3
Hence E(Z
4
) = 3 and E((Z
2
1)
2
) = 3 + 1 2 = 2.
This shows that V ar([[Z[[
2
) = E((Z
2
1
1)
2
)+E((Z
2
2
1)
2
)+ +E((Z
2
n

1)
2
) = 2 +2 + +2 = 2n. Thus the
2
(n) distribution has mean = n and
variance = 2n.
It is clear that if Y
1

2
(n
1
) and Y
2

2
(n
2
) then Y
1
+Y
2

2
(n
1
+n
2
)
Theorem 14.2.2 Assume the vector X =
_
_
_
_
_
X
1
X
2
.
.
.
X
n
_
_
_
_
_
has the multi-variate
normal distribution N(0, ), so is the covariance matrix of X i.e. =
E(
t
X X).
Then the random variable
t
X
1
X
2
(n)
Let P be an orthogonal projection matrix of rank r and let
_
_
_
_
_
Z
1
Z
2
.
.
.
Z
n
_
_
_
_
_

N(0, I) then
t
Z P Z
2
(r)
116
Figure 23:
2
Density functions
Proof: Let = R
t
R be the Cholesky decomposition. Then R
1
X is
normal with mean 0 and covariance matrix E(R
1
X
t
X
t
R
1
) = R
1
E(X
t
X)
t
R
1
= R
1

t
R
1
= R
1
R
t
R
t
R
1
= I. Thus R
1
X N(0, I).
It then follows that [[R
1
X[[
2

2
(n), but [[R
1
X[[
2
=
t
X
t
R
1
R
1
X =
t
X
1
X. This proves the rst part of the theorem.
To prove the second part consider the subspace V = v[Pv = v. Then
V = ImP. Indeed if v V , Pv = v so v ImP, conversely if w ImP,
w = Pu for some u R
n
. Since P is a projection P
2
= P so Pw = P(Pu) =
P
2
u = Pu = w. Thus w V . This shows in particular that dimV = r.
Let W = V

and let q
1
, q
2
, . . . , q
r
be an orthonormal basis of V and
q
r+1
, . . . , q
n
an orthonormal basis of W, then the combined set q
1
, q
r
, q
r+1
, . . . , q
n
is an orthonormal basis of R
n
. The matrix of P with respect to this basis
is M =
_
_
_
_
_
_
_
_
_
_
_
_
1 0 0 . . . 0 0
0 1 0 . . . 0 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 . . . 0 1 0 . . .
0 . . . 0 0 0 . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 . . . . . . 0 0
_
_
_
_
_
_
_
_
_
_
_
_
and if Q is the coordinate transforma-
117
Figure 24: t Density functions
tion matrix between this basis and the standard basis then the matrix of P
is
t
Q M Q. Now consider QZ, the covariance matrix is E(QZ
t
Z
t
Q) =
Q E(Z
t
Z)
t
Q = Q
t
Q = I. Hence Y = QZ N(0, I). Hence
t
Z P Z =
t
Y M Y = Y
2
1
+Y
2
2
+ +Y
2
r

2
(r).
Let Z N(0, 1) and Y
2
(m) then T
Z
_
Y/m
follows the Student
t-distribution with m degrees of freedom. We shall write this T t(m).
As m becomes larger t(m) approaches the standard normal distribution.
If Y
1

2
(m
1
) and Y
2

2
(m
2
) then the random variable F =
Y
1
/m
1
Y
2
/m
2
follows an F-distribution with m
1
, m
2
degrees of freedom, F F(m
1
, m
2
)
Consider again a linear regression model
y =
0
+X
1

1
+X
2

2
+ +X
k

k
+u = X +u
and observations
y
t
, x
1,t
, x
t,2
, . . . , x
t,k
, t = 1, 2, . . . , n
118
Figure 25: F(5, 3) Density function
As before

= (
t
X X)
1

t
Xy is the OLS estimator and u = y X

the
vector of residuals. We assume u
t
NID(0,
2
).
We want to test the hypothesis that one of the coecients,
i
say, has a
given value . Without loss of generality we can assume that i = k i.e. we
are testing the hypothesis
k
= . Subtracting X
k
on both sides we get
the model
y X
k
=
0
+X
1

1
+ +X
k
(
k
) +u
and so we may also assume that the hypothesis is
k
= 0.
Consider rst the case k = 1 so the model is y =
0
+ x
1
+ u. Let
P denote the orthogonal projection onto the (n 1)-dimensional subspace
t
(1, 1, 1, . . . , 1)

## . Thus P kills any vector of the form (a, a, a, . . . , a). Apply-

ing P we get Py = Px
1
+Pu and so we get the OLS-estimator

1
= (
t
x
t
P P y)(x
t
P Px)
1
= (
t
x P y)(
t
x P x)
1
because
t
P = P and P
2
= P
Under the null-hypothesis
1
= 0 so Py = Pu and so

1
=
t
x P
t
x P x
u
119
This shows that the estimator is a linear combinations of NID(0,
2
) random
variables. The coecients are the coordinates of the vector
t
x P
t
x P x
so
the variance of the OLS estimator

1
is
2
t
x P P x
(
t
x P x)
2
=

2
t
x P x
hence

1
N(0,
2
(
t
x P x)
1
) and so the statistic z =
t
x P y
(
t
x P x)
1/2
N(0, 1)
and we can use this to test the null-hypothesis.
This is under the unrealistic assumption that we actually know the vari-
ance of the error terms. If this is not the case we need to use the unbiased
estimator for the variance, s
2
=
1
n 2

t
u
2
t
where u = y (

0
+ x

1
).
Let Q be the orthogonal projection onto the orthogonal complement of
the subspace spanned by
t
(1, 1, 1, . . . , 1) and x; thus Q kills any vector
of the form
0
+
1
x. Since by construction the vector of residuals u is
orthogonal to both
t
(1, 1, 1, . . . , 1) and x we have u = Q u = Qy. Hence
s
2
=
1
n 2
[[Qy[[
2
=
t
y Q y
n 2
. Replacing by s we get the statistic
t =
t
x P y
s(
t
x P x)
1/2
=
t
x P y
s

(
t
x P x)
1/2
=
z
s

The denominator is
s

=
(
t
y Q y)
1/2

n 2
=
_
_
_
t
u

Q
u

n 2
_
_
_
1/2
Since u N(0,
2
I),
u

## N(0, I) and since Q is an orthogonal projection

onto a (n 2)-dimensional subspace, we get by Theorem 14.2.2 that
t
u

Q
u

2
(n 2). This shows that the statistic t is distributed as
N(0, 1)
_

2
(n 2)
n 2
t(n 2) and we can then use this distribution to test the
null-hypothesis. This is naturally enough known as the t-test.
In the general case where we have k independent variables the same cal-
culation goes through if we replace P by the orthogonal projection onto the
orthogonal complement of the subspace spanned by
t
(1, 1, . . . , 1), x
1
, . . . , x
k1
120
and replace n 2 by n k 1. The matrix of the projection is the
n n matrix I X

(
t
X

)
1

t
X

where X

is the n k matrix
_
_
_
_
_
1 x
1,t
1
x
2,t
1
. . . x
k1,t
1
1 x
1,t
2
x
2,t
2
. . . x
k1,t
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1 x
1,tn
x
2,tn
. . . x
k1,tn
_
_
_
_
_
i.e. we have left out the column of the
observations of the kth independent variable.
Example 14.6 We look again at the data set Data1 and test the hypothesis
that the coecient to the quadratic term is 0.
y=Data1(:,1);
X=[ones(size(Data1(:,2)) Data1(:,2)];
Z=[X X(:,2).^ 2];
P=eye(40)-X*(X*X)^ (-1)*X;
Q=eye(40)-Z*(Z*Z)^ (-1)*Z;
x=Z(:,3);
z=x*P*y;
ssquare=(y*Q*y)/(40-2-1);
s=sqrt(ssquare);
t=z/(s*sqrt(x*P*x))
t=-64.9502
p=tcdf(t,37)
p=4.9178e-40
Since the p-value is so small we reject the null-hypothesis that the coef-
cient to the quadratic term is 0
121