00 голосов за00 голосов против

7 просмотров121 стр.Sep 11, 2014

© © All Rights Reserved

PDF, TXT или читайте онлайн в Scribd

© All Rights Reserved

7 просмотров

00 голосов за00 голосов против

© All Rights Reserved

Вы находитесь на странице: 1из 121

Niels O. Nygaard

The University of Chicago

August 29, 2006

1 Vector Spaces and Bases

Recall that a real vector space is a set V with a rule for adding elements

+ and a rule for multiplying elements by real numbers i.e. if v and w are

elements in V (we shall from now on call them vectors), then we can form

another vector v + w and if R we can form the vector v. Also there

has to be a zero vector, 0 so a vector space can never be the empty set.

These rules are supposed to satisfy some self-evident axioms e.g. v +0 = v,

(v +w) = v +w, etc.

We shall mostly deal with the vector spaces R

n

of n-tuples of real num-

bers, thus a vector in R

n

is an n-tuple (x

1

, x

2

, . . . , x

n

). Other examples of

vector spaces are:

The set of continuous functions on an interval where addition of func-

tions is dened by (f +g)(x) = f(x) +g(x)

The set of dierentiable functions on an open interval ]a, b[

The set of real valued functions on a subset U R

n

The set of solutions to a system of linear dierential equations Df = 0

Instead of using the real numbers R we can also use the complex numbers

C in which case we talk about complex vector spaces.

Denition 1.0.1 A linear combination of vectors in a vector space V , v

1

, v

2

, . . . , v

k

is an expression of the form

1

v

1

+

2

v

2

+ +

k

v

k

where the s are real

numbers (or complex numbers if we are dealing with a complex vector space).

A linear combination is of course again a vector in V .

1

Denition 1.0.2 A subspace U of a vector space V , is a (non-empty) sub-

set, U V with the property that if we take any two vectors u

1

, u

2

which

happen to be in the subset U then their sum u

1

+ u

2

is also in U and for

any number (real or complex depending on whether V is a real or complex

vector space) and any u U, u is again in U. Remark that since 0u = 0,

the zero vector is in U.

Proposition 1.0.1 Consider a set of vectors v

1

, v

2

, v

3

, . . . , v

k

in V . Con-

sider the set of all possible linear combinations of these vectors. This set

is denoted Spanv

1

, v

2

, . . . , v

k

, thus a typical element in this subset is a

vector w in V which can be expressed in the form

w = a

1

v

1

+a

2

v

2

+ +a

k

v

k

where the as are real or complex numbers. Then Spanv

1

, v

2

, v

3

, . . . , v

k

is

a subspace of V .

Proof: We have to show that if w

1

and w

2

are in Spanv

1

, v

2

, v

3

, . . . , v

k

then also w

1

+w

2

and w

1

are in Spanv

1

, v

2

, . . . , v

k

. Now we can write

w

1

= a

1

v

1

+a

2

v

2

+ +a

k

v

k

and

w

2

= b

1

v

1

+b

2

v

2

+ +b

k

v

k

Hence

w

1

+w

2

= (a

1

v

1

+a

2

v

2

+ +a

k

v

k

) + (b

1

v

1

+b

2

v

2

+ +b

k

v

k

)

but we can rewrite this expression as

w

1

+w

2

= (a

1

+b

1

)v

1

+ (a

2

+b

2

)v

2

+ (a

k

+b

k

)v

k

and this is a linear combination of the vs hence an element of Spanv

1

, v

2

, . . . , v

k

.

Similarly

w

1

= (a

1

v

1

+a

2

v

2

+ +a

k

v

k

) = a

1

v

1

+a

2

v

2

+. . . a

k

v

k

and this is again an element of Spanv

1

, v

2

, . . . , v

k

.

Denition 1.0.3 If U is a subspace of V , we say that U is spanned by

vectors u

1

, u

2

, . . . , u

r

in U if U = Spanu

1

, u

2

, . . . , u

k

. We note that there

may be many dierent sets of vectors spanning U.

2

Example 1.1 Let V = R

n

and consider the subset U V consisting of

vectors u = (x

1

, x

2

, . . . , x

n

) with x

1

+ x

2

+ + x

n

= 0 i.e. the set of all

vectors whose coordinates add up to 0. Then U is a subspace. Indeed if u

1

=

(x

1

, x

2

, . . . , x

n

) and u

2

= (y

1

, y

2

, . . . , y

n

) are in U so x

1

+x

2

+ +x

n

= 0

and y

1

+y

2

+ +y

n

= 0 we have u

1

+u

2

= (x

1

+y

1

, x

2

+y

2

, . . . , x

n

+y

n

) and

(x

1

+y

1

)+(x

2

+y

2

)+ +(x

n

+y

n

) = (x

1

+x

2

+ +x

n

)+(y

1

+y

2

+ +y

n

) =

0 + 0 = 0.

Also u = (x

1

, x

2

, . . . , x

n

) and

x

1

+x

2

+ +x

n

= (x

1

+x

2

+ +x

n

) = 0 = 0

We claim that U is spanned by the n 1 vectors:

(1, 1, 0, 0, . . . , 0)

(0, 1, 1, 0, . . . , 0)

(0, 0, 1, 1, . . . , 0)

.

.

.

(0, 0, . . . , 0, 1, 1)

First, all these vectors are clearly in U. Assume now that u = (x

1

, x

2

, x

3

, . . . , x

n

)

U so x

1

+x

2

+x

3

+ +x

n

= 0. Consider the linear combination

x

1

(1, 1, 0, 0, 0, . . . , 0)

+ (x

1

+x

2

)(0, 1, 1, 0, 0, . . . , 0)

+ (x

1

+x

2

+x

3

)(0, 0, 1, 1, 0, . . . , 0)

+ (x

1

+x

2

+x

3

+x

4

)(0, 0, 0, 1, 1, . . . , 0)

.

.

.

+ (x

1

+x

2

+ +x

n1

)(0, 0, 0, . . . , 1, 1)

This is equal to

(x

1

, x

1

+ (x

1

+x

2

), (x

1

+x

2

) + (x

1

+x

2

+x

3

), . . . , (x

1

+x

2

+x

3

+ +x

n1

)

= (x

1

, x

2

, x

3

, . . . , (x

1

+x

2

+x

3

+ +x

n1

))

= (x

1

, x

2

, x

3

, . . . , x

n

) = u

where the last equality follows from x

1

+ x

2

+ + x

n1

+ x

n

= 0 so

x

n

= (x

1

+x

2

+ +x

n1

).

3

But U is also spanned by the n 1 vectors

(1, 1, 0, 0, . . . , 0)

(1, 0, 1, 0, . . . , 0)

(1, 0, 0, 1, . . . .0)

.

.

.

(1, 0, 0, 0, . . . , 1)

To see this we consider the linear combination

x

2

(1, 1, 0, 0, . . . , 0)

x

3

(1, 0, 1, 0, . . . , 0)

x

4

(1, 0, 0, 1, . . . , 0)

.

.

.

x

n

(1, 0, 0, 0, . . . , 1)

= (x

2

x

3

x

4

, x

n

, x

2

, x

3

, . . . , x

n

)

= (x

1

, x

2

, x

3

, x

4

, . . . , x

n

) = u

since x

1

+x

2

+x

4

+ +x

n

= 0 so x

1

= x

2

x

3

x

n

Denition 1.0.4 A set of vectors v

1

, v

2

, . . . , v

k

are said to be linearly in-

dependent if the only linear combination of these vectors which is equal

to 0 is the linear combination where all the coecients are 0. That is

1

v

1

+

2

v

2

+

3

v

3

+ +

k

v

k

= 0 implies that

1

=

2

=

3

= =

k

= 0.

This is equivalent to say that two linear combinations of these vectors are

equal if and only if all the coecients are equal i.e.

1

v

1

+

2

v

2

+

3

v

3

+ +

k

v

k

=

1

v

1

+

2

v

2

+

3

v

3

+ +

k

v

k

if and only if

1

=

1

,

2

=

2

,

3

=

3

, . . .

k

=

k

.

Indeed if the vectors are linearly equivalent then

1

v

1

+

2

v

2

+

3

v

3

+ +

k

v

k

=

1

v

1

+

2

v

2

+

3

v

3

+ +

k

v

k

if and only if

(

1

1

)v

1

+ (

2

2

)v

2

+ (

3

3

)v

3

+ + (

k

k

)v

k

= 0

so

1

1

=

2

2

= =

k

k

= 0

4

Denition 1.0.5 Consider a vector space V . A set of vectors v

1

, v

2

, v

3

, . . . , v

k

is said to be a basis of V if

1. They span V , i.e. every vector in V can be written as a linear combi-

nation of these vectors

2. They are linearly independent

Thus v

1

, v

2

, v

3

, . . . , v

k

is a basis if and only if every vector in V can be

written in exactly one way as a linear combination of these vectors.

Example 1.2 The vectors

e

1

= (1, 0, 0, . . . , 0, 0)

e

2

= (0, 1, 0, . . . , 0, 0)

e

3

= (0, 0, 1, . . . , 0, 0)

.

.

.

e

n1

= (0, 0, 0, . . . , 1, 0)

e

n

= (0, 0, 0, . . . , 0, 1)

form a basis for the vector space R

n

(resp. C

n

). This is called the

standard basis of R

n

(resp. C

n

)

Both the two sets of vectors considered in Example 1.1 are bases of the

subspace U.

A very important result is the following:

Theorem 1.0.1 Let v

1

, v

2

, . . . , v

k

and w

1

, w

2

, . . . , w

r

be bases of the same

vector space V . Then k = r i.e. any two bases of a vector space V have

the same number of vectors in them. This common number is called the

dimension of V , dimV .

Denition 1.0.6 A vector space is said to be nite dimensional if it has a

basis.

Remark 1.1 There are plenty of vector spaces which are not nite dimen-

sional e.g. the vector space considered above, of continuous functions on

an interval [a, b] does not have a basis in the sense above. If we dene a

linear combination of innitely many vectors as any linear combination in-

volving only a nite number of them, then in fact any vector space, nite

dimensional or not, has a basis in the sense that there is a set of vectors,

5

maybe innitely many, such that every vector can be written uniquely as a

linear combination of vectors from this set. Except when explicitly stated,

we consider nite dimensional vector spaces.

Some other results related to bases:

Theorem 1.0.2 Let dimV = m. If v, v

2

, . . . , v

k

is a set of linearly inde-

pendent vectors then k m. If k = m the set of vectors is a basis. In

other words, in an m-dimensional vector space no linearly independent set

can have more than m vectors in it and any set of m linearly independent

vectors also span the vector space

Theorem 1.0.3 Let v

1

, v

2

, . . . , v

s

be linearly independent vectors in a vec-

tor space V then we can supplement this set with vectors v

s+1

, v

s+2

, . . . , v

k

such that the combined set v

1

, v

2

, . . . , v

s

, v

s+1

, . . . , v

k

is a basis (i.e. dimV =

k).

Theorem 1.0.4 Let w

1

, . . . , w

n

be a set of vectors which span V i.e. every

vector in V can be written as a linear combination of these vectors. Then

we can pick out a basis w

i

1

, w

i

2

, . . . , w

i

k

from this set. Thus dimV n.

Theorem 1.0.5 Any subspace of a nite dimensional vector space, U V ,

is also nite dimensional and dimU dimV with equality if and only if

U = V

The great thing about bases is that if we x a basis of a vector space V ,

say v

1

, v

2

, . . . , v

k

we can basically identify V with the vector space R

k

(or

C

k

if we are in the complex case). The reason for this is that any vector

v V can be written as a linear combination v = a

1

v

1

+ a

2

v

2

+ + a

k

v

k

and the coecients (a

1

, a

2

, . . . , a

k

) which we can view as a vector in R

k

,

are uniquely determined by the the vector v. We call these coecients the

coordinates of v with respect to the basis v

1

, v

2

, . . . , v

k

. Thus the vector v

determines and in turn is determined by its coordinates. Remember though

that coordinates are always relative to a given basis. If we take another

basis say w

1

, w

2

, . . . , w

k

then v has a set of coordinates with respect to this

basis, but they will be totally dierent from the coordinates with respect to

the basis v

1

, v

2

, . . . , v

k

.

With a chosen basis, a vector is then completely determined by its coor-

dinates and these coordinates form a vector in R

k

(or C

k

) and we can do all

computations involving the vector by doing them on the coordinates e.g. we

get the coordinates of the sum of two vectors in V simply by adding their

coordinates in R

k

.

6

2 Linear Transformations and Matrices

Denition 2.0.7 Let V and W be vector spaces. A linear transformation

(or linear map) is a map T : V W which preserves the vector space

structures i.e. T(v

1

+v

2

) = T(v

1

) +T(v

2

) and T(v) = T(v)

We sometimes leave out the () so we may write Tv instead of T(v).

Example 2.1 If V = R

1

= W a linear transformation T : V W is of the

form Tx = a x for all x R where a is a xed real number.

Remark that the map Sx = a +x is not linear.

Example 2.2 Let V be the vector space of continuous functions on an in-

terval [a, b]. Then the map Tf =

_

b

a

f(t)dt is a linear transformation V R

Example 2.3 Let V be the vector space of continuously dierentiable func-

tions on an open interval ]a, b[ and W the vector space of continuous func-

tions on ]a, b[. Then the map S : V W dened by Sf = f

is a linear

transformation

Proposition 2.0.2 Let U, V, W be vector spaces and let T : U V and

S : V W be linear transformations. Then the composite map, S T :

U W, dened by S T(u) = S(Tu) is also linear, i.e. a composite of two

linear transformations is again a linear transformation.

Using bases, a linear transformation can be described very conveniently

using matrices. Thus let V and W be nite dimensional vector spaces and

let v

1

, v

2

, . . . , v

m

be a basis of V and w

1

, w

2

, . . . , w

n

be bases of V and W

resp. so dimV = m and dimW = n.

Let T : V W be a linear transformation. Applying T to any of the

basis vectors Tv

j

we obtain a vector in W and hence Tv

j

can be written

(uniquely) as a linear combination of the ws thus

Tv

j

= a

1j

w

1

+a

2j

w

2

+ +a

nj

w

n

i.e. a

1j

, a

2j

, . . . , a

mj

are the coordinates of Tv

j

with respect to the basis

w

1

, w

2

, . . . , w

n

.

We do this for every v

j

and hence we get nmnumbers

_

_

_

_

_

a

11

a

12

. . . a

1m

a

21

a

22

. . . a

2m

.

.

.

.

.

.

.

.

.

.

.

.

a

n1

a

n2

. . . a

nm

_

_

_

_

_

.

7

This is the matrix of T with respect to the bases v

1

, v

2

, . . . , v

m

and

w

1

, w

2

, . . . , w

n

. Thus the jth column in the matrix is the set of coordinates

of Tv

j

with respect to the basis w

1

, w

2

, . . . , w

n

.

We emphasize that the matrix of a linear map only makes sense when the

bases are given. The linear map itself exists independently of bases but the

matrix depends on them and if we change the bases we change the matrix.

This fact is what the subject is all about: by choosing bases cleverly we are

often able to make the matrix very simple e.g. having lots of 0s.

Theorem 2.0.6 Let T : V W be a linear transformation, v

1

, v

2

, . . . , v

m

and w

1

, w

2

, . . . , w

n

bases of V and W resp. Let

A =

_

_

_

_

_

a

11

a

12

. . . a

1m

a

21

a

22

. . . a

2m

.

.

.

.

.

.

.

.

.

.

.

.

a

n1

a

n2

. . . a

nm

_

_

_

_

_

be the matrix of T with respect to these bases.

Then the linear transformation is completely determined by A in the

sense that we can compute Tv for any vector v in V

To see this let v V and write it as a linear combination v = x

1

v

1

+

x

2

v

2

+ +x

m

v

m

. To determine Tv we need to nd coecients y

1

, y

2

, . . . , y

n

such that Tv = y

1

w

1

+y

2

w

2

+ +y

n

w

n

. We can do this as follows: since

T is linear

Tv = T(x

1

v

1

+x

2

v

2

+ +x

m

v

m

)

= x

1

Tv

1

+x

2

Tv

2

+ +x

m

Tv

m

= x

1

(a

11

w

1

+a

21

w

2

+ +a

n1

w

n

)

+x

2

(a

12

w

1

+a

22

w

2

+ +a

n2

w

n

)

.

.

.

+x

m

(a

1m

w

1

+a

2m

w

2

+ +a

nm

w

n

)

= (a

11

x

1

+a

12

x

2

+ +a

1m

x

m

)w

1

+ (a

21

x

1

+a

22

x

2

+ +a

2m

x

m

)w

2

.

.

.

+ (a

n1

x

1

+a

n2

x

2

+ +a

nm

x

m

)w

n

8

Thus the coordinates of Tv with respect to the basis w

1

, w

2

, . . . , w

n

are

given by

_

_

_

_

_

y

1

y

2

.

.

.

y

n

_

_

_

_

_

=

_

_

_

_

_

a

11

x

1

+a

12

x

2

+ +a

1m

x

m

a

21

x

1

+a

22

x

2

+ +a

2m

x

m

.

.

.

a

n1

x

1

+a

n2

x

2

+ +a

nm

x

m

_

_

_

_

_

Let A =

_

_

_

_

_

a

11

a

12

. . . a

1m

a

21

a

22

. . . a

2m

.

.

.

.

.

.

.

.

.

.

.

.

a

n1

a

n2

. . . a

nm

_

_

_

_

_

be an nmmatrix and B =

_

_

_

_

_

b

11

b

12

. . . b

1k

b

21

b

22

. . . b

2k

.

.

.

.

.

.

.

.

.

.

.

.

b

m1

b

m2

. . . b

mk

_

_

_

_

_

an m k matrix (remark that the number of columns in A is equal to the

number of rows in B). We can then dene the product of these two matrices

C = A B, where C is the n k matrix C =

_

_

_

_

_

c

11

c

12

. . . c

1k

c

21

c

22

. . . c

2k

.

.

.

.

.

.

.

.

.

.

.

.

c

n1

c

n2

. . . c

nk

_

_

_

_

_

where

c

ij

= a

i1

b

1j

+a

i2

b

2j

+a

i3

b

3j

+ +a

im

b

mj

for all 1 i n, 1 j k.

Example 2.4 Let A =

_

1 2 0

2 0 3

_

and B =

_

_

2 3 0

1 1 2

0 2 0

_

_

. The product is

the 2 3 matrix C =

_

4 5 4

4 12 0

_

Using matrix multiplication we can reformulate the expression for the

coordinates of Tv as

_

_

_

_

_

y

1

y

2

.

.

.

y

n

_

_

_

_

_

= A

_

_

_

_

_

x

1

x

2

.

.

.

x

m

_

_

_

_

_

=

_

_

_

_

_

a

11

a

12

. . . a

1m

a

21

a

22

. . . a

2m

.

.

.

.

.

.

.

.

.

.

.

.

a

n1

a

n2

. . . a

nm

_

_

_

_

_

_

_

_

_

_

x

1

x

2

.

.

.

x

m

_

_

_

_

_

Given bases v

1

, v

2

, . . . , v

m

and w

1

, w

2

, . . . , w

n

and an nk matrix we can

dene a linear transformation T as follows: let v V and let x

1

, x

2

, . . . , x

m

be the coordinates of v with respect to the basis v

1

, v

2

, . . . , v

m

. Let y

1

, y

2

, . . . , y

n

9

be dened as the matrix product

_

_

_

_

_

y

1

y

2

.

.

.

y

n

_

_

_

_

_

= A

_

_

_

_

_

x

1

x

2

.

.

.

x

m

_

_

_

_

_

=

_

_

_

_

_

a

11

a

12

. . . a

1m

a

21

a

22

. . . a

2m

.

.

.

.

.

.

.

.

.

.

.

.

a

n1

a

n2

. . . a

nm

_

_

_

_

_

_

_

_

_

_

x

1

x

2

.

.

.

x

m

_

_

_

_

_

We dene Tv as the vector whose coordinates with respect to the basis

w

1

, w

2

, . . . , w

n

are y

1

, y

2

, . . . , y

n

i.e.

Tv = y

1

w

1

+y

2

w

2

+ +y

n

w

n

Thus we have a 1 1 correspondence: (bases+linear transformation)

(bases+matrix)

Theorem 2.0.7 Let U, V, W be vector spaces with bases u

1

, u

2

, . . . , u

k

, v

1

, v

2

, . . . , v

m

and w

1

, w

2

, . . . , w

n

resp. Let T : U V be a linear transformation with ma-

trix B =

_

_

_

_

_

b

11

b

12

. . . b

1k

b

21

b

22

. . . b

2k

.

.

.

.

.

.

.

.

.

.

.

.

b

m1

b

m2

. . . b

mk

_

_

_

_

_

with respect to the bases u

1

, u

2

, . . . , u

k

and

v

1

, v

2

, . . . , v

m

, thus B is a mk matrix. Let S : V W be a linear trans-

formation with matrix A =

_

_

_

_

_

a

11

a

12

. . . a

1m

a

21

a

22

. . . a

2m

.

.

.

.

.

.

.

.

.

.

.

.

a

n1

a

n2

. . . a

nm

_

_

_

_

_

with respect to the bases

v

1

, v

2

, . . . , v

m

and w

1

, w

2

, . . . , w

n

. Then the matrix of the composite linear

transformation S T : U W with respect to the bases u

1

, u

2

, . . . , u

k

and

w

1

, w

2

, . . . , w

n

is the n k matrix A B

To prove this we compute S T(u

j

).

10

First Tu

j

= b

1j

v

1

+b

2j

v

2

+ +b

mj

v

m

and thus

S T(u

j

) = S(Tu

j

) = S(b

1j

v

1

+b

2j

v

2

+ +b

mj

v

m

)

= b

1j

Sv

1

+b

2j

Sv

2

+ +b

mj

Sv

m

= b

1j

(a

11

w

1

+a

21

w

2

+ +a

n1

w

n

)

+b

2j

(a

12

w

1

+a

22

w

2

+ +a

n2

w

n

)

.

.

.

+b

mj

(a

1m

w

1

+a

2m

w

2

+ +a

nm

w

n

)

= (a

11

b

1j

+a

12

b

2j

+ +a

1m

b

mj

)w

1

+ (a

21

b

1j

+a

22

b

2j

+ +a

2m

b

mj

)w

2

.

.

.

+ (a

n1

b

1j

+a

n2

b

2j

+ +a

nm

b

mj

)w

n

The coecients in this linear combination are precisely the entries in the

jth column of the product matrix A B

Consider now a system of linear equations:

a

11

x

1

+a

12

x

2

+ +a

1m

x

m

= b

1

a

21

x

1

+a

22

x

2

+ +a

2m

x

m

= b

2

.

.

.

a

n1

x

1

+a

n2

x

2

+ +a

nm

x

m

= b

n

Let Abe the nmmatrix of coecients i.e. A =

_

_

_

_

_

a

11

a

12

. . . a

1m

a

21

a

22

. . . a

2m

.

.

.

.

.

.

.

.

.

.

.

.

a

n1

a

n2

. . . a

nm

_

_

_

_

_

.

Let x denote

_

_

_

_

_

x

1

x

2

.

.

.

x

m

_

_

_

_

_

and b =

_

_

_

_

_

b

1

b

2

.

.

.

b

n

_

_

_

_

_

then the system of linear equations can

be expressed in terms of these matrices

A x = b

One of the major tasks of numerical analysis is to solve such systems and

clearly the simpler the matrix A is, the easier this will be.

11

Now consider the vector spaces V = R

m

and W = R

n

with their standard

bases. We can then view the matrix A as the matrix of a linear transfor-

mation T : V W with respect to the standard bases. If w is the vector

b

1

e

1

+ b

2

e

2

+ + b

n

e

n

in W then the system of equations above can be

formulated in a coordinate free form

Tx = w

i.e. we have to nd a vector x V which maps to w under the linear

transformation T. This may seem like a triviality but it has very profound

implications because it sets us free from dealing with the standard bases and

to use whatever bases of V and W that are most convenient.

Before we do an example of this we want to introduce two subspaces

associated to a linear transformation T : V W.

Denition 2.0.8 The image of T is the subset Im T W of all vectors

w W that are of the form Tv, i.e. there is a vector v V which is mapped

to w.

The kernel of T is the subset ker T V consisting of all vectors v V

such that Tv = 0 i.e. the set of all vectors in V which under T are mapped

to the zero vector in W

Lemma 2.0.1 Im T W and ker T V are subspaces

Proof: To prove that Im T is a subspace consider w

1

and w

2

in Im T

and real (or complex) numbers

1

,

2

. We have to show that

1

w

1

+

2

w

2

is also in Im T.

Since w

1

, w

2

Im T we can nd vectors v

1

, v

2

V such that Tv

1

= w

1

and Tv

2

= w

2

. Consider the vector

1

v

1

+

2

v

2

V . Then T(

1

v

1

+

2

v

2

) =

1

Tv

1

+

2

Tv

2

=

1

w

1

+

2

w

2

. But this shows precisely that the vector

1

v

1

+

2

v

2

maps to

1

w

1

+

2

w

2

under T i.e. this is a vector in Im T,

which is what had to be proved.

The other statement is equally easy to prove: consider v

1

, v

2

ker T

and

1

,

2

R (or C). We have to show that

1

v

1

+

2

v

2

is also in ker T.

But we can verify this by showing that T maps this vector to 0. Now

T(

1

v

1

+

2

v

2

) =

1

Tv

1

+

2

Tv

2

=

1

0 +

2

0 = 0 which again is precisely

what we had to show.

Proposition 2.0.3 Let T : V W be a linear transformation. Let v

1

, v

2

, . . . , v

m

12

and w

1

, w

2

, . . . , w

n

be bases of V and W resp. Let A =

_

_

_

_

_

a

11

a

12

. . . a

1m

a

21

a

22

. . . a

2m

.

.

.

.

.

.

.

.

.

.

.

.

a

n1

a

n2

. . . a

nm

_

_

_

_

_

be the matrix of T w.r.t. these bases. Then the sub space Im T W is

spanned by the vectors in W with coordinates

_

_

_

_

_

a

11

a

21

.

.

.

a

n1

_

_

_

_

_

,

_

_

_

_

_

a

12

a

22

.

.

.

a

n2

_

_

_

_

_

, . . . ,

_

_

_

_

_

a

1m

a

2m

.

.

.

a

nm

_

_

_

_

_

with respect to the basis w

1

, w

2

, . . . , w

n

. In short: the columns of A span

the image of T. Thus dimIm T is equal to the maximum number of linearly

independent vectors among the columns. This maximum is called the column

rank of the matrix A. If we do the same for the rows of A i.e. nd the

maximum number of linearly independent rows (remark that they are vectors

in R

m

while the columns are vectors in R

n

) then we come up with the row

rank of A. It is a fact (but not a trivial fact) that for any matrix row rank

= column rank.

Here is a nice formula which relates the dimensions of Im T and ker T:

Theorem 2.0.8 dimIm T + dimker T = dimV

We can now formulate a theorem regarding the solutions to the equation

Tx = w

Theorem 2.0.9 The equation

Tx = w

has a solution if and only if w Im T. If v is a solution then any other

solution is of the form v +u where u is a vector in ker T

Proof: First: the equation Tx = w has a solution precisely when there

is a vector v V such that Tv = w but that is the same as saying that

w Im T.

Secondly: if v

1

is another solution i.e. Tv

1

= w then T(v

1

v) =

Tv

1

Tv = w w = 0. Thus u = v

1

v is in ker T. Hence v

1

= v +u and

u ker T.

13

Example 2.5 Consider the system of linear equations

x

1

+x

2

= 3

x

1

+x

3

= 4

x

2

x

3

= 1

The matrix of coecients is given by A =

_

_

1 1 0

1 0 1

0 1 1

_

_

and the right

hand side is the vector b =

_

_

3

4

1

_

_

. The linear transformation T : R

3

R

3

with this matrix with respect to the standard bases is given by T(x

1

, x

2

, x

3

) =

_

_

x

1

+x

2

x

1

+x

3

x

2

x

3

_

_

.

Now consider the basis of R

3

given by f

1

=

_

_

1

1

0

_

_

, f

2

=

_

_

1

0

1

_

_

, f

3

=

_

_

0

1

1

_

_

. Then the matrix of T with respect to the bases e

1

, e

2

, e

3

and f

1

, f

2

, f

3

is computed by Te

1

= T

_

_

1

0

0

_

_

=

_

_

1

1

0

_

_

= f

1

, Te

2

=

_

_

1

0

1

_

_

= f

2

, Te

3

=

_

_

0

1

1

_

_

= f

1

f

2

. Hence the matrix of T with respect to these bases is

_

_

1 0 1

0 1 1

0 0 0

_

_

.

The vector b is written as b = 4f

1

f

2

and hence the equation

Tx = b

in the new coordinates reads

_

_

1 0 1

0 1 1

0 0 0

_

_

_

_

x

1

x

2

x

3

_

_

=

_

_

4

1

0

_

_

14

or

x

1

+x

3

= 4

x

2

x

3

= 1

0 = 0

Thus the solutions are all vectors of the form

_

_

4 x

3

1 +x

3

x

3

_

_

=

_

_

4

1

0

_

_

+

x

3

_

_

1

1

1

_

_

Note that T

_

_

x

1

x

2

x

3

_

_

= (x

1

+x

3

)f

1

+ (x

2

x

3

)f

2

hence the image of T is

spanned by f

1

, f

2

. Since they are linearly independent they form a basis for

Im T so this sub space has dimension 2. Thus dimker T = 1 spanned by the

vector

_

_

1

1

1

_

_

.

This example illustrates that by changing bases we can simplify the sys-

tem to a form where we can readily read o the solutions, if there are any.

Also note that in this case it only involves changing the basis of the target

vector space.

2.1 Change of Basis

Next we shall explicitly compute how the matrix changes when we change

bases. Thus consider a linear transformation T : V W. Let v

1

, v

2

, . . . , v

m

be a basis of V and let w

1

, w

2

, . . . , w

n

be a basis of W. Assume that the

matrix of T with respect to these bases is

A =

_

_

_

_

_

a

11

a

12

. . . a

1m

a

21

a

22

. . . a

2m

.

.

.

.

.

.

.

.

.

.

.

.

a

n1

a

n2

. . . a

nm

_

_

_

_

_

15

Now suppose we have two other bases v

1

, v

2

, . . . , v

m

and w

1

, w

2

, . . . , w

n

.

Then T has a matrix with respect to these bases, say

A

=

_

_

_

_

_

a

11

a

12

. . . a

1m

a

21

a

22

. . . a

2m

.

.

.

.

.

.

.

.

.

.

.

.

a

n1

a

n2

. . . a

nm

_

_

_

_

_

The question is: what is the connection between the matrices A and A

?

To answer this question we need to introduce the coordinate transforma-

tion matrix between two bases. Consider the two bases v

1

, v

2

, . . . , v

m

and

v

1

, v

2

, . . . , v

m

. Since they are both bases each of the vectors in one basis can

be expressed as a linear combination of the vectors in the other. Thus we

can write

v

j

= c

1j

v

1

+c

2j

v

2

+ +c

mj

v

m

for j = 1, 2, . . . , m. We can then form the matrix C =

_

_

_

_

_

c

11

c

12

. . . c

1m

c

21

c

22

. . . c

2m

.

.

.

.

.

.

.

.

.

.

.

.

c

m1

c

m2

. . . c

mm

_

_

_

_

_

.

This matrix is called the coordinate transformation matrix between the basis

v

1

, v

2

, . . . , v

m

and the basis v

1

, v

2

, . . . , v

m

. The reason for this terminology

is clear from the following result:

Proposition 2.1.1 Let u V and let

1

,

2

, . . . ,

m

be the coordinates of

u with respect to the basis v

1

, v

2

, . . . , v

m

i.e. u =

1

v

1

+

2

v

2

+ +

m

v

m

.

Then the coordinates of u with respect to the basis v

1

, v

2

, . . . , v

m

is given by

C

_

_

_

_

_

2

.

.

.

m

_

_

_

_

_

Proof: We have u =

1

v

1

+

2

v

2

+ +

m

v

m

and each v

j

= c

1j

v

1

+

16

c

2j

v

2

+ +c

mj

v

m

. Substituting these expressions we get

u =

1

(c

11

v

1

+c

21

v

2

+ +c

m1

v

m

)

+

2

(c

12

v

1

+c

22

v

2

+ +c

m2

v

m

)

+ +

+

m

(c

1m

v

1

+c

2m

v

2

+ +c

mm

v

m

)

= (

1

c

11

+

2

c

12

+ +

m

c

1m

)v

1

+(

1

c

21

+

2

c

22

+ +

m

c

2m

)v

2

+ +

+(

1

c

m1

+

2

c

m2

+ +

m

c

mm

)v

m

Thus the coordinates of u with respect to the basis v

1

, v

2

, . . . , v

m

is

_

_

_

_

_

1

c

11

+

2

c

12

+ +

m

c

1m

1

c

21

+

2

c

22

+ +

m

c

2m

.

.

.

1

c

m1

+

2

c

m2

+ +

m

c

mm

_

_

_

_

_

=

_

_

_

_

_

c

11

c

12

. . . c

1m

c

21

c

22

. . . c

2m

.

.

.

.

.

.

.

.

.

.

.

.

c

m1

c

m2

. . . c

mm

_

_

_

_

_

_

_

_

_

_

2

.

.

.

m

_

_

_

_

_

= C

_

_

_

_

_

2

.

.

.

m

_

_

_

_

_

Theorem 2.1.1 Let C be the coordinate transformation matrix between the

bases v

1

, v

2

, . . . , v

m

and v

1

, v

2

, . . . , v

m

and let D be the coordinate transfor-

mation matrix between the bases w

1

, w

2

, . . . , w

n

and w

1

, w

2

, . . . , w

n

. Let A

be the matrix of T w.r.t. v

1

, v

2

, . . . , v

m

and w

1

, w

2

, . . . , w

n

and let A

be the

matrix of T w.r.t. v

1

, v

2

, . . . , v

m

and w

1

, w

2

, . . . , w

n

then

A

= D A C

Proof: The jth column of A

j

with respect to the basis w

1

, w

2

, . . . , w

n

. We rst compute the coordinates

of Tv

j

with respect to the ws. Now v

j

= c

1j

v

1

+ c

2j

v

2

+ + c

mj

v

m

thus

the coordinates of v

j

w.r.t. the vs is

_

_

_

_

_

c

1j

c

2j

.

.

.

c

mj

_

_

_

_

_

. Thus the coordinates of Tv

j

w.r.t. the ws is A

_

_

_

_

_

c

1j

c

2j

.

.

.

c

mj

_

_

_

_

_

.

17

Now using the previous Proposition the coordinates with respect to the

w

s are given by D A

_

_

_

_

_

c

1j

c

2j

.

.

.

c

mj

_

_

_

_

_

. But this is precisely the jth column in the

product matrix D A C.

Denition 2.1.1 The mm identity matrix is the mm matrix with 1s

in the diagonal and all 0s outside the diagonal. We denote this matrix by

E

m

. Thus

E

m

=

_

_

_

_

_

_

_

1 0 0 . . . 0

0 1 0 . . . 0

0 0 1 . . . 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 0 . . . 1

_

_

_

_

_

_

_

Remark that for any n m matrix A we have A E

m

= A and E

n

A = A.

Denition 2.1.2 An mm matrix C is said to be invertible if there exists

an m m matrix C

such that C C

= C

C = E

m

. The matrix C

if it

exists is called the inverse matrix and is denoted C

1

Example 2.6 Let C be the coordinate transformation matrix between the

bases v

1

, v

2

, . . . , v

m

and v

1

, v

2

, . . . , v

m

and let C

mation between the bases v

1

, v

2

, . . . , v

m

and v

1

, v

2

, . . . , v

m

. Then C

= C

1

.

This follows immediately since C C

matrix between v

1

, v

2

, . . . , v

m

and itself, hence is the identity matrix. The

same is true for the other product C

C.

Corollary 2.1.1 Let T : V V be a linear transformation i.e. T maps

V into itself. Let v

1

, v

2

, . . . , v

m

and v

1

, v

2

, . . . , v

m

be bases of V . Let A

be the matrix of T with respect to v

1

, v

2

, . . . , v

m

and itself and A

the ma-

trix of T with respect to v

1

, v

2

, . . . , v

m

and itself. Let C be the coordinate

transformation matrix between v

1

, v

2

, . . . , v

m

and v

1

, v

2

, . . . , v

m

. Then

A

= C A C

1

18

3 Inner Products, Norms and Orthogonality

Let V be a vector space.

Denition 3.0.3 An inner product on V is a map ( , ) : V V R (or C

if V is a complex vector space). This map satises the following conditions

1. (

1

v

1

+

2

v

2

, w) =

1

(v

1

, w) +

2

(v

2

, w) (linearity)

2. (v, w) = (w, v) if V is real and (v, w) = (w, v) if V is complex (sym-

metric resp. hermitian)

3. (v, v) > 0 unless v = 0 (positive denite). Remark that (v, v) is always

real because of the second condition

It follows that we have (v,

1

w

1

+

2

w

2

) =

1

(v, w

1

) +

2

(v, w

2

). Thus

the inner product is also linear in the second variable if V is real and con-

jugate linear if V is complex (the linearity conditions in the rst and second

variables are known as bilinearity in the real case and sesqui-linearity in the

complex case)

Example 3.1 Let V = R

n

and dene the inner product of vectors v =

(x

1

, x

2

, . . . , x

n

) and w = (y

1

, y

2

, . . . , y

n

) by (v, w) = x

1

y

1

+x

2

y

2

+ +x

n

y

n

.

This is easily seen to be an inner product.

If V = C

n

we dene (v, w) = x

1

y

1

+x

2

y

2

+ +x

n

y

n

Denition 3.0.4 A linear functional on a vector space V is a linear trans-

formation f : V R (or C)

If ( , ) is an inner product on V and v V then the map u (u, v)

denes a linear functional. The next theorem states that in fact every linear

functional is of this form.

Theorem 3.0.2 (The Riesz representation theorem) Let dimV = m and

let ( , ) be an inner product. If f is a linear functional on V there exists a

unique vector v such that f(u) = (u, v) for all u V

Proof: We shall prove it in the real case. The complex case is slightly

more complicated. Let v

1

, v

2

, . . . , v

m

be a basis. We dene a linear func-

tional f

i

: V R by giving its matrix with respect to this basis of V and the

standard basis e

1

= 1 of R

1

. The matrix of a linear functional is an m1 ma-

trix i.e. a row. We let f

i

be dened by the matrix

_

0 0 . . . 0 1 0 . . . 0

_

19

where the 1 is in the ith position. Thus f

i

maps v

i

to 1 and all the other

basis vectors to 0.

Consider the set of all linear functionals V

space structure if we dene f +g by f +g(v) = f(v) +g(v) and (f)(v) =

(f(v)) (it has to be veried that f+g and f are again linear functionals but

that is trivial). This vector space is called the dual space of V . We shall show

that the linear functionals we dened above form a basis for V

. We rst

show that they are linearly independent in V

. Let

1

f

1

+

2

f

2

+ +

m

f

m

be a linear combination which is the 0-linear functional i.e. the functional

which is identically 0. Evaluating this linear combination on the basis vector

v

i

we get 0 =

1

f

1

(v

i

) +

2

f

2

(v

i

) + +

i

f

i

(v

i

) + +

m

f

m

(f

m

(v

i

) =

i

f

i

(v

i

) =

i

. Thus

i

= 0 for i = 1, 2, . . . , m which proves the linear

independence.

To show that they span V

.

We have to show that we can write f as a linear combination of the f

i

s.

Dene

i

= f(v

i

) i.e. the value of f on the basis vector v

i

and put g =

1

f

1

+

2

f

2

+ +

m

f

m

. Then g(v

i

) =

i

for i = 1, 2, . . . , m. Thus f and

g takes the same values on all the basis vectors hence f g vanishes on all

the basis vectors. But since any vector in V is a linear combination of these

basis vectors this means that f g must vanish on any vector i.e. f g is

identically 0 or in other words f = g and so f is a linear combination of the

f

i

s.

We conclude from this that dimV = dimV

.

Now let v V . Then we dene a linear functional f

v

by f

v

(u) = (u, v).

Because the inner product is linear in the rst variable this is a linear func-

tional i.e. f

v

V

v

, V V

. The

statement of the Riesz representation theorem is that this map is a bijection.

First, this map is a linear transformation i.e. f

v+

v

= f

v

+

f

v

. We

verify this by evaluating both sides on a general vector u. The left hand side

evaluates to (u,

v +

) = (u, v) +

(u, v

is also linear in the second variable (remember we are in the real case) but

this is precisely equal to the right hand side evaluated on u.

Now using the formula dimker + dimIm = dimV we see that since

dimV = dimV

must be all of V

is of the form f

v

. But if v ker so f

v

is identically 0, we have in particular

that 0 = f

v

(v) = (v, v). By the positive deniteness this means that v = 0

and so the only vector in ker is the 0-vector.

To show uniqueness assume that f

v

= f

v

then f

vv

is identically 0

hence v v

ker and so v v

= 0 i.e. v = v

20

If ( , ) is an inner product we dene the norm of v, [[v[[ =

_

(v, v). Thus

[[v[[ = 0 if and only if v is the 0-vector.

Theorem 3.0.3 (The Cauchy-Schwartz inequality) For any two vectors v

and w we have

[(v, w)[ [[v[[[[w[[

with equality if and only if w is a multiple of v

Proof: We shall prove it in the complex case, the real case is done the

same way but is easier. Let = x + iy be a complex number and consider

[[v +w[[

2

= (v +w, v +w). Using the bilinearity this equals [[

2

[[v[[

2

+

(v, w) +(w, v) +[[w[[

2

= [[

2

[[v[[

2

+(v, w) +(v, w) +[[w[[

2

. Since it is

the square of a norm, this expression is always 0 with equality if and only

if w = v. We now view it as a function of the two variables x and y and

try to nd its minimum. Writing out the expression we get

(x

2

+u

2

)[[v[[

2

+ 2(x'(v, w) y(v, w)) +[[w[[

2

Taking the partial derivatives with respect to x and y and putting them

equal to 0 we get

x

= 2x[[v[[

2

+ 2'(v, w) = 0

y

= 2y[[v[[

2

2(v, w) = 0

This gives x =

'(v, w)

[[v[[

2

and y =

(v, w)

[[v[[

2

and so the minimum value is

_

('(v, w))

2

[[v[[

4

+

((v, w))

2

[[v[[

4

_

[[v[[

2

2

_

('(v, w))

2

[[v[[

2

+

((v, w))

2

[[v[[

2

_

+[[w[[

2

=

[(v, w)[

2

[[v[[

2

2

[(v, w)[

2

[[v[[

2

+[[w[[

2

=

[(v, w)[

2

[[v[[

2

+[[w[[

2

This expression is 0 and so we get

[(v, w)[

2

[[v[[

2

+[[w[[

2

0

or

[(v, w)[

2

+[[v[[

2

[[w[[

2

0

which gives the result

21

Corollary 3.0.2 (The triangle inequality) For any two vectors v, w we have

the inequality

[[v +w[[ [[v[[ +[[w[[

Proof: Squaring both sides we have to show

[[v +w[[

2

[[v[[

2

+[[w[[

2

+ 2[[v[[[[w[[

Computing the left hand side we have [[v+w[[

2

= (v+w, v+w) = [[v[[

2

+

(v, w) +(w, v) +[[w[[

2

. Thus we need to show (v, w) +(w, v) 2[[v[[[[w[[. If

(v, w) is real this is a direct consequence of the Cauchy-Schwartz inequality.

If (v, w) = a + ib then (v, w) + (w, v) = (v, w) + (v, w) = 2a. We have

2a 2

a

2

+b

2

= 2[(v, w)[ 2[[v[[[[w[[.

Corollary 3.0.3 For any two vectors v and w we have

[[[v[[ [[w[[[ [[v w[[

Proof: Squaring both sides we have to show [[v[[

2

+[[w[[

2

2[[v[[[[w[[

(v w, v w) = [[v[[

2

(v, w) (w, v) + [[w[[

2

. Thus it comes down to

showing that (v, w) + (w, v) 2[[v[[[[w[[ which is precisely what we showed

above

Denition 3.0.5 Two vectors v and w are said to be orthogonal if (v, w) =

0

In the real case we have the formula cos =

(v, w)

[[v[[[[w[[

where is the

angle between the two vectors (remark that this does not make sense in the

complex case because the inner product in general is not real, where as of

course cos is a real number

Denition 3.0.6 A set of vectors q

1

, q

2

, . . . , q

r

is said to be an orthogonal

set if for any i ,= j, (q

i

, q

j

) = 0 i.e. they are pair wise orthogonal. If we

also have [[q

i

[[ = 1 for all i, we talk of an orthonormal set

Proposition 3.0.2 Let q

1

, q

2

, . . . , q

r

be an orthogonal set of vectors, then

they are linearly independent

Proof: Consider a linear combination equal to the 0-vector

c

1

q

1

+c

2

q

2

+ +c

r

q

r

= 0

22

We have to show that all the coecients are 0. Now

0 = (c

1

q

1

+c

2

q

2

+ +c

i

q

i

+ +c

r

q

r

, q

i

)

= c

1

(q

1

, q

i

) +c

2

(q

2

, q

i

) + +c

i

(q

i

, q

i

) + +c

r

(q

r

, q

i

)

= c

i

[[q

i

[[

2

Since [[q

i

[[ > 0 this means c

i

= 0.

Lemma 3.0.1 Assume q

1

, q

2

, . . . , q

r

is an orthonormal set and let v be any

vector. Then the vector

u = v (v, q

1

)q

1

(v, q

2

)q

2

(v, q

r

)q

r

is orthogonal to all the vectors q

1

, q

2

, . . . , q

r

If dimV = m then it follows that any orthogonal set of m vectors is a

basis and so u is orthogonal to all the vectors in the basis and hence to every

vector in V , in particular to itself. Thus 0 = (u, u) = [[u[[

2

. It follows that

u = 0 and hence we have

v = (v, q

1

)q

1

+ (v, q

2

)q

2

+ + (v, q

m

)q

m

Proof: We verify this simply by taking the inner product with any of the

q

i

s. We get

(q

i

, u) = (q

i

, v (v, q

1

)q

1

(v, q

2

)q

2

(v, q

i

)q

i

(v, q

r

)q

r

)

= (q

i

, v) (q

1

, v)(q

i

, q

1

) (q

2

, v, )(q

i

, q

2

) (q

i

, v)(q

i

, q

i

) (q

i

, v)(q

i

, q

r

)

= (q

i

, v) (q

i

, v)[[q

i

[[

2

= 0

since [[q

i

[[ = 1

The vector (v, q

1

)q

1

+ (v, q

2

)q

2

+ + (v, q

r

)q

r

is called the orthogonal

projection of v onto the subspace Spanq

1

, q

2

, . . . , q

r

1

, v

2

, . . . , v

r

be a

linearly independent set of vectors. Then there is a set of orthonormal vec-

tors q

1

, q

2

, . . . , q

r

such that Span v

1

, v

2

, . . . , v

i

= Span q

1

, q

2

, . . . , q

i

for

i = 1, 2, . . . , r

Proof: Let q

1

=

v

1

[[v

1

[[

Assume we have constructed q

1

, . . . , q

i1

. Let

u

i

= v

i

(q

1

, v

i

)q

1

(q

i1

, v

i

)q

i1

. Then as we have seen u

i

is orthogonal

to q

1

, . . . , q

i1

. Now put q

i

=

u

i

[[u

i

[[

23

Figure 1: Orthogonal Projection

Denition 3.0.7 Let U V be a subspace. We denote by U

the subset of

vectors u

if and only

if (u

, u) = 0 for all u U. U

Proposition 3.0.3 U

written uniquely in the form v = u +u

with u U and u

Proof: If u

1

and u

2

are vectors in U

(

1

u

1

+

2

u

2

, u) = (u

1

, u) +

2

(u

2

, u) = 0 + 0. Thus

1

u

1

+

2

u

2

U

is a subspace.

Let v V and let u

1

, u

2

, . . . , u

r

be a basis of U. We apply the Gram-

Schmidt procedure to obtain an orthonormal basis q

1

, q

2

, . . . , q

r

of U. Let

u

= v(q

1

, v)q

1

(q

2

, v)q

2

(q

r

, v)q

r

then u

qs and hence is orthogonal to all the vectors in U since they are all linear

combinations of the qs. Let u = (q

1

, v)q

1

+ (q

2

, v)q

2

+ + (q

r

, v)q

r

then

u U and v = u +u

Consider U

to every vector in U. But u U so (u, u) = 0 which implies u = 0. Thus we

conclude that U

and as

u

1

+ u

1

. Then we get u u

1

= u

1

. The left hand side is in U and the

right hand side is in U

are both 0 and so u = u

1

and u

= u

1

which proves the uniqueness.

24

Denition 3.0.8 Let T : V W be a linear operator. Assume both V and

W are equipped with inner products and corresponding norms. We dene

the operator norm of T by [[T[[ = sup

||v||=1

[[Tv[[

Proposition 3.0.4 Let u be any vector in V then [[Tu[[ [[T[[[[u[[

Proof: It is clear if u = 0. If u is not the zero vector

u

[[u[[

has norm = 1

and so [[T(

u

[[u[[

)[[ [[T[[. Since T(

u

[[u[[

) =

Tu

[[u[[

the inequality follows.

Denition 3.0.9 A linear transformation T : V W is said to be orthog-

onal (unitary in the complex case) if (Tu, Tv) = (u, v) i.e. T preserves the

inner products.

Clearly an orthogonal transformation also preserves norms and so [[Tv[[ =

1 if [[v[[ = 1. It follows that an orthogonal transformation has [[T[[ = 1

Proposition 3.0.5 An orthogonal transformation T : V W is 1 1

Proof: Since [[Tu[[ = [[u[[ it is clear that if Tu = 0 we have u = 0. Thus

ker T = 0. Now if Tu = Tv we have T(u v) = 0 hence u v = 0 so

u = v

Denition 3.0.10 An mm Q matrix is said to be an orthogonal matrix

if the columns of Q form an orthonormal basis of R

m

Theorem 3.0.5 An mm matrix Q is orthogonal if and only if Q

t

Q =

t

Q Q = E

m

, where

t

Q is the transposed matrix i.e. the ijthe entry in

t

Q

is the jith entry in Q

Proof: Let Q =

_

_

_

_

_

q

11

q

12

. . . q

1m

q

21

q

22

. . . q

2m

.

.

.

.

.

.

.

.

.

.

.

.

q

m1

q

m2

. . . q

mm

_

_

_

_

_

. We shall write this as Q =

_

q

1

q

2

. . . q

m

_

where q

j

is the jth column. Then

t

Q =

_

_

_

_

_

t

q

1

t

q

2

.

.

.

t

q

m

_

_

_

_

_

and

25

t

Q Q =

_

_

_

_

_

t

q

1

t

q

2

.

.

.

t

q

m

_

_

_

_

_

_

q

1

q

2

. . . q

m

_

=

_

_

_

_

_

_

_

_

_

_

_

(q

1

, q

1

) (q

1

, q

2

) . . . (q

1

, q

m

)

(q

2

, q

1

) (q

2

, q

2

) . . . (q

2

, q

m

)

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . (q

i

, q

j

) . . .

.

.

.

.

.

.

.

.

.

.

.

.

(q

m

, q

1

) (q

m

, q

2

) . . . (q

m

, q

m

)

_

_

_

_

_

_

_

_

_

_

_

If q

1

, q

2

, . . . , q

m

is an orthonormal basis (q

i

, q

j

) =

_

1 if i = j

0 if i ,= j

and so

we get that

t

Q Q = E

m

.

To go the other way: if

t

Q Q = E

m

then we get from the expression for

the matrix product that (q

i

, q

j

) =

_

1 if i = j

0 if i ,= j

which precisely means that

the qs form an orthonormal set and since there are m of them it is a basis.

Example 3.2 Let Q be the coordinate transformation matrix between two

orthonormal bases. Then Q is an orthogonal matrix

Theorem 3.0.6 Let T : V W be a linear transformation. Then there

exists a unique linear transformation T

formation satisfying (Tv, w) = (v, T

Proof: Consider the linear functional v (Tv, w) on V . By the Riesz

representation theorem (Theorem 3.0.2) there exists a unique vector w

V

such that (Tv, w) = (v, w

w = w

. This is a

well-dened map and by construction it satises (Tv, w) = (v, T

w). It has

to be veried that T

is a linear transformation.

We have (v, T

all v. Hence (v, T

(w) T

(w)

T

(w

1

+ w

2

) = T

w

1

+

T

w

2

.

Theorem 3.0.7 Let V = R

m

(C

m

) and W = R

n

(C

n

). Let A be the matrix

of T with respect to the standard bases. Then the matrix of T

w.r.t. the

standard bases is the transposed

t

A in the real case resp

t

A, the transposed

with all entries complex conjugated, in the complex case.

Proof: The jth column of the matrix of T

is the coordinates of T

e

j

,

i.e. T

e

j

=

1j

e

1

+

2j

e

2

+ +

m

e

m

. Then we get

ij

= (e

i

,

1j

e

1

+

2j

e

2

+

+

m

e

m

) = (e

i

, T

e

j

) = (Te

i

, e

j

). Now Te

i

= a

1i

e

1

+a

2i

a

2i

+ +a

ni

e

n

so (Te

i

, e

j

) = a

ji

and so

ij

= a

ji

which proves the result

26

Denition 3.0.11 An m m square matrix A is called symmetric resp.

hermitian if

t

A = A resp.

t

A = A. Remark that a hermitian matrix has real

entries in the diagonal

4 Eigenvectors and Eigenvalues

Denition 4.0.12 Let T : V V be a linear transformation. A non-zero

vector v is said to be an eigenvector if there is a number such that Tv = v.

The number is called the eigenvalue corresponding to the eigenvector v

Denition 4.0.13 Let be an eigenvalue, the subset v[Tv = v is called

the eigenspace associated to . It is a subspace of V , in fact it is equal to

ker(T I) where I denotes the identity map, which is obviously a linear

transformation.

Proposition 4.0.6 Let

1

,

2

, . . . ,

r

be distinct eigenvalues with eigenvec-

tors v

1

, v

2

, . . . , v

r

. Then these eigenvectors are linearly independent.

Proof: We use induction on r. Since an eigenvector is not the zero-vector

it is true for r = 1. Now assume it is true for r 1 vectors. Assume that

c

1

v

1

+c

2

v

2

+ +c

r

v

r

= 0. We have to show that all the cs are 0. Applying

the linear transformation T we get

c

1

1

v

1

+c

2

2

v

2

+ +c

r

r

v

r

= 0

Multiplying the linear dependence equation by

1

we get

c

1

1

v

1

+c

2

1

v

2

+ +c

r

1

v

r

= 0

Comparing these two equations we see that the rst terms are equal so

subtracting one from the other the rst terms cancel and we get

c

2

(

2

1

)v

2

+ +c

r

(

r

1

)v

r

= 0

By the induction hypothesis the vectors v

2

, v

3

, . . . , v

r

are linearly inde-

pendent and so all the coecients c

i

(

i

1

) = 0, for i = 2, 3, . . . , r. Since

the eigenvalues are distinct

i

1

,= 0, i = 2, 3, . . . , r and so we must have

c

2

= c

3

= = c

r

= 0. But then the only term left is c

1

v

1

= 0 and so also

c

1

= 0

Corollary 4.0.4 Assume T has m = dimV distinct eigenvalues with eigen-

vectors v

1

, v

2

, . . . , v

m

then the eigenvectors form a basis of V

27

If we compute the matrix of T with respect to a basis of eigenvectors

v

1

, v

2

, . . . , v

m

we have Tv

i

=

i

v

i

so the matrix is the diagonal matrix

D =

_

_

_

_

_

_

_

1

0 0 . . . 0

0

2

0 . . . 0

0 0

3

. . . 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 0 . . .

m

_

_

_

_

_

_

_

4.1 Determinants

Let A be a square matrix. We can associate to A a number det A called the

determinant of A.

The determinant has a number of properties some of which we list below:

1. det A ,= 0 if and only if A has an inverse matrix

2. The determinant does not change if we change A by adding a multiple

of a column of A to another column

3. If we change A by switching two columns the determinant changes sign

4. Changing A by multiplying a row by a number multiplies the det by

the diagonal are 0) is the product of the diagonal entries

6. If A and B are mm matrices det(A B) = det Adet B

7. det

t

A = det A

Example 4.1 If A is a 2 2 matrix A =

_

a b

c d

_

we have det A = ad bc

Example 4.2 Let T : R

m

R

m

be a linear transformation and let A be

its matrix w.r.t. the standard basis. Assume R

m

has a basis of eigenvectors

v

1

, v

2

, . . . , v

m

so that the matrix of T with respect to this basis is the diagonal

matrix D =

_

_

_

_

_

_

_

1

0 0 . . . 0

0

2

0 . . . 0

0 0

3

. . . 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 0 . . .

m

_

_

_

_

_

_

_

.

28

If C is the coordinate transformation matrix between the basis of eigen-

vectors and the standard basis we have D = C

1

A C and so det D =

det(C

1

A C) = det(C

1

) det(A) det(C) = det(A) det(C

1

) det(C) =

det(A) det(C

1

C) = det(A) det(E

m

) = det(A). Now det D =

1

2

. . .

m

,

the product of the eigenvalues.

Consider for a number the matrix AE

m

. This is the matrix of the

linear transformation T I. Now a non-zero vector v is an eigenvector

with eigenvalue if and only if T I(v) = 0. Thus is an eigenvalue

if and only if ker(T I) ,= 0. If ker(T I) = 0 then we have

dim(Im (T I)) = m so T I is onto. Also ker(T I) = 0 implies

that T I is 11 and so T I is a bijection and hence has an inverse map.

This means that the matrix AE

m

is invertible and so det(AE

m

) ,= 0.

This shows that is an eigenvalue precisely when det(AE

m

) = 0.

It can be shown that det(AE

m

) = (1)

m

m

+. . . is a polynomial in

of degree m. This polynomial is called the characteristic polynomial of A.

Thus the eigenvalues are precisely the roots of the characteristic polynomial.

Example 4.3 Consider A =

_

1 1

1 1

_

. To nd the eigenvalues we com-

pute det

_

1 1

1 1

_

= (1 )

2

1. Putting this determinant equal to

0 we get (1 )

2

= 1 or 1 = 1. Hence the eigenvalues are

1

= 0 and

2

= 2.

To nd eigenvectors we have to solve the systems of equations

x

1

x

2

= 0

x

1

+x

2

= 0

and

x

1

x

2

= 2x

1

x

1

+x

2

= 2x

2

The rst set of equations gives x

1

= x

2

so for instance

_

1

1

_

is an eigenvector

with eigenvalue 0. The second set gives x

1

= x

2

so

_

1

1

_

is an eigenvector

with eigenvalue 2. These two vectors from a basis of R

2

and the matrix with

respect to this basis is the diagonal matrix

_

0 0

0 2

_

29

It is not true for an arbitrary matrix that there exist a basis of eigen-

vectors, i.e. we cant always nd a basis such that the matrix becomes a

diagonal matrix. It is however true if the matrix is symmetric in the real case

and hermitian in the complex case. Indeed we have the following theorem

Theorem 4.1.1 (The Spectral Theorem) Let A be a symmetric or hermi-

tian matrix. Then there exists an orthonormal basis of eigenvectors for A

Let q

1

, q

2

, . . . , q

m

be an orthonormal basis of eigenvectors for the sym-

metric or hermitian matrix A. If Q is the coordinate transformation matrix

between the standard basis and the basis consisting of the qs then

Q

1

A Q = D

where D is the diagonal matrix with the eigenvalues in the diagonal. Thus

A = Q D Q

1

We call this the eigenvalue decomposition of A

For the remainder of these notes we shall, except where specically noted,

assume we are in the real case.

5 The Singular Value Decomposition (SVD)

Denition 5.0.1 Let A be an m n matrix. A singular value and corre-

sponding singular vectors consists of a non-negative real number , a vector

u R

m

and a vector v R

n

such that

Av = u

t

Au = v

Remark 5.1 If A is symmetric the singular values are the absolute values

of the eigenvalues , indeed if is an eigenvalue and v an eigenvector for

then we have Av = v. If < 0 put u = v then Av = ()(v) = [[u

and

t

Au = Au = u = [[v.

Denition 5.0.2 Let A be an m n matrix. A Singular Value Decompo-

sition of A is a factorization

A = U V

where U is an orthogonal m m matrix, V is an orthogonal n n matrix

and is an mn diagonal matrix with non-negative diagonal entries.

30

Proposition 5.0.1 Let A = U V be a Singular Value Decomposition.

Then the diagonal entries in are singular values

Proof: Assume m n and let =

_

_

_

_

_

_

_

_

_

_

1

0 0 . . . 0

0

2

0 . . . 0

0 0

3

. . . 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 0 . . .

n

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

_

_

_

_

_

_

_

_

_

_

.

Let us show that

1

is a singular value (the other entries are completely

similar). Consider the basis vector e

1

R

n

and put v =

t

V e

1

. Thus v is the

rst row in V = rst column in

t

V . Also because V is orthogonal

t

V = V

1

.

Put u = Ue

1

(where e

1

now means the rst canonical basis vector in R

n

).

Then we have

Av = U V v = U e

1

= U

1

e

1

=

1

u

Also

t

A =

t

V

t

t

U so we get

t

Au =

t

V

t

t

U(Ue

1

) =

t

V

t

e

1

=

t

V

1

e

1

= v

Consider a singular value decomposition A = U V and let v

j

be

the jth row in V . Thus v

j

=

t

V e

j

and V v

j

= e

j

, also v

1

, v

2

, . . . , v

n

is

an orthonormal basis of R

n

. Similarly let u

j

be the jth column in U so

u

j

= Ue

j

and u

1

, u

2

, . . . , u

m

is an orthonormal basis of R

m

. Now

Av

j

= U V v

j

= U e

j

= U(

j

e

j

) =

j

Ue

j

=

j

u

j

Let T : R

n

R

m

be the linear map whose matrix with respect to the

standard bases is A. Then the existence of a singular value decomposition

is equivalent to saying that we can nd an orthonormal basis of R

n

(the vs)

and an orthonormal basis of R

m

(the us) such that the matrix of T with

respect to these bases is a diagonal matrix.

Theorem 5.0.2 Every mn matrix A has a singular value decomposition

Proof: Consider the n n matrix

t

A A. This is a symmetric matrix

because

t

(

t

A A) =

t

A

t

(

t

A) =

t

A A. Hence by the spectral theorem we

can nd an orthogonal nn matrix V such that V (

t

A A)

t

V = D where

31

D is the diagonal matrix with the eigenvalues of

t

A A in the diagonal. We

assume that we put all the 0 eigenvalues at the end i.e.

D =

_

_

_

_

_

_

_

_

_

_

_

_

_

1

0 0 . . . 0 0 0

0

2

0 . . . 0 0 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 0

r

0 . . .

0 0 0 0 0 . . . 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0

.

.

.

.

.

.

.

.

.

.

.

. 0

_

_

_

_

_

_

_

_

_

_

_

_

_

and

1

,

2

, . . . ,

r

are non-zero.

Then we have

t

(A V )(AV ) = D. Let B = AV =

_

_

_

_

_

b

11

b

12

. . . b

1n

b

21

b

22

. . . b

2n

.

.

.

.

.

.

.

.

.

.

.

.

b

m1

b

m2

. . . b

mn

_

_

_

_

_

=

_

b

1

[ b

2

[ . . . [b

n

_

where the b

j

R

m

are the columns of B. Since

t

B B =

D is a diagonal matrix we have (b

i

, b

j

) =

_

0 if i ,= j

[[b

j

[[

2

if i = j

. Thus [[b

i

[[

2

=

i

, i = 1, 2, . . . , n and since

r+1

= =

n

= 0 we have b

r+1

= = b

n

= 0

and b

1

, b

2

, . . . , b

r

are ,= 0. Put u

i

= b

i

/[[b

i

[[, i = 1, 2, . . . , r. Then the u

i

are orthonormal and we can supplement them up to an orthonormal basis

u

1

, . . . , u

r

, u

r+1

, . . . , u

m

. Let U be the orthogonal mm matrix with the

us as columns. Then we have

t

U B =

_

_

_

_

_

u

1

u

2

.

.

.

u

m

_

_

_

_

_

_

b

1

[ b

2

[ . . . [b

r

[0 . . . [0

_

=

_

_

_

_

_

_

_

_

_

_

(u

1

, b

1

) (u

1

, b

2

) . . . (u

1

, b

r

) 0 . . . 0

(u

2

, b

1

) (u

2

, b

2

) . . . (u

2

, b

r

) 0 . . . 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

(u

r

, b

1

) (u

r

, b

2

) . . . (u

r

, b

r

) 0 . . . 0

(u

r+1

, b

1

) (u

r+1

, b

2

) . . . (u

r+1

, b

r

) 0 . . . 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

_

_

_

_

_

_

_

_

_

_

But u

i

= b

i

/[[b

i

[[, i = 1, 2, . . . , r and the us are orthonormal so (u

i

, b

j

) =

32

_

0 for i ,= j

(b

i

/[[b

i

[[, b

i

) = [[b

i

[[

2

/[[b

i

[[ = [[b

i

[[ for i = j

Thus we get

t

U B =

t

U A

t

V =

_

_

_

_

_

_

_

_

_

_

[[b

1

[[ 0 0 . . . . . . 0

0 [[b

2

[[ 0 . . . . . . 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 . . . [[b

r

[[ . . . 0

0 0 . . . 0 . . . 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

_

_

_

_

_

_

_

_

_

_

=

Thus A = U V .

Notice that

i

= [[b

i

[[

2

so the singular values are

i

= [[b

i

[[ =

i

for

i = 1, 2, . . . , r

Example 5.1 Consider the 23 matrix A =

_

1 1 1

3 2 2

_

. Find the singular

values and the singular value decomposition.

We rst form the 22 matrix A

t

A =

_

1 1 1

3 2 2

_

_

_

1 3

1 2

1 2

_

_

=

_

3 7

7 17

_

.

The characteristic polynomial is det

_

3 7

7 17

_

= (3 )(17 )

49 =

2

20 + 2. Thus the eigenvalues of A

t

A are 10 7

2 and the

singular values are

_

10 7

=

_

_

10 + 7

2 0 0

0

_

10 7

2 0

_

To nd the matrix V we have to nd an orthonormal basis of eigenvectors

i.e. we have to nd solutions to the two systems of linear equations

3x

1

+ 7x

2

= (10 + 7

2)x

1

7x

1

+ 17x

2

= (10 + 7

2)x

2

and

3x

1

+ 7x

2

= (10 7

2)x

1

7x

1

+ 17x

2

= (10 7

2)x

2

33

From the rst system we get x

2

= (1 +

2)x

1

hence a normalized eigen-

vector for the eigenvalue 10 + 7

2 is

v

1

=

_

1

_

4 + 2

2

,

1 +

2

_

4 + 2

2

_

Similarly from the second system we get x

2

= (1

2)x

1

and hence a

normalized eigenvector for 10

2 is

_

1

_

4 2

2

,

1

2

_

4 2

2

_

Remark that these two eigenvectors are automatically orthogonal because

they belong to dierent eigenvalues.

The matrix V is given by

V =

_

_

_

_

1

_

4 + 2

2

1

_

4 2

2

1 +

2

_

4 + 2

2

1

2

_

4 2

2

_

_

_

_

Next we have to compute the matrix U. We rst compute

B =

t

V A =

_

_

_

_

_

4 + 3

2

_

4 + 2

2

3 + 2

2

_

4 + 2

2

3 + 2

2

_

4 + 2

2

4 3

2

_

4 2

2

3 2

2

_

4 2

2

3 2

2

_

4 2

2

_

_

_

_

_

It is clear that the vector (0, 1, 1) is orthogonal to the two rows in

B hence if we normalize these three vectors and put them in as rows we

get an orthogonal matrix. The norm of the rst row is

2

_

17 + 12

2

_

4 + 2

2

=

_

10 + 7

2

_

17 12

2

_

4 2

2

=

_

10 7

2 and the

third

2. Hence the matrix

U =

_

_

_

_

_

_

_

_

_

4 + 3

2

_

10 + 7

2

3 + 2

2

_

10 + 7

2

3 + 2

2

2

_

10 + 7

2

4 3

2

2

_

10 7

2

3 2

2

_

10 7

2

3 2

2

_

10 7

2

0

1

2

1

2

_

_

_

_

_

_

_

_

_

34

is an orthogonal matrix.

Thus we get

U

t

A V =

_

_

_

_

_

_

_

_

1

_

10 + 7

2

0 0

0

1

_

10 7

2

0

0 0

1

2

_

_

_

_

_

_

_

_

_

B

0 1 1

_

t

B

=

_

_

_

_

_

_

_

_

1

_

10 + 7

2

0 0

0

1

_

10 7

2

0

0 0

1

2

_

_

_

_

_

_

_

_

_

_

10 + 7

2 0

0 10 7

2

0 0

_

_

=

_

_

_

_

10 + 7

2 0

0

_

10 7

2

0 0

_

_

_

Hence we get

t

V A

t

U =

_

_

10 + 7

2 0 0

0

_

10 7

2 0

_

and so

A = V

_

_

10 + 7

2 0 0

0

_

10 7

2 0

_

t

U

is the singular value decomposition.

Homework

Problem 1.

Find the singular value decompositions of the matrices

_

3 0

0 2

_

,

_

2 0

0 3

_

,

_

_

0 2

0 0

0 0

_

_

,

_

1 1

0 0

_

.

Problem 2.

Two n n matrices A and B are said to be orthogonally equivalent if

there exists an n n orthogonal matrix Q such that B = Q A

t

Q. Is it

true or false that A and B are orthogonally equivalent if and only if they

have the same singular values.

35

6 The QR Decomposition

Consider an m n (m n) matrix A =

_

_

_

_

_

a

11

a

12

. . . a

1n

a

21

a

22

. . . a

2n

.

.

.

.

.

. . . .

.

.

.

a

m1

a

m2

. . . a

mn

_

_

_

_

_

. Let a

j

denote the jth column of A so A =

_

a

1

a

2

a

3

. . . a

n

_

. Consider the

sequence of subspaces of R

m

Spana

1

Spana

1

, a

2

Spana

1

, a

2

, a

3

Spana

1

, a

2

, a

3

, . . . , a

i

. . .

At the last step we have the subspace spanned by the columns of A i.e.

ImA R

m

, the image of the linear map dened by A.

Assume rst that the image of A has dimension n i.e. that the columns

are linearly independent.

We can then apply the Gramm-Schmidt algorithm to the linearly inde-

pendent vectors a

1

, a

2

, . . . , a

n

, and we get an orthonormal system of vec-

tors in R

m

, q

1

, q

2

, . . . , q

n

. From the algorithm, at each step the vectors

q

1

, q

2

, . . . , q

j

span the subspace Spana

1

, a

2

, . . . , a

j

so we can nd real num-

bers r

1j

, r

2j

, . . . , r

jj

such that

a

j

= r

1j

q

1

+r

2j

q

2

+ +r

jj

q

j

In fact the Gramm-Schmidt algorithm constructs in the jth step a vector

v

j

= a

j

(a

j

, q

1

)q

1

(a

i

, q

2

)q

2

(a

j

, q

j1

)q

j1

so by construction v

j

is

orthogonal to q

1

, q

2

, . . . , q

j1

and we get q

j

by normalizing v

j

, q

j

=

v

j

[[v

j

[[

.

Thus

a

j

= (a

j

, q

1

)q

1

+ (a

j

, q

2

)q

2

+ + (a

j

, q

j1

)q

j1

+[[v

j

[[q

j

hence r

ij

= (a

j

, q

i

) for i j.

Consider the matrix

Q whose columns are the vectors q

1

, q

2

, . . . , q

n

and

consider the upper triangular matrix

R =

_

_

_

_

_

_

_

r

11

r

12

r

13

. . . r

1n

0 r

22

r

23

. . . r

2n

0 0 r

33

. . . r

3n

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 0 . . . r

nn

_

_

_

_

_

_

_

.

36

The matrix product

Q

R =

_

q

1

q

2

q

3

. . . q

n

_

_

_

_

_

_

_

_

r

11

r

12

r

13

. . . r

1n

0 r

22

r

23

. . . r

2n

0 0 r

33

. . . r

3n

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 0 . . . r

nn

_

_

_

_

_

_

_

=

_

r

11

q

1

r

12

q

1

+r

22

q

2

r

13

q

1

+r

23

q

2

+r

33

q

3

. . .

_

=

_

a

1

a

2

a

3

. . .

_

= A

This is called the Reduced QR decomposition of A.

Example 6.1 Let A be the 3 3 matrix

_

_

1 1 0

1 0 1

1 1 1

_

_

. Find the reduced

QR-decomposition of A.

We begin by normalizing the rst column of A, a

1

=

_

_

1

1

1

_

_

. The norm is

[[a

1

[[ =

3 so q

1

=

_

_

_

1

3

1

3

1

3

_

_

_. This is then the rst column of

Q. The (1, 1)

entry in

R is

3.

To nd the second column in

Q we use Gramm-Schmidt: put v

2

= a

2

(a

2

, q

1

)q

1

. Then v

2

is orthogonal to q

1

. v

2

=

_

_

1

0

1

_

_

3

_

_

_

1

3

1

3

1

3

_

_

_ =

_

_

1

0

1

_

_

_

_

2

3

2

3

2

3

_

_

=

_

_

1

3

2

3

1

3

_

_

. We normalize v

2

: the norm is

_

1

9

+

4

9

+

1

9

=

_

2

3

. Thus

q

2

=

_

_

_

1

6

1

6

_

_

_. Then a

2

=

2

3

q

1

+

_

2

3

q

2

and the second column in

Q is q

1

.

The second column in

R is

_

_

_

2

3

_

2

3

0

_

_

_.

The next step in the Gramm-Schmidt algorithm puts v

3

= a

3

(a

3

, q

1

)q

1

37

(a

3

, q

2

)q

2

=

_

_

0

1

1

_

_

3

_

_

_

1

3

1

3

1

3

_

_

_ (

1

6

)

_

_

_

1

6

1

6

_

_

_ =

_

_

1

2

0

1

2

_

_

. Then v

3

is

orthogonal to q

1

and q

2

. To get q

3

we normalize v

3

: the norm is

1

2

and so

q

3

=

_

_

_

2

2

0

2

2

_

_

_. This is the third column of

Q so

Q =

_

_

_

_

1

3

1

2

2

1

3

2

6

0

1

3

1

2

2

_

_

_

_

.

We have a

3

= (a

3

, q

1

)q

1

+ (a

3

, q

2

)q

2

+ [[v

3

[[q

3

so the third column in

R

is

_

_

_

2

6

1

2

_

_

_ and

R =

_

_

_

_

3

2

3

2

3

0

_

2

3

1

6

0 0

1

2

_

_

_

_

If A =

Q

R is the reduced QR-decomposition of an m n matrix

A with linearly independent columns,

Q is an m n matrix and

R is an

n n matrix. The n columns in

Q, q

1

, q

2

, . . . , q

n

are orthonormal. We

can extend these to an orthonormal basis for R

m

by adding normal vectors

q

n+1

, q

n+2

, . . . , q

m

, for instance by using the Gramm-Schmidt algorithm on

a basis of the orthogonal complement Spanq

1

, q

2

, . . . , q

n

Q with columns q

1

, q

2

, . . . , q

n

, q

n+1

, . . . , q

m

is an mm orthogonal matrix.

If we add mn rows of 0es to

R we get an mn upper triangular matrix

and we still have A = Q R. This is called the (full) QR-decomposition of

A. Remark that if A is a square matrix as in the example, the reduced and

the full QR-decompositions are the same.

Example 6.2 Consider the 3 2 matrix A =

_

_

1 1

1 0

1 1

_

_

.

The reduced QR-decomposition is A =

_

_

_

1

3

1

6

1

3

2

6

1

3

1

6

_

_

_

_

3

2

3

0

_

2

3

_

.

To get the full QR-decomposition we would nd a normal vector q

3

or-

thogonal to the columns in

Q. Here we could take q

3

=

_

_

_

2

2

0

2

2

_

_

_ so the full

38

QR-decomposition would be

A =

_

_

_

_

1

3

1

2

2

1

3

2

6

0

1

3

1

2

2

_

_

_

_

_

_

_

3

2

3

0

_

2

3

0 0

_

_

_

What do we do if the columns in A are not necessarily linearly indepen-

dent? In this case it can happen that a vector v

j

constructed in the jth

step of the Gramm-Schmidt algorithm is the 0-vector:

0 = v

j

= a

j

(a

j

, q

1

)q

1

(a

j

, q

2

)q

2

(a

j

, q

j1

)q

j1

so we cant normalize it.

If this is the case we pick q

j

to be any normal vector, orthogonal to

q

1

, q

2

, . . . , q

j1

and put r

jj

= 0 in the matrix

R, so the jth column is

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

(a

j

, q

1

)

(a

j

, q

2

)

.

.

.

(a

j

, q

j1

)

0

0

.

.

.

0

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

and then we just continue the process with the vectors q

1

, q

2

, . . . , q

j

.

This again gives us a reduced QR-decomposition A =

Q

R, where

Q has

orthonormal columns (and hence can be expanded to an orthonormal mm

matrix) and

R is upper triangular but may have some 0s in the diagonal.

Example 6.3 Consider the 4 3 matrix A =

_

_

_

_

1 1 0

0 1 1

1 0 1

0 1 1

_

_

_

_

. Find the re-

duced and full QR-decompositions.

We normalize the rst column to get q

1

=

_

_

_

_

_

2

2

0

2

2

0

_

_

_

_

_

and r

11

=

2.

39

The second step in the Gramm-Schmidt algorithm gives

v

2

=

_

_

_

_

1

1

0

1

_

_

_

_

(a

2

, q

1

)q

1

=

_

_

_

_

1

1

0

1

_

_

_

_

2

2

_

_

_

_

_

2

2

0

2

2

0

_

_

_

_

_

=

_

_

_

_

1

1

0

1

_

_

_

_

_

_

_

_

1

2

0

1

2

0

_

_

_

_

=

_

_

_

_

1

2

1

1

2

1

_

_

_

_

We normalize to get q

2

: the norm is

_

5

2

so q

2

=

_

_

_

_

_

_

_

2

2

2

2

5

_

_

_

_

_

_

_

The second column in

R is

_

_

_

2

2

2

0

_

_

_

The third step produces

v

3

=

_

_

_

_

0

1

1

1

_

_

_

_

(a

3

, q

1

)q

1

(a

3

, q

2

)q

2

=

_

_

_

_

0

1

1

1

_

_

_

_

(

2

2

)

_

_

_

_

_

2

2

0

2

2

0

_

_

_

_

_

(

5

2

_

2

5

)

_

_

_

_

_

_

_

2

2

2

2

5

_

_

_

_

_

_

_

= 0

so we cant normalize v

3

.

The third column in

R is

_

_

_

2

2

_

5

2

0

_

_

_ so

R =

_

_

_

2

2

2

2

0

_

5

2

_

5

2

0 0 0

_

_

_

To get the third column in

Q we can take any vector orthogonal to q

1

and q

2

, we can see that the vector

_

_

_

_

0

1

0

1

_

_

_

_

works. Normalizing we get q

3

=

40

_

_

_

_

_

0

1

2

0

2

_

_

_

_

_

so we get the reduced QR-decomposition

_

_

_

_

1 1 0

0 1 1

1 0 1

0 1 1

_

_

_

_

=

_

_

_

_

_

_

_

2

2

2

2

5

0

0

5

1

2

2

2

2

5

0

0

5

1

2

_

_

_

_

_

_

_

_

_

_

2

2

2

2

0

_

5

2

_

5

2

0 0 0

_

_

_

To nd the full QR-decomposition we have the nd a fourth normalized

vector q

4

orthogonal to q

1

, q

2

, q

3

. It is easy to see that the vector

_

_

_

_

2

1

2

1

_

_

_

_

is

othogonal to the three vectors and hence we can take q

4

=

_

_

_

_

_

10

1

10

2

10

1

10

_

_

_

_

_

Thus we get the full QR-decomposition

_

_

_

_

1 1 0

0 1 1

1 0 1

0 1 1

_

_

_

_

=

_

_

_

_

_

_

_

2

2

2

2

5

0

2

10

0

5

1

2

1

10

2

2

2

2

5

0

2

10

0

5

1

2

1

10

_

_

_

_

_

_

_

_

_

_

_

_

2

2

2

2

0

_

5

2

_

5

2

0 0 0

0 0 0

_

_

_

_

_

41

Homework

Problem 1.

Find the reduced and full QR-decompositions of the matrices A =

_

_

1 0

0 1

1 0

_

_

and B =

_

_

1 2

0 1

1 0

_

_

.

Problem 2.

Let A be an mn (m n) matrix and let A =

Q

R be a reduced QR-

decomposition. Show that A has full rank n if and only if all the diagonal

entries in

R are non-zero.

7 Orthogonal Projections

Consider a subspace V R

n

. Let u be a vector in R

n

. We know that we

can write u uniquely as v +v

where v V and v

. Since v and v

are

uniquely determined by u we can dene a map P : u v.

We call v = Pu the orthogonal projection of u onto the subspace V .

Proposition 7.0.2 The map P is a linear map R

n

R

n

and P

2

= P P =

P

Proof: We have to show that P(

1

u

1

+

2

u

2

) =

1

Pu

1

+

2

Pu

2

.

Write u

1

= v

1

+v

1

and u

2

= v

2

+v

2

where v

1

and v

2

V and v

1

, v

2

V

.

But then

1

u

1

+

2

u

2

= (

1

v

1

+

2

v

2

) +(

1

v

1

+

2

v

2

) with

1

v

1

+

2

v

2

V

and

1

v

1

+

2

v

2

V

1

u

1

+

2

u

2

) =

1

v

1

+

2

v

2

=

1

Pu

1

+

2

Pu

2

.

Write u = v + v

so Pu = v. Then P

2

u = Pv but v is already in V so

Pv = v.

Let P

= Id P then P

V

and we have Id = P +P

.

Let q

1

, q

2

, . . . , q

k

be an orthonormal basis of V and q

k+1

, q

k+2

, . . . , q

n

an

orthonormal basis for V

1

, q

2

, . . . , q

n

is an orthonormal

basis of R

n

. Then we have Pq

i

= q

i

for i = 1, 2, . . . , k and Pq

j

= 0 for

42

j = k + 1, k + 2, . . . , n. Hence the matrix of P with respect to this basis is

given by

_

_

_

_

_

_

_

_

_

_

_

_

_

_

1 0 0 . . . 0 0 . . . 0

0 1 0 . . . 0 0 . . . 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 0 . . . 1 0 . . . 0

0 0 0 . . . 0 0 . . . 0

0 0 0 . . . 0 0 . . . 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 0 . . . 0 0 . . . 0

_

_

_

_

_

_

_

_

_

_

_

_

_

_

and the matrix of P

is given by

_

_

_

_

_

_

_

_

_

_

_

_

0 0 . . . 0 0 0 . . . 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 . . . 0 0 0 . . . 0

0 0 . . . 0 1 0 . . . 0

0 0 . . . 0 0 1 . . . 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 0 . . . 0 0 . . . 1

_

_

_

_

_

_

_

_

_

_

_

_

Let as before q

1

, q

2

, . . . , q

k

be an orthonormal basis of V , then the orthog-

onal projection onto V is given by Pu = (u, q

1

)q

1

+(u, q

2

)q

2

+ +(u, q

k

)q

k

.

In the Gramm-Schmidt algorithm applied to linearly independent vectors

u

1

, u

2

, . . . , u

k

we let q

1

=

u

1

[[u

1

[[

and v

2

= u

2

(u

2

, q

1

)q

1

. But (u

2

, q

1

)q

1

is the

orthogonal projection P

2

of u

2

onto the subspace spanned by q

1

and hence

v

2

= u

2

P

2

u

2

= (IdP

2

)u

2

. Now Q

2

= IdP

2

is the orthogonal projection

onto the othogonal complement to Spanq

1

. We have q

2

=

Q

2

u

2

[[Q

2

u

2

[[

. Next

v

3

= a

3

(a

3

, q

1

)q

1

(a

3

, q

2

)q

2

is the orthogonal projection of a

3

onto

Spanq

1

, q

2

, Q

3

and q

3

=

Q

3

a

3

[[Q

3

a

3

[[

.

In general we have that v

j

is the orthogonal projection of a

j

onto the

orthogonal complement to the subspace spanned by q

1

, q

2

, . . . , q

j1

, i.e.

Spanq

1

, q

2

, . . . , q

j1

.

We can describe this as applying a sequence of projections: we get v

2

by

applying P

q

1

a

2

, the orthogonal projection onto the subspace orthogonal to

q

1

, to a

2

.

43

Applying this projection to a

3

we have P

q

1

a

3

= a

3

(a

3

, q

1

)q

1

.

Next applying P

q

2

, the projection onto the subspace orthogonal to q

2

, we

get P

q

2

P

q

1

a

3

= P

q

2

(a

3

(a

3

, q

1

)q

1

) = (a

3

(a

3

, q

1

)q

1

)(a

3

(a

3

, q

1

)q

1

, q

2

)q

2

=

a

3

(a

3

, q

1

)q

1

(a

3

, q

2

)q

2

because (q

1

, q

2

) = 0. But this is precisely v

3

, thus

v

3

= P

q

2

P

q

1

a

3

In general

v

j

= P

q

j1

P

q

j2

. . . P

q

2

P

q

1

a

j

Using this algorithm works better numerically than the usual Gramm-

Schmidt algorithm.

Using this algorithm we shall obtain the QR-decomposition, by multi-

plying the matrix A on the right by a sequence of upper triangular matrices.

We assume that A is an m n with m n, matrix with linearly inde-

pendent columns a

1

, a

2

, . . . , a

n

.

We let r

ij

denote the inner product (a

i

, q

j

) where q

j

is the jth normal

vector constructed from the Gramm-Schmidt algorithm. Remark that r

jj

=

(a

j

, q

j

) = (a

j

(a

j

, q

1

)q

1

(a

j

, q

2

)q

2

(a

j

, q

j1

)q

j1

, q

j

) = (v

j

, q

j

) =

([[v

j

[[q

j

, q

j

) = [[v

j

[[.

Multiplying to the left with the nn matrix R

1

=

_

_

_

_

_

1/r

11

0 0 . . . 0

0 1 0 . . . 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 0 . . . 1

_

_

_

_

_

we get

_

a

1

a

2

. . . a

n

_

_

_

_

_

_

1/r

11

0 0 . . . 0

0 1 0 . . . 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 0 . . . 1

_

_

_

_

_

=

_

a

1

/r

11

a

2

. . . a

n

_

=

_

q

1

a

2

. . . a

n

_

44

Next consider the matrix R

2

=

_

_

_

_

_

_

_

1 r

21

/r

22

0 . . . 0

0 1/r

22

0 . . . 0

0 0 1 . . . 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 0 . . . 1

_

_

_

_

_

_

_

. Then

A R

1

R

2

=

_

q

1

a

2

. . . a

n

_

_

_

_

_

_

_

_

1 r

21

/r

22

0 . . . 0

0 1/r

22

0 . . . 0

0 0 1 . . . 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 0 . . . 1

_

_

_

_

_

_

_

=

_

q

1

r

21

r

22

q

1

+

1

r

22

a

2

a

3

. . . a

n

_

=

_

q

1

a

2

(a

2

, q

1

)q

1

[[v

2

[[

a

3

. . . a

n

_

=

_

q

1

q

2

a

3

. . . a

n

_

Next we put

R

3

=

_

_

_

_

_

_

_

_

_

1 0 r

13

/r

33

0 . . . 0

0 1 r

23

/r

33

0 . . . 0

0 0 1/r

33

0 . . . 0

0 0 0 1 . . . 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 0 0 . . . 1

_

_

_

_

_

_

_

_

_

By the same computation as before we get

A R

1

R

2

R

3

=

_

q

1

q

2

q

3

a

4

. . . a

n

_

Continuing this way we get

A R

1

R

2

R

n

=

_

q

1

q

2

q

3

. . . q

n

_

=

Q

The Product R

1

R

2

R

3

R

n

is an upper triangular n n matrix and

so is its inverse. If we let

R denote this inverse we get

A =

Q

R

i.e. the reduced QR-decomposition. We could call this method of obtaining

the QR-decomposition, upper-triangular orthogonalization.

There is another method to obtain the QR-decomposition, which can

aptly be called orthogonal triangularization. It multiplies A on the left by a

45

Figure 2: Householder reection

sequence of orthogonal matrices to make an upper-triangular matrix. This

method is based on so called Householder reections.

We start out by nding an orthogonal mm matrix Q

1

such that the rst

column in Q

1

A is

_

_

_

_

_

_

_

[[a

1

[[

0

0

.

.

.

0

_

_

_

_

_

_

_

. The matrix Q

1

A is

_

Q

1

a

1

Q

1

a

2

. . . Q

1

a

n

_

thus we want an orthogonal matrix Q

1

such that Q

1

a

1

=

_

_

_

_

_

_

_

[[a

1

[[

0

0

.

.

.

0

_

_

_

_

_

_

_

.

Let v = a

1

[[a

1

[[e

1

. Let u =

v

[[v[[

. The orthogonal projection onto the

subspace spanned by u is given by x (x, u)u. The Householder reection

46

is dened by Q

1

(x) = x 2(x, u)u. It is clear that Q

1

is a linear map in

fact Q

1

is reection in the m 1-dimensional sub-space v

. A reection

preserves angles and lengths so is an orthogonal map and hence Q

1

is an

orthogonal matrix. We have Q

1

(a

1

) = a

1

2(a

1

, u)u = a

1

(a

1

, v)

v

[[v[[

2

.

Now [[v[[

2

= [[a

1

[[a

1

[[e

1

[[

2

= (a

1

[[a

1

[[e

1

, a

1

[[a

1

[[e

1

) = 2[[a

1

[[

2

2[[a

1

[[(a

1

, e

1

) = 2(a

1

, a

1

[[a

1

[[e

1

). Hence Q

1

(a

1

) = a

1

2(a

1

, a

1

[[a

1

[[e

1

)

a

1

[[a

1

[[e

1

[[a

1

[[a

1

[[e

1

[[

2

=

a

1

(a

1

[[a

1

[[e

1

) = [[a

1

[[e

1

This shows that Q

1

A =

_

_

_

_

_

_

_

[[a

1

[[ . . .

0 . . .

0 . . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 . . .

_

_

_

_

_

_

_

.

At the next step we want to multiply by a matrix Q

2

which does not

disturb the rst column but produces 0es below the second entry in the

second column. Q

2

is going to be of the form

_

_

_

_

_

1 0 . . . 0

0

.

.

. H

2

0

_

_

_

_

_

where H

2

is an m1 m1 matrix.

Let the second column in Q

1

A be

_

_

_

_

_

_

_

b

1

b

2

b

3

.

.

.

b

m

_

_

_

_

_

_

_

then we let H

2

be the House-

holder reection for the m 1-dimensional vector b =

_

_

_

_

_

b

2

b

3

.

.

.

b

m

_

_

_

_

_

. Since H

2

is

an orthogonal m1 m1 matrix the m1-dimensional column vectors

are orthonormal and hence also the columns in Q

2

=

_

_

_

_

_

1 0 . . . 0

0

.

.

. H

2

0

_

_

_

_

_

are

orthonormal so Q

2

is an orthogonal matrix. Now the matrix Q

2

Q

1

A =

47

_

_

_

_

_

_

_

[[a

1

[[ . . .

0 [[b[[ . . .

0 0 . . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 . . .

_

_

_

_

_

_

_

.

The third matrix will be of the form Q

3

=

_

_

_

_

_

_

_

1 0 0 0 . . . 0

0 1 0 0 . . . 0

0 0

.

.

.

.

.

. H

3

0 0

_

_

_

_

_

_

_

where H

3

is an m2m2 Householder reection. Applying this procedure

m times we nally arrive at Q

m

Q

m1

. . . Q

2

Q

1

A = R where R is upper-

triangular. Let

t

Q = Q

m

Q

m1

. . . Q

2

Q

1

, then

t

Q is an orthogonal matrix

and hence A = Q R is the full QR-decomposition.

Example 7.1 Compute the QR-factorization of the matrix A =

_

_

_

_

1 1 0

0 1 1

1 0 1

0 1 1

_

_

_

_

We begin by computing the matrix of the Householder reection asso-

ciated to the rst column

_

_

_

_

1

0

1

0

_

_

_

_

. We have v =

_

_

_

_

1

0

1

0

_

_

_

_

2

_

_

_

_

1

0

0

0

_

_

_

_

so u =

_

_

_

_

_

_

_

_

1

2

_

4 2

2

0

1

_

4 2

2

0

_

_

_

_

_

_

_

_

and the Householder reection is given by Q

1

(x) = x

2(x, u)u. To compute the matrix we have to nd Q

1

(e

j

) for j = 1, 2, 3, 4.

First Q

1

(e

1

) = e

1

2(u, e

1

)u =

_

_

_

_

1

0

0

0

_

_

_

_

2(

_

_

_

_

_

_

_

_

1

2

_

4 2

2

0

1

_

4 2

2

0

_

_

_

_

_

_

_

_

,

_

_

_

_

1

0

0

0

_

_

_

_

)

_

_

_

_

_

_

_

_

1

2

_

4 2

2

0

1

_

4 2

2

0

_

_

_

_

_

_

_

_

=

48

_

_

_

_

1

0

0

0

_

_

_

_

2

1

2

_

4

2

_

_

_

_

_

_

_

_

1

2

_

4 2

2

0

1

_

4 2

2

0

_

_

_

_

_

_

_

_

=

_

_

_

_

1

0

0

0

_

_

_

_

_

_

_

_

_

_

_

_

(1

2)

2

2

2

0

1

2

2

2

0

_

_

_

_

_

_

_

_

=

_

_

_

_

_

_

_

_

1 +

2

2

2

0

1 +

2

2

2

0

_

_

_

_

_

_

_

_

=

_

_

_

_

_

2

2

0

2

2

0

_

_

_

_

_

Secondly Q

1

e

2

= e

2

2(e

2

, u)u =

_

_

_

_

0

1

0

0

_

_

_

_

2(

_

_

_

_

0

1

0

0

_

_

_

_

,

_

_

_

_

_

_

_

_

1

2

_

4 2

2

0

1

_

4 2

2

0

_

_

_

_

_

_

_

_

)

_

_

_

_

_

_

_

_

1

2

_

4 2

2

0

1

_

4 2

2

0

_

_

_

_

_

_

_

_

=

_

_

_

_

0

1

0

0

_

_

_

_

0

_

_

_

_

_

_

_

_

1

2

_

4 2

2

0

1

_

4 2

2

0

_

_

_

_

_

_

_

_

=

_

_

_

_

0

1

0

0

_

_

_

_

Next Q

1

e

3

= e

3

2(e

3

, u)u =

_

_

_

_

0

0

1

0

_

_

_

_

2(

_

_

_

_

0

0

1

0

_

_

_

_

,

_

_

_

_

_

_

_

_

1

2

_

4 2

2

0

1

_

4 2

2

0

_

_

_

_

_

_

_

_

)

_

_

_

_

_

_

_

_

1

2

_

4 2

2

0

1

_

4 2

2

0

_

_

_

_

_

_

_

_

=

_

_

_

_

0

0

1

0

_

_

_

_

_

_

_

_

_

_

_

_

1

2

2

2

0

1

2

2

0

_

_

_

_

_

_

_

_

=

_

_

_

_

_

_

_

_

1 +

2

2

2

0

1

2

2

2

0

_

_

_

_

_

_

_

_

=

_

_

_

_

_

2

2

0

2

2

0

_

_

_

_

_

.

49

Finally Q

1

e

4

= e

4

2(e

4

, u)u =

_

_

_

_

0

0

0

1

_

_

_

_

2(

_

_

_

_

0

0

0

1

_

_

_

_

,

_

_

_

_

_

_

_

_

1

2

_

4 2

2

0

1

_

4 2

2

0

_

_

_

_

_

_

_

_

)

_

_

_

_

_

_

_

_

1

2

_

4 2

2

0

1

_

4 2

2

0

_

_

_

_

_

_

_

_

=

_

_

_

_

0

0

0

1

_

_

_

_

0

_

_

_

_

_

_

_

_

1

2

_

4 2

2

0

1

_

4 2

2

0

_

_

_

_

_

_

_

_

=

_

_

_

_

0

0

0

1

_

_

_

_

Thus we have Q

1

=

_

_

_

_

_

2

2

0

2

2

0

0 1 0 0

2

2

0

2

2

0

0 0 0 1

_

_

_

_

_

and Q

1

A =

_

_

_

_

_

2

2

2

2

0 1 1

0

2

2

2

2

0 1 1

_

_

_

_

_

The next step is to compute the Householder reection in R

3

of the vec-

tor b =

_

_

1

2

2

1

_

_

. We have v = b [[b[[e

1

. The norm of b is

_

5

2

and

so v =

_

_

_

_

1

_

5

2

_

1

2

1

_

_

_

_

. The norm of v is

_

5 2

_

5

2

and hence the normal

vector in the direction of v is u =

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

1

_

5

2

_

5 2

_

5

2

_

1

2

_

5 2

_

5

2

1

_

5 2

_

5

2

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

. The Householder re-

ection is given by H

2

x = x 2(x, u)u. We compute as before H

2

e

1

=

50

e

1

2(e

1

, u)u =

_

_

1

0

0

_

_

2

1

_

5

2

_

5 2

_

5

2

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

1

_

5

2

_

5 2

_

5

2

_

1

2

_

5 2

_

5

2

1

_

5 2

_

5

2

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

=

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

2 + 2

_

5

2

5 2

_

5

2

_

1

2

2 2

_

5

2

5 2

_

5

2

2 2

_

5

2

5 2

_

5

2

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

=

_

_

_

_

_

2

5

_

5

2

_

1

2

2

5

_

5

2

2

5

_

5

2

_

_

_

_

_

=

_

_

_

_

_

_

2

5

_

1

5

_

2

5

_

_

_

_

_

Next we compute H

2

e

2

= e

2

2(e

2

, u)u =

_

_

0

1

0

_

_

2

_

1

2

_

5 2

_

5

2

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

1

_

5

2

_

5 2

_

5

2

_

1

2

_

5 2

_

5

2

1

_

5 2

_

5

2

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

=

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

2

_

1

2

(1

_

5

2

)

5 2

_

5

2

4

_

5

2

5

_

5

2

2

_

1

2

5

_

5

2

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

=

_

_

_

_

_

_

_

1

5

102

q

5

2

15

_

1

2

10+4

q

5

2

15

_

_

_

_

_

_

.

51

Finally H

2

e

3

= e

3

2(e

3

, u)u =

_

_

0

0

1

_

_

2

1

_

5 2

_

5

2

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

1

_

5

2

_

5 2

_

5

2

_

1

2

_

5 2

_

5

2

1

_

5 2

_

5

2

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

=

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

2(1

_

5

2

)

5 2

_

5

2

2

_

1

2

5 2

_

5

2

3 2

_

5

2

5 2

_

5

2

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

=

_

_

_

_

_

_

_

2

5

_

1

2

10+4

q

5

2

15

54

q

5

2

15

_

_

_

_

_

_

.

It follows that Q

2

=

_

_

_

_

_

_

_

_

1 0 0 0

0

_

2

5

_

1

5

_

2

5

0

_

1

5

102

q

5

2

15

_

1

2

10+4

q

5

2

15

0

_

2

5

_

1

2

10+4

q

5

2

15

54

q

5

2

15

_

_

_

_

_

_

_

_

and

Q

2

Q

1

A =

_

_

_

_

_

2

2

2

2

0

_

5

2

_

5

2

0 0 0

0 0 0

_

_

_

_

_

8 Computational Experiments using MATLAB

In the section on orthogonal projections we alluded to this modied Gram-

Schmidt procedure as being numerically more stable than the classical

procedure and the Householder method as being even better. Next we shall

illustrate this using some MATLAB computations.

The experiment is the following: we construct two orthogonal 80 80

52

matrices Q

1

and Q

2

. Then we let D be the diagonal matrix with the powers

2

1

, 2

2

, . . . , 2

80

in the diagonal. Let A = Q

1

D Q

2

. We compute the

QR-decomposition using the classical Gram-Schmidt, the modied Gram-

Schmidt and the Householder algorithm.

If A = Q R is the QR-decomposition of A. Then Q

1

D Q

2

= Q R so

R = Q

1

Q

1

DQ

2

. Now A = Q

1

DQ

2

is the SVD of A and so the singular

values of A are the numbers 2

1

, 2

2

, . . . , 2

80

but R = Q

1

Q

1

DQ

2

is the

SVD of R so R has the same singular values. The eigenvalues of an upper

triangular matrix are the diagonal entries as can be seen by computing the

characteristic polynomial. Hence the absolute values of the diagonal entries

in R should be the numbers 2

1

, 2

2

, . . . , 2

80

.

Here is a print-out of a MATLAB session

>> M=randn(80);

>> N=randn(80);

>> [Q1,X]=qr(M);

>> [Q2,Y]=qr(N);

>> D=diag(2.^ (-1:-1:-80));

>> A=Q1*D*Q2;

Here we rst construct two 80 80 matrices with random entries chosen

from a standard normal distribution. Then we use the QR-decomposition

to get two orthogonal matrices Q

1

and Q

2

(and also two upper triangular

matrices which we have no use for). Next we construct the diagonal matrix

D with the numbers 2

1

, 2

2

, . . . , 2

80

in the diagonal and nally we form

the 80 80 matrix A = Q

1

D Q

2

.

Next we use MATLABs built in function to nd the QR-decomposition

(this procedure uses the Householder procedure)

>> [Q,R]=qr(A);

>> v=diag(R);

>> v=log(abs(v));

>> plot(v,o)

We then take out the diagonal of the upper-triangular part and we take

the log of the absolute values of the diagonal entries. This should give

the numbers log 2, 2 log 2, 3 log 2, . . . , 80 log 2 and so when we plot

them they should lie on a nice straight line with slope log 2. Notice the

53

Figure 3:

54

specication to the plot command to use an o to mark the points rather

than using a line graph. The result is shown in the gure.

The graph looks pretty good until we get down to around 2

35

then the

machine precision is no longer good enough to distinguish the very small

values and the line is drowned out in rounding errors.

Next we shall try the classical Gram-Schmidt method. Here we rst

have to write our own algorithm to compute the QR-decomposition using

this method. This is best done by writing a function M-le as shown below:

function [QC,RC]=clgs(A)

[n,m]=size(A);

QC=zeros(n);

RC=zeros(n);

QC(:,1)=A(:,1)/(norm(A(:,1)));

for j=1:n

vj=A(:,j);

for i=1:j-1

RC(i,j)=QC(:,i)*A(:,j);

vj=vj-RC(i,j)*QC(:,i);

end

RC(j,j)=norm(vj);

QC(:,j)=vj/RC(j,j);

end

We then compute the QR-decomposition using this method and again

plot the log of the absolute values of the diagonal elements. We plot it on the

same graph to compare the results of the two methods, using the command

hold which keeps the previous graph (to get a new graph use the command

hold o). We plot the points using x as a marker.

>> hold

Current plot held

>> [QC,RC]=clgs(A);

>> v=log(abs(diag(RC)));

>> plot(v,x)

We see that the rounding errors take over much earlier, this is a result

of the Gram-Schmidt algorithm being numerically unstable i.e. the errors

compound in each step.

55

Figure 4:

56

Homework:

Write a function M-le to compute the QR-decomposition using the mod-

ied Gram-Schmidt and compute the upper triangular part. Plot log of the

absolute values using * as a marker on the same graph as the Householder

and classical Gram-Schmidt.

9 Least Squares Problems

Let A be an n m matrix and assume n > m. We can of course view A as

the matrix of a linear transformation T : R

m

R

n

.

Consider a system of linear equations

Ax = b

Since there are more equations than unknowns, this system in general will

not have solutions. In fact it will have a solution precisely when b Im T.

Since Im T has dimension at most m because of the formula dimker T +

dimIm T = m, the subspace Im T will be small relative to R

n

so b has to

be very special in order for there to be any solution.

For a given x R

m

we can consider the residual r = b Ax R

n

. Of

course if x is a solution, r = 0.

The idea of a least squares solution is to nd x such that [[r[[ is as small

as possible. If r = (r

1

, r

2

, . . . , r

n

), [[r[[

2

= r

2

1

+r

2

2

+ +r

2

n

hence the name

least squares.

Theorem 9.0.3 A vector x R

m

is a least squares solution if and only if

r Im T

2

in terms of coordinates we get

(a

11

x

1

+a

12

x

2

+ +a

1m

x

m

b

1

)

2

+(a

21

x

1

+a

22

x

2

+ +a

2m

x

m

b

2

)

2

+

.

.

.

+(a

n1

x

1

+a

n2

x

2

+ +a

nm

x

m

b

n

)

2

Viewing this as a function of (x

1

, x

2

, . . . , x

m

), in order for x to be a

minimum, all the partial derivatives

x

i

must vanish.

57

Figure 5:

Computing the partial with respect to x

j

we get

2(a

11

x

1

+a

12

x

2

+ +a

1m

x

m

b

1

)a

1j

+2(a

21

x

1

+a

22

x

2

+ +a

2m

x

m

b

2

)a

2j

+

.

.

.

+2(a

n1

x

1

+a

n2

x

2

+ +a

nm

x

m

b

n

)a

nj

This is precisely 2 times the inner product of r with the jth column

of A. Thus the vanishing of

[[r[[

2

x

j

is equivalent to r being orthogonal to

the jth column. Now all the partials have to vanish and so r is orthogonal

to all the columns in A and since these column span Im T it follows that

r Im T

Using this theorem we can show that there always is a least squares

solution, namely let P : R

n

Im T be the orthogonal projection. Then

Pb Im T and so we can nd x R

m

such that Pb = Ax. As we have seen

b Pb Im T

and so r = b Ax = b Pb Im T

x is a least squares solution.

58

Another consequence of the theorem is that x is a least squares solution if

and only if (bAx, Ay) = 0 for all y R

m

. Using the adjoint transformation

and its matrix

t

A we get 0 = (bAx, Ay) = (

t

Ab

t

A Ax, y) for all y R

m

.

Thus

t

Ab

t

A Ax is a vector in R

m

orthogonal to every vector in R

m

and

so must be the zero-vector. Thus we have proved

Theorem 9.0.4 A vector x R

m

is a least squares solution to the system

Ax b if and only if

t

A Ax =

t

Ab

Remark that

t

A A is an m m matrix, so we now have a system of

m equations with m unknowns. This system of equations is known as the

normal equations. As we have shown the normal equations have a solution

which will be unique if and only if ker

t

A A = 0 or equivalently if

t

A A

has rank m, hence if it is invertible. This is the case if A has maximal rank

m and in this case the unique solution to the least squares problem is

x = (

t

A A)

1

t

Ab

The matrix (

t

AA)

1

t

A is called the pseudo-inverse of A and often denoted

by A

+

.

Example 9.1 Consider a set of points (x

1

, y

1

), (x

2

, y

2

), . . . , (x

n

, y

n

) and the

problem of nding a polynomial p(x) of degree m < n that best ts these

points i.e. we want to nd p(x) such that (y

1

p(x

1

))

2

+ (y

2

p(x

2

))

2

+

+ (y

n

p(x

n

))

2

is as small as possible. If we write p(x) = c

0

+ c

1

x +

c

2

x

2

+ +c

m

x

m

this comes down to nding a least squares solution to the

system of equations

_

_

_

_

_

1 x

1

x

2

1

. . . x

m

1

1 x

2

x

2

2

. . . x

m

2

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

1 x

n

x

2

n

. . . x

m

n

_

_

_

_

_

_

_

_

_

_

c

0

c

1

.

.

.

c

m

_

_

_

_

_

=

_

_

_

_

_

y

1

y

2

.

.

.

y

n

_

_

_

_

_

The matrix of coecients is known as a van der Monde matrix, if the

x

i

s are distinct this matrix has maximal rank m. Hence there is a unique

solution.

Suppose we want to t a line through these points. The matrix A is

_

_

_

_

_

1 x

1

1 x

2

.

.

.

.

.

.

1 x

n

_

_

_

_

_

59

and so

t

A A =

_

n

x

i

x

i

x

2

i

_

The inverse is

1

n

x

2

i

(

x

i

)

2

_

x

2

i

x

i

x

i

n

_

and so

A

+

=

1

n

x

2

i

(

x

i

)

2

_

x

2

i

x

i

x

i

n

_

_

1 1 . . . 1

x

1

x

2

. . . x

n

_

=

1

n

x

2

i

(

x

i

)

2

_

x

2

i

x

1

x

i

x

2

i

x

2

x

i

. . .

x

2

i

x

n

x

i

x

i

nx

1

x

i

nx

2

. . .

x

i

nx

n

_

It follows that the line is given by the equation y = c

0

+c

1

x where

c

0

=

1

n

x

2

i

(

x

i

)

2

(

y

i

x

2

i

x

i

y

i

x

i

)

c

1

=

1

n

x

2

i

(

x

i

)

2

(

y

i

x

i

n

x

i

y

i

)

Using the QR-decomposition of A we get a convenient way to solve the

normal equations.

Consider the reduced QR-decomposition of A, A =

Q

R where

Q is an

nm matrix with orthonormal columns and

R is a mm upper-triangular

matrix. Then

t

A A =

t

R

t

Q

Q

R =

t

R

R since

Q

Q = E

m

. Hence the

normal equations read

t

R

Rx =

t

R

t

Qb

which we can solve by solving

Rx =

t

Qb

The matrix

R is upper-triangular and so this is a system of the form

r

11

x

1

+r

12

x

2

+r

13

x

3

+ +r

1m

x

m

= c

1

r

22

x

2

+r

23

x

3

+ +r

2m

x

m

= c

2

.

.

.

r

mm

x

m

= c

m

60

where

_

_

_

_

_

c

1

c

2

.

.

.

c

m

_

_

_

_

_

=

t

x

m

= c

m

/r

mm

x

m1

=

1

r

m1,m1

(c

m1

r

m1,m

x

m

)

x

m2

=

1

r

m2,m2

(c

m2

r

m2,m1

x

m1

r

m2,m

x

m

)

.

.

.

x

1

=

1

r

11

(c

1

r

12

x

2

r

13

x

3

r

1m

x

m

)

10 Numerical Analysis

Consider a system of linear equations Ax = b. Here is a possible strategy

to nding a solution: compute the QR-decomposition A = Q R. Since Q is

an orthogonal matrix we have Q

1

=

t

Q and so we get Rx =

t

Qb. Since R

is upper-triangular we can solve this system by back substitution as above.

As the examples have shown computing the QR-decomposition by hand

is not possible unless the dimensions of A very small (think about computing

the QR-decomposition of a 100 100!). Thus we certainly want to use

a program such as MATLAB to nd Q and R and to perform the back

substitution.

Consider the following MATLAB code

61

>> R=triu(randn(50));

>> [Q,X]=qr(randn(50));

>> A=Q*R;

>> [Q2,R2]=qr(A);

>> norm(Q-Q2)

ans =

1.8187

>> norm(R-R2)/norm(R)

ans =

0.2848

>> norm(Q2*R2-A)

ans =

8.9370e-015

We rst construct a random upper-triangular matrix R and a random

orthogonal matrix Q and put A = Q R, thus this is the QR-decomposition

of A. Now we use the qr command in MATLAB to compute the QR-

decomposition of A , get Q

2

and R

2

. Now compare Q and Q

2

and R and

R

2

. If Q and Q

2

were close to each other the norm of Q Q

2

should be

small. But our computation shows it is large ( 2) and the same for RR

2

.

Since the QR-decomposition is unique this shows that we really cant hope

to compute the QR-decomposition to any reasonable degree of accuracy.

Yet when we compute Q

2

R

2

we get very close to A. Thus seemingly the

imprecisions in Q

2

and R

2

cancel out in the product. Does this mean that

the algorithm above is doomed to failure and is going to yield results that

are unusable?

We shall investigate this problem in some detail. First of all a computer

cannot represent all real numbers. In fact a double precision number is

represented by 64 bytes and is represented by the digits and the position of

the decimal point. Double precision numbers can be between 1.79 10

308

and 2.2310

308

. The problem is that there are gaps between the numbers

for instance the numbers in [1, 2] are represented by the numbers

1, 1 + 2

52

, 1 + 2 2

52

, 1 + 3 2

52

, . . . , 2

The numbers in [2

n

, 2

n+1

] are represented by the numbers

2

n

, 2

n

+ 2

n

2

52

, 2

n

+ 2

n

2 2

52

, 2

n

+ 2

n

3 2

52

, . . . , 2

n+1

62

i.e. the distance between points here is 2

n

2

52

. We shall call the collec-

tion of all these numbers the oating point numbers in the machine. Thus

these are precisely the real numbers that can be represented exactly in the

computer. Any other real number is approximated by one of the oating

point numbers. Remark that when we get close to the upper bound of the

numbers that can be represented by the machine, the distances between

consecutive oating point numbers becomes enormous, though the relative

distance stays the same.

For a real number x let fl(x) denote the closest oating point number.

Thus

[x fl(x)[

[x[

<

1

2

2

52

or [x fl(x)[ <

1

2

2

52

[x[. The number

1

2

2

52

is

called the machine precision, we shall denote it by

machine

. Thus we have

fl(x) = x(1 +) with [[ <

machine

.

Consider now the usual oating point number operations +, , , /. We

shall denote the corresponding operations in the machine by putting a circle

around the symbol e.g

the machine has the following property: for any two oating point numbers

x y = fl(x y)

thus

x y = x y(1 +)

for some [[ <

machine

Denition 10.0.3 A mathematical problem is a function (not linear in

general) f : X Y from a vector space X of data to a vector space Y

of solutions. Thus the problem is: for a given data point x compute the

solution f(x)

Denition 10.0.4 An accurate algorithm for a mathematical problem f :

X Y is another function

f : X Y that can be implemented by a

computer program and such that for any x X which can be represented by

oating point numbers we have

[[

f(x) f(x)[[

[[f(x)[[

= O(

machine

)

The notation left hand side = O(

machine

) means that there is xed

constant C (not depending on x) such that left hand side < C

machine

Next we shall discuss the notion of stability of an algorithm. This de-

nition may seem strange at rst but hopefully the signicance will become

clearer later on.

63

Denition 10.0.5 An algorithm

f for a problem f is stable if for each

x X there is a x with

[[ x x[[

[[x[[

= O(

machine

)

such that

[[

f(x) f( x)[[

[[f( x)[[

= O(

machine

)

An algorithm is backwards stable if there is x as above such that

f(x) = f( x)

Example 10.1 Let f : R

2

R, f(x

1

, x

2

) = x

1

x

2

. The algorithm is

f(x

1

, x

2

) = fl(x

1

) fl(x

2

). This algorithm is backwards stable.

Indeed fl(x

1

) = x

1

(1 +

1

) and fl(x

2

) = x

2

(1 +

2

) with [

1

[, [

2

[ <

machine

. Now by our assumption about the machine we have fl(x

1

)

fl(x

2

) = (fl(x

1

) fl(x

2

))(1 +

3

). Hence we get

fl(x

1

) fl(x

2

) = [x

1

(1 +

1

) x

2

(1 +

2

)](1 +

3

)

= x

1

(1 +

1

)(1 +

3

) x

2

(1 +

2

)(1 +

3

)

= x

1

(1 +

1

+

3

+

1

3

) x

2

(1 +

2

+

3

+

2

3

)

= x

1

(1 +

4

) x

2

(1 +

5

)

with [

4

[, [

5

[ < 2

machine

+

2

machine

.

Thus

f(x

1

, x

2

) = fl(x

1

) fl(x

2

) = x

1

x

2

= f( x

1

, x

2

) with x

1

=

x

1

(1 +

4

) and x

2

= x

2

(1 +

5

) so

[ x

1

x

1

[

[x

1

[

= [

4

[ = O(

machine

). For

instance we can take the constant C = 3. The same holds for x

2

.

Consider instead f(x) = x + 1 and the algorithm

f(x) = fl(x) 1.

Then fl(x) = x(1 +

1

) and fl(x) 1 = (x(1 +

1

) + 1)(1 +

2

). This gives

f(x) f( x) = (x(1 +

1

) +1)(1 +

2

) (x(1 +

1

) +1) = ( x+1)

2

. Thus this

algorithm is not backwards stable, but since

[

f(x) f( x)[

[f( x)[

=

2

it is stable

The example at the beginning of this section illustrates the notion of

backwards stability. The problem is to nd the QR-decomposition of the

matrix A. We nd that the algorithm does a poor job of computing the

actual Q and R for A. It does however come up with a

Q and a

R such that

Q

R =

A where

A is very close to A. Thus

Q,

R is not a solution to the

problem for the matrix A but for the nearby matrix

A

64

Of course we want algorithms that are accurate, as it turns out this is

not guaranteed by stability or even backwards stability.

Accuracy means that [[

implies that [[

it is 0) so what we need is for [[f( x) f(x)[[ to be small relative to [[f(x)[[

Denition 10.0.6 Let f : X Y be a problem and let x X be a data

point. Let x be a small increment and let f(x) = f(x+x)f(x). The rel-

ative condition number is dened by (x) = lim

0

sup

||x||<

_

[[f(x)[[

[[f(x)[[

_

[[x[[

[[x[[

_

A problem is well-conditioned if (x) is relatively small ( 100) and

ill-conditioned if (x) is large (10

6

, 10

16

)

Example 10.2 If f is a dierentiable function e.g. all its coordinate func-

tions have continuous partial derivatives then we can form the Jacobian

matrix J(x) =

_

_

_

_

_

f

1

/x

1

(x) f

1

/x

2

(x) . . . f

1

/x

m

(x)

f

2

/x

1

(x) f

2

/x

2

(x) . . . f

2

/x

m

(x)

.

.

.

.

.

.

.

.

.

.

.

.

f

n

/x

1

(x) f

n

/x

2

(x) . . . f

n

/x

m

(x)

_

_

_

_

_

. Then we

have [[x[[ = [[f(x + x) f(x)[[ [[J(x)x[[ [[J(x)[[[[x[[. Hence we

have (x)

_

[[f(x)[[

[[f(x)[[

_

[[x[[

[[x[[

_

[[J(x)[[[[x[[

[[f(x)[[

_

[[x[[

[[x[[

= [[J(x)[[

[[x[[

[[f(x)[[

and in fact we get equality i.e. (x) = [[J(x)[[

[[x[[

[[f(x)[[

Consider f : R

2

R, f(x

1

, x

2

) = x

1

x

2

. The Jacobian of f is J =

_

1 1

_

. We have [[J[[ =

2 and so (x) =

2

[[x[[

[x

1

x

2

[

. If x

1

and x

2

are

very close this can be very large. Thus this problem is ill-conditioned

Consider f : R R, f(x) = x

2

. Then J(x) = 2x and so (x) =

2[[x[[

[[x[[

[[x

2

[[

= 2. Thus this problem is well-conditioned

We shall consider a less trivial example, namely nding the roots of a

polynomial.

Example 10.3 Consider a polynomial p(x) = a

0

+ a

1

x + a

2

x

2

+ +

a

n1

x

n1

+x

n

. If we perturb a single coecient what happens to the roots?.

65

To gure this out we consider a function F : R

n+1

R dened by

F(b

0

, b

1

, b

2

, . . . , b

n1

, z) = b

0

+b

1

z +b

2

z

2

+ +b

n1

z

n1

+z

n

Thus F(a

0

, a

1

, . . . , a

n1

, x) = p(x). Now assume x

0

is a simple root of p(x),

i.e. p(x

0

) = 0 and p

(x

0

) ,= 0. The Implicit Function Theorem says that if

F

z

(a

0

, a

1

, . . . , a

n1

, x

0

) ,= 0 then there is a function f : R

n

R dened in

a neighborhood of the point (a

0

, a

1

, . . . , a

n1

) such that

F(b

0

, b

1

, . . . , b

n1

, f(b

0

, b

1

, . . . , b

n1

) = 0

for (b

0

, b

1

, . . . , b

n1

) in this neighborhood and f(a

0

, a

1

, . . . , a

n1

) = x

0

. Thus

f(b

0

, b

1

, . . . , b

n1

) is a root of the polynomial b

0

+b

1

x+b

2

x

2

+ +b

n1

x

n1

+

x

n

and

f

b

i

(a

0

, a

1

, . . . , a

n1

) measures how the root x

0

is aected by a small

perturbation of the ith coecient. Since

F

z

(a

0

, a

1

, . . . , a

n1

, x

0

) =

z

(a

0

+a

1

z +a

2

z

2

+ +a

n1

z

n1

+z

n

)(x

0

)

= p

(x

0

) ,= 0

the Implicit Function Theorem does in fact apply.

Now using the chain rule we compute

F(b

0

, b

1

, . . . , b

n1

, f(b

0

, b

1

, . . . , b

n1

))

b

i

=

F

b

i

+

F

z

f

b

i

= 0

hence

f

b

i

(a

0

, a 1, . . . , a

n1

) =

F

b

i

(a

0

, a

1

, dots, a

n1

, x

0

)/p

(x

0

)

and

F

b

i

(a

0

, a

1

, dots, a

n1

, x

0

) =

b

i

(b

0

+b

1

z+b

2

z

2

+ +b

n1

z

n1

+z

n

)[

(a

0

,a

1

,...,a

n1

,x

0

)

= x

i

0

so we get

f

b

i

(a

0

, a

1

, . . . , a

n1

) =

x

i

0

p

(x

0

)

Thus the relative conditioning number is

=

[x

0

[

[x

0

[

_

[a

i

[

[a

i

[

= [

x

0

a

i

[[

a

i

x

0

[ [

f

b

i (a

0

,a

1

,...,a

n1

)

[

[a

i

[

[x

0

[

= [

a

i

x

j1

0

p

(x

0

)

[

66

This number can be incredibly large, consider for instance the polynomial

p(x) = (x 1)(x 2)(x 3) . . . (x 20)

The coecient a

15

is approximately 1.6710

9

and for the root x

0

= 15 we get

p

(x

0

) = (x

0

1) . . . (x

0

14)(x

0

16) . . . (x

0

20) = 14!(1)(2)(3)(4)(5) =

14!5!. Thus

1.67 10

9

15

14

14!5!

5.1 10

13

We can visualize the ill-conditioning of this problem by using the follow-

ing MATLAB code (enter it into an M-le and name it rootplot.m)

for t=1:1000

p=poly(1:20);

p(6)=p(6)+10e-6*randn(1);

r=roots(p);

r1=real(r);

r2=imag(r);

plot(r1,r2,.)

end

First execute

>>plot(zeros(1,20),*)

>>hold

to plot the roots of p(x). The code then perturbs the coecient a

15

by adding a very small random increment (10

6

a random number cho-

sen from a standard normal distribution). It then nds the roots of the

perturbed polynomial and plots the roots.

Next consider an mn matrix and consider the problem of computing

Ax from an input x. By denition the relative conditioning number is

(x) = sup

_

[[A(x +x) Ax[[

[[Ax[[

_

[[x[[

[[x[[

_

= sup

[[Ax[[

[[x[[

_

[[Ax[[

[[x[[

= [[A[[

[[x[[

[[Ax[[

If A happens to be square and non-singular we have

[[x[[

[[Ax[[

[[A

1

[[ and so

in this case we get

[[A[[[[A

1

[[

67

Figure 6:

This number can also be very, very large. It is not hard to see that [[A[[ =

the largest singular value

1

and [[A

1

[[ = largest singular value of A

1

= 1/

m

where

m

is the smallest singular value of A. Thus we get

1

m

where

1

(resp.

m

) is the largest (resp. the smallest) singular value of A.

The number (A) = [[A[[[[A

1

[[ is called the conditioning number of

the matrix A and the matrix A is said to be well-conditioned (resp. ill-

conditioned) if this number is relatively small (resp. large). If A is not

square we dene the conditioning number by (A) = [[A[[[[A

+

[[ where A

+

is the pseudo-inverse as dened in section 9.

Theorem 10.0.5 Consider the equation Ax = b. The condition number of

computing x given b with respect to perturbing b is (A)

Suppose now that we perturb A by a small amount A + A and keep b

xed. Then x must also be perturbed so we have

(A+A)(x +x) = b

Writing this out and ignoring the second order innitesimal A(x) we get

A+Ax +Ax = b

68

or

x = A

1

Ax

This implies

[[x[[ [[A

1

[[[[A[[[[x[[

and hence

[[x[[

[[x[[

_

[[A[[

[[A[[

[[A

1

[[[[A[[ = (A)

Thus we have shown

Theorem 10.0.6 Let b be xed. The condition number of solving the equa-

tion

Ax = b

with respect to perturbations of A is (A)

We can now estimate the accuracy of a backward stable algorithm in

terms of the condition number:

Theorem 10.0.7 Let f : X Y be a problem and let

f be a backwards

stable algorithm for f. Let x X and let (x) be the relative condition

number. Then the relative error satises

[[

f(x) f(x)[[

[[f(x)[[

= O((x)

machime

)

Proof: By denition of backward stability we can nd x X such that

[[ x x[[

[[x[[

= O(

machine

)

and f( x) =

f(x).

By the denition of the condition number we have

[[f(x) f( x)[[

[[f(x)[[

/

[[ x x[[

[[x[[

(x) +const

hence

[[f(x)

f(x)[[

[[f(x)[[

((x) +const.)

[[ x x[[

[[x[[

which gives the result

69

There are two properties characterizing a good algorithm:

It has to be accurate and it has to be fast. We have studied the rst of

these properties.

The speed of an algorithm is measured in how many oating point oper-

ations or ops it takes to implement it. For matrix operations the number

of ops is typically of the order of C m

3

where C is a constant depending

on the algorithm. This can clearly get large very quickly so it is of some

importance to try to minimize the constant C though for large m, C be-

comes negligible. It is not hard to estimate the number of ops for a given

algorithm but we shall only give a simple example of multiplying two mm

matrices A =

_

_

_

_

_

a

11

a

12

. . . a

1m

a

21

a

22

. . . a

2m

.

.

.

.

.

.

.

.

.

.

.

.

a

m1

a

m2

. . . a

mm

_

_

_

_

_

and B =

_

_

_

_

_

b

11

b

12

. . . b

1m

b

21

b

22

. . . b

2m

.

.

.

.

.

.

.

.

.

.

.

.

b

m1

b

m2

. . . b

mm

_

_

_

_

_

.

The ijth entry in the product matrix is given by a

i1

b

1j

+a

i2

b

2j

+ +a

im

b

mj

.

Thus to compute each entry in the product matrix we need m products and

m1 additions for a total of 2m1 ops. There are m entries so we need

on the order of (2m 1)m

2

which when m becomes very large approaches

2m

3

Consider the equation

Ax = b

where A is a non-singular mm (i.e. square) matrix. Consider the following

algorithm for solving this equation

1. Compute the QR-decomposition of A, QR = A using the Householder

triangularization

2. Let y =

t

Qb (since Q is orthogonal Q

1

=

t

Q

3. Solve the upper-triangular system Rx = y using back substitution

All these steps are backward stable so the whole algorithm is backward

stable and we get the accuracy estimate

[[ x x[[

[[x[[

O((A)

machine

)

Here is a MATLAB experiment

70

>> A=randn(100);

>> kappa=cond(A)

kappa =

644.1458

>> b=randn(100,1);

>> [Q,R]=qr(A);

>> y=Q*b;

>> tildex=R

^

-1*y;

>> x=Ab;

>> norm(x-tildex)

ans =

3.2933e-013

We rst construct a 100 100 random matrix. Then use the cond(A)

MATLAB command to compute the condition number. This matrix is well-

conditioned. Next we construct a 100-dimensional random column vector,

b. We then use the algorithm above to solve the system Ax = b and get

the solution tildex. Next we use MATLABs built-in equation solver (us-

ing ). MATLABs algorithm in fact uses a more precise variant of the

QR-decomposition (QR-composition with pivoting) but as we can see our

algorithm gives a solution quite close to the MATLAB solution.

11 LU-decomposition

Let A be an mm matrix. The purpose of the LU-decomposition is to write

A = L U where L is a lower triangular matrix and U an upper triangular

matrix. The usual algorithm for this is known as Gaussian Elimination and

consists simply of subtracting multiples of one row from all the subsequent

rows to step-by-step produce 0s below the diagonal. It turns out that we

can achieve this by multiplying A by a sequence of lower triangular matrices

with 1s in the diagonal

L

m1

. . . L

2

L

1

A = U

Thus L

1

= L

m1

. . . L

2

L

1

.

Multiplying by L

1

creates 0s below the rst entry in the rst column.

Next multiplying by L

2

creates 0s below the rst two entries in the second

column with out disturbing the rst column and so on.

It easier to do an example than to give a general explanation of this

procedure:

71

Example 11.1 Let A =

_

_

2 1 1

4 3 3

8 7 9

_

_

. We see that by subtracting 2 the

rst row from the second row and 4 the rst row from the third row, will cre-

ate 0s below the rst entry in the rst column. We can achieve this by left-

multiplying by the matrix L

1

=

_

_

1 0 0

2 1 0

4 0 1

_

_

. We get L

1

A =

_

_

2 1 1

0 1 1

0 3 5

_

_

.

Next we subtract 3 the second row from the third row to get 0s below

the second entry in the second column. We achieve this by multiplying by

the matrix L

2

=

_

_

1 0 0

0 1 0

0 3 0

_

_

. We get L

2

L

1

A =

_

_

2 1 1

0 1 1

0 0 2

_

_

= U. So

L

1

= L

2

L

1

=

_

_

1 0 0

2 1 0

2 3 1

_

_

and L =

_

_

1 0 0

2 1 0

4 3 1

_

_

so A =

_

_

1 0 0

2 1 0

4 3 1

_

_

_

_

2 1 1

0 1 1

0 0 2

_

_

is the LU-decomposition.

The algorithm proceeds in the following manner: assume that we have

in the kth step produced X

k

= L

k

L

l1

. . . L

1

A such that the rst k columns

in X

k

have 0s below the diagonal. In the k +1st step we want to multiply

to the left by an upper triangular matrix L

k+1

making the entries in the

k + 1st column from the k + 2nd to the mth row equal to 0. Thus X

k

is

of the form

X

k

=

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

x

11

x

12

. . . x

1k

x

1,k+1

. . . x

1m

0 x

22

. . . x

2k

x

2,k+1

. . . x

2m

0 0 . . . x

3k

x

3,k+1

. . . x

3m

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 . . . x

kk

x

k,k+1

. . . x

km

0 0 . . . 0 x

k+1,k+1

. . . x

k+1,m

0 0 . . . 0 x

k+2,k+1

. . . x

k+2,m

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 . . . 0 x

m,k+1

. . . x

mm

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

72

Now take

L

k+1

=

_

_

_

_

_

_

_

_

_

_

_

_

_

_

1 0 0 . . . 0 0 . . . 0

0 1 0 . . . 0 . . . 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 0 . . . 1 . . . 0 0

0 0 0 . . . x

k+2,k+1

/x

k+1,k+1

1 . . . 0

0 0 0 . . . x

k+3,k+1

/x

k+1,k+1

0 . . . 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 0 . . . x

m,k+1

/x

k+1,k+1

0 . . . 1

_

_

_

_

_

_

_

_

_

_

_

_

_

_

Left multiplication by L

k+1

produces 0s in the k + 1st column below the

k + 1, k + 1 entry so X

k+1

= L

k+1

X

k

is of the form

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

x

11

x

12

. . . x

1k

x

1,k+1

. . . x

1m

0 x

22

. . . x

2k

x

2,k+1

. . . x

2m

0 0 . . . x

3k

x

3,k+1

. . . x

3m

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 . . . x

kk

x

k,k+1

. . . x

km

0 0 . . . 0 x

k+1,k+1

. . . x

k+1,m

0 0 . . . 0 0 . . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 . . . 0 0 . . .

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

Put

i,k+1

= x

i,k+1

/x

k+1,k+1

then

L

1

k+1

=

_

_

_

_

_

_

_

_

_

_

_

_

_

_

1 0 0 . . . 0 0 . . . 0

0 1 0 . . . 0 . . . 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 0 . . . 1 . . . 0 0

0 0 0 . . .

k+2,k+1

1 . . . 0

0 0 0 . . .

k+3,k+1

0 . . . 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 0 . . .

m,k+1

0 . . . 1

_

_

_

_

_

_

_

_

_

_

_

_

_

_

Thus to compute L

1

k+1

we only have to change the sign of the entries in the

k + 1st column below the diagonal. This observation makes it much easier

73

to compute

L = L

1

1

L

1

2

. . . L

1

m

=

_

_

_

_

_

_

_

1 0 0 . . . 0

21

1 0 . . . 0

31

32

1 . . . 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

m1

m2

m3

. . . 1

_

_

_

_

_

_

_

We can now formulate the algorithm for computing the LU-decomposition:

U = A, L = E

m

for k = 1 to m1

for j = k + 1 to m

jk

= u

jk

/u

kk

u

j,k:m

= u

j,k:m

jk

u

k,k:m

We can then do an operation count for this algorithm: for each pass of

the inner loop the rst line counts 1 op, the second line counts 2 ops for

each u

j,k

, u

j,k+1

. . . , u

j,m

. Thus in total 2(m k + 1) + 1 = 2(m k) + 3

hence the inner loop contribute (m k)(2(m k) + 3) = 2(m k)

2

+

3(m k) and so the total operation count is

m1

k=1

2(m k)

2

+ 3(m k) =

2

(m1)m(2(m1) + 1)

6

+ 3

(m1)m

2

. As m this grows like

2

3

m

3

.

If A is factored A = LU we can solve the system Ax = b by solving two

triangular systems. First Ly = b using forward substitution which gives

y

1

= b

1

y

2

= b

2

21

y

1

y

3

= b

3

31

y

1

32

y

2

.

.

. =

.

.

.

y

m

= b

m

m1

y

1

m2

y

2

m,m1

y

m

and secondly Ux = y using back substitution.

Consider the following example:

Example 11.2 A =

_

10

20

1

1 1

_

. The process yields L =

_

1 0

10

20

1

_

and

U =

_

10

20

1

0 1 10

20

_

. Assume now we perform this computation on a

74

machine with

machine

10

16

then each of these numbers will be rounded to

the nearest oating point number represented by the machine. The number

1 10

20

will be represented by 10

20

and so the machine will give

L =

_

1 0

10

20

1

_

and

U =

_

10

20

1

0 10

20

_

. If the algorithm was backward stable

we would have

A =

L

U close to A. But

L

U =

_

10

20

1

1 0

_

. This is very

far from A, we could see this by computing

[[A

A[[

[[A[[

. Instead we use the

computed LU-decomposition to solve the equations Ax =

_

1

0

_

. The correct

solution is x =

_

1

1

_

but using

A we get x =

_

0

1

_

We also note that the algorithm can fail, namely if one of the x

kk

s are 0

then we would have a division by 0 error.

A partial solution to these problems is to use pivoting. This means

interchanging rows so as to bring a non-zero entry into the kk position

before we compute L

k

. The procedure is as follows: at step k nd the entry

in the kth column in or below the diagonal with the largest absolute value.

Assume the this entry is x

r,k

then we want to interchange the kth and the

rth row to bring x

r,k

into the kkth position. Let

P

r,k

=

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

1 0 0 . . . 0 . . . 0 . . . 0

0 1 0 . . . 0 . . . 0 . . . 0

0 0 1 . . . 0 . . . 0 . . . 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 0 . . . 0 . . . 1 . . . 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 0 . . . 1 . . . 0 . . . 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 0 . . . 0 . . . 0 . . . 1

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

_

where there is a 1 in the k, rth and the r, kth position. Left multipli-

cation by P

r,k

interchanges the kth and rth rows without changing the

other entries. Such a matrix is called a transposition matrix. Remark

that P

2

r,k

= E

m

. Thus the procedure produces an upper triangular matrix

U = L

m1

P

m1

L

m2

P

m2

. . . L

1

P

1

A where P

m1

, P

m2

, . . . , P

1

are trans-

position matrices. Now dene L

k

= P

m1

P

m2

. . . P

k+1

L

k

P

k+1

. . . P

m2

P

m1

.

75

Right multiplication by P

r,k

interchanges the rth and the kth column. Thus

P

r,s

L

k

P

r,s

interchanges the rth and kth column and the rth and kth row.

But both r and s are > k hence this does not destroy the structure of L

k

(1s in the diagonal and one non-zero column). Now

L

2

L

1

P

m1

P

m2

. . . P

1

= (P

m1

P

m2

. . . P

3

L

2

P

3

. . . P

m2

P

m1

)

(P

m1

P

m2

. . . P

3

P

2

L

1

P

2

P

3

. . . P

m2

P

m1

)

P

m1

P

m2

. . . P

2

P

1

= (P

m1

P

m2

. . . P

3

)L

2

P

2

L

1

P

1

Multiplying by L

3

we get

L

3

L

2

L

1

(P

m1

P

m2

. . . P

1

)

= P

m1

P

m2

. . . P

4

L

3

P

4

. . . P

m2

P

m1

(P

m1

P

m2

. . . P

4

P

3

)L

2

P

2

L

1

P

1

= P

m1

P

m2

. . . P

4

L

3

P

3

L

2

P

2

L

1

P

1

Continuing this way we end up with

L

m1

L

m2

. . . L

1

(P

m1

P

m2

. . . P

1

)

= L

m1

P

m1

L

m2

P

m2

. . . L

2

P

2

L

1

P

1

Let P = P

m1

P

m2

. . . P

1

then we have

U = L

m1

P

m1

L

m2

P

m2

. . . L

1

P

1

A

L

m1

L

m2

. . . L

1

PA

Thus using pivoting we get an LU-decomposition of the matrix PA which

has the same columns as A but permuted.

Recall that the entries in L

k

are 1s in the diagonal, 0s everywhere

except in the kth column where the entries are

j,k

= x

jk

/x

kk

for j > k

but because after the pivoting [x

kk

[ is maximal in the kth column we have

[

jk

[ 1 and so all the entries in L

k

have absolute value 1.

We have the following stability result:

Theorem 11.0.8 Let A = LU be the LU-decomposition of a non-singular

matrix. Let

L and

U be machine computed using Gaussian elimination with-

out pivoting. Then

L

U = A + A with

[[A[[

[[L[[[[U[[

= O(

machine

) for some

mm matrix A

76

Remark that if we had [[A[[ in the denominator we would have backward

stability. For Gaussian elimination [[L[[ and [[U[[ can be unboundedly large

and so the algorithm is unstable.

With pivoting however the entries in L all have absolute value 1 so

[[L[[ = O(1) (of course we now have PA = LU but [[PA[[ = [[A[[). Thus if

[[U[[ = O([[A[[) the algorithm will be backward stable.

Denition 11.0.7 The growth factor for the matrix A is dened by =

max

i,j

[u

ij

[

[a

ij

[

Thus if is of order 1 we have [[U[[ = O([[A[[) and the algorithm is backward

stable. In general [[U[[ = O([[A[[).

Example 11.3 Consider the mm matrix

A =

_

_

_

_

_

_

_

_

_

_

_

1 0 0 0 . . . 0 1

1 1 0 0 . . . 0 1

1 1 1 0 . . . 0 1

1 1 1 1 . . . 0 1

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

1 1 1 1 . . . 1 1

1 1 1 1 . . . 1 1

_

_

_

_

_

_

_

_

_

_

_

One can compute the LU-decomposition (no pivoting is needed): L =

_

_

_

_

_

_

_

1 0 0 . . . 0

1 1 0 . . . 0

1 1 1 . . . 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

1 1 1 . . . 1

_

_

_

_

_

_

_

and U =

_

_

_

_

_

_

_

_

_

_

_

1 0 0 0 . . . 0 1

0 1 0 0 . . . 0 2

0 0 1 0 . . . 0 4

0 0 0 1 . . . 0 8

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 0 0 . . . 1 2

m2

0 0 0 0 . . . 0 2

m1

_

_

_

_

_

_

_

_

_

_

_

. So in this case = 2

m1

. For a

100 100 matrix we would get [[U[[ = O(2

99

[[A[[) thus the relative accuracy

in this case is of the order 2

99

2

16

= 2

83

if

machine

2

16

. Clearly

unacceptable.

77

In almost every case that arises in practice Gaussian elimination with

pivoting is well-behaved i.e. = O(1). Only in very exceptional cases does

the algorithm misbehave.

12 Cholesky Decomposition

Consider a symmetric matrix A. If

t

x A x > 0 for any x ,= 0 we say that

A is positive denite. An example of a symmetric positive denite matrix is

the covariance matrix of a collection of non-deterministic random variables

X

1

, X

2

, . . . , X

m

,

A =

_

_

_

_

_

_

_

var(X

1

) cov(X

1

, X

2

) cov(X

1

, X

3

) . . . cov(X

1

, X

m

)

cov(X

2

, X

1

) var(X

2

) cov(X

2

, X

3

) . . . cov(X

2

, X

m

)

cov(X

3

, X

1

) cov(X

3

, X

2

) var(X

3

) . . . cov(X

3

, X

m

)

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

cov(X

m

, X

1

) cov(X

m

, X

2

) cov(X

m

, X

3

) . . . var(X

m

)

_

_

_

_

_

_

_

A symmetric matrix is positive denite if and only if the eigenvalues of

A are all positive. Indeed if x is a unit eigenvector for the eigenvalue

(i.e. [[x[[ = 1). Then 0 <

t

xAx =

t

x (x) = (x, x) = . Conversely

expressing x as a linear combination of an orthonormal basis of eigenvectors

x = c

1

q

1

+c

2

q

2

+ +c

m

q

m

. Then

t

xAx =

1

c

2

1

+

2

c

2

2

+ +

m

c

2

m

> 0.

Write A =

_

a

11

w

t

w K

_

. Since

t

e

1

Ae

1

= a

11

, a

11

> 0 and by considering

vectors with rst coordinate = 0 i.e. of the form x =

_

0

y

_

where y is any n

1-dimensional vector we get 0 <

t

xAx =

t

yKy we see that K is a symmetric,

positive denite m 1 m 1 matrix. Let R

1

=

_

a

11

t

w/

a

11

0 E

m1

_

then

we have

A =

_

a

11

t

0

w/

a

11

E

m1

_

_

1

t

0

0 K w

t

w/a

11

_

a

11

t

w/

a

11

0 E

m1

_

Then A

1

=

t

R

1

1

A R

1

1

=

_

1

t

0

0 K w

t

w/a

11

_

is again positive

denite, indeed

t

xA

1

x =

t

(R

1

1

x) A (R

1

1

x) > 0. Now this implies as above

that the symmetric m1 m1 matrix K w

t

w/a

11

is positive denite.

Thus by the same method as above we can nd the m1m1 matrix

R

2

=

78

_

t

w

1

/

t

0 E

m2

_

such that (K w

t

w/a

11

) =

t

R

2

_

1

t

0

0 K

1

_

R

2

where K

1

is a

symmetric m2m2 positive denite matrix. Let R

2

=

_

1

t

0

0

R

2

_

then we

get

t

R

2

_

_

1 0

t

0

0 1

t

0

0 0 K

1

_

_

R

2

= A

1

and hence A =

t

R

1

t

R

2

_

_

1 0

t

0

0 1

t

0

0 0 K

1

_

_

R

2

R

1

.

Continuing this way we get A =

t

R

m

t

R

m1

. . .

t

R

2

t

R

1

E

m

R

1

R

2

. . . R

m1

R

m

.

Clearly the matrices R

i

are all upper triangular hence the product R =

R

1

R

2

. . . R

m1

R

m

is upper triangular and we have proved that A =

t

RR

where R is an upper triangular matrix. This is the Cholesky decomposition.

Thus we have

Theorem 12.0.9 Let A be a symmetric, positive denite matrix. Then we

can write A =

t

RR where R is an upper triangular matrix.

The following MATLAB session, indicates that the Cholesky decompo-

sition is backward stable

>> R=triu(rand(50));

>> A=R*R;

>> S=chol(A);

>> norm(S-R)

ans =

3.3056

>> norm(S*S-A)

ans =

8.5023e-015

>>

13 Eigenvalues

In this section we allow our matrices to have entries in C.

The eigenvalues of the mm matrix A are the roots of the characteristic

polynomial det(xE

m

A). This is a degree m polynomial and so has m roots

in the complex numbers, when counted with multiplicities. Thus we can

factor det(xE

m

A) = (x

1

)

m

1

(x

2

)

m

2

. . . (x

k

)

m

k

where

1

,

2

, . . . ,

k

are the distinct eigenvalues of A and m

1

+m

2

+ +m

k

= m.

Denition 13.0.8 The algebraic multiplicity of an eigenvalue

i

is the root

multiplicity m

i

79

Denition 13.0.9 The geometric multiplicity of an eigenvalue

i

is the

dimension of the eigenspace ker(A

i

E

m

)

Lemma 13.0.1 Let X be an invertible m m matrix. Then the matrix

X

1

AX has the same characteristic polynomial as A and hence has the

same eigenvalues with the same algebraic multiplicities as A.

Proof: We have det(xE

m

X

1

AX) = det(xX

1

X X

1

AX) =

det(X

1

(xE

m

A)X) = det(X

1

) det(xE

m

A) det(X) = det(xE

m

A)

Let n

i

be the geometric multiplicity of

i

and consider a basis v

1

, v

2

, . . . , v

n

i

.

We extend this to a basis of C

m

. Let C be the coordinate transformation

matrix between this basis and the standard basis. Then C

1

AC is the

matrix w.r.t. this basis. But we have C

1

ACv

j

=

i

v

j

hence C

1

AC

is of the form

_

i

E

n

i

S

0 T

_

where S and T are m n

i

m n

i

ma-

trices. Then det(xE

m

C

1

AC) = (x

i

)

n

i

det(xE

mn

i

T). But

det(xE

m

C

1

AC) = det(xE

m

A) and so (x

i

)

n

i

divides the char-

acteristic polynomial of A. It follows that n

i

m

i

and so we have shown

Proposition 13.0.3 The geometric multiplicity of an eigenvalue is al-

ways the algebraic multiplicity of

Denition 13.0.10 An eigenvalue is said to be defective if geometric

multiplicity of < algebraic multiplicity of

Denition 13.0.11 The matrix A is said to be defective if it has at least

one defective eigenvalue.

Theorem 13.0.10 A matrix A is diagonalizable i.e. A = X

1

X where

is a diagonal matrix, if and only if A is non-defective

Proof: A diagonal matrix is non-defective, indeed if

=

_

_

_

_

_

_

_

_

_

_

_

_

1

0 . . . 0 . . . . . . 0

0

1

. . . 0 . . . . . . 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 . . . 0

1

0 . . . 0

0 . . . 0 0

2

. . . 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 0 . . . . . . 0

k

_

_

_

_

_

_

_

_

_

_

_

_

80

then clearly the algebraic multiplicity of an eigenvalue

i

is the number of

times

i

appears in the diagonal but this is also the dimension of ker(

i

E

m

).

To go the other way let m

i

be the algebraic multiplicity of

i

. By

assumption this is also the geometric multiplicity so we can nd a basis

v

i1

, v

i2

, . . . , v

im

i

of the eigenspace for

i

. Since eigenvectors belonging to

dierent eigenvalues are linearly independent by Proposition 4.0.6 the total-

ity of these vectors for all the distinct eigenvalues are linearly independent

and since m

1

+ m

2

+ + m

i

+ + m

k

= m they form a basis, i.e. we

have a basis of eigenvectors. Then if the matrix w.r.t. this basis is diagonal

and if X is the coordinate transformation matrix we have XAX

1

= is

diagonal.

We know from the Spectral Theorem (Theorem 4.1.1) that if A is sym-

metric the we can nd an orthogonal matrix Q such that

t

QAQ = is

diagonal. We have the following general theorem:

Theorem 13.0.11 Any square matrix A has Schur factorization

A = Q T Q

1

where U is an upper triangular matrix and Q is unitary

Proof: We use induction on m. For m = 1 it is trivial, assume it is true

for m1m1 matrices. Let q

1

be an eigenvector in C

m

with eigenvalue .

Extend q

1

to an orthonormal basis q

1

, q

2

, . . . , q

m

. Let U be the mm matrix

with these vectors as columns, the U is unitary. Since U is the coordinate

transformation matrix between the standard basis and the basis consisting

of the qs and since Aq

1

= q

1

, U

1

AU is of the form

_

t

b

0 C

_

. C is an

m 1 m 1 matrix and so by the induction hypothesis C has a Schur

factorization C = V SV

1

where V is unitary and S is upper-triangular.

81

Now put Q = U

_

1

t

0

0 V

_

then Q is unitary and we have

Q

1

AQ =

_

1

t

0

0 V

1

_

U

1

AU

_

1

t

0

0 V

_

=

_

1

t

0

0 V

1

__

t

b

0 C

__

1

t

0

0 V

_

=

_

t

b

0 V

1

C

__

1

t

0

0 V

_

=

_

t

bV

0 V

1

CV

_

=

_

t

bV

0 S

_

Let T =

_

t

bV

0 S

_

. Since S is upper-triangular, T is as well and Q is

unitary, thus we have a Schur factorization of A

The eigenvalues of A are the diagonal elements in T since A and T have

the same characteristic polynomial and the characteristic polynomial of an

upper-triangular matrix T is (x t

11

)(x t

22

) . . . (x t

mm

)

We shall now turn to the problem of actually computing eigenvalues.

The obvious way would be to nd the characteristic polynomial and use an

algorithm to nd the roots. Aswe have seen nding roots of a polynomial is

an extremely ill-conditioned problem so it is unlikely that this method will

give good results. But is there another algorithm to nd the eigenvalues. It

is known that for general polynomials of degree 5 there is no algorithm

with nitely many steps for nding the roots as a function of the coecients.

Since any polynomial can be realized as the characteristic polynomial of a

matrix (the polynomial p(x) = a

0

+ a

1

x + a

2

x

2

+ + a

m1

x

m1

+ x

m

is

the characteristic polynomial of the matrix

_

_

_

_

_

_

_

_

_

0 0 0 . . . . . . a

0

1 0 0 . . . . . . a

1

0 1 0 . . . . . . a

2

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 0 . . . 0 a

m2

0 0 0 . . . 1 a

m1

_

_

_

_

_

_

_

_

_

)

there cannot be any general algorithm to nd eigenvalues that concludes in

nitely many steps. Hence we have to resort to iterative methods to nd

approximations to the eigenvalues. Thus the goal of an eigenvalue nding

algorithm should be to iteratively construct a sequence of numbers that

converges quickly to an eigenvalue.

82

To simplify matters we shall assume that A is real, symmetric so the

eigenvalues are real. We shall also assume we have ordered the eigenvalues

according to absolute values i.e.[

1

[ [

2

[ [lambda

m

[ and that we

have a corresponding orthonormal basis of eigenvectors q

1

, q

2

, . . . , q

m

Denition 13.0.12 Let x be a non-zero vector. The Rayleigh quotient is

r(x) =

t

xAx

t

xx

=

t

xAx

[[x[[

2

Remark that if x is an eigenvector, Ax = x then r(x) = .

We view r as a function R

m

0 R. In terms of the coordinate of x

and the entries of A we can write it as

r(x

1

, x

2

, . . . , x

m

) =

i,j

a

ij

x

i

x

j

i

x

2

i

Computing the partial derivatives we get

r

x

k

=

(2

i

a

ik

x

i

)(

i

x

2

i

) (

i,j

a

ij

x

i

x

j

)(2x

k

)

(

i

x

2

i

)

2

=

2(Ax)

k

[[x[[

2

[[x[[

2

t

xAx

[[x[[

2

2x

k

=

2

[[x[[

2

((Ax)

k

r(x)x

k

)

Hence we get r(x) =

2

[[x[[

2

(Ax r(x)x). Thus r(x) = 0 precisely

when x is an eigenvector.

Clearly r(ax) = r(x) for all a ,= 0. Hence we may as well only look at

unit vectors i.e. [[x[[ = 1 and thus we have

Theorem 13.0.12 A unit vector x is an eigenvector for A if and only if

r(x) = 0. In this case the eigenvalue is r(x)

Let x be a unit vector and write x = a

1

q

1

+ a

2

q

2

+ + a

m

q

m

. Then

r(x) =

a

2

i

i

. Hence r(x) r(q

j

) = a

2

1

1

+ a

2

2

2

+ + (a

2

j

1)

j

+

+ a

2

m

m

. Thus if x q

j

i.e. a

i

0 for i ,= j and a

j

1 we see that

r(x) r(q

j

) 0 as the square of [[x q

j

[[

83

Another possible way to nd a sequence of vectors converging to an

eigenvector is the method of power iteration. The algorithm works as follows

v

(0)

any unit vector

for k = 1, 2, . . .

w = Av

(k1)

v

(k)

= w/[[w[[

(k)

= r(v

(k)

)

We have v

(1)

=

Av

(0)

[[Av

(0)

[[

and v

(2)

=

Av

(1)

[[Av

(1)

[[

=

A

2

v

(0)

[[Av

(0)

[[

_

[[A

2

v

(0)

[[

[[Av

(0)

[[

=

A

2

v

(0)

[[A

2

v

(0)

[[

and in general v

(k)

=

A

k

v

(0)

[[A

k

v

(0)

[[

To analyze power iteration write v

(0)

= a

1

q

1

+a

2

q

2

+ +a

m

q

m

where

a

1

,= 0. Then A

k

v

(0)

= a

1

k

1

q

2

+a

2

k

2

q

2

+ +a

m

k

m

q

m

and so

v

(k)

=

a

1

k

1

q

1

+a

2

k

2

q

2

+ +a

m

k

m

q

m

_

a

2

1

2k

1

+a

2

2

2k

2

+ +a

2

m

2k

m

=

k

1

a

1

(q

1

+

a

2

a

1

(

1

)

k

q

2

+ +

a

m

a

1

(

1

)

k

q

m

)

[

k

1

[[a

1

[(1 +

a

2

2

a

2

1

(

1

)

2k

+ +

a

2

m

a

2

1

(

1

)

2k

)

=

(q

1

+

a

2

a

1

(

1

)

k

q

2

+ +

a

m

a

1

(

1

)

k

q

m

)

1 +

a

2

2

a

2

1

(

1

)

2k

+ +

a

2

m

a

2

1

(

1

)

2k

the depends on whether

k

1

a

1

is positive or negative.

Thus [[v

(k)

(q

1

)[[ = O([

1

[

k

). If [

1

[ > [

2

[ then [

1

[

k

0 and so

v

(k)

q

1

. This convergence is only linear in [

1

[ i.e. going from k to k+1

improves precision by a factor of [

1

[ and if this fraction is close to 1 i.e. if

[

2

[ is close to [

1

[ the convergence will be very slow.

The trick to overcome this problem is known as inverse iteration. Here

is how it works: consider a number which is not an eigenvalue. Then the

matrix A E

m

is invertible and the eigenvalues of (A E

m

)

1

are the

numbers

1

where runs through the eigenvalues of A. So if we can nd

84

the eigenvalues of (A E

m

)

1

we can certainly nd those of A. The idea

is that if is very close to a given eigenvalue , then [ [ is very small

and hence

1

[ [

is very large and in fact by choosing suciently close

to we can make it as large as we want relative to the other eigenvalues.

Thus the iterative process for (A E

m

)

1

will converge very fast. There

are two caveats: we have to have some idea about in order to choose

close to and the problem nding (AE

m

)

1

may be very ill-conditioned

if AE

m

is very close to being singular, which it will be if is very close

to .

Here is the algorithm:

v

(0)

any unit vector

for k = 1, 2, . . .

Solve (AE

m

)w = v

(k1)

for w

v

(k)

= w/[[w[[

(k)

= r(v

(k)

)

We shall argue that the ill-conditioning is not going to cause trouble.

Let us assume that the absolute value of the last eigenvalue

m

is much

smaller that the absolute values of the other eigenvalues. Then we may as

well assume that is 0 and we are solving the equation Aw = v where v is a

unit vector. Thus the problem is given by A A

1

v. Assume we are using

a backward stable algorithm to solve the equation and that the computed

solution is w. By the backward stability we have

A w = v where

A = A+A

is a perturbation of A with

[[A[[

[[A[[

= O(

machine

) and where w = w + w is

a perturbation of w with

[[w[[

[[w[[

= O((A)

machine

). Now (A) = [

1

/

m

[

which by our assumptions is very large. Thus there is very little chance that

w and w are close but we shall see that the normalized vectors

w

[[w[[

and

w

[[ w[[

are close.

Remark that the eigenvectors for A

1

are the same as the eigenvectors

for A and the eigenvalues are the inverses. Thus the largest eigenvalue

for A

1

is 1/

m

and the eigenvector is q

m

. Our argument above gives

[[

w

[[w[[

(q

m

)[[ = O([

m

m1

[).

Let v = A(

w

[[ w[[

) then we have [[ v[[ [[A[[ = [

1

[.

85

Write v = a

1

q

1

+ a

2

q

2

+ + a

m

q

m

then

w

[[ w[[

=

a

1

1

q

1

+

a

2

2

q

2

+ +

a

m1

m1

q

m1

+

a

m

m

q

m

. Since (

a

1

1

)

2

+ (

a

2

2

)

2

+ + (

a

m1

m1

)

2

+ (

a

m

m

)

2

=

w

[[ w[[

= 1 we have

w

[[ w[[

=

a

1

1

q

1

+

a

2

2

q

2

+ +

a

m1

m1

q

m1

+

a

m

m

q

m

_

(

a

1

1

)

2

+ (

a

2

2

)

2

+ + (

a

m1

m1

)

2

+ (

a

m

m

)

2

=

a

m

m

_

a

1

a

m

1

q

1

+ +

a

m1

a

m

m1

q

m1

+q

m

_

[

a

m

m

[

_

(

a

1

a

m

1

)

2

+ (

a

2

a

m

2

)

2

+ + (

a

m1

a

m

m1

)

2

+ 1

=

_

a

1

a

m

1

q

1

+ +

a

m1

a

m

m1

q

m1

+q

m

_

_

(

a

1

a

m

1

)

2

+ (

a

2

a

m

2

)

2

+ + (

a

m1

a

m

m1

)

2

+ 1

This shows that also

w

[[ w[[

(q

m

)

= O([

m

m1

[) and so

w

[[ w[[

w

[[w[[

=

O([

m

m1

[)

Thus the algorithm will construct a sequence converging to the proper

eigenvector if is chosen appropriately. We shall combine the inverse itera-

tion with the Rayleigh quotient to obtain a very fast converging algorithm.

v

(0)

any unit vector

(0)

=

t

v

(0)

Av

(0)

= corresponding Rayleigh quotient

for k = 1, 2, . . .

Solve (A

(k1)

E

m

)w = v

(k1)

for w

v

(k)

= w/[[w[[

(k)

= r(v

(k)

)

For this algorithm we have the following theorem

Theorem 13.0.13 Rayleigh quotient iteration converges to an eigenvector,

eigenvalue pair for all starting unit vectors except for a small set of unit

86

vectors (measure zero). When the algorithm converges, the convergence is

cubic in the sense that if

j

is an eigenvalue of A and v

(0)

is suciently

close to q

j

then

[[v

(k+1)

(q

j

)[[ = O([[v

(k)

(q

j

)[[

3

)

and

[

(k+1)

j

[ = O([

(k)

j

[

3

)

Thus each iteration triples the number of digits of accuracy. Here is an

example

>> A=[2 1 1;1 3 1;1 1 4];

>> v0=[1;0;0];

>> r0=v0*A*v0

r0 = 2

>> w=(A-r0*eye(3))v0;

>> v1=w/norm(w)

v1 =

-0.7071

0.7071

0

>> r1=v1*A*v1

r1 =

1.5000

>> w=(A-r1*eye(3)) v1;

>> v2=w/norm(w)

v2 =

0.9035 -0.3720 -0.2126

>> r2=v2*A*v2

r2 =

1.3305

>> w=(A-r2*eye(3))v2;

>> v3=w/norm(w)

v3 =

-0.8876 0.4274 0.1719

>> r3=v3*A*v3

r3 =

1.3249

87

After the fourth step all the results from MATLAB become equal. The

eigenvector closest to

_

_

1

0

0

_

_

is q

1

=

_

_

0.8877

0.4271

0.1721

_

_

with eigenvalue

1

= 1.3249

Starting with v

(0)

=

_

_

0

1

0

_

_

we get after four steps q

2

=

_

_

0.2332

0.7392

0.6318

_

_

and

2

= 2.4608

Starting with v

(0)

=

_

_

0

0

1

_

_

we get after ve steps q

3

=

_

_

0.3971

0.5207

0.7558

_

_

and

3

= 5.2143

Using the MATLAB command

[V,D]=eig(A)

we get a matrix V whose columns are the eigenvectors and a diagonal

matrix D where the diagonal entries are the eigenvalues. They agree with

the ones we have found here.

Use MATLAB to try other examples of larger matrices.

Write the following M-le

function A=qralg(B)

[Q,R]=qr(B);

A=R*Q;

Input some square matrix A and run

A=qralg(A)

Keep running this command a few times (use the up-arrow to get back

to the command rather than retyping it)

14 Regression Analysis

Consider random variables y

t

, X

1,t

, X

2,t

, . . . , X

k,t

where t belongs to some

indexing set I (t may denote time in which case I would be some interval

on the real axis or I could be some nite set). We assume that for any two

indices t

1

, t

2

I, X

i,t

1

and X

i,t

2

are identically distributed. We shall use

x

i,t

to indicate a sample of the random variable X

i,t

thus two samples x

i,t

1

and x

i,t

2

can be viewed as samples from a common distribution.

88

A statistical model relating these random variables is an equation

y

t

= f(X

1,t

, X

2,t

, . . . , X

k,t

) +u

t

for each t I where f is a function (which does not depend on t) and

u is some stochastic process indexed by I. In the special case where u is

identically 0 we can express y exactly as a function of the Xs but in general

all we can specify about the random variable u

t

are some distributional

properties. Regression analysis is concerned with estimating the function f

from observed samples y

t

, x

1,t

, x

2,t

, . . . , x

k,t

.

The simplest linear regression model can be expressed by the equation

y

t

=

0

+

1

X

t

+u

t

The purpose of formulating this model is to explain the value of the

dependent variable y as an ane function of the independent variable X

plus an error term u. The coecients of the model

0

and

1

remains to be

determined.

The underlying assumption of the model is that there is an ane rela-

tionship between the dependent and the independent variable and because

of the inherent imprecision in our data samples there will be a random er-

ror. Without making certain assumptions about the error term, the model

clearly has no meaning, in fact we could take any

0

and

1

and dene

u = y

0

1

X.

Assume now that we have observations (y

t

1

, x

t

1

), (y

t

2

, x

t

2

), . . . , (y

tn

, x

tn

).

We want to use these observations to get estimates for the parameters

0

and

1

under certain assumptions about the error term u

The rst condition we want to impose is that the expectation E(u

t

) = 0

for all t I.

Using this we get the equation:

E(y

t

) =

0

+

1

E(X

t

)

Because all the y

t

s and X

t

s are identically distributed we can estimate

the expectation by the sample means

E(y

t

) =

1

n

i

y

t

i

and

E(X

t

) =

1

n

i

x

t

i

and hence we get one equation to estimate the coecients. But there are

two coecients so one equation does not suce.

The second assumption we make is more serious. We assume that the

error term u

t

is independent of the independent variable X

t

for each t I.

If the model is correctly specied and the error term arises from sampling

errors rather than systemic errors this is a reasonable assumption. This

89

Figure 7: Scatter plot of the data set

implies that E(u

t

X

t

) = E(u

t

)E(X

t

) and hence from the rst assumption

(that E(u

t

) = 0), we get that also E(u

t

X

t

) = 0.

Multiplying the equation through by X and taking expectations we get

E(yX) =

0

E(X) +

1

E(X

2

)

We now have two equations and we get

1

=

E(yX) E(y)E(X)

E(X

2

) E(X)

2

=

cov(y, X)

var(X)

and

0

= E(y)

cov(y, X)

var(X)

E(X). We can then use the sample estimates to

get estimates for the coecients

Example 14.1 Consider the data sets

y

t

= 8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68

and

x

t

= 10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0

Using the sample means we get estimates for the coecients:

0

= 3.001

and

1

= 0.5001

90

Figure 8: Scatter plot of the data set with the Regression line

Let y =

_

_

_

_

_

y

t

1

y

t

2

.

.

.

y

tn

_

_

_

_

_

, X =

_

_

_

_

_

1 x

t

1

1 x

t

2

.

.

.

1 x

tn

_

_

_

_

_

and u =

_

_

_

_

_

u

t

1

u

t

2

.

.

.

u

tn

_

_

_

_

_

. Then we have

y = X

_

1

_

+u

Now

t

X u =

_

1 1 . . . 1

x

1

x

2

. . . x

n

_

_

_

_

_

_

u

1

u

2

.

.

.

u

n

_

_

_

_

_

=

_

_

t

u

t

t

x

t

u

t

_

_

=

_

n

E(u)

n

E(Xu)

_

=

_

0

0

_

and hence

t

Xy =

t

X X

_

1

_

This shows that

_

1

_

= (

t

X X)

1

t

Xy

91

and hence the estimates for the coecients are precisely the least squares

solution to the equation y = X

_

1

_

.

We can of course always nd a least squares solution as we saw in section

9 but it is important to realize that just tting an ordinary least squares

(OLS) line to the data does not in anyway mean that the data are samples

from a linear model. It is a common error to indiscriminately t OLS lines

to data sets. This can lead to serious errors.

Example 14.2 Consider the following data sets:

1.

y

t

= 9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74

x

t

= 10.0, 8.0, 13, 0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0

2.

y

t

= 7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73

x

t

= 10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0

3.

y

t

= 6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.5, 5.56, 7.91, 6.89

x

t

= 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 19.0, 8.0, 8.0, 8.0, 8.0

For data set 1 we get the OLS solution

_

1

_

=

_

3.009

.5000

_

. For data set 2

we get

_

1

_

=

_

3.0025

.4995

_

and for data set 3,

_

1

_

=

_

3.0017

.4999

_

. Thus

we might be tempted to conclude that these three data sets are all samples

from the same model. However if we graph these data sets we get completely

dierent pictures.

If we graph the residuals i.e. the error terms u

t

against the independent

variable in the case of data set 1 we see that the assumption of independence

is clearly invalid and so the linear model is incorrect

These examples show the importance of graphing the data before starting

any numerical analysis. A graphical presentation may help in formulating

92

Figure 9: Data set 1

Figure 10: Data set 2

93

Figure 11: Data set 3

Figure 12: Residuals for data set 1

94

Figure 13: Fitted quadratic model

reasonable models, in the case of data set 1 a quadratic model seems much

more appropriate. Indeed if we try to t a model of the form

y =

0

+

1

X+

2

X

2

we nd the least square solutions

0

= 5.9957,

1

= 2.7808,

2

= 0.1267

and the tted curve is shown in Figure 13.

Recall that if X

1

, X

2

are random variables, the joint distribution is the

function F

X

1

,X

2

(x

1

, x

2

) = Prob(X

1

< x

1

, X

2

< x

2

) and the joint density

function is f

X

1

,X

2

(t

1

, t

2

) =

2

F

x

1

x

2

(t

1

, t

2

). Thus

F

X

1

,X

2

(x

1

, x

2

) =

_

x

1

_

x

2

f

X

1

,X

2

(t

1

, t

2

)dt

1

dt

2

. .

Denition 14.0.13 The conditional density is dened by f

X

1

|X

2

(t

1

[t

2

) =

f

X

1

,X

2

(t

1

, t

2

)

f

X

2

(t

2

)

.

95

Denition 14.0.14 Two random variables X

1

and X

2

are independent if

f

X

1

,X

2

= f

X

1

f

X

2

or equivalently f

X

1

|X

2

= f

X

1

Recall the the expectation of the random variable X is

E(X) =

_

Xd =

_

tf

X

(t)dt

Denition 14.0.15 Recall (Introduction to Stochastic Processes=ISP 2.4)

that the conditional expectation E(X

1

[X

2

) is the random variable dened

by the condition

_

A

E(X

1

[X

2

)d =

_

A

X

1

d

for every set A in the -algebra generated by X

2

Lemma 14.0.2

E(X

1

[X

2

)() =

_

x

1

f

X

1

|X

2

(x

1

, X

2

())dx

1

Proof: By Example 2.8 in ISP we have for any function h that E(h(X

1

)[X

2

)() =

g(X

2

()) where g is the function dened by

g(x

2

) =

_

h(x

1

)f

X

1

|X

2

(x

1

, x

2

)dx

1

Applying this with h(x

1

) = x

1

proves the result

Lemma 14.0.3 (Law of iterated expectations) E(E(X

1

[X

2

)) = E(X

1

)

Proof: Let T

0

denote the trivial -algebra. Then E(X[T

0

) = E(X) for

any random variable. Now we have T

0

(X

2

) T and so we have by

transitivity of conditional expectations (ISP Theorem 2.4.3)

E(E(X

1

[X

2

)) = E(E(X

1

[X

2

)[T

0

) = E(E(X

1

[(X

2

))[T

0

) = E(X

1

[T

0

) = E(X

1

)

Lemma 14.0.4 If X

1

and X

2

are independent E(X

1

[X

2

) = E(X

1

) (in

particular constant)

Proof: If X

1

and X

2

are independent, f

X

1

,X

2

(x

1

, x

2

) = f

X

1

(x

2

)f

X

2

(x

2

)

(Proposition 2.3.1 in the Probability Theory Notes). Hence f

X

1

|X

2

= f

X

1

and so the function g in Lemma 14.0.2 is

g(x

2

) =

_

x

1

f

X

1

(x

1

)dx

1

= E(X

1

)

96

Proposition 14.0.4 Assume E(X[Y) = 0 then E(XY) = 0

Proof: We have E(XY) = E(E(XY[Y)) by the Law of iterated expec-

tations. Since Y is certainly measurable with respect to (Y) we have by

ISP 2.4.2, E(XY[Y) = YE(X[Y) = 0

This Proposition shows that to use OLS to estimate the coecients in

the regression model it suces to assume E(u

t

[X

t

) = 0 rather than the

stronger assumption that they are independent.

We shall now make some further assumptions on the error terms. We

assume that the error terms are independent, identically distributed with

mean 0 and variance

2

i.e. for any two indices t

1

,= t

2

the random variables

u

t

1

, u

t

2

are independent and have the same distribution. We write this

condition u

t

IID(0,

2

). Sometime we will make even more restrictive

assumptions such as specifying the actual distribution e.g. the u

t

s are all

normally distrubuted in which case we write u

t

NID(0,

2

).

Once a model has been suggested and tted it is a good idea to plot the

residuals to see if they are in fact IID.

Example 14.3 Using the data set Data1, which can be downloaded from

the course page we plot the data

The plot seems to indicate a linear model y

t

=

0

+

1

X

t

+u

t

. Estimating

the coecients by OLS we get

_

1

_

= 10

3

_

6.1497

.0007

_

and the plot of the

tted line seems to t the data extremely well.

But now we plot the residuals against the independent variable (Figure

16)

The plot shows that the residuals are not random and the linear model

is incorrect. The plot suggests a quadratic model so we try to t a model of

the form

y

t

=

0

+

1

X

t

+

2

X

2

t

The OLS estimates for the coecients are

_

_

_

2

_

_

_ = 10

3

_

_

.6736

.0007

.0000

_

_

(the

machine only displays 4 decimal points but internally stores 16 decimal

points).

Now the plot of the residuals shows randomness (Figure 17)

The following Q-Q plot and histogram shows that the residuals for the

quadratic model are approximately normally distributed (Figure 18 and 19)

97

Figure 14: Scatter plot of Data1

Figure 15: OLS line

98

Figure 16: Residuals for the Linear model

Figure 17: Residuals for the Quadratic model

99

Figure 18: Q-Q plot of the residuals for the Quadratic model

Figure 19: Histogram of the residuals for the Quadratic model

100

The estimated variance of the residuals are

2

= 3.9939 10

8

and so the

model

y

t

= 10

3

(.6736 +.0007X

t

+ (.0000)X

2

t

) +u

t

with u

t

NID(0, 3.9939 10

8

) is reasonable.

This example again underscores the benet of doing lots of graphical

analysis. Just plotting the data seems to indicate a linear model but the

graph of the residuals shows the problems with this model

Using OLS we can estimate coecients of models with several inde-

pendent variables (we have already done this when we estimated quadratic

models). Thus can look at models of the form

y

t

=

0

+

1

X

1,t

+

2

X

2,t

+ +

k

X

k,t

+u

t

The least squares estimates of the coecients are obtained by forming the

n (k + 1) matrix

X =

_

_

_

_

_

1 x

1,t

1

x

2,t

1

. . . x

k,t

1

1 x

1,t

2

x

2,t

2

. . . x

k,t

2

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

1 x

1,tn

x

2,tn

. . . x

k,tn

_

_

_

_

_

and the estimators are given by

_

_

_

_

_

_

_

_

2

.

.

.

k

_

_

_

_

_

_

_

_

= (

t

X X)

1

t

X

_

_

_

_

_

y

t

1

y

t

2

.

.

.

y

tn

_

_

_

_

_

14.1 Statistical Properties of the OLS Estimators

Denition 14.1.1 Let

be an estimator for some parameter whose true

value is

true

. The bias of the estimator is dened to be E(

)

true

and the

estimator is said to be unbiased if the bias is 0.

Assume our data come from the model

y

t

= X

t

+u

t

101

where X

t

= (1, X

1,t

, X

2,t

, . . . , X

k,t

) and =

_

_

_

_

_

_

_

2

.

.

.

k

_

_

_

_

_

_

_

. Our data y

t

and x

t

=

(1, x

1,t

, x

2,t

, . . . , x

k,t

), t = t

1

, t

2

, . . . , t

n

are samples of the random vector. We

assume the residuals u

t

IID(0,

2

). The OLS estimator is as before given

by

= (

t

XX)

1

t

X

_

_

_

_

_

y

t

1

y

t

2

.

.

.

y

tn

_

_

_

_

_

. Thus the bias of the OLS estimator is E(

).

Now y

t

= X

t

+u

t

and so

= (

t

XX)

1

t

X(X+u) = +(

t

XX)

1

Xu,

where u =

_

_

_

_

_

u

t

1

u

t

2

.

.

.

u

tn

_

_

_

_

_

. Thus the bias of OLS estimator is E((

t

X X)

1

t

X u).

This may not be 0 even under the condition E(u

t

[X

t

) = 0, the reason

being that (

t

X X)

1

t

X u involves products of the form X

i,t

j

X

r,ts

u

tv

and

so we would need each u

t

i

to have E(u

t

i

[X

j,tr

) = 0 for any combination of

indices or in shorthand E(u

t

[X

s

) = 0 for any t, s I. If this condition holds

the independent variables are said to be exogeneous and in this case we have

E((

t

X X)

1

t

X u[X) = (

t

X X)

1

t

X E(u[X) = 0. By the transitivity

of conditional expectation this implies that E((

t

X X)

1

t

X u) = 0 so the

OLS estimator is unbiased.

Example 14.4 Consider the model of the form

y

t

= 1.0 + 0.8y

t1

+u

t

(an auto-regressive model) and assume u

t

NID(0, 1) and y

0

= 0. We

can simulate this model by generating vectors of N(0,1) distributed random

numbers.

The following M-le named autoreg.m will generate 25 samples from

the model and compute the OLS estimates of the coecients

102

u=randn(25,1);

y=zeros(25,1);

for i=1:24

y(i+1)=1+.8*y(i)+u(i+1);

end

X=[ones(24,1) y(1:24)];

beta=(X*X)

^

(-1)*X*y;

If we run it a couple of times we get the following estimates

=

_

1.0786

0.8481

_

,

_

1.8455

0.6490

_

,

_

1.2757

0.7330

_

The following MATLAB code will run this 10,000 times and compute the

sample mean of the estimates:

b=zeros(2,10000);

for j=1:10000

autoreg

b(:,j)=beta

end

betahat=sum(b)/10000

This gives us the estimate

E(

) =

_

1.3328

0.7157

_

which indicates that the

OLS estimator has non-zero bias.

If we instead of generating 25 samples, generate 10000 samples and

average over 25 runs so we generate the same amount of data, we get

E(

) =

_

0.9912

0.8015

_

which is a much better estimate of the actual values.

This example shows that the OLS estimator is biased but the bias becomes

smaller as we increase the sample size.

This phenomenon also occurs in estimating sample variance: consider

independent samples x

i

, i = 1, 2, . . . , n of a random variable X with mean

and variance

2

. Let =

1

n

i

x

i

be the sample mean . The usual estimator

for the sample variance s

2

=

1

n

i

(x

i

)

2

is biased.

To see this, remark that the sample mean is an unbiased estimator of the

mean because E( ) =

1

n

i

E(X

i

) =

1

n

i

= .

103

We have s

2

=

1

n

i

(x

i

)

2

=

1

n

i

((x

i

) ( ))

2

=

1

n

_

i

(x

i

)

2

+n( )

2

2( )

i

(x

i

)

_

.

Now

i

(x

i

) = n( ) so we get s

2

=

1

n

_

(

i

(x

i

)

2

n( )

2

)

_

.

Taking expectations we get E(s

2

) =

1

n

_

i

E((x

i

)

2

) nE(( )

2

_

=

2

E(( )

2

) =

2

V ar( ).

We can compute V ar( ) as follows: V ar( ) = E(((

1

n

i

x

i

) )

2

) =

E((

1

n

i

x

i

)

2

) =

1

n

2

E(

i,j

(x

i

)(x

j

)) =

1

n

2

i,j

E((x

i

)(x

j

))

Since X

i

and X

j

are independent for i ,= j we get E((X

i

)(X

j

)) = E(X

i

)E(X

j

) = 0 for i ,= j and E((X

i

)

2

) =

2

. Hence

V ar( ) =

1

n

2

2

=

2

n

. Thus E(s

2

) =

n 1

n

2

. Again we see that as the

sample size gets larger (i.e. n ) the bias tends to 0. We also see that

n

n 1

s

2

=

1

n 1

i

(X

i

)

2

is an unbiased estimator.

Denition 14.1.2 An estimator

is said to be consistent if

tends to the

true value as the sample size tends to

The OLS estimator in the previous example and the s

2

estimator are

examples of consistent estimators.

To show that the OLS estimator is in fact consistent recall that we have

= +(

t

X X)

1

t

X u. Thus we need the last term to approach 0 when

the sample size increases. Remark that the matrix (

t

X X) is always a k k

matrix regardless of the sample size. In fact the i, jth entry in this matrix

is

r

X

i,tr

X

j,tr

, the 1, 1 entry is n. We divide the entries by n and notice

that (

t

X X)

1

)

t

Xu = (

1

n

t

X X)

1

1

n

t

X u.

In many cases the matrices

1

n

t

X X will converge to a xed matrix S as

the sample size increases (i.e. n ). In our example above we get

We make the (weak) assumption that E(u

t

[X

t

) = 0). This is a much

weaker assumption than assuming that the independent variables are ex-

ogenous. Then we also have E(X

i,t

j

u

t

j

[X

i,t

j

) = 0 and hence by transitivity

E(X

i,t

j

u

t

j

) = 0. This shows that E(

t

X u) = 0. Thus

1

n

t

X u 0 and

hence lim

n

(

1

n

(

t

X X)

1

1

n

t

Xu = S

1

0 = 0 and so

i.e. under these

assumptions the estimator is consistent.

104

Table 1:

sample size S

100

_

1 5.5687

5.5687 37.0914

_

1000

_

1 5.2986

5.2986 30.8182

_

10000

_

1 5.0040

5.0040 27.9696

_

100000

_

1 5.0161

5.0161 27.9281

_

1000000

_

1 5.0048

5.0048 27.8345

_

10000000

_

1 4.9991

4.9991 27.7744

_

Bias and consistency are completely dierent properties, as we have seen

an estimator can be biased and still be consistent.

Example 14.5 Consider the following three estimators for the expectation

1.

1

=

1

n+1

i=1,2,...,n

y

i

2.

2

=

1.01

n

i=1,2,...,n

y

i

3.

3

= 0.01y

1

+

0.99

n 1

i=1,2,...,n

y

i

Since

1

=

n

n + 1

, we have E(

1

) =

n

n + 1

E( ) =

n

n + 1

so this

estimator is biased, but since

n

n + 1

1 it is consistent.

In the second case we have

2

= 1.01 so E(

2

) = 1.01. This estimator

is both biased and inconsistent.

In the third case we have E(

3

) = .01E(y

1

) +.99 = so the estimator

is unbiased. But as n the limit is 0.01y

1

+ 0.99 which is stochastic

and so the limit is not the non-stochastic quantity i.e. the estimator is

inconsistent.

105

Under the assumption that the independent variables are exogenous and

the error terms are IID(0,

2

) we can compute the covariance matrix of the

OLS estimators. The covariance matrix is Cov(

)) = E((

)

t

(

)).

Now

= (

t

X X)

1

t

X u and so Cov(

) = E((

t

X X)

1

t

X u

t

u X (

t

X X)

1

) = (

t

X X)

1

t

X E(u

t

u) X (

t

X X)

1

). Under

the assumption of the error terms being IID(0,

2

) we have E(u

t

u) =

2

I

where I is the identity matrix. Hence we get Cov(

) =

2

(

t

X X)

1

.

Assume now that we want to use our model to make a forecast, thus we

ask for the value of the dependent variable y given a vector of independent

variables x = (1, x

1

, x

2

, . . . , x

k

). Using our estimators the forecast is y =

0

+

1

x

1

+

2

x

2

+ +

k

x

k

. It is of course very important to know the

variance of the forecast and we have V ar( y) = V ar(

0

+

1

x

1

+

2

x

2

+ +

k

x

k

) = x Cov(

)

t

x =

2

x (

t

X X)

1

t

x

Denition 14.1.3 An estimator

for the coecients in a linear regression

model y =

0

+

1

X

1

+

2

X

2

+ +

k

X

k

+u is said to be a linear estimator

if it is of the form A y for some matrix A. The OLS

estimator is linear

because it is (

t

X X)

1

t

X y

Let

be an estimator and consider its covariance matrix Cov(

). The

variance of any linear combination of the estimators w

, where w is a

row vector, is given by w Cov(

)

t

w. We obviously would like to have an

estimator where this variance is as small as possible. Thus if

is another

estimator we want w Cov(

)

t

w w Cov(

)

t

w for all w. Equivalently

the matrix Cov(

) Cov(

the case we will say that

is a more ecient or a better estimator than

Theorem 14.1.1 (The Gauss-Markov Theorem) Assume that E(u[X) = 0

and that E(u

t

u) =

2

I, in particular that the OLS estimator is unbiased.

Then the OLS estimator

is the best linear unbiased estimator

(BLUE for short).

Proof: In order to prove this theorem we need to understand what it

means: we have a set of n observations of independent variables which we

write as the nk +1 matrix X as before. A linear estimator is just a linear

map A : R

n

R

k+1

i.e. given by a k +1 n matrix. If we apply A to both

sides of the equation y = X+u we get Ay = AX+Au. Since E(u[X) = 0

we get E(A u[X) = 0 and so E(A y[X) = E(A X[X). The estimator is

unbiased if E(A y[X) = for all . Now E(A X[X) = A X because

106

A X is non-stochastic and so the estimator is unbiased if A X = for all

R

k+1

. Thus the linear estimator A is unbiased if and only if A X = I.

The OLS estimator is given by the k +1 n matrix (

t

X X)

1

t

X. Let

C = A(

t

X X)

1

t

X so C X = A X(

t

X X)

1

t

X X = I I = 0 and

),

) = E((

)

t

(

)) = E(Cu

t

((

t

X X)

1

t

X u)) = E(Cu

t

uX(

t

XX)

1

). By

our assumptions E(u

t

u) =

2

I so Cov((

),

) =

2

CX(

t

XX)

1

= 0.

Now Cov(

) = Cov(

+(

)) = Cov(

equality holds because C u = (

) and

are uncorrelated. This shows

that Cov(

) Cov(

) = Cov(C u) = E(C u

t

u

t

C) =

2

(C

t

C) and as

is true for any matrix, C

t

C is positive semi-denite.

Next we shall estimate the variance of the error terms i.e. estimate

2

,

in the model y = X +u

The vector of OLS residuals u = y X

spanned by the columns of X, so if we let P

X

denote the projection onto this

subspace and P

X

= I P

X

the projection onto its orthogonal complement so

P

X

is 0 on the column space and the identity on the orthogonal complement.

We shall assume that E(u[X) = 0 so

2

)

Applying P

X

to y = X + u gives P

X

y = P

X

u. Thus u = P

X

u =

P

X

(y X

) = P

X

y = P

X

u

The matrix of P

X

is X (

t

X X)

1

t

X so the matrix of P

X

is

I X (

t

X X)

1

t

X and u = u X (

t

X X)

1

t

Xu. This shows that

the tth coordinate u

t

in general depends on the u

s

for all s = 1, 2, . . . , n

and not just u

t

. So even if the error terms u

t

are independent this will not

necessarily be true for the estimators u

t

.

To compute the covariance matrix of u we note that

E( u) = E(u) X (

t

X X)

1

XE(u) = 0 (this uses E(u[X) = 0) so

the covariance matrix is E( u

t

u) = E(P

X

u

t

(P

X

u)) = E(P

X

u

t

u

t

P

X

) =

2

P

X

2

=

2

P

X

. If h

t

denotes the tth diagonal term in P

X

so the tth

diagonal term in P

X

is 1 h

t

we get V ar( u

t

) = (1 h

t

)

2

.

If we use the estimator

2

=

1

n

t

u

t

2

we get E(

2

) =

1

n

t

E( u

2

) =

1

n

t

V ar( u

t

) =

1

n

t

(1 h

t

)

2

=

2

n

Tr(P

X

). Since P

X

is the projection

onto an (n k)-dimensional subspace Tr(P

X

) = n k so E(

2

) =

n k

n

2

which shows that

2

is biased but if we instead use s

2

=

1

n k

t

u

t

2

we

107

get an unbiased estimator. Using this estimator in our formula Cov(

) =

2

(

t

X X)

1

we get an unbiased estimator for the covariance matrix of

,

Cov(

) = s

2

(

t

X X)

1

.

Up till now we have assumed that our model is correctly specied, i.e.

the data are generated by

y = X +u

where X is the set of independent variables we incorporate in our model.

But what happens if there in reality are more independent variables so the

data really are generated by a model

y = X +Z +u

but we are generating our estimators from the under specied model. This

is of course a quite common phenomenon: often the dependent variable will

depend on so many (known and unknown) factors that we cant possible

incorporate all of them.

Our OLS estimator for the under specied model is

= (

t

X X)

1

t

Xy = (

t

X X)

1

t

X(X +Z +u)

= + (

t

X X)

1

t

X Z + (

t

X X)

1

t

Xu

and thus even if we assume that E(u[X) = 0 so E((

t

X X)

1

t

Xu) = 0 the

OLS estimator

has bias E(

) = E((

t

X X)

1

t

X Z)

Classically the number used to measure how well the estimated model

ts the data is R

2

=

[[P

X

y[[

2

[[y[[

2

(also known as

ESS

TSS

, the explained sum of

squares divided by the total sum of squares). Since

= (

t

X X)

1

t

X and

P

X

= X(

t

XX)

1

t

X we have P

X

y = X

= y u because y = X

+ u. Now

u is orthogonal to the column space of X so X

consequently [[y[[

2

= [[X

[[

2

+[[ u[[

2

= [[P

X

y[[

2

+[[ u[[

2

. Thus we can write

R

2

=

[[y[[

2

[[ u[[

2

[[y[[

2

= 1

[[ u[[

2

[[y[[

2

= 1

SSR

TSS

(SSR =sum of squared residuals).

We have (P

X

y, y) = [[P

X

y[[[[y[[ cos where is the angle between the two

vectors P

X

y = X

2

=

(P

X

y, y)

2

[[P

X

y[[

2

[[y[[

2

. But (P

X

y, y) =

(P

X

y, P

X

y + u) = (P

X

y, P

X

y) because P

X

y and u are orthogonal. It follows

that (P

X

y, y) = [[P

X

y[[

2

hence we get cos

2

=

[[P

X

y[[

4

[[P

X

y[[

2

[[y[[

2

=

[[P

X

y[[

2

[[y[[

2

=

R

2

.

108

If R

2

= 1, = 0 and so P

X

y = X

t. The closer R

2

gets to 0 the worse the t. When R

2

= 0, =

2

and so

X

14.2 Hypothesis Testing in Regression Models

We might have some underlying theoretical model that species the values

of some or all of the coecients in a linear regression model or maybe that

two or more coecients are equal or are related in some way. How does one

determine from the data whether such a hypothesis is reasonable or more

precisely: when do the data not contradict the hypothesis?

Consider the simplest possible model

y = +u

where u IID(0,

2

). is the expectation, E(y). Suppose we have a

vector of samples y and let

=

1

n

t

y

t

be the sample mean. The y

t

s are

independent and have V ar(y

t

) =

2

so V ar(

) =

2

n

Suppose we hypothesize that the true value for is some number

.

How would we decide how likely this assumption is from our data? With-

out further information there is very little we can do, so lets make the

(unrealistic) assumption that we know the distribution of u, and assume

that u NID(0,

2

) i.e. that we know the true value of the variance (this

of course will virtually never be the case). Under our hypothesis, the test

statistic z =

n

N(0, 1). For a random quantity x N(0, 1), the

probability that a x b is equal to

_

b

a

(t)dt where is the Prob-

ability Density Function (PDF) of the standard normal distribution i.e.

(t) =

1

2

exp(

t

2

2

) Looking at Figure 20. we see that there is a 95%

probability that 1.96 x 1.96 and hence a 5% probability that x falls

outside this interval. This is called the rejection level of the test and we can

call the 95% the non-rejection level. If we set our level to 5% we would then

only reject our hypothesis if the test statistic z falls outside the condence

interval [1.96, 1.96]. Thus the test is not so much a test to conrm the

hypothesis but rather whether the data contradicts the hypothesis.

109

Figure 20: The standard normal PDF

Consider the following MATLAB code

u=randn(1,1000);

y=4+.5*u;

betastar=2;

betahat=sum(y)/1000;

z=(betahat-betastar)/(.5/sqrt(1000))

z=-125.1287

betahat=3;

z=(betahat-betastar)/(.5/sqrt(1000))

z=-61.8832

betahat=4.1;

z=(betahat-betastar)/(.5/sqrt(1000))

z=7.6869

betahat=4.02;

z=(betahat-betastar)/(.5/sqrt(1000))

z=2.6273

betahat=4;

z=(betahat-betastar)/(.5/sqrt(1000))

z=1.3624

110

This shows that the hypotheses

5% level, while we cannot reject the hypothesis

= 4.

The hypothesis we are testing i.e. u

t

N(0,

2

) and =

known

as the null-hypothesis H

0

. Under the null-hypothesis z N(0, 1). If we

reject the null-hypothesis it may be either because the hypothesis about the

distribution or the hypothesis about the actual value is false.

The error of rejecting a null-hypothesis that is actually true is called a

type I error. Thus the probability of rejecting a true null-hypothesis is the

level i.e. 5% in this case. If we make the non-rejection level larger, say

taking a 99% non-rejection level i.e. a 1% rejection level, then we lower the

probability of making a type I error, on the other hand the test will also fail

to reject more wrong values of

is [2.5758, 2.5758]. If we compute the test statistic for

= 4.06 we get

z = 2.3546, thus this is not rejected at the 1% level but is rejected at the

5% level.

The P-value of the test statistic is the rejection level at which we would

reject the null-hypothesis. For instance in our example before the P-value of

the test statistic z = 1.3624 is 1the area under the PDF from [1.3624, 1.3624].

Thus the P-value in this example equals (Prob(z < 1.3624) + Prob(z >

1.3264))2(1.3264) = 0.1853 and thus we would reject the null-hypothesis

at a condence level of 18.53% and higher. In general if the level of the test

is we reject at level if the P-value of the test statistic is less than . For

instance the P-value for the test statistic z with

hence the null-hypothesis

the 1% level.

Most test statistics encountered in regression analysis follows one of four

distributions: the standard normal distribution, the

2

-distribution, the

Student t-distribution or the F-distribution.

Let (t) =

1

2

exp(

t

2

2

) then the PDF of the normal distribution with

mean and variance

2

is

1

(

t

). Thus if Z N(,

2

) the cumulative

distribution function

(x) = Prob(Z < x) =

_

x

(

t

)dt

This function does not have a simple closed form expression.

111

Figure 21: P-Value

Figure 22: The Cumulative Distribution Function (CDF) of the standard

normal

112

If Z N(,

2

) then

Z

normal distribution is that any linear combination of independent normal

random variables is again normal.

Theorem 14.2.1 Let Z

i

N(

i

,

2

i

), i = 1, 2, . . . , n be independent random

variables. Then the random variable

W = a

1

Z

1

+a

2

Z

2

+ +a

n

Z

n

N(a

1

1

+a

2

2

+ +a

n

n

, a

2

1

2

1

+a

2

2

2

2

+ +a

2

n

2

n

)

Proof: We know X

i

=

Z

i

i

N(0, 1) and

W = a

1

(

1

X

1

+

1

) +a

2

(

2

X

2

+

2

) +

+a

n

(

n

X

n

+

n

)

= (a

1

1

X

1

+ +a

n

n

X

n

) + (a

1

1

+a

2

2

+ +a

n

n

)

Hence it suces to show that

a

1

1

X

1

+ +a

n

n

X

n

N(0, a

2

1

2

1

+a

2

2

2

2

+ +a

2

n

2

n

)

and so we may assume that Z

i

N(0, 1) for all i.

We rst assume that n = 2 and a

2

1

+a

2

2

= 1. We shall compute the joint

density f

W,Z

1

= f

W|Z

1

f

Z

1

. Now the conditional expectation is E(W[Z

1

) =

E(a

1

Z

1

+ a

2

Z

2

[Z

1

) = a

1

Z

1

+ a

2

E(Z

2

[Z

1

) = a

1

Z

1

+ a

2

E(Z

2

) = a

1

Z

1

and

the conditional variance E((W E(W[Z

1

))

2

[Z

1

) = E((W a

1

Z

1

)

2

[Z

1

) =

E(a

2

2

Z

2

2

[Z

1

) = E(a

2

Z

2

2

) = a

2

2

. Conditionally on Z

1

, W is the sum of the

N(0, a

2

2

) random variable a

2

Z

2

and a constant a

1

z

1

and so is N(a

1

z

1

, a

2

2

)

thus the conditional density f

W|Z

1

(w[z

1

) =

1

a

2

(

w a

1

z

1

a

2

). This shows

113

that the joint density

f

W,Z

1

(w, z

1

) =

1

a

2

(

w a

1

z

1

a

2

)(z

1

)

=

1

a

2

2

exp(

(w a

1

z

1

)

2

2a

2

2

)

1

2

exp(

z

2

1

2

)

=

1

a

2

2

1

2

exp(

w

2

+a

2

1

z

2

1

2a

1

z

1

w +a

2

2

z

2

1

2a

2

2

)

=

1

a

2

2

exp(

w

2

2a

1

z

1

w +z

2

1

2a

2

2

)

=

1

a

2

2

exp(

w

2

2a

1

z

1

w

1

+z

2

1

a

2

2

w

2

+a

2

2

w

2

2a

2

2

)

=

1

a

2

2

exp(

w

2

2

) exp(

(1 a

2

2

)w

2

+z

2

1

2a

1

z

1

w

2a

2

2

)

=

1

2

exp(

w

2

2

)

1

a

2

2

exp(

(a

1

w z

1

)

2

2a

2

2

)

The last factor as a function of z

1

is the PDF of a N(a

1

w, a

2

2

) distribution.

To nd the density function f

W

we compute the marginal density of the

joint density function by integrating out z

1

f

W

(w) =

_

2

exp(

w

2

2

)

1

a

2

2

exp(

(z

1

a

1

w)

2

2a

2

2

)dz

1

=

1

2

exp(

w

2

2

)

_

1

a

2

2

exp(

(z

1

a

1

w)

2

2a

2

2

)dz

1

The integral is 1 because we are integrating a PDF from to . This

shows that W N(0, 1)

In the general case where a

2

1

+ a

2

2

= r

2

,= 1 we consider

1

r

W. By the

previous argument

1

r

W N(0, 1) and so W N(0, r

2

). This proves the

result for a linear combination of two independent random variables. The

general result follows by induction.

Consider a vector Z =

_

_

_

_

_

Z

1

Z

2

.

.

.

Z

n

_

_

_

_

_

of independent standard normal variables.

114

Let A be a non-singular matrix and consider the vector X =

_

_

_

_

_

X

1

X

2

.

.

.

X

n

_

_

_

_

_

=

A

_

_

_

_

_

Z

1

Z

2

.

.

.

Z

n

_

_

_

_

_

. By the previous Theorem each of the components X

i

N(0, a

2

i1

+

a

2

i2

+ + a

2

in

). The covariance matrix = Cov(X) = E(X

t

X) =

E(AZ

t

Z

t

A) = A E(Z

t

Z)

t

A = A

t

A. Then the joint density func-

tion is

X

(x) =

1

(

2)

n

det

exp(

1

2

x

1t

x). If = (

1

,

2

, . . . ,

n

)

is a vector of constants then X +

t

has joint density function

X

(x) =

1

(

2)

n

det

exp(

1

2

(x )

1t

(x )). In particular if is a diagonal

matrix the joint density function

X

(x)

=

1

(

2)

n

det

exp(

1

2

(x )

1t

(x ))

=

1

2

1

2

exp(

1

2

(x

1

1

)

1

1

(x

1

1

)) . . .

1

2

n

2

exp(

1

2

(x

n

n

)

1

n

(x

n

n

))

becomes just the product of the densities of the components of the vector of

random variables and hence the components are independent. This is one of

the very special properties of normally distributed random variables: they

are independent if and only if their covariances are 0. This is not true for

random variables which are not normally distributed.

This discussion suggests how we would construct a vector of normally

distributed random variables with vector of means = and covariance ma-

trix : use the Cholesky decomposition to write =

t

R R where R is an

upper-triangular matrix. Then X =

t

R Z +

t

will be such a vector.

The random variable [[Z[[

2

= Z

2

1

+ Z

2

2

+ + Z

2

n

is said to follow a

2

distribution with n degrees of freedom. We write [[Z[[

2

2

(n). The mean

and variance of the

2

(n) distribution can be computed from the denition:

mean= E(Z

2

1

)+E(Z

2

2

)+ +E(Z

2

n

) = V ar(Z

1

)+V ar(Z

2

)+ +V ar(Z

n

) =

n.

var= E(([[Z[[

2

n)

2

) = E(((Z

2

1

1) +(Z

2

2

1) + +(Z

2

n

1))

2

). Since

the random variables (Z

2

i

1) and (Z

2

j

1) are independent for i ,= j and so

115

have covariance 0, we get the variance = nE((Z

2

1)

2

) = E(Z

4

+12Z

2

) =

E(Z

4

) + 1 2. Thus we need to compute E(Z

4

). By denition

E(Z

4

) =

1

2

_

t

4

exp(

t

2

2

)dt

Remark that

(t) = t(t)

(t) = (t) +t

2

(t)

3

(t) = (3t t

3

)(t)

(4)

(t) = (3 3t

2

)(t) (3t

2

t

4

)(t) = (3 6t

2

+t

4

)(t)

Thus

t

4

(t) =

(4)

(t) + 6t

2

(t) 3(t)

and we get

_

t

4

(t)dt =

(t)[

+6

_

t

2

(t)dt3

_

(t)dt = 6V ar(Z)3 = 3

Hence E(Z

4

) = 3 and E((Z

2

1)

2

) = 3 + 1 2 = 2.

This shows that V ar([[Z[[

2

) = E((Z

2

1

1)

2

)+E((Z

2

2

1)

2

)+ +E((Z

2

n

1)

2

) = 2 +2 + +2 = 2n. Thus the

2

(n) distribution has mean = n and

variance = 2n.

It is clear that if Y

1

2

(n

1

) and Y

2

2

(n

2

) then Y

1

+Y

2

2

(n

1

+n

2

)

Theorem 14.2.2 Assume the vector X =

_

_

_

_

_

X

1

X

2

.

.

.

X

n

_

_

_

_

_

has the multi-variate

normal distribution N(0, ), so is the covariance matrix of X i.e. =

E(

t

X X).

Then the random variable

t

X

1

X

2

(n)

Let P be an orthogonal projection matrix of rank r and let

_

_

_

_

_

Z

1

Z

2

.

.

.

Z

n

_

_

_

_

_

N(0, I) then

t

Z P Z

2

(r)

116

Figure 23:

2

Density functions

Proof: Let = R

t

R be the Cholesky decomposition. Then R

1

X is

normal with mean 0 and covariance matrix E(R

1

X

t

X

t

R

1

) = R

1

E(X

t

X)

t

R

1

= R

1

t

R

1

= R

1

R

t

R

t

R

1

= I. Thus R

1

X N(0, I).

It then follows that [[R

1

X[[

2

2

(n), but [[R

1

X[[

2

=

t

X

t

R

1

R

1

X =

t

X

1

X. This proves the rst part of the theorem.

To prove the second part consider the subspace V = v[Pv = v. Then

V = ImP. Indeed if v V , Pv = v so v ImP, conversely if w ImP,

w = Pu for some u R

n

. Since P is a projection P

2

= P so Pw = P(Pu) =

P

2

u = Pu = w. Thus w V . This shows in particular that dimV = r.

Let W = V

and let q

1

, q

2

, . . . , q

r

be an orthonormal basis of V and

q

r+1

, . . . , q

n

an orthonormal basis of W, then the combined set q

1

, q

r

, q

r+1

, . . . , q

n

is an orthonormal basis of R

n

. The matrix of P with respect to this basis

is M =

_

_

_

_

_

_

_

_

_

_

_

_

1 0 0 . . . 0 0

0 1 0 . . . 0 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 . . . 0 1 0 . . .

0 . . . 0 0 0 . . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 . . . . . . 0 0

_

_

_

_

_

_

_

_

_

_

_

_

and if Q is the coordinate transforma-

117

Figure 24: t Density functions

tion matrix between this basis and the standard basis then the matrix of P

is

t

Q M Q. Now consider QZ, the covariance matrix is E(QZ

t

Z

t

Q) =

Q E(Z

t

Z)

t

Q = Q

t

Q = I. Hence Y = QZ N(0, I). Hence

t

Z P Z =

t

Y M Y = Y

2

1

+Y

2

2

+ +Y

2

r

2

(r).

Let Z N(0, 1) and Y

2

(m) then T

Z

_

Y/m

follows the Student

t-distribution with m degrees of freedom. We shall write this T t(m).

As m becomes larger t(m) approaches the standard normal distribution.

If Y

1

2

(m

1

) and Y

2

2

(m

2

) then the random variable F =

Y

1

/m

1

Y

2

/m

2

follows an F-distribution with m

1

, m

2

degrees of freedom, F F(m

1

, m

2

)

Consider again a linear regression model

y =

0

+X

1

1

+X

2

2

+ +X

k

k

+u = X +u

and observations

y

t

, x

1,t

, x

t,2

, . . . , x

t,k

, t = 1, 2, . . . , n

118

Figure 25: F(5, 3) Density function

As before

= (

t

X X)

1

t

Xy is the OLS estimator and u = y X

the

vector of residuals. We assume u

t

NID(0,

2

).

We want to test the hypothesis that one of the coecients,

i

say, has a

given value . Without loss of generality we can assume that i = k i.e. we

are testing the hypothesis

k

= . Subtracting X

k

on both sides we get

the model

y X

k

=

0

+X

1

1

+ +X

k

(

k

) +u

and so we may also assume that the hypothesis is

k

= 0.

Consider rst the case k = 1 so the model is y =

0

+ x

1

+ u. Let

P denote the orthogonal projection onto the (n 1)-dimensional subspace

t

(1, 1, 1, . . . , 1)

ing P we get Py = Px

1

+Pu and so we get the OLS-estimator

1

= (

t

x

t

P P y)(x

t

P Px)

1

= (

t

x P y)(

t

x P x)

1

because

t

P = P and P

2

= P

Under the null-hypothesis

1

= 0 so Py = Pu and so

1

=

t

x P

t

x P x

u

119

This shows that the estimator is a linear combinations of NID(0,

2

) random

variables. The coecients are the coordinates of the vector

t

x P

t

x P x

so

the variance of the OLS estimator

1

is

2

t

x P P x

(

t

x P x)

2

=

2

t

x P x

hence

1

N(0,

2

(

t

x P x)

1

) and so the statistic z =

t

x P y

(

t

x P x)

1/2

N(0, 1)

and we can use this to test the null-hypothesis.

This is under the unrealistic assumption that we actually know the vari-

ance of the error terms. If this is not the case we need to use the unbiased

estimator for the variance, s

2

=

1

n 2

t

u

2

t

where u = y (

0

+ x

1

).

Let Q be the orthogonal projection onto the orthogonal complement of

the subspace spanned by

t

(1, 1, 1, . . . , 1) and x; thus Q kills any vector

of the form

0

+

1

x. Since by construction the vector of residuals u is

orthogonal to both

t

(1, 1, 1, . . . , 1) and x we have u = Q u = Qy. Hence

s

2

=

1

n 2

[[Qy[[

2

=

t

y Q y

n 2

. Replacing by s we get the statistic

t =

t

x P y

s(

t

x P x)

1/2

=

t

x P y

s

(

t

x P x)

1/2

=

z

s

The denominator is

s

=

(

t

y Q y)

1/2

n 2

=

_

_

_

t

u

Q

u

n 2

_

_

_

1/2

Since u N(0,

2

I),

u

onto a (n 2)-dimensional subspace, we get by Theorem 14.2.2 that

t

u

Q

u

2

(n 2). This shows that the statistic t is distributed as

N(0, 1)

_

2

(n 2)

n 2

t(n 2) and we can then use this distribution to test the

null-hypothesis. This is naturally enough known as the t-test.

In the general case where we have k independent variables the same cal-

culation goes through if we replace P by the orthogonal projection onto the

orthogonal complement of the subspace spanned by

t

(1, 1, . . . , 1), x

1

, . . . , x

k1

120

and replace n 2 by n k 1. The matrix of the projection is the

n n matrix I X

(

t

X

)

1

t

X

where X

is the n k matrix

_

_

_

_

_

1 x

1,t

1

x

2,t

1

. . . x

k1,t

1

1 x

1,t

2

x

2,t

2

. . . x

k1,t

2

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

1 x

1,tn

x

2,tn

. . . x

k1,tn

_

_

_

_

_

i.e. we have left out the column of the

observations of the kth independent variable.

Example 14.6 We look again at the data set Data1 and test the hypothesis

that the coecient to the quadratic term is 0.

y=Data1(:,1);

X=[ones(size(Data1(:,2)) Data1(:,2)];

Z=[X X(:,2).^ 2];

P=eye(40)-X*(X*X)^ (-1)*X;

Q=eye(40)-Z*(Z*Z)^ (-1)*Z;

x=Z(:,3);

z=x*P*y;

ssquare=(y*Q*y)/(40-2-1);

s=sqrt(ssquare);

t=z/(s*sqrt(x*P*x))

t=-64.9502

p=tcdf(t,37)

p=4.9178e-40

Since the p-value is so small we reject the null-hypothesis that the coef-

cient to the quadratic term is 0

121

## Гораздо больше, чем просто документы.

Откройте для себя все, что может предложить Scribd, включая книги и аудиокниги от крупных издательств.

Отменить можно в любой момент.