Академический Документы
Профессиональный Документы
Культура Документы
Summary
Sebastian Shaqiri
3 Linear Algebra 27
3.1 System of Linear Equations . . . . . . . . . . . . . . . . . . . 27
3.2 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . 29
3.3 The Matrix Equation Ax = b . . . . . . . . . . . . . . . . . . 31
3.3.1 Properties of the Matrix-vector product Ax . . . . . . 32
3.4 The Inverse of a Matrix . . . . . . . . . . . . . . . . . . . . . 33
3.5 Matrix Factorizations . . . . . . . . . . . . . . . . . . . . . . 34
3.5.1 The LU Factorization . . . . . . . . . . . . . . . . . . 35
3.6 Subspaces of Rn . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.7 Eigenvectors and Eigenvalues . . . . . . . . . . . . . . . . . . 37
1
3.8 Diagonalization . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.9 Inner Product, Length and Orthogonality . . . . . . . . . . . 41
3.9.1 The Inner Product . . . . . . . . . . . . . . . . . . . . 41
3.9.2 The Length of a Vector . . . . . . . . . . . . . . . . . 41
3.9.3 Orthogonal Vectors . . . . . . . . . . . . . . . . . . . . 42
3.9.4 Orthogonal Sets . . . . . . . . . . . . . . . . . . . . . 43
3.9.5 Orthogonal Projections . . . . . . . . . . . . . . . . . 45
3.10 The Gram-Schmidt process . . . . . . . . . . . . . . . . . . . 47
3.11 Least-Squares Problems . . . . . . . . . . . . . . . . . . . . . 48
3.12 Further Reading (Optional) . . . . . . . . . . . . . . . . . . . 49
2
Part I
3
1
1.1 Limits
In the case of x + we define the limit as
Definition 1. Assume that f (x) is a function whose definition contains
arbitrarily large amount of real numbers. We say that f (x) has the limit A
as x approaches infinity if for every given number > 0 is a number such
that )
x>
|f (x) A| < .
x Df
This is written
f (x) A when x +
or alternatively
lim f (x) = A.
x+
The meaning of the definition is as follows: the function has the limit
A when x if the function values f (x) satisfies any given tolerance
requirements of the form
A < f (x) < A +
as soon x is sufficiently large, that is, for all x > . The greater accuracy -ie
the smaller -stated, the greater has to be selected for tolerance require-
ment to be fulfilled for all x > .
4
sequences, but not other functions, there are also the following terminology:
if the sequence has a limit as n it is said to be convergent otherwise
divergent.
The definition of the limit above is just one of many similar definitions
that must be done. When we examine the elementary functions we have to
work with
f (x) A when x ,
f (x) A when x a,
f (x) A when x a+ ,
f (x) A when x a .
These notions are defined analogously to the above-treated prototype limx+ f (x) =
A, and they are designated by corresponding lim notations. For example we
have the case x a :
Let f be a function and assume that every setting of the point a contains
points from Df . Then f is said to have the limit A as x approaches a if, for
every number > 0 there exist a number > 0 such that
)
|x a| <
|f (x) A| < .
x Df
Especially if the point a itself belongs to the domain, you can select
x = a. Accordingly read is |f (a) A| < for each > 0. The only possibility
then is that A = f (a). If f is defined in a and is defined for the limit when
x a thus the limit has to be equal to the function value f (a).
Limits of the type x a+ and x a . is right respectively left limit.
Their definition is obtained by changing the condition above |x a| < for
a x < a + and a < x. Apparently f (x) is defined for the limit when
x a exactly in that case when the right and left limit exist and is equal.
f (x) + when x +,
f (x) + when x a,
f (x) when x a+
Such limits we call improper limits. They are defined by analogy with the
former (proper) limits.
5
1.1.1 Evaluating Limits
We usually try to avoid working directly with the definition when to deter-
mine the limits. Instead, we try to use some basic properties together with
a set of standard values.
The basic properties that we are about to establish are by most perceived
as intuitively obvious and works fully automatically at problem solving. The
rules are valid for all types of limits, x +, x a, x a+ etc., and in
the formulation below, we therefore make no stipulation in this regard unless
it is necessary for the sake of consistency.
Theorem 1. If lim f (x) = 0 and the function g(x) is finite it holds that
f (x)g(x) 0.
Proof. By the definition there exists two numbers C and 0 such that
it applies
f (x) + g(x) A + B (1.1)
f (x)g(x) AB (1.2)
Furthermore, if B 6= 0 it applies
f (x) A
(1.3)
g(x) B
6
and
x > 2 |g(x) B| <
2
Let = max(1 , 2 ), by the triangle-inequality for x > we then get
|f (x) + g(x) (A + B)| = |(f (x) A) + (g(x) B)| |f (x A)|+|g(x) B| < + =
2 2
and thereby the proof is done. Proof of (3.2): Consider the equation
If B is positive we get
B B
g(x) > B = .
2 2
and then we get
1 2
0< <
g(x) B
which gives us
1 1 1
= (B g(x))
g(x) B Bg(x)
and thus the proof is done.
Theorem 3 (Squeeze Theorem). If f (x) and g(x) has the same limit A and
if
f (x) h(x) g(x)
it implicate that lim h(x) = A.
and
x > 2 A < g(x) < A +
thus
A < f (x) h(x) g(x) < A +
for all x > max(1 , 2 ), and by the definition it entails that h(x) has the
limit A.
7
1.1.2 Continuous Functions
Definition 2. A function f is said to be continuous at a point x0 if x0
belongs to the domain, and if the limit
lim f (x)
xx0
exists.
-If a function is continuous at each point in its domain it is called contin-
uous.
Points in which a function is not continuous,is often referred to as discon-
tinuities. Sometimes we also talk about a singularity at one such point. The
meaning of continuity is that a small variation of the variable x only causes a
small change in the function value f (x). A sudden change of function values
thus indicates the presence of a discontinuity.
1.2 Derivatives
There are many practical issues about how quickly a particular course of
change appears, such as "how fast is the car going?", "how quickly the air
8
pressure with increasing height above the surface of the earth?", "how much
does the tax increase with a growing income? ", etc. We will equip ourselves
with a mathematical tool that measures the speed of such changes. Let f (x)
denote the temperature distribution along a thin rod positioned along the
x-axis. We assume that the temperature is measured in degrees Celsius and
that the length of the unit is meters. If you have full knowledge of the
function f (x), it should also be possible to answer the question: how fast,
expressed in degrees per meter, is the temperature changing along with the
rod at a certain point x0 ? In other words it should be an expression formed
only by means of f (x) as to reasonably measure the change in temperature
per meter at a given point x0 . To find this expression, we first note that
from the point x0 to a nearby point x0 + h, the temperature has changed
with
f (x0 + h) f (x0 )
degrees. In the interval with endpoints x0 and x0 + h is accordingly read
the temperature increase (or decrease) in average
f (x0 + h) f (x0 )
(1.4)
h
degrees per meter. The expression (3.4) means no precise answer to the
question of how large the growth rate is at point x0 , but the smaller intervals
we use the closer we should get a precise indication. If the limit
f (x0 + h) f (x0 )
lim (1.5)
h0 h
exist, it is therefore reasonable to regard this as the metrics for temperature
rate of change in the point x0 . The limit value (3.5) thus represents the
expression were looking for.
9
If a function f is differentiable at every point in its domain we say briefly
that f is differentiable. The function
x f 0 (x), x Df
We also talk about f 0 (x0 ) as the function curves slope or steepness at the
point (x0 , f (x0 )).
Second derivative
It can, of course, in some cases, be reason to study the growth rate of the
derivative f 0 of a function f. You should then form (f 0 )0 . This function is
called second derivative of f and designated in either of the ways
d2 f
f 00 , f (2) , D2 (f ) and .
dx2
According to its definition,f 00 measures how fast the growth rate increases
at the point x. For example, if s(t) indicates the total distance in meters at
the time t seconds is s0 (t) the normal acceleration in m/s2 . Depending on
the interpretation of f (x), however, the acceleration f 00 (x) have completely
different units. In the example where f (x) is the temperature ( C) at x (m)
this ensure the acceleration unit C/m2 .
The reverse of theorem 5 is not true. For example the function f (x) = |x|
is continuous but not differentiable at the point x = 0.
10
Algebraic Properties
Theorem 6. Let f and g be differentiable functions and is a constant.
Then the functions of f + g, f g, and f /g is differentiable in their respective
domain. We have the following formulas for their derivatives:
0
f f 0 (x)g(x) f (x)g 0 (x)
= (1.10)
g g(x)2
Proof. Proof (3.8):By the definition of the derivative we get
1 g 0 (x)
D =
g(x) g(x)2
where the denominator goes to g(x)2 when h 0 and the numerator goes
to g 0 (x) hence the theorem is true.
11
Derivatives of Composite Function
Theorem 7 (Chain rule). Let g(x) be differentiable at x and f (x) be dif-
ferentiable at g(x). Let y = f (g(x)) and u = g(x).
Proof. We will use the fact that if y = h(x) is differentiable at x then
y = h0 (x)x + x
where 0 when x 0. We have that
u = g 0 (x)x + 1 x dr 1 0 d x 0
y = f 0 (u)u + 2 u dr 2 0 d u 0.
Substituting u from the first equation into the second,
dy
= f 0 (u) + 2 g 0 (x) + 1 .
dx
Taking the limit as x 0
dy dy du
= f 0 (u) g 0 (x) = .
dx du dx
12
1.2.3 General Characteristics of Differentiable Functions
Definition 4. Let x0 be a point in the domain Df to a function f . We say
that f has a local maximum at x0 if there is a number > 0 such that
)
|x x0 i |
f (x) f (x0 ).
x Df
We then call x0 a local maximum point of f and the function value f (x0 )
for a local maximum. Moreover, if f (x) < f (x0 ) when x 6= 0 we speak of a
strict local maximum point and a strict local maximum.
Similarly we define a (strict) local minimum point and a (strict) local
minimum value.
Local maximum and local minimum points are with a common name
called local extreme points. We also say that f has local extreme values at
thees points. Note carefully that the concept of local extreme value only
describes the functions behavior in the immediate surroundings of a point.
A local maximum is not necessarily the functions largest value, but of course
it could be.
f 0 (x0 ) = 0.
Proof. We consider the case where f has a local maximum in x0 : the proof
in the other case is analogous. For all sufficiently small values of |h| is
according to the definition of a local maximum
(
f (x0 + h) f (x0 ) 0 if h < 0
h 0 if h > 0
0 f 0 (x0 ) 0
Points for which f has the derivative zero, ie where the functions growth
rate is zero, is usually called critical points. The meaning of theorem 9 is
that in addition to the possible end points, extreme values can only occur at
critical points. Among other things, in order to determine whether a given
critical point is an extreme point or not we need additional connections
between a function and its derivative. The following theorem is fundamental
in deriving such.
13
Theorem 10 (Mean value theorem). Suppose that f is continuous in the
closed interval a c b and differentiable in the open interval a < x < b.
Then there exist at least one point , a < < b, such that
f (b) f (a) = f 0 ()(b a).
Proof. Consider the following help fuction
f (b) f (a)
(x) = f (x) (x a)
ba
deposit of x = a, x = b gives us (a) = (b) = f (a). Furthermore is
continues in the interval [a, b] and differentiable in the interval (a, b) which
gives us
f (b) f (a)
0 (x) = f 0 (x)
ba
Rolles theorem states that there has to be a critical point at x = which
gives us
f (b) f (a)
() = f 0 () =0
ba
which is equivalent with
f (b) f (a) = f 0 ()(b a)
and thereby we have proved the theorem.
14
1.2.4 LHpitalss rule
Let x0 be a real number (including ) and let f (x) and g(x) be dieren-
tiable functions. Suppose that limxx0 f (x) = 0 and limxx0 g(x) = 0. If
0 (x)
limxx0 fg0 (x) exists and there is an interval (a, b) containing x0 such that
f 0 (x)
g 0 (x) 6= 0 for all x (a, b), then limxx0 g 0 (x) exists and
f (x) f 0 (x)
lim = lim 0 .
xx0 g(x) xx0 g (x)
0
Also suppose limxx0 f (x) = and limxx0 g(x) = . If limxx0 fg0 (x) (x)
exists and there is an interval (a, b) containing x0 such that g 0 (x) 6= 0 for all
0 (x)
x (a, b), then limxx0 fg0 (x) exists and
f (x) f 0 (x)
lim = lim 0 .
xx0 g(x) xx0 g (x)
15
2
"Love can reach the same level of talent, and even genius, as the
discovery of differential calculus."
Lev Vygotsky
F 0 (x) = f (x), x I.
F (x) + C
g(x) = F (x) + C.
16
denote an anti-derivative of f . We will soon see that this differential writ-
ing has large computational
R
advantages over other perhaps closer at hand
designations for example f (x).
x = g(t), (2.3)
17
According to the chain rule and the definition of an anti-derivative, the
derivative with respect to t of the left hand side equal to
than the original f (x)dx. If so, it carries out this calculation and finish
the solution and to return to the variable x.
2.2 Integrals
Integral of Steps Functions
A function on the interval [a, b] is called a step function if there is a
subdivision of [a, b] into smaller divisions in which has a constant value.
More precisely, if the division points are
then is defiend as
18
of (3.5) immediately indicate. We shall see later that this relationship is
very practical. It will also prove beneficial to no longer speak of the area
between the graph of and the x-axis, but rather consider I() in (3.5) as
a number associated with the function .
Definition 6. The number
n
X
I() = ck (xk xk1 )
k=1
is called the integral of the step function . We also use the designation
Z b
I() = (x)dx.
a
For each step function hears that we have seen a breakdown of its defini-
tion interval [a, b]. It is of course conceivable to add another division points
but for the sake of the function itself is changed. We then say that the divi-
sion refined. It is obvious that the value of the integral I() is not affected by
such a refinement of the distribution. This observation has the consequence
that if we have two step functions in the same interval [a, b] then there is
no restriction to assume that they are generated from the same division of
the interval. Against this background, it is not difficult to recognize the
correctness of the following theorem.
Theorem 14. The following properties hold for the integral of the step
function on the interval [a, b].
I() = I(), constant, (2.6)
Z b Z c Z b
I() = (x)dx = (x)dx + (x)dx if acb (2.9)
a a c
19
The definition has the consequence that if a function is integrable so its
graph can be covered by finitely many axis-parallel rectangles with arbitrar-
ily small total area. For the area between the graphs of and in the
definition consists of those rectangles and occupies an area of less than .
It remains to define the integral of an integrable function. The following
theorem is the basis for this.
Given the geometric importance of I() and I() the number should
be an adequate measure of the area of the region between the graph of f
and x-axis. We are therefore led to the following definition.
Definition 8. Assume that the function f integrable over the interval [a, b].
The uniquely determined number in Theorem 15 is called the integral of f
over [a, b] and could be written as
Z b
f (x)dx
a
Proof. Let be a given positive number. We will then construct two step
functions and with
of [a, b], such that the length l(D) of the longest sub-interval satisfy
l(D) < .
20
Then we define the numbers mk and Mk as the minimum and maximum
value of f in the interval xk1 x xk . Specifically, when
Mk mk < , k = 1, 2, ..., n.
ba
Finally, we define two step functions D and D belonging to this division
by putting
D = mk and D = Mk for xk1 < x < xk .
Then D f D and
n
X n
X
I(D ) I(D ) = Mk (xk xk1 ) mk (xk xk1 )
k=1 k=1
n
X
= (Mk mk )(xk xk1 ) <
k=1
n
X
< (xk xk1 ) = (b a) =
b a k=1 ba
Thus, f is integrable over [a, b] as defined by definition (7), and the theorem
is proved.
Riemann sum
Let f be a continuous function at the interval [a, b], and regard the division
D : a = x0 < x1 < ... < xn = b
of this. Denote by l(D) the length of the largest sub-interval. This number
can we perceive as a measure of the fineness subdivision. Choose arbitrarily
in each sub-interval a point k so xk1 x xk , and form the sum
n
X
RD = f (k )(xk xk1 ). (2.10)
k=1
21
Theorem 17. Suppose that f is continuous on [a, b]. For the Riemann sum
(3.10) it applies that
n
X Z b
RD = f (k )(xk xk1 ) f (x)dx (2.11)
k=1 a
mk f (k ) Mk , k = 1, 2, ..., n.
is
I(D ) RD I(D ).
seeing that
I(D ) I(f ) I(D )
by the definition of I(f )
This shows that RD I(f ) when the division fineness l(D) goes to zero.
Z b Z b Z b
(f (x) + g(x)) = f (x)dx + g(x)dx, (2.13)
a a a
Z b Z b
f (x) g(x) in [a, b] f (x)dx g(x)dx, (2.14)
a a
Z b Z c Z b
f (x)dx = f (x)dx + f (x)dx. (2.15)
a a c
22
Especially when aa f (x)dx = 0. With this Convention, we see that (3.16)
R
is a correct formula for all relative positions of points a, b and c, under the
premise that the integrals exist.
An important special case of (3.15) is
Z b
g(x) 0 in [a, b] g(x)dx 0.
a
Proof. We put
m = min f (x) M = max
axb axb
is
m f (x) M nr axb
wich gives
Z b Z b Z b
m(b a) = mdx f (x)dx M dx = M (b a).
a a a
we put
1 b Z
C= f (x)dx
ba a
thees differences then implies m C M. But f is continuous and therefore
adopts every value between m and M in the interval [a, b]. Especially, there
is a in this interval for which f () = C. And we have proved the thoerem.
S 0 (x) = f (x).
23
Now we use the mean value theorem for an integral, and we get
S(x + h) S(x) 1
= f (h )(x + h x) = f (h )
h h
for some point h between x and x + h. When h 0 h goes towards x.
whereas f is continuous it follows that
f (h ) f (x) when h 0.
Thus, the function S(x) is differentiable with the derivative S 0 (x) = f (x).
exist, say equal to A, it is said that the improper integral (3.18) is convergent.
The number A is called its value. If the limit does not exist, we say that the
improper integral is divergent.
not only for the integral but also for the integrals designated value A.
24
Indefinite integrand
We now consider a function defined in a definite interval a < x b and is
definite and Riemann integrable in each sub-interval [a + , b], > 0. The
function is assumed not to be definite throughout ]a, b]. For such a function
f is
Z b
f (x)dx (2.18)
a
a improper integral.
exists we say that the improper integral (3.19) is convergent with the value
A. If the limit does not exist it is said to be divergent.
25
It also works with distribution function F (x), which is related to the density
function by Z x
F (x) = f (t)dt;
the number F (x) apparently means the probability that the outcome of
the trial is less or equal to x. If f is continuous, F is differentiable and
F 0 (x) = f (x) according to the fundamental theorem of calculus.
As a measure of the density function we use the so-called mean or ex-
pected value. This is defined as the number
Z
(x)dx = 1.
The analogy with an emphasis in the mechanics is clear: the expected value
coincides with the center of gravity location for a mass distribution along
the entire real axis with density f (x) and the total mass first.
26
3
Linear Algebra
27
Matrix Notation
The essential information of a linear system can be recorded compactly in
a rectangular array called a matrix. Given the system
x1 2x2 + x3 = 0
2x 8x = 8
2 3 (3.2)
5x 5x = 10
1 3
28
An echelon matrix is one that is in echelon form. Property 2 says that
the leading entries form an echelon ("steplike") pattern that moves down to
the right trough the matrix. Property 3 is a simple consequence of property
2, but we include it for emphasis.
The triangular metrices
2 3 2 1 1 0 0 29
0 1 4 8 and 0 1 0 16
0 0 0 5/2 0 0 1 3
are in echelon form. In fact the second matrix is in reduced echelon form.
Any nonzero matrix may be row reduced into more than one matrix
in echelon form, using different sequences of row operations. However, the
reduced echelon form one obtains from a matrix is unique.
Pivot Positions
When row operations on a matrix produce an echelon form, further row
operations to obtain the reduced echelon form do not change the positions
of the leading entries. Since the reduced echelon form is unique, the leading
entries are always in the same positions in any echelon form obtained from a
given matrix. These leading entries correspond to leading 1s in the reduced
echelon form.
29
If A is m n, B is n p, and x Rp , denote the columns of B by
b1 , ..., bp and the entries in x by x1 , ..., xp . Then
Bx = x1 b1 + ... + xp bp .
This definition makes equation (3.3) true for all x Rp . Equation (3.3)
proves that the composite mapping is a linear transformation and that its
standard matrix is AB. Multiplication of matrices corresponds to composi-
tions of linear transformations.
2. A(B + C) = AB + AC
3. (B + C)A = BA + CA
5. Im A = A = AIn
Proof. We will just prove property (1). Property (1) follows from the fact
that matrix multiplication corresponds to composition of linear transforma-
tions, and its know that the composition of functions is associative.
30
3.3 The Matrix Equation Ax = b
A fundamental idea in linear algebra is to view a linear combination of vector
as the product of a matrix and a vector.
x1 a1 + ... + xn an = b (3.5)
which, in turn, has the same solution set as the system of linear equations
whose augmented matrix is
h i
a1 a2 ... an b (3.6)
Existence of solutions
Theorem 24. Let A be an m n matrix. The following statements are
logically equivalent. That is, for a particular A, either they are all true or
they are all false.
31
3. The columns of A span Rm .
32
3.4 The Inverse of a Matrix
Matrix algebra provides tools for manipulation matrix equations and creat-
ing various useful formulas in ways similar to doing ordinary algebra with
real numbers.
Recall that the multilicatie inverse of a number sich as 5 is 1/5 or 51 .
This inverse satisfies the equations
51 5 = 1 and 5 51 = 1.
The matrix generalization requires both equations and avoids the slanted-
line notion (for division) because matrix multiplication is not commutative.
Furthermore, a full generalization is possible only if the matrices involved
are square.
An n n matrix A is said to be inverteble if ther is an n n matrix C
such that
CA = I and AC = I
where I = In , the n n identity matrix. In this case, C is an inverse of A.
In fact, C is uniquely determined by A, because if B were another inverse
matrix of A then B = BI = B(AC) = (BA)C = CI = C. This unique
inverse is denoted by A1 so that
A1 A = I and AA1 = I.
A matrix that is not invertible is sometimes called a singular matrix, and
an invertible matrix is called a nonsingular matrix.
Theorem 26. Let " #
a b
A=
c d
. If ad bc 6= 0,then A is invertible and
" #
1 1 d b
A =
ad bc c a
If ad bc = 0, then A is not invertible.
Theorem 27. If A is an invertible n n matrix, the for each b in Rn , the
equation Ax = b has the unique solution x = A1 b.
Proof. Take any b in Rn . A solution exists because if A1 b is substituted
for x, then Ax = AA1 b = (AA1 )b = Ib = b. So A1 b is a solution. To
prove that the solution is unique, show that if u is any solution, then u in
fact, must be A1 b. Indeed if Au = b, we can multiply both sides with A1
and obtain
A1 Au = A1 b Iu = A1 b u = A1 b.
33
The formula in theorem 27 i seldom used to solve an equation Ax = b
numerically because row reduction of [A b] is nearly almost faster. One
possible exeption is the 2 2 case. In this case mental computations to solve
Ax = b are sometimes easier using the formula for A1 .
(A1 )1 = A.
(b) If A and B are n n invertible matrices, then so is AB, and the inverse
of AB is the product of the inverses of A and B in the revers order.
That is,
(AB)1 = B 1 A1 .
(AT )1 = (A1 )T .
A1 C = I and CA1 = I.
34
3.5.1 The LU Factorization
The LU factorization, described below, is motivated by the fairly common
industrial an business problem of solving a sequence of equations, all with
the same coefficient matrix:
Ax = b1 , Ax = b2 , ... Ax = bp . (3.9)
Ly = b (3.10)
U x = y. (3.11)
First solve Ly = b for y, and then solve U x = y for x. Each equation
are easy to solve because L and U are triangular.
An LU Factorization Algorithm
Suppose A can be reduced to an echelon form from U using only row re-
placements that add a multiple of one row to another row below it. In this
case, there exist unit lower triangular elementary matrices E1 , ..., Ep such
that
Ep E1 A = U. (3.12)
Then
A = (Ep E1 )1 U = LU (3.13)
where
L = (Ep E1 )1 . (3.14)
It can be shown that products and inverses of unit lower triangular ma-
trices are also unit lower triangular. Thus L is unit lower triangular.
35
Note that the row operations in equation (3.12), wich reduce A to U ,
also reduce the L in equation (3.14) to I, because Ep E1 L = (Ep
E1 )(Ep E1 )1 = I. This observation is key to construction L.
Definition 15 (Algorithm for an LU Factorization). 1. Reduce A to an
echelon form U by a sequence of row replacement operations, if possi-
ble.
2. Place entries in L such that the same sequence of row operations re-
duces L to I.
Step 1 is not always possible, but when it is, the argument above shows
that an LU factorization exists. By construction L will satisfy
(Ep E1 )L = I
using the same E1 , ..., Ep as equation (3.12). Thus L will be invertible,
by the invertible matrix theorem, with (Ep E1 ) = L1 . From (3.12),
L1 A = U, and A = LU. So step 2 will produce an acceptable L.
3.6 Subspaces of Rn
Definition 16. The subspace of Rn is any set H in Rn that has three
properties:
(a) The zero vector is in H.
(b) For each u and v, the sum u + v is in Rn .
(c) For each u in H and each scalar c, the vector cu is in Rn .
In words, a subspace is closed under addition and scalar multiplication.
36
Theorem 29. The null space of an m n matrix is a subspace of Rn , and
the set off all solutions of a equation Ax = 0 of m homogeneous linear
equations in n unknowns is a subspace of Rn .
Proof. The zero vector is in NulA (because A0 = 0.) To show that NulA
satisfies the other two properties required for a subspace, take any u and
v in NulA. That is, suppose Au = 0 and Av = 0. Then, by a property of
matrix multiplication,
A(u + v) = Au + Av = 0 + 0.
Thus u + v satisfies Ax = 0 so u + v is in NulA. Also for any scalar c,
A(cu) = c(Au) = c(0) = 0.
37
We say that is an eigenvector of an m n matrix A if and only if the
equation
(A I)x = 0 (3.15)
has a nontrivial solution. The set of all solutions of (3.15) is just the null
space of the matrix A I. So this set is a subspace of Rn and is called
the eigenspace of A corresponding to . The eigenspace consists of the zero
vector and all the eigenvectors corresponding to .
Theorem 31. The eigenvalues of a triangular matrix are the entries of its
main diagonal.
Ax = 0x (3.16)
38
of the preceding (linearly independent) vectors. Then there exist scalars
c1 , ..., cp such that
c1 v 1 + ... + cp v p = v p+1 . (3.17)
Multiplying both sides of (3.17) by A and using the fact that Av k = k v k
for each k, er obtain
c1 Av 1 + ... + cp Av p = Av p+1
Since v 1 , ..., v p is linearly independent, the wights in (3.19) are all zero. But
none of the factors i p+1 are zero, because the eigenvalues are distinct.
Hence v 1 , ..., v r cannot be linearly dependent and therefore must be linearly
independent.
Similarity
The next theorem illustrates one use of the characteristic polynomial, and
it provides the foundation for several iterative methods that approximate
eigenvalues. If A and B are n n matrices, then A is similar to B if there is
an invertible matrix P such that P 1 AP = B, or equivalently, A = P BP 1 .
Writing Q for P 1 , we have Q1 BQ = A. So B is also similar to A, and
we say simply that A and B are similar. Changing A into P 1 AP is called
similarity transformation.
Theorem 33. If n n matrices A and B are similar, then they have the
same characteristic polynomial and hence the same eigenvalues.
B I = P 1 AP P 1 P = P 1 (AP P ) = (A I)P.
39
3.8 Diagonalization
In many cases, the eigenvalue-eigenvector information contained within a
matrix A can be displayed in a useful factorization of the form A = P DP 1
where D is a diagonal matrix. In this section, the factorization enable us
to compute Ak quickly for large values of k, a fundamental idea in several
applications of linear algebra.
Proof. First, observe that if P is any nn matrix with the columns v 1 , ..., v n and
if D is any diagonal matrix with diagonal entries 1 , ..., n , then
h i h i
AP = A v 1 v 2 ... v n = Av 1 Av 2 ... Av n (3.21)
while
1 0 0
0 2 0
PD = P
.. .. .
.. (3.22)
. . .
0 0 n
Now suppose A is diagonalizable and A = P DP 1 . Then right-multiplying
this relation by P, we have AP = P D. In this case, equations (3.21) and
(3.22) imply that
h i h i
Av 1 Av 2 ... Av n = 1 v 1 2 v 2 n v n . (3.23)
Av 1 = 1 v 1 , Av 2 = 2 v 2 , .... Av n = n v n . (3.24)
40
1 , ..., n are eigenvalues and v 1 , ..., v n are corresponding eigenvectors. This
argument proves the "only if" parts of the first and second statement, along
with the third statement, of the theorem.
Finally, given any n eigenvectors v 1 , ..., v n , use them to construct the
columns of P and use corresponding eigenvalues 1 , ..., n to construct D.
By equation (3.21)-(3.23), AP = P D. This is true without any condition on
the eigenvectors. If, in fact, the eigenvectors are linearly independent, then
P is invertible, and AP = P D implies that A = P DP 1 .
(a) u v = v u.
(b) (u + v) w = u w + v w
41
Definition 23. The length (or the norm) of v is the nonnegative scalar kvk
defined by
q
kvk = v v = v12 + v22 + ... + vn2 and kvk2 = v v.
" #
a
Suppose v is in R2 , say v = , if we identify v with a geometric
b
point in the plane, as usual, then kvk coincides with the standard notion of
the length of the line segment from the origin to v. This follows from the
Pythagorean Theorem applied to a triangle.
A similar calculation with the diagonal of a rectangular box shows that
the definition of length of a vector v in R3 coincides with the usual notion
of length.
For any scalar c, the length of cv is |c| times the length of v. That is
kcvk = |c| kvk .
A vector whose length is 1 is called a unit vector. If we divide a nonzero
vector v by its length- that is, multiply by 1/ kvk- we obtain a unit vector u
because the length of u is (1/ kvk) kvk. The process of creating u from v is
sometimes called normalizing v, and we say that u is in the same direction
as v.
Distance in Rn
Recall that if a and b are real numbers, the distance on the number line
between a and b is the number |a b|. This definition of distance in R has
a direct analogue in Rn .
Definition 24. For u and v in Rn , the distance between u and v, written
as dist(u, v), is the length of the vector u v. That is
dist(u, v) = ku vk .
In R2 and R3 , this definition of distance coincides with the usual formulas
for the Euclidean distance between two point.
42
Orthogonal Complements
If a vector z is orthogonal to every vector in a subspace of W of Rn , then z
is said to be orthogonal to W. The set of all vectors z that are orthogonal
to W is called the orthogonal complement of W and is denoted by W
Theorem 37. Let A be an m n matrix. The orthogonal complement of
the row space of A is the null space of A, and the orthogonal complement of
the column space of A is the null space of AT :
43
Theorem 39. Let {u1 , ....up } be an orthogonal basis for a subspace W of
Rn . For each y in W , the weights in the linear combination
y = c1 u1 + ... + cp up
are given by
y uj
cj = .
uj uj
Proof. As in the preceding proof, the orthogonality of {u1 , ..., up } shows
that
y u1 = (c1 u1 + c2 u2 + ... + cp up ) u1 = c1 (u1 u1 )
Since u1 u1 is not zero, the equation can be solved for c1 . To find cj for
j = 2, ..., p, compute y uj and solve for cj .
Orthonormal Sets
A set u1 , ..., up is an orthonormal set if it is an orthogonal set of unit vectors.
If W is the subspace spanned by such a set, then u1 , ..., up is an orthonormal
basis for W, since the set is automatically linearly independent, by theorem
38.
The simplest example of an orthonormal set is the standard basis {e1 , ..., en }
for Rn . Any nonempty subset of {e1 , ..., en } is orthonormal, too.
The entries in the matrix at the right are inner product, using transpose
notation. The columns of U are orthogonal if and only if
44
Theorem 41. Let U be an m n matrix with orthonormal columns, and
let x and y be in Rn . Then
(a) kU xk = kxk
(b) (U x) (U y) = x y
y = z1 + z2
+z
y=y (3.28)
Proof. Let {u1 , ..., up } be any orthogonal basis for W , and define y by (3.29).
Then y is in W because y is a linear combination of the basis u1 , ..., up . Let
z =yy . Since u1 is orthogonal to u2 , ..., up , it follows from (3.29) that
y u1
z u1 = (y y
) u1 = y u1 u1 u1 0 ... 0
u1 u1
45
= y u1 y u1 = 0.
Thus z is orthogonal to u1 . Similarly, z is orthogonal to each uj in the basis
for W. Hence z is orthogonal to every vector in W. That is, z in W .
To show that the decomposition in (3.28) is unique, suppose y can also
be written as y = y1 +z 1 with y1 in W and z 1 in W . Then y
+z = y1 +z 1 ,
and so
y y1 = z 1 z.
y1 is in W an in W . Hence
This equality shows that the vector v = y
v v = 0, which shows that v = 0. This proves that y = y1 and also
z 1 = z.
ky y
k < ky vk (3.30)
.
for all v in W distinct from y
The vector y in theorem 43 is called the best approximation to y by
elements of W. The distance from y to v, given by ky y
k, can be regarded
as the "error" of using v in place of y. Theorem 43 says that this error is
minimized when v = y .
does not depend on the
Inequality (3.30) leads to a new proof that y
particular orthogonal basis used to compute it. If a different orthogonal
basis for W were used to construct an orthogonal projection of y, then this
projection would also be the closest point in W to y, namely y.
y v = (y y y v)
) + (
ky vk2 = ky y
k2 + k
y vk2 .
46
y vk2 > 0 because y
Now k v 6= 0, and so inequality (3.30) follows
immediately.
projW y = U U T y y Rn (3.32)
Proof. Formula (3.31) follows immediately from (3.29). Also (3.31) shows
that projW y is a linear combination of the columns of U using the weight
y u1 , y u2 , ..., y up . The weight can be written as uT1 y, uT2 y, ..., uTp y,
showing that they are entries in U T y and justifying (3.32).
47
Theorem 45 shows that any nonzero subspace W of Rn has an orthogo-
nal basis, because an ordinary basis {x1 , ..., xk } is always available and the
Gram-Schmidt process depends only on the existence of orthogonal projec-
tions onto subspaces of W that already have orthogonal bases.
Orthonormal Bases
An orthonormal base is constructed easily form an orthogonal basis {v 1 , ..., v p } :
simply normalize all the v k . When working problems by hand, this is easier
than normalizing each v k as soon as it is found.
kb A
xk kb Axk
for all x in Rn .
A
x = b. (3.34)
is the closest point in ColA to b, a vector x
Since b is a least-squares solution
of Ax = b if and only if x satisfy (3.34). Such an x in Rn is a list of weights
that will build b out of the columns of A.
AT (b A
x) = 0. (3.35)
48
Thus
AT b AT A
x=0
= AT A
x = AT b.
These calculations show that each least squares solution of Ax = b satisfies
the equation
AT Ax = AT b. (3.36)
The matrix equation (3.36) represent a system of equations called the normal
equations for Ax = b. A solution of (3.36) is often denoted by x .
x + (b A
b = A x)
2. Orthogonal Matrices
49