Regularization

Least Squares Parameter Estimation, Tiknonov
Regularization, and Singular Value Decomposition

CEE 690 , ME 555 System Identification Fall, 2013
c
Henri
P. Gavin, September 16, 2013
This hand-out addresses the errors in parameters estimated from fitting

a function to data. Any sample of measured quantities will naturally contain
some variability. Normal variations in data propagate through any equation
or function applied to the data. In general we may be interested in combining
the data in some mathematical way to compute another quantity. For example, we may be interested in computing the gravitational acceleration of the
earth by measuring the time it takes for a mass to fall from rest through a
measured distance. The equation is d = gt2 /2 or g = 2dt2 . The ruler we use
to measure the distance d will have a finite resolution and may also produce
systematic errors if we do not account for issues such as thermal expansion.
The clock we use to measure the time t will also have some error. Fortunately, the errors associated with the ruler are in no way related to the errors
in the clock; they are statistically independent or uncorrelated. If we repeat
the experiment n times, with a very precise clock, we will naturally find that
measurements of the time ti are never repeated exactly. The variability in
our measurements of d and t will surely lead to variability in the estimation
of g. As described in the next section, error propagation formulas help us
determine the variability of g based upon the variability of the measurements
of d and t. The remainder of this document addresses how errors in parameters, (such as the gravitational constant, g) can be evaluated and reduced
by modifying the equations describing the system. We will see that in many
parameter estimation problems, the fundamental nature of the problem can
lead to parameter errors that are so large that they can make the problem
almost impossible to solve. We will see that by modifying the problem slightly
the problem will become solveable, but by doing so the resulting parameters
may be biased. In this way we will be able to control the trade-off between
systematic bias errors and random errors.
CEE 690, ME 555 System Identification Duke University Fall 2013 H.P. Gavin
Formulation of Tikhonov Regularization
Consider the over-determined system of linear equations y = Xa where

y Rm and a Rn , and m > n. We seek the solution a that minimizes the
quadratic objective function
J(a) = ||Xa y||P + ||a a
||Q ,
(1)
where the quadratic vector norm is defined here as ||x||P = xT P x, in which the
weighting matrix P is positive definite, and the Tikhonov regularization factor
is non-negative. If the vector y is obtained through imprecise measurements,
and the measurements of each element of yi are statistically independent (as
is typically the case), then P is a diagonal matrix in which each diagonal
element Pii is the inverse of the variance of the measurement error of yi ,
Pii = 1/y2i . If the errors in yi are not statistically independent, then P should
be the inverse of the covariance matrix Vy of the data vector y, P = V1
y . The
positive definite matrix Q and the reference parameter vector a
reflect the way
in which we would like to constrain the parameters. For example, we may
simply want the solution, a to be near some reference point, a
, in which case
Q = In . Alternatively, we may wish some linear function of the parameters
= 0. Expanding the
LQ a to be minimized, in which case Q = LTQ LQ and a
quadratic objective function,
J(a) = aT X T P Xa 2aT X T P y + y T P y + aT Qa 2aT Q
a +
aT Q
a. (2)
The objective function is minimized by setting the first partial of J(a) with
respect to a equal to zero,
J(a)
= 2X T P Xa 2X T P y + 2Qa 2Q
a = 0n1 ,
a
(3)
and solving for the optimal parameters a ,

a() = [X T P X + Q]1 (X T P y + Q
a).
(4)
The meaning of the notation a() is that the solution a depends upon the value
of the regularization factor, . The regularization factor weights the relative
importance of ||Xa y||P and ||a a
||Q . For problems in which X or X T P X
are ill-conditioned, small values of (i.e., small compared to the average of
the diagonal elements of X T P X) can significantly improve the conditioning
of the problem.
Least Squares and Regularization
If the m measurement errors of yi are not individually known, then it is

common to set P = Im . Likewise, if the n parameter differences a a
are all
equally important, then it is customary to set Q = In . Finally, if we have no
set of reference parameters a
, then it is common to set a
= 0. With these
T
simplifications, the solution is given by a() = [X X + In ]1 X T y, which may
+
+
be written a() = X()
y, where X()
is called the regularized pseudo-inverse.
In the more general case, in which P 6= Im and Q 6= In , but a
= 0n1 ,
+
X()
= [X T P X + Q]1 X T P.
(5)
+
The dimension of X()
is n m. In a later section we will see that if LQ
is invertible then we can always scale and shift X, y, and a with no loss of
generality.
1.1
Error Analysis of Tikhonov Regularization
Recall that if f (y) is an n-dimensional vector-valued function of m correlated random variables, with covariance matrix Vy , then the covariance matrix
of f is
"
#
"
#
m X
m f f
X
f
f T
k
l
[Vf ]k,l =
[Vy ]i,j =
Vy
.
(6)
y
y
i=1 j=1 yi yj
where [f /y] is the n m Jacobian matrix of f with respect to y.
"
f
y
=
k,i
fk
.
yi
(7)
We are interested in determining the covariance matrix of the solution a() .

Applying the quadrature error propagation relation,
a()
a() T
=
Vy
y
y
+
+T
= X() Vy X()
Va()
= [X T P X + Q]1 X T P Vy P X[X T P X + Q]1

= [X T P X + Q]1 X T P X[X T P X + Q]1 ,
(8)
where we use P = V1
y . This covariance matrix is sometimes called the error
propagation matrix, as it indicates how random errors in y propagate to the
solution a. Note that in the special case of no regularization ( = 0),
Va(0) = [X T P X]1 .
(9)
In addition to having propagation errors, the solution a() is biased by

the regularization factor. Let us presume that we know y exactly, and that
the exact value of y is ye . The exact solution, without regularization, is
ae = [X T P X]1 X T P ye . The regularization error, a() = a() ae (for a
= 0),
is
+
+
a() = [X()
X(0)
]ye
= [[X T P X + Q]1 X T P [X T P X]1 X T P ]ye

= [[X T P X + Q]1 [X T P X]1 ]X T P ye .
(10)
Recall that ae = [X T P X]1 X T P ye , or X T P Xae = X T P ye , so

a() =
=
=
=
=
[[X T P X + Q]1 [X T P X]1 ]X T P Xae

[[X T P X + Q]1 X T P X In ]ae
[X T P X + Q]1 [X T P X + Q Q]ae In ae
[X T P X + Q]1 [X T P X + Q]ae [X T P X + Q]1 Qae In ae
[X T P X + Q]1 Qae .
(11)
The regularization error a() equals zero if = 0 and increases with .

The total mean squared error matrix, E() is the sum of the parameter
covariance matrix and the regularization bias error, E() = Va() +a() aT() .
The mean squared error is the trace of E() .
2
Singular Value Decomposition
Consider a real matrix X that is not necessarily square, X Rmn , with

m > n. The rank of X is r and r n. Let 1 , 2 , , r be the positive
eigenvalues of X T X, including multiplicity, ordered in decreasing numerical
order, 1 2 r > 0. The singular
values, i of X are defined as
the square roots of the eigenvalues, i = i , i = 1, , r.

The singular value decomposition of a matrix X is the factorization
X = U V T
(12)
where U Rmm , Rmn and V Rnn . The matrices U and V are

orthonormal, U T U = U U T = Im and V T V = V V T = In . The matrix is
a diagonal matrix of the singular values of X, = diag(1 2 n ). The
singular values of X are sorted in a non-increasing numerical order, 1 2

n 0. The number of singular values equal to zero is equal to the
number of linearly dependent columns of X. If r is the rank of X then n r
singular values are equal to zero. The ratio of the maximum to the minimum
singular value is called the condition number of X, cX = 1 /n . Matrices with
very large condition numbers are said to be ill-conditioned. If n = 0, then
cX = and X is said to be singular, and is non-invertible.
The columns of U and V are called the right and left singular vectors,
U = [u1 u2 un ] and V = [v1 v2 vn ]. The left singular vectors ui are
column vectors of dimension m and the right singular vectors vi are column
vectors of dimension n. If one or more singular value of X is equal to zero,
r < n, and the set of right singular vectors {vr+1 vn } (corresponding to
r+1 = = n = 0) form an orthonormal basis for the null-space of X. The
dimension of the null space of X plus the rank of X equals n.
The singular value decomposition of X may be written as a dyadic expansion of the outer-products of singular vectors,
X=
n
X
i [ui viT ],
(13)
i=1
where each rank-1 dyad [ui viT ] has the same dimensions as X. Also note that
||ui ||I = ||vi ||I = 1. Therefore, the relative contribution of each term of the
expansion i ui viT to building X decreases with i.
The system of equations y = Xa may be inverted using singular value
decomposition: a = V 1 U T y, or
a=
n
X
1
vi uTi y.
i=1 i
(14)
The singular values in the expansion which contribute least to the decomposition of X can potentially dominate the solution a. An additive perturbation
y in y will propagate to a perturbation in the solution, a = V 1 U T y.
The magnitude of a in the direction of vi is equal to the dot product of ui
with y divided by i ,
1
viT a = uTi y.
(15)
i
Therefore, perturbations y that are orthogonal to all of the left singular
vectors, ui , are not propagated to the solution. Conversely, any perturbation
a in the direction of vi contributes to y in the direction of ui by an amount

equal to i ||a||.
If the rank of X is less than n, (r < n), the components of a which
lie in the space spanned by {vr+1 vn } will have no contribution to y.
In principle, this means that any component of y in the space spanned by
the sub-set of left singular vectors {ur+1 um } is an error or is noise,
since it can not be obtained using the expression y = Xa for any value of
a. These components of y are called noise or error because they can not
be predicted by the model equations, y = Xa. In addition, these noisy
components of y, which lie in the space {ur+1 um }, will be magnified to
an infinite degree when used to identify the model parameters a.
When X is obtained using measured data, it is almost never singular
but is very often ill-conditioned. We will consider two types of ill-conditioned
matrices: (i) matrices in which the first r singular values are all much larger
than the last n r singular values, and (ii) matrices in which the singular
values decrease at a rate that is more or less uniform.
2.1
Singular Value Decomposition Series Truncation
If the first r singular values are much larger than the last n r singular
values, (i.e., 1 2 r >> r+1 r+2 n 0), then a
relatively accurate representation of X may be obtained by simply retaining
the first r singular values of X and the corresponding columns of U and V ,
X(r) =
Ur r VrT
r
X
i ui viT ,
(16)
i=1
and the truncated optimal solution is

T
a(r) = Vr 1
r Ur y =
r
X
1
vi uTi y,
i=1 i
where Vr and Ur contain the first r columns of V and U , and where r contains
the first r rows and columns of .
2.2
Error Analysis of Singular Value Decomposition Series Truncation
If r+1 is close to the numerical precision of the computation, ( 106

for single precision and 1012 for double precision), then the singular
values, r+1 n and the corresponding columns of U and V contribute

negligibly to X. Their contribution to the solution vector, a, can be dominated by random noise and round-off error in y.
The parameter covariance matrix is derived using equation (6), in which
a
T
= Vr 1
r Ur ,
y
(17)
Therefore the covariance matrix of the parameter estimates is

T
1 T
Va(r) = Vr 1
r Ur Vy Ur r Vr .
(18)
Because the solution a(r) does not contain components that are close to the
null-space of X the covariance matrix is limited to the range of X for which the
singular values are not unacceptably small. The cost of the reduced parameter
covariance matrix is an increased bias error, introduced through truncation.
Assuming that we know y exactly, the corresponding exact solution ae (computed without regularization) can be used to evaluate the regularization bias
error, a(r) = a(r) ae .
T
1 T
a(r) = Vr 1
r Ur ye V U ye
(19)
Substituting V U T = Ur r VrT + Un n VnT ,

T
a(r) = [Vn 1
n Un ]ye ,
(20)
where Vn and Un contain the last n r columns of V and U , and where n

contains the last n r rows and columns of . Substituting ye = U V T ae ,
T
T
a(r) = [Vn 1
n Un U V ]ae
(21)
T
Noting that UnT U = 1
n Un U = [ 0(nr)r Inr ],
a(r) = [Vn VnT ]ae
(22)
Note that while the matrix VnT Vn equals Inr , the matrix Vn VnT is not identity
because the summation is only over the last n r columns of V .
The total mean squared error matrix, E(r) is the sum of the parameter coT
1 T
variance matrix and the truncation bias error, E(r) = Vr 1
r Ur Vy Ur r Vr
Vn VnT ae . The mean squared error is the trace of E(r) .
8
2.3
Singular Value Decomposition Tikhonov Regularization
If the singular values decrease at a rate that is more or less uniform,

then selecting r for the truncated approximations above may require some
subjective reasoning. Certainly any singular value that is equal to or less than
the precision of the computation should be eliminated. However, if 1 1015
and n 103 , X would be considered ill conditioned by most standards,
even though the smallest singular value is easily resolved. As an alternative
to eliminating one or more of the smallest singular values one may simply
add a small constant to all of the singular values. This can substantially
improve the condition number of the system without eliminating any of the
information contained in the full singular value factorization. This approach
is equivalent to Tikhonov regularization.
To link the formulation of Tikhonov regularization to singular value decomposition, it is useful to first show that the Tikhonov objective function
(1) may be written
a y||I + ||
J(
a) = ||X
a||I + C,
(23)
by simply scaling and shifting X, y, and a, and with no loss of generality. In

the above expression, the scalar constant C is independent of a
and does not
affect the optimal solution. Defining LP and LQ as the Cholesky factors of P
= LP XL1
and Q, and defining X
= LP (y X
a), and a
= LQ (a a
) then
Q , y
equation (1) is equivalent to equation (23), where
C=a
T X T LTP LP X
a 2
aT X T LTP LP y
(24)
Setting (J(
a)/
a)T to zero results in the optimal solution
TX
+ I]1 X
T y.
a
() = [X
(25)
() are related by the scaling

Note that the solutions a() and a
a() = L1
() + a
.
Q a
(26)
In other words, the minimum value of the objective function (1) coincides with
the minimum value of the objective function (23). As a simple example of the
effects of scaling on optimal solutions, consider the two equivalent quadratic
objective functions J(a) = 5a2 3a + 1 and J(
a) = 20
a2 6
a + 2, where a
is scaled, a = 2
a. The optimal values are a = 3/10 and a
= 3/20. These
optimal solutions satisfy the scaling relationship, a = 2
a .
may be substituted into the leastThe singular value decomposition of X
squares solution for a
()
U T U
V T + I]1 V
U T y
a
() = [V
2 V T + V I V T ]1 V
U T y
= [V
2 + I)V T ]1 V
U T y
= [V (
2 + I)1 V T V
U T y
= V (
2 + I)1
U T y
= V (
(27)
The covariance of the parameter errors is largest in the direction correspond is singular, then as approaches
ing to the maximum value of
i /(
i2 +). If X
zero, random errors propagate in a direction which is close to the null space
Note that the singular value decomposition solution to y = X
a is
of X.
1 U T y. Thus, Tikhonov regularization is equivalent to a singular
a
= V
value decomposition solution, in which the inverse of each singular value,
1/
i , is replaced by
i /(
i2 + ), or in which each singular value
i is replaced
by
i + /
i . Thus, the largest singular values are negligibly affected by regularization, while the effects of the smallest singular values on the solution
are suppressed, as shown in Figure 1.
100
=0
i / ( i2+ )
=0.0001
10
=0.001
=0.01
=0.1
1
0.01
Figure 1. Effect of regularization on singular values.
0.1
i
10
2.4
Error Analysis of Singular Value Decomposition Tikhonov Regularization
The parameter covariance matrix is derived using equation (6), in which
a

2 + I)1 U T
= V (
y
(28)
Vy = LP Vy LTP .
(29)
and
Recall that Vy is the covariance matrix of the measurements y, and that
P = V1
= LTP LP . Using the fact that (AB)1 = B 1 A1 , we find that
y
Vy = I. Therefore the covariance matrix of the solution is
2 (
2 + I)2 V T .
Va() = V
(30)
Because a/
a = L1
Q , the covariance matrix of the original solution is
2 T T
2 2
Va() = L1
Q V ( + I) V LQ .
(31)
As in equation (8), we see here that increasing quadratically reduces the

covariance of the propagated error. The cost of the reduced parameter covariance matrix is a bias error, introduced through regularization. Assuming that
we know y exactly, the corresponding exact solution with no regularization
e .
a
e can be used to evaluate regularization bias error,
a() = a
() a

2 + I)1 U T ye V
1 U T ye
a() = V (

2 + I)1 U T V
1 U T ]
= [V (
ye
(32)
V T ae , and U T U = I
Substituting ye = U
2 + I)1
2 V T I]
a() = [V (
ae
2 + I)1 (
2 + I I)V T I]
= [V (
ae
2
1
T
+ I) V a
= V (
e
(33)
As in equation (11), we see here that bias errors due to regularization increase
with . In fact, the singular values participating in the bias errors increase as
is singular, then the exact parameters ae can not lie in the
increases. If X
and the bias error
null space of X
a() will be orthogonal to the null space
of X.
11
/ ( i2+ )
=0.0001
=0.001
0.1
0.01
=0.01
=0.1
0.1
i
Figure 2. Effect of regularization on bias errors.
The total mean squared error matrix, E() is the sum of the scaled parameter covariance matrix and the regularization bias error, E() = Va() +
a()
aT() .
The mean squared error is the trace of E() .
From the singular value decomposition of X we find that X T X = V 2 V T .
If c is the condition number of X, then the condition number of X T X is
c2 , and solving [X T X]a = X T y can be numerically treacherous. The goal of
regularization is to find a modification to [X T X] which improves its condition
number while leaving the solution vector a relatively un-changed.
Tikhonov Regularization and Lagrange Multipliers
Numerical Examples
4.1
Numerical Example 1: A 2 by 2 singular matrix
Consider the singular system of equations yo = Xo a,
100
1000
1 10
10 100
a1
a2
(34)
12
The singular value decomposition of Xo is
U =
V =
0.0995037 0.995037
0.995037 0.0995037
101 0
0 0
0.0995037
0.995037
0.995037 0.0995037
Because Xo is a singular matrix, the solution to this system is not defined.

The second column of V gives the one-dimensional null-space of Xo . Note that
yo in equation (34) is normal to u2 . Vectors yo normal to the space spanned
by Un are noise-free in the sense that no component of yo propagates to the
null space of Xo . The singular value expansions gives
a1
a2
i
1 0.0995037 h
100
=
0.0995037 0.995037
+
1000
101 0.995037
i
1
0.995037 h
100
0.995037 0.0995037
0.0995037
1000
0
0.99009900
=
(35)
9.9009900
and the reason that the zero singular value does not affect the solution is that
y is orthogonal to u2 . It is interesting to note that despite the fact that the
two equations in (34) represent the same line (infinitely many solutions) the
SVD provides a unique solution. This uniqueness implies that the solution
minimizes ||a||22 .
Regularization works very well for problems in which UnT y is very small.
Here we seek the solution a that minimizes the quadratic objective function of
equation (1). In other words, we want to find the solution to y = Xo a, while
keep keeping the solution, a close to a
. How much we care that a is close to a
is specified by the regularization parameter, . In general should be some

small fraction of the average of the diagonal elements of X, << Tr(X)/n.
Increasing will make the problem easier to solve numerically, but will also
bias the solution. The philosophy of using the regularization parameter is
something like this: Lets say we have a problem which we cant solve, (i.e.,
det(X) = 0 ). We will use regularization to change the problem only slightly
13
into a problem which we can solve easily, and which has a solution which is
ideally independent of the amount of the perturbation.
Because the solution a depends upon , we can plot a1 and a2 vs. and
determine the effect of on the solution.
1.05
1
a1
0.95
0.9
0.85
0.8
0.75
0
8
10
-log(/Tr(Xo)/2)
12
14
16
8
10
-log(/Tr(Xo)/2)
12
14
16
10
9.5
a2
9
8.5
8
7.5
Figure 3. Effect of regularization on the solution of y = [Xo + I]a for vectors y satisfying
UnT y = 0.
From Figure 3 we see that for a broad range of values of the solution
is
= [0.990100; 9.900990], which is very close to the SVD solution. In this
problem can be as small as 1010 Tr(Xo )/2. The solution is quite insensitive
to for 103 > /Tr(Xo )/2 > 1012 ; for < 1014 Tr(Xo )/2 the solution can
not be found. Changing the problem only slightly by setting y = [100 1001]T ,
equation (34) represents two exactly parallel lines, and y is no longer normal
to Un . By changing one element of y by only 0.1 percent, the original problem
is changed from having an infinite number of solutions to having no solution.
For vectors y with components in the space spanned by Un , the regularized
solution is more sensitive to the choice of the regularization factor, . There is
a region, 103 > /Tr(Xo )/2 > 104 in this problem, for which a1 is relatively
a()
14
1
0
a1
-1
-2
-3
-4
-5
a2
-6
0
0.5
1.5
2
2.5
-log(/Tr(Xo)/2)
3.5
0.5
1.5
2
2.5
-log(/Tr(Xo)/2)
3.5
11
10.5
10
9.5
9
8.5
8
7.5
7
Figure 4. Effect of regularization on the solution of yo +y = [Xo +I]a for vectors y satisfying
UnT (yo + y) 6= 0.
2
insensitive to , however there is no region in which da
d 0. For this type
of problem, regularization of some type is necessary to find any solution, and
the solution depends closely on the amount of regularization.
Now lets examine the effects of some small random perturbations in y.

In this part of the example, small errors (normally distributed with a mean
of zero and a standard deviation of 0.0005) are added to y, and regularized
solutions are found for = 0.01Tr(Xo )/2 and = 0.001Tr(Xo )/2, i.e, the
equations yo + y = [Xo + I]a are solved for a. Regularized solutions to
these randomly perturbed problems illustrate that at a regularization factor,
0.01Tr(Xo )/2, the solution is relatively insensitive to the value of .
Comparing the solutions for 100 randomly perturbed problems, we find that
the variance among the solutions is less than 1. For a regularization factor
of 0.001, on the other hand, we see that the variance of the solution is quite
a bit larger, but that the mean value of all of the solutions is much closer
to the noise-free solution. This illustrates the fact that larger values of
decrease the propagation error but introduce a bias error. Note also that
15

regularized solutions with = 0.01 and = 0.001
10.4
10.2
=0.001
a2
10
noise-free solution
=0.01
9.8
null space of Xo
9.6
9.4
-4
-2
a1
Figure 5. Effect of regularization on the solution of yo + y = [Xo + I]a
random errors in y propagate to the solution a along the null space of Xo . If

no regularization is used, then a1 and a2 range from -1000 to +1000 and from
-100 to 100, respectively. Increasing the regularization factor, , reduces the
propagation of random noise in the solution at the cost of a bias error.
4.2
Numerical Example 2: Power Polynomial Curve-fitting
4.3
Numerical Example 3: Curve Fitting to solve Finite Difference Expansions
Rescaling to Improve Conditioning
In many instances, the elements of y and a have dissimilar units. In such

cases the conditioning of X is affected by the system of units used to describe
y and a. The measurements y and the parameters a may be scaled using
diagonal matrices Dy and Da , y = Dy y and a = Da a
. So Dy y = XDa a
and
1
1
= Dy XDa . If X = U V T
y = Dy XDa a
. The system matrix is now X
= Dy1 U V T Da . Note that the singular vectors of X
are not D1 U
then X
y
T
and V Da , as these matrices are not orthonormal. Scaling matrix equations
necessarily affects the singular values and the condition number of X.
16
Table 1. The ill-conditioned nature of power-polynomial curve-fitting

n det([X T X]) det([X T X])
a = 0, b = 1 a = 1, b = 1
2
102
104
4
102
103
11
6
10
102
8
1024
102
Consider the fitting of an n-th degree polynomial to data over an interval

[a, b]. This typically involves finding the pseudo-inverse of a Vandermode
matrix of the form
X=
1
1
1
..
.
1
1
1
a
a+h
a + 2h
..
.
b 2h
bh
b
a2
(a + h)2
(a + 2h)2
..
.
(b 2h)2
(b h)2
b2
...
an
. . . (a + h)n
. . . (a + 2h)n
..
.
. . . (b 2h)n
. . . (b h)n
...
bn
(36)
where the independent variable is uniformly sampled with a sample interval

of h. The following table indicates the condition number of [X T X] for various
polynomial degrees and curve-fitting intervals. This table illustrates that the
conditioning of the Vandermode matrix for polynomial curve-fitting depends
upon the interval over which the data is to be fit. To explore this idea further,
consider the Vandermode matrix for power polynomial curve-fitting over the
domain [L, L]. The condition number of X for various polynomial degrees
(n) and intervals, L is shown in the figure below. This figure illustrates that
the curve-fit interval that minimizes the condition number of X depends upon
the polynomial degree n. The minimum condition number is plotted with
respect to the polynomial degree in the figure below, along with a curve-fit.
To minimize the condition number of the Vandermode matrix for curve-fitting
an n-th degree polynomial, the curve-fit should be carried out over the domain
[L, L], where L = 1.14 + 0.62/n. So, if we change variables before doing the
curve-fit to an interval [L, L] then our results will be more accurate, i.e.,
less susceptible to the errors of finite precision calculations.
17
condition numbers for polynomial approximation n=1...10

5
condition number of Vandermode matrix
10
104
10
102
101
10
1.2
1.4
1.6
1.8
interval limit
Figure 6. Minimization of the Vandermode condition number with respect to the shifting interval.
symmetric interval limit for curve-fitting

2.1
"cond.data"
a/n+b
2
1.9
interval limit, L
1.8
1.7
1.6
a=0.62
1.5
b=1.14
1.4
1.3
1.2
1.1
1
degree of polynomial, n
Figure 7. Condition numbers for scaled polynomial fitting problems.
10
18
Consider two related polynomials for the same function

y=
y=
N
X
n=0
N
X
cn xn
axb
(37)
dn z n
Lz L
(38)
n=0
with the linear scalings

1
z
1
(a + b) + (a b)
2
2
L
2L
a+b
z =
x+
L = qx + r,
ab
ba
(39)
x =
then
y=
N
X
dn (qx + r)n
(40)
LxL
(41)
n=1
Solving the least squares problem for the coefficeints dn is more accurate than
solving the least squares problem for the coefficients cn . Then, given a set of
coefficients dn , along with a and b, the coefficients cn are determined from
cj =
N
X
n=j
dn
Qj+1
k=0 (n
k) 2L
(n j)!
ab
!j
a+b
L
ba
!nj
(42)
Orthogonal Polynomial Bases
A set of functions, fi (x), i = 0, , n, is orthogonal with respect to some

weighting function, w(x) in an interval a x b if
Z b
a
fi (x) w(x) fj (x) dx =
i > 0 i = j
0
i=
6 j
(43)
Furthermore, a set of functions is orthonormal if i = 1. Division by i normalizes the set of orthogonal functions, fi (x). The sine and cosine functions
are examples of orthogonal polynomials.
Z
cos mx sin nx dx = 0 m, n
cos mx cos nx dx =
sin mx sin nx dx =
(44)
2 m = n = 0
m = n 6= 0
0
m=
6 n
(45)
2 m = n = 0
m = n 6= 0
0
m=
6 n
(46)
19
An orthogonal polynomial, fk (x) (of degree k) has k real distinct roots within
the domain of orthogonality. The k roots of fk (x) are separated by the k 1
roots of fk1 (x). All orthogonal polynomials satisfy a recurrence relationship
k1
ak fk+1 (x) = bk xfk (x) + ck fk1 (x),
(47)
6.1 Examples of Orthogonal Polynomials
Legendre polynomials, Pk (x) are orthogonal over the domain [1, 1] with
respect to a unit weight.
Z 1
1
Pm (x) Pn (x) dx =
2
2n+1
m=n
m 6= n
(48)
Legendre polynomials are solutions to the ordinary differential equation,

((1 x2 )u0 )0 + n(n + 1)u = 0 .
(49)
The recurrence relationship is

(k + 1)Pk+1 (x) = (2k + 1)xPk (x) kPk1 (x)
(50)
with P0 (x) = 1, P1 (x) = x, and P2 (x) = (3/2)x2 1/2.

Hermite polynomials, Hk (x), are orthogonal with respect to the Gaussean
2
weighting function, (x) = 12 ex /2 , over an infinite domain.
Z
Hm (x) (x) Hn (x) dx =
n! m = n
0 m=
6 n
(51)
Hermite polynomials are solutions to the ordinary differential equation,

((x)u0 )0 + (x)u = 0 .
(52)
The recurrance relationship for Hermite polynomials is

Hk+1 (x) = xHk (x) kHk1 (x) .
(53)
with H0 (x) = 1, H1 (x) = x, H2 (x) = x2 1, and Hk0 (x) = kHk1 (x).

Forsythe polynomials, Fk (x) are orthogonal over an arbitrary domain
[a, b] with respect to an arbitrary weight, w(x). They are generated as
follows:
F0 (x) = 1
F1 (x) = xF0 (x) 1 F0 (x) = x 1
(54)
(55)
20
P0
P1
P2
P3
P4
P5
(a)
Legendre
0.5
-0.5
-1
-1
-0.5
0.5
H0
H1
H2
H3
H4
H5
40
30
(b)
20
Hermite
10
0
-10
-20
-30
-40
-4
-3
-2
-1
T0
T1
T2
T3
T4
T5
(c)
0.5
Chebyshev
-0.5
-1
-1
-0.5
0.5
Figure 8. (a) Legendre polynomials P0 (x), . . . , P5 (x); (b) Hermite polynomials H0 (x), . . . ,
H5 (x); (c) Chebyshev polynomials T0 (x), . . . , T5 (x)
21
Applying the orthogonality condition to F0 (x) and F1 (x),

Z b
a
F0 (x) w(x) F1 (x) dx = 0
(56)
leads to the condition

Z b
a
w(x) x dx = 1
Z b
a
w(x) dx.
(57)
Higher degree polynomials are found from the recurrance relationship

Fk+1 (x) = xFk (x) k+1 Fk (x) k+1 Fk1 (x).
(58)
where
k+1 =
k+1 =
Rb
2
a w(x) x Fk (x) dx
Rb
2
a w(x) Fk (x) dx
Rb
a w(x) x Fk (x) Fk1 dx
Rb
2
a w(x) Fk1 (x) dx
(59)
(60)
Chebyshev polynomials, Tk (x), are defined by the trigonometric expression

Tk (x) = cos(k arccos x),
(61)
Chebyshev polynomials are orthogonal with respect to w(x) = (1
x2 )1/2 over the domain [1, 1],
Tm (x) Tn (x) dx =
1
1 x2
Z 1
m=n=0
2 m = n 6= 0
0
m 6= n
(62)
The discrete form of the orthogonality condition for Chebyshev polynomials has a special definition because the weighting function for Chebyshev polynomials is not defined at the end-points. Given the P real
roots of Tp (x), tp , p = 1, P , the discrete orthogonality relationship for
Chebyshev polynomials is
P
X
p=1
Tm (tp ) Tn (tp ) =
P m=n=0
P
2 m = n 6= 0
0
m 6= n
(63)
where tp = cos((p 1/2)/P ), p = 1, , P . Chebyshev polynomials are

solutions to the ordinary differential equation
(1 x2 )u00 xu0 + n2 u = 0
(64)
22
The recurrance relationship for Chebyshev polynomials is

Tk+1 (x) = 2xTk (x) Tk1 (x) ,
(65)
with T0 (x) = 1, T1 (x) = x, and T2 (x) = 2x2 1. Chebyshev polynomials

are often associated with an equi-ripple or mini-max property. If an
P
(x) that is
approximation f(x) N
k=0 ck Tk (x) has an error e = y(x) y
dominated by TN +1 (x), then the maximum of the approximation error
is roughly minimized. This desireable feature indicates that the error is
approximately uniform over the domain of the approximation; that the
magnitude of the error is no worse in one part of the domain than in
another part of the domain.
6.2
Use of Orthogonal Polynomials in Curve-Fitting
Consider the approximation error of a curve-fit operation.

ep = yp yp = y(xp )
K
X
ck Tk (xp )
(66)
k=0
The approximation yp = y(xp ) =
PK
k=0 ck Tk (xp )
may be written
y1
y2
..
.
yP
1 T1 (x1 ) T2 (x1 )
1 T1 (x2 ) T2 (x2 )
..
..
..
.
.
.
1 T1 (xP ) T2 (xP )
TK (x1 )
TK (x2 )
..
...
.
TK (xP )
c0
c1
c2
..
.
cK
or y = T c. Minimizing the quadratic objective function J = (1/2)

to the normal equations
c = [T T T ]1 T T y,
where the symmetric matrix [T T T ] is diagonal
P
T1 (xp )T1 (xp )
T1 (xp )T2 (xp )

P
T2 (xp )T2 (xp )
sym
P 0 0 0
0 P/2 0 0
..
..
.. . . .
..
.
.
.
.
0 0 0 P/2
T1 (xp )T3 (xp )

P
T2 (xp )T3 (xp )
P
T3 (xp )T3 (xp )
...
(67)
P 2
ep
leads
(68)
T1 (xp )TK (xp )

P
T2 (xp )TK (xp )
P
T3 (xp )TK (xp )
..
.
P
TK (xp )TK (xp )
P
(69)
23
provided that the independent variables, xp , are the roots of the polynomial
TP (x), xp = cos((p 1/2)/P ). The discrete orthogonality of Chebyshev
polynomials is exact to within machine precision, regardless of the number
of terms in the summation. Because [T T T ] may be inverted analytically, the
curve-fit coefficients may be computed directly from
c0
ck
1
=
P
2
=
P
P
X
p=1
P
X
y(xp )
(70)
Tk (xp )y(xp ),
for
k>0
(71)
p=1
Also, note that Tk (xp ) = cos(k(p 1/2)/P ). The cost of the simplicity of
the closed-form expression for ck using the Chebyshev polynomial basis is
the need to re-scale the independent variables, x to the interval [1, 1], and
interpolate the data, y, to the roots of TP (x).
6.3
Chebyshev approximation in multiple dimensions
In this section we investigate the functional approximation of a set of data

in which the approximation provides a function of three variables. Consider
the case in which you have collected data from a nonlinear static system which
has three inputs and one output. You would like to develop a polynomial
model of the output of the system, f , in terms of the three inputs x, y,
and z. Your data is in the form of a table, corresponding to simultaneous
measurements of xt , yt , zt and ft , where the subscript t indicates time instant
t , t = 1, , T . We would like to find the polynomial approximation
f(x, y, z) =
I,J,K
X
Ci,j,k Ti (x0 ) Tj (y 0 ) Tk (z 0 )
(72)
i,j,k=0
where x0 is a mapping from xmin xt xmax to 1 x0 1. A linear

mapping, x0 = (x xm )/xa , can be carried out, where xm = (xmax + xmin )/2
and xa = (xmax xmin )/2. To compute the three-dimensional matrix C, the
data ft must be interpolated to the roots of TP , TQ , TR , which we will call xp ,
yq , zr . The cuve-fit coefficients are then given by
Ci,j,k
P,Q,R
X
1
f (x00p , yq00 , zr00 )
=
2
2
2
||Ti || ||Tj || ||Tk || p,q,r=0
!!
!!
!!
i
1
j
1
k
1
cos
p
cos
q
cos
r
.(73)
P
2
Q
2
R
2
24
where the norms are
||Ti,j,k ||2 =
P, Q, R
i, j, k = 0
P/2, Q/2, R/2 i, j, k > 0
(74)
The interpolated data f (x00p , yq00 , zr00 ) may be given by the Ln norm interpolant
f (x00p , yq00 , zr00 )
PT
00
n
00
n
t=1 ft (xt , yt , zt )[(xp xt ) + (yq yt )
PT
00
n
00
n
00
t=1 [(xp xt ) + (yq yt ) + (zr
+ (zr00 zt )n ]1
, (75)
zt )n ]1
where the polynomial roots xp , yq , and zr are scaled to span the original data,
x00p = xm + xa xp , yp00 = ym + ya yq , and zp00 = zm + za zr . The exponent n is a
positive even integer. Smaller values of n provide a smoother interpolation
whereas larger values of n resolve more detail in the underlying data.
25
References
[1] Forsythe, G.E., Generation and Use of Orthogonal Polynomials ofr Data-fitting with a Digital
Computer, J. Soc. Ind. Appl. Math vol. 5, no 2, 1957.
[2] Hamming, R.W., Numerical Methods for Scientists and Engineers, Dover Press, 1986.
[3] Lapin, L.L. Probability and Statistics for Modern Engineering, Brooks/Cole, 1983.
[4] Perlis, S., Theory of Matrices, Dover Press, 1991.
[5] Press, W.H., Teukolsky, S.A., Vetterling, W.T. and Flannery, B.P., Numerical Recipes, 2nd
ed., Cambridge Univ. Press, 1991.
[6] Tikhonov, A. and Arsin V. Solutions of Ill Posed Problems, Wilson and Sons, 1977.
[7] http://www.geo.tudelft.nl/fmr/deosletter/97-1/html/art6/art6.html
[8] http://www.fi.uib.no/antonych/regul.html
[9] http://www.me.ua.edu/inverse/
[10] http://www.physics.lsa.umich.edu/IP-LABS/Errordocs/vocaberr.html

Regularization

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Regularization

Загружено:

Авторское право:

Доступные форматы

Least Squares Parameter Estimation, Tiknonov

Regularization, and Singular Value Decomposition

This hand-out addresses the errors in parameters estimated from fitting

Formulation of Tikhonov Regularization

Consider the over-determined system of linear equations y = Xa where

and solving for the optimal parameters a ,

Least Squares and Regularization

If the m measurement errors of yi are not individually known, then it is

Error Analysis of Tikhonov Regularization

We are interested in determining the covariance matrix of the solution a() .

= [X T P X + Q]1 X T P Vy P X[X T P X + Q]1

In addition to having propagation errors, the solution a() is biased by

= [[X T P X + Q]1 X T P [X T P X]1 X T P ]ye

Recall that ae = [X T P X]1 X T P ye , or X T P Xae = X T P ye , so

[[X T P X + Q]1 [X T P X]1 ]X T P Xae

The regularization error a() equals zero if = 0 and increases with .

Singular Value Decomposition

Consider a real matrix X that is not necessarily square, X Rmn , with

the square roots of the eigenvalues, i = i , i = 1, , r.

where U Rmm , Rmn and V Rnn . The matrices U and V are

Least Squares and Regularization

singular values of X are sorted in a non-increasing numerical order, 1 2

a in the direction of vi contributes to y in the direction of ui by an amount

Singular Value Decomposition Series Truncation

and the truncated optimal solution is

Error Analysis of Singular Value Decomposition Series Truncation

If r+1 is close to the numerical precision of the computation,  ( 106

Least Squares and Regularization

values, r+1 n and the corresponding columns of U and V contribute

Therefore the covariance matrix of the parameter estimates is

Substituting V U T = Ur r VrT + Un n VnT ,

where Vn and Un contain the last n r columns of V and U , and where n

a(r) = [Vn VnT ]ae

Singular Value Decomposition Tikhonov Regularization

If the singular values decrease at a rate that is more or less uniform,

by simply scaling and shifting X, y, and a, and with no loss of generality. In

() are related by the scaling

Least Squares and Regularization

Figure 1. Effect of regularization on singular values.

Error Analysis of Singular Value Decomposition Tikhonov Regularization

The parameter covariance matrix is derived using equation (6), in which

As in equation (8), we see here that increasing quadratically reduces the

Least Squares and Regularization

Figure 2. Effect of regularization on bias errors.

Tikhonov Regularization and Lagrange Multipliers

Numerical Example 1: A 2 by 2 singular matrix

Consider the singular system of equations yo = Xo a,

The singular value decomposition of Xo is

Because Xo is a singular matrix, the solution to this system is not defined.

is specified by the regularization parameter, . In general should be some

Least Squares and Regularization

Now lets examine the effects of some small random perturbations in y.

Least Squares and Regularization

Figure 5. Effect of regularization on the solution of yo + y = [Xo + I]a

random errors in y propagate to the solution a along the null space of Xo . If

Numerical Example 2: Power Polynomial Curve-fitting

Numerical Example 3: Curve Fitting to solve Finite Difference Expansions

Rescaling to Improve Conditioning

In many instances, the elements of y and a have dissimilar units. In such

Table 1. The ill-conditioned nature of power-polynomial curve-fitting

Consider the fitting of an n-th degree polynomial to data over an interval

where the independent variable is uniformly sampled with a sample interval

Least Squares and Regularization

condition numbers for polynomial approximation n=1...10

If r+1 is close to the numerical precision of the computation, ( 106