Академический Документы
Профессиональный Документы
Культура Документы
CEE 690, ME 555 System Identification Duke University Fall 2013 H.P. Gavin
(1)
where the quadratic vector norm is defined here as ||x||P = xT P x, in which the
weighting matrix P is positive definite, and the Tikhonov regularization factor
is non-negative. If the vector y is obtained through imprecise measurements,
and the measurements of each element of yi are statistically independent (as
is typically the case), then P is a diagonal matrix in which each diagonal
element Pii is the inverse of the variance of the measurement error of yi ,
Pii = 1/y2i . If the errors in yi are not statistically independent, then P should
be the inverse of the covariance matrix Vy of the data vector y, P = V1
y . The
positive definite matrix Q and the reference parameter vector a
reflect the way
in which we would like to constrain the parameters. For example, we may
simply want the solution, a to be near some reference point, a
, in which case
Q = In . Alternatively, we may wish some linear function of the parameters
= 0. Expanding the
LQ a to be minimized, in which case Q = LTQ LQ and a
quadratic objective function,
J(a) = aT X T P Xa 2aT X T P y + y T P y + aT Qa 2aT Q
a +
aT Q
a. (2)
The objective function is minimized by setting the first partial of J(a) with
respect to a equal to zero,
J(a)
= 2X T P Xa 2X T P y + 2Qa 2Q
a = 0n1 ,
a
(3)
(4)
The meaning of the notation a() is that the solution a depends upon the value
of the regularization factor, . The regularization factor weights the relative
importance of ||Xa y||P and ||a a
||Q . For problems in which X or X T P X
are ill-conditioned, small values of (i.e., small compared to the average of
the diagonal elements of X T P X) can significantly improve the conditioning
of the problem.
T
simplifications, the solution is given by a() = [X X + In ]1 X T y, which may
+
+
be written a() = X()
y, where X()
is called the regularized pseudo-inverse.
In the more general case, in which P 6= Im and Q 6= In , but a
= 0n1 ,
+
X()
= [X T P X + Q]1 X T P.
(5)
+
The dimension of X()
is n m. In a later section we will see that if LQ
is invertible then we can always scale and shift X, y, and a with no loss of
generality.
1.1
Recall that if f (y) is an n-dimensional vector-valued function of m correlated random variables, with covariance matrix Vy , then the covariance matrix
of f is
"
#
"
#
m X
m f f
X
f
f T
k
l
[Vf ]k,l =
[Vy ]i,j =
Vy
.
(6)
y
y
i=1 j=1 yi yj
where [f /y] is the n m Jacobian matrix of f with respect to y.
"
f
y
=
k,i
fk
.
yi
(7)
=
Vy
y
y
+
+T
= X() Vy X()
Va()
(8)
where we use P = V1
y . This covariance matrix is sometimes called the error
propagation matrix, as it indicates how random errors in y propagate to the
solution a. Note that in the special case of no regularization ( = 0),
Va(0) = [X T P X]1 .
(9)
CEE 690, ME 555 System Identification Duke University Fall 2013 H.P. Gavin
(10)
(12)
n
X
i [ui viT ],
(13)
i=1
where each rank-1 dyad [ui viT ] has the same dimensions as X. Also note that
||ui ||I = ||vi ||I = 1. Therefore, the relative contribution of each term of the
expansion i ui viT to building X decreases with i.
The system of equations y = Xa may be inverted using singular value
decomposition: a = V 1 U T y, or
a=
n
X
1
vi uTi y.
i=1 i
(14)
The singular values in the expansion which contribute least to the decomposition of X can potentially dominate the solution a. An additive perturbation
y in y will propagate to a perturbation in the solution, a = V 1 U T y.
The magnitude of a in the direction of vi is equal to the dot product of ui
with y divided by i ,
1
viT a = uTi y.
(15)
i
Therefore, perturbations y that are orthogonal to all of the left singular
vectors, ui , are not propagated to the solution. Conversely, any perturbation
CEE 690, ME 555 System Identification Duke University Fall 2013 H.P. Gavin
If the first r singular values are much larger than the last n r singular
values, (i.e., 1 2 r >> r+1 r+2 n 0), then a
relatively accurate representation of X may be obtained by simply retaining
the first r singular values of X and the corresponding columns of U and V ,
X(r) =
Ur r VrT
r
X
i ui viT ,
(16)
i=1
r
X
1
vi uTi y,
i=1 i
where Vr and Ur contain the first r columns of V and U , and where r contains
the first r rows and columns of .
2.2
(17)
(18)
Because the solution a(r) does not contain components that are close to the
null-space of X the covariance matrix is limited to the range of X for which the
singular values are not unacceptably small. The cost of the reduced parameter
covariance matrix is an increased bias error, introduced through truncation.
Assuming that we know y exactly, the corresponding exact solution ae (computed without regularization) can be used to evaluate the regularization bias
error, a(r) = a(r) ae .
T
1 T
a(r) = Vr 1
r Ur ye V U ye
(19)
(20)
(21)
T
Noting that UnT U = 1
n Un U = [ 0(nr)r Inr ],
(22)
Note that while the matrix VnT Vn equals Inr , the matrix Vn VnT is not identity
because the summation is only over the last n r columns of V .
The total mean squared error matrix, E(r) is the sum of the parameter coT
1 T
variance matrix and the truncation bias error, E(r) = Vr 1
r Ur Vy Ur r Vr
Vn VnT ae . The mean squared error is the trace of E(r) .
8
2.3
CEE 690, ME 555 System Identification Duke University Fall 2013 H.P. Gavin
(23)
(24)
Setting (J(
a)/
a)T to zero results in the optimal solution
TX
+ I]1 X
T y.
a
() = [X
(25)
(26)
In other words, the minimum value of the objective function (1) coincides with
the minimum value of the objective function (23). As a simple example of the
effects of scaling on optimal solutions, consider the two equivalent quadratic
objective functions J(a) = 5a2 3a + 1 and J(
a) = 20
a2 6
a + 2, where a
is scaled, a = 2
a. The optimal values are a = 3/10 and a
= 3/20. These
optimal solutions satisfy the scaling relationship, a = 2
a .
may be substituted into the leastThe singular value decomposition of X
squares solution for a
()
U T U
V T + I]1 V
U T y
a
() = [V
2 V T + V I V T ]1 V
U T y
= [V
2 + I)V T ]1 V
U T y
= [V (
2 + I)1 V T V
U T y
= V (
2 + I)1
U T y
= V (
(27)
The covariance of the parameter errors is largest in the direction correspond is singular, then as approaches
ing to the maximum value of
i /(
i2 +). If X
zero, random errors propagate in a direction which is close to the null space
Note that the singular value decomposition solution to y = X
a is
of X.
1 U T y. Thus, Tikhonov regularization is equivalent to a singular
a
= V
value decomposition solution, in which the inverse of each singular value,
1/
i , is replaced by
i /(
i2 + ), or in which each singular value
i is replaced
by
i + /
i . Thus, the largest singular values are negligibly affected by regularization, while the effects of the smallest singular values on the solution
are suppressed, as shown in Figure 1.
100
=0
i / ( i2+ )
=0.0001
10
=0.001
=0.01
=0.1
1
0.01
0.1
i
10
2.4
CEE 690, ME 555 System Identification Duke University Fall 2013 H.P. Gavin
a
2 + I)1 U T
= V (
y
(28)
Vy = LP Vy LTP .
(29)
and
Recall that Vy is the covariance matrix of the measurements y, and that
P = V1
= LTP LP . Using the fact that (AB)1 = B 1 A1 , we find that
y
Vy = I. Therefore the covariance matrix of the solution is
2 (
2 + I)2 V T .
Va() = V
(30)
Because a/
a = L1
Q , the covariance matrix of the original solution is
2 T T
2 2
Va() = L1
Q V ( + I) V LQ .
(31)
a() = V (
2 + I)1 U T V
1 U T ]
= [V (
ye
(32)
V T ae , and U T U = I
Substituting ye = U
2 + I)1
2 V T I]
a() = [V (
ae
2 + I)1 (
2 + I I)V T I]
= [V (
ae
2
1
T
+ I) V a
= V (
e
(33)
As in equation (11), we see here that bias errors due to regularization increase
with . In fact, the singular values participating in the bias errors increase as
is singular, then the exact parameters ae can not lie in the
increases. If X
and the bias error
null space of X
a() will be orthogonal to the null space
of X.
11
/ ( i2+ )
=0.0001
=0.001
0.1
0.01
=0.01
=0.1
0.1
i
The total mean squared error matrix, E() is the sum of the scaled parameter covariance matrix and the regularization bias error, E() = Va() +
a()
aT() .
The mean squared error is the trace of E() .
From the singular value decomposition of X we find that X T X = V 2 V T .
If c is the condition number of X, then the condition number of X T X is
c2 , and solving [X T X]a = X T y can be numerically treacherous. The goal of
regularization is to find a modification to [X T X] which improves its condition
number while leaving the solution vector a relatively un-changed.
Numerical Examples
4.1
100
1000
1 10
10 100
a1
a2
(34)
12
CEE 690, ME 555 System Identification Duke University Fall 2013 H.P. Gavin
U =
V =
0.0995037 0.995037
0.995037 0.0995037
101 0
0 0
0.0995037
0.995037
0.995037 0.0995037
a1
a2
i
1 0.0995037 h
100
=
0.0995037 0.995037
+
1000
101 0.995037
i
1
0.995037 h
100
0.995037 0.0995037
0.0995037
1000
0
0.99009900
=
(35)
9.9009900
and the reason that the zero singular value does not affect the solution is that
y is orthogonal to u2 . It is interesting to note that despite the fact that the
two equations in (34) represent the same line (infinitely many solutions) the
SVD provides a unique solution. This uniqueness implies that the solution
minimizes ||a||22 .
Regularization works very well for problems in which UnT y is very small.
Here we seek the solution a that minimizes the quadratic objective function of
equation (1). In other words, we want to find the solution to y = Xo a, while
keep keeping the solution, a close to a
. How much we care that a is close to a
13
into a problem which we can solve easily, and which has a solution which is
ideally independent of the amount of the perturbation.
Because the solution a depends upon , we can plot a1 and a2 vs. and
determine the effect of on the solution.
1.05
1
a1
0.95
0.9
0.85
0.8
0.75
0
8
10
-log(/Tr(Xo)/2)
12
14
16
8
10
-log(/Tr(Xo)/2)
12
14
16
10
9.5
a2
9
8.5
8
7.5
Figure 3. Effect of regularization on the solution of y = [Xo + I]a for vectors y satisfying
UnT y = 0.
From Figure 3 we see that for a broad range of values of the solution
is
= [0.990100; 9.900990], which is very close to the SVD solution. In this
problem can be as small as 1010 Tr(Xo )/2. The solution is quite insensitive
to for 103 > /Tr(Xo )/2 > 1012 ; for < 1014 Tr(Xo )/2 the solution can
not be found. Changing the problem only slightly by setting y = [100 1001]T ,
equation (34) represents two exactly parallel lines, and y is no longer normal
to Un . By changing one element of y by only 0.1 percent, the original problem
is changed from having an infinite number of solutions to having no solution.
For vectors y with components in the space spanned by Un , the regularized
solution is more sensitive to the choice of the regularization factor, . There is
a region, 103 > /Tr(Xo )/2 > 104 in this problem, for which a1 is relatively
a()
14
CEE 690, ME 555 System Identification Duke University Fall 2013 H.P. Gavin
1
0
a1
-1
-2
-3
-4
-5
a2
-6
0
0.5
1.5
2
2.5
-log(/Tr(Xo)/2)
3.5
0.5
1.5
2
2.5
-log(/Tr(Xo)/2)
3.5
11
10.5
10
9.5
9
8.5
8
7.5
7
Figure 4. Effect of regularization on the solution of yo +y = [Xo +I]a for vectors y satisfying
UnT (yo + y) 6= 0.
2
insensitive to , however there is no region in which da
d 0. For this type
of problem, regularization of some type is necessary to find any solution, and
the solution depends closely on the amount of regularization.
15
10.2
=0.001
a2
10
noise-free solution
=0.01
9.8
null space of Xo
9.6
9.4
-4
-2
a1
4.3
16
CEE 690, ME 555 System Identification Duke University Fall 2013 H.P. Gavin
X=
1
1
1
..
.
1
1
1
a
a+h
a + 2h
..
.
b 2h
bh
b
a2
(a + h)2
(a + 2h)2
..
.
(b 2h)2
(b h)2
b2
...
an
. . . (a + h)n
. . . (a + 2h)n
..
.
. . . (b 2h)n
. . . (b h)n
...
bn
(36)
17
10
104
10
102
101
10
1.2
1.4
1.6
1.8
interval limit
Figure 6. Minimization of the Vandermode condition number with respect to the shifting interval.
"cond.data"
a/n+b
2
1.9
interval limit, L
1.8
1.7
1.6
a=0.62
1.5
b=1.14
1.4
1.3
1.2
1.1
1
degree of polynomial, n
10
18
CEE 690, ME 555 System Identification Duke University Fall 2013 H.P. Gavin
N
X
n=0
N
X
cn xn
axb
(37)
dn z n
Lz L
(38)
n=0
(39)
x =
then
y=
N
X
dn (qx + r)n
(40)
LxL
(41)
n=1
Solving the least squares problem for the coefficeints dn is more accurate than
solving the least squares problem for the coefficients cn . Then, given a set of
coefficients dn , along with a and b, the coefficients cn are determined from
cj =
N
X
n=j
dn
Qj+1
k=0 (n
k) 2L
(n j)!
ab
!j
a+b
L
ba
!nj
(42)
i > 0 i = j
0
i=
6 j
(43)
Furthermore, a set of functions is orthonormal if i = 1. Division by i normalizes the set of orthogonal functions, fi (x). The sine and cosine functions
are examples of orthogonal polynomials.
Z
cos mx sin nx dx = 0 m, n
cos mx cos nx dx =
sin mx sin nx dx =
(44)
2 m = n = 0
m = n 6= 0
0
m=
6 n
(45)
2 m = n = 0
m = n 6= 0
0
m=
6 n
(46)
19
An orthogonal polynomial, fk (x) (of degree k) has k real distinct roots within
the domain of orthogonality. The k roots of fk (x) are separated by the k 1
roots of fk1 (x). All orthogonal polynomials satisfy a recurrence relationship
k1
(47)
Legendre polynomials, Pk (x) are orthogonal over the domain [1, 1] with
respect to a unit weight.
Z 1
1
Pm (x) Pn (x) dx =
2
2n+1
m=n
m 6= n
(48)
(49)
(50)
n! m = n
0 m=
6 n
(51)
(52)
(53)
(54)
(55)
20
CEE 690, ME 555 System Identification Duke University Fall 2013 H.P. Gavin
P0
P1
P2
P3
P4
P5
(a)
Legendre
0.5
-0.5
-1
-1
-0.5
0.5
H0
H1
H2
H3
H4
H5
40
30
(b)
20
Hermite
10
0
-10
-20
-30
-40
-4
-3
-2
-1
T0
T1
T2
T3
T4
T5
(c)
0.5
Chebyshev
-0.5
-1
-1
-0.5
0.5
Figure 8. (a) Legendre polynomials P0 (x), . . . , P5 (x); (b) Hermite polynomials H0 (x), . . . ,
H5 (x); (c) Chebyshev polynomials T0 (x), . . . , T5 (x)
21
(56)
w(x) x dx = 1
Z b
a
w(x) dx.
(57)
(58)
where
k+1 =
k+1 =
Rb
2
a w(x) x Fk (x) dx
Rb
2
a w(x) Fk (x) dx
Rb
a w(x) x Fk (x) Fk1 dx
Rb
2
a w(x) Fk1 (x) dx
(59)
(60)
Tm (x) Tn (x) dx =
1
1 x2
Z 1
m=n=0
2 m = n 6= 0
0
m 6= n
(62)
The discrete form of the orthogonality condition for Chebyshev polynomials has a special definition because the weighting function for Chebyshev polynomials is not defined at the end-points. Given the P real
roots of Tp (x), tp , p = 1, P , the discrete orthogonality relationship for
Chebyshev polynomials is
P
X
p=1
Tm (tp ) Tn (tp ) =
P m=n=0
P
2 m = n 6= 0
0
m 6= n
(63)
(64)
22
CEE 690, ME 555 System Identification Duke University Fall 2013 H.P. Gavin
(65)
K
X
ck Tk (xp )
(66)
k=0
PK
k=0 ck Tk (xp )
may be written
y1
y2
..
.
yP
1 T1 (x1 ) T2 (x1 )
1 T1 (x2 ) T2 (x2 )
..
..
..
.
.
.
1 T1 (xP ) T2 (xP )
TK (x1 )
TK (x2 )
..
...
.
TK (xP )
c0
c1
c2
..
.
cK
sym
P 0 0 0
0 P/2 0 0
..
..
.. . . .
..
.
.
.
.
0 0 0 P/2
(67)
P 2
ep
leads
(68)
(69)
23
provided that the independent variables, xp , are the roots of the polynomial
TP (x), xp = cos((p 1/2)/P ). The discrete orthogonality of Chebyshev
polynomials is exact to within machine precision, regardless of the number
of terms in the summation. Because [T T T ] may be inverted analytically, the
curve-fit coefficients may be computed directly from
c0
ck
1
=
P
2
=
P
P
X
p=1
P
X
y(xp )
(70)
Tk (xp )y(xp ),
for
k>0
(71)
p=1
Also, note that Tk (xp ) = cos(k(p 1/2)/P ). The cost of the simplicity of
the closed-form expression for ck using the Chebyshev polynomial basis is
the need to re-scale the independent variables, x to the interval [1, 1], and
interpolate the data, y, to the roots of TP (x).
6.3
I,J,K
X
Ci,j,k Ti (x0 ) Tj (y 0 ) Tk (z 0 )
(72)
i,j,k=0
P,Q,R
X
1
f (x00p , yq00 , zr00 )
=
2
2
2
||Ti || ||Tj || ||Tk || p,q,r=0
!!
!!
!!
i
1
j
1
k
1
cos
p
cos
q
cos
r
.(73)
P
2
Q
2
R
2
24
CEE 690, ME 555 System Identification Duke University Fall 2013 H.P. Gavin
||Ti,j,k ||2 =
P, Q, R
i, j, k = 0
P/2, Q/2, R/2 i, j, k > 0
(74)
The interpolated data f (x00p , yq00 , zr00 ) may be given by the Ln norm interpolant
f (x00p , yq00 , zr00 )
PT
00
n
00
n
t=1 ft (xt , yt , zt )[(xp xt ) + (yq yt )
PT
00
n
00
n
00
t=1 [(xp xt ) + (yq yt ) + (zr
+ (zr00 zt )n ]1
, (75)
zt )n ]1
where the polynomial roots xp , yq , and zr are scaled to span the original data,
x00p = xm + xa xp , yp00 = ym + ya yq , and zp00 = zm + za zr . The exponent n is a
positive even integer. Smaller values of n provide a smoother interpolation
whereas larger values of n resolve more detail in the underlying data.
25
References
[1] Forsythe, G.E., Generation and Use of Orthogonal Polynomials ofr Data-fitting with a Digital
Computer, J. Soc. Ind. Appl. Math vol. 5, no 2, 1957.
[2] Hamming, R.W., Numerical Methods for Scientists and Engineers, Dover Press, 1986.
[3] Lapin, L.L. Probability and Statistics for Modern Engineering, Brooks/Cole, 1983.
[4] Perlis, S., Theory of Matrices, Dover Press, 1991.
[5] Press, W.H., Teukolsky, S.A., Vetterling, W.T. and Flannery, B.P., Numerical Recipes, 2nd
ed., Cambridge Univ. Press, 1991.
[6] Tikhonov, A. and Arsin V. Solutions of Ill Posed Problems, Wilson and Sons, 1977.
[7] http://www.geo.tudelft.nl/fmr/deosletter/97-1/html/art6/art6.html
[8] http://www.fi.uib.no/antonych/regul.html
[9] http://www.me.ua.edu/inverse/
[10] http://www.physics.lsa.umich.edu/IP-LABS/Errordocs/vocaberr.html