Test BlocNote

INMA 2875: System Identi
cation
Julien Hendrickx, UCL
December 4, 2018
2
Contents
1 Preliminaries 9
1.1 Discrete time dynamical systems . . . . . . . . . . . . . . . . 9
1.1.1 General de
nitions and properties . . . . . . . . . . . . 9
1.1.2 LTI and impulse response . . . . . . . . . . . . . . . . 10
1.1.3 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.1.4 The Z transform . . . . . . . . . . . . . . . . . . . . . 11
1.1.5 Finite sequences . . . . . . . . . . . . . . . . . . . . . . 13
1.1.6 Shift operator q . . . . . . . . . . . . . . . . . . . . . . 14
1.2 Stochastic processes . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.1 Stationary processes . . . . . . . . . . . . . . . . . . . 17
1.2.2 Spectral density . . . . . . . . . . . . . . . . . . . . . . 19
1.2.3 White noise and factorization of random processes . . . 20
1.2.4 Quasi-Stationarity . . . . . . . . . . . . . . . . . . . . 22
1.2.5 Ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.2.6 Alternative to stochastic processes . . . . . . . . . . . 26
1.3 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Appendices 29
1.A Proof of Theorem 1.1 . . . . . . . . . . . . . . . . . . . . . . . 29
3
4 CONTENTS
2 Non-parametric identi
cation 31
2.1 Time domain analysis . . . . . . . . . . . . . . . . . . . . . . . 32
2.1.1 Impulse response g(t) . . . . . . . . . . . . . . . . . . . 32
2.1.2 Step response . . . . . . . . . . . . . . . . . . . . . . . 34
2.1.3 Correlation analysis . . . . . . . . . . . . . . . . . . . . 35
2.2 Frequency domain analysis . . . . . . . . . . . . . . . . . . . . 38
2.2.1 Single frequency . . . . . . . . . . . . . . . . . . . . . . 38
2.2.2 Fourier Analysis . . . . . . . . . . . . . . . . . . . . . . 43
2.2.3 Spectral Analysis . . . . . . . . . . . . . . . . . . . . . 47
2.2.4 Correlation approach to Fourier Analysis . . . . . . . . 50
Appendices 55
2.A Leakage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3 Regressions and Instrumental variables 59
3.1 Introduction and Parametric Identi
cation . . . . . . . . . . . 59
3.2 Linear Regressions . . . . . . . . . . . . . . . . . . . . . . . . 60
3.3 Instrumental variables . . . . . . . . . . . . . . . . . . . . . . 68
4 Classes of parametric models for LTI systems 77
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2 Classes of models with rational transfer functions . . . . . . . 78
4.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4 Parametrization and Identi
ability . . . . . . . . . . . . . . . 83
5 Input Signals 87
5.1 Information content of the signals . . . . . . . . . . . . . . . . 87
5.2 Persistence of excitation . . . . . . . . . . . . . . . . . . . . . 95
5.3 Some typical input signals . . . . . . . . . . . . . . . . . . . . 99
CONTENTS 5
5.3.1 Impulse and steps . . . . . . . . . . . . . . . . . . . . . 99
5.3.2 Sum of sines - periodic signals . . . . . . . . . . . . . . 100
5.3.3 White noise . . . . . . . . . . . . . . . . . . . . . . . . 100
5.3.4 Filtered white noise . . . . . . . . . . . . . . . . . . . . 101
6 General parametric methods 103
6.1 What is an identi
cation method? . . . . . . . . . . . . . . . . 103
6.2 Important classes of methods . . . . . . . . . . . . . . . . . . 104
6.3 Prediction error methods . . . . . . . . . . . . . . . . . . . . . 106
6.4 Statistical methods . . . . . . . . . . . . . . . . . . . . . . . . 107
6.4.1 Maximum likelihood method . . . . . . . . . . . . . . . 107
6.4.2 Maximum a posteriori method . . . . . . . . . . . . . . 109
7 Consistency of Estimators 113
7.1 Least Square prediction error method . . . . . . . . . . . . . . 114
7.1.1 A
rst example . . . . . . . . . . . . . . . . . . . . . . 114
7.1.2 General consistency result . . . . . . . . . . . . . . . . 115
7.1.3 Partial consistency: recovering only G . . . . . . . . . 119
7.1.4 Application to regressions . . . . . . . . . . . . . . . . 120
7.2 General prediction error methods . . . . . . . . . . . . . . . . 122
7.3 Statistical Methods . . . . . . . . . . . . . . . . . . . . . . . . 122
7.4 Correlation methods . . . . . . . . . . . . . . . . . . . . . . . 123
8 Variance of Estimators 125
8.1 Constant
lter . . . . . . . . . . . . . . . . . . . . . . . . . . 125
8.2 FIR
lter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
8.3 General least square prediction error method . . . . . . . . . . 129
8.3.1 The estimator as a random variable . . . . . . . . . . . 130
6 CONTENTS
8.3.2 Asymptotic expression of the variance . . . . . . . . . 132
8.3.3 Simple Application . . . . . . . . . . . . . . . . . . . . 133
8.4 Other Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
8.5 Cramer-Rao bound and optimal norms . . . . . . . . . . . . . 135
8.6 Correlation Approach . . . . . . . . . . . . . . . . . . . . . . . 136
8.7 Using the variance estimate . . . . . . . . . . . . . . . . . . . 136
Appendices 139
8.A E#ciency of the Maximum Likelihood method. . . . . . . . . . 139
9 Closed Loop Identi
cation 141
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
9.1.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
9.1.2 Examples of problems . . . . . . . . . . . . . . . . . . 142
9.1.3 Summary of good news and bad news . . . . . . . . . . 143
9.1.4 Good News . . . . . . . . . . . . . . . . . . . . . . . . 143
9.1.5 Bad News . . . . . . . . . . . . . . . . . . . . . . . . . 143
9.2 Information of experiments . . . . . . . . . . . . . . . . . . . . 144
9.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 144
9.2.2 Regulator-based approach . . . . . . . . . . . . . . . . 145
9.2.3 Reference signal-based approach . . . . . . . . . . . . . 146
9.3 Identi
cation Methods . . . . . . . . . . . . . . . . . . . . . . 148
9.3.1 Direct Method . . . . . . . . . . . . . . . . . . . . . . 148
9.3.2 Indirect Method . . . . . . . . . . . . . . . . . . . . . . 149
9.3.3 Joint input-output method . . . . . . . . . . . . . . . . 150
10 How to approach a problem of identi
cation ? 153
10.1 Experience/intuition, physics . . . . . . . . . . . . . . . . . . 154
CONTENTS 7
10.2 Experiment design . . . . . . . . . . . . . . . . . . . . . . . . 155
10.2.1 Type of input signal . . . . . . . . . . . . . . . . . . . 155
10.2.2 Frequency content . . . . . . . . . . . . . . . . . . . . 156
10.2.3 Information . . . . . . . . . . . . . . . . . . . . . . . . 156
10.2.4 Length . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
10.3 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . 157
10.4 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
10.4.1 Errors and outliers . . . . . . . . . . . . . . . . . . . . 157
10.4.2 Scaling and translation . . . . . . . . . . . . . . . . . . 158
10.4.3 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . 159
10.4.4 Information content . . . . . . . . . . . . . . . . . . . . 159
10.5 Model set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
10.6 Criterions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
10.7 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
10.8 Validation/comparison . . . . . . . . . . . . . . . . . . . . . . 160
8 CONTENTS
Chapter 1
Preliminaries
1.1 Discrete time dynamical systems
1.1.1 General de
nitions and properties
In this course we will mostly consider Single Input Single Output Systems,
and work with two-sided signals, de
ned from 􀀀1 to 1. A (discrete-time)
signal is a function u : Z ! R or C, and we call S the set of such signals. A
SISO dynamical system is a function:
~G
: S ! S : y = ~G(u).
Note that we can restrict the domain of de
nition of ~G to a subset of S
when needed.
De
nition 1.1. A system ~G is said to be time invariant, if with for any
signals u; y and value T, y = ~G(u) implies y0 = ~G(u0) where u0(t) = u(t+T)
and y0(t) = y(t + T).
De
nition 1.2. A system is said to be causal if for any t, u(s) = u0(s)
8s # t implies y0(t) = y(t) for y = ~G(u) and y0 = ~G(u0).
Causality translates the fact that the signal does not depend on values
after the instant t.
De
nition 1.3. A system ~G is linear if 8#;
2 R (or C) and u;w 2 S,
~G
(#u +
w) = #~G(u) +
~G(w)
9
10 CHAPTER 1. PRELIMINARIES
Examples
Linear Time invariant Causal
~G
(u) = u yes yes yes
~G
(u)(t) =
Pt
k=􀀀1 u(k) yes yes yes
~G
(u)(t) =
Qt
k=􀀀1 u(k) no yes yes
~G
(u)(t) =
Pt+10
t􀀀10 u(k) yes yes no
~G
(u)(t) = 5 if t < 0
no no yes ~G(u)(t) = u if t # 0
Note that the second and third systems can only be de
ned on a subset
of S. For example, u(t) = 2 could not be in the domain of de
nition, as it
would lead to 'in
nite' values of G(u)(t) for all t.
1.1.2 LTI and impulse response
We de
ne # 2 S by #(0) = 1 and #(t) = 0, 8t 6= 0.
Given a Linear Time Invariant system G, the impulse response g = G(#)
entirely characterizes the system, as y = G(u) can be computed for any
u 2 S(or in the domain of de
nition of G) by
y(t) =
X1
k=􀀀1
g(k)u(t 􀀀 k): (1.1)
If the system is causal, we have g(t) = 0 when t < 0 so this expression
becomes :
y(t) =
X1
k=0
g(k)u(t 􀀀 k): (1.2)
The proofs of these last two facts are left to the reader. We also de
ne the
convolution u
v of two signals u; v 2 S by:
(u
v)(t) =
X1
k=􀀀1
u(k)v(t 􀀀 k) =
X1
k=􀀀1
u(t 􀀀 k)v(k):
Observe that (1.1) can be re-expressed has y = g
u.
From now on, we only consider causal systems.
1.1. DISCRETE TIME DYNAMICAL SYSTEMS 11
1.1.3 Stability
De
nition 1.4. We say that a system with impulse response g is (BIBO)
stable if
P1
k=0 jg(k)j < 1.
One can verify that this de
nition is equivalent to requiring that any
bounded input should result in a bounded output. It will be convenient in
the sequel to have some additional notions of stability:
De
nition 1.5. We say that a system with impulse response g is strictly
stable if
P1
k=0 k jg(k)j < 1.
Strict stability refers thus to the fact that an input signal having taken
potentially very large values still produce a bounded output provided that
the high values occurred su#ciently far from t, and in particular, that the
magnitude of the signal does not grow more than linearly when going back
in time.
De
nition 1.6. We say that a class of systems with impulse responseg 2 G#
is uniformly stable if there exists a BIBO stable impulse response h such
that jg(k)j # h(k) for all g 2 G#.
1.1.4 The Z transform
Given U 2 S, the Z transform of U is de
ned as :
Z(u) : U(z) =
X1
t=􀀀1
z􀀀tu(t) (1.3)
And the inverse Z transform is :
Z􀀀1(U) : u(t) =
1
2#
Z
!2[0;2#)
U(ei!)ei!td!
=
1
2#
Z
!2[!0;2#+!0)
U(ei!)ei!td!; 8!0 2 <;
where we remark that the interval is open on one side and closed on the
other, to avoid taking twice the same angle into account (This only matters
if U(ei!) contains delta \functions"). U(ei!) represents thus the coe#cients
of the ei!t component of the signal. The value of U on the unit circle contains
therefore all the information available on u.
Let us compute the Z transform of some signals:
# u(0) = 1, u(1) = 1
2 and u(t) = 0 else. Then a direct application of (1.3)
leads to U(z) = 1 + 1
2z􀀀1.
# Suppose now that u(t) =
􀀀1
2
#t
if t # 0 and 0 else. Then it follows from
(1.3) and the equality
P
t#0 #t = 1
1􀀀# (for j#j < 1) that
U(z) =
X1
t=0
z􀀀tu(t) =
#
z􀀀1
2
#t
=
1
1 􀀀 z􀀀1
2
:
We now note interesting properties of this transform:
# Z(g
f) = Z(g)Z(f),
# If u is real : U(z) = U(z),
# If v(t) = u(t + 1), then Z(v) = zZ(u). Conversely, if v(t) = u(t 􀀀 1),
then Z(v) = z􀀀1Z(u).
The equality Z(g
f) = Z(g)Z(f) has an interesting consequence. Let
indeed g be the impulse response of a LTI system. We have seen that the
output y is determined from g and the input u by y = g
u. Using the
standard convention that Y = Z(y), U = Z(u), we have then
Y (z) = G(z)U(z)
using the standard convention G = Z(g). We call this function G the transfer
function of the system. Since Y (ei!) = G(ei!)U(ei!), G(ei!) determines
the e#ect of the system on the component of the input at frequency !. In
particular, if u(t) = ei!t, then y(t) = G(ei!)ei!t, and if u(t) = cos(!t), then
y(t) = jG(ei!t)j cos(!t + #), where # = argG(ei!t).
Exercise 1.1. Verify explicitly that y(t) = jG(ei!t)j cos(!t+#) when u(t) =
cos(!t).
Moreover, the BIBO stability of the system g can be characterized using
the poles of the transfer function G. Remember that the poles of G are the
values z for which 1=G(z) = 0, and correspond to the values for which the
denominator is zero in the particular case of rational functions (in irreducible
form). Indeed, one can prove that the system is strictly stable if all the poles
are inside the unit disk, i.e., if jzij < 1 for every pole zi. Conversely, the
system is not BIBO stable if at least one pole is on or out of the unit disk,
i.e., there is a zi for which jzij # 1.
Note: When the system is assumed to be causal (which will almost always
be the case), g(t) = 0 for t < 0, so that the transfer function can be computed
by
G(z) =
X+1
t=0
g(t)z􀀀t: (1.4)
1.1.5 Finite sequences
In most practical cases, we only know the value of the signals for a
nite
time period during which we measure them. We therefore now focus on how
the analysis above applies to
nite signal, and in particular, on if and how
this a#ects the use of transfer functions. We
rst de
ne an analog of the
Z-Transform for
nite signals.
The Discrete Fourier Transform (DFT) of a
nite sequence u : f1; :::;Ng !
R is
UN(!) =
1
p
N
XN
t=1
u(t)e􀀀i!t: (1.5)
UN(!) is thus proportional to the value at ei! of the Z-transform of a signal
equal to u on f1; :::;Ng and zero at all other times. The reader can verify
that the inverse transformation is
u(t) =
1
p
N
XN
k=1
U
#
k
2#
N
#
ei 2#
N kt; (1.6)
which implies that the values of UN at 2#
N ; 22#
N ; : : : ; 2# entirely characterize
the sequence u. The transformation from u to UN (restricted to these relevant
points) can actually be seen as an orthogonal change of coordinate in
CN, which can be e#ciently computed using the FFT and IFFT algorithms.
Note that one could have used other normalizing constant than 1=
p
N in
the de
nition of the DFT and in its inverse, provided that the product of
the two constants used is 1=N. The choice of 1=
p
N ensures however that
u#u = U#N
UN, i.e.
PN
t=1 u2(t) =
PN
k=1 U#
􀀀
k 2#
N
#
U#
􀀀
k 2#
N
#
, so that the \power"
of the signal remains the same in the frequency domain as in the time domain.
Finally, note that for a real sequence u, U(!) = #U
(2# 􀀀 !), so that one
only needs to store half of the values of UN.
Exercise 1.2. Verify that U(!) = #U
(2# 􀀀 !) when u takes real values.
The following theorem shows the relevance of transfer function when only
a
nite sequence of the signals are measured, while a large part (from 􀀀1
to 0) is unknown. In particular, it shows that, under certain assumptions,
the error made by approximating YN by GUN is bounded by a quantity that
decays as 1=
p
N. The proof is technical and can be found in Appendix 1.A.
Theorem 1.1. Consider a transfer function G and two signals u; y 2 S,
such that Y = GU. Let YN and UN be de
ned as in (1.5) based on the re-
striction of u and y to f1; : : : ;Ng. If the following two conditions hold:
(i) 9Cu < 1 s.t. ju(t)j < Cu8t,
(ii) The system is strictly stable : 9Cg < 1 s.t.
P1
k=0 kjg(k)j = Cg
Then
YN(!) = G(ei!)UN(!) + R(!)
with jR(!)j # 2CuC p g
N
Strict stability is necessary here to ensure the asymptotic irrelevance of
the values taken by the signal before t = 0.
1.1.6 Shift operator q
It is actually possible to de
ne, compute and use transfer functions without
help of the Z transform, but using the so-called shift-operator q, even though
the interpretation in terms of frequencies is less direct. Since this operator
often appears in the literature, we chose to present here this alternative
approach.
We de
ne the shift operator q:
q : S ! S; (qu)(t) = u(t + 1);
and the inverse shift operator q􀀀1:
q􀀀1 : S ! S; (q􀀀1u)(t) = u(t 􀀀 1):
The powers qk can be de
ned by (qku)(t) = u(t + k) (one can verify that
qk = q:qk􀀀1). Powers of q can be combined in polynomials. For example, we
have : 􀀀
(2q􀀀2 + 3q􀀀1 + 5)u)(t) = 2u(t 􀀀 2) + 3u(t 􀀀 1) + 5u
In the sequel, we will use a slight abuse of notation and write qu(t) =
(qu)(t). Besides, when referring to a function of q􀀀1, we actually mean the
polynomial expansion of this function \applied" to q􀀀1. For example eq􀀀1 P = 1
n=􀀀1
q􀀀n
n! , and 1
1􀀀q􀀀1 =
P1
k=0 q􀀀k. This de
nition allows manipulating
functions of q \as if they were normal functions". For example, the relation
y(t) 􀀀 y(t 􀀀 1) = u(t) + 2u(t 􀀀 1)
can be expressed as (1 􀀀 q􀀀1)y = (1 + 2q􀀀1)u, and thus as y = G(q)u with
G(q) =
1 + 2q􀀀1
1 􀀀 q􀀀1 :
The expression of this last function is exactly the same as that of the transfer
function of the system. This is in fact always the case. Indeed, when a system
with impulse response g is causal, we can rewrite (1.2) as
y(t) =
X1
k=0
g(k)u(t 􀀀 k) =
X1
k=0
􀀀
g(k)q􀀀ku)(t);
and therefore
y =
X1
k=0
g(k)q􀀀k
!
u = G(q)u;
where G is the function whose polynomial expansion in q􀀀1 is
G(q) =
X1
k=0
g(k)q􀀀k: (1.7)
Observe that this expression is exactly the same as (1.4), so that G(q) is
exactly the transfer function \for z = q".
Remark 1.1. There is no universal agreement in the literature about whether
one should refer to the transfer function as G(q􀀀1) (and G(z􀀀1)) of G(q) (and
G(z)). Since G is often de
ned or computed using series development in q􀀀1
(resp. z􀀀1)), and its expression often contains negative powers of q (resp
z), it could seem more natural to use the convention G(q􀀀1). However, this
convention could cause some confusion when discussing the poles and zeros
of the function, as those are the values of z (resp. q) for which G and 1=G
are zero. Besides, the e#ect of the system on the signal ei!0t is described by
the value of G for z = ei!0 . For these two reasons, and for the sake of using
concise notations, we will use here the convention that transfer function is
G(q) (resp. G(z)). This has however no conceptual implication.
Remark 1.2. Note that working with the operator q as with a normal variable
hides some di#culties. The equation (1􀀀aq􀀀1)y = 0 admits for example any
signal y = K:at as a solution. By linearity, this means that if y0 is the
solution of (1 􀀀 aq􀀀1)y = Gu, then any y0 + K:at is a solution to the same
equation. Formally, the operator 1 􀀀 aq􀀀1 is therefore not invertible on the
set of signals. The exact same problem actually also appears when working
in the frequency domain.
This issue is related to the fact that the use of transfer functions, whether
in z or in q, totally ignores the initial conditions of the system. It disappears
if enough (often initial) values of y are speci
ed. Such values are however
usually not always speci
ed when treating classes of systems and inputs or
outputs.
To avoid any problem, it is usually assumed that the input and output
signals were at 0 before 0 or before a certain negative time (all combinations
of powers of q are invertible on that set of signals). This assumption would
however forbid the use of any periodic input, and may also pose conceptual
problems when the output also depends on a disturbance. Another possibility
is to restrict oneself to stable transfer functions and assume that the signals
1.2. STOCHASTIC PROCESSES 17
remained bounded in the past. We believe that the reader must be aware of
these issues, but we will not detail or solve them in these notes.
1.2 Stochastic processes
Real systems on which experiments are made often involve unknown disturbances,
which can be modeled as (often nontrivial) random processes.
Besides, we will see that one may want to use random signals as inputs when
performing certain experiments. For these reasons, it will be necessary to
work with stochastic processes: A stochastic process is a random variable
taking its value in the set S of signals. We use the classical probabilistic
notation e :
! S to state that e is a stochastic process (where the symbol
denotes a set of all possible outcomes).

1.2.1 Stationary processes
Stationary processes are processes whose properties are preserved under timeshift.
A stationary process is thus one that is totally undistinguishable from
any of its shifted version (Individual realizations of the processes will of course
typically not be invariant).
De
nition 1.7. A stochastic process e :
! S is stationary if for any T
and sequence t1; t2; : : : ; t# , there holds
Fu(ut1 ; : : : ; ut# ) = F(ut1+T ; :::; ut#+T )
where Fu(ut1 ; :::; ut# ) is the cumulative distribution function of the values
ut1 ; : : : ; ut#
Examples:
# i.i.d. (independent and identically distributed) variables : xt # N(0; #2)
are stationary.
# A process with all zeros up to some point and then a certain random
variable is not stationary.
# The process de
ned by u(t) # N(sin(t); 1) is not stationary.
# If u(t) is stationary, the process v de
ned by v(t) = u(t) + 1
2u(t 􀀀 1) is
stationary.
# Even if the e(t) are i.i.d., the process u(t) = e(bt=10c) is not stationary.
Indeed, all the u(t) have the same distributions, but the pair
(u(9); u(11)) does not have the same distribution as (u(11); u(13)).
We denote by Eu(t) the expected value of u(t) i.e. the expected value
of the signal u at time t. It follows directly from the de
nition that the
expected value of an stationary signal and the correlation between its values
at di#erent time is invariant under shift of time.
Proposition 1.1. Let u be a stationary signal. Then
Eu(t) = Eu(t + k); 8t; k
and
Eu(t)u(s) = Eu(t + k)u(s + k); 8t; s; k:
Consequently, we can de
ne the expected value mu = Eu(t) (constant)
and the correlation function ru(# ) = Eu(t)u(t + # ), as these values are
independent
of t.
Consider for example u(t) = e(t) + e(t 􀀀 1), where e(t) is a sequence
of i.i.d. random variables of expected value 0 and variance #2. Then mu =
Eu(t) = Ee(t)+Ee(t􀀀1) = 0, and the correlation function can be computed
by considering three cases: ru(0) = Eu(0)2 = Ee(t)2 + 2Ee(t)e(t 􀀀 1) +
E(t 􀀀 1)2 = 2#2 because Ee(t 􀀀 1)2 = Ee(t)2) = #2, and Ee(t)e(t 􀀀 1) =
Ee(t)E(t 􀀀 1) = 0 as e(t) and e(t 􀀀 1) are independent. Similarly, ru(1) =
ru(􀀀1) = E((e(t)+e(t􀀀1))(e(t+1)+e(t))) = Ee(t)e(t+1)+Ee(t)2+Ee(t􀀀
1)e(t + 1) + Ee(t 􀀀 1)e(t + 1) = Ee(t)2 = #2. Finally, a similar reasoning
shows ru(# ) = 0 for all # 6= 􀀀1; 0; 1.
Note that the properties mentionned in Proposition 1.1 are consequences
of the de
nition of stationary signals, but do certainly not de
ne what a stationary
signal is. In particular, one can build signals that are not stationary
but for which the properties of Proposition 1.1 hold.
1.2.2 Spectral density
De
nition 1.8. The spectral density #u(!) of a stationary1 process u is
de
ned by
#u(!) :=
X1
k=􀀀1
ru(k)e􀀀i!k:
It corresponds thus to the Z transform of the correlation function evaluated
at ei!. As in the case of the Z􀀀transform, the spectrum is de
ned on a
continuous set ([􀀀#; #) or equivalently [0; 2#)), even if the correlation from
which it is derived is de
ned on the discrete set (the integers). It can be
moreover proved that if u takes real values, then #u(!) is real, nonnegative,
and symmetric: #u(􀀀!) = #u(!), or equivalently #(2# 􀀀 !) = #(!). Using
the inverse of the Z transform we see that
ru(k) =
1
2#
Z 2#
0
#u(!)eik!d!: (1.8)
Since #u is symmetric and real, we have equivalently
ru(k) =
1
#
Z #
0
#u(!) cos(k!)d!:
Intuitively, #u(!) corresponds thus to the importance of the frequency !
in the signal u: Roughly speaking, a large value of #u(!) corresponds to a
large component in cos(!k) of the correlation. This is translated in a strong
positive correlation between values distant by an even multiple of #=! (i.e.
those for which cos(!k) = 1) and a strong negative correlation between values
distant by an odd multiple #=! (i.e. those for which cos(!k) = 􀀀1). One
has however to be careful with such interpretation, as the \real" correlation
will be the combined e#ect of all the frequencies.
Particularizing (1.8) to k = 0 leads to the following expression relating
the variance Eu2(t) of the signal u with its spectrum.
Eu2(t) = ru(0) =
1
2#
Z 2#
0
#u(!)d!: (1.9)
1We will see later that this de
nition and all the subsequent properties apply to a larger
class of processes: the quasi-stationary processes.
The spectrum of a stationary process is asymptotically related to DFT of
nite subsequences of the signal, introduced in (1.5). One can indeed prove
that E jUN(!)j2 converges in distribution to #u(!) (see [1], Section 2.3), i.e.
lim
N!1
Z 2#
0
E jUN(!)j2 (!)d! =
Z 2#
0
#u(!) (!)d! (1.10)
for any \su#ciently smooth" function
We can also de
ne the cross-correlation between two stationary signals
by
ryu(k) = Ey(t)u(t 􀀀 k); (1.11)
and the corresponding cross-spectrum
#yu(!) =
X1
k=􀀀1
ryu(k)e􀀀i!k:
(Note that for the de
nitions of cross-correlation to be consistent, we actually
need the signals to satisfy a \mutual stationarity" condition, which we will
not detail here | see [1], Section 2.3 for details.)
The following proposition shows how the spectrum of the output of a LTI
system is related to the spectrum of its input (see [1], Section 2.3 for details).
Note that we use # to denote the conjugate transpose. In particular, if H is
a transfer function/matrix, H#(ei!) is the transposed version of H(e􀀀i!).
Proposition 1.2. Let u be a stationary signal and y = H(q)u for some
transfer function H. The following conditions hold
(a) my = Ey(t) = H(1)Eu = H(1)mu,
(b) #y(!) = H(ei!)#u(!)H#(ei!),
(c) #uy(!) = H(ei!)#u(!).
1.2.3 White noise and factorization of random pro-
cesses
De
nition 1.9. A stochastic process e :
! S (where S is the set of signals
Z ! R) is called white noise if the three following conditions hold
# E[e(s)] = 0;
# E[e(s)e(s)] = #2
e 8s (for some #e),
# E[e(s)e(t)] = 0 8s 6= t.
For such signals, we have therefore re(0) = #2
e (we will use #2 when the
context prevents any ambiguity) and re(k) = 0 for all k 6= 0. A white
noise is thus a stochastic process whose values at di#erent times are totally
decorrelated, have the same variance, and an expected value 0. For example
any signal consisting of an in
nite sequence of i.i.d. random variables with
zero expected value is a white noise2.
The absence of correlation between di#erent times translates in a uniform
spectrum:
#e(!) =
1
2#
X1
k=􀀀1
re(k)e􀀀i!k = re(0) = #2:
It follows from this uniform spectrum and from Proposition 1.2(b) that for
any transfer function H, one can generate a signal with spectrum proportional
to H(ei!)H#(ei!) = jH(ei!)j2 by applying the
lter H to a white noise.
Such a factorization will prove very convenient when trying to identify the
properties of the disturbances in a system. The following theorem gives a
su#cient condition for a spectrum to be \factorizable" by
lters with nice
properties.
Theorem 1.2 (Paley-Wiener). Let #v be a signal spectrum. If
Z #
􀀀#
jln#v (!)j d! < 1;
then there exists a
lter H such that
(a) #v(!) = H(ei!)H#(ei!)#2.
(b) If e(t) is a white noise with variance #2, and y(t) = H(q)e(t), then
#y(!) = #v(!).
2The converse inclusion does not hold though, as one could for example build a
sequence
of random variables which are decorrelated but not independent.
(c) H has all its poles inside the unit disk, H􀀀1 has all its poles inside or on
the unit disk.
For example, observe that the spectrum #v(!) = 4
5􀀀4 cos(!) satis
es the assumption
of the theorem: Since its values are contained in [ 4
9 ; 4], the absolute
value of its logarithms are uniformly bounded, so that
R #
􀀀# jln#v (!)j d! < 1.
Consistently with Paley-Wiener Theorem, observe now that a signal y with
this spectrum #v can be obtained by applying the
lter H(q) = 1
1􀀀1
2 q􀀀1 (which
has all its pole in the unit disk and whose inverse has no pole) to a white
noise e of variance 1. Indeed, it follows from Proposition 1.2(b) that there
holds in that case
#y(!) =
1
1 􀀀 1
2e􀀀i!
1
1 􀀀 1
2ei!
=
1
1 􀀀 cos ! + 1
4
=
4
5 􀀀 4 cos !
:
Theorem 1.2 is not a constructive one (even though some proofs provide a
way of constructing the
lter, at least in some particular cases); it just guarantees
the existence of the appropriate
lter H. Its main interest is to allow
modeling a large class of disturbances. Indeed, if one assumes that the disturbance
satis
es the condition of the theorem (which is the case is almost all
natural situations, and is always the case if 0 < inf #v # sup#v < 1), then
the problem of identifying the e#ect of the disturbance in an identi
cation
problem can be reformulated as one of identifying the appropriate transfer
function H, at least as far as the spectrum (and thus the
rst and second
moments) is concerned.
From a practical point of view, if one needs to factorize a spectrum, a
common strategy for
nding H it to guess its structure ( 1
b􀀀cq􀀀1 for example),
compute the associate class of spectrum, and identify the parameters.
Computing
R #
􀀀# jln#v (!)j d! is then not needed.
1.2.4 Quasi-Stationarity
The theory of stationary processes is mathematically elegant and produces
convenient results. Unfortunately, this class of processes is too small, as it
does not include many processes and signals which are very important from
a practical point of view. For example, any process involving a transient
behavior (resulting for example from the initialization of a system) is
nonstationary.
Similarly, sums of sinusoids, which constitute an important class
of input signals are not stationary; their expected value, which is their actual
value, depends indeed on time. To palliate this issue while preserving the nice
properties of stationary processes, we de
ne here the larger class of quasistationary
processes. Roughly speaking, those are processes for which it still
makes sense to talk about an expected value mu and a correlation function
ru(k), although those concepts have to be seen as asymptotic averages.
De
nition 1.10. A process u is Quasi-Stationary (Q-S) if
# The expected value of Mu(t) = Eu(t) is bounded for all t: jMu(t)j < C
8t and for some C,
and the average expected value mu := limN!1
1
N
PN
t=1Mu(t) exists
# The correlation Ru(s; t) := Eu(t)u(s) exists and is uniformly bounded:
jRu(s; t)j < C 8s; t and some C,
and for each k, the average correlation ru(k) := limN!1
1
N
PN
t=1 Ru(t; t􀀀
k) exists
Although the analysis is more mathematically involved, one can prove
that all the results on stationary processes described in the previous sections
still hold for quasi-stationary processes. Under suitable conditions, one can
also de
ne the cross-correlation function
ryu(k) := lim
N!1
1
N
XN
t=1
Ey(t)u(t 􀀀 k)
for two (mutually3) quasi-stationary processes u; y.
Examples :
# It can be proved that any deterministic periodic signal is quasi-stationary.
Consider for example u(t) = Acos(!0t), for !0 2 (0; #). There holds
Mu(t) = Eu(t) = u(t) = cos(!0t) is bounded, and one can verify that
mu = lim
N!1
1
N
XN
t=1
Mu(t) = lim
N!1
1
N
XN
t=1
cos(!0t) = 0;
3the notion of two processes being mutually quasi-sationary is not formally de
ned
here, but essentially mean that certain quantities about the two processes, such as
ryu,
can be de
ned and exist | see [1], Section 2.3 for details.
so that the
rst condition is satis
ed. Observe now that4
Ru(t; s) = Eu(t)u(s) = cos(!0t) cos(!0s) =
1
2
cos (!0(s + t))+
1
2
cos (!0(s 􀀀 t)) ;
which is also bounded. Moreover, for any k, the limit
ru(k) = lim
N!1
1
N
XN
t=1
Ru(t + k; t)
= lim
N!1
1
N
XN
t=1
1
2
cos (!0(2t + k)) +
1
2
cos (!0(k))
=
1
2
cos(!0k):
is well de
ned. Finally, computing directly the spectrum of such periodic
signals can seem a priori challenging, as it appears to involve
evaluation diverging or non-absolutely converging sequences. This issue
is related to the fact that the spectrum of a periodic signal (or
of a signal with periodic correlation) consists of a weighted sum of
Dirac #-functions. The exact mathematical meaning of such computation
is beyond the scope of these notes. For all practical purposes, the
spectrum of such signals can be obtained by guessing a structure, and
identifying the appropriate parameters using the inverse formula (1.8).
In our case, this yields
#u(!) =
#
2
(#(! 􀀀 !0) + #(! + !0)) :
# Any signal that becomes equal to an associate stationary signal after a
certain time is quasi-stationary.
# If u is deterministic Q-S and v stationary with mu = 0, the signal
s = u + v
with is quasi-stationary, and one can prove that there holds rs(# ) =
ru(# ) + rv(# ).
4this can be easily computed, for example by decomposing cos !0t in 1
2 (ei!0t + e􀀀i!0t)
1.2.5 Ergodicity
Although they are often confused, there is an important conceptual di#erence
between the expected value of a process at a certain time, and the average
value that would be typically measured on a realization of this process. To
exemplify this distinction, let x be a random variable taking the value 1
with probability 1=2 and 􀀀1 else, and consider the (provably stationary)
stochastic process de
ned by u(t) = x; 8t. Thus the realization of u is going
to be f: : : ; 1; 1; 1; 1; : : : g with a probability 1=2 and f: : :
;􀀀1;􀀀1;􀀀1;􀀀1; : : : g
else. Clearly, for each t, Eu(t) = Ex = 0. However, for any realization of u,
the limit
lim
N!1
1
N
XN
t=1
u(t)
will take the value 1 with probability 1=2 and 􀀀1 else. The expected value of
the stochastic process u cannot thus be estimated by computing the average
of the observed values. We say such stochastic processes whose properties
cannot be estimated by observing an arbitrary long (or even in
nite) realization
are not ergodic. There is a very large mathematical theory of ergodicity,
interacting with many other
elds, but lying beyond the scope of this class.
We will therefore just introduce a few concepts that we need.
When it is well de
ned, we call long term average of a process u the
limit
Eu := lim
N!1
1
N
XN
t=1
E[u(t)]: (1.12)
De
nition 1.11. We say that a process u is
# ergodic with respect to 1st moment if :
1
N
XN
t=1
u(t) 􀀀! E[u(t)] w.p. 1, when N ! 1
# ergodic with respect to 2nd moment if :
1
N
XN
t=1
u(t)u(t + k) 􀀀! E[u(t)u(t + k)] w.p. 1, when N ! 1; 8k;
where we use \w.p. 1" to abbreviate \with probability 1".
The following theorem (whose proof is available in Appendix 2B of [1])
gives a weak su#cient condition of ergodicity.
Theorem 1.3. Let s(t) be a quasi-stationary process, ms(t) = E[s(t)], and
de
ne v(t) = s(t) 􀀀 ms(t). Suppose that :
v(t) =
X1
k=0
ht(k)e(t 􀀀 k) = Ht(q)e(t)
Where e(t) is a process consisting of i.i.d. variables bounded 2nd and 4th mo-
ments.
If fHt(q)g is uniformly stable, i.e. there exists a stable ~h(k) such that jht(k)j
<
~h
(k) for all t and k, then the process s is ergodic with respect to 1st and 2nd
moments.
For the sake of simplicity, unless otherwise speci
ed, we will always assume
here that the white noises considered consists of i.i.d. variables with
bounded 2nd and 4th moments. As a result, they will be ergodic with respect
to 1st and 2nd moments, and so will be He for any stable transfer function
H.
In practice, the non-ergodic processes are processes whose parameters
cannot all be recovered, even in an arbitrary long experiment. They could
typically result from irreversible phenomena (such as deciding with equal
probability whether all values will be 1 or 􀀀1). The only way of recovering
their parameters would be to run several independent experiments, but guaranteeing
this independence is not always feasible, and may be challenging.
Common sense suggests avoiding attempts to model systems by nonergodic
processes whenever possible, assuming that the process considered
are ergodic unless there is a very good reason not to do so. In particular, we
will not consider systems involving non-ergodicity in this class.
1.2.6 Alternative to stochastic processes
To conclude this section, let us mention that system identi
cation techniques
can also be derived without any reference to stochastic processes. The technique
we develop in the next chapters often consist in assuming that the
1.3. ESTIMATORS 27
output signal y results from some deterministic behavior of the system to
which is added some random signal, and attempts to model both. Alternative
approaches take the view that the output y results from the deterministic
behavior of the system to which is added some arbitrary signal representing
un-modeled phenomena. The goal is then to
nd the system description that
explains best the observed output, that is, for which the signal resulting from
un-modeled phenomena are as small as possible.
Although very di#erent from a philosophical point of view, these approaches
typically lead to the same practical methods.
1.3 Estimators
Many methods of system identi
cation can be viewed as a way of estimating
the parameters of a stochastic process based on one (or several) of its
realization. To be able to formalize this and to quantify the quality of these
estimations, we introduce the following notions:
Let u :
! S be a random variable parametrized by # 2 #. Then,
# An estimator (of #) is a function
f : S ! # : s 7! f(s) = ^#:
# The bias of the estimator is
E[f(s) 􀀀 #] = E[^#] 􀀀 #;
The estimator is said to be unbiased if its bias is 0.
# The Variance or mean square error is
E[(^# 􀀀 #)2]
Let uN :
! SN be a family of random variables parametrized by a same
#, then the family of estimators ^#N is consistent if
lim
N!1
P
#
^#N 􀀀 #
> #
#
= 0 8#;
Where P(event) denotes the probability of the event considered. For example,
let uN be represent N independent realizations of a random variable of
expected value #. The estimator ^# = 1
N
PN
t=1 uN(t) is unbiased, and consistent.
The estimator ^# = 1
N+2
PN
t=1 uN(t) is biased but remains consistent.
Finally, the estimator 1
2 (uN(1) + uN(N)) is unbiased, but not consistent.
Appendix
1.A Proof of Theorem 1.1
Proof. Let us
rst develop the expression of YN(!),
YN(!) =
1
p
N
XN
#=1
y(# )e􀀀i!#
=
1
p
N
XN
#=1
􀀀X1
k=0
g(k)u(# 􀀀 k))e􀀀i!#
(t := # 􀀀 k)
=
1
p
N
X1
k=0
􀀀 NX􀀀k
t=1􀀀k
u(t)e􀀀i!t)g(k)e􀀀i!k
Using the de
nition of G, we obtain then
YN(!) 􀀀 G(ei!)UN(!)
=
1 X
k=1
g(k)e􀀀i!k
1
p
N
NX􀀀k
t=1􀀀k
u(t)e􀀀i!t
!
􀀀
X1
k=1
g(k)e􀀀i!kUN(!)
=
X1
k=1
g(k)e􀀀i!k
1
p
N
NX􀀀k
t=1􀀀k
u(t)e􀀀i!t 􀀀 UN(!)
!
#
X1
k=1
jg(k)j
1
p
N
NX􀀀k
t=1􀀀k
u(t)e􀀀i!t 􀀀 UN(!)
: (1.13)
29
We now bound the di#erence depending on u and UN in this last line:
UN(!) 􀀀
1
p
N
NX􀀀k
t=1􀀀k
u(t)e􀀀i!t
=
1
p
N
XN
t=1
u(t)e􀀀i!t 􀀀
NX􀀀k
t=1􀀀k
u(t)e􀀀i!t
=
1
p
N
􀀀
X0
t=1􀀀k
u(t)e􀀀i!t +
XN
t=N􀀀k+1
u(t)e􀀀i!t
#
1
p
N
X0
t=1􀀀k
u(t)e􀀀i!t
+
XN
t=N􀀀k+1
u(t)e􀀀i!t
!
#
1
p
N
X0
t=1􀀀k
ju(t)j
e􀀀i!t
+
XN
t=N􀀀k+1
ju(t)j
e􀀀i!t
!
:
Since je􀀀i!tj = 1, and ju(t)j # Cu for all t, it follows then that
UN(!) 􀀀
1
p
N
NX􀀀k
t=1􀀀k
u(t)e􀀀i!t
#
1
p
N
X0
t=1􀀀k
ju(t)j +
XN
t=N􀀀k+1
ju(t)j
!
#
2kCu p
N
:
Together with the inequality (1.13) and the assumption that
P1
k=0 k jg(k)j =
Cg < 1, this implies that
YN(!) 􀀀 G(ei!)UN(!)
#
X1
k=1
jg(k)j
2kCu p
N
=
2CgCu p
N
;
which proves the theorem.
Chapter 2
Non-parametric identi
cation
Non parametric methods aim at directly estimating the values of the transfer
function G(ei!) or of the impulse response g(t). No assumption is thus made
on the system, other than being LTI (non parametric methods for systems
that are not LTI are beyond the scope of these notes).
Among their main advantages, we will cite the following:
# no assumption or prior knowledge is needed (once input identi
ed),
# these methods are usually easy and fast to apply.
They present on the other hand several disadvantages:
# they do not provide a simple formulation of the estimated model, but
only a list of values. It can as a result prove harder to re-use the model
obtained, for example to design a controller.
# They typically do not provide a model for the noise/disturbances.
# Many of them are moderately accurate.
Non parametric methods can however prove very useful when one needs
a rough idea of the behavior of dynamical system, including in the context
of a preliminary study. It is for example a good idea to try applying a
non parametric method to get a feel for how the data looks like, and to
31
32 CHAPTER 2. NON-PARAMETRIC IDENTIFICATION
develop some intuition about a system that one will later identify with other
methods. They can also be useful for veri
cation purpose: one could perform
all the design of a controller based on a simpli
ed system model obtained
by parametric methods, and then check the stability and properties of the
controlled system using the full transfer functions identi
ed by nonparametric
methods.
2.1 Time domain analysis
In this section, we represent the system by
y(t) = G(q)u(t) + v(t) with Ev(t) = 0
2.1.1 Impulse response g(t)
A
rst idea is to recover the impulse response by directly apply its de
nition:
one applies an impulse as input to the system and measures the output:
u(t) = ##(t);
y(t) = #g(t) + v(t);
where #(0) = 1 and #(t) = 0 for all t 6= 0. This leads to the estimator:
^g(t) =
y(t)
#
:
For a given experiment, one obtains
) ^g(t) =
y(t)
#
;
=
#g(t) + v(t)
#
;
= g(t) +
v(t)
#
:
This method is very sensitive to noise, but can give a
rst approximation
of the system behaviour. To diminish its noise sensitivity, one should take a
2.1. TIME DOMAIN ANALYSIS 33
value # as large as possible. However, too large values may often be forbidden
for practical and operational reasons, or may drive the system out of the zone
where it behaves approximately linearly. Another solution is to repeat the
experiment and average the results. We will see in the next chapters that
using more appropriate input signals may yield much more accurate results
though.
Example: Consider the system y(t) = 1+0:25q􀀀1
1􀀀q􀀀1+0:5q􀀀2 u(t) + v(t), with v(t)a
white noise of variance #2
v = 0:1. Figure 2.1 illustrates the behavior of the
estimate ^g(t) for three di#erent realizations of the noise signal v(t),
respectively
for # = 1 and # = 10 (that is, for signal-to-noise ratios of 10 and
1000).
0 5 10 15 20 25 30 35 40
−1
−0.5
0
0.5
1
1.5
2
Signal−to−noise ratio: 10
Time
Amplitude
0 5 10 15 20 25 30 35 40
−1
−0.5
0
0.5
1
1.5
2
Signal−to−noise ratio: 1000
Time
Amplitude
Figure 2.1: Impulse response estimates ^g(t) for the system y(t) =
1+0:25q􀀀1
1􀀀q􀀀1+0:5q􀀀2 u(t) + v(t). Estimate ^g(t) for three di#erent realizations (red)
of the noise signal v(t) (respectively for signal-to-noise ratios of 10 (top) and
1000 (left)) and true impulse response g(t) (black).
2.1.2 Step response
A step is a signal taking a value 1 for t # 0 and 0 else.
For various practical reasons, step inputs may be easier to apply than
impulse inputs. This method allows directly measuring the step-response,
which is another characterization of the system, on which various important
parameters can be estimated (graphically). Suppose for example that the
system behaves approximately as a delayed
rst order (continuous-time) system
G # A
1+sT e􀀀s#d | which is the case of many systems |, then we can
estimate the static gain A, the time constant T and the delay #d by inspecting
the representation of the impulse response as represented in Figure 2.2.
Figure 2.2: Representation of the stating gain A, the time constant T and
the delay #d for a system behaving approximately as A
1+sT e􀀀s#d .
The output produced by the system when a step is applied also allows
re-computing the impulse response. Indeed, if we apply a step as input
to the system u(t) = # if t # 0 and 0 else. Then y(t) =
Pt
k=0 g(k) +
v(t). This leads to the estimator ^g(t) = y(t)􀀀y(t􀀀1)
# = g(t) + v(t)􀀀v(t􀀀1)
# : This
method su#ers from the same accuracy problems as the impulse response (and
actually slightly worse problems for some disturbances, due to the addition
of v(t) and v(t 􀀀 1)). Repeating the experiments or increasing # would help
improving accuracy, but are again not the most e#ective solutions.
In this context, surprising errors may also occur even in the absence of
\classical" noise. For example, the measurements may only be available with
a very limited precision (e.g., it can only take integer values) | this kind
of error is standardly referred to as quanti
cation error. In that case, very
surprising e#ects can be observed in the system response estimates, especially
when the quanti
cation of the measurements is too coarse as compared to
the sampling frequency and the dynamics of the system.
Suppose for simplicity that v = 0, and that the quantization levels are
the integers: only the quantized versions q(y(t)) of the output are available,
where q(y(t)) is the integer value that is the closest to y(t). If the sampling
frequency is very high with respect to the dynamics of the system, then y(t)
evolves very slowly, and it will take several steps to change by more than
1. The quantized measurements q(y(t)) remain then constant between most
of the steps, and only occasionally change by 1 or -1. The step-response
remains well approximated, as represented in Figure 2.3(a). However, if one
computes the impulse response using ^g(t) = q(y(t))􀀀q(y(t􀀀1))
# , one will obtain
^g(t) = 0 for most times, and only occasionally ^g(t) = #1 (or larger value if
the system moves fast at some time), which will look signi
cantly di#erent
from the real impulse response, as shown in Figure 2.3(b). In particular, it
may give the impression that the system temporarily \stops" several times.
Unlike the errors due to classical noise, errors caused by quanti
cation will
not necessarily be cancelled by averaging the result of several experiment;
quantization noise is indeed not independent (and in fact very strongly dependent)
of the value y(t), and does therefore not satisfy the assumptions
made in most of these lecture notes.
Similar issues may occur for various system identi
cation methods, and
should be dealt using appropriate
lters, and/or by an appropriate selection
of the sampling frequency.
2.1.3 Correlation analysis
The previous methods relied on the use of speci
c inputs. We now see a
method allowing recomputing the impulse response from the result of any
experiment, although some particular input signals will make things easier.
Let us assume, for the simplicity of the exposition, that the input signal
u is stationary. Remember the de
nition of the cross-correlation function in
0 5 10 15 20 25 30 35 40
0
1
2
3
4
5
Step response
Time
Amplitude
0 5 10 15 20 25 30 35 40
0
0.2
0.4
0.6
0.8
1
Impulse response
Time
Amplitude
Figure 2.3: Quantization error for the system with impulse response g(# ) =
(0:8)# when # # 0 and g(# ) = 0 otherwise. The quantization step is set
to 0:25. Step response (top, red) and quantized step response (top, blue);
impulse response (bottom, red) and impulse response computed from the
quantized step response (bottom, blue).
(1.11). There holds
ryu(# ) = E[y(t)u(t 􀀀 # )]
= E[(G(q)u(t) + v(t))u(t 􀀀 # )]
= E[(
P1
k=0 g(k)u(t 􀀀 k) + v(t))u(t 􀀀 # )]
=
P1
k=0 g(k)E[u(t 􀀀 k)u(t 􀀀 # )] + E[u(t 􀀀 # )v(t)]:
Assume that the disturbance v(t) is uncorrelated with the value of the input
u(t 􀀀 # ). This assumption is a very weak one if the system is controlled in
open-loop, but will usually not be satis
ed in closed loop (especially if the
disturbance is not a white noise), as the input may then depend on previous
values of the output, which have typically been a#ected by the disturbance.
Then, there holds
ryu(# ) =
X1
k=0
g(k)E[u(t 􀀀 k)u(t 􀀀 # )]
The de
nition of the correlation function introduced after Proposition 1.1
implies then that
ryu(# ) =
X1
k=0
g(k)ru(k 􀀀 # ):
Suppose that we have estimators ^ryu and ^ru of the cross-correlation ryu and
the correlation ru. Then this equality suggests using the following estimator
^ryu(# ) =
X1
k=0
^g(k)^ru(k 􀀀 # ): (2.1)
Computing ^g in this way appears impractical in general, as it involves solving
a system of equations of potentially in
nite dimension. We present two
situations in which (2.1) can be used.
Case 1: Finite impulse response (FIR)
A system has a
nite impulse response if there exists T such that g(k) = 0
for k # T. If the system is known (or assumed) to be FIR, then (2.1) becomes
a system of T linear equations, for which the solution can be computed
(conditions under which the solution exists and is unique will be derived
later). Typical estimators for ^ryu and ^ru that are used in that case are:
^ryu(# ) =
1
N
XN
t=#
y(t)u(t 􀀀 # ) (2.2)
^ru(# ) =
1
N
XN
t=#
u(t)u(t 􀀀 # ) (2.3)
If the system is not FIR but its impulse response g(# ) approaches 0 for
large # , one can approximate it by a FIR system by truncating the impulse
response, i.e by considering g0(# ) = g(# ) for # < T and g0(# ) = 0 else, for
some T after which g(# ) is very small. This is always feasible for some T
when g is stable.
Case 2: White noise input
If the input u is a white noise of known variance #2, then ru(0) = #2, and
ru(# ) = 0 if # 6= 0. We have then
ryu(# ) =
X1
k=0
g(k)#2#(k 􀀀 # ) = g(# )#2:
This leads to the estimator ^g(# ) = ^ryu(#)
#2 ; for which one can prove (see Appendix
3.2 in [2]) that
E
#
(^g(k) 􀀀 g(k))2#
'
rv(0)
N#2 +
1
N
X1
#=0
g(# )2+
1
N
Xk
#=􀀀k
g(t+k)g(k􀀀# )􀀀
2
N
g(# )g(k)
The mean square error decays thus to 0 as O(1=N). Observe that it remains
positive even in the absence of noise.
2.2 Frequency domain analysis
We now focus on frequency domain methods, which estimate the transfer
function of the systems. In this context, we represent our dynamical system
by
Y (!) = G(ei!) U(!) + V (ei!) with EV (ei!) = 0
2.2.1 Single frequency
Idea
Exercise 2.1. Check that if the input is
u(t) = # cos(!t)
2.2. FREQUENCY DOMAIN ANALYSIS 39
one can verify that the output will be
y(t) = # cos(!t + ')
G(ei!)
+ v(t);
where ' = argG(ei!), neglecting transitory phenomena or assuming that the
input has been periodic for a long time.
The idea of the frequency analysis is to recover G(ei!) for a given ! by
comparing the input u(t) = # cos(!t) and the output signals, and to perform
a new experiment for each ! for which the value of G(ei!) needs to be known.
Estimation of G(ei!)
Inspecting the output y and comparing it to the input may give an approximate
idea of the gain jG(ei!)j and the phase ' = argG(ei!). One convenient
way of doing this is to plot u on the x axis and y on the y axis, in the so-called
Lissajous curve, as in the examples presented by Figure 2.4 and Figure 2.5.
These curves are ellipsoidal for linear systems. The gain jG(ei!)j is recovered
by comparing their extension along y with their extension along x, while the
phase angle ' can be measured by comparing the long and short axes of
the ellipsoid. This method present(ed) in the past the advantage of being
implementable on an analog oscilloscope.
Exercise 2.2. Derive an exact expression of gain jG(ei!)j and the phase
angle ' based on the characteristics of the Lissajous Plot.
These methods based on a direct graphical estimation of the gain and
angle are however not very accurate. More importantly, they require (a
priori) some human intervention, and they do not provide a simple way of
using longer experiments to gain accuracy. We now describe an e#cient way
of estimating these parameters.
Observe that
e􀀀i!ty(t) = e􀀀i!t 􀀀
# cos(!t + ')
G(ei!)
+ v(t)
#
= #
G(ei!)
e􀀀i!t 1
2
􀀀
ei(!t+') + e􀀀i(!t+')#
+ e􀀀i!tv(t)
=
1
2
#
G(ei!)
ei'(1 + ei(􀀀2!t􀀀2')) + e􀀀i!tv(t)
=
1
2
#G(ei!) +
1
2
#G(ei!)ei(􀀀2!t􀀀2')) + e􀀀i!tv(t):
−1 0 1
−10
−8
−6
−4
−2
0
2
4
6
8
10
f=0
u
y
−1 0 1
−10
−8
−6
−4
−2
0
2
4
6
8
10
f=p/4
u
y
−1 0 1
−10
−8
−6
−4
−2
0
2
4
6
8
10
f=p/2
u
y
Figure 2.4: Lissajou curves for three noiseless continuous linear systems
(v(t) = 0); the input is u(t) = cos(!t), and the outputs have the form
y(t) = jG(ei!)j cos(!t + argG(ei!)). For some particular choice of !, the
three systems have jG(ei!)j = 10 and respectively argG(ei!) = 0 (left),
argG(ei!) = #=4 (middle) and argG(ei!) = #=2 (right).
The quantity e􀀀i!ty(t) is thus the sum of #
2G(ei!), a term oscillating with
frequency 2! and some noise. The idea of the method is to average out the
oscillating term and the noise by computing
I =
1
N
XN
t=1
e􀀀i!ty(t) =
#
2
G(ei!) +
#
2
G(ei!)
1
N
X
ei(􀀀2!t􀀀2')
| {z }
!0
+
1
N
e􀀀i!tv(t)
(2.4)
The second term tends to 0 (and is actually exactly 0 when N is a multiple
of 2#
! unless ! = 0 or ##), and so does the third term if the disturbance
v(t) does not contain a pure periodic component at frequency !. Therefore,
−1 0 1
−15
−10
−5
0
5
10
15
f=0
u
y
−1 0 1
−15
−10
−5
0
5
10
15
f=p/4
u
y
−1 0 1
−15
−10
−5
0
5
10
15
f=p/2
u
y
Figure 2.5: Lissajou curves for three noisy discrete linear systems (v(t) 6=
0); the input is u(t) = cos(!t), and the outputs have the form y(t) =
jG(ei!)j cos(!t + argG(ei!)). For some particular choice of !, the three systems
have jG(ei!)j = 10 and respectively argG(ei!) = 0 (left), argG(ei!) =
#=4 (middle) and argG(ei!) = #=2 (right).
I ! #
2G(ei!); and we have the estimator
^G
(ei!) =
I
#=2
=
2
#
1
N
XN
t=1
e􀀀i!ty(t): (2.5)
Example: Consider the system y(t) = u(t)+0:25u(t􀀀1)+v(t) and the input
signals u1(t) = cos(#t=8) and u2(t) = 10 cos(#t=8). The estimates ^G(ei#=8)
for this system are shown on Figure 2.6 and Figure 2.7.
Link with Fourier analysis
The term 1
N
PN
t=1 e􀀀i!ty(t) is equal, up to a ratio
p
n to the DFT of the
output y taken between 1 and z, YN(z) = p1
N
PN
t=1 e􀀀i!ty(t). Assuming for
0 20 40 60 80 100 120 140 160 180 200
1
1.5
2
2.5
N
|G|
0 20 40 60 80 100 120 140 160 180 200
−0.8
−0.6
−0.4
−0.2
0
0.2
N
arg G
Figure 2.6: Estimates of ^G(ei!) (for ! = #=8) as a function of the number of
samples for the input signal u1(t) = cos(#t=8). True value (black), estimate
(blue) and estimate with only N = 2k#=! (red).
simplicity that !# is an integer multiple of 2k#
N , one can verify that
UN(!) =
#
#
p
N
2 for ! = !#
0 for other values
It follows then from equation (2.5) that
^G
(ei!) =
YN(!)
UN(!)
: (2.6)
This estimator is thus a particular case of the Fourier analysis that we will
see in the next section, and error bounds for (2.5) will thus directly follows
from the general results described there.
0 20 40 60 80 100 120 140 160 180 200
0.8
1
1.2
1.4
1.6
1.8
2
N
|G|
0 20 40 60 80 100 120 140 160 180 200
−1
−0.8
−0.6
−0.4
−0.2
0
N
arg G
Figure 2.7: Estimates of ^G(ei!) (for ! = #=8) as a function of the number
of samples for the input signal u2(t) = 10 cos(#t=8). True value (black),
estimate (blue) and estimate with only N = 2k#=! (red).
2.2.2 Fourier Analysis
The estimator derived in the previous section provided an estimate of the
transfer function at one single frequency. The systems considered being linear,
the superposition principle suggests that one could work simultaneously
on several frequencies. Since Y (ei!) = G(ei!)U(ei!) + V (ei!); a natural estimator
of G would be
^G
N(ei!) =
YN(!)
UN(!)
:
We have seen in equation (1.6) in Section 1.1.5 that the values of UN and
YN at frequencies k 2#
N , for k = 1; : : : ;N contain all the information about
UN and YN, so we will restrict our analysis to these frequencies (half of them
would actually be already su#cient due to the symmetry properties of UN
and YN). Observe that this estimator is only de
ned at these frequencies for
which UN is nonzero, i.e. the frequencies present in the input signal. The
procedure provides thus information about up to N values of the transfer
function. We will see that it may very well provide information about less
than N frequencies, in case UN is periodic for example.
Let us now analyze the error made by this estimator, assuming G is
strictly stable. It follows from Theorem 1.1 that
YN(!) = G(ei!)UN(!) + RN(!) + VN(!);
where VN is discrete Fourier transform of the disturbance signal, and RN,
accounts for the possible e#ect of the inputs u(t) for t # 0, and satis
es
jRN(!)j # O(1=
p
N). Note that in the particular case of periodic signals
having been periodic since 􀀀1 and whose period is a factor of N (i.e.
k:period = N for some integer k), one can verify that RN = 0.)
We have therefore:
G^N(ei!) = G(ei!) +
RN
UN
+
VN
UN
:
Since the disturbance V is assumed to have a zero mean, this implies that
the bias of the estimator is only caused by the transitory e#ects of the inputs
before 0:
E
h
G^N(ei!)
i
= G(ei!) +
RN(!)
UN(!)
:
Since jRN(!)j# O(1=
p
N), the estimator is asymptotically unbiased at the
frequencies for which it is de
ned. Besides, it can be proved (see Section 6.3
in [1]) that
E
h#
^G
N(ei!) 􀀀 G0(ei!)
# #
^G
N(ei#) 􀀀 G0(ei#)
#i
=
(
1
jUN(!)j2 (#v(!) + #2) if # = !
#2
UN(!)UN(#) else
(2.7)
where #2 = O(1=N).
We can see that errors made at di#erent frequencies are asymptotically
uncorrelated.
Moreover, the asymptotic behavior of the error variance at a
given frequency ! is
Ej^GN(ei!) 􀀀 G(ei!)j2 !
#v(!)
jUN(!)j2 when N ! 1 (2.8)
that can be seen as the inverse of the signal-to-noise ratio de
ned for this
particular frequency !.
We now consider two particular cases of input signals.
Periodic input
Consider a N0-periodic input signal u(t), and suppose for simplicity that N
is a multiple of N0: N = lN0 for l = 1; 2; :::. Then one can prove
jUN(!)j2 # #N 6= 0
if ! = 2#k=N0 for some integer k 2 [1;N0], and UN(!) else. In this special
case, ^GN(ei!) is de
ned for a constant (w.r.t. N) number of frequencies
at most equal to N0 and it follows from (2.8) that the variance at these
frequencies decays as O(1=N). We have thus more and more information
about a
nite number of frequencies.
Similar conclusions hold if N is not a multiple of N0, but the exposition
is made more di#cult by the presence of additional decaying terms, and the
repartition of signal power on frequencies close to the actual signal frequencies
(which are not considered in the DFT). Note that this was already seen in the
case of the analysis for a single frequency (see Section 2.2.1), and is usually
referred to as the leakage e#ect (see Appendix 2.A).
This implies that, in this special case, ^GN(ei!) is de
ned for a constant
(w.r.t. N) number of frequencies at most equal to N0 and that it is an
unbiased estimator with a variance decaying in O(1=N).
Stochastic signal as input
Suppose now that u(t) is a stochastic signal characterized by a power spectral
density #u(!). In this case, one can show1 that
EjUN(!)j2 ! #u(!) when N ! 1
1see Lemma 6.2 in [1]. Note that this result is stronger than the claim in equation
(1.10),
as it concerns the actual values of #u(!) and EjUN(!)j2, as opposed to the integral
of the
value of a scalar product with any \su#ciently smooth" function.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
1.35
1.4
1.45
1.5
1.55
1.6
Frequency [2 p rad]
Amplitude
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
−0.35
−0.3
−0.25
−0.2
−0.15
−0.1
−0.05
0
Frequency [2 p rad]
Phase
Figure 2.8: Identi
cation of the frequency response for the system y(t) =
u(t) + 0:25u(t 􀀀 1) + v(t) with the input signal u chosen as a sum of 5 sines.
Realizations with 1 period (yellow, circles), 2 periods (red, plus) and 10
periods (blue, diamonds), and exact transfer function G(ei!) (black, stars).
We obtain more and more accurate estimates for the chosen set of frequencies,
but no information about other frequencies.
(see Lemma 6.2 in [1]) so it follows from (2.8) that
Ej^GN(ei!) 􀀀 G(ei!)j2 !
#v(!)
#u(!)
when N ! 1:
Since both #v(!) and #u(!) are generally bounded and positive at (almost)
all frequencies, this means that the quadratic error does not necessarily decay
to zero when N grows. On the other hand, we have an estimate of the transfer
function at N frequencies, since UN(!) is typically non-zero everywhere. An
example of such an e#ect is presented on Figure 2.9.
One could thus see the choice between periodic and stochastic input as
a choice between having increasingly accurate information about a constant
number of frequencies and having a constant level of accuracy about an
increasing number of frequencies.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.5
1
1.5
2
2.5
Amplitude
Frequency [2p rad]
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−2
−1.5
−1
−0.5
0
0.5
Frequency [2p rad]
Phase
Figure 2.9: Identi
u(t) + 0:25u(t 􀀀 1) + v(t) with the input signal u chosen as a white noise.
Realizations with N = 10 (black, diamonds), N = 40 (magenta, plus) and
N = 100 (blue, circles), and exact transfer function G(ei!) (red). The signalto-
noise ratio is kept constant; we obtain information about more and more
frequencies, but the inaccuracy observed for those frequencies does not decay
with N.
2.2.3 Spectral Analysis
The analysis in Section 2.2.2 could suggest that the Fourier analysis cannot
lead to asymptotically accurate results at all frequencies. One way to
solve this issue is to consider a sequence of more and more complex periodic
input signals. To reach a certain level of accuracy at a certain number of
frequencies,
one would design a periodic input signal with a period su#ciently
long to have information about the desired number of frequencies, and then
run the experiment for a time su#ciently long to obtain the desired level of
accuracy.
Another approach comes from the realization that we have considered the
identi
cations of G(ei!) at di#erent frequencies as totally separated problems,
even if the frequencies are close. In most situations, the system has a similar
behavior at similar frequencies, i.e. the transfer function G is smooth. The
estimate of G(ei!) carries therefore some information about the value of G
at neighboring frequencies.
In the case of periodic input signals, this means that estimate of the transfer
functions could be obtained at more than N0 frequencies by interpolating
between the frequencies k 2#
N0
.
In the case of stochastic input signals, one can take advantage of the
smoothness of G with respect to ! to reduce the quadratic error by averaging
the estimates at di#erent neighboring frequencies. This is the idea of the socalled
\spectral analysis" that we analyze in this section. It leads to the
following estimator:
^^G
N(ei!) =
P
# #(#)^GN(ei#) P
# #(#)
where the coe#cients #(#) have to be speci
ed, and ^GN(ei#) is the estimator
obtained using Fourier analysis as described just before. The sum is over all
# integer multiple of 2#=N in some interval [!1; !2] around !. This in e#ects
corresponds to applying a low-pass
lter on the estimate ^G.
The values #(#) need not be all equal, and we have some freedom in
their choice. Since the goal is to improve the accuracy of the estimator, the
most natural choice would be to take the #(#) that minimze the variance
of the estimator obtained, while keeping it unbiased if possible (neglecting
transitory e#ects). It can be shown that for a set of independent random
variable X1;X2; :::;XN with mean # and variances #2
i , the unbiased estimator
given by
^X
=
P
Pi
iXi
i
i
is of minimal variance if
i = 1=#2
i . So assuming that successive values of
^G
N(ei#) are independent (which is asymptotically true, see equation (2.7)),
if G(ei!) were constant over [!1; !2] we would simply obtain
#(#) =
jUN(#)j2
#v(#)
:
Now, since G(ei!) is not constant, we may want to also give more weights to
the frequencies close to !. This (and the de
nition of the interval on which
the average is taken) can be achieved by using a weighting function W(#􀀀!),
which will typically take smaller values when j# 􀀀 !j is large, to penalize the
values corresponding to frequencies # not close enough to !. This translates
to
#(#) =
jUN(#)j2
#v(#)
W(# 􀀀 !):
The spectral density #v(!) of the disturbance is typically not known, but it
is not unreasonable to consider it as approximately constant over the region
covered by W(#􀀀!) (and there is often no other simple solution). We obtain
then the estimator
^^G
N(ei!) =
P
# W(# 􀀀 !)jUN(#)j2 ^GN(ei#) P
# W(# 􀀀 !)jUN(#)j2 =
P
# W(# 􀀀 !)YN(#)#U
N(#) P
# W(# 􀀀 !)jUN(#)j2 ;
where we have used G^N(ei!) = YN(!)=UN(!).
It can be shown that
E
^^G
N(ei!) 􀀀 G(ei!) # MW
#
1
2
G00(ei!) + G0(ei!)
#0
u(!)
#u(!)
#
; (2.9)
where the derivative are taken with respect to !, and
Ej ^^GN(ei!) 􀀀 E
^^G
N(ei!)j2 #
1
N
#W
#v(!)
#u(!)
with MW =
R #
􀀀# #2W(#)d# and #W
=
R #
􀀀# W2(#)d# (assumng that W is normalized
so that
R #
􀀀# W(#)d# = 1: The averaging of neighboring values is thus
e#ective in diminishing the error variance, which decays to 0 as 1=N, but it
introduces an asymptotic bias (see Figure 2.11, Figure 2.12 and comments
below). Both the bias and the error variance depend on some characteristics
of the weighting function W, resulting in a trade-o# between these two
values. It is of course possible to take a W that depends on N (making
it narrower when N grows), in order to diminish the bias while keeping a
decaying error variance. With such a scheme, one can construct an asymptotically
unbiased estimator for which the mean square error (which contains
the bias and the error variance) tends to 0. One can prove that the optimal
trade-o# would lead to a mean square error decaying as N􀀀4=5, but this cannot
actually be achieved in practical applications, as computing the function
leading to the optimal trade-o# requires some knowledge about several quantities
unknown to the user. We refer the interested reader to [1] (Section 6.4).
2.2.4 Correlation approach to Fourier Analysis
It is possible to derive the estimators of the Fourier and spectral analysis by
following an alternative approach based on the correlations and the spectra.
It follows from Proposition 1.2 that
#yu(!) = G(ei!)#u(!);
with #yu(!) =
P
# ryu(# )e􀀀i!# ; ryu(# ) = #E
y(t)u(t􀀀# ), and #u(!) =
P
# ru(# )e􀀀i!# ;
and ru(# ) = #E
u(t)u(t 􀀀 # ). Based on this relation, one can de
ne the estimator
^G
N(ei!) =
^#
yu(!)
^#
u(!)
where the estimation of power spectra are obtained by using the estimators
of the correlations de
ned in equations (2.2) and (2.3) in the correlation
analysis, and taking their transforms. Some algebraic manipulations show
then that
^G
N(ei!) =
^#
yu(!)
^#
u(!)
=
YN(!)#U
N(!)
UN(!)#U
N(!)
=
YN(!)
UN(!)
This expression is exactly the same as that obtained in the Fourier analysis
in Section 2.2.2,and has thus the same properties. It provides however a
new interpretation of the persistent error variance of this estimator for
nonperiodic
signals. Observe indeed that the correlation estimators for large
values of # are computed using a smaller amount of data, and are thus less
accurate. Typically, the estimate ^ryu(N) will be computed using only one
couple of values (y(N); u(0)) (with an inappropriate weight). The estimate
^G
derived above gives however the same weight to all ^ryu(# ) resulting in a
systematic error. To correct this and improve our estimate of the transfer
function, we could use a weighting function w(# ) which reduces the contribution
of these inaccuracies during the power spectra estimation. So we have
the new estimators
^^#
yu(!) =
X
#
w(# )^ryu(# )e􀀀i!# = Z(ryuw)(ei!);
^^#
u(!) =
X
#
w(# )^ru(# )ei!# = Z(ruw)(ei!):
where Z(u) represents the Z transform of u. Remember that Z(fg) = Z(f)
Z(g). Therefore, denoting Z(w) by W, we have

^^#
yu(!) = (W
^#yu)(!) = ((YN #U
N)
W)(!) =
X
#
W(# 􀀀 !)YN(#)#U
N(#)
and
^^#
u(!) = (W
^#yu)(!) = ((UN #U
N)
W)(!) =
X
#
W(# 􀀀 !)UN(#)#U
N(#);
which leads to the improved estimator
^^G
N(ei!) =
^^#
yu(!)
^^#
u(!)
=
P
# W(# 􀀀 !)Y (#)#U
(#)
P
# W(# 􀀀 !)U(#)#U
(#)
;
which is exactly the same as in the spectral analysis. The weighted average
with weights W can thus be seen either as a way to improve the estimate
by taking into account the information available about the neighbouring
frequencies, or to weight out the correlations for large values of # .
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.5
1
1.5
2
Amplitude
Frequency [2p rad]
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−0.8
−0.6
−0.4
−0.2
0
0.2
Frequency [2p rad]
Phase
(a) Non-parametric estimation of G(ejw) using ^G
(ejw) = YN(w)
UN(w)
with no averaging. True transfer function G(ejw) (red) and estimate
(blue).
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Amplitude
Frequency [2p rad]
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−0.6
−0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
Frequency [2p rad]
Phase
(b) Non-parametric estimation of G(ejw) using ^ G(ejw) = YN(w)
UN(w) ,
improved using averaging (average between the current, 10 previous
and 10 following values). True transfer function G(ejw) (red) and
estimate (blue).
Figure 2.10: Identi
u(t) + 0:25u(t 􀀀 1) + v(t) with the input signal u being a white noise. Comparison
between the standard estimate (top) and the standard estimate with
averaging (bottom).
!
jG(ej!)j
#
jG(ej!)j
#
jG(ej(!􀀀#!))j
#
jG(ej(!+#!))j #
j^^ G(ej!)j
Figure 2.11: Interpretation of the presence of the second-order derivative
in the bias term from Equation (2.9). Suppose for simplicity
that G(ej!) is estimated by averaging just two estimates,
^^G
(ej!)) =
1
2
#
^G
(ej(!+#!)) + ^G(ej(!􀀀#!))
#
. Then, one can easily see that a positive curvature
introduces a positive bias (tendency to overestimate G(ejw)), whereas
a negative curvatures introduces a negative bias (tendency to underestimate
G(ejw)). For transfer functions G(ejw) whose curvature is not always of the
same sign, this intuition still allows to have a local understanding of the bias.
The
gure illustrates this point on a simple quadratic function.
!
jG(ej!)j slope: G0(ej! slope: 0) #0
u(!0)
#u(!0)
#
#
#
#
jG(ej!0)j #
!0 􀀀 #! !0 !0 + #!
more weight here
less weight here
Figure 2.12: Interpretation of the presence of the term G0(ej!)#0
u(!)
#u(!) in the
bias term from Equation (2.9). For simplicity, assume that ones wishes to estimate
the value of G(ej!0) by averaging over the estimates of ^G(ej(!+#!)) and
^G
(ej(!􀀀#!)). Assuming that the weighting function W is even (i.e., W(#!) =
W(􀀀#!)), then, the use of the averaging coe#cient #(#) = jUN(#)j2W(#􀀀!)
would induce more weight to the point !0􀀀#w in the estimation of
^^G
(ej!0) |
in a case such as presented on this
gure. The
rst-order term for estimating
the expectation in Equation (2.9) is then predicting a larger weight #(!0􀀀#!)
than #(!0+#!) for the computation of the transfer function at !0, resulting in
a negative bias (because G(ej(!􀀀#!)) # G(ej!0)+G0(ej!0)(􀀀#!) is smaller than
G(ej(!0)) and has more weight than G(ej(!+#!)) # G(ej!0) + G0(ej!0)(#!)).
Appendix
2.A Leakage
In this section, we brie
y remind to the reader what is the so-called leakage
e#ect, which occurs when one computes a DFT or Z-transform of a periodic
signal based on a sample that is not an integer multiple of its period.
Consider a signal xoriginal(n) de
ned for n 2 (􀀀1;1), and suppose we
only observe it on a
nite time window2 [􀀀n0; n0]. This can be represented
modeled by supposing we only observe the
nite version x(n) of xoriginal(n),
with x(n) = xoriginal(n) for n 2 [􀀀n0; n0] and x(n) = 0 otherwise. Formally,
x(n) can be expressed as a product in the time domain:
x(n) = xoriginal(n) windown0(n);
where windown0(n) is the function taking the value 1 when n 2 [􀀀n0; n0]
and 0 otherwise. This product in the time-domain corresponds to a convolution
in the frequency domain: respectively denoting Xoriginal(!), X(!) and
Wn0(!) the discrete Fourier transforms of the signals xoriginal(n), x(n) and
windown0(n), one can express the spectrum of x(n) as:
X(!) = Xoriginal(!)
Wn0(!): (2.10)
where it can be computed that Wn0(!) =
sin(!(n0+1
2 ))
sin(!=2) = sin(!N=2)
sin(!=2) since the
total number of samples is N = 2n0 +1. (This function is very related to its
2We consider a symmetric interval for simplicity, the reader can perform the same
exercise for the shifted version of the interval: [0;N] (note that it is only
necessary to
introduce the shift into the frequency domain, that is, to multiply the DFT by ej!
n0 ).
For the remaining of the computations of this section, this factor does not change
the
conclusions, as it will only be evaluated at ! = 2k#=N, for which ej!n0 = 1.
55
counterpart for continuous time systems, sinc(x) = sin(x)
x ; it is illustrated on
Figure 2.A.1; note however some di#erences, such as Wn0(0) = N, whereas
sinc(0) = 1). The e#ect of this convolution with the cardinal sine-like function
is illustrated on Figure 2.A.2. Observe that it tends to asymptotically
disappear when N grows (the error due to it diminishes in O(1=N), see Equation
2.4), and is therefore less and less relevant as the number of periods of
the observed signal grows. We can verify that no leakage e#ect occur for
−4 −3 −2 −1 0 1 2 3 4
−4
−2
0
2
4
6
8
10
w
sin (Nw/2) / sin (w/2)
Figure 2.A.1: Shape of the sin(!N=2)
sin(!=2) function, which plays an important role
in the leakage e#ect.
a periodic xoriginal(n) of period Toriginal if one chooses N = #Toriginal (with
# = 1; 2; : : :). Indeed, further developing the convolution product (2.10)
leads to
X(!) = Xoriginal(!)
Wn0(!);
X(!) =
Z 1
􀀀1
Xoriginal(#)Wn0(! 􀀀 #)d#:
Now, assuming that N = #Toriginal, we directly have that !original = 2##=N
and therefore the previous integral can be expressed as a sum (the other terms
are identically equal to zero, because the only possible non-zero components
of the spectrum of Xoriginal are multiples of !original):
X
#
2k#
N
#
=
X1
l=􀀀1
Xoriginal
#
2l#
N
#
Wn0
#
2k#
N
􀀀
2l#
N
#
:
2.A. LEAKAGE 57
−4 −3 −2 −1 0 1 2 3 4
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Frequency [rad]
Normalized amplitude |X(w)|
Figure 2.A.2: Example of leakage e#ect on a simple sine signal. The blue
spectrum is obtained by sampling the sine over an entire number of periods,
whereas the red signal is obtained by sampling the sine over a fractional
number of periods.
There remains to note that Wn0
􀀀2k#
N 􀀀 2l#
N
#
= Wn0
􀀀
(k 􀀀 l) 2#
N
#
is non-zero
only for k = l, in which case Wn0
􀀀
(k 􀀀 l) 2#
N
#
= Wn0(0) = N (we are only
interested in the interval k 2 [0;N 􀀀 1]). Therefore, the previous expression
becomes
X
#
2k#
N
#
= NXoriginal
#
2k#
N
#
;
which proves that no leakage occurs.
Chapter 3
Regressions and Instrumental
variables
In this chapter, we introduce and analyze two relatively simple parametric
methods for system identi
cation. The use of those methods allows to introduce
and instantiate di#erent concepts abundantly used in the next chapters;
among others for what concerns the information content of signals. It will be
seen in that these methods are particular cases of more general ones, which
are introduced in Chapter 6, and analyzed in Chapters 7 and 8.
3.1 Introduction and Parametric Identi
ca-
tion
In (most of) the remaining of the lectures notes, we focus on parametric
system identi
cation.
Unlike nonparametric methods which aim at directly estimating the transfer
function or impulse response, parametric system identi
cation consists in
selecting the model that represents best the system within a given class of
models, that can typically be represented by a small number of parameters.
Parametric methods yields models with simple and concise description,
which are much easier to handle and use than a simple series of numbers.
They also allow specifying in advance the model complexity that one is willing
59
60 CHAPTER 3. REGRESSIONS AND INSTRUMENTAL VARIABLES
to use. Supposing for example that a system behaves approximately as a
second-order model, one could use parametric method to
nd the best secondorder
approximation of the real system. Finally, the restricted complexity
of the models considered o#ers some natural protection against the e#ect of
noise in the estimation of the transfer function, as it will typically forbid
transfer functions with rapid variations such as in the example provided by
Figure 2.10 of the previous section.
Parametric system identi
cation comes however with new challenges, such
as the following ones which can already be foreseen at this stage
(i) How to pick the best model in the class considered? (and what does best
mean),
(ii) How to select the class of models? If the aim is to recover a real system,
how can one take a su#ciently large class of models without being too large,
are there any trouble associated with a class being too large?
(iii) How does one represent the class of models? Does the parametrization
in
uence the results?
We will actually discover other challenges, for example linked to the relation
between the complexity of the signals and the complexity of the models
that can be considered.
3.2 Linear Regressions
To introduce the
rst parametric method, we start with some examples.
Example 3.1. We suppose consider the class of models of the form
y(t) = u(t) + au(t 􀀀 1) + v(t); (3.1)
where v is a zero mean disturbance, and we want to estimate the parameter a
that
ts best some experimental signals. Since nothing is known about v, an
a priori natural method is to
nd the value a for which the errors that would
have been made by approximating y(t) by u(t) + au(t 􀀀 1) are as small as
possible. Using the standard approach of minimizing the sum of the squared
errors, we obtain the following estimator
âN = arg min V (a) :=
1
N
XN
t=1
(u(t) + au(t 􀀀 1) 􀀀 y(t))2 :
3.2. LINEAR REGRESSIONS 61
Note that this method corresponds to selecting the value of a for which the dis-
turbances or unmodeled phenomena V has the smallest norm. Di#erentiating
V with respect to a, we obtain
@V
@a
=
2
N
XN
t=1
(u(t) + au(t 􀀀 1) 􀀀 y(t)) u(t 􀀀 1);
and therefore
âN =
1
N
PN
t=1 (y(t) 􀀀 u(t)) u(t 􀀀 1)
1
N
PN
t=1 u(t 􀀀 1)2
:
A
rst measure of quality for an estimator is its consistency, i.e., its ability to
asymptotically recover the correct values of the parameters if the signals are
generated by a system which is really of the form that was assumed, i.e. here
a system of the form (3.1). The evolution with N of âN and its convergence
to the real parameters a0 in the example in Figure 3.1 suggests that âN ! a0.
Let us now show it is indeed the case.
0 10 20 30 40 50 60 70 80 90 100
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
Sample size N
Parameter value
Figure 3.1: Estimation of the parameter a with the least-square approach
for the system y(t) = u(t) + au(t 􀀀 1) + v(t) with a = 0:5. True value of a
(dashed, blue) and estimate â (red).
Assuming that all signals are quasi-stationary and ergodic and that Eu(t􀀀
1)2 6= 0 (if the latter condition does not hold, the input becomes asymptotically
irrelevant, and the estimate may not converge. Its value for large N are then
driven by the disturbance), âN converges in probability to
â =
#E
y(t)u(t 􀀀 1) 􀀀 #E
u(t)u(t 􀀀 1)
#E
u(t 􀀀 1)2
; (3.2)
independently of whether or not y is generated by a system of the form (3.1).
Supposing now y(t) = u(t)+a0u(t􀀀1)+v(t) indeed holds for some unknown
a0 and some zero-mean disturbance v, we have
#E
y(t)u(t 􀀀 1)􀀀#E
u(t)u(t 􀀀 1)
= #E
(u(t) + a0u(t 􀀀 1) + v(t)) u(t 􀀀 1) 􀀀 #E
u(t)u(t 􀀀 1)
= a0 #E
u(t 􀀀 1)u(t 􀀀 1) + #E
v(t)u(t 􀀀 1):
The last term is zero because the input and the noise are uncorrelated1. It
follows then from (3.4) that â = a0. The estimator built is thus consistent
in this case as it allows asymptotically recovering the correct value of the
parameters when the signals are of the form that was assumed. Note that this
last computation only makes sense in the context of a theoretical analysis of
the estimator, but cannot be performed in \real-life" since the value of a0 is
unknown to the experimenter.
Minimizing V asymptotically leads to a correct estimate of the model in
this
rst example. The next example show that this is not always the case if
the output y(t) depends on previous values of y.
Example 3.2. Consider the set of models of the form
y(t) = u(t) + by(t 􀀀 1) + v(t);
where v(t) is again a zero-mean disturbance. Minimizing the sum of the
squares of the error that made when approximating y(t) by u(t) 􀀀 bu(t 􀀀 1)
leads to the estimator
^b
N =
1
N
PN
t=1 (y(t) 􀀀 u(t)) y(t 􀀀 1)
1
N
PN
t=1 y(t 􀀀 1)2
: (3.3)
1This assumption may not hold if the system is controlled in closed loop
0 100 200 300 400 500 600 700 800 900 1000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Sample size N
Parameter value
(a) Least-square estimate (red) and true value of the pa-
rameter (dashed, blue) in the presence of a colored noise
v(t).
0 10 20 30 40 50 60 70 80 90 100
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
Sample size N
Parameter value
(b) Least-square estimate (red) and true value of the pa-
rameter (dashed, blue) in the presence of a white noise v(t).
Figure 3.2: Estimation of the parameter b with the least-square approach for
the system y(t) = u(t)+by(t􀀀1)+v(t) with b = 0:5. True value of b (dashed,
blue) and estimates â (red); the experiment fails (there is a bias) when v(t) is
some colored noise (top), and succeeds when v(t) is a white noise (bottom).
The evolution of bN with N in Figure 3.2 suggests that bN may or may not
converge to the right value when y was generated by a system of the correct
form y(t) = u(t) + b0y(t 􀀀 1) + v(t).
Let us further analyze this issue. Under suitable conditions, the estimator
(3.3) converges in probability to
^b
=
#E
y(t)y(t 􀀀 1) 􀀀 #E
u(t)y(t 􀀀 1)
#E
y(t 􀀀 1)2
: (3.4)
To check the consistency of that estimator, suppose now that the signals are
really generated by a system y(t) = u(t) + b0y(t 􀀀 1) + v(t), for some b0.
Then, there holds
#E
y(t)y(t 􀀀 1)􀀀#E
u(t)y(t 􀀀 1)
= #E
(u(t) + b0y(t 􀀀 1) + v(t)) y(t 􀀀 1) 􀀀 #E
u(t)y(t 􀀀 1)
= b0 #E
y(t 􀀀 1)y(t 􀀀 1) + #E
v(t)y(t 􀀀 1);
so that
^b
= b0 +
#E
y(t 􀀀 1)v(t)
#E
y(t 􀀀 1)2
:
Unlike in Example 3.1, the second term is not necessarily 0 as y(t􀀀1) depends
on v(t 􀀀 1), which may be correlated to v(t). Besides, it also depends on
previous values of y, which may also be correlated to v(t) for the same reason.
In the example taken in Figure 3.2(a), the disturbance is of the form v(t) =
e(t) + ce(t 􀀀 1) where e is a white noise of variance #2
e , and if the system is
controlled in open loop, so that #Eu(t)v(s) = 0. There holds then
#E
y(t 􀀀 1)v(t) =#E
(b0y(t 􀀀 2) (e(t) + ce(t 􀀀 1))
+(e(t 􀀀 1) + ce(t 􀀀 2)) (e(t) + ce(t 􀀀 1)))
=c#2
e ;
so that ^b = b0 + c#2
e
#E
y2 . The estimator is thus not consistent in that case, as
it converges to a value di#erent from the real one. Note however that this
asymptotic error approaches 0 when the signal to noise ratio is large.
On the other hand, if the disturbance v(t) is a white noise e(t) as in the
example in Figure 3.2(b), then there holds
#E
y(t 􀀀 1)v(t) = #E
b0y(t 􀀀 2)e(t) + #E
e(t 􀀀 1)e(t) = 0;
The estimator thus asymptotically recovers (under suitable conditions on u)
the correct value.
Remark 3.1. Note that #E
y2(t) can be relatively easily computed in the pre-
vious example using the fact that, by the de
nition (1.12) of the long term
average, #E
y2(t) = #E
y2(t 􀀀 1). Assuming for simplicity that the input u is a
white noise, we have here for example
y(t)2 = u2(t)+2a0u(t)y(t􀀀1)+2u(t)v(t)+a20
y2(t􀀀1)+2a0y(t􀀀1)v(t)+v2(t);
and thus
#E
y2(t) 􀀀 a20
#E
y2(t 􀀀 1) = #E
􀀀
u2(t) + 2a0y(t 􀀀 1)v(t) + v2(t)
#
;
where we use the independence of u and v. Sine #E
y2(t) = #E
y2(t 􀀀 1), we
obtain
(1 􀀀 a20
)#E
y2(t) = #E
􀀀
u2(t) + 2a0y(t 􀀀 1)v(t) + v2(t)
#
;
which allows computing #E
y2(t):
Examples 3.1 and 3.2. show that the method consisting in minimizing
the mean square error may in some cases lead to consistent estimators converging
asymptotically to the \real" values, but may in other cases lead to
an asymptotic error, and that this appears to depend both on the model
structure and on the nature of the disturbance v. We now formalize and this
method and analyze it in a general case
y(t) = '(t)T # + v(t)
where v(t) is some zero-mean disturbance and # 2 Rn and '(t) 2 Rn is a
vector of values that are measured and available at time t. For example, for
y(t) = au(t) + bu(t 􀀀 1) + cy(t 􀀀 1) + v(t),
'(t) =
0
@
u(t)
u(t 􀀀 1)
y(t 􀀀 1)
1
A and # =
0
@
a
b
c
1
A: (3.5)
We will restrict our analysis here to cases where '(t) only contains present
and previous values of u and y with
xed delays, but it could in principle
contain any sort of signals (e.g., multiple inputs and outputs, power or
other functions of the input/output, etc.) and its composition could be time
varying.
Since v has a zero-mean (and no assumption is made on its characteristics),
an a priori natural predictor of y(t) is obtained by discarding the e#ect
of v(t): ^y(tj#) = '(t)T #. This predictor makes an error #(tj#) = y(t)􀀀 ^y(tj#).
The sum of the squared prediction errors is thus
V (#) =
1
2
XN
t=1
#(tj#)2 =
1
2
XN
t=1
('(t)T # 􀀀 y(t))2:
The principle of the regression method is to use as estimator ^#N the # that
minimizes sum of squares of the prediction errors:
^#N = arg min V (#) =
XN
t=1
'(t)'(t)T
!􀀀1
XN
t=1
'(t)y(t)
!
(3.6)
if the inverse is de
ned (if not, one can verify that the minimum of V is not
unique, and an arbitrary one could be selected). If all signals are ergodic, ^#N
converges in probability to
^# =
􀀀
#E
''T #􀀀1 􀀀
#E
'y
#
:
Let us now analyze the consistency of this estimator, i.e. measure its
asymptotic error when the signals do indeed satisfy y(t) = '(t)T #0 + v(t)
for some zero mean disturbance v. We would like to stress on the fact that
analyzing the quality of ^# when y(t) is not generated by such a model is
meaningless, as there is in that case no real #0 to converge to. One could
however still talk about the accuracy of the model obtained. Supposing thus
y(t) = '(t)T #0 + v(t), it follows from (3.6) that
^#N =
XN
t=1
'(t)'(t)T
!􀀀1
XN
t=1
'(t)'(t)T #0 + '(t)v(t)
!
=
XN
t=1
'(t)'(t)T
!􀀀1
XN
t=1
'(t)v(t)
!
+ #0;
provided
#PN
t=1 '(t)'(t)T
#􀀀1
is invertible. The error in the estimation of
the parameter is thus
##N := ^#N 􀀀 #0 =
XN
t=1
'(t)'(t)T
!􀀀1
XN
t=1
'(t)v(t)
!
If we suppose that the signals are ergodic and quasi-stationary, for large N,
the error tends to
lim
N!1
##N = (E'(t)'(t)T )􀀀1(E'(t)v(t))
if E'(t)'(t)T is nonsingular (if not, the process may not converge, or converge
to unrelated values). So the estimator is consistent if E'(t)'(t)T is non
singular and E'(t)v(t) is zero. The
rst condition holds if the input signal
contains enough information with respect to the set of models (see Chapter
5). The second, however, requiring the noise to be uncorrelated with ', will
generally not hold unless v is a white noise or ' does not contain output
signals. Suppose indeed that '(t)T = (u(t); u(t 􀀀 1); : : : ; u(t 􀀀 Tu); y(t 􀀀
1); y(t 􀀀 2); : : : ; y(t 􀀀 Ty)). Then E'(t)v(t) = 0 is satis
ed if and only if the
two following conditions are satis
ed.
a) #E
u(t 􀀀 # )v(t) = 0 for # = 0; : : : ; Tu. This is always the case if the
system is controlled in open loop, but not necessarily if it is controlled
in closed loop, as i may then be correlated to y and v. We will analyzed
closed-loop systems in more details in Chapter 9.
b) #E
y(t 􀀀 # )v(t) = 0 for # = 1; : : : ; Ty. Remembering that y(t 􀀀 # ) =
'(t 􀀀 # )T #0 + v(t 􀀀 # ), we see that his will usually only be satis
ed in
three cases:
(i) v is a white noise (or some other) stochastic process for which the
di#erent values are uncorrelated, as in Figure 3.2(b) in Example
3.2;
(ii) Ty = 0, i.e., y(t) only depends on a
nite number of values of u
and on the disturbances, as in Figure 3.1 in Example 3.1;
(iii) v is a signal with the curious property that v(t) and v(s) are uncorrelated
if 0 6= jt􀀀sj # Ty. Note that this case encompasses the
two other ones.
Exercise: consider the system y(t) = u(t) + by(t 􀀀 1) + v(t) with b = 0:5
and v(t) being uncorrelated with v(s) 0 6= jt 􀀀 sj # # . What is the critical
value of # allowing the least-square estimator for b to be unbiased ?
In all other cases, #E
y(t􀀀# )v(t) is generically di#erent from 0 as in Figure
3.2(a) in Example 3.2, although it may equal 0 for some very speci
c values
of the parameters #0.
Observe nevertheless that when there is an asymptotic error, its value
(E'(t)'(t)T )􀀀1(E'(t)v(t)) is inversely proportional to the signal to noise
ratio, and can be acceptable if the latter is su#ciently high.
The regression method is thus only consistent for a relatively restricted
set of systems. Deeper reasons explaining why regression only works in those
speci
c cases and what goes wrong in the other ones will be seen in the
next chapters, together with more accurate and general methods that are
consistent for much larger sets of models. Before that, we present in the next
section an alternative method allowing to remove the asymptotic error for
all systems of the form y(t) = '(t)T # + v(t).
3.3 Instrumental variables
The regression method is derived by minimizing the sum of the square prediction
error
P
t #(tj#)2, but can equivalently be characterized by the following
system of linear equations
P
t '(t) (#(tj#)) = 0, or
X
t
'(t)
#
'(t)T ^# 􀀀 y(t)
#
= 0:
This corresponds to requiring the prediction error to be uncorrelated with
the signals in ', which can be justi
ed by a rationale independent of the minimization
of the square error: If the error was correlated with these signals ',
it could be partly predicted, and this prediction could be taken into account
to build a better predictor. Hence a good error should be unpredictable and
therefore uncorrelated to ' (similar arguments will be made in Chapter 10
in order to determine if additional signals should be considered as inputs).
We could generalize this method by requiring the error to be uncorrelated
with some other signals #(t). of the same dimension as #.
X
t
#(t)#(tj#) =
X
t
#(t)
#
y(t) 􀀀 '(t)T ^#
#
= 0: (3.7)
3.3. INSTRUMENTAL VARIABLES 69
This is the so-called instrumental variables method, and the #(t) are called
the instruments. The e#ciency of the method strongly depends on the instruments
chosen. We will see that a wise choice of instrument # lead to
a consistent estimator. But
rst, we note that the instrumental variable
methods can be generalized by
ltering the prediction error to favor some
frequencies and/or taking instruments of larger dimension that #. It becomes
then in general impossible to solve (3.7) exactly due to this latter point, but
one can minimize the \residuals". This generalized instrumental variable
methods can then be expressed as
^#N = arg min
#
X
#(t)L(q)#(t)
= 0;
for some
lter L(q). We will not consider this generalization in this Chapter.
An explicit expression of the estimator de
ned implicitly in (3.7) is given
by
^#N =
X
t
#(t)'(t)T
!􀀀1
X
t
#(t)y(t)
!
;
provided the matrix
􀀀P
t #(t)'(t)T
#
is invertible. Assuming again that the
signals are ergodic and quasi-stationary, this estimator tends to
(E#(t)'(t)T )􀀀1(E#(t)y(t));
provided E#(t)'(t)T is nonsingular (if not, the process may not converge, or
converge to unrelated values, as in the case of regressions).
To analyze the consistency of the estimator, we suppose again that the
signals do satisfy y(t) = '(t)T #0+v(t) where v is some zero-mean disturbance.
Then there holds
^# = (E#(t)'(t)T )􀀀1(E#(t)('(t)T #0 + v(t)) = #0 + (E#(t)'(t)T )􀀀1(E#(t)v(t));
so that the asymptotic error is
## = ^# 􀀀 #0 = (E#(t)'(t)T )􀀀1(E#(t)v(t));
assuming again that if E'(t)'(t)T is nonsingular. This method is thus consistent
if the two following conditions are satis
ed:
a) E#v = 0. Unless some parameters take a very speci
c value, this means
that # must be independent of y since y is in
uenced by v. There is however
no objection to # depending on u if the system is controlled in open loop, so
that u(t) does not depend on past values of y and thus v.
b) E#'T is nonsingular. This requires the instruments to be correlated with
', and thus usually with the output y unless ' has no component in y.
These two requirement pose a challenge, as the instruments must be correlated
with y, but y may not be used in their computation. One way of
solving this is to use a \simulated noise-free version" of ' as instrument: the
' that one would obtain with the real system if there were no noise. This
signal is indeed uncorrelated to the noise v on the one hand, and we may
expect on the other hand that it would be correlated to the actual (noisy) '
as they share the same deterministic part.
Example 3.3. Suppose the real system is:
y(t) = u(t) 􀀀
y(t 􀀀 1)
2
+ e(t) 􀀀 e(t 􀀀 1);
and we have guessed the correct model: 'T (t) = (y(t 􀀀 1) u(t)). Let us
compute the asymptotic error obtained if we apply (3.7), and #T (t) = (x(t 􀀀
1) u(t)), where the signal x(t) satis
es x(t) = u(t) 􀀀 x(t􀀀1)
2 . It is thus a
\noise-free" version of the real output. Clearly, #E
#(t)v(t) = #E
#(t)(e(t) 􀀀
e(t 􀀀 1)) = 0. Let us assume for simplicity that the input signal u is a white
noise of variance #2u
. One can then verify that
E#T' =
#
x(t 􀀀 1)
u(t)
#􀀀
y(t 􀀀 1) u(t)
#
=
#
Exy Ex(t 􀀀 1)u(t)
Eu(t)y(t 􀀀 1) Eu(t)2
#
=
#4
3#2u
0
0 #2u
#
;
which is nonsingular unless #2u
= 0. The asymptotic error is thus 0 in that
case. Observe however that, had we use a constant input u = 1, we would
have obtained a singular matrix
E#T' =
#
4=9 2=3
2=3 1
#
;
because Euy = Ey = 2=3, and Eyx = 4=9 in that case. This shows that the
asymptotic properties of the estimator do depend on the input signal used, as
will be explicited later.
The previous example seems to indicate that taking the noise-free version
of ' as instrument may lead to a consistent estimator. But this can of course
not directly be used in practical applications, generating the noise-free output
x of the real system requires knowing the parameters of the real systems
These are by de
nition unknown, for otherwise we would not try to identify
them.
But, we will see that taking as instrument # the simulated ' of almost
any dynamical system of the same form as the real one (which is assumed to
lie in the model set) will lead to a consistent estimator. For this purpose, we
rst formalize the idea of noise-free simulated system.
De
nition 3.1. Consider a class of models y(t) = '(t)T #, with
'T = (y(t 􀀀 1); :::y(t 􀀀 ny); u(t); :::; u(t 􀀀 nu)) ;
and some input signal u For a # 2 Rny+nu􀀀1 The noise-free signals ~y#(t) and
~'#(t) for # are de
ned by
'#(t)T = (~y#(t 􀀀 1); : : : ; ~y#(t 􀀀 ny); u(t); :::; u(t 􀀀 nu)) ;
and ~y#(t) = '#(t)T #; and some arbitrary initial condition.
Theorem 3.1. Consider the class of stable models y(t) = '(t)T #, with
'T = (y(t 􀀀 1); :::y(t 􀀀 ny); u(t); :::; u(t 􀀀 n)) ;
and suppose that the real system is stable, and is described by y(t) = '(t)T #0+
v(t), where v(t) is a zero-mean disturbance term (so that the model set has
been correctly \guessed") and u is chosen in open loop, so that u and v are
independent.
Assume also that the class of models does not \overparametrizes the real
(noise-free) relation between y and u for that input", that is,
# 6= #0 ) #E
􀀀
~'T#
0# 􀀀 ~'T#
0#0
#2
> 0: (3.8)
Let now #(t) = ~'#(t) for some # 2 Rny+nu􀀀1, i.e,
#(t) = (x(t 􀀀 1); :::; x(t 􀀀 ny); u(t); :::; u(t 􀀀 nu) ;
where x(t) is de
ned by x(t) = #(t)T #, for some # 2 Rny+nu􀀀1 and arbitrary
initial conditions. It is thus the simulated-noise free version of an arbitrary
system in the model set considered, for the same input.
Then, the asymptotic error of the estimator obtained by the instrumental
variable method with the instrument # is zero for almost all choice of # for
which the system ~y#(t) = ~'#(t)T # is stable , i.e. for all # corresponding to
a stable system except those chosen in a zero measure subset of Rnx+nu􀀀1
(typically, in a subset of strictly smaller dimension).
Remark 3.2. The assumption that # 6= #0 ) #E
#
~#T#
0# 􀀀 ~#T#
0#0
#2
> 0 means
that when observing the noise-free simulated version of the \real" system on
the input u, there is asymptotically no ambiguity about the #0 with which
it was created, i.e., if one were to try a di#erent #, one would consistently
obtain values ~##0(t)# di#erent from ~##0(t)#0 = ~y#0(t). It is expressed in a
somewhat ad hoc fashion here, and actually relates to two issues:
# The transfer function G of the real system can be described by one
unique vector of parameters. This issue depends solely on the parametriza-
tion, and is related to the \identi
ability" of the transfer function at #,
which we will see in Section 4.4.
# The input contains enough information to expose the di#erence between
any two di#erent systems in the model set considered. This issue is
totally independent from the parametrization, and is only related to the
set of models and the input signal. It will be studied in more detail in
Chapter 5.
Proof. The proof is separated in three parts, we
rst show that the asymptotic
correlation can be computed using noise-free version of the signal. We
then exhibit one # for which the asymptotic correlation is nonsingular. Finally,
we show that if it is nonsingular for one particular #, it is nonsingular
for almost all of them.
Part 1: #E
~'#(t)'(t) = #E
~'#(t) ~'#0(t)
By de
nition of the noise-free signals (De
nition 3.1), this equality is satis
ed
if the following two conditions hold
# #E
~'#(t)u(t 􀀀 # ) = #E
~'#(t)u(t 􀀀 # ) for # = 0; : : : ; nu,
# #E
~'#(t)y(t 􀀀 # ) = #E
~'#(t)~y#0(t 􀀀 # ) for # = 1; : : : ; ny:
The
rst is always trivially satis
ed. To prove the second, observe that
y(t) = '(t)T #0 + v(t) can be rewritten as A#0(q)y(t) = B#0(q)u(t) + v(t)
for some appropriate polynomials2 A#0(q);B#0(q) entirely determined by #0.
This can be rewritten as
y(t) =
B#0(q)
A#0(q)
u(t) +
1
A#0(q)
v(t):
By a similar development, we obtain
~y#0(t) =
B#0(q)
A#0(q)
u(t); (3.9)
up to the e#ect of the intial conditions, which eventually dies out due the
stability of the system de
ned. Let us now compute the correlation
#E
~'#(t)y(t 􀀀 # ) = #E
~'#(t)
B#0(q)
A#0(q)
u(t 􀀀 # ) + #E
~'#(t)
1
A#0(q)
v(t 􀀀 # ): (3.10)
The second term is zero because v and '#(t) are uncorrelated. It follows then
from (3.9) and (3.10) that
#E
~'#(t)'(t) = #E
~'#(t) ~'#0(t):
Part 2: #E
~'#0(t) ~'#0(t) is nonsingular
Suppose, to obtain a contradiction, that #E
~'#0(t) ~'#0(t) is singular. In that
case, there exists a nonzero vector ## such that
0 = ##T 􀀀
#E
~'#0(t) ~'#0(t)T #
## = #E
##T ~'#0(t) ~'#0(t)T## = #E
􀀀
~'#0(t)T##
#2
:
Let now ## = #0 + ##: It follows from the previous relation that
#E
􀀀
~'#0(t)T ## 􀀀 ~'#0(t)T #0
#2
= 0;
in contradiction with the assumption of equation (3.8).
2Those are really polynomials in q􀀀1.
Part 3: If #E
~'#(t) ~'#0(t) is nonsingular for one #, then #E
~'#0(t) ~'#0(t) is
nonsingular for almost all #
Let g(#) = det#E
􀀀
~'#(t)'T (t)
#
= det#E
􀀀
~'#(t) ~'T#
0(t)
#
, where the second inequality
follows from Part 1. This function can be shown to be analytic,
because it can be written as a combination (albeit an involved one) of analytic
functions).
A fundamental result in complex analysis states that, if g is an analytic3
function and if g(#) = 0 for all # in a set of positive measure then g # 0.
Conversely, if g(##) 6= 0 for at least one ## in the domain of g, then the set
of # for which g(#) = 0 has a zero measure, i.e. g(#) 6= 0 for almost every #.
Since we know by Part 2 that f(#0) 6= 0, this achieves the proof.
Even if Theorem 3.1 establishes that almost any choice of parameters will
lead to an consistent estimate, better convergence properties are obtained if
the parameters are chosen close to those of the real system, as the # obtained
so would have a higher correlation with ', leading to a smaller estimation
error. One possible way of achieving this is to work recursively, performing
rst an identi
cation with random instrument or a regression method, and
using the obtained parameters to generate better instruments #.
Example: consider, one more time, the system y(t) = u(t) + by(t 􀀀 1) +
e(t) + e(t 􀀀 1) with b = 0:5. We saw on Figure 3.2 that the standard leastsquare
approach resulted in a biased estimator. However, this biased estimate
can serve as a basis for creating a noise-free estimates of y. Figure 3.3
illustrates two iterations of this approach,
rst using x(t) = u(t)+^bLSx(t􀀀1)
as instrument (with ^bLS being the least square estimate of b) and secondly
using the re
ned instrument x(t) = u(t) +^bIV1x(t 􀀀 1) (with ^bIV1 being the
previous instrumental variable estimate of b).
To conclude this chapter, we note that the assumption of absence of
over-parametrization in Theorem 3.1 (and commented in Remark 3.2) is not
merely technical. In fact the following example shows that over-parametrization
may lead to trouble even if one uses the noise-free version of the real model
as instrument. Note that this chapter is essentially based on the books [1,2],
to which we refer to for more details.
3Analytic functions are essentially complex di#erentiable functions. We refer the
reader
to standard references in the topic of complex analysis for more details (e.g.,
[3]).
0 100 200 300 400 500 600 700 800 900 1000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Sample size N
Parameter value
Figure 3.3: Iterative estimation of the parameter b for the system y(t) =
u(t)+by(t􀀀1)+e(t)+e(t􀀀1) with b = 0:5. The true value b (dashed, blue),
its least-square estimate ^bLS (red), instrumental variable estimate ^bIV1 based
on the instrument x(t) = u(t) +^bLSx(t 􀀀 1) (green), instrumental variable
estimate ^bIV2 based on the instrument x(t) = u(t) +^bIV1x(t 􀀀 1) (magenta).
Example 3.4. Consider the system
y(t) = u(t) 􀀀
u(t 􀀀 1)
2
+
y(t 􀀀 1)
2
+ v;
and suppose we guess the model structure 'T = (y􀀀 u u􀀀), and use #T =
(x􀀀 u u􀀀), with x the noise-free version of y. We assume moreover that
the input is a white noise of variance 1. One can verify that
E#'T = E
0
@
x􀀀
u􀀀
u􀀀
1
A􀀀
y􀀀 u u􀀀
#
=
0
@
Exy Ex􀀀u Exu
Euy􀀀 Eu2 Euu􀀀
Euy Eu􀀀u Eu2
1
A =
0
@
1 0 1
0 1 0
1 0 1
1
A;
which is singular, so that the estimator is not consistent (and may not con-
verge).
Actually, this could be expected. Indeed, we can rewrite the real system
as follow:
#
1 􀀀 q􀀀1
2
#
y(t) =
#
1 􀀀 q􀀀1
2
#
u + v, and therefore as y(t) = u(t) +
v0(t) (up to a small error linked to the initial conditions). This system is
over-parametrized, as any # of the form (a 1 a) would lead to the correct
transfer function. The choice a0 = 1
2 made in our system description is thus
purely arbitrary, and there is no reason for which an automated method would
converge to this particular choice. This example is numerically illustrated on
Figure 3.4.
0 100 200 300 400 500 600 700 800 900 1000
−200
0
200
400
0 100 200 300 400 500 600 700 800 900 1000
−4
−2
0
2
4
Parameter values
0 100 200 300 400 500 600 700 800 900 1000
−300
−200
−100
0
100
Sample size N
Figure 3.4: Estimation of the parameters a, b and c with the instrumental
variable approach for the system y(t) = ay(t􀀀1)+bu(t)+cu(t􀀀1)+v; with
(a; b; c) = (1=2; 1; 1=2) (see Example 3.4). The system is overparametrized,
so the approach does not lead to consistent estimations: the estimator does
not even converge. Evolution of â (top), ^b (middle) and ^c (bottom).
Chapter 4
Classes of parametric models
for LTI systems
Notational convention
In the sequel, we will often deal with polynomials in q􀀀1 such as a0+a1q􀀀1+
a2q􀀀2 + # # # + anAq􀀀nA. Consistently with the notations used for the transfer
functions, we will denote such polynomials by A(q) (even if they are polynomials
in q􀀀1). Following this convention, there holds A(1) = a0 (when the
reverse convention is used, then A(0) = a0).
Moreover, we say that z is a zero of A(q) if A(z) = a0 + a1z􀀀1 + a2z􀀀2 +
# # # + anAz􀀀nA = 0. In particular, the
lter 1=A(q) is BIBO stable if all the
zeros of A(q) are inside the unit disk. For example, 1􀀀2q􀀀1 has a zero outside
the unit disk, while 2􀀀q􀀀1 has a zero in the unit disk. The same convention
applies when talking about poles or zeros of a transfer functions. The poles
of H = C(q)=D(q) are for examples those values z for which D(z) = 0.
4.1 Introduction
A LTI system with additive noise y(t) = G(q)u(t) + v(t) is entirely determined
by the transfer function G and the probability model for the noise
v (assumed to be stationary or quasi-stationary) When only the
rst and
second moments are of interest, one can replace v by a stochastic signal
77
78CHAPTER 4. CLASSES OF PARAMETRIC MODELS FOR LTI SYSTEMS
Rwith the same expected value and correlation function. In particular, if #
􀀀# jln#v(ei!)j d! < 1 (which is always the case if the spectrum is continuous
and positive everywhere), it follows from Theorem 1.2 that the noise
can be \factorized" in the following sense: there exists a monic
lter H (i.e.
H(1) = 1) such that #He = #v, where e is some withe noise of appropriate
variance. Moreover, H can be chosen in such a way that all its zeros are
inside or on the unit circle, and all its poles strictly inside the unit circle.
General LTI systems can then be represented by
y(t) = G(q)u(t) + H(q)e(t); (4.1)
with
G(q) =
X1
k=0
g(k)q􀀀k H(q) = 1 +
X1
k=1
h(k)q􀀀k
in which e(t) is white noise. Note that the assumption that H is monic will
simplify the exposition, but has no practical implication, and is not a loss of
generality. Indeed, one can scale H by an arbitrary constant if the inverse
square scaling is applied to the variance of the white noise.
A complete description of the system is given by Equation (4.1) | it requires
specifying g(t), h(t) and the pdf (probability density function) of e(t),
which is generally out of reach. Our goal in this chapter is to describe subclasses
of linear systems, in which a particular model can easily be speci
ed
via a limited number of parameters, in other words, we present some of the
commonly arising parametric models. Some models with simple descriptions
t very well in this framework, while others do not. However, it is useful to
represent them in their \most general description" (i.e., presenting general
classes of models) in order to have an insight on their fundamental properties.
4.2 Classes of models with rational transfer
functions
The most general
nite order blackbox model used is the following
A(q)y(t) =
B(q)
F(q)
u(t) +
C(q)
D(q)
e(t) (4.2)
4.2. CLASSES OF MODELS WITH RATIONAL TRANSFER FUNCTIONS79
where we assume that A(q), B(q), F(q), C(q), D(q) are all
nite order monic
polynomials (monic: A(1) = F(1) = C(1) = D(1) = 1). In theory, there
would be no loss of generality if we would only consider models of the form
A(q)y(t) = B(q)u(t) + C(q)e(t) (4.3)
To see this, multiply (4.2) by CF. But for numerical and practical reasons,
it is more convenient to consider the redundant description (4.2)1. Consider
for example the model
y(t) = u(t) +
1
1 􀀀 1
2q􀀀1
e(t):
Any of its expression under the form (4.3) requires A and B to have common
zeros, as in A = 1 􀀀 1
2q􀀀1, B = 1 􀀀 1
2q􀀀1 and C = 1. However, if one
is to recover this model based on measurements on a real system, due to
numerical and experimental errors one will very likely obtain polynomials A
and B with slightly di#erent zeros, which could lead to ambiguities about
the exact nature and the order of the system, and also to numerical issues.
Fixing certain polynomial to 1 leads to certain classical classes of LTI
model:
y(t) = B(q)u(t) + e(t) FIR
y(t) =
B(q)
F(q)
u(t) + e(t) output error
A(q)y(t) = B(q)u(t) + e(t) Auto-Regressive Exogenous (ARX)
A(q)y(t) = C(q)e(t) ARMA
A(q)y(t) = B(q)u(t) + C(q)e(t) ARMAX
A(q)y(t) = B(q)u(t) +
C(q)
D(q)
e(t) ARARMAX
y(t) =
B(q)
F(q)
u(t) +
C(q)
D(q)
e(t) Box-Jenkins
Note that these classes of model can easily be generalized to describe some
nonlinear systems by performing an appropriate change of variable, leading
1One could also have a more redundant description by introducing an additional
poly-
nomial dividing y. This is however not usually done.
to systems of the form
g(y(t)) = G(q)f(u(t)) + H(q)e(t)
where f(:) and g(:) can be a non-linear functions. This class of models is
generally known as Hammerstein-Wiener models.
4.3 Prediction
One important goal of model representation is the prediction of future output
values based on past (or present) inputs and outputs. We will see in
the next chapters that some identi
cation methods consist in selecting the
model that makes the best such predictions. The form y = Gu + He is not
directly appropriate for making prediction, as it refers on e which is typically
unknown (or even non-existent since e can be seen as a mathematical tool
to parametrize v as He). In this section, we develop explicit expressions for
unbiased estimators of the output y(t) given all past values of y and all past
and present values of u. In order to
x the ideas, we start by working the
concept of predictor using two examples.
Example 4.1. consider for example the model
y(t) =
1
2
y(t 􀀀 1) + u(t) + e(t):
An unbiased estimator of y(t) given all previous values and present value of
the input, and assuming that y(t) does follow this model, is
^y(t) = E[y(t)jy(t 􀀀 1); u(t)] =
1
2
y(t 􀀀 1) + u(t):
Example 4.2. Similarly, for the model
y(t) =
1
2
(y(t 􀀀 1) 􀀀 e(t 􀀀 1)) + u(t) + e(t);
the unbiased estimate (given all past values and the present value of u) can
be shown to be
^y(t) =
X
k
u(t 􀀀 k)
#
1
2
#k
:
4.3. PREDICTION 81
Exercise 4.1. Show that the unbiased estimator for y(t) = 1
2 (y(t􀀀1)􀀀e(t􀀀
1)) + u(t) + e(t) is indeed ^y(t) =
P
k u(t 􀀀 k)
􀀀1
2
#k
.
Interestingly, the structure of the predictor for the Example 4.1 and the
Example 4.2 are completely di#erent, although the systems only di#er by
a term e(t 􀀀 1). We now develop a general expression for the (one step
ahead) predictor of the system 4.1. The system description y(t) = G(q)u(t)+
H(q)e(t) can trivially be rewritten as
y(t) = G(q)u(t) + (H(q) 􀀀 1)e(t) + e(t); (4.4)
where we observe that (H(q) 􀀀 1)e(t) only depends on past values e(s) with
s < t because H(q) is monic. Now, it follows from y(t) = G(q)u(t)+H(q)e(t)
that the value of e(t) can be determined from y and u by
e(t) = H􀀀1(q)(y(t) 􀀀 G(q)u(t))
Reintroducing this expression in the second term of (4.4) (which corresponds
to expressing past values of e(t) in terms of past values of u and y) leads to
y(t) = G(q)u(t) + (H(q) 􀀀 1)H􀀀1(q)(y(t) 􀀀 G(q)u(t)) + e(t) (4.5)
= H􀀀1(q)G(q)u(t) + (1 􀀀 H􀀀1(q))y(t) + e(t):
Note that H, and therefore H􀀀1, are monic. Therefore, by looking at H and
H􀀀1 as polynomials in q􀀀1, one can observe that the term (1 􀀀 H􀀀1(q))y(t)
only depends on past values of y. Indeed, H􀀀1(q) can be developed as 1 +
~h
1q􀀀1 +~h2q􀀀2 + : : : . We have thus the following:
(1 􀀀 H􀀀1(q))y(t) = ~h1y(t 􀀀 1) +~h2y(t 􀀀 2) + : : : :
Given past values of u and y and the present value of u, the only nondeterministic
part in (4.5) is thus the term e(t), which is sometimes called
the innovation, and which is a random variable of zero mean. We can now
formally de
ne and express the (one-step ahead) predictor
^y(t)
(def)
= E[y(t)jut; ut􀀀1; :::; yt􀀀1; yt􀀀2; :::]; (4.6)
(def)
= (1 􀀀 H􀀀1(q))y(t) + H􀀀1(q)G(q)u(t);
(def)
= Wu(q)u(t) +Wy(q)y(t);
where Wu;Wy are called the predictor
lters, and the value y(t) is called the
one step ahead predictor. We remind that (1􀀀H􀀀1(q))y(t) does not depend
on y(t) but only on previous values of y because H is monic, and is thus
entirely determined by the information available to the user at time t, prior
to reading y(t). As already mentioned, the assumption that H is monic has
of course no real importance and is just made for computational simplicity.
For general H, the same reasoning would lead to an expression involving
(h􀀀1
0 􀀀 H􀀀1(q))y(t), which would also not depend on y(t). Particularizing
(4.6) to the generic form (4.2) leads to
^y(t) =
D(q)B(q)
C(q)F(q)
u(t) +
#
1 􀀀
A(q)D(q)
C(q)
#
y(t) (4.7)
Stability of the predictor
As seen in the introductory examples, predictor
lters can take many forms
and have a quite di#erent as those involved in the system de
nition. Note
that computing ^y(t) sometimes requires the knowledge of only a
nite number
of previous values (when C(q) = F(q) = 1) of u and y, in which case the
computation can be performed exactly (see Example 4.1). When in
nitely
many previous values are needed, but the
lters are stable, then the prediction
can still be approximated with an arbitrary precision by truncating
these in
nite sums or developing a recursive scheme (in which case the approximation
may be inaccurate in the initialization of the process). However,
if the
lter is unstable, the value of the predictor is in general not de
ned2.
The instability of the predictor
lters is an issue related to, but di#erent
from the stability of the system. In particular, unstable systems may admit
stable predictors. Consider the two following examples.
Example 4.3. y = u
1􀀀2q􀀀1 +e, or equivalently y(t) = 2y(t􀀀1)+u(t)+e(t)􀀀
2e(t 􀀀 1), leads to the unstable predictor ^y(t) = u
1􀀀2q􀀀1 .
Example 4.4. y = u+e
1􀀀2q􀀀1 , or equivalently y(t) = 2y(t􀀀1)+u(t)+e(t) leads
to the stable predictor ^y(t) = 2y(t 􀀀 1) + u(t).
2It may still be de
ned in some particular cases, for example when u(t) was initially
zero and only H􀀀1G is unstable. However, the practical computation of its value
will in
such case often be ill-conditioned.
4.4. PARAMETRIZATION AND IDENTIFIABILITY 83
In those examples, both models correspond to systems where, at each
step, the new state is obtained by doubling the previous one and adding the
input. In the
rst system, there is an additive white noise on the measurement,
so that the actual value of the state is not known with precision. The
unbiased estimator is obtained by \re-computing" the value y based on all
previous input values, using here an unstable
lter. In the second case, some
white noise is added during the computation of the new state, but this state is
measured exactly and is thus available to compute the predictor. The reader
can verify this explanation by deriving the transfer functions corresponding
to the system descriptions given above.
Based on the expression ^y(t) = (1􀀀H􀀀1(q))y(t)+H􀀀1(q)G(q)u(t), there
are two cases in which a predictor may be unstable:
# H􀀀1 is unstable, in which case Wy = 1 􀀀 H􀀀1 is unstable.
# G is unstable and the product with H􀀀1 does not lead to a pole-zero
cancellation, in which case Wu = H􀀀1G is unstable.
Remember though that H can usually be chosen with all its poles inside the
unit disk and all its zeros inside or on the unit disk (see Theorem 1.2). When
this is done, H􀀀1 has all its zeros inside the unit disk, and cannot cancel any
unstable pole of G. The
lter Wu is thus stable if and only if the system
is stable. Moreover, Wy = 1 􀀀 H􀀀1 would be unstable only when H has a
zero exactly on the unit disk, that is, when the spectrum of the disturbance
#v(!) = jH(ei!)j2 #2
e is exactly zero at at least one frequency.
However, some systems are more naturally described by models for which
H does not satisfy this assumption, even though such a H could then be
replaced by an alternative one satisfying it, without changing the noise spectrum
(see Theorem 1.2).
4.4 Parametrization and Identi
ability
We end this chapter by discussing some desirable properties of the way model
classes are parametrized. As already seen in Example 3.4, inappropriate
parametrization can lead to dramatic e#ect.
Let M be a set of models. A model structure is a smooth mapping: M :
DM # RN ! M : # ! M(#) = W(q; #), where W(q; #) is a pair of transfer
functions Wu;Wy de
ning a predictor
lter, and DM is the set of parameters
#. Observe that, by (4.6), there is a bijection between a pair of predictor
lter
and the corresponding transfer functions G and H. However, it is possible
that di#erent values of # correspond to a same predictor
lter using the
mapping M (in other words, M may not injective). This can in some cases
lead to numerical problems when optimizing over the set, especially when
in
nitely many values of # correspond to a same
lter (see Problems ??
and ?? of the exercise list). In particular, the estimate may not converge in
the latter case if one perform several identi
cations using longer and longer
signals. When in
nitely many values of # are \equally valid", there is indeed
a priori no reason to assume that an automatic method will consistently
select or converge the same one.
Requiring M to be injective may however not always be desirable. The
following concept characterize the injectivity properties of M.
De
nition 4.1. M is globally (parameter) identi
able at ## 2 DM if, for
any # 2 DM, M(#) = M(##) implies # = ##.
Example 4.5. Consider the model set M(a; b) =
n
1
1+aq􀀀1+bq􀀀2 ; 1
o
. M is
globally parameter identi
able at (a; b) = (0; 0), as 8(x; y) 2 R2 we have
M(x; y) = f1; 1g ) (x; y) = (0; 0).
De
nition 4.2. M is locally (parameter) identi
able at ## 2 DM if, for
there exists a neighborhood B of ## such that for any # 2 DM \ B, M(#) =
M(##) implies # = ##.
n
1
(q􀀀1􀀀a)(q􀀀1􀀀b) ; 1
o
. M is
locally parameter identi
able at (a; b) = (0; 2), as 8(x; x) 2 [􀀀1; 1] # [1; 3]
(some neighborhood of (0; 2)), we have M(x; y) = M(0; 2) ) (x; y) = (0; 2).
Note that the model set is not globally identi
able as M(0; 2) = M(2; 0).
De
nition 4.3. M is globally/locally (parameter) identi
able if M is glob-
ally/locally identi
able at almost all ## 2 DM.
n
1
(q􀀀1􀀀a)(q􀀀1􀀀b) ; 1
o
. M
is globally locally identi
able, as 8(a; b); (x; y) 2 R2, we have M(a; b) =
M(x; y) ) (x; y) = (a; b) or (x; y) = (b; a). Assuming that a 6= b (in the
case a = b, M(a; b) = M(x; y) ) (x; y) = (a; a), so the set M is (globally)
4.4. PARAMETRIZATION AND IDENTIFIABILITY 85
identi
able at (a; a)), we have jj(b; a)􀀀(a; b)jj22
> 0, and therefore there exists
some " > 0 satisfying " < jj(b; a) 􀀀 (a; b)jj2. Therefore, the Euclidean ball
of radius " and centered at (a; b) is a neighborhood of (a; b) such that M is
locally identi
able at (a; b) in that neighborhood, which allows to conclude.
De
nition 4.4. M is strictly globally (parameter) identi
able if M is glob-
ally/locally identi
able at all ## 2 DM.
n
1
1+aq􀀀1+bq􀀀2 ; 1
o
. M
is strictly globally identi
able, as 8(a; b); (x; y);2 R2 we have M(x; y) =
M(a; b) ) (x; y) = (a; b).
It is important to remember, though, that identi
ability is a property of
the model structure, i.e. of the way we represent the set of models, and not
of the set itself.
To conclude this section, we mention that inappropriate parametrizations
can cause numerical problems even if they satisfy all identi
ability properties.
Without entering into technical details, imagine a parametrization where very
di#erent # can lead to
lters close one from each other and/or where very
similar # can lead to very di#erent
lters. The problem of
nding the \best"
parameters can then be ill-conditioned, as \moving by a constant distance"
between two
lters may require either a huge or a very small modi
cation
of the parameters. These issues are related to questions of conditioning
of practical optimization problems, and are not further discussed in these
lecture notes.
Chapter 5
Input Signals
One of the important di#erences between parameter estimation in statistics
and system identi
cation is the possible freedom to chose the input signals in
system identi
cation. We will see in this chapter that some input signals do
not allow identifying the system, independently of the method used (this is
for example trivially the case if u = 0), and will characterized the open-loop
input signal that contain enough \information" to perform an identi
cation,
which will depend on the model set considered. The closed-loop control case
is deferred to Chapter 9. The choice of the input signal also obviously has
an in
uence on the accuracy of the result. This will be brie
y studied in
Chapter 8.
5.1 Information content of the signals
Example 5.1. Suppose we want to identify a system described by
y(t) = u(t) +
2
3
u(t 􀀀 1) +
1
3
u(t 􀀀 2) + e(t); (5.1)
where e(t) is a white noise, and we have correctly guessed that the system
description belongs to the model set
y = au(t) + bu(t 􀀀 1) + cu(t 􀀀 2) + e(t):
Figure shows the results obtained using a regression method (' = (u(t); u(t􀀀
1); u(t 􀀀 2))T ) for experiments made with the three following input signals:
87
88 CHAPTER 5. INPUT SIGNALS
(a) u is a white noise of variance #,
(b) u(t) = sin(#
2 (t)), i.e. the repetition of (0; 1; 0;􀀀1),
(c) u(t) = 1 + sin(#
2 (t)), i.e. the repetition of (1; 2; 1; 0).
One can see that the parameters are successfully recovered when using
the input signals (a) and (c), but not the input signal (b). Remember that
the regression method is asymptotically unbiased if #E
'v = 0 and #E
(''T ) is
nonsingular. The
rst condition is satis
ed as v is here a white noise e. Let
us evaluate #E
(''T ) in the cases of the three di#erent input signals.
(a) For u a white noise of variance #2, we have
#E
(''T ) =
0
@
#2 0 0
0 #2 0
0 0 #2
1
A
which is clearly nonsingular if # 6= 0.
(b) For u(t) = sin(#
2 t), '(t) is the repetition of the following sequence.
'1(t) = u(t) 0 1 0 -1
'2(t) = u(t 􀀀 1) -1 0 1 0
'3(t) = u(t 􀀀 2) 0 -1 0 1
We compute now the di#erent elements of the matrix weighted-summing
the products of the values in the respecting rows (weighted by the length
of the period). For example,
#E
[u(t)u(t 􀀀 1)] = 0:25(0:(􀀀1) + 1:0 + 0:1 + (􀀀1):0) = 0:
We
nd : 0
@
0:5 0 􀀀0:5
0 0:5 0
􀀀0:5 0 0:5
1
A
which is clearly singular now (alternatively, we could have computed
1
4 ('(t)'(t)T +'(t+1)'(t+1)T +'(t+2)'(t+2)T +'(t+3)'(t+3)T ))
As a result, the sequence of estimates may not converge, or converge to
some arbitrary value in the solution set of #E
(''T )# = #E
('y).
5.1. INFORMATION CONTENT OF THE SIGNALS 89
0 5 10 15 20 25 30 35 40 45 50
0
1
2
3
0 5 10 15 20 25 30 35 40 45 50
0
1
2
3
4
Parameter values
0 5 10 15 20 25 30 35 40 45 50
−0.5
0
0.5
Sample size N
(a) Estimations of the parameters a, b
and c from Example 5.3 when u is a
white noise. The estimations â, ^b
and
^c converge to the true values of the pa-
rameters.
10 20 30 40 50 60 70 80 90 100
−4
−2
0
2
4
10 20 30 40 50 60 70 80 90 100
−4
−2
0
2
4
Parameter values
10 20 30 40 50 60 70 80 90 100
−4
−2
0
2
4
(b) Estimations of the parameters a, b
and c from Example 5.3 with u(t) =
sin(#t=2). The identi
cation clearly
fails.
0 10 20 30 40 50 60 70 80 90 100
0
1
2
3
4
0 10 20 30 40 50 60 70 80 90 100
0
1
2
3
4
Parameter values
0 10 20 30 40 50 60 70 80 90 100
−0.5
0
0.5
1
1.5
(c) Estimations of the parameters a, b
and c from Example 5.3 with u(t) =
1+sin(#t=2). The estimations â, ^b
and
^c converge to the true values of the pa-
rameters.
Figure 5.1: Evolution with the length of the experiment of the three parameters
of system (5.1)) using a regression method, based on experiment with
three di#erent input signals.
(c) u(t) = 1 + sin(#
2 t). Using the same approach as in the previous case,
we
nd 0
@
1:5 1 0:5
1 1:5 1
0:5 1 1:5
1
A
which can be veri
ed to be nonsingular. This choice of input is thus
better the the previous one, even though the inputs only di#er by a
constant.
Exercise 5.1. Verify that for the input signal (b), u(t) = sin #
2 (t), the matrix
#E
''T is non-singular when ' = (u(t); u(t􀀀1))T , corresponding to a smaller
model set.
These examples have shown that some unfortunate choice of input signals
may prevent one from accurately identifying a system, and that making the
distinction between a \good" and a \bad" signal may not be easy. Besides,
one can verify that the second signal would have been appropriate if ' had a
size 2, and that the third one would not have been appropriate for sizes of '
larger than 3, so that the notion of \good" signal depends on the model set.
We will see that the problem with the input signal (b) in Example 5.3
that the signals do not contain enough information to discriminate between
di#erent models. Observe for example that if you chose a = b = c = 0
and a = c = 1; b = 0, you obtain di#erent predictors ^y(t) = 0 and ^y(t) =
u(t) + u(t 􀀀 2). But, on the input u(t) = sin(#
2 t), those two predictors
always give the same value because u(t) + u(t 􀀀 2) = 0 (except possibly for
the
rst few time steps). There is thus no way to distinguish them. More
generally, when using this input, the predictor of any model of the previous
class with parameters a, b, c will provide exactly the same prediction as that
of a model with parameters a + d, b, c + d, for any d. This idea of signals
not allowing to distinguish between di#erent predictors is formalized by the
notion of informative enough signals.
De
nition 5.1. Let M# be a model set. The signal Z1 = (u1; y1) is
informative enough with respect to M# if, for any two models M1;M2 2 M#
and their respective one step ahead predictors ^y(tjM1); ^y(tjM2), there holds
#E
(^y(tjM1) 􀀀 ^y(tjM2))2 = 0 ) M1 = M2:
Note that in this de
nition, we consider that being informative enough
with respect to some model set is a characteristic of the pair of input and
output signals. In particular, in linear systems, a lot is a priori (that is, even
without doing any computation) known on the output signal directly from
the knowledge of the input signal (e.g. the frequency content is the same),
which simpli
es a lot of situations.
In the light of this new de
nition, one can conclude that the signals (u; y)
obtained using the input signal u(t) = sin #
2 t are thus not informative enough
with respect to the set of predictors ^y(t) = au(t)+bu(t􀀀1)+cu(t􀀀2), because
we have seen that one can for example not distinguish a = b = c = 0 from
a = c = 1; b = 0.
De
nition 5.2. Z1 is informative enough if it is informative enough with
respect to set of all LTI predictors.
As previously underlined, being informative enough is thus a property of
the input and output signals Z1 = (u1; y1) and the model sets, not of the
identi
cation method used. It intuitively means that di#erent predictors can
be discriminated because they will persistently give di#erent predictions on
these signals1. There is however no guarantee that a given method will eventually
lead to the \correct" model when it exists. Nor is being informative
enough always necessary for performing an identi
cation. For example, it
can be shown that applying an impulse as input will usually never lead to
informative enough signals, but in the absence of noise, the response to an
impulse entirely determines the system.
Characterizing signals (u; y) informative enough for a model set is in
general not an easy problem in general. We will see in the next result that
when the system is controlled in open loop, being informative enough boils
down to having an input signal allowing distinguishing between the di#erent
transfer functions G (under some additional assumptions). Intuitively, this
comes from the fact that the white noise e will always su#ciently excite the
H part of the system so that one will always be able to distinguish between
di#erent noise models.
1Many authors actually de
ne the notion of \informative enough" in terms of prediction
errors instead of predictions. This is however entirely equivalent, as the
di#erence between
the prediction errors at t is the di#erence between the predictions at t.
Theorem 5.1. Consider a model set M# = M#G
#M#H
(i.e. G and H can be
chosen independently). Suppose that
# the output signal y is generated by a real LTI system y = G0u + H0e
controlled in open loop, where H0 has no zero on the unit circle, and e
is a white noise of positive variance,
# all signals are quasi-stationary,
# all transfer functions H in M#H
are stable and anti-stable (and have
thus no zero or pole on the unit circle, which means the #He is positive
everywhere.),
# M#H
is such that
R 2#
0 jH1(ei!) 􀀀 H2(ei!)j2 d! = 0 implies H1 = H2
(which is always the case if the transfer functions is smooth.)
Then Z1 is informative enough with respect to M# i# 8G1;G2 2 M#G
Z 2#
0
jG1(ei!) 􀀀 G2(ei!)j2#u(!)d! = 0 =) G1 = G2; (5.2)
or equivalently,
#E
j(G1 􀀀 G2)uj2 = 0 =) G1 = G2: (5.3)
Proof. Observe
rst that the equivalence between (5.2) and (5.3) follows
from the equality #E
x2 = 1
2#
R 2#
0 #x(!)d! for any signal x (equation (1.9) in
Section 1.2.1, extended to quasi-stationary signals). In particular, for the
signal x = Gu, we have #Gu(!) = jG(ei!)j2 #(u)! (Proposition 1.2).
Remember there is a unique correspondence between a pair of transfer
functions (G;H) and the predictor ^y = H􀀀1Gu + (1 􀀀 H􀀀1)y. In order to
prove the theorem, we need the following lemma
Lemma 5.1. Under the assumptions of Theorem 5.1, for any two models
M1;M2 characterized by (G1;H1) and (G2;H2), we have #E
(^yM1 􀀀 ^yM2)2 = 0
if and only if (i) H1 = H2 =: H, and (ii) there holds
Z 2#
0
jG1(ei!) 􀀀 G2(ei!)j2#u(!)d! = 0: (5.4)
Proof. Let us rewrite the di#erence between the predictors as
^yM1 􀀀 ^yM2 =
#
G1
H1
􀀀
G2
H2
#
u +
#
1
H2
􀀀
1
H1
#
y:
Since we assume y to be the output of a LTI system y = G0u +H0e, we can
replace y in this expression, leading to
^yM1 􀀀 ^yM2 =
#
(G1 􀀀 G0)
H1
􀀀
(G2 􀀀 G0)
H2
#
u +
#
1
H2
􀀀
1
H1
#
H0e =: Au + Be;
(5.5)
with the obvious de
nitions for A and B. The signals u and e are independent
because we assume the system to be controlled in open loop. Therefore, we
have #E
(AuBe)= #E
(Au)#E
(Be) = 0, which implies
#E
(^yM1 􀀀 ^yM2)2 = #E
(Au)2 + 2#E
(AuBe) + #E
(Be)2 = #E
(Au)2 + #E
(Be)2:
In particular, we have the following claim:
Claim 1: #E
(^yM1 􀀀 ^yM2)2 = 0 if and only if #E
(Au)2 = #E(Be)2 = 0.
The rest of the proof is organized around two additional claims.
Claim 2: #E
(Be)2 = 0 , H1 = H2 =: H
There holds
#E
(Be)2 =
1
2#
Z 2#
0
B(ei!)
2
#e(!)d! =
#2
e
2#
Z 2#
0
H0(ei!)
H2(ei!) 􀀀 H1(ei!)
H1(ei!)H2(ei!)
2
d!;
where we have used the fact that the spectrum of a withe noise is
at,
the properties of the spectrum #Be = jBj2#e, and the equality #E
u2 =
1
2#
R 2#
0 #u(!)d!. Since H0
H1H2
is by assumption non zero and bounded on
the unit circle, the equality #E
(Be)2 = 0 is equivalent to
Z 2#
0
H2(ei!) 􀀀 H1(ei!)
2
d! = 0;
which according to the assumptions of this theorem is equivalent to H1 = H2.
Claim 3: If H1 = H2 =: H, then #E
(Ae)2 = 0 i# (5.4) holds, (where A is
de
ned in (5.5)).
Since H1 = H2 =: H, we have A = G1􀀀G2
H . By a reasoning similar to that
made for #E
(Be)2, one can verify that #E
(Au)2 = 0 is then equivalent to
Z 2#
0
G1(ei!) 􀀀 G2(ei!)
2
#u(!)d! = 0:
Lemma 5.1 follows from the combination of Claims 1, 2 and 3.
We can now prove Theorem 5.1. Observe that the if part of the de
nition
of informative enough trivially always holds (i.e. if two models are identical,
they give the same prediction errors), so we just need to focus on the only if
part.
Suppose
rst the implication (5.2) does not hold. Then there exists
G1 6= G2 2 M#G
for which (5.4) is satis
ed. Take then a transfer function
H 2 M#H
. The two models M1 = (G1;H) and M2 = (G2;H) belong
to M#, because we assume M = M#G
#M#H
so that G and H can be chosen
independently 2. It follows from Lemma 5.1 that #E
(^yM1 􀀀 ^yM2)2 = 0, even
though the corresponding models are di#erent, which implies the signals are
not informative enough. Conversely, suppose now the implication (5.2) holds.
For any two models M1;M2 satisfying #E
(^yM1 􀀀 ^yM2)2 = 0, it follows from
Lemma 5.1 that H1 = H2 and that
R 2#
0 (G1(ei!) 􀀀 G2(ei!))2#u(!)d! = 0,
so that G1 = G2 due to implication (5.2). The signals are thus informative
enough.
Note that the assumption that the system is controlled in open loop is
essential for Theorem 5.1 to hold, as will be seen in Chapter 9.
Example 5.2. Suppose that the input signal is u(t) = sin(#
2 t), so we can
compute:
#u(!) =
#
2
#(! 􀀀
#
2
) +
#
2
#(! +
#
2
);
2This assumption is important for the proof to be valid: Otherwise, it might be
that
the class of H functions that can be associated with G1 and the class of those that
can be
associated with G2 do not intersect, making the construction described here
impossible
5.2. PERSISTENCE OF EXCITATION 95
Consider now the class of transfer functions G := a + bq􀀀1 + cq􀀀2. We have
G(ei!) = a+be􀀀i! +ce􀀀2i!. For two such transfer functions G1;G2, we have
Z #
􀀀#
jG1(ei!) 􀀀 G2(ei!)j2#u(!)d! =
=
Z #
􀀀#
j(a1 􀀀 a2) + (b1 􀀀 b2)e􀀀i! + (c1 􀀀 c2)e􀀀2i!j2 1
2
#
#(! 􀀀
#
2
) + #(! +
#
2
)
#
=
(a1 􀀀 a2) + (b1 􀀀 b2)e􀀀i #
2 + (c1 􀀀 c2)e􀀀i#
2
+
(a1 􀀀 a2) + (b1 􀀀 b2)ei #
2 + (c1 􀀀 c2)ei#
2
:
observe now that ei# = e􀀀i# = 􀀀1. Therefore, we can have
R #
􀀀# jG1(ei!) 􀀀
G2(ei!)j2#u(!)d! = 0 even if G1 6= G2, provided that b1 = b2 and (a1􀀀a2) =
(c1􀀀c2). It follows then from Theorem 5.1 that z1 is not informative enough
with respect to any class of models containing these class of transfer functions
G (with independent parameters).
However, if we consider the class of transfer functions G := a, we have :
Z #
􀀀#
jG1(ei!) 􀀀 G2(ei!)j2#u(!)d! = 0 , ja1 􀀀 a2j2 + ja1 􀀀 a2j2 = 0
so the data would be informative enough for fG = ag
5.2 Persistence of excitation
Theorem 5.1 gives a criterion based on u and G to decide whether some data
is informative enough with respect to a set of models (in open loop). We now
try to quantify the information content resulting from an input signal. This
will lead to a simpler criterion for many important model sets, such those
with rational transfer functions.
The intuitive idea is the following. Remember that signals are informative
enough, under suitable conditions, if
R #
􀀀# jG1(ei!) 􀀀 G2(ei!)j2#u(!)d! = 0
implies G1 = G2. Observe that
R #
􀀀# jG1(ei!) 􀀀 G2(ei!)j2#u(!)d! = 0 implies
G1(ei!) = G2(ei!) for all ! where #u(!) is positive (neglecting zeromeasure
issues for the moment). We have thus as many constraints of the
form G1(ei!j ) = G2(ei!j ) as there are frequencies as which #u is positive. So
if the class of transfer functions considered is parametrized by k parameters,
and if they are at least k points at which #u is positive, the k constraints
G1(ei!j ) = G2(ei!j ) will typically imply equality of the k parameters and
therefore G1 = G2 everywhere. Informative enough signals are thus generated
by input signals whose spectrum is positive at su#ciently many frequencies.
To formalize this, we introduce the notion of order of persistence
of excitation.
De
nition 5.3. A signal u is persistently exciting of order n if there are n
distinct sets S1; : : : ; Sn # [􀀀#; #) such that
R
Si
#u(!)d! > 0 for i = 1; : : : ; n.
Actually, one can prove that a signal is persistently exciting if one of the
two following conditions is satis
ed:
# #u contains n #-functions
# There exists a set S of positive measure with ! 2 S ) #u(!) > 0
Theorem 5.2. Under the same assumptions as Theorem 5.1, if the class of
transfer functions MG contains all rational transfer functions (satisfying the
stability requirements for the predictor to exist) of the form
G(q) =
B(q)
A(q)
=
b0 + b1q􀀀1 + ::: + bnbq􀀀nb
a0 + a1q􀀀1 + ::: + anbq􀀀na ;
for given na and nb (and where the parameters set do not restrict the dimen-
sion of the set G, then the signal (u; y) are informative enough with respect
to the model set if and only if u is persistently exciting of order na + nb + 1.
Proof. It follows from Theorem 5.1, the signals are informative enough if and
only if, for all G1 = B1
A1
, G2 = B2
A2
,
Z #
􀀀#
B1
A1
(ei!) 􀀀
B2
A2
(ei!)
2
#u(!)d! = 0 )
B1
A1
=
B2
A2
: (5.6)
Observe that
B1
A1
(ei!) 􀀀
B2
A2
(ei!) =
A1(ei!)B2(ei!) 􀀀 A2(ei!)B1(ei!)
A1(ei!)A2(ei!)
;
5.2. PERSISTENCE OF EXCITATION 97
so that Z #
􀀀#
B1
A1
(ei!) 􀀀
B2
A2
(ei!)
2
#u(!)d! = 0
if and only if
Z #
􀀀#
A1(ei!)B2(ei!) 􀀀 A2(ei!)B1(ei!)
2
#u(!)d! = 0: (5.7)
Suppose now that u is persistently exciting of order na + nb + 1. Then if
(5.7) holds, the non-negativity of #u implies the polynomial A1(ei!)B2(ei!)􀀀
A2(ei!)B1(ei!) equal 0 in at least one point in each of the na+nb+1 distinct
sets on which #u has a positive integral (otherwise the product would necessarily
have a positive integral), so that A1(z)B2(z)􀀀A2(z)B1(z) has at least
na + nb + 1 roots. Since this polynomial is of order na + nb, its na + nb + 1
roots implies it is identically zero, which implies that B1
A1
= B2
A2
.
Conversely, suppose that u is not persistently exciting of order na+nb+1.
The de
nition of persistently exciting signal implies that its spectrum #u is
zero almost everywhere, and that it contains at most na+nb delta functions.
One can then easily
nd parameters such that A1(ei!)B2(ei!)􀀀A2(ei!)B1(ei!)
is not trivially equal to 0, has a root at every one of the at most na + nb
values ! on which there is a delta function (since this polynomial is of order
na + nb), so that (5.7) holds without that B1
A1
= B2
A2
.
We have thus proved that (5.6) holds if and only if u is persistently
exciting of oder na + nb + 1. The result of this Theorem follows then from
Theorem 5.1.
Example 5.3. Let us evaluate the order of persistence of excitations of the
signal from the beginning of this chapter.
(a) u is a white noise of variance #. Since #u(!) is positive for every
value of ! (the spectrum is a constant), one can arbitrarily divide the
interval [􀀀#; #) into n distinct sets for any n (e.g. by the choice Si =
[􀀀# + 2#(i 􀀀 1)=n;􀀀# + 2#i=n) with i = 1; : : : ; n).
(b) u(t) = sin(#
2 (t)), i.e. the repetition of (0; 1; 0;􀀀1). The spectrum is
#(!) = #
2 (#(! 􀀀 #=2) + #(! + #=2)), and the signal is therefore persis-
tently exciting or order 2.
(c) u(t) = 1 + sin(#
2 (t)), i.e. the repetition of (1; 2; 1; 0). The spectrum is
#(!) = #
2 (#(!􀀀#=2)+#(!+#=2))+2##(!), and the signal is therefore
persistently exciting or order 3.
Finite impulse response
lter
We now analyze the particular meaning of being persistently exciting of order
n in the context of
nite impulse response
lters, and link it to the properties
of the correlation matrix (see Correlation analysis in Section 2.1.3). These
specials interpretations are made possible by the fact that the classes of
nite
impulse response
lters are vectorial spaces, unlike other generic classes of
rational transfer functions. Obseve indeed that any linear combination of
FIR transfer functions of order at most n is a FIR transfer function of order
at most n.
Proposition 5.1. The following conditions are equivalent.
a) u is persistently exciting of order n
b) Under the assumptions of Theorem 5.1, signals generated by u are in-
formative enough with respect to any set of models where the corresponding
transfer functions G are
lters of order n 􀀀 1.
c) u cannot be
ltered to 0 by a nontrivial FIR
lter of order n 􀀀 1, i.e. for
any FIR
lter G of order n 􀀀 1, #E
(Gu)2 = 0 ) G = 0.
d) detRN 6= 0, with RN de
ned by:
RN =
0
B@ru(
0) :
:
:
ru(
n
􀀀 1)
...
# ...
ru(n 􀀀 1) : : : ru(0)
1
CA
;
and ru(# ) is the correlation #E
u(t)u(t 􀀀 # ).
Proof. (a) , (b) follows from Theorem 5.2. Moreover, it follows then from
Theorem 5.1 that u is persistently exciting of order n if and only if
Z #
􀀀#
G1(ei!) 􀀀 G2(ei!)
2
#u(!)d! = 0 ) G1 = G2
5.3. SOME TYPICAL INPUT SIGNALS 99
for any G1;G2 corresponding to FIR
lters of order n􀀀1. Since #G = G1􀀀G2
is also a FIR
lter of order n 􀀀 1. This condition is equivalent to Z #
􀀀#
#G(ei!)
2
#u(!)d! = 0 ) #G = 0;
or #E
(#G(q)u)2 = 0 ) #G = 0 for any #G FIR
lter of order n􀀀1, so that
(a) and (b) are equivalent to (c). There remains to establish the equivalence
with (d).
Let ' =
􀀀
u(t) u(t 􀀀 1) : : : u(t 􀀀 n + 1)
#T
2 Rn. We have Rn =
#E
''T . Since Rn is symmetry, its determinant is zero if and only if there
exists a vector h 2 <n such that hTRnh = 0. Let us now analyze hTRnh.
There holds
hTRnh = hT #E
(''T )h = #E
hT''Th = #E
('Th)2:
Moreover, it follows from the de
nition of ' that
'(t)Th =
Xn􀀀1
k=0
hku(t 􀀀 k) =
Xn􀀀1
k=0
hkq􀀀k
!
u(t) =: H(q)u(t);
where we have labeled h 2 Rn from 0 to n 􀀀 1 for simplicity of notation.
We have thus proved the equivalence between (d) detRn = 0 and (c), the
existence of a n 􀀀 1th order FIR
lter H for which #E
(H(q)u)2 = 0.
5.3 Some typical input signals
We
nish this chapter by presenting a few typical signals that can be used.
5.3.1 Impulse and steps
For the impulse, we have ru(# ) = 0 8# and #u = 0. For the step, we have
ru(# ) = 0 8# and #u = #(!). They are thus persistently exciting of order 0
and 1 respectively. Nevertheless, using those signals can be useful to have a
rst qualitative idea of the transfer function of the system. Moreover, they
can still allow one to recover the exact transfer function in a noiseless
environment.
This latter fact does not contradict the de
nitions of informative
enough signals, which only consider the long term expected average of the
di#erence of predictors.
5.3.2 Sum of sines - periodic signals
Any periodic signal can be expressed as a sum of since of the form
u(t) =
Xm
j=1
aj sin(!jt +
j)
with 0 # !1 < !2 < : : : !m # #.
One can verify that the correlations are ru(# ) =
P Cj
2# cos !i# with Cj =
#a2j
if !j 2 (0; #), and Cj = 2#a2j
sin2(
j) if !j = # or 0. The spectrum of u
is given by
#u(!) =
X
j=1
Cj
2
(#(! 􀀀 !j) + #(! + !j)) : (5.8)
If only frequencies in (0; #) are used, the signal is thus persistently exciting
of order 2n. The use of frequency # only adds one order of persistence of
excitation because #(! 􀀀 #) = #(! + #) for angles !. The same holds true
for frequency 0 because #(! 􀀀 0) = #(! + 0). Beware that a frequency 0
component with a phase
= 0 is actually identically zero as sin(0:t+0) = 0,
and does thus not add any order of persistence of excitation.
Sums of a reasonable number (i.e. not too large) of sines are often appreciated
for the smaller sensitivity to noise that they induce.
5.3.3 White noise
The spectrum of white noise is a constant # = #2. Some examples of white
noise are
# ut # N(0; #2),
# u(t) = 1 with probability 1/2, and u(t) = 􀀀1 with probability 1/2,
# any law with E(u(t)) = 0, and that is i.i.d. (with bounded moments).
The second law, is interesting because it has the largest possible variance
among all signals with values restricted to [􀀀1; 1].
5.3. SOME TYPICAL INPUT SIGNALS 101
5.3.4 Filtered white noise
One drawback of the white noise is that it excites the system at all frequencies,
while we may be less interested in some frequencies, or wish to excite
more those at which there is more noise. This can be solved by
ltering the
white noise. An obvious way of doing it is to apply a usual LTI
lter H(q),
leading to a signal with spectrum #He = #2 jH(ei!j2. However, such
lters
typically do not preserve the bounds on the signal amplitude (if any). They
are inapplicable if the signal can only take a
nite number of designated
values. In order to maximize the variance in the presence of a bound on the
input, or to use only a small number of input values, one can apply amplitude
preserving time varying or stochastic
lters that preserve the amplitude of
the signal to a white noise with bounded amplitude. For example, one can
generate the following signals
# u(t) = e
#
bt􀀀1
N0
c + 1
#
, where e is a white noise taking values +1 or 􀀀1.
This corresponds to the application of a time varying-
lter. This signal
has the correlation function ru(# ) = N0􀀀#
N0
if # # N0, and 0 in other
case. It has the same correlation that u = He, with H = p1
N0
1􀀀q􀀀n
1􀀀q􀀀1 .
And we have #u = 1􀀀cos !N0
1􀀀cos ! .
# u(t) = e(t) with probability #, and u(t) = u(t􀀀1) else. One can verify
that the correlation function is ru(# ) = ###2
Exercise 5.2. Check that the expressions for the correlation functions and
the spectra provided for the signals above are correct.
Chapter 6
General parametric methods
6.1 What is an identi
cation method?
Let M : # ! M(#) = (Wu(#; q);Wu(#; q)) be a model structure de
ned on
a set DM # Rb, and where Wu and Wy are the predictor
lters derived in
Section 4.3, so that
^y(tj#) = Wu(#; q)u(t) +Wy(#; q)y(t)
. We assume that DM is such that these predictors are well de
ned for every
#.
After an experiment, we have the input-output signals pairs
zn =
#
u(0) ::: u(N)
y(0) ::: y(N)
#
:
Given a model structure M, an identi
cation technique just consists of an
estimator
^# : ZN ! DM
that sends the results (including the input) of an experiment onto a value ^#
for the parameters. (We will often use ^#N to denote the value of the estimator
for an experiment of length N.)
This de
nition does not contain any reference to a quality criterion, and
not every estimator is of course a good estimator. For reasonable experiments,
good estimators would be expected to have the following properties:
103
104 CHAPTER 6. GENERAL PARAMETRIC METHODS
# Consistency: if the signals are generated by a system M0 = M(#0)
belonging to the model set considered, and described by a unique #0 2
DM, then we expect that #^N #0 when N grows.
# If the signals are generated by a system M0 belonging to the model
set considered, but described by multiple values # 2 DM (i.e. M is
not identi
able at M0), M(^#N) should converge to M0, even if ^#N itself
does not necessarily converge.
# If the signals are generated by a system that does not belong to the
model set, we hope that the M(^#N) will converge to some transfer
function \close" to the real system.
Note that the last point is much more complex to formally analyze, and is
beyond the scope of this class.
6.2 Important classes of methods
The \correct" model would be that which reproduces exactly the measured
output when given the same input. The stochastic or uncertain nature of the
systems makes however this goal unreachable. Nevertheless, it is reasonable
to select models which will predict values \as close as possible" from the real
output. Remember that a model M(#) consists of two
lters Wu;Wy corresponding
to a predictor ^y(tj#) = Wu(#; q)u+Wy(#; q)y. The prediction error
is the di#erence between the actual value of the output and this prediction:
#(tj#) = y(t) 􀀀 ^y(t; #):
Lemma 6.1. For any model structure (for which the predictors are well
de
ned) and signals, there holds
#(tj#) = H􀀀1
# (q) (y(t) 􀀀 G#(q)u(t))
In other words, the prediction error is the realization of the noise e that would
explain the observed signals if the system was correctly described by the model
considered.
6.2. IMPORTANT CLASSES OF METHODS 105
Proof. Remember that
^y(tj#) = H􀀀1
# G#u + (1 􀀀 H􀀀1
# )y = H􀀀1
# (G#u 􀀀 y) + y:
Therefore,
#(tj#) = y(t) 􀀀 ^y(t; #) = H􀀀1
# (y 􀀀 G#u):
Most method will be based on selecting the # that renders the prediction
error as acceptable as possible. They can be divided in two groups
1. Scalar methods: optimizing a scalar criterion of the prediction error #:
(a) Prediction error methods: minimizing some measurement of #.
This class includes regressions, least square errors, etc.
(b) Statistical methods: maximizing some probabilistic measure of
the prediction error. This class includes the maximum likelihood
method, or the selection of the most likely parameter # given a
prior knowledge and the observations.
2. Correlation methods: Making sure that #(tj#) is uncorrelated to some
signals #(t), i.e. selecting ^#N such that
P
#(tj#)#(t) = 0.
The rationale behind these methods when # is accessible is that, if the
error is correlated to a signal, this signal could help predicting and
therefore correcting the error. As a result, a better prediction could be
obtained. This class of method includes the Instrumental variables.
The distinction between the di#erent methods is not absolute. We will
see that some statistical methods are actually particular cases of prediction
error methods. We have also seen in Section 3.3 that the least square of
prediction error criterion is actually equivalent to requiring the prediction
error to be uncorrelated with its gradient: indeed, # minimize
P
#(tj#)2 P if @#(tj#)
@# #(tj#) = 0. Finally, correlation methods can be seen as a prediction
error method where one minimizes some measure of the correlation. In fact,
if the dimension of the signals # is too large with respect to #, one can
generalize the correlation methods to ^#N = arg min jj
P
#(t)#(tj#)jj.
In the remainder of this chapter, we discuss general prediction error methods
and statistical methods is some more detail. Correlation methods can
be more arbitrary, and are less studied here.
Note that, independently of the method used, one sometimes wants to
give more importance to some frequencies of the prediction errors. This can
be achieved by applying the methods to L(q)#. For example, the least square
method becomes
^#N = arg min
XN
t=0
(L(q)#(tj#))2 :
Observe that
L(q)#(t) = L(q)H􀀀1(y 􀀀 Gu) =
􀀀
HL􀀀1#􀀀1
(y 􀀀 Gu) : (6.1)
Therefore, applying a
lter to the prediction error is exactly equivalent to
modifying the noise model, using fHL􀀀1g instead of H. All the results
obtained in the sequel can thus be directly extended to
ltered prediction
errors by applying this modi
cation of the noise model set.
6.3 Prediction error methods
The prediction error method with norm ` consists in taking the following
estimator
^#N = arg min
XN
t=0
` (#(tj#)) :
Very often, one will use `(#) = 1
2 jj#jj2, which is called the least-square criterion.
It is worth analyzing the meaning of the least-square criterion in the
frequency domain.
The prediction error EN in the frequency domain is
EN = DFT(#n) =
1
N
XN
t=1
e􀀀2#k
N t#(t); (6.2)
with k = 1; :::;N. Using
P
#2
n =
P
E2N
and #n = H􀀀1(y 􀀀 Gu), and we
obtain :
EN ' H􀀀1
LS(YN 􀀀 GLSUN) + O(transitories) (6.3)
6.4. STATISTICAL METHODS 107
and
X
#2 =
X
E2N
=
X
jH􀀀2
LSjjYN 􀀀 GLSUNj2
=
X jUNj2
jHLSj2
YN
UN
􀀀 GLS
2
: (6.4)
Therefore, minimizing the sum of the square prediction error is the same (up
to transitory signals) as minimizing the di#erence between G and the experimental
transfer function estimate YN=UN (see Fourier Analysis in Section
2.2.2), with a weight proportional to the ratio between the signal and the
noise transfer function (which is also part of the optimization problem).
6.4 Statistical methods
Suppose that the \real" system belonged to the model set, and could be
obtained using the parameter #. Due to the presence of noise, the observed
sequence of outputs y obtained when applying the input u is a random variable.
We denote by f#(y) the probability density function of this random
variable. In other words, f#(y) is the likelihood of observing the output sequence
y if the input u is applied and if the real system was de
ned by the
parameters #.
6.4.1 Maximum likelihood method
The principle is to choose the parameters # that maximize the likelihood
of observing the output y given the input u. In other words, we chose as
estimator the value of # such that the observation made is the least \surprising".
We will see in Section 6.4.2 that this can be seen as selecting the most
\probable" # given the observation made, in the absence of prior information
about #.
^## = arg max
#
f#(x):
Figure 6.1: Example of probability density functions
The idea of the approach can be illustrated by Figure 6.1. If a x = 0
occurs, we can have some con
dence that the parameter is #1, but if a x = 1
occurs the distinction is not so clear. The following example illustrates the
maximum likelihood method.
Example 6.1. Consider the simple class of models y(t) = au(t)+e(t), where
a is the constant we want to
nd and e are zero-mean gaussian white noise
with variance #2 independent and identically distributed. Assuming that a is
the real value, the probability density function of observing y(t) at time t if
u(t) is applied is thus p 1
2##2 e􀀀(y(t)􀀀au(t))2
2#2 . Moreover, the probabilities of the
observations at di#erent times are independent. It follows that the p.d.f. of
observing Y = (y1 # # # yn) on the input U = (u1 # # # un) if the parameter is a is
fa(Y ) =
Yn
i=1
1
p
2##2
e􀀀(yi􀀀aui)2
2#2
One can show that in this case arg maxa fa(Y ) = arg maxa ln fa(Y ) =
argmaxa(n ln(p 1
2##2 ) +
Pn
i=1 􀀀 1
2#2 (yi 􀀀 aui)2). The
rst term and the factor
1
2#2 do not in
uence the position the value of â. So, â = arg mina
Pn
i=1 (yi 􀀀 aui).
The derivative with respect to a gives the minimum : â =
Pn
Pi=1 uiyi n
i=1 u2i
In order to apply the maximum likelihood method, one must have an idea
of the distribution of e(t).
Suppose now that we consider a class of models of the form
y(t) = G(q; #)u(t) + H(q; #)e(t); (6.5)
where e is some white noise with a p.d.f. fe. Assuming that y is really
generated by such a model, it follows from this expression that y = Gu +
(H􀀀1)e+e and e = H􀀀1(y􀀀Gu). Lemma 6.1 implies then that #(tj#) = e(t)
(consistently with the interpretation that #(tj#) is the value that e(t) should
take to explained the observed signals if the system was indeed described by
G#;H#). So, if the real system is described by #, the prediction error behaves
as a white noise with a p.d.f. fe, are are thus i.i.d. It follows then that the
p.d.f. of observing Y on the input U if the real system is de
ned by # is
fy
# (Y jU) =
Yn
t=1
fe(#(tj#));
so that
arg max
#
fy
# (Y jU) = arg min
#
􀀀ln(fy
# (Y jU)) = arg min
#
Xn
t=1
􀀀ln(fe(#(tj#))):
Therefore, if the class of models used can be described by (6.5) with e(t) being
i.i.d. with probability density function fe, then the maximum likelihood
method is equivalent to a prediction error method with norm 􀀀ln fe(:).
6.4.2 Maximum a posteriori method
The maximum a posteriori method is an adaptation of the maximum likelihood
method for situations when one has prior information on the the
distribution g#(#) of the parameter #. Instead of picking the # for which the
observed signal were the more likely as in the latter method, the maximum
a posteriori estimate consists in taking the more likely # given the observed
data. Standard probability rules imply that the a priori p.d.f. of # de
ning
the real system AND observing Y (on the input U) is
f(Y \ #) = f(Y j#)g#(#) = f#(Y )g#(#);
and that, conditional to the observation of a value Y , the a posteriori p.d.f.
of # (i.e. the probability density function of # knowing that Y has been
observed) is
f(#jY ) =
f(Y \ #)
f(Y )
=
f(Y j#)g#(#)
f(Y )
;
where f(Y ) is the p.d.f. of Y . This lead to the following estimator.
^# = arg max
#
f#(Y )g#(#)
f(Y )
= arg max
#
f#(Y )g#(#);
where we used the fact that f(Y ) is not a#ecting the argument of the maximum
since it does not depend on #. Let us apply this method to example
6.1.
Example 6.2. Now we suppose that the parameter a was distributed accord-
ing to a gaussian law with mean #a and variance #2
a. It follows, after letting
down all non-in
uencing terms, that
â = arg max
a
ln(fa(Y )ga(a)) =
Xn
i=1
􀀀
1
2#2 (yi 􀀀 aui)2 􀀀
1
2#2
a
(a 􀀀 #a)2: (6.6)
To
nd the maximum, we force the derivative with respect to a to be zero:
@
@a
(# # # ) =
1
#2
Xn
i=1
(yi 􀀀 aui)ui 􀀀
1
#2
a
(a 􀀀 #a) = 0
1
#2
Xn
i=1
yiui 􀀀 a
Pn
i=1 u2i
#2
􀀀
a 􀀀 #a
#2
a
= 0
and we have
â =
1
#2
Pn
i=1 yiui + #a
#2a
1
#2
Pn
i=1 u2i
+ 1
#2a
:
We can see that if we almost know the value of the parameter (#a # 0), â # #a
and if we know nothing about the variance (#a = 1), the estimator is the
same as the one from the maximum likelihood method. Moreover, the esti-
mate obtain by this method tends to that obtained by the maximum likelihood
method when n goes to in
nity. This is consistent with the intuition, as get-
ting more information from the experiment decreases the relative importance
of the prior information on a.
In general, the longer the experiment is, the narrower f# becomes, taking
high values on a smaller and smaller set, and very low values elsewhere. The
prior knowledge represented by g#, which remains constant, becomes thus
asymptotically becomes irrelevant, and the Maximum a posteriori method is
asymptotically equivalent. Intuitively, if all experimental evidences contradicts
the prior knowledge, the latter is eventually dismissed.
Besides, if we do not have any prior information on #, all values of # are a
priori equiprobable, so that maximizing f#(Y )g# is equivalent to maximizing
f#(Y ). This reasoning does not formally make sense, as one cannot de
ne an
equiprobable distribution g# on a set of in
nite measure, so that f#(Y )g# is
actually unde
ned. Suppose however that we just know that # should belong
to some set #D
M # of large but bounded measure, and have no additional
information. In that case, g#(#) = 1=
#D
M
if # 2 #D
M and 0 else, so that
arg max
#2DM
f#(Y )g#(#) = arg max
#2#D
M
f#(Y ):
By taking larger and larger #D
M, one re-obtains thus the maximum likelihood
method.
Chapter 7
Consistency of Estimators
We now study the consistency of the estimators described in Chapter 6, that
is, their ability to asymptotically recover the \real" system when it exists and
belongs to the model set considered. It would be of course interesting have
analogous properties for situations where the real system is \close" to the
model set considered, but this issue is much more complex and lyes beyond
the scope of these notes.
We begin by analyzing in detail the case of prediction error methods with
least square criterion, and then see how the results can be extended to other
norms, statistical methods, and correlation methods.
Notation convention:
In the sequel, we will denote by G#;H# the transfer functions of the model
set parameterized by #. Using a slightly abusive notation, we will denote by
G0;H0 the transfer function of the real system. One should however not think
that these functions are those obtained when # = 0. Actually, when the real
system belongs to the model set considered and has a unique parametrization
#0, there holds G0 = G#0 and H0 = H#0 .
We will also use ^#N to denote the parameters estimated after an experiment
of length N (formally, after truncating the results of a virtually in
nite
experiment at length N, and ^GN; ^HN the corresponding transfer functions.
113
114 CHAPTER 7. CONSISTENCY OF ESTIMATORS
7.1 Least Square prediction error method
7.1.1 A
rst example
To begin with a simple example, let us consider the following system:
y = a0u + e;
where e is a white noise, and the one step ahead predictor y(t^ja) = au, which
corresponds to a model class that contains the real system. The least square
prediction error methods gives us the estimate
aN = arg min
a
1
N
XN
(y 􀀀 au)2:
When N grows, it is tempting to assume that
aN ! arg min
a
lim
N!1
1
N
XN
(y 􀀀 au)2 = arg min
a
#E
(y 􀀀 au)2;
(which is actually the case if the convergence is uniform). Replacing y by its
expression according to the real system yields
aN ! arg min
a
#E
(a0u+e􀀀au)2 = arg min
a
#E((a0 􀀀 a
)
u
))2
+
2
#E
e(a0􀀀a)u+#E
e2:
If e and u are independent, which is always the case when the experiment
takes place in open loop, then the second term is zero, so that
#E
(a0u + e 􀀀 au)2 = #E
((a0 􀀀 a)u))2 + #E
e2;
which has a unique minimum at a = a0 (for nontrivial u) since the last term
is independent of a, and actually equal to #2
e . The estimator aN converges
thus in this case to a0, and the \real" system is asymptotically recovered.
The previous intuitive reasoning contains two phases. First, we assume
that minimizers of the cost functions 1
N
PN(y 􀀀 au)2 asymptotically tends
to the minimizer of the #E
(y 􀀀 au)2. Second, we have seen that, thanks to
some independence properties between u and e, the minimizer of this limiting
function is actually an unbiased estimator of a0. We are going to see how
this reasoning can be made in a general and formally correct way.
7.1. LEAST SQUARE PREDICTION ERROR METHOD 115
7.1.2 General consistency result
In this purpose, we need to make some assumptions about the signals in
consideration:
Condition D1 (\nice signals") holds if y and u are quasi-stationary, and we
have
y(t) = Dyr
t r + Dye
t e0 (7.1)
u(t) = Dur
t r + Due
t e0 (7.2)
with D:t
uniformly stable (i.e. there is a stable transfer function bounding all Dt, see De
-
nition 1.6), e0 a sequence of independent variable with bounded moments (white
noise for
example), and r a deterministic, bounded and quasi-stationary signal.
This hypothesis is fairly general, and holds for example in the two following
cases:
# in open loop (y = Gu + v) if u; v are quasi stationary and ergodic and
if G is stable
# in closed loop if u is quasi stationary and if (1 + G0F)􀀀1G, (1 +
G0F)􀀀1H, F(1+G0F)􀀀1G and F(1+G0F)􀀀1H are stable, where F is
the feedback transfer functions.
In particular, all signals considered until here satisfy it.
As explained in 6.3, the prediction error method with usual norm consists
in taking as estimator the minimizer of
VN(#) =
1
2
X
#2(#; t);
where #(#; t) is the prediction error at time t. Let us de
ne the asymptotic
cost function.
V (#) =
1
2
E[#2(#; t)]
The following lemma characterizes when the minimizer of latter function can
be used to represent the asymptotic behavior of the minimizers of VN. In
order to state it, we need the following de
nition
De
nition 7.1. A model structure M with parameters # 2 DM is uniformly
stable if the family of (predictor)
lters that it contains and there derivative
with respect to # are uniformly stable, and DM is compact.
Note that the model structure can thus be uniformly stable even if the
system is unstable, provided that the predictors Wu;Wy satisfy the above
stability conditions.
Lemma 7.1. [see chapter 8 in [1]]
Under D1, if the model structure M is uniformly stable (for parameters # 2
DM), then
sup
#2DM
jVN(#) 􀀀 V (#)j ! 0 with probability 1: (7.3)
As a result,
^#N ! arg min
#
V (#) with probability 1
where, since arg min# V (#) is a set, this expression is to be understood as
inf
##2arg min# V (#)
j## 􀀀 #Nj ! 0:
We insist on the importance of equation (7.3), which states that the
convergence is uniform, and not only point-wise. Indeed, one can
nd simple
examples of sequences of functions fN converging point-wise to a certain f,
but for which it is not true that the minimizer of fN converge to that of f,
even if all these functions have unique minima. For this property to hold, on
must have absolute convergence.
It is also useful to re-emphasized the fact that the previous lemma establishes
a priori the convergence of the estimator to a set, and not necessarily
to a particular value, even if each VN has a unique minimum. Such behavior
can be observed when the model is over-parametrizing the system or if data
are not informative enough.
In order to hope recovering the \real system", we must assume that there
exists such a real system. More precisely:
Condition S1 ("Real System") The data set (u; y) is generated by
S : y = G0(q)u + H0(q)e0;
where e0 is a sequence of i.i.d. random variable with bounded moments, and
H0 is monic and inversely stable (H0 and H􀀀1
0 ) are stable.
Note that the assumption about H0 is fairly weak, is equivalent to requiring
the spectrum of the disturbance #v(!) to be factorizable and positive
everywhere. Indeed, when a disturbance is factorizable, it can always be
representd as H0e where H0 is stable, and has all its zeros inside or on the
unit disk (see Theorem 1.2). Requiring H􀀀1
0 to be stable only adds the requirement
that H0 should have no zero on the unit disks, which is translated
by the fact that jH2(ei!)j 6= 0 and thus #He(!) 6= 0, for all !.
There can of course be no hope of recovering the real system if it does not
belong to the model set that we consider. Given a set M parametrized by
# 2 DM, and assuming that assumption S1 holds, we will say that the real
system S belongs to M if there exists at least one # 2 DM corresponding to
the real system G0;H0. We can now state the following convergence result.
Theorem 7.1. Suppose that conditions D1 and S1 (existence of the real
system) hold, that the model structure M is uniformly stable, and that the
data is informative enough with respect to M.
Suppose in addition that the system is either controlled in open loop, or
in closed loop with a delay in the controller or in the system, so that e(t)
and u(t) are independent (even if u(t) may be in
uenced by e(t􀀀# ) for some
# > 0).
Then, the real system S belongs to the model set M, and M is identi
able
at the real system, i.e. there is a unique #0 2 DM for which M(#0) = S, then
there holds
#0 = arg min V (#) (7.4)
As a consequence, the minimizer of # V corresponds to the transfer functions
of the real system. Together with Lemma 7.1 and the uniform stability of M,
this implies that
lim ^#N =#0; (7.5)
G^#N
!G0; (7.6)
H^#N
!H0; (7.7)
Proof. We start from
V (#) = E(#2(#))
= E(#(#0) + #(#) 􀀀 #(#0))2
= E(#2 | {(z#0)})
(a)
+E(#(#) 􀀀 #(#0))2
| {z }
(b)
+2 |E[#(#0)(#({#z) 􀀀 #(#0))}]
(c)
Our goal is to prove that the minimizer of # V (#) is actually the minimizer of
(b). We are going to show that (a) and (c) are independent of #, with (c)
being actually zero.
Remember that, by Lemma 6.1, #(tj#) = H􀀀1
# (y 􀀀 G#u), and thus that
at the real system1, #(tj#0) = e(t) . The term (c) is thus #E
(tj#(#) 􀀀 #(tj#0)).
Observe now that #(tj#) 􀀀 #(tj#0) = ^y(tj#0) 􀀀 ^y(tj#). By de
nition, the predictors
^y(tj#0); ^y(tj#) can only depend on past outputs y(s) (s < t) and past
and present input u(s) (s # t). Clearly, the pas values values y(s); u(s) for
s < t are independent of e(t). Moreover, it is an assumption of this theorem
that u(t) and e(t) are mutually independent. Therefore, the term (c) is a
product of two independent quantities, and is thus equal to 0. Besides, since
#(tj#0) = e(t), the term (a) is Ee2 = #2
e , so that
V (#) = #2
e + E(#(#) 􀀀 #(#0))2: (7.8)
Therefore # minimizes V (#) if and only if E(#(#) 􀀀 #(#0))2 = 0. Since the
data are assumed to be informative enough, this implies that M(#) = M(#0)
holds for any # 2 arg min V (#). Moreover, since the model set is assumed to
be identi
able at #0, this in turn implies that arg min V (#) = #0.
Remark 7.1. One can easily see from the proof above that, if the parameters
#0 representing the real system are not unique, then the set of minimizer
of # V is precisely the set of parameters representing the real system, so that
G^#N
! G0 and H^#N
! H0, even though ^#N may not converge to one point but
only to the set of points representing the real system. This is however often a
problem from a practical point of view, as a convergence of the parameters to
a set may be hard to detect if the set is not known beforehand, and may often
result in numerical instabilities, for example when the set is unbounded.
Remark 7.2. It also follows from equation (7.8) that min# # V (#) = Ee(t)2 =
#2
e , and thus that the variance of the white noise e can be estimated by
min# VN(#).
1this could have been obtained also directly from the de
nition of the predictor
7.1.3 Partial consistency: recovering only G
The result of Theorem 7.1 is relatively strong, as it guarantees convergence
to the real system under some weak condition when this system belongs to
the model sets, that is, the model set contains the correct pair (G0;H0).
In many cases however, one is mostly interested in the transfer function
G0, and having a model set su#ciently large to ensure that it contains the
correct noise transfer function could be expensive if the noise has a complex
structure, and may actually not be feasible if the noise does not admit a
simple or
nite description. The following results states that, provided that
the parametrizations of G and H are independent, it is still possible to recover
G0 even when the model set does not contain the real noise transfer function.
Theorem 7.2. Suppose that conditions D1 and S1 (existence of the real sys-
tem) hold, that the model structure M is uniformly stable, and that the data
are informative enough with respect to M.
Suppose in addition that the two classes of transfer functions G#;H# corre-
sponding to the model structure M are independently parametrized:
# # = (#; #) with # 2 D#
M and # 2 D#
M
# G# = G#
# H# = H#.
If the system is controlled in open loop during the experiment (so that e
and u are independent), and if there is a unique #0 such that the real system
transfer function is G#0 , then ## = #0 if (##; ##) minimize V (#; #). As a
consequence:
#^N ! #0 ^GN ! G0 (7.9)
Proof. Remember that, by Lemma 6.1, we have #(tj#; #) = H􀀀1
# (y 􀀀 G#u):
Since the real system is assumed to exists (and to be LTI), there also holds
y = G0u + H0e, where e is some white noise. Reintroducing this in the
expression of the prediction error leads to
#(#; #) = H􀀀1
# (G0 􀀀 G#) u + H􀀀1
# H0e:
Since we have assumed the the system is controlled in open loop, e and u are
independent. Therefore, using the expression above, we have
2 # V (#) = #E
􀀀
H􀀀1
# (G0 􀀀 G#) u
#2
+ #E
􀀀
H􀀀1
# H0e
#2
+ 2 #E
􀀀
H􀀀1
# H0e
# 􀀀
H􀀀1
# (G0 􀀀 G#) u
#
| {z }
=0
The parameter # only appears in the
rst term. So, if ## = (##; ##) is a
minimizer of # V (#), ## must be a minimizer of this
rst term given the value
of ##. Indeed, if there existed a value #0 such that the
rst term would take a
smaller value on (#0; ##), we would have # V (#0; ##) < # V (##; ##) in
contradiction
with the optimality of ## = (##; ##). Observe now that the
rst term is
nonnegative, and can be rewritten as
#E
􀀀
H􀀀1
# (G0 􀀀 G#) u
#2
=
1
2#
Z 2#
0
G0(ei!) 􀀀 G#(ei!)
H#(ei!)
2
#u(!)d!: (7.10)
This expression is nonnegative. Moreover, since the signals are informative
enough with respect to M (and H# has no pole on the unit circle), it follows
from Theorem 5.1 that it is zero if and only if G# = G0, and thus if # = #0,
since we have assumed that G0 has a unique description. Therefore, the only
minimizer ## of (7.10) is ## = #0.
As compared to Theorem 7.1, the last theorem requires two additional
assumptions: G and H are independently parametrized, and u and e are
independent (open loop experiment). These two assumptions are actually
essential, and the result would not hold without them, as will appear in the
sequel. Note that the assumption of independent paremetrization could be
replaced by the weaker assumption that the choice of G does not in any way
restrict the choice of H.
7.1.4 Application to regressions
The relative weakness of the assumptions made in Theorems 7.1 and 7.2 is
in apparent contradiction with the restrictive conditions derived in Section
3.2 to guarantee the absence of asymptotic bias when using a regression
technique. We will now re-analyze the regression technique as a general
predictive error method, to show why there is actually no contradiction.
Remember that in the context of linear regressions, it is assumed that
the system is of the form y = '(t)T #0 + v, where v(t) is a disturbance, and
where ' would (usually) contain a
nite number of values of y and u. This
could thus be re-expressed as
y(t) = (1 􀀀 A0(q)) y(t) + B0(q)u(t) + v(t)
with appropriate de
nitions of A0 and B0, implying in particular that the
constant term in A0 is 1 (i.e; A0(0) = 1). The latter expression can be put
under the canonical LTI form
y(t) =
B0(q)
A0(q)
u(t) +
1
A0(q)
v(t):
Note that no assumption had in general been made about v(t), which is
therefore not necessarily a white noise. The estimate ^# in the regression
was obtained by minimizing
P􀀀
y(t) 􀀀 'T (t)#
#
. It corresponds thus to a
prediction error method with predictor
^y(t) = 'T (t)# = (1 􀀀 A#)y + B#u;
where the polynomials classes A# and B# are of the same \form" as A0 and
B0, and their coe#cients are the entries of # except for the
rst coe#cient
of A# which must be 1 (since the present value of y(t) must not be used to
compute ^y(t)). In particular, there is a #0 such that A#0 = A0 and B#0 = B0.
One can verify that using this predictor corresponds to considering the model
set
y(t) =
B#
A#
+
1
A#
e(t): (7.11)
The real system does thus in general not belong to the model set, as the
real noise 1
A0(q)v(t) can generally not be represented as 1
A#
e(t) for a white
noise e, so that we cannot apply Theorem 7.1. Moreover, observe that the
parametrization of G# = B#
A#
and H# = 1
A#
are generally not independent, so
that Theorem 7.2 may usually not be used either.
There are however two general exceptions (in addition to countless special
cases). First, if v is a white noise, then 1
A0(q)v(t) can clearly be represented
as 1
A#
e(t) for some #. In that case, the real system belongs to the model
set, and Theorem 7.1 can be used. Second, if A# is a constant, then the
parametrizations of G and H in (7.11) are independent, and one can apply
Theorem 7.2 to guarantee that we can recover the correct #0. Due to the
de
nition of A# the only possible way to have a constant A# is to have A = 1,
as all coe#cients of A are entries of # except the constant term which is
1. Therefore, this second exception corresponds to the case where the real
system is y(t) = B0(q)u(t) + v(t) and the model set y(t) = B#(q)u(t) + e(t),
that is, the case where '(t) only contains present and previous values of
u. We have thus precisely recovered the two cases in which our analysis in
Section 3.2 had shown the absence of asymptotic bias.
7.2 General prediction error methods
In the general prediction error method, one takes as estimator ^#N the minimizer
of
VN(#) =
XN
1
`(#(#; t))
for some norm `(:). One can prove that Theorems 7.1 and 7.2 still apply
provided that the function ` satis
es the two following conditions, related to
the symmetry with respect to the noise distribution and the strict convexity
around 0.
# E`0(e0(t)) = 0 (where e0 comes from condition D1 on the signals),
# `00(x) # # > 0, for all x.
7.3 Statistical Methods
Observe
rst that the maximum a posteriori estimate tends to the maximum
likelihood estimate when N grows due to the decreasing relative importance
of the prior information, so that their asymptotic properties are equivalent.
Moreover, we have seen in Section 6.4 that, under fairly general conditions,
the maximum likelihood method is equivalent to the prediction error method
with a norm `(#) = 􀀀log fe(#), where fe is the p.d.f. of the white noise
model used. Therefore, the result for general prediction error methods can
directly be applied here, assuming that fe is such that E`0(e0(t)) = 0 and
`00(x) # # > 0, for all x.
7.4. CORRELATION METHODS 123
7.4 Correlation methods
Most of the correlation methods can be viewed as solving
fN(#) =
XN
1
#(t; #)#(#; t) = 0
where
# = (Ku(q; #)u; Ky(q; #)y)
for some
lters Ku and Ky of appropriate dimension. Similarly to what we
have done for prediction error methods, let us de
ne the following limiting
function:
f(#) = E(#(#; t)#(t; #))
Theorem 7.3. Under the condition D1, if the model structure M is uni-
formly stable and if Ku, Ky are uniformly stable.
then
sup
#2M
jfN(#) 􀀀 f(#)j ! 0 with probability 1 (7.12)
Therefore ^#N : fN(^#N) = 0 and so ^#N converge to the solution of f(#) = 0
There is however no general theory to determine if the solution of f(#) = 0
corresponds to the parameter of the \real" system, or to a system close to
the real system. One will have to analyze the speci
c situations on a case by
case basis.
Chapter 8
Variance of Estimators
Until here we have only analyzed the asymptotic (or sometimes expected)
values of the estimators, that are theoretically accessible after in
nitely long
experiments. Real life experiments are however only
nite, so one should
also strongly consider the expected deviations from those asymptotic values
when only a
nite amount of data is available. In particular, it can be very
important to detect the presence of trade-o#s, or to design the experiment
in such a way that more precision is obtained on certain parameters than or
others.
We will see in this chapter that, under conditions similar to those in
Chapter 7, the error asymptotically behave as a gaussian random variable
whose covariance is determined by the information matrix, related to the
derivative of the one step-ahead predictors with respect to the parameters.
8.1 Constant
lter
To develop some intuition, let us begin by analyzing in detail a very simple
example in which we identify a constant
lter with additive white noise, using
the standard prediction error method. We consider thus the model set
y = au + e; (8.1)
125
126 CHAPTER 8. VARIANCE OF ESTIMATORS
with the corresponding predictor ^y(tja) = au, and assume that our data are
generated by the system
y(t) = a0u + e;
where e is a white noise consisting of a sequence of i.i.d. variables1.
It is important to realize that, once the system, the input, and the
(deterministic)
identi
cation methods are
xed, the estimated parameter â that
we obtain only depends on the realization of e. It is thus a random variable
\derived" from e, and we are going to analyze its properties.
Remember that the principle of the prediction error method is to select
as estimator
^# = arg min
a
XN
t=0
#(tja)2 = arg min
a
XN
t=0
(y(t) 􀀀 au(t))2 :
Taking as estimator the value at which the derivative of the function is zero
leads
â =
PN
t=0 y(t)u(t)
PN
t=0 u2(t)
:
Assuming that the signals were indeed generated using a system of the form
(8.1), let us analyze the error made by this estimator. Replacing y by its
expression a0u + e, we obtain
â = a0 +
PN
t=0 u(t)e(t)
PN
t=0 u2(t)
;
where we assume the denominator to be nonzero. The estimation error â􀀀a0
is thus a weighted sum of the i.i.d. random variables e(t) with zero mean.
This implies that, for any N, the estimator is unbiased, in the sense that Eâ =
a0. Moreover, by the central limit theorem, for large N, the distribution of
â 􀀀 a0 approaches a Gaussian distribution centered at 0.
Let us now compute the variance of the error, which is also the variance
of the Gaussian distribution that the distribution of the error approaches. At
this point it is important to remember that we consider the input
xed, even
if it may have been determined by a taking a realization of random process.
1this assumption about the nature of the white noise e is not essential, but simpli
es
the exposition
8.2. FIR FILTER 127
The only source of randomness is thus the white noise e, and all expectations
are taken with respect to e.
E (â 􀀀 a0)2 = E
#PN
s=0 u(s)e(s)
# #PN
t=0 u(s)e(t)
#
#PN
t=0 u2(t)
#2 =
PN
s=0
PN
t=0 u(s)u(t)Ee(s)e(t)
#PN
t=0 u2(t)
#2
Since, e is a white noise, Ee(s)e(t) = #2
e if s = t and 0 else. Therefore,
E (â 􀀀 a0)2 =
PN
s=0
PN
t=0 u(t)2#2
# e PN
t=0 u2(t)
#2 = #2
e
XN
t=0
u(t)2
!􀀀1
:
This result is consistent with the intuition that the error made is inversely
proportional to the signal to noise ratio.
Let us now take a step back, and consider that the input u may be the
result of a stochastic process. If it is ergodic (which we usually assume), then,
for large N, 1
N
PN
t=0 u(t)2 to#E
u2. So, combining everything, we see that for
large N,
E (â 􀀀 a0)2 =
#2
e
N
􀀀
#E
u2#􀀀1
;
and
1
p
N
(â 􀀀 a0) # N
#
0;
#2
e
#E
u2
#
: (8.2)
8.2 FIR
lter
We now move to the slightly more general case of a
lter identi
ed using a least square prediction error method. We consider
thus one step ahead predictor of the form ^y(tj#) = '(t)T # for # 2 Rn, and
'(t) a vector containing only present and past values of the input u2 We also
assume that the system considered is controlled in open loop, so that u and
thus ' are independent of the disturbance. We focus on the favorable case
where the signals are generated by a system of the form
y(t) = '(t)T #0 + e(t);
2This assumption is very important. Indeed, when ' also contains values of y, ' is
generally not independent from the disturbance, which prevents many simpli
cations.
where e(t) is a white noise, which we assume to consist of a sequence of i.i.d.
random variables. We have seen that the regression estimator is (see (3.6))
^#N 􀀀
XN
0
'(t)'(t)T
!􀀀1
XN
t=0
'(t)y(t)
!
and that the estimation error is (see (??))
^#N 􀀀 #0 =
XN
0
'(t)'(t)T
!􀀀1
XN
t=0
'(t)e(t)
!
: (8.3)
As in the
rst example in Section 8.1, we consider now that the system and
the input u are
xed (and so is therefore '). The only source of randomness
is then the disturbance e. and (8.3) shows that the estimation error is then a
random variable in Rn, obtained by taking a linear combination (with weights
in Rn of i.i.d. scalar random variables e(t) with Ee = 0. The expected value
of the error is thus the vector 0, and the central limit theorem implies that
for large N, the error distribution approaches that of a centered Gaussian.
Let us now compute the covariance of the error (and thus of that Gaussian).
It follows from (8.3) and the symmetry of varphi(t)'T that the covariance
E
#
^#N 􀀀 #0
# #
^#N 􀀀 #0
#T
is
E
XN
0
'(t)'(t)T
!􀀀1
XN
t=0
'(t)e(t)
!
XN
s=0
e(s)'(s)T
!
XN
0
'(t)'(t)T
!􀀀1
=
XN
0
'(t)'(t)T
!􀀀1
XN
t=0
XN
s=0
'(t)'(s)TE (e(t)e(s))
!
XN
0
'(t)'(t)T
!􀀀1
;
Using again the fact that Ee(t)e(s) = #2
e if s = t and 0 else, the expression
above becomes
XN
0
'(t)'(t)T
!􀀀1
XN
t=0
'(t)'(t)T #2
e
!
XN
0
'(t)'(t)T
!􀀀1
:
Therefore, the covariance of the estimation error is
E
#
^#N 􀀀 #0
# #
^#N 􀀀 #0
#T
= #2
e
XN
0
'(t)'(t)T
!􀀀1
:
8.3. GENERAL LEAST SQUARE PREDICTION ERROR METHOD 129
For large N, assuming that the input signals are ergodic, 1
N
PN
0 '(t)'(t)T '
#E
''T . So, asymptotically, we have
E
#
^#N 􀀀 #0
# #
^#N 􀀀 #0
#T
=
#2
e
N
􀀀
#E
''T #􀀀1
; (8.4)
and
1
p
N
#
^#N 􀀀 #0
#
# N
#
0n; #2
e
􀀀
#E
''T #􀀀1
#
:
8.3 General least square prediction error method
In the two previous sections, we have analyzed relatively simple cases, where
the predictor is not a function of the output, and is thus independent of
the disturbance. We have seen in both cases that the estimation error is
asymptotically distributed as a centered Gaussian, and evaluated its variance.
We have seen that this variance was inversely proportional to the signal to
noise ratio. What may not have been apparent at
rst sight, is that this
variance could be expressed in term of the local dependence of predictors
on the parameters. Observe indeed that in Section 8.2 where we work with
the predictor haty(tj#) = 'T (t)#, we have @^y(tj#)
@# = '(t). Therefore, equation
(8.4) can be rewritten as
E
#
^#N 􀀀 #0
# #
^#N 􀀀 #0
#T
=
#2
e
N
#
#E@
^y
@#
j#0
@^y
@#
jT#
0
#􀀀1
;
and the same holds for the simpler case described in Section 8.1. We will see
here that both this expression for the variance and the asymptotic Gaussian
distribution are valid in general for prediction error methods with least-square
criterion. In particular, the following theorem holds.
Theorem 8.1. Under the assumption of Theorem 7.1, when N tends to in-
nity and ^#N is obtained by the least square method, ^#N 􀀀#0 is asymptotically
distributed as
N
#
0;
P#
N
#
;
with
P# = #2
e
#
#E
#
@^y(t; #)
@#
#0
@^y(t; #)T
@#
#0
##􀀀1
:
The rest of this Section is devoted to a \heuristic proof" of Theorem 8.1
holds. We refer the reader to [1] for a formal proof. We will work under
the same assumptions as in Theorem 7.1. In particular, the real system
belongs to the model set considered and corresponds to the parameters #0,
and ^#N ! #0 when N ! 1.
8.3.1 The estimator as a random variable
Let #(t) = y(t) 􀀀 ^y(tj#), the estimator ^#N is the minimizer of
VN(#) =
1
2N
XN
t=0
#2(t; #);
which we will assume is unique. In these sections, we consider that u and #0
are given and
xed, and that y, ^y and # are random variables derived from
the white noise signal e.
De
ne also # V (#) = 1
2
#E
#2(t; #); to which VN uniformly converges (see
Lemma 7.1). We are interested in the convergence properties of #0 􀀀 ^#N
where
#0 = argmin
#
# V (#)
(a) A
rst expression
Let us consider a
rst order taylor series expansion of V 0N
(#) = @VN(#)
@# around
#0 as given by
V 0
N(#) = V 0
N(#0) + V 00
N(#)(# 􀀀 #0);
where # is \between" # and #0. Since V 0N
(^#N) = 0 we obtain
0 = V 0
N(^#N) = V 0
N(#0) + V 00
N(#)(^#N 􀀀 #0)
= V 0
N(#0) + V 00
N(^#N 􀀀 #0)
() ^#N 􀀀 #0 = 􀀀V 00􀀀1
N V 0
N(#0)
where we assumed that V 00
N(#) is su#ciently smooth to be considered as a
constant matrix.
(b) Properties of V 00
N(#0)
The error is thus approximately equal to the random variable V 0N
(#0) multiplied
by a constant matrix. Let us further develop this random variable
V 0
N(#0) =
@
@#
#0
1
2N
XN
t=0
#2(t; #)
!
=
1
N
XN
t=0
#(t; #0)
@#(t; #)
@#
#0
= 􀀀
1
N
XN
t=0
#(t; #0)
@^y(tj#)
@#
#0
V 0N
(#0) is thus a random variables #(t; #0) @^y(tj#)
@#
#0
. Assuming that N is large,
it turns out moreover that the expected value of EV 0N
(#0) is 0, as
EV 0
N(#0) ' 􀀀#E
#
#(t; #0)
@^y(tj#)
@#
#0
#
= #E
#
@
@#
#0
1
2
#2(t; #)
#
=
@
@#
#0
1
2
#E
#2(t; #)
=
@
@#
#0
# VN(#)
= 0
(c) Central limit theorem
Suppose for a moment that the #(t; #0) @^y(tj#)
@# j#0 were independent. Then, since
their sum V 0N
(#0) has a zero expected value, it would follow from the central
limit theorem that, for large N
V 0
N(#0) # N(0;QN) QN = E
1
N
XN
t=0
#(t; #0)
@^y(t; #)
@#
#0
!2
(8.5)
Of course the #(t; #0) @^y(tj#)
@# j#0 are not independent. But because the transfer
functions involved are assumed to be uniformly stable it can be shown that
#(t; #0) @^y(t;#)
@# and #(s; #0) @^y(s;#)
@# are asymptotically independent when jt􀀀sj !
1 and that (8.5) asymptotically holds, which leads to
^#N 􀀀 #0 # N(0; V 00􀀀1
N QNV 00􀀀1T
N) (8.6)
8.3.2 Asymptotic expression of the variance
We have thus proved that, given an input signal, the distribution of the value
of the estimate ^#N is asymptotically gaussian with a mean #0. Let us now
evaluate more precisely its variance, by developing asymptotic expressions of
QN and V 00􀀀1
N in (8.6).
Recall that we assume that the system belong to the set of models and
that #0 is the \correct" value of #. The prediction error with #0 is therefore
exactly the white noise e(t). (see for example the proof of Theorem 7.1):
#(t; ##) = e(t) where e(t) is white noise with zero mean and variance #2
e . The
covariance matrix QN can be developed as follow
QN = E
2
4
1
N
XN
t=0
#(t; #0)
@^y(t; #)
@#
#0
!
1
N
XN
s=0
#(s; #0)
@^y(s; #)
@#
#0
!T
3
5
=
1
N2
XN
t=0
XN
s=0
E
"
e(t)
@^y(t; #)
@#
#0
e(s)
@^y(s; #)
@#
T
#0
#
Note also that ^y(tj#) is independent of e(t) because it only depends on previous
values of y(t) and u(t). So we have
QN =
1
N2
XN
t=0
XN
s=0
Ee(t)e(s)E
"
@^y(t; #)
@#
#0
@^y(s; #)
@#
T
#0
#
=
#2
e
N2
XN
t=0
E
"
@^y(t; #)
@#
#0
@^y(t; #)
@#
T
#0
#
'
#2
e
N
#E
"
@^y(t; #)
@#
#0
@^y(t; #)
@#
T
#0
#
for N ! 1
Now we compute the asymptotic value of the matrix V 00
N as
V 00
N(#) =
@2
@#2
1
2N
XN
t=0
#2(t; #)
!
=
@
@#
􀀀
1
N
XN
t=0
#(t; #)
@^y(t; #)
@#
!
=
1
N
XN
t=0
@^y(t; #)
@#
@^y(t; #)T
@#
􀀀
1
N
XN
t=0
#(t; #)
@2^y(t; #)
@#2
If we assume that V 00
N(#) is su#ciently smooth around #0 we can consider
that V 00
N(#) = V 00
N(#0) = V 00
N is constant. We can again use the fact that at
#0 we have #(t; #0) = e(t). For large value of N this leads to the following
simpli
cation
V 00
N =
1
N
XN
t=0
@^y(t; #)
@#
#0
@^y(t; #)T
@#
#0
􀀀
1
N
XN
t=0
e(t)
@2^y(t; #)
@#2
#0
= #E
#
@^y(t; #)
@#
#0
@^y(t; #)T
@#
#0
#
􀀀 #E
#
e(t)
@2^y(t; #)
@#2
#0
#
= #E
#
@^y(t; #)
@#
#0
@^y(t; #)T
@#
#0
#
since ^y(tj#) is independent of e(t). So we
nally have the asymptotical properties
of the random variable ^#N 􀀀 #0. It follows a normal distribution with
zero mean and a covariance matrix given by
Q = V 00􀀀1
N QNVN
00􀀀1T !
N!1
#2
e
N
#
#E
#
@^y(t; #)
@#
#0
@^y(t; #)T
@#
#0
##􀀀1
, P#
N
;
which
nishes to establish (though not entirely formally) the validity of Theorem
8.1.
8.3.3 Simple Application
Let us come back to our example of Section 8.1 the system y = a0u + e and
the estimator ^y = au. Theorem 8.1 states in that case that
âr 􀀀 a0 # N
#
0;
#a
N
#
with:
#a = #2
e
#
E
#
@^y
@a
jâ
@^y
@a
T
jâ
##􀀀1
:
Observe that
@y(^t)
@a
=
@
@a
au(t) = u(t)
So that
#a =
#2
e
#E
u2
;
and
âr 􀀀 a0 # N(0;
#2
e
N #E
u2
);
consistently with what we had obtained in equation (8.2). One can also verify
that, applying Theorem 8.1 to a
nite impulse response system and model
sets allows recovering the results of Section 8.2.
8.4 Other Norms
The result obtained for the least square method holds, with minor modi
cations,
in the case of a generic prediction error method:
Theorem 8.2. Under the assumption of Theorem 7.1. If ^#N is obtained as
^#N = arg min
#
XN
t=1
`(#(#; t));
where `(:) is some norm for which E`0(e(t)) = 0 (e white noise used in
condition D1) and `00(x) # # > 0 for all x. Then ^#N 􀀀 #0 is asymptotically
distributed as
N(0;
P#
N
);
with
P# = #(`)
#
#E
#
@^y(t; #)
@#
#0
@^y(t; #)T
@#
#0
##􀀀1
;
and
#(`) =
E(`0(e(t)2)
E(`00(e(t)2)
(8.7)
8.5. CRAMER-RAO BOUND AND OPTIMAL NORMS 135
All prediction error methods have thus the same asymptotic variance up
to a constant (with respect to N) multiplicative factor, which depends on
the norm `. This leads to two questions. First, how could one minimize this
multiplicative factor. Second, are there other identi
cation methods with
better asymptotic behavior. We analyze those two questions in the following
section.
8.5 Cramer-Rao bound and optimal norms
The Cramer-Rao bound gives a lower bound on the variance of any unbiased
estimate, for general random variables. By applying it to the particular case
of dynamical systems (see [1]), one obtains that, for any identi
cation method
for which E^#N = #0, there holds
cov(^#N) = #E
(^#N 􀀀 #0)(^#N 􀀀 #0)T # K0
XN
1
E
@y
d# j#0
@y
@#
T
j#0
!􀀀1
:
with
#0 =
#Z 1
􀀀1
f0e
(x)2
fe(x)
dx
#􀀀1
with fe the pdf of e
Observe that this expression for the variance is the same as that in Theorem
8.2, up to the multiplicative factor #(`)
#0
. The asymptotic variance of the
estimates obtained by any prediction error method is thus within a constant
ratio of the optimal variance.
One could thus wonder if there is a norm for which #(`) = #0, and which
would thus be asymptotically optimal. It turns out that the answer is yes:
It is show in Appendix 8.A that this equality holds for `(:) = 􀀀ln fe(:),
where fe is the p.d.f. of the white noise of the real system. As we have
seen in Section 6.4.1, a prediction error estimate with that particular norm
is exactly equivalent to the the maximum likelihood method, which has thus
an optimal asymptotic variance.
It should be noted though that a small error in fe (which is usually
unknown) can signi
cantly increase the value of #(`fe ), and decrease the
performance of the the corresponding estimators. Finding and using the
\optimal norm" is thus not necessarily recommended. This observation led
to the development of robust norms, which are sub-optimal but perform
reasonably well for any fe su#ciently close to the real one (see [1]).
8.6 Correlation Approach
Similar results can be obtained for estimators obtained using a correlation
approach (see [1] for more detail). Let
fN(#; #) =
1
N
X
#(t)#(t; #)
where # is the signal to which correlation must be cancelled, and which can
in general depend on #. The estimate for a sample of length N is the solution
to fN(#; #) = 0. Let now
f(#; #) ' lim
t!1
fn(#; #) = #E
#(t)#(t; #);
and suppose that #0 is the solution to f(#; #) = 0. Under assumptions similar
to those in the previous Section and in Section 7.4, for large N, there holds
^#N 􀀀 #0 # N(0;
1
N
##)
with
## = #2
e
#
E#
@^y
@#
#􀀀1
(E##T )
#
E
@^y
@#
#T
#T
!􀀀1
;
where the derivative are estimated at #0. Observe that we
nd the prediction
error method with least square criterion if # = @^y(t;#)
@# , which was to be
expected since the two methods are equivalent in that case.
8.7 Using the variance estimate
The estimates derived in this chapter can usually not be computed, as the
required knowing the variance #2 of the white noise (and sometimes its
distribution)
and the real parameters #0.
8.7. USING THE VARIANCE ESTIMATE 137
However they can be approximated using the estimates obtained by the
methods (remember that min VN(#) tends to #2
e when N ! 1 for the least
square method), or used to derive con
dence intervals for the estimates obtained
(see Chapter 9.6 in [1]). One could also use rough approximations of
these bounds to compare di#erent identi
cation methods, or the in
uence of
some parameters that can be tuned when designing the experiment.
Another problem is that the estimate of variance are only asymptotically
valid, and we have no simple way of estimating their accuracy for
nite N.
Common wisdom says that \for a typical experience", the bounds are valid
within 10% when N # 300 [1]. This claim has however no formal scienti
c
ground, and should be treated with the greatest prudence.
Finally, it should be remembered that all these results provide estimation
of the variance of the parameters, but not of the transfer functions. In many
applications, the goal is to have an accurate estimation of the transfer function
in some frequency range, and one is not so interested in the parameters.
Evaluating the variance of the transfer function can be very complex. Some
results are presented in [1], but most results available are asymptotic in both
the sample size and the model order.
Appendix
8.A E#ciency of the Maximum Likelihood
method.
As explained in Section 8.5, we need to show that
#(`) :=
E(`0(e))2
(E`00(e))2 =
#Z 1
0
f0e
(x)2
f(x)
dx
#
=: #0 (8.8)
holds when `(x) = 􀀀ln fe(x), where fe is the p.d.f. of the noise e (assuming
that some sense can be given to f0e
and f00
e , and that the expected values
and integral above are well-de
ned). We
rst compute the
rst and second
derivative of `,
`0(x) = (􀀀ln fe(x))0 = 􀀀
f0e(x)
fe(x)
;
and
`00(x) = 􀀀
#
f0e
(x)
fe(x)
)
#0
=
f00
e (x)fe(x) 􀀀 f0e
(x)2
fe(x)2 :
Observe R now that for any function g, there holds by de
nition Eg(e) = 1
􀀀1 g(x)fe(x)dx. In particular,
E`0(e)2 =
Z 1
􀀀1
#
􀀀
f0e
(x)
fe(x)
#2
fe(x)dx =
Z 1
􀀀1
(f0e
(x))2
fe(x))
dx; (8.9)
and
E`00(e) =
Z 1
􀀀1
#
f00
e (x)fe(x) 􀀀 f0e
(x)2
fe(x)2
#
fe(x)dx
=
Z 1
􀀀1
f00
e (x)dx +
Z 1
􀀀1
(f0e
(x))2
fe(x))
dx (8.10)
139
In the expression above, observe that
R 1
􀀀1 f00
e (x)dx = 0 (assuming that this
integral is well de
ned). Indeed, there holds limt!􀀀1 f0e
(x)+
R 1
􀀀1 f00
e (x)dx =
limt!1 f0e
(x), and limt!􀀀1 f0e
(x) and limt!1 f0e
(x) can only be zero. Therefore,
it follows from (8.10) and (8.9) that
E(`0(e))2
(E`00(e))2 =
R 1
􀀀1
(f0e
(x))2
fe(x)) dx
#R 1
􀀀1
(f0e
(x))2
fe(x)) dx
#2 =
#Z 1
􀀀1
(f0e
(x))2
fe(x))
dx
#􀀀1
;
which establishes (8.8).
Chapter 9
Closed Loop Identi
cation
9.1 Introduction
9.1.1 Model
We consider closed loop systems of the form represented in 9.1, where the
input u results from the addition of a reference signal r and a (noiseless)
feedback 􀀀Fy. In such systems, we can no longer assume that the input u
and disturbance v are independent as often done in previous chapters, since
u explicitly depends on y which explicitly depends on v. As a consequence,
several (but not all) results from previous chapters do not apply.
We will however make two assumptions in our analysis:
1. The reference signal r and the disturbance v are independent. This
is always the case if r has been selected in advance and if v is a real
disturbance, but may not hold true if the system is encapsulated in a
more complex control structure, where r may be linked to the output.
The analysis of such structures is beyond the scope of this class.
2. There is a delay in at least G or F. This is in part necessary to avoid
algebraic loops such as y(t) = 2u(t) and u = 􀀀5y(t). Physical system
always contain such delays, but problem could arise if one sample a
system with a to small resolution.
141
142 CHAPTER 9. CLOSED LOOP IDENTIFICATION
Figure 9.1: General Closed Loop system with noiseless feedback F. r is the
reference signal, u u the input of the system G resulting from u = r 􀀀 Fy,
and y = Gu + v its noisy output.
9.1.2 Examples of problems
One could be tempted to dismiss the possible dependence between u and v
as a purely theoretical issue. The following examples show however that it
can lead to dramatic practical problems:
Example 9.1 (Fourier Analysis). Consider a system with constant feedback
and suppose that r # 0, so that u = 􀀀Fy and y = Gu + He for some scalar
F. Applying the Fourier Analysis (see Section 2.2.2) leads to
^G
(ei!) =
Yn(!)
Un(!)
= 􀀀1=F:
The estimate is thus independent of G and H, and pretty much useless (unless
one wanted to discover F). Note that this is not in contradiction with the
fact that the asymptotic error of this estimate is proportional to the noise to
signal ratio, as VN=UN = 􀀀1=F 􀀀 G.
Example 9.2. We want to apply a prediction error method with the class of
predictors
^ya;b(t) = ay(t 􀀀 1) + bu(t 􀀀 1):
9.1. INTRODUCTION 143
Suppose that the input is subject to output feedback u(t) = 􀀀ky(t); where k
is a scalar. Then the predictors will take the value
^ya;b(t) = ay(t 􀀀 1) + bu(t 􀀀 1) = (a 􀀀 kb)y(t 􀀀 1):
One can directly see that any two models for which a􀀀kb takes the same value
will have exactly the same predicted value, and will thus be undistinguishable
in this context, even though they may correspond to very di#erent systems,
and have very di#erent behavior when controlled in open loops or with other
output feedback controllers. In particular, no indication is provided as to
whether the open loop is stable, i.e. if jaj < 1. This remains an issue even
when k is known.
It will be seen that the problem here is one of content of information.
Indeed, the signals generated do not allow discriminating between certain
predictors (those with the same a􀀀kb), and are thus not informative enough,
independently of their frequency content. This is not in contradiction with
the main results of Chapter 5, as these were only valid in open loop.
9.1.3 Summary of good news and bad news
9.1.4 Good News
If :
# The data is informative enough
# The real system belongs to the model set M
Then most results about consistency and optimality of methods apply.
9.1.5 Bad News
Unfortunately, there are also some bad news :
1. We have at this stage no way to check if the data is informative enough,
as the main results in Section (5 do not apply to closed loop systems.
This can be a serious problem, as shown in Example 9.2.
2. The \naive" spectral analysis fails and leads to:
#yu
#u
=
G#r 􀀀 F#v
#r + jFj2 #v
:
So does the Fourier analysis, see Example 9.1.
3. The correlation analysis also a priori fails because it required that
#E
u(t)e(t 􀀀 # ) = 0 wich is no longer true.
4. If we try to apply a regression method with v 6= white noise, we will
generally have #E
#v 6= 0 ! we lose consistency.
5. If G 2 M and H =2 M, we cannot apply Theorem 7.2 to guarantee the
consistency for G.
In the sequel, we see in Section 9.2 how one can guarantee that the
signals contain enough information, and we present some methods that can
be applied to identify closed loop systems in Section 9.3.
9.2 Information of experiments
9.2.1 Introduction
Since the main results about information content do not apply to closed loop
systems, we come back to the de
nition. Remember that data are informative
with respect to a given class of predictor
lters W (i.e. set of pairs of
lters
Wu;Wy such that ^y = Wuu + Wyy), if the following implication hold for all
(W1
u ;W1
y ); (W1
u ;W1
y ) 2 W:
#E
􀀀
(W1
u 􀀀W2
u )u + (W1
y 􀀀W2
y )y
#2
= 0 ) W1
u = W2
u and W1
y = W2
y :
Let #Wu = W2
u 􀀀 W1
u and #Wy = W2
y 􀀀 W1
y . The implication above is
equivalent to
#E
(#Wuu + #Wyy)2 = 0 ) #Wu = 0;#Wy = 0 (9.1)
(for all #Wu;#Wy consistent with W). Unlike in the open loop case, determining
exactly when this condition holds can be very complex, and depends
9.2. INFORMATION OF EXPERIMENTS 145
on both the controller and the reference signal. We analyze here two decoupled
cases: one where the information content is guaranteed by the controller,
and one where it is guaranteed by the reference signal. One can of courses
combine the di#erent ideas presented.
9.2.2 Regulator-based approach
Suppose that r = 0. If the experiments does not generate informative enough
data, then #E
(#Wuu + #Wyy)2 = 0 is veri
ed for nonzero #Wu and #Wy
(consistent with the model sets). Therefore u and y are related by
#Wuu # 􀀀#Wyy ) u # 􀀀
#Wy
#Wu
y:
This relation is deterministic and can thus only be caused by the feedback
law, as the direct system induces some noise (unless v # 0 or u # y # 0,
which only happens if v # 0). There would thus hold
F =
#Wy
#Wu
: (9.2)
To guarantee that the data generated are informative enough, we just need
thus to make sure that this equality cannot hold. Observe that relation (9.2).
has three important properties that we can use for our advantage :
# it is linear
# it is time invariant
# its complexity is bounded by that of Wu and Wy (i.e. it must be
expressible as a ratio between di#erences of Wu and of Wy consistent
with the model set considered).
Therefore, any feedback law that fails to satisfy at least of these properties
cannot satisfy (9.2) and leads thus to informative enough data. This will be
the case for feedback laws that are
# non-linear (in the range of values taken by the system)
# time varying (oscillating between f1 and f2 for example)
# a su#ciently high order transfer function (the problem with this option
is that it is hard to know how high an order in advance)
Note that being informative enough is not a gradual property, in the
sense that the data are either informative enough or not informative enough.
However, one should keep in mind that, in order to successfully perform an
identi
cation within a limited time period, the data should be "su#ciently
informative enough" during that time period, and chose the regulator appropriately.
For example, a very slight nonlinearity will lead to informative
enough data, but might be almost useless on short time periods.
9.2.3 Reference signal-based approach
We now see how the reference signal can be used to generate informative
enough signal, given a
xed controller. For this purpose, we reformulate
the implication (9.1) in terms of the reference signal and the disturbance.
Suppose that the data are really generated by a closed loop system as that of
Figure 9.1, for some transfer functions G and F (with
nitely many zeros).
We have the equalities:
y = Gu + v
u = r 􀀀 Fy
Solving for u and y, we obtain
(1 + FG)y = Gr + v
(1 + FG)u = r 􀀀 Fv:
For concision, S = 1
1+FG, we have thus
#
u
y
#
=
#
1 􀀀SF
G S
##
Sr
v
#
=: Q
#
Sr
v
#
:
Observe that the det(Q) = S +SFG = 1, so that Q is always invertible. Let
us now re-write the di#erence of predictions in terms of Sr and v:
#Wuu + #Wyy =
􀀀
#Wu #Wy
#
Q
#
Sr
v
#
=:
􀀀
# ~Wr # ~Wv
##
Sr
v
#
9.2. INFORMATION OF EXPERIMENTS 147
Since Q is invertible, there is a linear bijection between
􀀀
#Wu #Wy
#
􀀀 and
# ~Wr # ~Wv
#
. As a result,
􀀀
# ~Wr # ~Wv
#
= 0 ,
􀀀
#Wu #Wy
#
= 0: (9.3)
Moreover, since r and v are assumed to be independent, we can rewrite the
expected square of di#erence of prediction in terms of # ~Wr and # ~Wv as:
#E
(#Wuu + #Wyy)2 = #E
#
# ~WrSr + # ~Wvv
#2
= #E
#
# ~WrSr
#2
+ 2#E
#
# ~WrSr# ~Wvv
#
+ #E
#
# ~Wvv
#2
= #E
#
# ~WrSr
#2
+ #E
#
# ~Wvv
#2
:
(9.4)
Therefore, thanks to this equivalence and to (9.3), the implication (9.1) is
equivalent to
#E
#
# ~WrSr
#2
+ #E
#
# ~Wvv
#2
= 0 ) # ~Wr;# ~Wv = 0:
Observe that the latter implication holds if the two following conditions hold:
(i) #E
(# ~Wvv)2 = 0 ) # ~Wv = 0
(ii) #E
(# ~WrSr)2 = 0 ) # ~Wr = 0
These relations can also be rewritten as (see equation (1.9) and Section 1.2.1):
(i)
R
# ~Wv
2
#v = 0 ) # ~Wv = 0
(ii)
R
# ~Wr
2
#Sr = 0 ) # ~Wr = 0
It is important to realize that, if the model class considered consists of
rational transfer functions, each of # ~Wv and # ~Wr has either a
nite number
of zeros or is identically zero everywhere, because they are di#erences between
rational transfer function (the same applies for almost all natural classes of
models). Let us assume that we are in that case. Then, to ensure (i) and
(ii) for any possible degrees of ~Wv and ~Wr, it su#ces that #v and #Sr are
be strictly positive at in
nitely many points1, which is always the case when
v and Sr are persistently exciting of all order. Since v is a disturbance, it
can be assumed to be persistently exciting of all orders. As for Sr, since S
is also a
nite
lter, it can only bring a
nite number of points to zero, so
Sr is persistently exciting of all order if r is.
In conclusion, as long as r is persistently exciting of all order, we can
guarantee that the data is informative enough. Moreover, one can verify
using our approach that the data can still be informative enough if r is
persistently exciting for a
nite but su#ciently high order.
9.3 Identi
cation Methods
9.3.1 Direct Method
It is the most straightforward method. It simply consists in focusing on the
system, ignoring the fact that the system is in closed loop, and applying
any prediction error method. One must however make sure that the data are
informative enough, as explained in Section 9.2.
Pros
# The method only relies on u and y. No knowledge of r or F is needed,
and it works for every feedback mechanism, even if it is not LTI.
# It uses the same algorithms as for open loop identi
cation (no special
tool is needed)
# The main results on consistency (Theorem 7.1) and optimality (Section
8) hold when the real system belongs to the model set.
Cons
As opposed to open loop, both G AND H must absolutely belong to the
model set for this method to be consistent, because Theorem 7.2 only applies
1formally: have a positive measure on in
nitely many intervals
9.3. IDENTIFICATION METHODS 149
for open loop systems.
However the asymptotic bias will remain small even if H does not belong
to the model set if :
# The signal to noise ratio is good
# The e#ect of the feedback is \small"
# There is a \real" H0, which is close to some H# in the model set.
9.3.2 Indirect Method
The idea of the indirect method is to consider the larger system that includes
both the initial system and the feedback loop, and whose input is the reference
signal r and output is y. Indeed, the closed loop equation can also be
rewritten as :
y =
G
1 + FG
r +
1
1 + FG
He
, y = Gclr + Hcle;
which is now expressed as an open loop system that we can identify as such,
using any method of our choice. After that, we can re-obtain G and H by
solving
Gcl =
G
1 + FG
Hcl =
H
1 + FG
However, solving this system is not so easy and is the major drawback of
this method because :
# F must be known
# The system may admit no solution if Gcl is only calculated approximately,
which is almost always the case in real cases. One has then
to apply more complex methods to obtain the \best approximate solution"
to the system. Note that this problem should never appear if
one would only consider pairs (Gcl;Hcl) consistent with the selected
classes of model (i.e. for which there exists a solution) in the initial
identi
cation. However, the set of such pairs of functions may not be
easy to e#ciently parameterize, or may not be appropriate for standard
algorithms.
# The solution may not be unique if F is too simple and r is not persistently
exciting as described in Section 9.2.
Moreover, if F is not an LTI
lter, the open loop system will not be LTI
either, making the initial identi
cation and the resolution of the corresponding
system much more complex. On the other hand, the main advantages of
this method are that :
# One can use any method for the initial identi
cation step.
# Theorem 7.2 is applicable, so that the method works even if H does
not belong to the model set considered.
9.3.3 Joint input-output method
The idea of this method is to consider the closed loop system as a system
with one input r and two outputs u and y. For this purpose, one considers a
slightly di#erent model, involving a perturbation on the feedback, as shown
in Figure 9.1. One then jointly obtains the transfer function from r to y and
the transfer function from r to u.
For this system, we can rewrite the following equations :
y = GSr + GSw + Sv = Gryr + v1
u = Sr + Sw 􀀀 FSv = Grur + v2
One solution is then to identify the single-input two-output system
#
y
u
#
= ~ Gr + ~H e
and to deduce then G and F. A simpler alternative consists in ignoring the
correlations between v1 and v2, and identifying separately ^Gry and ^Gru, and
then using the estimate ^G =
^G
ry
^G
ru
.
9.3. IDENTIFICATION METHODS 151
Figure 9.1: Closed loop model used for standard input-output method
Chapter 10
How to approach a problem of
identi
cation ?
The best answer is probably \Using common sense! your experience and
knowledge about the system physics". Figure 10.1 represents the main blocks
of the identi
cation process. In this chapter, we will brie
y discuss all these
blocks. More information is available in Chapters 11 and 12 of [2] and part
iii of [1].
153
154CHAPTER 10. HOWTO APPROACH A PROBLEM OF IDENTIFICATION ?
Figure 10.1: Representation of the identi
cation process. In some case, one
cannot make an experiment, and must just use some given data. The \experiment
design" and \data collection" parts are then removed. In an alternative
view, one performs identi
cation simultaneously with di#erent model
sets and criteria, and compare all results afterwards, avoiding usually the
iteration process.
10.1 Experience/intuition, physics
Information is often cheaper than running an experiment. One should thus
gather as much information on the problem as possible. In most cases some
prior knowledge of the problem and of its physics or properties are available:
One has an idea of what the system does or of what we want it to do. When
no information at all is available, one can begin by simple things (nonparametric
methods, ETFE, impulse response, low order models) to gather some
10.2. EXPERIMENT DESIGN 155
information and experience, and iterate. Such analysis should of course be
performed in conditions deemed \normal" for the system considered.
At this stage, it is also important to analyze the purpose of the identi
-
cation (control, estimation of a parameter, fault detection, prediction, etc.)
and to use this information to chose appropriate model set, criterions, and
experiments.
10.2 Experiment design
The experiment should take place in conditions as close as possible from
those in which the results will need to be used. The aim is to get as much
data as possible for the cost that one is ready to pay.
10.2.1 Type of input signal
# Noisy aperiodic signal: Such a signal is obtained by
ltering a white
noise, possibly with nonlinear or time-varying
lter (or with the trivial
lter, in which case the signal is white noise). It typically excites all
frequencies and is persistently exciting of all orders.
# Periodic signal: Periodic signals only excite a
nite number of frequencies
(they are persistently exciting of order O(period)). They present
however the advantage of being much less sensitive to noise in some
conditions. Besides, one can take advantage of the periodicity by averaging
the results over various periods: (apply N0 times the same
input signals, then average the N0 outputs before identi
cation). The
distribution of the N0 di#erent output sequences around their average
already provides then some information on the noise in the system.
The choice of the input signal is very dependent of what we know about the
system. If we do not know anything, we should choose the noisy signal, or a
periodic signal with many frequencies. This choice will give a
rst estimate
when ran
rst in the system.
10.2.2 Frequency content
Remember that applying the prediction error method with the usual norm
is equivalent to minimizing the following expression (see 6.4).
V (#) =
1
N
X
!
j^^G 􀀀 G#j2 jU2N
j
jH#j2 + transitory
We should thus choose an input signal with a lot of power for frequencies
that interest us, and with less power if we know there's a lot of noise in this
frequency. This will be easy with a sinusoid, but more complex with white
noise (!
ltering).
The frequency content also has an important e#ect on the information
matrix E
#
@^y
@#
@^y
@#
T
#
, which drives the evolution of the variance of estimated
parameters (see Chapter 8). It may therefore be desirable to select the frequency
content in such a way that the information matrix behaves in a desired
way. This may however require knowing (at least approximately) how the
system behaves.
10.2.3 Information
The signal should be persistently exciting of a su#ciently large order, so that
the experiment is informative enough with respect to the selected model set.
Ideally, it should be persistently exciting of an order su#ciently larger than
what is necessary for being informative enough. This will allow detecting
whether the model set considered was large enough. It can also prove very
handy later by avoiding running new experiments in case one realizes that a
larger order was needed.
Remember that, as seen in Chapter 9, special care needs to be taken when
identifying a system in closed loop.
10.2.4 Length
Choose a long enough signal, and pay attention to the length of the transitory
output: is it really playing a role ? Besides, one should keep in mind that a
part of the data should be kept for the validation of the models obtained.
10.3. DATA COLLECTION 157
10.3 Data collection
When identifying a continuous-time system (as most physical systems), one
needs to sample the signals. The sampling frequency should be su#ciently
high to gather all the relevant information. But it should not be higher than
necessary: First, a higher sampling rate often means a higher cost. Second,
high sampling rates can result in additional errors and noise. In particular,
the amount of noise (or rounding, etc.) on each measurement can become big
with respect to the typical variations between two consecutive measurements.
Finally, small delays correspond to a very large number of time-steps if the
the sampling rate is too high. However, oversampling is always preferable to
undersampling.
Note that when high frequency phenomena in which one is not interested
take place in the system, sampling at a relatively low frequency can lead to
folding and aliasing problems. One should then
rst apply an (analogical)
lter to the data to remove these high-frequency phenomena.
10.4 Data analysis
10.4.1 Errors and outliers
Signals often contain errors or outliers. These incorrect values can have
devastating e#ects on the identi
cation performances. It is however not
always easy to determine whether seemingly exceptional values correspond
to errors in the measurements or to correct measurements of exceptional
phenomena. Some cases are easy (typically, a zero in an otherwise continuous
and distant from zero sequence of number), but one should otherwise use
judgement and one's experience and knowledge of the system.
One approach to errors is to assume that they will be taken care of by
the noise model will take care of these errors. This can however signi
cantly
increase the complexity of the noise model, so that it is most often preferable
to remove or replace these outliers when preprocessing of the data.
Solutions to ignore or replace the erroneous and missing data include.
# Interpolation of other values. This does not work if sampling frequency
is too low, or just above the typical frequency.
# Cutting o# the parts of the data considered untrustworthy: Removing
the part of the data in which there is an outliers, and performing the
identi
cation based on the rest of the sequence. This works very well if
the remaining data set is su#ciently large to perform a proper identi
-
cation. Problems may appear when data points are frequently missing
or discarded, or when the computation of the predictors require using
a lot of data points.
# Using a time varying norm with very low weights for the time at which
the data is unreliable. This is a soft version of the \cutting" solution.
# Re
tting the unreliable data: First identify a system, then consider the
system as
xed and identify the \real values of the input" for the times
for which there is a doubt: y(t) =
P
gk:u(t􀀀k)+
P
hk:e(t􀀀k) (where
the parameters are some of the u). Then iterate.
Again, the most important is to be reasonable and careful ! Moreover, it is
essential to store the initial data, and to document and justify the modi
cations
made.
10.4.2 Scaling and translation
The theory was derived for LTI systems of the form y = Gu + He. In
particular, a zero input should result in just noise as output. However, many
systems contain a systematic o#set (or are a#ne), for example because the
level 0 is arbitrary (think of the level of the sea), so that they are really of
the form
y = Gu + He + a
10.5. MODEL SET 159
One solution is to consider a as another parameter. This introduces however
additional complexity. Another solution is to just translate the system to
remove the o#set, for example by removing the mean. In addition, a scaling
can be applied to maximize the entropy of the data.
10.4.3 Filtering
One can
lter the data to give more weight to some frequencies. This is
di#erent from the (analogic)
ltering that may need to be applied before the
sampling of the signals to avoid aliasing issues.
10.4.4 Information content
If the data is given, one must check if it is informative enough with respect to
the model set used. Note that in most cases, there will be a small component
in all frequencies, so that the data is theoretically informative enough for all
frequencies. But, since only a
nite size data is available, one cannot rely on
the asymptotic ability to distinguish between di#erent models. One should
thus check if there are enough frequencies at which the power of the signal is
su#ciently high (i.e. high with respect to the noise for example), as opposed
to just positive.
10.5 Model set
If we have some physical knowledge about the system, we must take it into
account (by bounding the order of the model, or using a \gray box model".
Also, try to detect dependencies between some input and output signals.
Otherwise, one will usually try di#erent model sets, starting with simple ones
(i.e. low order), and increasing the complexity until an appropriate result is
found. In some case, one may also deliberately chose a certain model order
due to the intended use of the model.
10.6 Criterions
Should we try several criterions? Again, one should take into account the
knowledge about the system, and the reason for which the identi
cation is
performed.
If prediction error methods are used, we have seen that some norms are
optimal, but that the performance can signi
cantly decrease if one use them
inappropriately, or for a slightly di#erent white noise p.d.f. One could thus
for example use robust norms (see Section 15.2 in [1]).
10.7 Computation
The implementation of methods presented in theses notes may be nontrivial,
and is beyond our scope here. These methods have been implemented in
several toolboxes, as for example the ident toolbox of MatLab. One should
remember however not to use such tools blindly. Among other concerns,
one should for example take care of not use a models with way too many
parameters, for this could cause identi
ability issues and numerical problems.
10.8 Validation/comparison
While too often neglected, the validation of the results and/or the comparison
between the di#erent models obtained and the reality are among the most
important steps of the identi
cation procedure.
A
rst natural step is to compare the output computed with the model
identi
ed with the real outputs. One should verify if they behave in a similar
way, and if the behavior of the identi
ed model makes sense from a physical
point of view.
10.8. VALIDATION/COMPARISON 161
Figure 10.1: Over
tting
Under the constraints that the model set has to be large enough to contain
S, and not too large to avoid numerical instabilities and redundancy, there
are two important questions (\one criterion", \one model") :
1. Is the model close enough from the real system ?
This can be tested in several ways. First, the prediction error should
ideally be a white noise. One can test
E[#(tj^#)e(t 􀀀 # )]; E[#(tj^#)u(t 􀀀 # )] and E[#(tj^#)y(t 􀀀 # )];
which should all be close to 0 (at least in open loop). We should also
expect a symmetry in the distributions of the prediction errors and
of the three quantities above. Remember also that we can compute
correlation also with signals that where not used in the identi
cation
procedure. This is particularly useful to see if those signals should in
fact be used.
When the model obtained is considered to be too di#erent form the
reality, one should re-iterate the identi
cation procedure with a larger
or di#erent class of models, possibly involving other signals.
2. If we reduce the model set, do we loose accuracy ?
For a same accuracy, it is always preferable to use a model with a
smaller description. They are several ways to see if the model order
can be reduced.
# If the high order coe#cients are very small, one can reduce the
model order (10􀀀9x5 ! x4 model)
# If there is a zero and a pole approximately at the same place,
one can delete them, or at least try lower-order model to avoid
overparametrization.
# The AIC test (Akaike's Information Criterion) evaluates the gain
in accuracy resulting from the addition/removal of parameters,
and allows evaluating whether a parameter improve su#ciently
the accuracy of the model (see for example Section 16.4 in [1]).
# Rank of the information matrix E
h
@^y
@#
@^y
@#
T
i
. If this matrix is singular,
then there exists a vector ## such that
##TE
#
@^y
@#
@^y
@#
T #
## = 0;
or equivalently,
E
#
@^y
@#
##
#2
= 0;
which implies that
@^y
@#
## ' 0:
This means that there exists a direction ## along which, a modi
cation of the parameters # would, as
rst order approximation,
not modify the predicted values ^y. In such case, the model is
clearly locally overparametrized around ^#, and a smaller order
model should be used.
Similarly, if the information matrix is almost singular, then there
is a direction of modi
cation of # that almost does not a#ect the
predictor, and one should consider working with a smaller order
model set.
10.8. VALIDATION/COMPARISON 163
Figure 10.2: Seletion of model order based on evolution of determinant of
information matrix
Besides, in many cases, one will observe that the performance (on the
sets on which the optimization was applied) increases when increasing the
model order, until it reaches a very slowly increasing plateau. The point at
which that plateau is reached usually corresponds to the best order of model
(i.e. the smallest order for which a near optimal performance is obtained).
Finally, one of the simplest and most e#cient tests when no speci
c constraint
is present is to select the model structure for which the corresponding
model performs best on the validation set (see Figure 10.3). When a large
number of model is present, one will often select the model set by comparing
performances on a comparison set, and reevaluate its performance on a validation
set. Indeed, the act of selecting a model among many based on its
performance on some set automatically creates a bias in its performance on
that set (of course the model is good on this comparison set, since we have
precisely selected that which performed the best).
Figure 10.3: Error made on the validation set
Acknowledgment
The very
rst version of these notes was drafted with the help of Pierre-Yves
Andre, Jonathan Bonnevie, Philippe Delandmeter, Johan Dupont, Cyrille
Lefevre and Valentin Vallaeys in 2011.
Many thanks also to Adrien Taylor and to the CLIL team for their help in
the improvements made in 2016.
165
Bibliography
[1] L. Ljung System Identi
cation - Theory for the user Prentice Hall, 1999.
[2] T. Soderstrom and P. Stoica, System Identi
cation http://user.it.
uu.se/~ts/sysidbook.pdf
[3] T. Needham Visual complex analysis Clarendon Press New York, 1997.
167

Test BlocNote

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Test BlocNote

Загружено:

Авторское право:

Доступные форматы

INMA 2875: System Identi

denotes a set of all possible outcomes).

Z(g). Therefore, denoting Z(w) by W, we have

Вам также может понравиться