Академический Документы
Профессиональный Документы
Культура Документы
System Identification
Kristiaan Pelckmans Johan A.K. Suykens
Abstract: This special session aims to survey, present new results and stimulate discussions
on how to apply techniques of convex optimization in the context of system identification. The
scope of this session includes but is not limited to linear and nonlinear modeling, model structure
selection, block structured systems, regularization mechanisms (e.g. L1, sum-of-norms, nuclear
norm), segmentation of time-series, trend filtering, optimal experiment design and others.
1. INTRODUCTION
REFERENCES
A. Ben-Tal and A.S. Nemirovskii. Lectures on modern convex optimization: analysis, algorithms, and engineering
applications, volume 2. Society for Industrial Mathematics, 2001.
P.J. Bickel, B. Li, A.B. Tsybakov, S.A. van de Geer, B. Yu,
T. Valdes, C. Rivero, J. Fan, and A. van der Vaart.
Regularization in statistics. Test, 15(2):271344, 2006.
S. Boyd and L. Vandenberghe. Convex Optimization.
Cambridge University Press, 2004.
S. Boyd, L. El Ghaoui, E. Feron, and V. Balakrishnan.
Linear matrix inequalities in system and control theory,
volume 15. Society for Industrial Mathematics, 1994.
E.J. Candes and T. Tao. Decoding by linear programming.
Information Theory, IEEE Transactions on, 51(12):
42034215, 2005.
Z. Liu and L. Vandenberghe. Interior-point method for
nuclear norm approximation with application to system
identification. SIAM Journal on Matrix Analysis and
Applications, 31(3):1235, 2009.
L. Ljung. System identification. Wiley Online Library,
1999.
L. Ljung, H. Hjalmarsson, and H. Ohlsson. Four encounters with system identification. European Journal of
Control, 17(5):449, 2011.
C.H. Papadimitriou and K. Steiglitz. Combinatorial optimization: algorithms and complexity. Dover, 1998.
B. Recht, M. Fazel, and P.A. Parrilo.
Guaranteed
minimum-rank solutions of linear matrix equations via
nuclear norm minimization. SIAM Review, 52(3):471
501, 2010.
A. Schrijver. Theory of linear and integer programming.
John Wiley & Sons Inc, 1998.
Abstract: In recent years there has been growing interest in convex optimization techniques
for system identification and time series modeling. This interest is motivated by the success of
convex methods for sparse optimization and rank minimization in signal processing, statistics,
and machine learning, and by the development of new classes of algorithms for large-scale
nondifferentiable convex optimization.
1. INTRODUCTION
Low-dimensional model structure in identification problems is typically expressed in terms of matrix rank or
sparsity of parameters. In optimization formulations this
generally leads to non-convex constraints or objective
functions. However, formulations based on convex penalties that indirectly minimize rank or maximize sparsity
are often quite effective as heuristics, relaxations, or, in
rare cases, exact reformulations. The best known example is 1-norm regularization in sparse optimization, i.e.,
the use of the 1-norm !x!1 in an optimization problem
as a substitute for the cardinality (number of nonzero
elements) of a vector x. This idea has a rich history in
statistics, image and signal processing [Rudin et al., 1992,
Tibshirani, 1996, Chen et al., 1999, Efron et al., 2004,
Cand`es and Tao, 2007], and an extensive mathematical
theory has been developed to explain when and why it
works well [Donoho and Huo, 2001, Donoho and Tanner,
2005, Cand`es et al., 2006b, Cand`es and Tao, 2005, Cand`es
et al., 2006a, Cand`es and Tao, 2006, Donoho, 2006, Tropp,
2006]. Several excellent surveys and tutorials on this topic
are available; see for example [Romberg, 2008, Cand`es and
Wakin, 2008, Elad, 2010].
The 1-norm used in sparse optimization has a natural
counterpart in the nuclear norm for matrix rank minimization. Here one uses the penalty function !X! where ! !
denotes the nuclear norm (sum of singular values) as a substitute for rank(X). Applications of nuclear norm methods in system theory and control were first explored by
[Fazel, 2002, Fazel et al., 2004], and have recently gained in
popularity in the wake of the success of 1-norm techniques
for sparse optimization [Recht et al., 2010]. Much of the
recent work in this area has focused on the low-rank matrix
completion problem [Cand`es and Recht, 2009, Cand`es and
Plan, 2010, Cand`es and Tao, 2010, Mazumder et al., 2010],
i.e., the problem of identifying a low-rank matrix from
a subset of its entries. This problem has applications in
collaborative prediction [Srebro et al., 2005] and multi-task
learning [Pong et al., 2011]. Applications of nuclear norm
methods in system identification are discussed in [Liu and
Vandenberghe, 2009a, Grossmann et al., 2009, Mohan and
Fazel, 2010, Gebraad et al., 2011, Fazel et al., 2011].
k=
(2)
Xi,i+k ,
k = 0, 1, . . . , p,
i=0
of the blocks in X.
An extension to ARMA processes is studied by Avventiy
et al. [2010].
3. GRAPHICAL MODELS
In a graphical model of a normal distribution x N (0, )
the edges in the graph represent the conditional dependence relations between the components of x. The vertices
in the graph correspond to the components of x; the
absence of an edge between vertices i and j indicates that
xi and xj are independent, conditional on the other entries
of x. Equivalently, vertices i and j are connected if there
is a nonzero in the i, j position of the inverse covariance
matrix 1 .
A key problem in the estimation of the graphical model
is the selection of the topology. Several authors have
addressed this problem by adding a 1-norm penalty to the
maximum likelihood estimation problem, and solving
minimize tr CX log det X + !X!1 .
!
S() =
Rk ejk ,
(1)
4. ALGORITHMS
For small and medium sized problems the applications
discussed in the previous sections can be handled by
general-purpose convex optimization solvers, such as the
modeling packages CVX [Grant and Boyd, 2007] and
YALMIP [L
ofberg, 2004], and general-purpose conic optimization packages. In this section we discuss algorithmic
approaches that are of interest for large problems that fall
outside the scope of the general-purpose solvers.
4.1 Interior-point algorithms
Interior-point algorithms are known to attain a high accuracy in a small number of iterations, fairly independent
of problem data and dimensions. The main drawback is
the high linear algebra complexity per iteration associated
with solving the Newton equations that determine search
directions. However sometimes problem structure can be
exploited to devise dedicated interior-point implementations that are significantly more efficient than generalpurpose solvers.
A simple example is the 1-norm approximation problem
minimize !Ax b!1
m
!
yi
#i=1
$# $ #
$
A I
x
b
subject to
,
A I
y
b
at the expense of introducing m auxiliary variables and 2m
linear inequality constraints. By taking advantage of the
structure in the inequalities, each iteration of an interiorpoint method for the LP can be reduced to solving linear
systems AT DAx = r where D is a positive diagonal
matrix. As a result, the complexity of solving the 1norm approximation problem using a custom interiorpoint solver is roughly the equivalent of a small number of
weighted least-squares problems.
A similar result holds for the nuclear norm approximation
problem
minimize !A(x) B!
(3)
(4)
(5)
R. Dahlhaus. Graphical interaction models for multivariate time series. Metrika, 51(2):157172, 2000.
T. Ding, M. Sznaier, and O. Camps. A rank minimization
approach to fast dynamic event detection and track
matching in video sequences. In Proceedings of the 46th
IEEE conference on decision and control, 2007.
D. L. Donoho. Compressed sensing. IEEE Transactions
on Information Theory, 52(4):12891306, 2006.
D. L. Donoho and X. Huo. Uncertainty principles and
ideal atomic decomposition. IEEE Transactions on
Information Theory, 47(7):28452862, 2001.
D. L. Donoho and J. Tanner. Sparse nonnegative solutions
of underdetermined systems by linear programming.
Proceedings of the National Academy of Sciences of the
United States of America, 102(27):94469451, 2005.
J. Duchi, S. Gould, and D. Koller. Projected subgradient
methods for learning sparse Gaussians. In Proceedings
of the Conference on Uncertainty in AI, 2008.
B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least
angle regression. The Annals of Statistics, 32(2):407
499, 2004.
M. Elad. Sparse and Redundant Representations: From
Theory to Applications in Signal and Image Processing.
Springer, 2010.
M. Fazel. Matrix Rank Minimization with Applications.
PhD thesis, Stanford University, 2002.
M. Fazel, H. Hindi, and S. Boyd. Rank minimization
and applications in system theory. In Proceedings of
American Control Conference, pages 32733278, 2004.
M. Fazel, T. K. Pong, D. Sun, and P. Tseng. Hankel
matrix rank minimization with applictions to system
identification and realization. 2011. Submitted.
J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse
covariance estimation with the graphical lasso. Biostatistics, 9(3):432, 2008.
P. M. O. Gebraad, J. W. van Wingerden, G. J. van der
Veen, and M. Verhaegen. LPV subspace identification
using a novel nuclear norm regularization method. In
Proceedings of the American Control Conference, pages
165170, 2011.
M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming (web page and software).
http://stanford.edu/~boyd/cvx, 2007.
C. Grossmann, C. N. Jones, and M. Morari. System identification via nuclear norm regularization for simulated
bed processes from incomplete data sets. In Proceedings
of the 48th IEEE Conference on Decision and Control,
pages 46924697, 2009.
R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding. Journal of
Machine Learning Research, 12:22972334, 2011.
L. Li and K.-C. Toh. An inexact interior point method for
L1-regularized sparse covariance selection. Mathematical Programming Computation, 2:291315, 2010.
Z. Liu and L. Vandenberghe. Interior-point method for
nuclear norm approximation with application to system
identification. SIAM Journal on Matrix Analysis and
Applications, 31:12351256, 2009a.
Z. Liu and L. Vandenberghe. Semidefinite programming
methods for system realization and identification. In
Proceedings of the 48th IEEE Conference on Decision
and Control, pages 46764681, 2009b.
Abstract: Change detection has traditionally been seen as a centralized problem. Many
change detection problems are however distributed in nature and the need for distributed
change detection algorithms is therefore significant. In this paper a distributed change detection
algorithm is proposed. The change detection problem is first formulated as a convex optimization
problem and then solved distributively with the alternating direction method of multipliers
(ADMM). To further reduce the computational burden on each sensor, a homotopy solution
is also derived. The proposed method have interesting connections with Lasso and compressed
sensing and the theory developed for these methods are therefore directly applicable.
1. INTRODUCTION
The change detection problem is often thought of as a centralized problem. Many scenarios are however distributed
and lack a central node or require a distributed processing.
A practical example is a sensor network. It may be vulnerable to select one of the sensors as a central node. Moreover,
it may be preferable if the sensor failing can be detected
in a distributed manner. Another practical example is the
monitoring of a fleet of agents (airplanes/UAVs/robots) of
the same type, see e.g., Chu et al. [2011]. The problem is
how to detect if one or more agents start deviating from
the rest. Theoretically, this can be done straightforwardly
in a centralized manner. The centralized solution, however,
poses many difficulties in practice. For instance, the communication between the central monitor to the agents and
the computation capacity and speed of the central monitor
is highly demanding due to a large number of agents in the
fleet and/or the extremely large data sets to be processed,
Chu et al. [2011]. Therefore, it is desirable to deal with the
change detection problem in a distributed way.
In a distributed setting, there will be no central node. Each
sensor or agent makes use of measurements from itself and
the other sensors or agents to detect if it has failed or
not. To tackle the problem, we first formulate the change
detection problem as a convex optimization problem. We
then solve the problem in a distributed manner using
the so-called alternating direction method of multipliers
(ADMM, see for instance Boyd et al. [2011]). The optimization problem turns out to have connections with the
Lasso [Tibsharani, 1996] and compressive sensing [Cand`es
et al., 2006, Donoho, 2006] and the theory developed for
? Ohlsson, Chen and Ljung are partially supported by the Swedish
Research Council in the Linnaeus center CADICS and by the European Research Council under the advanced grant LEARN, contract
267381. Ohlsson is also supported by a postdoctoral grant from the
Sweden-America Foundation, donated by ASEAs Fellowship Fund,
and by a postdoctoral grant from the Swedish Research Council.
(1)
subj. to
i=1 t=T
kp k2
[k1
Ti +1
kp . . . kN
kp ]
= k,
(2b)
with k kp being the p-norm, p
1, and k k i2 defined
as k / i k2 . The k failing sensors could now be identified
as the sensors for which ki kp 6= 0. It follows from
basic optimization theory (see for instance Boyd and
Vandenberghe [2004]) that there exists a > 0 such that
min
1 ,...,N ,
N
X
T
X
i=1 t=T
[k1
Ti +1
kyi (t)
kp k2
2
'T
i (t)i k 2
i
kp . . . kN
kp ]
, (3)
i=1 t=T
Ti +1
i=1
(4)
should be interpreted as the nominal model. Most sensors
will have data that can be explained by the nominal model
and the criterion (4) will therefore give i = for most
is. However, failing sensors will generate a data sequence
that could not have been generated by the nominal model
represented by and for these sensors, (4) will give i 6= .
subj. to
N
X
i=1
kYi
2
i i k 2
i
+ k#i
i kp ,
(6a)
#i
= 0, i = 1, . . . , N.
(6b)
T
T
T
T
, and let
Let x
=
1 . . . N #1 . . . # N
T
T = 1T 2T . . . N
, be the Lagrange multiplier vector
and i be the Lagrange multiplier associated with the ith
constraint #i = 0, i = 1, . . . , N . So the augmented
Lagrangian takes the following form
N
X
2
i kp
(7a)
L (, x, ) =
kYi
i i k 2 + k#i
T
i=1
(7b)
(8a)
k+1 = (1/N )
ik+1 = ik +
N
X
(#k+1
+ (1/)ik )
i
i=1
(#k+1
i
(8b)
k+1 ), for i = 1, . . . , N.
k
(8c)
k
2
i i k 2
i
+ k#i
i kp
i=1
k+1
N k+1 ,
1. (11)
i=1
Inserting
into the above equation yields (10). So
without loss of generality, further assume
N
X
i1 = 0.
(12)
i=1
1.
(13)
i=1
).
(6) If not converged, set k = k + 1 and return to step 2.
To show that ADMM gives:
k #ki ! 0 as k ! 1, i = 1, . . . , N .
PN
k 2
k
ik kp ! p , where p is
i i k 2 + k#i
i=1 kYi
i
the optimal objective of (4).
it is sufficient to show that the Lagrangian (L0 (, x, ), the
augmented Lagrangian evaluated at = 0) has a saddle
point according to [Boyd et al., 2011, Sect. 3.2.1] (since the
objective consists of closed, proper and convex functions).
Let , x denote the solution of (4). It is easy to show
that , x and = 0 is a saddle point. Since L0 (, x, 0) is
convex,
L0 ( , x , 0) L0 (, x, 0) 8, x
(14)
and since L0 ( , x , 0) = L0 ( , x , ), 8, , x and =
0 must be a saddle point. ADMM hence converges to the
solution of (4) in the sense listed above.
5. PROPOSED METHOD RECURSIVE SOLUTION
To apply the above batch method to a scenario where we
continuously get new measurements, we propose to re-run
the batch algorithm every T th sample time:
(1) Initialize by running the batch algorithm proposed in
the previous section on the data available.
(2) Every T th time-step, re-run the batch algorithm
using the sT, s 2 N , last data. Initialize the ADMM
iterations using the estimates of and from the
previous run. Considering the fact that faults occur
rarely over time, the optimal solution for dierent
data batches are often similar. As a result, by using
the estimates of and from the previous run for
initializing the ADMM algorithm can speed up the
convergence of the ADMM algorithm considerably.
To add T new observation pairs, one could possibly use
an extended version of the homotopy algorithm proposed
by Garrigues and El Ghaoui [2008]. The homotopy algorithm presented in Garrigues and El Ghaoui [2008] was
developed for including new observations in Lasso.
L (, x, ) =
kYi
i kp
i i k 2 + k#i
i
i=1
subj. to ki #i k s
(16)
where the following data matrices describe this optimization problem
2
T
2 k
k
Hi = T
ik .
(17)
i i / i , hi = i Y i / i , hi =
As can be seen from (17), among these matrices only Hi
and hi are the ones that are aected by the new measurements. Let yinew and 'new
denote the new measurements.
i
Then Hi and hi can be updated as follows
'newT
/ i2 , hi
hi + 'new
yinew / i2 .
Hi
Hi + 'new
i
i
i
(18)
To handle single data samples upon arrival, step 2 of the
ADMM algorithm should therefore be altered to:
(2) If there exits any new measurements, update Hi and
hi according to (18). Find #k+1
, ik+1 by solving (16).
i
Remark 5.1. In order for this algorithm to be responsive
to the arrival of the new measurements, it is required to
have network-wide persistent communication. As a result
this approach demands much higher communication traffic
than the batch solution.
6. IMPLEMENTATION ISSUES
Step 2 of ADMM requires solving the optimization problem
min L (k , x, k ).
(19)
x
s
zi1
,
,
(20)
+ s
x i #i
zi2
#i hi + zi2 = 0
(21b)
zi1 = 0
(21c)
kzi2 k zi1
(21d)
ki #i k s
(21e)
#i ) = 0.
(21f)
zi1
[Boyd and Vandenberghe, 2004]. Let i ,
zi2
#i , t and zi be the primal and dual optimums for the
k+1 = h
k +
problem in (16) at the kth iteration and let h
i
i
i . These can be used to generate a good warm start
h
point for the solver that solves (16). As a result, by (21)
the following vectors can be used to warm start the solver
iw = i
i
h
#w
i = #i +
(22)
w
zi = zi
sw = s + s
where s should be chosen such that
s
kiw #w
i ks +
(23)
w
wT w
#w
zi1 (s + s) + zi2 (i
i ) = ,
for some 0.
where zi =
(24)
gik (i , i ) ,kYi
i i k
2
i
+ k i k1 + (ik )T (
+ i
k )
+(/2)k i + i k k2 .
(25)
It is then straightforward to show that the optimization
problem (9) is equivalent to
ik+1 ,
T
zi1 s + zi2
(i
k+1
i
= argmin gik (i , i ).
i ,
(26)
Moreover,
#k+1
=
i
k+1
i
+ ik+1 .
(27)
2 T
i i
T
i
i+
(ik )T
2 T
i Yi
2/
+ (i +
) ,
@ i gik (i , i ) = @k i k1 + (ik )T + (
i
k T
(28a)
k )T . (28b)
+ i
gk
i i
02@
k+1
i
k+1
i
(29a)
(29b)
hi
ik /2
k )
(/2)(
hi =
T
2
i Yi / i .
(30)
T
i
2
i/ i
+ (/2)I,
(31)
k + i )T Qi
ik+1
+ (ik
k + i )T Qi
k+1 T
) Qi .
(32)
@ i gik (ik+1 , i )
hence
equals
k
k
(t)
=
argmin
G
(t).
It
follows
that
i
i
i
k+1
i
k
i (0),
k+2
i
Gki (0).
k
i (1)
(33)
+ (ik
+ (ik
i = 0 (36a)
k + ik )T Q
i = 0, (36b)
k + ik )T Q
with
R
1 2 Rmq , R
1 2 Rmm
1 ,R
= R
R
2 Rmq , Q
2 Rmm q .
Q
,Q
Q= Q
1
, (37)
(38)
hT
i Ri
+ t( k
= hT R
i
+ t(
i
ik (t))T Q
+ (k ik
i
ik )T Q
i
+ (k ik )T Q
k T
i ) Qi
i
( ik (t))T Q
i ( k (0))T Q
i
ik )T Q
i
k
k T
1
Q
Q
i)
+ t(
i ) ( Qi Q
i was used to denote the top q rows of Q
i . Now to
where Q
find t , we notice that both ik (t) and v are linear functions
of t. We can hence compute the minimal t that:
Make one or more elements of v equal to 1 or 1
or/and
make one or more elements of ik (t) equal to 0.
This minimal t will be t . At t the partition (34) changes:
Elements corresponding to v-elements equal to 1 or
1 should be included in ik .
Elements of ik (t) equal to 0 should be fixed to zero.
Given the solution ik (t ), we can now continue in a similar
way as above to compute ik (t), t 2 [t , t ]. The procedure
continues until ik (1) has been computed. Due to space
limitations we have chosen to not give a summary of
the algorithm. We instead refer the interested reader to
download the implementation available from http://www.
rt.isy.liu.se/~ohlsson/code.html.
=
hT
i Ri
+ (k
7. NUMERICAL ILLUSTRATION
Let
Assume now that ik+1 has been computed and that the
elements have been arranged such that the q first elements
are nonzero and the last m q zero. Let us write
k
k
i .
(0)
=
(34)
i
0
sign( ik )T + hT
i Ri
v T + hT
iR
v T (t)/ =
ik /)T Q
(39)
15
|| i ||2
10
REFERENCES
0
0
10
Sensor No.
|| i i ||2
10
8
6
4
2
0
0
10
Sensor No.
|| i i ||2
12
10
8
6
4
2
0
0
10
Sensor No.
Here, the tolerances pri > 0 and dual > 0 can be set via
an absolute plus relative criterion,
p
pri = nabs + rel max{kxk k2 , kz k k2 },
p
dual = nabs + rel kuk k2 ,
where abs > 0 and rel > 0 are absolute and relative
tolerances (see Boyd et al. [2011] for details).
In each iteration of ADMM, we perform alternating minimization of the augmented Lagrangian over x and z. At
iteration k we carry out the following steps
xk+1 := argmin{f (x) + (/2)kx z k + uk k22 } (3)
k+1
k+1
(4)
k+1
u
:= u + (x
z
),
(5)
where C denotes Euclidean projection onto C. In the
first step of ADMM, we fix z and u and minimize the
augmented Lagrangian over x; next, we fix x and u and
minimize over z; finally, we update the dual variable u.
2.1 Convergence
Under mild assumptions on f and C, we can show that the
iterates of ADMM converge to a solution; specifically, we
have
f (xk ) ! p? , xk z k ! 0,
as k ! 1. The rate of convergence, and hence the number
of iterations required to achieve a specified accuracy, can
depend strongly on the choice of the parameter . When
is well chosen, this method can converge to a fairly
accurate solution (good enough for many applications),
within a few tens of iterations. However, if the choice of
is poor, many iterations can be needed for convergence.
These issues, including heuristics for choosing , are discussed in more detail in Boyd et al. [2011].
2.2 Stopping criterion
The primal and dual residuals at iteration k are given by
ekp = (xk z k ), ekd = (z k z k 1 ).
(6)
subject to
xi+1
i = 1, . . . , N 1
with variables x1 , . . . , xN , r1 , . . . , rN 1 2 Rn , and where
n
n
i : R ! R [ {1} and
i : R ! R [ {1} are convex
functions.
This problem has the form (1), with variables x =
(x1 , . . . , xN ), r = (r1 , . . . , rN 1 ), objective function
f (x, r) =
N
X
i (xi ) +
i=1
z k+1 := C (xk+1 + uk )
i (ri )
i=1
xi ,
N
X1
i (ri )
i=1
xi , i = 1, . . . , N
1}.
(7)
N
X
i (xi )
N
X1
i (ri )
+ IC (z, s)
(8)
si , i = 1, . . . , N 1
xi = zi , i = 1, . . . , N,
with variables x = (x1 , . . . , xN ), r = (r1 , . . . , rN 1 ),
z = (z1 , . . . , zN ), and s = (s1 , . . . , sN 1 ). Furthermore,
we let u = (u1 , . . . , uN ) and t = (t1 , . . . , tN 1 ) be vectors
of scaled dual variables associated with the constraints
xi = zi , i = 1, . . . , N , and ri = si , i = 1, . . . , N 1 (i.e.,
ui = (1/)yi , where yi is the dual variable associated with
xi = zi ).
subject to
i=1
ri =
i=1
i = 1, . . . , N , and
rik+1
:= argmin{
ri
i (ri )
ski
+ (/2)kri
tki k22 },
(10)
k+1
k+1
k+1
k+1
(z
,s
) := C ((x
,r
) + (u , t )).
For the particular constraint set (7), we will show in
Section 3.3 that the projection can be performed extremely
efficiently.
Step 3.
:= uki + (xk+1
i
zik+1 ),
i = 1, . . . , N
and
tk+1
:= tki + (rik+1 sk+1
), i = 1, . . . , N 1.
i
i
These updates can also be carried out independently in
parallel, for each variable block.
3.3 Projection
In this section we work out an efficient formula for projection onto the constraint set C (7). To perform the
projection
(z, s) = C ((w, v)),
we solve the optimization problem
minimize kz wk22 + ks
subject to s = Dz,
vk22
1/li,i , li+1,i+1 =
1/lN
1,N
1,
lN,N
2
li+1,i
, i = 1, . . . , N
q
2
= 2 lN,N
1.
2,
+ I)
yi + (zik
uki )).
i (Xi )
+ (/2)kXi
where
= Tr(Xi yi yiT ) log det Xi .
This update can be solved analytically, as follows.
i (Xi )
Step (10) is
rik+1 := argmin{ kri k2 + (/2)kri
ri
which simplifies to
rik+1 = S
k
/ (si
tki ),
(12)
Zik Uik
yi yiT = QQT
where = diag( 1 , . . . , n ).
(2) Now let
q
2 + 4
j +
j
, j = 1, . . . , n.
j :=
2
(3) Finally, we set
Xik+1 = Q diag(1 , . . . , n )QT .
k
/ ((si
tki )j ),
j = 1, . . . , n.
subject to
N
X
Tr(Xi yi yiT )
i=1
Ri =
Xi+1
Xi ,
log det Xi +
i = 1, . . . , N
n
N
X1
i=1
kRi kF
1,
Step (10) is
Rik+1 := argmin{ kRi kF + (/2)kRi
Ri
which simplifies to
Rik+1 = S / (Sik Tik ),
where S is a matrix soft threshold operator, defined as
S (A) = (1 /kAkF )+ A, S (0) = 0.
Variations. As with `1 mean filtering, we can replace the
Frobenius norm penalty with a componentwise vector `1 norm penalty on Ri to get the problem
minimize
1, and
?
?
is the Frobenius norm of Ri . Let X1? , . . . , XN
, R1? , . . . , RN
1
denote an optimal point, our estimates of 1 , . . . , N are
?
(X1? ) 1 , . . . , (XN
) 1.
minimize
subject to
N
X
Tr(Xi yi yiT )
i=1
Ri =
Xi+1
log det Xi +
Xi ,
i = 1, . . . , N
N
X1
i=1
kRi k1
1,
Again, the ADMM updates are the same, the only dierence is that in step (10) we replace matrix soft thresholding
with a componentwise soft threshold, i.e.,
(Rik+1 )l,m = S / ((Sik
for l = 1, . . . , n, m = 1, . . . , n.
Tik )l,m ),
5
4
3
2
1
0
i=1
subject to ri = mi+1 mi , i = 1, . . . , N 1,
Ri = Xi+1 Xi , i = 1, . . . , N 1,
with variables r1 , . . . , rN 1 2 Rn , m1 , . . . , mN 2 Rn ,
R1 , . . . , RN 1 2 Sn , and X1 , . . . , XN 2 Sn+ .
ADMM steps. This problem is also in the form (6), however, as far as we are aware, there is no analytical formula
for steps (9) and (10). To carry out these updates, we must
solve semidefinite programs (SDPs), for which there are a
number of efficient and reliable software packages (Toh
et al. [1999], Sturm [1999]).
5. NUMERICAL EXAMPLE
In this section we solve an instance of `1 mean filtering
with n = 1, = 1, and N = 400, using the standard Fused
Lasso method. To improve convergence of the ADMM
algorithm, we use over-relaxation with = 1.8, see Boyd
et al. [2011]. The parameter is chosen as approximately
10% of max , where max is the largest value that results
in a non-constant mean estimate. Here, max 108 and
so = 10. We use an absolute plus relative error stopping
criterion, with abs = 10 4 and rel = 10 3 . Figure 1 shows
convergence of the primal and dual residuals. The resulting
estimates of the means are shown in Figure 2.
2
10
1
2
3
0
50
100
150
200
250
Measurement
300
350
400
10
10
10
10
10
20
30
40
50
Iteration
60
70
80
The only tuning parameter for our method is the regularization parameter . Finding an optimal is not a
straightforward problem, but Boyd et al. [2011] contains
many heuristics that work well in practice. For the `1 mean
filtering example, we find that setting works well,
but we do not have a formal justification.
REFERENCES
O. Banerjee, L. El Ghaoui, and A. dAspremont. Model
selection through sparse maximum likelihood estimation
for multivariate gaussian or binary data. Journal of
Machine Learning Research, 9:485516, 2008.
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein.
Distributed optimization and statistical learning via the
alternating direction method of multipliers. Foundations
and Trends in Machine Learning, 3(1):1122, 2011.
P. L. Combettes and J. C. Pesquet. A Douglas-Rachford
splitting approach to nonsmooth convex variational signal recovery. Selected Topics in Signal Processing, IEEE
Journal of, 1(4):564 574, dec. 2007. ISSN 1932-4553.
doi: 10.1109/JSTSP.2007.910264.
A. P. Dempster. Covariance selection. Biometrics, 28(1):
157175, 1972.
J. Eckstein and D. P. Bertsekas. On the Douglas-Rachford
splitting method and the proximal point algorithm for
maximal monotone operators. Mathematical Programming, 55:293318, 1992.
J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse
covariance estimation with the graphical lasso. Biostatistics, 9(3):432441, 2008.
M. Grant and S. Boyd.
CVX: Matlab software
for disciplined convex programming, version 1.21.
http://cvxr.com/cvx, April 2011.
S. J. Kim, K. Koh, S. Boyd, and D. Gorinevsky. l1 trend
filtering. SIAM Review, 51(2):339360, 2009.
H. Ohlsson, L. Ljung, and S. Boyd. Segmentation of arxmodels using sum-of-norms regularization. Automatica,
46:1107 1111, April 2010.
L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total
variation based noise removal algorithms. Phys. D,
60:259268, November 1992. ISSN 0167-2789. doi:
http://dx.doi.org/10.1016/0167-2789(92)90242-F.
Simon Setzer. Operator splittings, bregman methods and
frame shrinkage in image processing. International
Journal of Computer Vision, 92(3):265280, 2011.
J. Sturm. Using SeDuMi 1.02, a MATLAB toolbox
for optimization over symmetric cones. Optimization
Methods and Software, 11:625653, 1999. Software
available at http://sedumi.ie.lehigh.edu/.
R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and
K. Knight. Sparsity and smoothness via the fused
lasso. Journal of the Royal Statistical Society: Series
B (Statistical Methodology), 67 (Part 1):91108, 2005.
K. Toh, M. Todd, and R. T
ut
unc
u. SDPT3A Matlab
software package for semidefinite programming, version
1.3. Optimization Methods and Software, 11(1):545581,
1999.
B. Wahlberg, C. R. Rojas, and M. Annergren. On l1
mean and variance filtering. Proceedings of the FortyFifth Asilomar Conference on Signals, Systems and
Computers, 2011. arXiv/1111.5948.
Abstract: Given a linear system in a real or complex domain, linear regression aims to recover
the model parameters from a set of observations. Recent studies in compressive sensing have
successfully shown that under certain conditions, a linear program, namely, `1 -minimization,
guarantees recovery of sparse parameter signals even when the system is underdetermined. In
this paper, we consider a more challenging problem: when the phase of the output measurements
from a linear system is omitted. Using a lifting technique, we show that even though the phase
information is missing, the sparse signal can be recovered exactly by solving a semidefinite
program when the sampling rate is sufficiently high. This is an interesting finding since the
exact solutions to both sparse signal recovery and phase retrieval are combinatorial. The results
extend the type of applications that compressive sensing can be applied to those where only
output magnitudes can be observed. We demonstrate the accuracy of the algorithms through
extensive simulation and a practical experiment.
Keywords: Phase Retrieval; Compressive Sensing; Semidefinite Programming.
1. INTRODUCTION
Linear models, e.g., y = Ax, are by far the most used and
useful type of model. The main reasons for this are their
simplicity of use and identification. For the identification,
the least-squares (LS) estimate in a complex domain is
computed by 1
(1)
xls = argmin ky Axk22 2 Cn ,
x
of the measurements {i }N
i=1 { 1, 1}, one would have
to try out combinations of sign sequences until one that
satisfies
p
i bi = aTi x, i = 1, , N,
(6)
for some x 2 Rn has been found. For any practical size of
data sets, this combinatorial problem is intractable.
albeit the source signal was not assumed sparse. Using the
lifting technique to construct the SDP relaxation of the
NP-hard phase retrieval problem, with high probability,
the program (11) recovers the exact solution (sparse or
dense) if the number of measurements N is at least of the
order of O(n log n). The region of success is visualized in
Figure 1 as region I with a thick solid line.
min kxk0 ,
x
subj. to
H
b = |Ax|2 = {aH
i xx ai }1iN .
(7)
As the counting norm k k0 is not a convex function,
following the `1 -norm relaxation in CS, (7) can be relaxed
as
H
min kxk1 , subj. to b = |Ax|2 = {aH
i xx ai }1iN .
Finally, the motivation for introducing the `1 -norm regularization in (10) is to be able to solve the sparse phase
retrieval problem for N smaller than what PhaseLift requires. However, one will not be able to solve the compressive phase retrieval problem in region III below the dashed
curve. Therefore, our target problems lie in region II.
(8)
II
subj. to bi = Tr(aH
i Xai ), i = 1, , N,
rank(X) = 1, X 0.
(9)
III
subj. to bi = Tr(
X 0,
i X),
i = 1, , N,
(10)
.
nn
where we further denote i = ai aH
and where
i 2 C
> 0 is a design parameter. Finally, the estimate of x
can be found by computing the rank-1 decomposition of
X via singular value decomposition. We will refere to the
formulation (10) as compressive phase retrieval via lifting
(CPRL).
We compare (10) to a recent solution of PhaseLift by
Cand`es et al. [2011b]. In Cand`es et al. [2011b], a similar
objective function was employed for phase retrieval:
min Tr(X)
X
subj. to bi = Tr(
X 0,
2
i X),
i = 1, , N,
(11)
i X)}1iN
2R
(13)
1
PL
CPRL
0.9
(14)
where
(16)
0.8
0.7
0.6
|xi|
B:X2C
nn
0.5
0.4
0.3
0.2
0.1
0
0
10
20
30
40
50
60
250
200
CS
CPRL
PhaseLift
150
100
y = Rz = RFinv x.
50
0
1
1.5
2.5
3.5
4.5
120
CPRL
PhaseLift
CS
100
80
60
40
20
0
10
20
30
40
50
60
Finally, we are only given the magnitudes of our measurements, such that b = |y|2 = |Rz|2 .
0.8
zest
z
0.6
0.4
zi, zest,i
(17)
0.2
0.2
0.4
0.6
10
15
20
25
30
0
10
20
30
40
50
60
S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientific
Computing, 20(1):3361, 1998.
J. Dainty and J. Fienup. Phase retrieval and image reconstruction for astronomy. In Image Recovery: Theory
and Applications. Academic Press, New York, 1987.
A. dAspremont, L. El Ghaoui, M. Jordan, and G. Lanckriet. A direct formulation for Sparse PCA using semidefinite programming. SIAM Review, 49(3):434448, 2007.
D. Donoho. Compressed sensing. IEEE Transactions on
Information Theory, 52(4):12891306, April 2006.
J. Fienup. Phase retrieval algorithms: a comparison.
Applied Optics, 21(15):27582769, 1982.
J. Fienup. Reconstruction of a complex-valued object from
the modulus of its Fourier transform using a support
constraint. Journal of Optical Society of America A, 4
(1):118123, 1987.
R. Gerchberg and W. Saxton. A practical algorithm for
the determination of phase from image and diraction
plane pictures. Optik, 35:237246, 1972.
R. Gonsalves. Phase retrieval from modulus data. Journal
of Optical Society of America, 66(9):961964, 1976.
M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming, version 1.21. http://
cvxr.com/cvx, August 2010.
A. Hoerl and R. Kennard. Ridge regression: Biased
estimation for nonorthogonal problems. Technometrics,
12(1):5567, 1970.
D. Kohler and L. Mandel. Source reconstruction from
the modulus of the correlation function: a practical
approach to the phase problem of optical coherence
theory. Journal of the Optical Society of America, 63
(2):126134, 1973.
I. Loris. On the performance of algorithms for the minimization of `1 -penalized functionals. Inverse Problems,
25:116, 2009.
S. Marchesini. Phase retrieval and saddle-point optimization. Journal of the Optical Society of America A, 24
(10):32893296, 2007.
R. Millane. Phase retrieval in crystallography and optics.
Journal of the Optical Society of America A, 7:394411,
1990.
M. Moravec, J. Romberg, and R. Baraniuk. Compressive
phase retrieval. In SPIE International Symposium on
Optical Science and Technology, 2007.
H. Ohlsson, A. Y. Yang, R. Dong, and S. Sastry. Compressive Phase Retrieval From Squared Output Measurements Via Semidefinite Programming. Technical Report arXiv:1111.6323, University of California, Berkeley,
November 2011.
R. Tibsharani. Regression shrinkage and selection via the
lasso. Journal of Royal Statistical Society B (Methodological), 58(1):267288, 1996.
J. Tropp. Greed is good: Algorithmic results for sparse
approximation. IEEE Transactions on Information
Theory, 50(10):22312242, October 2004.
A. Walther. The question of phase retrieval in optics.
Optica Acta, 10:4149, 1963.
A. Yang, A. Ganesh, Y. Ma, and S. Sastry. Fast `1 minimization algorithms and an application in robust
face recognition: A review. In ICIP, 2010.
Abstract: Cointegrated Vector AutoRegressive (VAR) processes arise in the study of long
run equilibrium relations of stochastic dynamical systems. In this paper we introduce a novel
convex approach for the analysis of these type of processes. The idea relies on an error correction
representation and amounts at solving a penalized empirical risk minimization problem. The
latter finds a model from data by minimizing a trade-o between a quadratic error function
and a nuclear norm penalty used as a proxy for the cointegrating rank. We elaborate on
properties of the proposed convex program; we then propose an easily implementable and
provably convergent algorithm based on FISTA. This algorithm can be conveniently used for
computing the regularization path, i.e., the entire set of solutions associated to the trade-o
parameter. We show how such path can be used to build an estimator for the cointegrating rank
and illustrate the proposed ideas with experiments.
1. INTRODUCTION
Unit root nonstationary multivariate processes play an important role in the study of dynamical stochastic systems
Box and Tiao [1977], Engle and Granger [1987], Stock and
Watson [1988], Johansen [1988]. Contrary to their stationary counterpart these processes are allowed to have trends
or shifts in the mean or in the covariances. This feature
makes them suitable to describe many phenomena of interest such as economic cycles and population dynamics. In
this paper we focus on VAR processes. It is well known that
these processes can generate stochastic and deterministic
trends if the associated polynomial matrix has zeros on
the unit circle. If some of the variables within the same
VAR process move together in the long-run in a sense
that we clarify later they are called cointegrated. This
situation is of considerable practical interest. Equilibrium
relationships arise between economic variables such as, for
instance, household income and expenditures. Cointegration has also been advocated to describe long-term parallel
growth of mutually dependent indicators such as regional
population and employment growth or city populations
and total urban populations Payne and Ewing [1997],
Sharma [2003], Mller and Sharp [2008].
? The authors are grateful to the anonymous reviewers for the helpful comments. Research supported by the Research Council KUL:
GOA /11/05 Ambiorics, GOA/10/09 MaNet, CoE EF/05/006 Optimization in Engineering (OPTEC) en PFV/10/002 (OPTEC). Flemish Government: FWO: PhD/postdoc grants, projects: G0226.06 (cooperative systems and optimization), G0321.06 (Tensors), G.0302.07
(SVM/Kernel). Research communities (WOG: ICCoS, ANMMM,
MLDM). Belgian Federal Science Policy Office: IUAP P6/04
(DYSCO, Dynamical systems, control and optimization, 2007-2011);
IBBT EU: ERNSI; FP7-HD-MPC (INFSO-ICT-223854), COST intelliCIS, FP7-EMBOCON (ICT-248940), FP7-SADCO ( MC ITN264735), ERC HIGHWIND (259 166). The scientific responsibility is
assumed by its authors.
The analysis of cointegrated VAR models present challenges that are not present in the stationary case. In particular one of the main goal in the analysis of cointegrated
system is the estimation of the cointegrating rank. In this
work we propose a novel approach that relies on an error
correction representation of cointegrated VAR processes.
The approach consists of solving a convex program and
uses a nuclear norm as a proxy for the cointegrating rank.
We show how the regularization path arising from dierent
values of a trade-o parameter can be used to estimate
the cointegrating rank. In order to compute solutions we
propose to use a simple yet efficient algorithm based on an
existing procedure called FISTA (fast iterative shrinkagethresholding algorithm).
In Section 2 we recall the concept of cointegrated VAR
models, error correction representations and cointegrating
rank. In Section 3 we present our main problem formulation and discuss its properties. Section 4 deals with an
algorithm to compute solutions. In Section 5 we introduce
an estimator for the cointegrating rank based on the regularization path. In Section 6 we report experiments. We
conclude with final remarks in Section 7.
2. COINTEGRATED VAR(P ) MODELS
In the following we denote vectors as lower case letters
(a, b, c, . . .) and matrices as capital letters (A, B, C, . . .).
In particular, we use I to indicate the identity matrix,
the dimension of which will be clear from the context.
For a positive integer P we write NP to denote the set
{1, . . . , P }. Finally we use hA, Bi to denote the inner
product between A, B 2 RD1 D2 :
X
X
a d 1 d 2 bd 1 d 2
hA, Bi = trace(A> B) =
(1)
d1 2ND1 d2 2ND2
1z
2z
Pz
are outside the unit circle, see e.g. Tsay [2005]. If
det(P(1)) = 0, then xt is unit-root nonstationary. Following Johansen [1992] we call xt integrated of order R
(or simply I(R)) if the process R xt = xt
xt R is
stationary whereas for any r 2 NR 1 , the process r xt
is not. A stationary process is referred to as I(0). Finally
we say that a I(R) process xt is cointegrated if there
exists at least a cointegrating vector 2 RD such that the
scalar process > xt is I(R ) with R < R. Cointegrated
processes were originally introduced in Granger [1981] and
Engle and Granger [1987]. Since then they have become
popular mostly in theoretical and applied econometrics.
Cointegration has been advocated to explain long-run or
equilibrium relationships; for more discussions on cointegration and cointegration tests, see Box and Tiao [1977],
Engle and Granger [1987], Stock and Watson [1988], Johansen [1988]. In the following we focus on the situation
where xt is I(1). This case is the most commonly studied
in the literature.
(7)
DD ?
where A, B 2 R
are full (column) rank. The D?
linearly independent columns of B are cointegrating
vectors; they form a basis for the cointegrating subspace. The vector process
ft = B > xt
(8)
is stationary and represent deviation from equilibrium. These equations clarify that (3) expresses the
change in xt in terms of the deviations from the
equilibrium at time t 1 (the term xt 1 = Aft 1 ,
called error correction term) and previous changes
(the terms p xt p , 1 p P 1).
0.5
0.4
0.3
0.2
+
2
PP
xt 1 +
xt 2 + +
50
100
150
200
250
300
200
250
300
(a)
xt
P +1
+ et ,
(3)
where p =
P(1). The VAR(P )
j=p+1 j and =
model can be recovered from the ECM via:
8
= I + + 1,
(4a)
< 1
= p
1,
(4b)
p
p 1 , p = 2, . . . , P
:
=
(4c)
P
P 1 .
0.1
0.015
0.01
0.005
0
0.005
0.01
0.015
0.02
0
50
100
150
(b)
>
X=
X >1 , X >2 , , X >P +1 ,
(9c)
= [ 1, 2, , P 1] ,
(9d)
>
C = X >1 , X > ,
(9e)
Note that, based upon (9), the ECM (3) can be restated
in matrix notation as X0 = X 1 +
X + E. Our
), ( ))
approach now consists of finding estimates ((
by solving:
1
min
k X0 X 1
Xk2 + kk
,
2c
(14)
(10)
1
2
R > 0. The nuclear norm (a.k.a.
trace norm or Schatten 1 norm) of A is defined Horn and
Johnson [1994] as
X
kAk =
(12)
r .
r2NR
The nuclear norm has been used to devise convex relaxations to rank constrained matrix problems Recht et al.
[2007], Cand`es and Recht [2009], Candes et al. [2011].
This parallels the approach followed by estimators like the
LASSO (Least Absolute Shrinkage and Selection Operator,
Tibshirani [1996]) that estimate an high dimensional loading vector x 2 RD based on the l1 -norm
X
kxk1 =
|xd | .
(13)
d2ND
The nonconvexity of the latter implies that practical algorithms can only be guaranteed to deliver local solutions
of (15) that are rank deficient for sufficiently large . In
contrast, the problem (14) is convex since the objective is
a sum of convex functions Boyd and Vandenberghe [2004].
This implies that any local solution found by a provably
convergent algorithm is globally optimal.
Once a solution of (14) is available, one can recover the
parameters of the VAR(P ) model (2) based upon (4). Note
that here we focus on the case where the error function
is expressed in terms of the Hilbert-Frobenius norm; however, alternative norms might also be meaningful. For later
reference, note that, when P = 1 and c = 1, (14) boils
down to:
1
k X0 X 1 k2 + kk .
min
2c
(16)
3.2 l2 Smoothing
The nature of the problem makes it difficult to find a
solution of (14) for ! 0. Indeed, in practice X 1 and
X are often close to singular so that the problem is
ill-conditioned. In order to improve numerical stability a
possible approach is to add a ridge penalty to the objective
of (14). That is, for a small user-defined parameter > 0,
to add:
0
1
X
@
kk2 +
k p k2 A ;
(17)
2
p2NP
finds an high dimensional loading vector x based on empirical data. The approach consists of replacing the LASSO
penalty based on the l1 norm (13) with the composite
penalty 1 kxk2 + 2 kxk1 . This strategy aims at improving
the LASSO in the presence of dependent variables (which,
in fact, leads to ill-conditioning).
In the present context it is easy to see that the solution
of the l2 smoothed formulation can be found solving
(14) where the data matrices ( X0 , X 1 , X) have been
replaced by lifted matrices ( X0 , X 1 , X ). Consider
for simplicity problem (16). The lifted matrices of interest
become:
p
X0 = [ X0 , O] and X 1 = [X 1 , I]
(18)
where O, I 2 RDD and O is a matrix of zeros. For
this reason in the following we will uniquely discuss an
algorithm for the non-smoothed case; it is understood
that a solution for the smoothed formulation can be found
simply replacing the data matrices.
3.3 Dual Representation and Duality Gap
In this section, for simplicity of exposition, we restrict
ourselves to the primal problem (16) and derive its dual
representation. The Fenchel conjugate Rockafellar [1974]
of kXk is:
f (A) = maxhA, Xi kXk .
(19)
X
0, if kAk2 1
f (A) =
(21)
1, otherwise .
With reference to (16), it can be shown that strong duality
holds; the dual problem can be obtained as follows:
min{ 12 k
X0
min max{h,
max min{h,
max min{h,
by Lemma 2
X
X0
X0
X0 i
max{h,
1k
+ kk } =
X
X
X0 i
1
2 h, i
+ kk } =
1i
1
2 h, i
+ kk } =
hX >1 , i + kk } =
1
2 h, i
), ( )) corresponding to
In order to find a solution ((
a fixed value of one could restate (14) as a semidefinite
programming problem (SDP) and rely on general purpose
solvers such as SeDuMi Sturm [1999]. Alternatively, it is
possible to use a modelling language like CVX Grant and
Boyd [2010] or YALMIP Lofberg [2004]. However these
approaches are practically feasible only for relatively small
problems. In contrast, the iterative scheme that we present
next can be easily implemented and scales well with the
problem size. Additionally, the approach can be conveniently used through warm-starting to compute solutions
corresponding to nearby values of . The procedure, detailed in the algorithmic environment below, can be shown
to be a special instance of the fast iterative shrinkagethresholding algorithm (FISTA) proposed in Beck and
Teboulle [2009]; therefore it inherits its favorable convergence properties, see Beck and Teboulle [2009]. We call it
Cointegrated VAR(P ) via FISTA (Co-VAR(P )-FISTA).
Algorithm: CoVAR(P )-FISTA
Input: X
p;
: kX >1 k2 } .
(22)
Additionally, as it results clear from the second line of
(22), the primal and dual solution are related as follows:
= X0 X
(23)
1 .
This fact can be readily used to derive an optimality
certificate based on the duality gap, i.e. the dierence of
values between the objective functions of the dual and
primal problems.
of the dual problem
Remark 3. Note that the solution
in the last line of (22) corresponds to the projection of
X0 onto the convex set S = : kX >1 k2 .
p,
p = 0, 1, . . . , P
1;
Initialize:
0 = 1 ;
0p
1 p,
p 2 NP
t1 = 1; L = 1/c CC >
Iteration k
1;
(see (9e))
1:
Ak = k X
p2NP
kp
tk+1 =
1 + 4t2k
1+
, rk+1 =
k+1 = k + rk+1 (k
=
Return: k ,
kp
+ rk+1 (
k 1,
k2
(24a)
1
Ak X >p , p 2 NP
kp =
Lc
1
Ck = k
Ak X >1
Lc
k = D (Ck ) (see (25))
L
X0
kp
k+1 p
1i
1
2 h, i
4. ALGORITHM
k
kp
, ,
1)
kP
(24c)
(24d)
tk 1
tk+1
k 1 p ),
(24b)
p 2 NP
(24e)
(24f)
1
(24g)
D (A) = U + V , + = diag({max(
r , )}1rR )
.
(25)
D () is the proximity operator Bauschke and Combettes
[2011] of the nuclear norm function. Equation (24e) defines
the updating constant rk based upon the estimate sequence
tk Nesterov [1983, 2003]. Finally, equations (24) from g to i
update the second set of variables based upon the variables
in the fist set.
The approach requires to set an appropriate termination
criterion. A sensible idea, which we follow in experiments,
is to stop when the duality gap corresponding to the
current estimate is smaller that a predefined threshold.
5. CONTINUOUS SHRINKAGE AND
COINTEGRATING RANK
5.1 Regularization Path
By continuously varying the regularization parameter
in (14) one obtains an entire set of solutions, denoted as
), ( ))} , and called regularization path. In general,
{((
continuous shrinkage is known to feature certain favorable
properties. In particular, the paths of estimators like
the LASSO are known to be more stable than those of
inherently discrete procedures such as Subset Selection
Breiman [1996].
In the present context the path begins at
= 0 and
max ) = 0.
terminates at the least value max leading to (
For problem (16), in particular, such value is
= k X0 X >1 k2
as one can see in light of (23) and Remark 3.
max
(26)
). For a > 1
Denote by {r ( )}1rR the spectrum of (
D
consider the vector m
2 R defined entry-wise by 4 :
!1/2
Z loga ( max )
2 t t
m
d =
d (a )a dt
.
(27)
m2
?
(m) = arg min f (d) = P i2Nd i2 : f (d) > 0
D
m
d2ND
i2ND
(28)
where 0 < < 1 is a predefined threshold. Note that
this estimator is independent on . This is a desirable
feature: setting an appropriate value for the regularization
parameter is known to be a difficult task.
4
7. CONCLUSIONS
We presented a novel approach, based on a convex program, for the analysis of cointegrated VAR processes from
observational data. We proposed to compute solutions via
a scalable iterative scheme; used in combination with warm
starting this algorithm can be conveniently employed to
compute the entire regularization path. At each step one
can rely on the duality gap as an optimality certificate.
The regularization path oers indication for the actual
value of the cointegrating rank and can be used to define estimators for the latter. An important advantage is
6
x 10
1.6
3
1.4
2.5
1.2
1
sigma(i)
M(i)
0.8
1.5
0.6
1
0.4
0.5
0.2
0
10
20
30
40
50
60
0
0
70
10
20
30
40
50
60
70
(a)
(b)
sigma(lambda)
0.8
0.6
0.4
0.2
0
10
10
10
lambda
(c)
LS ; note
Fig. 2. (a) The singular values of the true and
that the LS solution does not give an indication of the
cointegrating rank. (b) The path spectrum; note the
gap after d = 8. (c) the regularization path.
that the approach does not require to fix a value for the
regularization parameter in (14). This is known to be a
difficult task, especially when the goal of the analysis is
model selection rather than low prediction errors.
REFERENCES
H.H. Bauschke and P.L. Combettes. Convex Analysis and
Monotone Operator Theory in Hilbert Spaces. Springer
Verlag, 2011.
A. Beck and M. Teboulle. A fast iterative shrinkagethresholding algorithm for linear inverse problems.
SIAM Journal on Imaging Sciences, 2(1):183202, 2009.
G.E.P. Box and G.C. Tiao. A canonical analysis of
multiple time series. Biometrika, 64(2):355, 1977.
S.P. Boyd and L. Vandenberghe. Convex Optimization.
Cambridge University Press, 2004.
L. Breiman. Heuristics of instability and stabilization in
model selection. Annals of Statistics, 24(6):23502383,
1996.
J.F. Cai, E.J. Cand`es, and Z. Shen. A singular value
thresholding algorithm for matrix completion. SIAM
Journal on Optimization, 20(4):19561982, 2010.
E.J. Cand`es and B. Recht. Exact matrix completion via
convex optimization. Foundations of Computational
Mathematics, 9(6):717772, 2009.
E.J. Candes, X. Li, Y. Ma, and J. Wright. Robust principal
component analysis? Journal of ACM, 58(3):Article 11,
37p., 2011.
R.F. Engle and C.W.J. Granger. Co-integration and
error correction: representation, estimation, and testing. Econometrica: Journal of the Econometric Society,
pages 251276, 1987.
M. Fazel. Matrix rank minimization with applications.
PhD thesis, Elec. Eng. Dept., Stanford University, 2002.
Abstract: The question in the title is answered empirically by solving instances of three classical
problems: fitting a straight line to data, fitting a real exponent to data, and system identification in
the errors-in-variables setting. The results show that the nuclear norm heuristic performs worse than
alternative problem dependant methodsordinary and total least squares, Kungs method, and subspace
identification. In the line fitting and exponential fitting problems, the globally optimal solution is known
analytically, so that the suboptimality of the heuristic methods is quantified.
Keywords: low-rank approximation, nuclear norm, subspace methods, system identification.
1. INTRODUCTION
With a few exceptions model reduction and system identification lead to non-convex optimization problems, for which there
are no efficient global solution methods. The methods for H2
model reduction and maximum likelihood system identification
can be classified as local optimization methods and convex
relaxations. Local optimization methods require an initial approximation and are in general computationally more expensive
than the relaxation methods, however, the local optimization
methods explicitly optimize the desired criterion, which ensures
that they produce at least as good result as a relaxation method,
provided the solution of the relaxation method is used as an
initial approximation for the local optimization method.
A subclass of convex relaxation methods for system identification are the subspace methods, see Van Overschee and De Moor
[1996]. Subspace identification emerged as a generalization of
realization theory and proved to be a very effective approach. It
also leads to computationally robust and efficient algorithms.
Currently there are many variations of the original subspace
methods (N4SID, MOESP, and CVA). Although the details of
the subspace methods may differ, their common feature is that
the approximation is done in two stages, the first of which is
unstructured low-rank approximation of a matrix that is constructed from the given input/output trajectory.
Related to the subspace methods are Kungs method and the
balanced model reduction method, which are the most effective
heuristics for model reduction of linear time-invariant systems.
A recently proposed convex relaxation method is the one using
the nuclear norm as a surrogate for the rank. The nuclear
norm relaxation for solving rank minimization problems was
proposed in Fazel et al. [2001] and was shown to be the tightest
relaxation of the rank. It is a generalization of the !1 -norm
heuristic from sparse vector approximation problems to rank
minimization problems.
The nuclear norm heuristic leads to a semidefinite optimization problem, which can be solved by existing algorithms with
provable convergence properties and readily available software
packages. (We use CVX, see Grant and Boyd.) Apart from theo-
retical justification and easy implementation in practice, formulating the problem as a semidefinite program has the additional
advantage of flexibility. For example, adding regularization and
affine inequality constraints in the data modeling problem still
leads to semidefinite optimization problems that can be solved
by the same algorithms and software as the original problem.
A disadvantage of using the nuclear norm heuristic is the fact
that the number of optimization variables in the semidefinite
optimization problem depends quadratically on the number of
data points in the data modeling problem. This makes methods
based on the nuclear norm heuristic impractical for problems
with more than a few hundreds of data points. Such problems
are considered small size data modeling problem.
Outline of the paper
The objective of this paper is to test the effectiveness of the
nuclear norm heuristic as a tool for system identification and
model reduction. Although, there are recent theoretical results,
see, e.g., Cands and Recht [2009], on exact solution of matrix
completion problems by the nuclear norm heuristic, to the best
of the authors knowledge there are no similar results about the
effectiveness of the heuristic in system identification problems.
The nuclear norm heuristic is compared empirically with other
heuristic methods on benchmark problems. The selected problems are simple: small complexity model and small number of data points. The experiments in the paper are reproducible Buckheit and Donoho [1995]. Moreover the M ATLAB
code that generates the results is included in the paper, so that
the reader can repeat the examples by copying the code chunks
from the paper and pasting them in the M ATLAB command
prompt, or by downloading the code from
http://eprints.soton.ac.uk/336088/
The selected benchmark problems are:
(1) line fitting by geometric distance minimization (orthogonal regression),
(2) fitting a real exponential function to data, and
(3) system identification in the errors-in-variables setting.
dist(D, B) =
(dist)
i=1
where
The minimizer
"
D
"
rank(D)r
r+1
" be the
" be an optimal solution of (LRA) and let B
Let D
optimal fitting model
" = image(D
" ).
B
Let #D# denotes the nuclear norm of D, i.e., the sum of the
singular values of D. Applying the nuclear norm heuristic to
(LRA), we obtain the following convex relaxation
" RqN #D#
"
minimize over D
(NNA)
" F e.
subject to #D D#
)define nna 2b*
function dh = nna(d, e)
cvx_begin, cvx_quiet(true);
variables dh(size(d))
minimize norm_nuc(dh)
subject to
norm(d - dh, fro) <= e
cvx_end
However, note that nnae (D) = 0 for e #D#F , so that for sufficiently large values of e, nna&e (D) is rank
' deficient. Moreover,
the numerical rank num rank nnae (D) of the nuclear norm
approximation reaches r for e / #D#F . We are interested to
characterize the set:
&
'
e := { e | num rank nnae (D) r }.
(e)
We hypothesise that e is an interval
e = [enna , ).
(H)
The smallest value of the approximation error #D nnae (D)#F ,
for which rank(nnae (D)) r (i.e., for which a valid model exists) characterizes the effectiveness of the nuclear norm heuristic. We define
nnar := nnaenna , where enna := min { e | e e }.
e
A bisection algorithm for computing the limit of performance enna of the nuclear norm heuristic is given in Appendix A.
The function e , presents a complexity vs accuracy tradeoff in using the nuclear norm heuristic. The optimal rankr approximation corresponds in the ( , e) space to the point
(0, elra ), where
elra := distr (D) = #D lrar (D)#F .
The best model nnar (D) identifiable by the nuclear norm
heuristic corresponds to the point (0, enna ).
The loss of optimality incurred by the heuristic is quantified by the difference enna = enna elra.
The following code defines a simulation example and plots the
e , function over the interval [elra , 1.75elra ].
)Test line fitting 3b*
randn(seed, 0); q = 2; N = 10; r = 1;
d0 = [1; 1] * [1:N]; d = d0 + 0.1 * randn(q, N);
)define dist 3a*, e_lra = dist(d, r)
N = 20; Ea = linspace(e_lra, 1.75 * e_lra, N);
for i = 1:N
Er(i) = dist(nna(d, Ea(i)), r);
end
figure, plot(Ea, Er, o, markersize, 8)
0.08
y"2 y"3 . . .
y"T L+2
.. ,
HL ("
y) := y"3 . . .
.
..
y"L y"L+1
y"T
where 1 < L < T 1 has rank less than or equal to 1. Therefore,
the exponential fitting problem (EF) is equivalent to the Hankel
structured rank-1 approximation problem
minimize over y" RT #yd y"#2
&
'
(HLRA)
subject to rank HL ("
y) 1.
0.06
(e)
t=1
t=1
t=1
t=1
and
0.04
0.02
0
c :=
enna
elra
0.35
0.4
0.45
0.5
0.55
e
Fig. 1. Distance of nnae (D) to a linear model of complexity 1
as a function of the approximation error e.
Next, we compare the loss of optimality of the nuclear norm
heuristic with those of two other heuristics: line fitting by minimization of the sum of squared vertical and horizontal distances
from the data points to the fitting line, i.e., the classical method
of solving an overdetermined linear system of equations in the
least squares sense.
)Test line fitting 3b*+
dh_ls1 = [1; d(2, :) / d(1, :)] * d(1, :);
T
yd (t)zt
t=1
.
T
t=1 z2t
(z )
(c )
0.05
0.04
(L)
0.03
0.02
and the minimal error enna , for which the heuristic identifies a
valid model.
The following code defines a simulation example and plots the
trade-off curve over the interval [ehlra , 1.25ehlra ].
)Test exponential fitting 4b*
randn(seed, 0); z0 = 0.4; c0 = 1; T = 10;
t = (1:T); y = c0 * (z0 .^ t) + 0.1 * randn(T, 1);
)define dist_exp 4a*, e_hlra = dist_exp(y)
N = 20; Ea = linspace(e_hlra, 1.25 * e_hlra, N);
L = round(T / 2);
for i = 1:N
Er(i) = dist_exp(nna_exp(y, L, Ea(i)));
end
ind = find(Er < 1e-6); es_nna = min(Ea(ind))
figure, plot(Ea, Er, o, markersize, 8)
0.01
2
10
L
Fig. 3. Distance of y" = nnae (y) to an exponential model as a
function of the parameter L.
(e)
0.03
0.02
0.01
ehlra
0
0.28
enna
0.3
0.32
0.34
e
Fig. 2. Distance of y" = nnaenna (y) to an exponential model as a
function of the approximation error e = #y y"#.
The problem is equivalent to the following block-Hankel structured low-rank approximation problem
over w
" #wd w#
" 2
.#
$/
HL (w
"1 )
subject to rank
L + n, (BHLRA)
HL (w
"2 )
for n < L < 0T /21.
)define blk_hank 4e*
minimize
0.1
(L)
0.08
0.04
0.15
(e)
10
L
Fig. 5. Distance of w
" = nnae (w) to a model of order 1 as a
function of the parameter L.
"
The distance from the data wd (w) to the obtained model B
(sysh) is computed by the function misfit, see Appendix D.
)Test system identification 5c*+
[e_n4sid, wh_n4sid] = misfit(w, sysh); e_n4sid
0.2
0.1
0.02
0.05
0.06
en4sid
0.4
ACKNOWLEDGEMENTS
enna
0.6
0.8
e
Fig. 4. Distance of w
" = nnae (w) to a model of order 1 as a
function of the approximation error e = #w w#.
"
model computed by the nuclear norm heuristic has corresponding approximation error enna = 0.7789. We have manually selected the value L = 4 as giving the best results.
)Test system identification 5c*+
Lrange = (n + 1):floor(T / 2);
for L = Lrange
Er(L) = dist_sysid(nna_sysid(w, L, es_nna), L, n);
end
figure,
plot(Lrange, Er(Lrange), o, markersize, 8)
c
t=1
f
=0
z
t=1
&
'
yd (t) czt tzt = 0.
Abstract: In this contribution we present a method to estimate structured high order ARX
models. By this we mean that the estimated model, despite its high order is close to a low order
model. This is achieved by adding two terms to the least-squares cost function. These two terms
correspond to nuclear norms of two Hankel matrices. These Hankel matrices are constructed from
the impulse response coefficients of the inverse noise model, and the numerator polynomial of the
model dynamics, respectively. In a simulation study the method is shown to be competitive as
compared to the prediction error method. In particular, in the study the performance degrades
more gracefully than for the Prediction Error Method when the signal to noise ratio decreases.
1. INTRODUCTION
The fundamental problem of estimating a model for a
linear dynamical system, possibly subject to (quasi-) stationary noise has received renewed interest in the last
few years. In particular different types of regularization
schemes have been in focus. An important contribution to
these developments is Pillonetto and De Nicolao [2010]
where a powerful approach rooted in Machine Learning is presented. In Chen et al. [2011] it is shown that
this approach has close ties with !2 -regularization with
a cleverly chosen penalty term. Another approach is to
use structured low rank approximations. This typically
leads to non-convex optimization problems for which local
nonlinear optimization methods are used, see Markovsky
[2008] for a survey. Recently, the nuclear norm has been
used in this approach to obtain convex optimization problems. A number of contributions based on this idea has
already appeared, e.g. Fazel et al. [2003], Grossmann et al.
[2009a,b], Mohan and Fazel [2010], Fazel et al. [2003], Liu
and Vandenberghe [2009a,b]. In this contribution we add
to this avalanche of new and exciting methods by introducing structured estimation of high order ARX models. Our
contribution can be seen as an extension of the method
presented in Grossmann et al. [2009a,b]. The extensions
concern i) the possibility to also estimate a noise model in
the same convex framework, ii) a way to choose the regularization parameter, and iii) a quite extensive simulation
study. In the simulation study the new method compares
favourably to the prediction error method for scenarios
where the signal to noise ratio (SNR) is poor.
The outline of the paper is as follows. In Section 2
the problem under consideration is discussed. Section 3
! This work was supported in part by the European Research
Council under the advanced grant LEARN, contract 267381, and in
part by the Swedish Research Council under contract 621-2009-4017.
(4)
(5)
(6)
(7)
(9)
Even though the system (1) is not captured by this structure, by letting n increase, Go can be well approximated
by B/A and Ho can be well approximated by 1/A [Ljung
and Wahlberg, 1992]. This structure thus has some very
attractive features and is extensively used, e.g. in industrial practice [Zhu, 2001]. A drawback with this structure
is that the variance error, i.e. the error induced by the
noise eo , increases linearly with the number of parameters
[Ljung and Wahlberg, 1992]. Thus the accuracy can be
significantly worse when compared with using the BoxJenkins structure (3) (with n = no , m = mo ). In order
to make a distinction between the desired low order model
with n parameters in B and F , and m parameters in C and
D, we will use the notation nho to denote the number of
parameters in the B and A polynomials when a high-order
ARX-model is used 1 .
3. STRUCTURED ARX ESTIMATION
A model equivalent to (3) is given by [Kailath, 1980]
!
!
y(t) =
bk u(t k) + e(t) +
hk e(t k)
k=1
k=1
Rank {Hankel(b)} = n
Rank {Hankel(h)} = m
(10)
1+
k=1
hk q k
%1
(y(t)
k=1
bk u(t k))
(11)
There are several problems associated with (11) from an
optimization point of view. Firstly, it is non-convex in
{hk }. Furthermore, the rank constraints are non-convex.
An interesting convex rank heuristic is obtained using the
nuclear norm [Fazel et al., 2001] (see Fazel et al. [2003] for
1
#X# =
i (X)
i=1
b1 ,b2 ,...
N
!
t=1
(y(t)
bk u(t k))2
k=1
(12)
Abstract: Recently a new approach to black-box nonlinear system identification has been
introduced which searches over a convex set of stable nonlinear models for the one which
minimizes a convex upper bound of long-term simulation error. In this paper, we further study
the properties of the proposed model set and identification algorithm and provide two theoretical
results: (a) we show that the proposed model set includes all quadratically stable nonlinear
systems, as well as some more complex systems; (b) we study the statistical consistency of the
proposed identification method applied to a linear system with noisy measurements. It is shown
a modification related to instrumental variables gives consistent parameter estimates.
1. INTRODUCTION
Building approximate models of dynamical systems from
data is a ubiquitous task in the sciences and engineering.
Black-box modeling in particular plays an important role
when first-principles models are either weakly identifiable,
too complicated for the eventual application, or simply
unavailable (see, e.g., Sjoberg et al. [1995], Suykens and
Vandewalle [1998], Giannakis and Serpedin [2001], Ljung
[2010] and references therein).
Model instability, prevalence of local minima, and poor
long-term (many-step-ahead) prediction accuracy are common difficulties when identifying black-box nonlinear models (Ljung [2010], Schon et al. [2010]). Recently a new
approach labelled robust identification error (RIE) was
introduced which searches over a convex set of stable
nonlinear models for the one which minimizes a convex
upper bound of long-term simulation error (Tobenkin et al.
[2010]). In Manchester et al. [2011] the method was extended to systems with orbitally stable periodic solutions.
Earlier related approaches appeared in Sou et al. [2008],
Bond et al. [2010], Megretski [2008].
Since both the model set and the cost function of the
RIE involve convexifying relaxations, it is important to
study the degree of conservatism so introduced. In this
paper, we show that the proposed model set includes all
quadratically contracting systems, as well as some more
complex models. We also study the statistical consistency
of the proposed identification method applied to a linear
system with noisy measurements. It is shown that the
estimator can give biased parameter estimates, but a
modification related to instrumental variables recovers
statistical consistency.
? This work was supported by National Science Foundation Grant
No. 0835947
(1)
(2)
where e : Rn 7! Rn , f : Rn Rm 7! Rn , and g :
Rn Rm 7! Rk are continuously dierentiable functions
such that the equation e(z) = w has a unique solution
z 2 Rn for every w 2 Rn . This choice of implicit equations
provides important flexibility for the convex relaxations.
Models are used for a wide variety of purposes, each with
its own characteristic measure of performance, however
for a large class of problems an appropriate measure is
simulation error, i.e.
E=
T
X
t=0
|y(t)
y(t)|2 ,
(3)
(4)
Ideally, we would search over all stable nonlinear models (1), (2) of limited complexity 1 for one which minimizes simulation error. There are two major difficulties
which render this impossible in general: firstly, we have no
tractable parameterization of all stable nonlinear models
of limited complexity; secondly, even supposing we are
given such a parameterization, the simulation error is a
highly nonlinear function of f and g, making the associated optimization problem very difficult. Indeed, in Ljung
[2010] stability of models and convexity of optimization
criteria are listed among the major open problems in
nonlinear system identification.
1.1 Convex Upper Bounds for Simulation Error
We will start with the second problem: optimization simulation error. The main difficulty with simulation error is
that it depends on the y(t), the result of solving a dierence or dierential equation over a long time interval. Even
if some finite-dimensional linearly-parameterized class of
functions is used to represent f and g, the simulated output will be a highly nonlinear function of those parameters.
When analyzing stability or performance of a dynamical
system it is common to make use of a storage function and
a dissipation inequality, and we follow the same approach
here. The advantage is that a condition on the solution of a
dynamical system can be checked or enforced via pointwise
conditions on the system equations. In particular, let =
x x
be the dierence between the state of the model and
the true state of the system. Suppose we find some positive
definite 2 function V () satisfying
V ( (t + 1), t + 1) V ( (t), t) r(t) |y(t) y(t)|2 (5)
with y(t) = g(
x(t) + (t), u
(t)) and r(t) a slack variable.
Suppose also that the models initial conditions are correct, i.e. (0) = 0, then we can simply sum the above
dissipation inequality to time T and obtain
E :=
T
X
t=0
|y(t)
y(t)|2
T
X
r(t).
t=0
(7)
= g(
x+
,u
)
y,
= e(
x+
e(
x).
(8)
N
X
t=1
EQ (
z (t)),
(9)
= f (
x+
,u
)
then EQ (
z ) EQ (
z ) where
EQ (
z ) = sup | v |2Q + | |2P
2Rn
e(
v ).
2
+ | y |2 ,
(10)
t=1
and (e
f ) = (
e
f).
1 E 1 f (x, u).
f = Q
Then we see the arguments of the supremum in the
definition of E identically match those in the definition
of E.
(18)
(19)
with E, F 2 R
,K 2 R
,G 2 R
,L2R
and
E invertible, RIE minimization depends only on the data
only through its correlation matrix. This observation is of
computational interest as, in this case, RIE minimization
will require only d = 2n + m + k nonlinear convex
constraints regardless of the number of observations. It
is also the basis of the consistent estimator to follow.
d
Lemma 6. Let Z = {z(ti )}N
i=1 R . Define:
nn
W :=
nm
kn
N
1 X
z(ti )z(ti )0 .
N i=1
km
(20)
The second claim follows by taking an eigenvector decomposition of W and reversing the above identities.
(r)
(t) = C0 x
(r)
(t) + D0 u
(t).
(24)
(25)
(r)
(r)
x
(t)
x (t)
=
+ w(r) (t).
(26)
y(r) (t)
y (r) (t)
Note in this setting direct observation of the state is not
particularly restrictive, as we can use the recent input and
output history for x
(t) and assume the system (24),(25) is
in a observable canonical form.
The RIE produces implicit models of the form (18),(19).
We denote the parameters of such a model by 2 Rnn
Rnn Rnm Rkn Rkm :
:= (E, F, K, G, L).
The implicit form means there are many corresponding
to the same linear system. We define S() to be the map
taking implicit models to their explicit form:
1
E F E 1K
S() =
.
G
L
In this notation, we will present an estimator for S(0 )
where 0 = (I, A0 , B0 , C0 , D0 ).
0
(r)
(r)
0 (r)
0 (r)
z (t) = x
(t + 1) y (t) x
(t)0 u
(t)0 ,
our estimator is defined as follows. Compute a symN:
metrized cross-correlation W
N
X
N = 1
z(1) (t)
z (2) (t)0 + z(2) (t)
z (1) (t)0 .
(27)
W
2N t=1
N by:
Define W
N = W
N + max{0,
W
N , ) subject to the
(N , QN ) be any minimizer of FQ (W
dt < I and Q = Q0 > 0. Our estimator
constraint that R
N ).
is then given by S(
The following statement describes conditions under which
this estimator converges:
Theorem 7. Let (A0 , B0 ) be stable and controllable. Given
an input u
: Z+ 7! R and observations {(
z (t)(1) , z(t)(2) )}1
t=1
PN (1)
1
(1)
T
and let WN = N t=1 z (t)z (t) (i.e. a noiseless
empirical correlation). If lim supN kWN k2F < 1 and the
following conditions hold:
!
N
0
1 X x(i) (t) x(i) (t)
",
(28)
lim inf min
u
(t)
(t)
N
N t=1 u
N
1 X (1)
w (t)w(2) (t)0 = 0,
N !1 N
t=1
lim
N
1 X (i)
w (t)z (j) (t)0 = 0,
N !1 N
t=1
lim
(29)
(30)
for (i, j) = (1, 2) and (i, j) = (2, 1) and some K, " > 0,
then:
N ) = S(
0 ),
lim S(
N !1
Fig. 1. Bode magnitude plots of several estimation strategies for a second-order system after 200 (upper) and
2000 (lower) samples.
mixed-correlation approach is by far the best at recovering
the resonant peak, whereas the other approaches seem to
generate models which are too stable.
5. CONCLUSION
The RIE is a new approach to nonlinear system identification. It allows one to search over a convex set of
stable nonlinear models, for one which minimizes a convex
upper bound of long-term simulation error. The resulting
optimization problem is a semidefinite program.
In order to convexify the model class and the optimization
objective, a number of relaxations and approximations
were necessary. The objective of this paper was to shed
light on the degree of approximation introduced. In particular, it is shown that the set of nonlinear models we
search over in principle includes all quadratically stable
systems (subject to richness of parametrization). It was
also shown that there exists at least some models which
are not quadratically contracting, but for which a nonquadratic contraction map can be found which satisfies
the relaxed stability condition. Further results (positive or
negative) on the coverage of this particular model class
N !1
W
2 I. This condition implies L(WN , S) is strongly
"
REFERENCES
B.N. Bond, Z. Mahmood, Yan Li, R. Sredojevic,
A. Megretski, V. Stojanovic, Y. Avniel, and L. Daniel.
Compact modeling of nonlinear analog circuits using
system identification via semidefinite programming and
incremental stability certification. IEEE Transactions
on Computer-Aided Design of Integrated Circuits and
Systems, 29(8):1149 1162, aug. 2010.
G.B. Giannakis and E. Serpedin. A bibliography on
nonlinear system identification. Signal Processing, 81
(3):533580, 2001.
S. L. Lacy and D. S. Bernstein. Subspace identification
with guaranteed stability using constrained optimization. IEEE Transactions on Automatic Control, 48(7):
12591263, 2003.
L. Ljung. System Identification: Theory for the User.
Prentice Hall, Englewood Clis, New Jersey, USA, 3
edition, 1999.
L. Ljung. Perspectives on system identification. Annual
Reviews in Control, 34(1):1 12, 2010.
W. Lohmiller and J.J.E. Slotine. On contraction analysis
for non-linear systems. Automatica, 34(6):683696, June
1998.
I.R. Manchester, M.M. Tobenkin, and J. Wang. Identification of nonlinear systems with stable oscillations. In
50th IEEE Conference on Decision and Control (CDC).
IEEE, 2011.
A. Megretski. Convex optimization in robust identification
of nonlinear feedback. In Proceedings of the 47th IEEE
Conference on Decision and Control, pages 13701374,
Cancun, Mexico, Dec 9-11 2008.
P. A. Parrilo. Structured Semidefinite Programs and Semialgebraic Geometry Methods in Robustness and Optimization. PhD thesis, California Institute of Technology,
May 18 2000.
T.B. Schon, A. Wills, and B. Ninness. System identification of nonlinear state-space models. Automatica, 47(1):
3949, 2010.
J. Sjoberg, Q. Zhang, L. Ljung, A. Benveniste, B. Delyon, P.-Y. Glorennec, H. Hjalmarsson, and A. Juditsky.
Nonlinear black-box modeling in system identification:
a unified overview. Automatica, 31(12):16911724, 1995.
K. C. Sou, A. Megretski, and L. Daniel. Convex relaxation approach to the identification of the wienerhammerstein model. In Proc. 47th IEEE Conference on
Decision and Control, pages 1375 1382, dec. 2008.
J. A. K. Suykens and J. Vandewalle, editors. Nonlinear Modeling: advanced black-box techniques. Springer
Netherlands, 1998.
M.M. Tobenkin, I.R. Manchester, J. Wang, A. Megretski,
and R. Tedrake. Convex optimization in identification of
stable non-linear state space models. In 49th IEEE Conference on Decision and Control (CDC), pages 7232
7237. IEEE, 2010.
Abstract: This paper gives a primal-dual derivation of the Least Squares Support Vector
Machine (LS-SVM) using Instrumental Variables (IVs), denoted simply as the Primal-dual
Instrumental Variable Estimator. Then we propose a convex optimization approach for learning
the optimal instruments. Besides the traditional argumentation for the use of IVs, the primaldual derivation gives an interesting other advantage, namely that the complexity of the system
to be solved is expressed in the number of instruments, rather than in the number of samples as
typically the case for SVM and LS-SVM formulations. This note explores some exciting issues
in the design and analysis of such estimator.
1. INTRODUCTION
The method of Least Square- Support Vector Machines
(LS-SVMs) amounts to a class of nonparameteric, regularized estimators capable of dealing efficiently with highdimensional, nonlinear eects as surveyed in Suykens et al.
[2002]. The methodology builds upon the research on Support Vector Machines (SVMs) for classification, integrating ideas from functional analysis, convex optimization
and learning theory. The use of Mercer kernels is found
to generalize many well-known nonparametric approaches,
including smoothing splines, RBF and regularization networks, as well as parametric approaches, see Suykens et al.
[2002] and citations. The primal-dual derivations are found
to be a versatile tool for deriving non-parametric estimators, and are found to adapt well to cases where the application at hand provides useful structural information, see
e.g. Pelckmans [2005]. In the context of (nonlinear) system
identification a particularly relevant structural constraint
comes in the form of a specific noise models, as explained
in Espinoza et al. [2004], Pelckmans [2005].
In literature two essentially dierent approaches to function estimation in the context of colored noise are found:
In (i) the first, one includes the model of the noise coloring
in the estimation process. This approach is known as
the Prediction Error Framework (PEM) in the context
of system identification. The main drawback of such approach is the computational demand. In (ii) an instrumental variables approach, one tries to retain the complexity
and convenience of ordinary least squares estimation. The
latter copes with the coloring of the noise by introducing
artificial instruments restricting the least squares estimator to model the coloring of the noise. Introductions to
Instrumental Variable (IV) estimators can be found in
Ljung [1987], S
oderstrom and Stoica [1989] in a context
of system identification of LTI systems, and Bowden and
Turkington [1990] in a context of econometry.
In this paper we construct a kernel based learning machine
based on an instrumental variable estimator. Related ideas
? This work was supported in part by Swedish Research Council
under contract 621-2007-6364.
(2)
ztk (f (xt )
yt ) = z k (fN
yN ) = 0, 8t = 1, . . . , N.
(3)
where fN = (f (x1 ), . . . , f (xN ))T 2 RN and yN =
(y1 , . . . , yN ) are vectors. Therefore, the method is also
referred to as the method of generalized moment matching
in the statistical and econometrical literature. The rationale goes that albeit the residuals might not be white (or
minimal), they are to be uncorrelated with the covariates.
That is, the estimate is to be independent of the realized
stochastic behavior (noise). The choice of instruments depends on which statistical assumptions one might exploit
in a certain case. A practical example is found when estimating parameters of a dynamic model (say a polynomial
linear time-invariant model). Then, instruments might be
chosen as filters of input signals, hereby exploiting the
assumption of the residuals being uncorrelated with past
inputs, see e.g. Ljung [1987], S
oderstrom and Stoica [1989]
and citations.
w=
m
X
m
X
N
X
ztk (xt ).
(5)
t=1
k=1
N N
Now, define K 2 R
(the kernel matrix) as Kt,s =
(xs )T (xt ). The dual problem is given as
1
(6)
minm T (Z T KZ) T Z T yN .
2R 2
If m < N and the matrix (Z T KZ) is full rank, the optimal
solution is unique and can be computed efficiently as the
solution to the linear system
(Z T KZ) = Z T yN .
(7)
Let x 2 Rd . The estimate can be evaluated in a new
sample x 2 Rd as
f(x ) =
k=1
N
X
l=1
zlk K(x , xl ),
(8)
m
X
1
e2k
min wT w +
w,e 2
2
k=1
s.t.
N
X
t=1
yt ) = ek , 8k = 1, . . . , m.
(9)
1 T
1
T
min
Z KZ + Im T Z T yN ,
(10)
2Rm 2
1
Z T KZ + Im = Z T yN .
(11)
The above derivation implies as well reduced computational complexities compared to the standard LS-SVMs.
In particular, it is not needed to memorize and work with
the full matrix K 2 RN N , but one may focus attention
on the matrix (ZKZ T ) 2 Rmm which is of considerably
lower dimension when m N .
2.4 Kernel Machines for Colored Noise Sources
The above problem formulation relates explicitly to learning an LS-SVM from measurements which contain colored
noise. Assume the noise is given as a filter h(q 1 ) =
Pd
N
X
t=1
wT (xt ) +
d
X
=0
h e t
= yt , 8t = 1, . . . , N.
(13)
Note that the filter coefficients {h } are assumed to be
known here. The dual problem is given as
1
1
minm T K + HH T T yN ,
(14)
2R 2
where IN = diag(1, . . . , 1) 2 RN N and the Toeplitz
matrix H 2 RN N is made up by the filter coefficients,
i j , with h
= h for 0 d, and h
= 0
or Hij = h
otherwise. Similarly H 1 is the Toeplitz matrix consisting
of the coefficients of the inverse filter, assuming that the
inverse exists (or that the filter is minimal phase). The
solution can be obtained by solving the linear system
1
T
1
H KH + Im
= H T yN ,
(15)
1
1
Y i = KZ ZKZ T + IN
ZT Y i
1
1
= K(ZZ ) K + IN
(ZZ )T Y i , (17)
1
1
Y i = K(ZZ ) K + IN
(ZZ )T Y i .
(18)
min
t ()2{0,1}
m
X
Yi
RY i
2
2
i=1
s.t. RY 1 = = RY m ,
(19)
1
where R = K K + 1 IN
2 RN N can be computed in
advance. The problem can also be phrased as follows by
introducing a vector Y 2 RN as
m
X
2
(, Y ) = arg
min
Y i Y
t ()2{0,1}
i=1
s.t. Y = RY i , 8i = 1, . . . , m. (20)
This combinatorial optimization problem can be relaxed
m
X
2
Y ) = arg min
(,
Y i Y
t ()2[0,1]
i=1
s.t. Y = RY , 8i = 1, . . . , m, (21)
where eigenvalues t can now take any value in the
interval [0, 1]. This problem can be solved efficiently as
a Semi-Definite Programming (SDP). This is so as both
the constraint 0 and max () 1 are convex, see
e.g. Boyd and Vandenberghe [2004]. Note that in practice,
in the optimum, many eigenvalues of will be set to
zero (structural sparseness), implying that one could find
Z 2 RN m with m N .
i
i=1
i
s.t. S Y = R n Y , 8i = 1, . . . , m, (22)
!
H(z) =
Hn z n
(3)
n=0
n=0
h(
r1 , r1 , n) . . . h(
r1 , rM , n)
n
..
..
..
z
.
.
.
h(
rJ , r1 , n) . . . h(
rJ , rM , n)
(4)
bn (
r1 , r1 ) . . . bn (
r1 , rM )
..
..
..
Bn =
.
.
.
.
bn (
rJ , r1 ) . . . bn (
rJ , rM )
(6)
r1,rm ). . .bQ (
r1,rm ). . .b0 (
rJ,rm ). . .bQ (
rJ,rm )]
bm =[b0 (
(10)
for m = 1, . . . , M and
T
a = [a0 . . . aP ] .
(11)
Note that the first coefficient in the denominator parameter vector is usually fixed to a0 = 1. We include it here in
the parameter vector for notational convenience.
2.2 State-of-the-Art Data-Based Identification Approach
Different algorithms for the estimation of the parameter
vector using the data model (5)-(8) have been proposed,
see Gustafsson et al. (2000), Haneda et al. (1994), Rolain
et al. (1998), Stoica and Jansson (2000), Verboven et al.
(2004), and Hikichi and Miyoshi (2004). In these algorithms, however, the knowledge that the noise-free observations u(
rj , n), j = 1, . . . , J, are samples of the wave field
generated by (1) is not exploited, and hence the structure
/ .
.
//
.
/ .
/
S ()ST () IJ zQ ()zH
vec Y()SH () zQ ()zH
Q ()
P ()
() =
/
.
/H .
YH ()Y()zP ()zH
vec Y()SH () zP ()zH
P ()
Q ()
(15)
B11 () . . . B1M ()
..
..
(17)
B() = ...
.
.
BJ1 () . . . BJM ()
contains the pole-zero model numerator frequency responses for the different source-observer combinations.
By defining the equation error vector related to (13) as
E(, ) = B()S() A()Y()
(18)
a least squares (LS) criterion for the estimation of the
parameter vector can be obtained as
!
min
EH (, )E(, )
(19)
s. t. a0 = 1
(21)
with () defined in (12) at the top of the page, where
IJ represents the J J identity matrix, () denotes
the complex conjugation operator, vec() is the matrix
vectorization operator, denotes the Kronecker product,
and the complex sinusoidal vectors are defined as
,
-T
zQ () = 1 ej . . . ejQ
(22)
, j
jP T
zP () = 1 e . . . e
.
(23)
3. FINITE ELEMENT METHOD FOR
COMMON-DENOMINATOR POLE-ZERO MODELS
We will now derive a set of linear equations in the polezero model parameters, that are valid if the MIMO-LTI
system is indeed governed by the wave equation in (1).
We can eliminate the time variable and the partial time
derivative from the wave equation by taking the discrete
Fourier transform of (1) after temporal sampling, which
results in the Helmholtz equation
2 U (r, ) + k 2 U (r, ) = S(r, )
(24)
(12)
M
!
Sm ()
2 H(r, r1 , ) + k 2 H(r, r1 , ) =
(rrm )
S1 ()
m=1
..
..
.
.
Sm ()
2
2
(rrm )
H(r, rM , ) + k H(r, rM , ) =
S
()
m=1 M
(26)
where the frequency-domain Greens function H(r, rm , ),
m = 1, . . . , M , corresponds to the frequency response
of the discrete-time system defined in (4) for r =
rj , j = 1, . . . , J. We can hence substitute the commondenominator pole-zero model for H(r, rm , ) in (26), and
bring the common denominator (which is independent of
r) to the right-hand side, i.e.,
6
M
!
Sl ()
2 B(r, rm , )+k 2 B(r, rm , )=A()
(rrl ),
Sm ()
l=1
m = 1, . . . , M. (27)
where we have used a more compact notation to denote
a set of M equations. Note that we have deliberately not
restricted the observer position r in (27) to the discrete
set of positions
rj defined earlier. Instead, we consider
the numerator frequency response B(r, rm , ) to be a
continuous function of r. This function can be approximated in a finite-dimensional subspace by discretizing the
spatial domain using a 3-D grid defined by the points
Here, the subspace basis functions are chosen to be piecewise linear functions satisfying i (
rk ) = (i k), i =
1, . . . , K. In particular, the basis functions are defined on
a 3-D triangulation of the spatial domain , where the
kth basis function is made up of linear (non-zero slope)
segments along the line segments between the point
rk and
all the points with which point
rk shares a tetrahedron
edge, and zero-valued segments elsewhere. We can then
rewrite the set of Helmholtz equations in (27) as a set of
linear equations in B(
rk , rm , ) by making use of the FEM,
see Brenner and Scott (2008). In a nutshell, the FEM consists in converting the partial differential equation (PDE)
in (27) to its weak formulation, performing integration
by parts to relax the differentiability requirements on
the subspace basis functions, and enforcing the subspace
approximation error induced by (28) to be orthogonal to
this subspace. The set of M PDEs in (27) can then be
m = 1, . . . , M. (29)
Here, the K K matrices K and L denote the FEM
stiffness and mass matrices, defined as
#
[K]ij =
j (r) i (r)dr
(30)
#
[L]ij =
j (r)i (r)dr
(31)
r1 , rm , ) . . . B(
rK , rm , )] .
m () = [B(
(32)
The K 1 vectors l , l = 1, . . . , M on the right-hand
side of the Galerkin system in (29) contain the barycentric
coordinates of the point sources, obtained by projecting
the spatial unit-impulse functions (rrl ) onto the chosen
subspace basis, i.e.,
#
[ l ]i =
(r rl )i (r)dr.
(33)
r1,rm ). . .bQ (
r1,rm ). . .b0 (
rK,rm ). . .bQ (
rK,rm )]
bm =[b0 (
(37)
and recall the (P + 1) 1 denominator parameter vector
definition in (11). Note that only the first J(Q + 1) coefficients of the numerator parameter vector (corresponding to the elements of the numerator parameter vector
bm defined in (10)) are of explicit interest, while the
other coefficients have been introduced for constructing
the FEM approximation of the continuous-space function B(r, rm , ). By using the above parameter vector
definitions, and recalling the definitions of the complex
sinusoidal vectors in (22)-(23), we can rewrite the Galerkin
system in (36) as follows,
6
.
/.
/
H
K k 2 L IK zH
Q () bm = m ()zP ()a,
m = 1, . . . , M (38)
or equivalently
b
1
M() 0 . . . 0
1 ()zH
P ()
b2
H
0 M() . . . 0
2 ()zP () .
..
..
..
..
..
.
= 0.
.
.
.
.
.
.
b
M
0
0 . . . M() M ()zH
P ()
a
9
:;
< 9 :;
<
()
(39)
Here, M() = K k 2 L IK zH
Q () , and 0 represents
a zero vector or matrix of appropriate dimensions.
.
/.
(1 ) = 0
..
.
(41)
s. t.
(
)
=
0
a0 = 1
Here, the [J(Q + 1) + P + 1] [K(Q + 1) + P + 1] selection
matrix C is defined such that
C = .
(42)
Compared to the state-of-the-art identification approach
exemplified by the QP in (20)-(21), LM K additional
equality constraints have been included in (41). These
equality constraints allow to impose structural information
at a number of frequencies 1 , . . . , L , thus increasing
the model accuracy at these particular frequencies. The
number of frequencies L at which the Galerkin equations
m = 1, . . . , M (43)
or equivalently
M() 0
0 M()
.
..
..
.
0
1
T1 () A()I
K
b2
T2 () A()I
K
..
= 0.
..
.
.
bM
. . . M() TM () A()I
K
:;
< 9 :;
<
...
...
..
.
0
0
..
.
(
a,)
(44)
The data term in (40) can also be rewritten as a function of
+
, by partitioning the matrix ( ())1/2 C ! [L |R ]
such that
F
0
11/2
<9
:!
,;
()
C = L IM K(Q+1) 0 + R a. (45)
?
(IM 11K ) = 1M 1
s. t.
(47)
0
with 1 a vector of all ones. In this optimization problem, the sparsity of is exploited by including an )1regularization term in (46), while the non-negativity and
the property of columns summing to one, are enforced in
the (in)equality constraints (47).
5. SIMULATION RESULTS
We provide a simulation example, in which the proposed
identification approach is compared to the state-of-the-art
approach for the case of indoor acoustic wave propagation
(c = 344 m/s). We consider a rectangular room of 8
6 4 m, with M = 3 sources and J = 5 sensors
positioned as shown in Fig. 1. The Greens functions
4
3.5
3
z (m)
2.5
2
1.5
1
0.5
0
6
5
8
4
6
3
4
2
2
1
0
y (m)
x (m)
20
6
=
> ; <9 :
.
/
2
T
K k L m () = m () A()IK vec(),
20
40
60
80
0
0.5
1.5
2.5
(rad)
30
DATA
HYBRID (L = 1)
HYBRID (L = 2)
25
20
15
10
10
0
0.5
1.5
2.5
(rad)
30
DATA
HYBRID (L = 1)
HYBRID (L = 2)
25
20
REFERENCES
15
10
10
0
0.5
1.5
2.5
(rad)