Special Session On Convex Optimization For System Identification

Special Session on Convex Optimization for
System Identification
Kristiaan Pelckmans Johan A.K. Suykens
SysCon, Information Technology, Uppsala University, 75501, Sweden
KU Leuven ESAT-SCD, B-3001 Leuven (Heverlee), Belgium
Abstract: This special session aims to survey, present new results and stimulate discussions
on how to apply techniques of convex optimization in the context of system identification. The
scope of this session includes but is not limited to linear and nonlinear modeling, model structure
selection, block structured systems, regularization mechanisms (e.g. L1, sum-of-norms, nuclear
norm), segmentation of time-series, trend filtering, optimal experiment design and others.
1. INTRODUCTION
Insights in convex optimization continue to be a driving

force for new formulations and methods of estimation
and identification. The paramount incentive for phrasing
methods of estimation in a format of a standard convex
optimization problem has been the availability of efficient
solvers, both from a theoretical as well as practical perspective. From a conceptual perspective, convexity of a
problem formulation ensures that no local minima are
existing. The work in convex optimization has now become
a well-matured subject, to such an extent that researchers
view the distinction to convex and non-convex problems
as more profound than the distinction between linear and
non-linear optimization problems. A survey of techniques
of convex optimization, together with applications in estimation is Boyd and Vandenberghe [2004].
Convex optimization has always maintained a close connection to systems theory and estimation problems. Main
catalyzers of this synergy include the following:
(Combinatorial) Interest in convex approaches to efficiently solving complex optimization problems can
be traced back to the maxflow-mincut theorem, essentially stating when a combinatorial problem could
be solved as a convex linear programming one. This
result has mainly impacted the field of Operations
Research (OR), but the related so-called property of
unimodularity of a matrix is currently seeing a revival
in a context of machine learning and estimation.
Another important result in this line of work has
been the recent interest in Semi-Definite Programming (SDP) relaxations for NP-hard combinatorial
problems. A standard reference is Papadimitriou and
Steiglitz [1998] and Schrijver [1998].
(LMI) An immediate predecessor of the solid body
of work in convex optimization is the literature on
Linear Matrix Inequalities (LMIs). This research has
found a particularly rich application area in systems
theory, were LMIs occur naturally in problems of
stability, and of automatic control. A standard work
is Boyd et al. [1994] and Ben-Tal and Nemirovskii
[2001].
(L1) Compressed Sensing or Compressive Sampling

(CS) has led to vigorous research in applications
of convex optimization in estimation and recovery
problems. More specifically, the interest in sparse
estimation problems has led to countless proposals of
L1 norms in estimation problems, often stimulated by
the promise that sparsity of (a transformation of) the
unknowns has a natural interpretation in the specific
application at hand. A main benefit of research in CS
over earlier interest in the use of the L1 heuristic for
recovery are the newly derived theoretical guarantees,
a research area often attributed to Candes and Tao
[2005]. Lately, much interest has been devoted to
the study of the low-rank approximation problem
Recht et al. [2010], where dierent heuristics were
proposed to relax the combinatorial rank constraint.
For extensions in the area of system identification see
Recht et al. [2010], Liu and Vandenberghe [2009].
(Structure) Recent years made it apparent that techniques of convex optimization can play yet another
important role in identification and estimation problems. If the structure of the system underlying the
observations can be imposed as constraints in the
estimation problem, the set of possible solutions can
be sharply reduced, leading in turn to better estimates. This view has been especially important in
modeling of nonlinear phenomena, where the role of
a parametric model is entirely replaced by structural
constraints. Examples of such thinking are Support
Vector Machines and other non-parametric methods.
The former are also tied to convex optimization via
another link: it is found that Lagrange duality (as in
the theory of convex optimality) can lead to a systematic approach for introducing nonlinearity through
the use of Mercer kernels.
(Design) The design of experiments as in statistical
sciences has always related closely to convex optimization. This is no dierent in a context of system
identification, where the design of experiments now
points to design of an input sequence which excites
properly all the modes of the system to be identified.
One apparent reason why techniques of convex optimization are useful is that such experiments have
to work in the allowed operation regions, that is,
constraints enter naturally in most cases. For relate

references, see e.g. Boyd and Vandenberghe [2004].
(Priors) A modern view is that methods based on
the L1-norm, nuclear norm relaxation or by imposing
structure, are examples of a larger picture, namely
that such terms can be used to make the inverse problem well-posed. In other words, they fill in unknown
pieces of information in the estimation problem by
imposing or suggesting a prior structure. In general,
an estimation problem can be greatly helped if one
is able to suggest a good prior for completing the
evidence given by data only. Such prior can come
in a form of a dictionary into which the solution
fit in nicely, or as a penalization term in the cost
function of the optimization function which penalizes
the occurrence of unusual phenomena in the solution.
A statistical treatment of regularization is surveyed
in Bickel et al. [2006], regularization and priors in
system identification are discussed in Ljung [1999],
Ljung et al. [2011].
The last item suggests a definite way forward. Namely, how
can techniques of convex optimization be used to model
appropriate priors for a context of system identification.
We conjecture that the near future will witness such shift
of focus from parametric Box-Jenkins and state-space
models to structural constraints and application specific
priors.
REFERENCES
A. Ben-Tal and A.S. Nemirovskii. Lectures on modern convex optimization: analysis, algorithms, and engineering
applications, volume 2. Society for Industrial Mathematics, 2001.
P.J. Bickel, B. Li, A.B. Tsybakov, S.A. van de Geer, B. Yu,
T. Valdes, C. Rivero, J. Fan, and A. van der Vaart.
Regularization in statistics. Test, 15(2):271344, 2006.
S. Boyd and L. Vandenberghe. Convex Optimization.
Cambridge University Press, 2004.
S. Boyd, L. El Ghaoui, E. Feron, and V. Balakrishnan.
Linear matrix inequalities in system and control theory,
volume 15. Society for Industrial Mathematics, 1994.
E.J. Candes and T. Tao. Decoding by linear programming.
Information Theory, IEEE Transactions on, 51(12):
42034215, 2005.
Z. Liu and L. Vandenberghe. Interior-point method for
nuclear norm approximation with application to system
identification. SIAM Journal on Matrix Analysis and
Applications, 31(3):1235, 2009.
L. Ljung. System identification. Wiley Online Library,
1999.
L. Ljung, H. Hjalmarsson, and H. Ohlsson. Four encounters with system identification. European Journal of
Control, 17(5):449, 2011.
C.H. Papadimitriou and K. Steiglitz. Combinatorial optimization: algorithms and complexity. Dover, 1998.
B. Recht, M. Fazel, and P.A. Parrilo.
Guaranteed
minimum-rank solutions of linear matrix equations via
nuclear norm minimization. SIAM Review, 52(3):471
501, 2010.
A. Schrijver. Theory of linear and integer programming.
John Wiley & Sons Inc, 1998.
Convex optimization techniques in system

identification
Lieven Vandenberghe
Electrical Engineering Department, UCLA, Los Angeles, CA 90095

(Tel: 310-206-1259; e-mail: vandenbe@ee.ucla.edu)
Abstract: In recent years there has been growing interest in convex optimization techniques
for system identification and time series modeling. This interest is motivated by the success of
convex methods for sparse optimization and rank minimization in signal processing, statistics,
and machine learning, and by the development of new classes of algorithms for large-scale
nondifferentiable convex optimization.
1. INTRODUCTION
Low-dimensional model structure in identification problems is typically expressed in terms of matrix rank or
sparsity of parameters. In optimization formulations this
generally leads to non-convex constraints or objective
functions. However, formulations based on convex penalties that indirectly minimize rank or maximize sparsity
are often quite effective as heuristics, relaxations, or, in
rare cases, exact reformulations. The best known example is 1-norm regularization in sparse optimization, i.e.,
the use of the 1-norm !x!1 in an optimization problem
as a substitute for the cardinality (number of nonzero
elements) of a vector x. This idea has a rich history in
statistics, image and signal processing [Rudin et al., 1992,
Tibshirani, 1996, Chen et al., 1999, Efron et al., 2004,
Candès and Tao, 2007], and an extensive mathematical
theory has been developed to explain when and why it
works well [Donoho and Huo, 2001, Donoho and Tanner,
2005, Candès et al., 2006b, Candès and Tao, 2005, Candès
et al., 2006a, Candès and Tao, 2006, Donoho, 2006, Tropp,
2006]. Several excellent surveys and tutorials on this topic
are available; see for example [Romberg, 2008, Candès and
Wakin, 2008, Elad, 2010].
The 1-norm used in sparse optimization has a natural
counterpart in the nuclear norm for matrix rank minimization. Here one uses the penalty function !X! where ! !
denotes the nuclear norm (sum of singular values) as a substitute for rank(X). Applications of nuclear norm methods in system theory and control were first explored by
[Fazel, 2002, Fazel et al., 2004], and have recently gained in
popularity in the wake of the success of 1-norm techniques
for sparse optimization [Recht et al., 2010]. Much of the
recent work in this area has focused on the low-rank matrix
completion problem [Candès and Recht, 2009, Candès and
Plan, 2010, Candès and Tao, 2010, Mazumder et al., 2010],
i.e., the problem of identifying a low-rank matrix from
a subset of its entries. This problem has applications in
collaborative prediction [Srebro et al., 2005] and multi-task
learning [Pong et al., 2011]. Applications of nuclear norm
methods in system identification are discussed in [Liu and
Vandenberghe, 2009a, Grossmann et al., 2009, Mohan and
Fazel, 2010, Gebraad et al., 2011, Fazel et al., 2011].
The 1-norm and nuclear norm techniques can be extended

in several interesting ways. The two types of penalties can
be combined to promote sparse-plus-low-rank structure
in matrices [Candès et al., 2011, Chandrasekaran et al.,
2011]. Structured sparsity, such as group sparsity or hierarchical sparsity, can be induced by extensions of the
1-norm penalty [Bach et al., 2012, Jenatton et al., 2011,
Bach et al., 2011]. Finally, Chandrasekaran et al. [2010]
and Bach [2010] describe systematic approaches for constructing convex penalties for different types of nonconvex
structural constraints.
In this tutorial paper we discuss a few applications of convex methods for structured rank minimization and sparse
optimization, in combination with classical ideas from
system identification and signal processing. We focus on
subspace algorithms for system identification and topology
selection problems in graphical models. The second part of
the paper (section 4) provides a short survey of available
convex optimization algorithms.
2. SYSTEM IDENTIFICATION
Subspace methods in system identification and signal
processing rely on singular value decompositions (SVDs)
to make low-rank matrix approximations [Ljung, 1999].
The structure in the approximated matrices (for example,
Hankel structure) is therefore lost during the low-rank
approximation. A convex optimization formulation based
on the nuclear norm penalty offers an interesting alternative, because it promotes low rank while preserving linear
matrix structure. An additional benefit of an optimization formulation is the possibility of adding other convex
regularization terms or constraints on the optimization
variables.
As an illustration, consider the input-output equation used
as starting point in many subspace identification methods:
Y = OX + HU.
The matrices U and Y are block Hankel matrices constructed from a sequence of inputs u(t) and outputs y(t)
of a state space model
x(t + 1) = Ax(t) + Bu(t),
y(t) = Cx(t) + Du(t),
and the columns of X form a sequence of states x(t).

The matrix H depends on the system matrices, and O is
an extended observability matrix [Verhaegen and Verdult,
2007, p.295] A simple subspace method consists in forming
the Hankel matrices U and Y and then projecting the
rows of Y on the nullspace of U . If the data are exact
and a persistence of excitation assumption holds, the rank
of the projected output matrix is equal to the system
order and from it a system realization is easily computed.
When the input-output data are not exact, one can use
a singular value decomposition of the projected output
Hankel matrix to estimate the order and compute a system
realization. However, as mentioned, this step destroys the
Hankel structure in Y and U . The nuclear norm penalty
on the other hand can be used as a convex heuristic
for indirectly reducing the rank, while preserving linear
structure. For example, if the inputs are exactly known
and the measured outputs ym (t) are subject to error, one
can solve the convex problem
!
2
minimize !Y Q! +
!y(t) ym (t)!2
t
where the columns of Q form a basis of the nullspace of

U and is a positive weight. The optimization variables
are the model outputs y(t) and the matrix Y is a Hankel
matrix constructed from the model outputs y(t). This is
a convex optimization problem that can be solved via
semidefinite programming. We refer the reader to [Liu and
Vandenberghe, 2009a,b] for more details and numerical
results. As an important advantage, the optimization
formulation can be extended to include convex contraints
on the model outputs. Another promising application
is identification with missing data [Ding et al., 2007,
Grossmann et al., 2009].
k=
with Rk = E x(t + k)x(t)T . Using this characterization,

one can formulate extensions of the regularized maximum
likelihood problem (1) to vector time series. In [Songsiri
et al., 2010, Songsiri and Vandenberghe, 2010] autoregressive models
p
!
x(t) =
Ak x(t k) + v(t),
v(t) N (0, ),
k=1
were considered, and convex formulations were presented

for the problem of estimating the parameters Ak , ,
subject to conditional independence constraints, and of
estimating the topology via a 1-norm type regularization.
The topology selection problem leads to the following
extension of (1):
minimize tr(CX) log det X00 + h(X)
subject to X $ 0.
(2)
The variable X is a (p + 1) (p + 1) block matrix with

blocks of size n n (the length of the vector x(t)), and
X00 is the leading block of X. The penalty h is chosen to
encourage a common, symmetric sparsity pattern for the
diagonal sums
pk
!
Xi,i+k ,
k = 0, 1, . . . , p,
i=0
of the blocks in X.
An extension to ARMA processes is studied by Avventiy
et al. [2010].
3. GRAPHICAL MODELS
In a graphical model of a normal distribution x N (0, )
the edges in the graph represent the conditional dependence relations between the components of x. The vertices
in the graph correspond to the components of x; the
absence of an edge between vertices i and j indicates that
xi and xj are independent, conditional on the other entries
of x. Equivalently, vertices i and j are connected if there
is a nonzero in the i, j position of the inverse covariance
matrix 1 .
A key problem in the estimation of the graphical model
is the selection of the topology. Several authors have
addressed this problem by adding a 1-norm penalty to the
maximum likelihood estimation problem, and solving
minimize tr CX log det X + !X!1 .
topology of the graph is determined by the sparsity pattern

of the inverse spectral density matrix
!
S() =
Rk ejk ,
(1)
Here X denotes the inverse covariance 1 , the

" matrix C is
the sample covariance matrix, and !X!1 = ij |Xij |. See
[Meinshausen and B
uhlmann, 2006, Banerjee et al., 2008,
Ravikumar et al., 2008, Friedman et al., 2008, Lu, 2009,
Scheinberg and Rish, 2009, Yuan and Lin, 2007, Duchi
et al., 2008, Li and Toh, 2010, Scheinberg and Ma, 2012].
Graphical models of the conditional independence relations can be extended to Gaussian vector time series
[Brillinger, 1996, Dahlhaus, 2000]. In this extension the
4. ALGORITHMS
For small and medium sized problems the applications
discussed in the previous sections can be handled by
general-purpose convex optimization solvers, such as the
modeling packages CVX [Grant and Boyd, 2007] and
YALMIP [L
ofberg, 2004], and general-purpose conic optimization packages. In this section we discuss algorithmic
approaches that are of interest for large problems that fall
outside the scope of the general-purpose solvers.
4.1 Interior-point algorithms
Interior-point algorithms are known to attain a high accuracy in a small number of iterations, fairly independent
of problem data and dimensions. The main drawback is
the high linear algebra complexity per iteration associated
with solving the Newton equations that determine search
directions. However sometimes problem structure can be
exploited to devise dedicated interior-point implementations that are significantly more efficient than generalpurpose solvers.
A simple example is the 1-norm approximation problem
minimize !Ax b!1
with A of size m n. This can be formulated as a linear

program (LP)
minimize
m
!
The proximal gradient algorithm is an extension of the

gradient algorithm to problems with simple constraints or
with simple nondifferentiable terms in the cost function.
It is less general than the subgradient algorithm, but it
is typically much faster and it handles many types of
nondifferentiable problems that occur in practice.
yi
#i=1
$# $ #
$
A I
x
b
subject to
,
A I
y
b
at the expense of introducing m auxiliary variables and 2m
linear inequality constraints. By taking advantage of the
structure in the inequalities, each iteration of an interiorpoint method for the LP can be reduced to solving linear
systems AT DAx = r where D is a positive diagonal
matrix. As a result, the complexity of solving the 1norm approximation problem using a custom interiorpoint solver is roughly the equivalent of a small number of
weighted least-squares problems.
A similar result holds for the nuclear norm approximation
problem
minimize !A(x) B!
(3)
where A(x) is a matrix valued function of size pq and x is

an n-vector of variables. This problem can be formulated
as a semidefinite program (SDP)
minimize tr
# U + tr V
$
U
(A(x) B)T
$0
subject to
A(x) B
V
(4)
with variables x, U , V . The very larger number of variables

(O(p2 ) if we assume p q) makes the nuclear norm
optimization problem very expensive to solve by generalpurpose SDP solvers. A specialized interior-point solver
for the SDP is described in [Liu and Vandenberghe,
2009a], with a linear algebra cost per iteration of O(n2 pq)
if n max{p, q}. This is comparable to solving the
matrix approximation problem in Frobenius norm, i.e.,
minimizing !A(x) B!F , and the improvement makes it
possible to solve nuclear norm problems with p and q on
the order of several hundred by an interior-point method.
We refer the reader to the book chapter [Andersen et al.,
2012] for additional examples of special-purpose interiorpoint algorithms.
4.2 Nonlinear optimization methods
Burer and Monteiro Burer and Monteiro [2003, 2005] have
developed a large-scale method for semidefinite programming, based on substituting a low-rank factorization for
the matrix variable and solving the resulting nonconvex
problem by an augmented Lagrangian method. Adapted
to the SDP (4), the method amounts to reformulating the
problem as
minimize !L!2F + !R!2F
subject to A(x) B = LRT
4.3 Proximal gradient algorithms
The proximal gradient algorithm applies to a convex

problem of the form
minimize f (x) = g(x) + h(x),
(6)
in which the cost function f is split in two components g
and h, with g differentiable and h a simple nondifferentiable function. Simple here means that the prox-operator
of h, defined as the mapping
%
&
1
2
proxth (x) = argmin h(u) + !u x!2
2t
u
(with t > 0), is inexpensive to compute. It can be shown
that if h is closed and convex, then proxth (x) exists and
is unique for every x.
A typical example is h(x) = !x!1 . Its prox-operator is the
element-wise soft-thresholding
'
xi t if xi t
if t xi t
proxth (x)i = 0
xi + t if xi t.
Constrained optimization problems
minimize g(x)
subject to x C
can be brought in the form (6) by defining h(x) = IC (x),
the indicator function of C (i.e., IC (x) = 0 if x C and
IC (x) = + if x * C). The prox-operator for IC is the
Euclidean projection on C. Prox-operators share many of
the properties of Euclidean projections on closed convex
sets. For example, they are nonexpansive, i.e.,
!proxth (x) proxth (y)!2 !x y!2
for all x, y. (See Moreau [1965].)
The proximal gradient method for minimizing (6) uses the
iteration
x+ = proxth (x tg(x))
where t > 0 is a step size. The proximal gradient update
consists of a standard gradient step for the differentiable
term g, followed by an application of the prox-operator
associated with the non-differentiable term h. It can be
motivated by noting that x+ is the minimizer of the
function
1
h(y) + g(x) + g(x)T (y x) + !y x!22
2t
(5)
over y, so x+ minimizes an approximation of f , obtained

by adding to h a simple local quadratic model of g.
with variables x, L Rpr , R Rqr , where r is a upper

bound on the rank of A(x) b at optimum. Recht et al.
[2010] discuss in detail Burer and Monteiros method in
the context of nuclear norm optimization.
It can be shown that if g is Lipschitz continuous with

constant L, then the suboptimality f (x(k) ) f " decreases
to zero as O(1/k) [Nesterov, 2004, Beck and Teboulle,
2009]. Recently, faster variants of the proximal gradient
method with an 1/k 2 rate convergence, under the same

assumptions and with the same complexity per step, have
been developed [Nesterov, 2004, 2005, Beck and Teboulle,
2009, Tseng, 2008, Becker et al., 2011].
The (accelerated) proximal gradient methods are well
suited for problems of the form
minimize g(x) + !x!
where g is differentiable with a Lipschitz-continuous gradient. Most common norms have easily computed proxoperators, and the following property is useful when computing the prox-operator of a norm h(x) = !x!:
proxth (x) = x tPB (x/t),
where PB is Euclidean projection on the unit ball in the
dual norm.
In other applications it is advantageous to apply the
proximal gradient method to the dual problem. Consider
for example an optimization problem
minimize f (x) + !Ax b!
with f strongly convex. Reformulating this problem as
minimize f (x) + !y!
(7)
subject to y = Ax b
and taking the Lagrange dual, gives
maximize bT z f (AT z)
subject to !z!d 1
where f (u) = supx (uT x f (x)) is the conjugate of f and
! !d is the dual norm of ! !. It can be shown that if f is
strongly convex, then f is differentiable with a Lipschitz
continuous gradient. If projection on the unit ball of the
dual norm is inexpensive, the dual problem is therefore
readily solved by a fast gradient projection method.
An extensive library of fast proximal-type algorithms
is available in the MATLAB software package TFOCS
[Becker et al., 2010].
4.4 ADMM
The Alternating Direction Method of Multipliers (ADMM)
was proposed in the 1970s as a simplified version of
the augmented Lagrangian method. It is a simple and
often very effective method for large-scale or distributed
optimization, and has recently been applied successfully
to the regularized covariance selection problem mentioned
above [Scheinberg et al., 2010, Scheinberg and Ma, 2012].
The recent survey by Boyd et al. [2011] gives an overview
of the theory and applications of ADMM. Here we limit
ourselves to a description of the method when applied to a
problem of the form (7). The ADMM iteration consists of
two alternating minimization steps (over x and y) of the
augmented Lagrangian
L(x, y, z) =
t
f (x) + !y! + z T (y Ax + b) + !y Ax + b!22 ,
2
followed by an update
z := z + t(y Ax b)
of the dual variable z. The complexity of minimizing

over x depends on the properties of f . If f is quadratic,
for example, it reduces to a least-squares problem. The
minimization of the augmented Lagrangian over y reduces
to the evaluation of the prox-operator of the norm ! !.
A numerical comparison of the ADMM and proximal
gradient algorithms for nuclear norm minimization can be
found in the recent paper by Fazel et al. [2011].
5. SUMMARY
Advances in algorithms for large-scale nondifferentiable
convex optimization are leading to a greater role of convex optimization in system identification and time series
modeling. These techniques are based on formulations that
incorporate convex penalty functions that promote lowdimensional model structure (such as sparsity or rank).
Similar techniques have been used extensively in signal
processing, image processing, and machine learning. While
at this point theoretical results that characterize the success of these convex heuristics in system identification
are limited, the extensive theory that supports 1-norm
techniques in sparse optimization, gives hope that progress
can be made in our understanding of similar techniques for
system identification as well.
ACKNOWLEDGMENT
This material is based upon work supported by the
National Science Foundation under Grants No. ECCS0824003 and ECCS-1128817. Any opinions, findings, and
conclusions or recommendations expressed in this material
are those of the author and do not necessarily reflect the
views of the National Science Foundation.
REFERENCES
M. S. Andersen, J. Dahl, Z. Liu, and L. Vandenberghe.
Interior-point methods for large-scale cone programming. In S. Sra, S. Nowozin, and S. J. Wright, editors,
Optimization for Machine Learning, pages 5583. MIT
Press, 2012.
E. Avventiy, A. Lindquist, and B. Wahlberg. Graphical
models of autoregressive moving-average processes. In
The 19th International Symposium on Mathematical
Theory of Networks and Systems (MTNS 2010), July
2010.
F. Bach. Structued sparsity-inducing norms through
submodular functions.
2010.
Available from
arxiv.org/abs/1008.4220.
F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Foundations
and Trends in Machine Learning, 4(1):1106, 2011.
F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Convex
optimization with sparsity-inducing norms. In S. Sra,
S. Nowozin, and S. J. Wright, editors, Optimization for
Machine Learning, pages 1953. MIT Press, 2012.
O. Banerjee, L. El Ghaoui, and A. dAspremont. Model
selection through sparse maximum likelihood estimation
for multivariate Gaussian or binary data. Journal of
Machine Learning Research, 9:485516, 2008.
A. Beck and M. Teboulle. A fast iterative shrinkagethresholding algorithm for linear inverse problems.
SIAM Journal on Imaging Sciences, 2(1):183202, 2009.
S. Becker, E. J. Candès, and M. Grant. Templates for

convex cone problems with applications to sparse signal
recovery. 2010. arxiv.org/abs/1009.2065.
S. Becker, J. Bobin, and E. Candès. NESTA: a fast and
accurate first-order method for sparse recovery. SIAM
Journal on Imaging Sciences, 4(1):139, 2011.
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein.
Distributed optimization and statistical learning via the
alternating direction method of mulitipliers. Foundations and Trends in Machine Learning, 3(1):1122, 2011.
D. R. Brillinger. Remarks concerning graphical models for
time series and point processes. Revista de Econometria,
16:123, 1996.
S. Burer and R. D. C. Monteiro. A nonlinear programming
algorithm for solving semidefinite programs via low-rank
factorization. Mathematical Programming (Series B), 95
(2), 2003.
S. Burer and R. D. C. Monteiro. Local minima and convergence in low-rank semidefinite programming. Mathematical Programming (Series A), 103(3), 2005.
E. Candès and T. Tao. The Dantzig selector: Statistical
estimation when p is much larger than n. The Annals
of Statistics, 35(6):23132351, 2007.
E. J. Candès and Y. Plan. Matrix completion with noise.
Proceedings of the IEEE, 98(6):925936, 2010.
E. J. Candès and B. Recht. Exact matrix completion
via convex optimization. Foundations of Computational
Mathematics, 9(6):717772, 2009.
E. J. Candès and T. Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51
(12):42034215, 2005.
E. J. Candès and T. Tao. Near-optimal signal recovery
from random projections and universal encoding strategies. IEEE Transaction on Information Theory, 52(12),
2006.
E. J. Candès and T. Tao. The power of convex relaxation:
near-optimal matrix completion. IEEE Transactions on
Information Theory, 56(5):20532080, 2010.
E. J. Candès and M. B. Wakin. An introduction to compressive sampling. IEEE Signal Processing Magazine,
25(2):2130, 2008.
E. J. Candès, J. Romberg, and T. Tao. Robust uncertainty
principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on
Information Theory, 52(2):489509, 2006a.
E. J. Candès, J. K. Romberg, and T. Tao. Stable signal
recovery from incomplete and inaccurate measurements.
Communications on Pure and Applied Mathematics, 59
(8):12071223, 2006b.
E. J. Candès, X. Li, Y. Ma, and J. Wright. Robust
principal component analysis? Journal of the ACM, 58
(3), 2011.
V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S.
Willsky. The convex geometry of linear inverse problems. 2010. arXiv:1012.0621v1.
V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S.
Willsky. Rank-sparsity incoherence for matrix decomposition. SIAM Journal on Optimization, 21(2):572596,
2011.
S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic
decomposition by basis pursuit. SIAM Journal on
Scientific Computing, 20:3361, 1999.
R. Dahlhaus. Graphical interaction models for multivariate time series. Metrika, 51(2):157172, 2000.
T. Ding, M. Sznaier, and O. Camps. A rank minimization
approach to fast dynamic event detection and track
matching in video sequences. In Proceedings of the 46th
IEEE conference on decision and control, 2007.
D. L. Donoho. Compressed sensing. IEEE Transactions
on Information Theory, 52(4):12891306, 2006.
D. L. Donoho and X. Huo. Uncertainty principles and
ideal atomic decomposition. IEEE Transactions on
Information Theory, 47(7):28452862, 2001.
D. L. Donoho and J. Tanner. Sparse nonnegative solutions
of underdetermined systems by linear programming.
Proceedings of the National Academy of Sciences of the
United States of America, 102(27):94469451, 2005.
J. Duchi, S. Gould, and D. Koller. Projected subgradient
methods for learning sparse Gaussians. In Proceedings
of the Conference on Uncertainty in AI, 2008.
B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least
angle regression. The Annals of Statistics, 32(2):407
499, 2004.
M. Elad. Sparse and Redundant Representations: From
Theory to Applications in Signal and Image Processing.
Springer, 2010.
M. Fazel. Matrix Rank Minimization with Applications.
PhD thesis, Stanford University, 2002.
M. Fazel, H. Hindi, and S. Boyd. Rank minimization
and applications in system theory. In Proceedings of
American Control Conference, pages 32733278, 2004.
M. Fazel, T. K. Pong, D. Sun, and P. Tseng. Hankel
matrix rank minimization with applictions to system
identification and realization. 2011. Submitted.
J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse
covariance estimation with the graphical lasso. Biostatistics, 9(3):432, 2008.
P. M. O. Gebraad, J. W. van Wingerden, G. J. van der
Veen, and M. Verhaegen. LPV subspace identification
using a novel nuclear norm regularization method. In
Proceedings of the American Control Conference, pages
165170, 2011.
M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming (web page and software).
http://stanford.edu/~boyd/cvx, 2007.
C. Grossmann, C. N. Jones, and M. Morari. System identification via nuclear norm regularization for simulated
bed processes from incomplete data sets. In Proceedings
of the 48th IEEE Conference on Decision and Control,
pages 46924697, 2009.
R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding. Journal of
L. Li and K.-C. Toh. An inexact interior point method for
L1-regularized sparse covariance selection. Mathematical Programming Computation, 2:291315, 2010.
Z. Liu and L. Vandenberghe. Interior-point method for
nuclear norm approximation with application to system
identification. SIAM Journal on Matrix Analysis and
Applications, 31:12351256, 2009a.
Z. Liu and L. Vandenberghe. Semidefinite programming
methods for system realization and identification. In
Proceedings of the 48th IEEE Conference on Decision
and Control, pages 46764681, 2009b.
L. Ljung. System Identification: Theory for the User.

Prentice Hall, Upper Saddle River, New Jersey, second
edition, 1999.
J. L
ofberg. YALMIP : A toolbox for modeling and optimization in MATLAB. In Proceedings of the CACSD
Conference, Taipei, Taiwan, 2004.
Z. Lu. Smooth optimization approach for sparse covariance
selection. SIAM Journal on Optimization, 19(4):1807
1827, 2009.
R. Mazumder, T. Hastie, and R. Tibshirani. Spectral
regularization algorithms for learning large incomplete
matrices. Journal of Machine Learning Research, 11:
22872322, 2010.
N. Meinshausen and P. B
uhlmann. High-dimensional
graphs and variable selection with the Lasso. Annals
of Statistics, 34(3):14361462, 2006.
K. Mohan and M. Fazel. Reweighted nuclear norm minimization with application to system identification. In
Proceedings of the American Control Conference (ACC),
pages 29532959, 2010.
J. J. Moreau. Proximite et dualite dans un espace hilbertien. Bull. Math. Soc. France, 93:273299, 1965.
Yu. Nesterov. Introductory Lectures on Convex Optimization. Kluwer Academic Publishers, Dordrecht, The
Netherlands, 2004.
Yu. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming Series A, 103:127
152, 2005.
T. K. Pong, P. Tseng, Shuiwang Ji, and J. Ye. Trace norm
regularization: reformulations, algorithms, and multitask learning. SIAM Journal on Optimization, 20(6):
34653489, 2011.
R. Ravikumar, M. J. Wainwright, G. Raskutti, and
B. Yu. High-dimensional covariance estimation by minimizing #1 -penalized log-determinant divergence, 2008.
arxiv.org/abs/0811.3628.
B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed
nuclear norm minimization. SIAM Review, 52(3):471
501, 2010.
J. Romberg. Imaging via compressive sampling. IEEE
Signal Processing Magazine, 25(2):1420, 2008.
L. Rudin, S. J. Osher, and E. Fatemi. Nonlinear total
variation based noise removal algorithms. Physica D,
60:259268, 1992.
K. Scheinberg and S. Ma. Optimization methods for sparse
inverse covariance selection. In S. Sra, S. Nowozin,
and S. J. Wright, editors, Optimization for Machine
Learning, pages 455477. MIT Press, 2012.
K. Scheinberg and I. Rish. SINCO - a greedy coordinate
ascent method for sparse inverse covariance selection
problem. Technical report, 2009. IBM Resesarch
Report.
K. Scheinberg, S. Ma, and D. Goldfarb. Sparse inverse
covariance selection via alternating linearization methods. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor,
R.S. Zemel, and A. Culotta, editors, Advances in Neural
Information Processing Systems 23, pages 21012109.
2010.
J. Songsiri and L. Vandenberghe. Topology selection in
graphical models of autoregressive processes. Journal of
J. Songsiri, J. Dahl, and L. Vandenberghe. Graphical

models of autoregressive processes. In Y. Eldar and
D. Palomar, editors, Convex Optimization in Signal
Processing and Communications, pages 89116. Cambridge University Press, Cambridge, 2010.
N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximummargin matrix factorization. In Lawrence K. Saul, Yair
Weiss, and Leon Bottou, editors, Advances in Neural
Information Processing Systems 17, pages 13291336.
MIT Press, Cambridge, MA, 2005.
R. Tibshirani. Regression shrinkage and selection via the
Lasso. Journal of the Royal Statistical Society. Series B
(Methodological), 58(1):267288, 1996.
J. A. Tropp. Just relax: Convex programming methods for
identifying sparse signals in noise. IEEE Transactions
on Information Theory, 52(3):10301051, 2006.
P. Tseng. On accelerated proximal gradient methods for
convex-concave optimization. 2008.
M. Verhaegen and V. Verdult. Filtering and System
Identification. Cambridge University Press, 2007.
M. Yuan and Y. Lin. Model selection and estimation in the
Gaussian graphical model. Biometrika, 94(1):19, 2007.
Distributed Change Detection ?

Henrik Ohlsson , Tianshi Chen
Sina Khoshfetrat Pakazad Lennart Ljung
S. Shankar Sastry
Division of Automatic Control, Department of Electrical Engineering,

Link
oping University, Sweden, e-mail: ohlsson@isy.liu.se.
Department of Electrical Engineering and Computer Sciences,

University of California at Berkeley, CA, USA.
Abstract: Change detection has traditionally been seen as a centralized problem. Many
change detection problems are however distributed in nature and the need for distributed
change detection algorithms is therefore significant. In this paper a distributed change detection
algorithm is proposed. The change detection problem is first formulated as a convex optimization
problem and then solved distributively with the alternating direction method of multipliers
(ADMM). To further reduce the computational burden on each sensor, a homotopy solution
is also derived. The proposed method have interesting connections with Lasso and compressed
sensing and the theory developed for these methods are therefore directly applicable.
1. INTRODUCTION
The change detection problem is often thought of as a centralized problem. Many scenarios are however distributed
and lack a central node or require a distributed processing.
A practical example is a sensor network. It may be vulnerable to select one of the sensors as a central node. Moreover,
it may be preferable if the sensor failing can be detected
in a distributed manner. Another practical example is the
monitoring of a fleet of agents (airplanes/UAVs/robots) of
the same type, see e.g., Chu et al. [2011]. The problem is
how to detect if one or more agents start deviating from
the rest. Theoretically, this can be done straightforwardly
in a centralized manner. The centralized solution, however,
poses many difficulties in practice. For instance, the communication between the central monitor to the agents and
the computation capacity and speed of the central monitor
is highly demanding due to a large number of agents in the
fleet and/or the extremely large data sets to be processed,
Chu et al. [2011]. Therefore, it is desirable to deal with the
change detection problem in a distributed way.
In a distributed setting, there will be no central node. Each
sensor or agent makes use of measurements from itself and
the other sensors or agents to detect if it has failed or
not. To tackle the problem, we first formulate the change
detection problem as a convex optimization problem. We
then solve the problem in a distributed manner using
the so-called alternating direction method of multipliers
(ADMM, see for instance Boyd et al. [2011]). The optimization problem turns out to have connections with the
Lasso [Tibsharani, 1996] and compressive sensing [Candès
et al., 2006, Donoho, 2006] and the theory developed for
? Ohlsson, Chen and Ljung are partially supported by the Swedish
Research Council in the Linnaeus center CADICS and by the European Research Council under the advanced grant LEARN, contract
267381. Ohlsson is also supported by a postdoctoral grant from the
Sweden-America Foundation, donated by ASEAs Fellowship Fund,
and by a postdoctoral grant from the Swedish Research Council.
these methods are therefore applicable. To further reduce

the computational burden on each sensor, a homotopy
solution (see e.g., Garrigues and El Ghaoui [2008]) is also
studied. Finally, we show the eectiveness of the proposed
method by a numerical example.
2. PROBLEM FORMULATION
The basic idea of the proposed method is to use system
identification, in a distributed manner, to obtain a nominal
model for the sensors or agents and then detect whether
one or more sensors or agents start deviating from this
nominal model.
To set up the notation, assume that we are given a sensor
network consisting of N sensors. Denote the measurement
from sensor i at time t by yi (t) and assume that there is
a linear relation of the form
yi (t) = 'T
i (t) + ei (t),
(1)
describing the relation between the measurable quantity

yi (t) 2 Rn and known quantity 'Ti (t) 2 Rnm . We will
call 2 Rm the state. The state is related to the sensor
reading through 'i (t). ei (t) 2 Rn is the measurement
noise and assumed to be white Gaussian distributed with
zero mean and variance i2 . Moreover, ei (t) is assumed
independent of that of ej (t), for all i = 1, . . . , N and
j = 1, . . . , i 1, i + 1, . . . , N . At time t it is assumed that
sensor i obtains yi (t) and knows 'i (t).
The problem is now, in a distributed manner, to detect a
failing sensor. That is, detect if the relation (1) is no longer
valid.
Remark 2.1. (Time varying state). A dynamical equation
or a random walk type of description for the state can
be incorporated. This is straightforward but for the sake
of clarity and due to page limitations, this is not shown
here. Actually, the only restriction is that does not vary
over sensors (that is, not dependent on i).
Remark 2.2. (Partially observed state). Note that (1) does

not imply that the sensors need to measure all elements of
. Some sensors can observe some parts and other sensors
other parts.
Remark 2.3. (Time-varying network topology). That sensors are added and taken away from the network is a very
realistic scenario. We will assume that N is the maximum
number of sensors in the network and set 'i (t) = 0 if
sensor i is not present at time t.
Remark 2.4. (Multidimensional yi (t)). For notational simplicity, from now on, we have chosen to let yi (t) 2 R. However, the extension to multidimensional yi (t) is straightforward.
Remark 2.5. (Distributed system identification). The proposed algorithm could also be seen as a robust distributed
system identification scheme. The algorithm computes, in
a distributed manner, a nominal model using observations
from several systems and is robust to systems deviating
from the majority.
A straightforward way to solve the distributed change
detection problem as posted here would be to
(1) locally, at each sensor, estimate
(2) broadcast the estimates and the error covariances
(3) at each sensor, fuse the estimates
(4) at each sensor, use a likelihood ratio test to detect
a failing sensor (see e.g., Willsky and Jones [1976],
Willsky [1976])
This method will work fine as long as the number of
measurements available at each sensor well exceeds m. Let
us say that {(yi (t), 'i (t))}Tt=T Ti +1 is available at sensor i,
i = 1, . . . , N . It is hence required that T1 , T2 , . . . , TN
m.
If m > Ti for some i = 1, . . . , N, the method will however
fail. That m > Ti for some i = 1, . . . , N , is a very realistic
scenario. Ti may for example be very small if new sensors
may be added to the network at any time. The case
T 1 , T2 , . . . , T N
m was previously discussed in Chu et al.
[2011].
Remark 2.6. (Sending all data). One may also consider to
broadcast data and solve the full problem on each sensor. Sending all data available at time T may however
be too much. Sensor i would then have to broadcast
{(yi (t), 'i (t))}Tt=T Ti +1 .
3. BACKGROUND
Change detection has a long history (see e.g., Gustafsson
[2001], Patton et al. [1989], Basseville and Nikiforov [1993]
and references therein) but has traditionally been seen as
a centralized problem. The literature on distributed or
decentralized change detection is therefore rather small
and only standard methods such as CUSUM and generalized likelihood ration (GLR) test have been discussed and
extended to distributive scenarios (see e.g., Tartakovsky
and Veeravalli [2002, 2003]). The method proposed here
has certainly a strong connection to GLR (see for instance
Ohlsson et al. [2012]) and an extensive comparison is seen
as future work.
The change detection algorithm proposed has also connections to compressive sensing and `1 -minimization. There
are several comprehensive review papers that cover the
literature of compressive sensing and related optimization
techniques in linear programming. The reader is referred

to the works of Candès and Wakin [2008], Bruckstein et al.
[2009], Loris [2009], Yang et al. [2010].
4. PROPOSED METHOD BATCH SOLUTION
Assume that the data set {(yi (t), 'i (t))}Tt=T Ti +1 is available at sensor i, i = 1, . . . , N . Since (1) is assumed to
hold for a functioning sensors, we would like to detect a
failing sensor by checking if its likelihood falls below some
threshold. What complicates the problem is that:
is unknown,
m > Ti for some i = 1, . . . , N , typically.
We first solve the problem in a centralized setting.
4.1 Centralized Solution
Introduce i for the state of sensor i, i = 1, . . . , N . Assume
that we know that k sensors have failed. The maximum
likelihood (ML) solution for i , i = 1, . . . , N (taking into
account that N k sensors have the same state) can then
be computed by
N
T
X
X
2
(2a)
min
kyi (t) 'T
i (t)i k 2
1 ,...,N ,
subj. to
i=1 t=T
kp k2
[k1
Ti +1
kp . . . kN
kp ]
= k,
(2b)
with k kp being the p-norm, p
1, and k k i2 defined
as k / i k2 . The k failing sensors could now be identified
as the sensors for which ki kp 6= 0. It follows from
basic optimization theory (see for instance Boyd and
Vandenberghe [2004]) that there exists a > 0 such that
min
1 ,...,N ,
N
X
T
X
i=1 t=T
[k1
Ti +1
kyi (t)
kp k2
2
'T
i (t)i k 2
i
kp . . . kN
kp ]
, (3)
gives exactly the same estimate for 1 , . . . , N , , as (2).

However, both (2) and (3) are non-convex and combinatorial, and in practice unsolvable.
What makes (3) non-convex is the second term. It has
recently become popular to approximate the zero-norm by
its convex envelope. That is, to replace the zero-norm by
the one-norm. This is in line with the reasoning behind
Lasso [Tibsharani, 1996] and compressed sensing [Candès
et al., 2006, Donoho, 2006]. Relaxing the zero-norm by
replacing it with the one-norm leads to the convex criteria
N
T
N
X
X
X
2
min
kyi (t) 'T
(t)
k
+
k i kp .
i 2
i
,1 ,...,N
i=1 t=T
Ti +1
i=1
(4)
should be interpreted as the nominal model. Most sensors
will have data that can be explained by the nominal model
and the criterion (4) will therefore give i = for most
is. However, failing sensors will generate a data sequence
that could not have been generated by the nominal model
represented by and for these sensors, (4) will give i 6= .
In (4), regulates the trade o between miss fit to the

observations and the deviation from the nominal model .
In practice, a large will make us less sensitive to noise
but may also imply that we miss to detect a deviating
sensor. However, a too small may in a noisy environment

generate false alarms. should be seen as an application
dependent design parameter. The estimates of (2) and (3)
are indierent to the choice of p. The estimate of (4) is
not, however. In general, p = 1 is a good choice if one
is interested in detecting changes in individual elements of
sensors or agents states. If one only cares about detecting
if a sensor or agent is failing, p > 1 is a better choice.
What is remarkable is that under some conditions on 'i (t)
and the number of failures, the criterion (4) will work
exactly as good as (2). That is, (4) and (2) will pick out exactly the same sensors as failing sensors. To examine when
this happens theory developed in compressive sensing can
be used. This is not discussed here but is a good future
research direction.
4.2 Distributed Solution
Let us now apply ADMM (see e.g., Boyd et al. [2011],
[Bertsekas and Tsitsiklis, 1997, Sec. 3.4]) to solve the
identification problem in a distributed manner. First let
3
2 T
2
3
'i (T Ti + 1)
yi (T Ti + 1)
6 'T (T Ti + 2) 7
6 yi (T Ti + 2) 7
i
7
7, i = 6
Yi = 6
7 . (5)
6
.
..
4
5
..
5
4
.
yi (T )
'T
(T
)
i
The optimization problem (4) can then be written as
min
,1 ,#1 ,...,N ,#N
subj. to
N
X
i=1
kYi
2
i i k 2
i
+ k#i
i kp ,
(6a)
#i
= 0, i = 1, . . . , N.
(6b)
T
T
T
T
, and let
Let x
=
1 . . . N #1 . . . # N
T
T = 1T 2T . . . N
, be the Lagrange multiplier vector
and i be the Lagrange multiplier associated with the ith
constraint #i = 0, i = 1, . . . , N . So the augmented
Lagrangian takes the following form
N
X
2
i kp
(7a)
L (, x, ) =
kYi
i i k 2 + k#i
T
i=1
+iT (#i ) + (/2)k#i k2 .

ADMM consists of the following update rules
xk+1 = argmin L (k , x, k )
(7b)
(8a)
k+1 = (1/N )
ik+1 = ik +
N
X
(#k+1
+ (1/)ik )
i
i=1
(#k+1
i
(8b)
k+1 ), for i = 1, . . . , N.
k
(8c)
k
Remark 4.1. It should be noted that given , , the

criterion L (k , x, k ) in (8a) is separable in terms of the
pairs #i , i , i = 1, . . . , N . Therefore, the optimization can
be done separately, for each i, as
#k+1
, ik+1 = argmin kYi
i
#i ,i
2
i i k 2
i
+ k#i
i kp
+ (ik )T (#i k ) + (/2)k#i k k2 . (9)

Remark 4.2. (Boyd et al. [2011]). It is interesting to note
that no matter what 1 is
N
X
ik = 0,
k 2.
(10)
i=1
To show (10), first note that

N
N
N
X
X
X
ik+1 =
ik +
#k+1
i
i=1
i=1
k+1
N k+1 ,
1. (11)
i=1
Inserting
into the above equation yields (10). So
without loss of generality, further assume
N
X
i1 = 0.
(12)
i=1
Then the update on reduces to

N
X
#k+1
,
k+1 = (1/N )
i
1.
(13)
i=1
As a result, in order to implement the ADMM in a

distributed manner each sensor or system i should follow
the steps below.
(1) Initialization: set 1 , i1 and .
(2) #k+1
, ik+1 = argmini ,#i L (k , x, k ).
i
(3) Broadcast #k+1
to the other systems (sensors), j =
i
1, . . . , i 1, i + 1, . . . , N .
PN
(4) k+1 = (1/N ) i=1 #k+1
.
i
k+1
k+1
k
k+1
(5) i
= i + (#i
).
(6) If not converged, set k = k + 1 and return to step 2.
To show that ADMM gives:
k #ki ! 0 as k ! 1, i = 1, . . . , N .
PN
k 2
k
ik kp ! p , where p is
i i k 2 + k#i
i=1 kYi
i
the optimal objective of (4).
it is sufficient to show that the Lagrangian (L0 (, x, ), the
augmented Lagrangian evaluated at = 0) has a saddle
point according to [Boyd et al., 2011, Sect. 3.2.1] (since the
objective consists of closed, proper and convex functions).
Let , x denote the solution of (4). It is easy to show
that , x and = 0 is a saddle point. Since L0 (, x, 0) is
convex,
L0 ( , x , 0) L0 (, x, 0) 8, x
(14)
and since L0 ( , x , 0) = L0 ( , x , ), 8, , x and =
0 must be a saddle point. ADMM hence converges to the
solution of (4) in the sense listed above.
5. PROPOSED METHOD RECURSIVE SOLUTION
To apply the above batch method to a scenario where we
continuously get new measurements, we propose to re-run
the batch algorithm every T th sample time:
(1) Initialize by running the batch algorithm proposed in
the previous section on the data available.
(2) Every T th time-step, re-run the batch algorithm
using the sT, s 2 N , last data. Initialize the ADMM
iterations using the estimates of and from the
previous run. Considering the fact that faults occur
rarely over time, the optimal solution for dierent
data batches are often similar. As a result, by using
the estimates of and from the previous run for
initializing the ADMM algorithm can speed up the
convergence of the ADMM algorithm considerably.
To add T new observation pairs, one could possibly use
an extended version of the homotopy algorithm proposed
by Garrigues and El Ghaoui [2008]. The homotopy algorithm presented in Garrigues and El Ghaoui [2008] was
developed for including new observations in Lasso.
The ADMM algorithm presented in the previous section

could also be altered to use single data samples upon
arrival. In such a setup, instead of waiting for a collection
or batch of measurements, the algorithm is updated upon
arrival of new measurements. This can be done by studying
the augmented Lagrangian of the problem. The augmented
Lagrangian in (7) can also be written in normalized form
as
N
X
2
L (, x, ) =
kYi
i kp
i i k 2 + k#i
i
i=1
+(/2)k#i ( i )k2 (/2)k

i k2 , (15)
where i = /. Hence, for p = 2, the update in (9) can
be achieved by solving the following convex optimization
problem, which can be written as a Second Order Cone
Programming (SOCP) problem, [Boyd and Vandenberghe,
2004],
k + s
min T Hi i 2T hi + (/2)#T #i #T h
i ,#i ,t
subj. to ki #i k s
(16)
where the following data matrices describe this optimization problem
2
T
2 k
k
Hi = T
ik .
(17)
i i / i , hi = i Y i / i , hi =
As can be seen from (17), among these matrices only Hi
and hi are the ones that are aected by the new measurements. Let yinew and 'new
denote the new measurements.
i
Then Hi and hi can be updated as follows
'newT
/ i2 , hi
hi + 'new
yinew / i2 .
Hi
Hi + 'new
i
i
i
(18)
To handle single data samples upon arrival, step 2 of the
ADMM algorithm should therefore be altered to:
(2) If there exits any new measurements, update Hi and
hi according to (18). Find #k+1
, ik+1 by solving (16).
i
Remark 5.1. In order for this algorithm to be responsive
to the arrival of the new measurements, it is required to
have network-wide persistent communication. As a result
this approach demands much higher communication traffic
than the batch solution.
6. IMPLEMENTATION ISSUES
Step 2 of ADMM requires solving the optimization problem
min L (k , x, k ).
(19)
x
This problem is solved locally on each sensor once every

ADMM iteration. What varies from one iteration to the
next are the values for the arguments k and k . However,
it is unlikely that k and k dier significantly from
k+1 and k+1 . To take use of this fact can considerably
ease the computational load on each sensor. We present
two methods for doing this, warm start and a homotopy
method.
The following two subsections are rather technical and
we refer the reader not interested in implementing the
proposed algorithm to Section 7.
in (16) and (17). However, in the batch solution, among

k changes with the iteration
the matrices in (17), only h
i
number. Therefore, if we assume that the vectors k and
ik do not change drastically from iteration to iteration, it
is possible to use the solution for the problem in (16) at the
kth iteration to warm start the problem at the (k + 1)th
iteration. This is done as follows.
The Lagrangian for the problem in (16) can be written as
k
L(i , #i , s, zi ) = iT Hi i 2iT hi + (/2)#T
#T
i #i
i hi
s
zi1
,
,
(20)
+ s
x i #i
zi2
for all kzi2 k zi1 . By this, the optimality conditions for

the problem in (16), can be written as
2Hi i 2hi zi2 = 0
(21a)
k
#i hi + zi2 = 0
(21b)
zi1 = 0
(21c)
kzi2 k zi1
(21d)
ki #i k s
(21e)
#i ) = 0.
(21f)
zi1
[Boyd and Vandenberghe, 2004]. Let i ,
zi2
#i , t and zi be the primal and dual optimums for the
k+1 = h
k +
problem in (16) at the kth iteration and let h
i
i
i . These can be used to generate a good warm start
h
point for the solver that solves (16). As a result, by (21)
the following vectors can be used to warm start the solver
iw = i
i
h
#w
i = #i +
(22)
w
zi = zi
sw = s + s
where s should be chosen such that
s
kiw #w
i ks +
(23)
w
wT w
#w
zi1 (s + s) + zi2 (i
i ) = ,
for some 0.
where zi =
6.2 A Homotopy Algorithm

Since it is unlikely that k and k dier significantly
from k+1 and k+1 , one can use the previously computed
solution xk and through a homotopy update the solution
instead of resolving (19) from scratch every iteration. We
will in the following assume that p = 1 and leave the details
for p > 1.
First, define
= #i i .
The optimization objective of (9) is then
(24)
gik (i , i ) ,kYi
i i k
2
i
+ k i k1 + (ik )T (
+ i
k )
+(/2)k i + i k k2 .
(25)
It is then straightforward to show that the optimization
problem (9) is equivalent to
ik+1 ,
6.1 Warm Start for Step 2 of the ADMM Algorithm

For the case where p = 2, at each iteration of the ADMM,
we have to solve an SOCP problem, which is described
T
zi1 s + zi2
(i
k+1
i
= argmin gik (i , i ).
i ,
(26)
Moreover,
#k+1
=
i
k+1
i
+ ik+1 .
(27)
Now, compute the subdierential of gi (i , i ) w.r.t. i and

i . Simple calculations show that
@i gik (i , i ) =ri gik (i , i ) =
+2/
2 T
i i
T
i
i+
(ik )T
2 T
i Yi
2/
+ (i +
) ,
@ i gik (i , i ) = @k i k1 + (ik )T + (
i
k T
(28a)
k )T . (28b)
+ i
A necessary condition for the global optimal solution

ik+1 , ik+1 of the optimization problem (9) is
0 2 @i gik ik+1 ,
ik+1 ,
gk
i i
02@
k+1
i
k+1
i
(29a)
(29b)
It follows from (28a) and (29a) that

ik+1 = Ri
hi
ik /2
k )
(/2)(
hi =
T
2
i Yi / i .
(30)
where we have let

Ri =
T
i
2
i/ i
+ (/2)I,
(31)
With (30), the problem now reduces to how to solve (29b).

Inserting (30) into (28b), and Qi , I (/2)Ri 1 , yields
1
k
@ i gik (ik+1 , i )= @k i k1 + hT
i Ri +(i
k + i )T Qi
Now, replace k with k +t k+1 and ik with k +t k+1 .

Let
Gki (t) = @k i k1 + hT
i Ri
+ t(
ik+1
+ (ik
k + i )T Qi
k+1 T
) Qi .
(32)
@ i gik (ik+1 , i )
hence
equals
k
k
(t)
=
argmin
G
(t).
It
follows
that
i
i
i
k+1
i
k
i (0),
k+2
i
Gki (0).
k
i (1)
(33)
We then have that (both the sign and | | taken elementwise)
@k ik (0)k1 = sign( ik )T v T , v 2 Rm q , |v| 1. (35)

Hence, that 0 2 Gki (0) is equivalent with
i
+ (ik
+ (ik
i = 0 (36a)
k + ik )T Q
i = 0, (36b)
k + ik )T Q
with
R
1 2 Rmq , R
1 2 Rmm
1 ,R
= R
R
2 Rmq , Q
2 Rmm q .
Q
,Q
Q= Q
1
, (37)
(38)
It can be shown that the partition (34) or the support

of ik (t) will stay unchanged for t 2 [0 t ), t > 0 (see
Lemma 1 of Garrigues and El Ghaoui [2008]). It hence
follows that for t 2 [0 t ]
k
k (t)T = ( sign( k )T /+hT R
1 1
i
i
i i )Qi +(
k+1
k+1
T 1
+t(
i /) QQ ,
hT
i Ri
+ t( k
= hT R
i
+ t(
i
ik (t))T Q
+ (k ik
i
ik )T Q
i
+ (k ik )T Q
k T
i ) Qi
i
( ik (t))T Q
i ( k (0))T Q
i
ik )T Q
i
k
k T
1
Q
Q
i)
+ t(
i ) ( Qi Q
i was used to denote the top q rows of Q
i . Now to
where Q
find t , we notice that both ik (t) and v are linear functions
of t. We can hence compute the minimal t that:
Make one or more elements of v equal to 1 or 1
or/and
make one or more elements of ik (t) equal to 0.
This minimal t will be t . At t the partition (34) changes:
Elements corresponding to v-elements equal to 1 or
1 should be included in ik .
Elements of ik (t) equal to 0 should be fixed to zero.
Given the solution ik (t ), we can now continue in a similar
way as above to compute ik (t), t 2 [t , t ]. The procedure
continues until ik (1) has been computed. Due to space
limitations we have chosen to not give a summary of
the algorithm. We instead refer the interested reader to
download the implementation available from http://www.
rt.isy.liu.se/~ohlsson/code.html.
=
hT
i Ri
+ (k
7. NUMERICAL ILLUSTRATION
Let
Assume now that ik+1 has been computed and that the
elements have been arranged such that the q first elements
are nonzero and the last m q zero. Let us write
k
k
i .
(0)
=
(34)
i
0
sign( ik )T + hT
i Ri
v T + hT
iR
v T (t)/ =
ik /)T Q
(39)
for the q q matrix made up of

where we introduced Q
We can also compute
the top q rows of Q.
For the numerical illustration, we consider a network of

N = 10 sensors with 1 = 2 = = 10 being random
samples from N (0, I) 2 R20 . We assume that there
exist 10 batches of measurements, each consisting of 15
samples. The regressors were unit Gaussian noise and the
measurement noise variance was set to one. In the scenario
considered in this section, we simulate failures in sensors
2 and 5, upon the arrival of the 4th and 6th measurement
batches, respectively. This is done by changing the 5th
component of 2 by multiplying it by 5 and changing the
8th component of 5 by shifting (adding) it by 5. It is
assumed that the faults are persistent.
With = = 20 both the centralized, the ADMM and
the ADMM with homotopy give identical results (up to the
2nd digit, 10 iterations were used in ADMM). As can be
seen from Figures 1-3 the result correctly detected that the
2nd and 5th sensors are faulty. In addition as can be seen
from Figures 2 and 3, ADMM and ADMM with homotopy
show that for how many data batches the sensors remained
faulty. Also the results detect which elements from 2 and
5 deviated from the nominal value. In this example (using
the ADMM algorithm), each sensor had to broadcast
m ADMM iterations number of batches = 20 10
10 = 2000 scalar values. If instead all data would have
been shared, each sensor would have to broadcast (m +
1) T number of batches = (20 + 1) 15 10 = 3150
scalar values. Using proposed algorithm, the traffic over
the network can hence be made considerably lower while
keeping the performance of a centralized change detection
algorithm. Using the Homotopy algorithm (or warm start)
to solve step 2 of the ADMM algorithm will not aect the
traffic over the network, but could lower the computational
burden on the sensors. It is also worth noting that the
more sensors or agents start deviating from this nominal

model. The proposed formulation takes the form of a
convex optimization problem. We show how this can be
solved distributively and present a homotopy algorithm to
easy the computational load. The proposed formulation
has connections with Lasso and compressed sensing and
theory developed for these methods are therefore directly
applicable.
15
|| i ||2
10
REFERENCES
0
0
10
Sensor No.
Fig. 1. Results from the centralized change detection. As

can be seen sensors 2 and 5 are detected to be faulty.
14
12
|| i i ||2
10
8
6
4
2
0
0
10
Sensor No.
Fig. 2. Results from the ADMM batch solution. Sensors 2

and 5 have been detected faulty for 6 and 4 batches.
18
16
14
|| i i ||2
12
10
8
6
4
2
0
0
10
Sensor No.
Fig. 3. Results from the Homotopy solution. Sensors 2 and

5 have been detected faulty for 6 and 4 batches.
classical approach using likelihood ration test, as described
in Section 3, would not work on this example since 20 =
m > T = 15.
8. CONCLUSION
This paper has presented a distributed change detection
algorithm. Change detection is most often seen as a centralized problem. As many scenarios are naturally distributed, there is a need for distributed change detection
algorithms. The basic idea of the proposed distributed
change detection algorithm is to use system identification,
in a distributed manner, to obtain a nominal model for
the sensors or agents and then detect whether one or
M. Basseville and I. V. Nikiforov. Detection of Abrupt Changes

Theory and Application. Prentice-Hall, Englewood Clis, NJ,
1993.
D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed
Computation: Numerical Methods. Athena Scientific, 1997.
S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge
University Press, 2004.
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed
optimization and statistical learning via the alternating direction
method of multipliers. Foundations and Trends in Machine
Learning, 2011.
A. M. Bruckstein, D. L. Donoho, and M. Elad. From sparse solutions
of systems of equations to sparse modeling of signals and images.
SIAM Review, 51(1):3481, 2009.
E. J. Cand`
es and M. B. Wakin. An introduction to compressive
sampling. Signal Processing Magazine, IEEE, 25(2):2130, March
2008.
E. J. Cand`
es, J. Romberg, and T. Tao. Robust uncertainty principles:
Exact signal reconstruction from highly incomplete frequency
information. IEEE Transactions on Information Theory, 52:489
509, February 2006.
E. Chu, D. Gorinevsky, and S. Boyd. Scalable statistical monitoring
of fleet data. In Proceedings of the 18th IFAC World Congress,
pages 1322713232, Milan, Italy, August 2011.
D. L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):12891306, April 2006.
P. Garrigues and L. El Ghaoui. An homotopy algorithm for the
lasso with online observations. In Proceedings of the 22nd Annual
Conference on Neural Information Processing Systems (NIPS),
2008.
F. Gustafsson. Adaptive Filtering and Change Detection. Wiley,
New York, 2001.
I. Loris. On the performance of algorithms for the minimization of
`1 -penalized functionals. Inverse Problems, 25:116, 2009.
H. Ohlsson, F. Gustafsson, L. Ljung, and S. Boyd. Smoothed state
estimates under abrupt changes using sum-of-norms regularization. Automatica, 48(4):595605, 2012.
R. Patton, P. Frank, and R. Clark. Fault Diagnosis in Dynamic
Systems Theory and Application. Prentice Hall, 1989.
A. Tartakovsky and V. Veeravalli. Quickest change detection in
distributed sensor systems. In Proceedings of the 6th International Conference on Information Fusion, pages 756763, Cairns,
Australia, July 2003.
A.G. Tartakovsky and V.V. Veeravalli. An efficient sequential
procedure for detecting changes in multichannel and distributed
systems. In Proceedings of the Fifth International Conference on
Information Fusion, pages 4148, 2002.
R. Tibsharani. Regression shrinkage and selection via the lasso.
Journal of Royal Statistical Society B (Methodological), 58(1):
267288, 1996.
A. Willsky. A survey of design methods for failure detection in
dynamic systems. Automatica, 12:601611, 1976.
A. Willsky and H. Jones. A generalized likelihood ratio approach
to the detection and estimation of jumps in linear systems.
IEEE Transactions on Automatic Control, 21(1):108112, February 1976.
A. Yang, A. Ganesh, Y. Ma, and S. Sastry. Fast `1 -minimization
algorithms and an application in robust face recognition: A review.
In ICIP, 2010.
An ADMM Algorithm for a Class of Total

Variation Regularized Estimation
Problems ?
Bo Wahlberg , Stephen Boyd , Mariette Annergren ,
and Yang Wang
Automatic Control Lab and ACCESS, School of Electrical

Engineering, KTH Royal Institute of Technology,
SE 100 44 Stockholm,Sweden
Department of Electrical Engineering, Stanford University, Stanford,

CA 94305, USA
Abstract: We present an alternating augmented Lagrangian method for convex optimization
problems where the cost function is the sum of two terms, one that is separable in the variable
blocks, and a second that is separable in the dierence between consecutive variable blocks.
Examples of such problems include Fused Lasso estimation, total variation denoising, and multiperiod portfolio optimization with transaction costs. In each iteration of our method, the first
step involves separately optimizing over each variable block, which can be carried out in parallel.
The second step is not separable in the variables, but can be carried out very efficiently. We apply
the algorithm to segmentation of data based on changes in mean (`1 mean filtering) or changes
in variance (`1 variance filtering). In a numerical example, we show that our implementation is
around 10000 times faster compared with the generic optimization solver SDPT3.
Keywords: Signal processing algorithms, stochastic parameters, parameter estimation, convex
optimization and regularization
1. INTRODUCTION
In this paper we consider optimization problems where
the objective is a sum of two terms: The first term is
separable in the variable blocks, and the second term is
separable in the dierence between consecutive variable
blocks. One example is the Fused Lasso method in statistical learning, Tibshirani et al. [2005], where the objective
includes an `1 -norm penalty on the parameters, as well as
an `1 -norm penalty on the dierence between consecutive
parameters. The first penalty encourages a sparse solution,
i.e., one with few nonzero entries, while the second penalty
enhances block partitions in the parameter space. The
same ideas have been applied in many other areas, such as
Total Variation (TV) denoising, Rudin et al. [1992], and
segmentation of ARX models, Ohlsson et al. [2010] (where
it is called sum-of-norms regularization). Another example
is multi-period portfolio optimization, where the variable
blocks give the portfolio in dierent time periods, the first
term is the portfolio objective (such as risk-adjusted return), and the second term accounts for transaction costs.
In many applications, the optimization problem involves
a large number of variables, and cannot be efficiently
handled by generic optimization solvers. In this paper,
our main contribution is to derive an efficient and scalable
optimization algorithm, by exploiting the structure of the
optimization problem. To do this, we use a distributed
? This work was partially supported by the Swedish Research Council, the Linnaeus Center ACCESS at KTH and the European Research Council under the advanced grant LEARN, contract 267381.
optimization method called Alternating Direction Method

of Multipliers (ADMM). ADMM was developed in the
1970s, and is closely related to many other optimization
algorithms including Bregman iterative algorithms for `1
problems, Douglas-Rachford splitting, and proximal point
methods; see Eckstein and Bertsekas [1992], Combettes
and Pesquet [2007]. ADMM has been applied in many
areas, including image and signal processing, Setzer [2011],
as well as large-scale problems in statistics and machine
learning, Boyd et al. [2011].
We will apply ADMM to `1 mean filtering and `1 variance
filtering (Wahlberg et al. [2011]), which are important
problems in signal processing with many applications, for
example in financial or biological data analysis. In some
applications, mean and variance filtering are used to preprocess data before fitting a parametric model. For nonstationary data it is also important for segmenting the
data into stationary subsets. The approach we present is
inspired by the `1 trend filtering method described in Kim
et al. [2009], which tracks changes in the mean value of
the data. (An example in this paper also tracks changes in
the variance of the underlying stochastic process.) These
problems are closely related to the covariance selection
problem, Dempster [1972], which is a convex optimization
problem when the inverse covariance is used as the optimization variable, Banerjee et al. [2008]. The same ideas
can also be found in Kim et al. [2009] and Friedman et al.
[2008].
This paper is organized as follows. In Section 2 we review

the ADMM method. In Section 3, we apply ADMM to our
optimization problem to derive an efficient optimization
algorithm. In Section 4.1 we apply our method to `1
mean filtering, while in Section 4.2 we consider `1 variance
filtering. Section 5 contains some numerical examples, and
Section 6 concludes the paper.
2. ALTERNATING DIRECTION METHOD OF
MULTIPLIERS (ADMM)
We terminate the algorithm when the primal and dual

residuals satisfy a stopping criterion (which can vary
depending on the requirements of the application). A
typical criterion is to stop when
kekp k2 pri , kekd k2 dual .
Here, the tolerances pri > 0 and dual > 0 can be set via
an absolute plus relative criterion,
p
pri = nabs + rel max{kxk k2 , kz k k2 },
p
dual = nabs + rel kuk k2 ,
In this section we give an overview of ADMM. We follow

closely the development in Section 5 of Boyd et al. [2011].
where abs > 0 and rel > 0 are absolute and relative
tolerances (see Boyd et al. [2011] for details).
Consider the following optimization problem

minimize f (x)
(1)
subject to x 2 C
with variable x 2 Rn , and where f and C are convex. We
let p? denote the optimal value of (1). We first re-write the
problem as
minimize f (x) + IC (z)
(2)
subject to x = z,
where IC (z) is the indicator function on C (i.e., IC (z) = 0
for z 2 C, and IC (z) = 1 for z 2
/ C). The augmented
Lagrangian for this problem is
L (x, z, u) = f (x) + IC (z) + (/2)kx z + uk22 ,
where u is a scaled dual variable associated with the
constraint x = z, i.e., u = (1/)y, where y is the dual
variable for x = z. Here, > 0 is a penalty parameter.
3. PROBLEM FORMULATION AND METHOD
In each iteration of ADMM, we perform alternating minimization of the augmented Lagrangian over x and z. At
iteration k we carry out the following steps
xk+1 := argmin{f (x) + (/2)kx z k + uk k22 } (3)
In this section we formulate our problem and derive an

efficient distributed optimization algorithm via ADMM.
3.1 Optimization problem
We consider the problem
N
N
X
X1
minimize
i (xi ) +
i=1
ri =
k+1
k+1
(4)
k+1
u
:= u + (x
z
),
(5)
where C denotes Euclidean projection onto C. In the
first step of ADMM, we fix z and u and minimize the
augmented Lagrangian over x; next, we fix x and u and
minimize over z; finally, we update the dual variable u.
2.1 Convergence
Under mild assumptions on f and C, we can show that the
iterates of ADMM converge to a solution; specifically, we
have
f (xk ) ! p? , xk z k ! 0,
as k ! 1. The rate of convergence, and hence the number
of iterations required to achieve a specified accuracy, can
depend strongly on the choice of the parameter . When
is well chosen, this method can converge to a fairly
accurate solution (good enough for many applications),
within a few tens of iterations. However, if the choice of
is poor, many iterations can be needed for convergence.
These issues, including heuristics for choosing , are discussed in more detail in Boyd et al. [2011].
2.2 Stopping criterion
The primal and dual residuals at iteration k are given by
ekp = (xk z k ), ekd = (z k z k 1 ).
(6)
subject to
xi+1
i = 1, . . . , N 1
with variables x1 , . . . , xN , r1 , . . . , rN 1 2 Rn , and where
n
n
i : R ! R [ {1} and
i : R ! R [ {1} are convex
functions.
This problem has the form (1), with variables x =
(x1 , . . . , xN ), r = (r1 , . . . , rN 1 ), objective function
f (x, r) =
N
X
i (xi ) +
i=1
z k+1 := C (xk+1 + uk )
i (ri )
i=1
xi ,
and constraint set

C = {(x, r) | ri = xi+1
N
X1
i (ri )
i=1
xi , i = 1, . . . , N
1}.
(7)
The ADMM form for problem (6) is

minimize
N
X
i (xi )
N
X1
i (ri )
+ IC (z, s)
(8)
si , i = 1, . . . , N 1
xi = zi , i = 1, . . . , N,
with variables x = (x1 , . . . , xN ), r = (r1 , . . . , rN 1 ),
z = (z1 , . . . , zN ), and s = (s1 , . . . , sN 1 ). Furthermore,
we let u = (u1 , . . . , uN ) and t = (t1 , . . . , tN 1 ) be vectors
of scaled dual variables associated with the constraints
xi = zi , i = 1, . . . , N , and ri = si , i = 1, . . . , N 1 (i.e.,
ui = (1/)yi , where yi is the dual variable associated with
xi = zi ).
subject to
i=1
ri =
i=1
3.2 Distributed optimization method

Applying ADMM to problem (8), we carry out the following steps in each iteration.
Step 1. Since the objective function f is separable in xi
and ri , the first step (3) of the ADMM algorithm consists
of 2N 1 separate minimizations
xk+1
:= argmin{ i (xi ) + (/2)kxi zik + uki k22 }, (9)
i
xi
i = 1, . . . , N , and
rik+1
:= argmin{
ri
i (ri )
ski
+ (/2)kri
tki k22 },
(10)
i = 1, . . . , N 1. These updates can all be carried out in

parallel. For many applications, we will see that we can
often solve (9) and (10) analytically.
Step 2. In the second step of ADMM, we project (x

uk , rk+1 + tk ) onto the constraint set C, i.e.,
k+1
k+1
k+1
k+1
k+1
(z
,s
) := C ((x
,r
) + (u , t )).
For the particular constraint set (7), we will show in
Section 3.3 that the projection can be performed extremely
efficiently.
Step 3.
Finally, we update the dual variables:

uk+1
i
:= uki + (xk+1
i
zik+1 ),
i = 1, . . . , N
and
tk+1
:= tki + (rik+1 sk+1
), i = 1, . . . , N 1.
i
i
These updates can also be carried out independently in
parallel, for each variable block.
3.3 Projection
In this section we work out an efficient formula for projection onto the constraint set C (7). To perform the
projection
(z, s) = C ((w, v)),
we solve the optimization problem
minimize kz wk22 + ks
subject to s = Dz,
vk22
with variables z = (z1 , . . . , zN ) and s = (s1 , . . . , sN 1 ),

and where D 2 R(N 1)nN n is the forward dierence
operator, i.e.,
2
3
I I
I I
6
7
7.
D=6
.. ..
4
. . 5
I I
This problem is equivalent to

minimize kz wk22 + kDz vk22 .
with variable z = (z1 , . . . , zN ). Thus to perform the
projection we first solve the optimality condition
(I + DT D)z = w + DT v,
(11)
for z, then we let s = Dz.
The matrix I + DT D is block tridiagonal, with diagonal

blocks equal to multiples of I, and sub/super-diagonal
blocks equal to I. Let LLT be the Cholesky factorization
of I + DT D. It is easy to show that L is block banded with
the form
2
3
l1,1
6 l2,1 l2,2
7
6
7
l
l
6
7 I,
3,2
3,3
L=6
7
.. ..
5
4
.
.
lN,N 1 lN,N
where denotes the Kronecker product. The coefficients

li,j can be explicitly computed via the recursion
p
l1,1 = 2,
q
li+1,i =
lN,N
1/li,i , li+1,i+1 =
1/lN
1,N
1,
lN,N
2
li+1,i
, i = 1, . . . , N
q
2
= 2 lN,N
1.
2,
The coefficients only need to be computed once, before the

projection operator is applied.
The projection therefore consists of the following steps
(1) Form b := w + DT v:
b1 := w1 v1 , bN := wN + vN 1 ,
bi := wi + (vi 1 vi ), i = 2, . . . , N 1.
(2) Solve Ly = b:
y1 := (1/l1,1 )b1 ,
yi := (1/li,i )(bi li,i 1 yi 1 ), i = 2, . . . , N.
(3) Solve LT z = y:
zN := (1/lN,N )yN ,
zi := (1/li,i )(yi li+1,i zi+1 ), i = N 1, . . . , 1.
(4) Set s = Dz:
si := zi+1 zi , i = 1, . . . , N 1.
Thus, we see that we can perform the projection very
efficiently, in O(N n) flops (floating-point operations). In
fact, if we pre-compute the inverses 1/li,i , i = 1, . . . , N , the
only operations that are required are multiplication, addition, and subtraction. We do not need to perform division,
which can be expensive on some hardware platforms.
4. EXAMPLES
4.1 `1 Mean filtering
Consider a sequence of vector random variables
Yi N (
yi , ), i = 1, . . . , N,
where yi 2 Rn is the mean, and 2 Sn+ is the covariance
matrix. We assume that the covariance matrix is known,
but the mean of the process is unknown. Given a sequence
of observations y1 , . . . , yN , our goal is to estimate the mean
under the assumption that it is piecewise constant, i.e.,
yi+1 = yi for many values of i.
In the Fused Group Lasso method, we obtain our estimates
by solving
N
N
X1
X
1
(yi xi )T 1 (yi xi ) +
kri k2
minimize
2
i=1
i=1
subject to ri = xi+1 xi , i = 1, . . . , N 1,
with variables x1 , . . . , xN , r1 , . . . , rN 1 . Let x?1 , . . . , x?N ,
?
r1? , . . . , rN
1 denote an optimal point, our estimates of
y1 , . . . , yN are x?1 , . . . , x?N .
This problem is clearly in the form (6), with
1
xi )T 1 (yi xi ),
i (xi ) = (yi
i (ri ) = kri k2 .
2
ADMM steps. For this problem, steps (9) and (10) of

ADMM can be further simplified. Step (9) involves minimizing an unconstrained quadratic function in the variable
xi , and can be written as
xk+1
= (
i
+ I)
yi + (zik
uki )).
ADMM steps. It is easy to see that steps (9) and (10)

simplify for this problem. Step (9) requires solving
Xik+1 := argmin{
Xi 0
i (Xi )
+ (/2)kXi
Zik + Uik k22 },
where
= Tr(Xi yi yiT ) log det Xi .
This update can be solved analytically, as follows.
i (Xi )
Step (10) is
rik+1 := argmin{ kri k2 + (/2)kri
ri
ski + tki k22 },
(1) Compute the eigenvalue decomposition of
which simplifies to
rik+1 = S
k
/ (si
tki ),
(12)
where S is the vector soft thresholding operator, defined

as
S (a) = (1 /kak2 )+ a, S (0) = 0.
Here the notation (v)+ = max{0, v} denotes the positive
part of the vector v. (For details see Boyd et al. [2011].)
Zik Uik
yi yiT = QQT
where = diag( 1 , . . . , n ).
(2) Now let
q
2 + 4
j +
j
, j = 1, . . . , n.
j :=
2
(3) Finally, we set
Xik+1 = Q diag(1 , . . . , n )QT .
Variations. In some problems, we might expect that

individual components of xt will be piecewise constant, in
which case we can instead use the standard Fused Lasso
method. In the standard Fused Lasso method we solve
N
N
X
X1
1
(yi xi )T 1 (yi xi ) +
kri k1
minimize
2
i=1
i=1
subject to ri = xi+1 xi , i = 1, . . . , N,
with variables x1 , . . . , xN , r1 , . . . , rN 1 . The ADMM updates are the same, except that instead of doing vector
soft thresholding for step (10), we perform scalar componentwise soft thresholding, i.e.,
(rik+1 )j = S
k
/ ((si
tki )j ),
j = 1, . . . , n.
Consider a sequence of vector random variables (of dimension n)

Yi N (0, i ), i = 1, . . . , N,
where i 2 Sn+ is the covariance matrix for Yi (which
we assume is fixed but unknown). Given observations
of y1 , . . . , yN , our goal is to estimate the sequence of
covariance matrices 1 , . . . , N , under the assumption
that it is piecewise constant, i.e., it is often the case that
i+1 = i . In order to obtain a convex problem, we use
the inverse covariances Xi = i 1 as our variables.
The Fused Group Lasso method for this problem involves
solving
subject to
N
X
Tr(Xi yi yiT )
i=1
Ri =
Xi+1
Xi ,
log det Xi +
i = 1, . . . , N
n
N
X1
i=1
kRi kF
1,
where our variables are Ri 2 S , i = 1, . . . , N

Xi 2 Sn+ , i = 1, . . . , N . Here,
q
kRi kF = Tr(RiT Ri )
Step (10) is
Rik+1 := argmin{ kRi kF + (/2)kRi
Ri
Sik + Tik k22 },
which simplifies to
Rik+1 = S / (Sik Tik ),
where S is a matrix soft threshold operator, defined as
S (A) = (1 /kAkF )+ A, S (0) = 0.
Variations. As with `1 mean filtering, we can replace the
Frobenius norm penalty with a componentwise vector `1 norm penalty on Ri to get the problem
4.2 `1 Variance filtering
minimize
For details of this derivation, see Section 6.5 in Boyd et al.

[2011].
1, and
?
?
is the Frobenius norm of Ri . Let X1? , . . . , XN
, R1? , . . . , RN
1
denote an optimal point, our estimates of 1 , . . . , N are
?
(X1? ) 1 , . . . , (XN
) 1.
minimize
subject to
N
X
Tr(Xi yi yiT )
i=1
Ri =
Xi+1
log det Xi +
Xi ,
i = 1, . . . , N
N
X1
i=1
kRi k1
1,
with variables R1 , . . . , RN 1 2 S , and X1 , . . . , XN 2 Sn+ ,

and where
X
kRk1 =
|Rjk |.
j,k
Again, the ADMM updates are the same, the only dierence is that in step (10) we replace matrix soft thresholding
with a componentwise soft threshold, i.e.,
(Rik+1 )l,m = S / ((Sik
for l = 1, . . . , n, m = 1, . . . , n.
Tik )l,m ),
4.3 `1 Mean and variance filtering

Consider a sequence of vector random variables
Yi N (
yi , i ), i = 1, . . . , N,
where yi 2 Rn is the mean, and i 2 Sn+ is the covariance
matrix for Yi . We assume that the mean and covariance
matrix of the process is unknown. Given observations
y1 , . . . , yN , our goal is to estimate the mean and the
sequence of covariance matrices 1 , . . . , N , under the
assumption that they are piecewise constant, i.e., it is
often the case that yi+1 = yi and i+1 = i . To obtain a

convex optimization problem, we use
1 1
Xi =
, mi = t 1 xi ,
2 t
as our variables. In the Fused Group Lasso method, we
obtain our estimates by solving
N
X
minimize
(1/2) log det( Xi ) Tr(Xi yi yiT )
i=1
mTi yi (1/4) Tr(Xi 1 mi mTi )

N
N
X1
X1
+ 1
kri k2 + 2
kRi kF
i=1
5
4
3
2
1
0
i=1
subject to ri = mi+1 mi , i = 1, . . . , N 1,
Ri = Xi+1 Xi , i = 1, . . . , N 1,
with variables r1 , . . . , rN 1 2 Rn , m1 , . . . , mN 2 Rn ,
R1 , . . . , RN 1 2 Sn , and X1 , . . . , XN 2 Sn+ .
ADMM steps. This problem is also in the form (6), however, as far as we are aware, there is no analytical formula
for steps (9) and (10). To carry out these updates, we must
solve semidefinite programs (SDPs), for which there are a
number of efficient and reliable software packages (Toh
et al. [1999], Sturm [1999]).
5. NUMERICAL EXAMPLE
In this section we solve an instance of `1 mean filtering
with n = 1, = 1, and N = 400, using the standard Fused
Lasso method. To improve convergence of the ADMM
algorithm, we use over-relaxation with = 1.8, see Boyd
et al. [2011]. The parameter is chosen as approximately
10% of max , where max is the largest value that results
in a non-constant mean estimate. Here, max 108 and
so = 10. We use an absolute plus relative error stopping
criterion, with abs = 10 4 and rel = 10 3 . Figure 1 shows
convergence of the primal and dual residuals. The resulting
estimates of the means are shown in Figure 2.
2
10
1
2
3
0
50
100
150
200
250
Measurement
300
350
400
Fig. 2. Estimated means (solid line), true means (dashed

line) and measurements (crosses).
We solved the same `1 mean filtering problem using CVX,
a package for specifying and solving convex optimization
problems (Grant and Boyd [2011]). CVX calls generic
SDP solvers SeDuMi (Toh et al. [1999]) or SDPT3 (Sturm
[1999]) to solve the problem. While these solvers are
reliable for wide classes of optimization problems, and
exploit sparsity in the problem formulation, they are
not customized for particular problem families, such as
ours. The computation time for CVX is approximately
20 seconds. Our ADMM algorithm (implemented in C),
took 2.2 milliseconds to produce the same estimates.
Thus, our algorithm is approximately 10000 times faster
compared with generic optimization packages. Indeed, our
implementation does not exploit the fact that steps 1 and
3 of ADMM can be implemented independently in parallel
for each measurement. Parallelizing steps 1 and 3 of the
computation can lead to further speedups. For example,
simple multi-threading on a quad-core CPU would result
in a further 4 speed-up.
6. CONCLUSIONS
In this paper we derived an efficient and scalable method

for an optimization problem (6) that has a variety of applications in control and estimation. Our custom method
exploits the structure of the problem via a distributed
optimization framework. In many applications, each step
of the method is a simple update that typically involves
solving a set of linear equations, matrix multiplication,
or thresholding, for which there are exceedingly efficient
libraries. In numerical examples we have shown that we
can solve problems such as `1 mean and variance filtering
many orders of magnitude faster than generic optimization
solvers such as SeDuMi or SDPT3.
10
10
10
10
10
20
30
40
50
Iteration
60
70
80
Fig. 1. Residual convergence: Primal residual ep (solid

line), and dual residual ed (dashed line).
The only tuning parameter for our method is the regularization parameter . Finding an optimal is not a
straightforward problem, but Boyd et al. [2011] contains
many heuristics that work well in practice. For the `1 mean
filtering example, we find that setting works well,
but we do not have a formal justification.
REFERENCES
O. Banerjee, L. El Ghaoui, and A. dAspremont. Model
selection through sparse maximum likelihood estimation
for multivariate gaussian or binary data. Journal of
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein.
Distributed optimization and statistical learning via the
alternating direction method of multipliers. Foundations
and Trends in Machine Learning, 3(1):1122, 2011.
P. L. Combettes and J. C. Pesquet. A Douglas-Rachford
splitting approach to nonsmooth convex variational signal recovery. Selected Topics in Signal Processing, IEEE
Journal of, 1(4):564 574, dec. 2007. ISSN 1932-4553.
doi: 10.1109/JSTSP.2007.910264.
A. P. Dempster. Covariance selection. Biometrics, 28(1):
157175, 1972.
J. Eckstein and D. P. Bertsekas. On the Douglas-Rachford
splitting method and the proximal point algorithm for
maximal monotone operators. Mathematical Programming, 55:293318, 1992.
J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse
covariance estimation with the graphical lasso. Biostatistics, 9(3):432441, 2008.
M. Grant and S. Boyd.
CVX: Matlab software
for disciplined convex programming, version 1.21.
http://cvxr.com/cvx, April 2011.
S. J. Kim, K. Koh, S. Boyd, and D. Gorinevsky. l1 trend
filtering. SIAM Review, 51(2):339360, 2009.
H. Ohlsson, L. Ljung, and S. Boyd. Segmentation of arxmodels using sum-of-norms regularization. Automatica,
46:1107 1111, April 2010.
L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total
variation based noise removal algorithms. Phys. D,
60:259268, November 1992. ISSN 0167-2789. doi:
http://dx.doi.org/10.1016/0167-2789(92)90242-F.
Simon Setzer. Operator splittings, bregman methods and
frame shrinkage in image processing. International
Journal of Computer Vision, 92(3):265280, 2011.
J. Sturm. Using SeDuMi 1.02, a MATLAB toolbox
for optimization over symmetric cones. Optimization
Methods and Software, 11:625653, 1999. Software
available at http://sedumi.ie.lehigh.edu/.
R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and
K. Knight. Sparsity and smoothness via the fused
lasso. Journal of the Royal Statistical Society: Series
B (Statistical Methodology), 67 (Part 1):91108, 2005.
K. Toh, M. Todd, and R. T
ut
unc
u. SDPT3A Matlab
software package for semidefinite programming, version
1.3. Optimization Methods and Software, 11(1):545581,
1999.
B. Wahlberg, C. R. Rojas, and M. Annergren. On l1
mean and variance filtering. Proceedings of the FortyFifth Asilomar Conference on Signals, Systems and
Computers, 2011. arXiv/1111.5948.
Compressive Phase Retrieval

From Squared Output Measurements
Via Semidefinite Programming ?
Henrik Ohlsson , Allen Y. Yang Roy Dong
S. Shankar Sastry
Department of Electrical Engineering and Computer Sciences,

University of California at Berkeley, CA, USA, (email:
{ohlsson,yang,roydong,sastry}@eecs.berkeley.edu).
Division of Automatic Control, Department of Electrical

Engineering, Link
oping University, Sweden, (e-mail:
ohlsson@isy.liu.se).
Abstract: Given a linear system in a real or complex domain, linear regression aims to recover
the model parameters from a set of observations. Recent studies in compressive sensing have
successfully shown that under certain conditions, a linear program, namely, `1 -minimization,
guarantees recovery of sparse parameter signals even when the system is underdetermined. In
this paper, we consider a more challenging problem: when the phase of the output measurements
from a linear system is omitted. Using a lifting technique, we show that even though the phase
information is missing, the sparse signal can be recovered exactly by solving a semidefinite
program when the sampling rate is sufficiently high. This is an interesting finding since the
exact solutions to both sparse signal recovery and phase retrieval are combinatorial. The results
extend the type of applications that compressive sensing can be applied to those where only
output magnitudes can be observed. We demonstrate the accuracy of the algorithms through
extensive simulation and a practical experiment.
Keywords: Phase Retrieval; Compressive Sensing; Semidefinite Programming.
1. INTRODUCTION
Linear models, e.g., y = Ax, are by far the most used and
useful type of model. The main reasons for this are their
simplicity of use and identification. For the identification,
the least-squares (LS) estimate in a complex domain is
computed by 1
(1)
xls = argmin ky Axk22 2 Cn ,
x
assuming the output y 2 CN and A 2 CN n are given.

Further, the LS problem has a unique solution if the
system is full rank and not underdetermined, i.e., N n.
Consider the alternative scenario when the system is
underdetermined, i.e., n > N . The least squares solution is
no longer unique in this case, and additional knowledge has
? Ohlsson is partially supported by the Swedish foundation for
strategic research in the center MOVIII, the Swedish Research
Council in the Linnaeus center CADICS, the European Research
Council under the advanced grant LEARN, contract 267381, and a
postdoctoral grant from the Sweden-America Foundation, donated
by ASEAs Fellowship Fund. Sastry and Yang are partially supported
by an ARO MURI grant W911NF-06-1-0076. Dong is supported by
the NSF Graduate Research Fellowship under grant DGE 1106400,
and by the Team for Research in Ubiquitous Secure Technology
(TRUST), which receives support from NSF (award number CCF0424422).
1 Our derivation in this paper is primarily focused on complex
signals, but the results should be easily extended to real domain
signals.
to be used to determine a unique model parameter. Ridge

regression or Tikhonov regression [Hoerl and Kennard,
1970] is one of the traditional methods to apply in this
case, which takes the form
1
(2)
xr = argmin ky Axk22 + kxk22 ,
2
x
where > 0 is a scalar parameter that decides the trade
o between fit (the first term) and the `2 -norm of x (the
second term).
Thanks to the `2 -norm regularization, ridge regression is
known to pick up solutions with small energy that satisfy
the linear model. In a more recent approach stemming
from the LASSO [Tibsharani, 1996] and compressive sensing (CS) [Candès et al., 2006, Donoho, 2006], another
convex regularization criterion has been widely used to
seek the sparsest parameter vector, which takes the form
1
x`1 = argmin ky Axk22 + kxk1 .
(3)
2
x
Depending on the choice of the weight parameter , the
program (3) has been known as the LASSO by Tibsharani
[1996], basis pursuit denoising (BPDN) by Chen et al.
[1998], or `1 -minimization (`1 -min) by Candès et al. [2006].
In recent years, several pioneering works have contributed
to efficiently solving sparsity minimization problems such
as Tropp [2004], Beck and Teboulle [2009], Bruckstein
et al. [2009], especially when the system parameters and
observations are in high-dimensional spaces.
In this paper, we consider a more challenging problem. We

still seek a linear model y = Ax. Rather than assuming
that y is given we will assume that only the squared
magnitude of the output is given
bi = |yi |2 = |hx, ai i|2 , i = 1, , N,
(4)
T
where A = [a1 , , aN ] 2 CnN and y T = [y1 , , yN ] 2
C1N . This is clearly a more challenging problem since
the phase of y is lost when only the (squared) magnitude is available. A classical example is that y represents the Fourier transform of x, and that only the
Fourier transform modulus is observable. This scenario
arises naturally in several practical applications such as optics (Walther [1963], Millane [1990]), coherent diraction
imaging (Fienup [1987]), astronomical imaging (Dainty
and Fienup [1987]), and is known as the phase retrieval
problem.
We note that in general phase cannot be uniquely recovered regardless whether the linear model is overdetermined
or not. A simple example to see this, is if x0 2 Cn is a
solution to y = Ax, then for any scalar c 2 C on the
unit circle cx0 leads to the same squared output b. As
mentioned in Candès et al. [2011a], when the dictionary A
represents the unitary discrete Fourier transform (DFT),
the ambiguities may represent time-reversed solutions or
time-shifted solutions of the ground truth signal x0 . These
global ambiguities caused by losing the phase information
are considered acceptable in phase retrieval applications.
From now on, when we talk about the solution to the phase
retrieval problem, it is the solution up to a global phase.
Accordingly, a unique solution is a solution unique up to
a global phase.
Further note that since (4) is nonlinear in the unknown
x, N
n measurements are in general needed for a
unique solution. When the number of measurements N
are fewer than necessary for a unique solution, additional
assumptions are needed to select one of the solutions (just
like in Tikhonov, LASSO and CS).
Finally, we note that the exact solution to either CS and
phase retrieval is combinatorially expensive (Chen et al.
[1998], Candès et al. [2011b]). Therefore, the goal of this
work is to answer the following question: Can we eectively
recover a sparse parameter vector x of a linear system up
to a global ambiguity using its squared magnitude output
measurements via convex programming? The problem is
referred to as compressive phase retrieval (CPR).
The main contribution of the paper is a convex formulation
of the sparse phase retrieval problem. Using a lifting technique, the NP-hard problem is relaxed as a semidefinite
program. Through extensive experiments, we compare the
performance of our CPR algorithm with traditional CS
and PhaseLift algorithms. The results extend the type
of applications that compressive sensing can be applied
to, namely, applications where only magnitudes can be
observed.
1.1 Background
Our work is motivated by the `1 -min problem in CS and
a recent PhaseLift technique in phase retrieval by Candès
et al. [2011b]. On one hand, the theory of CS and `1 -min
has been one of the most visible research topics in recent
years. There are several comprehensive review papers

that cover the literature of CS and related optimization
techniques in linear programming. The reader is referred
to the works of Candès and Wakin [2008], Bruckstein
et al. [2009], Loris [2009], Yang et al. [2010]. On the other
hand, the fusion of phase retrieval and matrix completion
is a novel topic that has recently being studied in a
selected few papers, such as Chai et al. [2010], Candès
et al. [2011b,a]. The fusion of phase retrieval and CS was
discussed in Moravec et al. [2007]. In the rest of the section,
we briefly review the phase retrieval literature and its
recent connections with CS and matrix completion.
Phase retrieval has been a longstanding problem in optics
and x-ray crystallography since the 1970s [Kohler and
Mandel, 1973, Gonsalves, 1976]. Early methods to recover
the phase signal using Fourier transform mostly relied on
additional information about the signal, such as band limitation, nonzero support, real-valuedness, and nonnegativity. The Gerchberg-Saxton algorithm was one of the popular algorithms that alternates between the Fourier and
inverse Fourier transforms to obtain the phase estimate iteratively [Gerchberg and Saxton, 1972, Fienup, 1982]. One
can also utilize steepest-descent methods to minimize the
squared estimation error in the Fourier domain [Fienup,
1982, Marchesini, 2007]. Common drawbacks of these iterative methods are that they may not converge to the
global solution, and the rate of convergence is often slow.
Alternatively, Balan et al. [2006] have studied a frametheoretical approach to phase retrieval, which necessarily
relied on some special types of measurements.
More recently, phase retrieval has been framed as a lowrank matrix completion problem in Chai et al. [2010],
Candès et al. [2011a,b]. Given a system, a lifting technique
was used to approximate the linear model constraint as
a semidefinite program (SDP), which is similar to the
CPR objective function (10) only without the sparsity
constraint. The authors also derived the upper-bound for
the sampling rate that guarantees exact recovery in the
noise-free case and stable recovery in the noisy case.
We are aware of the work by Moravec et al. [2007],
which has considered compressive phase retrieval on a
random Fourier transform model. Leveraging the sparsity
constraint, the authors proved that an upper-bound of
O(k 2 log(4n/k 2 )) random Fourier modulus measurements
to uniquely specify k-sparse signals. Moravec et al. [2007]
also proposed a compressive phase retrieval algorithm.
Their solution largely follows the development of `1 -min in
CS, and it alternates between the domain of solutions that
give rise to the same squared output and the domain of an
`1 -ball with a fixed `1 -norm. However, the main limitation
of the algorithm is that it tries to solve a nonconvex
optimization problem which assumes the `1 -norm of the
true signal is known.
2. CPR VIA SDP
In the noise free case, the phase retrieval problem takes
the form of the feasibility problem:
H
find x subj. to b = |Ax|2 = {aH
i xx ai }1iN , (5)
where bT = [b1 , , bN ] 2 R1N . This is a combinatorial
problem to solve: Even in the real domain with the sign
of the measurements {i }N
i=1 { 1, 1}, one would have
to try out combinations of sign sequences until one that
satisfies
p
i bi = aTi x, i = 1, , N,
(6)
for some x 2 Rn has been found. For any practical size of
data sets, this combinatorial problem is intractable.
albeit the source signal was not assumed sparse. Using the
lifting technique to construct the SDP relaxation of the
NP-hard phase retrieval problem, with high probability,
the program (11) recovers the exact solution (sparse or
dense) if the number of measurements N is at least of the
order of O(n log n). The region of success is visualized in
Figure 1 as region I with a thick solid line.
Since (5) is nonlinear in the unknown x, N

n measurements are in general needed for a unique solution. When
the number of measurements N are fewer than necessary
for a unique solution, additional assumptions are needed
to select one of the solutions. Motivated by compressive
sensing, we here choose to seek the sparsest solution of
CPR satisfying (5) or, equivalent, the solution to
If x is sufficiently sparse and random Fourier dictionaries

are used for sampling, Moravec et al. [2007] showed that
in general the signal is uniquely defined if the number of
squared magnitude output measurements b exceeds the
order of O(k 2 log(4n/k 2 )). This lower bound for the region
of success of CPR is illustrated by the dash line in Figure 1.
min kxk0 ,
x
subj. to
H
b = |Ax|2 = {aH
i xx ai }1iN .
(7)
As the counting norm k k0 is not a convex function,
following the `1 -norm relaxation in CS, (7) can be relaxed
as
H
min kxk1 , subj. to b = |Ax|2 = {aH
i xx ai }1iN .
Finally, the motivation for introducing the `1 -norm regularization in (10) is to be able to solve the sparse phase
retrieval problem for N smaller than what PhaseLift requires. However, one will not be able to solve the compressive phase retrieval problem in region III below the dashed
curve. Therefore, our target problems lie in region II.
(8)
Note that (8) is still not a linear program, as its equality

constraint is not a linear equation. In the literature, a
lifting technique has been extensively used to reframe
problems such as (8) to a standard form in semidefinite
programming, such as in Sparse PCA [dAspremont et al.,
2007].
II
More specifically, given the ground truth signal x0 2 Cn ,

.
nn
let X0 = x0 xH
be a rank-1 semidefinite matrix.
0 2 C
Then the CPR problem can be cast as 2
min kXk1
X
subj. to bi = Tr(aH
i Xai ), i = 1, , N,
rank(X) = 1, X 0.
(9)
III
This is of course still a non-convex problem due to the rank

constraint. The lifting approach addresses this issue by
replacing rank(X) with Tr(X). For a semidefinite matrix,
Tr(X) is equal to the sum of the eigenvalues of X. This
leads to an SDP
min Tr(X) + kXk1
X
subj. to bi = Tr(
X 0,
i X),
i = 1, , N,
(10)
.
nn
where we further denote i = ai aH
and where
i 2 C
> 0 is a design parameter. Finally, the estimate of x
can be found by computing the rank-1 decomposition of
X via singular value decomposition. We will refere to the
formulation (10) as compressive phase retrieval via lifting
(CPRL).
We compare (10) to a recent solution of PhaseLift by
Candès et al. [2011b]. In Candès et al. [2011b], a similar
objective function was employed for phase retrieval:
min Tr(X)
X
subj. to bi = Tr(
X 0,
2
i X),
i = 1, , N,
Fig. 1. An illustration of the regions in which PhaseLift

and CPR are capable of recovering the ground truth
solution up to a global phase ambiguity. While
PhaseLift primarily targets problems in region I,
CPRL operates primarily in region II.
(11)
kXk1 for a matrix X denotes the entry-wise `1 -norm in this paper.
3. NUMERICAL SOLUTIONS FOR NOISY DATA

In this section, we consider the case that the measurements are contaminated by data noise. In a linear model,
typically bounded random noise aects the output of the
system as y = Ax + e, where e 2 CN is a noise term with
bounded `2 -norm: kek2 . However, in phase retrieval,
we follow closely a more special noise model used in Candès
et al. [2011b]:
bi = |hx, ai i|2 + ei .
(12)
This nonstandard model avoids the need to calculate the
squared magnitude output |y|2 with the added noise term.
More importantly, in practical phase retrieval applications,
measurement noise is introduced when the squared magnitudes or intensities of the linear system are measured,
not on y itself (Candès et al. [2011b]).
Accordingly, we denote a linear function B of X

7! {Tr(
i X)}1iN
2R
(13)
that measures the noise-free squared output. Then the

approximate CPR problem with bounded `2 error model
(12) can be solved by the following SDP program:
min Tr(X) + kXk1
subj. to kB(X) bk2 ,
X 0.
1
PL
CPRL
0.9
We should further discuss several numerical issues in the

implementation of the SDP program. The constrained
CPR problem (14) can be rewritten as an unconstrained
objective function:
min Tr(X) + kXk1 + kB(X) bk22 ,

(15)
X0
2
> 0 and > 0 are two penalty parameters.
In (15), due to the lifting process, the rank-1 condition of

X is approximated by its trace function T r(X). In Candès
et al. [2011b], the authors considered phase retrieval of
generic (dense) signal x. They proved that if the number
of measurements obeys N cn log n for a sufficiently large
constant c, with high probability, minimizing (15) without
the sparsity constraint (i.e., = 0) recovers a unique rank1 solution obeying X = xxH .
In Section 4, we will show that using either random
Fourier dictionaries or more general random projections, in
practice, one needs much fewer measurements to exactly
recover sparse signals if the measurements are noisefree.
Nevertheless, in the presence of noise, the recovered lifted
matrix X may not be exactly rank-1. In this case, one can
simply use its rank-1 approximation corresponding to the
largest singular value of X.
We also note that in (15), there are two main parameters
and that can be defined by the user. Typically
is chosen depending on the level of noise that aects
the measurements b. For associated with the sparsity
penalty kXk1 , one can adopt a warm start strategy to
determine its value iteratively. The strategy has been
widely used in other sparse optimization, such as in `1 -min
[Yang et al., 2010]. More specifically, the objective is solved
iteratively with respect to a sequence of monotonically
decreasing ! 0, and each iteration is initialized using
the optimization results from the previous iteration. When
is large, the sparsity constraint outweighs the trace
constraint and the estimation error constraint, and vice
versa.
Example 3.1. (Compressive Phase Retrieval). In this example, we illustrate a simple CPR example, where a 2sparse complex signal x0 2 C64 is first transformed by
the Fourier transform F 2 C6464 followed by random
projections R 2 C3264 :
b = |RF x0 |2 .
Next, we apply CPRL (14), and the recovered sparse signal

is also shown in Figure 2. CPRL correctly identifies the two
nonzero elements in x.
(14)
The estimate of x, just as in noise free case, can finally

be found by computing the rank-1 decomposition of X via
singular value decomposition. We refer to the method as
approximate CPRL. Due to the machine rounding error,
in general a nonzero should be always assumed in the
objective (14) and its termination condition during the
optimization.
where
Given b, F , and R, we first apply the PhaseLift algorithm

[Candès et al., 2011b] with A = RF to the 32 squared
observations b. The recovered dense signal is shown in
Figure 2. PhaseLift fails to identify the 2-sparse signal.
(16)
0.8
0.7
0.6
|xi|
B:X2C
nn
0.5
0.4
0.3
0.2
0.1
0
0
10
20
30
40
50
60
Fig. 2. The magnitude of the estimated signal provided by

CPRL and PhaseLift (PL). CPRL correctly identifies
elements 2 and 24 to be nonzero while PhaseLift
provides a dense estimate. It is also verified that the
estimate from CPRL, after a global phase shift, is
approximately equal the true x0 .
4. EXPERIMENT
This section gives a number of examples. Code for the
numerical illustrations can be downloaded from http://
www.rt.isy.liu.se/~ohlsson/code.html.
4.1 Simulation
First, we repeat the simulation given in Example 3.1 for
k = 1, . . . , 5. For each k, n = 64 is fixed, and we increase
the measurement dimension N until CPRL recovered the
true sparse support in at least 95 out of 100 trials, i.e.,
95% success rate. New data (x, b, and R) are generated
in each trial. The curve of 95% success rate is shown in
Figure 3.
With the same simulation setup, we compare the accuracy of CPRL with the PhaseLift approach and the CS
approach in Figure 3. First, note that CS is not applicable
to phase retrieval problems in practice, since it assumes
the phase of the observation is also given. Nevertheless, the
simulation shows CPRL via the SDP solution only requires
a slightly higher sampling rate to achieve the same success
rate as CS, even when the phase of the output is missing.
Second, similar to the discussion in Example 3.1, without
enforcing the sparsity constraint in (11), PhaseLift would
fail to recover correct sparse signals in the low sampling
rate regime.
It is also interesting to see the performance as n and
N vary and k held fixed. We therefore use the same
setup as in Figure 3 but now fixed k = 2 and for n =
10, . . . , 60, gradually increased N until CPRL recovered
where eects like room ambience might color our otherwise

exactly sparse signal with noise.
250
Our recording z 2 Rs is a real signal, which is assumed

to be sparse in a Fourier basis. That is, for some sparse
x 2 Cn , we have z = Finv x, where Finv 2 Csn is a matrix
representing a transform from Fourier coefficients into the
time domain. Then, we have a randomly generated mixing
matrix with normalized rows, R 2 RN s , with which our
measurements are sampled in the time domain:
200
CS
CPRL
PhaseLift
150
100
y = Rz = RFinv x.
50
0
1
1.5
2.5
3.5
4.5
Fig. 3. The curves of 95% success rate for CPRL,

PhaseLift, and CS. Note that the CS simulation is
given the complete output y instead of its squared
magnitudes.
the true sparsity pattern with 95% success rate. The same
procedure is repeated to evaluate PhaseLift and CS. The
results are shown in Figure 4.
200
180
160
140
120
CPRL
PhaseLift
CS
100
80
60
40
20
0
10
20
30
40
50
60
Fig. 4. The curves of 95% success rate for CPRL,

PhaseLift, and CS. Note that the CS simulation is
given the complete output y instead of its squared
magnitudes.
Finally, we are only given the magnitudes of our measurements, such that b = |y|2 = |Rz|2 .
For our experiment, we choose a signal with s = 32

samples, N = 30 measurements, and it is represented
with n = 2s (overcomplete) Fourier coefficients. Also, to
generate Finv , the Cnn matrix representing the Fourier
transform is generated, and s rows from this matrix are
randomly chosen.
The experiment uses part of an audio file recording the
sound of a tenor saxophone. The signal is cropped so that
the signal only consists of a single sustained note, without
silence. Using CPRL to recover the original audio signal
given b, R, and Finv , the algorithm gives us a sparse
estimate x, which allows us to calculate z est = Finv x.
We observe that all the elements of z est have phases that
are apart, allowing for one global rotation to make z est
purely real. This matches our previous statements that
CPRL will allow us to retrieve the signal up to a global
phase.
We also find that the algorithm is able to achieve results
that capture the trend of the signal using less than s
measurements. In order to fully exploit the benefits of
CPRL that allow us to achieve more precise estimates with
smaller errors using fewer measurements relative to s, the
problem should be formulated in a much higher ambient
dimension. However, using the CVX Matlab toolbox by
Grant and Boyd [2010], we already ran into computational
and memory limitations with the current implementation
of the CPRL algorithm. These results highlight the need
for a more efficient numerical implementation of CPRL.
Compared to Figure 3, we can see that the degradation

from CS to CPRL when the phase information is omitted
is largely aected by the sparsity of the signal. More
specifically, when the sparsity k is fixed, even when the
dimension n of the signal increases dramatically, the number of squared observations to achieve accurate recovery
does not increase significantly for both CS and CPRL.
In this section, we further demonstrate the performance

of CPRL using signals from a real-world audio recording.
The timbre of a particular note on an instrument is
determined by the fundamental frequency, and several
overtones. In a Fourier basis, such a signal is sparse, being
the summation of a few sine waves. Using the recording
of a single note on an instrument will give us a naturally
sparse signal, as opposed to synthesized sparse signals in
the previous sections. Also, this experiment will let us
analyze how robust our algorithm is in practical situations,
0.8
zest
z
0.6
0.4
zi, zest,i
4.2 CPRL Applied to Audio Signals
(17)
0.2
0.2
0.4
0.6
10
15
20
25
30
Fig. 5. The retrieved signal z est using CPRL versus the

original audio signal z.
0
10
20
30
40
50
60
Fig. 6. The magnitude of x retrieved using CPRL. The

audio signal z est is obtained by z est = Finv x.
5. CONCLUSION AND DISCUSSION
A novel method for the compressive phase retrieval problem has been presented. The method takes the form of
an SDP problem and provides the means to use compressive sensing in applications where only squared magnitude
measurements are available. The convex formulation gives
it an edge over previous presented approaches and numerical illustrations show state of the art performance.
One of the future directions is improving the speed of
the standard SDP solver, i.e., interior-point methods, currently used for the CPRL algorithm. Some preliminary results along with a more extensive study of the performance
bounds of CPRL are available in Ohlsson et al. [2011].
REFERENCES
R. Balan, P. Casazza, and D. Edidin. On signal reconstruction without phase. Applied and Computational
Harmonic Analysis, 20:345356, 2006.
A. Bruckstein, D. Donoho, and M. Elad. From sparse
solutions of systems of equations to sparse modeling of
signals and images. SIAM Review, 51(1):3481, 2009.
E. J. Candès and M. Wakin. An introduction to compressive sampling. Signal Processing Magazine, IEEE, 25
(2):2130, March 2008.
E. J. Candès, J. Romberg, and T. Tao. Robust uncertainty
principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on
Information Theory, 52:489509, February 2006.
E. J. Candès, Y. Eldar, T. Strohmer, and V. Voroninski.
Phase retrieval via matrix completion. Technical Report
arXiv:1109.0573, Stanford University, September 2011a.
E. J. Candès, T. Strohmer, and V. Voroninski. PhaseLift:
Exact and stable signal recovery from magnitude measurements via convex programming. Technical Report
arXiv:1109.4499, Stanford University, September 2011b.
A. Chai, M. Moscoso, and G. Papanicolaou. Array imaging
using intensity-only measurements. Technical report,
Stanford University, 2010.
S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientific
Computing, 20(1):3361, 1998.
J. Dainty and J. Fienup. Phase retrieval and image reconstruction for astronomy. In Image Recovery: Theory
and Applications. Academic Press, New York, 1987.
A. dAspremont, L. El Ghaoui, M. Jordan, and G. Lanckriet. A direct formulation for Sparse PCA using semidefinite programming. SIAM Review, 49(3):434448, 2007.
D. Donoho. Compressed sensing. IEEE Transactions on
Information Theory, 52(4):12891306, April 2006.
J. Fienup. Phase retrieval algorithms: a comparison.
Applied Optics, 21(15):27582769, 1982.
J. Fienup. Reconstruction of a complex-valued object from
the modulus of its Fourier transform using a support
constraint. Journal of Optical Society of America A, 4
(1):118123, 1987.
R. Gerchberg and W. Saxton. A practical algorithm for
the determination of phase from image and diraction
plane pictures. Optik, 35:237246, 1972.
R. Gonsalves. Phase retrieval from modulus data. Journal
of Optical Society of America, 66(9):961964, 1976.
M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming, version 1.21. http://
cvxr.com/cvx, August 2010.
A. Hoerl and R. Kennard. Ridge regression: Biased
estimation for nonorthogonal problems. Technometrics,
12(1):5567, 1970.
D. Kohler and L. Mandel. Source reconstruction from
the modulus of the correlation function: a practical
approach to the phase problem of optical coherence
theory. Journal of the Optical Society of America, 63
(2):126134, 1973.
I. Loris. On the performance of algorithms for the minimization of `1 -penalized functionals. Inverse Problems,
25:116, 2009.
S. Marchesini. Phase retrieval and saddle-point optimization. Journal of the Optical Society of America A, 24
(10):32893296, 2007.
R. Millane. Phase retrieval in crystallography and optics.
Journal of the Optical Society of America A, 7:394411,
1990.
M. Moravec, J. Romberg, and R. Baraniuk. Compressive
phase retrieval. In SPIE International Symposium on
Optical Science and Technology, 2007.
H. Ohlsson, A. Y. Yang, R. Dong, and S. Sastry. Compressive Phase Retrieval From Squared Output Measurements Via Semidefinite Programming. Technical Report arXiv:1111.6323, University of California, Berkeley,
November 2011.
R. Tibsharani. Regression shrinkage and selection via the
lasso. Journal of Royal Statistical Society B (Methodological), 58(1):267288, 1996.
J. Tropp. Greed is good: Algorithmic results for sparse
approximation. IEEE Transactions on Information
Theory, 50(10):22312242, October 2004.
A. Walther. The question of phase retrieval in optics.
Optica Acta, 10:4149, 1963.
A. Yang, A. Ganesh, Y. Ma, and S. Sastry. Fast `1 minimization algorithms and an application in robust
face recognition: A review. In ICIP, 2010.
Convex Estimation of Cointegrated VAR

Models by a Nuclear Norm Penalty
M. Signoretto and J. A. K. Suykens
Katholieke Universiteit Leuven, ESAT-SCD/SISTA

Kasteelpark Arenberg 10, B-3001 Leuven (BELGIUM)
Email: marco.signoretto@esat.kuleuven.be and
johan.suykens@esat.kuleuven.be
Abstract: Cointegrated Vector AutoRegressive (VAR) processes arise in the study of long
run equilibrium relations of stochastic dynamical systems. In this paper we introduce a novel
convex approach for the analysis of these type of processes. The idea relies on an error correction
representation and amounts at solving a penalized empirical risk minimization problem. The
latter finds a model from data by minimizing a trade-o between a quadratic error function
and a nuclear norm penalty used as a proxy for the cointegrating rank. We elaborate on
properties of the proposed convex program; we then propose an easily implementable and
provably convergent algorithm based on FISTA. This algorithm can be conveniently used for
computing the regularization path, i.e., the entire set of solutions associated to the trade-o
parameter. We show how such path can be used to build an estimator for the cointegrating rank
and illustrate the proposed ideas with experiments.
1. INTRODUCTION
Unit root nonstationary multivariate processes play an important role in the study of dynamical stochastic systems
Box and Tiao [1977], Engle and Granger [1987], Stock and
Watson [1988], Johansen [1988]. Contrary to their stationary counterpart these processes are allowed to have trends
or shifts in the mean or in the covariances. This feature
makes them suitable to describe many phenomena of interest such as economic cycles and population dynamics. In
this paper we focus on VAR processes. It is well known that
these processes can generate stochastic and deterministic
trends if the associated polynomial matrix has zeros on
the unit circle. If some of the variables within the same
VAR process move together in the long-run in a sense
that we clarify later they are called cointegrated. This
situation is of considerable practical interest. Equilibrium
relationships arise between economic variables such as, for
instance, household income and expenditures. Cointegration has also been advocated to describe long-term parallel
growth of mutually dependent indicators such as regional
population and employment growth or city populations
and total urban populations Payne and Ewing [1997],
Sharma [2003], Mller and Sharp [2008].
? The authors are grateful to the anonymous reviewers for the helpful comments. Research supported by the Research Council KUL:
GOA /11/05 Ambiorics, GOA/10/09 MaNet, CoE EF/05/006 Optimization in Engineering (OPTEC) en PFV/10/002 (OPTEC). Flemish Government: FWO: PhD/postdoc grants, projects: G0226.06 (cooperative systems and optimization), G0321.06 (Tensors), G.0302.07
(SVM/Kernel). Research communities (WOG: ICCoS, ANMMM,
MLDM). Belgian Federal Science Policy Office: IUAP P6/04
(DYSCO, Dynamical systems, control and optimization, 2007-2011);
IBBT EU: ERNSI; FP7-HD-MPC (INFSO-ICT-223854), COST intelliCIS, FP7-EMBOCON (ICT-248940), FP7-SADCO ( MC ITN264735), ERC HIGHWIND (259 166). The scientific responsibility is
assumed by its authors.
The analysis of cointegrated VAR models present challenges that are not present in the stationary case. In particular one of the main goal in the analysis of cointegrated
system is the estimation of the cointegrating rank. In this
work we propose a novel approach that relies on an error
correction representation of cointegrated VAR processes.
The approach consists of solving a convex program and
uses a nuclear norm as a proxy for the cointegrating rank.
We show how the regularization path arising from dierent
values of a trade-o parameter can be used to estimate
the cointegrating rank. In order to compute solutions we
propose to use a simple yet efficient algorithm based on an
existing procedure called FISTA (fast iterative shrinkagethresholding algorithm).
In Section 2 we recall the concept of cointegrated VAR
models, error correction representations and cointegrating
rank. In Section 3 we present our main problem formulation and discuss its properties. Section 4 deals with an
algorithm to compute solutions. In Section 5 we introduce
an estimator for the cointegrating rank based on the regularization path. In Section 6 we report experiments. We
conclude with final remarks in Section 7.
2. COINTEGRATED VAR(P ) MODELS
In the following we denote vectors as lower case letters
(a, b, c, . . .) and matrices as capital letters (A, B, C, . . .).
In particular, we use I to indicate the identity matrix,
the dimension of which will be clear from the context.
For a positive integer P we write NP to denote the set
{1, . . . , P }. Finally we use hA, Bi to denote the inner
product between A, B 2 RD1 D2 :
X
X
a d 1 d 2 bd 1 d 2
hA, Bi = trace(A> B) =
(1)
d1 2ND1 d2 2ND2
where > indicates matrix transposition 1p

. The corresponding Hilbert-Frobenius norm is kAk = hA, Ai.
In this work we are concerned with a multivariate time

series xt 2 RD following a Vector Autoregressive (VAR)
model of order P :
xt = 1 xt 1 + 2 xt 2 + + P xt P + et ,
(2)
where for any p 2 NP , p 2 RDD and the innovation

et 2 RD is a zero-mean serially uncorrelated stationary
process.
Remark 1. We do not consider here the case where (2)
includes a deterministic trend; note, however, that the
convex approach that we consider in the following can be
easily adapted to deal also with this case.
2.1 Nonstationarity and Cointegration
Recall that xt is called unit-root stationary if all the zeros
of the univariate polynomial matrix of degree P
2
P
P(z) = I
1z
2z
Pz
are outside the unit circle, see e.g. Tsay [2005]. If
det(P(1)) = 0, then xt is unit-root nonstationary. Following Johansen [1992] we call xt integrated of order R
(or simply I(R)) if the process R xt = xt
xt R is
stationary whereas for any r 2 NR 1 , the process r xt
is not. A stationary process is referred to as I(0). Finally
we say that a I(R) process xt is cointegrated if there
exists at least a cointegrating vector 2 RD such that the
scalar process > xt is I(R ) with R < R. Cointegrated
processes were originally introduced in Granger [1981] and
Engle and Granger [1987]. Since then they have become
popular mostly in theoretical and applied econometrics.
Cointegration has been advocated to explain long-run or
equilibrium relationships; for more discussions on cointegration and cointegration tests, see Box and Tiao [1977],
Engle and Granger [1987], Stock and Watson [1988], Johansen [1988]. In the following we focus on the situation
where xt is I(1). This case is the most commonly studied
in the literature.
where yt (j) = yt+j for j 0. The forecast

mean square
P
error (MSE) matrix is y (H) = e + h2NH 1 h e >
h;
notably for unit-root nonstationary processes some entries
will approach infinity as h ! 1 L
utkepohl [2005].
2.3 Cointegrating Rank
Note that, under the assumption that xt is at most I(1),
xt is a I(0) process. The cointegrating rank Engle and
Granger [1987] is defined as:
D? = rank() .
(6)
One can distinguish the following situations depending on
its value:
i) D? = 0. In this case = 0 and (3) reduces to a
VAR(P 1) model.
ii) 0 < D? < D. The matrix can be written as
= AB >
(7)
DD ?
where A, B 2 R
are full (column) rank. The D?
linearly independent columns of B are cointegrating
vectors; they form a basis for the cointegrating subspace. The vector process
ft = B > xt
(8)
is stationary and represent deviation from equilibrium. These equations clarify that (3) expresses the
change in xt in terms of the deviations from the
equilibrium at time t 1 (the term xt 1 = Aft 1 ,
called error correction term) and previous changes
(the terms p xt p , 1 p P 1).
iii) D? = D. In this case det(P(1)) 6= 0 ( is full rank);

xt is I(0) and one can study (2) directly.
An example of cointegrated process is given in Figure 1.

0.6
0.5
0.4
0.3
2.2 Error Correction Models
0.2
An error correction model (ECM) for the VAR(P ) process

(2) is Tsay [2005], L
utkepohl [2005]:
xt = xt
+
2
PP
xt 1 +
xt 2 + +
50
100
150
200
250
300
200
250
300
(a)
xt
P +1
+ et ,
(3)
where p =
P(1). The VAR(P )
j=p+1 j and =
model can be recovered from the ECM via:
8
= I + + 1,
(4a)
< 1
= p
1,
(4b)
p
p 1 , p = 2, . . . , P
:
=
(4c)
P
P 1 .
Model-based forecasting is done as in the stationary case.

Assume et is independent white noise with covariance
matrix e . It can be shown that the optimal H step
forecast at the origin is
yt (H) = 1 yt (H 1) + + P yt (H P )
(5)
1
0.1
In this paper all vectors and matrices are real.
0.015
0.01
0.005
0
0.005
0.01
0.015
0.02
0
50
100
150
(b)
Fig. 1. (a) A realization of the scalar components of a

4-dimensional I(1) cointegrated process with cointegrating rank 1. (b) The stationary process corresponding to the cointegrating vector [0.4 0.2 0.1 0.3].
2.4 Existing Estimators
3.1 Main Problem Formulation
Dierent approaches have been proposed to estimate

ECMs based on training data {xt }1tT , see [L
utkepohl,
2005, Chapter 7.2]. Define the matrices
X p = [ xP p+1 , xP p+2 , , xT p ] ,
(9a)
X p = X p X p 1,
(9b)
>
X=
X >1 , X >2 , , X >P +1 ,
(9c)
= [ 1, 2, , P 1] ,
(9d)
>
C = X >1 , X > ,
(9e)
Note that, based upon (9), the ECM (3) can be restated
in matrix notation as X0 = X 1 +
X + E. Our
), ( ))
approach now consists of finding estimates ((
by solving:
1
min
k X0 X 1
Xk2 + kk
,
2c
(14)
This approach does not keep the decomposition (7) of

into account 2 ; in principle, one could perform truncated
LS a posteriori
singular value decomposition (SVD) of
to find B. However, this practice requires the knowledge
of D? , which is normally not available 3 . When this information is available, an alternative approach consists
of Maximum Likelihood Estimation (MLE), which works
under the assumption that the innovation is independent
white Gaussian noise. Contrary to LS, this approach directly estimates the factors in the representation (7). This
leads to a nonconvex multistage algorithm. We refer the
reader to [Tsay, 2005, Section 8.6.2] for details.
It can be shown that (12) is the convex envelope on the

unit ball of the (non-convex) rank function Fazel [2002].
In the present context, this makes (14) the tightest convex
relaxation to the problem:
1
min
k X0 X 1
Xk2 + rank() .
,
2c
(15)
where X p 2 RD(T P ) , X p 2 RD(T P ) , X 2

R(P 1)D(T P ) ,
2 RD(P 1)D and C 2 RP D(T P ) .
The Least Squares (LS) estimator is simply L
utkepohl
[2005]:
h
i
LS , LS = X0 C > CC > 1 .
(10)
3. ESTIMATION BASED ON CONVEX

PROGRAMMING
Recall that for an arbitrary matrix A 2 RD1 D2 with rank
R, the SVD is
A = U V > , = diag({ r }1rR )
(11)
where U 2 RD1 R and V 2 RD2 R are matrices with

orthonormal columns, and the singular values r satisfy
1
2
R > 0. The nuclear norm (a.k.a.
trace norm or Schatten 1 norm) of A is defined Horn and
Johnson [1994] as
X
kAk =
(12)
r .
r2NR
The nuclear norm has been used to devise convex relaxations to rank constrained matrix problems Recht et al.
[2007], Candès and Recht [2009], Candes et al. [2011].
This parallels the approach followed by estimators like the
LASSO (Least Absolute Shrinkage and Selection Operator,
Tibshirani [1996]) that estimate an high dimensional loading vector x 2 RD based on the l1 -norm
X
kxk1 =
|xd | .
(13)
d2ND
In the literature (10) is sometimes called unrestricted LS to

emphasize that the models parameters are not constrained to satisfy
any specific structural form.
3 As we later illustrate in experiments, the spectrum of
LS
normally does not give a good indication of the actual value of the
cointegrating rank.
where c is a fixed normalization constant, such as c =

D(T P ), and is a trade-o parameter. When = 0 (14)
reduces to the LS estimates (10). For a strictly positive
, (14) fits a model to the data by minimizing a trade-o
between an error function and a proxy for the cointegrating
rank; a positive
reflects the prior knowledge that the
process should be cointegrated with cointegrating rank
D? < D.
The nonconvexity of the latter implies that practical algorithms can only be guaranteed to deliver local solutions
of (15) that are rank deficient for sufficiently large . In
contrast, the problem (14) is convex since the objective is
a sum of convex functions Boyd and Vandenberghe [2004].
This implies that any local solution found by a provably
convergent algorithm is globally optimal.
Once a solution of (14) is available, one can recover the
parameters of the VAR(P ) model (2) based upon (4). Note
that here we focus on the case where the error function
is expressed in terms of the Hilbert-Frobenius norm; however, alternative norms might also be meaningful. For later
reference, note that, when P = 1 and c = 1, (14) boils
down to:
1
k X0 X 1 k2 + kk .
min
2c
(16)
3.2 l2 Smoothing
The nature of the problem makes it difficult to find a
solution of (14) for ! 0. Indeed, in practice X 1 and
X are often close to singular so that the problem is
ill-conditioned. In order to improve numerical stability a
possible approach is to add a ridge penalty to the objective
of (14). That is, for a small user-defined parameter > 0,
to add:
0
1
X
@
kk2 +
k p k2 A ;
(17)
2
p2NP
we call the resulting optimization problem the l2 -smoothed

formulation. The idea has a long tradition both in optimization and statistics. Recently it found application in
the Elastic Net Zou and Hastie [2005]. The Elastic Net
finds an high dimensional loading vector x based on empirical data. The approach consists of replacing the LASSO
penalty based on the l1 norm (13) with the composite
penalty 1 kxk2 + 2 kxk1 . This strategy aims at improving
the LASSO in the presence of dependent variables (which,
in fact, leads to ill-conditioning).
In the present context it is easy to see that the solution
of the l2 smoothed formulation can be found solving
(14) where the data matrices ( X0 , X 1 , X) have been
replaced by lifted matrices ( X0 , X 1 , X ). Consider
for simplicity problem (16). The lifted matrices of interest
become:
p
X0 = [ X0 , O] and X 1 = [X 1 , I]
(18)
where O, I 2 RDD and O is a matrix of zeros. For
this reason in the following we will uniquely discuss an
algorithm for the non-smoothed case; it is understood
that a solution for the smoothed formulation can be found
simply replacing the data matrices.
3.3 Dual Representation and Duality Gap
In this section, for simplicity of exposition, we restrict
ourselves to the primal problem (16) and derive its dual
representation. The Fenchel conjugate Rockafellar [1974]
of kXk is:
f (A) = maxhA, Xi kXk .
(19)
X
Recall that the dual norm of the nuclear norm is the

spectral norm, denoted as k k2 ; for a generic matrix A
with SVD (11) such norm is defined by
kAk2 = 1 .
(20)
It is a known fact that the conjugate of a norm is the
indicator function of the dual norm unit ball Boyd and
Vandenberghe [2004]. In the present context this fact reads
as follows.
Lemma 2.
0, if kAk2 1
f (A) =
(21)
1, otherwise .
With reference to (16), it can be shown that strong duality
holds; the dual problem can be obtained as follows:
min{ 12 k
X0
min max{h,
max min{h,
max min{h,
by Lemma 2
X
X0
X0
X0 i
max{h,
1k
+ kk } =
X
X
X0 i
1
2 h, i
+ kk } =
1i
1
2 h, i
+ kk } =
hX >1 , i + kk } =
1
2 h, i
), ( )) corresponding to
In order to find a solution ((
a fixed value of one could restate (14) as a semidefinite
programming problem (SDP) and rely on general purpose
solvers such as SeDuMi Sturm [1999]. Alternatively, it is
possible to use a modelling language like CVX Grant and
Boyd [2010] or YALMIP Lofberg [2004]. However these
approaches are practically feasible only for relatively small
problems. In contrast, the iterative scheme that we present
next can be easily implemented and scales well with the
problem size. Additionally, the approach can be conveniently used through warm-starting to compute solutions
corresponding to nearby values of . The procedure, detailed in the algorithmic environment below, can be shown
to be a special instance of the fast iterative shrinkagethresholding algorithm (FISTA) proposed in Beck and
Teboulle [2009]; therefore it inherits its favorable convergence properties, see Beck and Teboulle [2009]. We call it
Cointegrated VAR(P ) via FISTA (Co-VAR(P )-FISTA).
Algorithm: CoVAR(P )-FISTA
Input: X
p;
: kX >1 k2 } .
(22)
Additionally, as it results clear from the second line of
(22), the primal and dual solution are related as follows:
= X0 X
(23)
1 .
This fact can be readily used to derive an optimality
certificate based on the duality gap, i.e. the dierence of
values between the objective functions of the dual and
primal problems.
of the dual problem
Remark 3. Note that the solution
in the last line of (22) corresponds to the projection of
X0 onto the convex set S = : kX >1 k2 .
p,
p = 0, 1, . . . , P
1;
Initialize:
0 = 1 ;
0p
1 p,
p 2 NP
t1 = 1; L = 1/c CC >
Iteration k
1;
(see (9e))
1:
Ak = k X
p2NP
kp
tk+1 =
1 + 4t2k
1+
, rk+1 =
k+1 = k + rk+1 (k
=
Return: k ,
kp
+ rk+1 (
k 1,
k2
(24a)
1
Ak X >p , p 2 NP
kp =
Lc
1
Ck = k
Ak X >1
Lc
k = D (Ck ) (see (25))
L
X0
kp
k+1 p
1i
1
2 h, i
4. ALGORITHM
k
kp
, ,
1)
kP
(24c)
(24d)
tk 1
tk+1
k 1 p ),
(24b)
p 2 NP
(24e)
(24f)
1
(24g)
The approach is essentially a forward-backward splitting

technique (see Bauschke and Combettes [2011] and reference therein) which is accelerated to reach the optimal rate
of convergence in the sense of Nesterov [1983, 2003].
The procedure is based on two sets of working variables:
(k , k 1 , k 2 , , k P 1 )
and
(k , k 1 , k 2 , , k P 1 ) .
Equations (24) from a to d correspond to a forward step
in the first set of variables conditioned on the variables in
the second set; the step size is determined by the Lipschitz
constant L; equation (24d) represent the backward step. It
amounts at evaluating at the current estimate the singular
value shrinkage operator Cai et al. [2010] defined, for of a
matrix A with SVD (11), as:

>
D (A) = U + V , + = diag({max(
r , )}1rR )
.
(25)
D () is the proximity operator Bauschke and Combettes
[2011] of the nuclear norm function. Equation (24e) defines
the updating constant rk based upon the estimate sequence
tk Nesterov [1983, 2003]. Finally, equations (24) from g to i
update the second set of variables based upon the variables
in the fist set.
The approach requires to set an appropriate termination
criterion. A sensible idea, which we follow in experiments,
is to stop when the duality gap corresponding to the
current estimate is smaller that a predefined threshold.
5. CONTINUOUS SHRINKAGE AND
COINTEGRATING RANK
5.1 Regularization Path
By continuously varying the regularization parameter
in (14) one obtains an entire set of solutions, denoted as
), ( ))} , and called regularization path. In general,
{((
continuous shrinkage is known to feature certain favorable
properties. In particular, the paths of estimators like
the LASSO are known to be more stable than those of
inherently discrete procedures such as Subset Selection
Breiman [1996].
In the present context the path begins at
= 0 and
max ) = 0.
terminates at the least value max leading to (
For problem (16), in particular, such value is
= k X0 X >1 k2
as one can see in light of (23) and Remark 3.
max
(26)
5.2 Estimation of the Cointegrating Rank
In practice the regularization path is evaluated at discrete

values of . Therefore the integrals are replaced by their
Monte Carlo estimates. In computing the path we actually
max ) = 0) and proceed
begin with max (recall that (
backwards computing the solutions corresponding to logarithmically spaced values of the parameter. At each step
we initialize the algorithm of Section 4 with the previous
solution.
6. EXPERIMENTS
To test the performance of the proposed estimator we
considered cointegrated systems with a triangular representation Phillips [1991]. More specifically we generated
realizations of a D dimensional cointegrated process xt
with M cointegrating vectors by the following model 6 :
8 (i+1)M
>
< X
xjt + eit , if i = 1, 2, . . . , M
xit =
>
j=iM
+1
:
xit 1 + eit , if i = M + 1, M + 2, . . . , D .
(29)
In all the cases et 2 RD is a Gaussian white noise with

mean zero and covariance matrix (5e 3)ID . The process
is observed in noise: we actually used as training data
N successive time steps from a realization of x
t = xt +
wt where wt is zero-mean Gaussian white noise with
covariance matrix 2 ID . For all the cases we took P = 1
and computed the path corresponding to an l2 -smoothed
formulation with smoothing parameter . We compared
? (m) (see (28)) against the naive estimator D
? (LS )
D
0.8
0.8
where LS are the singular values of LS . Figure 2 refers

to an experiment with D = 72, M = 8, N = 400, =
0.002 and = 0.01. In Table 1 we reported the average
estimated cointegrating rank (with standard deviation in
parenthesis) over 20 random experiments performed for
two dierent set of values of D, M, N, and .
). For a > 1
Denote by {r ( )}1rR the spectrum of (
D
consider the vector m
2 R defined entry-wise by 4 :
!1/2
Z loga ( max )
2 t t
m
d =
d (a )a dt
.
(27)
Table 1. Estimated cointegrating ranks in randomized experiments

D = 16, M = 2, N = 60, = 0.001, = 0.1
? (m)
? (LS )
D
D
0.8
0.8
2.3(0.5)
4.35(0.5)
We call m, the vector obtained from the inverse order

statistics 5 of m,
the path spectrum. Note that the integral in (27) weights more the area of the path which
corresponds to shrunk singular values. The rationale behind this is simple: those singular values that survive the
shrinking are likely to be the most important ones. The
path spectrum can be used to define an estimator for the
cointegrating rank. In particular, one can take
m2
?
(m) = arg min f (d) = P i2Nd i2 : f (d) > 0
D
m
d2ND
i2ND
(28)
where 0 < < 1 is a predefined threshold. Note that
this estimator is independent on . This is a desirable
feature: setting an appropriate value for the regularization
parameter is known to be a difficult task.
4
In experiments we always consider a = 10.

I.e., m = [m
(1) , m
(2) , , m
(d) ] is obtained from m
sorting its
entries in decreasing order. Note that, normally, m
already coincide
with its inverse order statistics m.
5
D = 72, M = 8, N = 400, = 0.002, = 0.01

? (m)
? (LS )
D
D
0.8
0.8
7.4(1.9)
14.3(0.8)
7. CONCLUSIONS
We presented a novel approach, based on a convex program, for the analysis of cointegrated VAR processes from
observational data. We proposed to compute solutions via
a scalable iterative scheme; used in combination with warm
starting this algorithm can be conveniently employed to
compute the entire regularization path. At each step one
can rely on the duality gap as an optimality certificate.
The regularization path oers indication for the actual
value of the cointegrating rank and can be used to define estimators for the latter. An important advantage is
6
From (29) it is easy to recover the error correction model representation.
x 10
1.6
3
1.4
2.5
1.2
1
sigma(i)
M(i)
0.8
1.5
0.6
1
0.4
0.5
0.2
0
10
20
30
40
50
60
0
0
70
10
20
30
40
50
60
70
(a)
(b)
sigma(lambda)
0.8
0.6
0.4
0.2
0
10
10
10
lambda
(c)
LS ; note
Fig. 2. (a) The singular values of the true and
that the LS solution does not give an indication of the
cointegrating rank. (b) The path spectrum; note the
gap after d = 8. (c) the regularization path.
that the approach does not require to fix a value for the
regularization parameter in (14). This is known to be a
difficult task, especially when the goal of the analysis is
model selection rather than low prediction errors.
REFERENCES
H.H. Bauschke and P.L. Combettes. Convex Analysis and
Monotone Operator Theory in Hilbert Spaces. Springer
Verlag, 2011.
G.E.P. Box and G.C. Tiao. A canonical analysis of
multiple time series. Biometrika, 64(2):355, 1977.
S.P. Boyd and L. Vandenberghe. Convex Optimization.
L. Breiman. Heuristics of instability and stabilization in
model selection. Annals of Statistics, 24(6):23502383,
1996.
J.F. Cai, E.J. Candès, and Z. Shen. A singular value
thresholding algorithm for matrix completion. SIAM
Journal on Optimization, 20(4):19561982, 2010.
E.J. Candès and B. Recht. Exact matrix completion via
convex optimization. Foundations of Computational
Mathematics, 9(6):717772, 2009.
E.J. Candes, X. Li, Y. Ma, and J. Wright. Robust principal
component analysis? Journal of ACM, 58(3):Article 11,
37p., 2011.
R.F. Engle and C.W.J. Granger. Co-integration and
error correction: representation, estimation, and testing. Econometrica: Journal of the Econometric Society,
pages 251276, 1987.
M. Fazel. Matrix rank minimization with applications.
PhD thesis, Elec. Eng. Dept., Stanford University, 2002.
C.W.J. Granger. Some properties of time series data and

their use in econometric model specification. Journal of
econometrics, 16(1):121130, 1981.
M. Grant and S. Boyd.
CVX: Matlab software
for disciplined convex programming, version 1.21.
http://cvxr.com/cvx, May 2010.
R.A. Horn and C.R. Johnson. Topics in Matrix Analysis.
S. Johansen. Statistical analysis of cointegration vectors.
Journal of economic dynamics and control, 12(2-3):231
254, 1988.
S. Johansen. A representation of vector autoregressive
processes integrated of order 2. Econometric theory, 8
(02):188202, 1992.
J. Lofberg.
Yalmip : A toolbox for modeling and
optimization in MATLAB.
In Proceedings of the
CACSD Conference, Taipei, Taiwan, 2004.
URL
http://users.isy.liu.se/johanl/yalmip.
H. L
utkepohl. New introduction to multiple time series
analysis. Springer, 2005.
N. F. Mller and P. Sharp. Malthus in cointegration
space: A new look at living standards and
population in pre-industrial england.
Discussion Papers 08-16, University of Copenhagen.
Department of Economics, July 2008.
URL
http://ideas.repec.org/p/kud/kuiedp/0816.html.
Y. Nesterov. A method of solving a convex programming
problem with convergence rate O( k12 ). In Soviet Mathematics Doklady, volume 27, pages 372376, 1983, 1983.
Y. Nesterov. Introductory lectures on convex optimization:
A basic course. Kluwer Academic Pub, 2003.
J.E. Payne and B.T. Ewing. Population and economic
growth: a cointegration analysis of lesser developed
countries. Applied economics letters, 4(11):665, 1997.
P.C.B. Phillips. Optimal inference in cointegrated systems. Econometrica: Journal of the Econometric Society, pages 283306, 1991.
B. Recht, M. Fazel, and P.A. Parrilo.
Guaranteed
nuclear norm minimization. SIAM Rev., 52:471501,
2007.
R.T. Rockafellar. Conjugate duality and optimization,
volume 16. Society for Industrial Mathematics, 1974.
S. Sharma. Persistence and stability in city growth.
Journal of Urban Economics, 53(2):300320, 2003.
J.H. Stock and M.W. Watson. Testing for common trends.
Journal of the American statistical Association, pages
10971107, 1988.
J.F. Sturm. Using SeDuMi 1.02, a MATLAB toolbox
for optimization over symmetric cones. Optimization
Methods and Software, 11(1):625653, 1999.
R. Tibshirani. Regression shrinkage and selection via the
LASSO. Journal of the Royal Statistical Society. Series
B (Methodological), 58(1):267288, 1996.
R.S. Tsay. Analysis of financial time series, volume 543.
Wiley-Interscience, 2005.
H. Zou and T. Hastie. Regularization and variable selection via the elastic net. J. Roy. Stat. Soc. Ser. B, 67(2):
301320, 2005.
How effective is the nuclear norm heuristic

in solving data approximation problems?
Ivan Markovsky
School of Electronics and Computer Science, University of Southampton

SO17 1BJ, United Kingdom, Email: im@ecs.soton.ac.uk
Abstract: The question in the title is answered empirically by solving instances of three classical
problems: fitting a straight line to data, fitting a real exponent to data, and system identification in
the errors-in-variables setting. The results show that the nuclear norm heuristic performs worse than
alternative problem dependant methodsordinary and total least squares, Kungs method, and subspace
identification. In the line fitting and exponential fitting problems, the globally optimal solution is known
analytically, so that the suboptimality of the heuristic methods is quantified.
Keywords: low-rank approximation, nuclear norm, subspace methods, system identification.
1. INTRODUCTION
With a few exceptions model reduction and system identification lead to non-convex optimization problems, for which there
are no efficient global solution methods. The methods for H2
model reduction and maximum likelihood system identification
can be classified as local optimization methods and convex
relaxations. Local optimization methods require an initial approximation and are in general computationally more expensive
than the relaxation methods, however, the local optimization
methods explicitly optimize the desired criterion, which ensures
that they produce at least as good result as a relaxation method,
provided the solution of the relaxation method is used as an
initial approximation for the local optimization method.
A subclass of convex relaxation methods for system identification are the subspace methods, see Van Overschee and De Moor
[1996]. Subspace identification emerged as a generalization of
realization theory and proved to be a very effective approach. It
also leads to computationally robust and efficient algorithms.
Currently there are many variations of the original subspace
methods (N4SID, MOESP, and CVA). Although the details of
the subspace methods may differ, their common feature is that
the approximation is done in two stages, the first of which is
unstructured low-rank approximation of a matrix that is constructed from the given input/output trajectory.
Related to the subspace methods are Kungs method and the
balanced model reduction method, which are the most effective
heuristics for model reduction of linear time-invariant systems.
A recently proposed convex relaxation method is the one using
the nuclear norm as a surrogate for the rank. The nuclear
norm relaxation for solving rank minimization problems was
proposed in Fazel et al. [2001] and was shown to be the tightest
relaxation of the rank. It is a generalization of the !1 -norm
heuristic from sparse vector approximation problems to rank
minimization problems.
The nuclear norm heuristic leads to a semidefinite optimization problem, which can be solved by existing algorithms with
provable convergence properties and readily available software
packages. (We use CVX, see Grant and Boyd.) Apart from theo-
retical justification and easy implementation in practice, formulating the problem as a semidefinite program has the additional
advantage of flexibility. For example, adding regularization and
affine inequality constraints in the data modeling problem still
leads to semidefinite optimization problems that can be solved
by the same algorithms and software as the original problem.
A disadvantage of using the nuclear norm heuristic is the fact
that the number of optimization variables in the semidefinite
optimization problem depends quadratically on the number of
data points in the data modeling problem. This makes methods
based on the nuclear norm heuristic impractical for problems
with more than a few hundreds of data points. Such problems
are considered small size data modeling problem.
Outline of the paper
The objective of this paper is to test the effectiveness of the
nuclear norm heuristic as a tool for system identification and
model reduction. Although, there are recent theoretical results,
see, e.g., Cands and Recht [2009], on exact solution of matrix
completion problems by the nuclear norm heuristic, to the best
of the authors knowledge there are no similar results about the
effectiveness of the heuristic in system identification problems.
The nuclear norm heuristic is compared empirically with other
heuristic methods on benchmark problems. The selected problems are simple: small complexity model and small number of data points. The experiments in the paper are reproducible Buckheit and Donoho [1995]. Moreover the M ATLAB
code that generates the results is included in the paper, so that
the reader can repeat the examples by copying the code chunks
from the paper and pasting them in the M ATLAB command
prompt, or by downloading the code from
http://eprints.soton.ac.uk/336088/
The selected benchmark problems are:
(1) line fitting by geometric distance minimization (orthogonal regression),
(2) fitting a real exponential function to data, and
(3) system identification in the errors-in-variables setting.
Problem 1 is the static equivalent of problem 3 and can be

solved exactly by unstructured rank-1 approximation of the
matrix of the point coordinates. Problem 2 can be viewed as
a first order autonomous system identification problem. This
problem also admits an exact analytic solution. Therefore in
the first two cases, we are able to quantify the sub-optimality of
the nuclear norm heuristic (as well as any other method). This is
not possible in the third benchmark problem, where there are no
methods that can efficiently compute a globally optimal point.
2. TEST EXAMPLES
2.1 Line fitting
In this section, we consider the problem of fitting a line B,
passing through the origin, to a set of points in the plane
D = { d1 , . . . , dN }.
The fitting criterion is the geometric distance from D to B
!
N
dist(D, B) =
dist2 (di, B),
(dist)
i=1
where
dist(di , B) := min #di d"i #2 .

d"i B
The line fitting problem in the geometric distance sense

minimize dist(D, B)
(LF)
over all lines B passing through 0
is equivalent to the problem of finding the nearest in the
" to the matrix of
Frobenius norm # #F sense rank-1 matrix D
the point coordinates
D = [d1 dN ] ,
i.e.,
" RqN #D D#
" F
minimize over D
(LRA)
" r,
subject to rank(D)
where q = 2 and r = 1.
Note 1. (Generalization and links to other methods). For general
r < q < N, (LRA) corresponds to fitting an r-dimensional subspace to N points in a q-dimensional space. This problem is
closely related to the principal component analysis and total
least squares problem Markovsky and Van Huffel [2007].
The following theorem shows that all optimal solutions of (LRA)
are available analytically in terms of the singular value decomposition of D.
Theorem 1. (EckartYoungMirsky). Let
D = UV '
be a singular value decomposition of D and partition U, =:
diag(1 , . . . , q ), and V as follows:
#r q r $
r qr
r
0
and V =:[V1 V2 ] N ,
U =:[U1 U2 ] q , =: 1
0 2 q r
Then the rank-r matrix, obtained from the truncated singular
value decomposition
" = U1 1V1' ,
D
is such that
%
" #F = min #D D#
" F = 2 + + q2 .
#D D
r qr
The minimizer
"
D
"
rank(D)r
r+1
is unique if and only if r+1 (= r .
)define lra 2a*

function dh = lra(d, r)
[u, s, v] = svd(d);
dh = u(:, 1:r) * s(1:r, 1:r) * v(:, 1:r);
" be the
" be an optimal solution of (LRA) and let B
Let D
optimal fitting model
" = image(D
" ).
B
The rank constraint in the matrix approximation problem (LRA)

corresponds to the constraint in the line fitting problem (LF)
" is a line passing through the origin (subspace
that the model B
of dimension one)
" ) = rank(D
" ).
dim(B
We use the dimension of the model is a measure for its complexity and define the map
lrar
",
D ,
D
implemented by the function lra.
Let #D# denotes the nuclear norm of D, i.e., the sum of the
singular values of D. Applying the nuclear norm heuristic to
(LRA), we obtain the following convex relaxation
" RqN #D#
"
minimize over D
(NNA)
" F e.
subject to #D D#
)define nna 2b*
function dh = nna(d, e)
cvx_begin, cvx_quiet(true);
variables dh(size(d))
minimize norm_nuc(dh)
subject to
norm(d - dh, fro) <= e
cvx_end
The parameter e in (NNA) is a user supplied upper bounds on

" F.
the approximation error #D D#
" be the solution of (NNA). Problem (NNA) defines the

Let D
map
nnae
"
D ,
D,
implemented by the function nnae .
The approximation nnae (D) may have rank more than r,
"
in which case (NNA) fails to identify a valid model B.
However, note that nnae (D) = 0 for e #D#F , so that for sufficiently large values of e, nna&e (D) is rank
' deficient. Moreover,
the numerical rank num rank nnae (D) of the nuclear norm
approximation reaches r for e / #D#F . We are interested to
characterize the set:
&
'
e := { e | num rank nnae (D) r }.
(e)
We hypothesise that e is an interval
e = [enna , ).
(H)
The smallest value of the approximation error #D nnae (D)#F ,
for which rank(nnae (D)) r (i.e., for which a valid model exists) characterizes the effectiveness of the nuclear norm heuristic. We define
nnar := nnaenna , where enna := min { e | e e }.
e
A bisection algorithm for computing the limit of performance enna of the nuclear norm heuristic is given in Appendix A.
Another way to quantify the effectiveness of the nuclear norm

heuristic is to compute the distance of the approximation
nnae (D) to the manifold of rank-r matrices
&
'
(e) = distr nnae (D)
"
"
" F subject to rank(D)
" r.
:= min #nnae (D) D#
"
"
D
)define dist 3a*

dist = @(d, r) norm(d - lra(d, r), fro);
The function e , presents a complexity vs accuracy tradeoff in using the nuclear norm heuristic. The optimal rankr approximation corresponds in the ( , e) space to the point
(0, elra ), where
elra := distr (D) = #D lrar (D)#F .
The best model nnar (D) identifiable by the nuclear norm
heuristic corresponds to the point (0, enna ).
The loss of optimality incurred by the heuristic is quantified by the difference enna = enna elra.
The following code defines a simulation example and plots the
e , function over the interval [elra , 1.75elra ].
)Test line fitting 3b*
randn(seed, 0); q = 2; N = 10; r = 1;
d0 = [1; 1] * [1:N]; d = d0 + 0.1 * randn(q, N);
)define dist 3a*, e_lra = dist(d, r)
N = 20; Ea = linspace(e_lra, 1.75 * e_lra, N);
for i = 1:N
Er(i) = dist(nna(d, Ea(i)), r);
end
figure, plot(Ea, Er, o, markersize, 8)
The result is shown in Figure 1. In the example,

enna = 0.4603 and elra = 0.3209.
0.08
e_ls1 = norm(d - dh_ls1, fro)

dh_ls2 = [d(1, :) / d(2, :); 1] * d(2, :);
e_ls2 = norm(d - dh_ls2, fro)
The results are

els1 = 0.4546 and els2 = 0.4531
which are both slightly better than the nuclear norm heuristic.
2.2 Exponential fitting
The problem considered in this section is fitting a time series
&
'
yd := yd (1), . . . , yd (T )
by an exponential function
&
'
c expz := cz1 , . . . , czT
in the 2-norm sense, i.e.,
minimize over c R and z R #yd c expz #2 . (EF)
The constraint that the sequence
&
'
y" = y"(1), . . . , y"(T )
is an exponential function is equivalent to the constraint that the
Hankel matrix
y"1 y"2 y"3 y"T L+1
y"2 y"3 . . .
y"T L+2
.. ,
HL ("
y) := y"3 . . .
.
..
y"L y"L+1
y"T
where 1 < L < T 1 has rank less than or equal to 1. Therefore,
the exponential fitting problem (EF) is equivalent to the Hankel
structured rank-1 approximation problem
minimize over y" RT #yd y"#2
&
'
(HLRA)
subject to rank HL ("
y) 1.
Problem (HLRA) has analytic solution, see [De Moor, 1994,

Sec. IV.C].
Lemma 1. The optimal solution of (HLRA) is
y" = c expz
where z is a root of the polynomial equation
0.06
(e)
t=1
t=1
t=1
t=1
tyd (t)zt1 z2t yd (t)zt tz2t1 = 0
and
0.04
0.02
0
c :=
enna
elra
0.35
0.4
0.45
0.5
0.55
e
Fig. 1. Distance of nnae (D) to a linear model of complexity 1
as a function of the approximation error e.
Next, we compare the loss of optimality of the nuclear norm
heuristic with those of two other heuristics: line fitting by minimization of the sum of squared vertical and horizontal distances
from the data points to the fitting line, i.e., the classical method
of solving an overdetermined linear system of equations in the
least squares sense.
)Test line fitting 3b*+
dh_ls1 = [1; d(2, :) / d(1, :)] * d(1, :);
T
yd (t)zt
t=1
.
T
t=1 z2t
(z )
(c )
The proof of the lemma and an implementation of a procedure

fit_exp for global solution of (HLRA), suggested by the
lemma, are given in Appendix B.
Applying the nuclear norm heuristic to problem (HLRA), we
obtain the following convex relaxation
minimize over y" R #HL ("
y)#
subject to #y y"#2 e.
)define nna_exp 3d*
function yh = nna_exp(y, L, e)
variables yh(size(y))
minimize norm_nuc(hankel(yh(1:L), yh(L:end)))
subject to
norm(y - yh) <= e
cvx_end
dist_exp = @(y) norm(y - fit_exp(y));
0.05
0.04
(L)
As in the line fitting problem, the selection of the parameter e

can be done by a bisection algorithm. As in Section 2.2, we
show the complexity vs accuracy trade-off curve and quantify
the loss of optimality by the difference enna = ehlra enna
between the optimal approximation error ehlra , computed using
the result of Lemma 1,
)define dist_exp 4a*
0.03
0.02
and the minimal error enna , for which the heuristic identifies a
valid model.
The following code defines a simulation example and plots the
trade-off curve over the interval [ehlra , 1.25ehlra ].
)Test exponential fitting 4b*
randn(seed, 0); z0 = 0.4; c0 = 1; T = 10;
t = (1:T); y = c0 * (z0 .^ t) + 0.1 * randn(T, 1);
)define dist_exp 4a*, e_hlra = dist_exp(y)
N = 20; Ea = linspace(e_hlra, 1.25 * e_hlra, N);
L = round(T / 2);
for i = 1:N
Er(i) = dist_exp(nna_exp(y, L, Ea(i)));
end
ind = find(Er < 1e-6); es_nna = min(Ea(ind))
The result is shown in Figure 2. In the example,

enna = 0.3130, and ehlra = 0.2734.
0.01
2
10
L
Fig. 3. Distance of y" = nnae (y) to an exponential model as a
function of the parameter L.
methodKungs method, see Kung [1978]. Kungs method is

based on results from realization theory and balanced model
reduction. Its core computational step is the singular value
decomposition of the Hankel matrix HL (yd ), i.e., unstructured
low-rank approximation. The heuristic comes from the fact that
the Hankel structure is not taken into account. For detailed
about Kungs algorithm, we refer the reader to [Markovsky,
2012, Sect 3.1]. For completeness, an implementation kung
of Kungs method is given in Appendix C.
)Test exponential fitting 4b*+
e_kung = norm(y - kung(y, 1, L))
The obtained results is ekung = 0.2742, which is much better

than the result obtained by the nuclear norm heuristic.
2.3 Errors-in-variables system identification
(e)
0.03
0.02
0.01
ehlra
0
0.28
enna
0.3
0.32
0.34
e
Fig. 2. Distance of y" = nnaenna (y) to an exponential model as a
function of the approximation error e = #y y"#.
The performance of the nuclear norm heuristic depends on the

parameter L. In the simulation example, we have fixed the value
L = 0T /21. Empirical results (see the following chunk of code
and the corresponding plot in Figure 3) suggest that this is the
best choice.
)Test exponential fitting 4b*+
Lrange = 2:(T - 1)
for L = Lrange
Er(L) = dist_exp(nna_exp(y, L, es_nna));
end
figure,
plot(Lrange, Er(Lrange), o, markersize, 8)
As in the line fitting problem, we compare the loss of optimality

of the nuclear norm heuristic with an alternative heuristic
The considered errors-in-variables identification problem is a

generalization of the line fitting problem (LF) to dynamic
models. The fitting criterion is the geometric distance (dist) and
the model B is a single-input single-output linear time invariant
system of order n. Let
&
'
wd := wd (1), . . . , wd (T ) , where wd (t) R2 ,
be the given trajectory of the system. The identification problem
is defined as follows: given wd and n,
"
minimize dist(wd , B)
subject to w
" is traj. of LTI system of order n. (SYSID)
The problem is equivalent to the following block-Hankel structured low-rank approximation problem
over w
" #wd w#
" 2
.#
$/
HL (w
"1 )
subject to rank
L + n, (BHLRA)
HL (w
"2 )
for n < L < 0T /21.
)define blk_hank 4e*
minimize
blk_hank = @(w, L) [hankel(w(1, 1:L), w(1, L:end))

hankel(w(2, 1:L), w(2, L:end))];
This is a nonconvex optimization problem, for which there are

no efficient solution methods. Using the nuclear norm heuristic,
we obtain the following convex relaxation
0#
$0
0 HL (w
"1 ) 0
0
minimize over w
" 0
0 HL (w
"2 ) 0
subject to #wd w#
" 2 < e.
)define nna_sysid 5a*

function wh = nna_sysid(w, L, e)
)define blk_hank 4e*
variables wh(size(w))
minimize norm_nuc(blk_hank(wh, L))
subject to
norm(w - wh, fro) <= e
cvx_end
0.1
(L)
0.08
0.04
A lower bound to the distance from w to a trajectory of a linear

time-invariant system of order n, is given by the unstructured
low rank approximation of the block Hankel matrix.
)define dist_sysid 5b*
)define blk_hank 4e*, )define dist 3a*
dist_sysid = @(w, L, n) dist(blk_hank(w, L), L + n);
The following code defines a test example.

)Test system identification 5c*
)define dist_sysid 5b*

N = 20; Ea = linspace(0.3, 1, N); L = 4;
for i = 1:N
Er(i) = dist_sysid(nna_sysid(w, L, Ea(i)), L, n);
end
ind = find(Er < 1e-6); es_nna = min(Ea(ind))
The obtained trade-off curve is shown in Figure 4. The optimal
0.15
(e)
10
L
Fig. 5. Distance of w
" = nnae (w) to a model of order 1 as a
function of the parameter L.
"
The distance from the data wd (w) to the obtained model B
(sysh) is computed by the function misfit, see Appendix D.
)Test system identification 5c*+
[e_n4sid, wh_n4sid] = misfit(w, sysh); e_n4sid
The approximation error achieved by the n4sid alternative

heuristic method is en4sid = 0.3019. In this example, the subspace method produces a significantly better model than the
nuclear norm heuristic.
3. CONCLUSIONS
The examples considered in the paperline fitting in the
geometric distance sense, optimal exponential fitting, and
system identificationsuggest that alternative heuristics
ordinary least squares, Kungs, and N4SID methodsare more
effective in solving the original nonconvex optimization problems than the nuclear norm heuristic. Further study will focus
in understanding the cause of the inferior performance of the
nuclear norm heuristic and finding ways for improving it.
0.2
0.1
0.02
sysh = ss(n4sid(iddata(w(2,:), w(1, :)), ...

n, nk, 0)); sysh = sysh(1, 1);
randn(seed, 0); rand(seed, 0); T = 20; n = 1;

sys0 = ss(0.5, 1, 1, 1, -1); u0 = rand(T, 1);
y = lsim(sys0, u0) + 0.1 * randn(T, 1);
w = [(u0 + 0.1 * randn(T, 1)); y];
0.05
0.06
en4sid
0.4
ACKNOWLEDGEMENTS
enna
0.6
0.8
e
Fig. 4. Distance of w
" = nnae (w) to a model of order 1 as a
function of the approximation error e = #w w#.
"
model computed by the nuclear norm heuristic has corresponding approximation error enna = 0.7789. We have manually selected the value L = 4 as giving the best results.
Lrange = (n + 1):floor(T / 2);
for L = Lrange
Er(L) = dist_sysid(nna_sysid(w, L, es_nna), L, n);
end
figure,
plot(Lrange, Er(Lrange), o, markersize, 8)
Next, we apply the N4SID method, implemented in function

n4sid of the Identification Toolbox.
The research leading to these results has received funding from

the European Research Council under the European Unions
Seventh Framework Programme (FP7/2007-2013) / ERC Grant
agreement number 258581 Structured low-rank approximation: Theory, algorithms, and applications.
REFERENCES
J. Buckheit and D. Donoho. Wavelets and Statistics, chapter "Wavelab and reproducible research". Springer-Verlag,
Berlin, New York, 1995.
E. Cands and B. Recht. Exact matrix completion via convex
optimization. Found. Comput. Math., 9:717772, 2009.
B. De Moor. Total least squares for affinely structured matrices
and the noisy realization problem. IEEE Trans. Signal Proc.,
42(11):31043113, 1994.
M. Fazel, H. Hindi, and S. Boyd. A rank minimization heuristic
with application to minimum order system approximation. In
Proc. American Control Conference, 2001.
M. Grant and S. Boyd. CVX: Matlab software for disciplined
convex programming. stanford.edu/~boyd/cvx.
S. Kung. A new identification method and model reduction

algorithm via singular value decomposition. In Proc. 12th
Asilomar Conf. Circuits, Systems, Computers, pages 705
714, Pacific Grove, 1978.
I. Markovsky. Low Rank Approximation: Algorithms, Implementation, Applications. Springer, 2012.
I. Markovsky and S. Van Huffel. Overview of total least squares
methods. Signal Processing, 87:22832302, 2007.
P. Van Overschee and B. De Moor. Subspace Identification
for Linear Systems: Theory, Implementation, Applications.
Kluwer, Boston, 1996.
Appendix A. BISECTION ALGORITHM FOR
COMPUTING THE LIMIT OF PERFORMANCE OF NNA
Assuming that e is an interval (H) and observing that
#D lrar (D)#F enna #D#F ,
we propose a bisection algorithm on e (see Algorithm 1) for
computing enna .
Algorithm 1 Bisection algorithm for computing enna
Input: D, r, and convergence tolerance e
el := #D lrar (D)#F and eu := #D#F
repeat
e := (el + eu )/2
if rank(nnae (D)) > r then,
el := e,
else
eu := e.
end if &
'
until rank nnae (D) (= r or eu el > e
return e
)bisection 6a*
function e = opt_e(d, r)
el = norm(d - lra(d, r), fro);
eu = norm(d, fro);
while 1,
e = mean([el eu]);
re = rank(nna(d, e), 1e-5); % numerical rank
if re > r, el = e; else, eu = e; end
if (re == r) && (eu - el < 1e-3), break, end
end
Appendix B. PROOF OF LEMMA 1 AND FUNCTION FOR

GLOBAL SOLUTION OF (HLRA)
The fact that y" is an exponential function c expz follows
from the equivalence of (HLRA) and (EF). Setting the partial
derivatives of the cost function
T &
'2
f (c, z) := y(t) czt
t=1
of (EF) w.r.t. c and z to zero, we have the following first order

optimality conditions
T &
'
f
yd (t) czt zt = 0,
= 0 =
c
t=1
f
=0
z
t=1
&
'
yd (t) czt tzt = 0.
Solving the first equation for c gives (c ). The right-hand-side

of the second equation is a polynomial of degree 2T , and the
resulting polynomial equation is (z ).
)Analytic solution of the exponential fitting problem 6b*
function [yh, ch, zh] = fit_exp(y)
t = (1:length(y));
p1(t) = t .* y(t); p2(2 * t + 1) = 1;
p3(t + 1) = y(t); p4(2 * t) = t;
r = roots(conv(p1, p2) - conv(p3, p4));
r(r == 0) = []; z = 1 ./ r(imag(r) == 0);
for i = 1:length(z)
c(i) = (z(i) .^ t) \ y;
Yh(:, i) = c(i) * (z(i) .^ t);
f(i) = norm(y - Yh(:, i)) ^ 2;
end
[f_min, ind] = min(f);
zh = z(ind); ch = c(ind); yh = Yh(:, ind);
Appendix C. IMPLEMENTATION OF KUNGS METHOD

)Kung method 6c*
function yh = kung(y, n, L)
[U, S, V] = svd(hankel(y(1:L), y(L:end)));
O = S(1:n, 1:n) * U(:, 1:n); C = V(:, 1:n);
c = O(1, :); b = C(:, 1);
a = O(1:end - 1, :) \ O(2:end, :);
for t = 1:length(y)
yh(t) = c * (a ^ (t - 1)) * b;
end
Appendix D. DISTANCE COMPUTATION IN THE

DYNAMIC CASE
The problem of computing the distance, also called misfit,
from a time series to a linear time-invariant model is a convex quadratic optimization problem. The solution is therefore
available in closed form. The linear time-invariant structure of
the system, however, allows efficient O(T ) computation, e.g.,
the Kalman smoother computes the misfit in O(T ) flops, using
on a state space representation of the system.
In [Markovsky, 2012, Section 3.2], a method based on, what is
called an image representation of the system is presented. The
following implementation does not exploit the structure in the
problem and the algorithm has computational cost O(T 3 ).
)define misfit 6d*
function [M, wh] = misfit(w, sys)
)transfer function to image representation 6e*
T = size(w, 2); TP = blktoep(P, T);
wh = reshape(TP * (TP \ w(:)), 2, T);
M = norm(w - wh, fro);
The function blktoep (not shown) constructs a block Toeplitz

matrix and the conversion from transfer function to image
representation is done as follows:
)transfer function to image representation 6e*
[q, p] = tfdata(tf(sys), v);
P = zeros(2, length(p));
P(1, :) = fliplr(p);
P(2, :) = fliplr([q zeros(length(p) - length(q))]);
Identification of Box-Jenkins models using

structured ARX models and nuclear norm
relaxation !
H
akan Hjalmarsson James S. Welsh Cristian R. Rojas
Automatic Control Lab and ACCESS Linnaeus Centre, KTH Royal

Institute of Technology (e-mail:
hakan.hjalmarsson,cristian.rojas@ee.kth.se)
School of Electrical Engineering and Computer Science, University

of Newcastle, Australia (e-mail: james.welsh@newcastle.edu.au)
Abstract: In this contribution we present a method to estimate structured high order ARX
models. By this we mean that the estimated model, despite its high order is close to a low order
model. This is achieved by adding two terms to the least-squares cost function. These two terms
correspond to nuclear norms of two Hankel matrices. These Hankel matrices are constructed from
the impulse response coefficients of the inverse noise model, and the numerator polynomial of the
model dynamics, respectively. In a simulation study the method is shown to be competitive as
compared to the prediction error method. In particular, in the study the performance degrades
more gracefully than for the Prediction Error Method when the signal to noise ratio decreases.
1. INTRODUCTION
The fundamental problem of estimating a model for a
linear dynamical system, possibly subject to (quasi-) stationary noise has received renewed interest in the last
few years. In particular different types of regularization
schemes have been in focus. An important contribution to
these developments is Pillonetto and De Nicolao [2010]
where a powerful approach rooted in Machine Learning is presented. In Chen et al. [2011] it is shown that
this approach has close ties with !2 -regularization with
a cleverly chosen penalty term. Another approach is to
use structured low rank approximations. This typically
leads to non-convex optimization problems for which local
nonlinear optimization methods are used, see Markovsky
[2008] for a survey. Recently, the nuclear norm has been
used in this approach to obtain convex optimization problems. A number of contributions based on this idea has
already appeared, e.g. Fazel et al. [2003], Grossmann et al.
[2009a,b], Mohan and Fazel [2010], Fazel et al. [2003], Liu
and Vandenberghe [2009a,b]. In this contribution we add
to this avalanche of new and exciting methods by introducing structured estimation of high order ARX models. Our
contribution can be seen as an extension of the method
presented in Grossmann et al. [2009a,b]. The extensions
concern i) the possibility to also estimate a noise model in
the same convex framework, ii) a way to choose the regularization parameter, and iii) a quite extensive simulation
study. In the simulation study the new method compares
favourably to the prediction error method for scenarios
where the signal to noise ratio (SNR) is poor.
The outline of the paper is as follows. In Section 2
the problem under consideration is discussed. Section 3
! This work was supported in part by the European Research
Council under the advanced grant LEARN, contract 267381, and in
part by the Swedish Research Council under contract 621-2009-4017.
presents our approach and in Section 4 some numerical

comparisons with state of the art algorithms are presented.
The paper is concluded with some conclusions in Section
5.
2. THE PROBLEM
Consider the discrete-time linear time-invariant system
with input u(t) and output y(t) given by
y(t) = Go (q)u(t) + Ho (q)eo (t)
(1)
where {eo (t)} is zero mean white noise with variance e .
The input to output relationship is given by the rational
transfer function
bo1 q 1 + . . . + bono q no
Bo (q)
:=
,
Go (q) :=
Fo (q)
1 + f1o q 1 + . . . + fnoo q no
and the coupling between the noise and output is given by
1 + co1 q 1 + . . . + como q mo
Co (q)
Ho (q) :=
:=
.
Do (q)
1 + do1 q 1 + . . . + domo q mo
Here q 1 is the time-shift operator q 1 u(t) = u(t 1).
Thus the system can be written as
Co (q)
Bo (q)
u(t) +
eo (t).
(2)
y(t) =
Fo (q)
Do (q)
A parametric model of (1) is given by the Box-Jenkins
structure
C(q)
B(q)
u(t) +
e(t)
(3)
y(t) =
F (q)
D(q)
where B(q), F (q), C(q) and D(q) are polynomials
B(q) = b1 q 1 + . . . + bn q n
F (q) = 1 + f1 q 1 + . . . + fn q n
C(q) = 1 + c1 q 1 + . . . + cm q m
D(q) = 1 + d1 q 1 + . . . + dm q m
(4)
(5)
(6)
(7)
whose coefficients b1 , . . . , dm are collected in the vector

R2n+2m . This parameter vector can, e.g., be estimated
using prediction error identification [Ljung, 1999]. A disadvantage with prediction error identification using the
model structure (3) is that in general the criterion is nonconvex and thus the numerical optmization may converge
to a local minimum.
One model structure for which there exists an explicit expression for the parameter estimate is the ARX-structure
B(q)
1
y(t) =
u(t) +
e(t)
(8)
A(q)
A(q)
where
A(q) = 1 + a1 q 1 + . . . + an q n
(9)
Even though the system (1) is not captured by this structure, by letting n increase, Go can be well approximated
by B/A and Ho can be well approximated by 1/A [Ljung
and Wahlberg, 1992]. This structure thus has some very
attractive features and is extensively used, e.g. in industrial practice [Zhu, 2001]. A drawback with this structure
is that the variance error, i.e. the error induced by the
noise eo , increases linearly with the number of parameters
[Ljung and Wahlberg, 1992]. Thus the accuracy can be
significantly worse when compared with using the BoxJenkins structure (3) (with n = no , m = mo ). In order
to make a distinction between the desired low order model
with n parameters in B and F , and m parameters in C and
D, we will use the notation nho to denote the number of
parameters in the B and A polynomials when a high-order
ARX-model is used 1 .
3. STRUCTURED ARX ESTIMATION
A model equivalent to (3) is given by [Kailath, 1980]
!
!
y(t) =
bk u(t k) + e(t) +
hk e(t k)
k=1
k=1
Rank {Hankel(b)} = n
Rank {Hankel(h)} = m
(10)
where Hankel(x) is the infinite Hankel matrix with x1 ,

x2 , x3 , . . . in the first column. Prediction error identification of a Box-Jenkins model (3) using output/input data
{y(t), u(t)}N
t=1 can thus be stated as
min
$
b1 ,b2 ,...,h1 ,h2 ,...

N
!
t=1
1+
k=1
hk q k
%1
s.t. Rank {Hankel(b)} = n

Rank {Hankel(h)} = m
(y(t)
k=1
bk u(t k))
(11)
There are several problems associated with (11) from an
optimization point of view. Firstly, it is non-convex in
{hk }. Furthermore, the rank constraints are non-convex.
An interesting convex rank heuristic is obtained using the
nuclear norm [Fazel et al., 2001] (see Fazel et al. [2003] for
1
It is also easy to generalize our method so that different number

of parameters in B and A are used.
its connection to another heuristic, the log-det heuristic)

which for a matrix X Rnm is given by
min(n,m)
#X# =
i (X)
i=1
where {i (X)} are the singular values of X. The nuclear

norm is also known as the Schatten 1-norm and the KyFan n-norm and it can be obtained as the solution to the
semi-definite program (SDP)
#X# = min
Y,Z
(
)
Y X
s.t.
0, Tr{Y + Z) 2
XT Z
The nuclear norm can be seen as the matrix extension of
the !1 -norm known to give sparse estimates when used in
estimation [Tibshirani, 1996].
In the following we will show how to use ARX models (8) of
high order together with the nuclear norm to approximate
(11). We proceed by first studying a simpler problem and
then in a second step we address the full problem.
3.1 Estimation of structured FIR models
Assume that it is known that Ho (q) = 1 so that no noise
model has to be estimated. Then (11) simplifies to
min
b1 ,b2 ,...
N
!
t=1
(y(t)
bk u(t k))2
k=1
(12)
s.t. Rank {Hankel(b)} = n.

A straightforward convex approximation of this problem
is to truncate the impulse response {bk }, i.e. to set bk = 0
for k > nho for some sufficiently large integer nho , and to
replace the rank constraint by a constraint on the nuclear
ho
norm of the finite dimensional Hankel matrix with {bk }nk=1
as unique elements. We denote this Hankel matrix by
Hnho (b). A variation of this approach is to add the nuclear
norm as a regularization term to the cost function. This
leads to the following SDP:
nho
N
!
!
b
(y(t)
bk u(t k))2 + Tr{Yb + Zb )
min
b1 ,...,bn ,Yb ,Zb
2
k=1
(t=1
)
Yb
Hnho (b)
s.t.
0.
HnTho (b)
Zb
(13)
This approach has been used in Grossmann et al. [2009a,b],
where in addition missing data were included as free
parameters. We will call this method NUC-FIR.
The regularization parameter b can be determined by
cross-validation. We refer to the next section for details.
3.2 Estimation of structured ARX models
We cannot directly extend the idea in the preceeding
subsection to also include a noise model since, as already
noted, (11) is non-convex in the impulse response coefficients of the noise dynamics {hk }. However, we have
also observed that prediction error identification of ARXmodels is convex and that such models can approximate
any rational system and noise dynamics arbitrarily well
just by picking the polynomial order n large enough. Such
Stable Nonlinear System Identification:

Convexity, Model Class, and Consistency
Ian R. Manchester Mark M. Tobenkin
Alexandre Megretski
Aerospace, Mechanical and Mechatronic Engineering,

University of Sydney, Australia
Electrical Engineering and Computer Science,

Massachusetts Institute of Technology, USA
e-mail: {irm, mmt, ameg}@mit.edu.
Abstract: Recently a new approach to black-box nonlinear system identification has been
introduced which searches over a convex set of stable nonlinear models for the one which
minimizes a convex upper bound of long-term simulation error. In this paper, we further study
the properties of the proposed model set and identification algorithm and provide two theoretical
results: (a) we show that the proposed model set includes all quadratically stable nonlinear
systems, as well as some more complex systems; (b) we study the statistical consistency of the
proposed identification method applied to a linear system with noisy measurements. It is shown
a modification related to instrumental variables gives consistent parameter estimates.
1. INTRODUCTION
Building approximate models of dynamical systems from
data is a ubiquitous task in the sciences and engineering.
Black-box modeling in particular plays an important role
when first-principles models are either weakly identifiable,
too complicated for the eventual application, or simply
unavailable (see, e.g., Sjoberg et al. [1995], Suykens and
Vandewalle [1998], Giannakis and Serpedin [2001], Ljung
[2010] and references therein).
Model instability, prevalence of local minima, and poor
long-term (many-step-ahead) prediction accuracy are common difficulties when identifying black-box nonlinear models (Ljung [2010], Schon et al. [2010]). Recently a new
approach labelled robust identification error (RIE) was
introduced which searches over a convex set of stable
nonlinear models for the one which minimizes a convex
upper bound of long-term simulation error (Tobenkin et al.
[2010]). In Manchester et al. [2011] the method was extended to systems with orbitally stable periodic solutions.
Earlier related approaches appeared in Sou et al. [2008],
Bond et al. [2010], Megretski [2008].
Since both the model set and the cost function of the
RIE involve convexifying relaxations, it is important to
study the degree of conservatism so introduced. In this
paper, we show that the proposed model set includes all
quadratically contracting systems, as well as some more
complex models. We also study the statistical consistency
of the proposed identification method applied to a linear
system with noisy measurements. It is shown that the
estimator can give biased parameter estimates, but a
modification related to instrumental variables recovers
statistical consistency.
? This work was supported by National Science Foundation Grant
No. 0835947
We focus on a particular implicit form of state-space

models:
e(x(t + 1)) = f (x(t), u(t)),
(1)
y(t) = g(x(t), u(t)),
(2)
where e : Rn 7! Rn , f : Rn Rm 7! Rn , and g :
Rn Rm 7! Rk are continuously dierentiable functions
such that the equation e(z) = w has a unique solution
z 2 Rn for every w 2 Rn . This choice of implicit equations
provides important flexibility for the convex relaxations.
Models are used for a wide variety of purposes, each with
its own characteristic measure of performance, however
for a large class of problems an appropriate measure is
simulation error, i.e.
E=
T
X
t=0
|y(t)
y(t)|2 ,
where y is the simulated output from the model and y is

the recorded real system output, with each subjected to
the same input u
(t) and the same initial conditions.
When the true states x
(t) can be measured or estimated,
a standard approach is to consider equation error:
x (t) = e(
x(t + 1)) f (
x(t), u
(t)),
y (t) = y(t) g(
x(t), u
(t)).
(3)
(4)
If e, fPand g are linearly parametrized, then minimizing,

e.g., t (|x |2 +|y |2 ) amounts to basic least squares. However, minimizing x and y does not give any guarantees
about the simulation error, since models of this form are
recursive by nature and modelling errors will accumulate
and dissipate as the simulation progresses. In particular,
there is no guarantee that a model which has an optimal
fit in terms of equation error will even be stable.
Ideally, we would search over all stable nonlinear models (1), (2) of limited complexity 1 for one which minimizes simulation error. There are two major difficulties
which render this impossible in general: firstly, we have no
tractable parameterization of all stable nonlinear models
of limited complexity; secondly, even supposing we are
given such a parameterization, the simulation error is a
highly nonlinear function of f and g, making the associated optimization problem very difficult. Indeed, in Ljung
[2010] stability of models and convexity of optimization
criteria are listed among the major open problems in
nonlinear system identification.
1.1 Convex Upper Bounds for Simulation Error
We will start with the second problem: optimization simulation error. The main difficulty with simulation error is
that it depends on the y(t), the result of solving a dierence or dierential equation over a long time interval. Even
if some finite-dimensional linearly-parameterized class of
functions is used to represent f and g, the simulated output will be a highly nonlinear function of those parameters.
When analyzing stability or performance of a dynamical
system it is common to make use of a storage function and
a dissipation inequality, and we follow the same approach
here. The advantage is that a condition on the solution of a
dynamical system can be checked or enforced via pointwise
conditions on the system equations. In particular, let =
x x
be the dierence between the state of the model and
the true state of the system. Suppose we find some positive
definite 2 function V () satisfying
V ( (t + 1), t + 1) V ( (t), t) r(t) |y(t) y(t)|2 (5)
with y(t) = g(
x(t) + (t), u
(t)) and r(t) a slack variable.
Suppose also that the models initial conditions are correct, i.e. (0) = 0, then we can simply sum the above
dissipation inequality to time T and obtain
E :=
T
X
t=0
|y(t)
y(t)|2
T
X
r(t).
t=0
This would suggest that to minimize simulation error we

can search
over functions e, f, g, and V , and try to miniP
mize
r(t) while satisfying the dissipation inequality (5).
There are two problems. Firstly, the dissipation inequality
made use of the particular (t) which comes from the
model simulation. This can be ensured by simply requiring
that (5) hold for all (t), which adds a certain robustness
at the expense of conservatism.
The second problem is more challenging: it is that the
condition (5) is not jointy convex in V , e, and f . For
example, the term V ( (t + 1), t + 1) is a composition of V
and (t+1) which is given by x(t+1) x
(t+1) = f (
x(t)+
(t), u
(t)) x
(t + 1). Hence (5) contains a composition of
V and f , which is nonconvex. In Tobenkin et al. [2010]
a convex upper bound was derived that can be optimized
via semidefinite programming.
1
It is difficult to give a general definition of limited complexity,

but for a concrete example consider f (x, u) and g(x, u) to be
multivariate polynomials up to some fixed degree.
2 Here positive definite means that V (0, t) = 0 and V ( , t) > 0
otherwise, for all t.
1.2 Convex Classes of Stable Nonlinear Models

Another central contribution of Tobenkin et al. [2010] is
the model class itself: a particular convex parametrization
of stable nonlinear models was presented which can be
represented by sum-of-squares constraints Parrilo [2000].
There are some subtleties and a few dierent formulations
which will be detailed in later sections, but the thrust is
that the storage function V ( , t) used to bound simulation
error doubles as a contraction metric for the identified
model (Lohmiller and Slotine [1998]).
Characterizing the richness of a nonlinear model class
is difficult both conceptually and computationally. It was
shown in Tobenkin et al. [2010] that all stable linear
models are included in the model class. In this work we
extend those results to show inclusion of all quadratically
stable nonlinear systems, and at least some systems more
complex than this.
2. ROBUST IDENTIFICATION ERROR
In this section we recap a few relevant points from Tobenkin et al. [2010].
The global RIE for a model (1),(2) is a function of the
coefficients of (1),(2), a single data sample
z = (
v , y, x
, u
) 2 Rn Rk R n Rm ,
(6)
where v(t) = x
(t+1) i.e. a time-shifted measurement of the
state, and an auxiliary parameter Q = Q0 > 0, a positive
definite symmetric n-by-n matrix 3 :
EQ (
z ) = sup |f (
x + ,u
) e(
v )|2Q | e |2Q + | y |2 .
2Rn
(7)
where |a|2Q is shorthand for a0 Qa, and

y
= g(
x+
,u
)
y,
= e(
x+
e(
x).
(8)
Note that (7) corresponds to (5) with a particular choice of

V = | e |2Q . Then of course the following theorem suggests
minimizing the sum of EQ (
z ) with respect to e, f, g and Q:
Theorem 1. The inequality
E
N
X
t=1
EQ (
z (t)),
(9)
holds for every Q = Q0 > 0.
Unfortunately this is still nonconvex, however we have the

following convex upper bound: let
v
= f (
x+
,u
)
then EQ (
z ) EQ (
z ) where
EQ (
z ) = sup | v |2Q + | |2P
2Rn
e(
v ).
2
+ | y |2 ,
(10)
and P = Q 1 . The function EQ (

z ) serves as an upper
bound for EQ (
z ) that is jointly convex with respect to e,
f , g, and P = Q 1 > 0.
3
For convenience, we only indicate the dependence on z and Q, we

will occasionally write EQ [e, f, g] to make the dependence on (e, f, g)
explicit.
2.1 Convex Set of Stable Models
3.1 Tightness of the RIE Relaxation
The class of models we search over can be referred to as

incrementally output l2 stable. In particular, we consider
the model (1),(2) stable if for every input sequence u
:
Z+ 7! Rm and pair of initial conditions x00 , x10 2 Rn
the solutions (x, y) = (x0 , y0 ) and (x, y) = (x1 , y1 ) defined
by (1)(2) with u(t) u
(t), x0 (1) = x00 and x1 (1) = x10
satisfy:
1
X
|y0 (t) y1 (t)|2 < 1.
The following lemma describes the tightness of the

relaxation of the RIE when e(x) is linear.
Lemma 3. Let (e, f, g) be functions as in (1),(2) and Q =
Q0 2 Rnn be positive definite. If e(x) is linear and
invertible (i.e. e(x) = Ex for an invertible E 2 Rnn )
2 Rnn ,
then there exists a symmetric positive definite Q
an invertible linear function e : Rn 7! Rn and a function
f : Rn Rm 7! Rn such that:
EQ [e, f, g] = EQ [
e, f, g] = EQ [
e, f, g],
t=1
and (e
However, as with the RIE, this condition is nonconvex

with respect to E(x), but a similar relaxation results in
the following sufficient condition:
|F (x, u) |2Q | |2E(x)+E(x)0 Q 1 + |G(x, u) |2 0, (12)
An identical statement holds for the global RIE (EQ ) and
its relaxation (EQ ) proven by the same choices for e, f, Q.
for all x, 2 R and u 2 R . This condition is jointly

convex in (e, f, g) and Q 1 .
n
The well-posedness of state dynamics equation (1) is

guaranteed when the function e : Rn 7! Rn is a bijection.
Theorem 2. Let e : Rn 7! Rn be a continuously dierentiable function with a uniformly bounded Jacobian E(x),
satisfying:
E(x) + E(x)0 2r0 I, 8x 2 Rn
(13)
for some fixed r0 > 0. Then e is a bijection.
In the case when e, f and g are polynomials, the convex
constraints and optimization objectives in this section can
be handled via semidefinite programming and the sum-ofsquares relaxation Parrilo [2000].
3. RESULTS ON ACCURACY OF RELAXATIONS
For both the RIE and our class of stable models we have
described convex upper bounds on non-convex functions.
It is natural to ask how tight these upper bounds will be,
i.e. how close will the optimizer of the relaxed problems
be to attaining the optimal value of the original RIE
problems and what characterizes models which potentially
satisfy the unrelaxed stability constraint (11) but have no
parameterization satisfying the relaxed constraint (12). In
this section we oer partial answers to these questions. In
particular, we will demonstrate that when e(x) is linear
our bounds and model class are tight in sense we will
make precise below.
As a corollary, we will show our model class contains every
quadratically stable nonlinear system and in particular
contains all stable linear systems of appropriate dimensions. We will also demonstrate, via a simple example,
that the ability to choose e(x) to be nonlinear allows the
method to describe systems which are not quadratically
stable and oers a substantially richer model class.
f ) = (
e
f).
By minor variation of the work of Lohmiller and Slotine

[1998], this form of stability can be verified by the condition
|F (x, u) |2Q |E(x) |2Q + |G(x, u) |2 0,
(11)
where capital letters denote Jacobians of e, f and g with
respect to x. Here, E(x)0 QE(x) is a contraction metric for
the model.
Proof. Examine the choices:

1 = E 0 QE, e = Q
1 x,
Q
1 E 1 f (x, u).
f = Q
Then we see the arguments of the supremum in the
definition of E identically match those in the definition
of E.
3.2 Coverage of Quadratically Stable Models

Let f0 : Rn Rm 7! Rn and g : Rn Rm 7! Rk . We call
the system:
x(t + 1) = f0 (x(t), u(t))
(14)
y(t) = g(x(t), u(t))
(15)
quadratically incrementally `2 output stable (or, for our
purposes, simply quadratically stable) if there exists an
M 2 Rnn with M = M 0 > 0 and:
| |2M |f0 (x+ , u) f0 (x, u)|2M +|g(x+ , u) g(x, u)|2 ,
(16)
holds for all x,
2 Rn and u 2 Rm . The following
lemma describes how any quadratically stable system, in
principle, belongs to some model class described by the
relaxed stability constraint (12).
Lemma 4. Let f0 and g be continuously dierentiable
functions defining the system (14),(15). Then there exists
e : Rn 7! Rn , f : Rn Rm 7! Rn and Q 2 Rnn such that
e(x) is linear and invertible, f0 = e 1 f , Q = Q0 > 0,
and the relaxed stability constraint (12) is satisfied.
Proof. As (f0 , g) is quadratically stable there exists an M
such that (16) is satisfied for all x, 2 Rn and u 2 Rm .
The dierentiability of f and g then guarantees that:
| |2M |F0 (x, u) |2M + |G(x, u) |2 ,
(17)
also holds for all x, 2 Rn and u 2 Rm . Then taking
e(x) = M 1 x and Q = M gives E(x) = M 1 , F (x, u) =
M 1 F0 (x, u). Now, substituting these in to (12) recovers
(17), hence the model is in the relaxed model class.
As stable linear systems are automatically quadratically
stable in the above sense, we arrive at the following
corollary:
Corollary 5. For any k input, m output, stable linear
discrete time state-space model with n states there exists
a positive definite symmetric Q 2 Rnn and equivalent
state-space model (1),(2) given by linear (e, f, g) such that
(12) is satisfied.
This corollary is closely connected to the results of Lacy

and Bernstein [2003].
We now take a brief moment to explore the merits of
searching over nonlinear e(x) through a simple example.
It is easy to see that the unrelaxed stability constraint
implies quadratic stability when e(x) is linear. We examine
systems (14),(15) where g(x(t), u(t)) = x(t) 2 R. In this
case we see that the quadratic stability of (f0 , g) implies
that for any fixed u the map f (, u) is a contraction map.
This restricts solutions to grow closer in Euclidean norm
at every step of their evolution. In this setting, the relaxed
stability constraint (12) admits systems which are not
contraction maps. For example, the system (1) with:
1
1
e(x) = x + x5 , f (x, u) = x3 ,
5
3
satisfies (12) with Q = 1, but is not a contraction map in
a neighborhood of x = 32 .
4. A CONSISTENT ESTIMATOR OF LINEAR
MODELS
The work of Tobenkin et al. [2010] did not consider the
eects of noise on the proposed fitting procedure. The objectives thus far described are best suited to identification
problems where the eect of noise can be reduced via preprocessing, or model-order reduction from simulation. In
this section we describe a consistent estimator for stable
linear models with an output-error like noise model,
with quite weak statistical assumptions.
We first present the proof that, for linear models, i.e.
models of the form:
Ex(t + 1) = F x(t) + Ku(t),
(18)
y(t) = Gx(t) + Lu(t)
(19)
with E, F 2 R
,K 2 R
,G 2 R
,L2R
and
E invertible, RIE minimization depends only on the data
only through its correlation matrix. This observation is of
computational interest as, in this case, RIE minimization
will require only d = 2n + m + k nonlinear convex
constraints regardless of the number of observations. It
is also the basis of the consistent estimator to follow.
d
Lemma 6. Let Z = {z(ti )}N
i=1 R . Define:
nn
W :=
nm
kn
N
1 X
z(ti )z(ti )0 .
N i=1
km
(20)
Let = (E, F, G, K, L) be as in (18),(19) and Q = Q0 2

Rnn be a positive definite matrix. For any such matrices
such that
dt := F 0 QF + G0 G E E 0 Q 1 < 0
R
there exists a matrix H = H 0 2 Rdd such that:
N
1 X 0
FQ (W, ) := tr(W H) =
E (z(ti )),
(21)
N i=1 Q
for all data sets Z. Further, for every positive semidefinite

W = W 0 in Rdd there exists {qi }di=1 Rd depending
only on W such that:
d
X
0
EQ
FQ (W, ) =
(qi ).
(22)
i=1
dt < 0 the supremum in the definition of

Proof. When R
0
EQ can be calculated explicitly as:

0
2
F Q ex
0
EQ
(z) = | ex |2Q + | ey |2 +
.
(23)
0
G ey ( R ) 1
dt
The above expression is a homogenous quadratic form

in z, i.e. there exists a symmetric matrix H such that
0
EQ
(z) = z 0 Hz. We can then conclude:
! !
N
N
1 X 0
1 X 0
z Hz = tr
zz H = tr(W H).
N i=1
N i=1
The second claim follows by taking an eigenvector decomposition of W and reversing the above identities.
We now construct our consistent estimator. Our approach

is to approximate a noiseless empirical covariance between
the inputs, outputs and states through multiple observations of the system subjected to the same input this
is similar in spirit to instrumental variables and other
correlation approaches Ljung [1999]. We then choose a
model which minimizes the RIE treating the approximate
correlation as its input as in Lemma 6.
More formally, we assume that, for each r 2 {1, 2} and
t 2 Z+ , our experiments are generated by a linear system:
x(r) (t + 1) = A0 x(r) (t) + B0 u
(t)
y
(r)
(t) = C0 x
(r)
(t) + D0 u
(t).
(24)
(25)
with unknown initial conditions x (1) and identical input

u
: Z+ 7! R. The data we obtain is corrupted by noise
sequences w(r) (t) 2 Rn+k :
(r)
(r)
(r)
x
(t)
x (t)
=
+ w(r) (t).
(26)
y(r) (t)
y (r) (t)
Note in this setting direct observation of the state is not
particularly restrictive, as we can use the recent input and
output history for x
(t) and assume the system (24),(25) is
in a observable canonical form.
The RIE produces implicit models of the form (18),(19).
We denote the parameters of such a model by 2 Rnn
Rnn Rnm Rkn Rkm :
:= (E, F, K, G, L).
The implicit form means there are many corresponding
to the same linear system. We define S() to be the map
taking implicit models to their explicit form:
1
E F E 1K
S() =
.
G
L
In this notation, we will present an estimator for S(0 )
where 0 = (I, A0 , B0 , C0 , D0 ).
Given a data set {(

z (t)(1) , z(t)(2) )}N
t=1 with:
0
(r)
(r)
0 (r)
0 (r)
z (t) = x
(t + 1) y (t) x
(t)0 u
(t)0 ,
our estimator is defined as follows. Compute a symN:
metrized cross-correlation W
N
X
N = 1
z(1) (t)
z (2) (t)0 + z(2) (t)
z (1) (t)0 .
(27)
W
2N t=1
N by:
Define W
N = W
N + max{0,
W
min (WN )}I.
N is symmetric and positive semidefinite. Let

Clearly W
N , ) subject to the
(N , QN ) be any minimizer of FQ (W
dt < I and Q = Q0 > 0. Our estimator
constraint that R
N ).
is then given by S(
The following statement describes conditions under which
this estimator converges:
Theorem 7. Let (A0 , B0 ) be stable and controllable. Given
an input u
: Z+ 7! R and observations {(
z (t)(1) , z(t)(2) )}1
t=1
PN (1)
1
(1)
T
and let WN = N t=1 z (t)z (t) (i.e. a noiseless
empirical correlation). If lim supN kWN k2F < 1 and the
following conditions hold:
!
N
0
1 X x(i) (t) x(i) (t)
",
(28)
lim inf min
u
(t)
(t)
N
N t=1 u
N
1 X (1)
w (t)w(2) (t)0 = 0,
N !1 N
t=1
lim
N
1 X (i)
w (t)z (j) (t)0 = 0,
N !1 N
t=1
lim
(29)
(30)
for (i, j) = (1, 2) and (i, j) = (2, 1) and some K, " > 0,
then:
N ) = S(
0 ),
lim S(
N !1
where, again, 0 = (I, A0 , B0 , C0 , D0 ).

A proof of this theorem is given in the appendix. The
following theorem follows from this proof with minimal
modifications:
Theorem 8. Let u
(t), w(i) (t) be stochastic processes. If
condition (28) holds almost surely and the limits (29),(30)
N ) converges in probabilconverge in probability then S(
ity to S(0 ).
The boundedness of WN can be easily satisified and the
condition (28) is a standard persistence of excitation condition. The remaining conditions require a lack of correlation between noise sequences on separate experiments
and between the noise sequences and the noiseless data,
and can be satisfied by assuming the input is bounded
and independent of the noise, and that the noise satisfies
certain bounds on its moments and autocorrelations.
It is important to note that unlike other consistent estimators (such as equation error), models identified by the RIE
based on a finite number of samples are still guaranteed
to be stable.
4.1 Illustrative Example
To illustrate the utility of the above estimation procedure,
let us consider a simple second-order example, with DC
gain of 1 and a resonant pole pair with natural frequency
3 rad/s and damping factor 0.15. Given two highly noisy
data records of this system responding to the same input,
but with independent noise, there are three simple ways
they could be used: simply concatenated and treated as
one long data record; the output could be averaged at each
time to reduce noise; or the RIE could be applied with the
above mixed-correlation based modification.
Figure 4.1 shows results of fitting with these three methods
after 200 and 2000 samples, respectively. It is clear that the
Fig. 1. Bode magnitude plots of several estimation strategies for a second-order system after 200 (upper) and
2000 (lower) samples.
mixed-correlation approach is by far the best at recovering
the resonant peak, whereas the other approaches seem to
generate models which are too stable.
5. CONCLUSION
The RIE is a new approach to nonlinear system identification. It allows one to search over a convex set of
stable nonlinear models, for one which minimizes a convex
upper bound of long-term simulation error. The resulting
optimization problem is a semidefinite program.
In order to convexify the model class and the optimization
objective, a number of relaxations and approximations
were necessary. The objective of this paper was to shed
light on the degree of approximation introduced. In particular, it is shown that the set of nonlinear models we
search over in principle includes all quadratically stable
systems (subject to richness of parametrization). It was
also shown that there exists at least some models which
are not quadratically contracting, but for which a nonquadratic contraction map can be found which satisfies
the relaxed stability condition. Further results (positive or
negative) on the coverage of this particular model class
or alternative suggestions for convex parametrization of

nonlinear models would be highly interesting.
Another challenging theoretical aspect is the eect of measurement noise on the (nonlinear) estimation algorithm,
and the resulting behaviour of the nonlinear system. In this
paper, we provided preliminary results in this direction, in
the form of a modified estimator which is provably consistent for linear systems, and is inspired by instrumental
variables methods. Extending such analysis to nonlinear
models is non-trivial, due to the interaction of the system
parametrization and the stability certificates, however this
will be a focus of future work.
Appendix A. PROOF OF THEOREM 7
N, Q
N ) converges to zero.
We first show that FQN (W
By the corollary to Lemma 4 we know there exists
an R = R0 > 0 such that QR = R 1 and R =
(R, RA0 , RB0 , C0 , D0 ) is a feasible point for each mini N , thus:
mization which determines
N,
N ) FQ (W
N , R ).
FQ (W
N
Conditions (29) and (30) guarantee that:

N WN = 0.
lim W
N !1
Our boundedness assumption on WN and this converN

gence ensures for all sufficiently large N we have W
and WN belonging to some compact set. On this set
min () is uniformly continuous and as
min (WN ) = 0,
N WN = 0. Similarly FQ (W, R ) is a unilimN !1 W
R
formily continuous function of W on this set and thus
N , R )
limN !1 FQR (W
FQR (WN , R ) = 0. We see
FQR (WN , R ) = 0, as the equation errors (i.e. (ex , ey ) in
N,
N) =
(23)) vanish. We thus conclude limN !1 FQN (W
0.
dt < I guarantees
The constraint that E 0 QE < R
E 0 QE > I. From this we see, for every z = (v, y, x, u) 2
Rd :
|z E 1 (F x+Ku)|2 +|y (Gx+Lu)|2 | ex (z)|2Q +| ey (z)|2 .
Defining L : Rdd Rn+kn+m 7! R by:

0
L(W, S) = tr(W [I S] [I S]).
we see the above inequality ensures:
L(W, S()) FQ (W, )
0
for any W = W
0 and feasible (Q, ). This allows us
to conclude
N , S(
N )) = lim L(W
N , S(R )) = 0
lim L(W
N !1
N !1
from our previous inequalities.
Let WN 22 be the lower n + m square sub-block of WN

N 22 be the equivalent sub-block of W
N . Condiand W
tion (28) ensures for all sufficiently large N we have
WN 22
"I. Then for sufficiently large N we also have
"
N 22
W
2 I. This condition implies L(WN , S) is strongly
"
convex in S with parameter 2 . As L(WN , ) 0 and both

N , S(
N )) = 0 and limN !1 L(W
N , S(
R )) =
limN !1 L(W
0, strong convexity and the triangle inequality give us
N ) = S(R ) = S(0 ).
limN !1 S(
This completes the proof of the theorem.
REFERENCES
B.N. Bond, Z. Mahmood, Yan Li, R. Sredojevic,
A. Megretski, V. Stojanovic, Y. Avniel, and L. Daniel.
Compact modeling of nonlinear analog circuits using
system identification via semidefinite programming and
incremental stability certification. IEEE Transactions
on Computer-Aided Design of Integrated Circuits and
Systems, 29(8):1149 1162, aug. 2010.
G.B. Giannakis and E. Serpedin. A bibliography on
nonlinear system identification. Signal Processing, 81
(3):533580, 2001.
S. L. Lacy and D. S. Bernstein. Subspace identification
with guaranteed stability using constrained optimization. IEEE Transactions on Automatic Control, 48(7):
12591263, 2003.
L. Ljung. System Identification: Theory for the User.
Prentice Hall, Englewood Clis, New Jersey, USA, 3
edition, 1999.
L. Ljung. Perspectives on system identification. Annual
Reviews in Control, 34(1):1 12, 2010.
W. Lohmiller and J.J.E. Slotine. On contraction analysis
for non-linear systems. Automatica, 34(6):683696, June
1998.
I.R. Manchester, M.M. Tobenkin, and J. Wang. Identification of nonlinear systems with stable oscillations. In
50th IEEE Conference on Decision and Control (CDC).
IEEE, 2011.
A. Megretski. Convex optimization in robust identification
of nonlinear feedback. In Proceedings of the 47th IEEE
Conference on Decision and Control, pages 13701374,
Cancun, Mexico, Dec 9-11 2008.
P. A. Parrilo. Structured Semidefinite Programs and Semialgebraic Geometry Methods in Robustness and Optimization. PhD thesis, California Institute of Technology,
May 18 2000.
T.B. Schon, A. Wills, and B. Ninness. System identification of nonlinear state-space models. Automatica, 47(1):
3949, 2010.
J. Sjoberg, Q. Zhang, L. Ljung, A. Benveniste, B. Delyon, P.-Y. Glorennec, H. Hjalmarsson, and A. Juditsky.
Nonlinear black-box modeling in system identification:
a unified overview. Automatica, 31(12):16911724, 1995.
K. C. Sou, A. Megretski, and L. Daniel. Convex relaxation approach to the identification of the wienerhammerstein model. In Proc. 47th IEEE Conference on
Decision and Control, pages 1375 1382, dec. 2008.
J. A. K. Suykens and J. Vandewalle, editors. Nonlinear Modeling: advanced black-box techniques. Springer
Netherlands, 1998.
M.M. Tobenkin, I.R. Manchester, J. Wang, A. Megretski,
and R. Tedrake. Convex optimization in identification of
stable non-linear state space models. In 49th IEEE Conference on Decision and Control (CDC), pages 7232
7237. IEEE, 2010.
Primal-Dual Instrumental Variable

Estimators
K. Pelckmans
Syscon, Information Technology, Uppsala University, 75501, SE
Abstract: This paper gives a primal-dual derivation of the Least Squares Support Vector
Machine (LS-SVM) using Instrumental Variables (IVs), denoted simply as the Primal-dual
Instrumental Variable Estimator. Then we propose a convex optimization approach for learning
the optimal instruments. Besides the traditional argumentation for the use of IVs, the primaldual derivation gives an interesting other advantage, namely that the complexity of the system
to be solved is expressed in the number of instruments, rather than in the number of samples as
typically the case for SVM and LS-SVM formulations. This note explores some exciting issues
in the design and analysis of such estimator.
1. INTRODUCTION
The method of Least Square- Support Vector Machines
(LS-SVMs) amounts to a class of nonparameteric, regularized estimators capable of dealing efficiently with highdimensional, nonlinear eects as surveyed in Suykens et al.
[2002]. The methodology builds upon the research on Support Vector Machines (SVMs) for classification, integrating ideas from functional analysis, convex optimization
and learning theory. The use of Mercer kernels is found
to generalize many well-known nonparametric approaches,
including smoothing splines, RBF and regularization networks, as well as parametric approaches, see Suykens et al.
[2002] and citations. The primal-dual derivations are found
to be a versatile tool for deriving non-parametric estimators, and are found to adapt well to cases where the application at hand provides useful structural information, see
e.g. Pelckmans [2005]. In the context of (nonlinear) system
identification a particularly relevant structural constraint
comes in the form of a specific noise models, as explained
in Espinoza et al. [2004], Pelckmans [2005].
In literature two essentially dierent approaches to function estimation in the context of colored noise are found:
In (i) the first, one includes the model of the noise coloring
in the estimation process. This approach is known as
the Prediction Error Framework (PEM) in the context
of system identification. The main drawback of such approach is the computational demand. In (ii) an instrumental variables approach, one tries to retain the complexity
and convenience of ordinary least squares estimation. The
latter copes with the coloring of the noise by introducing
artificial instruments restricting the least squares estimator to model the coloring of the noise. Introductions to
Instrumental Variable (IV) estimators can be found in
Ljung [1987], S
oderstrom and Stoica [1989] in a context
of system identification of LTI systems, and Bowden and
Turkington [1990] in a context of econometry.
In this paper we construct a kernel based learning machine
based on an instrumental variable estimator. Related ideas
? This work was supported in part by Swedish Research Council
under contract 621-2007-6364.
are articulated in Laurain et al. [2011], but here instead we

use the primal-dual framework to derive the estimators,
and deviate resolutely from the bias-variance methods
essentially borrowed from the parametric context. The
design of suitable instruments relies on recent results on
nonparametric instrumental variable methods, as found
e.g. in Hall and Horowitz [2005]. However, it is not quite
clear what optimal IVs are in a context of non-parametric
estimators. We give a resolutely dierent answer to this
question based on the observation that IVs try to razor
away stochastic eects which dier in various realizations
of the data.
The contribution of this brief note is threefold. The next
section derives the Instrumental Variable (IV) kernel machine, as well as an extended version of such. Then this
is related to the kernel-based estimator designed for the
case of dealing with a known (stable) colored noise source.
Section 3 then elaborates on how the IVs can be chosen if
one has access to dierent realizations of the data. Then
we apply ideas to the case where one has only a single
realization of the data. The problem of finding such optimal IVs can be relaxed as an Semi-Definite-Programming
problem. The problem is essentially one of finding an
optimal projection matrix, a task quite relates quite nicely
to the objective of projecting away stochastic eects which
correlate to the very flexible (kernel) basis. Section 4 give
an outlook to the imminent question which are raised. This
paper essentially solicits for crisp ideas in non-parametric
estimation and nonlinear system identification in the context of colored noise.
2. THE DUAL OF THE INSTRUMENTAL VARIABLE
(IV) ESTIMATOR
The following setup is adopted. Given N samples
d
d
{(xt , yt )}N
t=1 R R. Let f : R ! R be any function
which can be written as f (x) = wT (x) for all x 2 Rd ,
where : Rd ! Rd denotes a mapping of the data to a
higher dimensional (possibly infinite dimensional) feature
space, and w 2 Rd are unknown linear parameters in this
feature space. Consider the model
yt = f (xt ) + et = wT (xt ) + et , 8t = 1, . . . , N.
(1)
which explains the given samples up to the residuals {et }t .

Unlike the case in traditional regression analysis, we do
not want to assume that the residuals are uncorrelated
(white). In fact, we are interested in what happens if the
residuals are allowed to have nontrivial coloring.
Consider m suitable instruments (time-series) designed
appropriately, denoted here for all k = 1, . . . , m as
k T
z k = (z1k , . . . , zN
) 2 RN .
(2)
Instrumental Variable (IV) techniques aim at estimating

the parameters w (or the function f ) such that the second
moments of the residuals with the instruments are zeroed
out, or
N
X
t=1
ztk (f (xt )
yt ) = z k (fN
yN ) = 0, 8t = 1, . . . , N.
(3)
where fN = (f (x1 ), . . . , f (xN ))T 2 RN and yN =
(y1 , . . . , yN ) are vectors. Therefore, the method is also
referred to as the method of generalized moment matching
in the statistical and econometrical literature. The rationale goes that albeit the residuals might not be white (or
minimal), they are to be uncorrelated with the covariates.
That is, the estimate is to be independent of the realized
stochastic behavior (noise). The choice of instruments depends on which statistical assumptions one might exploit
in a certain case. A practical example is found when estimating parameters of a dynamic model (say a polynomial
linear time-invariant model). Then, instruments might be
chosen as filters of input signals, hereby exploiting the
assumption of the residuals being uncorrelated with past
inputs, see e.g. Ljung [1987], S
oderstrom and Stoica [1989]
and citations.
2.1 Primal-Dual Derivation of the IV estimator

We now implement this principle on a kernel based learning scheme, thereby building on the method of Least
Squares Support Vector Machines (LS-SVMs). The primal
objective might become
N
X
1
min wT w s.t.
ztk (wT (xt ) yt ) = 0, 8k = 1, . . . , m.
w 2
t=1
(4)
Let Z = (z 1 z 2 . . . z m ) 2 RN m be a matrix containing
all m instruments. Let to each instrument z k and its
associated moment condition (3) a single Lagrange multiplier k 2 R be associated. From the dual condition of
optimality one has
w=
m
X
m
X
N
X
ztk (xt ).
(5)
t=1
k=1
N N
Now, define K 2 R
(the kernel matrix) as Kt,s =
(xs )T (xt ). The dual problem is given as
1
(6)
minm T (Z T KZ) T Z T yN .
2R 2
If m < N and the matrix (Z T KZ) is full rank, the optimal
solution is unique and can be computed efficiently as the
solution to the linear system
(Z T KZ) = Z T yN .
(7)
Let x 2 Rd . The estimate can be evaluated in a new
sample x 2 Rd as
f(x ) =
k=1
N
X
l=1
zlk K(x , xl ),
(8)
or in matrix notation as f(x ) = K T (x )Z where the

vector K(x ) 2 RN is defined as
K(x ) = ( (x )T (x1 ), . . . , (x )T (xN ))T 2 RN . Note
at this point that the traditional questions including the
choice of the kernel function K come in, see e.g. Suykens
et al. [2002], Pelckmans [2005].
2.2 Primal-Dual derivation of Extended IV Estimator
In a similar vain as extended IV methods - described e.g.
in Soderstrom and Stoica [1989] - one may extend the basic
dual IV estimator introducing slack-variables. Let > 0
be a trade-o parameter between model complexity wT w
T
and how strict the moment conditions z k (fN yN ) = 0
are enforced, then the primal objective can be written as
Fig. 1. Example function f (x) = sin(x) (full line), as well

as observed samples yt = f (xt ) + et (green dots) for
t
xt = 20 10000
and t = 1, . . . , 1000, where {et }t is a col1
ored noise source with filter et+1 = 1+0.99q
1 et . This
example illustrates that we need extra information
(assumptions) to separate the (i) function one aims to
reconstruct, and (ii) the noise coloring. IV estimators
do help a method to separate stochastic elements (as
the colored noise) from deterministic components by
use of appropriate instruments exploiting model assumptions.
m
X
1
e2k
min wT w +
w,e 2
2
k=1
s.t.
N
X
t=1
ztk (wT (xt )
yt ) = ek , 8k = 1, . . . , m.
(9)
In case ! 1, one recovers the estimator (7). In case

= 0, a trivial solution (i.e. w = 0d ) is obtained. The
choice of an optimal > 0 is a difficult problem of model
section, see e.g. Suykens et al. [2002], Pelckmans [2005].
The dual problem is given as
1 T
1
T
min
Z KZ + Im T Z T yN ,
(10)
2Rm 2
where Im = diag(1, . . . , 1) 2 Rmm . The solution can be

obtained by solving the linear system
1
Z T KZ + Im = Z T yN .
(11)
In general, if Z T = In , the method reduces to the standard

LS-SVM without bias term. The resulting estimate can be
evaluated in a new point x as
fn (x ) = Kn (x )Zn ,
(12)
where n solves (15) and
T
Kn (x ) = (K(x1 , x ), . . . , K(xn , x )) 2 Rn .
2.3 Computational Complexity
The above derivation implies as well reduced computational complexities compared to the standard LS-SVMs.
In particular, it is not needed to memorize and work with
the full matrix K 2 RN N , but one may focus attention
on the matrix (ZKZ T ) 2 Rmm which is of considerably
lower dimension when m N .
2.4 Kernel Machines for Colored Noise Sources
The above problem formulation relates explicitly to learning an LS-SVM from measurements which contain colored
noise. Assume the noise is given as a filter h(q 1 ) =
Pd
, where d is possibly infinite, and {h } denote

=0 h q
the filter coefficients. Furthermore, we assume that the
filter is stable, and its inverse exists and is unique (this
assumption is used implicitly in the derivation of the dual,
left here as an exercise). Then the primal objective can be
written as
N
X
1
min wT w +
e2
w,e 2
2 t=1 t
s.t.
N
X
t=1
wT (xt ) +
d
X
=0
h e t
= yt , 8t = 1, . . . , N.
(13)
Note that the filter coefficients {h } are assumed to be
known here. The dual problem is given as
1
1
minm T K + HH T T yN ,
(14)
2R 2
where IN = diag(1, . . . , 1) 2 RN N and the Toeplitz
matrix H 2 RN N is made up by the filter coefficients,
i j , with h
= h for 0 d, and h
= 0
or Hij = h
otherwise. Similarly H 1 is the Toeplitz matrix consisting
of the coefficients of the inverse filter, assuming that the
inverse exists (or that the filter is minimal phase). The
solution can be obtained by solving the linear system
1
T
1
H KH + Im
= H T yN ,
(15)
where we use a change of variables as

= H. The
resulting estimate can be evaluated in a new point x as
fn (x ) = Kn (x )H 1
n,
(16)
where n solves (15) and KN (x ) as before. From this
derivation it becomes clear that the instrumental variable
approach is essentially the same as the extended IV
approach if the instruments are chosen appropriately. That
is, the optimal instruments correspond with the inverse
coloring of the noise realized in the sample. This duality

between noise coloring and instrumental variable is well
established in the context of linear identification, but
comes as a surprise in the present context of regularized
methods.
3. ON THE CHOICE OF THE INSTRUMENTS
The question now is how to chose the instruments Z in
a realistic or optimal fashion. The classical approach used
in linear IV approaches based on minimizing the variance
of the final estimate is not directly applicable as the
nonlinear methods based on regularization are essentially
biased. Minimizing the variance would result in an optimal
estimate such that f (x) = 0 for all x. Indeed, the variance
would certainly be minimal! Trying to minimize the variance while controlling the bias implies the need of a proper
stochastic framework, and assumption of reasonable prior
which contrasts to the nonparametric or distribution-free
approach for which such methods are conceived. If one is
willing to make such strong stochastic assumptions, one
may well resort to a parametric approach as e.g. relying
on a finite set of basis functions. This argument indicates
that there is a need for new principles for IV approaches
in nonparametric approaches. In this note we focus on the
maximal invariance principle, stating that the IVs should
be chosen such that the result of the estimation is similar
in case of dierent realizations of the data. In order to
avoid the trivial solution f (x) = 0, we try to find the
invariant solution which is as informative as possible. It
turns out that such instruments can be found by solving
a convex optimization problem.
3.1 Dierent realizations
Suppose that the timeseries corresponding to {xt }t has
dierent realizations of the output values {yt }t for i =
1, . . . , m. Let this realizations be stacked in the vectors
i T
Y i = (y1i , . . . , yN
) 2 RN for all i = 1, . . . , m. The
aforementioned derivation is given as follows
1
1
Y i = KZ ZKZ T + IN
ZT Y i
1
1
= K(ZZ ) K + IN
(ZZ )T Y i , (17)
where = ZZ is a projection matrix with eigenvalues in

{0, 1}. Moreover, since Z is Toeplitz, it is Here we used
that is equivalent to In up to the nulspace of . This
also implies that this method is equivalent to
1
1
Y i = K(ZZ ) K + IN
(ZZ )T Y i .
(18)
Then the problem of learning an optimal matrix such

that the estimate is constant over the m dierent realizations is phrased as follows. Let 2 RN N be a symmetric
positive matrix with eigenvalues { t }N
t=1 , then
= arg
min
t ()2{0,1}
m
X
Yi
RY i
2
2
i=1
s.t. RY 1 = = RY m ,
(19)

1
where R = K K + 1 IN
2 RN N can be computed in
advance. The problem can also be phrased as follows by
introducing a vector Y 2 RN as
m
X
2
(, Y ) = arg
min
Y i Y
t ()2{0,1}
multiple realizations. It is indicated how this can be solved

as a convex problem, basically reducing to finding optimal
projection matrices.
Results in this paper demand for empirical validation. A
reason this is not reported in this paper is that empirical
validation in non-parametric settings also prompts fundamental questions on model selection. That is, if prediction
of the output value of the next (in time) sample is the
problem, it is beneficiary to exploit the noise coloring.
reflects
One intriguing open issue is to elaborate how
the causal structure of instruments, or which structure
to impose to enforce such property. A related question is
gives us the noise coloring as
wether the optimal choice
well.
i=1
s.t. Y = RY i , 8i = 1, . . . , m. (20)
This combinatorial optimization problem can be relaxed
m
X
2
Y ) = arg min
(,
Y i Y
t ()2[0,1]
i=1
s.t. Y = RY , 8i = 1, . . . , m, (21)
where eigenvalues t can now take any value in the
interval [0, 1]. This problem can be solved efficiently as
a Semi-Definite Programming (SDP). This is so as both
the constraint 0 and max () 1 are convex, see
e.g. Boyd and Vandenberghe [2004]. Note that in practice,
in the optimum, many eigenvalues of will be set to
zero (structural sparseness), implying that one could find
Z 2 RN m with m N .
i
A more general open question is wether the relaxation of

the {0, 1} eigenvalues to the interval [0, 1] is tight. In other
words, is such approach for learning a projection matrix
efficient? Is it to lead guaranteed to a low rank solution?
REFERENCES
3.2 A Single Realization

Consider the situation that we have only observations of
a single realization YN 2 RN , can we still do similar
as before? It turns out that we can by exploiting the
stationarity of the noise filters. Let us define the divide
the samples {yt } in a set of subsequent batches of length
n < N as Y i = (yi , . . . , yn+i 1 )T , and let S i 2 {0, 1}N n
be a selection matrix such that S i Y = Y i for all i =
1, . . . , N n + 1. Then we can phrase an optimal IV as a
solution to
m
X
2
n , Y ) = arg
(
min
Y i S i Y
t (n )2{0,1}
i=1
i
s.t. S Y = R n Y , 8i = 1, . . . , m, (22)
where m = N n+1, and Ri = S i KS iT S i Ki S iT + 1 In
for all i = 1, . . . , m. Note that in this case n 2 Rnn .

This problem basically aims to filter away the local colored
noise, assuming that the basis of this coloring is translational invariant.
4. DISCUSSION
This paper gives a primal-dual derivation of instrumental
variable approach for kernel machines. The main insight
is that the computational complexity of the estimator is
given in terms of the number of instruments, rather than
in the number of data-samples. This observation opens
up many unexplored opportunities for dealing with large
sets of data. We related this approach to the problem
of estimating in the presence of colored noise, where the
coloring is assumed to be know. This resulting estimator
suggests a close relation between the noise coloring scheme,
and the employed instruments, a relation well-known in
the context of parametric identification. This observation
is however unexploited in a context of regularized and
non-parametric estimators. Finally, we pronounced one
approach of learning optimal instruments, based on principles following from considering dealing with data with
R.J. Bowden and D.A. Turkington. Instrumental variables,

volume 8. Cambridge University Press, 1990.
S. Boyd and L. Vandenberghe. Convex Optimization.
M. Espinoza, J.A.K. Suykens, and B. De Moor. Partially
linear models and least squares support vector machines.
In Decision and Control, 2004. CDC. 43rd IEEE Conference on, volume 4, pages 33883393. IEEE, 2004.
P. Hall and J.L. Horowitz. Nonparametric methods for
inference in the presence of instrumental variables. The
Annals of Statistics, 33(6):29042929, 2005.
V. Laurain, W. Xing Zheng, and R. Toth. Introducing instrumental variables in the LS-SVM based identification
framework. In Proceedings of the 50th IEEE Conference
on Decision and Control (CDC 2011). IEEE, 2011.
L. Ljung. System Identification, Theory for the User.
Prentice Hall, 1987.
K. Pelckmans. Primal-Dual kernel Machines. PhD thesis,
Faculty of Engineering, K.U.Leuven, Leuven, may. 2005.
280 pp., TR. 05-95.
T. Soderstrom and P. Stoica. System identification, 1989.
J.A.K. Suykens, T. Van Gestel, J. De Brabanter, B. De
Moor, and J. Vandewalle. Least Squares Support Vector
Machines. World Scientific, Singapore, 2002.
Identification of Black-Box Wave

Propagation Models Using Large-Scale
Convex Optimization
Toon van Waterschoot Moritz Diehl Marc Moonen
Geert Leus
Department of Electrical Engineering (ESAT-SCD), Katholieke
Universiteit Leuven, 3001 Leuven, Belgium (e-mail:
{tvanwate,mdiehl,moonen}@esat.kuleuven.be)
Faculty of Electrical Engineering, Mathematics, and Computer

Science, Delft University of Technology, 2628 CD Delft, The
Netherlands (e-mail: g.j.t.leus@tudelft.nl)
Abstract: In this paper, we propose a novel approach to the identification of multiple-input

multiple-output (MIMO) wave propagation models having a common-denominator pole-zero
parametrization. We show how the traditional, purely data-based identification approach can be
improved by incorporating a physical wave propagation model, in the form of a spatiotemporally
discretized version of the wave equation. If the wave equation is discretized by means of the
finite element method (FEM), a high-dimensional yet highly sparse linear set of equations is
obtained that can be imposed at those frequencies where a high-resolution model estimate is
desired. The proposed identification approach then consists in sequentially solving two largescale convex optimization problems: a sparse approximation problem for estimating the point
source positions required in the FEM, and an equality-constrained quadratic program (QP) for
estimating the common-denominator pole-zero model parameters. A simulation example for the
case of indoor acoustic wave propagation is provided to illustrate the benefits of the proposed
approach.
Keywords: Multivariable System Identification; Hybrid and Distributed System Identification;
Vibration and Modal Analysis
1. INTRODUCTION
We consider wave propagation in a three-dimensional
(3-D) enclosure with partially reflective boundaries as
governed by the wave equation
1 2
2 u(r, t) 2 2 u(r, t) = s(r, t)
(1)
c t
with appropriate boundary conditions in the spatiotemporal domain T . Here, r = [x, y, z] and t T
denote the spatial and temporal coordinates, respectively,
! This research was carried out at the ESAT laboratory of KU Leuven, and was supported by the KU Leuven Research Council (CoE
EF/05/006 Optimization in Engineering (OPTEC), PFV/10/002
(OPTEC), IOF-SCORES4CHEM, Concerted Research Action GOAMaNet), the Belgian Federal Science Policy Office (IUAP P6/04
Dynamical systems, control and optimization (DYSCO) 2007
2011), the Research Foundation Flanders FWO (Postdoctoral Fellowship T. van Waterschoot, Research Projects G0226.06, G0321.06,
G.0302.07, G.0320.08, G.0558.08, G.0557.08, G.0588.09, G.0600.08,
Research Communities G.0377.09 and WOG: ICCoS, ANMMM,
MLDM), the IWT (Research Projects Eureka-Flite+, SBO LeCoPro, SBO Climaqs, SBO POM, O&O-Dsquare), the European Commission (Research Community ERNSI, Research Projects FP7-HDMPC (INFSO-ICT- 223854), COST-intelliCIS, FP7-EMBOCON
(ICT-248940), FP7-SADCO (MC ITN-264735), ERC HIGHWIND
(259 166)), the NWO-STW (VICI project 10382), AMINAL, ACCM,
and IBBT. The scientific responsibility is assumed by its authors.
s(r, t) represents the driving source function that initiates

the wave propagation, u(r, t) represents the resulting wave
field, and c is the wave propagation speed determined by
the propagation mechanism and medium. If we consider
the driving source function to be generated by M point
sources at positions rm , m = 1, . . . , M , then the wave field
can be expressed as a superposition of M contributions
corresponding to the temporal convolution of the point
source signals sm (t) with the Greens function h(r, rm , t),
$
M "#
!
u(r, t) =
sm ( )h(r, rm , t )d . (2)
m=1
If we observe the wave field at a discrete number of

positions
rj , j = 1, . . . , J, it follows from (2) that the wave
propagation can be modeled as a multiple-input multipleoutput (MIMO) linear time-invariant (LTI) system. If we
assume the source signals sm (t) to be bandlimited, then
the Greens function can be sampled in time and the
MIMO-LTI system can be represented by a discrete-time
transfer function matrix
!
H(z) =
Hn z n
(3)
n=0
n=0
h(
r1 , r1 , n) . . . h(
r1 , rM , n)
n
..
..
..
z
.
.
.
h(
rJ , r1 , n) . . . h(
rJ , rM , n)
(4)
where n = t/Ts denotes the discrete time index, with Ts

the sampling period. It is particularly relevant to represent
the transfer function matrix by means of a pole-zero model
with a common denominator, i.e.,
+Q
Bn z n
H(z) = +n=0
(5)
P
n
n=0 an z
with
bn (
r1 , r1 ) . . . bn (
r1 , rM )
..
..
..
Bn =
.
.
.
.
bn (
rJ , r1 ) . . . bn (
rJ , rM )
(6)
Indeed, it was shown in Gustafsson et al. (2000) that

such a parametrization is related to the assumed modes
solution of the wave equation, in which the source and
wave field are expanded on the eigenfunction basis of the
enclosure, see also Kuttruff (2009). Here, the common
denominator is related to the resonant modes of the
enclosure which can be understood to be independent of
the source and observer positions, see Gustafsson et al.
(2000), Kuttruff (2009) and Haneda et al. (1994).
A parametric model of the wave propagation is particularly useful for the prediction, simulation, and deconvolution (i.e., source recovery) of the wave field, in case a fixed
set of source and observer positions is considered. Even
though these operations may also be performed through
direct use of (a numerical approximation of) the wave
equation in (1), the availability of a parametric model will
typically result in significantly less computations. Because
of the tight connection between the pole-zero model and
the assumed modes solution of the wave equation, it may
be appealing to use a grey-box model, by parametrizing
the numerator and denominator in (5) explicitly as a function of the resonance frequencies and damping factors, see
Gustafsson et al. (2000). However, this parametrization is
highly nonlinear and so the parameter estimation requires
the solution of a non-convex optimization problem. Therefore, we prefer to use a linear-in-the-parameters black-box
pole-zero model.
Our aim is then to estimate the model parameters as
accurately as possible, given a data set consisting of
source and observed signal samples. A variety of estimation
algorithms for identifying common-denominator pole-zero
models has been proposed earlier in literature, see, e.g.,
Gustafsson et al. (2000), Haneda et al. (1994), Rolain et al.
(1998), Stoica and Jansson (2000), Verboven et al. (2004),
and Hikichi and Miyoshi (2004). A common property
of these algorithms is that they rely exclusively on the
available data set. Instead, we propose an approach in
which not only the data set, but also the structure of the
underlying wave equation is exploited in the estimation of
the pole-zero model parameters. By enforcing the model
parameters to obey a linear relationship derived from a
finite element approximation of the wave equation, we can
include physical arguments in the black-box identification
while avoiding the non-convexity issues encountered with
a grey-box approach. This allows to achieve a higher
estimation accuracy as compared to the purely data-based
algorithms in literature, or to achieve a similar accuracy
with a smaller data set. The latter property is particularly
appealing if the estimation of the common-denominator
coefficients is of primary interest. The inclusion of the wave
equation structure in the black-box identification problem

can then result in a reduction of the number of observation
positions required to achieve a given accuracy, which
may significantly reduce the cost of the identification
experiment.
The paper is organized as follows. In Section 2 we formulate the problem statement and review the existing
data-based approach to the identification of commondenomimator pole-zero models. In Section 3, we show how
the finite element method (FEM) can be used to derive a
set of linear equations in the pole-zero model parameters,
that are valid if the MIMO-LTI system is indeed governed
by the wave equation in (1). This set of equations is
then used in Section 4 to formulate a large-scale convex
optimization problem that allows to identify the commondenominator pole-zero model by relying on both the data
set and the wave equation structure. Finally, a simulation
example is provided in Section 5.
2. PROBLEM STATEMENT & STATE OF THE ART
2.1 Problem Statement
The problem considered in this paper can be formulated
as follows. We are given a data set consisting of N samples
of the source signals and observed signals,
Z N = {sm (n), yj (n)}N
n=1 , m = 1, . . . , M, j = 1, . . . , J,
(7)
where the observed signals obey the measurement model
yj (n) = u(
rj , n) + vj (n), j = 1, . . . , J.
(8)
Here, the noise-free observations u(
rj , n), j = 1, . . . , J
result from a spatiotemporal sampling of the wave field
u(r, t), as generated by the wave equation (1), and
vj (n), j = 1, . . . , J, represents measurement noise. For
the sake of simplicity, and without loss of generality, we
will assume that the measurement noise signals vj (n) are
realizations of zero-mean and mutually uncorrelated white
noise processes with equal variance v2 . Our aim is then to
obtain the best possible estimate of the parameter vector
,
-T
(9)
= bT1 bT2 . . . bTM aT
containing the coefficients of the common-denominator
pole-zero model in (5), with
T
r1,rm ). . .bQ (
r1,rm ). . .b0 (
rJ,rm ). . .bQ (
rJ,rm )]
bm =[b0 (
(10)
for m = 1, . . . , M and
T
a = [a0 . . . aP ] .
(11)
Note that the first coefficient in the denominator parameter vector is usually fixed to a0 = 1. We include it here in
the parameter vector for notational convenience.
2.2 State-of-the-Art Data-Based Identification Approach
Different algorithms for the estimation of the parameter
vector using the data model (5)-(8) have been proposed,
see Gustafsson et al. (2000), Haneda et al. (1994), Rolain
et al. (1998), Stoica and Jansson (2000), Verboven et al.
(2004), and Hikichi and Miyoshi (2004). In these algorithms, however, the knowledge that the noise-free observations u(
rj , n), j = 1, . . . , J, are samples of the wave field
generated by (1) is not exploited, and hence the structure
/ .
.
//
.
/ .
/
S ()ST () IJ zQ ()zH
vec Y()SH () zQ ()zH
Q ()
P ()
() =
/
.
/H .
YH ()Y()zP ()zH
vec Y()SH () zP ()zH
P ()
Q ()
of the wave equation is not taken into account. In this

paper, we will adopt the frequency domain identification
algorithm proposed in Verboven et al. (2004) as the stateof-the-art algorithm. In the frequency domain, the data
model corresponding to (5)-(8) can be written as
B()
Y() =
S() + V()
(13)
A()
where = 2f Ts denotes radial frequency,
T
S() = [S1 () . . . SM ()]
(14)
V() = [V1 () . . . VJ ()]
(15)
Y() = [Y1 () . . . YJ ()]

(16)
contain the N -point discrete Fourier transform (DFT)
samples of the source signals, measurement noise, and
observed signals, A() represents the pole-zero model
denominator frequency response, and
B11 () . . . B1M ()
..
..
(17)
B() = ...
.
.
BJ1 () . . . BJM ()
contains the pole-zero model numerator frequency responses for the different source-observer combinations.
By defining the equation error vector related to (13) as
E(, ) = B()S() A()Y()
(18)
a least squares (LS) criterion for the estimation of the
parameter vector can be obtained as
!
min
EH (, )E(, )
(19)
where the summation is executed over the DFT frequencies

= 0, 1/N, . . . , (N 1)/N , and ()H denotes the Hermitian transposition operator. The LS criterion (19) can be
rewritten as a quadratic program (QP),
0
1
!
T
min
()
(20)
s. t. a0 = 1
(21)
with () defined in (12) at the top of the page, where
IJ represents the J J identity matrix, () denotes
the complex conjugation operator, vec() is the matrix
vectorization operator, denotes the Kronecker product,
and the complex sinusoidal vectors are defined as
,
-T
zQ () = 1 ej . . . ejQ
(22)
, j
jP T
zP () = 1 e . . . e
.
(23)
3. FINITE ELEMENT METHOD FOR
COMMON-DENOMINATOR POLE-ZERO MODELS
We will now derive a set of linear equations in the polezero model parameters, that are valid if the MIMO-LTI
system is indeed governed by the wave equation in (1).
We can eliminate the time variable and the partial time
derivative from the wave equation by taking the discrete
Fourier transform of (1) after temporal sampling, which
results in the Helmholtz equation
2 U (r, ) + k 2 U (r, ) = S(r, )
(24)
(12)
where k = /c represents the wave number. As mentioned

earlier, we consider the source function to consist of M
point source contributions, i.e.,
M
!
S(r, ) =
Sm ()(r rm ).
(25)
m=1
By substituting (25) in (24) and dividing both sides by

Sm (), m = 1, . . . , M we obtain a set of M equations
M
!
Sm ()
2 H(r, r1 , ) + k 2 H(r, r1 , ) =
(rrm )
S1 ()
m=1
..
..
.
.
Sm ()
2
2
(rrm )
H(r, rM , ) + k H(r, rM , ) =
S
()
m=1 M
(26)
where the frequency-domain Greens function H(r, rm , ),
m = 1, . . . , M , corresponds to the frequency response
of the discrete-time system defined in (4) for r =
rj , j = 1, . . . , J. We can hence substitute the commondenominator pole-zero model for H(r, rm , ) in (26), and
bring the common denominator (which is independent of
r) to the right-hand side, i.e.,
6
M
!
Sl ()
2 B(r, rm , )+k 2 B(r, rm , )=A()
(rrl ),
Sm ()
l=1
m = 1, . . . , M. (27)
where we have used a more compact notation to denote
a set of M equations. Note that we have deliberately not
restricted the observer position r in (27) to the discrete
set of positions
rj defined earlier. Instead, we consider
the numerator frequency response B(r, rm , ) to be a
continuous function of r. This function can be approximated in a finite-dimensional subspace by discretizing the
spatial domain using a 3-D grid defined by the points
rk , k = 1, . . . , K with K J (and typically K ' J),

which includes the observer positions,
K
!
B(r, rm , )
B(
rk , rm , )k (r).
(28)
k=1
Here, the subspace basis functions are chosen to be piecewise linear functions satisfying i (
rk ) = (i k), i =
1, . . . , K. In particular, the basis functions are defined on
a 3-D triangulation of the spatial domain , where the
kth basis function is made up of linear (non-zero slope)
segments along the line segments between the point
rk and
all the points with which point
rk shares a tetrahedron
edge, and zero-valued segments elsewhere. We can then
rewrite the set of Helmholtz equations in (27) as a set of
linear equations in B(
rk , rm , ) by making use of the FEM,
see Brenner and Scott (2008). In a nutshell, the FEM consists in converting the partial differential equation (PDE)
in (27) to its weak formulation, performing integration
by parts to relax the differentiability requirements on
the subspace basis functions, and enforcing the subspace
approximation error induced by (28) to be orthogonal to
this subspace. The set of M PDEs in (27) can then be
expanded to a set of M K linear equations, also known as

the Galerkin equations,
6
M
!
.
/
Sl ()
,
K k 2 L m () = A()
Sm () l
l=1
m = 1, . . . , M. (29)
Here, the K K matrices K and L denote the FEM
stiffness and mass matrices, defined as
#
[K]ij =
j (r) i (r)dr
(30)
#
[L]ij =
j (r)i (r)dr
(31)
and the K 1 vector m () contains the spatial samples

of the function B(r, rm , ) as defined by (28), i.e.,
T
r1 , rm , ) . . . B(
rK , rm , )] .
m () = [B(
(32)
The K 1 vectors l , l = 1, . . . , M on the right-hand
side of the Galerkin system in (29) contain the barycentric
coordinates of the point sources, obtained by projecting
the spatial unit-impulse functions (rrl ) onto the chosen
subspace basis, i.e.,
#
[ l ]i =
(r rl )i (r)dr.
(33)
Each vector l has only 1, 2, 3, or 4 non-zero elements,

depending on whether the mth point source is located
in a vertex, on an edge, on a face, or in the interior
of a tetrahedron of the FEM mesh. We can write (29)
in a more compact notation by defining M 1 vectors
m , m = 1, . . . , M , containing the source spectrum ratios,
8T
7
SM ()
S1 ()
...
, m = 1, . . . , M (34)
m () =
Sm ()
Sm ()
and the K M matrix
= [ 1 . . . M ]
(35)
such that6
.
/
K k 2 L m () = A()m (),
m = 1, . . . , M.
(36)
Finally, we can write the Galerkin equations as a function
of the model parameters of the common-denominator polezero model defined in (5) as follows. Define the K(Q+1)1
numerator parameter vector, for m = 1, . . . , M ,
T
r1,rm ). . .bQ (
r1,rm ). . .b0 (
rK,rm ). . .bQ (
rK,rm )]
bm =[b0 (
(37)
and recall the (P + 1) 1 denominator parameter vector
definition in (11). Note that only the first J(Q + 1) coefficients of the numerator parameter vector (corresponding to the elements of the numerator parameter vector
bm defined in (10)) are of explicit interest, while the
other coefficients have been introduced for constructing
the FEM approximation of the continuous-space function B(r, rm , ). By using the above parameter vector
definitions, and recalling the definitions of the complex
sinusoidal vectors in (22)-(23), we can rewrite the Galerkin
system in (36) as follows,
6
.
/.
/
H
K k 2 L IK zH
Q () bm = m ()zP ()a,
m = 1, . . . , M (38)
or equivalently
b
1
M() 0 . . . 0
1 ()zH
P ()
b2
H
0 M() . . . 0
2 ()zP () .
..
..
..
..
..
.
= 0.
.
.
.
.
.
.
b
M
0
0 . . . M() M ()zH
P ()
a
9
:;
< 9 :;
<
()
(39)
Here, M() = K k 2 L IK zH
Q () , and 0 represents
a zero vector or matrix of appropriate dimensions.
.
/.
A few remarks are in place here. First, the Galerkin

system in (39) is always underdetermined. However, we
can straightforwardly increase the number of equations by
considering (39) for L different radial frequencies l , l =
1, . . . , L, without increasing the dimension of the parameter vector. It suffices to choose L Q + 1 + (P + 1)/(M K)
to obtain a square or overdetermined system of equations.
Second, a well-known and attractive property of the FEM
is that the stiffness and mass matrices K and L, as well as
the point source positioning matrix , are highly sparse
and structured. Consequently, the system of equations
in (39) can typically be solved with a linear complexity.
Third, we should stress that the accuracy of the FEM
approximation relies heavily on the quality of the mesh,
which is why we cannot just set K = J and define the FEM
mesh using only the observer positions
rj , j = 1, . . . , J.
In particular, a sufficiently large number of mesh points
is needed to achieve a good spatial resolution and nearuniformity of the tetrahedra defined in the triangulation.
4. PROPOSED IDENTIFICATION APPROACH
The proposed identification approach is aimed at blending
measured information in the data set with structural
information obtained from the wave equation, and results
from the integration of the Galerkin equations (39) in
the QP (20)-(21). One way to achieve this integration
is to apply the field estimation framework proposed in
van Waterschoot and Leus (2011), where an optimization
problem is defined in which a LS data-based objective
function is minimized subject to the Galerkin equations. If
we apply this framework to the problem considered here,
we end up with a large-scale equality-constrained QP,
0
1
!
T T
min C
() C
(40)
(1 ) = 0
..
.
(41)
s. t.
(
)
=
0
a0 = 1
Here, the [J(Q + 1) + P + 1] [K(Q + 1) + P + 1] selection
matrix C is defined such that
C = .
(42)
Compared to the state-of-the-art identification approach
exemplified by the QP in (20)-(21), LM K additional
equality constraints have been included in (41). These
equality constraints allow to impose structural information
at a number of frequencies 1 , . . . , L , thus increasing
the model accuracy at these particular frequencies. The
number of frequencies L at which the Galerkin equations
are imposed in (41) should satisfy that L Q + (P +

1)/(M K), as otherwise an infeasible QP may be obtained.
m = 1, . . . , M (43)
or equivalently
M() 0
0 M()
.
..
..
.
0
1
T1 () A()I
K
b2
T2 () A()I
K
..
= 0.
..
.
.
bM
. . . M() TM () A()I
K
:;
< 9 :;
<
...
...
..
.
0
0
..
.
(
a,)
(44)
The data term in (40) can also be rewritten as a function of
+
, by partitioning the matrix ( ())1/2 C ! [L |R ]
such that
F
0
11/2
<9
:!
,;
()
C = L IM K(Q+1) 0 + R a. (45)
Again, data information and structural information can

be combined into a single convex optimization problem in
which the point source positioning matrix is estimated
alongside the pole-zero model numerator coefficients, i.e.,
min *L F + R
a*2 + *(
a, )*2 + **1 (46)
?
(IM 11K ) = 1M 1
s. t.
(47)
0
with 1 a vector of all ones. In this optimization problem, the sparsity of is exploited by including an )1regularization term in (46), while the non-negativity and
the property of columns summing to one, are enforced in
the (in)equality constraints (47).
5. SIMULATION RESULTS
We provide a simulation example, in which the proposed
identification approach is compared to the state-of-the-art
approach for the case of indoor acoustic wave propagation
(c = 344 m/s). We consider a rectangular room of 8
6 4 m, with M = 3 sources and J = 5 sensors
positioned as shown in Fig. 1. The Greens functions
4
3.5
3
z (m)
2.5
2
1.5
1
0.5
0
6
5
8
4
6
3
4
2
2
1
0
y (m)
x (m)
Fig. 1. Simulation scenario: rectangular room (8 6 4

m) with M = 3 sources (blue ) and J = 5 sensors
(red o).
40
20
20 log10 |H(rj , rm , )| (dB)
Up till now we have assumed that all quantities required in

the computation of the matrices () and C are available.
More particularly, () relies on the geometry of the FEM
mesh (through K, L, and ), on the source spectrum
ratios (through m ()), and on the point source positions
(through ), while C depends on the observer positions.
In a typical identification experiment, the source spectrum
ratios and the observer positions are indeed known, while
the FEM mesh is known by construction. However, in
many applications, the point source positions are unknown
and so the point source positioning matrix cannot be
straightforwardly computed. Nevertheless, we will show
that if a preliminary estimate
a of the pole-zero model
denominator parameter vector is available (e.g., by using the state-of-the-art data-based identification approach
outlined in Section 2.2), the point source positioning matrix can be estimated by exploiting its particular structure and sparsity. To this end, we first rewrite (36) as
6
=
> ; <9 :
.
/
2
T
K k L m () = m () A()IK vec(),
20
40
60
80
0
0.5
1.5
2.5
(rad)
Fig. 2. Frequency magnitude responses of the Greens

functions related to the different source-observer combinations.
related to the source and observer positions have been
simulated using the assumed modes solution to the wave
equation, see Gustafsson et al. (2000), truncated to a
duration of 10 s, sampled at fs = 100 Hz, and low-pass
filtered to suppress the cavity mode at DC. The resulting
frequency magnitude responses 20 log10 |H(
rj , rm , )| for
m = 1, . . . , M , j = 1, . . . , J are plotted in Fig. 3. The
common resonances can be clearly observed.
The data set was generated as follows: the M source
signals were obtained by filtering M Gaussian white noise
signals with M different all-pole filters (first-order lowpass, second-order band-pass, and first-order high-pass
for m = 1, 2, 3, respectively). The observed signals were
obtained by filtering the source signals with the simulated
Greens functions and adding Gaussian white noise at a
0 dB signal-to-noise ratio (SNR). The FEM mesh was
generated by performing a 3-D Delaunay triangulation on
a set of 315 regularly spaced grid points separated by 1
m in each dimension. The resulting FEM mesh consists of
1152 elements, and is shown in Fig. 3.
30
DATA
HYBRID (L = 1)
HYBRID (L = 2)
20 log10 |A1 (rj , rm , )| (dB)
25
20
15
10
10
0
0.5
1.5
2.5
(rad)
Fig. 3. Visualization of the tetrahedral FEM mesh.
Fig. 5. Results with estimated source positioning matrix.

proximation algorithm (46)-(47) prior to executing the
hybrid identification algorithm (40)-(41). The resulting
identification performance is seen to be comparable to the
case when exact knowledge of the point source positions is
assumed.
30
DATA
HYBRID (L = 1)
HYBRID (L = 2)
20 log10 |A1 (rj , rm , )| (dB)
25
20
REFERENCES
15
10
10
0
0.5
1.5
2.5
(rad)
Fig. 4. Results with exact source positioning matrix.

We evaluate the capability of the data-based (DATA)
and proposed (HYBRID) identification approaches to
capture the resonant behavior of the wave propagation,
by inspecting the pole-zero model inverse denominator frequency magnitude response 20 log10 |A1 (
rj , rm , )|. The
pole-zero model orders are set to Q = P = 12. The
proposed approach is evaluated with the Galerkin equality
constraints imposed at L = 1 frequency and L = 2 frequencies. These frequencies are chosen to correspond to the
4th and 1st resonance frequency of the Greens functions,
respectively, i.e., 1 = 2.7018 rad and 2 = 1.3509 rad.
Fig. 4 shows the results for the case when the point source
positions are exactly known, and hence (40)-(41) can be
directly solved. It is clearly observed that by imposing the
Galerkin equality constraints at a certain frequency, the
resonant behavior at that particular frequency is identified
much more accurately compared to the case when only
measurement information is used.
Finally, Fig. 5 shows the results for the case when the
point source positions are unknown, and the point source
positioning matrix is estimated using the sparse ap-
Brenner, S.C. and Scott, L.R. (2008). The mathematical

theory of finite element methods. Springer, New York.
Gustafsson, T., Vance, J., Pota, H.R., Rao, B.D., and
Trivedi, M.M. (2000). Estimation of acoustical room
transfer functions. In Proc. 39th IEEE Conf. Decision
Control (CDC 00), 51845189. Sydney, Australia.
Haneda, Y., Makino, S., and Kaneda, Y. (1994). Common
acoustical pole and zero modeling of room transfer
functions. IEEE Trans. Speech Audio Process., 2(2),
320328.
Hikichi, T. and Miyoshi, M. (2004). Blind algorithm for
calculating common poles based on linear prediction.
In Proc. 2004 IEEE Int. Conf. Acoust., Speech, Signal
Process. (ICASSP 04), volume 4, 8992. Montreal,
Quebec, Canada.
Kuttruff, H. (2009). Room Acoustics. Spon Press, London.
Rolain, Y., Vandersteen, G., and Schoukens, J. (1998).
Best conditioned common denominator transfer function matrix estimation in the frequency domain. In Proc.
37th IEEE Conf. Decision Control (CDC 98), 3938
3939. Tampa, Florida, USA.
Stoica, P. and Jansson, M. (2000). MIMO system identification: State-space and subspace approximations versus
transfer function and instrumental variables. IEEE
Trans. Signal Process., 48(11), 30873099.
van Waterschoot, T. and Leus, G. (2011). Static field
estimation using a wireless sensor network based on the
finite element method. In Proc. Int. Workshop Comput.
Adv. Multi-Sensor Adaptive Process. (CAMSAP 11),
369372. San Juan, PR, USA.
Verboven, P., Guillaume, P., Cauberghe, B., Vanlanduit,
S., and Parloo, E. (2004). Modal parameter estimation from input-output Fourier data using frequencydomain maximum likelihood identification. J. Sound
Vib., 276(35), 957979.

Special Session On Convex Optimization For System Identification

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Special Session On Convex Optimization For System Identification

Загружено:

Авторское право:

Доступные форматы

Special Session on Convex Optimization for

SysCon, Information Technology, Uppsala University, 75501, Sweden

KU Leuven ESAT-SCD, B-3001 Leuven (Heverlee), Belgium

Insights in convex optimization continue to be a driving

(L1) Compressed Sensing or Compressive Sampling

constraints enter naturally in most cases. For relate

Convex optimization techniques in system

Electrical Engineering Department, UCLA, Los Angeles, CA 90095

The 1-norm and nuclear norm techniques can be extended

y(t) = Cx(t) + Du(t),

and the columns of X form a sequence of states x(t).

where the columns of Q form a basis of the nullspace of

with Rk = E x(t + k)x(t)T . Using this characterization,

were considered, and convex formulations were presented

The variable X is a (p + 1) (p + 1) block matrix with

topology of the graph is determined by the sparsity pattern

Here X denotes the inverse covariance 1 , the

with A of size m n. This can be formulated as a linear

The proximal gradient algorithm is an extension of the

where A(x) is a matrix valued function of size pq and x is

with variables x, U , V . The very larger number of variables

4.3 Proximal gradient algorithms

The proximal gradient algorithm applies to a convex

over y, so x+ minimizes an approximation of f , obtained

with variables x, L Rpr , R Rqr , where r is a upper

It can be shown that if g is Lipschitz continuous with

method with an 1/k 2 rate convergence, under the same

of the dual variable z. The complexity of minimizing

S. Becker, E. J. Cand`es, and M. Grant. Templates for

L. Ljung. System Identification: Theory for the User.

J. Songsiri, J. Dahl, and L. Vandenberghe. Graphical

Distributed Change Detection ?

Division of Automatic Control, Department of Electrical Engineering,

Department of Electrical Engineering and Computer Sciences,

these methods are therefore applicable. To further reduce

describing the relation between the measurable quantity

Remark 2.2. (Partially observed state). Note that (1) does

techniques in linear programming. The reader is referred

gives exactly the same estimate for 1 , . . . , N , , as (2).

In (4), regulates the trade o between miss fit to the

sensor. However, a too small may in a noisy environment

,1 ,#1 ,...,N ,#N

+iT (#i ) + (/2)k#i k2 .

Remark 4.1. It should be noted that given , , the

+ (ik )T (#i k ) + (/2)k#i k k2 . (9)

To show (10), first note that

Then the update on reduces to

As a result, in order to implement the ADMM in a

The ADMM algorithm presented in the previous section

+(/2)k#i ( i )k2 (/2)k

This problem is solved locally on each sensor once every

in (16) and (17). However, in the batch solution, among

for all kzi2 k zi1 . By this, the optimality conditions for

6.2 A Homotopy Algorithm

6.1 Warm Start for Step 2 of the ADMM Algorithm

Now, compute the subdierential of gi (i , i ) w.r.t. i and

A necessary condition for the global optimal solution

It follows from (28a) and (29a) that

where we have let

With (30), the problem now reduces to how to solve (29b).

Now, replace k with k +t k+1 and ik with k +t k+1 .

We then have that (both the sign and | | taken elementwise)

@k ik (0)k1 = sign( ik )T v T , v 2 Rm q , |v| 1. (35)