Вы находитесь на странице: 1из 13

www.ietdl.

org
Published in IET Computer Vision
Received on 24th September 2012
Revised on 7th June 2013
Accepted on 16th July 2013
doi: 10.1049/iet-cvi.2013.0017

ISSN 1751-9632

Augmented Lagrangian-based approach for dense


three-dimensional structure and motion estimation
from binocular image sequences
Geert De Cubber1,2, Hichem Sahli1,3
1

Electronics and Information Processing (ETRO), Vrije Universiteit Brussel, Brussels 1040, Belgium
Mechanical Engineering, Royal Military Academy of Belgium, Brussels 1000, Belgium
3
Interuniversity Microelectronics Centre IMEC, Heverlee 3001, Belgium
E-mail: geert.de.cubber@rma.ac.be
2

Abstract: In this study, the authors propose a framework for stereomotion integration for dense depth estimation. They
formulate the stereomotion depth reconstruction problem into a constrained minimisation one. A sequential unconstrained
minimisation technique, namely, the augmented Lagrange multiplier (ALM) method has been implemented to address the
resulting constrained optimisation problem. ALM has been chosen because of its relative insensitivity to whether the initial
design points for a pseudo-objective function are feasible or not. The development of the method and results from solving the
stereomotion integration problem are presented. Although the authors work is not the only one adopting the ALMs
framework in the computer vision context, to thier knowledge the presented algorithm is the rst to use this mathematical
framework in a context of stereomotion integration. This study describes how the stereomotion integration problem was
cast in a mathematical context and solved using the presented ALM method. Results on benchmark and real visual input data
show the validity of the approach.

1
1.1

Introduction
Problem statement

The integration of the stereo and motion depth cues offers the
potential of a superior depth reconstruction, as the
combination of temporal and spatial information makes it
possible to reduce the uncertainty in the depth
reconstruction result and to augment its precision. However,
this requires the development of a data fusion methodology,
which is able to combine the advantages of each method,
without propagating errors induced by one of the depth
reconstruction cues. Therefore the mathematical formulation
of the problem of combining stereo and motion information
must be carefully considered.
The dense depth reconstruction problem can be casted as a
variational problem, as advocated by a number of researchers
[1, 2]. The main problem in dense stereomotion
reconstruction is that the solution depends on the
simultaneous evaluation of multiple constraints which
have to be balanced carefully. This is sketched in Fig. 1,
which shows the different constraints to be imposed for a
sequence acquired with a moving binocular
 l r  camera.
Considering a pair of rectied
stereo
images
I1 , I1 at time


t = t0 and a stereo pair I2l , I2r at time t = t0 + tk, with tk
being determined by the frame rate of the camera. A point
xl1 in the reference frame I1l can be related to a point xr1 via
the stereo constraint, as well as to a point xl2 via the motion
98

& The Institution of Engineering and Technology 2014

constraint. Using the stereo and motion constraints in


combination, the point xl1 can even be related to a point xr2 ,
via a stereo + motion or a motion + stereo constraint. It is
evident that, ideally, all these interrelations should be taken
into consideration for all the pixels in all the frames in the
sequence. In the following, we present such a methodology
for addressing the stereomotion integration problem for
dense reconstruction.
1.2

State-of-the-art

The early work on stereomotion integration goes back to the


approach of Richards [3], relating the stereomotion
integration problem to the human vision system. Based on
this analysis, Waxman and Duncan [4] proposed in a
stereomotion fusion algorithm. They dene a binocular
difference ow as the difference between the left and right
optical ow elds, where the right ow eld is shifted by
the current disparity eld. In 1993, Li and Duncan [5]
presented a method for recovering structure from stereo and
motion. They assume that the cameras undergo translation,
but no rotational motion. Tests on laboratory scenes
presented good results; however, the constraint of having
only translational motion is hard to full for a real-world
application.
The above-mentioned early work on stereomotion
integration generally considers only sparse features and uses
IET Comput. Vis., 2014, Vol. 8, Iss. 2, pp. 98109
doi: 10.1049/iet-cvi.2013.0017

www.ietdl.org

Fig. 1 Motion and stereo constraints on a binocular sequence

three-dimensional (3D) tracking techniques [6] or direct


methods [7] for reconstruction. Tracking techniques track
3D tokens from frame-to-frame and estimate their
kinematics. The motion computation problem is formulated
as a tracking problem and solved using an extended
Kalman lter. Direct methods use a rigid-body motion
model to estimate relative camera orientation and local
ranges for both the stereo and motion components of the
data. The obvious disadvantage of sparse reconstruction
methodologies is that no densely reconstructed model can
be obtained. To overcome this problem, other researchers
have proposed model-based approaches [8]. The visible
scene surface is represented with a parametrically
deformable, spatially adaptive, wireframe model. The model
parameters are iteratively estimated using the image
intensity matching criterion. The disadvantage of this kind
of approache is that they only work well for reconstructing
objects that can be easily modelled (small objects, statues,
), and not for unstructured environments like outdoor
natural scenes.
Recent approaches to stereomotion-based reconstruction
concentrated more on dense reconstruction. The general
idea of these approaches is to combine the left and right
optical ows with the disparity eld, for example, using
space carving [9] or voxel carving [10]. Some researchers
[11] emphasise on the stereo constraint and only reinforce
the stereo disparity estimates using an optical ow
information, whereas Isard and MacCormick [12] use more
advanced belief propagation techniques to nd the right
balance between the stereo and optical ow constraints.
Sudhir et al. [13] model the visual processes as a sequence
of coupled Markov random elds (MRFs). The MRF
formulation allows us to dene appropriate interactions
between the stereo and motion processes and outlines a
solution in terms of an appropriate energy function. The
MRF property allows to model the interactions between
stereo and motion in terms of local probabilities, specied
in terms of local energy functions. These local energy
functions express constraints helping the stereo
disambiguation by signicantly reducing the search space.
The integration algorithm as proposed by Sudhir et al. [13]
makes the visual processes tightly constrained and reduces
the possibility of an error. Moreover, it is able to detect
stereo-occlusions and sharp object boundaries in both the
disparity and the motion eld. However, as this is a local
method, it has difculties when there are many regions with
homogeneous intensities. In these regions, any local method
IET Comput. Vis., 2014, Vol. 8, Iss. 2, pp. 98109
doi: 10.1049/iet-cvi.2013.0017

of computation of stereo and motion is unreliable. Other


researchers (e.g. Larsen et al. in [14]) later improved the
MRF-based stereomotion reconstruction methodology by
making it able to operate on a 3D graph that includes both
spatial and temporal neighbours and by introducing noise
suppression methods.
As an alternative to the MRF-based approach, Strecha and
Van Gool [1, 15] presented a partial differential equation
(PDE)-based approach for 3D reconstruction from
multi-view stereo. Their method builds upon the PDE-based
approach for dense optical ow estimation by Proesmans
et al. [16] and reasons on the occlusions between stereo and
motion to estimate the quality or condence of
correspondences. The evolution of the condence measures
is driven by the difference between the forward and
backward ows in the stereo and motion directions. Based
on the above-estimated per-pixel and per-depth cue quality
or condence measures, their weighting scheme guides at
every iteration and at every pixel the relative inuences of
both depth cues during the evolution towards the solution.
Other researchers [1720] use scene-ow-based methods
for stereomotion integration. Like the optical ow, 3D
scene ow is dened at every point in a reference image.
The difference is that the velocity vector in scene-ow eld
contains not only x, y, but also z velocities.
Zhang and Kambhamettu [17] formulated the problem as
computing a 4D vector (u, v, w, d), where (u, v) are the
components of optical ow vector, d is the disparity and w
is the disparity motion, at every point of the reference
image, where the initial disparity is used as an initial guess.
However, with serious occlusion and limited number of
cameras, this formulation is very difcult, because it
implies solving for four unknowns at every point. At least
four independent constraints are needed to make the
algorithm stable. Therefore in [17], constraints on motion,
disparity, smoothness and optical ow, as well as
condence measurement on the disparity estimation, have
been formulated. The major disadvantage of this approach,
is its limitation for slowly moving Lambertian scenes under
constant illumination.
The method advocated by Pons et al. in [18] handles
projective distortion without any approximation of shape
and motion and can be made robust to appearance changes.
The metric used in their framework is the ability to predict
the other input views from one input view and the
estimated shape or motion. Their method consists of
maximising, with respect to shape and motion, the
similarity between each input view and the predicted
images coming from the other views. They warp the input
images to compute the predicted images, which
simultaneously removes projective distortion.
Huguet and Devernay [19] proposed a method to recover
the scene ow by coupling the optical ow estimation in
both cameras with dense stereo matching between the
images, thus reducing the number of unknowns per image
point. The main advantage of this method is that it handles
occlusions both for optical ow and stereo. In [20],
Sizintsev and Wildes extend the scene-ow reconstruction
approach, by introducing a spatiotemporal quadric element,
which encapsulates both spatial and temporal image
structure for 3D estimation. These so-called stequels are
used for spatiotemporal view matching. Whereas Huguet
and Devernay [19] apply a joint smoothness term to all
displacement elds, Valgaerts et al. [21] propose a
regularisation strategy that penalises discontinuities in the
different displacement elds separately.
99

& The Institution of Engineering and Technology 2014

www.ietdl.org
1.3

Related work

As can be noted from the overview of the previous section,


most of the recent research works on stereomotion
reconstruction use scene-ow-based reconstruction methods.
The main disadvantage to 3D scene ow is that it is
computationally quite expensive, because of the 4D nature
of the problem. Therefore we formulate the stereomotion
depth reconstruction problem into a constrained
minimisation one and use a sequential unconstrained
minimisation technique, namely, the augmented Lagrange
multiplier (ALM) for solving it. This approach has been
presented originally by De Cubber in [22]. The use of
ALM has been also proposed recently by Del Bue et al.
[23]; however, they apply the technique only to singular
stereo reconstruction and structure from motion, whereas we
propose an ALM use for integrated stereomotion
reconstruction.
The augmented Lagrangian (AL)-based stereomotion
reconstruction methodology presented here differentiates
itself from the current state-of-the-art in stereomotion
reconstruction by a number of key factors: the processing
strategy, depicted in Fig. 2, considers three sources of
information for the structure estimation process: a left and a
right proximity maps from motion, and a proximity map
from stereo. During optimisation, information from the
(central) proximity map from stereo is transferred to the left
and right proximity maps, which are the ones actually being
optimised simultaneously. During the optimisation process,
data is constantly being interchanged between both
optimisers, as they are highly dependent. The advantage of
this concurrent optimisation methodology is that it provides
a symmetric processing cue. This makes it easer to handle
the uncertainties induced by the unknown displacements
between the different cameras, in comparison with other
approaches [13] who consider only one reference image and
warp all other images to this reference image for matching
and depth estimation. Other researchers have noted this too

Fig. 2 Processing strategy of a binocular sequence: from a left and


right image sequences, proximity maps are calculated through
stereo and dense structure from motion
These maps are iteratively improved by constrained optimisation, using the
AL method
100
& The Institution of Engineering and Technology 2014

and have used even more depth or proximity maps. In [1],


Strecha and Van Gool combine four proximity maps
l
r
dil , di+1
, dir and di+1
, as displayed in Fig. 1. The problem
with using so many proximity maps, however, is that the
problem size is increased drastically, and with it, also the
computation time.
The proposed methodology poses the dense stereomotion
reconstruction problem as a constrained optimisation problem
and uses the AL to transform the estimation into
unconstrained optimisation problem, which can be solved
with a classical method. Whereas other researchers express
the stereomotion reconstruction problem as a MRF [13,
14] or a graph cut [2] optimisation problem. The approach
we follow is very natural, as the stereomotion
reconstruction problem is by nature a highly constrained
and tightly coupled optimisation problem and the AL has
been proven before [23, 24] to be an excellent method for
these kind of problems.

Methodology

2.1

Depth reconstruction model

The stereomotion integration problem for dense


depth estimation can be regarded as a high-dimensional
data fusion problem. In this paper, we formulate the
stereomotion depth reconstruction problem into a
constrained minimisation one, with suitable functional that
minimises the error on the dense reconstruction. Fig. 2
illustrates the proposed methodology, where a pair of
stereo images at time t is related to a consecutive pair at
time t + 1.
Fig. 2 considers a binocular image stream consisting of left
and right images of a stereo camera system. The left and right
streams are processed individually, using the dense
structure-from-motion algorithm proposed by De Cubber
and Sahli in [25], resulting in, respectively, a left and right
proximity maps d l and d r. In parallel, the left and right
images are combined using a stereo algorithm [26, 27],
embedded in the used Bumblebee stereo camera. As a
result of this stereo computation, a new proximity map from
stereo d c can be dened. The reason for calling this
proximity map d c lies in the fact that it is dened in the
reference frame of a virtual central camera of the stereo
vision system.
There exist strong interrelations between the different
proximity maps d l, d c and d r, which need to be expressed
to ensure consistency and to improve the reconstruction
result. Therefore we adopt an approach where the left
proximity map d l is optimised, subject to two constraints,
relating it to d c and d r, respectively. In parallel, the right
proximity map d r is optimised, also subject to two
constraints, relating it to d c and d l. The compatibility of the
left and right proximities is hereby automatically ensured,
as both d l and d r are related to d c.
The dense stereomotion reconstruction problem can thus
be stated as the following constrained optimisation problem
Find min E(x) subject to: ui (x) = 0 for i = 1, ..., n (1)
x[V

with E(x) as the objective functional and i(x) expressing a


number of constraint equations.
IET Comput. Vis., 2014, Vol. 8, Iss. 2, pp. 98109
doi: 10.1049/iet-cvi.2013.0017

www.ietdl.org
A traditional solving technique for constrained
optimisation problems as the one posed by (1) is the
Lagrangian multiplier method, which converts a constrained
minimisation problem into an unconstrained minimisation
problem of a Lagrange function. In theory, the Lagrangian
methodology can be used to solve the stereomotion
reconstruction problem; however, to improve the
convergence characteristics of the optimisation scheme, it is
better [28] to use the AL L(x, l), with as the
Langrangian multiplier. The AL, which was presented by
Powell and Hestenes in [29, 30], adds a quadratic penalty
term to the original Lagrangian
L(x, l) = E(x) +

n 

i=1

n
 r
li ui (x) +
u (x)2
2 i=1 i

(2)

with a penalty parameter > 0.


In the context of dense stereomotion reconstruction, we
seek to simultaneously minimise two energy functions:
E l(d l ), for the left image, and E r(d r), for the right image,
which we seek to optimise subject to four constraints


1. ullc d l , d c = 0 relates d l to the proximity map obtained
c
from stereo
 l rd .
l
2. ulr d , d = 0 relates d l to the proximity map of the right
imaged r. 
3. urrc d r , d c = 0 relates d r to the proximity map obtained
from stereo d c.
4. urrl d r , d l = 0 relates d r to the proximity map of the left
image d l.
According to the AL theorem and the denition given by
Equation 2, we can write the AL for the left image as follows

l



 r 
2
L d l , lllc , lllr = E l (d l ) + lllc ullc d l , d c + ullc d l , d c
2

 r 
2
+ lllr ullr d l , d r + ullr d l , d r
(3)
2
For the right image, we have in a similar fashion



 r 
2
Lr d r , lrrc , lrrl = E r (d r ) + lrrc urrc d r , d c + urrc d r , d c
2
 r l  r  r  r l 2
r r
+ lrl url d , d + url d , d
(4)
2
The energy functions in (3) and (4) express the relationship
between structure and motion between successive images.
It has to be noted that the approach for solving the
reconstruction problem, in principle, is not tied to the
formulation of the dense structure-from-motion problem, so
any formulation can be chosen. Here, we use the dense
structure-from-motion approach presented originally by De
Cubber in [22], which formulates the dense structure from
motion as minimising the following energy functional [25]
E = fdata + mfregularisation

(5)

The data term is based on the image derivatives based optical


ow constraint
 
2

fdata = Ix a1 d + b1 + Iy [ad + b] + It

(6)

where Ix and Iy denote the spatial gradient of the image in the


IET Comput. Vis., 2014, Vol. 8, Iss. 2, pp. 98109
doi: 10.1049/iet-cvi.2013.0017

x- and y-directions, It denotes the temporal gradient, d is a


depth (proximity) parameter and the motion coefcients
[a, b, a, b] are dened as a function of the camera focal
length f and its translation t = (tx, ty, tz) and rotation =
(x, y, z)

f tx + xtz
= Qt t =
f ty + ytz
a



xy
x2

f
+
v
+
y
v
f x
y
z
b
f

= Qv v =


2

b
y
xy
f +
vx vy xvz
f
f

(7)

As expressed by (5), a regularisation is used to lter erroneous


reconstruction results and to smooth and extrapolate the
structure (depth) over related pixels. A key aspect here is of
course to nd out which pixels are related (e.g. belonging
to the same object on the same distance), such that
proximity information can be propagated and which pixels
are not related. Here, we make use of the NagelEnkelmann
anisotropic regularisation model, as dened in [31]

fregularisation = (d)T D(I1 )(d)

(8)

with D as a regularised projection matrix.


The energy functions E l(d l) and E r(d r) can then be dened
as
 
 
 
E l d l = fldata d l + mflregularisation d l

(9)

 
 
 
E r d r = fldata d r + mflregularisation d r

(10)

with fldata (d l ) and fldata (d l ) are as given by (6) for,


respectively, the left and right images and flregularisation (d l );
flregularisation (d l ), the regularisation term, according to (8).
The diffusion parameter regulates the balance between the
data and regularisation term. In order to regulate this
balance, is estimated iteratively, using the methodology
described in [22].
The constraints uiij (d i , d j ) with (i, j = left, centre, right)
express the similarity between an estimated proximity map
di and another proximity map dj. In order to calculate this
similarity measure, the second proximity map must be
warped to the rst one. This warping process can be
expressed by introducing a warping function, = (x, d, ,
t), with d as the proximity, and t as the camera rotation
and translation, respectively. allows dening the
constraint equations, uiij , i, j [ l, r, c as errors in the
warping

 


2
uiij d i , d j = d i (x) d j x + c x, d i (x), v ji , t ji
(11)


The rst constraint, ullc d l , d c , expresses the similarity
between the estimated left proximity map d l and the
proximity map from stereo d c. The motion that is
considered in this case is in fact the displacement between
the left camera and the virtual central camera, which is
known a priori. Since we consider rectied stereo images,
the rotational movement between the cameras is zero
(stereo = 0) and the translational movement is according to
the X-axis over a distance of half the stereo baseline b, such
that t cl = (b/2, 0, 0)T. For estimating the depth, an iterative
101

& The Institution of Engineering and Technology 2014

www.ietdl.org
procedure is proposed. Following this methodology, the
current estimate of the proximity map d l is lled in (11). As
such, the warping process is integrated in the optimisation
scheme
and
will gradually improve over time. Finally,


ullc d l , d c is given by

ullc

d, d

= d (x) d x + c x, d (x), 0, tcl


l

2

(13)



2

 
urrc d r , d c = d r (x) d c x + c x, d r (x), 0, tst /2
(14)

l



2
url d r , d = d r (x) d l x + c x, d r (x), 0, tst
(15)

By integrating the denitions of the energy functions of (9)


and (10), and the constraint (12)(15) into the formulation
of the AL functions, given by (3) and (4), the constrained
minimisation problem stated in (1) is now completely
dened. How this problem is numerically solved is
discussed in the following section.
2.2

Numerical implementation

The discrete version of (3) is given by




Ll1

k
i, j

 k  k  k
r  l k 2
= E l i, j + lllc i, j ullc i, j +
ulc i, j
2


2
 k  k r  l k
+ lllr i, j ullr i, j +
ulr i, j
2

(16)

The constraints given by (12) and (13) measure the


dissimilarity between the left proximity map and the
(warped) central and right proximity map, respectively.
However, these proximity maps are discrete and possibly
highly discontinuous, which makes them impractical to
work with in an optimisation scheme. Therefore we use an
interpolation function fI(d, x, y) which interpolates the
discrete function d at a continuous location (x, y). In this
work, we use a bi-cubic spline interpolation function [32],
and formulate the discrete version of the constraint fllc of
(12) as

2
 l k
 c k
b  l k
d i, j fI d1 , i f d i, j , j
2
 k
Similarly, fllr i, j is given by


k
fllc i, j =

 
 l k  l k
 k 2
k
flr i, j = d i, j fI d r , i fb d l i, j , j

(17)

(18)

The update equations of the Lagrangian multipliers lki, j are


derived as follows. When the solution for xk converges to a
local minimum x*, then the k must converge to the
corresponding optimal Lagrange multiplier *. This
102
& The Institution of Engineering and Technology 2014

n 


li ui (x)

n 


+r
ui (x)ui (x)

(12)

Note that, in this case, we use the translation over the whole
baseline tst = (b, 0, 0)T for warping the right proximity map to
the left proximity map.
The constraints on the right proximity map are as follows


r

x L(x, l) = E(x) +

i=1



The second constraint, ullr d l , d1r , on the left proximity map
can be obtained in the same way

 


2
ullr d l , d r = d l (x) d r x + c x, d l (x), 0, tst

condition can be expressed by differentiating the AL of (2)


with respect to x

(19)

i=1

In the local minimum, E(x*) = 0 and the


 optimality

conditions on the AL require that also x L x , l = 0;
hence, we can deduce

li = li + rui (x)

(20)

which give us an update scheme for the Lagrangian


multipliers, such that they converge to i*



lllc

k+1
i, j

 k
 k
= lllc i, j + r ullc i, j

 k
 k
= lllr i, j + r ullr i, j
 r k+1  r k
 k
lrc i, j = lrc i, j + r urrc i, j
 r k+1  r k
 k
lrl i, j = lrl i, j + r urrl i, j

lllr

k+1
i, j

(21)

The expression of the energy and the constraint equations


completely denes the formulation of the AL of (16),
governing the iterative optimisation of the left proximity
map d l. As such, the constrained optimisation problem of
(1) is transformed into an unconstrained optimisation
problem. To solve this unconstrained optimisation problem,
we use a classical numerical solving technique, proposed by
Brent in [33]. Brents method switches between inverse
parabolic interpolation and golden section search. Golden
section search [34] is a methodology for nding the
minimum of a bounded function by successively narrowing
the range of values inside which the minimum is known to
exist. This range is also updated using inverse parabolic
interpolation, but only if the produced result is acceptable.
If not, then the algorithm falls back to an ordinary golden
section step.
This optimisation method converges to a minimum within
the search interval. Therefore it is crucial that a good initial
value is available for all status variables. To estimate this
initial value for the proximity eld, the dense disparity map
from stereo is used. The reason for this is that the camera
displacement between the left and right stereo frames is
well known and is xed over time. As such, it is possible to
warp the stereo data in the virtual central camera reference
frame towards the left and the right image with high
accuracy. Applying image warping following the
perspective projection model, it is possible to dene the
equations providing an initial value for the left and right
proximity maps d l and d r, based on a stereo proximity map dst

b
l

d
(x,
y)
=
d
initial
st x fdst (x, y) ,
2

b
r

dinitial
(x, y) = dst x + fdst (x, y) ,
2


y

y

(22)

As can be noted, (22) contains no real unknown data, next to


the stereo proximity map dst.
IET Comput. Vis., 2014, Vol. 8, Iss. 2, pp. 98109
doi: 10.1049/iet-cvi.2013.0017

www.ietdl.org
The application of Brents optimisation method also requires
that the minimum and maximum boundaries where the
solution is to be found be known. In our case, it means that a
minimum and maximum proximity values must be available
for each pixel of the left and right images. These minimum
and maximum proximity maps are calculated based on the
3 error interval of the initial value of the proximity maps
 i 
i
i
dmin
= dinitial
3s dinitial
 i 
i
i
dmax
= dinitial
+ 3s dinitial

(23)

l
l
and dinitial
are calculated according to (22).
where dinitial
For the right proximity map, a set of similar expressions
can be found, starting from the AL

 r k
 k
 k  k r  r k 2
L1 i, j = E r i, j + lrrc i, j urrc i, j +
urc i, j
2


 k  k r  r k 2
+ llrl i, j urrl i, j +
url i, j
2

3
(24)

and with the constraints


 l k
frc i, j


2
 r k
 c k
b  r k
= d i, j fI d , i + f d i, j , j
2

 
  k
 l k
 k 2
k
frl i, j = d r i, j fI d l , i + fb d r i, j , j

two functions
 k that are optimised at the same time: onel
using Ll1 i, j which
 k optimises the left proximity map d
and one using Lr1 i, j which optimises the right proximity
map d r. In the proposed algorithm, these functions are
optimised alternatively, hereby always using the latest result
for both proximity maps.
An aspect which is not depicted by Algorithm 1 is the
choice of the optimal framerate. The underlying
structure-from-motion algorithm uses the geometric robust
information criterion scoring scheme introduced by Torr in
[35] not to assess the optimal framerate. This will have as
an effect that if the camera does not move (no translation
and no rotation) between two consecutive time instants, no
reconstruction will be performed.

(25)
(26)

Algorithm 1 details the constrained optimisation


methodology (Fig. 3). As shown in Fig. 3, there are, in fact,

Results and analysis

3.1 Qualitative analysis using a real-world


binocular video sequence
3.1.1 Evaluation methodology: The validation and
evaluation of a dense stereomotion reconstruction
algorithm requires the use of an image sequence with a
moving stereo camera. Hence, we recorded, using a
Bumblebee stereo head, an image sequence of an ofce
environment as illustrated Fig. 4, denoted here after as
Desk sequence. The translation of the camera is mainly
along its optical axis (Z-axis) and along the positive X-axis.
The rotation of the camera is almost only along the positive
Y-axis.

Fig. 3 Constrained optimisation for binocular depth reconstruction using AL


IET Comput. Vis., 2014, Vol. 8, Iss. 2, pp. 98109
doi: 10.1049/iet-cvi.2013.0017

103

& The Institution of Engineering and Technology 2014

www.ietdl.org

Fig. 4

Some frames of the binocular desk sequence

a Frame 1, left image


b Frame 1, right image
c Frame 10, left imate
d Frame 10, right image

As it can be seen from Fig. 4, the recorded sequence


consists of a cluttered environment, presenting serious
challenges for any reconstruction algorithm:
Cluttered environment with many objects at different
scales of depth.
Relatively large untextured areas (e.g. the wall in the upper
left) making correspondence matching very difcult.
Areas with specular reection (e.g. on the poster in the
upper right of the image), violating the Lambertian
assumption, traditionally made for stereo matching.
Variable lighting and heavy reections (in the window on
the upper right), causing saturation effects and incoherent
pixel colours across different frames.
We will focus our evaluation on how the presented iterative
optimisation methodology deals with these issues and how
well it is able to reconstruct the structure of this scene.
However, it must not be forgotten that this iterative
optimiser is also dependent on an initialisation procedure,
which can inuence the reconstruction result.
The initialisation step of the iterative optimiser estimates an
initial value for the left and right depth elds. This method
consists of warping a stereo proximity image to the left and
right camera reference frames. The initial values for the left
and right proximity maps still contains a lot of blind
spots, or areas where no (reliable) proximity data is
available. These areas are caused by unsuccessful
correspondences in the used stereo vision algorithm, which
104
& The Institution of Engineering and Technology 2014

performs an area-based correlation with sum of absolute


differences on bandpassed images [26, 27]. This algorithm
is fairly robust and it has a number of validation steps that
reduce the level of noise. However, the method requires
texture and contrast to work correctly. Effects like
occlusions, repetitive features and specular reections can
cause problems leading to gaps in the proximity maps. In
the following discussion, we will evaluate how well the
proposed dense stereomotion algorithm is able to cope
with these blind spots and see whether it is capable of
lling in the areas where depth data are missing.
To compare our method to the state-of-the-art, we
implemented a more classical dense stereomotion
reconstruction approach. This approach denes classical
stereo and motion constraints, based upon the constant
image brightness assumption, alongside the Nagel
Enkelmann regularisation constraint. These constraints are
integrated into one objective function, which is solved
using a traditional trust-region method. As such, this
approach presents a relatively simple and straightforward
solution. This methodology is used to serve as a base
benchmarking method for the AL-based stereomotion
reconstruction technique.
Applying this more classical technique to the Desk
sequence shown in Fig. 4 results in a depth reconstruction
as shown in Fig. 5. Overall, the reconstruction of the
proximity eld correlates with the physical reality, as
imaged in Fig. 4, but there are some serious errors in the
reconstructed proximity elds, notably on the board in the
IET Comput. Vis., 2014, Vol. 8, Iss. 2, pp. 98109
doi: 10.1049/iet-cvi.2013.0017

www.ietdl.org

Fig. 5 Proximity maps for different frames of the desk sequence using the global optimisation algorithm
a Frame 1, left proximity d110
b Frame 1, right proximity dr10
c Frame 10, left proximity d110
d Frame 10, right proximity dr10

Fig. 6 Proximity maps for different frames of the desk sequence using the AL algorithm
a Frame 1, left proximity d1l
b Frame 1, right proximity dr1
c Frame 10, left proximity d110
d Frame 10, right proximity dr10
IET Comput. Vis., 2014, Vol. 8, Iss. 2, pp. 98109
doi: 10.1049/iet-cvi.2013.0017

105

& The Institution of Engineering and Technology 2014

www.ietdl.org

Fig. 7 Reconstructed 3D model of the desk sequence


a Novel view 1
b Novel view 2
c Novel view 3
d Novel view 4

middle of the image. This leads us to conclude that this


method is not suitable for high-quality 3D modelling. In the
following, we compare these results with the ones obtained
by the proposed AL-based stereomotion optimisation
methodology, using the same input sequence.
3.1.2 Reconstruction results: Fig. 6 shows the
reconstructed left and right proximity maps using the
algorithm shown in Fig. 3. The reconstructed proximity eld
correlates very well with the physical nature of the scene.
Foreground and background objects are clearly
distinguishable. The depth gradients on the left and back
walls can be clearly identied, despite the fact that there is
very little texture on these walls. The occurrence of specular
reection on the poster does not cause erroneous
reconstruction results. The only remaining errors on the
proximity eld are in fact because of border effects. Indeed,
at the lower left of Fig. 6a and the lower right of Fig. 6b,
one can note some areas where the regularisation has
smoothed out the proximity eld. The reason for this lies in
the lack of initial proximity data in these areas. Owing to the
total absence of proximity information in these areas, the
algorithm used the solution from the neighbouring regions.
In general, this was performed correctly, but because of the
lack of information, the algorithm estimated the direction of
regularisation wrongly at these two locations. This is quite a
normal side-effect when using area-based optimisation
techniques, which can be solved by extending the image
canvas before the calculations. The result of Fig. 6 can be
106
& The Institution of Engineering and Technology 2014

compared with Fig. 5, which shows the same output, but


using the global optimisation approach. From this
comparison, it is evident that the result of the AL-based
reconstruction technique is far superior to the one using
global optimisation. The global optimisation result features
numerous
problems:
erroneous
proximity
values,
under-regularised areas, over-regularised areas and erroneous
estimation of discontinuities. None of those problems are
present in the result of the AL, as shown in Fig. 6.
To show the applicability of the presented technique for 3D
modelling, the individual reconstruction results were
integrated to form one consistent 3D representation of the
imaged environment. Fig. 7 shows four novel views of the
3D model. From the different novel viewpoints, the 3D
structure of the ofce environment can be clearly deduced,
there are no visible outliers and all items in the scene have
been reconstructed, even those with very low texture. This
illustrates the capabilities of the proposed AL-based stereo
motion reconstruction technique, which allows the
reconstruction of a qualitative 3D model.
3.2 Quantitative analysis using standard
benchmark sequences
For quantitative analysis, we compared the performance of
the proposed approach with a traditional variational
scene-ow-based method using standard benchmark
sequences. The selected benchmarking sequences are the
well known Cones and Teddy sequences created by
IET Comput. Vis., 2014, Vol. 8, Iss. 2, pp. 98109
doi: 10.1049/iet-cvi.2013.0017

www.ietdl.org

Fig. 8 Quantitative analysis: input images and ground truth depth maps
Top row: left input image and bottom row: ground truth left depth image. Left column: Cones sequence and right column: Teddy sequence
a Cones sequence, left image at t0
b Teddy sequence, left image at t0
c Cones sequence, ground truth depth image at t0
d Teddy sequence, ground truth depth image at t0

Fig. 9 Comparison of the reconstruction result using the traditional variational scene-ow method [19] and the proposed method
Top row: reconstructed left depth image using [19] and bottom row: reconstructed left depth image using proposed method. Left column: Cones sequence and right
column: Teddy sequence
a Cones sqeuqnce, depth image at t0 using [19]
b Teddy sequence, depth image at t0 using [19]
c Cones sequence, depth image at t0 using proposed
d Teddy sequence, depth image at t0 using proposed method
IET Comput. Vis., 2014, Vol. 8, Iss. 2, pp. 98109
doi: 10.1049/iet-cvi.2013.0017

107

& The Institution of Engineering and Technology 2014

www.ietdl.org
Scharstein and Szeliski [36, 37], shown on the top row of
Fig. 8.
As a baseline algorithm, the variational scene-ow
reconstruction approach presented by Huguet and Devernay
[19] was chosen, as the authors provided the algorithm
online, which makes it possible to perform comparison
tests. To be able to supply a correct comparison of the
stereomotion reconstruction
capabilities
of both
algorithms, the same base stereo algorithm [38] was used to
initialise both methods.
The results of any reconstruction algorithm depend largely
on the correct initialisation of the algorithm and the selection
of the parameters. In the initialisation phase of the proposed
method, the estimation of the motion vectors tl, l and tr,
r via sparse structure from motion plays an important role.
To assess the validity of the motion vector estimation
results, it is possible to compare the measured motion with
the perceived motion between the subsequent images. For
example, for the Cones sequence, the main motion is a
horizontal movement, which is correctly expressed by the
estimated translation vectors: t l = [0.0800, 0.3151, 0.0988]
and tr = [0.1101, 0.3131, 0.1255]. Ideally, both vectors
should be identical (as both cameras follow an identical
motion pattern), which gives an idea of the errors on the
motion estimation process.
Parameter-tuning is a process which affects many modern
reconstruction algorithms, as the parameter selection makes
comparison and application of the algorithms in real
situations difcult. For the proposed approach, one
parameter is of major importance: the parameter deciding
on the balance between the data and the regularisation term.
In our experiments, a value of = 0.5 was chosen, based on
previous [25] analysis. A remaining parameter of lesser
importance is the threshold on for stopping the iterative
solver. This parameter is somewhat sequence-dependent,
with typical values somewhere between 10 and 20. With
regard to the benchmark algorithm by Huguet and
Devernay, all parameters were chosen as provided by the
authors in their original implementation.
Fig. 9 shows a qualitative comparison of the reconstruction
results of both methods. The quantitative evaluation is done
by computing the root-mean-square (RMS) error on the
depth map measured in pixels, as presented by Table 1. The
proposed AL-based binocular structure-from-motion
(ALBDSFM) approach has the convenient property of
decreasing the residual on the objective function
dramatically in the rst iteration, whereas convergence
slows down in subsequent iterations. For this reason, we
also included the results after one iteration in the tables.
As can be noted from Fig. 9 and Table 1, the proposed
ALBDSFM algorithm performs better on both the Cones
and Teddy sequences. On the Cones sequence, the
ALBDSFM approach is better capable of representing the
structure of the lattice in the back, whereas this structure is
completely smoothed by the SceneFlow algorithm. On the
Teddy sequence, both reconstruction results are visually
quite similar. It is clear that both reconstruction techniques
suffer from over-segmentation. This is a typical problem of
the NagelEnkelmann regularisation problem we used and
can partly be remedied by ne-tuning the regularisation
parameters; however, for keeping the comparison honest, we
did not perform such sequence-specic parameter-tuning.
The quantitative analysis on the Teddy sequence in Table 1
shows an advantage for the ALBDSFM approach.
Table 2 gives an overview of the total processing times
required for both algorithms. It must be noted that all
108
& The Institution of Engineering and Technology 2014

Table 1 RMS error in pixels on the different sequences using


both methods
RMS error

Cones

Teddy

ALBDSFM (1 iteration)
ALBDSFM (convergence)
variational scene flow [19]

2.411
2.381
9.636

5.002
4.961
8.650

Table 2

Total processing time in minutes on the different


sequences using both methods
Processing time (min)
ALBDSFM (1 iteration)
ALBDSFM (convergence)
variational scene flow [19]

Cones

Teddy

24
74
257

24
205
243

experiments were performed on an Intel Core i5 central


processing unit (CPU) of 1.6 GHz. The SceneFlow
algorithm is a C++ application (available on http://devernay.
free.fr/vision/varsceneow), whereas the ALBDSFM is
implemented in MATLAB. While none of the algorithms
can be called fast, it is clear that the ALBDSFM approach
is much faster than the SceneFlow implementation. The
processing time is mostly dependent on the computational
cost for a single iteration (within the while-loop of
Algorithm 1) and the number of iterations, as the
initialisation and stereo computation steps only take a few
seconds. As the iteration step consists of a double
optimisation step using the Brents method, its
computational complexity is of the order of O(2n 2) with n,
the number of image pixels. When high-resolution images
are used, the computational cost quickly rises, which
explains the relatively large processing times. However,
none of these algorithm implementations make use of
multi-threading or graphics processing unit (GPU)
optimisations, so large speed gains could be obtained by
applying such optimisations.

Conclusions

The combination of spatial and temporal visual information


makes it possible to achieve high-quality dense depth
reconstruction, but comes at the cost of a high
computational complexity. To this extent, we presented a
novel solution to integrate the stereo and motion depth cue,
by simultaneously optimising the left and right proximity
elds, using the AL. The main advantage of our algorithm
is the ability to exploit all the available constraints in one
minimisation framework. Another advantage is that the
framework is able to incorporate any given stereo
reconstruction methodology. The algorithm has been
implemented and applied on real imagery as well as
benchmarks. A comparison of the proposed method to the
variational scene-ow method shows that the quality of the
obtained results far exceed the quality of the results using
the traditional method. The added quality with respect to
normal stereo comes with a penalty of increased processing
time, which is still important. Even though future
optimisation of the implementation, by considering, for
example, a GPU implementation, will certainly further
reduce the processing time, we consider that the proposed
approach can already be effectively used at this moment in
an off-line production environment, where the proposed 3D
IET Comput. Vis., 2014, Vol. 8, Iss. 2, pp. 98109
doi: 10.1049/iet-cvi.2013.0017

www.ietdl.org
reconstruction methodology presents an excellent
reconstruction tool allowing high-quality 3D recosntruction
from binocular video.

Acknowledgment

The research leading to these results has received funding


from the European Union Seventh Framework Programme
(FP7/2007-2013) under grant agreement number 285417.

References

1 Strecha, C., Van Gool, L.J.: Motion stereo integration for depth
estimation. ECCV, 2002, no. 2, pp. 170185
2 Worby, J.A.: Multi-resolution graph cuts for stereo-motion estimation.
Masters thesis, University of Toronto, 2007
3 Richards, W.: Structure from stereo and motion, J. Opt. Soc. Am.,
1985, 2, pp. 343349
4 Waxman, A., Duncan, J.: Binocular image ows: steps towards
stereo-motion fusion, IEEE Trans. Pattern Anal. Mach. Intell., 1986,
8, (6), pp. 715729
5 Li, L., Duncan, J.: 3-d translational motion and structure from binocular
image ows, IEEE Trans. Pattern Anal. Mach. Intell., 1993, 15, (7),
pp. 657667
6 Zhang, Z., Faugeras, O.D.: Three-dimensional motion computation and
object segmentation in a long sequence of stereo frames, Int. J. Comput.
Vis., 1992, 7, (3), pp. 211241
7 Hanna, K.J., Okamoto, N.E.: Combining stereo and motion analysis for
direct estimation of scene structure. ICCV, 1993, pp. 357365
8 Malassiotis, S., Strintzis, M.G.: Model-based joint motion and structure
estimation from stereo images, Comput. Vis. Image Underst., 1997, 65,
(1), pp. 7994
9 Kutulakos, K.N., Seitz, S.M.: A theory of shape by space carving,
Int. J. Comput. Vis., 2000, 38, (3), pp. 199218
10 Neumann, J., Aloimonos, Y.: Spatio-temporal stereo using multiresolution
subdivision surfaces, Int. J. Comput. Vis., 2002, 47, (13), pp. 181193
11 Gong, M.: Enforcing temporal consistency in real-time stereo
estimation. ECCV, 2006, pp. 564577
12 Isard, M., MacCormick, J.: Dense motion and disparity estimation via
loopy belief propagation. ACCV, 2006, pp. 3241
13 Sudhir, G., Banerjee, S., Biswas, K.K., Bahl, R.: Cooperative
integration of stereopsis and optic ow computation, J. Opt. Soc. Am.
A, 1995, 12, (12), pp. 25642572
14 Larsen, E.S., Mordohai, P., Pollefeys, M., Fuchs, H.: Temporally consistent
reconstruction from multiple video streams. ICCV, 2007, pp. 18
15 Strecha, C., Van Gool, L.: Pde-based multi-view depth estimation.
First Int. Symp. 3D Data Processing Visualization and Transmission
(3DPVT02), 2002, vol. 416
16 Proesmans, M., van Gool, L., Pauwels, E., Oosterlinck, A.:
Determination of optical ow and its discontinuities using non-linear
diffusion. ECCV, 1994, pp. 295304
17 Zhang, Y., Kambhamettu, C.: On 3-d scene ow and structure recovery
from multiview image sequences, Syst. Man Cybern. B, 2003, 33, (4),
pp. 592606

IET Comput. Vis., 2014, Vol. 8, Iss. 2, pp. 98109


doi: 10.1049/iet-cvi.2013.0017

18 Pons, J.P., Keriven, R., Faugeras, O.: Modelling dynamic scenes by


registering multiview image sequences. Int. Conf. Computer Vision
and Pattern Recognition, 2005, vol. 2, pp. 822827
19 Huguet, F., Devernay, F.: A variational method for scene ow
estimation from stereo sequences. ICCV, 2007, pp. 17
20 Sizintsev, M., Wildes, R.: Spatiotemporal stereo and scene ow via
stequel matching, IEEE Trans. Pattern Anal. Mach. Intell., 2012, 34,
(6), pp. 12061219
21 Valgaerts, L., Bruhn, A., Zimmer, H., Weickert, J., Stoll, C., Theobalt,
C.: Joint estimation of motion, structure and geometry from stereo
sequences. ECCV, 2010, (2010)
22 De Cubber, G.: Variational methods for dense depth reconstruction
from monocular and binocular sequences. PhD thesis, Vrije
Universiteit Brussel, March 2010
23 Del Bue, A., Xavier, J., Agapito, L., Paladini, M.: Bilinear modeling via
augmented lagrange multipliers (balm), IEEE Trans. Pattern Anal.
Mach. Intell., 2012, 34, (8), pp. 14961508
24 Nocedal, J., Wright, S.J.: Numerical optimization Springer series in
operations research. (Springer, 1999, 2nd edn.)
25 De Cubber, G., Sahli, H.: Partial differential equation-based dense 3d
structure and motion estimation from monocular image sequences,
IET Comput. Vis., 2012, 6, (3), pp. 174185
26 Fua, P.: Combining stereo and monocular information to compute
dense depth maps that preserve depth discontinuities. 12th Int. Joint
Conf. Articial Intelligence, 1991, pp. 12921298
27 Murray, D., Little, J.J.: Using real-time stereo vision for mobile robot
navigation, Auton. Robots, 2000, 8, (2), pp. 161171
28 Bertsekas, D.P.: Constrained optimization and lagrange multiplier
methods (Athena Scientic, 1996)
29 Powell, M.J.D.: Optimization A method of nonlinear constraints in
minimization problems, (Academic Press, London, 1969)
30 Hestenes, M.R.: Multipler and gradient methods, J. Optim. Theory
Appl., 1969, 4, pp. 303320
31 Nagel, H., Enkelmann, W.: An investigation of smoothness constraints
for the estimation of displacement vector elds from image sequences,
IEEE Trans. Pattern Anal. Mach. Intell., 1986, 8, (5), pp. 565593
32 Keys, R.: Cubic convolution interpolation for digital image processing,
IEEE Trans. Acoust. Speech Signal Process., 1981, 29, (6),
pp. 11531160
33 Brent, R.P.: Algorithms for minimization without derivatives
(Prentice-Hall, Englewood Cliffs, NJ, 1973)
34 Forsythe, G.E., Malcolm, M.A., Moler, C.B.: Computer methods for
mathematical computations (Prentice-Hall, 1976)
35 Torr, P.H.S.: Bayesian model estimation and selection for epipolar
geometry and generic manifold tting, Int. J. Comput. Vis., 2002, 50,
(1), pp. 3561
36 Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense
two-frame stereo correspondence algorithms, Int. J. Comput. Vis.,
2002, 47, (13), pp. 742
37 Scharstein, D., Szeliski, R.: High-accuracy stereo depth maps using
structured light. IEEE Computer Society Conf. Computer Vision and
Pattern Recognition (CVPR 2003), Madison, WI, USA, 2003, vol. 1,
pp. 195202
38 Felzenszwalb, P.F., Huttenlocher, D.P.: Efcient belief propagation for
early vision, Int. J. Comput. Vis., 2006, 70, (1), pp. 126

109

& The Institution of Engineering and Technology 2014

Copyright of IET Computer Vision is the property of Institution of Engineering &


Technology and its content may not be copied or emailed to multiple sites or posted to a
listserv without the copyright holder's express written permission. However, users may print,
download, or email articles for individual use.

Вам также может понравиться