Академический Документы
Профессиональный Документы
Культура Документы
org
Published in IET Computer Vision
Received on 24th September 2012
Revised on 7th June 2013
Accepted on 16th July 2013
doi: 10.1049/iet-cvi.2013.0017
ISSN 1751-9632
Electronics and Information Processing (ETRO), Vrije Universiteit Brussel, Brussels 1040, Belgium
Mechanical Engineering, Royal Military Academy of Belgium, Brussels 1000, Belgium
3
Interuniversity Microelectronics Centre IMEC, Heverlee 3001, Belgium
E-mail: geert.de.cubber@rma.ac.be
2
Abstract: In this study, the authors propose a framework for stereomotion integration for dense depth estimation. They
formulate the stereomotion depth reconstruction problem into a constrained minimisation one. A sequential unconstrained
minimisation technique, namely, the augmented Lagrange multiplier (ALM) method has been implemented to address the
resulting constrained optimisation problem. ALM has been chosen because of its relative insensitivity to whether the initial
design points for a pseudo-objective function are feasible or not. The development of the method and results from solving the
stereomotion integration problem are presented. Although the authors work is not the only one adopting the ALMs
framework in the computer vision context, to thier knowledge the presented algorithm is the rst to use this mathematical
framework in a context of stereomotion integration. This study describes how the stereomotion integration problem was
cast in a mathematical context and solved using the presented ALM method. Results on benchmark and real visual input data
show the validity of the approach.
1
1.1
Introduction
Problem statement
The integration of the stereo and motion depth cues offers the
potential of a superior depth reconstruction, as the
combination of temporal and spatial information makes it
possible to reduce the uncertainty in the depth
reconstruction result and to augment its precision. However,
this requires the development of a data fusion methodology,
which is able to combine the advantages of each method,
without propagating errors induced by one of the depth
reconstruction cues. Therefore the mathematical formulation
of the problem of combining stereo and motion information
must be carefully considered.
The dense depth reconstruction problem can be casted as a
variational problem, as advocated by a number of researchers
[1, 2]. The main problem in dense stereomotion
reconstruction is that the solution depends on the
simultaneous evaluation of multiple constraints which
have to be balanced carefully. This is sketched in Fig. 1,
which shows the different constraints to be imposed for a
sequence acquired with a moving binocular
l r camera.
Considering a pair of rectied
stereo
images
I1 , I1 at time
t = t0 and a stereo pair I2l , I2r at time t = t0 + tk, with tk
being determined by the frame rate of the camera. A point
xl1 in the reference frame I1l can be related to a point xr1 via
the stereo constraint, as well as to a point xl2 via the motion
98
State-of-the-art
www.ietdl.org
www.ietdl.org
1.3
Related work
Methodology
2.1
www.ietdl.org
A traditional solving technique for constrained
optimisation problems as the one posed by (1) is the
Lagrangian multiplier method, which converts a constrained
minimisation problem into an unconstrained minimisation
problem of a Lagrange function. In theory, the Lagrangian
methodology can be used to solve the stereomotion
reconstruction problem; however, to improve the
convergence characteristics of the optimisation scheme, it is
better [28] to use the AL L(x, l), with as the
Langrangian multiplier. The AL, which was presented by
Powell and Hestenes in [29, 30], adds a quadratic penalty
term to the original Lagrangian
L(x, l) = E(x) +
n
i=1
n
r
li ui (x) +
u (x)2
2 i=1 i
(2)
r
2
L d l , lllc , lllr = E l (d l ) + lllc ullc d l , d c + ullc d l , d c
2
r
2
+ lllr ullr d l , d r + ullr d l , d r
(3)
2
For the right image, we have in a similar fashion
r
2
Lr d r , lrrc , lrrl = E r (d r ) + lrrc urrc d r , d c + urrc d r , d c
2
r l r r r l 2
r r
+ lrl url d , d + url d , d
(4)
2
The energy functions in (3) and (4) express the relationship
between structure and motion between successive images.
It has to be noted that the approach for solving the
reconstruction problem, in principle, is not tied to the
formulation of the dense structure-from-motion problem, so
any formulation can be chosen. Here, we use the dense
structure-from-motion approach presented originally by De
Cubber in [22], which formulates the dense structure from
motion as minimising the following energy functional [25]
E = fdata + mfregularisation
(5)
(6)
xy
x2
f
+
v
+
y
v
f x
y
z
b
f
= Qv v =
2
b
y
xy
f +
vx vy xvz
f
f
(7)
(8)
(9)
E r d r = fldata d r + mflregularisation d r
(10)
www.ietdl.org
procedure is proposed. Following this methodology, the
current estimate of the proximity map d l is lled in (11). As
such, the warping process is integrated in the optimisation
scheme
and
will gradually improve over time. Finally,
ullc d l , d c is given by
ullc
d, d
2
(13)
2
urrc d r , d c = d r (x) d c x + c x, d r (x), 0, tst /2
(14)
l
2
url d r , d = d r (x) d l x + c x, d r (x), 0, tst
(15)
Numerical implementation
Ll1
k
i, j
k k k
r l k 2
= E l i, j + lllc i, j ullc i, j +
ulc i, j
2
2
k k r l k
+ lllr i, j ullr i, j +
ulr i, j
2
(16)
k
fllc i, j =
l k l k
k 2
k
flr i, j = d i, j fI d r , i fb d l i, j , j
(17)
(18)
n
li ui (x)
n
+r
ui (x)ui (x)
(12)
Note that, in this case, we use the translation over the whole
baseline tst = (b, 0, 0)T for warping the right proximity map to
the left proximity map.
The constraints on the right proximity map are as follows
r
x L(x, l) = E(x) +
i=1
The second constraint, ullr d l , d1r , on the left proximity map
can be obtained in the same way
2
ullr d l , d r = d l (x) d r x + c x, d l (x), 0, tst
(19)
i=1
li = li + rui (x)
(20)
lllc
k+1
i, j
k
k
= lllc i, j + r ullc i, j
k
k
= lllr i, j + r ullr i, j
r k+1 r k
k
lrc i, j = lrc i, j + r urrc i, j
r k+1 r k
k
lrl i, j = lrl i, j + r urrl i, j
lllr
k+1
i, j
(21)
b
l
d
(x,
y)
=
d
initial
st x fdst (x, y) ,
2
b
r
dinitial
(x, y) = dst x + fdst (x, y) ,
2
y
y
(22)
www.ietdl.org
The application of Brents optimisation method also requires
that the minimum and maximum boundaries where the
solution is to be found be known. In our case, it means that a
minimum and maximum proximity values must be available
for each pixel of the left and right images. These minimum
and maximum proximity maps are calculated based on the
3 error interval of the initial value of the proximity maps
i
i
i
dmin
= dinitial
3s dinitial
i
i
i
dmax
= dinitial
+ 3s dinitial
(23)
l
l
and dinitial
are calculated according to (22).
where dinitial
For the right proximity map, a set of similar expressions
can be found, starting from the AL
r k
k
k k r r k 2
L1 i, j = E r i, j + lrrc i, j urrc i, j +
urc i, j
2
k k r r k 2
+ llrl i, j urrl i, j +
url i, j
2
3
(24)
2
r k
c k
b r k
= d i, j fI d , i + f d i, j , j
2
k
l k
k 2
k
frl i, j = d r i, j fI d l , i + fb d r i, j , j
two functions
k that are optimised at the same time: onel
using Ll1 i, j which
k optimises the left proximity map d
and one using Lr1 i, j which optimises the right proximity
map d r. In the proposed algorithm, these functions are
optimised alternatively, hereby always using the latest result
for both proximity maps.
An aspect which is not depicted by Algorithm 1 is the
choice of the optimal framerate. The underlying
structure-from-motion algorithm uses the geometric robust
information criterion scoring scheme introduced by Torr in
[35] not to assess the optimal framerate. This will have as
an effect that if the camera does not move (no translation
and no rotation) between two consecutive time instants, no
reconstruction will be performed.
(25)
(26)
103
www.ietdl.org
Fig. 4
www.ietdl.org
Fig. 5 Proximity maps for different frames of the desk sequence using the global optimisation algorithm
a Frame 1, left proximity d110
b Frame 1, right proximity dr10
c Frame 10, left proximity d110
d Frame 10, right proximity dr10
Fig. 6 Proximity maps for different frames of the desk sequence using the AL algorithm
a Frame 1, left proximity d1l
b Frame 1, right proximity dr1
c Frame 10, left proximity d110
d Frame 10, right proximity dr10
IET Comput. Vis., 2014, Vol. 8, Iss. 2, pp. 98109
doi: 10.1049/iet-cvi.2013.0017
105
www.ietdl.org
www.ietdl.org
Fig. 8 Quantitative analysis: input images and ground truth depth maps
Top row: left input image and bottom row: ground truth left depth image. Left column: Cones sequence and right column: Teddy sequence
a Cones sequence, left image at t0
b Teddy sequence, left image at t0
c Cones sequence, ground truth depth image at t0
d Teddy sequence, ground truth depth image at t0
Fig. 9 Comparison of the reconstruction result using the traditional variational scene-ow method [19] and the proposed method
Top row: reconstructed left depth image using [19] and bottom row: reconstructed left depth image using proposed method. Left column: Cones sequence and right
column: Teddy sequence
a Cones sqeuqnce, depth image at t0 using [19]
b Teddy sequence, depth image at t0 using [19]
c Cones sequence, depth image at t0 using proposed
d Teddy sequence, depth image at t0 using proposed method
IET Comput. Vis., 2014, Vol. 8, Iss. 2, pp. 98109
doi: 10.1049/iet-cvi.2013.0017
107
www.ietdl.org
Scharstein and Szeliski [36, 37], shown on the top row of
Fig. 8.
As a baseline algorithm, the variational scene-ow
reconstruction approach presented by Huguet and Devernay
[19] was chosen, as the authors provided the algorithm
online, which makes it possible to perform comparison
tests. To be able to supply a correct comparison of the
stereomotion reconstruction
capabilities
of both
algorithms, the same base stereo algorithm [38] was used to
initialise both methods.
The results of any reconstruction algorithm depend largely
on the correct initialisation of the algorithm and the selection
of the parameters. In the initialisation phase of the proposed
method, the estimation of the motion vectors tl, l and tr,
r via sparse structure from motion plays an important role.
To assess the validity of the motion vector estimation
results, it is possible to compare the measured motion with
the perceived motion between the subsequent images. For
example, for the Cones sequence, the main motion is a
horizontal movement, which is correctly expressed by the
estimated translation vectors: t l = [0.0800, 0.3151, 0.0988]
and tr = [0.1101, 0.3131, 0.1255]. Ideally, both vectors
should be identical (as both cameras follow an identical
motion pattern), which gives an idea of the errors on the
motion estimation process.
Parameter-tuning is a process which affects many modern
reconstruction algorithms, as the parameter selection makes
comparison and application of the algorithms in real
situations difcult. For the proposed approach, one
parameter is of major importance: the parameter deciding
on the balance between the data and the regularisation term.
In our experiments, a value of = 0.5 was chosen, based on
previous [25] analysis. A remaining parameter of lesser
importance is the threshold on for stopping the iterative
solver. This parameter is somewhat sequence-dependent,
with typical values somewhere between 10 and 20. With
regard to the benchmark algorithm by Huguet and
Devernay, all parameters were chosen as provided by the
authors in their original implementation.
Fig. 9 shows a qualitative comparison of the reconstruction
results of both methods. The quantitative evaluation is done
by computing the root-mean-square (RMS) error on the
depth map measured in pixels, as presented by Table 1. The
proposed AL-based binocular structure-from-motion
(ALBDSFM) approach has the convenient property of
decreasing the residual on the objective function
dramatically in the rst iteration, whereas convergence
slows down in subsequent iterations. For this reason, we
also included the results after one iteration in the tables.
As can be noted from Fig. 9 and Table 1, the proposed
ALBDSFM algorithm performs better on both the Cones
and Teddy sequences. On the Cones sequence, the
ALBDSFM approach is better capable of representing the
structure of the lattice in the back, whereas this structure is
completely smoothed by the SceneFlow algorithm. On the
Teddy sequence, both reconstruction results are visually
quite similar. It is clear that both reconstruction techniques
suffer from over-segmentation. This is a typical problem of
the NagelEnkelmann regularisation problem we used and
can partly be remedied by ne-tuning the regularisation
parameters; however, for keeping the comparison honest, we
did not perform such sequence-specic parameter-tuning.
The quantitative analysis on the Teddy sequence in Table 1
shows an advantage for the ALBDSFM approach.
Table 2 gives an overview of the total processing times
required for both algorithms. It must be noted that all
108
& The Institution of Engineering and Technology 2014
Cones
Teddy
ALBDSFM (1 iteration)
ALBDSFM (convergence)
variational scene flow [19]
2.411
2.381
9.636
5.002
4.961
8.650
Table 2
Cones
Teddy
24
74
257
24
205
243
Conclusions
www.ietdl.org
reconstruction methodology presents an excellent
reconstruction tool allowing high-quality 3D recosntruction
from binocular video.
Acknowledgment
References
1 Strecha, C., Van Gool, L.J.: Motion stereo integration for depth
estimation. ECCV, 2002, no. 2, pp. 170185
2 Worby, J.A.: Multi-resolution graph cuts for stereo-motion estimation.
Masters thesis, University of Toronto, 2007
3 Richards, W.: Structure from stereo and motion, J. Opt. Soc. Am.,
1985, 2, pp. 343349
4 Waxman, A., Duncan, J.: Binocular image ows: steps towards
stereo-motion fusion, IEEE Trans. Pattern Anal. Mach. Intell., 1986,
8, (6), pp. 715729
5 Li, L., Duncan, J.: 3-d translational motion and structure from binocular
image ows, IEEE Trans. Pattern Anal. Mach. Intell., 1993, 15, (7),
pp. 657667
6 Zhang, Z., Faugeras, O.D.: Three-dimensional motion computation and
object segmentation in a long sequence of stereo frames, Int. J. Comput.
Vis., 1992, 7, (3), pp. 211241
7 Hanna, K.J., Okamoto, N.E.: Combining stereo and motion analysis for
direct estimation of scene structure. ICCV, 1993, pp. 357365
8 Malassiotis, S., Strintzis, M.G.: Model-based joint motion and structure
estimation from stereo images, Comput. Vis. Image Underst., 1997, 65,
(1), pp. 7994
9 Kutulakos, K.N., Seitz, S.M.: A theory of shape by space carving,
Int. J. Comput. Vis., 2000, 38, (3), pp. 199218
10 Neumann, J., Aloimonos, Y.: Spatio-temporal stereo using multiresolution
subdivision surfaces, Int. J. Comput. Vis., 2002, 47, (13), pp. 181193
11 Gong, M.: Enforcing temporal consistency in real-time stereo
estimation. ECCV, 2006, pp. 564577
12 Isard, M., MacCormick, J.: Dense motion and disparity estimation via
loopy belief propagation. ACCV, 2006, pp. 3241
13 Sudhir, G., Banerjee, S., Biswas, K.K., Bahl, R.: Cooperative
integration of stereopsis and optic ow computation, J. Opt. Soc. Am.
A, 1995, 12, (12), pp. 25642572
14 Larsen, E.S., Mordohai, P., Pollefeys, M., Fuchs, H.: Temporally consistent
reconstruction from multiple video streams. ICCV, 2007, pp. 18
15 Strecha, C., Van Gool, L.: Pde-based multi-view depth estimation.
First Int. Symp. 3D Data Processing Visualization and Transmission
(3DPVT02), 2002, vol. 416
16 Proesmans, M., van Gool, L., Pauwels, E., Oosterlinck, A.:
Determination of optical ow and its discontinuities using non-linear
diffusion. ECCV, 1994, pp. 295304
17 Zhang, Y., Kambhamettu, C.: On 3-d scene ow and structure recovery
from multiview image sequences, Syst. Man Cybern. B, 2003, 33, (4),
pp. 592606
109