Вы находитесь на странице: 1из 6

Building virtual worlds by 3D object mapping

Alessandro Moro Enzo Mumolo Massimiliano Nolich


DEEI, University of Trieste, DEEI, University of Trieste, DEEI, University of Trieste,
Italy Italy Italy
alessandro.moro@phd.units.it mumolo@units.it mnolich@units.it

ABSTRACT tool for representing real environments by means of corre-


In this paper, a method for automatically building 3D vir- sponding virtual worlds. To this end, we borrowed from the
tual worlds which correspond to the the objects detected robotic science the concept of 3D object mapping, that is
in a real environment is presented. The proposed method the way to build a map of the environment which describes
can be used in many applications such as for example Vir- the shape and pose of the objects located in the environment
tual Reality, Augmented Reality, remote inspection and Vir- for robotic handling or avoiding [22]. Building 3D maps is
tual Worlds generation. Our method requires an operator a challenge in mobile robotics and a number of approaches
equipped with a stereo camera and moving in an office en- have been proposed so far [12]; most of the 3D maps devel-
vironment. The operator takes a picture of the environ- oped are composed of grid cells or geometric elements such
ment and, with the proposed method, the Regions of Inter- as polygons.
est (ROI) are extracted from each picture, their content is We present in this paper a novel technique of object map-
classified and 3D virtual scenarios are reconstructed using ping using a stereo camera, whose objective is not robotic
icons which resemble the classified object categories. ROI mapping but 3D virtual worlds building. Key points of our
exctraction, pose and height estimation of the classified ob- technique are automatic detection of the Regions of Inter-
jects are performed using stereo vision. The ROIs are ob- est (ROIs) and objects classification. Generally speaking,
tained using a Dempster-Shafer technique for fusing different we have adopted the following approach: starting from the
information detected from the image such as the Speeded Up features obtained from a picture, we extract the objects in
Robust Features (SURF) and depth data obtained with the it using a novel Regions Of Interest (ROIs) detection algo-
stereo camera. Experimental results are presented in office rithm, which can be considered one of the contributions of
environments. this paper, and then we train statistical models of the ob-
jects using a simplified 2D-HMM for classification.
We use edges for object characterization because they are
Keywords more robust than other features with respect to changes in
Virtual Reality, Stereo Vision, Dempster-Shafer the light conditions and because different objects belonging
to the same category from the same point of view, once
1. INTRODUCTION represented with edges, have a similar shape.
We consider the following scenario. In a real world, for
Virtual worlds are 3D environments built with digital tech-
example in an office environment, there are items such as
niques where users can interact with each other over the In-
tables, chairs or persons. The image of the environment is
ternet using virtual users and virtual objects. Since their in-
acquired with a stereo camera from a given point of view and
troduction, an increasing number of applications have been
the algorithm described in this paper is used for building a
developed with them, including virtual computer games [24],
3D object map which is given as input to a 3D graphic pack-
social networking [17], augmented reality based systems [21],
age. In this way, the 3D objects in the map are represented
e-commerce [7] and so forth. The virtual environments and
with corresponding graphical icons. As the pose and height
objects in the virtual world are typically created using suit-
of the objects are estimated from the images taken in the
able software tools and are generally not related to real
real world, the icons are put in the virtual world at the same
scenes. However, building a virtual environment which rep-
pose and using the same height as that estimated, so that
resents a real environment can have important applications,
the virtual world corresponds to the real one. We assume
for example in inspection or security frameworks. This pa-
to know the a-priori map of the environment, which is fixed,
per is directed towards this direction, namely to build a
while the objects may vary. This procedure can be repeated
for pictures taken from different points of view. However,
in this work we do not consider the linkage between images
taken in subsequent points of view - which could be used to
Permission to make digital or hard copies of all or part of this work for make virtual video - leaving that to further developments.
personal or classroom use is granted without fee provided that copies are This paper is organized as follows. In Section 2 some pre-
not made or distributed for profit or commercial advantage and that copies vious work dealing with object maps is described. In Sec-
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
tion 3 we describe the developed approach and in Section 4
permission and/or a fee. some experimental results are reported. Final remarks con-
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.
clude the paper.

2. RELATED WORK
Object maps are mostly oriented towards mobile robot
navigation. Vasudevan et al. [23] use SIFT as a recognition
tool and develop a hierarchical probabilistic representation
of space that is based on objects. A global topological repre-
sentation of places with object graphs serving as local maps
is suggested.
Anguelov et al. describe a probabilistic approach for de-
tecting and modeling doors in a corridor environment [1].
They use features based on shape, color, and motion proper-
ties of door and wall objects. Modayil and Kuipers describe
in [11] an approach for object localization and recognition
based on object shape models.
Brezetz et al [3] use range scans to segment objects and
represent them in a topological framework. Limeketkai et
al [8] describe Relational Object maps of walls and doors.
Mozos et al [14] used the object map for interpretation of
the environment and Ranganathan et al [18] develop object
map for giving semantic interpretation of places.
Tomono describe in [22] 3D map building algorithms using
vision data. Moro et al. in [13] describe an approach for
object classification based on edge features.

3. THE PROPOSED 3D OBJECTS MAP Figure 1: Block diagram of the proposed algorithm.
BUILDING APPROACH
The stereo camera used in this work is a Bumblebee [6] sis concern the possibility that the pixel (i, j) corresponds to
stereo camera, which is a Firewire CCD camera with a def- an object or not. In other words, we have eight hundred hy-
inition of 640x480 pixels at 48 fps. We consider images ac- pothesis for each texel, namely θ = {θ1 (0, 0), . . . , θ1 (19, 19),
quired by the stereo camera: they are combined and recti- θ2 (0, 0), . . . , θ2 (19, 19)}, where θ1 (i, j) is the belief that the
fied, and for each pixel its depth is computed. All of these pixel (i, j) of that texel belongs to an object in the environ-
low-level processing are performed by the camera internal ment and θ2 (i, j) is the belief that the pixel (i, j) does not
firmware. It is important to note that we divide the image belong to an object.
in texel of 20 × 20 pixels, as all the subsequent processing is
texel based. 3.1.1 Basic Belief Assignment
The algorithm described in this paper is summarized in The Basic Belief Assignment can be viewed as a gener-
the block diagram reported in Fig. 1. alization of a probability density function. More precisely,
a Basic Belief Assignment m() is a function that assigns a
value in [0, 1] to every subset A of θ that satisfies the fol-
As described in Fig. 1, our 3D objects map approach is per- lowing:
formed using the following steps: basic belief assignment by X
processing the rectified and depth images, data fusion using m(A) = 1, m(∅) = 0
the Dempster-Shafer algorithm, ROI estimation, feature ex- A⊆Θ

traction from the rectified image and object classification. It is worth noting that m(A) is the belief that supports the
Using the object labels and the depth map, which is the dis- subset A of θ, not the elements of A. This reflects some
tance from the camera to all the pixels of the image, a 3D ignorance because this means that we can assign belief only
virtual world is finally built using a 3D vectorial graphical to subsets of θ, not to the individual hypothesis as in classical
application. probability theory.
In the following we summarize some results, limited to
those used in this work, from the Dempster-Shafer theory 3.1.2 Belief function
of evidence. Many good tutorials are available, such as for
The belief function, bel(.), associated with the Basic Belief
example [5] and [19].
Assignment m(.), assigns a value in [0, 1] to every nonempty
3.1 The Dempster-Shafer Fusion subset B of θ. It is defined by
The goal of the Dempster-Shafer theory of evidence [20], is
X
bel(B) = m(A)
to represent uncertainty and lack of knowledge. The theory A⊆B
can combine different measures of evidence. At the base of
the theory is a finite set of possible hypotheses, say θ = The belief function can be viewed as a generalization of a
{θ1 , . . . , θK }. probability function.
In our case, a hypothesis set is defined for each texel in
which is divided the image. Within each texel, the hypothe- 3.1.3 Combination of evidence
Consider two Basic Belief Assignments m1 (.) and m2 (.) the evidence m2 (O) is basically set to the number of pixels
and the corresponding belief functions bel1 (.) and bel2 (.). with the same value in the texel. More precisely, setting the
Let Aj and Bk be subsets of θ. Then m1 (.) and m2 (.) can maximum distance sensed by the camera, which can corre-
be combined to obtain the belief mass assigned to C ⊂ θ spond to a wall or, more generally, to an absence of objects,
according to the following formula [20], we set m2 (O) to the number of pixels in the texel that have
P a distance value different by the maximum distance value.
M j,k,A ∩B =C m1 (Aj )m2 (Bk ) If all the pixels in the texel have a maximum distance value
m(C) = m1 m2 = P j k
1 − j,k,Aj ∩Bk =0 m1 (Aj )m2 (Bk ) the value of m2 (O) is set to zero.
(1)
The denominator is a normalizing factor, which measures 3.3 ROI estimation, object pose and height es-
how much m1 (.) and m2 (.) are conflicting. timation
The final evidence of the presence of an object in a texel
3.1.4 Belief functions combination is computed using (1). In this way, each pixel is assigned a
The combination rule can be easily extended to several value that states the belief that there is an object in it.
belief functions by repeating the rule for new belief func- A 3D view of the final belief image is reported in Fig. 2.
tions. Thus the sum of n belief functions bel1 , bel2 , . . . , beln ,
can be formed as
M M n
M
((bel1 bel2 ) bel3 ) . . . beln = beli
i=1

It is important to note that the basic beliefs combination


formula given above assumes that the belief functions to be
combined are independent.

3.2 Basic Belief Assignment for ROI estima-


tion
As stated above, the Basic Belief Assignment is related
to the definition of the relevant and available evidence that
support a claim for each of the considered hypothesis. To
this end, we considered two independent experts, namely Figure 2: 3D view of the distance image
the SURF expert and the Depth expert, that support the
claim for the hypothesis that each pixel of the image be-
longs or not to an object in the real environment. In order
The detection of the Regions of interest is made by com-
to simplify the computation, we make the following approx-
puting the contours of this three-dimensional image and as-
imation: we assume that the texels are so small that all the
signing a ROI to them. This is equivalent to extract the
pixels in them share the same hypothesis. This means that
regions with the highest value of evidence that an object
we have to define the following evidences: {m1 (O), m1 (O)}
does exist in the environment, according to the opinion of
for the SURF expert, and {m2 (O), m2 (O)} for the Depth
the SURF and Depth experts.
expert, where O and O means that there is an object or not,
Regarding the pose and height estimation, since the dis-
respectively.
tance is computed by the camera, its pose is computed by
Let us consider the SURF expert; it is worth recalling
evaluating the orientation of the object itself with respect to
that it has been developed, and widely used in computer
the camera. With distance and orientation the position of
vision, the Scale-Invariant Feature Transform (SIFT) algo-
the object in the real world is easily computed. The height
rithm. SIFT ([9][10]) is an algorithm which detects and
of the objects deserve some further observations. If the ob-
describes local features in images and can be used to esti-
jects were entirely extracted, their height could be computed
mate points of interest. In [2] it was presented a variant of
for example by pixel counting. However, not all the object
SIFT, called SURF, which requires much less computations
may by entirely contained in the corresponding ROI (for in-
than SIFT. SURF is a performant scale and rotation invari-
stance the chair in Fig. 4), and for those objects only the
ant interest point detector and descriptor algorithm. We
top of the objects is contained in the extracted ROI. For this
extracted points of interest in the rectified image according
reason, we perform the computation of the object height as
to SURF, and defined m1 (O) for a texel as the proportion
described in Fig. 3. More precisely, it is sufficient that we
of SURF points which appear in that texel with respect to
obtain, from the camera, the distance to the top of the ob-
the total number of SURF points. The evidence m1 (O) was
jects and the angle from the x-axis of the camera plane to
set in a first attempt to the complementary of m1 (O), even
the top of the objects, which can be easily estimated from
if it is worth noting that a more precise assignment would
the corresponding number of pixels in the image plane. The
use a different setting. This issue will be explored in further
value H in Fig. 4 is the height of the camera to the floor
works; however this simple setting gives satisfactory results,
of the environment, and the values HO1 and HO2 are the
as reported in the experimental section.
heights of the two objects. The values D1 and D2 are the
Let us know consider the Depth expert. This expert uses
distances from the camera to the top pixels of the objects.
the depth image, where the value of each pixel is the dis-
Then, assuming that all the objects touch the floor, their
tance of the point in the real image that correspond to that
height can be estimated as reported in Fig. 4.
pixel, to the camera. Since the presence of an object leads
to a number of pixels in the image sharing the same value,
Person Chair Door Table
Person 70% 10% 20 % 0%
Chair 0% 90.3% 6.5% 3.2%
Door 0% 2.7% 93.6% 3.7%
Table 0% 19.6% 16.9% 63.8%

Table 1: Confusion matrix of the object classifica-


tion.

in a particular point of the environment. For testing the


algorithm’s performances, hundreds of images were taken
in our environments and, from these pictures, one hundred
of images, both rectified and depth, have been randomly
Figure 3: Objects height estimation method
extracted and analyzed with the described algorithm. This
Section deals with experimental results we obtained with
3.4 Feature extraction, HMM training and clas- such data.
sification 4.1 ROI estimation
Each ROI we estimate as described in Section 3.3 is then
After automatic ROI extraction, results were manually
processed as a new single image. The image corresponding
analyzed by inspecting the complex images. Analyzing 100
to the ROI is processed with a classical Canny algorithm [4],
complex images extracted randomly from the videos, the
which computes the edges of the image. We use edges for
ROIs extraction accuracy is about 91.3%.
object characterization because they are more robust than
other features with respect to changes in the light condi- 4.2 Objects classification
tions and because different objects belonging to the same
The object classification is performed with a pseudo2D-
category from the same point of view have similar shape.
HMM algorithm used to compare the ROI images with pre-
The resulting image is then spectrally processed with a DCT
viously obtained models of the considered objects categories.
transform.
Preliminarily, we perform some test for stating the supe-
A large number of images of objects belonging to the same
riority of edge representation over grayscale. To this goal,
class is used, after block division and feature representation,
we obtained HMM models with incremental training with
to train a HMM model of that class. We experimentally
600 images of each class of interest. The images were ac-
show that a classification based on HMM models trained
quired in the following way: ten instances of each object
with a large number of objects show a certain degree of ab-
have been obtained by rotating the object itself by 36 de-
straction of the class of an object independent from the point
grees. Furthermore, another data set of 600 images has been
of view and partial occlusions. To describe objects in the vi-
acquired for testing. The 600 images acquired for testing
sual scene an HMM should be structured as two-dimensional
were manually extracted and every object is perfectly cen-
HMM, where a matrix of states are linked by transitions with
tered in its ROI image. The superiority of edge represen-
some probabilities. Since the states represent image blocks,
tation over grayscale representation is demonstrated by the
such model can capture the statistical relation among the
following results. We considered two types of image repre-
image blocks. Two-dimensional HMM has been described
sentations as input of the pseudo2D-HMM classifier, namely
for example in [25], but they are computationally too com-
grayscale representation and edge representation with the
plex. For this reason we used pseudo2D-HMM [16]. The
classical Canny detector. For grayscale images the average
pseudo2D-HMM consists of a rectangular array of states or-
classification is 73% obtained with 7 superstates and 7 states
ganized as superstates each of them consisting of states. The
per superstate. Using the Canny edge detection the average
image is divided into blocks from which features are com-
classification rate is 92% for 8 states per superstate.
puted; the array of blocks is scanned from top to bottom
Turning to the ROIs automatically extracted using the
and from left to right. The pseudo2D-HMM parameters are
described algorithm, we then analyzed the 100 complex im-
estimated with standard Baum Welch re-estimation formu-
ages selected randomly from the initial dataset. In this case
las. The goal of the parameter estimation is to estimate the
many objects are not well centered as in the ideal case, and
parameters of the pseudo2D-HMM λ that maximize P (O|λ).
parts of other objects appear in the background. In this case
Pseudo2D-HMMs are used to obtain a model of an object
the classification accuracy was about 84%.
from its edge image. More precisely, the image computed
In Tab. 1 we report the confusion matrix for the different
with the proposed feature extraction algorithm is divided in
categories considered in this work.
subblocks on which a DCT is applied. The greatest DCT
coefficients are given as input to the pseudo2D-HMM. Due to 4.3 3D map building: case study
the object’s complexity, a uniform subdivision of the image
In this section we report a case study describing how the
has been used, differently to the structure proposed in [15].
proposed algorithm works for a particular picture taken in
The super-state structure follows a vertical line. The im-
the considered environment. Fig. 4 shows a picture of an
ages used for the training are centered on the objects of
office environment.
interest and does not have textures on the background.
Fig. 5 shows the depth map of the environment shown in
Fig. 4. This data, which represent the distance from each
4. EXPERIMENTAL RESULTS pixel of the image of the environment to the camera, is used
The scene is acquired by the Bumblebee stereo camera put by the Depth expert to give its evidence of the presence of
Figure 4: An office environment. Figure 6: SURF feature map of Fig. 4.

objects.

Figure 7: Result from Dempster-Shafer fusion tech-


nique.

Figure 5: Depth map of Fig. 4.


depth data algorithm. Objects are modeled using edges and
In Fig. 6 the points of interest as obtained by SURF are classified using a pseudo2D-HMM.
shown, and in Fig. 7 the final evidence as resulting from The algorithm works sufficiently well in the simple envi-
the combination rule is shown. Each texel of the image is ronment of Fig. 4. Many improvements are currently being
labeled according to the resulting evidence. studied, in particular concerning the Basic Belief Assign-
In Fig. 8 the ROIs extracted from the final belief image ment with depth and SURF data. Also, the algorithm is
are shown. The ROIs are passed to the EHMM classifier and currently being tested in other environments and used in
the result of the classification is written inside the ROIs. It remote inspection applications based on virtual worlds tech-
is shown that the table nearest to the camera is not detected nology.
with sufficient detail, and the EHMM doesn’t classify it.
Finally, Fig. 9 shows the result from a vectorial 3D graph- 6. REFERENCES
ical package. The vectorized image can be easily enclosed
[1] D. Anguelov, D. Koller, E. Parker, and S. Thrun.
and modified in other frameworks. The input to the pack-
Detecting and modeling doors with mobile robots. In
age are the classified objects and their pose and height in
International Conference on Robotics and Automation
the environment.
(ICRA), pages 3777–3784, 2004.
[2] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool.
5. FINAL REMARKS AND CONCLUSIONS Speeded-up robust features (surf). Comput. Vis.
In this paper an approach for building 3D virtual envi- Image Underst., 110(3):346–359, 2008.
ronment which is corresponding to real scenario is described. [3] S. Brezetz, R. Chatilaand, and M. Devy. Natural scene
The problem is faced through 3D object maps. The ROIs are understanding for mobile robot navigation. In IEEE
estimated using a novel approach which uses Depth data and International Conference on Robotics and Automation
SURF features. Object pose and height are estimated using - ICRA, 1994.
learning for object localization and recognition. In
International Conference on Robotics and Automation
(ICRA), pages 2991–2996, 2006.
[12] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit.
Fastslam: A factored solution to the simultaneous
localization and mapping problem. In Proceedings of
the 18th National Conference on Artificial Intelligence
(AAAI), pages 593–598, July 2002.
[13] A. Moro, E. Mumolo, and M. Nolich. Visual scene
analysis using relaxation labeling and embedded
hidden markov models for map-based robot
navigation. In International Conference on
Information Technology Interfaces ITI, 2008.
[14] O. Mozos, R. Triebel, P. Jensfelt, A. Rottmann, and
W. Burgard. Supervised semantic labeling of places
using information extracted from sensor data. Robotics
and Autonomous Systems, 55(5):391–402, 2007.
Figure 8: ROIs estimation and object classification [15] A. Nefian. A Hidden Markov Model-Based Approach
related to the environment depicted in Fig. 4. for Face Detection and Recognition. PhD thesis,
Georgia Institute of Technology, 1999.
[16] A. V. Nefian and M. H. H. III. An embedded hmm
based approach for face detection and recognition. In
IEEE International Conference on Acoustic Speech
and Signal Processing, 1999.
[17] L. Overbey, G. McKoy, J. Gordon, and S. McKitrick.
Automated sensing and social network analysis in
virtual worlds. In International Conference on
Intelligence and Security Informatics, pages 179–184,
2008.
[18] A. Ranganathanand and F. Dellaert. Semantic
modeling of places using objects. In Robotics Science
and Systems (RSS), 2007.
[19] K. Sentz and S. Ferson. Combination of evidence in
dempster-shafer theory. Technical report sandia report
sand2002-0835, SANDIA, 2002.
[20] G. Shafer. A Mathematical Theory of Evidence.
Princeton University Press (1976), 1976.
Figure 9: Virtual world related to real world de- [21] C. Sun, Z. Pan, and Y. Li. Srp based natural
picted in Fig. 4. interaction between real and virtual worlds in
augmented reality. In International Conference on
Cyberworlds, pages 117–124, 2008.
[4] J. Canny. A computational approach to edge [22] M. Tomono. 3-d object map building using dense
detection. Pattern Analysis and Machine Intelligence, object models with sift-based recognition features. In
8(6):679–698, 1986. Proc. of IEEE Int. Conf. of Intelligent Robots and
[5] H. Choi, A. Katake, S. Choi, Y. Kang, and Y. Choe. Systems - IROS, 2006.
Probabilistic Combination of Multiple Evidence. [23] S. Vasudevan, S. Gachter, M. Berger, and R. Siegwart.
Lecture Notes in Computer Science. Springer, 2009. Cognitive maps for mobile robots. an object based
[6] P. grey. Bumbleebee. http://www.ptgrey.com. approach. Robotics and Autonomous Systems, 55(5),
[7] M. Khoury, X. Shen, and S. Shirmohammadi. A 2007.
peer-to-peer collaborative virtual environment for [24] S. Vosinakis and T. Panayiotopoulos. A tool for
e-commerce. In Canadian Conference on Electrical constructing 3d environments with virtual agents.
and Computer Engineering, pages 828–831, 2007. Multimedia Tools and Applications archive,
[8] B. Limketkai, L. Liao, and D. Fox. Relational object 25(2):253–279, 2005.
maps for mobile robots. In International Joint [25] M. Xiang, D. Schonfeld, and A. Khokhar. A general
Conference on Artificial Intelligence (IJCAI), 2005. two-dimensional hidden markov model and its
[9] D. G. Lowe. Object recognition from local application in image classification. In IEEE
scale-invariant features. In International Conference International Conference on Image Processing ICIP,
on Computer Vision, pages 1150–1157, 1999. pages VI41–VI44, 2007.
[10] D. G. Lowe. Distinctive image features from
scale-invariant keypoints. International Journal of
Computer Vision, 60(2):91–1104, 2004.
[11] J. Modayil and B. Kuipers. Autonomous shape model

Вам также может понравиться