Proceedings of the 5 th European Conference on Mobile Robots
ECMR’2011
September 7-9, 2011 Örebro, Sweden
Editors:
Achim J. Lilienthal Tom Duckett
September 5, 2011
Achim J. Lilienthal AASS Research Centre School of Science and Technology Örebro University SE-70182 Örebro Sweden achim.lilienthal@oru.se
Tom Duckett Lincoln School of Computer Science University of Lincoln LN6 7TS Lincoln United Kingdom tduckett@lincoln.ac.uk
ECMR’11
Proceedings of the 5 th European Conference on Mobile Robots September 7-9, 2011 Örebro, Sweden
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
ii
General Chair:
Achim J. Lilienthal
Program Chair:
Tom Duckett
Conference Board:
Adam Borkowski Wolfram Burgard Primo Zingaretti
Committee
Local Organization:
Monica Wettler (Conference Coordinator) Barbro Alvin (Local Organization) Per Sporrong (Local Organization) Marcello Cirillo (Local Organization) Branko Arsov (Finance Organization)
Program Committee:
Kai Arras Alexander Bahr Juan Andrade Cetto Antonio Chella Marcello Cirillo Mark Cummins David Filliat Udo Frese Javier Gonzalez Horst-Michael Gross Dirk Holz Patric Jensfelt Maarja Kruusmaa Lino Marques Stefan May Ivan Petrovic´ Libor Preucil Alessandro Saffiotti Antonio Sgorbissa Cyrill Stachniss Rudolph Triebel Jasmin Velagic
Tamim Asfour Nicola Bellotto Raja Chatila Grzegorz Cielniak Andreu Corominas Murtra Robert Cupec Thierry Fraichard Emanuele Frontoni Roderich Gross Dongbing Gu Luca Iocchi Dermot Kerr Martin Magnusson Tomás Martínez-Marín Emanuele Menegatti Cedric Pradalier Dario Lodi Rizzini Thomas Schön Piotr Skrzypczynski´ Adriana Tapus Andrzej Typiak Markus Vincze
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
iii
Sponsors
ABB Volvo Construction Equipment Robot Dalen Atlas Copco
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
iv
Welcome Message
We are honoured to welcome you to the 5th European Conference on Mobile Robots – ECMR 2011, held in the city of Örebro, Sweden on September 7-9, 2011. ECMR is a biannual European forum, interna- tionally open, allowing researchers throughout Europe to become acquainted with the latest accomplish- ments and innovations in advanced mobile robotics and mobile human-robot systems. The first ECMR meeting was held in September 2003 in Radziejowice, Poland, followed by ECMRs in September 2005 in Ancona, Italy; in September 2007 in Freiburg, Germany; and in September 2009 in Mlini/Dubrovnik, Croatia. A priority of ECMR is to attract young researchers to present their work to an international audience. ECMR is organized in single track mode to favour discussions. ECMR 2011 will continue this policy and will provide panel sessions and original presentations about research in progress. ECMR 2011 has continued to build on the success of the previous conferences, reaching a level of maturity that is reflected in the quality of the technical program. Each of the 71 papers submitted was evaluated by three reviewers and 51 of them (from 19 different countries and 170 authors) have been accepted by the Program Committee. These papers are included in these proceedings and will be presented at the conference. They cover a wide spectrum of research topics in mobile robotics:
3D perception, navigation, path planning and tracking, SLAM, mapping and exploration, human-robot cooperation, various service applications, etc. We are especially proud to welcome our distinguished keynote speakers Professor Per-Erik Forssén from Linköping University, Sweden, who will give the talk “Dynamic and Situated Visual Perception”, and Professor Markus Vincze from Vienna University of Technology, Austria, who will give the talk “Robot Object Classification”. The Technical Program also includes a special invited talk by Professor Wolfram Burgard of Freiburg University, Germany, on “Robot Control Based on System Identification”, a presentation in memoriam of Professor Ulrich Nehmzow. Ulrich was a keen supporter of ECMR (as well as EUROBOT, one of the predecessor conferences to ECMR) and will be deeply missed by many colleagues in the European mobile robotics community. A further invited talk will be given by Torbjörn Martinsson of Volvo CE on “Construction Equipment is a Mobile Robotic Market”, reflecting the strong connections between the academic community and its industrial partners at ECMR. Our sincere thanks are due to all people whose cooperation and hard work made this conference pos- sible. First and foremost, we would like to thank the members of the Organizing Committee and Program Committee for their outstanding work and our sponsors whose support is particularly appreciated. Our special thanks go to the authors for submitting their work to ECMR 2011 and to the reviewers for their time and effort in evaluating the submissions. The results of their joint work are visible in the Program and Proceedings of ECMR 2011. It is now up to all of us to make ECMR 2011 a great success and a memorable event by participating in the technical program, by supporting our younger colleagues, especially students, as they will shape the future of mobile robotics research.
Achim J. Lilienthal Tom Duckett
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
v
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
vi
CONFERENCE PROGRAM
Day 1
Session 1 – Shared Environments
|
1 |
Cipriano Galindo, Javier González, Juan-Antonio Fernández-Madrigal, Alessandro Saffiotti Robots that Change Their World: Inferring Goals from Semantic Knowledge |
|
7 |
Arne Kreutzmann, Immo Colonius, Lutz Frommberger, Frank Dylla, Christian Freksa, Diedrich Wolter On Process Recognition by Logical Inference |
|
13 |
Alper Aydemir, Moritz Göbelbecker, Andrzej Pronobis, Kristoffer Sjöö, Patric Jensfelt Plan-based Object Search and Exploration using Semantic Spatial Knowledge in the Real World |
Session 2 – Comparative Evaluation
|
19 |
Todor Stoyanov, Athanasia Louloudi, Henrik Andreasson, Achim J. Lilienthal Comparative Evaluation of Range Sensor Accuracy in Indoor Environments |
|
25 |
Dirk Holz, Nicola Basilico, Francesco Amigoni, Sven Behnke A Comparative Evaluation of Exploration Strategies and Heuristics to Improve Them |
Session 3 – Tracking
|
31 |
Sre´cko Juri´c-Kavelj, Ivan Markovi´c, Ivan Petrovi´c People Tracking with Heterogeneous Sensors using JPDAF with Entropy Based Track Management |
|
37 |
Aamir Ahmad, Pedro Lima Multi-Robot Cooperative Object Tracking Based on Particle Filters |
Session 4 – Navigation
|
43 |
Bernd Kitt, Jörn Rehder, Andrew Chambers, Miriam Schönbein, Henning Lategahn, Sanjiv Singh Monocular Visual Odometry using a Planar Road Model to Solve Scale Ambiguity |
|
49 |
Boris Lau, Christoph Sprunk, Wolfram Burgard Incremental Updates of Configuration Space Representations for Non-Circular Mobile Robots with 2D 2.5D or 3D Obstacle Models |
|
55 |
Maximilian Beinhofer, Jörg Müller, Wolfram Burgard Landmark Placement for Accurate Mobile Robot Navigation |
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
vii
61
Francesco Capezio, Fulvio Mastrogiovanni, Antonello Scalmato, Antonio Sgorbissa, Paolo Vernazza, Tullio Vernazza, Renato Zaccaria Mobile Robots in Hospital Environments: an Installation Case Study
Day 2
Session 5 – Visual SLAM
|
69 |
John McDonald, Michael Kaess, Cesar Cadena, José Neira, John J. Leonard 6-DOF Multi-session Visual SLAM using Anchor Nodes |
|
77 |
Gerardo Carrera, Adrien Angeli, Andrew J. Davison Lightweight SLAM and Navigation with a Multi-Camera Rig |
Poster Spotlight Session 1 – Shared Environments, Navigation
|
83 |
Abir B. Karami, Abdel-Illah Mouaddib A Decision Model of Adaptive Interaction Selection for a Robot Companion |
|
89 |
Jonas Firl, Quan Tran Probabilistic Maneuver Prediction in Traffic Scenarios |
|
95 |
Jens Kessler, Andrea Scheidig, Horst-Michael Gross Approaching a Person in a Socially Acceptable Manner using Expanding Random Trees |
|
101 |
Amir Aly, Adriana Tapus Speech to Head Gesture Mapping in Multimodal Human-Robot Interaction |
|
109 |
Hatice Kose-Bagci, Rabia Yorganci, Hatice Esra Algan Evaluation of the Robot Sign Language Tutoring Assistant using Video-based Studies |
|
115 |
Agustin Ortega, Juan Andrade-Cetto Segmentation of Dynamic Objects from Laser Data |
|
121 |
Erik Einhorn, Markus Filzhuth, Christof Schröter, Horst-Michael Gross Monocular Detection and Estimation of Moving Obstacles for Robot Navigation |
|
127 |
Robert Cupec, Emmanuel K. Nyarko, Damir Filko Fast 2.5D Mesh Segmentation to Approximately Convex Surfaces |
|
133 |
Janusz Bedkowski Data Registration Module - A Component of Semantic Simulation Engine |
|
139 |
Ernesto H. Teniente, Juan Andrade-Cetto FaMSA: Fast Multi-Scan Alignment with Partially Known Correspondences |
Session 6 – Perception
145 Sebastian Scherer, Daniel Dube, Philippe Komma, Andreas Masselli, Andreas Zell Robust Real-Time Number Sign Detection on a Mobile Outdoor Robot
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
viii
153
Marcel Häselich, Marc Arends, Dagmar Lang, Dietrich Paulus Terrain Classification with Markov Random Fields on fused Camera and 3D Laser Range Data
|
159 |
Andrzej Pronobis, Patric Jensfelt Hierarchical Multi-Modal Place Categorization |
|
165 |
Luís Osório, Gonçalo Cabrita, Lino Marques Mobile Robot Odor Plume Tracking using Three Dimensional Information |
Session 7 – Planning
|
171 |
Jan Faigl, Vojtˇech Vonásek, Libor Pˇreuˇcil |
|
|
A |
Multi-Goal Path Planning for Goal Regions in the Polygonal Domain |
|
|
177 |
Jörg Stückler, Ricarda Steffens, Dirk Holz, Sven Behnke Real-Time 3D Perception and Efficient Grasp Planning for Everyday Manipulation Tasks |
|
Poster Spotlight Session 2 – Perception, Planning, Visual Mapping, SLAM, Applications
|
183 |
Gonçalo Cabrita, Pedro Sousa, Lino Marques Odor Guided Exploration and Plume Tracking - Particle Plume Explorer |
|
|
189 |
Miriam Schönbein, Bernd Kitt, Martin Lauer Environmental Perception for Intelligent Vehicles Using Catadioptric Stereo Vision Systems |
|
|
195 |
Dominik Belter, Przemysław Łabe¸cki, Piotr Skrzypczy´nski On-Board Perception and Motion Planning for Legged Locomotion over Rough Terrain |
|
|
201 |
Vojtˇech Vonásek, Jan Faigl, Tomáš Krajník, Libor Pˇreuˇcil |
|
|
A |
Sampling Schema for Rapidly Exploring Random Trees using a Guiding Path |
|
|
207 |
Rainer Palm, Abdelbaki Bouguerra Navigation of Mobile Robots by Potential Field Methods and Market-based Optimization |
|
|
213 |
Feras Dayoub, Grzegorz Cielniak, Tom Duckett |
|
|
A |
Sparse Hybrid Map for Vision-Guided Mobile Robots |
|
|
219 |
Marco A. Gutiérrez, Pilar Bachiller, Luis J. Manso, Pablo Bustos, Pedro Núñez An Incremental Hybrid Approach to Indoor Modeling |
|
|
227 |
Hemanth Korrapati, Youcef Mezouar, Philippe Martinet Efficient Topological Mapping with Image Sequence Partitioning |
|
|
233 |
Thomas Féraud, Roland Chapuis, Romuald Aufrère, Paul Checchin Kalman Filter Correction with Rational Non-linear Functions: Application to Visual-SLAM |
|
|
239 |
Paolo Raspa, Emanuele Frontoni, Adriano Mancini, Primo Zingaretti, Sauro Longhi Helicopter Safe Landing using Vision and 3D Sensing |
|
|
245 |
Klas Hedenberg, Björn Åstrand Safety Standard for Mobile Robots - A Proposal for 3D Sensors |
|
|
253 |
Wajahat Kazmi, Morten Bisgaard, Francisco Garcia-Ruiz, Karl D. Hansen, Anders la Cour- Harbo Adaptive Surveying and Early Treatment of Crops with a Team of Autonomous Vehicles |
|
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
ix
Day 3
Session 8 – New Design Concepts
|
259 |
Shahriar Asta, Sanem Sariel-Talay A Differential Steering System for Humanoid Robots |
|
265 |
Christian Mandel, Udo Frese Annelid - a Novel Design for Actuated Robots Inspired by Ringed Worm’s Locomotion |
Session 9 – SLAM
|
271 |
Andreas Nüchter, Seyedshams Feyzabadi, Deyuan Qiu, Stefan May SLAM à la carte - GPGPU for Globally Consistent Scan Matching |
|
277 |
Andreja Kitanov, Ivan Petrovi´c Generalization of 2D SLAM Observability Condition |
|
283 |
Anssi Kemppainen, Janne Haverinen, Ilari Vallivaara, Juha Röning Near-optimal Exploration in Gaussian Process SLAM: Scalable Optimality Factor and Model Quality Rating |
|
291 |
Eduardo Lopez, Caleb De Bernardis, Tomas Martinez-Marin An Active SLAM Approach for Autonomous Navigation of Nonholonomic Vehicles |
Session 10 – Localization
|
297 |
John Folkesson Robustness of the Quadratic Antiparticle Filter for Robot Localization |
|
303 |
Stéphane Bazeille, Emmanuel Battesti, David Filliat Qualitative Localization using Vision and Odometry for Path Following in Topo-metric Maps |
|
309 |
Keisuke Matsuo, Jun Miura, Junji Satake Stereo-Based Outdoor Localization using a Line Drawing Building Map |
|
315 |
Alessandro Benini, Adriano Mancini, Emanuele Frontoni, Primo Zingaretti, Sauro Longhi Adaptive Extended Kalman Filter for Indoor/Outdoor Localization using a 802.15.4a Wire- less Network |
|
321 |
List of Authors |
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
x
1
Robots that change their world:
Inferring Goals from Semantic Knowledge
Cipriano Galindo*, Javier Gonz alez,´ Juan-Antonio Fern andez-Madrigal´
Dept. of System Engineering and Automation
University of Malaga,´
Spain
Abstract —A growing body of literature shows that endowing a mobile robot with semantic knowledge, and with the ability to reason from this knowledge, can greatly increase its capabilities. In this paper, we explore a novel use of semantic knowledge:
we encode information about how things should be, or norms , to allow the robot to infer deviations from these norms and to generate goals to correct these deviations. For instance, if a robot has semantic knowledge that perishable items must be kept in a refrigerator, and it observes a bottle of milk on a table, this robot will generate the goal to bring that bottle into a refrigerator. Our approach provides a mobile robot with a limited form of goal autonomy : the ability to derive its own goals to pursue generic aims. We illustrate our approach in a full mobile robot system that integrates a semantic map, a knowledge representation and reasoning system, a task planner, as well as standard perception and navigation routines.
Index Terms —Semantic Maps, Mobile Robotics, Goal Gener- ation, Goal Autonomy, Knowledge Representation, Proactivity.
I. I NTRODUCTION
Mobile robots intended for service and personal use are being increasingly endowed with the ability to represent and use semantic knowledge about the environment where they operate [13]. This knowledge encodes general information about the entities in the world and their relations, for instance:
that a kitchen is a type of room which typically contains a refrigerator, a stove and a sink; that milk is a type of perishable food; and that perishable food is stored in a refrigerator. Once this knowledge is available to a robot, there are many ways in which it can be exploited to better understand the environment or plan actions [10], [18], [19], [21], [23], [25], assuming of course that this knowledge is a faithful representation of the properties of the environment. There is, however, an interesting issue which has received less attention so far: what happens if this knowledge turns out to be in conflict with the robot’s observations? Suppose for concreteness that the robot observes a milk bottle laying on a table. This observation conflicts with the semantic knowledge that milk is stored in a refrigerator. The robot has three options to resolve this contradiction: (a) to update its semantic knowledge base, e.g., by creating a new subsclass of milk that is not perishable; (b) to question the
*Corresponding author. System Engineering and Automation Dpt. Uni- versity of M alaga,´ Campus de Teatinos. E-29071 Malaga,´ Spain. Email:
cipriano@ctima.uma.es . This work has been partially supported by the Spanish Government under contract DPI2008-03527.
Alessandro Saffiotti
AASS Mobile Robotics Lab
¨
Orebro University, Sweden
validity of its perceptions, e.g., by looking for clues that may indicate that the observed object is not a milk bottle; or (c) to modify the environment, e.g., by bringing the milk into a refrigerator. While some work have addressed the first two options [6], [11], the last one has not received much attention so far. Interestingly, the last option leverages an unique capability of robots: the ability to modify the physical environment. The goal of this paper is to investigate this option. We propose a framework in which a mobile robot can ex- ploit semantic knowledge to identify inconsistencies between the observed state of the environment and a set of general, declarative descriptions, or norms, and to generate goals to modify the state of the environment in such a way that these inconsistencies would disappear. When given to a planner, these goals lead to action plans that can be executed by the robot. This framework can be seen as a way to enable a robot to proactively generate new goals, based on the overall principle of maintaining the world consistent with the given declarative knowledge. In this light, our framework contributes to the robot’s goal autonomy . Although behavioral autonomy has been widely addressed in the robotic arena by developing deliberative architectures and robust algorithms for planning and executing tasks under uncertainty, goal autonomy has received less attention, being explored in the last years in the theoretical field of multi-agents [4], [8] and implemented through motivational architectures [1], [7]. Our framework relies on a hybrid semantic map, which combines semantic knowledge based on description logics [2] with traditional robot maps [11], [18], [21]. Semantic maps have been already shown to increase the robot’s behavioral autonomy, by improving their basic skills (planning, naviga- tion, localization, etc.) with deduction abilities. For instance, if a robot is commanded to “fetch a milk bottle” but it ignores the target location, it can deduce that milk is supposed to be in fridges which, in turn, are located at kitchens. We now extend our previous works on these issues [10], [11] to also include partial goal autonomy through the proactive generation of goals based on the robot’s internal semantic model. More specifically, we consider a robot with the innate objective of keeping its environment in good order with respect to a given set of norms, encoded in a declarative way in its internal semantic representation. Incoherences between the sensed reality and the model, i.e., the observation of facts that violate a particular norm, will lead to the generation of the corresponding goal that, when planned and executed,
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
1
2
will re-align the reality to the model, as in the milk bottle example discussed above. It should be emphasized that in this work we only focus on the goal inference mechanism: the development of the required sensorial system, and the possible use of semantic knowledge in that context, are beyond the scope of this paper. Our approach to goal autonomy can be seen as a case of normative goals applied to agents which act based on beliefs and intentions [4], [8]. However, normative goals are often considered as simple if-then rules triggered when particular stimuli are given in the environment [1], [20]. Other works have used the term maintenance goals to represent innate goals that are aimed to satisfy a particular state of the world over time, e.g., the battery level should be always over a certain value [3], [12]. Our approach substantially diverges from those works, since it is not based on procedural rules, i.e., motivation-action pairs, nor if-then rules. Instead, we rely on a declarative representation of the domain, using the L OOM description logic language [17], from which the robot infers what should be done according to the current factual information in order to maintain the consistency between its environment and its representation. This manuscript is structured as follows. In the next section we present our semantic map. Section III formalizes the use of semantic knowledge for goal generation. In section IV a real experiment is described. Finally some conclusions and future work are outlined.
II. A S EMANTIC M AP FOR M OBILE R OBOT O PERATION
The semantic map considered in this work, derived from [10], comprises two different but tightly interconnected parts:
a spatial box , or S-Box, and a terminological box , or T-
Box. Roughly speaking, the S-Box contains factual knowledge about the state of the environment and of the objects in it, while the T-Box contains general semantic knowledge about the domain, giving meaning to the entities in the S-Box in terms of concepts and relations. For instance, the S-Box may represent that Obj-3 is placed at Area-2, while the T- Box may represent that Obj-3 is a stove which is a type of appliance. By combining the two sides, one can infer, for instance, that Area-2 is a kitchen, since it contains a stove. This structure is reminiscent of the structure of hybrid knowledge representation (KR) systems [2], which are now dominant in the KR community. Our semantic map extends the assertional component to be more than a list of facts about individuals by also associating these individuals to sensor-level information with a spatial structure — hence the name S-Box. Please refer to [10] for more detail. Figure 1 shows a simple example of a semantic map of
a home-like environment where both the S-Box and the T- Box have a hierarchical structure. The hierarchy in the T- Box reflects the fact that the represented semantic knowledge forms a taxonomy. For the S-Box, the use of a hierarchical spatial representation is a convenient and common choice in the robotic literature [9], [15]. This hierarchical arrangement largely helps in reducing the computational burden in large- scale scenarios when spatial information is involved, i.e. robot
Fig. 1. An example of semantic map for a home-like environment. S-Box is on the left and T-Box on the right. See explanation in the text.
localization, as well as when symbolic information is required, i.e. goal generation, or task planning [10].Of course one could also consider a flat representation in the S-Box: in fact, in our framework, the S-Box can be substituted by any other spatial representation.
III. I NFERRING G OALS FROM S EMANTICS
The semantic map described above provides two different points of view of the robot workspace. On the one hand the spatial part (S-box) enables the robot to generate plans from basic skills, striving for behavioral autonomy. On the other hand the terminological part (T-box) provides an ab- stract model of the robot environment which includes general knowledge, e.g., books are located on shelves , which can be exploited for the automatic generation of robot goals. First we give an informal description of the proposed mechanism for goal generation. Then, section III-B formalizes our approach under description logic. Finally, section III-C illustrates the process with two intuitive examples.
A. Informal Description
In the field of knowledge representation, semantic knowl- edge is usually interpreted as being descriptive of a specific domain: for example, the item of knowledge “beds are located in bedrooms” is used to partially describe beds. This knowl- edge is most useful to infer implicit properties from a few observed facts. For example, if the robot perceives a bed in
|
a |
room it can infer that the room is a bedroom; conversely, |
|
if |
it is commanded to find a bed it can restrict its search to |
bedrooms. Galindo et al. [10] offer examples of applications of these inferences in the robotic domain. Interestingly, semantic knowledge can also be interpreted as being normative: under this interpretation, the above item of knowledge is prescribing where a bed must be located. The difference becomes apparent when considering how a robot should react to an observation that contradicts this knowledge. Consider again the milk box example in the Introduction, and the three possible options to resolve the contradiction
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
2
3
discussed there. Options (a) (update the model) and (b) (update the perceived state) correspond to modifying the robot’s beliefs to recover from a contradiction, and are related to execution monitoring and uncertain state estimation. These options has been explored previously [6], [11]. The third option (c) (update the world) involves goal generation, and it is the one addressed here. Informally, our approach defines a subset of concepts and relations stored in the T-Box as normative, i.e. they are involved in norms that should be fulfilled, by defining a special class normative-concept and a special relation normative-relation. Items of knowledge to be treated as normative will derive from them. For instance, we can define that the normative concept towel should be related to the concept bathroom through the normative relation place — that is, towels should be located in a bathroom. When a given instance violates a norm in the T-Box, the system derives the items of knowledge involved in the norm, and hence the goal that should be posted in order to satisfy that norm. In our example, suppose that an instance of a towel is perceived in a room which is not a bathroom. Then the given definition of a towel is violated — a circumstance that can be detected by most knowledge representation systems, including the L OOM [17] system used in our experiments. Since the above definition of towel is normative, the system yields a goal to satisfy the constraint, that is, to make the place of this towel be an instance of a bathroom . If the robot knows that, let say, room-3 is a bathroom, this means that the goal “bring the towel to room-3” is generated.
B. Description Logic Representation for Inferring Normative Goals
Let I be a description logic interpretation on a particular domain D . Let ℘ define a set of disjoint concepts ℘ =
{P 1 ,
x ⊑ y denotes that x is subsumed by concept y . Let N r be called a normative relation, a function defined as:
N r : N C → ℘
j, j ̸= i, a ⊑ P j , where
P n }, i.e., ∀ a, a ⊑ P i ⇒
where N C represents the so-called normative concepts , that is, concepts which ought to be properly related to those from ℘ . N r actually defines the norms to be kept. Normative relations
˙
are defined as one-to-one function as ∀ b ⊑ N C ⇒ ∃ P j ∈ ℘, b → [ F ILLS : N r P j ] , where a → [ F ILLS : B C ] denotes that the instance a is related to an instance derived from concept C through the relation B . The N C set is further divided into two disjoint sets: the set △ of all normative concepts that fulfill the imposed norms,
and the set △ of those that fail to fulfill some of the norms (see figure 2). Within this structure of the domain, constraint violations are automatically inferred when instances of the defined partitions are deduced to belong to a number of disjoint concepts. Let see an example:
Fig. 2. Knowledge representation for detecting inconsistencies. Boxes represent concepts while instances are represented as circles. The concept C is defined as a normative concept related to P i through the normative relation N r 1 . See explanation in the text.
Let C a normative concept (and therefore C ⊑ △ by definition) which is related to the P i concept through the normative relation N r . That is,
∀ c ⊑ C, c → [ F ILLS : N r x ] , x ⊑ P i
If in a given interpretation I , ∃ k ⊑ C, k → [ F ILLS :
N r y ] , y ⊑ P j ∈ ℘, P j ̸= P i ⇒ I y ⊑ P j ∧ y ⊑ P i ⇒
( Incoherent y ) . That is, if the normative relation is not met for a particular instance of a normative concept, the filler of such an instance, in this case y , becomes incoherent. Moreover, since k is defined as k ⊑ C ⊑ △ , it is also inferred that
k ⊑ △ , which also makes k incoherent. Goal Inference . Given an incoherent instance of a normative concept, k ⊑ C and the normative relation N r , N r ( k ) = x, x ⊑ P i ∈ ℘ , the inferred goal to recover the system from the incoherence is:
(exists ?z (P i z) (N r k z))
That is, in the goal state, there should exist an instance of
P i related to k through the normative relation N r 1 .
C. Sample Scenarios
In this section we describe two illustrative examples. 1) Milk should be inside fridges: Consider a home assistant robot taking care of an apartment. Among other norms, the robot might watch milk bottles so they are always kept inside the fridge (see an implementation in section IV). The semantic map for this scenario will entail information
about the different rooms, i.e. kitchen, livingroom, bathroom, etc., the objects found inside, i.e. tables, chairs, fridges, books, bottles, etc, and their relations. Following the formalization given in III-B, part of the description of this scenario in- cludes the partition of different places where bottles of milk could be found, e.g. ℘ = {fridge, table, shelf }, being milk-bottle a normative concept, i.e. milk − bottle ⊑ △ , (see figure 3). Note that this definition implicitly provides certain re- strictions that any bottle of milk should fulfill. Precisely,
1 It is not necessary to add the negation of (N r k z ) to the goal state, since the N r function is defined as one-to-one.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
3
4
(defrelation normative-relation)
(defrelation :is :domain (and object normative-relation place
:range object-places))
(defconcept :partitions normative-concept $Norms$))
(defconcept fulfilling-norm (:not :is :in-parititions (and (:some normative-concept normative-relation $Norms$) incoherent))
(
defconcept :is (and milk-bottle beverage (:the (:the fulfilling-norm place color white)) fridge)
(defconcept :is (and non-fulfilling-norm normative-concept (:some normative-relation incoherent) :in-parititions $Norms$)
Fig. 3. Part of the domain definition for the “milk inside fridges” example. For clarity sake, fulfilling-norm is used instead of △ and non-fulfilling-norm instead of △ .
milk-bottle is assumed to be a beverage which has to meet at least one norm imposed by a normative relation, since
it is subsumed by the fulfilling-norm concept.
Through the definitions given in figure 3, the expres- sion (:the place fridge) indicates that every bottle of milk ought to be located in one location that must be
a fridge. Notice that in this example, the other restriction
(:the color white) is not defined as normative rela- tion, and thus, if it is not fulfilled in the scenario it will be simply deduced that the object is not a bottle of milk and no incoherences or robot goals will be generated.
Let us now consider the following situation in which the
information gathered by the robot contradicts the definitions
in the domain:
{(table t1)(milk-bottle mb)(fridge f1) (place mb t1)}
Under this interpretation, L OOM infers that the instance t1 should be a fridge since there is a bottle of milk placed on it. Such an inference produces an incoherence in the model given that the instance t1 is deduced to belong to two concepts, i.e. table and fridge , which have been defined as members of a partition. In this situation t1 is marked by L OOM as “incoherent”. Moreover, it is also deduced that the instance mb , initially defined as mb ⊑ △ , also belongs to △ since the normative relation (:the place fridge) is filled with an incoher- ent instance. Again the system detects that mb belongs to two concepts defined in a partition, and thus mb is also marked as “incoherent”. The result is that the instances involved in the violated norm are detected and marked as incoherent. By checking the domain definition of such incoherent instances,
Fig. 4. General scheme for representing multiple norms. A particular partition has to be defined for each normative relation.
the following goal is deduced: 2
(exist ?x (fridge ?x) (place mb ?x))
That is, the robot has to put the bottle of milk represented by mb inside any object ?x which is known to be a fridge. Since in the robot’s domain there is a single fridge f1 , the above goal is instantiated as (place mb f1) . 2) Plant should be properly watered: We now describe a more general example in which two norms are imposed on the same normative concept. Consider we impose that “plants should be placed at the garden and have a normal humidity level”. In this case we need two normative relations place and has-humidity and two partitions of concepts representing the possible, disjoint values for such relations. Figure 4 depicts part of the T-Box for this example. Let us consider the following situation:
{(kitchen k)(bathroom b)(garden g) (plant p)(place p k)(humidity-value dry) (has-humidity p dry)}
As in the previous example, the process for detect- ing norm violations checks for incoherent instances. In this case instances k and dry become incoherent since they are deduced to belong to {kitchen,garden} and {dry,normal-hum} respectively. Besides, the instance p is also incoherent and therefore the following goal is generated:
(and (exist ?x (garden ?x)(place p ?x)) (exist ?y (normal-hum ?y)(has-humidity p ?y)))
IV. A N I LLUSTRATIVE E XPERIMENT
We now illustrate the applicability of our goal generation technique to a real robotic application by showing an illustra-
2 This goal is expressed in the goal language of the planner used in our experiment (see below), which is a subset of FOL.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
4
5
Fig. 5. The test environment. Left: layout. Right: the robot Astrid.
tive experiment run in a home environment. The experiment is inspired by the “milk” scenario in Sec. III above.
A. Physical setup
We have used a physical test-bed facility, called the P EIS - Home [24], that looks like a bachelor apartment of about 25 m 2 and consists of a living-room, a bedroom and a small kitchen — see Fig. 5. The P EIS -Home is equipped with a communication infrastructure, and with a number of sensing and actuating devices, including a few mobile robots. Relevant to the experiments reported here are:
• a refrigerator equipped with a computer, some gas sen- sors, a motorized door, and an RFID tag reader;
• an RFID tag reader mounted under the kitchen table;
• a set of RFID tagged objects, including a milk cartoon;
• a set of webcams mounted on the ceiling; and
• Astrid, a PeopleBot mobile robot equipped with a laser scanner, a PTZ camera, and a simple gripper. A few provisions have been introduced to simplify execu- tion. In particular, since Astrid does not have a manipulator able to pick-up an object and place it somewhere else, these operations have been performed with the assistance of a human who puts the object in and out from Astrid’s gripper. These simplifications are acceptable here, since the purpose of our experiments is not to validate the execution system but to illustrate our goal generation algorithm in the context of a full robotic application.
B. Software setup
The software system used in our experiment is schematically shown in Fig. 6. The block named “P EIS Ecology” contains all the robotic components and devices distributed in the P EIS - Home. These are integrated through a specific middleware, called the P EIS -Middleware, that allows to dynamically acti- vate and connect them in different ways in order to perform different tasks [5]. A set of activations and connections is called a configuration of the P EIS Ecology. For instance, the configuration in which the ceiling cameras are connected via an object recognition to the navigation controller onboard Astrid can be used to let the robot reach a given object. The semantic map is based on a simple metric-topological map attached to the LOOM knowledge representation system [17]. Newly observed facts are asserted in LOOM using the tell primitive. The goal generation system interacts with
Fig. 6. Sketch of the software architecture used in our experiments. Only the modules and connections relevant to goal generation are shown.
LOOM as described in Sec. III above. Newly generated goals are passed to the planning system. This consists of three parts: an action planner, called PTLplan [14], that generates a sequence of actions to satisfy the goal; a sequencer, that selects those actions one by one; and a configuration planner [16], that generates the configuration needed to perform each action. When the current plan is completed, the goal generation system is re-activated.
C. Execution Before the execution started, the semantic map contained a metric-topological map of the P EIS -Home, and the considered semantic knowledge in L OOM . In particular, the following statement was included in the L OOM knowledge base
(defconcept MilkBox :is (:and Container FulfillingNorm (:the place Fridge) ))
This encodes the normative constraint that any instance of the normative class MilkBox must have a single filler for the place relation, and that this filler must be of the class Fridge. An RFID tag has been attached to a milk box, containing an encoding of the following information:
id: mb-22 type: MilkBox color: white-green size: 1-liter
At start, the milk is put on the kitchen table, called table-1 in the map. The RFID tag reader under the table detects the new tag, and reports the information that mb-22 is a MilkBox and it is at table-1 — see Fig. 7. This information is entered into L OOM by:
(tell (MilkBox mb-22)) (tell (place mb-22 table-1))
As discussed in Sec. III, this information renders both the instances mb-22 and table-1 incoherent. The goal genera- tion algorithm identifies mb-22 as the normative instance. The algorithm then searches through all the relations that constrain mb-22 to find a violated normative one, and it finds place. Since this relation should be filled by an instance of Fridge, it generates the following goal:
(exists ?x (and (Fridge ?x)(place mb-22 ?x)))
PTLplan uses the knowledge in the semantic map, together with its domain knowledge about the available actions, to generate the following action plan (simplified):
((MOVE astrid table-1) (PICKUP astrid mb-22) (OPEN fridge-1) (MOVE astrid fridge-1)
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
5
6
Fig. 7. RFID tagged objects and RFID tag readers used in our experiments. Left: in the fridge. Right: under the kitchen table.
(PLACE astrid mb-22) (CLOSE fridge-1))
where the variable ?x has been instantiated by fridge-1. The sequencer passes each action in turn to the configuration planner, which connects and activates the devices in the P EIS Ecology needed to execute it. For example, the first two actions only require devices which are on-board Astrid, while the third action requires the activation of the fridge door device. (The details of this “ecological” execution are not relevant here: see [16] for a comprehensive account.) After the milk is removed from the table, the RFID tag reader under the table detects its absence and it signals it to the semantic map. When the milk is placed into the fridge, it is detected by the reader in the fridge. Corresponding to these two events, the following assertions are made in L OOM :
(forget (place mb-22 table-1)) (tell (place mb-22 fridge-1))
After execution is completed, the sequencer re-activates the goal generator. Since the place of mb-22 is now an instance of a fridge, no incoherence is detected and no new goal is generated.
V. D ISCUSSION AND
C ONCLUSIONS
This paper has explored a novel use of semantic knowledge:
recognizing and correcting the situation in the world that do not comply with the given semantic model, by generating appropriate goals for the robot. A distinctive feature of our approach is that the normative model is provided in a declar- ative way, rather than by exhaustive violation-action rules. Simple experiments demonstrate the conceptual viability of this approach. The work reported here is a first step in an interesting direc- tion, and many extensions can and should be considered. For instance, in our work we assume that the robot should always enforce consistency with the semantic knowledge. However, there are cases where norm violation might be allowed. Going back to the milk example, it would be reasonable to allow that the milk bottle stays out of the fridge while the user is having breakfast. Dealing with this type of situations would require the ability to reason about both the robot’s plans and the user’s activities in an integrated way [22].
R EFERENCES
[1] R.C. Arkin, M. Fujita, T. Takagi, and R. Hasegawa. An ethological and emotional basis for human-robot interaction. Robotics and Autonomous Systems, 42(3-4). , 42(3-4), 2003. [2] F. Baader, D. Calvanese, D.L.D McGuinness, and D. Nardi, editors. The Description Logic Handbook . Cambridge University Press, 2007. [3] C. Baral, T. Eiter, M. Bj areland,¨ and M. Nakamura. Maintenance goals of agents in a dynamic environment: Formulation and policy construction. Artificial Intelligence , 172(12-13):1429–1469, 2008. [4] G. Boella and R. Damiano. An architecture for normative reactive agents. In Proc of the 5th Pacific Rim Int Workshop on Multi-Agents, pages 1–17, London, 2002. Springer. [5] M. Bordignon, J. Rashid, M. Broxvall, and A. Saffiotti. Seamless integration of robots and tiny embedded devices in a peis-ecology. In Proc. of the IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS). San Diego, CA , 2007. [6] A. Bouguerra, L. Karlsson, and A. Saffiotti. Semantic knowledge-based execution monitoring for mobile robots. In Proc. of the IEEE Int. Conf. on Robotics and Automation, Rome, Italy , pages 3693–3698, 2007. [7] A. M. Coddington, M. Fox, J. Gough, D. Long, and I. Serina. Madbot:
A motivated and goal directed robot. In Proc The 20th Nat Conf on
Artificial Intelligence, Pennsylvania, USA , pages 1680–1681, 2005. [8] M. Dastani and L.W.N. van der Torre. What is a normative goal?:
Towards goal-based normative agent architectures. In Proc of the Int Workshop on Regulated Agent-Based Social Systems (2002) Bologna, Italy, pages 210–227, 2002. [9] C. Galindo, J.A. Fernandez-Madrigal, and J. Gonzalez. Multiple Ab- straction Hierarchies for Mobile Robot Operation in Large Environ- ments. Studies in Computational Intelligence, Vol. 68. Springer, 2007. [10] C. Galindo, J.A. Fernandez-Madrigal, J. Gonzalez, and A. Saffiotti. Robot task planning using semantic maps. Robotics and Autonomous Systems, 56(11):955–966, 2008.
[11] C. Galindo, A. Saffiotti, S. Coradeschi, P. Buschka, J.A. Fernandez- Madrigal, and J. Gonzalez. Multi-hierarchical semantic maps for mobile robotics. In Int. Conf. on Intelligent Robots and Systems, pages 3492– 3497. IROS, Edmonton, Alberta (Canada), 2005. [12] M.V. Hindriks and M.B. van Riemsdijk. Satisfying maintenance goals.
In Proc of the Int Workshop on Declarative Agent Languages and
Technologies (DALT07), Honolulu, HI, USA , pages 86–103, 2007. [13] J. Hertzberg and A. Saffiotti, editors. Special issue on semantic knowledge in robotics. Robotics and Autonomous Systems , 56(11), 2008. [14] L. Karlsson. Conditional progressive planning under uncertainty. In Proc of the Int Joint Conf on Artificial Intell. (IJCAI) , pages 431–438, Seattle, USA, 2001.
[15] B.J. Kuiper. Modeling Spatial Knowledge, chapter Advances in Spatial Reasoning, Vol. 2, pages 171–198. The U. of Chicago Press, 1990. [16] R. Lundh, L. Karlsson, and A. Saffiotti. Autonomous functional configuration of a network robot system. Robotics and Autonomous Systems, 56(10):819–830, 2008. [17] R. MacGregor and R. Bates. The loom knowledge representation language. Technical report, DTIC Research Report ADA183415, 1987. [18] D. Meger, P. Forssen, K. Lai, S. Helmer, S. McCann, T. Southey,
M. Baumann, J. Little, D. Lowe, and B. Dow. Curious george: An
attentive semantic robot. In IROS Workshop: From sensors to human spatial concepts, pages 390–404, 2007. [19] O.M. Mozos, P. Jensfelt, H. Zender, M. Kruijff, and W. Burgard. From labels to semantics: An integrated system for conceptual spatial representations of indoor environments for mobile robots. In ICRA Workshop: Semantic Information in Robotics , 2007.
[20] T.J. Norman and D. Long. Alarms: An implementation of motivated
agency. In Proc of the IJCAI Workshop on Agent Theories, Architectures, and Languages (ATAL), Montreal, Canada , pages 219–234, 1995. [21] A. Nuchter,¨ O. Wulf, K. Lingemann, J. Hertzberg, B. Wagner, and
H. Surmann. 3D mapping with semantic knowledge. In RoboCup Int.
Symp., pages 335–346, 2005. [22] F. Pecora and M. Cirillo. A constraint-based approach for plan management in intelligent environments. In Scheduling and Planning Applications Workshop at ICAPS09 , 2009. [23] A. Ranganathan and F. Dellaert. Semantic modeling of places using objects. In Robotics: Science and Systems Conf., 2007. [24] A. Saffiotti, M. Broxvall, M. Gritti, K. LeBlanc, R. Lundh, J. Rashid, B.S. Seo, and Y.J. Cho. The peis-ecology project: vision and results. In Int. Conf. on Intelligent Robots and Systems. Nice, France, 2008. [25] M. Tenorth, U. Klank, D. Pangercic, and M. Beetz. Web-enabled robots:
Robots that use the web as an information resource. Robotics and Automation Magazine, 18(2), 2011.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
6
1
On Process Recognition by Logical Inference
Arne Kreutzmann
Immo Colonius
Lutz Frommberger
Frank Dylla
Christian Freksa
Diedrich Wolter
SFB/TR 8 Spatial Cognition, University of Bremen, Germany
Abstract — The ability to recognize and to understand processes
allows a robot operating in a dynamic environment to rationally respond to dynamic changes. In this paper we demonstrate how
a mobile robot can recognize storage processes in a warehouse
environment, solely using perception data and an abstract spec- ification of the processes. We specify processes symbolically in linear temporal logic (LTL) and pose process recognition as a model verification problem. The key feature of our logic based approach is its ability to infer missing pieces of information by logic-based reasoning. The evaluation demonstrates that this approach is able to reconstruct histories of good movements in
a lab-simulated warehouse.
spatio-
temporal reasoning
Index
Terms— plan
recognition,
temporal
logic,
I. INTRODUCTION
Mastering dynamic environments is a demanding challenge in autonomous robotics. It involves recognition and under- standing processes in the environment [7]. Recent advances in simultaneous localization and mapping (SLAM) [20, 21, 22] build the basis for sophisticated navigation in dynamic envi- ronments, but but our aim of understanding processes goes beyond navigation. In this paper we indicate how the problem of recognizing processes can be tackled on a conceptual level in the domain of warehouse logistics. In a warehouse, there is a constant flow of goods which are moved through space, establishing functional zones that are connected with certain types of storage processes (for example, admission of goods into a warehouse makes use of a buffer zone to temporarily store goods for quality assurance). Knowing about the in-warehouse processes and their whereabouts enables warehouse optimization. Hildebrandt et. al. argue for using autonomous robots as a minimally invasive means to observe in-warehouse processes [10]. How- ever, the sensory system of the robot provides uncertain and incomplete knowledge about the environment and the observed spatio-temporal patterns. Thus the challenge is to interpret the observations sensibly. Many approaches to process recognition rely on statistical data to train probabilistic classifiers such as Markov networks [6, 13], Bayesian networks [23], or supervised learning [5]. Approaches based on statistical data perform very well in terms of recognition rate, but, aside from the need for training, they do not support flexible queries about processes and they have they have to be re-trained if new elements or processes are introduced in the domain. Symbolic approaches have none of these downsides, but require a model of the observable processes, which is given in our environment. Additionally, a
This paper presents work done in the project R3-[Q-Shape] of the Transre- gional Collaborative Research Center SFB/TR 8 Spatial Cognition. Funding by the German Research Foundation (DFG) is gratefully acknowledged.
well constructed model allows for efficient use of heuristics to speed up query processes[8]. Usually, symbolic approaches are used to tackle plan recognition, which is closely related to process recognition—see [2, 3] for an overview. In the following we present a logic-based approach that allows us to recognize activities purely from qualitative process descriptions without prior training. By integrating and abstracting sensory information we are able to answer queries about observed spatio-temporal activities (such as “How often have goods been relocated within the storage zone?”) as well as about regions in space (e.g., “Which areas in the warehouse have been used as a buffer zone?”). Answering such queries is an important step towards logistic optimization. The contribution of this paper is to demonstrate how processes and their whereabouts can be inferred in a previously unknown environment. Referring to the decomposition of process detection by Yang [23], we propose a multi-step approach to get from low-level sensory observations to high-level symbolic representations (see Fig. 1). In our scenario, a robot performs a surveillance task in the warehouse. Object recognition is outside the scope of this paper, but in many logistics scenarios goods can easily be identified by unique labels attached to them (such as barcodes or RFID tags). Thus, we assume that the robot is able to uniquely identify goods in the warehouse. The integration of position estimates for the goods in itself presents a feature-based SLAM problem. Uncertain and incomplete position estimates of entities gathered by a probabilistic mapping procedure must be transferred into a symbolic representation in a symbol grounding process to allow for high-level descriptions of the system dynamics. What has been an uncertain position estimate in the mapping process must become a stable qualitative notion of location. Based on correspondence of features and locations in time, we are able to specify processes of interest in an abstract formal language and, in a third step, tackle the process recognition problem by model verification. The formal language we choose to formalize processes
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
7
2
and to state queries is linear temporal logic (LTL) [17, see Sect. III-A]. LTL was proposed earlier as a tool for mobile robotics [1], especially for robot motion planning from high- level specifications [11, 18]. Recently, this approach has also been applied to real robotic systems [12]. In the domain of smart environments, an approach to process detection by LTL model verification has been presented in [14]. LTL not only allows for queries about processes, but also about spatial relations of regions. This approach covers a wide range of reasoning tasks adequately. In particular, it allows us to query the occurrence of processes operating on spatial regions and the concrete whereabouts of those regions at the same time in one and the same reasoning process.
II. THE WAREHOUSE SCENARIO
We address the problem of understanding so-called chaotic or random-storage warehouses, characterized by a lacking predefined spatial structure, that is, there is no fixed assignment of storage locations to specific goods. Thus, storage processes are solely in the responsibility of the warehouse operators and basically not predictable: goods of the same type may be distributed over various locations and no data base keeps track of these locations. This makes it a hard problem for people aiming at understanding the internal storage processes. On a conceptual level, storage processes are defined by
a unique pattern [19]: On their way into and out of the warehouse, goods are (temporarily) placed into functional zones which serve specific purposes (see Fig. 2). All goods arrive in the entrance zone (E). From there, they are picked up and temporarily moved to a buffer zone (B) before they are finally stored in the storage zone (S). Within the storage zone redistribution of goods can occur arbitrarily often. When taking out goods, they are first moved to the picking zone (P) from where they are taken to the outlet zone (O), before being placed on a truck. A mobile robot observing such a warehouse is not able
to directly perceive these zones, as they are not marked. For
all zones we know that they exist (that is, that such regions are used within the storage operations), but not their concrete spatial extents or their number of occurrences, as they appear as a result of dynamic in-warehouse storage processes. The robot can detect and identify goods, and estimate their position.
So when observing this kind of environment, we face the challenge that for detecting concrete storage processes we rely
on the existence of certain zones, but we do not know their whereabouts.
III. IN-WAREHOUSE PROCESS DETECTION WITH LINEAR TEMPORAL LOGIC
To interpret raw sensory data such that we achieve a symbolic
representation of the processes of interest, we first introduce linear temporal logic and the axiomatization of our domain. All queries are stated as LTL formulas and can be answered by model verification. Following this, we describe the symbolic grounding. Then, we specify the in-warehouse processes in linear temporal logic and demonstrate the inference process by an example.
A. Linear Temporal Logic (LTL)
LTL [17] is a modal logic that extends propositional logic by a sequential notion of time. A formula φ in LTL is defined over a finite set of propositions with a set of the usual logic operators (∧, ∨,¬, →). The temporal component is established by an accessibility relation R that connects worlds (or states) and a set of modal operators, of which we use the following:
• ◦φ – next. A formula φ holds in in the following world
• φ – always. A formula φ holds now and in all future worlds
• ♦φ – eventually. φ will hold in some world in the future (♦φ ↔ ¬ ¬φ)
B. Axiomatizing the Warehouse Scenario
1) Propositions: We define the propositions that model the desired processes in our logic with the help of the following atomic observables:
• a set G = {G 1 ,
• a set L = {L 1 ,
, G n } of uniquely identifiable goods
, L m } of locations in space at which
goods have been perceived by the robot
functional zones as
described in Sect. II. The following atoms need to be defined over G, such that we obtain a finite set of atoms, L, and Z:
• at(G, L) – holds iff a good G is known to be at location L
• a set Z = {E, B, S, P, O} of
• in(L, Z) – holds iff a location L lies within zone Z
• close(L 1 , L 2 ) – holds iff two locations L 1 , L 2 are close to one another
2) Axioms: Based on constraints of space and general knowledge about our domain, we axiomatize our domain. One constraint is that we disregard continuous motion and therefore only deal with snapshots of the world. This means that all observed goods are temporarily fixed at their positions.
good G can only be at one location at a time. We
introduce the following axioms for all G ∈ G and L i , L j ∈
• A
L, i
= j:
A1 G,L i ,L j = ¬ at(G, L i ) ∧ at(G, L j ) (1)
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
8
3
• Any object is always located within a zone Z ∈ Z . We have for all G ∈ G and L ∈ L:
(2)
A2 G,L = at(G, L) →
in(L, Z)
Z∈Z
• Locations in different zones are not close to each other, that is, zones are at least some minimum distance apart.
|
We have for all Z k , Z l ∈ Z (k |
= l) and L i , L j |
∈ L |
||
|
(i = j): |
||||
|
A3 L i ,L j ,Z k ,Z l = in(L i , Z k ) ∧ in(L j , Z l ) |
(3) |
|||
|
→ ¬close(L i , L j ) |
||||
|
• Zones are static. We have for all Z k , Z l ∈ Z L ∈ L: |
(k = l) and |
|||
A4 L,Z k ,Z l = in(L, Z k ) → ¬♦in(L, Z l )
(4)
A set A subsumes all axioms (1) – (4).
C. Grounding Symbols
So far, we have formal descriptions of the high-level in-
warehouse observables on one hand, and sensory perceptions from the robot, on the other hand. These need to be connected
to each other in order to perform reasoning on real world data.
That is, we need to transform the sensory information to our logical propositions at(G, L), close(L 1 , L 2 ), and in(L, Z). Mapping a perceived good to a symbol G is trivial in this task due to the unique identifiers. However, for the good’s location we will only have an uncertain position estimate
(x, y) ∈ R 2 for the entity observed from the mapping process. These estimates are subject to noise and thus will vary over time although the observed object remains static. A location
is a qualitative abstraction from positional measurements that
abstracts from uncertainty emerging from sensory perceptions and the mapping process. Therefore, we need to transform
position estimates to a discrete and finite set of symbols, i.e., to subsume similar or comparable positions. This transformation
is a function f : R 2 → L, that is, every position estimate is
mapped to a single location (see Axiom (1)). To this end, a clustering method can be applied to map estimates to a set of prototypical positions—the locations (see Section IV-B).
We ground close(L 1 , L 2 ) by applying a metric and checking whether the distance between L 1 and L 2 is below a certain threshold. To ground in(L, Z ), we need to identify the functional zones in the warehouse. These zones are constituted by sets of locations. For zones Z whose extents are known a-priori by introducing the respective in-atoms the corresponding locations L Z ⊆ L can be assigned directly. All remaining locations L i ∈ L\L Z are known to be not a part of Z, i.e., ¬in(L i , Z), but (according to (2)) must be part of one of the other zones:
in(L i , Z ) with Z ∈ Z\Z. In addition to the axioms A, the propositions close and in are persistent over all worlds. The set
B = A ∪
L i ,L j ∈L
close(L i , L j ) ∪
L∈L,Z∈Z
in(L, Z)
(5)
is called background knowledge. The only proposition that
changes over different worlds is at(G, L). We traverse through
the time steps t and map all goods G i with their position
estimates (x i , y i ) to corresponding observations obs(t, G i , L j ) that assign that G i has been observed at L j at time step t. This yields a series of sets of observations O t = G i ∈G obs(t, G, L j ) over time. A new world is established
O t . Then, new world
as soon as our observations change, that is, O t+1 from obs(t, G i , L j ) follows at(G i , L j ), and the consist of B ∪ i at(G i , L j ).
=
D. In-Warehouse Processes
We now formalize the in-warehouse processes Admission, Take-out, and Redistribution:
• Admission – a good G is delivered to the warehouse’s entrance zone E and moved to the storage zone S via the buffer zone B. For all G ∈ G and L i , L j , L k ∈ L:
(6)
Admission G,L i ,L j ,L k = at(G, L i ) ∧ in(L i , E) →
♦ at(G, L j ) ∧ in(L j , B) → ♦ at(G, L k ) ∧ in(L k , S)
• Take-out – a good G is moved from the storage zone S to the outlet zone O via a picking zone P . For all G ∈ G
and L i , L j , L k ∈ L:
Takeout G,L i ,L j ,L k = at(G, L i ) ∧ in(L i , S) →
♦ at(G, L j ) ∧ in(L j , P) → ♦ at(G, L k ) ∧ in(L k , O)
(7)
• Redistribution – a good G is moved within the storage
zone S. For all G ∈ G and L i , L j ∈ L, i
= j:
Redistribution G,L i ,L j = at(G, L i ) ∧ in(L i , S) → (8)
♦ at(G, L j ) ∧ in(L j , S)
Process detection can be posed as a model checking problem:
An in-warehouse process is detected when we can find a model (based on the sensory observations from the robot) that satisfies the corresponding formula. The history of a good is the chain of processes that the good is part of and can also be stated as a formula. A history for a good would be admission, zero or more redistributions and its takeout.
E. Example
A good G entered the warehouse and was stored in the entrance zone E at position L 1 at time t 0 . At t 1 , it was moved to a location L 2 and at t 2 it was moved to L 3 . All these locations are not close to one another. Let us assume that we observe the following from this process: We perceived G to be at L 1 at t 0 , at L 2 at t 1 and at L 3 at t 4 . See Fig. 3 for a depiction and the logical interpretations—to ease understanding the worlds are labeled just like the time points. These observations constitute a model that satisfies (6), such that the observed process is an admission, starting in world t 1 and ending in world t 4 , and also deduces that location L 2 is in the buffer zone and L 3 is in of the storage zone. Note that deduced start and end times differ from the real admission times: While the admission took place from t 0 to t 3 , we detect it from observations t 1 to t 4 ; this is due to incomplete observation of the world.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
9
4
sketch
observation
background
knowledge
Admission
at( G, L 1 ) ∧ in( L 1 , E ) → ♦
at( G, L 2 ) ∧ in( L 2 , B ) → ♦
t 2
t 3
at( G, L 3 ) ∧ in( L 3 , S )
Fig. 3. Example: Model verification for an admission process of good G (only the relevant assertions for each world t 0
background knowledge, also it is known that locations L 2 and L 3 are either part of the buffer zone (B) or the storage zone (S) but not close to one another so that they cannot belong to the same zone. From this admission refined knowledge about the buffer and storage zone can be inferred: in(L 2 , B) ∧ in(L 3 , S).
3
are shown). in(L 1 , E) is
IV. IMPLEMENTATION
A. Mapping of Positions of Goods
We use visual tags to represent our observable features. To ease the evaluation, some tags are known to be static throughout the experiments. This allows the map constructed by the robot to be easily aligned with the ground-truth for evaluation. The positions of the tags relative to the camera are estimated using the tag detection routine provided by the ARToolkit software library 1 , for which we determined a measurement model. Positions of detected tags with a sufficient quality as well as odometry of the robot are fed into the TreeMap SLAM algorithm [9]. In contrast to [22] we deal with dynamic objects by using only one map layer in which we handle the movement of a good by simply comparing its current position measurement with its expected position. If the positions are too far apart (in our experiments >1 meter), the good is treated as having been moved and is added as a new feature into the SLAM algorithm. This results in a sequence of maps that contain position estimates and a mapping of goods to positions at each time step.
B. From Positions to Locations
Measured positions are clustered after each step and the generated cluster centroids are used as qualitative locations. Therefore, the mapping of positions to clusters needs to stay fixed even when new centroids are generated by added data. We implemented two clustering methods for later comparison: The first clustering method assigns position estimates to predefined locations (shown in Fig. 4(a)). We used this method for evaluation purposes. The second clustering method computes locations automatically by employing a straightforward greedy algorithm: Positions are clustered together if their surrounding circle is below a certain size; otherwise a new cluster is created (shown for one test run in Fig. 4(b)). Each observation of a good is now attributed by a location and a time step (obs(t, G, L)), which is the starting point for the symbol grounding as described in Section III-C.
1 http://artoolkit.sourceforge.net/
C. From LTL–Worlds to Histories
As described at the end of Section III-D histories of goods are also LTL formulas and as such can also be used during model verification. It is straightforward to implement the rules as Prolog clauses and let Prolog try to prove them. The connection of the world is realized by an ordered list, i.e., two worlds W i and W j are connected if W j directly follows W i in the list. We then use Prolog to constructively prove the existence of a history for each good by model verification. The history construction includes the deduction of zones as demonstrated in the example shown in Fig. 3. In general, many histories can be verified for the same observations, e.g., moving a good from A to B to C verifies the model Redistribution G,A,B ∧ Redistribution G,B,C but also Redistribution G,A,C . In the latter case the observation that the good was at location B is ignored. Therefore, in ambiguous cases we select the maximal model, i.e., the history involving the largest number of observations.
V. E XPERIMENTS AND E VALUATION
In our experimental setting we simulated warehouse pro- cesses in our lab in order to measure to which extend histories can be identified correctly. 2
A. Experimental Setup
Our robot platform is an Active Media Pioneer-3 AT controlled by a top-mounted laptop and equipped with a SONY DFW SX900 (approx. 160°FOV) that delivers 7.5 frames per second. In our lab we simulate a warehouse that consists of five dedicated zones (entrance, buffer, storage, picking, outlet) as seen in Fig. 4(a) and 4(c). 15 tags are distributed within the environment as static landmarks. Goods are represented by boxes with visual tags attached to all sides. An experiment consists of a series of movements of goods between the zones while our robot is monitoring the environ- ment. For each of the 10 test runs, the robot was placed in the lab and driven around until each landmark had been seen at least twice to ensure a fully mapped environment. Then, we
2 Video material of a test run is available at http://www.sfbtr8.
uni-bremen.de/project/r3/ECMR/index.html
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
10
5
(a) Floor plan with zones
(b) Automatically generated clusters
(c) The lab (picture taken from the outlet zone)
Fig. 4.
shown as red stars and the landmarks as blue dots with there corresponding letters. In (b) measurements of a single experimental run are shown as black ×
and the automatically generated clusters as blue circles.
Experimental Setup: The warehouse (6.12 m × 7 m) with 5 zones (Entrance, Buffer, Storage, Picking, and Outlet). In (a) the fixed cluster centers are
Fig. 5.
evaluated 10 test runs which included a total number of 21 full histories and 18 partial histories.
Evaluation results showing the rate of correct history detections. We
steered the robot in arbitrary round courses, while we moved boxes through the lab, simulating the previously defined logistic processes (Sect. III-D). The duration of a test run was between approx. 11 and 28 minutes in which we moved 3 to 8 goods through the warehouse, resulting in 4 to 19 detectable processes per run including runs with only partial histories. Goods were moved between zones while not covered by sensor surveillance, to comply with axiom 2 in section III-B. The robot was driven by hand in the experiments. As mentioned in section IV-B, we evaluated our approach with two different clustering methods, each one with 3 different settings of background knowledge. In the first setting all zones are previously known, in the second setting only entrance and outlet are known and in the third one the whereabouts of no zone are known.
B. Evaluation
We evaluate our approach based on correctly identified histories. For each good we query its history, i.e., running the model verification to generate it. A history is correctly identified if temporal order and number of processes match the ground truth.
Fig. 5 shows the result of our evaluation. In the most favorable case of knowing all zones and predefined cluster centroids we achieve an average recognition rate of 83%. The experiments comprise of 21 full histories and 18 partial ones. In partial histories, a good either started in the warehouse or after its admission never left it again. Our current interpretation prefers full histories over partial histories and is biased towards an empty starting warehouse, i.e., if the observations verify both admission and take-out we prefer the admission. Especially in the case of having no prior knowledge we found that partial histories reduced detection rate. In particular, with automatically generated centroids and no prior knowledge about the zones; 37.9% of the full histories were correctly found, but only 13.3% of the partial histories were correctly found. A significant difference can be observed between the two clustering methods, but both follow the same pattern: additional prior knowledge results in more correctly identified histories. If no zone is known, i.e. all zones needed to be inferred, the results show that the approach is still capable of correctly identifying histories. This clearly demonstrates the utility of inference in process recognition.
VI. DISCUSSION
Our work targets online process detection and online queries while the robot is operating. Thus, we rely on observations of goods as soon as we detect them, even if the position estimate is still uncertain. Over time, stability of positions is achieved by clustering them into locations. Every (new) perception of a good at a different location (immediately) triggers the creation of a new world. Poor position estimates (for example, when few tags are detected due to motion blur while the robot turns) can easily be mapped to locations that incorrectly induce movement of a good or lie outside of the zone the good is in. Such cases result in incorrectly detected histories. The results in Fig. 5 confirm this: When providing stable, pre-defined cluster centers, detection rates are significantly higher, especially when more domain knowledge is included. Thus, excluding estimates with too much uncertainty would improve the detection rate. Using uncertainty estimates for measured positions will also improve
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
11
6
the robustness of geometrical shape estimation for the zones. However, the current implementation of the TreeMap SLAM algorithm 3 does not provide uncertainty estimates. In a real-world environment it is reasonable to assume knowledge about entrance and outlet zones (e.g., by placing tags to mark the end of the warehouse). The observable difference between knowing all zones and knowing only entrance and outlet is relatively small, especially when predefined clusters are used (83% and 75% respectively). These results illustrate the feasibility of our approach. In this work, we currently restrict ourselves to use inference only on sensory observations. As stated before, the detection of correct histories improves with better clustering (e.g., by using outlier detection). To query more complex information it would be reasonable to also include knowledge gained within the mapping process. That is, information on goods we have observed before and included into the map, but that we are not able to perceive at the very moment. For these objects, we have a strong belief of their existence and position, but this belief can—according to the actual observation—not be validated. A possibility to include reasoning on such beliefs is to use a logic that provides a modal belief operator, such as the logic for BDI agents presented in [16]. Another source of information for more complex queries could be provided by an ontology, as shown in [15]. Our logic foundation also supports multiple instances of the same type of good, e.g. splitting or merging packages for delivery. However, due to limited size in our lab, we did not include this feature in our experiments.
VII. SUMMARY
In this paper we propose an approach to process detection based on a specification of processes as temporal logic formulas. We show in our evaluation that our approach is applicable using real sensory data from a mobile robot. One strength of our approach is that it can fill in missing pieces of information by reasoning about processes and spatial configurations in the same formalism. It is also possible to query about previously unspecified processes as well as about spatial facts, such as functional zones. Basing our approach on the well-established linear temporal logic not only works for passive process detection but would also allow us to incorporate so-called search control knowledge and perform high-level planning [4], i.e., doing active process detection in the sense of planning where to go for more information. This is the objective of our future research.
REFERENCES
Marco Antoniotti and Bud Mishra. Discrete event models + temporal logic = supervisory controller: Automatic synthesis of locomotion controllers. In Proceedings of the IEEE Conference on Robotics and Automation (ICRA), volume 2, pages 1441–1446, 1995.
[2] D. Avrahami-Zilberbrand, G.A. Kaminka, and H. Zarosim. Fast and complete symbolic plan recognition: Allowing for duration, interleaved execution, and lossy observations. In Proceedings of the AAAI Workshop on Modeling Others from Observations (MOO). Citeseer, 2005.
[1]
3 as provided at http://openslam.org/treemap.html
[3] Dorit Avrahami-Zilberbrand. Efficient Hybrid Algorithms for Plan Recognition and Detection of Suspicious and Anomalous Behavior. PhD thesis, Bar Ilan University, 2009. [4] Fahiem Bacchus and Froduald Kabanza. Using temporal logics to express search control knowledge for planning. Artificial Intelligence, 116(1- 2):123 – 191, 2000. [5] Maria-Florina Balcan and Avrim Blum. A discriminative model for semi-supervised learning. Journal of the ACM (JACM), 57(3):1–46,
2010.
[6] Maren Bennewitz, Wolfram Burgard, Grzegorz Cielniak, and Sebastian Thrun. Learning motion patterns of people for compliant robot motion.
The International Journal of Robotics Research (IJRR), 24(1):39–41,
2005.
Marcello Cirillo, Lars Karlsson, and Alessandro Saffiotti. A human-aware robot task planner. In Alfonso Gerevini, Adele E. Howe, Amedeo Cesta, and Ioannis Refanidis, editors, Proceedings of the 11th International Conference on Automated Planning and Scheduling (ICAPS). AAAI,
2009.
[7]
[8] Christophe Dousson and Pierre Le Maigat. Chronicle recognition improvement using temporal focusing and hierarchization. Proceedings of the 20th International Joint Conference on Artifical Intelligence (IJCAI), pages 324–329, 2007. [9] Udo Frese. An O(log n) Algorithm for Simulateneous Localization and Mapping of Mobile Robots in Indoor Environments. PhD thesis,
University of Erlangen-Nurnberg,¨ 2004. [10] Torsten Hildebrandt, Lutz Frommberger, Diedrich Wolter, Christian Zabel, Christian Freksa, and Bernd Scholz-Reiter. Towards optimization of manufacturing systems using autonomous robotic observers. In Proceedings of the 7th CIRP International Conference on Intelligent
Computation in Manufacturing Engineering (ICME), June 2010. [11] Marius Kloetzer and Calin Belta. LTL planning for groups of robots. In Proceedings of the IEEE International Conference on Networking, Sensing and Control (ICNSC), pages 578–583, 2006. [12] Marius Kloetzer and Calin Belta. Automatic deployment of distributed teams of robots from temporal logic motion specifications. IEEE Transactions on Robotics, 26(1):48–61, 2010. [13] Lin Liao, Donald J. Patterson, Dieter Fox, and Henry Kautz. Learning and inferring transportation routines. Artificial Intelligence, 171(5–6):311– 331, 2007. [14] Tommaso Magherini, Guido Parente, Christopher D. Nugent, Mark P. Donnelly, Enrico Vicario, Frederico Cruciani, and Cristiano Paggetti. Temporal logic bounded model-checking for recognition of activities of daily living. In Proceedings of the 10th IEEE International Conference on Information Technology and Applications in Biomedicine (ITAB), Corfu, Greece, November 2010. [15] Fulvio Mastrogiovanni, Antonio Sgorbissa, and Renato Zaccaria. Context assessment strategies for ubiquitous robots. In IEEE International Conference on Robotics and Automation (ICRA), pages 2717–2722,
2009.
[16] John-Jules Meyer and Frank Veltman. Intelligent agents and common sense reasoning. In Patrick Blackburn, Johan Van Benthem, and Frank Wolter, editors, Handbook of Modal Logic, volume 3 of Studies in Logic and Practical Reasoning, chapter 18, pages 991 – 1029. Elsevier, 2007. [17] Amir Pnueli. The temporal logic of programs. In Proceedings of the 18th Annual Symposium on Foundations of Computer Science (FOCS), pages 46–57, 1977. [18] Stephen L. Smith, Jana Tumov˚ a,´ Calin Belta, and Daniela Rus. Optimal path planning under temporal logic constraints. In Proceeding of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3288–3293, Taipeh, Taiwan, October 2010. [19] Michael Ten Hompel and Thorsten Schmidt. Management of Warehouse Systems, chapter 2, pages 13–63. Springer, Berlin Heidelberg, 2010. [20] Trung-Dung Vu, Julien Burlet, and Olivier Aycard. Grid-based local- ization and local mapping with moving object detection and tracking. Information Fusion, 12:58–69, January 2011. [21] Chieh-Chih Wang, Charles Thorpe, Sebastian Thrun, Martial Hebert, and Hugh Durrant-Whyte. Simultaneous localization, mapping and moving object tracking. The International Journal of Robotics Research, 26:889– 916, September 2007. [22] Denis F. Wolf and Gaurav S. Sukhatme. Mobile robot simultaneous localization and mapping in dynamic environments. Autonomous Robots, 19:53–65, 2005. 10.1007/s10514-005-0606-4. [23] Qiang Yang. Activity recognition: Linking low-level sensors to high level intelligence. In Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI), pages 20–25, Pasadena, CA, July 2009.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
12
1
Plan-based Object Search and Exploration Using Semantic Spatial Knowledge in the Real World
Alper Aydemir ∗
Moritz Gobelbecker¨
†
Andrzej Pronobis ∗
Kristoffer Sjo¨o¨ ∗
Patric Jensfelt ∗
∗ Centre for Autonomous Systems, Royal Institute of Technology, Stockholm, Sweden
† Institut fur¨
Informatik, Albert-Ludwigs-Universitat¨
Freiburg, Germany
Abstract — In this paper we present a principled planner based approach to the active visual object search problem in unknown environments. We make use of a hierarchical planner that com- bines the strength of decision theory and heuristics. Furthermore, our object search approach leverages on the conceptual spatial knowledge in the form of object cooccurences and semantic place categorisation. A hierarchical model for representing object locations is presented with which the planner is able to perform indirect search. Finally we present real world experiments to show the feasibility of the approach.
Index Terms— Active Sensing, Object Search, Semantic Map- ping, Planning
I. INTRODUCTION
Objects play an important role when building a seman- tic representation and an understanding of the function of space [14]. Key tasks for service robots, such as fetch-and- carry, require a robot to successfully find objects. It is evident that such a system cannot rely on the assumption that all object relevant to the current task are already present in its sensory range. It has to actively change its sensor parameters to bring the target object in its field of view. We call this problem active visual search (AVS). Although researchers began working on the problem of visually finding a relatively small sized object in a large environment as early as 1976 at SRI [4], the issue is often overlooked in the field. A common stated reason for this is that the underlying problems such as reliable object recognition and mapping are posing hard enough challenges. However as the field furthers in its aim to build robots acting in realistic environments, this assumption need to be relaxed. The main contribution of this work a method to relinquish the above mentioned assumption.
A. Problem Statement
We define the active visual object search problem as an agent localizing an object in a known or unknown 3D envi- ronment by executing a series of actions with the lowest total cost. The cost function is often defined as the time it takes to complete the task or distance traveled. Let the environment be Ω and Ψ being the search space with Ψ ⊆ Ω. Also let P o (Ψ) be the probability distribution for the position of the center of the target object o defined as a function over Ψ. The agent can execute a sensing action s in
This work was supported by the SSF through its Centre for Autonomous Systems (CAS), and by the EU FP7 project CogX.
the reachable space of Ψ. In the case of a camera as the sensor,
s is characterised by the camera position, (x c , y c , z c ), pan-tilt angles (p, t), focal length f and a recognition algorithm a;
s = s(x c , y c , z c , p, t, f, a). The part of Ψ covered by s is
called a viewcone. In practice, a has an effective region in which reliable recognition or detection is achieved. For the i th viewcone we call this region V i . Depending on the agent’s level of a priori knowledge of Ψ and P o (Ψ) there are three extreme cases of the AVS problem. If both Ψ and P o (Ψ) is fully known then the problem is that of sensor placement and coverage maximization given limited field of view and cost constraints. If both Ψ and P o (Ψ) is unknown then the agent has an additional explore action as well. An exhaustive exploration strategy is not always optimal, i.e. the agent needs to select which parts of the environment to explore first depending on the target object’s properties. Furthermore the agent needs to trade-off between executing a sensing action and exploration at any given point. That is, should the robot search for the object o in the partially known Ψ or explore further. This is classically known as the exploration vs. exploitation problem. When P o (Ψ) is unknown (i.e. uniformly distributed) but Ψ is known (i.e. acquired a priori), the agent needs to gather information about the environment similar to the above case. However in this case, the exploration is for learning about the target object specific characteristics of the environment. Knowing Ψ also means that the robot can reason whether or not to execute a costly search action at the current position, or move to another more promising region of space. The rare case where P o (Ψ) is fully known but Ψ is unknown is not practically interesting to the scope of this paper. So far, we have examined the case where the target object is an instance. The implication of this is that P o (Ψ) + P o (Ω \ Ψ) = 1, therefore observing V i has an effect on P o (Ψ \ V i ). However this is not necessarily true if instead the agent is searching for any member of an object category and the number of them is not known in advance. Therefore knowing whether the target object is a unique instance or a member of an object category is an important factor in search behavior. Recently there’s an increasing amount of work on acquiring semantic maps. Semantic maps have parts of the environ- ment labeled representing various high level concepts and functions of space. Exploring and building a semantic map while performing AVS contributes to the estimation of P o (Ψ). The semantic map provides information that can be exploited
by leveraging on common-sense conceptual knowledge about
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
13
2
indoor environments. This knowledge describes, for example, how likely it is that plates are found in kitchens, that a mouse and a computer keyboard occur in the same scene and that corridors typically connect multiple rooms. Such information offers valuable information in limiting the search space. The sources for those can be from online common-sense databases or world wide web among others. Acknowledging the need to limit the search space and integrate various cues to guide the search, [4] proposed indirect search. Indirect search as
a search strategy is a simple and powerful idea: it’s to find
another object first and then use it to facilitate finding the target
object, e.g. finding a table first while looking for a landline phone. Tsotsos [13] approached the problem by analyzing the complexity of the AVS problem and showed that it is NP-hard. Therefore we must adhere to a heuristics based solution. Ye [15] formulated the problem in probabilistic framework. In this work we consider the case where Ψ and P o (Ψ) are both unknown. However, the robot is given probabilistic default knowledge about the relation betweeen objects and the occurences of objects in difference room category following our previous work [1, 6].
B. Contributions
The contributions of this work are four fold. First we pro- vide the domain adaptation of a hierarchical planner to address the AVS problem. Second we show how to combine semantic cues to guide the object search process in a more complex and larger environment than found in previous work. Third, we start with an unknown map of the environment and provide an exploration strategy which takes into account the object search task. Four, we present real world experiments searching for multiple objects in a large office environment, and show how the planner adapts the search behavior depending of the current conditions.
C. Outline
The outline of this paper is as follows. First we present how the AVS problem can be formulated in a principled way using a planning approach (Section II). Section III provides the motivation for and structure of various aspects of our spatial representation. Finally we showcase the feasibility of our approach in real world experiments (Section IV).
II. PLANNING
For a problem like AVS which entails probabilistic action outcomes and world state, the robot needs to employ a planner to generate flexible and intelligent search behavior that trade off exploitation versus exploration. In order to guarantee optimality a POMDP planner can be used in, i.e. a decision theoretic planner that can accurately trade different costs against each other and generate the optimal policy. However, this is only tractable when a complex problem like AVS is applied to very small environments. Another type of planner
are the classical AI planners which requires perfect knowledge about the environment. This is not the case since both Ψ and P o (Ψ) are unknown.
A variation of the classical planners are the so called continual
planners that interleave planning and plan monitoring in order
to deal with uncertain or dynamic environments[3]. The basic
idea behind the approach is to create an plan that might reach the goal and to start executing that plan. This initial plan takes into account success probabilities and action costs however it is optimistic in nature. A monitoring component keeps track of the execution outcome and notifies the planner in the event of the current plan becoming invalid (either because the preconditions of an action are no longer satisfied or the plan does not reach the goal anymore). In this case, a new plan is created with the updated current state as the initial state and execution starts again. This will continue until either the monitoring component detects that the goal has been reached or no plan can be found anymore. In this paper we will make use of a so called switching plan- ner. It combines two different domain independent planners for different parts of the task: A classical continual planner to decide the overall strategy of the search (for which objects to search in which location) and a decision theoretic planner to schedule the low level observation actions using a probabilistic sensing model. Both planners use the same planning model and are tightly integrated. We first give a brief description of the switching planner. We focus on the use of the planner in this paper and instead refer the reader to [5] for a more detailed description. We will also present the domain modeling for the planner, and give further details on various aspects of knowledge that planner makes use of.
A. Switching Planner
1) Continual Planner (CP): We build our planning frame- work on an extended SAS + [2] formalism. As a base for the continual planner, we use Fast Downward[7]. Because our knowledge of the world and the effects of our actions are uncertain we associate a success probability p(a) with every action a. In contrast to more expressive models like MDPs or even POMDPs, actions don’t have multiple possible outcomes, they just can succeed with probability p(a) or fail with probability of 1 − p(a). The goal of the planner is then to find a plan π that reaches the goal with a low cost. In classical planning the cost function is usually either the number of actions in a plan or the sum of all action’s costs. Here we chose a function that resembles the expected reward adjusted to our restricted planning model. With p(π) = a∈π p(a) as the plans total success probability and cost(π) = a∈π cost(a) as the total costs, we get for the optimal plan π ∗ :
π ∗ = argmin cost(π) + R(1 − p(π))
π
where a is an action and the constant R is the reward the planner is given for achieving the goal. For small values of R the planner will prefer cheaper but more unlikely plans, for larger values more expensive plans will be considered. Assumptions The defining feature of an exploration problem is that the world’s state is uncertain. Some planning frame- works such as MDPs allow the specification of an initial state distribution. We choose not to do this for two different reasons: a) having state distributions would be a too strong departure from the classical planning model and b) the typical exploration problems we deal with have too many possible
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
14
3
states to express explicitly. We therefore use an approach we
call assumptive actions that allow the planner to construct parts
of the initial state on the fly, and which allows us to map the
spatial concepts to the planning problem in an easy way. 2) Decision Theoretic (DT) Planner: When the continual planner reaches a sensing action (e.g. search location1 for a
object2), we create a POMDP that only contains the parts
of the state that are relevant for that subproblem with. This planner can only use MOVE and PROCESSVIEWCONE actions explained in Section II-B.2. The DT planner operates in
a closed-loop manner, sending actions to be executed and
receiving observations from the system. Once the DT planner
either confirms or rejects a hypothesis, it returns control back
to the continual planner, which treats the outcome of the DT
session like the outcome of any other action.
B. Domain Modeling
We need to discretize the involved spaces (object location, spatial model and actions) to make a planner approach ap- plicable to the AVS problem. Most methods make use of discretizations as a way to handle the NP-hard nature of the problem. 1) Representing space: For the purposes of obstacle avoid- ance, navigation and sensing action calculation, Ψ is repre- sented as a 3D metric map. Ψ discretised into i volumetric
c i . Each cell represents the occupancy
with the attributes OCCUPIED, FREE or UNKOWN as well as the probability of target object’s center being in that cell. However, further abstraction is needed to achieve reliable and fast plan calculation as the number of cells can be high.
For this purpose we employ a topological representation of Ψ called place map, see Fig 1(a). In the place map, the world is represented by a finite number of basic spatial entities called places created at equal intervals as the robot moves. Places are connected using paths which are discovered by traversing the space between places. Together, places and paths represent the topology of the environment. This abstraction is also useful for a planner since metric space would result in a largely intractable planning state space. The places in the place map are grouped into rooms. In the case of indoor environments, rooms are usually separated by doors or other narrow openings. Thus, we propose to use a door detector and perform reasoning about the segmentation of space into rooms based on the doorway hypotheses. We use a template-based door detection algorithm which matches a door template to each acquired laser scan. This creates door hypotheses which are further verified by the robot passing through a narrow opening. In addition, unexplored space is represented in the place map using hypothetical places called placeholders defined in the boundary between free and unknown space in the metric map. We represent object locations not in metric coordinates but
in relation to other known objects or rooms to achieve further
cells so that Ψ = c 0
abstraction. The search space is considered to be divided into locations L. A location is either a room R or a related space. Related spaces are regions connected with a landmark object o, either in or on the landmark (see [1] for more details). The related space “in” o is termed I o and the space “on” o O o .
2) Modeling actions: The planner has access to three physical actions: MOVE can be used to move to a place or placeholder, CREATEVIEWCONES creates sensing actions for an object label in relation to a specified location, PRO- CESSVIEWCONE executes a sensing action. Finally, the virtual SEARCHFOROBJECT action that triggers the decision theoretic planner. 3) Virtual objects: There are two aspects of exploration in the planning task: we’re searching for an (at that moment) unknown object, which may include the search for support objects as an intermediate step. But the planner may also need to consider the utility of exploring its environment in order to find new rooms in which finding the goal object is more likely. Because the planners we use employ the closed world as- sumption, adding new objects as part of the plan is impossible. We therefore add a set of virtual objects to the planning problem that can be instantiated by the planner as required by the plan. This approach will fail for plans that require finding more objects than pre-allocated, but this is not a problem in practice. The monitoring component tries to match new (real) objects to virtual objects that occur in the plan. This allows us to deliver the correct observations to the DT planner and avoid unnecessary replanning. 4) Probabilitic spatial knowledge: The planner makes use of the following probabilistic spatial knowledge in order to generate sensible plans:
• P category (room i ) defines the distribution over room cat- egories that the robot has a model for, for a given room integrated over places that belongs to room i . The planner uses this information to decide whether to plan for a SEARCHFOROBJECT action or explore the remaining placeholders.
• P category (placeholder i ) represents the probability distri- bution of a placeholder turning into a new room of a certain category upon exploration. Using this distribution, the planner can choose to explore a placeholder instead of another, or plan to launch search altogether.
• P (ObjectAtL) gives the probability of an object o being at location L. More details about calculation of these probabilities are further explained in Section III.
III. SPATIAL REPRESENTATION
5) Conceptual Map: All higher level inference is performed in the so called conceptual map which is represented by a graphical model. It integrates the conceptual knowledge (food items are typically found in kitchens) with instance knowledge (the rice package is in room4). We model this in a chain graph [8], whose structure is adapted online according to the state of underlying topological map. Chain graphs provide a natural generalisation of directed (Bayesian Networks) and undirected (Markov Random Fields) graphical models, allow- ing us to model both “directed” causal as well as “undirected” symmetric or associative relations. The structure of the chain graph model is presented in Fig. 2. Each discrete place is represented by a set of random variables connected to variables representing semantic category of a room. Moreover, the room category variables are connected
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
15
(a)
(b)
place map with several places and 3 detected doors shown as two placeholders with different probabilities for turning into of them is behind a door hypothesis therefore having a higher eadin into a new room Colors on circular discs indicates the
Fig. 2.
Schematic image of chain graph
3) as Poisson processes over cells c 0
c i per location L
In other words, each location has the possibility of containing, independently of all other locations, a number n c of objects of a class c with probability
P(
= k) =
L,c e −λ L,c
λ k
(1)
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
16
5
IV. EXPERIMENTS
Experiments were carried out on a Pioneer III wheeled robot, equipped with a Hokuyo URG laser scanner, and a camera mounted at 1.4 m above the floor. Experiments took place in 12x8 m environment with 3 different rooms, kitchen, office1, office2 connected by a corridor. The robot had models of all objects it searches for before each search run. 3 different objects (cerealbox, stapler and whiteboardmarkers) were used during experiments. The BLORT framework was used to detect objects [10]. To highlight the flexibility of the planning framework evaluated the system with 6 different starting positions and tasked with finding different objects in an unknown environ- ment. We refer the reader to http://www.csc.kth.se/ ˜ aydemir/avs.html for videos. Each sub-figure in Fig. 3 shows the trajectory of the robot. The color coded trajectory indicates the room category as perceived by the robot: red is kitchen, green is corridor and blue is office. The two green arrows denote the current position and the start position of the robot. In the following we give a brief explanation for what happened in the different runs.
• Fig. 3(a) Starts: corridor, Target: cerealbox in kitchen
The robot starts by exploring the corridor. The robot finds
a doorway on its left and the placeholder behind it has a
higher probability of yielding into a kitchen and the robot enters office1. As the robot acquires new observations the CP’s kitchen assumption is violated. The robot returns
to exploring the corridor until it finds the kitchen door. Here the CP’s assumptions are validated and the robot searches this room. The DT planner plans a strategy of first finding a table and then the target object on it. After finding a table, the robot generates view cones for the
O table,cornf lakes location. The cerealbox object is found.
• Fig. 3(b) Starts: office2, Target: cerealbox in kitchen Unsatisfied with the current room’s category, the CP commits to the assumption that exploring placeholders in the corridor will result in a room with category kitchen. The rest proceeds as in Fig. 3(a).
• Fig. 3(c) Starts: corridor Target: cerealbox in kitchen The robot explores until it finds office2. Upon entry the robot categorises office2 as kitchen but after further exploration, office2 is categorised correctly. The robot switches back to exploration and since the kitchen door
is closed, it passes kitchen and finds office1. Not satisfied
with office1, the robot gives up since all possible plans success probability are smaller than a given threshold value.
• Fig. 3(d) Starts: office1 Target:stapler in office2 After failing to find the object in office1 the robot notices the open door, but finding that it is kitchen-like decides not to search the kitchen room. This time the stapler object is found in office2
• Fig. 3(e) Starts: kitchen Target: cerealbox in kitchen As before it tries locating a table, but in this case all table objects have been eliminated beforehand; failing
to detect a table the robot switches to looking for a
counter. Finding no counter either, it finally goes out in the corridor to look for another kitchen and upon failing that, gives up. • Fig. 3(f) Starts: corridor Target: whiteboardmarker in
office1
The robot is started in the corridor and driven to the kitchen by a joystick; thus in this case the environment
is largely explored already when the planner is activated
and asked to find a whiteboardmarker object. The part
of the corridor leading to office2 has been blocked. The robot immediately finds its way to office1 and launches
a search which results in a successful detection of the
target object. In the following, we describe the planning decisions in more detail for a run similar to the one described in Fig. 3(a), with the main difference being that the cereals could not be found in the end due to a false negative detection. The first plan, with the robot starting out in the middle of the corridor, looks as follows:
ASSUME-LEADS-TO-ROOM place1 kitchen ASSUME-OBJECT-EXISTS table IN new-room1 kitchen ASSUME-OBJECT-EXISTS cerealbox ON new-object1 table kitchen MOVE place1 CREATEVIEWCONES table IN new-room1 SEARCHFOROBJECT table IN new-room1 new-object1 CREATEVIEWCONES cerealbox ON new-object1 SEARCHFOROBJECT cerealbox ON new-object1 new-object2 REPORTPOSITION new-object2
Here we see several virtual objects being introduced: The first action assumes that place1 leads to a new room new- room1 with category kitchen. The next two assumptions hy- pothesize that a table exists in the room and that cornflakes exist on that table. The rest of the plan is rather straightfor- ward: create view cones and search for the table, then create view cones and search for the cereal box. Execution of that plan leads to frequent replanning, as the first assumption is usually too optimistic: most placeholders do not directly lead to a new room, but require a bit more exploration. After following the corridor, the robot does find the office, and returns to the corridor to explore into the other direction. It finally finds a room which has a high likelihood of being a kitchen.
ASSUME-CATEGORY room3 kitchen ASSUME-OBJECT-EXISTS table IN room3 kitchen ASSUME-OBJECT-EXISTS cerealbox ON new-object1 table kitchen MOVE place17 MOVE place18 MOVE place16
CREATEVIEWCONES table IN room3
SEARCHFOROBJECT table IN room3 new-object1 CREATEVIEWCONES cerealbox ON new-object1 SEARCHFOROBJECT cerealbox ON new-object1 new-object2
The new plan looks similar to the first one, except that
assume the existence of a new room but the
we
do not
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
17
6
(a)
(d)
(b)
(e)
Fig. 3.
Trajectories taken by the robot in multiple experiments
(c)
(f)
category of an existing one. Also, the robot cannot start creating view cones immediately because a precondition of the CREATEVIEWCONES action is that the room must be fully
explored, which involves exploring all remaining placeholders in the room. After view cones are created, the decision theoretic planner
is invoked. We used a relatively simple sensing model, with a
false negative probability of 0.2 and a false positive probability of 0.05 – these are educated guesses, though. The DT planner starts moving around and processing view cones until it eventually detects a table and returns to the continual planner. At this point the probability of the room being a kitchen is so high, that it considered to be certain by the planner. With lots of the initial uncertainty removed, the final plan is straightforward:
ASSUME-OBJECT-EXISTS cerealbox ON object1 table kitchen CREATEVIEWCONES cerealbox ON object1 SEARCHFOROBJECT cerealbox ON object1 new-object2 REPORTPOSITION new-object2
During the run, the continual planner created 14 plans in total, taking 0.2 – 0.5 seconds per plan on average. The DT planner was called twice, and took about 0.5 – 2 seconds per action it executed.
V.
C ONCLUSIONS AND F UTURE W ORK
In this paper we have presented a planning approach to the active object search. We made use of a switching planner, combing a classical continual planner with a decision theoretic planner. We provide a model for the planning domain appro- priate for the planner and show by experimental results that the system is able to search for objects in a real world office environment making use of both low level sensor perceipt and high level conceptual and semantic information. Future work includes incorporating 3D shape cues to guide the search and
a specialized planner for the AVS problem.
REFERENCES
[1] Alper Aydemir, Kristoffer Sjo¨o,¨ John Folkesson, and Patric Jensfelt. Search in the real world: Active visual object search based on spatial relations. In IEEE International Conference on Robotics and Automation (ICRA), May 2011. [2] C. Backstr¨ om¨ and B. Nebel. Complexity results for SAS + planning. Comp. Intell., 11(4):625–655, 1995. [3] Michael Brenner and Bernhard Nebel. Continual planning and acting in dynamic multiagent environments. Journal of Autonomous Agents and Multiagent Systems, 19(3):297–331, 2009.
[4] Thomas D. Garvey. Perceptual strategies for purposive vision. Technical Report 117, AI Center, SRI International, 333 Ravenswood Ave., Menlo Park, CA 94025, Sep 1976.
[5]
Moritz Gobelbecker,¨ Charles Gretton, and Richard Dearden. A switching planner for combined task and observation planning. In Twenty-Fifth Conference on Artificial Intelligence (AAAI-11), August 2011.
[6] Marc Hanheide, Charles Gretton, Richard W Dearden, Nick A Hawes,
Jeremy L Wyatt, Andrzej Pronobis, Alper Aydemir, Moritz Gobelbecker,¨ and Hendrik Zender. Exploiting Probabilistic Knowledge under Uncer- tain Sensing for Efficient Robot Behaviour. In Proc. Int. Joint Conf. on Artificial Intelligence (IJCAI), 2011. Malte Helmert. The fast downward planning system. Journal of Artificial Intelligence Research, 26:191–246, 2006.
[7]
[8] S. L. Lauritzen and T. S. Richardson. Chain graph models and their causal interpretations. J. Roy. Statistical Society, Series B, 64(3):321– 348, 2002. [9] J. M. Mooij. libDAI: A free and open source C++ library for discrete approximate inference in graphical models. J. Mach. Learn. Res., 11:2169–2173, August 2010.
T. Morwald,¨ J. Prankl, A. Richtsfeld, M. Zillich, and M. Vincze. BLORT - The blocks world robotic vision toolbox. In Workshop on Best Practice in 3D Perception and Modeling for Mobile Manipulation at ICRA 2010,
[10]
2010.
[11] Andrzej Pronobis and Patric Jensfelt. Hierarchical multi-modal place categorization. In submitted to ECMR’11, 2011. [12] Andrzej Pronobis, Oscar M. Mozos, Barbara Caputo, and Patric Jensfelt. Multi-modal semantic place classification. The International Journal of Robotics Research (IJRR), Special Issue on Robotic Vision, 29(2-3):298– 320, February 2010. [13] J. K. Tsotsos. On the relative complexity of active vs. passive visual search. International Journal of Computer Vision, 7(2):127–141, 1992. [14] S. Vasudevan and R. Siegwart. Bayesian space conceptualization and place classification for semantic maps in mobile robotics. Robot. Auton. Syst., 56:522–537, June 2008. [15] Yiming Ye and John K. Tsotsos. Sensor planning for 3d object search.
Comput. Vis. Image Underst., 73(2):145–168, 1999.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
18
Comparative Evaluation of Range Sensor Accuracy in Indoor Environments
Todor Stoyanov, Athanasia Louloudi, Henrik Andreasson and Achim J. Lilienthal
Center of Applied Autonomous Sensor Systems (AASS),
¨
Orebro University, Sweden
Abstract — 3D range sensing is one of the important topics in robotics, as it is often a component in vital autonomous subsystems like collision avoidance, mapping and semantic perception. The development of affordable, high frame rate and precise 3D range sensors is thus of considerable interest. Recent advances in sensing technology have produced several novel sensors that attempt to meet these requirements. This work is concerned with the development of a holistic method for accuracy evaluation of the measurements produced by such devices. A method for comparison of range sensor output to a set of reference distance measurements is proposed. The ap- proach is then used to compare the behavior of three integrated range sensing devices, to that of a standard actuated laser range sensor. Test cases in an uncontrolled indoor environment are performed in order to evaluate the sensors’ performance in a challenging, realistic application scenario.
I. INTRODUCTION
In recent years, a multitude of range sensing devices have become available at more affordable costs. Notably, 2D laser range sensors have demonstrated reliable performance and therefore have been widely used for both research and industrial applications. As the complexity of application scenarios considered in mobile robotics increases, so does the use of 3D range sensors. Although precise commercial 3D laser sensors are available, their prohibitively high cost has limited their use. Actuated laser range finders (aLRF) have thus been the most widely used 3D range sensors in the mobile robotics community. Usually aLRF sensors utilize commercially available 2D laser sensors of high accuracy and well known precision, resulting in a reliable measurement system. Nevertheless, aLRF sensors have several disadvan- tages — the lack of commercial availability of an integrated system, slow refresh rates on the order of 0.1Hz, a high weight and a high number of moving parts. Several competing sensor technologies that attempt to solve the problems of aLRF systems have been proposed. Re- cently, time-of-flight (ToF) and structured light cameras have become more available and more widely used. ToF sensors operate by emitting modulated infrared light and measuring the phase shift of the reflected signal. Typically, ToF cameras can deliver dense range measurements at high frame rates of up to 50Hz. Structured light cameras can produce similar measurements, using a projected pattern that is observed by a CCD camera with a known baseline distance. Although the ToF and structured light sensor technologies have a lot of potential for use in mobile robotics, both are affected by several error sources. It is thus very important to evaluate the accuracy of these emerging sensors, compared to that of the actuated LRF.
Over the course of development of ToF sensors, several works have evaluated their accuracy, in the context of sensor calibration. Kahlmann [1] proposes several calibration rou- tines for the SwissRanger SR-2 and SR-3000 cameras. In or-
der to evaluate their effect on the range measurement quality, he proposes to scan a flat wall and compare offsets from the expected distance. Later, several precisely engineered optical calibration setups, as well as a calibration track line are used. Linder et. al. also use standard optical calibration set- ups (checkerboard images), in order to calibrate PMD ToF cameras [2]. Fuchs et. al. [3] evaluate distance accuracy using
a modified checkerboard pattern and a pair of ToF cameras
mounted on an industrial manipulator. Chiabrando et. al. [4] perform a comprehensive evaluation of the SR-4000 ToF camera, also using a pre-fixed optical calibration pattern and
a tilted calibration board. While these set-ups constitute an
important test case for ToF cameras, they do not capture the complexity of typical uncontrolled environments encountered by mobile robots. May et. al. [5] consider several real-world environments and measure the distance to well known objects in these set-ups, resulting in a more complete accuracy
evaluation. Several works have also implicitly evaluated the accuracy of ToF ranging systems by considering their utility in the context of mapping [6] [7], obstacle avoidance [8] and object modeling [9]. Prior work dealing with the accuracy of ToF systems has uncovered complex error sources. Features of the environ- ment, such as dark textures, sharp edges, foreground objects and distant bright objects, all introduce measurement errors in the obtained ranges. Although investigations into struc- tured light camera accuracy are lacking, similar complexity of the error sources may be expected. Thus, the evaluation of both ToF and structured light cameras in strictly controlled environments and engineered test scenarios may not properly reflect on their performance in a real world setting. It is therefore important to develop a procedure for a holistic evaluation of novel range sensing devices for the purposes of mobile robotics applications. This work extends a recently proposed method for spa- tial representation accuracy evaluation [10] to perform a comparison of range sensor measurements. Three integrated 3D range sensors — the SR-4000 and Fotonic B70 ToF cameras and the Microsoft Kinect structured light camera are compared to a standard aLRF sensor. This article proceeds with a description of the accuracy evaluation approach. Sec- tion III describes the test setup and evaluation environments considered. Section IV presents an analysis of the obtained results, followed by a summary of the major contributions.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
19
II. ACCURACY EVALUATION
In order to objectively compare the performance of differ- ent 3D range sensors, a well-defined framework is needed. It is also important to define well-formed comparison criteria and obtain meaningful, easy to interpret statistical results. The main objective of this work is to propose a method to compare the accuracy of range measurements in an uncontrolled environment, closely related to the application scenario in which the sensors are to be used. The information returned by 3D range sensors consists of discrete samples from a real world environment. Due to differences in sensing methodology, position and orientation of the devices, the discrete points obtained by each sensor do not correspond to precisely the same physical locations. Thus, a simple, per-point distance error would not be an accurate measure of the discrepancies between different sen- sors. Rather, in order to compare a test point set P t to a ref- erence measurement P r , a continuous model M(P r ) should be used. The model M should be capable of evaluating the occupancy of any arbitrary point in the field of view of the sensor. Several alternative probabilistic modeling techniques could be used for constructing M. Building on previous results in spatial representation accuracy evaluation [10], the Three-Dimensional Normal Distributions Transform (3D- NDT) is chosen for constructing the reference model M(P r ).
A. The Three-Dimensional Normal Distributions Transform
(3D-NDT)
The Normal Distributions Transform was originally de- veloped in the context of 2D laser scan registration [11]. The central idea is to represent the observed range points as a set of Gaussian probability distributions. In order to better account for outliers in the data, a mixture of a Gaussian distribution and a uniform distribution may also be used [12]. NDT has been extended to three dimensions [13] and applied to the domains of 3D scan registration [14] and loop detection [15], as well as change detection [16] and path planning [17]. One of the key advantages of the 3D- NDT is the fact that it forms a piecewise smooth spatial representation, resulting in the existence of analytic deriva- tives. Consequently, standard optimization methods can be employed to produce state-of-the-art registration of 3D point clouds using the 3D-NDT representation [14]. Assuming that a set of n point samples P = {(x i , y i , z i )} has been drawn from a Gaussian distribution N (µ, Σ), the maximum likelihood estimates of the covariance and mean can be obtained from the observations:
µ = 1
n
i=n
i=0
x
y
z
i
i
i
M
Σ
=
x 0 − µ x y 0 − µ y z 0 − µ z
1
= n − 1 MM T
− µ x
y n − µ y
x
n
z n − µ z
(1)
(2)
(3)
The 3D-NDT partitions space into a set of disjoint voxels and fits a Gaussian distribution to the points in each cell. In order to obtain a well-fitting estimate, it is important to choose a discretisation strategy that would permit for each distribution to be well supported by the observations in P . The most typical implementation of 3D-NDT utilizes a regular grid, though the use of an OcTree data structure of irregular cell size is also possible [15]. As shown previously, the spatial indexing used has an effect on the representation accuracy of the 3D-NDT [10], as well as on the speed and accuracy of the registration algorithms that utilize it [15].
B. Comparing Range Sensor Accuracy using the 3D-NDT
Prior work in evaluating the discriminative capabilities of the 3D-NDT [10] has demonstrated that it is a robust spatial representation with better accuracy than the standard occupancy grid map. Thus, the 3D-NDT is used in this work to represent the ground truth models of the environment. An important feature of the 3D-NDT is that due to its probabilistic foundations, it can be used to evaluate the occupancy likelihood of any point in space. Thus, given a 3D-NDT model M ndt and a query point x, a probability of occupancy can be estimated. The PDFs in each cell of M ndt can be treated as surface generation processes, thus the probability of observing x, generated from N (µ, Σ) is:
p(x|N (µ, Σ)) =
1 (x−µ) T Σ −1 (x−µ)
(2π) 3/2 |Σ| e
−
2
(4)
As the PDFs in the model M ndt are largely disjoint, the probability p(x|M ndt ) can be evaluated by only using the closest 3D-NDT component and directly applying Equa- tion 4. In order to declare the point in space as occupied, the probability p(x|M ndt ) can be tested against a pre- fixed threshold. Previous work has empirically determined that a conservative threshold value of 0.1 results in a very accurate model. For the purposes of the current evaluation, an OcTree version of the 3D-NDT is used, as it offers stable performance over varying discretisation sizes. The proposed evaluation procedure assumes a reference 3D sensor, which produces a ground truth point cloud
P r . First, the 3D-NDT model of the reference point cloud
M ndt (P r ) is constructed. The point set output of a test sensor P t is then considered as a set of positive examples pos t , produced by a binary classifier associated with the sensor. Next, a set of corresponding negatively classified
points neg t can be generated, using P t . For each point in
P
offset of between 0.1 and 1 meters along
the beam direction is subtracted. This method of generating free space samples is consistent with the sensor output, as
each point in neg t is located between the sensor origin and a corresponding reported obstacle point from pos t . Next, all points in pos t and neg t are tested for occupancy, using the ground truth model M ndt . The samples in pos t correctly labeled positive represent the true positives. The negatively labeled samples from pos t are false negatives. Similarly, the positively labeled points from neg t constitute
t ,
a
random
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
20
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
21
(a)
(b)
(c)
g. 2. Setup for Evaluation with Known Ground Truth. Figure 2(a) shows a picture of the controlled testing environment. Figure 2(b) shows the outp om the SwissRanger SR-4000 camera (red), registered together with the ground truth point set (white.) Finally, Figure 2(c) shows the SwissRanger po t along with the 3D-NDT distributions constructed from the ground truth model. The normal distributions are drawn as ellipsoids at three standa viations, with color coding indicating the dominant orientations of the distributions.
ROC Plot Evaluation with Known Ground Truth
False Positive Rate (fpr)
(a)
ROC Plot for Low Range Scans
False Positive Rate (fpr)
ROC Plot Evaluation with Known Ground Truth (Laser)
False Positive Rate (fpr)
(b)
ROC Plot for Mid Range Scans
False Positive Rate (fpr)
ROC Plot for all Scans
False Positive Rate (fpr)
(c)
ROC Plot for High Range Scans
False Positive Rate (fpr)
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
22
to those of the laser sensor. The two time-of-flight cameras exhibit similar performance, with the SwissRanger slightly outperforming the Fotonic camera. Figure 3(b) shows results from the same test scenario, with the laser sensor point clouds used as a reference sensor, providing the point cloud P r . Comparing with Figure 3(a) can give a good sense of the expected accuracy of the results presented later. First, due to the inaccuracies in the laser sys- tem discussed earlier, measurements are considerably more spread across the ROC plot. Notably though, the comparative performance of the evaluated sensors remains the same, with the Kinect sensor producing the best results. The average values of the sensor accuracy also remain close to their ground truth values, making a strong case for the proposed comparison strategy, especially so in the absence of reliable ground truth information.
B. Evaluation in an Uncontrolled Environment
Fifty sets of test examples collected from an indoor laboratory environment were used to generate the ROC plot shown in Figure 3(c). The ROC points associated with the measurements from the actuated laser are as expected concentrated in the top-left area of the plot. The obtained points are spread over a small area, featuring a high rate of true positives and a negligible rate of false positives. As the laser points were used to construct the ground truth 3D-NDT models, this behavior is normal and only serves as a reference for the best possible results on the data set. The points from the other three sensors are spread over a large area of the ROC plot and do not seem to offer a clear distinction between the sensors. A closer look at the collected data can however offer some insights about the sensor performance. It is important to note that in the operational mode used (millimeter precision), the SICK LMS-200 has a maximum range of eight meters. The operational ranges of the three integrated 3D sensors, as declared by the manufacturers are between 0.8 and 3.5 meters for the Kinect, up to 5 meters for the SR-4000 and between 0.1 and 7 meters for the Fotonic B70. Thus, a separate evaluation, depending on the average distances measured in the collected point sets is necessary. Figures 3(d)–3(f) show the results obtained for partitions of the evaluation set into point clouds containing ranges of up to 3, between 3 and 5 and above 5 meters respectively. The distributions of points on the ROC plots obtained are notably more concentrated and easier to interpret. When observing objects that are closer than three me- ters, the Microsoft Kinect structured light camera performs the best, delivering point clouds with a high likelihood conditioned on the laser measurements. The Fotonic and SwissRanger cameras have slightly worse performance, with a slightly higher rate of false positives. Table I summarizes the average performance of each sensor on the full data set, as well as the partitioned sets. As also testified by Figure 3(d), the performance of the Kinect sensor is very close to that of the laser, for short range environments. The two time- of-flight cameras did not show significant differences in the short range test set. In the indoor environment tested, the
|
Data Set |
True Positive |
σ(T P R) |
|
|
Rate (TPR) |
|||
|
aLRF full |
0.91 |
0.02 |
|
|
Kinect full |
0.77 |
0.15 |
|
|
SR-4000 full |
0.87 |
0.05 |
|
|
F |
B700 full |
0.86 |
0.04 |
|
False Positive |
σ(F P R) |
|
Rate (FPR) |
|
|
0.01 |
0.01 |
|
0.22 |
0.20 |
|
0.26 |
0.14 |
|
0.24 |
0.11 |
|
0.05 |
0.05 |
|
0.14 |
0.10 |
|
0.16 |
0.09 |
|
0.34 |
0.18 |
|
0.30 |
0.07 |
|
0.25 |
0.07 |
|
0.44 |
0.07 |
|
0.41 |
0.04 |
|
0.35 |
0.05 |
|
Kinect low |
0.89 |
0.05 |
|
|
SR-4000 low |
0.89 |
0.05 |
|
|
F |
B700 low |
0.89 |
0.04 |
|
Kinect mid |
0.71 |
0.13 |
|
|
SR-4000 mid |
0.87 |
0.02 |
|
|
F |
B700 mid |
0.85 |
0.04 |
|
Kinect high |
0.61 |
0.12 |
|
|
SR-4000 high |
0.87 |
0.04 |
|
|
F |
B700 high |
0.83 |
0.02 |
TABLE I
AVERAGE AND STANDARD DEVIATION OF TRUE AND FALSE POSITIVE RATES FOR DIFFERENT SENSORS AND SCAN TYPES
two cameras performed in a similar fashion over the whole test set, with the SwissRanger naturally degrading on scans containing longer distances. In the evaluation performed, no sensor achieved perfor- mance comparable to the one of the laser sensor at the full distance range. Depending on the application scenario and intended use of the sensors, additional evaluation might be needed to asses the feasibility of using an integrated 3D range device, in place of an actuated laser sensor. For example, if the primary use of the sensor is collision avoidance, distances of less than three meters from the robot are usually a sufficient look-ahead interval. Thus, any of the three evaluated 3D range sensors would perform well. If a mapping application is envisioned, the uncertainty in far- away objects position would have to be explicitly taken into account, if one of the evaluated 3D range cameras is to be used. Amplitude thresholding for the time-of-flight cameras, or a distance cutoff for the structured light sensor might be necessary to ensure better performance. In order to demonstrate the precision of the evaluated sensors at shorter ranges, an additional test was performed. The point sets, obtained by all four sensors were filtered, eliminating all obstacle points found at ranges above 3.5 meters. Thus, when only close points are considered both for model building and testing, the ROC plots in Figure 4 are obtained. The improvement in performance is evident, with the average accuracy of the Kinect sensor increasing from (0.77, 0.22) TPR/FPR (σ(0.20, 0.15)) to (0.90, 0.06) TPR/FPR (σ(0.03, 0.01)). A similar performance gain can be observed for the Fotonic B70 camera. The SwissRanger SR-4000 is not influenced by the relaxed conditions, due to the severe backfolding effects that occur in high range scans — due to the modulation frequency used bright objects at more than five meters cannot be distinguished from faint objects at very close range. This effect was not observed in the Fotonic ToF sensor, due to the use of multiple modulation frequencies for range disambiguation.
V. D ISCUSSION
This work presented a comparative evaluation of three in- tegrated 3D range cameras. A novel evaluation methodology,
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
23
ROC Plot for Full Evaluation with Constrained Range
False Positive Rate (fpr)
Fig. 4. ROC plot for the full indoor environment data set, considering only ranges of less than 3.5 meters. The performance of the Kinect and Fotonic cameras improves drastically.
based on previous work on spatial representation evaluation,
was used to compare the outputs of the three cameras. The proposed approach offers easy to interpret statistics for the descriptive power of different sensors, compared to a set of reference range observations. In simple environments, hand- crafted precise geometric models can be used to evaluate ab- solute sensor performance. In addition, a sensor with known performance can be used to collect reference measurements
in arbitrarily complex application scenarios. A direct compar-
ison of the ranges obtained with a different sensor can then be performed, allowing for further analysis into the usability of novel range sensors. The proposed methodology does not
make any assumptions about the operational principle of the evaluated range sensors and can thus be used for generic benchmarking of newly available devices. The accuracy of three integrated 3D range sensors — a SwissRanger SR-4000 and Fotonic B70 ToF cameras and
a Microsoft Kinect structured light camera, was compared
to that of an actuated laser range sensor. The three devices evaluated can deliver high frame rates and dense range measurements, but none of them features an accuracy similar to that of the laser system. A notably different conclusion can be reached when explicitly considering environments of constrained size. When all range measurements are concen- trated within a sphere of radius 3.5 meters, the Microsoft Kinect sensor has an accuracy, similar to the one obtained by the actuated laser sensor. In conclusion, when considering smaller environments the Kinect and Fotonic B70 sensors can be directly used as a high data rate substitute of an aLRF system. Future work in sensor evaluation will focus on a more in- depth analysis of the acquired data set and more extensive comparison of the expressive power of the sensors consid- ered. One option to be explored is the direct comparison of the 3D-NDT models that can be constructed from the range measurements returned by each sensor. Other directions to be further explored are a diversification of the set of sensors
evaluated and the collection of a larger test sample from different environments. Emphasis will also be placed on out- door environments and some application specific scenarios, in particular the scenario of grasping deformable objects, stacked in a container.
ACKNOWLEDGMENTS
This work has been partially supported by the European Union FP7 Project “RobLog — Cognitive Robot for Au- tomation of Logistic Processes”.
REFERENCES
[1] T. Kahlmann, “Range imaging metrology: Investigation, calibration and development,” Ph.D. dissertation, 2007. [2] M. Lindner, I. Schiller, A. Kolb, and R. Koch, “Time-of-Flight sensor calibration for accurate range sensing,” Computer Vision and Image Understanding , 2010. [3] S. Fuchs and G. Hirzinger, “Extrinsic and Depth Calibration of ToF-cameras,” in Computer Vision and Pattern Recognition, IEEE Computer Society Conference on, 2008, pp. 1–6. [4] F. Chiabrando, R. Chiabrando, D. Piatti, and F. Rinaudo, “Sensors for 3D Imaging: Metric Evaluation and Calibration of a CCD/CMOS Time-of-Flight Camera,” Sensors, vol. 9, no. 12, 2009. [5] S. May, B. Werner, H. Surmann, and K. Pervolz, “3D Time-of-Flight Cameras for Mobile Robotics,” in Proc. of IEEE/RSJ Int. Conf. on Intelligent Robots and Systems. IROS 2006, pp. 790–795. [6] S. May, D. Droschel,¨ S. Fuchs, D. Holz, and A. Nuchter, “Robust 3D-mapping with Time-of-Flight Cameras,” in Proc. of IEEE/RSJ Int. Conf. on Intelligent Robots and Systems. IROS 2009, pp. 1673–1678.
[7]
K. Pathak, A. Birk, and J. Poppinga, “Sub-pixel Depth Accuracy with a Time of Flight Sensor using Multimodal Gaussian Analysis,” in Proc. of IEEE/RSJ Int. Conf. on Intelligent Robots and Systems. IROS 2008,
pp. 3519–3524. [8] D. Droschel,¨ D. Holz, J. Stuckler,¨ and S. Behnke, “Using Time-of- Flight Cameras with Active Gaze Control for 3D Collision Avoidance,” in IEEE Int. Conf. on Robotics and Automation, ICRA 2010, pp. 4035–
4040.
[9] Y. Cui, S. Schuon, D. Chan, S. Thrun, and C. Theobalt, “3D Shape Scanning with a Time-of-Flight Camera,” 2010, pp. 1173–1180. [10] T. Stoyanov, M. Magnusson, H. Almqvist, and A. J. Lilienthal, “On the Accuracy of the 3D Normal Distributions Transform as a Tool for Spatial Representation,” in Proc. of IEEE Int. Conf. on Robotics and Automation, ICRA 2011 . [11] P. Biber and W. Straßer, “The Normal Distributions Transform: A new Approach to Laser Scan Matching,” in Proc. of IEEE/RSJ Int. Conf. on Intelligent Robots and Systems. IROS 2003, 2003. [12] P. Biber, S. Fleck, and W. Straßer, “A Probabilistic Framework for Robust and Accurate Matching of Point Clouds,” in 26th Pattern Recognition Symposium (DAGM 04), 2004.
[13] M. Magnusson, T. Duckett, and A. J. Lilienthal, “3D scan registration for autonomous mining vehicles,” Journal of Field Robotics , vol. 24, no. 10, pp. 803–827, Oct. 24 2007. [14] M. Magnusson, A. Nuchter,¨ C. Lorken,¨ A. J. Lilienthal, and J. Hertzberg, “Evaluation of 3D Registration Reliability and Speed - A Comparison of ICP and NDT,” in Proc. IEEE Int. Conf. on Robotics and Automation, ICRA 2009, pp. 3907–3912. [15] M. Magnusson, “The Three-Dimensional Normal-Distributions Trans- form — an Efficient Representation for Registration, Surface Analysis,
¨
and Loop Detection,” Ph.D. dissertation, Orebro University, Dec. 2009. [16] H. Andreasson, M. Magnusson, and A. J. Lilienthal, “Has Something Changed Here? Autonomous Difference Detection for Security Patrol Robots,” in Proc. of IEEE/RSJ Int. Conf. on Intelligent Robots and Systems. IROS 2007 , 2007. [17] T. Stoyanov, M. Magnusson, H. Andreasson, and A. J. Lilienthal, “Path Planning in 3D Environments Using the Normal Distributions
Transform,” in Proc. of IEEE/RSJ Int. Conf. on Intelligent Robots and Systems. IROS 2010 , Taipei, Taiwan, 2010.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
24
1
A comparative evaluation of exploration strategies and heuristics to improve them
Dirk Holz ∗
Nicola Basilico †
Francesco Amigoni †
Sven Behnke ∗
∗ Autonomous Intelligent Systems Group, University of Bonn, Bonn, Germany † Dipartimento di Elettronica e Informazione, Politecnico di Milano, Milano, Italy
Abstract—Exploration strategies play an important role in influencing the performance of an autonomous mobile robot exploring and mapping an unknown environment. Although sev- eral exploration strategies have been proposed in the last years, their experimental evaluation and comparison are still largely
unaddressed. In this paper, we quantitatively evaluate exploration strategies by experimentally comparing, in a simulation setting,
a representative sample of techniques taken from literature.
From a broader perspective, our work also contributes to the
development of good experimental methodologies in the field
of autonomous mobile robotics by promoting the principles of
comparison, reproducibility, and repeatability of experiments.
Index Terms—Robotic exploration. Exploration strategies.
I. INTRODUCTION
Exploration and mapping are fundamental tasks for au- tonomous mobile robots operating in unknown environments. Recent work [1] showed that exploration strategies largely influence the efficiency with which a robot performs ex-
ploration. Broadly speaking, an exploration strategy drives
a robot within a partially known environment, determining
where to acquire new spatial information. In this work, we focus on the mainstream approach of greedy Next-Best-View (NBV) exploration strategies. When employing a NBV strat- egy, exploration comes down to a sequence of steps where,
at each step, a number of candidate observation locations are
evaluated according to some objective function and the best
one is selected for the robot to reach. Several exploration strategies [2], [3], [8], [16], [19] have been proposed, but their experimental evaluation and comparison constitute a topic that is still largely unaddressed, with few exceptions (e.g., [1] and [12]). In our opinion, filling this gap is an important issue
in mobile robot exploration.
In this paper, we aim at contributing to the assessment of the experimental comparison between exploration strategies. In particular, we compare three exploration strategies for a single robot [5], [8], [11] that are a representative sample of the current state of the art and we investigate the reasons for their different performance and the ways in which they can be improved. The original contribution of this paper is not in the proposal of new exploration strategies, but in presenting some insights derived from the quantitative experimental evaluation of both some strategies and some general heuristics that can be used to improve them. These insights can be intended as enabling factors for more complex exploration applications and for developing better exploration strategies. Our work extends the results of [1] by comparing a different set of
strategies within a more realistic simulation framework and by presenting new insights. Furthermore, we extend the work in [9] by also evaluating the heuristic improvements when applied to different exploration strategies. Our work can be also viewed from the general perspective of the definition of good experimental methodologies for autonomous mobile robotics (for instance, see [6] and [14]). Recent efforts have recognized that experimentation in this field has not yet reached a level of maturity comparable with that reached in other engineering and scientific fields [4]. Among the elements that define a good experimental method- ology is the comparison of experimental results. With this paper, we contribute toward the definition of a framework for evaluating exploration strategies in different setups. We con- duct our comparison in simulation, since it enables performing reproducible and repeatable experiments [4]. Reproducibility is the possibility to verify, in an independent way, the results of a given experiment. Other experimenters, different from the one claiming for the validity of some results, should be able to achieve the same results, by starting from the same initial conditions, using the same type of instruments, and adopting the same experimental techniques. Repeatability concerns the fact that a single result is not sufficient to ensure the success of an experiment. A successful experiment must be the outcome of a number of trials, performed at different times and in different places. These requirements guarantee that the result has not been achieved by chance, but is systematic. Performing experiments using a standard and publicly available simulation platform (like Player/Stage) is a way to promote comparison, reproducibility, and repeatability of experiments.
II. RELATED WORKS
The definition of strategies for autonomous exploration of environments has been addressed by several works in literature. Besides exploration strategies that make the robots move along predefined trajectories [13] and that attempt to close loops for localization purposes [17], the mainstream approach considers exploration as an incremental process in which the next observation location is selected among a set of candidates on the basis of available information. These Next- Best-View (NBV) systems evaluate the candidate observation locations according to some criteria. Usually, in NBV systems, candidate locations are on the frontier between the known free space and the unexplored part of the environment and are reachable from the current position of the robot [21]
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
25
2
(an exception is the feature-based approach of [15]). The exploration strategies analyzed in this paper follow the NBV approach. NBV problems have been also studied in Computer Vision and Graphics. However, the proposed techniques do not apply well to mobile robots [8].
In evaluating a candidate location, single or multiple criteria can be used. For example, [21] presents a strategy that uses
a single criterion, the traveling cost, according to which the
best observation location is the nearest one. Other approaches
combine traveling cost with different criteria, for example with expected information gain [8]. This criterion is related
to the expected amount of new information about the environ-
ment obtainable from a candidate location. It is estimated by measuring the area of the portion of unknown environment potentially visible from the candidate location, taking into account the so-far built map and the robot’s sensing range. Other examples of combining different criteria are [16], in which the traveling cost is linearly combined with the expected reduction of the uncertainty of the map after the observation, and [2], in which a technique based on relative entropy is used. In [19], several criteria are employed to evaluate a candidate location: traveling cost, uncertainty in landmark recognition, number of visible features, length of visible free
edges, rotation and number of stops needed to follow the path to the location. They are combined in a multiplicative
function to obtain a global utility value. The above strategies are based on ad hoc aggregation functions (linear combination,
multiplication,
) that combine criteria. In [3], the authors
dealt with this problem and proposed a more theoretically- grounded approach based on multi-objective optimization, in which the best candidate location is selected on the Pareto frontier. In [3], besides traveling cost and expected information gain, also overlap is taken into account. This criterion is related to the amount of already known features that are visible from a candidate location. It accounts for the precision of self- localization of the robot: the larger the overlap, the better the localization of the robot.
III. EXPERIMENTAL SETTING
We now introduce our experimental setting in which we compared the three exploration strategies described in the next section. The strategies have been integrated into a robot control architecture [11] and simulated runs have been performed in Player/Stage to assess and compare their performance. The system represents a class of widely used wheeled mobile robots and consists of a differential-drive robot platform equipped with a SICK LMS 200 laser range scanner with 180 degree field of view and 1 degree angular resolution. The goal of the robot is to fully explore an initially unknown environment. Robot localization and mapping are performed by incremen- tally registering raw 2D laser range scans as described in [10]. The robot continuously updates the map as it moves. The map is represented as an unordered point cloud where duplicate storage of measurements is avoided by adding to the map only points that provide new information. They are determined according to a minimum distance from the already stored
points. In addition, we update a grid map that represents, for each cell c [ xy ] , its reflection probability
p (c [ xy ] ) =
#hits #hits + #misses ,
where #hits is the number of range beams that have been re- flected by an object in the corresponding region and #misses is the number of range beams that have passed through the cell without being reflected. Initially, a value 0 . 5 is assigned to each cell, i.e., a cell’s reflection is initially unknown. Path planning is accomplished by computing a reachability map which stores, for every cell, both the length of the shortest path to reach it from the current location of the robot and the preceding cell along this path. It is built by iteratively applying Dijkstra’s algorithm on the grid map without specifying any goal location to fully explore the reachable workspace. Therefore, once a candidate location is selected, the shortest obstacle-free path for navigating to it can be recursively looked up in the reachability map. To guarantee safe navigation, we consider as traversable only cells c [ xy ] such that p (c [ xy ] ) ≤ 0 . 25 and whose distance to the closest obstacle is less than 30 cm. Finally, note that running all the exploration strategies in the same experimental setting provides a fair way to compare them. Furthermore, using the architecture described above allows for directly applying the implemented exploration strategies on real mobile robots.
IV. EXPLORATION STRATEGIES
In this section, we present the three exploration strategies
that we compared in our experiments. These methods con- stitute a representative sample of different classes of NBV exploration strategies proposed in literature. The first one, a closest-frontier strategy [11], is simple, both in its definition and computation, and considers a single criterion for candi- date selection – the traveling cost. The second technique [8] combines traveling cost and information gain with an ad hoc exponential function. The third technique [5] is based on a more principled way for aggregating multiple criteria in a global utility function.
A. Closest-Frontier Exploration Strategy
The idea of frontier-based exploration strategies is to detect borders between already explored regions of the environment and those regions where the robot has not yet acquired information. Hence, the robot searches for regions that are traversable in the map built so far and that are adjacent to unexplored regions and holes in the map.
A simple frontier-based exploration strategy is the closest-
frontier strategy (CF). It has been proposed in [21] and can be briefly described according to the following steps:
1) determine the set T of traversable cells; 2) determine the set R of reachable cells, i.e., compute a reachability map (see Section III); 3) determine the set C of cells that are both reachable and traversable: C = T ∩ R ;
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
26
3
4) determine the set of frontier cells F by checking for every cell in the set C if it is adjacent to a cell with unknown reflection probability:
F =
{ c [ xy ] | c [ xy ] ∈ C,
∃c [( x + m )( y + n)] : p (c [( x + m )( y + n)] ) = 0 . 5 ,
m ∈ {− 1 , 1 } , n ∈ {− 1 , 1 }} ;
(1)
5) determine n = (n x n y ) T as the frontier cell lying closest
to the robot’s current position r = (r x r y ) T :
n = arg min L (x
c [ xy ] ∈ F
y ) T , r ,
(2)
where L (p , r) is the length of the shortest path from p to r.
Finally, n is chosen as the next best observation location and the robot is guided towards it following the minimum path.
B. González-Baños and Latombe’s Exploration Strategy
The second exploration strategy we decided to evaluate is
the strategy by González-Baños and Latombe (GBL) presented
|
in |
[8]. It selects the next best observation location according |
|
to |
traveling cost and information gain. Given the current partial map of the environment, this |
strategy generates a set of candidate locations by randomly sampling cells in the vicinity of frontier cells F . Then, given
a candidate location p , the corresponding utility u (p ) is
computed according to two criteria: the traveling cost L (p , r) for reaching p and the estimated information gain I (p ) when performing a sensing action at p . The global utility is then computed as
(3)
u (p ) = I (p )e − λL ( p , r) ,
and the candidate n that maximizes u () is selected as the next observation position (in our experiments we used λ = 0 . 2 , as suggested by the authors). Whereas the traveling cost is estimated in the same way as above (using the reachability
map), the information gain is estimated as the expected relative change in map entropy. That is, we simulate range scans and corresponding map updates at all candidate locations p . The information gain I (p ) is estimated as the difference between
ˆ
the map’s entropy before ( H ) and after ( H ) the simulated
ˆ
update I (p ) = H − H . Since the probabilistic reflection maps we used represent, in principle, two probabilities for each cell (being occupied and being free), we estimate the map entropy by:
H =
− c [ xy ]
+
p (c [ xy ] ) log p (c [ xy ] )
= H
p (occupied )
(1 − p (c [ xy ] )) log(1 − p (c [ xy ] )) . (4)
= H
p (free)
C. MCDM-based Exploration Strategy
This exploration strategy has been introduced in [5] for maps of line segments. Here we summarize it and show its extension to grid maps. This exploration strategy exploits a decision theoretic technique called Multi-Criteria Decision Making (MCDM), which constitutes a more principled way to combine the criteria that evaluate a candidate location. Given a candidate location p , we consider three criteria for its evaluation. The first one is the traveling cost L (p , r), computed as the length of the path connecting the current position of the robot with p . Then, we consider the estimated information gain I (p ) and the overlap O (p ). These two last criteria should be maximized in order to select good observation locations. Both I (p ) and O (p ) are computed according to a standard entropy measure. Given the set of cells V p that are visible from the candidate location p , i.e., cells falling within the sensing range area centered at p , we distinguish between old and new cells using a threshold k over the reflection probability. In particular, a cell c [ xy ] ∈ V p
is considered as old if p (c [ xy ] ) ≤ k or if p (c [ xy ] ) ≥ 1 − k , otherwise c [ xy ] is considered as new. In our experiments we set
I (p ) corresponds to maximizing
k = 0 . 2 . Then, maximizing
the total entropy over new cells of V p ( p provides a potentially large amount of new information) while maximizing O (p ) corresponds to minimizing the total entropy over old cells of V p ( p provides a good localization).
We call N the set of three criteria that are considered, N = { L (), I (), O ()} . Given a criterion i ∈ N and a candidate location p , an utility value u i (p ) in the [0 , 1] interval is computed in order to evaluate on a common scale p ’s goodness according to every criterion. The utility is normalized over all the candidates in the current exploration step. For example, considering the traveling cost L (p , r) and called C the set of (current) candidate locations, the utility u L (p ) (with p ∈ C ) is computed with the following linear mapping function:
u L (p ) =
1 − (L (p , r) − min q ∈ C L (q, r)) (max q ∈ C L (q, r) − min q ∈ C L (q,
r)) .
(5)
Analogous normalization functions are used for other criteria, preserving the idea that the larger the utility the better the satisfaction of the criterion. In order to select an observation location, the robot com- putes a global utility value measuring the overall goodness
of each candidate. For every pair p ∈ C and i ∈ N an utility value u i (p ) is computed. MCDM uses an aggregation technique called Choquet fuzzy integral. Let us introduce this concept. We call a function 1 µ : P (N ) → [0 , 1] a fuzzy measure on the set of criteria N when it satisfies the following properties:
|
1) |
µ (∅ ) = 0 , µ (N ) = 1 , |
|
|
2) |
if A ⊂ B ⊂ N , then |
µ (A) ≤ µ (B ). |
Given A ∈ P (N ), µ (A) represents the weight of the set of
criteria A. In this way, weights are associated not only to single criteria, but also to their combinations. Global utility u (p ) for
a candidate location p is computed by means of the Choquet
1 P (N ) is the power set of N .
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
27
4
µ(L ) = 0. 2 µ(I ) = 0. 4
µ(O ) = 0. 4 µ({L, I }) = 0. 9
µ({L, O }) = 0. 6 µ({I, O }) = 0. 8
TABLE I
DEFINITION OF µ() FOR THE MCDM-BASED STRATEGY
integral with respect to the fuzzy measure µ :
u (p ) =
| N |
i=1
(u ( i) (p ) − u ( i− 1) (p ))µ (A ( i) ),
(6)
where (i ) indicates the indices after a permutation that changed
their order to have, for a given p , u (1) (p ) ≤
1 (it is supposed that u (0) (p ) = 0 ) and
u ( | N ) | (p ) ≤
≤
A ( i) = { j ∈ N | u ( i) (p ) ≤ u j (p ) ≤ u ( | N | ) (p )} .
Different aggregation functions can be defined by changing the definition of µ . For example, weighted average is a particular case of the Choquet integral when µ is additive (i.e., µ (A ∪ B ) = µ (A) + µ (B )). Most importantly, µ can model dependence relationships between criteria. Formally, criteria belonging to a group G ⊆ N are:
• redundant, if µ (G)
• synergic, if µ (G) > g ∈ G µ (g );
• independent, otherwise.
< g ∈ G µ (g );
In summary, what MCDM provides is a sort of “distorted” weighted average, which takes into account dependency be- tween criteria. The next observation location is the candidate location that maximizes u () in Eq. (6). The MCDM-based strategy used in experiments has been defined according to the weights reported in Table I. Such weights have been manually chosen in order to model a synergy relation between the information gain I () and the traveling cost L (), thus favoring candidates that satisfy those criteria in a balanced way. Finally, we note that the compu- tational time of employing MCDM, although longer than that of employing CF, has a negligible impact on the time required to map an environment.
D. Heuristics to Improve the Strategies
The three exploration strategies we considered (and most of those presented in literature) have two main limitations:
the decision of reaching a selected location is not changed
until the location is actually reached, 2) evaluation of candidate locations is based only on infor- mation relative to the single locations, without consider- ing their relation with other locations.
In the next section, we provide an experimental answer to the question of how much these limitations affect the performance of exploration strategies. Here, we describe two simple heuristics that can be applied to exploration strategies in order to cope with these limitations and to obtain a better performance (thereby extending the initial results from [9]).
1)
15
10
5
0
m
m
m
m
(a)
(b)
Fig. 1.
examples of map segmentation (b).
Example trajectories with and without repetitive re-checking (a) and
1) Repetitive Re-checking: During navigation to a selected location, the map is continuously updated. As a result, the robot might have fully explored an unknown region before actually reaching the selected frontier location. Hence, con- tinuing to travel to the selected location is unnecessary. We address this problem by using Repetitive Re-checking (RR), i.e., the robot checks whether or not the currently approached frontier location n is still adjacent to at least one cell with unknown reflection probability. As soon as n is no longer a valid frontier, the robot stops traveling towards it and selects the next best location according to the employed exploration strategy. In Fig. 1(a) we report an example that shows that the robot’s trajectory is shortened by repetitive re-checking, espe- cially when approaching frontiers in the vicinity of corners. 2) Map Segmentation: Sometimes it can happen that a single room gets visited multiple times if successively selected locations lie in different rooms. To reduce the number of multiple visits, we applied Map Segmentation (SEG), which splits the map built so far into segments representing individual rooms and makes the robot prefer candidates lying in the segment of its current location. We use an approach based on [18] and [20] that splits map regions at local minima in the Voronoi diagram (critical points) of the map’s free space. We define critical points to be local minima with respect to the distances to the closest Voronoi site, nodes of degree 2, and to be itself adjacent to a junction node or adjacent to another node that is adjacent to a junction node. Using critical points we split previously unassigned map regions into two parts. We assign cells to segments with respect to their distances to critical points. That is, we form clusters of cells being closest to a common split point. This can be performed efficiently by computing an Euclidean distance transform (EDT) for the critical points. For the actual assignment we compute and store both the distance to the closest critical point (as for the EDT) and the closest critical point itself; thus computing a nearest neighbor transform. Then, in an iterative refinement step, we merge segments that are adjacent to each other but not split by the same critical point. An example of the segmentation algorithm is reported in Fig. 1(b). The map segmentation can be used in the exploration strategies to restrict the set of candidate locations within the scope of the robot’s current segment. If the set of candidates
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
28
5
(a) “AVZ”
(b) “Hospital”
Fig. 2.
J. Hertzberg (a) and by R. Vaughan (b).
The two environments provided by K. Lingemann, A. Nüchter, and
belonging to the robot’s current segment is not empty, then the exploration strategy will choose the next best location from that set. In this sense, we will say that, when SEG is used, a “depth-first-like” or room-by-room exploration is favored.
V. E XPERIMENTAL R ESULTS
We compared the performance of exploration strategies in two office-like indoor environments composed of several rooms and corridors (Fig. 2). Indoor environments present interesting challenges to exploration strategies, mainly due to their intricate structure that makes the selection of the next observation position non-trivial. We compared the three exploration strategies of Section IV with and without the RR and SEG heuristics. As a baseline for comparison, we report also results obtained with a Random Frontier (RF) exploration strategy. It chooses the next obser- vation location according to a uniform probability distribution over the current candidate locations. For every configuration in which a particular exploration strategy was tested within an environment, we performed 50 simulation runs with the same initial position for the robot. Each run is considered completed when no more frontiers can be determined in the current map, i.e., when there does not exist any reachable cell adjacent to another cell with unknown reflection probability. To compare the performance obtained in different configurations, we report the mean of the length of trajectories covered by the robot (as in [1]–[3], [5], [16], [19]). Results obtained in the two environments are reported in Fig. 3. All the strategies perform better than RF, as expected, with more evident differences in the more complex hospital en- vironment. The first interesting comparison that is worth doing is between CF and MCDM strategies. The good performance of CF means that minimizing the traveled distance at every exploration step produces a small global traveled distance in the indoor environments we considered. This fact and recalling that the robot acquires data during its movements explain the good performance of CF. Although CF performs slightly better, MCDM achieves comparable performance with respect to CF. This is not obvious, since in MCDM other criteria ( I ()
and O ()) are given more importance than traveling cost (see
Table I), which is the only criterion adopted by CF. In fact, the MCDM strategy provides, by means of synergy, a good trade- off between I () and L (). The close performance of CF and MCDM can be explained also by saying that the latter strategy compensates the potential performance worsening, due to the fact that distance is not minimized, with good information gains. Moreover, we observed that MCDM maps most of the environment following a short path and then travels a relatively long path to complete the map (e.g., filling holes close to corners). A reduction in the total traveled distance of the three strategies can be observed when enabling Repetitive Re- checking (RR). A strategy with the RR heuristic outperforms the corresponding basic strategy, which needs to reach every selected observation location independently of the sensorial data acquired along the path. Using map segmentation (SEG) reduces the traveled distance especially in the hospital envi- ronment, where SEG provides a good quality segmentation. Enabling SEG prevents to leave out corners and occlusions and exploring them in the last steps of the exploration. Without SEG, multiple visits to the same room can be necessary, e.g., when the current robot’s room is not completely explored and the best frontier location happens to be outside that room. Interestingly, the MCDM strategy showed a “depth- first” behavior with respect to unknown regions, even without using SEG. The main reason is the presence of the overlap criterion, which leads to a more conservative exploration by imposing to have a certain amount of old information in each sensorial acquisition. Comparing the two heuristics, SEG appears to reduce the traveled distance slightly more than RR. This result can be explained by considering that RR only stops following an already made decision, while SEG helps
in making a better decision.
The GBL strategy (with and without RR and SEG) is outperformed by MCDM and CF in both the environments. This means that using more criteria does not guarantee by itself to obtain a better exploration strategy and suggests that the way in which criteria are combined is fundamental. In this sense, general aggregation techniques such as MCDM appear more suitable to design multi-criteria exploration strategies. This is in accordance with the results of [5], where exploration
strategies defined with MCDM and GBL are compared using maps composed of line segments. Finally, we considered a variant of the GBL strategy, in
which the information gain is computed as in MCDM, i.e., by using the entropy only over the new cells visible from
a candidate location (data are not shown here). With this
different I (), GBL shows a slightly better performance, but the above considerations still hold. This suggests that the way
in which criteria are combined could be even more important
than the methods used to compute the criteria themselves. An evaluation criterion that has been largely neglected so far, is the completeness of the maps after exploration. In our experiments, we compared the map entropies resulting from exploration with those computed on (manually) fully explored
maps ( c = (1 − (H full /H expl. )) ∗ 100 to obtain completeness c
in percent). By having the same termination criterion, all basic
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
29
6
Fig. 3.
(a) “AVZ”
Traveled distances in m (average and standard deviation).
strategies (with and without SEG) achieve a map completeness of roughly 99%. RR lowers the completeness to 98% in the AVZ environment, and 96% in the hospital environment. This is primarily caused by taking less close range measurements in corners, which can be seen in Fig. 1(a). That is, there is a trade-off in RR between shortening the robot’s trajectory and inreasing its map uncertainty.
VI. CONCLUSIONS In this paper, we addressed the experimental comparison of frontier-based exploration strategies for an autonomous mobile robot that maps an unknown environment. A representative sample of three strategies proposed in literature has been evaluated in combination with two improvement techniques in a common simulated experimental setting. Some insights obtained from our analysis, like the influence of the function used to combine criteria in evaluating candidate locations, can help in developing better exploration strategies. Our work is intended to constitute another step toward the definition of good experimental methodologies for exploration strategies. In particular, in our experimental framework we used a stan- dard simulation platform in order to support the comparison, reproducibility, and repeatability of experiments. Several additional issues can be considered to improve the experimental framework of this paper. For example, it would be interesting to compare performance of exploration strategies with that of optimal offline coverage strategies (in which the map is known) [7]. Another issue worth considering is the metric used to measure performance. In this paper, we have considered the traveled distance to account for the energy and time effort, but also the number of map updates and the entropy of the final map can be considered to account for, respectively, the computational effort and for the quality of the produced map (which could also involve loop closures). Moreover, the relationships between the performance of the strategies and the particular setting (robot locomotion, speed, etc.) deserve more attention. Finally, extensions to 3D and flying robots is a matter of future work.
REFERENCES
[1] F. Amigoni. Experimental evaluation of some exploration strategies for mobile robots. In Proc. ICRA, pages 2818–2823, 2008.
(b) “Hospital”
[2] F. Amigoni and V. Caglioti. An information-based exploration strategy for environment mapping with mobile robots. Robotics and Autonomous Systems, 58(5):684–699, 2010. [3] F. Amigoni and A. Gallo. A multi-objective exploration strategy for mobile robots. In Proc. ICRA, pages 3861–3866, 2005. [4] F. Amigoni, M. Reggiani, and V. Schiaffonati. An insightful comparison between experiments in mobile robotics and in science. Autonomous Robots, 27(4):313–325, 2009. [5] N. Basilico and F. Amigoni. Exploration strategies based on multi- criteria decision making for an autonomous mobile robot. In Proc. ECMR, pages 259–264, 2009. [6] EURON GEM Sig. http://www.heronrobots.com/EuronGEMSig/, 2007. [7] Y. Gabriely and E. Rimon. Competitive complexity of mobile robot on- line motion planning problems. International Journal of Computational Geometry and Applications, 20(3):255–283, 2010.
[8] H. Gonzáles-Baños and J.-C. Latombe. Navigation strategies for explor- ing indoor environments. International Journal of Robotics Research, 21(10-11):829–848, 2002. [9] D. Holz, N. Basilico, F. Amigoni, and S. Behnke. Evaluating the efficiency of frontier-based exploration strategies. In Proc. ISR 2010, pages 36–43, 2010. [10] D. Holz and S. Behnke. Sancta simplicitas – on the efficiency and achievable results of SLAM using ICP-Based Incremental Registration. In Proc. ICRA, pages 1380–1387, 2010.
D. Holz, G. K. Kraetzschmar, and E. Rome. Robust and Computationally
[11]
Efficient Navigation in Domestic Environments. In RoboCup 2009:
Robot Soccer World Cup XIII, volume 5949/2010 of Lecture Notes in Computer Science, pages 104–115. Springer, Germany, 2009. [12] D. Lee and M. Recce. Quantitative evaluation of the exploration strategies of a mobile robot. International Journal of Robotics Research, 16:413–447, 1997.
[13] J. Leonard and H. Feder. A computationally efficient method for large- scale concurrent mapping and localization. In Proc. Int’l Symposium on Robotics Research, pages 169–176, 1999. [14] R. Madhavan, C. Scrapper, and A. Kleiner. Special issue on charac- terizing mobile robot localization and mapping. Autonomous Robots, 27(4):309–481, 2009. [15] P. Newman, M. Bosse, and J. Leonard. Autonomous feature-based exploration. In Proc. ICRA, pages 1234–1240, 2003. [16] C. Stachniss and W. Burgard. Exploring unknown environments with mobile robots using coverage maps. In Proc. IJCAI, pages 1127–1134,
2003.
[17] C. Stachniss, D. Haehnel, and W. Burgard. Exploration with active loop-closing for FastSLAM. In Proc. IROS, pages 1505–1510, 2004. [18] S. Thrun. Learning Metric-Topological Maps for Indoor Mobile Robot Navigation. Artificial Intelligence, 99(1):21–71, 1998. [19] B. Tovar, L. Munoz-Gomez, R. Murrieta-Cid, M. Alencastre-Miranda, R. Monroy, and S. Hutchinson. Planning exploration strategies for simultaneous localization and mapping. Robotics and Autonomous Systems, 54(4):314 – 331, 2006. [20] K. Wurm, C. Stachniss, and W. Burgard. Coordinated Multi-Robot Exploration using a Segmentation of the Environment. In Proc. IROS,
pages 1160–1165, 2008. [21] B. Yamauchi. A frontier-based approach for autonomous exploration. In Proc. CIRA, pages 146–151, 1997.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
30
1
People Tracking with Heterogeneous Sensors using JPDAF with Entropy Based Track Management
Srecko´ Juric-Kavelj´
Ivan Markovic´
Ivan Petrovic´
Department of Control and Computer Engineering, University of Zagreb Faculty of Electrical Engineering and Computing, Croatia
Abstract — In this paper we study the problem of tracking an arbitrary number of people with multiple heterogeneous sensors. To solve the problem, we start with a Bayesian derivation of the multiple-hypothesis tracking (MHT), and, under certain assumptions, we arrive to the joint probabilistic data association filter (JPDAF). In their original derivation, both the MHT and JPDAF assume a multiple sensor scenario which enables us to fuse the sensors measurements by asynchronously updating the tracking filters. To solve the data association problem, instead of using the optimal MHT with complex hypothesis branching, we choose the JPDAF since we are interested only in local observations by a mobile robot for people detection, tracking, and avoidance. However, the JPDAF assumes a constant and known number of objects in the scene, and therefore, we propose to extend it with an entropy based track management scheme. The benefits of the proposed approach are that all the required data come from a running filter, and that it can be readily utilized for an arbitrary type of filter, as long as such a strong mathematical principle like entropy is tractable for the underlying distribution. The proposed algorithm is implemented for the Kalman and particle filter, and the performance is verified by simulation and experiment. For the simulation purposes, we analyze two generic sensors, a location and a bearing sensor, while in the experiments we use a laser range scanner, a microphone array and an RGB-D camera.
Index Terms — Multi-sensor fusion, 3D sensing, JPDAF, En- tropy
I. INTRODUCTION
The prospects of utilizing measurements from several sen- sors to infer about a system state are manyfold. To begin with, the use of multiple sensors results in increased sensor mea- surement accuracy, and moreover, additional sensors should never reduce the performance of the optimal estimator [24]. However, in order to ensure this performance, special care must be taken when choosing the process model [29]. Fur- thermore, system reliability increases with additional sensors, since the system itself becomes more resilient to sensor failure [14]. Therefore, by combining data from multiple sensors, and perhaps related information from associated databases, we can achieve improved accuracies and more specific inferences than by using only a single sensor [9]. With such approach, we increase the chances of a mobile robot to operate intelligently in a dynamic environment. Sensor measurements may be combined, or fused, at a variety of levels; from the raw data level to the state vector level, or at the decision level [9]. Raw sensor data can be directly combined if the sensor data are commensurate (i.e., if the sensors are measuring the same physical phenomena),
while if the sensor data are noncommensurate, then the sensor data, i.e. sensor information, must be fused at a feature/state vector level or a decision level. Some sensors, like monocular cameras and microphone arrays, can only measure the angle and not the range of the detected objects. Moreover, some sensors can provide measurements at higher rates, thus making sensor fusion an even more challenging problem. A large body of work exists on tracking moving objects with mobile robots. As discussed in [21] two major approaches can be identified, both defined by the sensors. The first approach stems form the field of computer vision and implies a camera as a major sensor, while the second utilizes laser range sensor (LRS) whose measurements are similar to those of radars and sonars. Since the field of tracking and surveillance (where radars and sonars are commonly used), was well established before the mobile robotics, a lot of results [8], [22] from that field were applied to the problem of people tracking with an LRS. The LRS approach can be further subdivided according to data association techniques into deterministic and proba- bilistic [1], [11], [12], [25] approaches. Additionally, these two sensors can also be used conjointly. E.g., in [4], the nearest neighbour approach and unscented Kalman filter are used for tracking people with a laser and a camera, while in [16] the authors used euclidean distance and covariance intersection method for fusing laser, sonar and camera measurements. When considering multitarget tracking, data association is the fundamental problem. A detailed overview of probabilistic data association techniques is given in [6]. Our previous work [11] was heavily influenced by [1], [25], where the authors use the joint probabilistic data association filter (JPDAF) to solve the data association problem. In [19] the JPDAF is extended to handle multiple data sources (sensors). Such a rigorous approach is questioned when looking at the JPDAF seminal paper [8, Section V, Fig. 2], since the target-sensor geometry indicates that three sonar sensors were used to obtain the measurements. Since in our case the data acquisition happens asynchronously across sensors, we prefer the approach in [8]. The idea is as follows. When the new sensory inputs arrive, predictions about track states are made, and then the JPDAF is used to solve the data association problem. Finally, the track states are updated according to the association probabilities, where the final steps use the likelihood function of the reporting sensor, and that is the only thing required by the JPDAF to handle the multisensor case. Another approach to probabilistic data association is the multiple hypothesis tracking (MHT) developed in a seminal
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
31
2
paper [22] by Reid. It is an optimal solution to the data association problem, unlike the JPDAF. As discussed by Reid himself [22, Section I], the JPDAF is a special case of the MHT, in which only one hypothesis remains after data processing. To be clear, Reid is referring to [3], which is the initial derivation of the JPDAF. Reid already solves the multisensor problem for two di- fferent generic types of sensors. He accomplishes that by describing sensors with their detection and false-alarm statis- tics. Thanks to such approach, we can use any type of a sensor, provided we have its probabilistic description. The downside of the MHT is in its high memory and processing requirements (which grow exponentially with the number of tracks). However, an efficient implementation of the MHT is discussed in [7] and some recent applications are presented in [2], [13]. Instead of using the optimal MHT with complex hypoth- esis branching, we choose the simpler, although not optimal JPDAF, as it is a very convenient solution for people tracking by a mobile robot for its local navigation [26]. However, the JPDAF assumes a constant and known number of objects in the scene, and therefore we propose to extend it with an entropy based track management algorithm. The proposed approach is tested for, but not limited to, the Kalman and particle filter. Furthermore, both trackers were tested in simulation and experiment.
II. PROBLEM STATEMENT
We consider initialized tracks at time k described by a set of
xˆ T k , where
T k denotes the number of tracks. At time k we receive a set
Z s k = z s k , z
where m k denotes the number of measurements. A set of all
measurements received until and including k is denoted as
Z k = {Z s 0 ,
Let Θ k−1 denote a set of hypotheses about measurement to track assignments at time k − 1. Considering a specific hypothesis Θ p(h) at time k − 1, we can construct legal assign-
ment θ h (k) for the set of measurements Z s k . The resulting hypothesis at time k is denoted as Θ k h = {Θ p(h) , θ h (k)}, where Θ p(h) denotes the parent of the Θ h k hypothesis. We can now consider the probability of a hypothesis given
the sensors measurements P Θ k h Bayes’ theorem
Z k and calculate it using
continuous random variables X k = xˆ k , xˆ
1
1
s
2
k
,
.
.
.
,
z
s
k
m
, Z s k }.
k−1
k−1
k
2 ,
k
,
k of measurements from sensor s k ,
k−1
P Θ k Z k = P Θ
h
= P Z s k
p(h) , θ h (k), Z s k , Z k−1
k−1
P
(Z s k , Z k−1 )
p(h) , θ h (k), Z k−1
Θ k−1
·
P θ h (k) Θ
k−1
p(h) , Z k−1
·
P Θ
k−1
p(h)
Z k−1
P (Z s k |Z k−1 ) .
(1)
The JPDAF can be viewed as a special case of the MHT, where after each iteration all the considered hypotheses are reduced to a single hypothesis. We denote that hypothesis
at time k as θ(k). Again, the JPDAF considers a possible data assignment hypothesis relative to the hypothesis in the previous time instant k − 1. Calculation of the probability for such hypothesis reduces to
P Θ k Z k = 1
h
c P Z s k θ h (k), θ(k − 1), Z k−1
· P θ h (k) θ(k − 1), Z k−1
· P θ(k − 1) Z k−1 ,
(2)
where c is a normalizing constant. Since we stated that the JPDAF is a zero scan data association algorithm (only one hypothesis is left after the measurements processing), the probability of the previous hypothesis P θ(k − 1) Z k−1 is equal to one. If no new track hypotheses are considered, then this formulation coincides with the one in [8]. Each hypothesis Θ h k contains track states X k updated using considered association θ h (k). Since the JPDAF flattens the hypothesis tree to a single branch θ(k), it contains track states X k updated using all the measurements, given their association probabilities
(3)
β
t
P θ Z k
,
j
=
θ∈Θ
k
jt
k
where Θ jt denotes all the hypotheses that associate measure- ment j with track t.
III. JOINT PROBABILISTIC DATA ASSOCIATION FILTER
As stated before, the JPDAF considers possible data assign-
ment hypothesis θ h (k) relative to the singular hypothesis from
the previous time instant k − 1, which has unit probability. Probability of a specific hypothesis is given in (2) (with
P θ(k − 1)
two terms in (2). The second term, P θ h (k) θ(k − 1), Z k−1 can be modeled as a constant, as in [11], [25]. A more precise derivation of this term can be found in [7], [8], [22]. Now, we only need to develop the first term
= 1). This leaves us to describe the other
Z k−1
P Z s k θ h (k), θ(k − 1), Z k−1 = P (Z s k |θ h (k))
=
m k
j=1
P z
s
j
k
|θ h (k) ,
(4)
where P z
associations made by hypothesis θ h (k)
s
j
k
|θ h (k) depends on the measurement to track
|θ h (k) = P
P
s
F
s
D
k
k
z
z
s
j
s
j
k
k
false alarm
existing track
P
z
s
j
k
s
k
,
P z
s
j
k
xˆ
s
k
,
k
t
(5)
the false alarm probability
for sensor s k . By inserting (5) in (2) and then inserting the result in (3), we get an expression for measurements to tracked objects association probabilities
where P
D
is the detection and P
F
t
j
β
= 1 c
θ∈Θ
k
jt
m
k
j=1
P z
s
j
k
|θ
.
(6)
Aforementioned association probabilities are used to update the track states (the filtering part). Since the track states are
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
32
3
updated with all the measurements (weighted by their respec- tive probabilities) from their cluster, this is what essentially flattens the hypothesis tree. In other words, after the update, only one hypothesis remains, the one about the current track states. The actual track state update and implementation of (6) depends on the used state estimator (filter). In this paper, we present a particle filter, used in our previous work [11], and a Kalman filter, used in the original JPDAF formulation [8].
A. Kalman JPDAF Given any state estimator, a process model is required. We use a quite general constant velocity model for motion in 2D plane, where state
(7)
is described by position (x, y) and velocity (x,˙ y˙) in the xy– plane. The model itself is given by
x = [x x˙
y y˙] T
x k+1 = Fx k + Gw k
1
0
0
0
=
∆T k
1
0
0
0
0
1
0
0
0
∆T k
1
x k +
∆T
2
∆T k
0
2
k
0
0
2
k
∆T
2
0 ∆T k
w k ,
(8)
where w k is the process noise and ∆T k is the update interval. Prediction is calculated using standard Kalman filter equa- tions
xˆ
P
k−
t
k−
t
=
=
Fxˆ
FP
k−1
t
k−1
t
, F T + GQG T .
(9)
Given the innovation vector
(10)
and its covariance matrix
(11)
we can define (5) in the case of an existing track association
as P z
The innovation vector and covariance matrix can be used
t has χ 2 distribution,
for measurement gating. Since ν T S
t
ν j
z
s
j
k
− Hxˆ
k−
t
k−
,
=
S t = HP
t
t
; 0, S t ).
t
j
H T + R s k ,
s
j
k
x
k
t
= N(ν
j
−1
t
ν
j
by using tables we can select upper limit which includes valid measurements with, e.g., 99% probability. Update is done by using all the validated measurements, i.e. weighted innovation is used for the state update
ν t =
xˆ
k
t
=
m k
j=1
xˆ
k−
t
t
j
β
t
j
ν
,
+
K k ν t .
(12)
− ν t ν t T
Given β t = 1 −
the covariance update is calculated as in [5]
(13)
An important implementation note is that instead of the standard Kalman filter covariance update [I − K k H] P
[I − K k H] T +
K k R s k K T , since the standard form caused numerical prob-
use Joseph’s stabilized form [I − K k H] P
we
m
j=1 β j
t
k
and P ν t =
m
j=1 β j
t
k
t
t
ν ν
j
j
T
P
k
t
β t P
k−
t
+ (1 − β t ) [I − K k H] P
k−
t
+K k P ν t K T
k
k−
t
.
k−
t
=
k
lems.
B. Particle JPDAF
If a particle filter is to be used as a state estimator, besides
the prediction and update of the filter, we also have to modify
gating and hypotheses probability calculation procedures. We
(i), w t (i))
where xˆ
and w t (i) is its associated weight. If a particle falls within the valid measurement region (based on the squared Mahalanobis
(i)), then we consider the association
of measurement j to track t when generating the hypothesis. As for the hypothesis probability calculation, we use average likelihood of the particles
model the estimated state as a set of tuples (xˆ
k
distance ν T (i)R
t
j
t
k
t
(i) is a particle containing a possible track state,
−1
s
k
t
ν
j
P z
s
j
k
x
k
t
=
1
N
N
i=1
t
j
N(ν
(i); 0, R s k )
(14)
(i) is the individual particle’s innovation. After
obtaining association probabilities, we update the particle
N(ν t (i); 0, R s k ). After
weights according to w t (i) =
the weights calculation, and only if there are measurements in
where ν
t
j
m
j=1 β j
t
k
j
the current time step, we resample the particles.
IV. TRACK MANAGEMENT
When tracking multiple targets, track management is pra- ctically as important as the association itself. A solution for the Kalman filter, described in [5], is based on a logarithmic
hypothesis ratio and innovation matrix. In [25] a Bayesian
estimator of the number of objects for an LRS is proposed. This approach requires learning the probability of how many features are observed under a presumed number of objects in
the perceptual field of the sensor, while the tracking perfor- mance is monitored by an average of the sum of unnormalized sample weights of the particle filter.
We propose to use an entropy measure as a feature in track management. If such a strong mathematical principle is tractable for the underlying probability distribution, then it can be readily utilized for track management independently of the
filtering approach. Furthermore, all the information required
for the entropy calculation is already available in the running filter and sensor model, and as it will be presented, threshold setting is quite convenient.
A practical entropy measure for this task is the quadratic
Renyi´
entropy [23]
H 2 (xˆ t ) =
− log p (xˆ t ) 2 dxˆ t .
(15)
For the Kalman filter, i.e. Gaussian distributions, an analytical
solution exists and is given by H 2 (xˆ t ) = n log 4π+ 1 2 log |P t |, where n is the state dimension.
Entropy calculation of continuous random variables is based
on the probability density functions (pdfs) of these variables. In order to calculate entropy of a particle filter, which rather represents the density and not the function, we need a non- parametric method to estimate the pdf. One such method is the Parzen window method [20] which involves placing a kernel function on top of each sample and then evaluating the density
2
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
33
4
x [m]
(a) Trajectories
0
5
10
t [s]
(b) KF entropies
15
20
Fig. 1.
and tentative but not confirmed tracks (red + marker). The first two objects were detected and a KF was initialized for each one at 0 s. At 5.9 s they went out the range of the sensors and the entropy of their KF kept rising until they were deleted at 6.3 s. The third object was detected at 3.6 s, and at 9.4 it went behind the robot thus causing a rise in entropy. At 11 s it was detected again by the bearing sensor causing an effective drop in entropy. At 15.1 s it moved in front of the robot again and was detected by the location sensor which significantly lowered the entropy.
Simulation results for the KF with modeled detection probability, false alarms and silent periods – true (dashed) and estimated (solid) track states,
|
as |
a sum of the kernels. We continue this approach as proposed |
|
in |
[17], [18], and convert each sample to a kernel |
K h (xˆ t ) = h n K(xˆ t ),
where K(.) is the particle set covariance, and h > 0 is the scal-
(16)
ing parameter. For the kernel, we choose h =
where e =
n+4 , and N is the number of particles. At this
point, each track is described as a sum of Gaussian kernels,
n+2 e N −e ,
4
1
(xˆ t ) = i=1 N(xˆ t (i), 2K h (xˆ t )), for which an analytical solution for the quadratic Renyi´ entropy exists [27]
p
N
H 2 (x t ) = − log
1
N 2
N
N
i=1 j=1
N(xˆ t (i) − xˆ t (j); 0, 2K h (xˆ t )).
(17)
Due to symmetry, only half of these kernels need to be evaluated in practice. The track management logic is as follows. When the tracks are initialized, they are considered tentative and the initial entropy is stored. When the entropy of a tentative track drops for 50% – it is a confirmed track. If and when the entropy gets 20% larger than the initial entropy – the track is deleted. This logic reflect the fact that if the entropy is rising, we are becoming less and less confident that the track is informative. Furthermore, since no entropy should be greater than the one calculated at the point of the track initialization, we can use this initialization entropy as an appropriate deletion threshold.
V. S IMULATIONS
In order to test the performance of the algorithm, we generated three intersecting circular trajectories. The robot
was at (0, 0, 90 circ ) m, the first object started at (2, 1) m and finished at (−0.8, 10) m, the second object started at (−2, 1)
m and finished at (0.8, 10) m, while the third object started at (3, 0) m and finished at (−1.6, 2.5) thus making more than one
revolution around the mobile robot (Fig. 1a). Each object was tracked in an alternating manner by the location and bearing sensor, while the maximum range for both was kept at 6 m. The location sensor can only track objects in front of the mobile robot, i.e. from 0 to π, and was corrupted with white
Gaussian noise given by N ([x y] T ; 0, 0.03 · I). The bearing sensor, on the other hand, can only measure the bearing angle θ of the object, but in the full range around the mobile robot, i.e. from 0 to ±π, and was also corrupted with white Gaussian noise given by N (θ; 0, 3 ◦ ). Furthermore, for both sensors each measurement had the
detection probability of P D = 0.9, and the probability of a false alarm was P F = 0.01. Since the bearing sensor models a microphone array, it is logical to assume that the speaker will have pauses while talking, thus resulting in longer periods of absent measurements. This was modeled by placing a random number of pauses of maximum length of 2 seconds at random time instances. Although the assumption about the talking speaker might not be realistic for every-day scenarios, we find it important to analyze performance of a bearing-only sensor in such a multisensor system. The tracks can only be initialized by the location sensor, but the existing tracks should be kept by the bearing sensor when the object moves behind the robot. Since in this case the entropy is substantially larger, it requires calculation of bearing-only initialization entropy, in order to efficiently man- age the case when the object is behind the robot. In Fig. 1 we show KF simulation results with added detec- tion probability, false alarms, and silent speaker periods, from which we can see that there were several false tracks initiated but never confirmed. Furthermore, we made 100 Monte Carlo (MC) runs for the Kalman filter – on average there were 11.43 initialized, 7.95 tentative tracks, and 3.48 confirmed tracks. In an ideal situation we would have had three confirmed tracks, but taking the scenario into account we can conclude that the
5 th
European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
34
5
y [m]
(a) Trajectories
t [s]
(b) PF entropies
Simulation results for the PF with modeled detection probability, false alarms and silent periods – true (dashed) and estimated (solid) track states,
and tentative but not confirmed tracks (red + marker). The objects trajectories were the same as in the case of the KF. The main difference is in the third object’s estimated trajectory: when the object moved behind the robot, there were bearing measurements up to 13.8 seconds when a silent period started. The entropy kept rising and the object was deleted before new measurements appeared. The object moved in front of the robot at 15.1 and consequently a new filter was initialized.
Fig. 2.
algorithm performs well when it comes to tracking, association and track management. The results of the simulation for the PF with added detection probability, false alarms, and silent speaker periods are shown in Fig. 2. Furthermore, we also made 100 Monte Carlo (MC) runs for the particle filter – on average there were 8.17 initialized, 3.22 tentative, and 4.95 confirmed tracks. Although the average number of confirmed tracks was larger than in the case of the Kalman filter, we still find it to be of acceptable performance. Simulations were performed on a machine running at 2.33 GHz with an unoptimized Matlab implementation. The av- erage computational time of each iteration was 1.9 ms and 137.2 ms for the KF and PF, respectively. Time spent on the entropy calculation was 0.02 ms and 88.6 ms for the KF and PF, respectively.
VI. EXPERIMENTS
To further test the proposed approach, we conducted experi- ments with our Pioneer 3-DX robot. The laser sensor was the Sick LMS 200 model, while the microphone array is of our design. Furthermore, since the proposed framework is easily extended to multiple sensors, we also used the Kinect time-of- flight camera with a face recognition algorithm based on [28] to yield a set of measurements in 3D. In the experiment two people were walking in an intersecting trajectory in front of the robot (a snapshot of the experiment is shown in Fig. 3). The results are shown in Fig. 4 from which we can see that the first person (blue line) started at (−1.2, 2.3) m and finished at (0.9, 2.3) m, while the second person (green line) started at (0.7, 0) m and finished at (0.6, 0) m. The first person was in the field-of-view (FOV) of all the three sensors and was talking throughout the experiment, while the second person entered LRS FOV at a later time, kept quiet and was facing the robot only in the second half of the trajectory. Tracks were
Fig. 3. A snapshot of the data acquisition and signal processing for the experiments. The measurements were classified and collected based on our previous work [10], [11], [15], with only the signal processing stage done, i.e. no tracking was performed on the sensor level.
correctly initialized and maintained, despite the large number of false alarms. The second track was deleted short-after the second person left the LRS FOV.
VII. CONCLUSION
In the present work we addressed the problem of tracking multiple objects with multiple heterogeneous sensors – specifi- cally an LRS, a microphone array, and an RGB-D camera. The integration of multiple sensors is solved by asynchronously updating the tracking filters as new data arrives. We solved the data association problem by applying the JPDAF, which is a suboptimal zero-scan derivation of the MHT, but which in effect assumes a known number of objects. To circumvent this assumption, we proposed an entropy based track management
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
35
6
x [m]
(a) People trajectories
t [s]
(b) KF entropies
Fig. 4.
scene from the beginning, while the second object entered the scene at 7.5 s. At 15 s the second object got occluded by the first, which caused an increase in entropy, while at 30 s the second object occluded the first shortly before exiting the scene. The false alarms were caused by tiles on the wall and leg-like features in the room (chairs and tables).
Experimental results for the KF – estimated (solid) track states, and tentative but not confirmed tracks (red + marker). The first object was in the
scheme, and demonstrated its performance for the Kalman and particle filter both in simulation and experiment. The results showed that the proposed algorithm is capable of maintaining a viable number of filters with correct association and accurate tracking.
ACKNOWLEDGMENT
This work was supported by the Ministry of Science, Education and Sports of the Republic of Croatia under grant No. 036-0363078-3018.
REFERENCES
[1] A. Almeida, J. Almeida, and R. Araujo.´ Real-time Tracking of Multiple Moving Objects Using Particle Filters and Probabilistic Data Association. AUTOMATIKA Journal, 46(1-2):39–48, 2005. [2] K.O. Arras, S. Grzonka, M. Luber, and W. Burgard. Efficient people tracking in laser range data using a multi-hypothesis leg-tracker with adaptive occlusion probabilities. In IEEE ICRA, pages 1710–1715. IEEE, 2008. [3] Y. Bar-Shalom. Extension of the probabilistic data association filter to multi-target environment. In Proceedings of the 5th Symposium on Nonlinear Estimation, pages 16–21, San Diego, CA, 1974. [4] N. Bellotto and H. Hu. Vision and Laser Data Fusion for Tracking People with a Mobile Robot. In Proceedings of the IEEE International Conference on Robotics and Biomimetics, pages 7–12, 2006. [5] S. Blackman and R. Popoli. Design and Analysis of Modern Tracking Systems (Artech House Radar Library). Artech House Publishers, 1999. [6] I.J. Cox. A review of statistical data association techniques for motion correspondence. International Journal of CV, 10(1):53–66, 1993. [7] I.J. Cox and S.L. Hingorani. An Efficient Implementation of Reid’s Multiple Hypothesis Tracking Algorithm and its Evaluation for the Purpose of Visual Tracking. Trans. on PAMI, 18(2):138–150, 1996. [8] T. Fortmann, Y. Bar-Shalom, and M. Scheffe. Sonar tracking of multiple targets using joint probabilistic data association. IEEE Journal of Oceanic Engineering, 8(3):173–184, July 1983. [9] D.L. Hall and J. Llinas. An Introduction to Multisensor Data Fusion. Proceedings of the IEEE, 85(1):6–23, 1997. [10] S. Juric-Kavelj´ and I. Petrovic.´ Experimental comparison of AdaBoost algorithms applied on leg detection with different range sensor setups. In 19th International Workshop on RAAD, pages 267–272. IEEE, 2010. [11] S. Juric-Kavelj,´ M. Seder, and I. Petrovic.´ Tracking Multiple Moving Objects Using Adaptive Sample-based Joint Probabilistic Data Associ- ation Filter. In Proceedings of 5th International Conference on CIRAS, pages 99–104, Linz, Austria, 2008.
[12] A. Kraußling¨ and D. Schulz. Tracking extended targets—a switching algorithm versus the SJPDAF. In Proc. of FUSION, pages 1–8, Florence, Italy, 2006. [13] M. Luber, G.D. Tipaldi, and K.O. Arras. Spatially grounded multi- hypothesis tracking of people. In Workshop on People Detection and Tracking, 2009 IEEE ICRA, Kobe, Japan, 2009. [14] R. Luo and M. Kay. Multisensor Integration and Fusion in Intelli- gent Systems. IEEE Transactions on Systems, Man, and Cybernetics, 19(5):901–931, 1989. [15] I. Markovic´ and I. Petrovic.´ Applying von Mises Distribution to Microphone Array Probabilistic Sensor Modelling. In International Symposium on Robotics (ISR’10), Munchen,¨ Germany, 2010. in print. [16] C Martin, E Schaffernicht, A Scheidig, and H.-M. Gross. Multi-Modal Sensor Fusion using a Probabilistic Aggregation Scheme for People Detection and Tracking. Robotics and Autonomous Systems, 54(9):721– 728, September 2006. [17] C. Musso, N. Oudjane, and F. Le Gland. Improving Regularised Particle Filters, pages 247–272. Springer-Verlag, 2001. [18] L.-L.S. Ong. Non-Gaussian Representations for Decentralised Bayesian Estimation. PhD thesis, The University of Sydney, 2007. [19] L.Y. Pao and S.D. O’Neil. Multisensor fusion algorithms for tracking. In Proceedings of the ACC, pages 859–863, San Francisco, CA, 1993. [20] E. Parzen. On Estimation of a Probability Density Function and Mode. The Annals of Mathematical Statistics, 33(3):1065–1076, 1962.
[21] E. Prassler, J. Scholz, and A. Elfes. Tracking multiple moving objects for real-time robot navigation. Autonomous Robots, 8(2):105–116, 2000. [22] D. Reid. An algorithm for tracking multiple targets. IEEE Transactions on Automatic Control, 24(6):843–854, December 1979. [23] A. Renyi.´ Probability Theory. North-Holland, London, 1970. [24] J. Richardson and K. Marsh. Fusion of Multisensor Data. The International Journal of Robotics Research, 7(6):78–96, 1988.
D. Schulz, W. Burgard, D. Fox, and A.B. Cremers. People Tracking with
[25]
Mobile Robots Using Sample-Based Joint Probabilistic Data Association Filters. The International JRR, 22(2):99–116, February 2003. [26] M. Seder and I. Petrovic.´ Dynamic window based approach to mobile robot motion control in the presence of moving obstacles. In Proceedings of IEEE ICRA, pages 1986–1991, Roma, Italy, 2007. [27] K. Torkkola. Feature Extraction by Non-Parametric Mutual Information Maximization. Journal of Machine Learning Research, 3(7-8):1415– 1438, October 2003. [28] P. Viola and M. Jones. Robust real-time object detection. International Journal of Computer Vision, 2002. [29] G.A. Watson and W.D. Blair. Tracking Maneuvering Targets with Multiple Sensors Using the Interacting Multiple Model Algorithm. Proceedings of SPIE, 1954:438–449, 1993.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
36
1
Multi-Robot Cooperative Object Tracking Based on Particle Filters
Aamir Ahmad ∗
∗ Institute for Systems and Robotics, Instituto Superior Tecnico,´
Pedro Lima ∗
Lisboa, Portugal
Abstract — This article presents a cooperative approach for tracking a moving object by a team of mobile robots equipped with sensors, in a highly dynamic environment. The tracker’s
core is a particle filter, modified to handle, within a single unified framework, the problem of complete or partial occlusion for some
of the involved mobile sensors, as well as inconsistent estimates in
the global frame among sensors, due to observation errors and/or self-localization uncertainty. We present results supporting our approach by applying it to a team of real soccer robots tracking
a soccer ball.
I. INTRODUCTION AND RELATED WORK
This paper deals with cooperative tracking of an object by a team of robots. Object tracking is a field of research with multiple techniques being researched and developed
extensively [1]. In recent years RoboCup Soccer has laid down
a common platform for various research areas in robotics,
object tracking being a predominant and crucial one. This involves tracking the soccer ball by the robots during the game play. The complexity of tracking has risen from small, orange colored balls to standard sized, random/multi-colored balls and from 2D to 3D [2]. The problem can be formulated as tracking a moving object of known dimensions by a moving robot. We use RoboCup Soccer as an ideal testbed for novel methods that can be used outside soccer.
Particle Filters (PF) are one of the most popular methods
employed for tracking [7]. PF is a non-parametric filter. Non- parametric filters can efficiently handle multi-modal beliefs.
In a generic tracker, the motion model of object being tracked
can be completely unknown and might change over time hence using a parametric filter can lead to failures quite often. This is because if we use any standard motion model for the object, the tracker can quickly result in low confidence on the posterior when the object motion changes to a different model or switches randomly. This makes it essential to have beliefs with multiple modes scattered over the whole state space which makes the use of a non-parametric filter appropriate. In the RoboCup scenario, PF based trackers are dominant tools currently being used by most of the teams. An interesting approach of fusing the Extended Kalman Filter (EKF) and Monte Carlo PF has been described in [4] where an integrated self-localization and ball tracking method is presented. In [5] a method for simultaneously estimating ball position and velocity using Monte Carlo Localization (MCL) is developed. An efficient implementation of Rao-Blackwellised PF which was successfully demonstrated on Sony AIBO robots in the four-legged league of RoboCup is presented in [7]. None
This work was supported by FCT (ISR/IST plurianual funding) through the PIDDAC Program funds, and by project FCT PTDC/EEA-CRO/100692/2008 PCMMC.
of these works use the information from more than one sensor/robot, therefore being less robust to occlusion and very dependent on the relative state of the robot and the object tracked. The field of sensor fusion, including its use for single and multiple target tracking [12,13], is now very mature. However,
it does frequently address situations where the sensors are
static, know their location in a global frame with no uncer- tainty, and occlusions occur rarely. When sensors are mobile, e.g., mounted on the top of mobile robots, their knowledge of their own localization may degrade over time and/or during time periods due to a number of reasons (e.g., absence of known environment features, bad odometry) and this impacts the uncertainty in the determination of the target position in the global frame, where it is fused with the estimates from the other sensors. Furthermore, occlusions can occur more frequently, as they are due not only to the target(s) path but also to the motion of the different sensors/robots. Therefore, the problem of cooperatively tracking a moving object by a team of mobile sensors is an extension of sensor
fusion, designated here as cooperative perception, in which one has to handle occlusions, disagreements between sensors, and dynamic changes of the observation models due to frequent spatial changes. Efficient solutions for multiple static platforms and a mov- ing target [14] or a single moving platform and moving target(s) [15] have been introduced. Our approach to combine both challenges, i.e., track a moving target using multiple moving platforms, is novel. In [10] relationship between fixed world objects and moving objects is explored for global object localization. These rela- tionships are communicated to teammates where they form
a set of constrained relations, solving which gives object
location estimates. The authors in [6] present a cooperative PF based tracker for Sony AIBO robots, where the fusion of information involves communicating a reduced set of particles between the robots over the wireless network, which still re-
mains a huge data set causing inefficient communication. Our approach overcomes this problem which is explained further
in this paper. Outside the RoboCup scenario, in [11] a new
cooperative perception architecture is developed and tested on multiple UAVs for forest fire detection and localization. A substantial effort is put on developing the fire detector and
fusion of data from various sensors used on-board a single and multiple UAVs. The errors that creep in due to the self- localization of the UAVs themselves are unaccounted for, which we address in our work. In [12,13] a decentralized PF for multiple target tracking is developed and deployed on flight vehicles. The communication bandwidth problem is solved by transforming the particle
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
37
2
set into a Gaussian Mixture Model (GMM) which seems to be an efficient way. In our work we communicate a single parametrized observation probability density function between two robots. This not only further reduces the bandwidth usage but also prevents the prediction model errors of the PF to be propagated to team-mates which happens when sharing of particles (or of a parametrized form of it) is done. Our work builds mainly upon [2] and [3], carried out in the direction of object tracking and sensor fusion among teammates respectively. In [2], a PF based tracker is presented with a unique and novel 3-D observation model based on color histogram matching. Each robot has an individual tracker and its most notable feature is that the tracking could be performed in 3-D space without the object color information, but at the cost of computational expense. In [3] a sensor fusion technique for cooperative object localization using particle filters is presented. Parameters of a GMM approximating a teammate’s tracker’s particles are communicated to the other robots. Particles at a robot’s tracker are then sampled using own belief and the received GMM. In this paper we introduce an approach to cooperative perception where we implement a PF-based tracker. For each observing robot (i.e., a mobile robot with a sensor), we determine confidence factors associated to the tracked target from two origins: i) the confidence on the observation itself and ii) the confidence on the self-localization estimate of the observing robot. Note that the self-localization method itself is completely decoupled from the PF-based tracker. The observation model of each mobile sensor is a parametrized probability density function (e.g., a Gaussian centered on the observation). The probability density functions associated to the observations of the team robots are shared by all of them in a pool. Each robot selects the best function, i.e., the one with higher confidence factors, from the pool, and uses it to assign weights to the particles in the traditional PF update step. The parametrization of the observation models intends to reduce significantly the amount of data communicated to teammates, since the probability density function can be univo- cally represented by its communicated parameters. The method handles, within a single unified framework, inconsistencies (disagreements) between sensors due to observation errors and/or self-localization uncertainty. In order to achieve near real-time tracking, we track the object in 2-D space only and use the object color information. These will be relaxed in the future work, as they depend mostly on the available computing power.
II. PARTICLE FILTER BASED TRACKER
In this section we briefly explain a standard PF and how it is used to construct a tracker. Our tracker will estimate the 2-D pose of an object assumed to be moving on a known ground plane. The state vector of the object will be denoted by x t . We assume that the object’s state evolution over time is an unobserved Markov process with a uniform initial distribution p(x 0 ) and a transition distribution p(x t |x t−1 ). The observations {y t ; t ∈ N} are conditionally independent given the process {x t ; t ∈ N} with distribution p(y t |x t ).
For the estimation of the a posteriori distribution i.e. belief of the state given all observations p(x t |y 1:t ), the problem under Markov assumption is formulated as:
p(x t |y 1:t ) ∝ p(y t |x t ) p(x t |x t−1 , u t )p(x t−1 |y 1:t−1 )dx t−1
(1)
In the rest of paper we refer to p(x t |x t−1 ) as the motion model and to p(y t |x t ) as the observation model. u t is the odometry input to the motion model. For solving the problem as formulated above we use a PF- based tracker. A PF is a non-parametric Bayesian filter where,
contrary to a parametric Bayesian filter, the state’s probability distribution is represented by a set of M weighted particles
X
t {x t , w i=1 where each particle is an instantiation of
the state itself [8].
i
i
t
}
M
III. THE COOPERATIVE TRACKER
In this section we present our PF-based cooperative tracker algorithm (see Algorithm 1) which involves the classical PF augmented with the fusion step which we introduce as a novel contribution.
Algorithm 1 PF Cooperative Tracker(X t−1 , u t , M, k, N)
|
¯ |
|
|
X |
t = X t = φ |
for m = 1 to M do
x
[m]
t
= sample motion model (u t , x
[m]
t−1 )
end for {The Fusion step starts here}
Perform sensor observation and generate Observation Model M
M klocal → M kworld {Frame Transformation using self posture. In rest of the algorithm the Observation models’ parameters are in world frame and are denoted by M k } Compute and Send M k , C(M k ), C LOC k to teammates. α k = C(M k ) {Observation confidence for robot k} for r = 1 to N do
klocal
if r
= k then
receive M r , C(M r ), C LOC r from R r
α r
= C(M r ) ∗ C LOC r
end if end for for r = 1 to N do
α r =
r=1 α r {Weight normalization step}
α
r
N
end for for m = 1 to M do
i = r ;
w
¯
X
r ∈ [1 : N ] sampled with probability α r
[m]
t
, w
[m]
t
[m] ∼ M i
¯
t
t = X t + x
end for
{The Fusion step ends here}
X
return X t
¯
t = Low Variance Sampler( X t )
The prediction step and the resampling step of the PF based cooperative tracker inherit directly from the original PF [8]. We introduce a fusion step which modifies the observation
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
38
3
model’s mechanism which in turn modifies the way in which the particles are assigned weights. The first input to the algorithm, X t−1 is the set of particles, which initially could be distributed on the state space accord- ing to any choice. Here we consider a uniform distribution. u t denotes the input to the motion model of the tracked object. M is the total number of particles which depends on the available computational resource. The larger the value of M , the better is the approximation of the target probability density function (PDF). A standard practice is to keep M close to the order of 500 − 1000. k is the robot number on which the algorithm is running in a team of N robots. We denote the r th robot as R r . First, we assume that each robot can communicate with every other robot in the team. After the prediction step of PF the robot makes an observation and generates an observation PDF, denoted by M, approximating the observation over the whole state space. This PDF could be any parametrized function in general based on the model chosen. M k is the observation model PDF by robot k, which can be reduced to the parameters of the PDF representing the observation model, e.g., mean and covariance matrix for the Gaussian case. C(M k ) represents the robot k’s confidence on its observation which can be calculated in various ways depending on the implementation and scenario. We calculate C (M k ) as :
C(M k ) =
A obs
A
,
exp
(2)
where A obs is the area of the tracked object observed in the
camera image and A exp is the expected area of the object since we know a priori the real dimensions of the tracked object. C LOC k represents the robot k’s confidence on its own local- ization. The self-localization confidence factor is determined from the particle filter set. One good approach to do this is to
consider the number of effective particles n
ef k f [8]
k
C LOC k = n eff =
1
M m=1 (w
[m]
t
) 2 ,
(3)
which assumes normalized weights of the particles. The observation PDF’s parameters are converted into the world frame at this point, using the robot’s self-localization estimate. Next, we receive M r , C(M r ), C LOC r from every other robot r in the team. It is important to note here that these parameters obtained from the teammate robots are already expressed in the world frame. This leads us to form a set P k = {M r | 1 r N } which we call an observation model pool (OMP) for the robot k in the world frame whose elements represent each robot’s individual observation PDF. We now associate a weight to each element in P k as mentioned in Algorithm 1. For the robot k’s OMP, only the elements due to other teammates are weighed using their self-localization confidence except the element due to k’s own observation which is weighed only by its observation confidence. The reason for this will become clear later.
α k = C(M k )
α r = C(M r ) ∗ C LOC r
1 r N ; r =
k
(4)
(5)
Furthermore, we normalize the weights to :
α r
α r =
N
r=1 α r
1 r N
(6)
The most crucial step of fusion is here. For every particle that we have after the prediction step we do the following :
Step 1 : Sample an element M i from the OMP P k with probability α i . Recall that M i is the parametrized observation PDF generated by robot i’s observation of the target. Step 2 : Calculate the weight of the particle using the PDF represented by M i sampled in the previous step. For instance if M i is parametrized as a Gaussian over the target space, the closer the particle lies to the mean of this Gaussian, the higher is its weight. These two steps are further clarified in Fig. 1 with the help of an example OMP.
Fig. 1.
probability α 3 from the OMP which consists of the observation model PDFs M 1 ···M 4 which in this figure are shown as Gaussians for the sake of an example (but in practice it can be any parametrized PDF). M 3 is then used to
generate the weight w
at time step t, M 3 is sampled with a
For an m th particle x
[m]
t
t
[m]
of the particle x
t
[m]
. Sampling M and then using
it for generating the particle’s weight is done for every particle at each time step.
Since the above two steps are performed for every particle
at a given iteration of the Algorithm 1, it fuses the information from all the elements of the OMP according to their respective
importance. After this, resampling is performed which can
be done by any established method. The association of OMP elements to particles is done in every iteration of the PF-based tracker. Due to the sampling process in step 1 as mentioned above a few particles get associated with low weight elements of OMP to maintain the spread of the particles. The most relevant problem which this PF-based cooperative tracker solves is that of partial or complete occlusion. A partially occluded object’s observation by a robot k may lead to a low confidence in its own observation PDF, but in case the same object is being fully observed by another teammate robot r, it will lead to a high confidence in the element contributed by it to P k , and hence a greater chunk of particles will get associated to it. This way the robot k would still be able to make a good approximation of the target distribution. The OMP elements refer to the world frame. If a robot is wrongly localized, its observation of the object in the world frame will be incoherent with another correctly localized robot’s observation of the same object in the world frame, although both might be observing the object correctly in their respective local frames. The incoherency here is due to the frame transformation carried out by the wrongly
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
39
4
localized robot, of its observation in local frame to the global frame. In our approach this problem is solved by weighing the OMP elements by the associated robot’s self localization confidence multiplied by its observation confidence. Also, this is done differently for the recipient robot and the sender robots (4),(5). By doing this we ensure 2 major advantages: i) a wrongly localized robot’s good observation hardly influences other robot’s OMP; ii) a wrongly localized robot would still have a high confidence in its own good observation which is necessary if we want to carry out visual tracking in its local frame. It will not be affected by the observations of well localized teammates which are incoherent with the wrongly localized robot’s local frame. This enables the robot to keep tracking the object with its local information during visual tracking without relying on incorrect global frame information. We do not have to deal with the robot’s motion directly. This is taken care in situ when we construct the OMP in world frame, using the self-posture of the robot which inherently involves odometry or motion update of the robot, assuming that the robot’s acceleration is not very high. Otherwise the frame transformation functions should include the non-inertial terms. Robot motion control is handled assuming that the robot’s acceleration is not very high. Otherwise the frame transforma- tion functions should include the non-inertial terms.
IV. IMPLEMENTATION AND RESULTS
A. Test Bed
We applied the proposed algorithm here to the robot soccer scenario. Our testbed is the RoboCup Middle Sized League(MSL) robot team. In robot soccer, one of the basic necessities is to continuously track the ball. A major concern here is that the robots tend to lose their localization on the soccer field because of low-range vision and field symmetry. Moreover, since the field is too large as compared to a robot’s camera vision field, a ball far from the robot (> 5m) is scarcely visible, and very often a nearby lying ball is occluded by a teammate or an opponent robot. Thus it becomes a very interesting and appropriate testbed for our proposed algorithm.
Fig. 2.
Omni directional robots of MSL on the soccer field
Each of our 5 fully autonomous robots (Fig. 2) in the team is characterized by an omni directional drive and a dioptric vision system consisting of a fish-eye lens facing downwards. All high level algorithms are coded using C/C++ programming language and run on a Pentium IV 1.6 Ghz CPU laptop always mounted on the robot itself. All implementations and results presented further involve tracking the ball by 4 of these robots in one half of the MSL field.
B. Implementation
In our implementation, we model individual robot’s obser- vation of the ball as a bivariate Gaussian distribution over the soccer field. The confidence of the observation is calculated based on the observed ball size and expected ball size as mentioned in (2). This is because in the image frame the expected ball size is a fixed function of the distance of ball (fixed radius = 10cm) from the image center. Images from the camera are streamed at 15 fps. For the prediction step of the tracker we use the same approach (7) as in [2]
X t+1 =
I
0
(∆t)I
I
)I X t + ( (∆t)I ∆t 2
2
a t ,
(7)
which is a constant velocity model with normally distributed acceleration noise about zero mean. I is a 3×3 identity matrix, ∆t = 1, and a t is a 3 × 1 white zero mean random vector corresponding to an acceleration disturbance. For resampling we apply the Low Variance Sampling method [8].
C. Robot Localization
Since the PF based tracker is closely interlaced with the localization uncertainties of individual robots, it is worth mentioning that the localization method implemented here is Monte Carlo Localization, and is independent of the ball tracking particle filter.
D. Results
In this sub-section we present a set of plots for each experiment we made using 4 robots represented as R 1 −R 4 . In Figs. 3 and 5 robot’s self localization confidence and individual robot’s ball observation confidence is shown. The plots in Figs. 4 and 6 represent the spatial variance and the variance of weights of the particles at every iteration step for the tracker running on R 3 in each experiment which are calculated as in (8) and (9). Every iteration of the PF-based tracker consisting of the prediction step and the observation-fusion-update step takes ∼ 0.1 second. Note that comparing with the ground truth is rather difficult in scenarios like this because an overhead field camera will have tracking errors in itself.
2
σ weight =
2 1
σ spatial =
M
M
1
M
m=1 w
M
m=1
x
[m]
t
− x¯ t 2 ,
[m]
t
2 −
1
M
M
m=1 w
[m]
t
2 .
(8)
(9)
In the accompanying video [ http://www.youtube. com/watch?v=m_a4DuWagYI ], the left side of the frame shows the self localization estimates of all robots and R 3 ’s estimate of the tracked ball’s global position. The right side of the frame shows the simultaneously shot overhead footage of the experiment in which R 3 can be seen reaching for the ball.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
40
5
1
0.8
0.6
0.4
0.2
0
1
0.8
0.6
0.4
0.2
0
1
0.8
0.6
0.4
0.2
0
1
0.8
0.6
0.4
0.2
0
1
0.8
0.6
0.4
0.2
0
1
0.8
0.6
0.4
0.2
0
1
0.8
0.6
0.4
0.2
0
1
0.8
0.6
0.4
0.2
0
Iteration Number
Fig. 3.
0.1 second)
Confidence plots of experiment 1 (Each iteration is performed every
1) Experiment 1: In this experiment we have 3 robots which are forced to stay at their initial positions on the field and R 3 is supposed to reach for the ball. Tracking by R 3 has been analyzed in the plots of Fig. 4.
Confidence plots for R 3 in Fig. 3 show that intermittently between iterations 100 and 400 the ball was not directly ob- served by it. During this period, the ball was directly observed by either R 1 , R 2 or R 4 (see other plots in Fig. 3). As a result of fusion of the observation models from teammates as explained previously in this paper, the cooperative tracker on R 3 was able
to track the ball consistently during this period with only a few
breaks where it momentarily lost the ball. This is supported by the low spatial variance of the R 3 ’s tracker particles and their converged weight variance in Fig. 4 with few small periods
of peaks in variances denoting instances when the ball was completely lost. Also, the robot kept following the correct ball visible in the video accompanying this paper. The spatial variance of the particles in R 3 ’s tracker reach
a high peak for a small duration after iteration 300 in Fig. 4 which indicates that the ball was not tracked during that period. This was not only because the ball was lost from R 2 , R 3 and R 4 ’s vision field but also only R 1 was observing the ball with a low observation confidence (∼ 0.4). During most part of this experiment all four robots stay well localized.
Iteration number
Fig. 4.
cooperative tracker at R 3 in experiment 1
Spatial Variance and Variance of the weights of the Particles of the
2) Experiment 2: Similar to experiment 1, in this setup too there are 3 static robots (R 1 , R 2 , R 4 ) and one dynamic robot
(R 3 ). The major difference here is that R 1 , R 2 and R 3 do not localize well for most of the time. The first visible conclusion from the confidence plots in Fig. 5 is that the robot R 3 directly observes the ball for
a very short period (iterations ∼ 300 − 350, near 450 and
near 700). Up to iteration 300, the spatial variance of R 3 ’s tracker’s particles remain low and their weight variance has also converged indicating that R 3 was able to track the ball, which is supported by the accompanying video. This is because R 4 (see Fig. 5) was directly observing the ball and since it was well localized, R 3 was relying on its observation model and hence tracking the ball in the correct position. Close to iteration 300 the ball is lost by all the robots for a
very short instance (peak in spatial variance in Fig. 6) and then R 3 and R 4 observe the ball alternatively between iterations (∼ 300 & 420) (Fig. 5) leading to a converged spatial variance during that time supported by the video showing that R 3 is consistently following the ball. During iterations (∼ 420 − 650) the ball is seen directly only by R 1 (Fig. 5) which has an almost zero confidence on its self localization estimate (Fig. 5). The cooperative tracker
at R 3 takes this into account and prevents itself from tracking
the ball in a wrong position, hence the high spatial variance in Fig. 6 during that time period. Another notable instance is at iteration 400 where R 4 was directly observing the ball and was well localized (Fig. 5), R 1 and R 3 were not seeing the ball and R 2 was observing the ball but was not localized (low self-localization confidence in Fig. 5). At that instance, R 3 was following the correct ball (supported by low spatial variance in Fig. 6 and the video) indicating that the fusion in R 3 ’s cooperative tracker diminished the influence of R 2 ’s observation model because of its low self localization confidence hence protecting itself from tracking the ball in a wrong position.
V.
C ONCLUSION AND F UTURE W ORK
After performing a considerable number of tests on real robots we conclude that our algorithm provides a robust approach for tracking an object cooperatively by a team of robots and support this on a set of real robot experimental
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
41
6
1
0.8
0.6
0.4
0.2
0
1
0.8
0.6
0.4
0.2
0
1
0.8
0.6
0.4
0.2
0
1
0.8
0.6
0.4
0.2
0
1
0.8
0.6
0.4
0.2
0
1
0.8
0.6
0.4
0.2
0
1
0.8
0.6
0.4
0.2
0
1
0.8
0.6
0.4
0.2
0
Iteration Number
Fig. 5.
0.1 second)
Confidence plot of experiment 2 (Each iteration is performed every
data. However, a few major points can be enumerated: the approach is a solution for tracking an object which is likely to be occluded or partially occluded quite often. If the object is confidently located by a wrongly localized robot, after fusion it would track it correctly in its own local frame and affect other teammates’ fused observation model quite insignificantly. By sharing an observation PDF (in which only the models’ parameters are enough to construct the whole PDF), we significantly reduce the use of bandwidth and communication time which leads to real-time tracking of the object. We are working further to extend our model of cooperative tracking to include multiple sensors on the same robot as well as to relax the 2-D space and colored object tracking assumptions. Tracking a random colored ball in 3-D space will require a new observation model and an extension in the state space without any changes in our current tracker’s algorithm because it uses a generalized observation model and is independent of the state space dimensions. Furthermore we intend to extend our current approach to active cooperative tracking where we dynamically change the geometry of the robot formation so as to reduce the uncertainty of the belief in the object recognition.
Iteration number
Fig. 6.
cooperative tracker at R 3 in experiment 2
Spatial Variance and Variance of the weights of the Particles of the
REFERENCES
[1] Alper Yilmaz, Omar Javed and Mubarak Shah Object Tracking: A Survey,
ACM Computing Surveys, Vol.38, No.4, Article 13, December 2006. M. Taiana, J. Santos, J. Gaspar, J. Nascimento, A. Bernardino and P. Lima
[2]
Tracking objects with generic calibrated sensors : an algorithm based on color and 3D shape features, Robotics and Autonomous Systems, special issue on Omnidirectional Robot Vision, Vol. 58, Issue 6, 30 June, pp. 784-795, 2010. [3] J. Santos and Pedro Lima Multi-Robot Cooperative Object Localization - Decentralized Bayesian Approach, Proc. of RoboCup 2009 Symposium, Graz, Austria, 2009. [4] Raul´ A. Lastra, Paul A Vallejos and Javier Ruiz-del-Solar Integrated Self- Localization and Ball Tracking in the Four-Legged Robot Soccer League, 1st IEEE Latin American Robotics Symposium, Mexico city, 2004. [5] Jun Inoue, Akira Ishino and Ayumi Shinohara Ball tracking with velocity based on Monte-Carlo localization, 9th International Conference on Intelligent Autonomous Systems, Tokyo, Japan, March 2006. [6] Walter Nistico,´ Matthias Hebbel, Thorsten Kerkhof and Christine Zarges Cooperative Visual Tracking in a Team of Autonomous Mobile Robots, RoboCup 2006: Robot Soccer World Cup X. Lecture Notes in Artificial Intelligence, Springer 2007. [7] Cody Kwok and Dieter Fox Map-based Multiple Model Tracking of a Moving Object, RoboCup 2004: Robot Soccer World Cup VIII, Lecture Notes in Computer Science 2005. [8] Thrun, Sebastian and Burgard, Wolfram and Fox, Dieter Probabilistic Robotics, The MIT Press, 2005. [9] Hastie Trevor, Tibshirani Robert and Friedman Jerome The Elements of Statistical Learning, Springer, 2001. [10] Daniel Gohring¨ and Hans-Dieter Burkhard Multi Robot Object Tracking and Self Localization Using Visual Percept Relations, Proceedings of the 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, October 9 - 15, 2006, Beijing, China. [11] Luis Merino, Fernando Caballero, J. R. Mart´ınez-de Dios, Joaqu´ın Ferruz, and An´ıbal Ollero A Cooperative Perception System for Multiple UAVs: Application to Automatic Detection of Forest Fires, Journal of Field Robotics 23(3/4), 165-184; 2006 [12] Ong, L.-L., Upcroft, B., Bailey, T., Ridley, M., Sukkarieh, S., and Durrant-Whyte, H. (2006a). A decentralised particle filtering algorithm for multi-target tracking across multiple flight vehicles. In Intelligent Robots and Systems, 2006 IEEE/RSJ International Conference on, pages 4539 –4544. [13] Ong, L.-L., Upcroft, B., Ridley, M., Bailey, T., Sukkarieh, S., and Durrant-Whyte, H. (2006b). Consistent methods for decentralised data fusion using particle filters. In Multisensor Fusion and Integration for Intelligent Systems, 2006 IEEE International Conference on, pages 85
–91.
[14] Sheng, Xiaohong and Hu, Yu-Hen and Ramanathan, Parameswaran. Distributed particle filter with GMM approximation for multiple targets localization and tracking in wireless sensor network. Proceedings of the 4th international symposium on Information processing in sensor networks , IPSN 2005. [15] Michael D. Breitenstein and Fabian Reichlin and Bastian Leibe and Esther Koller-Meier and Luc Van Gool. Online Multi-Person Tracking- by-Detection from a Single, Uncalibrated Camera. in IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
42
1
Monocular Visual Odometry using a Planar Road Model to Solve Scale Ambiguity
Bernd Kitt ∗
Jorn¨
Rehder ‡
Andrew Chambers †
Miriam Schonbein¨
∗
Henning Lategahn ∗
Sanjiv Singh †
∗ Institute of Measurement and Control Systems, Karlsruhe Institute of Technology, Karlsruhe, Germany † The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, USA
‡ Hamburg University of Technology, Hamburg, Germany
Abstract— Precise knowledge of a robots’s ego-motion is a crucial requirement for higher level tasks like autonomous navigation. Bundle adjustment based monocular visual odometry has proven to successfully estimate the motion of a robot for short sequences, but it suffers from an ambiguity in scale. Hence, approaches that only optimize locally are prone to drift in scale for sequences that span hundreds of frames. In this paper we present an approach to monocular visual odo- metry that compensates for drift in scale by applying constraints imposed by the known camera mounting and assumptions about the environment. To this end, we employ a continuously updated point cloud to estimate the camera poses based on 2d-3d- correspondences. Within this set of camera poses, we identify keyframes which are combined into a sliding window and refined by bundle adjustment. Subsequently, we update the scale based on robustly tracked features on the road surface. Results on real datasets demonstrate a significant increase in accuracy compared to the non-scaled scheme.
Index Terms— Localization, Navigation, Robot Vision
I. INTRODUCTION
Ego-motion estimation is an important prerequisite in robo- tic’s applications. Many higher level tasks like obstacle detec- tion, collision avoidance or autonomous navigation rely on an accurate localization of the robot. All of these applications make use of the relative pose of the current camera with respect to the previous camera frame or a static world reference frame. Usually, this localization task is performed using GPS or wheel speed sensors. In recent years, camera systems became cheaper and the performance of computing hardware increased dramatically. Hence, high resolution images can be processed in real-time on standard hardware. It has been proven, that the information given by a camera system is sufficient to estimate the motion of a moving camera in a static environment, called visual odometry [16]. Compared to the abovementioned sensors, camera systems have different advantages. It is well known, that the accuracy of the GPS-localization depends on the number of satellites used. This number drops down in urban environments with large buildings on either side of the road. The accuracy of wheel speed sensors depends mainly on the slip between wheel and road, which can be high depending on the terrain. Obviously, the localization results based on these sensors are highly affected by the environment. Further drawbacks of GPS or inertial measurement units (IMUs) are the low accuracy and the high cost, respectively. The local drift rates for visual odometry are mostly smaller than the drift rates of IMUs, ex- cept for expensive high accuracy integrated navigation systems
which fuse inertial measurements with GPS data [10]. As for all incremental motion estimation techniques, long term drift can only be mitigated by applying loop-closure on (visual) place recognition (e.g. [20]) or by fusing absolute localization data. In this work, we make use of a fully calibrated monocular camera to estimate the pose of the current camera with respect to a global world reference frame. The used datasets were captured in urban environments using a vehicle moving at a speed of approximately 15m/s on average. The six degrees of freedom (6 DoF) motion of the vehicle is estimated merely on the visual information. No additional sensor measurements such as GPS- or IMU-data are used in contrast to [1] or [5].
The remainder of this paper is organized as follows: The following section describes work already done in the field of vision-based motion estimation. In Section III, the monocular motion estimation approach is described, which is extended in section IV to cope with drift in scale. We close the paper with experimental results in section V, a short conclusion and an outlook on future work.
II. RELATED WORK
In recent years, many algorithms have been developed that estimate the ego-motion of a robot. These algorithms can roughly be classified into two main categories, namely algo- rithms using monocular camera systems (e.g. [24], [18]) and binocular approaches (e.g. [15], [12]). A further subdivision is possible into algorithms using only feature matches between consecutive frames (e.g. [10], [21]) and algorithms using feature tracks over a couple of images (e.g. [11], [17]). Each class of algorithms has different benefits and draw- backs. Monocular algorithms suffer from the scale ambiguity in the translational camera movement which is usually resol- ved using measurements from IMUs (e.g. [5]) or a combinati- on of wheel speed sensors and GPS as in Agrawal et al. (e.g. [1], [2]). Compared to algorithms which use feature tracks, algorithms making use only of feature matches usually suffer from higher drift rates, since the used information incorporates only two images. The entire trajectory is then computed by accumulating the relative camera motions between two consecutive frames. This drift can be reduced using feature tracks over a sequence of images combined with a bundle adjustment scheme [23]. The drawback of bundle adjustment is the computational burden of the optimization process. To
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
43
2
relax this, most algorithms are based on a bundle adjustment which performs the optimization only over a limited number of images, i.e. a sliding window. Other algorithms make use of additional sensors, like IMU- or GPS-systems (e.g. [1]) to increase the accuracy of the estimation. Obviously, the use of GPS-information reduces drift significantly because of the global nature of the system. Furthermore, approaches applying assumptions about the observer’s motion have been developed. Scaramuzza et al. [18] make use of a planar motion model and the non-holonomic constraints of wheeled vehicles to reduce the parameter space and increase the accuracy. Good localization results have also been achieved using visual SLAM techniques (e.g. [4]) which simultaneously esti- mate a map of the environment jointly with the camera pose inside this map. Besides the computational complexity of these approaches, most of the monocular visual SLAM techniques perform only well in well structured environments and at low speed. Hence, these approaches are mainly applicable to indoor environments. Recently, Strasdat et al. [20] developed a monocular SLAM algorithm applicable for large-scale envi- ronments, which resolves the drift in scale when loop-closure occurs. Since we focus on open-loop scenarios, where a robot is travelling from A to B, this approach is not suitable in our case. Compared to the abovementioned approaches which com- bine monocular vision with additional sensors, our algorithm uses only visual inputs. To solve translational scale drift, we make use of some assumptions about the mounting pose of the camera and the planarity of the road surface in the vicinity of the vehicle. Given these reasonable assumptions, we can continuously resolve the ambiguity in scale and reduce the scale drift significantly. We prefer to use a monocular camera setup because the effort used in the calibration process is much less than in the binocular case. Furthermore, calibration errors directly affect the motion estimation process. Hence, the use of a monocular camera mitigates the effect of an erroneous calibration onto the motion estimation. Compared to Scaramuzza et al. [18] our assumptions are less restrictive since we assume only a locally planar surface but estimate the robot’s movement in 6 DoF.
III. MONOCULAR VISUAL ODOMETRY
The proposed algorithm for performing monocular visual odometry assumes a fully calibrated camera with known and fixed intrinsic calibration parameters K . Given a sequence
of images, the goal is to estimate the camera pose at each time step solely based on these images. To this end, we track
(e.g. corners [7], [19]) over a series
of frames. Here, j describes the index of the feature track and k denotes the index of the frame, respectively. Based on these feature tracks, we reconstruct the poses of the moving camera with respect to a predefined world reference frame and the locations of the scene points corresponding with the tracks. In the following sections we will give a detailed description of the different steps of the pose estimation algorithm. Since the feature tracks are short, the (arbitrary) scale of the estima- ted camera poses drifts over time. In section IV we propose
salient image features x
k
j
a technique to reduce the scale drift and recover scale from a moving monocular camera.
A. Pose Initialization
The pose initialization step is based on a set of three preselected keyframes K = {K 1 , K 2 , K 3 } and corresponding feature points visible in all of these images. Based on the feature points, we compute the epipolar geometry between K 1 → K 2 and K 1 → K 3 , which describes the relative pose of the keyframes with respect to the world reference frame. Note, that the world reference frame coincides with the camera coordinate frame of K 1 . To this end, we use the normalized eight-point-algorithm [9] wrapped in a RANSAC framework [6] to reject outliers caused by independently moving objects or false feature matches. Outliers are detected, using the pairwise Euclidean distance between the observed features and their corresponding epipolarlines. All features with a distance larger than a predefined threshold are classified as outlier. Based on the essential matrix E K 1 → K 3 , we recover the pose R K 3 , t K 3 of K 3 with respect to the world reference frame [8]. Here, R and t describe the orientation and translation of the camera respectively. Note, that t K 3 can only be recovered up to an arbitrary scale factor. Using the recovered pose of the third keyframe and the feature correspondences, we compute the corresponding scene points for all features which are inliers in both pairwise epipolar geometry estimations. Based on the triangulated scene points and their correspon- ding features in K 2 we can compute its pose with respect to the world reference frame using the algorithm proposed in [13]. Afterwards, we perform a bundle adjustment [23] over these three keyframes to get the best estimation for the camera poses and the scene points. The bundle adjustment minimizes the reprojection error, i.e.
(1)
X j
min
P
k ,
j,k
d P k X j , x 2 ,
k
j
making it the maximum likelihood estimate, assuming a Gaussian distribution in the measurement errors [23]. Here, P k = K · R k | t k denotes the 3 × 4 projection matrix of camera k and d ( . ) denotes the geometric distance between two image points.
B. Pose Estimation
Using the previously initialized scene structure, the esti-
mation of the camera pose R k , t k corresponding to image I k , is based on already initialized scene points, visible in the current image, i.e. a set of 2d-3d-correspondences. Given a set
X N }
and feature tracks in the current image I k , we use all feature
tracks x
point X j to estimate the current camera pose in the global coordinate frame, using the algorithm proposed in [13]. Since we would like to reject features lying on independent- ly moving objects or false feature matches, we use a RANSAC scheme to get a robust pose estimation. Hence, we use random- ly chosen subsets of the 2d-3d-correspondences and create a pose hypothesis for each subset. Given the pose hypothesis and
in the current image with an already associated scene
of N already known scene points X = { X 1 ,
, X j ,
k
j
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
44
3
the set of visible scene points, we can compute the expected image points. Subsequently, we evaluate the quality of each hypothesis using the Euclidean reprojection error between the expected image points based on the current hypothesis and the measured image features. All correspondences with an error larger than a predefined threshold are classified as outlier. The final estimation uses only the correspondences of the largest set of inliers.
C. Keyframe Selection and Pose Refinement
Since the pose estimation scheme proposed in the previous section does not update the scene structure, the number of visible scene points decreases over time. Hence, if none of the initialized scene points is visible in the current camera image, no pose estimation is possible. To this end, new keyframes are selected when the number of visible scene points in the current image drops below a threshold. Whenever a new frame I k is selected as keyframe K M +1 , where M denotes the number of already initialized keyframes, a bundle adjustment is performed. To keep the computational complexity moderate, we perform the bundle adjustment only over a sliding window of L keyframes. Furthermore, we opti- mize only the m camera poses of the most recent keyframes inside this window ( m < L ), keeping the remaining camera poses fixed. This is reasonable, since the other keyframes have been optimized multiple times in previous optimization steps. Note that the reprojection error in the fixed camera poses is taken into consideration during the optimization process. We use the well established implementation of a sparse bundle adjustment proposed in Louriakis et al. [14]. After the bundle adjustment, we use the most recent keyframe K M +1 and the two previous keyframes { K M − 1 , K M } to triangulate all feature points visible in these keyframes using a three-view triangulation scheme [3]. Note that only those features are triangulated that do not correspond to a previously initialized scene point. To reject outliers, we check the reprojection error of the triangulated scene points in all keyframes used for the triangulation. If the Euclidean reprojection error in any of these keyframes exceeds a certain threshold, the triangulated scene point is rejected.
The following section describes the approach to recover scale from a monocular image sequence based on assump- tions about the camera mounting and the road environment. Furthermore, the proposed approach allows for a reduction in the drift of the scale. This part is the main contribution of the paper and consists of a robust triangulation scheme for feature points lying on the less textured asphalt as well as the recovery of the translational scale from monocular imagery.
IV. RECOVERING SCALE
A. Planar Assumption
In automotive applications, cameras are often rigidly moun- ted in a fixed position and at constant orientation. Furthermore, streets in urban environments may be reasonably assumed to be approximately planar in the vicinity of the vehicle (see
Fig. 1: This figure illustrates the mounting of the camera. h denotes the height of the camera above ground and R 0 denotes the orientation of the camera with respect to the ground plane.
figure 2). Additionally, we assume that the roll and pitch movement of the vehicle is negligible, since high dynamic maneuvers are not in the scope of this contribution. Nevert- heless, roll and pitch of the camera can be taken into account based on the estimated pose of the vehicle. In this case, only the initial pose of the camera w.r.t. the ground plane has to be known. Employing these assumptions provides a clue that may be exploited to upgrade the monocular visual odometry to a metric reconstruction and to compensate for the inevitable drift in scale. For this purpose, we estimate the ground plane based on correspondences of features that may safely be assumed to lie on this plane. The motion of the camera is then scaled in a way that recovers its known height above ground.
B. Patch Matching
While we employ corner-like features (e.g. [19]) in visual odometry, this approach fails to robustly detect features on as- phalt due to the lack of strong texture. Furthermore, the images of patches on the street plane close to the camera undergo severe perspective distortions in successive frames caused by the camera movement, which makes it particularly challenging to match them by means of simple block-matching algorithms. However, the known mounting position and orientation of the camera as well as the assumption about the planarity of the area in front of the vehicle may be exploited to overcome these challenges and robustly match features on the street. Let R 0 denote a rotation matrix that accounts for the pitch and roll angle of the camera as depicted in figure 1 and t = − R 0 (0, h, 0) T denotes the translation in its reference frame, where h is the height above the road. Further, let the road plane be defined by Y = 0 as shown by the gray coordinate frame in figure 1, and a point on this plane by X ′ = ( X, Z, 1) T in
Fig. 2: This figure depicts the region of interest (ROI) in the image that is assumed to be approximately planar.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
45
4
projective coordinates, then image points x = ( x, y, 1) T are related to points X ′ on the road plane through a projective transformation x = HX ′ . This transformation is calculated as (see also [8])
|
x = |
K |
· [ R 0 | t ] ( X, 0, Z, 1) T |
(2) |
|
= |
K |
· [ r 1 r 3 t ] ( X, Z, 1) T |
(3) |
|
= |
HX ′ , |
(4) |
|
where r 1 and r 3 denote the first and third column of R 0 . Our approach uses the inverse mapping H − 1 to generate a synthetic fronto-parallel view of a region of interest (ROI) on the street plane, as depicted in figure 3. As the ROI is in general evenly textured, it is difficult to identify significant features that are well suited for tracking. Instead, a predefined set of square patches is used in frame k , located at the edge of the rectangular ROI closest to the vehicle. This choice ensures that a region with high resolution is used for matching as well as the maximum flow for triangulation. The patches are matched against the ROI at k − 1, using the sum of absolute differences (SAD) and choosing the minimum value as match. In most cases, the matching exhibits a distinct minimum. In the fronto-parallel projection, patches in successive frames are related by an Euclidean transformation which improves the robustness of patch based matching schemes significantly. In our experi- ments, the rotational component of this transformation was moderate and thus did not affect the performance of the SAD based matching negatively. Using the projective transformation H , the correspondences are mapped onto image points as
(5)
(6)
Note that the purpose of all steps described in this section is to robustly match features on the road surface. Only the
of these features in I k − 1 and I k will
be used in the following.
positions x
x
k
j
x
j
− 1
k
j
= HX
j
=
′
HX
j
k
′
k
k − 1
and
.
k
j
− 1
and x
C. Scale Update
In order to relate the height of the camera above the estimated ground plane to the distance of translation, we triangulate the points based on the relative positions of the cameras at frame k − 1 and frame k as estimated by the visual odometry. Let
(7)
P k − 1
= K · [ R 0 | 0]
P k = K · R k R k − 1 − 1 R 0 | t k − R k R k − 1 − 1 t k − 1
(8)
be projection matrices, where R 0 accounts for tilting as
depicted in figure 1.
frames in the global world reference frame, i.e.
the relative pose of frame k w.r.t. frame k − 1. Since we compensate for the mounting orientation of the camera at frame k − 1, the X-Z-plane of the camera coordinate frame is parallel to the ground plane and its height coincides with
P k expresses
P k compensates for the poses of both
Fig. 3: This figure depicts the synthetic fronto-parallel view of the ROI. The white square marks a single patch that is matched in the preceeding frame.
the mounting height h of the camera. Then the points
j , triangulated with P k − 1 and P k and the patch correspondences
x
(expressed in the coordinate frame of image k − 1), which corresponds to the height above the road plane in the current scale, i.e. the estimated height of these points depends on the actual translation between the frames. Since we know the height of the camera h and the height of these points, we can rescale the translation vector such that the camera height and
should roughly have the same Y-coordinate
X
′
k
j
− 1
and x
k
j
the height of the points coincide.
To robustly estimate the scale factor, we implemented an
outlier rejection based on the reprojection error | x
to cope with mismatches as well as a simple planarity check to discard estimates of the plane that exceed a threshold in the
variance of the Y-coordinates of the triangulated points. The
scale factor s k is determined by
(9)
k being the mean over all Y-coordinates
with α ∈ [0, 1] and Y
k
j −
P k X j |
s k = (1 − α ) s k − 1 + αh/ Y ,
¯
k
j
¯
j
of X j triangulated from the image pair k and k − 1. The factor incorporates the previous scale factor for temporal smoothing of the scale as well as spatial averaging. This smoothing seems reasonable as the drift in scale occurs on a large time-scale, thus making noise rejection a priority over a fast response. The scale factor s k is used to scale all camera positions i within the current window as well as the structure observed by the frames in the window. The window for which the scaling is applied consists of all frames i ∈ {S, k } , where S is the index corresponding to the oldest keyframe inside the sliding window. As the scale of the motion only affects the translations t i , the scaled translations may be calculated t i = s k t i − t S + t S , where t S is the translation of the oldest camera pose in the optimization window which is held constant to ensure a smooth trajectory. Analogously, the point cloud is scaled as
as
(10)
X j = s k X j + R i − 1 t i − R i − 1 t i .
The proposed scaling of both, the translational camera move- ments as well as the scene points guarantees a consistently estimated trajectory of the camera as well as the structure of the scene. Obviously, the structure of the scene is too sparse
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
46
5
X-coordinate [m]
Fig. 4: This figure depicts the results of the proposed algo- rithm. The red trajectory is the ground truth given by the INS system. The green and blue trajectories are estimated using the visual odometry algorithm. Blue depicts the trajectory with a single scaling at the beginning of the sequence, green is the trajectory with continuously updated scale.
for obstacle avoidance or path planning purposes but sufficient for a robust motion estimation.
V. E XPERIMENTAL R ESULTS
For our experiments, we used different datasets captured
in real environments. To this end, we used our experimental
vehicle, which is equipped with a camera and a high accuracy integrated navigation system (INS), which combines intertial measurements with a GPS-receiver and wheel speed sensors for measuring motion, pose and orientation of the vehicle. Therefore, the INS yields accurate measurements for the motion of the vehicle along its roll axis and the yaw rate.
In the following, these measurements are used as ground truth
for our experiments. The used camera is mounted on top of the car and yields images at a resolution of 1344 × 391 pixels with 10Hz .
As features, we used Harris corners [7] in combination with
a block matching on the image derivatives to get feature
correspondences between two adjacent frames. These feature
matches are accumulated, to get feature tracks over a series
of images. The results for two challenging datasets with different length
and speed can be seen in figure 4 and 5. The first sequence (figure 4) includes 600 frames and was captured with an average speed of approximately 10m/s . The second dataset (figure 5) with a loopy trajectory consists of 1405 frames, the speed was approximately 15m/s on average. The demonstrated results were computed offline on a stan- dard PC, using an implementation in MATLAB. The length of the sliding windows was set to L = 10, optimizing only the m = 5 most recent camera poses inside this window. These parameters have been chosen to account for the average track length of approximately four frames. The number of
used scene points for the bundle adjustment over the sliding window is approximately 900. This and the good initialization of both, the scene points and the camera poses yield a fast convergence of the bundle adjustment. Per frame, five patches of size 50 × 50 pixels were matched on the road plane. For the vast majority of patches, the SAD exhibited a distinct minimum which gave rise to accurately triangulated points on the plane. The resulting scale was smoothed with a factor of α = 0. 8. Figure 4 displays the results of the proposed approach compared to the ground truth trajectory given by the high accuracy INS (red). Compared to the trajectory which has been scaled only once at the beginning of the sequence (blue), the trajectory using the proposed approach (green) exhibits no significant drift in scale. Obviously, there is an increasing deviation from the ground truth position due to the local nature of the algorithm. Since the drift in scale does not affect the rotational component of the motion estimate, the angular error of the scaled and unscaled trajectories is similar. Only the length of the trajectory is affected by the scaling. Note that the trajectories are manually translated and rotated to align them with the ground truth trajectory. Currently, the proposed algorithm (without feature tracking) takes approximately 1. 7s/frame due to the implementation in MATLAB. An implementation in C++, which is planned for future work, should increase the framerate significantly.
VI. CONCLUSION AND FUTURE WORK
In this paper we presented an approach for estimating the 6 DoF ego-motion of a vehicle solely from monocular image sequences. By exploiting constraints induced by the known camera mounting and a reasonable assumption about the planarity of the road surface in the vicinity of the vehicle, the algorithm is capable of continuously recovering the scale of the motion. Furthermore we can significantly reduce the drift in scale. Our experiments have shown, that the scale ambiguity in monocular approaches leads to a large drift in scale. As a result, distant sections of the same estimated trajectory are scaled differently, causing a distortion of the path. Based on feature tracks over a series of images, we estimate the motion of the camera. To this end, we use a continuously updated scene structure to estimate the pose of the camera based on 2d-3d-correspondences between scene points and image features. Whenever the number of visible scene points in the current image drops below a threshold, we select a new keyframe and refine the previous scene structure and the camera poses in a sliding window based on bundle adjustment. Subsequently, we update the scene structure based on newly triangulated feature tracks. The proposed keyframe selection criterion has proven to be sufficient for the proposed algorithm. Nevertheless, more sophisticated approaches for keyframe selection as proposed e.g. in [22] may be used. The algorithm employs knowledge about the mounting of the camera and a planar assumption to generate a synthetic fronto-parallel view of the road. Despite the lack of strong features on asphalt, patches in successive frames can be mat- ched robustly in this view. Using these matches and the motion
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
47
6
Fig. 5: This figure illustrates the results of the proposed algorithm (green) in comparison to the unscaled version (blue) for a sequence consisting of 1405 frames. As can be seen, the unscaled version suffers from a significant drift in scale. The trajectory exploiting knowledge about the camera mounting exhibits a slight drift due to the local nature of the algorithm. Nevertheless, there is no significant drift in scale.
hypothesis of the visual odometry, we triangulate points to acquire an estimate of the distance to the ground plane as measured by the visual odometry. Based on the deviation of this measurement from the known mounting height, the algorithm continuously scales the motion and structure of the current window. The experimental results suggest that the proposed algorithm is capable of accurately reconstructing the trajectory of the vehicle solely based on visual inputs. Furthermore, continuously correcting the scale results in a significant improvement of the accuracy.
Due to the use of only a single camera, the algorithm fails in the presence of dominant independent motion. This may be the case, when moving traffic accounts for a significant share of the image. To improve the robustness of the approach, we are working on a more reliable outlier rejection scheme. The planar assumption is sufficiently fulfilled in most urban scenarios. Nevertheless, large changes in the slope of the road cause our approach to estimate a wrong scaling factor. Although it will recover from such perturbations, the resulting trajectory will be incorrectly scaled around this slope. Another case where the algorithm fails to correctly compute the scale factor is the presence of objects inside the ROI. Future work will include the detection and handling of these conditions as well as a real-time implementation.
ACKNOWLEDGEMENTS
The first author gratefully acknowledges the support from Karlsruhe House of Young Scientists.
REFERENCES
[1] Motilal Agrawal and Kurt Konolige. Real-time localization in outdoor environments using stereo vision and inexpensive gps. In Proceedings of the International Conference on Pattern Recognition, pages 1063 – 1068, 2006. [2] Motilal Agrawal and Kurt Konolige. Rough terrain visual odometry. In Proceedings of the International Conference on Advanced Robotics, August 2007.
[3] Martin Byrod,¨ Klas Josephson, and Kalle Astrom.¨ Fast optimal three view triangulation. In Proceedings of the Asian Conference on Computer Vision, pages 549 – 559, 2007. [4] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse. Monoslam:
Real-time single camera slam. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6):1 – 16, June 2007. [5] Christian Dornhege and Alexander Kleiner. Visual odometry for tracked vehicles. In Proceedings of the IEEE International Workshop on Safety, Security and Rescue Robotics (SSRR 2006), 2006. [6] Martin A. Fischler and Robert C. Bolles. Random sample consensus:
A paradigm for model fitting with applications to image analysis and
automated cartography. Communications of the ACM, 24(6):381 – 395,
1981.
[7] Chris Harris and Mike Stephens. A combined corner and edge detector.
In Proceedings of the Alvey Vision Converence, pages 147 – 151, 1988.
[8] Richard Hartley and Andrew Zisserman. Multiple View Geometry in
computer vision. Cambridge University Press, second edition edition,
2008.
[9] Richard I. Hartley. In defense of the eight-point algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(6):580 – 593, June 1997. [10] Andrew Howard. Real-time stereo visual odometry for autonomous ground vehicles. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2008), pages 3946 – 3952, September 2008. [11] Andrew Edie Johnson, Steven B. Goldberg, Yang Cheng, and Larry H. Matthies. Robust and efficient stereo feature tracking for visual odo- metry. In IEEE International Conference on Robotics and Automation, pages 39 – 46, May 2008. [12] Bernd Kitt, Frank Moosmann, and Christoph Stiller. Moving on to dynamic environments: Visual odometry using feature classification. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5551 – 5556, Taipei, Taiwan, October 2010. [13] Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. Epnp: An accurate o(n) solution to the pnp problem. International Journal of
Computer Vision, 81(2):155 – 166, February 2009. [14] Manolis I. A. Louriakis and Antonis A. Argyros. Sba: A software package for generic sparse bundle adjustment. ACM Transactions on Mathematical Software, 36(1):1 – 30, March 2009. [15] Annalisa Milella and Roland Siegwart. Stereo-based ego-motion estima- tion using pixel-tracking and iterative closest point. In Proceedings of the IEEE International Conference on Computer Vision Systems, 2006. [16] David Nister,´ Oleg Naroditsky, and James Bergen. Visual odometry.
In IEEE Computer Society Conference Computer Vision and Pattern
Recognition, volume 1, pages 652 – 659, 2004.
[17] Frank Pagel. Robust monocular egomotion estimation based on an iekf.
In Canadian Conference on Computer and Robot Vision, pages 213 –
220, Kelowna, BC, Canada, May 2009. [18] Davide Scaramuzza, Friedrich Fraundorfer, and Roland Siegwart. Real- time monocular visual odometry for on-road vehicles with 1-point ransac. In Proceedings of the IEEE Conference on Robotics and Automation, May 2009. [19] Jianbo Shi and Carlo Tomasi. Good features to track. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 1994), pages 593 – 600, June 1994. [20] Hauke Stradsdat, J. M. M. Montiel, and Andrew J. Davison. Scale drift-aware large scale monocular slam. In Proceedings of the Robotics:
Science and Systems Conference, Zaragoza, Spain, June 2010. [21] Ashit Talukder and Larry Matthies. Real-time detection of moving objects from moving vehicles using dense stereo and optical flow. In IEEE International Conference on Intelligent Robots and Systems (IROS 2004), volume 4, pages 3718 – 3725, September 2004. [22] Thorsten Thormahlen,¨ Hellward Broszio, and Axel Weissenfeld. Keyf- rame selection for camera motion and structure estimation from multiple views. In Proceedings of the European Conference on Computer Vision, pages 523 – 535, Prague, Czech Republic, May 2004. [23] Bill Triggs, Philip McLauchlan, Richard Hartley, and Andrew Fitzgib- bon. Bundle adjustment – a modern synthesis. Lecture Notes in Compute Sciene, 1883:298 – 372, 2000. [24] Koichiro Yamaguchi, Takeo Kato, and Yoshiki Ninomiya. Ve hicle ego-motion estimation and moving object detection using a monocular camera. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR 2006), pages 610 – 613, 2006.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
48
Incremental Updates of Configuration Space Representations for Non-Circular Mobile Robots with 2D, 2.5D, or 3D Obstacle Models
Boris Lau
Christoph Sprunk
Wolfram Burgard
Abstract — This paper presents techniques to incrementally update collision maps, distance maps, and Voronoi diagrams in the configuration space of non-circular mobile robots. Compared to previous work, our approach only updates the cells affected by changes in the environment. Thus, it is applicable in large workspaces and in the presence of unknown or moving obstacles. The c-space collision maps allow for performing highly efficient collision checks in dynamically changing environments, for 2D, 2.5D, and 3D representations of robots and obstacles. By using the proposed c-space distance maps, long trajectories can efficiently be checked for collisions. Finally, our c-space Voronoi diagrams allow effective and complete path planning in narrow spaces. Our algorithms are easy to implement, benefit from parallelization on multi-core CPUs, and can be integrated with state-of-the-art path planners. Experiments demonstrate the effectiveness of our methods and show their applicability to real-world scenarios.
I. INTRODUCTION
Efficient collision checks are a crucial ability for many online systems in autonomous mobile robotics: simulators, path planners, and trajectory optimizers alike need to check for every considered pose for collision. For plain 2D obstacle representations and circular approximations of the robot’s footprint, these collision checks are easy to perform. For non-circular robots in passages narrower than their circumcircle, however, circularity is too crude an assumption, and collisions have to be checked for in the three-dimensional configuration space (c-space) of robot poses. Also, even for robots moving on a plane as considered in this paper, 3D obstacles and collisions can be important: applications like robotic transporters, wheelchairs, or mobile manipulators can require the robot to partially move underneath or above obstacles as shown in Fig. 1. In these cases, collision checks can easily become a dominant part of the computational effort. Naive collision checks on 2D occupancy grid maps require one lookup per grid cell in the area occupied by the robot. For 3D obstacle representations like multi-layer surface maps or full 3D maps [1], the effort depends on the robot’s volume. However, by convolving a map with the discretized shape of the robot, one can precompute a collision map that marks all colliding poses. With such a map, a collision check requires just a single lookup, even for 3D obstacle representations. From these c-space maps one can derive c-space distance maps that encode the distance to the closest colliding robot pose. Based on this, checking a trajectory for collisions can be made more efficient by extending the intervals of collision
All authors are with the Department of Computer Science at the University of Freiburg, Germany, {lau, sprunkc, burgard}@informatik.uni-freiburg.de. This work has partly been supported by the European Commission under FP7-248258-First-MM, FP7-248873-RADHAR, and FP7-260026-TAPAS.
Fig. 1. For some applications, representing obstacles and robots by their 2D footprints can be sufficient (top-left). For overhanging parts of robots, their load, or obstacles, 2.5D representations are needed (bottom), whereas interaction tasks can also require actual 3D obstacle and robot models (top- right). Robot discretizations as used in our experiments are depicted in blue.
checks from one cell to the respective local collision distance. Similarly, in contrast to regular 2D Voronoi maps that can be used for complete motion planning of circular robots, c-space Voronoi maps can be employed for non-circular robots as well. Early work on c-space obstacle representations assumes static environments [2]. More recent approaches are often motivated by dynamic obstacles, but mostly try to improve the efficiency with the goal to execute the entire computation on every sensor update. This paper presents methods to incremen- tally update collision, distance, and Voronoi grid maps in the configuration space of non-circular mobile robot poses. Our al- gorithms only update cells affected by changes, which greatly reduces the computational costs. We consider all obstacle types shown in Fig. 1 and thus provide means for efficient collision checking in real-world applications. The algorithms are easy to implement and benefit from parallelization on multiple cores. After discussing related work, we describe dynamic c-space collision, distance, and Voronoi maps in Sects. III and IV, and their application to path planning in Sect. V. Finally, we present extensive experiments in Sect. VI.
II. RELATED WORK
Algorithms for efficient collision checking in 3D continue to be an active area of research. For example, Tang et al. [3] recently proposed a connection collision query algorithm that
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
49
detects collisions of triangle meshes moving between given states. Thus, it can be used for sampling-based path planning. For online feasibility, Pan et al. use multi-core GPUs for collision queries [4]. Still, the cost per collision check depends on the number of polygons used to represent the tested objects. Schlegel precomputes collision distances for circular arcs as a function of relative obstacle location and curvature [5]. Thus, the kinematic analysis is done offline, and collision distances can be obtained with one lookup per obstacle. Precomputing c-space representations instead further reduces the online effort for collision checks to a single lookup. Since Lozano-Perez’s 1983 paper on c-space planning among static polyhedric obstacles, many approaches were proposed to reduce the cost for computing c-space obstacles [2]. Because of the relevance of this problem, researchers still work on improving the efficiency [6]. Convolving a gridmap of a robot’s environment with an image of its footprint yields a discrete c-space map. To always reflect the current state of previously unknown or moving obstacles, these maps need to be updated regularly. Kavraki proposed to use the fast Fourier transform (FFT) to reduce the computational cost of the convolution [7], and Theron´ et al. added parallelization as a further speed-up [8]. Later, the same group proposed a multi-resolution approach to reduce memory and computational load in large workspaces [9]. To speed up path planning for autonomous cars, Ziegler and Stiller decompose the shape of the robot into circular discs [10]. As a first dynamic approach for changing enviroments, Wu et al. precompute colliding robot poses for each potentially oc- cupied cell in the work space of a manipulator [11]. Taking the union of the colliding poses for a given set of occupied cells yields the c-space obstacle map without further recomputation. For mobile robots, however, the size of the operational area can make the database storage and the online computation of the union infeasible. In contrast, our method for updating c-space collision maps is truly incremental: it executes a regular map convolution in an offline phase, and during online application only updates the cells affected by changes in the environment. In our previous work we presented algorithms to incre- mentally update two-dimensional Euclidean distance maps and Voronoi diagrams [12]. For path planning with circular robots, 2D Voronoi diagrams are appealing roadmaps since they cover all topologically different paths in a map with a small number of cells. For rectangular robots however, 2D Voronoi planning looses its completeness property, and one has to repair paths in narrow areas where following the Voronoi diagram leads to collisions, e.g., by using RRTs as proposed by Foskey et al. [13]. In this paper we combine the dynamic distance and Voronoi maps proposed in [12] with our novel incrementally updatable c-space collision maps. In this way, we overcome the aforementioned problem and can perform complete Voronoi planning in the configuration space of non-circular robots. Although we use A planning as an example application, our approach can be combined with other planners, e.g., D Lite [14], or rapidly-exploring random trees (RRT) with Voronoi-biased sampling [15].
M x, y
S 0 i, j
C 0 x, y
M x, y
Fig. 2. Convolving a map M x, y with a representation of the robot’s shape S θ i, j for a given orientation θ yields a collision map C θ x, y , according to Eq. (1). Each cell x, y in C θ counts the number of cells in the robot shape that collide with occupied cells in M , given the robot is at pose x, y, θ .
M x, y
S 0 i,
j
C 0 x, y
M x, y
Fig. 3. A newly occupied cell in the map M (red) increments the collision count C for all robot poses that cause a collision with the obstacle in the updated cell. In this way, the collision map is updated (red cells) without recomputing the values for unaffected (gray) cells. Alg. 1 implements this procedure as well as the corresponding case for newly emptied cells.
III. DYNAMIC C-SPACE COLLISION MAPS
This section presents a method to incrementally update c- space collision maps for non-circular mobile robots moving on a plane. It can be applied to all obstacle types shown in Fig. 1. For the sake of clarity we start with the 2.5D representation with overhanging obstacles shown in Fig. 1 (bottom-right), and discuss the adaptation to the other obstacle models later. Let M x, y be a grid map that represents the vertical clearance, i.e., the height of free space above the floor, with zeros for completely occupied cells. Consider a robot moving
˜
on the floor with continuous orientation θ with respect to the
map coordinate system. We represent the discretized shape of
|
˜ |
|||||
|
the robot for a given orientation |
θ |
by |
a map |
S |
θ ˜ i, j , that |
stores the height of the robot for every cell of its footprint. S has the same resolution and orientation as the grid map M , whereas its origin S 0, 0 is located at the robot’s center.
θ
˜
θ ˜ yields a count
stores
the number of cells the robot collides with when located there:
(1)
map C
A convolution-type conjunction of M and S
C
˜
θ
˜
θ
x, y as shown in Fig. 2. Each cell x, y in C
˜
θ
x, y i j eval M x i, y j S
˜
θ
i, j ,
˜
where eval true 1 and eval false 0. If we discretize θ and stack the C θ x, y for all discrete θ, we obtain the robot’s c- space map C x, y, θ for M . Clearly, by testing C x, y, θ 0 we can check if the discretized pose x, y, θ is colliding.
A. Discretization of Orientations
An appropriate discretization of θ ensures that if two adja-
cent poses x, y, θ i and x, y, θ i 1 are collision-free accord-
˜
ˆ
ing to C, intermediate orientations θ θ i , θ i 1 are collision-
|
˜ |
|
|
free as well. Under this constraint we seek to discretize |
θ as |
coarse as possible to keep the number of θ-layers in C small.
In occupancy grid maps, the actual location of obstacles can
be anywhere in the cells they occupy. Therefore, one usually
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
50
Alg. 1 Dynamic Update of C-space Collision Map
function UpdateVerticalClearance x, y, v new
|
1: |
v old M x, y |
|
2: |
M x, y v new |
|
3: |
for all θ do |
|
4: |
for all x , y x i, y j S θ i, j 0 do |
|
5: |
if v new S θ i, j v old S θ i, j then |
|
6: |
C x , y , θ C x , y , θ 1 |
|
7: |
if C x , y , θ 1 then NewOccupied x , y , θ |
|
8: |
else if v new S θ i, j v old S θ i, j then |
|
9: |
C x , y , θ C x , y , θ 1 |
|
10: |
if C x , y , θ 0 then NewEmpty x , y , θ |
assumes an additional safety margin m around the robot, e.g., of m 1 pixel unit. With this, we can formulate a bound on
˜
the angular resolution for the discretization of θ as follows:
if the robot rotates from θ i to θ i 1 , each point on the robot moves along an arc. The maximum arc length occurs at the
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
51
Connecting start and goal to the closest cell on the Voronoi graph
Our method: use bubble-like Voronoi areas at start and goal
Fig. 5. Connecting start and goal to the Voronoi graph (green) during planning: using the shortest connection (top), the planned path (blue) can change abruptly for small changes of the start configuration, even for sightline- pruned paths (dashed). We create Voronoi bubbles around start and goal, and use goal-directed search therein, which yields more stable paths (bottom).
In 2D, Voronoi graphs are the union of points whose two closest obstacles are at the same distance. Similar to distance maps, we compute a Voronoi diagram V θ x, y for every C θ x, y . Stacking these Voronoi diagrams results in the c- space Voronoi diagram V x, y, θ as shown in Fig. 4 (bottom). If θ is discretized according to Sect. III-A, Voronoi lines in neighboring layers connect to unbroken surfaces. We update the layers of the c-space distance map D θ and Voronoi diagram V θ incrementally as proposed for 2D in our previous work [12]. The functions SetObstacle x, y and RemoveObstacle x, y described therein are used to specify newly occupied or freed cells. UpdateDistanceMap() performs the update with a dynamic brushfire algorithm. The event NewOccupied x , y , θ in Alg. 1 calls SetObstacle x , y for the dynamic distance map D θ x, y , and NewEmpty x , y , θ calls RemoveObstacle x , y . After the update of C x, y, θ , we execute UpdateDistanceMap() in parallel for every D θ , which completes the update of D x, y, θ and V x, y, θ .
V.
C- SPACE V ORONOI PATH P LANNING
Given a c-space Voronoi map as described above, a goal- directed graph search on the Voronoi cells is no different from regular planning on 3D grids, except for the cyclic nature of the orientation dimension. This section details on important aspects of Voronoi planning in dynamic environments. In general, the start and goal locations of a planning problem are not on the Voronoi graph. Trivial approaches search for the closest Voronoi cell at both locations, and connect them with straight lines to the graph [17]. This is problematic in practice, since a small change of the start pose can substantially change the planned path as shown in Fig. 5 (top row). Our approach tackles this problem by applying the following steps:
1) insert a virtual obstacle at the start and goal location, 2) update the Voronoi map for all layers, 3) use brushfire expansion to mark cells in Voronoi bubbles, 4) plan a path from start to goal, 5) undo the changes made by 1) and 2).
10 m
Fig. 6.
Maps of the environments we used to benchmark collision checks.
Through the virtual obstacles inserted at start and goal, these locations get enclosed by the Voronoi graph which generates
a “bubble”-like area as shown in Fig. 5 (bottom row). For
all θ-layers in parallel, the brushfire expansion starts at the start and at the goal position, and marks all visited cells while expanding like wavefronts up to the enclosing Voronoi lines. Then, a goal-driven A search is initiated at the start location. It is restricted to only consider Voronoi cells and the ones in the start and goal bubble, marked by the brushfire expansion. In this way, the search expands from the start onto the Voronoi graph, follows Voronoi lines, and then connects to the goal when reaching the goal bubble. Since the whole path is the result of goal-directed graph search, the consecutive paths planned for a moving robot are very similar to each other and do not change abruptly (see Fig. 5). The brushfire expansion has to be run for each θ-layer sep- arately using a 4-connected neighborhood. This ensures that the expansion is contained in the start and goal bubbles. Since the update of the c-space representations is run separately for the θ layers as well, the whole procedure can be parallelized except for the actual planning routine in step 4).
VI. EXPERIMENTS
This section presents application examples and benchmarks for our incrementally updatable c-space representations. We performed tests on two sequences of 2D laser range data with 200 frames each. The data was recorded by a robot moving
in areas where walking people heavily affected the traversable
space, namely an office building (FR079) and a large foyer (FR101). The algorithms were initialized with the gridmaps shown in Fig. 6, using a resolution of 0.05 m per grid cell. The maximum range of the laser scanner was limited to 5 m.
To get 2.5D and 3D obstacles, we augmented the laser data with random height values between 0 m and the robot height. In 2D, we assumed a medium sized rectangular robot (0.85x0.45 m) and a large one (1.75x0.85 m). In 2.5D, we
modeled a wheelchair with a low front and a high rear part, as
in Fig. 1 (bottom-right). In 3D, the robot was modeled like a
Willow Garage PR2, with a frontal extension for the base and
the fixed arms (see Fig. 1 top-right). The C++ implementation
of our algorithms was executed on an Intel Core i7 2670 MHz,
using OpenMP for parallelization with up to 6 threads.
A. Collision Checks for Non-Circular Robots
The c-space collision map presented in Sect. III requires computation of the incremental update in every time step, but then, each collision check for the whole robot takes only a single lookup. In the 2D model, we exploit the symmetry of the rectangular robot as described in Sect. III-C.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
52
Time for 200 frames [s]
Time for 200 frames [s]
Time for 200 frames [s]
2D Occupancy Map
2D Distance Map
C-space Map
8
6
4
2
8
6
4
2
0 50k
100k
150k
0 50k
100k
150k
8
6
4
2
8
6
4
2
#collision checks per frame
0 50k
100k
150k
0 50k
100k
150k
#collision checks per frame
Fig. 7. Computation time for different collision checking routines for two sequences and two robot models. The update required in every frame for the c-space collision map pays off starting from 10,000 collision checks per frame. The plot shows mean and standard deviations for 10 runs.
10
8
6
4
2
2D (Medium robot)
2.5D (Wheelchair)
3D (PR2)
0
50k
100k
150k
#collision checks per frame
#collision checks per frame
Fig. 8. Collision check performance for different robot and obstacle models, using our updatable c-space collision map (left) vs. the straight-forward occupancy gridmap approach (right). The costs for updating the c-space map are remedied by the faster collision checks for 10,000 checks or more per frame. The plot shows mean and standard deviations for 20 runs.
We compare our method to a previous collision checking approach for rectangular robots that uses incrementally updat- able 2D distance maps [18]. As a baseline, we also include a straight-forward approach that checks every cell of the robot’s footprint for collision using an up-to-date 2D occupancy map. The results of this benchmark are shown in Fig. 7. The time required for updating the distance and c-space maps is shown by the first data point of each plot (zero collision checks). The slopes of the curves depend on the cost per collision check. In contrast to the distance map approach, the update time for the c-space map grows with the size of the robot (right vs. left column), but does not suffer from the open area in FR101 (bottom vs. top row). The update for the c-space collision map pays off for 10,000 or more collision checks, which can easily be required during path planning or trajectory optimization. In comparison, the break-even point for a single disc-shaped
C-space Voronoi
KPiece
Fig. 9. Map of a factory floor (9.5x15.4 m) with start location (A) and three goals (B), (C), and (D). Example paths from (A) to (D) are shown for two different planners. The sampling-based planner (right) is challenged by narrow passages, while the performance of the Voronoi planner (left) is unaffected.
Fig. 10. Planning time and path length for three planners and the three planning tasks in Fig. 9. The plot shows mean and min/max for 20 runs. In contrast to the Voronoi planner, the sampling-based planners require several orders of magnitude more planning time for each narrow passage in the path.
object was at 22,400 for the disc-decomposition method by Ziegler et al., and 5 10 6 for the full c-space [10]. We repeat the experiment, but with 2.5D and 3D obstacles and robots this time. Compared to the 2D rectangular robot (dashed), the costs for the c-space update with 2.5D and 3D are higher, since the robots are not symmetric anymore and consist of two and three parts, respectively, see Fig. 8 (left). However, the costs per collision check (slope of the plots) is the same, as opposed to the curves for the straight-forward occupancy grid map approach (right). In all cases, the update of the c-space map takes less than 15 ms per frame. Performing 150,000 collision checks per frame additionally requires at most another 15 ms. This corresponds to 10 10 6 collision checks per second for arbitrary robot shapes, which clearly outperforms even modern GPU- based approaches with 0.5 10 6 collision checks per second for simple polygons [4].
B. Path Planning using C-space Voronoi Maps
The c-space Voronoi maps presented in this paper provide means for complete grid map planning for non-circular om- nidirectional robots using standard graph search algorithms
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
53
Fig. 11.
Table-docking with a PR2 robot in a 3D map using Voronoi planning.
like A or D Lite [14]. With our algorithms for incremental updates they are applicable in dynamic environments. This experiment uses the Voronoi bubble technique proposed in Sect. V. We use A to plan paths for the large robot model (see above) on the grid map of the factory floor shown in Fig. 9. The start pose is given by (A), and three possible goal poses by (B), (C), and (D). Each of the consecutive goals requires traversing another narrow passage. For comparison, we test our method against the KPiece and RRT implementations available in the Open Motion Planning Library [19]. All planners use our c-space map for collision checking. The average resulting planning times and path lengths for 20 runs per start-goal combination are shown in Fig. 10. Each additional narrow passage requires several orders of magnitude more planning time for the sampling-based planners, while the time taken by the Voronoi planner grows roughly linearly with the path length. Using per-cell collision checking rather than the c-space collision maps for the sampling based planners increases the computation times by a factor of 3. As another application example, we plan the path of a PR2 robot using a c-space Voronoi map generated from real 3D point cloud data (see Fig. 11). After a precomputation phase of 0.5 s, planning a path on the incrementally updatable c- space Voronoi map takes less than 2.5 ms. Clearly, Voronoi planning is of advantage in narrow areas as long as the grid resolution is fine enough. Our incrementally updatable c-space Voronoi representation allows to apply this idea to non-circular robots in dynamic environments, and could also be used in Voronoi sampling routines of other path planners [15].
VII. CONCLUSION
This paper presents incrementally updatable collision maps, distance maps, and Voronoi diagrams for non-circular robots. During initialization, these representations are computed using a given grid map or point cloud. During online application, our methods only update cells that are affected by changes in the environment, and can thus be used in real-world scenarios with unexpected or moving obstacles.
We consider different obstacle representations, namely a robot moving on a plane with overhanging obstacles, or vice versa, obstacles elevating from the ground, and a robot with overhanging parts. Our approach is also applicable to 2D and full 3D obstacle representations, and can exploit symmetries in the robot shape. In our experiments we showed that the presented methods allow for a high number of collision checks and reliable path planning in frame rates required for online real-world appli- cations. To further increase the efficiency of our algorithms, e.g., in higher-dimensional c-spaces of manipulators, one could combine them with hierarchical decomposition as proposed by Blanco et al. [9].
REFERENCES
[1] K. Wurm, A. Hornung, M. Bennewitz, C. Stachniss, and W. Burgard, “Octomap: A probabilistic, flexible, and compact 3D map representation for robotic systems,” in ICRA Workshop on Best Practice in 3D Perception and Modeling for Mobile Manipulation, Anchorage, 2010. [2] K. D. Wise and A. Bowyer, “A survey of global configuration-space
|
[3] |
mapping techniques for a single robot in a static environment,” Interna- tional Journal of Robotics Research, vol. 19, no. 8, pp. 762–779, 2000. M. Tang, Y. J. Kim, and D. Manocha, “CCQ: Efficient local planning us- |
|
[4] |
ing connection collision query,” in Algorithmic Foundations of Robotics IX, ser. Springer Tracts in Advanced Robotics (STAR), 2011, vol. 68. J. Pan and D. Manocha, “GPU-based parallel collision detection for real- |
|
[5] |
time motion planning,” in Algorithmic Foundations of Robotics IX, ser. Springer Tracts in Advanced Robotics (STAR), 2011, vol. 68. C. Schlegel, “Fast local obstacle avoidance under kinematic and dynamic |
constraints for a mobile robot,” in Intl. Conf. on Intelligent Robots and Systems (IROS), Victoria, Canada, 1998. [6] E. Behar and J.-M. Lien, “A new method for mapping the configuration space obstacles of polygons,” Department of Computer Science, George Mason University, Tech. Rep. GMU-CS-TR-2011-11, 2010. [7] L. E. Kavraki, “Computation of configuration-space obstacles using the fast Fourier transform,” IEEE Transactions on Robotics and Automation, vol. 11, no. 3, pp. 408–413, June 1995. [8] R. Theron,´ V. Moreno, B. Curto, and F. J. Blanco, “Configuration space of 3D mobile robots: Parallel processing,” in 11th Intl. Conf. on Advanced Robotics, vol. 1–3, 2003. [9] F. J. Blanco, V. Moreno, B. Curto, and R. Theron,´ “C-space evaluation for mobile robots at large workspaces,” in IEEE Intl. Conf. on Robotics and Automation (ICRA), Barcelona, Spain, April 2005. [10] J. Ziegler and C. Stiller, “Fast collision checking for intelligent vehicle motion planning,” in IEEE Intelligent Vehicles Symposium, San Diego, CA, USA, June 2010. [11] X. J. Wu, J. Tang, and K. H. Heng, “On the construction of discretized configuration space of manipulators,” Robotica, vol. 25, pp. 1–11, 2006. [12] B. Lau, C. Sprunk, and W. Burgard, “Improved updating of Euclidean distance maps and Voronoi diagrams,” in IEEE Intl. Conf. on Intelligent Robots and Systems (IROS), Taipei, Taiwan, October 2010. [13] M. Foskey, M. Garber, M. C. Lin, and D. Manocha, “A Voronoi-based hybrid motion planner,” in IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), Maui, HI, USA, October 2001. [14] S. Koenig and M. Likhachev, “D* lite,” in Eighteenth National Confer- ence on Artificial Intelligence (AAAI), 2002, pp. 476–483. [15] L. Zhang and D. Manocha, “An efficient retraction-based RRT planner,” in IEEE Intl. Conf. on Robotics and Automation (ICRA), Pasadena, 2008. [16] J. Canny, “A Voronoi method for the piano-movers problem,” in IEEE Intl. Conf. on Robotics and Automation (ICRA), March 1985. [17] R. Geraerts and M. Overmars, “A comparative study of probabilistic roadmap planners,” in Algorithmic Foundations of Robotics V, ser. Springer Tracts in Advanced Robotics, 2004, vol. 7, pp. 43–58. [18] C. Sprunk, B. Lau, P. Pfaff, and W. Burgard, “Online generation of kinodynamic trajectories for non-circular omnidirectional robots,” in IEEE Intl. Conf. on Robotics and Automation (ICRA), Shanghai, 2011. [19] I. A. S¸ucan, M. Moll, and L. E. Kavraki, “The open motion planning library (OMPL),” 2010. [Online]. Available: http://ompl.kavrakilab.org/
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
54
1
Landmark Placement for Accurate Mobile Robot Navigation
Maximilian Beinhofer
Jorg¨ Muller¨
Wolfram Burgard
Department of Computer Science, University of Freiburg, Germany
Abstract — Being able to navigate accurately is one of the basic capabilities a mobile robot needs to effectively execute a variety of tasks, including collision avoidance, docking, manipulation, and path following. One popular approach to achieve the desired accuracy is to use artificial landmarks for a sufficiently accurate localization. In this paper, we consider the problem of optimally placing landmarks for robots navigating repeatedly along a given set of trajectories. Our method aims at minimizing the number of landmarks for which a bound on the maximum deviation of the robot from its desired trajectory can be guaranteed with high confidence. The proposed approach incrementally places landmarks utilizing linearized versions of the system dynamics of the robot, thus allowing for an efficient computation of the deviation guarantee. To verify this guarantee for the real and possibly non-linear system dynamics of the robot, our method then performs a Monte Carlo simulation. We evaluate our algorithm in extensive experiments carried out both in simulation and with a real robot. The experiments demonstrate that our method requires substantially fewer landmarks than other approaches to achieve the desired accuracy.
Index Terms— Localization, Autonomous Navigation, Service Robots
I. I NTRODUCTION
One of the main challenges for mobile service robots is
autonomous navigation. Especially in industrial applications,
a high degree of accuracy is required to avoid collisions
and, e.g., to ensure reliable docking maneuvers. However, robots navigating autonomously tend to deviate from their desired trajectory due to uncertainty in motion and position.
Typically, they are equipped with on-board sensors to estimate the deviation and react according to a feedback control law. In service tasks, robots usually have to repeatedly execute
a fixed number of trajectories. As real world environments
often contain dynamic and ambiguous areas, many practical applications rely on artificial landmarks placed along these trajectories to allow for accurate self-localization [6, 7]. In these applications, placing the artificial landmarks is often expensive or the computational power of the robot is limited. Therefore, it is beneficial to select the smallest possible num- ber of landmark positions which still guarantees an accurate navigation.
In this paper, we present a novel algorithm for land- mark placement. It computes a landmark configuration which guarantees that the deviation of the robot from its desired trajectory stays below a user-defined threshold d max with high confidence. To check if the guarantee holds, we use the specific properties of the robot and its navigation task in the landmark placement algorithm. The algorithm works in two stages. In a
Fig. 1.
on top is a wireless webcam detecting uniquely identifiable visual markers attached to the ceiling.
The miniature e-puck robot we use in the experiments. Mounted
first stage, it uses a linearized motion model and observation model of the robot to efficiently place landmarks in an incremental fashion. This stage aims at placing the smallest number of landmarks for which the deviation guarantee holds. In a second stage, the algorithm employs a Monte Carlo simulation for the computed landmark configuration to check if the guarantee also holds for the possibly non-linear models. Our approach has several characteristics which make it especially useful for mobile robot navigation. It can deal with arbitrary trajectories, and the maximum allowed deviation of the robot can be defined individually for every part of the trajectories. Taking into account the properties of the individual robotic system results in customized landmark sets:
while high-precision robots only need few landmarks for reaching the deviation guarantee, low cost systems typically require more landmarks. As our placement algorithm effi- ciently evaluates the guarantee using linear models, it can deal even with large instances of the landmark placement problem (i.e., long trajectories). Note that our incremental method simultaneously determines the number of landmarks needed and their positions to meet the desired guarantee. This paper is organized as follows. After discussing related work in the following section, we formalize the problem definition in Section III. In Section IV we describe the prediction of the deviation from the trajectory in linearized systems. Afterwards, in Section V, we present our incremental landmark placement algorithm. Finally, we provide extensive experiments in which we evaluate the algorithm in simulation as well as with the real e-puck robot [9] depicted in Fig. 1.
II. R ELATED W ORK
In the past, the problem of finding an optimal set of landmark positions has been addressed from several points
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
55
2
of view. Sala et al. [12] select landmark positions so that at every position in the map, at least k landmarks are observable. Jourdan and Roy [8] consider a set of possible target positions. They place sensors on the walls of buildings to minimize the average position error bound in the sensor network. Unlike these methods, our approach takes into account the full spec- ification of the robot and its navigation task. Vitus and Tomlin [15] also consider the full problem specifi- cation to place sensors in the environment. They approximate the a-priori covariances with the a-posteriori covariances of the most likely run of the robot. Similar to our approach, van den Berg et al. [3] evaluate sensor positions using the exact a-priori distributions in a linearized system. As they focus mainly on path planning, they restrict themselves to randomly sampled positions of a single sensor. Our previous work [1] selects landmarks maximizing the mutual information between the sensor readings and the states of the robot. It solely applies Monte Carlo simulations to estimate the a-priori distributions, which makes it computationally more demanding. While all of the approaches above place artificial land- marks or sensors before operation of the robot, the following approaches decide whether to utilize observed landmarks during operation. In contrast to our method, their decisions are based on a-posteriori distributions, i.e., the information already gathered by the robot. Thrun [14] selects the subset of the observed landmarks for localization which minimizes the average posterior localization error. Strasdat et al. [13] and Zhang et al. [17] both consider landmark selection in the context of the simultaneous localization and mapping (SLAM) problem. Strasdat et al. use reinforcement learning to create a landmark selection policy whereas Zhang et al. minimize the entropy of the a-posteriori distributions. In contrast to the above-mentioned approaches, our method optimizes the positions of the landmarks so that the robot stays within a user-defined region around the trajectory with high a-priori probability.
III. P ROBLEM D EFINITION
We consider the problem of placing landmarks for localiza- tion and control of a mobile robot. We assume the time to be discretized into steps of equal duration. At each time step t the state of the robot is defined by a vector x t ∈ X , which changes over time according to the stochastic motion model
(1)
x t = f ( x t − 1 , u t − 1 , m t )
where u t ∈ U is the control command at time t . Thereby, the motion is disturbed by Gaussian noise m t ∼ N ( 0 , M t ) . For self-localization, we assume that the robot is equipped with a sensor taking measurements of a set of landmarks
A = { l 1 ,
(2)
where the sensor signal is disturbed by Gaussian noise n t ∼ N ( 0, N t ) . The covariances M t and N t model the un- certainty in the motion and the measurements, respectively. We define a navigation task as a trajectory that the robot
z t = h ( x t , n t , A )
, l n } according to the measurement function
should follow. A trajectory T = ( x ⋆ , u ⋆ ,
0
0
, x ⋆ T , u T
⋆ ) can
be considered as a series of states and desired controls the robot should execute to reach these states. In this navigation task, we assume that the trajectory will be executed using
a linear-quadratic regulator (LQR) [4] feedback controller.
At each time step t the LQR controller selects the control command u t which minimizes the quadratic cost function
E
T
ℓ = t
(( x ℓ − x ) T C ( x ℓ − x )+( u ℓ − u ) T D ( u ℓ − u )) , (3)
⋆
ℓ
⋆
ℓ
⋆
ℓ
⋆
ℓ
where C and D are positive definite weight matrices.
The localization uncertainty and, as a result, also the de- viation from the desired trajectory strongly depend on the
, l n } which are
observed during operation. Our approach selects landmarks l i from a continuous space of possible landmark locations L . We evaluate the quality of a landmark configuration based on the deviation of the (real) state x t from the desired state x ⋆ at each
specific configuration of landmarks A = { l 1 ,
t
time step t (ignoring the control part u 0: ⋆ T of the trajectory).
In particular, we consider the Euclidean distance between the
part of the state x pos describing the position of the robot and
t
x ⋆ pos
t
. We focus on limiting the deviation
(4)
of the robot from its trajectory at all time steps t ∈ [0, T ] . Our
approach aims at finding the landmark configuration A with
the fewest elements for which the deviation guarantee
∀ t ∈ [0, T ] : p ( d pos ( x t , x ⋆ ) ≤ d max ( x ⋆ ) | A ) ≥ p min (5)
d pos ( x t , x ⋆ ) = x pos
t
t
− x ⋆ pos
t
2
t
t
holds. This guarantee ensures that the probability of deviating
at most d max from the desired trajectory is at least p min . Note
that d max can be either a globally constant value or depend on the position or time.
IV. P REDICTING THE D EVIATION FROM THE T RAJECTORY
To validate the guarantee (5) for a certain landmark configu- ration A , we need to compute p ( d pos ( x t , x ⋆ ) ≤ d max ( x ⋆ ) | A ) . For this, we consider the a-priori probability distribution
t
t
p ( x t − x ⋆ | A ) = p ( x t − x ⋆
t
t
| u 0: t − 1 , z 1: t , A )
· p ( u 0: t − 1 , z 1: t | A ) d u 0:t − 1 d z 1:t , (6)
which averages over the observations z 1: t and controls u 0: t − 1 that are not yet available during landmark placement. For general, non-linear systems, the a-posteriori distribu- tions p ( x t − x ⋆ | u 1: t − 1 , z 1: t , A ) can be used to estimate the a-priori distributions (6) via Monte Carlo simulation by sam- pling observations and controls and averaging over numerous runs [1]. However, this is computationally expensive for large instances of the landmark placement problem.
t
A. A-Priori State Estimation in Linearized, Gaussian Systems
In the main part of our landmark placement algorithm, we locally linearize the system and approximate all distributions by Gaussians. This allows for an analytical evaluation of the guarantee (5), making it more efficient than Monte Carlo
simulations. Linearizing the motion model (1) and the sensor
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
56
3
model (2) around the desired trajectory ( x order Taylor expansion leads to
∗
0: T , u
0: T ) by first
∗
x t = f ( x ⋆
t − 1 , u t ⋆
− 1 , 0 ) + A t ( x t − 1 − x ⋆ − 1 )
t
+ B t ( u t − 1 − u ⋆ − 1 ) + V t m t ,
t
z t = h ( x ⋆ , 0 , A ) + H t ( x t − x ⋆ ) + W t n t ,
t
t
(7)
(8)
with the Jacobians
A t = ∂f x ( x ⋆
∂
− 1 , 0) , B t = ∂f u ( x ⋆
∂
t − 1 , u t ⋆
t − 1 , u t ⋆
− 1 , 0) ,
V t = ∂f m ( x ⋆
∂
t − 1 , u t ⋆
− 1 , 0 ) ,
H t = ∂h x ( x ⋆ , 0 , A ) , W t = ∂h n ( x ⋆ , 0, A ) .
∂
t
∂
t
(9)
In this linearized system, the Gaussian a-posteriori distribution p ( x t − x ⋆ | u 1: t − 1 , z 1: t , A ) ∼ N ( µ t − x ⋆ , P t ) of the deviation from the trajectory can be computed recursively using a Kalman filter [16]. The Kalman filter propagates a given ini- tial Gaussian distribution p ( x 0 − x ⋆ | A ) ∼ N ( µ 0 − x ⋆ , P 0 ) according to the actual control commands in the motion update
t
0
t
0
µ¯ t − x ⋆ = A t ( µ t − 1 − x ⋆ − 1 ) + B t ( u t − 1 − u ⋆
t
t
t − 1 )
¯
P t = A t P t − 1 A
T
t
T
+ V t M t V .
t
(10)
and according to the measurements in the observation update
K t = P t H ( H t P t H + W t N t W ) − 1
¯
t
¯
t
t
T
T
T
(11)
µ t − x ⋆ t =
µ¯ t − x ⋆ + K t ( z t − h ( x ⋆ , 0 , A ) − H t (¯µ t − x ⋆ ))
t
t
t
¯
P t = ( I − K t H t ) P t .
(12)
Note that the covariance P t and the Kalman gain K t depend, via the Jacobians, on x ⋆ 0: t and u 0: ⋆ t − 1 but not on the actual values of u 0: t − 1 and z 1: t (see (10), (11), (12)). Therefore they can be calculated before the robot starts operation (a-priori). The minimization of the expected deviation from the desired trajectory (3) in the LQR controller can also be solved a-priori, linearly relating the control command u t to the estimated state µ t via a feedback matrix L t :
u t − u ⋆ = L t ( µ t − x ⋆ ) .
t
t
(13)
L t depends on the a-priori known Jacobians (9) and the weight matrices (3) and is derived explicitly in [4]. As described above, we express the whole navigation al- gorithm, which consists of executing a motion command, making an observation, localizing, and selecting the next motion command depending on the localization, by linear functions. For this linear navigation system, van den Berg et al. [2] proved that the a-priori joint distribution of x t and µ t is a Gaussian
x t − x ⋆ µ t − x ⋆
t
t
∼ N ( 0
0 , R t =
S t Cov( x t , µ t ) T
Cov ( x t , µ t ) U t
)
,
and that its covariance R t can be computed recursively by
R 0 = P 0 0
0 0 , R t = F t R t − 1 F + G t M t
T
t
0
N t G
0
T
t
,
with
F t =
A t K t H t A t
B t L t − 1
A t + B t L t − 1 −
G t =
V t K t H t V t K t
0
W t .
K t H t A t ,
(14)
(15)
R t only depends on a-priori known variables, namely the Jaco-
bians (9), the Kalman gain (11), and the feedback matrix (13). These variables can be computed a-priori since we linearize
the models around the (a-priori known) desired states x ⋆ 0: T and
not around the (a-priori unknown) estimates µ 0: T as it is done
for example in the extended Kalman filter [16].
B. Evaluation of the Deviation Guarantee
In the linearized system we can efficiently check whether
be the part of the
a-priori covariance S t of p ( x t − x ⋆ | A ) corresponding to the
the deviation guarantee (5) holds. Let S
pos
t
t
position of the robot. The length a t ( A ) of the major semi-axis
of the p min -confidence ellipsoid of S
(16)
and c is a scaling fac-
tor corresponding to p min via the regularized Gamma function
as described in [2]. If a t ( A ) ≤ d max , then the p min -ellipsoid of
S
is inside a circle with radius d max and guarantee (5) holds
for the linearized system. Note that this test is a conservative
approximation and is exact if the p min -ellipsoid is a sphere.
t
can be calculated using
pos
pos
a t ( A ) = c λ t ,
where λ t is the largest eigenvalue of S
t
pos
t
C. Visibility of Landmarks
For robots with a limited sensor range, the non-linearity of the sensor model at the border of the field of view of the robot induces a large discrepancy between the real model and its linearization. To avoid this, we conservatively estimate for every landmark if the robot will observe it at time t ,
following the approach of Vitus and Tomlin [15]: We consider
|
a |
landmark as visible at time t only if it is p min - visible, i.e., |
|
it |
is visible from every position inside the p min -ellipsoid of |
pos
t
S
around x ⋆ pos . If a landmark configuration satisfies the
t
guarantee (5) when only p min -visible landmarks are observed,
it also satisfies it when using all visible landmarks. Note
that for general state spaces and fields of view (like in the framework proposed in [11]), checking if a certain landmark
is p min -visible is highly non-trivial and could for example be
done using a sampling based approach. However, for the two dimensional case (i.e., x pos = [ x y ] ) and for a robot with a circular field of view with radius r , there exists a closed form solution to checking if a given landmark l is p min -visible. Consider the p min -ellipse of S centered at the origin of the coordinate system. We apply a
principal axis transformation on the ellipse so that afterwards its semi-axes lie on the axes of the coordinate system. Ap-
plying this transformation on S
S
′ = diag( λ 1 , λ 2 ) with the diagonal elements identical
yields a diagonal matrix
t
pos
t
pos
t
pos
t
to the eigenvalues of the matrix. Any point x ′ = [ x 1 ′ , x ′
on the transformed ellipse E can then be described as x ′ =
2 ] T
c · diag( √ λ 1 , √ λ 2 ) · x for a point x = [ x 1 , x 2 ] T on the
unit circle C and the scaling factor c from (16). To check if
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
57
4
landmark l is p min -visible given an a-priori estimate ( x ⋆ , S t ) , we apply the principal axis transformation also on the relative position ( l − x ⋆ ) of the landmark, resulting in l ′ = [ l 1 ′ , l 2 ] . Checking if l is p min -visible means checking if
t
t
′
max x ′ − l ′ 2 ≤ r
x ′ ∈E
⇔ max c diag( λ 1 , λ 2 ) x − l ′ 2 ≤ r
x ∈C
⇔
x 1 ∈ [ − 1 , 1] , sgn ∈{− 1 , 1 } ( c λ 1 x 1 − l ′
max
1 ) 2
+ ( c λ 2 sgn 1 − x 2 − l 2 ) 2 − r 2
1
′
≤ 0 .
Applying a distinction of cases for sgn , we set the derivative with respect to x 1 of the function inside the max -operator to 0 and reorder the resulting equation. This yields a quartic term, which can be solved analytically in an efficient way.
V. I NCREMENTAL L ANDMARK P LACEMENT A LGORITHM
Our landmark placement approach aims at minimizing the number of landmarks that have to be placed for the deviation guarantee to hold. Since the dimensionality of the search space grows with the length of the trajectory, in general, globally searching for the optimal landmark configuration is computationally intractable. However, using an incremental placement algorithm, we can efficiently find an approximate solution to the landmark placement problem.
A. Landmark Placement for the Linearized System
In a first stage, our algorithm employs the linearized sys- tem to incrementally place landmarks. Considering linearized Gaussian models has the advantage that the a-priori distribu- tions can be efficiently calculated analytically. The objective of our approach is to minimize the number of landmarks needed for the deviation guarantee to hold on the whole trajectory x 0: ⋆ T . We approximate this minimum by maximizing the number of time steps every single landmark guarantees (5). Let
t max ( A , x 0: ⋆ T ) = max {t | a s ( A ) ≤ d max ∀ s ≤ t } (17)
be the maximum time step for which the landmark set A guarantees (5) in the linearized system for the first part of the trajectory x 0: ⋆ t max . In every iteration our algorithm adds the landmark l ⋆ which maximizes t max to the already selected set of landmarks A . In some cases, one additional landmark is not enough to increase t max . This can happen for example if
d max ( x ⋆
t
max +1 ) is chosen considerably smaller than d max ( x ⋆
t
max ) .
In these cases, the algorithm selects the landmark which minimizes a t max ( A ) instead. Reducing a t max ( A ) increases the
likelihood that in the next step a landmark can be found which increases t max again (see (17)). Algorithm 1 describes the incremental landmark placement for the linearized system.
B. Monte Carlo Validation
In a second stage, we check the computed landmark con- figuration A for the deviation guarantee via Monte Carlo simulation using the real (possibly non-linear, non-Gaussian)
Algorithm 1 Landmark Placement for the Linearized System
Input: trajectory x 0: ⋆ T , space of landmark locations L
Output: landmark configuration A A ← ∅ t ← 0 while t < T do
l ⋆ ← argmax t max ( A ∪ { l }, x ⋆
0: T )
l ∈L
t ⋆ ← t max ( A ∪ { l ⋆ }, x ⋆
if t ⋆ = t then
0: T )
l ⋆ ← argmin a t max ( A ∪ { l } )
l ∈L
end if A ← A ∪ {l ⋆ } t ← t ⋆ end while return A
models. This is necessary to account for approximation errors due to the linearization and the Gaussian assumption. The Monte Carlo simulation samples robot states x 0: T , controls u 0: T , and observations z 1: T of the landmarks in A and counts the number of time steps t in which d pos ( x t , x ⋆ ) ≤ d max ( x ⋆ ) , as required in guarantee (5). Averaging over numerous runs yields an estimate p MC of p min for which the deviation guar- antee in the real system holds. If p MC < p min , one can use arbitrary heuristics to place additional landmarks. For example, one could run our algorithm for increased values of p min or decreased values of d max .
t
t
VI. E XPERIMENTAL R ESULTS
We evaluated our landmark placement algorithm and com- pared it to other landmark placement approaches in extensive experiments both in simulation and with a real robot.
A. Setup of the Simulation Experiments In the simulation experiments, we considered a wheeled robot with a differential drive. For self-localization we simu- lated three different kinds of sensors detecting uniquely iden- tifiable landmarks: a range-only sensor, measuring only the distance to the landmarks, a bearing-only sensor, measuring only the relative angle between the robot and the landmarks, and a range-and-bearing sensor, measuring both. In this setup, the differential drive motion model and all sensor models have non-linear components. For all sensors, we assumed a circular field of view around the robot with radius 2 m. We evaluated our algorithm on five navigation tasks T1-T5 for all three
sensor models, resulting in 15 experiments. Fig. 2 shows the landmarks our algorithm computed for the three sensor models in the first task T1, together with a-priori and a-posteriori distributions. Fig. 3 depicts the landmark configurations and a-priori distributions for the other four tasks T2-T5 for a range- only sensor. For all trajectories, we set p min = 99% and d max = 0. 5m. For the pick-and-place task T5, we changed d max to 0 . 2m in the pick up zone and in the deposit zone (gray rectangles). Because of the high accuracy necessary to fulfill this task, we simulated a more precise robotic system than for the other tasks, i.e., we scaled down the noise values.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
58
5
goal
Fig. 2. The Landmark configurations (red triangles) our algorithm computed for a figure eight trajectory T1 for three different sensor models: range-only (left), bearing-only (middle) and range-and-bearing (right). The blue points and ellipses in the upper row correspond to the means and 99% covariance ellipses of the a-priori distributions, and in the lower row to the a-posteriori distributions of simulated sample runs. The true positions of the robot in the sample runs are depicted as black lines.
|
T2
goal
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● start ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
1m |
T3
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● goal start ● ● ● ● ● ● ● ● ● 1m ● ● ● ● ● |
|
T4 goal
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● start ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1m |
T5 start goal ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1m |
Fig. 3. The Landmark configurations (red triangles) our algorithm computed for four sample trajectories T2-T5 using a range-only sensor. T2 is a square, T3 a curved shape, T4 a sweeping trajectory, and T5 a pick-and-place task. The blue points and ellipses correspond to the means and 99% covariance ellipses of the a-priori distributions. In T5, the pick up zone and the deposit zone are marked as gray rectangles.
B. Restriction of the Search Space
To implement the argmax and argmin operators in Algo-
rithm 1, we used an insight gained from empirical evaluations.
In the experiments, l ⋆ typically lay inside the field of view
of x ⋆ . Therefore, we restricted the search space for the
t
optimizations to this area. As t max and a t max usually had several
local optima inside the field of view of x ⋆ , we did a search on
t
a discrete grid covering this region followed by a fine search
with Powell’s method [10] around the optimum on the grid.
To evaluate this strategy, we also ran grid searches on the
full environment of the robot followed by Powell optimization
steps. In all 15 experiments, Algorithm 1 with the full searches
did not select less landmarks than our implementation with the
restricted search spaces. On an Intel R Core TM i7 2.8GHz with
12GB RAM, the execution time of our algorithm averaged
over the 15 experiments was 138 sec utilizing the restricted
search spaces and 617 sec using the full searches.
C. Influence of the Sensor Model
As can be seen in Fig. 2, the amount of landmarks our
algorithm computes and their configuration strongly depend
on the chosen sensor model. For the range-only sensor, the
landmarks tend to be further away from the trajectory than
for the other two sensor models. The numbers of landmarks
needed are stated in the first row of Table I. Also the results of
the Monte Carlo simulations in our algorithm varied strongly for the different sensor models. In every Monte Carlo simula- tion, we performed 1000 simulated runs of the robot to get an estimate p MC of p min . For all trajectories and all sensor models, the values of p MC for the landmark sets our approach computed are stated in the fifth row of Table I. For the range-only sensor, p MC is considerably above the intended value of 99% in all tasks. For the range-and-bearing sensor, p MC is slightly below 99% in the pick-and-place task and for the bearing-only
sensor, p MC is below 99% in three of the five tasks. These
results indicate that the non-linear components of the range
measurements are less critical for landmark placement than
the ones of the bearing measurements.
D. Comparison to other Landmark Placement Strategies
For comparison, we evaluated three other landmark place- ment methods. Each method starts with a minimum number of
landmarks and successively increases the number (or density)
of landmarks until it finds a set for which the guarantee in the
linearized system holds.
• On trajectory places landmarks equidistant on the desired
trajectory.
• On grid places a landmark in the center of each cell of
a regular grid. Starting with one cell covering the whole
environment, the cell size is decreased at every iteration
until the deviation guarantee holds. For efficiency, land- mark positions which are outside the field of view of all states x 0: ⋆ T on the desired trajectory are not used.
• Random successively places landmarks at randomly cho- sen positions observable from the desired trajectory. The numbers of selected landmarks and the values of p MC for all landmark placement strategies are stated in Table I.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
59
6
TABLE I
N UMBERS OF SELECTED LANDMARKS AND RESULTS OF M ONTE C ARLO SIMULATIONS
|
Range-only sensor |
Bearing-only sensor |
Range-and-bearing sensor |
||||||||||||
|
T1 |
T2 |
T3 |
T4 |
T5 |
T1 |
T2 |
T3 |
T4 |
T5 |
T1 |
T2 |
T3 |
T4 |
T5 |
Number of landmarks
|
Our approach |
14 |
12 |
11 |
18 |
|
On trajectory |
− |
− |
25 |
− |
|
On grid |
48 |
32 |
38 |
56 |
|
Random |
108 |
63 |
62 |
138 |
|
10 |
11 |
9 |
7 |
16 |
7 |
9 |
8 |
6 |
13 |
5 |
|
58 |
41 |
− |
12 |
− |
23 |
12 |
9 |
10 |
17 |
7 |
|
23 |
26 |
19 |
17 |
30 |
18 |
20 |
20 |
17 |
25 |
16 |
|
62 |
75 |
66 |
51 |
88 |
31 |
38 |
29 |
38 |
37 |
15 |
p MC
|
Our approach |
0 |
. 999 |
0 |
. 998 |
0 |
. 999 |
|
On trajectory |
− |
− |
0 |
. 996 |
||
|
On grid |
0 |
. 999 |
0 |
. 999 |
0 |
. 999 |
Random
0 . 999
0 . 999
0 . 999
|
0 |
. 999 |
0 |
. 999 |
0 |
. 979 |
0 |
. 978 |
0 |
. 991 |
0 |
. 994 |
0 |
. 826 |
0 |
. 999 |
0 |
. 997 |
0 |
. 998 |
0 |
. 994 |
0 |
. 986 |
|
− |
0 |
. 962 |
0 |
. 353 |
− |
0 |
. 955 |
− |
0 |
. 773 |
0 |
. 996 |
0 |
. 999 |
0 |
. 983 |
0 |
. 999 |
0 |
. 980 |
|||
|
0 |
. 999 |
0 |
. 998 |
0 |
. 997 |
0 |
. 981 |
0 |
. 995 |
0 |
. 999 |
0 |
. 931 |
0 |
. 996 |
0 |
. 999 |
0 |
. 995 |
0 |
. 999 |
0 |
. 999 |
Dashes in the table indicate that no valid landmark configura- tion could be found. For all experiments, our approach placed less landmarks than the other approaches. The on trajectory method is the best method after ours for the range-and- bearing sensor, measured in the number of landmarks placed. However, for the other two sensor models, the on trajectory method was not always able to find a landmark configuration which satisfied the guarantee in the linearized system. For this method, especially the non-linearities in the bearing-only sensor model resulted in low values for p MC .
E. Experiments with a Real Robot To further validate the simulation experiments, we evaluated a landmark set our algorithm generated also on the real e-puck robot [9] depicted in Fig. 1. As a range-and-bearing sensor, we utilized a webcam pointing upwards detecting uniquely identifiable ARToolkit markers [5] attached to the ceiling. We considered the navigation task T1 scaled down to suit the miniature size of our robot (diameter 75 mm ) and the lower ceiling. Scaling the task by the factor 0 . 08 yields d max = 0. 04m . To evaluate the deviation d pos ( x t , x ⋆ ) of the e-puck robot from its desired trajectory, we obtained the reference positions x t from a MotionAnalysis motion capture system with four digital Raptor-E cameras. During 20 autonomous runs, d pos ( x t , x ⋆ ) was below d max in 99.7% of the time steps.
t
t
VII. C ONCLUSIONS
In this paper, we presented a landmark placement method guaranteeing a bound on the maximum deviation of the robot from its trajectory with high confidence. During a placement stage, our approach approximates the real motion model and sensor model by their linearizations to efficiently evaluate the guarantee. In a subsequent validation stage, we apply
a Monte Carlo simulation using the real system dynamics
to check if the selected landmark set satisfies the deviation
guarantee also for the possibly non-linear models. In contrast
to other approaches, our algorithm is customizable to specific
robotic systems and navigation tasks and inherently chooses the number of landmarks needed. In extensive experiments, we demonstrated that our method outperforms other approaches.
A CKNOWLEDGMENTS
This work has partly been supported by the German Re- search Foundation (DFG) within the Research Training Group
1103. The authors would like to thank Axel Rottmann for his implementation of the Powell algorithm.
R EFERENCES
[1] M. Beinhofer, J. M uller,¨ and W. Burgard. Near-optimal landmark selection for mobile robot navigation. In Proc. of the IEEE Int. Conf. on Robotics & Automation (ICRA), 2011. [2] J. van den Berg, P. Abbeel, and K. Goldberg. LQG-MP: Optimized path planning for robots with motion uncertainty and imperfect state information. In Proc. of Robotics: Science and Systems (RSS) , 2010. [3] J. van den Berg, S. Patil, R. Alterovitz, P. Abbeel, and K. Goldberg. LQG-based planning, sensing, and control of steerable needles. In Proc. of the Int. Workshop on the Algorithmic Foundations of Robotics (WAFR) , 2010. [4] D. Bertsekas. Dynamic Programming and Optimal Control . Athena Scientific, 2nd edition, 2000. [5] M. Billinghurst and H. Kato. Collaborative augmented reality. Commu- nications of the ACM , 45(7):64–70, 2002. [6] H.F. Durrant-Whyte, D. Pagac, B. Rogers, M. Stevens, and G. Nelmes. Field and service applications - an autonomous straddle carrier for move- ment of shipping containers. IEEE Robotics & Automation Magazine , 14:14–23, 2007. [7] J.-S. Gutmann, E. Eade, P. Fong, and M. Munich. A constant-time algorithm for vector field SLAM using an exactly sparse extended information filter. In Proc. of Robotics: Science and Systems (RSS) ,
2010.
[8] D.B. Jourdan and N. Roy. Optimal sensor placement for agent localiza- tion. ACM Trans. Sen. Netw. , 4(3):1–40, 2008. [9] F. Mondada, M. Bonani, X. Raemy, J. Pugh, C. Cianci, A. Klaptocz, S. Magnenat, J.-C. Zufferey, D. Floreano, and A. Martinoli. The e-puck, a Robot Designed for Education in Engineering. In Proc. of the Conf. on Autonomous Robot Systems and Competitions, 2009.
M. Powell. An efficient method for finding the minimum of a function of several variables without calculating derivatives. The Computer Journal , 7(2):155–162, 1964.
[11] C. Pradalier and S. Sekhavat. ”Localization Space”: a framework for localization and planning, for systems using a sensor/landmarks module.
In Proc. of the IEEE Int. Conf. on Robotics & Automation (ICRA), 2002. [12] P. Sala, R. Sim, A. Shokoufandeh, and S. Dickinson. Landmark selection for vision-based navigation. IEEE Transactions on Robotics and Automation , 22(2):334–349, 2006. [13] H. Strasdat, C. Stachniss, and W. Burgard. Which landmark is useful? Learning selection policies for navigation in unknown environments. In Proc. of the IEEE Int. Conf. on Robotics & Automation (ICRA), 2009. [14] S. Thrun. Finding landmarks for mobile robot navigation. In Proc. of the IEEE Int. Conf. on Robotics & Automation (ICRA) , 1998. [15] M. Vitus and C. Tomlin. Sensor placement for improved robotic navigation. In Proc. of Robotics: Science and Systems (RSS), 2010.
[16]
[10]
G. Welch and G. Bishop. An introduction to the Kalman filter. Technical
report, 1995. [17] S. Zhang, L. Xie, and M.D. Adams. Entropy based feature selection scheme for real time simultaneous localization and map building. In Proc. of the IEEE/RSJ Int. Conf. on Intelligent Robots and Systems
(IROS) , 2005.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
60
1
Mobile Robots in Hospital Environments: an Installation Case Study
F. Capezio, F. Mastrogiovanni, A. Scalmato, A. Sgorbissa, P. Vernazza, T. Vernazza, R. Zaccaria
Abstract — This paper discusses a number of issues arose during the deployment of a commercial robotic platform in a hospital scenario. In particular, a realistic case study and installation process at Polyclinic of Modena (Italy) is described, where robots have to accomplish two different types of tasks, namely drugs delivery and waste transportation. The lesson learned is that the design of a robot to be used in civilian environments must not only take into account state of the art and dependable algorithms, but also consider real-world issues (i.e., not strictly scientific or technological) when adopting techniques for navigation, obstacle avoidance or localization. The paper discusses the most important features of the overall system architecture that have been modified and the solutions that have been selected to deal with this realistic scenario.
Index Terms— Add up to 4 keywords here
I. INTRODUCTION
During the past few years, service Mobile Robotics for object transportation in semi-structured environments and warehouses developed in a mature research field. A number of systems and algorithms have been presented, which are able – in principle – to provide customers with specifications about their use. Industrial applications, and especially Automated Guided Vehicles (AGVs in short) are nowadays very common. However, a number of limitation are usually associated with these systems, which are related to the need of operating on a workspace that is specifically modified, or even built on purpose, for the robots. Among the many functional or non-functional requirements that robots must be able to adhere to, a service robot must be dependable. Under a customer’s perspective, failures imply costs: a service robot must be less expensive and more efficient than a corresponding team of human workers when performing the same task. It is necessary to analyse how customer’s or contingent requirements can be guaranteed (or at least enforced) by the robot control framework and architecture, both at the software and hardware levels. In particular, it is necessary to take into account the impact of such requirements on the robot design, both at the hardware and software levels. This paper describes the lesson learned in deploying a commercial service robot in a real-world environment. In particular, the paper discusses how a robotic platform has been designed to better comply with a number of issues arose during the deployment process, as well as to guarantee the execution of the assigned tasks, namely drugs delivery and waste transportation. These design choices take into account both hardware and software issues. The deployment of robots in similar scenarios is not new. The first robot operating in hospital environments is HelpMate
[1], which appeared in 1991 and was used for the trans- portation of small-sized objects at Danbury Hospital. After this first example, other robots followed the same philosophy and strategies adopted for HelpMate. I-Merc [2], which is specialized in hospital meal transportation, and Care-o-Bot [3, 4], which has been developed to provide assistance to elderly and people with special needs are such a systems. In recent years, mobile robots for hospital transportation started to be a commercial business. The most known system in Europe is TransCar [6] by Swisslog, which is a laser- guided AGV that can transport payloads up to 450 kg. The vehicle is able to move in order to lift specifically designed boxes where the material to be handled is assumed to be located. Many robots are able to operate in the same workspace or storage area. However, the system assumes that the environment is very well-structured: such requirements include, for instance, wide space, flat floors and elevators that precisely align with the floor in order to allow robots to enter within. The TransCar solution seems feasible in new buildings, although it is difficult to adopt it in existing (and old) hospital buildings. Recently, Swisslog produced a small autonomous robot called RoboCourier [7], which is designed for inter- department material transportation, with a payload of about 25 kg. TUG [8] from Aethon is similar to RoboCourier. It has been designed for the transportation of various materials as dietary trays, medications, linens, blood samples, medical records and pharmacy drugs. It can be interfaced with most of standard carts used in hospitals, which can store up to 250 kg. FMC Technologies develops ATLIS [9], a robot for the transportation of heavy objects that is very similar to TransCar, but characterized by a heavier payload. However, it requires major changes to the workspace. Recently, Muratek Keio Robot [5] has been presented, which is not yet commercially available but seems very promising. It is an omni-directional autonomous robot, developed for the transportation of light payloads. The paper describes the experiences gained during a real- world installation of Merry Porter, a service mobile robot from Genova Robot R . The paper is organized as follows: Section II describes the considered real-world case study; Sections III and IV introduce the robot architecture that has been designed to accomplish the required tasks; Section V discussed the installation procedure in a hospital environment. Conclusions follow.
II. CASE STUDY
In this Section we describe the two types of mission that have been required by the hospital service management unit.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
61
2
Fig. 1.
Different robot tasks. Left: waste transportation. Right: drugs delivery.
The case study discussed in the article is represented by a hospital environment, the Polyclinic of Modena. The hospital needs a high level automation to accomplish two tasks: waste transportation and drugs delivery. Each task is characterized by a specific path for the robots to follow and specific needs. As depicted in Figure 1 on the right, the drugs delivery mission starts from the pharmacy, which is located at the basement floor and arrives at the sixth floor, where various pediatric departments are located. The total path length is about 50 m. The delivery is scheduled three times per day, and in particular early morning, midday and evening, subject to the fact that they occur before patients’ visits. Waste transportation (Figure 1 on the left) has a different and longer path covering about 500 m. The robot starts from the fourth floor where the operating rooms of general surgery are located to transport waste to a waste dump placed outside the hospital, passing through the basement floor. Without the adoption of robots, this task needs two workers: the former to transport waste to the basement floor, the latter to move the waste to the external garbage dump with a small electric machine. As it can be noticed, robots need to use and control: (i) elevators; (ii) additional environmental devices, such as the ones controlling automated doors: e.g., the pharmacy has a system for access control for security reasons. To accomplish these two tasks, robots must be equipped with different components. On the one hand, a robot requested to perform drugs delivery must be provided with a security box to prevent drugs theft. The box should be opened only with a predefined code, known by authorized personnel. On the other hand, a robot performing the waste transportation tasks must be provided with a conveyor to load waste (in order to avoid human personnel), and by a GPS for localization, since it has to transport its payload outside the hospital. A commercial robot system must maximize the reliability because, as previously pointed out, each inconvenience or delay costs. Typically, maintenance efforts are parts of costs that must be minimized. After analysing these missions, we found the following weak points:
• Hardware structure: real and general-purpose environ- ments can require a number of hardware features on the robot side in order to overtake physical problems like slope or heavy payloads.
• Navigation and security: robots must navigate in dif- ferent environments that can present obstacles. Areas can be crowded, thereby needing particular trajectories: as a consequence, the navigation system must be adaptable. Even more critical is the system security. Robots must avoid contacts with objects or people. The security system must have higher priority over the navigation system and it must fulfil the rules defined in the European laws, in particular EN 1525:1997.
• Localization: localization failures are not acceptable:
when the robot localization fails, the manual intervention of a technician is necessary.
• User interface: in order for the system to be used by non specialists, a simple interface allowing users to provide robots with tasks and to monitor their their execution is required. Interfaces can be different: either for simple monitoring or for providing advanced commands, for qualified personnel. The following Sections consider each problem and present the developed solution for the presented system.
III. HARDWARE ARCHITECTURE
Merry Porter has a base architecture that can be extended with different modules, either hardware and software. The hardware modules allow to configure the robot with different devices, like GPS or conveyors, as well as software modules allowing to separate and manage each component with proper characteristics. The base architecture of each robot fulfils these specifications:
• maximum speed of 1,5 m/s;
• max slope without payload 30%;
• 7% slope with 100 kg of payload;
• 180 kg of payload without slope;
• 3 cm step overtake. These characteristics are required by a standard hospital where there can be very long corridors (hence: high speed is required in order to avoid waste of time), the floors can be rugged and sloping, and the alignment between the floors themselves and the elevators can be not perfect (hence: the robot must overtake steps). The base robot architecture has these characteristics (Figure
2):
• unicycle structure with a front steering wheel;
• a telescopic turret for placing sensors at a variable height between 1.40 to 2.10 meters, to avoid occlusions due to surrounding people;
• 4 motors and drives (2 for the rear wheels, 1 for the front steering wheel, and 1 for the telescopic turret) connected by CANbus;
• 1 Hokuyo laser scanner for localization only, placed on top of the turret;
• 1 Sick S3000 laser scanner for security and obstacle avoidance placed at 30 cm from the floor;
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
62
3
Fig. 2.
Merry Porter base architecture at Polyclinic of Modena.
• 2 IP cameras for payloads and environmental surveil- lance, placed on the turret;
• 1 DLPS R
active beacon localization system, placed on
top of the turret;
• 16 ultrasound sensors placed around the robot body for obstacle detection;
• 1 alert flashing light to advise people about the robot presence.
• 1 PC embedded fanless CPU Intel Celeron 1Ghz with Linux-based O.S.
Robots are equipped with two lithium polymer batteries allowing them 4 hours of continuous work with 30 minutes charge. Consequently, robots can work for more than 21 hours per day. The two cameras prevent or record materials thefts and, at the same time, they can improve the security system
of the hospital, by visualizing their input stream in the central
room of the security personnel.
IV. SOFTWARE ARCHITECTURE
Since the Merry Porter software architecture present dif- ferent modules, these are integrated within a multi-agent based system, called ETHNOS [10], where each agent has
a specific task to carry out, like managing missions, lo-
calization, or trajectory planning and following (Figure 3). ETHNOS provides a soft real-time extensions to the Linux kernel supporting different communication, execution, and
representation requirements. This is built over a Linux kernel and a dedicated network communication protocol allowing not only the communication between the different parts of the robot, but also with other software agents running on board
of devices distributed throughout the environment.
Fig. 3.
Agents present in the robot software architecture.
A. Navigation
Requirements: The navigation subsystem must present an efficient algorithm for trajectory following allowing to navi- gate in different kinds of environment. It must be characterized by good obstacle avoidance algorithms and any kind of col- lision must be prevented at any cost: as a consequence, the associated priority is at maximum. This is necessary in any civilian environment where people as well as other obstacles like hospital carts often left in the robot workspace. Approach: The robot navigates on a predefined graph, following a given node sequence that can be chosen by the user or it is automatically generated by a trajectory planner. In absence of obstacles, node-to-node navigation is simply performed by moving along the connecting straight lines. The navigation subsystem can be observed in Figure 3: the Trajectory Generator computes the linear and angular speed that must be fed to actuators, depending on the current position returned by the Localization Agent and the next point to be reached determined by Task&Path Manager. Controls are then sent to the Motor Manager. Obstacle detection and avoidance: In order to guarantee the safest possible behaviour, obstacle detection is managed by a safety laser scanner (the previously cited SICK S3000) with two different safety mechanisms, i.e., hardware and software. The hardware mechanism consists in an internal circuit that, when the laser detects something in a given (short) range, automatically stops the robot by cutting the motor power, which makes the motors brakes to be activated. This system is necessary according to European laws, and it is activated only when extremely necessary. Standard obstacle detection is made by a software agent, which subsumes the hardware mechanisms in most cases. The software analyses all the data returned by the safety laser and feed them to a simple rule- based engine: the final output depends also on the robot speed, corridors and passages width. For example, the higher the
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
63
4
speed the greater the distance at which the robot begins to slow down; the smaller the passage (such as when entering an elevator) the smaller the minimum allowed distance from obstacles before stopping the robot. Simple stop (or slow down) strategies guarantee a safe approach, allowing to avoid collisions with people and objects. Furthermore, they guaran- tee a very repeatable behaviour. Therefore, they maximize the system reliability. As a consequence of a security standard check on the robots performed by an expert society, another hardware mechanism has been installed. This mechanism prevents the lateral collision and the shear effect that can occur in the case that the robot, when performing a high-curvature trajectory or when moving very close to walls, comes very close to people. To solve this issue, two photocells have been placed on rods of about 15 cm length on the lateral part of the robot: when someone is located between the photocells or it hits the rods, the motor power is taken off. Other 15 cm rods with photocells are placed on the back of the robot to prevent collisions when the robot moves backward. This solution complies with European standard: reverse motion is used only when exiting from the elevator and when approaching conveyor belts. On the one hand, when the robot is in cluttered and very crowded environments, it does not avoid obstacles (but simply stops), because this can cause it to be trapped in very narrow areas. On the other hand, an algorithm for obstacle avoidance can be used wider areas. The algorithm is based on the concept of “roaming stripes” [11]. A roaming stripe is a diamond shaped region connecting two nodes. These are designed in such a way that their intersection with the free space is always a convex area. When the obstacle avoidance algorithm is active, the trajectory to avoid obstacles is computed through a standard Artificial Potential Field, by storing laser data into a local probabilistic map centered on the robot reference frame. During navigation, the robot is allowed to deviate from its trajectory to avoid obstacles, but it is never allowed to exit from the roaming stripe. This guarantees that the robot always moves within a convex area containing the target node. By assuming that, sooner or later, all dynamic obstacles (e.g., people, piece of furniture or objects) move or are moved away, this in turn guarantees that the robot will be able to reach the target by following a straight line. Pros and Cons: The adopted solution is able to guarantee that robots navigate in selected areas, because the roaming stripes algorithm always select a trajectory that lies within the stripe. This is very important in a real-world case, because medical personnel must be guaranteed always a free path for emergency cases. The on board safety system (made up a laser range finder and a number of sonars) is aimed at preventing obstacles in a very conservative way. However, this approach increases the time necessary to complete the task.
B. Localization
Requirements: A robot required to navigate in an environ- ment for very long periods of time and with high reliability must be provided with a robust and redundant localization sys- tem. With respect to the introduced case study it must be noted
that localization must provide also a very precise estimate of the robot position. The workspace is characterized by many narrow passages, like elevators and doors. Furthermore, the robot must be able to approach conveyors with a positioning error lower than 2 cm. In the case that this accuracy can not be met, the payload could fall down. Approach: Merry Porter is provided with two different local- ization systems in indoor and a GPS-based system for outdoor. All the three systems communicate positioning information to an Extended Kalman Filter (EKF). 1) Indoor - Map Based Localization: Usual localization approaches based on available maps periodically compare observations taken from the environment with a map repre- sentation, with the aim of correcting the a priori estimate provided by odometry. Among the most common approaches, Kalman filter based localization assumes that an unimodal probability distribution is adequate to represent the pose estimate, thus being a simple and efficient solution to local position tracking. Merry Porter exploits the EKF with a priori known environment maps. The design choices associated with the whole localization subsystem have been described in [12], where the use of line-based features (extracted from a laser rangefinder) is discussed: a technique to improve the corrective power of each extracted line feature is discussed. In particular, [12] introduces the concept of ”low frequency cross section”:
in typical indoor environments, above a given height from the ground, environments are more regular and simple. This has obvious consequences when computing line segments from range data: each laser scan produces a smaller number of lines, each one being longer and more reliable. However, the hospital environment turns out to be more complex. For instance, in the basement floor there are areas where the ceiling is particularly low, thereby requiring the adoption of a telescopic turret (where the localization laser is mounted), which is able to lower its position if necessary. This is actually performed by the robot control architecture as a consequence of specific robot area. 2) Indoor - Active Beacon Localization: The active beacon localization system (called DLPS) has been described in [13]. The base idea starts from classic active beacons methods where beacons have an unique identifier and a specific position in the map. When a beacon is in line-sight with respect to the robot perspective, robots can send them specific commands; as a consequence, the active beacon can reply with useful information about localization. Obviously enough, different commands issues by robots are possible. In particular, the system adopts the classic trilateration principle, that is when the robot detects three beacons it is able to compute its position precise position. The DPLS system introduces a different approach where beacon-based localization is formalized as a position-tracking problem. The information sent from beacons is processed by the EKF, because the high non-linearity of the measurement model. Every time instant t i , the DLPS system provides the robot with the identifiers of the detected beacons, their positions (xb k , yb k ) and the angle λ k , for each detected beacon k. The angle λ k is interpreted as the estimate of the direction of the beacon. More precisely, two values are considered: λ k and σ k correspond respectively to the mean
2
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
64
5
value and the variance of the position estimate of beacon k. This estimate is fed to the EKF to correct the estimate of the robot position. 3) Outdoor - GPS: The outdoor localization system derives from a similar system previously tested in the context of an airport surveillance robot project, which has been described in [14], where the very huge space characterized by few lines features in the environment needed an alternative localization method. A common non-differential GPS localization system has been developed, which is integrated with the laser scanner data. Unfortunately, GPS measurements are corrupted by error sources introducing highly coloured noise with significant low- frequency components, which can be modelled as a non-zero mean value (i.e., a bias) in GPS errors slowly varying in time. The analysis of both longitude and latitude data collected for 24 hours at a fixed location clearly shows this effect:
considering the Fast Fourier Transform (FFT in short) of GPS longitude and latitude data, low-frequency components can be noticed, corresponding to a slow variation of the signal over time. By estimating this bias in GPS measurements, one can expect to improve GPS data precision, therefore making robot localization more accurate. To this purpose, an augmented state vector xˆ is defined, comprising the x, y, and θ components of the robot pose, as well as the xgps and ygps components of the low-frequency bias. Since the coloured components are then separated from additive white Gaussian components of GPS noise (i.e., they are comprised in the state vector), the EKF can be applied. When moving outdoor, detected line features of the a priori map mostly correspond to external walls of buildings. In this situation, a smaller number of features are usually available, since the robot mostly traverses areas where no lines are detected at all. However, when line features are available, their contribute is sufficient to estimate the full state vector then pose and precision of GPS measure. Pros and Cons: In indoor areas the combined use of map- based and beacons-based localization is complementary. The map-based algorithm does not need any external devices but, in few parts of the robot paths, this does not provide the necessary precision. Even worse, the variance associated to the localization estimate is very high, or no warranties about localization can be assured. Under a commercial perspective, the beacon-based localization system adds costs to the overall system set-up. However, it turns out to be a good choice for enforcing system reliability. Finally, GPS-based localization has good performance and it is mandatory for the outdoor localization.
C. User Interface - Robomap
Requirements: A commercial robot system must present a simple and user-friendly interface. Approach: The main user interface is provide by a software tool called Robomap, an integrated management system that allows to define maps, missions and real time monitoring of the state of the robots. Map definition is only allowed to well- qualified personnel and it provides all the elements necessary to describe fundamental parts of the environment to be used
Fig. 4.
Robomap Interface: Map definition starting from CAD maps.
Fig. 5. Robomap Interface: On-line monitoring and visualization of the robot map, mission nodes and active beacon positions.
by the robot localization system. As a matter of fact, a map comprises both lines and active beacons. The specification of map lines can be directly made by using and refining already available CAD maps (Figure 4). An alternative method for populate the map with suitable lines for localization is to trace out the lines derived by the line extraction algorithm taken by laser scanner processing. To do this, it is necessary to move the robot in the interested location. The tasks or missions definition mode allows to create nodes and links between nodes, which compose the path. Each link can be associated with a number of parameters, such as the desired speed for the robot on the link or the preferred localization method. Furthermore, for each node it is possible
to declare a set of actions that the robot must perform in order
for the mission to complete or continue, such as requesting the presence of an elevator at the floor, opening an automated door or loading the payload (by communicating to the conveyor base station).
During robot missions, Robomap allows for the robot on- line monitoring (Figure 5). This mode allows to monitor different robot properties, such as the position with respect
to the map, speed, jog, obstacles revealed with sonars, battery
status, detected beacons, just to name but few. Furthermore,
a human supervisor can issue commands to the robot for
movement control and mission management (e.g., starting or changing a mission).
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
65
6
V. A PPLICATIONS
This Section describes the system set-up realized at Poly- clinic of Modena, Italy. We explain the main environment requirements, the device installation process and an overview about performance issues and improvements on typical hospi- tal tasks.
A. Environment and Devices
Robots have to control a number of devices distributed in the environment in order to accomplish their tasks. For example, moving between two floors requires an elevator call; accessing an area may require to control automated doors. Environment automation is essentially based on custom devices that communicate via an Echelon Lonwork Network. The network guarantees a very flexible topology and it is easily adapted to any environment. In order for all the robot parameters to be accessible to human supervisors, the whole workspace is provided with
a wireless network communication system. This system is
used also to allow the communication between robots and devices, specifically through a monitoring system remotely located on a workstation, which is connected both to the network and to the Echelon fieldbus. In order to explain how the environment automation works, we briefly discuss the elevator control. Custom devices are connected to the elevator control system, which requires a simple installation that must be performed by a qualified technician. The device is directly connected to the elevator buttons. Therefore, it is possible to perform all the actions provided by the elevator panel. The interface device is connected to the Echelon fieldbus like all other devices and the supervisor workstation. Therefore, the supervisor can query each device and it can issue commands to them. Robots can interact with the supervisor by simply using the wireless network provided by the hospital and the multi- agent framework used to develop both robots and supervisor applications.
In order to complete the required tasks, the described building automation is not sufficient. In particular, the de- scribed hospital application needs other components, such as
a conveyor where the personnel leaves waste waiting for the
robot transportation. The conveyor must be connected to the automation network and it is controlled by the supervisor application. For instance, in the described set-up, a number of
conveyors have been installed, which can be directly controlled by robots through ZigBee commands.
B. DPLS implementation
The DPLS system is composed by a rotating laser placed on top of the robot telescopic turret as well as by a number of active beacons distributed in the environment. Active beacons are powered by batteries and communicate through the ZigBee protocol. These two characteristics allow to minimize time and costs for the set-up. The laser carries information about the encoder associated with the motor that is responsible for rotating the laser itself. When the laser hits a beacon, the beacon sends a message to the robot with the beacon
Fig. 6. Robot takes the elevator to go from floor -1 to floor 6. 1,2,3,4:
Robot calls elevator and it enters. 5,6: Robot performs reverse, it turns and it continues the mission at 6th floor.
identifier and the encoder value: it is possible to estimate the robot position as previously described. Battery duration is very important in real-world scenarios because human personnel can not be frequently dedicated to substitute beacons. Beacon electronic parts are essentially two, namely ZigBee communication and laser management, which allows for the interpretation of the message contained in the laser light. The following strategy is implemented for maximise beacons life:
• beacons wake up every 28 seconds and check if any robot is near, otherwise return to a sleep state;
• if a robot finds a beacon, it maintains the beacon active and turns on the laser management part;
• after 160 seconds, when the beacon does not receive any signal by the robot, it returns to a sleep state. Using these devices, beacons need to recharge every three months.
C. System evaluation and automation improvements
The described system reduces the number of manual trans- portations and improves the management of the described tasks. Workers and nurses can delegate robots with a number of low-level tasks, thereby focusing on other tasks where their know-how is essential. Drugs delivery is not a very frequently needed task. Instead, waste transportation requires a huge amount of effort for human personnel. The current batteries power cycle allows robots to operate more than 21 hours a day, i.e., a period of time that is normally covered by three workers, since the standard working time is 8 hours a day). Another consideration must be done related to medical waste laws. With the introduction of robots, the
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
66
7
hospital management wants to divide the different paths of material distribution (i.e., medicine, linen and food) and waste disposal as medical regulation impose. In particular, operating surgery waste has more strict rules suggesting that this waste must come in contact with the personnel as unfrequently as possible. Under this perspective, automated robot-based transportation enforces this aspect. As it can be seen in Figure 1, the two task types are char- acterized by very different paths, distances and then different necessary time to complete. The robots requested to perform waste transportation has to cover about 500 meters and to take an elevator, with a mean overall mission time of about 15 minutes, which is quite constant because there are not many people in those hospital areas. Much less is the distance that is required to traverse in case of drug delivery, which is only 50 meters. However, the path presents a huge number of obstacles and people. As a consequence, the time to accomplish two deliveries, in the two pediatric departments, is about 8-10 minutes. Currently, waste transportation presents a number of diffi- cult issues, to the extent that it could not convenient for the hospital management. It is evident that human personnel is more flexible and provide better performance in some cases. On the one hand, the existing waste transportation system is proven to work reasonably well: structures for waste storage and transportation are compliant for the task, if performed by humans. On the other hand, using an automated robot- based system requires to install a number of ad hoc conveyors, whereas the existing big carts for waste transportation must be replaced by other small containers. Switch from human to automated transportation implies an important change in the hospital habits that is difficult to implement.
VI. CONCLUSIONS
An example of commercial robots for hospital environments has been presented. Merry Porter robots use different and consolidated approaches to navigate, avoid obstacle and lo- calize. All of these aspects have been designed to improve reliability, with the aim of comply not only with state of the art research activities, but above all with real-world regulations and functional needs of a typical hospital environment. The current set-up at Polyclinic of Modena has been described, where two robots are used for drugs delivery and waste transportation. The article addresses a number of topics that played a major role in the deployment process. In particular, it is focused on the architectural and system-level choices that deemed necessary for a real-world installation.
REFERENCES
[1] Evans, J.M.; HelpMate: an autonomous mobile robot courier for hospi- tals. In proceedings of Intelligent Robots and Systems ’94. ’Advanced Robotic Systems and the Real World’, IROS ’94. [2] F. Carreira et al., ”i-Merc: A Mobile Robot to Deliver Meals inside Health Services” in Proceedings of the 2006 IEEE International Con- ference on Cibernetics & Intelligence Systems & Robotics, Automation & Mechtronics, 2006, 1-8. [3] B. Graf, M. Hans, and R. D. Schraft, ”Care-O-bot IIDevelopment of a next generation robotic home assistant,” Autonomous robots 16, no. 2 (2004): 193-205 [4] Care-O-bot 3: http://www.care-o-bot-research.org/care-o-bot-3
[5] M.Takahashi, T.Suzuki, H.Shitamoto, T.Moriguchi, K.Yoshida, 1Devel- oping a mobile robot for transport applications in the hospital domain, Robotics and Autonomous Systems, Vol. 58(July 2010) pagg: 889-899. [6] Swisslog TransCar - http://www.swisslog.com/index/hcs-index/hcs- systems/hcs-agv/hcs-agv-transcar.htm
[7] Swisslog RoboCourier - http://www.swisslog.com/index/hcs-index/hcs- systems/hcs-mobile-robot-speciminder.htm [8] Aethon TUG - http://www.aethon.com/products/default.php
[9]
FMC Technologies ATLIS - http://www.fmcsgvs.com/content/products/atlis.htm
[10] M. Piaggio, A. Sgorbissa, and R. Zaccaria: A programming environment for real-time control of distributed multiple robotic systems. Advanced Robotics 14(1): 75-86 (2000) [11] A. Sgorbissa and R. Zaccaria, Roaming Stripes: Smooth Reactive Navi- gation in a Partially Known Environment, Proceedings of the 2003 IEEE Intemabonal Workshop on Robot and Human Interactive Communication Millbrae. California, USA, Od 31 - Nov. 2, 2003
[12] F. Mastrogiovanni, A. Sgorbissa and R. Zaccaria. Learning to Extract Line Features: beyond Split and Merge. In Proceedings of the 10th International Conference on Intelligent Autonomous Systems (IAS-10), Baden-Baden, Germany, July 23 - 25, 2008. [13] Maurizio Piaggio, Antonio Sgorbissa, Renato Zaccaria, Beacon Based Navigation and Localization for Service Mobile Robots, Proceedings from the 2nd International Symposium on Robotics and Automation (ISRA2000), Monterrey, N.L. Mexico, November 10 - 12, 2000. [14] F. Capezio, A. Sgorbissa, R. Zaccaria (2007). An Augmented State Vector Approach to GPS-Based Localization. In Proceedings of the 7th IEEE International Symposium on Computational Intelligence in Robotics and Automation (CIRA-07), Jacksonville, Florida, June 21-23,
2007.
[15] F. Mastrogiovanni, A. Sgorbissa and R. Zaccaria. Context Assessment Strategies for Ubiquitous Robots. In Proceedings of the 2009 IEEE International Conference on Robotics and Automation (ICRA 09), Kobe,
Japan, May 12-17, 2009.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
67
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
68
6-DOF Multi-session Visual SLAM using Anchor Nodes
John McDonald ∗
Michael Kaess ‡
Cesar Cadena †
Jos´e Neira †
John J. Leonard ‡
∗ Department of Computer Science, National University of Ire land Maynooth, Maynooth, Co. Kildare, Ireland † Instituto de Investigacion´ en Ingenier´ıa de Arag on´ (I3A), Universidad de Zaragoza, Zaragoza 50018, Spain ‡ Computer Science and Artificial Intelligence Laboratory (C SAIL), Massachusetts Institute of Technology (MIT),
Cambridge, MA 02139, USA
Abstract— This paper describes a system for performing multi- session visual mapping in large-scale environments. Multi-session mapping considers the problem of combining the results of multiple Simultaneous Localisation and Mapping (SLAM) mis- sions performed repeatedly over time in the same environment. The goal is to robustly combine multiple maps in a common metrical coordinate system, with consistent estimates of uncer- tainty. Our work employs incremental Smoothing and Mapping (iSAM) as the underlying SLAM state estimator and uses an improved appearance-based method for detecting loop closures within single mapping sessions and across multiple sessions. To stitch together pose graph maps from multiple visual mapping sessions, we employ spatial separator variables, called anchor nodes, to link together multiple relative pose graphs. We provide experimental results for multi-session visual mapping in the MIT Stata Center, demonstrating key capabilities that will serve as a foundation for future work in large-scale persistent visual mapping.
Index Terms— multi-session visual SLAM, lifelong learning, persistent autonomy
I. I NTRODUCTION
Despite substantial recent progress in visual SLAM [17],
many issues remain to be solved before a robust, general visu al mapping and navigation solution can be widely deployed. A key issue in our view is that of persistence – the capability for a robot to operate robustly for long periods of time. As
a robot makes repeated transits through previously visited
areas, it cannot simply treat each mission as a completely new experiment, not making use of previously built maps. Howeve r, nor can the robot treat its complete lifetime experience as “one big mission”, with all data considered as a single pose graph and processed in a single batch optimisation. We seek
to develop a framework that achieves a balance between these
two extremes, enabling the robot to leverage off the results of
previous missions, while still adding in new areas as they ar e uncovered and improving its map over time. The overall problem of persistent visual SLAM involves several difficult challenges not encountered in the basic SL AM problem. One issue is dealing with dynamic environments,
requiring the robot to correct for long-term changes, such a s furniture and other objects being moved, in its internal rep re- sentation; this issue is not addressed in this paper. Anothe r critical issue, which is addressed in this paper, is how to pose the state estimation problem for combining the results
of multiple mapping missions efficiently and robustly.
Cummins defines the multi-session mapping problem as “the task of aligning two partial maps of the environment col- lected by the robot during different periods of operation [3].”
Fig. 1: Internal architecture of windowed and multi-sessio n visual SLAM (vSLAM) processes.
We consider multi-session mapping in the broader context of life-long, persistent autonomous navigation, in which w e would anticipate tens or hundreds of repeated missions in the same environment over time. As noted by Cummins, the “kidnapped robot problem” is closely related to multi-session mapping. In the kidnapped robot problem, the goal is to estimate the robot’s position with respect to a prior map given no a priori information about the robot’s position.
Also closely related to the multi-session mapping problem is the multi-robot mapping problem. In fact, multi-session mapping can be considered as a more restricted case of multi- robot mapping in which there are no direct encounters betwee n robots (only indirect encounters, via observations made of the same environmental structure). Kim et al. presented an exten- sion to iSAM to facilitate online multi-robot mapping based on multiple pose graphs [11]. This work utilised “anchor nodes”, equivalent to the “base nodes” introduced by Ni and Dellaert for decomposition of large pose graph SLAM problems into submaps of efficient batch optimisation [18], in an approach called Tectonic Smoothing and Mapping (T-SAM). Our work extends the approach of Kim et al. [11] to perform multi- session visual mapping by incorporating a stereo odometry frontend in conjunction with a place-recognition system fo r identifying inter- and intra-session loop closures.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
69
II. R ELATED W ORK
Several vision researchers have demonstrated the operatio n
of visual mapping systems that achieve persistent operatio n in
a limited environment. Examples of recent real-time visual
SLAM systems that can operate persistently in a small- scale environment include Klein and Murray [12], Eade and Drummond [5], and Davison et al. [4, 8]. Klein and Murray’s system is highly representative of this work, and is targete d at the task of facilitating augmented reality applications in small-scale workspaces (such as a desktop). In this approac h, the processes of tracking and mapping are performed in two parallel threads. Mapping is performed using bundle adjust- ment. Robust performance was achieved in an environment as large as a single office. While impressive, these systems are not designed for multi-session missions or for mapping o f large-scale spaces (e.g., the interior of a building).
There have also been a number of approaches reported for
large-scale visual mapping. Although a comprehensive survey
is beyond the scope of this paper we do draw attention to the
more relevant stereo based approaches. Perhaps the earliest of these was the work of Nist´er et al. [19] on stereo odometry. In the robotics literature, large-scale multi-session mapping has been the focus of recent work of Konolige et al. in developing view-based mapping systems [14, 13]. Our research is closely related to this work, but has several differences. A crucial new aspect of our work in relation to [14] is the method we use for joining the pose graphs from different mapping sessions. Konolige and Bowman join pose graphs using “weak links”, which are used to connect disjoint sequences. The weak links are added with a very high covariance and subse- quently deleted after place recognition is used to join the p ose graphs [14]. In our approach, which extends [11] to full 6- DOF, we use anchor nodes as an alternative to weak links; the use of anchor nodes provides a more efficient and consistent way to stitch together the multiple pose graphs resulting fr om multiple mapping sessions. In addition, our system has been applied to hybrid indoor/outdoor scenes, with hand-carrie d (full 6-DOF) camera motion.
III. S YSTEM O VERVIEW
In this section we describe the architecture and components of a complete multi-session stereo visual SLAM system. This
includes a stereo visual SLAM frontend, a place recognition system for detecting single and multi-session loop closure s, and a multi-session state-estimation system. A schematic o f the system architecture is shown in Figure 1. The system uses
a sub-mapping approach in conjunction with a global multi-
session pose graph representation. Optimisation is perfor med by applying incremental and batch SAM to the pose graph and the constituent submaps, respectively. Each submap is built up
over consecutive sets of frames, where both the motion of the sensor and a feature based map of the scene is estimated. Once the current submap reaches a user defined maximum number
of poses, 15 in our system, the global pose graph is augmented
with the resultant poses. In parallel to the above, as each frame is processed, the visual SLAM frontend communicates with a global place
recognition system for intra- and inter-session loop closu re detection. When a loop closure is detected, pose estimation is performed on the matched frames, with the resultant pose and frame-id’s passed to the multi-session pose graph optimisa tion module.
IV. S TEREO O DOMETRY
Within each submap the inter-frame motion and associated scene structure is estimated via a stereo odometry frontend . The most immediate benefit of the use of stereo vision is that it avoids issues associated with monocular systems including inability to estimate scale and indirect depth estimation. The stereo odometry approach we use is similar to that presented by [19].
Our stereo odometry pipeline tracks features using a stan- dard robust approach followed by a pose refinement step. For each pair of stereo frames we first track a set Harris corners in the left frame using the KLT tracking algorithm. The resulting tracked feature positions are then used to com - pute the corresponding feature locations in the right frame . Approximate 6-DOF pose estimation is performed through the use of a RANSAC based 3-point algorithm [6]. The input to the motion estimation algorithm consists of the set of tracked features positions and disparities within the current fram e and the current estimates of the 3D locations of the corresponding landmarks. In our work we have found that ensuring that approximately 50 features are tracked between frames results in a reliable pose estimate through the 3-point RANSAC procedure. Finally, accurate pose estimation is achieved b y identifying the inliers from the estimated pose and using th em in a Levenberg-Marquardt optimisation that minimises the reprojection error in both the left and right frames. In our implementation of the above stereo odometry pipeline we use a GPU based KLT tracker [25]. This minimises the load on the CPU (by delegating the feature detection and tracker to the GPU) and exploits the GPU’s inherent parallel architecture to permit processing at high frame rates. In parallel to this we compute a disparity map for the frame, which is then combined with the results of the feature tracker, resulting in a set of stereo features. In order to maintain an adequate number of features we detect new features in every fifth frame, or when the number of feature tracks in the current frame drops below a certain threshold. A consequence of keeping the number of features in
a given frame high, whilst at the same time setting a minimum
inter-feature distance in the KLT tracker, is that it helps to ensure a good distribution of the resulting feature set over the image.
V. S INGLE S ESSION V ISUAL SLAM
Deriving a pose graph representation from the stereo odom- etry system involves two levels of processing. The first of
these optimises over the poses, features and structure with in
a local window. As each new frame is added, a full batch
optimisation is performed. The second step transfers optim ised poses to the pose graph after a fixed maximum number of frames is reached. The resulting pose graph structure conta ins
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
70
no point features and can be optimised efficiently even for a large number of poses. We apply smoothing in combination with a homogeneous point parameterisation to the local window to improve the pose estimates obtained from visual odometry. In contrast to visual odometry, smoothing takes longer range constrain ts into account, which arise from a single point being visible in multiple frames. The homogeneous point parameterisatio n p = (x , y, z, w) allows dealing with points at infinity [24]. Points close to or at infinity cannot be represented correctly by the conventional Euclidean formulation. Even for points that a re not at infinity, convergence of the smoothing optimisation is typically improved. We use exponential maps based on Lie group theory to deal with overparameterised representations. In particular we use Quaternions to represent orientations in 3D space. Quater- nions consist of four parameters to represent three degrees of freedom, therefore causing problems for conventional least- squares algorithms. Using an exponential map, as described for example in [7], reduces the local updates during optimisation to three parameters. The homogeneous point parameterisation suffers from the same problem, and indeed the same solution can be applied as for Quaternions after realising that both a re equivalent to the 3-sphere S 3 in R 4 if normalised. With overparameterisations removed, the optimisation pro b- lem can now be solved with standard least-squares solvers. We use the iSAM library [9] to perform batch smoothing with Powell’s Dog Leg algorithm. iSAM represents the optimisation as a factor graph, a bipartite graph containing variable nod es, factor nodes and links between those. Factor nodes, or short factors, represent individual probability densities
f i (Θ i ) = f i (x j i , p k i ) ∝ exp − 1 Π (x j i , p k i ) − z i Σ i
2
2
(1)
where Π (x , p ) is the stereo projection of a 3D point p into a camera of given 3D pose x , yielding the predicted stereo
actual
stereo measurement, and Σ i represents the Gaussian image measurement noise. iSAM then finds the least-squares estima te Θ ∗ of all variables Θ (camera poses and scene structure combined) as
(2)
projections (u L , v ) and (u R , v ), z i = (uˆ L , uˆ R , vˆ) is the
Θ ∗ = argmax
Θ
∏
i
f i (Θ i )
When the smoothing window reaches a maximum size, all poses and associated odometry are transferred to the curren t session’s pose graph, and a new local window is initialised. By including all poses from a window, as opposed to just the first or first and last pose (as is the case in other approaches) we ensure that loop closures between arbitrary frames can be dealt with within the pose graph. Full details of the loop closure handling is provided in Section VII. To initialise a new window we use the last pose of the previous window in conjunction with all landmarks that correspond to features that are tracked into the current frame. The pose graph is again being optimised using the iSAM library [9], but this time using the actual incremental iSAM algorithm [10] to efficiently deal with large pose graphs. In
contrast to the stereo projection factors f i in the smoothing formulation above, we now use factors g i
g i (Θ i ) = g i (x j i , x j ) ∝
′
i
exp − 1 (x j ⊖ x j i ) − c i Ξ i
2
′
i
2
(3)
that represent constraints c i with covariances Ξ i between pairs of poses as obtained by local smoothing or by loop closure detection. We use the notation x d = x a ⊖ x b from Lu and Milios [16] for representing pose x a in the local frame of pose x b ( x a = x b ⊕ x d ).
VI. P LACE R ECOGNITION
Place recognition is an important component in the context of large-scale, multi-robot and multi-session SLAM, where algorithms based on visual appearance are becoming more popular when detecting locations already visited, also known as loop closures. In this work we have implemented a place recognition module based on the recent work of [1, 2], which demonstrated robust and reliable performance. The place recognition module has the following two com- ponents:
• The first component is based on the bag-of-words method (BoW) [23] which is implemented in a hierarchical way [20]. This implementation enables quick comparisons of an image at time t with a database of images in order to find those that are similar according to the score s. Then, there are three possibilities, if s ≥ α + λ t the match is considered highly reliable and accepted, if α − λ t < s < α + λ t the match is checked by conditional random field (CRF)-Matching in the next step, otherwise the match is ignored. In our implementation, λ t is the BoW score
computed between the current image and the previous
one in the database. The minimum confidence expected for a loop closure candidate is α − = 0 . 15 and for a loop closure to be accepted is α + = 0 . 8. The images from one session are added to the database at one frame per second and with the sensor in motion, i.e. during the last second, the sensor’s motion according to the visual odometery module might be greater than 0 . 2m or 0 . 2rad.
• The second component consists of checking the previ- ous candidates with CRF-Matching in 3D space. CRF-
Matching is an algorithm based on Conditional Random
Fields (CRF). Lafferty et al. [15] proposed CRF for matching 2D laser scans [21] and for matching image features [22]. CRF-Matching is a probabilistic model that is able to jointly reason about the association of features. In [1] CRF-Matching was extended to reason in 3D space about the association of data provided by a stereo camera system. We compute the negative log-likelihood Λ t ,t ′ from the maximum a posteriori (MAP) association between the current scene in time t against the candidate scene in time t ′ . We accept the match only if Λ t ,t ′ ≤ Λ t ,t − 1 . This module exploits the efficiency of BoW to detect revisited places in real-time. CRF-Matching is a more com- putationally demanding data association algorithm because it uses much more information than BoW. For this reason, only the positive results of BoW are considered for CRF-Matching .
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
71
VII. M ULTI -S ESSION V ISUAL SLAM
For multi-session mapping we use one pose graph for each robot/camera trajectory, with multiple pose graphs connec ted to one another with the help of “anchor nodes” as introduced in Kim et al. [11] and Ni and Dellaert [18]. In this work we distinguish between intra-session and inter- session loop closures. Processing of loop closures is perfo rmed firstly with each candidate frame being input to the above place recognition system. These candidate frames are match ed against previously input frames from all sessions. On succe ss- ful recognition of a loop closure the place recognition system returns the matched frame’s session and frame identifier in conjunction with a set of stereo feature correspondences between the two frames. These feature sets consist of lists of SURF feature locations and stereo disparities. Note that since these features are already computed and stored during the place recognition processing, their use here does not place any additional computational load on the system. These feature sets serve as input to the same camera orientation estimation system described in Section IV. Her e the disparities for one of the feature sets are used to perform 3D reconstruction of their preimage points. These 3D points ar e passed with their corresponding 2D features from the second image into a 3-point algorithm based RANSAC procedure. Finally the estimated orientation is iteratively refined th rough a non-linear optimisation procedure that minimises the rep ro- jection error in conjunction with the disparity.
Inter-session loop closures introduce encounters between pose graphs corresponding to different visual SLAM session s. An encounter between two sessions s and s ′ is a measurement that connects two robot poses x s j and x s ′ ′ . This is in contrast to measurements between poses of a single trajectory, which ar e of one of two types: The most frequent type of measurement connects successive poses, and is derived from visual odom- etry and the subsequent local smoothing. A second type of measurement is provided by intra-session loop closures. The use of anchor nodes [11] allows at any time to combine multiple pose graphs that have previously been optimised independently. The anchor node ∆ s for the pose graph of session s specifies the offset of the complete trajectory with respect to a global coordinate frame. That is, we keep the individual pose graphs in their own local frame. Poses are transformed to the global frame by pose composition ∆ s ⊕ x s with the corresponding anchor node.
In this relative formulation, pose graph optimisation rema ins the same, only the formulation of encounter measurements involves the anchor nodes. The factor describing an encounter between two pose graphs also involves the anchor nodes associated with each pose graph. The anchor nodes are in- volved because the encounter is a global measure between the two trajectories, but the pose variables of each trajectory are specified in the session’s own local coordinate frame. Th e anchor nodes are used to transform the respective poses of ea ch pose graph into the global frame, where a comparison with the measurement becomes possible. The factor h describing
j
i
Fig. 2: Multi-session visual SLAM processing
an encounter c i is given by
h (x s j , x s ′ ′ , ∆ s , ∆ s ′ ) ∝ exp − 1 ((∆ s ⊕ x s j ) ⊖ (∆ s ′ ⊕ x s ′ ′ )) − c
j
2
j
2
Γ
(4)
where the index i was dropped for simplicity. The concept of relative pose graphs generalises well to a larger number of robot trajectories. The number of anchor nodes depends only on the number of robot trajectories.
VIII. E XPERIMENTS AND R ESULTS
In this section we present results of the performance of our system for both single- and multi-session processing. T he dataset that we use was collected at the Ray and Maria Stata Center at MIT over a period of months. This building in known for its irregular architecture and provides a good testing g round
for visual SLAM techniques in general. The dataset includes indoor and outdoor (and mixed) se- quences captured from both a wheeled platform and using
a handheld camera with full 6-DOF movement (e.g. ascend-
ing and descending stairs, etc.). All images sequences were
captured using a Point Grey Bumblebee colour stereo camera with a baseline of 11.9cm and where both lenses had a focal length of 3.8mm. The wheeled platform also included
a horizontally mounted 2D SICK laser scanner and a spinning
LiDAR. Although we do not use the LiDAR sensors in our system, the accompanying laser data allows us to compare the performance of our technique to that of a laser-based
scan matcher in restricted 3D scenarios (i.e. 2D + rotational movement). The complete multi-session visual SLAM system follows the architecture shown in Fig. 1, and is implemented as a set of loosely coupled processes that communicate via the Lightweight Communications and Marshalling (LCM) robot middleware system. This permits straightforward parallelism between the components of the system, hence minimising the impact on all modules due to fluctuations in the load of a particular module (e.g. due to place recognition deferring to CRF processing). Futhermore the overall architecture can b e transparently reconfigured for different setups (e.g. from single CPU to multi-core or distributed processing).
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
72
(a)
(b)
Fig. 3: Single session visual SLAM processing including full 6-DOF motion.
A. Single-Session Visual SLAM Results
In this section we provide results from a number of single
session SLAM experiments. We have applied the system in single session mode (i.e. only running a single frontend) ac ross
a variety of sequences for the Stata Center dataset describe d above. The system is capable of operating over extended sequences in both indoor, outdoor and mixed environments with full 6-DOF motion. Two example feature-based maps from outdoor sequences
are shown in Fig. 3. Here, for (a), the underlying grid is at
a scale of 10m, where the trajectory is approximately 100m
in length. An example image from the sequence is shown in the inset with the GPU KLT feature tracks overlaid on the left frame. Fig. 3 (b) shows a similar scale sequence that include s full 6-DOF motion, where the user has carried a handheld camera up a stairs. In the absence of loop closing we have found the system to have drift of approximately 1%-3% in position during leve l motion (i.e. without changes in pitch angle). To demonstrate this, Fig. 4 shows two maps with two trajectories, both taken from the same sequence. The yellow contour shows a 2D LiDAR based map computed from applying a scanmatching algorithm to the output of horizontal LiDAR scanner attache d to the cart. The scanmatcher’s estimated pose is shown by the dark blue trajectory, which can be seen more clearly in the lower right-hand inset. The distance between grid lines in the figure is 2m. From the figure the horizontal displacement of the final poses is approximately 60cm with a total trajectory of approximately 20m. An example of the accumulated error in position due to drift is shown in Fig. 5. Here the dataset consists of an image sequence taken over an indoor area within in the Stata Center. Here the grid is at a scale of 5m with the sequence taken by travelling on a large loop over a space of approximately 35m × 15m. The image at the top shows the result of the motion estimate in the absence of a loop closure. The majority of the drift here is due to the tight turn at the right-hand end of the
sequence, where the divergence between each traversal of th e hallway can be clearly seen. The center figure shows the result of the correction applied to the pose graph due to a sequence of loop closures occuring at the area highlighted by the red box. Here it can seen that the pose graph sections showing the traversals of the hallway are much more coincident and that the misalignment in corresponding portions of the map is reduced considerably. The figure also shows accuracy of the map relative to the ground truth CAD floorplan. Although the odometry system has shown to be robust over maps of the order of hundreds of meters, two failure modes for the system are in low-texture or low contrast environments, or where the disparity estimation fails over a large set of features, e.g. due to aliasing. We do not address this situation in the current system, however the standard approach of incorporating inertial sensors is a natural solution to this problem. An alternative approach that we are currently investigating is the possibility of using multi-session SL AM as a solution to this problem, whereby odometry failure results in the creation of a new session with a weak prior on the initial position. This disjoint session is treated the same as any other session. When a new encounter does occur, the session can be reconnected to the global pose graph. A future paper will present results of this approach.
B. Multi-Session Visual SLAM Results
To test the full multi-session visual SLAM system, we took two sequences from the same area as shown in Fig. 5 and processed each through a separate instance of the visual SLA M frontend. Results of each of the separate sessions are shown in Fig. 6 (a) and 6 (b), with the combined multi-session results shown in Fig. 6 (c). Again, loop closure occurred in the same area as shown in Fig. 5 (b). Finally Fig. 6 (d) shows a textured version of the same map. The scale of the grid is 2m for Figures (a) & (b), and 5m for Figures (c) & (d).
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
73
Fig. 4: Comparison of drift in single session visual SLAM against 2D LiDAR scan matcher over a 20m trajectory. Grid scale is 2m.
IX. C ONCLUSIONS
In this paper we have presented a 6-DOF multi-session visual SLAM system. The principal contribution of the paper is to integrate all of the components required for a multi- session visual SLAM system using iSAM with the anchor node formulation [11]. In particular this is the first example of an anchor node based SLAM system that (i) uses vision as the primary sensor, (ii) operates in general 6-DOF mo- tion, (iii) includes a place recognition module for identif ying encounters in general environments, and (iv) derives 6-DOF pose constraints from those loop closures within these gene ral environments (i.e. removing the need for fiducial targets as was used in [11]). We have demonstrated this system in both indoor and out- door environments, and have provided examples of single- an d multi-session pose graph optimisation and map constructio n. We have also shown the effects of loop closures within single - session mapping in reducing drift and correcting map structure. Multi-session visual mapping provides a solution to the problem of large-scale persistent localisation and mappin g. In the future we plan to extend the results published here to in- corporate the entire Stata dataset described in the Section VIII. Furthermore we intend to evaluate the approach in online collaborative mapping scenarios over extended timescales.
A CKNOWLEDGMENTS
Research presented in this paper was funded by a Strategic Research Cluster grant (07/SRC/I1168) by Science Foundation Ireland under the Irish National Development Plan, and by th e Direccision´ General de Investigacion´ of Spain under projects DPI2009-13710, DPI2009-07130. The authors gratefully ac- knowledge this support. The authors would like to thank Hordur Johannsson and Maurice Fallon for their assistance in the collection of the Stata datasets.
R EFERENCES
[1] C. Cadena, D. G´alvez, F. Ramos, J.D. Tardos,´ and J. Neira. Robust place recognition with stereo cameras. In IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), Taipei, Taiwan, October 2010.
(a)
(b)
(c)
Fig. 5: Single-session dataset containing a large loop. Her e the grid scale is at 5m. (a) Map and pose graph prior to loop closure showing drift in position and structure. (b) Map and pose graph showing correction in the position and structure due to a series of loop closures in the area shown by the red square. Background image shows ground truth CAD floorplans of the environment. (c) Textured version of figure (b).
[2] C. Cadena, J. McDonald, J. Leonard, and J. Neira. Place recognition using near and far visual information. In Proceedings of the 18th IFAC World Congress , August 2011. To appear. [3] M. Cummins. Probabilistic Localization and Mapping in Appearance Space. PhD thesis, University of Oxford, 2009. [4] A.J. Davison. Real-time simultaneous localisation and mapping with a single camera. In Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, pages 1403–1410, 2003. [5] E. Eade and T. Drummond. Unified loop closing and recovery for real time monocular SLAM. In British Machine Vision Conference, 2008.
[6]
M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm
for model fitting with applications to image analysis and automated cartography. Commun. ACM, 24(6):381–395, 1981. [7] F.S. Grassia. Practical parameterization of rotations using the exponen- tial map. J. Graph. Tools , 3:29–48, Mar 1998.
[8] J.M.M. Montiel H. Strasdat and A.J. Davison. Real-time m onocular SLAM: Why filter? In IEEE Intl. Conf. on Robotics and Automation (ICRA), May 2010.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
74
(a)
(c)
(b)
(d)
Fig. 6: Stata Center second floor dataset with two separate se ssions captured over an 850 m 2 area. (a) Map and poses for session 1. (b) Map and poses for session 2. (c) Multi-session pose graphs after inter-session loop closure showing transformed maps. (d) Textured version of figure (c).
[9] M. Kaess, H. Johannsson, and J.J. Leonard. Incremental s moothing and mapping (iSAM) library. http://people.csail.mit.edu/ kaess/isam , 2010–2011. [10] M. Kaess, A. Ranganathan, and F. Dellaert. iSAM: Increm ental smoothing and mapping. IEEE Trans. Robotics , 24(6):1365–1378, Dec
2008.
[11] B. Kim, M. Kaess, L. Fletcher, J.J. Leonard, A. Bachrach, N. Roy, and S. Teller. Multiple relative pose graphs for robust cooperative mapping. In IEEE Intl. Conf. on Robotics and Automation (ICRA), pages 3185– 3192, Anchorage, Alaska, May 2010. [12] G. Klein and D. Murray. Parallel tracking and mapping for small AR workspaces. In Proceedings of the 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, pages 1– 10. IEEE Computer Society, 2007. [13] K. Konolige, J. Bowman, J.D. Chen, P. Mihelich, M. Calon der, V. Lep-
etit, and P. Fua. View-based maps. Intl. J. of Robotics Research , 29(10),
2010.
[14] K. Konolige and J. Bowmand. Towards lifelong visual maps. In IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IRO S), pages 1156–1163, 2009. [15] J. Lafferty, A. McCallum, and F. Pereira. Conditional Random Fields:
Probabilistic models for segmenting and labeling sequence data. In Proc. 18th International Conf. on Machine Learning, pages 282–289. Morgan Kaufmann, San Francisco, CA, 2001. [16] F. Lu and E. Milios. Globally consistent range scan alignment for environmental mapping. Autonomous Robots , 4:333–349, April 1997. [17] J. Neira, A.J. Davison, and J.J. Leonard. Guest editorial special issue
on visual SLAM. IEEE Trans. Robotics , 24(5):929–931, Oct 2008. [18] K. Ni, D. Steedly, and F. Dellaert. Tectonic SAM: Exact, out-of-core, submap-based SLAM. In IEEE Intl. Conf. on Robotics and Automation (ICRA), pages 1678–1685, Apr 2007. [19] D. Nister, O. Naroditsky, and J. Bergen. Visual odometry for ground vehicle applications. J. of Field Robotics , 23(1):3–20, 2006. [20] D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, volume 2, pages 2161 – 2168, 2006. [21] F. Ramos, D. Fox, and H. Durrant-Whyte. CRF-Matching: Conditional Random Fields for feature-based scan matching. In Robotics: Science and Systems (RSS), 2007. [22] F. Ramos, M.W. Kadous, and D. Fox. Learning to associate image features with CRF-Matching. In Intl. Sym. on Experimental Robotics (ISER), pages 505–514, 2008. [23] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. In Intl. Conf. on Computer Vision (ICCV), volume 2, page 1470, Los Alamitos, CA, USA, 2003. IEEE Computer Society. [24] B. Triggs, P. McLauchlan, R. Hartley, and A. Fitzgibbon . Bundle adjustment – a modern synthesis. In W. Triggs, A. Zisserman, and R. Szeliski, editors, Vision Algorithms: Theory and Practice, LNCS, pages 298–375. Springer Verlag, Sep 1999. [25] C. Zach, D. Gallup, and J.-M. Frahm. Fast gain-adaptive KLT tracking on the GPU. In Computer Vision and Pattern Recognition Workshops, 2008. CVPRW ’08. IEEE Computer Society Conference on, pages 1 –7, June 2008.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
75
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
76
Lightweight SLAM and Navigation with a Multi-Camera Rig
Gerardo Carrera
Adrien Angeli
Andrew J. Davison
Department of Computing, Imperial College London, 180 Queen’s Gate, London SW7 2AZ, UK
Abstract — An interesting recent branch of SLAM algorithms using vision has taken an appealing approach which can be characterised as simple, robust and lightweight when compared to the more established and complex geometrical methods. These lightweight approaches typically comprise mechanical odometry or simple visual odometry for local motion estimation; appearance-based loop closure detection using either whole image statistics or invariant feature matching; and some type of efficien t pose graph relaxation. However, such algorithms have so far been proven only as localisation systems, since they have not offered the semantic demarcation of free space and obstacles necessary
to guide fully autonomous navigation and exploration. In this
paper we investigate how to apply and augment such approaches with other lightweight techniques to permit fully autonomous
navigation and exploration around a large and complicated room environment. Our approach uses only odometry and visual
sensing, the latter being provided by a rig of multiple standard cameras without the need for precise inter-camera calibration.
A particular focus of our work is to investigate the camera
configurations which are most valuable to permit the capabilities needed by autonomous navigation, and our solution neatly assigns each camera a well-defined specialist role.
Index Terms — Computer Vision, Robotics, SLAM, Multi- Camera Rig.
I. I NTRODUCTION
Computer vision is an increasingly appealing sensing modality for mobile robotics, and a number of successful SLAM systems involving vision as the primary outward- looking sensor have been recently presented. In particular, besides the more standard feature-based reconstruction ap- proaches, there have been several systems which have solved the visual SLAM problem using much ‘simpler’ lightweight techniques which combine local trajectory estimation (via either wheel or simple visual odometry), visual place recog- nition, and pose graph optimisation (e.g. [13]). However, i n either paradigm few systems have gone further than tackling localisation, and few real visual SLAM systems have been proven in autonomous navigation. In this paper we present a method based on a combination of lightweight vision techniques which permits robust, automatic and repeatable mapping and navigation around a large room. I n this scenario, there are several demands placed on the robot ’s sensing systems:
• Local trajectory estimation.
• Place recognition to detect re-visits (loop closures) and permit global map optimisation.
• Detection and avoidance of obstacles in the robot’s close proximity.
• Global free-space mapping for path planning. Here we describe a novel combination of lightweight meth- ods to provide all of these capabilities using odometry together
with multi-camera visual sensing. Our use of a rig of multipl e standard cameras is based on the observation that while a wide field of view is often desirable for SLAM and relocalisation, the other requirements placed on a vision system when the goal is full autonomous navigation are not easily satisfied by a single omnidirectional camera. These cameras often have a limited and low resolution view of downward angles below the horizontal necessary for free space and obstacle detect ion, which, when looking more to the future, could be a limitation for tasks like object recognition or human interaction. Pre-calibrated multi-camera rigs in a single unit which off er good resolution across a wide field of view are available, such as Point Grey’s Ladybug, but they are expensive, as well as inflexible since they cannot be reconfigured. The advantage of multiple specialised sensors in high performance autonomous systems has been proven in robots such as the DARPA Grand Challenge vehicles or Willow Garage’s PR2. In our work, we choose to use just cameras rather than other outward-looking sensor types, but retain the idea of multiple cameras mounted in a weakly coupled way and which provide specific functions as part of a whole navigation solution, without requiring a tedious inter-camera calibration procedure. We demonstrate automatic localisation and mapping, and autonomous navigation in a goal-directed scenario where the robot is able to move repeatably between any pair of points indicated in the map. Further, we demonstrate full autonomous exploration; the robot is dropped into the room with no knowledge or a large cluttered room and is able to explore autonomously to build a consistent map of the whole area suitable for autonomous navigation.
II. R ELATED W ORK
A. Lightweight Vision-Based SLAM
The lightweight approaches we study in this work have at their core a coarse metric-topological environment model, but they can still enable accurate and autonomous navigation. The single most important function provided by vision within such a system boils down to image retrieval, where the most recent image is compared against all previous lo- cations in the environment model. The use of colour and texture information encoded in global histograms for image representation has proven successful for topological robot localisation [16, 18]. More recently, purely appearance-based topological place recognition approaches based on the “Bag of words” paradigm and inspried by information retrieval techniques have been proposed [2, 4]. These are however more computationally demanding, as they require expensive feature descriptor calculations and matching, and we consi der
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
77
them unnecessary in the context of our robot’s local navigat ion problem. Although a rig of multiple standard cameras is seemingly good alternative to a single omnidirectional device, the us e of such rig for SLAM has been subject of little research, presumably because of the difficulty of extrinsic calibrati on. In our previous work [3] we showed that autonomous external calibration of a camera rig with no overlapping views was indeed possible, and [10] demonstrated SLAM using an eight camera rig. While such geometric work on fused multi-camera SLAM will doubtless continue, in the present paper we aim to use each of the cameras for one or more specialised tasks without the need for global extrinsic calibration.
B. Integrated Visual SLAM and Autonomous Navigation
There have only been few visual SLAM systems using standard cameras which enable fully autonomous navigation. Davison and Murray’s early visual SLAM system based on fixating active stereo [6] was used for real-time position-based navigation, but the feature map generated was too sparse to permit reliable reasoning about free and occupied areas of space. With the advent of feature-based robot SLAM systems with much increased feature density, there have recently been some attempts at performing free-space mapping based on semi-dense point clouds. Notably, systems like [15] offer good possibilities for free-space detection (at least in highly textured areas with many feature points), and enable autonomous navigation and exploration.
C. 2D Free Space Mapping using Vision
On the assumption that our robot moves on a ground plane, 2D free space mapping is critical to allow obstacle avoidance and path planning. This problem has been studied not only for robot navigation but also in road detection for vehicles to aid autonomous driving. Some approaches attempt to define the drivable area either using offline machine learning techniques such as Sup- port Vector Machines [17] or by a combination of geometry information and offline learning [1]. Such techniques cannot however be directly transposed to the indoor 2D free-space detection problem, because in this context there is no such thing as a general geometric pattern for the boundary of the drivable area that can be retrieved in each image based on (potentially learned) a priori information. Successful methods for obstacle avoidance and free-space mapping for mobile robot navigation rely solely on the use of colour information [9], or infer a planar homography mapping image pixels to 2D floor coordinates under a ground plane assumption [19, 11]. The authors of [12] propose to make use of multiple cues to calculate horizontal lines defining the boundary between floor and walls, which is well suited to corridors, but presumably not adapted in case of more complicated structure.
III. M ETHOD We divide our whole framework into five main sections which are all interconnected:
• Rig camera placement.
• Map representation and loop-closure detection.
• Global map relaxation.
• Free space mapping.
• Autonomous navigation, obstacle avoidance and path planning.
A. Rig Camera Placement
Our robot’s vision capability is to be provided by a rig of up to four standard cameras, each with a lens offering approximately 80 ◦ horizontal field of view. There are many possible choices for the configuration of these cameras, since they must support the various tasks required by loop closure detection, free-space mapping and exploration, and we have examined the trade-offs of different set-ups. Notably, the final configuration chosen is adapted to the characteristics of the office environment used, where movements can be quite restricted and the distance to the objects relatively short (e.g., when traversing narrow corridors between two desks). In principle, an extremely wide field of view, up to the maximum full cylindrical field of view offered by a single omnidirectional camera, is well suited to relocalisation: not only does this enable the capture of a good variety of ap- pearance data to act as a fingerprint for a location, but it als o permits recognition of previously-visited places independent of the orientation of the robot. However, when this wide field of view is provided not by a single omnidirectional camera but by a rig, we found that additional difficulties arose. We experimented extensively with histogram-based place recognition (see the next section) with an ad-hoc four-camera rig designed such that the cameras were mounted horizontall y with maximum angular spacing (i.e. at the corners of a square). However, the performance in recognising locations with different robot orientations was disappointing. This was partly due to the fact that the four cameras used left gaps or ‘blind spots’ in the cylindrical field of view which would not necessarily align when the robot had made a rotation. Another significant point was that the actual distance between the camera centres on the robot was often significant compared to the distance to objects and furniture in the indoor scene, so unmodelled parallax effects came into play. We found that a good pragmatic solution to camera place- ment for loop closure detection in an indoor scene is to have one horizontal camera facing forwards, and one backwards (see Figure 1). This configuration is able to detect the vast majority of loop closure events the robot will encounter, because in restricted spaces, when a robot revisits a place i t is very likely to do it either moving in the same or exactly opposite direction to before. Yes, there will be possible loop closure events sometimes missed when the robot crosses a previous path at right angles. But in fact, such a crossing might well be difficult to reliably detect in any case, as it may correspond to a wide open area where several significantly distant positions at the centre of it have similar appearance, making recognition ambiguous and localisation inaccurate.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
78
Fig. 1. Robot camera configuration consisting of a three camera rig where camera C is used for obstacle avoidance and free space mapping and the cameras A and B are used for loop closure detection.
Together with the front/back camera combination for loop closure detection, we added a third camera pointing down
at the ground in front of the robot for free space mapping.
As detailed later, we experimented with the downward angle
of this camera where there is a trade-off between immediate,
local obstacle detection capability and a more forward view suitable for planning. There is an interesting feedback in play
here between the image matching for relocalisation required
by SLAM and free space mapping which permits autonomous
robot guidance. A free space detection solution which is working well will enable the robot, when exploring or re- visiting, to navigate in a precise way by for instance moving repeatably half-way between gaps or along corridor passages. This makes loop closure detection easier, since the robot is likely to very precisely revisit locations.
B. Map representation and Loop Closure Detection
Our approach does not assume any prior knowledge about the environment except that is traversable by a wheeled robot and that its visual appearance is descriptive enough.
As the robot autonomously explores or is being driven through an environment, it builds a topological or graph representation of the area. A topological map is a very pract ical and desirable model for navigation, since it imposes a discr ete structure on a continuous space, and because it easily enabl es the use of different low level (graph relaxation) and high level (path planning) algorithms. In this undirected graph G = ( V, E ) , a vertex V = ( I , X t ) represents a physical location in the environment which stores
all the images I from N cameras of the rig at time t as well as
the global position of the robot X t = ( x t , y t , θ t ) (2D position plus heading direction). Each edge E t in the graph stores
the relative 2D transformation between nodes X t and X t +1 .
A new vertex is initialised when the distance traveled from
the previous position is greater than some threshold β (using the internal odometry of the robot), or when the dissimilari ty between consecutive images is above a threshold α .
Our approach for image comparison relies on a global descriptor implemented in the form of a 1D grey-level his- togram. Such minimalistic single-signature place charact erisa- tion enables reasonable discrimination, while on the other hand only requiring frugal computational resources. The signat ure of a location is obtained by sticking the images of both
N
t
forward and backward cameras next to each other into a single mosaic which serves as the support for the calculation of the descriptor. Different methods for comparison were tested as well as multi-dimension histograms including cues such as gradients, color or intensity, obtaining the best results using EDM [14] over 1D histograms of grey intensities. Loop-closure detection is achieved by comparing the most recent location signature against all the previously visit ed locations. Thanks to the compactness and simplicity of place characterisation, such an exhaustive retrieval procedure can
be executed efficiently. Once a candidate for loop closure is
found, time consistency is ensured to cope with perceptual
aliasing. To this end, we perform a comparison of the ap-
pearance of 7 locations L recent = {V i− 3 ,
centred in time around the location of interest V i , with
7 locations L previous = {V j − 3 ,
sampled around the potentially loop-closing location V j . This
yields a 7 × 7 matrix C = i,j =1 ,
corresponds to the distance (in appearance-space) between locations V i ∈ L recent and V j ∈ L previous . Note that the procedure has to be postponed until the locations V i+1 , V i+2 and V i+3 are visited and added to the map.
Asserting the time-consistency of a potential loop-closur e requires an evaluation of the entries of C which takes care of the heading direction of the robot (see Figure 2). However, because of the configuration of the rig retained in our ap- proach, we only need to distinguish between situations wher e the relative orientation from one passing to another is either 0 deg or 180 deg . Therefore, we review the scores of all the elements on both diagonals of C , and only if they are all below some threshold for one diagonal the loop-closure is accepted. This is enforcing the consistency of the appearance over neighbouring locations in time around both the location
of interest and the potential loop-closing location in such a
way that the relative orientation of the robot between the 2 passings is properly taken into consideration.
C. Graph Relaxation and Map Optimization
Due to the topological nature of our map representation we can optimize the poses by minimising the error between
7 C ij , where each entry
, V j +3 } similarly
, V i ,
, V i+3 }
, V j ,
,
robot positions every time a loop closure is found. Every time a new graph relaxation is applied the map becomes more consistent to the real metric map and therefore becomes usable for navigation and path planning. In our approach we used TORO [8] which provides a highly efficient gradient descent- based error minimisation for constrained graphs.
In order to enable accurate obstacle avoidance and path planning, a global free space map M of the environment is be- ing built incrementally as the robot navigates (see Section III-
D for details about free space detection). To this end, we
associate to every keyframe of the robot trajectory a relative local free space map M t , every vertex of the graph G being now represented as V t = ( I , X t , M t ) , which is a simple 2D occupancy grid with a resolution of 1cm 2 per cell and whose origin is anchored according to the position and orientation of X t . To ensure the consistency between topological and global
N
t
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
79
Fig. 2. Loop Closures using two cameras: forward and backward . a) represents a loop closure found with a difference in robo t orientation of 180 deg with respect to the matching node. b) represents a loop closure fou nd with the same robot orientation as the matching node. The top row in figures a) and b) indicates the current frame and the bottom row the matching nod e or keyframe. The single image at the left of figures a) and b) shows the matrix consistency check, with the bright red indicating strong matching along either the forward or backward diagonal.
free space maps, M is updated based on the optimised vertex positions whenever the topological graph is relaxed. It is true that with our featureless technique for loop- closure detection, an accurate position of the robot is not obtained right away. Yet, this is important to impose precis ion which improves the graph of poses after relaxation. Therefore, to obtain a good relative position estimate at loop-closure between current X t and past X p poses, we approximate the displacement ∆P = (∆x, ∆y, ∆θ ) between them by aligning the contours of their respective local free space maps M t and M p , which can be simply done by solving a cost function minimising the distance between points in those contours:
F C 1 C 2 = min( dist( P
C
i
i,j
1
C
, P
j
2
)) ,
(1)
where the function dist( · , · ) is evaluated obtaining the 2D
. When the robot revisits a location with an opposite direction to the one of the previous passing, it is difficult to match the free space maps, as they do not overlap very much. In such situation though, we can still approximate the relative orientation between the two views by calculating the mean horizontal optical flow between the corresponding images (obtained by the loop closure detection) and this relative orientation is used in the loop closure constraint.
C
Euclidian distance between all the points in P
i
1
C
and P
j
2
D. Free Space Mapping
We have developed an accurate algorithm which incorpo- rates geometry and colour information. Our solution assumes that the robot is moving on a plane which is the same over the whole room, and that the floor is made of large regions with homogeneous colour. Under the floor planarity assumption, it is possible to map the image pixels of a downward looking camera to the 2D cells of a local free space map of the visible floor in front of the robot. This transformation H f is a homography which is calibrated once each time the is fixed to the robot by the simple procedure of matching four artificial landmarks on the floor with the corresponding pixel positions in the image.
can be employed to retrieve from
the image the RGB value X i associated to any floor cell i
visible in front of the robot, so as to determine if this cell
is free of obstacles or not. This is done here by calculating
the log-likelihood ratio of occupancy, as follows (statist ical
dependency on the model is omitted for simplicity):
The inverse mapping H
−
f
1
ln
P
( C i =
F | X i )
P
( C i = O | X i
) = ln
P
( X i | C i =
F )
P
( X i | C i = O
) +ln
P
( C i =
F )
P
( C i = O
) (2)
where C i is the class of cell i (F for “floor”, O for “obstacle”). We assume constant empirically fixed priors, and
a uniform “obstacle” likelihood model, while the “floor”
likelihood model is a mixture of L gaussians similar to the
one used by Thrun et al. [5] for road detection:
( X i − µ j )
(3)
where µ j , Σ j , w j respectively are the mean, covariance matrix and mixing coefficient parametrising gaussian i . When the occupancy ratios have been calculated for every cell, the model parameters are updated according to the procedure proposed in [5], using only those cells whose ratio is above some threshold. Initially, the gaussians are learned in the very first frame using only a small region at the bottom-centre of t he image (that is assuming that at startup, the corresponding area on the floor is free of obstacles and is a good representative sample of the overall appearance of the floor). For better robustness and more efficient planning during navigation, the local occupancy grids corresponding to several consecutive robot poses are fused together (see Figure 3):
not only does this provide a medium-sized free space map around the robot which is more suited to navigation, but also makes it possible to filter out incorrect obstacle detections due
to inaccurate probability scores in the presence of noise in
the image or illumination changes in the scene. Similarly, as
already mentioned, a global free space map M of the explored
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
80
environment can be computed at anytime by fusing all the individual local occupancy grids (see Figures 5 and 6).
Fig. 3. The image on the left shows part of a map which has been incrementally built, the region inside the marked area is the current frame being mapped which corresponds to the image on the right.
E. Autonomous Navigation and Exploration
We use the Dynamic Window Approach (DWA) [7] to plan smooth motions of the robot around nearby obstacles. Having estimated the free space around the robot based on a local window of a fixed number of recent keyframes, the local occupancy map is binarized to obtain boundaries between free- space, obstacles and unexplored areas. Every cell detected as an obstacle boundary in our map is used in DWA to plan the next velocity commands. DWA plans the best route from a current robot position to a target location, avoiding obstacles, and therefore requires the definition of robot goals. We have considered it outside the scope of this paper to investigate ideal exploration strategies, finding that a heuristic technique was sufficient in our application. A goal is randomly selected relative to the robot position and around a square window of 3m × 3m . If the robot has spent much time around the same area then a goal is selected further away. It is also important to obtain a balance between mapping unexplored areas and obtaining accurate maps. Every time the robot has mapped an area completely it revisits previous places within the mapped area to find potential loop closures. By doing this we try to correct drifts in the odometry and also every cell is corrected according to the new robot position.
IV.
E XPERIMENTS AND R ESULTS
We have developed our experiments in an office of size 10 × 10m using a Pioneer robot platform. As can be seen in figure 4, this environment represents a challenging scenario with strong perceptual aliasing and multiple narrow spaces (88cm), making both localisation and autonomous navigation difficult. The quality of our approach is demonstrated by performing manual and autonomous navigation with incremental and real - time free space mapping and loop closure detection. Our first experiments consider the effect of loop closure detection over map precision: it is well known that odometry- based robot position estimation will continuously drift over time, until a loop-closure is detected and the inconsistency is compensated for. To safely navigate, it is therefore extremely important to correct the map accordingly every time as pos- sible. As can be observed in the top left image of Figure 5,
Fig. 4. Office environment with strong perceptual aliasing an d challenging navigable spaces.
Ground truth comparison (cm)
|
Robot Pos |
Ground truth |
Robot Pos |
|
(182, 0) |
(180, 0) |
(96, 445) |
|
(212, 145) |
(210, 150) |
(-246, 487) |
|
(425, 143) |
(420, 140) |
(89, -325) |
|
(512, -271) |
(510, -270) |
(-280, -326) |
Ground truth
(90, 450)
(-240, 480)
(90, -330)
(-270, -330)
TABLE I
E VALUATION OF L OCALISATION A CCURACY.
a constructed map using only odometry information with no loop-closure detection is highly inaccurate, and it would be almost impossible to navigate autonomously or to use the map for further high-level tasks. When loop-closures are detect ed with only one camera (top right image), then it can be observed that both the robot trajectory and the map are significantly more accurate. When using two cameras, the number of loop closure detections increases, and so the map becomes even more accurate (see Figure 5, bottom). We have investigated the impact of the downward looking angle of the camera used for free space detection on the quali ty of the map (Figure 5, bottom images). We have found that if the camera is oriented such that it is only covering a very restricted region (i.e., within 20cm to 110cm) in front of the robot, then, the obtained map is very detailed, enabling ver y accurate obstacle localisation. However, such very downward looking angle penalises motion planning, as it only provides very local information, with obstacles being detected very late, only when the robot is very close to them (see the bottom right part of the figure). When the camera is covering a larger region (i.e., within 60cm to 450cm) in front of the robot, further away obstacles can be detected, enabling more efficient moti on planning, over a wider time window, leading to smoother robot trajectories (bottom left of the figure). To measure the metric quality of our free space map, we have picked 8 random robot positions distributed over the whole room, and compared their coordinates in the map with ground truth obtained by manually measuring the coordinates of these points on the floor. This comparison, presented in Table I, demonstrates the accuracy of our solution, with a mean 2D error of 6.39cm, and a max error of 10.77cm. With our system the robot was able to successfully achieve several autonomous navigation runs of approximatively 20min. Figure 6 shows examples of global free space maps, providing
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
81
Fig. 5. Free Space Map built by driving the robot manually (700 keyframes). Top left: map without loop closures, top right: map with loop closures coming from the camera facing forward, bottom left: map with loop closures from both cameras, bottom right: map built using a camera facing to th e floor.
a comparison of the two downward looking angles already
used earlier, again proving the superior accuracy of the proxi- mal sensing configuration. The purpose of this experiment is to
demonstrate the reliability of our complete solution compr is- ing simultaneous localisation, mapping, free space detect ion and navigation: the exploration strategy here is deliberat ely simplistic, and used only as a proof of applicability of our solution (more advanced policies would certainly result in more efficient exploration).
Fig. 6. Free Space Map built autonomously using 2500 keyframes
V. C ONCLUSIONS
We have shown that lightweight vision-based techniques, previously shown to be effective for surprisingly high per- formance localisation, can be augmented to be similarly ef- fective for fully autonomous robot navigation. Our system puts together odometry-based trajectory estimation, front/back
vision-based loop closure detection, free space identifica- tion from a local obstacle detection camera and a forward- looking camera, and local and global occupancy mapping. Our approach relies on an ad-hoc rig of multiple standard cameras, and we have investigated the role each camera shoul d effectively play for the best performance, but in future wor k
it may be worth considering taking the specialisation route
further and creating a system with heterogeneous camera types with even more elaborated configurations and roles.
A CKNOWLEDGMENTS
This work was supported by a CONACYT scholarship from the Mexican government to G. Carrera and European Research Council Starting Grant 210346.
R EFERENCES
[1] Y. Alon, A. Ferencz, and A. Shashua. Off-road path following using re- gion classification and geometric projection constraints. I n Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , volume 1, pages 689–696. IEEE, 2006. [2] A. Angeli, D. Filliat, S. Doncieux, and J.-A. Meyer. Real-time visual loop-closure detection. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , 2008. [3] G. Carrera, A. Angeli, and A. J. Davison. SLAM-based auto matic extrinsic calibration of a multi-camera rig. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , 2011. [4] M. Cummins and P. Newman. Highly scalable appearance-only S LAM — FAB-MAP 2.0. In Proceedings of Robotics: Science and Systems (RSS) , 2009. [5] H. Dahlkamp, A. Kaehler, D. Stavens, S. Thrun, and G. Bradski. Self- supervised monocular road detection in desert terrain. In Proc. of Robotics: Science and Systems (RSS) , 2006. [6] A. J. Davison and D. W. Murray. Mobile robot localisation using active vision. In Proceedings of the European Conference on Computer Vision (ECCV) , 1998. [7] D. Fox, W. Burgard, and S. Thrun. The dynamic window approach to collision avoidance. IEEE Robotics and Automation Magazine, 4(1):23– 33, 1997. [8] G. Grisetti, C. Stachniss, S. Grzonka, and W. Burgard. A tree param- eterization for efficiently computing maximum likelihood maps u sing gradient descent. In Proceedings of Robotics: Science and Systems (RSS) , 2007. [9] J. Hoffmann, M. Jungel, and M. Lotzsch. A vision based system for goal-directed obstacle avoidance. In ROBOCUP2004 SYMPOSIUM, Instituto Superior Tecnico,´ Lisboa, Portugal, 2004. [10] M. Kaess and F. Dellaert. Probabilistic structure match ing for visual SLAM with a multi-camera rig. Computer Vision and Image Under- standing (CVIU) , 2009. [11] Y. Kim and H. Kim. Layered ground floor detection for visio n-based mobile robot navigation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , volume 1, pages 13– 18. IEEE, 2004. [12] Y. Li and S.T. Birchfield. Image-Based Segmentation of Ind oor Corridor Floors for a Mobile Robot. In Proceedings of the IEEE/RSJ Conference on Intelligent Robots and Systems (IROS) , 2010. [13] M. Milford and G. Wyeth. Persistent navigation and mapping using a biologically inspired SLAM system. International Journal of Robotics Research (IJRR) , 29(9):1131–1153, 2009. [14] Y. Rubner, C. Tomasi, and L.J. Guibas. A metric for distributions with applications to image databases. In Proceedings of the 1998 IEEE International Conference on Computer Vision , pages pp. 59–66, 1998. [15] R. Sim and J. J. Little. Autonomous vision-based exploration and mapping using hybrid maps and rao-blackwellised particle filters. In Proceedings of the IEEE/RSJ Conference on Intelligent Robots and Systems (IROS) , 2006. [16] I. Ulrich and I. Nourbakhsh. Appearance-based place recognition for topological localization. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , 2000. [17] Q. Wu, W. Zhang, T. Chen, V. Kumar, et al. Camera-based clear path detection. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on , pages 1874–1877. IEEE, 2010. [18] C. Zhou, Y. Wei, and T. Tan. Mobile robot self-localization based on global visual appearance features. In IEEE International Conference on Robotics and Automation , volume 1, pages 1271–1276. Citeseer, 2003. [19] J. Zhou and B. Li. Robust ground plane detection with nor malized homography in monocular sequences from a robot platform. In Image Processing, 2006 IEEE International Conference on , pages 3017–3020. IEEE, 2007.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
82
1
A Decision Model of Adaptive Interaction Selection for a Robot Companion
Abir B. Karami
Abdel-Illah Mouaddib
GREYC-CNRS Laboratory, UMR 6072, 14032 Caen CEDEX, France. { akarami, abdel-illah.mouaddib} @info.unicaen.fr
Abstract — This paper is motivated by two ideas. First, es- tablishing a system for a robot companion that is capable to switch between different types of interaction according to the
human’s needs. Second, representing the system with a model that outperforms Partially Observable Markov decision processes (POMDPs) for large-scale problems. For this end, we propose a unified model of Human-Robot Multi-Type Interaction (HRMTI). The objective is to observe the human’s behavior and try to predict/estimate the human’s intention/need and therefore react
). POMDPs are
largely used by the community of collaborative and assistive Human-Robot Interaction (HRI). However, solving an HRMTI problem for large-scale applications by employing POMDPs is intractable. We present an approach to overcome this limitation by dividing the problem of HRMTI to three levels: first, estimate the human intention; second, select the appropriate type of interaction; last, choose one of the pre-calculated policies with respect to the human intention and the needed type of interaction. We present the unified model in detail, it includes: a belief update model concerning the intention of the human using multiple MDP models defined by empathy for each specific human task; an MDP model that supports the robot’s decision
appropriately (assist, cooperate, collaborate,
making towards the human needs by selecting the appropriate interaction type; finally, an algorithm that first defines the task to accomplish by the robot, then chooses the pre-calculated policy
to achieve it with respect to the type of interaction. We present
some interesting performance and scalability analysis and, with
a representative scenario inspired from robocup-at-work, we
present an implementation on a real robot showing the feasibility
of the approach.
Index Terms — Human-Robot Adaptive Interaction, Robot Companions, Hidden Markov Models, Makovian Decision Mod- els.
I. INTRODUCTION
Nowadays, a vast research interest is progressing towards robot companions that share environments with people like elders in their homes or in assisted living facilities. A robot companion can be of help to a human in different ways. This depends on the human possible incapabilities or preferences. More information about the human desires and constraints would more probably lead the robot companion to offer the good help when needed. For example, if the human has the intention of doing a certain task like calling his friend, he might be able to achieve the task himself, however, he might need assistance in finding or remembering the phone number. In the first case, the robot will be able to help the human by achieving some other task (like cleaning the room) and in the second case the robot should offer assistance to the human to find the phone number.
We consider a Human-Robot Multi-Type Interaction (HRMTI) model where the robot companion observes a human and acts appropriately for the best help. Indeed, to best achieve this mission, the companion must be able to: estimate the human’s intention (what task is he trying to achieve?), then infer the kind of help (interaction) that he might need, then reason the best action to help the human. In addition to accurately and quickly infer the human’s intentions, the companion should adapt to the human’s possible change in desire over time.
A possible idea to solve such problems would be using
POMDPs while representing the human’s intention as a hidden part of the POMDP state. This solution was used in many studies in the HRI domain [2] [5] [10]. POMDPs proved to be a natural solution for handling uncertainty about the environment and humans and adapt to any possible change. However, as known, solving POMDPs is PSPACE-Complete and finding a tractable approximate solution for real size applications is highly complicated.
We are interested in a framework for a robot companion
that is able to interact with a human using different types of
) depend-
ing on the recognized intention and preference of the human. Instead of using POMDPs, we propose an outperforming model to solve the problem on three levels without loosing the adaptability and the performance of handling uncertainty of POMDPs. First, the system observes the human action and estimates his intention using multiple simulated Human-MDPs and a Hidden Markov Model (HMM) where the hidden state corresponds to the human intention. Second, an MDP model helps in inferring the kind of help the human needs using the belief state updated from the HMM. Third, knowing the kind of interaction needed to help the human and his intention, the robot applies an action from previously computed policies (a policy for each task achieved under certain type of interaction).
interaction (assisting, collaborating, cooperating,
II. DEFINITIONS
In this section, we will present some definitions that are
related to the HRMTI model. We will present few types of interaction and try to define the difference between them. We also present our definition of a task and how it can be related
to one or more types of interaction. Finally, knowing all that we define the what a human intention might be. We define I as a set of interaction classes I =
{IC 1 , IC 2 , IC 3 ,
.}. Each class of interaction differs in the
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
83
2
way the robot companion helps the human. We distinct in the HRI literature three types of interaction that we will concentrate on in this paper: Cooperation, Assistance and Collaboration, I = {CP, AS, CL}. Cooperation CP : a robot can cooperate with a human by doing a task on his behalf to save him time or effort. The robot should be capable to do the task alone (cleaning carpet or the dinner table), and should not disturb the human by respecting the human’s preferences and interests [7]. Assistance AS: a robot companion can assist a human by guidance (typically spoken guidance). For example by accessing a manual of how to run the dishwasher the
robot can assist the human by reading the steps. This type of interaction also covers cases like robot assistants for people with dementia [2]. Collaboration CL: a task that needs the participation of the human and the robot to achieve it (typically physical action) requires a collaboration between them [6] [10]. A robot companion should not keep the human waiting when he shows interest in a collaborative task. The robot holds a list of all possible tasks that can be done by the human or/and the robot. A task tk ∈ TK can be
, context
defined as tk = context, agent, policy, time,
holds information about the task regarding the context of the
problem, it might include relations and dependencies between tasks, conditions to achieve, related databases to access or update information; agent ∈ {R, H, H ∨ R, H ∧ R}, is the possible agent that can achieve the task, it can be achieved uniquely by the robot, uniquely by the human, by any of them
or by both of them; policy = {true, f alse} is the fact that the robot holds a policy or a manual of how to assist the human
if he needs assistance to do this task; and time is the time
steps normally needed to achieve the task. A task can be achieved by one or more type of interaction following the rules mentioned in Table I:
tk ∈ CP
tk = ∗, {R, R ∨ H}, ∗, ∗
tk = ∗, {H, R ∨ H}, true, ∗
tk ∈ AS
tk ∈ CL
tk = ∗, {R ∧ H}, ∗, ∗
TABLE I
RELEVANCE BETWEEN TASKS AND DIFFERENT INTERACTION CLASSES. (*) REPRESENTS ANY POSSIBLE VALUE.
We define TK = TK h ∪TK r , where for TK h the domain of variable agent is {H ∨ R, H, H ∧ R}, and for TK r the variable agent ∈ {H ∨ R, R, H ∧ R}. The human’s intention can be one of the human’s tasks or nothing at all, intention ∈ TK h ∪ {do nothing}
III. THE MULTI INTERACTION MODEL
We will present in this section the multi-type interaction HRMTI decision control model that allows a robot to observe
a human and detects his possible intentions (intended tasks)
and infer any possible need of assistance or collaboration. We
concentrate in this paper on a robot companion system that puts the human needs as priority (assistance, collaboration), otherwise helping the human by doing cooperative tasks. Figure 1 presents the three levels of the system. In the first level, at each time step, the Human Intention Estimator
Fig. 1.
The HRMTI Decision Control Model
observes the human action z ie and updates the estimation over possible human intentions HIE(b ie |b ie , z ie ). It uses a library of human action values (Q-values) calculated offline. The second level receives those estimations and uses them to efficiently switch the type of interaction in a way that corresponds to the human desires. The third level uses all information from the first and second levels to choose a task for the robot respecting the belief over the human intentions and the chosen class of interaction. By selecting the policy of the robot task from the policy database, the robot applies the appropriate action from this policy. Afterwards, the system passes the control back to the Human Intention Estimator and along with the new observed human action, a new time step begins. We note that the levels 2 and 3 adapt their decisions following the possible changes in the human intention estima- tion. The high level algorithm of the system is as the following:
1) the system observes the human action z ie , 2) the belief update function updates the belief over the human intention b ie , 3) Interaction Class Selector creates the state s ic from b ie , 4) Interaction Class Selector calls the policy π ic to choose an interaction class, 5) Task Selector algorithm chooses a task to achieve by the robot tk ∈ TK r , 6) the appropriate policy from the database is called, and a robot action is sent to the system, 7) the robot applies his action and new time step starts, In the following of this section, we will explain how we build the Library of Q-values and we detail the three levels of the system.
A. The Human-MDPs
To be able to link an observed human action to a possible human intention, we propose a solution inspired from the simulation theory of empathy [11] which suggest that with similar brain structure to simulate the underlying mental states of a person, it is possible to infer his beliefs and goals and anticipate and understand his behavior. Our solution consists of building a number of Human-MDPs simulated by empathy:
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
84
3
tk h , ∀tk ∈ TK h . Each of the
Human-MDPs simulates a rational human acting towards a task tk. After solving the Human-MDPs (using Value Itera- tion [1] or other approximated MDP solver), a library of Q- values is filled using the MDP tk model and the corresponding calculated value function V tk :
h
MDP tk = S tk , A tk , T tk , R
h
h
h
h
Q tk (s h , a h ) = R tk (s h , a h ) + γ
h
s h ∈S h
T tk (s h , a h , s h )V tk (s h )
The simulated Human-MDPs policies are calculated offline by the system but not necessarily followed by the human. Figure 2 shows an example library of Q-values. They hold a value of all human actions in any possible state, therefore they help in evaluating any rational human action towards any of his intentions. It is possible to personalize the system towards a certain human behavior by updating the values online while observing the real human actions [5]. We notice from the figure that observing the human doing a h from state s h for example indicates a higher estimation for intention tk 1 rather than tk 2 , where Q tk 1 (s h )(a h ) = 0.5 and Q tk 2 (s h )(a h ) = 0.2. We will explain later how we integrate the Q-values in (Eq.1) in order to update the belief over the possible human intentions in
(Eq.2).
1
1
1
1
1
1
Fig. 2.
The Q-values concluded from the Human-MDPs.
B. Level 1: The Human Intention Estimator (HIE) The HIE is the part of the system that observes the human
actions and tries to translate those actions into intentions using the Library of Q-values. A belief update function is used to calculate the belief over the human intentions after observing his actions. This function is defined using an HMM where the human intention is represented as the hidden part of the model. The HIE ie = S ie , Z ie , T ie , O ie , b ie , where S ie =
0
s ie
context h × s ie
inten h , the human intention (s ie
inten h ∈ TK h )
is non observable and the human context (s context ie h ) includes
human’s information regarding the problem context. The pos- sible observations are the possible perceived actions of the human that are related to the set of tasks TK: Z ie ∈ A h . We introduce into the transition function T ie the probabilities that the human maintain or change his intention (p, 1 − p). Those probabilities can be learned according to the human personality. pr(s ie |s ie ) =
pr(s ie
pr(s ie
context h |s ie ) × p
context h |s ie ) ×
1−p
|s
ie inten h |
if
if
s
s
ie
inten h = s inten h
ie
inten h
ie
= s
ie
inten
h
The observation function O ie is calculated using the Library of Q-values.
(1)
O ie = pr(z ie |s ie , s ie ) = λQ s
ie
inten h (s ie
context h )(z ie ),
where λ is a normalizing factor. If the human is rational in his actions, then the probability of observing the human doing action (z ie ) while his intention is (s inten h ) can be estimated from the library of Q-values. Where we have a value of each
possible human action towards each possible human intention. The Human Intention Estimator receives at each time step the perceived human action z ie and uses the belief update function (Eq. 2) to create the new belief state b ie and passes it to the Interaction Class Selector. Any possible change in the human intention is usually reflected on his actions and therefore reflected on the new belief over the human intentions.
ie
b ie (s ie ) = γ
s ie ∈S ie
pr z ie |s ie , s ie pr s ie |s ie b ie (s ie )
(2)
C. Level 2: The Interaction Class Selector (ICS)
The ICS is responsible of choosing the appropriate type
of interaction. It starts by analyzing the received belief state b ie and concludes the intention of the human. Then it uses
a precalculated policy MDP ic to choose one of the different interaction classes: Cooperation CP ; Assistance AS; Collab- oration CL. We will explain two points before detailing the ICS model and then we will explain how this model allows
the robot to detect a human’s need of assistance.
Confirming with the human: In order to avoid unneeded
gesture of assistance or collaboration deduced from a false
intention estimation, we propose to add a Confirmation policy
which allows the robot to ask the human for a confirmation when the estimated human intention is of type CL. On the other hand, the fact that the human intention is of type AS does not necessarily prove that the human needs assistant to achieve his task (he is possibly capable of achieving it by himself). We will explain later how the human need of assistant is detected. The confirmation is simple, for example: (Do you need my help in doing task ’x’?) and the human answer can be of type (yes, no). There is no need of any kind of confirmation if the human intention if of type CP . In this case, the robot will choose another task of type CP . The Flag: The Flag represents the current status of inter- action or information that directly affects the decided type of interaction between the robot and the human. It is also a way
to communicate between the different levels of the system. At each time step the Flag is assigned a value by one of the system models. The value of the Flag is used in the ICS to decide the needed class of interaction. An example of possible Flag values:
- P ossibleNeed : a need of assistance is detected.
- DoingX : the robot is processing a task of type X where: X ∈ I.
- Conf irm : the robot is confirming with the human his need of help.
- Y es : the human answer is (yes).
- No : the human answer is (no).
1) The ICS model: The ICS is a Markov Decision Process
such as MDP ic = S ic , A ic , T ic , R ic :
•
S ic = s ic
ic
(s context
context h × s ic
inten h × s ic
F
lag , where:
h ) represents human context related information,
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
85
4
(s inten ic h ∈ TK h ) is the human intended task, and (s ic lag ∈ Flag) holds the Flag value.
F
• A ic
= {doCP, doAS, doCL, doConf irm}, the decision
is one of the possible interaction classes or to confirm the need of any of them.
• T ic = pr(s ic |s ic , a ic ): we calculate the ICS transition function in two parts:
T
ic = pr(s ic lag |s ic , a ic ) ∗ pr({s ic
F
ic
context h , s inten h }|s ic ).
The first is responsible for the probability of changing the Flag variable knowing the action a ic ∈ A ic . The
second part is responsible for the transition of the hu-
inten h where we
also introduce the probabilities of human maintaining or changing his intention (p, 1 − p). The changes in other
man variables s ic
context h and intention s ic
human information (s context ic h ) depends on the specific context of the problem.
pr({s ic
context h , s inten h }|s ic ) =
ic
pr(s ic
pr(s ic
context h |s ic ) × p
context h |s ic ) ×
1−p
|s
ic inten h |
if
if
s
s
ic
inten h = s inten h
ic
inten
ic
= s
ic
inten
h
h
• R ic (s ic , a ic ): a confirmation of collaboration is rewarded if the estimated human intention is of type CL. While a confirmation for assistance is only rewarded if Flag = P ossibeNeed. DoCL, DoAS are rewarded when esti- mated human intention is of type CL, AS respectively and Flag = Y es. DoCP is rewarded else-wise. The ICS policy π ic is calculated offline. Once the system is online the policy is called to choose the best interaction class. As the state of the human is not totally observable, the state of the ICS is created online as described in Algorithm 1.
Algorithm 1 The creation of the ICS state
1: INPUT: Flag, b ie (s): the passed Flag and belief state.
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
st ∗ = argmax (b ie (s)) st2 ∗ = argmax (b ie (s)) if (b ie (st2 ∗ ) − b ie (st ∗ ) >
argmax in S ie
argmax in {S ie − st ∗ }
k) then
distinct intention
s ic =< st
∗ context h , st
else for (tk ∈ AS) do
inten ∗ h , Flag) > ambiguity in b ie check for need of assistance
if (st context h relates with tk context ) then
∗
s ic =< st context h , tk, P ossibleNeed >
∗
end if
end for
12: end if
13: if s ic still not defined then
no clear intention
14:
s ic =< st
15: end if
∗
context h , st
inten h , Flag) >
∗
Using the belief state that the ICS receives from the HIE b ie (s) at each time step, the algorithm chooses the dominant state st ∗ in this belief state and uses its variables to build the ICS state s ic . We define:
st ∗ = argmax s ∈ S b ie (s),
st 2 = argmax s ∈ S − {st ∗ } b ie (s)
∗
Line (4) of Algorithm 1 tests if the dominant state st ∗ is dominant enough to recognize a distinct intention. The
value of k has an inversed relationship with the number of tasks tk ∈ TK H . The condition in Line (4) covers cases
where (s ic
inten h ∈ {CP ∪ CL}). However if (s ic
inten h ∈ AS)
it does not necessarily means that the human needs assistance.
2) Detecting a need of assistance: When human actions are indistinct (waiting for example) this will lead to a certain ambiguity in the belief state over his intention. This ambiguity might be a sign of a lack of interest in all intentions, or
a need of assistance. Line (7:) tests if the human possibly
needs help in doing one of the (AS) tasks. This test checks for a relation between the human context variable st
and the context variables of one of the assistant tasks (AS). This relation might be a similar location or any variable that presents a human interest in this task. In case of possible need of assistance, a state is created with the related assistance task and a Flag of value (P ossibleNeed).
In Line (13:) we define the case where there is a lot of ambiguity in the belief state (b ie ) and no need of assistance. The created state includes the intention of the dominant state st ∗ even if it is not dominant enough for a clear type of interaction. From here we pass the (s ic ) to the policy (π ic ) that will give the appropriate ICS action.
∗
context h
D. Level 3: Choosing a Task and Achieving a Type of Inter- action
Algorithm 2 describes the selection of the task that the robot should achieve. We notice that the task is easily chosen if the human intention is of type AS or CL and he confirmed his need of help. Else-wise in Line (11) the algorithm looks for tasks that only the robot can do. In Lines (17: 23), if non of the prior cases is true, the method looks for a task that is doable by the robot and is least intended by the human. The algorithm sets the Flag to pass information about the type of interaction the robot is intending to do. As the system passes throw level 2 at each time step to choose the Interaction Class,
any change of the human intention will be processed directly. This allows the robot to adapt quickly to the human possible need of assistance or collaboration. If there are dependency conditions for achieving a task (defined in tk context )), tests
over such conditions should be integrated in algorithm 2. We assume a database of pre-calculated policies, one for each of the tasks in the mission respecting a certain type of interaction. Once the task is chosen, the appropriate pre- calculated policy is called.
IV. EXPERIMENTS
The results we show in this section concerns problems with three classes of interaction. We present some performance
results including offline calculation times registered for dif- ferent sizes of the problem. For scalability analysis we show
in Table II results for 6 experiments. We considered the tasks
contexts are their positions in the human-robot shared area. The HMDPs are calculated to plan the human path to each possible task. Tasks types are divided equally and randomly
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
86
5
Algorithm 2 Task Selection
1: INPUT: s ic
inten h , b ie (s).
2: OUTPUT: tk ∈ TK, Flag.
3:
4:
5:
6:
7:
8:
9: else
if (s ic
inten h ∈ AS ∧ Flag ∈ {yes, P ossibleNeed}) then
tk = s ic
inten h
Flag = DoingAS
else if (s ic
inten h ∈ CL ∧ Flag ∈ {yes}) then
tk = s ic
inten h
Flag = DoingCL
|
10: |
for (t ∈ TK r ) do |
|
|
11: |
if (t agent = R) then |
priority robot only |
|
12: |
tk = t |
|
|
13: |
Flag = DoingCP |
|
|
14: |
end if |
|
|
15: |
end for |
|
16: end if 17: while (tk is not defined) do
18:
19:
20:
21:
22:
get next argmin st 0 = argmin (b ie (s)) if (st inten h ∈ TK r AND st inten h ∈/ CL ∪ AS) then tk = st
Flag = DoingCP end if
0
0
0
inten h
23: end while
to (CP, CL, AS). In those experiments we presented the areas with (10 to 100) rooms each with (10 to 20) accessible positions, which is large enough to cover scenarios of a house, an apartment or a hotel. Table II presents the number of positions |P OS|, the number of tasks |TK|, number of ICS states and the offline calculation time. We consider the presented HRMTI calculation times allows us to solve real- world robot companion scenarios contrary to a POMDP model (Table III) 1 .
Exp.
|P OS|
|TK|
|S ic |
total time (minutes)
1
2
3
200
200
400
|
25 |
4.5 |
× 10 4 |
< |
1 |
|
50 |
9 × 10 4 |
< |
1 |
|
|
50 |
1.8 |
× 10 5 |
2 |
|
4 400
5 1200
6 1200
|
100 |
3.6 |
× 10 5 |
5 |
|
100 |
1.08 |
× 10 6 |
21 |
|
400 |
4.32 |
× 10 6 |
480 (8 hours) |
TABLE II
EXPERIMENTS FROM HRMTI MODEL
We remind that the parts that are calculated offline are: the HMDPs policies, the HIE transition and observation functions, the MDP ic policy. The more the scenario is complicated, the more time needed in creating the MDP models and in solving them. In our results, the majority of calculation time served for the creation of the transition function T ic of the MDP ic model. We also remind that once the system is online, the robot decisions are taken instantly. Figure 3 includes screen-shots of a video demonstrating three types of interactions. After some human hesitation in
1 Experiments done with Linux 6 core Intel(R) 2.13 GHz, 20Go Memory.
front of the bookcase, in 3(a) the robot offers assistance in finding a book (we suppose the robot has a database with the placement of books in the bookcase). In 3(b), Conf irm for CL task when the human started to show interest in the collaborative task: filling the printer with papers. In 3(c), DoingCL by bringing pack of papers to the human. In 3(d): DoingCP task: cleaning the window. We propose to follow this link to see the complete video:
http://users.info.unicaen.fr/ ˜ akarami/ demohri/demo_multi_interaction.avi
Fig. 3.
(a) Confirm Assistance
(c) Collaboration
(b) Confirm Collaboration
(d) Cooperation
Screen-shots of a video demonstrating three types of interactions.
In previous work[7], we presented a POMDP model to solve a cooperative Human-Robot Interaction. The model was presented for a robot that shares a mission with the human, where the mission is divided into tasks (of type CP only). The robot is supposed to choose tasks that have the least of the human interest in doing. Table III shows the incapability of a POMDP model to solve large-scale problems. The approach we present in this paper solves problems more than 1000 times the size of problems solved using the POMDP model and with acceptable calculation time. Knowing that it will be impossible
to solve HRMTI using a POMDP model.
Exp.
|P OS|
|TK|
|S ic |
total time (minutes)
|
1 10 |
4 |
|
2 12 |
4 |
|
3 11 |
6 |
|
4 16 |
6 |
|
2765 |
2 |
|
4385 |
10 |
|
15112 |
51 |
|
40897 |
315 (5 hours) |
TABLE III
EXPERIMENTS FROM COOPERATIVE-INTERACTION POMDP MODEL.
V. R ELATED W ORK
To our knowledge, there is no prior work on adaptive multi- interaction systems in HRI. However, assistive and collabora- tive individual models have a large interest in the literature. Most prior work on decision making for assistant systems used approaches based on POMDP models. Some of them were for assistant-interaction systems. For example, hierar- chical POMDPs for an autonomous mobile robotic assistant
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
87
6
that provides elderly people in nursing homes with reminders about their daily activities [9]. A cognitive assistive system to help people with advanced dementia in their daily living ac- tivities [2], they present a model for a hand-washing assistant which monitors the persons progress in his activity and suggest guidance in case of an unusual observed behavior. Markovian models are also described for human Intention/activity recog- nition problems. A switching hidden semi-Markov model [4] is used for a system that learns a model of the house occupant’s daily activities through observation. Then by monitoring the person’s current activity, the model detects any abnormality and alerts the caregiver. Assistant robots are supposed to offer the information/help that facilitate the human task. This class of approaches include the detection of the human’s need of help and assisting him. However, the systems are limited to one type of interaction (assistive/informative) and in some cases, it is a specialized system for only one task [2]. This is why we thought it is interesting to have a higher level system that will be able to detect which of the precalculated task/interaction policies is needed by the human. A POMDP model for selective assistive actions [5] es- timates the possible human goals by using the Q-function values of a representative optimal human MDPs. They infer these values in a heuristic to approximately solve the POMDP while, in our approach, we use the Q-values in a belief update function over the human intention and his context variables. They describe a solution for human-robot interaction different from the one described in this paper. The assistant robot in [5] continues doing assisting actions from the calculated policy until no more actions can be done or high ambiguity is reached (no − operation), this is where the human can participate in doing only one action. Our approach, in case of ambiguity, checks if special help is needed, otherwise, switches to cooperation interaction (CP ). An additional variety of research over robot companions include robots that collaborate with the human to do their tasks. A wheelchair navigation system [10] uses a POMDP model that can predicts, with a minimal joystick input, the target position of the wheelchair’s user and drives him there. We consider this approach as collaborative as the system (the chair) and the human are collaborating together to do the same task. Other problems suggested in the literature discuss robot companions that cooperate with their human partners without acting jointly with them. Given a robot companion that plans for its actions with awareness of the existence of its human partner in the same environment [3] using HMMs and [7] using POMDPs. In those problems the robot should recognize the human plan in order to cooperate by achieving its own tasks without disturbing the human plan. Some approaches use speech to help recognizing the human plan or preferences for cooperative missions [8]. Previously mentioned approaches are specialized in only one type of interaction. This paper concentrated on how we can enrich companion robots with systems that are able to switch from one type of interaction to another depending the human needs. Our approach includes an online intention recognition for both human intention and type of help needed. It includes
an MDP model that helps the robot to make the best decision for each possible situation. Solving the problem this way helps in enriching the system without exploding the memory or the calculation time. In addition, we are able to solve large- scale applications of HRI problems contrary to POMDP-based approaches.
VI. CONCLUSION
In this paper, we presented a multi-type human robot interaction model for a robot companion. We explained the difference between three types of interaction: Cooperation, Assistance and Collaboration. We described how the human’s behavior leads to estimate his need to one of those interactions. Then we presented the succession of processes in our model that gives the robot companion a good reaction. Our results show that by dividing the decision model into different components, we were able to solve real life problem sizes with acceptable calculation time, and compute policies that act reasonably as expected. Those policies handle the problem of uncertainty over the human intention and also overcomes the limitation of state space of POMDPs. An important amount of time is dedicated for manually create MDP models (HMDPs, ICS, HIE and a database of policies for each couple (task, interaction)). Nevertheless, this is a normal difficulty when solving HRI problems of important task space. In addition, the policy database can be reused in different robot companion scenarios.
REFERENCES
[1] Richard Bellman. A markovian decision process. Indiana Univ. Math.
J. , 6:679–684, 1957.
[2] Jennifer Boger, Pascal Poupart, Jesse Hoey, Craig Boutilier, Geoff Fernie, and Alex Mihailidis. A decision-theoretic approach to task assistance for persons with dementia. In Leslie Pack Kaelbling and Alessandro Saffiotti, editors, IJCAI, pages 1293–1299, 2005. [3] Marcello Cirillo, Lars Karlsson, and Alessandro Saffiotti. A human- aware robot task planner. In Alfonso Gerevini, Adele E. Howe, Amedeo Cesta, and Ioannis Refanidis, editors, ICAPS , 2009. [4] Thi V. Duong, Hung Hai Bui, Dinh Q. Phung, and Svetha Venkatesh. Activity recognition and abnormality detection with the switching hidden semi-markov model. In CVPR (1), pages 838–845. IEEE Computer Society, 2005. [5] Alan Fern, Sriraam Natarajan, Kshitij Judah, and Prasad Tadepalli. A decision-theoretic model of assistance. In Manuela M. Veloso, editor, IJCAI , pages 1879–1884, 2007. [6] Guy Hoffman and Cynthia Breazeal. Cost-based anticipatory action selection for human-robot fluency. IEEE Transactions on Robotics, 23(5):952–961, 2007. [7] Abir-Beatrice Karami, Laurent Jeanpierre, and Abdel-Illah Mouaddib. Human-robot collaboration for a shared mission. In Pamela J. Hinds, Hiroshi Ishiguro, Takayuki Kanda, and Peter H. Kahn Jr., editors, HRI, pages 155–156, 2010. [8] L. Matignon, A. Karami, and A. I. Mouaddib. A model for verbal and non-verbal human-robot collaboration. In Dialog with Robots: AAAI 2010 Fall Symposium Technical Reports , pages 62–67, 2010. [9] Joelle Pineau, Michael Montemerlo, Martha E. Pollack, Nicholas Roy, and Sebastian Thrun. Towards robotic assistants in nursing homes:
Challenges and results. Robotics and Autonomous Systems , 42(3-4):271– 281, 2003. [10] Tarek Taha, Jaime Valls Miro,´ and Gamini Dissanayake. Pomdp-based long-term user intention prediction for wheelchair navigation. In ICRA , pages 3920–3925, 2008. [11] http://en.wikipedia.org/wiki/Simulation_theory_ of_empathy Wikipedia.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
88
1
Probabilistic Maneuver Prediction in Traffic Scenarios
Jonas Firl ∗
Quan Tran †
∗ Adam Opel AG, 65423 Russelsheim,¨
Germany
† Department of Measurement and Control, KIT, 76131 Karlsruhe, Germany
Abstract— Traffic scene understanding is an important part of future advanced driver assistance system (ADAS). In this paper we present a probabilistic approach for this problem. We use Hidden Markov Models (HMMs) for modeling the spatio- temporal dependencies of traffic situations. The parameters of the individual models are learned from a data base, generated by a professional driving simulation software. New oncoming situa- tions are firstly perceived by an on-board sensor system including radar and camera. After the parameters of traffic participants are estimated they are used for recognition and prediction of the traffic maneuver by evaluating the likelihood for the learned models. Experimental results in real-world environments have shown that the method can robustly recognize driving maneuvers. As an example, the prediction of lane change maneuvers shows the benefit for different ADAS functions.
Index Terms— traffic scene understanding, probabilistic model, pattern recognition
I. MOTIVATION
Traffic scene understanding is a key concept for future advanced driver assistance system (ADAS) and a fundamental requirement for autonomous driving as well. Real traffic is a complex phenomenon involving multiple traffic participants, which are time dynamic and interact with each other. To act intelligently in this dynamic scene, an ADAS should be able to perceive and interpret its environment. Especially, recognition and prediction of the maneuvers of traffic participants in advance allows improving the passenger’s comfort and safety. Information perception in this context refers to detection and tracking of relevant objects based on raw sensory data. At this low level the system must deal with the uncertainty in sensory data. Generally, to improve the reliability of the sensory data perception, data from separate sources are combined, namely by using sensor fusion. At the high level, the system uses the estimated information from the low level to recognize, what is happening in the driving environment. At this level the system must be able to deal with the ambiguities of the scene. Although much effort has been devoted recently, traffic scene understanding remains challenging. One of the crucial difficulties is the complexity of the dynamic traffic scene, which usually deals with the spatio-temporal dependencies of multiple objects.
Most of existing approaches can be divided into two groups:
logical approaches and probabilistic approaches. In the logical approaches, the common knowledge is used to model space
and time of situations. This approach was first introduced in [11]. Prior knowledge is formulated in logical languages such as propositional time logic [10], modal logic [1] or description logic [7] to achieve a logical representation of traffic situations. In another system introduced in [5], a conceptual representa- tion of elementary vehicle activities is formulated by fuzzy metric-temporal Horn logic. Despite the advantage in scene representation and reasoning, these logical based approaches have difficulties to handle uncertainties and ambiguities of real situations. Recently there are some approaches based on the probabilistic logic formalism introduced in [14]. In the work [6] the Markov Logic is used for modeling and understanding the object relations in a traffic scene. However, these approaches are still not capable to capture the temporal dependency of the scene. The other group of approaches is based on probabilistic mod- els. In [4] Dynamic Bayesian Networks (DBNs) are chosen because of their capability to capture spatio-temporal depen- dencies. Whereas using DBNs for modeling traffic situations is intuitive and elegant, parameter learning and doing inference remain challenging. To overcome these problems some other approaches use specific structures of DBNs. Hidden Markov Models (see [13]) are special cases of DBNs with linear structures, which are widely used in many areas of pattern recognition. The main advantages of HMMs in comparison with DBNs are the efficient algorithms for learning model parameters as well as for doing inference. Some variants of HMMs like coupled HMMs and factorial HMMs are applied successfully for modeling human activities as in [2] and [8]. Our work is mainly motivated by the idea of using HMMs to model traffic maneuvers as in [12], where two laser scanners are used for data acquisition, which are not applicable for current ADAS. Furthermore, additional scenario information like road geometry and boundaries are not taken into account. The main contribution of this paper is to show that the maneuver of traffic participants can be practically modeled and robustly recognized with a platform using production-oriented sensors. This paper is organized as follows. In the next section, we review Hidden Markov Models with the important algorithms for parameter learning and probabilistic inference. Next in section III we describe our system using HMMs as the framework for modeling and recognition of traffic situations. In section IV we present the experimental results evaluated in the real-world environment. The paper is finally concluded in section V.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
89
2
Fig. 1. Example of an HMM structure. Left: Markov chain of hidden system states q t and observations o t . Right: Model structure with 3 system states and 2 (discrete) observation symbols v 1 , v 2 .
II. THEORETICAL FRAMEWORK
Hidden Markov Models (HMMs) are one of the most popular methods for probabilistic modeling of sequential data. HMMs have been used successfully in many areas such as speech, gesture or human activities. In [13] Rabiner gives an excellent tutorial on HMMs with applications in speech recognition. An HMM is specified by the following:
, s N } defines the un-
• The set of states S = {s 1 , s 2 ,
observed, or hidden, discrete state of the model. In
an HMM, the sequence of hidden states is a Markov
, q T ) , in which the probability of
each state is dependent only on the state immediately preceding it. The probabilities to go from state i to state
j : a i,j = P ( q t +1 = s j | q t = s i ) are summarized in the transition matrix A.
• The observation model or emission probabilities B con-
sist of b i ( o ) = p ( o | s i ) the probabilities of an observation
o if a model is in the state s i . For practical reasons, the observation distributions are continuous and are usually parameterized e.g. by Mixtures of Gaussian (MoG).
• The initial state distribution π = {π i }, where π i =
P ( q 1 = s i ) are the probabilities of s i being the first state
of the state sequence. In summary, the parameter set of an HMM can be indicated by the compact notation: λ = ( A, B, π ) . An example of the structure of an HMM can be seen in Figure 1.
chain Q = ( q 1 , q 2 ,
Using HMMs for modeling and recognition of traffic sit- uations are concerned with two questions: evaluation of the likelihood of an observation sequence with respect to an HMM and learning the parameters of the HMMs. These two problems can be efficiently solved by some standard algorithms, which are briefly described in the following subsections.
A. Recognition with HMMs
The first problem is about the question: given a new
, o T ) and a set of models
{λ 1 , λ 2 ,
solve this problem, it is necessary to compute the likelihood of
an observation sequence O given a model λ i . This likelihood expands as follows:
(1)
observation sequence O = ( o 1 , o 2 ,
, λ K }, which model explains the sequence best. To
p ( O | λ i ) =
p ( O, Q | λ i )
all possible Q
It is the sum over every possible assignment to Q . Because q t can take one of N possible value at each time step, evaluation
of this sum will require O ( N T ) operations. In practice p ( O | λ )
can be computed via a dynamic programming algorithm called
forward algorithm, described in [13]. In the forward algorithm,
at every time step, the likelihood of the partial observation
sequence that ends at the state j are recursively computed as.
• Initialization:
• Recursion
α 1 ( i ) = a 1 ,i b i ( o 1 ) , for
i = 1,
, N
α t +1 ( j ) = [
N
i=1
α t ( i ) · a ij ] b j ( o t +1 )
(2)
(3)
The likelihood of the whole sequence is finally calculated by:
p ( O | λ ) =
N
i=1
α T ( i )
(4)
The algorithm can compute α t ( i ) efficiently. At each time step only O ( N ) operations are needed. To compute the total
probability of an observation with length T the complexity is O ( N · T ) .
B. Parameter Learning
The second problem of HMMs related to situation recog- nition is how to learn the model parameters. Given a set of observations {O training }, the model parameter λ = ( A, B, π ) need to be adjusted, so that it makes the observation data most likely. In other words, we have to solve the following optimization problem:
|
ˆ |
|
|
λ |
= argmax P ( O training | λ ) (5) |
λ
This optimization problem can be solved efficiently by using the Baum-Welch-algorithm described in [13] in more detail. In this work we use the Baum-Welch-algorithm to learn for every situation HMM λ i the initial distribution π i , the transition matrix A i and the vector B i of emission probability density functions, which are modeled by Gaussian Mixtures. The Baum-Welch-algorithm is a derivation of the Expectation Maximization algorithm for Hidden Markov Model, which use the iterative numerical technique to achieve the maximization of the likelihood function greedily. Like other applications of EM algorithms, parameter learning for HMMs is a non-convex problem with many local maxima. The Baum-Welch-algorithm will converge to a maximum based on its initial parameters. In order to get stable and robust results multiple runnings of the algorithm are needed.
III. SYSTEM DESCRIPTION
In this section the system for maneuver recognition and prediction in traffic scenarios is described. It is based on an HMM based approach, introduced in the last section, to take advantage of their beneficial properties, such as the capability to handle dynamic, complex systems, uncertainties and am- biguities, the availability of effective training and evaluation algorithms and the readability of the model. As a first step, relevant traffic maneuvers have to be iden- tified, which may influence an improved traffic prediction. As
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
90
3
this work focuses on extra urban traffic scenarios, following-, overtaking- and flanking-maneuvers are selected, which are the most occurring interactions on country roads and highways. To develop a recognition system for these traffic maneuvers, it is necessary to identify characteristic quantities. By keeping in mind the available sensor setups of current ADAS, the information about relative position, velocity and acceleration of each traffic participant pair is appropriate. Each of the mentioned maneuvers is modeled by one HMM. The system states of the models thereby correspond to the different physical stages of the maneuvers. The number of states of every model is set to 5. For a given observation sequence O , recognition is performed by applying the forward procedure to compute the likelihood P ( O | λ i ) for each model λ i . The continuous observations are implemented by using a MoG-observation model as already mentioned in section II, since they have good approximation capabilities for arbitrary probability distribution. The number of mixture components is set to 3, while a higher number of mixture component leads to an extremely increasing number of model parameters. Figure 2 presents the semantic modeling of driving maneuvers using Hidden Markov Model with Mixtures of Gaussian as observation models. Although this approach works well in simple traffic en- vironment, it shows some disadvantages in more complex real world traffic scenarios. These are characterized e.g. by different road types with different numbers of lanes, relations of more than two objects and uncertainty about possible free space for the maneuvers. In addition, the probability P ( O | λ ) of an observation sequence given a model is not the best quantity for making decisions but the probability P ( λ | O ) of a model given an observation sequence. Hence, a more feasible approach was developed in this work. Using the basic equation of the Bayes theorem, the likeli- hood P ( O | λ ) can be rewritten and using the assumption of a constant observation probability P ( O ) yields:
P ( λ i | O ) = P ( O | λ i ) · P ( λ i )
P ( O )
∝
P ( O | λ i ) · P ( λ i )
(6)
The a-priori probability P ( λ i ) has effect on the resulting likelihood and depends on the following factors:
• Type of road: The a-priori probabilities for different maneuvers depends on the type of road, e.g. overtaking maneuvers occur more often on highways than on country roads.
Fig. 3.
System overview for traffic maneuver recognition using HMMs.
• Past recognized maneuvers: Individual driver behavior like aggressive or defensive way of driving have influence on oncoming maneuvers. Therefore previous recognized maneuvers are also taken into account.
• Influence of other vehicles: The intended maneuver does not only depend on the relative information between the two involving vehicles, but also on other vehicles in the traffic situation.
• Road geometry: Road information like boundaries or oncoming lanes have also to be taken into account. As an example for the last two points, the overtaking maneuver of Figure 6 is considered. In stage 1 (before the first lane change) the probability of an overtaking maneuver is directly be influenced by possible road boundaries on the left or by other vehicles on the left lane. Figure 3 shows the diagram of the maneuver recognition and prediction system. New traffic participant pairs are initialized every time a new vehicle enters the sensor field of view. For every participant pair the likelihood P ( λ i | O ) of each maneuver λ i is computed. This information is used for ma- neuver prediction. Therefore the dominating maneuver model is identified by determining:
argmax P ( λ i | O ) .
i
Results of single maneuver prediction are presented in the next section.
IV. EXPERIMENTAL RESULTS
In this section the implementation of the described system is presented and some results of real world data for traffic
maneuver recognition and prediction are given.
A. Scenario Setup
For data acquisition an experimental vehicle of the Adam Opel AG, an Opel Insignia (see Figure 4), is used which is equipped with a sensor fusion system consisting of a forward-looking radar sensor (Continental ARS300) and a
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
91
4
Fig. 5.
system ( X S , Y S ) of the ego vehicle (red), with: dist x = X 1 − X 2 and dist y = Y 1 − Y 2 .
Lane fixed coordinate system ( X, Y ) and sensor fixed coordinate
Fig. 6.
Overtaking maneuver for real data evaluation.
mono camera (Continental CSF200). With this sensor platform it is possible to detect, classify and track all objects in front of the car. The opening angles of the sensor setup are shown at the right side of Figure 4. Both sensors have complementary properties that make sensor fusion more reasonable:
• Radar: Good position and velocity estimation of all detected objects. High detection range using far range scans. • Camera: Good classification capability of detected ob- jects and provide scenario information (lanes and road boundaries). After selecting the sensor platform, the next step is to choose an appropriate coordinate system. Whereas current approaches use vehicle or sensor fixed coordinates, our work uses a lane fixed coordinate system to get more significant information, especially on winding streets (see Figure 5). One axis is directed along the road, which is normally not a straight line, and the other axis is orthogonal onto it (the resulted coordinate system is not Cartesian). To transform the positions from the given sensor fixed coordinate system ( X s , Y s ) to the lane coordinate system ( X, Y ) the geometry of the road has to be known. In this work a clothoid model is used (see [3]), where the parameters of the model are extracted from the lane markers and road boundaries information given by the sensor fusion system (see
[3]).
After that, the observation vector for a traffic participant pair can be defined:
(7)
o = ( dist x , dist y , vel, acc ) ,
where dist x and dist y are the relative distances of the objects in X - and Y -direction (lane coordinate system), vel and acc
are the absolute value of relative velocity and acceleration of the objects (see Figure 5). The reason for using only one dimensional velocity and acceleration is the inaccurate lateral velocity resolution of radar sensors. The details of the algo- rithmic implementation in the demonstration vehicle regarding object detection, tracking and classification are presented in
[9].
After defining the observation vector, two kinds of input data have to be generated to implement the proposed maneuver recognition system:
• Training data: data for model parameter estimation using Baum-Welch-algorithm.
• Evaluation data: data for maneuver recognition using the forward-algorithm. As HMMs have many model parameters, the number of training sequences have to be large enough. Because the acquisition of real data with the current sensor setup shown in Figure 4 is extremely time-consuming, data simulated with the
tool IPG CarMaker c
is a popular simulation tool for many applications in the automotive industry. When simulating traffic maneuvers with CarMaker two main models are used: vehicle model, which models all characteristic properties of the simulated cars; driver model, which includes driver specific inaccuracies when following a predefined path. With a parametrized definition of traffic scenarios, CarMaker allows the user to rapidly build up a maneuver database (here consisting of ≈ 500 maneuvers). To evaluate the trained models and carry out prediction steps, the test scenario shown in Figure 7 was set up. With this scenario 60 overtaking maneuvers were recorded to evaluate the maneuver recognition system described in the last section and to show exemplarily the benefit for lane change prediction. Figure 6 shows the different stages of the maneuvers.
is used for the training stage. CarMaker
B. Recognition Results
In the following, results of the maneuver recognition are presented. This is done by using real world data as described in section IV-A to guarantee the functionality of the system in real ADAS applications. After acquiring the data, relative dynamic information about the interacting vehicle pair is extracted according to (7). Maneuver recognition is done by applying the procedure shown in Figure 3 to classify the three maneuvers overtak- ing, following and flanking for the complete set of approx. 60 recorded sequences. The calculation of the likelihoods P ( O | λ i ) is computed effectively by applying the forward- algorithm (see section II). Because of the complexity of the algorithm is only O ( N · T ) , the system has real-time capability. One main problem when computing the forward variables is underflows due to multiplication of significant small terms. Therefore in this work we use the scaling procedure presented in [13] that normalize the forward variables at each time step. When applying this procedure the likelihood-functions are no longer monotonically decreasing. In this highway-like test scenario, the a-priori probabilities for all maneuvers were set manually to: P ( following ) = 0. 9 and P ( overtaking ) = P ( flanking ) = 0. 05 due to the general occurrence of these
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
92
5
log (P (λ i |O ))
ARS300 radar (far range scan)
ARS300 radar (near range scan)
CSF200 mono camera system
x
y
Fig. 4.
Left: Opel Insignia experimental vehicle. Right: Opening angles of the used sensor setup.
Experimental vehicle of the Adam Opel AG: Sensor system consists of a mono camera and a radar sensor performing a far and a short range scan.
Fig. 7.
Screenshots of scenario setup for model evaluation with real data. From left to right: different stages of the overtaking maneuver.
−50
−100
−150
−200
−250
−300
frame
Fig. 8. Results of the maneuver recognition system for three different models (using scaling procedure of [13]).
maneuvers in real traffic situations on highways. That means that only the dependency of the type of road (see section III) in the a-priori probabilities is taken into account. The overtaking maneuvers are detected correctly in all se- quences, at the latest when the first lane change was per- formed. At the beginning of all sequences (stage 1 of Figure 6) a clear distinction between the maneuvers overtaking and following is hard to make or sometimes impossible. This corresponds to the general ambiguities of traffic maneuvers in real world scenarios. Figure 8 presents exemplarily the results for the sequence shown in Figure 7, where the likelihoods P ( λ i | O ) correspond- ing to the three HMMs are plotted logarithmically. At the
beginning of the sequence, when vehicle number 1 is following vehicle number 2 (left image in Figure 7), the likelihood for
a following maneuver is slightly above the overtaking ma-
neuver. At frame 250 vehicle number 1 begins the overtaking maneuver with the first lane change (middle image in Figure 7), the likelihood of the following maneuver decreases and the likelihood of the flanking maneuver increases abruptly. At frame 520 the second lane change from vehicle number 1 is performed (right image in Figure 7), which is represented by the decreasing flanking probability.
C. Lane Change Prediction
The results of maneuver recognition of the last section can be used in a wide field of ADAS applications. In the following we present a single driving maneuver prediction in the form
of lane changes. The basic idea is to predict a lane change maneuver by recognizing an overtaking maneuver, when the vehicle is still being in state 1 of Figure 6. Therefore we have
to observe the ratio r = P ( O | following ) /P ( O | overtaking ) .
For example, an increasing relative velocity vel will lead to
a rapidly decreasing likelihood P ( O | following ) and makes a
lane change more probable to occur. If we set the critical ration value to r = 1, lane changes are predicted when the probability for the following maneuver fall below the overtaking maneuver
(see e.g. Figure 8, frame 250). To decrease the resulted false positive rate, the critical ratio has to be decreased too. In Table I the results of the lane change prediction described above are shown for the complete set of 60 recorded maneu- vers. The predicted lane changes are listed for different times
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
93
6
TABLE I
LANE CHANGE PREDICTION RESULTS BASED ON OVERTAKING MANEUVER RECOGNITION.
|
Time before lane change [s] |
2 |
|
correct predictions [%] |
6.7 |
|
1.8 |
1.6 |
1.4 |
1.2 1.0 |
|
10 |
18.3 |
28.3 |
33.3 40 |
Time before lane change [s]
correct predictions [%]
|
0.8 |
0.6 |
0.4 |
0.2 |
0.0 |
-0.2 |
|
55 |
65 |
76.7 |
85 |
98.3 |
100 |
before the vehicle actually starts to perform the lane change (left tires of the vehicle pass lane makers). The ground truth is thus determined manually by analyzing the given images of the camera system. It can be seen that at this point of time already 98. 3% of the maneuvers and 0.4 seconds before the true lane change happens approx. 80% of the maneuvers are predicted correctly.
V. C ONCLUSION
In this paper we presented a probabilistic framework for vehicle maneuver recognition and prediction in real-world traffic scenarios. Especially, our system is based only on state of the art sensor systems of current ADAS. Furthermore we considered different factors, which have influence on traffic situations. Our system provides robust and reliable results for recog- nition of driving maneuvers in extra urban scenarios. One application of the maneuver recognition system is lane change prediction, which can be used by other following ADAS functions, like lane departure warning. For future work we plan to study more detailed modeling of the prior information. We believe that, if the prior information is modeled more accurately, the recognition results will be im- proved. Moreover it will be possible to handle more different and complex scenarios. We also would like to use the result of the recognition stage for a path prediction stage.
REFERENCES
[1] B. Bennett, A.G. Cohn, F. Wolter, and M. Zakharyaschev. Multi- dimensional modal logic as a framework for spatio-temporal reasoning. Applied Intelligence, 17(3):239–251, 2002.
[2] M. Brand, N. Oliver, and A. Pentland. Coupled hidden Markov models for complex action recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), page 994, 1997. [3] M. Darms, M. Komar, and S. Lueke. Map based road boundary estimation. In Intelligent Vehicles Symposium (IV), pages 609 –614,
2010.
[4] T. Dean and K. Kanazawa. A model for reasoning about persistence and causation. Computational intelligence, 5(2):142–150, 1989. [5] R. Gerber and H.H. Nagel. Representation of occurrences for road vehicle traffic. Artificial Intelligence, 172(4-5):351–391, 2008. [6] I. Hensel, A. Bachmann, B. Hummel, and Q. Tran. Understanding object relations in traffic scenes. International Conference on Computer Vision Theory and Applications (VISAPP), pages 389–395, 2010.
[7] B. Hummel, W. Thiemann, and I. Lulcheva. Scene understanding of urban road intersections with description logic. Logic and Probability for Scene Interpretation, in Dagstuhl Seminar Proceedings, Dagstuhl, Germany, 2008.
[8] N. Landwehr. Modeling interleaved hidden processes. In Proceedings of the 25th international conference on Machine Learning, pages 520–527,
2008.
[9] S. L uke,¨ P. Rieth, and M. Darms. From brake assist to autonomous collision avoidance. International Federation of Automotive Engineering Societies (FISITA), 2008.
[10] Z. Manna and A. Pnueli. The temporal logic of reactive and concurrent systems: Specification. springer, 1992. [11] J. McCarthy. Situations, actions, and causal laws. Technical report, Stanford Department of Computer Science, 1963.
[12] D. Meyer-Delius, C. Plagemann, and W. Burgard. Probabilistic situation
recognition for vehicular traffic scenarios. In Robotics and Automation, 2009. ICRA’09. IEEE International Conference on, pages 459–464. IEEE, 2009.
[13] L.R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257– 286, 1989. [14] M. Richardson and P. Domingos. Markov logic networks. Machine Learning, 62(1):107–136, 2006.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
94
1
Approaching a Person in a Socially Acceptable Manner Using Expanding Random Trees
Jens Kessler ∗
Andrea Scheidig ∗
Horst-Michael Gross ∗
∗ Neuroinformatics and Cognitive Robotics Lab, Ilmenau University of Technology, Ilmenau,Germany
Abstract — In real world scenarios for mobile robots, socially acceptable navigation is a key component to interact naturally with other persons. On the one hand this enables a robot to behave more human-like, and on the other hand it increases the acceptance of the user towards the robot as an interaction
partner. As part of this research field, we present in this paper a strategy of approaching a person in a socially acceptable manner. Therefore, we use the theory of ”personal space” and present
a method of modeling this space to enable a mobile robot to
approach a person from the front. We use a standard Dynamic Window Approach to control the robot motion and, since the
personal space model could not be used directly, a graph planner
in configuration space, to plan an optimal path by expanding the
graph with the use of the DWA’s update rule. Additionally, we give a proof of concept with first preliminary experiments.
Index Terms— Social acceptable navigation, approaching strat- egy, expanding random trees, dynamic window approach
I. INTRODUCTION
In the recent years, mobile robotics have been developing towards fields of applications with direct interaction with persons. There are several prototypical systems that aim to help elderly people to improve cognitive abilities [1], to assist care givers in hospitals [2, 3], be an intelligent video- conferencing system [4], guide people in supermarkets and home improvement stores [5, 6] or simply improve the well- being by providing an easy-to-use communication platform. All these scenarios have to consider persons, interacting with the robot system. Psychologists and gerontologists showed in the 90s that technical devices are treated and observed as ”social beings”, for example cars, television and computers [7, 8]. Also a robot system is recognized as a social being and also has to behave like one. One important part of the robots behavior is the socially acceptable navigation. Naviga- tion commonly includes tasks like mapping, motion control, obstacle avoidance, localization and path planning. Social- acceptable navigation focuses on these tasks with keeping in mind that humans are within the operation area of the robot, and that an extra treatment is needed. We are contributing to the ALIAS (Adaptable Ambient LIving ASsistant) project. ALIAS has the goal of developing a mobile robot system to ”interact with elderly users, monitor and provide cognitive assistance in daily life, and promote social inclusion by creating connections to people and events in the wider world” [9].
A. The ALIAS robot and the navigation system
The ALIAS projects provides a variety of services, like auto-collecting and searching the web for specific events , a calendar function to remind the user, and, most important, a
Fig. 1. The ALIAS robot, a SCITOS G5 platform of MetraLabs GmbH,
with cameras, Kinect
user by touch-display and7 speech output.
3D sensor and laser range finder. It interacts with the
c
service to communicate by e-mail, social networks and voice- or video telephone, particularly adapted to the needs of the target group. All these tasks are provided by a mobile robot system (see Fig. 1). The benefit of a mobile system is the capability to move: the robot can be requested by the user and should autonomously drive to the user and approach him/her. Navigation has to be smooth and exact, therefore our motion controlling system is based on the Dynamic Window Approach [10]. Based on this approach, we present here how to approach a person with known pose while considering the ”personal space” of the interaction partner. This provides a more natural, polite and unobtrusive approaching behavior of the robot. The personal space itself is not appropriate to use directly inside the DWA, so we need to apply a planning strategy to find an optimal approaching behavior. The robot we use is a SCITOS G5 platform and is equipped with sonar based and laser based distance sensors, a high-
3D camera from Microsoft (see
res front camera, a Kinect c
Fig. 1), and a dual core PC. With these sensors obstacle recognition and also person detection is done to provide the navigation system with all needed information. The navigation system is only a small part of the overall ALIAS system
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
95
2
|
zone |
interval |
|
close intimate |
0.0m - 0.15m |
example situation
lover or close friend touching
lover or close friend talking
conversion between friends
conversion to non-friend
no private interaction
|
intimate zone |
0.15m - 0.45m |
|
personal zone |
0.45m - 1.2m |
|
social zone |
1.2m - 3.6m |
|
public zone |
from 3.6m |
TABLE I
PSYCHOLOGICAL DEFINITION OF THE PERSONAL SPACE AS DEFINED IN [11]. THIS SPACE CONSISTS OF 5 ZONES, EACH SUPPORTING DIFFERENT ACTIVITIES AND DIFFERENT COMMUNICATION INTENTIONS.
architecture, which also consists of the dialog controller, person recognition and detection system, speech recognition and speaker identification, and a set of applications presented to the user, which are enhanced with various web services.
II. STATE OF THE ART
Psychologists investigated the human-to-human interaction
in public areas very carefully since the 70s of the last century.
One of the foundations and most recognized publications is the work of Hall [11],[12], who first introduced the concept of different spaces around a human being to support different modes of interaction. There is a space for non-interaction, pub- lic interaction, interactions with friends and also an intimate
space for interaction with very close relatives (see table I). By formulating the theory that interaction is also coupled to spatial configurations between interaction partners, many investigations on this matter have taken place, and it could be shown that the configuration depends on many aspects like cultural background, age, sex, social status and person’s character [13, 14, 15, 16, 17, 18]. But is the personal space
a valid description for human robot interaction? As Reeves
and Nass [8, 7] showed, complex technical devices are indeed seen as social beings and treated as such. So, we can assume
that a robot with a person-like appearance is treated like a person. Additional proof is given by exhaustive experiments done within the COGNIRON project, where wizard of oz methods showed that a spatial configuration between robots and humans exists [19] and that this configuration also changes depending of the task of interaction (e.g. talking, handing over an object)[20], or such constraints like sex or experience with robots [21].However, non of these works tried to autonomously approach a person in a socially acceptable manner. But the wizard of oz experiments could find out useful spatial param- eters to autonomously approach a person. Despite the thorough psychological background work, only few publications exist that describe an actual autonomous approaching behavior. Often a simple control policy is used, where a fuzzy controller [22], a PID controller [23, 24], or
a similar technique is used to keep the robot at a certain
distance to the person. The used distance thresholds or fuzzy- rules are always hand-crafted and set by the designer without sufficient psychological justification. Some can only approach
a person from the front [23], since face detection is needed, and some simply do not consider the upper body orientation
of the person and approach the person from any direction [22].
There are only a few works, more aware of the concept of personal space, which use this space to approach a person or drive around a person without intruding the person’s personal zone. For example Pacchierotti [25] uses an elliptical region
around a tracked person in a corridor to signal avoidance
towards the person by changing the robot’s driving lane in
a corridor at an early stage of approaching, where collision
avoidance would not have suggested such a driving behavior. The distance of the lane changing where tuned by hand and the
distance threshold for driving by was determined by evaluating
a questionnaire. A hand-made approaching scenario was also
presented by Hoeller [26], where different approaching regions where defined, each with a different priority. At least one of these regions had to be free from obstacles and the region with the highest priority was the current target region. Hoeller uses also expanding random trees[26] to plan the next motion step in an optimal fashion. The work of Svenstrup and Andersen [27] models the personal space explicitly and without the need of any thresholds, so they could create a dense representation of the personal space and approach a person by using a potential field method. Although their results do not consider any obstacles and could get stuck in local minima, they were the first with an actual mathematical model of the personal space. Sisbot [28] investigates in his work other aspects of planning a path towards a person. So the robot has to be visible, should not hide behind walls and also should not drive behind a person. He uses an adapted A* planner to derive a planning path but does not show how to include these results into the motion planning concept. Other authors do not consider the personal space, but also have the need to approach a walking person from the front to catch customer attention [29]. Here, the trajectory of the person is predicted, and a point on that trajectory is chosen as the goal, to give the robot enough time to turn towards that person and approach her from the front.
A. The Dynamic window approach
To move a robot, there must be decisions taken which action
to be executed as next. Here, two parts are important. First, the
robot has to know to which position it has to drive, and second, which trajectory it has to drive to reach a good position. As mentioned before, we use the Dynamic Window Approach [10] for motion planning and therefor can only support phys- ical plausible paths towards the target. We can assume two things when decide upon the next action. First, we can measure the robots position and speed, and second we know the current obstacle situation. The Dynamic Window Approach’s key idea is to select a rectangular region of rotation- and translation speeds around the current rotation- and translation speed, and decide which next speed pair is the best by evaluating different so called objectives. Each Objective focuses on one aspect of navigation like avoiding obstacles, heading towards the target, drive at a certain speed and so on. The window’s outer bounds are only based on physical constraints, like the robot’s acceleration capabilities and maximum allowed speeds. The voting values of the objectives are summed up weighted, and the minimum vote of the current speed window is chosen to be the next valid action. Our goal is to design an objective for the DWA, which uses a personal space model to approach a person. The model of the personal space is described in the
next section. After that section we show, how to include the model into the DWA’s objective.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
96
3
Fig. 2.
an ±45 ◦ interval (in red). The back region is the rest (in blue). Note, that the regions are not limited in radial extension, like it is done in the illustration.
Two regions of our personal space model. The front region is within
III. MODEL OF THE PERSONAL SPACE
As described in section II, the model of the personal space
is the key component to approach a person. Similar to the
work of Dautenhahn [19], we also want the robot to approach
a person from the front, but with a slight aberration from
the direct front, since most user perceive such a behavior more comfortable. For this purpose, obviously we need the position and viewing direction of the person to calculate
the configuration of the personal space model. The space configuration should enable the robot to drive around the person in a comfortable distance and turn towards the person when a ”front position” is reached. Like in [27], we model the personal space with a sum of Gaussians. The space relative to the persons upper body direction is separated into two regions:
a front-region, which is considered to be within ±45 ◦ around
the persons upper direction, and a back-region, which is the rest (see fig. 2).
In both areas we define a distance function to keep the robot
out of the user’s personal zone but within his/her social zone while approaching the person. The function is defined relative
to the persons upper body direction.
a(x, y) =
α
2πσ 1
−
· e
x 2 +y 2 σ 2
1
−
β
2πσ 2
−
· e
x 2 +y 2
σ
2
2
(1)
The variables α, β, σ 1 , σ 2 describe a classical Difference of Gaussians function and are set in our case (see Fig. 2) to
α = 0.6, β = 0.3, σ 1 = 2m, σ 2 = √ 7m to form a minimum cost region in a distance of 3.5 meters around the person. The front region is treated additionally with an ”intrusion function” i(x, y). This is also a Gaussian function and is simply added to a(x, y).
i(x, y)
Σ
=
=
2π |Σ| · e − x
γ
T Σ −1
x
σ x
0.0
0.0
σ y
· cos(φ) sin(φ)
− sin(φ)
cos(φ)
(2)
Here the variables σ x and σ y define an elliptical region, that is rotated towards the needed approaching direction φ, as seen from the persons perspective. The vector x is simply a column vector (x, y) T . The variables are set to γ = −0.5,
Fig. 3.
Two example configurations for different approaching directions.
σ x = 2.9 and σ y = 1.1. Only φ and σ x need to be set at runtime to regulate the approaching distance and direction. These parameters defining the form of the personal space can be obtained by investigating the familiarity of the user with robots, but for the sake of simplicity have been chosen manually for our first trials. All other parameters are constant and are chosen to reflect the properties of the personal space definition in [11]. So, the final definition of the personal space p(x, y) relatively to the person coordinates x = 0, y = 0 and upper body pose towards the x-axis is defined as follows:
2
2
p(x, y) = a(x, y) , if x, y in back-region a(x, y) + i(x, y) , if x, y in front-region
(3)
To compute the personal space in a real world application, each point (´x, y´) T has to be transformed to the person-centered coordinate system (x, y) T presented here. In our trials we use the given person’s upper body pose, representing the ”most likely” pose. Figure 3 shows an example of two configura- tions of the personal space with two different approaching directions.
IV. PLANNING WITH EXPANDING RANDOM TREES AND THE DYNAMIC WINDOW APPROACH
Up to that point, we have shown how the personal space can
be described, if the upper body pose of a person is known. We
also stated, that this space is used within an objective for the DWA. The basic idea of the DWA is to decide what next action
is best in a local optimal fashion. The local driving command is
only valid for a certain ∆t, than the next window configuration is evaluated. The model of the personal space could be used directly within the Dynamic Window objective. It is possible to predict for every speed pair V rot , V trans the trajectory within
the interval ∆t and simply evaluate the value of the personal space at the end point of each trajectory. This is shown in Fig. 4. The minimal value results in the most supported driving decision. By using the personal space directly, multiple driving decisions may lead to the same minimal value and a unique local optimum can not be guaranteed. So we have to reformulate the search problem to guarantee
a function with a unique local minimum, and, by sequentially following the local minima, a function that leads towards the global minimum (or target position). It is known that planning algorithms can provide such functions. We choose
a random tree planner[30] for two reasons. First, classical
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
97
4
Fig. 4.
is used directly. Here, several actions can lead toward the same minimal value.
No distinct speed decision is possible, when the personal space model
planning approaches like A*,D* and E* are defined only in metric grid-based maps (and not in the configuration space the DWA is defined in) and have to explore a large area of the grid to finish the plan. The second reason is, that random trees need to touch only a small area of the planning grid. Here, computation time can be saved by computing only the needed cells of the personal space grid and also by covering only a sparse portion of the planning space. In the following sections we describe how the random tree graph is constructed and how it fits to the Dynamic Window Approach. The basic idea is simply, to use the global optimal pose, extracted from the personal space, and use the mentioned planning algorithm to overcome local minima in the personal space by also finding
a cost optimal path to the global optimal pose.
A. Expanding Random Trees
For planning purposes we use so called expanding random
trees [30]. These trees are used to generate a path towards a target pose , so one aspect of planning is to define the target pose to reach (see IV-C), the graph’s state definition (inside this section) and also the directed expansion of existing graph nodes towards the target (section IV-B). If the target is reached by one node of the graph, it is guaranteed that a cost minimal path toward the target is found. The benefit of an expanding random tree is, that only a sampled set of possible actions are used per node to expand that node. This makes the tree efficient and still suitable for complex planning tasks. Formally
a random tree is quite simple: it consists of a set of nodes
, s n ), each representing a state s i of the system.
Our tree uses a five dimensional state space consisting of rotational speed V rot of the robot, translational speed V trans , position and orientation of the robot x, y, φ. What makes this
approach useful is the creation of successive states by using
a random transition function tr(s i ) and using the state update equation from the DWA. This function generates a set of next
states by considering the current node’s state s i and applying
a set of random actions on that state to generate a set of next
system states (see Fig. 5 b) ). This process is also called the ”expansion” of a node. We use as the transition function a
motion model for a differential drive robot with left wheel speed and right wheel speed. Given a pair of these speeds, we can create the trajectory for a given time interval ∆t. Since translation speed and rotation speed is convertible to left wheel- and right wheel speed, we can sample a set of speed pairs from a virtual dynamic window, centered at the current speed states of the given node (see Fig. 5 b).
S = (s 1 , s 2 ,
B. Expanding the graph
To expand the graph, the method of A* [31] is used. A* uses heuristics to implement a directed search (unlike other planners like E* or Dijkstra) and could significantly speed up the search for the optimal path. Each node of the planning graph also carries a cost value c i which is incrementally increased with the graph nodes parent c i−1 , the real costs to travel from node s i−1 to node s i (denoted by the cost function C(s i−1 , s i ) and the heuristic for node s i . So, a cost update is:
c i = c i−1 + C(s i−1 , s i ) +
h(s i )
(4)
The traveling cost function C(s i−1 , s i ) is described in more detail in section IV-D. The heuristic is quite simple. We use the 5D euclidean distance of the nodes’ state vector to the minimum cell (´x, y´) of p min (´x, y´) with target speeds V rot = 0 and V trans = 0. All nodes with updated costs are put to the active node list. From that list, the node with the lowest costs c i is selected, expanded and removed from the list of active node. If a node reaches the target cell with correct speed and viewing direction, the planning task is complete. The graph is initialized by using the current configuration of the dynamic window. The root node is the current robot position, view direction and rotation- and translation speed. The dynamic window is used to give a fully specified set of next actions, which are applied to that node and the graph expands. All subsequent nodes are expanded by using only a sampled subset of the corresponding dynamic window, valid only for each node (see Fig. 5 c) ). Than the sequence of best motion actions is applied to the robot’s driving system. The deviation from the best path is measured and if the difference reaches above a threshold, complete replanning is done. The same is done when the person changes his/her position too much.
C. Extracting the target region
To navigate with the Dynamic Window, we use local oc- cupancy maps to represent the surrounding obstacle situation around the robot. In this grid representation, we have to rasterize also the personal space values p(´x, y´) to merge the costs of the personal space with the costs of obstacles to create an optimal path. Each planning algorithm has to know the target, to which state the system has to drive to. Since we have a rasterized personal space, we are able to easily extract the minimum value p min (´x, y´). For a grid based planner this would be sufficient to be the target region, but for expanding random trees it is very unlikely to hit exactly that single cell. So the target region has to grow to increase the probability to hit a target cell. We do so by adding a value to the minimum and each cell with a value below p min (´x, y´) + is called a target cell (see Fig. 5 a) ). For each target cell we also store the needed orientation of the robot towards the person. Planning is complete when the first lattice of the graph hit a target cell, when the speed of the robot at that cell is near zero and the viewing direction of the robot is nearly towards the person to approach.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
98
5
State of a node
a)
b)
V trans
c)
Fig. 5.
the full dynamic window. Also the target region is defined. Each node gets new state variables. In b) in each node a new individual dynamic window is constructed, to define a set of new possible trajectories. In c) only a subset of these trajectories are used per node to expand the graph towards the target region.
Different stages of creating the expanding random tree. In a) the root node at the current robot state is created and expanded with the actions from
Fig. 6.
personal space costs along the trajectory are summed up.
To estimate the personal space costs, the trajectory is rastered and
D. The cost function
The last piece (and core component) to understand the graph structure is the calculation of the traveling cost function C(s i−1 , s i ) from one node to the next. It consists of two components. One is the cost component from the personal space and the second is the traveling time. In a differential drive system the robot can only drive strait lines or piecewise defined circles. The radius of the circles is simply V trans /V rot . So, when V rot reaches zero, the radius is infinitely large.
Given V rot and the prediction time interval ∆t one can easily calculate the rotation angle (V rot · ∆t) and the length of the line segment l ij . To compute the traveling time t s i ,s j we calculate t s i ,s j = l ij /V trans . The costs of the personal space are harder to calculate. We use here the rasterization of the trajectory and sum up all costs on the the rasterized trajectory (see Fig. 6). For each cell x i , y i which is part of the trajectory. The costs
k(s i , s j ) are:
k(s i , s j ) = p(x´ n , y´ n ) if x n , y n ∈ traj(s i , s j )
n
(5)
If the trajectory hits an obstacle, the traveling costs are set
to infinity. The resulting costs are the sum of both values:
C(s i−1 , s i ) = t(s i−1 , s i ) + k(s i−1 , s i )
V. E XPERIMENTS
A problem on approaching a person is the estimation of the
person’s position and the associated measurement noise. We
Robot
scen. 1(I)
scen. 1(II)
scen. 2(I)
scen. 2(II)
σ pers
σ rob
|
(0.3, 0.1) |
(0.3, 0.1) |
|
(0.2, 0.1) |
(0.4, 0.2) |
|
(0.2, 0.1) |
(0.2, 0.2) |
|
(0.2, 0.2) |
(0.2, 0.4) |
TABLE II
THE VARIANCE OF THE END POSITION OF THE ROBOT VS. THE VARIANCE OF THE PERSONS UPPER BODY POSITION.
plan to detect the upper body pose by fusing two standard tracker methods, namely the leg-pair detector of [32] by using the laser range scanner and the OpenNI full body pose tracker by using the Kinect. To test the stability and robustness of the approach towards that noise, we investigated two scenarios, one in a narrow space and one in a large room of our lab. We use a simulator to avoid the problems of person detection and to control the (simulated) measurement noise of the person’s and robot’s pose. We could also proof in first test, that the approach is running well on the real robot, but here you have to face the challenging task of upper body pose estimation. To investigate the stability of the approaching behavior, the position of the person and the robot was chosen randomly to approach in a circle around a marked position. The robot and the person should face towards a given direction each. For each of the two locations, we define two person positions with different viewing angles and performed ten runs for each position. So, we had a set of four trials with a sum of 40 single runs. The variance of the final positions of the robot and the variance of the person position are shown in table II. From the experimental setup we have uncertainties of 0.1m to 0.3 meters in the person position. The question to be answered in our experiments is, how the uncertainty of the target position of the robot will increase when approaching a person. To do so, we record the trajectory of the robot and calculate the mean and standard deviation of the final robot poses. The results are also shown in table II. The average distance from the person is 0.7 meters, the variance is usually within the same magnitude as the variance of the person’s pose. In two cases, the variance in one direction is increased by 0.2 m, which is a result of the target region calculation with its simple threshold heuristic. Figure 7 shows
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
99
6
the path and the mean person position with variance of all four test cases. The quality of the trajectories also gives an impression on the stability of the method. Scenario 2 shows, how the upper body pose heavily influences the trajectory of the robot. Scenario 1 shows, that in narrow spaces the trajectory has to follow the physical restrictions and only the upper body pose is considered. The personal space has to be intruded, if there is no other chance.
y/m
0.5
0
-0.5
Scenario 1
30 31
32
33
34
x/m
Fig. 7. Resulting trajectories of the two tested scenarios. Per scenario two different poses are evaluated by the user (I and II). The mean positions of the user are shown as black dots, the mean upper body poses as arrows. In each scenario the blue lines denote the robot’s trajectories corresponding to the first person setup, while the red lines show trajectories of the second setup. Red circles denote the mean starting position of the robot. Both scenarios show, how the upper body pose influences the approaching trajectory. Scenario 2 also shows, that the social zone is respected if there is room to navigate.
VI. CONCLUSIONS
In this paper we presented a method, working within the Dynamic Window Approach, to approach a person by considering his/her personal space. We could demonstrate, by using a planning strategy, that a stable and reliable solution could be achieved. Nevertheless the method of extracting the target region could be improved in future work. We also want to include obstacles into the personal space model, to improve planning quality and focus on the task of real time replanning, when the person changes his/her pose while the robot approaches. And off course we plan to couple the planning approach with data from a real person tracker to show these results at the conference.
ACKNOWLEDGMENT
This work was financed by the project AAL-2009-2-049 ”Adaptable Ambient Living Assistant” (ALIAS) co-funded by the European Commission and the Federal Ministry of Education and Research (BMBF) in the Ambient Assisted Living (AAL) program.
REFERENCES
[1] C. Pastor et al, “Affective robotics for assisting elderly people,” in Conf. Proc. Assistive Technology From Adapted Equipment to Inclusive Environments: AAATE 2009, Florence, Italy, 2009, pp. 153–258.
[2] M.E. Pollack et al, “Pearl: A mobile robotic assistant for the elderly,” in AAAI Workshop on Automation as Eldercare, Munich, Germany, 2002. [3] F. Weisshardt et al, “Making high-tech service robot platforms avail- able,” in Joint Int. Conf. ISR/ROBOTIK2010, 2010, pp. 1115–1120. [4] Ch. Schroeter, M. Hoechemer, St. Mueller, and H.-M. Gross, “Au- tonomous robot cameraman - observation pose optimization for a mobile service robot in indoor living space,” in Proc. ICRA , Kobe, Japan, 2009, pp. 424–429. [5] T. Kanda, M. Shiomi, Z. Miyashita, H. Ishiguro, and N. Hagita, “A communication robot in a shopping mall,” IEEE Transactions on Robotics, vol. 26, no. 5, pp. 897–913, 2010. [6] H.-M. Gross, H.-J. Boehme, Ch. Schroeter, St. Mueller, A. Koenig, E. Einhorn, Ch. Martin, M. Merten, and A. Bley, “Toomas: Interactive shopping guide robots in everyday use - final implementation and experiences from long-term field trials,” in Proc. IROS , St. Louis, 2009, pp. 2005–2012. [7] C. Nass, J. Steuer, and E.R. Tauber, “Computers are social actors,” in
Proc. the Conference on Human Factors in Computing , 1994, pp. 72–78.
[8] B. Reeves and C. Nass, The Media Equation: How People Treat Computers, Television, and New Medial Like Real People and Places , CSLI Press, Stanford, 1996. [9] F. Walhoff and E. Bourginion, “Alias project description,” ALIAS home page, http://www.aal-alias.eu/content/project-overview. [10] D. Fox, W. Burgard, and S. Thrun, “The dynamic window approach to collision avoidance,” IEEE Robotics and Automation Magazine, vol. 4, no. 1, pp. 23–33, 1997. [11] E.T. Hall, The hidden dimension, Doubleday, NY, 1966.
[12] E.T. Hall, “Proxemics,” Current Anthropology, vol. 9, no. 2, pp. 83+,
1968.
[13] D.L. Gillespie and A. Leffler, “Theories of nonverbal behaviour: A critical review of proxemics research,” Sociological Theory , vol. 1, pp.
120–154, 1983.
[14] A. Leffler, D.L. Gillespie, and J.C. Conaty, “The effects of status
differentiation on nonverbal behaviour,” Social Psychology Quaterly,
vol. 45, pp. 153–161, 1982.
[15] H.W. Smith, “Territorial spacing on a beach revisited: A cross-national exploration,” Social Psychology Quaterly, vol. 44, pp. 132–137, 1981. [16] C. Albas and D. Albas, “Aligning actions: The case of subcultural proxemics,” Canadian Ethnic Studies, vol. 21, pp. 74–82, 1989. [17] K.A. Krail and G. Leventhal, “The sex variable in the intrusion of personal space,” Sociometry, vol. 39, pp. 170–173, 1976. [18] J.L. Williams, “Personal space and its relation to extraversion- introversion,” Canadian Journal of Behavioural Science , vol. 3, pp. 156–160, 1971. [19] K. Dautenhahn et al, “How may i serve you? a robot companion approaching a seated person in a helping context,” in Proc. HRI , 2006, pp. 172–179. [20] K. Koay et al, “Exploratory study of a robot approaching a person in the context of handing over an object,” in AAAI Spring Symposia, 2007. [21] L Takayama and C. Pantofaru, “Influences on proxemic behaviours in human-robot interaction,” in Proc. IROS, 2009, pp. 5495–5502. [22] Ch. Hu, X. Ma, and X. Dai, “Reliable person following approach for mobile robot in indoor environment,” in Proc. 8th IEEE Int. Conf. on Machine Learning and Mechatronics, 2009, pp. 1815–1821.
[23]
Z. Chen and S.T. Birchfield, “Person following with a mobile robot using
binocular feature-based tracking,” in Proc. IROS , 2007, pp. 815–820. [24] X. Ma, C. Hu, X. Dai, and K. Qian, “Sensor integration for person tracking and following with mobile robot,” in Proc. IROS , 2008, pp.
3254–3259.
[25] E. Pacchierotti, H.I. Christensen, and P. Jensfelt, “Evaluation of passing distance for social robots,” in Proc. RO-MAN, 2006. [26] F. Hoeller, D. Schulz, M. Moors, and F.E. Schneider, “Accompanying persons with a mobile robot using motion prediction and probabilistic roadmaps,” in Proc. IROS, 2007, p. 1260 1265. [27] M. Svenstrup, S. Tranberg, H.J Andersen, and T. Bak, “Pose estimation and adaptive robot behaviour for human-robot interaction,” in Proc. ICRA , 2009, p. 3571 3576. [28] E.A. Sisbot, Towards Human-Aware Robot Motions, PHD Thesis, Univ. of Toulouse, Toulouse, 2006. [29] S. Satake et al, “How to approach humans?- strategies for social robots to initiate interaction,” in Proc. Of HRI, 2009, pp. 109–116. [30] J.M Phillips, L.E. Kavraki, and N. Bedrossian, “Spacecraft rendezvous and docking with real-time, randomized optimization,” in AIAA Guid- ance,Navigation, and Control, 2003. [31] E.P. Hart, N.J Nilsson, and B. Raphael, “A formal basis for the heuristic determination of minimum cost paths,” IEEE Transactions on Systems, Science and Cybernetics, vol. 4, pp. 100–107, 1968. [32] K. Arras, O.M. Mozos, and W. Burgard, “Using boosted features for the detection of people in 2d range data,” in Proc. ICRA , 2007.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
100
1
Speech to Head Gesture Mapping in Multimodal Human-Robot Interaction
Amir Aly and Adriana Tapus Cognitive Robotics Lab, ENSTA-ParisTech, France { amir.aly, adriana.tapus } @ensta-paristech.fr
Abstract — In human-human interaction, para-verbal and non- verbal communication are naturally aligned and synchronized. The difficulty encountered during the coordination between speech and head gestures concerns the conveyed meaning, the way of performing the gesture with respect to speech character- istics, their relative temporal arrangement, and their coordinated organization in a phrasal structure of utterance. In this research, we focus on the mechanism of mapping head gestures and speech prosodic characteristics in a natural human-robot interaction. Prosody patterns and head gestures are aligned separately as a parallel multi-stream HMM model. The mapping between speech and head gestures is based on Coupled Hidden Markov Models (CHMMs), which could be seen as a collection of HMMs, one for the video stream and one for the audio stream. Experimental results with Nao robot are reported.
Index Terms— Coupled HMM, audio-video signal synchroniza- tion, signal mapping
I. INTRODUCTION
Robots are more and more present in our daily lifes and the new trend is to make them behave more natural so as to obtain an appropriate social behavior and response. The work described in this paper presents a new methodology that allows the robot to automatically adapt its head gestural behavior to the user’s profile (e.g. the user prosodic patterns) and therefore to produce a personalizable interaction. This work is based on some findings in the linguistic literature that show that head movements (e.g., nodding, turn taking system) support the verbal stream. Moreover, in human-human communication, prosody express the rhythm and intonation of speech and reflect various features of the speakers. These two communication modalities are strongly linked together and synchronized. Humans use gestures and postures as a communicative act. McNeill in [1] defines a gesture as a movement of the body synchronized with the flow of speech. The mechanism of the human natural alignment of the verbal and non-verbal characteristic patterns based on the work described in [2] shows a direct relationship between prosody features and gestures/postures, and constitute an inspiration for our work. Recently, there has been a growth of interest in socially intelligent robotic technologies featuring flexible and cus- tomizable behaviors. Based on the literature in linguistics and psychology that suggests that prosody and gestural kinematics are synchronous and therefore strongly linked together, we posit that is important to have a robot behavior that integrates this element. Therefore, in this paper, we describe a new
methodology for speech prosody and head gesture mapping for human-robot social interaction. The gesture/prosody modeled patterns are aligned separately as a parallel multi-stream HMM model and the mapping between speech and head gestures
is based on Coupled Hidden Markov Models (CHMMs).
A specific gestural behavior is estimated according to the
incoming voice signal’s prosody of the human interacting with
the robot. This permits to the robot to adapt its behavior to the
user profile (e.g. here the user prosodic patterns) and therefore
to produce a personalizable interaction. To the best of our knowledge, very little research has been
dedicated to this research area. An attempt is described by the authors in [3] that present a robotic system that uses dance so as to explore the properties of rhythmic movement
in general social interaction. Most of the existing works are
related to computer graphics and interactive techniques. A general correlation between head gestures and voice prosody had been discussed in [4], [5]. The emotional content of the speech can also be correlated to some bodily gestures. In [6], it is discussed the relation between voice prosody and hand gestures, while [7] discusses the relation between the verbal and semantic content and the gesture. In [8], which is some- how closed to the discussed topic on this research, presents the relation between prosody changes and the orientation of the head (Euler angles). Moreover, authors in [9], proposed a mechanism for driving a head gesture from speech prosody. Our work presents a framework for head gesture and prosody correlation for an automatic robot gesture production from interacting human user speech. The system is validated with the Nao robot in order to find out how naturalistic will be the driven head gestures from a voice test signal with respect to an interacting human speaker. The rest of the paper is organized as following: section II presents the applied algorithm for extracting the pitch contour of a voice signal; section III illustrates the detection of head poses and Euler angles; section IV describes speech and gesture temporal segmentation; section V presents the speech to head gesture coupling by using CHMMs; section VI resumes the results obtained; and finally, section VII concludes the paper.
II. PROSODIC FEATURES EXTRACTION
In human-robot interaction applications, the human voice signal can convey many messages and meanings, which should
be understood appropriately by the robot in order to interact
properly. Next, we describe the methodology used for pitch
extraction.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
101
2
Fig. 1: Pitch Tracking
Talkin [10] defined the pitch as the auditory percept of tone, which is not directly measurable from a signal. Moreover, it is a nonlinear function of the signal’s spectral and temporal energy distribution. Instead, another vocal characteristic, the fundamental frequency F 0, is measured as it correlates well with the perceived pitch. Voice processing systems that es- timate the fundamental frequency F 0 often have 3 common processes: (1) Signal Conditioning; (2) Candidate Periods Esti- mation and (3) Post Processing.Signal preconditioning process is concerned by removing interfering signal components like noise and DC offset, while post processing process chooses the more likely candidate period in order to precisely estimate the fundamental frequency F 0. Talkin in [10] developed the traditional (NCC) method in order to estimate reliably the voicing periods and the fundamental frequency F 0 by considering all candidates simultaneously in a large temporal context. This methodology uses two pass normalized cross correlation (NCC) calculation for searching the fundamental frequency F 0 which reduces the overall computation load with respect to the traditional (NCC) method. The procedures of the algorithm are illustrated in Figure 1. In this work, we choose to express the characterizing vector of the voice signal in terms of the pitch and the intensity of the signal.
III. H EAD P OSE E STIMATION
During social human-robot interaction, robots should be able to estimate the human head pose. This can help the robot to understand the human focus of attention and/or the meaning of the spoken message. The authors in [11] present a survey
Fig. 2: Detecting the face rectangle that contains all salient points
on the different existing algorithms for head poses estimation and detection. Our methodology used for the detection of head poses is Viola and Jones algorithm [12]. After extracting the head region, the eyes are detected by the valley points’ detection algorithm [13]. After detecting the location of eyes, it is possible to detect the location of the other salient points of the face using the geometry of the face [14]. For example, if the distance between the two eyes points (1&3) equals to D (see Figure 2), and point 2 is the midpoint between the eyes, then the mouth point 4 is located at a distance = 1.1 D downwards from point 2. The X - Y coordinates of the rectangle surrounding the salient points of the face (points 5, 6, 7, and 8) (see Figure 2) could be precised as following:
• The difference between the Y -coordinates of points (5&1 or 3&7) = 0. 2 ∗ 1. 8D
• The difference between the X -coordinates of points (5&1 or 3&7) = 0. 225 ∗ 1 .8D
After calculating the coordinates of points (5 , 7) , the co- ordinates of points (6, 8) are directly calculated based on the vertical distance between points (7&8 or 5&6), which is equal to 1 .8D . One of the problems that may appear when detecting the surrounding rectangle of the facial salient points is the rotation of the head clockwise and counterclockwise (see Figure 3). Therefore, the (X, Y ) coordinates of the eyes has to be rotated first to (X − , Y − ) before following the previous steps in order to precise the points of the surrounding rectangle, because the above metioned relations are valid when the eyes coordinates are in the same plane of the face (i.e., if the face is rotated, the coordinate of the eyes have also to be located in the rotated plane). The direction of rotation will be detected by calculating the slope (i.e., rotation angle θ ) of the line passing by the two eyes using their (X, Y ) coordinates. The rotation of the axes is described by the following equations:
X − = Xcosθ − Y sinθ
(1)
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
102
3
(a)
(b)
Fig. 3: Tracking Salient Face Details in Different Cases of Ro- tation: (a) Rotation clockwise; (b) Rotation counterclockwise
(2)
After calculating the coordinates of a face salient point in the rotated (X − , Y − ) plane while the head is rotating clockwise or counterclockwise, it is important to make the inverse rotation each time to detect the location of this point in the ( X, Y ) plane. The pose of the head is calculated in terms of Euler angles: Pitch, Yaw, and Roll. These angles are calculated using the previously detected 8 salient facial points and their relative positions based on the geometry and symmetry of faces (Figure 2) as following:
Y − = X sin θ + Y cos θ
Y aw i =
b 1 [ ( P 2 x,i − C 1 x,i ) + ( P 2 x,i − C 2 x,i ) 2D eyes 0 (P 2 x,0 − C 1 x, 0 ) + (P 2 x, 0 − C 2 x,0 )
−
] +
2 D eyes 0 b 2 [ ( P 4 x,i − C 1 x,i ) + ( P 4 x,i − C 2 x,i ) 2D eyes 0 ( P 4 x,0 − C 1 x, 0 ) + ( P 4 x,0 − C 2 x,0 )
−
2D eyes 0 P itch i = b 3 [ ( P 2 y,i − C 3 y,i ) + ( P 2 y,i − C 4 y,i ) 2D eyes 0 ( P 2 y, 0 − C 3 y, 0 ) + ( P 2 y, 0 − C 4 y, 0 )
−
] +
2D eyes 0 b 4 [ ( P 4 y,i − C 3 y,i ) + ( P 4 y,i − C 4 y,i ) 2D eyes 0 (P 4 y, 0 − C 3 y, 0 ) + (P 4 y, 0 − C 4 y, 0 )
−
where:
2D eyes 0
] (3)
] (4)
|
• |
P |
2 x,i , P 4 x,i : the x coordinates of the midpoint between |
|
eyes, mouth point, respectively (see Figure 2), in frame |
||
|
i |
of the video. |
|
|
• |
P |
2 y,i , P 4 y,i : the y coordinates of the midpoint between |
|
eyes, mouth point, respectively (see Figure 2), in frame |
||
|
i |
of the video. |
|
|
• |
P |
2 x,0 , P 4 x,0 : the x coordinates of the midpoint between |
|
eyes, mouth point, respectively (see Figure 2), in frame |
||
|
0 |
which is the reference frame in the video (1st frame). |
|
|
• |
P |
2 y, 0 , P 4 y, 0 : the y coordinates of the midpoint between |
eyes, mouth point, respectively (see Figure 2), in frame
0 which is the reference frame in the video (1st frame).
Yaw
Pitch
Frame1 (reference)
Frame 2
Frame 3
|
0 |
0.0016 |
0.0034 |
|
0 |
0.00255 |
0.0075 |
TABLE I: Yaw and Pitch Initial Angles (Frames 1-3) Used for Calculation of Regression Values
• C 1 x,i : the x coordinates of the center point between point 5 and point 6 (Figure 2), in frame i.
• C 2 x,i : the x coordinates of the center point between point 7 and point 8 (Figure 2), in frame i.
• C 3 y,i : the y coordinates of the center point between point 5 and point 7 (Figure 2), in frame i.
• C 4 y,i : the y coordinates of the center point between point 6 and point 8 (Figure 2), in frame i. The regression values b 1 , b 2 , b 3 , and b 4 are constants throughout all the video frames. They are calculated by fixing the absolute values of Y aw and P itch angles in the second and third frames (according to empirical test) as shown in Table I. The substitution of the second and third values of P itch and Y aw in the equations 3 and 4, leads directly to the computation of the values of the constants b 1 , b 2 , b 3 , and b 4 . The calculation of Roll angle is straightforward, it depends on the coordinates of the midpoint between eyes (point 2) in frame i with respect to the reference frame [15], and it is clear that the value of Roll angle in the first reference frame equals to 0 .
Roll i = tan − 1 ( −P 2 x,i ) − tan − 1 ( −P 2 x,0 )
P 2 y,i
P 2 y, 0
(5)
IV. SPEECH AND HEAD GESTURE SEGMENTATION
The mapping between speech and head gestures is done by using the Coupled Hidden Markov Models (CHMMs), which could be seen as a collection of HMMs, one for the video stream and one for the audio stream. The advantage of this
model over a lot of other topologies is its ability to capture
the dual influences of each stream on the other one across
time (see Figure 6). In the beginning, speech and head gestures
streams are aligned separately as a parallel multi-stream HMM model.
The mapping between speech and head gestures is per- formed in 2 main steps: (1) the first is modeling the gesture
sequence and the associated voice prosody sequence (in terms of their characteristic vectors) into two separate HMMs; (2) then after training both models, a correlation between the two HMMs is necessary so as to estimate a final head gesture states sequence given a speech test signal. The HMM structure used in analyzing gestures (and simi- larly voice prosody) is indicated in Figure 4. It is composed of N parallel states, where each one represents a gesture composed of M observations. The goal of the transition between states S END to S ST ART is to continue the transition between states from 1 to N (e.g., after performing gesture state 1, the model transfers from the transient end state to the start state to perform any gesture state from 2 to N in a sequential way and so on). In order to be able to model gestures/prosody, it is necessary to make a temporal segmentation of the video
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
103
4
Fig. 4: HMM structure for gesture and prosody analysis
Trajectory Class
Trajectory State
1 pitch ↑ & intensity ↑
pitch ↓ & intensity ↓
|
2 |
pitch ↑ & intensity ↓ |
|
3 |
pitch ↓ & intensity ↑ |
|
4 |
|
|
5 |
Unvoiced segment |
TABLE II: Voice Signal Segmentation Labels
content to detect the M number of observations in each state and the total number of states N .
A. Speech Temporal Segmentation
Speech is segmented as syllables presented by the states from 1 to N as indicated in Figure 4. The segmentation is performed by intersecting the inflection points (zeros crossing points of the rate of change of the curve) for both the pitch and intensity curves, beside the points that separate between the voiced and unvoiced segments of the signal (see Figure 5 for an example of pitch and intenisty curves). When comparing the two curves together, 5 different trajectory states can result [16] (see Table II). The goal is to code each segment of the signal with its corresponding pitch-intensity trajectory class (e.g., a voice signal segment coding could be: 5, 3 , 4 , 2, etc.). This segmental coding is used as label for CHMM training. The next step cor- responds to segmenting the voice signal with its corresponding trajectory labeling into syllables. Arai and Greenberg in [17] defined the average duration of a syllable as 200 ms and this duration can increase or decrease according to the nature of the syllable as being short or long. Practical tests proved that
Trajectory Class
Trajectory State (Rate of Change)
1
2
3
4
5
Yaw ↑ & Pitch ↑
Yaw ↑ & Pitch ↓
Yaw ↓ & Pitch ↑
Yaw ↓ & Pitch ↓
No change
TABLE III: Gesture Segmentation Labels
Fig. 6: Coupled Hidden Markov Model CHMM lag-1 Structure
within a syllable of duration varying from 180 ms to 220 ms, the average number of trajectory classes in its corresponding pitch and intensity curves is around 5. Therefore, given the voice signal with its segments coded by the corresponding pitch-intensity trajectory labels, each 5 segments of the signal will create a syllable state (from 1 to N ) and the corresponding 5 labels will be the observations M within the syllable state.
B. Gestures Temporal Segmentation
The average duration for making gestures, in general, varies between 0. 1 to 2. 5 seconds according to the speed and the performed gesture as being pointing or head gesture for example. In case of head gestures, the average duration of performing a gesture will be limited to 0. 4 seconds [18, 19]. In our case, the camera used to capture the gestures had the ability of capturing 30 frames/second, and therefore we can estimate to 12 frames the average number of frames sufficient to characterize a gesture. Similarly to the speech temporal segmentation (see Section IV-A), gesture temporal segmentation is performed by com- paring the 9 trajectory classes according to the sinusoidal evo- lution of the extracted angles curves. However, the mechanical characteristics of our platform (NAO robot) are limited only to pitch and yaw movements, therefore introducing only 5 trajectory classes (see Table III). In the context of the CHMM model each group of 12 frames will form a complete gesture state from 1 to N , and the corresponding coding labels will constitute the observations within the gesture state.
V. SPEECH TO HEAD GESTURE COUPLING
A typical CHMM structure is shown in Figure 6, where the circles present the discrete hidden nodes/states while the rectangles present the observable continuous nodes/states, which contain the observation sequences of voice and gestures characteristics.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
104
5
Fig. 5: Speech, Pitch and Intensity Curves (The red parts in the voice signal are the unvoiced parts, while blue parts are the voiced parts of the signal. The black points depict the inflection points of the signal, while green points represent the separating points between the unvoiced and the voiced segments.)
According to the sequential nature of gestures and speech, the CHMM structure is of type lag-1 in which couple (back- bone) nodes at time t are conditioned on those at time t − 1 [20, 21, 22]. A CHMM model λ C is defined by the following parameters:
(6)
(7)
(8)
π
0
( i) = P (q
1
=
S j , q
video
t− 1
C
C
= S i )
=
S k )
= S i )
a
i
|
j,k = P (q
t
C
C
= S i | q
audio
t− 1
b
C
t
C
(i ) = P ( O |q
t
C
t
where C ∈ { audio, video} denotes the audio and visual
C
channels respectively, and q is the state of the coupling node
t
in the c th stream at time t [23, 24]. The training of this model is based on the maximum likelihood form of the expectation maximization (EM) al-
gorithm. Supposing there are 2 observable sequences of the
audio and video states O = {A 1
{a 1 , · · · , a N } is the set of observable states in the first audio
sequence, and similarly B 1
observable states in the second visual sequence, and S =
{X 1
the first audio chain and the second visual chain respectively [21, 22]. The expectation maximization algorithm finds the maximum likelihood estimates of the model parameters by maximizing the following function [22]:
} is the set of states of the couple nodes at
= {b 1 , · · · , b N } is the set of
N
, B 1
N
} where A 1
N
=
N
N
, Y 1
N
T
f ( λ C ) = P (X 1 ) P (Y 1 )
∏ P (A t | X t )P ( B t | Y t )
t=1
P ( X t+1 |X t , Y t ) P (Y t+1 | X t , Y t ), 1 ≤ T ≤ N
(9)
where:
• P ( X 1 ) and P (Y 1 ) are the prior probabilities of the audio and video chains respectively
• P ( A t | X t ) and P ( B t | Y t ) are the observation densities of the audio and video chains respectively
• P ( X t+1 |X t , Y t ) and P ( Y t+1 |X t , Y t ) are the couple nodes transition probabilities in the audio and video chains.
The training of the CHMM differs from the standard HMM in the expectation step (E) while they are both identical in the maximization step (M) which tries to maximize equation 9 in terms of the expected parameters [25].The expectation step of the CHMM is defined in terms of the forward and backward recursion. For the forward recursion we define a variable for
the audio and video chains at t = 1:
(10)
(11)
Then the variable α is calculated incrementally at any arbitrary
moment t as follows:
audio
α t=1
= P (A 1 | X 1 ) P (X 1 )
video
α t=1 = P ( B 1 | Y 1 ) P (Y 1 )
α
audio
t+1
= P ( A t+1 |X t+1 ) ∫ ∫ α
audio
t
α
video
t
P (X t+1 | X t , Y t ) dX t dY t
(12)
α
video
t+1
= P (B t+1 | Y t+1 ) ∫ ∫ α
audio
t
α
video
t
P (Y t+1 | X t , Y t )dX t dY t (13)
Meanwhile, for the backwards direction there is no split in the calculated recursions which can be expressed as follows:
β
audio,video
t+1
N
= P ( O t+1 | S t ) =
∫ ∫ P (A
N
t+1 , B
N
t+1 | X t+1 , Y t+1 )
P ( X t+1 , Y t+1 | X t , Y t )dX t+1 dY t+1 (14)
After combining both forward and backwards recursion parameters, an audio signal will be tested on the trained model, generating a synthesized equivalent gesture that most likely fit the model. The generated gesture sequence is determined when the change in the likelihood is below a threshold.
VI. EXPERIMENTAL RESULTS
The experimental testbed used in this study is the humanoid robot Nao developed by Aldebaran Robotics. For the train- ing and testing, we used the MVGL-MASAL gesture-speech
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
105
6
Synthesized/Real Gesture Classes
1
2
3
4
5
1
2 3
4 5
|
25 |
13 |
13 |
6 |
36 |
|
3 |
29 |
5 |
3 |
28 |
|
2 |
6 |
20 |
8 |
25 |
|
4 |
2 |
5 |
40 |
33 |
|
20 |
18 |
30 |
43 |
351 |
TABLE IV: Confussion matrix of the original and synthesized trajectories’ classes
Turkish database [9]. The database is composed of 4 videos of
different durations that go from 6 to 8 minutes. It contains the audiovisual information of different subjects instructed to tell stories to children audience. We use one part of the database
for the training of the models and the other part for the testing.
The audio signals are extracted and then they are processed
in order to extract the relevant prosodic characteristics. The
proposed speech to gesture mapping methodology was tested
on the database using cross validation algorithm. The system
was trained on the audio/visual sequences of 3 videos from
the database, and then tested on the audio sequence of the 4th video. The corresponding generated gestures are compared to
the natural gesture sequence in the video of test and an average
score of 62% was found in terms of the similarity of trajectory classes. Table IV depicts the confussion matrix between the original and synthesized gesture labels trajectories. The confussion matrix reveals that the trajectory state 5 in which there would be no change in the Y aw and P itch angles is the dominant trajectory class. This can be a result of the smoothing processes and/or of the precision of Euler angles extracting algorithm; however this will not cause unnaturaleness when the robot and the human are interacting in long conversations.
After calculating the score of similarity between the tra-
jectory labels of the original and the synthesized signals, it
is important to generate the corresponding Y aw and P itch
curves for the head motion and compare them to the original curves by calculating the total average root mean square (RMS) error between the corresponding curves points. The RMS errors found between the generated Y aw and P itch curves with respect to the original curves are 10% and 12% respectively. In fact, the obtained score 62% and the RMS errors between the original and the synthesized curves can be considered a reasonable result, because the duration and the surrounding environement conditions of the test video and the training videos set were similar. Also, the speaker’s tonality in all
training and test videos were similar. However, we don’t know
yet the score we will obtain in real applications where the robot
will be tested under different condiditions. The performed head gestures could differ in the amplitude or the direction from one person to another without hindering the transfer of the mean- ing of the gesture message between interacting humans and similarly, between the interacting robot and human. Figures 7 and 8 show a comparison between a part of the original and synthesized pitch and yaw curves (after being smoothed by a median filter) of the test video from the database.
A video of the speech-gesture mapping system with Nao
Fig. 7: View of the original (blue curve) and synthesized (red curve) Pitch angles of a part of the test video
Fig. 8: View of the original (blue curve) and synthesized (red curve) Yaw angles of a part of the test video
robot is available at: http://www.ensta-paristech. fr/ ˜ tapus/HRIAA/media.
VII. CONCLUSIONS
This research focuses on synthesizing head gestures based on speech characteristics (e.g., pitch and intensity of the signal). Our mapping system is based on the Coupled Hidden Markov Model (CHMM) that tries to find a coupling joint between audio and visual sequences. The audio sequence is composed of parallel states presenting the syllables and each syllable is composed of a specific number of observations (M=5, in our case). Meanwhile, the video sequence has the same parallel construction where the states present the gestures and each state is composed of another specific number of observations determined experimentally (M=12, in our case). After training the CHMM on audio-visual sequences from a database, and when a test audio signal is generated, the system tries to find a corresponding sequence of gestures based on its own experience learnt during the training phase. The generated gesture sequence is the sequence that achieves the maximum likelihood estimation with the speech test signal. Our system shows a score of 62%, which measures the similarity between the original gesture sequence labels and the synthesized ges- ture sequence labels, over a test video of 8 minutes. This can be considered a good score. The proposed system is able to generate appropriate robot head gesture from speech input, which allows it to produce an automatic natural robot behavior that is almost completely absent from present-day
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
106
7
human-robot interactions. Further work will focus on creating a triadic alignment between the speech, head gestures, and hand gestures in different human-robot interactional contexts that will allow the robot to interact naturally under different conditions.
R EFERENCES
[1] D. McNeill, Hand and mind : what gestures reveal about thought . Chicago, USA: Chicago : University of Chicago Press, 1992.
F. P. Eyereisen and J. D. D. Lannoy, Gestures and Speech: Psychological Investigations. Cambridge University Press, 1991.
[3] M. P. Michalowski, S. Sabanovic, and H. Kozima, “A dancing robot for rhythmic social interaction,” in Proceedings of the Human-Robot Interaction Conference, Arlington, USA, mar 2007, pp. 89–96.
[2]
[4] K. Munhall, J. A. Jones, D. E. Callan, T. Kuratate, and E. Vatikiotis- Bateson, “Visual prosody and speech intelligibility: Head movement improves auditory speech perception,” Psychological Science , vol. 15,
no. 2, pp. 133–137, 2004.
[5] T. Kuratate, K. G. Munhall, P. E. Rubin, E. Vatikiotis-Bateson, and
H. Yehia, “Audio-visual synthesis of talking faces from speech pro-
duction correlates,” in Proceedings of the 6th European Conference on Speech Communication and Technology (EUROSPEECH) , 1999, pp.
1279–1282.
[6] L. Valbonesi, R. Ansari, D. McNeill, F. Quek, S. Duncan, K. E. McCullough, and R. Bryll, “Multimodal signal analysis of prosody and hand motion: Temporal correlation of speech and gestures,” in Proceedings of the European Signal Processing Conference(EUSIPCO), vol. 1, 2005, pp. 75–78. [7] F. Quek, D. McNeill, R. Ansari, X. Ma, R. Bryll, S. Duncan, and K. Mc- Cullough, “Gesture cues for conversational interaction in monocular video,” in Proceedings of the ICCV , 1999, pp. 64–69.
[8] H. P. Graf, E. Cosatto, V. Strom, and F. J. Huang, “Visual prosody:
facial movements accompanying speech,” in Proceedings of IEEE Int. Conf. Automatic Face and Gesture Recognition, 2002, pp. 381–386. [9] M. E. Sargn, Y. Yemez, E.Erzin, and A. M. Tekalp, “Analysis of head gesture and prosody patterns for prosody-driven head-gesture anima- tion,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 30, no. 8, pp. 1330–1345, 2008. [10] D. Talkin, “A robust algorithm for pitch tracking,” in Speech Coding and Synthesis . W B Kleijn, K Paliwal eds, Elsevier, 1995, pp. 497–518. [11] E. M.Chutorian and M.M.Trivedi, “Head pose estimation in computer vision: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 4, pp. 607–626, 2009.
P. Viola and M. J. Jones, “Robust real-time face detection,” International Journal of Computer Vision, vol. 57, pp. 137–154, 2004.
[13] K. Wong, K. Lam, and W. Siu, “A robust scheme for live detection of human faces in color images,” Signal Processing: Image Communica- tion, vol. 18, no. 2, pp. 103–114, 2003. [14] K. W. Wong, K. I. Lam, and W. Siu, “An efficient algorithm for human face detection and facial feature extraction under different conditions,” Pattern Recognition, vol. 34, no. 10, pp. 1993–2004, 2000.
[15] B. Yip, W. Y. Siu, and S. Jin, “Pose determination of human head using one feature point based on head movement,” in Proceedings of IEEE Int. Conf. on Multimedia and Expo (ICME), vol. 2, 2004, pp. 1183–1186. [16] F. Ringeval, J. Demouy, G. S. andM. Chetouani, L. Robel, J. Xavier, and D. C. Plaza, “Automatic intonation recognition for the prosodic assessment of language impaired children.” IEEE Transactions on Audio, Speech and Language Processing, vol. 99, pp. 1–15, 2010. [17] T. Arai and S. Greenberg, “The temporal properties of spoken japanese are similar to those of english,” in Proceedings of Eurospeech, Rhodes, Greece, 1997, pp. 1011–1114. [18] K. Nickel and R. Stiefelhagen, “Real-time recognition of 3d-pointing gestures for human-machine-interaction,” in Proceedings of DAGM- Symposium , Magdeburg, Germany, 2003, pp. 557–565. [19] S. A. Moubayed and J. Beskow, “Effects of visual prominence cues on speech intelligibility,” in Proceedings of the International Conference on Auditory-Visual Speech Processing (AVSP), Norwich, UK, 2009. [20] L. R. Rabiner, “A tutorial on hidden markov models and selected applications in speech recognition,” in Proceedings of the IEEE , vol. 77,
[12]
no. 2, 1989, pp. 257–286.
[21] I.Rezek, P. Sykacek, and S. Roberts, “Coupled hidden markov models for biosignal interaction modelling,” in Proceedings of the International Conference on Advances in Medical Signal and Information Processing (MEDSIP) , 2000.
[22] I. Rezek and S. J. Roberts, “Estimation of coupled hidden markov mod- els with application to biosignal interaction modelling,” in Proceedings of the IEEE International Workshop on Neural Networks for Signal Processing (NNSP) , Sydney, Australia, 2000. [23] A. V. Nean, L. Liang, X. Pi, X. Liu, and C. Mao, “A coupled hidden markov model for audio-visual speech recognition,” in Proceedings of the International Conference on Acoustics, Speech and Signal Process- ing (ICASSP) , vol. 2, Orlando, USA, 2002, pp. 2013–2016. [24] L. Liang, X. Liu, X. Pi, Y. Zhao, and A. V. Nean, “Speaker independent audio-visual continuous speech recognition,” in Proceedings of the International Conference on Multimedia and Expo (ICME) , vol. 2, Lausanne, Switzerland, 2002, pp. 25–28. [25] W. Penny and S. Roberts, “Gaussian observation hidden markov models for eeg analysis,” in Technical Report TR-98-12 , Imperial College London, UK, 1998.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
107
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
108
Evaluation of the Robot Sign Language Tutoring Assistant Using Video-Based Studies
H. Kose- Bagci, R. Yorganci , H. E. Algan
Social Robotics Lab, Faculty of Computer and Informatics, Istanbul Technical University, Istanbul, Turkiye Email:{ hatice.kose, yorganci, algan}@itu.edu.tr
Abstract—There is an on-going study which aims to assist in teaching Turkish Sign Language (TSL) to hearing impaired children by means of non-verbal communication and imitation based interaction games between a humanoid robot and the child. In this study, the robot will be able to express a word in the TSL among a set of chosen words using hand movements, body and face gestures and having comprehended the word, the child will give relevant feedback to the robot. This paper aims to perform an evaluation on a subset of sample words chosen from TSL via the comparison of their different video representations carried out by human teachers and the simulated Nao robot. Several surveys and user studies have been realized to reveal the resemblance between the two types of videos involving the performance of the robot simulator and the human teacher for each chosen word. In order to investigate the perceived level of similarity between human and robot behavior, a number of participants with different sign language acquaintance have been asked to complete questionnaires. The results of these surveys have been summarized and the most significant factors affecting the comprehension of TSL words have been discussed. Index Terms— Humanoid Robots, Interaction games, Non- verbal communication, Sign Language
I.
INTRODUCTION
Turkish Sign Language(TSL) is a visual language that is comprised of hand movements, body and face gestures. Hearing impaired children have chance to learn this language as their native language even before they learn written Turkish language on condition that their parents are hearing impaired as well. Language acquisition, which is an extremely crucial process for brain development and intelligence, is completed at ages of 2 or 3 years. Hence existence of sufficient native language materials and their employment during education is of great importance for preschool training. Utilization of television and computer has become widespread for the purpose of teaching sign language to hearing impaired people. A comprehensive research is being held at Boğaziçi University, Turkey to teach TSL by regarding videos and to understand sign language words performed by students. This study involves research on recognition and classification of hand and face gestures [1-
This work was supported by Istanbul Technical University Scientific Research Projects foundation under the contract BAP 34255. Contact address: hatice.kose@itu.edu.tr
4]. In addition, a TSL tutoring software [5] has been implemented. This study has been realized as part of an on-going research, which aims to utilize humanoid robots for aiding sign language tutoring due to the incompetency of 2-D instructional tools developed for this goal and the lack of sufficient educational material. In the proposed system, it is intended that a child-sized humanoid robot is going to perform various elementary TSL words so as to assist teaching these words to hearing impaired children. This will be achieved through interaction games based on non-verbal communication, turn-taking and imitation that are designed specifically for robot and child to play together. In this study, a survey has been realized to investigate the similarity between the behaviors of human teachers and Nao robot simulator during the representation of TSL words. For each word selected, a video which displays the Nao robot simulator performing the sign language expression has been prepared. The corresponding videos of sign language representations by human teachers are available within the TSL tutoring software [5]. For the test study, following the demonstration of the robot’s and human teacher’s performances of several selected words from TSL by videos, a number of participants have been demanded to fill out a questionnaire regarding the success of the robot’s performance in matching to the correct human implementation. The results of the survey have been summarized and the most significant factors regarding the realization of TSL words have been discussed in the experiments and discussions section.
II. LITERATURE SURVEY
Various studies have been carried out for the recognition of different sign languages. A system that recognizes 40 words, each of which is a pronoun, noun, verb or adjective in American Sign Language, with 90% precision has been implemented [6]. Another study states a 80% success rate for detecting 95 words taken from the Australian Sign language [7]. Regarding the Japanese Sign language, a total of 52 words, 42 of which are represented by the finger- alphabet, have also been adequately identified [8]. There is also a study that recognizes 19 words from American Sign Language by utilizing Hidden Markov
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
109
models [1]. These words are expressed by the movements of the head and the two hands. The program flow commences with the user repeating a sign language word after watching its corresponding video. The recorded movements of the user are analyzed to determine whether the sign language word has been performed appropriately. A success rate of 99% has been reported for the recognition of words expressed solely by hands and a success rate of 85% was achieved for the classification of words performed both by hands and by the head. Various other studies on sign languages such as the recognition of hand shapes and movements or the classification of facial gestures have been carried out in order to analyze and help teach sign languages.
[1-4]
There are several successful user studies on non-gesture communication through imitation based interaction games with humanoid robots and human participants [9, 10]. To achieve interaction, drumming games with gestures provided by the robot to motivate successful turn-taking were studied in [10-13]. There is no previous study that reports the usage of humanoid robots for helping teach sign language to hearing impaired children. The proposed system aims to compensate this mentioned lack while being supportive to hearing impaired children.
III. PROPOSED WORK
This study is a part of an on-going research that aims to help teach TSL to hearing-impaired children via interaction based games between robot and the child. The H-25 Nao robots will be used for this research in the user-studies, as they have hands and fingers to implement most of the sign language words, and appropriate to use in interactive children games due to its expressive face, small size, compact shape and toy-like appearance. The Nao robot, which has a height of 0.57 m. and a weight of 4.5 kg., is a system with 21-25 degrees of freedom, 500 MHz processor, two cameras, sonar sensors and force-sensitive resistors [14]. In this study, a subset of the most appropriate words that can be demonstrated by the movements of a Nao robot simulator software have been determined. The physical limitations of the Nao robot makes it hard to implement some of the words. One of the reasons for this is the fact that the Nao robot has only 3 depended fingers while most of the words from the TSL are performed by using 5 fingers (mostly independent, i.e one still and other 2 are curled). By the aid of the proposed system, distinct success rates for robot and human tutors will be extracted. Regarding the TSL, no matter how professional the teacher might be, each individual has a differentiated style, which causes difficulty in learning in case the instructor is replaced. In the proposed system, it is intended to show that the robot tutor achieves a similar success rate with human teachers. In this way, it is planned that tutor independent TSL teaching will be achieved. Here, it should also be noted that the aim of the
proposed system is to develop a fast, simple, motivating tool with easy update facility that allows children to test their knowledge. The proposed system is expected to assist human teachers rather than to form a substitute for them.
IV.
EXPERIMENTS&DISCUSSION
A survey, which intends to evaluate the performance of
the Nao robot simulator[15] for demonstrating a number of chosen TSL words, has been carried out in this study. The survey has been carried out both for people who have little or no acquaintance with the TSL and for second grade students of TSL courses held at Bilgi and Boğaziçi Universities, Istanbul, Turkey.
A. Adult participants with no acquaintance with Sign
Language
In the first part of the survey, 40 people (16 male, 24
female), 36 of whom are students at different universities in Istanbul, have participated in the survey. The remaining four people have distinct professions. Table 1 exhibits the acquaintance of participants with TSL.
A subset of TSL words, which have been chosen for the
survey, are given with their English meaning in Table 2.
TABLE 1. TSL ACQUAINTANCE OF PARTICIPANTS
Familiar
# of participants
4
Little acquaintance
No acquaintance
No answer
6
27
3
TABLE 2. CHOSEN TSL WORDS
Turkish Word
Araba
English Meaning
Car
Arkadaş
Baba
Masa
Üç
Friend
Father
Table
Three
During the execution process throughout the survey, firstly the video of the robot performing the sign language word has been displayed to the participant. Subsequently, a number of human teacher videos taken from the TSL software have been shown for the participant to identify the video that mostly resembles that of the robot. In each of the TSL software videos, a specific TSL word is performed by a human teacher. Finally, the participants were asked to reveal which word was most successfully expressed by the robot. Figure 1 to Figure 4 display video screen shots of sample TSL word representations performed by Nao robot simulator and a human teacher.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
110
Figure
1
Videos
corresponding
to
the
realization
of
TSL
word
“masa”(table) by Nao robot simulator and a human teacher.
Figure 2 Videos corresponding to the realization of TSL word “arkadaş”(friend) by Nao robot simulator and a human teacher.
In Table 3, the accuracy of choices made by the participants who are not familiar with TSL is displayed. To state more explicitly, it has been observed that 36 people have identified the corresponding human teacher TSL video having watched the video of the Nao robot performing the TSL word “araba”(car). A choice refers to the human teacher TSL video selected by the participant, who attempts to ascertain the most correlative human teacher TSL video for the robot video. For each robot video that demonstrates a particular TSL word, there is merely one corresponding human teacher TSL video and at most three irrevelant other human teacher TSL videos. In other words, there is one correct choice along with other wrong choices that can be selected by the participant. Wrong choices were generally selected among the ones which look like the correct video most. Some of the participants did not choose any choices at all since they admitted they could not find a resemblance between current choices and the corresponding video. The
total number of answers, and the number of wrong choices for each video are presented in Table 3 as well.
Figure 3 Videos corresponding to the realization of TSL word “baba”(father) by Nao robot simulator and a human teacher.
Figure 4 Videos corresponding to the realization of TSL word “araba”(car) by Nao robot simulator and a human teacher.
TABLE 3. ACCURACY OF CHOICES FOR NONFAMILIAR PARTICIPANTS
|
CORRECT |
WRONG |
TOT. |
|
CHOICE |
CHOICE |
|
|
Araba |
36 |
4 |
40 |
|
Arkadaş |
23 |
14 |
37 |
|
Baba |
19 |
18 |
37 |
|
Masa |
33 |
6 |
39 |
|
Üç |
31 |
5 |
36 |
In this survey, having chosen the most likely human teacher TSL video for the robot’s expression, the participant will then assess a score of resemblance for the corresponding two videos of human and robot performance (Table 4). Hence, if the participant is able to come up with the correct choice, he/she will determine a score for the corresponding TSL video. Otherwise, he/she will be evaluating a wrong choice, thus endeavoring to assess the resemblance between the robot video and an irrelevant human teacher TSL video. In Table 4, each column represents the score of resemblance for the related choice. Also the distribution of scores among different wrong choices, and the total scores of the wrong choices are given.
TABLE 4. ASSESSMENT OF RESEMBLANCE FOR ROBOT AND HUMAN TEACHER VIDEOS.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
111
Araba
139
1st
|
Arkadas |
57 |
|
Baba |
56 |
|
Masa |
121 |
|
Üc |
121 |
Correct
Choice
2nd
3rd
|
Wrong |
Wrong |
Wrong |
|
Choice |
Choice |
Choice |
Total #
of
Wrong
Choices
4
11
# of
Ungraded
Words
15 2
|
3 |
28 |
31 |
4 |
|
|
5 |
13 |
32 |
50 |
2 |
|
6 |
16 |
22 |
1 |
|
|
16 |
0 |
16 |
2 |
|
Lastly, the questionnaire asks about the most and the least successful robot/human teacher TSL video pair according to their similarity with each other. Numbers in parentheses display the number of participants that selected the correct TSL video for the robot video of that TSL word (Table 5). The participants are demanded to evaluate the rate of success according to the resemblance between the two videos for the same TSL word. However, the participants attempt to perform this reclamation for the robot video that they have watched and the TSL video choice that they have come up with.
TABLE 5. NUMBER OF PARTICIPANTS EVALUATING THE MOST AND THE LEAST SUCCESSFUL ROBOT/HUMAN TEACHER VIDEO PAIR
Successful
Unsuccessful
|
Araba(Car) |
17 (16) |
- |
||
|
Arkadaş(Friend) |
2 |
(1) |
11 |
(6) |
Üç(Three)
14
|
5 |
1 |
(0) |
|
|
8 |
(8) |
8 |
(5) |
|
Baba(Father) |
3 |
(2) |
(6) |
|
Masa(Table) |
(5) |
||
The questionnaires also interrogate the reasons for the failure of robot performance for expressing a TSL word such as; a) physical appearance of the robot b) physical incompetence pertaining to the robot c) software inadequacy of the Nao robot simulator d) other reasons (Table 6).
TABLE 6. POSSIBLE REASONS FOR THE LACK OF RESEMBLANCE BETWEEN ROBOT AND TID VIDEOS.
|
(a) |
(b) |
(c) |
(d) |
|||
|
Araba(Car) |
10 |
1 |
- |
2 |
||
|
Arkadaş(Friend) |
2 |
9 |
- |
2 |
||
|
Baba(Father) |
1 |
13 |
2 |
2 |
||
|
Masa(Table) |
1 |
2 |
3 |
- |
||
Üç(Three)
6
8 1
2
TABLE 7. DISTRIBUTION OF CHOICE PERCENTILES
Correct Choice
Wrong
Choice(s)
Araba
Arkadaş
Baba
Masa
Üç
|
%90 |
- |
|
%62,16 |
%35,14 |
|
%51,35 |
%43,24 |
|
%84,62 |
- |
|
%86,11 |
%13,89 |
Table 7 demonstrates the distribution of percentiles for the choices made by the participants. Choices that have been selected by less than 5 people, and the unanswered questions have been discarded. Table 8 shows the mean scores and standard deviations for different choice scores for each TSL
word. It can be inferred from the results that the rates of appropriate association between robot and human teacher TSL videos are above 50% for all the TSL words chosen.
The word “Baba” has the least successful robot performance among the TSL words. However, it is conferred that the
wrong choices made by the participants exhibit a distribution
rather than getting stuck in a single wrong answer.
TABLE 8. AVERAGE AND STANDARD DEVIATION VALUES FOR DIFFERENT
CHOICE SCORES
Correct Choice
Wrong Choice
Araba
Arkadaş
Baba
Masa
Üç
|
3,86±0,82 |
- |
|
2,47±0,85 |
2,15±1,21 |
|
2,95±0,97 |
2,75±1,52 |
|
3,67±0,82 |
- |
|
3,90±1,11 |
3,2±1,48 |
Since problems have been faced whilst trying to move fingers and rotate elbows on the Nao simulation program, robot expressions for the words “arkadaş”(friend), “baba”(father) and “üç”(three) tend to be rather problematic. Therefore, the results in the previous tables were anticipated. The word “üç”(three) has the best performance among the other problematic words. Moreover, it is realized to have one of the most successful representations. The reason for this can be traced to the fact that the corresponding expression of the TSL word is widely used among Turkish people.
Therefore, the participants were unable to respond in an unbiased way.
B. Adult participants with Sign Language acquaintance
As a test case, the survey is also done on a class of second degree sign language students (11 females and 1 male). Although the words and the videos are same with the first survey, the way we ask the questions slightly changed. Before giving the choices to the sign language professionals, we first wanted to get their unbiased opinions and guesses about the videos. The second questionnaire is comprised of three phases. Firstly, the participant is asked to watch a video containing a Nao simulator representation of a TSL word and then to
guess the word that the simulated Nao attempts to perform.
Secondly, the participant is expected to choose the correct
video, which resembles the most to the Nao’s demonstration
of a given TSL word, among a number of videos each of which displays the human teacher performance of a specific TSL word. In addition, the resemblance ratio among these
two similar videos is required to be graded by the participant
based on a 5-scale. This part is same with the test applied in the first survey. Thirdly, the participant is requested to select
a certain word that he/she thinks to correspond to the video of the Nao simulator representation among a number of given words and to rank the resemblance ratio as well. Finally, the participants were asked to evaluate the videos of the robots as the most and least successful word implementation of the robot.
TABLE 9. THE RESULT DISTRIBUTION FOR THE PHASE1
|
Correct |
Wrong |
No |
|
Choice |
Choice |
Choice |
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
112
Araba
7 5
-
Arkadaş
Baba
Masa
Üç
|
1 |
8 |
|
- |
10 |
|
- |
11 |
|
- |
|
3
5
2
1
7
Table 9 displays the results of the first part of the second questionnaire. From the table, it can be observed that only the word “Araba” could be guessed correctly from the Nao simulator video with slightly more than %50 success rate when no human teacher TSL videos are shown to the participants.
TABLE 10. DISTRIBUTION OF CHOICES MADE BY THE PROFESSIONAL PARTICIPANTS IN PHASE 2
|
CORRECT |
1 |
|
|
CHOICE |
||
|
Araba |
11 |
- |
|
Arkadaş |
8 |
- |
ST
2
-
-
ND
3 RD
TOT.
11
8
|
1 |
- |
10 |
|
4 |
11 |
|
|
2 |
||
|
Baba |
8 |
1 |
|
Masa |
4 |
3 |
|
Üç |
2 |
- |
TABLE 11. ASSESSMENT OF RESEMBLANCE FOR ROBOT AND TID VIDEOS.
|
Correct |
1st |
2nd |
3rd |
|
|
Choice |
Wrong |
Wrong |
Wrong |
|
|
Choice |
Choice |
Choice |
||
|
Araba |
31 |
- |
- |
|
|
Arkadaş |
11 |
- |
- |
|
|
Baba |
17 |
4 |
2 |
- |
|
Masa |
13 |
8 |
9 |
|
Üç
2 -
# of
Ungraded
Words
2
-
1
-
4
TABLE 12. DISTRIBUTION OF PERCENTILES FOR THE SCORES MADE BY THE PARTICIPANTS.
Correct Choice
Wrong
Choice(s)
Araba
%91,66
-
-
%16,66
%58,33
-
|
Arkadaş |
%66,67 |
|
Baba |
%66,67 |
|
Masa |
%33,33 |
|
Üç |
%16,67 |
TABLE 13. AVERAGE AND STANDARD DEVIATION VALUES OF SCORES FOR EACH TSL WORD
Correct Choice
Wrong Choice
Araba
Arkadaş
Baba
Masa
Üç
|
3,45±0,87 |
- |
|
1,125±0,35 |
- |
|
2,125±0,64 |
6±0 |
|
3,2±1,10 |
5,25±1,41 |
|
1±0 |
- |
From Table 10 to Table 13, we can observe the results for the evaluation of the Nao simulator video by the participants. In Table 11, each column represents the score of resemblance for the related choice. The participants that are capable of recognizing a much wider range of TSL words have more difficulty in finding the right TSL video corresponding to the robot video. The reason for this can be traced to the fact that the physical deficiencies of the simulated robot might change the
meaning of the words, since there are many words which
look very similar and difficult to distinguish even for the
sign language proffesionals.
In the third phase of the second questionnaire, the
participants are given certain words instead of videos and are
asked to choose the word that they think best fits the one that
is demonstrated in the Nao simulator video. Table 14 to Table 17 display the corresponding results.
TABLE 14. DISTRIBUTION OF CHOICES MADE BY THE PROFESSIONAL PARTICIPANTS IN PHASE 3
Araba
Arkadaş
Baba
Masa
Üç
CORRECT
CHOICE
1 ST
2 ND
3 RD
TOT.
|
10 |
2 |
- |
12 |
|
|
10 |
- |
- |
10 |
|
|
7 |
2 |
1 |
- |
10 |
|
8 |
4 |
- |
12 |
|
|
1 |
3 |
2 |
6 |
|
TABLE 15. ASSESSMENT OF RESEMBLANCE FOR ROBOT AND TID VIDEOS.
Araba
Arkadaş
Baba
Masa
Üç
|
Correct |
1st |
2nd |
3rd |
# of |
|
Choice |
Wrong |
Wrong |
Wrong |
Ungraded |
|
Choice |
Choice |
Choice |
Words |
|
36 |
7 |
- |
- |
|
|
14 |
- |
- |
- |
|
|
16 |
1 |
2 |
- |
2 |
|
20 |
10 |
- |
- |
|
|
- |
7 |
3 |
||
TABLE 16. DISTRIBUTION OF PERCENTILES FOR THE CHOICES MADE BY THE
PROFFESIONAL PARTICIPANTS.
Correct Choice
|
Araba |
%83,34 |
|
Arkadaş |
%83,34 |
Wrong
Choice(s)
16,67%
-
25,00%
33,34%
41,67%
|
Baba |
%58,33 |
|
Masa |
%66,67 |
|
Üç |
%8,33 |
TABLE 17. AVERAGE AND STANDARD DEVIATION VALUES FOR DIFFERENT CHOICE SCORES FOR EACH TURKISH SIGN LANGUAGE WORD
Correct Choice
Wrong Choice
|
Araba |
3,45±1,13 |
3,5±0,71 |
|
Arkadaş |
1,4±0,97 |
|
|
Baba |
2,29±0,76 |
3±0 |
|
Masa |
2,5±1,07 |
2,5±0,58 |
|
Üç |
3,54±3 |
|
Lastly, Table 18 presents the number of participants
evaluating the most and the least successful robot/tid video pair according to their similarity with each other. The numbers in parentheses display the number of participants that selected the correct TSL video for the robot video of that turkish sign language word.
TABLE 18. EVALUATION OF THE MOST AND THE LEAST SUCCESSFUL ROBOT/TID VIDEO PAIR
Successful
Unsuccessful
|
Araba(Car) |
9(9) |
- |
|
|
Arkadaş(Friend) |
1 |
(1) |
3(1) |
|
Baba(Father) |
- |
- |
|
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
113
|
Masa(Table) |
- |
|
Üç(Three) |
- |
-
8(1)
The answers of the participants with adequate TSL knowledge confirm the results from the naïve users. In this study, the students were also asked to suggest alternative answers for the words embodied by the robot’s actions. Although they were able to find to words with high success rates, they comment on how the little cues realized by slight motion of fingers or finger tips which are currently out of Nao robot’s physical capabilities might change the meaning of the words, since there are many words which look very similar and difficult to distinguish even for the sign language professionals. As expected although word “üç”(three), got a high success rate among the naïve users, it was not rated as successful by the professionals since the robot’s fingers are not capable of showing the right action and it is not easy to distinguish it from the next similar word “iki”, without the fingers.
V.
CONCLUSIONS
In this study, a survey has been carried out to compare the performance of Nao robot simulator with that of a human teacher for expressing certain words from the TSL. This study has been realized as part of a more comprehensive research, which aims to help teach Turkish Sign language to hearing impaired children by generating interaction based games between the robot and the child. The videos in which a human TSL tutor performs the chosen words have been accessed from the TSLsoftware developed at Boğaziçi University[3]. The Nao robot simulator has been utilized to prepare corresponding videos for the chosen words as well. The participants were asked to fill out a questionnaire in order to assess the resemblance between the robot and the human teacher so as to determine whether the Nao robot has near-human capability for expressing TSL words. Although the physical limitations of the robot does not allow the implementation of all words in the domain (i.e. the ones that require 5 fingers or some complex upper torso motion), and most of the participants were chosen from naïve people who do not know robots or sign language, all the words were recognized with big success. There was bias related to the familiarity of some words, or the fact that there are several words whose implementation looks very familiar due to their close meaning (i.e. father -uncle). We enlarged our domain of words and repeated this experiment with experienced sign language users, and we are still working on the analysis of the results. Our next step is to convert the survey into an online interaction game and test with primary school students to see if age and experience will have an effect on these results. Currently, more than 40 words have been implemented on real/simulated Nao H25 robot. There is also an ongoing work to use a robotic platform with 5 fingers and more DOF on arms to realize the signs more successfully. Also we are
working on extending the study to American Sign Language
(ASL).
REFERENCES
|
[1] |
Aran, O., Keskin, C., Akarun, L., “Sign Language Tutoring Tool”, |
|
[2] |
European Signal Processing Conference, EUSIPCO'05, Antalya, Türkiye, 2005. A. Caplier, S. Stillittano, O. Aran, L. Akarun, G. Bailly, D. |
Beautemps, N. Aboutabit, and T. Burger, Image and video for hearing impaired people, EURASIP Journal on Image and Video Processing, Special Issue on Image and Video Processing for Disability, 2007. [3] Keskin, C. and L. Akarun, " Input-output HMM based 3D hand
gesture recognition and spotting for generic applications‖, accepted for publication, Pattern Recognition Letters, vol. 30, no. 12, pp. 1086- 1095, September 2009. Aran, O., I., Ari, A. Benoit, P. Campr, A. H. Carrillo, F. Fanard, L. Akarun, A. Caplier, M. Rombaut, and B. Sankur, ―Signtutor: An Interactive System for Sign Language Tutoring‖. IEEE Multimedia, Volume: 16, Issue: 1, Pages: 81-93, Jan-March 2009.
[4]
[5] Turkish Sign Language Dictionary v1.0. [Online]. Available:
http://www.cmpe.boun.edu.tr/pilab/tidsozlugu
|
[6] |
Staner , A. T., Pentland, A., “Real-Time American Sign Language |
|
[7] |
Recognition from Video using Hidden Markov Models”, Technical Report TR-306, Media Lab, MIT. Kadous, Waleed, “GRASP: Recognition of Australian Sign Language Using Instrumented Gloves,” MSc. Thesis, University of New South Wales, 1995. |
[8] Murakami, Kouichi, Hitomi, Taguchi, “Gesture Recognition Using Recurrent Neural Networks,” Proceedings of CHI’91 Human Factors in Computing Systems, 1991, pp. 237-242.
[9]
Experimental Investigation of Interference Effects in Human- Humanoid Interaction Games”, IEEE RO-MAN2009, accepted [10] Kose-Bagci, H., K. Dautenhahn, C. L. Nehaniv, “Emergent Dynamics of Turn-Taking Interaction in Drumming Games with a Humanoid Robot”, Proc. IEEE RO-MAN 2008, Technische Universitat Munchen, Munich, Germany, 1-3 August 2008. [11] Kose-Bagci, H., E. Ferrari, K. Dautenhahn, D. S. Syrdal, and C. L. Nehaniv, “Effects of Embodiment and Gestures on Social Interaction in Drumming Games with a Humanoid Robot“, Special issue on Robot and Human Interactive Communication, Advanced Robotics vol.24, no.14, 2009. [12] Kose-Bagci, H., K. Dautenhahn, D. S. Syrdal, and C. L. Nehaniv, “Drummate: Interaction dynamics and gestures in human-humanoid drumming experiments”, Connection Science, 2009, accepted. [13] Dautenhahn, K., C. L. Nehaniv, M. L. Walters, B. Robins, H. KOSE- BAGCI, N. A. Mirza, M. Blow (in press) KASPAR - A Minimally Expressive Humanoid Robot for Human-Robot Interaction Research. to appear in Special Issue on "Humanoid Robots" for Applied Bionics and Biomechanics, published by Taylor and Francis, [14] Graf, C., Härtl, A., Röfer, T., Laue, T.: “A robust closed-loop gait for the standard platform league humanoid”. In Zhou, C., Pagello, E., Menegatti, E., Behnke, S., Röfer, T., eds.: Proceedings of the Fourth Workshop on Humanoid Soccer Robots in conjunction with the 2009 IEEE-RAS International Conference on Humanoid Robots, Paris, France, 2009, pp.30 – 37. [15] Aldebaran Robotics Choregraphe. [Online]. Available:
Shen, Q., J. Saunders, H. Kose-Bagci, K. Dautenhahn, “An
http://www.aldebaran-robotics.com/en/programmable
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
114
Segmentation of Dynamic Objects from Laser Data
Agustin Ortega and Juan Andrade-Cetto
Institut de Robotica`
i Informatica`
Industrial, CSIC-UPC, Barcelona, Spain
Abstract —We present a method to segment dynamic objects from high-resolution low-rate laser scans. Data points are tagged as static or dynamic based on the classification of pixel data from registered imagery. Per-pixel background classes are adapted online as Gaussian mixtures, and their matching 3D points are classified accordingly. Special attention is paid to the correct calibration and synchronization of the scanner with the the accessory camera. Results of the method are shown for a small indoor sequence with several people following arbitrarily different trajectories.
Index terms – Segmentation, 3D sensing, calibration, sensor synchronization.
I. I NTRODUCTION
2D and 3D lidar scanning are popular sensing method- ologies for robotics applications. They are used for robot
navigation [5], trajectory planning [20], scene reconstruction [17], and even object recognition [1]. Aside from pricey devices such as the Velodyne HDL-64E, high resolution 3D lidar scanning is only possible at low frame rates. As an example, we have built an omnidirectional lidar sensing de- vice for outdoor mobile robotics applications that scans wi th resolutions and acquisition times that range from 0.5 degrees at 9 seconds per revolution to finer point clouds sampled at 0.1 degrees resolution at a more demanding processing time of 45 seconds per revolution. This sensor has been devised
for low cost, dense 3d mapping. The removal of dynamic and
spurious data from the laser scan is a prerequisite to dense 3d
mapping.
In this paper we address this problem by synchronizing the laser range sensor with a color camera, and using the high frame-rate image data to segment out dynamic objects from
the point clouds. Per-pixel class properties of image data are
adapted online using Gaussian mixtures. The result is a syn-
chronized labeling of foreground/background correspondi ng laser points and image data as shown in Fig. 1. The paper is organized as follows. In the next section
we present related work in background segmentation using
computer vision methods and 3D laser range data. Section
|
III |
gives our custom built sensor specifications, and detail s |
|
the |
methods developed for sensor synchronization and sensor |
calibration. Section IV details the background segmentati on
algorithm. Results of the method are shown in Section V on a real indoor scenario with several people moving with random patterns. Conclusions and future work are detailed in Secti on
VI.
Fig. 1. Several laser scans of a dynamic object reprojected on their corresponding image frame.
II. R ELATED W ORK
Methods that study the segmentation of 3D laser data usually focus on the extraction of valuable geometric primi - tives such as planes or cylinders [10] with applications that vary from map building, to object classification [2], road classification [6], or camera network calibration [9]. All t hese methods however are designed to work on static data only and do not consider the temporal information. For outdoor map building applications, the removal of dynamic objects from the laser data is desirable. Furthermore, for low-rate scanning devices such as ours, moving items in the scene would appear as spurious 3D data; hence the need to segment them out. Background segmentation is a mature topic in computer vision, and is applied specially to track objects in scenari os that change illumination over time but keep the camera fixed to a given reference frame. The most popular methods adapt the probability of each image pixel to be of background class using the variation of intensity values over time. Such adaptation can be tracked with the aid of a Kalman filter [14] taking into account illumination changes and cast shadows. These methods can be extended to use multimodal density functions [18, 19] in the form of Gaussian mixture models, whose parameters are updated depending on the membership degree to the background class. The classification of 3D range data fusing appearance infor- mation has been addressed in the past, again for static scene analysis. Posner et al. [11, 12, 13] propose an unsupervised
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
115
Fig. 2. Camera to laser rigid body pose estimation using a planar calibration pattern.
Fig. 3. Our custom built 3D range sensing device and a rigidly attached color camera.
method that combines 3D laser data and monocular images to classify image patches to belong to a set of 8 different object classes. The technique oversegments images based on texture and appearance properties, and assigns geometri c attributes to these patches using the reprojected 3D point correspondences. Each patch is then described by a bag of words and classified using a Markov random field to model the expected relationship between patch labels.
These methods (and ours) have as a prerequisite the accurate calibration of both sensors, the laser and the camera. The computation of the rigid body transformation between 2D and 3D laser scanners and a camera are common procedures in mobile robotics and are usually solved with the aid of a calibration pattern. The techniques vary depending on the t ype of sensor to calibrate, and on the geometric motion constrai nts between the two sensor reference frames [23, 22, 7, 9]. Sensor synchronization on the other hand has received less attention. Sensor synchronization and occlusions are studied in [16] for the case of the Velodyne HDL-64 sensor. A more general method to synchronize sensors with varying latency is proposed in [8].
Fig. 4. Laser-camera pose refinement using line primitives. The green dotted lines show the image features. Red lines show reprojection pr ior to pose refinement, and blue lines correspond to refined reprojected estimates (best viewed in color).
Fig. 5. Camera and laser synchronization.
III. S ENSOR SYNCHRONIZATION AND CALIBRATION
A. Sensor specifications and data acquisition
Our 3D range sensing device consists of a Hokuyo UTM- 30LX laser mounted on a slip-ring, with computer-controlled angular position via a DC brushless motor and a controller. For the experiments reported in this paper, laser resolution has been set to 0.5 degrees in azimuth with 360 degree omnidirectional field of view, and 0.5 degrees resolution in elevation for a range of 270 degrees. Each point cloud contai ns 194,580 range measurements of up to 30 meters with noises varying from 30mm for distances closer to 10m, and up to 50mm for objects as far as 30m. The color camera used is a Pointgray Flea camera with M1214-MP optics and 40.4 degree field of view. Fig. 3 shows a picture of the entire unit.
B. Sensor Calibration
We are interested in the accurate registration of laser range data with intensity images. Registration can be possible by first calibrating the intrinsic camera parameters and then, findi ng the relative transformation between the camera and laser reference frames. Intrinsic camera calibration is computed using Zhang’s method and a planar calibration pattern [24],
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
116
although other methods could also be used [3, 21]. Extrinsic calibration between the laser and camera is initialized by selecting correspondences of the calibration plane corner s on both sensing modalities with the aid of a graphical user interface, and using Hagger’s method for pose estimation [4], as shown in Fig. 2. The method is subject to the resolution of the laser scanner for the selection of the four 3D to 2D corner matches in the pattern. Pose estimation is further refined by minimizing the reprojection error of line primitives. Lines in the 3D point cloud are obtained growing and intersecting planar patches as in [10]. Their corresponding matches in the images are manually selected using the graphical user interface. Line reprojection error is computed as the weighted sum of angular and midpoint location reprojection errors as shown in Fig. 4,
ǫ = ( θ i − θ p ) 2 + w ( m i − m p ) T ( m i − m p ) (1)
where the subscript i corresponds to measured image features, and the subscript p indicates projected model features. The weight w is a free tuning parameter to account for the difference between angular and Cartesian coordinates.
C. Synchronization
At 0.5 degree resolution, our 3D scanner takes about 9 seconds to complete a scan, which is made of a 180 degree turn of the sensor. Camera frame rate is set to 17 fps, thus we have roughly 153 images per full 3D image. The timestamps between consecutive laser slices t slice i , and grabbed images t frame j are compared and set to lie within a reasonable threshold T s in milliseconds.
(2)
| t slice i − t frame j | ≤ T s
With T s = 1/17, each laser scan is uniquely assigned to its corresponding image frame, roughly two to three per image. Increasing this threshold, allows to match each laser slice to more than one image at a time (see Fig. 5).
IV. B ACKGROUND S USTRACTION
Once we have time correspondences between 3D laser slices and image frames, we can use background segmentation results on the image sequence to classify the corresponding 3D points in each time slice as belonging to a dynamic or static object. The method we implemented is based in [19].
A. Mixture Model
For each pixel in the image, the probability of its RGB coordinates x to be of the background class is modeled as a mixture of K Gaussian distributions.
p ( x ) =
K
k =0
ω k N ( x | µ k , Σ k )
(3)
with ω k the weight of the k-th Gaussian, and K a user selected number of distributions. This classification scheme assumes that the RGB values for neighboring pixels are independent. During the training
session, when a pixel RGB value x falls within 2.5 standard deviations of any of the distributions in the sum (in the Mahalanobis sense), evidence in the matching distributions is stored by recursively updating their sample weight, mean, and variance with
|
ω k ( t + 1) |
= |
(1 − α ) ω k ( t ) + α |
(4) |
|
µ k ( t + 1) |
= |
(1 − ρ) µ k ( t ) + ρx |
(5) |
Σ k ( t + 1) = (1 − ρ)Σ k ( t ) + ρ( x − µ ( t )) T ( x − µ ( t ))(6)
and
(7)
Note that after updating ω in Eq. 4, the weights need to be renormalized. And, just as in [19] we also consider during the training session, that when a pixel value x falls below a 2.5 standard deviation of the distribution, the least probable distribution of the Gaussian sum is replaced by the current RGB pixel value as the current mean, with an initially high variance, and a low prior weight.
ρ
= α N ( x | µ k , Σ k )
B. Background Class
The mixture model on each pixel encodes the distribution of colors for the full image sequence set per full 3D scan (about 153 images). The static portion of the data, i.e., the background, is expected to have large frequency and low variance. By ordering the Gaussians on each sum by the value
Σ , the distributions with larger probability to be of the background class will be aggregated in the top of the list. Static items might however be multimodal in their color. For instance, a flickering screen or a blinking light. As a result we choose as background class the first B < K ordered distributions which add up to a factored weight ω B , where
ω
det
B = argmin b (
b
i=1
ω i ≥ ω B ) .
(8)
C. Point classification
Each point on each scan slice is reprojected to its matching image frames according to Eq. 2. Ideally, for tight bounds on T s , only one image will be assigned to each scan slice. Robustness to noise is possible however, if this bound is relaxed and we allow for larger values of T s , so that more than one image can be matched to the same scan slice. We call this set of images I . Thus for each point in a slice, the corresponding pixel values x from the whole set I is visited, and checked for inclusion in the set B of distributions. Class assignment is made if x belongs to B for all the images in the set I .
V.
E XPERIMENTS
Results are shown for a series of indoor sequences with moderate dynamic content. For background segmentation, the multimodal distribution is set to contain 4 Guassians, the learning rate is set at α = 0. 3, and the background class is set to one third of the frequency in the distributions, i.e.,
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
117
(a) T s = 1/fps.
(b) T s = 0.5sec
(c) Foreground segmentation
(d) Segmented dynamic object
Fig. 6. Segmentation results for a sequence with one moving per son and varying values of the synchronization threshold.
ω B = 0. 3. The synchronization threshold T s is varied from the minimal 1/17 to a more conservative value of 0.5 seconds. The first analyzed sequence corresponds to a single person moving in front of the laser and camera. Frames (a) and (b) in Figure 6 show final results of point classification for differ ent values of T s ; frame (c) shows the image pixel classification results; and frame (d) shows the 3D reconstruction of both, the segmented dynamic object, and the entire 3D scene. The second sequence contains a more challenging scenario with three people with slow random walking trajectories. Given the slow motion rate of the people, laser range readings hitting on them are difficult to categorize as being dynamic. The background segmentation algorithm proposed in this paper helps to alleviate this issue. Figure 7 shows results of background segmentation in this new sequence for varying values of the synchronization parameter. Setting this parameter slightly above the camera acquisition rate accounts for syn- chronization errors and produces better segmentation results. Frames (a-c) in the image show the segmentation results for T s = 1/fps, whereas frames (d-f) show segmentation results for T s = 0. 5sec. Figure 8 shows 3D reconstruction results of the segmented data and of the full 3D scene. The results shown are for a synchronization threshold of 0.5 sec. We appreciate the suggestion during the peer review phase
of this work to compare our method with other approaches. Unfortunately, as far as we know, the system presented is unique, and there are no other methods in the literature that take low-rate 3d scans and remove dynamic content from them using high-rate imagery. To validate the approach, we can report however an empirical comparison with ground truth image difference. Assuming a clean background scan is available (without people), image difference to a full dynamic cloud was computed with the Point Cloud Library [15] using a distance threshold of 3mm. Fig. 9 shows results of such image difference computation. The results of our method are visually comparable to such ground truth experiment.
VI. C ONCLUSIONS
We present a method to segment low-rate 3D range data as static or dynamic using multimodal classification. The technique classifies fast-rate image data from an accessory camera as background/foreground adapting at frame rate a per-pixel Gaussian mixture distribution. The results of image classification are used to tag reprojected laser data. Special attention is paid to the synchronization and metric calibration of the two sensing devices. Sensor synchronization is of paramount importance as it allows to match high frame rate imagery with their corresponding low rate laser scans. The method is tested for indoor sequences with moderate dynamics.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
118
(a)
(d)
(b)
(e)
(c)
(f)
Fig. 7. Segmentation results for a sequence with three people moving randomly and varying values of the synchronization thr eshold. Frames (a-c) show three sequence instances segmented at T s = 1/fps. Frames (d-f) show the same sequence instances segmented at T s = 0.5sec.
Fig. 8. Segmentation results for a sequence with three slowly moving people with random walking trajectories.
The proposed method was designed to remove spurious data or dynamic objects from low acquisition rate lidar sensors. The result is a cleaner 3d picture of static data points. These point clouds could then be aggregated into larger datasets with the guarantee that dynamic data and noise will not jeopardize point cloud registration. The intended application of the techni que is robotic 3d mapping.
VII. A CKNOWLEDGMENTS
This work has been partially supported by the Mexican Council of Science and Technology with a PhD Scholarship to A. Ortega, by the Spanish Ministry of Science and Innovation under projects PAU (DPI2008-06022) and MIPRCV Con- solider Ingenio (CSD2007-018), and by the CEEDS (FP7-ICT- 2009-5-95682) and INTELLACT (FP7-ICT2009-6-269959) projects of the EU. The authors thank M. Morta for the development of the 3D laser used for this research.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
119
Fig. 9. Result of applying point cloud difference using PCL.
R EFERENCES
[1] K.O. Arras, O.M. Mozos, and W. Burgard. Using boosted features for the detection of people in 2d range data. In Proceedings of the IEEE International Conference on Robotics and Automation , pages 3402– 3407, Rome, April 2007.
[2] F. Endres, C. Plagemann, C. Stachniss, and W. Burgard. Unsupervised discovery of object classes from range data using latent Dir ichlet allocation. In Robotics: Science and Systems V, Seattle, USA, June
2009.
[3] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision . Cambridge University Press, Cambridge, 2nd edition, 2004.
[4] C.P. Lu, G.D. Hager, and E. Mjolsness. Fast and globally convergent pose estimation from video images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22:610622, 2000. [5] F. Maurelli, D. Droeschel, T. Wisspeintner, S. May, and H. Surmann.
A 3D laser scanner system for autonomous vehicle navigation. In
Proceedings of the 14th International Conference on Advanced Robotics, Munich, June 2009. [6] F. Moosmann, O. Pink, and C. Stiller. Segmentation of 3D lid ar data in non-flat urban environments using a local convexity criterio n. In IEEE Intelligent Vehicles Symposium, pages 215–220, 2009.
[7] P. N u´ nez, P. Drews Jr, R. Rocha, and J. Dias. Data fusion calibrat ion for a 3D laser range finder and a camera using inertial data. In Proceedings of the European Conference on Mobile Robotics, Dubrovnik, September
2009.
[8] E. Olson. A passive solution to the sensor synchronization problem.
In Proceedings of the IEEE/RSJ International Conference on In telligent
Robots and Systems, pages 1059–1064, Taipei, October 2010. [9] A. Ortega, B. Dias, E. Teniente, A. Bernardino, J. Gaspar, and Juan Andrade-Cetto. Calibrating an outdoor distributed camera n etwork using laser range finder data. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 303–308, Saint Louis, October 2009.
[10] A. Ortega, I. Haddad, and J. Andrade-Cetto. Graph-based segmentation
of range data with applications to 3d urban mapping. In Proceedings
of the European Conference on Mobile Robotics, pages 193–198, Dubrovnik, September 2009. [11] I. Posner, M. Cummins, and P. Newman. Fast probabilistic la beling of city maps. In Robotics: Science and Systems IV, Zurich, June 2008. [12] I. Posner, D. Schroeter, and P. Newman. Describing compos ite urban workspaces. In Proceedings of the IEEE International Conference on Robotics and Automation , pages 4962–4968, Rome, April 2007. [13] I. Posner, D. Schroeter, and P. Newman. Online generatio n of scene descriptions in urban environments. Robotics and Autonomous Systems, 56(11):901–914, 2008. [14] C. Ridder, O. Munkelt, and H. Kirchner. Adaptive backgr ound estima- tion and foreground detection using kalman-filtering. In Proceedings of
the IASTED International Conference on Robotics and Manufa cturing ,
pages 193–199, Istanbul, August 1995. [15] R.B. Rusu and S. Cousins. 3D is here: Point Cloud Library (PCL). In Proceedings of the IEEE International Conference on Robotics and Automation , Shanghai, May 2011. [16] S. Schneider, M. Himmelsbach, T. Luettel, and H.J. Wuensche. Fusing vision and lidar - synchronization, correction and occlusion reasoning. In Proceedings of the IEEE Intelligent Vehicles Symposium, pages 388– 393, San Diego, June 2010. [17] I. Stamos and P.K. Allen. 3-d model construction using ran ge and image data. In Proceedings of the 14th IEEE Conference on Computer Vision and Pattern Recognition , volume 1, page 1531, Hilton Head, SC, June
2000.
[18] C. Stauffer and W. Grimson. Adaptive background mixture mo dels for real-time tracking. In Proceedings of the 13th IEEE Conference on Computer Vision and Pattern Recognition , pages 246–252, Fort Collins, June 1999. [19] C. Stauffer and W. Grimson. Learning patterns of activity using real- time tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):747–757, 2000. [20] H. Surmann, A. Nuchter, and J. Hertzberg. An autonomous mob ile robot with a 3D laser range finder for 3D exploration and digitalization of indoor environments. Robotics and Autonomous Systems, 45(3-4):181– 198, 2003. [21] R. Tsai. A versatile camera calibration technique for high accuracy 3D machine vision metrology using off-the-shelf TV cameras. IEEE Journal of Robotics and Automation , 3(4):323–344, August 1987. [22] R. Unnikrishnan and M. Hebert. Fast extrinsic calibration of a laser rangefinder to a camera. Technical Report CMU-RI-TR-05-09, R obotics Institute, Pittsburgh, July 2005. [23] Q. Zhang and R. Pless. Extrinsic calibration of a camera and laser range finder (improves camera calibration). In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots a nd Systems, pages 2301–2306, Sendei, September 2004. [24] Z. Zhang. A flexible new technique for camera calibration . IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 22(11):1330– 1334, 2000.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
120
1
Monocular Detection and Estimation of Moving Obstacles for Robot Navigation
Erik Einhorn
Markus Filzhuth
Christof Schroter¨
Horst-Michael Gross
Neuroinformatics and Cognitive Robotics Lab, Ilmenau University of Technology, Germany
Abstract — The detection of motion and moving objects or persons with stationary monocular cameras has been extensively studied. However, those techniques fail if the camera is moving itself. In this paper, we present a method for detecting and estimating the position of moving objects using a monocular camera that is mounted in front of a mobile robot platform. The position estimates are used for obstacle avoidance and robot navigation. We apply image warping to compensate the ego- motion of the camera. This allows us to use standard techniques for motion detection. The final position and velocity estimates are obtained using Extended Kalman Filters. Combined with a monocular scene reconstruction our approach allows the robust detection and avoidance of both static and moving obstacles by using a single monocular camera as the only sensor.
detection,
monocular scene reconstruction
Index
Terms— visual
obstacle
detection,
motion
I. INTRODUCTION
With steadily increasing processing power and newly evolv- ing hardware, approaches for visual navigation gain more and more importance in mobile robotics. For obstacle detection vision based sensors can obtain a larger amount of information about the structure of the local surroundings compared to traditional range-measuring sensors like laser range finders or sonar sensors that operate in a single plane only. There is a large variety of vision sensors that are suitable for obstacle detection. Time-of-flight cameras are able to measure the distance between camera and obstacles directly by emitting short light pulses. Due to their still high costs, these cameras may be suitable for robot prototypes but are no option for our purposes. Microsoft’s Kinect depth camera uses structured light patterns to obtain the depth of objects in the scene. Due to the low costs and the precise depth measurements, this technique will surely have a huge impact in mobile robotics and robot navigation. However, in several experiments we came across a few disadvantages of that camera. Even for smaller robots, its field of view is too narrow to cover the whole area in front of the robot. Moreover, the camera can be used indoors only. In outdoor environments, the emitted infrared light pattern is outshone by the sunlight and cannot be detected by the camera. Beside these depth cameras and stereo cameras, monocular approaches are an adequate alternative for obstacle detection. The majority of such approaches use feature-based techniques that reconstruct the depth or the entire 3D position of each feature. In our previous works, we have developed such an approach for visual obstacle detection that utilizes images captured by a single monocular camera mounted in front of a mobile robot [7]. Our Structure-from-Motion approach
employs Extended Kalman Filters (EKF) to reconstruct the 3D position of the image features in real-time in order to identify potential obstacles in the reconstructed scene. We have shown that this method can significantly improve the reliability of obstacle detection when combined with laser range finders [6]. Similar to other related approaches [2, 5] our previous approach assumes that the scene is static. Moving objects could not be estimated and were ignored. However, since our mobile robots operate at home and in public environments [8] that are populated by walking people and other dynamic objects, a proper handling of those obstacles is required. In this paper, we present an extension of our previous approach that is now able to detect moving objects and allows the estimation of their position and velocity. The presented method is the first approach for monocular obstacle detection that is able to detect both static and dynamic obstacles.
II. RELATED WORK
For detecting moving objects or persons with stationary cameras a large number of approaches exists that use dif- ference images, motion history images [3] or background subtraction. The detection of moving objects using images of a monoc- ular camera mounted in front of a mobile robot is a more difficult problem since the ego-motion of the robot induces an inherent optical flow in the images that must be distinguished from the flow of the moving objects. In [11] a feature-based method is presented that uses image correspondences over two and three frames. Different constraints are applied to differentiate between features located on static objects and those located on moving objects that violate the constraints. In a comprehensive analysis Klappstein et al. [11] show, that objects moving in parallel to the camera with a lower velocity can hardly be detected. Moreover, detecting objects moving anti-parallelly towards the camera is only possible if an additional heuristic is used [10]. A problem that is related to the problem of detecting moving objects in a monocular image sequence is motion segmentation, i.e. the segmentation of the image into regions of pixels moving coherently across the image sequence. In [13] a feature-based approach for real-time motion segmentation is presented that uses an expectation-maximization algorithm to group neighboring features with similar trajectories. However, the problem of detecting objects moving parallel to the camera remains. Those objects would most likely be assigned to the segments of background motion.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
121
2
Fig. 1. The architecture of our approach. The upper part shows the scene reconstruction for static parts of the environment. In the captured image sequence, features are tracked, and their 3D positions in the scene are estimated using EKFs. The resulting features with their 3D positions are used for creating a volumetric 3D occupancy map of the static scene. The tracked features and their estimated 3D position are also used for the detection and estimation of dynamic objects in the scene. As shown in the lower part, the features are used for ego-motion compensation in the input image sequence. Afterwards, the dynamic objects are detected, segmented, and tracked to provide information that is used to recover their positions and velocities.
Beside the detection of dynamic objects, the estimation of their 3D positions and trajectories is even more difficult. In addition to an overall scale ambiguity, further difficulties arise due to the ego-motion of the camera making it impossible to estimate the object positions and trajectories if no additional assumption about the object movements are made. There are approaches for Nonrigid Structure-from-Motion [15, 12, 4, 1, 16] that tackle these difficulties by using shape priors [4] or shape bases [16] to constrain the problem. Other authors [12, 1] apply an orthogonal approach and try to solve the problem in the trajectory space instead of the shape space. In those feature-based methods the position and movement of each point is represented by a linear combination of basis trajectories, e.g. the discrete cosine transform basis. The reconstruction is then performed by estimating the coefficients of such a combination of basis trajectories. The approach presented in [12] is most related to our problem. However, the authors analyze the reconstructibility of the point trajectories and come to the conclusion that the reconstruction of the points and their trajectories is poor if the trajectories are correlated to the camera’s movement and if the camera is moving slowly and smoothly. Since the movement of our camera mounted in front of the robot performs this kind of movement this method is not applicable in our case.
III. OVERVIEW
We use a single calibrated camera that is mounted in front
of the robot. During the robot’s locomotion, the camera is
capturing a sequence of images
are rectified immediately according to the intrinsic camera parameters in order to correct the lens distortions. Using the robot’s odometry data, the corresponding camera position, expressed by its projection matrix P t = KR t [I| −c t ], can be computed for each image I t , containing the orientation R t , the position c t , and the intrinsic calibration matrix K of the
camera (see [9] for details). Both the camera’s position and its orientation are expressed with respect to a global reference coordinate frame. The complete architecture of the software system that pro- cesses this input data is shown in Figure 1. The monocular Structure-from-Motion approach [7] for reconstructing the 3D shape of the static scene is shown in the upper part of that figure. In each captured and preprocessed image (frame), distinctive image features are selected using the Shi-Tomasi corner detector [14]. These features are tracked over the ac- quired image sequence while their 3D positions are estimated
, I t−1 , I t , I t+1 ,
.) that
using EKFs. Similar to the camera’s pose, these 3D positions of the features are computed with respect to the same global reference coordinate frame [7]. For static obstacle detection, we perform this monocular scene reconstruction for 200-300 salient features of the scene simultaneously. Before the reconstructed features are used to build a map of the environment, they have to undergo some post-processing where unreliable estimates with large covariance and uncertainty are removed. The lower part of Fig. 1 shows the data flow for the detection of dynamic obstacles, which is the main scope of this paper. The tracked features processed by the above static scene reconstruction are used to perform an ego-motion com- pensation: the input images are warped in order to eliminate the effect of the robot’s movement in the images. After this step, the images can be treated as if they were captured by a static camera. Hence, standard operations such as the computation of difference images can be applied to detect and segment the moving obstacles within the images. Such objects are then tracked and their positions in the images are used to estimate their positions and velocities in the scene. These processing steps are described in more detail in the following two sections.
IV. EGO-MOTION COMPENSATION
One major problem for detecting moving objects is the inherent optical flow that is induced by the movement of the robot or camera. We try to eliminate this effect in the images by virtually correcting the viewpoint of the images. We will see, that the ego-motion compensation finally allows us to apply methods known from motion detection with static cameras. The ego-motion compensation is done by image warping. We will show that this image warping can be easily computed using the image features that are tracked by the approach for static scene reconstruction. However, the ego- motion compensation is successful only if features on static scene objects are used. For simplicity we call these features static features, while features on moving objects are called dynamic features in the following. The usage of dynamic features for the ego-motion compensation leads to inferior results and would disallow the detection of moving objects in later processing steps of our approach.
A. Feature Filtering In the next paragraphs, we will describe several constraints and criteria that allow us to classify the features that are
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
122
3
extracted and tracked during the static scene reconstruction into static features and dynamic ones. As one constraint one could use the epipolar constraint. For a static feature at position x t in frame I t its corresponding image point x t in frame I t will be located along the epipolar line that can be described by l = F˜x t , where x˜ t denotes the homogeneous coordinate of x t and F is the fundamental matrix that describes the geometry between the two involved camera poses (see [9] for details). A dynamic feature in comparison will have a larger distance from that epipolar line. Though, instead of the epipolar constraint we use a different constraint for long-term stability that implicitly contains the epipolar constraint but uses more than two frames to classify the feature. As stated in section III, the 3D positions of the tracked features are estimated iteratively using EKFs. Beside the 3D positions, the EKFs additionally compute the covariance of the position estimates. During the scene reconstruction the algorithm assumes that the scene is static. Hence, dynamic features are estimated with large covariances since they violate the assumptions of the estimation algorithm. This is where the epipolar constraint is implicitly taken into account. The dynamic features can then be filtered out by ignoring all features whose covariance is above a certain threshold. Moreover, the reconstruction algorithm uses the feature estimates to guide the feature tracking process by predicting the position of the corresponding image points in the next frame. Dynamic features that move quickly and perpendicular to their epipolar line, will be too far away from the predicted image location, and the approach for static scene reconstruction will lose track of them. Hence, those dynamic features are filtered out at a very early stage already during the static scene reconstruction. Unfortunately, there are still dynamic features that pass through the stage of static scene reconstruction and that have 3D position estimates with a low variance. This happens for features located on objects that move in parallel to the camera. Those features primarily do not violate the assumption of a static scene. Instead the estimation algorithm is able to find static and stable 3D position estimates that correspond to the observed 2D feature locations in the image sequence. Of course, those position estimates are not correct for dynamic features. The error depends on the movement of the object. Dynamic features on objects moving faster than the camera and hence away from it are reconstructed behind the camera with a negative depth. For features on objects that are moving in the same direction as the camera but with a lower speed the estimated depth is too large. Since our camera is tilted towards the ground, those features are reconstructed below the ground plane. Features that move with the same speed as the camera are reconstructed at infinity and hence also below the ground plane. We apply simple plausibility tests and classify those features with a negative depth and a negative height as dynamic features in order to filter them out. It should be noted, that the detection of dynamic features is done only to exclude them from the ego-motion compensation. They are not used to detect moving obstacles at this point. The above criteria are able to reveal a large number of features on moving objects. However, features on objects that
are moving towards the camera cannot be detected as dynamic features. They are reconstructed as static obstacle closer to the camera compared to their real distance. In fact, they are reconstructed near the position where a potential collision with the moving object would occur. Hence for obstacle detection and avoidance, adding those features into the static obstacle map results in an acceptable behavior of the robot. Nevertheless, our processing step for detecting moving objects in the warped images is also able to detect most of these moving objects. This will be described in section V.
B. Image Warping
The static features that were classified by the above criteria can now be used for the actual ego-motion compensation which is done by image warping. Image warping allows us to morph any image I t taken at camera position P t to the “perspective” P t of any another frame I t . The warp W depends on the two positions of the camera and hence - in our case of a single continuously moving camera - on the two time stamps t and t :
(1)
I t→t = W t→t (I t )
The warped image I t→t corresponds to I t being taken from the same camera position where I t was captured from. Moreover, if the warp was perfect and the scene was static, the warped image I t→t would be equal or similar to the image I t :
(2)
The warp is approximated as piecewise affine transform
of a triangle mesh. As vertexes of the mesh we use the
, f (n) that are tracked between
the two frames I t and I t . Let x
position of the feature f (i) within image I t , while x
static features F = f (1) ,
(i) denote the 2D image
(i) denotes
I t ≈ I t→t
t
t
its image position within image I t . The triangle mesh is generated by computing a Delaunay triangulation of the 2D
points x
2. The mesh is then used to warp each image point in I t .
within the image I t as shown in Fig.
(1)
t
(n)
(i)
t
,
.
.
. , x
t
The image features x
image position in the image I t :
are warped according to their tracked
x
(i)
t
=
W t→t (x
(i)
t
).
For all the other pixels in between, their position is bi- linearly interpolated within the triangle that is spanned by the surrounding three image features according to the Delaunay triangulation. When generating the mesh, one must take into account that the positions of the features and hence the vertices of the mesh move in image I t . This can result in triangles that fold over, i.e. their orientation flips and their back-faces become visible as shown in Fig. 3. Such triangles result in inferior results of the warp. Hence, we identify the vertices that cause the fold- over in order to remove them from the mesh. Afterwards, the mesh is triangulated again. The described warp can be efficiently computed on contem- porary graphics hardware by using the image I t as a texture for rendering the triangle mesh. The features’ positions in image
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
123
4
I t
I t
Fig. 2. Generated triangulation for both images I t and I t . While the same mesh topology is used in both images, the positions of the vertices are different in image I t since the features have moved due to the ego-motion of the camera.
I
t
I t
I t
Fig. 3. left: The two yellow triangles fold over in image I t since two of the features that are used for the triangulation moved significantly differently. The vertices/features that caused the fold-over are marked with red circles. right: The two vertices were removed from the mesh to resolve the fold-over.
I t are used as texture coordinates and the features’ positions in the image I t are used as model coordinates for each vertex of the mesh. The upper row in Fig. 4 shows the warping for an image sequence consisting of 5 frames. The warping was computed using the triangle mesh that was built from the features tracked by the approach for static scene reconstruction as described above. Dynamic features where removed by the aforemen- tioned criteria. The two leftmost and the two rightmost images are warped to the perspective of the middle image. While the images were captured, the robot and the blue ball moved forward. The blue ball was slightly faster than the robot. Due to the ego-motion compensation, the static scene remains still in the images although the robot moves forward. Additionally, it is apparent that the blue ball moves forward.
V.
D ETECTION AND E STIMATION OF M OVING O BJECTS
Using the described ego-motion compensation, we can implement the actual detection of moving objects using meth- ods known from motion detection with static cameras. For performance reasons, we use difference images which can be computed efficiently. According to Eq. 2 the difference image of two input images is close to zero, if one image is warped into the perspective of the other image and if the scene is static: I t→t − I t ≈ 0. Moving objects on the other hand will produce large differences and, therefore, can be detected in the difference images. To reduce the influence of image noise and to increase the signal to noise ratio, we do not only use two images for generating the difference image. Instead, we use 5 consecutive images from the captured image sequence. Let I t denote the reference image where the motion should be detected. Then we use two images I t−2 , I t−1 that were captured before and two images I t+1 , I t+2 that were captured after the reference image. All images (except the reference image) are warped into the perspective of the reference image
I t . Afterwards, the difference images between the four warped images and the reference image I t are computed as shown in the second row of Fig. 4. Moreover, the difference images are binarized using a threshold that was chosen experimentally. As shown in Fig. 4 the binarized difference images are then combined in pairs using an AND operation, i.e. in the resulting binary image a pixel is set only if the pixel was set in both binarized difference images. The resulting images are combined using an OR operation. Finally, remaining noise is removed by applying an opening and closing morphological operator. This procedure has yielded the best results in our experiments. It reduces the noise in the difference images and preserves the outline of the moving object even if the size of the object increases or decreases when the object’s distance changes. As seen in the above figure, some smaller image regions especially along occlusion boundaries were also classified as moving objects although they belong to the static environment. These regions would also be treated as moving objects in the next processing steps of our approach. However, their position and velocity estimates would finally reveal that those objects are in fact static. In the final processed binary difference image, bounding boxes are generated around each connected image segment. These bounding boxes are used as features of the extracted moving objects. They provide the approximate position and size of each object within the image and are used for tracking the moving objects within the image sequence. At the moment, we use a single hypothesis tracker, i.e. we assume that there is only one moving object within the field of view of the camera. Currently, the bounding box with the largest area is chosen as hypothesis of the single moving object. The other bounding boxes are ignored. The position and velocity estimation is done using an EKF. As stated in the first sections it is not possible to reconstruct the position of a moving object from a moving camera without additional constraints. We therefore assume that all moving objects touch the ground and hence the lower border of the extracted bounding box is located on the ground plane in the scene. Consequently, the state of the moving object can be modeled by y = (p, v) , where p ∈ 2 represents the position of the object on the ground plane in world coordinates and v ∈ 2 denotes the velocity of the object along the ground plane. The resulting state transition function is given as y k = Ay k−1 , with:
A =
1
0
0
0
0
1
0
0
∆t
0
1
0
0
∆t
0
1
As measurement for the EKF update, the midpoint of the bounding box’s bottom edge in the image is used. During the update the image position of the midpoint is projected onto the ground plane. This projection can be described by a homography H = P t G, where P t is again the projection
matrix of the camera. G = e 1 e 2 c describes the
orientation and location of the ground plane that is spanned
by the two vectors e 1 , e 2 and goes through the point c.
0
0
1
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
124
5
I
t
I t−2→t
I t−1→t
I
t
I t+1→t
I t+2→t
AND
AND
OR
Fig. 4. Processing chain for the detection of moving objects. first row: Input sequence consisting of 5 frames. second row: Two images before and two images after the reference frame I t are warped into the perspective of the reference frame. third row: Binarized difference images between the warped images and the reference image. fourth row: Difference images are combined in pairs using an AND operation. fifth row: Resulting images are combined using an OR operation. sixth row: Final detection of moving objects after opening and closing operations with extracted bounding boxes.
Using the above stated transition and projection functions, the usual EKF prediction and update steps are iteratively processed for each new image and hence for each extracted bounding box to continuously estimate the position and ve- locity of the moving object. The final position estimate is used for navigation and obstacle avoidance in addition to the reconstructed features of the static scene.
VI. RESULTS
We have tested our approach with numerous data sets of real data that was recorded while the robot was moving through an indoor environment. As moving obstacles we used an untextured ball as shown in Fig. 4 and a person who walked in front of the robot as shown in Fig. 5 and Fig. 6. We tested different kinds of movements relative to the robot. The detection and estimation of obstacles that moved perpendicular to the robot’s trajectory was easily managed by our approach (Fig. 5). More difficult is the detection of objects that move parallel to the robot. Nevertheless, those obstacles were successfully detected and estimated correctly by the proposed approach. An example is shown in Fig. 6 where a person is walking in front of the robot into the same direction with a slightly higher velocity. In the left column of Fig. 6 the frames 10, 20 and 30 of the captured images sequence taken by the robots front camera are shown. Image regions where motion was detected using our proposed method are labeled with red color. Beside some smaller artifacts near occlusion boundaries the walking person was correctly detected. The
blue circle denotes the estimated position of the person on the ground plane reconstructed using an EKF as described in the previous section. The green circles indicate the reconstructed positions of the person within the next ten frames that are not shown as separate images. The sizes of the circles indicate the uncertainties of the estimates. In the right column a bird’s-eye view of the scene is shown, where the reconstructed static features are additionally visualized as black dots. Together with the estimated position of the moving object, such a 2D map can be used for obstacle avoidance and navigation. As ground truth, the range measurements of a laser range finder are indicated in all images by a thin black line. In the images of the front camera, the dashed line shows a projection of the range measurements at the height where the laser is mounted, while the solid line shows a projection of the range measurements onto the ground plane in order to allow for a better comparison with the position estimates computed by our approach, which are located on the ground plane, too.
Fig. 5.
the robot is moving forward. Image regions, where motion was detected, are highlighted in red. Our approach correctly detected the motion of the leg that is moving forward, while the leg that stands on the floor is classified as static.
Detection of a moving person that is walking from left to right while
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
125
6
velocity (m/s)
Fig. 6. left column: Three frames of the input sequence: image regions where motion was detected are highlighted in red. The position estimate of the moving object is indicated by the blue circle. The green circles show the reconstructed positions for 10 further frames where no image is shown. For comparison, measurements of a laser range finder are shown as black line. right column: Bird’s-eye view of the scene. The reconstructed features of the static scene are additionally shown as black dots.
Especially in the bird’s-eye view it becomes apparent that our approach estimates the position of the moving person correctly. Qualitatively, the precision is comparable with the precision of the laser range finder. In Fig. 7, the velocity of the person that is additionally estimated by the EKF is plotted for each frame of the image sequence. The person comes into the field of view in the third frame. After a short initialization phase of 2 frames, the velocity is correctly estimated at around 1 m/s. The small oscillation in the graph is caused by the non-uniform leg movement of the person.
frame
|
Fig. 7. |
Estimated velocity of the moving object for each frame of the |
|
sequence. |
|
VII. CONCLUSIONS AND FUTURE WORK
In this paper, we presented a method for detecting and estimating the position and velocity of moving objects in monocular image sequences that were captured from a moving robot. Combined with our existing approach for monocular static scene reconstruction, the presented technique allows us to detect both static and dynamic obstacles for robot navigation and obstacle avoidance. Thus, it increases the robustness of our obstacle detection system for real world robot applications in
public environments, like tour guides and shopping assistants [8]. Those environments are populated by many persons and other moving obstacles (e.g. shopping carts). In several experiments, we have tested the robustness of our approach. Even in the difficult case of an object that is moving parallel to the camera, our technique is able to detect the motion and correctly estimates the position of that object. One example for this kind of motion is given in the paper (Fig. 6). At the moment our algorithm uses a single hypothesis tracker only. Hence, it assumes that there is only one moving object in the camera’s field of view. However, it can be easily extended to a multi-hypotheses tracking algorithm that can handle several moving objects. The presented method for ego-motion compensation using image warping allows the use of difference images for motion detection. In other applications, the same ego-motion compen- sation technique could be used to allow for the usage of image processing algorithms that were primarily developed for static cameras also for moving cameras.
REFERENCES
Ijaz Akhter, Yaser Ajmal Sheikh, Sohaib Khan, and Takeo Kanade. Non- rigid Structure from Motion in Trajectory Space. In Neural Information Processing Systems, pages 41–48, 2008.
[2] J. Civera, A.J. Davison, and J. Montiel. Inverse Depth Parametrization for Monocular SLAM. IEEE Trans. on Robotics, pages 932–945, 2008.
J.W. Davis. Hierarchical Motion History Images for Recognizing Human
Motion. In Proceedings IEEE Workshop on Detection and Recognition of Events in Video, pages 39–46. IEEE Comput. Soc. [4] A. Del Bue. A Factorization Approach to Structure from Motion with Shape Priors. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8, 2008. [5] E. Eade and T. Drummond. Unified Loop Closing and Recovery for Real Time Monocular SLAM. In Proc. of the BMVC, 2008. [6] E. Einhorn, Ch. Schroter,¨ and H.-M. Gross. Monocular Scene Recon- struction for Reliable Obstacle Detection and Robot Navigation. In Proc. of the 4th ECMR, pages 156–161, 2009. [7] E. Einhorn, Ch. Schroter,¨ and H.-M. Gross. Attention-Driven Monocular Scene Reconstruction for Obstacle Detection, Robot Navigation and Map Building. Robotics and Autonomous Systems, 59(5):279–292, 2011. [8] H.-M. Gross, H.-J. Bohme,¨ Ch. Schroter,¨ St. Muller,¨ A. Konig,¨ E. Ein- horn, Ch. Martin, M. Merten, and A. Bley. TOOMAS: Interactive Shopping Guide Robots in Everyday Use - Final Implementation and Experiences from Long-term Field Trials. In Proc. of IROS, pages 2005– 2012, 2009. [9] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0-521-54051-8, second edition, 2006.
[10] Jens Klappstein. Optical-Flow Based Detection of Moving Objects in
Traffic Scenes. PhD thesis, Ruprecht-Karls-Universitat¨ Heidelberg, 2008. [11] Jens Klappstein, Fridtjof Stein, and Uwe Franke. Detectability of Moving Objects Using Correspondences over Two and Three Frames. In Proc of the 29th DAGM conference on Pattern recognition, pages 112–121, 2007. [12] Hyun Soo Park, Takaaki Shiratori, Iain Matthews, and Yaser Sheikh. 3D Reconstruction of a Moving Point from a Series of 2D Projections. In ECCV, pages 158–171, 2010. [13] Shrinivas J. Pundlik, Stanley T. Birchfield, and Senior Member. Real- time motion segmentation of sparse feature points at any speed. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 38(3):731 – 742, 2008. [14] J. Shi and C. Tomasi. Good features to track. In 9th IEEE Conference on Computer Vision and Pattern Recognition, pages 593–600, 1994. [15] J. Taylor, A.D. Jepson, and K.N. Kutulakos. Non-Rigid Structure from Locally-Rigid Motion. In Proc. Computer Vision and Pattern Recognition Conf., pages 2761–2768, 2010. [16] R. Vidal and D. Abretske. Nonrigid Shape and Motion from Multiple Perspective Views. In ECCV, pages 205–218, 2006.
[1]
[3]
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
126
1
Fast 2.5D Mesh Segmentation to Approximately Convex Surfaces
Robert Cupec, Emmanuel K. Nyarko and Damir Filko
Faculty of Electrical Engineering, J. J. Strossmayer University of Osijek, Croatia
Abstract—A range image segmentation approach is presented.
A range image obtained by a 3D sensor is transformed to a 2.5D
triangle mesh. This mesh is then segmented into approximately convex surfaces by an iterative procedure based on the incremental convex hull algorithm and region growing. The
proposed approach is designed as a fast mid-level tool whose task
is to provide suitable geometric primitives for a high-level object
recognition or robot localization algorithm. The presented approach is experimentally evaluated using 3D data obtained by a Kinect sensor.
Index Terms— 2.5D Mesh, Kinect, Range Image Segmentation, Convex Surfaces
I.
INTRODUCTION
E NVIRONMENT PERCEPTION capabilities are crucial for achieving a highly autonomous operation of a mobile
robot. The robot must be capable of localizing itself in an unstructured environment as well as to recognize certain objects of interest for a particular task. Human-like robot localization strategies rely on visual recognition of landmarks, i.e. objects with fixed position in the environment. A common approach is to use point features with assigned descriptors as landmarks [1]. Alternatively, line features [2] can be used. More complex structures such as generic objects could be very useful for robot localization [3]. Apart from the purpose of localization, a robot's ability to recognize objects is necessary for flexible object manipulation. A useful preprocessing step in an object recognition process is image segmentation. The image of a scene is segmented into several regions potentially representing different objects. A common image segmentation approach is to partition the image into a number of approximately uniformly colored regions [4], [5]. These regions can be grouped into more complex structures which can be matched with objects in a database. A human can easily recognize separate objects in an image such as the one shown in Fig. 1. For a computer vision system, segmentation of a scene from a single 2D image is still an open problem. 3D information can be of great help in scene analysis. 3D sensors such as laser range finder, stereo vision, time-of-flight camera
and sensors based on structured light can provide depth image of an observed scene and thus information of the shape of geometric structures which appear in the scene. The images obtained by the discussed sensors are called range images since every image point is assigned a depth value encoding the distance of the observed point from the camera. Since every image point is assigned a unique depth value, such an image is also called 2.5D model of the scene.
Fig. 1 Human can easily segment such an image to object and recognize them.
In this paper a computationally efficient approach for segmenting a 2.5D image is proposed. First, the point cloud obtained by a 3D sensor is transformed to a 2.5D triangle mesh. Then an iterative procedure based on the incremental convex hull algorithm and region growing is applied to detect approximately convex surfaces in the scene. The proposed approach is designed as a mid-level processing stage in a vision-based perception system of a robot which should provide suitable geometric primitives for high-level object recognition or robot localization algorithm. The experiments reported in this paper are performed using a commercially available low cost 3D sensor Kinect and give some insight into the applicability of this device for robot perception tasks. This paper is structured as follows. In Section II an overview of the related work is given. The proposed range image segmentation algorithm is described in Section III and its experimental evaluation in Section IV.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
127
2
II. RELATED WORK
Range image segmentation is the process of dividing, or segmenting, a range image such that all the points of the same surface belong to the same region. The most common range image segmentation methods found in literature can be grouped into two main categories: region-based (surface- based) methods and edge-based methods. In [6], an overview of several such algorithms are presented and evaluated. Edge-based methods often apply an edge detector to extract edges from the range image. Even though such methods produce well defined boundaries between different regions, they often produce gaps between boundaries. These methods are also prone to under-segmenting the range image especially on curved surfaces where the discontinuities are smooth and hard to determine. Examples of edge-based methods are given in [7], [8] and [9]. With region-based methods, the image can be segmented either by using only planar surfaces or a combination of planar and curved surfaces. In [10], [11], [12], [13] and [14] range images are segmented entirely using only planar surfaces. A procedure to detect connected planar, convex, and concave surfaces of 3-D objects is given in [15]. The range image is initially segmented into "surface patches" by applying a square error criterion clustering algorithm based on the surface points and the associated surface normals. These patches are then classified as planar, convex or concave using a non-parametric statistical trend test, curvature values and eigenvalue analysis. Finally, the boundaries between adjacent surface patches are classified as crease or noncrease edges, and based on this information, compatible patches are merged to produce reasonable faces of the objects. In [16], a method for segmenting range images using curved (convex) surfaces is given. The authors propose a method of detecting 3D convex surfaces using the signs of simplified Gaussian and mean curvatures based on classical differential geometry. By combining edge-based methods and region-based methods, more accurate segmentation results can be obtained [8]. An approach to multi-scale edge and planar surface detection in range images is also given in [9]. Computational geometry approaches to range image segmentation often utilize Voronoi diagrams and Delaunay triangulations to determine connections among data points and subsequently construct triangulated surfaces by connecting adjacent vertices [14], [17] and [18]. These triangulated surfaces form a topological structure often referred to as 3D mesh. A survey in [19] provides an overview of 3D mesh segmentation methodologies. With the aid of this 3D mesh, 3D objects can be modeled by using the convex hull algorithm. The use of convex hulls is well-suited for rapid 3D object modeling since: such volumes are inherently conservative due to their convex nature; the resulting hull is bounded by planar faces, thereby enabling fast computation of distances; and its use in the generation of bounding models to represent the spatial shapes of 3D objects is compact, fast, and relatively efficient [20] and [21].
A technique for approximating range images with adaptive triangular meshes ensuring a user-defined approximation error is presented in [18]. The technique is based on an efficient coarse to-fine refinement algorithm that avoids iterative optimization stages. The algorithm first maps the pixels of the given range image to 3D points defined in a “curvature space” (the curvature space can be thought of as a curvature image in which the range originally associated with each pixel is substituted for a measure of its local curvature). These points are then tetrahedralized with a 3D Delaunay algorithm. Finally, an iterative process starts sculpturing (digging up) the convex hull of the obtained tetrahedralization by progressively removing the triangles that do not fulfill the specified user- defined approximation error with respect to the original 3D triangular mesh. The introduction of the aforementioned curvature space makes it possible for both convex and non- convex object surfaces to be approximated with adaptive triangular meshes, improving thus the behavior of previous coarse-to-fine sculpturing techniques. In [22], the authors developed a system for automated 3D object detection and modeling using range images in a construction work environment. The range image is initially filtered to reduce noise. Then the image is segmented such that the ground-floor or plane is detected. The data corresponding to the ground-floor of the scene is removed in order to ease the detection of the individual objects located on the ground. The resulting image is then segmented in order to detect the objects in the scene. As a result of this object segmentation process, the surface points on the various objects of interest are segmented. Using a convex hull algorithm, the surface points for each object are then connected to generate a 3D model. We propose a fast and computationally efficient method of segmenting a range image into approximately convex surfaces. We initially create a 2.5D mesh of the range image using iterative Delaunay triangulation. Approximately convex surfaces are then obtained by implementing the incremental convex hull algorithm combined with region growing.
III. SEGMENTATION TO APPROXIMATELY CONVEX SURFACES
Triangle mesh is a common representation of 3D objects in computer graphics and computer vision, where 3D objects are approximated by polyhedral surfaces up to a given precision. In this section, 2.5D mesh is defined and the term of approximately convex surface is introduced. Let us start with the notation used throughout this section.
Let M = {F 1 , F 2 ,
representing a triangle mesh. The vertices of the triangles
∈M are referred to in the following as the vertices of M.
F µ } be a set of triangular surfaces F k
F
k
The set of all vertices of M is denoted by Σ M . A mesh M such that for every two triangles
,
is a sequence (
two triangles
side is referred to in the following as a connected mesh.
F
i
,
F
j
∈ M there
F
s
F
s
p
1
,
F
s
2
and
…
,
F
s
F
p
+
s
1
r
) , where s 1 = i, s r = j and every
in that sequence share a common
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
128
3
Let us consider a mesh M such that for every two points
P P′∈Σ
,
M
holds
1
0
0
1
0
0
⋅
(
p
′−
p
)
≠
0 ,
where p = [x, y, z] T denotes the vector defining the position of a point P. A mesh with these properties is referred to in the following as a 2.5D mesh. The convex hull of any 2.5D mesh M is also a polyhedral surface, i.e. a 3D mesh which can be represented by a set of
triangles H M = {T 1 , T 2 ,
unit normal n k directed out of the hull and a parameter ρ k , referred to in the following as offset. The plane in which T k lies is defined by the equation
T m }. Each triangle T k is assigned a
∀P ∈T
k
,
n
T
k
⋅p − ρ
k
=
0
.
(1)
Let
+
M
H
be a subset of H M defined by
+
M
H
=
{
T
k
∈
H
M
:
n
T
k
⋅z >
0
}
.
(2)
In the case where M is obtained by a 3D sensor,
represents the subset of triangles
the sensor. A connected 2.5D mesh M such that
+
M
H
T
k
∈ H
M
which are visible to
max
P ∈Σ
M
{
min
T
k
∈
+
M
H
{
ξ
⋅
(
ρ
k
−n
T
k
⋅p
)}}
≤
ε
,
(3)
where ξ = 1 and ε is a given tolerance, is referred to in the following as an approximately convex surface with tolerance ε. Approximately concave surface can be defined by the same equation for ξ = –1. Detection of approximately convex surfaces in a 2.5D mesh can be achieved by employing a region growing approach. Starting with a single triangle, which is by itself a convex surface, region growing proceeds by successively appending adjacent triangles while preserving (3). Algorithm 1 presented in the following can be used for such an incremental growing of an approximately convex surface.
Algorithm 1:
Incremental
Growing of
Convex Surface
an
Approximately
Input: M, H M , ε, F
Output: M',
1:
2:
H
M
′ , result
M' ← M ∪{F}
P ← vertex of F which is not a vertex of any triangle of
3:
4:
M
H M ,1
← {
T
k
∈ H
M
:
n
T
k
⋅p ≤ ρ
k
}
Find all pairs of adjacent triangles
T
k
,
T ∈ H
l
M
such that
5:
T
k
∈ H
M ,1
and
T
l
∉ H
M ,1
.
For every such pair, insert into a set
defined by the common side of T k and T l and the point P.
H
M ,2
a new triangle
6:
7:
8:
9:
10:
11:
H
M
′ ←
H
M ,1
∪ H
M ,2
If (3) is satisfied for M' and
H
M
return M',
H
M
′ , SUCESS
′
else return M, H M , FAILURE end if
The algorithm is based on the incremental convex hull algorithm [23], where after updating the convex hull of the surface M by appending a new triangle F, evaluation of (3) is performed. Evaluating (3) for all vertices in Σ M can be avoided. Let us
assign to each triangle T k a set Σ M,k containing all vertices P ∈
Σ M for which T k is the closest triangle in
+
H
M
, i.e.
Σ
M k
,
=
P
∈Σ
M
:
k
=
argmin
T
k
∈
+
M
H
{
ρ
k
−
n
T
k
⋅
p
}
It is easy to show that instead of evaluating (3) for all vertices in Σ M it is sufficient to evaluate only those in the set
Σ
*
M
=
∪
T ∈H
k
M
\
H
M ,1
Σ
M k
,
.
be
evaluating (3) only for the triangles from the set
Additional
computational
saving
can
+
M
H
,2
=
{
T
k
∈
H
M
,2
:
n
T
k
⋅z >
0
}
.
achieved
by
Hence, in line 7: of Algorithm 1, instead of (3), the following equation can be evaluated
max
P ∈Σ
*
M
{
min
T
k
∈
+
M ,2
H
{
ξ
⋅
(
ρ
k
−n
T
k
⋅p
)}
}
≤
ε
.
(4)
However, this second simplification can result in rejecting some valid approximately convex surfaces in certain cases.
This can happen if there is a vertex
≤ ε and at the same
time
for which
T
k
⋅p
)
P∈Σ
M
,
*
T
k
∈
+
H
M ,1
exists such that
ξ ⋅
(
ρ
k
−n
min
T
k
∈
+
M ,2
H
{
ξ
⋅
(
ρ
k
−n
T
k
⋅p
)}
>
ε
.
In that case, condition (4) is not satisfied although (3) is. Nevertheless, such cases are expected to be rather rare. The segmentation algorithm discussed in this section starts with a connected 2.5D mesh M. The triangle with the greatest
area of the image projection is selected from M as a seed for the region growing procedure. Region growing proceeds by
incrementally appending adjacent triangles to the expanding
surface, as explained in Algorithm 1. In every iteration, there
are multiple candidates which can be merged with the growing surface. There are various possible criteria for selecting the
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
129
4
next triangle to be appended. In the experiments reported in Section IV, the angle between the normals of adjacent triangles is used as the selection criterion, where appending of triangles with similar normal is favored. The growing of the surface stops when no adjacent triangle satisfies (4). All triangles grouped in the obtained approximately convex surface are labeled and thereby excluded from the further segmentation process. After one surface is completed, the procedure continues by taking the triangle with the greatest area of the image projection from the remaining unlabeled triangles in M as a seed for the next surface. The procedure stops when all or a predefined percentage of triangles are labeled.
IV. EXPERIMENTAL EVALUATION
The range image segmentation approach described in Section III is experimentally evaluated using 3D data provided by a Kinect sensor. Microsoft Kinect is a game console peripheral with integrated multi-microphone array for sound direction tracking, RGB camera and a depth sensor. The depth sensor combines an infrared laser projector with a monochrome CMOS sensor. The depth camera interprets 3D scene information based on continuously projected infrared structured light, i.e. this 3D scanner uses patent technology from Primesense called LightCoding. Kinect specifications claims its video and depth refresh rate to be 30Hz with 8-bit VGA resolution, while the depth sensor produces 640 × 480 depth image with maximum 11-bit values which corresponds to 2048 levels of depth. In the experiments reported in this section, the depth images are subsampled to the resolution 320 × 240. The effective scanning range of the Kinect depth sensor is between 0.7m and approximately 5m, with an angular field of view of 58° horizontally and 45° vertically. The depth resolution is approximately 1cm at distance of 2m, while at the same distance spatial resolution is 3mm [24]. At the time of writing this paper no official drivers for Kinect existed for Windows platform, therefore we used unofficial drivers and libfreenect framework from the OpenKinect community [25] which provided satisfactory results. The Kinect sensor provides a 3D point cloud. This point cloud is transformed to a 2.5D triangle mesh by iterative refinement of Delaunay triangulation proposed in [14]. The obtained mesh is then segmented into approximately convex surfaces by applying the procedure described in Section III. The algorithm is implemented in C++ programming language and executed on a 3.40GHz Intel Pentium 4 Dual Core CPU with 2GB of RAM. The results are presented in Figs. 2 and 3. The objects are mostly correctly segmented. However, in some cases oversegmentation occurs. This is mostly due to the choice of the tolerance, ε. Furthermore, in some cases, almost identical scenes result in different segmentation. In general, the smaller the value of ε, the greater the probability of oversegmentation. On the other hand, setting ε to greater values could result in merging of separate objects into a single segment. Nevertheless, the proposed method is designed as a mid-level tool which requires appropriate grouping of the obtained
segments on a higher level. This grouping is simplified by the fact that a relatively small number of dominant segments must be considered for grouping on the higher level.
Fig. 2 Camera image (top left); depth image obtained by Kinect (top right); triangle mesh (bottom left) and segmentation (bottom right).
Fig. 3 Camera image (top left); depth image obtained by Kinect (top right); triangle mesh (bottom left) and segmentation (bottom right).
In order to evaluate the segmentation obtained by the proposed method, we used V-measure [26], a cluster evaluation measure based on the conditional entropy. This measure combines two criteria for a segmentation result:
homogeneity and completeness. A segmentation satisfies homogeneity if every segment is a subset of a ground truth segment, while satisfying completeness assumes that every ground truth segment is a subset of a segment belonging to the evaluated segmentation. Both homogeneity and completeness are assessed by a value between 0 and 1 and these two values
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
130
5
are combined in V-measure. The approach proposed in [26] allows different weighting of homogeneity and completeness. In this work, we use equal weights for both properties. The evaluation is performed by applying the proposed algorithm to 12 depth images representing scenes with several objects positioned on a table. Two such scenes are shown in Figs. 4 and 5. The segmentation results obtained by the proposed approach are compared to ground truth segmentation that is obtained by hand-labeling.
Fig. 4 Camera image (top left); ground truth (top right); segmentation obtained by PCL (bottom left) and segmentation obtained by our method (bottom right).
Fig. 5 Camera image (top left); ground truth (top right); segmentation obtained by PCL (bottom left) and segmentation obtained by our method (bottom right).
In order to explore the influence of the threshold ε to the segmentation quality, we performed a series of experiments by changing ε from 0, which corresponds to the segmentation to strictly convex surfaces, to 20. The average homogeneity, completeness and V-measure over 12 considered depth images for different values of ε are presented in Fig. 6. The best
segmentation according to V-measure is achieved with ε = 7. We compared our method (Segmentation to Approximately Convex Surfaces – SACS) to the segmentation obtained by the tools SACSegmentation and EuclideanClusterExtraction of the Point Cloud Library (PCL). These two tools are used as explained in the following. First, SACSegmentation is used to detect two dominant planar surfaces: the table surface and the wall. This detection is performed using RANSAC with 100 iterations and distance threshold parameter set to 0.02. Then, EuclideanClusterExtraction is applied with cluster tolerance set to 0.02, minimum cluster size set to 10 and maximum cluster size set to 250000. The obtained results are compared to the segmentation obtained by our method with ε = 7 using V-measure. This comparison is presented in Fig. 7. It indicates that our method outperforms PCL for all 12 test samples. One of the most important advantages of the proposed approach is its computational efficiency, which makes it suitable for application in robotics. The execution time of our method (SACS) for the considered 12 test samples is shown in Tab. 1, where it is compared to the execution time of PCL.
Fig. 6
values ε.
Homogenity, completeness and V-measure for different
V-Measure
Image
Fig. 7 Comparison of the proposed method (SACS) and PCL according to the V-measure.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
131
6
Tab.
1
Execution time of the proposed method (SACS) and PCL in seconds.
Exec.
time
mesh
building
SACS
segment.
total
|
0.119 |
0.036 |
0.155 |
|
0.176 |
0.128 |
0.278 |
|
0.146 |
0.066 |
0.212 |
PCL
|
min. |
1.293 |
|
max. |
2.492 |
|
average |
1.626 |
V.
CONCLUSION
An approach for the segmentation of range images into approximately convex surfaces is presented. From the original range image a 2.5D triangle mesh is created. This mesh is then segmented to approximately convex surfaces using a region growing technique based on the incremental convex hull algorithm. The algorithm requires the user to specify the tolerance up to which the surfaces in the scene can be considered as convex surfaces. The method is experimentally evaluated and compared to the 3D segmentation tools provided by the Point Cloud Library (PCL). For the scenes used in this evaluation, our method achieved better segmentation according to V-measure in significantly shorter time. We believe that the output of the proposed method could provide a good basis for a high-level object recognition or robot localization algorithm. Nevertheless, the critical aspect of the considered algorithm is tolerance which significantly influences the result. Possible extensions of the presented research are search for an appropriate tolerance adaptation strategy as well as designing a robot localization or object recognition system which would rely on the convex surfaces provided by the proposed approach.
REFERENCES
[1] S. Se, D. G. Lowe and J. J. Little, Vision-Based Global Localization and Mapping for Mobile Robots, IEEE Trans. on Robotics, vol. 21, no. 3, June 2005, pp. 364–375. [2] X. Lebègue, J. K. Aggarwal, Significant Line Segments for an Indoor Mobile Robot, IEEE Trans. on Robotics and Automation, vol. 9, no. 6, December 1993. [3] D. Kim and R. Nevatia, Recognition and localization of generic objects for indoor navigation using functionality, Image and Vision Computing vol. 16, Issue 11, 1 August 1998, pp. 729–743. [4] D. Comanicu and P. Meer, Mean Shift: A Robust Approach Toward Feature Space Analysis, IEEE Trans. on Pattern Analysis and Machine Intelligence, May 2002. [5] P. F. Felzenszwalb, and D. P. Huttenlocher, Efficient Graph-Based Image Segmentation, International Journal of Computer Vision, vol. 59, no. 2, September 2004. [6] A. Hoover et al., An Experimental Comparison of Range Image Segmentation Algorithms, IEEE Transactions on Pattern Analysis and Machine Intelligence, July 1996, pp. 673-689. [7] S. Suganthan, S. Coleman and B. Scotney, Single-step Planar Surface Extraction from Range Images, in Proceedings of the Fourth International Symposium on 3D Data Processing, Visualization and Transmission (3DPVT 08), Georgia Institute of Technology, Atlanta, GA, USA, 2008, pp363-370. [8] S. M. Bhandarkar and A. Siebert, Integrating edge and surface information for range image segmentation, Pattern Recognition, vol. 25, no. 9, September 1992, pp 947-962.
[9] X. Jiang, Qualitative Decomposition of Range Images into Convex Parts / Objects, Proceedings of the International Workshop on Machine
Vision Applications (MVA), The University of Tokyo, Japan, November 2000. [10] X. Yu, T. D. Bui, and A. Krzyzak, Robust Estimation for Range Image Segmentation and Reconstruction, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 16, no. 5, May 1994, pp. 530–
538.
[11] K. M. Lee, P. Meer, and R. H. Park, Robust Adaptive Segmentation of Range Images, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 2, February 1998, pp. 200–205. [12] H. Wang, and D. Suter, Robust Adaptive-Scale Parametric Model Estimation for Computer Vision, IEEE Transactions on Pattern Analysis and Machine Intelligence, November 2004, pp. 1459–1474. [13] R. Cupec et al, Detection of Planar Surfaces Based on RANSAC and LAD Plane Fitting, Proceedings of the 4th European Conference on Mobile Robots (ECMR), Mlini/Dubrovnik, Croatia, September 2009, pp. 37-42. [14] F. Schmitt and X. Chen, Fast Segmentation of Range Images into Planar Regions, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Maui, HI , USA, June 1991, pp 710 – 711. [15] R. Hoffman and A. K. Jain, Segmentation and classification of range images, IEEE Transaction on pattern analysis and machine intelligence, no. 5, 1987, pp. 608–620. [16] C. Zhao, D. Zhao and Y. Chen, Simplified Gaussian and mean curvatures to range image segmentation, Proceedings of the 13th International Conference on Pattern Recognition (ICPR), Vienna, Austria, vol.2, August 1996, pp. 427 – 431. [17] N. Amenta, M. Bern and M. Kamvysselis,. A new voronoi based surface reconstruction algorithm, Proceedings of the 25th annual conference on Computer graphics and interactive techniques (SIGGRAPH 98) 1998, pp. 415–421. [18] A. D. Sappa and M. A. Garcia, Coarse-to-fine approximation of range images with bounded error adaptive triangular meshes, Journal of Electronic Imaging, vol.16, no. 2, 023010, April 2007. [19] A. Agathos et al, 3D Mesh segmentation methodologies for CAD applications, Computer Aided Design & Applications, vol. 4, no. 6, 2007, pp. 827–841. [20] C. Kim, C.T. Haas, K.A. Liapi and C.H. Caldas, Human-assisted obstacle avoidance system using 3D workspace modeling for construction equipment operation, Journal of Computing in Civil Engineering, vol. 20, no.3, 2006, pp.177–186. [21] J. McLaughlin, S.V. Sreenivasan, C. Haas and K. Liapi, Rapid human- assisted creation of bounding models for obstacle avoidance in construction, Computer-Aided Civil and Infrastructure Engineering, vol. 19, no. 1, 2004, pp. 3–15. [22] H. Son, C. Kim, K. Choi, Rapid 3D object detection and modeling using range data from 3D range imaging camera for heavy equipment operation, Automation in Construction, vol. 19, no. 7, November 2010, pp 898-906. [23] T. Lambert, Incremental Algorithm, http://www.cse.unsw.edu.au/
~lambert/java/3d/incremental.html
[24] Primesense reference design, http://www.primesense.com/ [25] OpenKinect, http://openkinect.org/wiki/Main_Page [26] A. Rosenberg, J. Hirschberg, V-Measure: A conditional entropy-based external cluster evaluation measure, Proceedings of EMNLP- CoNLL'2007, pp. 410–420. [27] R. B. Rusu, S. Cousins, 3D is here: Point Cloud Library (PCL), Proceedings of IEEE International Conference on Robotics and Automation (ICRA), Shanghai, China, 2011.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
132
1
Data registration module - a component of semantic simulation engine
Janusz Bedkowski ∗ †
∗ Institute of Automation and Robotics, Warsaw University of Technology, Poland
† Industrial Research Institute for Automation and Measurements, Warsaw, Poland
Abstract — In this paper the data registration module being a component of semantic simulation engine is shown. An improved implementation of ICP (Iterative Closest Point) algorithm based on GPGPU (General-purpose computing on graphics processing units) is proposed. The main achievement is on-line aliment of two data sets composed of up to 262144 3D points, therefore it can be used during robot motion. The algorithm uses GPGPU NVIDIA CUDA for NNS (Nearest Neighborhood Search) computation and to improve the performance all ICP steps are implemented also on GPU. Experiments performed in INDOOR and OUDROOR environments show benefits of parallel computation applied for on-line 3D map building. Empirical validation using new generation CUDA architecture, named Fermi is shown.
Index Terms— Data registration, parallel computing, iterative closest point
I. INTRODUCTION
Semantic simulation engine [12] is a project that main idea is to improve State of The Art in the area of semantic robotics [5] [16] with the focus on practical applications such as robot supervision and control, semantic map building, robot reasoning and robot operator training using augmented reality techniques [14] [11]. Data registration using Iterative Closest Point algorithm is a very important component of Seman- tic Simulation Engine. Semantic simulation engine combines semantic map with rigid body simulation to perform super- vision of its entities such as robots moving in INDOOR or OUTDOOR environments composed by floor, ceiling, walls, door, tree, etc. Semantic simulation engine is composed of data registration modules, semantic entities identification (data segmentation) modules and semantic simulation module. It provides tools to implement mobile robot simulation based on real data delivered by robot and processed on-line using parallel computation. Semantic entities identification modules can classify several objects in INDOOR and OUTDOOR environments. Data can be delivered by robot observation based on modern sensors such as laser measurement system 3D and RGB-D cameras. Real objects are associated with virtual entities of simulated environment. The concept of semantic simulation is a new idea, and its strength lies on the semantic map integration with mobile robot simulator. Alignment and merging of two 3D scans, which are obtained from different sensor coordinates, with respect to a reference coordinate system is called 3D registration [10] [6] [15]. Park [18] proposed a real-time approach for 3D registration using GPU, where the registration technique is based on the Iterative Projection Point (IPP) algorithm. IPP technique is a combination of point-to-plane and point-to-projection
registration schemes [19]. Processing time for this approach is about 60ms for aligning 2 3D data sets of 76800 points during 30 iterations of the IPP algorithm. Fast searching algorithms such as the k-d tree algorithm are usually used to improve the performance of the closest point search [17] [23]. GPU accelerated nearest neighbor search for 3D registration is proposed in work of Qiu [22], where the advantage of Arya’s priority search algorithm described in [4] to fit NNS in the SIMD (Single Instruction Multiple Data) model was used for GPU acceleration purpose. Purcell suggested that k-d tree and priority queue methods are efficient but difficult to be implemented on GPU [21]. Garcia proves, that a brute force NNS approach using NVidia Compute Unified Device Architecture (CUDA) is 400 times faster over the CPU k-
d tree implementation [8]. GPU-based NNS with advanced
search structures is also used in the context of ray tracing [7], where NNS procedure builds trees with a different manner
from a triangle soup, and takes these triangles as the objects of interest. To convert k-d tree into serialized flat array that can be easily loaded into CUDA device, left-balanced k-d tree was proposed [13] [22]. Another technique for 3D registration using Fast Point Feature Histograms (FPFH) is shown in the work of Rusu [24]. Rusu also proposed a way of characterizing the local geometry of 3D points, using persistent feature histograms, where the relationships between the neighbors of a point are analyzed and the resulted values are stored in a 16-bin histogram [25]. The histograms are pose and point cloud density invariant and cope well with noisy datasets. An alternative concept to ICP algorithm which relies on instantaneous kinematics and on the geometry of the squared distance function of a surface is shown in the work of Pottmann [20]. The proposed algorithm exhibits faster convergence than ICP, which is supported both by results of
a local convergence analysis and by experiments. The ICP
algorithm is used in SLAM 6D (Simultaneous Localization and Mapping), where 6 DOF (Degree of freedom) of robot position is computed based on aliment of 3D clouds of points and loop-closing technique [28]. The paper shows an improved implementation and empirical validation of GPGPU ICP algorithm. This algorithm offers on- line computation. The main difference compared to State of The Art approaches is NNS procedure, where 3D space was
divided into regular grid of buckets, therefore there is no need
to build complex data structures such as k-d tree, and the time
of ICP is decreased. The proposed solution is efficient since it performs nearest neighbor search using a bucket data structure
(sorted using CUDA primitive) and computes the correlation
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
133
ONEER 3AT equipped with 3D laser measurement system CK LMS 200).
arallel CUDA all-prefix-sum instruction.
s organized as follows: section II gives an
rning harware used in experiments, section III DIA GPGPU processor characteristic features. thm is presented in section IV where all details allel computing are given. The results of data INDOOR and OUTDOOR environments are
on V.
II. ROBOT
ments commercially available robot PIONEER (figure 1). Robot is equipped with 3D laser ystem 3DLSN based on rotated SICK LMS 200. ystem delivers 3D cloud of points in stop-scan scan contains 361 (horizontal) x 498 (vertical)
IV.
3D DATA REGISTRATION
Aligning two-view range images with respect to the ref- erence coordinate system is performed by the ICP (Iterative Closest Points) algorithm. Range images are defined as a model set M and data set D, N m and N d denotes the number of the elements in the respective set. The alignment of these two data sets is solved by minimization with respect to R,t of the following cost function:
E (R, t) =
N m N d
i=1 j=1
w ij m i − (Rd j + t) 2
(1)
w ij is assigned 1 if the i-th point of M correspond to the j-th point in D. Otherwise w ij =0. R is a rotation matrix, t is a translation matrix. m i and d i corresponds to the i-th point from model set M and D respectively. The ICP algorithm using CUDA parallel programming is given:
Algorithm 1 ICP - parallel computing approach
allocate the memory copy data from the host(M host , D host ) to the
device(M device , D device )
for iter = 0 to max iterations do
select closest points between M device and D device calculate (R, t) that minimizes equation 1
transform D device by (R, t) and put the results into
D deviceRt copy D deviceRt to D device
end for
copy D device to D host
free memory
III.
GPU ARCHITECTURE
|
GPUs are programmable multi core chips built |
A. Memory allocation on device and copy from host |
|
|
y |
of processors working in parallel. The GPU |
Memory allocation on device and copy from host is a main bottleneck of proposed implementation. Data transfer between host and the device is limited by memory bandwidth and should be minimized. Proposed ICP algorithm is implemented in GPU (all ICP steps are performed using GPU), therefore the need of data transfer is limited to copying cloud of points to/from GPU. Figure 2 and algorithm 2 shows main data used in presented approach. The table of found buckets, table of sorted buckets, table of sorted points consist of 512 × 512 integer elements, table of amount of points in bucket and table of bucket indexes consist of 256 × 256 × 256 integer elements. M (reference data) and D (data to be align) data sets are stored in six 512×512 tables consisting float values stored in one dimensional array. For sorting the table of buckets routine described in section IV-B.2 and used in algorithm 2 the CUDA Radix Sort class available in [1] briefly described in [26] and [27] is used. The method initialize() called by the constructor of Radix Sort Class allocates temporary storage for the sort and the prefix sum that it uses. Temporary storage is (2*NUMBER OF POINTS + 3*8*M/CTA SIZE) unsigned ints with a default CTA SIZE of 256 threads and |
|
f an array of streaming multiprocessors (SM), |
||
|
them can launch up to 1024 co-resident con- . Currently available graphics units are in the SM up to 30 SMs for the high end products. |
||
|
M |
contains 8 scalar processors (SP) each with |
|
|
gisters. The total of 64KB of register space each SM. Each SM is also equipped with a memory that is characterized by low access |
||
|
h |
bandwidth. It is important to realize that all |
|
|
ment (creation, scheduling, synchronization) is |
||
|
ardware and the overhead is extremely low. SM scheme (Single Instruction, Multiple Thread), are executed in groups of 32 called warps. mming model defines the host and the device. CPU sequential procedures while the device |
||
|
el |
GPU programs - kernels. A kernel works |
|
|
heme (Single Program, Multiple Data). CUDA tage of using massively parallel computation plications. Detailed GPU architecture can be riginal documentation [3]. Useful additional ssues are published in best practices guide [2]. |
||
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
134
3
Fig. 2.
Initial steps: selection of closest points procedure.
NUMBER OF P OINT S = 512 × 512. Amount of data is
large, therefore the procedure of memory allocation is done initially.
B. Selection of closest points
The distance between two points in Euclidean distance
metric for point p 1 = {x 1 , y 1 , z 1 } and
p 2 = {x 2 , y 2 , z 2 } is:
1
distance(p 1 , p 2 ) = (x 1 − x 2 ) 2 + (y 1 − y 2 ) 2 + (z 1 − z 2 ) 2
(2)
To find pairs of closest points between model set M and data set D, the decomposition of XYZ space, where x,y,z ∈ < −1, 1 >, into 2 8 x2 8 x2 8 buckets is proposed. The idea of the decomposition will be discussed for 2 2 x2 2 x2 2 case. Figure 3 shows decision tree that decomposes XYZ space into 64 buckets.Each node of the decision tree includes boundary decision, therefore points are categorized into left or right branch. Nodes that do not have branches assign buckets. Each bucket has unique index and is related to cubic subspace with length,width,height = 2/2 2 ,2/2 2 ,2/2 2 . Each bucket that does not belong to border has 26 neighbors. The 27 neighboring cubic subspaces are shown on figure 4 where also the way of indexing is given. Figure 5 demonstrates the idea of nearest neighbor (NN) search technique on 2 dimensional example. Assuming that we are looking for nearest neighbor that satisfies condition R < 2/2 8 and circle R=2/2 8 ⊂ bucket 3R NN will be found in the same bucket or in neighboring bucket (in this example NN of point d is m5). Algorithm 2 describes the procedure of selection of closest points. For better explanation figure 2 shows initial steps of this algorithm where set M of 10 points from figure 5 is used for NN search. The details of the algorithm will be discussed in next subsections. 1) Find bucket: Figure 3 shows tree structure used for indexing of 2 2 x2 2 x2 2 buckets. The concept of finding bucket index for point m xyz is shown on the scheme 6, where x corresponds to border for current level in the tree and 0, 1,
2,
correspond to actual bucket index during its
computation. The bucket indexing procedure is executed in parallel, where each CUDA kernel computes bucket index for different point xyz . 2) Sort buckets: Radix sort class [1] is used to sort unsigned integer key-value pairs. Keys correspond to the elements of table of buckets and value corresponds to elements from table
2
3,
14,
Fig. 3. Tree structure used for XYZ space decomposition into 64 buckets. From root to the leaf/bucket chosen left or right brunch depends on the current separation line.
Fig. 4. Cubic subspaces - neighboring buckets, the way of indexing is explained in section IV-B.1.
Fig. 5.
2 dimensional example of NN search in neighboring buckets.
Algorithm 2 Selection of closest points
for all points m xyz in parallel do find bucket m update table of found buckets end for in parallel sort table of found buckets {radix sort} in parallel count points in each bucket for all points d xyz in parallel do find bucket d for all neighbors of bucket d do find NN for d xyz {nearest neighbor is one from m xyz } end for end for
Fig. 6.
The scheme of bucket indexing procedure.
5 th
European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
135
4
of points. Procedure outputs sorted table of buckets. Figure 2
shows an example of sorting result. Radix sort is a well known sorting algorithm, very efficient on sequential machines for sorting small keys. It assumes that the keys are d-digit numbers and sorts on one digit of the keys at a time, starting from least and finishing on most significant. The complexity of sorting
n keys will be O(n). The details of GPU based radix sort
implementation can be found in [26] [27]. It is important to emphasized that the implementation of GPU based radix sort
is robust, therefore it can be used for on-line computation.
3) Count points in bucket and find index of bucket: In the procedure of counting points that belong to the same bucket the counting is based on table of sorted buckets (see figure 2). It is important to notice, that also the index of the found bucket is computed. This index, along with the information concerning an amount of points in the bucket, will be used for searching nearest neighbor in algorithm 2.
C. Calculation of (R,t)
Calculation of the rotation and translation (R,t) is performed
using reduced equation
1:
where
E (R, t) ∝
1
N
N
i=1
m i − (Rd i + t) 2
N =
N m N d
w ij
i=1 j=1
(3)
(4)
Rotation R is decoupled from computation of translation t using the centroids c m and c d of points:
c m =
c d =
1
N
1
N
N
i=1
N
i=1
m i
d
i
and modified data sets:
M = {m i = m i − c m } 1,
D = {d i = d i − c d } 1,
,N
,N
(5)
(6)
(7)
(8)
After applying equations 5, 6, 7 and 8 to the mean square error function E(R, t), the equation 3 takes the following form:
E (R, t) ∝
1
N
Assuming that:
N
i=1
m i − Rd i − (t − c m + Rc d ) 2 (9)
t − c m + Rc d = t
(10)
equation
9 takes following form:
E (R, t) ∝
1 N
N i=1 m i − Rd i 2 −
N t ·
2
N
i=1 (m i − Rd i ) +
1
N
N
i=1
t
2
(11)
|
To minimize |
11 the algorithm has to minimize the following |
|
term: |
|
N
i=1
with the assumption:
|
m i − Rd i 2 |
(12) |
|
t = 0 |
(13) |
The optimal rotation is calculated by R= VU T , where matrices V and U are derived from the singular value decomposition of a correlation matrix C = USV T given by:
where:
C =
N
i=1
m i T d i =
c
xx
c yx
zx
c
c
xy
c yy
zy
c
c
c
c
xz
yz
zz
(14)
N
N
N
i=1
c xx =
(15)
Correlation matrix elements are computed using optimized parallel reduction described in section IV-C.1. The optimal translation t is derived from equation 13 and 10, therefore
t = c m − Rc d
1) Optimized parallel reduction: Parallel computation of the correlation matrix (equation 14) is performed using prefix sum [9] available in CUDA. The all-prefix-sums operations
take a binary associate operator ⊕ with identity I, and an array of n elements
(17)
In the result following array is returned:
(18)
All-prefix-sums operations on array of data is commonly known as scan. The parallel implementation uses multiple thread blocks for processing an array of 512 × 512 data points stored in one dimensional array. The strategy is to keep all multiprocessors on the GPU busy to increase the performance. Each thread block is responsible for reducing a portion of the array. To avoid the problem of global synchronization, the computation is decomposed into multi kernel invocations. Optimized kernel available in CUDA is used in parallel computation.
(16)
m ix d ix , c xy =
i=1
m ix d iy ,
i=1
,
c zz =
m iz d iz
[a 0 , a 1 ,
, a n−1 ]
[I, a 0 , (a 0 ⊕ a 1 ) ,
, (a 0 ⊕ a 1 ⊕
⊕ a n−2 )]
V. E XPERIMENTS
Experiments where performed in INDOOR and OUTDOOR environments shown on figure 7, where robot was acquiring observations in stop-scan fashion in one meter step. The goal was to align iteratively all scans, therefore the odometry error was deceased. INDOOR data set is composed of 142
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
136
Robot paths during INDOOR and OUTDOOR 3D data acquisition.
Comparison between the robot trajectory based on combined with gyroscope correction system and GPGPU based ICP with 10
terations - INDOOR. Robot initial and final positions are shown as
7.
f 361 × 498 3D data points. OUTDOOR data set is
ed of 61 3D scans. The computation is performed using
A GF 580 GTX GPGPU. The comparison between the
ajectory based on combined odometry with gyroscope
|
on |
system and GPGPU based ICP with 10 and 100 |
|
s |
is shown on figure 8 for INDOOR and 9 for |
OOR. The error E(R,t) (see eq. 1 ) for each robot
in function of ICP algorithm iterations is shown on 10. The error E(R,t) obviously is decreasing mono-
y during performed ICP iterations. Time of ICP for
bot position in function of ICP iterations is shown
|
e |
11. The worst, best and average ICP performance |
|
d |
during ICP alignments are shown on figure 12. It |
Fig. 9. Comparison between the robot trajectory based on co odometry with gyroscope correction system and GPGPU based ICP and 100 iterations - OUTDOOR. Robot initial and final positions are as in figure 7.
Error E(R,t) − INDOOR
(a) INDOOR
Error E(R,t) − OUTDOOR
150
70
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
137
6
ICP time[ms]
ICP time[ms]
ICP measured performance − INDOOR
(a) INDOOR
100
ICP measured performance − OUTDOOR
robot position
80
0
number of ICP iterations
multiple 3d data sets. Image and Vision Computing, 21(1):637–650,
July 2003. [11] A. Maslowski J. Bedkowski. Methodology of control and supervision of web connected mobile robots with cuda technology application. Journal of Automation, Mobile Robotics and Intelligent Systems, 5(2):3–11,
2011.
[12] A. Maslowski J. Bedkowski. Semantic simulation engine for mobile robotic applications. Pomiary, Automatyka, Robotyka 2/2011, Automa- tion 2011 6-8 April, Warsaw, pages 333–343, 2011. [13] Henrik Wann Jensen. Realistic image synthesis using photon mapping. A. K. Peters, Ltd., Natick, MA, USA, 2001. [14] G. Kowalski, J. Bedkowski, P. Kowalski, and A. Maslowski. Computer training with ground teleoperated robots for de-mining. Using robots in hazardous environments, Landmine detection, de-mining and other applications, pages 397–418, 2011. [15] Martin Magnusson and Tom Duckett. A comparison of 3d registration algorithms for autonomous underground mining vehicles. In In Proc. ECMR, pages 86–91, 2005. [16] Andreas Nuchter¨ and Joachim Hertzberg. Towards semantic maps for mobile robots. Robot. Auton. Syst., 56(11):915–926, 2008. [17] Andreas Nuchter, Kai Lingemann, and Joachim Hertzberg. Cached k-d tree search for icp algorithms. In Proceedings of the Sixth International Conference on 3-D Digital Imaging and Modeling, pages 419–426, Washington, DC, USA, 2007. IEEE Computer Society. [18] Soon-Yong Park, Sung-In Choi, Jun Kim, and Jeong Chae. Real-time 3d registration using gpu. Machine Vision and Applications, pages 1–14, 2010. 10.1007/s00138-010-0282-z. [19] Soon-Yong Park and Murali Subbarao. An accurate and fast point- to-plane registration technique. Pattern Recogn. Lett., 24:2967–2976,
December 2003.
Helmut Pottmann, Stefan Leopoldseder, and Michael Hofer. Registration
without icp. Computer Vision and Image Understanding, 95:54–71,
2002.
[21] Timothy J. Purcell, Craig Donner, Mike Cammarano, Henrik Wann Jensen, and Pat Hanrahan. Photon mapping on programmable graphics hardware. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, pages 41–50. Eurographics Associ- ation, 2003. [22] Deyuan Qiu, Stefan May, and Andreas Nuchter. Gpu-accelerated nearest neighbor search for 3d registration. In Proceedings of the 7th International Conference on Computer Vision Systems: Computer Vision Systems, ICVS ’09, pages 194–203, Berlin, Heidelberg, 2009. Springer-
100 [20]
(b) OUTDOOR
|
Fig. 11. |
Time of ICP for each robot position in function of ICP algorithm |
|
iterations. |
|
(a) INDOOR - 142 scans
(b) OUTDOOR - 61 scans
Fig. 12.
alignments.
The worst, best and average ICP performance measured during ICP
[5] Minoru Asada and Yoshiaki Shirai. Building a world model for a mobile robot using dynamic semantic constraints. In Proc. 11 th International Joint Conference on Artificial Intelligence, pages 1629–1634, 1989. [6] Andrew W. Fitzgibbon. Robust registration of 2d and 3d point sets. In In British Machine Vision Conference, pages 411–420, 2001. [7] Tim Foley and Jeremy Sugerman. Kd-tree acceleration struc- tures for a gpu raytracer. In Proceedings of the ACM SIG- GRAPH/EUROGRAPHICS conference on Graphics hardware, HWWS ’05, pages 15–22, New York, NY, USA, 2005. ACM. [8] Vincent Garcia, Eric Debreuve, and Michel Barlaud. Fast k nearest neighbor search using gpu. In 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pages 1–6,
Verlag.
[23] Szymon Rusinkiewicz and Marc Levoy. Efficient variants of the ICP
algorithm. In Third International Conference on 3D Digital Imaging
and Modeling (3DIM), June 2001.
[24] Radu Bogdan Rusu, Nico Blodow, and Michael Beetz. Fast point feature
histograms (fpfh) for 3d registration. In Proceedings of the 2009 IEEE
international conference on Robotics and Automation, ICRA’09, pages
1848–1853, Piscataway, NJ, USA, 2009. IEEE Press.
[25] Radu Bogdan Rusu, Zoltan Csaba Marton, Nico Blodow, and Michael Beetz. Persistent Point Feature Histograms for 3D Point Clouds. In Proceedings of the 10th International Conference on Intelligent Autonomous Systems (IAS-10), Baden-Baden, Germany, 2008. [26] Nadathur Satish, Mark Harris, and Michael Garland. Designing efficient sorting algorithms for manycore gpus. NVIDIA Technical Report NVR- 2008-001, NVIDIA Corporation, September 2008. [27] Nadathur Satish, Mark Harris, and Michael Garland. Designing efficient sorting algorithms for manycore gpus. In Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing, pages 1– 10, Washington, DC, USA, 2009. IEEE Computer Society. [28] Jochen Sprickerhof, Andreas Nuchter,¨ Kai Lingemann, and Joachim Hertzberg. An explicit loop closing technique for 6d slam. In 4th European Conference on Mobile Robots ECMR09, September 23-25, 2009, Mlini/Dubrovnik, Croatia, pages 229–234.
2008.
[9] Mark Harris, Shubhabrata Sengupta, and John D. Owens. GPU Gems 3, Parallel Prefix Sum (Scan) with CUDA, chapter 39, pages 851–876. Addison-Wesley, 2007. [10] Daniel Huber and Martial Hebert. Fully automatic registration of
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
138
FaMSA: Fast Multi-Scan Alignment with Partially Known Correspondences
Ernesto H. Teniente and Juan Andrade-Cetto
Institut de Robotica`
i Informatica`
Industrial, CSIC-UPC, Barcelona, Spain
Abstract — This paper presents FaMSA, an efficient method to boost 3D scan registration from partially known correspondence sets. This situation is typical at loop closure in large laser-based mapping sessions. In such cases, scan registration for consecut ive point clouds has already been made during open loop traverse, and the point match history can be used to speed up the computation of new scan matches. FaMSA allows to quickly match a new scan with multiple consecutive scans at a time, with the consequent benefits in computational speed. Registration error is shown to be comparable to that of independent scan alignment. Results are shown for dense 3D outdoor scan matching.
Index terms – ICP, 3D scan registration, scan alignment.
I. I NTRODUCTION
The Iterative Closest Point (ICP) algorithm is the de-facto standard for range registration in 3D mapping. It is used to
compute the relative displacement between two robot poses by pairwise registration of the point clouds sensed from them. In a typical mapping session, consecutive pairwise registr ation
is performed during open loop traverse, and accumulates dri ft
error. This error is corrected by closing loops, i.e., matching point clouds with large temporal deviation (see Fig. 1). Most SLAM algorithms keep probabilistic estimates of the
robot location that can be used to determine whether or not
a loop closure test is advisable. For instance, by consideri ng
not only pose uncertainty but information content as well [12]. But, once a loop closure test is deemed necessary, an algorit hm that can compute it expeditiously is needed. Typically loop closure tests are checked not only from the current cloud to a query cloud in the past, but instead, to a consecutive set of query clouds in the past, which in turn have already been registered among them. Using this knowledge, we can expedite multiple registrations at a time. In this paper we propose FaMSA, a technique for fast multi-scan point cloud alignment at loop closure that takes advantage of the assert ed point correspondences during sequential scan matching. The paper is organized as follows. A description of related work is given in Section II. Section III details some implemen- tation details of our ICP algorithms; and Section IV elaborates on the particularities of the method. Experiments that vali date the viability of the method are given in Section VI, and Section VII contains some concluding remarks.
II. R ELATED WORK
The most popular scan matching methods are based on the Iterative Closest Point algorithm [5]. The objective of thi s
Fig. 1. Dense point cloud registration during loop closure at the Barcelona Robot Lab.
algorithm is to compute the relative motion between two data sets partially overlapped by minimizing the mean squared er ror of the distance between correspondences in the two sets.
In the original algorithm, a point-to-point metric is used
to measure the distance between correspondences in the set. Point-to-plane metrics are also common practice [8], which make the method less susceptible to local minima. Further- more, point-to-projection metrics are also possible [7], by matching points to ray indexes directly, inverting the ray casting process. A thorough account of these metrics and their properties is given in [21]. More recently, an error metric that weights unevenly rotation and translation was proposed for 2D [15], [16], and later extended to 3D [6]. The method uses point-to-projection minimization using triangles as the projection surface, and performs nearest neighbor search i n the new metric space. FaMSA uses this metric for optimization.
ICP’s computational bottleneck is in correspondence search. Most strategies to accelerate this search rely on some prior ordering of points within each point cloud, and use tree- based search methods such as the Approximate Nearest Neigh- bor [19], [3] that uses balanced kd-trees; kd-trees with caching mechanisms [24]; or parallelized kd-trees [18].
A method for fast NN search that competes with kd trees
for execution speed is based on the spherical triangle con- straint [10]. Like in [24], point caching is maintained from one iteration to the next, and ordering and triangle constraints are used to quickly identify correspondences. Aside from tr ee structures, other space partitioning mechanisms that allow for fast NN search include double z-buffering [4] and grid decomposition [26].
Point sampling is also a common strategy used to accel-
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
139
erate the matching process. Sampling however, only reduces
asymptotic computational complexity by a constant factor. It
is common practice to use hierarchical coarse-to-fine sampl ing
methods to avoid missing fine resolution correspondences [25], [27]; and sampling can be either uniform [25], [7], ran- dom [14], or with ad-hoc reduction heuristics related to the sensing mechanism [20]. Outlier removal is also a major concern on most modern ICP implementations. Point rejection can be based on statistical point distributions [27], [14], [20], using fixed or dynamic distance thresholds [22], or using topological heuristics [25], [27], [22]. The idea of multi-scan alignment has been addressed as
a bundle adjustment problem for 2D range scans [13] using force field simulation. The work that most relates to ours is the latent map [11], a multi-scan matching technique for 2D range matching.
III. R ANGE IMAGE REGISTRATION
A. Notation
The objective of the classic ICP algorithm is to compute the relative rotation and translation ( R, t ) between two partially overlapped point clouds P and Q , iteratively minimizing the mean square error over point matches. For a given set of point match indexes Y , ICP’s cost function is
arg min
R,t
( i,j ) ∈ Y
( p i − Rq j − t ) 2 .
(1)
This minimization is solved iteratively, revising at each iteration the list of point matches, using for instance, NN search.
B. Implementation details and computational complexity
Correspondence search is the most expensive step in the ICP algorithm. Finding the NN to a given query point relies on the ability to discard large portions of the data with simple tests. Brute force correspondence search would take O ( n) , with n the size of the point cloud. The preferred data structures used to solve the NN problem in low multidimensional spaces are kd-trees [9] with O ( n log n) construction complexity and O (log n) search complexity. Box structures on the other hand take polynomial time to build [2] and constant time to search. Box structures are possible in ICP only when the initial and final poses do not change significantly so that NNs remain in the originally computed box. We implement Acka’s box search structure with some modifications. The box structure in [2] assigns to empty boxes the index of the last occupied box. We instead leave empty boxes out of the search. This serves effectively as a fixed distance filter with significant savings in computational load. Our method is faster than using the optimized Approximate Nearest Neigborh (ANN) library [3] with fixed radius search, as shown in the experiments section. The original ICP algorithm of Besl and McKey [5] assumes that for each point in the reference set there must be a cor- respondence in the query set. In most applications this is not
FA MSA( P, P ′ , Q, Y, R, t, R 0 , t 0 )
I NPUTS :
P, P ′ : Two consecutive query point clouds.
Q :
Y :
R, t :
R 0 , t 0 : Initial displacement between P and Q .
Current point cloud.
Correspondences between P and P ′ . Relative displacement between P and P ′ .
O UTPUTS :
|
R |
P , t P : Relative displacement between P and Q . |
|
|
R |
P ′ , t P ′ :Relative |
displacement between P ′ and Q . |
|
1: |
R P , t P ← R 0 , t 0 |
|
2: |
R P ′ , t P ′ ← ( R 0 , t 0 ) ⊕ ( R, t ) |
3: while not convergence do
|
4: |
Z ← NNS EARCH ( P, Q, R P , t P ) |
|
5: |
Z ′ ← L INK ( Z, Y ) |
|
6: |
R P , t P ← ICPU PDATE ( P, Q, R P , t P , Z ) |
|
7: |
R P ′ , t P ′ ← ICPU PDATE ( P ′ , Q, R P ′ , t P ′ , Z ′ ) |
|
8: |
convergence ← ( ǫ < T ) and ( ǫ ′ < T ) |
9: end while
Algorithm 1: FaMSA: Fast multi-scan alignment with partial known correspondences
the case and adequate similarity tests must be implemented. Using point distance as the only criteria for point similari ty usually leads to wrong data association and local minima.
We use, as in [22], oriented normal similarity constraints, together with statistical constraints [14], i.e, points at distances larger than a multiple of their standard deviation are rejected. These filtering strategies are time consuming, and should be used with discretion, since they require sorting and binary search. Correspondence uniqueness is also enforced and its implementation needs appropriate bookkeeping of matches at each iteration. Several metrics exist to find the closest point during cor- respondence search [23], [21]. We adopt in this work the metric proposed in [6], but use point-to-point matching ins tead
a point-to-triangle matching, and avoid the computational burden of computing the corresponding triangle mesh. The metric is an approximated distance that penalizes rotations with a user defined weight L ,
d ( p i , q j ) = p i − Rq j − t 2 − q j × ( p i − Rq j − t ) 2
q j 2 + L 2
.
(2)
and a point norm q = x 2 + y 2 + z 2 + L 2 θ 2 . The metric d substitutes the Euclidean distance in Eq. 1, and as L → ∞ , this measure tends to the Euclidean distance.
IV. FAST MULTI SCAN ALIGNMENT WITH PARTIALLY
KNOWN CORRESPONDENCES
Given that correspondence search is the most expensive part
of any ICP implementation, we propose FaMSA to boost mul-
tiple scan alignment using previously known correspondences.
That is, given two previously aligned point clouds P and P ′ , the relative transformation between the two R , t , and a list Y
of correspondences, we want to find the registration between
the current point cloud Q and the two query scans P and P ′ .
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
140
FA MSA2( P, P ′ , Q, Y, R, t, R 0 , t 0 )
I NPUTS :
P, P ′ : Two consecutive query point clouds.
Q :
Y :
R, t :
R 0 , t 0 : Initial displacement between P and Q .
Current point cloud.
Correspondences between P and P ′ . Relative displacement between P and P ′ .
O UTPUTS :
|
R |
P , t P : Relative displacement between P and Q . |
|
|
R |
P ′ , t P ′ :Relative |
displacement between P ′ and Q . |
1:
R P , t P ← R 0 , t 0
2: while not convergence do
|
3: |
Z ← NNS EARCH ( P, Q, R P , t P ) |
|
4: |
R P , t P ← ICPU PDATE ( P, Q, R P , t P , Z ) |
|
5: |
convergence ← ( ǫ < T ) |
6: end while
7:
R P ′ , t P ′ ← ( R P , t P ) ⊕ ( R, t )
8: while not convergence do
|
9: |
Z ′ ← L INK ( Z, Y ) |
|
10: |
R P ′ , t P ′ ← ICPU PDATE ( P ′ , Q ′ , R P ′ , t P ′ , Z ′ ) |
11: end while
Algorithm 2: FaMSA2: Very fast multi-scan alignment with partial known correspondences
The method proceeds as follows. Standard correspondence search is implemented between clouds P and Q , and for each match between points p i and q i , a link to P ′ is read from Y , and consequently the distance from q j to p k ′ is immediately established, avoiding the computation of similarity search and filters. Aside from the previous alignment of P and P ′ , the method needs, as any other iterative ICP algorithm, an initi al estimation of the relative displacement between the query cloud Q and P . Algorithm 1 shows the approach.
In the algorithm, Z and Z ′ indicate the correspondence sets between P and Q ; and P ′ and Q , respectively. Appropriate index bookkeeping links to the other in constant time. The threshold T is used to indicate the maximum error allowed for the registration of both point clouds. The method also limit s the search to a maximum number of iterations, typically set to 100.
The method is suboptimal in the sense that no new matches are sought for between point clouds P ′ and Q . For sufficiently close reference clouds P and P ′ it does not impose a limitation on the quality of the final correspondence.
In the same way that FaMSA takes advantage of the point correspondences between P and P ′ to boost the computation of the relative displacement between P ′ and Q , one can also defer the estimation of the pose between P ′ and Q until all iterations for P have finished and use the result as a starting point for the second optimization. This method is shown in Algorithm 2.
Extensive experimentation shows that only one iteration of ICP update suffices to revise the pose of P ′ with respect to Q , once the relative transformation between P and Q has been optimized. We call this method FaMSA2.
Fig. 2. Our mobile robotic platform.
V. E XPERIMENT SETUP
Our experimental data was acquired in the Barcelona Robot Lab, located at the Campus Nord of the Universitat Polit ecnica` de Catalunya. The point clouds were captured using a Pioneer 3AT mobile robot and a custom built 3D laser with a Hokuyo UTM-30LX scanner mounted in a slip-ring. Each scan has 194,580 points with resolution of 0.5 deg azimuth and 0.25 deg elevation. Figure 2 shows the coordinate frames of all of our robot sensors. For the work reported here, only 39 scans from this dataset were used. Figure 7(a) shows a partial view of the mapped environment. The entire dataset is available in [1]. Each scan was uniformed sampled for faster convergence using voxel space discretization with a voxel size of 0.35 meters. During sampling, we also computed surface normals and enforced a minimum voxel occupancy restriction of 4 points. Random sampling with set sizes of 20 points was used for those boxes exceeding such number of points. Normal orientations are computed after random sampling. This has shown to produced better orientation estimates, especiall y around corners, when compared to other strategies such as k-NNs with density filtering. ICP is executed in open loop for 39 consecutive scans, storing all relative pose displacements as well as the corre- spondence indexes. Then, a number of possible loop closure locations were selected manually. FaMSA was executed on these loop closure candidates. The specific parameters of the ICP implementation include: maximum angle between nor- mals of 35 deg; upper and lower bounds of sigma rejection at 0.25σ and 5σ , respectively; and maximum number of iterations at 100. For the execution times reported, experiments were run in MATLAB using mex files of C++ routines in an Intel Core 2
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
141
(a) Dense point cloud registration. Color indicates height.
(b) Robot trajectory. In green the initial pose, in red the fin al pose.
Fig. 3. A path with 39 poses around the FIB plaza of the Barcelo na Robot Lab.
Cloud pairs
(a) Time required to match Q and P ′ , when the correspondences between P and P ′ are known.
Cloud chains
(b) Time required to match Q with both P and P ′ .
Fig. 4. Algorithm performance.
Quad CPU Q9650 3.0 GHz system, with 4 GB RAM running Ubuntu 10.04 32 bits.
VI. R ESULTS
First, we compare the execution time in seconds for various implementations of multi-scan ICP. To this end, 10 loop closure locations Q are selected in the trajectory, and each is compared against its query clouds P and P ′ . Figure 4(a) shows the time it takes to align the current cloud Q to the second query cloud P ′ given the correspondences between Q and first cloud P are knwon. The methods BNN, ANN-FR and ANN refer to our implementation of voxel NNs; ANN with fixed radius, the size of the voxels; and conventional ANN. FaMSA and FaMSA2 stand for the methods presented in this paper that make use of previous point correspondence indexes to speed up registration. Note that FaMSA2 is the
fastest of the methods, requiring only one iteration in the minimization. Extensive experimentation showed that furt her refinement in the case of FaMSA2 does not significantly improve the registration. Figure 4(b) plots the time it takes to register the current point cloud Q against both query clouds P and P ′ . The plot shows individual registration using BNN and combined registration using the proposed schemes BNN+FaMSA and BNN+FaMSA2. The advantages in computational load of using the proposed mechanism are significative. One might think that using only the correspondences in Y would yield suboptimal estimation. As a matter of fact, when using only this set to compute the relative displacement between P ′ and Q , the number of correspondences effectively halves (see Fig.5), but pose estimation accuracy does not suffer significantly.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
142
Correspondences
Cloud pairs
Fig. 5. Number of correspondences between P ′ and Q running a full BNN compared to using the stored set Y .
Figure 6 plots proportional translation and rotational err ors as compared with full ICP estimation using BNN, and com- puted as follows [17]: using as ground truth the relative pos e
between Q and P ′ as computed with BNN ( R BNN , t BNN ), we measure the relative error of the estimated rotation ( R, t ),
and q are
the normalized quaternions of the corresponding orientati on matrices R BNN and R , respectively. Similarly, the relative error of the estimated translation is computed with E t (%) = t BNN − t P ′ / t P ′ . Translation error turns out to be less than 0. 7% for FaMSA and for all cloud pairs, and less than 0. 2% for FaMSA2. Rotation error is barely noticable for both methods.
as E R (%) = q BNN − q / q , where q BNN
Figure 7 shows a sample of the point cloud match (best viewed in color). In blue, the current pose. In green and red, the query poses. A safe percentage of point cloud overlap in our method is roughly 50%. This is achieved with displace- ments of about 4 meters.
VII. C ONCLUSIONS
This paper presents a novel ICP variation for simultaneous multiple scan registration that benefits from prior known correspondences. Speed up gain is substantial when compared with other methods. The method uses a voxel structure to efficiently search for correspondences to the first cloud in the set, and a metric that unevenly weights rotation and translation. The method was devised to search for loop closures after long sequences in open loop traverse but could be used for other configurations, provided the correspondences on the query set are known.
VIII. A CKNOWLEDGMENTS
This work has been partially supported by the Mexican Council of Science and Technology with a PhD Scholarship to
E.H. Teniente, by the Spanish Ministry of Science and Innova- tion under projects PAU (DPI2008-06022) and MIPRCV Con- solider Ingenio (CSD2007-018), and by the CEEDS (FP7-ICT- 2009-5-95682) and INTELLACT (FP7-ICT2009-6-269959) projects of the EU.
R EFERENCES
[1] Barcelona Robot Lab data set. [online] http://www.iri.upc. edu/research/webprojects/pau/datasets/BRL/php/ dataset.php, 2011. [2] D. Akca and A. Gruen. Fast correspondece search for 3D sur face matching. In Proc. ISPRS Workshop on Laser Scanning , pages 186– 191, Enschede, Sep. 2005. [3] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. An optimal algorithm for approximate nearest neighbor search ing fixed dimensions. J. ACM , 45(6):891–923, Nov. 1998. [4] R. Benjemaa and F. Schmitt. Fast global registration of 3d sampled surfaces using a multi-z-buffer technique. Image Vision Comput., 17:113–123, 1999. [5] P.J. Besl and N.D. McKay. A method for registration of 3D sh apes. IEEE Trans. Pattern Anal. Machine Intell., 14(2):239–256, Feb. 1992. [6] L. Biota, L. Montesano, J. Minguez, and F. Lamiraux. Towar d a
metric-based scan matching algorithm for displacement estimation in 3d workspaces. In Proc. IEEE Int. Conf. Robot. Automat., pages 4330– 4332, Orlando, May 2006. G. Blais and M.D. Levine. Registering multiview range data to create 3D
[7]
computer objects. IEEE Trans. Pattern Anal. Machine Intell., 17(8):820– 824, Aug. 1995. [8] Y. Chen and G. Medioni. Object modeling by registration os multiples ranges images. In Proc. IEEE Int. Conf. Robot. Automat., volume 3, pages 2724–2729, Sacramento, Apr. 1991. [9] J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorith m for finding best matches in logarithmic expected time. ACM T. Math. Software, 3(3):209–226, Sep. 1977. [10] M. Greenspan and G. Godin. A nearest neighbor method for efficient icp. In Proc. 3rd Int. Conf. 3D Digital Imaging Modeling , pages 161– 168, Quebec, May 2001.
[11] Q-X Huang and D. Anguelov. High quality pose estimation b y aligning multiple scans to a latent map. In Proc. IEEE Int. Conf. Robot. Automat., pages 1353–1360, Anchorage, May 2010.
[12] V. Ila, J. M. Porta, and J. Andrade-Cetto. Information-b ased compact Pose SLAM. IEEE Trans. Robot., 26(1):78–93, Feb. 2010. [13] R. Lakaemper, N. Adluru, and L.J. Latecki. Force field based n-scan alignment. In Proc. European Conf. Mobile Robotics, Freiburg, Sep.
2007.
[14] T. Masuda. Registration and integration of multiple ran ge images by matching signed distance fields for object shape modeling. Comput. Vis. Image Und., 87(1):51–65, 2002.
J. Minguez, F. Lamiraux, and L. Montesano. Metric-based scan matching algorithms for mobile robot displacement estimation. In Proc. IEEE Int. Conf. Robot. Automat., pages 3568–3574, Barcelona, Apr. 2005.
[16] J. Minguez, L. Montesano, and F. Lamiraux. Metric-based iterative closest point scan matching for sensor displacement estimatio n. IEEE Trans. Robot., 22(5):1047–1054, Oct. 2006.
[17] F. Moreno-Noguer, V. Lepetit, and P. Fua. EPnP: An accur ate O(n) solution to the PnP problem. Int. J. Comput. Vision , 81(2):155–166,
[15]
2009.
[18] A. Nuchter.¨ Parallelization of scan matching for robotic 3d mapping. In Proc. European Conf. Mobile Robotics, Freiburg, Sep. 2007.
[19] A. Nuchter, K. Lingemann, J. Hertzberg, and H. Surmann. 6D SLAM with approximate data association. In Proc. 12th Int. Conf. Advanced Robotics, pages 242–249, Seattle, Jul. 2005.
[20] A. Nuchter, H. Surmann, K. Lingemann, J. Hertzberg, and S. Thrun. 6D SLAM with an application in autonomous mine mapping. In Proc. IEEE Int. Conf. Robot. Automat., pages 1998–2003, New Orleans, Apr.
2004.
[21] S-Y. Park and M. Subbarao. A fast point-to-tangent plan e technique for multi-view registration. In Proc. 4th Int. Conf. 3D Digital Imaging Modeling , pages 276–283, Banff, Oct. 2003. [22] K. Pulli. Multiview registration for large data sets. I n Proc. 2nd Int. Conf. 3D Digital Imaging Modeling , pages 160–168, Ottawa, Oct. 1999.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
143
R (%)
t (%)
E
E
Cloud pairs
(a) Relative translational error.
Cloud pairs
(b) Relative rotational error.
Fig. 6. Proportional translation and rotation errors for th e registration between Q and P ′ with the proposed methods. BNN is used for ground truth comparison.
(a) P in yellow, P ′ in red, and Q in blue.
(b) Cenital view.
Fig. 7. A loop closure location between clouds 3, 4, and 28 in the BRL dataset (best viewed in color).
[23] S. Rusinkiewicz and M. Levoy. Efficient variants of the I CP algorithm. In Proc. 3rd Int. Conf. 3D Digital Imaging Modeling , pages 145–152, Quebec, May 2001. [24] D.A. Simon, M. Hebert, and T. Kanade. Real-time 3D pose estimation using a high-speed range sensor. In Proc. IEEE Int. Conf. Robot. Automat., volume 3, pages 2235–2241, New Orleans, Apr. 2004. [25] G. Turk and M. Levoy. Zippered polygon meshes from range images. In Computer Graphics. Proc. ACM SIGGRAPH Conf., pages 311–318, Orlando, Jul. 1994. ACM Press. [26] S.M. Yamany, M.N. Ahmed, and A.A. Farag. A new genetic-based technique for matching 3D curves and surfaces. Pattern Recogn.,
32(10):1827–1820.
[27] Z. Zhang. Iterative point matching for registration of f ree-form curves and surfaces. Int. J. Comput. Vision , 13:119–152, 1994.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
144
1
Robust Real-Time Number Sign Detection on a Mobile Outdoor Robot
Sebastian A. Scherer ∗
Daniel Dube ∗
Philippe Komma ∗
Andreas Masselli ∗
Andreas Zell ∗
∗ Department of Computer Science, University of Tuebingen, Tuebingen, Germany
Abstract — We present the evolution of a highly-efficient system for visually detecting number signs in real-time on a mobile robot with limited computational power. The system was designed for use on a robot participating in the 2010 “SICK robot day” robotics challenge, in which the robot needed to autonomously find 10 number signs and drive to all of them in a given order as fast as possible. Our most efficient method has proven to be robust and successful, since our robot won the competition using the final vision systems described in this paper.
Index Terms — number detection, number classification, real- time, robotic vision,
I. INTRODUCTION
Automatic sign detection is an essential task for robots
working in environments built for humans: Signs can serve as landmarks and convey messages for both humans and robots.
|
In |
this work we try to find the best system for detecting a set |
|
of |
number signs used for marking subgoals in the 2010 “SICK |
robot day” robotics challenge. The goal of this challenge was to build a robot that autonomously finds and drives to 10
number signs in a given order within at most 10 minutes. Hence the system had to run on an on-board processor of a mobile robot and had to provide real-time performance.
Even though the system presented here solves a very specific task, there are several domains of research which are related
to the problem at hand: recognition of text on certain objects
such as license plates [5, 11, 9] and containers [12], video
OCR [16], and text detection in natural images [15, 23, 20].
A few examples for the latter field of research are Liang et al.
[14], who give an overview of document analysis from camera- captured images. Yamaguchi et al. in [22] address the problem of recognizing phone numbers in natural images. Both take advantage of the fact that there is more than a single character and that characters usually arranged in lines. In contrast, de Campo et al. [7] focus on recognizing single characters in natural images, which is more similar to our problem at hand. Another related area of research is the detection of signs, e.g. traffic signs. In comparison with video OCR, sign detec- tion is a more challenging problem due to the requirement of coping with highly dynamic settings in real-time. Note, however, that the task of sign detection is facilitated by the fact that almost all traffic signs are either colored in or at least framed in an easy-to-detect signal color to make them easier to spot. This also implies that there is usually a distinct edge around their boundary, which renders their detection less demanding and makes color and shape information popular features for detecting traffic signs [8].
Fig. 1.
the “SICK robot day” robotics challenge 2010.
Our robot driving towards sign number 5 during its winning run at
For the “SICK robot day” competition, it was not totally clear whether there would be an always visible boundary that could be used for sign detection. Hence, we could not apply car number extraction techniques based on color segmentation [13] or rectangular area detection [11]. The contribution of this work is to provide a framework which addresses these problems in real-time. As shown in Sect. V the proposed method allows for robust number de- tection and classification in a highly dynamic environment. The remainder of this paper is organized as follows: In Sect. II we describe the 2010 “SICK robot day” robotics challenge. Sect. III provides details of our experimental setup. In Sect. IV we describe four different number detection and classification approaches we implemented. The results are presented and discussed in Sect. V. Finally, Sect. VI gives conclusions.
II. THE 2010 “SICK ROBOT DAY” CHALLENGE
The 2010 “SICK robot day” challenge was an interna- tional robotic race for universities, colleges and other research groups. Participating teams were required to build and program an autonomous mobile robot that is able to explore and navi- gate an unknown environment, find and recognize landmarks given by number signs, and reach all of them in a given order as fast as possible. The organizers originally announced that the challenge could take place either indoors or outdoors in
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
145
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
146
3
Fig. 3.
Patches used for training the classifier.
in different scales. Window positions at which the classifier predicts a face are memorized to produce the final detection result for the image. One crucial idea in [19] is the introduction of a so called attentional cascade, which rapidly improves the speed of the overall detector. A classifier consists of a cascade of simpler classifiers, or layers, which are applied successively. It starts with the simplest layer, which is designed to be quickly evaluated and highly sensitive, allowing a high false positive rate. This layer can already reject many non-faces, so the remaining layers focus on the image patches accepted by the first layer. The whole classifier only accepts an image patch as a face if it has been predicted positive by all layers. Originally published as a fast and robust face detector, the system can be trained with arbitrary objects. We created a number detector by training the system with patches of number signs that were cropped out of our log files (see also section III-B). Examples of the image patches are shown in figure 3. Since a single detector can only distinguish between two classes, usually objects and non-objects, we arrange multiple detectors to create a classifier which is capable of telling not only if a number is found, but also which number is predicted (see Figure 4). The first stage has been designed to quickly select patches which are candidates for a number and reject the majority of image patches, which do not have to be processed further. Only patches that pass the first stage are examined more sophistically by the classifiers sensitive for a certain number. By using a first stage common for all numbers we extended the idea of the attentional cascade described in [19], overcoming the burden of applying 10 different classifiers, one for each number, to all window positions and scales when scanning an image. As shown in 4, the structure of the additional processing is linear and relies on a robust response of the detectors that are specific for each number, e.g. if the “one”-classifier erroneously predicts a one instead of a nine, the “nine”-classifier never gets a chance to correct this. However, due to processing speed and ease of implementation this structure has been chosen to be used as opposed to a more sophisticated approach. 1) Training: The stage common for all number candidates consists of a 10 layer cascade trained with an initial set of 1472
Fig. 4. Structure of the classifier composed from multiple cascade detectors. All image patches must pass a common stage (C) to get examined by the classifiers specific for each single number, going from 0 to 9. Patches that are rejected by either the C stage or the 9 classifier will be considered as non-numbers.
images of all 10 numbers and 12252 images of non-numbers, i.e. the images were used for training the first layer of the cascade. For the last layers the bootstrapping method described in [19] is used, taking only those images for training that have been predicted positive by the first layers, and refilling the non-number set by gathering false positives from large images that do not contain numbers. Gathering is done by scanning these images with the classifier created so far, consisting of the former layers. The detection rate is set to 0.97 per cascade layer, the false positive rate is set to 0.4 per layer, which means that the common stage already rejects about 1−0.4 10 = 99.990% of all non-numbers. The classifiers for a certain number are trained with 112 to 193 positive examples and initially 12252 negative examples (non-numbers), also using a detection rate and false positive rate of 0.97 and 0.4 per layer, respectively. Bootstrapping is applied for images that do not contain the certain number, i.e. all other numbers are present. First training results show that this is necessary to ensure a high specificity of the classifier. Training is done until the non-number training set cannot be refilled with false positives, which results in classifiers consisting of 13 to 22 cascade layers.
B. The OpenCV Approach
For our second approach, we focused on achieving real-time number detection by using basic and fast image processing methods. We therefore used standard algorithms, e.g. adaptive thresholding and pattern matching, as implemented in the OpenCV library. 1) Thresholding: Depending on the lighting conditions, the setup of the camera and the captured scene, the detected gray value of the white board background or the black number is highly variable. Methods separating pixels into fore- or background by globally adjusted thresholds are therefore not applicable. In an adaptive approach, this threshold is deter- mined dynamically taking spatial variations in illumination into account. For example, Wellner [21] employs a moving
average technique in which the mean m of the last s visited pixels is considered. If the value of the current pixel is t
percent less than the average m then this pixel is assigned the value of 0 (dark) and 1 (light) otherwise. Since this approach depends on the scanning order of the pixels and the neighborhood samples are not evenly distributed during the average calculation, Bradley et al. [1] adopt an alternative
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
147
4
approach for determining the threshold. They compute the average of a s × s window of pixels centered around the pixel under consideration and define a certain percentage of this average as the threshold value. We used the implementation of the described adaptive thresholding approach from the open source image library OpenCV with a dynamic window size of 15 × 15 pixels to as- sign each pixel to either light background or dark foreground.
Although this setup works fast and stable for most conditions,
it has major difficulties detecting very close numbers. There,
the dynamic window is smaller than the line width of the number, hence the inner number area is randomly chosen depending on the image noise. The computation time of the adaptive threshold, however, increases with the window size, thus a larger window is no option to deal with this problem.
2) Component Labeling: We applied a contour-based ap- proach [4] 1 to obtain the connected components of the binary image. The algorithm identifies the components by tracing
a contour chain for each connected region. Since we were
only interested in blobs which are likely to be numbers, we applied several filters with experimentally determined thresholds to discard some of the extracted components, which saves computation time in the classification step. In our case, numbers are higher than wide, so we removed all components with a aspect ratio greater than 1. Since numbers have usually straight contours, we only kept blobs with a ratio of perimeter to diagonal between 2 and 10. Furthermore, the relative line width of the numbers is equivalent to a certain value, so we discarded blobs with a ratio between blob area and bounding box area lower than 0.1 and greater than 0.5.
3) Classification: We based the classification task on a k- nearest-neighbor classifier (KNN) with k = 1. Since the image patches are two dimensional and vary in size whereas the KNN classifier requires one dimensional inputs of constant length, the connected regions had to be preprocessed. For this we scaled every contour chain to fit in a 16 × 16 pixels bounding box. Now the OpenCV method cvWarpImage was used to crop this bounding box from the image. Then we set all not blob related areas of this patch to the background color and scaled the pixel values of the patch linearly to fill the interval between 0 to 255. Finally, we determined the nearest neighbor of the extracted patch by calculating a normalized distance between the patch and every patch of a labeled reference set. In our case, the reference set consists of images of the numbers from zero to nine rendered with rotation angles between −15 ◦ and 15 ◦ in steps of 5 ◦ . Given two image patches p 1 and p 2 , the
normalized distance is calculated by
, where
. L2 is the L2 norm, calculated by the OpenCV method norm, and A is the area of the patch. If the minimal distance
between the extracted patch and a training patch is lower than
d = p 1 −p 2 L2
255· √ A
|
a |
threshold of 0.33, the label of the training patch is assigned |
|
to |
the patch. In the other case, the patch is classified as a non- |
number. The threshold was experimentally chosen to yield a good ratio between correct detections and false positives.
1 We used the open source library cvBlobsLib which implements [4] and is based on OpenCV: http://opencv.willowgarage.com/wiki/ cvBlobsLib .
C. The Accelerated k-Nearest Neighbor (kNN) Approach
The third approach aimed at reducing the run-time com- plexity of each individual step by means of several modifica- tions: an integral image-based adaptive thresholding approach, connected-component labeling using flood fill, and accelerated k-nearest neighbor classification. 1) Thresholding: Using integral images [6, 19], the average calculation for adaptive thresholding can be accomplished in constant time, rendering the approach of [1] practical for real- time applications. Given an image I with pixel intensities I(x, y) at pixel locations (x, y), an entry within the integral image I(x, y) stores the sum of all intensities which are contained in the rectangle defined by the upper left corner (0,0) and the lower right corner (x,y). Then, the average intensity
i(x, y) of an arbitrary rectangular area of I given by (x ur , y ur ) and (x ll , y ll ) can be calculated as:
i(x, y) = I(x ur , y ur ) − I(x ur , y ll ) − I(x ll , y ur ) + I(x ll , y ll )
(x ur − x ll )(y ur − y ll )
,
(1)
where x ur , y ur and x ll , y ll denote the upper-right and lower- left x- and y-coordinates of the given rectangle, respectively. Equation (1) shows how the average can be computed effi- ciently. Further acceleration was achieved by choosing a power of two as the window size, which enables the use of a right- shift instead of a division operation. Adaptive thresholding using an integral image is a two-stage process. At first, the integral image has to be calculated, which can be accomplished in linear time. In the second stage, the original image is binarized by comparing each pixel intensity of I with the average intensity of the rectangular area which surrounds the pixel under consideration. 2) Component Labeling: For the labeling task, we em- ployed a simple seed fill algorithm: Given a certain location
in the binary image (x, y), its neighbors are recursively tested
whether they are assigned the same binary value as the pixel under consideration. If this is the case the respective neighbor (x ∗ , y ∗ ) is assigned the same label as the pixel (x, y). Further, this process is recursively repeated to all non-visited neighbors of pixel (x ∗ , y ∗ ). Recursion stops if the binary value of the neighboring pixel (x ∗ , y ∗ ) differs from the one of (x, y).
All pixels with the same label assigned define a set of connected components. We further represented the area which was covered by the minimum bounding box surrounding a single connected component set by a binary matrix. Here, a
matrix element was assigned the value of 1 if it represented
a connected component and 0 otherwise. Finally, the set of
all binary matrices established the basis of the succeeding classification step. 3) Classification: Following the classification scheme of the first approach, we also employed k-nearest neighbor clas- sification with k = 1 given rescaled image regions as inputs. In contrast to the previous approach, however, these image regions did not originate from the acquired grayscale image but from their respective connected components extracted in the previous step. Further, resampling was achieved by nearest neighbor interpolation yielding rescaled connected components of size 32 × 32 pixels. In the following step, the
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
148
5
32×32 matrix of binary values was transformed into a 1024 bit sized one dimensional vector by concatenating each individual
row of the matrix. Here, the one dimensional vector was repre-
sented by a byte array of length 128, F n = {f n,1 ,
each byte containing 8 pixels of the binary vector. Classi- fication was performed by calculating the Hamming distance between a candidate input vector and each entry of the number training set. Similar to the first approach the training set consisted of unmodified and transformed numbers rotated by an angle between −15 ◦ and +15 ◦ with step size 5 ◦ , yielding 70 training patterns in total. The image in the training set with the minimum distance to the candidate vector determined the classification result. If the Hamming distance exceeded a certain threshold, the considered image patch was defined to be a non-number. We determined this threshold experimentally by choosing a threshold value which resulted in the smallest classification error. Adopting our representation of connected areas, the Hamming distance between two input vectors F 1 and F 2 , H(F 1 , F 2 ), can be efficiently computed using addition and XOR operations only: H(F 1 , F 2 ) = i=1 bitcount(f 1,i ⊗ f 2,i ), where bitcount determines the number of 1-bits in the resulting bitstring after the XOR operation. In our implemen- tation, the function bitcount was realized using a lookup table covering all 2 8 possible configurations of an input vector entry f n,i . We disregarded further acceleration measures such as using 16 bit up to 64 bit integers instead of 8 bit integers as the basis data type for the input vector representation (cf. [17]) or employing specialized instructions of the SSE4.2 instruction set as proposed in [3]. This is because the classification step proved to be fast enough and hence had not to be accelerated.
, f n,128 },
128
D. The Artificial Neural Network (ANN) Approach
In the third approach we reused the algorithm to compute an adaptive thresholding using the integral image. We notably changed, however, how connected components are extracted and classified. 1) Extracting Suitable Connected Components: We based our connected components labeling algorithm on the run-based two-scan labeling algorithm by He et. al. [10]. This algorithm regards runs, i.e. continuous horizontal line segments with the same pixel value, instead of pixels as the smallest elements of connected components, which promises a significant speedup for binary images with large areas of constant value. It requires extra space for a label image, which assigns the same unique label to all pixels that belong to the same connected component upon completion of the algorithm, and for three arrays of limited size to store temporary information about so-called provisional label sets. The original algorithm by He traverses the image twice: In the first scan, the algorithm traverses the image pixel-wise to extract runs. If a run is not connected to any previous run, a new label is assigned to all of its pixels. Otherwise it inherits the label of the connected run. In case there are two or more runs with different labels both connected to each other, they are merged to form a provisional label set. This merging operation can be implemented efficiently since all information about provisional label sets is in compact data structures.
In the second scan, the original algorithm traverses the whole image again to compute the final label image, replacing all provisional labels with their final ones. In our case, however, we do not need the complete final
label image. As we usually only need to extract a small number of connected components, we modified He’s algorithm in the following way: In addition to the data structures proposed by He, we also store the bounding box of each provisional label set. Thus, we can disregard connected components depending on the dimension of their bounding box and extract only connected components which are likely to be numbers. Once we have selected a smaller number of connected components as candidates, we extract these efficiently, based on their bounding box instead of traversing the whole label image for
a second time. 2) Classification of Patches: To further improve classifi- cation performance over the previous kNN classifier without requiring much more computation time, we employ an artifi- cial neural network (ANN) to classify image patches. The neural network has to decide to which of the 11 classes
a patch belongs: It is either one of the 10 numbers or “not a
number”. We implemented this using the 1-of-11 target coding scheme, which requires one output neuron per class: For each input, we expect the output neuron corresponding to its class to return 1 and all others to return 0. The structure of our neural network is a multilayer percep- tron with 160 input units (one per pixel of patches resized to 10 × 16) and only one hidden layer consisting of 20 hidden units. All units of our neural network use the logistic activation function, except the output layer which uses the softmax activation function, so we can interpret the output of each output unit as the posterior probability of the input patch belonging to its corresponding class. We used the Stuttgart Neural Network Simulator (SNNS) [24] to train and evaluate the neural network because of its ability to automatically generate efficient C code evaluating a previously trained network using its snns2c module. We trained the network using standard backpropagation on a large dataset containing real-world patches extracted from logfiles in which a robot moved around environments that contained all 10 signs. The classifier of the previous approach supplied us with a preliminary labeling of this dataset and we only needed to manually verify or correct this labeling afterwards. Since these logfiles contain a relatively small number of patches that correspond to actual numbers compared to a large number of non-number patches, we syn- thetically generated more training patches by warping images of numbers using different perspective transformations. We relied on early stopping to prevent overfitting to the training data and employed a 10-fold cross validation on the training data to find that the average test error was minimal after 53 epochs of training. By adding a large number of synthetic training patches, we significantly changed the ratio between number and non- number patches and thus the prior class probabilities assumed by the network. We need to correct this by scaling the
posteriors returned by the network with the ratio of relative frequencies of each class before and after adding synthetic
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
149
6
TABLE I
DETECTION PERFORMANCE OF THE OPENCV, KNN, ANN AND VIOLA/JONES APPROACHES. THE FIRST TWO VALUES ARE RELATIVE TO THE NUMBER OF TRUE NUMBER SIGNS IN THE VALIDATION SET.
|
OpenCV |
kNN |
ANN |
Viola/Jones |
|
|
true positives in % false negatives in % false positives per frame |
35.0 |
34.6 |
49.7 |
64.0 |
|
65.0 |
65.4 |
50.3 |
36.0 |
|
|
0.012 |
0.011 |
0.012 |
0.007 |
TABLE II
CLASSIFICATION PERFORMANCE OF THE OPENCV, KNN, ANN AND VIOLA/JONES APPROACHES.
|
OpenCV |
kNN |
ANN |
Viola/Jones |
|
|
correct detections in % wrong detection in % |
100.0 |
99.8 |
99.6 |
97.6 |
|
0.0 |
0.2 |
0.4 |
2.4 |
data. Since we were more concerned about false positives and wrong detections than about false negatives, we applied a loss
function that assigns a loss of 1 to false negatives and a loss
of 10 to false positives and wrong detections. Classes are then
chosen to minimize the expected loss.
V. E XPERIMENTAL R ESULTS
We evaluate all four of the approaches described in chapter
IV using the validation dataset described in III-B with regard
to both classification performance and computation time.
A. Detection Performance
In order to compare the detection performance, we decouple
the two problems of detecting a general number sign and the
classification of detected signs. Each true number sign is either correctly detected (true positive) or missed (false negative). Detections that do not correspond to a true number sign are
counted as false positives. Since numbers as found by the described detectors might slightly differ in position and size from the ones labeled by humans, we consider two labels to coincide if the intersection the area of both covers at least one- fourth of each label’s individual area. The resulting detection rates can be seen in table I. Rates of the OpenCV and kNN approach are almost identical because they mainly differ in details of their implementation. The ANN approach achieves many more true positives while keeping the number of errors at a similar rate due to the better classification performance of the artificial neural network. The Viola/Jones-based detector performs significantly better than all other methods. The classification performance, i.e. the percentage of cor- rectly detected signs that were also assigned the correct label,
is shown in Table II. This is consistently high with the
exception of the Viola/Jones approach which tends to confuse number signs more often than the others. The overall ratios of correctly recognized numbers as re- ported in Table I might seem low to the reader. This is because many of the number signs visible and tagged in our
Detection performance in relation to the size of number signs
Height interval of number signs (in pixels)
Fig. 5. Detection performance depending on the visible size of numbers. The correct detection rate in this plot is equivalent to the true positives rate from table I times the correct detection rate from table II.
TABLE III
PROCESSING TIME PER FRAME IN MILLISECONDS.
|
OpenCV |
kNN |
ANN |
Viola/Jones |
|
|
thresholding |
3.5 ± 0.4 9.3 ± 5.3 0.4 ± 0.3 13.2 ± 5.5 |
2.5 ± 0.4 9.3 ± 1.3 0.2 ± 0.4 12.0 ± 1.6 |
2.4 ± 0.3 2.3 ± 0.7 0.1 ± 0.2 4.8 ± 1.0 |
- - - 193.5 ± 73.3 |
|
segmentation |
||||
|
classification |
||||
|
overall |
log files suffer from motion blur or are far away from the robots. In combination with the artifacts introduced by the discretization in the thresholding step of the first three ap- proaches, this makes classifying very small numbers correctly
a hard problem. Another important issue to keep in mind
is, that for our application, low ratios of false positives or incorrectly classified numbers are much more important than higher recognition rates: As long as our robot has not seen the next number it needs to reach, it can simply keep exploring
until it will find it eventually. If, on the other hand, it drives towards an incorrectly recognized (false positive) sign and decides it has reached it, there is no way to recover from this bad decision and the competition would be lost. Fig. 5 shows how detection performance of different ap- proaches depends on the size in pixels of the number sign as seen in the image. While the performance is comparable for large numbers, the most significant differences can be observed for small numbers. Good performance for small numbers, however, is essential, as we want to find number signs early on when the robot is still far away. Once the robot is close to
a sign, we hope to have an accurate estimate of its position and can afford to not detect it from time to time.
B. Computation Time
Table III lists the computation time required by the de- scribed methods. We measured these times by running the number sign recognition systems on our robot in batch mode, i.e. with no other jobs running at the same time and utilizing only one of the two cores. It is worth noting that our own implementation of adaptive thresholding using the integral image is significantly faster
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
150
7
than the implementation provided by OpenCV on a window size of 15 × 15 pixels. The most important improvement
with respect to computation time, however, comes from using
a run-based algorithm for segmentation as described in the ANN approach. While using runs might not help that much when processing images with many very small connected
components, e.g. documents with very small letters, binarized images of natural scenes in our experience always contain
at least some uniform regions, in which using runs improves
performance considerably.
VI. CONCLUSIONS
We have implemented and compared the performance of four methods for detecting number signs on a mobile robot and managed to create a robust system that requires less than 5 ms per image. Using this system, our robot is able to detect number signs in real-time even when processing images from three cameras at the same time, while still leaving enough computational resources for other modules running on the robot. Specialized algorithms using efficient simplifications, e.g. binarization, are key to image processing on mobile robots in high-speed environments like robot races. Unfortunately, they usually require easily exploitable properties. For the task described here, we were able to exploit the fact that there is a white region around the black numbers to be detected. This allowed us to implement a robust number sign detection algorithm that is several orders of magnitude faster than general-purpose object detection algorithms. Our robot described in III-A won the 1 st place in the “SICK robot day” robotics challenge 2010 using the artifi- cial neural network-based number sign detection system. A video of our first run at the SICK robot day challenge 2010 can be found at http://www.youtube.com/watch?v= tpILYAUGxfU .
ACKNOWLEDGMENT
The authors would like to thank all other colleagues of our team that participated in the “SICK robot day” robotics challenge: Karsten Bohlmann, Yasir Niaz Khan, Stefan Laible and Henrik Marks
REFERENCES
[1] D. Bradley and G. Roth. Adaptive thresholding using the integral image. Journal of Graphics Tools, 12(2):13–21, 2007. [2] G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 2000. [3] M. Calonder, V. Lepetit, C. Strecha, and P. Fua. BRIEF: Binary robust independent elementary features. In Kostas Daniilidis, Petros Maragos, and Nikos Paragios, editors, Computer Vision - ECCV 2010, volume 6314 of Lecture Notes in Computer Science, chapter 56, pages 778– 792. Springer Berlin / Heidelberg, Berlin, Heidelberg, 2010. [4] F. Chang. A linear-time component-labeling algorithm using con- tour tracing technique. Computer Vision and Image Understanding, 93(2):206–220, February 2004. [5] X. Chen, J. Yang, J. Zhang, and A. Waibel. Automatic detection and recognition of signs from natural scenes. IEEE Transactions on Image Processing, 13(1):87–99, 2004. [6] F. C. Crow. Summed-area tables for texture mapping. In Proceedings of the 11th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH 84), pages 207–212, New York, NY, USA, 1984. ACM.
[7] T. E. de Campos, B. R. Babu, and M. Varma. Character recognition in natural images. In Proceedings of the International Conference on Computer Vision Theory and Applications, Lisbon, Portugal, February
2009.
[8] A. de la Escalera, J. M Armingol, and M. Mata. Traffic sign recognition
and analysis for intelligent vehicles. Image and Vision Computing, 21(3):247 – 258, 2003. [9] K. Deb, H.-U. Chae, and K.-H. Jo. Vehicle license plate detection method based on sliding concentric windows and histogram. Journal of Computers, 4(8):771–777, 2009. [10] Lifeng He, Yuyan Chao, and K. Suzuki. A run-based two-scan labeling algorithm. Image Processing, IEEE Transactions on, 17(5):749–756, May 2008.
[11] C.G. Keller, C. Sprunk, C. Bahlmann, J. Giebel, and G. Baratoff. Real- time recognition of u.s. speed signs. In IEEE Intelligent Vehicles Symposium, pages 518–523, Eindhoven, June 2008. [12] S. Kumano, K. Miyamoto, M. Tamagawa, H. Ikeda, and K. Kan. Development of a container identification mark recognition system. Electronics and Communications in Japan, Part 2, 87(12):38–50, 2004. [13] M. Lalonde and Y. Li. Detection of road signs using color indexing. Technical Report CRIM-IT-95, Centre de Recherche Informatique de Montreal, 1995. [14] J. Liang, D. Doermann, and H. Li. Camera-based analysis of text and documents: a survey. International Journal on Document Analysis and Recognition, 7:84–104, 2005. 10.1007/s10032-004-0138-z. [15] J. Ohya, A. Shio, and S. Akamatsu. Recognizing characters in scene images. IEEE Trans. Pattern Anal. Mach. Intell., 16:214–220, February
1994.
[16] T. Sato, T. Kanade, E. K. Hughes, and M. A. Smith. Video OCR for digital news archive. In Proceedings of the 1998 International Workshop
on Content-Based Access of Image and Video Databases (CAIVD ’98), CAIVD ’98, pages 52–60, Washington, DC, USA, 1998. IEEE Computer Society. [17] S. Taylor, E. Rosten, and T. Drummond. Robust feature matching in 2.3µs. In IEEE CVPR Workshop on Feature Detectors and Descriptors:
The State Of The Art and Beyond, June 2009. [18] A. Treptow, A. Masselli, and A. Zell. Real-time object tracking for soccer-robots without color information. In European Conference on Mobile Robotics (ECMR 2003), pages 33–38, Radziejowice, Poland,
2003.
[19] P. Viola and M. J. Jones. Robust real-time face detection. International Journal of Computer Vision, 57:137–154, May 2004. [20] K. Wang and J. A. Kangas. Character location in scene images from digital camera. Pattern Recognition, 36(10):2287–2299, October 2003. [21] P. D. Wellner. Adaptive thresholding for the digitaldesk. Technical Report EPC-1993-110, 1993. [22] T. Yamaguchi, Y. Nakano, M. Maruyama, H. Miyao, and T. Hananoi. Digit classification on signboards for telephone number recognition. In Document Analysis and Recognition, 2003. Proceedings. Seventh International Conference on, pages 359–363, 2003. [23] J. Yang, X. Chen, J. Zhang, Y. Zhang, and A. Waibel. Automatic detection and translation of text from natural scenes. In Proceed- ings of International Conference on Acoustics and Signal Processing 2002(ICASSP ’02), Orlando, FL, USA, May 2002. [24] A. Zell, N. Mache, R. Hubner,¨ G. Mamier, M. Vogt, M. Schmalzl, and K.-U. Herrmann. SNNS (stuttgart neural network simulator). In Josef Skrzypek, editor, Neural Network Simulation Environments, volume 254 of The Springer International Series in Engineering and Computer Science. Kluwer Academic Publishers, Norwell, MA, USA, February 1994. Chapter 9.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
151
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
152
1
Terrain Classification with Markov Random Fields on fused Camera and 3D Laser Range Data
Marcel H¨aselich ∗
Marc Arends ∗
Dagmar Lang ∗
Dietrich Paulus ∗
∗ Active Vision Group, AGAS Robotics, University of Koblenz-Landau, 56070 Koblenz, Germany {mhaeselich, marends, dagmarlang, paulus }@uni-koblenz.de
Abstract— In this paper we consider the problem of inter- preting the data of a 3D laser range finder. The surrounding terrain is segmented into a 2D grid where each cell can be an obstacle or negotiable region. A Markov random field models the relationships between neighboring terrain cells and classifies the whole surrounding terrain. This allows us to add context sensitive information to the grid cells where sensor noise or uncertainties could lead to false classification. Camera images provide a perfect complement to the laser range data because they add color and texture features to the point cloud. Therefore camera images are fused with the 3D points and the features from both sensors are considered for classification. We present a novel approach for online terrain classification from fused camera and laser range data by applying a Markov random field. In our experiments we achieved a recall ratio of about 90% for detecting streets and obstacles and prove that our approach is fast enough to be used on an autonomous mobile robot in real time.
Index Terms— Multi-sensor fusion, terrain negotiability
I. INTRODUCTION
Autonomous navigation in unstructured environments is a current and challenging task in robotics. The mobile system needs a detailed interpretation of the surrounding terrain to
avoid obstacles and to regard the negotiability of the surface.
A modern 3D Laser Range Finder (LRF) provides a rich and
thorough picture of the environment in form of 3D distance measurements. Considering a path planing algorithm, the data
of the 3D LRF cannot be directly used because of their vast
amount. Therefore, as a first step, a reduction of the huge point cloud is necessary and an efficient data structure is essential. Our work provides a direct extension to the terrain analysis by Neuhaus et al. [14], where a data structure was introduced that we use in our work. The laser range measurements alone allow no differentia- tion between different surfaces. Therefore we calibrated three cameras to the LRF and can then access the fused data in one coordinate system. This allows us to determine the color and texture of the 3D points in the field of view of each camera. In unstructured environments, the classification of the terrain can be a problem due to sensor noise, varying density of the data, egomotion or percussions on rough terrain. For that reason we apply a Markov random field to add context-sensitive information to the terrain classification, which models the relationships in our data structure. Our goal is to determine the negotiability of the surrounding terrain with a Markov random field in real time based on the sensors described in the following section.
This paper is organized as follows. First of all we present the deployed sensors and their layout in Sec. II. After discussing related work in Sec. III, we will briefly describe the Markov random field basics in Sec. IV. The algorithm is described in detail in Sec. V followed by the experiments in Sec. VI. Finally, we present the conclusion in Sec. VII.
II. HARDWARE
Fig. 1. Deployed sensors: A 3D laser Velodyne HDL-64E S2 (left) and two different commercially available cameras (Logitech HD Pro Webcam C910 on the upper right and Philips SPC1300NC on the lower right) are used to perceive the environment.
As shown in Fig. 1 we use a Velodyne HDL-64E S2 [8] and two different camera models. The head of the Velodyne consists of 64 lasers which permanently gather data of the environment as the head spins at a frequency from up to 15 Hz around the upright axis. In doing so, the sensor produces a rich dataset of 1.8 million distance measurements per second. The data of one full rotation are accumulated into one point cloud. A Logitech HD Pro Webcam C910 [18] is installed to the front and two Philips SPC1300NC [15] cameras are fixed on the left and the right side of the construction. The sensors are either be mounted on top of a 500 kg robot (autonomous driving) or on a car (recording of sensor data).
III. RELATED WORK
There exist various approaches to classify the terrain sur- rounding an autonomous mobile robot platform. Especially image or laser based strategies are wide spread when terrain negotiability information is needed.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
153
2
The image based strategies either use a single, stereo or combined setup of digital and infrared cameras. Konolige et al. [10] and Alberts et al. [1] both use stereo vision approaches to maneuver a vehicle through unstructured environments. Stereo vision allows them an extraction of traversable regions from the camera video streams. Furthermore, Vernaza et al. [23] present a camera based terrain classification approach.Their approach uses a Markov random field that classifies image data of a stereo system into obstacles or ground regions for an autonomous robot. Negative obstacles (non-negotiable regions underneath the ground level) present a difficult challenge in non-urban en- vironments. Thermal infrared images have the characteristic that negative obstacles remain warmer than the surrounding terrain in the night. Rankin et al. [17] therefore combine thermal signatures and stereo range data to determine the terrain negotiability. Laser based approaches either work with a 2D, a self-build rotating 2D or a 3D LRF, like the Velodyne HDL-64E S2. Wurm et al. [26] use the laser remission value of a 2D LRF on a pan-tilt unit to classify the surface terrain based on the resulting 3D scans. In this way, they can detect grass-like vegetation and prefer paved routes with their robot. Another approach for terrain classification is presented by Wolf et al. [25]. Their robot uses a 2D LRF oriented to the ground, records data while driving and produces 3D maps using Hidden Markov models. The authors are able to differentiate flat areas from grass, gravel or other obstacles. Vandapel et al. [22] segment 3D distance measurements and classify the segments into three different classes for terrain surface, clutter or wires. Their approach worked with a detailed stationary 3D sensor as well as on a mobile platform with a rotating 2D scanning mount. Ye and Borenstein [27] present an algorithm for terrain mapping with a 2D LRF. Their LRF is mounted at a fixed angle to the ground in front of their robot and creates an elevation map while driving. A color stereo camera and a 2D LRF are used by Manduchi et al. [12] to detect obstacles. The authors present a color- based classification system and an algorithm that analyses the laser data to discriminate between grass and obstacles. Schenk and Csatho [20] fuse aerial images with 3D point clouds to construct surfaces of urban scenes. The surfaces are represented in a 3D object space coordinate system by patches that store the shape and the boundary of the corresponding surface region. A stereo pair of digital cameras, an infrared camera and two 2D LRFs mounted on scanning mounts are used by Wellington et al. [24]. The authors apply multiple Markov random fields that interact through a hidden semi-Markov model that enforces a prior on vertical structures. Their results showed that including the neighborhood structure significantly improved obstacle classification. Beyond terrain classification another research topic covers the simultaneous localization and mapping (SLAM) problem, where LRFs are used to process the 2D and 3D distance measurements to build maps [5, 13] of the environment. In contrast to mapping, we want to process the sensor data as
fast as possible online on the robot so that a path planing algorithm can directly access these data and perform tasks autonomously in real time. All presented approaches have in common, that unstructured environments are described as challenging by the authors. Sensor noise, manifold and complex vegetation, rough terrain and concussions while driving require a solid approach to realize an adequate classification of the terrain. Under the circumstances described, we want to achieve the best clas- sification possible with our sensor data. A statistical approach can handle sensor noise and uncertainties quite well and for this reason we chose a Markov random field, which will be described in the following section.
IV. MARKOV RANDOM FIELD BASICS
Markov random fields can be assumed to be a 2D expansion of Markov chains. This implies that a random state ω ij at position (i, j ) has a neighborhood N ij including states ω λ with
(1)
(2)
The neighboring states can be located in different distances forming different cliques of neighbors. In our work, we examine the 8 nearest neighbors surrounding a random state in a 2D data representation. Assuming that each state takes a value results in a config- uration ω of a random field. A random field can be called Markov random field if the probability for a configuration takes a positive value
(3)
P (ω ) > 0
and if a random state depends only on its neighbor states
(4)
P (ω ij | ω − ω ij ) = P (ω ij | ω λ ), with ω λ ∈ N ij
where ω − ω ij stands for ω \ ω ij the set ω without ω ij . For easier computation and modeling Markov random fields are often used as a Gibbs random field. Markov random fields and Gibbs random fields can be seen equal, which is described by the Hammersley-Clifford Theorem. In a Gibbs random field the distribution of the random states corresponds to the Gibbs distribution, which is defined as
ω
ij
=
ω λ
ω λ ∈ N ij
P (ω ) = Z exp − T U (ω )
1
1
(5)
where T is a temperature parameter, U (· ) a function which calculates the energy E of a configuration and Z is a normal- izing constant with
Z
= exp − T U (ω )
ω
1
(6)
that calculates a sum over all possible configurations. The modeling of U (· ) depends on the problem, which needs to be solved. Maximizing P (ω ) can be achieved by finding a configuration ω that minimizes U (ω ). For this purpose several sampling techniques exist, as it is too time consuming to test every possible configuration.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
154
3
|
An introduction of Markov random fields for classification |
of even terrain. In addition to texture features, the color f c of |
|
|
tasks is presented by Theodoridis and Koutroumbas [21] and |
a |
terrain cell is also used, which is fast and easy to compute. |
|
for a detailed description and several possibilities for modeling |
The Combination of all features yields the feature vector |
|
|
with Markov random fields are discussed by Li [11]. |
f |
= ( f r , f h , f sam , f v , f idm , f fh , f c ) that integrates in the |
V. MARKOV RANDOM FIELD APPLICATION
After an introduction of the Markov random fields, the following section details their application to our sensor data and their use in our framework. The section starts with the data structure and pre-classification and continues with a description of the features. Our Markov Model of the terrain and the details of the camera and LRF fusion follow right after. The section ends with a description of the feature parameter estimation for the Markov random fields classes.
|
A. |
PRINCIPAL COMPONENT ANALYSIS |
|
As a first step, the large point cloud delivered by our sensor |
|
|
is |
subdivided into smaller chunks, which can then be analyzed |
with the help of the principal component analysis (PCA) [14]. We use a 2D equidistant grid, centered around the origin of the sensor, to subdivide the point cloud. Our grid has a size of 40 × 40 m and each of the cells is scaled approximately 50 × 50 cm. The output of the PCA is a pre-classified grid, where each cell can either be flat, obstacle or unknown. A detailed description of the PCA in form of a hierarchical implementation is provided by Neuhaus et al. [14].
B. FEATURES
Features are essential for classification tasks. We define a set of features which describe different terrain classes we use in our approach. This set can be divided into two categories. The first type of features is acquired from of the data gained by the LRF mounted on our robot. The second type consists of features computed from the camera images of the terrain. The first laser-based feature we use, is the roughness f r of
a terrain cell. The roughness can be obtained by calculating the local height disturbance [14]. The roughness feature offers valuable clues about how uneven a terrain cell is. This information is helpful for distinguishing between roads and grass or acres, for example. The second laser-based feature, the height difference f h ,
is capable of differentiating between obstacles and negotiable
terrain. This feature simply displays the distance between the height of the lowest and the highest laser point of a terrain cell and was presented by Happold et al. [6]. The usage of this feature is based on our assumption that the main characteristic of an obstacle is its height. Under the assumption that texture differs for diverse types of terrain, we apply features acquired from image data. We exert the second angular moment f sam , variance f v and inverse difference moment f idm texture feature. These are 3 easy to compute features out of the 14 texture calculations proposed by Haralick et al. [7]. We use another fast calculating feature, expressing the homogeneity f fh of a texture, proposed by Knauer et al. [9]. This is based on the assumption that rough terrain has an inhomogeneous texture in contrast to the texture
terrain classification process.
C. MARKOV MODEL OF THE TERRAIN
Regarding the fact that flat terrain cells can either be easy
or hard to pass for a robot, the two classes road and rough are used for differentiation. This leads to a total of four possible terrain-classes {unknown, road, rough, obstacle }. A Markov model of the terrain should respect the context-sensitivity of neighboring cells and the acquired features. For this reason we chose a Markov random field model used for image segmentation tasks, described by Deng et al. [2]. This model
is able to fulfill both requirements and can easily be adapted to
the terrain classification problem by regarding each terrain cell as random state in the Markov random field. Here the energy E of a Gibbs random field is calculated piecewise by splitting
it up into a component E N that models the relationship of the
neighboring cells and another component E f which models the computed features of a cell. The neighborhood component E N ij of a cell C ij at position (i, j ) in a grid is defined as
E N ij =
λ∈N ij
(β · δ (c ij , c λ ))
(7)
where c ij is the terrain class assigned to the cell C ij and c λ
is the terrain class assigned to a cell C λ , which is part of the neighborhood of C ij . The function δ (· ) returns a value of − 1
for c ij
neighbors impact according to its distance to the cell. In our approach, we observe the 8 closest neighbors to a cell where the value of β is fixed and adjusted to our environment.
The corresponding feature component E f ij of a cell C ij is based on the assumption that the features of a cell are Gauss distributed. As an energy its computation is defined as
= c λ and +1 for c ij = c λ . β is used to weight a
E f ij =
k
(f ij k − µ ij k ) 2
2
σ
2
ij k
+ log √ 2 πσ ij k
(8)
where f ij k is the k -th feature of C ij and µ ij k and σ ij k are the mean respectively the standard deviation of the k -th feature of the class c ij assigned to a cell C ij . Combining the two calculations, the complete energy E ij of a cell C ij can be calculated as follows
(9)
E ij = E N ij + α · E f ij
where α is a weighting constant used to control the influence of the different energy types. For classification the sum of all computed energies E ij needs to be minimized, which leads to a maximization of the
a posteriori probability of the labeling of the terrain. These energies can be minimized by finding a label for each cell which fits best for the computed features and the labels of the neighbor cells. We apply the Gibbs Sampler described by Geman and Geman [3] to solve this task.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
155
4
Example scene of an unstructured environment with three different representations (best viewed in color). In each representation the images from
the left, front and right camera are shown in the upper part. The upper image shows the 3D point cloud provided by the Velodyne HDL-64E S2. The color of a point is determined by its height. On the image in the middle the LRF data are fused with all three cameras. This way color, texture and 3D position are available in one coordinate system. The lower image shows the result of the MRF application: classified terrain cells with negotiability information. The classes unknown, road, rough and obstacle are respectively visualized in black, gray, brown and red.
Fig. 2.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
156
5
D. CAMERA AND LIDAR FUSION
Our framework grants the possibility to fuse laser points and image pixels. This is achieved by a scenegraph representation of the used hardware, which provides transformation matrices between different sensors and corresponding intrinsic param- eters. Thus we can build a lookup table for each laser point, which returns the corresponding pixel. Using this table finding the image region spanned up by the measured laser points is only a minor problem. Further information how to calibrate a camera to a LRF are available in the literature [19].
E. ESTIMATION OF FEATURE PARAMETERS
The application of the described Markov random field model depends on the knowledge of the means and standard deviations of each feature for each class. Thus, we estimate these parameters by annotating terrain cells with labels out of the used set of classes, by hand. From the annotated terrain it is possible to calculate the needed parameters of the features. The laser feature parameters can be applied in different environments in contrast to the image feature parameters, which need to be estimated separately for each scenario, because of the different appearances of the terrain classes. This has the advantage that the estimated parameters are based on real data acquired from the environment, which are carefully classified by a human observer.
VI. EXPERIMENTAL RESULTS
In our experiments we compared our terrain classification results with a ground truth. The ground truth is acquired by a human observer, who annotates the terrain cells per hand. By comparing the results with their respective ground truth we calculate the true positive rate and the false positive rate for each terrain class. We evaluated the classification for different scenarios, where the relative appearances of the terrain classes differ. An example scene is shown in Fig. 2 where the 3D data, the fused data and the Markov random field result are visualized. Table I shows the result for a scenario, where the robot drives on a road trough fields and small to medium sized vegetation and Fig. 3 the corresponding ROC curves. In this table the usage of laser based features is denoted by L, selected Haralick features by H, the homogeneity feature by FH and color by C. Furthermore TPR is the true positive rate, or recall rate and FPR the corresponding false positive rate. Table II shows the results for a forest scenario, where the class rough appears only at single terrain cells and not at larger regions. In this case, the results show that the detection of the less appearing class does not work properly. The reason is the modeling of the terrain, in which we assume that terrain cells of one class show a tendency to group and not to exists as single cells. Furthermore we observed that the usage of image features does not improve the quality of the results in our approach. A reason for this may be that texture and color varies more than we assumed, which leads to high values for the standard deviations of these features. Thus a high deviation from the mean value of such a feature does not change the computed energy value sufficiently to have impact on the classification.
The evaluation of the runtime, shown in Table III, shows that the usage of image features multiplies the needed calculation time. In this table PCA expresses the usage of the PCA based algorithm without the application of our Markov random field and MRF stands for the application of our Markov random field. Using laser features exclusively leads to a fast runtime and good classification results. In all experiments the fixed values α and β were respectively set to 0 . 4 and 0 . 8 .
TABLE I
TABLE II
Table I and Table II shows our results for a rural and a forest scenario, respectively. The true positive rate (TPR) and the false positive rate (FPR) are presented for the three classes Road, Rough and Obstacle. The columns are separated by the used features: laser based (L), Haralick features (H), the homogeneity feature (FH) and the color information (C).
Mode
Mean 9. 832 ms 62. 667 ms 744. 917 ms 141. 379 ms 144. 439 ms
Std. dev. 1. 757 ms 7. 357 ms 131. 507 ms 11. 233 ms 12. 566 ms
Max 16. 145 ms 82. 240 ms 993. 987 ms 188. 420 ms 187. 013 ms
Min 5. 871 ms 48. 787 ms 421. 465 ms 121. 733 ms 117. 326 ms
TABLE III
Table III displays the runtime results for our approach. All values represent long term measurements and result from the direct application to real sensor data, without preprocessing or any form of data reduction. The terms L, H, FH and C are equivalent to Table I and II. The runtime of the PCA is compared to the Markov random field application with different features from single sensor fused sensor data.
VII. CONCLUSIONS AND FUTURE WORK
In this paper, we presented an approach to process the fused data of a modern 3D LRF and three cameras with a Markov random field. We thereby integrate context sensitivity and dif- ferent features from both laser and image data. The application of the Markov random field results in a classified 2D grid with terrain negotiability information for each cell. We annotated terrain cells in different scenarios to learn the class parameters of the features. The resulting grid forms a perfect data structure for an autonomous system in unstructured environments. We achieved a recall ratio of about 90% for detecting streets and obstacles. Limitations of our approach reveal at higher speed. The robot platform we use has a maximum speed of 14 km/h, but
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
157
6
Fig. 3. ROC curves corresponding to the values of Table I. The three curves show the true positive rate in relation to the false positive rate for the three classes Road (cyan), Rough (brown) and Obstacle (blue)
the sensor data were also recorded while driving around with a car. For the usage of our approach on our autonomous system, the presented fusion is sufficient. In order to deal with very fast proper motion, the sensor calibration needs to concern the instant of time where the data was recorded. A system of precise time stamps should be able to solve this issue. Computation time of the image features is another difficult aspect. The image features need to be chosen wisely or a solution for speeding up this cost intensive part is needed. In future work, we will work on an integration of time in the Markov random fields. By accessing previous classifications from the current, we hope to fill out spots with missing or sparse data. In example in regions with negative obstacles or in the immediate vicinity around our sensor. Another quite promising extension is a self-supervised learning of the feature parameters. In doing so, our approach can adapt to various situations and is able to improve during runtime if the features are relearned under certain circumstances. In addition, we will investigate the application of graph cuts [4] to the Markov random field data structure. The Markov random field and the fused data provide the opportunity to extract semantic information in real time from the sensor readings. This information can be used to create further information for 3D maps [16] and result in an semantic 3D map.
ACKNOWLEDGMENT
This work was partially funded by Wehrtechnische Dienst- stelle 51 (WTD), Koblenz, Germany.
REFERENCES
[1] J. Alberts, D. Edwards, T. Soule, M. Anderson, and M. O’Rourke.
Autonomous navigation of an unmanned ground vehicle in unstructured forest terrain. In Proceedings of the ECSIS Symposium on Learning and Adaptive Behaviors for Robotic Systems, pages 103–108, 2008. [2] H. Deng and D. A. Clausi. Unsupervised image segmentation using a simple mrf model with a new implementation scheme. Pattern Recogn.,
2004.
[3] S. Geman and D. Geman. Stochastic relaxation, gibbs distribution, and bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1984.
[4] D. M. Greig, B. T. Porteous, and A. H. Seheult. Exact maximum a
posteriori estimation for binary images. Journal of the Royal Statistical Society. Series B (Methodological), 1989. [5] J. Guivant, E. Nebot, and S. Baiker. Autonomous navigation and map building using laser range sensors in outdoor applications. Journal of Robotic Systems, 2000. [6] M. Happold, M. Ollis, and N. Johnson. Enhancing supervised terrain classification with predictive unsupervised learning. In Proceedings of Robotics: Science and Systems, 2006. [7] R. M. Haralick, I. Dinstein, and K. Shanmugam. Textural features for image classification. In Proceedings of IEEE Transactions on Systems, Man, and Cybernetics, pages 610–621, 1973. [8] Velodyne Lidar Inc. Velodyne Lidar. http://www.velodyne.com/lidar, April 2011.
[9]
U. Knauer and B. Meffert. Fast computation of region homogeneity with
application in a surveillance task. In Proceedings of ISPRS Commission V Mid-Term Symposium Close Range Image Measurement Techniques,
2010.
[10] K. Konolige, M. Agrawal, R. C. Bolles, C. Cowan, M. Fischler, and
B. Gerkey. Outdoor mapping and navigation using stereo vision. In Proceedings of the International Symposium on Experimental Robotics,
pages 179–190, 2006. [11] S. Z. Li. Markov Random Field Modeling in Computer Vision. Springer,
2009.
[12] R. Manduchi, A. Castano, A. Talukder, and L. Matthies. Obstacle detection and terrain classification for autonomous off-road navigation. Autonomous Robots, 2004. [13] M. Montemerlo and S. Thrun. Large-scale robotic 3-d mapping of urban structures. In Experimental Robotics IX, pages 141–150. Springer, 2006.
[14]
F. Neuhaus, D. Dillenberger, J. Pellenz, and D. Paulus. Terrain drivability analysis in 3d laser range data for autonomous robot navigation in unstructured environments. In Proceedings of the IEEE International Conference on Emerging Technologies and Factory Automation, pages 1686–1689, 2009.
[15] Philips Electronics N.V. SPC1300NC. http://www.p4c.philips.com/cgi- bin/dcbint/cpindex.pl?ctn=SPC1300NC/00, April 2011. [16] J. Pellenz, D. Lang, F. Neuhaus, and D. Paulus. Real-time 3d mapping
of rough terrain: A field report from disaster city. In Proceedings of the IEEE International Workshop on Safty, Security and Rescue Robotics,
2010.
[17] A. Rankin, A. Huertas, and L. Matthies. Negative obstacle detection
by thermal signature. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 906–913, 2003. [18] Logitech Europe S.A. HD Pro C910. http://www.logitech.com/de- de/webcam-communications/webcams/devices/6816, April 2011.
D. Scaramuzza, A. Harati, and R. Siegwart. Extrinsic self calibration of a camera and a 3d laser range finder from natural scenes. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 4164–4169, 2007.
[20] T. Schenk and B. Csatho. Fusing imagery and 3d point clouds for reconstructing visible surfaces of urban scenes. In Proceedings of the IEEE GRSS/ISPRS Joint Workshop on Remote Sensing and Data Fusion over Urban Areas, 2007. [21] S. Theodoridis and K. Koutroumbas. Pattern Recognition. Academic Press, 2009.
[19]
[22] N. Vandapel, D. Huber, A. Kapuria, and M. Hebert. Natural terrain clas- sification using 3-d ladar data. In Proceedings of the IEEE International Conference on Robotics and Automation, pages 5117–5122, 2004. [23] P. Vernaza, B. Taskar, and D. D. Lee. Online, self-super vised terrain classification via discriminatively trained submodular markov random fields. In IEEE International Conference on Robotics and Automation,
2008.
[24] C. Wellington, A. Courville, and A. Stentz. Interacting markov random
fields for simultaneous terrain modeling and obstacle detection. In Proceedings of Robotics Science and Systems, 2005. [25] D. F. Wolf, G. Sukhatme, D. Fox, and W. Burgard. Autonomous terrain mapping and classification using hidden markov models. In Proceedings of the IEEE International Conference on Robotics and Automation, pages 2026–2031, 2005. [26] K. M. Wurm, R. K¨ummerle, C. Stachniss, and W. Burgard. Improving robot navigation in structured outdoor environments by identifying vegetation from laser data. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1217–1222, 2009. [27] C. Ye and J. Borenstein. A new terrain mapping method for mobile robots obstacle negotiation. In Proceedings of the UGV Technology Conference at the SPIE AeroSense Symposium, pages 21–25, 2003.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
158
1
Hierarchical Multi-Modal Place Categorization
Andrzej Pronobis ∗
Patric Jensfelt ∗
∗ Centre for Autonomous Systems, Royal Institute of Technology, Stockholm, Sweden
Abstract — In this paper we present an hierarchical approach to place categorization. Low level sensory data is processed into more abstract concept, named properties of space. The framework allows for fusing information from heterogeneous sensory modalities and a range of derivatives of their data. Place categories are defined based on the properties that decouples them from the low level sensory data. This gives for better scalability, both in terms of memory and computations. The probabilistic inference is performed in a chain graph which supports incremental learning of the room category models. Experimental results are presented where the shape, size and appearance of the rooms are used as properties along with the number of objects of certain classes and the topology of space.
Index Terms— place categoriation; graphical models; semantic mapping; machine learning
I. INTRODUCTION
The topic of this paper is place categorization, denoting the problem of assigning a label (kitchen, office, corridor, etc) to each place in space. To motivate why this is useful, consider a domestic service robot. Such a robot should be able to “speak the language” of the operator/user to minimize training efforts and to be able to understand what the user is saying. That is, the robot should be able to make use of high level concepts such as rooms when communicating with a person, both to verbalize spatial knowledge but also to process received information from the human in an efficient way. Besides robustness and speed, there are a number of additional desirable characteristics of a place categorization system:
C1: Categorization The system should support true catego- rization and not just recognition of room instances. That is, it should be able to classify an unknown room as ”a kitchen” and not only recognize ”the kitchen”. C2: Spatio-temporal integration The system should support integration over space and time as the information acquired at a single point rarely provides enough evidence for reliable categorization C3: Multiple sources of information No single source of information will be enough in all situations and it is thus important to be able to make use of as much information as possible. C4: Handles input at various levels of abstraction The system should not only be able to use low level sensor data but also higher level concepts such as objects. C5: Automatically detect and add new categories The sys- tem should be able to augment the model with new categories identified from data. C6: Scalability and complexity The system should be scal- able both in terms of memory and computations. That is, for example, it should not be a problem to double the number of
room categories. C7: Automatic and dynamic segmentation of space The system should be able to segment space into areas (such as rooms) automatically and should be able to revise its decision if new evidence suggesting another segmentation is received. C8: Support life-long incremental learning The robot sys- tem cannot be supplied with all the information at production time, it needs to learn along the way in an incremental fashion throughout its life. C9: Measure of certainty There are very few cases where the categorization can be made without uncertainty due to imperfections in sensing but also model ambiguities. Ideally the system should produce a probability distribution over all categories, or at least say something about the certainty in the result. In out previous work we have designed methods that meet C1, C3, C7 and partly C2, C4 and C9. In this paper we will improve on C4 and C9 and add C6 and C7. The main contribution of the paper relates to C4, C6 and C9.
A. Outline
In Section II presents related work and describes our con- tribution with respect to that. Section III describes our method and Section IV provides implementation details. Finally, Sec- tion V describes the experimental evaluation and Section VI draws some conclusions and discusses future work.
II. RELATED WORK
In this section we give an overview of the related work in the area of place recognition and categorization. Place categorization has been addressed both by the computer vision and the robotics community. In computer vision the problem is often referred to as scene categorization. Although also related, object categorization methods are not covered here. However, we believe that objects are key to understanding space and we will include them in our representation but will make use of standard methods for recognizing/categorizing them. Table II maps some of the methods presented below to the desired characteristics presented in the previous section. In computer vision one of the first works to address the problem of place categorization is [19] based on the so called ”gist” of a scene. One of the key insights in the paper is that the context is very important for recognition and categorization of both places and objects and that these processes are intimately connected. Place recognition is formulated in the context of localization and information about the connectivity of space is utilized in an HMM. Place categorization is also addressed using a HMM. In [23] the problem of grouping images into semantic categories is addressed. It is pointed out that many
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
159
2
C5: Novelty detection
C2: Spatio/temporal
C1: Categorization
C7: Segmentation
C3: Multi source
C4: Multi levels
C8: Incremental
C9: Uncertainty
C6: Scalability
|
X |
x |
X |
||||
|
X |
||||||
|
x |
||||||
|
X |
||||||
|
X |
x |
X |
x |
x |
||
|
X |
||||||
|
x |
||||||
|
x |
x |
x |
||||
|
X |
x |
X |
x |
|||
|
X |
x |
|||||
|
X |
||||||
|
X |
X |
|||||
|
X |
X |
X |
|||||
|
X |
|||||||
|
x |
X |
X |
|||||
|
x |
X |
X |
X |
x |
x |
X |
|
|
[22] |
X |
|
[21] |
|
|
This work |
X |
TABLE I
CHARACTERIZING SOME OF THE PLACE CATEGORIZATION WORK BASED ON THE DESIRABLE CHARACTERISTICS FROM SECTION I.
natural scenes are ambiguous and the performance of the system is often quite subjective. That is, if two people are asked to sort the images into different categories they are likely to come up with different partitions. [23] argue that typicality is a key measure to use in achieving meaningful categorizations. Each cue used in the categorization should be assigned a typicality measure to express the uncertainty
when used in the categorization, i.e. the saliency of that cue. The system is evaluated in natural outdoor scenes. In [4] another method is presented for categorization of outdoors scenes based on representing the distribution of codewords
|
in |
each scene category. In [25] a new image descriptor, PACT, |
|
is |
presented and shown to give superior results on the datasets |
used in [19, 4].
In robotics, one of the early systems for place recognition
|
is |
[20] where color histograms is used to model the appearance |
|
of |
places in a topological map and place recognition performed |
as a part of the localization process. Later [10] uses laser data to extract a large number of features used to train classifiers using AdaBoost. This system shows impressive results based on laser data alone. The system is not able to identify and learn new categories: adding a new category required off- line re-training, no measure of certainty and it segmented space only implicitly by providing an estimate of the category for every point in space. In [12] this work is extended to also incorporate visual information in the form of object detections. Furthermore, this work also adds a HMM on top of the point-wise classifications to incorporate information about the connectivity of space and make use of information such as offices are typically connected to corridors. In [14]
a vision only place recognition system is presented. Super
Vector Machines (SVMs) are used as classifiers. The char- acteristics are similar to those of [10]; cannot identify and learn new categorizes on-line, only works with data from a
single source and classification was done frame by frame. In [9, 16] a version of the system supporting incremental learning is presented. The other limitations remains the same. In [13] a measure of confidence is introduce as a means to better fuse different cues and also provide the consumer of the information with some information about the certainty in
the end result. In [15] the works in [10, 14] are combined using
an SVM on top of the laser and vision based classifiers. This allows the system to learn what cues to rely on in what room
category. For example, in a corridor the laser based classifier is more reliable than vision whereas in rooms the laser does
not distinguish between different room types. Segmentation
of space is done based on detecting doors that are assumed
to delimit the rooms. Evidence is accumulated within a room
to provide a more robust and stable classification. It is also
shown that the method support categorization and not only
recognition. In [24] the work from [25] is extended with a
new image descriptor, CENTRIS, and a focus on visual place
categorization in indoor environment for robotics. A database, VPC, for benchmarking of vision based place categorization systems is also presented. A Bayesian filtering scheme is added on top of the frame based categorization to increase robustness and give smoother category estimates. In [17] the problem of place categorization is addressed in a drastically different and novel way. The problem is cast in a fully prob- abilistic framework which operates on sequences rather than individual images. The method uses change point detection to detect abrupt changes in the statistical properties of the data. A Rao-Blackwellized particle filter implementation is presented for the Bayesian change point detection to allow for real-time performance. All information deemed to belong to the same segment is used to estimate the category for that segment using a bag-of-words technique. In [27] a system for clustering panoramic images into convex regions of space indoors is presented. These regions correspond roughly with the human concept of rooms and are defined by the similarity between the images. In [21] panoramic images from indoor and outdoor scenes are clustered into topological regions using incremental spectral clustering. These clusters are defined by appearance and the aim is to support localization rather than human robot interaction. The clusters therefore have no obvious semantic meaning. As mentioned above [12] makes use of object observations to perform the place categorization. In [6] objects play a key role in the creation of semantic maps. In [18] a 3D model centered around objects is presented as a way to model places and to support place recognition. In [22] a Bayesian framework for connecting objects to place categories is presented. In [26] the work in [12] is combined with detections of objects to deduce the specific category of a room in a first-order logic way.
A. Contributions
In this paper we contribute a method for hierarchical cat- egorization of places. The method can make use of a very diverse set of input data, potentially also including spoken dialogue. We make use of classical classifiers (SVM in our
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
160
3
case, building on the work [15]) and a graphical model to fuse information at a higher level. The categorical models for rooms are based on so called properties of space, rather than the low level sensor characteristics which is the case in most of the other work presented above. This also means that a new category could be defined without having the need to re- train from the sensor data level. The properties decouples the system. The introduction of properties also makes the system more scalable as the low level resources (memory for models and computations for classifiers) can be shared across room categorizers. The system we present still rely on the detection of doors like [15] but the graphical model allows us to add and remove these doors and thus change the segmentation of space. The system will automatically adjust the category estimates for each room taking into account the new topology of space.
III. HIERARCHICAL MULTI-MODAL CATEGORIZATION
We pose the problem of place categorization as that of estimating the probability distribution of category labels, c i , over places, p j . That is, we want to estimate p(c i , p j ). We consider a discrete set of places rather than a continuous space. In our implementation the places are spread out over space like bread crumbs every one meter [26]. The places become nodes (representing free space) in a graph covering the environment. Edges are added when the robot has traveled directly between two nodes. In our previous work [26] we performed place catego- rization by combining a room/corridor classifier (based on [10]) with an ontology that related objects to specific room types. For example, we inferred being in a living room if the classification system reported a room and a sofa and a TV set were found (objects associated with a living rooms according to the ontology). This method had some clear and severe shortcomings that made it only appropriate for illustrating ideas rather than being a real world categorization system in anything but simple and idealized test scenarios. Furthermore, because the system was unable to retract inferred information any categorization was crisp and set in stone. Conceptually the solution has several appealing traits. It allowed us to teach the system, at a symbolic level, to distinguish different room categories simply by assigning specific objects to them. It combined information from low level sensor data (to classify room/corridor) with high level concepts such as objects. The place categorization system in this paper provides a principled way to maintain the advantages mentioned above even in natural environments. Our approach is based on the insight that what made the previous system easy to re-train was that the categorization was based on high level concepts rather than on low level sensor data. For this purpose, we introduce what we call properties of space where in the previous system the properties corresponded to the existence of certain types of objects. In general these properties could be related to, for example, the size, shape and appearance of a place. The introduction of properties decomposes our approach hi- erarchically. The categories are defined based on the properties and the properties are defined based on sensor data, either directly or in further hierarchies. This is closely related to the
work on part based object recognition and categorization [3]. The property based decomposition buys us better scalability in several ways. Instead of having to build a model from the level of sensor data for every new category, we can reuse the low level concepts. This saves memory (models for SVMs can be hundreds of megabytes in size) and saves computations (calculations shared across categories). The introduction of properties also makes training easier. Once we have the mod- els for the properties, training the system for a new category is decoupled from low level sensor data. The properties can be seen as high level basis functions on which the categories are defined, providing a significant dimensionality reduction. The graph made up of the free space nodes can be used to impose topological constraints on the places as well and help lay the foundation for the segmentation process.
Fig. 1.
of the properties and the topology on the categorization and segmentation.
Structure of the graphical model for the places showing the influence
We use a graphical model to structure the problem, start- ing from the place graph. More precisely we will use a probabilistic chain graph model [8]. Chain graphs are a natural generalization of directed (Bayesian Networks) and undirected (Markov Random Fields) graphical models. As such, they allow for modelling both “directed” causal as well as “undirected” symmetric or associative relationships, including circular dependencies. Figure 1 shows our graphical model. The structure of model depends on the topology of the environment. Each discrete place is represented by a set of random variables connected to variables representing the semantic category of areas. Moreover, the category variables are connected by undirected links to one another according to the topology of the environment. The potential functions φ rc (·, ·) represent the knowledge about the connectivity of
areas of certain semantic categories (e.g. kitchens are typically connected to corridors). The remaining variables represent properties of space. These can be connected to observations of features extracted directly from the sensory input. Finally,
the functions p p 1 (·|·), p p 2 (·|·),
, p p N (·|·) model spatial
properties. The joint density f of a distribution that satisfies the Markov property associated with a chain graph can be written as [8]:
f(x) =
τ∈ T
f(x τ |x pa(τ ) ),
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
161
set of parents of vertices τ . This factorization which can be viewed aph with vertices representing the bles X τ , for τ in T (one for each actor f (x τ |x pa(τ ) ) factorizes further
1
Z(x pa(τ ) )
α∈A(τ)
φ α (x α ),
sets of vertices in the moralized
, such that in every set, there exist
of vertices in the set. The factor Z nto a proper distribution. erence on the chain graph, we first graph representation [1]. To meet
osed by most robotics applications ate inference engine, namely Loopy
LEMENTATATION
, each object class results in one xpected/observed number of such se of the following properties:
, square) –
data
ared to other typical rooms)) –
data
ce-like appearance) –
|
l |
data |
|
e |
in a doorway) – |
|
data |
|
|
, |
rooms tend to share similar func- |
|
n |
this work we cluster places into |
operty of places (using door detector
property is considered to be crisp. art of the chain graph but rather act owever, the graphical model allows pology if new information becomes em therefore performs segmentation namic nature of it is based on re-
f doors. Figure 2 illustrates how the segmented into areas (ellipses) by small circles) and how this defines
in [15] when defining the prop- pe, size and appearance (see [15] zers are based on Support Vector
Fig. 2.
places. The doors form the edges in the topological area gra
The set of places, {p i }, is segmented into areas ba
of rooms belonging to different categories and u different illumination settings (during the day a By including several different room instances i the acquired model can generalize sufficiently categorization rather than instance recognition.
for the uncertainty in the categorization results is distances between the classified samples and d model hyperplanes (see [13] for details). To learn the probabilities associated with t between rooms, objects, shapes, sizes and app analyzed common-sense resources available onlin see [7]) and the annotated data in the COL database 1 . The relations between rooms and
bootstrapped from part of the Open Mind Indo
Sense database 2 . The object-location pairs found
process were then used to form queries on th
in the loc ’ that were fed to an online image se
The number of hits returned was used as a b probability estimate. Relations that where not fo were assigned a certain low default probability not out completely.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
162
shows two distributions corresponding to the relation en the number of computers in a double office and a
sor’s office. In the specific case of the double office,
t the expected number of computers to two. In all
ning cases the parameter λ is estimated by matching = 0) with the probability of there being no objects of
ain category according to the common sense knowledge ases.
V. E XPERIMENTS
perimental Setup
COLD-Stockholm database contains data from four
. We divide the database into two subsets. For training alidation, we used the data acquired on floors 4, 5 and
e data acquired on floor 6 is used for evaluation of the
mance of the property classifiers and for the real-world
ment. the purpose of the experiments presented in this paper, ave extended the annotation of the COLD-Stockholm ase to include 3 room shapes (elongated, square and gular), 3 room sizes (small, medium and large) as well general appearances (anteroom-, bathroom-, hallway- hen-, lab-, meetingroom- and office-like). The room nd shape, were decided based on the length ratio and mum length of edges of a rectangle fitted to the room e. These properties together with 6 object types defined om categories used in our experiments, see Figure 5.
aluation of Property Categorizers
performance of each of the property categorizers was
ted in separation. Training and validation datasets were
d by grouping rooms having the same values of prop- Parameters of the models were obtained by cross- tion. All training and validation data were collected er and used for training the final models which were
ted on test data acquired in previously unseen rooms.
II presents the results of the evaluation. The classifica-
ates were obtained separately for each of the classes and veraged in order to exclude the influence of unbalanced
g set. As can be seen all classifiers provided a recognition above 80%. Furthermore, integrating the two visual cues H and BOW-SURF) increased the classification rate of pearance property by almost 5%. From the confusion es in Fig. 4 we see that the cases with confusion occurs en property values being semantically close.
Property
Cues
Classification rate
|
Shape |
Geometric features |
|
Size |
Geometric features |
|
Appearance |
CRFH |
|
Appearance |
BOW-SURF |
|
Appearance |
CRFH + BOW-SURF |
84.9%
84.5%
80.5%
79.9%
84.9%
Shape
Size
|
Appearance |
Appearance |
Appearance |
|
CRFH |
BOW-SURF |
CRFH + BOW-SURF |
Fig. 4.
Confusion matrices for the evaluation of the property categorizers.
C. Real-world experiments
In the real-world experiment the robot was joysticked around manually. The robot started with only the models obtained in the evaluation of the property categorizers. Laser based SLAM [5] was performed while moving and new places were added every meter traveled into unexplored space. The robot was driven through 15 different rooms while performing real-time place categorization without relying on any previous observations of this particular part of the environment. The object observations where provided by human input. The information comes into the change graph in exactly the same was as would real object detections.
Figure 5 illustrates the performance of the system during part of a run. The 11 categories can be found along the vertical axis. The ground truth for the room category is marked with
a box with thick dashed lines. The Maximum a posteriori
(MAP) estimate for the room category is indicated with white dots. The system correctly identified the first two rooms as
a hallway and a single office using only shape, size and
general appearance (no objects were found). The next room was properly classified as a double office. The MAP estimate switches to professors office for a short while when one computer is found and switches back again when a second if found. After some initial uncertainty where the MAP switches category several times the next room is classified as a double office until the robot finds a computer at which point it switches to professor’s office. Later the robot enters a robot lab which according to its models is very similar to a computerlab. Initially there is a slightly higher probability for the hypothesis that it is a computerlab, but once the robot detectes a robot arm the robotlab hypothesis completely dominates. The next non- hallway room is a single person office currently occupied by a bunch of Master’s students. Because of its current appearance, the best match is a double office. The robot continues and the
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
163
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
164
1
Mobile Robot Odor Plume Tracking Using Three Dimensional Information
Luís Osório, Gonçalo Cabrita and Lino Marques
Abstract—This paper studies the utilization of data gathered from multiple gas sensors placed vertically on a wheeled mobile robot to perform plume tracking strategies. The main research questions to be answered are if is this advantageous and if so, trying to quantify that advantage in terms of plume tracking over traditional two dimensional plume acquisition. This was accomplished by testing a few well known plume tracking algorithms (Surge-Caste, Casting and Gradient following) against a modified version of each one of the algorithms, designed to take into account odor data from the gas sensor vertical array. Extensive real world experiments were conducted in a large indoor environment. It was found that plume tracking with three dimensional acquisition outperformed the two dimensional variants in the tested environment in about 25%.
Index Terms—Three dimensional plume acquisition, mobile robot olfaction, odor search.
I.
The last two decades have seen a growing interest in autonomous odor source localization. The use of autonomous agents capable of locating odor sources can be extremely useful in various scenarios, from monitoring public structures for hazardous gasses or explosives to aiding in the search of survivors during a disaster. When working with mobile robot olfaction it is necessary to understand how odor dispersion behaves in uncontrolled conditions. In natural environments, even without noticeable air flow, odor disperses by thermal convection induced by thermal gradients existing in the surfaces. On the other hand, in the presence of advection the drainage speed will superimpose to this effect, becoming usually the dominant phenomena. In any case, when close to planar surfaces, only laminar flow is possible, as a fluid moves parallel to the surface resulting in the dispersion of odor molecules under the form of filaments that enlarge as they get increasingly away from the odor source. However if the intensity of the air flow is high enough as one gets further away from the source it is possible to perceive increasing turbulence. As the odor filament enters the turbulent areas it separates into smaller segments. The odor plume will continue to disperse in the direction of the air flow, however it will now happen irregularly as it expands in a series of odor fractions. Furthermore the weight of the particles and their temperature influence the vertical movement of the plume as well. All these effects along with the way air is diverted by obstacles will have a major impact on the process of odor dispersion [1]. The effects of air flow are part of everyday life, in spite of this the opportunity to observe these patterns in full detail is
INTRODUCTION
L. Osorio, G. Cabrita and L. Marques are with the Department of Electrical and Computer Engineering, Institute of Systems and Robotics, University of Coimbra, 3030-290 Coimbra, Portugal. {bica, goncabrita, lino}@isr.uc.pt
extremely rare. Occasionally they become visible due to particles in suspension (e.g. smoke leaving a chimney), or due to variations in the air density (e.g. the ascension of hot air over the asphalt in the summer). A simple technique that allows visualizing of the air flow around a robot is the introduction of small solid or liquid particles in the air [2]. Their capacity of reflecting or dispersing the light makes them visible. If the particles are sufficiently small, their low mass allows them to follow the changes of the air's trajectory as they are not largely affected by gravity. Some examples of these cases are the smoke from tobacco, from burning oils, or from the drops of steam [3].
A. Two Dimensional Odor Plume Tracking Up to this point, most of the research in mobile robot olfaction has focused on the problem of odor source localization. This problem can be divided in three phases [4]. The first phase consists on finding odor cues (odor exploration), the second phase is characterized by following the plume (plume tracking) and the last phase is in charge of pinpointing the odor source (odor source declaration). Current methods and algorithms for detecting odor sources have originated in a wide range of areas of study: biological inspiration [6][7]; logical reasoning [8]; gradient following [9]; neural networks [12]; probabilistic methods [13]; multi- robot cooperation [14][15]. Odor tracking can be done with a single gas sensor and multiple spatial samples [5] or with the help of a directional anemometer. Lochmatter et al. [16] used this approach to evaluate the effectiveness of three algorithms that follow the wind direction to track odor sources. The odor sensor was used in binary, distinguishing only if the robot was inside or outside the odor plume. In another work, this group studied the shortening of search time obtained with the utilization of a group of robots instead of one [15]. Increasing the number of sensors allows a system to collect simultaneous readings of odor concentration in different positions thus characterizing the atmosphere in a shorter time interval or orienting easily towards a gradient ascent direction. Waphare et al. [18] explored this strategy with three odor sensors on a robot, one was located on the front of the robot and the other two on the sides . The cooperation between robots showed great potential to speed-up the location of the odor source, however due to the increase in complexity progress on this area has been slow. On this subject one can also find the proposal of a cooperative Surge-Spiral algorithm [14], the odor guided cooperative exploration of areas using stochastic search techniques [20] [21], an algorithm inspired by nature to be used in environments with low air flow [22], a system of cooperative odor guided odor source localization [23] and a study of
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
165
2
swarm-based robotic plume tracking [24]. Regardless of the approach all the algorithms previously discussed share one characteristic: The fact that the odor plume is sampled at a constant height. This means that all these algorithms perceive the odor plume as a two dimensional entity when in fact the chemical will scatter over a three dimensional space.
B. Three Dimensional Odor Plume Tracking
Although there is plenty of work related to tracking airborne odor plumes in two dimensions, there have been relatively few works exploring the three dimensions capabilities of plume acquisition. Marques et al. [19] used a robot equipped with two gas sensors, one on the top and another near its bottom in order to infer to localization of a ground located gas source. In [25] Ishida used a blimp as a platform to measure the odor concentration in three dimensions, creating an odor distribution map. In [26] a group of robots was used to determine the environment’s odor concentration, recurring to the Kernel DM+V algorithm [27]. In [17] Russell built a robot with a gas sensor moved vertically through the action of a winch which allowed to measure odor in three dimensions building a 3D representation of the plume while it was tracked.
II. PROBLEM APPROACH
To track odor in three dimensions a robot capable of determining the odor concentration at different heights and detecting the winds direction was built. Although the robot has the ability to detect odor in three dimensions, it can only move in two dimensions. Therefore the odor was measured at different heights however only the highest concentration measured along the gas sensor array was used, resulting in a pseudo three dimensional system. An analogy can be made with mobile robot navigation. While the robot can only move in a two dimensional space it occupies a three dimensional space and as such should avoid any obstacle in its way. In our case although the robot moves in a two dimensional space the plume will scatter along a three dimensional space, hence the need to detect it as such yet translate that information to the robot’s two dimensional workspace. Three algorithms (Surge-Cast, Casting and Gradient) were tested tracking three dimensional odor plumes and compared against a purely two dimensional system, to evaluate the advantages of three dimensional plume acquisition over the traditional two dimensional method for the purpose of plume tracking. Surge-Cast and Casting use upwind movement while Gradient, as the name suggests, relies solely on concentration measurements in order to estimate its gradient. Although these algorithms are already well known a quick overview follows.
A. Surge-Cast
The robot executes an upwind movement while it is inside the odor plume, executing movements perpendicular to the wind when it leaves the plume. Once it is back inside the plume the robot returns to performing the upwind movement [16]. This is illustrated in Figure 1 and the pseudocode can be found in Algorithm 1.
Algorithm 1: Surge-Cast
if odor_reading < threshold and !found_plume then wander();
end if odor_reading >= threshold then upwindSurge();
end else if found_plume then if last_direction == left then castRight(); last_direction = right;
end
else
end
castLeft(); last_direction = left;
Figure 1 – Surge-Cast algorithm.
B. Casting While inside the odor plume, the robot moves towards the wind direction added an offset angle that makes the robot leave the odor plume from a known side. When it is outside of the plume it moves perpendicularly to the wind in the opposite direction of the offset angle until it is once more inside the plume, repeating the process with an inverse offset angle [16]. This is illustrated in Figure 2 while the pseudocode can be found in Algorithm 2.
Algorithm 2: Casting
if odor_reading < threshold and !found_plume then wander();
end if odor_reading == threshold then if reaquiring_plume then angle = -angle; reaquiring_plume = false;
end
cast(angle);
end else if found_plume then cast( -(angle/||angle||)*π /2 ); reaquiring_plume = true;
end
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
166
3
d
Wind flow
Odor Source
Figure 2 – Casting algorithm.
C. Gradient
This algorithm only uses odor concentration data to navigate. The robot describes a circle to acquire odor concentration in its neighborhood, moving to the position where the maximum concentration was found, repeating the process until it reaches the odor source. The radius of the circle is enlarged if odor was detected in most of the circular path and is reduced if odor was absent for most of the path. This is illustrated in Figure 3 and the pseudocode can be found in Algorithm 3.
Algorithm 3: Gradient
if odor_reading < threshold and found_plume then wander ( );
end if odor_reading == threshold then circularPath(radius); goToMaxOdor();
end /* Percentage of path where odor was found */ if odor_percent > odor_percent_upper_threshold then radius = radius_inc_coeff*radius;
end /* Percentage of path where was found odor */ if odor_percent < odor_percent_lower_threshold then radius = radius_dec_coeff*radius;
end
Wind flow
III. EXPERIMENTAL SETUP
A. The Robot
A Videre Erratic robot was used for all the experiments
presented in this paper. The robot was equipped with a
range finder (Hokuyo URG-04LX) for obstacles avoidance, a
Laser
2D anemometer (Gill Instruments WindSonic) for wind
direction measurement and a vertical array of four Figaro
TGS2620 gas sensors placed 0.5m apart from each other as can be seen in Figure 4. The gas sensors were connected to the Erratic through an Arduino microcontroller board. This robot allowed testing both two and three dimensional setups, by
using concentration data from only one sensor (2D) or by using the full array of gas sensors (3D). The Erratic was running the ROS navigation stack for autonomous navigation and localization [11]. A metric map of the environment was provided to the robot for each experiment. All the algorithms developed for this work were done so using the ROS framework and can be downloaded from http://www.ros.org/wiki/odor_search.
Figure 4 – The robot used in the experiments and the location of the Figaro TGS-2620 gas sensors on the robot.
B. Gas Sensor Calibration
Since an array of gas sensors was used a simultaneous calibration of the Figaro TGS-2620 array was done to guarantee that all the readings were referenced to the same concentration referential (ppm in this case). This was accomplished by enclosing all four sensors inside a compartment with constant volume, pressure and temperature and then inserting controlled volumes of liquid ethanol, that was latter vaporized. The readings of each sensor were registered after each ethanol increase.
C. The Testing Environment
Although simulation is an important step during research, it is important to ensure that algorithms perform well in the real world. Most of the proposed odor localization algorithms have only been demonstrated in simulation so far. In the real world the behavior of an odor plume can appear random and chaotic most of times, meaning that a realistic plume simulation is hard to achieve. Real world experimentation is thus an important tool for validating odor related algorithms. With this in mind, the experiments for this work were performed in our building entrance hallway which is 5m in width, by more than 100m long and more than 10m in height. A part of the testing environment can be seen in Figures 5 and 6. It was observed that this hallway generated an air flow in one direction during the day and in the opposite direction during the night. The tests were performed during the night have similar conditions during all the tests and to avoid disturbances from people
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
167
4
walking by. No artificial air flow was induced during the experiments, but a steady airflow ranging from 0.1m/s to 0.3m/s was observed. The ethanol vapor was released into the atmosphere bubbling air into an ethanol solution.
Figure 5 – Pictures of the hallway where the experiments were conducted.
D. Particle Plume
The Particle Plume mapping algorithm was presented in [10] and allows for a three dimensional representation of odor plumes. To accomplish this the chemical readings from the gas sensors are converted into a point cloud around the sensor’s location on a global frame. The number of points generated for each reading is proportional to the chemical reading in parts per million (ppm). The point cloud is confined to a predefined radius around the point of origin. The newly created point cloud is finally added to the plume. If the volume occupied by the new points contains older points, these older points are removed from the plume.
E. Gas Molecular Weight
The Figaro TGS-2620 has the ability to measure alcohol and organic gases, mainly Ethanol (CH 3 CH 2 OH), Hydrogen (H 2 ), Isobutane (C 4 H 10 ), Carbon Monoxide (CO) and Methane (CH 4 ). Since their composition differs, they will behave differently when released into the environment affecting the way they are carried by the air flux. The molecular weight of the chemical being released (Table 1) has great influence on the experiments conducted in this work as it changes the way the chemical moves vertically. If the chemical’s molecular weight is greater than the air’s it will tend to descend, otherwise the chemical will tend to rise. This means that away from the source the chemical might be traced at a height different from that at which it was released supporting the idea of sensing chemicals at three dimensions.
IV. EXPERIMENTAL RESULTS
The first set of experiments consisted in setting the odor source at different heights and producing a map of the plume in order to determine the differences between representing the plume in two and three dimensions. First the odor source was set at ground height and the Surge-Cast algorithm was used to track the plume. This test was performed for both the two and three dimensional versions of the algorithm.
Figure 6 – Picture and scheme of the testing area. Notice the odor source, the wind direction and the three starting positions marked by the three starts.
Next the source was placed at the height of the third gas sensor and the test was repeated. The second set of experiments consisted in comparing the performance of both setups (two and three dimensions) for each algorithm, Several tests were conducted consisting in placing the robot outside the odor plume followed by instructing the robot to move into it. As soon as the gas sensors detect concentrations above 7ppm, the robot started tracking the odor plume to its source using the algorithms described previously. When the odor concentration detected was higher than a known predefined threshold measured in the neighborhood of the source (30ppm), the odor source is declared found and the experiment is finished. A set of 9 experiments were performed for each setup (two and three dimensions) for each algorithm in a total of 54 experiments. The odor source was placed about 1.06m above the ground.
A. Results Figure 7 shows the results for the plume mapping experiments conducted with the odor source at ground height while in Figure 8 it is possible to see the results for the same experiments conducted with the odor source at the height of the third gas sensor. These screenshots were taken from rviz, a 3D visualization environment for robots using ROS. The plume is represented by the red point cloud while the yellow path represents the visited map cells. The results for the performance test between two dimensional plume acquisition and three dimensional plume acquisition for plume tracking can be found in Table 2. Furthermore Figure 9 shows three rviz screenshots from each one of the algorithms tested running with the three dimensional version of the software.
Table 1 – Molecular weight of gases detected by the Figaro TGS-2620.
Gas
Formula
Molecular weight (g/mol)
|
Hydrogen |
H 2 |
2.0 |
|
Methane |
CH 4 |
16.0 |
|
Carbon monoxide |
CO |
28.0 |
|
Air |
- |
29.0 |
|
Ethanol |
CH 3 CH 2 OH |
46.1 |
|
Isobutane |
C 4 H 10 |
58.1 |
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
168
5
a) Two dimensional plume acquisition.
b) Three dimensional plume acquisition. Figure 7 – Surge-cast with the odor source at ground height.
Figure 8 – Surge-cast with the odor source at the third gas sensor’s height, about 1.06m from the ground. Three dimensional plume acquisition.
a) Surge-cast.
b) Casting
c) Gradient. Figure 9 – Three experiments with three dimensional plume acquisition.
Table 2 – Table of results from the tests with the Surge-Cast, Casting and Gradient both in 2D and 3D.
Surge-Cast
Casting
Gradient
|
2D |
3D |
2D |
3D |
2D |
3D |
|
|
Success |
5 |
7 |
8 |
8 |
2 |
4 |
Total
|
2D |
|||||||
|
15 |
|||||||
|
Fail |
4 |
2 |
1 |
1 |
7 |
5 |
12 |
Succ. Rate (%)
56
78 89
89
22
44 56
II.
DISCUSSION
A. Odor Plume Acquisition at Various Heights
In the first experiment, ethanol was released at ground level and since ethanol vapor is heavier than air it tends to keep close to the ground. As a consequence, the three upper sensors are unable to detect the odor. In these conditions the three dimensional setup has no advantage over the two dimensional one, in fact they perform exactly the same, since only the lower sensor detects any odor. In the second experiment the advantages of using sensors at different heights became evident. The second sensor was the first to detect odor, leading the robot to the proximity of the odor source. Then the third sensor, at 1.06m heigh, started detecting the odor leading the robot to the odor source and performing the declaration. This can be seen in Figure 8. The lack of the gas sensor array would have resulted in the robot not finding the odor source.
B. Odor Plume Tracking
The Surge-Cast algorithm showed an improvement in the
success rate when using the three dimensional setup raising its success ratio from 56% to 78%. This algorithm takes the robot
rate
suffered from the flickering of the odor concentration caused by the irregular air flow. It was noticeable the fact that the two dimensional setup sometimes failed to reacquire the odor plume due to momentaneous decreases in the odor concentration while the cast was being performed. For the same situation the three dimensional setup performed much better, as the momentarily decrease did not affect all the sensors at the same time. The Casting algorithm performed successfully with both the two and three dimensional setups with a success rate of 89% for either case. This is probably due to the fact that this algorithm relies mostly on the wind direction to navigate. The Gradient algorithm on the other hand relies solely on the odor data to determine the movement of the robot. It is thus no surprise that the three dimensional setup reveled to be more accurate than the two dimensional setup by scoring a success rate of 44% against 22%. Overall the three dimensional odor acquisition improved the success rate of the plume tracking from 56% to 70% representing and increase of about 25% in the success rate. For the two dimensional setup experiments the gas sensor used was always the one at the odor source’s height. The two dimensional setup would have performed worst if sensor and source were at different heights, thus resulting in a greater difference between the traditional two dimensional odor acquisition setup and the three dimensional one.
close to the border of the odor plume hence its success
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
169
6
III.
Three dimensional plume acquisition proved to be more efficient than the more traditional two dimensional approach in terms of odor plume tracking. Odor moves in all three dimensions rising or descending depending on the molecular weight of the mixture. Even if a chemical compound is heavier or lighter than the air it can still be carried upward or downward by the wind (vortexes will affect the plume not only in width and length but also in height), rendering a ground robot equipped with a single fixed gas sensor handicapped in various scenarios. A robot with this setup might be unable to pinpoint the odor source or even fail to detect the plume at all. Many odor related algorithms have been demonstrated in constrained conditions with the pair odor source/gas sensors localized at a weight favorable for the plume detection. This work shows that as the field of mobile robot olfaction moves towards real world applications, the ability to acquire a plume in all its three dimensions will become a mandatory requirement. Future work in this area will include the implementation of other plume tracking algorithms (namely the inclusion of wind information in a variant of the gradient-based algorithm). The study of plumes generated by other gases of different molar masses. And finally, the possibilities to fully track odor plumes in all their three dimensions and declare the corresponding source.
CONCLUSIONS
IV.
REFERENCES
R. A. Russel, “Odour Detection by Mobile Robots”, World Scientific Publishing, 1999.
[2] R. C. Pankhurst and D.W. Holder, "Wind-Tunnel Technique: An Account of Experimental Methods in Low-Speed and High-Speed Wind Tunnels", Sir Isaac Pitman & Sons, 1952
[1]
[3] T. J. Mueller, "Flow visualization by direct injection," Goldstein, R.J. (ed), Fluid Mechanics Measurements, 1983.
[4]
A. Lilienthal, A. Loutfi and T. Duckett, “Airborne Chemical Sensing with Mobile Robots”, Sensors, 6, pp. 1616-1678, 2006.
[5] L. Marques, U. Nunes, and A.T. de Almeida, "Olfaction-based
mobile robot navigation", Thin Solid Films, vol. 418, nº1, pp. 51-58,
2002.
[6] N. J. Vickers, "Mechanisms of animal navigation in odor plumes," Biological Bulletin, vol. 198, pp. 203-212, 2000. [7] F. W. Grasso, T. R. Consi, D. C. Mountain, and J. Atema, "Biomimetic robot lobster performs chemo-orientation in turbulence
using a pair of spatially separated sensors: Progress and challenges", Rob. Autonom. Syst., vol. 30, pp. 115-131, 2000. [8] R. A. Russell, D. Thiel, R. Deveza, and Mackay-Sim, "A robotic system to locate hazardous chemical leaks", Proc. IEEE Int. Conf. Robotics Automation, pp. 556-561, 1995.
H. Ishida, T. Nakamoto, and T. Moriizumi, "Remote sensing of gas/
odor source location and concentration distribution using mobile system", Sensors and Actuators B, vol. 49, pp. 52-57, 1998. [10] G. Cabrita, P. Sousa and L. Marques, “Odor guided exploration and plume tracking - Particle Plume Explorer”, Proc. of European Conf. on Mobile Robotics (ECMR), 2011. [11] E. Marder-Eppstein, E. Berger, T. Foote, B. Gerkey and K. Konolige, "The Office Marathon: Robust Navigation in an Indoor Office Environment", Proc. IEEE Int. Conf. on Robotics and Automation, Anchorage, USA, 2010. [12] A. M. Farah and T. Duckett, "Reactive localization of an odor source by a learning mobile robot", Proc. 2nd Swedish Wokshop on Autonomous Robotics, Stockholm, pp. 29-38, 2002.
[9]
[13] J. Farrell, J. Murlis, W. Li, and R. Carde, "Filament-based atmospheric dispersion model to achieve short time-scale structure of odor plumes", Environmental fluid Mechanics, vol. 2, pp. 143-169, 2002. [14] A. T. Hayes, A. Martinoli, and R. M. Goodman, "Distributed odor source localization", IEEE Sensors Journal, vol. 2, nº3, pp. 260-271, June 2002. [15] T. Lochmatter, N. Heiniger, and A. Martinoli, "Understanding the Potential Impact of Multiple Robots in Odor Source Localization", 9th Symposium on Distributed Autonomous Robotic Systems, Tsukuba, Japan, November 2008.
[16] T. Lochmatter, N. Heiniger, and A. Martinoli, "Theoretical Analysis of Three Bio-Inspired Plume Tracking Algorithms", Proc. IEEE Int. Conf. on Robotics and Automation, Kobe, Japan, May 2009. [17] A. R. Russel, "Tracking Chemical Plumes in 3-Dimensions", Proc. Int. Conf. on Robotics and Biomimetics, pp. 31-36, Clayton, Australia, 2006. [18] S. Waphare, D. Gharpure, and A. Shaligram, "Implementation of 3- Nose Strategy In Odor Plume-Tracking Algorithm", Proc. Int. Conf. on Signal Acquisition and Processing, 2010. [19] L. Marques, N. Almeida, and A. de Almeida, "Mobile robot olfactory sensory system", Proc. of EuroSensors, 2003. [20] L. Marques, U. Nunes, and AT de Almeida, “Particle swarm-based olfactory guided search”, Autonomous Robots, vol. 20, no 3, 2006. [21] M. Turduev, M. Kirtay, P. Sousa, V. Gazi, and L. Marques, "Chemical concentration map building through bacterial foraging optimization based search algorithm by mobile robots", Proc. IEEE
Int. Conference on Systems Man and Cybernetics (SMC), 2010.
[22] G. Ferri, E. Casellia, V. Mattolib, A. Mondinib, B. Mazzolaib and P. Dario, "Spiral: A novel biologically-inspired algorithm for gas/odor source localization in an indoor environment with no strong airflow" Robotics and Autonomous Systems, pp. vol 57, nº4, 2009. [23] A. Marjovi, J. Nunes, P. Sousa, R. Faria, and L. Marques, "An Olfactory-Based Robot Swarm Navigation Method", Proc. IEEE Int. Conf. on Robotics and Automation, pp. 4958-4963, Alaska, USA,
2010.
[24] D. V. Zarzhitsky, D. F. Spears, and D. R. Thayer, "Experimental studies of swarm robotic chemical plume tracing using computational fluid dynamics simulations", Int. Journal of Intelligent Computing and Cybernetics, 2010. [25] H. Ishida, "Blimp Robot for Three-Dimensional Gas Distribution Mapping in Indoor Environment", in Proc. of the 13 International Symposium on Olfaction and Electronic Nose, pp. 61-64, 2009.
[26] G. Ferri, A. Mondini, A. Manzi, B. Mazzolai, C. Laschi, V. Mattoli,
M. Reggente, T. Stoyanov, A. Lilienthal, M. Lettere and P. Dario,
"DustCart, a Mobile Robot for Urban Environments: Experiments of Pollution Monitoring and Maping during Autonomous -Navigation in Urban Scenarios" Proc. ICRA Workshop on Networked and Mobile Robot Olfaction in Natural, Dynamic Environments, 2010. [27] A. J. Lilienthal, M. Reggente, M. Trincavelli, J. L. Blanco, and J. Gonzalez, "A Statistical Approach to Gas Distribution Modelling with Mobile Robot - The Kernel DM+V Algorithm", Proc. Int. Conf. on Intelligent Robots and Systems, pp. 570-576, USA, 2009.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
170
A Multi-Goal Path Planning for Goal Regions in the Polygonal Domain
Jan Faigl ∗ , Vojtechˇ
Vonasek,´
Libor Pˇreucilˇ
∗ Center for Applied Cybernetics, Department of Cybernetics Faculty of Electrical Engineering, Czech Technical University in Prague {xfaigl,vonasek,preucil}@labe.felk.cvut.cz
Abstract — This paper concerns a variant of the multi-goal path planning problem in which goals may be polygonal regions. The problem is to find a closed shortest path in a polygonal map such that all goals are visited. The proposed solution is based on a self-organizing map algorithm for the traveling salesman problem, which is extended to the polygonal goals. Neurons’ weights are considered as nodes inside the polygonal domain and connected nodes represent a path that evolves according to the proposed adaptation rules. Performance of the rules is evaluated in a set of problems including an instance of the watchman route problem with restricted visibility range. Regarding the experimental results the proposed algorithm provides flexible approach to solve various NP-hard routing problems in polygonal maps.
I. INTRODUCTION
The multi-goal path planning problem (MTP) stands to find a shortest path connecting a given set of goals located in a robot working environment. The environment can be represented by the polygonal domain W and the goals may be sensing locations in the inspection task. Such point goals guaranteeing the whole environment would be covered using
a sensor with limited sensing range can be found by a sensor
placement algorithm [8]. The MTP with point goals can be formulated as the Traveling Salesman Problem (TSP) [16], e.g., using all shortest path between goals found in a visibility graph by Dijkstra’s algorithm. Then, the MTP is a combinato-
rial optimization problem to find a sequence of goals’ visits. A more general variant of the MTP can be more appropriate
if objects of interest may be located in certain regions of W,
e.g., when it is sufficient to reach a particular part of the environment to “see” the requested object. In such a problem formulation, a goal is a polygonal region rather than a single point. Several algorithms addressing this problem can be found in literature; however, only for its particular restricted variant. For example goals form a disjoint set of convex polygons attached to a simple polygon in the safari route problem [12], which can be solved in O(n 3 ) [17]. If the route enter to the convex goal is not allowed, the problem is called the zoo- keeper problem, which can be solved in O(n log n) for a given starting point and the full shortest path map [1]. However, both problems are NP-hard in general.
Routing problems with polygonal goals can be considered as variants of the TSP with neighborhoods (TSPN) [10]. The TSPN is studied for graphs or as a geometric variant in a plane but typically without obstacles. Approximate algorithms for restricted variants of the TSPN have been proposed [5, 3];
however, the TSPN is APX-hard and cannot be approximated
to within a factor 2 − , where > 0, unless P=NP [13].
A combinatorial approach [14] can be used for the MTP
with partitioned goals, where each goal is represented by
a finite (small) set of point goals. However, combinatorial
approaches cannot be used for continuous sets because of too many possibilities how to connect the goals. This is also the
case of the watchman route problem (WRP) in which goals
are not explicitly prescribed. The WRP stands to find a closed shortest path such that all points of W are visible from at least one point [11]. Although polynomial algorithms have been proposed for restricted class of polygons [2], the WRP
is NP-hard for the polygonal domain.
In this paper, a self-organizing map (SOM) algorithm for
the TSP in W [9] is modified to deal with a general variant
of the MTP. Contrary to combinatorial approaches, a geomet-
rical interpretation of SOM evolution in W allows easy and straightforward extensions to deal with polygonal goals. To demonstrate geometric relation between the learning network and polygonal goals several modifications of the adaptation rules are proposed and evaluated in a set of problems. The main advantage of the proposed approach is ability to address general multi-goal path planning problems in W (not only in a simple polygon) and with goals not necessarily attached to W. The rest of this paper is organized as follows. The addressed problem formulation is presented in the next section. The proposed algorithms are based on the SOM adaptation schema for the TSP in W, and therefore, a brief overview of the schema is presented in Section III. The proposed modifications of the adaptation rules for polygonal goals are presented in Section IV. Experimental evaluation of the proposed algorithm variants is presented in Section V. Concluding remarks are presented in Section VI.
II. PROBLEM STATEMENT
The problem addressed in this paper can be formulated as follows: Find a closed shortest path visiting given set of goals represented as convex polygons (possibly overlapping each other) in a polygonal map W. The problem formulation
is based on the safari route problem [12]; however, it is a
more general in three aspects. First, polygons can be placed inside a polygon with holes. Also, it is not required that convex polygons are attached to the boundary of W like in the original safari route problem formulation. Finally, polygons
can overlap each other, and therefore, such polygons can represent a polygonal goal of an arbitrary shape.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
171
The proposed problem formulation comprises the WRP with restricted visibility range d. The set of goals can be found as a convex cover set of W, i.e., a set of convex polygons whose union is W. The advantage of an algorithm solving the formulated problem is that it is not required to have a minimal cover set. The restricted convex polygons to the size d can be found by a simple algorithm based on a triangular mesh of W [6].
III.
SOM ALGORITHM FOR THE TSP IN W
A SOM algorithm for routing problems, in particular the SOM for the TSP in W [9], is Kohonen’s type of unsupervised two-layered learning neural network. The network contains two dimensional input vector and an array of output units that are organized into a uni-dimensional structure. An input vector represents coordinates of a point goal and connections’ weights (between the input and output units) represent co- ordinates of the output units. Connections’ weights can be considered as nodes representing a path, which provides direct geometric interpretation of neurons’ weights. So, the nodes form a ring in W because of the uni-dimensional structure of the output layer, see Fig. 1.
Fig. 1: A schema of the two-layered neural network and associated geometric representation.
The network learning process is an iterative stochastic procedure in which goals are presented to the network in a random order. The procedure basically consists of two phases: (1) selection of winner node to the presented goal; (2) adaptation of the winner and its neighbouring nodes toward the goal. The learning procedure works as follows. 1) Initialization: For a set of n goals G and a polygonal map W, create 2n nodes N around the first goal. Let the initial value of the learning gain be σ=12.41n+0.06, and adaptation parameters be µ=0.6, α=0.1. 2) Randomizing: Create a random permutation of goals Π(G). 3) Clear Inhibition: I ← ∅. 4) Winner Selection: Select the closest node ν to the goal g ∈ Π(G) according to:
ν ← argmin ν∈N ,ν∈/I |S(ν, g)|,
where |S(ν, g)| is the length of the shortest path among obstacles S(ν, g) from ν to g.
5) Adapt: Move ν and its neighbouring nodes along a particular path toward g:
• Let the current number of nodes be m, and N (N ⊆ N ) be a set of ν ’s neighborhoods in the cardinal distance less than or equal to 0.2m.
• Move ν along the shortest path S(ν , g) toward g by the distance |S(ν , g)|µ.
• Move nodes ν ∈ N toward g along the path S(ν, g) by the distance |S(ν, g)|µf (σ, l), where f is the neighbouring function f = exp(−l 2 /σ 2 ) and l is the cardinal distance of ν to ν .
• Update the permutation: Π(G) ← Π(G) \ {g}.
• Inhibit the winner: I ← I ∪ {ν }. If |Π(G)| > 0 go to Step 4. 6) Decrease the learning gain: σ ← (1 − α)σ. 7) Termination condition: If all goals have the winner in a sufficient distance, e.g., less than 10 −3 , or σ < 10 −4 Stop the adaptation. Otherwise go to Step 2. 8) Final path construction: Use the last winners to deter- mine a sequence of goals’ visits. The algorithm is terminated after finite number of adaptation steps as σ is decreased after presentation of all goals to the network. Moreover, the inhibition of the winners guarantees
that each goal has associated a distinct winner; thus, a se-
quence of all goals’ visits can be obtained by traversing the
ring at the end of each adaptation step.
The computational burden of the adaptation procedure de- pends on determination of the shortest path in W, because 2n 2
node–goal distance queries (Step 4) and (0.8n+1)n node–goal
path queries (Step 5) have to be resolved in each adaptation
step. Therefore, an approximate shortest path is considered using a supporting division of W into convex cells (convex
partition of W) and pre-computed all shortest path between
map vertices to the point goals. The approximate node–goal path is found as a path over vertices of the cells in which the points (node and goal) are located. Then, such a rough approximation is refined using a test of direct visibility from the node to the vertices of the path. Details and evaluation of refinement variants can be found in [9]. Beside the approximation, the computational burden can be decreased using the Euclidean pre-selection [7], because only the node with a shorter Euclidean distance to the goal than the distance (length of the approximate shortest path) of the current winner node candidate can become the winner. In Fig. 2, a ring of nodes connected by an approximate shortest path between two points is shown to provide an overview of the ring evolution in W.
IV. ADAPTATION RULES FOR POLYGONAL GOALS
Although it is obvious that a polygonal goal can be sampled into a finite set of points and the problem can be solved as the MTP with partitioned goals, the aforementioned SOM procedure can be straightforwardly extended to sample the goals during the self-adaptation. Thus, instead of explicit sampling of the goals three simple strategies how to deal with adaptation toward polygonal goals are presented in this section. The proposed algorithms are based on the SOM for
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
172
(a) step 29
(c) step 58
(b) step 40
(d) step 78
Fig. 2: An example of ring evolution in a polygonal map for the MTP with point goals, small green disks represent goals and blue disks are nodes.
the TSP using centroids of the polygonal goals as point goals. However, the select winner and adapt phases are modified to find a more appropriate point of the polygonal goal and to avoid unnecessary movement into the goal. Therefore, a new point representing a polygonal goal is determined during the adaptation and used as a point goal, which leads to computation of a shortest path between two arbitrary points in W. Similarly to the node–goal queries an approximate node–point path is considered to decrease the computational burden. The approximation is also based on a convex partition of W and the shortest path over cells’ vertices (detailed description can be found in [6]).
A. Interior of the Goal
Probably the simplest approach (called goal interior here) can be based on the regular adaptation to the centroids of the polygonal goals. However, the adaptation, i.e., the node movement toward the centroid, is performed only if the node is not inside the polygonal goal. Determination if a node is inside the polygonal goal with n vertices can be done in O(n) computing the winding number or in O(log n) in the case of a convex goal. So, in this strategy, the centroids are more like attraction points toward which nodes are attracted because the adaptation process is terminated if all winner nodes are inside the particular polygonal goals. Then, the final path is constructed from a sequence of winner nodes using the approximate shortest node–node path. An example of solutions using the new termination condition and with the avoiding adaptation of winners inside goals is shown in Fig. 3.
(a) a found path for termination of the adaptation if all winners are inside the goals, L=84.3 m
(b) a found path with avoiding adaptation of winners inside the goal, L=65.0 m
Fig. 3: Examples of found paths without and with consider- ation of winners inside the goals. Goals are represented by yellow regions with small disks representing the centroids of the regions. Winner nodes are represented by small orange disks. The length of the found path is denoted as L, and the length of the path connecting the centroids is L ref =85.9 m.
(a) an intersection point
(b) a found path, L=59.7 m
Fig. 4: Examples of an intersection point and a found path using the attraction algorithm variant.
B. Attraction Point
The strategy described above can be extended by determi- nation of a new attraction point at the border of the polygonal goal. First, a winner node ν is found regarding its distance to the centroid c(g) of the goal g. Then, an intersection point p of g with the path S(ν, c(g)) is determined. The point p is used as the point goal to adapt the winner and its neighbouring nodes. This modification is denoted as attraction in the rest of this paper.
An example of determined intersection point p and the final found path is shown in Fig. 4. The found path is about five me- ters shorter than a path found by avoiding adaptation of winner nodes inside the goals. Determination of the intersection point increases the computational burden, therefore an experimental evaluation of the proposed algorithm variants is presented in Section V.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
173
C. Selection of Alternate Goal Point
A polygonal goal can be visited using any point of its border. The closest point at the goal border to a node can be determined in the winner selection phase. To find such a point, straight line segments forming the goal are considered instead of the goal centroid. Moreover, a goal can be attached to the map, and therefore, only segments laying inside the
, s k } be
free space of W are used. Let S g = {s 1 , s 2 ,
the border segments of the polygonal goal g that are entirely inside W. Then, the winner node ν ∗ is selected from a set of non-inhibited nodes regarding the shortest path S(ν, s) from
a point ν to the segment s, s ∈ S g . Beside the winner node,
a point p at the border of g is found in the winner selection
procedure as a result of determination of S(ν, s). The border point p is then used as an alternate point goal for adaptation, therefore this modification is denoted as alternate goal.
(a) an alternate goal point
(b) a found path, L=56.9 m
Fig. 5: An example of the alternate goal point and the final found path. Red straight line segments around the goal regions denote parts of the goal border inside the free space of W.
Determination of the exact shortest point–segment path can be too computationally demanding, therefore the following approximation is proposed. First, the Euclidean distance be- tween the node ν and the segment s is determined. If the distance is smaller than the distance of the current winner node candidate, then the resulting point p of s is used to determine an approximate path among obstacles between p and ν. If |S(p, ν)| is shorter than the path length of the current winner node candidate to its border point, ν becomes the new winner candidate and p is the current alternate goal (border) point. Even though this modification is similar to the modification described in Section IV-B, it provides sampling of the goal boundary with a less distance of the goal point to the winner node; thus, a shorter final path can be found. An example of found alternate goal point and the found path is shown in Fig. 5.
V. E XPERIMENTAL R ESULTS
The proposed adaptation rules in Section IV have been experimentally verified in a set of problems. Due to lack of commonly available multi-goal path planning problems with polygonal goals several problems have been created within maps of real and artificial environments. An overview of the basic properties of the environments is shown in Table I.
TABLE I: Properties of environments and their polygonal representation
|
Dimensions [m × m] |
No. |
No. |
No. convex |
||
|
Map |
vertices |
holes |
polygons |
||
|
jh |
20.6 × |
23.2 |
196 |
9 |
77 |
|
pb |
133.3 × 104.8 |
89 |
3 |
41 |
|
|
h2 |
84.9 × |
49.7 |
1 061 |
34 |
476 |
|
dense |
21.0 × |
21.5 |
288 |
32 |
150 |
|
potholes |
20.0 × |
20.0 |
153 |
23 |
75 |
The last column shows the number of convex polygons of the supporting convex polygon partition utilized in the ap- proximation of the shortest path. The partition is found by Seidel’s algorithm [15]. Maps jh, pb, and h2 represent real environments (building plans), and maps dense and potholes are artificial environments with many obstacles. Sets of polygonal goals have been placed within the maps in order to create representative multi-goal path planning problems. The name of the problem is derived from the name of the map, considered visibility range d in meters written as a subscript, and a particular problem variant, i.e., the problem name is in a form map d -variant. The value of d restricts the size of the convex polygonal goal, i.e., all vertices are in mutual distance less than d. An unrestricted visibility range is considered in problems without the subscript. Three proposed variants of the SOM based algorithm for the MTP with polygonal goals have been experimentally evaluated within the set of problems. The algorithms are randomized, therefore twenty solutions of each problem have been found by each algorithm variant. The average length of the path L, the minimal found path length L min , and the standard deviation in percents of L (denoted as s L %) are used as the quality metrics. All presented length values are in meters. The experimental results are shown in Table II, where n is the number of goals. The best found solutions are shown in Fig. 6. From the visualized solutions, one can assume that high quality solutions are found for all problems. The required computational time is presented in the column T . All algorithms have been implemented in C++, compiled by the G++ version 4.2 with -O2 optimization, and executed within the same computational environment using a single core of 2 GHz CPU. Therefore, the presented required com- putational times can be directly compared. The time includes determination of all shortest path between map vertices (used in the path approximation) and the adaptation time. The supporting convex partition and the complete visibility graph are found in tens of milliseconds, and therefore, these times are negligible regarding the time of the adaptation procedure and determination of the shortest paths.
Discussion
The presented results provide performance overview of the proposed adaptation rules. The principle of the attraction and alternate goal algorithm variants are very similar; however, the alternate goal variant provides better results. The advantage of the alternate goal is sampling of goals’ borders. Even though a simple approximation of the shortest path between
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
174
TABLE II: Experimental results
goal interior
attraction
alternate goal
|
Problem |
n |
L |
[m] |
s L % |
L min |
T [s] |
L [m] |
s L % |
L min |
T [s] |
L [m] |
s L % |
L min |
T [s] |
|
dense-small dense 5 -A h2 5 -A jh-rooms jh 10 -doors jh 10 -coverage jh 4 -A jh 5 -corridors pb 5 -A potholes 2 -A |
35 |
119.5 |
3.21 |
113.51 |
0.85 |
114.4 |
3.47 |
108.64 |
0.96 |
111.8 |
3.09 |
106.11 |
1.20 |
|
|
9 |
68.2 |
1.75 |
66.21 |
0.41 |
62.4 |
2.19 |
60.60 |
0.43 |
58.7 |
1.75 |
58.06 |
0.44 |
||
|
26 |
425.3 |
1.42 |
416.22 |
3.56 |
405.9 |
1.25 |
399.62 |
3.54 |
402.1 |
0.97 |
396.46 |
3.75 |
||
|
21 |
103.6 |
1.49 |
101.37 |
0.28 |
88.2 |
0.22 |
87.83 |
0.33 |
88.1 |
0.26 |
87.79 |
0.34 |
||
|
21 |
71.5 |
3.24 |
67.48 |
0.47 |
68.0 |
1.52 |
66.11 |
0.46 |
62.2 |
0.25 |
62.06 |
0.57 |
||
|
106 |
134.2 |
2.52 |
128.31 |
3.64 |
106.6 |
1.59 |
101.19 |
4.19 |
93.8 |
0.47 |
93.12 |
6.22 |
||
|
16 |
66.6 |
2.97 |
64.00 |
0.44 |
61.5 |
2.74 |
59.10 |
0.45 |
57.3 |
1.02 |
56.80 |
0.51 |
||
|
11 |
69.6 |
2.05 |
66.96 |
0.38 |
66.0 |
1.23 |
64.83 |
0.39 |
60.0 |
0.56 |
59.60 |
0.41 |
||
|
7 |
277.8 |
3.21 |
268.61 |
0.84 |
273.8 |
4.55 |
265.56 |
0.88 |
270.1 |
2.66 |
264.91 |
0.86 |
||
|
13 |
73.9 |
2.09 |
72.01 |
0.13 |
72.0 |
1.95 |
70.42 |
0.14 |
72.0 |
1.78 |
70.32 |
0.16 |
||
(a) jh 4 -A, L best =56.6 m
(b) jh 5 -corridors,L best =59.6 m
(c) jh 10 -doors, L best =62.1 m
(d) jh-rooms, L best =87.8 m
(e) jh 10 -coverage,L best =93.1 m
(f) h2 5 -A, L best =395.6 m
(g) pb 5 -A, L best =264.7 m
(h) dense-small, L best =102.8 m
(i) dense 5 -A, L best =58.0 m
Fig. 6: The best found solutions.
(j) potholes 2 -A, L best =68.7 m
a node (point) and goal’s segment is used, a precision of the approximation increases with node movements toward the goal, and therefore, a better point of the goal is sampled. This is an import benefit of the SOM adaptation, which allows usage of a relatively rough approximation of the shortest path.
On the other hand, the attraction algorithm variant is a more straightforward, as the path to the centroid is utilized as a path to the fixed point goal. The fixed point goals allow to use pre- computed all shortest paths from map vertices to the goals, which improves precision of the approximate node–goal path.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
175
Such approximation is less computationally intensive in the cost of higher memory requirements. However, this benefit is not evident from the results, because the alternate goal variant provides a faster convergence of the network. Although convex goals are assumed in the problem for- mulation, the presented adaptation rules do not depend on the goal convexity. The convex goals are advantageous in visual inspection tasks (covering tasks), because the whole goal region is inspected by visiting the goal at any point of the goal. Also a point representative of the convex goal can be simply computed as the centroid. If a goal is not convex
a point that is inside the goal has to be determined for the
goal interior and attraction algorithms. Basically any point inside the goal can be used, but a bias toward the point can be
expected. The alternate goal algorithm variant uses a set of segments representing the goal, and therefore, this algorithm can be directly used for problems with non-convex goals (see Fig. 7).
(a) the jh environment
(b) potholes environment
Fig. 7: Found solutions of problems with non-convex goals by the alternate goal algorithm variant.
Regarding the problems with disjoint convex goals that are relatively far from each other, a sequence of goals’ visits can be found as a solution of the TSP for centroids of the goals. Then, having the sequence of polygonal goals the problem of finding the path connecting them can be formulated as the touring polygons problem (TPP) if start and end points are given. A polynomial algorithm exists for the TPP with convex goals lying inside a simple polygon [4]; however, the TPP
is NP-hard in general. Therefore, the proposed alternate goal
algorithm seems be more practical due to its flexibility.
VI. CONCLUSION
A self-organizing map based algorithm for the multi-goal path planning problem in the polygonal domain has been presented. Three variants of the algorithm addressing polygo- nal goals have been proposed and experimentally evaluated for a set of problems including an instance of the WRP with restricted visibility range (jh 10 -coverage). Even though the solution quality is not guaranteed because of SOM, re- garding the experimental results the algorithms provide high quality solutions. The advantage of the proposed alternate goal algorithm is that it provides a flexible approach to solve various routing problems including the TSP, WRP, safari route problems, and their variants in the polygonal domain. From the
practical point of view, the proposed SOM algorithm is based on relatively simple algorithms and supporting structures, which is an additional benefit. SOM is not a typical technique used for routing prob- lems motivated by robotic applications. The presented results demonstrate flexibility of SOM based algorithm; thus, they may encourage roboticists to consider SOM as a suitable plan- ning technique for other multi-goal path planning problems.
ACKNOWLEDGMENT
This work has been supported by the Grant Agency of the Czech Technical University in Prague under grant No. SGS10/185, by the Ministry of Education of the Czech Repub- lic under Projects No. 1M0567 and No. LH11053 and partially by Project No. 7E08006, and by the EU Project No. 216342.
REFERENCES
[1] Sergei Bespamyatnikh. An o(n log n) algorithm for the Zoo-keeper’s problem. Computational Geometry: Theory and Applications, 24(2):63– 74, 2002. [2] Wei-Pang Chin and Simeon Ntafos. Optimum watchman routes. In SCG ’86: Proceedings of the second annual symposium on Computational geometry, pages 24–33, Yorktown Heights, New York, United States,
1986. ACM Press.
[3] Mark de Berg, Joachim Gudmundsson, Matthew J. Katz, Christos Levcopoulos, Mark H. Overmars, and A. Frank van der Stappen. Tsp with neighborhoods of varying size. Journal of Algorithms, 57(1):22–36,
2005.
[4] Moshe Dror, Alon Efrat, Anna Lubiw, and Joseph S. B. Mitchell. Touring a sequence of polygons. In STOC ’03: Proceedings of the thirty- fifth annual ACM symposium on Theory of computing, pages 473–482, San Diego, CA, USA, 2003. ACM Press. [5] Adrian Dumitrescu and Joseph S. B. Mitchell. Approximation algo- rithms for tsp with neighborhoods in the plane. Journal of Algorithms, 48(1):135 – 159, 2003. [6] Jan Faigl. Approximate Solution of the Multiple Watchman Routes Problem with Restricted Visibility Range. IEEE Transactions on Neural Networks, 21(10):1668–1679, 2010. [7] Jan Faigl. On the performance of self-organizing maps for the non- euclidean traveling salesman problem in the polygonal domain. Infor- mation Sciences, 181(19):4214–4229, 2011. [8] Jan Faigl, Miroslav Kulich, and Libor Pˇreucil.ˇ A sensor placement algorithm for a mobile robot inspection planning. Journal of Intelligent & Robotic Systems, 62(3-4):329–353, 2011. [9] Jan Faigl, Miroslav Kulich, Vojtechˇ Vonasek,´ and Libor Pˇreucil.ˇ An Application of Self-Organizing Map in the non-Euclidean Traveling Salesman Problem. Neurocomputing, 74:671–679, 2011. [10] Jacob E. Goodman and Joseph O’Rourke, editors. Handbook of Discrete and Computational Geometry. CRC Press LLC, Boca Raton, FL, 2004.
[11] Fajie Li and Reinhard Klette. An approximate algorithm for solving the watchman route problem. In RobVis’08: Proceedings of the 2nd interna- tional conference on Robot vision, pages 189–206, Berlin, Heidelberg,
2008. Springer-Verlag.
[12] Simeon Ntafos. Watchman routes under limited visibility. Computa- tional Geometry: Theory and Applications, 1(3):149–170, 1992. [13] Shmuel Safra and Oded Schwartz. On the complexity of approximating tsp with neighborhoods and related problems. Comput. Complex., 14(4):281–307, 2006. [14] Mitul Saha, Tim Roughgarden, Jean-Claude Latombe, and Gildardo Sanchez-Ante.´ Planning Tours of Robotic Arms among Partitioned Goals. Int. J. Rob. Res., 25(3):207–223, 2006. [15] Raimund Seidel. A simple and fast incremental randomized algorithm for computing trapezoidal decompositions and for triangulating poly- gons. Comput. Geom. Theory Appl., 1(1):51–64, 1991. [16] Steven N. Spitz and Aristides A. G. Requicha. Multiple-Goals Path Planning for Coordinate Measuring Machines. In ICRA, pages 2322– 2327, 2000. [17] Xuehou Tan and Tomio Hirata. Finding shortest safari routes in simple polygons. Information Processing Letters, 87(4):179–186, 2003.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
176
1
Real-Time 3D Perception and Efficient Grasp Planning for Everyday Manipulation Tasks
Joerg Stueckler, Ricarda Steffens, Dirk Holz, and Sven Behnke Autonomous Intelligent Systems Group, University of Bonn, 53113 Bonn, Germany
Abstract — In this paper, we describe efficient methods for solving everyday mobile manipulation tasks that require object pick-up. In order to achieve real-time performance in complex environments, we focus our approach on fast yet robust solutions. For 3D perception of objects on planar surfaces, we develop scene segmentation methods that process Microsoft Kinect depth images in real-time at high frame rates. We efficiently plan feasible, collision-free grasps on the segmented objects directly from the perceived point clouds to achieve fast execution times. We evaluate our approaches quantitatively in lab experiments and also report on the successful integration of our methods in public demonstrations at RoboCup German Open 2011 and RoboCup 2011 in Istanbul, Turkey.
Index Terms— Scene Segmentation, Grasp Planning, Mobile Manipulation
I. INTRODUCTION
Mobile manipulation tasks in domestic environments require a vast set of perception and action capabilities. The robot not only requires localization, mapping, path planning, and obsta- cle avoidance abilities to safely navigate through the environ- ment. It also needs to integrate object detection, recognition, and manipulation. A typical requirement for a service robot is not just to achieve the task, but to perform it in reasonable time. While much research has been invested into the general solution of complex perception and motion planning problems, only few work has been focused on methods that solve the tasks efficiently in order to allow for continuous task execution without interruptions. In this paper, we present fast methods to flexibly grasp objects from planar surfaces. To achieve fast performance, we combine real-time object perception with efficient grasp planning and motion control. For real-time perception, we combine rapid normal estimation using integral images with efficient segmentation techniques. We segment the scene into the support plane of interest and the objects thereon. Our perception algorithm processes depth images of a Microsoft Kinect in real-time at a frame rate of approx. 16 Hz. From the raw object point clouds our grasp planning method derives fea- sible, collision-free grasps within about 100 milliseconds. We consider grasps on objects from either the side or from above. The planned grasps are then executed using parametrized motion primitives. We integrate our approaches into a system that we publicly evaluate at RoboCup competitions. We further conduct experiments in our lab to demonstrate the robustness and efficiency of our approaches. This paper is organized as follows: after a brief system overview in Sec. III, we detail our approaches to real-time
Fig. 1. Top: Cosero grasps a spoon and pours milk into a bowl of cereals. A 5 min video showing the complete demonstration at RoboCup GermanOpen 2011 is available at http://nimbro.net.
3D perception and efficient grasp planning in Sec. IV and Sec. V, respectively. In Sec. VI, we evaluate our approaches and report on successful public demonstrations at RoboCup GermanOpen 2011 and at RoboCup 2011 in Istanbul, Turkey.
II. RELATED WORK
Many research groups currently develop systems for mobile manipulation in everyday environments. A very prominent example is the Personal Robot 2 (PR2) developed by Willow Garage [5]. It is equipped with two 7 DoF compliant arms and a parallel gripper with touch sensor matrices on the gripper tips. Similar to our approach, they derive feasible, collision- free grasps from the raw object point cloud [3]. They select the best-ranked grasp and plan a collision-free motion for the arm taking into account obstacles that are perceived by the robot’s 3D sensors. While the authors demonstrate that the approach can robustly grasp a variety of objects in a wide range of configurations, the execution speed of the system for perception and grasping is still far slower than human performance. Further systems that perform object manipulation in clut- tered environments have been reported by Srinivasa et al. [8, 7]. In [8], the authors present a robotic busboy system in which a mobile tray delivers mugs to a statically mounted manipulator. The manipulator grasps the mugs and loads them into a dishwasher rack. A real-time vision system that is designed for the mugs estimates the pose of the mugs on the tray. Since the objects are known, valid grasps on the
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
177
2
mug are precomputed. The grasp planner then selects online a best feasible grasp from several criteria like reachability and collision avoidance. The authors report on a total duration of 51 sec in average for executing a grasp and releasing the mug in the dishrack. With the robot HERB [7], the vision system has been extended to more general object recognition and motion planning. While object recognition is aborted after 1 sec, the planning of motions is reported to take several seconds. Our approach is not restricted to recognizable objects. Jain and Kemp develop EL-E [4], a mobile manipulator that shall assist the impaired. EL-E consists of a Katana manipulator on a vertical linear actuator mounted on a Erratic differential drive. While we extract object information in real- time from a depth image sensor, they segment measurements of a 3D laser using connected components labelling to find ob- ject clusters above table height. Similar to our approach, they perform top grasps along the object’s principal axis. However, side grasps are not considered. If an object is too high or too wide to fit into the gripper, they also consider overhead grasps on top-most points on the object. To ensure that the grasping motion is not in collision, a cuboid volume from the manipulator base to the object is checked for obstacles. Morales et al. [11] propose a system that selects feasible, collision-free grasps on objects from a database. They deter- mine the set of feasible grasps on the object from its CAD model in an offline phase. After the object has been recognized and localized with a stereo vision system, a grasp simulation framework (GraspIt! [6]) is used to select a collision-free grasp among the potential grasps on the object. The authors report 5 ms computation time for the recognition of objects in a database of 5 objects. The time for planning of collision-free, feasible grasps in GraspIt is reported to range from seconds to several minutes in [6].
III. SYSTEM OVERVIEW
A. Design of Cognitive Service Robot Cosero Domestic environments are designed for the specific capa- bilities of the human body. It is therefore natural to endow the robot with an anthropomorphic upper body scheme for similar manipulation abilities. Furthermore, the actions of the robot become predictable and interpretable, when they are performed human-like. In such environments, robots also have to interact closely with humans. By its lightweight design, Cosero is inherently less dangerous than a heavy-weight industrial-grade robot. Finally, the robot should also possess natural sensing capabilities, e.g., vision and audio, since humans design their environments salient and distinguishable in such perception channels. We focused the design of Cosero on such typical requirements for household settings. We equipped Cosero with an omnidirectional drive to ma- neuver in the narrow passages found in household environ- ments. Its two anthropomorphic arms resemble average human body proportions and reaching capabilities. A yaw joint in the torso enlarges the workspace of the arms. In order to compensate for the missing torso pitch joint and legs, a linear actuator in the trunk can move the upper body vertically by approx. 0.9 m. This allows the robot to manipulate on similar heights like humans.
Cosero has been constructed from light-weight aluminum parts. All joints in the robot are driven by Robotis Dynamixel actuators. These design choices allow for a light-weight and inexpensive construction, compared to other domestic service robots. While each arm has a maximum payload of 1.5 kg and the drive has a maximum speed of 0.6 m/sec, the low weight (in total ca. 32 kg) requires only moderate actuator power. Compared to its predecessor Dynamaid [9], we increased payload and precision of the robot by stronger actuation. Cosero perceives its environment with a variety of comple- mentary sensors. The robot senses the environment in 3D with a Microsoft Kinect RGB-D camera in its head that is attached to the torso with a pan-tilt unit in the neck. To improve the robustness of manipulation, the robot can measure the distance to obstacles directly from the grippers. We attached infrared distance sensors to each gripper that point downward and forward in the finger tips. Another sensor is placed in the palm and measures distance to objects within the gripper.
B. Mobile Manipulation in Everyday Environments We develop Cosero to perform a variety of mobile manip- ulation tasks in everyday environments. For mobile manipu- lation, we combine safe navigation of the robot through the environment with motion control methods for the upper body. 1) Motion Control: We developed omnidirectional driving for Cosero’s eight-wheeled mobile base [9]. The linear and angular velocity of the drive can be set independently and can be changed continuously. We determine the steering direction and the individual wheel velocities of the four differential
drives, which are located at the corners of the rectangular base, from an analytical solution to the drive’s inverse kinematics. For the anthropomorphic arms, we implemented differential inverse kinematics with redundancy resolution [9]. We also developed compliance control for the arms [10]. For our method we exploit that the servo actuators are back-drivable and that the torque which the servo applies for position-control can be limited. Compliance can be set for each direction in task- or joint-space separately. For example, the end-effector can be kept loose in both lateral directions while it keeps the other directions at their targets. With these methods Cosero can perform a variety of parameterizable motions like grasping, placing objects, and pouring out containers. 2) Mobile Manipulation: In typical everyday mobile ma- nipulation scenarios, the workspace of a statically mounted manipulator is too small. One possible solution to achieve
a larger workspace is to construct robots with a restricted
manipulator workspace but to extend it with a mobile base.
Since the robot is not statically mounted to the environment,
it has to estimate its pose in reference to static parts of the
environment like walls, dynamic objects, and persons. We propose a coarse-to-fine strategy to align the robot to the objects involved in mobile manipulation. For example, when the robot grasps an object from a table, it first approaches the table roughly within the reference frame of the static map. Then, it adjusts in height and distance to the table. Finally, it aligns itself to bring the object into the workspace of its arms. Cosero grasps objects on horizontal surfaces like tables and shelves in a height range from ca. 0.3 m to 1 m [9]. It
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
178
3
(a) Example table scene
(b) RGB point cloud
(c) Detected objects
Fig. 2.
(a) Example table top setting. (b) Raw point cloud from the Kinect with RGB information. (c) Each detected object is marked with a random color.
also carries the object, and hands it to human users. We also developed solutions to pour-out containers, to place objects on horizontal surfaces, to dispose objects in containers, to grasp objects from the floor, and to receive objects from users. When handing an object over, the arms are compliant in upward direction so that the human can pull the object, the arm complies, and the object is released. For receiving an object from a person, we localize the handed object with the depth camera and drive towards it. As soon as the object is reachable with the arms, the robot grasps it.
IV. REAL-TIME 3D PERCEPTION
A typical task for a domestic service robot is to fetch and deliver objects. This involves detecting objects in the robot’s workspace and recognizing them. In household environments, objects are usually located on planar surfaces such as tables. In [9] we proposed to use the laser scanner in the lower torso to detect such objects. This approach, however, is not able to perceive valuable 3D information like the objects’ height or principal axes. We therefore developed real-time 3D scene segmentation with the RGB-D camera. In order to identify objects, we extract texture and color information.
A. Real-Time 3D Scene Segmentation
Our approach to object detection processes images of a depth camera such as the Microsoft Kinect at frame rates of approx. 16 Hz. This enables our system to extract information about the objects in a scene with a very low latency for further decision-making and planning stages. We base our approach on fast planar segmentation of the scene. We achieve the high computational speed of our approach by combining rapid normal estimation with efficient segmentation techniques. The basic idea of the normal esti- mation method is to determine local surface normals from the cross product of two tangents to the surface. For each pixel in the depth image, the tangents are estimated from local pixel neighbors. In the simplest case, both tangents could be calculated from just the horizontal and vertical neighbors, respectively. However, this approach would be highly prone to measurement noise. The tangent estimates should therefore be averaged in an image neighborhood. By using integral images, such averaging operations can be processed rapidly
in constant time independent of the neighborhood size. The overall runtime complexity of this approach is linear in the number of points for which normals are computed. One way to find all planes in a scene would be to extract planes from the local surface normals in a two-stage pro- cess [2]. First, one could find clusters in normal orientation which then are separated by clustering in plane distance to the origin. Since we assume here that the robot as well as the objects in the environment are typically standing on horizontal surfaces, we instead focus our method on local surface normals close to the vertical direction. On all points with such normal orientation and within a 3D region of interest, we apply efficient RANSAC [1] to determine the horizontal plane. Then, we find the points above the plane and extract object clusters which projections lie within the convex hull of the support plane. For clustering we assume that there is a small space between the objects. The size of this space has to be chosen carefully, since due to occlusions, parts of an object may be disconnected. Fig. 2 shows a typical segmentation result for a table-top scene. A multi-object tracker is constantly updated with the detected objects.
V. E FFICIENT G RASP P LANNING
Objects in everyday manipulation scenarios are highly vari- able in shape and appearance. Furthermore, the configuration of objects and obstacles in a scene is strongly unstructured. It is therefore challenging to develop a grasp planning method that can cope with any encountered situation. Our approach is specifically suited for rigid objects which shape is symmetric along the principal axes of the object. We also assume that the center of gravity roughly coincides with the center of the object. While many objects meet these assumptions, our approach can also yield robust grasps for objects that violate the constraints. We developed flexible grasping motions to grasp objects from the side or from above. When the robot encounters a new situation, it plans and executes a feasible collision-free grasp on the object of interest. The robot perceives the scene with its depth camera. It interprets the raw point representation of the objects on the grasp surface which is provided by our real-time 3D perception method (s. Sec. IV-A).
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
179
Fig. 3. Left: We extract object pose and shape properties from the object points. The arrows mark the bounding box of the objects by the principal axes. Right: We rank feasible, collision-free grasps (red, size prop. to scores) and select the most appropriate one (larger RGB-coded coordinate frame).
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
180
5
|
processing stage |
mean (std) in msec |
|
|
normal estimation |
7.2 |
(2.4) |
|
scene segmentation |
11.9 |
(1.4) |
|
object clustering |
41.6 (1.5) |
|
|
grasp planning |
98.1 |
(9.1) |
TABLE I
COMPUTATION TIME OF INDIVIDUAL PROCESSING STAGES.
of the arm’s workspace.
• Collisions. We check for collisions during the reaching and grasping motion. Fig. 3 shows an example for grasps that satisfy our criteria. One possible solution for collision checking would be to search for collisions of all robot limbs during the complete trajectory of the grasping motion. However, we propose to use simple geometric constraints to find all possible colli- sions (s. Fig. 4). While our method is more conservative, we can find collisions with only little computational effort. We first project all points on obstacles into the horizontal plane. In order to avoid collisions of the upperarm, we search for collisions within a circle around the shoulder with a radius equal to the upperarm length. We further require that the gripper and the forearm can move towards the object by checking a cone with opening angle and forearm length behind the grasping pose. We extend the cone towards the robot’s center position to cover the area swept during the reaching motion. Finally we search for collisions within a small circle at the final grasp position. The radius of this circle is set to the maximum diameter of the open gripper. 3) Ranking of Grasps: We rank the feasible and collision- free grasps for several criteria such as
• Distance to object center. We favor grasps with a smaller distance to the object center.
• Grasp width. We reward grasp widths closer to a preferred width (0.08 m).
• Grasp orientation. Preference is given to grasps with a smaller angle between the line towards the shoulder and the grasping direction.
• Distance from robot. We support grasps with a smaller distance to the shoulder. Fig. 3 illustrates this process with example rankings. From the ranked grasps, we find the best top- and side- grasps and select the most appropriate one. This decision depends on the relation of the object height to the largest extent of the object in the horizontal plane. We integrate a small bias towards the faster side grasps.
VI. EXPERIMENTS
A. Quantitative Results
1) Run-Time Efficiency: In Table I, we summarize aver- age run-times of several stages of our perception and grasp planning pipeline in the scene shown in Fig. 2. For a depth image resolution of 160×120, our table-top segmentation approach achieves an average frame rate of approx. 16 Hz. The experiments have been carried out on an HP Pavilion dv6 notebook with an Intel Core i7 Q720 processor. Using
Fig. 5.
of the 8 household objects used in the experiments.
Example grasps, segmentations, and grasp planning results for each
the integral image approach, normals can be estimated rapidly for the 19200 image pixels within only 7.2 msec in average. The segmentation of the scene into vertical points, applying RANSAC to find the support plane, and determining the points above the support plane requires 11.9 msec (avg.). The clustering of the points into objects takes 41.6 msec (avg.). The computation time in this step depends on the number of objects in the scene. Our approach to grasp planning requires computation time in the same magnitude as the segmentation, i.e., 98.1 msec (avg.). The timings demonstrate that our approaches are very performant and yield results in short computation times. We also measured the time for the complete object pick- up process. The robot has already approached the table. It perceives the objects on the table and plans a grasp on the closest object in front. It executes the grasp and moves the gripper back to its starting pose. The overall process takes approx. 15 sec for a side-grasp and 25 sec for a top-grasp. 2) Robustness: We evaluate the robustness of our percep- tion and grasp planning pipeline in a series of experiments. We chose 8 typical household objects and executed 10 grasps with the right arm that cover the range of feasible object orientations for this arm. Fig. 5 shows an example grasp for each object. Table II summarizes the results of the experiment. The robot could grasp the most objects very reliably. For the tissues, it sometimes chooses a top-grasp along the shorter side of the object. In one situation it missed the object with this grasp. Our approach also estimates the tea box to be high enough to be grasped from the side in some configurations. Despite the fact that the clothes strongly violate our assump- tions that objects are rigid and are shaped symmetric along principal axes, our method succeeds robustly for this object.
B. Public Demonstration
While one can assess the quality of individual system components in the laboratory, it is difficult to compare robot
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
181
6
|
object |
side-grasp |
top-grasp |
||
|
filter box |
10 |
/ 10 |
0 |
/ 0 |
|
tea box |
1 |
/ 1 |
9 |
/ 9 |
|
banana |
0 |
/ 0 |
10 |
/ 10 |
|
cup |
10 |
/ 10 |
0 |
/ 0 |
|
chewing gums |
0 |
/ 0 |
10 |
/ 10 |
|
tissues |
0 |
/ 0 |
9 |
/ 10 |
|
cloth |
3 |
/ 3 |
7 |
/ 7 |
|
pen |
0 |
/ 0 |
10 |
/ 10 |
TABLE II
SUCCESS RATES (SUCCESS / TRIALS) WHEN GRASPING OBJECTS 10 TIMES IN RANDOM ORIENTATIONS.
systems with others. In recent years, competitions such as the DARPA Grand and Urban Challenges and RoboCup, play an important role in assessing the performance of robot systems. The international RoboCup competitions include the @Home league for domestic service robots. In this compe- tition, the robots have to perform tasks defined by the rules of the competition, in a given environment at a predetermined time. In addition, there are open challenges and the final demonstration, where the teams can highlight the capabilities of their robots in self-defined tasks. The simultaneous presence of multiple teams allows for a direct comparison of the systems by measuring objective performance criteria, and by subjective judgment of the scientific and technical merit by a jury.
With Cosero, we won the RoboCup GermanOpen 2011 competition. In the finals, Cosero and Dynamaid prepared breakfast within the 10 min demonstration slot. Dynamaid fetched orange juice out of the refrigerator, which it opened and closed successfully. It delivered the bottle on the breakfast table. In the meantime, Cosero grasped a bottle of milk, opened the bottle, and poured the milk into a cereal bowl. Cosero disposed the empty bottle into the trash bin. It then moved to another table and successfully grasped a spoon with
a top-grasp. A jury member placed the spoon in an arbitrary
orientation. Cosero put the spoon next to the cereal bowl and finally waited for an instruction to leave the room. Another jury member pointed towards one of two exit doors using a pointing gesture. Cosero successfully recognized the pointing gesture and left the room through the correct door. The jury awarded us the highest score for the finals. We also won the @Home competitions at RoboCup 2011 in Istanbul, Turkey. Early in the competition in the open challenge, Cosero demonstrated to prepare cereals to a jury of team leaders of other teams. In the demo challenge, Cosero cleaned up the appartement by picking up laundry from the ground and putting it into the correct laundry basket. A human user could before show in which baskets to put colored and white laundry using gestures. Afterwards, Cosero picked up 3 objects from a table using the perception and grasping pipeline
proposed in this paper. In the first attempt to pick up a carrot,
it had to choose a grasp perpendicular to the carrot’s principal
axis and failed to keep grip on the object. However, in the second attempt, it picked up the carrot successfully along its principal axis. Finally, it grasped a tea-box with a top-grasp. The objects have been placed randomly. We could convince the
jury with this demonstration and achieved the highest score. 1
VII. CONCLUSION
In this paper, we proposed highly efficient means to perceive
objects on planar surfaces and to plan feasible, collision-free
grasps on the object of interest. We integrate our methods
into a mobile manipulation system, that robustly executes object pick-up in reasonable time without longer processing interruptions, i.e. interruptions in the milliseconds to seconds. For object perception, we segment depth images of a Mi- crosoft Kinect camera in real-time at a frame rate of up to 6 Hz. We demonstrated that our perception and planning modules yield their results in a very short time. In the integrated system this allows for short and steady execution of the task. Our experiments demonstrate that our method is fast yet robust. In future work, we plan to integrate feedback from touch sensors into the grasp execution. By including top-grasps at high points on the objects we could further extend the range of graspable objects (bowl-like shapes, for instance). We could also improve our grasping pipeline through knowledge on how to grasp specific objects.
ACKNOWLEDGMENTS
This research has been partially funded by the FP7 ICT- 2007.2.2 project ECHORD (grant agreement 231143) experi- ment ActReMa.
REFERENCES
[1] M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 24(6):381–395, 1981. [2] D. Holz, S. Holzer, R. B. Rusu, and S. Behnke. Real-time plane
segmentation using RGB-D cameras. In Proc. of the 15th RoboCup International Symposium, 2011. [3] K. Hsiao, S. Chitta, M. Ciocarlie, and E. G. Jones. Contact-reactive grasping of objects with partial shape information. In Proc. of the IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS), 2010. [4] A. Jain and C. C. Kemp. EL-E: an assistive mobile manipulator that autonomously fetches objects from flat surfaces. Autonomous Robots, 28:45–64, January 2010. [5] W. Meeussen, M. Wise, S. Glaser, S. Chitta, C. McGann, P. Mihelich,
|
E. |
Marder-Eppstein, M. Muja, V. Eruhimov, T. Foote, J. Hsu, R. B. |
|
|
[6] |
Rusu, B. Marthi, G. Bradski, K. Konolige, B. P. Gerkey, and E. Berger. Autonomous door opening and plugging in with a personal robot. In Proc. of the IEEE Int. Conf. on Robotics and Automation (ICRA), 2010. A.T. Miller, S. Knoop, H.I. Christensen, and P.K. Allen. Automatic grasp |
|
|
[7] |
planning using shape primitives. In Proc. of the IEEE International Conference on Robotics and Automation (ICRA), 2003. S. Srinivasa, D. Ferguson, C. Helfrich, D. Berenson, A. C. Romea, R. Di- ankov, G. Gallagher, G. Hollinger, J. Kuffner, and J. M. Vandeweghe. Herb: a home exploring robotic butler. Autonomous Robots, 28(1):5–20, January 2010. |
|
[8] S. Srinivasa, D. Ferguson, J. M. Vandeweghe, R. Diankov, D. Berenson,
C. Helfrich, and H. Strasdat. The robotic busboy: Steps towards
developing a mobile robotic home assistant. In Proc. of the Int. Conf. on Intelligent Autonomous Systems (IAS), 2008. [9] J. Stuckler¨ and S. Behnke. Integrating indoor mobility, object manipu- lation, and intuitive interaction for domestic service tasks. In Proc. of the 9th IEEE-RAS Int. Conf. on Humanoid Robots (Humanoids), 2009. [10] J. Stuckler¨ and S. Behnke. Compliant task-space control with back- drivable servo actuators. In Proc. of the 15th RoboCup International Symposium, 2011.
[11] A. Morales T., Asfour, P. Azad, S. Knoop, and R. Dillmann. Integrated grasp planning and visual object localization for a humanoid robot with five-fingered hands. In Proc. of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2006.
1 Videos showing these demonstrations can be found at http://nimbro.net.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
182
Odor guided exploration and plume tracking - Particle Plume Explorer
Gonc¸alo Cabrita, Pedro Sousa and Lino Marques
Abstract — This paper presents a new algorithm for single robot odor exploration called Particle Plume Explorer. This algorithm is supported by a novel point cloud based plume mapping method called Particle Plume also presented here. The developed algorithm was validated through extensive real world experiments on multiple environments using a single robot equipped with a gas sensor.
I. INTRODUCTION
The sense of smell is becoming increasingly popular among mobile robots. The applications for a gas sensor equipped robot are endless, from patrolling an airport for explosives and other illegal substances to aiding in the search of survivors during a natural catastrophe. Applications like these require a robot or group of robots to determine the localization of odor sources. Although some applications are still out of the grasp of current off-the-shelf chemical sensing technologies, the algorithms which will one day drive these robots are starting to emerge. The existing solutions to the odor source localization problem can be divided in three families of algorithms, (1) hill climbing or gradient ascent techniques; (2) biologically inspired algorithms; and (3) probabilistic methods. Gradient based techniques require comparisons between two or more spatially separated chemical measurements, meaning that there must be more than one gas sensor on the same agent or, in the case of a unique sensor, the agent must perform multiple measurements separated in time and space in order to estimate the gradient and make a decision for the next move. This behavior has been widely observed in several insects (e.g. Drosophila) and in some bacteria (e.g. E. coli). However, this type of behavior is satisfactory only if the agent is either close to or inside the plume, otherwise it will perform random movements while trying to re-acquire the plume. Biologically inspired algorithms have been broadly used in robotics as an attempt to mimic the successful behaviors of animals, in this case while performing odor tracking or odor source declaration. The bacterium E. coli, which is also referred in smooth gradient ascend techniques, presents a chemotaxis behavior that consists of a series of movements that probabilistically lead towards highest concentrations. However, the moth is the living being that inspired the most algorithms in this area. The moth performs a set of movements to reach the odor source [1], [2] which consist of moving straight upwind while inside the odor plume - the
G. Cabrita, P. Sousa and L. Marques are with the Department of Electrical and Computer Engineering, Institute of Systems and Robotics, University of Coimbra, 3030-290 Coimbra, Portugal.
{ goncabrita, pvsousa, lino} @isr.uc.pt
surge; then, the moth performs counter-turning patterns (zig- zag movements) along the plume and according to the wind direction, aiming to acquire odor cues - the casting. Some silkworm moths [3] also describe some kind of irregular spiral movements when the casting strategy bears no fruits. Along with this case, other bio-inspired algorithms have been implemented and tested in mobile robots to accomplish the task of odor source localization [4], [5], using search strategies like Levy-taxi´ [6], [7] or Biased Random Walk [8]. Different approaches of bio-inspired algorithms which rely on large groups of individuals carrying out a task as a whole (swarm behavior) can be found in [9], [10]. The odor source localization problem was also researched employing probabilistic methods such as Hidden Markov Methods [11] or Bayesian methods [12], [13]. These works proposed to localize the odor source by building a probability map of the source localization and, whenever new informa- tion was available, updating each cell in the map. In spite of the interest of the research community on odor source localization problem most of the proposed ap- proaches are limited by certain constraints as the absence of obstacles or near-perfect airflow, meaning that most methods are not suitable for real world implementation. In fact the majority of the existing odor localization algorithms have only been demonstrated on simulation so far. The tradeoffs between real world, laboratory and simulation experiments are commonly discussed in the community. It is important to ensure that algorithms that perform well in simulation or in the laboratory will also perform well in real world experiments, thus the simulation environment must include the key features that make it as close as possible to the real world. Due to the amount of variables involved in the process of odor displacement, the behavior of a chemical plume can appear somewhat random and chaotic, hence realistic plume simulation is extremely difficult to achieve, meaning that real world experimentation is an important step for validating most odor related algorithms. The remaining of this paper is organized as follows, section II presents the proposed approach for solving the odor oriented exploration and plume tracking problems. In section III one can find the experimental setup used to validate the developed algorithm. Section IV holds the results and section V the discussion. The conclusions can be found in section VI.
II. ODOR SOURCE LOCALIZATION
The odor source localization problem can be divided in three stages as depicted in Figure 1, (1) exploration of odor cues, (2) plume tracking and (3) odor source declaration. Although odor plume tracking based approaches are not the
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
183
|
Fig. 1. |
Diagram of the various stages of an odor source localization |
|
algorithm. |
|
only solution for the odor source localization problem they are in fact the most studied by the scientific community and are the basis for this work. While searching for odor sources a robot might often find itself lacking odor cues to follow. Furthermore due to the chaotic nature of chemical plumes, odor cues might be present yet insufficient to provide an accurate estimate of where the robot should advance to. The odor exploration block in Figure 1 is thus the base of many odor source localization algorithms, providing the robot with the ability to explore the environment, unknown from a chemical point of view at the beginning of the search. Without any additional information this becomes an optimization problem, as such during this stage the robot should attempt to cover the unknown area as efficiently as possible while searching for chemical cues. If during chemical exploration the robot is able to find enough odor cues to take an informed decision, the next step towards the odor source is plume tracking as depicted by Figure 1. Odor is caused by the volatilization of chemicals which are then mixed with the air molecules. Odor can then travel by the means of: (1) advection, caused by the motion of the fluid where the chemical particles are suspended in (in this case the air); (2) diffusion, resulting from the movement of particles along concentration gradients; and (3) a combi- nation of the previous two. This leads to the intermittence of the plume, caused by the irregular spreading pattern. This characteristic is responsible for the high discrepancies in the chemical concentrations that can be measured at close points inside the plume, making of intermittence one of the main problems while attempting to track a plume. Wind plays an important role in the motion component of the odor propagation. Along with this, the changes in the wind direction and speed will also affect the odor plume, causing it to meander and produce instantaneous alterations on the wind direction. It is thus natural to use information provided by the wind to optimize plume tracking. However the wind might be almost imperceivable under certain cir- cumstances or so chaotic that it becomes unreliable, giving no useful information to the algorithm. In final step, odor source declaration, the robot should be finally able to pinpoint the localization of the odor source by means of varying chemical concentrations or by the shape of the plume. Nonetheless there are some researchers who have combined the olfactory information with other sources of information in order to better locate the source, e.g. using
visual information [14]. This topic is however out of the scope of this paper since the proposed algorithm only goes as far as plume tracking. In spite of the fact that odor mapping is not traditionally seen as part of the solution for odor source localization, it is shown in Figure Figure 1 as a part of the entire algorithm. Odor mapping is often present in odor related experiments in some way, however building odor maps is usually performed as a goal and not as a tool. One of the main problems encountered during odor mapping is the fact that the chemical concentration on the same point inside a plume can have extremely high variations over time. The work developed in this project can be divided in two different parts, Particle Plume (PP), a plume mapping algo- rithm and Particle Plume Explorer (PPE), an odor oriented exploration and plume tracking algorithm.
A. Particle Plume
The PP mapping algorithm intends to tackle the problems of plume intermittence, gas sensor noise and chemical data sharing between robots when applicable. To accomplish this the chemical readings from any gas sensor available (from a robot or static sensor network) are converted into a point cloud around the sensor’s location on
a global frame. The number of points generated for each
reading is proportional to the chemical reading in parts per million (ppm). In this case the MOX gas sensors being used were calibrated to provide readings in ppm, the calibration procedure is described further ahead. The point cloud is confined to a predefined radius around the point of origin.
The newly created point cloud is finally added to the plume. If the volume occupied by the new points contains older points, these older points are removed from the plume. This
is necessary in order to assure that the number of points in
a certain volume reflects the last concentration read in that
area. This process is represented in Figure 2. The result is a low-pass-filter-like behavior, producing a smooth representation of the plume. Point clouds with bigger radius produce smoother plumes while smaller radius result in bigger variations inside the plume. The idea is not to capture the plume in a realistic manner, but in such a way that facilitates the job of the odor algorithms. Along with the particle cloud representation of the plume, PP is in charge of generating a grid of visited cells. This means that it is possible to differentiate between explored areas with no odor detected and unexplored areas.
Fig. 2.
Flowchart for the Particle Plume plume mapping algorithm.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
184
(b) Step 2.
(c) Step 3.
(d) Step 4.
The four steps of the Particle Plume Explorer algorithm: a) Draw a circle of radius r around the robot and divide it in n slices; b) Remove the
slices which overlap with obstacles; c) Remove the slices which contain a percentage of explored area over a certain threshold; d) Calculate the goal using the cost function in Equation 1.
Fig. 3.
B. Particle Plume Explorer
The PPE algorithm is in charge of odor oriented explo- ration and plume tracking. This is accomplished in four steps. (1) Initially a circle of radius r is drawn around the robot and divided into n slices. (2) Next the slices which overlap with obstacles are removed. (3) The slices which contain a percentage of explored area over a certain threshold are then also removed, i.e. slices which overlap a significant amount of already explored area (leaving only the slices which overlap unexplored areas). (4) Finally the cost of each remaining slice is calculated using Equation 1,
S cost = W o · |mad(α, θ)| + W v · A v + W d · |mad(β, θ)| (1)
where W o , W v and W d represent the weight of the odor, visited cells and current robot orientation, respectively; the mad() is the minimum angular distance function; α, β and θ are the direction of growing odor concentrations, the direction of the current slice and the orientation of the robot, respectively; finally, A v represents the quantity of area in the current slice that was already explored, i.e a slice which overlaps no explored areas will have a value of A v of 0. The slice with the lowest cost is chosen and the goal pose is set to the edge of the chosen slice. The four stages of the PPE algorithm are illustrated in Figure 3 and the pseudocode can be found in Algorithm 1. There might be situations where all slices are removed before the final stage. When this happens PPE enters a recovery behavior where the area surrounding the robot is scanned for the closest viable spot to continue the explo- ration. This is done respecting the obstacles present in the environment. Determining if a certain location is suitable for further exploration is accomplished using the four-stage algorithm discussed earlier. If while performing the recovery behavior PPE is unable to locate a suitable pose to continue exploration it assumes that there are no more places it needs to explore and the algorithm comes to an end. Since the purpose of PPE is odor exploration and plume tracking, areas which appear to have odor cues are given more importance and are thus more thoroughly explored. This is achieved by means of a dynamic threshold in the third stage of the algorithm. The presence of nearby odor
Algorithm 1 : Particle Plume Explorer algorithm.
input : obstacle map, odor map, robot pose output: goal
/ * Step 1: Set the pie around the robot initPie(radius, number of slices);
if odor is present in the pie then max visited area ← no odor max visited area;
*
/
calculateOdorGradient();
else max visited area ← odor max visited area;
odor weight ← 0
end foreach slice do / * Step 2: Discard slices with obstacles * / if slice overlaps with obstacle then discard slice; / * Step 3: Discard overly explored slices * / else if slice explored area > max visited area then
discard slice;
else slice cost = odor weight ×
end
abs(shortestAngularDistance( odor, robot )) + visited weight × visited area + direction weight ×
abs(shortestAngularDistance( slice, robot ));
end / * Step 4: Determine best slice and set goal sortSlices(); goal ← getBestSliceGoal();
* /
cues results in higher thresholds for the percentage of visited area for each slice while the absence of odor cues results in little tolerance for visited areas.
C. Implementation
Both PP and PPE were implemented in C++ under the ROS framework [15]. All the source code developed for this project is open source and can be downloaded for free from http://www.ros.org/wiki/pp explorer.
III. EXPERIMENTAL SETUP
For the real world experiments the iRobot Roomba plat- form was used. Each robot was equipped with a Hokuyo LASER range finder, a Figaro TGS2620 MOX gas sensor and an onboard computer for running the software. The robot can be viewed in Figure 4. The robots were using the ROS navigation stack for autonomous navigation and localization. Since odor exploration is the main focus of this paper, the metric map of the environment was provided to the robots
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
185
(a) The Roomba used for the experiments.
(b) Figaro TGS2620 MOX sensor on the left.
|
Fig. 4. |
The Roomba equipped with Hokuyo LASER range finders and gas |
|
sensors. |
|
for each experiment. The Figaro TGS2620 gas sensors is considerably slow, with about 2.0 seconds of rising time and about 15.0 seconds of decay time. For this reason the maximum robot speed allowed during the experiments was
0.10m/s.
The Figaro TGS2620 MOX sensor was calibrated to provide readings in ppm. The sensor was placed inside a sealed chamber of known volume containing clean air. Next a known quantity of CH 3 OH (the chemical used throughout the experiments) was released into the chamber. After the sensor stabilized a sensor reading was recorded along with the current quantity of chemical inside the chamber in ppm. MOX sensors produce changes in resistance as the chemical concentration changes. The procedure was repeated for several calibration points until the sensor was close to saturation. The obtained data set was then fitted by least square minimization to a logarithmic model.
Real world experiments were performed inside an arena specifically designed for mobile robot odor experiments. The 3m × 4m arena is a completely enclosed environment delimited by four walls where two of them are made of plastic with the shape of honeycombs. It includes an array of controllable fans allowing for the generation of different kinds of air flows. The chemical substance released during all real world experiments was methanol vapor (CH 3 OH). It was always released at a point about 0.3m above the ground using the bubbling method.
Simulations were performed in Stage under the ROS framework. Plume simulation was accomplished using PlumeSim [16], a plume simulation software developed by the authors. The plumes were generated using the Computa- tional fluid dynamics (CFD) software ANSYS Fluent.
A. Simulated Experiments
A set of simulated experiments were used to fine tune the algorithms and determine the best parameters for PP and PPE. A set of simulations using a single robot was also conducted to determine the relation between the area of the environment and the amount of time needed to complete the exploration. To accomplish this the robot was allowed to perform the PPE algorithm in empty square arenas of different sizes, 9m 2 , 12m 2 , 36m 2 and 81m 2 .
B. Real World Experiments
The real world experiments were performed using a single
robot on the testing arena. Three different maps were tested
to determine how robust the PPE algorithm was for different
environments. A set of 10 experiments was performed for each map. For each experiment the robot started at a random location inside the map. The distance travelled by the robot,
the time spent on the exploration and the percentage of cov- ered area was recorded for each experience. The robot was placed at a random location at the start of each experiment. The second set of experiments performed in Map 3 consisted in varying the parameters of the PPE algorithm. For each experiment a single parameter was changed from the control parameters to evaluate how it would affect the behavior of the robot. Only both percentage of visited area thresholds were changed simultaneously to the same value
so that in practice PPE would work with only one threshold.
A fourth map was used to test the following PPE parameter
changes: Radius of the circle around the robot increased to 1m and decreased to 0.3m (control value 0.6m); Number of slices decreased to 4 and increased to 32 (control value 16); Both percentage of visited area thresholds set to 20% and to 80% (control value 20% for the percentage of visited area percentage threshold in the absence of odor and 50% for the presence of odor).
IV. RESULTS
The initial simulations and real world experiments resulted
in the following set of parameters used for the remaining of
the experiments: For PP the sphere radius used was 0.10m. For PPE the pie radius r used was 0.60m; the number of slices n used was 16; the visited area, current heading and odor weights used were respectively 1.00, 0.01 and 0.10; the maximum allowed percentage of discovered area per slice on the absence of odor was 20% and for the presence of odor 50%. The results of the real world experiments conducted on Maps 1, 2 and 3 can be found in Figure 5 and Table I. The robot performed all of the experiments successfully. In Figure 5 it is possible to see a picture of the arena followed by a sequence of screenshots from the visualization software rviz captured during operation. The screenshots were captured in realtime as the robot was performing the exploration algorithm. The visited cells can be seen in yellow while the point cloud representation of the plume appears in
red. In Figure 6 it is possible to see the graphic representing the evolution of the time needed to perform an exploration
as the area to explore grows. The results for the cost function
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
186
(a) Map 1.
(b) Map 2.
(c) Map 3.
Fig. 5.
building. Map 3 contains a set of obstacles with awkward shapes and angles. The blue circle depicts the localization of the odor source.
Results of the real world experiments. Map 1 has no obstacles. Map 2 has a set of obstacles similar to what a robot can find while navigating a
0
18
36
54
Area to explore (m2)
72
90
Fig. 6.
an exploration as the area to explore increases.
Graphic representing the evolution of the time needed to perform
experiments can be found in Figure 7 which is composed by three graphics. Each set of three graphics shows the distance travelled in meters, time spent in minutes and percentage of explored area.
TABLE I
SINGLE ROBOT ARENA EXPERIMENTS
Map 1
Map 2
Map 3
Explored Area (%)
|
µ 56.17 |
43.57 |
47.05 |
|
σ 2.22 |
3.84 |
3.23 |
Distance Traveled (m)
|
µ 26.03 |
21.65 |
22.57 |
|
σ 2.81 |
3.26 |
3.56 |
|
µ 9.69 |
7.99 |
7.05 |
|
σ 1.75 |
1.59 |
1.32 |
Time Spent (min)
V. DISCUSSION The results in Table I show that PPE performs in similar way when faced with different sets of obstacles. It is natural
that Map 2 and Map 3 get lower percentages of explored area, travelled distance and time spent when compared to Map 1 since the obstacles present in Map 2 and 3 will actually end up decreasing the area to explore. The robot tends to avoid exploring the areas very close to the obstacles since a real robot must take into account its own dimensions when transversing any area. The time spent on the explorations is related to the speed of the robot which in turn was affected by the slow response speed of the gas sensors being used.
Nevertheless the times show in Table I are very satisfactory. The fact that the robot never never failed any experiment shows how robust the PPE algorithm is. From the sequences of images in Figure 5 it is possible to observe that the robot covered more area were odor was present. Furthermore it was also possible to verify during the experiments that the robot was following the odor gradient as intended. This was achieved due to the smooth plumes generated by PP, point cloud representations of the odor captured by the gas sensors. It is visible that particularly after being exposed to high concentrations PP generated a drag effect of the chemical. This is due to the low response speed of the gas sensors being used, however this behavior was both expected and acceptable in order to maintain a desired average robot speed. This is visible in the rightmost
image in Figure 5(b).
The graphic in Figure 6 shows that as the area to explore
increases so does the time spent on the exploration following
a second order polynomial curve. This behavior was expected and Figure 6 shows that PPE’s performance does not degrade as the area to explore increases. The circle radius experiments (Figure 7) showed that as the radius increases the robot is unable to move in tight places since slices which overlap obstacles are removed, leaving
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
187
(a) Distance Travelled (m).
(b) Time Spent (min).
(c) Explored Area (%).
Fig. 7.
Graphics related to the parameter tests performed for the PPE algorithm.
most of the area to explore. Although PPE ended faster the percentage of explored area is about half of that explored during the control experiments. On the other hand as the radius was decreased to 0.3m the robot tended to perform a swirl-like movement, having spent more time to complete the exploration, while also having the highest percentage of explored area (Figure 7). The high amount of time necessary for completing the exploration can also be explained by the increase in the number of iterations. As the radius decreases the number of iterations PPE must run increases. When the number of slices was decreased to 4 the results were similar to those of a large circle radius as Figure 7 shows. The robot left much of the area to explore. However now this is due to the fact that as the number of slices decreases PPE has fewer slices to chose from, making it almost impossible to move near obstacles. When the number of slices was increased to 32 the results were similar to those of the control experiments. The slight increase in time spent, distance travelled and explored area can be explained by the fact that the map being used and thus the slices are in fact composed by a grid of cells which have 0.05m of side. With 32 slices each slice had very few cells, thus the performance of the algorithm decayed. The results for the percentage of visited area performed as expected. With a lower threshold the explored area, time spent and travelled distance fell, whereas with a higher threshold the robot was able to visit more of the area, consequently increasing time spent and distance travelled. This can be seen in the graphics of Figure 7.
VI. CONCLUSIONS The presented PP and PPE algorithms were tested under realistic scenarios and performed efficiently. Particle Plume Explorer was able to complete all the odor explorations successfully under three different scenarios thus proving to be both reliable and robust. Future work will include adding the final stage of odor source declaration to the current work, completing the odor source localization algorithm. Furthermore the use of the wind direction during plume tracking when such information is available. As the time needed to complete an exploration grows with the area to be explored it is only natural that a multi- robot version of the PPE algorithm will be researched in the near future, to allow a quicker exploration of larger
areas. Furthermore faster gas sensors will be tested to allow the robots to move at greater speeds, thus accelerating the exploration process.
REFERENCES
[1] J. Belanger and E. Arbas, “Behavioral strategies underlying pheromone-modulated flight in moths: lessons from simulation stud- ies,” Journal of Comparative Physiology A: Neuroethology, Sensory,
Neural, and Behavioral Physiology , vol. 183, no. 3, pp. 345 – 360,
1998.
[2] T. Lochmatter and A. Martinoli, “Tracking odor plumes in a laminar wind field with bio-inspired algorithms,” in Proc. of the 11th Int. Symposium on Experimental Robotics 2008 (ISER 2008) . Springer, 2009, pp. 473–482. [3] R. A. Russell, Odour Detection by Mobile Robots, ser. Robotics and Intelligent Systems. World Scientific, 1999, vol. 22. [4] L. Marques, U. Nunes, and A. de Almeida, “Olfaction-based mobile robot navigation,” Thin Solid Films, vol. 418, no. 1, pp. 51–58, 2002, 1st Int. School on Gas Sensors. [5] R. Russell, A. Bab-Hadiashar, R. Shepherd, and G. Wallace, “A comparison of reactive robot chemotaxis algorithms,” Robotics and Autonomous Systems , vol. 45, no. 2, pp. 83–97, 2003. [6] Z. Pasternak, F. Bartumeus, and F. W. Grasso, “Levy-taxis:´ a novel search strategy for finding odor plumes in turbulent flow-dominated environments,” Journal of Physics A: Mathematical and Theoretical, vol. 42, no. 43, 2009. [Online]. Available: http://stacks.iop.org/1751-
8121/42/i=43/a=434010
[7] S. Nurzaman, Y. Matsumoto, Y. Nakamura, S. Koizumi, and H. Ishig- uro, “Biologically inspired adaptive mobile robot search with and without gradient sensing,” in IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, (IROS 2009), 2009, pp. 142 – 147. [8] A. Dhariwal, G. Sukhatme, and A. Requicha, “Bacterium-inspired robots for environmental monitoring,” in Proc. of the IEEE Int. Conf.on Robotics and Automation, (ICRA 2004) , vol. 2, 2004, pp. 1436 – 1443. [9] L. Marques, U. Nunes, and A. de Almeida, “Particle swarm-based olfactory guided search,” Autonomous Robots, vol. 20, no. 3, pp. 277– 287, 2006, special Issue on Mobile Robot Olfaction.
[10] V. Gazi and K. Passino, Swarm Stability and Optimization . Springer,
2011.
[11] J. A. Farrell, S. Pang, and W. Li, “Plume mapping via hidden markov methods,” IEEE Trans. on Systems, Man, and Cybernetics - Part B, vol. 33, no. 6, pp. 850–863, 2003. [12] S. Pang and J. Farrell, “Chemical plume source localization,” IEEE Trans. on Systems, Man, and Cybernetics, Part B , vol. 36, no. 5, pp. 1068 – 1080, 2006. [13] S. Pang, “Plume source localization for AUV based autonomous hydrothermal vent discovery,” in OCEANS ’10 , 2010, pp. 1 – 8. [14] H. Ishida, H. Tanaka, H. Taniguchi, and T. Moriizumi, “Mobile robot navigation using vision and olfaction to search for a gas/odor source,” Autonomous Robots , vol. 20, pp. 231 – 238, 2006, 10.1007/s10514-
006-7100-5.
[15] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, and A. Y. Ng, “ROS: an open-source Robot Operating System,” in ICRA Workshop on Open Source Software , 2009. [16] G. Cabrita, P. Sousa, and L. Marques, “Player/stage simulation of olfactory experiments,” in IEEE/RSJ Int. Conf. on Intelligent Robots
and Systems (IROS 2010) , 2010, pp. 1120 – 1125.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
188
1
Environmental Perception for Intelligent Vehicles Using Catadioptric Stereo Vision Systems
Miriam Schonbein¨
∗
Bernd Kitt ∗
Martin Lauer ∗
∗ Institute for Measurement and Control Systems, Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany
Abstract— This paper presents advantages of catadioptric camera systems for robotic applications using the example of autonomous vehicles and driver assistance systems. We propose two different catadioptric stereo camera systems to improve
environmental perception. Both stereo camera systems consist of
at least one catadioptric camera, which combines a lens with
a hyperbolic mirror and provides a panoramic view of the
surrounding. This allows a 360 ◦ field of view of the vehicle’s
surrounding and ultimately allows e.g. object detection around the vehicle or the detection of lane markings next to the vehicle. We discuss the advantages of both camera systems for stereo vision and 3D reconstruction. Furthermore, we show results for depth estimation with sparse features in a simulated environment
as well as on a real data set.
Index Terms— catadioptric camera, stereo vision, hybrid imag- ing system
(b) Catadioptric/Catadioptric Sys- tem mounted on the left and right side above the windshield.
Field of View of the catadioptric stereo camera systems.
with only one camera. This eliminates the problem to stitch a panorama out of many perspective images. Moreover, a central catadioptric camera system inherently fulfills the single view point condition [2]. Catadioptric cameras are a combination of a convex mirror above a lens and provide a panoramic image. Such panoramic imaging systems are very popular in the robotic research com- munity for autonomous navigation and surveillance systems and have been studied in the past [3, 4]. While monoscopic catadioptric cameras have been considered for driver assistance applications such as blind spot detection [5], lane tracking [6] or egomotion estimation [7], there is very little work on the potential of stereoscopic catadioptric systems for intelligent vehicles. In recent years, different types for catadioptric stereo vision systems have been proposed. Zhu et al. [8] discuss different configurations of omnidirectional stereo setups, mainly binoc- ular stereo with vertically or horizontally aligned cameras. Common methods to use vertically aligned omnistereo are the combination of two lenses with two separated mirrors which are vertically aligned [9] or two mirrors mounted on top of each other and only one lens [10]. Gandhi and Trivedi [11] used two horizontally aligned catadioptric sensors mounted on the side mirrors of a vehicle for visualizing and analyzing the nearby surrounding of a vehicle and Ehlgen et al. [12] applied two horizontally aligned catadioptric cameras mounted on the rear roof edges. Both approaches give the driver a visualization of the entire environment. A hybrid camera system, combining a catadioptric and a perspective camera, has been proposed by Lauer et al. [13] and Sturm [14]. This imaging system com- bines the advantages of a 360 ◦ field of view from a catadioptric camera and the high resolution from a conventional perspective
189
I. INTRODUCTION
For robotic applications and first of all autonomous driving vision-based obstacle detection and 3D reconstruction is an important task. Therefore, it is desirable for camera systems to have a large field of view of the surrounding of the vehicle. However, conventional perspective cameras typically used in vehicles, only provide a horizontally limited field of view ( < 90 ◦ ). Commonly the cameras point in frontal direction and objects alongside the vehicle are not visible with a single perspective camera. However, for many advanced driver assistance systems such as blind spot detection, lane departure warning or lane tracking, knowledge about the sides of the vehicle is essential. In addition, a panoramic view of the surrounding could enable new applications like intersection reconstruction (e.g. [1]) for autonomous driving in urban environments. Omnidirectional cameras are able to provide such information, for example passing vehicles, pedestrians or lane markings. The apparent possibility to obtain a panoramic view of the vehicle environment by capturing a set of perspective images using cameras with different orientations or one rotating cam- era suffers from the complexity to stitch the images together, the sizable installation space and the extensive calibration. Hence, such systems are not suitable for the dynamic envi- ronment of a vehicle, although such systems are capable to obtain a high resolution image. Moreover, violation of the single view point condition while stitching a panorama from multiple images introduces unwanted effects such as ghosting which makes the panoramas unsuitable for stereo matching and reconstruction. In contrast, catadioptric camera systems are able to provide a 360 ◦ field of view of the environment
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
2
camera. In [15] such a hybrid omnidirectional pinhole sensor (HOPS) is used for stereo obstacle detection in a robotic environment. However, little research has been performed on stereo matching and 3D reconstruction with a stereo camera system consisting of at least one catadioptric camera as well as on the choice of a catadioptric camera system for autonomous vehicle applications. In this paper, we evaluate the advantages of catadioptric camera systems for environmental perception in applications for driver assistance systems and intelligent vehicles. Here- for, we propose two promising systems to improve environ- mental perception. The first system consists of a perspec- tive and a catadioptric camera which are vertically aligned (Fig. 1(a)). The second system consists of two horizontally aligned catadioptric cameras (Fig. 1(b)). Both systems provide
a panoramic view of the surrounding, which allows object
detection around the vehicle and not only in the small area in front of the vehicle. We introduce stereo vision and 3D reconstruction for both catadioptric systems and discuss results for simulated and real data sets as well as the respective advantages of both methods.
The remainder of this paper is organized as follows. Sec- tion II introduces the imaging model of catadioptric cam- eras and the construction of a catadioptric/perspective stereo camera system as well as a catadioptric/catadioptric stereo camera setup. Section III describes stereo vision utilizing the proposed catadioptric stereo camera systems. We explain feature matching as well as the 3D reconstruction for both camera systems. In Section IV we discuss results for feature matching and 3D reconstruction on real data and evaluate the accuracy of the reconstruction for both systems within a simulated environment. Furthermore, we show the advantages of both systems for future intelligent vehicle applications. Finally, Section V concludes our work and gives an overview of future work.
II. CATADIOPTRIC STEREO CAMERA SYSTEMS
A. Catadioptric Camera
Our catadioptric camera consists of a hyperbolic mirror and
a camera lens. The catadioptric system fulfills the single view point (SVP) condition, which means that the camera center C coincides with the second focal point F of the hyperbolic mirror as shown in Fig. 2. The shape of a hyperbolic mirror
is defined by the following equation
( z + e) 2
R b 2 2 = 1
−
a
2
(1)
where a and b are the mirror parameters and e = √ a 2 + b 2 . The relationship between a world point and the correspond- ing camera point consists of two projection steps [16], a non- linear projection from the world point X to the point on the mirror surface X M and a linear projection from the point on
the mirror surface X M to the image point u C . The world point
in the mirror coordinate system X WM is given by the following
equation
(2)
X WM = R M ( X − t M )
Fig. 2. Hybrid Imaging System consisting of a catadioptric camera with image plane u c and a perspective camera with image plane u p .
where R M is the rotation matrix and t M the translation vector between the world and the mirror coordinate system. The point X WM is projected to the point on the mirror surface X M by the nonlinear function 1 / 2 ( X WM ) .
with
X M = 1 / 2 ( X WM ) · X WM
1 / 2 ( X WM ) =
b 2 ( − eZ ± a X )
b 2 Z 2 − a 2 X 2 − a 2 Y 2
.
(3)
(4)
Consequently, the point X M is computed from the intersection of the ray from X WM to the focal point F ′ and the hyperbolic mirror surface. Thus, we receive two values for caused by the two intersection points between the mirror and the ray. The correct is given by
1 / 2 ( X WM ) = min( 1 / 2 ) , max ( 1 / 2 ) ,
for
for 1 > 0 and 2 < 0 .
1 / 2 > 0
(5)
The point X M is then transformed into the camera coor-
dinate system with rotation matrix R C and translation vector
t C = [ 0, 0, − 2e] T . Finally, the point on the mirror surface in the camera coordinate system X C is projected to the image point u c with the camera calibration matrix K by the following equation
u c = K z C X C .
(6)
1
Using equation (6), the complete imaging model is given by
u c = K 1 C R C ( R M ( X − t M ) − t C ) .
z
(7)
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
190
3
The images show the stereo and the mono region of the catadioptric/catadioptric system in the unwarped panoramic images of the left (above) and
the right camera (below). The green area is the visible region in both cameras, the yellow area the visible region of the left camera and the blue area the visible region of the right camera.
Fig. 3.
B. Catadioptric/Perspective Stereo Camera System
The catadioptric/perspective stereo camera system, which will be denoted as System A, consists of a catadioptric and a perspective camera. The system is mounted on top of the vehicle in the center above the windshield. As depicted in Fig. 2, the perspective camera is attached under the catadiop- tric camera, so that the focal points of the perspective camera and the catadioptric camera are aligned vertically. As shown in Fig. 1(a) the system allows a monoscopic view of the surrounding environment in all directions of the car, only the area behind the vehicle is occluded by the vehicle itself. Additionally, the system obtains stereo information in a preferential direction which is the frontal direction in our case. The stereo area of the system is depicted in Fig. 1(a) and is bounded by the field of view of the perspective camera. The advantages of this camera system are the 360 ◦ field of view of the catadioptric camera while at the same time providing a standard perspective camera image in frontal direction. This allows the use of well-established monoscopic image processing algorithms such as lane boundary estimation, traffic sign recognition and obstacle detection even for longer distances. Besides, in the close up range in frontal direction we obtain stereo vision information in the overlapping perspective and catadioptric field of view.
C. Catadioptric/Catadioptric Stereo Camera System
The catadioptric/catadioptric stereo camera system, which will be denoted as System B, consists of two catadioptric cameras which are aligned horizontally. The cameras are mounted on the left and right side on top of the vehicle above the windshield. This enables a binocular observation in front of the vehicle as well as on both sides of the vehicle in difference to Gandhi et al. [11] who mounted the cameras on the side mirrors and obtain stereo information only in frontal direction. Moreover the mounting of the camera system on a rigid frame on top of the vehicle yields to a higher rigidity of the stereo camera system which is important for stereo vision. The overlapping stereo area which is visible in both catadioptric cameras is shown in Fig. 1(b). The stereo area is also larger than in the catadioptric/perspective system. Due to the two horizontally aligned cameras, the stereo area of the system is only restricted by the vehicle itself. As depicted
in the unwarped panoramic view of the left and the right catadioptric camera in Fig. 3, both cameras observe the area in front of the vehicle and far range alongside the vehicle. The close range behind and on both sides alongside the vehicle is only visible within one of the two catadioptric camera images. Hence, the advantages of the catadioptric/catadioptric camera system are the large stereo field of view in which we obtain 3D information about the environment and the monocular observation alongside the vehicle.
III. STEREO VISION WITH CATADIOPTRIC CAMERAS
The first step for stereo vision is the matching process. During the matching process we obtain corresponding points in the image regions which are visible in both images. To sim- plify the correspondence search and to verify the established correspondence points with the epipolar geometry, we need the intrinsic and the extrinsic calibration parameters of the camera setup. We use the calibration toolbox from [17] for the intrinsic calibration of the catadioptric cameras. The extrinsic calibration is done with the aid of virtual images which are generated out of the catadioptric images. Due to the fact that we use perspective images for the extrinsic calibration, we can use the stereo calibration toolbox [18] for perspective images. Finally, we reconstruct the corresponding 3D world points from the 2D image matches by triangulation.
A. System A: Catadioptric/Perspective System
In System A we obtain two different images, a panoramic catadioptric image and a perspective image. For the feature matching we compare the overlapping area of both images and search corresponding feature points directly in the different images. Due to the different mappings in the catadioptric and perspective camera, standard block matching algorithms are non-applicable to establish feature correspondences. There- fore, we use the well-known SURF features [19] which are scale and rotation invariant and hence applicable for the different images. By means of the correspondence search directly in the different images we do not lose any image information by the transformation of the catadioptric image into a virtual perspective image. The epipolar geometry for a combined catadioptric/perspective stereo setup is more difficult
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
191
4
Reconstruction of a 3D scene point X from the corresponding
catadioptric image points u L and u R by the shortest distance between the rays through F R ,x R and F L ,x L .
Fig. 4.
than in the classical case of stereo vision. Points which lie on a vertical line in the perspective image are on a radial curve in the catadioptric image. In our case, the perspective camera is attached in the symmetry axis of the mirror under the catadioptric camera. This simplifies the epipolar geometry so that points on a radial line in the catadioptric image are mapped onto a straight line in the perspective image [13]. The 3D scene point X is calculated out of the corresponding image points. To estimate the 3D points we use the points on the mirror surface X M which belong to the corresponding image points u c of the respective image. The mirror points are back-projections from the image points u c to the mirror points X c by equation (3). The 3D point can be found as the midpoint of the shortest distance [20] between the ray through the focal point of the mirror F c and the point on the mirror surface X M and the ray through the focal point of the perspective camera F p and the 3D point on the perspective image plane X p .
B. System B: Catadioptric/Catadioptric System
In System B we obtain two catadioptric images. The over- lapping area as shown in Fig. 1 is larger and not only in front of the vehicle. Again, block matching algorithms are non- applicable to match directly in the catadioptric images, due to the distortions in catadioptric images. As in System A, we use SURF features to find corresponding feature points directly in the catadioptric images. The epipolar geometry for a catadiop- tric image pair is described in [16]. A corresponding point in one image lies on a conic circle in the other catadioptric image. The estimation of the 3D point is performed by calculating the midpoint of the shortest distance between the ray through the focal point of the left mirror F L and the point on the left mirror surface X L and the ray through the focal point on the right mirror F R and the point on the right mirror surface X R . The points on the left and right mirror surface depend on the corresponding image points in the catadioptric images. The reconstruction of a 3D scene point from the corresponding image points in case of two catadioptric cameras is shown in Fig. 4.
IV. EXPERIMENTAL RESULTS
For our experiments we use catadioptric cameras with IT+Robotics VS-C450MR hyperbolic mirrors with the para-
Fig. 5.
image (System A).
SURF correspondences between the perspective and the catadioptric
meters a = 26. 8578 mm, b = 20. 8485 mm and r max = 22. 5 mm. The spatial angle of view is = 104 ◦ . The image resolution of the catadioptric cameras and of the perspective camera is 1392 × 1032. The catadioptric/catadioptric stereo system as well as the perspective/catadioptric stereo system can outperform standard perspective camera systems in the task of object detection, since the detection of objects is possible not only in front of the vehicle but also on both sides. Moreover, both systems have different additional advantages. Due to the mounting of System B on the sides on top of the vehicle, it is also possible to detect small objects alongside the car such as lane markings or curbs. As shown in Fig. 3, lane markings alongside the vehicle are clearly visible at both sides in the images captured with System B. The ability to not only detect lane markings in front of the vehicle may help to improve systems such as lane departure warning or vehicle localization significantly. With System A it is not possible to see close objects alongside to the vehicle. However, this system has advantages for object detection in front of the vehicle due to the higher resolution of the perspective camera. Objects far away could be detected in the high resolution perspective image before their position is estimated in the stereo vision area. This camera system allows also to track detected objects even after they have left the stereo field of view.
A. Feature Matching
Within the matching process we obtain corresponding fea- tures in both images of the respective camera system by means
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
192
5
[mm]
[mm]
y
y
1 x 10 4
0
0.8
0.6
0.4
0.2
-0.2
-0.4
-0.6
-0.8
-1
-2
-1.5
-1
-0.5
0
x
[mm]
|
0.5 |
1 |
1.5 |
2 |
|
|
x |
10 4 |
|||
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
-5000
5000
x
[mm]
(a) Distribution of the 3D reconstruction for a noisy circular point cloud with 8m radius around the midpoint of the camera baseline from System B in left image and from System A in the right image.
Fig. 6.
3D reconstruction results within a simulated environment.
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
(b) Depth Image from the catadioptric/catadioptric system of a simulated environment with noisy input data.
of a SURF feature detector. We evaluate the feature matching and 3D reconstruction in a sequence of 600 frames captured with the considered catadioptric stereo camera systems. The features are only matched in the area around the vehicle and not on the vehicle itself. On average, 500 features are found with the SURF detector
in the perspective images, whereas in the catadioptric images 1000 features are found. From these features an average of
100 corresponding image points is found for System B, which
is about two times higher than for System A. This can be explained by the larger stereo field of view for System B than for System A and the fact that corresponding points are also found on both sides of the vehicle. The number of correspondence points in frontal direction which is visible with both systems is approximately the same. The corresponding image points for one frame in System A are depicted in Fig. 5 and the corresponding image points for System B are shown in Fig. 7.
B. Reconstruction
Out of the established feature points and the extrinsic calibration, the corresponding 3D world points are calculated. The accuracy of the world points for both systems is evaluated using a simulated 3D environment. Herein, we simulate both camera systems and a circular point cloud with 8 m radius around the midpoint of the camera baseline. Afterwards, we map all these points onto the image planes of the cameras and perform a Monte-Carlo-Simulation. Therefore, we create
100 samples of the mapped points, each sample disturbed with
Gaussian noise of standard deviation 1. 5 pixel. Subsequently, we reconstruct each sample and compute mean and standard deviation of the Euclidean distance between reconstructed and ground truth scene points. For System B the error depends on the azimuth angle as shown in the left image of Fig. 6(a). The accuracy of the reconstruction decreases from the points in the middle between the two cameras to the points on both sides of the cameras. The reconstruction of a point which is horizontally aligned with both catadioptric cameras is not possible, because the baselength in horizontal direction of both cameras is close to zero. Table I shows the back-projection error depending on the azimuth angle. For System A the
accuracy is more or less constant due to the small aperture angle of the perspective camera which restricts the stereo vision area. The right image of Fig. 6(a) shows the results for the catadioptric/perspective system within the same simulation environment. Fig. 6(b) shows the depth image out of the catadioptric/catadioptric stereo camera system of a simulated environment. Therefore we map a simulated 3D scene onto the catadioptric image planes and disturb the image points with Gaussian noise with standard deviation 1. 5 pixels. The image shows the calculated depth information of the back-projected scene points. The catadioptric/catadioptric system captures a larger stereo vision area and thus allows feature matching and 3D recon- struction also alongside the vehicle. The mean and standard deviation of the Euclidean distance between reconstructed and ground truth scene points in the frontal direction for System A are not significantly smaller than for System B. Therefore System B is better suitable for 3D reconstruction because 3D points on the sides of a vehicle can be calculated in addition to points in frontal direction. Moreover, with the 3D information alongside the vehicle it is also possible to detect and track objects which enter the field of view from one side of the observer vehicle.
V. CONCLUSIONS AND FUTURE WORK
In this work, we presented two different catadioptric stereo camera systems for advanced intelligent vehicle applications. The feature matching and the 3D reconstruction out of the feature points for a catadioptric/perspective and a catadiop- tric/catadioptric stereo camera system is explained. We per- formed experiments on point correspondence search with SURF features as well as 3D reconstruction for both catadiop- tric camera systems on real data and evaluated the accuracy of the 3D reconstruction on simulated data. Furthermore, we pointed out the advantages for applications in autonomous vehicles for both systems. The catadioptric/perspective system is advantageous for object detection in the preferential frontal direction due to the higher resolution of the perspective cam- era. Besides, it would be possible to track objects alongside the vehicle if they were detected first in frontal direction. However,
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
193
6
y
|
Azimuth angle |
0 ◦ |
10 ◦ |
20 ◦ |
30 ◦ |
40 ◦ |
50 ◦ |
60 ◦ |
70 ◦ |
80 ◦ |
90 ◦ |
|
Mean [m] Standard deviation [m] |
- |
1.7837 |
0.8703 |
0.5572 |
0.4160 |
0.3769 |
0.3274 |
0.2917 |
0.2995 |
0.2779 |
|
- |
1.8950 |
0.7073 |
0.4281 |
0.3203 |
0.2739 |
0.2476 |
0.2020 |
0.2371 |
0.2030 |
TABLE I
MEAN AND STANDARD DEVIATION OF THE EUCLIDEAN RECONSTRUCTION ERROR DEPENDING ON THE AZIMUTH ANGLE FOR SIMULATED SCENE POINTS. FOR EVALUATION PURPOSES WE CREATE SCENE POINTS ON A CIRCLE WITH RADIUS 8M AND DISTURB THE MAPPED SCENE POINTS WITH GAUSSIAN NOISE OF 1 . 5 PIXEL STANDARD DEVIATION.
2.5 x 10 4
1
2
1.5
0.5
0
-0.5
-1
|
-8000 |
-6000 |
-4000 |
-2000 |
0 |
2000 |
4000 |
6000 |
8000 |
|
x |
||||||||
Fig. 7.
SURF correspondences between the both catadioptric images (System B) and the corresponding reconstructed 3D points on top view.
the catadioptric/catadioptric system allows for feature match- ing and 3D reconstruction also on both sides of the vehicle. Hence, it is possible to track objects which are not detected in frontal direction at first. Moreover, lane markings on the side of the vehicle can be detected and e.g. improve localization. The results of this evaluation showed that the use of catadioptric cameras improves many intelligent vehicle ap- plications. Based on the investigation in this study, the cata- dioptric/catadioptric system turned out to be more promising for further evaluation. Our next steps will include an im- plementation for object detection and tracking in the entire vehicle surrounding from catadioptric images of a catadiop- tric/catadioptric stereo system. The possibility to use lane markings next to the vehicle to improve localization or lane departure warning will also be evaluated.
REFERENCES
[1] A. Geiger, M. Lauer, and R. Urtasun, “A generative model for 3d urban scene understanding from movable platforms,” in Computer Vision and Pattern Recognition (CVPR), Colorado Springs, USA, June 2011. [2] C. Geyer and K. Daniilidis, “A unifying theory for central panoramic systems and practical applications,” in ECCV ’00: Proceedings of the 6th European Conference on Computer Vision-Part II. London, UK:
Springer-Verlag, 2000, pp. 445–461. [3] S. Baker and S. K. Nayar, “A theory of single-viewpoint catadioptric image formation,” International Journal of Computer Vision, vol. 35, no. 1, pp. 1 – 22, 1999. [4] R. Benosman and S. e. Kang, Panoramic vision: sensors, theory, and applications. Secaucus, NJ, USA: Springer-Verlag New York, Inc.,
2001.
[5] T. Ehlgen, T. Pajdla, and D. Ammon, “Eliminating blind spots for assisted driving,” vol. 9, no. 4, pp. 657–665, December 2008. [6] S. Y. Cheng and M. M. Trivedi, “Toward a comparative study of lane tracking using omni-directional and rectilinear images for driver assistance systems,” in ICRA 2007 Workshop: Planning, Perception and
Navigation for Intelligent Vehicles, 2007.
[7] D. Scaramuzza and R. Siegwart, “Monocular omnidirectional visual
odometry for outdoor ground vehicles,” Springer Lecture Notes in Computer Science, Computer Vision Systems, 2008. [8] Z. Zhu, “Omnidirectional stereo vision,” in Proc. of ICAR 01, 2001, pp.
22–25.
J. Gluckman, S. K. Nayar, and K. J. Thoresz, “Real-time omnidirectional
and panoramic stereo,” in In Proceedings of the 1998 DARPA Image Understanding Workshop. Morgan Kaufmann, 1998, pp. 299–303. [10] S. Yi and N. Ahuja, “An omnidirectional stereo vision system using a single camera,” in ICPR ’06: Proceedings of the 18th International Conference on Pattern Recognition. Washington, DC, USA: IEEE Computer Society, 2006, pp. 861–865. [11] T. Gandhi and M. Trivedi, “Vehicle surround capture: Survey of tech- niques and a novel omni-video-based approach for dynamic panoramic surround maps,” vol. 7, no. 3, pp. 293–308, September 2006. [12] T. Ehlgen, M. Thom, and M. Glaser, “Omnidirectional cameras as backing-up aid,” Computer Vision, IEEE International Conference on, vol. 0, pp. 1–5, 2007. [13] M. Lauer, M. Schonbein,¨ S. Lange, and S. Welker, “3d-objecttracking with a mixed omnidirectional stereo camera system,” Mechatronics, vol. 21, no. 2, pp. 390 – 398, 2011, special Issue on Advances in intelligent robot design for the Robocup Middle Size League. [14] P. Sturm, “Mixing catadioptric and perspective cameras,” Omnidirec- tional Vision, Workshop on, vol. 0, p. 37, 2002. [15] G. Adorni, L. Bolognini, S. Cagnoni, and M. Mordonini, “Stereo obstacle detection method for a hybrid omni-directional/pin-hole vision system,” in RoboCup, 2001, pp. 244–250. [16] T. Svoboda and T. Pajdla, “Epipolar geometry for central catadioptric cameras,” Int. J. Comput. Vision, vol. 49, no. 1, pp. 23–37, 2002. [17] C. Mei and P. Rives, “Single view point omnidirectional c amera calibra- tion from planar grids,” in IEEE International Conference on Robotics and Automation, April 2007. [18] J. Y. Bouguet, “Camera calibration toolbox for Matlab,” 2008. [Online]. Available: http://www.vision.caltech.edu/bouguetj/calib doc/.
[19] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust features (surf),” Comput. Vis. Image Underst., vol. 110, no. 3, pp. 346– 359, 2008. [20] R. I. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, 2nd ed. Cambridge University Press, ISBN: 0521540518, 2004.
[9]
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
194
1
On-Board Perception and Motion Planning for Legged Locomotion over Rough Terrain
Dominik Belter ∗
Przemysław Łabe¸cki ∗
Piotr Skrzypczynski´
∗
∗ Institute of Control and Information Engineering, Poznan´ University of Technology, Poznan,´
Poland
Abstract— This paper addresses the issues of perception and motion planning in a legged robot walking over a rough terrain. The solution takes limited perceptual capabilities of the robot into account. A passive stereo camera and a 2D laser scanner are considered for sensing the terrain, with environment representa- tions matching the uncertainty characteristics of these sensors. The motion planner adheres to the dual-representation concept by applying the A ⋆ algorithm for strategic level planning on a coarse elevation grid from stereo data, and then using a RRT- based method to find a sequence of feasible motions on a more precise but smaller map obtained by using the laser data. Results of simulations and experiments on real data are provided.
Index Terms— legged robot, terrain mapping, path planning
I. INTRODUCTION
In this paper we introduce a motion planning system that allows a legged robot to traverse previously unseen rough terrain using only on-board perception. In our previous work [9] we presented a mapping system for the Messor hexapod robot (Fig. 1A), which uses a tilted-down miniature 2D laser scanner Hokuyo URG-04LX to acquire the terrain profile. The obtained elevation map is then used for planning of the footholds [1], and as shown in our recent work [2], can be a basis for local motion planning. However, the perception range of the tilted scanner is too small to plan an effective path of the robot on-line, while perceiving the terrain.
Fig. 1.
left camera image from the STOC system (B)
Messor robot with the dual-mode terrain perception system (A), and
Hence, we added a stereo camera to obtain a more extended representation of the terrain (because of the limited payload of the robot we cannot use a 3D laser scanner). This permits to extend our previous motion planning approach [2] to a “strategic” level, allowing the robot to devise a path to a more distant goal. The motion planning approach is closely related to the perception capabilities of our robot. The stereo camera perceives the environment up to several metres from the robot (Fig. 1B), but the range data are sparse and corrupted by many artefacts. On the other hand, the laser scanner, which is much more precise in acquiring the terrain profile, provides only local data. Therefore, we conceived a dual-resolution terrain
representation (Fig. 2), which employs a coarse “strategic” elevation, grid describing the more distant areas perceived by means of the camera, and a smaller, fine-grained “tactical” elevation map that surrounds the robot, and is updated from the URG-04LX range data. The robot is located in the centre of these maps, and both maps are translated while the robot is walking. With such a terrain representation the approach to motion planning resembles a human behaviour during a trip in rugged terrain: we usually plan a general path by looking at the area from a distance to avoid trees, large stones, ditches, etc., and then we follow this path, focusing our visual attention on the terrain at our feet in order to avoid smaller obstacles and find appropriate footholds. These two planning levels are implemented in the robot by two different algorithms. The Rapidly-exploring Random Tree (RRT) algorithm [10], which provides a single-step, probabilistically complete planning method for high-dimensional configuration spaces, is applied at the tactical level. At this level the RRT-based planner handles the 6-d.o.f. robot pose and the 18-dimensional space defined by the legs’ joints. However, the probabilistic nature of RRT often leads to exploring spaces unrelated to any reasonable path.
Fig. 2.
100 mm
q
Concept of the terrain representation and motion planning approach
To mitigate this problem the strategic-level path planner is added, which uses the heuristic, non-randomized A ⋆ al- gorithm. This algorithm works in a lower-dimensional search space, because the coarse elevation map allows only to roughly determine traversability between two cells. However, the A ⋆ algorithm is guaranteed to yield a path optimal up to the resolution of the map. The path planned at the strategic level is then used to define local sub-goals for the tactical RRT-based planner. Because the two planning levels run repeatedly as the robot moves and perceives new areas, the fine-grained motion plan yielded by the RRT follows the general path from the A ⋆
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
195
2
planner, but can make detours if the strategic plan turns out to be locally unfeasible for the walking robot.
II. RELATED WORK
This section briefly surveys the terrain perception and motion planning works that are most directly related to our approach. Visual terrain perception in legged robots was pioneered by the work on Ambler [7], which used a laser scanner to build an elevation grid of the surroundings. Passive stereo vision, which is a popular sensing modality in robotics, as so far found limited use in terrain perception for walking robots. One notable example is [13], however it relies on the specific mechanics of the RHex robot, which provides robust motion in moderately rough terrain without explicit foothold selection. Also Kolter et al. [5] describe a perception system with an on- board stereo camera, but off-board data processing is used. The terrain modelling method proposed in [5] focuses on filling-in the occluded areas of the terrain that arise due to the line-of- sight constraints of the camera. Elevation values at unseen locations are also predicted in [12] by applying Gaussian process regression to 3D laser data. This approach enables filling-in large discontinuities, but its computational cost is high. Conversely, we use not only a long-range sensor (stereo camera), but also the 2D laser scanner, which yields dense data and is purposefully configured to minimize discontinuities in the obtained map [9]. The notion of variable-resolution grid map for motion planning is also used in [4]. The spatial resolution of the map decreases with the distance from the robot, which allows to use an optimal A ⋆ path planner in real-time for an environment with moving obstacles. A similar grid map concept is applied in our work, however it is motivated mostly by the properties of the sensors being used. Our strategic level path planner exploits some form of cost map computed over this variable- resolution grid. It is based on the spherical variance, which is used also by [14] in connection with the Fast Marching wave-propagating path planner. Motion planning for a legged robot is computationally expensive because of the huge search space with numerous constraints. Some researchers try to simplify the problem, e.g. by adapting existing 2D planning techniques to operate in 3D terrain. Such an approach may generate quite conservative mo- tion plans, which do not take full advantage of the locomotion capabilities of a legged robot. Another way of addressing the issue of complexity is presented by Kolter et al. [6], who at first plan a path of the LittleDog robot’s trunk without considering the trajectories of its feet. Then, the footholds are planned to follow the initial path while considering appropriate constraints. This approach also may result in motion plans that do not consider all possible paths, as some solutions could be eliminated in an early stage of planning. The high- dimensionality issue in motion planning for legged robots can be tackled by employing a randomized planning algorithm. In particular, Hauser et al. [3] successfully used the Probabilistic Roadmaps technique to plan motion of a six-legged lunar robot. In a recent work [16] the R ⋆ algorithm, which combines aspects of deterministic and randomized search, is used.
III. TERRAIN PERCEPTION AND MODELLING
Because the terrain map updating procedure applied here is virtually identical to the one we have used for the laser-only perception [9], we provide here only a brief description of the terrain modelling sub-system, focusing on the added stereo vision data processing. The foothold selection procedure [1] requires a high res- olution elevation map, which reflects geometric properties of the ground at scale of the robot foot size. Such a map is created incrementally, as the robot moves, using range data from the tilted URG-04LX laser scanner. Currently, self-localization is accomplished by means of proprioceptive sensing supported by a vision-based SLAM procedure [15]. The mapping algorithm maintains two co-located grids of the same size: an elevation grid and a certainty grid. Each cell
in the certainty grid holds a value that describes the accuracy of the corresponding cell’s estimate of the elevation. We use
a local grid with cells of 15×15 mm in size, and maximal
dimensions of 2×2 m, which are determined by the range of the URG-04LX sensor in the tilted configuration.
Fig. 3. STOC camera results (brighter colours correspond to objects closer to the camera): disparity image with default filter thresholds (A), disparity images with lowered filter thresholds (B), disparity image after removing pixels of high variance (C), disparity image after final filtering (D)
Long-range terrain perception is accomplished by means of passive stereo vision. Stereo cameras are lightweight and inexpensive sensors, which can be used to obtain 3D data for
a range of several metres, depending on the base-line used. To
cope with the high computational requirements of the stereo image processing in a legged robot with limited on-board computing power we use the STereo-On-Chip (STOC) camera from Videre Design, which provides embedded image process- ing. The STOC camera implements the Small Vision System stereo algorithm based on area correlation and generates a point cloud of 640×480 resolution at 30 frames per second. However, the small base-line (only 6 cm) results in a limited practical range of the sensor, while changing illumination and insufficient texture of observed objects contribute to artefacts and large empty areas in the disparity images. Although there are built-in filters in the camera, they don’t remove all of the artefacts, and there is a need for measurement post- processing [8]. The artefacts that are most problematic for the built-in filters are those on the borders of objects with disparity values much different from the background. Such measurements can be removed by analyzing the depth image. The artefact removal procedure is illustrated for a simple indoor mockup scene depicted in Fig. 1B. Figure 3A shows the disparity image acquired with default filter thresholds. There are few false data in the disparity image, but large patches of the terrain are missing. Lowering the filter thresholds provides much richer image, but introduces some artefacts (Fig. 3B).
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
196
3
h [cm]
Some of the invalid disparity values are of random nature and can be eliminated by analyzing several consecutive images and discarding pixels of large disparity variance (Fig. 3C). There are also some false measurements that are stable from image to image (encircled by red lines in Fig. 3C). Some of these artefacts can be removed at the expense of increasing the minimal measurement distance (Fig. 3D).
400
Fig. 4.
Voting scheme for stereo data filtering and elimination of artefacts
Although the depth image analysis removes much of the artefacts, we apply a simple voting scheme in a local 3D data structure (Fig. 4) in order to eliminate as many spurious measurements as possible before integrating the measurements into the actual elevation map. The obtained point cloud (green points) is projected into a 3D grid. Usually more than one measured point falls into a single cell, which has the height of 0.1 m. Each cell in the 3D grid is treated as a separate bin, which has its index of votes increased whenever a range measurement “hits” that bin. After projecting all the points, for each cell of the 2D “floor” the bin in its respective vertical “column”, that has the largest number of votes is selected. The final elevation value in the given cell is computed as a weighted average of the vertical coordinates of the mea- surements accumulated in the winning bin, with the disparity standard deviation used as the weighting factor. The final elevation values are shown as coloured patches in Fig. 4. Measurements allocated to other bins in the given column are discarded. Moreover, points with the elevation |h | > 0.5 m are discarded, as we are only interested in the terrain description.
Fig. 5.
(A), and combined stereo and laser data (B)
Long-range perception with the STOC sensor allows to create a much bigger, “strategic” elevation map. The STOC camera version we use can perceive objects as far as 5 m, thus the map size can be set to 10 × 10 m. Because of the artefacts and empty areas the precision of the strategic map is quite low (Fig. 5A). Therefore, the size of its grid
Elevation maps obtained with an indoor mockup: stereo data only
cell is 0.1×0.1 m. This map allows to roughly plan the path of the robot, but doesn’t allow foothold selection. The strategic elevation map is updated only from stereo data, while the smaller tactical map updating procedure uses mostly the URG-04LX range measurements. However, as the robot walks forward the precise map gradually overlaps the area that was already mapped with the STOC camera. To re-use the already acquired elevation data in the precise map we implemented a simple fusion mechanism. As the laser data are much more reliable, these data override the stereo data in all cells that have the certainty value above a given threshold (i.e. they were successfully measured several times). However, if there are low-certainty cells in the tactical map, they are filled with the elevation data from the corresponding (but larger) cells of the less precise map from stereo. Results of this fusion procedure are shown in Fig. 5B (stereo and laser data mapped to different colours to make them distinguishable).
IV. PERCEPTION-BASED MOTION PLANNING
A. Outline of the planning procedure
Our motion planning method involves two algorithms. The whole procedure (Fig. 6) finds and executes a path between the current start configuration q c of the robot and the goal configuration q g . Although the configuration (denoted by q ) is defined in the 24-dimensional space constituted by the 6-d.o.f. pose and the 18-dimensional posture of the robot, in some operations of the planning algorithm particular variables are set to default values, effectively reducing the dimensionality. At first, the planning procedure finds a coarse path P A between the current and the goal pose by using the A ⋆ algorithm, which handles only the 2D position of the robot’s trunk, assuming a default posture. If the A ⋆ fails to yield a path, the goal is considered to be unreachable. If a path is found, a temporary goal configuration q t is created on the path P A . The temporary goal is located in the distance d RRT from q c , which at start is equal to the current robot pose q R . Then the RRT-based planning procedure searches for a full motion plan between the q c and q t configurations. If this search is successful, the robot executes the desired sequence of motions, but only up to the limits of the known tactical map (this configuration is denoted by q m ). However, if the RRT-based algorithm fails to find a feasible motion sequence to q t , the distance d RRT is increased, allowing the tactical path planner to make a bigger detour from the general path found by the strategic planner. While walking, the robot acquires new data which are stored in the tactical and the strategic elevation maps. After reaching a temporary goal the path planning procedures (both A ⋆ and RRT-based) are repeated on the updated maps. The whole motion planning procedure stops when the robot reaches the goal configuration or the RRT-based planner fails during searching for a motion plan between q c and q g (cf. Fig. 6).
B. Strategic level planning
The A ⋆ algorithm performs strategic-level path planning on the strategic map obtained from the stereo camera. This representation is sufficient for coarse path planning, which allows to obtain a general direction of further motion, i.e., to
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
197
4
Fig. 6.
guide the RRT-based planner. Moreover, the lower-resolution grid allows A ⋆ to run in real-time.
A cost function and a heuristic distance function must be defined for the A ⋆ path planner. We defined a transition cost between two neighbouring nodes by using two coefficients and a procedure that finds non-traversable transitions. The first coefficient c 1 simply defines the Euclidean distance between the centres of two neighbouring cells. The second coefficient c 2 is based on the spherical variance [14], which describes a “roughness” of the terrain. The higher value the bigger diffi- culties the robot can expect in finding appropriate footholds. To compute this coefficient the elevation map is converted to a triangle mesh by using the considered [i, j ] cell and its neighbours. −→ The vector normal to each triangle on the surface
N
k = (x k −→ , y k , z k ) is computed. Then the module R of the
vector set N k is computed:
R =
k =0 x k 2 + k =0 y k 2 + k =0 z k 2
n
n
n
.
(1)
The module R is divided by the number of vectors n (we use n = 8) to normalize the values into a range between 0 and 1, and the coefficient is computed as follows:
c 2 = 1 − R .
n
(2)
Prior to computing the final transition cost c final the kine- matic ability of the robot to traverse between the two given nodes is considered by the SIMPLE TRAVERSE procedure. While planning, the roll, pitch and yaw angles of the robot’s trunk are assumed to be zero, and the robot uses nominal footholds, resulting from the gait pattern. Finally, if the transition between the initial and the goal state is possi- ble from the point of view of our hexapod’s kinematics, SIMPLE TRAVERSE returns “True”, and c final coefficient is
computed as a mean value of c 1 and c 2 coefficients. If the transition is not achievable (SIMPLE TRAVERSE=False) the
final cost c final is set to infinity. We use Euclidean distance to define the heuristic cost in A ⋆ , which ensures that we never overestimate the heuristic part of the cost, and that we fulfill the monotony condition, guaranteeing the optimality of the planner.
C. Tactical level planning
To find a precise sequence of movements for the legged
robot, an algorithm working in a continuous configuration
space should be used, thus graph search methods like A ⋆ are unacceptable. We chose a RRT-based planner because
of its ability to quickly explore the high dimensional search space. The tactical level planner is based on the RRT-Connect
concept [10], which is used here as a meta-algorithm, invoking several lower-level planning modules. The particular sub- procedures of this meta-algorithm are explained further in the paper. Like RRT-Connect the RRT MOTION PLANNING proce- dure (Alg. 1) creates two random trees. Each node in a tree represents a kinematic configuration of the robot. A child node is connected to the parent node if a transition between these two configurations is possible. The root configuration of the first tree T c is located in the current configuration q c . The root configuration q t of the second tree T t defines a temporary goal state of the robot. For the T c tree the motion is planned forward, whereas the direction in the T t tree is inverted. Algorithm 1
procedure RRT MOTION PLANNING( q c , q g )
1 T c .Init (q init ); T t .Init (q init )
2 for k:=1 to K do
3 begin
4 q rand :=RANDOM CONFIG;
5 if not ( q new :=EXTEND( T c , q rand ))=Trapped
6 if EXTEND( T t , q new )=Reached
7 return PATH( T c , T t )
8 SWAP( T c , T t );
9
end;
At first, the procedure RANDOM CONFIG randomly selects
a pose of the robot q rand (only the x and y coordinates) on the given elevation map. Then the EXTEND operation extends the tree and tries to build new branch in the direction of q rand . If it is possible, and the new node q new is created, the algorithm extends the second tree in the direction of q new node. In the next iteration of the algorithm the order of the trees is swapped (SWAP procedure). As a result, the T t tree is extended at the beginning of the next iteration. The maximal number of
iterations is defined by K . When the two trees are connected and the EXTEND proce- dure returns “Reached“ value, the algorithm finds the shortest path between q c and q t . The q new configuration is common for the two trees. The PATH procedure finds the path from q new to the root nodes in both trees. The sequence of the nodes for T t tree should be inverted. Finally, the sequence of configurations which allows to move from the initial to the goal configuration is obtained. The EXTEND procedure (Alg. 2) plays a key role in the RRT-based planner for a legged robot. It checks if a transition between two given configurations of the robot is possible. At the beginning the NEAREST NEIGHBOUR operation finds the existing node q near , which is the closest one to the new, random position q rand . Then, a new configuration is created by using the module, which selects footholds and creates a posture (CREATE POSTURE). Then, the procedure CHECK TRAVERSE verifies if the transition from q rand to q near
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
198
5
is possible. Trajectories for the feet and a path for the trunk are created. Then, the planner checks if the desired sequence of the robot’s postures is achievable (lack of collisions, acceptable footholds inside the workspaces) and safe (robot maintains the static stability). If a transition to this configuration is im- possible, another configuration q rand , which is however closer to the q near is created by REDUCE DISTANCE procedure. The maximal number of iterations is N . Algorithm 2
procedure EXTEND(T, q rand )
1 q near :=NEAREST NEIGHBOUR( q rand )
2 for n:=1 to N do
3 begin
4 CREATE POSTURE( q rand );
5 if CHECK TRAVERSE( q rand ,q near )=true
6 if n=1
7 return q rand , Reached;
8 else
9 return q rand , Advanced;
10 REDUCE DISTANCE( q rand ,q near )
11 end;
12 return Trapped;
If the considered q rand node is located inside the tactical map the robot plans its movements using all the modules described in Subsection IV-D. But whenever the d RRT distance increases beyond the tactical map area, this node can be outside the precise map. In such a case the movement is planned by using a simplified method that roughly determines traversability between two nodes. The CREATE POSTURE and CHECK TRAVERSE procedures are not used, replaced by the default posture and SIMPLE TRAVERSE procedure, used by the strategic planner. However, as the robot advances on its path and the precise map covers new areas the motion sequence is replanned with the full-featured version.
D. Main modules of the tactical motion planner
The module responsible for maintaining an acceptable pos- ture creates the robot’s configuration at the given (x, y ) loca- tion on the elevation map. It computes the required inclination of the robot’s trunk and its elevation above the terrain. The static stability of the new posture is checked using a simple but fast procedure implementing the stability criterion defined by [11]. At a given posture of the body, the robot has to find appropriate footholds. The algorithm proposed in [1] is used. The aim of this algorithm is to find an analytical relation between some simple characteristics of the local shape of the terrain, and the predicted slippage of the robot’s feet. The predicted slippage is a slippage which is expected when the robot chooses the considered point of the elevation map. As shown in [1] the robot learns unsupervised how to determine the footholds. Then it exerts the obtained experience to predict the slippage and minimize it by choosing proper points from the map. Finally, to determine if the considered footholds are reachable, the planner checks if each foothold is located inside the workspace of the respective leg. Points that are more distant from the borders of the leg’s workspace are preferred to avoid potential kinematic deadlocks. Legs of the robot can
collide if a selected foot position is inside of the workspace of the other legs. To detect such situations, the planner is equipped with a module for collision detection, which uses
a simplified CAD model of the robot (see [2] for details). In the transition between two configurations the path of the robot’s trunk is determined as a straight line. However, the motion planner also has to determine if the robot is able to move its legs in such a way, that the goal state is reached, e.g. by moving the legs above an obstacle. Therefore, a fast trajectory planner for the feet of the robot is implemented, which takes into account the shape of the terrain as modelled by the detailed elevation map. The procedure for planning trajectories of the feet also sets the sequence of swing and stance phases for the legs, which defines the gait. Although
currently the motion planner applies a wave or a tripod gait,
it allows to plan the movements using any kind of cyclic gait.
Because using only the position-based control is risky on a rough terrain, when the internal representation of the environment is uncertain, the robot uses the force sensors in its feet to detect collisions with the ground and to stop the movement when support of the terrain is sufficient.
V. R ESULTS
Initially, the proposed motion planning method was tested with a physic engine-based simulator of the Messor robot. A specially prepared terrain model, that includes a real map of
a rocky terrain acquired using the URG-04LX sensor, was used
in the simulator 1 . Example results are shown in Fig. 7A, where
the tree paths (depicted in shades of blue) represent three independent simulations of the motion planning procedure. The branches of the resulting RRT trees (for one of the three paths) are shown in green. Although the randomized RRT algorithm is used, the results are repeatable thanks to the guidance provided by the strategic planner. On a 2GHz PC the average execution time of the A ⋆ planner is 10 s, and execution of the RRT-planner takes also 10 s for d RRT =1.2 m.
planned by using the RRT-Connect algorithm (B)
1 A movie clip is available at http://lrm.cie.put.poznan.pl/ecmr2011.wmv
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
199
6
For the sake of comparison we show how the same problem is solved by the RRT MOTION PLANNING procedure, which is used here without the strategic level planner (Fig. 7B). The method, being in fact our implementation of the RRT-Connect idea for the walking robot, searches over the whole map, which is time-consuming. In contrast, the same procedure in the proposed two-level planning approach is focused on exploring areas around the solution provided by the A ⋆ planner.
Fig. 8.
Experimental results obtained with the indoor mockup
The described motion planning method and the terrain representation were also verified in a series of small-scale indoor and outdoor experiments, involving a mockup of a rocky terrain (cf. Fig. 1) and a real rubble-covered terrain with some vegetation. Figure 8 visualizes two moments during an indoor experiment, when the robot is about 40 cm from the start point (A), and about one metre from the start (B). The general path provided by the strategic planner is red, the RRT tree is shown in green, while the final path is black. This figure also shows the local feet trajectories (blue lines). It is visible that both the general path and the motion plan get updated as the robot receives new information about the terrain. Figure 9 shows a similar situation during an outdoor experiment, illustrating the ability of the on-board perception system to handle natural scenes. The distance between the two configurations shown in Fig. 9A and B is 80 cm.
Fig. 9.
Experimental results obtained outdoors, in a rubble-covered area
VI. CONCLUSIONS AND FUTURE WORK
The motion planning system presented in this paper enables to combine the strengths of the two popular approaches to
path planning: the RRT and the graph-search-based approach. Moreover, it provides a natural decomposition of the com- putationally expensive task, which being inspired by human behaviour, is well-suited for the dual-mode terrain perception system of the Messor robot. It allows the RRT algorithm to take advantage of the guidance provided by the strategic planner, but at the same time does not eliminate any solutions in the early stage of planning – if the A ⋆ path is unfeasible for legged locomotion (e.g., because of poor cost assessment) the whole procedure reduces to a RRT-Connect like planner, which takes much more time, but finds a path if one exists. Future work includes improved perception from the STOC camera and improvements to the low-level modules, which should make the RRT-based planner fully real-time. Also replacement of the A ⋆ algorithm with a dynamic planner (e.g. D ⋆ Lite) is considered, to make the replanning step faster.
REFERENCES
[1] D. Belter, P. Łabe¸cki, and P. Skrzypczy nski.´ Map-based adaptive foothold planning for unstructured terrain walking. In Proc. IEEE Int. Conf. on Robot. & Autom., pages 5256–5261, Anchorage, 2010. [2] D. Belter and P. Skrzypczy nski.´ Integrated motion planning for a hexapod robot walking on rough terrain. In Prepr. 18th IFAC World Congress, Milan, 2011 (to appear). [3] K. Hauser, T. Bretl, J.-C. Latombe, and D. Wilcox. Motion planning for a six-legged lunar robot. In 7th Int. Workshop on the Algorithmic Foundations of Robotics, pages 301–316, New York, 2006. [4] R. Kirby, R. Simmons, and J. Forlizzi. Variable sized grid cells for rapid replanning in dynamic environment. In Proc. IEEE/RSJ Int. Conf. on Intel. Robots & Syst., pages 4913–4918, St Louis, 2009. [5] J. Z. Kolter, Y. Kimz, and A. Y. Ng. Stereo vision and terrain modeling for quadruped robots. In Proc. IEEE Int. Conf. on Robot. & Autom., pages 1557–1564, Kobe, 2009. [6] J. Z. Kolter, M. P. Rodgers, and A. Y. Ng. A control archite cture for quadruped locomotion over rough terrain. In Proc. IEEE Int. Conf. on Robot. & Autom., pages 811–818, Pasadena, 2008. [7] E. Krotkov and R. Simmons. Perception, planning, and control for autonomous walking with the Ambler planetary rover. Int. Journal of Robotics Research, 15(2):155–180, 1996. [8] P. Łabe¸cki. Improved data processing for an embedded ster eo vision
system of an inspection robot. In R. Jablonski et al., editor, Recent Advances in Mechatronics, Springer, Berlin, 2011 (to appear). [9] P. Łabe¸cki, D. Rosi nski,´ and P. Skrzypczy nski.´ Terrain perception and mapping in a walking robot with a compact 2D laser scanner. In H. Fujimoto et al., editor, Emerging Trends in Mobile Robotics, pages 981– 988, World Scientific, Singapore, 2010. [10] J. J. Kuffner Jr. and S. M. LaValle. RRT-Connect: An efficient approach to single-query path planning. In Proc. IEEE Int. Conf. on Robot. & Autom., pages 995–1001, San Francisco, 2000.
[11]
R. B. McGhee and G.I. Iswandhi. Adaptive locomotion for a multilegged robot over rough terrain. IEEE Trans. on Systems, Man, and Cybernetics, SMC-9(4):176–182, 1979.
[12] C. Plagemann, S. Mischke, S. Prentice, K. Kersting, N. Roy, and
W. Burgard. A Bayesian regression approach to terrain mapping and
an application to legged robot locomotion. Journal of Field Robotics, 26(10):789–811, 2009.
[13] R. B. Rusu, A. Sundaresan, B. Morisset, K. Hauser, M. Agrawal, J.-
C. Latombe, and M. Beetz. Leaving flatland: efficient real-time three-
dimensional perception and motion planning. Journal of Field Robotics, 26(10):841–862, 2009. [14] L. Sanctis, S. Garrido, L. Moreno, and D. Blanco. Outdoor motion planning using fast marching. In O. Tosun et al., editor, Mobile Robotics: Solutions and Challenges, pages 1071–1080, World Scientific, Singapore, 2009. [15] A. Schmidt and A. Kasinski.´ The visual SLAM system for a hexapod robot. In L. Bolc et al., editor, Computer Vision and Graphics, volume 6375 of LNCS, pages 260–267, Springer, Berlin, 2010.
[16] P. Vernaza, M. Likhachev, S. Bhattacharya, S. Chitta, A . Kushleyev, and D.D. Lee. Search-based planning for a legged robot over rough terrain. In Proc. IEEE Int. Conf. on Robot. & Autom., pages 2380–2387, Kobe,
2009.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
200
A Sampling Schema for Rapidly Exploring Random Trees using a Guiding Path
Vojtechˇ
Vonasek´
Jan Faigl ∗
Toma´sˇ Krajn´ık
Libor Pˇreucilˇ
Department of Cybernetics, ∗ Center for Applied Cybernetics Faculty of Electrical Engineering, Czech Technical University in Prague {vonasek,xfaigl,tkrajnik,preucil}@labe.felk.cvut.cz
Abstract — In this paper, a novel sampling schema for Rapidly Exploring Random Trees (RRT) is proposed to address the narrow passage issue. The introduced method employs a guiding path to steer the tree growth towards a given goal. The main idea of the proposed approach stands in a preference of the sampling of the configuration space C along a given guiding path instead of sampling of the whole space. While for a low- dimensional C the guiding path can be found as a geometric path in the robot workspace, such a path does not provide useful information for efficient sampling of a high-dimensional C . We propose an iterative scaling approach to find a guiding path in such high-dimensional configuration spaces. The approach starts with a scaled geometric model of the robot to a fraction of its original size for which a guiding path is found using the RRT algorithm. Then, such a path is iteratively used in the proposed RRT-Path algorithm for a larger robot up to its original size. The experimental results indicate benefit of the proposed technique in significantly higher ratio of the successfully found feasible paths in comparison to the state-of-the-art RRT algorithms.
Index Terms— motion planning, RRT
I. INTRODUCTION
The motion planning is a classical robotic problem, which can be described using a notion of the configuration space C, where q ∈ C is a configuration of a robot. The obstacles in the workspace correspond to a set C obs ⊆ C, while C f ree = C\C obs is a set of feasible configurations. The motion of a robot in a workspace equals to a feasible path in C f ree . Several complete methods have been proposed to solve the planning problem, e.g., Voronoi diagrams or Visibility Graphs. These approaches require an explicit representation of the configuration space. However, such a representation cannot be easily computed for systems with many degrees of freedom. To overcome this problem, sampling-based methods as Probabilistic Roadmaps (PRM) [13], Rapidly Exploring Ran- dom Trees (RRT) [15] or Expansive Spaces Trees (EST) [9] were suggested to obtain an approximation of the configuration space. The common idea of the sampling-based methods is to build a roadmap of free configurations q ∈ C f ree . These methods randomly sample the configuration space C and classify the samples as free or non-free using a collision detection method. The free samples are stored and connected to make a roadmap, in which a solution is found. The collision detection is applied as a “black-box”, which allows to cope with robots of arbitrary shapes. The sampling-based methods are intensively studied, and their advantages and issues have been described in many papers. Here, we refer to summary [17]. The well known issue
of the sampling-based methods is the narrow passage problem.
A narrow passage can be defined as a part of C, which removal
changes the topology of C [14]. The narrow passage becomes important if the result path in C has to pass it. The issue comes from a less number of the samples covering the narrow passage, which disallows to construct a feasible path through it. The sparse coverage of the passage is due to the uniform distribution, which is usually used in sampling-based motion planning methods. Even though the methods are probabilistic complete, they do not provide a feasible path within a given computational time.
In this paper, a novel sampling schema for the RRT algo-
rithm is proposed to cope with the narrow passage problem, and to increase success ratio of finding feasible paths. The approach is based on a modification of the RRT–Path [2] algorithm, which uses a guiding path to steer the tree growth in C. Although a guiding path can be computed directly in workspace corresponding to a low dimensional C, it is difficult
|
to |
construct a useful path in a general C. Therefore, we propose |
|
to |
iteratively employ the RRT–Path algorithm to find a path for |
a robot with downscaled size, but preserving the same motion model. The path found for a smaller robot is then used as a guiding path for a larger one, which increases performance of the motion planning. The paper is organized as follows. Related work is described in the next section. In Section III, the modified RRT–Path algorithm is described. It is then used in Section IV in the iterative method to find a guiding path in a high-dimensional C. Experimental verifications are described in Section V.
II. RELATED WORK
A crucial part of the sampling-based methods is the sam-
pling of the configuration space. The sampling process can be characterized by a sampling source and a sampling mea- sure [11]. The sampling source denotes how the samples are constructed, e.g., using pseudorandom number generators or by deterministic approaches like Halton sequences. The
sampling measure is the distribution of the samples in C and
is more important for the planning process itself than the
it
sampling source [11].
In the original PRM and RRT, the configuration space is
sampled uniformly. To cope with the narrow passage problem, the sampling measure can be modified to generate more samples in difficult regions. A simple method that modifies the sampling measure in the RRT is the goal-bias [16], where
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
201
random configuration is replaced by the goal configuration with a probability p g . The goal-bias suppresses the exploration of C, while the tree is expanded more preferably towards the goal. Although the goal-bias speeds up the growth towards the goal, it may cause a congestion problem of the tree growth due to existence of an obstacle in the pathway. A location of narrow passage can be estimated using a shape of the workspace prior the planning. This approach is used in the Gaussian PRM [18] and Bridge-Test [8], where random configurations are generated close to obstacles. If a roadmap with a higher clearance is preferred, the samples can be generated on the medial axis [23]. Other approaches that use knowledge about the workspace to determine regions for dense sampling are [14, 22, 1, 10]. The sampling measure can be adapted in the planning phase, which allows to generate more samples in sparsely covered regions. The adaptive sampling can be based on the level of in- formation gain brought by the samples [4, 5]. The coverage of the configuration space by the samples can be estimated using KPIECE [20] algorithm. The adaptive methods improve the sampling in high-dimensional configuration spaces, where the sampling cannot be modified respecting purely the workspace knowledge [14]. The sampling measure can be easily modified in a PRM algorithm, because such algorithm first samples the configura- tion space, and after that, the roadmap is constructed using the samples located in C f ree . The modifications originally suggested for the PRM cannot be directly used in the RRT, because the RRT algorithm simultaneously builds the tree and samples the configuration space. To modify the sampling measure in RRT, the tree growth needs to be considered, otherwise the tree can get stuck due to an obstacle, like in the situation depicted in Fig. 1a. More samples in the narrow passage np do not guarantee the tree will escape the “u” obstacle, because the tree tends to growth towards the samples in the np, but in that directions an obstacle is located. Therefore, a sampling measure should be boosted around np after the tree approaches the narrow passage (Fig. 1b).
Fig. 1. Environment with a narrow passage (np). If the sampling is boosted around np before the tree approaches it, the tree attempts to grow to the obstacles (left). To help the tree to pass np the sampling measure should be boosted in np when the tree approaches it (right).
Several approaches to adjust the sampling measure consid- ering the tree growth in the RRT have been presented so far. The tree nodes can be classified into two groups: (1) frontier nodes, whose Voronoi cells grow together with the growth of the environment; and (2) boundary nodes, which are close to obstacles [24]. The tree should be expanded from the frontier nodes, as these are found at the boundary of the tree; thus, these can be extended towards unexplored
regions. In narrow passages, the nodes are mainly of both the frontier and the boundary types. These are frequently chosen for tree expansion, because they are frontier nodes.
Nevertheless, the expansion of the nodes may not be successful
as these are located close to obstacles. The authors of [24] proposed the RRT-DD algorithm to deactivate nodes from
which the tree cannot be expanded. Each node has assigned
a radius defining maximum distance of a random sample in
C, which can activate the node for the expansion. The radius
is initialized to ∞; hence, the nodes can be chosen for the
expansion by an arbitrary random sample. If the expansion of
a node fails, its radius is decreased to a predefined value R.
This suppresses frequent selection of the boundary nodes, and increases probability of selecting frontier nodes. However, the performance of the RRT-DD strongly depends on the value of R. A strategy for adapting the activation radius has been proposed in [12]. If the expansion of a node succeeds,
its activation radius is increased by a given percentage α, otherwise it is decreased by the same percentage. To attract the tree into a narrow passage, the Retraction- RRT [26] computes contact configurations q c that are located on a boundary of C f ree and C obs . Such configurations are found using retraction procedure, which works as follows.
A random configuration q rand ∈ C f ree is generated and a
close non-free configuration q obs ∈ C obs is found. A contact configuration q c on a segment (q rand , q obs ) is found and
its
between q obs and q c . The configuration q c is then added to
the tree. It was shown that this approach can deal with narrow passages efficiently, because the generated contact configu- rations penetrate into the narrow passages. However, to find the contact configurations, the collision detection algorithm is called frequently, which can decrease the performance of the algorithm.
In [2], the performance of the RRT for problems with a low-
neighborhood is searched for q c minimizing the distance
dimensional C has been increased by the proposed RRT–Path algorithm. The approach is based on a precomputed geometric path in the workspace that is used to guide randomized con- struction of the tree through the configuration space. Therein, the guiding path is computed using well known approaches
like Visibility graph, Voronoi diagram, or the Visibility-
Voronoi combination. A similar idea for attracting the tree was proposed in [21]. In this approach, the tree is grown towards multiple precomputed key configurations. In the RRT– Path, only one attraction configuration is used. Moreover, this configuration is moved along the path considering the growth
of the tree. It allows the tree to reach the goal configuration
even in an environment with obstacles.
A significant performance improvement of the RRT–Path
in problems with many narrow passages (like office like
building plans) allows to consider the motion planning in the
watchman route problem [6] where many trajectories have
to be determined to find a final watchman route trajectory
from which the whole environment is covered. However, a guiding path found in the workspace cannot be used to guide the RRT in a high-dimensional C, because the workspace importance decreases as the dimension of the configuration space increases [14]. Therefore, in this paper, we extended the
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
202
former RRT–Path algorithm to efficiently deal with problems in a high-dimensional C.
III. GUIDING A TREE USING A PRECOMPUTED PATH
Although the RRT–Path has been proposed in [2], the algorithm has been further improved to cope with high- dimensional spaces. That is why a detailed description of the modified algorithm is presented in this section. The basic idea of the algorithm is to steer the growth of the tree through the environment without unnecessary exploration steps. To steer a growing process of the tree a precomputed guid- ing path is employed, and new samples are generated along this path using a temporal goal g t , which follows the goal-bias principle. The temporal goal g t is maintained considering the growth of the tree; thus, g t slides along the guiding path during the tree construction. To determine the temporal goal g t , the nearest node in the tree should be projected to a line segment of the guiding path. To avoid computing of this projection, we represent the guiding path by a set of sample points. Then, the temporal goal is searched among these samples. To bias the tree growth towards g t a new random sample q rand is randomly drawn from w neighbor sample points of g t . So, q rand is sampled around g t with the probability p g , otherwise q rand is sampled from C. The principle of the sampling is depicted in Fig. 2. Similarly to the original RRT, the nearest configuration q near in the tree is found and the tree is extended towards q rand . If the tree approaches the current goal g t in a distance less than δ t , the temporal goal is updated to the most distant point of the guiding path that has not yet been reached by the tree. The algorithm terminates if the tree approaches q goal . The RRT–Path algorithm is listed in Algorithm 1. The guiding path can be found easily for 2D or 3D workspaces using well know approaches like Visibility graphs or Voronoi diagrams. Here, we do not focus on the guiding path computation itself, rather we assume to have one and focus to the sampling along such a path. To measure distance between a configuration q ∈ C and a path point p, only the corresponding variables are used in Euclidean metric. For example, for a car-like robot with q = {x, y, ϕ} and 2D guiding path point p = {x, y}, only x and y variables are used to determine the distance between q and p. Maintenance of the temporal goal (lines 4–12 in Algo- rithm 1) is based on distances between the tree and points on the guiding path that have a higher index than g t . If the distance is lower than δ t for a point p i , then the next point p i+1 is labeled as a new temporal goal. The considered distance δ t implies that the tree may not follow the path exactly. A higher δ t leads to a less accurate tracking of the guide path. In such a situation, a new path
Fig. 2.
4. If w = 5 neighbors of the temporal goal is used to attract the tree, q rand
is drawn from points {2, 3, 4, 5, 6}.
Algorithm 1: RRT–Path
Input: Configurations q init , q goal , the maximum number of iterations K, the temporal goal bias p g , the number of the temporal goal neighbors w,
a guiding path P = {p 1 ,
, p n }
Output: A trajectory between q init and q goal or failure
1
2
T.add(q
g
= 1
t
init
)
// index of the temporal goal point
3 while iteration< K and q
goal
not reached do
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
i = n
while i > g t do
|
q |
n = nearestNeighbor(T, p i ) |
|||
|
if distance(q n , p i ) < δ t then |
||||
|
g t = i + 1 // new temporal goal |
||||
|
break |
||||
|
end |
||||
|
i |
= |
i − 1 |
||
|
end |
||||
if p g > rand(0, 1) then
|
q |
rand = random point among w neighbors of g t |
|
|
else |
||
|
q |
rand = random configuration in C |
|
|
end |
||
|
q |
near = nearestNeighbor(T, q rand ) |
|
q |
new = extend q near towards q rand |
if q new can be connected to q near then
T.add(q new )
end
iteration = iteration + 1
24 end
25 if q
goal
was reached by the tree T then
26
27
28
else
return trajectory between q init and q goal
return failure
29 end
through the environment may be found and even it can be interconnected with an existing path behind the temporal goal. So, the tree can “skip” a part of the guiding path, which is useful in situations, where a part of the guiding path is not able to steer the tree properly, e.g., due to a low clearance. An example of such a situation is depicted in Fig. 3. A balance between path following and exploration of C depends on the number of neighbors of g t and the distance between the consecutive points on the guiding path. If the number of the neighbors w is high, or the consecutive path points are too far, the algorithm prefers exploration of C rather than following the guiding path. The reason is that the tree is more attracted by widespread points than by the guiding path itself. If the distance between the consecutive path sample points is small or only several g t neighbors are used to select
the random points, the tree is attracted by points located in
a close area; thus, the tree does not explore, but attempts to reach these points. As the distance between consecutive path points decreases, the number of path points increases, which also increases the computational burden of the temporal goal
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
203
q
Fig. 3. Example of skipping part of the guiding path by the tree. The temporal goal is set to g t but the tree approaches the point p. The new temporal goal is thus set to g
t .
selection. To overcome these situations, the distance between consecutive sample points on the guiding path should not be higher than δ t .
IV. RRT–PATH WITH ITERATIVE SCALING OF ROBOT
GEOMETRY
The RRT–Path is able to grow the tree along a predefined
geometric path. The guiding path can be computed easily for
a point robot in 2D or 3D workspaces, while its computation
is PSPACE-Hard [19] for a non-point robot. In this section, we proposed an iterative scaling algorithm denoted as RRT-IS to find a guiding path in a general C. The approach is based on an iterative refinement of the guiding path using a scaled model of the robot. First, the original RRT is used to find an initial guiding path with a downscaled model of the robot. This path is then used as the guiding path in the modified RRT–Path algorithm that is executed several times using an enlarged robot in a stepwise manner unless the original size of the robot is achieved. The RRT–Path attempts to follow the given guiding path at most m-times. Here, we assume that the robot is a single body robot and it is scaled equally in all dimensions. The algorithm is listed in Algorithm 2. The narrow passages in C of a smaller robot are relatively wider than the passages in C of the original (full-sized) robot; thus, it is easier to find a path through the passages. The iterative scaling approach assumes that the path found for a smaller robot is topologically equivalent to the path for the robot with the original size. It is clear that this assumption is not always valid. For example a path in C found for a small robot can become infeasible for a larger robot due to existence of an obstacle, see for an example Fig. 4. However, the modified RRT–Path is able to skip a part of the guiding path, and therefore, it may find a solution also in these cases. Performance of the proposed RRT-IS depends on the used
scaling strategy. The simplest strategy is to start the planning with a low scale and increase the scale at each iteration by
a predefined value. Although the initial robot scale can be
chosen as extremely low to increase a chance to find an admissible guiding path, such a path may become infeasible for a larger robot. An improved scaling strategy can be used, e.g., employing a binary search, to speed up the RRT-IS algorithm. The idea of iterative scaling approach for finding the guiding path in C is similar to the Iterative Relaxation of Constraints (IRC) [3], which was introduced for PRM. The IRC method iteratively scales the robot and for each scale a roadmap is
Algorithm 2: RRT–IS
Input: q init , q goal , the number of RRT–Path trials m, the maximum number of allowed iterations K
Output: A path for the robot with the scale 1.0 or failure 1 guidingPath = ∅
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
while q goal
is not reached by the original robot do
scale = getNextScale()
robot.scaleGeometry(scale)
if guidingPath.size = 0 then
// scaling strategy
p=rrtOriginal.findPath(q init ,q goal ,robot, K)
else
end
for t = 1 : m do
p=rrtpath.findPath(guidingPath, q init ,q goal ,K)
if goal reached by p then
break
end
if goal reached by p then
guidingP ath = p
else
end
return failure
20 end
21 return guidingPath // a solution for scale 1.0
Fig. 4. An example of guiding paths found for two different scales of a robot. The workspaces with the robot are depicted in the left column, and their configuration spaces are on the right. Two paths can be found in the configuration space of a smaller robot (b); however, only the dashed path is feasible for a larger robot.
built. In the next step, the scale of the robot is increased and the previous roadmap is adapted. The benefit of our approach designed for the RRT is that it provides solution in situations where: (a) differential constraints must be considered, or (b) in a case of changing environment, where finding one path between start and goal is faster than building a roadmap of the whole configuration space. Moreover, the proposed RRT- IS uses a guiding path that considers the motion model of the robot, and therefore, it can be used to find a path for a non-holonomic robot.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
204
V. E XPERIMENTS
The performance of the proposed modification of the RRT with iterative scaling (RRT-IS) has been experimentally ver- ified and compared with state-of-the-arts RRT approaches in two sets of experiments. All experiments have been performed within the same computational environment using the Intel Core2Duo 3.0 GHz CPU with 2 GB RAM. The nearest neighbors in RRT algorithms were searched using the MPNN library [25] and the collisions were detected using the RAPID library [7]. In the first set of experiments, the RRT-IS was verified in 2D environments 1 BT 100 and BT 30 with a car-like robot. The subscript in the map name denotes the size of the narrow passage in centimeters. The size of the robot is 100 × 20 cm. Each algorithm was executed 100 times for each map.
A trial in which the distance between the resulting trajectory
and the goal state was higher than 30 cm is considered as planning failure and the trial is discarded. The ratio of the number of planning failures to the total number of the trials performed denotes the failure ratio. Results from the successful trials are averaged.
In the RRT-IS, the scale was initialized to 0.9, and it was increased by 0.02 at each iteration. The number of the inner RRT–Path trials was set to m = 3. The parameters of RRT– Path were: w = 15, p g = 95%, and δ t = 20 cm.
The results are shown in Table I and examples of found solutions in Fig. 5. The maximum number of iterations for the RRT-Original is in row No. iterations . For RRT-IS, this denotes maximum number of iterations of one run of RRT and RRT–Path on a certain scale (parameter K in Alg. 2). The row Time is the required computational time, Tree size
is the size of tree created by the method, No. Collisions is
the number of collision queries. The tree size of the RRT-IS can be higher than the number of iterations as the RRT–Path algorithm can be performed up to m-times.
A passage that is a more than four times wider than the robot does not represent a significant issue for the RRT-Original algorithm in the BT 100 environment. However, the failure ratio of RRT-Original is higher in the BT 30 environment with a narrow passage. The failure ratio of RRT-Original depends on the maximum number of iterations. However, even if the number was increased (to 50, 000), the failure ratio of the RRT-Original was still higher than in RRT-IS with only 5, 000
of allowed iterations.
Although the RRT-IS is more computationally intensive than the RRT-Original, it provides significantly better performance regarding the failure ratio. The high computational require- ments of the RRT-IS are caused by the determination of the temporal goals as RRT–Path has to find the nearest neighbors
to all points on the guiding path in each iteration. The initial
construction of the first guiding path for the most downscaled
robot is time consuming, whilst the guidance of larger robots
is faster as can be seen in Fig. 7.
The second experimental verification concerns 3D environ- ments in scenarios Bugtrap 2 and Room. The task is to find
1 Maps are available at http://imr.felk.cvut.cz/planning/maps.xml 2 http://parasol.tamu.edu/groups/amatogroup/benchmarks/mp/bug-trap/
a path for a stick-shaped robot. Both environments contain
a narrow passage. The iterative scaling was started at a scale
0.05 for the Bugtrap environment and it was increased by 0.05 after each iteration. The initial scale for the Room environment was 0.7 and it was increased by 0.02 after each iteration. The parameters of RRT–Path were: w = 15, p g = 95%, and δ t = 0.5 map unit. A trial was accepted if the distance between found trajectory and the goal state was less than 1 map unit. The results are shown in Table II. Beside the RRT- Original, the RRT-Retraction [26] and RRT-DynamicDomain [24] with adaptive tuning were implemented for a comparison. The parameters α = 0.1 and R = 100 of RRT-DD were used, where is the length of the expansion step in the RRT. Although the state-of-the-art methods RRT-DD and RRT- Retraction are faster and need less number of collision detec- tion queries, their failure ratios are significantly higher than for the RRT-IS. Regarding the experimental results the proposed RRT-IS provides feasible path with a higher frequency, which indicates benefit of proposed method of guided sampling.
TABLE I
RESULTS FOR BT 100 AND BT 30 MAPS WITH CAR-LIKE ROBOT.
|
RRT-Original |
RRT-IS |
||
|
BT 100 |
|||
|
No. iterations |
15,000 |
35,000 |
5,000 |
|
Time [s] |
1.45 |
3.18 |
5.0 |
|
Tree size |
10,738 |
24,485 |
8,279 |
|
No. collisions |
180,557 |
384,971 |
257,796 |
|
Failures |
8 % |
1 % |
1 % |
|
BT 30 |
|||
|
No. iterations |
30,000 |
50,000 |
5,000 |
|
Time [s] |
2.72 |
4.2 |
28.81 |
|
Tree size |
19,335 |
26,734 |
18,834 |
|
No. collisions |
331,085 |
510,084 |
860,929 |
|
Failures |
82 % |
78 % |
1 % |
Fig. 5. Results of RRT-IS on maps BT 100 (left) and BT 30 (right). The rectangle represents the robot.
VI. CONCLUSION
A novel sampling schema for RRT planning algorithms has been presented. The proposed method employs a guiding path to steer the growth of the tree in the configuration space. The guiding path can be computed in the workspace by geo- metric path-planning methods, or by the RRT approach with the herein proposed iterative scaling. Whilst the geometric methods are more suitable for a low-dimensional configura- tion space, the RRT-IS can even deal with high-dimensional
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
205
TABLE II
RESULTS FOR 3D ENVIRONMENTS.
|
RRT-Original |
RRT-IS |
RRT-Retr |
RRT-DD |
|
|
Room No. iterations Time [s] Tree size No. collisions (×10 6 ) Failures |
200,000 |
40,000 |
40,000 |
100,000 |
|
66.84 |
58.2 |
9.9 |
19.2 |
|
|
156,874 |
209,031 |
69,886 |
100,000 |
|
|
4.38 |
4.57 |
1.23 |
1.2 |
|
|
59% |
0% |
75% |
80% |
|
|
Bugtrap No. iterations Time [s] Tree size (×10 6 ) No. collisions (×10 6 ) Failures |
4 · 10 6 |
3.5 · 10 6 |
10 6 |
10 6 |
|
2,692 |
3,000 |
257 |
394 |
|
|
3.8 |
2.2 |
1.63 |
0.99 |
|
|
133 |
120 |
16 |
12 |
|
|
86 % |
1 % |
46 % |
86 % |
a
b
Fig. 6.
3D environments Bugtrap (a) and Room (b) with a tree along a path.
Fig. 7. Tree size needed to find a solution with different scales of a car- like robot in BT 100 and BT 30 (a); running time of the RRT–Path for various scales of robot in the Bugtrap problem (b).
problems; thus, it can find a path through a narrow passage in the C even if the corresponding part of the workspace does not indicate the presence of a narrow passage. The experimental verifications have shown that the proposed RRT-IS algorithm is able to cope with the narrow passage problem in both 2D and 3D workspaces with significantly higher success rate than other methods. Although the proposed method is computationally demanding, it may be combined with the RRT-DD or RRT-Retraction to decrease the compu- tational burden, which is a subject of our future work.
VII. ACKNOWLEDGMENTS
This work has been supported by the Grant Agency of the Czech Technical University in Prague under grant No. SGS10/185 and SGS10/195, by the Ministry of Education of the Czech Republic under Projects No. 7E08006, No. 1M0567, and No. LH11053, and by the EU under Project No. 216342.
REFERENCES
[1] Nancy M. Amato, O. Burchan Bayazit, Lucia K. Dale, Christopher
Jones, and Daniel Vallejo. OBPRM: an obstacle-based PRM for 3D workspaces. In WAFR, pages 155–168. A. K. Peters, Ltd., 1998. [2] Vonasek´ Vojtech,ˇ Jan Faigl, Toma´sˇ Krajn´ık, and Libor Pˇreucil.ˇ RRT- Path: a guided Rapidly Exploring Random tree. In Robot motion and control , Poznan, Poland, June 2009. [3] O.B. Bayazit, Dawen Xie, and N.M. Amato. Iterative relaxation of constraints: a framework for improving automated motion planning. In IROS 2005, pages 3433 – 3440, aug. 2005. [4] B. Burns and O. Brock. Information theoretic construction of proba- bilistic roadmaps. In IROS, volume 1, pages 650–655, Oct. 2003. [5] Brendan Burns and Oliver Brock. Toward optimal configuration space sampling. In Proceedings of Robotics: Science and Systems, Cambridge, USA, June 2005. [6] Jan Faigl. Multi-Goal Path Planning for Cooperative Sensing. PhD thesis, Czech Technical University in Prague, 2010. [7] S. Gottschalk, M. C. Lin, and D. Manocha. OBBTree: a hierarchical structure for rapid interference detection. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques , pages 171–180, New York, NY, USA, 1996. ACM. [8] David Hsu. The bridge test for sampling narrow passages with probabilistic roadmap planners. In IEEE ICRA , 2003. [9] David Hsu, Jean claude Latombe, and Rajeev Motwani. Path planning in expansive configuration spaces. In International Journal of Compu- tational Geometry and Applications, pages 2719–2726, 1997. [10] David Hsu, Lydia E. Kavraki, Jean-Claude Latombe, and Rajeev Mot- wani. On finding narrow passages with probabilistic roadmap planners. In WAFR, 1998. [11] David Hsu, Jean-Claude Latombe, and Hanna Kurniawati. On the prob- abilistic foundations of probabilistic roadmap planning. International
Journal of Robotics Research, 25(7):627–643, 2006. [12] L. Jaillet, A. Yershova, S.M. La Valle, and T. Simeon. Adaptive tuning of the sampling domain for dynamic-domain RRTs. In IEEE/RSJ IROS , pages 2851 – 2856, 2-6 2005. [13] Lydia E. Kavraki, Petr Svestka, Jean claude Latombe, and Mark H. Overmars. Probabilistic roadmaps for path planning in high-dimensional configuration spaces. IEEE Transactions on Robotics and Automation , 12:566–580, 1996. [14] Hanna Kurniawati and David Hsu. Workspace importance sampling for probabilistic roadmap planning. In IROS , September 2004. [15] S. M. LaValle. Rapidly-exploring random trees: A new tool for path planning, 1998. TR 98-11. [16] Steven M. Lavalle and James J. Kuffner. Rapidly-exploring random trees: Progress and prospects. In Algorithmic and Computational Robotics: New Directions, pages 293–308, 2000. [17] Stephen R. Lindemann and Steven M. LaValle. Current issues in sampling-based motion planning. In Robotics Research: The Eleventh International Symposium, pages 36–54, Berlin, Germany, 2005. Springer-Verlag. [18] Mark H. Overmars. The Gaussian Sampling strategy for probabilistic roadmap planners. In ICRA, pages 1018–1023, 1999. [19] John H. Reif. Complexity of the mover’s problem and generalizations. In Proceedings of SFCS ’79, pages 421–427, Washington, DC, USA, 1979. IEEE Computer Society. [20] Ioan Alexandru Sucan and Lydia E. Kavraki. Kinodynamic motion planning by interior-exterior cell exploration. In WAFR, 2008. [21] E. Szadeczky-Kardoss and B. Kiss. Extension of the rapidly exploring random tree algorithm with key configurations for nonholonomic motion planning. In IEEE International Conference on Mechatronics, 2006. [22] J.P. van den Berg and M.H. Overmars. Using workspace information as a guide to non-uniform sampling in probabilistic roadmap planners. In IEEE ICRA, volume 1, pages 453–460 Vol.1, April-1 May 2004. [23] Steven A. Wilmarth, Nancy M. Amato, and Peter F. Stiller. Maprm: A probabilistic roadmap planner with sampling on the medial axis of the free space. In IEEE ICRA, pages 1024–1031, 1999. [24] A. Yershova, L. Jaillet, T. Simeon, and S.M. LaValle. Dynamic-domain RRTs: Efficient exploration by controlling the sampling domain. In ICRA, pages 3856–3861, April 2005. [25] A. Yershova and S. M. LaValle. Improving motion-planning algo- rithms by efficient nearest-neighbor searching. Robotics, IEEE Trans- actions on, 23(1):151–157, Feb. 2007. http://msl.cs.uiuc.edu/ yer- shova/MPNN/MPNN.htm. [26] Liangjun Zhang and D. Manocha. An efficient retraction-based RRT planner. In IEEE International Conference on Robotics and Automation , pages 3743 –3750, 19-23 2008.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
206
Navigation of mobile robots by potential field methods and market-based optimization
Rainer Palm ∗
Abdelbaki Bouguerra †
∗ AASS, Dept. of Technology, Oerebro University, Sweden † AASS, Dept. of Technology, Oerebro University, Sweden
Abstract— Mobile robots play an increasing role in everyday life, be it for industrial purposes, military missions, or for health care and for the support of handicapped people. A prominent aspect is the multi-robot planning, and autonomous navigation of a team of mobile robots, especially the avoidance of static and dynamic obstacles. The present paper deals with obstacle avoidance using artificial potential fields and selected traffic rules. As a novelty, the potential field method is enhanced by a decentralized market-based optimization (MBO) between competing potential fields of mobile robots. Some potential fields are strengthened and others are weakened depending on the local situation. In addition to that, circular potential fields are ’deformed’ by using fuzzy rules to avoid an undesired behavior of a robot in the vicinity of obstacles.
I. INTRODUCTION
A group of autonomously acting mobile robots (mobile platforms) is a system of subsystems with different goals and velocities, competing navigation commands and obstacles to be avoided. In the last two decades several methods addressing these issues have been reported. One of the most popular methods for obstacle avoidance is the artificial potential field method [1]. Borenstein and Koren gave a critical view on this method addressing its advantages and drawbacks regarding stability and deadlocks [2]. Aircardi and Baglietto addressed team building among mobile robots sharing the same task and the appropriate decentralized control [3]. In approaching situations robots act as moving obstacles where coordination is done by online local path planning using the so-called via points. Further research results regarding navigation of non- holonomic mobile robots can be found in [4] and [5]. The execution of robot tasks based on semantic domain-knowledge is reported in detail in [6]. Trying to achieve different tasks at the same time makes a decentralized optimization necessary. Decentralized methods like multi-agent control are expected to handle optimization tasks for systems being composed of a large number of complex local systems more efficiently than centralized ap- proaches. One example is the flow control of mobile platforms in a manufacturing plant using intelligent agents [7]. Other decentral control strategies are the so-called utility approach [8] and the behavioral approach [9] used for mobile robot navigation. The most difficult task to optimize a decentralized system consisting of many local systems leads us to game theoretic and related methods. One of the most interesting and promising approaches to large decentralized systems is the market-based (MB) optimization. MB algorithms imitate
the behavior of economic systems in which producer and consumer agents both compete and cooperate on a market of
commodities. This simultaneous cooperation and competition
of agents is also called coopetition [10]. In [11] an overview on
MB multi-robot coordination is presented, which is based on bidding processes. The method deals with motion planning, task allocation and team cooperation, whereas obstacles are not considered. General ideas and some results of MB control strategies are presented in [12] and [13]. In [14], a more detailed description of the optimization algorithm is presented. The authors show how to optimize distributed systems by so- called producer and consumer agents using local cost func- tions. Given desired setpoints and, with this, a cost function for each local system a set of controls has to be found that leads the whole set of local systems to a so-called Pareto- optimum.
The present paper adopts many ideas from [13], [14] and [15] to improve the performance of safe navigation of multiple robots using potential fields. In the context of MB navigation, combinations of competing tasks that should be optimized can be manifold, for example the presence of a traffic rule and the necessity for avoiding an obstacle at the same time. Another case is the accidental meeting of more than two robots within a small area. This requires a certain minimum distance
between the robots and appropriate (smooth) manoeuvers to keep stability of trajectories to be tracked. The present paper addresses exactly this point where - as a novelty - optimization takes place between ”competing” potential fields of mobile robots, whereas some potential fields are strengthened and some are weakened depending on the local situation. Repulsive forces both between robots and between robots and obstacles are computed under the assumption of circular force fields meaning that forces are computed between the centers of mass of the objects considered. It should be mentioned that each mobile robot uses only local data available through its own sensors in order to compute its local actions. Section
II shows the navigation principles applied to the robot task.
In Section III the general task of obstacle avoidance for a multi-robot system using artificial potential fields is outlined. Section IV gives an introduction to a special MB approach used in this paper. The connection between the MB approach and the system to be controlled is outlined in Section V. Section VI shows simulation experiments and Section VII draws conclusions and highlights future work.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
207
II. NAVIGATION PRINCIPLES
To illustrate the navigation problems, let, in a working area, n mobile platforms (autonomous mobile robots) perform special tasks like loading materials from a starting station, bringing them to a target station and unloading the materials there. The task of the platforms is to reach their targets while avoiding other platforms according to specific rules. The problem is to find an appropriate navigation strategy.
Fig. 1.
Platform area
Let, as an example, mobile robots (platforms) P 1 , P 2 , and P 3 be supposed to move to targets T 1 , T 2 , and T 3 , respectively, whereas collisions should be avoided. Each platform has some estimation about its own position and orientation and also the position of its own target. The position of another platform P j relative to P i can be measured if it lies within the sensor cone of P i . Navigation principles for a mobile robot (platform) P i are meant to be heuristic rules to perform a specific task under certain restrictions originating from the environment, obstacles O j , and other robots P j . As already pointed out, each platform P i is supposed to have an estimation about position/orientation of itself and the target T i . This information can be either given in the base frame (world coordinate system) or in the local frame of platform P i . In our case these information is given in world coordinates. Apart from the heading to target movement, all other navigation actions take place in the local coordinate system of platform P i . The positions of obstacles (static or dynamic) O j or other platforms P j are formulated in the local frame of platform P i . In the following, 4 navigation principles are formulated that have been used in our work:
1. Move in direction of target T i
2. Avoid an obstacle O j (static or dynamic) if it appears in
the sensor cone at a certain distance. Always orient platform in direction of motion
3. Decrease speed if dynamic (moving) obstacle O j comes
from the right
4. Move to the right if the obstacle angles β [16] of two
approaching platforms are small
(e.g. β < 10 ) (see Fig. 2)
III. NAVIGATION AND OBSTACLE AVOIDANCE USING POTENTIAL FIELDS
This section gives a more detailed view on the potential field method for obstacle avoidance. Let us start with a simple first order dynamics for each platform P i which automatically avoids abrupt changes in the orientation
(1)
v i ∈ 2 - actual velocity vector of platform P i , v di ∈ 2 - vector of desired velocity of platform P i , k di ∈ 2× 2 - damping matrix (diagonal) The tracking velocity is designed as a control term
(2)
x i ∈ 2 - position of platform P i , x ti ∈ 2 - position of target T i , k ti ∈ 2× 2 - gain matrix (diagonal) Virtual repulsive forces appear between platform P i and obstacle O j from which ’repulsive velocities’ are derived
(3)
v ti = k ti (x i − x ti )
v˙ i = k di (v i − v di )
v ij o = − c ij o (x i − x j o )d ij
−
o
2
v ij o ∈ 2 - repulsive velocity between platform P i and obstacle O j , x j o ∈ 2 - position of obstacle O j , d ij o ∈ - Euclidian distance between platform P i and obstacle O j ,
c ij o ∈ 2× 2 - gain matrix (diagonal)
Virtual repulsive forces also appear between platforms P i and P j from which we get the repulsive velocities
(4)
v ij p = − c ij p (x i − x j )d ij
−
p
2
∈ 2 - repulsive velocity between platforms P i and P j , ∈ 2 - position of platform P i ,
v ij p
x
x j ∈ 2 - position of platform P j ,
i
d ij p
c ij p
∈ - Euclidian distance between platforms P i and P j ,
∈ 2× 2 - gain matrix (diagonal)
Fig. 2.
Geometrical relationship between platforms
The resulting velocity velocity v di is the sum
v di = v ti +
m
o
j
=1
v ij o +
m
p
j
=1
v ij p
(5)
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
208
where m o and m p are the numbers of contributing obstacles and platforms, respectively. It has to be stressed that force fields are switched on/off according to the actual scenario:
distance between interacting systems, state of activation ac- cording to the sensor cones of the platforms, positions and velocities of platforms w.r.t. to targets, obstacles and other platforms. All calculations of the velocitiy components (1)-(5), angles and sensor cones are formulated in the local coordinate systems of the platforms (see Fig. 2)
A. ”Deformation” of potential fields using fuzzy rules
Potential fields of obstacles (static and dynamic) act nor- mally independently of the attractive force of the target. This may cause unnecessary repulsive forces especially in the case when the platform can ”see” the target.
Fig. 3.
Deformation of potential field
Fig. 4.
Fuzzy table for potential field
The goal is therefore to ”deform” the repulsive potential field in such a way that it is strong if the obstacle hides the target and weak if the target ”can be seen” from the platform. In addition, the potential field should also be strong for a high tracking velocity and weak for a small one (see Fig. 3). This is done by a coefficient coef ij ∈ [0 , 1] that is multiplied by v ij o to obtain a new v ij o as follows
v ij o = − coef ij c ij o (x i − x j o )d ij
−
o
2
(6)
The coefficients coef ij can be calculated by a set of 16 fuzzy rules like
IF
v ti = B
AND
α ij = M
THEN
coef ij = M
(7)
where α ij is the angle between v ij o and v ti . The whole set of 16 rules can be summarized in a table shown in Fig. 4. Z - ZERO, S - SMALL, M - MEDIUM, B - BIG are fuzzy sets
[17].
IV.
MB APPROACH
Imitation of economical market mechanisms and the ap- plication to a multi robot system requires the modeling of both the system to be optimized (see Sect. III) and the opti- mization strategy itself. The system and optimization strategy are presented as continuous models where the computational realization is usually discrete. The desired motion of platform P i is described by
v d i = vo i +
m
j =1,i = j
w ij v ij p
where vo i is a combination of
(8)
- tracking velocity depending on distance between platform i and goal i
- repulsive/control terms between platform i and obstacles
- Traffic rules
m - number of platforms v ij p - repulsive velocity between platforms i and j
w ij - weighting factors for repulsive velocities where
m
j =1,i = j w ij = 1
The objective is to change the weights w ij so that all contributing platforms show a smooth dynamical behavior in the process of avoiding each other. One possible option for tuning the weights w ij is to find a global optimum over all contributing platforms. This, however, is rather difficult especially in the case of many interacting platforms. Therefore a multi-agent approach has been preferred. The determination of the weights is done by producer-consumer agent pairs in a MB scenario that is presented in the following. Assume that to every local system S i (platform) belongs a set of m producer agents Pag ij and m consumer agents Cag ij . Producer and consumer agents sell and buy, respectively, the weights w ij on the basis of a common price p i . Producer agents Pag ij supply weights w ij p and try to maximize specific local profit functions ρ ij where ”local” means ”belonging to system S i ”. On the other hand, consumer agents Cag ij demand for weights w ij c from the producer agents and try to maximize specific local utility functions U ij . The whole ”economy” is in equilibrium as the sum over all supplied weights w ij p is equal to the sum over all utilized weights w ij c .
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
209
m
j
=1
w ij p (p i ) =
m
j
=1
w ij c (p i )
(9)
The trade between the producer and consumer agents is based on the definition of cost functions for each type of agent. We define a local utility function for the consumer agent Cag ij
Utility
U ij
=
=
benefit − expenditure ˜
b ij w ij c − c˜ ij p i (w ij c ) 2
(10)
≥ 0 , p i ≥ 0 . Furthermore a local profit
where ˜ b ij ,c˜ ij
function is defined for the producer agent Pag ij
profit
ρ ij
=
= g ij p i (w ij p ) − e ij (w ij p ) 2
income − costs
(11)
where g ij ,e ij ≥ 0 are free parameters which determine the average price level. It has to be stressed that both cost functions (10) and (11) use the same price p i on the basis of which the weights w ij are calculated. Using the system equation (8) we define further a local energy function between the pair of platforms p i and p j to be minimized
˜
J ij
v
T
i v d i
d
=
= a ij + b ij w ij + c ij (w ij ) 2 → min
(12)
where
˜
J ij ≥ 0 ,a ij ,c ij > 0 .
The question is how to combine the local energy function (12) and the utility function (10), and how are the parameters
in (10) to be chosen? An intuitive choice
(13)
guarantees w ij ≥ 0 . It can also be shown that, independently of a ij , near the equilibrium v d i = 0 , and for p i = 1 , the energy function (12) reaches its minimum, and the utility function (10) its maximum, respectively. With (13) the utility function (10) becomes
U ij = |b ij |w ij c − c ij p i (w ij c ) 2
(14)
˜
b ij = |b ij |,
c˜ ij = c ij
A maximization of (14) leads to
∂U ij ∂w ij c
=
|b ij | − 2 c ij p i w ij c = 0
(15)
from which a local w ij c is obtained
w ij c
=
|b ij |
·
1 i
p
2
c ij
(16)
Maximization of the local profit function (11) yields
∂ρ ij ∂w ij p
= g ij p i − 2 e ij w ij p = 0
(17)
from which a local w ij p is obtained
w ij p = p i
2
η
ij
where
η ij = e ij
ij
g
(18)
The requirement for an equilibrium between the sums of the ”produced” w ij p and the ”demanded” w ij c leads to the
balance equation
m
j
=1
w ij c =
m
j
=1
w ij p
(19)
Substituting (16) and (18) into (19) gives the prices p i for the weights w il p
p i =
m
j
=1 |b ij |/c ij
m
j
=1 1 /η ij
(20)
Substituting (20) into (16) yields the final weights w ij to be implemented in each local system. Once the new weights w ij are calculated, each of them has to be normalized with
respect to
j =1 w ij which guarantees the above requirement
m
m
j =1 w ij = 1 .
V.
MB OPTIMIZATION OF OBSTACLE AVOIDANCE
In the following the optimization of obstacle avoidance between moving platforms by MB methods will be addressed. Coming back to the equation of the system of mobile robots
(8)
v d i = vo i +
m
j =1,i = j
w ij v ij p
(21)
where vo i is a subset of the RHS of (5) - a combination of different velocities (tracking velocity, control terms, etc.). v ij p reflects the repulsive forces between platforms P i and P j . The global energy function (12) reads
˜
J i
= v T
d i v d i
=
vo T vo i + 2 vo T
i
i
m
j =1,i = j
w ij v ij p
+
(
j
m
=1,i = j
w ij v ij p ) T (
j
m
=1,i = j
w ij v i p )
(22)
The local energy funcion reflects only the energy of a pair of two interacting platforms P i and P j
˜
J ij
=
=
+
+
+
v
T
d
i v d i
vo T vo i + (
i
m
k =1,k = i,j
w ik v ik p ) T (
m
k =1,k = i,j
2
m
k =1,k = i,j
w ik vo T
i
v ik p
2 w ij (vo T +
i
m
k =1,k = i,j
w
2
ij (v ij
T
p
v ij p )
w ik v ik
T
p
)v ij p
w ik v ik p )
(23)
Comparison of (23) and (12) yields
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
210
a ij
b ij
c ij
=
+
vo T vo i + (
i
m
k =1,k = i,j
w ik v ik p ) T (
2
m
k =1,k = i,j
w ik vo T
i
v ik p
= 2(vo T +
i
= v ij
T
p
v ij p
m
k =1,k = i,j
w ik v ik
T
p
)v ij p
m
k =1,k = i,j
VI. SIMULATION RESULTS
w ik v ik p )
(24)
The following simulation results consider mainly the obsta- cle avoidance of the multi-robot system in a relatively small area. The sensor cone of a platform amounts to +/- 170 deg. Inside the cone a platform can see another platform within the range of 0-140 units. Platforms P 1 and P 3 are approaching head-on. At the same time platform P 2 crosses the course of P 1 and P 3 right before their avoidance manoeuver. If there were only platforms P 1 and P 3 involved the avoidance manoeuver would work without problems. According to the built-in traffic rules both platforms would move some steps to the right (seen from their local coordinate system) and keep heading to their target after their encounter. Platform P 2 works as a disturbance since both P 1 and P 3 react on the repulsive potential of P 2 which has an influence on their avoidance manoeuver. The result is a disturbed trajectory (see Fig. 5) characterized by drastic changes especially of the course of P 3 during the rendezvous situation. A collision between P 1 and P 3 cannot be excluded because of the crossing of the courses of P 1 and P 3 . When we activate the MB optimization, we obtain a behav- ior that follows both the repulsive potential law for obstacle avoidance and the traffic rules during approaching head-on (see Fig. 6). There is no crossing of tracks between P 1 and P 3 anymore due to the MB optimization of the repulsive forces between platforms P 1 ,P 2 , and P 3 and a respective tuning of the weights w ij . Figure 7 shows the resulting weights. We also notice that w 12 and w 13 , w 21 and w 23 , and w 31 and w 32 are pairwise mirror-inverted due to the condition
In further simulations, which are not shown here, the plat- forms move on circles with different speed, similar diameters and centerpoints while avoiding other platforms and static obstacles on their tracks. To determine the smoothness of the trajectories, the averages of the curvatures along the trajectories of the platforms were calculated. It turned out that the use of MB optimization leads to a significant improvement of the smoothness of trajectories ( see Figs. 8 and 9).
m
j
=1,i
= j w ij = 1 (see also eq. (8)).
VII. CONCLUSIONS
Navigation while avoiding obstacles by mobile robots can be successfully done by using a mixture of principles like virtual potential fields, traffic rules, and control methods. It has also been shown that a ’deformation’ of the central symmetry
Fig. 5.
Approach, no MB optimization
Fig. 6.
Approach, with MB optimization
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
211
y
Fig. 8.
moving on circular trajectories, no MB
of an obstacle is helpful because it takes into account whether a robot is moving towards or away from it. An important aspect presented in this paper is the market-based (MB) optimization of competing potential fields of mobile platforms. MB opti- mization imitates economical behavior between consumer and producer agents on the basis of a common price. By means of MB optimization some potential fields will be strengthened and some weakened depending on the actual scenario. This is especially required when more than two robots meet within a small area which makes a certain minimum distance between the robots and appropriate manoeuvers necessary. Therefore, MB navigation allows smooth motions in such situations. Simulation experiments with simplified robot kinematics and dynamics have shown the feasibility of the presented method. Although it does not belittle the presented results, it has to be underlined that the simulation of the vehicles only meet a part of the requirements for non-holonomic systems. Restrictions w.r.t. to abrupt changes of motions in position and orientation are only indirectly considered. Restrictions regarding the weight, height, engine and wheels of the ve- hicles and the sensor have been neglected to a large extend. Therefore, future aspects of this work are the enhancement of the simulation regarding more realistic vehicles with the goal of implementation on a real mobile robot.
REFERENCES
[1] O. Khatib.
Real-time 0bstacle avoidance for manipulators and mobile
robots. IEEE Int. Conf. On Robotics and Automation,St. Loius,Missouri, 1985, page 500505, 1985.
Fig. 9.
moving on circular trajectories, with MB
[2] Y. Koren and J. Borenstein. Potential field methods and their inherent limitations for mobile robot navigation. Proceedings of the IEEE Conference on Robotics and Automation, Sacramento, California, pages 1398–1404, April 7-12, 1991. [3] M.Aicardi and M. Baglietto. Decentralized supervisory control of a set of mobile robots. Proceedings of the European Control Conference 2001., pages 557–561, 2001.
|
[4] |
M. J.Sorensen. Feedback control of a class of nonholonomic hamiltonian systems. Ph.D. Thesis,Department of Control Engineering Institute of Electronic Systems Aalborg University, Denmark., 2005. |
|
[5] |
J.Alonso-Mora, A. Breitenmoser, M.Rufli, P. Beardsley, and R. Siegwart. |
|
[6] |
Optimal reciprocal collision avoidance for multiple non-holonomic ro- bots. Proc. of the 10th Intern. Symp. on Distributed Autonomous Robotic Systems (DARS), Switzerland, Nov 2010. A. Bouguerra. Robust execution of robot task-plans: A knowledge-based approach. Ph.D. Thesis,Oerebro University, 2008. |
[7] A. Wallace. Flow control of mobile robots using agents. 29th International Symposium on Robotics Birmingham, UK, pages 273–276,
1998.
[8] T.B.Gold, J.K. Archibald, and R.L. Frost. A utility approach to multi-agent coordination. Proceedings of the 2000 IEEE International Conference on Robotics and Automation, San Francisco, pages 2052– 2057, 1996. [9] E.W.Large, H.I.Christensen, and R.Bajcsy. Scaling the dynamic ap- proach to path planning and control: Competition among behavioral constraints. Vol. 18, No. 1, pages 37–58, 1999. [10] T.Teredesai and V.C.Ramesh. A multi-agent initiative system for real- time scheduling. SMC98 Conference, San Diego, CA, USA, pages 439– 444, 1998. [11] M. B. Dias, R. Zlot, N. Kalra, and A. Stentz. Market-based multirobot coordination: a survey and analysis. Proceedings of the IEEE, vol. 94,
no. 7, pages 1257–1270, July 2006. [12] S.H. Clearwater (ed.). Market-based control: A paradigm for distributed resource allocation. Proceedings of the 38th CDC, Phoenix, Arizona USA., World Scientific, Singapore., 1996. [13] O.Guenther, T.Hogg, and B.A.Huberman. Controls for unstable struc- tures. Proceedings of the SPIE, San Diego, CA, USA, pages 754–763,
1997.
[14] H.Voos and L.Litz. A new approach for optimal control using market- based algorithms. Proceedings of the European Control Conference ECC99, Karlsruhe, 1999. [15] R. Palm. Synchronization of decentralized multiple-model systems by market-based optimization. IEEE Trans Syst Man Cybern B, Vol. 34, pages 665–72, Feb 2004. [16] B. R. Fajen and W. H. Warren. Behavioral dynamics of steering, obstacle avoidance, and route selection. Journal of Experimental Psychology:
Copyright by the American Psychological Association, Inc. Human Perception and Performance, Vol. 29, No. 2, page 343362, 2003. [17] R.Palm, D.Driankov, and H.Hellendoorn. Model based fuzzy control. Springer-Verlag Berlin New York Heidelberg, 1997.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
212
A Sparse Hybrid Map for Vision-Guided Mobile Robots
Feras Dayoub
Grzegorz Cielniak
Tom Duckett
Department of Computing and Informatics, University of Lincoln, Lincoln, UK
{ fdayoub,gcielniak,tduckett} @lincoln.ac.uk
Abstract — This paper introduces a minimalistic approach to produce a visual hybrid map of a mobile robot’s working environment. The proposed system uses omnidirectional images along with odometry information to build an initial dense pose- graph map. Then a two level hybrid map is extracted from the dense graph. The hybrid map consists of global and local levels. The global level contains a sparse topological map extracted from the initial graph using a dual clustering approach. The local level contains a spherical view stored at each node of the global level. The spherical views provide both an appearance signature for the nodes, which the robot uses to localize itself in the environment, and heading information when the robot uses the map for visual navigation. In order to show the usefulness of the map, an experiment was conducted where the map was used for multiple visual navigation tasks inside an office workplace.
I. INTRODUCTION
Different methods have been introduced to tackle the problem of acquiring a map of a mobile robot’s environment. The problem is called simultaneous localization and mapping (SLAM). Due to the importance of solving the SLAM prob- lem before truly autonomous mobile robots can be built, the robotics community has given much attention to SLAM over the last decade. The result is a wide variety of methods, with each method having its own advantages and disadvantages. Recently, the methods which deal with SLAM as a non- linear graph optimization problem, i.e. graph-based SLAM, have gained increasing attention in the literature. The output of these methods is a pose-graph where the nodes associate past poses of the robot with map features. The edges of the graph model spatial constraints between the nodes [1]. The optimization step aims to select the spatial configuration of the nodes, which best satisfies the constraints encoded in the edges. Generally, the output from the graph optimization algorithms is a pose-graph map, where the nodes of the graph are created at every step the robot has performed. Therefore, the graph is dense and it contains redundant information that can be removed leading to a more compact map, which is preferable in the case of mobile robots with limited resources. In this paper we use a graph-based SLAM algorithm to produce an initial dense pose-graph map of the environment. Then the initial map is used to build a sparse hybrid map consisting of two levels, global and local. Fig. 1 illustrates the hybrid map. On the global level, the world is represented as a topological map. The topological map is extracted from the initial dense pose-graph using a dual clustering approach, introduced in this paper. The proposed approach selects nodes from the initial map which are located in areas such as doorways and corners, allowing the topological map to
maintain full coverage of the environment while minimizing the required number of nodes. On the local level of the map, each node stores a spherical view representation of image features extracted from an omnidirectional image recorded at the position of the node. The spherical views contain the 3D location of the image features on a sphere. Thus, we only store the direction of the features (but not their distance or depth) from the centre of the sphere, which corresponds to the centre of that node. The spherical views are used for estimating the robot’s heading in a visual navigation system where we use the map to perform a path following task. The paper is constructed as follows. In Section II, we discuss related work in the field. Section III contains details of the method to build the initial map. In Section IV we present the dual clustering algorithm which selects the nodes of the global map. A visual navigation strategy is presented in Section V. The experiment and results are presented in Section VI. Finally, we draw conclusions in Section VII.
II. BACKGROUND
Different vision-based mapping methods using graph op- timization have been proposed [2], [3], [4]. These methods follow the same general approach where the map is built as a graph, with the nodes containing camera views from the environment and the graph edges are expressed as geometric constraints between these views. A loop closing mechanism is deployed to detect when previously mapped areas are revisited. When a loop is detected, a new constraint link is added to the graph and then a graph optimizer is invoked to correct the map. Although the work presented in this paper follows the same general approach, we differ by proposing a spherical view representation for the nodes in the local level of the hybrid map and also we introduce a dual clustering algorithm to reduce the number of nodes in the global level. The loop detection mechanism in this paper uses a sim- ilarity measure between the views stored in the nodes of the map. The detection mechanism could simply be a direct one-to-one view matching procedure. However, more sophis- ticated methods for loop detection can be used to provide real time performance when the graph contains a large number of nodes. Such methods include the hierarchical vocabulary tree [5] and the FABMAP visual vocabulary [6]. A naive solution to reduce the number of nodes in the graph would be based on the time stamp of the nodes, where the graph is sampled using a fixed time step. This method would fail if the robot stands still for some time or has a changing speed while mapping the environment. Another simple solution would be based on the distance between the nodes [7]. However, this method does not take into
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
213
Fig. 1. Proposed hybrid map with two levels i.e. global and local. The environment is represented as an adjacency graph of nodes on the global level of the map and each node on the local level represents the 3D location of image features on a sphere. Our method represents the direction of the features (but not their distance or depth) from the centre of the sphere, which corresponds to the centre of that node.
account the rapid change in the appearance of the robot’s surroundings which could occur when the robot crosses a door or goes round a corner. Successive images can differ
considerably on the two sides of the doors or corners, which could become a challenge when the robot tries to actually use the map for navigation. This implies that image similarity should be considered in the process. In general, the methods which use image similarity to reduce the number of nodes in the map start by clustering the nodes based on image similarity and then choose key nodes as a representative for each image group [8]. These methods can be used when information about the geometric distances between the nodes
is not available. However, in our case the distance between
the nodes is provided by the graph optimization approach and we can use that as an additional source of information.
Instead of choosing the key nodes in the graph based on the
geometric distance alone or the similarity alone, we propose
a method which selects the nodes in the graph based on a
combination of the two metrics. Recently, Ila et al. [9] introduced a pose-graph mapping method where they measure statistical content on links to enforce node sparsity and limited graph connectivity. The result is a compact map which contains non-redundant nodes and links. The main criteria with which the nodes are selected relates directly to reducing the uncertainty in the estimation of the robot position. This tends to produce densely dis- tributed nodes when the robot performs curved paths and sparsely distributed nodes when the robot performs straight forward paths. This is because of the increasing uncertainty resulting from turning motions. The method presented in this
paper uses different criteria to reduce the number of nodes in the map. Our main goal is to produce a map with sparse nodes located in places that are suitable to use the map for visual navigation tasks.
III. BUILDING THE HYBRID MAP
We aim to produce a hybrid map extracted from an initial dense pose-graph map. The hybrid map consists of a global level contains a sparse topological map and a local level which stores a spherical view for each node (see Fig. 1). Each spherical view is generated from an omnidirectional image recorded from the position of the node. The initial map building process is carried out based on a stop-sense-go strategy, where the robot is driven by a human operator. The driver follows the following routine. The robot starts a mapping step by capturing an omnidirectional image from a camera on-board, while it is static, along with the odometry reading from the wheel encoder. Then the robot moves a short distance forward and stops. If there is a need to perform a rotation, the robot rotates a certain angle and ends the current mapping step by stopping. The driver repeats as many mapping steps as required to cover the environment. The constraints between each consecutive poses along the trajectory of the robot consist of two components, i.e. translation and rotation. These constraints can be extracted from the internal odometry of the robot. However, using the odometry alone is not always accurate enough to build the constraint network. First, the odometry measurements are affected by noise due to wheel drift and slippage, especially during rotation. Second, using odometry alone cannot pro- vide any information about loop closure. To deal with these limitations, we use measurements from the omnidirectional vision sensor on-board as an input for a Bayesian filtering framework (extended Kalman filter) to reduce the error from the odometry measurements. In addition to that, the vision sensor is used to perform loop closure. A loop closure results in an edge in the constraint network which relates the current robot pose with a former robot pose. These loop closure edges contradict the pose estimates resulting from plain odometry. In order to correct the structure of the graph after a loop is detected, we use the graph optimization algorithm TORO [10] which considers each edge in the pose graph as a cost function for each two connected nodes and re-arranges the nodes in the graph such that the total costs associated with all edges are minimized.
A. Relations Based on Odometry
Let the robot’s pose at any given time step t be represented as p t = (x t , y t , θ t ), where (x t , y t ) are the coordinates of the robot and θ t is the current heading. In our stop-sense- go strategy, robot motion between two consecutive poses is approximated by a translation d followed by a rotation δ, and the model which obtains the pose p t from p t−1 is
x
y
θ
t
t
t
=
x
y
θ
t−1
t−1
t−1
ˆ
+ d t sin (θ t−1 ) ,
d t cos (θ t−1 )
ˆ
ˆ
δ
t
(1)
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
214
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
215
the map for tasks such as visual navigation. Fig 2 illustrates this situation.
A. The clustering algorithm
Our clustering algorithm is inspired by a fast imple- mentation of the dual clustering method presented in [14]. The algorithm performs clustering based on density in the geometric domain, while maximizing the similarity in the non-geometric domain. In order to achieve this, for each cluster center in the geometrical domain, the neighborhood of a given radius Ψ d should contain a minimum number of points. And in the non-geometric domain, the similarity between the neighborhood points of each cluster and the center of that cluster should be above a certain threshold Ψ s . The clustering process is performed incrementally as follows:
1) Initialize a cluster by starting from the first unclassified node in the graph N p (based on the time stamp). If the number of nodes in the neighborhood of N p is less than a predefined threshold κ the node is ignored. Otherwise, create a new cluster C. 2) Insert all nodes from the neighborhood of N p , which have similarity with N p greater than Ψ s , in C. 3) Compute the center of the cluster N ref by selecting the node which is most similar to all other nodes in the cluster, as
N ref = arg max ( sim(N j , N k )),
k∈C
j∈C
where sim(N j , N k ), the similarity between the two views in the nodes N j and N k , is the number of matched image features. 4) Check an arbitrary node N q , from the neighborhood of N p . If the number of nodes in the neighborhood of N q is at least κ, and the similarity between an unclassified node N o in the neighborhood of N q and the center of the cluster is above Ψ s , then insert N o to the cluster C and recompute the center of the cluster as in step 3. Repeat step 4 until the cluster C can not be extended any more. 5) Repeat all the steps until all the nodes in the graph are classified. We discuss how to set the clustering parameters in our experiments in Section VI.
V.
U SING THE M AP FOR N AVIGATION
Every map can be judged by its usefulness for practical purposes. In our case the map is used for a path following routine inside an indoor environment. When robots work inside an indoor environment, their navigation generally is restricted to what the humans con- sider to be a path inside that environment, such as corridors and the areas between the furniture. These routes effectively simplify the task of navigation by limiting the robot to only one degree of freedom along the path. And by representing this path as a sequence of images, the following framework
Fig. 3.
the path and N k is the next node. The dashed line is the path, θ j and θ k are the relative orientations between the robot’s heading and the reference orientation of the nodes N j and N k respectively. θ r is the robot’s desired heading.
The proposed visual navigation strategy. N j is the current node in
of the appearance-based approach for visual navigation is used in the literature [15], [16], [17]:
• The path is first built during a learning phase where the robot is controlled by a human operator. During this phase the robot captures a sequence of images along the path.
• A subset of the captured images is selected to represent the reference images along the path.
• During the replay phase, the robot starts near the first position and is required to repeat the same path.
• The robot extracts the control commands of its motion by comparing the currently observed image with the reference images along the path. In this work we adopted a similar framework for visual path following using a sequence of nodes from the map. Fig. 3 illustrates the navigation strategy. First the robot localizes itself to one of the nodes in the path. This is done by selecting the node which has the the highest similarity score with the currently observed view. Let S j be the similarity score, i.e. the number of matched points. The similarity score is also computed between the current view and the next node in the sequence. Let S k be the similarity score with the next node. Then the following ratio is computed:
ω j =
S
j
S j +
S k ,
ω k =
S j S + k S k .
(4)
The heading angle θ r is computed as a weighted sum:
θ r = ω j ∗ θ j + ω k ∗ θ k .
where θ j and θ k are the relative orientation between the current view and the nodes N j and N k respectively (see Fig. 3). By following this navigation strategy, the nodes in the path can be considered as directional signs which lead the robot toward its goal. In order to estimate the relative orientation between two views, we use epipolar geometry. The method first estimates
(5)
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
216
Left: The true trajectory of the robot. Right: The experimental
platform. An ActivMedia P3-AT robot equipped with an omnidirectional vision system.
Fig. 4.
the so-called essential matrix E. Then, based on the method introduced by Hartley and Zisserman in [18], the essential matrix is factored to give Eq. 6 which contains the rotation matrix R ∈ SO(3) and the skew-symmetric matrix [t] × of the translation vector t ∈ R 3 .
E = [t] × R.
(6)
After that, the the relative orientation is extracted from the rotation matrix R.
VI. EXPERIMENTS AND RESULTS
Our experimental platform is an ActivMedia P3-AT robot equipped with a GigE progressive camera (Jai TMC- 4100GE, 4.2 megapixels) with a curved mirror from 0- 360.com. The following experiment was carried out in an office floor at the University of Lincoln. We drove the robot on a tour between the offices while recording a set of omnidirectional images. The resulting database contains 222 images with approximately 35 cm between the consecutive images. Fig. 4 shows the positions of the recorded images obtained using the GMapping library [19]. The GMapping algorithm provides a SLAM solution using laser range- finder scans based on a Rao-Blackwellized particle filter. The output of the algorithm is an estimate of the robot trajectory along with an occupancy grid map of the environment. We would like to emphasize that our method does not use laser scan matching. We use this information for visualisation purposes only. The robot starts from location (x 0 = 0, y 0 = 0) and comes back to the same start point. The dashed line in Fig. 5 shows the trajectory of the robot based on the odometry alone where the drift effect is clear. Fig. 5 also shows the trajectory after we added the EKF as a filter for the observation of the robot heading. Although the robot does not have any method to detect loop closure, the resulting error in estimating the trajectory is smaller. The final output of the mapping step
The green dashed line represents the trajectory of the robot from
the odometry. The red line is the result of using vision estimated relative orientation as an observation with an EKF filter. The black line is the final output after loop closing and graph relaxation.
Fig. 5.
is shown as well in Fig. 5, where the graph optimization algorithm produced a map which is globally consistent. In the next step, we applied our dual clustering algorithm which pruned the dense graph produced from the previous step and generated the final map. The number of nodes in the final map is affected by the initial neighborhood radius Ψ d and the image similarity threshold Ψ s . The parameter Ψ d gives an initial radius for the node to cover and then the algorithm expands the node based on the image similarity threshold Ψ s . In our experiments the map is intended to be used for autonomous visual navigation; therefore the sparsity of the map should not result in gaps where the robot cannot estimate its heading relative to one of the nodes in the map. In order to achieve that we assign Ψ d to 1 m as a minimum distance between the nodes and Ψ s to 35 feature points as the minimum number of matched points between any view in the node and the center of the node; and finally we assign κ, the minimum number of points required in the neighborhood of any node, to 1 m. Fig. 6 shows the result of the pruning step where a set of 23 nodes was selected. In order to test the map for visual navigation, we per- formed five path following runs using the nodes of the map. The paths were chosen randomly, while covering all nodes in the map. At the start of each run the robot was given a sequence of nodes to follow. The robot then followed each path using the navigation strategy presented in Section. V. In addition, an array of sonar sensors was used for obstacle avoidance. The same five runs were then re-executed using manual drive where a human driver steers the robot taking the best track where the robot was driven through the shortest distance and at the same time was kept away from obstacles. Table I shows the results for both the autonomous and the manual runs. In order to show the robustness of the navigation procedure, the 5 autonomous runs were executed
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
217
Fig. 6.
five sequences of nodes which the robot was given to follow.
The selected nodes from the dual clustering algorithm along with
twice. The mean and minimum sonar range distances to obstacles were calculated for each run along with the traveled distance. These values are used as an indication about the quality of the navigation performance. As the results show, although the robot takes a slightly longer distance to reach its goal, the autonomous routes were smooth and similar to the manual runs. The average mean distance to any obstacle was 1.00 m and for the autonomous runs was 0.88 m. The average value of the minimum range to obstacles was 0.49 m for manual driving and 0.44 m for the autonomous ones.
VII. CONCLUSION
This paper introduced a minimalistic mapping method using an omnidirectional vision sensor. The produced map is hybrid with two levels of representation, global and local. On the global level, the world is represented as a graph of adjacent nodes with each node containing a group of image features. On the local level, the features inside each node form a spherical view, which is used for estimating
TABLE I
VISION GUIDED NAVIGATION RESULTS
|
Distance |
Mean Range |
Minimum Range |
|
[m] |
[m] |
[m] |
|
Manual |
22.87 |
0.85 |
0.43 |
|
|
Path 1 |
Auto1 |
24.07 |
0.82 |
0.42 |
|
Auto2 |
24.28 |
0.82 |
0.44 |
|
Manual |
14.30 |
1.20 |
0.58 |
|
|
Path 2 |
Auto1 |
15.34 |
0.90 |
0.45 |
|
Auto2 |
14.90 |
0.99 |
0.48 |
|
|
Manual |
9.81 |
0.94 |
0.56 |
|
|
Path 3 |
Auto1 |
10.66 |
0.85 |
0.41 |
|
Auto2 |
10.70 |
0.87 |
0.51 |
|
|
Manual |
8.97 |
1.04 |
0.42 |
|
|
Path 4 |
Auto1 |
9.27 |
0.97 |
0.37 |
|
Auto2 |
9.09 |
1.02 |
0.42 |
|
|
Manual |
15.52 |
0.98 |
0.48 |
|
|
Path 5 |
Auto1 |
16.42 |
0.85 |
0.46 |
|
Auto2 |
16.69 |
0.78 |
0.47 |
the robot’s heading using multi-view geometry. The map is built using a sequence of images along with odometry information. The global consistency of the map is achieved using a graph optimization algorithm. In order to reduce the number of nodes in the map, a dual clustering algorithm for post-processing the initial map was developed. The map was used in an experiment where the robot performed multiple path following tasks.
REFERENCES
[1] F. Lu and E. Milios. Globally consistent range scan alignment for environment mapping. Autonomous Robots, 4(4):333–349, 1997. [2] F. Fraundorfer, C. Engels, and D. Nistr. Topological mapping, localization and navigation using image collections. In Proc. IEEE International Conference on Intelligent Robots and Systems (IROS), pages 3872–3877, 2007. [3] E. Eade and T. Drummond. Monocular slam as a graph of coalesced observations. In Proc. IEEE International Conference on Computer Vision (ICCV), pages 1–8, 2007. [4] K. Konolige, J. Bowman, J. D. Chen, P. Mihelich, M. Calonder, V. Lepetit, and P. Fua. View-based maps. The International Journal of Robotics Research, 29(8):941, 2010. [5] D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2006. [6] M. Cummins and P. Newman. Probabilistic appearance based navi- gation and loop closing. In Proc. IEEE International Conference on Robotics and Automation (ICRA) , pages 2042–2048, 2007. [7] M. Jogan and A. Leonardis. Robust localization using an omnidi- rectional appearance-based subspace model of environment. Robotics and Autonomous Systems , 45(1):51–72, 2003. [8] O. Booij, Z. Zivkovic, and B. Krose. Sampling in image space for vision based SLAM. In Proc. Workshop on the Inside Data Association, Robotics: Science and Systems Conference (RSS), 2008.
[9] V. Ila, J. M Porta, and J. Andrade-Cetto. Information-based compact pose SLAM. IEEE Transactions on Robotics, 26(1):78–93, 2010. [10] G. Grisetti, C. Stachniss, S. Grzonka, and W. Burgard. A tree parameterization for efficiently computing maximum likelihood maps using gradient descent. In Proc. of Robotics: Science and Systems (RSS) , 2007. [11] N. J. Gordon, D. J. Salmond, and A. F. M. Smith. Novel approach to nonlinear/non-Gaussian bayesian state estimation. In Radar and Signal Processing, IEE Proceedings F, volume 140, pages 107–113,
2002.
[12] C. Geyer and K. Daniilidis. A unifying theory for central panoramic
systems and practical implications. Proc. European Conference on Computer Vision (ECCV) , pages 445–461, 2000.
[13] H. Bay, T. Tuytelaars, and L. Van Gool. SURF: Speeded Up Robust Features. In Proc. European Conference on Computer Vision (ECCV),
2006.
[14] J. Zhou, F. Bian, J. Guan, and M. Zhang. Fast implementation of dual clustering algorithm for spatial data mining. In Proc. International Conference on Fuzzy Systems and Knowledge Discovery , volume 3, pages 568–572. IEEE Computer Society, 2007.
[15] G. Blanc, Y. Mezouar, and P. Martinet. Indoor navigation of a
wheeled mobile robot along visual routes. In Proc. IEEE International
Conference on Robotics and Automation (ICRA), pages 3354–3359,
2005.
[16] T. Goedeme,´ M. Nuttin, T. Tuytelaars, and L. Van Gool. Omnidirec-
tional Vision Based Topological Navigation. International Journal of
Computer Vision , 74(3):219–236, 2007.
[17] O. Booij, B. Terwijn, Z. Zivkovic, and B. Krose. Navigation using an
appearance based topological map. Proc. of the IEEE International
Conference on Robotics and Automation (ICRA), 2007.
[18] R. Hartley and A. Zisserman. Multiple View Geometry in Computer
Vision . Cambridge University Press, 2 edition, March 2004.
[19] G. Grisetti, C. Stachniss, and W. Burgard. Improved techniques
for grid mapping with Rao-Blackwellized particle filters. IEEE
Transactions on Robotics, 23(1):34–46, 2007.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
218
1
An incremental hybrid approach to indoor modeling
Marco A. Gutierrez´
Pilar Bachiller
Luis J. Manso
Pablo Bustos
Pedro Nu´nez˜
Department of Computer and Communication Technology, University of Extremadura, Spain
Abstract — Most of mobile robots basic functions are highly dependent on a model of their environment. Proper modeling is crucial for tasks such as local navigation, localization or route planning. This paper describes a novel solution for building models of indoor environments. We use a 3D Laser on a mobile robot to scan the surroundings and obtain sensing information. While the robot explores the environment, perceived points are clustered forming models of rooms. Room modeling is solved using a new variation of the Hough transform. The result of the modeling process is a topological graph that represents the rooms (nodes) and their connecting doors (edges). Each node in this representation contains the metric parametrization of the room model. Using this basic metric information, robots do not need to maintain in parallel a metric map of the environment. Instead, this metric map can be totally or partially built from the topological representation whenever it is necessary. To test the approach we have carried out a modeling experiment of a real environment, obtaining promising results.
Index Terms — Environment modeling, hybrid representation, active exploration, 3D laser scanning.
I. INTRODUCTION
In recent years, the problem of map building has become an important research topic in the field of robotics. A wide variety of proposals use dense metric maps to represent the robot environment. While a metric representation is necessary for solving certain robot tasks, many others require a more qualitative organization of the environment. Topological maps provide such a qualitative description of the space and consti- tute a good support to the metric information. Several approaches on mobile robotics propose the use of topological representation to complement the metric infor- mation of the environment. In [16] it is proposed to create off-line topological graphs by partitioning metric maps into regions separated by narrow passages. In [14] the environment
is represented by a hybrid topological-metric map composed
of a set of local metric maps called islands of reliability.
[17] describes the environment using a global topological map that associates places which are metrically represented by infinite lines belonging to the same places. [18] constructs
a topological representation as a route graph using Vorono¨ı
diagrams. In [19] the environment is represented by a graph whose nodes are crossings (corners or intersections). [9] organizes the information of the environment in a graph of planar regions. [3] proposes an off-line method that builds a topological representation, whose nodes correspond to rooms, from a previously obtained metric map.
In this paper, we propose a novel incremental modeling method that provides a hybrid topological/metric represen- tation of indoor environments. Our approach improves the current state of the art in several aspects. Firstly, as in [3], the topological space is represented by a graph whose
nodes encode rooms and whose edges describe connections between rooms. However, in our proposal, the topological
representation is incrementally built from the local information provided by the sensor. This means that no global metric map is needed to extract a topological description of the environment. In addition, instead of maintaining in parallel a dense metric map, each topological node contains a minimal set of metric information that allows building a map of a part
of the environment when it is needed. This approach reduces
drastically the computation the robot must perform to maintain
an internal representation of its surroundings. In addition, it
can be very helpful for solving certain tasks in an efficient way, such as global navigation or self-localization.
We also present a new variation of the Hough transform used for room modeling. Under the rectangularity assumption, this method provides the geometrical parametrization of a room from a set of points. It is shown how this proposal improves the results compared to other approaches. The rest of the paper is organized as follows. In section II, an overview of the approach is presented. Section III describes the modeling method that provides a description
of the environment in terms of rooms and their connections.
Along section IV, the process for creating the proposed hybrid representation is detailed. Experimental results in a real environment are presented in section V. Finally, section VI summarizes the main conclusions of our work.
II. OVERVIEW OF THE APPROACH
As a first approach towards environment modeling, we focus on indoor environments composed by several rooms connected through doors. Rooms are considered approximately rectangular and contain several objects on the floor.
To solve the low-level perceptual process, we have equipped one of our mobile robots with a motorized 2D LRF (Laser Range Finder). Stereo cameras, static 2D LRFs and 3D LRFs constitute an alternative for this purpose. It is possible nowa- days to use stereo vision or even the very popular RGB-D primesense sensor[1] to retrieve depth from images and, in consequence, be able to map the robot surroundings. However most of these vision studied techniques are performed under almost ideal circumstances. Uniformly colored surfaces or light variations are some of the problems these solutions might face. In addition, compared to LRF performance, sensors like RGB-D get small fields of view and low depth precision (3cm in 3m scan). LRFs constitute a strong alternative to these sensors, especially when facing not ideal environments.
A wide variety of this kind of sensors have became lately
available: point range finders, 2D LRFs and 3D LRFs. 3D LRFs seem promising, but their high cost makes them in
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
219
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
220
(a) 2D view of the occupancy grid.
(b) Rectangle detection using lines.
(c) Rectangle detection using segments.
. Rectangle detection using the variation of the Hough transform based es (b) and the proposed one based on segments (c).
oom detection
our room detection method, the Hough space is parame- ed by (θ, d, p), being θ and d the parameters of the line sentation (d = x cos(θ) + y sin(θ)) and |p| the length segment in the line. For computing p it is assumed that of the extreme points of its associated segment is initially and situated at a distance of 0 to the perpendicular line ng through the origin. Under this assumption, being (x, y) ther extreme point of the segment, its signed length p can omputed as:
p = x cos(θ + π/2) + y sin(θ + π/2) (1)
the two segments with non-fixed extreme poi respectively, according to equation 1. Since a rectangle is composed of four seg Hough space parameterized by (θ, d, p) allo the total number of points included in the rectangle. Thus, considering a rectangle expres vertices V 1 = (x 1 , y 1 ), V 2 = (x 2 , y 2 ), V 3 = V 4 = (x 4 , y 4 ) (in clockwise order), the numb its contour, denoted as H r , can be computed a
H r = H 1↔2 + H 2↔3 + H 3↔4 + H
Considering the restrictions about the segme angle and using the equation 4, each H i↔j of 5 can be rewritten as follows:
H 1↔2 = |H(α, d 1↔2 , d 4↔1 ) − H(α, d 1↔2
H 2↔3 = |H(α+π/2, d 2↔3 , d 1↔2 )−H(α+π/2
H 3↔4 = |H(α, d 3↔4 , d 2↔3 ) − H(α, d 3↔4
H 4↔1 = |H(α+π/2, d 4↔1 , d 3↔4 )−H(α+π/2
being α the orientation of the rectangle and d i distance from the origin to the straight line points V i and V j . Since H r expresses the number of points in defined by (α, d 1↔2 , d 2↔3 , d 3↔4 , d 4↔1 ), the p taining the best rectangle given a set of points ca finding the combination of (α, d 1↔2 , d 2↔3 , d 3 maximizes H r . This parametrization of the re transformed into a more practical representation five-tuple (α, x c , y c , w, h), being (x c , y c ) the c the rectangle and w and h its dimensions. This can be achieved using the following expression
x c =
d 1↔2 + d 3↔4
2
= d 1↔2 + d 3↔4
cos(α) −
d 2↔3 + d 4↔1
2
i
(
) + d 2↔3 + d 4↔1
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
221
4
b) Compute d = x cos(θ) + y sin(θ).
c) Compute the discrete value d d associated to d.
d) Compute p = x cos(θ + π/2) + y sin(θ + π/2).
e) Compute the discrete value p d associated to p.
f) For
p d = p d
by 1.
d dMax : increment H(θ d , d d , p
d
)
|
3) |
Compute argmax H r (α, d 1↔2 , d 2↔3 , d 3↔4 , d 4↔1 ). |
|
|
4) |
Obtain the rectangle r = (α, x c , y c , w, h) tions 10, 11, 12 and 13. |
using equa- |
As it can be observed, the height of walls is only taken into account through histogram contributions. This is because walls correspond to 2D segment with higher histogram values than any other plane perpendicular to the floor. Thus, it is not necessary to explicitly consider the height of points in the room detection method. Even though this method is computationally expensive, in practice, its complexity can be significantly reduced in two ways. Firstly, instead of computing H from the whole occupancy grid, it can be updated using only those cells whose occupancy state has changed. In addition, it is not necessary to apply step 3 over the entire parameter space, since only rectangles of certain dimensions are considered rooms. Thus,
it is assumed a specific rank of w and h that limits the search
to those values of d 1↔2 , d 2↔3 , d 3↔4 and d 4↔1 fulfilling that rank.
B. Door detection
The proposed 3D Hough space is also used for door detec- tion. Doors are free passage zones that connect two different
rooms, so they can be considered as empty segments of the corresponding room rectangle (i.e. segments without points). Taking this into account, once the room model is obtained, doors can be detected by analyzing each wall segment in the 3D Hough space. Therefore, for each segment of the rectangle, defined by V i and V j , two points D k = (x k , y k ) and
D l = (x l , y l ) situated on the inside of that segment constitute
a door segment if they verify:
H k↔l = |H(θ i↔j , d i↔j , p k ) − H(θ i↔j , d i↔j , p l )| = 0 (14)
being θ i↔j and d i↔j the parameters of the straight line defined by V i and V j and p k and p l the signed lengths of the segments for D k and D l :
p k = x k cos(θ i↔j + π/2) + y k sin(θ i↔j + π/2) (15)
p l = x l cos(θ i↔j + π/2) + y l sin(θ i↔j + π/2) (16)
Assuming p i ≤ p k < p l ≤ p j and a minimum length
l for each door segment, the door detection process can be
carried out by verifying equation 14 for every pair of points between V i and V j , such that p l − p k ≥ l. Starting from the discrete representation of the Hough space, this process can be summarized in the following steps:
1) Compute the discrete value θ d associated to θ i−j . 2) Compute the discrete value d d associated to d i−j . 3) Compute the discrete value p di associated to p i . 4) Compute the discrete value p dj associated to p j .
5) Compute the discrete value l d associated to l (minimum length of doors).
6)
7)
p dk ← p di
While p dk ≤ p dj − l d :
a) p dl ← p dk + 1
b)
While
p dl
<
p dj
and
|H(θ d , d d , p dk )
−
H(θ d , d d , p dl )| = 0: p dl ← p dl + 1
c) If p dl − p dk > l d :
i) Compute the real value p k associated to p dk .
ii) Compute the real value p l associated to (p dl −
1).
iii) Compute the door limits D k and D l from p k and p l .
iv) Insert the new door segment with extreme points D k and D l to the list of doors.
d) p dk ← p dl
IV. INCREMENTAL MODELING OF THE ENVIRONMENT
Building topological maps requires to endow the robot not only with modeling skills, but also with the ability to actively explore the environment. Exploration plays an important role in our proposal because the robot must make sure that each room model corresponds to a real room before leaving it behind. For this reason the robot must scan the whole space surrounding it to take correct decisions about the current model. In our system, the exploration task is driven by the 3D local grid. When a room model is detected from the set of points stored in the grid, the robot must verify it by scanning the unexplored zones inside the estimated room model. Depending on its occupancy value, the cells of the grid are labeled as occupied, empty and unexplored. Thus, by analyzing the grid, the robot can direct its sensor towards new places and retrieve the necessary information to get a reliable model. At this point, the benefits of using a long range sensor become clear. Few movements of the robot are needed to cover the whole space around it and, in consequence, the modeling process is less sensitive to odometric errors. Once the local space around the robot has been completely explored, the current room model is inserted as a node in the graph representing the topological space and the robot gets out of the room to model new places. Each node in the graph stores the geometric parametrization of the room and its doors. The center and orientation of the room are used to form a reference frame (F r ) that expresses the location of the room in relation to a global reference frame (F w ). Thus, being r = (α, x c , y c , w, h) the rectangle that models a given room, the transformation matrix (T r ) that relates F r with F w is defined as:
T r =
cos(α)
sin(α)
0
0
− sin(α)
cos(α)
0
0
0
0
1
0
x c
y c
0
1
(17)
Using this transformation matrix, any point of the model is expressed in coordinates relative to the room. This way, if the local or global reference frames are modified, points of room models remain unaffected. In addition, the robot can be aware
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
222
(a) Frontal view of the scene.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
223
he room scanning is finished when all regions in the model ave been scanned, and further scanning (even using higher esolutions) results in no modifications. Then, the obtained oom model is fixed and stored in the topological graph. Notice hat, although the model size and doors are properly obtained, he room has been a little mispositioned when compared to eality. This is due to the accumulation of odometric errors uring the robot movements. Nevertheless, the relative position f the robot inside the room remains correct and therefore the etection of a new room is not affected by this misplacement.
(a)
(c)
ig. 4.
(e)
Modeling process of room 1.
(b)
(d)
(f)
(a)
(c)
Fig. 5.
(e)
Modeling process of room 2.
(b)
(d)
(f)
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
224
7
Once the second model has been fixed, the deviation be- tween the two room models (figure 5(f)), caused by odometric errors, is corrected according to the common door restriction (equation 20). Figure 6 shows the result of this correction. Each square of the figure represents an area of 0, 27×0, 27m 2 . This size corresponds to the sampling step of the Hough space for the parameters d and p. This means that the accuracy of each room model is limited by this sampling step. The real sizes of the two rooms are 3, 19 × 3, 78m 2 (room 1) and 4, 20 × 3, 78m 2 (room 2). The sizes obtained by the modeling process are 2, 97 × 3, 78m 2 (room 1) and 4, 05 × 3, 78m 2 (room 2). As it can be observed, the difference between the representation and the real world is in the range of the permissible error. More accurate models can be obtained by reducing the sampling step of the discrete Hough space.
VI. CONCLUSIONS AND FUTURE WORK
In this paper, we have presented an incremental modeling method for building hybrid representations of indoor environ- ments. The proposed representation consists in a topological graph that describes the rooms of the environment and their connections. Each node of the graph represents a room and contains the geometric parametrization of the corresponding room and its doors. This geometric parametrization is the only metric information included in the representation. Using this information, the robot can recover a metric map of a particular zone of the environment when it is needed. Thus, no dense metric map is mantained in parallel to the topological graph. Rooms and doors are modeled using a variation of the Hough Transform that detects segments and rectangle patterns. We have proposed methods for dealing with odometric errors in the creation of new models as well as in loop closings. In ad- dition, a method for robot pose estimation using room models has been presented. Real experiments have been carried out using a mobile robot equipped with a motorized 2D Laser. Results show the accuracy of the modeling process in real environments. We are working on several improvements of our proposal. In particular, work in order to relax the rectangle assumption is currently in progress. We are also extending the modeling ability of our robots for representing other structures of man- made environments like corridors. Bigger and more complex possible surroundings are also being taken into account.
ACKNOWLEDGMENT
This work has partly been supported by grants PRI09A037 and PDT09A044, from the Ministry of Economy, Trade and Innovation of the Extremaduran Government, and by grants TSI-020301-2009-27 and IPT-430000-2010-2, from the Span- ish Government and the FEDER funds.
REFERENCES
[1] The primesensor technology. http://www.primesense.com/.
[2] R.O. Duda and P.E. Hart. Use of the hough transformation to detect lines and curves in pictures. Commun. ACM, 15:11–15, January 1972. [3] Kwangro Joo, Tae-Kyeong Lee, Sanghoon Baek, and Se-Young Oh. Generating topological map from occupancy grid-map using virtual door detection. In IEEE Congress on Evolutionary Computation, pages 1–6,
2010.
[4] C.R. Jung and R. Schramm. Rectangle detection based on a windowed hough transform. In Proceedins of the XVII Brasilian Symposium on Computer Graphics and Image Processing, pages 113–120, 2004. [5] D. Lagunovsky and S. Ablameyko. Straight-line-based primitive ex- traction in grey-scale object recognition. Pattern Recognition Letters, 20(10):1005–1014, 1999. [6] C. Lin and R. Nevatia. Building detection and description from a single intensity image. Computer Vision and Image Understanding, 72(2):101– 121, 1998. [7] L. Manso, P. Bachiller, P. Bustos, P. N u´ nez,˜ R. Cintas, and L. Calderita. Robocomp: a tool-based robotics framework. In Proceedings of the Sec- ond international conference on Simulation, modeling, and programming for autonomous robots, SIMPAR’10, pages 251–262, 2010. [8] J. Matas, C. Galambos, and J. Kittler. Robust detection of lines using the progressive probabilistic hough transform. Computer Vision and Image Understanding, 78(1):119–137, 2000. [9] E. Montijano and C. Sagues. Topological maps based on graphs of planar regions. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1661–1666, October 2009. [10] E. Olson. Robust and Efficient Robotic Mapping. PhD thesis, Mas- sachusetts Institute of Technology, Cambridge, MA, USA, June 2008. [11] P.L. Palmer, J. Kittler, and M. Petrou. Using focus of attention with the hough transform for accurate line parameter estimation. Pattern Recognition, 27(9):1127–1134, 1994. [12] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951. [13] A. Rosenfeld. Picture processing by computer. ACM Comput. Surv., 1:147–176, September 1969. [14] S. Simhon and G. Dudek. A global topological map formed by local metric maps. In In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1708–1714, 1998. [15] W.-B. Tao, J.-W. Tian, and J. Liu. A new approach to extract rectangular building from aerial urban images. In Signal Processing, 2002 6th International Conference on, volume 1, pages 143 – 146, aug. 2002. [16] S. Thrun. Learning metric-topological maps for indoor mobile robot navigation. Artificial Intelligence, 99(1):21–71, 1998. [17] N. Tomatis, I. Nourbakhsh, and R. Siegwart. Hybrid simultaneous localization and map building: a natural integration of topological and metric. Robotics and Autonomous Systems, 44(1):3–14, 2003. [18] D. Van Zwynsvoorde, T. Simeon, and R. Alami. Incremental topological modeling using local vorono-like graphs. In Proc. of IEEE/RSJ Int. Conf. on Intelligent Robots and System (IROS 2000), volume 2, pages 897– 902, 2000. [19] F. Yan, Y. Zhuang, and W. Wang. Large-scale topological environ- mental model based particle filters for mobile robot indoor localization. Robotics and Biomimetics, IEEE International Conference on, 0:858– 863, 2006. [20] Y. Zhu, B. Carragher, F. Mouche, and C.S. Potter. Automatic particle de- tection through efficient hough transforms. IEEE Trans. Med. Imaging, 22(9), 2003.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
225
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
226
1
Efficient Topological Mapping with Image Sequence Partitioning
Hemanth Korrapati, Youcef Mezouar, Philippe Martinet
LASMEA, CNRS JOINT UNIT 6602, 24, Avenue des Landais, 63177-AUBI ERE, FRANCE
Email: firstname.lastname@lasmea.univ-bpclermont.fr
`
Abstract— Topological maps are vital for fast and accurate localization in large environments. Sparse topological maps can be constructed by partitioning a sequence of images acquired by a robot, according to their appearance. All images in a partition have similar appearance and are represented by a node in a topological map. In this paper, we present a topological mapping framework which makes use of image sequence partitioning (ISP) to produce sparse maps. The framework facilitates coarse loop closure at node level and a finer loop closure at image level. Hierarchical inverted files (HIF) are proposed which are naturally adaptable to our sparse topological mapping framework and enable efficient loop closure. Computational gain attained in loop closure with HIF over sparse topological maps is demonstrated. Experiments are performed on outdoor environments using an omnidirectional camera.
Index Terms— Topological Mapping, Omnidirectional Vision, Loop Closure
I. INTRODUCTION
Mapping is one of the fundamental problems of Au- tonomous Mobile robotics. Mapping problem can be widely categorized as Topological and Metrical [16]. Metrical map- ping involves accurate position estimates of robots and land- marks of the environment. Topological mapping on the other hand represents an environment as a graph in which nodes cor- respond to places and the edges between them indicate some sort of connectivity. Recently, a third category called Topo- Metric Mapping [17], [9] is gaining popularity. Topo-Metric mapping is a hybrid approach which uses both metrical and topological information in map building. Building an accurate map either metrical or topological depends on loop closure accuracy. Such maps are difficult to build using metrical in- formation which is prone to gross errors in position estimation of robot and landmarks. Topological maps facilitate accurate loop closure as they depend on appearance information rather than on exact metrical information of the environment. Many powerful loop closing techniques for topological maps have been introduced recently [2, 3, 5, 7]. Most of them produce dense topological maps, in which every acquired image stands as a node in the topological graph. Sparser topo- logical maps can be built by representing sets of contiguous images with similar appearance features as places. Each place is represented by a node in the topological graph. We refer to this kind of partitioning of image sequences into places as Image Sequence Partitioning (ISP). In a sparse topological map, since each node represents multiple images, fewer nodes would be sufficient for the map representation. Maps with
Fig. 1: A global view of our topological mapping framework.
fewer nodes reduce computational complexity involved in loop closure and map merging. An example map merging problem can be to localize a topological map of a tiny environment in a larger map (ex:- google maps). We use a topological mapping framework which facilitates coarse loop closure to the node level and a finer loop closure to the image level. The topological map is represented as a graph T = (N, E ), where N and E are sets of nodes and edges respectively. The map is updated with each newly acquired image. Every new image (query image) is verified if it is similar to a previously visited node (place) or the current place node and if so, the corresponding node is augmented with the image. If the query image is not similar to any of the existing nodes, then a new place node is created and augmented with the image. This process of map update is nothing but Image Sequence Partitioning (ISP). Each node contains a set of representative features representing all the member images of the node (place). The representative features are used in evaluating node-image similarity during ISP. Another contribution of this paper is the proposal of Hier- archical inverted files (HIF) for efficient loop closure at both node and image levels. Similar to traditional inverted files used for object recognition [11] and loop closure problems [2], HIFs are also associated to the visual words in the vocabulary. HIFs combine the power of regular inverted files with our sparse topological map structure and help in fast loop closure. Considering the fact that in a sparse topological map, images are again grouped into nodes, HIFs organize
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
227
2
the previous occurrence information of visual words in a hierarchical fashion which enables fast loop closure. We use an inverse similarity score evaluation methodology in order to take advantage of HIFs for loop closure. Experiments were performed on omnidirectional image data acquired in outdoor urban environments. Omnidirectional im- ages offer a 360 degree field of view which helps in building topological maps, invariant of robot’s heading. As a result loop closure can be performed even if the robot is not headed in the same direction as of the previous visit to the same location. Map building in outdoor environments is challenging due to illumination variation and possible occlusions [15]. Sparsity and accuracy of topological maps constructed using ISP are evaluated. The computational savings achieved in HIF-based loop closure is analysed. The rest of the paper is organized as follows: Section II details the related work done in this area. Section III describes the steps involved in ISP in detail and provides algorithmic illustrations. Section IV introduces HIFs and discusses how node and image level loop closures are performed using HIFs. Section V evaluates sparsity of maps produced by ISP, accuracy and computational cost of loop closure on the generated topological maps.
II. RELATED WORK
Scene Change Detection and Key Frame Selection for video segmentation and abstraction [19], [13] have similar goals as that of ISP. They try to represent a video with fewer images called key frames whenever there is a sufficient change in the scene and most of them focussed on video compression domain. The major difference between these video abstraction problems and topological mapping is that topological mapping demands localization of a query image which is obtained at a previously visited place, but with variation in illumination and viewpoint, and a possible occlusion. Hence, video segmenta- tion techniques using pixel-wise intensity measures and global image features like histograms, motion based segmentation cannot be applied to our problem. Loop closure in topological maps has gained popularity among mobile robotic researchers during the recent times. Many loop closure algorithms for topological mapping have been proposed and tested in both indoor [7], [2], [3], [21], [8] and outdoor environments [5], [6], [1], [12]. In [21], [22] topological maps are built for indoor environ- ments. They segment the topological graph of the environment using normalized graph-cuts algorithm resulting in subgraphs corresponding to convex areas in the environment. In [8] SIFT features were used to perform matching over a sequence of images. They detected transitions between individual indoor locations depending on the number of SIFT features which can be successfully matched between the successive frames. In [14] fingerprint of an acquired image is generated using omni- directional image and laser readings, and these fingerprints are compared to those of the previous images. If the similarity is above a threshold the image is added to the existing node and if not a new node is formed. All of these works were focused on indoor environments. Indoor environments contain convex
spaces (rooms) which are relatively simpler to be partitioned as compared to outdoor environments. A topological mapping framework using incremental spec- tral clustering has been presented in [20]. Nodes containing similar images are constructed using incremental spectral clustering over the affinity matrix of the images, thereby producing a topological graph. An optical flow based ISP technique was presented in [12] for topological mapping in outdoor environments using a quad rotor robot. Optical flow is used to discover change in environmental appearance. In [1], gist features [18] were used to cluster images with similar appearance for topological map construction.
III. IMAGE SEQUENCE PARTITIONING
Figure 1 depicts a global view of our framework, in which we can see a modular view of ISP enclosed by a red dashed line. As can be seen from Figure 1, ISP consists of three main modules: node level loop closure, evaluation of similarity to current place and new node formation. Given a query image, initially SURF [4] features are extracted from the image. Using the SURF features, we evaluate the node-image similarity of the query image with all the nodes in the graph except the current place node and pick out the top k similar nodes. The top k similar nodes are assigned to the set of winning nodes N w . This process is called node level loop closure as it finds the previously visited places (nodes) most similar to the query image. Only the representative feature sets of the nodes are used to compute the node-image similarities during node level loop closure. In our framework, the representative features of a node are the SURF features of the first image augmented to the node. An empty N w indicates loop closure failure ; that is, query image is not similar to any of the previously visited places. In that case, query image similarity to the current node is evaluated. If the similarity of query image with current place node is above a certain threshold, current node is augmented with the query image. If the query image is not similar to current node also, a new node is created with the query image. But if N w is not empty indicating a loop closure, then the set of winning nodes can be considered for a thorough image level loop closure. Algorithm 1 shows the steps involved in ISP. The ’node level loop closure’ function in lines 4 is discussed in detail in sections IV. The function ’current node image similarity’ evaluates the similarity of the current node with that of the query image. This is done by matching the SURF features of the query image to that of the features of the current place node. Feature matching is performed as proposed in [10]. Two features are considered to be matched if the ratio of the best match and the second best match is greater than 0 . 6 .
IV. LOOP CLOSURE & HIERARCHICAL INVERTED FILES
Node and Image level loop closures are performed at image level using visual words corresponding to the SURF features of the image. Given a query image, to find the most similar
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
228
3
Algorithm 1 Image Sequence Partitioning Algorithm
|
1: |
procedure PROCESS QUERY IMAGE(T, I q , n c ) |
⊲ T, I q , n c , T. N are the |
|
Topological graph, query image, current node in topological graph & node-set of |
||
|
topological graph respectively. |
||
|
2: |
N ′ = T. N − { n c } ⊲ Reference node set excluding current node. |
|
|
3: |
N w = {} N w =Node level Loop Closure(N ′ , I q , T h s ) |
⊲ Set of winning nodes. |
|
4: |
||
|
5: |
if is empty(N w ) then if current node image similarity(n c , I q ) > T h t then |
|
|
6: |
||
|
7: |
n c .add image(I q ) |
|
|
8: |
else n=new node() n.add image(I q ) update map(T, n) |
|
|
9: |
||
|
10: |
||
|
11: |
||
|
12: |
end if |
|
|
13: |
else |
|
|
14: 15: 16: 17: |
N w =get I w = get
images
top k similar nodes(N w ) of nodes(N w ) Loop Closure(I w , N w , I q ) n ∗ =Image Level add image to node(I q , n ∗ ) update map(T, n ∗ ) end if end procedure |
|
|
18: 19: |
||
|
20: |
||
reference image in the database, [11, 7] uses a Tf-Idf based similarity score for all the reference images in the database and the query image and select the reference images corresponding to the top n similarity scores. We use an inverse methodology to compute image similarities for loop closure. The steps involved are enumerated as follows:
|
1) |
Let the set of reference images be I = { I 1 , I 2 , · · · , I M } . |
|
2) |
We consider a histogram H with the number of bins corresponding to the number of reference images, M . Extract the set of visual words W = { w 1 , w 2 , · · · , w p } from the query image, I q . |
3) For each visual word w i , using the inverted file IF i of the word, we extract the reference image indexes
, · · ·} in which the word has been
previously seen. The histogram bins corresponding to these extracted reference images are incremented by a factor of Tf-Idf of the corresponding word.
(1)
The resulting histogram can be interpreted to contain the degrees of similarity of the query image with respect to the reference images. As we can see, the loop closure computation time only depends on number of words in the query image and the average inverted file length at that instant. As a result loop closure time does not increase so steeply as is the case with forward method. A closely related work can be found in [6, 2, 3]. But this method is suitable only for loop closure over dense topological maps but not for sparse topological maps in which each node represents multiple images. With a change in the inverted file structure, we can adapt this similarity evaluation method to sparse topological maps. A regular inverted file corresponding to a visual word simply contains a list of all previous images which contained the word. We associate Hierarchical inverted files (HIF) to each visual word. As the name suggests, HIF contain two levels. The first level consists of the ids of the nodes in which the visual word occurred previously. The second level consists of small child inverted files attached to each of the
I w i = { I
w
1
i
w
, I
2
i
H [ I
j
i
] = H [ I
w
w
j
i
w
] + T f − Idf (I
j
i
, w i )
(b)
Fig. 2: (a) represents a traditional inverted file. (b) represents a hierarchical inverted
file. VW1, VW2,
represent node ids in the first
level of HIF and I1, I2, I3,
represent visual words. N1, N2,
represent the image ids in the inverted file.
node id. These image ids indicate the images belonging to the parent node in which the visual word has occurred. Each child inverted file corresponding attached to a node of the topological graph, contains the list of all previous images belonging to the parent node in which the word has occurred. The difference between traditional inverted files and HIFs is illustrated in figure 2. To perform a node level loop closure using HIF, we do not have to go through the entire HIF, but its sufficient to go through first level (node ids) of the HIFs. For an image level loop closure using HIF, we only have to traverse through those child inverted files corresponding to the winning nodes; which form only a fraction of the total HIF. Thus, HIFs offer computational gain in loop closure when compared to regular inverted files which is demonstrated
in section V. Algorithms 2 and 3 give a clearer picture of
the node level and image level loop closures. There can be multiple winning images given by the image level loop closure and hence multiple corresponding nodes, but for the sake of simplicity we do not represent that in the algorithm. In such cases we use RANSAC based geometric verification to find the right match.
Algorithm 2 Node Level Loop Closure Algorithm
1:
procedure NODE LEVEL LOOP CLOSURE(N ′ , I q , T h s )
2: N ′ = T. N − { n c }
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
H =Histogram(size of(N ′ )) W =extract and quantize Features(I q ) for each word w i in W do HIF i =get hierarchical inverted file(w i ) for each node n j in HIF i do H [ n j ] = H [ n j ] + 1 end for end for N ∗ =get winners from histogram(H , T h s ) return N ∗ end procedure
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
229
4
Algorithm 3 Image Level Loop Closure Algorithm
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
procedure IMAGE LEVEL LOOP CLOSURE(I w , N w , I q ) H =Histogram(M )
W =extract and quantize
for each word w i in W do HIF i =get hierarchical inverted file(w i ) for each node n j in N w do
Features(I q )
IF n j
i
=get inverted file of node(HIF i , n j )
do
for each entry f i in IF n j
i
H [ f i ] = end for end for end for I ∗ =get winner from
n ∗ =get corresponding node(I ∗ )
H [ f i ] + T f − Idf ( f i , w i )
histogram(H )
return n ∗ end procedure
V. EXPERIMENTS
Our experimental setup consists of a Pioneer P3DX robot equipped with an omnidirectional camera. A laptop equipped with an Intel Centrino 2 processor running ubuntu 9.04 is used for data processing. The experiments were carried out in our artificial urban environment - PAVIN. The environment contains roads, artificial buildings, and a variety of real-world road settings like junctions, traffic lights, round-abouts, curved roads and dead ends. Omnidirectional images were acquired at a frame rate of 2 fps, as the robot moves along a manually controlled trajectory. Image data was acquired in four installments(A, B, C and D) at very different times of two days and hence contained significant illumination variation. Figure 3 shows the parts of the environment through which the robot traversed during each installment. We took care that data from all the four installments contained overlaps so as to put our loop closure algorithm to test. We constructed two data-sets by combining data from all the four installments. Dataset-6560 was obtained by combining data of installments A, C and D. It contains 6560 images with 52 possible loop closures. Another data-set Dataset-11200 was obtained by a combination of all the four installments. It contains 11200 images and 71 possible loop closures. The number of loop closures were determined by manually examining the data-sets.
A. ISP - Sparsity
The number of nodes in a topological map indicate its sparsity. An ideal topological map is one in which each distinct place in the environment is represented by a node in the topological graph. The sparsity of these ideal maps represents the optimal sparsity of the actual environment. But practically, an ideal topological map is far from being attained. Different features produce different topological representations of the environment. We have experimented using SURF128, U-SURF128, and SIFT features, out of which we found out that U-SURF128 features lead to topological structure closest to the ground truth. Another important factor that effects stability of appearance is the image distortion. The features directly extracted from warped omnidirectional images are unstable as the appearance of keypoint patch changes very much even with a small
(a) Installment A
(b) Installment B
Fig.
installment.
3:
Shows
(c) Installment C
traversed
paths
by
the
(d) Installment D
robot
during
each
image
acquisition
TABLE I: SPARSITY
(a) Sparsity - Warped
|
SURF128 |
U-SURF128 |
SIFT |
|
|
DATASET-6560 |
539 |
502 |
1582 |
|
DATASET-11200 |
756 |
742 |
2795 |
|
(b) Sparsity - Unwarped |
|||
|
SURF128 |
U-SURF128 |
SIFT |
|
|
DATASET-6560 |
504 |
473 |
1037 |
|
DATASET-11200 |
737 |
723 |
1257 |
displacement of the camera. Hence it is likely that the maps produced using warped images contain greater number of place nodes. This happens because due to the feature in- stability, each place can be understood as multiple adjacent places and hence multiple nodes in the topological graph. The undistorted (unwarped) images produce relatively sparser maps. Tables I(a) and I(b) show the sparsity of maps produced by ISP using different features on warped and unwarped images. We can see that U-SURF128 is the best performing feature producing the most sparse maps.
B. Accuracy
In this subsection, we discuss the accuracy of the maps produced by ISP, based on the number of accurate loop closures and the obtained false positives. The most sparse map may not guarantee an accurate map. Only those maps with accurate place partitioning are accurate and can lead to accurate loop closures. Thus a good mapping technique is
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
230
5
TABLE II: NODE LEVEL LOOP CLOSURE ACCURACY
#(LC)
#(FP)
|
Dataset-6560 |
49 |
4 |
|
Dataset-11200 |
68 |
6 |
one which provides an optimal combination of sparsity and accuracy. Given a query image, first we perform loop closure at node level and then at image level if more accuracy is required. Table II shows the number of loop closures detected and the number of false positives obtained on Dataset-6560 and Dataset-11200 respectively by node level loop closure. We can see that a few loop closures are missed out, and some false positives are observed. The inaccuracy can be attributed to the imperfections in ISP. There is a direct relation between accuracy of the maps and ISP. The way in which we perform topological mapping does not induce any information loss. In other words we do perform any kind of sampling or selection of reduced number of features. Instead, we consider each and every feature extracted from the images and store it in HIFs. This process guarantees that there is no information lost during the mapping process. But inaccurate loop closure detection might occur due to inaccurate partitioning of places. For example a place can be represented by two nodes by partitioning it inaccurately during ISP due to appearance feature instability. As a result, during node level loop closure using a query image, both of the nodes representing that place may not get high similarity scores and hence the loop closure becomes inaccurate. A good ISP algorithm produces maps with minimum number of these situations. Image level loop closure accuracy depends on the accuracy of node level loop closure. If node level loop closure selects an inaccurate set of winning nodes, then as a result, image level loop closure also becomes inaccurate. However in case of an accurate node level loop closure, we have observed that 99% accuracy was possible in image level loop closure irrespective of the ISP technique. Figure 4 shows two loop closure scenarios that occurred in our mapping.
C. Computational Time
Table III shows average computational time (in millisec- onds) taken by each stage of our topological mapping frame- work in processing each query frame on both Dataset-6560 and Dataset-11200. The abbreviations NLLC, LFE+QUANT, ILLC stand for Node Level Loop Closure, Local Feature Extraction & Quantization and Image Level Loop Closure respectively. NLLC is the node level loop closure which involves extracting the most similar nodes using HIFs as mentioned in Algorithm 2. This takes 10 ms as shown in table III, and requires additional 50 ms in order to compare with the current place node whenever needed. Obviously, local feature extraction (LFE) ( 200 ms ) and quantization (QUANT) ( 70 ms ) time is constant for every acquired frame. Actually, time required for both of these tasks increases with the number of features in an image. Computation time of image level loop closure (ILLC) is too low. This low computation time is the result of using
(a)
(b)
(c)
(d)
Fig. 4: Example loop closures.
TABLE III: AVERAGE COMPUTATION TIMES (in ms)
NLLC 10 + 50
LFE+QUANT 200 + 70
ILLC
21
hierarchical inverted files(HIF). As we mentioned before, HIFs make the loop closure computation almost independent of the number of reference images and also in our case, nodes of the topological graph.
Figures 5(a) and 5(b) show graphs comparing the loop closure times of our HIF-based method and without using HIF (non-HIF based). Red curves in the graphs indicate the time taken for feature quantization, node level loop closure and image level loop closure, for each image frame in a sequence. The blue curves represent the time taken by feature quantization and similarity score generation using inverted files by using inverse similarity evaluation methodology. We can see that the loop closure time of non-HIF based method increases relatively more with the increase in the number of images in the map, while our method using our method, loop closure time increases more slowly. Also, the performance gain becomes more prominent in case of huge datasets (huge number of images) as can be seen in the figure 5(b) corre- sponding to Dataset-11200, which contains 11200 images. The non-HIF based loop closure time for Dataset-11200 increases less steeply than that of Dataset-6560. This happened because the average number of features of Dataset-11200 is lesser than that of Dataset-6560 and as a result it takes lesser time to process each reference frame. This efficiency of our HIF-based method can be attributed to the combination of sparse topological mapping and HIFs for efficient map storage. The representational power of HIFs saved lot of computation involved in loop closure.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
231
6
Times (sec)
Time (sec)
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
1000
2000
3000
Image Id
4000
5000
6000
7000
(a) Loop Closure time - Dataset-6560
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
2000
4000
6000
Image Id
8000
10000
12000
(b) Loop Closure time - Dataset-11200
Fig. 5: Loop closure computation times of non-HIF based loop closure and HIF based loop closure on maps generated on our datasets.
VI. CONCLUSION
We proposed a sparse topological mapping framework involving two levels of loop closure. Hierarchical Inverted Files(HIF) were naturally adaptable for loop closure in our sparse topological mapping framework and made fast loop closure possible. Image Sequence Partitioning(ISP) played a key role in producing sparse topological maps. Sparsity of the maps produced by different features are analyzed and the accuracy is evaluated. Finally, our framework is evaluated on computational time required for loop closure. The experiments prove our argument that HIFs are suitable for sparse topolog- ical maps as they take advantage of the sparsity of the map in performing loop closure efficiently without discarding any information.
ACKNOWLEDGMENT
This work has been supported by the ANR projects - CityVip and R-Discover. Special thanks to Mr.Jonathan Courbon for his help in experimentation.
REFERENCES
[1] J. Kosecka A. C. Murillo, P. Campos and J. J. Guerrero. Gis t vocabularies in omnidirectional images for appearance based mapping and localization. In 10th IEEE Workshop on Omnidirectional Vision, Camera Networks and Non-classical Cameras (OMNIVIS), held with Robotics, Science and Systems, 2010. [2] A. Angeli, D. Filliat, S. Doncieux, and J.-A. Meyer. A fast and incremental method for loop-closure detection using bags of visual
words. IEEE Transactions On Robotics, Special Issue on Visual SLAM,
2008.
[3] Adrien Angeli, St´ephane Doncieux, Jean-Arcady Meyer, and David Filliat. Visual topological slam and global localization. In Proceedings of the 2009 IEEE international conference on Robotics and Automation, ICRA’09, pages 2029–2034, 2009. [4] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. Speeded-up robust features (surf). Comput. Vis. Image Underst., 110, June 2008. [5] Mark Cummins and Paul Newman. FAB-MAP: Probabilistic Localiza- tion and Mapping in the Space of Appearance. The International Journal of Robotics Research, 27(6):647–665, 2008. [6] Mark Cummins and Paul Newman. Highly scalable appearance-only slam - fab-map 2.0. In Robotics Science and Systems (RSS), Seattle, USA, June 2009.
[7] Friedrich Fraundorfer, Christopher Engels, and David Nist´er. Topolog- ical mapping, localization and navigation using image collections. In IROS, pages 3872–3877, 2007. [8] Jana Koseck, Fayin Li, and Xialong Yang. Global localization and relative positioning based on scale-invariant keypoints. Robotics and Autonomous Systems, 52(1):27 – 38, 2005. [9] Benjamin Kuipers, Joseph Modayil, Patrick Beeson, Matt MacMahon, and Francesco Savelli. Local metrical and global topological maps in the hybrid Spatial Semantic Hierarchy. In Proceedings of the IEEE Intl. Conf. on Robotics & Automation (ICRA-04), 2004. [10] David G. Lowe. Distinctive image features from scale-invariant key- points. International Journal of Computer Vision, 60:91–110, 2004. [11] David Nister and Henrik Stewenius. Scalable recognition with a vocabulary tree. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2, CVPR ’06, pages 2161–2168, 2006. [12] Navid Nourani-Vatani and Cedric Pradalier. Scene change detection for topological localization and mapping. In IEEE/RSJ Intl. Conf. on Intelligent Robotics and Systems (IROS), 2010. [13] Cees G.M. Snoek and Marcel Worring. Multimodal video indexing:
A review of the state-of-the-art. Multimedia Tools and Applications,
25:5–35, 2005.
[14] A. Tapus and R. Siegwart. Incremental robot mapping with fingerprints
of places. In IEEE/RSJ Intl. Conf. on Intelligent Robotics and Systems
(IROS), pages 2429–2434, 2005.
[15] Sebastian Thrun. Robotic mapping: A survey. In Exploring Artificial Intelligence in the New Millenium. Morgan Kaufmann, 2002. [16] Sebastian Thrun, Wolfram Burgard, and Dieter Fox. Probabilistic Robotics (Intelligent Robotics and Autonomous Agents). The MIT Press,
2005.
[17] Nicola Tomatis. Hybrid, metric-topological representation for localiza- tion and mapping. In Robotics And Cognitive Approaches To Spatial Mapping, volume 38 of Springer Tracts In Advanced Robotics, pages
43–63, 2008. [18] Antonio Torralba, Kevin P. Murphy, William T. Freeman, and Mark A. Rubin. Context-based vision system for place and object recognition. In Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2, ICCV, pages 273–, 2003. [19] Ba Tu Truong and Svetha Venkatesh. Video abstraction: A systematic review and classification. ACM Trans. Multimedia Comput. Commun.
Appl., 3, 2007. [20] Christoffer Valgren, Tom Duckett, and Achim Lilienthal. Incremental spectral clustering and its application to topological mapping. In In Proc. IEEE Int. Conf. on Robotics and Automation, pages 4283–4288,
2007.
[21] Zoran Zivkovic, Olaf Booij, and Ben Kr¨ose. From images to rooms.
Robot. Auton. Syst., 55:411–418, May 2007.
[22] Z.Zivkovic, B.Bakker, and B.Krose. Hierarchical map building using visual landmarks and geometric constraints. In In Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 7–12,
2005.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
232
1
Kalman filter correction with rational non-linear functions: Application to Visual-SLAM
Thomas Feraud,´
Roland Chapuis, Romuald Aufrere`
and Paul Checchin
Clermont Universite,´
Universite´ Blaise Pascal,
LASMEA UMR 6602, CNRS, 63177 AUBIERE, FRANCE, Email: firstname.name@univ-bpclermont.fr
Abstract — This article deals with the divergence of the Kalman filter when used on rational non-linear observation functions in the Visual SLAM framework. The context objective is to localize a vehicle and simultaneously to build a map according
to environment perceived by a camera. There are many ways to
fuse all data from sensors and the usual one is the Kalman filter.
A main problem of this approach is the divergence due to an
improper linearization of the observation model. It leads to wrong
estimation which disturbs all the process. The presented method allows, under weak constraint on the observation function, to reduce the divergence effect without modifying the observation noise. In the Visual SLAM context, this method drastically improves results and gives more stability to monocular system
in order to initialize landmarks.
Index Terms — Visual SLAM, depth estimation, Kalman filter- ing
I. INTRODUCTION
Simultaneous Localization and Mapping (SLAM) has be- come well defined in the robotics community to tackle the is- sue of a moving sensor platform constructing a representation of its environment on the fly, while concurrently estimating its ego-motion [8, 2]. Stochastic approaches have solved the SLAM problem in a consistent way, because they explicitly deal with sensor noise [7, 13]. The implementation of a feature-based SLAM approach encompasses the following functionalities:
• the perception process which depends on the kind of en- vironment and on the sensors the robot is equipped with. It consists in detecting from the perceived data, features of the environment that are salient, easily observable and whose relative position to the robot can be estimated.
• the observation process which concerns the estimation of the features locations relative to the robot’s pose from which they are observed.
• the prediction process which deals with the estimation of the robot’s motion between two feature observations. This estimate can be provided by sensors, by a dynamic model of robot’s evolution fed with the motion control inputs, or based on simple assumptions such as a constant velocity model.
• the data association process in which the observations of landmarks must be properly associated (or matched), otherwise the robot’s position can become totally incon- sistent.
• the estimation process which consists in integrating the various relative measurements to estimate the robot and landmarks positions in a common reference frame.
Fig. 1 gives an example of a monocular SLAM process.
Historically, SLAM relied on laser telemeters such as Range-and-Bearing (RB) sensors allowing the robots to build planar maps of the environment. The original solution [12] utilized an Extended Kalman Filter (EKF) to fuse data ac- quired by a laser range scanner or other range and bearing sensors, leading to Range and Bearing EKF-SLAM. Recently, the use of vision sensors has put the 3D world within reach but has given rise to new difficulties due, on the one hand, to the richness of the information provided and, on the other hand, to the loss of the depth dimension in the measurements inherent to the projective nature of vision. Visual SLAM with a single camera is more challenging than visual SLAM based on stereo vision, but a single camera will always be cheaper, more compact and easier to calibrate than a multi-camera rig.
Fig. 1. Top : two images taken at steps k and k + 1 and the cor- responding extracted features. Bottom: the distance between the two observer’s positions (black triangular) based on the trajectory in blue.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
233
2
The EKF estimation algorithm used in Davison’s system has been shown to be inconsistent [1]. The algorithm tends to be overconfident in the estimate calculated due to linearization errors. As well as improving the computational efficiency of mapping large environments, sub-mapping also improves the quality of the estimate by decreasing these linearization errors [1], [5]. Other monocular SLAM systems have also attempted to improve the quality of the estimate by using different estimation algorithms. Chekhlov et al. [6] used the Unscented Kalman Filter (UKF) for their monocular SLAM system. This algorithm improves the consistency of the estimate by making more accurate ap- proximations to the non-linearities in the observation process. The improvement in consistency comes at a cost in complexity (O(N 3) vs. O(N 2) for the EKF). Recently, Holmes et al. [10] showed that using the square root UKF algorithm [14] a more consistent estimate can be obtained at O(N 2) complexity in a monocular SLAM system. However, the actual processing time required for the implementation was an order of magnitude greater than for the EKF on reasonably sized maps due to
a large constant factor on the computational cost in their
implementation. In this paper, a camera-centered Extended Kalman Fil- ter is used to process a monocular sequence. The sensor- centered Extended Kalman Filter was first introduced in the context of laser-based SLAM [4]. Contrary to the standard EKF-SLAM, where estimates are always referred to a world reference frame, the sensor-centered approach represents all feature locations and the camera motion in a reference frame
attached to the current camera. The typical correlated feature- sensor uncertainty arising from the uncertain camera location
is transferred into uncertainty in the world reference frame,
resulting in lower variances for the camera and map features, and thus in smaller linearization errors [5]. This paper presents a method to reduce the divergence in the specific case of rational non-linear observation models. At each step, comparisons between solutions with and without
correction in the non-linear case demonstrate the effectiveness
of the proposed approach. In Section II, we begin by introduc-
ing the camera-centered Extended Kalman Filter. Section III presents the solution proposed in this paper. Finally Section IV shows experimental results of this work. Section V concludes the paper.
II. ISSUES OF EKF VISUAL SLAM
A. Prediction step
The SLAM algorithm begins with standard EKF prediction using proprioceptive sensors and the known dynamic model associated to the vehicle. Thus, the current position of the observer can be predicted:
(1)
where X k is the state vector at step k and u k is the vector of control inputs. The estimation step of the variance-covariance matrix is then expressed as:
(2)
X k+1|k = f(X k|k , u k )
P k+1|k
F X P k|k F T
=
X + Q k
the
Jacobian of the evolution model f (X k|k ) [9], with respect to X k . At this point we obtain the best estimation of the position according to the proprioceptive information.
where Q k is the covariance of process noise, F X
is
B. Update step
In the case of SLAM, the extracted features are used to
update the vehicle’s position and the map. At time k, a set of features constitutes the map which can be augmented with new observed features. On the other hand, if this is a feature previously observed, its location can be predicted in the observation space. The Kalman gain G k+1 and the update process in this case can be written for each observation y as follows:
(3)
(4)
(5)
where h is the observation model and H X is its Jacobian with respect to the state vector. R obs is the covariance of the noise associated with the observations. Localization accuracy depends on the number of points, the precision of the initial positioning and the reconstruction model. At this stage, the gain has two behaviors versus the ob- servation model. If it is linear, the Kalman gain is optimum. Otherwise, it is suboptimal and may even lead to divergence mainly due to inappropriate linearization.
G k+1 = P k+1|k H X T (H X P k+1|k H X T + R obs ) −1
X k+1|k+1 = X k+1|k + G k+1 y − h(X k+1|k )
P k+1|k+1 = P k+1|k − G k+1 H X P k+1|k
C. Visual SLAM
In the case of Visual SLAM, the dimension of observation
space is lower than that of the state space and the associated
model is non-linear. To understand the consequences, Fig. 2a shows the result of a 2D feature’s position updated with
a 1D observation. The divergence observed is the result of
an improper linearization. The problem of negative depth was addressed for inverse depth parametrization in [11]. The presented solution includes the case of negative depth. We propose in this article to introduce a correction in the update process by putting a constraint on the rational observation function in order to make it more consistent and
|
to |
avoid divergence phenomenon. |
|
D. |
From 2D to 3D |
A first approach would be to see that if the observation
model in the 3D space is linear and equal to identity, then for each 2D observation, a virtual 3D point is created along
the line of sight. For each point, a variance-covariance matrix
is associated, dealing with uncertainty of the pose and the
observation. With this virtual 3D point, only the corresponding 3D point in the state vector is updated. The result of this method is presented in Fig. 2c. This method has the advantage of working with a linear observation function but at the cost of increased uncertainty associated with the data. Indeed, considering the uncertainty as Gaussian in the image, its rigorous retro-projection in the world is not Gaussian as shown in Fig. 3.
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
234
3
(a). Update process in the 2D space showing divergence
(b). A zoom on the real point prove the integrity
(c). Update step processed in the 3D world with a linear observation function
Fig. 2. Simulation: let the green star be a landmark of the envi- ronment being observed from a first robot’s position (left blue triangle). The position of a 3D feature is initialized along the line of sight (red cross) with its uncertainty in blue. (a) When observed from a second position (right blue triangle), the loca- tion of the 3D feature (green circle) is estimated after the update process in the 2D space showing divergence due to the non- linear observation function. (b) A zoom on the real point prove it is contained in the uncertainty of the virtual point. (c) The new 3D point (black cross with its uncertainty) is created from the second observation but this time the Kalman update step is processed in the 3D world with a linear observation function equal to identity. The result is the black circle with the uncertainty in red.
One approach commonly used in this case consists in an approximation by using the sum of Gaussian functions, each centered on a different point of the line of sight and tangent to the cone resulting from the projection of the measured uncertainty (see Fig. 4).
Fig. 3. For a 2D Gaussian uncertainty in the image (left part), the reprojection in the 3D world is not Gaussian and corre- sponds to a cone (right part).
Fig. 4. 3D point uncertainty approximation by using the sum of three Gaussian functions.
III. PROPOSED APPROACH
In order to directly initialize the new point, it is placed into the world reference frame from its first observation with an uncertainty coherent with the observation error in the image. However, despite being direct this method has a problem of linearization: gap between the created and the real point may be significant. We propose here to compute a correction in the update step of the Kalman process in oder to attenuate the filter divergence (see Section III-A). Its application in a Visual SLAM context is introduced next in Section III-B.
A. Weighted Coefficient
When the measurement matches the predicted observation, we have the following relation:
observation = h(P
(k+1)
3D
)
(6)
where P
the part of the Kalman gain regarding to the 3D point.
Hypothesis: Let the observation function be defined as a ratio of two polynomial functions, N and D of P 3D :
(k+1)
3D
= P 3D +Gpt k+1 y − h(P 3D ) with Gpt k+1
(k)
(k)
h(P
(k+1)
3D
)
= N P (k+1)
3D
D P
(k+1)
3D
(7)
with P
(k+1)
3D
is not a root of D.
Using (6) and (7) we have a polynomial sum with respect
to P
(k+1)
3D
:
observation × D P
(k+1)
3D
− N P 3D
(k+1)
= 0
(8)
Let us define the gain Ω which satisfies (8). As the 2D-3D transformation is not well-conditioned, we intend to seek a simple proportional relationship between Ω and Gpt k+1 . This
5 th European Conference on Mobile Robots, September 7-9, 2011, Örebro, Sweden
235
4
research requires that Ω satisfies the same existing conditions as the Kalman gain Gpt:
Ω = r × Gpt
=
r × P Pt H T
Pt H Pt P Pt H T
Pt + R −1
=
P Pt H T Pt H Pt
P
Pt
r
H
Pt + R
T
r
−1
(9)
(10)
and where P Pt and H Pt are the parts of the covariance and the Jacobian matrices related to the point and the observa- tion model respectively. As P Pt and R are positive-definite matrices, P and R must also be the same. So r cannot be negative. Furthermore, as R is the best estimate of the noise associ- ated with the observation, thus R cannot be improved. It leads to R ≤ R ⇔ r ≤ 1 and (10) to a polynomial function in r, which is the coefficient of proportionality sought between Ω and Gpt k+1 . This can be summarized by the following relation:
(11)
Once the value of r is found, the corrected update step can be processed:
P Pt H T
P =
Pt H Pt P H T
Pt
P
Pt
+ R −1
=
and R = R
with
r r
Ω k+1 = r × Gpt k+1
and r ≤ min(r, 1)
Ω k+1
k+1
3D
P
P
k+1
P
3D
= r × Gpt k+1
=
=
=
P 3D + Ω k+1 y − h(P 3D )
k
k
P
Гораздо больше, чем просто документы.
Откройте для себя все, что может предложить Scribd, включая книги и аудиокниги от крупных издательств.
Отменить можно в любой момент.