Robot Detection and Localization Based On Deep Learning: Sha Luo, Huimin Lu, Junhao Xiao, Qinghua Yu, Zhiqiang Zheng

Robot Detection and Localization Based on Deep
Learning
Sha Luo, Huimin Lu, Junhao Xiao, Qinghua Yu, Zhiqiang Zheng
College of Mechatronic Engineering and Automation
National University of Defense Technology
luoshasha1992@gmail.com, lhmnew@nudt.edu.cn, junhao.xiao@hotmail
yuqinghua163@163.com, zqzheng@nudt.edu.cn
Abstract—Real-time and accurate robot detection and recognized. According to the RoboCup MSL 2017 new rule,
localization is important for the RoboCup Middle size the robots are allowed to wear clothes of different colors [3],
League (MSL) soccer robots. In the current robot which introduces a new challenge to those methods. Recently,
detection methods used by most of the teams, the black- deep learning has attracted increasing attentions on the object
color-based information is used to distinguish robots from detection and recognition as it allows computational models
the environment, which is not robust if the robot changes composed of multiple processing layers to learn the
its makers’ color according to the current rule. representations of data with multiple levels of abstraction.
Considering the good performance of deep learning on the These methods have dramatically improved the state-of-the-art
in speech recognition, visual object recognition, object
feature extraction and object detection, in this paper, we
detection and so on [4]. Since the early 1990s, CNNs have
propose a novel approach for robot detection and
been applied with great success to the detection, segmentation
localization based on Convolutional Neural Networks and recognition of objects and regions in images. In 2012,
(CNNs) for RoboCup MSL soccer robots. The approach is Krizhevsky et al. [5] applied deep CNNs to a million of
composed of two stages: robot detection using the RGB images and showed high image classification accuracy on the
image, and robot localization using the depth point cloud. ImageNet Large Scale Visual Recognition Challenge
The high accuracy and mean average precision (mAP) (ILSVRC) [6, 7] which made the CNNs well-known by
verify that the proposed method is suitable for robot computer vision researchers.
detection during the MSL competition, which will benifit
the following strategy design and obstacle avoidance Usually, the increased detection speed is achieved at the
procedures. The proposed approach can be easily cost of the decreased detection accuracy. Ribeiro et al used
deep learning on the MSL for object detection [12] where the
exploited to deal with different objects and adapted to be
classifiers implemented by artificial neural networks are
used in other RoboCup leagues. The acquired dataset is trained using the images obtained from an omnidirectional
made available for the community. vision sensor. Different objects can be detected although there
Keywords—Robot Detection and Localization; Deep learning;
are imaging distortions due to the use of omnidirectional
RoboCup MSL; Convolutional Neural Networks (CNNs);
vision sensor. A Deep Learning method for NAO robot
detection was proposed in the RoboCup Standard Platform
I. INTRODUCTION League [13], which performs very well in terms of accuracy.
Robot Soccer World Cup (RoboCup 1 ) is a worldwide However, the processing time is too long to meet the real time
competition and academic event for promoting the research requirement.
and development of artificial intelligence (AI) and robotics by
providing a challenging and public testing platform. The Robot 1
Middle Size League (MSL) is one of the most important
events of RoboCup. For a completely autonomous soccer Vision
Kinect
Industrial computer
robot, the accurate and real-time detection of robots plays an
important role in the robot's entire system, and helps the robot Communication ROS_MASTER
Jetson TX2
to perceive the environment as a basis for realizing
Decision
autonomous capabilities such as motion planning and LAN
decision-making.
In the past decade, most MSL teams used the color-based (a) (b)
methods for robot detection [1, 2] because the main color of
Fig. 1 Illustration of NuBot implementation; (a) NuBot robot (b) Connection
the robots is black, which is obvious and easy to be architecture of Kinect, Jetson TX2 and robot’s onboard industrial computer
1 RoboCup Homepage: http://www.robocup.org/

This research was supported by National Natural Science Foundation of
China (61403409, 61503401), China Postdoctoral Science Foundation
(2014M562648).
,(((
There are some state-of-the-art object detection systems accurate ground-truth annotations. Considering the lack of
such as SSD [8], Faster-RCNN [9], YOLO v2 [10, 11] and so open source dataset about the RoboCup MSL, we decided to
on. YOLO v2 applys a single neural network to the full image collect a set of images taken under varying conditions with
and the network divides the image into regions and predicts different vision sensors from different point of view. We also
bounding boxes and probabilities for each region. These made the dataset publicly available and encourage the similar
bounding boxes are weighted by the predicted probabilities. research in robotics/vision community.
According to the Table. 3 in [11], YOLO v2 is faster and more
accurate than previous detection methods. It can also run at Pipeline
different resolutions for an easy tradeoff between the speed
and accuracy (67 FPS with mAP 76.8% on VOC2007 test). Detection
Dataset creation Image pre-process acceleration Robot detection
In this paper, we used the Kinect v22 as the vision sensor
for robot detection and localization which is an active 3D
depth estimation setup employing IR laser structured patterns
for depth calculation. This characteristic makes it more
accurate in object localization. However, the huge amount of
data from the Kinect will increase the onboard industrial Localization
computer’s CPU burden which needs to run motion control,
omnidirectional vision, and decision control and Image
Noise elimination 3D position
registration
communication modules at the same time. Considering the
weight and real time requirements, we decided to use the
NVIDIA TX2 3 embedded development board to process the
Depth point cloud
data from the Kinect for robot detection and localization and
transmit the localization results to the onboard industrial
computer for later decision making and obstacles avoidance. Fig. 2 The pipeline of the proposed approach
The detection and localization system is equipped on our
NuBot soccer robot (Fig. 1(a)), and the communication A typical image from our dataset is shown in Fig. 3, where
between TX2 and robot’s onboard industrial computer is there are some robots with different shapes and colors on the
handled by ROS master messages as illustrated in Fig. 1 (b). green soccer field. These robots have some common features
For employing the accurate and fast features of the YOLO such as having a black base, the upper body being thinner than
detection system, we adjusted the framework for the fast and the bottom body and having similar background. We need to
accurate robot detection during the detection stage. We also detect the robots and consider annotating the robots’ relative
built a novel dataset for robot detection which contained fully positions with respect to the image. All the annotations are
annotated images acquired from MSL competitions on a provided in an XML file named "annotations.xml" that
regular field. The dataset is publicly available at: contains the robot’s Class, Width, Height, Xmin, Xmax,
https://github.com/Abbyls/robocup-MSL-dataset. After the Ymin, Ymax and so on. The origin of the image coordinate is
robot detection step, we obtained robot’s 2D positions from placed in the upper-left corner. Since usually more than one
the RGB image. Then we registered the RGB image to the robot are presented in an image, bounding boxes could
depth image and produced the depth point cloud to obtain overlap. There are 1456 images in total without any pre-
robot’s 3D positions. processing in our dataset, and each image has different
dimensions. We selected 1000 images for training and 456
The structure of the paper is organized as follows. Section images for testing.
2 introduces the proposed approach for the robot detection and
localization in detail. Section 3 presents the experiments X
conducted on our NuBot soccer robot, and analyzes the
accuracy and efficiency of the approach. Section 4 concludes
this paper and discusses the future work. (Xmin,Ymin)
II. PROPOSED APPROACH

height
The pipeline of our proposed approach is shown in Fig. 2

which includes two stages: robot detection and 3D position
localization.
(Xmax,Ymax)
A. Dataset building width
To train the network model for robot detection, a huge Y
amount of data from a real scenario are needed which require
Fig. 3 One of the annotated images from our dataset
2
Kinect Homepage: https://developer.microsoft.com/en-us/windows/kinect
3
Jetson TX2 introduction: http://elinux.org/Jetson_TX2

B. Image pre-processing acceleration classifying probabilities. The network architecture is shown in
As mentioned above, the YOLO detection system is Fig. 5.
extremely fast and accurate in comparison with other detection Conv Conv Conv Conv Conv Conv
Conv Conv Conv
Region
Pool Pool Pool Pool Pool Pool Detection
systems when it is used on the NVIDIA Titan X4, which is the
strongest graphics processing unit at that time. Obviously, the Fig. 5 The network architecture of Convolutional Neural Networks used for
embedded development board Jetson TX2 we used is weaker D. Image Registration and Depth
robot Point Cloud Production
detection.
than the NVIDIA Titan X which makes the detection process We register the RGB image to the depth image and
slower for the robot detection. produce the depth point cloud, and then find the depth value
For fast and convenient prediction, we transformed and according to the box’s center position in the registered image.
resized (at the ratio) the input image before the prediction After that, we can get the 3D position in the depth point cloud.
process and then used a fixed sized box to constrain the The open source driver libfreenect2 [14, 15] is used here, but
resized image to a fixed width and height image as the input it does not provide parallel registration of the color and depth
data to the predict process. The image pre-processing process image. The Jetson TX2 has relatively weaker CPU comparing
and the prediction process need about 58ms and 60ms with the mainstream CPU processors. To improve the real
respectively, failing to meet the real-time requirements in the time performance of the algorithm and for the convenience of
real competition. robot detection to be discussed later, we parallelized the image
registration and depth point cloud production on the GPU.
CPU
Transmit the RGB image

to YOLO framework GPU
Image data processing
(a) (b)
Copy the boxed

image to CPU
Fig. 4. Data transmission between the CPU and GPU for the image pre-
processing
The pipeline of the image pre-processing algorithm offers (c) (d)

pixel-level data parallelism which can be easily exploited on Fig. 6 Image registration and depth point cloud production (a) RGB image (b)
the CUDA (Compute Unified Device Architecture) Depth image (c) registered image (d) depth point cloud
architecture. To accelerate the detection process, we decided
to parallelize the data pre-processing step using the CUDA. The bottlenecks of this step implemented on the GPU are
Since the GPU consists of multiple cores, it allows that the amount of data to process is huge, the global storage
independent thread scheduling and execution, and is suitable bandwidth is small, and the number of shared storage and
for the computation of independent pixels. Therefore, in the constant storage is limited. To get around the bottlenecks,
image pre-processing step, we use m × n threads for an image texture storage provided by the GPU architecture is utilized. It
with the dimension of m×n, and use blocks of the appropriate is cached on the chip and provides more effective bandwidth
size running on multiple cores. The pipeline of the data by reducing memory requests to off-chip DRAM. We use the
transmission between the CPU and GPU is shown in Fig. 4. texture memory to store depth-to-color-mapping data. The
computation speed is greatly improved on each thread since
C. Robot Detection memory access time is significantly reduced. The registered
image and depth point cloud is shown in Fig. 6.
After the image pre-processing, we put the processed box
images into the test network to predict the robot’s box position E. Noise Elimination
and the corresponding probability. Considering the real-time
requirement in robot detection process, we use a tiny version Sometimes we will get the false positive results because of
model which has 9 convolutional layers and 1 region detection the blurred features and noise in the image. To avoid these
layer. The first 6 convolution layers of the network extract situations, a validation process is employed to decide whether
features from the image, and the pooling layer is for the the box includes a real robot.
subsampling and enhancing the rotational invariance while the The robot’s size is limited in a fixed range, and all of the
region detection layer predicts the boxes’ coordinates and robots have similar sizes. So, we can get the relationship
between the box’s size and the box’s center to the Kinect.
4
NVIDIA Titan X Homepage: https://www.nvidia.com/en-us/geforce/ Using this knowledge, we estimate the number of pixels that a
products/10series/titan-x-pascal/ robot box should have according to its distance to Kinect. We

can get the relation function as equation (1) in which the x From the table we can find that the precision is
represents the distance and f(x) represents the box’s size. The approximately to 90% which meets the detection accuracy
curve of f(x) is plotted in Fig. 7. Then we can eliminate the requirements in the RoboCup competition. The mAP results
box the size of which deviates much from the curve. indicate the good performance of the MSL robots datsaset. We
encourage other people to use our dataset and compare the
f ( x) 3.67e 09 * x 1.972 (1) algorithm performance based on the dataset in the robot vision
x 10
5 relations_between_ball_pixels_and_distance community. Furthermore, we tested the detection performance
6
ball_pixels vs. distance using some images acquired in the 2017 RoboCup
5
fit curve competition, Nagoya with different viewpoints, and the
detection results are shown in Fig. 9. From the results we can
4 conclude that we can detect visible robots in the MSL
competition using our proposed approach.
ball_pixels
0
0 1000 2000 3000 4000 5000 6000 7000 8000
distance(mm)
Fig. 7 The relation between robot box’s size and its distance to the Kinect
sensor.
F. Get the 3D Positions

We get the most probable candidate robot’s regions after
validating the targets, and then we search every box’s center
position in the depth point cloud to get the 3D position of each
detected robot. If the box’s center has no depth information,
we adjust it by adding a small offset on the center.
At last, we can get the robot’s 3D position in the depth
point cloud as shown Fig. 8(b), where the grey ball represents
the detected robot’s center position.
.
Fig. 9 Detection examples on the 2017 RoboCup competition
TABLE I. DATASETS PERFORMANCE

IOU Recall Precision mAP
72.76% 94.79% 89.09% 70.65%
(a) (b)
B. Real-time performance after the parallel computing
Fig. 8 (a) Detected robots in registered image (b) The robots’ 3D positions in
depth point cloud. The detection process includes two steps: image pre-
processing and prediction. The prediction process employs the
III. EXPERIMENTAL RESULTS GPU parallel computing technique which leads to little
acceleration space because of the hardware restriction. The
This section presents quantitative experimental results image pre-processing process impacts the speed a lot, and the
obtained by our approach. In order to test both the real-time computation time to process each frame of image is
performance and measurement accuracy of robot detection and approximately 58ms. We accelerated this process by
localization, we used our NuBot soccer robots as the parallelizing it based on CUDA.
experiment platforms which were equipped with Kinect v2
We tested the processing time of parallelized image pre-
sensor and employed Jetson TX2 as the processor. Jetpack 3.0
precessing under different scenes in the robot soccer field. The
was used here as the programming interface which uses
improved performance is shown in TABLE II. We can realize
CUDA 8.0 for parallel computing and CUDNN 5.0 for the
real-time robot detection in the RoboCup MSL competiton
CNNs acceleration.
using such approach with the YOLO v2 architecture. The
statistics about the computing time in the image pre-
A. Detection precision and mAP precessing after the acceleration and in the prediction process
As there is no other public dataset about MSL robots are shown in Fig. 10, where the red lines represent the median
available for us to measure our dataset performance, we tested computing time, grey lines represent the maximal and minimal
the dataset using the 400 testing images contained in the computing time, and the blue boxes are the Box-whiskerPlot
dataset. The testing results are shown in TABLE I. acquired by Matlab.

IV. CONCLUSION AND FUTURE WORK
This paper proposed a novel approach used for the robot
detection and localization in the RoboCup MSL competition
using the Kinect v2 and Jetson TX2 as the hardware platform.
We built a dataset about the MSL robots as there is no
public dataset for the research of MSL robot detection and we
shared it on the github for the development in the robot vision
community. We accelerated the image pre-precessing process
in the YOLO detection system to make it meet the real-time
requirements of RoboCup MSL competition. More
Fig. 10 Statistics of processing time. importantly, we combined the detected robots’ 2D positions in
the RGB image with the depth point cloud to obtain the 3D
TABLE II. PROCESSING TIME AFTER ACCELERATION positions, so the accuracy of robot localization was improved.
We think the real-time performance will be further improved
Time Pre-processing Prediction FPS
if we use a more powerful GPU.
TX2 51-60 54-60 8.3-9.2
After acceleration 0-10 54-60 14.2-18.5 In the future, we will maintain and improve our dataset of
C. Lcalization error MSL robots and continue the research on the accurate and
real-time detection and localization of the robots, ball and
After the accurate robot detection in the registered RGB referee.
image, we need to localize the robots in the soccer field for
decision-making and obstacles avoidance procedures. The
accuracy of the localization has a great impact on the robot’s REFERENCES
competition performance during the game. [1] Neves, A.J., Pinho, A.J., Martins, D.A., and Cunha, B, "An Efficient
Omnidirectional Vision System for Soccer Robots: From Calibration to
As mentioned above, we combined the depth information Object Detection," Mechatronics. 21 (2), 2011, pp.399-410.
with RGB information to get the detected robot’s 3D position [2] Lu, H., Yang, S., Zhang, H., and Zheng, Z, "A Robust Omnidirectional
in the field. To measure the localization error, we set an Vision Sensor for Soccer Robots," Mechatronics. 21(2), 2011, pp. 373-
389.
experiment in which two soccer robots are placed at positions
(-1660, 1800) and (1750, 3280) respectively in our soccer [3] MSL Technical Committee. "Middle Size Robot League Rules and
Regulations for 2017," 2017.
field, and the robots’ positions are measured from the position
[4] LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. "Deep learning,"
(0, 0) by a robot with the proposed detection and localization Nature 521.7553, 2015, pp. 436-444.
system. The localization results are shown in Fig. 11 (red [5] A. Krizhevsky, I. Sutskever, and G. Hinton, "ImageNet classification
points: detected robots’ positions, blue points: robots’ ground with deep convolutional neural networks," In NIPS, 2012.
truth positions, maroom point: NuBot’s position with a Kinect [6] J. Deng, A. Beng, S. Satheesh, H. Su, A. Khosla, and L. Fei-Fei, " Image
sensor for robot detection and localization) and the average Net Large Scale Visual Recognition Competition 2012 (ILSVRC2012),
Euclidean distance measurement error is shown in TABLE III. " 2012. http://www.image-net.org/challenges/ LSVRC/2012/.
Considering the physical dimensions of the robot [7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, " ImageNet:
(approximately 52 cm*52 cm) and the error of placing the A large-scale hierarchical image database," In CVPR, 2009.
robot at a specific position, the localization error (30.567 mm) [8] Liu, Wei, et al, "Ssd: Single shot multibox detector," European
of our proposed system is acceptable. conference on computer vision. Springer, Cham, 2016.
[9] Ren, Shaoqing, et al, "Faster R-CNN: Towards real-time object
robots localization detection with region proposal networks," Advances in neural
6000
5500
detected robot 1 position information processing systems. 2015.
detected robot 2 position
5000
robots groundtruth [10] Redmon, Joseph, et al, "You only look once: Unified, real-time object
4500 positions detection," Proceedings of the IEEE Conference on Computer Vision
4000 NuBot position and Pattern Recognition. 2016.
3500
[11] Redmon, Joseph, and Ali Farhadi, "YOLO9000: better, faster, stronger,"
Z(mm)
3000 arXiv preprint arXiv:1612.08242, 2016.

2500
[12] de Almeida Ribeiro, P.R., Lopes, G., and Ribeiro, F, "Neural Network in
2000
Computer Vision for RoboCup Middle Size League," Journal of
1500
Software Engineering and Applications. 9(07), 319, 2016.
1000
500 [13] Albani, D., et al, "A Deep Learning Approach for Object Recognition
0 with NAO Soccer Robots," Robocup Symposium: Poster presentation.
-3000 -2500 -2000 -1500 -1000 -500 0 500 1000 1500 2000 2500 3000
X(mm)
2016.
[14] Lawin, F.J., Forssén, P.-E., and Ovrén, H, "Efficient Multi-frequency
Fig. 11 robot localization results in our soccer field Phase Unwrapping Using Kernel Density Estimation. In European
Conference on Computer Vision," pp. 170-185. oct 2016.
TABLE III LOCALIZATION ERROR (mm) [15] Lingzhu Xiang, F.E., Christian Kerl, Thiemo Wiedemeyer, Lars,
hanyazou, Alistair: libfreenect2: Release 0.2 [Data set],Zenodo.
Robot positions mean min max
http://doi.org/10.5281/zenodo.50641, 2016.
(-1660,1800) 24.140 0.568 77.967
(1750,3280) 36.994 0.832 181.65
8

Robot Detection and Localization Based On Deep Learning: Sha Luo, Huimin Lu, Junhao Xiao, Qinghua Yu, Zhiqiang Zheng

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Robot Detection and Localization Based On Deep Learning: Sha Luo, Huimin Lu, Junhao Xiao, Qinghua Yu, Zhiqiang Zheng

Загружено:

Авторское право:

Доступные форматы

Robot Detection and Localization Based on Deep

1 RoboCup Homepage: http://www.robocup.org/

II. PROPOSED APPROACH

The pipeline of our proposed approach is shown in Fig. 2

Transmit the RGB image

Image data processing

Copy the boxed

The pipeline of the image pre-processing algorithm offers (c) (d)

F. Get the 3D Positions

Fig. 9 Detection examples on the 2017 RoboCup competition

TABLE I. DATASETS PERFORMANCE

3000 arXiv preprint arXiv:1612.08242, 2016.

Вам также может понравиться