Вы находитесь на странице: 1из 5

Fire SSD: Wide Fire Modules based Single Shot Detector on Edge

Device
HengFui, Liau,1 , Nimmagadda, Yamini,2 and YengLiong, Wong1
{heng.hui.liau,yamini.nimmagadda,yeng.liong.wong}@intel.com,

Abstract— With the emergence of edge computing, there is an constraints of edge devices. As an example, the model size of
increasing need for running convolutional neural network based SSD300 that uses VGG16 network [5] as backend is 103MB.
object detection on small form factor edge computing devices This makes the network model less efficient for distribution
with limited compute and thermal budget for applications such
arXiv:1806.05363v1 [cs.CV] 14 Jun 2018

as video surveillance. To address this problem, efficient object across edge devices over the network. In complex visual use-
detection frameworks such as YOLO and SSD were proposed. cases, tasks such as video decode and image pre-processing
However, SSD based object detection that uses VGG16 as are run on CPU or integrated GPU of edge devices and deep
backend network is insufficient to achieve real time speed learning is offloaded to add-on accelerators such as FPGAs.
on edge devices. To further improve the detection speed, the To qualify as edge devices, low-power FPGAs with less
backend network is replaced by more efficient networks such
as SqueezeNet and MobileNet. Although the speed is greatly than 10MB on-chip memory are used for such deployments.
improved, it comes with a price of lower accuracy. In this paper, The large network size also limits the object detector being
we propose an efficient SSD named Fire SSD. Fire SSD achieves deployed on these accelerators. Second is the throughput of
70.7mAP on Pascal VOC 2007 test set. Fire SSD achieves the the object detector. SSD300 achieves 46FPS on Titian X [2].
speed of 30.6FPS on low power mainstream CPU and is about 6 However, edge devices may not have the luxury of using a
times faster than SSD300 and has about 4 times smaller model
size. Fire SSD also achieves 22.2FPS on integrated GPU. 250W GPU because they are typically smaller and have very
low power/thermal envelop.
I. INTRODUCTION To further improve the throughput of SSD, the backend
Object detection is one of the main areas of research in network which acts like a feature extractor is replaced by
computer vision and has lot of real word applications. In more efficient network architectures, such as SqueezeNet [6],
recent years, object detection is widely used in several fields MobileNet [7], ShuffleNet [8] etc. Iandola et al. [6] proposed
including but not limited to digital surveillance, autonomous SqueezeNet to address the issue of high parameter count
vehicles, smart city, industrial automation, and smart home. for conventional neural networks. SqueezeNet reduces the
With increasing need for protecting privacy and security, number of parameter by replacing the conventional 3 × 3
and for applications with limited network connectivity, edge convolutional layer by Fire Module. Fire module reduces the
computing devices (referred as edge throughout this paper) number of channels of the input data by employing 1 × 1
are being deployed for local processing and inference of convolutional layer before applying a 3 × 3 convolutional
visual data. Edge devices are often thin clients and internet and 1 × 1 convolutional layers in parallel with 12 of the
of things (IoT) systems with limited power and memory targeted number of channels. MobileNet proposes depth-wise
budget. To perform object detection on edge devices, there separable convolution to reduce the number of computa-
is a need for real-time inference speed, small model size tion operations significantly. ShuffleNet proposes pointwise
and high energy efficiency. To be economical and widely grouped convolutions and channel shuffle to greatly reduce
deployable, this object detector must operate on low power the computations while maintaining accuracy [8]. Replacing
CPUs, integrated GPUs, or accelerators that dissipate far less the backend network with SqueezeNet or MobileNet reduces
power than powerful discrete GPUs used for benchmarking the computation time significantly but it comes at a price of
in typical computer vision experiments. To address the need accuracy degradation. In this work, we propose an efficient
for running CNN based object detection in real-time, various SSD named Fire SSD that is both fast and accurate. The key
object detection methods have been proposed [1][2][3][4]. features of Fire SSD are as follows:
Among various object detection methods, Single Shot Multi-
box Detector (SSD) [2] has received a lot of attention be- • Wide fire module inspired by SqueezeNet, ResNext
cause it is relatively fast and has an interchangeable backend [9] and ShuffleNet. Grouped convolution [9] [10] has
network. SSD is robust to scale variations because it makes proven not only to reduce the computation complexity,
use of multiple convolution layers at different scales for but also improves the accuracy at the same time. We
object detection. Although the SSD300 and SSD512 perform improved the Fire module by replacing the 3×3 expand
well in detection accuracy, they still cannot fit within the convolution with grouped 3 × 3 convolution. The 1 × 1
expand layer remains unchanged. We studied the per-
1 Hengfui Liau and YengLiong Wong are with Intel’s Internet of Things
formance difference between grouped 1 × 1 convolution
Group, Bayan Lepas, Penang, Malaysia.
2 Nimmagadda Yamini is with Intel’s Software and Service Group, and 1×1 convolution and the experimental result shows
Hillsboro, Oregon, United States. that 1 × 1 grouped convolution does not improve the
accuracy. object detection on high end GPUs and processes faster than
• Dynamic Residual Mbox Detection layer. In SSD most of other state-of-the-art object detectors.
framework, six feature maps with different sizes are
B. SqueezeNet
extracted from the backend network. Each feature map
is processed by a single 3 × 3 convolutional layer Iandola et al. proposed SqueezeNet to address the issue of
for bounding box regression and object class predic- high parameter count of conventional neural network. The
tion. Feature maps at higher layers are responsible SqueezeNet model has only 4.8MB of parameters and it
for detecting small scale objects. SSD does not per- matches AlexNet accuracy on ImageNet. The SqueezeNet
form well on detecting small scale objects because the model comprises of Fire Module. In a Fire Module, the input
corresponding feature maps are not deep enough and tensor of size H × W × C is first downsampled using 1x1
extracted features are not semantic enough to detect convolution filter to reduce the channel size to C4 . Next, a
small scale objects. VGG16 and ResNet [11] show 3 × 3 convolution filter with C2 channels together with a
that deeper features improve the classification accuracy. parallel 1 × 1 convolutional filter with C2 channels is used
We stacked different number of convolution layers for to fuse spatial information. Given an H x W x C input and
feature maps at different levels. To keep the model size output size, a 3 × 3 convolutional layer requires 9 × C 2
and computational complexity small, we use Wide Fire parameters and 9 × H × W × C 2 computations, while Fire
Module (WFM). To compensate the gradient vanishing Module requires only 23 ×C 2 parameters and 23 ×H ×W ×C 2
effect as the number of convolutional layers grow, a computations.
residual connection is introduced to connect upper and
III. N ETWORK A RCHITECTURE
lower WFM modules.
• Normalized and dropout module. The Normalized
Fire SSD is based on SSD network. The backend network
and Dropout Module (NDM) serves two purposes. First, is replaced by residual SqueezeNet. After Fire 8 layers,
NDM normalizes the gradient from different levels of 10 WFM modules are appended along with a one 3 × 3
feature maps by using batch normalization layer. This convolutional layer. Our purposed architecture is shown in
effectively eliminates the need of implementing L2 nor- Figure 1. Fire SSD consists of a backend network and six
malization layer for conv4 3 feature map as described Multibox feature branches. The appended WFM modules are
in SSD. Secondly, the dropout layer [10][21] helps summarized in Table 1.
regularize the training and performs data augmentation TABLE I
at different feature maps. Experimental result shows that F IRE SSD: A PPENDED WIDE FIRE MODULES AND POOLING LAYERS
NDM improves the accuracy by 1.6%.
We evaluate our models on PASCAL VOC dataset. Fire Layers Output Size Stride Channel
SSD achieves 70.7mAP on Pascal VOC 2007 test set [20]. Input (From Fire8) 38 × 38 - 512
Pool8 19 × 19 2 512
Fire SSD achieves the speed of 30.6FPS on low power Intel Fire9 19 × 19 1 512
CoreTM CPU which is about 6 times faster than SSD300. We Fire10 19 × 19 1 512
also implemented out network in Intel R
OpenVINOTM [12] Pool10 10 × 10 2 512
and measured performance on a small form factor PC, Intel R Fire11 10 × 10 1 512
Fire12 10 × 10 1 512
Skull Canyon [13]. The result shows that Fire SSD achieved Pool12 5×5 2 512
real-time speed on CPU and 20.2FPS on integrated GPU Fire13 5×5 1 512
available in most of the personal computers. Fire14 5×5 1 512
Fire15 3×3 2 512
Fire16 3×3 1 512
II. RELATED WORK Conv17 1×1 1 512
A. CNNs for object detection
SSD, YOLO [3] and YOLOv2 [4] are among the most
popular fast object detection approaches. YOLO uses a single A. Wide Fire Module
feedforward convolutional network to directly predict the WFM shown in Figure2 (a) is inspired by SqueezeNet,
object classes and location. Redmon et al. [4] proposed ResNet and ShuffleNet. Grouped convolution not only re-
YOLOv2 to improve YOLO in several aspects. Redmon et duces the computational complexity, but also improves the
al. also proposed a YOLOv2 with small model size named accuracy at the same time. We improved the fire module by
tiny YOLOv2. Tiny YOLOv2 only has 41% of the number replacing the 3 × 3 expand convolution with grouped 3 × 3
of parameters compared to YOLOv2 with 12mAP lower convolution. The 1×1 expand layer remains unchanged. The
accuracy. Liu et al. [2] propose the SSD method, which study conducted in ResNet shows that when the network is
spreads out the anchors of different scales to multiple layers going wider and increasing the cardinality, it has a better re-
within the backend network and uses feature maps from sults than going deeper or wider in images classification task.
different layers to focus on predicting objects of a certain Compared to going wider and deeper, increasing cardinality
scale. Because SSD detects objects directly from different only comes with minimal increase in terms of computational
backend networks feature maps, it can achieve real-time complexity. One can view this as ensembling multipath of
Fig. 1. Network architecture of Fire SSD

3×3 convolution and 1×1 convolution filters. For Fire SSD, based on the feature map size to increase the depth of each
the cardinality of 3 × 3 grouped convolution is set to 16 and mbox branches. To keep the model size and computation
the cardinality of 1×1 grouped convolution is set to 2. Higher complexity small, WFM is used. To compensate the gradient
cardinality for 1×1 grouped convolution has been tested and vanishing effect as the number of convolutional layers grow,
it shows that the performance improvement is minimal and residual connection is introduced to connect the upper and
the performance is decreased. We argues that this is due to lower WFM modules. The details of dynamic mbox detection
the receptive field of 1×1 convolution which is much smaller layer are presented in Figure 1. For 38 × 38 and 19 × 19
than 3×3 convolution. Further breaking the 1×1 convolution feature maps, two WFMs are introduced before the 3 × 3
into multiple groups of smaller number of channel features convolution layer. For 10 × 10 and 5 × 5 feature maps, one
may not be able to capture the useful information from WFM is introduced before the 3 × 3 convolution layer. No
the input feature maps. To balance the number of features WFM is added to 3 × 3 and 1 × 1 feature map branch as
generated by grouped 1 × 1 and 3 × 3 convolution, the large objects are easier to be detected.
cardinality is set based to fulfill the conditions in (1).
C. Normalize and Dropout module
C1x1 × K1x1 ≈ C3x3 × K3x3 (1)
In SSD300, the objective functions are applied directly on
where C is the number of channels for each grouped the selected feature maps and a L2 normalization layer is
convolution and K is the filter size. used for the conv4 3 layer. It is due to the large magnitude
of the gradient. We investigated the layers that contain
B. Dynamic Mbox Detection Layer mbox branch and found that the gradient magnitude is
In SSD framework, six feature maps with different sizes different for each layer. Keeping this in mind, we added
are extracted from the backend network. Each feature map is batch normalize layer for each subnet. The filter size for
processed by a single 3 × 3 convolutional layer for bounding each layer is designed in such a way that it is fixed to
box regression and object class prediction. Feature maps 512 channels to ensure that mbox layers obtain sufficient
at higher layers are responsible for detecting small scale information to make decision. Large number of filters come
objects. SSD does not perform well on detecting small scale with high risk of overfitting. To avoid overfitting while
objects because the corresponding feature maps are not deep using such large number of filters, appended dropout layer is
enough and extracted features are not semantic enough for implemented after batch normalize layer. The dropout ratio
detecting objects. VGG16 and ResNet show that the deeper is set to a small value to minimize the variance shift problem
features can improve the classification accuracy. We stacked [17]. The dropout ratio varies according to the size of the
different number of convolutional layers for feature maps feature map. Larger feature maps have larger dropout ratio
by incorporating dynamic mbox detection layer. Dynamic
mbox layer improves Fire SSD accuracy on small objects.
The final design of Fire SSD with NDM can boost the net-
work to achieve 70.7mAP. It shows that all of the introduced
modules are contributing to the accuracy improvement.
TABLE II
E FFECT OF VARIOUS DESIGN CHOICES

Dynamic Mbox Layer × × - -


Wide Fire Module × × × -
Normalize Dropout Mod- × - - -
ule
VOC2007 mAP 70.7 69.1 68.9 68.5

C. Pascal VOC 2007


As can be seen from the Table 3, among the lightweight
object detectors, Fire SSD scores the highest accuracy. It
has higher accuracy than YOLOv2 by 2.7mAP with only
Fig. 2. (a) Wide Fire Module. (b) Normalization and Dropout Module
about 10% of YOLOv2’s model size. In terms of model
size, Fire SSD is about 2 times smaller than Tiny YOLOV2’s
model size but achieves 13.6% higher accuracy. Compared
because the larger the number of parameters the more it is with the other SSD like implementations, Fire SSD outper-
prone to variance shift. We named the sub-network design a forms SSD+Mobilenet by 2.7mAP and SSD+Squeezenet by
normalized and dropout module (NDM) as shown in Figure 6.4mAP. The comparisons are shown in Table 3.
2(b). TABLE III
PASCAL VOC 2007 T EST R ESULT
IV. E XPERIMENT
The experiments are conducted on the widely used PAS- Model Computation Model Size (Mil- Accuracy
Cost (Millions lion Parameters) (mAP)
CAL VOC dataset that has 20 object categories. Object MACS)
detection accuracy is measured by mean Average Precision YOLO V2 8,360 67.13 69
(mAP). Tiny YOLO V2 3,490 15.86 57.1
SSD + 1,150 5.77 68
A. Network Training MobileNet[16]
SSD + 1,180 5.53 64.3
The model is trained on the union of PASCAL 2007 SqueezeNet[16]
trainval and 2012 trainval datasets with Caffe [14]. The Fire SSD 2,460 6.81 70.7
SSD300 31,380 26.28 77.2
backend network of Fire SSD is SqueezeNet with residual
connection. The newly added layers are initialized with
Xavier method [15]. The training process consists of three
stages. The optimization algorithm for training is stochastic D. Inference Speed on CPU and GPU
gradient decent (SGD). During the first stage of training, the We tested Fire SSD on a small form factor computer
backend network is frozen from updating. The learning rate called Intel
R
Next Unit of Computing (NUC). The model
is warmup [19] to 0.001 and batch size is 64 for the first selected is Skull Canyon. Skull Canyon features a 45W
50K iterations. For the second stage, the backend network quad core CPU clocked at 2.6GHz and Intel R
IrisTM Pro
is unfrozen. The learning rate is adjusted to 0.01 and batch Graphics 580 clocked at 350MHz [13]. The size of Skull
size is increased to 128 for another 50K iterations. During the Canyon is 211mm × 116mm × 28mm which is ideal for
third stage, the learning rate is reduced to 0.001 for another edge deployment. Fire SSD is implemented in IntelCaffe
50k iterations. [18] and OpenVINOTM . Intel Caffe is a fork from Caffe
that is optimized for Intel R
CPU. OpenVINOTM supports
B. Effect of Various Design Choice deep learning acceleration on different Intel hardware plat-
Several effects on design choices are investigated. A forms. In this paper, we evaluated the inference throughput
baseline Fire SSD with SqueezeNet and residual connection of network on CPU and integrated GPU. The results are
as the backend network is trained. The result shows that the presented in Table 4. The batch size for all tests are set
baseline network only scores 67.9mAP. By implementing to 1. Fire SSD achieves 17.9FPS using IntelCaffe. Using
WFM, the accuracy is improved to 0.4mAP. This result OpenVINOTM , Fire SSD achieves 31.6FPS, which is nearly
shows that grouped 3×3 convolution filter improves the per- double the speed of IntelCaffe. The improved speed gained
formance. The performance is further improved by 0.2mAP is due to model optimizer of OpenVINOTM . It performs
layer fusing and pruning on the network [12]. For GPU, [7] A. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
OpenVINOTM supports both FP32 and FP16 computations. M. Andreetto and H. Adam, MobileNets: Efficient Convolutional
Neural Networks for Mobile Vision Applications, arXiv preprint arXiv:
Due to lower clock rate compared to CPU, Fire SSD scores 1704.04861, 2017.
22.2FPS running on GPU. Surprisingly, the GPU FP16 result [8] X Zhang, X Zhou, M Lin and J Sun, Shufflenet: An extremely effi-
of Fire SSD is slower than GPU FP32. The reason behind is cient convolutional neural network for mobile devices, arXiv preprint
arXiv:1707.01083, 2017.
unclear and needs further investigation. Although Fire SSD [9] S. Xie, R. Girshick, P. Dollr, Z. Tu; K. He, Aggregated Residual
is not the fastest network, it is the most accurate network Transformations for Deep Neural Networks, in Proceedings of the
among the networks that can be running at real-time speed IEEE conference on computer vision and pattern recognition (CVPR)
2017, pp. 5987-5995.
on edge devices. [10] A. Krizhevsky, I. Sutskever and GE. Hinton, Imagenet classification
with deep convolutional neural networks, Advances in neural infor-
TABLE IV mation processing systems 2012, pp. 1097-1105
I NFERENCE S PEED ON CPU AND GPU [11] K He, X Zhang, S Ren and J Sun, Deep Residual Learning for Image
Recognition, in Proceedings of the IEEE conference on computer
vision and pattern recognition (CVPR) 2016, pp. 770-778.
OpenVINOTM
[12] A. Burns, OpenVINO Toolkit Accelerates Computer Vision
Model Accuracy IntelCaffe CPU GPU GPU
Development across Intel Platforms, https://software.intel.com/en-
(mAP) with (FP32) (FP32) (FP16)
us/blogs/2018/05/16/openvino-toolkit-accelerates-cv-development-
CPU
across-intel-platforms.
(FP32)
[13] Skull Canyon, https://www.intel.com/content/www/us/en/products/boards-
Tiny YOLO 57.1 NA 49.5 84.2 108.6
kits/nuc/kits/nuc6i7kyk.html.
V2
[14] Y, Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
SSD + Mo- 68 27.8 91.7 69.4 90.9 S. Guadarrama, T. Darrell, Caffe: Convolutional architecture for fast
bileNet feature embedding,, in the Proceedings of the 22nd ACM international
SSD + 64.3 35.7 80 57.8 67.6 conference on Multimedia, pp. 675-678.
SqueezeNet [15] X. Glorot and Y. Bengio, Understanding the difficulty of training
Fire SSD 72 17.9 30.6 22.2 18.6 deep feedforward neural networks, in the Proceedings of the thirteenth
SSD300 77.2 4.2 5.2 10.8 18 international conference on artificial intelligence and statistics, pp 249-
256.
[16] J. Huang et al., Speed/accuracy trade-offs for modern convolutional
object detectors, arXiv preprint arXiv: 1611.10012, 2016.
[17] D. Hendrycks, K. Gimpel, Adjusting for Dropout Variance in
V. CONCLUSIONS Batch Normalization and Weight Initialization, arXiv preprint arXiv:
1607.02488, 2016.
In this work, we presented WFM which originates from [18] V. Karpusenko, A. Rodriguez, J. Czaja and M. Moczala,
fire module to improve the backend network of the SSD in Caffe* Optimized for Intel Architecture: Applying Modern Code
Techniques, https://software.intel.com/en-us/articles/caffe-optimized-
terms of accuracy while keeping the number of parameters for-intel-architecture-applying-modern-code-techniques.
and computation complexity small. Dynamic mbox layer [19] P. Goyal, P. Dollr, R. Girshick, P. Noordhuis, L. Wesolowski, A.
is proposed to improve the mbox branches of the original Kyrola, A. Tulloch, Y. Jia, K. He, Accurate, Large Minibatch SGD:
Training ImageNet in 1 Hour, arXiv preprint arXiv: 1706.02677, 2017.
SSD. NDM serves the purpose of regulating the network [20] M. Everinham, S.M. Eslami, L. Gool, C.K. Williams, J. Winn and A.
and improving the accuracy of object detector. The proposed Zisserman, The Pascal Visual Object Classes Challenge: A Retrospec-
object detector not only has small model size but also scores tive, International Journal of Computer Vision, Volume 111 Issue 1,
January 2015, pp. 98-136.
72mAP on PASCAL VOC 2007. With above mentioned [21] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhut-
advantages, Fire SSD is ideal to be deployed in edge devices dinov, Dropout: A simple way to prevent neural networks from
and achieves real-time performance on mainstream CPU and overfitting, The Journal of Machine Learning Research 15 (1), pp.
1929-1958, 2015.
integrated GPU.

R EFERENCES
[1] S. Ren, K. He, R. Girshick and J. Sun, Faster R-CNN: towards real-
time object detection with region proposal networks, in NIPS’15 Pro-
ceedings of the 28th International Conference on Neural Information
Processing Systems - Volume 1, pp. 91-99.
[2] W. Liu, D. Anguelov, D. Erhan, C. Szegedy S. Reed, YF. Cheng and
A. Berg, SSD: Single shot multibox detector, in Computer Vision -
14th European Conference, ECCV 2016, Proceedings, pp. 21-37.
[3] J. Redmon, S. Divvala, R. Girshick and A. Farhadi, You only look
once: Unified. Real-time object detection, in Proceedings of the IEEE
conference on computer vision and pattern recognition (CVPR) 2016,
pp. 779-788.
[4] J. Redmon and A. Farhadi, You only look once: Unified. Real-time
object detection, in Proceedings of the IEEE conference on computer
vision and pattern recognition (CVPR) 2017, pp. 6517-6525.
[5] K Simonyan and A Zisserman, Very Deep Convolutional Networks
for Large-Scale Image Recognition, arXiv preprint arXiv:1409.1556,
2014.
[6] FN Iandola, S Han, MW Moskewicz, K Ashraf, WJ Dally and K
Keutzer, SqueezeNet: AlexNet-level accuracy with 50x fewer parame-
ters and¡ 0.5 MB model size, arXiv preprint arXiv:1602.07360, 2016.

Вам также может понравиться