Вы находитесь на странице: 1из 6

Neural road semantic segmentation in driving

scenarios
Alexandru Sorici Adina-Magda Florea
David-Traian Iancu University Politehnica of Bucharest University Politehnica of Bucharest
University Politehnica of Bucharest Bucharest, Romania Bucharest, Romania
Bucharest, Romania Email: alexandru.sorici@cs.pub.ro Email: adina.florea@cs.pub.ro
david_traian.iancu@cti.pub.ro

Abstract— The purpose of this paper is to review the best Regarding these problems, analyzing the current
semantic segmentation algorithms which involve deep neural performances of the best networks is required. In Section 2 we
networks and to compare one of the most used ones on a specific analyze the current approaches regarding semantic
task for autonomous driving - road segmentation, regarding the segmentation and also another research papers that evaluate
accuracy and the time. We analyze the state of the art, then we this task, in Section 3 we analyze some classical datasets, then
propose a new dataset for road semantic segmentation consisting we introduce our dataset for road segmentation benchmark. In
of images from our university campus, annotated by our team and Section 4 we describe the metrics that we used for the
taking in account the time of the day. We analyze the results of the experiments and also the models used. In Section 5 we analyze
road segmentation task obtained from the best segmentation
the results obtained on our dataset and finally we draw the
models and we describe some metrics (accuracy, IoU) regarding
the time of the day (day, dusk, night).
conclusions in Section 6.
II. STATE OF THE ART
Keywords— semantic segmentation, autonomous driving,
neural networks, self driving car dataset In this section, we will present the most important semantic
segmentation networks as well as other works that try to do a
I. INTRODUCTION review of the best architectures, a benchmark for semantic
Autonomous driving has gained a lot of interest in the latest segmentation or a comparison regarding some classical
years, with both universities and big car corporations being datasets. After we present the best models used, we will
interested in developing new algorithms in order to improve analyze other similar studies, how our work is comparing with
the safety and the driving quality of the autonomous cars. these and we show how we improve the current analysis of the
Autonomous cars have a lot of advantages in order to improve semantic segmentation regarding the specific task of
many aspects of the life. The advantages include, among autonomous driving and road semantic segmentation.
others, financial reasons, individual comfort and safety. The A. Semantic segmentation architectures
studies [1] found that most of the accidents made on the road
are due to human mistake, so a large adoption of the The first approaches regarding object detection and
autonomous cars by people will lead to less accidents. For segmentation involved classical computer vision algorithms
example, if all the cars will incorporate a GPS signal,each such as Histogram of Gradients (HOG) [3], Scale Invariant
individual car will know the position of the other vehicles in Feature Transform (SIFT) [4], Speeded Up Robust Features
the traffic at any moment, being more easy to make a takeover, (SURF) [5] or Corner Detectors [6]. With the improvement of
for example, and also they will respect all the traffic rules, due the neural networks, the classical computer vision algorithms
to the restrictions of the algorithms, so the whole traffic will had been overcome by the deep neural networks and the
become safer. However, the current algorithms can't compete segmentation task have been developed almost exclusively
even with a beginner human driver, so better algorithms at all with deep convolutional neural networks, which have many
levels are required in order to trust the safety of an autonomous advantages in image processing.
car. Even a simple, but crucial task, such as object detection The first architectures worth mentioning are not directly
has a very low recall if there are a lot of objects (for example used for semantic segmentation, but have been used as
in a parking spot). An autonomous system is composed of backbones for many networks. Architectures like AlexNet [7],
many layers - object detection and segmentation, object VGG [8], GoogLeNet [9], Resnet [10],ZFNet [11] and SENet
tracking, depth prediction, trajectory prediction for [12] competed and won in different years at ImageNet
surrounding vehicles and people, motion planning, path Challenge (ILSVRC) [13], an image classification contest,
planning and so on. This article is part of a series of three where the network have to recognize the object class in a
studies made with our university dataset - we discussed about photo. They are all convolutional neural networks with several
object detection [2], in this work we focus on the semantic differences regarding the layers used. AlexNet, for example,
segmentation task, especially for road segmentation, and the uses convolutions with different sizes, max pooling layers,
final study will involve depth prediction. As we said earlier, dropout, activation functions like ReLU and SGD with
there are no perfect algorithms for the problems that we meet momentum for training. ZFNet is an improvement of AlexNet
in autonomous driving and also for the semantic segmentation hyperparameters, VGG has only 3x3 convolutions,
task there is no network that can be trusted completely.
GoogLeNet uses a new module called Inception and also batch DANet [34] uses a traditional dilated FCN but adds two
normalization. RMSProp and RESNet uses skip connections attention modules which model the semantic
and achieves a bigger performance by having 152 layers. A interdependencies.
typical segmentation architecture consists of an encoder,
which can be one of the previous networks, and a decoder for There are also segmentation networks for other tasks, not
pixel classification. only to detect people and vechicles. An example is ECNet
[35], which segments images obtained by a side scan sonar.
There are semantic segmentation architectures that try to They use an encoder decoder architecture, but also a single
do both the detection and the segmentation task. Mask RCNN streamed deep neural network in order to optimize the edge
[14] uses Faster R-CNN [15] algorithm but add another branch segmentation task. To summarize, most of the networks have
for semantic segmentation (a small convolutional network) a step of feature extraction, then a convolutional architecture
during the training. Simultaneous Detection and Segmentation for obtaining the segmentation. Their main particularities were
(SDS) [16] is based on R-CNN but uses a SVM at the top of summarized in this paragraphs, but comparing them directly
the network to assign a score for each category. Hypercolumns cannot be done only by analyzing the architecture, but testing
[17] uses as pixel descriptors the vectors of activations of all the networks on different datasets, which is what we have done
the network units above the pixel to achieve segmentation and in this work for some of the most used networks. The reason
detection in the same step. PANet [18] do both object is that there is no clear winner - only different approaches
detection and instance segmentation and improves Mask R- regarding better architectures, with new features, network
CNN by adding a FPN architecture with bottom-up path optimizations and models. Even if there are some architectures
augmentation in lower layers. However, this approach adds that are clearly better than others - for example, DeepLabv3+
the errors from the segmentation task, so it's a better idea to do is a clear improvement over DeepLab, but it cannot be directly
the segmentation task independently. We will focus on those compared with a network with a different strategy. Depending
networks in the following paragraphs. on the dataset and also on the data used to train the network, a
model can perform better than other. In this work, we analyzed
Most of the networks do not try to do the both tasks and some of these models, pretrained on classical datasets, on our
use special architectures for semantic segmentation. One of new dataset.
the first networks used for this task is Fully Convolutional
Networks for Semantic Segmentation (FCN) [19], which learn There is another category of networks, those who try to do
a mapping of the pixels by using a convolutional network with weak supervised semantic segmentation, because it is not easy
upsampling layers and by removing the fully connected layers to have semantic segmentation annotations for the images,
in order to accept various image sizes for the input. Many because they require a lot of manual work. One of the most
networks have taken this architecture of fully convolutional cited is Boxsup [36], which does segmentation using bounding
network and made some improvements in order to achieve boxes, and Simple Does It [37], which tries to do de-noising
better results. SegNet [20] uses a modified VGG16 for the during the training to obtain the correct labels. However, the
encoder with 13 layers and a corresponding decoder with a results of these models are not as good as the previous ones,
layer for each of the encoding ones. They use max pooling and so we will not insist of those models, but they were worth
subsampling and the last layer is connected to a softmax mentioning.
classifier which classifies each pixel. Deeplab [21] uses
Resnet 101 for feature extraction. They have some A summary of the models can be found in Table 1.
improvements - DeepLab v2 [22], DeepLab v3 [23], DeepLab B. Segmentation reviews and benchmarks
v3+ [24]. In the last one they use a different feature extractor, There are some articles that tackle the same problem as we
the XCeption Network [25], atrous spatial Pyramid Pooling, do. In [38] we can see a good review of the state-of-the-art
batch normalization and a decoder module to refine the results. models but without any benchmarks or comparisons on any
Dilated Convolution [26] also uses atrous (dilated) dataset. A review with some metrics added on some classical
convolutions and a context module with 7 modules based on datasets can be found here [39] Also, a very good review with
dilated convolutions with different dilation factors. Parsenet some experiments, which considers not only neural models,
[27] construct a context vector by generating feature maps into but also other techniques, ca be found here [40]. However,
a single vector with a pooling layer. They use as their final they didn't took in account the time of the day or the specific
feature map the concatenation between two vectors - the task of the road segmentation. [41] is another review which
normalized initial feature map and the normalized and analyzes both the datasets and the neural networks and
unpooled context vector. U-net [28] is a variant of FCN with discusses object classification, semantic and instance
some differences - the network is symmetric and the skip segmentation, but without mentioning the road segmentation.
connections between the upsampling and the downsampling There are also recent benchmark, for example [42] proposes a
path use the concatenation operator, not the sum. PSPNet [29] new benchmark for object detection, but without taking in
uses a Pyramid Pooling module with subregion average account the light factor. In this article we will present a new
pooling, billinear interpolation for upsampling and benchmark for road segmentation made in our university
concatenation to obtain a global prior. In DenseNet [30], an campus and also we will compare the results tested on some
image classification network, each layer is connected to each reference implementations for one of the most popular
of the other layers in a feed-forward scheme in order to obtain architectures. We will discuss some metrics regarding our
more information from the previous layers. FC-DenseNet [31] dataset, discussing also the light factor. We insist on the time
took this idea to obtain a new semantic segmentation model. of the day because most of the experiments used images made
EncNet [32] uses ResNet for feature extraction, then feeds the in daylight, but, in general, the same photos taken at night will
features into a context encoding module. They also use a lead to worse results, so a good study should take the light into
semantic encoding loss which detect the presence of the object account.
classes. ENet [33] is another convolutional architecture who
claims to do semantic segmentation in real time. They use five Our previous study on object detection, as well as the future
stages of convolutions and a final full convolution network. study on depth prediction, will also take in account the light.
in the future we will add more classes, at least for cars and
TABLE 1. SEMANTIC SEGMENTATION MODELS.
pedestrian. For the experiments presented in this work, we
used a sample from this dataset - we kept only one frame from
Object Detection and Fully supervised Weakly supervised 20, in order to have a smaller representative of the dataset, and
Segmentation segmentation Networks also to be diverse enough to conserve the general scores. We
Networks Networks chose to keep a smaller number (about 1000 frames) in order
to have a faster evaluation for the networks. Also, we divided
Mask R-CNN FCN Boxsup the dataset in three classes - day, dusk and night, regarding the
Simultaneous SegNet Simple Does It daylight, in order to observe how the results vary depending
Detection and DeepLab
on the outside conditions. We have 735 images for day, 133
Segmentation Dilated Convolution
Hypercolumns ParseNet for dusk and 165 for the night.
PANet U-Net
PSPNet IV. BENCHMARKS AND METRICS
FC-DenseNet For our experiments, we computed some basic metrics
EncNet
Enet
regarding the detections. The annotated images are black and
DANet white images - white for the road, black for the environment.
We computed the percent of true positive pixels - those which
were detected as road and compose the road in the image, the
III. DATASETS FOR SEMANTIC SEGMENTATION false negative pixels - the pixels that compose the road but
Currently there are some standard datasets annotated for were not detected, the true negative pixels and the false
the semantic segmentation task. However, due to the difficulty positive pixels - the number of the pixels that were detected as
of manually annotating images for this task, there are only a road but were not actually from the road. One problem
few datasets, and there are not many images in the datasets, regarding the experiments was to detect the class that was
but they offer good quantitative results for the evaluation. In estimated for the road by the models. There are many classes
this section we will describe the most used datasets for which are computed for an image - the networks were not fine
semantic segmentation, then we will briefly analyze our own tuned for this task to output only two classes, in order to see
dataset made for the specific task of road seg1mentation. they results in a full setting, so we had to select from the
resulting images which class was assigned to the road. We
A. Classical datasets choose to select the class that maximizes the Intersection Over
The most common datasets used for semantic Union. We also computed the Accuracy, defined as
segmentation for autonomous driving are Cityscapes [43] and 𝑇𝑃+𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (1)
Kitty [44], but there are also some other datasets, like CamVid 𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
[45], DUS [46] and Vistas [47]. There are also some datasets
that are not specially designed for autonomous driving tests, However, the most important metric regarding semantic
such us Pascal VOC [48], Coco [49] and SYNTHIA [50]. A segmentation is the mIoU - mean Intersection over Union. The
more comprehensive list of datasets can be found at [41]. IoU means the number of pixels that are in the intersection of
Cityscapes contains 5000 full segmentation images and also the detected road and actual road, divided by the number of
20000 coarse segmentation images, with 30 classes. Kitty pixels that are in the union of the two. It can be formally
dataset, although not released with segmented images, was defined as
later annotated for segmentation tasks by other people. They
𝑇𝑃
have, though, a small dataset for road segmentation and lane 𝐼𝑜𝑈 = (2)
𝑇𝑃+𝐹𝑁+𝐹𝑃
detection (under 300 frames annotated). CamVid consists of
700 images with 32 classes, DUS consists of 5 classes
(ground/ road, building, sky, vehicle, pedestrian) annotated in The mIoU represents the mean of all the IoUs from all the
500 images and Vistas is the biggest dataset from this category images. We computed the mean TP, FP, Accuracy and IoU for
- over 25000 images with 66 classes and contains instance all the categories - day, dusk and night, and also for all the
segmentation for 37 of their classes. Pascal VOC contains dataset. The TP and FP are shown as percentages, but they
almost 1500 segmented images only for objects (not for the were used in their absolute values when computing the
road) more useful for instance segmentation, not for Accuracy and IoU. The percentage of FN and TN can be
autonomous driving. Coco contains many annotated images computed as 1-TP and 1-FP.
but also focused for object detection and instance We also tested the inference time for some of the networks
segmentation, not taking the road as a segmented class. tested, in order to see which of them are suitable for a live
SYNTHIA features over 13.000 images with 11 classes driving scenario. The experiments were made using the
(including the road) and it is worth mentioning because it has NVidia DGX-1 workstation which contains 4 Tesla V100
some diversity regarding the scenes, the seasons and the GPUs.
weather.
B. Our dataset A. Reference models
For our experiments we chose to evaluate some of the best
Our dataset consists of some short movies (138) recorded in semantic segmentation networks with public implementations
our university campus. Each movie contains only a few on Github. At [52] we can find multiple architectures for this
frames, and there are about 20.000 frames that were manually task, along with pretrained models. We tested for Fully
annotated for the road segmentation task using an online tool, Convolutional Networks - FCN, PSPNet and ENet. We also
CVAT [51]. The road was segmented using polygons. We evaluated SegNet using the implementation found at [53],
segmented only the road for this specific task but we plan that Dilated Convolutions using the implementation found at [54]
and a more recent network, DaNet, using their oficial TABLE 2. SEGMENTATION RESULTS.
implementation [55]. The scripts, the datasets and the results
Day Dusk Night Avg
can be found at [56].
FCN
B. Identify the Headings
TP 0.782 0.742 0.572 0.743
Headings, or heads, are organizational devices that guide
the reader through your paper. There are two types: FP 0.057 0.051 0.070 0.058
component heads and text heads. Acc 0.897 0.893 0.835 0.887
IoU 0.673 0.647 0.476 0.638
V. RESULTS
ENet
We will now analyze the results of the road segmentation
detection. Table 2 presents for each of the architecture we
TP 0.781 0.725 0.711 0.763
tested the mean TP, mean FP, mean Accuracy and mean IoU
for each of the three times of the day and light - day, dusk and FP 0.218 0.448 0.320 0.264
night. Also, in the last column is presented an average score Acc 0.780 0.597 0.686 0.741
for all the dataset.
IoU 0.527 0.352 0.393 0.483
The first network analyzed is FCN. We can see that the PSPNet
scores for TP, FP, Acc and IoU are similar for day an dusk.
but there is a decrease for TP, Acc and IoU in the night, and TP 0.940 0.947 0.836 0.924
also the number of False Positives increased on the night FP 0.048 0.058 0.089 0.056
dataset. The percentage of FP is computed by dividing their Acc 0.948 0.943 0.889 0.938
number to all the negatives (pixels that are annotated as not
road), so even if the increase is only 2 %, the decrease in IoU IoU 0.831 0.820 0.676 0.805
is almost 20 % from day to night. However, the night dataset Dilation
is smaller than the day dataset, so the final result is influenced TP 0.508 0.431 0.958 0.571
only by 4%.
FP 0.571 0.176 0.941 0.350
The next network analyzed is ENet. The results are worse
Acc 0.687 0.718 0.299 0.629
than FCN - even if the number of true positives is almost the
same as for FCN, and also the TP is better in the night, The IoU 0.338 0.313 0.267 0.324
number of FP is bigger compared to FCN. This means that Segnet
even if the road is correctly detected, the network believed that
TP 0.955 0.903 0.454 0.868
the road was bigger than in reality. Or it could happen that the
network didn't correctly guess the label for the road and when FP 0.157 0.233 0.028 0.146
computing the IoU we considered the class that maximises the Acc 0.872] 0.803 0.829 0.856
IoU, but if the road wasn't detected this could be any other
IoU 0.673 0.558 0.414 0.617
class, for example the sky, and the road was also included in
the sky. The average IoU drops for dusk and night and the Danet512
overall result is not very good - 48.3%. TP 0.651 0.759 0.673 0.668
The next network shown in the table is PSPNet. From the FP 0.442 0.604 0.312 0.442
beginning we can see that the results for this network are the
Acc 0.584 0.497 0.681 0.589
best obtained from our experiments. The number of TPs is
close to the real number of the correct pixels - 94% where IoU 0.307 0.294 0.367 0.315
correctly guessed in the day and only slightly worse in the Danet768
night, 83%. Also, the number of false positives has high
TP 0.770 0.825 0.713 0.768
values. This combination results in very good scores for
accuracy - almost 95% in the day, almost 89% in the night, FP 0.574 0.744 0.325 0.556
with a mean of 93%. Acc 0.518 0.413 0.684 0.532
The IoU is also good, though not as good as the accuracy IoU 0.303 0.274 0.382 0.312
- 80.5% average, but with only 67.6% in the night. However,
the results are much better than the other networks tested, even However, when we analyze the False Positives, we can
in the night. observe that for the day and dusk the numbers are decent (24%
The following architecture tested, Dilation, has poor and 17%) but for the night almost all of the negatives are
results, even if the results on paper were better regarding other considered false positives.
datasets. When we manually looked at the outputs to see what
We used their official implementation, so we can exclude happened, we saw that most of the image was labeled as a
errors due to the implementation. The number of True single class (the sky, probably), and also the road was labeled
Positives is 50% for the day, 43% for dusk and an impressive as sky/ background, so the number of true positives was big,
95% for the night. but also the number of false positives was very big. This
results in a very poor core for the accuracy and the IoU, the
average accuracy for all the dataset is 62.9% and the average
IoU is only 32.4%.
each of the architectures discussed earlier, but for DaNet we
considered only the version with base size 768, because the
time was almost the same for base size 512. From the plot, we
can infer some interesting remarks. DaNet has the worst
performance regarding the IoU, but the inference time is
almost constant regarding the image dimensions - around 1
second. Dilation has also a poor IoU but a good inference time,
similar to DaNet. ENet has the best times, around half a second
and don't vary much regarding the image size, but the IoU is
not very good. SegNet has a decent IoU but the worst inference
time in our model and it can't be used in a real driving scenario.
The running time doesn't depend on the image size but is huge
(it also limited us from testing with a bigger dataset). FCN has
a good IoU but the times increase linearly with the images size.
However, for small images we can use it in real life driving
scenarios. PSPNet has the best results but the inference times
are bigger than those from FCN, also increase linearly
regarding the size, but the values are too big to use them in real
life applications.
FIGURE 1. SEGMENTATION TIME.
An interesting aspect is that the results may have a big
The following network, SegNet, has decent results for day variance from other tests made on other datasets, because
and dusk but not so good results for the night. We can see that some networks are optimized for some scenarios, and testing
for the day the number of true positives is over 95%, for the them on a new dataset is a good thing in order to evaluate the
dusk over 90%, but for the night there is a huge drop to only real capacity of the network. Though, we hope that with this
45%. Looking at the false positives, we can see that there is a study we can offer a bigger perspective on the tested models
huge drop in false positives too in the night. The network possibilities.
correctly detected only a part of the road, so the number of true
positives was not so good, but also the network didn't wrongly VI. CONCLUSION AND FUTURE WORK
classified other sections of the image as road. The result for In this article, we presented the most used deep neural
this partially correct identification is that we have a good networks for semantic segmentation, especially those that use
accuracy, because the number of true negatives is big, but a a fully convolutional architecture and are specialized only for
small IoU. The average accuracy for all the dataset is over this task, not for both detection and segmentation. We also
85%, but the average IoU is only 61%, a little worse than FCN. presented the most used datasets for semantic segmentation,
The last two experiments involve the same network, including but not limited to the most used datasets for
DaNet. The results of the network depend on two parameters, autonomous driving. We discussed the road detection task for
base size and crop size - the width for the analyzed image and both the architectures and the dataset. We presented our
the size for cropping (applied for data augmentation). Their dataset for road semantic segmentation made at the
recommended base size is 2048 because they worked with Politehnica University of Bucharest and how we used it for the
large images, but this resulted in a very big GPU memory experiments. We defined the experiments made, the metrics
consumption, so we tested with base size 512 and 768. For and the networks that we tested - FCN, ENet, SegNet, Dilated
1024, for example, our system couldn't handle the memory Convolutions and DaNet. We presented the scores for all the
used - even if we tested on 32 GB GPUs. However, even if we networks regarding the daytime - day, dusk or night and the
expected to have better results for a bigger base size, the mean inference times regarding the image size. We can see from the
IoU remained virtually unchanged at about 31%, the accuracy experiments that the segmentation results are close to the
had a small drop from 58.9% to 53.2%, the number of True reality but there is still room for improvement if we want to
Positives increased for all the categories (day, dusk, night) but have better autonomous driving systems. The experiments
also the number of False Positives increased. The number of showed that the scores are worse in the night, so there should
False Positives is big for all the categories, with a maximum be more tests regarding the segmentation involving pictures in
of almost 75% for the dusk, base size 768. Unfortunately, the night. The best networks tested were PSPNet and FCN,
when we looked on the results the segmentation was quasi- close to PSPNet, and the worst was DaNet. Regarding the
random, with poor results, and in general the road was actually time, we observed that there is a balance between the quality
included in another class - the explanation for the big number and the inference time. The best networks have big inference
of false positives. times and can't be used in real time scenarios, and the networks
with poor results have some inference times that make us
We also made an experiment regarding the time involved optimistic that they could be used in real time scenarios,
for different image sizes. We tested the inference time for especially if the resolution of the photos is not very big.
some different sizes to see how the number of pixels will affect However, there is still room for improvement regarding the
the time for a prediction, and also to see if the networks can be inference time if we want to segment more than a few frames
used in real time applications. We used one of the fastest per second. In the future, we will extend our dataset with more
systems on the market, so the times are relevant in order to classes for a better understanding of the environment and also
know if the architectures are reliable to use in a real life driving we will try to offer to the community some pretrained models
system. In Figure 1 we can see the results. We used pictures that include our dataset. Also, we will extend the dataset for
with the following resolutions: 640x360, 640x480, 1024 x our future study, which involves depth prediction.
768, 1280 x 720, 1366 x 768, 1920 x 1080 and 3140 x 2160.
The results are shown in Figure 1. We measured the time for
ACKNOWLEDGMENT [28] Ronneberger, O., Fischer, P., Brox, T. U-net: Convolutional
networksfor biomedical image segmentation Medical Image
The authors would like to thank to the team from the AI Computing andComputer-Assisted InterventionMICCAI 2015, pp.
laboratory from Politehnica University of Bucharest, which 234241. Springer.2015.
made the dataset in the university campus and also annotated [29] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene
the images for the road segmentation task. We would also parsingnetwork. ArXiv:1612.01105. 2016.
[30] G. Huang, Z. Liu, and K. Q. Weinberger. Densely connected convo-
thank Politehnica University for the hardware used to make lutional networks. IEEE Conference on Computer Vision and
the experiments. PatternRecognition. 2017.
[31] Simon Jegou, Michal Drozdzal, David V azquez, Adriana
REFERENCES Romero,and Yoshua Bengio. The one hundred layers tiramisu: Fully
[1] https://www.visualexpert.com/Resources/roadaccidents.html convolu-tional densenets for semantic segmentation. Workshop on
ComputerVision in Vehicle Technology CVPRW. 2017.
[2] David Iancu, Alexandru Sorici, Adina Florea. Object detection [32] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and
[3] N. Dalal and B. Triggs. Histograms of oriented gradients for A.Agrawal. Context encoding for semantic segmentation. CVPR.
humandetection.CVPR. 2005. 2018
[4] D. G. Lowe. Distinctive image features from scale-invariant [33] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello. ENet: A
keypoints.IJCV, 60(2):91110. 2004. DeepNeural Network Architecture for Real-Time Semantic
[5] H. Bay, T. Tuytelaars, and L. Van Gool. Surf: Speeded up Segmentation.arXiv:1606.02147. 2016.
robustfeatures. In European Conference on Computer Vision. 2006. [34] Jun Fu, Jing Liu, Haijie Tian, Zhiwei Fang, and Hanqing Lu.Dual
[6] Sheng He and Jiedong Chen and Yongjin Hou, Image Segmenta-tion attention network for scene segmentation. arXiv
Method Based on SURF Algorithm and Harris Corner preprintarXiv:1809.02983. 2018.
PointsDetection. International Journal of Simulation – Systems, [35] ECNet: Efficient Convolutional Networks for Side Scan Sonar
ScienceTechnology. 2016. ImageSegmentation. Meihan Wu, Qi Wang, Eric Rigall, Kaige Li,
[7] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. WenboZhu, Bo He, Tianhong Yan. Sensors. 2019
Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, andL. Fei- [36] J. Dai, K. He, and J. Sun. Boxsup: Exploiting bounding boxes
Fei. ImageNet Large Scale Visual Recognition tosupervise convolutional networks for semantic segmentation.
Challenge.arXiv:1409.0575, 2014. ICCV.2015.
[8] K. Simonyan and A. Zisserman. Very deep convolutional networks [37] A. Khoreva, R. Benenson, J. Hosang, M. Hein, and B. Schiele.Simple
forlarge-scale image recognition. ICLR. 2015. does it: Weakly supervised instance and semantic segmenta-
[9] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,D. tion.CVPR. 2017
Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper [38] Y. Guo, Y. Liu, T. Georgiou, and M. S. Lew, A review of
withconvolutions. CoRR, abs/1409.4842. 2014. semanticsegmentation using deep neural networks, International
[10] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for Journal ofMultimedia Information Retrieval. 2018.
imagerecognition. CVPR. 2016. [39] https://medium.com/@arthurouaknine/review-of-deep-learning-
[11] Zeiler, M. D. and Fergus, R. Visualizing and understanding con- algorithms-for-image-semantic-segmentation-509a600f7b57
volutional networks. CoRR, abs/1311.2901, 2013. Published in [40] H. Yu et al.. Methods and datasets on semantic segmentation:
Proc.ECCV. 2014. Areview,Neurocomputing, vol. 304, pp. 82103. 2018.
[12] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. [41] A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. VillenaMartinez,and
arXivpreprint arXiv:1709.01507. 2017. J. Garcia-Rodriguez. A review on deep learning techniques appliedto
[13] http://www.image-net.org/challenges/LSVRC/ semantic segmentation. arXiv preprint arXiv:1704.06857. 2017.
[14] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask R- [42] H. Li, J. Cai, T. N. A. Nguyen, and J. Zheng, A benchmark
CNN.arXiv:1703.06870. 2017. forsemantic image segmentation. Proc. IEEE Int. Conf. Multimedia
[15] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towardsreal- Expo.2013.
time object detection with region proposal networks. NIPS. 2015. [43] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,
[16] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. R.Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes
Simultaneousdetection and segmentation. ECCV. 2014. datasetfor semantic urban scene understanding. CVPR. 2016.
[17] Hariharan, B., Arbelaez, P., Girshick, R., Malik, J.: Hypercolumnsfor [44] hGeiger, Andreas, Lenz, Philip, Stiller, Christoph, and
object segmentation and fine-grained localization. CVPR. 2015. Urtasun,Raquel. Vision meets robotics: The KITTI dataset.
[18] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia.Path International Jour-nal of Robotics Research, 32(11). 2013
aggregation network for instance segmentation. arXiv [45] Brostow, Gabriel J., Fauqueur, Julien, and Cipolla, Roberto.
preprintarXiv:1803.01534. 2018. Semanticobject classes in video: A high-definition ground truth
[19] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional database. PatternRecognition Letters, 30(2). 2009.
networksfor semantic segmentation. CVPR. 2015. [46] T. Scharwchter, M. Enzweiler, U. Franke, and S. Roth. Efficient
[20] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. SegNet:A multi-cue scene segmentation.GCPR. 2013.
deep convolutional encoderdecoder architecture for image segmen- [47] Neuhold, G., Ollmann, T., Bulo, S.R., Kontschieder, P. The
tation. Arxiv:1511.00561. 2015. mapillaryvistas dataset for semantic understanding of street scenes. In:
[21] Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, Interna-tional Conference on Computer Vision. 2017.
A.L.Semantic image segmentation with deep convolutional nets and [48] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams,
fullyconnected crfs. ICLR. 2015. J.Winn, and A. Zisserman. The pascal visual object classes
[22] Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille,A.L. challenge:A retrospective. International Journal of Computer Vision.
Deeplab: Semantic image segmentation with deep convolutionalnets, 2015.
atrous convolution, and fully connected crfs. arXiv [49] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
preprintarXiv:1606.00915. 2016. P.Dollar, and C. L. Zitnick. Microsoft coco: Common objects in
[23] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. context.European Conference on Computer Vision, Springer. 2014.
Rethinkingatrous convolution for semantic image segmentation. arXiv [50] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez.The
preprintarXiv:1706.05587. 2017. synthia dataset: A large collection of synthetic images for
[24] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. semanticsegmentation of urban scenes. IEEE Conference on
Adam.Encoder-decoder with atrous separable convolution for Computer Visionand Pattern Recognition (CVPR). 2016.
semantic imagesegmentation. ECCV. 2018. [51] https://github.com/opencv/cvat
[25] F. Chollet. Xception: Deep learning with depthwise separable convo- [52] https://github.com/hellochick/semantic-segmentation-tensorflow
lutions. CVPR. 2017. [53] https://github.com/toimcio/SegNet-tensorflow
[26] Yu, Fisher and Koltun, Vladlen. Multi-scale context aggregation [54] https://github.com/fyu/dilation
bydilated convolutions. ICLR. 2016. [55] https://github.com/junfu1115/DANet
[27] Liu, W., Rabinovich, A., Berg, A.C. ParseNet: Looking wider to [56] https://github.com/funkydvd/semantic-segm-bench
seebetter. ILCR. 2016.

Вам также может понравиться