Академический Документы
Профессиональный Документы
Культура Документы
problematic situations for segmentation algorithms are which work on grayscale or colored single pixel images
examined. A taxonomy of segmentation algorithms is (see Section II-C) in a completely automated, passive
given.
fashion (see Section II-D).
I. I NTRODUCTION
Semantic segmentation is the task of clustering A. Allowed classes
parts of images together which belong to the same
object class. This type of algorithm has several use- Semantic segmentation is a classification task. As
cases such as detecting road signs [MBLAGJ+ 07], such, the classes on which the algorithm is trained is a
detecting tumors [MBVLG02], detecting medical in- central design decision.
struments in operations [WAH97], colon crypts segmen- Most algorithms work with a fixed set of classes;
tation [CRSS14], land use and land cover classifica- some even only work on binary classes like fore-
tion [HDT02]. In contrast, non-semantic segmentation ground vs background [RM07], [CS10] or street vs
only clusters pixels together based on general character- no street [BKTT15].
istics of single objects. Hence the task of non-semantic However, there are also unsupervised segmentation
segmentation is not well-defined, as many different algorithms which do not distinguish classes at all (see
segmentations might be acceptable. Section V-B) as well as segmentation algorithms which
Several applications of segmentation in medicine are are able to recognize when they don’t know a class.
listed in [PXP00]. For example, in [GRC+ 08] a void class was added
Object detection, in comparison to semantic seg- for classes which were not in the training set. Such
mentation, has to distinguish different instances of the a void class was also used in the MSRCv2 dataset
same object. While having a semantic segmentation (see Section III-B2) to make it possible to make more
is certainly a big advantage when trying to get object coarse segmentations and thus having to spend less
instances, there are a couple of problems: neighboring time annotating the image.
pixels of the same class might belong to different object
instances and regions which are not connected my
belong to the same object instance. For example, a B. Class affiliation of pixels
tree in front of a car which visually divides the car into
two parts. Humans do an incredible job when looking at the
This paper is organized as follows: It begins by giving world. For example, when we see a glass of water
a taxonomy of segmentation algorithms in Section II. standing on a table we can automatically say that there
A summary of quality measures and datasets which are is the glass and behind it the table, even if we only had a
used for semantic segmentation follows in Section III. single image and were not allowed to move. This means
A summary of traditional segmentation algorithms and we simultaneously two labels to the coordinates of the
their characteristics follows in Section V, as well as a glass: Glass and table. Although there is much more
brief, non-exhaustive summary of recently published work being done on single class affiliation segmenta-
semantic segmentation algorithms which are based on tion algorithms, there is a publication about multiple
neural networks in Section VI. Finally, Section VII class affiliation segmentation [LRAL08]. Similarly,
informs the reader about typical problematic cases for recent publications in pixel-level object segmentation
segmentation algorithms. used layered models [YHRF12].
2
C. Input Data
where the received image cannot be influenced. Among accuracy has two major drawbacks:
the passive algorithms, some segment in a completely P1 Tasks like segmenting images for autonomous cars
automatic fashion, others work in an interactive mode. have large regions which have one class. This
One example would be a system where the user clicks makes achieving classification accuracies of more
on the background or marks a coarse segmentation and than 30 % with a priori knowledge only possible.
the algorithm finds a fine-grained segmentation. [BJ00], For example, a system might learn that a certain
[RKB04], [PS07] describe systems which work in an position of the image is most of the time “sky”
interactive mode. while another position is most of the time “road”.
3
P2 The manually labeled images could have a more segmentation, every image needs to be processed within
coarse labeling. For example, a human classifier 20 ms [BKTT15]. This time is called latency.
could have labeled a region as “car” and the Most papers do not give exact values for the time
algorithm could have split that region into the their application needs. One reason might be that this is
general “car” and the more specific “wheel of a very hardware, implementation and in some cases even
car” data specific. For example, [HJBJ+ 96] notes that their
Three accuracy metrics which do not suffer from algorithm needs 10 s on a Sun SparcStation 20. The
problem P1 are used in [LSD14]: fastest CPU ever produced for this system had 200 MHz.
• mean accuracy: k ·
1
Pk nii Comparing this directly with results which were ob-
i=1 ti ∈ [0, 1]
• mean intersection over union:
tained using an Intel i7-4820K with 3.9 GHz would not
1
Pk nP be meaningful.
k · i=1 ti −nii + k nji ∈ [0, 1]
ii
j=1 However, it does still make sense to mention the
• frequency weighted intersection over union:
Pk −1 Pk nP
execution time as well as the hardware in individual
( i=1 ti ) i=1 ti · t −n + k n ∈ [0, 1]
ii
i ii j=1 ji
papers. This gives the interested reader the possibility to
Another problem might be pixels which cannot be estimate how difficult it might be to adjust the algorithm
assigned to one of the known classes. For this reason, to work in the required time-constraints.
[SWRC06] makes use of a void class. This class gets Besides the latency, the throughput is another
completely ignored for all quality measures. Hence the relevant characteristic of algorithms and implementa-
total number of pixels is assumed to be width · height − tions for semantic segmentation. For example, for the
number of void pixels. automatic description of images in order to enable text
One way to deal with problem P1 and problem P2 search the throughput is of much higher importance
is giving the confusion matrix as done in [SWRC06]. than latency.
However, this approach is not feasible if many classes 3) Stability: A reasonable requirement on semantic
are given. segmentation algorithms is the stability of a segmen-
The F -measure is useful for binary classifica- tation over slight changes in the input image. When
tion task such as the KITTI road segmentation the image data is sightly blurred by smoke such as
benchmark [FKG13] or crypt segmentation as done in Figure 4(c), the segmentation should not change.
by [CRSS14]. It is calculated as “the harmonic mean Also, two images which show a slight change in
of the precision and recall” [PH05]: perspective should also only result in slight changes in
tp the segmentation [PH05].
Fβ = (1 + β)2 4) Memory usage: Peak memory usage matters
(1 + β 2 ) · tp + β 2 · fn + fp
when segmentation algorithms are used in devices like
where β = 1 is chosen in most cases and tp means smartphones or cameras, or when the algorithms have
true positive, fn means false negative and fp means to finish in a given time frame, run on the graphics
false positive. processing unit (GPU) and consume so much memory
Finally, it should be noted that a lot of other measures for single image segmentation that only the latest
for the accuracy of segmentations were proposed for graphic cards can be used. However, no publication
non-semantic segmentation. One of those accuracy were available mentioning the peak memory usage.
measures is Normalized Probabilistic Rand (NPR)
index which was introduced in [UPH05] and eval-
B. Datasets
uated in [CSI+ 09] on dermoscopy images. Other
non-semantic segmentation measures were introduced The computer vision community produced a couple
in [MFTM01], but the reason for creating them seems to of different datasets which are publicly available. In
be to deal with the under-defined task description of non- the following, only the most widely used ones as well
semantic segmentation. These accuracy measures try to as three medical databases are described. An overview
deal with different levels of coarsity of the segmentation. over the quantity and the kind of data is given by
This is much less of a problem in semantic segmentation Table I.
and thus those measures are not explained here. 1) PASCAL VOC: The PASCAL1 VOC2 challenge
2) Speed: A maximum upper bound on the execution was organized eight times with different datasets:
time for the inference on a single image is a hard Once every year from 2005 to 2012 [EVGW+ b].
requirement for some applications. For example, in the 1 pattern analysis, statistical modelling and computational learning,
case of autonomous cars an algorithm which classifies an EU network of excellence
pixel as street or no-street and thus makes a semantic 2 Visual Object Classes
4
V. T RADITIONAL A PPROACHES the directions is calculated for each patch. HOG features
Image segmentation algorithms which use traditional were proposed in [DT05] and are used in [BMBM10],
approaches, hence don’t apply neural networks and [FGMR10] for segmentation tasks.
make heavy use of domain knowledge, are wide-spread 3) SIFT: Scale-invariant feature transform (SIFT)
in the computer vision community. Features which can feature descriptors describe keypoints in an image. The
be used for segmentation are described in Section V-A, image patch of the size 16 × 16 around the keypoint
a very brief overview of unsupervised, non-semantic is taken. This patch is divided in 16 distinct parts of
segmentation is given in Section V-B, Random Decision the size 4 × 4. For each of those parts a histogram of
Forests are described in Section V-C, Markov Random 8 orientations is calculated similar as for HOG features.
Fields in Section V-E and Support Vector Machines This results in a 128-dimensional feature vector for
(SVMs) in Section V-D. Postprocessing is covered in each keypoint.
Section V-G. It should be emphasized that SIFT is a global feature
It should be noted that algorithms can use combina- for a complete image.
tion of methods. For example, [TNL14] makes use of a SIFT is described in detail in [Low04] and are used
combination of a SVM and a MRF. Also, auto-encoders in [PTN09].
can be used to learn features which in turn can be used 4) BOV: Bag-of-visual-words (BOV), also called
by any classifier. bag of keypoints, is based on vector quantization.
Similar to HOG features, BOV features are histograms
which count the number of occurrences of certain
A. Features and Preprocessing methods patterns within a patch of the image. BOV are described
The choice of features is very important in traditional in [CDF+ 04] and used in combination with SIFT
approaches. The most commonly used local and global feature descriptors in [CP08].
features are explained in the following as well as feature 5) Poselets: Poselets rely on manually added extra
dimensionality reduction algorithms. keypoints such as “right shoulder”, “left shoulder”,
1) Pixel Color: Pixel color in different image spaces “right knee” and “left knee”. They were originally
(e.g. 3 features for RGB, 3 features for HSV, 1 feature used for human pose estimation. Finding those extra
for the gray-value) are the most widely used features. A keypoints is easily possible for well-known image
typical image is in the RGB color space, but depending classes like humans. However, it is difficult for classes
on the classifier and the problem another color space like airplanes, ships, organs or cells where the human
might result in better segmentations. RGB, YcBcr, HSL, annotators do not know the keypoints. Additionally, the
Lab and YIQ are some examples used by [CRSS14]. keypoints have to be chosen for every single class. There
No single color space has been proven to be superior are strategies to deal with those problems like viewpoint-
to all others in all contexts [CJSW01]. However, the dependent keypoints. Poselets were used in [BMBM10]
most common choices seem to be RGB and HSI. to detect people and in [BBMM11] for general object
Reasons for choosing RGB is simplicity and the support detection of the PASCAL VOC dataset.
by programming languages, whereas the choice of 6) Textons: A texton is the minimal building block
the HSI color space might make it simpler for the of vision. The computer vision literature does not give a
classifier to become invariant to illumination. One strict definition for textons, but edge detectors could be
reason for choosing CIE-L*a*b* color space is that it one example. One might argue that deep learning tech-
approximates human perception of brightness [KP92]. niques with Convolution Neuronal Networks (CNNs)
It follows that choosing the L*a*b color space helps learn textons in the first filters.
algorithms to detect structures which are seen by An excellent explanation of textons can be found
humans. Another way of improving the structure within in [ZGWX05].
an image is histogram equalization, which can be 7) Dimensionality Reduction: High-resolution im-
applied to improve contrast [PAA+ 87], [RM07]. ages have a lot of pixels. Having one or more feature per
2) Histogram of oriented Gradients: Histogram of pixel results in well over a million features. This makes
oriented gradients (HOG) features interpret the image training difficult while the higher resolution might not
as a discrete function I : N2 → { 0, . . . , 255 } which contain much more information. A simple approach
maps the position (x, y) to a color. For each pixel, there to deal with this is downsampling the high-resolution
are two gradients: The partial derivative of x and y. image to a low-resolution variant. Another way of
Now the original image is transformed to two feature doing dimensionality reduction is principal component
maps of equal size which represents the gradient. These analysis (PCA), which is applied by [COWR11]. The
feature maps are splitted into patches and a histogram of idea behind PCA is to find a hyperplane on which all
6
feature vectors can be projected with a minimal loss The 4-neighborhood (north, east, south west) or an 8-
of information. A detailed description of PCA is given neighborhood (north, north-east, east, south-east, south,
by [Smi02]. south-west, west, north-west) are plausible choices.
One problem of PCA is the fact that it does not One way to cut the edges is by building a minimum
distinguish different classes. This means it can happen spanning tree and removing edges above a threshold.
that a perfectly linearly separable set of feature vectors This threshold can either be constant, adapted to the
becomes not separable at all after applying PCA. graph or adjusted by the user. After the edge-cutting
There are many other techniques for dimensionality step, the connected components are the segments.
reduction. An overview and a comparison over some A graph-based method which ranked 2nd in the
of them is given by [vdMPvdH09]. Pascal VOC 2010 challenge [EVGW+ 10] is described
in [CS10]. The system makes heavy use of the multi-
cue contour detector globalPb [MAFM08] and needs
B. Unsupervised Segmentation about 10 GB of main memory [CS11].
Unsupervised segmentation algorithms can be used 3) Random Walks: Random walks belong to the
in supervised segmentation as another source of infor- graph-based image segmentation algorithms. Random
mation or to refine a segmentation. While unsupervised walk image segmentation usually works as follows:
segmentation algorithms can never be semantic, they are Seed points are placed on the image for the different
well-studied and deserve at least a very brief overview. objects in the image. From every single pixel, the
Semantic segmentation algorithms store information probability to reach the different seed points by a
about the classes they were trained to segment while random walk is calculated. This is done by taking
non-semantic segmentation algorithms try to detect image gradients as described in Section V-A for HOG
consistent regions or region boundaries. features. The class of the pixel is the class of which a
1) Clustering Algorithms: Clustering algorithms can seed point will be reached with highest probability. At
directly be applied on the pixels, when one gives a first, this is an interactive segmentation method, but it
feature vector per pixel. Two clustering algorithms are can be extended to be non-interactive by using another
k-means and the mean-shift algorithm. segmentation methods output as seed points.
The k-means algorithm is a general-purpose cluster- 4) Active Contour Models: Active contour models
ing algorithm which requires the number of clusters to (ACMs) are algorithms which segment images roughly
be given beforehand. Initially, it places the k centroids along edges, but also try to find a border which is
randomly in the feature space. Then it assigns each smooth. This is done by defining a so called energy
data point to the nearest centroid, moves the centroid function which will be minimized. They were initially
to the center of the cluster and continues the process described in [KWT88]. ACMs can be used to segment
until a stopping criterion is reached. A faster variant is an image or to refine segmentation as it was done
described in [Har75]. in [AM98] for brain MR images.
k-means was applied by [CLP98] for medical image 5) Watershed Segmentation: The watershed algo-
segmentation. rithm takes a grayscale image and interprets it as a
Another clustering algorithm is the mean-shift algo- height map. Low values are catchment basins and
rithm which was introduced by [CM02] for segmen- the higher values between two neighboring catchment
tation tasks. The algorithm finds the cluster centers basins is the watershed. The catchment basins should
by initializing centroids at random seed points and contain what the developer wants to capture. This
iteratively shifting them to the mean coordinate within implies that those areas must be dark on grayscale
a certain range. Instead of taking a hard range constraint, images. The algorithm starts to fill the basins from
the mean can also be calculated by using any kernel. the lowest point. When two basins are connected, a
This effectively applies a weight to the coordinates watershed is found. The algorithm stops when the
of the points. The mean shift algorithm finds cluster highest point is reached.
centers at positions with a highest local density of A detailed description of the watershed segmentation
points. algorithm is given in [RM00].
2) Graph Based Image Segmentation: Graph-based The watershed segmentation was used in [JLD03] to
image segmentation algorithms typically interpret pixels segment white blood cells. As the authors describe,
as vertices and an edge weight is a measure of the segmentation by watershed transform has two
dissimilarity such as the difference in color [FH04], flaws: Over-segmentation due to local minima and thick
[Fel]. There are several different candidates for edges. watersheds due to plateaus.
7
Detailed introductions to MRFs are given by VI. N EURAL N ETWORKS FOR S EMANTIC
[BKR11], [Mur12]. MRFs are used by [ZBS01] and S EGMENTATION
[MSB12] for image segmentation. Artificial neural networks are classifiers which are
inspired by biologic neurons. Every single artificial
F. Conditional Random Fields neuron has some inputs which are weighted and sumed
CRFs are MRFs where all clique potentials are up. Then, the neuron applies a so called activation
conditioned on input features [Mur12]. This means, function to the weighted sum and gives an output. Those
instead of learning the distribution P (y, x), the task neurons can take either a feature vector as input or the
is reformulated to learn the distribution P (y|x). One output of other neurons. In this way, they build up
consequence of this reformulation is that CRFs need feature hierarchies.
much less parameters as the distribution of x does The parameters they learn are the weights w ∈ R.
not have to be estimated. Another advantage of CRFs They are learned by gradient descent. To do so, an error
compared to MRFs is that no distribution assumption function — usually cross-entropy or mean squared error
about x has to be made. — is necessary. For the gradient descent algorithm, one
sees the labeled training data as given, the weights
A CRF has the partition function Z:
X as variables and the error function as a surface in
Z(x) = P (x, y) this weight-space. Minimizing the error function in the
y weight space adapts the neural network to the problem.
and joint probability distribution There are lots of ideas around neural networks like
regularization, better optimization algorithms, automat-
1 Y ically building up architectures, design choices for
P (y|x) = ψc (yc |x)
Z(x) activation functions. This is not explained in detail here,
c∈C
but some of the mayor breakthroughs are outlined.
The simplest way to define the clique potentials ψ is CNNs are neural networks which learn image filters.
the count of the class yc given x added with a positive They drastically reduce the number of parameters which
smoothing constant to prevent the complete term from have to be learned while being still general enough for
getting zero. the problem domain of images. This was shown by Alex
CRFs as described in [LRKT09] have reached top Krizhevsky et al. in [KSH12]. One major idea was a
performance in PASCAL VOC 2010 [VOC10] and clever regularization called dropout training, which set
are also used in [HZCP04], [SWRC06] for semantic the output of neurons while training randomly to zero.
segmentation. Another contribution was the usage of an activation
A method similar to CRFs was proposed function called rectified linear unit:
in [GBVdW+ 10]. The system of Gonfaus et.al.
ranked 1st by mean accuracy in the segmentation task ϕReLU (x) = max(0, x)
+
of the PASCAL VOC 2010 challenge [EVGW 10]. Those are much faster to train than the commonly used
An introduction to CRFs is given by [SM11]. sigmoid activation functions
1
G. Post-processing methods ϕSigmoid (x) = −x
e +1
Post-processing refine a found segmentation and Krizhevsky et al. implemented those ideas and partici-
remove obvious errors. For example, the morphological pated in the ImageNet Large-Scale Visual Recognition
operations opening and closing can remove noise. The Challenge (ILSVRC). The best other system, which
opening operation is a dilation followed by a erosion. used SIFT features and Fisher Vectors, had a perfor-
This removes tiny segments. The closing operation is a mance of about 25.7 % while the network by Alex
erosion followed by a dilation. This removes tiny gaps Krizhevsky et al. got 17.0 % error rate on the ILSVRC-
in otherwise filled regions. They were used in [CLP98] 2010 dataset. As a preprocessing step, they downsam-
for biomedical image segmentation. pled all images to a fixed size of 256 px×256 px before
Another way of refinement of the found segmentation they fed the features into their network. This network
is by adjusting the segmentation to match close edges. is commonly known as AlexNet.
This was used in [BBMM11] with an ultra-metric Since AlexNet was developed, a lot of different
contour map [AMFM09]. neural networks have been proposed. One interesting
Active contour models are another example of a example is [PC13], where a recurrent CNN for semantic
post-processing method [KWT88]. segmentation is presented.
10
A. Lens Flare
Lens flare is the effect of light getting scattered in (e) Transparency (f) Viewpoint
the lens system of the camera. The testing data set of
the KITTI road evaluation benchmark [FKG13] has a Figure 4: Examples of images which might cause
couple of photos with this problem. Figure 4(a) shows semantic segmentation systems to fail.
an extreme example of lens flare.
1998. [Online]. Available: http://ieeexplore.ieee.org/ vol. 1, June 2005, pp. 886–893 vol. 1.
xpls/abs_all.jsp?arnumber=730379 [Online]. Available: http://ieeexplore.ieee.org/xpls/
[CM02] D. Comaniciu and P. Meer, “Mean shift: A abs_all.jsp?arnumber=1467360
robust approach toward feature space analysis,” [EVGW+ a] M. Everingham, L. Van Gool, C. K. I.
Pattern Analysis and Machine Intelligence, IEEE Williams, J. Winn, and A. Zisserman, “The
Transactions on, vol. 24, no. 5, pp. 603–619, 2002. PASCAL Visual Object Classes Challenge
[Online]. Available: http://ieeexplore.ieee.org/xpl/ 2007 (VOC2007) Results,” http://www.pascal-
login.jsp?tp=&arnumber=1000236 network.org/challenges/VOC/voc2007/workshop/index.html.
[COWR11] C. Chen, J. Ozolek, W. Wang, and G. K. Rohde, [Online]. Available: http://host.robots.ox.ac.uk:
“A pixel classification system for segmenting 8080/pascal/VOC/voc2007/index.html
biomedical images using intensity neighborhoods [EVGW+ b] ——, “The PASCAL Visual Object Classes Chal-
and dimension reduction,” in Biomedical Imaging: lenge 2012 (VOC2012) Results,” http://www.pascal-
From Nano to Macro, 2011 IEEE International network.org/challenges/VOC/voc2012/workshop/index.html.
Symposium on. IEEE, 2011, pp. 1649–1652. [Online]. Available: http://host.robots.ox.ac.uk:
[Online]. Available: https://www.andrew.cmu.edu/ 8080/pascal/VOC/voc2012/index.html
user/gustavor/chen_isbi_11.pdf [EVGW+ 10] M. Everingham, L. Van Gool, C. K. Williams,
[CP08] G. Csurka and F. Perronnin, “A simple high J. Winn, and A. Zisserman, “The pascal visual object
performance approach to semantic segmentation.” classes (voc) challenge,” International journal of
in BMVC, 2008, pp. 1–10. [Online]. Avail- computer vision, vol. 88, no. 2, pp. 303–338, 2010.
able: http://www.xrce.xerox.com/layout/set/print/ [EVGW+ 12] M. Everingham, L. Van Gool, C. K. I. Williams,
content/download/16654/118653/file/2008-023.pdf J. Winn, and A. Zisserman, “Visual object
[CRSS] A. Cohen, E. Rivlin, I. Shimshoni, and classes challenge 2012 (voc2012),” 2012. [Online].
E. Sabo, “Colon crypt segmentation website.” [On- Available: http://host.robots.ox.ac.uk:8080/pascal/
line]. Available: http://mis.haifa.ac.il/~ishimshoni/ VOC/voc2012/index.html
SegmentCrypt/Download.htm [Fel] P. F. Felzenszwalb, “Graph based im-
[CRSS14] ——, “Memory based active contour algorithm age segmentation.” [Online]. Available: http:
using pixel-level classified images for colon crypt //cs.brown.edu/~pff/segment/
segmentation,” Computerized Medical Imaging [FGMR10] P. F. Felzenszwalb, R. B. Girshick, D. McAllester,
and Graphics, Nov. 2014. [Online]. Available: and D. Ramanan, “Object detection with discrimina-
http://mis.haifa.ac.il/~ishimshoni/SegmentCrypt/ tively trained part-based models,” Pattern Analysis
Active%20contour%20based%20on%20pixel- and Machine Intelligence, IEEE Transactions on,
level%20classified%20image%20for%20colon% vol. 32, no. 9, pp. 1627–1645, 2010.
20crypts%20segmentation.pdf
[FH04] P. F. Felzenszwalb and D. P. Huttenlocher,
[CS10] J. Carreira and C. Sminchisescu, “Constrained
“Efficient graph-based image segmentation,”
parametric min-cuts for automatic object segmenta-
International Journal of Computer Vision,
tion,” in Computer Vision and Pattern Recognition
vol. 59, no. 2, pp. 167–181, 2004. [Online].
(CVPR), 2010 IEEE Conference on. IEEE, 2010,
Available: http://link.springer.com/article/10.1023/
pp. 3241–3248.
B:VISI.0000022288.19776.77
[CS11] ——, “Cpmc: Constrained parametric min-cuts for
automatic object segmentation,” Feb. 2011. [Online]. [FKG13] J. Fritsch, T. Kuehnl, and A. Geiger, “A
Available: http://www.maths.lth.se/matematiklth/ new performance measure and evaluation
personal/sminchis/code/cpmc/ benchmark for road detection algorithms,” in
International Conference on Intelligent Transporta-
[CSI+ 09] M. E. Celebi, G. Schaefer, H. Iyatomi, W. V.
tion Systems (ITSC), 2013. [Online]. Available:
Stoecker, J. M. Malters, and J. M. Grichnik, “An
http://www.cvlibs.net/datasets/kitti/eval_road.php
improved objective evaluation measure for border
detection in dermoscopy images,” Skin Research [GBVdW+ 10] J. M. Gonfaus, X. Boix, J. Van de Weijer, A. D.
and Technology, vol. 15, no. 4, pp. 444–450, 2009. Bagdanov, J. Serrat, and J. Gonzalez, “Harmony po-
[Online]. Available: http://arxiv.org/abs/1009.1020 tentials for joint classification and segmentation,” in
Computer Vision and Pattern Recognition (CVPR),
[CSM09] L. P. Coelho, A. Shariff, and R. F. Murphy, “Nuclear
2010 IEEE Conference on. IEEE, 2010, pp. 3280–
segmentation in microscope cell images: a hand-
3287.
segmented dataset and comparison of algorithms,”
in Biomedical Imaging: From Nano to Macro, [GRC+ 08] S. Gould, J. Rodgers, D. Cohen, G. Elidan, and
2009. ISBI’09. IEEE International Symposium on. D. Koller, “Multi-class segmentation with relative
IEEE, 2009, pp. 518–521. [Online]. Available: location prior,” International Journal of Computer
http://murphylab.web.cmu.edu/data Vision, vol. 80, no. 3, pp. 300–316, Apr. 2008.
[CXGS12] M. D. Collins, J. Xu, L. Grady, and V. Singh, [GVSY13] S. Giannarou, M. Visentini-Scarzanella, and G.-
“Random walks based multi-image segmentation: Z. Yang, “Probabilistic tracking of affine-invariant
Quasiconvexity results and gpu-based solutions,” anisotropic regions,” Pattern Analysis and Machine
in Computer Vision and Pattern Recognition Intelligence, IEEE Transactions on, vol. 35, no. 1,
(CVPR), 2012 IEEE Conference on. IEEE, pp. 130–143, 2013.
2012, pp. 1656–1663. [Online]. Available: http: [Har75] J. A. Hartigan, Clustering algorithms. John Wiley
//pages.cs.wisc.edu/~jiaxu/pub/rwcoseg.pdf & Sons, Inc., 1975.
[DHS15] J. Dai, K. He, and J. Sun, “Instance-aware seman- [HDT02] C. Huang, L. Davis, and J. Townshend, “An
tic segmentation via multi-task network cascades,” assessment of support vector machines for land
arXiv preprint arXiv:1512.04412, 2015. cover classification,” International Journal of remote
[DT05] N. Dalal and B. Triggs, “Histograms of oriented sensing, vol. 23, no. 4, pp. 725–749, 2002.
gradients for human detection,” in Computer [HHR01] S. Hu, E. Hoffman, and J. Reinhardt, “Automatic
Vision and Pattern Recognition, 2005. CVPR lung segmentation for accurate quantitation of
2005. IEEE Computer Society Conference on, volumetric x-ray ct images,” Medical Imaging, IEEE
13
Transactions on, vol. 20, no. 6, pp. 490–498, Jun. invariant keypoints,” International Journal of
2001. Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.
[HJBJ+ 96] A. Hoover, G. Jean-Baptiste, X. Jiang, P. J. [Online]. Available: http://dx.doi.org/10.1023/B%
Flynn, H. Bunke, D. B. Goldgof, K. Bowyer, 3AVISI.0000029664.99615.94
D. W. Eggert, A. Fitzgibbon, and R. B. [LRAL08] A. Levin, A. Rav-Acha, and D. Lischinski,
Fisher, “An experimental comparison of range “Spectral matting,” Pattern Analysis and
image segmentation algorithms,” Pattern Analysis Machine Intelligence, IEEE Transactions on,
and Machine Intelligence, IEEE Transactions vol. 30, no. 10, pp. 1699–1712, 2008.
on, vol. 18, no. 7, pp. 673–689, Jul. 1996. [Online]. Available: http://ieeexplore.ieee.org/xpls/
[Online]. Available: http://ieeexplore.ieee.org/xpls/ abs_all.jsp?arnumber=4547428
abs_all.jsp?arnumber=506791 [LRKT09] L. Ladický, C. Russell, P. Kohli, and P. Torr,
[Ho95] T. K. Ho, “Random decision forests,” in “Associative hierarchical crfs for object class image
Document Analysis and Recognition, 1995., segmentation,” in Computer Vision, 2009 IEEE 12th
Proceedings of the Third International Conference International Conference on, 2009, pp. 739–746.
on, vol. 1. IEEE, 1995, pp. 278–282. [Online]. Available: http://ieeexplore.ieee.org/xpls/
[Online]. Available: http://ect.bell-labs.com/who/ abs_all.jsp?arnumber=5459248
tkh/publications/papers/odt.pdf [LSD14] J. Long, E. Shelhamer, and T. Darrell, “Fully
[Hus07] Hustvedt, “File:cctv lens flare.jpg,” Wikipedia convolutional networks for semantic segmentation,”
Commons, Nov. 2007. [Online]. Avail- arXiv preprint arXiv:1411.4038, 2014. [Online].
able: https://commons.wikimedia.org/wiki/File: Available: http://arxiv.org/abs/1411.4038
CCTV_Lens_flare.jpg [MAFM08] M. Maire, P. Arbelaez, C. Fowlkes, and
[HZCP04] X. He, R. Zemel, and M. Carreira-Perpindn, J. Malik, “Using contours to detect and localize
“Multiscale conditional random fields for image junctions in natural images,” in Computer Vision
labeling,” in Computer Vision and Pattern and Pattern Recognition, 2008. CVPR 2008.
Recognition, 2004. CVPR 2004. Proceedings IEEE Conference on, June 2008, pp. 1–8.
of the 2004 IEEE Computer Society Conference [Online]. Available: http://ieeexplore.ieee.org/xpls/
on, vol. 2, Jun. 2004, pp. II–695–II–702 Vol.2. abs_all.jsp?arnumber=4587420
[Online]. Available: http://ieeexplore.ieee.org/xpl/ [Man12] M. Manske, “File:randabschattung mikroskop
login.jsp?tp=&arnumber=1315232 kamera 6.jpg,” Wikipedia Com-
[JLD03] K. Jiang, Q.-M. Liao, and S.-Y. Dai, “A novel white mons, Dec. 2012. [Online]. Avail-
blood cell segmentation scheme using scale-space able: https://commons.wikimedia.org/wiki/File:
filtering and watershed clustering,” in Machine Randabschattung_Mikroskop_Kamera_6.JPG
Learning and Cybernetics, 2003 International [MBLAGJ+ 07] S. Maldonado-Bascon, S. Lafuente-Arroyo, P. Gil-
Conference on, vol. 5, Nov 2003, pp. 2820–2825 Jimenez, H. Gomez-Moreno, and F. Lopez-
Vol.5. [Online]. Available: http://ieeexplore.ieee.org/ Ferreras, “Road-sign detection and recognition
xpl/login.jsp?tp=&arnumber=1260033 based on support vector machines,” Intelligent
[Kaf07] L. Kaffer, “File:great male leopard in south afrika- Transportation Systems, IEEE Transactions on,
jd.jpg,” Wikipedia Commons, Jul. 2007. [Online]. vol. 8, no. 2, pp. 264–278, Jun. 2007.
Available: https://commons.wikimedia.org/wiki/File: [Online]. Available: http://ieeexplore.ieee.org/xpls/
Great_male_Leopard_in_South_Afrika-JD.JPG abs_all.jsp?arnumber=4220659
[KKV+ 14] V. Kalesnykiene, J.-k. Kamarainen, R. Voutilainen, [MBVLG02] N. Moon, E. Bullitt, K. Van Leemput, and G. Gerig,
J. Pietilä, H. Kälviäinen, and H. Uusitalo, “Automatic brain and tumor segmentation,” in Med-
“Diaretdb1 diabetic retinopathy database and ical Image Computing and Computer-Assisted In-
evaluation protocol,” 2014. [Online]. Available: tervention—MICCAI 2002. Springer, 2002, pp.
http://www2.it.lut.fi/project/imageret/diaretdb1/ 372–379.
[KP92] J. M. Kasson and W. Plouffe, “An analysis of [MFTM01] D. Martin, C. Fowlkes, D. Tal, and J. Malik,
selected computer interchange color spaces,” ACM “A database of human segmented natural
Transactions on Graphics (TOG), vol. 11, no. 4, pp. images and its application to evaluating
373–405, 1992. segmentation algorithms and measuring ecological
[KP06] Z. Kato and T.-C. Pong, “A markov random statistics,” in Computer Vision, 2001. ICCV
field image segmentation model for color 2001. Proceedings. Eighth IEEE International
textured images,” Image and Vision Computing, Conference on, vol. 2. IEEE, 2001, pp. 416–423.
vol. 24, no. 10, pp. 1103–1114, 2006. [Online]. [Online]. Available: http://ieeexplore.ieee.org/xpls/
Available: http://www.sciencedirect.com/science/ abs_all.jsp?arnumber=937655
article/pii/S0262885606001223 [MHMK+ 14] L. Maier-Hein, S. Mersmann, D. Kondermann,
[KSH12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, S. Bodenstedt, A. Sanchez, C. Stock, H. G.
“Imagenet classification with deep convolutional Kenngott, M. Eisenmann, and S. Speidel, “Can
neural networks,” in Advances in neural information masses of non-experts train highly accurate
processing systems, 2012, pp. 1097–1105. image classifiers?” in Medical Image Computing
[KWT88] M. Kass, A. Witkin, and D. Terzopoulos, and Computer-Assisted Intervention–MICCAI 2014.
“Snakes: Active contour models,” International Springer, 2014, pp. 438–445. [Online]. Available:
journal of computer vision, vol. 1, no. 4, pp. http://opencas.webarchiv.kit.edu/?q=node/26
321–331, Jan. 1988. [Online]. Available: http: [Min89] J. Mingers, “An empirical comparison of selection
//link.springer.com/article/10.1007/BF00133570 measures for decision-tree induction,” Machine
[LKJ15] F.-F. Li, A. Karpathy, and J. Johnson, Learning, vol. 3, no. 4, pp. 319–342, 1989.
“CS231n: Convolutional neural networks for [Online]. Available: http://dx.doi.org/10.1023/A%
visual recognition,” 2015. [Online]. Available: 3A1022645801436
http://cs231n.stanford.edu/ [MSB12] G. Moser, S. B. Serpico, and J. A. Benediktsson,
[Low04] D. Lowe, “Distinctive image features from scale- “Markov random field models for supervised land
14
cover classification from very high resolution tion strategies,” Fundam. Inform., vol. 41, no. 1-2,
multispectral remote sensing images,” in Advances pp. 187–228, 2000.
in Radar and Remote Sensing (TyWRRS), 2012 [RM07] J. Reynolds and K. Murphy, “Figure-ground
Tyrrhenian Workshop on. IEEE, 2012, pp. 235– segmentation using a hierarchical conditional
242. [Online]. Available: http://ieeexplore.ieee.org/ random field,” in Computer and Robot
xpl/login.jsp?tp=&arnumber=6381135 Vision, 2007. CRV ’07. Fourth Canadian
[MSC] “Object class recognition image database.” Conference on, May 2007, pp. 175–182.
[Online]. Available: http://research.microsoft.com/ [Online]. Available: http://ieeexplore.ieee.org/xpls/
vision/cambridge/recognition/ abs_all.jsp?arnumber=4228537
[MSR] “Image understanding - research data,” [RMBK06] C. Rother, T. Minka, A. Blake, and V. Kolmogorov,
Microsoft Research. [Online]. Avail- “Cosegmentation of image pairs by histogram
able: http://research.microsoft.com/en-us/projects/ matching - incorporating a global constraint
objectclassrecognition/ into mrfs,” in Computer Vision and Pattern
[Mur12] K. P. Murphy, Machine learning: a probabilistic Recognition, 2006 IEEE Computer Society
perspective. MIT press, 2012. Conference on, vol. 1, June 2006, pp. 993–
[OKS78] Y.-i. Ohta, T. Kanade, and T. Sakai, “An analysis 1000. [Online]. Available: http://ieeexplore.ieee.org/
system for scenes containing objects with substruc- xpls/abs_all.jsp?arnumber=1640859
tures,” in Proceedings of the Fourth International [SAN+ 04] J. Staal, M. D. Abràmoff, M. Niemeijer,
Joint Conference on Pattern Recognitions, 1978, pp. M. Viergever, B. Van Ginneken et al., “Ridge-based
752–754. vessel segmentation in color images of the retina,”
[PAA+ 87] S. M. Pizer, E. P. Amburn, J. D. Austin, Medical Imaging, IEEE Transactions on, vol. 23,
R. Cromartie, A. Geselowitz, T. Greer, B. ter no. 4, pp. 501–509, 2004. [Online]. Available:
Haar Romeny, J. B. Zimmerman, and K. Zuiderveld, http://www.isi.uu.nl/Research/Databases/DRIVE/
“Adaptive histogram equalization and its variations,” [SCZ08] F. Schroff, A. Criminisi, and A. Zisserman,
Computer vision, graphics, and image processing, “Object class segmentation using random
vol. 39, no. 3, pp. 355–368, 1987. [Online]. forests.” in BMVC, 2008, pp. 1–10. [On-
Available: http://www.sciencedirect.com/science/ line]. Available: http://research.microsoft.com/pubs/
article/pii/S0734189X8780186X 72423/Criminisi_bmvc2008.pdf
[PC13] P. H. Pinheiro and R. Collobert, “Recurrent [SJC08] J. Shotton, M. Johnson, and R. Cipolla,
convolutional neural networks for scene parsing,” “Semantic texton forests for image categorization
arXiv preprint arXiv:1306.2795, 2013. [Online]. and segmentation,” in Computer vision and
Available: http://arxiv.org/abs/1306.2795v1 pattern recognition, 2008. CVPR 2008. IEEE
[PH05] C. Pantofaru and M. Hebert, “A Conference on. IEEE, Jun. 2008, pp. 1–8.
comparison of image segmentation algorithms,” [Online]. Available: http://ieeexplore.ieee.org/xpls/
Robotics Institute, p. 336, 2005. [Online]. abs_all.jsp?arnumber=4587503
Available: http://riweb-backend.ri.cmu.edu/ [SM11] C. Sutton and A. McCallum, “An introduction
pub_files/pub4/pantofaru_caroline_2005_1/ to conditional random fields,” Machine Learning,
pantofaru_caroline_2005_1.pdf vol. 4, no. 4, pp. 267–373, 2011. [Online].
[PS07] A. Protiere and G. Sapiro, “Interactive Available: http://homepages.inf.ed.ac.uk/csutton/
image segmentation via adaptive weighted publications/crftutv2.pdf
distances,” Image Processing, IEEE Transactions [Smi02] L. I. Smith, “A tutorial on principal components
on, vol. 16, no. 4, pp. 1046–1057, 2007. analysis,” Cornell University, USA, vol. 51, p. 52,
[Online]. Available: http://ieeexplore.ieee.org/xpls/ 2002.
abs_all.jsp?arnumber=4130436 [Smi04] B. T. Smith, “Lagrange multipliers tutorial in the
[PTN09] N. Plath, M. Toussaint, and S. Nakajima, “Multi- context of support vector machines,” Memorial Uni-
class image segmentation using conditional random versity of Newfoundland St. John’s, Newfoundland,
fields and global classification,” in Proceedings Canada, Jun. 2004.
of the 26th Annual International Conference on [SSA12] D. Schiebener, J. Schill, and T. Asfour, “Discovery,
Machine Learning. ACM, 2009, pp. 817–824. segmentation and reactive grasping of unknown
[PXP00] D. L. Pham, C. Xu, and J. L. Prince, “A objects.” in Humanoids, 2012, pp. 71–77. [On-
survey of current methods in medical image line]. Available: http://h2t.anthropomatik.kit.edu/
segmentation,” Annual Review of Biomedical pdf/Schiebener2012.pdf
Engineering, vol. 2, no. 1, pp. 315–337, 2000, [SUM+ 11] D. Schiebener, A. Ude, J. Morimotot,
pMID: 11701515. [Online]. Available: http:// T. Asfour, and R. Dillmann, “Segmentation
dx.doi.org/10.1146/annurev.bioeng.2.1.315 and learning of unknown objects through physical
[Qui86] J. R. Quinlan, “Induction of decision trees,” interaction,” in Humanoid Robots (Humanoids),
Machine learning, vol. 1, no. 1, pp. 81–106, 2011 11th IEEE-RAS International Conference
Aug. 1986. [Online]. Available: http://dx.doi.org/ on. IEEE, 2011, pp. 500–506. [Online].
10.1023/A%3A1022643204877 Available: http://ieeexplore.ieee.org/ielx5/6086637/
[Qui93] ——, C4.5: Programs for Machine Learning, P. Lan- 6100798/06100843.pdf
gley, Ed. Morgan Kaufmann Publishers, Inc., 1993. [SWRC06] J. Shotton, J. Winn, C. Rother, and A. Criminisi,
[RKB04] C. Rother, V. Kolmogorov, and A. Blake, “Grabcut: “Textonboost: Joint appearance, shape and context
Interactive foreground extraction using iterated modeling for multi-class object recognition and
graph cuts,” ACM Transactions on Graphics segmentation,” in Computer Vision–ECCV 2006.
(TOG), vol. 23, no. 3, pp. 309–314, 2004. [Online]. Springer, 2006, pp. 1–15. [Online]. Available: http:
Available: http://delivery.acm.org/10.1145/1020000/ //link.springer.com/chapter/10.1007/11744023_1
1015720/p309-rother.pdf [TNL14] J. Tighe, M. Niethammer, and S. Lazebnik,
[RM00] J. B. Roerdink and A. Meijster, “The watershed “Scene parsing with object instances and
transform: Definitions, algorithms and paralleliza- occlusion ordering,” in Computer Vision and
15
A PPENDIX A
TABLES
Number Number
Database Image Resolution (width × height) of of Channels Data source
Images Classes
Colon Crypt DB (302 px − 1116 px) × (349 px − 875 px) 389 2 3 [CRSS]
DIARETDB1 1500 px × 1500 px 89 4 3 [KKV+ 14]
KITTI Road (1226 px − 1242 px) × (370 px − 376 px) 289 2 3 [FKG13]
MSRCv1 (213 px − 320 px) × (213 px − 320 px) 240 9 3 [MSR]
MSRCv2 (213 px − 320 px) × (162 px − 320 px) 591 23 3 [MSR]
Open-CAS Endoscopic Datasets 640 px × 480 px 120 2 3 [MHMK+ 14]
PASCAL VOC 2012 (142 px − 500 px) × ( 71 px − 500 px) 2913 20 3 [EVGW+ 12]
Warwick-QU (567 px − 775 px) × (430 px − 522 px) 165 5 3 [CSM09]
Table I: An overview over publicly available image databases with a semantic segmentation ground trouth.