Вы находитесь на странице: 1из 12


Pattern Recoynition, Vol. 29, No. 8, pp. 1335-1346, 1996
Elsevier Science Ltd
Copyright 1996 Pattern Recognition Society
Printed in Great Britain. All rights reserved
0031 - 3203/96 $15.00 + .00
Department of Electronic Engineering, Tsinghua University, 100084 Beijing, People's Republic of China,
and The Institute of AI and PR, Shantou University, 515063 Shantou, People's Republic of China
(Received 19 April 1995; in revised form 9 November 1995; received for publication 5 December 1995)
Abstract--This paper studies different methods proposed so far for segmentation evaluation. Most methods
can be classified into three groups: the analytical, the empirical goodness and the empirical discrepancy
groups. Each group has its own characteristics. After a brief description of each method in every group, some
comparative discussions about different method groups are first carried out. An experimental comparison for
some empirical (goodness and discrepancy) methods commonly used is then performed to provide a rank of
their evaluation abilities. In addition, some special methods are also discussed. This study is helpful for an
appropriate use of existing evaluation methods and for improving their performance as well as for
systematically designing new evalution methods. Copyright 1996 Pattern Recognition Society. Published
by Elsevier Science Ltd.
Image analysis Image segmentation Segmentation evaluation
Analytical and empirical study Performance assessment Criteria function
Algorithm comparison Image quality measure Method characterization
Image analysis usually refers to processing of images
by comput er with the goal of finding what objects are
presented in the image. "~ Image segmentation is one of
the most critical tasks in aut omat i c image analysis. It
consists of subdividing an image into its constituent
parts and extracting these parts of interest (objects).
A great variety of segmentation algorithms have
been developed in the last few decades and this
number continually increases each yearJ 2~ Several
survey papers for segmenation techniques have been
presented in the literatureJ 3-m Since none of the
proposed segmentation algorithms are generally ap-
plicable t o all images and different algorithms are
not equally suitable for a particular application, 19~ the
performance evaluation of segmentation algorithms is
indispensable and thus an i mport ant subject in the
study of segmentation. More generally, performance
evaluation is critical for all comput er vision algorithms
from research to application, "> while image seg-
ment at i on is an essential and i mport ant step of low-
level vision.
While development of segmentation algorithms has
at t ract ed significant attention, relatively fewer efforts
have been spent on their evaluation, al t hough many
newly developed algorithms are (most often subjec-
tively) compared with some particular algorithms with
few particular images. Moreover, most efforts spent on
evaluation are just for designing new evaluation
*This research has been supported under Grants SCE-
F 1994660 and SCE-TM 199416.
methods and only very few aut hors have attempted to
characterize the different evaluation methods existed J11~
The present paper will review different existing methods
for segmentation evaluation, as well as discuss and com-
pare their applicability, advantages and limitations.
Segmentation algorithms can be evaluated analyti-
cally or empirically, so the evaluation methods can be
divided into two categories: the analytical met hods
and the empirical methods. The analytical methods
directly examine and assess the segmentation algo-
rithms themselves by analysing their principles and
properties. The empirical met hods indirectly judge the
segmentation alogrithms by applying them to test
images and measuring the quality of segmentation
results. Various empirical methods have been pro-
posed. Most of them can still be classified into two
types: goodness met hods and discrepancy methods. I n
the first category some desirable properties of seg-
mented images, often established accordi ng to human
intuition, are measured by "goodness" parameters.
The performances of segmentation algorithms under
investigation are j udged by the values of goodness
measures. In the second category some references t hat
present the ideal or expected segmentation results are
first found. The actual segmentation results obt ai ned
by applying a segmentation algorithm, sometimes pre-
ceded by preprocessing and/ or followed by post-
processing processes, are compared with the references
by count i ng their differences. The performances of
segmentation algorithms under investigation are then
assessed accordi ng to the discrepancy measures. Fol-
lowing this discussion, three groups of methods can be
1336 Y.J. ZHANG
( "1
a n a l y t i c a l
[ me t h o d )
a l g o r i t h ms
I n p u t m a o [
, - p r / ~ . . . . . ~
I s e g me n t i n g i ma g e [
> I r o r o n o o im age I
i ! i
I ............................................................ ! ! ! ~' t s e g me n t e d i ma g e
t i . . . . . . . . .
"n '" s t - p r o c e s s l g ~ . . . . . -'--
f-':-~(-'~ ,,L.'~'-:~:'~-~,,] d i s c r e p a n c y /
I ....... / k . me t h o d . J
( e m p m c a l t iii" or~ I
[ g o o d n e s s ~ ................ '~i ..................................................................... :-::'::~'"l o u t p u t i ma g e
\ me t h o d ) ...... u
Fig. 1. General scheme for segmentation and its evaluation.
The a bove cl assi fi cat i on f or e va l ua t i on me t hods can
be seen mo r e cl earl y in Fig. 1, wher e a gener al scheme
f or s egment at i on and its e va l ua t i on is pr esent ed. The
i nput i mage obt a i ne d by sensi ng is first ( opt i onal l y)
pr epr oces s ed t o pr oduc e t he segment i ng i mage f or t he
s e gme nt a t i on (in its st r i ct sense) pr ocedur e. The seg-
me nt e d i mage can t hen be ( opt i onal l y) pos t pr oces s ed
t o pr oduc e t he out put i mage. Fu r t h e r processes, such
as f eat ur e ext r act i on a nd me a s ur e me nt , will be ba s e d
on t hese o u t p u t i mages. I n Fig. 1 t he pa r t encl osed by
t he r ounde d s quar e wi t h t hi n line c or r e s ponds t o t he
s e gme nt a t i on pr oc e dur e in its na r r ow- mi nde d sense,
while t he pa r t encl osed by t he r ounde d s quar e wi t h
poi nt line c or r e s ponds t o t he s egment at i on pr oc e dur e
in its gener al form. The bl ack ar r ows i ndi cat e t he
pr ocessi ng di r ect i ons of s egment at i on. The access
poi nt s f or t he t hr ee gr oups of e va l ua t i on me t hods ar e
depi ct ed wi t h gr a y a r r ows in Fig. 1. No t e t ha t t her e is
an o r condi t i on bet ween b o t h a r r ows l eadi ng t o t he
boxes cont ai ni ng " s egment ed i mage" a nd " o u t p u t i m-
age" b o t h f r om t he " empi r i cal goodnes s me t h o d " a nd
"empi r i cal di s cr epancy me t hod" . Mor e ove r , t her e is a n
a n d condi t i on bet ween t he a r r o w f r om "empi r i cal dis-
cr epancy me t h o d " t o "reference i mage" a nd t he t wo
( or) a r r ows goi ng t o " s egment ed i mage" a nd " o u t p u t
i mage". Th e anal ysi s me t h o d s t r e a t t he a l gor i t hms
f or s egment at i on directly. The empi r i cal goodnes s
me t hods j udge t he s egment ed i ma ge or o u t p u t i ma ge
so as t o i ndi rect l y assess t he pe r f or ma nc e of al go-
r i t hms. F o r a ppl yi ng empi r i cal di s cr epancy met hods ,
t he r ef er ence i ma ge is necessary. I t c a n be obt a i ne d
ma nua l l y or a ut oma t i c a l l y f r om t he i nput i ma ge or
s egment i ng i mage. Th e empi r i cal di s cr epancy me t h o d s
c o mp a r e t he s egment ed i ma ge or o u t p u t i ma ge t o t he
reference i ma ge a nd use t hei r difference t o assess t he
pe r f or ma nc e of al gor i t hms,
Each me t h o d g r o u p has its own par t i cul ar i t i es so as
t o be di st i ngui shed f r om ot he r gr oups. Each me t h o d
has al so its own char act er i st i cs so as t o be identified. I n
t he fol l owi ng t hr ee sect i ons a br i ef descr i pt i on of t he
me t hods bel ongi ng t o t he t hr ee gr oups will be pr o-
vided. The y ar e a r r a nge d accor di ng t o t he a b o v e
me t h o d cl assi fi cat i on. The j ust i f i cat i on of t he classifi-
cat i on of me t hods i nt o anal yt i cal a nd empi r i cal ones as
well as t he s e pa r a t i on of empi r i cal me t hods i nt o good-
ness a nd di s cr epancy gr oups will be ma de cl ear by t he
c o mp a r a t i v e di scussi on of di fferent me t h o d gr oups in
Sect i on 5. I n addi t i on, a n exper i ment al c o mp a r i s o n of
several c o mmo n l y used empi r i cal me t hods will be
car r i ed out in Sect i on 6. Thes e r epr es ent at i ve me t h o d s
ar e c o mp a r e d accor di ng t o t hei r abi l i t y a nd be ha vi or
in eval uat i ng t he s a me series of s egment ed i mages.
A r a nk a mo n g t he m is t hen obt ai ned. I n Sect i on
7 several speci al eval uat i on me t hods t ha t do not fall
cl ear l y i nt o t he a b o v e t hr ee gr oups a nd s ome c o mmo n
pr obl e ms f or mos t exi st i ng eval uat i on me t hods ar e
discussed. Fi nal l y, s ome concl udi ng r e ma r ks ar e gi ven
in Sect i on 8.
2. A N A L Y T I C A L ME T H O D S
The anal yt i cal me t hods di r ect l y t r e a t t he s egment a-
t i on al gor i t hms t hemsel ves by consi der i ng t he pr i n-
ciples, r equi r ement s, utilities, compl exi t y, e t c . , of
al gor i t hms. Us i ng t he anal yt i cal me t hods t o eval uat e
s e gme nt a t i on al gor i t hms avoi ds t he concr et e i mpl e-
Evaluation methods for image segmentation 1337
ment at i on of these algorithms and the results could be
exempted from the influence caused by the arrange-
ment of evaluation experiments as the empirical
methods do. However, not all properties of segmenta-
tion algorithms can be obt ai ned by analytical studies.
The difficulty, up to now, is the lack of general theory
for image segmentation. ~12) Although some initial at-
tempts in the direction of a unified theory about
segmentation were reported, for example, in the rela-
tion of image models and segmentation, ~13~ no formal
solution has been found yet. Until now, the analytical
methods work only with some particular models or
desirable properties of algorithms.
One analytical met hod has been proposed by
Liedtke e t al / ~ *) They presented an evaluation study of
several algorithms by taking into account the type and
amount of a pr i or i knowledge that has been i ncorpor-
ated into different segmentation algorithms. Such
knowledge for certain segmentation algorithms is
ready to be analysed, which is mainly determined by
the nature of the algorithms. However, such knowl-
edge is usually heuristic information and different
types of a pr i or i knowledge are hardly comparable.
The information provided by this met hod is then
rough and qualitative. On the other side, not only "the
amount of relevant a pr i or i knowledge t hat can be
i ncorporat ed into the segmentation algorithm is deci-
sive for the reliability of the segmentation
methods", "4) but it is also very i mport ant for the
performance of the algorithm how such a pri ori knowl-
edge has been i ncorporat ed into the al gori t hm. " 5)
The analytical methods can in certain cases provide
quantitative information about segmentation algo-
rithms. Abdou and Prat t "6) analysed the performance
of several edge detectors with a detection probability
ratio in a statistical design procedure. Let T be the
edge decision threshold, Pc the probability of correct
detection and PI the probability of false detection:
Pc = S p ( t l e d g e ) d t (1)
P I = ~ P ( t l n - e d g e ) d t (2)
the pl ot of Pc versus PI in terms of T can provide
a performance index of detectors. Such an index
should be useful for evaluating the segmentation algo-
rithms based on edge detection [for example, see refer-
ence (9)]. In cont rast t o the a pr i or i knowledge
discussed above, this index can be precisely defined
and calculated for simple edge detectors3 ~61
Ot her properties of segmentation algorithms t hat
can be obt ai ned by analysis include the processing
strategy, processing complexity and efficiency, and
segmentation resolution of algorithm. 1~7'am These
properties could be helpful for selecting suitable algo-
rithms in particular applications. For example, the
processing strategy of segmentation algorithms can be
parallel, sequential, iterative or mixed. The parallel
algorithms are suitable for fast implementation. How-
ever, for images that are severely cont ami nat ed by
noise, the performance of parallel algorithms is often
poorer t han that of sequential methods. ~19)
3. E MP I R I C A L G O O D N E S S ME T H O D S
The methods in this gr oup evaluate the performance
of algorithms by j udgi ng the quality of segmented
images. To carry out this work certain quality
measures should be defined. Most measures are estab-
lished according to human intuition about what condi-
tions should be satisfied by an "ideal" segmentation
(for example, a pretty picture). In other words, the
quality of segmented images is assessed by some
"goodness" measures. These met hods characterize dif-
ferent segmentation algorithms by simply comput i ng
the goodness measures based on the segmented image
without the a pri ori knowledge of the correct segmen-
tation. {1~ The application of these evaluation methods
exempts the requirement for references, so that they
can be used for on-Jine evaluation. Different types of
goodness measures have been proposed.
3.1. Goodnes s based on i nt r a- r egi on u n i f o r mi t y
Weszka and Rosenfeld proposed a threshold evalu-
ation met hod t hat uses a busyness measure as the
criterion to judge thresholded images, t21~ To apply the
busyness measure they assume that the images are
composed of objects and background of compact
shapes and not strongly textured. Under these as-
sumptions, the thresholded images should l ook
smoot h rather t han busy. In practice, they comput e the
amount of busyness for a thresholded image by using
the gray-level co-occurrence matrix of the image. ~z2~
That is, those entries of the co-occurrence matrix
representing the percentage of obj ect -background ad-
jacencies are summarized. The lower the busyness, the
smoot her the thresholded images and the better the
segmentation result. In consequence, the better the
segmentation results, the higher the performance of
applied algorithms.
Similar to Weszka and Rosenfeld, Nazif and Levine
also believe that an adequate segmentation should
produce images having higher intra-region uniformity,
which is related to the similarity of propert y about
region elementJ TM The uniformity of a feature over
a region can be comput ed on the basis of the variance
of that feature evaluated at every pixel belonging t o
t hat region. 121 In partictdar, for a gray-level image
f ( x , y), let R i be ith segmented region, A i be the area of
R i, then the gray-level uniformity measure ( GU) of
f ( x , y ) is:
G U = ~ 2 f ( x , y ) - - ~ ~ f ( y ) (3)
( x , y ) e Ri . " ( x , y ) e Ri
A normalized uniformity measure ( N U ) has been
proposed by Sahoo e t al.: ~8~
N U = 1 - GU/ C, (4)
1338 Y.J. ZHANG
where C is a normalization factor. Generally, other
features can also be used.
The intra-region uniformity, as a desired propert y of
segmented images, can also be measured by the higher-
order local ent ropy based on information theory. 1241
Pal and Pal proposed a thresholding met hod t hat
maximizes the second-order local ent ropy of the object
and background regions, t24~ This ent ropy H 2, for an
assumed threshold T, is comput ed by:
H2(T) = -- Z Z pi j l npi j ,
i =0j =0
where p~j is the probability of occurrence of the pair
(i, j) within the object/background. This ent ropy is also
used by Pal and Bhandari t25~ as a measure of the
region homogenei t y in segmented images for the per-
formance evaluation of segmentation results.
3.2. Goodness based on inter-region contrast
Except for intra-region uniformity, Levine and
Nazi f also believe t hat an adequat e segmentation
should in addition produce images having higher con-
trast across adjacent regions, t2~ I n a simple case t hat
a gray-level image f ( x , y) consists of the object with
average gray-level f o and the background with aver-
age gray-level f b, a gray-level cont rast measure (GC)
can be comput ed by:
IL -fbl
Not e t hat the similar idea has been already used by
Ot s u {26) for evaluating the "goodness" of threshold
values in the development of a hi st ogram based thresh-
old selection algorithm. By maximizing the between
region variance, a threshold value produci ng the high-
est region separability can be obtained.
In practical segmentation applications, some errors
in the segmented image can be tolerated. On the ot her
side, if the segmenting image is complex and the algo-
rithm used is fully automatic, the error is inevitable, t271
The disparity between an actually segmented image
and a correctly/ideally segmented image (reference
image) t hat is the best expected result can be used to
assess the performance of algorithms. Bot h (actually
segmented and reference) images are obtained from the
(5) same input image. The reference image is sometimes
called gold st andard [e.g. reference (27)]. In cases t hat
the test images are synthetic images, the reference
images can be simply obtained from image generation
procedure, ~2s) while in cases t hat the test images are
real images, manually (with the help of visual inspec-
tion) segmented images are often used as references.
The methods in this gr oup take into account the
difference (measured by various discrepancy par-
ameters) between the actually segmented and reference
images, i.e. these methods try to determine how far the
actually segmented image is from the reference image.
A higher value of the discrepancy measure would
imply a bigger error in the actually segmented image
relative to the reference image and this indicates the
lower performance of applied segmentation algo-
I n image encoding, the disparity between the orig-
inal image and the decoded image has often been used
t o objectively assess the performance of codi ng algo-
(6) rithms. A commonl y used discrepancy measure is the
mean-square signal-to-noise ratio [see, e.g. reference
(29)]. However, in cont rast to image encoding, image
segmentation is a process t hat changes the image
unit. {xJ In other words, image encoding is an image
processing process, while image segmentation is an
image analysis process, in which the input and out put
are different matters. So many other discrepancy
measures have been proposed and used.
3.3. Goodness based on region shape
Not only the gray level, but also the form of a seg-
mented region can be taken into account t o design
goodness measures for satisfying the human intuition
on an "ideal" segmentation. Sahoo et al. ts~ proposed
a shape measure ( SM) for evaluating several threshold
selection algorithms, which is defined as:
S M = 1 { ~ S g n [ f ( x , y )
- f Nt x . y ~] g ( x , y ) S g n [ f ( x , y ) - - r ] ) . (71
where fN{~.y~ is the average gray value of the neighbor-
hood N( x , y) of a pixel located at (x, y) with gray level
f ( x , y ) and gradient value g(x, y), T is the threshold
value selected for segmentation, C is a normal i zat i on
factor and Sgn(.) is the unit step function.
4.1. Discrepancy based on the number o f
mis-segmented pi xel s
Considering image segmentation as a pixel classifi-
cation process, the percentage of pixel mis-classified is
the discrepancy measure t hat comes most readily t o
mindJ a> Suppose an image consist of N pixel classes,
a confusion matrix C of dimension N can be construc-
ted, where each entry C~j represents the number of class
j pixels classified as class i by the segmentation algo-
rithms. Two error types can thus be comput ed for each
pixel class k, which can bot h be used t o describe the
class-by-class performance of these algorithmsJ a~ The
multi-class Type I error is defined as:
i =1 i =1
where the numerat or represents the number of pixels
Evaluation methods for image segmentation 1339
of class k not classified as k and the denomi nat or is t he
t ot al number of pixels of class k.
The mul t i -cl ass Type I I er r or is defined as:
i ~l
Cij - Cik , (9)
i j =l i =t
where the numer at or represent s the number of pixels
of ot her classes called class k. The denomi nat or is the
t ot al number of pixels of ot her classes. In equat i ons (8)
and (9), each pixel class is weighted equally.
Weszka and Rosenfeld ~2 ~ used a si mi l ar appr oach
to measure the difference between an "i deal " (correct)
i mage and a t hreshokt ed image. Under the assumpt i on
t hat the i mage consists of objects and backgr ound each
having a specified di st r i but i on of gr ay level, they com-
pute, for any given t hreshol d value, the pr obabi l i t y of
misclassifying an obj ect pixel as backgr ound, or vide
versa. This pr obabi l i t y in t ur n provi des an index of
segment at i on results, which can be used for eval uat i ng
t hreshol d selection al gori t hms. In t hei r work, such
a pr obabi l i t y is mi ni mi zed in the process of selecting an
appr opr i at e t hreshol d.
Recently, a di scr epancy measure based on the
same pr i nci pal has been defined. It is t ermed the
pr obabi l i t y of er r or (PE). For a t wo-cl ass pr obl em PE
can be cal cul at ed by: 13~)
PE = P(O) x P(BIO) + P(B) P(OIB), (10)
where P(BIO) is the pr obabi l i t y of er r or in classifying
obj ect s as backgr ound, P(OJB) is the pr obabi l i t y of
er r or in classifying backgr ound as objects, P(O) and
P(B) are a priori pr obabi l i t i es of obj ect s and back-
gr ound in images. For mul t i -cl ass probl em, a general
defi ni t i on of PE can be found in reference (32).
The i dea of comput i ng di screpancy based on the
number of er r or pixels is al so reflected in some edge-
det ect i on eval uat i on schemes.' For exampl e, a maxi -
mum l i kel i hood est i mat e of the fract i on of correct l y
det ect ed edges has been used by Fr am and Deutsch. 1331
Such a measure coul d be readi l y ext ended to measure
what fract i ons of the segment ed obj ect pixels were
act ual l y obj ect pixels so as to be appl i ed for segment a-
t i on eval uat i on.
4.2. Discrepancy based on the position of
mis-segmented pixels
The di screpancy measures based onl y on the num-
ber of mi s-segment ed pixels do not t ake i nt o account
the spat i al i nf or mat i on of these pixels. I t is t hus poss-
i bl e t hat i mage segment ed differently can have t he
same di screpancy measure values if these measures
onl y count the number of mi s-segment ed pixels. To
address this pr obl em, some di screpancy measures
based on pixel posi t i on er r or have been proposed.
One way is to use t he di st ance between t he mis-
segment ed pixel and the nearest pixel t hat act ual l y
belongs to the mi s-segment ed class. Let N be the
number of mi s-segment ed pixels for the whole i mage
and d(i) be a di st ance met ri c from the ith mis-
segment ed pixel and the nearest pixel t hat act ual l y
is of the mis-classified class; a di screpancy measure
(D) based on this di st ance is defined by Yasnoff et al.
O = Z d2t i ) , l 1 1 )
i =1
In equat i on (11), each di st ance is squared. This
measure is further normal i zed (ND), to exempt the
influence of i mage size and to give it a sui t abl e val ue
range by: ta)
NO = 1 O0 K / A , (12)
where A is the t ot al number of pixels in the i mage (i.e.
a measure of area).
In the eval uat i on of edge det ect ors a commonl y used
di screpancy measure is the mean- squar e di st ance fig-
ure of meri t (FOM) pr oposed by Prat t : t34~
FOM = - - (13)
N 1 p d2(i) '
i =1
where N =max( Ni , Na) and N i and N a denot e the
number of i deal and act ual by det ect ed edge pixels,
respectively, d(i) denot es the di st ance between the ith
det ect ed edge pixel and its correct posi t i on and p is
a scaling paramet er. This measure has been shown
insensitive to cor r el at i on in false al ar ms and missed
edges, t35) St rast ers and Ger br ands used FOM for
eval uat i ng segment at i on results with N denot i ng the
number of pixel in i mage and d(i) denot i ng the di st ance
between the ith pixel and its correct class, t361 In addi -
tion, they defined a modi fi ed version of FOM named
FOM e to expand the FOM val ue range in the near
perfect segment at i on:
FOM~= ~ l +p x d 2 ( i ) if N~>
i =i (14)
if Ne = 0 ,
where Ne denot es the number of mi s-segment ed
4.3. Discrepancy based on the number of objects
in the image
For perfect segment at i on a necessary condi t i on is
t hat an equal number of obj ect s of each class among
a reference i mage and a segment ed i mage shoul d be
met. A subst ant i al di sagreement of t he obj ect number
i ndi cat es a l arge di screpancy between t he reference
and segment ed images. Yasnoff and Bacus 137~ pr o-
posed to comput e the obj ect - count - agr eement (OCA)
based on pr obabi l i t y theory. Let R i be t he number of
obj ect s of class i in the reference i mage and Si be t he
number of obj ect s of class i in the segment ed image,
t hey use the pr obabi l i t y Foc A t hat t he two number s R~
and S i represent sampl es from t he same di st r i but i on
1340 Y.J. ZHANG
for measuri ng the OCA:
1 ( 15)
FocA = J M 2 . . . . .
Z 2 / F( M/ 2)
I n equat i on (15), M = N - 1 denot es t he number of
degrees of freedom, F(.) denot es the Ga mma function
and L can be comput ed by:
L = ~ S i - Ri (16)
i=1 P RI '
where N is the number of obj ect classes and p is
a cor r el at i on paramet er.
On t he basi s of the si mi l ar idea, anot her weighting
scheme called f r agment at i on ( FRAG) is defined as: t36)
F. R AG- (17)
1 + p x I Ts - asl q'
where T s is the true obj ect number in t he reference
i mage and A s is the act ual obj ect number in t he
segment ed image, p and q are scaling paramet ers.
4.4. Discrepancy based on the f eat ur e values of
segmented objects
I mage anal ysi s is concerned with the ext ract i on of
i nf or mat i on from an image, an i mage in yields dat a
out. (3a) Here t he dat a are t he measur ement values of
obj ect features obt ai ned from segment ed images. One
fundament al quest i on in i mage anal ysi s is whet her
a measur ement made on the objects from segment ed
i mages is as accurat e as one made on t he ori gi nal
images. Accordi ng to this measure, a segment ed i mage
has t he highest qual i t y if t he obj ect features ext r aci ed
from it precisely mat ch t he features in the original. I n
practice, an i mage has high qual i t y if t he deci si on made
on it is unchanged from t hat made on t he ori gi nal
image. 139) The ul t i mat e goal of i mage segment at i on in
t he cont ext of i mage analysis is to obt ai n measure-
ment s of obj ect features.( TM The accuracy of these
measurement s obt ai ned from the segment ed i mage
with respect to the reference i mage provi des useful
di screpancy measures. Thi s accuracy can be t er med
"ul t i mat e measur ement accuracy" ( UMA ) to reflect
the ul t i mat e goal of segment at i on. The U MA is feature
dependent and so can be denot ed as UMA : . Let R :
denot e t he feature val ue obt ai ned from t he reference
i mage and S / d e n o t e the feature value measured from
t he segment ed image, t he absol ut e UMA : ( A UMA : )
and relative U MA : ( R U MA : ) are defined as: TM
A U M A : = [Ry - Syl
R U M A - IR: - S:[ x 100%.
R :
factor, nor mal i zed mean absol ut e curvat ure, peri met er
and spheri ci t y of objects34) Among them, t he ar ea of
objects is mor e sui t abl e t han ot hers to apprai se the
qual i t y of differently segment ed images. (2'4)
4.5. Discrepancy based on miscellaneous quantities
There are ot her di screpancy measures t hat can de-
scribe the difference between the reference i mage and
the segment ed image. The di screpancy measure pr o-
posed by Levine and Nazi t ~.1) is a 2-D (t wo-di men-
sional) di st ance measure based on two component s.
One is an under mergi ng er r or measure and anot her is
an over mergi ng er r or measure. The former compo-
nent is pr opor t i onal to the amount by which the
regions in the segmented i mage over l ap t he regions in
t he reference image. The l at t er component signifies the
amount by which the segment ed regions par t i t i on t he
reference regions.
Not onl y the spat i al i nformat i on, but also the gray-
level i nformat i on can be used to describe the difference
between the segmented i mage and t he reference image.
St rast ers and Ger br ands (361 defined a figure of cert ai n-
ty (FOC) for t aki ng i nt o account this i nformat i on. Let
f~ be t he gr ay level of the ith pixel in the reference i mage
and gi be the represent at i ve gray level of a regi on
compr i si ng the ith pixel in the segment ed i mage (note
t hat bot h images are t aken as masks here to ext ract f i
and Yl from the i mage to be segmented), the FOC is
defined as:
1 s 1
FOC = - - ~, (20)
N i=_ l x p x l f i - 9 1 1 q'
where N denot es the t ot al number of pixels in the
i mage and p and q are scaling paramet ers.
I f we consi der bot h the segment ed i mage and the
reference i mage as pr obabi l i t y di st ri but i ons, the differ-
ence between t hem woul d be reflected by t hei r diver-
gence. Suppose t hat t he segment ed i mage has
N regi ons and p'~ represent s the a posterior pr obabi l i t y
of a pixel to be in the ith region, while p' : is t hat i~a the
reference image, Pal and Bhandar i 12s) pr oposed to use
the symmet ri c di vergence (SD):
N r
S D= ~ ( p ' ~ - p , ) l n p~,, (21)
i= 1 Pi
as a measure of performance for the segment at i on
al gori t hms.
(18) The t hree met hod gr oups for segment at i on eval u-
at i on descri bed in t he above sections have t hei r own
(19) characteristics. I n the following, t hei r advant ages and
l i mi t at i ons are discussed.
Bot h A U M A : and R U M A : can represent a number
of di screpancy measures when different obj ect features
are used. The features can be densi t omet ri c, static or
geomet ri c features. Some exampl es of geomet ri c fea-
tures are the area, bendi ng energy, eccentricity, form
5.1. Generality f or evaluation
One desi rabl e pr oper t y of an eval uat i on met hod is
its general i t y t o be appl i ed for st udyi ng vari ous pr op-
erties of different segment at i on al gori t hms. To appl y
Evaluation methods for image segmentation 1341
analytical methods some formal rnodels of an image
should be first defined. The behavior of the algorithm
on such an image can then be analysed (mathemat-
ically) in terms of the parameters of the image and the
algorithm. ~4z~ Certain properties of segmentation al-
gorithms can be easily obtained just by analysis, such
as the processing strategy of algorithms and the resol-
ution of segmentation results. ~18) However, some other
properties cannot be precisely analysed since no for-
mal model exists. For example, there is no quantitative
measure for a priori knowledge about images that can
be i ncorporat ed into segmentation algorithms, 114) so
various types of knowledge are hardly to be compared.
In addition, there are methods t hat can only be appli-
cable to certain segmentation algorithms. For in-
stance, the met hod based on detection probability
ratio is merely suitable for studying simple edge de-
tectors ~ ~ 6~
Empirical methods, as described in Sections 3 and 4,
are mainly used to study the correctness of segmenta-
tion algorithms by taking into account the accuracy of
segmentation results. One reason is t hat other proper-
ties of algorithms, such as comput at i on cost, have been
partially overcome by the progress of technology.
Anot her reason is that the accuracy of segmentation is
often the pri mary concern in real applications and is
difficult to be studied by analytical methods. Fr om the
point of view t hat only one propert y is studied, the
empirical methods can be t hought of as somewhat
limited. However, most of them can be considered as
relatively general, because they can evaluate different
types of segmentation algorithms. The studies pres-
ented in references (9, 23, 43, 44) are some examples in
which quite different types of algorithms are treated. In
most empirical studies, only the images to segment and
segmented are needed and no matter which type of
algorithms is used. A few exceptions are the methods
based on busyness ~2~) and shape measure. 18) Since the
threshold value is necessary for calculating these
measures other types of algorithms can not be evalu-
5.2. Qualitative versus quantitative and subjective
versus objective
Two more desirable properties of an evaluation
met hod are the abilities to evaluate segmentation algo-
rithms in a quantitative way and on an objective basis.
Quantitative study can provide precise results reflect-
ing the exactness of evaluation3 z) Objective study will
exempt the influence of human factor and provide
consistency and no bias results33s~ Generally, analyti-
cal met hods are more ready t o apply, but they often
provide only qualitative properties of algorithms. Em-
pirical met hods are normal l y quantitative as the values
of quality measures can be numerically comput ed.
Among them, goodness met hods based on subjective
measures of image quality are less suitable for an
objective evaluation of segmented algorithms. Dis-
crepancy met hods can be bot h objective (the gold
st andard available yields objective results ~27)) and
5.3. Compl exi t y f or evaluation
The complexities for applying the above three
groups of methods in segmentation evaluation in-
crease progressively. Applying empirical methods for
evaluation is usually more complicated t han just algo-
rithm analysis, because the algorithms are necessary to
be concretely implemented and some extra efforts are
needed to segment test images and to 6alculate the
values of quality measure parameters. The comput a-
tional cost of different empirical methods is first deter-
mined by the quality measures they used. For example,
the object count agreement can be easily obtained,
while the uniformity measure and shape measure need
much more computation.
Among empirical methods, goodness methods are
less complicated for applying t han discrepancy
methods and they can be used for on-line evalu-
at i on/ 2) One particular requirement associated with
the application of discrepancy methods is the reference
image. Many studies use real images as test images and
manually segment them to obtain the references [for
example, see reference (31)]. The process greatly in-
creases the complexity of applying discrepancy
methods. In addition, since only real images from
particular task domains were used in these studies, the
evaluation results may be not appropri at e for other
applications. One possible and effective alternation is
to use synthetic images] 11 The two problems asso-
ciated with real images, as discussed above, can be
overcome by using well-designed synthetic images
Ot her advantages of synthetic images include t hat they
can be easily controlled and they can be reproduced by
all users. 12'28)
5.4. Consideration o f segmentation applications
The effective use of domain-dependent knowledge in
comput er vision can help to make different processes
reliable and efficient [see, for example, reference (45)].
To effectively evaluate segmentation algorithms, the
consideration of segmentation applications in which
algorithms are applied is also important.
The above three met hod groups are different in the
extent to which they explicitly consider the applica-
tions for which the segmentation algorithms are used.
At one extreme are the analytical studies that do not
consider the nature and goal of application. The evalu-
ation results depend only on the analysis of algorithms
themselves. The empirical goodness methods in which
some desirable properties of segmented images are
quantified by goodness measures begin to address the
application issue as the choice of which goodness
measure should be used is related to the application
goal. The empirical discrepancy methods, which take
bot h the reference and segmented images into con-
sideration, at t empt t o capture the application t hrough
the discrepancy measures. The need t o have a reference
1342 Y.J. ZHANG
forces t he eval uat i on to be connect ed to the act ual
appl i cat i ons. 127)
I n empi ri cal studies t he segment at i on al gor i t hm is
appl i ed to test i mages and statistics of its performance
are gat hered with t he hel p of some measurement s from
segment at i on results. Most empi ri cal eval uat i on
met hods are devel oped i ndependent l y and no com-
par i son of performance or behavi or with ot her
met hods has been made. Since a number of met hods
have been pr oposed, as descri bed in t he above sections,
t hei r compar i son becomes i mpor t ant and necessary.
The performance of different empi ri cal met hods can
be compar ed accor di ng to t hei r behavi or in j udgi ng
the same sequences of segment ed image. This sequence
of i mages can be obt ai ned by t hreshol di ng an i mage
with a number cff or der ed t hreshol d val ues: 21 As we
know, the qual i t y of t hr eshol ded images woul d be
bet t er if an appr opr i at e t hr eshol d val ue is used and the
qual i t y of t hreshol ded i mages woul d be worse if the
selected t hr eshol d values are t oo high or t oo low. In
ot her words, if the t hreshol d value increases or de-
creases in one di rect i on, the pr obabi l i t y of er r oneousl y
classifying the backgr ound pixels as obj ect pixels goes
down, but t he pr obabi l i t y of erroneousl y classifying
the obj ect pixels as backgr ound pixels goes up, or vice
versa. Since different eval uat i on met hods use different
measures to assess this qual i t y, t hey will behave differ-
ent l y for the same sequence of images. By compar i ng
the behavi or of different met hods in such a case, the
performance of different met hods can be revealed and
On the basis of this idea, a compar at i ve st udy of
different empi ri cal met hods has been carri ed out. The
five met hods st udi ed (and the measures they based on)
are the following:
(1) G- GU: goodness based on gray-level uni form-
ity [see equat i on (3) in Subsect i on 3.1];
(2) G- GC: goodness based on gray-level cont r ast
[see equat i on (6) in Subsect i on 3.2];
(3) D- PE: di screpancy based on pr obabi l i t y of er-
r or [see equat i on (10) in Subsect i on 4.1];
(4) D- ND: di screpancy based on nor mal i zed dis-
t ance [see equat i on (12) in Subsect i on 4.2];
(5) D-AA: di screpancy based on absol ut e UMA:
with ar ea as t he feature [see equat i on (18) in Subsec-
t i on 4.4].
These five met hods bel ong t o five different met hod
subgroups. They are consi dered for t he compar at i ve
st udy mai nl y because the measures these met hods
based on are qui t e general for use and so are compar -
able. The met hods in ot her subgr oups and the
measures t hey based on are less general. For exampl e,
the met hod based on shape measure defined in equa-
t i on (7) of Subsect i on 3.3 can onl y count the l ocal
smoot hness of region boundar y and cannot even dis-
tinguish a circle from a square. 12~ On the ot her side, the
measure based on the number of obj ect s in the i mage is
onl y meaningful when the segment at i on results are
qui t e poor. In near perfect segment at i on, the number
of obj ect s in the reference i mage and segmented i mage
are often the same and the di scr i mi nat i on power of this
measure will be lost.
The whole experi ment can be di vi ded i nt o several
steps: define test images, segment test images, appl y
eval uat i on met hods, measure qual i t y par amet er s and
compar e eval uat i on results. It is ar r anged si mi l ar to
the st udy of object features in the cont ext of i mage
segment at i on eval uat i on: 11) A si mi l ar process has also
been discussed by Har al i ck ~101 for charact eri zi ng com-
put er vision al gori t hms.
Test images are synt het i cal l y generat ed with the
system descri bed in reference (28). Since our mai n
concern is to compar e different eval uat i on met hods
with t he same segmented images so some simple im-
ages are synthesized. They are 256 x 256 with 256 gray
levels. The obj ect s are centered discs of vari ous sizes
with gray level 144. The backgr ound is homogeneous
with gray level 112. These images are then added by
i ndependent zero-mean Gaussi an noise with vari ous
st andar d deviations. To cope with the r andom nat ure
of noise, for each st andar d devi at i on five noise sampl es
are generat ed i ndependent l y and added separat el y to
noise free images in this study. Fi ve test images thus
generat ed form a test group. Fi gure 2 gives an example.
Test images are segmented by t hreshol di ng t hem as
descri bed above. A sequence of 14 t hreshol d values
l abel l ed from 1- 4 are t aken to segment each gr oup of
images. The five eval uat i on met hods are then appl i ed
to the segment ed images. The values of cor r espondi ng
Fig. 2. A group of test images.
Evaluation methods for image segmentation
Table I. Comparison results of different evaluation methods
Label 1 2 3 4 5 6 7 8 9 10 11 12 13 14
G-GC 0.989 0.994 0.997 0.997 0.998 0.999 0.999 0.999 0. 999 1.000 0.999 0.998 0.997 0.995
G-GU 1.000 0.897 0.858 0.846 0.821 0.808 0.804 0.800 0.800 0.800 0.808 0.825 0.854 0.906
D-ND 0.705 0.538 0.454 0.415 0.362 0.292 0.260 0.238 0.290 0.382 0.466 0.583 0. 719 1.000
D-PE 0. 578 0.340 0.242 0.202 0.154 0.100 0.079 0.066 0.099 0.170 0.254 0.395 0.573 1.000
D-AA 0. 526 0.340 0.241 0.203 0.149 0.092 0.042 0.017 0.077 0.161 0.252 0.395 0.573 1.000
0 . 8
0 . 6
0 . 4
0 . 2
o o o o o , o o o o
~ i ! ! i ! i i ! i i i i
1 2 3 4 5 6 7 $ 9 10 11 12 13 14
Fig. 3. Plot of the comparison results listed in Table 1.
. --O--. G-GC
I G - G U
r' l D- P E
- - O - - D - A A
measures are obtained by averaging the results of five
measurements over each group. In Table 1 comparison
results of the five methods for one experiment are
presented as examples. The labels correspond to the
sequence of segmented images. In other words each
column in Table 1 indicates a different threshold ap-
plied to a group of images. The measure values have
been normalized to the range [0, 1] for easy compari-
son. In Fig. 3 the curves corresponding to different
measure values listed in Table 1 are plotted. These
curves can be analysed by comparing their forms.
Firstly, as the worst segmentation results give the value
one for all measures, the valley values that correspond
to the best segmentation results determine the margin
between the two extremes. The deeper the valley, the
larger the dynamic range of measures for assessing the
best and worst segmentation results. Comparing the
depth of valleys, these methods can be ranked in the
order D-AA, D-PE, D-ND, G-GU, G-GC. Note that
G-GC curve is almost unity for all segmented images
(can be seen more clearly from Table 1), so that differ-
ent segmentation results can hardly be distinguished in
such a case.
Second, for evaluation purposes a good method
should be capable of detecting very small variations in
segmented images. The sharper the curves, the higher
the measure's discrimination capability to distinguish
small segmentation degradation. The ranking of these
five methods according to this point is the same as
above. Looking more closely, though D-AA and D-PE
curves are parallel or even overlapped for most cases in
Fig. 3, the form of the D-AA curve is much sharper
than that of D-PE curve near the valley. This means
that D-AA has more power than D-PE to distinguish
those slightly different and near-best segmentation
results, which is more interesting in practice/~61 It is
clear that D-AA should not be confused with D-PE as
made by Beghdadi et al. ~47~ On the other side, the
flatness of G-GC and G-GU curves around valley
show that the methods based on goodness measures
such as GC and GU should be less appropriate in
segmentation evaluation.
The effectiveness of evaluation methods is largely
determined by their employed image quality measures.
From this comparative study, it becomes evident that
the evaluation methods using discrepancy measures
such as that based on the feature values of segmented
objects and that based on the number of mis-seg-
mented pixels should be more powerful than the evalu-
ation methods using other measures. Moreover, as the
methods compared in this study are representative of
various methods subgroups, it seems that the empirical
discrepancy method surpass the empirical goodness
methods in evaluation.
7.1. Special evaluation methods
There are also few particular evaluation methods
that do not fall clearly in any one of the above three
groups. The following is a critical review of them.
(1) For a general segmentation procedure, pre-
processing and postprocessing are often needed (see
Fig. 1). In practical applications, based on an auto-
1344 Y.J. ZHANG
mat i cal l y segmented i mage t hat is not perfect, some
manual l y edi t i ng oper at i ons are often needed to bri ng
the results to a cert ai n level satisfying the desi red
quality. ~zT~ The amount of such oper at i ons or the cost
to do these oper at i ons can also pr ovi de an index of
how the segment ed i mage devi at es from the desired
quality. This i ndex has been used by Gr aaf et al. t27"48)
to val i dat e segment at i on results and to j udge the per-
formance of al gori t hms. Since the desi red qual i t y level
of a segment at i on is det er mi ned by the par t i cul ar
processi ng task, such a met hod makes a t ask-di rect ed
eval uat i on and depends on the t ool s avai l abl e for
i mage editing. ~48~ M ore generally, one tries to est i mat e
the requi rement for pre- and/ or post -processi ng to
obt ai n sat i sfact ory segment at i on results from the raw
images. 118) In a sense, it is not t he segment at i on al go-
ri t hms but the pre- and/ or post -processi ng al gori t hms
are studied.
(2) In i mage analysis the size of a region is obt ai ned
by count i ng the number of pixels bel ongi ng to this
region. ~38~ The mi s-segment ed pixels modi fy the size of
regions in segment ed images. This size change can
easily be observed by human eyes. I nst ead of defining
numeri cal di screpancy measures MacAul ay and Palcic
pr oposed a qual i t at i ve eval uat i on met hod. ~49) In t hei r
st udy for compar i ng four simple t hreshol di ng al go-
rithms, a segment at i on is det ermi ned to be accept abl e
if the ar ea of segmented obj ect s mat ches within a mar -
gin of 5% to. the ar ea of visually det ect ed object. If
a l arge number of images are processed, a st at i st i c
st udy of the results can hel p to compar e the perform-
ance of the tested al gori t hms. This met hod is qui t e
si mi l ar to the met hods descri bed in Subsect i on 4.1,
except t hat the di screpancy is qual i t at i vel y and vis-
ually measured. Most subjective compar i son studies
are based on si mi l ar principles.
(3) To select an appr opr i at e t hreshol d val ue for
segment at i on Brink ~5~ pr oposed a t hreshol di ng tech-
nique t hat uses a gray-level cor r el at i on measure. An
opt i mum t hreshol d is selected by maxi mi zi ng the cor-
rel at i on between the ori gi nal i mage and the t hreshol -
ded bilevel image. The value of cor r el at i on measure
provi des an index about the di ssi mi l ari t y between
these two images. This measure has been used in the
eval uat i on of t hreshol di ng al gori t hms by Pal and
Bhandari . ~25) In cont r ast to di screpancy met hods de-
scri bed in Section 4, this met hod t akes t he i mage to
segment di rect l y as t he "reference" image. Al t hough
this cor r el at i on measure is seemed different in appear -
ance t han ot her measures, it has been pr oved ~51) t hat
the square of the cor r el at i on coefficient used in Bri nk' s
met hod is j ust the class separ abi l i t y quot i ent used by
Ot su (26~in the "goodness" measure for t hr eshol d selec-
tion. Thi s met hod shoul d t hus have a behavi or si mi l ar
to t hat based on i nt er-regi on cont rast .
(4) Taki ng the i mage to segment as the reference is
also followed by Beghdadi et al. ~47~ They pr oposed t o
use a measure t er med the bl ur r i ng effect for segment a-
t i on compar i son. A noise-free synt het i c i mage is gener-
at ed and is t hen bl ur r ed with a Gat ]ssi an filter. The
aut hor s unusual l y set the bl urred boundar y pixels as
obj ect pixels and thus curi ousl y t ake enl arged obj ect s
as references. The bl urri ng effect is measured by the
l ocat i on difference between the det ect ed boundar y and
the reference boundar y. Such a use of synt het i c images
loses t hei r advant ages in eval uat i on. In addi t i on, the
noisy effect, a very i mpor t ant and common degr ada-
tion factor influencing the performance of al gori t hms,
cannot be st udi ed by such a met hod.
(5) Different from all the above met hods, Bryant
and Boul di n ~52) pr oposed anot her i nt erest i ng eval ua-
t i on pr ocedur e based on relative gradi ng for edge
detectors. The pri nci pl e may be ext ended for eval uat -
ing segment at i on al gori t hms. No precise qual i t y
measure or cri t eri on is defined in this procedure. It
consists of compar i ng the out put of an al gor i t hm to the
consensus results of ot her al gori t hms. In ot her words,
it compares the out put of a number of al gori t hms and
rat es each al gor i t hm by how often it agrees with the
consensus of the others. This can be consi dered as an
i nt erest i ng idea, but it is unconsci ous to errors made by
all al gori t hms and may even penalize a good al gor i t hm
t hat does not pr oduce errors made by a maj or i t y of
bad al gori t hms. ~2)
7.2. Common problems f or most existing methods
There are still two mai n pr obl ems associ at ed with
most of existing eval uat i on met hods.
(1) Each eval uat i on met hod det ermi nes the per-
formance of al gori t hms accordi ng to cert ai n criteria. If
the same cri t eri on used for segment at i on is also used
for eval uat i on t hen some bi ased results will be pr o-
duced. Iz) For exampl e, the second- or der l ocal ent r opy
t hat was maxi mi zed for selecting t hreshol d values in
the new al gor i t hm pr oposed by Pal and Pal Iz4) and
was also comput ed for compar i ng the performance of
this al gor i t hm with t hat of ot her al gori t hms by Pal and
Bhandari . 125~ It is expect ed t hat the new al gor i t hm
shoul d pr oduce qui t e high ent r opy values. In many
appl i cat i ons, i mages are model ed as a mosai c of re-
gi ons of uni form i nt ensi t y cor r upt ed by addi t i ve Gaus-
sian white noise [e.g. reference (53)]. Therefore, the
region homogenei t y is a commonl y used cri t eri on for
designing vari ous segment at i on al gori t hms [e.g. Ot su
algorithmt26)]. The met hod using the goodness measure
based on uni formi t y t akes the same cri t eri on for evalu-
ation. When this cri t eri on is used t o compar e a number
of t hreshol di ng al gor i t hms) 8) it is not surpri si ng t hat
the Ot su t26) al gor i t hm ranks at the first place. When
ot her cri t eri a were used, however, the r anki ng or der
becomes compl et el y different. 18)
(2) To st rengt hen cert ai n aspects in the qual i t y
measures, some scal i ng/ wei ght i ng par amet er s are of-
ten used. For example, the par amet er p in F OM [see
equat i on (13)] provi des a relative penal t y between
smeared edges and i sol at ed but offset edges, ~34J while
the par amet er s p and q in FOC [see equat i on (20)]
det ermi ne the cont r i but i on of the l arge devi at i on rel a-
tive to a smal l deviation/36) There exists no sui t abl e
Evaluation methods for imag~ segmentation 1345
gui del i ne or rul e for choosi ng t hese par amet er s. I n
practice, t hey are oft en selected on t he basis of h u ma n
i nt ui t i on or j udgment . Thi s makes an expect ed obj ec-
tive eval uat i on to be unpl eas ant l y i nf l uenced by sub-
j ect i ve factors.
I n this paper most met hods pr oposed for segment a-
t i on eval uat i on a nd compar i s on so far are reviewed.
A met hod cl assi fi cat i on scheme is i nt r oduced. Com-
par at i ve st udi es for different met hod gr oups a nd for
different met hods are also carri ed out , bot h anal yt i -
cally a nd exper i ment al l y. Segment at i on eval uat i on is
i ndi spensabl e for i mpr ovi ng the per f or mance of exist-
i ng s egment at i on al gor i t hms and for devel opi ng new
powerful s egment at i on al gor i t hms. Thi s st udy at -
t empt s to st i mul at e t he wor k i n this di rect i on. To make
s egment at i on get off t r i al - and- er r or st at us furt her st u-
dies a nd mor e efforts for s egment at i on eval uat i on are
Fr o m this st udy some resul t s concer ni ng t he per-
f or mance of different eval uat i on met hods are ob-
t ai ned. As t here is cur r ent l y no general s egment at i on
t heory, t he empi r i cal met hods are mor e sui t abl e a nd
useful t ha n t he anal yt i cal met hods for per f or mance
eval uat i on of s egment at i on al gor i t hms. Among em-
pi ri cal met hods, t he di scr epancy met hods are bet t er
for obj ect i vel y assessi ng s egment at i on al gor i t hms
t ha n t he goodness met hods, al t hough t he former is
somewhat compl ex i n appl i cat i on t ha n t he l at t er
due to t he r equi r ement for reference. Accor di ng- t o
t he exper i ment al compar i s on made i n this paper,
t he met hod D- AA is mor e powerful for eval uat i on
t han ot her met hods. Mor e gener al st udi es are still
car r yi ng on.
Each met hod st udi ed i n this paper has advant ages
and l i mi t at i ons. Fr o m an appl i cat i on poi nt of view,
t hose t hat bel ong to different gr oups are mor e com-
pl ement ar y t han compet i t i ve. Besides, t he per f or m-
ance of s egment at i on al gor i t hms is i nf l uenced by ma ny
factors, so onl y one eval uat i on met hod woul d be not
enough to j udge all pr oper t i es of an al gor i t hm and
different mehods shoul d be cooperat ed. One earl y
wor k of this t ype is made by Yasnoff et al., ~s*) who
combi ned two er r or measur es t hey pr oposed, namel y
pixel spat i al di s t r i but i on a nd pixel class pr opor t i on,
~3o) i nt o one general i zed measure. Lat er t hey i ncor por -
at ed anot her c ompone nt , t he obj ect count agr eement
~3~) together. Ot her eval uat i on st udi es usi ng several
measur es can be f ound i n references (23, 25, 36, 43, 44).
Gener al l y, for a compl et e eval uat i on and compar i s on
of s egment at i on t echni ques, a set of per f or mance
measur es shoul d be necessary. ~9' ~s) How to form such
a set will be a pr omi s i ng research subj ect i n segment a-
t i on eval uat i on.
Acknowledgement--We are very grateful to the reviewer for
his helpful comments and valuable suggestions to improve
the presentation of this paper.
1. T. Pavlidis, Image analysis, Ann. Rev. Comput. Sci. 3, 121
146 (1988).
2. Y.J. Zhang and J. J. Gerbrands, Objective and quantita-
tive segmentation evaluation and comparison, Sign. Pro-
cess. 39, 43-54 (1994).
3. E. M. Riseman and M. A. Arbib, Survey: computational
techniques in the visual segmentation of static scenes,
CGIP 6, 221-276 (1977).
4. J. S. Weszka, A survey of threshold selection techniques,
CGIP 7, 259-265 (1978).
5. K. S. Fu and J. K. Mui, A survey on image segmentation,
Pattern Recognition 13, 3-16 (1981).
6. R.M. Haralick and L. G. Shapiro, Survey: image segmen-
tation techniques, CVGIP 29, 100 132 (1985).
7. V. I. Borisenko, A. A. Zlatotol, I. B. Muchnik, Image
segmentation (state of the art survey), Automat. Remote.
Control 48, 837-879 (1987).
8. P. K. Sahoo, S. Soltani, A. K, C. Wong, Y. C. Chen,
A survey of thresholding techniques, CVGIP 41, 233 260
9. N. R. Pal and S. K. Pal, A review on image segmentation
techniques, Pattern Recognition 26, 1277-1294 (1993).
10. R. M. Haralick, Performance characterization in com-
puter vision, CVGIP-IU 60, 245-249 (1994).
11. Y.J. Zhang and J. J. Gerbrands, Segmentation evaluation
using ultimate measurement accuracy. SPIE 1657, 449
460 (1992).
12. R. M. Haralick and L. G. Shapiro, Computer and Robot
Vision. Addison-Wesley, New York (1992).
13. A. Rosenfeld and L. S. Davis, Image segmentation and
image models, Proc. IEEE 67, 764 772 (1979).
14. C.E. Liedtke, T. Gahm, F. Kappei, B. Aeikens, Segmenta-
tion of microscopic cell scenes, AQCH 9, 197-211 (1987).
15. Y.J. Zhang and J. J. Gerbrands, Transition region deter-
mination based thresholding, Pattern Recognition Lett.
12, 13 23 (1991).-
16. I. E. Abdou and W. K. Pratt, Quantitative design and
evaluation of enhancement/thresholding edge detectors,
Proc. IEEE 67, 753-763 (1979).
17. C. Garbay, Image structure representation and process-
ing: a discussion of some segmentation methods in cytol-
ogy, IEEE Trans. PAMI-8, 140-146 (1986).
18. Y. J. Zhang, Comparison of segmentation evaluation
criteria, Proc. 21CSP 870-873 (1993).
19. J. J. Gerbrands, Segmentation of noisy images, Doctoral
Thesis, Delft University of Technology, Delft, The Neth-
erlands (1988).
20. M. D. Levine and A. Nazif, Dynamic measurement of
computer generated image segmentations, IEEE Trans.
PAMI-7, 155 164(1985).
21. J. S. Weszka and A. Rosenfeld, Threshold evaluation
techniques, IEEE Trans. SMC-8, 622 629 (1978).
22. R. M. Haralick, K. Shanmugam, I. Dinstein, Textural
features for image classification, IEEE Trans. SMC-3,
610 622(1973).
23. A. M. Nazifand M. D. Levine, Low level image segmenta-
tion: an expert system, IEEE Trans. PAMI-6, 555 577
24. N. R. Pal and S. K. Pal, Entropic thresholding, Sign.
Process. 16, 97-108 (1989).
25. N. R. Pal and D. Bhandari, Image thresholding: some
new techniques, Sign. Process. 33, 139 158 (1993).
26. N. Otsu, A threshold selection method from gray-level
histogram, IEEE Trans. SMC-9, 62-66 (1979).
27. C. N. Graaf, A. S. E. Koster, K. L. Vincken, M. A.
Viergever, Validation of the interleaved pyramid for the
segmentation of 3D vector images, Pattern Recognition
Lett. 15, 467 475 (1994).
28. Y. J. Zhang and J. J. Gerbrands, On the design of the test
images for segmentation evaluation, Proc. E UROSCO-92
1,551 554(1992).
29. R. C. Gonzalez and P. Wintz, Digital Image Processing,
Addison-Wesley, New York (1987).
1346 Y.J. ZHANG
30. W.A. Yasnoff, J. K. Mui, J. W. Bacus, Error measures for
scene segmentation, Pattern Recognition 9, 217-231
31. S. U. Lee, S. Y. Chung, R. H. Park, A comparative
performance study of several global thresholding tech-
niques for segmentation, CVGI P 52, 171-190 (1990).
32. Y. W. Lim and S. U. Lee, On the color Image segmenta-
tion algorithms based on the thresholding and fuzzy
c-means techniques, Pattern Recognition 23, 935 952
33. J. R. Fram and E. S. Deutsch, On the quantitative evalu-
ation of edge detection schemes and their comparison
with human performance, I EEE Trans. C-24, 616-628
34. W. K. Pratt, Digital Image Processing. John Wiley and
Sons, New York (1978).
35. F. Heyden, Evaluation of edge detection algorithms,
Proc. 31CIPA 618 622 (1989).
36. K. Strasters and J. J. Gerbrands, Three-dimensional im-
age segmentation using a split, merge and group ap-
proach, Pattern Recognition Lett. 12, 307 325 (1991).
37. W. A. Yasnoff and J. W. Bacus, Scene segmentation
algorithm development using error measures, AOCH 6,
45-58 (1984).
38. I. T. Young, Sampling density and quantitative micro-
scopy, AOCH 10, 269-275 (1988).
39. P. C. Cosman, R. M. Gray, R. A. Olshen, Evaluating
quality of compressed medical images: SNR, subjective
rating, and diagnostic accuracy, Proc. I EEE 82, 919 932
40. Y. Zhang, Influence of image segmentation over feature
measurement, Pattern Recognition Lett. 16, 201-206
41. M. D. Levine and A. Nazif, An experimental rule based
system for testing low level segmentation strategies, in
Multi-Computers and Image Processing: Algorithms and
Programs, K. Preston and L. Uhr, eds., pp. 149-160.
Academic Press, New York (1982).
42. L. J. Kitchen and J. A. Malin, The effect of spatial
discretization on the magnitude and direction response of
simple differential edge operators on a step edge, CVGI P
47, 243 258 (1989).
43. Y.J. Zhang, Image synthesis and segmentation compari-
son, Proc. 31CYCS 8.21-8.24 (1993).
44. Y. J. Zhang, Segmentation evaluation and comparison:
a study of various algorithms, SPI E 2094, 801-812
45. Y. Shirai, Three-Dimensional Computer Vision. Springer-
Verlag, Berlin (1987).
46. Y. J. Zhang and J. J. Gerbrands, Comparison of thresh-
olding techniques using synthetic images and ultimate
measurement accuracy, Proc. 11ICPR 3, 209-213
47. A. Beghdadi, A. Negrate, P. V. Lesegno, Entropic thresh-
olding using a block source model, GMI P 57, 197-205
48. C. N. Graaf, A. S. E. Koster, K. L. Vincken, M. A.
Viergever, Task-directed evaluation of image segmenta-
tion methods, Proc. 11ICPR 3, 219-222 (1992).
49. C. MacAuley and B. Palcic, A comparison of some quick
and simple threshold selection methods for stained cells,
AOCH 10, 155-164 (1988).
50. A. D. Brink, Gray-level thresholding of images using
a correlation criterion, Pattern Recognition Lett. 9, 335-
341 (1989).
51. I. Cseke and Z. Fazekas, Comments on gray-level thresh-
olding of images using a correlation criteria, Pattern
Recognition Lett. 11,209-210 (1990).
52. D. J. Bryant and D. W. Bouldin, Evaluation of edge
operators using relative and absolute grading, Proc.
I EEE PRI P 138-145 (1979).
53. P. C. Chen and T. Pavlidis, Image segmentation as an
estimation problem, CGIP 12, 153-172 (1980).
54. W.A. Yasnoff, W. Galbraith, J. W. Bacus, Error measures
for objective assessment of scene segmentation algo-
rithms, AQC 1, 107-121 (1979).
About the Au t h o r - - YU JI N ZHANG received the Ph.D. degree in applied science from the State University
of Lirge, Li/~ge, Belgium, in 1989. From 1989 to 1993, he was with the Department of Applied Physics and
Department of electrical engineering at the Delft University of Technology, Delft, The Netherlands, as
a post-doctoral fellow and scientific staff. In 1993 he joined the Department of Electronic Engineering at
Tsinghua University, Beijing, Peoples Republic of China. He has been associate professor of information and
signal processing since 1993. His research interests include image processing, analysis and understanding,
computer graphics, computer and machine vision, pattern recognition and artificial intelligence, as well as
their applications. He is the first author of more than 20 research papers in the above fields.