Академический Документы
Профессиональный Документы
Культура Документы
2, FEBRUARY 2013
549
What
Are We Tracking:
Approach of
TrackingA
and Unified
Recognition
Jialue Fan, Xiaohui Shen, Student Member, IEEE, and Ying Wu, Senior Member, IEEE
evidence.
Online
models
for
low-level
correspondences are generally employed to adapt to
the changing appearances of the target [1][3].
However, one notable shortcoming of these online
models is that they are constructed and updated
based on the previous appearance of the target
without much seman- tic understanding. Therefore,
they are limited in predicting unprecedented states
(a
)
(b
)
Fig. 1.
Traditional methods may fail in these
complicated scenarios, while our approach handles
them well. (a) Rapid appearance change. (b) Object
morphing.
the same low-level appearance. Therefore, the
discriminative information provided by the car
category, i.e., the high-level correspondences, can be
utilized to help successfully track the
10577149/$31.00
2012 IEEE
550
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013
550
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013
the evidence
accumulated.
is2
FAN et al.: WHAT ARE WE TRACKING: A UNIFIED APPROACH OF TRACKING AND RECOGNITION
551
Fig. 2. Framework of the proposed method. In the first few frames, the object category is unknown, so
our tracking procedure only relies on the online target model, which is the same as traditional tracking.
Meanwhile, video-based object recognition is applied on the tracked objects. When the target is recognized
properly, the offline target model will be automatically incorporated to provide more information about the
target.
M L = (z1, ... , zt ), where
t offline
t model.
on the input image sequence I as well as the
t
), the
neously. Therefore we employ a two-step EM-like only the online object model is considered.
xt
For every
method here: at time t , we first estimate xt (i.e.,
frame,
the online target model is updated as M L
tracking), and then
(z1, ..., zt ).
t =
estimate c based on the new tracking result z = I (x ) (i.e.,
= {x ,
h} consists of the target
t
t
t t
In this paper, the
y,
t
recognition). In the next subsections, we will present these
state x
w,
central position (x , y), its width w, and its height h.
two steps in details.
Maximizing the likelihood term in Eq. 1 can be
formulated as an energy minimization problem by
B. Tracking Procedure
defining the energy term E = ln p. Therefore
Different from traditional tracking, the estimation of xt inour
FAN et al.: WHAT ARE WE TRACKING: A UNIFIED APPROACH OF TRACKING AND RECOGNITION
551
, the
t 1result
selected by the previous recognition
offline model M t 1
ct 1, and the current input image It . In a Bayesian perspective,
L
x = arg max p(x |M
)
t
, HM
t 1
xt
xt
,I
t 1
t 1
xt
t , It ) p(M
= arg max p(M L |x
H
(2)
d t
t t
ct 1
term related to detection. The term
detection is
consistent
with the widely used term tracking-by-detection
in state-of- the-art literature. Please note that we
absorb the normalization
we have
t
1
|xt , It ).
(1)
552
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013
Ee (xt )
r
=
fe (zt ) NN j,ct
j =1
fe (zt )
(6)
x
t
does not have an analytic solution, we
( t ) obtai
n xt
t 1
at time t
searchminpixels,
the space
. We construct
aperform
subset a coarse-to-fine
with spacing
and define
x =
xt
i
(Fig.
3
gives
x t
tthen start from
cannot
change
rapidly,
so: is treated as our tracking result.
pixels,
and
the local
optimum
552
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013
w /
Es (xt ) =
lt 1 gt
i
(3)
h t
image.
t 1
2e
i
(4)
gion
s
we can estimate the rough global motion ( xt , yt )
of the target by simply averaging the motion of each
salient point. The outer boundary is then the
rectangle with the center
(xt 1+ x ,t y t 1 + y t) and the width/height w t 1 + (w t 1+
ht 1)/4, ht 1 + (wt 1 + ht 1)/4. The illustration is shown in
x
procedure is formulat
s finding
such that
ed a the best c
partial
occlusion. Therefore, we instead
find the optimal sequence {c1 , ... , ct } given the measurement
c = arg
min
c1 {C 0 N,C U (d , c). Therefore, we use the
,...,C } same
4
The inference algorithm is usually used in
cost function U (d , c) for object detection and
particle-filter-style
object tracking research. And
recognition.
the gradient-descent method is usually used when
In object detection, an exemplar-based the objective function has a nice property (like
energy term Ed (xt ) = U (zt , ct 1) (recall zt = It (xt )) mean-shift). For other tracking
which does not reside in those two
is designed for each specific class ct 1, which can be research
directions, the naive exhaustive
search is often applied [2], [3], [10], [12].
decomposed as the weighted
FAN et al.: WHAT ARE WE TRACKING: A UNIFIED APPROACH OF TRACKING AND RECOGNITION
553
c ) is low
for
|
i i
= arg max
(7)
i
=1
The last equation assumes that zi are independent of
z j and
c j (i = j ), and {ci } is a Markov chain.
, It ).
x
1) Tracking
Morphable Objects: The target
category c can
be extended to describe different status of the
object. For example, if we want to track morphable
objects, we regard
t
{C 1, ..., C N }
FAN et al.: WHAT ARE WE TRACKING: A UNIFIED APPROACH OF TRACKING AND RECOGNITION
553
IV.
EXPERIMENTS
In this section, we first go through the technical
details. Then we compare the proposed video-based
object recognition method with still-image object
recognition, followed by the sensitivity analysis of
the parameters. The tracking perfor- mance in
various scenarios is then evaluated.
directions, we use [1, 0, 1] gradient filter with no smoothing,
A. Technical Details
For technical details, the parameters ws , wh , w p ,
and we are chosen to make the energy terms
comparable. In practice, the weight of each term is
adaptively adjusted by a confidence score, which
measures the performance of one term at current
frame. This ensures that the inaccuracy of one term
will not impact our results, even in some extreme
cases (e.g., ws =
s = xt
1( Es (xt ) < Ts )
(8)
| |
where xt is one candidate region at frame t , is the
search space, 1() is the indicator function, and Ts is
the threshold of Es . Eq. 8 means that the more
candidate regions are larger than the threshold, the
less confident the energy term is. Likewise
we can define the confidence score for other energy
terms. The thresholds for the offline model (e.g.,
Tp and Te ) are
determined based on the training data. Different
categories have different thresholds. For a certain
category c, we compute Tp (c) as follows (similarly
for Te (c)):
We choose one image from the training data, and
collect
some image regions close to the target location as
the positive samples. For each image region z, we
obtain the average distance d p = E p (z, c)
between
this region
and
its nearest
neighbors from
remaining
training
samples. Intuitively, d is
less than T (c), since it is the positive data. For allp the
p
training
554
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013
(a)
(b)
Fig. 4. Comparison of video-based object recognition and image object recognition. (a) Dog. (b) Plane.
We show these two because the recognition is relatively very challenging.
Fig. 5. Tracking a car (frames 16, 34, 75, 109, 145, 247). The red box is the tracking result (the
optimum). The green bounding boxes are the target candidates with 2nd5th best energy scores.
image, and obtain the distribution of dn for negative
samples. So Tp (c) is essentially the Bayes optimal
decision boundary, and it can be easily obtained
numerically based on these two
distributions. The thresholds for the online model
(e.g., Ts and
T ) are determined empirically.
h
554
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013
FAN et al.: WHAT ARE WE TRACKING: A UNIFIED APPROACH OF TRACKING AND RECOGNITION
555
Fig. 6.
Quantitative evaluation on individual
components of the proposed method. Car.
Fig. 7.
Fig. 8.
FAN et al.: WHAT ARE WE TRACKING: A UNIFIED APPROACH OF TRACKING AND RECOGNITION
555
Fig. 9.
556
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013
TABL
EI
MI
L
33
OA
B28.
VT
D
21.
D
AT
23.
MD
T30.
Propos
ed 5
.8
17
.67.
633.
046.
420.
013.
718.
910.
217.
614.
4.
4.
4.
5.
6
25
.0
22
122.
925.
517.
928.
018.
620.
014.
535.
9.
6.
1.
0.
6
.0
1
TABLE II
Proposed
(SO) 6
Proposed
(AI) 15.9
13.4
8
.
13.9
11.7
AVERAGE CENTER L OCAT I O N E RRORS (IN PI X ELS ) FO R PARAMETER SENS I TI VITY ANA LY S I S . FOR E AC H COLUMN , W E CHANGE THE
VALUE OF ONLY ONE PARAMETER , AND FIX T H E OTHERS
Video Clip
Car
Propos
ed 5
4.
Car2
Plane
9.
.
7
(20
5.
6
(+20
4.
6
(20
4.
5
(+20
5.
6
(20
4.
1.
0.
9.
.
1.
0.
1.
1.
8.
.
7
(+20
5.
7
(20
5.
1.
2.
1.
0.
5
(+20
.
6
.
12.5
(a)
(b)
Fig. 12.
Tracking a car (frames 0, 51, 75, 109, 145, 210). (a) Red/green/purple/cyan/magenta bounding boxes are generated from the
MIL/OAB/VTD/DAT/MDT trackers, respectively. (b) On the top of the bounding box, we show the recognition result (C stands for Car). The dashed
bounding box means that we do not know the target category at the beginning.
556
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013
(a)
(b)
Fig. 13.
Tracking a dog (frames 0, 45, 205, 223, 245, 262). (a) Red/green/purple/cyan/magenta bounding boxes are generated from the MIL/OAB/
VTD/DAT/MDT trackers, respectively. (b) On the top of the bounding box, we show the recognition result (Q stands for Quadruped). The dashed bounding
box means that we do not know the target category at the beginning.
Fig. 14. Multiple target tracking and recognition (frame 71, 293, 336, 364, 403, 455). We assign an object id to one target (shown in different colors). The
recognition result is given when the target is successfully recognized.
that are far from the existing target in order to avoid such
redundancy. As observed in the experiment, our tracker can
successfully track and recognize people and cars (except in the
rightmost picture, a man is bending his body and riding a
bicycle, where the contour is different from the moving person
in the training set). This result can be further used for highlevel tasks like abnormal video event detection, etc.
FAN et al.: WHAT ARE WE TRACKING: A UNIFIED APPROACH OF TRACKING AND RECOGNITION
557
Fig. 15. Rock-paper-scissors game (frames 0, 73, 120, 160, 238, 252). The proposed method automatically tracks the hands, recognizes the hand gesture, and
decides the winner of the game. R, P, and S stand for rock, paper, and scissor, respectively.
(a)
(b)
Fig. 16. Example of scene understanding. The system detects cars and pedestrians in the video. (a) (Frames 0, 25, 41, 130, 208, 254), the system understands
the human passes behind the car, while (b) (frames 0, 76, 112, 142, 305, 388), the system understands the human gets in the car. P and C stand for
People and Car, respectively.
to
the
{x
(9)
lower the transition score between states. The
excellent results are shown in Fig. 15.
Tracking by recognition has important
applications in video
understanding. Figure 16 is a simple application of
, ct }
= argxt ,cmin
t C 0:N
= arg
min
xt ,ct C 0:N
FAN et al.: WHAT ARE WE TRACKING: A UNIFIED APPROACH OF TRACKING AND RECOGNITION
557
} is not an ordered
558
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013
Fig. 17. Automatic initialization for testing videos, Car (0), Car2 (0), Dog (45), Plane
(0), and People (0).
object tracking methods use manual initialization,
our method can automatically initialize the object by
detection in the first frame. we tested the automatic
initialization performance of our method. In the first
frame, the target location is initialized via object
detection (our five categories).6 The results are
shown in Fig. 17 and Table I (see column Proposed
(AI)). We observe that the result is worse than the
manual initial- ization, as the initialization error is
induced through object detection. But the automatic
initialization still performs better than baseline
methods in most cases. This further verifies that the
feedback of recognition module helps our object
tracking.
G. About the Number of Object
Categories
Generally speaking, when the number of
object cate- gories increases, the ambiguity of the
tracking problem also increases. Here, we
investigate the performance of the pro- posed
algorithm versus different number of object
categories. In the first experiment, we obtain the
recognition accuracy (the same way as Sec. IV.B)
when the number of categories varies, where all the
object categories are included in the extreme case.
The result for dog sequence is shown in Fig. 18.
558
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013
FAN et al.: WHAT ARE WE TRACKING: A UNIFIED APPROACH OF TRACKING AND RECOGNITION
559
Fig. 19. Tracking a Dog (frames 0, 45, 205, 223, 245, 262) using 16 categories. The tracking ambiguity
increases and the dog is wrongly classified as Cat.
recognition accuracy (Fig. 4). (2) The tracking result
is deter- mined by both online target model and
offline model. So the online target model may adjust
the target location in some cases. (3) In our work,
we have designed the others class. The purpose of
introducing this others class is that we want to
combine online and offline model only if the
object is recognized with a high confidence. If the
classifier is not confident enough to recognize the
object, we treat it as the others class. Then our
tracker becomes the regular online tracking (offline
model is disabled), just like what we do in the first
few frames when the object category is unknown.
3) Dataset: Our current design is not appropriate
for some
tracking dataset such as MIL/VTD. The major
reason is that: for the testing data, there is a gap
between object tracking literature (MIL/VTD
dataset) and object recognition literature (PASCAL
VOC series). As we know, for all the machine
learning techniques applied in computer vision
problems, the training data should have some
consistency with the testing data. Therefore, the data
inconsistency of MIL/VTD dataset and PASCAL
VOC series causes this problem here. We further
explain it as follows:
We think that the tracking term Et and the
detection term Ed should be consistent to each
other, since they are both about matching the
target representation to the candidate model
(online vs. offline model). In the state-of-the-art
object recognition literature, the researchers all use
salient point (bag-of-words) representation (or
some variants). Inspired by those works, we use
salient point representation in Ed and also Et .
Unfortunately, this salient point representation in E is not
t
FAN et al.: WHAT ARE WE TRACKING: A UNIFIED APPROACH OF TRACKING AND RECOGNITION
559
560
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013
560
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013
560
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013
continuous object recognition using local [37] G. Takacs, V. Chandrasekhar, S. Tsai, D. Chen,
R. Grzeszczuk, and B. Girod, Unified realfeature descrip- tors, in Proc. Conf. Comput.
time tracking and recognition with rotationinvariant fast features, in Proc. Conf. Comput.
Vis. Pattern Recognit., 2009, pp. 18.
Vis. Pattern Recognit.,
2010,
pp.
934941.
[38] S. Lazebnik, C. Schmid, and J. Ponce,
Beyond bags of features: Spatial
pyramid matching for recognizing natural scene
categories, in Proc. Conf. Comput. Vis. Pattern
Recognit., 2006, pp. 21692178.
[39] O. Chum and A. Zisserman, An exemplar
model for learning object
classes, in Proc. Conf. Comput. Vis. Pattern
Recognit., 2007, pp. 18. [40] L. R. Rabiner, A
tutorial on hidden Markov models and selected
applications in speech recognition, Proc. IEEE,
vol. 77, no. 2, pp. 257
286, Feb.
1989.
[41] O. Boiman, E. Shechtman, and M. Irani,
In defense of nearest- neighbor based image
classification, in Proc. Conf. Comput. Vis.
Pattern Recognit., 2008, pp. 18.
[42] N. Dalal and B. Triggs, Histograms of
oriented gradients for human
detection, in Proc. Conf. Comput. Vis. Pattern
Recognit., 2005, pp.
886
893.
[43] M. Everingham, L. Van Gool, C. K. I.
Williams, J. Winn, and
A. Zisserman. (2007). The PASCAL Visual
Object Classes Chal- lenge 2007 (VOC2007)
Results [Online]. Available: http://www.pascalnetwork.org/challenges/VOC/voc2007/worksho
p/index.html
[44] L. Fei-Fei and P. Perona, A Bayesian
hierarchical model for learning
natural scene categories, in Proc. Conf.
Comput. Vis. Pattern Recognit.,
2005,
pp.
524531.
[45] J. Kwon and K. M. Lee, Visual tracking
decomposition, in Proc. Conf.
Comput. Vis. Pattern Recognit.,
2010, pp. 18.
560
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 2, FEBRUARY 2013