Вы находитесь на странице: 1из 10

1872

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 4, APRIL 2014

Visual Tracking via Discriminative


Sparse Similarity Map
Bohan Zhuang, Huchuan Lu, Senior Member, IEEE, Ziyang Xiao, and Dong Wang

Abstract In this paper, we cast the tracking problem as


finding the candidate that scores highest in the evaluation model
based upon a matrix called discriminative sparse similarity map
(DSS map). This map demonstrates the relationship between
all the candidates and the templates, and it is constructed
based on the solution to an innovative optimization formulation named multitask reverse sparse representation formulation,
which searches multiple subsets from the whole candidate set
to simultaneously reconstruct multiple templates with minimum
error. A customized APG method is derived for getting the
optimum solution (in matrix form) within several iterations. This
formulation allows the candidates to be evaluated accurately in
parallel rather than one-by-one like most sparsity-based trackers
do and meanwhile considers the relationship between candidates,
therefore it is more superior in terms of cost-performance
ratio. The discriminative information containing in this map
comes from a large template set with multiple positive target
templates and hundreds of negative templates. A Laplacian
term is introduced to keep the coefficients similarity level in
accordance with the candidates similarities, thereby making
our tracker more robust. A pooling approach is proposed to
extract the discriminative information in the DSS map for easily
yet effectively selecting good candidates from bad ones and
finally get the optimum tracking results. Plenty experimental
evaluations on challenging image sequences demonstrate that
the proposed tracking algorithm performs favorably against the
state-of-the-art methods.
Index Terms Object
appearance model.

tracking,

sparse

representation,

I. I NTRODUCTION

ISUAL tracking, one of the fundamental topics in computer vision, has long been playing a critical role in
numerous applications such as surveillance, military reconnaissance, motion recognition and traffic monitoring. While
much breakthrough has been made within the last decades
(like [7][16], etc), it still remains challenging in many
aspects including pose variation, illumination change, partial

Manuscript received August 20, 2013; revised December 18, 2013 and
February 17, 2014; accepted February 17, 2014. Date of publication
February 26, 2014; date of current version March 14, 2014. This work was
supported in part by the Natural Science Foundation of China under Grants
61071209 and 61272372, and in part by the Joint Foundation of China
Education Ministry and China Mobile Communication Corporation under
Grant MCM20122071. The associate editor coordinating the review of this
manuscript and approving it for publication was Prof. Richard J. Radke.
The authors are with the School of Information and Communication
Engineering, Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian 116024, China (e-mail:
zhuangbohan2013@gmail.com; lhchuan@dlut.edu.cn; 461179822@qq.com;
wangdong.ice@gmail.com).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TIP.2014.2308414

Fig. 1.
Challenges during tracking in real-world environments, including heavy occlusions (Woman), abrupt motion (Face), illumination change
(Singer1), pose variation (Girl) and complex background (Cliffbar). We use
blue, green, black, yellow, magenta, cyan and red rectangles to represent the
tracking results of the OSPT [1], APGL1 [2], LSAT [3], ASLAS [4], MTT [5],
SCM [6] and the proposed method, respectively.

occlusion, camera motion and background clutter, like we


demonstrate in Fig. 1.
A general way to construct a robust tracking system involves
two key components: a motion model, e.g., particle filter [17]
or Kalman filter [18], that forecasts the likely movements
of the target over time to supply the tracker with a number
of candidate states; an observation model (or an appearance
model) that evaluates the likelihood of each candidate state
being the true target state and selects the best candidate as the
tracking result for the current frame, which is the core of a
tracking system.
Since the first time sparse representation is introduced into
visual tracking by Mei and Ling [19], it has been employed
to build various efficient trackers (we refer them as sparse
trackers in the following paper) with favorable experimental
performance against other state-of-the-art trackers. In [19], the
feature vector of a candidate state is reconstructed by both
the target templates and the trivial templates (accounting for

1057-7149 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

ZHUANG et al.: VISUAL TRACKING VIA DSS MAP

noisy pixels) with the sparsity and nonnegativity constraints


on the reconstruction coefficients. Later, the likelihood of this
candidate being the true target is measured upon its error
in being reconstructed by the target templates. This method
requires solving the 1 minimization problem as many times
as the number of candidates, making it quite computational
expensive. To explore more efficient solutions within the same
framework, in [20] an approximate solution is developed
to reduce the number of particles that need to be sparsely
decomposed, and in [2] an efficient gradient descent approach
is introduced to accelerate the solving process of the 1
minimization problem.
Sharing the candidate evaluation scheme in [19], some other
sparsity based tracking algorithms build new formulations with
customized sparse constraint terms. In [3], Liu et al. select
a sparse set of features for representing target objects and
extend the sparsity constraint to the dynamic group sparsity
constraint considering the contiguous distribution of noised
pixels. Zhang et al. [5] formulate the tracking problem using
sparse representation within the multi-task learning framework
in which the similarities between candidates are exploited by
enforcing joint group sparsity with mixed norm constraints. An
algorithm also considering the relevance among candidates is
presented in [21] where the tracking problem is posed as a
low-rank matrix learning problem.
Although these new formulations are effective in modeling
the object, the reconstruction error based candidate evaluating
scheme that they share is neither efficient nor robust. Therefore, several sparse trackers not only propose new sparsity
involved models but also introduce improvements on the
candidate evaluation scheme. Liu and Sun [22] propose to use
a dictionary composed by all candidates and trivial templates
to represent a static object template and view the decomposition coefficient as the similarity between all candidates and
the templates. Wang et al. [1] replace the target templates
with online updated PCA basis vectors, which can better
express the target object subspace. Meanwhile, they use an
occlusion mask to explicitly consider the effect of occluded
pixels when evaluating a candidate. Jia et al. [4] propose a
structural local sparse appearance model that integrates local
and global information of an observed image through an
alignment pooling method, and the coefficients after pooling
are summed to sort the candidates. Zhong et al. [6] develop two
independent sparsity-based models and evaluate the candidates
by integrating the information from both models.
All the aforementioned sparsity based methods yield
impressive tracking performance, however, most of them
focus on measuring how a candidate is resembling the foreground object while ignoring the background information,
which makes them subject to drifts when objects are similar
to the target appearance or when the target appearance bears
some similarity with the background objects due to partial
occlusion. Although Zhong et al. [6] employ a discriminative model, it is more of an assistant to the generative model
and it makes the tracker redundant since hundreds of the
candidates are all evaluated twice, which entails the number
of involved 1 minimization problem to be doubled, greatly
aggravating the computational complexity.

1873

Through the above analysis, we propose a reversed multitask sparse tracking framework which projects the templates
matrix (both positive and negative templates) into the candidates space. By selecting and weighting the discriminative
sparse coefficients, the DSS map and pooling method lead to
the best candidate. Our contributions can be summed up in
the following three aspects:
First, we propose an innovative optimization formulation
named multi-task reverse sparse representation. In our
work, a single task means to reconstruct a template with a
few candidates that bear more similarity with the template
than the others, which is inverse to the traditional sparsity
based formulations (like those in [1][6], [19], [20]) and
multi-task means that we seek to simultaneously reconstruct multiple templates. A customized APG method
is derived for getting the optimum solution (in matrix
form) within several iterations. A Laplacian term is
also included to keep the coefficients similarity level in
accordance with the candidates similarities, which makes
our tracker more robust as the experimental observations
show. This formulation provides the tracker with the
similarity relationship between all the candidates and
templates through solving only one optimization problem
without loss of accuracy. Therefore, this formulation is
more superior in terms of cost-performance ratio.
Second, we construct a discriminative sparse similarity
map (DSS map) based upon those similarity relationship.
The discriminative information containing in this map
comes from a large template set composed by multiple positive target templates and hundreds of negative
templates. Both the target templates and the background
templates are updated online to accommodate the appearance change in and near the target area. With this DSS
map, candidates are evaluated in both directions: not
only how similar it is to the target object but also how
different it is from the background. This is also one of
the key difference from most previous sparse trackers like
[1][5], [19][21], [23], [24], making our tracker more
robust when similar objects appear near the target or
when the target appearance bears some similarity with
the background due to partial occlusion.
Third, we propose a simple yet useful additive pooling
method to make the best use of the information in the
DSS map and before this step the DSS map would
be refined with adaptive weights to get rid of potential
instability. Through this pooling scheme, the information
for each candidate is integrated to be a single score and
the candidate with the highest score is regarded as the
tracking result.
II. BAYESIAN I NTERFERENCE F RAMEWORK
We carry out the object tracking in a Bayesian interference
framework, a technique for estimating the posterior distribution of state variables that characterize a dynamic system,
to form a robust tracking algorithm. We define the observation
set of target Zt = [z1 , z2 , . . . , zt ], and let xt be the state
variable of an object at time t. In the tracking frame, we use the

1874

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 4, APRIL 2014

affine transformation to model the object motion between two


consecutive frames. Then the optimal state x t can be computed
by the maximum a posterior (MAP) estimation,
x t = arg max p (xti |Zt )

(1)

xti

where xti indicates the state of the i -th sample. The posterior
probability can be inferred from the Bayesian framework
recursively,

p(xt |Zt ) p(zt |xt ) p(xt |xt 1) p(xt 1|Zt 1 )dxt 1 (2)
where p(xt |xt 1) is the dynamic model and p(zt |xt ) denotes
the observation likelihood. The state variable xt is composed
of six independent parameters {1 , 2 , 3 , 4 , 1 , 2 }, in
which {1 , 2 , 3 , 4 } are the deformation parameters and
{1 , 2 } contain the 2D transformation information. As the
dynamic model can be modeled by the Gaussian distribution,
it can be represented by
p(xt |xt 1 ) = N(xt ; xt 1, )

(3)

where is a diagonal covariance matrix whose elements are


the variances of the affine parameters.
Through this method, we get the candidates set Y =
[y1 , y2 , ..., ym ] R dm , in which d is the feature dimension
and m is the number of candidates. The observation model
p(zt |xt ) essentially reflects the likelihood of observing zt at
state xt . In this paper, p(zt |xt ) is proportional to the discriminative score obtained by exploiting the additive pooling
scheme on the DSS map.
III. P ROBLEM F ORMULATION
A. Discriminative Reverse Sparse Representation
To construct a robust tracker, the number of templates and
candidates always amount to hundreds or even thousands and
high-dimension features must be used to keep the profuse
target information. However, traditional sparse coding based
trackers perform computationally expensive 1 regularizations
at each frame for each candidate. Hundreds of 1 regularizations per frame make the computational load so high
that the tracker is unsuitable to process high-dimensional
image features for fast and robust tracking applications under
dynamic environment. As a reverse thought to conventional
sparse representation, where a candidate (an observed image
patch associated with a state) is reconstructed mainly by
several target templates, we construct the dictionary with the
candidate set Y to represent each target template as in Eq. 4
with the sparsity and nonnegativity constraints,
arg min ||t Yc||22 + ||c||1 ,
c

s.t. c  0

(4)

where t denotes a representative template, is the parameter to


adjust the sparsity penalty term and c represents the coefficient
vector.
With the sparsity constraint and the goal to minimize the
reconstruction error term, only a few candidates that bear more
similarity to the template would be involved in representing

the template. Their associated elements in c are positive and


the magnitudes of these elements are assumed to imply the
similarity levels. Thus, we add a constraint entry, c  0,
which means all the elements of c are nonnegative for the
reason that each element represents the similarity between the
corresponding template and candidate, and negative elements
are meaningless.
Beyond that, although through using the 1 minimization
can the tracker be efficient and adaptive to appearance change,
the lack of negative templates makes its discriminative power
poor for ignoring the background information around the
target, which may cause the tracker gradually drift away from
the target. Therefore, in this work, multiple positive target templates are exploited so as to make the tracker more responsive
to a variety of appearance change. Meanwhile, in order to
better capitalize on the distinction between the foreground
and the background to locate the target, we use plenty of
negative templates, which are capable of fully sketching out
the periphery of the target area.
The positive and negative template sets are respectively defined as T pos = [t1 , t2 , ..., t p ] and Tneg =
[t p+1 , t p+2 , ..., t p+n ], where p and n denote the number of
positive and negative templates.
With these assumptions, our problem formulation is equivalent to an ensemble of sparse decomposition problems that the
templates are effectively expressed by finding the combination
of the particles and the corresponding coefficients as the
following:

arg min ||t1 Yc1 ||2 + ||c1 ||1

......

argcmin ||t p Yc p ||2 + ||c p ||1


p
(5)

arg min ||t p+1 Yc p+1 ||22 + ||c p+1 ||1

c p+1

......

arg
min
||t

Yc
p+n
p+n ||2 + ||c p+n ||1

c p+n

where ci = [ci1 , ci2 , . . . , cim ] expresses the sparse coefficients


of the i -th template and ci  0, i = 1, 2, . . . , ( p + n) means
all the elements in ci are nonnegative.
In this formulation, one template is decomposed in each
sparse representation procedure through 1 optimization and
the whole process terminates until all the positive and negative
templates have been represented.
We give an illustration of the basic idea of this formulation
in Fig. 2.
The matrix formed by the reconstruction coefficient
vectors of all templates are defined as the sparse map
C = [c1 , . . . , c p+n ], which fundamentally reflects a mapping
relationship between the reference templates and the candij
dates, i.e., the value of a map element ci can be comprehended
as an indicator of similarity between the i -th template and the
j -th candidate.
The candidates that contribute more to reconstruct one
template should correspond to a large map element while those
involve little information of the template should correspond

ZHUANG et al.: VISUAL TRACKING VIA DSS MAP

1875

Fig. 2. Problem Formulation. This figure illustrates the basic idea of the multi-task reverse sparse representation scheme. (a) The positive and the negative
template sets. (b) The sampled candidates. (c) The discriminative sparse similarity map (DSS map).

to smaller ones, in more cases zero. Meanwhile we got two


sub-maps, i.e., C pos = [c1 , . . . , c p ] for the positive template
set and Cneg = [c p+1 , . . . , c p+n ] for the negative one. The
rightmost part in Fig. 2 is the sparse map C, each column
vector of which implicates the sparse coefficients of all the
candidates representing one template.

any two candidate features with Bi j = 1 if ci is among the K


nearest neighbors of c j , otherwise, Bi j = 0.
The last part of this formula can be transformed as:
1
||ci c j ||22 Bi j
2
ij



=
||ci ||2 Di +
||c j ||2 D j 2
ci  c j B i j
i

B. Laplacian Multi-Task Reverse Sparse Representation


Overall, the formulation presented in Eq. 5 suffers from two
principal problems. First, it still requires solving multiple 1
minimization problems per frame, which is computationally
expensive especially when a large number of template set
is maintained. Second, the dependence information among
features of particles is ignored, even similar features may
have unreasonable difference in the responses of sparse representation, which specifically embodies in the disparity of the
corresponding coefficients .
In order to alleviate these defects, we reformulate the
problem of calculating decomposition coefficients for multiple
templates into a single optimization procedure, where the
optimum similarity map C can be calculated as a whole.
Intuitively, we propose a multi-task concept here, in which a
single task means one template can be represented in the form
of linear combination of a few similar candidates, and further,
the multi-task refers to reconstruct multiple templates simultaneously. We name this procedure a multi-task reverse sparse
representation problem as Eq. 6

||ci ||1
arg min ||T YC||22 +
C

s.t. ci  0, i = 1, 2, . . . , ( p + n).

(6)

In addition, to preserve the similarity of sparse codes for the


similar candidate features, we introduce a customized Laplacian regularization term inspired by the success of similar
implementation for image classification [25]. To begin with,
we have the following formulation:


||ci ||1 +
||ci c j ||2 Bi j
arg min ||T YC||22 +
2
C

ij

= 2tr (CLC )

(8)

where L = D B is the Laplacian matrix , the degree of ci is


p+n

defined as Di =
Bi j and D = di ag(D1, D2 , . . . , D p+n ).
j =1

So, the Laplacian multi-task reverse optimization problem is


reformulated as:

||ci ||1 + tr (CLC )
arg min ||T YC||22 +
C

s.t. ci  0, i = 1, 2, . . . , ( p + n).

(9)

Let 1 R m (m is the number of candidates) denote the column


vector whose entries are all ones and denote (ci ) as:

0
ci  0
(ci ) =
(10)
+
other wi se
With this non-negative constraint, Eq. 9 can be optimized
alternately as:
arg min ||T YC||22 + 1 C1 + tr (CLC ) + (C)

(11)

where ai represents the i -th template. Then we apply the


accelerated proximal gradient (APG) approach [2] to solve
this minimization problem with
F(C) = arg min ||T YC||22 + 1 C1 + tr (CLC )
C

G(C) = (C)

(12)

where F(C) is a differentiable convex function and G(C) is a


non-smooth convex function. Following the APG method, we
need to solve an optimization problem:

(7)

F(k+1 ) 2

||2 + G(C) (13)


k+1 = arg min ||C k+1 +
2

where is the parameter to adjust the new regularization term


and B is a binary matrix indicating the relationship between

where is the Lipschitz constant as function F(k+1 ) has


the nature of continuous gradient and the variable k denotes

s.t. ci  0, i = 1, 2, . . . , ( p + n),

ij

1876

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 4, APRIL 2014

Fig. 3. This figure illustrates how the discriminative sparse similarity map indicates whether a candidate is good or not. (a) The original discriminative
similarity map. A typical good candidate and a bad one are picked as examples. (b) The process regarding how to obtain the refined discriminative feature
for the good candidate. The notion  is the Hadamard product (element-wise product). (c) The process regarding how to obtain the refined discriminative
feature for the bad candidate. The sub features related to the positive/negtive templates are shown in red/green. We notice that the positive part in the refined
discriminative feature for the bad candidate is weakened by the adaptive weights.

the current interation time. Then we define gk+1 = k+1 +


F (k+1 )
. Since

Algorithm 1 Algorithm for Optimizing the Laplacian MultiTask Reverse Sparse Representation.

F(k+1 ) = Y (T Yk+1 ) + k+1 (L + L) + 11  ,


(14)
we can easily get the formulation:
1
gk+1 = k+1 + [Y (TYk+1 )+k+1 (L +L)+11 ]

(15)
where 1 R ( p+n) ((p+n) is the number of templates) denotes
the vector whose entries are all ones. Based on the above
assumption, Eq. 13 is equivalent to
k+1 = max(0, gk+1 )

(16)

The algorithm for solving our Laplacian multi-task reverse


sparse representation problem is summarized in Algorithm 1.
The computational complexity of each iteration in Algorithm 1
is dominated by step 3. Thus, we can easily compute the perframe complexity to be O(kd( p + n)), where k is the iteration
number, d is the feature dimension and p + n is the total
number of templates.
IV. O BJECT T RACKING VIA THE R EFINED DSS M AP
A. Weighted Discriminative Sparse Similarity Map
1) Discriminative Sparse Similarity Map: In this subsection, we further interpret the discriminative sparse similarity
map. As being introduced in the above sections and demonstrated in Fig. 2, each column of C denotes the coefficients
of a certain template decomposed by all candidates. However,
it is worth explaining that each row of C corresponds to the
responses of one candidate on all templates, which can be
viewed as a discriminative feature of this candidate. For the
i -th candidate, we have
fi = [Ci1 , . . . , Cip , Ci( p+1) , . . . , Ci( p+n) ]

(17)

where Ci j is the element in the i -th row and the j -th


column of C. From this perspective, the similarity map can be
represented as
F = [f1 , . . . , fm ] = C ,

(18)

where each column is a discriminative feature of a candidate,


indicating its similarity levels to p positive templates and n
negative templates.
The discriminative nature of this feature can be reflected
from its larger elements distribution as shown in Fig. 3. For a
good candidate, the index of larger elements in f must be in
the range [1, p], corresponding to several positive templates.
Likewise, a bad candidate should be more similar to some
negative templates, which results in larger coefficients index
ranging in [ p + 1, p + n] while small or even zero coefficients
on representing positive templates. For the subsequent implementation, we define two sub similarity maps as F pos = C
pos
and Fneg = C
neg .

ZHUANG et al.: VISUAL TRACKING VIA DSS MAP

1877

2) Refined Discriminative Sparse Similarity Map: To get


rid of potential instability and achieve better robustness, we
refine the DSS map with adaptive weights. The weight Wi j
for an element Fi j in the similarity map is constructed based
on the difference between the j -th candidate y j and the i -th
template ti :
Wi j exp(||ti y j ||22 ).

(19)

A candidate with smaller difference from a foreground template indicates they share higher similarity with each other,
representing that the candidate is more likely to be a target
object, and vice versa. For the following employment, we
separate the weight map into two submaps:

W pos = [w1 , . . . , w
p] ,


Wneg = [w
p+1 , . . . , w p+n ] ,

(20)

where wi = [Wi1 , . . . Wim ] for i = 1, 2, . . . , ( p + n).


Then we can get two weighted DSS maps through:
F pos = W pos  F pos ,

(21)

F neg = Wneg  Fneg ,

(22)

and
where  is the Hadamard product (element-wise product).
In the weighted DSS map, an element Fi j = Wi j Ci j is
supposed to be large only when the j -th candidate has small
difference from the i -th template and it plays a significant
role in decomposing the i -th template with other candidates.
Otherwise, Fi j will have a small or even zero value, indicating
that j -th candidate bears little similarity with the i -th template.
An example is shown in Fig. 3(c) to illustrate the benefit of
this refinement process. For the bad candidate, the sub-feature
related to the positive templates (in red) are non-zero since the
positive templates might account for some minor parts of the
bad candidate in the 1 minimization process. Although their
values are small, they might cause unexpected tracking result.
However, by applying the adaptive weight, the refined subfeature related to positive templates (in red) are suppressed to
be close to zero, which means the bad candidate just bears
similarity to some negative templates instead of any positive
templates. In terms of this view, we can get the most accurate
feature for the candidate in order to get the convincing final
candidate score.
B. Additive Pooling
For the i -th candidate, we view the i -th column in the
refined similarity map F as a refined discriminative feature:
fi = [ F1i , . . . . . . , F pi , F( p+1)i , . . . . . . , F(n+ p)i ] ,

(23)

and we have two sub features each representing the candidates


resemblance to the positive and negative templates:
fi pos = [ F1i , . . . . . . , F pi ] ,
fineg = [ F( p+1)i , . . . . . . , F(n+ p)i ]

(24)

from which we calculate the candidates confidence in being


the true target object si through an intuitive additive pooling

method which consists of two steps. First, we separately plus


the largest l coefficients in fi pos and fineg to get the scores
si pos and sineg indicating what extent can the i -th candidate
be related to the positive and the negative template sets. This
process can be concluded as the following:
si pos = L(fi pos , 1)+, + L(fi pos , l),
sineg = L(fineg , 1)+, + L(fineg , l)

(25)

k) denotes the k-th largest element in f and in


where L(f,
this work we set l half the number of positive templates.
Discarding the small map values that may come from uncertain
interference ensures that we get more robust scores.
Second, the discriminative score for the i -th candidate is
formulated by
si = si pos sineg

(26)

and the score set for all the candidates is denoted as


S = {si }i=1,...,m . This formulation is based on the assumption
that a candidate with a larger foreground score and a smaller
background score is more likely to be the target object, and
vice versa. Namely, a target observation should have large
discriminative score while a bad candidate has a relatively
small one. Thus the additive pooling process is completed after
two steps defined by Formulation 25 and 26.
The likelihood of the observation yi being the target at state
x t can be constructed within the Bayesian framework by
p(yi |x t ) si

(27)

Finally, the target observation yt can be located by maximizing


p(yt |x t ) = max p(yi |x t )

(28)

We give a summary of this additive pooling scheme in


Fig. 4. Here, we could notice that some discriminative scores
are similar to each other. This result is rational because we
sample numerous candidates, and inevitably, some of them
share the similar features, which leads to the similar responses
to the additive pooling scheme.
V. I MPORTANT I MPLEMENTATION S CHEMES
To make this work clear and complete, we will briefly
introduce some less novel but rather important implementation
schemes in our work.
A. Locally Normalized Features
In this work we adopt locally normalized features to withstand partial occlusion and moderate appearance variation.
An observed image patch A is partitioned into E local patches,
each of which is independently expressed in gray scale values,
vectorized and normalized to be a vector with unit 2 norm.
Then we concatenate these local feature vectors so that the
global structural information is maintained. The candidates
and templates in this work are all represented with this
locally normalized features to handle partial occlusion and to
moderate appearance variation.

1878

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 4, APRIL 2014

Fig. 4. This figure intuitively illustrates how to get the discriminative scores for all candidates and choose the best candidate state based on it. (a) The
weighted DSS map. (b) Two score vectors after the first step of additive pooling, and they respectively indicate the degree of resemblance to the positive
(upper one) and the negative (bottom one) template set for all candidates. (c) The final discriminative score vector after the second step of additive pooling.
(d) The optimal state corresponding to the candidate that scores the highest.

B. Initial Discriminative Template Sets


The first tracking result is a manually chooser rectangle
area. Let us assume point Q(h, v) be the center of the
rectangle region, and we sample p patches as the initial
positive templates around Q(h, v) within a circular area which
satisfies ||Q i Q(h, v)|| < , where Q i is the center of the i -th
sampled patch. Similarly, the initial negative template set, that
is updated dynamically, is sampled from the annular region
< ||Q j Q(h, v)|| < a few pixels away from Q, where
Q j is the center of the j -th sampled image, and are the
inner and outer radius of the annular region respectively.
C. Update Scheme
For the positive template set, as the target in the first frame
is always the ground-truth, we keep the first template in
the positive template set unchanged to alleviate the drifting
problem. We denote = [1 , 2 , . . . , p ] as the similarity
vector and set a threshold to describe the degree of similarity.
In each frame, we measure the similarity i between the
current tracking result and the i -th positive training template
by applying the Euclidean distance. Then we compare the
maximum similarity value  = max i , i = 1, 2, . . . , m with
the threshold . If  > , then we use the tracking result
to replace the corresponding positive template which has the
largest similarity with the new target appearance. Otherwise,
it means there is an incredibly large appearance change in
adjacent frames or a significant part of the target object is
occluded. Then, we discard this bad sample without update.
On the other hand, for the negative templates, although the
background information varies a lot along the tracking process,
we only sample negative templates around the tracking result
in the last frame. Since the backgrounds of two successive
frames are quite similar, the negative templates could be
well represented by the current candidates that contain much
background information. In this way, these bad candidates
would achieve lower scores in the following pooling step
without being considered as possible tracking result for they
take part in representing negative templates.
VI. E XPERIMENTS
The proposed algorithm is implemented in MATLAB and
runs at 2 frames per second on a 2.5 GHz i5-2450M Core

PC with 4GB memory. The parameters, which are fixed for


each sequence, are summarized as follows. In Eq. 9, the sparse
regularization constant is set to be 0.04 and the Laplacian
constraint is 0.8. The iteration number is 5 and the Lipschitz
constant is equal to 1/0.00018, respectively. The variables p
and n (the number of positive and negative templates) are set to
be 10 and 150 respectively. The update threshold is 0.4. We
resize the target image patch to 3232 pixels and extract 44
local patches within the target region. We update both positive
templates and negative templates in each frame. The MATLAB
source code and datasets will be made available on our website
(http://ice.dlut.edu.cn/lu/publications.html).
A. Key Component Validation
In this section, we qualitatively discuss the effect of the
Laplacian constraint term and the negative templates. It is
worth noticing that the OWN (our algorithm without the
negative template set) and the OWL (our algorithm without
the Laplacian constraint) perform relatively good as well,
from which we can conclude that the overall framework is
effective. But without negative templates or the Laplacian
constraint, the robustness of our tracker indeed decreases
to some extent. As is shown in Tables I and II, all the
results of the proposed algorithm are better than the ones
of the OWN and the OWL. Compared with OWL, we can
come to a conclusion that the Laplacian constraint serves to
increase the stability of the proposed algorithm. The OWN
tracker performs relatively poor compared with OWL and
the proposed algorithm in those sequences undergoing heavy
occlusion or severe background clutter, which can demonstrate
the significant role of negative templates in handling occlusion
and segregating the foreground target from the background.
B. Quantitative Evalution
We use fifteen challenging videos in the experiment to
evaluate the performance of the proposed algorithm. The challenging factors of these videos include heavy occlusion,
motion blur, pose variation, background clutter and illumination change. The proposed approach is compared with eleven
state-of-the-art algorithms, including the IVT [7], APGL1 [2],
PN [8], VTD [9], MILTrack [10], FragTrack [11], MTT [5],
OSPT [1], ASLAS [4], LSAT [3] and SCM [6] methods. Also
two extra algorithms, OWL and OWN, are introduced for

ZHUANG et al.: VISUAL TRACKING VIA DSS MAP

1879

TABLE I
C OMPARISON R ESULTS IN T ERMS OF AVERAGE C ENTER E RROR ( IN P IXELS ). T HE B EST T HREE R ESULTS
A RE S HOWN IN R ED , B LUE , AND G REEN F ONTS .
(The Last Two Columns are for Self Comparison and do not Participate in Ranking)

TABLE II
C OMPARISON R ESULTS IN T ERMS OF AVERAGE OVERLAP R ATE ( IN P IXELS ). T HE B EST T HREE R ESULTS A RE S HOWN IN R ED , B LUE ,
AND G REEN F ONTS . T HE L AST R OW S HOWS C OMPARISON R ESULTS A BOUT C OMPUTATIONAL L OADS IN T ERMS OF Fps.
(The Last Two Columns are for Self Comparison and do Not Participate in Ranking)

self-comparison. For fair evaluation, we use the source code


provided by the authors and run these codes with the same
initial position of the target.
For the purpose of assessing the performance of the proposed tracker, two criteria, the center location error as well
as the overlap rate, are employed in our paper. It should be
noted that a smaller average error or a bigger overlap rate
means a more accurate result. Given the tracking result of
each frame RT and the corresponding ground truth RG , we
can get the overlap rate by the PASCAL VOC [26] criterion,
T RG )
scor e = area(R
area(RT RG ) . Tables I and II report the quantitative
comparison results in terms of the average center location
errors and average overlap rates respectively. As shown in
the tables, the proposed tracker yields favorable performance
against other state-of-the-art methods.
Regarding the computational loads, in the last row of
Table II we report the comparison result in terms of fps,

which is obtained by running all the algorithms on computers


with same configuration and using the same dataset for fair
comparison. From the result we could tell that although the
sparse representation based trackers (the APGL1, MTT, OSPT,
ASLAS, LSAT, SCM trackers and the proposed tracker) are
slower than some classic trackers like the IVT and MIL
trackers, they generally yield superior performance. Among
the sparsity based trackers, the proposed tracker is best in
terms of accuracy and second in speed, striking a good balance
between performance and computational load.
C. Qualitative Evaluation
As the sparse trackers generally perform better than the
other state-of-the-art methods and they are more related to our
work, we only demonstrate the comparison results with them
in Fig. 5, including the OSPT [1], APGL1 [2], LSAT [3],
ASLAS [4], MTT [5], SCM [6] and the proposed method.

1880

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 4, APRIL 2014

Fig. 5. Sample tracking results on fifteen challenging sequences. (a) Occlusion1 and Woman with heavy occlusion and in-plane rotation. (b) Caviar1 and
Caviar2 with heavy occlusion and in-plane rotation. (c) Face, Jumping and Deer with abrupt motion. (d) DavidIndoor, Singer1 and Car4 with illumination
change. (e) Sylvester2008b, Girl and Dudek with pose variation. (f) Cliffbar and Car11 with background clutter.

Heavy Occlusion: We test four sequences (Occlusion1,


Woman, Caviar1, Caviar2) characterizing in having
either severe occlusion or long-time partial occlusion.
Fig. 5(a) and (b) confirms the truth of the robustness of
the proposed algorithm in dealing with rotation and scale
change when the target undergoes heavy occlusion. Since
two sub discriminative features are formulated to evaluate the
candidates similarity to the positive and the negative template
set respectively, although a good candidate with occlusion
bears some similarity with the background, other misaligned
candidates bears more, which make their final scores lower
than the good candidate. Whats more, as we use the particles
to reconstruct the templates, the influence from occluded
parts of the particles is effectively suppressed for the reason
that they contribute little to the reconstruction progress.
Motion Blur: Fig. 5(c) presents the tracking results on the
sequences Face, Jumping and Deer. As the target object
undergos abrupt motion, it is rather tough to accurately
locate its position and account for the blurs which reduce
the discriminative information in feature vectors. It is worth
noticing that the proposed method performs better than other
algorithms. Thanks to the discriminative template set and the
update scheme, it is easier for our tracker to maximally capture
the appearance change information in and near the target area
and accurately select the target from the background even with

limited discriminative information in the feature vectors when


the blurs occurs.
Illumination Change: Fig. 5(d) demonstrates the tracking
results on the sequences DavidIndoor, Singer1 and Car4
with drastic illumination change. Our tracker can successfully
tail the target throughout entire sequences, which can be
attributed to the locally normalized features that have great
effect in resisting the light change. We also observe that due
to the template update strategy with the incremental subspace
learning which enables the tracker to capture light change,
the ASLAS algorithm achieves good performance in these
sequences as well.
Rotation: The sequences, Sylvester2008b, Girl and Dudek,
involving both in-plane and out-of-plane rotations are reported
in Fig. 5(e). As we use the affine transformation parameters
that include the rotation angle modeling, we can capture the
rotating candidates for further selection. We also observe that
some trackers do not adapt to scale or in-plane rotation (e.g.,
LSAT, APGL1 and MTT).
Background Clutter: Fig. 5(f) shows the tracking results in
the Cliffbar and Car11 with complex background. By introducing both the positive template set and the negative template
set to model the foreground and the background information
respectively, we can obtain enough discriminative information

ZHUANG et al.: VISUAL TRACKING VIA DSS MAP

and store them in the DSS map. Meanwhile, the additive pooling method effectively extract the discriminative information
in the DSS map and enables our method to accurately calculate
the discriminative scores and find the optimum candidate.
VII. C ONCLUSION
In this paper, we propose an efficient tracking algorithm
based on a discriminative sparse similarity map which is
obtained via a multi-task reverse sparse coding approach
with Laplacian constraint. The proposed formulation enjoys
advantages including light computational load through using a
customized APG method and ideal stability by incorporating
a Laplacian term. The employment of dynamically updated
positive and negative template sets supplies our tracker with
sufficient discriminative information, which is stored in the
DSS map and accurately integrated via an additive pooling
scheme. Both quantitative and qualitative evaluations against
several state-of-the-art algorithms based on challenging image
sequences demonstrate the accuracy and the robustness of the
proposed tracker.

1881

[21] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja, Low-rank sparse learning


for robust visual tracking, in Proc. ECCV, 2012, pp. 470484.
[22] H. Liu and F. Sun, Visual tracking using sparsity induced similarity,
in Proc. 20th ICPR, 2010, pp. 17021705.
[23] D. Wang and H. Lu, On-line learning parts-based representation via
incremental orthogonal projective non-negative matrix factorization,
Signal Process., vol. 93, no. 6, pp. 16081623, 2013.
[24] D. Wang, H. Lu, and M.-H. Yang, Least soft-threshold squares tracking, in Proc. CVPR, 2013, pp. 23712378.
[25] S. Gao, I. W.-H. Tsang, L.-T. Chia, and P. Zhao, Local features are
not lonelyLaplacian sparse coding for image classification, in Proc.
CVPR, 2010, pp. 35553561.
[26] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and
A. Zisserman, The pascal visual object classes (VOC) challenge, Int.
J. Comput. Vis., vol. 88, no. 2, pp. 303338, 2010.

Bohan Zhuang is currently pursuing the B.E. degree


with the School of Information and Communication Engineering, Dalian University of Technology,
Dalian, China.

R EFERENCES
[1] D. Wang, H. Lu, and M.-H. Yang, Online object tracking with sparse
prototypes, IEEE Trans. Image Process., vol. 22, no. 1, pp. 314325,
Jan. 2013.
[2] C. Bao, Y. Wu, H. Ling, and H. Ji, Real time robust L1 tracker
using accelerated proximal gradient approach, in Proc. CVPR, 2012,
pp. 18301837.
[3] B. Liu, J. Huang, L. Yang, and C. Kulikowsk, Robust tracking using
local sparse appearance model and k-selection, in Proc. CVPR, 2011,
pp. 13131320.
[4] X. Jia, H. Lu, and M. Yang, Visual tracking via adaptive structural
local sparse appearance model, in Proc. CVPR, 2012, pp. 18221829.
[5] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja, Robust visual tracking
via multi-task sparse learning, in Proc. CVPR, 2012, pp. 20422049.
[6] W. Zhong, H. Lu, and M. Yang, Robust object tracking via sparsitybased collaborative model, in Proc. CVPR, 2012, pp. 18381845.
[7] D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang, Incremental learning
for robust visual tracking, Int. J. Comput. Vis., vol. 77, nos. 13,
pp. 125141, 2008.
[8] Z. Kalal, J. Matas, and K. Mikolajczyk, P-N learning: Bootstrapping
binary classifiers by structural constraints, in Proc. CVPR, 2010,
pp. 4956.
[9] J. Kwon and K. M. Lee, Visual tracking decomposition, in Proc.
CVPR, 2010, pp. 12691276.
[10] B. Babenko, M.-H. Yang, and S. Belongie, Visual tracking with online
multiple instance learning, in Proc. CVPR, 2009, pp. 983990.
[11] A. Adam, E. Rivlin, and I. Shimshoni, Robust fragments-based tracking
using the integral histogram, in Proc. CVPR, 2006, pp. 798805.
[12] S. Hare, A. Saffari, and P. H. Torr, Struck: Structured output tracking
with kernels, in Proc. ICCV, 2011, pp. 263270.
[13] M. Godec, P. Roth, and H. Bischof, Hough-based tracking of non-rigid
objects, in Proc. ICCV, 2011, pp. 8188.
[14] E. G. Learned-Miller and L. S. Lara, Distribution fields for tracking,
in Proc. CVPR, 2012, pp. 2533.
[15] F. Yang, H. Lu, and M. Yang, Robust visual tracking via multiple
kernel boosting with affinity constraints, IEEE Trans. Circuits Syst.
Video Technol., vol. 24, no. 2, pp. 242254, Jul. 2013.
[16] F. Yang, H. Lu, and M.-H. Yang, Learning structured visual dictionary
for object tracking, Image Vis. Comput., vol. 31, no. 12, pp. 992999,
2013.
[17] P. Prez, C. Hue, J. Vermaak, and M. Gangnet, Color-based probabilistic tracking, in Proc. ECCV, 2002, pp. 661675.
[18] D. Comaniciu, V. Ramesh, and P. Meer, Kernel-based object tracking,
IEEE TPAMI, vol. 25, no. 5, pp. 564577, May 2003.
[19] X. Mei and H. Ling, Robust visual tracking using 1 minimization,
in Proc. ICCV, 2009, pp. 110.
[20] X. Mei, H. Ling, Y. Wu, E. Blasch, and L. Bai, Minimum error bounded
efficient 1 tracker with occlusion detection, in Proc. CVPR, 2011,
pp. 12571264.

Huchuan Lu (SM12) received the Ph.D. degree


in system engineering and the M.Sc. degree in
signal and information processing from the Dalian
University of Technology (DUT), Dalian, China, in
2008 and 1998, respectively, where he joined the
faculty in 1998 and is currently a Full Professor
with the School of Information and Communication
Engineering. His current research interests include
the areas of computer vision and pattern recognition
with focus on visual tracking, saliency detection, and
segmentation. He is a member of the ACM and an
Associate Editor of the IEEE T- SMC PART: B .

Ziyang Xiao received the B.E. degree in electronic


engineering from the Dalian University of Technology, Dalian, China, in 2011, where she is currently
pursuing the masters degree with the School of
Information and Communication Engineering. Her
research interest is in object tracking.

Dong Wang received the B.E. degree in electronic


information engineering and the Ph.D. degree in
signal and information processing from the Dalian
University of Technology (DUT), Dalian, China, in
2008 and 2013 respectively, where he is currently a
faculty with the School of Information and Communication Engineering. His research interests include
face recognition, interactive image segmentation,
and object tracking.

Вам также может понравиться