Вы находитесь на странице: 1из 30

1

Multiple Camera Tracking of


Interacting and Occluded Human Motion
Shiloh L. Dockstader

and A. Murat Tekalp


Department of Electrical and Computer Engineering
University of Rochester, Rochester, NY 14627, USA

E-mail: dockstad@ieee.org, tekalp@ece.rochester.edu
URL: http://www.ece.rochester.edu/~dockstad/research
Abstract
We propose a distributed, real-time computing platform for tracking multiple interacting persons
in motion. To combat the negative effects of occlusion and articulated motion we use a multi-
view implementation, where each view is first independently processed on a dedicated processor.
This monocular processing uses a predictor-corrector filter to weigh re-projections of 3-D
position estimates, obtained by the central processor, against observations of measurable image
motion. The corrected state vectors from each view provide input observations to a Bayesian
belief network, in the central processor, with a dynamic, multidimensional topology that varies
as a function of scene content and feature confidence. The Bayesian net fuses independent
observations from multiple cameras by iteratively resolving independency relationships and
confidence levels within the graph, thereby producing the most likely vector of 3-D state
estimates given the available data. To maintain temporal continuity we follow the network with a
layer of Kalman filtering that updates the 3-D state estimates. We demonstrate the efficacy of the
proposed system using a multi-view sequence of several people in motion. Our experiments
suggest that, when compared with data fusion based on averaging, the proposed technique yields
a noticeable improvement in tracking accuracy.
Key Words: Kalman filtering, Bayesian network, human motion analysis, human interaction,
real-time tracking, multiple camera fusion, articulated motion, occlusion

Corresponding Author
2

1. Introduction
1.1. Motivation
Motion analysis and tracking from monocular or stereo video has long been proposed for
applications such as visual security and surveillance [1][2][3][4]. More recently, it has also been
employed for performing gesture and event understanding [5][6]; developing generalized
ubiquitous and wearable computing for the man-machine interface [7]; and monitoring patients
to capture the essence of epileptic seizures, assist with the analysis of rehabilitative gait, and
perform monitoring and examination of basic sleep disorders [8][9]. However, due to a
combination of several factors, reliable motion tracking still remains a challenging domain of
research. The underlying difficulties behind human motion analysis are founded mostly upon the
complex, articulated, and self-occluding nature of the human body [10]. The interaction between
this inherent motion complexity and an equally complex environment greatly compounds the
situation. Environmental conditions such as dynamic ambient lighting, object occlusions,
insufficient or incomplete scene knowledge, and other moving objects and people are just some
of the naturally interfering factors. These issues make tasks as fundamental as tracking and
semantic correspondence recognition exceedingly difficult without first outlining numerous, and
sometimes prohibitive, assumptions. Computational complexity also plays an important role, as
many of the popular applications of human motion analysis and monitoring demand real-time (or
near real-time) solutions.
1.2. Previous Work
A fundamental step in human motion analysis and tracking consists of the modeling of
moving people in image sequences. Classical 3-D modeling of moving people has taken the form
of volumetric bodies based on elliptical cylinders [11], tapered super-quadrics [12], or other
more highly parameterized primitives or configurations [13]. Rehg and Kanade [14] extend 3-D
model-based tracking to account for the significant self-occlusions inherent in the tracking of
articulated motion. Alternatives to using exact 3-D models include the use of 2-D models that
represent the projection of 3-D data onto some imaging plane [15], point distribution models
(PDM) [16], strict 2-D/3-D contour modeling [17], stick figure and medial axis modeling [18], or
even 2-D generalized shape and region modeling [19][20]. For human motion analysis, choosing
an object or motion model depends greatly on the application. Real-time scenarios, for example,
must often sacrifice the accuracy of more highly parameterized models in order to achieve rapid,
3

automatic model initialization. In line with this motivation, Wren et al. [21] introduce a multi-
class statistical model of color and shape to obtain a 2-D representation of the head and hands.
They use simple blobs, or structurally significant regions, to represent various body parts for the
purpose of tracking and recognition. The approach lends itself to a real-time implementation and
is more robust to certain types of occlusion than other techniques.
At the heart of human motion analysis is motion tracking. The dependence between
tracking and recognition is significant, where even a slight increase in the quality or precision of
tracking is able to return considerable improvements in recognition accuracy. Various
approaches to object tracking employ features such as edges, shape, color, and optical flow [21].
To handle occlusion, articulated motion, and noisy observations, numerous researchers have
looked to stochastic modeling such as Kalman filtering [22][23][24][25] and probabilistic
conditional density propagation (Condensation algorithm) [17]. Dockstader and Tekalp [3]
suggest a modified Kalman filtering approach to the tracking of several moving people in video
surveillance sequences. They take advantage of the fact that as multiple moving people interact,
the state predictions and observations for the corresponding Kalman filters no longer remain
independent. Lerasle et al. [25] employ Kalman filters to perform tracking of human limbs from
multiple vantage points. Jang and Choi [26] suggest the use of active models based on regional
and structural characteristics such as color, shape, texture, and the like to track non-rigid moving
objects. Like the tracking theory set forth by Peterfreund [27], the active models employ Kalman
filtering to predict basic motion information and snake-like energy minimization terms to
perform dynamic adaptations using the moving objects structure. MacCormick and Blake [28]
present the notion of partitioned sampling to perform robust tracking of multiple moving objects.
The underlying probabilistic exclusion principle prevents a single observation from supporting
the presence of multiple targets by employing a specialized observational model. The
Condensation [17] algorithm is used in conjunction with the aforementioned observational
model to perform the contour tracking of multiple, interacting objects. Without the explicit use of
extensive temporal or stochastic modeling, McKenna et al. [29] describe a computer vision
system for tracking multiple moving persons in relatively unconstrained environments.
Haritaoglu et al. [30] present a real-time system, W
4
, for detecting and tracking multiple people
4

when they appear in a group. All of the techniques discussed so far have considered motion
tracking from monocular video.
The use of multi-view monitoring [12][16][31] and data fusion [32][33] provides an
eloquent mechanism for handling occlusion, articulated motion, and multiple moving objects in
video sequences. With distributed monitoring systems, however, comes additional issues such as
stereo matching, pose estimation, automated camera switching, and system bandwidth and
processing capabilities [34]. Stillman et al. [35] propose a robust system for tracking and
recognizing multiple people with two cameras capable of panning, tilting, and zooming (PTZ) as
well as two static cameras for general motion monitoring. The static cameras perform person
detection and histogram registration in parallel while the PTZ cameras acquire more highly
focused data to perform basic face recognition. Utsumi et al. [36] suggest a system for detecting
and tracking multiple persons using multiple cameras to address complex occlusion and
articulated motion. The system is composed of multiple tasks including position detection,
rotation angle detection, and body-side detection. For each of the tasks, the camera that provides
the most relevant information is automatically chosen using the distance transformations of
multiple segmented object maps. Cai and Aggarwal [34] develop an approach for tracking
human motion using a distributed-camera model. The system starts with tracking from a single
camera view and switches when the active camera no longer has a sufficient view of the moving
object. Selecting the optimal camera is based on a matching evaluation, which uses a simple
threshold on some tracking confidence parameter, and a frame number calculation, which
attempts to minimize the total amount of camera switching.
It stands to reason that many who focus on the modeling and tracking of articulated
human motion also initiate efforts in the recognition, interpretation, and understanding of such
motion. Toward this end, popular approaches typically employ parameterized motion trajectories
or vector fields and use Dynamic Time Warping, hidden Markov models [6][37], combinations
thereof [38], or hybrid graph-theoretic methods [39]. Other researchers have addressed activity
recognition using methods based on principle component analysis [40], vector basis fields [41],
or generalized dynamic modeling. Wren and Pentland [42] describe an approach to gesture
recognition based on the use of 3-D dynamic modeling of the human body. Complimentary
research to activity recognition investigates the effects of various features on recognition
5

accuracy. Nagaya et al. [43] propose an appearance-based feature for real-time gesture
recognition from motion images. Campbell et al. [5] experiment with a number of potential
feature vectors for human activity recognition, including 3-D position, measures of local
curvature, and numerous velocity terms. Their experimental results lead favor to the use of
redundant velocity-based feature vectors.
For a more thorough discussion of modeling, tracking, and recognition for human motion
analysis, we refer the reader to Gavrila [44], Aggarwal and Cai [45], and the special issue of the
IEEE Transactions an Pattern Analysis and Machine Intelligence on video surveillance [46].
1.3. Proposed Contribution
In this paper, we introduce a distributed, real-time computing platform for improving
feature-based tracking in the presence of articulation and occlusion for the goal of recognition.
The main contribution of this work is to perform both spatial (between multiple cameras) and
temporal data integration within a unified framework of 3-D position tracking to provide
increased robustness to temporary feature point occlusion. In particular, the proposed system
employs a probabilistic weighting scheme for spatial data integration as a simple Bayesian belief
network (BBN) [47] with a dynamic, multidimensional topology. As a directed acyclic graph
(DAG), a Bayesian network demonstrates a natural framework for representing causal,
inferential relationships between multiple variables (e.g., a 3-D trajectory is always inferred from
two or more 2-D sources). Moreover, unlike traditional probabilistic weighting schemes [48], a
BBN is easily extended to multiple modalities due to its inherent semantic structure. Perhaps the
most exploited aspect, however, is the possibility of topological simplification resulting from
conditional dependency and independency relationships between multiple network variables. For
the proposed system, this corresponds to the selective use of multiple views of particular features
based on measures of spatio-temporal tracking confidence. Our approach is particularly well
suited as a precursor to generalized motion monitoring, activity recognition, and interaction
analysis systems [49]. To this end, we use multi-view data acquired from an unconstrained home
environment. Video captured within a home environment has the advantage of broad
applicability, where human motion is often significant and unrestricted. Applications range from
home security to understanding of social and interactive phenomena to monitoring for the
purposes of home health care and rehabilitation.
6

2. Theory
2.1. System Overview
The proposed system, as shown in Figure 1, consists of three major components: (i) state-
based 2-D predictor-corrector filtering for monocular tracking (in the dotted box), (ii) multi-view
(spatial) data fusion, and (iii) Kalman filtering for 3-D trajectory tracking. The first stage consists
of video preprocessing including background subtraction, sparse 2-D motion estimation, and
foreground region clustering. From the estimated 2-D motion, we extract a set of measurements
(observations) [ ] k x , where 0 k is the frame number, for the estimation of the state vector,
[ ] 1 2
[ ] [ ] [ ] [ ]
T
N
k k k k s s s s " . Here, [ ]
m
k s , 1 m N , denotes the image coordinates of the
m
th
feature point that we wish to track in time, and N is the number of features being tracked on
one or more independently moving regions. We also define, [ ] k , as a 3-D state vector that
captures both the velocity and position of features in 3-D Cartesian space, where the unknown 3-
D feature position is denoted by [ ] k y . At each frame, k, we use the observations, [ ] k x , in
conjunction with 3-D state estimates, [ 1| 1] k k , as input to a predictor-corrector filter. The
output of the predictor-corrector filter is a state estimate, [ ] k s , with some confidence, [ ] k M .
Input Video Data
Extracted Features
(Observations)
Bayesian
Network
Predictor-
Corrector Filter
3-D Observations
Output Trajectory
States of Multiple
Moving Objects
Data Processing
Standard
Kalman Filter
Processing for the j
th
View
States from
Other Views
#
Corrected 3-D States
(Feature Locations)
Projection of 3-D
Features onto
j
th
Imaging Plane
(Predictions)
[ ] [ ]
j
j
k k s
[ ] k x
[ ] k y
[ ] k
[ | 1]
j
k k s

Figure 1. System flow diagram.
The stage of multi-view fusion (indicated as the Bayesian network in Figure 1) performs
spatial data integration using triangulation, perspective projections, and Bayesian inference. The
input to the Bayesian network is a set of random variables, 1 2
[ ], [ ], , [ ]
J
k k k " , where the j
th

element is identical to the output, [ ]
j
k s , taken from the predictor-corrector filter corresponding
to the j
th
view of the scene. The network defines an unknown subset of 1 2
[ ], [ ], , [ ]
J
k k k " ,
7

denoted by [ ] k , that combine to form yet another random variable, [ ] k , indicative of a vector
of 3-D positions. At the output of the BBN are the estimates,

[ ] k and [ ] k y , that maximize the


joint density,
,
[ , ]

P PP P . We use the notation [ ] k y instead of

[ ] k since the output of the


network is actually an estimate of the unknown 3-D position vector, [ ] k y . Accompanying the 3-
D estimate is a noise covariance matrix, [ ] k R , minimized by the maximization of
,
[ , ]

P PP P .
The final stage of the system uses a Kalman filter to maintain a level of temporal
smoothness on the vector of 3-D trajectories. The observations for the Kalman filter are the 3-D
estimates, [ ] k y , produced by the Bayesian network. As mentioned previously, the states of the
filter are indicated by [ ] k which represent both the true velocity and position of the N features
in 3-D Cartesian space. The corrected states, [ | ] k k , at the output of the Kalman filter provide
updated estimates of the unknown 3-D position vector. For tracking in the next frame, the
algorithm develops a prediction, ( )
[ 1] [ | ] k k k + F FF F , based on a perspective projection, ( ) F FF F ,
of the corresponding 3-D prediction. This process introduces a temporally iterative estimation
algorithm capable of improving the accuracy of both 2-D and 3-D object tracking of features
proven useful in human motion analysis [5].
The proposed architecture considers both the fundamental bandwidth constraints as well
as the computational complexity requirements for various components. Feature tracking and
processing occurs for a single camera (1 j J ) and, due to the complexity of the required
motion estimation and spatial clustering routines, is performed using a dedicated processor.
Rather than burdening a network with the real-time transfer and analysis of video data from J
views, we perform all tracking and correspondence temporally at each view and then use the
Bayesian network to perform spatial integration of only the relevant data. Both the Bayesian
network and the subsequent stage of 3-D Kalman filtering are computationally simple enough (in
comparison to J occurrences of motion-based estimation and segmentation) to coexist on a single
central processor.
2.2. Two-Dimensional Feature Tracking
Throughout this section it is tacitly understood that each equation is applied to data at the
plane of the j
th
camera, although we drop the explicit dependence on j with the understanding
that the steps in question are equally applicable to multiple views of the scene. As in [3], the
proposed technique performs moving object detection and segmentation using change detection
8

and localized, sparse motion estimation over a grid of points, including the semantic features,
within the foreground of the video sequence. While in [3] we were able to effectively use motion
estimation and frame differencing to produce filter predictions and observations, respectively,
such an approach is not feasible for the tracking of specific occluded features of moving people.
Rather, we must now rely on temporal correspondence to generate observations while using
feedback from the 3-D feature vector trajectories to develop next state predictions.
The first step involves computation of the state prediction, [ | 1] k k s , in the current
frame, k, according to
( )
[ | 1] [ ] [ 1| 1] k k k k k s F FF F , (1)
where [k] is the 3-D state transition matrix and ( ) F FF F is a vector-valued function that maps
points in 3-D Cartesian space to a particular imaging plane using a perspective projection. The
precise definition of [ ] k is given in 2.4. We then compute the 2-D error covariance matrix,
[ | 1] k k M , by projecting the corresponding 3-D error covariance matrix, [ | 1] k k , associated
with [ 1| 1] k k to each imaging plane as per
( )( )

( )
[ | 1] [ ] [ | 1] [ ] [ | 1] [ | 1]
T
k k E k k k k k k k k M s s s s G GG G , (2)
where ( ) G GG G is a transformation of the noise covariance from 3-D to 2-D. The mapping of 3-D
data to the imaging plane of each camera, as in (1) and (2), results in a one-step predictor-
corrector filter at each frame, k. Collectively, the implementation of these functions, ( ) F FF F and
( ) G GG G , parallels that of a transformation of random variables, ( )
j
H HH H , which we fully define and
illustrate in 2.3.
The next step involves the computation of the gain matrix, [ ] k K , which determines the
magnitude of the correction at the k
th
frame. The operation relies on [ | 1] k k M and a noise
covariance matrix, [ ] k C , which describes the distribution of the assumed Gaussian observation
noise. Using a probabilistic weighting scheme, similar to that proposed in [3], we introduce a
matrix that captures the confidence associated with the temporal correspondence measurements
(i.e., observations, [ ] k x ) for each feature. For this task, we classify the correspondence of
various features and their neighboring pixels into three basic classes:
Class A The element is visible in the previous frame and presumably in the current
frame, as there exists a strong temporal correspondence;
9

Class B The element is visible in the previous frame but presumably not in the current
frame, as there exists only a weak temporal correspondence; and
Class C The element is not visible in the previous frame.
The qualification of a feature as having a strong or weak temporal correspondence is based on
well-established criteria in the motion estimation and optical flow literature [50]. Since, in the
interest of computational complexity, the proposed system estimates only the forward motion,
we ignore the case where a previously occluded feature becomes revealed in the next frame; this
is considered a Class C correspondence.
Let us refer to the i
th
motion vector in the neighborhood of some feature point, [ ]
m
k s , as
[ ]
i
k v and the origin of this vector in the previous frame (in image coordinates) as [ ]
i
k v

. For
convenience, we introduce the notation
A
,
B
, and
C
to represent the sets of all points, [ ]
i
k v


and [ 1]
m
k s , that are classified as Class A, Class B, and Class C, respectively. It is assumed that
the union of these three sets describes a single, arbitrary region. For the proposed contribution,
this corresponds to an entire moving person, although the technique is equally applicable to
single body parts, groups of moving people, or even generic moving regions, depending on the
segmentation of the foreground.
We construct [ ] k C in the traditional manner by allowing the matrix to be representative
of the time-varying differences between observations and corrected predictions. We then
compose a gain matrix according to
( )
1
[ ] [ | 1] [ ] [ ] [ ] [ ] [ ] [ | 1] [ ]
T T T
k k k k k k k k k k k

+ K M H W C W H M H , (3)
where [ ] k H I represents the linear observation matrix and
[ ] ( )
1
2
1 2
[ ] ( ) ( ) ( )
N
k diag k k k

W " , (4)

1
1
2
( )
1, [ 1]
( ) ( ) ( , ), , [ 1]
0, [ 1]
m
m
m m m m
D k
i
m
k
k D k d k i k
k

A
A
A
B

C
s
s
s
, (5)
and ( , ) [ ] [ 1]
m i m
d k i k k v s

is the distance between the m


th
feature of and the i
th
observable
motion vector originating from a particular moving region. Here, ( ) max { ( , )}
m i m
D k d k i ,
indicates the maximum distance between a particular feature and the observable motion vectors
used to produce its temporal correspondence. The observations are indicated by
10

( )( )
1
[ ] ( ) ( , ) [ 1] [ ] ( ) ( , )
m m m m i m m
i i
k D k d k i k k D k d k i




+




A A
A

x s v , (6)
where ( , )
m
d k i , [ ]
i
k v , and [ ]
i
k v

are illustrated in Figure 2 more clearly.


Upon inspecting (3) through (6), one may make a number of observations. First, a
completely visible feature that has a seemingly accurate motion estimate is tracked in the usual
way; this represents the trivial correspondence problem with no apparent occlusion. An occluded
feature, on the other hand, develops a weighted least-squares estimate of its trajectory using
neighboring, observable motion vectors. The confidence of this estimate is encoded in [ ] k W ,
where both the visibility and the proximity of the neighboring trajectories is taken into account.
We draw the readers attention to the fact that nothing can be said about the optimality of the
weighted least-squares estimate unless something more is known regarding the distribution of the
bordering motion vector estimates. In the next section, we address this issue by introducing a
weighting scheme that takes the form of a Bayesian network with a dynamic topology.
B
C
A
Object 1
Object 2
Frame k-1
( , )
m
m A
d k i
B
C
A
Frame k
i
th
motion vector
[ 1]
i
k v
[ 1]
i
k v


Figure 2. Occlusion modeling. Here we show various feature tracking scenarios in the presence of interacting
foreground objects (i.e., two moving people). When a feature is occluded, the temporal trajectory is estimated using
neighboring, but visible, motion vectors.
For the sake of completeness, we provide the remaining steps of the filtering procedure.
These equations follow those of the standard Kalman filter and are indicated by
( )
[ | ] [ | 1] [ ] [ ] [ ] [ | 1] k k k k k k k k k + s s K x H s (7)
and
( ) [ | ] [ ] [ ] [ | 1] k k k k k k M I K H M . (8)
11

The two equations represent the state correction based on the gain matrix and the update of the
error covariance matrix, respectively.
2.3. Spatial Integration
The spatial integration is performed for each feature point independently. The following
treatment addresses spatial integration for the m
th
feature point, although the index m has been
omitted for ease of notation. We introduce random variables, [ ] [ ] [ ] [ ]
j
j j j
k k k k +
s
s s n and
[ ] [ ] [ ] k k k +
y
y n , where [ ]
j
k
s
n represents some zero-mean distribution of estimation error on
the imaging plane corresponding to the j
th
view. Similarly, [ ] k
y
n characterizes the zero-mean 3-
D reconstruction error inherent in [ ] k . Let [ ], [1, ]
j
k j J indicate a family of random
variables defined by the vector of probability density functions (PDF),

[ ]
( ) ( )
1
2
1 1
2
[ ] (2 ) [ ] exp [ ] [ ] [ ] [ ] [ ]
T
N
j j j
j j j j
k k k k k k k

, ]

, ]
]
M s M s P PP P . (9)
where M[k] contains the confidence of the m
th
component of the estimate on the j
th
view. Under
the assumption that the m
th
feature may be occluded in some views, the algorithm uses 1 K J <
views of the scene to reconstruct an estimate of [ ] k y . We indicate the set of random variables
used in reconstruction by
[ ]
1 2
1
[ ; ] [ ], [ ], , [ ] [ ], [ ]
K
j j j J
k K k k k k k " , (10)
where 1
n
j J indicates one of K views upon which [ ] k y might be dependent. For the sake of
notational simplicity, we drop the explicit dependence of these variables on the frame number, k.
The proposed Bayesian network obtains an estimate of [ ] k y and the proper subset
[ ; ] k K by calculating

( )
,

, argmax [ , ]

y
, , , ,
P PP P , (11)
where, according to Bayes rule,

, 1 1 1 2 1 1
[ , ] [ | , , , ] [ | , , , ] [ ]
J J J J J

" " " P P P P P P P P P P P P P P P P . (12)


Representing (12) as a causal network, our task is reduced to finding a topology, { , } T , such
that
,
{ , } [ , ]


T P PP P is maximized. In (12),
1 1
[ | , , , ]
J J
" P PP P models 3-D reconstruction
noise, [ ]
j
P PP P models 2-D observation noise, and the remaining conditional densities model the
effects of occlusion and correlation between various views of the scene. We can contrast the
model of (12) to that of a more simplistic approach, assuming J independent sources, where
12


1 2
1
{ , } [ | , , , ] [ ]
J
J j
j

T " P P P P P P P P . (13)
The latter model works well for a small number of variables, but (if the variables are correlated)
becomes less accurate as J increases. The generalized topologies of (12) and (13) are illustrated
in Figure 3. The remaining modifications to the network consist of defining a topological
ordering of the nodes, removing nonexistent dependencies (i.e., edges), and estimating the
conditional probabilities for the remaining I-map structure.

1

"

1

"
The j
th
node in the network
contains (J-j+1) divergent
paths and (j-1) convergent paths

(A) (B)
Figure 3. Bayesian network topologies. The network shown in (A) represents a basic BBN with a convergent
topology at a child node, indicative of independent sources, while that in (B) shows the more general case in which
the correlation between views is considered.
To determine a topological ordering of the nodes, we must take into consideration the
relative significance of various data sources. With two cameras there is no ordering, as both
views are equally necessary for 3-D reconstruction. However, the case of two cameras does not
provide a means to deal with the possibility of occlusion; hence it is not of interest to us. For
more views, we need relative measures of importance, most likely determined by redundant
scene content. If we assume uniformly distributed scene content, then by moving the cameras
further apart we increase the total probability of having multiple unobstructed views of one or
more features. Herewith, there is also an implicit upper bound on camera separation in order to
maintain a well-defined stereo matching problem for feature initialization.
Using the ( )
2
J
fundamental matrices for the system, based on the static content associated
with each view, we define a probabilistic measure, 0 1
ij
l . This probability portrays the ratio
of points seen from the i
th
view that have some correspondence in the j
th
view to the total number
of pixels in the i
th
view. Within a graph-theoretic framework, this quantity states that an
13

inferential dependency exists between these nodes such that one might conjecture that i j
with a confidence of l
ij
. Two identical views of a scene are indicated by 1
ij
l , while 0
ij
l
would suggest two views containing no points in common. Note that the converse of the former
statement does not necessarily follow. With fixed cameras, such a matrix, { }
ij
l L , is easily
populated with arbitrary levels of precision using automatic, semi-automatic, or even manual
assessment. This dependence on camera position is illustrated in Figure 4. For a system
consisting of J cameras, the nodal ordering is specified by

1
1 2 1
{ , } { , , , }
J
J
+
+

T T "

, (14)
where the nodes are placed in descending order of confidence from
1
, which indicates the root
of the BBN, to , which has no dependencies. This topological ordering (14) varies for each
feature being tracked and, therefore, must consider both the confidence of temporal tracking and
the correlation between views of the scene. For the m
th
feature in the
th
n
j view of the scene we
have a confidence metric

( )
1
1
[ ] ( ) 1
n n
n
J
mj m j i
J j i
k k l



, (15)
where ( )
m
k for an arbitrary view is given in (5). The random variable for the n
th
node,
n
, in
the network is equal to
n
j
, where the views of the scene are ranked according to

1 2
[ ] [ ] [ ]
K
mj mj mj
k k k " . (16)

Figure 4. Three-dimensional view of a scene. The correlation between views j
1
and j
3
is higher than that between j
1

and j
2
. It is expected, then, that occluded content in j
1
has a higher probability of remaining occluded in j
3
than in j
2
.
As an example for a single, arbitrary feature, consider a four-camera system in which we
have [ ] ( ) 0.72 0.51 1.00 0.39
m
k and a matrix of camera correlation data
14


1.00 0.86 0.90 0.52
0.42 1.00 0.73 0.36
0.92 0.71 1.00 0.57
0.94 0.56 0.78 1.00
, ]
, ]
, ]

, ]
, ]
]
L . (17)
The corresponding confidence vector is given by [ ] [ ] 0.13 0.19 0.20 0.07
m
k , where the
resulting topological ordering,
3 2 1 4
{ , , , , } T , is demonstrated graphically in Figure 5. For an
arbitrary feature, this arrangement suggests that one should expect that data obtained from
2
is
likely, on average, to contain more unique scene content than that from
1
or
4
. Alternatively,
this might be read as: Given that we have an estimated feature vector in the second view, it is
58% or 64% likely (
2
1
j
l ) that the first or fourth views, respectively, provide additional
information. Of course, information in this context is taken for the case of occlusion. The goal
of this implementation is meant to lessen the adverse effects of observation and reconstruction
error by jointly considering camera position and temporal tracking confidence.

1


Figure 5. Four-camera BBN. The illustrated network demonstrates a dynamic topology of
3 2 1 4
{ , , , , } T . This
configuration is valid for only the m
th
feature for the k
th
frame.
To calculate (11), we must define the three fundamental densities provided in (12). In
addition to the a priori distribution at each imaging plane, as indicated by (9), we require density
functions for and
j

, each conditioned upon a collection of other views. Let us refer to this


collection of random variables as
1 2
, , ,
B
j j j

" such that


1 1
[ , ] [ , ]
B K
j j j j j

. We
introduce two transformation functions, ( ) B BB B and ( )
j
H HH H , where the first function maps a
collection of random variables,

, from 2-D to 3-D, while the latter transformation projects a


random variable, , from 3-D to the imaging plane of the j
th
camera. These functions are
analogous to those used in error propagation within the 3-D reconstruction literature [51]. Due to
the inherent complexity of these transformations, however, it is not always possible to provide a
simple, analytical representation of the noise propagation.
15

To estimate the transformed distributions, we first assume a noise model for the
reconstructed and projected density functions. For density functions transformed by ( ) B BB B , we
assume a 3-D Gaussian distribution with ( ) , R N , while for those functions transformed by
( )
j
H HH H , we assume a 2-D distribution of ( )
,
j j
U N . These distributions are calculated using
the notion of random sampling. For ( ) , R N , we choose one point at random (distributed
according to
1
j
,
2
j
, etc) on each of the imaging planes corresponding to

. These B points
are then triangulated in the standard way [52] to form a random observation in 3-D Cartesian
space. We continue this sampling process until enough observations exist to estimate and R
with sufficient confidence. A similar sampling procedure is used to estimate
j
and
j
U , where
points are selected at random in 3-D according to ( ) , R N , which also uses random samples
from the j
th
view, and then projected onto the j
th
imaging plane. The random 3-D positions will
project to a random distribution on the imaging plane that provides a set of observations for
estimating
j
and
j
U . We illustrate the random sampling process for constructing ( ) , R N
and ( )
,
j j
U N in Figure 6.
1
1
1
1
2
2 2
2
3
3
3
3
j
1
j
2
j
3
Random samples triangulated into 3-D
3-D Observations for
Estimating and R
1
[ ]
j
k

, ]
]
P
2
[ ]
j
k

, ]
]
P
3
[ ]
j
k

, ]
]
P
3
3
1
1
2
2
j
Random samples projected to 2-D
2-D Observations for
Estimating and
j j
U
( ) , R N

Figure 6. Random sampling process. The diagram on the left shows the generation of 3-D observations given
random samples of
1
[ ]
j
k

, ]
]
P ,
2
[ ]
j
k

, ]
]
P , and
3
[ ]
j
k

, ]
]
P , while that on the right demonstrates the creation of 2-D
observations given random samples with a ( ) , R N distribution.
Starting with the two most confident views of the scene, corresponding to
1
and
2
in the
network, we construct
,
[ , ]

P PP P under the assumption that a conditional independence exists
[ ] ,
[ , ] | [ ] |

, ] , ]

] ]



P P P P P P P P P P P P P P P P P P P P , (18)
where
1 2
,
j j

. To calculate (18), we introduce


16

[ ] [ ]
1
3
2
2
1 1
2
[ ] | [ ] (2 ) [ ] exp [ ] [ ] [ ] [ ] [ ]
N T
k k k k k k k k

, ]
, ]
]
]
R R

P PP P (19)
and

1
2
1 1
2
[ ] | [ ] (2 ) [ ] exp [ ] [ ] [ ] [ ] [ ]
T
N
j j j j j j
j
k k k k k k k k

, ]
, ] , ] , ]
] ]
] , ]
]
U U

P PP P , (20)
where we show the dependence of these equations on time, k, for completeness. The
maximization of (18) is trivial due to the normal distribution of each variable. The solution
parallels that of a weighted least-squares problem, where

are the observations and the a priori


density in (9) is analogous to the weighting factor for a particular observation. We iteratively
modify the topology of the graph, adding nodes of successively lower confidence, until the
dynamic topological ordering satisfies (11). By considering nodes in a highest confidence first
(HCF) manner, the iterations are encouraged to converge quickly, and typically to a global
maximum. Since

is finite, however, guaranteed absolute convergence to the maximum of


,
[ , ]

P PP P is possible by simply testing all possible input combinations. This may become
prohibitively computational, however, depending on the number of views and features. The
resulting estimate of 3-D feature position, [ ] k y , and corresponding noise covariance, [ ] k R , are
the input to a Kalman filter that encourages a 3-D trajectory with temporal continuity.
2.4. Temporal Integration
The state-based feature vector represents both the location and velocity of points in 3-D
and is indicated by

[ ] 1 2 (2 )
[ ] [ ] [ ] [ ]
T
N
k k k k

" , (21)
where [ ],
m
k m N indicates the ideal 3-D position of the m
th
feature within the k
th
frame and

[ ]
,
[ ]
b
k
m
k
b m N m N
k

>


. (22)
An estimate of the 3-D feature vector in (21) is denoted by (23) where we use a dynamic model
for [ ] k with constant velocity and linear 3-D displacement such that
[ ] 1 2
[ | 1] [ ] [ 1| 1] [ ] [ ] [ ] 1 1 1
T
N
k k k k k k k k " " , (23)
where
[ ] 2 [ 1| 1] [ 2 | 2]
m m m
k k k k k . (24)
We develop an error covariance matrix, [ | 1] k k , that depicts our confidence in the predictions
of the state estimates. The update equation is indicated by
17

[ | 1] [ ] [ 1| 1] [ ] [ ]
T
k k k k k k k + Q , (25)
where [ ] k Q represents a Gaussian noise covariance matrix which is iteratively modified over
time to account for the deviations between the predictions and corrections of the state estimates.
The system then constructs a Kalman gain matrix, [ ] k D , as follows

( )
1
1
2
[ ] [ | 1] [ ] { [ ] [ ]} [ ] [ | 1] [ ]
T T
k k k k k k k k k k

+ + D R , (26)
where [ ]
2
[ ]
N N
k

I 0 indicates the linear observation matrix and [ ] k is a recursively


updated observation noise covariance matrix. The remaining steps of the three-dimensional
trajectory filtering include
( )
[ | ] [ | 1] [ ] [ ] [ ] [ | 1] k k k k k k k k k + D y (27)
and
( ) [ | ] [ ] [ ] [ | 1] k k k k k k I D . (28)
These equations represent the state and noise covariance update equations for [ ] k and [ ] k ,
respectively.
2.5. Algorithm Procedure
For convenience, we summarize the overall flow of the algorithm. Following the initial
process of camera calibration and the estimation of the redundancy matrix, L, semantic features
are chosen manually between J views of the scene to initialize the predictor-corrector filters used
for tracking. Then, for an arbitrary frame, k, we have the following procedure:
At each imaging plane:
1. Estimate the sparse motion between frames k and k-1 and segment the foreground into
moving regions, as in [3].
2. Update H[k] and C[k]; using (1) through (8) and the projections of [ ] [ 1] k k and
[ 1| 1] k k , produce an estimate, [ ] k s , with some confidence, [ ] k M , for the feature
vector, [ ] k s .
For each feature over all imaging planes:
1. Define a set of random variables, [ ] k , and a subset, [ ] k

, that characterize some


unknown combinations of vector estimates, [ ] k s , from different views.
2. Following (15) and (16), construct a Bayesian topology (12) that orders the views in
descending order of confidence.
18

3. Start with the two most confident views of the feature and estimate the density functions
in (9), (19), and (20) using random sampling, triangulation, and geometric projections, as
illustrated in Figure 6.
4. Define and maximize
,
[ , ]

P PP P using the previously estimated density functions.
5. Iterate over steps 2-4, adding an additional view of the feature at each step until a global
maximum of
,
[ , ]

P PP P is reached, yielding the estimates

[ ] k and [ ] k y .
In 3-D Cartesian space:
1. Update [ ] k , [ ] k , and [ ] k Q ; using (21) through (28), produce a corrected estimate,
[ ] k , with some confidence, [ ] k , for the actual 3-D position of features, [ ] k y .
3. Experimental Results
To test the proposed contribution, we use 600 frames of synchronized video data taken
from three (i.e., 3 J ) distinct views of a home environment. The underlying scene captures an
informal social gathering of four people, each of whom is characterized by five semantic features
that are selected at first sight and tracked throughout the remainder of the sequence. For features,
we use the top of each shoe, the transition between the sleeve and the arm, and the top of the
torso underneath the neck as seen from the front of the body. These points correlate well with
various body parts (e.g., wrists, elbows, feet, head/neck) while providing enough color and
content to perform accurate tracking. If a desired feature is initially not visible due to self-
occlusion, but its position in one more views can be estimated with fair accuracy, it is labeled
and tracked as an occluded point until it becomes visible. This method of feature selection and
tracking is illustrated more clearly in Figure 7, where we show the tracking results for the system
at various stages.
Using the metric in (5), the algorithm associates a confidence level for each feature. We
quantize the number of levels to three, where a confidence level of 0 (L
V
) suggests that the
feature is being tracked with high accuracy, a level of 1 (L
O
) indicates that the feature is being
tracked with relatively low accuracy, and a level of 2 (L
M
) identifies a feature that is no longer in
the field of view. Referring to the notation of 2.1, L
V
-features typically include those of Class A
and some of those in Class B and are assumed to be visible, while L
O
-features include those in
Class C and some of those in Class B and are assumed to be occluded. While tracking, visible
and occluded features are marked using solid and dashed circles, respectively.
19


Figure 7. Tracking results. Each column provides a distinct view of the scene and each row represents an individual
frame extracted from the sequence. Features that appear to be occluded or tracked with questionable accuracy are
automatically labeled with a dashed circle, while those of higher confidence are given a solid mark.
In addition to capturing the precise location and state of the features at various frames, we
also show feature trajectories for a continuous set of frames. In particular, Figure 8 shows the
path of every feature followed over a period of five seconds. For the sake of visual clarity, each
row of the figure is dedicated to a different individual, while the features on each person are
denoted using a variety of colors. To quantify the accuracy of the proposed tracking system, we
calculate the average absolute error between the automatically generated feature locations and
the corresponding ground-truth data. Since the underlying features have some well-defined
semantic interpretation, it stands to reason that the ground-truth should most appropriately be
20

generated by the same intervention that initially selected and defined each feature. When 3-D
data is available, we characterize the tracking error,
B
, by using the absolute difference between
the ground-truth and the 3-D feature projection at each imaging plane. The reported error is taken
on the imaging plane carrying the maximum absolute difference over all J views. If a feature can
only be tracked from one view, however, we simply calculate the absolute error between the
ground-truth and the tracking results of monocular image sequence processing.

Figure 8. Trajectories of various features. Each row shows three views of a five second interval of the scene for a
single person. The numbers in the lower corner of each image indicate the view followed by a range of frames.
For exhaustively quantifying the data, then, we must develop a baseline for three views of
four people, each with five features, over 600 frames. As this would clearly be no less than an
21

overwhelming task, we choose to faithfully represent the entire sequence using a random sample
of only 60 frames. The fundamental hypothesis of the proposed contribution is based upon an
assumed correlation between tracking accuracy and various configurations of visible and
occluded features. As is such, we group features into any of B bins, where for a total of J views
and F confidence levels, B is the number of unique solutions to

1
, where 0 , , , ,
F
i i i
i
O J O J F O J

` (29)
and O
i
is the number of views with the i
th
confidence level. Using the three levels defined earlier
(L
V
, L
O
, L
M
), it can be shown that we have
2 3 1
2 2
1 J J + + bins for any J combination of cameras.
Each bin is represented using a triplet, where for any feature the first number, O
V
, indicates the
number of cameras that presumably have an unobstructed view, the second number, O
O
,
represents the number of cameras that are thought to have an occluded view, and the third
number, O
M
, specifies the number of cameras for which the feature is outside the field of view.
Figure 9 summarizes the tracking error for the proposed Bayesian technique (
B
) while
comparing to the performance of a more simplistic averaging approach (
A
). For the latter
method, multiple observations are combined in the least-squares sense in order to combat the
effects of observation noise. We draw the readers attention to a number of important
characteristics. As indicated by the data, using the proposed BBN for data fusion produces lower
tracking errors for any class of features for which more than two observations exist (i.e., bins
300, 210, 120, and 030). Assuming uniformly distributed features over B bins, it is easily shown
that the probability of having an arbitrary feature with more than two observations is

1 2
2
2
Pr[ 2] 3 10
M
J
O J J J

< +
B
. (30)
With three views, one would expect nearly 40% of the features (under a uniform distribution) to
benefit from the proposed data fusion approach. Equation (30) states that as the number of views
increases, so does the probability that the tracking of a feature occurs with a lower error. This is
seen when taking the limit as the number of cameras monitoring the scene approaches infinity:

2
2
3 10 12
2 3 2
lim Pr[ 2] lim lim 1 1
J J
M
J J
j j j
O J
+
+ +

<
B
. (31)
Due to the nature of occlusion, however, one cannot draw a similar inference for the case of
simple averaging. That is, without a priori knowledge of the feature distribution within the
22

scene, adding more cameras is likely to increase the number of occluded and visible points by
comparable amounts. This clearly increases the probability that more than two cameras have an
unobstructed view of a particular feature, as indicated by (31), but typically at the expense of
additional observations in occlusion. As demonstrated by the transitions from bins 111 to
120/210 and 201 to 210/300 in Figure 9, we only witness a significant decrease in error, on
average, when the third observation has a relatively high confidence. In contrast, the Bayesian
belief network effectively weights only the most likely observations based on the a priori
confidence distributions presented in (9).
300 210 201 120 111 030 021 102 012 003
0
1
2
3
4
5
6
View/Confidence Configuration
M
a
x
i
m
u
m
A
b
s
o
l
u
t
e
E
r
r
o
r
(
p
i
x
e
l
s
)
U
n
d
e
f
i
n
e
d
E
r
r
o
r
=
A
=
B
4%
34%
0%
14%
0%
9%
0%
0%
0% N/A

Figure 9. Summary of tracking results. Each bin in the histogram represents a different class of features, denoted by
a triplet, where the first, second, and third digits indicate the number of cameras tracking a given feature with high,
low, and no accuracy, respectively. The three right-most bins indicate features for which 3-D data is unavailable,
where the last bin (003) shows all points that have left the scene completely. The labeled percentages indicate the
percent increase in error of one approach over another for each class of features.
If we consider our specific distribution of features as a function of configuration over the
entire sequence, as indicated in Figure 10, the difference in error between
A
and
B
is as low as
4%, as high as 34%, and 14.5% on average. As further indicated by Figure 10, the total number
of features for which the proposed algorithm improves the tracking accuracy accounts for 42% of
all features. This value is in agreement with the prediction of 40% for a uniform distribution of
observations. In support of Figures 9 and 10, we show corresponding numerical data in Table 1.
23

300 210 201 120 111 030 021 102 012 003
0
50
100
150
200
250
300
View/Confidence Configuration
N
u
m
b
e
r
o
f
S
a
m
p
l
e
s
20% 62% 18%
5%
8%
7%
10%
12%
19%
21%
2%
4%
12%

Figure 10. Distribution of features. We show the distribution of sampled features as a function of configuration. The
total number sampled is 1200, 18% of which did not have enough observations to generate 3-D data.
All major sources of error in the tracking results can be traced back to occlusion; this
point is exemplified by the trends in Figure 9. A more careful examination of the sequence even
indicates that self-occlusion is often a stronger culprit than other forms, such as occlusion due to
multiple object interactions and scene clutter. With the proposed technique, however, occlusion
due to multiple moving objects still yields erroneous tracking results when an occluded feature
undergoes some form of unexpected acceleration. If the visible motion estimates in the vicinity
of the occluded region do not positively correlate with the motion of that region, then neither the
predictions nor the observations will produce an accurate feature trajectory. For a more detailed
discussion of tracking multiple objects in the presence of occlusion, we refer the reader to [3].
Table 1. Tracking statistics. The first column shows those features for which 3-D position was estimated with high
accuracy, the middle column indicates those features for which one or more necessary views were apparently
occluded, while features in the third column had (at most) an estimated position in only one view of the scene.
O
V
2 O
V
< 2 (O
V
+ O
O
) 2 (O
V
+ O
O
) < 2
Config (
A
,
B
) Samples Config (
A
,
B
) Samples Config (
A
,
B
) Samples
300 (1.80, 1.73) 58 120 (3.21, 2.82) 121 102 (2.26, 2.26) 29
210 (2.41, 1.79) 93 111 (3.03, 3.03) 140 012 (5.38, 5.38) 48
201 (1.81, 1.81) 85 030 (3.78, 3.48) 229 003 (N/A, N/A) 145
021 (4.09, 4.09) 252
Percentage 19.6% Percentage 61.8% Percentage 18.6%

24

Our technique for tracking occluded features is based on an analysis of visible data.
Because we use the less computational, yet more simplistic, approach of model-free tracking, the
motion of features in self-occlusion is often predicted using erroneous vector estimates within
the foreground of the moving person in question. The result is a trajectory that is correct at a
global scale, but corrupt at finer resolutions. This is easily seen, for example, while attempting to
track a feature on an occluded arm, where the motion of the arm is not as positively correlated
with that of the torso (or the visible arm) as the algorithm would like to imply. We illustrate the
difficulty of self-occlusion in Figure 11.
View A View B
Visible
Observations
Visible
Feature
Occluded
Feature

Figure 11. Observations in self-occlusion. This figure illustrates the adverse effects of self-occlusion and articulated
motion on estimating observations for feature positions.
One solution to the problem of self-occlusion might be the introduction of an a priori
motion model, as suggested by [53]. In conjunction with some type of human model [20], one
might consider segmenting the foreground video data into contiguous regions, each defined by
some parametric motion model. These regions could then be used to intelligently guide the
inclusion of sparse motion estimates in the definition of the state observations given in (6). This
approach is theoretically sound, but is difficult to implement in practice due to the dependence of
the segmentation on an accurate initialization, the uncertainty associated with region boundaries
(especially under occlusion), and the need for real-time performance. Given the proposed
framework for the fusion of video data from multiple cameras, however, a simpler (and provably
more optimal) solution might be to extend the number of views, thus migrating towards the
increasingly popular notion of next generation ubiquitous computing. To develop a full
appreciation of the proposed technique, we invite and encourage the reader to visit our website to
inspect the multi-view tracking results in their entirety.
25

4. Conclusions and Future Work
We introduce a novel technique for tracking interacting human motion using multiple
layers of temporal filtering coupled by a simple Bayesian belief network for multiple camera
fusion. The system uses a distributed computing platform to maintain real-time performance and
multiple sources of video data, each capturing a distinct view of some scene. To maximize the
efficiency of distributed computation each view of the scene is processed independently using a
dedicated processor. The processing for each view is based on a predictor-corrector filter with
Kalman-like state propagation that uses measures of state visibility and sparse estimates of image
motion to produce observations. The corresponding gain matrix probabilistically weights the
calculated observations against projected 3-D data from the previous frame. This mixing of
predictions and observations produces a stochastic coupling between interacting states that
effectively models the dependencies between spatially neighboring features.
The corrected output of each predictor-corrector filter provides a vector observation for a
Bayesian belief network. The network is characterized by a dynamic, multidimensional topology
that varies as a function of scene content and feature confidence. The algorithm calculates an
appropriate input configuration by iteratively resolving independency relationships and a priori
confidence levels within the graph. The output of the network is a vector of 3-D positional data
with a corresponding vector of noise covariance matrices; this information is provided as input to
a standard Kalman filtering mechanism. The proposed method of data fusion is compared to the
more basic approach of data averaging. Our results indicate that for any input configuration
consisting of more than two observations per feature the method of Bayesian fusion is superior.
In addition to producing lower errors than observational averaging, the Bayesian network lends
itself well to applications involving numerous input data (qualitatively as well as
computationally) and, in general, is better suited to handling multi-sensor data sources.
Future work in this area will continue to investigate methods for handling complex
occlusion such as those employing a priori motion and object models or those based on
additional observational sources of data. Outside the scope of tracking, but well within the field
of human motion analysis are other topics of potential contribution. These investigations might
be in areas including, but not limited to, simultaneous object tracking and recognition, multi-
modal recognition of gestures and activities, and higher-level fusion of activities and modalities
for event and interactive understanding applications.
26

Acknowledgments
This research was made possible by generous grants from the Center for Future Health,
the Keck Foundation, and Eastman Kodak Company. We would also like to acknowledge
Terrance Jones for his assistance and organizational efforts in providing synchronized multi-
view sequences of multiple people in action.
References
[1] Y. Ivanov, C. Stauffer, A. Bobick, and W. E. L. Grimson, Video surveillance of
interactions, Proc. of the Workshop on Visual Surveillance, Fort Collins, CO, 26 June
1999, pp. 82-89.
[2] T. Boult, R. Micheals, A. Erkan, P. Lewis, C. Powers, C. Qian, and W. Yin, Frame-rate
multi-body tracking for surveillance, Proc. of the DARPA Image Understanding
Workshop, Monterey, CA, 20-23 November 1998, pp. 305-313.
[3] S. L. Dockstader and A. M. Tekalp, On the Tracking of Articulated and Occluded Video
Object Motion, Real-Time Imaging, to appear in 2001.
[4] I. Haritaoglu, D. Harwood, and L. S. Davis, Hydra: multiple people detection and tracking
using silhouettes, Proc. of the Workshop on Visual Surveillance, Fort Collins, CO, 26 June
1999, pp. 6-13.
[5] L. W. Campbell, D. A. Becker, A. Azarbayejani, A. F. Bobick, and A. Pentland, Invariant
features for 3-D gesture recognition, Proc. of the Int. Conf. on Automatic Face and
Gesture Recognition, Killington, VT, 14-16 October 1996, pp. 157-162.
[6] A. D. Wilson and A. F. Bobick, Parametric Hidden Markov Models for Gesture
Recognition, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 21, no. 9, pp.
884-890, September 1999.
[7] A. Pentland, Looking at People: Sensing for Ubiquitous and Wearable Computing, IEEE
Trans. on Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 107-119, January
2000.
[8] J. M. Nash, J. N. Carter, and M. S. Nixon, Extraction of moving articulated-objects by
evidence gathering, Proc. of the British Machine Vision Conference, Southampton, United
Kingdom, 14-17 September 1998, pp. 609-618.
27

[9] C. Bregler, Learning and recognizing human dynamics in video sequences, Proc. of the
Conf. on Computer Vision and Pattern Recognition, San Juan, Puerto Rico, 17-19 June
1997, pp. 568-574.
[10] H. A. Rowley and J. M. Rehg, Analyzing articulated motion using expectation-
maximization, Proc. of the Conf. on Computer Vision and Pattern Recognition, San Juan,
Puerto Rico, 17-19 June 1997, pp. 935-941.
[11] D. Hogg, Model-Based Vision: A Program to See a Walking Person, Image and Vision
Computing, vol. 1, no. 1, pp. 5-20, January 1983.
[12] D. M. Gavrila and L. S. Davis, 3-D model-based tracking of humans in action: a multi-
view approach, Proc. of the Conf. on Computer Vision and Pattern Recognition, San
Francisco, CA, 18-20 June 1996, pp. 73-80.
[13] J. O'Rourke and N. I. Badler, Model-Based Image Analysis of Human Motion Using
constraint propagation, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 2,
no. 6, pp. 522-536, November 1980.
[14] J. M. Rehg and T. Kanade, Model-based tracking of self-occluding articulated objects,
Proc. of the Int. Conf. on Computer Vision, Cambridge, MA, 20-23 June 1995, pp. 618-623.
[15] S. Wachter and H.-H. Nagel, Tracking Persons in Monocular Image Sequences,
Computer Vision and Image Understanding, vol. 74, no. 3, June 1999.
[16] E.-J. Ong and S. Gong, Tracking hybrid 2D-3D human models from multiple views,
Proc. of the Int. Workshop on Modelling People, Kerkyra, Greece, 20 September 1999, pp.
11-18.
[17] M. Isard and A. Blake, Condensation - Conditional Density Propagation for Visual
Tracking, Int. J. of Computer Vision, vol. 29, no. 1, pp. 5-28, August 1998.
[18] K. Akita, Image Sequence Analysis of Real World Human Motion, Pattern Recognition,
vol. 17, no. 1, pp. 73-83, January 1984.
[19] M. K. Leung and Y.-H. Yang, Human Body Motion Segmentation in a Complex Scene,
Pattern Recognition, vol. 20, no. 1, pp. 55-64, January 1987.
[20] S. X. Ju, M. J. Black, and Y. Yacoob, Cardboard people: A parameterized model of
articulated image motion, Proc. of the Int. Conf. on Automatic Face and Gesture
Recognition, Killington, VT, 14-16 October 1996, pp. 38-44.
28

[21] C. R. Wren, A. Azarbayejani, T. Darrell, and A. P. Pentland, Pfinder: Real-Time Tracking
of the Human Body, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 19,
no. 7, pp. 780-785, July 1997.
[22] Y.-S. Yao and R. Chellappa, Tracking a Dynamic Set of Feature Points, IEEE Trans. on
Image Processing, vol. 4, no. 10, pp. 1382-1395, October 1995.
[23] A. Azarbayejani and A. P. Pentland, Recursive Estimation of Motion, Structure, and Focal
Length, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 17, no. 6, pp. 562-
575, June 1995.
[24] T. J. Broida and R. Chellappa, Estimating the Kinematics and Structure of a Rigid Object
from a Sequence of Monocular Images, IEEE Trans. on Pattern Analysis and Machine
Intelligence, vol. 13, no. 6, pp. 497-513, June 1991.
[25] F. Lerasle, G. Rives, M. Dhome, Tracking of Human Limbs by Multiocular Vision,
Computer Vision and Image Understanding, vol. 75, no. 3, pp. 229-246, September 1999.
[26] D.-S. Jang and H.-I. Choi, Active Models for Tracking Moving Objects, Pattern
Recognition, vol. 33, no. 7, pp. 1135-1146, July 2000.
[27] N. Peterfreund, Robust Tracking of Position and Velocity with Kalman Snakes, IEEE
Trans. on Pattern Analysis and Machine Intelligence, vol. 21, no. 6, pp. 564-569, June
1999.
[28] J. MacCormick and A. Blake, A probabilistic exclusion principle for tracking multiple
objects, Proc. of the Int. Conf. on Computer Vision, Kerkyra, Greece, 20-27 September
1999, pp. 572-578.
[29] S. J. McKenna, S. Jabri, Z. Duric, and H. Wechsler, Tracking interacting people, Proc. of
the Int. Conf. on Automatic Face and Gesture Recognition, Grenoble, France, 28-30 March
2000, pp. 348-353.
[30] I. Haritaoglu, D. Harwood, and L. S. Davis, W
4
: Real-Time Surveillance of People and
Their Activities, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 22, no. 8,
pp. 809-830, August 2000.
[31] V. Kettnaker and R. Zabih, Counting people from multiple cameras, Proc. of the Int.
Conf. on Multimedia Computing and Systems, Florence, Italy, 7-11 June 1999, pp. 267-271.
29

[32] G.-W. Chu and M. J. Chung, An optimal image selection from multiple cameras under the
limitation of communication capacity, Proc. of the Int. Conf. on Multisensor Fusion and
Integration for Intelligent Systems, Taipei, Taiwan, 15-18 August 1999, pp.261-266.
[33] T. Darrell, G. Gordon, M. Harville, and J. Woodfill, Integrated Person Tracking Using
Stereo, Color, and Pattern Detection, Int. J. of Computer Vision, vol. 37, no. 2, pp. 175-
185, June 2000.
[34] Q. Cai and J. K. Aggarwal, Tracking Human Motion in Structured Environments Using a
Distributed-Camera System, IEEE Trans. on Pattern Analysis and Machine Intelligence,
vol. 21, no. 11, pp. 1241-1247, November 1999.
[35] S. Stillman, R. Tanawongsuwan, and I. Essa, A system for tracking and recognizing
multiple people with multiple cameras, Proc. of the Int. Conf. on Audio and Video-Based
Biometric Person Authentication, Washington, DC, 22-23 March 1999, pp. 96-101.
[36] A. Utsumi, H. Mori, J. Ohya, and M. Yachida, Multiple-human tracking using multiple
cameras, Proc. of the Int. Conf. on Automatic Face and Gesture Recognition, Nara, Japan,
14-16 April 1998, pp. 498-503.
[37] J. Yamato, J. Ohya, and K. Ishii, Recognizing human action in time-sequential images
using hidden Markov model, Proc. of the Conf. on Computer Vision and Pattern
Recognition, Champaign, IL, 15-18 June 1992, pp. 379-385.
[38] M. J. Black and A. J. Jepson, Recognizing temporal trajectories using the Condensation
algorithm, Proc. of the Int. Conf. on Automatic Face and Gesture Recognition, Nara,
Japan, 14-16 April 1998, pp. 16-21.
[39] T. J. Olson and F. Z. Brill, Moving object detection and event recognition algorithms for
smart cameras, Proc. of the DARPA Image Understanding Workshop, New Orleans, LA,
11-14 May 1997, pp. 159-175.
[40] Y. Yacoob and M. J. Black, Parameterized Modeling and Recognition of Activities,
Computer Vision and Image Understanding, vol. 73, no. 2, pp. 232-247, February 1999.
[41] Y. Yacoob and L. S. Davis, Learned Models for Estimation of Rigid and Articulated
Human Motion from Stationary or Moving Camera, Int. J. of Computer Vision, vol. 12, no.
1, pp. 5-30, January 2000.
30

[42] C. R. Wren and A. P. Pentland, Dynamic models of human motion, Proc. of the Int. Conf.
on Automatic Face and Gesture Recognition, Nara, Japan, 14-16 April 1998, pp. 22-27.
[43] S. Nagaya, S. Seki, R. Oka, A theoretical consideration of pattern space trajectory for
gesture spotting recognition, Proc. of the Int. Conf. on Automatic Face and Gesture
Recognition, Killington, VT, 14-16 October 1996, pp. 72-77.
[44] D. M. Gavrila, The Visual Analysis of Human Movement: A Survey, Computer Vision
and Image Understanding, vol. 73, no. 1, pp. 82-98, January 1999.
[45] J. K. Aggarwal and Q. Cai, Human Motion Analysis: A Review, Computer Vision and
Image Understanding, vol. 73, no. 3, pp. 428-440, March 1999.
[46] R. T. Collins, A. J. Lipton, and T. Kanade, Introduction to the Special Section on Video
Surveillance, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 22, no. 8,
pp. 745-746, August 2000.
[47] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference,
San Francisco, CA: Morgan Kaufmann, 1988.
[48] Y. Altunbasak, A. M. Tekalp, and G. Bozdagi, Simultaneous stereo-motion fusion and 3-D
motion tracking, Proc. of the Int. Conf. on Acoustics, Speech, and Signal Processing,
Detroit, MI, 9-12 May 1995, vol. 4, pp. 2270-2280.
[49] N. M. Oliver, B. Rosario, and A. P. Pentland, A Bayesian Computer Vision System for
Modeling Human Interactions, IEEE Trans. on Pattern Analysis and Machine Intelligence,
vol. 22, no. 8, pp. 831-843, August 2000.
[50] J. L. Barron, D. J. Fleet, and S. S. Beauchemin, Systems and Experiment: Performance of
Optical Flow Techniques, Int. J. of Computer Vision, vol. 12, no. 1, pp. 43-77, 1994.
[51] Z. Sun, Object-Based Video Processing with Depth, Ph.D. Thesis, University of Rochester,
2000.
[52] E. Trucco and A. Verri, Introductory Techniques for 3-D Computer Vision, Upper Saddle
River, NJ: Prentice-Hall, 1998.
[53] H. Sidenbladh, M. J. Black, and D. J. Fleet, Stochastic tracking of 3D human figures using
2D image motion, Proc. of the European Conf. on Computer Vision, Dublin, Ireland, 26
June - 1 July 2000.

Вам также может понравиться