Академический Документы
Профессиональный Документы
Культура Документы
Xavier P. Burgos-Artizzu , Piotr Dollar , Dayu Lin+ , David J. Anderson , Pietro Perona
California Institute of Technology, Microsoft Research, Redmond, + NYU Medical Center
xpburgos@caltech.edu, pdollar@microsoft.com, dayu.lin@nyumc.org, {wuwei,perona}@caltech.edu
Clean, grooms itself Human intervenes Sniff any body part of intruder Up Stands in its back legs
p=7.6%, = 2.6, = 25.4 p=1.2%, = 3.5, = 10.5 p=14.4%, = 2.7, = 27.7 p=3.8%, = 2.1, = 12.0
Figure 1. Behavior categories: frame examples, descriptions, frequency of occurrence (p), and duration mean and variance expressed in
seconds (, ). All behaviors refer to the cage resident, main mouse. The probability of other (not shown) is 55.3%.
first computing spatio-temporal and trajectory features from teresting and meaningful when purposeful agents interact
video, a second level analyzes temporal context using an with each other and with objects in a given environment.
extension of Auto-context [35] to video. The temporal con- The ideal behavior dataset identifies genuine agents acting
text model helps segmenting the video into action bouts freely in a well-defined scenario, and captures all sponta-
improving results 8% across all behaviors. neous actions and activities. Filming should be continu-
Our method reaches a recognition rate of 61.2% on 13 ous in order to allow a study of the structure of behav-
categories, tested on 133 videos, see Table 3. Figure 2 ior at different scales of temporal resolution. Social be-
shows a comparison of our approach with experts agree- havior and agent-object interactions are of particular inter-
ment on the 12 videos for which we have multiple expert est. Current most widely-used datasets for action classifica-
annotations. On this smaller dataset, expert agreement is tion, KTH [32], INRIA-XMAS [40], Weizmann [12], UCF
around 70%, while our methods performance is 62.6%. Sports [30], Hollywood2 [25], YouTube [24], Olympic
However, CRIM13 still represents a serious challenge. Sports [14] and UT videos [31] do not meet this standard:
Disagreement between human annotators lies almost en- they are segmented, they are acted, the choice of actions is
tirely on the labeling of other behavior (less important), often arbitrary, they are not annotated by experts and they
while our approach still makes mistakes between real be- include little social behavior. Moreover, KTH and Weiz-
haviors. In fact, removing other from the confusion matri- mann datasets may have reached the end of their useful life
ces results in a human performance of 91%, and only 66% with current state of the art classification rates of 94% and
for our approach. When counting other, our approach out- 99% respectively.
performs human agreement in 5 of the 12 behaviors (ap-
proach, chase, circle, human, walk away). One of the first approaches to continuous event recog-
nition was proposed by [43], although only a very small
2. Related work set of continuous video sequences were used. Virat [27] is
the first large continuous-video dataset allowing the study
Datasets What makes a good dataset? Behavior is in- of behavior in a well-defined meaningful environment. The
focus of this dataset is video surveillance and contains ex-
humans with objects (cars, bags), containing more than 29h chase 4 8 1148
circle 3 1 33
13 14
2 60
chase 3 14 4 70
circle 3 22 68
4
6
2 1
2
of video. Another useful dataset, collected by Serre and drink
eat
91
82 5
9
13
drink 3
eat
9 491411 5
4 53 3 1 39
8
Features Most action recognition approaches are ap att co ch cir dr ea cle hu sn up wa oth
pr ac itu as cleink t an maiff
oa k s e n
lk er
ap att co ch cir dr ea cle hu sn up wa oth
pr ac itu as cleink t an maiff
oa k s e n
lk er
ch ch
based solely on spatio-temporal features. These features (a) Annotators 25 (69.7%) (b) Our method (62.6%)
are computed in two separate stages: interest point detec-
tion and description. Popular spatio-temporal interest point (c)Ground truth segmentation
detectors include Harris3D [19], Cuboids [8], and Hes-
sian [13]. In datasets were the background contains useful
information about the scene, as is the case for the Holly- (d) Method segmentation
wood dataset, densely sampling points instead of running
Figure 2. Confusion matrices for comparison with experts anno-
an interest point detector seems to improve results. Most ef- tations on the 12 videos that were annotated by more than two
fective spatio-temporal point descriptors are Cuboids (PCA- experts. For results on the whole dataset, see Table 3. (a) Com-
SIFT) [8], HOG/HOF [20], HOG3D [18] and eSURF [13]. parison between annotator1 and a group of 4 other annotators.
More recently, relationships between spatio-temporal fea- (b) Comparison between annotator1 and the output of our ap-
tures have been used to model the temporal structure of proach. (c,d) Comparison of the segmentation of 2k frames of a
primitive actions [5, 31]. video (available from project website) into behavior bouts.
Other features that have shown good performance are
Local Trinary Patterns [41] and motion features derived introduced and the social interaction begins. Just before the
from optical flow [9, 1]. Recently, space-time locally adap- end of the video, the intruder mouse is removed. Behavior
tive regression kernels (3DLSK) have been shown to reach is categorized into 12+1 different mutually exclusive action
state-of-the-art recognition on the KTH dataset with only categories, i.e. 12 behaviors and one last category, called
one training example [33]. Biologically-inspired hierar- other, used by annotators when no behavior of interest is oc-
chies of features have also been widely used [16, 34, 42, curring. The mice will start interacting by getting to know
22]. In [22] deep learning techniques were applied to learn each other (approach, circle, sniff, walk away). Once es-
hierarchical invariant spatio-temporal features that achieve tablished if the intruder is a female, the resident mouse will
state-of-the-art results on Hollywood and YouTube datasets. likely court her (copulation, chase). If the intruder is a male,
Another trend is to use an indirect representation of the the resident mouse will likely attack it to defend its terri-
visual scene as input to the classification, such as silhou- tory. Resident mice can also choose to ignore the intruder,
ettes, body parts or pose [29]. These approaches are gen- engaging in solitary behaviors (clean, drink, eat, up). The
erally sensitive to noise, partial occlusions and variations in introduction/removal of the intruder mouse is labeled as hu-
viewpoint, except in the case of small animals such as flies, man. Figure 1 shows video frames for each behavior in both
where trajectory features combined with pose information top and side views, gives a short description of each behav-
and body parts segmentation has proven to work well [7, 3]. ior, its probability p (how often each behavior occurs), and
Classifiers The most common approach to classifica- the mean and variance duration in seconds (, ).
tion is to use a bag of spatio-temporal words combined with Each scene was recorded both from top- and side-views
a classifier such as SVM [36] or AdaBoost [11]. Other in- using two fixed, synchronized cameras. Mice being noctur-
teresting classification approaches are feature mining [10] nal animals, near-infrared in-cage lighting was used, thus
and unsupervised latent topic models [26]. the videos are monochromatic (visually undistinguishable
Mouse behavior Two works focussed on actions of from grayscale). Videos typically last around 10 min, and
solitary black mice on a white background [8, 15]. We were recorded at 25fps with a resolution of 640x480 pixels,
tackle the more general and challenging problem of social 8-bit pixel depth. The full dataset consists of 237 videos
behavior in mice with unconstrained color. and over 8M frames.
3. Caltech Resident-Intruder Mouse dataset Every video frame is labeled with one of the thirteen ac-
tion categories, resulting in a segmentation of the videos
The dataset was collected in collaboration with biolo- into action intervals or bouts. The beginning and end
gists, for a study of neurophysiological mechanisms in- of each bout are accurately determined since behavior in-
volved in aggression and courtship [23]. The videos always stances must be correlated with electrophysiological record-
start with a male resident mouse alone in a laboratory en- ings at a frequency of 25Hz. Some behaviors occur more
closure. At some point a second mouse, the intruder, is often than others, and durations (both intra-class and inter-
class) vary greatly, see Figure 1. 1 Input video n
tise of annotators varied slightly; they were all trained by the Local features
same two behavior biologists and were given the same set A'#)$527%
!"#$%&'()"%*#+,"#'-.(/
#17*'5&B/
of instructions on how the behaviors should be annotated. 3*#4(-'%*1(/,
561/'78(+79--:::;
Annotators used both top and side views in a video behav- /?),7,)1>7,)#C, mice
positions <&=(#>/
ior annotator GUI developed in Matlab. This tool provides
D#*7,E/?)7,.#*)
&@ ,-+?/'(*1>2
all the playback functionality needed to carefully analyze . . . A
B@
01/'%2*#)
the videos and allows the manual annotation of behaviors !"#$%&'#(")&*'+%,"#&-'". 40#9*:&"/0*'#1%;#7%*,%8*'6.
into bouts (starting and ending frame). The tool is avail-
&"/0*'#1%)*2&"3&%,"#&-'".
able from the project website. Each 10min video contains $%?6#@**.&%)1#..5="'.
an average of 140 action bouts (without counting other). It Auto-context Loop <*2=6"2)"%
hkt (i) t = 1..T 1">"1.
typically takes an expert 7-8 times the video length to fully
annotate it. The approximate time spent by experts to fully arg max hkt (i) Behavior with highest confidence
rection, velocities and accelerations, see Figure 4(a). Then, 0.24 0.42
Recog Rate %
Recog Rate %
the algorithm generates a large pool of weak features from 0.22
0.4
Dir difference DDir(t) = |Dir1 (t) Dir2 (t)| (all, past, post) (end start)
0.25
1st order
1
Velocity
Vxmi (t) xmi (t 1) xmi (t 2) xmi (t) xmi (t 1) xmi (t + 1) xmi (t)
= 0.5 hk sz sz hk sz hk sz
Vymi (t) t ymi (t 1) ymi (t 2) ymi (t) ymi (t 1) ymi (t + 1) ymi (t)
0.25
t1
t (i 2 ,i + 2 )
t1
t (i 2 , i)
t1
t (i, i + 2 )
Axmi (t)
V xmi (t+1)V xmi (t1)
derivative hk
Acceleration where, t1
(start, end) = hkt1 (end) hkt1 (start)
(all, past, post)
= 2t
V ymi (t+1)V ymi (t1)
Aymi (t)
2t
t
(a) (b)
Figure 4. (a) Trajectory information computed from mouse tracks. Distance helps discriminate between solitary and social behaviors (mice
far/close to each other), as well as to detect a transition from one to the other (approach, walk away). Velocity and acceleration helps
detecting behaviors such as attack, chase or clean (stationary). Direction is important to distinguish between behaviors such as sniff (often
Distance
head to head) with copulation or chase (head to tail). Pose estimation would clearly improve results in this aspect. (b) Temporal context
features computed from confidence levels hkt output of each k binary classifier at previous iteration t1, given a window of size sz centered
at frame i. For each frame, 1536 context features are computed, using windows of sz = 75, 185, 615 to combine short, medium and long
term context. Some of the features are also computed only on the previous or past frames to encode behavior transition probabilities.
Recognition rate %
35 35 tation.
30 30
We proposed an approach for the automatic segmenta-
tion and classification of spontaneous social behavior in
continuous video. The approach uses spatio-temporal and
25 25
20
50 100 250
T
500 1k 2k
20
100 500 1k 10k
S
100k 1M trajectory features, each contributing to the correct segmen-
Figure 6. Role of T and S in classification. Experiments were tation of the videos into behavior bouts. We benchmarked
carried on using only trajectory features (see Sec. 5.2.3). every component of our approach separately. On our dataset
the combination of novel trajectory features with popular
Features used Without Context With Context spatio-temporal features outperforms their use separately.
TF 52.3% 58.3% This suggests that the study of behavior should be based on
STF 29.3% 43.0% a multiplicity of heterogeneous descriptors. Also, we found
STF Top 26.6% 39.3% that temporal context can improve classification of behavior
STF Side 28.2% 39.1% in continuous videos. Our methods performance is not far
(Full method) from that of trained human annotators (see Figure 2 and Ta-
TF+STF 53.1% 61.2% ble 3) although there is clearly still room for improvement.
While disagreement between human annotators lies almost
Table 3. Method results on CRIM13. Context results were entirely on the labeling of other behavior (less important),
achieved after 2 Auto-Context iterations. Adding temporal con- our method confuses some real behaviors. In fact, remov-
text improves classification an average of 10% (14% on ST F ). ing other from the confusion matrices results in a human
For a comparison with experts annotations, see Figure 2. performance of 91%, and 66% for our approach.
We believe that classification performance could be im-
of each combination of local features without context. Sec- proved by adding new features derived from pose and ap-
ond column shows the results after adding temporal context plying dynamic multi-label classifiers that can prove more
features with Auto-context. robust to unbalanced data. We will also further study the
Without context, T F clearly outperforms ST F , while use of LTP features, due to their promising results and faster
the combination of both, T F + ST F , improves classifi- speed. Although in this work we were able to deal with be-
cation 1% with respect to T F . The use of Auto-context haviors of different durations, an open question is whether
improves performance 8% on the full method, 10% on av- an explicit distinction between short and simple movemes,
erage. In comparison, a simple temporal smoothing of medium-scale actions and longer and more complex ac-
the output labels only improves results 0.2%. Best re- tivities should be directly included into future models.
sults are achieved by the combination of both local features
T F +ST F , which improves T F 3%, reaching a final recog- Acknowledgments
nition rate of 61.2%. The upper bound for this task is 70%,
which is the average agreement rate between experts anno- Authors would like to thank R. Robertson for his careful work
tations, see Figure 2. annotating the videos, as well as Dr. A. Steele for coordinating
Although the metric used might suggest that the differ- some of the annotations. We also would like to thank Dr. M.
ence between T F and T F + ST F is small, analyzing each Maire for his valuable feedback on the paper. X.P.B.A. holds a
behavior separately shows that T F + ST F outperforms postdoctoral fellowship from the Spanish ministry of education,
T F in 7 out of the 12 behaviors (attack, copulation, chase, Programa Nacional de Movilidad de Recursos Humanos del Plan
drink, eat, sniff, up). Also, comparing the confusion ma- Nacional de I-D+i 2008-2011. D.L. was supported by the Jane
trices of ST F from the side and ST F from the top shows Coffin Child Memorial Foundation. P.P. and D.J.A. were supported
that the top view is best suited to detect 5 behaviors (chase, by the Gordon and Betty Moore Foundation. D.J. is supported by
circle, clean, sniff and walk away) while the rest are best the Howard Hughes Medical foundation. P.P. was also supported
recognized from the side. by ONR MURI Grant #N00014-10-1-0933.