Академический Документы
Профессиональный Документы
Культура Документы
Tao Wang1 Jianguo Li1 Qian Diao1 Wei Hu 1 Yimin Zhang1 Carole Dulong2
1
Intel China Research Center, Beijing, P.R. China, 100080
2
Intel Corporation, Santa Clara, CA 95052, USA
{ tao.wang, jianguo.li, qian.diao, wei.hu, yimin.zhang, carole.dulong}@intel.com
Abstract
Semantic event detection is an active research field
of video mining in recent years. One of the challenging
problems is how to effectively model temporal and
multi-modality characteristics of video. In this paper,
we employ Conditional Random Fields (CRFs) to fuse
temporal multi-modality cues for event detection. CRFs
are undirected probabilistic models designed for
segmenting and labeling sequence data. Compared
with traditional SVM and Hidden Markov Models
(HMMs), CRFs based event detection offers several
particular advantages including the abilities to relax
strong independence assumptions in the state
transition and avoid a fundamental limitation of
directed graphical models. To detect event, we use a
three-level framework based on multi-modality fusion
and mid-level keywords. The first level extracts
audiovisual features, the mid-level detects semantic
keywords, and the high-level infers semantic events
from multiple keyword sequences. The experimental
results from soccer highlights detection demonstrate
that CRFs achieves better performance particularly in
slice level measure.
1. Introduction
With the advance of storage capabilities,
computing power and multimedia technology, the
research on semantic event detection become more and
more active in recent years, such as video surveillance,
sports highlight detection, TV/Movie abstraction and
home video retrieval etc. Through event detection,
consumers can retrieve specific video segments quickly
from the long videos and save much time in browsing.
There is much literature on semantic event detection
[1][3][5][11][16]. However, semantic event detection
is still a challenging problem due to the large semantic
gap and the difficulty of modeling temporal and multimodality characteristics of video.
Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW06)
0-7695-2646-2/06 $20.00 2006 IEEE
s t 1
st
ot 1
s t 1
ot
ot 1
s t 1
st
s t 1
ot 1
ot
ot 1
pT ( s | o )
Z (o )
exp( F (s, o, t ))
t 1
exp( F (s, o, t ) is
where Z (o )
the normalization
t 1
t 1
i i
with
the
observation
probability
p ( s t | s w , o, w ~ t ) ,
log( pT (s
| o j ))
j 1
N
(3)
T
( F (s
j 1
, o , t )) log Z (o )
t 1
Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW06)
0-7695-2646-2/06 $20.00 2006 IEEE
s*
arg max pT (s | o)
s
(4)
video
t 1
audio feature
...
multimodal features
x2
... xN
keyword stream
Fig 2. Overview
framework.
of
the
event
detection
Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW06)
0-7695-2646-2/06 $20.00 2006 IEEE
(a)
(b)
Fig. 4. (a) Hough line detection on a
segmented playfield (b) Five detected regions
in the playfield
Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW06)
0-7695-2646-2/06 $20.00 2006 IEEE
st+1
Event labels s
Video sequence
Keyword v
sequence
Keywords w
sequence
st-2
G
st-1
G
st
C
feature
templates
at
time
4. Experiments
In our experiments, we used libSVM[13], Intel
OpenPNL [8] and FlexCRF[7] toolkits for the
training/inference of Linear SVM, first-order HMMs
and first-order CRFs respectively. To demonstrate the
effectiveness of the proposed approach, experiments of
soccer highlights detection were conducted on eight
soccer matches totaling up to 12.6 hours of videos.
Highlights are video segments in which the user has
special or elevated interest. We define semantic events
goal, shot, foul, free kick and corner kick as
highlight events, and all others as non highlights. Five
matches are used as training data, and the others as
testing data. The ground truth is labeled manually.
To compare fairly the performance of SVM, HMM,
and CRFs, we input same video segments for
highlights detection. Since televised soccer generally
uses a close-up view or a replay as a break to
emphasize the highlights event, we first filter multiple
Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW06)
0-7695-2646-2/06 $20.00 2006 IEEE
Miss
26
30
28
False
27
45
20
Precision
86.9%
79.5%
89.8%
Recall
87.3%
85.4%
86.3%
F-Score
87.10%
82.34%
88.02%
Precision
73.4%
70.0%
78.5%
Recall
60.4%
60.7%
71.6%
F-Score
66.3%
65.3%
74.9%
5. Conclusion
In this paper, we propose a CRFs based semantic
event detection approach. The method first extracts
mid-level keyword sequences from low-level multimodality features, and then employs CRFs to jointly
infer semantic event labels from multiple keyword
sequences. Compared with traditional approaches, e.g.,
HMMs and SVM, CRFs offers several particular
advantages including the abilities to relax strong
independence assumptions in the state transition and
avoid a fundamental limitation of directed graphical
models. The experiments over soccer highlights event
detection demonstrated that CRFs achieves better
performance than SVM and HMM particularly in slice
level measure.
It is worth pointing out that our proposed methods
can be broadly applied to other kinds of event-based
video applications, e.g. video surveillance, video
summarization and content based retrieval.
6. Acknowledgements
The authors are thankful to YangBo, WangFei, Sun
Yi, Prof. Sun Lifeng, and Prof. Ou Zhijian of Dept. CS.
and EE. of Tsinghua university for the research on the
mid-level audiovisual keywords detection.
7. References
[1]
Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW06)
0-7695-2646-2/06 $20.00 2006 IEEE