Вы находитесь на странице: 1из 6

Semantic Event Detection using Conditional Random Fields

Tao Wang1 Jianguo Li1 Qian Diao1 Wei Hu 1 Yimin Zhang1 Carole Dulong2
1
Intel China Research Center, Beijing, P.R. China, 100080
2
Intel Corporation, Santa Clara, CA 95052, USA
{ tao.wang, jianguo.li, qian.diao, wei.hu, yimin.zhang, carole.dulong}@intel.com

Abstract
Semantic event detection is an active research field
of video mining in recent years. One of the challenging
problems is how to effectively model temporal and
multi-modality characteristics of video. In this paper,
we employ Conditional Random Fields (CRFs) to fuse
temporal multi-modality cues for event detection. CRFs
are undirected probabilistic models designed for
segmenting and labeling sequence data. Compared
with traditional SVM and Hidden Markov Models
(HMMs), CRFs based event detection offers several
particular advantages including the abilities to relax
strong independence assumptions in the state
transition and avoid a fundamental limitation of
directed graphical models. To detect event, we use a
three-level framework based on multi-modality fusion
and mid-level keywords. The first level extracts
audiovisual features, the mid-level detects semantic
keywords, and the high-level infers semantic events
from multiple keyword sequences. The experimental
results from soccer highlights detection demonstrate
that CRFs achieves better performance particularly in
slice level measure.

1. Introduction
With the advance of storage capabilities,
computing power and multimedia technology, the
research on semantic event detection become more and
more active in recent years, such as video surveillance,
sports highlight detection, TV/Movie abstraction and
home video retrieval etc. Through event detection,
consumers can retrieve specific video segments quickly
from the long videos and save much time in browsing.
There is much literature on semantic event detection
[1][3][5][11][16]. However, semantic event detection
is still a challenging problem due to the large semantic
gap and the difficulty of modeling temporal and multimodality characteristics of video.

In general, two kinds of methods are adopted in


previous works, i,e, segments classification and
sequence learning. The Segments Classification
Approach (SCA) treats event detection as a
classification problem. The approach first selects
possible event segments, e.g., a sliding data window,
and then adopts classification algorithms to predict the
semantic label of each segment. Duan et al [11] used
game-specific rules to classify events. Although the
rule system is intuitive to yield adequate results, it
lacks in scalability and robustness. Wang et al used
SVM to detect events[10]. SVM is a good classifier
particularly for a small training set. However, it may
not sufficiently characterize the relations and temporal
layout of features. Some researchers utilized Naive
Bayesian classifier to detect specific events[1]. Naive
Bayesian assumes that features are independent of each
other, and consequently neglects the important
relationships among features. SCA are simple and
effective but have two limitations. Firstly, they can not
characterize long-term dependence within video
streams, and thus may be myopic about the impact of
their current decision on later decisions[9]. Secondly, it
is difficult for them to determine accurate event
boundaries, i.e., the starting and ending time of the
detected events.
Compared with Segments Classification, Sequence
Learning Approach (SLA) uses probabilistic models to
characterize the temporal video sequence. SLA deals
with event detection as a labeling sequence problem,
i.e., decoding the most probable hidden state (semantic
event label) sequence from the observed sequence
(video). The most popular model is Hidden Markov
Models (HMMs) which provides well-understood
training and decoding algorithms for labeling
sequence[4][5]. While enjoying much historical
successes, they suffer from one principal drawback:
The structure of the HMM is often a poor model to
characterize the true process producing the data. Part of
the problem stems from the Markov property. Any
relationship between two separated y values (e.g., y0

Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW06)
0-7695-2646-2/06 $20.00 2006 IEEE

and y3) must be communicated via the intervening y's


(e.g., y1 and y2). A first-order Markov model where the
p(yt) only depends on yt-1 can not capture these kinds of
relationships. This limitation is one of the main
motivations to consider Conditional Random Fields
(CRFs) as an alternative[9].
CRFs are undirected graph models which specify
the joint probability of possible label sequences given
an observation sequence. Although it encompasses
HMM-like models, CRFs are more expressive due to
allowing more dependencies on the observation
sequence. In addition, the chosen features may
represent attributes at different levels of granularity of
the same observations or aggregate properties of the
observation sequence [9][6].
CRFs have been successfully practiced in text
processing, such as name entity extraction [2] and
shallow text parsing [6], but there is little work
applying it in video processing. In this paper, we
employ Conditional Random Fields (CRFs) to
semantic event detection. In our approach, we take
advantage of mid-level keywords [11] to minimize the
semantic gap between low-level features and high-level
events. The method first detects mid-level semantic
keywords from low-level audio/visual features, and
then CRFs infers semantic event labels according to the
multiple mid-level keyword sequences.
The rest of this paper is organized as follows. In
Section 2, we briefly describe CRFs conceptions and
relevant algorithms. In Section 3, we propose the
CRFs-based semantic event detection approach. To
evaluate the effectiveness of this method, extensive
experiments over 12.6 hours of soccer videos are
reported in section 4. Finally, concluding remarks are
given in section 5.

2. Conditional random fields


In this section, we describe CRFs basic concepts
and relevant algorithms on labeling sequence. More
details of CRFs can be found in [6].

where w~t means that w are neighbors of the node st in


the graph model, i.e. Markov blanket.

s t 1

st

ot 1

s t 1

ot

ot 1

s t 1

st

s t 1

ot 1

ot

ot 1

Fig 1. (a)HMM models and (b)CRFs models


The conditional probability pT (s | o) of CRFs is
defined as follows:
T
(1)
1

pT ( s | o )

Z (o )

exp( F (s, o, t ))
t 1

exp( F (s, o, t ) is

where Z (o )

the normalization

t 1

constant, F (s, o, t ) is the feature function at the


position t. For the first-order CRFs, the feature
function F (s, o, t ) is given by:
(2)
F (s, o, t )
O f ( s , s )  u g (o, s )

t 1

i i

In which f i (.) and g j (.) are the transition and state


feature functions respectively. T {O1 , O2 ,...; u1 , u 2 ...}
is the learned weights associated with f i (.) and g j (.) .
Compared

with

the

observation

probability

p(ot | st ) of HMM, the state feature functions


g j (o, st ) of CRFs depend not only on the current
observation ot, but also on past and future
observations o . Although SVM is able to make the
decision based on the dependent local observation
segments, its object function doesnt consider the state
relationship, e.g. f i ( st 1 , st ) in the labeling sequence.
So CRFs are better at characterizing the sequence
labeling problem in theory than SVM and HMMs.

2.1 CRFs concepts

2.2 Training and inference of CRFs

Contrary to directed HMMs, CRFs are undirected


probabilistic graphical models shown in fig.1(b). Let
o [o1 , o2 ,.., oT ] be the input observation sequence

The parameters T {O1 , O 2 ,...} of CRFs are trained


by maximum likelihood estimation (MLE) approach.
Given N training sequences, the log likelihood L is
written as

and s [ s1 , s2 ,..., sT ] be the corresponding labeling


sequence, e.g., the sequence of event labels (st=1
represents event and st=0 nonevent, t=1,2,...,T).
Conditioned on the observation sequence o and the
surrounding labels s \t , the random variable st obeys
the Markov property p ( s t | s \t , o)

p ( s t | s w , o, w ~ t ) ,

log( pT (s

| o j ))

j 1
N

(3)
T

( F (s
j 1

, o , t ))  log Z (o )

t 1

It has been proved that the L-BFGS quasi-Newton


method converges much faster to learn the

Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW06)
0-7695-2646-2/06 $20.00 2006 IEEE

parameters T than traditional iterative scaling learning


algorithms, such as GIS and IIS [6]. L-BFGS avoids
the explicit estimation of the Hessian matrix of the loglikelihood by building up an approximation of it using
successive evaluations of the gradient.
After training of CRFs, the most probable labeling
sequence s * given an observation sequence o is
inferred by:

s*

arg max pT (s | o)
s

infers events in the semantic space of these keyword


sequences.
In the framework, we take advantage of mid-level
keywords to bridge the large semantic gap between
low-level features and high-level events[11]. Since
mid-level keywords convert video streams into the
representation of text symbol sequences, the high-level
event detection can be conveniently processed and
analyzed like text mining.

(4)

video

arg max exp( F (s, o, t ))


s

t 1

s can be efficiently calculated by the Viterbi algorithm,


which calculates the marginal probability of states at
each position of the sequence using a dynamicprogramming procedure[7].

3. CRFs based Event Detection


In this section, we propose the CRFs based event
detection approach. The method first detects mid-level
semantic keywords from low-level audio/visual
features, and then CRFs jointly infers semantic event
labels from the multiple keyword sequences.

low-level feature extraction


visual feature

audio feature

...

multimodal features

mid-level keyword detection


x1

x2

... xN

keyword stream

high level event detection


prediction result

3.1 Framework of event detection


In general, events appear with particular patterns.
Similar to the idea from content (word) to context
(sentence) in text mining, the events of video are
characterized by certain multimedia content elements
and their temporal layout. For example, music and
camera motion are two kinds of important content
elements that often appears in movies; video segments
in movies with high tempo, extreme emotion and tense
music usually indicate highlight scenes. By semantic
keyword/concept detection, the content elements can
be extracted as keywords. Further, their temporal
layout consists of the keywords sequence. So, event
detection can be intuitively looked as a sequence
labeling problem, i.e., infer the semantic event labels s
according to observed multiple keyword sequences o k
where o k [ xk1 , xk 2 ,..., xkN ] with k=1,..,K keywords,
and N time slices.
Fig.2 illustrates our event detection framework. The
framework consists of three level architectures, i.e.
low-level feature extraction, mid-level semantic
keywords detection and high-level event detection
modules. In processing, the low-level module first
extracts audio/visual features from the video stream.
Then the mid-level module detects semantic keywords
from low-level features. Finally, the high-level module

Fig 2. Overview
framework.

of

the

event

detection

3.2. Mid-level keyword detection


The mid-level module detects relevant semantic
keywords from low-level audio/visual features of
videos. Keywords denote basic semantic concepts in a
frame or a shot, such as subject (face, car, building,
road, sky, water, grass), place (indoor and outdoor),
sound type (silence, speech, music, applause,
explosion), camera motion (pan, tilt, zoom), and view
type (global view, medium view and close-up view) etc.
Generally, keywords are related to different
applications. For instances, car and speed are relevant
keywords to the surveillance of vehicle transportation,
face and speeches are important in casting indexing of
movie.
In the case of soccer highlights detection, we detect
following multi-modal keywords for high-level
semantics inference. In visual domain, there are three
relevant keywords: semantic view types, play-position,
and replay. In audio domain, we detect two significant
keywords: commentators excited speech and referees
whistle. Details of keywords generation are described
as following:
- x 1 View type: View type plays a critical role in video
understanding. We predefine four kinds of view types:

Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW06)
0-7695-2646-2/06 $20.00 2006 IEEE

global view, medium view, close-up view and out of


view [1][11]. And then use playfield area and player
size to determine view type of each key frame. The
corresponding low-level processing includes playfield
segmentation by HSV dominant color of playfield
connect-component analysis. The dominant color of
playfield is adaptively trained by accumulating HSV
color histogram on a lot of frames[14]. Fig.3 shows
examples of these view types.
- x 2 Play-position: Play-position indicates potential
highlights in field-ball sports video (near penalty area
or transition from middle area to penalty area). We
classify the play-position in global views into 5 regions
as shown in fig. 4(b). LL means left region including
two corners and penalty area in the left half-field.
ML refers to middle-left region, and MM is middle
region. In the implementation, we first execute Hough
transform to detect playfield line-marks including
boundary lines, middle line and penalty box lines.
Then use a decision tree to determine the play-position
according to the lines slope and positions [1][10].
- x 3 Replay: Replay is an important video editing way
in broadcasting programs. It is usually used to play
back important or interesting segments with a slowmotion pattern to let audiences enjoy the details.
Generally, there is a logo flying in high speed at the
beginning and ending of each replay (see fig.3 right).
We detect logo patterns by color feature and optical
flow motion features and then identify similar replay
segments by dynamic programming [17].
- x 4,5 Audio keywords: We detect two types of audio
keywords: commentators excited speech, and referees
whistle. They have strong relations to some soccer
highlights such as goal, shot, and foul, etc. Gauss
mixture model (GMM) is used to detect the above two
keywords from low-level audio features including Mel
frequency Cepstral coefficients (MFCC), Energy and
pitch [11][15].
One thing worth pointing out is that the proposed
mid-level module is an open framework. More
advanced features and mid-level keywords can readily
be incorporated for special applications.

Fig. 3. From left to right, these are examples of


global view, middle view, close-up view, out of
view, and replay logo

(a)
(b)
Fig. 4. (a) Hough line detection on a
segmented playfield (b) Five detected regions
in the playfield

3.3. High-level event detection using CRFs


The goal of semantic event detection is to find
meaningful events (event type) and their starting and
ending times (event boundary). For classifier based
event detection, SCA depends on a candidate event
region, e.g., a sliding window, to decide whether event
happened or not in the corresponding video segment.
Since different events randomly happened with variant
time durations, SCA is trouble to decide the accurate
candidate event regions in the whole video without
prior knowledge. For CRFs based event detection,
labeling
mid-level
keyword
sequences
can
automatically decode the semantic event label and
output both the occurred event type and detailed event
region according to the whole observed keywords
sequence. So, CRFs based event detection is more
convenient in practice and more possible to achieve
better performance by joint inference over entire video
sequences.
For the event detection of CRFs, the transition
feature functions f i ( st 1 , st ) (i.e., edge in the
undirected graph) are model defined. Users only need
to control the state feature functions g j (o, st ) . At each
time position t of a given sequence, we usually assume
a Markov blanket, and extract combined state features
in the Markov blanket domain. Given two keywords
sequences, the context combination based state features
at position t are defined and illustrated in Fig 5 and
Table 1. Here, v and w are two kinds of mid-level
keywords, and st is the event label. It is obvious that
CRFs are more expressive to model temporal sequence
than HMMs and SVM by allowing arbitrary
dependencies on the observation sequence. In addition,
the chosen features may be at different levels of
granularity of the same observations or aggregate
properties of the observation sequence.
CRFs model is flexible and extensible. For
different applications, people can incorporate different
keyword sequences, state features g j (o, st ) and set
appropriate size of Markov blanket according to prior

Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW06)
0-7695-2646-2/06 $20.00 2006 IEEE

knowledge or feature selection methods. In the case of


soccer highlight detection, we empirically set the size
of Markov blanket to be 5, View type x1 to be one
keyword sequence v, and encode all the other mid-level
keywords [ x 2 x 3 x 4 x 5 ] as another sequence w. We
define st=1 if current slice t belongs to highlights event,
otherwise st=0 for non-highlight.
st-3

st+1

Event labels s
Video sequence

Keyword v
sequence

MLXX LLXX LLXX LLXE XXRE ...

Keywords w
sequence

st-2
G

st-1
G

st
C

Fig 5. Markov blanket of CRFs at position t


with size=5. In the example of soccer
highlights detection, G: global view; M:
medium view; C: close-up view; LL: left region;
ML: middle left region; R: replay; E: excited
speech.
Table1. CRFs
position t

feature

templates

at

time

Template for transition features


st-1st
Template for state features
st
st
st
st
st

wt-2, wt-1, wt, wt+1, wt+2, wt-1wt, wtwt+1


vt-2, vt-1, vt, vt+1, vt+2, vt-2vt-1, vt-1vt, vtvt+1, vt+1vt+2
vt-2vt-1vt, vt-1vtvt+1, vtvt+1vt+2
vt-1vtvt+1wt
vt-1wt-1, vtwt, vt-1vtwt-1, vt-1vtwt, vt-1wt-1wt, vtwt-1wt

keyword sequences to find candidate event segments,


i.e. play-break units, using the algorithm described in
[1]. Then each keyword stream of the play-break unit
is time sampled and represented by a N-dimensional
feature vector. This vector is used by all approaches to
detect events where x k [ xk 1 , xk 2 ,..xkN ] , and k =
1,2,..,K, with K=5 keywords, and N=40 time slices.
The most widely used performance measures for
information retrieval are precision (Pr) and recall (Re).
Based on Pr and Re, the F-score = 2*Pr*Re/(Pr+Re)
evaluates the comprehensive performance. Since
events always happened in particular time regions, a
good detector not only predicts correct event types, i.e.
event label, but also detects the accurate event
boundary with full event contents. Thus we further
define the segment level measure and slice level
measure to evaluate event detection performance.
For a predicted video segment, the segment level
measure is true if there is at least 80% overlap with the
real event region. Similarly, for a predicted video slice,
the slice level measure is true if the predicted slice is
also in the real event region. Then the performance for
a whole video is the average of all predicted
segments/slices. Segment level measure is suitable to
evaluate the recall because it is not dependent on
accurate event boundaries. On the other hand, slice
level measure can better evaluate how accurate the
predicted event boundaries are.
Table 2 and table 3 summarized the highlights
event detection performance in segment level measure
and slice level measure respectively. From the two
tables, following observations can be made:
z

For segment level measure, the CRFs achieve


slightly better performance than the SVM approach
and better performance than HMM since it relaxes
the strong first-order Markov dependence
assumptions and avoids a fundamental limitation of
directed graphical models. The lower precision of
HMMs demonstrates the deficiency which is unable
to capture long term interactions in sequence
labeling.

For slice level measure, CRFs greatly dominate all


other approaches since sequence learning
approaches can automatically predict the event
boundary by sequence labeling. This can also
explain why CRFs obtain the highest precision
performance in the segment level measure of table 2.
However, SCA approaches (SVM) cant detect
accurate event boundaries due to depending on
prior knowledge to decide possible event regions.
The lower precision of HMMs further demonstrates
its deficiency in semantic event detection.

4. Experiments
In our experiments, we used libSVM[13], Intel
OpenPNL [8] and FlexCRF[7] toolkits for the
training/inference of Linear SVM, first-order HMMs
and first-order CRFs respectively. To demonstrate the
effectiveness of the proposed approach, experiments of
soccer highlights detection were conducted on eight
soccer matches totaling up to 12.6 hours of videos.
Highlights are video segments in which the user has
special or elevated interest. We define semantic events
goal, shot, foul, free kick and corner kick as
highlight events, and all others as non highlights. Five
matches are used as training data, and the others as
testing data. The ground truth is labeled manually.
To compare fairly the performance of SVM, HMM,
and CRFs, we input same video segments for
highlights detection. Since televised soccer generally
uses a close-up view or a replay as a break to
emphasize the highlights event, we first filter multiple

Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW06)
0-7695-2646-2/06 $20.00 2006 IEEE

Table 2. Comparison on soccer highlight


detection in segment level measure. Ground
truth of highlights is 205.
Method
SVM
HMM
CRFs

Miss
26
30
28

False
27
45
20

Precision
86.9%
79.5%
89.8%

Recall
87.3%
85.4%
86.3%

F-Score
87.10%
82.34%
88.02%

Table 3. Comparison on soccer highlights


detection in slice level measure.
Method
SVM
HMM
CRFs

Precision
73.4%
70.0%
78.5%

Recall
60.4%
60.7%
71.6%

F-Score
66.3%
65.3%
74.9%

Both SCA (SVM) and SLA (HMMs and CRFs)


have their advantages and disadvantages, and can be
applied for different applications. For example, if we
just care about the rough position of each highlight, or
prior knowledge can filter possible event segments
accurately, then SCA is enough. However, if we
require the precise event boundaries, but without prior
knowledge to segment videos, then we have to use
SLA.

5. Conclusion
In this paper, we propose a CRFs based semantic
event detection approach. The method first extracts
mid-level keyword sequences from low-level multimodality features, and then employs CRFs to jointly
infer semantic event labels from multiple keyword
sequences. Compared with traditional approaches, e.g.,
HMMs and SVM, CRFs offers several particular
advantages including the abilities to relax strong
independence assumptions in the state transition and
avoid a fundamental limitation of directed graphical
models. The experiments over soccer highlights event
detection demonstrated that CRFs achieves better
performance than SVM and HMM particularly in slice
level measure.
It is worth pointing out that our proposed methods
can be broadly applied to other kinds of event-based
video applications, e.g. video surveillance, video
summarization and content based retrieval.

6. Acknowledgements
The authors are thankful to YangBo, WangFei, Sun
Yi, Prof. Sun Lifeng, and Prof. Ou Zhijian of Dept. CS.
and EE. of Tsinghua university for the research on the
mid-level audiovisual keywords detection.

7. References
[1]

A. Ekin, A. M. Tekalp, and R. Mehrotr. Automatic


soccer video analysis and summarization. IEEE Trans.
on Image processing, 12(7):796807, 2003.
[2] A. McCallum, Efficient inducing features of
Conditional Random Fields, In Proc. of Conf. on
Uncertainty in Artificial Intelligence(UAI), 2003.
[3] C. G. Snoek and M. Worring. Multimedia event-based
video indexing using time intervals. IEEE Trans on
Multimedia, 7(4):638647, 2005.
[4] D. Q. Phung, T.V. Duong, S.Venkatesh, and H.H. Bui.
Topic transition detection using hierarchical hidden
Markov and semi-Markov Models, ACM multimedia,
pp.11-20, 2005.
[5] D. Zhang, D. G. Perez, S. Bengio, I., McCowan, Semisupervised Adapted HMMs for Unusual Event
Detection, IEEE conf. on CVPR, vol.1, pp 611-618,
2005
[6] F. Sha and F. Pereira. Shallow parsing with conditional
random fields. In Proc. of HLT/NAACL, 2003.
[7] FlexCRFs: Flexible Conditional Random Fields
http://www.jaist.ac.jp/~hieuxuan/flexcrfs/flexcrfs.html
[8] Intel Open Source Probabilistic Network Library
(OpenPNL) http://www.intel.com/research/mrl/pnl
[9] J. Lafferty, A. McCallum, and F. Pereira. Conditional
random fields: probabilistic models for segmenting and
labeling sequence data. In Proc. of ICML, pp.282-289,
2001.
[10] J. Wang, C. Xu, E.Chng, K. Wan, and Q. Tian.
Automatic replay generation for soccer video
broadcasting. In ACM Multimedia Conference, 2004.
[11] L. Duan, M. Xu, T.-S. Chua, Q. Tian, and C. Xu. A
mid-level representation framework for semantic sports
video analysis. In ACM Multimedia Conference, 2003.
[12] L. Xie, S.-F. Chang, A. Divakaran, and H. Sun.
Structure analysis of soccer video with hidden markov
models. Proc. ICASSP, 4:40964099, 2002.

[13] LIBSVM: A Library for Support Vector


Machines http://www.csie.ntu.edu.tw/~cjlin/libsvm/
[14] M. Luo, Y. Ma, and H.J.Zhang. Pyramidwise
structuring for soccer highlight extraction. In
ICICS-PCM, pp. 15, 2003.
[15] M. Xu, N. Maddage, C.Xu, M. Kankanhalli, and
Q.Tian. Creating audio keywords for event detection in
soccer video. In IEEE ICME 2003, volume 2, pages
281284, 2003.
[16] N., Haering, R.J. Qian, M.I.Sezan, A semantic eventdetection approach and its application to detecting
hunts in wildlife video, IEEE Trans. on Circuits and
Systems for Video Technology, Vol.10(6), pp.857
868, 2000.
[17] X. Yang, P. Xue, and Q.Tian. Repeated video clip
identification system. In ACM Multimedia 2005, pages
227228, 2005.

Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW06)
0-7695-2646-2/06 $20.00 2006 IEEE

Вам также может понравиться