Вы находитесь на странице: 1из 4

TIME INTERVAL MAXIMUM ENTROPY BASED EVENT INDEXING IN SOCCER VIDEO

Cees G.M. Snoek and Marcel Wom.ng

Intelligent Sensory Information Systems, University of Amsterdam


Kruislaan 403, Io48 SJ Amsterdam, The Netherlands
{cgmsnoek, worring} @science.uva.nl

ABSTRACT results for the purpose of highlight classification in base-


ball. However, the presented method lacks synchroniza-
?Multimodal indexing of events in video documents poses
tion of multimodal information sources. We propose the
problems with respect to representation, inclusion of con-
Time Interval Maximum Entropy (TIME) framework that
textual information, and synchronization of the heteroge-
extgnds the standard framework with time interval relations,
neous information sources involved. In this paper we present
to allow proper inclusion of multimodal data, synchroniza-
the l i m e Interval Maximum Entropy (TIME) framework
tion, and context relations. To demonstrate the viability
that tackles aforementioned problems. To demonstrate the
of TIME for detection of semantic events in multimodal
viability of TIME for event classification in multimodal video,
video documents, we evaluated the method on the domain of
an evaluation was performed on the domain of soccer broad-
soccer broadcasts. Other methods using this domain exist,
casts. It was found that hy applying TIME, the amount of
e.g. [2, 141. We improve on this existing work by exploiting
video a user has to watch in order to see almost all highlights
multimodal, instead of unimodal, information sources, and
can be reduced considerably.
by using a classifieribased on statistics:instead.of.heuristics.
The rest [of:ttiisipaper~S:organiied!as.follbws.. W6:first;
1. INTRODUCTION introduce event representation in the TIMEframework: Then
we.proceed with the basics of the maximum entropy classi-
Effective and efficient extraction of semantic indexes fiom! fierzin-section3. In section 4 we discuss the classification of
video documents requires simultaneous analysis of visual, events in soccer video, and the features used. Experiments
auditory, and textual information sources. In literature sev- are presented in section 5.
eral of such methods have been proposed, addressing dif-
ferent types of semantic indexes, see [I21 for an extensive
overview. Multimodal methods for detection of semantic 2. VIDEO EVENT REPRESENTATION
events are still rare, notable exceptions are [3,7,8, IO]. For
We view the problem of event detection in video as a pat-
the integration of the heterogeneous data sources a statisti-
tern recognition problem, where the task is to assign to a
cal classifier gives the best results [12], compared to heuris-
tic methods, e.g. 131. In particular, instances of the Dynamic pattern x an event or category w , based on a set of n features
(fi, fi,. . . , fn) derived from x. We now consider how to
Bayesian Network (DBN) framework, e.g. [8, IO]. Draw-
backs of the DBN framework are the fact that the model represent a pattern.
works with fixed common units, e.g. image frames, thereby A multimodal video document is composed of differ-
ignoring differences in layout schemes of the modalities, ent modalities, each with their own layout and content el-
and thus proper synchronization. Secondly, it is difficult to ements. Therefore, features have to be defined on layout
model several asynchronous temporal context relations si- specific segments. Hence, synchronization is required. To
multaneously. Finally. it lacks satisfactory inclusion of the illustrate, consider figure 1. In this example a video doc-
textual modality. ument is represented by five time dependent features de-
Some limitations are overcome by using a maximum fined on different asynchronous time scales. At a certain
moment an event occurs. Clues for the occurrence of this
entropy framework. Which has been successfully applied
in diverse research disciplines, including the area of statis- event are found in the features that have a value within the
tical natural language processing, where it achieved state- time-window of the event, but also in contextual features
of-the-art performance 141. More recently it was also re- that have a value before or after, the actual occurrence of
ported in video indexing literature [7], indicating promising the event. As an example consider a goal in a soccer match.
Clues that indicate this event are a swift camera pan towards
This research is sponsored by lhe ICESKIS MIA pmjen and WO. the goal area before the goal, an excited commentator dur-

0-7803-7965-9/03/$17.00 02003 IEEE 111 - 481 ICME 2003

Authorized licensed use limited to: Francis Xavier Engineering College. Downloaded on January 7, 2009 at 05:28 from IEEE Xplore. Restrictions apply.
l"
' -.
$ 1

event time ..,",..F

Figure 1 : Feature based representation of a video document


with an event (box)and contextual relations (dashed lines).
-.q
1 1
+
........ .... .~
I
................. ............ ......... ~.
f
1
f p
1........... 1 1.
...~~ ..... f
~ f
56

ing the goal, and a specific keyword in the closed caption


afterwards. Hence, we need a means to express the differ-
ent visual, auditory, and textual features into one fixed ref-
erence pattem without loss of their original layout scheme.
For this purpose we propose to use binary fuzzy Allen time Figure 2: Simplified visual representation of the m i m u m
interval relations [l]. A total of thirteen possible interval entropy framework.
relations, i.e. precedes, meets, overlaps, starts, during, fin-
ishes, equals, and their inverses, identified by i at the end,
can be distinguished. A margin is introduced to account for
imprecise boundary segmentation, explaining the fuzzy na- tion of unseen patterns, i.e. the reconstructed model p ( w l x ) ,
ture. By using fuzzy Allen relations it becomes possible to we require that the constraints for S are in accordance with
model events, context, and synchronization into one com- the constraints of the test set 7. Hence, we need the ex-
mon framework. When we choose a camera shot as a refer- pected value off, with respect to the model p ( w ( x ) :
ence pattem, a goal in a smcer broadcast can be modelled
by a swift camera pan that precedes the current camera shot, Wf,)
=cmP(wlX)f,(z>w) (3)
=,U
excited speech thatfinishes the camera shot, and a keyword
in the closed caption that precedesi the camera shot. Note where @(z)is the observed probability of x in S.The com-
that the precedes and precedesi relations require a range plete model of training and test set is visualized in figure 2.
parameter to limit the amount of contextual information that We are left with the problem of finding the optimal recon-
is included in the analysis. structed model p'(w1z). This is solved by restricting atten-
Thus, we c h w s e a reference pattern, and express a co- tion to those models p ( w ) x ) for which the expected value
occurrence between a pattern x and category w by means of of f, over T equals the expected value of fj over S. From
binary fuzzy Allen relations with binary features, f,. Where all those possible models, the maximum entropy philosophy
each f, is defined as: dictates that we select the one with the most uniform distri-
bution. Assuming no evidence if nothing has been observed.
if A,(.) = true and w = w';
fi(z'w) = ( 01,, otherwise;
(1) The uniformity of the conditional distribution p ( w l x ) can be
measured by the conditional entropy, defined as:
Where A,(.) is a predicate function that checks for a fuzzy
Allen relation, and w' is one of the categories, or events. H @ ) = - C F ( z ) P ( w l z )1% P ( W l 4 (4)
z.w

3. PATTERN CLASSIFICATION The model with maximum entropy, p*(wlz), should be se-
lected, It is shown in [4] that there is always a unique
Having defined the feature based pattem representation, we model p * ( w l x ) with maximum entropy, and that p'(w1x)
now switch to classification using the Maximum Entropy must have a form that is equivalent to:
framework [4].
For each feature the expected value over the training set
S i s computed
4 ( f j )= C ~ C z ) w ) f j ( x > w ) (2) where aj is the weight for feature f j and Z is a normaliz-
ZPJ
ing constant, used to ensure that a probability distribution
Where p(z,w ) is the observed probability of x and w in S. results. The values for aj are computed by the Generalized
This creates a model of S. To use this model for classifica- Iterative Scaling (GIS) [ 5 ] algorithm. Since GIS relies on

111 - 482

Authorized licensed use limited to: Francis Xavier Engineering College. Downloaded on January 7, 2009 at 05:28 from IEEE Xplore. Restrictions apply.
F<CIlU?C F u u y Allen
camera w o k during
Perso"
close-up 0-40
Goal ksywad 0-6
Card keyword 0-6
Substitution keyword 0-6
Excitemen1 0-1
lnia block 20 - 80
Person block 20 - 50 Figure 3: Dlferent steps in overlay segmentation: color
Referee b l ~ k 20.50 edge detection, marked watershed, and video OCR.
Coach block 20.50
Goal block 20 - 50
Card block 20 - 50
Substitution block
Block length
In total 30 informative stemmed keywords were defined for
the various events.
Table 1: Features withfuzzy Allen time interval relations.

4.3. Visual features


both E#(f,) and E,(f,)for calculation of a,.an approxi- From the visual modality we extracted several features. The
mation is used that relies only on Ep(f,)from S [9]. This type of camera work [I31 was computed for each camera
allows to construct a classifier that depends on the training shot. A face detector [ 1 I] was applied for detection of per-
set only. Hence, by using the maximum entropy classifier sons. The same detector formed the basis for a close-up
we can focus on what features to use, since relative impor- detector. Close-ups are detected by relating the size of de-
tance of each feature is computed automatically. tected faces to the total frame size. Often, a director shows a
close-up of a player after an event of importance. One of the
most informative pieces of information in a soccer broadcast
4:. EVENT INDEXING IN SOCCER BROADCASTS are the visual overlay blocks that give information about the
game. For segmentation of the overlay blocks we used a
Toidemonstrate the: viability of the TIME framework for color invariant edge detector [6] combined with a marked
event!&tection in multimodal video, we consider the do- watershed algorithm. The segmented region was the input
main ofsoccer. Typical highlight events that occur in a soc- for a Video Optical Character Recognition (VOCR) mod-
cer match are goals, penalties, yellow cards, red cards, and ule [13], see figure 3. Results of VOCR are noisy, but by
substitutions. We take as a basic pattem a camera shot, since using the game database and fuzzy string matching we were
this is the most natural candidate for retrieval of events. able to reliably detect team names, player names, coach
In what follows, we will highlight several multimodal fea- names, referee names, and descriptive text like: misses next
tures used for modelling those events. Features were cho- match or 3 goals in 6 matches. This information is fused
sen based on reported robustness and training experiments. to classify an overlay block as either info, person, referee,
The parameters for individual detectors were found by ex- coach, goal, card, or substitution. The duration of visibility
perimentation. The features with fuzzy Allen relations are of the overlay block is also used, as we observed that sub-
summarized in table 1. stitution and info blocks are displayed longer on average.

4.1. Static information 4.4. Auditory features


Game related information, like the players who played dur- From the auditory modality the excitement of the commen-
ing the match, name of the coaches and referees and so on, tator is a valuable feature [IO]. For such a feature to work
can be found on the UEFA web site. This information was properly, we require that it is insensitive to crowd cheer.
extracted with a web spider and stored in a game database. This can be achieved by using a high threshold on the av-
This information is used to improve a visual feature detector erage energy of a fixed window, and by requiring that an
that is explained later on. excited segment has a minimum duration of 4 seconds.

4.2. Textual features 5. EVALUATION

The teletext (closed caption) provides a textual description For the evaluation of TIME we digitized 8 live soccer broad-
of what is said by the commentator during a match. This in- casts from TV, about 12 hours in total. The videos weredig-
formation source was analyzed for presence of informative itized in 704 x 516 resolution MPEG-2 format. The audio
keywords, like yellow, red, card, goal, 1-0, 1-2, and so on. was sampled at 16 KHz with 16 bits per sample. The time

111 - 483

Authorized licensed use limited to: Francis Xavier Engineering College. Downloaded on January 7, 2009 at 05:28 from IEEE Xplore. Restrictions apply.
G m u d truth Maximum Entmpy
Torr1 Ouration Kelevrnl Ouration
G<"d 12 3:07 IO I014
Yellow Cord 24 1035 22 2612
Svbrirfufion 29 8:W 25 7:36
): 65 21:51 1 57 ' 4402

Table 2: Evaluation resulis. duration in minutexseconds.

stamped teletext was recorded with a teletext receiver. We


used a representative training set of 3 hours and a test set of
9 hours. We focussed on 3 events, w E {yellow card, sub-
stitution, goai}, red card and penalty were excluded from
analysis since there was only one instance of each in the Figure 4: The Goalgle soccer video search engine
data set. We manually labelled all the camera shots as ei-
ther belonging to one of four categories: yellow card, goal,
substitution, or unknown. We defined the different events as
Results show that a considerable reduction of watching time
follows:
can be achieved. The indexed events were used to build the
0 Goal: begin until end of the camera shot showing the Goaigie soccer video search engine, see figure 4.
actual goal;
0 Yeliow c a r d begin of the camera shot showing the 7. REFERENCES
foul until the end of the camera shot that shows the
[ I ] M. Aiello, C. Monz. L. Todoran. and M. Worring. Document
referee with the yellow card;
understanding for a broad class of documents. IJDAR, 2002.
0 Substitution: begin of the camera shot showing player [2] I. Assfalg, M. Bertini. A. Del Bimbo. W. Nunziati. and
that goes out, until the end of the camera shot showing P. Pala. Soccer highlights
. . detection and recognition using
HMMs. In IEEE ICME, 2002.
player that comes in;
. _N. Babaguchi.
131 " Y, Kawai. and T. Kitahashi. Event based in-
Since events can cross camera shot boundaries, adjacent dexing of broadcasted sports video by intermodal collabora-
events are merged. Hence, we cannot use precision and re- tion. IEEE Trans. on Multimedia. 4( 1):68-75,2002.
[4] A. Berger. S. Della Pietra, and V. Della Pietra. A maximum
call as an evaluation measure. From a users perspective it is
enlropy approach to natural language processing. Cmnputn-
unacceptable that events are missed. Therefore, we strive to rional Linguistics, 22(1):39-7 I , 1996.
find all events. Since it is difficutt to exactly define the start [SI 1. Darroch and D. Ratcliff. Generalized iterative scaling for
and end of an event in soccer video, we introduce a toler- log-linear models. The Annals qf Morhemarical Srarisrics,
ance value T (in seconds) with respect to the boundaries of 43(5):147CL1480, 1972.
detection results. We used a T of 7 s. for all events. Re- . . J. Geusebroek, R. van den Boomaaard, A. Smeulders, and
161
sults are visualized in table 2. Note that almost all events H. Geens. Color invariance. IEEfTPAMI, 23(12). 2001.
[7] M. Han, W. Hua. W. Xu, and Y. Gong. An integrated base-
are found, and that the amount of video that a user has to
ball digest system using maximum entropy method. In ACM
watch before finding those events is only two times longer
Mulrimedia, 2002.
compared to the best case scenario. [8] M. Naphade and T. Huang. A probabilistic framework for
The weights computed by CIS indicate that for goal and semantic video indexing, filtering, and retrieval. IEEE Trans.
yeNow card specific keywords in the closed captions, excite- on Mulrimedia, 3(1):141-151.2001.
ment with during and overlaps relations, and the presence [9]OpenNLPMaxent. http://maxent.sf.net/.
of an overlay nearby are important features. For substitution [IO] M. Petkovic, V. Mihajlovic, W. Jonker, and S. Djordjevic-
Kaian. Multi-modal extraction of highlights from TV for-
the auditory modality is less important.
mula I programs. In IEEE ICME, 2002.
.I1 I1. H. Rowlev, S. Baluia, and T. Kanade. Neural network-based
6. CONCLUSION face detection. IEEE TPAMI, 20(1):23-38, 1998.
[I21 C. Snoek and M. Worring. Multimodal video indexing: A
We combined multimodal information sources into a com- review of the state-of-the-art. Multimedia Tools and Appli-
mon framework for the purpose of event detection in video carions. To appear.
[I31 The Lowlands Team. Lazy users and automatic video re-
documents. The presented Time Interval Maximum En-
trieval tools in (the) lowlands. In TREC, 2001.
tropy framework allows for proper modelling of events, syn- [I41 D. Yow, B. Yeo. M. Yeung, and B. Liu. Analysis and presen-
chronization, and asynchronous contextual information re- tation of soccer highlights from digital video. In ACCV95.
lations. Our method was evaluated on the domain of soccer.

111 - 484

Authorized licensed use limited to: Francis Xavier Engineering College. Downloaded on January 7, 2009 at 05:28 from IEEE Xplore. Restrictions apply.

Вам также может понравиться