Вы находитесь на странице: 1из 13

Tiny Videos: A Large Data Set

for Nonparametric Video Retrieval


and Frame Classification
Alexandre Karpenko, Student Member, IEEE, and Parham Aarabi, Senior Member, IEEE
AbstractIn this paper, we present a large database of over 50,000 user-labeled videos collected from YouTube. We develop a
compact representation called tiny videos that achieves high video compression rates while retaining the overall visual appearance of
the video as it varies over time. We show that frame sampling using affinity propagationan exemplar-based clustering
algorithmachieves the best trade-off between compression and video recall. We use this large collection of user-labeled videos in
conjunction with simple data mining techniques to perform related video retrieval, as well as classification of images and video frames.
The classification results achieved by tiny videos are compared with the tiny images framework [24] for a variety of recognition tasks.
The tiny images data set consists of 80 million images collected from the Internet. These are the largest labeled research data sets of
videos and images available to date. We show that tiny videos are better suited for classifying scenery and sports activities, while tiny
images perform better at recognizing objects. Furthermore, we demonstrate that combining the tiny images and tiny videos data sets
improves classification precision in a wider range of categories.
Index TermsImage classification, content-based retrieval, tiny videos, tiny images, data mining, nearest-neighbor methods.

1 INTRODUCTION
O
NLY a few years ago, the majority of Web sites on the
World Wide Web consisted of static content such as text
and images. Today, video has become a prominent
component of many news, entertainment, information,
blogging, and personal Web sites. In May 2009, an estimated
20 hours of video were uploaded every minute to YouTube
[11]. YouTube is currently hosting over 100 million videos,
and it is only one of a growing list of video sharing Web sites.
Despite readily available online video content, the
majority of video retrieval and recognition research to this
day employs much smaller data sets. Frequently used
annotated research databases such as the Open Video
Project [8] contain on the order of 5,000 videos. TREC Video
Retrieval Evaluation (TRECVID) [21] uses about 200 hours
of video footage. These small data sets, while convenient,
do not capture the diversity of online video content viewed
daily by the public.
Large data sets of user-generated and annotated data are
inherently more noisy and challenging. In addition, efficient
storage space utilization and computational complexity also
play a major role. However, if some of these challenges are
overcome, then the ubiquity and diversity of online visual
data can be leveraged to aid a variety of computer vision
problems. This is the case because a very large amount of
data and simple algorithms can be used in place of
sophisticated algorithms to model the complexity of the
vision task at hand. A number of recent research papers
have used large collections of images for various computer
vision tasks [18], [22], [24].
The tiny images database [24] is currently the largest
labeleddatabase of images available toresearchers. It consists
of 79,302,017 images that were collectedfromthe Internet and
down-sampled to tiny 32 32 pixel size. In [24], Torralba
et al. use this large data set of images andverysimple nearest-
neighbor (NN) techniques to perform person detection and
localization, scene recognition, automatic image annotation,
as well as image colorization and orientation detection.
In this paper, we present a new method for using videos
to classify objects, scenes, people, and activities. We
demonstrate that a database of 52,159 videos collected from
YouTube and compressed to tiny size can be used to classify
a wide range of categories using very simple nearest-
neighbor techniques. Furthermore, our representation of
tiny videos is compatible with the tiny image representa-
tion. This allows us to not only compare, but also combine
both data sets for a variety of classification tasks. We show
that tiny videos perform better than tiny images for
classification tasks involving sports activities and scenery.
We further demonstrate that by combining both data sets,
classification performance is improved for a wider range of
categories and visual appearances.
This paper is organized into three parts. In Section 2, we
discuss our data set of YouTube videos, the tiny videos
representation, and the similarity metrics used for duplicate
frame detection and classification. In Section 3, we use our
similarity metrics for content-based copy detection. Section 4
presents the main application of our video data set: image
and video frame classification. Sections 4.3 and 4.5 present
an analysis of the classification performance of tiny videos
and tiny images for a variety of categorization tasks. Finally,
618 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 3, MARCH 2011
. The authors are with the Department of Electrical and Computer
Engineering, University of Toronto, Toronto ON M5S 1A1, Canada.
E-mail: alexander@comm.utoronto.ca, parham@ecf.utoronto.ca.
Manuscript received 11 June 2009; revised 25 Dec. 2009; accepted 22 Mar.
2010; published online 14 June 2010.
Recommended for acceptance by A. Torralba.
For information on obtaining reprints of this article, please send e-mail to:
tpami@computer.org, and reference IEEECS Log Number
TPAMI-2009-06-0370.
Digital Object Identifier no. 10.1109/TPAMI.2010.118.
0162-8828/11/$26.00 2011 IEEE Published by the IEEE Computer Society
in Section 5, we conclude with a discussion of directions for
future work.
2 THE TINY VIDEOS REPRESENTATION
In this section, we introduce our large database of over
50,000 videos. We discuss the video collection procedure
and develop the tiny video representation, in particular our
approach for compressing the temporal dimension of
videos. We then discuss the similarity metrics employed
and use them to characterize some properties of our data set.
2.1 Video Collection Procedure
We used YouTubes API [1] to download videos from
YouTube. The videos were collected over a period of four
months in 2008. We decided to keep all videos that we
download in their original size and format, resulting in
52,159 videos occupying 520 GB of disk space.
The videos were primarily collected in YouTubes News,
Sports, People, Travel, and Technology sections (see Fig. 1). We
chose these sections because we expect their videos to
contain a great deal of within-category visual overlap (e.g.,
soccer fields, basketball courts, news anchors, politicians,
cities, and so on). For each of these categories, YouTube
finds about 350,000 results. However, the API allows us to
retrieve only the top 1,000 results. Therefore, to increase the
number of videos collected per section, we use the API to
sort the results by most viewed, top rated, relevance, and most
recent. It is important to note that collecting videos in this
nonrandom way introduces slight biases in our sample of
YouTube videos. For example, frequently viewed videos
tend to be shorter in duration than rarely viewed videos.
Hence, videos sorted by most viewed will increase the fraction
of short videos in our data set. We chose this collection
procedure despite these slight biases, not only because it is
more practical to collect popular videos given the limitations
of the API, but also because these videos provide a good
representation of frequently viewed content on YouTube.
Fig. 2 shows the distribution of video durations in our
entire data set of YouTube videos. The average video
duration is about 5 minutes. In March 2006, YouTube
introduced a 10 minute limit on the duration of uploaded
videos for regular user accounts, although 3,240 videos in
our database exceed the 10-minute limit and span up to a
maximum of 2 hours and 23 minutes. Note the peak around
the minute mark since many popular videosincluding
viral videos, music clips, movie trailers, and excerpts taken
from longer showsfall into this category.
For each video, we also store all of the associatedmetadata
returned by YouTubes API. The metadata includes such
information as video duration, rating, view count, title,
description, and user assigned labels (tags). The metadata in
uncompressed text files occupies a total of 436 MB. For our
experiments, we use only a small fraction of the metadata,
which is compactly stored in a file that is 2.8 MB in size.
2.2 Video Preprocessing Procedure
We store YouTube videos in their native Flash video format.
The frame rate of Flash videos is not constant and usually
depends on the content (e.g., a video of slides from a
presentation might only have one frame per slide spanning
several seconds or even minutes). We ignore this complica-
tion by extracting frames at constant frame intervals rather
than constant time intervals because we want to extract
unique-looking frames.
We crop videos that were encoded with black bars (see
Fig. 3). To detect horizontal black bars, we use the
following formula:
)y
1
`
X
r.c
j1
y
r. y. cj.
y
iii
min
h
arg min
y
)y t
i
.
y
ior
max
h
arg max
y
)y t
i
.
where 1
y
is the derivative along y of frame 1, which is an
NM3 matrix (where ` and ` correspond to the native
resolutionof the video). We sumthe gradient responses along
the r-direction and the color channels c, yielding )y. Large
)y values are likely to correspond to a transition between
the frame content and the black bar. We pick the smallest y
and largest y, for which )y is greater than a threshold t
(a value of 0.5 was found to work well), as the locations at
which the frame will be cropped (i.e., y
iii
and y
ior
,
respectively). We check that the bar being removed contains
at least 80 percent black pixels; otherwise, we do not crop that
region. This technique is repeated for vertical black bars.
Similarly to [24], we remove frames that contain more
than 80 percent of pixels of the same color. These frames
generally correspond to title sequences and diagrams.
KARPENKO AND AARABI: TINY VIDEOS: A LARGE DATA SET FOR NONPARAMETRIC VIDEO RETRIEVAL AND FRAME CLASSIFICATION 619
Fig. 1. Histogram showing the number of videos per YouTube category
in our tiny videos data set.
Fig. 2. Histogramshowing the distribution of video durations in 30-second
intervals for our data set. The average video duration is 4 minutes and
40 seconds.
2.3 Low-Dimensional Video Representation
Since we want our tiny videos data set to be compatible with
tiny images, we resize individual frames to be 32 32 pixels
in size. We concatenate the three color channels and
normalize the resulting frame vector to have zero mean
and unit norm. This is done in order to reduce sensitivity to
variations in illumination and is a common transformation
in image processing. The resulting normalized tiny frame is
compatible with the tiny images descriptor which is used by
Torralba et al. [24] because it is obtained in the same manner.
Unlike images, videos have an additional temporal
dimension. The temporal dimension of most videos is
densely sampled (usually at a rate of 24 frames per second)
even when the motion in the shot is minuscule. As a result,
videos can be strongly compressed temporally by retaining
only distinct visual appearances. A large number of video
summarization algorithms have been developed to perform
temporal compression [3], [17], [20], [23]. Many of these
algorithms are quite complex since they often depend on
shot boundary detections, which are difficult to detect
reliably due to gradual shot transitions such as blends and
wipes [9], [26], [27]. In addition, false positives can arise
from fast moving objects in front of the camera lens or fast
motions of the camera itself (e.g., camera pans and dollies).
Furthermore, frames coming from the same shot can appear
more distinct than frames coming from different shots in
the presence of camera motion.
As a result, we limit ourselves to algorithms that do not
rely on shot boundary detection. Perhaps the most widely
employed summarization approach is uniform sampling. In
uniformsampling, frames are extractedat a constant interval.
The main advantage of this approach is computational
efficiency. However, uniform sampling tends to oversample
long shots or skip short shots.
A widely used approach that adapts to changes in frame
content is intensity of motion (IM) frame sampling [10], [19].
Intensity of motion has also been used as a feature vector for
describingmotioncharacteristics andalsofor detectingsharp
boundary transitions in prior work [5]. Intensity of motion is
defined as the mean of consecutive frame differences:
ot
1
AY
X
r.y
j1r. y. t 1 1r. y. tj. 1
where A and Y are the dimensions of the video (A Y 32
in our case) and 1r. y. t denotes the luminance value of
pixel r. y of a frame at time t. After applying a Gaussian
filter to ot, Jolyet al. use the locations of the extrema to select
the keyframes. Fig. 4 shows a sample intensity of motion plot
for a video in our database. IMsampling allows the sampling
rate to be controlled by adjusting the standard deviation of
the Gaussian filter. Larger standard deviations lead to fewer
extrema and, as a result, fewer keyframes (see Fig. 4).
The intensity of motion keyframe selection algorithm is
robust to color and affine transformations. For a given shot,
it also selects the same keyframes regardless of any
temporal shifts or the appearance of neighboring shots.
These properties make it particularly suitable for content-
based copy detection (CBCD) and other video retrieval
tasks as it generally samples the same set of frames for a
shot occurring across multiple videos.
In [12], we proposed a video-summarization algorithm
that uses exemplar-based clustering to select only unique-
looking keyframes. Similar to uniform and IM sampling, this
approach does not rely on shot boundary detection.
However, exemplar-based clustering not only captures
620 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 3, MARCH 2011
Fig. 3. Removing black bars from videos. The regions above y
iii
and below y
ior
are only cropped if they contain at least 80 percent black pixels.
(a) Input frame 1. (b) Derivative of input frame: j1
y
j. (c) After horizontal and vertical crop.
Fig. 4. Intensity of motion ot plots with Gaussian filters applied. The time at which the extrema (red) occur is used to select the keyframes. Note that
a larger standard deviation leads to fewer keyframes. (a) o 10. (b) o 30.
within-shot visual appearance variations, but also consoli-
dates similarities across multiple shots. This is particularly
suitable for YouTube as its video clips are generally short
and shots often alternate between a small set of scenes (e.g.,
a reporter in a studio and an on-location journalist). This
allows the visual range of most clips on YouTube to be
captured with only a few unique-looking frames.
We use affinity propagation [7] to cluster densely sampled
frames into visually related groups. Only the exemplar (or
unique looking) frame within each cluster is retained and
the rest are discarded. Affinity propagation (AP) is
particularly suitable here because it allows us to define
what unique looking means in terms of the same frame
similarity metrics that we will later use for video retrieval.
As a result, AP selects exemplars such that similarity (as
defined by (2) of Section 2.4) with their cluster members is
maximized. By adjusting the preference parameter j, we
can control the number of exemplars (or keyframes) that AP
sampling selects. Fig. 5 shows the advantage of AP
sampling over uniform sampling qualitatively.
In Section 3.1, we provide an additional empirical
comparison for uniform, IM, and AP frame sampling.
2.4 Frame and Video Similarity Metrics
We adopt the same similarity metrics for tiny frames as in
[24]. We will briefly review these metrics, but suggest that
readers consult [24] for a more detailed discussion. Torralba
et al. define a basic distance measure between two tiny
images 1
o
and 1
/
(tiny frames in our case) as their sum of
squared differences:
1
2
ssd
1
o
. 1
/

X
r.y.c
1
o
r. y. c 1
/
r. y. c
2
. 2
where 1 denotes a 32 32 3 dimensional zero mean,
normalized tiny video frame, or tiny image. Furthermore,
Torralba et al. show that recognition performance can be
improved by allowing the pixels of the tiny image to shift
slightly within a 5-pixel window. This reduces sensitivity to
slight image misalignments, such as moving objects or
variations in scale. Hence, the following distance metric is
also used:
1
2
shift
1
o
. 1
/

X
r.y.c
min
j1
r.y
jn
1
o
r. y. c
^
1
/
r 1
r
. y 1
y
. c
2
.
3
where n is the window size within which individual pixels
can shift and
^
1 is a transformed version of frame 1. For
simplicity, we have only implemented the horizontal
mirroring transformation, although other distortion trans-
formations could be used.
We extend these frame distance measures to work for a
pair of videos \
c
and \
u
by defining the following basic
video distance measure:
^
1
2
ssd,shift
\
c
. \
u
min
1
o
2\
c
.1
/
2\
u
1
2
ssd,shift
1
o
. 1
/
. 4
where the video distance
^
1
2
ssd,shift
denotes that either (2) or
(3) can be used as the frame distance metric. In essence, the
distance between two videos \
c
and \
u
is defined as the
distance of the most similar pair of frames 1
o
and 1
/
belonging to these videos. Note that, if both videos \
c
and
\
u
consist of a single frame, then the
^
1
2
ssd,shift
distance
metric reduces to 1
2
ssd,shift
. Furthermore, a single tiny image
or tiny frame can be substituted for \
c
in order to compute
the distance between that image or frame and the tiny
video \
u
. In the following sections, it will also be convenient
to refer to the similarity between two videos or frames in
terms of their correlation. The correlation is defined in
terms of distance as follows:
, 1
1
2
1
2
::d
. 5
^ , 1
1
2
^
1
2
ssd
. 6
This relationship can be trivially derived by expanding the
expression for 1
2
::d
, collecting the terms that sum to 1, and
rearranging (recall that we normalize our descriptors such
that they sum to 1). A correlation of , 1 for two frames
implies that the frames are identical. For a pair of videos,
^ , 1 signifies that the videos share at least one identical
frame. A correlation of zero corresponds to completely
dissimilar frames (i.e., the descriptors are orthogonal).
These distance measures allow us to find a set of NN
given an image or frame as input. In addition, Torralba et al.
propose using PCA-compressed descriptors to facilitate
faster neighbor retrieval. We have adopted their method,
although alternative approximate nearest-neighbor search
methods [25] as well as methods that minimize disk reads
[18] should be explored in future work. A
^
1
2
shift
sibling set
search runs in about 20 seconds on the data set of 80 million
tiny images using our Matlab/C code on a 3 GHz P4.
Despite the approximate nearest-neighbor pruning, disk
reads are the bottle neck due to the sequential random
access nature of reading in the full 3,072 dimensional
descriptors for the set of approximate neighbors.
Fig. 6 illustrates the 36 nearest neighbors (sibling set) for
a query frame, ranked using
^
1
2
ssd
and
^
1
2
shift
(the top
approximate neighbors are also shown). All three distance
metrics give similar results. For a category with a lot of
within-class visual variation (e.g., scenery in Fig. 6
middle),
^
1
2
shift
returns more relevant sibling videos than
^
1
2
ssd
. In some cases, more outliers appear in
^
1
2
shift
(Fig. 6
right). This occurs because the visual appearance of the
basketball category is more constrained, which leads
KARPENKO AND AARABI: TINY VIDEOS: A LARGE DATA SET FOR NONPARAMETRIC VIDEO RETRIEVAL AND FRAME CLASSIFICATION 621
Fig. 5. Comparison of (a) uniform sampling to (b) AP sampling. The
green samples are of a scene that was already sampled once. The blue
samples are of scenes that are not present in uniform sampling. The red
sample is the only scene with content missing from AP sampling, albeit
its visual appearance is captured in the second frame in (b). AP
sampling has no redundant samples. (a) A 1,000-frame interval
sampling. (b) A 30-frame interval sampling + AP clustering.
^
1
2
shift
to return more unrelated but similar neighbors (such
as hockey fields) due to the pixel shifting relaxation. While
any improvements visible in Fig. 6 are at best marginal, we
observe that for frame and image classification (discussed
in Section 4), the
^
1
2
shift
metric empirically outperforms the
^
1
2
ssd
metric. A more robust descriptor that puts less weight
on color and more weight on structure could be used to
improve results in some of these cases. We leave this
extension for future work.
Finally, we define two additional similarity metrics that
we have found to be particularly suitable for duplicate
video retrieval:
o
1
c. u
X
1
o
2\
c
^ ,1
o
. \
u
. 7
o
2
c. u
X
1
o
2\
c
^ ,1
o
. \
u
,1
o1
. 1
/1
. 8
t
1. if t t.
0. if t t.
&
Here, o
1
counts the number of frames 1
o
in video \
c
that
match to video \
u
with a correlation ^ , that exceeds t. o
2
adds an additional constraint, by only counting consecu-
tively matching pairs of frames in \
c
and \
u
since such
occurrences are less likely to arise by chance.
3 CONTENT-BASED COPY DETECTION
In this section, we evaluate our tiny videos representation
and similarity metrics on CBCD. The aim is to detect the
same video shots occurring in different videos. This will also
allow us to label videos that share a duplicate shot as related
based on content, rather than user metadata (e.g., labels).
Valid duplicate shots may vary in compression, aspect ratio,
and other common spatiotemporal video transformations
(see Fig. 7), which makes robust CBCD challenging.
We use the MUSCLE-VCD-2007 copy detection evalua-
tion corpus [14] as a baseline benchmark for our methods.
This corpus was previously used for video copy detection
evaluation at CIVR 2007. It consists of 101 videos and about
100 hours of continuous playback in total. The videos were
622 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 3, MARCH 2011
Fig. 6. Top 36 sibling videos (a) for the query frames (c) using the
^
1
2
::d
metric and (d) the
^
1
2
:/i)t
metric. We only show the closest frame from each
sibling video to the query frame. (a) Query frame. (b) Approximate neighbors. (c)
^
1
2
::d
sibling set. (d)
^
1
2
:/i)t
sibling set.
collected from The Internet Archive [2] and The Open Video
Project [8] and contain a wide range of subject matter. The
videos have different bit rates, resolution, and video formats
(though they have been reencoded to MPEG1 in the
database). In addition, 15 query videos are provided, of
which 10 are in the database. The query videos have
undergone some common transformations, such as crop-
ping, reencoding with high compression, and camcording.
Fig. 8 shows the probability that any two frames are
duplicates as a function of pixelwise correlation ,. This plot
was obtained by randomly sampling many pairs of tiny
video frames in 0.02 correlation intervals and labeling them
as either duplicates or not. The plot indicates that we can
detect duplicate frames with high probability if their
correlation is greater than 0.9. We therefore set t to 0.9 for
o
1
, resulting in the similarity metric counting the number of
likely duplicate frames in a pair of videos.
Using AP sampling (at an average keyframe interval of
325 frames) and the o
1
metric, we are able to match 14 of the
15 query videos correctly. The only failure is due to a color
shift in query video #11 that results in o
1
0. This failure
case could be removed by switching to a more robust color
space; however, to keep compatibility with tiny images, we
opt to instead relax the threshold (setting t to 0.8) and use the
more constrained o
2
similarity metric. This results in a
perfect match score on the MUSCLE data set. Therefore,
despite the sensitivity to color, our preprocessing steps
coupled with the small size of tiny video frames make our
descriptors and similarity metrics robust to camcording,
strong reencoding, subtitles, and mirroring transformations.
However, more sophisticated local feature-based copy
detection methods [4] have been proposed recently that also
achieve prefect results on the MUSCLE corpus. Further-
more, local feature-based methods are known to be more
robust than appearance-based methods (such as our own)
for CBCD [15]. However, these methods are more complex
than our tiny videos approach. In addition, recent results
from TRECVID 2009 suggest that audio can be used more
effectively for CBCD than video data. Therefore, future
work could explore augmenting tiny videos with audio in
order to further improve upon our CBCD results.
3.1 Related Video Retrieval Using Tiny Videos
We now apply the CBCD results to our own data set of
YouTube videos. In particular, our aim is to find YouTube
videos that are related by content (i.e., they share at least
one duplicate shot) and to evaluate the frequency of such
occurrences.
We begin by examining pairs of videos and hand-
labeling them as either related or not. Since hand-labeling is
very laborious, we only examine a subset of 6,654 YouTube
videos. Even this subset is still rather large to manually
consider all possible combinations of video pairs. Therefore,
to narrow down the set of potentially related videos, we use
the
^
1
2
ssd
video distance metric to find similar videos for each
input video. Recall that the
^
1
2
ssd
distance between two
videos corresponds to the distance between their two most
similar frames. Related videos should contain very similar
frames coming from their shared duplicate shot. Hence,
related videos should be ranked high in the sibling set. As a
result, we only need to examine the query video and a few
of its highest ranked sibling videos. We examine a total of
2,429 potentially related video pairs which consist of an
input video and a video from the sibling set. All video pairs
which we confirm to share a duplicate shot are hand-
labeled as related. Of the 2,429 video pairs examined, we
find that 215 pairs of videos are related. The remaining
2,214 video pairs are labeled as not related. Given this
ground-truth set of videos, we can evaluate CBCD
performance on our tiny videos data set.
KARPENKO AND AARABI: TINY VIDEOS: A LARGE DATA SET FOR NONPARAMETRIC VIDEO RETRIEVAL AND FRAME CLASSIFICATION 623
Fig. 7. Example transformations in the MUSCLE-VCD-2007 copy
detection evaluation corpus. (a) Strong reencoding. (b) Camcording
and subtitles. (c) Horizontal mirror.
Fig. 8. Probability of duplicates as a function of pixelwise correlation ,.
The probability rapidly increases for , 0.9.
It is important to note some of the biases introduced by
preparing a ground-truth data set in this manner. First, the
215 pairs of related videos consist of videos, which were
identified as similar using our distance metrics. Videos
which are related but appear dissimilar due to various
transformations (such as severe color shifts) would not be
included in the ground-truth data set. In order to reduce the
frequency of such cases, we examine sibling videos with
correlations as low as ^ , 0.7. Furthermore, sampling could
skip shots which happen to be very short yet which appear
across related videos. Therefore, we sample densely
1
in
order to identify most related videos, including those which
share only short duplicate segments. Even with these
precautions, we have no way of knowing the exact number
of related videos given any query video unless we examine
the entire database of videos. As a result, our ground-truth
data set removes some of the copy detection complexity that
exists in our database.
Fig. 9 shows the precision-recall (PR) curves obtained
for related video retrieval within the ground truth of
2,429 labeled video pairs. Precision is defined as the fraction
of videos correctly identified as containing a duplicate shot.
Recall indicates the fraction of related videos found out of
all videos with a duplicate shot in our ground-truth data
set. The confidence in a duplicate pair of videos is given by
the match score o
1
(red) or o
2
(blue). Precision increases as
the number of matching duplicate frames in a video pair
increases. This result is not surprising as many duplicate
frame matches are unlikely to occur for a false positive.
Note that only two matching pairs of consecutive frames
(i.e., o
2
! 2) are required to achieve a precision of over
80 percent at 60 percent recall.
Furthermore, given our similarity metrics and this
ground-truth data set, we can now evaluate the three video
summarization algorithms empirically based on the follow-
ing criteria: A good keyframe-sampling algorithm should
preserve the video similarities observed in temporally
uncompressed videos. Fig. 10 plots the fraction of duplicate
shots found as a function of average sampling interval for the
three sampling methods. The average sampling interval for
APsampling is controlled by the preference parameter j. For
IM sampling, we vary the standard deviation o of the
Gaussian filter. Last, for uniform sampling, we simply
increase the sampling interval.
The results demonstrate that AP sampling retains the
similarities for more video pairs than IM or uniform
sampling for a given sampling rate. Since affinity propaga-
tion uses the negative of 1
2
ssd
for exemplar selection, all
frames with high correlation, except for the exemplar, will
be discarded. This distributes the samples to dissimilar
frames. Neither uniform nor IM sampling take visual
appearance into account; as a result, they may contain
several highly correlated samples. This is redundant as the
similarity across videos will be preserved, even if one of
these highly correlated samples is discarded.
From this hand-labeled data set, we estimate that about
16 percent of all of our YouTube videos have duplicate shots.
We suspect that the number of related videos will vary
between categories and with the popularity of a particular
topic. Popular videos on YouTube, for example, are more
likely to be submitted multiple times than rarely viewed
videos. Our results suggest that a surprisingly large amount
of videos on YouTube share at least one duplicate shot.
Fig. 11 shows an example pair of related videos that were
found using our similarity metrics.
While YouTube already suggests potentially relevant
videos to the user, those videos do not always contain
visually related content. Suggesting videos that are related
by content separately from other relevant videos has a
number of benefits that are not currently present on
YouTube. First, it can aid the user in finding a better
quality version of the current video. This is a surprisingly
frequent goal: Music videos, movie trailers, and other clips
can vary greatly in their quality, and locating the best
looking version tends to involve some effort since there is
currently no indication when a better quality version exists.
Furthermore, the need for such a tool has recently been
exacerbated with the introduction of high-definition (HD)
content on YouTube. Many existing videos have been
624 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 3, MARCH 2011
1. The dense sampling rate averages two frames per second. After
removal of mostly constant color frames, we obtain a data set consisting of
2,951,578 tiny frames. The storage requirement for this test set of tiny video
frames is 8.4 GBa compression of approximately 10 times from the
original subset of 6,654 YouTube Flash videos.
Fig. 9. Precision-recall curves for classifying related videos using the o
1
(red) and o
2
(blue) similarity metrics. Points a, b, and c correspond to
o
1
! 0, 1, and 2 duplicate frames, respectively. Points 1, 2, and 3
correspond to o
2
! 0, 1, and 2 duplicate consecutive pairs, respectively.
Fig. 10. Comparison of the three sampling algorithms. We show the
percentage of video pairs that have retained at least one duplicate
keyframe pair (o
1
! 1) as a function of average sampling rate. Note that
even if we, on average, sample every 250th frame in a video, AP
sampling still retains duplicate keyframes for more than half of the video
pairs.
resubmitted in higher quality, but their low-quality versions
still show up in search results. In addition, duplicate video
detection could be used to clean up YouTubes search
results by grouping identical videos into one entry, rather
than letting them clutter the results page. Finally, users
frequently upload excerpts and remixes of funny videos,
television shows, and movies without indicating the source
of the clip. Therefore, one final benefit, and perhaps the
most common usage scenario for related video retrieval, is
helping the user locate the source material for a given clip.
4 CATEGORIZATION USING LARGE DATA SETS
In this section, we use our large database of videos to
classify unlabeled images and video frames into broad
categories. We compare our classification results with those
achieved by tiny images [24] and show how the data sets
can be combined to improve precision for a wider range of
visual appearances.
4.1 Discussion of Labeling Noise
A major component that drives recognition rates is labeling
noise in our data. For the data set of tiny images, Torralba
et al. use 75,062 nonabstract nouns to query various search
engines for images. The label (i.e., noun) for an image could
therefore originate from surrounding text, which does not
always describe the images content. This means that each
tiny image is only loosely tied to its label.
In contrast, the tiny videos data set was obtained by
ranking videos in YouTubes categories based on view
counts, ratings, relevancy (to the broad category), and date
submitted. The labels (i.e., tags) for the videos collected are
not restricted to only nonabstract nouns. The tiny videos
data set consists of 61,279 unique tags for 52,159 videos. The
average number of tags per video is 10.9. Valid tags include
dates, acronyms, names, and verbs in addition to nouns. As
a result, only 16,190 tags correspond to known dictionary
words. Fig. 12 shows a histogram of the number of videos
per dictionary word as well as the top 10 most frequent tags
in our data set. Tags are assigned by the users with the
specific goal of describing the content of the video which
they have submitted. As a result, the labels are more
strongly tied to the video compared to tiny images.
However, the primary source of labeling noise for tiny
videos is the temporal dimension of the video itself. While a
label for an image often applies directly to its content, a
label for a video could only apply to a specific segment of
the video and could be completely unrelated to other parts.
Therefore, many video frames in the tiny videos data set are
unrelated to the videos labels since the user does not
indicate which labels apply to which parts of the video.
2
Fig. 13a shows a random sample of tiny images for the
person category. Fig. 13b shows a random sample of tiny
video frames for the same category. A larger fraction of tiny
images contain people than tiny video frames. We exam-
ined the labeling noise in a set of 1,084 randomly chosen
tiny images and tiny video frames with the labels person,
technology, and city. The results are tabulated in
Fig. 13c. Note that the labeling noise for tiny video frames
is overall twice as high as for tiny images.
4.2 Categorization Using WordNet Voting
Torralba et al. propose a k-nearest-neighbor (k-NN) voting
scheme that transfers the labels of 80 nearest neighbors to
the input image or frame. Given our
^
1
2
o. / distance
metric, the term nearest neighbor can refer to either a
neighbor video \
/
or a neighbor image 1
/
without loss of
generality. Therefore, for an input image or frame 1
o
, the
k-NN are videos in the case of tiny videos, images in the
case of tiny images, and videos and/or images if the data
sets are combined.
In order to further reduce labeling noise, Torralba et al.
propose letting the k-NN labels vote at multiple semantic
levels using the WordNet database [6]. By accumulating
votes from higher semantic levels, more images or videos
will contribute a vote for a category at a lower semantic
level. For example, if our goal is to classify person
images or frames, then not only do the neighbors labeled
with the person tag vote for this category, but also all
the neighbors labeled with tags whose hypernyms include
the word person (e.g., politician, scientist, and so on). As
a result, we draw on a great deal more data to classify
images and frames.
KARPENKO AND AARABI: TINY VIDEOS: A LARGE DATA SET FOR NONPARAMETRIC VIDEO RETRIEVAL AND FRAME CLASSIFICATION 625
Fig. 12. (a) A histogram of videos per word collected (words are tags
which are defined in the dictionary). (b) Number of videos for the
10 most frequent tags.
Fig. 11. Excerpts (a) and (b) from a pair of videos that share a duplicate shot (highlighted in blue). For illustration purposes, the videos can be viewed
on YouTube by entering the following IDs: 875-bt7o9HE and -EoPavHl0aw for videos (a) and (b), respectively. Note that YouTube does not list
these videos as related, even though they have content in common.
2. YouTube does allow users to add notes to videos, which are overlaid
at user specified times during the playback of the video. However, these
notes are optional, rarely used, and, if used, then mostly for advertising or
for linking to other videos.
Unlike tiny images, videos in our data set very frequently
have more than one label (tag). To ensure that a video with
multiple tags gets the same vote as a video withfewer tags (or
an image with a single label), we split the videos vote evenly
across all of its tags. Formally, a video has a set of tags T. Only
a subset of these tags are defined in WordNet: 1 ft
i
g
`
i1
,
where t
i
2 WordNet nouns and 1 T. Then, tag t
i
would
each vote for its branch in the WordNet tree with a weight
equal to 1,` such that the total vote per video regardless of
the number of tags is 1.
Fig. 14 shows an input image and the top tiny image
and tiny video nearest neighbors. Fig. 15 shows the
WordNet-voted branches, accumulated over a set of
80 labels from the corresponding tiny images in Fig. 14.
Similarly, Fig. 16 shows the WordNet-voted branches for
the nearest neighbor tiny videos. Since each video casts a
total vote of 1, the accumulated number of votes at each
node is comparable to tiny images. The number of votes
accumulated for the root node is close to 80 (i.e., the number
of nearest neighbors retrieved
3
).
Categorization is performed at the desired semantic level
by assigning the input image (or frame) the label with the
most votes at that level in the WordNet tree. The number of
votes is also used as a confidence measure (or score) for that
label. Note that the person node has the highest number of
votes at its semantic level in Figs. 15 and 16.
Finally, tiny images and tiny videos can be combined in
order to improve precision for some categorization tasks.
Given an input image or frame, we retrieve the top 80
nearest neighbors of the combined data sets by ranking
images and videos using the same
^
1
2
metric. The resulting
626 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 3, MARCH 2011
Fig. 14. Top nearest neighbors to (a) the input image ranked using the
^
1
2
shift
metric. Only the closest frame to (a) is shown for each neighbor
video in (c). (a) Input image. (b) Tiny images nearest neighbors. (c) Tiny
videos nearest neighbors.
Fig. 15. Accumulated votes from the tiny images nearest neighbors of
Fig. 14b. For clarity, the nouns are shown only for nodes with more than
two votes. The ground-truth nodes for the input image (Fig. 14a) are
colored in red. Note that these nodes have the highest vote count at their
respective semantic levels.
Fig. 13. Characterization of labeling noise in the tiny images and tiny
videos data sets. (a) Person tiny images. (b) Person tiny video
frames. (c) Labeling noise for tiny images and tiny videos data sets.
3. The actual number is slightly less than 80 for tiny images, as well as
tiny videos, because some of the 80 neighbors may have labels which are
not defined in the WordNet dictionary.
Fig. 16. Accumulated votes from the tiny videos nearest neighbors of
Fig. 14c. For clarity, the nouns are shown only for nodes with more than
two votes. The person node has the highest count at its semantic level.
set of nearest neighbors will, as a result, consist of both tiny
images and tiny videos. Categories which are more densely
sampled in tiny videos will have more neighbors coming
from the tiny videos data set. The converse is true for
categories which are sampled more densely in tiny images.
4.3 Categorization Results
We now evaluate classification performance for tiny
images, tiny videos, and both data sets combined. In
particular, we use these data sets to recognize man-made
devices, people, sports activities, and scenery. Our experimen-
tal setup closely follows that of Torralba et al. [24]. We
prepare a ground truth of randomly sampled positive and
negative examples for these categories. Furthermore, we
remove video and image duplicates of the input image or
frame from participating in classification in order to avoid
introducing biases in the results. The
^
1
2
shift
distance metric
is used to find nearest neighbors in the tiny images and
tiny videos data sets, although we will also discuss
alternative similarity measures.
The man-made devices categorization task includes
positive examples of mobile phones, computers, and other
clearly identifiable technological equipment. These frames
were sampled from YouTubes technology category. We use
the number of votes for the artifact noun in the WordNet
tree as the score for this classification task. The artifact
noun is used because it has such children as device,
gadget, and so on. Therefore, all neighbor images and
videos with such labels would increase the score of the
input image. Negative examples are of frames, which do not
contain man-made devices, such as scenery, people, sports,
and various other unrelated frames or images.
For categorizing people, we use the number of votes for the
person noun in the WordNet tree as the confidence score.
The noun person has children such as politician,
chemist, reporter, and child. Positive examples con-
tain images and frames of one ore more clearly identifiable
people, while negative examples contain no people in them.
Positive examples in the sports categorization task are
taken from hockey, figure skating, wrestling, boxing,
Olympic games, soccer, rugby, and tennis videos. This
gives us a wide variety of sports-related visual appearances.
The number of votes for the sport node in WordNet is
used as the confidence score in this task.
Finally, as discussed in [24], many Web images (and video
frames) correspond to full scenes rather than individual
objects. Therefore, we also use our data sets to differentiate
between frames that contain scenes (positive examples) and
those that are of objects and other unrelated frames (negative
examples). To decide which frames are scenes, we use the
location noun because it is the most generic node in the
WordNet tree that is related to scenes (with children such as
landscape, destination, and city).
The categorization results using tiny videos, tiny images,
and both data sets combined are shown in Fig. 17. First note
that tiny videos perform significantly worse than tiny
images at categorizing people. This is the case because the
majority of videos in our data set contain people, regardless
of their category.
On the other hand, tiny videos perform much better in
classifying sports activities than tiny images. In addition,
Fig. 18 demonstrates that some sports activities can be
classified at a higher semantic level using tiny videos.
Sports-related appearances are much more frequent in the
collection of over 50,000 videos compared to the data set of
80 million tiny images. Furthermore, the visual appearances
captured by images and videos can vary even for the same
category. In particular, images primarily contain specific
objects (such as people, cars, devices, and so on), while
videos focus on activities (such as news reporting, video
blogging, sports, travel, and so on). This difference occurs
because people take photos of objects, but they videotape
activities. Therefore, labels associated with a picture would
generally describe the object that it contains, while labels
associated with a video would generally describe the
activity that it captures.
KARPENKO AND AARABI: TINY VIDEOS: A LARGE DATA SET FOR NONPARAMETRIC VIDEO RETRIEVAL AND FRAME CLASSIFICATION 627
Fig. 17. Precision-recall curves for classifying, from (a) to (d), people, sport activities, man-made devices, and scenery. Tiny videos (red) performs
better than tiny images (blue) at sports categorization. The converse is true for person categorization. Note that the combined data set (blue)
performs well in every categorization task.
Fig. 18. PR for classifying football-related frames using the football
noun in WordNet.
For example, a Google search for the noun basketball
returns images of the ball (see Fig. 19a), while a search for
basketball on YouTube returns videos of the game. Such
videos contain basketball courts, basketball players, com-
mentators, and so on (refer to Fig. 19b). Hence, tiny videos
sample the visual appearance of the game, while tiny
images sample the appearance of the ball for the same label.
Depending on whether we want to classify the activity or
the object, tiny videos or tiny images would be better suited,
respectively. Combining the two data sets has the advan-
tage of capturing the greatest range of visual appearances
for a category.
Notice that the combined data set performs better than
tiny videos at classifying people, and it also performs better
than tiny images at classifying sports. In fact, in all but the
sports classification, the combined data set performs at or
slightly above the precision obtained by tiny images or tiny
videos alone. For sports-relatedactivities, the precisiondrops
below tiny videos due to additional noise from the tiny
images data set (e.g., corn fields that look like soccer fields).
This noise only significantly impacts classification precision
in the case of more difficult positive and negative examples.
Precision for the combined data set is comparable to that
achieved by tiny videos alone for the simpler sports cases at
40 percent or lower recall. However, the results demonstrate
that, unlike tiny images or tiny videos alone, the combined
data sets perform well on objects as well as activities.
Finally, tiny images perform notably better at scenery
classification than tiny images (although in Section 4.5, we
will show how to reverse this result). Scenery photography
is fairly common; therefore, images are suitable for this
classification task. Videos also frequently capture outdoor
activities (such as hiking, travel, cities, and so on); therefore,
they perform well at this classification task, although for
now, at lower precision than tiny images. Once again note
that combining the two data sets produces slightly higher
precision than that attained by either data set alone.
4.4 An Alternative Video Similarity Metric for
Categorization
Recall that the
^
1
2
shift
measure used in the previous section is
simply the distance 1
2
shift
of the closest frame in video \
u
to
our unlabeled input image 1
o
. The similarity of other frames
in \
u
is therefore effectively ignored. Here, we examine a
different measure which returns an average distance of the
i-closest frames 1
1...i
in video \
u
for an input image 1
o
:

1
2
i
1
o
. \
u

1
i
X
i
/1

1
2
shift

1
o
. 1
/

. where : 9
1
2
shift

1
o
. 1
1

< 1
2
shift

1
o
. 1
2

< < 1
2
shift

1
o
. 1
`

.
If we set i 1, then the average distance of i closest frames
is simply the distance to the closest frame or

1
2
1

^
1
2
shift
.
Furthermore, this metric reduces to 1
2
shift
if \
u
contains a
single frame. As a result, it can be used to find both
neighbor videos and images.
Fig. 20 shows the categorization results using the

1 and
^
1 metrics. The

1 metric tends to rank videos with
multiple moderately similar keyframes higher than videos
with only a single very similar frame. We find this to only
be effective for person categorization since a video with
many distinct-looking person keyframes is more likely to
be tagged as such.
However, we find that the converse is true for other
categorization tasks. In particular, videos with a single very
similar keyframe and multiple completely dissimilar key-
frames tend to make better neighbors in classifying an input
image than videos with multiple moderately similar key-
frames. This is the case because our AP sampling algorithm
picks exemplar keyframes and discards all other similar-
looking frames. Therefore, if an input image is very similar
to one keyframe, it will not match any other keyframes very
well (since AP sampling selects only dissimilar frames). As
a result, we find the
^
1
2
shift
measure to work best for all other
categorization tasks.
4.5 Classification Using YouTubes Categories
As we have discussed in Section 2.1, the tiny videos data set
contains all the metadata provided by YouTube for each
video. Unlike the tiny images data set, which has only one
label per image, tiny videos stores the videos title, descrip-
tion, rating, view count, a list of related videos, and other
metadata, in addition to video tags (which we used for
classificationinthe previous section). We will showthat some
of this metadata can be used to further improve classification
precision of the tiny videos data set, resulting in significant
improvements in person detection, and outperforming tiny
images on the scene classification task.
628 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 3, MARCH 2011
Fig. 19. Comparing the appearances of the top image results and the top
video results for the search query basketball. (a) Top image results on
Google. (b) Top video results on YouTube.
Fig. 20. (a) and (b) Comparison of the

1 and
^
1 distance measures for
classification. Note that

1
2
1
denotes the average distance from the input
image to all frames in video \
u
. For all but the person classification task,
the
^
1 metric achieves the highest precision.
All videos on YouTube are placed into a category. The
number of categories on YouTube has increased over the
years. Currently, videos on YouTube can appear in one of
14 categories. The tiny videos data set contains videos from
predominantly the News, Sports, People, Travel, and Technol-
ogy categories. In this section, we use these categories for
frame and image classification.
Similar to WordNet voting, we let the nearest-neighbor
videos vote for the category to which they belong. The
number of votes for a category is used as the confidence
measure in classifying the input image or frame into that
category. Since the set of categories is small, no hierarchical
voting is required. The classification results using
YouTubes category voting (YT) on the ground-truth sets
discussed in the previous section are shown in Fig. 21. For a
frame of reference, we also include the results obtained with
WordNet voting (WN) in Section 4.3. To classify scenes,
recall that we used the location node in WordNet. For this
classification task, we can use the number of votes for the
Travel category (which consists predominantly of videos
taken outdoors) to distinguish between scenes and negative
examples. Notice that we achieve substantially higher
precision with this method compared to WordNet voting
using tiny videos, as well as a notable improvement over
WordNet voting using tiny images. For the artifact
classification task, we use the number of votes in the
Technology category as our confidence score. In this case,
WordNet voting performs on par with YouTubes category
voting classification scheme. For the sport frame classi-
fication task, we use the Sports category. YouTubes
category voting using tiny videos once again performs
better than WordNet voting using either data set. Finally,
we use the People category to classify tiny images that
contain people. Note that this approach significantly
improves the classification of people using the tiny videos
data set. However, as discussed before, many videos in
other categories also contain people. Therefore, we are still
unable to match the precision achieved by the tiny images
data set for the person classification task.
Improved precision is achieved with category voting
because labels are constrained to a single category from a
predefined list, while users are permitted to enter any text
string to tag a video or to describe an image. This results in
higher labeling noise for video tags and image labels. As a
result, YouTubes categories can be used to classify images
or frames more reliably than WordNet voting.
Note, however, that classification in this section is
limited to the 14 categories
4
defined by YouTube. If we
want to classify other categories or classify at higher
semantic levels, then we must resort to WordNet voting.
For example, we used WordNet to recognize frames related
to football in Fig. 18. While YouTubes Sport category
can be used to classify those frames as belonging to
sports, they cannot be distinguished from other sports
activities since no corresponding categories are defined by
YouTube. In such cases, we must resort to the videos tags
and the WordNet voting scheme.
In this section, we have shown that we can effectively
draw on the additional metadata available to us in order to
improve recognition performance for some specific cate-
gories. Other data such as ratings and view counts may also
be used to improve various classification tasks. We leave
this extension for future work.
5 CONCLUSION
This paper presented a method for compressing a large
database of videos into a compact representation called
tiny videos. We showed that tiny videos can be used
effectively for content-based copy detection. In addition, we
leveraged this large data set of user-labeled online videos to
perform a variety of classification tasks using only simple
nearest-neighbor methods. The classification performance
of the tiny videos data set was compared with tiny images.
We showed that tiny videos are better suited for classifying
sports-related activities than tiny images, while tiny images
performed better at categorizing people. The tiny videos
data set was designed to be compatible with tiny images.
This allows us to combine both data sets to achieve high
precision and recall for both activity and object-classifica-
tion tasks. Finally, we show that additional metadata in the
tiny videos database can be used to significantly improve
classification precision for some categories.
The same descriptor was used for tiny videos and tiny
images. This allowed us to combine the two data sets for
classification. However, the RGB color space used by tiny
images is not the best choice for CBCD. Future work could
explore different types of descriptors for the tiny videos data
KARPENKO AND AARABI: TINY VIDEOS: A LARGE DATA SET FOR NONPARAMETRIC VIDEO RETRIEVAL AND FRAME CLASSIFICATION 629
Fig. 21. (a)-(d) Classification results achieved with the tiny videos data set and voting using YouTubes video categories (magenta). For comparison,
we also show the classification results achieved with tiny videos (red) and tiny images (green) using the WordNet voting scheme discussed in
Section 4.3.
4. Note that some categories (such as Comedy, Education, and
Nonprofit) are very abstract. As a result, they cannot be successfully
classified by tiny videos or tiny images because they violate the implicit
assumption that visually similar frames or images are more likely to be of
the same category.
set. For example, a histogramof orientedgradients [16] could
be usedfor its improvedtolerance to changes in illumination.
Furthermore, we have only looked at classifying cate-
gories that are well defined by their visual appearance.
Some categories (e.g., activities such as walking and
running) are only loosely defined by their visual
appearance. Instead, they are more readily described by
their motion. Therefore, descriptors that take motion into
account could also be explored in future work (e.g., space-
time interest points [13]). However, encoding the temporal
dimension in the descriptor is not always desirable because
it precludes the use of a single image or frame as input.
Since the appearance-based tasks presented herein are
entirely data-driven, obtaining more data means better
retrieval andrecognitionperformance. As the price of storage
(and, inparticular, RAM) continues todrop, our ongoinggoal
is to explore new computer-vision methods that draw on
increasingly larger collections of video and image data.
ACKNOWLEDGMENTS
The authors would like to thank Antonio Torralba, Rob
Fergus, and William T. Freeman for making their data set of
80 million tiny images available online.
REFERENCES
[1] YouTubes APIs and Developer Tools, http://code.google.com/
apis/youtube/overview.html, 2010.
[2] The Internet Archive, http://www.archive.org, 2009.
[3] N. Dimitrova, T. McGee, and H. Elenbaas, Video Keyframe
Extraction and Filtering: A Keyframe Is Not a Keyframe to
Everyone, Proc. Sixth Intl Conf. Information and Knowledge
Management, pp. 113-120, 1997.
[4] M. Douze, A. Gaidon, H. Jegou, M. Marszaek, and C. Schmid,
Inria-Lears Video Copy Detection System, Proc. Text Retrieval
Conf. Video Retrieval Evaluation Workshop, http://lear.inrialpes.fr/
pubs/2008/DGJMS08a, Nov. 2008.
[5] S. Eickeler and S. Muller, Content-Based Video Indexing of TV
Broadcast News Using Hidden Markov Models, Proc. IEEE Intl
Conf. Acoustics, Speech, and Signal Processing, vol. 6, pp. 2997-3000,
Mar. 1999.
[6] WordNet: An Electronic Lexical Database (Language, Speech, and
Communication), C. Fellbaum, ed., MIT Press, http://www.
amazon.ca/exec/obidos/redirect?tag=citeulike09-20&path=
ASIN/026206197X, May 1998.
[7] B.J. Frey and D. Dueck, Clustering by Passing Messages between
Data Points, Science, vol. 315, pp. 972-976, 2007.
[8] G. Geisler and G. Marchionini, The Open Video Project: A
Research-Oriented Digital Video Repository, Proc. ACM Digital
Libraries, pp. 258-259, http://www.open-video.org, 2000.
[9] A. Hampapur, R. Jain, and T.E. Weymouth, Production Model
Based Digital Video Segmentation, Multimedia Tools Appl., vol. 1,
no. 1, pp. 9-46, 1995.
[10] A. Joly, C. Frelicot, and O. Buisson, Robust Content-Based Video
Copy Identification in a Large Reference Database, Proc. Conf.
Image and Video Retrieval, pp. 414-424, 2003.
[11] R. Junee, Zoinks! 20 Hours of Video Uploaded Every Minute!
http://youtube-global.blogspot.com/2009/05/zoinks-20-hours-
of-video-upl oaded-every_20.html, May 2009.
[12] A. Karpenko and P. Aarabi, Tiny Videos: Non-Parametric
Content-Based Video Retrieval and Recognition, Proc. Tenth
IEEE Intl Symp. Multimedia, pp. 619-624, Dec. 2008.
[13] I. Laptev, On Space-Time Interest Points, Intl J. Computer Vision,
vol. 64, nos. 2-3, pp. 107-123, 2005.
[14] J. Law-To, A. Joly, and N. Boujemaa, Muscle-VCD-2007: A Live
Benchmark for Video Copy Detection, http://www-rocq.inria.
fr/imedia/civr-bench/, 2007.
[15] J. Law-To, L. Chen, A. Joly, I. Laptev, O. Buisson, V. Gouet-Brunet,
N. Boujemaa, and F. Stentiford, Video Copy Detection: A
Comparative Study, Proc. Sixth ACM Intl Conf. Image and Video
Retrieval, pp. 371-378, 2007.
[16] D.G. Lowe, Distinctive Image Features from Scale-Invariant
Keypoints, Intl J. Computer Vision, vol. 60, no. 2, pp. 91-110, 2004.
[17] S. Lu, M.R. Lyu, and I. King, Semantic Video Summarization
Using Mutual Reinforcement Principle and Shot Arrangement
Patterns, Proc. 11th IEEE CS Intl Multimedia Modelling Conf.,
pp. 60-67, 2005.
[18] D. Nister and H. Stewenius, Scalable Recognition with a
Vocabulary Tree, Proc. IEEE Conf. Computer Vision and Pattern
Recognition, pp. 2161-2168, 2006.
[19] K.A. Peker, A. Divakaran, and H. Sun, Constant Pace Skimming
and Temporal Sub-Sampling of Video Using Motion Activity,
Proc. Intl Conf. Image Processing, vol. 3, pp. 414-417, 2001.
[20] B. Shahraray, Scene Change Detection and Content-Based
Sampling of Video Sequences, Proc. SPIE Conf., pp. 2-13, Apr.
1995.
[21] A.F. Smeaton, P. Over, and W. Kraaij, Evaluation Campaigns and
Trecvid, Proc. Eighth ACM Intl Workshop Multimedia Information
Retrieval, pp. 321-330, 2006.
[22] N. Snavely, S.M. Seitz, and R. Szeliski, Modeling the World from
Internet Photo Collections, Intl J. Computer Vision, vol. 80, no. 2,
pp. 189-210, http://phototour.cs.washington.edu/, Nov. 2008.
[23] C. Toklu, S.P. Liou, and M. Das, Videoabstract: A Hybrid
Approach to Generate Semantically Meaningful Video Summa-
ries, Proc. IEEE Intl Conf. Multimedia and Expo, vol. 3, pp. 1333-
1336, 2000.
[24] A. Torralba, R. Fergus, and W.T. Freeman, 80 Million Tiny
Images: A Large Data Set for Non-Parametric Object and Scene
Recognition, Technical Report MIT-CSAIL-TR-2007-024, 2007.
[25] A. Torralba, R. Fergus, and Y. Weiss, Small Codes and Large
Image Databases for Recognition, Proc. IEEE Conf. Computer
Vision and Pattern Recognition, 2008.
[26] R. Zabih, J. Miller, and K. Mai, A Feature-Based Algorithm for
Detecting and Classifying Scene Breaks, Proc. ACM Multimedia
Conf., pp. 189-200, 1995.
[27] R. Zabih, J. Miller, and K. Mai, A Feature-Based Algorithm for
Detecting and Classifying Production Effects, Multimedia Systems,
vol. 7, no. 2, pp. 119-128, 1999.
Alexandre Karpenko received the BASc de-
gree (with honors) in engineering science
(computer option) and the MASc degree in
computer engineering from the University of
Toronto, in 2007 and 2009, respectively. His
current research is in the area of large-scale
data mining of online videos and images for
content-based retrieval and recognition tasks.
He is a student member of the IEEE and the
IEEE Computer Society, and a recipient of the
Ontario Graduate Scholarship and the Queen Elizabeth II Scholarship.
Parham Aarabi received the BASc degree in
engineering science (electrical option) and the
MASc degree in computer engineering from the
University of Toronto, in 1998 and 1999,
respectively, and the PhD degree in electrical
engineering from Stanford University in 2001.
He is a Canada Research Chair in Internet
Video, Audio, and Image Search, an associate
professor in The Edward S. Rogers Sr. Depart-
ment of Electrical and Computer Engineering,
and the founder and director of the Artificial Perception Laboratory at the
University of Toronto. His recent awards include the 2002, 2003, and
2004 Professor of the Year Awards, the 2003 Faculty of Engineering
Early Career Teaching Award, the 2004 IEEE Mac Van Valkenburg
Early Career Teaching Award, the 2005 Gordon Slemon Award, the
2005 TVO Best Lecturer (Top 30) Selection, the Ontario Early
Researcher Award, the 2006 APUS/SAC University of Toronto Under-
graduate Teaching Award, the 2007 TVO Best Lecturer (Top 30)
Selection, as well as MIT Technology Reviews 2005 TR35 Worlds Top
Young Innovator Award. His current research, which includes multi-
sensor information fusion, human-computer interactions, and hardware
implementation of sensor fusion algorithms, has appeared in more than
50 peer-reviewed publications and has been covered by media such as
the New York Times, MITs Technology Review Magazine, Scientific
American, Popular Mechanics, and the Discovery Channel. He is a
senior member of the IEEE.
630 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 3, MARCH 2011

Вам также может понравиться