Вы находитесь на странице: 1из 2

Abstract

Although the images and videos are taken to capture certain objects, there is always some
background or other which gets captured along with the objects. Thus, it becomes difficult
for the computers to directly take them as inputs to extract visual properties of objects such as
shape, size, location and structure, or for other object-based applications. The main problem
that lies herein is that there is hardly any information about the object if all that is available
is just the image or video. But if multiple images or videos of the same object category are
provided for joint processing, the job becomes little easier because there is a clue now: the
commonness. But this clue is usually not sufficient because backgrounds can also end up to
be common in many cases. But thanks to the several years of prior research on individual
image and video processing, there are numerous visual saliency and motion saliency extraction
methods which can provide certain object prior, which are also unreliable individually because
object need not be salient always, and even background can be salient at times. Although these
two kinds of clues are individually unreliable but combination of them can be quite promising,
because this combination is geared towards achieving the "common object".
By making use of these clues, we propose several fusion-based joint processing methods,
where we essentially make two assumptions: (i) common object pixels are generally salient if
not in every image, (ii) background pixels are generally non-salient if not in every image. These
assumptions allow us to develop jointly processed general (or fused) object prior maps, which
then help in individual processing of images or videos to extract the desired visuals properties
of objects, as discussed above. This is the way we perform co-segmentation, co-localization
and co-skeletonization of common objects in the images and videos.
In particular, we fuse object priors (mostly saliency maps) in multiple ways: (i) across
different images/frames, (ii) across different modalities (visual and motion), and (iii) across
different object prior (saliency) extraction methods, with the joint processing aim of highlight-
ing the common objects’ pixels and belittle the background pixels. Through our work, we try

i
to address several challenges such as minimizing the complexity involved in the joint process-
ing of several images or videos, the speed, the parameter tuning, and also monitoring if joint
processing is necessary at all in certain cases, because individually processed saliency might
itself do the job in those cases.
The fusion-based joint processing methods not only obtain promising results on benchmark
datasets of co-segmentation, co-localization and co-skeletonization tasks, but also on the large
scale datasets such as ImageNet, which suggests that proposed methods are not only accurate
but scalable as well.

ii

Вам также может понравиться