Академический Документы
Профессиональный Документы
Культура Документы
G. de Haan
Philips Research Laboratories
Prof. Holstlaan4, 5656A, Eindhoven, The Netherlands
G.de.Haan@Philips.com
Video format conversion has become a key technology for multimedia systems. This paper discusses the progress in the relevant areas, i.e. spatial
scaling, de-interlacing, picture rate conversion, and motion estimation.
INTRODUCTION
Typically in a television chain, a single costly source broadcasts picture material to a large number of low cost receivers. The high number of receivers has
evidently prevented rapid introduction of new video formats enabling a higher
spatial or temporal resolution. On the other hand, this architecture stimulated
researchers to improve the perceived image quality in a compatible way. Particularly the digital techniques that entered the television receiver around 1990,
and parallel with it the option to store and delay image parts, pushed the use
of image enhancement techniques. The silicon technology enabled a complexity
growth according to Moore's Law which helped the more robust, but less ecient
digital techniques to become the natural choice even in areas where the earlier
analogue solutions required less silicon area. More or less simultaneously, video
entered the personal computer, which became a consumer electronics product.
By the end of the twentiest century, this synthesis led to multimedia products in which the video is scalable in space and time. This caused an explosion
of video formats, as in addition to the two main broadcast formats1 PC monitors
with picture rates between 60 Hz and 120Hz, and spatial resolutions in a broad
range (VGA, SVGA, XVGA, etc.) arrived on the market. Also television receivers
protted from these techniques and decoupled their display format from the historically determined transmission format to eliminate
icker artifacts, and/or to
adapt to new display principles, which resulted in new
icker-free (100Hz), noninterlaced (Proscan), and widescreen (16:9) formats on cathode ray tubes, plasma
1 The
interlaced 50 and 60 Hz formats with 625 and 525 scanning lines, respectively
required sample on
grid with other density:
xb
xc
xd
xe
xf
original grid
2
xa
original grid
down scaling
pixel dropping
?
5 ?
xg
up-scaling
SUM
(a)
1 1 2 3 4 4 5 6 7
pixel repetition
(b)
Figure 1: The decimating, or interpolating poly-phase lter, (a), calculates output pixels
at any position, as a weighted average of input pixels. In (b) the simpler and inferior pixel
repetition and pixel dropping technique is illustrated.
panel displays, and liquid crystal screens. Currently, also videotelephony, video
from the Internet, and graphics are being merged with broadcast signals.
In conclusion, video format conversion (VFC) has become a key technology
for multimedia systems. To enable conversion at a consumer price level, breakthroughs were required in motion estimation, de-interlacing, and robust picture
interpolation. We shall discuss progress in these areas in separate sections. Although the other format conversion ingredient, spatial scaling, is an almost trivial
application of long available theory, a brief section serves to show the dierences
and commonalities with the other areas.
SPATIAL SCALING
(a)
(b)
(c)
Figure 2: Screen details showing up and down scaling using pixel dropping and repetition,
(b), and polyphase-ltering, (c), respectively. The left hand picture, (a) shows the original.
frequency. For vertical scaling of interlaced video material, this is not the case.
A prior de-interlacing stage is then required. In the temporal domain, more
fundamental problems prohibit the use of linear lters. Therefore, an inferior
conversion method, that has also been used for spatial scaling in the past, is still
quite popular for vertical and temporal format conversions. This simpler method
consists in pixel dropping, and/or pixel repetition. Figure 1b illustrates the procedure, while Figure 2b shows the resulting image when used for spatial scaling.
PICTURE RATE CONVERSION
In the temporal domain, there is no preltering prior to the sampling by the
camera. It may appear as if such a lter, though dicult to realize, would bring
down the problem of temporal interpolation of video images to the common sampling rate conversion problem solved for spatial scaling already. To understand
why this does not yield acceptable results, we have to look at some properties of
the human visual system (HVS).
As can be seen in Figure 3b, the temporal frequency response of the HVS
shows a roll o for higher frequencies, due to integration. Figure 3a shows the
perceived image for a viewer focussing his/her eyes to a xed part of the screen
on which a moving rotor with calendar is shown. The image is blurred, due to
the integration over a number of successive pictures.
Under normal viewing conditions though, the perceived sharpness is hardly
aected by motion. The contradition results from the fact that the relation
shown in Figure 3b holds for temporal frequencies as they occur at the retina
of the observer. These frequencies, however, equal the frequencies at the display
(a)
(b)
Figure 3: Picture showing the perceived moving image in case the eyes are stationary with
respect to the screen (a). Although from the frequency response of the human visual system,
(b), we may expect such a result, the object tracking of the eye normally prevents this blurring.
only, if the eye is stationary with respect to this display. For an object tracking
observer, even very high temporal frequencies at the display are transformed to
much lower frequencies at the retina. Ideally, even a temporally stationary image
results as illustrated in Figure 4. Consequently, perfect picture interpolation only
results from methods that take the object tracking, or motion compensation, into
account. This requires interpolation along the motion trajectory rather than along
the temporal axis. As can be seen from Figure 5, the use of poly-phase lters
along the temporal axis may even result in worse blurring than straighforward
pixel repetition.
To enable interpolation along the motion trajectory, motion vectors that describe the displacement of image content from one picture to another are necessary. However, motion vectors cannot always be reliably estimated, and artifacts
are introduced in the resulting interpolated images [2, 3]. Moreover, even if it is
possible to describe the temporal changes in an image correctly with a motion
vector, a practical implementation of a motion estimator may not nd this vector.
Clearly, just "shifting parts of the image over a vector" is a risky operation
that easily introduces artifacts, worse than the blurring that we observe from
linear methods. Figure 7 illustrates this remark, showing pictures from robust
and less robust motion compensation methods, using vectors from good, and poor
motion estimators, repectively.
Several algorithms have been proposed to reduce the artifacts introduced
by erroneous motion vectors. Next to the obvious motion compensated (MC)
Fixed eye
Moving
detailed
object
Brightness
Tracking
eye
Brightness
time
High temporal frequency
Moving
detailed
object
Moving
field
of view
time
Figure 4: An object
tracking
observer
performs a frequency
transformation. High
temporal frequencies
on the screen become
stationary intensities
at the retina.
averaging, the use of order statistic lters has more recently been proposed, [4, 5],
and appeared in commercially available products [11, ?]. The most advanced
methods attempt deciding upon foreground and background objects in order to
better cope with the interpolation of occlusion areas [6, 7].
DE-INTERLACING
Interlace is the common video broadcast procedure to transmit only the odd,
or only the even numbered lines of a picture in alternate elds. De-interlacing
aims at restoring the full vertical resolution for every picture, i.e. make odd and
even lines available simultaneously. Closer study reveals that de-interlacing is an
implicit requirement for nearly all VFCs.
As with spatial up{scaling, the simplest method exists in pixel repetition.
If the repeated pixel is a vertical neighbour we speak of line-repetition, if it is
a temporal neighbour we speak of eld-repetition. Field repetition preserves all
detail in a stationary image part, but introduces severe artifacts in moving parts,
as can be seen in Figure 6a2 . Line repetition, on the other hand, cannot eliminate
the alias present in a single eld and leads to jagged edges, as shown in Figure
6c. All kinds of adaptive methods switching between spatial and temporal interpolation, as well as 2-D vertical-temporal, linear and non-linear, interpolation
lters, and methods aiming at an interpolation along 2- D spatial edges have been
proposed. The more advanced methods use motion compensation. Although a
theoretical solution exists, based on motion compensation and a generalisation
of the sampling theorem [8], its robustness for vector errors is a problem [9],
and various MC alternatives have been proposed. Nevertheless, MC methods are
2 An
exception occurs when lm material is broadcast. As the odd and even lines transmitted in separate elds originate from a single lm image, they can be assembled without any
drawback.
(a)
(b)
(c)
Figure 5: Picture showing the perceived moving image in case of 50 Hz to 100 Hz up-conversion
applying picture repetition (a), picture averaging (b), and poly-phase ltering (c).
(a)
(b)
(c)
(d)
(e)
Figure 6: De-interlacing a video signal. In case of motion, assembling the lines from the odd
(a) and the even (b) eld leads to strong artifacts (d). When interpolating the missing lines
from a single eld only, alias remains (c). Only if the motion between the elds is precisely
compensated for, assembling leads to a perfect de-interlacing (e).
still lower hardware cost, such as the number of signicantly dierent pixels [14]
sacriced too much quality.
A more signicant reduction of the complexity resulted from ecient search
techniques. This eort, for coding aplications, resulted in three-step search [15],
logarithmic search [14], and one-at-a-time search techniques [16]. Unfortunately,
all these methods decreased the already weak relation with the true-motion of
the objects in the scene.
A breakthrough occurred when hierarchical methods where proposed [17, 18],
as these not only reduced the operations count, but simultaneously improved the
consistency of the vector eld resulting in a closer relation to the true motion.
Two of these methods, hierarchical block matching of Reference [17] and phase
plane correlation (PPC) of Reference [18], are found in professional VFC equipment.
The breakthrough necessary for introduction in consumer ICs, resulted with
the introduction of the recursive search block matcher (RS-BM) [10]. The background of the RS-BM is that, if objects are larger than blocks and have inertia,
then the best candidate is available in a spatio-temporal neighbourhood. Experiments indicated that with just three spatio-temporal predictors, a single random
vector, and a well-chosen set of penalties added to the match criterion, a superior
large range true motion ME could be realised. This concept is being used in the
rst generation MC VFC consumer ICs [11, 19].
In the second generation [12], sub pixel accuracy was added, which enabled
picture rate conversion as well as the accuracy demanding MC de-interlacing.
This generation further applied an extra candidate vector, next to the spatio-
(a)
(b)
(c)
Figure 7: Screen photographs showing MC interpolation using vectors from a full search
block matcher, (a), and motion compensated interpolation using vectors from the 3-D Recursive
Search block matcher, (b). In (c), motion blur resulting from a non-MC picture interpolation
is shown.
temporal prediction and the random candidate, derived from a global motion
model [20].
A natural extension of the above mentioned parameter model capable of
describing the, global, camera movements results by segmenting the image in
individual objects and estimating motion parameters for each of these objects.
This follows the trend pixel-block-object. That is attractive, as the number of
blocks typically exceeds the number of objects with two or three orders of magnitude. Consequently, one could hope for potentially further reducing the operations count. Moreover, consistency of the vector eld within objects would be
garanteed.
The main problem to realise an object based ME (OME) turns out to be in
the object segmentation. This far from trivial task easily costs more operations
than the actual motion parameter estimation [21]. The breaktrough came with
the insight that it is possible to just start a number of independent and dierent parameter estimators, and assign image portions to individual estimators.
Although the assignment, based on a best match criterion, occurs rather accidentally this way, the process converges towards a fair object segmentation, provided
that each individual parameter estimator is focused on the image parts where it
turned out to be best. This focussing, is achieved by increasing the contribution
to the match criterion for these areas, while decreasing the contribution from the
rest of the picture. Interesting is further that the match optimisation can be
realised on some 1% of the luminance data and the assignment on a down-scaled
image, which tremendously reduces the operations count to a level where real
time DSP software becomes feasible [22].
operations count
103
accuracy
102
101
consistency
100
10-1
FS-BM
ME method
PPC
3-D RS
OME
OUTLOOK
[1] A.W.M van den Enden and N.A.M. Verhoeckx, Discrete-time signal processing,
Prentice Hall, 1989, ISBN 0-13-216763-8, pp. 233-.
[2] A. Puri et al., `Video Coding With Motion-Compensated Interpolation for CDRom Applications,' in
, Aug. 1990,
pages 127-144.
[3] Caorio and Rocca, 'Methods for measuring small displacements of television images'
, Sep. 1976, pp. 573-579.
[4] G. de Haan et al., `An Evolutionary Architecture for Motion-Compensated 100 Hz
Television,' in
, Jun. 1995, pp. 207{217.
Signal Processing:
IEEE Tr. on IT
Image Communications 2
Tr. on CE
IEEE, Multimedia
Systems
IEEE Tr. on CE
IEEE Tr. on CE
IEEE Tr. on CE
IEEE Tr. on CE