Вы находитесь на странице: 1из 10

VIDEO PROCESSING FOR MULTIMEDIA SYSTEMS

G. de Haan
Philips Research Laboratories
Prof. Holstlaan4, 5656A, Eindhoven, The Netherlands

G.de.Haan@Philips.com

Video format conversion has become a key technology for multimedia systems. This paper discusses the progress in the relevant areas, i.e. spatial
scaling, de-interlacing, picture rate conversion, and motion estimation.
INTRODUCTION

Typically in a television chain, a single costly source broadcasts picture material to a large number of low cost receivers. The high number of receivers has
evidently prevented rapid introduction of new video formats enabling a higher
spatial or temporal resolution. On the other hand, this architecture stimulated
researchers to improve the perceived image quality in a compatible way. Particularly the digital techniques that entered the television receiver around 1990,
and parallel with it the option to store and delay image parts, pushed the use
of image enhancement techniques. The silicon technology enabled a complexity
growth according to Moore's Law which helped the more robust, but less ecient
digital techniques to become the natural choice even in areas where the earlier
analogue solutions required less silicon area. More or less simultaneously, video
entered the personal computer, which became a consumer electronics product.
By the end of the twentiest century, this synthesis led to multimedia products in which the video is scalable in space and time. This caused an explosion
of video formats, as in addition to the two main broadcast formats1 PC monitors
with picture rates between 60 Hz and 120Hz, and spatial resolutions in a broad
range (VGA, SVGA, XVGA, etc.) arrived on the market. Also television receivers
pro tted from these techniques and decoupled their display format from the historically determined transmission format to eliminate icker artifacts, and/or to
adapt to new display principles, which resulted in new icker-free (100Hz), noninterlaced (Proscan), and widescreen (16:9) formats on cathode ray tubes, plasma
1 The

interlaced 50 and 60 Hz formats with 625 and 525 scanning lines, respectively

required sample on
grid with other density:

xb

xc

xd

xe

xf

original grid

2
xa

original grid

down scaling

pixel dropping
?
5 ?

xg

up-scaling

SUM

(a)

1 1 2 3 4 4 5 6 7
pixel repetition

(b)

Figure 1: The decimating, or interpolating poly-phase lter, (a), calculates output pixels
at any position, as a weighted average of input pixels. In (b) the simpler and inferior pixel
repetition and pixel dropping technique is illustrated.

panel displays, and liquid crystal screens. Currently, also videotelephony, video
from the Internet, and graphics are being merged with broadcast signals.
In conclusion, video format conversion (VFC) has become a key technology
for multimedia systems. To enable conversion at a consumer price level, breakthroughs were required in motion estimation, de-interlacing, and robust picture
interpolation. We shall discuss progress in these areas in separate sections. Although the other format conversion ingredient, spatial scaling, is an almost trivial
application of long available theory, a brief section serves to show the di erences
and commonalities with the other areas.
SPATIAL SCALING

Scaling a time-discrete representation of a continuous signal results as a


straightforward application of a long available theory [1], and perfection, in the
spatial domain, is currently achievable at a consumer price level. It involves
the use of decimating (sub-sampling) and interpolating (up-sampling) low-pass
lters for down-scaling and up-scaling respectively, as illustrated in Figure 1a.
Non-integer scaling results as a (virtual) cascade of up-scaling and down-scaling.
Memories are used for bu ering, as samples are written at a rst (input) sampling frequency and read at a second (output) sampling frequency. A decimating
low-pass lter reduces the bandwidth of the input signal to less than half of the
output sampling frequency and eliminates the redundant samples, while the interpolating low-pass lter processes at the higher output sampling frequency by
adding zero valued samples to the input samples (zero-stung) and removes repeat spectra from the input spectrum where possible. We emphasize that this
procedure yields valid results only in case the demands of the sampling theorem
are met, i.e. that the input spectrum is bandwidth limited to half the sampling

(a)

(b)

(c)

Figure 2: Screen details showing up and down scaling using pixel dropping and repetition,
(b), and polyphase- ltering, (c), respectively. The left hand picture, (a) shows the original.

frequency. For vertical scaling of interlaced video material, this is not the case.
A prior de-interlacing stage is then required. In the temporal domain, more
fundamental problems prohibit the use of linear lters. Therefore, an inferior
conversion method, that has also been used for spatial scaling in the past, is still
quite popular for vertical and temporal format conversions. This simpler method
consists in pixel dropping, and/or pixel repetition. Figure 1b illustrates the procedure, while Figure 2b shows the resulting image when used for spatial scaling.
PICTURE RATE CONVERSION

In the temporal domain, there is no pre ltering prior to the sampling by the
camera. It may appear as if such a lter, though dicult to realize, would bring
down the problem of temporal interpolation of video images to the common sampling rate conversion problem solved for spatial scaling already. To understand
why this does not yield acceptable results, we have to look at some properties of
the human visual system (HVS).
As can be seen in Figure 3b, the temporal frequency response of the HVS
shows a roll o for higher frequencies, due to integration. Figure 3a shows the
perceived image for a viewer focussing his/her eyes to a xed part of the screen
on which a moving rotor with calendar is shown. The image is blurred, due to
the integration over a number of successive pictures.
Under normal viewing conditions though, the perceived sharpness is hardly
a ected by motion. The contradition results from the fact that the relation
shown in Figure 3b holds for temporal frequencies as they occur at the retina
of the observer. These frequencies, however, equal the frequencies at the display

(a)

(b)

Figure 3: Picture showing the perceived moving image in case the eyes are stationary with

respect to the screen (a). Although from the frequency response of the human visual system,
(b), we may expect such a result, the object tracking of the eye normally prevents this blurring.

only, if the eye is stationary with respect to this display. For an object tracking
observer, even very high temporal frequencies at the display are transformed to
much lower frequencies at the retina. Ideally, even a temporally stationary image
results as illustrated in Figure 4. Consequently, perfect picture interpolation only
results from methods that take the object tracking, or motion compensation, into
account. This requires interpolation along the motion trajectory rather than along
the temporal axis. As can be seen from Figure 5, the use of poly-phase lters
along the temporal axis may even result in worse blurring than straighforward
pixel repetition.
To enable interpolation along the motion trajectory, motion vectors that describe the displacement of image content from one picture to another are necessary. However, motion vectors cannot always be reliably estimated, and artifacts
are introduced in the resulting interpolated images [2, 3]. Moreover, even if it is
possible to describe the temporal changes in an image correctly with a motion
vector, a practical implementation of a motion estimator may not nd this vector.
Clearly, just "shifting parts of the image over a vector" is a risky operation
that easily introduces artifacts, worse than the blurring that we observe from
linear methods. Figure 7 illustrates this remark, showing pictures from robust
and less robust motion compensation methods, using vectors from good, and poor
motion estimators, repectively.
Several algorithms have been proposed to reduce the artifacts introduced
by erroneous motion vectors. Next to the obvious motion compensated (MC)

Fixed eye
Moving
detailed
object

Brightness

Tracking
eye

Brightness

time
High temporal frequency

Moving
detailed
object
Moving
field
of view
time

Zero temporal frequency

Figure 4: An object

tracking
observer
performs a frequency
transformation. High
temporal frequencies
on the screen become
stationary intensities
at the retina.

averaging, the use of order statistic lters has more recently been proposed, [4, 5],
and appeared in commercially available products [11, ?]. The most advanced
methods attempt deciding upon foreground and background objects in order to
better cope with the interpolation of occlusion areas [6, 7].
DE-INTERLACING

Interlace is the common video broadcast procedure to transmit only the odd,
or only the even numbered lines of a picture in alternate elds. De-interlacing
aims at restoring the full vertical resolution for every picture, i.e. make odd and
even lines available simultaneously. Closer study reveals that de-interlacing is an
implicit requirement for nearly all VFCs.
As with spatial up{scaling, the simplest method exists in pixel repetition.
If the repeated pixel is a vertical neighbour we speak of line-repetition, if it is
a temporal neighbour we speak of eld-repetition. Field repetition preserves all
detail in a stationary image part, but introduces severe artifacts in moving parts,
as can be seen in Figure 6a2 . Line repetition, on the other hand, cannot eliminate
the alias present in a single eld and leads to jagged edges, as shown in Figure
6c. All kinds of adaptive methods switching between spatial and temporal interpolation, as well as 2-D vertical-temporal, linear and non-linear, interpolation
lters, and methods aiming at an interpolation along 2- D spatial edges have been
proposed. The more advanced methods use motion compensation. Although a
theoretical solution exists, based on motion compensation and a generalisation
of the sampling theorem [8], its robustness for vector errors is a problem [9],
and various MC alternatives have been proposed. Nevertheless, MC methods are
2 An

exception occurs when lm material is broadcast. As the odd and even lines transmitted in separate elds originate from a single lm image, they can be assembled without any
drawback.

(a)
(b)

(c)

Figure 5: Picture showing the perceived moving image in case of 50 Hz to 100 Hz up-conversion
applying picture repetition (a), picture averaging (b), and poly-phase ltering (c).

clearly superior over the older methods, as shown in an extensive evaluation of


de{interlacing methods [9].
However, MC methods have for long been judged too expensive for consumer
applications, mainly due to the high price of the motion vector estimator, that in
this application needs to have a sub-pixel accuracy. Here, more recently breakthroughs have been reported [10, 20], which made it feasible that some generations
of MC de-interlacing methods are available in consumer products [11, 19, 12].
PROGRESS IN MOTION ESTIMATION

The rst motion estimation (ME) algorithms were computationally highly


complex. It was absolutely necessary to put e ort in simpli cation to enable
introduction in consumer products. An interesting observation is that although
indeed the computational complexity of the ME algorithms for VFC decreased
over time, at the same time their quality improved. Intuitively, this is in line
with a trend that we can see in the techniques being proposed, which range from
pixel based methods (pel-recursive, or gradient algorithms) [13], via block based
methods (various types of block matching) [10], to object based algorithms [22].
This trend was not sucient though. Although there are less blocks than
pixels in an image, the huge number of candidates tested in a full search block
matcher (FS-BM) easily demands more processing power than the single calculation of an update in a pel-recursive estimator even when this calculation has
to be repeated for every pixel. The search for reduced complexity match criteria
has converged via the normalised cross correlation, the mean squared error, to
the most popular summed absolute di erence. This did not, however, reduce
the operations count with the required orders of magnitude. Alternatives with

(a)

(b)

(c)

(d)

(e)

Figure 6: De-interlacing a video signal. In case of motion, assembling the lines from the odd

(a) and the even (b) eld leads to strong artifacts (d). When interpolating the missing lines
from a single eld only, alias remains (c). Only if the motion between the elds is precisely
compensated for, assembling leads to a perfect de-interlacing (e).

still lower hardware cost, such as the number of signi cantly di erent pixels [14]
sacri ced too much quality.
A more signi cant reduction of the complexity resulted from ecient search
techniques. This e ort, for coding aplications, resulted in three-step search [15],
logarithmic search [14], and one-at-a-time search techniques [16]. Unfortunately,
all these methods decreased the already weak relation with the true-motion of
the objects in the scene.
A breakthrough occurred when hierarchical methods where proposed [17, 18],
as these not only reduced the operations count, but simultaneously improved the
consistency of the vector eld resulting in a closer relation to the true motion.
Two of these methods, hierarchical block matching of Reference [17] and phase
plane correlation (PPC) of Reference [18], are found in professional VFC equipment.
The breakthrough necessary for introduction in consumer ICs, resulted with
the introduction of the recursive search block matcher (RS-BM) [10]. The background of the RS-BM is that, if objects are larger than blocks and have inertia,
then the best candidate is available in a spatio-temporal neighbourhood. Experiments indicated that with just three spatio-temporal predictors, a single random
vector, and a well-chosen set of penalties added to the match criterion, a superior
large range true motion ME could be realised. This concept is being used in the
rst generation MC VFC consumer ICs [11, 19].
In the second generation [12], sub pixel accuracy was added, which enabled
picture rate conversion as well as the accuracy demanding MC de-interlacing.
This generation further applied an extra candidate vector, next to the spatio-

(a)

(b)

(c)

Figure 7: Screen photographs showing MC interpolation using vectors from a full search
block matcher, (a), and motion compensated interpolation using vectors from the 3-D Recursive
Search block matcher, (b). In (c), motion blur resulting from a non-MC picture interpolation
is shown.

temporal prediction and the random candidate, derived from a global motion
model [20].
A natural extension of the above mentioned parameter model capable of
describing the, global, camera movements results by segmenting the image in
individual objects and estimating motion parameters for each of these objects.
This follows the trend pixel-block-object. That is attractive, as the number of
blocks typically exceeds the number of objects with two or three orders of magnitude. Consequently, one could hope for potentially further reducing the operations count. Moreover, consistency of the vector eld within objects would be
garanteed.
The main problem to realise an object based ME (OME) turns out to be in
the object segmentation. This far from trivial task easily costs more operations
than the actual motion parameter estimation [21]. The breaktrough came with
the insight that it is possible to just start a number of independent and di erent parameter estimators, and assign image portions to individual estimators.
Although the assignment, based on a best match criterion, occurs rather accidentally this way, the process converges towards a fair object segmentation, provided
that each individual parameter estimator is focused on the image parts where it
turned out to be best. This focussing, is achieved by increasing the contribution
to the match criterion for these areas, while decreasing the contribution from the
rest of the picture. Interesting is further that the match optimisation can be
realised on some 1% of the luminance data and the assignment on a down-scaled
image, which tremendously reduces the operations count to a level where real
time DSP software becomes feasible [22].

Figure 8: The evolution


104
1

operations count

103

accuracy

102

101

consistency

100

10-1
FS-BM

ME method
PPC

3-D RS

OME

of motion estimation. Note


that the vertical axis is
logarithmic. The number
of operation drops dramatically over time, the consistency increases, while the
accuracy (prediction error)
remains roughly the same.
The last two aspects re ect that the vectors more
closely describe the true object velocities.

OUTLOOK

Progress in the eld of ME has caused an evolution in algorithms along the


path pixel based, block based, and object based methods, see Figure 8. While
operations count decreased over time, the quality greatly increased, and the calculations became more irregular. Consequently, the hardware-software balance
of the algoritms moved to increased software content, while the most recent algorithms are implemented entirely in software running in real time on a DSP. Mainly
driven by this progress in ME, MC picture rate conversion and de{interlacing have
shown rapid performance improvement and new algorithms have been developed
with inherent robustness for motion vector inaccuracies. The quality achieved
with this recent progress enables a great amount of exibility in choosing the
video format of modern multimedia systems.
REFERENCES

[1] A.W.M van den Enden and N.A.M. Verhoeckx, Discrete-time signal processing,
Prentice Hall, 1989, ISBN 0-13-216763-8, pp. 233-.
[2] A. Puri et al., `Video Coding With Motion-Compensated Interpolation for CDRom Applications,' in
, Aug. 1990,
pages 127-144.
[3] Caorio and Rocca, 'Methods for measuring small displacements of television images'
, Sep. 1976, pp. 573-579.
[4] G. de Haan et al., `An Evolutionary Architecture for Motion-Compensated 100 Hz
Television,' in
, Jun. 1995, pp. 207{217.
Signal Processing:

IEEE Tr. on IT

IEEE Tr. on CVST

Image Communications 2

[5] O.A. Ojo and G. de Haan, `Robust motion-compensated video up-conversion',


, Nov. 1997, pp. 1045-1055.
[6] G.A. Thomas and M. Burl, `Vector assignment for video image motion compensation', US-patent no. 6,005,639.
[7] A. Pelagotti and G. de Haan, `High quality video in a PC',
, Vol. 2, Jun. 1999, Florence, pp. 872-876.
[8] P. Delogne et al., `Improved interpolation, motion estimation and compensation for
interlaced pictures',
, Sep. 1994, pp. 482{491.
[9] G. de Haan and E.B. Bellers, `Deinterlacing{An overview',
, Sep.
1998, pp. 1839-1857.
[10] G. de Haan et al., `True motion estimation with 3-D recursive search blockmatching,'
, Oct. '93, pp. 368-388.
[11] G. de Haan et al., `IC for motion compensated 100Hz TV, with a smooth motion
movie-mode,'
, May 1996, pp. 165{174.
[12] G. de Haan, `IC for motion compensated de-interlacing,noise reduction, and picture rate conversion,'
, Aug '99, pp. 617-624.
[13] J. Driessen et al., `Pel-recursive motion estimation from image sequences', J. Visual
Comm. Image Represent., '91.
[14] Y. Ninomiya and Y. Ohtsuka, `A motion compensated interframe coding scheme
for television pictures',
, Vol. 30, no. 1, '82.
[15] T. Koga et al., `Motion-compensated interframe coding for video conferencing,'
, G5.3.1., '81.
[16] R. Srinivasan and K. Rao, `Predictive coding based on ecient motion estimation,'
, pp. 888{896, '85.
[17] R. Thoma and M. Bierling, `Motion compensating interpolation considering covered and uncovered background,'
, pp.
191{212, '89.
[18] G. Thomas, `Television motion measurement for DATV and other applications,'
, no. BBC RD 1987/11, Nov. '87.
[19] M. Shu et al., `System on silicon for motion compensated scan rate conversion, picture-in-picture processing, split screen applications and display processing',
, Aug. '99, pp. 842-850.
[20] G. de Haan and P. Biezen, `An ecient true-motion estimator using candidate
vectors from a parametric motion model',
, Mar. '98, pp. 85-91.
[21] A. Tekalp,
, pp. 200{203. Prentice Hall Signal Processing
Series, 1995. ISBN 0-13-190075-7.
[22] R.J. Schutten and G. de Haan, `Real-time 2-3 pull-down elimination applying
motion estimation / compensation on a programmable device',
,
Aug. 1998, pp. 930-938.
IEEE

Tr. on CE

IEEE, Multimedia

Systems

IEEE Tr. on Im. Proc.

Proc. of the IEEE

IEEE Tr. CSVT

IEEE Tr. on CE

IEEE Tr. on CE

IEEE, Tr. on Comm.

IEEE, Proc. NTC 81

IEEE Tr. on Comm.

Signal Processing: Image Communications 1

BBC Research Report

IEEE Tr. on CE

IEEE Tr. on CSVT

Digital Video Processing

IEEE Tr. on CE

Вам также может понравиться