Вы находитесь на странице: 1из 22

Pre-processing for mobile and OTT

Jérôme Vieron (PhD)

ATEME, Bièvres France, j.vieron@ateme.com

Introduction
The Broadcast & Broadband Industry is embracing various “content everywhere” initiatives. This is a
reality for Content Owners who bypass the Operators and launch Over-The-Top (OTT) services. Thus,
within a web portal or App, they can offer their content directly to the end consumer and maximize
the value of their “goods.” The goal is to address as many devices as possible to reach the broadest
audience.

The multi-screen delivery market is a dynamic landscape. On top of the premium IPTV delivery, new
and/or more, nomad display devices emerge every year. Changes in the market lead to new formats
and dramatic increases in subscriber content demand. Both aspects must be addressed as fast as
possible to ensure that video content is delivered to the broadest audience. Additionally, the nature
of the delivered content itself has an impact; users can decide to change the processing properties
(bitrate, resolution) depending on the streaming event (sports or news) and the targeted audience’s
device (web or smartphones).

OTT/multi-screen delivery markets evolve very quickly and are no longer alongside the TV
experience, but a continuity of it. The audience wants to consume more and more video content
from self-produced video (captured by smartphone, tablet, etc.) to premium HD video content (live
events, concerts, sports, home theater etc.). Moreover, even if OTT was initially targeting low
capacity devices and limited network bandwidth, devices with increased decoding capacity (multi-
core architecture) are arriving on the market and are now able to manage higher spatial resolution,
frame rate and bitrate.

In this new context, the expectation for impeccable video quality is increasing. Therefore, the need
for flawless pre-processing tools to improve the end-user visual experience, while facilitating video
compression, becomes crucial. This paper provides a description of the latest high standard pre-
processing tools designed by our research teams. The described algorithms are the result of years of
research and development expertise in video and image processing, applied to bring the highest
quality to mobile and OTT end users. We present in the following pages three major pre-processing
tools: Deinterlacing, Denoising and, Motion blurring. For each proposed algorithm, we describe
theoretical foundations and implementation details. Finally, comparison performances with state-of-
the-art tools are performed and provided.
1 Deinterlacing
Deinterlacing is the process of converting the so-called “ interlaced” video into “progressive scan”
video. As already mentioned, legacy video content (up to HDTV) is interlaced and most content in the
broadcasting area is still captured (and produced) with interlaced camera systems. Deinterlacing is
required for mobile and OTT deliveries.

An interlaced frame is built by intermixing every pair of consecutive pictures (with half the height).
One does not call it pictures anymore but fields. Hence, an interlaced frame is made up of two
interleaved fields. Odd and even scan lines represent the top and bottom fields respectively.

As depicted in Figure 1, Deinterlacing consists in first deinterleaving the top and bottom fields and
then filling in the missing lines in order to obtain two full frames.

Figure 1 - Deinterlacing principles.

In the following section we provide a description of the new Deinterlacer that we designed in order
to offer the best video quality to the end user. Like in all our designs, various quality modes,
corresponding to different tradeoffs, between computation speed and visual quality, have been
considered.

1.1 Theoretical foundations

Deinterlacing is a complex problem which spurs a lot of research activities in the Image and Signal
processing community. Most of the proposed approaches in literature are based on spatial
interpolation and few integrate motion compensation to better interpolate the signal. Unfortunately,
most of these lead to irregular object contours and lines. To deal with such drawbacks, we are
interested in two kinds of approaches.

The first approach is based on a directional interpolation algorithm which consists of estimating the
local edge direction and interpolating along the found direction. In particular the ELA (Edge-based
Line Averaging) described in [1], which proposes computing a gradient for each considered direction
in order to detect a local edge. The smaller the gradient the more important the contribution of the
corresponding direction in the interpolation will be. This technique is a spatial interpolation only
algorithm.
Figure 2- example of Neural Network structure.

The second approach consists of Neural Network interpolation [2]. A neural network is a
computational model compound of interconnected nodes that is used to model complex
relationships between inputs and outputs. Neural networks are based on the principle of learning by
experience, i.e. for a given problem, the network is able to “learn” from known data (sets of known
inputs/outputs), in order to compute the outputs from any given inputs after the learning process is
done,. Each node in the network, called neuron, performs a linear operation followed by a non-linear
one using sets of weights. The learning process consists of finding the optimal weights.

1.2 Algorithm overview

The proposed deinterlacing algorithm is made up of the following important steps:

1. Fields deinterleaving, spatial interpolation and preprocessing


2. Motion estimation stage
3. High quality spatio-temporal fields interpolation
4. Spatio-temporal post-processing

In the following section we describe in detail each of these steps in which the processing may
naturally vary from one quality mode to another.

Figure 3 – Deinterlacer block diagram.


1.3 Preprocessing

We first separate and interpolate the fields of the interlaced source in order to perform motion
estimation on full resolution frames. For this first interpolation, a simple bicubic filter is used. At this
stage, a high quality interpolation is not needed, since it is followed by an image smoothing to ease
the motion estimation process.

Thus, both temporal and spatial smoothing filters are applied. Frames are smoothed temporally using
the neighbor fields within a window whose size depends on the quality mode. The colocated pixels
from the defined window of frames are averaged to produce the smoothed frame.

Then, the spatial smoothing is performed using a simple smoothing mask like: ( )

Finally, a sharpening stage is applied to reduce the effect of smoothing and recover the edges by
comparing the interpolated image, the result of the temporal filter and the result of the spatial filter.
The result is a frame ready for motion estimation.

1.4 Motion estimation

We then perform sub-pixel motion estimation dealing with frame weighting in order to catch
illumination scene changes. The number of motion estimations to be performed per frame depends
on the targeted quality mode. For the best quality mode, up to three backward and three forward
motion estimations per frame are performed.

1.5 High quality Interpolation

A good spatial interpolation is essential for a high quality deinterlacer. The proposed deinterlacer is
based on two interpolation algorithms depending on the quality mode: Neural Network
interpolation for the high quality mode, and an Improved Edge-based Line interpolation for the
other modes.

Neural Network interpolation


Two neural networks are used in our approach: the first network is used to decide what kind of
interpolation to use for the current pixel and the second network is used to perform the
interpolation itself. Such an architecture limits the complexity related to the neural network
interpolation by applying it only where it is necessary. Elsewhere a bicubic interpolation is
performed.

The so-called “Decision Network”, is made up of four neurons for which the inputs are a window of
4x16 neighbor pixels around the current pixel. This network returns a decision output that is set to 0
or 1 according to the interpolation that should be used.
The “Interpolation network” is bigger and more complex. It takes 96 input pixels from the 6x16
neighbors’ window around the current pixels and contains a hidden layer of 32 neurons. It returns
the value of the desired interpolated pixel. The combination used to compute the output is non-
linear this time and uses an exponential computed at each neuron (cf Figure 4).

Figure 4 - Interpolation neural network.


Improved Edge-based Line Interpolation
This second approach is based on an Improved ELA algorithm that we called IELI (Improved Edge-
based Line Interpolation). In our algorithm, 12 directions are considered: the 11 directions depicted
on Figure 5 plus the horizontal direction. Moreover, fields have first been interpolated using a bicubic
filtering before the gradients computation. Thus, we compute a second-order differential and not a
first-order one which makes the estimation more robust. Furthermore, the robustness is improved
by computing three gradients on the adjacent pixels instead of one.

Figure 5 - Directions used for the Improved Edge-based line Interpolation algorithm

At this point we have a full resolution frame (actually a window of frames), the related motion
vectors and data computed in step 2. This data is used in the next step for motion compensation.
1.6 Spatio-temporal image post-processing

In order to prevent lines from fluttering up and down in the deinterlaced output and to correct areas
where spatial interpolation is wrong, motion compensation is used. This step is very important to
correct, or improve, the spatial interpolation by integrating temporal information. Motion
compensation takes into account the temporal correlation between the fields and consequently
allows the result to be more fluid.

Thus, a motion compensated temporal smoothing filter (cf Figure 6) based on the related confidence
of the motion compensation was implemented. For each block of the current frame, we perform an
Overlapped Block Weighted Average where the weights are computed according to a confidence
metric. In order to obtain a robust confidence metric and weighting, block
appearance/disappearance together, with flashes and fade detection are considered.

Figure 6 - Motion Compensated Temporal smoothing.

For the higher-quality modes the motion-compensated temporal smoothing filter is iteratively
performed two times, instead of one, as shown in Figure 7. To deinterlace a given field, we make use
of a signal temporally located two fields backward and forward.

Figure 7 - Two stages motion compensated temporal smoothing.

Finally in order to recover the edges, a new more complex and efficient sharpening filter than the
one used in the preprocessing step is applied. We start by creating an image where each pixel is an
average value of the original pixel’s spatial neighbors. Then we compute a weighted average of this
image with the original smoothed one.
1.7 Results and performance

The proposed deinterlacer offers a better visual quality compared to state-of-the-art deinterlacers.
Figure 8 illustrates a comparison between two quality modes of our deinterlacer and the TDeint [1]
on the Ship video sequence. Neural network is used for the high-quality mode and IELI for the
medium quality mode. As can be seen, our deinterlacer recovers edges better than TDeint,
particularly diagonal thin lines which are hard to deinterlace.

a) Neural Network b) IELI c) TDeint

Figure 8 – Comparison of the proposed Deinterlacer (2 quality modes) and the TDeint approach.

Figure 9 provides a comparison of our Neural Network approach against the state-of-the-art MvBob
[3] algorithm on the Party scene video sequence. Our Deinterlacer perfectly reproduces the thin lines
on the wall, on the floor and the objects contours. The MvBob algorithm partially recreates the high
frequency details on the cubes. Finally, the MvBob is not able to deal with scene cuts, whereas our
deinterlacer is.
Figure 9 – Comparison between the MvBob (Top) and the proposed Deinterlacer (Bottom).
2 Denoiser

Figure 10 - Example of film grain.

Film grain is a high frequency signal, without any temporal correlation, associated to a given video.
The grain has various characteristics depending on the considered video sequence: size, shape,
direction, chromaticity, luminosity sensitivity etc. Figure 10 depicts an example of video film grain.

Film grain plays a significant role in the atmosphere of a scene. It was observed that, in terms of
visual perception, film grain reinforces the feeling of realism for the audience. However, depending
on the targeted bitrate, the video compression process can have a strong impact on this grain. In
such context, the role of a Denoiser is to remove as much grain, or “noise”, as possible, in order to
ease the compression process while preserving the “good” grain expected by the audience.

In the following section we provide a description of the Denoiser algorithm we designed in order to
address the needs related to the mobile and OTT delivery applications. Various quality modes are
considered corresponding to different tradeoffs between computation speed and visual quality.

2.1 Theoretical foundations

The noise reduction we designed is an original approach integrating motion compensation and noise
detection processes. The proposed filter is a spatio-temporal filter based on anisotropic diffusion as
presented in [4]. This filter is capable of smoothing out white noise and insignificant textures in order
to improve video coding efficiency, while preserving significant information. One feature of the
anisotropic diffusion is to homogenize the image by iteratively propagating the value of one pixel to
its neighboring pixels. The anisotropic diffusion technique is based on the Laplace equation (heat
diffusion) and consists of the convolution of the image with a Gaussian function:

( ) ( )
The drawback of such a technique is that it reduces the quality of the contours at each iteration. In
order to avoid erasing the contours, the diffusion equation is modified by considering that the
anisotropic diffusion is maximal on the image except around the contours. The l2 norm of the
gradient is used as a contour detector: a high value of the gradient indicates a strong probability of
contour presence. The modified diffusion equation is then:

( )
( (‖ ‖) ) ( )

Where ( ) is a non-linear (e.g. Lorentz) function which preserves contours (near 0 for a high
gradient value and near 1 for a low one).

( )
( )

To solve the above diffusion equation numerically, the following recurrence relation is used [5]:

( ) ( ) ( ) ( )
∑ ( )
| |
( ) ( ) ( )

( )
where is the value of one pixel ( ) after n iterations of anisotropic diffusion. is a
stabilization coefficient which controls the diffusion gain. represents the four neighboring pixels of
the pixel in the image ( ) as depicted on Figure 11.

Figure 11 –Neighboring of the s pixel.

The presented recurrence diffusion equation is a spatial operation. The strength of the filtering is
controlled by the parameter and increases with the number of iterations. The diffusion is optimal
when the value of this parameter is low and the number of iterations is high. A spatio-temporal
diffusion can be introduced by taking into account two supplementary neighboring pixels from the
previous frame and the next one (cf Figure 12).
Figure 12 – Spatio-temporal diffusion: colocated neighboring.

{ } { }

The diffusion equation in the spatio-temporal context is:

Equation 1

( ) ( ) ( ) ( )
∑ ( )
| |
( ) ( )
∑ ( )
| |

where is some kernel which, convolved with the image ( ) , reduces errors during the gradient
computation, which are caused by high grain noise and are falsely detected as contours.

Naturally, in order to improve the computation of the temporal parameter, we introduced a motion
compensation process instead of choosing and as the colocated pixels of in the previous and
the next frame respectively as illustrated in Figure 13.

Figure 13 - Motion compensated spatio-temporal diffusion.


2.2 Algorithm overview

The proposed denoiser is based on Equation 1 which is a recurrence formula. In order to implement
such an iterative spatio-temporal diffusion algorithm, we designed a system based on a buffer whose
size is a function of the number of iterations nb_Iter. This filter parameter is set depending on the
targeted quality mode. Thus, the image ( ) will exit the denoising process when the image
( )
will be pushed into the buffer.

Figure 14 depicts the block diagram of the proposed denoiser. The Filtering module represents
the so-called Lorentz filtering, PREME is the motion estimation stage and Bilat represents a
bilateral filter. Actually, the Blur filter corresponds to the kernel in Equation 1. It can be chosen
as an input parameter of the denoiser and can be picked in a pre-determined set (e.g. Median,
Gaussian, Average etc.) of filters.

Figure 14 – Denoiser block diagram.

Last but not least, confidence metrics are computed in order to finely tune and weight the impact of
the temporal information on the denoising filter. To this end, we compute various kinds of
parameters (e.g. blocks variance and covariance, Illumination variation, SAD (Sum Absolute
Difference) etc…) on pre-filtered images in order to obtain robust metrics. The pre-filtering is
implemented with a bilateral filter we designed (cf Bilat module on Figure 14).

2.3 Results and performance


In order to illustrate the proposed Denoiser performance we show in Figure 15 and Figure 16, the
original source image and the same image extracted from encoded sequences with or without
denoising. The video sequence “Soprano” has been chosen in SD format and the encoded bitrate is
1.2 Mbits/s in H.264/AVC.
First of all, it shows the proposed Denoiser is highly efficient on this grainy and noisy source image.
Moreover, as can be seen, when no denoising is applied the resulted grain after encoding is irregular
and blocky. With our Denoiser, the video encoder has a better behavior leading to a better bitrate
allocation and higher video quality.

Figure 15 – Source image from video sequence “Soprano”.


Figure 16 – Comparison of encoded image(“Soprano”) without (Top) and with (Bottom) denoising.
3 Motion Blurring
In a movie or television scene, motion blur is naturally part of the video material. For technical
reasons when a camera captures an image, that image does not represent a single instant of time but
rather the scene over a period of time. More specifically, it represents an integration of all the
images over the period of exposure determined by the shutter speed. Both time exposure and the
speed of the movement impact the blurring effect. Any object motion, with respect to the camera
position, will look blurred in the direction of relative motion. An example of motion blur is illustrated
in Figure 17 where the camera follows the motorcyclist and the background is blurred.

Figure 17 – Example of Motion Blur.

Such a motion blur effect is not inconvenient for the human eye, since its natural behavior is very
similar. On the contrary, a sharper scene without blurring troubles the viewer.

Therefore, we designed a motion blurring algorithm which allows correction of the jerky effects
introduced by the frame rate sub-sampling during the video pre-processing when dealing with
mobile and OTT applications. This algorithm is described later on.

3.1 Theoretical foundations

The goal of our algorithm was to theoretically model a motion blur by simulating the behavior of the
camera shutter and the associated time exposure. Thus, the idea is to generate the motion blur by
integrating all the signals related to one object captured on a time period simulating the time
exposure.

We developed a new concept called “Virtual Exposure Instant” (VEI). The idea of the VEI is to
consider all the object positions that could have been captured by a sensor in between two “real”
exposure instants. By ”real” exposure instant we mean instant corresponding to the available frame.
Then, each object position in between two real exposure instants will be considered as a VEI. For the
sake of simplicity, we considered that the potential positions captured by a sensor are on a regular
grid (the pixels grid).

In order to obtain a more fine integration an object will necessary be a pixel. Thus, each pixel will
have its own “Virtual Exposure Thread” (VET). Consequently, the number of VEI in between two real
exposure instants is different from one pixel to another. The VET length is only dependent on the
motion displacement length in terms of pixel positions. Figure 18 illustrates the concepts of VEI and
VET concepts. A (resp. B) represents an object in the current Frame, ⃗⃗⃗⃗ and ⃗⃗⃗⃗⃗ represent the
forward and backward vectors associated to A respectively.

Current Frame

Figure 18 – Virtual Exposure Instant (VEI) and Virtual Exposure Thread (VET) illustrations.

In order to properly integrate the relative importance of a given pixel, meaning impact on the
temporal signal, we decided to perform a weighted accumulation along the motion displacements in
both backward and forward direction along its VET. Finally, in order to smooth the impact of block-
based motion estimation, we proposed using an Overlapped Block Motion Disparity Compensation
[6].

Finally, we believe that only true signals (true pixels) have to be used to better generate the motion
blur. Consequently, no sub-pixel position and then no interpolation are performed.

3.2 Algorithm overview

The following steps compose the proposed motion blurring algorithm as represented in Figure 19:

1. Motion estimation
2. Scene cuts and flashes detection: Motion discard
3. Motion vector confidence
4. Weighted accumulation along the motion
5. Time integration and normalization
Figure 19 – Block diagram of the Motion Blurring algorithm.

3.3 Motion estimation

For each input frame, both a backward and a forward motion estimation is performed. As explained
before, a full pixel motion estimation is sufficient in order to implement our motion blur algorithm.
The motion estimation is performed on 8x8 pixels block basis.

At this stage, in order to prevent the usage of irrelevant motion estimation, flashes and scene cut
detections are jointly performed. Thus, irrelevant motion fields are directly discarded.

3.4 Motion vector confidence metric and weights computations

Here comes one of the more important aspects of the proposed algorithm since it concerns the
relevance of the VEI (and VET) building. The idea is to compute confidence metrics for each motion
vector associated to a given 8x8 block.

The proposed method is based on the so-called SAD (Sum of Absolute Difference) computation
between the current and the motion compensated block as well as what we call the autoSAD. The
autoSAD consists of SAD computation in between the current block and the same block shifted by
one pixel.

Then each forward and backward motion vectors associated to the block , both SAD and autoSAD ,
are compared. When the SAD is significantly higher than the autoSAD value we consider discarding
the motion vector since it is an incorrect vector, typically related to a block appearance or
disappearance. However, when the autoSAD value is very low it means that the current block is
merely blurred. In such case no motion vector will be discarded at all.

Finally, based on these values, confidence metrics are computed and according to pre-determined
weighting strategies and distribution, final weights are computed. The weighting distribution (e.g.
Gaussian, constant, linear…) is a parameter of the proposed algorithm.

In the sequel we note the weight associated at the ith instant of the VET associated to the block

3.5 Weighted accumulation along the motion

Figure 20 – Weighted accumulation along Virtual Exposure Threads.

As explained before, the idea is to accumulate for each pixel its weighted contribution along its
associated VET. Thus, assuming the length of and the pixel value at position i in the
, the accumulation associated to pixel X can be written as:

The length of is computed by taking into account the motion displacement length for each
non discarded direction. Moreover, in order to generate an appropriate motion blur we consider the
scaling of the vector as a function of the ratio between the original frame rate and the current
one. Assuming ⃗⃗⃗⃗ ( ) and ⃗⃗⃗⃗⃗⃗ ( ) to be the forward and backward vectors
associated to block respectively.

( (| || |) (| || |)) /
Hence, when down sampling from 50p to 25p the will be equal to two.

The accumulation process is realized via the usage of two buffers: the Accumulation buffer which
stores the accumulated weighted contribution of each pixel, and the Weights buffer which stores the
associated accumulated weights.

In order to smooth and limit the block border impact due to the motion compensation, we perform
an overlapped block weighted accumulation. An overlap of eight pixels all around the 8x8 pixels block
leading to a 16x16 pixels accumulation block is used. Consequently, each pixel contributes to more
than one VET.

3.6 Normalization
Once, each block has been accumulated along its respective VET the resulting blurred image is
obtained by normalization based on Weights buffer information.

3.7 Results and performance


In order to illustrate the performance of the proposed Motion Blurring algorithm, we provide in
Figure 21, a source image and its motion-blurred version. All the content parts that are in motion
have been blurred, whereas those that are quite fixed from the camera point of view remain sharp.
Finally, with the developed algorithm the impact on the final video quality and quality of experience
for the end user is improved.
Figure 21- Proposed Motion Blurring algorithm (Top) without Blurring and (Bottom) with Blurring.
Conclusion
To face the OTT/multi-screen delivery markets’ quick evolution and to satisfy the challenges of the
new “high quality content everywhere” trend, constantly improving video compression, while
designing high standard processing tools has become a necessity. To this end, our research teams
designed and developed three pre-processing tools essential in tackling the aforementioned needs
and requirements: Deinterlacing, Denoising and, Motion Blurring.

In order to deal with the huge amount of interlaced TV content, legacy or in production and its OTT
and mobile deliveries, an efficient deinterlacer is required. In this paper, we presented new
deinterlacing tools based on neural network and directional interpolation algorithms. The results
show significant improvement and higher-quality in comparison with the state-of-the-art
approaches.

We also provided a detailed description of a new denoising tool. Having a good denoiser allows for a
cleaner image in the case of a noisy video source while preserving the good grain expected by the
audience. The proposed approach is based on a motion compensated spatio-temporal diffusion
algorithm. Results show the impact on the resulting video quality, while keeping the so-called realism
of the image.

Finally, we described a motion blurring algorithm allowing corrections of jerky effects introduced by
frame rate sub-sampling usually required for mobile and OTT applications. The original design is
based on the modelization of a camera shutter behavior by integrating all the signals related to one
object captured on a time period simulating the time exposure. Results illustrate the improved
performance of the developed approach. Last, but not least, the developed algorithm can easily be
extended to deal with frame rate changes if needed.
Acknowledgment
The author would like thank his colleagues Zineb Agyo, Anne-Lyse Lavaud, Arthur Denardou and,
Mathieu Monnier for their contributions to this paper.

References
[1]A. Lukin, “High-Quality Spatial Interpolation of Interlaced Video”, Proceedings of GraphiCon'2008,
Moscow, Russia, pp. 114-117, June 2008.

[2] J.M. Bishop, R.J. Mitchell, “Neural networks - an introduction”, Neural Networks for Systems:
Principles and Applications, IEEE Colloquium , 1991.

[3] G. De Haan, E. B. Bellers, "Deinterlacing-an overview", IEEE, vol. 86, p.1839, 1998.

[4] H. Tsuji, T. Skatari, Y. Yashima, and N. Kobayashi, “A nonlinear spatio-temporal diffusion and its
application to pre-filtering in MPEG-4 video coding”. In proc. Of IEEE International Conference Image
Processing, 2002.

[5] P. Perona, and J. Malik, " Scale-Space and Edge detection using anisotropic diffusion ", IEEE
Transactions on Pattern Analysis and Machine Intelligence, Vol. 12, No. 7, pp. 629-639, July 1990.

[6] W. Woo, A. Ortega, "Overlapped block disparity compensation with adaptive windows for stereo
image coding", IEEE, vol. 10, March 2000.

Вам также может понравиться