VLSI Implementation of High-Throughput Parallel H.264/AVC Baseline Intra-Predictor

www.ietdl.
org
Published in IET Circuits, Devices & Systems
Received on 9th March 2012
Revised on 28th May 2013
Accepted on 29th May 2013
doi: 10.1049/iet-cds.2013.0097
ISSN 1751-858X
VLSI implementation of high-throughput parallel

H.264/AVC baseline intra-predictor
Shih-Chang Hsia1, Ying-Chao Chou2
1
Department of Electronics Engineering, National Yunlin University of Science and Technology, Yunlin, Taiwan
Department of Computer and Communication, National Kaohsiung First University of Science and Technology, Taiwan
E-mail: hsia@yuntech.edu.tw
Abstract: This study presents a parallel very large scale integrated circuits architecture for an intra-predictor based on a fast 4 4
algorithm. For real-time scheduling, the proposed algorithm overcomes the data dependency between intra-prediction and intracoding, thereby improving coding performance and reducing the number of coding cycles. The high-speed architecture for intraprediction includes congurable computation cores to process YUV components using 10 pixel parallelism. Prediction for one
macro-block (MB) coding (luminance: 4 4 and 16 16 block modes; chrominance: 8 8 block modes) can all be completed
within 256 cycles. The proposed architecture achieves throughput of 410 kMB/s, suitable for 1920 1080/35 Hz 4:2:0 HDTV
encoder at a working frequency of 105 MHz.
Introduction
Video coding techniques have progressed from H.261 (in

1990) [1], MPEG-1 and MPEG-2 [2], H.263 and H.263 +
+ [3], MPEG-4 [4] to H.264/MPEG-4 Part-10 (in 2004) [5,
6]. Many advanced techniques have been adopted to
improve the coding efciency of H.264/AVC, including
intra-prediction, integer transform, adaptive variable length
coding, various block sizes and multi-frame reference [7
10]. The H.264/AVC standard can reduce bit-rates by 64,
49 and 39%, compared with MPEG-2 [2], H.263 [3] and
MPEG-4 [4], respectively. However, the computational
requirements of a H.264/AVC system are far more complex
than those of previous standards [14]. Fast algorithms have
recently been developed to include intra-prediction,
inter-mode decision, motion estimation, transform and
rate-distortion optimisation (RDO) [712]. These algorithms
can reduce computational complexity of H.264/AVC codec.
Intra-mode coding is particularly important to the H.264
codec, for I-frames and P- and B-frames. For motion
compensated frames coding, the adaptive selection of intraand inter-modes to nd the best mode from RDO can
reduce the bit-rate by 1020% compared with only
inter-mode coding [11]. Intra-prediction can improve the
coding efciency of intra-coding; however, it takes up most
of the computing cycles. This study sought to develop a
real-time architecture for intra-prediction, in which the
H.264/AVC standard could be used to specify the coding
mode with DC-prediction and eight-directional prediction
for 4 4 luminance blocks. Intra-predicted blocks are
computed using the boundary pixels of the neighbouring
blocks [6]. The optimal mode can be selected to improve
quality and lower the bit-rate when the differential error
between the intra-predicted block and the original coded
10
& The Institution of Engineering and Technology 2014
block is low. For the processing of smooth luminance

blocks, 16 16 block mode may also be used to reduce
rate-distortion (RD) costs. The H.264/AVC system denes
four coding modes for 16 16 block processing: DC,
vertical, horizontal and plane. The modes predicted for the
chrominance signal using 8 8 blocks are similar to those
of 16 16 luminance. Among them, the 4 4 block mode
consumes more computational power than the others. A
number of fast algorithms have been proposed to reduce
computational complexity associated with intra-prediction
[1322]. Chip designers employ fast algorithms to reduce
circuit complexity and enable the introduction of low-cost,
high-speed H.264/AVC chips for applications running in
real-time [2329].
The fast algorithm for 4 4 block prediction proposed in
this study requires about half the computation required for
the full search. A parallel architecture is also proposed,
using a pipelined schedule to realise real-time architecture.
These advancements make possible a throughput of 1.5
samples per cycle, making it highly efcient. The remainder
of this paper is organised as follows. The proposed fast
algorithm is described in Section 2. The real-time
architecture schedule and the design of its modules are
proposed in Sections 3 and 4, respectively. Implementation
and comparisons are outlined in Section 5. Conclusions are
presented in Section 6.
Fast algorithm for intra-prediction
To ensure a cost-effective intra-prediction, this study proposes

a fast intra-mode decision algorithm to reduce the
computation required for 4 4 luminance blocks. The
H.264/AVC standard includes nine predicted-modes for 4
IET Circuits Devices Syst., 2014, Vol. 8, Iss. 1, pp. 1018
doi: 10.1049/iet-cds.2013.0097
www.ietdl.org
4 block, each of which is assigned a number from 0 to 8:
vertical (0), horizontal (1), DC (2), diagonal downleft (3),
diagonal downright (4), verticalright (5), horizontal
down (6), vertical left (7) and horizontalup (8).
Intra-prediction processing includes the generation of
prediction pixels, the sum of absolute differences (SADs)
operations and RD cost computation. To reduce
computational complexity, this study applies the near pixel
correlation integration (NPCI) algorithm to reduce the
computational complexity involved in analysing the features
of each mode [30]. Thus, many of the modes can be
rejected before the process enters the H.264/AVC coding
kernel.
The proposed NPCI takes advantage of the high degree of
correlation among predicted pixels in the predicted direction
by combining them to obtain a differential parameter. The
nine parameters are used to estimate coding cost of each
mode using simple computations. The parameters represent
features of each block mode to facilitate the identication of
candidates without the need to compute for block prediction
or RD costs. Finally, only the rst four minimum
parameters are selected as candidate modes to enter the
H.264/AVC coding kernel for regular hardware design.
Details related to the underlying theory and experimental
results can be found in our previous paper [30]. The
proposed fast algorithm for 4 4 block prediction can
reduce processing complexity by approximately half,
whereas maintaining coding performance. Full-search
prediction is employed for 16 16 luminance and 8 8
chrominance blocks. Previous results [30] demonstrate that
the bit-rate, using the proposed method is lower than that of
other algorithms, resulting in speeds far superior to
comparable techniques with only a negligible drop in video
quality.
Real-time system schedule
This study developed high-throughput intra-prediction

architecture for a H.264/AVC encoder based on the
proposed fast algorithm. For real-time HDTV coding, the
chip must process about one pixel per cycle. For example,
the pixel rate of 1080 1920/60 Hz YUV = 4:1:1, is 178
M/s. The clock rate of the selected chip is close to the pixel
rate for high-speed implementation. Hence, one
macro-block (MB) (16 16 pixels) coding must be nished
within approximately 256 cycles. To achieve this end, a
pipeline schedule is used to share the computational load
between the intra-predictor and the intra-coder.
Intra-prediction requires 256 cycles for the nth MB mode
decision. In the following 256 cycles, the nth MB can be
coded using the intra-coder through transform and entropy
coding. At the same time, the intra-predictor can identify
the optimal mode for the (n + 1)th MB.
Fig. 1 illustrates the interface of frame memory, temporary

register and intra-coder. Real-time applications require dual
memory ports: one port for writing the decoding results to
the memory and the other for reading the reconstructed
pixels from the memory to the temporary register. For
intra-prediction, the boundary pixels rst need to be loaded
to the temporary register. After intra-prediction, the optimal
mode can be selected and its relative boundary pixels read
in order to compute the predicted pixels for intra-coding.
The image can be reconstructed using the intra-decoder,
whereupon the decoded pixels are written to the frame
memory. The register must be operated simultaneously
using a multi-port output for intra-prediction and
intra-coding. The same boundary pixels may be read from
the temporary register for the intra-prediction of the (n + 1)
th block and the intra-coding of the nth block at the same
time. To provide exibility, the temporary register uses
parallel outputs.
Generally, intra-prediction is used to predict pixel values
for pixel reconstruction. All approaches to real-time
intra-prediction face the same problem, the strong data
dependency of reconstructed YUV reference pixels [24, 25].
Predicted pixels are obtained from up-side and left-side
pixels when searching for the coding mode. Up-side pixels
are reconstructed from the coding results of the previous
row block stored in the frame memory. The up-side pixels
can be read from the line-buffer to the temporary register to
compute the coding mode. The left-side pixels are produced
from the decoding results of the previous column block.
The reconstructed pixels must be processed from residuals
using transform, quantisation, de-quantisation and inverse
transform modules. For the intra-prediction of the current
block, we must wait for results of the previous block. To
overcome the problem of data dependency between
intra-prediction and intra-coding, the reference pixel of the
left-side boundary may adopt the original pixels, rather than
the decoded pixels for mode prediction to prevent idle cycles.
To verify the practicality of this approach, we evaluated the
coding performance involved in modifying reference pixels,
using the H.264/AVC standard program JM 16.0 [31]. The
results are presented in Figs. 2a and b for the sequences,
news and carphone, respectively. Evaluations of up to 20
sequences provided similar results. The performance of the
proposed fast algorithm was evaluated by comparing the
original codec (with reconstructed pixels) or modied
(without coded pixels). The results were very close with
regard to high bit-rate using low quantisation (QP);
however, with high QP for low bit-rate coding, the
reconstructed pixels produced a large amount of distortion
because of high quantisation error. When a full mode
search was implemented for intra-prediction, the use of
decoded pixels outperforms that of modied pixels.
However, when the fast algorithm was employed for
intra-prediction, only four modes were selected to determine
Fig. 1 Interface of frame memory, register and intra-coder

doi: 10.1049/iet-cds.2013.0097
11
www.ietdl.org
Fig. 2 Sequences, news and carphone

a and b Shows the references with the original reconstructed pixels and
modied pixels with our approach for news and carphone, respectively
Fig. 3 Position of sixteen 4 4-blocks and its reference pixels for

one MB
the optimal RD. At a low bit-rate, the decoded pixels

presented considerable distortion from the original pixels. In
such a case, the pre-decision mechanism of the fast
algorithm was unable to accurately identify the best mode
among the four presented, resulting in degraded coding
performance. When modied pixels were used by the
non-coded pixels, the performance exceeded that of the
decoded pixels, using the fast algorithm because non-coded
pixels reect actual edge information for intra-prediction
under low bit-rate conditions. The edge information from
decoded pixels produces high distortion at low bit-rates.
The prediction error using decoded pixels exceeds that of
non-coded pixels when using the proposed fast prediction
algorithm. Results demonstrate that the performance of the
modied method with non-coded pixels may reduce 0.4%
bit-rate of the original method, using decoded pixels in
average. Thus, this approach overcomes the data
dependency of intra-prediction on intra-coding, thereby
improving coding performance and reducing the number of
waiting cycles. This approach is also compatible with the
H.264 decoder, which does not specify how the prediction
mode is to be selected.
MBs are a basic processing unit for video coding. Fig. 3
presents the position of sixteen 4 4 blocks and the
reference pixels for intra-prediction. The timing schedule
for data access (intra-coding) is presented in Fig. 4. In
real-time applications, the intra-prediction of each 4 4
block requires 16 cycles. The up-side pixels are the
reconstructed pixels of the previous row block that can be
loaded to the temporary register from the memory. The
original sampling pixels are used for the left-side pixels,
because the reconstructed pixels are unavailable. The
near-optimal mode is selected through intra-prediction,
using the proposed algorithm. Intra-coding is subsequently
performed by transform, quantisation and VLC to provide
the coding bit-stream. The 4 4-block can be transformed
within four cycles [32], and four coefcients are quantised
within one cycle. Similarly, the de-quantisation and inverse
transform can be processed, using one and four cycles,
respectively. The right-side pixels of the (i, j)th blocks can
be reconstructed, using residuals of the intra-coder in the 10
cycles. The reconstructed four boundary pixels of the (i, j)
th block can be written to the temporary register for the
intra-coding of the (i, j + 1)th block in two cycles. The data
are subsequently stored in the frame memory.
Using the pipelined scheduling technique, the
intra-prediction for the (i, j + 1)th block and intra-coding for
the (i, j)th block are performed in the same time slot. Pixels
of the (i, j)th block are used for intra-prediction of the (i, j
Fig. 4 Timing schedule between intra-prediction and intra-coding

12

doi: 10.1049/iet-cds.2013.0097
www.ietdl.org
+ 1)th block. However, the (i, j)th block is not reconstructed
at this time. To improve coding speed, the original pixels of
the (i, j)th block are employed for intra-prediction of the (i,
j + 1)th block. While the (i, j)th block is being coded, the
coding mode for the (i, j + 1)th block can be identied from
intra-prediction. Proceeding to the next cycle, the right-side
pixels of the (i, j)th block can be read as the reference for
the coding of the (i, j + 1)th block and intra-prediction
computing of the (i, j + 2)th block. This schedule continues
to the (i, j + 3)th block, whereupon the four blocks in the
rst row are completed. To process the second row blocks,
the up-side pixels of the (i + 1, j), (i + 1, j + 1) and (i + 1, j
+ 2)th blocks are updated using the bottom pixels of the (i,
j), (i, j + 1) and (i, j + 2)th blocks for the prediction of the
(i + 1, j), (i + 1, j + 1) and (i + 1, j + 2)th blocks.
In the prediction cycle, the fast algorithm selects the coding
mode, using the original pixel rather than the reconstructed
pixel for the left-side prediction. This approach does not
violate coding rules, because the H.264 standard does not
specify the method used for the selection of
intra-prediction. However, for intra-coding, the reference
pixels must use the reconstructed pixels to ensure decoding
quality. For the coding of the (i, j + 1)th block, the (i, j)th
block was reconstructed in the previous cycle; thus, we can
employ pixels of the (i, j)th block as the reference for the
coding of the (i, j + 1)th block. During the process of
prediction, the left-side pixels from the previous block were
reconstructed and stored in the memory, using 10 cycles.
Throughout the coding cycle, the predicted pixels can be
computed using the reconstructed pixels, according to the
selected mode, whereupon the difference between the
predicted and original pixels is used for H.264 intra-coding.
The residual data are added to the reference pixels to update
the reconstructed pixels for each frame, and to write pixels
to the frame memory. This pipeline scheduling enables the
reconstruction of pixels in the 4 4 blocks using
intra-coding within the 10 cycles. The intra-coding cycle is
shorter than the time required for prediction (16 cycles).
Hence, the critical path occurs at the intra-predicted core
that dominates the processing time.
4 Parallel VLSI architecture and module

design
To achieve high throughput, this study proposes a parallel
architecture for the real-time intra-predictor, as shown in
Fig. 5. Luminance (in 4 4 and 16 16 block modes) and
the chrominance (in 8 8 block mode) can be processed in
parallel. For 4 4 block mode, the pre-decision core of the
proposed fast algorithm provides four predicted modes for
regular hardware designs. According to pre-selected modes,
the reference block is computed from the previously
reconstructed blocks, using a congurable architecture. The
difference between the reference block and original block is
used to compute the SAD or sum of absolute transform
difference (SATD) value. SAD can be used to avoid the
computational costs of SATD (including a Hadamard
transform) with only a negligible drop in quality. This
approach also reduces the circuit costs. SAD computation
uses four parallel subtractors to meet real-time
requirements. The computation of a single 4 4-block
requires 16 cycles. The processing time for each block
mode is 16/4 = 4 cycles (four modes are selected). Thus, the
processing time for one MB is 16 16 = 256 cycles, which
includes 16 4 4 blocks. The 16 16 block mode for
doi: 10.1049/iet-cds.2013.0097
Fig. 5 Parallel architecture for intra-predictor
luminance and 8 8 block mode for chrominance are

computed at the same time. H.264/AVC denes the four
predicted modes for 16 16 luminance (Y ) and 8 8
chrominance (UV). Pixels of horizontal and vertical modes
can be directly copied from the boundary pixels. Because
the computational load is low, a common congurable core
is used to share the operator for the computation of pixels
in DC and plane modes (16 16 Y-block and 8 8
UV-block). This also helps to reduce hardware costs. For
16 16 Y-block prediction, the computational time required
for one mode is (256/4) = 64 cycles, using four subtractors
in parallel. This totals 64 4 = 256 cycles to compute four
modes of the 16 16 Y-block. Of about 8 8 UV-blocks
require 64/2 = 32 cycles for one SAD computation as two
subtractors are used for U or V mode. Because U and V
pixels contains four modes, this also requires 32 2 4 =
256 cycles. Thus, a total of 4, 4 and 2 parallel subtractors
are used to compute SAD values for 4 4 Y-blocks, 16 16
Y-blocks and 8 8 UV-blocks, respectively. The results of
SAD are sent to the mode decision module in parallel. The
coding mode is selected according to the criterion of RD to
determine the best performance. However, for low-cost
hardware designs, the nal selection of mode uses a simple
cost function [22, 25]
f = SADs + 4Pl(Qp)
(1)
where P is either 1 or 0. If P = 0, the cost function can be

using only the SAD value. If P = 1, the function is
dependent on , using the lambda from approximate
exponential approach of the quantisation scale QP. This
approach can be used to select the coding mode before the
real coding bit-rate is obtained, which is benecial in
real-time chip implementation.
4.1
Architecture for 4 4 block prediction
Fig. 6 presents the architecture of 4 4 prediction, including

two parts: the pre-decision core using the proposed algorithm,
and the congurable prediction to generate particular block
patterns according to H.264/AVC standards. For regular
designs, the computation of nine parameters is allocated to
the hardware, using the nite state machine (FSM),
according to Table 1, which lists parameter computations
13
www.ietdl.org
Fig. 7 Relationship of interpolated pixel and boundary pixel
Fig. 6 Architecture of 4 4 block intra-prediction
and processing cycles, using our previous fast algorithm [30].

Fig. 7 presents the relative pixel position for intra-prediction.
The FSM selects which pixels are to be inputted into the
computational kernel for the computation of parameters. In
real-time applications, only one or two clock cycles are
used to compute a single parameter, such that the nine
parameters can be completed within 14 clocks. Finally, a
comparator is used to nd the 1st4th minimum values
from the nine parameters, using one clock cycle.
Fig. 8 illustrates the congurable architecture according to
the proposed algorithm. The boundary pixels and coded
pixels are selected, using multiplexes controlled by the
FSM. The FSM ow depends on which parameter is
being computed. In each cycle, the circuit can read four
boundary pixels and four coded pixels from temporary
registers. Differential results are accumulated and sent to
the comparator circuit. For example, for the computation
of parameter EV (according to Table 1), the multiplexes
MUX-1 and MUX-2 selects P(1, 0) and B, respectively,
whereas MUX-3 and MUX-4 selects P(1, 3) and B,
respectively. The absolute summation of [P(1,0)B] and [P
(1, 3)B] is performed along path 1. In the same way, the
result for path 2 is [P(2, 0)C ] + [P(2, 3) C], using
MUX-58. Finally, the EV parameter is obtained from the
absolute sum of the results of two paths. However, to
compute VR, two clocks are required to compute the
SAD, using eight boundary pixels and coded pixels. In
the rst clock cycle, MUX 18 can select A, P(2, 2), B, P
(1, 0), B, P(3, 2), C and P(2, 1), respectively. The
Fig. 8 Computational circuit of the fast algorithm
differential values are accumulated and then saved to a

register (R). In the second clock cycle, MUX 18 can
select I, P(1, 3), X, P(1, 2), X, P(0, 1), A and P(0,0),
respectively, whereupon the VR parameter can be
calculated according to the current differential results
accumulated, using the register R value.
Fig. 9 presents the entire pipelined schedule used to predict
4 4 block for real-time operations. The fast algorithm circuit
selects 4 modes for block prediction within 15 clock cycles.
During the 16th cycle, the selected mode is sent to the
congurable architecture to generate the reference block.
Interpolation is used to predict pixel generation according to
Table 1 Computations of predictive parameters and its processing cycle

Parameters
Clock-requirement
EV = {|[P(1, 0) B] + [P(1, 3) B]| + |[P (2, 0) C] + [P (2, 3) C]|} 1
EH = {|[P (0, 1) J] + [P (3, 1) J]| + |[P (0, 2) K ] + [P (3, 2) K ]|} 1
DC = |[J P (1, 1)] + [L P(3, 1)] + [B P (1, 3)] + [D P(3, 3)]| 1
DDL = |[D P(3, 0)] + [E P (1, 2)] + [E P (2, 1)] + [F P(0, 3)]| 1
DDR = |[A P (0, 0)] + [X P (1, 1)] + [X P(2, 2)] + [I P(3, 3)]| 1
VR = |[A P (2, 2)] + [B P(1, 0)] + [B P(3, 2)] + [C P(2, 1)]| 2

+ |[I P(1, 3)] + [X P (1, 2)] + [X P (0, 1)] + [A P (0, 0)]| 2
HD = |[I P (3, 3)] + [I P (2, 2)] + [J P (1, 2)] + [J P (0, 1)]| 2

+ |[J P (1, 2)] + [J P(3, 3)] + [K P(0, 2)] + [K P(2, 3)]| 2
VL = |[C P (2, 0)] + [D P(2, 1)] + [D P (1, 1)] + [E P(3, 0)]| 2

+ |[D P (2, 2)] + [E P(2, 3)] + [E P(3, 1)] + [F P (3, 2)]| 2
HU = |[J P (0, 1)] + [J P (1, 0)] + [K P (2, 0)] + [K P (3, 0)]|2

+ |[L P (2, 2)] + [L P (2, 3)] + [L P(1, 3)] + [L P (3, 2)]|2
14

doi: 10.1049/iet-cds.2013.0097
www.ietdl.org
Fig. 9 Pipelined schedule for 4 4 block prediction
Fig. 10 Pipeline timing schedule for 16 16 Y-block and 8 8 UV-block prediction
pixels of the previous top and right blocks. The circuit

computes four pixels per cycle; therefore the prediction
time for one block is four cycles. To predict four modes,
predicted pixel generation with boundary interpolation
requires 16 cycles. Then, the predicted pixels are subtracted
from the current coding pixel to obtain an SAD value. The
comparator circuit determines the best mode, using the

minimum SAD in the 33th clock cycle. The second block
mode is obtained in the 49th cycle. Using this pipelining
schedule, the architecture can select the 4 4 block modes
within 16 cycles. The processing time totals 256 cycles for
the coding of one MB.
Fig. 11 Architecture of 16 16 Y-block and 8 8 UV-block prediction

doi: 10.1049/iet-cds.2013.0097
15
www.ietdl.org
4.2 Architecture for 16 16 and 8 8 block
prediction
H.264/AVC species the prediction of 16 16 and 8 8
blocks using DC, horizontal, vertical and plane modes. The
implementation of plane mode invariably involves many
clock cycles. To ensure a cost-effective design, this study
adopted a time-sharing method to generate the predicted
pixels for Y-plane mode, UV-plane mode, YDC mode and
UVDC mode using a congurable architecture for the
common circuit. Fig. 10 presents the pipelined schedule for
computing the predicted pixel in these modes. Pixels can be
directly copied from the boundary data in horizontal and
vertical modes without the need for computation. In the rst
64 cycles, for the computation of Y and UV in horizontal
mode, the boundary pixels can be read directly from the
temporary register to enable the parallel computation of the
SAD. Meanwhile, the congurable architecture computes
the predicted pixels of DC- and plane-modes, as shown in
Fig. 11. The FSM is used to control the congurable
function for computing each mode, according to the
schedule in Fig. 10. In the rst 19 cycles, DC values of
YUV blocks can all be computed and results saved to
registers. Pixels of U, V and Y-plane modes are then
successively calculated using 39, 39 and 143 cycles,
respectively. The symbol Pa is the basic parameter used to
compute values H, V, a, b and c, as dened in the H.264/
AVC standards [6], which must be calculated before
computing pixels of the plane mode. Thus, we use only 240
cycles to compute pixels of two 8 8 chrominance blocks
and one 16 16 luminance block for prediction in DC and
plane modes. Pixels in DC mode can be computed for the
SAD in the 65th cycle, because the DC value of YUV
components was previously computed during cycles 119.
The predicted pixels in plane mode can all be completed
within the last 128 cycles, and their SAD values can be
computed in the last 64 cycles.
Fast
algorithm
core
Y 4 4 mode
prediction
UV 8 8 and Y
16 16 mode
prediction
Cost evaluation,
system control
4766 gates
7690 gates
14 412 gates
1645 gates
Table 3 compares the proposed architecture with those

from the literature. Previous studies [2327] used sequential
schemes for YUV processing, which required a greater
number of computational cycles. The architectures in [23,
24] employed modied full searches for intra-prediction;
however, this approach is not particularly efcient. Wang
et al. [25] used a fast 4 4, 16 16 (Luma) and 8 8
(Chroma)
algorithm
to
design
a
cost-effective
intra-predictor. However, this method selects the mode
directly using edge information, without considering real
SAD (or SATD) or QP for H.264 coding. Coding
performance drops because of high prediction error.
Specically, the PSNR value was degraded to
approximately 0.5 dB under high bit-rate coding, which is
not conducive to high-quality HDTV systems.
This paper proposes a fast 4 4 algorithm [30] to improve
coding efciency. We therefore introduced a parallel
architecture to process Y and UV signals within the same
time slot, to reduce the number of coding cycles. The
processing time is only 256 cycles for the full intra-coder to
process 256(Y ) + 64 2(UV) pixels for coding one MB. We
dened the coding speed using
throughput rate-1 =
pixel
cycle
throughput rate-2 =
cycle
pixel MB
MB
=
second cycle pixel second
(2)
VLSI implementation and comparisons
Based on this parallel architecture, we realised a high-speed

circuit using Verilog HDL [33]. The prototyping
processor for intra-prediction was implemented using Xilinx
FPGA development software [34], mapped to an
XC3S200A device. Test patterns using continuous blocks
were sent to the circuit to verify the functionality of the
chip. C-programming code was rst employed to determine
results of each computing unit according to our algorithm.
The same pattern was sent to the hardware module to check
whether the output met our expectations. The
intra-prediction chip comprised all of the submodules and
the functionality was in agreement with results of
C-programming. The architecture comprised four parts: fast
algorithm core, Y4 4 mode prediction, Y16 16/UV8 8
mode prediction and cost evaluation and system control,
occupying approximately 17, 27, 51 and 6% of the entire
chip area, respectively. The gate count of each module is
listed in Table 2. The Verilog coding le was sent to a
workstation for cell-based design. The le was successfully
synthesized to the gate circuit, using SYNOPSYS synthesis
tools under the TSMC 0.18 um CMOS process. The total
gate count was approximately 28 k and the silicon area
covered approximately 2.8 mm2. The maximum power
dissipation was approximately 9.6 mW with the chip
operating at 105 MHz.
16
Table 2 Gate count of each computation module
Throughput rate-1 using the proposed architecture can

achieve 384/256 = 1.5 pixels per cycle. At throughput
rate-2, the number of MBs processed is approximately (105
M 1.5)/384 = 410 k per second at a frequency of 105
MHz. The throughput rate afforded by competing
architectures is one sample/cycle less, because they must
process 256 + 64 2 pixels, which requires additional
processing cycles for one MB [2326]. Although the
proposed hardware overhead is increased by approximately
30%, the throughput rate can be increased 35 times. Other
solutions to high-speed intra-prediction have recently been
proposed [27, 28]; however, the circuit complexity is
extremely high making it unsuitable for application in
consumer products. Compared with existing chips, the
proposed intra-predictor provides higher cost-throughput
performance for high-speed HDTV using a core of
reasonable size.
In a typical H.264 stream, intra-coded frames may occur at
intervals of 812 frames. However, for the coding of P- and
B-frames, the RDO method is used to nd the best coding
performance by checking all intra and inter-modes in JM
[31]. The intra-block may be used for inter-frame coding.
Hence, real-time intra-prediction is required for the H.264
coding kernel. In addition, low-cost systems, such as digital
still cameras, can encode pictures continuously using
I-frame only, to economise on the frame memory.
doi: 10.1049/iet-cds.2013.0097
www.ietdl.org
Table 3 Comparisons with other architectures
Algorithm
averagea
performance
Huang et al.
[23]
Partial full
search
PSNR
bit
rate
YUV processing
interpolate pixel (per
cycle)
#subtractor for SAD
pixel parallelism
processing timing for
one MB (#cycle)
throughput rate-1 #
pixel/cycle
throughput rate-2
#MB/s
tech. mapping
gate count
memory size, bit
frequency
maximum
applications
Ku et al. [24]
Full search
(plane-mode
removal)
Wang et al.
[25]
Fast 4 4,
16 16
algorithm
Lin et al. [26]
Lin et al. [27]
Modified
three step
Transform
domain
Ren et al.
[28]
Proposed
Fast 4 4
algorithm
others, full
search
0.13 dB
+1.2%
0.04 dB
+0.32%
0.22 dB
+3.6%
0.19 dB
+3.2%
0.06 dB
+1.2%
0.07 dB
+1.8%
sequential
4
sequential
4
sequential
sequential
sequential
sequential
parallel
6
4
4
1280
4
4
1080
4
416
8
560
8
620
16
163
10
10
256
0.3
0.35
0.92
0.69
0.62
2.36
1.5
42.2 K
106.6 K
158.1 k
251.6 k
161.5 k
1.32 M
410 K
0.25 m
19 871
14 k
54 MHz
SDTV (720
480/21 Hz,
4:2:0)
0.18 m
19 246
16 k
117 MHz
HDTV (1280
720/20 Hz, 4:2:0)
0.18 m
10 302
66 MHz
HDTV
(1280 720/
30 Hz, 4:2:0)
0.13 m
19 870
8k
140 MHz
HDTV
(1920 1080/
21 Hz 4:2:0)
0.18 m
38 775
11 k
100 MHz
HDTV
(1280 720/
30 Hz, 4:2:0)
0.13 m
60 945
12 k
215 MHz
HDTV
(1920 1080/
110 Hz 4:2:0)
0.18 m
28 513
5k
105 MHz
HDTV (1920
1080/35 Hz
4:2:0)
Average performance measured by QP (12, 20, 28, 36, 46) with five sequences
Conclusions
This paper presents methods for fast intra-prediction based on

the near pixel correlation algorithm. The proposed parallel
architecture, with a pipelined schedule, overcomes the
problem of data dependency between intra-prediction and
intra-coding, which prevents idle cycles and improves the
data rate. The architecture can predict Y 4 4 mode, Y 16
16 mode and UV 8 8 mode within 256 cycles. A
congurable computation kernel is used to compute
parameters of each mode using an FSM control, to
minimise hardware requirements. A specic timing
schedule is arranged to efciently allocate plane mode
computing for Y 16 16 and UV 8 8, using common
operators to reduce hardware requirements. This
intra-prediction processor is capable of higher throughput
and lower cost than other recent architectures, and is also
capable of meeting the speed requirement of 1080 1920/
35 Hz HDTV systems.
Acknowledgment
This work was supported by the National Science Council,

Taiwan, under grant no. NSC94-2213-E-327-004.
References
1 Liou, M.: Overview of the p 64 kbits/s video coding standard,

Commun. ACM, 1991, 34, (4), pp. 5963
2 ISO/IEC DIS 138182, MPEG-2 video coder
3 Cote, G., Erol, B., Kossentini, F.: H.263+: video coding at low bit-rate,
IEEE Trans. Circuits Syst. Video Technol., 1998, 8, (7), pp. 849866
4 Coding of audio-visual objects: video, MPEG 4, ISO/IECJTC/SC29/
WG11, January 1999
5 Wiegand, T., Sullivan, G., Bjontegaard, G., Luthra, A.: Overiew of
H.264/AVC video coding standard, IEEE Trans. Circuits Syst. Video
Technol., 2003, 13, (7), pp. 560576
6 ITU-TRec.H.264/ISO/IEC14496-10AVC in Joint Video Team (JVT) of
ISO/IECMPE Gand ITU-TVCEG, JVT G050, 2003
doi: 10.1049/iet-cds.2013.0097
7 Chen, Y.H., Chen, T.C., Tsai, C.Y., Tsai, S.F., Chen, L.G.: Algorithm
and architecture design of power-oriented H.264/AVC baseline prole
encoder for portable devices, IEEE Trans. Circuits Syst. Video
Technol., 2009, 19, (8), pp. 11181128
8 Chen, Y.H., Cheng, C.C., Chuang, T.D., Chen, C.Y., Chien, S.Y., Chen,
L.G.: Efcient architecture design of motion-compensated temporal
ltering/motion compensated prediction engine, IEEE Trans. Circuits
Syst. Video Technol., 2008, 18, (1), pp. 98109
9 Wang, H., Kwong, S.: Rate-distorting optimization of rate control for
H.264 with adaptive initial quantization parameter determination,
IEEE Trans Circuits Syst. Video Technol., 2008, 18, (1), pp. 140144
10 Yi, Y., Song, B.C.: High-speed CAVLC encoder for 1080p 60-Hz
H.264 codec, IEEE Signal Process. Lett., 2008, 15, pp. 891894
11 Kim, T.J., Hong, J.E., Suh, J.W.: A fast intra mode skip decision
algorithm based on adaptive motion vector map, IEEE Trans.
Consum. Electron., 2009, 55, (1), pp. 179184
12 Tseng, C., Wang, H., Yang, J.: Enhanced intra-4 4 mode decision for
H.264/AVC coder, IEEE Trans. Circuits Syst. Video Technol., 2006,
16, (8), pp. 10271032
13 Tsai, A., Paul, A., Wang, J.C., Wang, J.F.: Intensity gradient technique
for efcient intra-prediction in H.264/AVC, IEEE Trans. Circuits Syst.
Video Technol., 2008, 18, (5), pp. 694698
14 Li, H., Ngan, K., Wei, Z.: Fast and efcient method for block edge
classication and its application in H.264/AVC video coding, IEEE
Trans. Circuits Syst. Video Technol., 2008, 18, (6), pp. 756768
15 Tsai, A., Wang, J.F., Yang, J.F., Lin, W.G.: Effective subblock-based
and pixel-based fast direction detections for H.264 intra
prediction, IEEE Trans. Circuits Syst. Video Technol., 2008, 18, (7),
pp. 975982
16 Bharanitharan, K., Liu, B.D., Yang, J.F., Tsai, W.C.: A low complexity
detection of discrete cross differences for fast H.264/AVC intra
prediction, IEEE Trans. Multimedia, 2008, 10, (7), pp. 12501260
17 Kim, B.: Fast selective intra-mode search algorithm based on adaptive
thresholding scheme for H.264/AVC encoding, IEEE Trans. Circuits
18 Wang, J.F., Wang, J.C., Chen, J.T., Tsai, A.C., Paul, A.: A novel fast
algorithm for intra mode decision in H. 264/AVC encoders. IEEE
Symp. Circuits and System, 2006, pp. 34983501
19 Lee, W., Jung, Y., Lee, S., Kim, J.: High speed intra prediction scheme
for H.264/AVC, IEEE Trans. Consum. Electron., 2007, 53, (4),
pp. 15771582
20 Lee, J.Y., Park, H.W.: A fast mode decision method based on motion
cost and intra prediction cost for H.264/AVC, IEEE Trans. Circuits
17
www.ietdl.org
21 Gabriellini, A., Flynn, D., Mrak, M., Davies, T.: Combined
intra-prediction for high-efciency video coding, IEEE J. Sel. Top.
Signal Process., 2011, 5, (7), pp. 12821289
22 Pan, F., Lin, X., Rahardja, S., et al: Fast mode decision algorithm for
intraprediction in H.264/AVC video coding, IEEE Trans. Circuits
23 Huang, Y.W., Hsieh, B.Y., Chen, T.C., Chen, L.G.: Analysis, fast
algorithm and VLSI architecture design for H.264 Intra frame
coder, IEEE Trans. Circuits Syst. Video Technol., 2005, 15, (3),
pp. 378400
24 Ku, C.W., Cheng, C.C., Yu, G.S., Tsai, M.C., Chang, T.S.: A
high-denition H.264/AVC intra-frame codec IP for digital video and
still camera applications, IEEE Trans. Circuits Syst. Video Technol.,
2006, 16, (8), pp. 917928
25 Wang, J.C., Wang, J.F., Yang, J.F., Chen, J.T.: A fast mode decision
algorithm and its VLSI design for H.264/AVC intra-prediction, IEEE
26 Lin, Y.K., Ku, Ch.W., Li, D.W., Chang, T.S.: A 140-MHz 94 K gates
HD1080p 30-frames/s intra-only prole H.264 encoder, IEEE Trans.
Circuits Syst. Video Technol., 2009, 19, (3), pp. 432436
18
27 Lin, H.Y., Wu, K.H., Liu, B.D., Yang, J.F.: An efcient VLSI
architecture for transform-based intra prediction in H.264/AVC, IEEE
28 Ren, H., Fan, Y., Chen, X., Zeng, X.: A 16-pixel parallel architecture
with block-level/mode-level co-reordering approach for intra prediction
in 4kx2 k H.264/AVC video encoder. IEEE 17th Asia and South
Pacic on Design Automation Conf. (ASP-DAC) 2012, pp. 801806
29 Lo, W.Y., Lun, D.P.-K., Siu, W.C., Wang, W., Song, J.: Improved
SIMD architecture for high performance video processors, IEEE
30 Hsia, S.C., Chou, Y.C.: Fast intra-prediction with near pixel correlation
approach for H.264/AVC system, IET Image Process., 2008, 2, (4),
pp. 185193
31 H.264 video coding reference software. Available at: http://www.bs.hhi.
de/~suehring/tml/download
32 Chen, K.H., Guo, J.I., Wang, J.S.: A high-performance direct 2-D
transform coding IP design for MPEG-4 AVC/H.264, IEEE Trans.
Circuits Syst. Video Technol., 2006, 16, (4), pp. 472483
33 Palnitkar, S.: Veriolg HDL (Prentice-Hall, Nj07458, 1996)
34 Xilinx, the eld programming gate array, Web: http//www.xilinx.com

doi: 10.1049/iet-cds.2013.0097

VLSI Implementation of High-Throughput Parallel H.264/AVC Baseline Intra-Predictor

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

VLSI Implementation of High-Throughput Parallel H.264/AVC Baseline Intra-Predictor

Загружено:

Авторское право:

Доступные форматы

www.ietdl.

VLSI implementation of high-throughput parallel

Video coding techniques have progressed from H.261 (in

& The Institution of Engineering and Technology 2014

block is low. For the processing of smooth luminance

Fast algorithm for intra-prediction

To ensure a cost-effective intra-prediction, this study proposes

Real-time system schedule

This study developed high-throughput intra-prediction

Fig. 1 illustrates the interface of frame memory, temporary

Fig. 1 Interface of frame memory, register and intra-coder

& The Institution of Engineering and Technology 2014

Fig. 2 Sequences, news and carphone

Fig. 3 Position of sixteen 4 4-blocks and its reference pixels for

the optimal RD. At a low bit-rate, the decoded pixels

Fig. 4 Timing schedule between intra-prediction and intra-coding

& The Institution of Engineering and Technology 2014

IET Circuits Devices Syst., 2014, Vol. 8, Iss. 1, pp. 1018

4 Parallel VLSI architecture and module

Fig. 5 Parallel architecture for intra-predictor

luminance and 8 8 block mode for chrominance are

where P is either 1 or 0. If P = 0, the cost function can be

Architecture for 4 4 block prediction

Fig. 6 presents the architecture of 4 4 prediction, including

& The Institution of Engineering and Technology 2014

Fig. 7 Relationship of interpolated pixel and boundary pixel

Fig. 6 Architecture of 4 4 block intra-prediction

and processing cycles, using our previous fast algorithm [30].

Fig. 8 Computational circuit of the fast algorithm

differential values are accumulated and then saved to a

Table 1 Computations of predictive parameters and its processing cycle

EV = {|[P(1, 0) B] + [P(1, 3) B]| + |[P (2, 0) C] + [P (2, 3) C]|} 1

EH = {|[P (0, 1) J] + [P (3, 1) J]| + |[P (0, 2) K ] + [P (3, 2) K ]|} 1

DC = |[J P (1, 1)] + [L P(3, 1)] + [B P (1, 3)] + [D P(3, 3)]| 1

VR = |[A P (2, 2)] + [B P(1, 0)] + [B P(3, 2)] + [C P(2, 1)]| 2

HD = |[I P (3, 3)] + [I P (2, 2)] + [J P (1, 2)] + [J P (0, 1)]| 2

VL = |[C P (2, 0)] + [D P(2, 1)] + [D P (1, 1)] + [E P(3, 0)]| 2

HU = |[J P (0, 1)] + [J P (1, 0)] + [K P (2, 0)] + [K P (3, 0)]|2

& The Institution of Engineering and Technology 2014

IET Circuits Devices Syst., 2014, Vol. 8, Iss. 1, pp. 1018

Fig. 9 Pipelined schedule for 4 4 block prediction

Fig. 10 Pipeline timing schedule for 16 16 Y-block and 8 8 UV-block prediction

pixels of the previous top and right blocks. The circuit

comparator circuit determines the best mode, using the

Fig. 11 Architecture of 16 16 Y-block and 8 8 UV-block prediction

& The Institution of Engineering and Technology 2014

Table 3 compares the proposed architecture with those

VLSI implementation and comparisons

Based on this parallel architecture, we realised a high-speed

Table 2 Gate count of each computation module

& The Institution of Engineering and Technology 2014

Throughput rate-1 using the proposed architecture can

Lin et al. [26]

Lin et al. [27]

This paper presents methods for fast intra-prediction based on

This work was supported by the National Science Council,

1 Liou, M.: Overview of the p 64 kbits/s video coding standard,

& The Institution of Engineering and Technology 2014

& The Institution of Engineering and Technology 2014

IET Circuits Devices Syst., 2014, Vol. 8, Iss. 1, pp. 1018

Вам также может понравиться