Академический Документы
Профессиональный Документы
Культура Документы
org
Published in IET Circuits, Devices & Systems
Received on 9th March 2012
Revised on 28th May 2013
Accepted on 29th May 2013
doi: 10.1049/iet-cds.2013.0097
ISSN 1751-858X
Department of Electronics Engineering, National Yunlin University of Science and Technology, Yunlin, Taiwan
Department of Computer and Communication, National Kaohsiung First University of Science and Technology, Taiwan
E-mail: hsia@yuntech.edu.tw
Abstract: This study presents a parallel very large scale integrated circuits architecture for an intra-predictor based on a fast 4 4
algorithm. For real-time scheduling, the proposed algorithm overcomes the data dependency between intra-prediction and intracoding, thereby improving coding performance and reducing the number of coding cycles. The high-speed architecture for intraprediction includes congurable computation cores to process YUV components using 10 pixel parallelism. Prediction for one
macro-block (MB) coding (luminance: 4 4 and 16 16 block modes; chrominance: 8 8 block modes) can all be completed
within 256 cycles. The proposed architecture achieves throughput of 410 kMB/s, suitable for 1920 1080/35 Hz 4:2:0 HDTV
encoder at a working frequency of 105 MHz.
Introduction
www.ietdl.org
4 block, each of which is assigned a number from 0 to 8:
vertical (0), horizontal (1), DC (2), diagonal downleft (3),
diagonal downright (4), verticalright (5), horizontal
down (6), vertical left (7) and horizontalup (8).
Intra-prediction processing includes the generation of
prediction pixels, the sum of absolute differences (SADs)
operations and RD cost computation. To reduce
computational complexity, this study applies the near pixel
correlation integration (NPCI) algorithm to reduce the
computational complexity involved in analysing the features
of each mode [30]. Thus, many of the modes can be
rejected before the process enters the H.264/AVC coding
kernel.
The proposed NPCI takes advantage of the high degree of
correlation among predicted pixels in the predicted direction
by combining them to obtain a differential parameter. The
nine parameters are used to estimate coding cost of each
mode using simple computations. The parameters represent
features of each block mode to facilitate the identication of
candidates without the need to compute for block prediction
or RD costs. Finally, only the rst four minimum
parameters are selected as candidate modes to enter the
H.264/AVC coding kernel for regular hardware design.
Details related to the underlying theory and experimental
results can be found in our previous paper [30]. The
proposed fast algorithm for 4 4 block prediction can
reduce processing complexity by approximately half,
whereas maintaining coding performance. Full-search
prediction is employed for 16 16 luminance and 8 8
chrominance blocks. Previous results [30] demonstrate that
the bit-rate, using the proposed method is lower than that of
other algorithms, resulting in speeds far superior to
comparable techniques with only a negligible drop in video
quality.
11
www.ietdl.org
www.ietdl.org
+ 1)th block. However, the (i, j)th block is not reconstructed
at this time. To improve coding speed, the original pixels of
the (i, j)th block are employed for intra-prediction of the (i,
j + 1)th block. While the (i, j)th block is being coded, the
coding mode for the (i, j + 1)th block can be identied from
intra-prediction. Proceeding to the next cycle, the right-side
pixels of the (i, j)th block can be read as the reference for
the coding of the (i, j + 1)th block and intra-prediction
computing of the (i, j + 2)th block. This schedule continues
to the (i, j + 3)th block, whereupon the four blocks in the
rst row are completed. To process the second row blocks,
the up-side pixels of the (i + 1, j), (i + 1, j + 1) and (i + 1, j
+ 2)th blocks are updated using the bottom pixels of the (i,
j), (i, j + 1) and (i, j + 2)th blocks for the prediction of the
(i + 1, j), (i + 1, j + 1) and (i + 1, j + 2)th blocks.
In the prediction cycle, the fast algorithm selects the coding
mode, using the original pixel rather than the reconstructed
pixel for the left-side prediction. This approach does not
violate coding rules, because the H.264 standard does not
specify the method used for the selection of
intra-prediction. However, for intra-coding, the reference
pixels must use the reconstructed pixels to ensure decoding
quality. For the coding of the (i, j + 1)th block, the (i, j)th
block was reconstructed in the previous cycle; thus, we can
employ pixels of the (i, j)th block as the reference for the
coding of the (i, j + 1)th block. During the process of
prediction, the left-side pixels from the previous block were
reconstructed and stored in the memory, using 10 cycles.
Throughout the coding cycle, the predicted pixels can be
computed using the reconstructed pixels, according to the
selected mode, whereupon the difference between the
predicted and original pixels is used for H.264 intra-coding.
The residual data are added to the reference pixels to update
the reconstructed pixels for each frame, and to write pixels
to the frame memory. This pipeline scheduling enables the
reconstruction of pixels in the 4 4 blocks using
intra-coding within the 10 cycles. The intra-coding cycle is
shorter than the time required for prediction (16 cycles).
Hence, the critical path occurs at the intra-predicted core
that dominates the processing time.
f = SADs + 4Pl(Qp)
(1)
www.ietdl.org
Clock-requirement
DDL = |[D P(3, 0)] + [E P (1, 2)] + [E P (2, 1)] + [F P(0, 3)]| 1
DDR = |[A P (0, 0)] + [X P (1, 1)] + [X P(2, 2)] + [I P(3, 3)]| 1
14
www.ietdl.org
15
www.ietdl.org
4.2 Architecture for 16 16 and 8 8 block
prediction
H.264/AVC species the prediction of 16 16 and 8 8
blocks using DC, horizontal, vertical and plane modes. The
implementation of plane mode invariably involves many
clock cycles. To ensure a cost-effective design, this study
adopted a time-sharing method to generate the predicted
pixels for Y-plane mode, UV-plane mode, YDC mode and
UVDC mode using a congurable architecture for the
common circuit. Fig. 10 presents the pipelined schedule for
computing the predicted pixel in these modes. Pixels can be
directly copied from the boundary data in horizontal and
vertical modes without the need for computation. In the rst
64 cycles, for the computation of Y and UV in horizontal
mode, the boundary pixels can be read directly from the
temporary register to enable the parallel computation of the
SAD. Meanwhile, the congurable architecture computes
the predicted pixels of DC- and plane-modes, as shown in
Fig. 11. The FSM is used to control the congurable
function for computing each mode, according to the
schedule in Fig. 10. In the rst 19 cycles, DC values of
YUV blocks can all be computed and results saved to
registers. Pixels of U, V and Y-plane modes are then
successively calculated using 39, 39 and 143 cycles,
respectively. The symbol Pa is the basic parameter used to
compute values H, V, a, b and c, as dened in the H.264/
AVC standards [6], which must be calculated before
computing pixels of the plane mode. Thus, we use only 240
cycles to compute pixels of two 8 8 chrominance blocks
and one 16 16 luminance block for prediction in DC and
plane modes. Pixels in DC mode can be computed for the
SAD in the 65th cycle, because the DC value of YUV
components was previously computed during cycles 119.
The predicted pixels in plane mode can all be completed
within the last 128 cycles, and their SAD values can be
computed in the last 64 cycles.
Fast
algorithm
core
Y 4 4 mode
prediction
UV 8 8 and Y
16 16 mode
prediction
Cost evaluation,
system control
4766 gates
7690 gates
14 412 gates
1645 gates
throughput rate-1 =
pixel
cycle
throughput rate-2 =
cycle
pixel MB
MB
=
second cycle pixel second
(2)
www.ietdl.org
Table 3 Comparisons with other architectures
Algorithm
averagea
performance
Huang et al.
[23]
Partial full
search
PSNR
bit
rate
YUV processing
interpolate pixel (per
cycle)
#subtractor for SAD
pixel parallelism
processing timing for
one MB (#cycle)
throughput rate-1 #
pixel/cycle
throughput rate-2
#MB/s
tech. mapping
gate count
memory size, bit
frequency
maximum
applications
Ku et al. [24]
Full search
(plane-mode
removal)
Wang et al.
[25]
Fast 4 4,
16 16
algorithm
Modified
three step
Transform
domain
Ren et al.
[28]
Proposed
Fast 4 4
algorithm
others, full
search
0.13 dB
+1.2%
0.04 dB
+0.32%
0.22 dB
+3.6%
0.19 dB
+3.2%
0.06 dB
+1.2%
0.07 dB
+1.8%
sequential
4
sequential
4
sequential
sequential
sequential
sequential
parallel
6
4
4
1280
4
4
1080
4
416
8
560
8
620
16
163
10
10
256
0.3
0.35
0.92
0.69
0.62
2.36
1.5
42.2 K
106.6 K
158.1 k
251.6 k
161.5 k
1.32 M
410 K
0.25 m
19 871
14 k
54 MHz
SDTV (720
480/21 Hz,
4:2:0)
0.18 m
19 246
16 k
117 MHz
HDTV (1280
720/20 Hz, 4:2:0)
0.18 m
10 302
66 MHz
HDTV
(1280 720/
30 Hz, 4:2:0)
0.13 m
19 870
8k
140 MHz
HDTV
(1920 1080/
21 Hz 4:2:0)
0.18 m
38 775
11 k
100 MHz
HDTV
(1280 720/
30 Hz, 4:2:0)
0.13 m
60 945
12 k
215 MHz
HDTV
(1920 1080/
110 Hz 4:2:0)
0.18 m
28 513
5k
105 MHz
HDTV (1920
1080/35 Hz
4:2:0)
Average performance measured by QP (12, 20, 28, 36, 46) with five sequences
Conclusions
Acknowledgment
References
7 Chen, Y.H., Chen, T.C., Tsai, C.Y., Tsai, S.F., Chen, L.G.: Algorithm
and architecture design of power-oriented H.264/AVC baseline prole
encoder for portable devices, IEEE Trans. Circuits Syst. Video
Technol., 2009, 19, (8), pp. 11181128
8 Chen, Y.H., Cheng, C.C., Chuang, T.D., Chen, C.Y., Chien, S.Y., Chen,
L.G.: Efcient architecture design of motion-compensated temporal
ltering/motion compensated prediction engine, IEEE Trans. Circuits
Syst. Video Technol., 2008, 18, (1), pp. 98109
9 Wang, H., Kwong, S.: Rate-distorting optimization of rate control for
H.264 with adaptive initial quantization parameter determination,
IEEE Trans Circuits Syst. Video Technol., 2008, 18, (1), pp. 140144
10 Yi, Y., Song, B.C.: High-speed CAVLC encoder for 1080p 60-Hz
H.264 codec, IEEE Signal Process. Lett., 2008, 15, pp. 891894
11 Kim, T.J., Hong, J.E., Suh, J.W.: A fast intra mode skip decision
algorithm based on adaptive motion vector map, IEEE Trans.
Consum. Electron., 2009, 55, (1), pp. 179184
12 Tseng, C., Wang, H., Yang, J.: Enhanced intra-4 4 mode decision for
H.264/AVC coder, IEEE Trans. Circuits Syst. Video Technol., 2006,
16, (8), pp. 10271032
13 Tsai, A., Paul, A., Wang, J.C., Wang, J.F.: Intensity gradient technique
for efcient intra-prediction in H.264/AVC, IEEE Trans. Circuits Syst.
Video Technol., 2008, 18, (5), pp. 694698
14 Li, H., Ngan, K., Wei, Z.: Fast and efcient method for block edge
classication and its application in H.264/AVC video coding, IEEE
Trans. Circuits Syst. Video Technol., 2008, 18, (6), pp. 756768
15 Tsai, A., Wang, J.F., Yang, J.F., Lin, W.G.: Effective subblock-based
and pixel-based fast direction detections for H.264 intra
prediction, IEEE Trans. Circuits Syst. Video Technol., 2008, 18, (7),
pp. 975982
16 Bharanitharan, K., Liu, B.D., Yang, J.F., Tsai, W.C.: A low complexity
detection of discrete cross differences for fast H.264/AVC intra
prediction, IEEE Trans. Multimedia, 2008, 10, (7), pp. 12501260
17 Kim, B.: Fast selective intra-mode search algorithm based on adaptive
thresholding scheme for H.264/AVC encoding, IEEE Trans. Circuits
Syst. Video Technol., 2008, 18, (1), pp. 127133
18 Wang, J.F., Wang, J.C., Chen, J.T., Tsai, A.C., Paul, A.: A novel fast
algorithm for intra mode decision in H. 264/AVC encoders. IEEE
Symp. Circuits and System, 2006, pp. 34983501
19 Lee, W., Jung, Y., Lee, S., Kim, J.: High speed intra prediction scheme
for H.264/AVC, IEEE Trans. Consum. Electron., 2007, 53, (4),
pp. 15771582
20 Lee, J.Y., Park, H.W.: A fast mode decision method based on motion
cost and intra prediction cost for H.264/AVC, IEEE Trans. Circuits
Syst. Video Technol., 2012, 22, (3), pp. 393402
17
www.ietdl.org
21 Gabriellini, A., Flynn, D., Mrak, M., Davies, T.: Combined
intra-prediction for high-efciency video coding, IEEE J. Sel. Top.
Signal Process., 2011, 5, (7), pp. 12821289
22 Pan, F., Lin, X., Rahardja, S., et al: Fast mode decision algorithm for
intraprediction in H.264/AVC video coding, IEEE Trans. Circuits
Syst. Video Technol., 2005, 15, (7), pp. 813822
23 Huang, Y.W., Hsieh, B.Y., Chen, T.C., Chen, L.G.: Analysis, fast
algorithm and VLSI architecture design for H.264 Intra frame
coder, IEEE Trans. Circuits Syst. Video Technol., 2005, 15, (3),
pp. 378400
24 Ku, C.W., Cheng, C.C., Yu, G.S., Tsai, M.C., Chang, T.S.: A
high-denition H.264/AVC intra-frame codec IP for digital video and
still camera applications, IEEE Trans. Circuits Syst. Video Technol.,
2006, 16, (8), pp. 917928
25 Wang, J.C., Wang, J.F., Yang, J.F., Chen, J.T.: A fast mode decision
algorithm and its VLSI design for H.264/AVC intra-prediction, IEEE
Trans. Circuits Syst. Video Technol., 2007, 17, (10), pp. 14141422
26 Lin, Y.K., Ku, Ch.W., Li, D.W., Chang, T.S.: A 140-MHz 94 K gates
HD1080p 30-frames/s intra-only prole H.264 encoder, IEEE Trans.
Circuits Syst. Video Technol., 2009, 19, (3), pp. 432436
18
27 Lin, H.Y., Wu, K.H., Liu, B.D., Yang, J.F.: An efcient VLSI
architecture for transform-based intra prediction in H.264/AVC, IEEE
Trans. Circuits Syst. Video Technol., 2010, 20, (6), pp. 894906
28 Ren, H., Fan, Y., Chen, X., Zeng, X.: A 16-pixel parallel architecture
with block-level/mode-level co-reordering approach for intra prediction
in 4kx2 k H.264/AVC video encoder. IEEE 17th Asia and South
Pacic on Design Automation Conf. (ASP-DAC) 2012, pp. 801806
29 Lo, W.Y., Lun, D.P.-K., Siu, W.C., Wang, W., Song, J.: Improved
SIMD architecture for high performance video processors, IEEE
Trans. Circuits Syst. Video Technol., 2011, 21, (12), pp. 17691783
30 Hsia, S.C., Chou, Y.C.: Fast intra-prediction with near pixel correlation
approach for H.264/AVC system, IET Image Process., 2008, 2, (4),
pp. 185193
31 H.264 video coding reference software. Available at: http://www.bs.hhi.
de/~suehring/tml/download
32 Chen, K.H., Guo, J.I., Wang, J.S.: A high-performance direct 2-D
transform coding IP design for MPEG-4 AVC/H.264, IEEE Trans.
Circuits Syst. Video Technol., 2006, 16, (4), pp. 472483
33 Palnitkar, S.: Veriolg HDL (Prentice-Hall, Nj07458, 1996)
34 Xilinx, the eld programming gate array, Web: http//www.xilinx.com