2010A Pipe Lining Hardware Implementation of H.264 Based On FPGA

2010 International Conference on Intelligent Computation Technology and Automation
A Pipelining Hardware Implementation of H.264 Based on FPGA
Sun Song , Qi Haibing

School of Electric and Electronic Information Engineering, Huangshi Institute of Technology, Huangshi 435003, China
gezh@qq.com
Abstract—A two-dimensional discrete cosine transform (DCT) precision results of the transformation. However, the integer
module for the JPEG image compression system is designed. DCT is an intensive computing process. The architecture of
Considering the compromise of resource and speed in the FPGA (Field Programmed Gate Array) has advantages in
FPGA chip, two same 1D-DCT module are reused to complete parallel and pipelining processing. The compromise of
the FPGA design of 2D-DCT. The pipelining levels in the resource and speed is an important factor in the design of
module are also analyzed and optimized. Simulation and test
FPGA. In this paper, the 2D-DCT (two dimensions DCT)
results for the whole system based on EP1C6Q240C8 show
that it can perform the integer DCT of 4×4 block in twelve module based on FPGA is optimized for two same 1D-DCT
clock cycles and 10% resource consumption rate. It provides a through the analysis of pipelining level, a group input data
exploring attempt and a positive reference on the JPEG used to verify the algorithm and the hardware design that it
encoder system IP core design and their FPGA can meet the JPEG compression system.
implementation.
II. THEORY OF INTEGRAL TRANSFORMATION FOR H.264
Keywords-JPEG; Discrete Cosine Transform; Pipelining The 2D-DCT can be broken into two sequential 1D-DCT
level; FPGA
operations by employing the row-column decomposition,
one along the row vector and the second along the column
I. INTRODUCTION vector of the preceding row vector results [3, 4]. The 1D-
The H.264 digital coding standard developed by the JVT DCT for N point can be expressed as
(Joint Video Team) including VCEG (Video Coding Expert N −1
(2n + 1)kπ (1)
Group) and MPEG (Moving Picture Expert Group) of ITU- yk = Ck ∑ xn cos
n =0 2N
T is aimed at high-quality coding for video contents at very and the 1D-IDCT for N point is expressed as
low bit-rates [1]. The DCT plays a key role in several image N −1
(2n + 1)kπ
compression standards including JPEG for still picture xn = ∑ Ck yk cos (2)
compression [2]. It transforms a signal or image from the k =0 2N
spatial domain to the frequency domain and it is similar to where xn is the nth item of input sequence in time domain,
the DFT (Discrete Fourier Transform). However one yk is the kth item of output sequence in frequency domain.
primary advantage of the DCT over the DFT is that the The coefficient Ck is defined as
former involves only real multiplications, which reduces the ⎧ 1 k =0
total number of required multiplications, unlike the latter. ⎪ N (3)
Ck = ⎨
The other advantage lies in the fact that for most images ⎪ 2 k = 1, 2," N − 1
much of the signal energy lies at low frequencies, and are ⎩ N
often small enough to be neglected with little visible The 2D-DCT of N×N image block can be considered as
distortion. The DCT does a better job of concentrating the 1D-DCT for the row of blocks firstly, and then 1D-DCT
energy into lower order coefficients than does the DFT for for the column of blocks [5]
image data. However, the traditional DCT which is used to N −1 N −1
(2 j + 1)nπ (2i + 1)nπ
Ymn = CmCn ∑∑ X ij cos cos (3)
transform the prediction residuals of intra-frame or inter- i =0 j =0 2 N 2N
frame will bring out the complex hardware design of N −1 N −1
(2 j + 1)nπ (2i + 1)nπ
floating point calculations, and then it will result in the X ij = ∑∑ Cm CnYij cos cos (4)
mismatch between the encoder and decoder induced by the i = 0 j =0 2N 2N
rounding errors. In order to overcome these shortcomings, where Xij is the ith column and jth row prediction residual,
the new standard DCT of H.264 is modified to be realized Ymn is the coefficient of DCT at the switch matrix Y, and it
with integer addition and subtraction transform and shift is expressed as
operation. Therefore, the decoder output can recover Y = AXAT (5)
accurately the input code with slight declining of T
X = A YA (6)
compression performance in the condition of no considering
where the coefficient of switch matrix A in the N×N block
the quantitative effect.
is
No multiplications in the integer DCT of H.264 will
greatly reduce the complexity of computing and ensure
978-0-7695-4077-1/10 $26.00 © 2010 IEEE 299

DOI 10.1109/ICICTA.2010.401
Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on August 16,2010 at 05:54:12 UTC from IEEE Xplore. Restrictions apply.
Y =(CXCT )⊗E
⎛⎡1 1 1 1 ⎤ ⎡1 1 1 d ⎤⎞ ⎡a2 ab a2 ab⎤
⎜⎢ ⎟ ⎢ ⎥
d −d −1⎥⎥ ⎢⎢1 d −1 −1⎥⎥⎟ ⎢ab
(10)
1 b2 ab b2 ⎥
=⎜⎢ X ⊗
⎜⎢1 −1 −1 1 ⎥ ⎢1 −d −1 1 ⎥⎟ ⎢a2 ab a2 ab⎥
⎜⎜⎢ ⎥ ⎢ ⎥⎟ ⎢ ⎥
⎝⎣d −1 1 d ⎦ ⎣1 −1 1 −d⎦⎟⎠ ⎣ab b2 ab b2 ⎦
where d=c/b (≈0.141), the multiply operation is needed
between the each element in the expression (CfXCfT) and
the corresponding elements in the matrix E. Generally, the
value of d is selected as 0.5 for simplify calculation. The
value of b is amended as b = 2 5 for maintaining the
orthogonal character of the transformation. Meanwhile, the
Fig.1. the module of 1D-DCT
elements of the second and forth column of matrix C, and
the elements of the second and forth row of matrix CT all
(2 j + 1)iπ
Aij = Ci cos cos (7) are multiplied with the coefficient 2, then the matrix E is
2N turned into the matrix Ef
For the 2D-DCT operation for a 4×4 image block such as Y = (Cf XCTf ) ⊗Ef
brightness block or chromaticity block, the corresponding
transform matrix A is ⎡ 2 ab ab⎤
⎢a a2
2 2 ⎥ (11)
⎡ 1 1 1 1 ⎤ ⎛ ⎡1 1 1 ⎤ ⎡1
1 2 1 1 ⎤ ⎞ ⎢⎢ab ⎥
⎢ cos(0) cos(0) cos(0) cos(0) ⎥ b2 ab b2 ⎥
2 2 2 2 ⎜⎢ ⎟
⎢ ⎥ 2 1 −1 −2⎥ ⎢1 1 −1 −2⎥ ⎟ ⎢ 2 4 2 4⎥
⎢ 1 π
cos( )
1 3π
cos( )
1 5π
cos( )
1 7π
cos( ) ⎥ (8) = ⎜⎢ ⎥X⎢ ⎥ ⊗⎢ ⎥
⎜ ⎢1 −1 −1 1 ⎥ ⎢1 −1 −1 2 ⎥⎟ ⎢ 2 ab ab⎥
⎢ 2 8 2 8 2 8 2 8 ⎥ ⎜⎜ ⎢ ⎥ ⎢ ⎥⎟ a a2
A= ⎢ −1⎦ ⎟⎠ ⎢ 2⎥
1 2π 1 6π 1 10π 1 14π ⎥ ⎝ ⎣1 −2 2 −1⎦ ⎣1 −2 1
⎢
2
⎥
⎢ cos( ) cos( ) cos( ) cos( )⎥ b2 b2 ⎥
⎢ 2 8 2 8 2 8 2 8 ⎥ ⎢ab ab
⎢ 1 3π 1 9π 1 15π 1 21π ⎥ ⎢⎣ 2 4 2 4 ⎥⎦
⎢ cos( ) cos( ) cos( ) cos( )⎥ where the multiply operation is only once for each element
⎣ 2 8 2 8 2 8 2 8 ⎦
of matrix Ef , and it can be evolved into the quantization
operation at the same time. Therefore, there are only the
1
where a = , b = 1 cos ⎛ π ⎞ , c = 1 cos ⎛ 3π ⎞ , then the operation including adder and subtraction of integer and
2 ⎜ ⎟ ⎜ ⎟
2 ⎝8⎠ 2 ⎝ 8 ⎠ shift in the expression (CfXCfT), here the actual output of
transform matrix A is integer DCT is
⎡a a a a⎤ W = CXC T (12)
⎢b c − c b ⎥ As mentioned above, we can change the matrix
(9)
A=⎢ ⎥ multiplication into two integral 1D-DCT shifting transform.
⎢a − a − a a ⎥
⎢ ⎥ Therefore, the integer 2D-DCT of H.264 can be divided into
⎣ c − b b − c⎦ two steps. Firstly, each column of image or residual is
Obviously, a, b and c are real number, and the elements are transformed as one dimension matrix DCT. Secondly, each
integers in the block X. With regard to the real DCT, the row of the first operation results is transformed as one
mismatch between the encoder and decoder induced by the dimension matrix DCT. Then the integer 2D-DCT can be
rounding errors will cause data error for the accuracy of realized as integer 1D-DCT, which can reduce the multiply
floating-point operation. As there is more prediction process computation cost and time with butterfly calculation.
than other image coding, and even the internal encoding
mode is dependent on spatial prediction in H.264. Therefore, TABLE I. BUTTERFLY PROCESSING
H.264 is very sensitive to the prediction drift. Based on type Input value Operation process Output value
integral DCT technology, we adjust the elements of matrix
A to integers. This method can effectively reduce
M [0] = x[0] + x[3] X[0] =M[ 0] +M[1]
computation cost, and no loss of image accuracy. (5) is
x [0 ], x [1 ] M [3] = x[0] − x[3] X[ 2] =M[ 0] −M[1]
equivalent to Positive
x [2 ], x [3 ] M [1] = x[1] + x[2] X[1] =M[ 2] +(M[3] <<1)
M [2] = x[1] − x[2] X[3] =M[3] −(M[ 2] <<1)
300
Fig. 1 2D-DCT
where M[0], M[1],M[2], M[3] is the intermediate variable

generated by the one dimensional DCT, the type of positive
is the DCT and the type of negative is the IDCT, <<1 stands
for left shift one bit, and >>1 stands for right shift one bit.
According to the theory of integral transformation for
H.264, we can design the module for calculating the integer
1D-DCT of input vectors. And then the integer 2D-DCT is Fig.3. Pipelining process of 1D-DCT
realized through calling the 1D-DCT module two times.
The module unit of integer 1D-DCT is shown in Fig.1. The The macro-block input cost only 4 cycles for the parallel
source data are pixel points which input DATA_in_0 to input in the column and serial input in the row of 4 × 4
DATA_in_3 in the module of integer 1D-DCT. Four ports
block as shown in Fig.3. Of course, this process can also
which input one of column or row data at 4 × 4 block operate parallel input in the row and serial input in the
constitute of the input 16bits for the 2D-DCT. Then the column for 4 cycles. On account of the four data is operated
value of data will immediately enter the first operation for separately, the whole column (row) transformation is
adder or subtraction, and the results of the module completed after the finish of input data which cost 1 cycle.
(ADDSUB_1, ADDSUB_2) are the input value of the shift Now the first calculation of the 1D-DCT cost 5 cycles.
register. Next, the results of the shift register and the first A group 16 bits intermediate register is arranged in the
operation for adder or subtraction input to the module module, which constitute a 4*4 matrix. After the
(ADDSUB_3, SUB_1 and ADD_1) for the second intermediate variable is written into the register according
operation, the results are the coefficients of one-dimensional to row (column) from the first operation, the data start to
integer DCT. At last, the coefficients are sent to the module operate at integer DCT and the result is the output of the
(DATA_OUT_0 ~DATA_OUT_3) to output. two-dimensional integer DCT module. This second
The 2D-DCT of input signal can be realized by calling calculation cost total 7 cycles.
two times of 1D-DCT. The key factor is whether two To sum up, the module of 2D-DCT constitutes a 5 level
pipelining 1D-DCT module or one reused 1D-DCT module pipelining. The first level is for the adder and shift operation
should be selected. If we select one 1D-DCT module to with column or row input data. The second level operates
reuse twice, the gate numbers (area or resource) are reduced the result of the first level. The third level stores the result
but the speed is slower than that of two pipelining 1D-DCT of intermediate variable. The fourth level reads the
module. Conversely, if we select two pipelining 1D-DCT intermediate variable and operates with adder and shift. The
module, the speed can increase twice but the consume area last level operates the result of the fourth level and output
is relative enlarged than that of one reused 1D-DCT module. the final result. The pipelining process is shown as Fig.4
As the consume resource of two 1D-DCT module is not According to the reuse ideas of IP core, the 2D-DCT can
excessive, the 2D-DCT can be realized with two 1D-DCT be realized with two 1D-DCT IP core. As the output width
IP core as shown in Fig.2 of the first 1D-DCT is 11 bits, the input width of the second
III. IMPLEMENTATION OF H.264 BASED ON FPGA 1D-DCT should be designed as 11 bits. The module of 2D-
DCT is shown in Fig.4. The module of 2D-DCT is
According to (12), the kernel matrix of integer DCT is programmed with Verilog language using the software
the matrix C. Obviously, the absolute sum of elements in Quartus II 7.2 of Altera Company, and the simulation chip
any column of the matrix is not more than 6. And then the is EP1C6Q240C8. The respective interface signals are
maximum gain is 36 (62) after a video block is transformed CLK: system Clock
by the 2D-DCT. The value 5.17 (log236=5.17) shows that RST: system reset
the increase of bits is 6 than that of original input data. For din: input data
example, a 9bit input pixels (i.e. input data for - 395 ~ + 255) dct_2d: output data
will turn into 15bit output data, so the 16bit in the module rdy_out: actual output data
won’t cause data overflow for a 9 bit input data. The input data in DATA_in_0 to DATA_in_3 din is
301
IV. CONCLUSIONS
In this paper we have analyzed the principle of integral
2D-DCT. After the pipelining levels in the module which
constitute of two modules of 1D-DCT are also analyzed, a
hardware implementation of based on FPGA through
selecting two pipelining 1D-DCT module instead of one
1D-DCT module to reuse twice, the speed can increase twice
but the consume resource increment is relative small. It
provides a exploring attempt and a positive reference on the
JPEG encoder system IP core design and their FPGA
implementation.
REFERENCES
Fig. 4. The module of 2D-DCT [1] T. Wiegand, G. J. Sullivan, “Overview of the H.264/AVC video coding
standard,” IEEE Transactions on circuits and systems for video technology,
vol.13 No.7 July.2003
[2] H. Kim, H. Jeong and Y. Lee, “Efficient DCT/IDCT and quantization
implementation of multimedia ASIP for H.264,”
[3] C Loeffler, A Lightenberg, “Practical fast 1-D DCT algorithms with 11
multiplications,” Proceedings of the International Conference on
Acoustics, Speech and Signal Processing(ICASSP’89),Scotland, May
1989, pp.988-991
[4] H.EL-Banna, A. A. EL-Fattah, W. Fakhr, “An Efficient Implementation
of the 1D DCT using FPGA Technology,” Proceedings of the 11th IEEE
International Conference and Workshop on the Engineering of Computer
Based Systems (ECBS’04), May 2004, pp: 356-60. doi:10.1109/ECBS.
2004. 1316719
[5] L V Agostini, I S Silva, “Pipelined fast 2D-DCT architecture for JPEG
image compression,” 14th Symposium on Integrated Circuits and Systems
Design, 2001,pp: 226-31
Fig. 5 Simulation result of 2D-DCT
⎡1 2 3 4⎤ ⎡ 136 −28 0 −4⎤

⎢5 6 7 8⎥ ⎢
⎥ dct _ 2d = ⎢−112 0 0 0 ⎥⎥
din = ⎢
⎢ 9 10 11 12 ⎥ ⎢ 0 0 0 0⎥
⎢ ⎥ ⎢ ⎥
⎣13 14 15 16 ⎦ ⎣ −16 0 0 0⎦
and MATLAB simulation results is dct_2d , which is used
to verify the validity of designed 2D-DCT module.
Meanwhile, the speed and the resource in the designed
FPGA is used to provide a reference on the JPEG encoder
system IP core. Fig.5 shows the simulation result of our
designed 2D-DCT module. We can find out that the
hardware modules results and the software test results are
consistent, and the consume logic elements is 587 which is
10% occupancy of total logic elements. It shows that the
modular design completely correct and resource
consumption is small with taking into consideration of high-
speed pipelining design.
302

2010A Pipe Lining Hardware Implementation of H.264 Based On FPGA

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

2010A Pipe Lining Hardware Implementation of H.264 Based On FPGA

Загружено:

Авторское право:

Доступные форматы

2010 International Conference on Intelligent Computation Technology and Automation

A Pipelining Hardware Implementation of H.264 Based on FPGA

Sun Song , Qi Haibing

978-0-7695-4077-1/10 $26.00 © 2010 IEEE 299

where M[0], M[1],M[2], M[3] is the intermediate variable

Fig. 5 Simulation result of 2D-DCT

⎡1 2 3 4⎤ ⎡ 136 −28 0 −4⎤

Вам также может понравиться