Вы находитесь на странице: 1из 4

A Zero-Skipping Multi-symbol CAVLC Decoder for MPEG-4 AVC/H.

264
Guo-Shiuan Yu, and Tian-Sheuan Chang
Dept. Electronics Engineering National Chiao-Tung University Hsinchu, Taiwan {isis, tschang}@twins.ee.nctu.edu.tw
AbstractThis paper presents a high-performance CAVLC decoding VLSI architecture for MPEG-4 AVC/H.264. Instead of just skipping zero block, the proposed design explores the features of CAVLC decoding process to efficient skip possible processes if none needed to be decoded, and can decode multiple symbols in sign and run before stage. The proposed design just needs average 90 cycles for one MB decoding, which can meet real time HDTV requirement and saves 64% of cycle count in average when compared with previous design. The hardware cost is about 13192 gates when synthesized at 125 MHz.

I.

INTRODUCTION

Context-based Adaptive Variable Length Code (CAVLC) has been adopted in MPEG-4 AVC/H.264 video coding standard [1] as one of the entropy coding methods. However, in a video decoder system, the entropy decoding has often become the performance bottleneck since it is hard to be speedup by parallelism and pipelining. Previous designs on CAVLC decoding focus on simplifying VLC tables to reduce area or using gated clock for all zero block decoding[3][4], quite similar to other VLC decoding approach as in MPEG-2 and MPEG-4 video coding. The design [6] employs multi-symbol decoding technique to improve throughput but does not pipeline the final merging process which demand extra cycles. The final merging process that merges the zero and nonzero coefficients can take as many cycles as the block size, 16, which is also not addressed in previous approaches. In summary, these previous designs with zero skipping only skip the zero block decoding or just gated the clock for low power. None of these previous approaches have not exploited the characteristics of CAVLC decoding to skip more cycles within nonzero block decoding. CAVLC, though similar to other VLC process at the codeword table construction, is quite different at the overall process. The overall decoding process of CAVLC can be decomposed into several processes, and some processes can be merged for lower cycle count. Besides, the zero

information, which has been encoded specifically, can be used to skip the redundant cycles instead of waiting. This paper presents a high throughput rate CAVLC decoder that exploits the CAVLC features. The proposed decoding flow efficiently merges the processes together to reduce the cycles. Besides, to take advantages of the zero coefficients, the proposed design can smartly skip some decoding process if none is needed to be decoded. With the efficient zero skipping, the final output stage used the indexed buffer to quickly merge and reorder the decoded nonzero coefficients and zero coefficients together without waiting and buffering as in the previous designs. Thus, the required cycles are as many as the number of nonzero coefficients, which is just five or six in statistics, instead of 16 cycles for the whole block. The rest of the paper is organized as follows. In Section II, we first introduce a typical CAVLC decoding flow and discuss the dependency in each stage. Then, the proposed decoding flow is presented in Section III, and its corresponding VLSI hardware architecture is shown in Section IV. Section V then shows the performance evaluation and comparisons. Finally, a conclusion remark is made in Section VI.. II. INTRODUCTION OF CAVLC DECODING FLOW Fig. 1 shows a typical decoding flow of CAVLC, which can be divided into five stages [3] as discussed below.

Fig. 1

Decoding flow of CAVLC[3].

0-7803-9390-2/06/$20.00 2006 IEEE

5583

ISCAS 2006

A. Coefficients token decoding process First, the process is stared by decoding the total number of non-zero coefficients, TotalCoeff, and the number of coefficients with absolute value equal to 1 at the end of the block scan order, TrailingOnes. These two values are decoded by coeff_token table which is divided into five sub-tables according to the variable nC. The nC means the number of coefficients in neighbor block. This decoding process depends on both bitstream and nC values. B. Sign of each T1 decoding process According to the TrailingOnes (T1), the corresponding bits are taken to decode the sign of trailing values. One bit is used to signal a trailing coefficient sign, that is, 0 for +1 and 1 for -1. Each T1 is decoded in reverse order. This process depends on both bitstream and TrailingOnes. C. Level decoding process Each level of non-zero coefficient is decoded in reverse order. There are seven choices of VLC tables in this process. The selection is based on the magnitude of each successive coded level. Each item in these VLC tables could be represented as 0..01x..xs. The 0..01 is prefix, and suffix is decoded from x..xs where the s means the sign of the level. The suffix decoding process depends on the prefix. The level decoding is decided by previous decoded level. This is a self dependent process. D. Total zeros decoding process This process decodes the number of zeros before the last coefficient, that is, TotalZeros. There are different tables for AC 4x4 blocks and DC 2x2 blocks. These two tables are divided into several sub tables according to the TotalCoeff decoded in coefficients token process. When TotalCoeff is equal to 0 or maxNumCoeff, the TotalZeros is set to 0 without bitstream contents. The maxNumCoeff is 16 for AC and 4 for DC. With the exception of the cases mentioned above, this process depends on both bitstream and TotalCoeff. E. Run before decoding process In this process, the number of zeros between two adjacent coefficients is decoded in reverse order. There are seven sub tables for run before decoding. These tables are divided based on the zerosLeft which is calculated by subtracting previous run from previous zerosLeft and initialized with TotalZeros. When zerosLeft is equal to 0, the remaining runs of rest coefficients is set to 0, which means the order of rest coefficients is the same as the decoding order in level stage. In this case, the process is bitstream independent. Otherwise, this process depends on both bitstream and previous run.

III.

PROPOSED CAVLC DECODING FLOW

From previous decoding process, we can find that each symbol has several corresponding context-based adaptive VLC tables, and the selection of these tables is based on the statistic of block content and previous decoded symbols. Thus the decoding process depends on not only the bitstream but also the previous symbols. This prevents the speedup techniques like parallelism and pipelining. One often adopted solution is to combine the code word of different symbols so that multiple-symbol decoding is possible. However, this will lead to a longer code and thus large table to be decoded in one cycle. In this paper, we propose to use some decoded information as inherent table index to reduce the kinds of combinations. Besides, from above analysis, we can find dependency of some processes can be eliminated by merging them together, e.g. Coff_token and TrailingOnes decoding. Some processes can be skipped if none needs to be decoded, which can be explored by the zero information embedded in the bitstream. In summary, to reduce processing cycles, we combine some stages, decode multi-symbol at sign and run before stage and skip all bitstream independent process. Thus, the proposed CAVLC decoding flow is as shown in Fig. 2 which is divided into four stages.

Fig. 2

The proposed decoding flow of CAVLC

A. Coefficients token and sign of T1decoding process The coefficient token process is combined with sign process since the sign decoding process is quite simple. As soon as the TrailingOnes and TotalCoeff are known, the length of sign code is predictable. Thus the sign of all trailing ones could be decoded in one cycle instead of one sign in one cycle. Since this is a multiple symbol decoding, the hardware cost of final level buffer will also be increased. The level buffer, which is used to store final decoded level value, has to be a multiple input level buffer instead of single input FIFO in other methods. When TotalCoeff is equal to 0, only the coefficient token process is required. The other process stages will be skipped. The mechanism also achieves the low power consumption.

5584

B. Level decoding process The adaptation of level decoding is quite complex. Considering the hardware cost, the multiple symbols method may not be suitable. The decoding process remains the same with traditional ones. C. Total zero decoding process This process will be skipped if the TotalCoeff is equal to zero or maxNumCoeff. To skip or not to skip is controlled by coefficient token process. D. Run before decoding process In this process, we decode two runs in the same cycle when zerosLeft <= 6. The sub run before table is separated by zerosLeft (1,2,3,4,5,6,>6). Unless zeroLeft is bigger than 6, the zerosLeft for next run is predictable. For example, the run is equal to 1 under zerosLeft == 6. Then decoding next run should take the sub table under zerosLeft == 5. Thats the reason why we adopt six as the partitioning point. The possible combination of two run codes is 71 and the longest length of code word in such table is 6, which is smaller than the original one. The modified run table contains 86 items. As long as zerosLeft is equal to 0, the process will be skipped and remaining levels are copied to the coefficients buffer in the decoded order. This task will be done in one cycle. For this purpose, data in the level buffer must be stored according to the decoding order. The critical case for the proposed decoding flow is as shown in Fig. 3 (a). The required cycle count is 19. In this case, all of our speed up is failed. The decoding process becomes the same with traditional one. For traditional process, the critical case is as shown in Fig. 3 (b). The cycle count is 32. Proposed method needs 18 cycles under this condition.

The input unit is a bitstream shifter that shifts the bitstream according to the length of the previous codeword and provides the aligned bitstream for next decoding process. It includes two registers, a shifter and a code length accumulator, which is generally used in traditional VLC decoding hardware [5]. The output part merges and reorders the decoded level and zero runs together for the subsequent component like inverse quantization. In contrast, in previous published CAVLC decoder architectures, the level buffer and run buffer are separated. Thus, the decoder needs additional process to merge the zero run and level to provide actual coefficients for other decoder component. In proposed architecture this process is done at the run before decoding stage. When a run is decoded, the corresponding level will be copied to the coefficient buffer. The processing cycles will be reduced and thus run buffer can be saved. The control unit assigns decoding tasks to different decoding components. To help reduce power consumption, , the control unit will turn off the component to provide functional gating if the component is not used. Furthermore, when all coefficients in a 4x4 or 2x2 block are zero, the control unit skips all decoding process except the coefficient token decoding process. In this case, an extra zero block index is set to 1 to zero skipping in the subsequent components. The other components will be described in more details as the following.

Fig. 4

The proposed architectures for CAVLC decoding

(a)
Fig. 3

(b)
Critical cases for (a) the proposed and (b) traditional design, where the X denotes the nonzero coefficients.

IV.

ARCHITECTURE OF PROPOSED CAVLC DECODER

A. Coeff_token-sign decoder Fig. 5 shows the architecture that includes coeff_token table, trailing value decoder and zero-block detector. The bitstream is first decoded by the coeff_token table to generate the value of TotalCoeff and TrailingOnes. According to TrailingOnes, the value of trailing coefficients can be decoded and sent to the level buffer. The zero-block detector will detect the codeword of the TotalCoeff equal to 0 and set the zero block index to one.

Based on the proposed decoding flow, Fig. 4 shows the corresponding VLSI architecture. This design contains the following component, i.e. bitstream shifter, coeff_token sign decoder, level decoder, total zero decoder, run before decoder, level buffer, coefficient buffer, zero block index and control unit.
Fig. 5 Architecture of coeff_token-sign decoder

5585

B. Level decoder The prefix is decoded by leading one detector. Then, we will get the information for suffix decoding and codeword length. After subtracting suffix, the value of the level could be decoded by prefix and suffix. An escape code is happened when prefix is 15, which has a 28 bit codeword length. This is the reason for 28-bits bitstream shifter. C. Total zero decode and run before decoder Both of these two decoders are implemented according to the modified table as mentioned above.

decode one macroblock by the proposed CAVLD for different sequences. All the sequences in our simulation is QCIF size and all intra encoded. Due to the efficient zero skipping, merged process and multi-symbol decoding, the proposed design can save up to 70% of cycles compared to previous one. For fair comparison, we also include the required reodering and merging cycles in the previous design. The hardware cost is about 13192 gates when synthesized at 125 MHz, where design [3] is about 9943 gates and 1152 bits RAM.
TABLE I COMPARISONS OF AVERAGE PROCESSING CYCLES Akiyo 38 162 76% 77 230 66% 120 335 64% Foreman 53 192 72% 120 306 60% 214 468 54% Stefan 124 310 60% 200 441 54% 269 563 52% Mobile 174 395 55% 279 570 51% 353 704 49% News 58 196 70% 108 282 61% 176 404 56%

D. Level buffer To achieve multiple sign decoding, run before skipping and in-time coefficients matching, the level buffer can offer 3 input updating at most and in order assignment. Thus the levels are assigned by index with the decoded order, i.e. the 3rd level will be sent to the 2nd position in level buffer if total level is five. This brings the side benefit for lower power consumption since only the updated instead of whole buffer content will be changed. E. Indexed coefficient buffer and the final output stage According to each decoded run, the corresponding level will be selected in the level buffer. Then, the matched position of coefficient buffer will be active for updating. The indexes generated from run before table for the level selection and position matching process are stored in register. The run before decoding and coefficients matching process are pipelined since matching work is independent from bitstream. If two-run decoding is available, two-level selection and twoposition updating will be active. Otherwise, only one position would be updated. When zerosLeft is equal to zero, the coefficient buffer will directly copy the remaining levels from level buffer, because the order has been arranged when level buffer updating. The mechanism could match at most 16 coefficients in one cycle. The architecture is as shown in Fig. 6 .

Sequence
QP 28 average proposed cycles design[3] /MB reduced percentage average proposed cycles design[3] /MB reduced percentage average proposed cycles design[3] /MB reduced percentage

QP 20

QP 12

VI.

CONCLUSION

In this paper, a zero-skipping multi-symbol CAVLC decoding architecture for MPEG-4 AVC/H.264 has been proposed. We have analyzed the typical decoding flow and proposed a new decoding flow to reduce processing cycles by employing the properties of CAVLC algorithm. Together with the proposed architecture, the processing cycle count of proposed architecture can be reduced by up to 76%. Acknowledgement This research is sponsored by National Science Committee, Taiwan, R.O.C. under grant NSC-93-2200-E009-028. REFERENCES
[1]
Joint Video Team (JVT), Draft ITU-T recommendation and Final Draft International Standard of Joint Video Specification. ITU-T Rec. H.264 and ISO/IEC 14496-10 AVC, May 2003. Joint Video Team reference software JM8.2. H-C Chang, C-C Lin, J-I Guo, A Novel Low-Cost High-Performance VLSI Architecture for MPEG-4 AVC/H.264 CAVLC Decoding, in Proc. ISCAS, pp. 6110 - 6113, 2005. D. W, G. Wen, M. H., and Z. Ji A VLSI architecture design of CAVLC decoder, Proc. Intl. Conf. ASIC. pp. 962-965, 2003. S.-M.Lei, , M.-T Sun,., An entropy coding system for digital HDTV applications, IEEE Trans. Circuits and Systems for Video Tech. vol. 1, no. 1. pp. 147-155, Mar. 1991. T-W Chen, Y-W Huang, T-C Chen, Y-H Chen, C-Y. Tsai, L-G Chen,Architecture design of H.264/AVC decoder with hybrid task pipelining for high definition videos ISCAS. Page(s):2931 - 2934 Vol. 3 May 2005

[2] [3]

[4] Fig. 6 The final output stage [5]

V.

PERFORMANCE EVALUATION
[6]

The proposed design is implemented with Verilog HDL and synthesized with 180 nm CMOS standard cell-based library. TABLE I shows the average required cycles to

5586

Вам также может понравиться