A Novel Design of CAVLC Decoder With Low Power and High Throughput Considerations

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO.
3, MARCH 2011
311
A Novel Design of CAVLC Decoder with Low Power and High Throughput Considerations
Tsung-Han Tsai, Member, IEEE, Te-Lung Fang, and Yu-Nan Pan
AbstractThis paper proposes a novel algorithm and its very large scale integration design for context-based adaptive variable length code (CAVLC) decoding. In order to improve throughput of CAVLC decoder, we propose two new methods, which are multiple level decoding (MLD) and nonzero skipping for run before decoding (NZS). By performing parallel operations on the level decoder, MLD can decode two levels in one cycle at most situations, and NZS can produce several values of run before in the same cycle. These two methods have the advantages of low complexity and regularity. The proposed architecture needs 141 cycles/macroblock. Moreover, the proposed CAVLC decoder can run at 33.5 MHz to meet the real time requirement for 19201088 resolution. The power consumption for the 19201088 resolution is about 1.83 mW. The operation frequency can be reduced about 29.1% to 71.5% compared with other architectures. With an aid on a lower operation frequency, it is suitable for many low power applications. The synthesis result shows that the gate count is 13 175 gates, and the maximum frequency can archive 160 MHz. Index TermsCAVLC decoder, H.264, low power, multi symbol.
I. Introduction LC WHICH stands for variable-length-coding is very suitable for regular data and efcient to compress data without any loss. Variable length code (VLC) uses shorter bits of codeword instead of data occurring frequently, but uses the longer bits of codeword instead of data occurring infrequently. It is widely applied in video and image compression standard nowadays such as MPEG-1/2/4 and H.26x. In order to further improve the compression ratio, context-based adaptive variable length code (CAVLC) is adopted to encode the residual data after quantization in MPEG-4 AVC/H.264 baseline prole [1]. H.264 offers further improvement of compression ratio to maintain excellent quality when comparing to previous standard. The advanced compression technology drives the computational complexity of H.264 decoder to be much higher than previous standards [2], [3]. According to the complexity proling of H.264 decoder in previous research [4], the computational complexity of
Manuscript received February 24, 2010; revised May 24, 2010 and August 11, 2010; accepted August 20, 2010. Date of publication January 13, 2011; date of current version March 23, 2011. This work was supported by CIC and the National Science Council of Taiwan, under Grant NSC97-2220-E008-001. This paper was recommended by Associate Editor Y.-K. Chen. The authors are with the Department of Electrical Engineering, National Central University, Taoyuan 300, Taiwan (han@dsp.ee.ncu.edu.tw; anderson@dsp.ee.ncu.edu.tw; kobbe@dsp.ee.ncu.edu.tw). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TCSVT.2011.2105590
context-based adaptive variable length decoding (CAVLD) is about 28% of total H.264 decoder, which is only smaller than MC. Although the computational complexity of MC in H.264 decoder is large, it can be easily improved by using the parallel processing in architecture level. However, CAVLD is of variable length decoding, and each level is with dependence. The complicated coding procedure and the strong data dependency of the CAVLD algorithm make it hard to exploit the pipeline and parallel facilities in hardware design. Therefore, CAVLC decoding is usually the performance bottleneck in H.264 baseline prole decoder, especially in level decoding and run before decoding. To get a better performance of H.264 decoder, most previous researches are interested in speed-up of the CAVLD performance. Tseng [7] proposed a pattern search method before CAVLD to reduce memory access. Moon [8] used a lot of arithmetic operations to decode the codewords by analyzing the structure of VLC tables. Based on this kind of architecture, the design can decode codewords without memory access. For high throughput requirement, many researches proposed the architecture to realize CAVLC decoder [10][12]. Alle [10] cut-off the critical path to improve throughput by restricting the bits of process to 16 bits. The authors in [11][13] focused on multisymbol decoding on CAVLC decoder. Yu [11] decoded two runs with parallel processing in the same cycle. Wen [12] used six decoders to decode all runs in three cycles, but this kind of architecture induces a large delay on critical path. Tsai [13] used two decoders for level decoding and proposed a novel algorithm to decode several runs in the same cycle. The architecture of two-level decoders will induce a hardware overhead and poor hardware performance. Furthermore, there are other designs [14][16] studying on how to reduce power consumption and hardware cost. There are several submodules in the typical CAVLC decoder. We have proled the complexity based on several video sequences and shown the results in Fig. 1. According to the proling results and the decoding procedure, it is clear that level decoding and run before decoding dominate the performance. Similar to previous designs in [11] and [12], the multisymbol decoding for run before has been employed to reduce decoding cycles. But the improvement on level decoding is not solved in all the previous literatures. For a high quality video stream with a low quantization parameter (QP), the most part of coefcient in a 4 4/2 2 block would not be zero. In this case, it will consume lots of cycles in level and run before decoding process. When
1051-8215/$26.00 c 2011 IEEE
312
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 3, MARCH 2011
Fig. 1.
Analysis of the complexity of CAVLC decoder.
more nonzero coefcients (nC) in 4 4/2 2 blocks, the more decoding cycles are used. To solve this problem, we propose two solutions to reduce decoding cycles and increase the symbol throughput rate as follows. 1) Nonzero skipping (NZS) for Run Before Decoder is used to decode the run before of all (nC) between two zeros in each time. 2) Multiple level decoding (MLD) is used to increase the throughput of level decoding. The idea of NZS is similar to Booth algorithm, and it can easily detect a series of coefcients whose run befores are zero and decode these run before in the same clock cycle. When NZS is employed, we can save about 4% of decoding cycle compared to multisymbol design [11] and more than 50% reduction compared to the single symbol design. For the intra-only encoder [23][25], the number of nC is larger than the number of nC by using whole H.264 encoder. Since NZS can gain much benet on intra-only encoder, the decoding efciency of the proposed design is better than that of other designs. The idea of MLD is using two level decoders to decode the levels at the same time. With an aid of these two methods, the proposed hardware is cost-effective and thus the operation frequency is largely reduced to achieve low power consumption. On the other hand, Gated clock scheme is used to reduce the power consumption. This paper is organized as follows. At rst, this paper introduces a typical CAVLC decoding algorithm in Section II. The proposed CAVLC decoding algorithm is illustrated in Section III. Based on our proposed algorithm, the architecture design for every module is illustrated in depth in Section IV. The proposed architecture is implemented by cell-based design ow, and the decoding performance is evaluated and compared to other designs in Section V. Finally, we conclude this paper in Section VI. II. Typical CAVLC Decoding Algorithm In H.264 standard, the quantized residuals of each 4 4 block and each 2 2 block are encoded in zigzag scan order by CAVLC. Fig. 2 shows the steps of conventional CAVLC decoding algorithm. There are ve steps in the encoded codeword, and thus a typical CAVLC decoding algorithm can be divided into ve steps as follows. A. Coeff Token In this step, the total number of nC (TotalCoeff ) and the number of trailing ones without sign (TrailingOnes) are
Fig. 2.
Step of conventional CAVLD algorithm. TABLE I Thresholds for Determining Whether to Increment Suffixlength Current sufxLength 0 1 2 3 4 5 6 Threshold to Increment sufxLength 0 3 6 12 24 48 N/A
decoded. There are ve choices of VLC table which is used to decode coeff token, and the parameter nC determines which VLC table should be chosen. According to the number of nC in the neighborhood of the current block, the parameter nC should be computed and updated before every CAVLC decoding. B. Sign of TrailingOnes For each TrailingOne represented in coeff token, the sign is decoded with a single bit in reverse order. There are at most three bits which will be decoded as the sign of TrailingOne. C. Level The codeword of each level consists of a prex (level prefix) and a sufx (level suffix), and every level of nC is decoded in inverse order. With the end of each level decoded, the length of sufx might increase based on the magnitude of each decoded level. The threshold for the increase on sufxLength is shown in Table I [17]. D. Total Zeros In this step, the total number of zeros in each 4 4/2 2 block, total zeros, leading the rst nC in reverse zigzag order is also decoded with VLC tables. There are 15 tables for 4 4 blocks and three tables for 2 2 blocks. According to the value of TotalCoeff, the corresponding table will be selected to decode the current total zeros.
TSAI et al.: A NOVEL DESIGN OF CAVLC DECODER WITH LOW POWER AND HIGH THROUGHPUT CONSIDERATIONS
313
Fig. 3. Typical decoding algorithm of CAVLC decoder is illustrated with an example by using nC = 4. Fig. 4. Proposed decoding algorithm of CAVLC decoder is illustrated with an example by using nC = 4.
E. Run Before In the last step, each run before is decoded as the number of zeros which leads every nC decoded in reverse zigzag order. There are seven VLC tables whose determination is dependent on how many zeros are left, represented as zeroLeft. The parameter zeroLeft is obtained by subtracting previous run before from zeroLeft and initialized with total zeros. An example for typical CAVLC decoder which takes one cycle for one symbol is shown in Fig. 3. There are totally 22 clock cycles in summarization, including: nine clock cycles for both level decoding and run before decoding, and four clock cycles for the other steps. Obviously it is inefcient for using the typical CAVLC decoding algorithm to achieve a high throughput design.
Fig. 5.
Diagram of multiple level decoding for level decoder.
III. Proposed CAVLC Decoding Algorithm There are ve elements in CAVLC codeword and strong dependency is existed between these elements. According to the architecture of previous multisymbol design [11], the decoding process for Coeff token and sign of TrailingOnes can be merged in the same process, and then a typical CAVLC decoding ow can be simplied to four steps as shown in Fig. 4. Furthermore, we improve both level decoder and run before decoder to achieve the requirement of multisymbol decoding. Our proposed level decoder can produce two symbols and several symbols in run before decoder in the most cases. As the example in Fig. 4, it needs one cycle for Coeff token & TrailingOnes sign decoder and Total zeros decoder respectively. Level decoder needs ve cycles, and run before decoder needs two cycles. Totally the decoding process needs nine cycles. Compared to a typical CAVLC decoder whose decoding cycle is 22, we can reduce 59% of cycles. The more detailed descriptions are shown below. A. Level Decoder Based on the concept of the word-level parallel processing, several level decoders are employed to meet this kind of requirement. This initiates our MLD method. With this method, the proposed Level Decoder architecture can produce two symbols in every clock cycle at the most cases. According to the CAVLC decoding algorithm, sufxLength has to be updated in every level decoded. Because of the dependency on
each sufxLength, this information should be passed directly to the second Level Decoder such that two levels can be decoded simultaneously. Our proposed design does not support the full case of twolevel decoding because it results in a huge shift circuit. Only when the total bits of these two levels are less than 32 bits, two levels decoded can be performed successfully. We will use one cycle for one level when the total bits of two levels are more than 32 bits. Because this case is seldom occurred, there is no increase on decoding cycle. The diagram of twolevel decoding is illustrated in Fig. 5. The architecture design issues will be discussed in Section IV. B. Nonzero Skipped Decoder In Nonzero Skipped decoder, each level is represented with a run before which indicates the number of zeros before leading level. Traditionally Nonzero Skipped decoder methods decoded only one run before at each cycle. In order to speed up the CAVLC decoding performance, we trace the algorithm carefully and analyze every VLC table. It reveals some regularity in run before table which is marked in italics in Table II. Table II shows the run before table which is referenced from standard [1]. The zeros left represents the number of zeros in the level which is decoded by the Level Decoder. As shown in Table II, the codeword consists of ones if the value of run before equals to zero in different columns. The codeword will be single one, double ones, or triple ones, if the value of run before equals to zero.
314
TABLE II Table for Run before Zeros Left 5 6 11 11 10 000 011 001 010 011 001 010 : : : :
Run before 0 1 2 3 4 : : 13 14
1 1 0
2 1 01 00
3 11 10 01 00 : :
: :
: :
4 11 10 01 001 000 : :
>6 111 110 101 100 011 : : 0000000001 00000000001 Fig. 7. Group distribution for different modes and QP of the Hall Monitor.
Fig. 6. (a) Example for describing of NZS. (b) Best case of decoding cycle using NZS. (c) Worst case of decoding cycle using NZS.
Fig. 8.
Flowchart of nonzero skipping decoding for run before decoder.
We propose the NZS method to perform the multisymbol decoding on run before. The key idea of NZS is to look forward the bit-stream and nd out how many run befores exist with the values of zero. With the reversed zigzag scan order, we dene the continuing levels as a group, which starts from the rst level and ends at the next coefcient of zero. As an example illustrated in Fig. 6(a), a 4 4 block contains 11 symbols represented from x0 to x10. We interpret this example as three groups. For group1 (g1), the symbols are x0 and x1; for group2 (g2), the symbols are from x2 to x8; and for group3 (g3), the symbols are x9 and x10. Conventionally, run before is decoded step by step and the value is given by 0, 1, 0, 0, 0, 0, 0, 0, 1 from x0 to x8, respectively. Then the Nonzero Skipped decoder is nished because there is no zero left. When we employ NZS on Nonzero Skipped decoder, the cycle counts equal to the numbers of groups minus one. In the example of Fig. 6(a) the decoding cycles are only 2. For the rst cycle, symbols in g1 are decoded; for the second cycle, symbols in g2 are decoded. Fig. 6(b) shows the best case and Fig. 6(c) shows the worst case of NZS algorithm. In these two cases, shaded block stands for the nC, and white block stands for the coefcient with zero value. In the best case, since there are only two groups in this block, only one cycle is needed to decode nine symbols. In the worst case, nonzero and zero coefcients are interleaved. Thus totally seven cycles are needed to decode these eight symbols. In our simulation on various sequences, the worst case is never happened. We analyze the distribution of the levels in each 44 block, and nd that the number of groups can be one to eight. In order to evaluate the performance on NZS, we perform eight video sequences with QP = 28 in general decoder (intra+inter). Table III shows the number of groups for each video sequence.
From Table III, 96% of blocks are classied as one to four groups in average. That means NZS can achieve large gain on the most blocks with only little cycles for decoding. On the other hand, because the quantized residual coefcients predicted by intra prediction are often nonzero, the number of groups in each block is often small and suitable to use NZS. Fig. 7 shows the group distribution for different modes and QP of the video sequence Hall Monitor. From the distribution, no matter the bit-stream is coded by the intra-only encoder or the general encoder (intra+inter), the percentage of group one to four is over 96%. That means the NZS can achieve a good performance no matter the bit-stream is coded by the intra-only or the general encoder. On the other hand, the percentage of group 1 of intra only is larger than the intra+inter. Therefore, NZS can provide a relative improvement on the intra-only encoder. The ow chart of run before decoding is shown in Fig. 8. Based on the value of zeroLeft, the rst step is to determine which ones type is in the current iteration, and count the number of ones in the second step. The third step is to calculate the skipping number which can be computed from the number of ones. Then the run before with nonzero value will be decoded. In the last step, this process will be terminated if there is no zero left, or back to the rst step.
IV. Proposed Architecture for CAVLC Decoder Based on the proposed CAVLC decoding ow, the corresponding block diagram is shown in Fig. 9. The key components include Control Unit, TotalZeros Decoder, Run before Decoder, Level Decoder pair, Coeff Token&TrailingOnes sign Decoder, Input Unit, and Output Unit. Control Unit manages the whole decoding ow and provides all control signals and data to the corresponding units.
315
TABLE III Percentage of Group in Each Video Sequence with QP = 28 No. of Groups 1 2 3 4 5 6 7 8 Weather 12.97 27.23 30.80 20.48 7.120 1.143 0.236 0.001 Table 24.67 49.46 19.63 5.310 0.834 0.076 0.004 0.000 Stefan 9.240 30.25 31.50 19.45 7.820 1.590 0.137 0.003 Silent 28.11 40.84 22.46 7.141 1.335 0.101 0.001 0.000 Mother 26.79 41.23 21.82 8.076 1.864 0.196 0.010 0.000 Mobile 5.831 29.27 31.21 21.68 9.393 2.313 0.287 0.004 Hall 30.03 37.70 21.78 7.528 2.530 0.362 0.052 0.000 Coast 8.468 42.89 32.73 11.62 3.503 0.698 0.073 0.001
Fig. 10.
Proposed architecture of LUT for Coeff Token Decoder.
Fig. 9.
Architecture of the proposed CAVLC decoder.
A. Coeff Token and TrailingOnes Sign Decoder Fig. 10 shows the Coeff Token Decoder constructed with look-up table. However, the look-up table costs a large area while the Coeff Token Decoder is implemented. It is because the prex of codeword has a lot of number zeros before the leading one of the codeword. Reviewing the VLC tables of Coeff token, it is found that many redundant data can be eliminated. For saving the hardware cost, we classify and merge these tables. The zero counter shown in Fig. 11 is used to count the number of zeros before the leading one. Hence, the look-up table is built without the number of zeros before the leading one. We construct ve VLC decoders with different range of nC. An example of range 0 <= nC < 2 is listed on the right-hand side of Fig. 10. According to the prex, we partition the table into some groups, and each group represents the number of zero before the leading one. According to sufx, TrailingOnes and TotalCoeff can be indexed in each group. As an example of 00000101. . . , the prex, 000001, is decoded to the fth group, and sufx, 01, is decoded to 2 for TrailingOnes and 4 for TotalCoeff respectively. In addition, we modify the zero counter from previous design [9] to a simpler one. Then the bit-stream with a leading one can be detected to address the TotalCoeff and TrailingOnes.
Fig. 11.
Architecture of the zero counter.
B. Level Decoder In order to achieve multisymbol requirement, the architecture based on the proposed MLD method is shown in Fig. 12. The proposed Level Decoder is designed with the parallel technique. There are four major blocks including Prex Decoder, LevelSufxSize Calculator, Level Sufx Decoder, and Level Recover. The Prex Decoder is used to detect the length of prex. The LevelSufxSize Calculator is used to calculate the length of sufx. The Level Sufx Decoder is designed as
316
Fig. 12.
Architecture of Level Decoder.
a table and used to decode the sufx. These four modules are directly implemented from the standard. The length of prex and sufx calculated by Prex Decoder and LevelSufxSize Calculator are added together to determine the bit length. Because of the data dependency, the second part of Level Decoder must remove the bit-stream of the rst part of Level Decoder. Therefore, the sizes of the barrel shifters in the two parts of Level Decoder are different. The barrel shifters (BS32 and BS48) are constructed from the library of the Synopsys. Due to the characteristics of variable length code, a long length codeword is occurred infrequently. The two-level decoding is supported only when the total codeword length is less than 32 bits. While the gated clock is utilized, the second part of Level Decoder is idle to save the power consumption. The Level Recover is used to decode the level after the prex and sufx are decoded. Because of the cascade architecture of Level Decoder, it induces the critical path with a large delay. Here we perform retiming technique in the critical path to meet the timing requirement. Because the number of the used bits can be determined after the levelCode Calculator, the inserted register is used to reduce the critical path. The used bits are shifted at next cycle, and the Level Decoder can start to decode next level. One of the retiming register is also illustrated in Fig. 12. The decoded symbols will be stored in output unit which supports two inputs at the same time. C. Total Zeros Decoder The architecture of Total zeros Decoder is similar to Coeff Token Decoder. Analyzing the VLC tables of total zeros, we group each table into several small ones based on the number of zeros before leading one. The circuit of table selection is rearranged to obtain a balanced and minimum path delay. D. Run Before Decoder The block diagram of Run Before Decoder is shown in Fig. 13. Before run before decoding is started, the table should be chosen by the parameter zeroLeft. The codeword
Fig. 13.
Architecture of Run Before Decoder.
is a group of ones if run before equals to zero. Since the Ones Counter detects how many ones exist before leading zero in bit-stream, we can decode several nC without any zero inserting at the same time. As an example of 0 0 0 0 0 0 X 0 X X X X X X X X. For this case, the maximum length of one is 23, so Ones Counter needs a 24-bit input for counting. For other cases whose run before is not equal to zero, decoding will be performed by LUT. In the next unit, the decoded run before provides the information to insert zeros and reorder the levels which were decoded from level unit. E. Output Unit This is a very important unit in our CALVC decoder because it has to support higher symbol input rate. The output unit is to insert zeros between these nC from above decoding and provides the correct order of coefcients to IQ. The implementation of reorder is straightforward, thus we focus on zero insertion circuit. This circuit is partitioned into three parts. Because the location of zero inserting is not regular in each iteration, we need part-1 to obtain the location where the parameter skip and run are from run before decoding. According to the result of part-1, we can divide these nCs into two sides in part-2 (left-side and right-side). The zeros between these two sides are produced by left-shifter. Then the coefcients are combined and stored in the buffer in part-3 for
317
TABLE IV Comparison of Average Cycle per Macroblock (NZS + MLD) Sequence Intra Coding QP = 20 Average (cycles/MB) Coastguard [11] Proposed Reduced % [11] Proposed Reduced % [11] Proposed Reduced % [11] Proposed Reduced % 327.78 234.38 29.50 198.55 150.12 25.40 270.54 197.50 26.99 146.24 111.71 23.61 Hall 201.57 143.52 29.80 135.02 100.43 25.62 132.33 95.83 27.58 80.81 60.40 25.25 Mobile 364.70 241.32 34.84 274.83 195.99 28.68 303.82 209.29 31.11 201.16 146.24 27.30 Mother 205.33 148.02 28.00 124.30 94.59 23.90 145.86 107.00 26.64 82.08 63.15 23.06 Silent 253.67 181.92 28.29 157.54 119.47 24.16 162.50 118.05 27.35 97.16 74.26 23.56 Stefan 293.68 197.57 32.73 219.91 158.27 28.02 235.89 163.78 30.56 159.12 116.34 26.88 Table 270.63 198.47 26.67 125.16 96.89 22.58 193.78 144.08 25.64 84.50 66.08 21.79 Weather 301.26 199.19 33.89 234.57 164.64 29.81 191.34 128.58 32.80 143.25 101.45 29.17
Intra Coding QP = 28 Average (cycles/MB)
General Coding QP = 20 Average (cycles/MB)
General Coding QP = 28 Average (cycles/MB)
TABLE V Comparison of Throughput (Macroblock/S) Frequency (MHz) [11] [12] [14] [20] Proposed 125 71.43 175 74.25 160 Average Cycle (per MB) 193 480 312 279 141 Throughput (mega MB/s) 0.648 0.149 0.561 0.266 1.17
Fig. 14.
Architecture of Output Unit.
the next iteration. The 16& means a 16 12 register array. The 16| means a 16 bits bitwise OR. The 16 means a 16 bits inverter. The DFF16 is a 16-bit register which records one bit of each coefcient, and there are 12 copies for this circuit. The architecture of output unit is shown as Fig. 14.
V. Performance Evaluation, Comparison, and Implementation A. Performance Evaluation and Comparison Because the decoded cycle for CAVLC decoder is unxed, we have constructed the software-level simulator for CAVLC decoder. In order to evaluate the performance of the proposed algorithm and verify the design, H.264 reference software JM11.0 [19] is used to generate the test pattern. We have run several video sequences with various QP in CIF resolution, such as Weather, Hall Monitor, Mother Daughter, and Mobile Calendar. In comparison with the other designs, we choose [11] as benchmark because it provided a better performance than others. The comparisons are illustrated in Table IV. The simulation is used the intra-only decoder and general decoder with QP = 20 and 28. The proposed Run Before Decoder architecture can decode several nCs without any zero
inserting at the same time, and the proposed Level Decoder architecture can decode two levels in each time while the number of bit of the two levels is smaller than 32. Meanwhile, the architecture of [11] can only decode two run befores and one level in each time. As a result, the decoding cycles of the proposed design can be saved around 30% at most when both NZS and MLD are implemented. Compared with the general decoder, the intraonly decoder can save an extra 1% of clock cycles comparing with [11] at the same QP. We also evaluate the average number of cycles for processing a macroblock. In comparison with other designs, Table V shows the maximum throughput of each design. The proposed design has an average throughput of 1.17106 macroblock per second, which is 1.8 times enhancement on the other designs at least. Furthermore, Table VI shows that the lowest frequency used for each design to meet the real time requirement in different resolutions. The operation frequency of our design is 56% lower than [14], 71.5% lower than [12], 29.1% lower than [11], 50.8% lower than [20], and 85% lower than [22]. Based on the comparison results, the proposed design can achieve to a low power consumption and be suitable for a portable device due to a lower operation frequency. B. VLSI Implementation and Power Evaluation We use Verilog HDL to implement the proposed design and DesignCompiler for synthesis. The synthesis report shows that the maximum operation frequency can achieve to 160 MHz.
318
TABLE VI Performance Comparison with Different Resolution in 30 Frames/S 720 576 9.4 MHz 23.3 MHz 15.2 MHz 13.6 MHz 6.9 MHz 1280 720 20.8 MHz 51.8 MHz 33.7 MHz 30.1 MHz 15.2 MHz 1920 1088 47.3 MHz 76.4 MHz 68.3 MHz 213 MHz 34.5 MHz
[11] [12] [14] [20] [22] Proposed
TABLE VII Hardware Cost Comparison (Gate Count) CAVLD 4720 Output Unit 9705 Predict Register 2430 All 13 192 10 000 16 855 19 027 6883.8 13 175
Fig. 16.
Layout view of the proposed CAVLC decoder. TABLE VIII Summarization of Chip
[11] [12] [14] [21] [22] Proposed
9298
3877
Technology Supply voltage Core size Gate count(2-input NAND gate)
On-chip memory Operation frequency Fig. 15. Simulation result of power consumption. Power consumption Package
In order to compare the gate count with [11], [12], and [14], we re-synthesize the proposed design at 125 MHz to meet the same situation as those designs. The comparisons of hardware cost are illustrated in Table VII. In [14], the author used 1152bit RAM as the interleaved double stacks. As described in Section IV-E, the 16 12-bit register is used to cover this function. In order to have a fair comparison with [14], we calculate the equivalent gates of RAM considering one bit occupied six gates. With this interpretation, the gate count of output unit in [14] is approximated to 9705 gates. In addition, even though [12] has an advantage on low gate count, the throughput is much less than ours. Consequently, our design has the lowest operation frequency without overhead on hardware cost. Comparing with [22], the gate count is 6883.8 which is about 52% of the proposed architecture. However, the proposed architecture only needs 15% of the operation frequency of [22] to achieve the same video specication. With regard to the power consumption, we use Synopsys PrimePower and post-layout simulation waveform to evaluate the power consumption at different frequency. We choose four different video sequences which are referred to [16]. The simulation results are illustrated in Fig. 15. As the realtime requirement, it shows that the power consumption of our design is only 9.18 mW at 33.5 MHz with HD1080i video format.
TSMC 0.18 m 1P6M CMOS PROCESS 1.8-V 719.46 715.68 um2 (utilization: 51.1%) Input unit 3561 Coeff Token 1055 Total Zero 396 Ctrl 659 Run 1306 Level 2322 Output 3877 Total 13 175 None 6.7 MHz for D1 33.5 MHz for HD1080i 1.83 mW for D1 9.18 mW for HD1080i 100 CQFP
Fig. 16 shows the layout view of the proposed CAVLC decoder, and the implementation result of this chip is demonstrated in Table VIII. The detailed gate count of each module is shown in Table VIII. The core size is 719 719 m using TSMC 0.18 m 1P6M CMOS process. The gate count of this chip is 13 175 gates without SRAM.
VI. Conclusion In this paper, a novel high-performance CAVLC decoder and its VLSI implementation is presented for a MPEG-4 AVC/H.264 baseline prole. By analyzing the CAVLC algorithm and the structure of VLC tables, we proposed a new algorithm for Run Before Decoder and a parallel architecture for level decoder, which are called NZS and MLD, respectively. With an aid of these two methods, the decoding cycle can be saved largely. Furthermore, the proposed algorithm is very suitable while the bit stream is coded by intra-only encoder. Some architecture optimization on critical path is provided to enhance the performance. One of the important advantages is that our design can drive the H.264 decoder to run at a low operation frequency. The evaluation result showed
319
that the proposed CAVLC decoder has the least decoding cycles. As a low operation frequency is applied, the low power consumption can be achieved. In comparison with previous designs, our design has the lowest operation frequency without overhead on hardware for the same applications.
References
[1] Advanced Video Coding, document JVT-E022, Final Committee Draft, ITU-T Rec.H.264/ISO/IEC 11496-10, Sep. 2002. [2] V. Lappalainen, A. Hallapuro, and T. D. Hmlinen, Complexity of optimized H.26L video decoder, IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 717725, Jul. 2003. [3] M. Horowitz, A. Joch, F. Kossentini, and A. Hallapuro, H.264/AVC baseline prole decoder complexity analysis, IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 704716, Jul. 2003. [4] X. Quan, L. Jilin, W. Shijie, and Z. Jiandong, H.264/AVC baseline prole decoder optimization on independent platform, in Proc. WCNM, Tsung-Han Tsai (M01) received the B.S., M.S., Sep. 2005, pp. 12531256. and Ph.D. degrees in electrical engineering from [5] X. Qin and X. Yan, A memory and speed efcient CAVLC decoder, National Taiwan University, Taipei, Taiwan, in 1990, Proc. SPIE, vol. 5960, pp. 14181426, Jul. 2005 [Online]. Available: 1994, and 1998, respectively. http://spiedigitallibrary.org/proceedings/resource/2/psisdg/5960/1/596046 1. From 1999 to 2000, he was a Professor of elec[6] Y. C. Chao, S. T. Wei, J. F. Yang, and B. D. Liu, Combined CAVLC tronic engineering with Fu Jen University, Taipei. decoder and inverse quantizer for efcient H.264/AVC decoding, in In 2000, he joined the Department of Electrical Proc. IEEE Int. Conf. Asia Pacic Conf. Circuits Syst., Dec. 2006, pp. Engineering, National Central University, Taoyuan, 259262. Taiwan, where he is currently a Professor. He holds [7] S. Y. Tseng and T. W. Hsieh, A pattern-search method for H.264/AVC 14 patents and has published more than 120 referred CAVLC decoding, in Proc. IEEE Int. Conf. Multimedia Expo, Jul. 2006, papers in international journals and conferences. pp. 10731076. His current research interests include very large scale integration signal [8] Y. H. Moon, A new coeff-token decoding method with efcient memory access in H.264/AVC video coding standard, IEEE Trans. Circuits Syst. processing, video/audio coding algorithms, DSP architecture design, wireless communication, and system-on-chip design. Video Technol., vol. 17, no. 6, pp. 729736, Jun. 2007. Dr. Tsai has been an IEEE member for over 10 years, and is a member of the [9] W. Di, G. Wen, H. Mingzeng, and Z. Ji, A VLSI architecture design of CAVLC decoder, in Proc. 5th Int. Conf. ASIC, Oct. 2003, Audio Engineering Society, New York, NY, and the Institute of Electronics, Information and Communication Engineers, Japan. He received the Industrial pp. 962965. Cooperation Award in 2003 from the Ministry of Education, Taiwan. He is [10] M. A. J. Biswas and S. K. Nandy, High performance VLSI architecture a member of the Technical Committee of the IEEE Circuits and Systems design for H.264 CAVLC decoder, in Proc. ASAP, Sep. 2006, pp. 317 Society and serves as a technical program committee member or session chair 322. of several international conferences. [11] G. S. Yu and T. S. Chang, A zero-skipping multi-symbol CAVLC decoder for MPEG-4 AVC/H.264, in Proc. IEEE Int. Conf. Circuits Syst., May 2006, pp. 55835586. Te-Lung Fang received the B.S. degree in electron[12] Y. N. Wen, G. L. Wu, S. J. Chen, and Y. H. Hu, Multiple-symbol ics engineering from the National Yunlin University parallel CAVLC decoder for H.264/AVC, in Proc. IEEE Int. Conf. Asia of Science and Technology, Yunlin, Taiwan, in 2001, Pacic Conf. Circuits Syst., Dec. 2006, pp. 12401243. and the M.S. degree in electrical engineering from [13] T. H. Tsai, D. L. Fang, and Y. N. Pan, A hybrid CAVLD architecture National Central University, Taoyuan, Taiwan, in design with low complexity and low power considerations, in Proc. 2008. IEEE Int. Conf. Multimedia Expo, Jul. 2007, pp. 19101913. He is currently with the Department of Electrical [14] H. C. Chang, C. C. Lin, and J. I. Guo, A novel low-cost highEngineering, National Central University. His curperformance VLSI architecture for MPEG-4 AVC/H.264 CAVLC derent research interests include video, image processcoding, in Proc. IEEE Int. Conf. Circuits Syst., May 2005, pp. 6110 ing, and very large scale integration design for video 6113. systems. [15] Y. H. Moon, G. Y. Kim, and J. H. Kim, An efcient decoding of CAVLC in H.264/AVC video coding standard, IEEE Trans. Consumer Yu-Nan Pan received the B.S. degree in electronics Electron., vol. 51, no. 3, pp. 933938, Aug. 2005. engineering from Chung Yuan Christian University, [16] H. Y. Lin, Y. H. Lu, B. D. Liu, and J. F. Ynag, Low power design of Jhongli, Taiwan, in 2002, and the M.S. degree in H.264 CAVLC decoder, in Proc. IEEE Int. Conf. Circuits Syst., May electrical engineering from National Central Uni2006, pp. 26892692. versity, Taoyuan, Taiwan, in 2004. He is currently [17] I. E. G. Richardson, H.264 and MPEG-4 Video CompressionVideo pursuing the Ph.D. degree from the Department of Coding for Next Generation Multimedia. New York: Wiley, 2003, pp. Electrical Engineering, National Central University. 198207. His current research interests include video, image [18] T. L. Chang, Y. M. Tsai, C. D. Chien, C. C. Lin, and J. I. Guo, A processing, and very large scale integration design high-performance MPEG4 bitstream processing core, in Proc. ICME, for video and image systems. Jun. 2004, pp. 467470. [19] K. Shring, Ed. (2007). JVT Reference Software JM 11.0 [Online]. Available: http://bs.hhi.de/ suehring/tml
[20] T. G. George and N. Malmurugan, A new fast architecture for HD H.264 CAVLC multi-syntax decoder and its FPGA implementation, in Proc. ICCIMA, Dec. 2007, pp. 118122. [21] J. Moon and S. Lee, Design of H.264 AVC entropy decoder without internal ROM/RAM memories, in Proc. ISCCSP, Mar. 2008, pp. 1464 1467. [22] H. Y. Lin, Y. H. Lu, B. D. Liu, and J. F. Yang, A highly efcient VLSI architecture for H.264/AVC CAVLC decoder, IEEE Trans. Multimedia, vol. 10, no. 1, pp. 3142, Jan. 2008. [23] C. C. Cheng, C. W. Ku, and T. S. Chang, A 1280/spl times/720 pixels 30 frames/s H.264/MPEG-4 AVC intra encoder, in Proc. IEEE Int. Conf. Circuits Syst., May 2006, pp. 53385341. [24] C. H. Chang, J. W. Chen, H. C. Chang, Y. C. Yang, J. S. Wang, and J. I. Guo, A quality scalable H.264/AVC baseline intra encoder for high denition video applications, in Proc. IEEE Workshop Signal Process., Oct. 2007, pp. 521526. [25] Y. K. Lin, C. W. Ku, D. W. Li, and T. S. Chang, A 140-MHz 94 K gates HD1080p 30-frames/s intra-only prole H.264 encoder, IEEE Trans. Circuits Syst. Video Technol., vol. 19, no. 3, pp. 432436, May 2009.

A Novel Design of CAVLC Decoder With Low Power and High Throughput Considerations

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

A Novel Design of CAVLC Decoder With Low Power and High Throughput Considerations

Загружено:

Авторское право:

Доступные форматы

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO.

1051-8215/$26.00 c 2011 IEEE

Analysis of the complexity of CAVLC decoder.

Diagram of multiple level decoding for level decoder.

Flowchart of nonzero skipping decoding for run before decoder.

Proposed architecture of LUT for Coeff Token Decoder.

Architecture of the proposed CAVLC decoder.

Architecture of the zero counter.

Architecture of Level Decoder.

Architecture of Run Before Decoder.

Intra Coding QP = 28 Average (cycles/MB)

General Coding QP = 20 Average (cycles/MB)

General Coding QP = 28 Average (cycles/MB)

Architecture of Output Unit.

[11] [12] [14] [20] [22] Proposed

[11] [12] [14] [21] [22] Proposed

Technology Supply voltage Core size Gate count(2-input NAND gate)

Вам также может понравиться