Вы находитесь на странице: 1из 6

478

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 4, APRIL 2007

into eight segments. When mapped onto the Virtex-4 XC4VLX100-12 FPGA combinatorially using slices only, the design in [8] occupies 72 slices and has a delay of 11 ns. A 10-bit x/10-bit y implementation using our approach leads to an area of 221 slices and a delay of 33 ns, which is larger by a factor of three in both area and delay. However, the Lin et al. design exhibits a maximum error of 85 ulp and a mean squared error of 2.3321004 which is two orders of magnitude larger than the maximum errors of 0.68 ulp and three orders of magnitude larger than the mean-squared error of 8.5021008 provided by the design methodology we describe.

A Processor-In-Memory Architecture for Multimedia Compression


Brandon J. Jasionowski, Michelle K. Lay, and Martin Margala

V. CONCLUSION A exible and highly efcient hardware architecture for precise gamma correction via piece-wise linear polynomial approximations has been presented. The exibility of the architecture allows the support of arbitrary gamma values, input bit widths, and output bit widths. The gamma correction curve is segmented in a nonuniform manner, resulting in low segment count while also allowing hardware-efcient polynomial coefcient indexing. Analytical bit-width analysis has been described for deriving the minimal integer and fractional bit widths to each signal in the data path. The analysis allows the outputs to exhibit comparable precision to that of a direct table lookup approach. Experimental results for combinatorial and pipelined implementations on a Xilinx Virtex-4 FPGA have been presented showing signicant reductions in memory sizes over direct table lookups.

AbstractThis paper presents the design and development of a novel, low-complexity processor-in-memory (PIM) architecture for image and video compression. By integrating a novel-processing element with SRAM, bandwidth is improved and latency is greatly reduced. This paper also presents PIM design techniques for reduced power, area, and complexity for rapid deployment and reduced cost. A design methodology is presented and followed by an analysis of the processing element performance and capabilities. The proposed datapath solution delivers between 2 to 40 times higher performance compared to other presented solutions. The architecture executes a discrete cosine and wavelet transforms achieving up to 40% higher throughput per watt and occupying as little as 0.9% area compared to a commercial digital signal processing and other application-specied integrated circuit implementations while maintaining precision. A comprehensive comparative analysis is also provided. The proposed processor-in-memory is implemented in 1.8-V 0.18- m CMOS technology and operates with a 300-MHz clock. Index TermsDCT, DWT, image and video compression, low-power architecture, processor-in-memory (PIM), VLSI.

I. INTRODUCTION There is a growing demand for mobile communication and portable computing with cellular technology and digital photography permeating the mainstream. Therefore, the need for high-speed and low-power signal processing is apparent, especially concerning portable devices, which require extended operation, high speed, and small area. The proposed architecture utilizes a multiplier-based applicationspecic processor (ASP). Specically, the ASP is constructed to compute key algorithms and operations essential to image and video processing in order to minimize the complexity and the power consumption, delivering high throughput at very low cost. The focus in this paper is mainly on the datapath design. The system level design has been previously described in [1] and the memory architecture has been previously described in [2]. This paper is organized as follows. Section II presents the design tradeoffs, optimizations, and the proposed processor-in-memory (PIM) architecture along with simulation and synthesis results in Section III. Section IV describes the comparative study and the discussion of the results. Concluding remarks are made in Section V.

REFERENCES
[1] C. Poynton, Gamma and its disguises: The nonlinear mappings of intensity in perception, CRTs, lm and video, SMPTE J., vol. 102, no. 12, pp. 10991108, Dec. 1993. [2] S. Kang, H. Do, B. Cho, S. Chien, and H. Tae, Improvement of low gray-level linearity using perceived luminance of human visual system in PDP-TV, IEEE Trans. Consum. Electron., vol. 51, no. 1, pp. 204209, Feb. 2005. [3] S. Hecht, A theory of visual intensity discrimination, J. General Physiol., vol. 18, no. 5, pp. 767789, 1935. [4] K. Akeley, Reality engine graphics, in Proc. ACM Int. Conf. Comput. Graph. Interactive Techn., 1993, pp. 109116. [5] B. Lucas, Method and apparatus for converting oating-point pixel values to byte pixel values by table lookup, U.S. Patent 5 528 741, Jun. 18, 1996. [6] J. Kim, B. Choi, and O. Kwon, 1-billion-color TFT-LCD TV with full HD format, IEEE Trans. Consum. Electron., vol. 51, no. 4, pp. 10421050, Nov. 2005. [7] D. Warren, A. Bowen, and D. Dignam, Floating point gamma correction method and system, U.S. Patent 6 304 300, Oct. 16, 2001. [8] T. Lin, H. Cheng, and C. Kung, Adaptive piece-wise approximation method for gamma correction, U.S. Patent 6 292 165, Sep. 18, 2001. [9] E. Kim, S. Jang, S. Lee, T. Jung, and K. Sohng, Optimal piece linear segments of gamma correction for CMOS image sensors, IEICE Trans. Electron., vol. E88-C, no. 11, pp. 20902093, Nov. 2005. [10] D. Lee, W. Luk, J. Villasenor, and P. Cheung, Hierarchical segmentation schemes for function evaluation, in Proc. IEEE Int. Conf. FieldProgram. Technol., 2003, pp. 9299. [11] ATI Technologies Inc., Markham, ON, Canada, Radeon X1900 graphics technologyGPU specications, (2006). [Online]. Available: http://www.ati.com/products/RadeonX1900/specs.html [12] D. Lee, A. Abdul Gaffar, R. Cheung, O. Mencer, W. Luk, and G. Constantinides, Accuracy-guaranteed bit-width optimization, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 25, no. 10, pp. 19902000, Oct. 2006. [13] Kodak, Rochester, NY, Leaf Aptus 75 specications, (2005). [Online]. Available: http://www.leaf-photography.com

II. PIM ARCHITECTURE In order to fully take advantage of the bandwidth available in a PIM design, it is vital to maximize parallelism. Cavalli et al. performed an analysis on various MPEG-4 algorithms and found that motion estimation/compensation and DCT coding require the largest amount of

Manuscript received January 7, 2005; revised September 6, 2005 and July 14, 2006. B. J. Jasionowski is with the SET Corporation, Vienna, VA 22180 USA. M. K. Lay is with the U.S. Patent and Trademark Ofce, Alexandria, VA 22314 USA. M. Margala is with theDepartment of Electrical and Computer Engineering, University of Massachusetts at Lowell, Lowell, MA 01854 USA (e-mail: martin_margala@uml.edu). Digital Object Identier 10.1109/TVLSI.2007.893672

1063-8210/$25.00 2007 IEEE

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 4, APRIL 2007

479

C. DCT The DCT hardware is merely a control unit, determining the operations to be performed and storing intermediate results. Fig. 3 converts the ow graph into a block diagram form, displaying both the even and odd decomposition. To perform a 1-D DCT, the block diagram needs to be executed twice for both the even and odd outputs, Y . Each DCT block utilizes two SubPE units for computation. Thus, four 8-bit additions, four 8 bit multiplications, or two 16-bit additions may be performed at one time (in parallel). The coefcients ag represent scaled values from the cosine basis, where ck = cos k=16 and a = 0:5c1 , b = 0:5c2 , c = 0:5c3 , d = 0:5c4 , e = 0:5c5 , f = 0:5c6 , g = 0:5c7 . Fig. 3 shows that a 1-D DCT requires eight 8-bit additions, 24 16-bit additions, and 32 8 bit multiplications. The 1-D DCT is performed on each row (eight rows), and again on each column (eight columns). This equates to 16 1-D DCTs for an 8 2 8 2-D DCT. The SubPE performs xed-width twos complement arithmetic. Given that the cosine basis function of the DCT consists of fractions, the SubPE must accommodate twos complement, xed-point decimal multiplication. A fraction is represented in (1) as
(0) x m = 0x m +

B01 j =1

Fig. 1. PE capable of performing 8-bit operations.

j ) 0j x( m2

(1)

computation [3]. Therefore, a PIM would be well suited for block processing due to the immense bandwidth requirements and low complexity. The data words are also short, which is very desirable for reduced area consumption. A PIM is formed by attaching SubPEs (two coupled processing elements) to an existing RAM. Each SubPE is pitch-matched to eight words, which corresponds to a width of one 8 2 8 block. A. PE In order to reduce the area of processing logic in a RAM, arithmetic operations are merged. This process maintains the maximum functionality required for data processing, while conserving precious silicon real estate. The proposed PE implements a family of 8- and 16-bit operations including addition, subtraction, and multiplication, through a partial product decomposition-based multiplier proposed by Lin and Margala [4]. This decomposition may be recursively applied to form even larger multipliers. For the PE architecture, 8- and 16-bit operations are required, therefore, this process is employed twice to perform 16 2 16 bit multiplication. The increased cycle count is offset by the improvement in data parallelism of a PIM design. The PE (Fig. 1) is the computational unit for the PIM design, and was rst proposed by Margala and Lin [1]. The operation is controlled by the instruction vector i, which is determined each cycle by a nite-state machine (FSM). Multiplexers and switches, controlled by the instruction vector, direct the data to the arithmetic units to exploit hardware reuse. B. SubPE Two PEs are combined to form a SubPE unit using additional hardware to enable 16-bit operations, which are necessary for higher precision (Fig. 2). In order to meet the SRAM timing requirements, pipeline registers are inserted at the end of each PE output to allow for a multicycle implementation. The lower portion of the architecture either combines the partial products of 16 2 16 bit multiplication, or it may serve to keep the results from the PEs separate for 8-bit operations. Multiplexers between the PEs allow for a carry to propagate for 16-bit addition.

where jxm j < D. DWT

1 and has B bits of precision.

The DWT is also just a control unit used to determine the operations to be performed and the storage of intermediate results. Since the architecture is modeled after JPEG 2000, both the Daubechies (5,3) and (9,7) lters were implemented. Both lters followed a general computation block designed after the lifting scheme (Fig. 4). Four SubPEs were allocated for each 8-point 1-D DWT to be calculated in parallel. This will be helpful when tile sizes are quite large, such as the typically used 32 2 32 or 64 2 64. The utilization of four SubPEs permits four 16-bit additions or four 16 bit multiplications to be calculated simultaneously. The Daubechies (5,3) lter follows the same algorithm illustrated in Fig. 4 with a = 00:5, b = 0:25, and c = d = 1 for a total of 16 16-bit additions and eight 16 bit multiplications. The Daubechies (9,7) lter is a bit more intricate. The result is calculated in two steps, each consisting of the general computation block. The rst half of the 1-D DWT computation using the (9,7) lter uses the general DWT computation block shown in Fig. 4 with a = p1, b = u1, and c = d = 1. The nal result block also uses the computation block of Fig. 4, where the input values are the output value of the rst half of the (9,7) computation with a = p2, b = u2, c = K 1, and d = K 0. This increases the total number of operations to 32 16-bit additions and 24 16 bit multiplications for the entire (9,7) computation. For a typical tile size of 32 2 32, the rst level of dyadic decomposition requires 32 1-D DWTs applied horizontally and then 32 applied vertically. For the (5,3), 1024 16-bit additions (512 for each dimension) and 512 16 bit multiplications (256 for each dimension) are required. The (9,7) is almost twice as large with 2048 16-bit additions and 1536 16 bit multiplications (1024 and 768 for each dimension, respectively). To compensate for overow, 16-bit xed-point computation was implemented. III. RESULTS DCT and DWT architectures were implemented in VHDL. Following behavioral simulation with Mentor Graphics ModelSim, the source code was synthesized to obtain a gate level netlist with Cadence

480

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 4, APRIL 2007

Fig. 2. SubPE architecture.

Fig. 4. General DWT computation.

Fig. 3. Block diagram of the reusable DCT 8-bit operations (white) and 16-bit operations (grey).

BuildGates Extreme version 5.05, with low-power optimizations using Ambitwares TSMC 0.18-m standard cell library.

The Texas Instruments TI 552 family of DSPs is tailored for highspeed, power-efcient applications, including portable Internet devices and wireless communication. The TMS320VC5509 DSP [5], [6] was introduced in 2002 and operates at a maximum frequency of 144 MHz using 0.15-m technology. While a DSP has a clear singular processing

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 4, APRIL 2007

481

TABLE I 2-D DCT PDP COMPARISON

TABLE II PRECISION OF THE DCT COMPUTATION

advantage, it consumes more area and power and does not have the advantage of data parallelism native to a PIM architecture. In applications requiring large bandwidth, low-complexity processing, this improvement is substantial. The proposed SubPE design operates at a 333-MHz clock, driving 10-fF load capacitances on the outputs, and is further constrained by a 10-ps delay on each input port. For the target technology, 10 fF loading would be typical if the outputs were driving a wide fan-out. The input delay was chosen to compensate for any propagation through wiring from the output of another sequential block. Timing-driven optimization was performed with the following command line options:
do_optimize -priority time -effort high -power high -flatten auto.

TABLE III MATLAB RESULTS FOR (5,3) FILTER

Optimization of the design required 5141 CPUs for completion on the Sun Blade 100. It did not incur any timing violations. The critical path through the SubPE required 2.85146 ns for propagation, including a setup time of 0.14788 ns. Timing constraints required the design meet an arrival time of 2.85146 ns, which resulted in a slack time slightly under 1 ps. The critical path begins in the control hardware (FSM), which determines the instruction vectors to be loaded for each operation and ends at the most signicant register. The speed benets of the TI 5509 DSP cannot compete with the power savings of the proposed architecture. Using conservative measurements from TI [6], the DSP core draws 0.78 mA/MHz, assuming a 50% NOP and 50% MAC duty cycle with typical data bus activity. With a core voltage supply of 1.5 V, the average power for the TI 5509 DSP is 168.48 mW at 144 MHz. Therefore, the SubPE requires only 1001 power of the TI DSP to perform identical operation. The SubPE uses a total area of 35 298.92 m2 . The 1-D 8-point DCT calculation requires 134 cycles for a 1-D DCT, or 2144 cycles for a 2-D implementation. At a 3-ns clock, the 2-D DCT computation requires 6.432 s. This performance is achieved without the instruction pipelining. Under the best circumstances, the TI 5509 DSP requires 151 cycles to complete a 2-D DCT [7]. At 144 MHz, this translates to a 1.049 s execution time for an 8 2 8 DCT on a dedicated DSP. Consequently, the SubPE performs a single 2D DCT computation six times slower than the TI 5509 DSP. However, since power and speed are tradeoffs, the power-delay-product (PDP) is often used as a gure of merit between two architectures. In this case, the PDP is the power consumption per cycle per 2-D DCT. This is estimated by (2) PDP = Average Power 3 Cycles 2-D DCT

TABLE IV MATLAB RESULTS FOR (9,7) FILTER

TABLE V COMPARISON OF PE ARCHITECTURES

3 Delay : Cycle

(2) in Tables III and IV, for the (5,3) and (9,7) lters, respectively. Despite the disadvantage in the DCTs xed-point numeric format, the SubPE produces an output that deviates from Matlab by 2.89% in the worst case and 0.035% in the best case. These results are a testament to the capabilities of the SubPE, which does not employ any oating-point hardware. Floating point computation is used extensively in CPUs and some DSPs, but is usually undesirable for mobile applications due to its enormous power and hardware area costs.

The result is shown in Table I. Due to extreme power savings, the SubPE outperforms the DSP 1.7 times. As a result, the proposed PIM, replicated a hundred-fold throughout a memory array, will still outperform the DSP in these bandwidth intensive applications. In order to compare the quality of the results, both the DCT and DWT were performed in Matlab using the same input set used for ModelSim simulations. The results for the DCT are presented in Table II and DWT

482

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 4, APRIL 2007

TABLE VI COMPARISON OF ARCHITECTURES IMPLEMENTING 8

8 2-D DCT

The proposed DWT implementation performs approximately 85% of the (5,3)s outputs with 0% error, having an extreme case of 42.9% error near the boundary. JPEG 2000 symmetrically and periodically extends its signal when using an odd-tap lter. The proposed DWT implementation did not employ this extension. Nevertheless, the results are closely correlated with Matlab, as shown in Table III. The reliance on xed-point computation posed another challenge for the DWT implementation, since the proposed SubPE does not make use of any oating-point hardware. This conicts especially with the (9,7) lter since its coefcients are oating point numbers and are restricted to 16-bits. A comparison with Matlab can be found in Table IV. As previously mentioned, to increase the accuracy of the oating point values concerning the (9,7) lter, the signed signicand and exponent can be extended to 16 bits each, using a pseudo-version of single precision oating-point representation. Thus, two SubPEs, one for the signed mantissa and another for the exponent, would be required to compute one value. In a simulation experiment, a 32 2 32 2-D DWT has been implemented using four subPEs. Using the (9,7) lter, the architecture performed the operation in 4352 cycles or 14.3616 s. Using the (5,3) lter, the operation was completed in 1792 cycles or 5.9136 s. In general, if N 2 N is the array size, then it takes 56 N cycles to compute the (5,3) 2-D DWT and 136 N to compute (9,7) 2-D DWT. IV. DISCUSSION AND COMPARISONS In order to perform a fair evaluation and a comparison, the proposed architecture and the referenced solutions have to be put into a perspective. The proposed architecture and the corresponding results are for a datapath only. For a comparison, we used a power-performance operation merit expressed in millions of operations per second per watt introduced by Hsu et al. [8]. We used this merit in a comparison of 8 2 8 2-D DCT operations with four other architectures from literature and one Texas Instruments DSP. There are three data tables presented. Table V shows a transistor count per memory column and a performance comparison of the proposed PE to other previously published processing element designs. Table VI shows a comprehensive comparison of the proposed datapath architecture to ASIC implementations recently presented in the literature and to the TI DSP, all running 8 2 8 2-D DCT application. Table VI presents the results of a proposed datapath architecture and ve ASICs executing 2-D DWT operation. As can be seen in Table V, the proposed PE is not the only solution that can execute 8- and 16-bit operations, but also it delivers the highest performance. DSP-RAM proposed by Kwan et al. is a xed 16-bit architecture and IMAP is a xed 8-bit architecture. All other architectures are single bit only. In order to execute a 16 2 16 multiplication (which is a common operation in image processing), the proposed PE requires only 96 clock cycles. The IMAP architecture (the second best) takes 200 clock cycles. The AbacusBooth architecture is more

than six times slower and the remaining designs are 1040 times slower than the proposed PE. Table VI displays the specication metrics of the proposed architecture, TI 5509 DSP, and six other ASICs, all executing 8 2 8 2-D DCT algorithm. The data clearly shows that the proposed architecture delivers the highest throughput per watt and it uses by far the least area to implement the 8 2 8 2-D DCT. Specically, compared to the TI DSP, the ICT design and the design by Xantholopoulos et al., it improves the performance throughput per watt by 40%, 13%, and 6%, respectively, using only 0.9%, 12%, and 7.5% of the area of previous designs. Finally, we compare the results of the proposed architecture with the ASICs performing 2-D DWT operation. The proposed architecture uses smaller number of clock cycles than the design by McCanny et al. [18] and Wu and Chen [19] for image sizes larger than 38 2 38 and 128 2 128, respectively, using (5,3) lter [(9,7) lter requires slightly larger image sizes]. Moreover, McCanny et al. presented a design that performs the 2-D DWT in 28 s and 7.16 ms on a 32 2 32 and 512 2 512 array, respectively, using (5,3) lter. That compares to 5.9136 s and 95.57 s for the proposed architecture. This represents an improvement of 79% and 98.7%, respectively. Dang and Chau [20] proposed an integer fast wavelet transform that sacrices accuracy, especially of high-pass lters (errors between 45%50%). Chen et al. [21] proposed a programmable architecture that performs 256 2 256 2-D DWT in 72s using (5,3) lters (versus 47.8 s by the proposed architecture) and in 129.6 s using (9,7) lters (versus 116 s) and consumes eight times more power and is 100 times larger than the proposed architecture. V. CONCLUSION This research has presented datapaths for PIM targeting low-power image and video processing focusing on compression algorithms. By utilizing the bandwidth of memory through SIMD processing, a smaller yet exible arithmetic unit may replace traditionally used DSPs or microprocessors for embedded applications, conserving power and area. A low-power operation is achieved at 1.8 V, consuming an average power of only 16.47 mW (single subPE). The proposed datapath solution delivers between 2 to 40 times higher performance compared to other presented solutions. The architecture executes a discrete cosine and wavelet transforms achieving up to 40% higher throughput per watt and occupying as little as 0.9% area compared to a commercial DSP(TI TMS320VC5509) and other ASIC implementations while maintaining precision. The SubPE also maintains a low area of 35,299 m2 through merged arithmetic operations and a 4 2 4 bit partial product decomposition-based multiplier.

REFERENCES
[1] M. Margala and R. Lin, Highly efcient digital CMOS accelerator for image and graphics processing, in Proc. IEEE ASOC/SoC, 2002, pp. 127132.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 4, APRIL 2007

483

[2] M. Wieckowski and M. Margala, A 32 Kb SRAM cache using current mode operation and asynchronous wave-pipelined decoders, in Proc. IEEE Int. SOC Conf., 2004, pp. 251254. [3] F. Cavalli, R. Cucchiara, M. Piccardi, and A. Prati, Performance analysis of MPEG-4 decoder and encoder, in Proc. Int. Symp. Video/Image Process. Multimedia Commun., 2002, pp. 227231. [4] R. Lin and M. Margala, Novel design and verication of a 16 16-b self-repairable recongurable inner product processors, in Proc. 12th ACM Great Lakes Symp. VLSI, 2002, pp. 172177. [5] Texas Instruments, Dallas, TX, TI TMS320VC5509 xed-point DSP, (2002). [Online]. Available: http://www.focus.ti.com/docs/ prod/folders/print/tms320vc5509.html [6] Texas Instruments, Dallas, TX, TI TMS320VC5509 revision D power consumption, Measurements, 2002. [7] Texas Instruments, Dallas, TX, TI TMS320C55x hardware extensions for image/video applications programmers reference, SPRU098, (2002). [Online]. Available: http://www.focus.ti.com/lit/ug/spru098/ spru098.pdf [8] S. K. Hsu, S. K. Mathew, M. A. Anders, B. R. Zeydel, V. G. Oklobdzija, R. K. Krishnamurthy, and S. Y. Borkar, A 110 GOPS/W 16-bit multiplier and recongurable PLA loop in 90-nm CMOS, IEEE J. Solid-State Circuits, vol. 41, no. 1, pp. 256264, Jan. 2006. [9] M. Bolotski, T. Simon, C. Vieri, R. Amirtharajah, and T. F. Knight, Jr., Abacus: A 1024 processor 8 ns SIMD array, in Proc. Int. Conf. Adv. Res. VLSI, 1995, pp. 2840. [10] D. G. Elliott, M. Stumm, W. M. Shelgrove, C. Cojocaru, and R. McKenzie, Computational RAM: Implementing processors in memory, IEEE Des. Test Comput., no. 1, pp. 3241, Jan.Mar. 1999. [11] J. C. Gealow and C. G. Sodini, A pixel-parallel image processor using logic pitch-matched to dynamic memory, IEEE J. Solid-State Circuits, vol. 34, no. 6, pp. 831839, Jun. 1999. [12] B. S.-H. Kwan, B. F. Cockburn, and D. G. Elliott, Implementation of DSP-RAM: An architecture for parallel digital signal processing in memory, in Proc. IEEE Can. Conf. Electr. Comput. Eng., 2001, vol. 1, pp. 341345. [13] N. Yamashita, T. Kimmura, Y. Fujita, Y. Aimoto, T. Manabe, S. Okazaki, K. Nakamura, and M. Yamashina, A 3.84 GIPS integrated memory array processor with 64 processing elements and 2-Mb RAM, IEEE J. Solid-State Circuits, vol. 29, no. 11, pp. 13361343, Nov. 1994. [14] D. Gong, Y. He, and Z. Cao, New cost-effective VLSI implementation of a 2-D discrete cosine transform and its inverse, IEEE Trans. Circuits Syst. Video Technol., vol. 14, no. 4, pp. 405415, Apr. 2004. [15] K.-H. Cheng, C.-S. Huang, and C.-P. Lin, The design and implementation of DCT/IDCT chip with novel architecture, in Proc. IEEE Symp. Circuits Syst., 2000, pp. IV-741IV-744. 8 [16] G. A. Ruiz, J. A. Michell, and A. M. Buron, Parallel-pipeline 8 forward 2-D ICT processor chip for image coding, IEEE Trans. Signal Process., vol. 53, no. 2, pp. 714723, Feb. 2005. [17] T. Xanthopoulos and A. P. Chandrakasan, A low-power DCT core using adaptive bitwidth and arithmetic activity exploiting signal correlations and quantization, IEEE J. Solid-State Circuits, vol. 35, no. 5, pp. 740750, May 2000. [18] P. McCanny, S. Masud, and J. McCanny, Design and implementation of the symmetrically extended 2-D wavelet transform, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2002, pp. III-3108III3111. [19] P.-C. Wu and L.-G. Chen, An efcient architecture for two-dimensional discrete wavelet transform, IEEE Trans. Circuits Syst. Video Technol., vol. 11, no. 4, pp. 536545, Apr. 2001. [20] P. P. Dang and P. M. Chau, Integer fast wavelet transform and its VLSI implementation for low power applications, in Proc. IEEE Workshop Signal Process. Syst., 2002, pp. 9398. [21] C.-Y. Chen, Z.-L. Yang, T.-C. Wang, and L.-G. Chen, A programmable VLSI architecture for 2-D discrete wavelet transform, in Proc. IEEE Int. Symp. Circuits Syst., 2000, pp. I-619I-622.

A Memory Efcient Partially Parallel Decoder Architecture for Quasi-Cyclic LDPC Codes
Zhongfeng Wang and Zhiqiang Cui

AbstractThis paper presents a memory efcient partially parallel decoder architecture suited for high rate quasi-cyclic low-density parity-check (QC-LDPC) codes using (modied) min-sum algorithm for decoding. In general, over 30% of memory can be saved over conventional partially parallel decoder architectures. Efcient techniques have been developed to reduce the computation delay of the node processing units and to minimize hardware overhead for parallel processing. The proposed decoder architecture can linearly increase the decoding throughput with a small percentage of extra hardware. Consequently, it facilitates the applications of LDPC codes in area/power sensitive high-speed communication systems. Index TermsArchitecture, error correction codes, low-density parity check (LDPC), memory efcient, quasi-cyclic (QC) codes.

I. INTRODUCTION Recently, low-density parity-check (LDPC) codes [1] have attracted considerable attention due to their near Shannon limit performance and inherently parallelizable decoding scheme. Quasi-cyclic LDPC (QCLDPC) codes, being a special class of LDPC codes, are well-suited for hardware implementation for the regularity of their parity check matrices. Most recently, several classes of QC-LDPC codes [2][5] have been proposed that can achieve comparable performance with computer generated random LDPC codes. Among various LDPC codes decoding algorithms, the sum product algorithm (SPA) has the best decoding performance. The modied min-sum algorithm (MSA) [6], which does not require any knowledge about the channel parameters and offers comparable decoding performance to SPA, is preferred in practical implementation. The other benets of (modied) MSA over SPA include less sensitivity to quantization noise and lower computation complexity in check node processing. In general, LDPC codes achieve outstanding performance only with 1000 bits). Thus, the memory large code word lengths (e.g., part normally dominates the overall hardware of an LDPC codec. A memory efcient serial decoder was presented in [7]. The decoding throughput of each tile is less than 5.5 Mb/s. Partially parallel decoder architectures, which can achieve a good tradeoff between hardware complexity and decoding throughput, are more appropriate for practical applications. In this paper, a memory efcient partially parallel decoder architecture for high rate QC-LDPC codes is proposed, which exploits the data redundancy of soft messages in the min-sum decoding algorithm. In general, over 30% memory can be saved. In addition, the proposed architecture can be extended to other block-based LDPC codes, e.g., (general) permutation matrix-based LDPC codes. To reduce the complexity of check-node unit (CNU), an optimized pseudo-rank order lter (PROF) is proposed. A low complexity data scheduling structure is presented to enable parallel processing. The structure of this paper is as follows. In Section II, the rearranged min-sum decoding procedure is discussed. Section III presents the partially parallel decoder architecture. Various optimizations to further re-

Manuscript received January 18, 2006; revised September 11, 2006. The authors are with the School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR 97331 USA (e-mail: zwang@eecs. oregonstate.edu; cuizh@eecs.oregonstate.edu). Digital Object Identier 10.1109/TED.2007.895247 1063-8210/$25.00 2007 IEEE

Вам также может понравиться