DCT Simd Ijens09

Scalable Architecture for Discrete Cosine Transform Computation Engine Based on Array Processors
S. R. Naqvi. S. S. Naqvi, F. Rehman, F. Naghman, R. Tariq, A. Zainab
ABSTRACT
We propose a scalable architecture for a Discrete Cosine Transform (DCT) computation engine based on Single Instruction stream and Multiple Data stream (SIMD) - Array Processors. Each pixel of an input matrix is distributed across a 4-way connected Processing Element (PE); and a frame comprises several such PEs making it possible to compute as many pixels as the number of PEs in a frame. Tripling such frames allows us to compress a colored image as efficiently as any gray-scale image. We specifically target least possible computations by completely replacing a floating-point unit by Look-up-Tables (LUTs) and an efficient implementation of an 8-bit multiplier is presented. By making use of nine processors, arranged in a matrix of the order 3x3, we manage to compute nine coefficients in less than nine clock cycles resulting in a tremendous Data-Rate (DR) of 1.4Gbps at the cost of 967 slices. Performance is analyzed using SPARTAN III FPGA (Field Programmable Gate Array) and a comparison with a previously proposed systolic architecture is presented.
Frequency domain allows easy filtering of high or low frequency components of an image resulting in its enhancement. Discrete Cosine Transform is one of the fundamental techniques of transforming an image from spatial domain to frequency domain. It is the real part of Fourier Transform [2, 3] which is a key component of image enhancement through filtering in frequency domain. Discrete Cosine Transform has found its application in compression techniques and image enhancement algorithms. Figure1 shows the block transformation concept where the block initially is in spatial domain and after applying DCT the resultant block is transformed in to the frequency domain. This transformation helps visualizing the high and low frequency components of image; thus helps in separating the visually significant and important data [2]. Over the years, DCT has been computed by making use of various software tools; however, a new drift towards designing and enhancing dedicated architectures for image processing techniques has gained much importance over the last decade. Designing reconfigurable processors for image processing and other applications is an active research nowadays [4 - 6]. Designing a reconfigurable processor for DCT coefficient computation based on an FPGA is the primary aim of this work. A dedicated hardware like a reconfigurable array processor which is a realization of Single Instruction Multiple Data Steams (SIMD) is fast, efficient and flexible in terms of its re-configurability. By making use of some Application Specific Instructions (ASI), FPGA provides greater performance in terms of reliability and processing speed as compared to a software simulation [4]. A configurable processor implemented on an FPGA becomes an FPGA-based computation engine, which can be reconfigured for various image processing algorithms [5, 6]. DCT can be applied on an individual pixel or a block. In this work, we begin by applying DCT on a 3x3 block of image pixels followed by scaling the design to work for a block of 8x8; thats why the name scalable architecture. The paper is organized as follows: Section 2 gives a brief background of DCT and SIMD architectures. The proposed algorithm & implementation and the working of Control Unit are given in Section 3. Section 4 presents the results
Keywords
DCT, SIMD, Array Processors, Processing Element.
1. INTRODUCTION
In recent years image processing techniques have got immense attention of researchers. All sorts of operations like any transform or an enhancement algorithm applied on an image falls under the category of Image Processing. These transforms and algorithms are applied to achieve desired results and improve visual quality of images [1, 2]. Transform coding is a fundamental component of image processing techniques in which a transform is usually applied on an image to take it into frequency domain.
Manuscript received September 16, 2009. This work was conducted in Electrical Engineering Department at COMSATS Institute of Information Technology, Wah Cantt, Pakistan. S. R. Naqvi is working as a lecturer in Electrical Engineering Dept at COMSATS Institute of Information Technology, Quaid Avenue, The Mall, Wah Cantt, Pakistan (+92-51-9272614 Ext 240; rameeeznaqvi@ciitwah.edu.pk). His interest lies in System Level Architecture, Error Correcting Codes and Regression Analysis. Fasih-ur-Rehman is the Head of Electrical Engineering Department at CIIT Wah Cantt (+92-51-9272614 Ext 205; fasihurrehman@comsats.edu.pk). His major fields of interest include Digital Image & Signal Processing. Other authors are also with the same department working in Digital Signal Processing and Computer Architecture field.
achieved. Section 5 concludes the paper by giving future recommendations.
H (u , v) =
M 1 N 1 x =0 y =0
2 MN
c(u )c(v)
2. SYSTEM DESIGN COMPONENTS

As stated above, the primary target of this work is to investigate and suggest a massively parallel processing architecture for DCT/IDCT computation engine. Three major entities in achieving this task are DCT theory (Software), a massively parallel architecture based on SIMD Array Processors and some mechanism to interface with the outer world (Hardware). Following subsections briefly discuss each issue.
cos[
2
(2 y + 1)v (2 x + 1)u ] ] cos[ 2N 2M
(2)
h ( x, y ) =
M 1 N 1 x =0 y =0
MN
c(u )c(v)
(2 x + 1)u (2 y + 1)v ] (3) ][ 2N 2M
H ( x, y) cos[
2.1 Discrete Cosine Transform (DCT)

Discrete Cosine Transform (DCT) attempts to de-link the image data into visually significant and insignificant data. DCT has a unique property of concentrating the visually significant information of any image in just a few coefficients [2]. For this reason, the DCT is often used in image compression applications. A coefficients usefulness is determined by its variance over a set of images [4]. When we get the high energy components of an image which are the pixels that define the main aspects of an image, the remaining coefficients can be discarded to achieve a very high compression rate. One clear advantage of the DCT over the DFT is that there is no need to manipulate complex numbers [9 - 12]. If n1 and n2 denote the number of information carrying units (usually bits) in the original and encoded images respectively, the compression that is achieved can be quantified numerically via the compression ratio given in Equation1 [2, 3]. Where;
c(u ) =
1 ; u = 0 : c (v ) = M
1 ;v = 0 N
c(u ) =
2 ; u = 1,2,.., M 1 : c(v) = M
2 ; v = 1,2,..N 1 N
Cr =
n1 n2
(1)
There is a set of unique nine matrices of the order 3x3 for evaluating DCT coefficients for an input block of the same order. The input matrix is multiplied with the corresponding matrix of constants followed by the addition of products to get one DCT coefficient. This implementation generates a DCT coefficient in just seventeen calculations comprising of nine multiplications and eight additions; while the conventional implementation takes about 138 calculations comprising of ninety-three multiplications, twenty-six additions and nineteen divisions to compute a single DCT coefficient. Algorithm requiring fewer calculations leads to a smaller number of clock cycles and becomes a base for faster architecture. This new representation of 2D-DCT is given by Equation4. The reduced algorithm is based on two operations which are multiplication of corresponding matrix of constants with input block of image pixels and consequently adding the results of multiplication to obtain a DCT coefficient. In this work, we explicitly make use of SIMD Array Processors in order to achieve this task, since they promise massively parallel processing capability [10].
The proposed design is based upon the 2D DCT. The mathematical representation of 2D forward DCT is given by Equation2. Similarly, the mathematical representation of 2D reverse DCT is given by Equation3 [2, 9]. Evaluation of DCT coefficients using this conventional representation consumes a lot of computations hence resulting in a slower system. The proposed algorithm observantly suggests that the conventional representation of forward 2D-DCT can be separated into a constant and a variable part for better computation efficiency. The constant part comprises of evaluation of cosines, c(u) and c(v); while input image pixel h(x,y) is the variable part. A DCT coefficient is generated after several iterations of multiplication and addition of these parts. The proposed design based on SIMD Array Processors investigates a NOVEL algorithm (Section III) that allows computation of DCT coefficients in a very small number of clock cycles.
Figure 1: Concept of Block Transformation
data can only be sent to PE(1,1) either through PE(0,1) or PE(1,0) which further can forward its data to PE(1,1), thus consuming 2CC). However, this set must be replicated nine times to evaluate nine DCT coefficients in parallel. Having utilized 81PEs to compute just one DCT coefficient, this implementation from nowhere seems impressive. Following section discusses the proposed implementation and a step-by-step development of an efficient algorithm to compute the nine DCT coefficients in parallel. Furthermore, such processing systems require an Application Specific Instruction Set (ASIS) which is also developed and briefly described in Section 3.
Figure 2: Conventional Array Processor Organization
B (u , v) =
Where;
M 1 N 1 x =0 y =0
A( x, y) * C ( x, y)
(4)
2.3 Parallel-In-Parallel/Serial-Out Buffer (PIPSO)

Parallel-In-Parallel/Serial-Out Input Buffer as shown in Figure2 is used to smooth out the data flow between Array Processors and the external world (might be a software program running on a PC). As the name suggests, it is only a collection of shift registers that takes input (pixels in our case) of n-bytes in parallel during each clock cycle and keeps shifting the data upwards on subsequent cycles. This procedure runs the number of times equivalent to the number of rows in the array i.e. after three clock cycles PIPSO will have 3x3 bytes stored in our case of 3 rows of PEs. Once this operation completes, pixels are moved into the corresponding PEs either bit-by-bit or byte-by-byte. The latter is followed in this work.
A(x, y) = h(x, y) = input block of pixels C(x, y) = corresponding matrix of constants
2.2 SIMD Array Processors

SIMD (Single Instruction Multiple Data) is an arrangement of multiprocessors to achieve data level parallelism [10]. There are said to be two basic unavoidable factors of employing parallelism; firstly the low execution cost of inter-processor communication for implementation of a given task; secondly achieving a certain level of parallelism in implementation. Keeping these in view, the parallelism is achieved by utilizing array processors where more than one processor is pooled to do a job [10 - 12]. All the neighboring processors communicate with each other to complete the assigned task. Some of the key properties of SIMD array processors are Inter-connected multiprocessors that may be loosely or tightly coupled with respect to their data memories. However, they are bound to execute a single instruction issued by a common Control Unit, hence a shared program memory; and finally data sharing through Inter-Processors Links [10]. In this work, SIMD-Architectures for the implementation of 2D-DCT/IDCT have been followed. The initial architectural design for evaluating a single DCT coefficient is shown in Figure2 that comprises nine Processing Elements (PEs), each holding a unique pixel value of the block and the corresponding matrix of constants. Each PE is equipped with its own multiplier and responsible to perform its respective multiplication. The addition as given in Equation4 is performed by the center PE, rest of the neighboring PEs simply pass their values to the center PE using the scheme depicted in Figure3 (for a 4-way connected organization, PEs can only transfer their data to immediate East, West, North or South. For the PE(0,0),
3. PROPOSED IMPLEMENTATION
As understandable, almost every design based on a Control Unit requires a well-defined set of instructions to execute; for instance, reading from (or writing to) a memory location may be done using a MOVE instruction. Similarly in this proposed implementation, there are certain operations that must be understood by the entire array of processors; moreover these instructions need to be application specific meaning that they need to be altered if the similar architecture has to be used to implement other algorithms. Hence we analyze the requirements of the design and define an ASIS which is given in Table1. Since all the PEs are controlled by a common CU, whenever the CU generates an operation for example Broadcast 0, each PE sends data from 0th location of its unique (local/unshared) data memory to its output line. Similarly, all other instructions are also performed simultaneously by all the PEs but on their local data as explained in Table1. Devising a suitable algorithm is probably the most vital part after having designed the instruction set. Keeping in view the scheme of computation as shown in Figure3, a
simple algorithm has been designed in which the corner PEs simply pass their pixel values to their immediate neighbors in the middle column, where they get multiplied with their respective constant values, added together and the results are then further transferred to the PE (1,1). Middle PE is responsible for generating a DCT coefficient by adding the results that it acquired from the PEs at its north and south ends. This algorithm exclusively computes one DCT coefficient in 64 CC (synchronization with the source of image may consume a couple more). In this case either the frame (9-PEs) needs 9-times replication in order to generate 9-coefficients in 64 CC (Approx 1-coefficient in 9 CC, still not a great data rate) or there needs to be a much better algorithmic solution to achieve the desired throughput. Algorithm optimization is carried out in three phases as described below.
pi = xi + yi
(7)
That is, a stage unconditionally generates a carry if both addend bits are 1, and it propagates carries if at least one of the addend bit is 1. The carry output of a stage can now be written in terms of generate and propagate signals as shown in Equation8.
Ci +1 = g i + pi .ci
A(0,0) x C(0,0) A(0,1) x C(0,1) A(0,2) x C(0,2)
(8)
3.1 Algorithm Optimization Phase 1

On an insight of the organization shown in Figure2, one realizes the fact that with a dual port memory (that allows two simultaneous reads or writes) in place, it is possible to reduce the number of clock cycles by a great deal. That is, PE(1,0) and PE(1,2) can send their data to PE(1,1) simultaneously, hence saving 1CC. Similarly, same reduction takes place for the PEs at north and south. This scheme allows completing the operation in just 12CC. This means one coefficient in 1.33CC but a huge wastage of area with 81PEs in place.
A(1,0) x C(1,0)
A(1,1) x C(1,1)
A(1,2) x C(1,2)
A(2,0) x C(2,0)
A(2,1) x C(2,1) Figure3: Scheme of Coefficient Computation Table 1. PROPOSED INSTRUCTION SET
A(2,2) x C(2,2)
3.2 Algorithm Optimization Phase 2

During the second phase of optimization, target is to comeup with a better multiplier design. Generating the partial products is easily and quickly done by making use of several multiplexers. However, addition of these partial products is something that consumes a lot of clock cycles. Therefore, we explicitly target design of a multiplier based on Carry Look-ahead Adder [16, 17]. The proposed multiplier shown in Figure4 manages to generate three results in 1CC. Its working is as following. If the multiplicand bit is 0; 0 is selected otherwise input itself is selected as a partial product. Once all the partial products are found, they are added by using carry lookahead adder to get the final product. Carry look-ahead logic uses the concepts of generating and propagating carries [16, 17]. The logic equation for sum, carry generation and propagation for each stage are given as Equations 5 7 respectively.
Sr # 1.
Command
Operation
Remarks
Broadcast X
Output Bus
PE sends data from memory location X to the output bus. Data either from East, West, North or South are input into location X of memory. Data from memory location X2 are copied into X1. Sum of contents of locations X1 & X2 is copied back to X1. Product of any scalar Z and contents of memory location X1 is copied into X1. External data flows into the memory location specified.
2.
In X, Y
3.
Move X1, X2
X1
X2
4.
Add X1, X2
X1
X1 + X2
5.
Mul X1, #Z
X1
X1 * Z
S i = xi yi ci g i = xi * y i
(5) (6)
6.
Din
[Mem]
Pixel
Figure 4: Multiplication mechanism
To eliminate carry ripple, we recursively expand the Ci term for each stage and multiply out to obtain a 2-level AND-OR expression. Using this technique, Equations 9 11 can be obtained for the first three adder stages [16].
transfer their values to their neighboring PEs without even using their multipliers hence a huge wastage of area again. If the PEs are connected in such a way that each of them becomes a Center-PE, then all of them would be utilizing their multipliers just like PE(1,1) previously. Such an organization is called Mesh Organization of Hypercube and shown in Figure5. Now, if each PE has a unique matrix of constants stored in local memory, it can compute a unique DCT coefficient independently; thus keeping the number of PEs only nine. In this scheme the interprocessor communication has been enhanced to improve the performance of the computation engine and also increased the degree of parallelism in the formulated algorithm. Figure6 shows these optimization steps through a flow chart and the new operational version of algorithm is given in Table2.
C1 = g 0 + p0 .c0 C 2 = g1 + p1 .g 0 + p1 . p0 .c0
(9) (10)
C3 = g 2 + p2 .g1 + p2 . p1 .g 0 + p2 . p1 . p0 .c0 (11)

Similarly the remaining carries are calculated. This multiplier when replaced the previous one, resulted in a further reduction of 5CC dropping it to 7CC. Although, computation of one coefficient in less than a clock cycle is satisfactory as far as DR is concerned; however, when it comes to the area utilization, 81PEs in place computing just 9coefficients are simply not affordable. Especially, when the architecture is to be scaled to a standard block of eight pixels, this architecture will grow exponentially and so will be the cost of design. In modern Multi-Processor Systemon-Chips (MPSoC), Area Vs Throughput has become a usual argument. Designers always try to locate a balance between them. In this case, a maximum of 9PEs should be more than enough to compute 9coefficients. 3.3 Algorithm Optimization Phase 3 There are two critical observations made about the organization shown in Figure2. First is that the corner PEs normally have at least one end connected to a hard-wired 0. For instance PE(0,2), PE(0,1) and PE(0,0) must have their East/North, North and West/North inputs connected to zeros respectively. Similarly all other corner PEs too. The only PE that has all ends connected to other PEs is the one in the middle i.e. PE(1,1). Second observation is that, according to the algorithm devised PEs (0,0), (0,2), (1,0), (1,2), (2,0) and (2,2) simply
Figure 6: Step by Step Optimization of Algorithm
Figure 5: Mesh Organization of Hypercube
Control unit is the back-bone of any design based on SIMD architecture [12]. It generates signals for inter-processor communication, writing/reading data to & from memory, broadcasting & fetching data from neighboring PEs and generates signals for performing multiplication & addition in different stages as per requirement of the design. Table3 shows the entire control signals along with their explanations; whereas, the State Transition Diagram shown in Figure7 gives the corresponding values of the control signals asserted/de-asserted during each control state. The first state S0 is called a reset state where all the control signals are simply initialized with their no-operation default values. Other than the reset state, there are seven states which execute to complete the computation of 9DCT coefficients. This modified algorithm computes a DCT coefficient in less than a clock cycle average Coefficient per Clock Cycle (CPC) of 1.28. As every PE computes its own coefficient; there is no need to replicate the architecture a number of times, thus a very fast, efficient and small DCT computation engine generates our desired results. This implementation of DCT computation is for gray scale images where there is only one slice of color with different levels between black and white. On the contrary, a color image has 3 dimensions namely Red, Green and Blue generally mentioned as RGB. To apply DCT on an image represented in RGB space; it is applied separately on each of the slices. In order to compute a colored image, this architecture can simply be replicated thrice and each provided with a set of values corresponding to R, G and B spaces respectively.
rather much parallel approach would be to replicate the complete architecture three times so to evaluate DCT coefficients for each color R, G and B in parallel. As understandable, this can only be done at the expense of three times the area utilized earlier; however an interesting thing to note is that, the engine in this situation would be computing 27coefficients in parallel (9 for each of the slices). So, coefficient wise DRSR remains the same as before but a significant drop of about 67% to 0.48x106 in case of Pixel wise DRSR.
Table 2. BRIEF OPTIMIZED ALGORITHM WITH EXPLANATION Command din1,4,7,10,13, 16,19,22,25 mul 1,c0;4,c1;7,c2 Broadcast 1,4,7 In North 2,5,8 In South 3,6,9 Add 3,2,1 Broadcast 1,4,7 In East 2,5,8 In West 3,6,9 Add 3,2,1 Broadcast 1 Add the copied values and save the result into a new location Broadcast the DCT coefficient S6 S7 Explanation Copy the Pixel values on various locations of memory. Multiply Pixel values with their respective constants. First broadcast the multiplication results on output lines; then the neighboring PEs copy them into them into their vacant locations. Add the copied values and save the result into a new location Now broadcast the new data again, this time input from the other two neighbors. State S1
S2 S3
S4 S5
Table 3. LIST OF CONTROL SIGNALS
4. RESULTS & DISCUSSION

The synthesized code ran at a maximum operational frequency of 137.3MHz by only utilizing 967 slices of the selected device i.e. SPARTAN III (speed grade 4). Equation12 gives the calculation of data rate achieved. The engine computes 9Coefficients each comprising 8bits in just 7CC, resulting in an enormous data-rate of 1.413Gbps for Gray scale images and reasonable Data-rate to Slice ratio (DRSR) of 1.46x106.
Sr #
Signal
Width (bits) 3
Operation
1.
We
Write enable identifies the written.
signal. It also location to be
2.
Doo
The signal will be high when data is to be fed on to output line of PE. When an addition operation is completed, this signal is set high. When a multiplication operation is completed, MUL signal is set high. Inn signal is set high when a new block of pixels is to be input and processed. When high, this signal allows data to be sent at the output bus.
3.
Sum
Datarate =
8 x9coefficient (bits ) = 1.413Gbps (12) 7CC * 7.282nSec
4.
Mul
Images are not only confined to gray scale; rather they are also represented in different color domains; DCT is a tool that can be applied to all the images to transform them into frequency domain. This implementation can easily be scaled for color images as well. As in the case of RGB, after separating the three slices of R, G and B, the architecture can be used for each slice to compute DCT. A
5.
Inn
6.
Broad
design proposed by Antonino [18] which is synthesized at 107 MHz, this implementation is synthesized at 137MHz by utilizing a relatively smaller area of the device. Secondly, Antoninos design computes one coefficient on an average of 0.4CC. On the other hand, this implementation computes one coefficient in about 0.17CC on average. There are two major disadvantages associated with Xilinx [19]. First one is that it almost computes one coefficient in an average of 1CC after an initial latency of 92CC. Hence the implementation compared to this implementation is slower by not less than six times. Second disadvantage is its output range of 9bits which makes it unsuitable for JPEG implementation. Agostini [20] suggests a unique implementation utilizing smaller area with Wallace tree approach. Secondly, this designed is based on two 1D-DCT elements. It was an efficient solution at its time but since most of the FPGAs in market these days are equipped with a number of built-in multipliers, its first feature is no more counted as a quality. The worst thing about Agostinis implementation is its global latency of 164CC due to its deeper pipelined stages. After that, it computes one coefficient on an average of 0.8CC which is still slower than this work approximately by five times. The design proposed by Bukhari [21] gives the fastest pipelined solution of the lot with 16CC required to compute the same block of 8x8 pixels resulting in an average delay of 0.25 Our implementation stands even better than his design in terms of computation speed; however utilizes larger area. The comparison of this design with other architectures has been presented by the Area/Delay Comparison Curve in Figure8 taken from Antonino [18]. The graph clearly shows that this design follows the are/delay constant curve unlike Agostinis implementation and it is definitely better than all others in terms of delay. The area utilization has been normalized with respect to Xilinx IP Core [19].
Figure 7: Flow diagram of CUs optimized FSM
7 This Design 6
Bukhari Agostini
4 Area 3 Antonino 2
Xilinx 1
0.2
0.4
0.6 Delay
0.8
1.2
Figure8: Area/Delay Comparison of Various IP Cores
Up to this point in time, the architecture has been designed to work for a 3x3 matrix of pixels. However, it is possible to compute 63coefficients (just one less than the 8x8 matrix of pixels) either by incorporating additional PEs and increasing the number of MOVE instructions, or, simply replicating the existing design seven times which could compute 63coefficents in parallel. The latter approach seems impressive with respect to DR but at an expensive slice count of 967x7. In this case Equation12 should give the modified DR equivalent to 9.887Gbps. However, if we count the total number of clock cycles required loading the PIPSO and the output buffer (equivalent to 4CC), the DR should be reduced down to 6.29Gbps and the DRSR to 0.93x106. Several architectures have been reported [18 21] that give efficient implementations of the 2D-DCT. Compared to the
5. CONCLUSION
In this work a scalable DCT Computation Engine has been proposed with an efficient algorithm for the 2D DCT. The architecture has been optimized with respect to the Area Vs Data-rate argument. Initially the algorithm has been modified by utilizing the symmetrical characteristics of the DCT algorithm resulting in a very few required computations. Furthermore, the huge list of constants is now completely replaced with the LUTs. We have efficiently made use of SIMD Array Processors to achieve this task. 9PEs run in parallel to do their complex multiplication and then communicate with each other to do the addition. Hence in short, each PE computes its own pixel resulting in a massively parallel approach. A reasonable DR of 1.4Gbps has been achieved at the
expense of 967 of selected device SPARTAN III. The design can easily be modified to work for JPEG and used in various multimedia applications.
[11] Image compression using discrete cosine transform, Andrew B. Watson NASA Ames research centre. [12] Fundamentals of Computer organization and architecture, Mostafa Abd-El-Barr and Hesham El-Rewini. [13] Computer Organization and Design 2nd ed by David A. Patterson and John L. Hennessey [14] A SIMD-Systolic Architecture And VLSI Chip for the TwoDimensional DCT and IDCT; by Chen-Mie Wu and Andy Chiou; Dept of Electronic Engineering; National Taiwan Institute of Technology; Taipei, Taiwan, R.O.C. [15] http://www.xilinx.com/bvdocs/whitepapers/wp245.pdf [16] Hardware algorithms for arithmetic modules, ARITH research group, Aoki lab., Tohoku University [17] Katz, Randy (1994). Contemporary Logic Design. The Benjamin/Cummings Publishing Company. pp. 249256. [18] Antonino Tumeo, Matteo Monchiero, Gianluca Palermo, Fabrizio Ferrandi, Donatella Sciuto; A Pipelined Fast 2DDCT Accelerator for FPGA-based SoCs. [19] Xilinx XAPP610 Video application note. Xilinx http://www.xilinx.com. Compression Corporation, using DCT, available at
6. REFERENCES
[1] The International Telegraph and Telephone Consultative Committee (CCITT). Information Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements and Guidelines. Rec. T.81, 1992. [2] Digital Image Processing using Matlab by Rafael C. Gonzales and Richard E. Woods. [3] J. Miano. Compressed Image File Formats JPEG, PNG, GIF, XBM, BMP, Addison Wesley Longman Inc, USA, 1999. [4] Sid-Ahmed, M.A. Image Processing: Theory, Algorithms, and Architectures, McGraw-Hill, N.Y. [5] Optimization of HW/SW Co-design: Relevance to Configurable Processor and FPGA Technology. Susan Xu and Hugh Pollitt-Smith CMC Microsystems, Kingston Canada
[6] S. Haunk, The role of FPGAs in reprogrammable systems, Proc. of the IEEE, Vol. 86, No.4, pp. 615-638, April, 1998. [7] R. Witting and P.Chow, OnChip: an FPGA processor with logic, Proc. IEEE Symp. FPGAs for Custom Computing Machines, pp. 126-135, 1996. [8] http://www.eecg.toronto.edu/~vaughn/challenge/fpga_arch.h tml [9] The Discrete Cosine Transform (DCT); theory and application, by Syed Ali Khayam, Department of Electrical & Computer Engineering, Michigan State University. [10] Digital Image Processing by Rafael C. Gonzales and Richard E. Woods.
[20] L.V. Agostini, I.S. Silva, and S. Bampi. Pipelined fast 2d DCT architecture for JPEG image compression. In Integrated Circuits and Systems Design, 2001, 14th Symposium on., pages 226231, Pirenopolis, Brazil, 2001. [21] K. Z. Bukhari, G.K. Kuzmanov, and S. Vassiliadis. Dct and idct implementations on different fpga technologies. In Proceedings of ProRISC 2002, pages 232235, November 2002.

DCT Simd Ijens09

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

DCT Simd Ijens09

Загружено:

Авторское право:

Доступные форматы

Scalable Architecture for Discrete Cosine Transform Computation Engine Based on Array Processors

S. R. Naqvi. S. S. Naqvi, F. Rehman, F. Naghman, R. Tariq, A. Zainab

achieved. Section 5 concludes the paper by giving future recommendations.

2. SYSTEM DESIGN COMPONENTS

(2 y + 1)v (2 x + 1)u ] ] cos[ 2N 2M

2.1 Discrete Cosine Transform (DCT)

Figure 1: Concept of Block Transformation

2.3 Parallel-In-Parallel/Serial-Out Buffer (PIPSO)

A(x, y) = h(x, y) = input block of pixels C(x, y) = corresponding matrix of constants

2.2 SIMD Array Processors

3.1 Algorithm Optimization Phase 1

3.2 Algorithm Optimization Phase 2

Figure 4: Multiplication mechanism

C3 = g 2 + p2 .g1 + p2 . p1 .g 0 + p2 . p1 . p0 .c0 (11)

Figure 5: Mesh Organization of Hypercube

Table 3. LIST OF CONTROL SIGNALS

4. RESULTS & DISCUSSION

Write enable identifies the written.

signal. It also location to be

8 x9coefficient (bits ) = 1.413Gbps (12) 7CC * 7.282nSec

Figure 7: Flow diagram of CUs optimized FSM

Figure8: Area/Delay Comparison of Various IP Cores

Вам также может понравиться