Sai

TSINGHUA SCIENCE AND TECHNOLOGY ISSNll1007-0214ll17/20llpp108-113 Volume 15, Number 1, February 2010
Optimized Implementation of the FDK Algorithm on One Digital Signal Processor*

LIANG Wenxuan (), ZHANG Hui ( ), HU Guangshu ()**
Department of Biomedical Engineering, Tsinghua University, Beijing 100084, China Abstract: This paper presents an optimized implementation of the FDK algorithm on a single fixed-point TMS320C6455 digital signal processor (DSP). Software pipelining and proper configuration of the data transfer enables a 2563 volume to be reconstructed in about 42 seconds from 360 projections with very good accuracy. This implementation reveals the potential of modern high-performance DSPs in accelerating image reconstruction, especially when cost and power consumption are emphasized. Key words: computed tomography; digital signal processor (DSP); high performance computing; software pipelining
Introduction
In recent years, 3-D cone-beam computed tomography (CT) has been gaining popularity in both medical and industrial applications. The FDK algorithm[1] has been widely used in practical reconstruction due to its ease of implementation and acceptable result for small cone angle. However, the complexity of FDK is up to O(MN3), where M is the number of projections and N3 is the number of voxels in the reconstruction volume. Intensive computations and huge amount of data involved make demanding requirements on the computational power of the imaging system. How to fully utilize new parallel processors, such as graphic processing unit (GPU) and Cell Processor, in accelerating 3-D CT reconstruction has naturally become a hot topic in recent years. Many experiments and implementations have been carried out and have revealed the promising application of the awesome computational power of these platforms[2,3]. The potential of modern high-performance digital signal processors (DSPs) is also far from being fully
Received: 2009-10-18; revised: 2009-12-19
* Supported in part by the TI Innovation Funds ** To whom correspondence should be addressed.

E-mail: hgs-dea@mail.tsinghua.edu.cn; Tel: 86-10-62784568
exploited. Early in 1997, Texas Instruments (TI) released the TMS320C6000 platforms with the VelociTI, an advanced very long instruction word (VLIW) architecture. VelociTI retains the advantages of VLIW (e.g., parallelism) and improves its deficiency (e.g., reducing code size). Thus the C6000 platform has actually transcended the conventional concept of DSP since it does not integrate dedicated multiply-add units but deploys eight parallel functional units. Today the cores (CPUs) in the C6000 family have evolved into the C64x+ core (fixed-point) and the C67x+ core (floating-point). Also, the bus bandwidth, capacity and flexibility of the on-chip memory, and diversity and capability of the integrated I/O interfaces and peripherals are developing continuously. Earlier experiments of applying DSPs in medical imaging include mapping some core routines in digital radiography (DR) and ultrasound, e.g. fast unsharp masking, 2-D convolution, and 2-D FFT[4-8]. Neri-Caldern et al.[9] accelerates the parallel-beam CT reconstruction based on a single TMS320C6416 DSP mainly by minimizing CPU stalls due to cache misses. However, it does not take full advantage of the power of modern DSPs. There exist other programming techniques that are critical to guarantee high performance, as in the implementation presented in this paper for accelerating the FDK
LIANG Wenxuan et al.Optimized Implementation of the FDK Algorithm
109
algorithm.
Description of the Platform and the Algorithm Mapping Methods
The block diagram of the Texas Instruments TMS320C6455 DSP (C6455) is shown in Fig. 1. Its C64x+ CPU features 2 data paths, each with 4 functional units and a register file consisting of 32 32-bit registers. M unit is capable of multiply operations. L and S units accommodate various arithmetic and logic operations. D unit is capable of loading and storing data as well as common arithmetic operations. All the four units support data-level parallelism, i.e. treating a word-typed (32-bit) operand as a multiple of subwords (e.g., 4 8-bit operands or 2 16-bit operands) and executing operations on the sub-words simultaneously. More details on the advanced instruction set can be found in Ref. [10].
processor architecture by several programming techniques. Managuli and Kim[8] summarized five techniques, three of which are referenced here. (a) Judicious use of instructions to utilize multiple functional units and data-level parallelism: Carefully select instructions so that all the functional units are kept busy and make use of partitioned operations to improve performance. (b) Loop unrolling and software pipelining: Due to hardware pipelining, most assembly instructions have a single-cycle throughput (i.e., another identical instruction can be issued in the following CPU cycle), though several instructions take multiple cycles to complete (which is defined as latency). Overcoming the multi-cycle latency and utilizing their one-cycle throughput requires loop unrolling to compute multiple sets of data in one loop and software pipelining to overlap successive loops. (c) Use of the programmable DMA controller: In order to reduce cycles of the CPU stalls due to accessing non-cached data, DMA should be employed to transfer data between faster on-chip memory and slower off-chip memory concurrently with the CPU, especially when huge amounts of data are involved. Typical usage of DMA (also adopted in this implementation) called double buffering is shown in Fig. 2.
Fig. 1 Block diagram of C6455. SCR is short for switched-central resource, which is one component of the EDMA and functions as the interconnection between different masters and slaves.
C6455 has a 32 KB 2-way set-associative L1 data cache, a 32 KB direct-mapped L1 program cache, and a 2 MB on-chip L2 memory. Up to 256 KB of L2 memory can be configured as a 4-way set-associative cache. C6455 also features an EDMA (enhanced DMA controller) that allows various, flexible modes of data transfer. The C6455 DSP also contains a DDR2 controller to interface to an external DDR2 SDRAM device. To achieve high performance on DSPs, the algorithm should be mapped efficiently to the underlying
Fig. 2 Double buffering in L2 SRAM. While the CPU is processing data using one buffer pair (e.g., buffer pair A) by reading data from In-Buffer A and writing the result to Out-Buffer A, EDMA is transferring data between the other buffer pair (i.e., buffer pair B) and the external DDR2 SDRAM by filling In-Buffer B with new data and exporting the result from Out-Buffer B. After CPU and EDMA have both completed their work, the buffer pairs are switched. This procedure of processing and switching is repeated.
2 Implementation Details
The FDK algorithm[1] equations are given to facilitate the following description of the program architecture. The first part of FDK is pre-weighting and ramp-filtering the 2-D projection data as
110
Tsinghua Science and Technology, February 2010, 15(1): 108-113
R ( , a, b) = p p( , a, b) g P (a) (1) 2 2 2 R +a +b where R is the source trajectory radius and a and b are coordinates on the virtual flat detector. Here g P (a ) is the space-domain convolution function of
the ramp filter as
g P (a)=
1 | |exp (j a)da 2
(2)
Then the pre-weighted and filtered projection data are backprojected to the reconstruction volume as 2 1 R2 ( , a xy , b xyz )d f (x, y, z ) = p (3) 2 4 0 U xy
where
U xy = R + x cos + y sin V xy = x sin + y cos a xy = V xy R / U xy b xyz = zR / U xy
(4) (5) (6) (7)
R / U xy and V xy are independent of z . So for each given projection we can calculate all R / U xy (denoted by array R/U) and V xyz (denoted by array normV) and then reuse them when backprojecting slice by slice along the z-axis direction. Such pre-calculation is critical because it not only avoids duplicated calculations, but also excludes division operations from the backprojection loop, which enables the software pipelining technique to be applied. So, the pseudocode for the whole program is listed below.
(1) Necessary initialization. (2) Compute the coefficients for pre-weighting as Eq. (1). (3) Loop projection-angle=1:360. (4) Loop y=1:256. (5) Loop x=1:256. (6) Filling array R/U and array normV as in Eqs. (4) and (5). (7) End loop x. (8) End loop y. (9) Pre-weight and ramp-filter the projection row by row. (10) Loop z=1:256. (11) Backproject the volume slice. (12) End loop z. (13) End loop projection-angle.
more exact, in a Qm.n format number there are m bits used to represent the integer portion and n bits used to represent the fractional portion. With an extra sign bit, m+n+1 bits are needed to store a general Qm.n number. So the range of a Qm.n number is [2m, 2m) and the finest fractional resolution is 2n. In short, Q-format numbers are still fixed-point, but with an imaginary radix-point. R / U xy is always positive and less than 2, since in a practical cone-beam system with a small cone angle, the radius of the source trajectory is much larger that the object volume. V xy is normalized to the a-size of cells on the virtual detector to exclude division operations from the backprojection step. Thus the absolute value of V xy is limited by half the number of cells in a row on the (virtual) detector, which is 512 in our program. In our implementation, the elements in array R/U are Q2.13 numbers and in array normV are Q12.3 numbers. Usage of the half word (16-bit) number saves internal memory space and also makes better use of the partitioned operations supported by the hardware. With the Q format, the absolute error in Eq. (6) is less than 21328+232 and so the nearest-neighbor interpolation in the backprojection step is accurate. The same strategy is also applied to Eq. (7) where z is normalized to the b-size of the cells. IQmath library is employed to carry out division and square root operations[11].
2.2 Other optimization and simplifications
2.1
Q-format selection
Since floating-point operations are inevitable in FDK algorithm, Q-format numbers can be utilized. To be
Double buffering is employed for the data transfer. Each time, two rows of the projection are processed and then stored into L2 SRAM. This double buffering scheme is actually a little different from that in Fig. 2 since no data output is done. The same method used by Neri-Caldern et al.[9] was adopted when using the FFT function in the C64x+ library[12] for ramp-filtering. Since only ordinary assembly operations are involved in the backprojection loop, software pipelining can be used. First, the loop is unrolled to backproject 2 voxels at a time. The unrolled loop consists of 29 assembly instructions, including 8 multiplication instructions. Since only two .M units are available in one CPU cycle, at least 4 CPU cycles are needed for one loop. By careful instruction arrangement and elaborate software pipelining by hand, this limit was indeed achieved. Up to eight loops are overlapped and
LIANG Wenxuan et al.Optimized Implementation of the FDK Algorithm
111
executed concurrently. On average it takes only 2 CPU cycles to backproject 1 voxel. Double buffering scheme are also employed here.
Results
Experiments were conducted on the TMS320C6455 Evaluation Module board. A 2563 volume was reconstructed from 360 projections of size 5122 each. Memory space for only one projection was allocated in DDR2 SDRAM due to the limited capacity of the on-board DDR2 SDRAM and the need to simulate practical situations where the projection data are streamed in from an imaging system one frame after another. Since the main goal here is to evaluate the speed and feasibility of the DSP implementation, we further use only one projection. This simplification means that the object should produce the same projection at all projection angles. Thus, we selected a homogeneous round disk model as the reconstruction object and generated the discrete 14-bit projection value analytically. The reconstruction volume (all 32-bit voxels) also resides in the external DDR2 SDRAM. The cone angle in our simulation is about 11.7 degrees.
3.1 Accuracy analysis
Y=0.0625) from the two implementations. The two images are almost visually indistinguishable. The average absolute error after normalization was 2.8%. The significand of a single-precision floating-point number is 24-bit. The Q-format of operands used in the DSP program varies in different stages. But the significands of all the Q-format operands are always morethan14-bit with careful control. Since the original projection is generated in 14-bit precision, the significand during the calculation in the DSP program is enough. Though some right shifts are inevitable to avoid overflows, the final result using fixed-point numbers deviates only slightly from that obtained using 32-bit floating-point numbers.
3.2 Speed analysis
We also wrote a C++ program which runs the FDK algorithm using 32-bit floating-point precision. Since the error of the FDK algorithm is well-known, we just compared our DSP result with that obtained from the C++ program. Figure 3 is the reconstruction image (at the plane
The DSP implementation took 41.6 s to reconstruct a 2563 volume from 360 projections of size 5122 each. To examine whether the speed has hit the hardware limits, the number of CPU cycles consumed in the different stages during filtering and backprojection of one projection were measured and listed in Table 1. As mentioned in Section 2, the three stages include filling the array R/U and array normV, pre-processing the projection, and at last the voxel-driven backprojection. The numbers in Table 1 are the average of cycles from 8 measurements. Since our C6455 DSP runs at 1 GHz, multiplying the total cycles in the last row by 360 yields 43.0 s, which is consistent with the overall time.
Table 1 Cycles used in each stage during filtering and backprojecting of one projection Stage description
Filling the array R/U and normV Pre-processing the projection Voxel-driven backprojection Total CPU cycles used 15 103 748 3 222 098 101 274 870 119 600 716
(a) C++ result (b) DSP result Fig. 3 Reconstruction result of the disk model at the plane Y=0.0625. (a) is obtained from the C++ program using floating-point precision. (b) is obtained from DSP using Q-format fixed-point numbers. Note that they are visually indistinguishable, showing that the DSP implementation gives acceptable image quality.
Table 1 also shows that backprojection is still the most time-consuming of the three stages. Then, according to the ideal performance of the pipelined loop (2 cycles for 1 voxel), the number of CPU cycles needed for one backprojection is 25632= 33 554 432, which is much less than the number 101 274 870 in Table 1. Though the ideal speed cannot be achieved due to inevitable cache misses, the performance here is actually limited by the data transfer speed, as analyzed in the following.
112
Tsinghua Science and Technology, February 2010, 15(1): 108-113
The DDR2 Memory Controller on the C6455 DSP uses a 32-bit 533 MHz (data rate) external bus. Its nternal data bus frequency is fixed at 1/3 of the CPU frequency. Since the C6455 CPU runs at 1 GHz, it is the internal bus frequency that limits the DDR2 throughout. During one backprojection, the whole 2563 32-bit voxels have to be moved out and back into the DDR2, so the total number of cycles needed for data transfer is 256323=100 663 296, which is consistent with the number in Table 1. Besides, when using the actual 360 different projections, the projections can be updated independently by the integrated Ethernet media access controller which supports 1000 Mb/s. Since the projection data transfer can be completed concurrently with CPU processing, the overall performance will not be degraded much.
Conclusions
Discussion
Careful design of the algorithm architecture, proper utilization of the EDMA, and skillful pipelining of the backprojection loop was used to optimize the FDK algorithm and eventually verified the potential of DSPs in reconstruction acceleration. The implementation can be easily adapted to other time-consuming image reconstruction tasks which involve large numbers of repeated loops. The paper stresses that the reconstruction result and speed revealed the potential of modern high-performance DSPs, the low-cost and low-power VLIW processor with only eight parallel functional units. DSPs provide a competitive alternative in the field of medical imaging acceleration including cone-beam CT reconstruction, especially when cost and power consumption are emphasized more. Acknowledgements
We acknowledge the Texas Instruments (TI) Incorporated for providing the TMS320C6455 EVM boards and relative technology support.
The accuracy of the DSP implementation was comparable with that of the equivalent floating-point program. In the presented DSP implementation, only calculations involving division, square root, and trigonometric function are carried out using IQmath library. In situations requiring higher accuracy, more computations can be carried out using the IQmath library (e.g., FFT and IFFT). Since backprojection determines the overall time-complexity as shown in Section 3, we can anticipate that such shifts of task will not impair the overall performance too much. Though a fair inter-platform comparison is nearly impossible considering the many factors involved (e.g., hardware technology, bit depth, number and size of projections, I/O bandwidth), we still list some latest high-performance FDK implementations in Table 2 for reference. The time in Table 2 is normalized to 5123 volume and 360 projections. Note that neither Mueller and Xu[2] nor Kachelrie M et al.[3] included the time to pre-weight and ramp-filter all the projections as done in this paper.
Table 2 Some high-performance FDK implementations
Platform Kachelrie M et al.[3] Mueller and Xu This paper
[2]
References
[1] Feldkamp L A, Davis L C, Kress J W. Practical conebeam algorithm. Journal of the Optical Society of America A, 1984, 1(6): 612-619. [2] Mueller K, Xu F, Neophytou N. Why do commodity graphics hardware boards (GPUs) work so well for acceleration of computed tomography? In: Proceedings of SPIE Electronic Imaging 2007, Computational Imaging V. San Jose, USA, 2007. [3] Kachelriess M, Knaup M, Bockenbach O. Hyperfast perspective cone--beam backprojection. In: Proceedings of 2006 IEEE Nuclear Science Symposium and Medical Imaging Conference. San Diego, USA, 2006. [4] Bae U, Shamdasani V, Managuli R, et al. Fast adaptive unsharp masking with programmable mediaprocessors. Journal of Digital Imaging, 2003, 16(2): 230-239. [5] Managuli R, York G, Kim D, et al. Mapping of two-dimensional convolution on very long instruction word media processors for real-time performance. Journal of Electronic Imaging, 2000, 9(3): 327-335. [6] Mermer C, Kim D, Kim Y. Efficient 2D FFT implementation on mediaprocessors. Parallel Computing, 2003, 29(6): 691-709.
Time (s) 9.6 8.9 336.0
Cell BE GPU 8800GTX TMS320C6455
LIANG Wenxuan et al.Optimized Implementation of the FDK Algorithm [7] Sikdar S, Managuli R, Lixin G, et al. A single mediaprocessor-based programmable ultrasound system. IEEE Transactions on Information Technology in Biomedicine, 2003, 7(1): 64-70. [8] Managuli R, Kim Y. Mediaprocessors in medical imaging for high performance and flexibility. In: Proceedings of SPIE Medical Imaging 2002: Visualization, Image-Guided Procedures, and Display. San Diego, USA, 2002. [9] Neri-Calderon R A, Alcaraz-Corona S, Rodriguez-Dagnino
113
R M. Cache-optimized implementation of the filtered backprojection algorithm on a digital signal processor. Journal of Electronic Imaging, 2007, 16(4): 043010. [10] Texas Instruments. TMS320C64x+ DSP CPU and Instruction Set Reference Guide. Dallas, 2007. [11] Texas Instruments. TMS320C64x+ IQmath Library Users Guide. Dallas, 2008. [12] Texas Instruments. TMS320C64x+ DSP Little-Endian Library Programmers Reference. Dallas, 2006.

Sai

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Sai

Загружено:

Авторское право:

Доступные форматы

TSINGHUA SCIENCE AND TECHNOLOGY ISSNll1007-0214ll17/20llpp108-113 Volume 15, Number 1, February 2010

Optimized Implementation of the FDK Algorithm on One Digital Signal Processor*

* Supported in part by the TI Innovation Funds ** To whom correspondence should be addressed.

LIANG Wenxuan et al.Optimized Implementation of the FDK Algorithm

Description of the Platform and the Algorithm Mapping Methods

Tsinghua Science and Technology, February 2010, 15(1): 108-113

the ramp filter as

U xy = R + x cos + y sin V xy = x sin + y cos a xy = V xy R / U xy b xyz = zR / U xy

(4) (5) (6) (7)

LIANG Wenxuan et al.Optimized Implementation of the FDK Algorithm

Tsinghua Science and Technology, February 2010, 15(1): 108-113

Time (s) 9.6 8.9 336.0

Cell BE GPU 8800GTX TMS320C6455

Вам также может понравиться