Вы находитесь на странице: 1из 15

Chapter 2

REVIEW OF
DSP
HARDWARE
TECHNOLOGY
2.1 Introduction

With the explosion of Internet connectivity, growth of wireless

communication and popularity of grand digital convergence, digital signal processing

finds itself suddenly in the main stream of embedded systems technology. DSP has

become indispensable in the recent years in many a consumer, communication,

military, medical and industrial products. While the number and variety of products

that include some form of signal processing has grown dramatically large over the

last decade, the DSP hardware has also evolved according to the requirements of the

applications and algorithms [14]. Throughout the history of computing, the DSP

algorithms particularly for real-time applications have pushed the limits of number

crunching power, and accordingly there is a wide assortment of commercial hardware

available to accelerate signal-processing functions. Extensive efforts are therefore

continuing in industry as well as academia for developing necessary hardware to meet

such heavy computational demand for the real time processing of audio, video, radar,

sonar, and several other signals. Options range from dedicated full custom VLSI and

application specific integrated circuits ASIC targeted to a narrow class of operations

(FFT, convolution, etc.) to architectures based on general-purpose programmable

devices that can be adapted to a broad range of applications. Dedicated and

customized hardware offer optimum performance at the expense of long development

cycles and limited flexibility. Programmable signal processors provide a vehicle to

rapidly host and update algorithms, but typically operate at a fraction of theoretical

11
peak performance due to inefficiencies in mapping algorithms to the available

execution units. Reconfigurable hardware offers a compromise between special

purpose hardware and general-purpose processors. Programming is accomplished by

mapping algorithms on demand to a pool of field programmable gate array (FPGA)

logic.

The DSP hardware, therefore, may be placed in three main categories e.g.,

programmable general purpose DSP, FPGA-based DSP hardware and ASIC-based

DSP hardware. In this Chapter, we shall present a brief review of the development of

these general purpose DSP, as well as, the ASIC-and FPGA-based DSP hardware. In

the next section we have discussed the evolution of the programmable DSP (referred

to as DSP in the rest of the discussion). In Section-3 and 4 we have presented an

overview of the ASIC-based and FPGA-based DSP structures, respectively.

Conclusion is presented in Section-5.

2.2 General Purpose Programmable DSP

The general purpose DSPs constitutes a class of microprocessors optimized

for executing DSP functionality. These processors can handle varieties of applications.

through a limited and fixed set of arithmetic and control operations organized and

sequenced in suitable programs. The general purpose DSPs are, therefore, also

referred to as programmable DSPs. They are reprogramable in the field according to

the requirement of the applications, and are often more cost-effective than custom

hardware, particularly for low-volume applications, where the development cost of

12
custom ICs may be prohibitive [15]. These DSPs are, therefore, readily available, and

are widely applicable. In spite of their inherent inefficiency in terms of speed and

power consumption programmable DSP processor continues to be popular due to its

flexibility and cheap availability. The first commercially successful programmable

DSPs were introduced in early 80s. The first generation DSPs such as TMS 32010

and NEC 7720 adopted basic Harvard architecture (Fig. 2.1) that consisted of a

program memory, a data memory, a multiply-aecumulator unit and a control unit with

separate data bus and program bus.

Figure 2.1: Basic Harvard architecture

Various features of the DSP architecture, e.g., the number execution units, bus

systems, memory access for data and instructions, instruction set design, address

generation and addressing have evolved in the last twenty-five years according to the

need of the algorithms and applications. The need to minimize the cost and energy

consumption has influenced the data word width used in DSP processors. DSPs tend

to use the shortest data words that provide adequate accuracy in their target

applications. These processors now support zero-overhead looping, since they usually

spend much of the computation time to execute small section of the program

13
repeatedly [14]. To allow low-cost and high-performance input and output, most DSP

processors incorporate one or more specialized serial or parallel I/O interfaces, and

streamlined I/O handling mechanisms such as low-overhead interrupts and direct

memory access (DMA) that facilitates data transfer with little or no intervention from

the processor’s computational units. Today's general-purpose DSPs comprise of 32-

bit floating-point CPU, separate address generators, DMA control, SRAM memory,

and peripheral memory interfaces. The evolution of DSP processors from

conventional, enhanced conventional, multi-issue architecture to very long instruction

word (VLIW) and superscalar processors are given below.

2.2.1 Conventional DSP Processors

Conventional DSP processors contain a single MAC unit and an ALU with

few execution units. They are designed to execute one MAC instruction per clock

cycle. Examples of such processors include ADSP-21xx family, TMS320C2xx

family, and DSP560xx family operating at around 20-50 MHz. Due to their low cost,

low power consumption and less memory usage they are popularly used in consumer

products, where very high performance is not essential. Improvement in the

performance is achieved in conventional processor by increasing the clock speed and

augmenting additional hardware and pipelining. The examples of such processors are

DSP563xx and TMS 320Cxx. TMS320C54x which operate at 100-150 MHz and

include additional hardware such as instruction cache and barrel shifter to improve

the speed performance in implementing DSP algorithms. Apart from that, such type

14
of processors also use instruction and arithmetic pipelines to improve instruction-

throughput, and overall reduction of processor time. These processors although do not

provide very high performance, can maintain low energy consumption.

2.2.2 Enhanced-conventional DSP Processors

The enhanced-conventional DSP processors incorporate instruction, as well

as, data parallelism to have more computation performed in every clock cycle. They

contain parallel execution units with extra multiplier and- adder circuitry with

extended instruction set to allow more operations to be executed in parallel.

Enhanced-conventional DSP processors e.g. DSP16xxx contain wider data buses to

allow more data words to be accessed per clock cycle and fed to the different

execution units for parallel operation. They also have wider instruction words to

accommodate additional parallel operation in a single instruction. The enhanced-

conventional DSP processor requires specialized and complex hardware for executing

the compound instruction, which is difficult to program in assembly, and unfriendly

to compiler targets. So the new DSP processors with multi-issue approach have been

developed.

2.2.3 Multi-Issue Architecture

Multi-Issue processor architecture use very simple instruction that typically

encodes a single operation. These processors achieve higher degree of parallelism by

executing the instructions in parallel group rather than one at a time. The first multi­

issue DSP TMS320C62xx was introduced in 1996. At that time it was much faster

15
than other DSP processors. Now all the DSP processor vendors (TI, Analog Devices,

Motorola, and Lucent Technologies) -are using multi-issue architectures for high

performance processors. There are two classes of architectures, which execute

multiple instructions in parallel. They are very long instruction word (VLIW) and

superscalar processor.

A VLIW DSP processor e.g. TMS320C62xx has 8 independent execution

units and it issues a maximum four to eight instructions per clock cycle. The

instructions are fetched and issued as part of a long super-instruction. Superscalar

processors issue and execute two to four instructions per cycle. In VLIW architecture,

the assembly language programmer specifies which instruction could be executed in

parallel. Accordingly, the instructions are grouped at the time of assembly process.

But superscalar processors contain a specialized hardware unit, which specifies the

instruction to be executed in parallel based on data dependencies and resource

conflicts. The burden of scheduling of parallel instructions here is shifted from

programmer to the processor. Both VLIW and superscalar processors require high-

energy consumption compared with conventional DSP processors with increasing

speed. These processors have more execution units, which are active in parallel in

comparison to the conventional DSP processor. Besides, they also require wide on-

chip buses and memory banks to support data movement for parallel execution of

multiple instructions in different execution units [9,14].

16
2.3 ASIC-based DSP Hardware

The ASIC-based system provides another alternative for hardware

implementation of the DSP that is tailored for optimal implementation of specific

signal processing functions. Considerable advance has taken place in the past few

decades in the field of microelectronics, and therefore, it has been possible to realize a

complete printed circuit board on a single chip. ASICs are the key components in

development of the systems-on-chip. An ASIC contains a circuit blocks those are

specialized for a given application or an application domain. Due to the customized

design it is always possible to put more functionalities in ASICs with better

performance and lower power consumption. ASICs and other semiconductor chips

with ASIC blocks are therefore widely used in space applications, defense

applications and consumer products as well. The architecture of ASIC can be put in

two basic categories. One category is based on standard cells while the other is based

on gate arrays. Both these categories differ widely in term of their manufacturing

techniques, cost involved as well as the development time. The gate arrays consist of

rows and columns of regular transistor structures, where each basic cell or the gate

consists of a set of small number of unconnected transistors. In case of gate arrays,

the connection is determined completely by the design to be implemented. The

transistors are connected together first to realize low-level functions and low-level

functions are then routed and connected to build higher-level functions. The standard

cell ASICs on the other hand are designed by using transistors which are already

17
connected together and routed to form the higher level functions like flip-flops,

adders and counters. The ASIC designers connect these cells together for

implementing the higher-level functions. The standard, cell ASICs are more

customizable, possess higher utilization of chip area and involve smaller die size than

the equivalent gate array implementation, but involve high NRE cost and high turn

around time.

Most of the DSP ASICs use fixed-point numeric format, because arithmetic

with floating-point format is more complex, and requires more silicon area than

fixed-point format. The precision and the range of numeric format used in the design

however affect the behavior of the system in the following three ways:

(i) Functionality: The frequency response of the system changes as the locations of

the poles and zeros of the filter changes when the filter coefficients are

quantized.

(ii) Quantization noise: Finite-precision arithmetic introduces quantization noise at

the output of the system due to truncation or rounding after performing an

arithmetic operation.

(iii) Overflow: The output of the system may be distorted due to overflow that may

occur as a consequence of the use of a finite number of bits to represent signals

and state variables. (

CAD tools are available these days to analyze the effects of fixed-point

arithmetic on the behavior of a system, and to optimize the selection of fixed-point

18
numeric formats. One can simulate the design using appropriate test signals and

analyze the fidelity of the output signals. Apart from that, filter design packages also

can be . used to evaluate the behavior of a filter implemented using different fixed-

point numeric formats. The DSP ASIC designers also have option to customize the

designs that best suits the constraints and requirements such as area, speed, power

consumption, production cost, design cycle time of their applications. The most

important characteristic that is used to be determined during the hardware architecture

development stage is the level of parallelism needed to satisfy system performance

and power consumption requirements, which in turn depends on the computational

requirements and sampling rates of the application. It is imperative to state that the

underlying hardware structure then directly follows the degree of parallelism in the

algorithm and the architecture. The hardware synthesis tools are found to be quite

useful for the design of hardware architecture. In the first-pass such synthesis tools

are used to generate implementations of components to assist in evaluation of

candidate architecture. In the second pass, higher-level synthesis tools are used for the

synthesis of selected hardware architectures. These implementations are then passed

through a series of verifications and evaluation to ascertain the desired functionality

and performance, feasibility of implementation and cost estimation.

2.4 FPGA-based DSP Hardware

In case of ASICs, the designers need to wait for weeks for the delivery of the

finished products but the FPGA designers can realize the design on FPGA chips by

19
themselves in minutes. FPGA is a class of field programmable logic device that is

comprised of an array of uncommitted circuit elements, called logic blocks, and

programmable interconnect resources. These interconnects facilitate the end user to

reconfigure the FPGA for multiple reuse. This can be an advantage in applications

that need multiple trial versions within a development cycle. FPGAs are significantly

more expensive compared with the DSP, but can have higher performance in specific

applications, and would have less power consumption compared to the programmable

DSPs. FPGAs are mostly used these days for testing, rapid prototyping, and for low-

volume applications. In an FPGA, the digital circuits are programmed by means of a

bit-stream that completely specifies the logical functions and connectivity to be

implemented [16]. Using the static-random-access-memory (SRAM) devices it has

been possible to reprogram the FPGAs as many times as one need. The same silicon

resource in FPGA thus can be reused for wide range of DSP functionalities. FPGAs

derive this advantage of flexibility over the ASICs at the cost of higher power

consumption, larger die size and slower speed of operations.

FPGAs from different vendors differ in terms of the organization of the

programmable logic devices (PLD), logic gates, random-access-memory etc. But, In

spite of all those, architectural differences of the FPGAs from different vendors they

consist of an array of logical units distributed across a grid of programmable

interconnect. In a broad sense FPGAs can be classified in two categories as course­

grained and fine-grained. The course-grained FPGAs contain relatively smaller

number of more powerful logical units while the fine-grained FPGAs contain

20
relatively larger number of less powerful/elementary logic blocks. Most of the

popularly used FPGAs are coarse-grained and are based on the logical units based on

look-up-tables (LUT). Xilinx 4000 series family of FPGA is comprised of the logical

units called as configurable logic blocks (CLB). Each CLB consists of two 4-input, 1-

output LUTs, one 3-input, 1 output LUT, two flip-flops, and some multiplexers for

selecting appropriate output out of the LUTs or flip-flops. Each CLB of the Xilinx 4K

series can be used to implement a two-bit adder or a nine-bit parity checker. Fine­

grained FPGA like Xilinx 6200 consists of two 4-to-l multiplexers and three 2-to-l

multiplexers which can implement any two input functions or single bit storage. The

coarse grained FPGA like Xilinx 4K series can also be used as a form of distributed

memory in addition to logic resources e.g., a each LUT can be used to implement a

16x1 RAM/ROM. It consists of fixed-length metal segments interconnected by

programmable switches to connect the LUTs, memory blocks, and the flip-flops for

programmable routing.

Until recently FPGAs, however, have rarely been used to implement DSP

tasks as they are more power hungry, and as they were lacking the gate capacity to

handle the DSP algorithms. All this may be changing with the introduction of new

DSP-oriented products from various FPGA vendors. Altera's Stratix family and

Xilinx's Virtex-II family both offer significant DSP-oriented architectural

enhancements. For example, both these products offer hardwired on-chip multipliers

embedded throughout the reconfigurable logic array that are intended to accelerate

the multiply-accumulate (MAC) operations. By including some hardwired processing

21
elements, FPGAs are improving their energy efficiency and cost performance while

offering higher speed performance. The computational requirements for real-time

DSP applications often exceed the performance available from even the fastest DSP

processors. The new breed of DSP-enhanced FPGA, therefore, offers a potentially

attractive solution for various real-time DSP applications.

2.5 Conclusion

In this chapter we have discussed the three main categories of DSP hardware,

e.g., programmable general purpose DSP, FPGA-based DSP hardware and ASIC-

based DSP hardware. DSPs are potentially reprogrammable in the field, allowing

product upgrades and are often more cost-effective than custom hardware,

particularly for low volume application, where the development cost of custom ICs

may be prohibitive. Programmable DSPs are therefore widely used in numerous

applications. Modem DSP architectures contain multiple buses and multiple

independent execution units, which could be operated in parallel and pipelined

architecture to do multiple tasks in one cycle for faster realization of DSP algorithms.

Besides, it facilitates efficient access to memory for instructions as well as data by

dividing the memory in to number of banks. We have discussed here the evolution of

DSP processors from conventional, enhanced conventional, multi-issue architecture

to very long instruction, word (VLIW) and superscalar processors, and reconfigurable

DSP accelerator. Implementation of digital signal processing applications typically

requires chips with very high number-crunching capabilities and very high throughput

22
that is not met even by the fastest programmable DSP. Very often again the

applications place stringent constraints on power consumption. Signal processing

tasks in such situations are usually carried out by the ASICs. ASICs can achieve high

levels of performance with hard-to-match energy efficiency with minimum silicon

area, but they require massive design efforts. ASIC-based hardware, however, do not

offer any flexibility of operation for more than one applications or product

upgradation under evolutionary technology because once the design and fabrication

process is completed, their functionality cannot be altered. ASICs involve longest

design time and very high non-recurring engineering cost, and therefore, they are

currently used only for high volume market, time-critical and military applications.

FPGAs are thus becoming more popular these days. FPGAs possess the similar

capability as the ASICs to provide specific circuits for a given DSP application, but

differ basically in terms of their internal connectivity. Unlike dedicated hardware, the

FPGA can be time shared between algorithms by simply reloading the configuratioh

code. In contrast to a programmable DSP, the FPGA actually assumes the logic

design required for implementing an algorithm instead of executing a sequence of

instructions on predefined hardware resources. Properly executed FPGA designs

typically outperform a DSP microprocessor by a factor of 100:1, and by more than

1000:1 in special circumstances. Power dissipation of an FPGA DSP design is

typically about 20% of a microprocessor based design working at the same sample

rate. It is not hard to see how the configurability of FPGAs makes them ideal for

23
customized but reconfigurable logic that can execute specific, compute-intensive

algorithms utilizing the massive parallelism in DSP operations.

24

Вам также может понравиться