10 - Chapter 2 PDF

Chapter 2
REVIEW OF
DSP
HARDWARE
TECHNOLOGY
2.1 Introduction
With the explosion of Internet connectivity, growth of wireless
communication and popularity of grand digital convergence, digital signal processing
finds itself suddenly in the main stream of embedded systems technology. DSP has
become indispensable in the recent years in many a consumer, communication,
military, medical and industrial products. While the number and variety of products
that include some form of signal processing has grown dramatically large over the
last decade, the DSP hardware has also evolved according to the requirements of the
applications and algorithms [14]. Throughout the history of computing, the DSP
algorithms particularly for real-time applications have pushed the limits of number
crunching power, and accordingly there is a wide assortment of commercial hardware
available to accelerate signal-processing functions. Extensive efforts are therefore
continuing in industry as well as academia for developing necessary hardware to meet
such heavy computational demand for the real time processing of audio, video, radar,
sonar, and several other signals. Options range from dedicated full custom VLSI and
application specific integrated circuits ASIC targeted to a narrow class of operations
(FFT, convolution, etc.) to architectures based on general-purpose programmable
devices that can be adapted to a broad range of applications. Dedicated and
customized hardware offer optimum performance at the expense of long development
cycles and limited flexibility. Programmable signal processors provide a vehicle to
rapidly host and update algorithms, but typically operate at a fraction of theoretical
11
peak performance due to inefficiencies in mapping algorithms to the available
execution units. Reconfigurable hardware offers a compromise between special
purpose hardware and general-purpose processors. Programming is accomplished by
mapping algorithms on demand to a pool of field programmable gate array (FPGA)
logic.
The DSP hardware, therefore, may be placed in three main categories e.g.,
programmable general purpose DSP, FPGA-based DSP hardware and ASIC-based
DSP hardware. In this Chapter, we shall present a brief review of the development of
these general purpose DSP, as well as, the ASIC-and FPGA-based DSP hardware. In
the next section we have discussed the evolution of the programmable DSP (referred
to as DSP in the rest of the discussion). In Section-3 and 4 we have presented an
overview of the ASIC-based and FPGA-based DSP structures, respectively.
Conclusion is presented in Section-5.
2.2 General Purpose Programmable DSP
The general purpose DSPs constitutes a class of microprocessors optimized
for executing DSP functionality. These processors can handle varieties of applications.
through a limited and fixed set of arithmetic and control operations organized and
sequenced in suitable programs. The general purpose DSPs are, therefore, also
referred to as programmable DSPs. They are reprogramable in the field according to
the requirement of the applications, and are often more cost-effective than custom
hardware, particularly for low-volume applications, where the development cost of
12
custom ICs may be prohibitive [15]. These DSPs are, therefore, readily available, and
are widely applicable. In spite of their inherent inefficiency in terms of speed and
power consumption programmable DSP processor continues to be popular due to its
flexibility and cheap availability. The first commercially successful programmable
DSPs were introduced in early 80s. The first generation DSPs such as TMS 32010
and NEC 7720 adopted basic Harvard architecture (Fig. 2.1) that consisted of a
program memory, a data memory, a multiply-aecumulator unit and a control unit with
separate data bus and program bus.
Figure 2.1: Basic Harvard architecture
Various features of the DSP architecture, e.g., the number execution units, bus
systems, memory access for data and instructions, instruction set design, address
generation and addressing have evolved in the last twenty-five years according to the
need of the algorithms and applications. The need to minimize the cost and energy
consumption has influenced the data word width used in DSP processors. DSPs tend
to use the shortest data words that provide adequate accuracy in their target
applications. These processors now support zero-overhead looping, since they usually
spend much of the computation time to execute small section of the program
13
repeatedly [14]. To allow low-cost and high-performance input and output, most DSP
processors incorporate one or more specialized serial or parallel I/O interfaces, and
streamlined I/O handling mechanisms such as low-overhead interrupts and direct
memory access (DMA) that facilitates data transfer with little or no intervention from
the processor’s computational units. Today's general-purpose DSPs comprise of 32-
bit floating-point CPU, separate address generators, DMA control, SRAM memory,
and peripheral memory interfaces. The evolution of DSP processors from
conventional, enhanced conventional, multi-issue architecture to very long instruction
word (VLIW) and superscalar processors are given below.
2.2.1 Conventional DSP Processors
Conventional DSP processors contain a single MAC unit and an ALU with
few execution units. They are designed to execute one MAC instruction per clock
cycle. Examples of such processors include ADSP-21xx family, TMS320C2xx
family, and DSP560xx family operating at around 20-50 MHz. Due to their low cost,
low power consumption and less memory usage they are popularly used in consumer
products, where very high performance is not essential. Improvement in the
performance is achieved in conventional processor by increasing the clock speed and
augmenting additional hardware and pipelining. The examples of such processors are
DSP563xx and TMS 320Cxx. TMS320C54x which operate at 100-150 MHz and
include additional hardware such as instruction cache and barrel shifter to improve
the speed performance in implementing DSP algorithms. Apart from that, such type
14
of processors also use instruction and arithmetic pipelines to improve instruction-
throughput, and overall reduction of processor time. These processors although do not
provide very high performance, can maintain low energy consumption.
2.2.2 Enhanced-conventional DSP Processors
The enhanced-conventional DSP processors incorporate instruction, as well
as, data parallelism to have more computation performed in every clock cycle. They
contain parallel execution units with extra multiplier and- adder circuitry with
extended instruction set to allow more operations to be executed in parallel.
Enhanced-conventional DSP processors e.g. DSP16xxx contain wider data buses to
allow more data words to be accessed per clock cycle and fed to the different
execution units for parallel operation. They also have wider instruction words to
accommodate additional parallel operation in a single instruction. The enhanced-
conventional DSP processor requires specialized and complex hardware for executing
the compound instruction, which is difficult to program in assembly, and unfriendly
to compiler targets. So the new DSP processors with multi-issue approach have been
developed.
2.2.3 Multi-Issue Architecture
Multi-Issue processor architecture use very simple instruction that typically
encodes a single operation. These processors achieve higher degree of parallelism by
executing the instructions in parallel group rather than one at a time. The first multi
issue DSP TMS320C62xx was introduced in 1996. At that time it was much faster
15
than other DSP processors. Now all the DSP processor vendors (TI, Analog Devices,
Motorola, and Lucent Technologies) -are using multi-issue architectures for high
performance processors. There are two classes of architectures, which execute
multiple instructions in parallel. They are very long instruction word (VLIW) and
superscalar processor.
A VLIW DSP processor e.g. TMS320C62xx has 8 independent execution
units and it issues a maximum four to eight instructions per clock cycle. The
instructions are fetched and issued as part of a long super-instruction. Superscalar
processors issue and execute two to four instructions per cycle. In VLIW architecture,
the assembly language programmer specifies which instruction could be executed in
parallel. Accordingly, the instructions are grouped at the time of assembly process.
But superscalar processors contain a specialized hardware unit, which specifies the
instruction to be executed in parallel based on data dependencies and resource
conflicts. The burden of scheduling of parallel instructions here is shifted from
programmer to the processor. Both VLIW and superscalar processors require high-
energy consumption compared with conventional DSP processors with increasing
speed. These processors have more execution units, which are active in parallel in
comparison to the conventional DSP processor. Besides, they also require wide on-
chip buses and memory banks to support data movement for parallel execution of
multiple instructions in different execution units [9,14].
16
2.3 ASIC-based DSP Hardware
The ASIC-based system provides another alternative for hardware
implementation of the DSP that is tailored for optimal implementation of specific
signal processing functions. Considerable advance has taken place in the past few
decades in the field of microelectronics, and therefore, it has been possible to realize a
complete printed circuit board on a single chip. ASICs are the key components in
development of the systems-on-chip. An ASIC contains a circuit blocks those are
specialized for a given application or an application domain. Due to the customized
design it is always possible to put more functionalities in ASICs with better
performance and lower power consumption. ASICs and other semiconductor chips
with ASIC blocks are therefore widely used in space applications, defense
applications and consumer products as well. The architecture of ASIC can be put in
two basic categories. One category is based on standard cells while the other is based
on gate arrays. Both these categories differ widely in term of their manufacturing
techniques, cost involved as well as the development time. The gate arrays consist of
rows and columns of regular transistor structures, where each basic cell or the gate
consists of a set of small number of unconnected transistors. In case of gate arrays,
the connection is determined completely by the design to be implemented. The
transistors are connected together first to realize low-level functions and low-level
functions are then routed and connected to build higher-level functions. The standard
cell ASICs on the other hand are designed by using transistors which are already
17
connected together and routed to form the higher level functions like flip-flops,
adders and counters. The ASIC designers connect these cells together for
implementing the higher-level functions. The standard, cell ASICs are more
customizable, possess higher utilization of chip area and involve smaller die size than
the equivalent gate array implementation, but involve high NRE cost and high turn
around time.
Most of the DSP ASICs use fixed-point numeric format, because arithmetic
with floating-point format is more complex, and requires more silicon area than
fixed-point format. The precision and the range of numeric format used in the design
however affect the behavior of the system in the following three ways:
(i) Functionality: The frequency response of the system changes as the locations of
the poles and zeros of the filter changes when the filter coefficients are
quantized.
(ii) Quantization noise: Finite-precision arithmetic introduces quantization noise at
the output of the system due to truncation or rounding after performing an
arithmetic operation.
(iii) Overflow: The output of the system may be distorted due to overflow that may
occur as a consequence of the use of a finite number of bits to represent signals
and state variables. (
CAD tools are available these days to analyze the effects of fixed-point
arithmetic on the behavior of a system, and to optimize the selection of fixed-point
18
numeric formats. One can simulate the design using appropriate test signals and
analyze the fidelity of the output signals. Apart from that, filter design packages also
can be . used to evaluate the behavior of a filter implemented using different fixed-
point numeric formats. The DSP ASIC designers also have option to customize the
designs that best suits the constraints and requirements such as area, speed, power
consumption, production cost, design cycle time of their applications. The most
important characteristic that is used to be determined during the hardware architecture
development stage is the level of parallelism needed to satisfy system performance
and power consumption requirements, which in turn depends on the computational
requirements and sampling rates of the application. It is imperative to state that the
underlying hardware structure then directly follows the degree of parallelism in the
algorithm and the architecture. The hardware synthesis tools are found to be quite
useful for the design of hardware architecture. In the first-pass such synthesis tools
are used to generate implementations of components to assist in evaluation of
candidate architecture. In the second pass, higher-level synthesis tools are used for the
synthesis of selected hardware architectures. These implementations are then passed
through a series of verifications and evaluation to ascertain the desired functionality
and performance, feasibility of implementation and cost estimation.
2.4 FPGA-based DSP Hardware
In case of ASICs, the designers need to wait for weeks for the delivery of the
finished products but the FPGA designers can realize the design on FPGA chips by
19
themselves in minutes. FPGA is a class of field programmable logic device that is
comprised of an array of uncommitted circuit elements, called logic blocks, and
programmable interconnect resources. These interconnects facilitate the end user to
reconfigure the FPGA for multiple reuse. This can be an advantage in applications
that need multiple trial versions within a development cycle. FPGAs are significantly
more expensive compared with the DSP, but can have higher performance in specific
applications, and would have less power consumption compared to the programmable
DSPs. FPGAs are mostly used these days for testing, rapid prototyping, and for low-
volume applications. In an FPGA, the digital circuits are programmed by means of a
bit-stream that completely specifies the logical functions and connectivity to be
implemented [16]. Using the static-random-access-memory (SRAM) devices it has
been possible to reprogram the FPGAs as many times as one need. The same silicon
resource in FPGA thus can be reused for wide range of DSP functionalities. FPGAs
derive this advantage of flexibility over the ASICs at the cost of higher power
consumption, larger die size and slower speed of operations.
FPGAs from different vendors differ in terms of the organization of the
programmable logic devices (PLD), logic gates, random-access-memory etc. But, In
spite of all those, architectural differences of the FPGAs from different vendors they
consist of an array of logical units distributed across a grid of programmable
interconnect. In a broad sense FPGAs can be classified in two categories as course
grained and fine-grained. The course-grained FPGAs contain relatively smaller
number of more powerful logical units while the fine-grained FPGAs contain
20
relatively larger number of less powerful/elementary logic blocks. Most of the
popularly used FPGAs are coarse-grained and are based on the logical units based on
look-up-tables (LUT). Xilinx 4000 series family of FPGA is comprised of the logical
units called as configurable logic blocks (CLB). Each CLB consists of two 4-input, 1-
output LUTs, one 3-input, 1 output LUT, two flip-flops, and some multiplexers for
selecting appropriate output out of the LUTs or flip-flops. Each CLB of the Xilinx 4K
series can be used to implement a two-bit adder or a nine-bit parity checker. Fine
grained FPGA like Xilinx 6200 consists of two 4-to-l multiplexers and three 2-to-l
multiplexers which can implement any two input functions or single bit storage. The
coarse grained FPGA like Xilinx 4K series can also be used as a form of distributed
memory in addition to logic resources e.g., a each LUT can be used to implement a
16x1 RAM/ROM. It consists of fixed-length metal segments interconnected by
programmable switches to connect the LUTs, memory blocks, and the flip-flops for
programmable routing.
Until recently FPGAs, however, have rarely been used to implement DSP
tasks as they are more power hungry, and as they were lacking the gate capacity to
handle the DSP algorithms. All this may be changing with the introduction of new
DSP-oriented products from various FPGA vendors. Altera's Stratix family and
Xilinx's Virtex-II family both offer significant DSP-oriented architectural
enhancements. For example, both these products offer hardwired on-chip multipliers
embedded throughout the reconfigurable logic array that are intended to accelerate
the multiply-accumulate (MAC) operations. By including some hardwired processing
21
elements, FPGAs are improving their energy efficiency and cost performance while
offering higher speed performance. The computational requirements for real-time
DSP applications often exceed the performance available from even the fastest DSP
processors. The new breed of DSP-enhanced FPGA, therefore, offers a potentially
attractive solution for various real-time DSP applications.
2.5 Conclusion
In this chapter we have discussed the three main categories of DSP hardware,
e.g., programmable general purpose DSP, FPGA-based DSP hardware and ASIC-
based DSP hardware. DSPs are potentially reprogrammable in the field, allowing
product upgrades and are often more cost-effective than custom hardware,
particularly for low volume application, where the development cost of custom ICs
may be prohibitive. Programmable DSPs are therefore widely used in numerous
applications. Modem DSP architectures contain multiple buses and multiple
independent execution units, which could be operated in parallel and pipelined
architecture to do multiple tasks in one cycle for faster realization of DSP algorithms.
Besides, it facilitates efficient access to memory for instructions as well as data by
dividing the memory in to number of banks. We have discussed here the evolution of
DSP processors from conventional, enhanced conventional, multi-issue architecture
to very long instruction, word (VLIW) and superscalar processors, and reconfigurable
DSP accelerator. Implementation of digital signal processing applications typically
requires chips with very high number-crunching capabilities and very high throughput
22
that is not met even by the fastest programmable DSP. Very often again the
applications place stringent constraints on power consumption. Signal processing
tasks in such situations are usually carried out by the ASICs. ASICs can achieve high
levels of performance with hard-to-match energy efficiency with minimum silicon
area, but they require massive design efforts. ASIC-based hardware, however, do not
offer any flexibility of operation for more than one applications or product
upgradation under evolutionary technology because once the design and fabrication
process is completed, their functionality cannot be altered. ASICs involve longest
design time and very high non-recurring engineering cost, and therefore, they are
currently used only for high volume market, time-critical and military applications.
FPGAs are thus becoming more popular these days. FPGAs possess the similar
capability as the ASICs to provide specific circuits for a given DSP application, but
differ basically in terms of their internal connectivity. Unlike dedicated hardware, the
FPGA can be time shared between algorithms by simply reloading the configuratioh
code. In contrast to a programmable DSP, the FPGA actually assumes the logic
design required for implementing an algorithm instead of executing a sequence of
instructions on predefined hardware resources. Properly executed FPGA designs
typically outperform a DSP microprocessor by a factor of 100:1, and by more than
1000:1 in special circumstances. Power dissipation of an FPGA DSP design is
typically about 20% of a microprocessor based design working at the same sample
rate. It is not hard to see how the configurability of FPGAs makes them ideal for
23
customized but reconfigurable logic that can execute specific, compute-intensive
algorithms utilizing the massive parallelism in DSP operations.
24

10 - Chapter 2 PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

10 - Chapter 2 PDF

Загружено:

Авторское право:

Доступные форматы

Chapter 2

With the explosion of Internet connectivity, growth of wireless

communication and popularity of grand digital convergence, digital signal processing

become indispensable in the recent years in many a consumer, communication,

crunching power, and accordingly there is a wide assortment of commercial hardware

available to accelerate signal-processing functions. Extensive efforts are therefore

continuing in industry as well as academia for developing necessary hardware to meet

application specific integrated circuits ASIC targeted to a narrow class of operations

(FFT, convolution, etc.) to architectures based on general-purpose programmable

devices that can be adapted to a broad range of applications. Dedicated and

customized hardware offer optimum performance at the expense of long development

cycles and limited flexibility. Programmable signal processors provide a vehicle to

execution units. Reconfigurable hardware offers a compromise between special

purpose hardware and general-purpose processors. Programming is accomplished by

mapping algorithms on demand to a pool of field programmable gate array (FPGA)

programmable general purpose DSP, FPGA-based DSP hardware and ASIC-based

to as DSP in the rest of the discussion). In Section-3 and 4 we have presented an

overview of the ASIC-based and FPGA-based DSP structures, respectively.

Conclusion is presented in Section-5.

2.2 General Purpose Programmable DSP

The general purpose DSPs constitutes a class of microprocessors optimized

referred to as programmable DSPs. They are reprogramable in the field according to

hardware, particularly for low-volume applications, where the development cost of

power consumption programmable DSP processor continues to be popular due to its

flexibility and cheap availability. The first commercially successful programmable

separate data bus and program bus.

Figure 2.1: Basic Harvard architecture

streamlined I/O handling mechanisms such as low-overhead interrupts and direct

the processor’s computational units. Today's general-purpose DSPs comprise of 32-

and peripheral memory interfaces. The evolution of DSP processors from

conventional, enhanced conventional, multi-issue architecture to very long instruction

word (VLIW) and superscalar processors are given below.

2.2.1 Conventional DSP Processors

cycle. Examples of such processors include ADSP-21xx family, TMS320C2xx

products, where very high performance is not essential. Improvement in the

performance is achieved in conventional processor by increasing the clock speed and

provide very high performance, can maintain low energy consumption.

2.2.2 Enhanced-conventional DSP Processors

The enhanced-conventional DSP processors incorporate instruction, as well

extended instruction set to allow more operations to be executed in parallel.

Enhanced-conventional DSP processors e.g. DSP16xxx contain wider data buses to

accommodate additional parallel operation in a single instruction. The enhanced-

the compound instruction, which is difficult to program in assembly, and unfriendly

2.2.3 Multi-Issue Architecture

Multi-Issue processor architecture use very simple instruction that typically

encodes a single operation. These processors achieve higher degree of parallelism by

performance processors. There are two classes of architectures, which execute

A VLIW DSP processor e.g. TMS320C62xx has 8 independent execution

instructions are fetched and issued as part of a long super-instruction. Superscalar

the assembly language programmer specifies which instruction could be executed in

instruction to be executed in parallel based on data dependencies and resource

conflicts. The burden of scheduling of parallel instructions here is shifted from

energy consumption compared with conventional DSP processors with increasing

multiple instructions in different execution units [9,14].

The ASIC-based system provides another alternative for hardware

implementation of the DSP that is tailored for optimal implementation of specific

development of the systems-on-chip. An ASIC contains a circuit blocks those are

specialized for a given application or an application domain. Due to the customized

design it is always possible to put more functionalities in ASICs with better

consists of a set of small number of unconnected transistors. In case of gate arrays,

the connection is determined completely by the design to be implemented. The

(ii) Quantization noise: Finite-precision arithmetic introduces quantization noise at

the output of the system due to truncation or rounding after performing an

interconnect. In a broad sense FPGAs can be classified in two categories as course