Вы находитесь на странице: 1из 24

Architectures of DSP Processors

Objectives:
 To understand the features of the DSP processors.
 To understand the architectures, bus structures and memories of DSPs.
 To understand special addressing modes and on-chip peripherals.
 To study TMS320C5X Processors.

Programmable DSPs (P-DSPs):


Digital Signal Processing is an important branch of Electronics and
Telecommunication engineering that deals with the improvisation of reliability
and accuracy of the digital communication by employing multiple techniques. It
provides an alternative method for processing the analog signal as illustrated in
the below figure.

Fig.1: Digital Signal Processing System.

The Programmable DSPs (P-DSPs) are specially designed for digital


signal processing applications. The main components of P-DSPs are:
Multiplier & Multiplier Accumulator (MAC):
The multiply–accumulate operation is a common step that computes the
product of two numbers and adds that product to an accumulator. The hardware
unit that performs this operation is known as a multiplier–accumulator (MAC,
or MAC unit), it requires array multiplication. The multiplication as well as
accumulating to be carried out using hardware elements by two ways:
 A dedicated MAC unit implemented in hardware which has integrated
multiplier and accumulator in a single hardware unit.
 Use of multiplier and accumulator separately.

Fig.2: MAC Unit


The Processor Architecture:
There are mainly two types of architectures of microprocessor:
Von Neumann Architecture
In this architecture a single address bus and a single data bus for
accessing the program as well as data memory area.

Fig.3: Von Neumann architecture.


So if MACD (MAC data) instruction is to be executed in a machine with
this architecture it requires four clock cycles. That is due to a single address and
data bus.
Harvard Architecture:
In this architecture there are two separate buses for the program
and data memory.

Fig.4: Harvard architecture.

Hence the content of program memory and data memory can be accessed
in parallel. The instruction code can be fed from the program memory to the
control unit while the operand is fed to the processing unit from the data
memory. The processing unit consist of the registers and processing elements
such as MAC units, multiplier, ALU, Shifters etc.

The P-DSP follows the modified Harvard Architecture


In this architecture one set of bus is used to access a memory that has
both program and data and another that has data alone. Data can also be
transferred from one memory to another.
Fig.5: Modified Harvard architecture.
` This modified Harvard Architecture is used in several P-DSPs e.g. P-
DSPs from Texas Instruments and analog devices.

Memory for P-DSPs


1. Multiple Access Memory:
The number of memory accesses/clock period can be increased by
using a high-speed memory that permits more than one memory access/clock
period.
E.g. The DARAM (dual access RAM) permits two memory access/clock
period. Multiple accesses may be connected to the processing units of the P-
DSPs by using the Harvard Architecture.

2. Multipored Memory:
The dual port memory has two independent data and address buses as
shown in the following fig.
Fig.6 : Multiported Memory.
Two memory access is can be achieved in a clock period. Multi-
ported memory dispense with the need for storing the program and data in two
different memory chips in order to permit simultaneous access to both data and
program memory.
E.g. Motorola DSP561XX processor has a single ported program memory and
a dual ported data memory.

VLIW Architecture
The VLIW is accessed from the memory and is used to specify the
operands and operations to be performed by each of the functional units. The
architecture of VLIW is shown in Fig 7.

Fig.7: VLIW architecture.


Multi-ported Register File: Fetches and stores operands.
Read/Write Cross Bar: Enables parallel random access by the functional units
to the register file.
Functional units: Consists of no. of ALUs, MACsand Shifters etc. for
simultaneous processing.
Execution is concurrent with load/store operations between RAM and register
file.
The performance depends on the degree of parallelism in the algorithm of
particular DSP application and also on no. of functional units.

Pipelining
Pipelining is an approach to speed up the execution. Execution of
an instruction may be split into some micro instruction phase and each is
executed simultaneously. The four phases of the pipeline structure and their
functions are as follows:
1) Fetch (F):
This phase fetches the instruction words from memory and updates the
program counter (PC).
2) Decode (D):
This phase decodes the instruction word and performs address generation
and ARAU updates of auxiliary registers.
3) Read (R):
This phase reads operands from memory, if required. If the instruction
uses indirect addressing mode, it will read the memory location pointed at by
the ARP before the update of the previous decode phase.
4) Execute (E):
This phase performs any specify operation, and, if required, writes
results of a previous operation to memory.
In a processor with no pipeling, each functional unit is busy only 25% of the
time as shown in Fig.8. It takes 12 cycles to execute four instructions.

Value of T Fetch Decode Read Execute


1 I1
2 I1
3 I1
4 I1
5 I2
6 I2
7 I2
8 I2
9 I3
10 I3
11 I3
12 I3

Fig.8 : Without Pipeline

On a processor with pipeline it takes only 7 cycles to execute four instuctions as


shown in Fig. 9. To execute N instructions on a processor with pipeline depth
K, it takes K+N-1 cycles.

Value of T Fetch Decode Read Execute


1 I1
2 I2 I1
3 I3 I2 I1
4 I4 I3 I2 I1
5 I5 I4 I3 I2
6 I6 I5 I4 I3
7 I7 I6 I5 I4
8 I8 I7 I6 I5
9 I9 I8 I7 I6
10 I9 I8 I7
11 I9 I8
12 I9

Fig.9 : With Pipeline


The Throughput is defined as the ratio of Number of instructions to Total time
to complete the instructions.
Addressing Modes in P-DSPs:
The P-DSPs have special addressing modes that permit single word
instruction format and thereby speed up the execution by making effective use
of the instruction pipeline.
1. Short Immediate Addressing: This permits the operand to be specified
using a short constant that forms parts of a single word instruction. The length
of the short constant depends on the instruction type and P-DSP. For example in
TMS320C5X an 8-bit constant can be specified as one of the operands in the
single word instructions
2. Short Direct Addressing: This permits the lower order address of the
operand of an instruction to be specified in the single word instruction. The
argument in the instruction specifies only the location within the current page.
For example in Motorola DSP5600X, short direct memory addressing permits a
6-bit address to be specified in the instruction.
3. Memory mapped Addressing: The CPU registers and the I/O registers of
the P-DSP are also accessible as memory location. This is achieved by storing
them in starting page or the final page of the memory space. When the registers
are accessed in this type of addressing mode, the higher address bits are made to
be 0 in case of TI DSP’s and made to 1 in Motorola DSPs. For example, in
TMS320C5X, page 0 corresponds to the CPU registers and I/O registers
4. Indirect Addressing: This permits an array of data to be processed in P-DSP
to be efficiently fetched and stored. The address of the operands can be stored in
one of the registers called indirect address registers. These are also called as
auxiliary registers in TI processors.
5. Bit Reversed Addressing: In the bit reversed addressing mode, when a 16-
point FFT is to be computed 2-point DFT of X(0) and X(8) is to be found.
Similarly 2-point DFT of X (4) and X (12) and so on. In bit reversed addressing
mode the address is incremented or decremented by the number represented in
the bit reversed form.
6. Circular Addressing: In this mode the memory can be organized as a
circular buffer with the beginning memory address and the ending memory
address corresponding to this buffer defined by programmer. In this mode when
the address pointer is incremented, the address will be checked with the ending
memory address of the circular buffer. If it exceeds that, the address will be
made equal to the beginning address of the circular buffer.

On-Chip Peripherals

1. On-Chip timer: Two of the common applications of the timers are


generation of periodic interrupts to the P-DSPs and generation of the sampling
clocks for the A/D convertors. The timers can generate single bus or a periodic
train pulses and also single wave or a periodic square wave.
2. Serial port: This enables the data communication between the P-DSP and an
external peripheral such as RS232 C device. These ports normally have input
and output buffers so that the P-DSP writes or reads from the serial port in
parallel form and the serial port sends and receives data to the peripherals in
serial form. The shift clock can be fed either from the P-DSP or an external
device can supply it.
3. TDM serial port: This permits a P-DSP to communicate with other devices
or P-DSPs using time division multiplexing (TDM). One of the devices generate
the frame sync pulse that indicates the beginning of a TDM frame and bit clock,
the bit for which a bit is to be transmitted. The TDM frame is split into a
number of equal slots and each slot can be allotted for one of the devices.

Ch 1 Ch 2 Ch 3 Ch 4 Ch 5 Ch 6 Ch 7 Ch 8

Fig 10: TDM frame with 8 time slots.


4. Parallel port: These ports enable communication between the P-DSP and
other devices to be faster compared to the serial communication by using a
number lines in parallel. The P-DSPs have two approaches for assigning lines
for parallel port. In one approach used by the TI, the data bus itself is used for
parallel ports. In another approach separate lines are dedicated for parallel ports
including the handshaking signals.
5. Bit I/O ports: The P-DSPs have additional I/O ports that are single bit wide.
These ports may be individually set, reset or read. These bits are normally used
for control purposes but that can also be used for data transfer. Some of these
bits are used for conditional branching or calls.
6. Host port: The P-DSPs also have a special port normally 8-bit or 16-bit wide
called the host port that enables them to communicate with a microprocessor or
PC, which is called as host. The host can generate interrupts and also cause the
P-DSP to load program from ROM to the RAM on reset.
7. Comm ports: These are parallel ports that are used for inter processor
communication between a number of identical P-DSP in a multiprocessor
system.
8. On-chip A/D and D/A Converters: Some of the P-DSPs targeted towards
voice applications such as cellular telephones and tapeless answering machines
have A/D and D/A converters inside the P-DSP. For example, Motorola
DSP561XX and Analog devices ADSP21MSP5X both have the A/D and D/A
on chip and permit effective sampling rates of about 8kHz.
P-DSPs with RISC and CISC: P-DSPs may be implemented using either the
RISC processor or the CISC processor. For example, TITMS320C6X P-DSPs
uses RISC processor and a large number of P-DSPs from Analog devices,
Motorola and TI make use of CISC. For example, TI TMS32054X and the
Motorola DSP563XX and analog devices ADSP 2100X makes use of CISC.TI
TMS32054X has a RISC and 4 P-DSPs with CISC in a single core.
RISC Advantages:
The chip area dedicated to the realisation of the control unit is considerably
reduced because of the reduced number of instructions.
In RISC about 20% of the chip area is used for the control unit whereas in CISC
30-40% of the chip area is used.
The throughput of the processor can be increased by applying pipelining and
parallel processing since there is a considerable reduction in the control area.
In RISC, all the instructions are of uniform length and take the same time for
execution. Hence the dummy periods or hold periods in the instruction pipeline
is reduced to minimum which increases the computational speed.
A simpler and small control unit in RISC has fewer gates. This reduces the
propagation delay and increases the speed. Reduced number of instructions,
formats and addressing modes result in simple and smaller decoder, which in
turn increases the speed.
Writing the programs in C and C++ relives the programmer from learning the
instruction set of a P-DSP and instead concentrate on the application.
CISC Advantages:
The CISC processors have a very rich instruction set that even support high
level language construct similar to “if condition true then do”, ”for”, and
“while” .The P-DSPs with CISC also have instructions specifically required for
DSP applications such as MACD, FIRS, etc.
Since the RISC has a smaller number of instructions implementation of a single
CISC instruction might require a number of instructions in RISC.
This increases the memory required for storing the program and the traffic
between CPU and memory increased which increases the computation time and
also makes the program difficult to debug.
For most of the low cost applications DSP platforms without the compilers are
preferred. Hence a majority of P-DSPs are CISC based.
The relative disadvantage of each of these architectures are diminishing. By
making the RISC processors applicable for larger and larger applications, the
cost of the chip and compiler costs are being brought down.
The code composer studio from TI permits the programming in HLL as well as
in assembly level language in a single development environment so that the best
features of both the HLL and assembly language programming can be used by
the programmer.

TMS320C5X:

The ’C5x' is designed to execute up to 50 million instructions per


second (MIPS). Spin-off devices that combine the ’C5x CPU with customized
on-chip memory and peripheral configurations may developed for special
applications in the worldwide electronics market.
Advantages:
The ’C5x devices offer these advantages:
1. Enhanced TMS320 architectural design for increased performance and
versatility.
2. Modular architectural design for fast development of spin-off devices.
3. Advanced integrated-circuit processing technology for increased performance
and low power consumption.
4. Source code compatibility with ’C1x, ’C2x, and ’C2xx DSPs for fast and
easy performance upgrades.
5. Enhanced instruction set for faster algorithms and for optimized high-level
language operation.
6. Reduced power consumption and increased radiation hardness because of
new static design techniques.
Architecture:

The architectural structure of the C5x, consists of the buses, on-chip


memory, and central processing unit (CPU), and on-chip peripherals. The C5x
uses an advanced, modified Harvard type architecture. It maximizes the
processing power with separate buses for program memory and data memory.
The instruction set supports data transfers between the two memory spaces.

Fig 11:Internal architecture of C5x


The C5x architectural overview consists of
1. Bus Structure
2. Central Processing Unit (CPU)
3. On-Chip Memory
4. On-Chip Peripherals.
5. Test/Emulation
1. Bus Structure:
Separate program and data buses allow simultaneous access to program
instructions and data, providing a high degree of parallelism.

The C5x architecture is built around four buses.


 Program bus (PB).
 Program address bus (PAB).
 Data read bus (DB)
 Data read address bus (DAB).

Program bus:
The PB carries the instruction code and immediate operands from
program memory space to the CPU.
b. Program address bus (PAB):
The PAB provides addresses to program memory space for both reads
and writes.
c. Data read bus (DB):
The DB interconnects various elements of the CPU to data memory space
d. Data read address bus (DAB):
The DAB allows the CPU to read from and write to DARAM in the same
machine cycle.
2. Central Processing Unit (CPU):
The C5x CPU consists of the elements:
Central arithmetic logic unit (CALU)
Program logic unit (PLU)
Auxiliary register arithmetic unit (ARAU)
Memory mapped registers
Program controller.

Central Arithmetic Logic Unit (CALU)


The CPU uses the CALU to perform 2s-complement arithmetic.
The CALU consists of:
1. 16-bit X16-bit multiplier
2. 32-bit arithmetic logic unit (ALU)
3. 32-bit accumulator (ACC)
4. 32-bit accumulator buffer (ACCB)
5. Additional shifters at the outputs of both the accumulator and
the product register (PREG)

Parallel Logic Unit (PLU):


It performs Boolean Operations.
The CPU includes an independent PLU, which operates separately from, but in
parallel with, the ALU.
PLU can set, clear, test, or toggle bits in a status register, control register, or any
data memory location.
It provides a direct logic operation path to data memory values without affecting
the contents of the ACC or PREG.
Auxiliary Register Arithmetic Unit (ARAU):
The CPU includes an unsigned 16-bit arithmetic logic unit that calculates
indirect addresses by using inputs from the auxiliary registers (AR0-AR7),
index register (INDX), and auxiliary register compare register (ARCR).
The ARAU can auto index the current AR while the data memory location is
being addressed and can index either by ±1 or by the contents of the INDX

Memory-Mapped Registers:
The C5x has 96 registers mapped into page 0 of the data memory space.
They can be accessed in the same way as any other data memory location.
The memory-mapped registers are used for indirect data address pointers,
temporary storage, CPU status and control, or integer arithmetic processing
through the ARAU.

Program Controller:
The program controller contains logic circuitry that decodes the operational
instructions.
The program controller consists of the elements:
Program counter
Status and control register.
Hardware stack.
Address generation logic.
Instruction register.

3. On-Chip Memory
The C5x architecture contains a considerable amount of on-chip
memory to aid in system performance and integration:

Program read-only memory (ROM)


Data/program dual-access RAM (DARAM)
Data/program single-access RAM (SARAM)
Program ROM
All ’C5x DSPs carry a 16-bit on-chip maskable programmable ROM.
The C50 and C57S DSPs have boot loader code resident in the on-chip
ROM,C5x DSPs offer the boot loader code as an option.
This memory is used for booting program code from slower external ROM or
EPROM to fast on-chip or external RAM.

Data/Program Dual-Access RAM


The DARAM is primarily intended to store data values but, when needed, can
be used to store programs as well.
All C5x DSPs carry a 1056-word X 16-bit on-chip dual-access RAM
(DARAM).
The DARAM is divided into three individually selectable memory blocks:
B0 -512-word data or program DARAM block
B1 -512-word data DARAM block
B2- 32-word data DARAM block B1 and B2 are always configured as data
memory
B0 can be configured as data or program memory
The DARAM can be configured in one of two ways:
All 1056 words x16 bits configured as data memory.
544 words x16 bits configured as data memory and 512 words × 16 bits
configured as program memory

Data/Program Single-Access RAM


All C5x DSPs except the ’C52 carry a 16-bit on-chip single-access
RAM (SARAM) of various sizes.
The SARAM can be configured by software in three ways:
1. All SARAM configured as data memory
2. All SARAM configured as program memory
3. SARAM configured as both data memory and program memory
4. One SARAM block can be accessed only once per machine cycle.
5. SARAM supports more flexible address mapping than DARAM. (As it
supports both data memory and program memory simultaneously)
6. The instruction fetch and data fetch takes two machine cycles with SARAM.
7. The SARAM is divided into 1K- and/or 2K-word blocks in address memory
space.

4.On-Chip Peripherals
C5x DSPs have different on chip peripherals connected to their CPUs.
The ’C5x DSP on-chip peripherals available are:

Clock generator
Hardware timer
Software-programmable wait-state generators
Parallel I/O ports
Serial port
Buffered serial port (BSP)
Time-division multiplexed (TDM) serial port
User-maskable interrupts
Host port interface (HPI).

Clock Generator
The clock generator consists of an internal oscillator and a phase-locked
loop(PLL) circuit.

It can be driven internally by a crystal resonator circuit or driven externally by a


clock source.
The PLL circuit can generate an internal CPU clock by multiplying the clock
source by a specific factor, so you can use a clock source with a lower
frequency than that of the CPU.

Hardware Timer

It is a 16-bit hardware timer with a 4-bit prescaler.


This programmable timer clocks at a rate that is between 1/2 and 1/32 of the
machine cycle rate(CLKOUT1), depending upon the timer’s divide-down ratio.
The timer can be stopped, restarted, reset, or disabled by specific status bits.

Software-Programmable Wait-State Generators


Software-programmable wait-state logic in C5x DSPs allows wait-state
generation without any external hardware for interfacing with slower off-chip
memory and I/O devices.
.This feature consists of multiple wait state generating circuits.
Each circuit is user-programmable to operate in different wait states for off-chip
memory accesses.

Parallel I/O Ports:

There are 64K I/O ports,16 ports- memory mapped in data memory
space. Each of the I/O ports can be addressed by the IN or the OUT instruction.
The memory-mapped I/O ports can be accessed with any instruction that reads
from or writes to data memory. The ’C5x can easily interface with external I/O
devices through the I/O ports while requiring minimal off-chip address
decoding circuits.
Host Port Interface (HPI)
The host port interface (HPI) is an 8-bit parallel port used to interface a
host device or host processor to the ’C5. The HPI is designed to interface to the
host device as a peripheral, with the host device as master of the interface,
therefore greatly facilitating ease of access by the host.

Serial Port
Three different kinds of serial ports :
1. General-purpose serial port.
2. Time division multiplexed (TDM).
3. Buffered serial port (BSP).

Operates up to one fourth the machine cycle rate (CLKOUT1).


Data is framed either as bytes or as words.

Buffered Serial Port (BSP):


The BSP is a full-duplexed, double buffered serial port and an auto
buffering unit (ABU). The full duplex BSP serial interface provides direct
communication with serial devices such as codecs, serial A/D converters.
The double-buffered BSP allows transfer of a continuous communication
stream in
8-,10-,12- or 16-bit data packets
The BSP provides flexibility on the data stream length.

TDM Serial Port


The TDM is a full duplexed serial port.
The TDM serial port allows the ’C5x device to communicate serially with up to
seven other devices.
The TDM port, therefore, provides a simple and efficient interface for
multiprocessing applications.
Hence it is commonly used in multiprocessor applications.
User-maskable Interrupts
Four external interrupt lines (INT1–INT4) and five internal interrupts, a timer
interrupt and four serial port interrupts, are user maskable.
When an interrupt service routine (ISR) is executed, the contents of the program
counter are saved on an 8-levelhardware stack, and the contents of eleven
specific CPU registers are automatically saved on a 1-level-deep stack.

Important questions:
1. What is MAC? Explain its operation in detail with neat diagram.
2. Discuss in detail the basic architectural features of programmable DSP
devices.
3. Explain memory address schemes in DSPs.
4. Discuss in detail the pipeline operation of TMS320C54xx processor.
5. What are the various addressing modes used in TMS320C5x processor?
6. Draw and explain the major blocks diagram of TMS320Cx.
7. Explain the on-chip peripherals in TMS320Cx processor.
8. Explain the different types of interrupts in TMS320C54xx processor.

Multiple Choice Questions:

1. The features in which P-DSP is superior to advanced microprocessors


is___________.
a.Low cost b. Low power c. Computational speed d.Realtime I/O capability
Ans:d
2.In modified Harvard architecture for fetching the content of program and data
memory, a separate bus is used for __________ memory and a single bus is
used for ___________ memory. Ans: Data, Program

3. Number of memory accesses/clock/period that can be achieved using on chip


DARAM of a P-DSP is _________.

a.1 b.2 c.3 d.4 Ans: b

4.VLIW architecture differs from conventional P-DSP in which of the following


aspects:
a. Instruction cache
b. Number of functional units
c. Use pipelining
d. A single word fetched from memory has a number of instructions
Ans: b

5. A DSP has four pipeline stages and uses four phase clock. The number of
clock cycles required for executing a program with 25 instruction is
____________
a.29 b.28 c.25 d.26 Ans: b

6. The addressing mode that is convenient for FFT computation is________


a.Indirect addressing
b.Circular mode
c. Bit reversed addressing
d. Memory mapped Ans: c

7. Which of the following characteristics are true for a RISC processor?


a.Smaller control unit
b.Small instruction set
c.Short program length
d.Less traffic between CPU & memory Ans: b
8. The central ALU of C5X have _________ bit ALU and one of the operands
for the ALU operation comes from __________
a. 32, ACC b. 16, ACC c. 32, ACCB d. 16, ACCB Ans: a

9. The ______ bit register used for temporary storage of accumulator is


a. 32, PREG b. 32, ACCB c. 16, TREG0d. 32, ACC Ans: b
10. The _________ permits execution of logical operations on data without
affecting the contents of ACC
a. Parallel logic unit b. Auxiliary ALU c. Central ALU d. MAC Ans: a
11. The register that stores multiplicand is _______ and is _____ bit wide.
a. PREG, 32 b. PREG, 16 c. TREG0, 16 d. TREG0, 32 Ans: c
12. _________ holds the result of multiplication and is ______ bit wide.
a. PREG, 32 b. PREG, 16 c. TREG0, 16d. TREG0, 32Ans: a
13. ___________ Permits the contents of memory to be left shifted by 0-16 bits
before they are fed to ALU or stored from ALU to memory.
a. Scaling shifter b. ALU c. PLU d. Auxiliary ALU Ans: a
14. In the hardware stack of C5X processors ________ , _________ bit
numbers can be stored.
a. 16, 16 b. 16, 8 c. 8, 8 d. 8, 16 Ans: a
15. The RAM configuration control bit that indicates whether the on-chip
reconfigurable dual access RAM is mapped to data space or program space is
______________
a. PM b. CNF c. HM d. XF Ans: b
References
1.Digital Signal Processors Architecture, Programming and Applications.
B.Venkataramani, N.Bhaskar
2. Digital Signal Processing: Principles, Algorithms, and Applications
John G Proakis, Dimitris G. Manolakis