TMS320 C6 X Programming

Chapter
TMS320C6x
Programming
Introduction
In this chapter programming the TMS320C6x in assembly, linear
assembly, and C will be introduced. Preference will be given to
explaining code development for the DSK memory map. The
basis for the material presented in this chapter are the course
notes from TIs C6000 4-day design workshop1.
Programming Alternatives
Efficiency*
Effort
Compiler
Optimizer
Intrinsics
70 80%
Low
Linear
ASM
Assembly
Optimizer
95 100%
Medium
Hand
Optimize
100%
High
ASM
* Typical efficieny versus hand optimized assembly

see TI benchmarks for more information
1. TMS320C6000 DSP Design Workshop, Revision 4.0, June 2000.

ECE 5655/4655 Real-Time DSP
31
Chapter 3 TMS320C6x Programming
Introduction to Assembly Language Programming

A Dot Product Example
Recall the C6000 block diagram
Program
RAM
Data Ram
Addr
Internal Buses
DMA
D (32)
EMIF
.M1 .M2
.L1 .L2
.S1 .S2
Control Regs
Serial Port
(B0-B15)
Regs (B0-
- Sync
- Async
(A0-A15)
Regs (A0-
Extl
Extl
Memory
Memory
.D1 .D2
Host Port
Boot Load
Timers
Pwr Down
CPU
To motivate this introduction to assembly programming, consider a basic sum of products or dot product example
y =
40
(3.1)
an xn
n=1
Assembly instructions will initially be shown only with limited detail

In a later section the details of putting together an actual
assembly file will be given
The core of this algorithm is multiplication and addition
32
To multiply we use the .M (multiply) unit

40
Y =
n = 1
an * xn
.M
.M
MPY
.M
a, x, prod
As shown here MPY calls a 16-bit multiply which gives a

32-bit result
To add or accumulate we use the .L (logical) unit
40
Y =
n = 1
Where
Whereare
are
the
variables
the variables
stored?
stored?
an * xn
.M
.M
.L
.L
MPY
.M
a, x, prod
ADD
.L
Y, prod, Y
33
Note that we need to store the working variables in a register

file, the C6000 has two, but for now we will just use the A
side
We now rewrite the code to include the actual register names
Register File A
A0
a
x
A1
A2
prod
A3
Y
A4
.
.
.
40
Y =
n = 1
an * xn
.M
.M
.L
.L
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
A15
3232-bits
The original equation (3.1) specifies 40 multiply accumulates

To create a loop we need:
A branch instruction and a label
A loop counter variable
An instruction to decrement the loop counter
A properly set branch condition
34
The unit responsible for branching is the .S (branch) unit

Register File A
A0
a
x
A1
A2 loop count
prod
A3
Y
A4
.
.
.
40
an * xn
.S
.S
Y =
.M
.M
MVK
.S
40, A2
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
SUB
.L
A2, 1, A2
.S
loop
n = 1
loop:
.L
.L
A15
[A2] B
3232-bits
MVK moves a 16-bit constant into the lower 16-bits of register A2

We decrement the loop counter register by one using SUB
which uses the .L unit
Branch condition instructions execute conditionally based
on the value held in A2
;general asm code form
[condition] B
loop
The [A2] means execute if A2
If we use [!A2] then execute only if A2 = 0

On the C62x/C67x conditional registers are limited to A1,
A2, B0, B1, B2
Note: On the C64x the conditional registers are A0, A1,
A2, B0, B1, B2
35
The next step is to get variables loaded into the register file
We assume that the variables are located in memory (internal or external)
We then create a pointer to the address of the variable and
store it in a register
Finally, we load the variable itself into another register
Register File A
A0
a
x
A1
A2 loop count
prod
A3
Y
A4
&a[n]
A5
&x[n]
A6
&Y
A7
..
A15
How do a and x get loaded?

a, x, Y located in memory
Create a pointer to values
A5 = &a
A6 = &x
A7 = &Y
.S
.S
.M
.M
Use pointer with load/store

LD
*A5, A0
LD
*A6, A1
ST
A4, *A7
.L
.L
3232-bits
Memory
a [40]
x [40]
Y
*A5
*A6
*A7
The C notation of &a is used here to obtain the address of a,

but there is more to this as we will see shortly
The C62 has 3 three load instructions and the C67 and C64
add a fourth
The architecture allows byte level addressing (8-bits), halfword (16-bits), words (32-bits)
Added on the C67/64 are double-words (64-bits)
36
Load and store option summary:

Load instructions:
LDB
LDH
LDW
LDDW
Load 8-bit byte

Load 16-bit half-word
Load 32-bit word
Load 64-bit double-word
(char)
(short)
(int)
int)
(C67x, C64x)
(double)
Store instructions:
STB
STH
STW
STDW (C64x)
To carry out the load and store operations we use the .D

(data) unit
Register File A
A0
a
x
A1
A2 loop count
prod
A3
Y
A4
&a[n]
A5
&x[n]
A6
&Y
A7
..
A15
40
an * xn
.S
.S
Y =
.M
.M
MVK
.S
40, A2
LDH
.D
*A5, A0
LDH
.D
*A6, A1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
SUB
.L
A2, 1, A2
.S
loop
.D
A4, *A7
.L
.L
.D
.D
n = 1
loop:
[A2] B
STH
3232-bits
Data
DataMemory
Memory
Note that as in C, *A5 takes the value pointed to by A5 and

places the value into a register, here it is A0
37
A remaining detail is the actual creation of a pointer, e.g., x,

a, and y
Earlier we used MVK to move a 16-bit constant into the lower
16-bits of a register
Now we want to move a 32-bit address corresponding to
some label a
MVKL .S a,A5 ;will move the lower 16-bits with
sign extension
MVKH .S
a,A5 ;will move the upper or high 16bits without altering the lower 16-bits
Use MVKL and MVKH in ordered combination to load constants greater the 16-bits, and MVK for 16-bit or less constants
What should appear above the code MVK .S 40,A2 is:
MVKL
MVKH
MVKL
MVKH
MVKL
MVKH
.S
.S
.S
.S
.S
.S
a,A5
a,A5
x,A6
x,A6
y,A7
y,A7
;store
;store
;store
;store
;store
;store
lower
upper
lower
upper
lower
upper
half
half
half
half
half
half
of
of
of
of
of
of
a
a
x
x
y
y
To properly loop over the data, the pointers need to be incrmented

The C notation ++ can be used to pre- or post-increment
registers being used as pointers, e.g., A5++ increments by
one the address held in A5 after it is used
38
Pointer incrementing is summarized in the following figure:

A5
A6
A5
++
a0
a1
a2
a
&x
&
A6
++
.
.
40
Y =
x0
x1
x2
n = 1
.
.
loop:
After first loop, A4 contains...
a0 * x0
How do you access a1 and
x1 on the second loop?
LDH .D
*A5++, A0
LDH .D
*A6++, A1
an * xn
MVK
.S
40, A2
LDH
LDH
.D
*A5, A0A0
*A5++,
LDH
LDH
.D
*A6, A1A1
*A6++,
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
SUB
.L
A2, 1, A2
.S
loop
.D
A4, *A7
[A2] B
STH
Since there is another set of function units we should have

specified which the side, e.g., .S1 for side A, etc.
Register File A
A0
A1
A2
A3
A4
.
.
.
Register File B
.S1
.S1
.S2
.S2
.M1
.M1
.M2
.M2
.L1
.L1
.L2
.L2
.D1
.D1
.D2
.D2
3232-bits
B0
B1
B2
B3
B4
.
.
.
B15
3232-bits
Data Memory
39
The final version of the A-side code is

Y =
MVK
loop: LDH
LDH
MPY
ADD
SUB
[A2] B
STH
.S1
.D1
.D1
.M1
.L1
.L1
.S1
.D1
40
n = 1
an * xn
40, A2
*A5++, A0
*A6++, A1
A0, A1, A3
A3, A4, A4
A2, 1, A2
loop
A4, *A7
; A2 = 40, loop count

; A0 = a(n)
; A1 = x(n)
; A3 = a(n) * x(n)
; Y = Y + A3
; decrement loop count
; if A2 0, branch
; *A7 = Y
In the above we assume A4 is initially cleared

Instruction Set Summary by Category
310
Arithmetic
Logical
ABS
ADD
ADDA
ADDK
ADD2
MPY
MPYH
NEG
SMPY
SMPYH
SADD
SAT
SSUB
SUB
SUBA
SUBC
SUB2
ZERO
AND
CMPEQ
CMPGT
CMPLT
NOT
OR
SHL
SHR
SSHL
XOR
Bit Mgmt
CLR
EXT
LMBD
NORM
SET
Data Mgmt
LDB/H/W
MV
MVC
MVK
MVKL
MVKH
MVKLH
STB/H/W
Program Ctrl
B
IDLE
NOP
C62xx and C67xx Instruction Set Summary by Unit

.S Unit
.S
.S
.L
.L
.D
.D
.M
.M
ADD
ADDK
ADD2
AND
B
CLR
EXT
MV
MVC
MVK
MVKL
MVKH
.L Unit
ABS
ADD
AND
CMPEQ
CMPGT
CMPLT
LMBD
MV
NEG
NORM
NEG
NOT
OR
SET
SHL
SHR
SSHL
SUB
SUB2
XOR
ZERO
NOT
OR
SADD
SAT
SSUB
SUB
SUBC
XOR
ZERO
.M Unit
.D Unit
ADD
NEG
ADDAB (B/H/W) STB
(B/H/W)
SUB
LDB
(B/H/W) SUBAB (B/H/W)
ZERO
MV
MPY
MPYH
MPYLH
MPYHL
SMPY
SMPYH
No Unit Used
NOP
IDLE
The C67 adds 31 More Instructions

.S Unit
.S
.S
.L
.L
.D
.D
.M
.M
ADD
ADDK
ADD2
AND
B
CLR
EXT
MV
MVC
MVK
MVKL
MVKH
NEG
NOT
OR
SET
SHL
SHR
SSHL
SUB
SUB2
XOR
ZERO
ABSSP
ABSDP
CMPGTSP
CMPEQSP
CMPLTSP
CMPGTDP
CMPEQDP
CMPLTDP
RCPSP
RCPDP
RSQRSP
RSQRDP
SPDP
.D Unit
ADD
NEG
ADDAB (B/H/W) STB
(B/H/W)
ADDAD
SUB
LDB
(B/H/W) SUBAB (B/H/W)
LDDW
ZERO
MV
.L Unit
ABS
ADD
AND
CMPEQ
CMPGT
CMPLT
LMBD
MV
NEG
NORM
NOT
OR
SADD
SAT
SSUB
SUB
SUBC
XOR
ZERO
ADDSP
ADDDP
SUBSP
SUBDP
INTSP
INTDP
SPINT
DPINT
SPRTUNC
DPTRUNC
DPSP
.M Unit
MPY
MPYH
MPYLH
MPYHL
SMPY
SMPYH
MPYSP
MPYDP
MPYI
MPYID
No Unit Used
NOP
IDLE
311
In total, the processor has only about 48 instructions, and

hence is considered to be a RISC device
Before going any further in assembly programming we need
to spend some time studying the pipeline
Introduction to the Pipeline

DSP microprocessors rely heavily on the performance advantages of pipelining, the C6x is no exception
It would be nice to never have to worry about pipeline issues,
but some exposure will be helpful in future programming
Getting code to work only requires a few basic guidelines,
while full optimization of the eight function units is beyond
the scope of this section of the notes
The basic operations of the CPU are:
(F) Fetch or Program Fetch (PF): get an instruction from
memory
(D) Decode: figure out what type of instruction it is (ADD,
MPY)
(E) Execute: Actually perform the operation
312
Pipelined and Non-Pipelined

CPU Type
Clock Cycles
3 4 5 6 7
Non-Pipelined
F1 D1 E1
Pipelined
F1 D1 E1
F2 D2 E2
F3 D3 E3
F 2 D2 E 2
F 3 D3 E 3
Pipeline full
Once the pipeline is full the multiple buses of the C6x can
carry out the F, D, and E operations in parallel, all within the
same clock cycle
On the downside, when discontinuities such as program
branching occur, the pipeline must be flushed which results in
added processor overhead
Program Fetch Stage
The program fetch stage actally is broken into four phases
PG: Generate fetch address
PS: Send address to memory
PW: Wait for data ready
PR: Read opcode
313
Decode Stage
The decode stage consists of two phases
DP: Route the instruction to a functional unit (dispatch)
DC: Actually decode the instruction at the functional unit
(decode)
Execute Stage
For code writing purposes the execute stage is the most interesting
On the C62x all instructions execute in a single cycle, but
results are delayed by varying amounts
Furthermore, there is an additional cycle before the results
are available, which is known as the pipeline latency
Common examples of delay and latency
Description
Instructions
Delay
Latency
Single Cycle
All, except ...
0+1=1
Multiply
MPY / SMPY
Load
LDB/H/W
Branch
As a result of the maximum delay of 5 cycles, there are six

execute phases E1E6
314
Summary of Pipeline Phases

Program
Fetch
(1)
(2)
(3)
Decode
DP DC
(5) (6)
(4)
Execute
E1 E2 E3 E4 E5 E6
(7) (8) (9) (10) (11) (12)
E2-E6 are place holders

for delayed results
Pipeline full
PG
PS
PW
PR
DP
DC
E1
E2
PG
PS
PW
PR
DP
DC
E1
PG
PS
PW
PR
DP
DC
E1
PG
PS
PW
PR
DP
DC
E1
PG
PS
PW
PR
DP
DC
E1
PG
PS
PW
PR
DP
DC
E1
PG
PS
PW
PR
DP
DC
E3
E4
E5
E6
E7
E1
315
Sending Code Through the Pipeline

Since there are eight function units, eight 32-bit instructions
are fetched every clock cycle
The 256-bit total is called a fetch packet
; mycode.asm
I1 .unit
The 'C6x fetches eight 32-bit

instructions every cycle
I2 .unit
I3 .unit
I4 .unit
256 Bits
I5 .unit
I1 I2 I3 I4 I5 I6 I7 I8
I6 .unit
Fetch Packet (8 x 32-bit)
I7 .unit
I8 .unit
Recall that there is a 256-bit wide program data bus for this
purpose
Pipeline Code Example
Consider the sum of products example used earlier
MVK .S1
loop: LDH .D1
*A5++, A0
LDH .D1
*A6++, A1
MPY .M1
A0, A1, A3
ADD .L1
A3, A4, A4
SUB
.L1
A2, 1, A2
.S1
loop
[A2]
STH .D1
316
40, A2
We assume A4 is
already cleared
A4, *A7
We have eight instructions, so on the first cycle they are in the

PG phase of program fetch
Program
Fetch
PG
PS
PW
PR
Decode
Execute
DP
E1 - E6
DC
MVK
LDH
LDH
MPY
ADD
SUB
B
STH
12
11
2
3
10
9
8
6
On the fifth cycle, assuming zero wait state memory, the

eight instructions are now at the DP phase
On the next cycle the first instruction moves to the DC
317
(decode) phase, and the other seven wait in line

Prog.
Decode
Fetch
P
DP
DC
Execute
E1
E2
E3
E4
Done
E5
E6
MVK
LDH
LDH
FP5-2 MPY
ADD
SUB
B
STH
11
12
1
2
3
10
9
8
7
On cycle eight MVK has completed execution and LDH begins

execution, but requires five total cycles (+ signs)
Prog.
Decode
Fetch
P
DP
DC
Execute
E1
E2
E3
E4
Done
E5
E6
MVK
LDH
LDH
FP5-2 MPY
ADD
SUB
B
STH
11
12
1
2
3
10
9
8
7
318
On the 10th cycle the second LDH enters E2 and the first LDH
is moved over to E3, with MPY at E1
Prog.
Decode
Fetch
P
DP
DC
Execute
E1
E2
E3
E4
Done
E5
E6
MVK
LDH
LDH
MPY
FP5-2
+
+
+
+
ADD
SUB
B
STH
11
12
1
2
3
10
9
8
7
Note that the MPY requires only one delay, but needs values from memory that the LDHs bring in
The LDHs have not finished yet! What to do?
A similar problem exists when the ADD instruction reaches
E1
The one cycle delay of MPY means that the addition has
started too early as well
319
For the existing code, we see that at 12 cycles MPY and ADD
have both finished, but both LDHs still have not completed
Prog.
Decode
Fetch
P
DP
DC
Execute
E1
E2
E3
Done
E4
E5
E6
MVK
LDH
LDH
+
MPY
FP5-2
ADD
SUB
B
STH
11
12
1
2
3
10
9
8
7
To fix the code we need to add instruction delays or NOPs

To start with we need to add one NOP between MPY and
ADD
We need to add four NOPs between the second LDH and
MPY
Simple NOP insertion rules:
320
Description
Delay Slots
# of NOPs
Single Cycle
Multiply
Load
Branch
Rather than typing four lines of NOP, we can type a single

line
NOP
NOP
NOP
NOP
NOP 4
The final NOP fixed code, including benchmark information is the following:
loop:
MVK .S1
40,A2
LDH
.D1
*A5++, A0
(1)
(1)
LDH
.D1
*A6++, A1
(1)
(4)
A0,A1,A3
(1)
NOP
MPY
.M1
NOP
[A2]
ADD
.L1
A3,A4,A4
(1)
(1)
SUB
.L1
A2,1,A2
(1)
.S1
loop
5
(1)
(5)
A4,*A7
(1)
NOP
STH
.D1
Loop = 16 x 40
= 640
+ 2 = 642 cycles
642 cycles
Benchmark = _______
28 cycles
Best case = _______
The NOPs greatly increase the cycle count, but we have not
tried any optimization yet
With full optimization just 28 cycles can be achieved, less
than the loop count!
321
Use of Parallel Instructions

In the pipeline example above all of the instructions flowed
serially
Parallel instructions are given with the double pipe symbol
||
Up to eight instructions can be put in parallel since there are
eight functional units
A partially parallel solution is given below:
Serial
B
.S1
Partially
Parallel
B
Fully
Parallel
.S1
MVK .S1
|| MVK .S2
ADD .L1
ADD .L1
ADD .L1
|| ADD .L2
MPY .M1
|| MPY .M1
MPY .M1
MPY .M1
LDW .D1
|| LDW .D1
LDB .D1
|| LDB .D2
When instructions process in parallel they are called execute

packets, and are so denoted in the pipeline diagrams
Each fetch packet can contain multiple execute packets
322
At the beginning of the decode phase (dispatch), the above

example code, has three execute packets entering DC
Decode
DP
Execute
DC
E1
E2
E3
E4
Done
E5
E6
B
MVK
ADD
ADD
MPY
MPY
LDW
LDB
11
12
1
2
3
10
9
8
7
Each execute packet enters E1 and the individual instructions

execute simultaneously until completed, with their respective
delays
323
At cycle eight we have packet two at E1 and part of packet

one is complete
Decode
DP
DC
Execute
E1
Done
E2
E3
E4
E5
E6
+
MVK
ADD
ADD
MPY
MPY
12
11
LDW
1
2
3
10
9
LDB
8
7
Parallel instructions give a great performance increase

For the code example we have been considering it is possible
to go fully parallel since there are only eight instructions
To do so will require full utilization of both sides of the CPU
324
The fully parallel code
Partially
Parallel
Serial
B
.S1
Fully
Parallel
B
.S1
.S1
MVK .S1
|| MVK .S2
|| MVK .S2
ADD .L1
ADD .L1
|| ADD .L1
ADD .L1
|| ADD .L2
|| ADD .L2
MPY .M1
|| MPY .M1
|| MPY .M1
MPY .M1
MPY .M1
|| MPY .M2
LDW .D1
|| LDW .D1
|| LDW .D1
LDB .D1
|| LDB .D2
|| LDB .D2
At the start of execution (seventh cycle) we have
Decode
DP
Execute
DC
Done
E1
E2
E3
E4
E5
E6
+
+
+
+
+
+
+
+
+
+
MVK
ADD
ADD
EP2
MPY
MPY
LDW
LDB
12
11
2
3
10
9
8
6
325
This sort of efficiency requires smart coding

Two not so obvious requirements are:
Properly filling delay slots
Proper use of parallel instructions
The assembly optimizer (part of linear assembly) and the
optimizing C compiler significantly simplify this process
C67x Exceptions
With the floating point capability comes additional delay slot
requirements and latency
There is also functional unit latency beyond one cycle, which
occurs in some double precision (DP) instructions
C67x Latencies: (unit.instruction)
.L Unit
.S Unit
ABSSP
(1.1)
ABSDP
(1.2)
CMPEQSP (1.1)
CMPGTSP (1.1)
CMPLTSP (1.2)
CMPEQDP (1.3)
CMPGTDP (1.3)
CMPLTDP (1.2)
RCPSP
(1.1)
RCPDP
(1.2)
RSQRSP (1.1)
RSQRDP (1.2)
SPDP
(1.2)
ADDSP
ADDDP
DPINT
DPSP
INTDP
INTDPU
(1.3)
(2.7)
(1.4)
(1.4)
(1.5)
(1.5)
INTSP
(1.4)
INTSPU (1.4)
SPINT
(1.4)
SPTRUNC (1.4)
SUBSP
(1.4)
SUBDP
(2.7)
.D Unit
.M Unit
MPYSP
MPYDP
(1.4) MPYI
(4.10) MPYID
ADDAD
(1.1)
LDDW
(1.5)
(4.9)
(4.10)
e.g., MPYSP (1.4) means a single precision float multiply

requires a single function unit latency and three delay slots.
326
C Programming
The section will focus on some of the uses of the C6x development tools and some of the compiler, assembler, and linker settings.
As stated at the beginning of this chapter, the use of C code
can achieve from 80100% the efficiency of hand assembly
Further optimization, what is discussed in this section, will
likely be required, but it is safe to say that C code is a good
starting point for algorithm development
Recall the basic code building tool layout is:
Asm
Optimizer
Link.cmd
.sa
Editor
.asm
Asm
.obj
Linker
.out
.c / .cpp
.cpp
Compiler
When the compiler tools are coupled with Code Composer

Studio (CCS) we have a compete development environment:
327
PLUG INS (C++, VB, Java)
Probe
In
Compile
Asm Opto
SIM
DSK
Edit
Asm
Link
Debug
EVM
Profiling
BIOS
Library
Graphs
Probe
Out
Studio Includes:
Code Generation Tools
BIOS: Real-time kernel
Real-time analysis (RTA)
Simulator Plug-ins, RTDX
Simulator,
Third
Party
XDS
DSP
Board
The output code can be controlled with a very large number

of options that span the compiler, assemble, and linker
(Old CCS Interface shown)
Indicates how output file
should be constructed
Which Optimizations
Where to find files/libs
C62x or C67x
How to link files
Etc.
file.c
file.c
328
Compile
Asm
Link
file.out
file.out
C Programming
Debug options
Options
debug
-mv6700
-fr <dir>
-g
-s
-k
-mg
-mt
-o3
-pm -gs
Description
Generate C6700 code (C6200 is default)
Directory containing source files
Enables srcsrc-level symbolic debugging
Interlist C statements into assembly listing
Keep assembly file
Enables minimum debug to allow profiling
No aliasing used
Invoke optimizer ((-o0, -o1, -o2/o2/-o, -o3)
Combine all C source files before compile
CC Tab
Compiler
Compiler
Comp/Asm
Comp/Asm
Compiler
Compiler
Compiler
Compiler
Compiler
Compiler
All total there are about five pages of options in the compiler
user manual
Optimize Options
Options
speed
opto
-mv6700
-fr <dir>
-g
-s
-k
-mg
-mt
-o3
-pm
-ms
-oi0
Description
Directory-kcontaining
source files
-mgt -o3 -pm
Keep assembly file
No aliasing used
Minimize code size ((-ms0/ms0/-ms, -ms1, -ms2)
Disables automatic function inlining
CC Tab
Compiler
Compiler
Comp/Asm
Comp/Asm
Compiler
Compiler
Compiler
Compiler
Compiler
Compiler
Compiler
Compiler
When first debugging code we typically use -gs (above),

later optimization can be turned on, e.g., -o3
329
Code Size
Options
size
opto
-mv6700
-fr <dir>
-g
-s
-k
-mg
-mt
-o3
-pm
-ms
-oi0
Description
CC Tab

Directory-kcontaining
-mgt -ms0 -source
o3 -oi0files
-pm
Keep assembly file
No aliasing used
Minimize code size ((-ms0/ms0/-ms, -ms1, -ms2)
Compiler
Compiler
Comp/Asm
Comp/Asm
Compiler
Compiler
Compiler
Compiler
Compiler
Compiler
Compiler
Compiler
Assembler Options
Options
-g
-l
-s
Description
Create assembler listing file (small -L)
Retain asm symbols for debugging
CC Tab
Comp/Asm
Assembler
Assembler
-gls
330
C Programming
Linker Options
Options
- o <file>
- m <file>
- c
Description
Output file name
Map file name
AutoAuto-initialize global/static C variables
CC Tab
Linker
Linker
Linker
Summary of Popular Options

Options
-mv6700
-fr <dir>
-g
debug
-s
-k
-mg
speed -mt
opto -o3
-pm
size -ms
opto -oi0
-l
-s
-o <dir>
-m <dir>
-c
Description
Directory containing source files
Enables src-level symbolic debugging
Keep assembly file
No aliasing used
Invoke optimizer (-o0, -o1, -o2/-o, -o3)
Minimize code size (-ms0/-ms, -ms1, -ms2)
Create assembler listing file (small -L)
Retain asm symbols for debugging
Output file name
Map file name
Auto-Init C variables (-cr turns off autoinit)
Options Tab
Compiler
Compiler
Comp/Asm
Compiler
Compiler
Compiler
Compiler
Compiler
Compiler
Compiler
Compiler
Assembler
Assembler
Linker
Linker
Linker
331
A block diagram depicting what happens when a project

build takes place is shown below:
Compiler
file.c
file.c
-o
C
Optimizer
Run-time
Library
(boot.c)
-s
file.asm
Assembler
-al
file.lst
-z
Linker
file.obj
-m
file.map
-o
file.out
Embedded Systems with C

Consider software systems development in terms of the C6x
An embedded system, for the purposes of C6x development,
consists of:
Program (algorithm and data structures)
Initialization
Memory management
The program part seems pretty clear
The initialization and memory management part are beyond
what you find in a typical host programming environment,
such as Visual C++ on a PC
332
From a C programming perspective on a host, once the system resets and initializes, we only deal with the program
Basic Sections of
C file
Global
Variables
Code
Dynamic
Variables
short m = 10;
short b = 2;
short y = 0;
main()
{
short x = 0;
scanf(x);
malloc(y);
y = m * x;
y = y + b;
}
et
res in
p
Initial
Values
reset
reset
vector
vector
Initialize
Initialize
System
System
Local
Variables
Program
Program
In the embedded world we have to also deal with initialization

We have more flexibility this way, and we only need to
include the hardware and software really needed to get the
job done
Using only the hardware and software that is needed also
provides a cost savings
The reset operation
Stops the processor,
333
brings some registers back to a preset state,

sets the program counter (PC) to zero, and
begins running code (address 0)
Initialization Under C
The C compiler run-time support library contains the routine
boot.c
et
res in
p
reset
reset
vector
vector
boot.c
Initialize
Initialize
System
System
1. Initialize Pointers
(discussed in mod 11)
stack
heap
global/static
2. Initialize global and static
variables
3. Call _main
short
short mm == 10;
10;
short
short bb == 2;
2;
short
short yy == 0;
0;
_main
main()
main()
{{
short
short xx == 0;
0;
scanf(x);
scanf(x);
malloc(y);
malloc(y);
yy == mm ** x;
x;
yy == yy ++ b;
b;
}}
Note, global variables are optionally initialized through a

compiler switch
334
Following the actual hardware reset, the software begins to

reset via vectors.asm via a branch to c_int00
et
res in
p
vectors.asm
.global _c_int00
_c_int00
.sect vectors
b
_c_int00
nop 5
nop
nop
O ne ke t
nop
Pac
_main
nop
Fetch
nop
nop
reset
reset
vector
vector
boot.c
boot.c
1. Init stack, heap,

& global ptrs
2. init variables
3. call _main
short
short mm == 10;
10;
short
short bb == 2;
2;
short
short yy == 0;
0;
main()
main()
{{
short
short xx == 0;
0;
scanf(x);
scanf(x);
malloc(y);
malloc(y);
yy == mm ** x;
x;
yy == yy ++ b;
b;
}}
Note that c_int00 is defined in the C library

Note also that when using CCS and debugging the target,
e.g., the DSK, some of this functionality is automatically
taken care of
NOPs are added to fill the fetch packet
Each interrupt vector is aligned on the fetch packet boundaries
Other interrupts, which are typically also part of this file,
will be discussed later
335
Compiler Sections
The system software is broken into modules of code and data
known as sections
The sections as found in a typical C program are shown
below:
Hardware
Software
System
SystemInit
Init
(boot.c)
(boot.c)
C6x
C6x
RAM
RAM
Periph
Periph
Memory
Memory
ROM
ROM
RAM
RAM
RAM
RAM
Vectors
Vectors
(reset)
(reset)
Program
Program
Code
Code
Data
Data
CCCode
Code
(main.c)
(main.c)
Variables
Variables
(global)
(global)
Init
InitValues
Values
(global)
(global)
Heap
Heap
(dynamic)
(dynamic)
Stack
Stack
(local)
(local)
The above names seem reasonable, but the compiler uses

names associated with the common object files format
(coff) developed many years ago by AT&T for use with C
and Unix
The real names used by the C6x complier tools are the following:
336
Vectors
Vectors
(reset)
(reset)
your
?
choice
System
SystemInit
Init
(boot.c)
(boot.c)
Program
Program
Code
Code
Data
Data
.text
CCCode
Code
(main.c)
(main.c)
Variables
Variables
(global)
(global)
.bss
Init
InitValues
Values
(global)
(global)
.cinit
Heap
Heap
(dynamic)
(dynamic)
.sysmem
Stack
Stack
(local)
(local)
.stack
The reset section can be any name, but vectors is reasonable

The complete list of C compiler sections is:
Section
Name
.text
Description
Code
.switch
Tables for switch instructions
.const
Global and static string literals
.cinit
Initial values for global/static vars
.bss
Global and static variables
.far
Global and statics declared far
.stack
.sysmem
.cio
Stack (local variables)

Memory for malloc fcns (heap)
Buffers for stdio functions
337
A possible section placement solution for the C6201:
.switch
.switch
.cinit
.const
.text
.switch
.const
.const
EPROM
CE0
.text
.text
140_0000
(prog RAM)
.cinit
.cinit
.bss
.bss
.far
.far
CE2
.stack
.stack
.sysmem
.far
.cio
.sysmem
.sysmem
SDRAM
.cio
.cio
8000_0000 .bss
(data RAM)
.stack
C6201
Many other solutions

possible; the C67xx?
A more generalized way of describing the memory sections is

to use the terms initialized and uninitialized as opposed to
ROM and RAM, i.e.,
Section
Name
.text
Code
initialized
.switch
Tables for switch instructions
initialized
.const
Global and static string literals
initialized
.cinit
Initial values for global/static vars
initialized
.bss
Global and static variables
uninitialized
.far
Global and statics declared far
uninitialized
Stack (local variables)
uninitialized
Memory for malloc fcns (heap)
uninitialized
Buffers for stdio functions
uninitialized
.stack
.sysmem
.cio
338
Memory
Type
Description
Memory Management
We control the physical mapping of memory to program and
data sections sections via a linker command file
C6x
C6x
RAM
RAM
Periph
Periph
Memory
Memory
Memory
Memory
ROM
ROM
.cmd
.obj
.obj
.obj
Linker
RAM
RAM
RAM
RAM
-o
.out
-m
.map
The linker command file .cmd has two parts
MEMORY
{
Memory Description
}
SECTIONS
{
Binding Code/Data Sections to Memory
}
339
In the memory description portion we create a description of

both processor and system resources
Each line is of the form
name:origin = address, length = size-in-bytes
Note that we can shorten origin to simply o or org, and

length to simply len or l, i.e., consider the memory
portion of the C6711 command file we have used thus far
MEMORY
{
vecs:
org = 00000000h , len = 220h
IRAM:
org = 00000220h , len = 0000fdc0h
CE0:
org = 80000000h , len = 01000000h
FLASH: org = 90000000h , len = 00020000h
}
Quantities may be specified in hex or decimal, but hex is

preferred, e.g., 100h or 0x100
Note: The vectors section must come first, so that following
reset, initialization can occur
The vecs space must be at least 200 hex long since on the
C6x there are a total of 16 interrupts, each requiring one fetch
packet of 8, 32-bit instructions ( 16 32 = 200h )
Here the 220h leaves room for 32 bits more
There will be more discussion of interrupts later
To understand the rest of the memory space assignments,
recall the C6x11 memory map
340
C67xx Memory Map

0000_0000
6713
64K x 8 Internal The
DSK
(L2) has 264kB

starting at
0000_0000
4K
Program
Cache
0180_0000
CPU
64K
Unified
RAM
8000_0000
9000_0000
4K
Data
Cache
A000_0000
B000_0000
OnOn-chip Peripherals
The 6713
0 256M x 8 External DSK
has 16M
1 256M x 8 External at
8000_0000
2 256M x 8 External
3
256M x 8 External
FFFF_FFFF
On the C6x13 DSK we frequently place all of the sections,

program and data, in the internal RAM (IRAM)
SECTIONS
{
vectors
.text
.bss
.cinit
.stack
.sysmem
.const
.switch
.far
.cio
}
:>
:>
:>
:>
:>
:>
:>
:>
:>
:>
vecs
IRAM
IRAM
IRAM
IRAM
SDRAM
IRAM
IRAM
SDRAM
SDRAM
Note some sections are placed in the SDRAM of CE0

341
Linker Options
In the third tab of the project options dialog box, we set linker
options
The -o specifies the executable file, e.g., norm_sq_c.out

The -m creates a map file which shows in detail how the
linker has located everything in memory
342
The -c option, run-time autoinitialization, invokes BOOT.C

so that variables are autoinitialized, that is initial values in the
.cinit section are copied into the .bss section
We can turn of autoinit by using -cr
-stack sets the size of the stack, e.g., .stack section; the
default is 0x400
-heap sets the size of the heap, which is actually the .sysmem section, has a default value of 0x400
-q supresses the banner display and -w has the linker
exhaustively read all libraries
343
Calling Assembly with C

Being able to call assembly routines from C is a powerful capability of the compiler tools. In this section we explore the main
points.
For more detail refer to spru187t or newer, TMS320C6000
Optimizing Compiler v 7.3: User's Guide
Sections 7.4 & 7.5
To begin with all C labels are accessed in the assembly file
with an underscore (_) character, e.g., sum --> _sum
To call an assembly routine requires that we follow a few
simple rules
main( )
{
_asmFunction:
Things we would like to do are:

Pass arguments in
Return results
Access Cs global variables in assembly
More advanced issues, not dealt with here, are use of and
access to the stack and optimal access to global variables
344
To find a function we have a global (inter-file) reference

Parent.C
Parent.C
int
int
int
int
Use _underscore
Make label global
child(int,
child(int, int);
int);
xx == 7,
y,
w
7, y, w == 3;
3;
void
void
{{
yy
}}
main
main (void)
(void)
== child(x,
child(x, 5);
5);
Child.ASM
.global
Child.C
Child.C
int
int child(int
child(int a,
a, int
int b)
b)
{{
return(a
return(a ++ b);
b);
}}
_child
_child:
...assembly code...
; end of subroutine
To pass variables in, take a return value, and return to the parent code flow, we use a set of argument/register passing rules
A
Arguments are passed in
registers as shown
Return value in A4
and return to address
in B3
arg1/r_val
arg1/r_val
arg3
Child.C
Child.C
arg5
int
int child(int
child(int a,
a, int
int b)
b)
{{
return(a
return(a ++ b);
b);
}}
arg7
arg9
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
ret addr
arg2
arg4
arg6
arg8
arg10
345
A simple example
Parent.C
Parent.C
int
int
int
int
child(int,
child(int, int);
int);
xx == 7,
7, y,
y, ww == 3;
3;
void
void main
main (void)
(void)
{{
yy == child(x,
x, 5);
);
child(
55);
child(x,
}}
Child.C
Child.C
int
int child(int
child(int a,
a, int
int b)
b)
{{
return(a
return(a ++ b);
b);
}}
Arguments
Return/Result
Child.ASM
.global
_child
_child:
add
a4,
a4,b4,
b4,a4
b
b3
nop
5
; end of subroutine
Accessing C global variables in assembly:

Parent.C
Parent.C
int
int child2(int,
child2(int, int);
int);
int
x
;
33;
int x == 7,
7, y,
y, ww == 3;
void
void main
main (void)
(void)
{{
yy == child2(x,
child2(x, 5);
5);
}}
Child2.ASM
Child2.ASM
.global
.global _child2
_child2
.global
.global _w
_w
_child2:
_child2:
mvkl
mvkl
mvkh
mvkh
ldw
ldw
_w
_w ,, A1
A1
_w
_w ,, A1
A1
*A1,
*A1, A0
A0
Declare global labels

Use _underscore when accessing C variables (labels)
Advantages of declaring variables in C?
Declaring in C is easier
Compiler does variable init ( int w = 3 )
346
Registers A10A15 and B10B15 must be saved/preserved

A
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
These must be saved and

restored if you use them
in Assembly
There is actually a bit more to this (see below), but more later
A
arg1/r_val
arg3
arg5
arg7
arg9
B
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Stack
ret addr
arg2
arg4
arg6
arg8
arg10
DP
SP
extra
arguments
Prior
Stack
Contents
347
Linear Assembly and Assembly Optimization

Being able to call highly efficient linear assembly routines from
C is another powerful capability of the compiler tools. In this
section we explore the main points.
Linear assembly has the ease of C programming (almost) and
the efficiency approaching that of assembly, but without too
many headaches, as the tools do a lot of the work
The development flow for linear assembly modules
Asm
Asm
Optimizer
Optimizer
Link.cmd
.sa
Text
Text
Editor
Editor
.asm
Assembler
Assembler
.obj
Linker
Linker
.out
.c / .cpp
.cpp
Compiler
Compiler
Features of linear assembly for subroutines include:

Pass parameters
Return results
Use symbolic variable names
Ignore pipeline issues (delay slots)
Automatically return to the calling function
Call other functions written in C or linear assembly
348
Linear Assembly and Assembly Optimization
Consider a simple dot product example in C

int DotP(short *m, short *n, int count)
{ int i;
int product;
int sum = 0;
for (i=0; i < count; i++)
{
product = m[i] * n[i];
sum += product;
}
return(sum);
}
Rewriting in linear assembly (typically a .sa file) we have

__dotp:
dotp:
zero
zero
sum
sum
loop:
loop:
ldh
ldh
ldh
ldh
mpy
mpy
add
add
*pm++,
*pm++, mm
**pn++,
pn++, nn
m,
m, n,
n, prod
prod
prod,
sum,
prod, sum, sum
sum
sub
sub
[count]
b
[count] b
count,
count, 1,
1, count
count
loop
loop
Assembly directives are required :(

Functional unit management is not needed :)
Register management not needed :)
349
A special directive .cproc is used to declare the passed

variables, e.g.,
.cproc
arg1, arg2, arg3
The directive .endproc declares the end of the routine

Symbolic names can be used throughout, which is very nice
The completed dot product example
__dotp:
dotp:
dotp:
..cproc
cproc
..reg
reg
pm,
, count
pn
pm, pn,
pn,
count
m,
m, n,
n, prod,
prod, sum
sum
zero
zero
sum
sum
loop:
loop:
ldh
ldh
ldh
ldh
mpy
mpy
add
add
sub
sub
[count]
b
[count] b
.return
.return
*pm++,
*pm++, mm
**pn++,
pn++, nn
m,
m, n,
n, prod
prod
prod,
prod, sum,
sum, sum
sum
count,
count, 1,
1, count
count
loop
loop
sum
sum
..endproc
endproc
The above performs the function

short dotp(short *a, short *x, int count)
350
Example: Vector Norm Squared
Calling from Linear Assembly

Linear assembly can also call another subroutine
_dotp:
.cproc
.reg
mvk
val
5, val
.call
val = _testcall(val)
.return
.endproc
val
_testcall:
.cproc
add
input
input, 5, input
.return input
.endproc
Linear Assembly Compiler Settings

Specific assembly optimizer options are:
Use -g -s for algorithm verification
Use -k -mgt -o3 -pm for software pipelining

In this example we will be computing the squared length of a
vector using 16-bit (short) signed numbers. In mathematical
terms we are finding
351
An
(3.1)
A = A1 AN
(3.2)
=
n=1
where
is an N -dimensional vector (column or row vector).

The solution will be obtained in three different ways:
Conventional C programming
C6x assembly
C6x linear assembly
Optimization is not a concern at this point
The focus here is to see by way of a simple example, how to
call a C routine from C (obvious), how to call an assembly
routine from C, and how to call and write a simple linear
assembly routine from C
C Version
We implement this simple routine in C using a declared vector length N and vector contents in the array A
The C source, which includes the called function norm_sq
is given below
352
/******************************************************
Vector norm-squared routine in C
******************************************************/
#include <stdio.h>
short norm(short *A, int N);
int main()
{
int N = 5;
short A[5] = {1, 2, 3, 6, 7};
short norm_sq;
norm_sq = norm(A, N);
printf("Vector norm squared = %d",norm_sq);
return 0;
}
short norm(short* V, int n)
{
int i;
short out = 0;
for(i=0; i<n; i++)
{
out += V[i]*V[i];
}
return out;
}
The expected answer is 1 + 4 + 9 + 36 + 49 = 99
353
Running in CCS 5.1: The C code is put into a project for running on the OMAP-L138 or the simulator as Norm_Squared
and debugged and profiled
From the watch window we obtain the following when we
step the program to the last line
354
Starting address of array in memory
Other active windows in CCS 5.1
Enable the clock under the run menu to profile
Time from 1st to 2nd

breakpoint
The cycle count at the function call level for the norm_sq
function call is 152 in the simulator, did not try hardware
355
Assembly Version
The parent C routine is the following:
/******************************************************
Vector norm-squared routine in assembly
******************************************************/
#include <stdio.h>
short norm_asm(short *A, int N);
int main()
{
int N = 5;
short A[5] = {1, 2, 3, 6, 7};
short norm_sq;
norm_sq = norm_asm(A, N);
return 0;
}
From just the C source it is not obvious that the function prototype for norm_asm is actually an assembly routine
The assembly routine is the following:
; Vector norm in assembly
.global _norm_asm ;reference name from C
_norm_asm:
356
mv
.l2 B4, B1
zero .l1 A2
;put loop ctr. in a proper reg.

;initialize accumulator
loop:
ldh .d1 *A4++, A1 ;ld vals pointed to by A4 in A1
nop 4
;required ldh delay
mpy .m1 A1, A1, A3;square each value
nop
;required mpy delay
add .l1 A3, A2, A2;accumulate the squared values
sub .l2 B1, 1, B1 ;decrement the loop counter
[B1]b
.s2 loop
;branch until B1 == 0
nop 5
;required branch delay
mv
b
nop
.d1 A2, A4
.s2 B3
5
;move result to return reg. A4

;branch back to address at B3
;required branch delay
Note that each line of assembly code takes the following

form:
label: || [cond]
instruction
.unit
operand
;comment
Labels must start in the first column, up to 200 characters,

and must begin with a letter, the colon is optional
When accessing from C the register calling convention is
observed, that is, when we enter the function
norm_asm(arg1, arg2),
arg1, is a pointer or address to the first value of the array
A, and is stored in register A4
arg2 is an int value, e.g., a full 32-bit signed integer,
and is stored in register B4
Since arg2 is the array dimension, we will use it as the loop
counter starting value
357
B4 is not a suitable register for loop control, so we move

(mv) the value stored in B4, in this case to B1
We initialize the accumulator register, A2, using zero instruction, alternatively mvk .s1 0,A2 works as well
Starting at the top of the loop section, we begin by loading
(ldh since we only have 16-bits) the values pointed to by A4
into working register A1
The pointer A4 is post incremented by just 2-bytes or 16bits address steps following the load operation
The default increment size is controlled by the data type,
here it is halfwords (16-bits)
Various pre- and post-increment options are available,
including the offset amount, and wether it modifies the
original pointer or not (see the table below)
358
Table 3.1: Pointer incrementing methods; A1 showna
Syntax
Pointer
changed
Description
*A1
no
Basic pointer
*+A1[disp]
no
+Pre-offset
*-A1[disp]
no
-Pre-offset
*++A1[disp]
yes
Pre-increment
*--A1[disp]
yes
Pre-decrement
*A1++[disp]
yes
Post-increment
*A1--[disp]
yes
Post-decrement
a. If [disp] is omitted the displacement is one unit of the data type, otherwise the displacement is by integer multiples of Word, Halfword, or
Byte. If (disp) is used in stead of [disp] the displacement is (disp) bytes.
To satisfy the pipeline delays, we follow the ldh with 4

NOPs
Next, we perform a 16-bit multiply (MPY), actually a squaring; the result is stored in A3
To satisfy the pipeline we follow the MPY with one NOP
We accumulate the result into register A2 using ADD
Next, we branch to loop subject to the state of B1
The branch is followed by five NOPs to satisfy the pipeline
delay
359
Finally, the squared and accumulated value held in A2 is

saved to the return register A4
To return back to the C module, we must branch to the
address saved in B3
If we had needed to use registers A10A15 or B10B15, we
would of had to save and restore them accordingly
The final numerical result is again 99
Running in CCS 2: The C code is put into a project for running
on the 6711 DSK as norm_sq_asm.pjt, and debugged and
profiled
The profiling results of the new norm_sq function are:
With the assembly routine the cycle count is reduced to 91,

which as a ratio makes the C routine 152/91 = 1.67 times
slower, assuming no optimization
With optimization the tables are turned and the C is faster by
the factor ?
360
The Linear Assembly Version

The parent C calling routine is again of the form:
/******************************************************
Vector norm-squared routine in linear assembly
******************************************************/
#include <stdio.h>
short norm_sa(short *A, int N);
int main()
{
int N = 5;
short A[5] = {1, 2, 3, 6, 7};
short norm_sq;
norm_sq = norm_sa(A, N);
return 0;
}
The assembly routine is the following:

; Vector norm in linear assembly
.global _norm_sa;reference name from C
_norm_sa:
.cproc
.reg
zero
A, N
m, sum
sum
;input variables
;working variables
;zero the accumulator
*A++, m
;load values pointed to by A
loop:
ldh
361
mpy
add
sub
[N]b
m, m, m
;square each value
m, sum, sum;accumulate the squared values
N, 1, N
;decrement the loop counter
loop
;branch until N == 0
.return sum
.endproc
;return value
;end linear assembly routine
The function/subroutine is declared .global just as in the

assembly case
Following the assembly label _norm_sa, we begin the linear assembly routine with .cproc followed by the input
variables (may be dummy names);
Working variables are declared using .reg
The accumulator is cleared using the assembler instruction
zero
A loop is then set up in a similar fashion to the pure assembly
version, except now the precise management of the registers
is left to the assembly optimizer
There is also no need to include NOPs
As before the final answer is 99
Running in CCS 2: The C code is put into a project for running
on the 6711 DSK as norm_sq_sa.pjt, and debugged and
profiled
362
The profiling results of the new norm_sq function are:
This result is very similar to the assembly result (on the 6713
90 .sa & 91 .asm)
With say -o3 optimization the linear assembly is faster by the
ratio ?
When debugging a linear assembly routine it is best to use the
mixed mode to display assembly interlisted with C and/or linear assembly
The registers window can then be used to watch what is happening when the code is stepped
363
364

TMS320 C6 X Programming

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

TMS320 C6 X Programming

Загружено:

Авторское право:

Доступные форматы

Chapter

* Typical efficieny versus hand optimized assembly

1. TMS320C6000 DSP Design Workshop, Revision 4.0, June 2000.

Chapter 3 TMS320C6x Programming

Introduction to Assembly Language Programming

Assembly instructions will initially be shown only with limited detail

ECE 5655/4655 Real-Time DSP

Introduction to Assembly Language Programming

To multiply we use the .M (multiply) unit

As shown here MPY calls a 16-bit multiply which gives a

ECE 5655/4655 Real-Time DSP

Chapter 3 TMS320C6x Programming

Note that we need to store the working variables in a register

The original equation (3.1) specifies 40 multiply accumulates

ECE 5655/4655 Real-Time DSP

Introduction to Assembly Language Programming

The unit responsible for branching is the .S (branch) unit

MVK moves a 16-bit constant into the lower 16-bits of register A2

The [A2] means execute if A2

If we use [!A2] then execute only if A2 = 0

Chapter 3 TMS320C6x Programming

How do a and x get loaded?

Use pointer with load/store

The C notation of &a is used here to obtain the address of a,

ECE 5655/4655 Real-Time DSP

Introduction to Assembly Language Programming

Load and store option summary:

Load 8-bit byte

To carry out the load and store operations we use the .D

Note that as in C, *A5 takes the value pointed to by A5 and

ECE 5655/4655 Real-Time DSP

Chapter 3 TMS320C6x Programming

A remaining detail is the actual creation of a pointer, e.g., x,

To properly loop over the data, the pointers need to be incrmented

ECE 5655/4655 Real-Time DSP

Introduction to Assembly Language Programming

Pointer incrementing is summarized in the following figure:

After first loop, A4 contains...

Since there is another set of function units we should have

ECE 5655/4655 Real-Time DSP

Chapter 3 TMS320C6x Programming

The final version of the A-side code is

; A2 = 40, loop count

In the above we assume A4 is initially cleared

ECE 5655/4655 Real-Time DSP

Introduction to Assembly Language Programming

C62xx and C67xx Instruction Set Summary by Unit

The C67 adds 31 More Instructions

ECE 5655/4655 Real-Time DSP

Chapter 3 TMS320C6x Programming

In total, the processor has only about 48 instructions, and

Introduction to the Pipeline

ECE 5655/4655 Real-Time DSP

Introduction to the Pipeline

Pipelined and Non-Pipelined

ECE 5655/4655 Real-Time DSP

Chapter 3 TMS320C6x Programming

All, except ...

As a result of the maximum delay of 5 cycles, there are six

ECE 5655/4655 Real-Time DSP

Introduction to the Pipeline

Summary of Pipeline Phases

E2-E6 are place holders