Вы находитесь на странице: 1из 64

Chapter

TMS320C6x
Programming

Introduction
In this chapter programming the TMS320C6x in assembly, linear
assembly, and C will be introduced. Preference will be given to
explaining code development for the DSK memory map. The
basis for the material presented in this chapter are the course
notes from TIs C6000 4-day design workshop1.
Programming Alternatives
Efficiency*

Effort

Compiler
Optimizer
Intrinsics

70 80%

Low

Linear
ASM

Assembly
Optimizer

95 100%

Medium

Hand
Optimize

100%

High

ASM

* Typical efficieny versus hand optimized assembly


see TI benchmarks for more information

1. TMS320C6000 DSP Design Workshop, Revision 4.0, June 2000.


ECE 5655/4655 Real-Time DSP

31

Chapter 3 TMS320C6x Programming

Introduction to Assembly Language Programming


A Dot Product Example
Recall the C6000 block diagram
Program
RAM

Data Ram

Addr
Internal Buses
DMA

D (32)
EMIF

.M1 .M2
.L1 .L2
.S1 .S2
Control Regs

Serial Port
(B0-B15)
Regs (B0-

- Sync
- Async

(A0-A15)
Regs (A0-

Extl
Extl
Memory
Memory

.D1 .D2

Host Port
Boot Load
Timers
Pwr Down

CPU

To motivate this introduction to assembly programming, consider a basic sum of products or dot product example
y =

40

(3.1)

an xn

n=1

Assembly instructions will initially be shown only with limited detail


In a later section the details of putting together an actual
assembly file will be given
The core of this algorithm is multiplication and addition
32

ECE 5655/4655 Real-Time DSP

Introduction to Assembly Language Programming

To multiply we use the .M (multiply) unit


40

Y =
n = 1

an * xn

.M
.M
MPY

.M

a, x, prod

As shown here MPY calls a 16-bit multiply which gives a


32-bit result
To add or accumulate we use the .L (logical) unit
40

Y =
n = 1

Where
Whereare
are
the
variables
the variables
stored?
stored?

ECE 5655/4655 Real-Time DSP

an * xn

.M
.M
.L
.L

MPY

.M

a, x, prod

ADD

.L

Y, prod, Y

33

Chapter 3 TMS320C6x Programming

Note that we need to store the working variables in a register


file, the C6000 has two, but for now we will just use the A
side
We now rewrite the code to include the actual register names
Register File A
A0
a
x
A1
A2
prod
A3
Y
A4

.
.
.

40

Y =
n = 1

an * xn

.M
.M
.L
.L

MPY

.M

A0, A1, A3

ADD

.L

A4, A3, A4

A15
3232-bits

The original equation (3.1) specifies 40 multiply accumulates


To create a loop we need:
A branch instruction and a label
A loop counter variable
An instruction to decrement the loop counter
A properly set branch condition

34

ECE 5655/4655 Real-Time DSP

Introduction to Assembly Language Programming

The unit responsible for branching is the .S (branch) unit


Register File A
A0
a
x
A1
A2 loop count
prod
A3
Y
A4

.
.
.

40

an * xn

.S
.S

Y =

.M
.M

MVK

.S

40, A2

MPY

.M

A0, A1, A3

ADD

.L

A4, A3, A4

SUB

.L

A2, 1, A2

.S

loop

n = 1

loop:

.L
.L

A15

[A2] B
3232-bits

MVK moves a 16-bit constant into the lower 16-bits of register A2


We decrement the loop counter register by one using SUB
which uses the .L unit
Branch condition instructions execute conditionally based
on the value held in A2
;general asm code form
[condition] B
loop

The [A2] means execute if A2

If we use [!A2] then execute only if A2 = 0


On the C62x/C67x conditional registers are limited to A1,
A2, B0, B1, B2
Note: On the C64x the conditional registers are A0, A1,
A2, B0, B1, B2
ECE 5655/4655 Real-Time DSP

35

Chapter 3 TMS320C6x Programming

The next step is to get variables loaded into the register file
We assume that the variables are located in memory (internal or external)
We then create a pointer to the address of the variable and
store it in a register
Finally, we load the variable itself into another register
Register File A
A0
a
x
A1
A2 loop count
prod
A3
Y
A4
&a[n]
A5
&x[n]
A6
&Y
A7
..
A15

How do a and x get loaded?


a, x, Y located in memory
Create a pointer to values
A5 = &a
A6 = &x
A7 = &Y

.S
.S
.M
.M

Use pointer with load/store


LD
*A5, A0
LD
*A6, A1
ST
A4, *A7

.L
.L

3232-bits

Memory

a [40]
x [40]
Y

*A5
*A6
*A7

The C notation of &a is used here to obtain the address of a,


but there is more to this as we will see shortly
The C62 has 3 three load instructions and the C67 and C64
add a fourth
The architecture allows byte level addressing (8-bits), halfword (16-bits), words (32-bits)
Added on the C67/64 are double-words (64-bits)
36

ECE 5655/4655 Real-Time DSP

Introduction to Assembly Language Programming

Load and store option summary:


Load instructions:
LDB
LDH
LDW
LDDW

Load 8-bit byte


Load 16-bit half-word
Load 32-bit word
Load 64-bit double-word

(char)
(short)
(int)
int)
(C67x, C64x)
(double)

Store instructions:
STB
STH
STW
STDW (C64x)

To carry out the load and store operations we use the .D


(data) unit
Register File A
A0
a
x
A1
A2 loop count
prod
A3
Y
A4
&a[n]
A5
&x[n]
A6
&Y
A7
..
A15

40

an * xn

.S
.S

Y =

.M
.M

MVK

.S

40, A2

LDH

.D

*A5, A0

LDH

.D

*A6, A1

MPY

.M

A0, A1, A3

ADD

.L

A4, A3, A4

SUB

.L

A2, 1, A2

.S

loop

.D

A4, *A7

.L
.L
.D
.D

n = 1

loop:

[A2] B
STH

3232-bits

Data
DataMemory
Memory

Note that as in C, *A5 takes the value pointed to by A5 and


places the value into a register, here it is A0

ECE 5655/4655 Real-Time DSP

37

Chapter 3 TMS320C6x Programming

A remaining detail is the actual creation of a pointer, e.g., x,


a, and y
Earlier we used MVK to move a 16-bit constant into the lower
16-bits of a register
Now we want to move a 32-bit address corresponding to
some label a
MVKL .S a,A5 ;will move the lower 16-bits with
sign extension
MVKH .S
a,A5 ;will move the upper or high 16bits without altering the lower 16-bits
Use MVKL and MVKH in ordered combination to load constants greater the 16-bits, and MVK for 16-bit or less constants
What should appear above the code MVK .S 40,A2 is:
MVKL
MVKH
MVKL
MVKH
MVKL
MVKH

.S
.S
.S
.S
.S
.S

a,A5
a,A5
x,A6
x,A6
y,A7
y,A7

;store
;store
;store
;store
;store
;store

lower
upper
lower
upper
lower
upper

half
half
half
half
half
half

of
of
of
of
of
of

a
a
x
x
y
y

To properly loop over the data, the pointers need to be incrmented


The C notation ++ can be used to pre- or post-increment
registers being used as pointers, e.g., A5++ increments by
one the address held in A5 after it is used

38

ECE 5655/4655 Real-Time DSP

Introduction to Assembly Language Programming

Pointer incrementing is summarized in the following figure:


A5
A6
A5
++

a0
a1
a2

a
&x
&

A6
++

.
.

40

Y =

x0
x1
x2

n = 1

.
.

loop:

After first loop, A4 contains...

a0 * x0
How do you access a1 and
x1 on the second loop?
LDH .D
*A5++, A0
LDH .D
*A6++, A1

an * xn

MVK

.S

40, A2

LDH
LDH

.D

*A5, A0A0
*A5++,

LDH
LDH

.D

*A6, A1A1
*A6++,

MPY

.M

A0, A1, A3

ADD

.L

A4, A3, A4

SUB

.L

A2, 1, A2

.S

loop

.D

A4, *A7

[A2] B
STH

Since there is another set of function units we should have


specified which the side, e.g., .S1 for side A, etc.
Register File A
A0
A1
A2
A3
A4

.
.
.

Register File B
.S1
.S1

.S2
.S2

.M1
.M1

.M2
.M2

.L1
.L1

.L2
.L2

.D1
.D1

.D2
.D2

3232-bits

B0
B1
B2
B3
B4

.
.
.
B15
3232-bits

Data Memory

ECE 5655/4655 Real-Time DSP

39

Chapter 3 TMS320C6x Programming

The final version of the A-side code is


Y =

MVK
loop: LDH
LDH
MPY
ADD
SUB
[A2] B
STH

.S1
.D1
.D1
.M1
.L1
.L1
.S1
.D1

40
n = 1

an * xn

40, A2
*A5++, A0
*A6++, A1
A0, A1, A3
A3, A4, A4
A2, 1, A2
loop
A4, *A7

; A2 = 40, loop count


; A0 = a(n)
; A1 = x(n)
; A3 = a(n) * x(n)
; Y = Y + A3
; decrement loop count
; if A2 0, branch
; *A7 = Y

In the above we assume A4 is initially cleared


Instruction Set Summary by Category

310

Arithmetic

Logical

ABS
ADD
ADDA
ADDK
ADD2
MPY
MPYH
NEG
SMPY
SMPYH
SADD
SAT
SSUB
SUB
SUBA
SUBC
SUB2
ZERO

AND
CMPEQ
CMPGT
CMPLT
NOT
OR
SHL
SHR
SSHL
XOR

Bit Mgmt
CLR
EXT
LMBD
NORM
SET

Data Mgmt
LDB/H/W
MV
MVC
MVK
MVKL
MVKH
MVKLH
STB/H/W

Program Ctrl
B
IDLE
NOP

ECE 5655/4655 Real-Time DSP

Introduction to Assembly Language Programming

C62xx and C67xx Instruction Set Summary by Unit


.S Unit

.S
.S
.L
.L
.D
.D
.M
.M

ADD
ADDK
ADD2
AND
B
CLR
EXT
MV
MVC
MVK
MVKL
MVKH

.L Unit
ABS
ADD
AND
CMPEQ
CMPGT
CMPLT
LMBD
MV
NEG
NORM

NEG
NOT
OR
SET
SHL
SHR
SSHL
SUB
SUB2
XOR
ZERO

NOT
OR
SADD
SAT
SSUB
SUB
SUBC
XOR
ZERO

.M Unit
.D Unit

ADD
NEG
ADDAB (B/H/W) STB
(B/H/W)
SUB
LDB
(B/H/W) SUBAB (B/H/W)
ZERO
MV

MPY
MPYH
MPYLH
MPYHL

SMPY
SMPYH

No Unit Used
NOP

IDLE

The C67 adds 31 More Instructions


.S Unit

.S
.S
.L
.L
.D
.D
.M
.M

ADD
ADDK
ADD2
AND
B
CLR
EXT
MV
MVC
MVK
MVKL
MVKH

NEG
NOT
OR
SET
SHL
SHR
SSHL
SUB
SUB2
XOR
ZERO

ABSSP
ABSDP
CMPGTSP
CMPEQSP
CMPLTSP
CMPGTDP
CMPEQDP
CMPLTDP
RCPSP
RCPDP
RSQRSP
RSQRDP
SPDP

.D Unit
ADD
NEG
ADDAB (B/H/W) STB
(B/H/W)
ADDAD
SUB
LDB
(B/H/W) SUBAB (B/H/W)
LDDW
ZERO
MV

ECE 5655/4655 Real-Time DSP

.L Unit
ABS
ADD
AND
CMPEQ
CMPGT
CMPLT
LMBD
MV
NEG
NORM

NOT
OR
SADD
SAT
SSUB
SUB
SUBC
XOR
ZERO

ADDSP
ADDDP
SUBSP
SUBDP
INTSP
INTDP
SPINT
DPINT
SPRTUNC
DPTRUNC
DPSP

.M Unit
MPY
MPYH
MPYLH
MPYHL

SMPY
SMPYH

MPYSP
MPYDP
MPYI
MPYID

No Unit Used
NOP

IDLE

311

Chapter 3 TMS320C6x Programming

In total, the processor has only about 48 instructions, and


hence is considered to be a RISC device
Before going any further in assembly programming we need
to spend some time studying the pipeline

Introduction to the Pipeline


DSP microprocessors rely heavily on the performance advantages of pipelining, the C6x is no exception
It would be nice to never have to worry about pipeline issues,
but some exposure will be helpful in future programming
Getting code to work only requires a few basic guidelines,
while full optimization of the eight function units is beyond
the scope of this section of the notes
The basic operations of the CPU are:
(F) Fetch or Program Fetch (PF): get an instruction from
memory
(D) Decode: figure out what type of instruction it is (ADD,
MPY)
(E) Execute: Actually perform the operation

312

ECE 5655/4655 Real-Time DSP

Introduction to the Pipeline

Pipelined and Non-Pipelined


CPU Type

Clock Cycles
3 4 5 6 7

Non-Pipelined

F1 D1 E1

Pipelined

F1 D1 E1

F2 D2 E2

F3 D3 E3

F 2 D2 E 2
F 3 D3 E 3
Pipeline full

Once the pipeline is full the multiple buses of the C6x can
carry out the F, D, and E operations in parallel, all within the
same clock cycle
On the downside, when discontinuities such as program
branching occur, the pipeline must be flushed which results in
added processor overhead
Program Fetch Stage
The program fetch stage actally is broken into four phases
PG: Generate fetch address
PS: Send address to memory
PW: Wait for data ready
PR: Read opcode

ECE 5655/4655 Real-Time DSP

313

Chapter 3 TMS320C6x Programming

Decode Stage
The decode stage consists of two phases
DP: Route the instruction to a functional unit (dispatch)
DC: Actually decode the instruction at the functional unit
(decode)
Execute Stage
For code writing purposes the execute stage is the most interesting
On the C62x all instructions execute in a single cycle, but
results are delayed by varying amounts
Furthermore, there is an additional cycle before the results
are available, which is known as the pipeline latency
Common examples of delay and latency
Description

Instructions

Delay

Latency

Single Cycle

All, except ...

0+1=1

Multiply

MPY / SMPY

Load

LDB/H/W

Branch

As a result of the maximum delay of 5 cycles, there are six


execute phases E1E6
314

ECE 5655/4655 Real-Time DSP

Introduction to the Pipeline

Summary of Pipeline Phases


Program
Fetch

(1)

(2)

(3)

Decode
DP DC
(5) (6)

(4)

Execute
E1 E2 E3 E4 E5 E6
(7) (8) (9) (10) (11) (12)

E2-E6 are place holders


for delayed results
Pipeline full

PG

PS

PW

PR

DP

DC

E1

E2

PG

PS

PW

PR

DP

DC

E1

PG

PS

PW

PR

DP

DC

E1

PG

PS

PW

PR

DP

DC

E1

PG

PS

PW

PR

DP

DC

E1

PG

PS

PW

PR

DP

DC

E1

PG

PS

PW

PR

DP

DC

ECE 5655/4655 Real-Time DSP

E3

E4

E5

E6

E7

E1

315

Chapter 3 TMS320C6x Programming

Sending Code Through the Pipeline


Since there are eight function units, eight 32-bit instructions
are fetched every clock cycle
The 256-bit total is called a fetch packet
; mycode.asm
I1 .unit

The 'C6x fetches eight 32-bit


instructions every cycle

I2 .unit
I3 .unit
I4 .unit

256 Bits

I5 .unit

I1 I2 I3 I4 I5 I6 I7 I8

I6 .unit

Fetch Packet (8 x 32-bit)

I7 .unit
I8 .unit

Recall that there is a 256-bit wide program data bus for this
purpose
Pipeline Code Example
Consider the sum of products example used earlier
MVK .S1
loop: LDH .D1

*A5++, A0

LDH .D1

*A6++, A1

MPY .M1

A0, A1, A3

ADD .L1

A3, A4, A4

SUB

.L1

A2, 1, A2

.S1

loop

[A2]

STH .D1
316

40, A2

We assume A4 is
already cleared

A4, *A7
ECE 5655/4655 Real-Time DSP

Introduction to the Pipeline

We have eight instructions, so on the first cycle they are in the


PG phase of program fetch

Program
Fetch
PG

PS

PW

PR

Decode

Execute

DP

E1 - E6

DC

MVK
LDH
LDH
MPY
ADD
SUB
B
STH
12
11

2
3

10
9

8
6

On the fifth cycle, assuming zero wait state memory, the


eight instructions are now at the DP phase
On the next cycle the first instruction moves to the DC

ECE 5655/4655 Real-Time DSP

317

Chapter 3 TMS320C6x Programming

(decode) phase, and the other seven wait in line


Prog.
Decode
Fetch
P

DP

DC

Execute
E1

E2

E3

E4

Done
E5

E6

MVK
LDH
LDH

FP5-2 MPY
ADD
SUB
B
STH
11

12

1
2
3

10
9

8
7

On cycle eight MVK has completed execution and LDH begins


execution, but requires five total cycles (+ signs)
Prog.
Decode
Fetch
P

DP

DC

Execute
E1

E2

E3

E4

Done
E5

E6
MVK

LDH

LDH

FP5-2 MPY
ADD
SUB
B
STH
11

12

1
2
3

10
9

8
7

318

ECE 5655/4655 Real-Time DSP

Introduction to the Pipeline

On the 10th cycle the second LDH enters E2 and the first LDH
is moved over to E3, with MPY at E1
Prog.
Decode
Fetch
P

DP

DC

Execute
E1

E2

E3

E4

Done
E5

E6
MVK

LDH
LDH
MPY

FP5-2

+
+

+
+

ADD
SUB
B
STH

11

12

1
2
3

10
9

8
7

Note that the MPY requires only one delay, but needs values from memory that the LDHs bring in
The LDHs have not finished yet! What to do?
A similar problem exists when the ADD instruction reaches
E1
The one cycle delay of MPY means that the addition has
started too early as well

ECE 5655/4655 Real-Time DSP

319

Chapter 3 TMS320C6x Programming

For the existing code, we see that at 12 cycles MPY and ADD
have both finished, but both LDHs still have not completed
Prog.
Decode
Fetch
P

DP

DC

Execute
E1

E2

E3

Done

E4

E5

E6
MVK

LDH
LDH

+
MPY

FP5-2

ADD
SUB
B
STH

11

12

1
2
3

10
9

8
7

To fix the code we need to add instruction delays or NOPs


To start with we need to add one NOP between MPY and
ADD
We need to add four NOPs between the second LDH and
MPY
Simple NOP insertion rules:

320

Description

Delay Slots

# of NOPs

Single Cycle

Multiply

Load

Branch

ECE 5655/4655 Real-Time DSP

Introduction to the Pipeline

Rather than typing four lines of NOP, we can type a single


line
NOP
NOP
NOP
NOP

NOP 4

The final NOP fixed code, including benchmark information is the following:
loop:

MVK .S1

40,A2

LDH

.D1

*A5++, A0

(1)
(1)

LDH

.D1

*A6++, A1

(1)

(4)

A0,A1,A3

(1)

NOP
MPY

.M1

NOP

[A2]

ADD

.L1

A3,A4,A4

(1)
(1)

SUB

.L1

A2,1,A2

(1)

.S1

loop
5

(1)
(5)

A4,*A7

(1)

NOP
STH

.D1

Loop = 16 x 40
= 640

+ 2 = 642 cycles

642 cycles
Benchmark = _______
28 cycles
Best case = _______

The NOPs greatly increase the cycle count, but we have not
tried any optimization yet
With full optimization just 28 cycles can be achieved, less
than the loop count!

ECE 5655/4655 Real-Time DSP

321

Chapter 3 TMS320C6x Programming

Use of Parallel Instructions


In the pipeline example above all of the instructions flowed
serially
Parallel instructions are given with the double pipe symbol
||
Up to eight instructions can be put in parallel since there are
eight functional units
A partially parallel solution is given below:

Serial
B

.S1

Partially
Parallel
B

Fully
Parallel

.S1

MVK .S1

|| MVK .S2

ADD .L1

ADD .L1

ADD .L1

|| ADD .L2

MPY .M1

|| MPY .M1

MPY .M1

MPY .M1

LDW .D1

|| LDW .D1

LDB .D1

|| LDB .D2

When instructions process in parallel they are called execute


packets, and are so denoted in the pipeline diagrams
Each fetch packet can contain multiple execute packets

322

ECE 5655/4655 Real-Time DSP

Introduction to the Pipeline

At the beginning of the decode phase (dispatch), the above


example code, has three execute packets entering DC

Decode
DP

Execute

DC

E1

E2

E3

E4

Done
E5

E6

B
MVK
ADD
ADD
MPY
MPY
LDW
LDB

11

12

1
2
3

10
9

8
7

Each execute packet enters E1 and the individual instructions


execute simultaneously until completed, with their respective
delays

ECE 5655/4655 Real-Time DSP

323

Chapter 3 TMS320C6x Programming

At cycle eight we have packet two at E1 and part of packet


one is complete

Decode
DP

DC

Execute
E1

Done

E2

E3

E4

E5

E6

+
MVK

ADD
ADD
MPY

MPY
12
11

LDW

1
2
3

10
9

LDB

8
7

Parallel instructions give a great performance increase


For the code example we have been considering it is possible
to go fully parallel since there are only eight instructions
To do so will require full utilization of both sides of the CPU

324

ECE 5655/4655 Real-Time DSP

Introduction to the Pipeline

The fully parallel code

Partially
Parallel

Serial
B

.S1

Fully
Parallel
B

.S1

.S1

MVK .S1

|| MVK .S2

|| MVK .S2

ADD .L1

ADD .L1

|| ADD .L1

ADD .L1

|| ADD .L2

|| ADD .L2

MPY .M1

|| MPY .M1

|| MPY .M1

MPY .M1

MPY .M1

|| MPY .M2

LDW .D1

|| LDW .D1

|| LDW .D1

LDB .D1

|| LDB .D2

|| LDB .D2

At the start of execution (seventh cycle) we have

Decode
DP

Execute

DC

Done

E1

E2

E3

E4

E5

E6

+
+
+
+

+
+

+
+

+
+

MVK
ADD
ADD
EP2

MPY
MPY

LDW
LDB
12
11

2
3

10
9

8
6

ECE 5655/4655 Real-Time DSP

325

Chapter 3 TMS320C6x Programming

This sort of efficiency requires smart coding


Two not so obvious requirements are:
Properly filling delay slots
Proper use of parallel instructions
The assembly optimizer (part of linear assembly) and the
optimizing C compiler significantly simplify this process
C67x Exceptions
With the floating point capability comes additional delay slot
requirements and latency
There is also functional unit latency beyond one cycle, which
occurs in some double precision (DP) instructions
C67x Latencies: (unit.instruction)
.L Unit

.S Unit
ABSSP
(1.1)
ABSDP
(1.2)
CMPEQSP (1.1)
CMPGTSP (1.1)
CMPLTSP (1.2)
CMPEQDP (1.3)
CMPGTDP (1.3)

CMPLTDP (1.2)
RCPSP
(1.1)
RCPDP
(1.2)
RSQRSP (1.1)
RSQRDP (1.2)
SPDP
(1.2)

ADDSP
ADDDP
DPINT
DPSP
INTDP
INTDPU

(1.3)
(2.7)
(1.4)
(1.4)
(1.5)
(1.5)

INTSP
(1.4)
INTSPU (1.4)
SPINT
(1.4)
SPTRUNC (1.4)
SUBSP
(1.4)
SUBDP
(2.7)

.D Unit
.M Unit
MPYSP
MPYDP

(1.4) MPYI
(4.10) MPYID

ADDAD

(1.1)

LDDW

(1.5)

(4.9)
(4.10)

e.g., MPYSP (1.4) means a single precision float multiply


requires a single function unit latency and three delay slots.

326

ECE 5655/4655 Real-Time DSP

C Programming
The section will focus on some of the uses of the C6x development tools and some of the compiler, assembler, and linker settings.
As stated at the beginning of this chapter, the use of C code
can achieve from 80100% the efficiency of hand assembly
Further optimization, what is discussed in this section, will
likely be required, but it is safe to say that C code is a good
starting point for algorithm development
Recall the basic code building tool layout is:
Asm
Optimizer
Link.cmd

.sa
Editor

.asm

Asm

.obj

Linker

.out

.c / .cpp
.cpp
Compiler

When the compiler tools are coupled with Code Composer


Studio (CCS) we have a compete development environment:

ECE 5655/4655 Real-Time DSP

327

PLUG INS (C++, VB, Java)

Chapter 3 TMS320C6x Programming

Probe
In

Compile
Asm Opto

SIM
DSK

Edit

Asm

Link

Debug
EVM

Profiling

BIOS
Library

Graphs

Probe
Out

Studio Includes:
Code Generation Tools
BIOS: Real-time kernel
Real-time analysis (RTA)
Simulator Plug-ins, RTDX
Simulator,

Third
Party
XDS
DSP
Board

The output code can be controlled with a very large number


of options that span the compiler, assemble, and linker
(Old CCS Interface shown)
Indicates how output file
should be constructed
Which Optimizations
Where to find files/libs
C62x or C67x
How to link files
Etc.

file.c
file.c

328

Compile

Asm

Link

file.out
file.out

ECE 5655/4655 Real-Time DSP

C Programming

Debug options
Options

debug

-mv6700
-fr <dir>
-g
-s
-k
-mg
-mt
-o3
-pm -gs

Description
Generate C6700 code (C6200 is default)
Directory containing source files
Enables srcsrc-level symbolic debugging
Interlist C statements into assembly listing
Keep assembly file
Enables minimum debug to allow profiling
No aliasing used
Invoke optimizer ((-o0, -o1, -o2/o2/-o, -o3)
Combine all C source files before compile

CC Tab
Compiler
Compiler
Comp/Asm
Comp/Asm
Compiler
Compiler
Compiler
Compiler
Compiler
Compiler

All total there are about five pages of options in the compiler
user manual
Optimize Options
Options

speed
opto

-mv6700
-fr <dir>
-g
-s
-k
-mg
-mt
-o3
-pm
-ms
-oi0

Description
Generate C6700 code (C6200 is default)
Directory-kcontaining
source files
-mgt -o3 -pm
Enables srcsrc-level symbolic debugging
Interlist C statements into assembly listing
Keep assembly file
Enables minimum debug to allow profiling
No aliasing used
Invoke optimizer ((-o0, -o1, -o2/o2/-o, -o3)
Combine all C source files before compile
Minimize code size ((-ms0/ms0/-ms, -ms1, -ms2)
Disables automatic function inlining

CC Tab
Compiler
Compiler
Comp/Asm
Comp/Asm
Compiler
Compiler
Compiler
Compiler
Compiler
Compiler
Compiler
Compiler

When first debugging code we typically use -gs (above),


later optimization can be turned on, e.g., -o3
ECE 5655/4655 Real-Time DSP

329

Chapter 3 TMS320C6x Programming

Code Size
Options

size
opto

-mv6700
-fr <dir>
-g
-s
-k
-mg
-mt
-o3
-pm
-ms
-oi0

Description

CC Tab

Generate C6700 code (C6200 is default)


Directory-kcontaining
-mgt -ms0 -source
o3 -oi0files
-pm
Enables srcsrc-level symbolic debugging
Interlist C statements into assembly listing
Keep assembly file
Enables minimum debug to allow profiling
No aliasing used
Invoke optimizer ((-o0, -o1, -o2/o2/-o, -o3)
Combine all C source files before compile
Minimize code size ((-ms0/ms0/-ms, -ms1, -ms2)
Disables automatic function inlining

Compiler
Compiler
Comp/Asm
Comp/Asm
Compiler
Compiler
Compiler
Compiler
Compiler
Compiler
Compiler
Compiler

Assembler Options
Options
-g
-l
-s

Description
Enables srcsrc-level symbolic debugging
Create assembler listing file (small -L)
Retain asm symbols for debugging

CC Tab
Comp/Asm
Assembler
Assembler

-gls

330

ECE 5655/4655 Real-Time DSP

C Programming

Linker Options
Options
- o <file>
- m <file>
- c

Description
Output file name
Map file name
AutoAuto-initialize global/static C variables

CC Tab
Linker
Linker
Linker

Summary of Popular Options


Options
-mv6700
-fr <dir>
-g
debug
-s
-k
-mg
speed -mt
opto -o3
-pm
size -ms
opto -oi0
-l
-s
-o <dir>
-m <dir>
-c

Description
Generate C6700 code (C6200 is default)
Directory containing source files
Enables src-level symbolic debugging
Interlist C statements into assembly listing
Keep assembly file
Enables minimum debug to allow profiling
No aliasing used
Invoke optimizer (-o0, -o1, -o2/-o, -o3)
Combine all C source files before compile
Minimize code size (-ms0/-ms, -ms1, -ms2)
Disables automatic function inlining
Create assembler listing file (small -L)
Retain asm symbols for debugging
Output file name
Map file name
Auto-Init C variables (-cr turns off autoinit)

ECE 5655/4655 Real-Time DSP

Options Tab
Compiler
Compiler
Comp/Asm
Compiler
Compiler
Compiler
Compiler
Compiler
Compiler
Compiler
Compiler
Assembler
Assembler
Linker
Linker
Linker

331

Chapter 3 TMS320C6x Programming

A block diagram depicting what happens when a project


build takes place is shown below:
Compiler

file.c
file.c

-o

C
Optimizer

Run-time
Library
(boot.c)

-s
file.asm
Assembler
-al
file.lst

-z

Linker

file.obj

-m
file.map

-o
file.out

Embedded Systems with C


Consider software systems development in terms of the C6x
An embedded system, for the purposes of C6x development,
consists of:
Program (algorithm and data structures)
Initialization
Memory management
The program part seems pretty clear
The initialization and memory management part are beyond
what you find in a typical host programming environment,
such as Visual C++ on a PC
332

ECE 5655/4655 Real-Time DSP

Embedded Systems with C

From a C programming perspective on a host, once the system resets and initializes, we only deal with the program

Basic Sections of
C file

Global
Variables
Code

Dynamic
Variables

short m = 10;
short b = 2;
short y = 0;
main()
{
short x = 0;
scanf(x);
malloc(y);
y = m * x;
y = y + b;
}

et
res in
p

Initial
Values

reset
reset
vector
vector

Initialize
Initialize
System
System

Local
Variables

Program
Program

In the embedded world we have to also deal with initialization


We have more flexibility this way, and we only need to
include the hardware and software really needed to get the
job done
Using only the hardware and software that is needed also
provides a cost savings
The reset operation
Stops the processor,
ECE 5655/4655 Real-Time DSP

333

Chapter 3 TMS320C6x Programming

brings some registers back to a preset state,


sets the program counter (PC) to zero, and
begins running code (address 0)
Initialization Under C
The C compiler run-time support library contains the routine
boot.c
et
res in
p

reset
reset
vector
vector

boot.c
Initialize
Initialize
System
System

1. Initialize Pointers
(discussed in mod 11)

stack
heap
global/static
2. Initialize global and static
variables
3. Call _main

short
short mm == 10;
10;
short
short bb == 2;
2;
short
short yy == 0;
0;

_main

main()
main()
{{
short
short xx == 0;
0;
scanf(x);
scanf(x);
malloc(y);
malloc(y);
yy == mm ** x;
x;
yy == yy ++ b;
b;
}}

Note, global variables are optionally initialized through a


compiler switch

334

ECE 5655/4655 Real-Time DSP

Embedded Systems with C

Following the actual hardware reset, the software begins to


reset via vectors.asm via a branch to c_int00
et
res in
p

vectors.asm
.global _c_int00
_c_int00
.sect vectors
b
_c_int00
nop 5
nop
nop
O ne ke t
nop
Pac
_main
nop
Fetch
nop
nop

reset
reset
vector
vector

boot.c
boot.c

1. Init stack, heap,


& global ptrs
2. init variables
3. call _main

short
short mm == 10;
10;
short
short bb == 2;
2;
short
short yy == 0;
0;
main()
main()
{{
short
short xx == 0;
0;
scanf(x);
scanf(x);
malloc(y);
malloc(y);
yy == mm ** x;
x;
yy == yy ++ b;
b;
}}

Note that c_int00 is defined in the C library


Note also that when using CCS and debugging the target,
e.g., the DSK, some of this functionality is automatically
taken care of
NOPs are added to fill the fetch packet
Each interrupt vector is aligned on the fetch packet boundaries
Other interrupts, which are typically also part of this file,
will be discussed later
ECE 5655/4655 Real-Time DSP

335

Chapter 3 TMS320C6x Programming

Compiler Sections
The system software is broken into modules of code and data
known as sections
The sections as found in a typical C program are shown
below:
Hardware

Software

System
SystemInit
Init
(boot.c)
(boot.c)

C6x
C6x
RAM
RAM
Periph
Periph

Memory
Memory
ROM
ROM
RAM
RAM
RAM
RAM

Vectors
Vectors
(reset)
(reset)

Program
Program
Code
Code
Data
Data

CCCode
Code
(main.c)
(main.c)
Variables
Variables
(global)
(global)
Init
InitValues
Values
(global)
(global)
Heap
Heap
(dynamic)
(dynamic)
Stack
Stack
(local)
(local)

The above names seem reasonable, but the compiler uses


names associated with the common object files format
(coff) developed many years ago by AT&T for use with C
and Unix
The real names used by the C6x complier tools are the following:

336

ECE 5655/4655 Real-Time DSP

Embedded Systems with C

Vectors
Vectors
(reset)
(reset)

your
?
choice

System
SystemInit
Init
(boot.c)
(boot.c)

Program
Program
Code
Code
Data
Data

.text

CCCode
Code
(main.c)
(main.c)
Variables
Variables
(global)
(global)

.bss

Init
InitValues
Values
(global)
(global)

.cinit

Heap
Heap
(dynamic)
(dynamic)

.sysmem

Stack
Stack
(local)
(local)

.stack

The reset section can be any name, but vectors is reasonable


The complete list of C compiler sections is:
Section
Name
.text

Description
Code

.switch

Tables for switch instructions

.const

Global and static string literals

.cinit

Initial values for global/static vars

.bss

Global and static variables

.far

Global and statics declared far

.stack
.sysmem
.cio

Stack (local variables)


Memory for malloc fcns (heap)
Buffers for stdio functions

ECE 5655/4655 Real-Time DSP

337

Chapter 3 TMS320C6x Programming

A possible section placement solution for the C6201:

.switch
.switch

.cinit
.const
.text
.switch

.const
.const

EPROM

CE0
.text
.text

140_0000
(prog RAM)

.cinit
.cinit
.bss
.bss
.far
.far

CE2

.stack
.stack

.sysmem
.far
.cio

.sysmem
.sysmem

SDRAM

.cio
.cio

8000_0000 .bss
(data RAM)
.stack

C6201

Many other solutions


possible; the C67xx?

A more generalized way of describing the memory sections is


to use the terms initialized and uninitialized as opposed to
ROM and RAM, i.e.,
Section
Name
.text

Code

initialized

.switch

Tables for switch instructions

initialized

.const

Global and static string literals

initialized

.cinit

Initial values for global/static vars

initialized

.bss

Global and static variables

uninitialized

.far

Global and statics declared far

uninitialized

Stack (local variables)

uninitialized

Memory for malloc fcns (heap)

uninitialized

Buffers for stdio functions

uninitialized

.stack
.sysmem
.cio

338

Memory
Type

Description

ECE 5655/4655 Real-Time DSP

Embedded Systems with C

Memory Management
We control the physical mapping of memory to program and
data sections sections via a linker command file
C6x
C6x
RAM
RAM
Periph
Periph

Memory
Memory

Memory
Memory
ROM
ROM

.cmd

.obj
.obj
.obj

Linker

RAM
RAM
RAM
RAM

-o

.out

-m
.map
The linker command file .cmd has two parts
MEMORY
{
Memory Description
}

SECTIONS
{
Binding Code/Data Sections to Memory
}

ECE 5655/4655 Real-Time DSP

339

Chapter 3 TMS320C6x Programming

In the memory description portion we create a description of


both processor and system resources
Each line is of the form
name:origin = address, length = size-in-bytes

Note that we can shorten origin to simply o or org, and


length to simply len or l, i.e., consider the memory
portion of the C6711 command file we have used thus far
MEMORY
{
vecs:
org = 00000000h , len = 220h
IRAM:
org = 00000220h , len = 0000fdc0h
CE0:
org = 80000000h , len = 01000000h
FLASH: org = 90000000h , len = 00020000h
}

Quantities may be specified in hex or decimal, but hex is


preferred, e.g., 100h or 0x100
Note: The vectors section must come first, so that following
reset, initialization can occur
The vecs space must be at least 200 hex long since on the
C6x there are a total of 16 interrupts, each requiring one fetch
packet of 8, 32-bit instructions ( 16 32 = 200h )
Here the 220h leaves room for 32 bits more
There will be more discussion of interrupts later
To understand the rest of the memory space assignments,
recall the C6x11 memory map

340

ECE 5655/4655 Real-Time DSP

Embedded Systems with C

C67xx Memory Map


0000_0000

6713
64K x 8 Internal The
DSK

(L2) has 264kB


starting at
0000_0000

4K
Program
Cache
0180_0000

CPU

64K
Unified
RAM

8000_0000
9000_0000

4K
Data
Cache

A000_0000
B000_0000

OnOn-chip Peripherals
The 6713
0 256M x 8 External DSK
has 16M
1 256M x 8 External at
8000_0000
2 256M x 8 External
3

256M x 8 External

FFFF_FFFF

On the C6x13 DSK we frequently place all of the sections,


program and data, in the internal RAM (IRAM)
SECTIONS
{
vectors
.text
.bss
.cinit
.stack
.sysmem
.const
.switch
.far
.cio
}

:>
:>
:>
:>
:>
:>
:>
:>
:>
:>

vecs
IRAM
IRAM
IRAM
IRAM
SDRAM
IRAM
IRAM
SDRAM
SDRAM

Note some sections are placed in the SDRAM of CE0


ECE 5655/4655 Real-Time DSP

341

Chapter 3 TMS320C6x Programming

Linker Options
In the third tab of the project options dialog box, we set linker
options

The -o specifies the executable file, e.g., norm_sq_c.out


The -m creates a map file which shows in detail how the
linker has located everything in memory

342

ECE 5655/4655 Real-Time DSP

Embedded Systems with C

The -c option, run-time autoinitialization, invokes BOOT.C


so that variables are autoinitialized, that is initial values in the
.cinit section are copied into the .bss section
We can turn of autoinit by using -cr
-stack sets the size of the stack, e.g., .stack section; the
default is 0x400
-heap sets the size of the heap, which is actually the .sysmem section, has a default value of 0x400
-q supresses the banner display and -w has the linker
exhaustively read all libraries

ECE 5655/4655 Real-Time DSP

343

Chapter 3 TMS320C6x Programming

Calling Assembly with C


Being able to call assembly routines from C is a powerful capability of the compiler tools. In this section we explore the main
points.
For more detail refer to spru187t or newer, TMS320C6000
Optimizing Compiler v 7.3: User's Guide
Sections 7.4 & 7.5
To begin with all C labels are accessed in the assembly file
with an underscore (_) character, e.g., sum --> _sum
To call an assembly routine requires that we follow a few
simple rules
main( )
{

_asmFunction:

Things we would like to do are:


Pass arguments in
Return results
Access Cs global variables in assembly
More advanced issues, not dealt with here, are use of and
access to the stack and optimal access to global variables

344

ECE 5655/4655 Real-Time DSP

Calling Assembly with C

To find a function we have a global (inter-file) reference


Parent.C
Parent.C
int
int
int
int

Use _underscore
Make label global

child(int,
child(int, int);
int);
xx == 7,
y,
w
7, y, w == 3;
3;

void
void
{{
yy
}}

main
main (void)
(void)
== child(x,
child(x, 5);
5);

Child.ASM
.global

Child.C
Child.C
int
int child(int
child(int a,
a, int
int b)
b)
{{
return(a
return(a ++ b);
b);
}}

_child

_child:

...assembly code...
; end of subroutine

To pass variables in, take a return value, and return to the parent code flow, we use a set of argument/register passing rules
A
Arguments are passed in
registers as shown
Return value in A4
and return to address
in B3

arg1/r_val
arg1/r_val
arg3

Child.C
Child.C

arg5

int
int child(int
child(int a,
a, int
int b)
b)
{{
return(a
return(a ++ b);
b);
}}

arg7

ECE 5655/4655 Real-Time DSP

arg9

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

ret addr
arg2
arg4
arg6
arg8
arg10

345

Chapter 3 TMS320C6x Programming

A simple example
Parent.C
Parent.C
int
int
int
int

child(int,
child(int, int);
int);
xx == 7,
7, y,
y, ww == 3;
3;

void
void main
main (void)
(void)
{{
yy == child(x,
x, 5);
);
child(
55);
child(x,
}}

Child.C
Child.C
int
int child(int
child(int a,
a, int
int b)
b)
{{
return(a
return(a ++ b);
b);
}}

Arguments
Return/Result

Child.ASM
.global

_child

_child:
add
a4,
a4,b4,
b4,a4
b
b3
nop
5
; end of subroutine

Accessing C global variables in assembly:


Parent.C
Parent.C
int
int child2(int,
child2(int, int);
int);
int
x
;
33;
int x == 7,
7, y,
y, ww == 3;
void
void main
main (void)
(void)
{{
yy == child2(x,
child2(x, 5);
5);
}}

Child2.ASM
Child2.ASM
.global
.global _child2
_child2
.global
.global _w
_w
_child2:
_child2:
mvkl
mvkl
mvkh
mvkh
ldw
ldw

_w
_w ,, A1
A1
_w
_w ,, A1
A1
*A1,
*A1, A0
A0

Declare global labels


Use _underscore when accessing C variables (labels)
Advantages of declaring variables in C?
Declaring in C is easier
Compiler does variable init ( int w = 3 )

346

ECE 5655/4655 Real-Time DSP

Calling Assembly with C

Registers A10A15 and B10B15 must be saved/preserved


A

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

These must be saved and


restored if you use them
in Assembly

There is actually a bit more to this (see below), but more later
A

arg1/r_val
arg3
arg5
arg7
arg9

B
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

ECE 5655/4655 Real-Time DSP

Stack
ret addr
arg2
arg4
arg6
arg8
arg10
DP
SP

extra
arguments
Prior
Stack
Contents

347

Chapter 3 TMS320C6x Programming

Linear Assembly and Assembly Optimization


Being able to call highly efficient linear assembly routines from
C is another powerful capability of the compiler tools. In this
section we explore the main points.
Linear assembly has the ease of C programming (almost) and
the efficiency approaching that of assembly, but without too
many headaches, as the tools do a lot of the work
The development flow for linear assembly modules
Asm
Asm
Optimizer
Optimizer
Link.cmd

.sa
Text
Text
Editor
Editor

.asm

Assembler
Assembler

.obj

Linker
Linker

.out

.c / .cpp
.cpp
Compiler
Compiler

Features of linear assembly for subroutines include:


Pass parameters
Return results
Use symbolic variable names
Ignore pipeline issues (delay slots)
Automatically return to the calling function
Call other functions written in C or linear assembly
348

ECE 5655/4655 Real-Time DSP

Linear Assembly and Assembly Optimization

Consider a simple dot product example in C


int DotP(short *m, short *n, int count)
{ int i;
int product;
int sum = 0;
for (i=0; i < count; i++)
{
product = m[i] * n[i];
sum += product;
}
return(sum);
}

Rewriting in linear assembly (typically a .sa file) we have


__dotp:
dotp:

zero
zero

sum
sum

loop:
loop:

ldh
ldh
ldh
ldh
mpy
mpy
add
add

*pm++,
*pm++, mm
**pn++,
pn++, nn
m,
m, n,
n, prod
prod
prod,
sum,
prod, sum, sum
sum

sub
sub
[count]
b
[count] b

count,
count, 1,
1, count
count
loop
loop

Assembly directives are required :(


Functional unit management is not needed :)
Register management not needed :)
ECE 5655/4655 Real-Time DSP

349

Chapter 3 TMS320C6x Programming

A special directive .cproc is used to declare the passed


variables, e.g.,
.cproc

arg1, arg2, arg3

The directive .endproc declares the end of the routine


Symbolic names can be used throughout, which is very nice
The completed dot product example
__dotp:
dotp:
dotp:

..cproc
cproc
..reg
reg

pm,
, count
pn
pm, pn,
pn,
count
m,
m, n,
n, prod,
prod, sum
sum

zero
zero

sum
sum

loop:
loop:
ldh
ldh
ldh
ldh
mpy
mpy
add
add
sub
sub
[count]
b
[count] b
.return
.return

*pm++,
*pm++, mm
**pn++,
pn++, nn
m,
m, n,
n, prod
prod
prod,
prod, sum,
sum, sum
sum
count,
count, 1,
1, count
count
loop
loop
sum
sum

..endproc
endproc

The above performs the function


short dotp(short *a, short *x, int count)

350

ECE 5655/4655 Real-Time DSP

Example: Vector Norm Squared

Calling from Linear Assembly


Linear assembly can also call another subroutine
_dotp:

.cproc
.reg
mvk

val
5, val

.call

val = _testcall(val)

.return
.endproc

val

_testcall:

.cproc
add

input
input, 5, input

.return input
.endproc

Linear Assembly Compiler Settings


Specific assembly optimizer options are:
Use -g -s for algorithm verification
Use -k -mgt -o3 -pm for software pipelining

Example: Vector Norm Squared


In this example we will be computing the squared length of a
vector using 16-bit (short) signed numbers. In mathematical
terms we are finding
ECE 5655/4655 Real-Time DSP

351

Chapter 3 TMS320C6x Programming

An

(3.1)

A = A1 AN

(3.2)

=
n=1

where

is an N -dimensional vector (column or row vector).


The solution will be obtained in three different ways:
Conventional C programming
C6x assembly
C6x linear assembly
Optimization is not a concern at this point
The focus here is to see by way of a simple example, how to
call a C routine from C (obvious), how to call an assembly
routine from C, and how to call and write a simple linear
assembly routine from C
C Version
We implement this simple routine in C using a declared vector length N and vector contents in the array A
The C source, which includes the called function norm_sq
is given below

352

ECE 5655/4655 Real-Time DSP

Example: Vector Norm Squared

/******************************************************
Vector norm-squared routine in C
******************************************************/
#include <stdio.h>
short norm(short *A, int N);
int main()
{
int N = 5;
short A[5] = {1, 2, 3, 6, 7};
short norm_sq;
norm_sq = norm(A, N);
printf("Vector norm squared = %d",norm_sq);
return 0;
}
short norm(short* V, int n)
{
int i;
short out = 0;
for(i=0; i<n; i++)
{
out += V[i]*V[i];
}
return out;
}

The expected answer is 1 + 4 + 9 + 36 + 49 = 99

ECE 5655/4655 Real-Time DSP

353

Chapter 3 TMS320C6x Programming

Running in CCS 5.1: The C code is put into a project for running on the OMAP-L138 or the simulator as Norm_Squared
and debugged and profiled
From the watch window we obtain the following when we
step the program to the last line

354

ECE 5655/4655 Real-Time DSP

Example: Vector Norm Squared

Starting address of array in memory

Other active windows in CCS 5.1

Enable the clock under the run menu to profile

Time from 1st to 2nd


breakpoint

The cycle count at the function call level for the norm_sq
function call is 152 in the simulator, did not try hardware
ECE 5655/4655 Real-Time DSP

355

Chapter 3 TMS320C6x Programming

Assembly Version
The parent C routine is the following:
/******************************************************
Vector norm-squared routine in assembly
******************************************************/
#include <stdio.h>
short norm_asm(short *A, int N);
int main()
{
int N = 5;
short A[5] = {1, 2, 3, 6, 7};
short norm_sq;
norm_sq = norm_asm(A, N);
printf("Vector norm squared = %d",norm_sq);
return 0;
}

From just the C source it is not obvious that the function prototype for norm_asm is actually an assembly routine
The assembly routine is the following:
; Vector norm in assembly
.global _norm_asm ;reference name from C
_norm_asm:
356

ECE 5655/4655 Real-Time DSP

Example: Vector Norm Squared

mv
.l2 B4, B1
zero .l1 A2

;put loop ctr. in a proper reg.


;initialize accumulator

loop:
ldh .d1 *A4++, A1 ;ld vals pointed to by A4 in A1
nop 4
;required ldh delay
mpy .m1 A1, A1, A3;square each value
nop
;required mpy delay
add .l1 A3, A2, A2;accumulate the squared values
sub .l2 B1, 1, B1 ;decrement the loop counter
[B1]b
.s2 loop
;branch until B1 == 0
nop 5
;required branch delay
mv
b
nop

.d1 A2, A4
.s2 B3
5

;move result to return reg. A4


;branch back to address at B3
;required branch delay

Note that each line of assembly code takes the following


form:
label: || [cond]

instruction

.unit

operand

;comment

Labels must start in the first column, up to 200 characters,


and must begin with a letter, the colon is optional
When accessing from C the register calling convention is
observed, that is, when we enter the function
norm_asm(arg1, arg2),
arg1, is a pointer or address to the first value of the array
A, and is stored in register A4
arg2 is an int value, e.g., a full 32-bit signed integer,
and is stored in register B4
Since arg2 is the array dimension, we will use it as the loop
counter starting value
ECE 5655/4655 Real-Time DSP

357

Chapter 3 TMS320C6x Programming

B4 is not a suitable register for loop control, so we move


(mv) the value stored in B4, in this case to B1
We initialize the accumulator register, A2, using zero instruction, alternatively mvk .s1 0,A2 works as well
Starting at the top of the loop section, we begin by loading
(ldh since we only have 16-bits) the values pointed to by A4
into working register A1
The pointer A4 is post incremented by just 2-bytes or 16bits address steps following the load operation
The default increment size is controlled by the data type,
here it is halfwords (16-bits)
Various pre- and post-increment options are available,
including the offset amount, and wether it modifies the
original pointer or not (see the table below)

358

ECE 5655/4655 Real-Time DSP

Example: Vector Norm Squared

Table 3.1: Pointer incrementing methods; A1 showna

Syntax

Pointer
changed

Description

*A1

no

Basic pointer

*+A1[disp]

no

+Pre-offset

*-A1[disp]

no

-Pre-offset

*++A1[disp]

yes

Pre-increment

*--A1[disp]

yes

Pre-decrement

*A1++[disp]

yes

Post-increment

*A1--[disp]

yes

Post-decrement

a. If [disp] is omitted the displacement is one unit of the data type, otherwise the displacement is by integer multiples of Word, Halfword, or
Byte. If (disp) is used in stead of [disp] the displacement is (disp) bytes.

To satisfy the pipeline delays, we follow the ldh with 4


NOPs
Next, we perform a 16-bit multiply (MPY), actually a squaring; the result is stored in A3
To satisfy the pipeline we follow the MPY with one NOP
We accumulate the result into register A2 using ADD
Next, we branch to loop subject to the state of B1
The branch is followed by five NOPs to satisfy the pipeline
delay

ECE 5655/4655 Real-Time DSP

359

Chapter 3 TMS320C6x Programming

Finally, the squared and accumulated value held in A2 is


saved to the return register A4
To return back to the C module, we must branch to the
address saved in B3
If we had needed to use registers A10A15 or B10B15, we
would of had to save and restore them accordingly
The final numerical result is again 99
Running in CCS 2: The C code is put into a project for running
on the 6711 DSK as norm_sq_asm.pjt, and debugged and
profiled
The profiling results of the new norm_sq function are:

With the assembly routine the cycle count is reduced to 91,


which as a ratio makes the C routine 152/91 = 1.67 times
slower, assuming no optimization
With optimization the tables are turned and the C is faster by
the factor ?

360

ECE 5655/4655 Real-Time DSP

Example: Vector Norm Squared

The Linear Assembly Version


The parent C calling routine is again of the form:
/******************************************************
Vector norm-squared routine in linear assembly
******************************************************/
#include <stdio.h>
short norm_sa(short *A, int N);
int main()
{
int N = 5;
short A[5] = {1, 2, 3, 6, 7};
short norm_sq;
norm_sq = norm_sa(A, N);
printf("Vector norm squared = %d",norm_sq);
return 0;
}

The assembly routine is the following:


; Vector norm in linear assembly
.global _norm_sa;reference name from C
_norm_sa:
.cproc
.reg
zero

A, N
m, sum
sum

;input variables
;working variables
;zero the accumulator

*A++, m

;load values pointed to by A

loop:
ldh

ECE 5655/4655 Real-Time DSP

361

Chapter 3 TMS320C6x Programming

mpy
add
sub
[N]b

m, m, m
;square each value
m, sum, sum;accumulate the squared values
N, 1, N
;decrement the loop counter
loop
;branch until N == 0

.return sum
.endproc

;return value
;end linear assembly routine

The function/subroutine is declared .global just as in the


assembly case
Following the assembly label _norm_sa, we begin the linear assembly routine with .cproc followed by the input
variables (may be dummy names);
Working variables are declared using .reg
The accumulator is cleared using the assembler instruction
zero
A loop is then set up in a similar fashion to the pure assembly
version, except now the precise management of the registers
is left to the assembly optimizer
There is also no need to include NOPs
As before the final answer is 99
Running in CCS 2: The C code is put into a project for running
on the 6711 DSK as norm_sq_sa.pjt, and debugged and
profiled

362

ECE 5655/4655 Real-Time DSP

Example: Vector Norm Squared

The profiling results of the new norm_sq function are:

This result is very similar to the assembly result (on the 6713
90 .sa & 91 .asm)
With say -o3 optimization the linear assembly is faster by the
ratio ?
When debugging a linear assembly routine it is best to use the
mixed mode to display assembly interlisted with C and/or linear assembly
The registers window can then be used to watch what is happening when the code is stepped

ECE 5655/4655 Real-Time DSP

363

Chapter 3 TMS320C6x Programming

364

ECE 5655/4655 Real-Time DSP

Вам также может понравиться