Академический Документы
Профессиональный Документы
Культура Документы
TMS320C6x
Programming
Introduction
In this chapter programming the TMS320C6x in assembly, linear
assembly, and C will be introduced. Preference will be given to
explaining code development for the DSK memory map. The
basis for the material presented in this chapter are the course
notes from TIs C6000 4-day design workshop1.
Programming Alternatives
Efficiency*
Effort
Compiler
Optimizer
Intrinsics
70 80%
Low
Linear
ASM
Assembly
Optimizer
95 100%
Medium
Hand
Optimize
100%
High
ASM
31
Data Ram
Addr
Internal Buses
DMA
D (32)
EMIF
.M1 .M2
.L1 .L2
.S1 .S2
Control Regs
Serial Port
(B0-B15)
Regs (B0-
- Sync
- Async
(A0-A15)
Regs (A0-
Extl
Extl
Memory
Memory
.D1 .D2
Host Port
Boot Load
Timers
Pwr Down
CPU
To motivate this introduction to assembly programming, consider a basic sum of products or dot product example
y =
40
(3.1)
an xn
n=1
Y =
n = 1
an * xn
.M
.M
MPY
.M
a, x, prod
Y =
n = 1
Where
Whereare
are
the
variables
the variables
stored?
stored?
an * xn
.M
.M
.L
.L
MPY
.M
a, x, prod
ADD
.L
Y, prod, Y
33
.
.
.
40
Y =
n = 1
an * xn
.M
.M
.L
.L
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
A15
3232-bits
34
.
.
.
40
an * xn
.S
.S
Y =
.M
.M
MVK
.S
40, A2
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
SUB
.L
A2, 1, A2
.S
loop
n = 1
loop:
.L
.L
A15
[A2] B
3232-bits
35
The next step is to get variables loaded into the register file
We assume that the variables are located in memory (internal or external)
We then create a pointer to the address of the variable and
store it in a register
Finally, we load the variable itself into another register
Register File A
A0
a
x
A1
A2 loop count
prod
A3
Y
A4
&a[n]
A5
&x[n]
A6
&Y
A7
..
A15
.S
.S
.M
.M
.L
.L
3232-bits
Memory
a [40]
x [40]
Y
*A5
*A6
*A7
(char)
(short)
(int)
int)
(C67x, C64x)
(double)
Store instructions:
STB
STH
STW
STDW (C64x)
40
an * xn
.S
.S
Y =
.M
.M
MVK
.S
40, A2
LDH
.D
*A5, A0
LDH
.D
*A6, A1
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
SUB
.L
A2, 1, A2
.S
loop
.D
A4, *A7
.L
.L
.D
.D
n = 1
loop:
[A2] B
STH
3232-bits
Data
DataMemory
Memory
37
.S
.S
.S
.S
.S
.S
a,A5
a,A5
x,A6
x,A6
y,A7
y,A7
;store
;store
;store
;store
;store
;store
lower
upper
lower
upper
lower
upper
half
half
half
half
half
half
of
of
of
of
of
of
a
a
x
x
y
y
38
a0
a1
a2
a
&x
&
A6
++
.
.
40
Y =
x0
x1
x2
n = 1
.
.
loop:
a0 * x0
How do you access a1 and
x1 on the second loop?
LDH .D
*A5++, A0
LDH .D
*A6++, A1
an * xn
MVK
.S
40, A2
LDH
LDH
.D
*A5, A0A0
*A5++,
LDH
LDH
.D
*A6, A1A1
*A6++,
MPY
.M
A0, A1, A3
ADD
.L
A4, A3, A4
SUB
.L
A2, 1, A2
.S
loop
.D
A4, *A7
[A2] B
STH
.
.
.
Register File B
.S1
.S1
.S2
.S2
.M1
.M1
.M2
.M2
.L1
.L1
.L2
.L2
.D1
.D1
.D2
.D2
3232-bits
B0
B1
B2
B3
B4
.
.
.
B15
3232-bits
Data Memory
39
MVK
loop: LDH
LDH
MPY
ADD
SUB
[A2] B
STH
.S1
.D1
.D1
.M1
.L1
.L1
.S1
.D1
40
n = 1
an * xn
40, A2
*A5++, A0
*A6++, A1
A0, A1, A3
A3, A4, A4
A2, 1, A2
loop
A4, *A7
310
Arithmetic
Logical
ABS
ADD
ADDA
ADDK
ADD2
MPY
MPYH
NEG
SMPY
SMPYH
SADD
SAT
SSUB
SUB
SUBA
SUBC
SUB2
ZERO
AND
CMPEQ
CMPGT
CMPLT
NOT
OR
SHL
SHR
SSHL
XOR
Bit Mgmt
CLR
EXT
LMBD
NORM
SET
Data Mgmt
LDB/H/W
MV
MVC
MVK
MVKL
MVKH
MVKLH
STB/H/W
Program Ctrl
B
IDLE
NOP
.S
.S
.L
.L
.D
.D
.M
.M
ADD
ADDK
ADD2
AND
B
CLR
EXT
MV
MVC
MVK
MVKL
MVKH
.L Unit
ABS
ADD
AND
CMPEQ
CMPGT
CMPLT
LMBD
MV
NEG
NORM
NEG
NOT
OR
SET
SHL
SHR
SSHL
SUB
SUB2
XOR
ZERO
NOT
OR
SADD
SAT
SSUB
SUB
SUBC
XOR
ZERO
.M Unit
.D Unit
ADD
NEG
ADDAB (B/H/W) STB
(B/H/W)
SUB
LDB
(B/H/W) SUBAB (B/H/W)
ZERO
MV
MPY
MPYH
MPYLH
MPYHL
SMPY
SMPYH
No Unit Used
NOP
IDLE
.S
.S
.L
.L
.D
.D
.M
.M
ADD
ADDK
ADD2
AND
B
CLR
EXT
MV
MVC
MVK
MVKL
MVKH
NEG
NOT
OR
SET
SHL
SHR
SSHL
SUB
SUB2
XOR
ZERO
ABSSP
ABSDP
CMPGTSP
CMPEQSP
CMPLTSP
CMPGTDP
CMPEQDP
CMPLTDP
RCPSP
RCPDP
RSQRSP
RSQRDP
SPDP
.D Unit
ADD
NEG
ADDAB (B/H/W) STB
(B/H/W)
ADDAD
SUB
LDB
(B/H/W) SUBAB (B/H/W)
LDDW
ZERO
MV
.L Unit
ABS
ADD
AND
CMPEQ
CMPGT
CMPLT
LMBD
MV
NEG
NORM
NOT
OR
SADD
SAT
SSUB
SUB
SUBC
XOR
ZERO
ADDSP
ADDDP
SUBSP
SUBDP
INTSP
INTDP
SPINT
DPINT
SPRTUNC
DPTRUNC
DPSP
.M Unit
MPY
MPYH
MPYLH
MPYHL
SMPY
SMPYH
MPYSP
MPYDP
MPYI
MPYID
No Unit Used
NOP
IDLE
311
312
Clock Cycles
3 4 5 6 7
Non-Pipelined
F1 D1 E1
Pipelined
F1 D1 E1
F2 D2 E2
F3 D3 E3
F 2 D2 E 2
F 3 D3 E 3
Pipeline full
Once the pipeline is full the multiple buses of the C6x can
carry out the F, D, and E operations in parallel, all within the
same clock cycle
On the downside, when discontinuities such as program
branching occur, the pipeline must be flushed which results in
added processor overhead
Program Fetch Stage
The program fetch stage actally is broken into four phases
PG: Generate fetch address
PS: Send address to memory
PW: Wait for data ready
PR: Read opcode
313
Decode Stage
The decode stage consists of two phases
DP: Route the instruction to a functional unit (dispatch)
DC: Actually decode the instruction at the functional unit
(decode)
Execute Stage
For code writing purposes the execute stage is the most interesting
On the C62x all instructions execute in a single cycle, but
results are delayed by varying amounts
Furthermore, there is an additional cycle before the results
are available, which is known as the pipeline latency
Common examples of delay and latency
Description
Instructions
Delay
Latency
Single Cycle
0+1=1
Multiply
MPY / SMPY
Load
LDB/H/W
Branch
(1)
(2)
(3)
Decode
DP DC
(5) (6)
(4)
Execute
E1 E2 E3 E4 E5 E6
(7) (8) (9) (10) (11) (12)
PG
PS
PW
PR
DP
DC
E1
E2
PG
PS
PW
PR
DP
DC
E1
PG
PS
PW
PR
DP
DC
E1
PG
PS
PW
PR
DP
DC
E1
PG
PS
PW
PR
DP
DC
E1
PG
PS
PW
PR
DP
DC
E1
PG
PS
PW
PR
DP
DC
E3
E4
E5
E6
E7
E1
315
I2 .unit
I3 .unit
I4 .unit
256 Bits
I5 .unit
I1 I2 I3 I4 I5 I6 I7 I8
I6 .unit
I7 .unit
I8 .unit
Recall that there is a 256-bit wide program data bus for this
purpose
Pipeline Code Example
Consider the sum of products example used earlier
MVK .S1
loop: LDH .D1
*A5++, A0
LDH .D1
*A6++, A1
MPY .M1
A0, A1, A3
ADD .L1
A3, A4, A4
SUB
.L1
A2, 1, A2
.S1
loop
[A2]
STH .D1
316
40, A2
We assume A4 is
already cleared
A4, *A7
ECE 5655/4655 Real-Time DSP
Program
Fetch
PG
PS
PW
PR
Decode
Execute
DP
E1 - E6
DC
MVK
LDH
LDH
MPY
ADD
SUB
B
STH
12
11
2
3
10
9
8
6
317
DP
DC
Execute
E1
E2
E3
E4
Done
E5
E6
MVK
LDH
LDH
FP5-2 MPY
ADD
SUB
B
STH
11
12
1
2
3
10
9
8
7
DP
DC
Execute
E1
E2
E3
E4
Done
E5
E6
MVK
LDH
LDH
FP5-2 MPY
ADD
SUB
B
STH
11
12
1
2
3
10
9
8
7
318
On the 10th cycle the second LDH enters E2 and the first LDH
is moved over to E3, with MPY at E1
Prog.
Decode
Fetch
P
DP
DC
Execute
E1
E2
E3
E4
Done
E5
E6
MVK
LDH
LDH
MPY
FP5-2
+
+
+
+
ADD
SUB
B
STH
11
12
1
2
3
10
9
8
7
Note that the MPY requires only one delay, but needs values from memory that the LDHs bring in
The LDHs have not finished yet! What to do?
A similar problem exists when the ADD instruction reaches
E1
The one cycle delay of MPY means that the addition has
started too early as well
319
For the existing code, we see that at 12 cycles MPY and ADD
have both finished, but both LDHs still have not completed
Prog.
Decode
Fetch
P
DP
DC
Execute
E1
E2
E3
Done
E4
E5
E6
MVK
LDH
LDH
+
MPY
FP5-2
ADD
SUB
B
STH
11
12
1
2
3
10
9
8
7
320
Description
Delay Slots
# of NOPs
Single Cycle
Multiply
Load
Branch
NOP 4
The final NOP fixed code, including benchmark information is the following:
loop:
MVK .S1
40,A2
LDH
.D1
*A5++, A0
(1)
(1)
LDH
.D1
*A6++, A1
(1)
(4)
A0,A1,A3
(1)
NOP
MPY
.M1
NOP
[A2]
ADD
.L1
A3,A4,A4
(1)
(1)
SUB
.L1
A2,1,A2
(1)
.S1
loop
5
(1)
(5)
A4,*A7
(1)
NOP
STH
.D1
Loop = 16 x 40
= 640
+ 2 = 642 cycles
642 cycles
Benchmark = _______
28 cycles
Best case = _______
The NOPs greatly increase the cycle count, but we have not
tried any optimization yet
With full optimization just 28 cycles can be achieved, less
than the loop count!
321
Serial
B
.S1
Partially
Parallel
B
Fully
Parallel
.S1
MVK .S1
|| MVK .S2
ADD .L1
ADD .L1
ADD .L1
|| ADD .L2
MPY .M1
|| MPY .M1
MPY .M1
MPY .M1
LDW .D1
|| LDW .D1
LDB .D1
|| LDB .D2
322
Decode
DP
Execute
DC
E1
E2
E3
E4
Done
E5
E6
B
MVK
ADD
ADD
MPY
MPY
LDW
LDB
11
12
1
2
3
10
9
8
7
323
Decode
DP
DC
Execute
E1
Done
E2
E3
E4
E5
E6
+
MVK
ADD
ADD
MPY
MPY
12
11
LDW
1
2
3
10
9
LDB
8
7
324
Partially
Parallel
Serial
B
.S1
Fully
Parallel
B
.S1
.S1
MVK .S1
|| MVK .S2
|| MVK .S2
ADD .L1
ADD .L1
|| ADD .L1
ADD .L1
|| ADD .L2
|| ADD .L2
MPY .M1
|| MPY .M1
|| MPY .M1
MPY .M1
MPY .M1
|| MPY .M2
LDW .D1
|| LDW .D1
|| LDW .D1
LDB .D1
|| LDB .D2
|| LDB .D2
Decode
DP
Execute
DC
Done
E1
E2
E3
E4
E5
E6
+
+
+
+
+
+
+
+
+
+
MVK
ADD
ADD
EP2
MPY
MPY
LDW
LDB
12
11
2
3
10
9
8
6
325
.S Unit
ABSSP
(1.1)
ABSDP
(1.2)
CMPEQSP (1.1)
CMPGTSP (1.1)
CMPLTSP (1.2)
CMPEQDP (1.3)
CMPGTDP (1.3)
CMPLTDP (1.2)
RCPSP
(1.1)
RCPDP
(1.2)
RSQRSP (1.1)
RSQRDP (1.2)
SPDP
(1.2)
ADDSP
ADDDP
DPINT
DPSP
INTDP
INTDPU
(1.3)
(2.7)
(1.4)
(1.4)
(1.5)
(1.5)
INTSP
(1.4)
INTSPU (1.4)
SPINT
(1.4)
SPTRUNC (1.4)
SUBSP
(1.4)
SUBDP
(2.7)
.D Unit
.M Unit
MPYSP
MPYDP
(1.4) MPYI
(4.10) MPYID
ADDAD
(1.1)
LDDW
(1.5)
(4.9)
(4.10)
326
C Programming
The section will focus on some of the uses of the C6x development tools and some of the compiler, assembler, and linker settings.
As stated at the beginning of this chapter, the use of C code
can achieve from 80100% the efficiency of hand assembly
Further optimization, what is discussed in this section, will
likely be required, but it is safe to say that C code is a good
starting point for algorithm development
Recall the basic code building tool layout is:
Asm
Optimizer
Link.cmd
.sa
Editor
.asm
Asm
.obj
Linker
.out
.c / .cpp
.cpp
Compiler
327
Probe
In
Compile
Asm Opto
SIM
DSK
Edit
Asm
Link
Debug
EVM
Profiling
BIOS
Library
Graphs
Probe
Out
Studio Includes:
Code Generation Tools
BIOS: Real-time kernel
Real-time analysis (RTA)
Simulator Plug-ins, RTDX
Simulator,
Third
Party
XDS
DSP
Board
file.c
file.c
328
Compile
Asm
Link
file.out
file.out
C Programming
Debug options
Options
debug
-mv6700
-fr <dir>
-g
-s
-k
-mg
-mt
-o3
-pm -gs
Description
Generate C6700 code (C6200 is default)
Directory containing source files
Enables srcsrc-level symbolic debugging
Interlist C statements into assembly listing
Keep assembly file
Enables minimum debug to allow profiling
No aliasing used
Invoke optimizer ((-o0, -o1, -o2/o2/-o, -o3)
Combine all C source files before compile
CC Tab
Compiler
Compiler
Comp/Asm
Comp/Asm
Compiler
Compiler
Compiler
Compiler
Compiler
Compiler
All total there are about five pages of options in the compiler
user manual
Optimize Options
Options
speed
opto
-mv6700
-fr <dir>
-g
-s
-k
-mg
-mt
-o3
-pm
-ms
-oi0
Description
Generate C6700 code (C6200 is default)
Directory-kcontaining
source files
-mgt -o3 -pm
Enables srcsrc-level symbolic debugging
Interlist C statements into assembly listing
Keep assembly file
Enables minimum debug to allow profiling
No aliasing used
Invoke optimizer ((-o0, -o1, -o2/o2/-o, -o3)
Combine all C source files before compile
Minimize code size ((-ms0/ms0/-ms, -ms1, -ms2)
Disables automatic function inlining
CC Tab
Compiler
Compiler
Comp/Asm
Comp/Asm
Compiler
Compiler
Compiler
Compiler
Compiler
Compiler
Compiler
Compiler
329
Code Size
Options
size
opto
-mv6700
-fr <dir>
-g
-s
-k
-mg
-mt
-o3
-pm
-ms
-oi0
Description
CC Tab
Compiler
Compiler
Comp/Asm
Comp/Asm
Compiler
Compiler
Compiler
Compiler
Compiler
Compiler
Compiler
Compiler
Assembler Options
Options
-g
-l
-s
Description
Enables srcsrc-level symbolic debugging
Create assembler listing file (small -L)
Retain asm symbols for debugging
CC Tab
Comp/Asm
Assembler
Assembler
-gls
330
C Programming
Linker Options
Options
- o <file>
- m <file>
- c
Description
Output file name
Map file name
AutoAuto-initialize global/static C variables
CC Tab
Linker
Linker
Linker
Description
Generate C6700 code (C6200 is default)
Directory containing source files
Enables src-level symbolic debugging
Interlist C statements into assembly listing
Keep assembly file
Enables minimum debug to allow profiling
No aliasing used
Invoke optimizer (-o0, -o1, -o2/-o, -o3)
Combine all C source files before compile
Minimize code size (-ms0/-ms, -ms1, -ms2)
Disables automatic function inlining
Create assembler listing file (small -L)
Retain asm symbols for debugging
Output file name
Map file name
Auto-Init C variables (-cr turns off autoinit)
Options Tab
Compiler
Compiler
Comp/Asm
Compiler
Compiler
Compiler
Compiler
Compiler
Compiler
Compiler
Compiler
Assembler
Assembler
Linker
Linker
Linker
331
file.c
file.c
-o
C
Optimizer
Run-time
Library
(boot.c)
-s
file.asm
Assembler
-al
file.lst
-z
Linker
file.obj
-m
file.map
-o
file.out
From a C programming perspective on a host, once the system resets and initializes, we only deal with the program
Basic Sections of
C file
Global
Variables
Code
Dynamic
Variables
short m = 10;
short b = 2;
short y = 0;
main()
{
short x = 0;
scanf(x);
malloc(y);
y = m * x;
y = y + b;
}
et
res in
p
Initial
Values
reset
reset
vector
vector
Initialize
Initialize
System
System
Local
Variables
Program
Program
333
reset
reset
vector
vector
boot.c
Initialize
Initialize
System
System
1. Initialize Pointers
(discussed in mod 11)
stack
heap
global/static
2. Initialize global and static
variables
3. Call _main
short
short mm == 10;
10;
short
short bb == 2;
2;
short
short yy == 0;
0;
_main
main()
main()
{{
short
short xx == 0;
0;
scanf(x);
scanf(x);
malloc(y);
malloc(y);
yy == mm ** x;
x;
yy == yy ++ b;
b;
}}
334
vectors.asm
.global _c_int00
_c_int00
.sect vectors
b
_c_int00
nop 5
nop
nop
O ne ke t
nop
Pac
_main
nop
Fetch
nop
nop
reset
reset
vector
vector
boot.c
boot.c
short
short mm == 10;
10;
short
short bb == 2;
2;
short
short yy == 0;
0;
main()
main()
{{
short
short xx == 0;
0;
scanf(x);
scanf(x);
malloc(y);
malloc(y);
yy == mm ** x;
x;
yy == yy ++ b;
b;
}}
335
Compiler Sections
The system software is broken into modules of code and data
known as sections
The sections as found in a typical C program are shown
below:
Hardware
Software
System
SystemInit
Init
(boot.c)
(boot.c)
C6x
C6x
RAM
RAM
Periph
Periph
Memory
Memory
ROM
ROM
RAM
RAM
RAM
RAM
Vectors
Vectors
(reset)
(reset)
Program
Program
Code
Code
Data
Data
CCCode
Code
(main.c)
(main.c)
Variables
Variables
(global)
(global)
Init
InitValues
Values
(global)
(global)
Heap
Heap
(dynamic)
(dynamic)
Stack
Stack
(local)
(local)
336
Vectors
Vectors
(reset)
(reset)
your
?
choice
System
SystemInit
Init
(boot.c)
(boot.c)
Program
Program
Code
Code
Data
Data
.text
CCCode
Code
(main.c)
(main.c)
Variables
Variables
(global)
(global)
.bss
Init
InitValues
Values
(global)
(global)
.cinit
Heap
Heap
(dynamic)
(dynamic)
.sysmem
Stack
Stack
(local)
(local)
.stack
Description
Code
.switch
.const
.cinit
.bss
.far
.stack
.sysmem
.cio
337
.switch
.switch
.cinit
.const
.text
.switch
.const
.const
EPROM
CE0
.text
.text
140_0000
(prog RAM)
.cinit
.cinit
.bss
.bss
.far
.far
CE2
.stack
.stack
.sysmem
.far
.cio
.sysmem
.sysmem
SDRAM
.cio
.cio
8000_0000 .bss
(data RAM)
.stack
C6201
Code
initialized
.switch
initialized
.const
initialized
.cinit
initialized
.bss
uninitialized
.far
uninitialized
uninitialized
uninitialized
uninitialized
.stack
.sysmem
.cio
338
Memory
Type
Description
Memory Management
We control the physical mapping of memory to program and
data sections sections via a linker command file
C6x
C6x
RAM
RAM
Periph
Periph
Memory
Memory
Memory
Memory
ROM
ROM
.cmd
.obj
.obj
.obj
Linker
RAM
RAM
RAM
RAM
-o
.out
-m
.map
The linker command file .cmd has two parts
MEMORY
{
Memory Description
}
SECTIONS
{
Binding Code/Data Sections to Memory
}
339
340
6713
64K x 8 Internal The
DSK
4K
Program
Cache
0180_0000
CPU
64K
Unified
RAM
8000_0000
9000_0000
4K
Data
Cache
A000_0000
B000_0000
OnOn-chip Peripherals
The 6713
0 256M x 8 External DSK
has 16M
1 256M x 8 External at
8000_0000
2 256M x 8 External
3
256M x 8 External
FFFF_FFFF
:>
:>
:>
:>
:>
:>
:>
:>
:>
:>
vecs
IRAM
IRAM
IRAM
IRAM
SDRAM
IRAM
IRAM
SDRAM
SDRAM
341
Linker Options
In the third tab of the project options dialog box, we set linker
options
342
343
_asmFunction:
344
Use _underscore
Make label global
child(int,
child(int, int);
int);
xx == 7,
y,
w
7, y, w == 3;
3;
void
void
{{
yy
}}
main
main (void)
(void)
== child(x,
child(x, 5);
5);
Child.ASM
.global
Child.C
Child.C
int
int child(int
child(int a,
a, int
int b)
b)
{{
return(a
return(a ++ b);
b);
}}
_child
_child:
...assembly code...
; end of subroutine
To pass variables in, take a return value, and return to the parent code flow, we use a set of argument/register passing rules
A
Arguments are passed in
registers as shown
Return value in A4
and return to address
in B3
arg1/r_val
arg1/r_val
arg3
Child.C
Child.C
arg5
int
int child(int
child(int a,
a, int
int b)
b)
{{
return(a
return(a ++ b);
b);
}}
arg7
arg9
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
ret addr
arg2
arg4
arg6
arg8
arg10
345
A simple example
Parent.C
Parent.C
int
int
int
int
child(int,
child(int, int);
int);
xx == 7,
7, y,
y, ww == 3;
3;
void
void main
main (void)
(void)
{{
yy == child(x,
x, 5);
);
child(
55);
child(x,
}}
Child.C
Child.C
int
int child(int
child(int a,
a, int
int b)
b)
{{
return(a
return(a ++ b);
b);
}}
Arguments
Return/Result
Child.ASM
.global
_child
_child:
add
a4,
a4,b4,
b4,a4
b
b3
nop
5
; end of subroutine
Child2.ASM
Child2.ASM
.global
.global _child2
_child2
.global
.global _w
_w
_child2:
_child2:
mvkl
mvkl
mvkh
mvkh
ldw
ldw
_w
_w ,, A1
A1
_w
_w ,, A1
A1
*A1,
*A1, A0
A0
346
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
There is actually a bit more to this (see below), but more later
A
arg1/r_val
arg3
arg5
arg7
arg9
B
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Stack
ret addr
arg2
arg4
arg6
arg8
arg10
DP
SP
extra
arguments
Prior
Stack
Contents
347
.sa
Text
Text
Editor
Editor
.asm
Assembler
Assembler
.obj
Linker
Linker
.out
.c / .cpp
.cpp
Compiler
Compiler
zero
zero
sum
sum
loop:
loop:
ldh
ldh
ldh
ldh
mpy
mpy
add
add
*pm++,
*pm++, mm
**pn++,
pn++, nn
m,
m, n,
n, prod
prod
prod,
sum,
prod, sum, sum
sum
sub
sub
[count]
b
[count] b
count,
count, 1,
1, count
count
loop
loop
349
..cproc
cproc
..reg
reg
pm,
, count
pn
pm, pn,
pn,
count
m,
m, n,
n, prod,
prod, sum
sum
zero
zero
sum
sum
loop:
loop:
ldh
ldh
ldh
ldh
mpy
mpy
add
add
sub
sub
[count]
b
[count] b
.return
.return
*pm++,
*pm++, mm
**pn++,
pn++, nn
m,
m, n,
n, prod
prod
prod,
prod, sum,
sum, sum
sum
count,
count, 1,
1, count
count
loop
loop
sum
sum
..endproc
endproc
350
.cproc
.reg
mvk
val
5, val
.call
val = _testcall(val)
.return
.endproc
val
_testcall:
.cproc
add
input
input, 5, input
.return input
.endproc
351
An
(3.1)
A = A1 AN
(3.2)
=
n=1
where
352
/******************************************************
Vector norm-squared routine in C
******************************************************/
#include <stdio.h>
short norm(short *A, int N);
int main()
{
int N = 5;
short A[5] = {1, 2, 3, 6, 7};
short norm_sq;
norm_sq = norm(A, N);
printf("Vector norm squared = %d",norm_sq);
return 0;
}
short norm(short* V, int n)
{
int i;
short out = 0;
for(i=0; i<n; i++)
{
out += V[i]*V[i];
}
return out;
}
353
Running in CCS 5.1: The C code is put into a project for running on the OMAP-L138 or the simulator as Norm_Squared
and debugged and profiled
From the watch window we obtain the following when we
step the program to the last line
354
The cycle count at the function call level for the norm_sq
function call is 152 in the simulator, did not try hardware
ECE 5655/4655 Real-Time DSP
355
Assembly Version
The parent C routine is the following:
/******************************************************
Vector norm-squared routine in assembly
******************************************************/
#include <stdio.h>
short norm_asm(short *A, int N);
int main()
{
int N = 5;
short A[5] = {1, 2, 3, 6, 7};
short norm_sq;
norm_sq = norm_asm(A, N);
printf("Vector norm squared = %d",norm_sq);
return 0;
}
From just the C source it is not obvious that the function prototype for norm_asm is actually an assembly routine
The assembly routine is the following:
; Vector norm in assembly
.global _norm_asm ;reference name from C
_norm_asm:
356
mv
.l2 B4, B1
zero .l1 A2
loop:
ldh .d1 *A4++, A1 ;ld vals pointed to by A4 in A1
nop 4
;required ldh delay
mpy .m1 A1, A1, A3;square each value
nop
;required mpy delay
add .l1 A3, A2, A2;accumulate the squared values
sub .l2 B1, 1, B1 ;decrement the loop counter
[B1]b
.s2 loop
;branch until B1 == 0
nop 5
;required branch delay
mv
b
nop
.d1 A2, A4
.s2 B3
5
instruction
.unit
operand
;comment
357
358
Syntax
Pointer
changed
Description
*A1
no
Basic pointer
*+A1[disp]
no
+Pre-offset
*-A1[disp]
no
-Pre-offset
*++A1[disp]
yes
Pre-increment
*--A1[disp]
yes
Pre-decrement
*A1++[disp]
yes
Post-increment
*A1--[disp]
yes
Post-decrement
a. If [disp] is omitted the displacement is one unit of the data type, otherwise the displacement is by integer multiples of Word, Halfword, or
Byte. If (disp) is used in stead of [disp] the displacement is (disp) bytes.
359
360
A, N
m, sum
sum
;input variables
;working variables
;zero the accumulator
*A++, m
loop:
ldh
361
mpy
add
sub
[N]b
m, m, m
;square each value
m, sum, sum;accumulate the squared values
N, 1, N
;decrement the loop counter
loop
;branch until N == 0
.return sum
.endproc
;return value
;end linear assembly routine
362
This result is very similar to the assembly result (on the 6713
90 .sa & 91 .asm)
With say -o3 optimization the linear assembly is faster by the
ratio ?
When debugging a linear assembly routine it is best to use the
mixed mode to display assembly interlisted with C and/or linear assembly
The registers window can then be used to watch what is happening when the code is stepped
363
364