ARM Introduction & Instruction Set Architecture

ARM
Introduction &
Instruction Set Architecture
Outline
ARM Architecture
ARM Organization and Implementation
ARM Instruction Set
Thumb Instruction Set
Architectural Support for System
Development
ARM Processor Cores
Memory Hierarchy
Architectural Support for Operating Systems
ARM CPU Cores
Embedded ARM Applications
2
ARM History
ARM Acorn RISC Machine (1983 1985)
Acorn Computers Limited, Cambridge, England
ARM Advanced RISC Machine 1990

ARM Limited, 1990
ARM has been licensed to many semiconductor
manufacturers
ARMs visible registers

User level
15 GPRs, PC,
CPSR (current
program
status
register)
Remaining
registers are
used for systemlevel
programming
and for handling
exceptions
r0
r1
r2
r3
r4
r5
r6
r7
r8
r9
r10
r11
r12
r13
r14
r15 (PC)
CPSR
user mode
usable in user mode

system modes only
r8_fiq
r9_fiq
r10_fiq
r11_fiq
r12_fiq
r13_fiq
r14_fiq
r13_svc
r14_svc
r13_abt
r14_abt
r13_irq
r14_irq
r13_und
r14_und
SPSR_irq SPSR_und
SPSR_abt
SPSR_fiq SPSR_svc
fiq
mode
svc
mode
abort
mode
irq
mode
undefined
mode
ARM CPSR format

N (Negative), Z (Zero), C (Carry), V (oVerflow)
mode control processor mode
T control instruction set
T = 1 instruction stream is 16-bit Thumb
instructions
T = 0 instruction stream is 32-bit ARM instructions
I F interrupt enables
31
28 27
NZCV
8 7 6 5 4
unused
IF T
mode
ARM memory organization

Linear array of bytes numbered
from 0 to 232 1
Data items
bytes (8 bits)
half-words (16 bits) always
aligned to 2-byte boundaries
(start at an even byte address)
words (32 bits) always
aligned to 4-byte boundaries
(start at a byte address which
is multiple of 4)
bit 31
bit 0
23
22
21
20
19
18
17
16
15
14
13
12
11
10
word16
half-word14 half-word12
word8
byte6 half-word4
byte3 byte2 byte1 byte0
byte
address
ARM instruction set

Load-store architecture
operands are in GPRs
load/store only instructions that operate with memory
Instructions
Data Processing use and change only register values
Data Transfer copy memory values into registers
(load) or copy register values into memory (store)
Control Flow
o branch
o branch-and-link
save return address to resume the original sequence
o trapping into system code supervisor calls
ARM instruction set (contd)

Three-address data processing instructions
Conditional execution of every instruction
Powerful load/store multiple register instructions
Ability to perform a general shift operation and a
general ALU operation in a single instruction that
executes in a single clock cycle
Open instruction set extension through coprocessor
instruction set, including adding new registers and
data types to the programmers model
Very dense 16-bit compressed representation of the
instruction set in the Thumb architecture
I/O system
I/O is memory mapped
internal registers of peripherals (disk controllers,
network interfaces, etc) are addressable locations
within the ARMs memory map and may be read and
written using the load-store instructions
Peripherals may use either the normal interrupt

(IRQ) or fast interrupt (FIQ) input
normally most interrupt sources share the IRQ input,
while just one or two time-critical sources are
connected to the FIQ input
Some systems may include external DMA hardware

to handle high-bandwidth I/O traffic
9
ARM exceptions
ARM supports a range of interrupts, traps, and supervisor calls
all are grouped under the general heading of exceptions
Handling exceptions
current state is saved by copying the PC into r14_exc and CPSR
into SPSR_exc (exc stands for exception type)
processor operating mode is changed to the appropriate
exception mode
PC is forced to a value between 0016 and 1C16, the particular
value depending on the type of exception
instruction at the location PC is forced to (the vector address)
usually contains a branch to the exception handler; the
exception handler will use r13_exc, which is normally initialized
to point to a dedicated stack in memory, to save some user
registers
return: restore the user registers and then restore PC and CPSR
atomically
10
ARM cross-development toolkit

Software development
tools developed by
ARM Limited
public domain tools
(ARM back end for
gcc C compiler)
C source
C libraries
C compiler
assembler
.aof
object
libraries
linker
.axf
Cross-development
tools run on different
architecture from one
for which they
produce code
asm source
system model
debug
ARMsd
development
board
ARMulator
11
Outline
ARM Architecture
ARM Assembly Language Programming
ARM Instruction Set
Architectural Support for High-level Languages
Architectural Support for System Development
ARM Processor Cores
Memory Hierarchy
ARM CPU Cores
12
ARM Instruction Set

Data Processing Instructions
Data Transfer Instructions
Control flow Instructions
13
Data Processing Instructions

Classes of data processing instructions
Arithmetic operations
Bit-wise logical operations
Register-movement operations
Comparison operations
Operands: 32-bits wide;

there are 3 ways to specify operands
come from registers
the second operand may be a constant (immediate)
shifted register operand
Result: 32-bits wide, placed in a register

long multiply produces a 64-bit result
14
Data Processing Instructions (contd)

Arithmetic Operations
Bit-wise Logical Operations
ADD r0, r1, r2
r0 := r1 + r2
AND r0, r1, r2 r0 := r1 and r2
ADC r0, r1, r2
r0 := r1 + r2 + C
ORR r0, r1, r2
r0 := r1 or r2
SUB r0, r1, r2
r0 := r1 - r2
EOR r0, r1, r2
r0 := r1 xor r2
SBC r0, r1, r2
r0 := r1 - r2 + C - 1
BIC r0, r1, r2
r0 := r1 and (not) r2
RSB r0, r1, r2
r0 := r2 r1
RSC r0, r1, r2
r0 := r2 r1 + C - 1
Register Movement
Comparison Operations
MOV r0, r2
r0 := r2
CMP r1, r2
set cc on r1 - r2
MVN r0, r2
r0 := not r2
CMN r1, r2
set cc on r1 + r2
TST r1, r2
set cc on r1 and r2
TEQ r1, r2
set cc on r1 xor r2
15
Data Processing Instructions (contd)

Immediate operands:
immediate = (0->255) x 22n, 0 <= n <= 12
ADD r3, r3, #3
r3 := r3 + 3
AND r8, r7, #&ff
r8 := r7[7:0], & for hex
Shifted register operands

the second operand is subject to a shift operation
before it is combined with the first operand
ADD r3, r2, r1, LSL #3 r3 := r2 + 8 x r1
ADD r5, r5, r3, LSL r2
r5 := r5 + 2r2 x r3
16
ARM shift operations

LSL Logical Shift Left
LSR Logical Shift Right
ASR Arithmetic Shift
Right
ROR Rotate Right
RRX Rotate Right
Extended by 1 place
31
31
00000
00000
LSL #5
31
LSR #5
0
31
00000 0
11111 1
ASR #5 , positive operand

31
ASR #5 , negative operand

0
31
ROR #5
RRX
17
Setting the condition codes

Any DPI can set the condition codes (N, Z, V, and C)
for all DPIs except the comparison operations
a specific request must be made
at the assembly language level this request is indicated
by adding an `S` to the opcode
Example (r3-r2 := r1-r0 + r3-r2)
ADDS r2, r2, r0 ; carry out to C
ADC r3, r3, r1
; ... add into high word
Arithmetic operations set all the flags (N, Z, C, and V)

Logical and move operations set N and Z
preserve V and either preserve C when there is no shift
operation, or set C according to shift operation (fall off
bit)
18
Multiplies
Example (Multiply, Multiply-Accumulate)
Note
MUL r4, r3, r2
r4 := [r3 x r2]<31:0>
MLA r4, r3, r2, r1
r4 := [r3 x r2 + r1]
<31:0>
least significant 32-bits are placed in the result register,

the rest are ignored
immediate second operand is not supported
result register must not be the same
as the first source register
if `S` bit is set the V is preserved and
the C is rendered meaningless
Example (r0 = r0 x 35)
ADD r0, r0, r0, LSL #2 ; r0 = r0 x 5

RSB r3, r3, r1 ; r0 = 7 x r0
19
Data transfer instructions

Single register load and store instructions
transfer of a data item (byte, half-word, word)
between ARM registers and memory
Multiple register load and store instructions

enable transfer of large quantities of data
used for procedure entry and exit, to save/restore
workspace registers, to copy blocks of data around
memory
Single register swap instructions

allow exchange between a register and memory
in one instruction
used to implement semaphores to ensure mutual
exclusion on accesses to shared data in multis
20
Data Transfer Instructions (contd)

Register-indirect addressing
Single register load and store
LDR r0, [r1]
r0 := mem32[r1]
STR r0, [r1]
mem32[r1] := r0
Note: r1 keeps a word address (2 LSBs are 0)
Base+offset addressing
(offset of up to 4Kbytes)
LDR r0, [r1, #4] r0 := mem32[r1 +4]
LDRB r0, [r1]
r0 := mem8[r1]
Note: no restrictions for r1
Auto-indexing addressing
LDR r0, [r1, #4]! r0 := mem32[r1 + 4]
r1 := r1 + 4
Post-indexed addressing
LDR r0, [r1], #4 r0 := mem32[r1]
r1 := r1 + 4
21
Data Transfer Instructions (contd)

COPY:
LOOP:
ADR r1, TABLE1
; r1 points to TABLE1
ADR r2, TABLE2
LDR r0, [r1]

STR r0, [r2]
ADD r1, r1, #4
ADD r2, r2, #4
...
COPY:
TABLE1: ...
TABLE2:...
LOOP:
ADR r1, TABLE1
ADR r2, TABLE2
LDR r0, [r1], #4

STR r0, [r2], #4
...
22
Data Transfer Instructions

Multiple register data transfers
LDMIA r1, {r0, r2, r5}
r0 := mem32[r1]
r2 := mem32[r1 + 4]
r5 := mem [r1 + 8]
Note: any subset (or all) of the registers may be32

transferred with a single instruction
Note: the order of registers within the list is
insignificant
Note: including r15 in the list will cause a change
in the control flow
Stack organizations
FA full ascending
EA empty ascending
FD full descending
ED empty descending
Block copy view

data is to be stored above
or below the the address
held in the base register
address incrementing or
decrementing begins
before or after storing
the first value
23
Multiple register transfer addressing

modes
r9
r9
r5
r1
r0
1018 16
r9
100c 16
r9
r5
r1
r0
100c 16
100016
STMIA r9!, {r0,r1,r5}

1018
r9
r9
r5
r1
r0
100016
STMIB r9!, {r0,r1,r5}

1018
16
100c 16
1000
16
STMDA r9!, {r0,r1,r5}
1018 16
100c 16
r9
r9
16
r5
r1
r0
1000
16
STMDB r9!, {r0,r1,r5}
24
The mapping between the stack and

block copy views
Before
Increment
After
Before
Decrement
Ascendi
ng
Full
STM IB
STM FA
Descend
Empty
Full
STM IA
STM EA
LDM DB
LDM EA
LDM IA
LDM FD
STM DB
STM FD
25
Control flow instructions

Branch
B
BAL
BEQ
BNE
BPL
BMI
BCC
BLO
BCS
BHS
BVC
BVS
BGT
BGE
Interpretation
Unconditional
Always
Equal
Not equal
Plus
Minus
Carry clear
Lower
Carry set
Higher or same
Overflow clear
Overflow s et
Greater than
Greater or equal
BLT
BLE
Less than
Less or equal
BHI
BLS
Higher
Lower or s ame
Normal uses
Always take this branch
Always take this branch
Com paris on equal or zero result
Com paris on not equal or non-zero res ult
Result pos itive or zero
Result minus or negative
Arithmetic operation did not give carry-out
Unsigned comparis on gave lower
Arithmetic operation gave carry-out
Unsigned comparis on gave higher or same
Signed integer operation; no overflow occurred
Signed integer operation; overflow occurred
Signed integer comparison gave greater than
Signed integer comparison gave greater or
equal
Signed integer comparison gave les s than
Signed integer comparison gave les s than or
equal
Unsigned comparis on gave higher
Unsigned comparis on gave lower or s ame
26
Conditional execution
Conditional execution to avoid branch instructions
used to skip a small number of non-branch
instructions
Example
CMP r0, #5
;
BEQ BYPASS
; if (r0!=5) {
ADD r1, r1, r0
SUB r1, r1, r2
;}
r1:=r1+r0-r2
With conditional execution

BYPASS: CMP
... r0, #5
ADDNE r1, r1, r0
; if ((a==b) && (c==d)) e++;
;
CMP r0, r1
Note: add 2 letter condition after the 3-letter opcode
SUBNE r1, r1, r2
CMPEQ r2, r3
27
Branch and link instructions

Branch to subroutine (r14 serves as a link register)
BL SUBR ; branch to SUBR
..
; return here
SUBR:
..
; SUBR entry point
Nested
subroutines
MOV pc, r14 ; return
BL SUB1
..
SUB1:
; save work and link register

STMFD r13!, {r0-r2,r14}
BL SUB2
..
28
Supervisor calls
Supervisor is a program which operates at a
privileged level it can do things that a user-level
program cannot do directly
Example: send text to the display
ARM ISA includes SWI (SoftWare Interrupt)

; output r0[7:0]
SWI SWI_WriteC
; return from a user program back to monitor
SWI SWI_Exit
29
Jump tables
Call one of a set of subroutines depending on a
value computed by the program
BL JTAB
BL JTAB
...
JTAB:
...
CMP r0, #0
BEQ SUB0
CMP r0, #1
BEQ SUB1
Note: slow when the list is long,
and all subroutines are equally
CMP r0, #2
frequent
BEQ SUB2
JTAB:
ADR r1, SUBTAB

CMP r0, #SUBMAX ; overrun?
LDRLS pc, [r1, r0, LSL #2]
B ERROR
SUBTAB: DCD SUB0

DCD SUB1
DCD SUB2
30
Hello ARM World!

AREA HelloW, CODE, READONLY ; declare code area
SWI_WriteC
EQU
&0
; output character in r0
SWI_Exit
EQU
&11
; finish program
ENTRY
; code entry point
START: ADR r1, TEXT
; r1 <- Hello ARM World!
LOOP:
LDRB r0, [r1], #1
; get the next byte
CMP r0, #0
; check for text end
SWINE SWI_WriteC
; if not end of string, print
BNE LOOP
SWI SWI_Exit
TEXT
; end of execution
= Hello ARM World!, &0a, &0d, 0
31
ARM
Organization and Implementation
Aleksandar Milenkovic
E-mail:
milenka@ece.uah.edu
Web: http://www.ece.uah.edu/~milenka
Outline
ARM Architecture
ARM Instruction Set
Architectural Support for High-level Languages
Architectural Support for System Development
ARM Processor Cores
Memory Hierarchy
ARM CPU Cores
33
ARM organization
A[31:0]
control
address register
Register file
2 read ports, 1 write port +
1 read, 1 write port reserved
for r15 (pc)
Barrel shifter shift or rotate
one operand for any number of
bits
ALU performs the arithmetic
and logic functions required
Memory address register +
incrementer
Memory data registers
Instruction decoder and
associated control logic
P
C
incrementer
PC
register
bank
instruction
decode
A
L
U
b
u
s
multiply
register
&
b
u
s
b
u
s
barrel
shifter
control
ALU
data out register

D[31:0]
34
data in register
Three-stage pipeline
Fetch
the instruction is fetched from memory and placed in
the instruction pipeline
Decode
the instruction is decoded and the datapath control
signals prepared for the next cycle; in this stage the
instruction owns the decode logic but not the
datapath
Execute
the instruction owns the datapath; the register bank
is read, an operand shifted, the ALU register
generated and written back into a destination register
35
ARM single-cycle instruction pipeline
1
2
3
instruction
fetch
decode
exec ute
fetch
decode
execute
fetch
decode
execute
time
36
ARM single-cycle instruction pipeline
fetch
sub r2,r3,r6
decode execute add

fetch
decode execute sub

fetch
cmp r2,#3
add r0,r1,#5
decode execute cmp

time
37
ARM multi-cycle instruction pipeline
fetch ADD decode
2
3
4
5
instruction
Decode logic is always generating

the control signals for the datapath
to use in the next cycle
execute
fetch STR decode calc. addr

. data xfer
fetch ADD
decode
fetch ADD
execute
decode
execute
fetch ADD decode

time
38
execute
ARM multi-cycle LDMIA (load

multiple) instruction
ldmia
r0,{r2,r3}
fetch decodeex ld r2ex ld r3
sub r2,r3,r6
cmp r2,#3
Instruction delayed
fetch
Decode stage occupied

since ldmia must continue to
remember decoded instruction
decode ex sub
fetch decodeex cmp
time
sub fetched at normal time but

not decoded until LDMIA is finishing
39
Control stalls: due to branches

Branches often introduce stalls (branch penalty)
Stall time may depend on whether branch is taken
May have to squash instructions

that already started executing
Dont know what to fetch until condition is
evaluated
40
ARM pipelined branch

Decision not made until the third clock cycle
bne foo
sub
r2,r3,r6
foo add
r0,r1,r2
fetch decode ex bne ex bne ex bne

fetch decode
Two cycles of work thrown

away if bne takes place
fetch decode ex add
time
41
Pipeline: how it works

All instructions occupy the datapath
for one or more adjacent cycles
For each cycle that an instruction occupies the
datapath,
it occupies the decode logic in
the immediately preceding cycle
During the fist datapath cycle each instruction
issues
a fetch for the next instruction but one
Branch instruction flush and refill the instruction
pipeline
42
ARM9TDMI
5-stage pipeline
next
pc
pc + 4
Fetch
Decode
instruction is decoded
register operands read
(3 read ports)
Execute
an operand is shifted and
the ALU result
generated, or
B, BL
MOV pc
address is computed
SUBS pc
Buffer/data
data memory is
accessed (load, store)
LDR pc
Write-back
write to register file
+4
I-cache
pc+8
fetch
I decode
instruction
decode
r15
register read
LDM/
STM postindex
+4
immediate
fields
mul
shift
pre-index
reg
shift
ALU
execute
forwarding
paths
mux
byte repl.
load/store
address
D-cache
buffer/
data
rot/sgn ex
register write
43
write-back
ARM9TDMI
Data Forwarding
next
pc
Data Forwarding
I-cache
fetch
pc + 4
ADD r3, r2, r1, LSL #3 r3 := r2 + 8 x r1

+4
pc+8
I decode
instruction
decode
r15
r5 := r5 + 2 x r3
r2
register read
immediate
fields
ADD r3, r2, r1, LSL #3 r3 := r2 + 8 x r1

ADD r8, r9, r10
LDM/
STM postindex
+4
r8 := r9 + r10
mul
shift
pre-index
r5 := r5 + 2r2 x r3
Stall?
LD r3, [r2]
ADD r1, r2, r3
reg
shift
ALU
execute
forwarding
paths
mux
B, BL
MOV pc
SUBS pc
byte repl.
r3 := mem[r2]
load/store
address
r1 := r2 + r3
D-cache
buffer/
data
rot/sgn ex
LDR pc
register write
44
write-back
ARM9TDMI
PC generation
3-stage pipeline
next
pc
+4
I-cache
fetch
pc + 4
PC behavior:
operands are read in
execution stage
r15 = PC + 8
pc+8
I decode
instruction
decode
r15
register read
5-stage pipeline
operands are read in decode
stage and r15 = PC + 4?
incompatibilities between 3stage and 5-stage
B, BL
MOV pc
implementations =>
SUBS pc
unacceptable
to avoid this 5-stage pipeline
ARMs emulate the behavior
of the older 3-stage designs
LDM/
STM postindex
+4
immediate
fields
mul
shift
pre-index
reg
shift
ALU
execute
forwarding
paths
mux
byte repl.
load/store
address
D-cache
buffer/
data
rot/sgn ex
LDR pc
register write
45
write-back
Data processing instruction

datapath activity (Ex)
Reg-Reg
Rd = Rn op
Rm
r15 = AR + 4
AR = AR + 4
address register
address register
increment
Rd
PC
Rn
Reg-Imm
registers
increment
Rd
Rm
PC
Rn
registers
mult
Rd = Rn op
Imm
r15 = AR + 4
AR = AR + 4
mult
as ins.
as ins.
as instruction
as instruction
[7:0]
data out
data in
i. pipe
(a) register register operations
data out
data in
i. pipe
(b) register immediate operations
46
STR (store register) datapath activity

(Ex1, Ex2)
Compute
address (Ex1)
address register
AR = Rn op
Disp
r15 = AR + 4
increment
increment
PC
Rn
Store data
(Ex2)
AR = PC
mem[AR] =
Rd<x:y>
If autoindexing
=>
Rn = Rn +/- 4
address register
Rn
registers
PC
registers
mult
mult
shifter
lsl #0
= A / A +B/
Rd
= A +B/ A -B
A -B
[11:0]
data out
data in
i. pipe
(a) 1st cycle compute address
byte?
data in
i. pipe
(b) 2nd cycle store data & auto-index
47
The first two (of three) cycles of a

branch instruction
Compute target
address
address register
address register
AR = PC + Disp,lsl
#2
Save return address
increment
increment
R14
PC
registers
PC
mult
mult
(if required)
r14 = PC
Third
cycle:
do+
a small
AR
= AR
4
correction to the value

stored in the link register in
order that it points to
directly at the instruction data out
which follows the branch?
registers
shifter
lsl #2
=A
= A+B
[23:0]
data in
i. pipe
data out
data in
i. pipe
(b) 2nd cycle save return address

(a) 1st cycle compute branch target
48
ARM Implementation
Datapath
RTL (Register Transfer Level)
Control unit
FSM (Finite State Machine)
49
2-phase non-overlapping clock

scheme
Most ARMs do not operate on edge-sensitive
registers
Instead the design is based around
2-phase non-overlapping clocks which are
generated internally from a single clock signal
Data movement is controlled by passing the data
alternatively through latches
which are open during phase 1 or latches during
phase 1
phase 2
phase 2
1 cloc k c yc le
50
ARM datapath timing

Register read
Register read buses dynamic, precharged during phase 2
During phase 1 selected registers discharge the read buses
which become valid early in phase 1
Shift operation
second operand passes through barrel shifter
ALU operation
ALU has input latches which are open in phase 1,
allowing the operands to begin combining in ALU
as soon as they are valid, but they close at the end of phase 1
so that the phase 2 precharge does not get through to the ALU
ALU processes the operands during the phase 2, producing the
valid output towards the end of the phase
the result is latched in the destination register
at the end of phase 2
51
ARM datapath timing (contd)

ALU operands
latched
phase 1
register
read
time
shift time
phase 2
read bus valid
precharge
invalidates
shift out valid buses
register
write time
ALU time
ALU out
Minimum Datapath Delay =

Register read time +
Shifter Delay + ALU Delay +
Register write set-up time + Phase 2 to phase 1 non-overlap time
52
The original ARM1 ripple-carry adder

Carry logic: use CMOS AOI (And-Or-Invert) gate
Even bits use circuit show below
Odd bits use the dual circuit with inverted inputs
and outputs and AND and OR gates swapped
around
Cout
Worst case path:
32 gates long
A
B
sum
Cin
53
ARM2 4-bit carry look-ahead scheme

Carry Generate (G)
Carry Propagate (P)
Cout[3] =Cin[0].P + G
Use AOI and
alternate AND/OR gates
Worst case:
8 gates long
A[3:0]
Cout[3]
G
4-bit
adder
logic
P
B[3:0]
Cin[0]
54
sum[3:0]
The ARM2 ALU logic for one result bit

ALU functions
data operations (add, sub, ...)
address computations for memory accesses
branch target computations
f s: 5
01 23
4
carry
bit-wise logical
logic
NB
operations
bus
G
...
ALU
bus
P
NA
bus
55
ARM2 ALU function codes

fs5
0
0
0
0
0
1
0
0
0
0
fs4
0
0
0
1
1
1
0
0
0
0
fs3
0
1
1
1
0
0
0
0
0
1
fs2
1
0
0
0
1
1
0
0
1
0
fs1
0
0
0
0
1
1
0
0
0
1
fs0
0
0
1
1
0
0
0
1
1
0
ALU output
A and B
A and not B
A xor B
A plus not B plus carry
A plus B plus carry
not A plus B plus carry
A
A or B
B
not B
56
The ARM6 carry-select adder scheme

Compute sums
of various fields
of the word
for carry-in of
zero and carryin of one
Final result is
selected by
using the
correct carry-in
value to control
a multiplexor
Worst
case:
a,b[3:0]
+
c
a,b[31:28]
+, +1 +, +1
s
s+1
mux
mux
mux
sum[3:0] sum[7:4] sum[15:8]
O(log2[word width]) gates

long
sum[31:16]
Note: Be careful! Fan-out on some of these

gates is high so direct comparison with
previous schemes is not applicable.
57
The ARM6 ALU organization

Not easy to merge the arithmetic and logic
functions =>
a separate logic unit runs in parallel with the adder,
A operand latc h
B operand latc h
andinvert
multiplexor
selects the output
A
XOR gates
func tion
XOR gates
logic functions
logic /arithmetic
adder
result mux
zero detec t
result
invert B
C in
C
V
N
Z
58
ARM9 carry arbitration encoding

Carry arbitration adder
ai
bi
Ci
vi, wi
0, 0
1, 1
1, 0
1, 0
v i =a i +bi
wi =aibi
ai
bi
ai-1 bi-1
Ci
vi, wi
0, 0
1, 1
0(1
)
1(0
)
0, 0
0(1
)
1(0
)
1, 1
0(1
)
1(0
)
0(1
)
1(0
)
1, 0
59
The cross-bar switch barrel shifter

Shifter delay is critical since it contributes directly
to the datapath cycle time
Cross-bar switch matrix (32 x 32)
Principle for 4x4 matrix
right 3 right 2 right 1 no shift
in[3]
left 1
in[2]
left 2
in[1]
left 3
in[0]
out[0] out[1] out[2] out[3]
60
The cross-bar switch barrel shifter

(contd)
Precharged logic is used =>
each switch is a single NMOS transistor
Precharging sets all outputs to logic 0, so those
which are not connected to any input during
switching remain at 0 giving the zero filling required
by the shift semantics
For rotate right, the right shift diagonal is enabled +
complementary shift left diagonal (e. g., right 1 +
left 3)
Arithmetic shift right:
use sign-extension => separate logic is used to
decode the shift amount and discharge those
outputs appropriately
61
Multiplier design
All ARMs apart form the first prototype have included
support for integer multiplication
older ARM cores include low-cost multiplication hardware
that supports only the 32-bit result multiply and
multiply-accumulate
recent ARM cores have high-performance multiplication
hardware and support 64-bit result multiply and
multiply-accumulate
Low cost implementation

Use the datapath iteratively, employing the barrel shifter
and ALU to generate 2-bit product in each clock cycle
use early termination to stop the iterations when there
are no more ones in the multiply register
62
The 2-bit multiplication algorithm,

Nth cycle
Control settings for the Nth cycle of the
multiplication
Use existing shifter and ALU + additional hardware
dedicated two-bits-per-cycle shift register for the
multiplier and a few gates for the Booths algorithm
control logic
(overhead
is a few
per
on the area ofALU
ARM core)
Carry-in
Multip
liercent Shift
0
x0
x1
x2
x3
x0
x1
x2
x3
LSL
LSL
LSL
LSL
LSL
LSL
LSL
LSL
#2N
#2N
#(2N +1)
#2N
#2N
#(2N +1)
#2N
#2N
A
A
A
A
A
A
A
A
63
+0
+B
B
B
+B
+B
B
+0
High speed multiplication

Where multiplication performance is very
important,
more hardware resources must be dedicated
in some embedded systems the ARM core is used to
perform real-time digital signal processing (DSP)
DSP programs are typically multiplication intensive
Use intermediate results which include

partial sums and partial carries
Carry-save adders are used for this
These two binary results are added together at the

end of multiplication
The main ALU is used for this
64
Carry-propagate (a) and carry-save

(b) adder structures
Carry propagate adder takes two conventional (irredundant)
binary numbers as inputs and produces a binary sum
Carry save adder takes one binary and one redundant (partial
sum and partial carry) input and produces a sum in redundant
binary representation (sum and carry)
(a)
(b)
B Cin
B Cin
Cout S
Cout
B Cin
Cout S
B Cin
Cout
B Cin
Cout
B Cin
Cout
B Cin
Cout
B Cin
Cout
65
ARM high-speed multiplier

organization
CSA has 4 layers of adders each handling 2
multiplier bits
=> multiply 8-bits per clock cycle
Partial sum and carry are cleared at the beginning
or initialized to accumulate a value
Multiplier is shifted right 8-bits
per cycle in the Rs register
Carry sum and carry
are rotated right 8 bits per cycle
Performance: up to 4 clock cycles
(early termination is possible)
Complexity: 160 bits in shift registers,
128 bits of carry-save adder logic
(up to 10% of simpler cores)
66
ARM high-speed multiplier

organization
initializa tion f or MLA
registers
Rs >> 8 bits/cycle
Rm
rotate sum and
carry 8 bits/cy cle
carry-save adders
partial sum
partial carry
ALU (add partials)
67
ARM2 register cell circuit
write
read read
A
B
ALU bus
A bus
B bus
68
ARM register bank floorplan

A bus read decoders
B bus read decoders
write dec oders
Vdd
Vss
ALU
bus
PC
bus
INC
bus
ALU
bus
PC
register cells
A bus
B bus
69
ARM core datapath buses
address register
incrementer
Ad
PC
inc
B
register bank
multiplier
shift out
ALU
shifter
instruction
Din
data in
instruction pipe
data out
70
ARM control logic structure

instruction
coprocessor
decode
PLA
address
control
register
control
cycle
count
ALU
control
multiply
control
load/store
multiple
shifter
control
71

ARM Introduction & Instruction Set Architecture

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

ARM Introduction & Instruction Set Architecture

Загружено:

Авторское право:

Доступные форматы

ARM

ARM Advanced RISC Machine 1990

ARMs visible registers

usable in user mode

ARM CPSR format

ARM memory organization

byte3 byte2 byte1 byte0

ARM instruction set

ARM instruction set (contd)

Peripherals may use either the normal interrupt

Some systems may include external DMA hardware

ARM cross-development toolkit

ARM Instruction Set

Data Processing Instructions

Operands: 32-bits wide;

Result: 32-bits wide, placed in a register

Data Processing Instructions (contd)

Bit-wise Logical Operations

ADD r0, r1, r2

AND r0, r1, r2 r0 := r1 and r2

ADC r0, r1, r2

ORR r0, r1, r2

SUB r0, r1, r2

EOR r0, r1, r2

SBC r0, r1, r2

BIC r0, r1, r2

RSB r0, r1, r2

RSC r0, r1, r2

Data Processing Instructions (contd)

AND r8, r7, #&ff

r8 := r7[7:0], & for hex

Shifted register operands

ARM shift operations

ASR #5 , positive operand

ASR #5 , negative operand

Setting the condition codes

; ... add into high word

Arithmetic operations set all the flags (N, Z, C, and V)

MUL r4, r3, r2

MLA r4, r3, r2, r1

least significant 32-bits are placed in the result register,

Example (r0 = r0 x 35)

ADD r0, r0, r0, LSL #2 ; r0 = r0 x 5

Data transfer instructions

Multiple register load and store instructions

Single register swap instructions

Data Transfer Instructions (contd)

LDR r0, [r1]

STR r0, [r1]

Note: r1 keeps a word address (2 LSBs are 0)

LDRB r0, [r1]

Note: no restrictions for r1

Data Transfer Instructions (contd)

ADR r1, TABLE1

ADR r2, TABLE2

LDR r0, [r1]

ADR r1, TABLE1

ADR r2, TABLE2

LDR r0, [r1], #4

Data Transfer Instructions

Note: any subset (or all) of the registers may be32

Block copy view