Академический Документы
Профессиональный Документы
Культура Документы
Introduction &
Instruction Set Architecture
Outline
ARM Architecture
ARM Organization and Implementation
ARM Instruction Set
Thumb Instruction Set
Architectural Support for System
Development
ARM Processor Cores
Memory Hierarchy
Architectural Support for Operating Systems
ARM CPU Cores
Embedded ARM Applications
2
ARM History
ARM Acorn RISC Machine (1983 1985)
Acorn Computers Limited, Cambridge, England
r0
r1
r2
r3
r4
r5
r6
r7
r8
r9
r10
r11
r12
r13
r14
r15 (PC)
CPSR
user mode
r8_fiq
r9_fiq
r10_fiq
r11_fiq
r12_fiq
r13_fiq
r14_fiq
r13_svc
r14_svc
r13_abt
r14_abt
r13_irq
r14_irq
r13_und
r14_und
SPSR_irq SPSR_und
SPSR_abt
SPSR_fiq SPSR_svc
fiq
mode
svc
mode
abort
mode
irq
mode
undefined
mode
I F interrupt enables
31
28 27
NZCV
8 7 6 5 4
unused
IF T
mode
bit 31
bit 0
23
22
21
20
19
18
17
16
15
14
13
12
11
10
word16
half-word14 half-word12
word8
byte6 half-word4
byte
address
Instructions
Data Processing use and change only register values
Data Transfer copy memory values into registers
(load) or copy register values into memory (store)
Control Flow
o branch
o branch-and-link
save return address to resume the original sequence
o trapping into system code supervisor calls
I/O system
I/O is memory mapped
internal registers of peripherals (disk controllers,
network interfaces, etc) are addressable locations
within the ARMs memory map and may be read and
written using the load-store instructions
ARM exceptions
ARM supports a range of interrupts, traps, and supervisor calls
all are grouped under the general heading of exceptions
Handling exceptions
current state is saved by copying the PC into r14_exc and CPSR
into SPSR_exc (exc stands for exception type)
processor operating mode is changed to the appropriate
exception mode
PC is forced to a value between 0016 and 1C16, the particular
value depending on the type of exception
instruction at the location PC is forced to (the vector address)
usually contains a branch to the exception handler; the
exception handler will use r13_exc, which is normally initialized
to point to a dedicated stack in memory, to save some user
registers
return: restore the user registers and then restore PC and CPSR
atomically
10
C source
C libraries
C compiler
assembler
.aof
object
libraries
linker
.axf
Cross-development
tools run on different
architecture from one
for which they
produce code
asm source
system model
debug
ARMsd
development
board
ARMulator
11
Outline
ARM Architecture
ARM Assembly Language Programming
ARM Organization and Implementation
ARM Instruction Set
Architectural Support for High-level Languages
Thumb Instruction Set
Architectural Support for System Development
ARM Processor Cores
Memory Hierarchy
Architectural Support for Operating Systems
ARM CPU Cores
Embedded ARM Applications
12
13
Arithmetic operations
Bit-wise logical operations
Register-movement operations
Comparison operations
14
r0 := r1 + r2
r0 := r1 + r2 + C
r0 := r1 or r2
r0 := r1 - r2
r0 := r1 xor r2
r0 := r1 - r2 + C - 1
r0 := r1 and (not) r2
r0 := r2 r1
r0 := r2 r1 + C - 1
Register Movement
Comparison Operations
MOV r0, r2
r0 := r2
CMP r1, r2
set cc on r1 - r2
MVN r0, r2
r0 := not r2
CMN r1, r2
set cc on r1 + r2
TST r1, r2
set cc on r1 and r2
TEQ r1, r2
set cc on r1 xor r2
15
r3 := r3 + 3
r5 := r5 + 2r2 x r3
16
31
31
00000
00000
LSL #5
31
LSR #5
0
31
00000 0
11111 1
31
ROR #5
RRX
17
18
Multiplies
Example (Multiply, Multiply-Accumulate)
Note
r4 := [r3 x r2]<31:0>
r4 := [r3 x r2 + r1]
<31:0>
19
20
r0 := mem32[r1]
mem32[r1] := r0
Base+offset addressing
(offset of up to 4Kbytes)
LDR r0, [r1, #4] r0 := mem32[r1 +4]
r0 := mem8[r1]
Auto-indexing addressing
LDR r0, [r1, #4]! r0 := mem32[r1 + 4]
r1 := r1 + 4
Post-indexed addressing
LDR r0, [r1], #4 r0 := mem32[r1]
r1 := r1 + 4
21
LOOP:
; r1 points to TABLE1
; r2 points to TABLE2
COPY:
TABLE1: ...
TABLE2:...
LOOP:
; r1 points to TABLE1
; r2 points to TABLE2
22
r0 := mem32[r1]
r2 := mem32[r1 + 4]
r5 := mem [r1 + 8]
Stack organizations
FA full ascending
EA empty ascending
FD full descending
ED empty descending
23
r9
r5
r1
r0
1018 16
r9
100c 16
r9
r5
r1
r0
100c 16
100016
r9
r9
r5
r1
r0
100016
16
100c 16
1000
16
1018 16
100c 16
r9
r9
16
r5
r1
r0
1000
16
24
Before
Increment
After
Before
Decrement
Ascendi
ng
Full
STM IB
STM FA
Descend
Empty
Full
STM IA
STM EA
LDM DB
LDM EA
LDM IA
LDM FD
STM DB
STM FD
25
Interpretation
Unconditional
Always
Equal
Not equal
Plus
Minus
Carry clear
Lower
Carry set
Higher or same
Overflow clear
Overflow s et
Greater than
Greater or equal
BLT
BLE
Less than
Less or equal
BHI
BLS
Higher
Lower or s ame
Normal uses
Always take this branch
Always take this branch
Com paris on equal or zero result
Com paris on not equal or non-zero res ult
Result pos itive or zero
Result minus or negative
Arithmetic operation did not give carry-out
Unsigned comparis on gave lower
Arithmetic operation gave carry-out
Unsigned comparis on gave higher or same
Signed integer operation; no overflow occurred
Signed integer operation; overflow occurred
Signed integer comparison gave greater than
Signed integer comparison gave greater or
equal
Signed integer comparison gave les s than
Signed integer comparison gave les s than or
equal
Unsigned comparis on gave higher
Unsigned comparis on gave lower or s ame
26
Conditional execution
Conditional execution to avoid branch instructions
used to skip a small number of non-branch
instructions
Example
CMP r0, #5
;
BEQ BYPASS
; if (r0!=5) {
;}
r1:=r1+r0-r2
;
CMP r0, r1
CMPEQ r2, r3
27
; return here
SUBR:
..
; SUBR entry point
Nested
subroutines
MOV pc, r14 ; return
BL SUB1
..
SUB1:
28
Supervisor calls
Supervisor is a program which operates at a
privileged level it can do things that a user-level
program cannot do directly
Example: send text to the display
29
Jump tables
Call one of a set of subroutines depending on a
value computed by the program
BL JTAB
BL JTAB
...
JTAB:
...
CMP r0, #0
BEQ SUB0
CMP r0, #1
BEQ SUB1
Note: slow when the list is long,
and all subroutines are equally
CMP r0, #2
frequent
BEQ SUB2
JTAB:
30
EQU
&0
; output character in r0
SWI_Exit
EQU
&11
; finish program
ENTRY
LOOP:
CMP r0, #0
SWINE SWI_WriteC
BNE LOOP
SWI SWI_Exit
TEXT
; end of execution
31
ARM
Organization and Implementation
Aleksandar Milenkovic
E-mail:
milenka@ece.uah.edu
Web: http://www.ece.uah.edu/~milenka
Outline
ARM Architecture
ARM Organization and Implementation
ARM Instruction Set
Architectural Support for High-level Languages
Thumb Instruction Set
Architectural Support for System Development
ARM Processor Cores
Memory Hierarchy
Architectural Support for Operating Systems
ARM CPU Cores
Embedded ARM Applications
33
ARM organization
A[31:0]
control
address register
Register file
2 read ports, 1 write port +
1 read, 1 write port reserved
for r15 (pc)
Barrel shifter shift or rotate
one operand for any number of
bits
ALU performs the arithmetic
and logic functions required
Memory address register +
incrementer
Memory data registers
Instruction decoder and
associated control logic
P
C
incrementer
PC
register
bank
instruction
decode
A
L
U
b
u
s
multiply
register
&
b
u
s
b
u
s
barrel
shifter
control
ALU
34
data in register
Three-stage pipeline
Fetch
the instruction is fetched from memory and placed in
the instruction pipeline
Decode
the instruction is decoded and the datapath control
signals prepared for the next cycle; in this stage the
instruction owns the decode logic but not the
datapath
Execute
the instruction owns the datapath; the register bank
is read, an operand shifted, the ALU register
generated and written back into a destination register
35
1
2
3
instruction
fetch
decode
exec ute
fetch
decode
execute
fetch
decode
execute
time
36
fetch
sub r2,r3,r6
cmp r2,#3
add r0,r1,#5
37
2
3
4
5
instruction
execute
decode
fetch ADD
execute
decode
execute
38
execute
ldmia
r0,{r2,r3}
sub r2,r3,r6
cmp r2,#3
Instruction delayed
fetch
decode ex sub
fetch decodeex cmp
time
40
41
ARM9TDMI
5-stage pipeline
next
pc
pc + 4
Fetch
Decode
instruction is decoded
register operands read
(3 read ports)
Execute
an operand is shifted and
the ALU result
generated, or
B, BL
MOV pc
address is computed
SUBS pc
Buffer/data
data memory is
accessed (load, store)
LDR pc
Write-back
write to register file
+4
I-cache
pc+8
fetch
I decode
instruction
decode
r15
register read
LDM/
STM postindex
+4
immediate
fields
mul
shift
pre-index
reg
shift
ALU
execute
forwarding
paths
mux
byte repl.
load/store
address
D-cache
buffer/
data
rot/sgn ex
register write
43
write-back
ARM9TDMI
Data Forwarding
next
pc
Data Forwarding
I-cache
fetch
pc + 4
+4
pc+8
I decode
instruction
decode
r15
r5 := r5 + 2 x r3
r2
register read
immediate
fields
LDM/
STM postindex
+4
r8 := r9 + r10
mul
shift
pre-index
r5 := r5 + 2r2 x r3
Stall?
LD r3, [r2]
ADD r1, r2, r3
reg
shift
ALU
execute
forwarding
paths
mux
B, BL
MOV pc
SUBS pc
byte repl.
r3 := mem[r2]
load/store
address
r1 := r2 + r3
D-cache
buffer/
data
rot/sgn ex
LDR pc
register write
44
write-back
ARM9TDMI
PC generation
3-stage pipeline
next
pc
+4
I-cache
fetch
pc + 4
PC behavior:
operands are read in
execution stage
r15 = PC + 8
pc+8
I decode
instruction
decode
r15
register read
5-stage pipeline
operands are read in decode
stage and r15 = PC + 4?
incompatibilities between 3stage and 5-stage
B, BL
MOV pc
implementations =>
SUBS pc
unacceptable
to avoid this 5-stage pipeline
ARMs emulate the behavior
of the older 3-stage designs
LDM/
STM postindex
+4
immediate
fields
mul
shift
pre-index
reg
shift
ALU
execute
forwarding
paths
mux
byte repl.
load/store
address
D-cache
buffer/
data
rot/sgn ex
LDR pc
register write
45
write-back
address register
address register
increment
Rd
PC
Rn
Reg-Imm
registers
increment
Rd
Rm
PC
Rn
registers
mult
Rd = Rn op
Imm
r15 = AR + 4
AR = AR + 4
mult
as ins.
as ins.
as instruction
as instruction
[7:0]
data out
data in
i. pipe
data out
data in
i. pipe
46
address register
AR = Rn op
Disp
r15 = AR + 4
increment
increment
PC
Rn
Store data
(Ex2)
AR = PC
mem[AR] =
Rd<x:y>
If autoindexing
=>
Rn = Rn +/- 4
address register
Rn
registers
PC
registers
mult
mult
shifter
lsl #0
= A / A +B/
Rd
= A +B/ A -B
A -B
[11:0]
data out
data in
i. pipe
byte?
data in
i. pipe
47
address register
address register
AR = PC + Disp,lsl
#2
increment
increment
R14
PC
registers
PC
mult
mult
(if required)
r14 = PC
Third
cycle:
do+
a small
AR
= AR
4
registers
shifter
lsl #2
=A
= A+B
[23:0]
data in
i. pipe
data out
data in
i. pipe
48
ARM Implementation
Datapath
RTL (Register Transfer Level)
Control unit
FSM (Finite State Machine)
49
50
Shift operation
second operand passes through barrel shifter
ALU operation
ALU has input latches which are open in phase 1,
allowing the operands to begin combining in ALU
as soon as they are valid, but they close at the end of phase 1
so that the phase 2 precharge does not get through to the ALU
ALU processes the operands during the phase 2, producing the
valid output towards the end of the phase
the result is latched in the destination register
at the end of phase 2
51
phase 2
read bus valid
precharge
invalidates
shift out valid buses
register
write time
ALU time
ALU out
52
sum
Cin
53
Cout[3]
G
4-bit
adder
logic
P
B[3:0]
Cin[0]
54
sum[3:0]
ALU
bus
P
NA
bus
55
fs4
0
0
0
1
1
1
0
0
0
0
fs3
0
1
1
1
0
0
0
0
0
1
fs2
1
0
0
0
1
1
0
0
1
0
fs1
0
0
0
0
1
1
0
0
0
1
fs0
0
0
1
1
0
0
0
1
1
0
ALU output
A and B
A and not B
A xor B
A plus not B plus carry
A plus B plus carry
not A plus B plus carry
A
A or B
B
not B
56
a,b[3:0]
+
c
a,b[31:28]
+, +1 +, +1
s
s+1
mux
mux
mux
sum[3:0] sum[7:4] sum[15:8]
sum[31:16]
57
B operand latc h
andinvert
multiplexor
selects the output
A
XOR gates
func tion
XOR gates
logic functions
logic /arithmetic
adder
result mux
zero detec t
result
invert B
C in
C
V
N
Z
58
bi
Ci
vi, wi
0, 0
1, 1
1, 0
1, 0
v i =a i +bi
wi =aibi
ai
bi
ai-1 bi-1
Ci
vi, wi
0, 0
1, 1
0(1
)
1(0
)
0, 0
0(1
)
1(0
)
1, 1
0(1
)
1(0
)
0(1
)
1(0
)
1, 0
59
left 1
in[2]
left 2
in[1]
left 3
in[0]
60
Multiplier design
All ARMs apart form the first prototype have included
support for integer multiplication
older ARM cores include low-cost multiplication hardware
that supports only the 32-bit result multiply and
multiply-accumulate
recent ARM cores have high-performance multiplication
hardware and support 64-bit result multiply and
multiply-accumulate
62
x0
x1
x2
x3
x0
x1
x2
x3
LSL
LSL
LSL
LSL
LSL
LSL
LSL
LSL
#2N
#2N
#(2N +1)
#2N
#2N
#(2N +1)
#2N
#2N
A
A
A
A
A
A
A
A
63
+0
+B
B
B
+B
+B
B
+0
64
(a)
(b)
B Cin
B Cin
Cout S
Cout
B Cin
Cout S
B Cin
Cout
B Cin
Cout
B Cin
Cout
B Cin
Cout
B Cin
Cout
65
registers
Rs >> 8 bits/cycle
Rm
rotate sum and
carry 8 bits/cy cle
carry-save adders
partial sum
partial carry
ALU (add partials)
67
write
read read
A
B
ALU bus
A bus
B bus
68
Vdd
Vss
ALU
bus
PC
bus
INC
bus
ALU
bus
PC
register cells
A bus
B bus
69
address register
incrementer
Ad
PC
inc
B
register bank
multiplier
shift out
ALU
shifter
instruction
Din
data in
instruction pipe
data out
70
decode
PLA
address
control
register
control
cycle
count
ALU
control
multiply
control
load/store
multiple
shifter
control
71