You are on page 1of 27

Dr.Y.Narasimha Murthy Ph.


ARM Processors -Architecture


The ARM Processor was originally developed at Acorn Computers Limited of Cambridge,
England, between the years 1983-1985. It was the first RISC microprocessor developed for
commercial use and has some significant differences from subsequent RISC architectures. In
1990 ARM Limited was established as a separate company specifically to widen the exploitation
of ARM technology and it is established as a market-leader for low-power and cost-sensitive
embedded applications.
The basic reason behind the origin of ARM processor was, the 16-bit CISC microprocessors that
were available in 1983 were slower than standard memory parts. They also had instructions that
took many clock cycles to complete (in some cases, many hundreds of clock cycles), resulting
very long interrupt latencies. As a result of these limitations with the commercial microprocessor
offerings, the design of a proprietary microprocessor was considered hence ARM chip was
In fact, ARM does not manufacture microprocessors. It is an IP(intellectual property) company
that design systems and give licenses to other companies to fabricate them; for example, ARM
microprocessors are manufactured by Intel, Texas Instruments, Samsung and by many other
Fab companies.
The ARM processor is supported by a toolkit which includes an instruction set emulator for
hardware modeling and software testing and benchmarking, an assembler, C and C++ compilers,
a linker and a symbolic debugger.
So, ARM is not a Fab company, it only gives licenses to companies that want to manufacture
ARM based CPUs or System On Chip products. The two main types of licenses offered by ARM
are Implementation Licenses and Architecture License. The implementation license provides
complete information required to design and manufacture integrated circuits containing an ARM
processor core. ARM give two types of licenses. Software core and Hardware core. A hardware
core is optimized for a specific manufacturing process while a soft core can be used in any
process but it is less optimized.
The architecture license enables the licensee to develop their own processors compliant with
Unique features of ARM Processors.

Dr.Y.Narasimha Murthy Ph.D

Because of certain unique features, today ARM has become one of the most popular embedded
(i).ARM cores are simple compared to most other general purpose processors. i.e they can be
manufactured using comparatively less number of transistors, leaving plenty of space on the chip
for application specific macro cells.
(ii).A typical ARM chip can contain several peripheral controllers, a digital signal processor, and
some amount of on chip memory along with an ARM core.
Both ARM ISA and pipeline design are aimed at minimizing energy consumption, which is a
critical requirement in mobile embedded systems.
(iii).ARM architecture is highly modular i.e the only mandatory component of an ARM
processor is Integer Pipe line. All other components including caches, MMU, Floating Point and
other Co-Processors are optional, which gives a lot of flexibility in building application specific
ARM based processors.
Also, being small and low-power, these ARM processors provide high performance for
embedded applications.
For Ex:PXA255 Xscale processor running at 400MHz provides performance comparable to
Pentium2 at 300MHz while using 50 times less energy.
ARM is basically a RISC architecture processor which incorporated a number of features from
the Berkeley RISC design, but a number of other features were rejected.
The RISC features used were:
A load-store architecture.
Fixed-length 32-bit instructions: All instructions have only a fixed length of 32 bits.
All arithmetic and Logic instructions operate on the operands in the processor registers.
(3-address instructions- Two source operand registers and one destination register all are
independently specified. Ex: ADD r0, r1, r2).
The RISC features those were rejected are:
Register windows, Delayed branches and Single-cycle execution of all instructions etc.
The main problem with register windows is the large chip area occupied by the large
number of registers. This feature was therefore rejected on cost grounds.
The problem with delayed branches is that they remove the atomicity of individual
instructions. They work well on single issue pipelined processors, but they do not scale

Dr.Y.Narasimha Murthy Ph.D

well to super-scalar implementations and can interact badly with branch prediction
Although the ARM executes most data processing instructions in a single clock cycle, many
other instructions take multiple clock cycles. Instead of single-cycle execution of all instructions,
the ARM was designed to use the minimum number of cycles required for memory accesses.
Where this was greater than one, the extra cycles were used, where possible, to do something
useful, such as support auto-indexing addressing modes. This reduces the total number of ARM
instructions required to perform any sequence of operations, improving performance and code

ARM's architecture is compatible with all four major platform operating systems:

Symbian OS,
Palm OS,
Windows CE, and
Special ISA (Instruction Set Architecture) features of ARM
ARM has certain interesting features which are not found in other processors.
(i).Conditional execution of Instructions: All instructions are conditionally executed. i.e an Instruction is
executed only if the current values of the condition code flags.
Ex: ADDNE r1, r2, r3 i.e Add the registers r2 and r3, if they are not equal and keep the result in
the register r1.If the condition is not satisfied, the instruction acts as a NOP.
This feature was chosen because it could maintain high performance while reducing hardware
complexity since it could avoid introducing pipeline bubbles and compensate for the lack of a
branch predictor.
On the same lines the instructions can use: N-Negative, Z-Zero, C-Carry, and V-Overflow flags
in the Current Program Status Register (CPSR) satisfy the Condition specified in a 4-bit field
of the instruction.
For example let us write a program based on Conditional execution of ARM instructions.

Ex: Loop :
CMP r0,r1
SUBGT r0,r0,r1
SUBLT r1,r1,r0
BNE Loop.
The program has only 4 instructions.
Let us now consider the example of normal program.
Loop : CMP r0,r1
BEQ end
BLT less
Sub r0,r0,r1

Dr.Y.Narasimha Murthy Ph.D

B Loop
Less : Sub r1,r1,r0
B Loop
From the above example it is clear that there are total 7 Instructions (2 conditional branch and 2
unconditional Jump instructions).
So, implementation using conditional execution generates shorter code and increases execution
speed. Also an instruction has only its normal effect if the status satisfies a condition specified
in the instructions, otherwise the instruction acts as a NOP.

Another unusual architectural feature is in ARM is Shift Instructions are not provided explicitly
in ARM. However an immediate value or one of the register operands in Arithmetic, Logic and
Move instructions can be shifted by a prescribed amount before being used in an operation.
Consider the following ARM instruction with r1 = 3 and r2 = 5
ADD r0,r1, r2, LSL#3 ; r0= r1 + (8 x r2) which is r0 = 3+ (8x5) =43
Consider a MOV instruction with r1 = 168 and r2 = 3:

MOV r0,r1,LSR r2; Shift the binary value of 168, 3 places to right
168=0000 0000 0000 0000 0000 0000 1010 1000

Shifted 3 places

Becomes 0000 0000 0000 0000 0000 0000 0001 0101


For positive numbers, LSR 3 is the same as dividing by 2 ^ 3 (8)

This feature is used to implement shift instructions implicitly.
Though there are different numbers of multiply instructions for use in signal processing
applications, there are no hardware Divide instructions. Division must be implemented in
ARM was one of the first architectures to implement load-store multiple instructions. These can
transfer multiple registers between memory and processor in a single instruction.
ARM processor include an inline barrel shifter to pre-process one of the input registers. This
barrel shifter helps in executing arithmetic instructions like multiplication and multiply
accumulate etc.
The simplicity in architecture reduces the overhead on each instruction allowing the clock cycles
to be shortened.
ARM 7TDMI-S Processor : The ARM7TDMI-S processor is a member of the ARM family of
general-purpose 32-bit microprocessors. The ARM family offers high performance for very low-

Dr.Y.Narasimha Murthy Ph.D

power consumption and gate count. The ARM7TDMI-S processor has a Von Neumann
architecture, with a single 32-bit data bus carrying both instructions and data. Only load, store,
and swap instructions can access data from memory. The ARM7TDMI-S processor uses a three
stage pipeline to increase the speed of the flow of instructions to the processor. This enables
several operations to take place simultaneously, and the processing, and memory systems to
operate continuously. In the three-stage pipeline the instructions are executed in three stages.

The three stage pipelined architecture of the ARM7 processor is shown in the above figure.
ARM7TDMIS stands for
T: THUMB MODE(16 bit instruction support)
D for on-chip Debug support, enabling the processor to halt in response to a debug request,
M: enhanced Multiplier, yield a full 64-bit result, high performance
I: Embedded ICE hardware (In Circuit emulator).The Embedded ICE macro cell consists of on-
chip logic to support debug operations.
S: Synthesizable.
[Here let me tell you the meaning of Synthesizable: In early days ARMs were designed as a hard
macro,i.e the physical design at the transistor layout level, and the fab companies were taking
this fixed physical block and used to place it into their chip designs. But due to the
complexities,a demand increased for a more flexible and configurable solution, hence ARM
moveddecided to deliver processor designs as a behavioral description at the "register transfer
level" (RTL) written in a hardware description language (HDL), typically Verilog HDL. The
process of converting this behavioral description into a physical network of logic gates is called
"synthesis", and several major EDA companies sell automated synthesis tools for this purpose.
Aprocessor design distributed to licensees as an RTL description (such as ARM7TDMI-S) is
therefore described as "synthesizable"]

Dr.Y.Narasimha Murthy Ph.D

The ARM processors are based on RISC architectures and this architecture has provided small
implementations, and very low power consumption. Implementation size, performance, and very
low power consumption remain the key features in the development of the ARM devices.
The typical RISC architectural features of ARM are

(i).A large uniform register file all of which can be used for most purposes.
(ii).A load/store architecture, where data-processing operations only operate on register contents,
not directly on memory contents. Only Load /Store instructions access memory.
For ex: LDR r0,[r1] ; STR r0, [r1]; LDREQB r0,[r1]: conditional
(iii).A 3-address instructions (Two source operand registers and the result register all are
independently specified)
(iv).Simple addressing modes, with all load/store addresses being determined from register
contents and instruction fields only uniform and fixed-length instruction fields, to simplify
instruction decode.
(v).The ability to perform a general shift operation and a general ALU operation (using a
hardware barrel shifter) in a single instruction that executes in a single clock cycle.
(vi). Auto-increment and auto-decrement addressing modes to optimize program loops
(vii).Load and Store Multiple instructions to maximize data throughput
(viii)Conditional execution of almost all instructions to maximize execution throughput.
(ix).A very dense 16-bit compressed representation of the instruction set in the Thumb
ARM architecture is compatible with all four major operating systems, i.e.
Symbian OS,
Palm OS,
Windows and Android OS.
There are three basic instruction sets for ARM.
A 32- bit ARM instruction set
A 16 bit Thumb instruction set and
The 8-bit Java Byte code used in Jazelle state.
[This is supported by ARM9 processors and above. For this either the J bit in CSR
register must be set or a branch instruction BXJ is executed. This will help to increase the
execution speed of Java ME(Java Micro Edition)games and applications. As Java

Dr.Y.Narasimha Murthy Ph.D

applications get run in hardware (rather than Software) more speed is achieved.This light
weight version of Java runs on limited memory and /or processing power such as
Cellular phones,PDAs,TVset-top boxes , smart cards etc.
Even though the Jazelle adds a lot of functionality to the already existing ARM core, only
about 20,000 additional gates are needed, a value that is almost insignificant for a typical
ARM CPU macro cell product, that also includes the cache required to support the
operating system].
The Thumb instruction set is a subset of the most commonly used 32-bit ARM instructions.
Thumb instructions operate with the standard ARM register configurations, enabling excellent
interoperability between ARM and Thumb states. This Thumb state is nearly 65% of the ARM
code and can provide 160% of the performance of ARM code when working on a 16-bit memory
system. This Thumb mode is used in embedded systems where memory resources are limited.
**Additional Explanation: Comparison of Thumb and ARM instructions.
Here I consider ARM assembly code and Thumb code in the program given below.
ARM Assembly Thumb Assembly
.abs ; return the absolute value of integer parameter .abs
iabs CMP r0,#r0 iabs CMP r0,#0
RSBLT r0,r0,#0;if r0 is less than zero,set r0 to 0-r0 BGE return ;
MOV PC,lr ; return from a linked branch NEG ro, ro ;
return MOV PC , lr ;

Let us now code density for both the codes.

Code Instructions Size(Bytes) Normalized
ARM 3 12 1.0
THUMB 4 8 0.67
So, the thumb code is nearly 33% denser than ARM code for the same function.
**In the above ARM code, the last line, instead of MOV Pc,lr can also be MOV r15,r14 .Why?
guess the reason !!
The ARM 7 processor is based on Von Neumann model with a single bus for both data and
instructions.(The ARM 9 is based on Harvard model).Though this will decrease the

Dr.Y.Narasimha Murthy Ph.D

performance of ARM 7, it is overcome by the pipe line concept. ARM uses the Advanced
Microcontroller Bus Architecture (AMBA) bus architecture. This AMBA include two system
buses: the AMBA High-Speed Bus (AHB) or the Advanced System Bus (ASB), and the
Advanced Peripheral Bus (APB).
The ARM processor consists of
Arithmetic Logic Unit (32-bit)
One Booth multiplier(32-bit)
One Barrel shifter
One Control unit
Register file of 37 registers each of 32 bits.
The barrel shifter is used for fast shift operations and can perform the necessary processing of
register values before it enters the ALU. This helps the easy calculation of wider ranges of
expressions and addresses.
In addition to this the ARM also consists of a Program status register of 32 bits, Some
special registers like the instruction register, memory data read and write register and
memory address register ,one Priority encoder which is used in the multiple load and
store instruction to indicate which register in the register file to be loaded or stored and
Multiplexers etc.

Dr.Y.Narasimha Murthy Ph.D

ARM Registers: ARM has a total of 37 registers. In which - 31 are general-purpose registers of
32-bits, and six status registers .But all these registers are not seen at once. The processor state
and operating mode decide which registers are available to the programmer. At any time,
among the 31 general purpose registers only 16 registers are available to the user. The remaining
15 registers are used to speed up exception processing. There are two program status registers:
CPSR and SPSR (the current and saved program status registers, respectively)
In ARM state the registers r0 to r13 are orthogonalany instruction that you can apply to r0 you
can equally well apply to any of the other registers.
The main bank of 16 registers is used by all unprivileged code. These are the User mode
registers. User mode is different from all other modes as it is unprivileged. In addition to this
register bank, there is also one 32-bit Current Program status Register (CPSR)

Dr.Y.Narasimha Murthy Ph.D

In the 15 registers, the r13 acts as a stack pointer register and r14 acts as a link register and r15
acts as a program counter register.
Register r13 is the SP (Stack Pointer) register, and it is used to store the address of the stack top.
R13 is used by the PUSH and POP instructions in T variants, and by the SRS and RFE
instructions from ARMv6.

Register r14 is the Link Register (LR). This register holds the address of the next instruction
after a Branch and Link (BL or BLX) instruction, which is the instruction used to make a
subroutine call. It is also used for return address information on entry to exception modes. At all
other times, r14 can be used as a general-purpose register.
**You may get a doubt why this link register is added in the ARM architecture and what is
its advantage? In fact in CISC (Intel) processors when an interrupt occurs the return
address is always stored on stack. So, after providing the interrupt service the processor
has to access the stack which normally takes more time than accessing a register. So, if the
return address is stored in a Link register then accessing the link register takes less time.
This is the advantage of Link register.

Register r15 is the Program Counter (PC). It can be used in most instructions as a pointer to the
instruction which is two instructions after the instruction being executed.
**The PC in ARM has a specialty. In normal CISC (Intel) processors the PC normally stores the
address of next instruction to be executed. But the ARM PC contains the address of the instruction
that is being fetched (not the one being executed) .
The remaining 13 registers have no special hardware purpose.
CPSR: The ARM core uses the CPSR register to monitor and control internal operations. The
CPSR is a dedicated 32-bit register and resides in the register file. The CPSR is divided into four
fields, each of 8 bits wide: flags, status, extension, and control. The extension and status fields

Dr.Y.Narasimha Murthy Ph.D

are reserved for future processors like ARMV5 and ARMV7 etc. The control field contains the
processor mode, state, and interrupts mask bits. The flags field contains the condition flags. The
32-bit CPSR register is shown below.

M4 M3 M3 M2 M1 Mode
0 0 0 0 0 User 26 mode
0 0 0 0 1 FIQ 26 Mode
0 0 0 1 0 IRQ 26 Mode
0 0 0 1 1 SVC 26 Mode
1 0 0 0 0 User Mode
1 0 0 0 1 FIQ mode
1 0 0 1 0 IRQ Mode
1 0 0 1 1 SVC Mode
1 0 1 1 1 ABT Mode
1 1 0 1 1 UND Mode
1 1 1 1 1 System Mode

FIQ disable bit:

F 1 = FIQ interrupts disabled
0 = FIQ interrupts enabled.

IRQ disable bit:

I 1 = IRQ interrupts disabled
0 = IRQ interrupts enabled

Negative or less than flag:

V 1 = result negative or less than in last operation
0 = result positive or greater than.

Dr.Y.Narasimha Murthy Ph.D

Thumb state flag:

T 1 = processor operating in Thumb state
0 = processor operating in ARM state.

The CPSR in Higher versions

Q flag is set in E variants of of ARMv5 and above to indicate underflow and/or saturation is
used in instructions intended to assist DSP operations.

GE[3:0] flags, in ARMv6, control the Greater than or Equal behavior in SIMD instructions.

For half word instructions, if bits 3:2 are set, the upper half word is used; and if bits 1:0 are set,
the lower half word is set. Similarly, for byte operations, if bit 3 is set, the top byte is used; if bit
0 is set, the bottom byte is used; and bits/bytes 2 and 1 in the same fashion.
E: is a flag in ARMv6 that controls the 'endianness' for data handling.
With increasing system on a chip (SoC) integration, a single chip is more likely to contain little-
endian OS environments and interfaces (such as USB, PCI), but with bigendiandata (TCP/IP
packets, MPEG streams). With ARMv6, support for mixed-endiansystems has been improved.
As a result, handling data in mixed-endian systems under ARMv6 is far more efficient.
The ARM added the J bit to the CPSR .The J bit records whether the processor is in Java, ARM
or Thumb state.
When J=1,T=1 it is illegal and
When J=T=0 The processor will be in ARM mode
When J=1, T=0 , The processor is in Java state.
But when J=0, T=1 the processor is in Thumb State.
Basically to enter in the Java state simply write the J bit of CPSR, but it is not recommended.
Instead of this use Branch Exchange to Java (BXJ) instruction. It works just like calling a

Dr.Y.Narasimha Murthy Ph.D

This single instruction saves three program steps. Because BXJ performs three operations.
First it checks the condition .If the condition is true it will store it in the Pc and load a new Pc.
Then it will store it in the Pc and load a new Pc .Then it will set the Java state and takes a branch.
CPSR in Cortex Processors

Do Not Modify (DNM) must not be modified by software.

The IT execution state bits

IT[7:5] encodes the base condition code for the current IT block, if any. It contains b000 when no
IT block is active.

IT[4:0] encodes the number of instructions that are to be conditionally executed, and whether the
condition for each is the base condition code or the inverse of the base condition code. It contains
b00000 when no IT block is active.

SPSR Register: The SPSR is used to store the current value of the CPSR when an exception
occurs so that it can be restored after handling the exception. Each exception handling mode can
access its own SPSR. User mode and System mode do not have an SPSR because they are not
exception handling modes.

Processor Modes: There are seven processor modes. Six privileged modes abort, fast interrupt
request, interrupt request, supervisor, system, and undefined and one un-privileged mode called
user mode.
i.The processor enters abort mode when there is a failed attempt to access memory.
ii.Fast interrupt request and iii. interrupt request modes correspond to the two interrupt levels
available on the ARM processor.

Dr.Y.Narasimha Murthy Ph.D

iv. Supervisor mode is the mode that the processor is in after reset and is generally the mode that
an operating system kernel operates in.
v. System mode is a special version of user mode that allows full read-write access to the CPSR.
vi.Undefined mode is used when the processor encounters an instruction that is undefined or not
supported by the implementation.
vii.User mode is used for programs and applications.
The T bit Decides processor state, either 16 bit Thumb state or 32 bit Arm state. When the T bit is
1, then the processor is in Thumb state. To change states the core executes a specialized branch
instruction and when T= 0 the processor is in ARM state and executes ARM instructions.
**So, the processor mode can be changed by a program that writes directly to CPSR (the
processor has to be in privileged mode) or by hardware when core responds to an exception or
Banked Registers: Out of the 32 registers, 20 registers are hidden from a program at different
times. These registers are called banked registers. They are available only when the processor is
in a particular mode; for example, abort mode has banked registers r13_abt , r14_abt and
spsr _abt. Banked registers of a particular mode are denoted by an underline character post-fixed
to the mode mnemonic or _mode.

Dr.Y.Narasimha Murthy Ph.D

Any banked register is unique to its particular state and would actually be a different physical
memory location even though the instruction address to write to it would be the same regardless
of mode. For example, if you wanted to write to r13 in whatever mode, you would use r13 =
some_ value, and not actually specify the unique name r13, r13_fiq, r13_svc, etc. So if you wrote
a value into r13 while in User mode then switched to FIQ mode, the value in r13 User mode
would not be available. While in FIQ mode, you could write to register r13 again and not impact
the value you wrote during user mode.
There are two interrupt request levels available on the ARM processor core- interrupt request
(IRQ) and Fast Interrupt request (FIQ).
At the CPU level, the ARM FIQ signal is technically very similar to the x86 non-maskable
interrupt (NMI), but its role within the system architecture has different historical roots. ARM

Dr.Y.Narasimha Murthy Ph.D

FIQs were, as the name suggests, designed to rapidly service demanding peripherals or even to
allow software to replace hardware (for example in synchronous serial communication).
**Here an interesting point to understand is how FIQ provides faster service? The answer is
simple. From banked registers it is clear that more registers (r8-r14) are banked with this FIQ
mode and hence this need not use stack to store any values, rather can use its registers .As
accessing registers is always faster than stack, it provides faster service to interrupt requests.
The IRQ exception is a normal interrupt caused by a LOW level on the IRQ input. IRQ has a
lower priority than FIQ, and is masked on entry to an FIQ sequence. It must ensured that the
IRQ input is held LOW until the processor acknowledges the interrupt request, either from the
VIC (Vectored Interrupt Controller) interface or the software handler.
V, C, Z , N are the Condition flags .

V(oVerflow) : Set if the result causes a signed overflow. This flag is set whenever the result of
a signed number operation is too large, causing the high order bit to overflow into
the sign bit. Generally carry flag is used to detect errors in unsigned arithmetic
operations while the overflow is used to detect errors in signed arithmetic
C (Carry) : Is set when the result causes an unsigned carry
Z (Zero) : This bit is set when the result after an arithmetic operation is zero, frequently
used to indicate equality
N (Negative) : It is the sign bit used to represent the binary signed .This bit is set when the bit
31 of the result is a binary 1.Binary representation of signed numbers uses D31
as the sign bit .If the D31 bit of the result is zero ,then N=0 and the result is
positive. If D31 bit is one ,then N=1 and the result is negative. The negative and
V flag are used for the signed number arithmetic operations .

Note: The biggest register difference involves is the SP register. The Thumb state has unique
stack mnemonics (PUSH, POP) that don't exist in the ARM state. These instructions assume the
existence of a stack pointer, for which R13 is used. They translate into load and store instructions
in the ARM state.

Dr.Y.Narasimha Murthy Ph.D

THUMB Mode Secrets : Actually the Thumb mode instructions are only 16-bit instructions. But
how the ARM processor gives both code density advantage and the same 32 bit higher
performance at the same time?
Let us understand this point in detail. Actually the ARM design has a special block which
decompresses the Thumb code into ARM code before it enters into execution (Thumb instruction
decompressor) in addition to ARM instruction Decoder. This can be found in the following
Block diagram.

The ARM instructions arriving from the Fetch stage of the pipe line pass through the ARM
decoder, and activate major and minor opcode bit control signals.
Major opcode bits describe the type of instructions to execute while minor bits specify
instruction details such as the registers or operand specified.

Dr.Y.Narasimha Murthy Ph.D

In Thumb state, multiplexers direct Thumb instructions through the Thumb Decompression
logic. This effectively expands the thumb instructions into its equivalent ARM instructions.

The execution of ARM instruction takes place as usual

The major code of the Thumb instruction denotes the type of instruction, in the above example it
is an Arithmetic instruction. The minor opcode specifies the type of arithmetic operation. i.e
ADD between a register & constant. In ARM instructions have space for 4 registers, the value is
expanded by a zero.

Dr.Y.Narasimha Murthy Ph.D

PIPE LINE : Pipeline is the mechanism used by the RISC processor to execute instructions at
an increased speed. This pipeline mechanism speeds up execution by fetching the next
instruction while other instructions are being decoded and executed. During the execution of an
instruction ,the processor Fetches the instruction .It means loads an instruction from
memory.And decodes the instruction i.e identifies the instruction to be executed and finally
Executes the instruction and writes the result back to a register.
The ARM7 processor has a three stage pipelining architecture namely Fetch, Decode and
ARM 9 has five pipe line stages, ARM10 has 6 and ARM11 has 8 pipe line stage architecture.
The three stage pipelining is explained as below.

Fig: ARM 7 Core 3-Stage Pipe Lining

To explain the pipelining ,let us consider that there are three instructions Compare, Subtract and
Add. The ARM7 processor fetches the first instruction CMP in the first cycle and during the
second cycle it decodes the CMP instruction and at the same time it will fetch the SUB

Dr.Y.Narasimha Murthy Ph.D

instruction. During the third cycle it executes the CMP instruction , while decoding the SUB
instruction and also at the same time will fetch the third instruction ADD. This will improve the
speed of operation. This leads to the concept of parallel processing .This pipeline example is
shown in the following diagram.

As the pipeline length increases, the amount of work done at each stage is reduced, which allows
the processor to attain a higher operating frequency. This in turn increases the performance. One
important feature of this pipeline is the execution of a branch instruction or branching by the
direct modification of the PC causes the ARM core to flush its pipeline.
Exceptions, Interrupts, and the Vector Table

Exceptions are generated by internal and external sources to cause the ARM processor to handle
an event, such as an externally generated interrupt or an attempt to execute an Undefined
instruction. The processor state just before handling the exception is normally preserved so that
the original program can be resumed after the completion of the exception routine. More than
one exception can arise at the same time.ARM exceptions may be considered in three groups
1. Exceptions generated as the direct effect of executing an instruction.Software interrupts,
undefined instructions (including coprocessor instructions where the requested coprocessor is
absent) and prefetch aborts (instructions that are invalid due to a memory fault occurring during
fetch) come under this group.
2. Exceptions generated as a side-effect of an instruction. Data aborts (a memory fault during a
load or store data access) are in this group.
3. Exceptions generated externally, unrelated to the instruction flow.Reset, IRQ and FIQ are in
this group.
The ARM architecture supports seven types of exceptions.

Dr.Y.Narasimha Murthy Ph.D

ii.Undefined Instruction
iii.Software Interrupt(SWI)
iv. Pre-fetch abort(Instruction Fetch memory fault)
v.Data abort (Data access memory fault)
vi. IRQ(normal Interrupt)
vii. FIQ (Fast Interrupt request).
When an Exception occurs , the processor performs the following sequence of actions:
It changes to the operating mode corresponding to the particular exception.
It saves the address of the instruction following the exception entry instruction in r14 of the
new mode.
It saves the old value of the CPSR in the SPSR of the new mode.
It disables IRQs by setting bit 7 of the CPSR and, if the exception is a fast interrupt, disables
further fast interrupts by setting bit 6 of the CPSR.
It forces the PC to begin executing at the relevant vector address
Excdption / Interrupt Name Address High Address
Reset RESET 0X00000000 0Xffff0000
Undefined Instruction UNDEF 0X00000004 0Xffff0004
Software Interrupt SWI 0X00000008 0Xffff0008
Pre-fetch Abort PABT 0X0000000C 0Xffff000c
Data Abort DABT 0X00000010 0Xffff0010
Interrupt Request IRQ 0X00000018 0Xffff0018
Fast Interrupt Request FIQ 0X0000001C 0Xffff001c
The exception Vector table shown above gives the address of the subroutine program to be
executed when the exception or interrupt occurs. Each vector table entry contains a form of
branch instruction pointing to the start of a specific routine.
In the above table one can see the missing of 0X00000014 address .This location was used on
earlier ARM processors which operated within a 26-bit address space to trap load or store
addresses which fell outside the address space. These traps were referred to as 'address
exceptions'. Since 32-bit ARMs do not generate addresses which fall outside their 32-bit
address space, address exceptions have no role in the current architecture and the vector address
at 0x00000014 is unused.
Similarly some ARM vendors use the Vector table at more than one memory locations .Hence
you have two address locations (Address and High address).This depend on the type and
configuration of the ARM processor.

Dr.Y.Narasimha Murthy Ph.D

Reset vector is the location of the first instruction executed by the processor when power is
applied. This instruction branches to the initialization code.

Undefined instruction vector is used when the processor cannot decode an instruction.
Software interrupt vector is called when you execute a SWI instruction. The SWI instruction is
frequently used as the mechanism to invoke an operating system routine.
Pre-fetch abort vector occurs when the processor attempts to fetch an instruction from an address
without the correct access permissions. The actual abort occurs in the decode stage.
Data abort vector is similar to a prefetch abort but is raised when an instruction attempts to
access data memory without the correct access permissions.

Interrupt request vector is used by external hardware to interrupt the normal execution flow of
the processor. It can only be raised if IRQs are not masked in the CPSR.
The Thumb programmer's model
ARM cores after reset, start executing ARM instructions. The normal way they switch to
execute Thumb instructions is by executing a Branch and Exchange instruction (BX).
The Thumb instruction set is a subset of the ARM instruction set and the instructions operate on
a restricted view of the ARM registers.i.e all the registers are not available in Thumb mode.
Only registers r0 r7 (Low registers) and special function registers (r13-r15)are available in
Thumb mode.
r13 is used as a stack pointer.
r14 is used as the link register.
r15 is the program counter (PC).

The CPSR condition code flags are set by arithmetic and logical operations and control
conditional branching.

Dr.Y.Narasimha Murthy Ph.D

Salient Features of THUMB

Most Thumb instructions are executed unconditionally.

Many Thumb data processing instructions use a 2-address format (the destination register
is the same as one of the source registers).
Thumb instruction formats are less regular than ARM instruction formats, as a result of
the dense encoding.

Exceptions generated during Thumb execution switch to ARM execution before executing the
exception handler.
The state of the T bit is preserved in the SPSR, and the LR of the exception mode is set so that
the normal return instruction performs correctly, regardless of whether the exception occurred
during ARM or Thumb execution.

The higher registers r8 to r12 are only accessible with MOV, ADD, or CMP instructions.
CMP and all the data processing instructions that operate on low registers update the condition
flags in the CPSR.
Also, there are no MSR and MRS equivalent Thumb instructions. To alter the CPSR or SPSR,
one must switch into ARM state to use MSR and MRS. Similarly, there are no coprocessor
instructions in Thumb state.

From ARMv4T to ARMv7-A there are two instruction sets: ARM and Thumb.
They are both "32-bit" in the sense that they operate on up-to-32-bit-wide data in 32-bit-wide
registers with 32-bit addresses.
In fact, where they overlap they represent the exact same instructions - it is only the instruction
encoding which differs, and the CPU effectively just has two different decode front-ends to its
pipeline which it can switch between. Thumb-2 encompassed not just adding more instructions
to Thumb (mostly with 4-byte encodings) to bring it almost to parity with ARM, but also
extending the execution state to allow for conditional execution of most Thumb instructions, and
finally introducing a whole new assembly syntax (UAL, "Unified Assembly Language") which
replaced the previous separate ARM and Thumb syntaxes and allowed writing code once and
assembling it to either instruction set without modification
The Cortex-M architectures implement only the Thumb instruction set -ARMv7-M (Cortex-
M3/M4/M7) supports most of "Thumb-2 Technology", including conditional execution and

Dr.Y.Narasimha Murthy Ph.D

encodings for VFP instructions, whereas ARMv6-M (Cortex-M0/M0+) only uses Thumb-2 in
the form of a handful of 4-byte system instructions.

ARM-Thumb transfer instructions:

(i). BX Rm
Thumb version branch exchange
pc = Rm & 0xfffffffe, T = Rm[0]

(ii). BLX Rm ; Thumb version branch exchange with link

pc = Rm & 0xfffffffe, T = Rm[0]
lr = address of next instruction after BLX+1
Example1: ARM code
CODE32 ; word aligned
LDR r0, =thumbCode+1 address (thumbCode)= 0x00009000 ; r0 = 0x00009001
BLX r0 ; branch to Thumb code & mode

Example 2: Thumb code

CODE16 ; halfword aligned
Thumb Code
ADD r1, #1
BX lr ; branch to ARM code & mode

Co-Processor Interface: ARM 7 supports for up to 16 logical Coprocessors. The introduction of

this concept is mainly aimed at improving the performance of ARM processor.Each coprocessor
can have up to 16 private registers of any size without limiting to 32 bits.
Co-processors use load/store architecture.
The ARM7TDMI Co-processor is based on Bus Watching
The Co-processor is attached to a a bus where ARM instruction stream flows into ARM
and the coprocessor copies the instructions into an internal pipeline that is similar to
ARM instruction pipe line.
There are three hand shake signals between ARM and the co-processor before execution
of instructions.
(i).CPI(From ARM to all Co-processors):Co-processor instruction. Indicates that ARM has
identified a co-processor instruction and wishes to execute it.
(ii).CPA(From Co-processor to ARM):Co-processor absent, which tells the ARM that there is no
ARM co-processor present that is able to execute the current instruction.
(iii).CPB(From the co-processor to ARM):Co-processor busy signal which tells the ARM that
the co-processor cannot begin executing the instruction set.
The timing is such that the ARM and co-processor must generate their respective signals

Dr.Y.Narasimha Murthy Ph.D

ARM Processor Families

There are various ARM processors available in the market for different application .These are
grouped into different families based on the core .These families are based on the ARM7,
ARM9, ARM10, and ARM11 cores. The numbers 7, 9, 10, and 11 indicate different core
designs. The ascending number indicates an increase in performance and sophistication.
Though ARM 8 was introduced during 1996, it is no more available in the market. The
following table gives a brief comparison of their performance and available resources.
The ARM7 core has a Von Neumannstyle architecture, where both data and instructions use the
same bus. The core has a three-stage pipeline and executes the architecture ARMv4T instruction
set. The ARM7TDMI was introduced in 1995 by ARM. It is currently a very popular core and is
used in many 32-bit embedded processors.
The ARM9 family was released in 1997. It has five stage pipeline architecture. Hence, the
ARM9 processor can run at higher clock frequencies than the ARM7 family. The extra stages
improve the overall performance of the processor. The memory system has been redesigned to
follow the Harvard architecture, with separate data and instruction .buses. The first processor in
the ARM9 family was the ARM920T, which includes a separate D + I cache and an MMU. This
processor can be used by operating systems requiring virtual memory support. ARM922T is a
variation on the ARM920T but with half the D +I cache size.
The latest core in the ARM9 product line is the ARM926EJ-S synthesizable processor core,
announced in 2000. It is designed for use in small portable Java-enabled devices such as 3G
phones and personal digital assistants (PDAs).
The ARM10 was released in 1999. It extends the ARM9 pipeline to six stages. It also supports an
optional vector floating-point (VFP) unit, which adds a seventh stage to the ARM10 pipeline.
The VFP significantly increases floating-point performance and is compliant with the IEEE
754.1985 floating-point standard.
The ARM1136J-S is the ARM11 processor released in the year 2003 and it is designed for high
performance and power efficient applications. ARM1136J-S was the first processor
implementation to execute architecture ARMv6 instructions. It incorporates an eight-stage
pipeline with separate load store and arithmetic pipelines.
In 2004, ARM introduced its new Cortex family of processors.

Dr.Y.Narasimha Murthy Ph.D

The Cortex processor family is subdivided into three different profiles.Cortex-A, Cortex-M and
Cortex-R. Each profile is optimized for different segments of embedded systems applications.
A denotes Application, M denotes Microcontroller and R denotes Real Time.
The Cortex-A profile has been designed as a high-end application processor. Cortex-A processors
are capable of running feature-rich operating systems such as WinRT and Linux.The key
applications for Cortex-A are consumer electronics such as smart phones, tablet computers, and
set-top boxes.

Unlike earlier ARM CPUs, the Cortex-M processor family is designed specifically for use within
a small microcontroller.
The Cortex-M processor comes in five variants: Cortex-M0, Cortex-M01, Cortex-M1, Cortex-
M3, and Cortex-M4. The Cortex-M0 and Cortex-M01 are the smallest processors in the family.
This helps the manufacturers to design low-cost, low-power devices that can replace existing 8-
bit microcontrollers while still offering 32-bit performance.
The Cortex-M1 has much of the same features as the Cortex-M0 but has been designed as a soft
core to run inside a Field Programmable Gate Array (FPGA) device.
The highest performing member of the Cortex-M family is the Cortex-M4.This has all the
features of the Cortex-M3 and adds support for digital signal processing (DSP) and also includes
hardware floating point support for single precision calculations.
The third Cortex profile is Cortex-R. This is the real-time profile that delivers a high-
performance processor which is the heart of an application specific device.
Very often a Cortex-R processor forms part of a system-on-chip design that is focused on a
specific task such as hard disk drive (HDD) control, automotive engine management, and
medical devices. The Arm Cortex-R real-time processors offer high-performance computing
solutions for embedded systems where reliability, high availability, fault tolerance and/or
deterministic real-time responses are needed.
Cortex-R processors are used in products where performance requirements and timing deadlines
must always be met.
In addition, Cortex-R processors are used in electronic systems which must be functionally safe
to avoid hazardous situations, for example, in medical applications or autonomous systems.

Dr.Y.Narasimha Murthy Ph.D

ARM recently (2017) unveiled its next-generation CPU cores, the CORTEX A75 and CORTEX
A55, which are the first processors to support the companys new DynamIQ multi-core
This a set of new processors provide the brainpower to the mobile devices to cope with
advanced artificial intelligence (AI), virtual reality (VR), and mixed reality (MR) technologies.
The A75 is the successor to ARMs high performance A73 and A72, while the new Cortex-A55
is a more power efficient replacement for the popular Cortex-A53.
Cortex-A75 is the new flagship-tier mobile processor design, with a claimed 22 percent
improvement in performance over the incumbent A73.
Its joined by the new Cortex A-55, which has the highest power efficiency of any mid-range
CPU ARMs ever designed, and the Mali-G72 graphics processor, which also comes with a 25
percent improvement in efficiency relative to its predecessor G71.
A brief comparison of different ARM families is presented below: