000-Chapter1 CPU

Chapter 1:
Computer Systems Concepts

and
Processor Architecture
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
Chapter 1
Computer Systems Concepts and Processor Architecture
1.1 INTRODUCTION ................................................................................................ 3

1.1.1 Advances in Microprocessor Technology....................................................... 4
1.1.2 Overall System Performance......................................................................... 5
1.1.3 Processor Performance.................................................................................. 8
1.1.4 Benchmarks ................................................................................................... 9
1.1.5 The RISC Approach..................................................................................... 10
1.2 THE PROCESSOR’S EXTERNAL VIEW ....................................................... 13
1.2.1 Processor-Bus: Lines and Signals .............................................................. 14
1.2.2 Processor Interface Components ................................................................. 25
1.3 SYSTEMS CONFIGURATIONS ...................................................................... 29
1.3.1 Introduction ................................................................................................. 29
1.3.2 System Components ..................................................................................... 30
1.3.3 Hierarchy of Buses in a Computer System .................................................. 35
1.3.4 Multiprocessor Systems................................................................................ 37
1.4 THE MICROPROCESSOR’S INTERNAL ORGANIZATION........................ 38
1.4.1 The Bus Interface Unit ................................................................................ 39
1.4.2 The Instruction Fetch and Decode Unit ...................................................... 40
1.4.3 The Execution Unit..................................................................................... 42
1.4.4 On-Chip Caches........................................................................................... 42
1.4.5 Example CISC Scalar Processors ................................................................ 43
1.5 HOW THE COMPUTER EXECUTES PROGRAM INSTRUCTIONS ............. 44
1.5.1 Instruction Formats..................................................................................... 44
1.5.2 Bus Cycles, Clock Cycles, and “States” .................................................... 48
1.5.3 Microoperations and State Transition Diagrams......................................... 53
1.6 PIPELINED RISC PROCESSORS .................................................................... 60
1.6.1 Instruction Pipelines.................................................................................... 62
1.6.2 Advanced Microprocessor Operation........................................................... 66
1.6.3 Problems with Pipelining............................................................................. 68
1.7 HIGH-PERFORMANCE PROCESSORS.......................................................... 71
1.7.1 Superpipelined (Scalar) Processors ............................................................. 71
1.7.2 Superscalar Processors ............................................................................... 74
1.7.3 Superpipelined-Superscalar Processors....................................................... 83
1.7.4 The VLIW Approach.................................................................................... 83
1.8 COMPUTER SYSTEM DESIGN METHODOLOGY .................................... 86
BIBLIOGRAPHY..................................................................................................... 8 9
EXERCISES............................................................................................................. 90
1.2
COMPUTER SYSTEMS CONCEPTS AND

PROCESSOR ARCHITECTURE
1.1 INTRODUCTION
In this textbook, the terms computer, computer system, and microprocessor-based
system will mean the same thing: a computer system whose central processing unit (or
CPU) is a single processor chip, usually in the form of a high-performance
microprocessor. This Chapter discusses the CPU or processor of the system (its basic
internal components and I/O pins and signals), its use in configuring different computer
systems, and the hierarchy of buses used to interconnect the various system components.
Existing processors differ widely in both their internal structure and in the number and
types of I/O pins and signals. Both conventional CISC (complex instruction set
computer) and RISC (reduced instruction set computer) processor approaches are
explained and contrasted, along with advanced internal architectural features that newer
processors have, such as superpipelined and superscalar implementations. The first
Sections of the Chapter cover the basic fundamental concepts and the more traditional
operation of the processor and the computer system; towards the end of the Chapter,
however, we present more up to date techniques, newer architectural implementations,
and the more advanced and sophisticated operations of today’s superpipelined-
superscalar microprocessors.
The operation of the processor is the same as that of any computer CPU: it
executes an instruction cycle through the following phases: the instruction fetch from
memory and opcode decode phase, the possible operand read from memory phase, and
the instruction execute phase. Accesses to memory are carried out in terms of “bus
cycles”. The total number of bus cycles required by an instruction cycle depends on the
following: (a) whether or not the instruction has already been prefetched and is inside the
processor, (b) the width (i.e., the total number of bits) of the instruction, (c) the width of
the external data bus (over which the instruction is fetched from memory), (d) whether or
not the instruction execution phase requires access to memory to fetch an operand, and
(e) the total number of steps the execution phase of this instruction requires. Later in
Section 1.5, we will give the detailed definition of what is a bus cycle and show its
relationship to the system clock cycle (also called “input clock cycle” or “external clock
cycle”) and explain the “state transition diagram” the processor goes through in order to
execute each bus cycle. We will also present the most common microoperations
executed with each system clock cycle.
Because in general, the external or system clock (the external clock that provides
input clock signals to the processor CPU and to the other modules of the computer
system) and the internal or processor clock (also called the “CPU clock”, the internal
clock that provides clock signals to the hardware components inside the processor,) may
not be of the same frequency, in this textbook when we use the term “clock” by itself
we will mean the “system clock” and when we use the term “clock cycle” by itself we
will mean the “system clock cycle”.
1.3
1.1.1 Advances in Microprocessor Technology

The first single-chip microprocessor, the Intel 4004, appeared in 1972. Since
then, we have witnessed dramatic increases in all facets of the design and manufacture of
powerful microprocessors. While the first microprocessors that appeared at the end of
1971-beginning of 1972 were very small (only 4 bits), very slow (their system/input
clock frequency was 0.5 MHz), and could access only a 16K-byte of main memory, 25
years later (in 1996) the processors had wider wordlenghts (64 bits), faster input or
“system clocks” (300 MHz1), and could access a huge memory of 240 = 1 terabyte (1000
gigabytes), soon thereafter to increase to 264 bytes!
Characteristics 1972 1977 1982 1987 1992 1996
Number of I/O 16 40 64 100 340 (430) 512 (700)

pins:
Transistors/chip 2K 20K 100K 500K- >1M 7M

(for processors): 800K
Processor 4 8 (16) 16 32 32 (64) 64 (128)

wordlength
(bits):
External data 4 8 (16) 16 32 64 (128) 128 (256)

bus size (bits):
Input (system) 0.5 - 1 5-8 8 (16) 20 (30) 50-100 200-433

clock frequency (200)
F (MHz):
Processor clock --- --- 20 6-2 1 .25

cycles per instr.
(CPIP):
Total main 214 = 16 K 216 = 64 K 224 = 16 M 232 = 4 GB 232 = 4 GB 236 - 264

memory size (1 M) (240= 1 TB)
(bytes):
Memory chip 1K 4K to 16K 64K 256K 1M 64M

(bits/chip) (128M)
(1Kx1-bit)
Figure 1.1: Advances in processor and memory chip characteristics in the first 25 years.
Figure 1.1 depicts the chronological evolution of the major characteristics of processor
and memory chips in these first 25 years. (The numbers given in parentheses are the
exceptional, rather than the usual, cases.) We notice that the size of the
microprocessor chip itself becomes bigger with a larger number of pins (from 16
1
Which means a system “clock cycle time” of 1/300MHz = 3.33ns.
1.4
input/output pins in 1972 it has surpassed the number 550 pins in 1996) and higher
integration level or number of transistors packed on a single processor chip (from 2K
transistors in 1972 to more than 7M transistors in 1996). A larger number of transistors
permits the design of wider wordlength2 microprocessors (64-bit wordlength in 1996),
while a larger number of pins permits wider external buses to interface the processor with
other components of the system (for example, wider address buses of 36-bit width and
wider data buses of 128-bit width.) The integration level of the memory chip has
increased even more dramatically, from only 1K bits per memory chip in 1972 to 64M
bits in 1996. Finally, the processor clock cycles per instruction or CPIP (an important
figure of merit that provides insight into processor performance discussed below) has
decreased from around 20 processor clock cycles in 1972 (i.e., it took 20 processor clock
cycles to completely execute an instruction in 1972) to .25 clock cycles in superscalar
processors (which means that in 1996, the processor can complete four instructions every
processor clock cycle.)
1.1.2 Overall System Performance
The performance of the computer system is the metric on which comparative evaluations
and purchasing decisions are based on. Usually, buying the best performance possible
offers the longest useful life and increases the chances the system will be able to also
handle the more complex programs of tomorrow. Measuring performance however, is
not always simple; furthermore, performance is very tightly related to the type of jobs the
computer system is be used for (i.e., how the computer is to be used).
When we refer to the performance of a computer system, we usually mean the
“time” it takes to execute a program or task; i.e., the execution time (also called
“turnaround time”, “elapsed time”, or “response time”). However, if we examine closer
this “execution time”, we observe that it is made up of a number of time components
including: the “CPU or processor time” (the time the processor spends in executing both
user and system tasks), the “memory time” (the time spent for accesses to external main
memory), time periods that have to do with user-task disk-accesses and waiting for I/O
(during which time the processor is used by other tasks in a multitasking environment),
and all the “operating system (OS) overhead” (time spent executing OS instructions.) To
simplify matters here, we will neglect all I/O and OS issues and focus on part of the
“system” made up of the processor and main memory only. We can then define here
processor time TP as the user-CPU time (the time spent by the processor for internal
operations on the user task) and system time TS (as the sum of the processor time TP and
the time spent (by the task) for main memory accesses.)
System performance then is measured by the total execution time TS the computer system
needs to execute a task expressed as:
2
A microprocessor with a “wordlength equal to n” is also referred to as an n-bit microprocessor defined as
a processor with n-bit wide internal data registers and n-bit wide ALU (arithmetic logic unit) which
carries out operations on n-bit input operands.
1.5
TS (total system execution time for a task) =

(total system clock cycles for a task) x (system clock cycle time, C S) (1.1)
The “total system clock cycles for a task” term in the above equation equals:
total system clock cycles for a task = NI * CPIS (1.2)
where CPIS is the average number of system clock cycles per instruction over all
instruction types. Different machine instructions will require different numbers of system
clock cycles to execute, but for a computer system with a given instruction set we can
calculate its average number of system clock cycles per instruction or CPIS over all
instruction types, provided we know their frequencies of appearance in a task, as follows
[21]:
CPIS = (total system clock cycles for a task) / (number of instructions, NI, in the task)
∑ i = 1 [CPI i
n
= x NI i ] / NI
∑
n
= [CPI i x (NI i / NI)] (1.3)
i=1
where:
NI i represents the number of times instruction i is executed in a program, and
CPIi represents an average number of system clock cycles for instruction i,
measured from a large amount of program code over a long period of time.
The system CPIS in Equation 1.3 is calculated by multiplying each individual CPIi by the
fraction of occurrences of that instruction in a program (NI i / NI).
Thus the system performance of Equation 1.1 now becomes:
TS (total system execution time per task) = NI * CPIS * C S (1.4)
where:
NI = number of instructions executed for a task (the “instruction path length”)
CPIS = average number of system clock cycles per instruction
C S = system clock cycle time (or, simply, clock cycle)3
3
The system clock cycle equals the inverse of the system clock’s frequency applied as input to the microprocessor
chip; thus, from Eq. 1.4 it is concluded that it is erroneous to compare computer performance by using only the
processor’s megahertz rating or clock speed. For example, a Pentium processor running at 75 MHz easily
outperforms an Intel 486DX2 processor running at the higher frequency of 100 MHz.
1.6
Overall system performance increases by reducing one or more of the terms in Equation
1.4. Designing a more balanced instruction set and using optimized compiler
technologies can reduce the term NI. The system clock cycle CS depends upon the
overall system architecture, balanced system implementation assumptions, processor-
memory bandwidth, and processor speed. The term CPIS includes both a number of
“processor clock cycles” (during which time the processor executes internal operations4
required by the instruction) and a number of “external memory references” (that
references to main memory are performed5 as needed by the instruction.) Since during
such accesses to main memory the internal pipeline of the processor usually “stalls”
waiting for the operand, these cycles are also referred to as the “processor stall cycles”.
Thus,
CPIS = number of processor clock cycles for internal operations +

number of processor clock cycles for external memory references
= number of processor clock cycles for internal operations +
number of processor stall cycles (the clock cycles the processor stalls
during access to external memory) (1.5)
Because most processors nowadays include on-chip instruction and data caches – and,
therefore, external accesses to memory are performed only on “cache misses”6 –
Equation 1.5 can also be expressed as:
CPIS = the “processor CPI” + the “CPI due to cache misses” (the finite cache effect)
= CPIP + (number of memory accesses due to cache misses M) *
(“cache miss penalty” P in cycles)
= CPIP + (M * P) (1.6)
The second term (M*P) in the above Equation 1.6 (whose reduction will increase system
performance because it will lower CPIS and therefore also lower TS of Equation 1.4) is
heavily dependent upon system rather than processor characteristics (such as the duration
of a bus cycle, processor-to-memory bandwidth, delays of any interface components
between the processor and memory, the actual access time of main memory). All these
are described in detail in later Chapters.
4
Such as instruction decode, internal register-to-register transfers, etc..
5
Such as starting various “bus cycles” to external memory in order to either fetch an instruction or
operand, or transfer results to memory. As we will see later, a “bus cycle” usually requires a number of
“system clock cycles”.
6
Details on caches and cache misses are given in Chapter 5.
1.7
1.1.3 Processor Performance
Processor performance is measured by the total time the processor spends in executing
internal operations for a user task. If we now use the “processor clock” instead of the
“system” clock, the processor performance can be given by a formula analogous to that
of the system performance Equation 1.4 as follows:
TP (total processor or CPU time per task) = (NI) * (CPIP) * (CP ) (1.7)
TP, is the total internal execution time, it is called the processor time or CPU time, and
does not include the time the CPU waits for memory or I/O activities. CP is the internal
“processor clock cycle” (which may be different from the external system clock cycle
CS). The processor CPI, or CPIP, ignores all system issues and represents the average
number of processor clock cycles per instruction assuming no cache misses (i.e., infinite
caches). For example, a “single-issue” processor (a processor with only one internal
pipelined functional unit to which it issues for execution only one instruction at a time)
will have an ideal CPIP = 1 (i.e., no cache misses and one instruction completed per
processor clock cycle); a “multiple-issue” processor of degree, say, 4 (i.e., four internal
pipelined functional units to which 4 instructions can be issued for execution in parallel)
will have an ideal CPIP = 0.25 (i.e., four instructions will be completed per processor clock
cycle7).
Processor performance can be improved by reducing any of the three factors in
Equation 1.7. CP, the processor clock cycle time, is technology-driven and depends on
the VLSI design of the chip. This cycle time is chosen long enough to allow execution of
every basic operation (or microoperation) in only one processor clock cycle; other, more
complex operations will require multiple processor clock cycles for their execution. NI,
the number of instructions per task, is a function of the instruction set design and how
effective an optimized compiler is. Finally, CPIP, the average clock cycles per
instruction, is a direct function of the processor’s internal architecture (instruction
pipelining, instruction issue ability, etc.) and the efficiency of instruction scheduling.
In general, the speed of the processor itself (which determines the overall system
performance) can be improved by increasing one or more of the following characteristics:
1. Processor (or CPU) wordlength: By making the processor wordlength wider, more
bits can be processed internally in parallel.
2. Processor clock: By using newer, faster technologies, the frequency of the
processor’s internal clock can increase to speed up program execution by the
processor.
3. Level of integration: By packing more functional units on the processor chip (such
as floating-point units, caches, memory management units, etc.), their
7
The actual CPIP is modeled by simulating the structure of the internal pipeline(s) and measuring
instruction stream execution cycles. For “multiple-issue” processors (i.e., superscalar processors) instead
of the CPI it is more convenient to use its reciprocal, the IPC (instructions per clock cycle).
1.8
interconnection distances and the need for off-chip signaling are decreased,
contributing to the increase in the processor operating speed.
4. Architectural advances: Finally internal architectural advances have been
incorporated in the processor, such as pipelined RISC architectures and
superpipelined or superscalar implementations (to be discussed later in the
Chapter), which effectively reduce the average number of processor clock cycles
it takes to execute an instruction (i.e., reduce the CPIP) and, therefore, increase
processor performance.
Another metric commonly used for the “processor performance” is MIPS (million
instructions per second) which for a given task is approximated by the following formula:
Processor performance (in MIPs) = (NI) / (total processor execution time)

= 1 / (CPIP x CP)
= (FP) / (CPIP) (1.8)
where FP = processor clock frequency rate in MHz.
This MIPs metric, however, is not a very accurate measure for comparing performance
among computers and should be used with caution [1].
-----------------------------------------------------------------àà
Example 1- 1: Comparing MIPS of various processors
Using Equation 1.8, consider three different processors, each operating with an 100-
MHz internal clock.
We say that a processor which under the best-case conditions requires 2 processor clock
cycles to execute an instruction (i.e., has a CPIP of 2) has a peak performance of 50 MIPS; if the
second processor can execute one instruction per processor clock cycle (i.e., CPIP = 1), it has a
peak performance of 100 MIPS; and if the third processor can execute 3 instructions per
processor clock cycle (CPIP = 0.3), it has a peak performance of 300 MIPS.
ßß--------------------------------------------------------------
ßß
1.1.4 Benchmarks
System performance, however, is truly measured only by executing the actual, specific
application(s) to be run by the system. Since quite often this is not practical or possible,
one uses benchmarks.
A benchmark is a software program (or a suite of programs) which measures the
performance of a computer system or just the parts of the computer. Since there are
many benchmarks, one should be careful to use those that reflect the way the computer is
going to be used and are representative of the type of applications to be run in the system.
It is also important for someone to use up-to-date benchmarks which are able to evaluate
all the newer and more sophisticated features of today’s computer systems, such as pre-
emptive multitasking 32-bit operating systems, larger (graphic- and video-intensive)
applications, superscalar microprocessor architectures, multithreaded applications,
multiprocessor (parallel) systems, etc.
1.9
There are two types of benchmarks: application and synthetic. Application

benchmarks include real code from actual application programs (such as Word word-
processor, Paradox, etc.), provide a better measurement of the system’s or component’s
capabilities, but are large and more difficult to use. One example is SYSmark/NT (for
the system) and SPEC95 (for the processor itself). Synthetic benchmarks use code that
tries to simulate the application program functions, based on extensive application
profiling; they are not as effective as the application benchmarks but are shorter and
easier to use. Examples include the Norton SI 32 and CPUmark32 (to measure the
performance of the processor itself).
Finally, there are also two levels of benchmarks: component and system.
Component benchmarks test only specific parts of the system, such as the processor or
the memory, or the video/graphics board or the disk subsystem in isolation. CPU
benchmarks include the SPEC95, Norton SI 32, and CPUmark32; parts of WinBench 95
measures only disk, graphics, and other subsystems. System benchmarks on the other
hand, test the overall system performance (i.e., all its components working together as a
system). An example of a system-level benchmark is the SYSmark/NT.
1.1.5 The RISC Approach
The RISC approach (also called “streamlined architecture” [10]) was introduced to
increase the performance of the basic CPU by advocating new, simpler, fixed-length,
register-oriented instruction sets. The basic principle that drives the RISC approach is
that processor performance can increase by keeping instructions simple.
RISC processors have lower CPIP than CISC processors because the RISC
approach uses simpler, fixed-length instruction formats which allow for faster hardwired
instruction decoding and greatly simplify the use of internal instruction pipelining; as a
result, the number of processor clock cycles needed to execute an instruction can (ideally)
reduce to 1 (i.e., CPIP = 1). CP, the processor clock cycle, is also shorter for a RISC than
for a CISC processor, because of the former’s architectural simplicity and because the
majority of RISC instructions are of the register-to-register type (that do not need to
access external main memory during their execution; all operands are in internal
registers). Thus, RISC approaches improve the program performance by reducing the
last two terms in Equation 1.7. The NI, however, is usually larger for RISCs, the ratio of
the number of instructions for a RISC versus those for a CISC processor being on the
average around 1.8 to 2.
Major common properties of all RISC architectures include [12,14,15]:

1. Simple load/store architecture: Most of the instructions are register-based,
supported by a large number of CPU registers. References to external memory
are minimized: accessing memory for operand/result transfer is explicitly done
only via LOAD and STORE instructions. While in CISC processors 30-40% of
the executed instructions implicitly or explicitly access data memory and 20% are
1.10
register-to-register operations, in RISC processors less than 20% may be LOADs

and STOREs and more than 50% are register-to-register operations [13].
2. Simple instructions: Instructions have simple fixed format, fixed length, and
simple and few addressing modes.
3. Simple hardwired control: Because of the RISC instructions’ short format and
few addressing modes, the decoding becomes simple and this permits
implementing the control mechanism with hardwired implementations which are
faster than the microprogrammed ones used in CISC processors.
4. Larger register set or register windows: One approach is for RISC processors to
have large sets of application-usable registers (larger than in the CISC
processors) for quick accessing of operands (which are stored in internal registers
rather than in main memory), for subroutine calls, and for saving of the processor
state during context switches.
5. Faster clocks: RISCs have faster processor clocks than CISCs.
6. Instruction pipelines: Although these are not unique RISC characteristics, they are
more often found in RISC than in CISC processors. Internal pipelined
implementation are used in RISC processors to achieve the single-clock-cycle-per-
instruction goal (i.e., CPIP = 1). For branch type instructions that may slow the
internal pipeline (by stalling it while waiting for the branch condition evaluation),
RISC processors use the “delayed branching technique”: the branch instruction is
delayed until after one or more instructions immediately following the branch
instruction have been executed. The compiler has already examined the program
and organized it in such a way (for example, by rearranging code and inserting
useful instructions in the “branch delay slot”) to allow the processor pipeline to
continue executing instructions while the branch conditions are being evaluated.
(More on handling branch conditions is found later in Section 1.6.3).
7. Separate instruction and data buses (Harvard architecture): Again, this property is
not unique to the RISCs, but it is found more often in RISCs than in CISCs. RISC
processors usually have both internal and external Harvard architecture. Because
RISC instructions are much simpler than those of CISCs, instructions need to be
fetched much faster. To allow for that, separate buses (and caches or memories)
are used: one bus for instructions and one for data. This separation also lowers the
CPIP (i.e., improves the processor performance) because it allows concurrence
between the execution of the current LOAD or STORE instruction (using the data
cache or memory) and the prefetch of the instruction (from the instruction cache or
memory).
Arguments can be made for and against RISC processors:

RISC Advantages. Reducing the instruction complexity increases the speed,
because complex instructions slow down all instructions. The speed is also increased by
reducing memory accesses with the use of only register-to-register instructions and by
using hardwired techniques for the implementation of the processor’s control section.
The instruction simplification reduces circuit complexity (which allows simpler design
and faster design and development cycles) and chip area requirements (which allows the
implementation of other major functions on-chip, such as cache memories, memory
management units, floating-point units, etc.).
1.11
RISC Drawbacks. RISC programs tend to be lengthier (the NI in Equation 1.4 is

larger). The managing of internal processor pipelines and caches and the need to
implement delayed branching are concerns that must be taken care of and reflected
properly in the design of the compiler (thus the RISC processor’s performance relies
heavily on compiler sophistication).
In general, however, CISC and RISC are not really two actual, different, specific
architectures; instead, they represent two different approaches or paradigms to processor
(CPU) design that can be utilized by any processor architecture. A good microprocessor
design would have to trade off and select which attributes of RISC and CISC to combine.
As a matter of fact, the two approaches and their execution cycle times are now
beginning to converge toward the design of hybrid CPUs. CISC microprocessor designs
have adopted features which were characteristic of RISCs, such as internal parallelism,
internal Harvard architecture, and pipelined execution units, which allow frequently used
instructions to execute in a single processor clock cycle; on the other hand, RISC
microprocessors have become more complex, with on-chip MMUs (Memory
Management Units) and FPUs (Floating-Point Units), have adopted a higher degree of
internal pipelining and parallelism, and have implemented a number of additional not-so-
simple instructions.
When it comes, however, to the methodology of designing a computer system,
there are not really too many significant differences when using a RISC or a CISC
microprocessor. Figure 1.2 shows the interconnection with memory of a conventional
CISC processor with a single external bus and that of a RISC processor that has separate
external buses for the instructions and data. (In both cases there is another set of bus
lines that make up the “control bus”. The control bus is used by the processor to send
control signals to the other modules of the system and receive from them feedback
signals in order to properly orchestrate the system’s operation.) The rules for interfacing
various units to the processor are almost the same except that in the case of processors
with external Harvard architecture, data and instruction memories are separate and their
speed requirements may be different from those of a conventional system configuration.
As we will see in the next Section, most of the remaining control signals are similar for
RISCs and CISCs. It will also become apparent at the end of the Chapter, that a more
interesting question is not CISCs versus RISCs but which of the following processor
implementation is preferable: “superpipelined”, “superscalar”, “very long instruction
word”, or any combinations of the above.
As far as the design of a computer system goes, in this textbook we will treat both
RISC and CISC processor in a unified fashion.
Processor Memory
(CPU) Unit(s)
Physical address bus
Data bus
Processor external physical
bus (or the “memory bus”)
Fig. 1.2a: Conventional “processor bus”
1.12
Data Memory Instruction

Processor Unit(s) Memory
(CPU) Unit(s)
Address bus (for data addresses)
Data bus (for data) Processor external physical
bus for data (the “data bus”)
Address bus (for instruction addresses)
Data bus (for instructions) Processor external physical

bus for instructions (the
“instruction bus”)
Fig. 1.2b: Harvard architecture external processor buses
Figure 1- 2: Conventional processors with a single and separate external buses
1.2 THE PROCESSOR’S EXTERNAL VIEW

Before we discuss the individual input/output signals of the processor, let’s first take a
look at different example configurations of processors with external caches and/or
MMUs. A number of them are shown in Figure 1.3. The Figure includes processors
with and without external Harvard architecture.
Figure 1.3a shows an example of a processor with an external “unified” or

“integrated” cache8; if the processor has an on-chip cache (referred to as the “level-1
cache”) then this external cache will be treated as the “level-2 cache”. As we will see
later, the external cache may be connected to the same bus to which main memory and
input/output are connected, or the processor may have a “front-side bus” for interfacing
main memory and I/O and a separate “back-side bus” for directly connecting a “level-2
cache” (e.g., Pentium Pro, Alpha, UltraSparc).
Figure 1.3 shows a RISC processor with one external bus and a separate
ICACHE (instruction cache) and DCACHE (data cache). The processor contains on-
chip all the hardware logic needed to control these external caches (the cache controller),
and these caches are then mainly static-RAM chips connected directly to processor I/O
pins (to a cache bus.) Main memory is connected using a separate local bus; local bus
control logic (a bus “bridge”) converts the local bus to the bus connected directly to the
processor’s pins.
Finally, Figure 1.3c shows a RISC processor with external Harvard architecture
that defines two external buses: one for data (the data bus, which includes lines for
transferring the address of the data and separate lines to transfer the data itself) and one
for instructions (the instruction bus, which includes lines for transferring the address of
8
A “unified” or “integrated” cache is a cache that stores both data/operands and instructions.
1.13
the instruction and separate lines to transfer the instruction itself); these buses require
their own separate sets of synchronization signals. The two buses of the Harvard
architecture are used solely to support the LOAD and STORE instructions. If we assume
that a LOAD/STORE takes two system clock cycles to execute (one to calculate the
memory address and one to do the actual transfer to/from memory), during the second
clock cycle the prefetch unit of a non-Harvard-architecture processor would not be able
to overlap the fetching of the next instruction with the data transfer of the
LOAD/STORE, because the single data bus lines are already busy (doing the transfer of
data to/from memory.)
By providing separate external buses, a Harvard-architecture processor can now

use the data bus to execute the second cycle of the LOAD/STORE instruction and use the
instruction bus to simultaneously prefetch the next instruction. However, these two buses
do have different memory bandwidth requirements: the data lines used to fetch
instructions require much higher memory bandwidth (because RISC-type processors need
to fetch a larger number of their simpler instructions in order to implement the same
functions that a single complex CISC instruction performs); on the other hand, the data
lines used to transfer operands or results of operations require a lower bandwidth for a
RISC processor than for a CISC processor (because most of these simple RISC
instructions perform internal register-based operations that do not require frequent
accesses to external memory). Since we assume the processor has no on-chip MMU, it
outputs “logical or virtual addresses” (addresses that the program or process sees) which
are converted by the external MMUs into “physical addresses” (addresses sent to memory
without further translation). Caches are placed between the MMUs and main memory.
1.2.1 Processor Bus: Lines and Signals
Figure 1.4 shows the external view of CISC and RISC microprocessor chips and their
I/O signals. Their detailed explanation and use in examples will be given in the next
Chapter 2. In this Section we only present an outline on the address and data buses and
some of the most common control signals in Figure 1.4. The input/output pins of the
microprocessor al together define what is called the processor bus or sometimes called
the component-level bus.
1.14
L2 (external)
cache
Processor
(CPU) Cache Bus
L1 (on-chip)
L2 cache controller
MMU cache
MMU Processor (CPU)

Physical
Physical Address Bus
Data Bus Unified
Cache External “processor
bus”
Main
Memory Local bus
interface logic
“Local bus”
Memory Bus Main Memory

a) Processor with external cache
b) Processor with on-chip cache

Processor (CPU) controller
Instruction Address
Data Address
Logical data bus
Logical Instruction Data

instruction MMU MMU
bus
(IMMU) DMMU) Physical data bus
Physical
instruction Instruction Data Data Instruction
bus Cache Cache Memory Memory
(ICACHE) (DCACHE)
Physical c) Processor with external

address
Data
MMUs and caches on
Physical address separate buses
Instructions
Figure 1.3: Example processors with external MMUs and caches
a) Address and data buses.

A processor may have either a nonmultiplexed address bus (A bus) and a second
separate data bus (D bus), or address and data may be multiplexed on the same lines of a
single bus called a multiplexed address/data bus (AD bus).
A bus is called a (time-) multiplexed bus when it transfers different types of
information at different well-defined times during the bus cycle. In a multiplexed bus,
each transmitter and receiver has one designated time slot. Multiplexing has the
advantage that fewer pins are required on the chip9. A multiplexed bus, however,
imposes additional external interface hardware requirements. Most of the recent
processors now have separate, nonmultiplexed address and data buses.
9
That’s why earlier microprocessors, which could not provide enough pins, were forced to multiplex their
external bus lines.
1.15
Synchronization signals Address Bus

“READY” or “DTACK*” Data Bus
“Byte-Enables”
Input or “System”
Clock CISC Operand size indicator Dynamic
(data identifier) Bus
Read/Write processor Port size indicator
control signals Sizing
Bus cycle identifier
Other signals (bus (status or function code
error, parity, cache signals)
control, etc.) Reset
“INTA”
Bus
*DTACK = Data Transfer arbitration Interrupt
Acknowledge signals signals
a) CISC processor external view
Synchronization signals Address Bus

“READY” or “DTACK*” Data Bus Data
Bus
“Byte-Enables”
Input or “System” Address Bus Instruction
Clock RISC Data Bus Bus
Read/Write processor Operand size indicator Dynamic
control signals (data identifier) Bus
Port size indicator Sizing
Other signals (bus
error, parity, cache Bus cycle identifier
control, etc.) (status or function code
signals)
Reset
*DTACK
“INTA”
= Data Transfer Bus
Acknowledge arbitration
Interrupt
signals
signals
b) RISC processor external view
Figure 1.4: External views of CISC and RISC processors
1.16
Different processors, however, have different numbers of external address and

data bus lines. For example, there exist n-bit processors that use m-bit addresses, but
when they communicate with main memory they may not place all m bits of the address
on the external address bus. Similarly, there exist n-bit processors whose external data
bus may not necessarily be n bits wide; as a matter of fact, the tendency now is for n-bit
processors that have an external data bus with 2n or even 4n lines. The example below
identifies some specific cases.
-----------------------------------------------------------------àà
Example 1- 2: Variations on external address and data buses
One distinction among processor address buses has to do with how many bits of the
actual physical address the processor places on its external address bus. Some processors issue
on the address bus a byte-address along with signals to indicate the width of the operand to be
transferred, while others issue the most significant part of the byte-address (representing for
example a 32-bit doubleword-address or a 64-bit quadword-address) along with “byte-enable10”
control signals that specify the byte-section(s) of memory to be activated (enabled) during the
current data transfer.
1) 16-bit data buses

Figure 1.5a shows two alternative solutions for 16-bit processors with 16-bit data buses.
The example at the left (Intel 286) shows the processor issuing a “byte-address” (i.e., an address
that points to a byte-location in memory) along with a single “byte-enable” control signal (the
BHE#)11. The 16-bit memory subsystem uses this BHE# (along with the issued least significant
address bit A0) to determine whether the transfer is for a byte or a 16-bit word and which byte-
section(s) of memory will participate in the current transfer: A0,BHE# = 00 means a 16-bit
word transfer, =01 means a byte transfer with an even byte location, and =10 means a byte
transfer with an odd byte location.
On the other hand, processors like the one at the right (Motorola 68000) issue a “word-
address” (i.e., an address whose least significant bit A0 is zero, and thus need not be issued) that
points to a 16-bit word-location12 in memory, along with 2 “byte-enables” (the upper data strobe
UDS# and lower data strobe LDS#) to distinguish between byte and word transfers as follows:
UDS#,LDS# = 00 means a 16-bit word transfer, =01 means a byte transfer with an even byte
location, and =10 means a byte transfer with an odd byte location.
10
As we will see later in more detail, a “byte-enable” signal selects one “byte-section” of a memory: For
example, a “16-bit memory” is composed of two “byte-sections” that require 2 byte-enables; a “32-bit
memory” of four “byte-sections” that require four byte-enables; etc.
11
The pound symbol (#) after a signal designates negative logic and corresponds to the overbar used over
the signal. Quite often, usually for industry standard “system buses,” instead of the pound symbol (#)
the star (*) or the slash (/) symbol is used. In this textbook we will be using mainly the pound symbol
(#) and the overbar interchangeably.
12
The general term “word” does not have a universally accepted definition; some processors call the 16-bit
quantity a word, while others call the 32-bit quantity a word (and the 16-bit quantity a half-word).
Throughout this textbook, we will use the term word to mean 16 bits and the term doubleword to mean
32 bits.
1.17
24 A23-A0 Address Bus 23 A23-A1 Address Bus

Intel (Byte-address) Motorola (Word-address)
16-bit 1 BHE# 16-bit 2 UDS#,LDS# Byte High
Byte High Enable
Enables
16 D15-D0 16 D15-D0
(286) Data Bus (68000) Data Bus
One “byte-enable” issued by Two “byte-enables” issued by

the processor; A0 is issued the processor; A0 is not issued.
(“Byte-address” & 1 “byte-enable”) (“Word-address” & 2 “byte-enables”)
A0,BHE# = 00: 16-bit transfer UDS#,LDS# = 00: 16-bit transfer

01: Even byte transfer 01: Even byte transfer
10: Odd byte transfer 10: Odd byte transfer
Figure 1.5a: Processors with 16-bit external data bus
30 A31-A2 Address Bus 30 A31-A2 Address Bus

Intel (DW-address)
Motorola (DW-address)
32-bit 4 BE0#-BE3#
32-bit 2 A2,A1
Byte Enables SIZ1,SIZ0 Logic 4 Byte
32 D31-D0 Enables
Data Bus 32 D31-D0
(486) (68030)
4 Data Bus
Data parity
4 “byte-enables” issued by the 4 “byte-enables” generated by

processor; A1 and A0 are not external logic; A1 and A0 are
issued. issued.
(“DW-address” & 4 “byte-enables”) (“Byte address” & “operand size indicator”)
30 AD31-AD2 Data Address

(DW-address) Data
32 D31-D0
Data P
Motorola DE0-DE3 Bus
32-bit 4 Data “byte-enables”
30 CA31-CA2 Code Address Instruction
32 C31-C0 (DW-address) P
(88000) Code Bus
(4 “byte-enables” only for the data P bus)
Figure 1.5b: Processors with 32-bit external data

1.18
29 A31-A3 Address Bus

(Quadword-address)
8 BE0#-BE7#
Intel 8 byte-enables
32-bit
64 D63-D0
Data Bus
8 DP7-DP0
Data parity
(Pentium)
(“QW-address” & 8 “byte-enables”)
Figure 1.5c) Processors with 64-bit external
Cache
Level-2
64 Address
Cache Data Cache
Bus
Intel
32-bit 36
Address Bus
BE0#-BE7#
8 byte-enables
System
64
D63-D0 Bus
Data Bus
(PentiumPro) 8
Data Parity
(8 byte-enables)
Cache
Level-2
128 Address
Cache Data Cache
Bus
MIPS
64-bit
64
Multiplexed AD Bus
9 System
(R4000 SysCmd (operand size) Bus
&
8
R10000) Data Parity
(“operand size indicator” issued)

R4000: superpipelined, 36-bit address; R10000: superscalar, 40-bit address
Figure 1.5d: High performance superscalar processors (continues)

1.19
40 A39-A0 Address Bus

(Byte-address)
PowerPC
64-bit
128
Data Bus
D127-D0
(620)
Cache
Level-2
128 Address
Cache Data Cache
Bus
Alpha
&
UltraSparc
Address Bus *
(64-bit) 128 System
Data Bus Bus
16
Data Parity
* Alpha: 64-bit addresses

Support for other buses: UltraSparc: 41-bit
(PCI by Alpha; Sbus by Sparc) ** addresses
** PCI = Peripheral Component Interconnect

Sbus = Sun’s Bus
128 Level-1
ICACHE & DCACHE
HP
64-bit Multiplexed Bus
64
(Data bus: 64 bits;
Address bus: 64 bits)
(PA-8000)
Figure 1.5d: High-performance superscalar processors

Figure 1.5: External address and data buses of example processors
1.20

Figure 1.5b shows alternative solutions for 32-bit processors with 32-bit data buses.
The example at the left (Intel 386/486) shows the processor issuing a “doubleword-address” (i.e.,
an address whose least significant 2 bits A1 and A0 are both zero, and thus need not be issued)
that points to a 32-bit doubleword location in memory, along with 4 “byte-enable” control signals
(BE0#-BE3#) for the 4 byte-sections of the 32- Motorola 88000) shows a processor with one
“data P bus” (with address lines for the operand addresses and data lines for the operands
themselves) and a separate “instruction P bus” (with address lines for the instruction addresses
and data lines for the instructions themselves). Since the data transferred may be of any size,
the “data P bus” requires the four data “byte-enables” DE0-DE3. The approach followed
here was similar to the first case above: the processor issues a doubleword-address along with 4
“byte-enables”. Since however instructions are always fetched as 32-bit doublewords from
doubleword-addresses, the instruction P bus needs no “byte-enables” to be sent to memory.

Figure 1.5c depicts the case of a processor (here the 32-bit Intel Pentium and i860) that
has a 64-bit external data bus. In such case, the processor issues an “8-byte-address” (i.e., an
address whose least significant 3 bits A2, A1 and A0 are all zero, and thus need not be issued) that
points to a 64-bit location in memory, along with 8 “byte-enable” control signals (BE0#-BE7#)
for the 8 byte-sections of the 64-bit memory. Any such BE# activated low enables the respective
byte-section(s) to participate in the current data transfer: BE0#=0 for Section0, BE1#=0 for
Section1, etc. For transfers of data wider than a byte, more than one BE# will be activated low.
4) High Performance processors

The trend is to always provide wider address and data buses. Figure 1.5d shows some
examples of high-performance processors with addresses as wide as 64 bits and even wider
external data buses of 128 bits or16 bytes (PowerPC 620, Alpha, UltraSparc, etc.). Most
processors now also provide extra pins to carry some additional information, such as one parity
bit for each byte-lane of the data bus.
These wider data buses permit the design and interconnecting of 128-bit (16-section)
main memories; these 16 byte- sections would now require 16 byte-enables. The processor can
issue the 16 byte-enables (by internally interpreting the 4 least significant bits of the address A3-
A0 and the operand size to be transferred) along with only the remaining most significant address
bits (28 bits if the address is a 32-bit address).
In the MIPS R4000-series case, the processor instead of “byte-enables” issues an
operand size indicator (referred to by MIPS as “data identifier” or “data size indicator”) to
indicate operand size transfers from 1 byte to 8 bytes.
Not all high-performance microprocessors issue such “byte-enable” control signals. For
example, a different approach is followed by the Alpha chip in Figure 1.5d, which, although also
shows an 128-bit (16-byte) external data bus, it places on its external address bus only the most
significant 59 bits of its 64-bit address. This is because proper buffering in the design allows
accesses to memory to be performed on a 32-byte block basis and the second half of the 32-byte
block is read and driven in the next half bus cycle (which as we will see in the next Chapter is
the next “pipeline” cycle).
A number of processors have an on-chip cache controller to control a secondary “level-
2” external cache; for this level-2 cache interconnect they allocate separate bus lines
(PentiumPro, MIPS, and UltraSparc in Figure 1.5d). Alpha has placed the level-2 cache on-chip
along with a controller for an external level-3 cache. (The Figure specifies the width only for the
lines used to transfer data between the processor and level-2 cache.)
Finally, some processors (like the Alpha and UltraSparc in Figure 1.5d) provide
additional bus lines to permit direct interconnect with peripheral devices; the Alpha provides
support for the PCI (Peripheral Component Interconnect) bus and the UltraSparc for Sun’s SBus
bus.
1.21
High-performance superpipelined and superscalar processors of Figure 1.5d are

discussed in more detail later in Section 1.7.
ßß--------------------------------------------------------------------
ßß
b) Bus cycle identifiers
At the beginning of a bus cycle, the processor issues a set of signals called the bus
cycle identifiers (some microprocessors call them “status” or “function code” signals) to
inform the rest of the computer modules of the type of bus cycle the processor has
initiated. Figure 1.6a shows the most common types of bus cycles, and Figure 1.6b, the
“bus cycle identifying” signals of some processors. It is noticed that the Intel Pentium
processor identifies the bus cycle by a combination of several control signals (the M/IO#,
D/C#, and W/R#, CACHE#, and KEN#), while the Motorola processors use three output
“function code” signals (the FC2-FC0.)
c) Synchronization signals.
Synchronization signals are needed to synchronize the operation of the processor
with the other modules of the computer system. Output synchronization signals (issued
by the processor in the form of an “address strobe” or “data strobe”) indicate when
address or data are valid on their respective bus lines; input synchronization signals
(received by the processor in the form of a “ready” or “data transfer acknowledge”)
notify it that the addressed memory or I/O port has responded. Processors that operate
asynchronously with their slave devices13 use a pair of “handshake signals” to accomplish
their synchronization with these devices; for example, a processor will send to memory
the “address strobe” signal to indicate that it has placed an address on the address bus,
and the memory will send back to the processor a feedback signal “data transfer
acknowledge” to tell the processor that memory has finished its requested transaction
(either placed data on the data bus or received the data from the data bus). Some
microprocessors use only one input pin for the data transfer acknowledge signal (like the
DTACK# pin of the 16-bit Motorola 68000) and its absence meant that the slave module
has not yet finished its operation. Other microprocessors use two such input pins (like
the DSACK0# and DSACK1# pins of the 32-bit Motorola 680x0 processors); this pair of
signals, in addition to synchronizing the processor to the slave module, they also inform
the processor of the width of the responding slave module (e.g., whether it is a 32-bit
memory port or a 16-bit I/O port.)
13
That is, each may have its own separate clock source, and they each may operate at a
different clock speed.
1.22
d) Operand-size and port-size indicators
A number of processors have the capability to dynamically (at run time) adjust the width
of their external data bus, according to the size (width) of the memory or I/O port they
communicate with. A processor issues an operand size indicator14 (like the Motorola
SIZ0 and SIZ1 pair of signals) to indicate the size of the operand to be transferred: for
example, 00 means 4 bytes (one doubleword), 10 means 2 bytes, 01 means 1 byte, and 11
means 3 bytes. When processors do not permit 3-byte transfers in one bus cycle, the
value 11 can be used to mean that the processor requests the memory slave to send a
“long” cache line (of 16 or 32 bytes, depending on the processor).
The port size indicator is a feedback the processor receives indicating the size of
the port currently communicating with it. As we said above, the Motorola products
interpret the binary values on their input pins DSACK0# and DSACK1#; other
processors have one input pin for each port width, e.g., the Intel 486 (which also has a
built in “dynamic bus sizing capability”) uses two input pins BS16# and BS8# for 16-bit
and 8-bit ports, respectively.
e) Interrupts and bus arbitration signals

Interrupt lines are used by external devices to interrupt the processor and force it
to jump to the appropriate interrupt-handling routine (whose execution will service the
interrupting device.) Bus arbitration lines are lines used to allow connecting a number
of devices to the same bus (for example, a number of processors, DMA controllers, etc.)
so that in an orderly manner only one device will be granted control of the shared bus to
start a bus cycle. Interrupts and bus arbitration are discussed in more detail in later
Chapters.
f) Other control signals

The input clock signal CLK represents the timing source that defines the input
clock cycles and synchronizes all activities of the processor’s external bus. It is also
called the “system clock” signal because it is supplied to the other modules of the
computer system as well. Some processors require a single-phase clock (the most
common situation), whereas others require a two-phase clock (phase 1 and phase 2).
All processors issue some kind of a read or a write control signal to identify the
type of access to memory and I/O. In some, these signals are formed by combining
externally two other output control signals; for example, we notice from Figure 1.6b that
in a Pentium-based system, in order to supply memory with the “memory data read”
signal, the following output processor signals must be combined with external logic:
signal M/IO# = H (this high value means access to memory while a low value on this
output pin would mean access to I/O), the W/R# = L (low means a read cycle while a
high would mean a write cycle), and D/C# = H (a high means access to data while low
14
Some processors (e.g., MIPS) call it the “data identifier”.
1.23
Type of Bus Cycle
• Memory data read (non-

• Memory code read (non-
• Memory data write (non-
• I/O read (non-
• I/O write (non-
• Memory data read
• Memory code read
• Memory data read (burst
• Memory code read (burst
• Memory data write (cache line
• Interrupt acknowledge
Types of microprocessor external bus cycle
Intel
M/IO# D/C# W/R# CACHE# KEN# CYCLE DESCRIPTION # OF
TRANSFERS
0 0 0 1 X INTERRUT ACKNOWLEDGE 1 TRANSFER

(2 LOCKED CYCLES) EACH CYCLE
0 0 1 1 X SPECIAL CYCLE 1
0 1 0 1 X I/O READ, 32-BITS OR LESS, 1
NON-CACHEABLE
1 0 0 1 X CODE READ, 64 BITS, 1
NON-CACHEABLE
1 0 0 X 1 CODE READ, 64 BITS, 1
NON-CACHEABLE
1 0 0 0 0 CODE READ, 256-BIT BURST 4
LINE FILL
1 0 1 X X INTEL RESERVED (WILL NOT N/A
BE DRIVEN BY THE PENTIUM
PROCESSOR)
1 1 0 1 X MEMORY READ, 64-BITS OR 1
LESS, NON CACHEABLE
1 1 0 X 1 MEMORY READ, 64-BITS OR 1
LESS, NON-CACHEABLE
1 1 0 0 0 MEMORY READ, 256 BIT 4
BURST LINE FILL
1 1 1 1 X MEMORY WRITE, 64 BITS 1
OR LESS
NON CACHEABLE
1 1 1 0 X 256-BIT BURST WRITEBACK 4
Motorola
FC2 FC1 FC0 ADDRESS SPACE
0 0 0 (UNDEFINED, RESERVED) *
0 0 1 USER DATA SPACE
0 1 0 USER PROGRAM SPACE A19-A16 CPU SPACE TYPE

1 111 INTERRUPT
0 1 1 (UNDEFINED, RESERVED)a ACKNOWLEDGE
0 01 0 COPROCESSOR
1 0 0 (UNDEFINED, RESERVED)a COMMUNICATION
1 0 1 SUPERVISOR DATA SPACE 0 0 01 ACCESS LEVER CONTROL
(CALLM, RETM)
1 1 0 SUPERVISOR PROGRAM 0 0 01 BREAKPOINT
SPACE ACKNOWLEDGE
1 1 1 CPU SPACE
* ADDRESS SPACE 3 IS RESERVED FOR USER DEFINITION, WHILE

0 AND 4 ARE RESERVED FOR FUTURE USE BY MOTOROLA
Figure 1.6: Bus cycles, and Intel and Motorola “bus cycle identifiers”
1.24
would mean access to code.) For small systems, the read- and write-type control signals
that the processor itself issues are sufficient to be directly connected to and drive the local
memory and I/O ports. For larger systems, external circuits in the form of “bus
controllers” may be required to amplify the processor’s output signals and convert them
to more powerful system-wide read and write signals.
All processors also have a reset input pin, which is the highest-priority interrupt
that resets the processor to a known internal state. The reset input has different effects on
the various processors: in almost all cases, when a reset occurs, the processor address and
data buses go to a high impedance state (i.e., the processor electrically disconnects itself
from the bus lines), all output control signals go to the inactive state, the interrupt system
is disabled not to accept further external interrupts, and the current bus cycle ends. For
those processors that have both “user” and “supervisor” modes of operation, the
supervisor mode is entered and the appropriate value (called the “reset vector”) is loaded
into the CPU to update the program counter. (More details on user/supervisor modes and
interrupts are given in Chapter 7).
A ready input signal is used by processors that operate in synchronism15 with the
slave devices and is used to accommodate memory and I/O devices that cannot transfer
data at the processor’s fast bandwidth. When the processor samples this READY signal
and finds it asserted, it knows that the action requested has been completed and that the
addressed memory or I/O device has placed data on, or accepted data from, the data bus.
This allows the processor to end the current bus cycle and advance to the next bus
transfer. If it samples the READY signal not asserted, the processor then enters a wait
state. While in a wait state, the processor still has control of the local bus.
1.2.2 Processor Interface Components

Not all processor chips provide the separate nonmultiplexed address and data buses
required by the slave devices in the system. Furthermore, as we said earlier, the required
control and timing signals are not always supplied to the slave devices directly from the
processor chip itself, because of distance and driving capability considerations.
Similarly, not all control signals generated at other points in the system and sent back to
the processor are applied directly to the processor’s input pins; in some cases they are
first sent instead to supporting components external to the processor chip, and then
applied to the processor’s input pins.
In general, an interface is a shared boundary between parts of a computer system,
through which information is conveyed. The interface system consists of the device-
dependent elements (which include all driving and receiving circuits, connectors, and
timing and control protocols) of an interface necessary to effect unambiguous data
transfers between devices. Information is communicated between devices via the bus,
and each device conforms to the interface system definition.
As Figure 1.7a suggests, interface logic must be included on the bus master
module, the memory slave module, and all other slave modules of the system. The
15
That is, the processor and the slave devices are driven by the same clock source.
1.25
“memory interface logic” will be discussed later in Chapter 3. The “processor interface
logic” discussed here, is circuitry that may be required for several reasons: to demultiplex
and/or buffer the CPU’s local address and data lines, to interface the processor with other
modules of the computer system by providing them with the appropriate control signals,
or to receive from them feedback signals and apply them in turn to the processor’s input
pins. The complexity of this interface circuitry and the total number of interface
components depend upon the specific processor and the size and complexity of the final
computer system.
Figure 1.7b gives the most common interface components of a processor; not all
of them are needed in each design. We discuss below the clock generator, the address
latches and address decoder, the data transceivers, the bus controller, and the byte-enable
circuitry. The remaining interface components are discussed in later Chapters.
Bus drivers (or bus buffers) are required to place information on the bus. A
semiconductor device driving the bus may have either open collector or tri-state (3ST)
outputs. Devices with open-collector outputs have only two output states: logic 0 (zero),
i.e., the gate pulls the output line to logic 0 (zero) using an active circuit element (the
gate’s internal output transistor), and logic 1 (one), i.e. the output line is pulled back to
logic 1 (one) by a passive circuit element (an external pull-up resistor)16. The opposite
holds for the tri-state devices used in bus-based computer architectures. They prevent
excessive loading or driving of the bus lines, and allow many devices to be connected to
the bus. Devices connected to the bus must have tri-state outputs: the first two states are
the logic 0 and 1 (0.8 and 3.5 volts for TTL); the third state is a high-impedance state or
open circuit17.
Address latches (registers with D-type flip-flops) are used to latch the address
and hold it as long as required. (We say that the latch operates in a “transparent mode”
when the strobe remains active or its “output enable” OE# pin is grounded to logic 0, i.e.
enabled.) If the processor bus is multiplexed, the external latches are strobed by a
processor signal (called “address latch enable” ALE or “address strobe” AS) issued at the
beginning of each bus cycle to demultiplex the bus and provide at the output of these
latches a buffered address which remains valid for the duration of the whole bus cycle.
16
Devices that have open-collector outputs allow the logical AND among their outputs simply by having
these outputs tied together. For this reason, this connection is also called wired-AND or AND-tied.
When negative logic is used, in which the less-positive level corresponds to logic 1, the open-collector
driven bus lines perform the wired-OR function.
17
A tri-state device has both an active pull-up transistor and an active pull-down transistor in its output
[to define the logic 1 (one) and logic 0 (zero) states], but an extra third input terminal is used to disable
the output. When the output is disabled, it is said that the output floats. This third state is often called
the high-Z or high-impedance state. When a device is placed in its high-impedance state, it is considered
to be electrically disconnected from the bus. Thus, many tri-state devices (drivers) may be connected to
the same bus, with their respective outputs forming the logical OR with each other. For this reason this
connection is also called wire-ORed, OR-tied, or bus-configuration. Appropriate control signals must be
applied to select only one of them to drive the bus, while holding all other drivers “disconnected’ from
the bus.
1.26
MEMORY
SLAVE MEMORY
STORAGE
CLOCK
MEMORY
BUS MASTER INTERFACE
LOGIC
PROCESSOR PROCESSOR
(CPU) INTERFACE
LOGIC MEMORY BUS *
* MAY BE THE “COMPONENT LEVEL” OR PROCSSOR BUS, “THE LOCAL BUS”,

OR THE “ SYSTEM BUS”, DEPENDING WHERE THE MEMORY SLAVE IS CONNECTED TO.
a) Interface logic between processor and memory
ADDRESS BUFFERED
LATCHES ADDRESS BUS
ALE or AS CHIP
CLOCK ADDRESS
DECODER SELECTS
GENERATOR
Data
DATA DATA
TRANSCEIVERS BUS
C
Status/co BUS
ntrol CONTROLLER COMMANDS
PROCESSOR
BYTE ENABLE
(CPU) “BYTE-ENABLES
CIRCUIT
BURST
LOGIC
INTERRUPT CONTROL INTERRUPT

CO- REQUEST
LOGIC
PROCESSOR SIGNALS
BUS
ARBITRATION LINES
ARBITER
b) Interface components on the processor side (board)
1.27
VCC
Status/Control
CLK
Clock Signals
MemR
generator MemW Control
CLK Signals
RES Bus to
READY Controller I/OR Memory
RDY I/OW
RESET DEN &
GND
DT/R INTA I/O
ALE
MICRO-
PROCESSOR
CPU
20 STB ADDRESS 1-MEGABYTE
GND OE 20
BUS ADDRESS SPACE
AD15-AD0
A19-A16 Address/Data LATCHES BUS (A19-A0)
(3)
BHE
T DATA
OE BUS 16-BUS
TRANSCEIVERS
(2)
DATA BUS
16 16 (D15-D0)
c) Bus controller, address latches, data transceivers
Figure 1.7: Examples of interface components between processor and memory
When memory or I/O devices are connected directly to the multiplexed local data
bus, it is essential that they be prevented from corrupting the information (usually, an
address) present on this bus during the first clock cycle of the bus cycle. Most often,
interfacing requirements become simpler if the data bus is buffered. Buffering the data
bus also offers increased current capability and capacitive load immunity. For the bi-
directional data bus, this buffering is accomplished by using external bi-directional bus
drivers (transmitters) and receivers, called data transceivers. The direction of the data
transceivers’ operation is guided by a proper control signal the processor issues (e.g.,
DT/R# = H forces it to act as a transmitter, while DT/R# = L as a receiver).
Address decoders are used to receive some address bits off the address bus,
decode them, and generate some control signals needed by memory and I/O (for example
“bank selects” and “chip-select” signals to be discussed in Chapter 3.) Address decoders
are also used to identify address ranges and determine whether the current access is for a
device connected to the local processor bus or to the global system bus. Decoders may
also be needed to decode status or function code signals (to identify the type of bus cycle
the processor starts).
A bus or system controller circuit is used to convert the processor’s control

signals to those needed by a system bus. It receives status and control signals from the
1.28
microprocessor itself, decodes them, and converts them into system-wide “command
signals”. Figure 1.7c shows the bus controller, the address latches, and the data
transceivers in an example configuration.
Finally, the byte-enable circuit in Figure 1.7b contains logic to generate the
“byte-enable” signals for main memory; an external circuit is needed for processors that
do not themselves issue these “byte-enables”. As Figure 1.7b shows, microprocessors
like the Motorola 68030, require this external logic to combine the processor’s output
signals A1,A0 and SIZ1,SIZ0 and generate the four “byte-enables” (called by Motorola
DBBE4-DBBE1) needed by memory.
1.3 SYSTEMS CONFIGURATIONS

1.3.1 Introduction
Functionally, the modules of a bus-based processor system are divided into two broad
classes: bus master devices and bus slave modules. Bus master modules are those that
can gain control of the bus and initiate data transfers by driving the address and control
lines. To perform these tasks, the bus master is equipped with either a processor CPU or
similar logic that makes it capable of initiating bus cycles to transfer data over the bus.
The master informs all other modules in the system of the type of bus cycle it starts and
qualifies a valid address on the address bus by issuing proper “status signals” or “function
code signals” over the control lines. A module acting as a bus slave can not start bus
cycles; it only monitors the bus activities and, if addressed during a particular bus cycle,
it either accepts data from the data bus or places data on it. A slave module is not capable
of controlling the bus. Therefore, a slave always receives and decodes the address and
control signals to determine whether or not it should respond to the current bus cycle
started by the master. All data transfer activities between a bus master and a bus slave
are carried out in terms of “bus cycles” (explained in Section 1.5.2).
Table 1-1 Microprocessor-based system platforms
UltraSPARC: used in AUSPEX NetServer
SuperSPARC: used in Cray CS6400 Enterprise Server
HP PA-RISC 7200: used in Convex Exemplar SPP series
HP PA-8000: used in HP Exemplar (S-Class, X-Class)
MIPS R10000: used in SGI Power Challenge, Pyramid, Concurrent,
Tandem
DEC Alpha 21164: used in Digital’s high performance computers
Intel Pentium Pro: used in Intel TFLOPS, advanced workstations
IBM PowerPC 620: used in IBM RS/6000 SP2 (Scalable Power Parallel
System)
1.29
Table 1-1 lists current system platforms based on the latest microprocessor chips in
Figure 1.5d.
1.3.2 System Components
Figure 1.8 shows the block diagram of representative computer system configurations:
Figure 1.8a represents a small system with only one bus (the processor bus), Figure 1.8b
shows a single-board configuration with a local bus, Figure 1.8c a larger multiboard
system, and Figure 1.8d a multiprocessor system. In addition to the processor itself, a
computer system must contain as a minimum the following types of components
(discussed below in more detail): (1) a clock generator, to supply clock signals to the
processor and the other components of the system18; (2) main memory slave units,
usually composed of dynamic RAM (DRAM) memory chips to store the program code
and the operands/data, with an optional error detection-correction unit (EDCU) to
increase the reliability of the information exchanged; and (3) some kind of Input/Output
interface units to connect external devices to the computer system, such as hard-drives,
printers, modems, network adaptors, etc. Sometimes, other optional coprocessor chips
may be attached to the processor, such as communications or multimedia coprocessors to
operate in parallel with the CPU and execute operations that the processor does not
support. Finally, a system may also incorporate some other special-purpose chips, such
as MMUs (memory management units) to implement virtual memory and handle task
switching or cache memories (composed of static RAM chips and cache controllers) to
increase overall system throughput by providing to processor information faster than
main memory. (As we will see later, most of the high-performance microprocessors have
integrated the MMU and at least a first-level cache on the processor chip itself.)
a) The processor.
Processors nowadays are single-chip devices, called microprocessors, although there may
be some that consist of more than one (referred to as a “chip set”). The processor
communicates with memory and I/O subsystems (units) by sending addresses to them
(over the address bus) and sending to them or receiving from them data (over the data
bus). All control signals issued by the processor CPU travel over the control bus.
Each processor has a wordlength, which is characterized by the width of the
processor’s internal data registers and the width of its arithmetic/logic unit (ALU); this
sometimes is also referred to as its “internal architecture.” Although the width of its
internal or external buses may or may not be the same, in this textbook, unless explicitly
specified otherwise, we will assume that an n-bit processor also has an n-bit external
data bus.
18
In some cases, the processor receives pulses from and external crystal and it is the processor itself that
generates the appropriate clock signals distributed to the other components of the system.
1.30
• KEYBOARD
• HARD DRIVE
• CD ROM
• NETWORK
INTERFACE
• GRAPHICS
• SENSORS
Main • ACTUATORS
Memor
DRAM
(&y
Processor Controller)
Module
CPU
Input/syst L2 I/O
BUS
e m (CPU
PROCESSO (&
Cache DMA
•SUBSYSTEM
cloc ) controlle
cache Graphics
•controller
k r) etc.
•controller
Level 1 : CPU or processor bus (e.g., X

bus
a) A small, one-level-bus, computer system
Processor/Cache/Memory Subsystem
Local Bus Devices
(Display system,
Processor Module CD ROM, hard
drive, etc.)
Main
Input/syste L2 Cache DRAM
PROCESSOR
m (& cache Memory
(CPU) I/O
clock controller) (& Controller)
Expansion
Boards
Level 1 bus: CPU or Processor Bus (e.g., X MHz)
Local Bus Bridge (“chip-set”)
Level 2 bus: Local Bus (PCI, VLBus, Sbus) (e.g., X/2 MHz)
b) A single-board computer system with a local bus
1.31
Processor/Master Board
Local
Processor/Cache/Memory Subsystem Bus
Processor Module Devices
Level 4
Main Graphics I/O Bus
System L2 DRAM (GPIB,
clock (CPU) Cache Cont’lr
Memory LAN)
I/O
Expans’n
Boards
Level 1 bus: CPU or Processor Bus (e.g., X MHz)
AGP *
I/O
Local Bus Bridge (“chip-set”) Bus
Interface
Level 2 bus: Local Bus (PCI, VLBus, Sbus) (e.g., X/2 MHz) Boards
System Bus Bridge
Level 3 bus: System or Expansion Bus (VMEbus, Futurebus, Multibus) (e.g., X/4 MHz)
* AGP =
Shared Shared Accelerated
System Memory Boards System Memory Boards Graphics Port
c) A larger, multiboard, computer system
Processor board #1 Processor board #2 Processor board #N
Local Local Local Local Local Local

Memory I/O Memory I/O Memory I/O
Processor Processor
(CPU) Processor
(CPU)
(CPU)
System bus System bus System bus
interface interface interface
Bus Bus Bus
arbiter arbiter arbiter
Global/System bus
Memory
Interface Controllers System
PCI: Peripheral Component Interconnect and I/O
Main
SCSI: Small Computer System Interface adapters devices
Memory
GPIB: General-Purpose Instrument Bus
LAN: Local Area Network Global/system memory Global/System I/O
d) A multiprocessor computer system
Figure 1.8: Bus based computer systems of different complexities
1.32
The CISC approach follows the classical design of the CPU with its
accompanying complex, wider, variable-length instruction set (which results in a more
complex structure of the processor’s control unit) and its support of many different
addressing modes. The RISC approach aims at increasing the performance of the
processor itself by providing a different internal processor architecture and simpler,
fixed-length, and primarily register-oriented instructions (that simplify the processor’s
control unit and speed up its operation).
Newer and more advanced processors have increased their wordlengths, bus sizes,
addressing capabilities, and input clock frequencies, and have integrated more features on
the processor chip. For example, the major difference between the Intel 32-bit CISC
processors 80386 [4] and 80486 [5] (also referred to as the 386 and 486, respectively) is
that the latter has integrated a cache and an FPU (Floating-Point Unit) on-chip. In the 32-
bit Motorola CISCs, the earlier 68030 [8] had included on the chip an ICACHE
(instruction cache) and a separate DCACHE (data cache) and single MMU (memory
management unit), while the 68040 [9] has integrated on the chip the FPU and has split
the MMU into an IMMU (instruction MMU) and a separate DMMU (data MMU).
Today’s processors have a 64-bit internal architecture with an 128-bit external
data bus; most of them have on-chip MMUs and caches (separate for instructions and
data), FPUs, and some may also include graphics units. Also, all processors have
incorporated pipelined implementations of their execution units, some are superpipelined
processors (like the MIPS R4400), while others are superscalar processors (like the
88110, PowerPC 620, Alpha, and Ultra-Sparc). These terms are explained later in this
Chapter.
b) Memory units.
Whenever we use the word “memory,” we will mean main memory implemented outside
the microprocessor chip, on separate memory chips; for larger systems, memory chips
will be arranged on one or more memory boards. Static RAMs (SRAMs) hold stored
information as long as power is being applied to them. SRAMs are quite often used for
“cache memories”. On the other hand, dynamic RAMs (DRAMs) are cheaper, have
greater densities and require less standby power. However, the DRAM only provides for
temporary storage of data and requires special circuits to refresh its contents periodically.
This periodic refresh requirement makes dynamic memories slower than static memories.
The main memory in Figure 1.2a is referred to as an “n-bit memory” if the maximum
data it can transfer in parallel is n bits per access, and data exchange between the
processor and main memory is done on the n-bit basis (over an n-bit data bus). (The
design of the memory subsystem and its interfacing to the processor are covered in
Chapter 3.)
c) I/O interface units.

The I/O subsystem contains interface circuitry to implement the communication protocol
with the peripheral devices; i.e., it contains input and output ports (or registers) though
which it exchanges status and control signals, as well as data, with the actual controllers
1.33
of the peripheral devices. The I/O configuration is referred to as I/O-mapped if these

I/O ports are accessed by special “input” and “output” instructions whose execution
causes the processor to issue “input” and “output” type control signals to the I/O
subsystem. An alternative configuration is the memory-mapped I/O; in this case,
memory-type instructions are used to perform the input and output operations, the I/O
ports occupy memory locations, and the I/O ports are accessed with memory addresses.
Some processors (e.g., Motorola) have no separate I/O instructions at all, forcing the
designer to interface the peripheral devices using only “memory-mapped” schemes.
Others (e.g., Intel) have separate I/O instructions permitting either I/O-mapped or
memory-mapped configurations.
I/O interfacing may also involve higher-complexity peripheral devices such as
magnetic disks. In such cases, the peripheral chips used to interface with the controller of
these devices are more complicated than just a number of I/O ports. One such complex
interface device is the direct memory access controller (DMAC); the DMAC contains
logic that enables it to start its own bus cycles, in order to access main memory and
support high-speed data exchange directly between the external device and the
computer’s main memory with no intervention of the processor CPU. A DMAC may
support more than one high-speed peripheral device by providing more than one “DMA
channel,” one channel per I/O device. Other peripheral chips called “I/O processors,” are
even more complex interface devices, which remove yet another level of control from the
CPU. They contain a number of DMA channels, have processor-like capabilities to
execute their own different instruction sets, and operate independent of the computer’s
CPU.
d) Caches.
A cache memory is a small high-speed memory placed between the processor CPU and
main memory to speed up the rate at which instructions and data are supplied to the CPU
by keeping copies of the most recently used memory items. For example, since
computer programs spend a lot of time executing loops (i.e., exhibit “temporal locality”),
the instructions for the second and subsequent iterations of a loop will be found in the
cache, and therefore the CPU does not need to access main memory to fetch them.
Similarly, data structures such as arrays, vectors, etc., frequently exhibit the property of
“spatial locality”; thus access to nearby data items will also find them in the cache. The
inclusion of a cache memory in the computer system allows the processor to operate at
cache speed much of the time rather than at slower main memory speed. In this textbook
whenever we us the word “cache” by itself we will mean the “cache RAM” used for
storage as well as the “cache controller.” (Caches are covered in detail in Chapter 5.)
e) MMUs.
A memory management unit (MMU) aided by systems programs, provides a technique
for handling a larger address space in a flexible fashion, on behalf of the user. It does this
by subdividing the total logical/virtual address space into blocks (pages or segments),
defining logical addresses, and translating them at runtime into physical addresses. It
also provides protection and management of virtual and physical address spaces by
checking a number of access attributes, such as user vs. supervisor space, out-of-limits
access, class of ownership (i.e., which tasks are permitted access to that block), privilege
1.34
level (the privilege level of the requestor vs. the privilege level of the module to be
accessed), mode of access (read-only vs. read-write, execute only, etc.), etc. MMU
hardware is also used to prevent problems that result when multiple tasks within a given
application contend for limited physical memory, or when users share common data or
employ common programs.
When an MMU is outside the CPU, it is usually connected to the processor bus
and manipulated (i.e., given the attributes of each module, has it registers updated with
new values, etc.) in supervisor mode by special privileged I/O instructions. Most high-
performance processors contain this MMU hardware on their chip along with the CPU.
(Memory management and MMUs are covered in Chapter 6.)
f) Special coprocessor chips.

Computer systems may also be configured to include other special-purpose support
devices. These are compatible and easily interfaceable to the microprocessor CPU (using
the processor’s bus lines), and--acting like satellite processors--free it from a complex
task that formally required considerable CPU time and software implementations. By
themselves they can be quite complex processors with an instruction set of their own,
supported by the enhanced instruction set of the CPU. Typical types of external
coprocessors include floating-point units (or FPUs), communications coprocessors
(for example, to implement the ISDN protocol), speech processing coprocessors,
graphics coprocessors, multimedia coprocessors (to compress/decompress video
images, with large on-chip memories and multiple execution units), etc.
1.3.3 Hierarchy of Buses in a Computer System
As it is noted from the configurations in Figure 1.8, a hierarchy of buses is usually used
to interconnect the various components of a computer system. We make here some
introductory comments on the types of buses; their detailed operation is given later in
Chapter 2.
a) Processor Bus and Local Bus
Some devices, like the coprocessors in Figure 1.8a, are directly connected to the
processor CPU (sometimes referred to as close coupling) using what is called the CPU or
processor (external) bus19, which is defined by the I/O pins of the processor CPU. The
other components of the system are interconnected through a second-level bus, also
referred to as the local bus20. (If main memory is connected to this local bus, the bus is
also called the memory bus.) In a number of system designs the local bus and the
19
Also called the component-level bus.
20
One local bus standard is the VL-Bus (Video Local Bus) created by the Video Electronics Standards
Association (San Jose, Calif.); another local bus standard invented by Intel is the PCI (Peripheral
Component Interconnect).
1.35
processor bus may be the same (and in such cases we will be referring to it
interchangeably either as the processor bus or the local bus); if these two buses are
different, then the proper “local bus interface” or “local bus bridge” (usually in the form
of a “chip-set”) is needed between them to adapt one bus to the other.
Every bus is functionally made up of the data bus, the address bus, and the control
bus. The data bus is used to transfer instructions and data (operands/results); as deduced
form the processor examples in Figure 1.5 its width may or may not equal the
wordlength of the processor’s internal architecture. Data transfers may sometimes use
portion only of the data bus lines, depending upon the width of the operand to be
transferred and the width of the memory or I/O device with which the processor
exchanges data; for example, when a 32-bit processor exchanges data with a 16-bit slave
module, only half of its 32-bit data bus will be involved in this data transfer. The
address bus is used to transfer memory addresses (to select a memory location) or I/O
port numbers (to select an I/O port). For example, if a 32-bit processor has a 32-bit
address then it can directly access up to 4 GB (gigabytes) of main memory. Memory
addressing refers to the fact that the address always specifies the address of the first byte
(the lowest-numbered byte-address) of an operand; this is true regardless of the length of
the operand or its byte-ordering (big/little endian) in memory. Finally, control and timing
signals use the control bus lines to synchronize the operation of the various computer
modules and facilitate their intercommunication activities over the bus lines.
b) Global or System Bus

The whole computer system shown in Figure 1.8b can be implemented on a single PC
(printed-circuit) board -- called the “motherboard” -- in which case it is referred to as a
single-board computer system; such systems have limited flexibility and make future
expansions very difficult. A larger, more flexible, and easily expandable computer
system is configured using a number of boards (a multiboard system), of different types,
and even from different manufacturers, as shown in Figure 1.8c In addition to the
processor board (all of whose components are interconnected through the local bus), the
system may contain additional memory implemented on separate memory boards and
special “I/O adapters” also implemented on separate “adapter boards”. This third level of
bus that connects together all boards of the computer system is called the global or
system bus.
Figure 1.8c shows a processor board (composed of a processor module, its local
bus, some local memory and local bus peripheral devices) connected to global (system)
memory and global (system) I/O interface boards via an industry system bus (such as the
Multibus, VMEbus, Futurebus, etc.). The complexity of the “system bus interface”
circuitry on the processor board will depend on the particular system bus used and the
size and complexity of the overall computer system. As with the processor and local
buses, we will divide the system bus functionally into a “system address bus,” a “system
data bus,” and a “system control bus.” Depending on the particular system bus, these
lines may be separate nonmultiplexed lines, or they may be multiplexed. (Details of the
specifications for a number of industry standard system buses and their interfacing are
given in Chapter 4.)
1.36
While the CPU or processor bus is specific to the individual processor, the system
bus is independent of the central processor; i.e., it has its own set of specifications that
all board manufacturers must satisfy. For that purpose, each board includes the proper
hardware, called the “system bus interface” or “bridge”, which allows the board to
interface to the system bus. The interface bridge logic included on the processor board of
Figure 1.8c is used to convert the local bus to the processor-independent industry system
bus. An example of such hierarchical bus configuration may have a processor with an n-
bit high-speed external data bus operating at X MHz frequency, a level-2 n/2-bit
intermediate-speed local bus operating at X/2 MHz frequency, and a level-3 standard
n/2- or n/4-bit low-speed system bus operating at X/4 MHz frequency.
Bus characteristics, operation, and timing are important factors in memory and
peripheral device interfacing; they are used to identify the specific address or data placed
on the respective bus lines and the time required to carry out bus transactions. Such
identification is valuable to the system designer/integrator, in order to single-step and
debug the system under development, instruction by instruction, and interface the central
processor with memories and peripherals that have different access times. It is also
valuable to the software designer, who must also be aware of software compatibility
problems arising when a system is built using a certain type of system bus (that may, for
example, be transferring data bytes in its own order21) and a mix of 16-, 32-, or 64-bit
processor boards from different manufacturers, which themselves impose their own
ordering of the most and least significant bytes within a 16-bit word, a 32-bit doubleword
entity22, etc.
c) I/O bus
Finally, a number of peripheral devices may be connected to the computer system using
special I/O buses, such as a SCSI (small computer system interface) bus used by hard
drives, a GPIB (general-purpose instrumentation bus) to interface measuring devices and
instruments, and LAN (local area network) interconnections. Each of these will require
its respective I/O controller or “bus adapter board” shown in Figure 1.8 as part of the
global/system I/O interface module(s).
1.3.4 Multiprocessor Systems
The evolutionary step in the development is the design of computer systems that
incorporate parallel processing by integrating together a number of processor boards or
nodes. Figure 1.8d shows this more complicated configuration of a bus-based
multiprocessor computer system. Each processing node has its own private DRAM main
memory, I/O resources that no other processor can access (like video monitor, hard
21
The ordering of data bytes on the system bus is referred to as either mad-endian or sad-endian ordering
and is discussed in more detail in Chapter 4.
22
The data ordering in the computer’s memory is referred to as either big-endian or little-endian ordering
and is discussed in more detail in Chapter 2.
1.37
drive, and other peripherals), and external local area networks interconnects. (An
example of such system is the Beowulf23 Parallel Workstation [24]). Such a parallel
computer may also provide global (system) memory and I/O that are “shareable”; i.e.,
any processor in the system can request and gain control of the system bus, and use it to
access a global memory location or a global I/O port. In this case, the “system bus
interface” hardware on each processing board will include a separate component, called a
“bus arbiter” (or “bus exchange”), to arbitrate among simultaneous requests of system
bus access and determine which board will be given the shared system bus to execute its
data transfers. (Bus arbitration is covered in more detail in Chapter 4.)
The significant price advantage the RISC processor has, along with its dramatic
improvement in performance, have made it ideal candidate for designing commercial
bus-based SMP (Symmetric MultiProcessing24) parallel architectures. Examples
include the SGI Power Challenge supercomputer, which uses MIPS R8000/R10000
microprocessors, and the Cray CS6400 Enterprise Server which uses SuperSPARC
microprocessors. The SMP architecture resembles that of Figure 1.8d in which each
processor board is a single microprocessor chip (maybe with some external cache).
Because all these microprocessors use the system bus to share global RAM memory,
there is only one memory space, which simplifies both systems and applications
programming.
The single shared memory and bus, however, contribute to the SMP’s biggest
problem, the “memory-bus bottleneck”, which limits system scalability to not more that
8-16 processors. To overcome this limitation, the HP Exemplar high-performance
servers (PA-8000-based platforms) have replaced the interconnecting system bus with a
crossbar switch which makes the HP Exemplar family architecture scalable up to 512
CPUs.
1.4 THE MICROPROCESSOR’S INTERNAL

ORGANIZATION
In this Section we present the common architectural components found inside most
processor CPUs; the architecture inside the microprocessor chip is often referred to as
the microarchitecture of the processor. Figure 1.9 shows the simplified internal
organization of a representative generic scalar processor.
23
The Beowulf Parallel Workstation comprises 16 PC microprocessor motherboards, each coupled with
1.2GB hard drive, 32MB of DRAM memory, a dual 100Mb per second Fast Ethernet channels (with
optional monitor and other peripherals in each processing node). This parallel system has peak
performance in excess of 1 GOPS (giga operations per second), half a gigabyte of main memory, and
20GB of secondary storage.
24
This system configuration is also known as tightly coupled or shared everything
1.38
External Bus
Scalar processors Lines
BIU (Bus Interface Unit)
Data Register
File (RF)
Integer
Execution
Instruction Unit
Instructio Prefetch, (ALU) Data
n Queue, or Cache
Cache & (DCACHE)
(ICACHE) Decode Floating Point
Execution Unit
(FPU)
Instructions
Data/Operands
Figure 1.9: Simplified internal organization of a generic scalar processor.
1.4.1 The Bus Interface Unit
The bus interface unit or BIU is responsible for all activities on the processor’s external
bus; it works in an independent fashion form the other internal components of the
processor. It initiates external bus cycles when so requested by the CPU and maximizes
bus utilization by prefetching instructions from memory whenever the processor’s
external bus is free.
The BIU’s hardware includes address latches and drivers, data transceivers, a
prefetcher circuit that prefetches instructions from memory even before they are needed,
hardware to prioritize the various requests for bus cycles that different internal sections of
the processor request, and a bus controller. The bus controller can also receive and
interpret input signals that an external slave device (like for example a memory port)
sends to it to identify the device’s width (or size); this way, the processor properly
identifies what data presented at its input pins are valid. (This way the processor
implements “dynamic bus sizing”, discussed in detail in Chapter 2). There is internal
multiplexing circuitry to route the data arriving to the processor (for a read cycle with the
execution of a LOAD or INPUT instruction) to the correct internal or for a write cycle to
place the data on the correct output data pins. The BIU also contains the necessary
hardware logic to perform error detection and correction on the data transferred over the
external data bus. Finally, the bus controller may operate at the slower speed of the
1.39
external bus while the remaining sections inside the processor may operate at their own
faster speed25.
We saw earlier that some processors have an external Harvard architecture which
provides separate ports to allow simultaneous access to both an instruction-memory and
a data-memory module. Such a two-port, nonmultiplexed memory access scheme
requires separate parts in this BIU -- an “instruction interface unit” and a “data interface
unit” -- to individually control these two independent memory accesses.
Finally, depending of the processor’s complexity, the BIU may have the
appropriate hardware to provide interfacing not only to the external main memory bus but
also to other buses. One such development is the processor having an additional separate
bus for connecting external caches (a cache bus). Most often, this cache bus is for an
external “level-2” cache (see the Pentium Pro, MIPS R10000, Alpha 21164, and
UltraSparc in Figure 1.5d); this assumes that the processor includes an on-chip level-1
cache and the BIU contains all the hardware logic needed to control these external level-2
caches. Other processors (e.g., the HP PA-8000) have no on-chip cache at all, but provide
a wide external cache bus to allow interfacing to level-1 instruction and data caches
implemented outside the processor chip. This allows the implementation of very large
level-1 caches which are needed by applications whose data sets are too large to fit into
the smaller on-chip caches. Finally, another development is that of the Alpha and
UltraSparc processors (Figure 1.5d) whose BIUs provide the necessary hardware to
support separate special-purpose I/O buses: Alpha supports the PCI bus, while
UltraSparc supports Sun’s Bus.
1.4.2 The Instruction Fetch and Decode Unit
The control unit in the processor is used for instruction fetching, decoding,
sequencing, and dispatching to the proper functional units; for example in Figure 1.9 an
integer instruction would be issued to the fixed-point execution unit (FXU), while a
floating-point instruction would be issued to the floating-point unit (FPU). In scalar
processors this instruction issue is done “in-program-order”, one instruction at a time.
Earlier CISC processors used mainly microprogrammed implementations of the control
unit, while RISC processors used only hardwired implementations.
The control unit contains the proper timing circuitry (driven by the processor
clock pulses) to provide both internal and external timed control signals to carry out
internal CPU microoperations and facilitate external data bus transfers. The control unit
also contains a control sequencer to request from the bus interface unit to initiate
25
For example, the “double-speed” Intel 486DX2 processor operates internally at twice the clock speed
than the rest of the system: e.g., a “50-MHz 486DX2” (a 486DX2/50) processor operates internally at 50
MHz while the external subsystem still operates at 25 MHz. (Thus, this “50-MHz 486DX2” can still work
with older external slower 25-MHz motherboard designs but has an internal performance which is
substantially better than that of a normal “25-MHz Intel 486” processor.)
1.40
instruction prefetches from memory. Some processors, instead of only a single

instruction register, provide a larger instruction buffer or queue to hold a number of
instructions prefetched by the bus interface unit. (For example, the 486 has a 32-byte
code queue, while the Motorola 68030 has a 3-word-deep instruction pipe.) Most of the
processors today have an on-chip ICACHE (instruction cache) where all prefetched
instructions from memory are first placed.
To carry out its tasks properly, the control unit must provide for at least the
following four capabilities:
1. Establish the present “processor state” (or microarchitecture state) during each
processor clock cycle.
2. Provide logic to determine the correct “next state”; this next state selection is done
through a proper combination of the present-state information, the processor’s
external inputs, and certain feedback lines, either form within the processor (e.g.,
flags from the ALU) or from external components in the system.
3. Provide a facility to store the information that identifies the current state.
4. Finally, provide some means for translating this state into proper intra-module and
inter-module controls signals generated by the processor CPU. Since all
processors synchronize their events with single- or multiple-phase input clocks,
control signals are issued in synchronism with precise clock pulses. The input
clock source, therefore, synchronizes all state transitions of the system.
CISC processors have used microprogrammed implementations for the design of

their control section, to allow flexibility, ease of expansion, and upward compatibility.
Various groups of microinstructions, forming microroutines, are usually incorporated into
a complex system. The (macro) instruction or assembly language instruction, when
decoded by the control unit points to the appropriate microroutine in the control store.
Execution of the microroutine corresponds to executing the required microoperations that
generate all control signals needed for the instruction execution. The microprogrammed
approach has the advantages of leading to more “regular” design and providing enhanced
instruction flexibility. Future changes and improvements are much easier, since this
would only require changing the microroutines in the control store. Microprogrammed
logic, however, is slower than hardwired logic.
The control section of a RISC processor, on the other hand, is usually hardwired,
because RISC instructions have simple and fixed formats. The hardwired implementation
results in a smaller control section than that of the microprogrammed approach; however,
it is inflexible and presents design difficulties which lead to non-structured, quite
complex configurations. As the complexity of the processor increases, it becomes more
expensive than the microprogrammed approach, primarily in terms of development cost.
Modifications are also very difficult, since changing the visible machine requires new
chip layouts, etc.
Processors nowadays have most of their control unit hardwired with only a small
portion of it microprogrammed (to decode the few complex instructions that the
processor may have.)
1.41
1.4.3 The Execution Unit
The processor’s execution unit (EXE) is responsible for executing logic instructions
(such as AND, OR, shift, etc.), integer fixed-point arithmetic instructions, or floating-
point arithmetic instructions. As far as calculating the “effective address” of an operand
or of a result (needed during an instruction execution), some processors use their single
arithmetic ALU while others have a separate ALU dedicated only for that purpose. The
EXE unit includes the following hardware: the necessary arithmetic/logic units (adders
and a barrel shifter) and hardware multipliers/dividers; general-purpose fixed- and
floating-point registers (used to hold the operands and results of operations); special
registers (that hold intermediate results); and finally, the required local control circuitry.
Integer fixed-point instructions are executed in what is referred to as the FXU section of
the EXE unit, while floating-point instructions are executed in the FPU section. Usually,
an “n-bit processor” has an n-bit ALU (which operates in parallel on n-bit wide input
operands).
Quite often, the FXU has its own “integer register file”, and the FPU its own
separate “FPU register file”. (FPU registers have at least twice the width of integer
registers). CISC and RISC processors differ in their typical “register file” in that RISC
chips generally have a much larger number of registers to hold more operands.
Increasing the number of internal registers increases (1) the requirements for on-chip real
estate needed to implement them, (2) the width of the instructions (to be able to specify
the added registers), and (3) the additional hardware needed in the form of multiplexers
for storing into, or reading data from these registers. On the other hand a large number of
registers provides a better support to the larger number of register-to-register instructions
executed by RISCs. Usually, an “n-bit processor” has n-bit internal integer data
registers; “32-bit processors” have 32-bit data registers (with capabilities of addressing
an 8- or 16 bit quantity within a register, or concatenating two registers to hold 64-bit
quantities), “64-bit processors” have 64-bit integer data registers, etc.
1.4.4 On-Chip Caches
Almost all processors have on-chip two separate types of cache memories: an
“instruction cache or ICACHE” which contains the instructions most likely to be fetched
by the control unit for decode and execute, and a “data cache or DCACHE” from which
operands are loaded and to which results of internal operations are being stored26.
(Caches are discussed in detail in Chapter 5).
26
This is the “split cache” approach. Some processors have followed the “unified” or “integrated cache”
approach with only a single on-chip cache used for both instructions and data.
1.42
1.4.5 Example CISC Scalar Processors
In this section we present the internal organization of the 32-bit Intel 486
microprocessor as a representative scalar CISC processor. More details on the 486
internal registers or on the internal organization of other CISC processor examples can be
found in Appendix A.
àà
-----------------------------------------------------------àà
Example 1- 3: The Intel 486 scalar microprocessor
Figure 1.10 shows the internal organization of the Intel 486 microprocessor with its two
functional units: the integer ALU and a rudimentary FPU. Only one decoded instruction is
issued at a time to one of these two functional units.
The 486 is a 32-bit CISC scalar microprocessor, with a 32-bit ALU and 32-bit data
registers. Its “bus interface” unit contains the appropriate hardware to manage the external bus
lines and keep the buses busy. It can initiate both simple and complex bus cycles and support
variable data bus widths (both discussed in detail in the next Chapter 2), and the 486 contains all
the logic to perform on-chip error detection on the parity-encoded data transferred over the data
bus.
The chip contains a single 8-Kbyte “unified” cache, for both instructions and data. As a
first step, the 486 fetches the instruction from the on-chip cache. However, since a “cache line” is
16 bytes long, most instructions do not require this stage (because they have already been
prefetched with the previous access to the cache.) This step however is always required at the
target of the branch instruction. Then, a first decoding is performed by processing up to 3
instruction bytes. At this stage, the length of the instruction is determined along with actions that
are to be performed for the effective “address generation”. (A small number of instructions may
require 2 clock cycles for this first decoding stage). The “address generation” stage (needed
because of the complexity of the 486 instructions that intermix computations with accesses to
external memory) completes the decoding of the instructions, decodes any displacement or
immediate operands, and in parallel computes the effective address. (Again, a small number of
instructions may require 2 clock cycles in this stage). Depending on the instruction decoded, the
decoder may send it for execution either to the “ALU execution unit” (if an integer instruction)
or to the FPU (if a floating-point instruction) . Instructions that reference memory (including
jump instructions) access the on-chip “unified” cache in this stage. (Along with cache lookup, the
TLB lookup proceeds in parallel). The processor updates the register file either with data from
the on-chip cache or from main memory, or with the results of the FXU or FPU execution. The
“MMU” section inside the Intel processors contains the hardware to implement both segmented
and paged virtual memory. (These terms are explained in detail in Chapter 6). Finally, the 486
processor has neither internal nor external Harvard architecture.
<<-----------------------------------------------------------
1.43
BE0#-BE3# Address Bus Data Bus

4 30 32 Intel 486
BIU (Bus Interface Unit)
32 32 Integer Registers
Integer Unit
Integrate 32 Fetch (ALU) MMU *
d(Unified)
Buffers (Segmentation
(CACHE) or
& &
(8 KB) FPU Paging
Decoder
(Rudimentary) Unit)
FPU Registers
Instructions
MMU * = Memory Management Unit Data/Operands
Figure 1.10: Intel 486’s internal organization (with its two functional units) [24].
1.5 HOW THE COMPUTER EXECUTES PROGRAM

INSTRUCTIONS
1.5.1 Instruction Formats
a) CISC instruction formats
We mentioned earlier that CISC processors have much more complex instructions that
RISC processors. As an example consider the complex format in Figure 1.11 of the Intel
instructions [5]. (Not all fields are shown.) These instructions consist of one or two
primary opcode bytes, possibly an address specifier consisting of the “mod r/m” byte and
“scaled index” byte, a displacement if required, and an immediate data field if required.
Within the primary opcode or opcodes, smaller encoding fields may be defined. These
fields vary according to the class of operation. The fields define such information as
direction d of the operations (to/from), size of the displacements w, register encoding
sreg2, sreg3, or sign extension s. The remaining fields of a long instruction specify
register and address mode, address displacement, and/or immediate data.
1.44
TTTTTTTT TTTTTTTT mod TTT r/m ss index base d32 16 8 none data 32 16 8 none
7 07 0 7 6 5 3 2 07 6 5 3 2 0
Opcode “mod r/m” “s-i-b” Address Immediate

(one or two bytes) byte byte displacement data
(T represents an (4, 2, 1 bytes (4, 2, 1 bytes
opcode bit) or none) or none)
Register and address
mode specifier
a) General CISC instruction format
FIELD NAME DESCRIPTION NUMBER OF BITS
W SPECIFIES IF DATA IS BYTE OR FULL SIZE (FULL SIZE IS 1

EITHER 16 OR 32 BITS)
d SPECIFIES DIRECTION OF DATA OPERATIONS 1
s SPECIFIES IF AN IMMEDIATE FIELD MUST BE SIGN- 1

EXTENDED
reg GENERAL REGISTER SPECIFIER 3
mod r/m ADDRESS MODE SPECIFIER (EFFECTIVE ADDRESS CAN BE A 2 for mod;
GENERAL REGISTER) 3 for r/m
ss SCALE FACTOR FOR SCALED INDEX ADDRESS MODE 2
index GENERAL REGISTER TO BE USED AS INDEX REGISTER 3
base GENERAL REGISTER TO BE USED AS BASE REGISTER 3
sreg2 SEGMENT REGISTER SPECIFIER FOR CS, SS, DS, ES 2
sreg3 SEGMENT REGISTER SPECIFIER FOR CS, SS, DS, ES, PS, GS 3
mn FOR CONDITIONAL INSTRUCTIONS, SPECIFIES A CONDITION 4

ASSERTED OR A CONDITION XXXXXX.
b) Fields of the CISC instruction of (a)
Figure 1.11: A general CISC instruction format
The complexity of the instruction format and its number of different fields make
its decoding cumbersome. Such a complex CISC instruction may specify a large number
of addressing modes, including direct, based, base plus displacement, index plus
displacement, base plus displacement plus index, etc.
1.45
à>
--------------------------------------------------------à
Example 1- 4: Address field
Consider a hypothetical 32-bit microprocessor having 32-bit instructions composed of two fields:
the first field is one byte and represents the opcode and the second field (the remaining bits)
represents either an immediate operand or the address field.
a) What is the maximum directly addressable memory capacity (in number of bytes)?
b) Discuss the impact on the system speed if the microprocessor has:
• a 32-bit external address bus and a 16-bit data bus or,
• a 16-bit address bus and a 16-bit data bus.
c) How many bits are needed for the program counter and the instruction pointer?
Answer:
a) The maximum directly addressable memory is 224 = 16 Mbytes.
b) If the address bus is 32 bits, the whole address can be transferred to memory at once and decoded
there; however, since the data bus is only 16 bits, it will require 2 bus cycles (accesses to
memory) to fetch the 32-bit instruction or operand. In a hypothetical case of a 16-bit address
bus, will have the processor perform two transmissions in order to send to memory the whole 32-
bit address; this will require more complex memory interface control to latch the two halves of
the address before it performs an access to it. In addition to this two-step address issue, since the
data bus is also 16 bits, the microprocessor will need 2 bus cycles to fetch the 32-bit instruction or
operand.
c) The program counter must be at least 24 bits (if we assume here that the program counter
contains the physical address issued by the microprocessor)27.
If the instruction register is to contain the whole instruction, it will have to be 32 bits
wide.
(Sometimes there is a distinction between an instruction register and an “opcode register”; in
such a case, the opcode register here would be 8 bits long to hold the 8-bit opcodes.)
ß ----------------------------------------------------------------
<ß
b) RISC instruction formats

Compared to the long, complex, and many-addressing-mode CISC instruction formats,
RISC processors have very simple and short instructions. Figure 1.12a shows the
instruction formats the MIPS R4000 (three types of instructions) and DEC Alpha (four
types) processors. RISC instructions are fixed in length (usually 32 bits), have simple
formats, and are stored in memory aligned on a doubleword boundary (i.e., a 32-bit
address evenly divisible by 4). Having a small number of different formats simplifies the
decoding, shortening the time of the overall instruction cycle. Less frequently used
operations and more complex addressing modes can be synthesized by the compiler using
sequences of these simple instructions.
27
Most likely, a 32-bit microprocessor will have a 32-bit address and a 32-bit program counter, unless on-
chip segment registers are used which may permit a less wide program counter.
1.46
àà
-----------------------------------------------------------àà
Example 1- 5: Instruction Formats of the MIPS RISC microprocessors
The MIPS R-series instructions (see Figure 1.12a) can be divided into the following groups [7]:
Load and store instructions: move data between memory and general registers. They are
all I-type instructions, since the only addressing mode supported is base register plus 16-bit
signed immediate offset.
Computational instructions: perform arithmetic, logical, shift, multiply, and divide
operations on operands in internal registers. They occur in both R-type format ( the
operands and the result are stored in registers) and I-type format (one operand is a 16-bit
immediate value).
Jump and branch instructions: change the control flow of a program. Jumps are always
to a paged, absolute address formed by combining a 26-bit target address with the high-
order bits of the program counter (J-type format) or register addresses (R-type format).
Branches have 16-bit offsets relative to the program counter (I-type). “JumpAndLink”
instructions save a return address in internal register 31.
Coprocessor instructions: perform operations in the coprocessors. Coprocessor Load and
Store instructions are I-type.
Coprocessor 0 instructions: perform operations on CP0 registers to manipulate the
memory management and exception-handling facilities of the processor.
Special instructions: perform system calls and breakpoint operations. These instructions
are always R-type.
Exception instructions: cause a branch to the general exception-handling vector based
upon the result of a comparison. These instructions occur in both R-type and I-type
formats.
ßß-----------------------------------------------------------------
ßß
I-type (immediate)
31 26 25 2120 16 15 0
op rs rt immediate
• op 6-bit operation code
• rs 5-bit source register specifier
J-type (jump)
31 26 25
• rt 5-bit target (source/destination)
0
op target register or branch instruction
• immediate 16-bit immediate value, branch
R-type (register) displacement or address displacement
31 26 25 21 20 1615 1110 65 0 • target 26-bit jump target address
op rs rt rd sa funct
• rd 5-bit destination register specifier
• sa 5-bit shift amount
• funct 6-bit function field
Fig. 1.12a: RISC instruction formats (MIPS)
1.47
3 2 2 0
1 6 5 0
CALL_PA OPC func

L
3 2 2 22 0
1 6 5 10 0
BR OPC Ra Branch Disp
3 2 2 22 1 1 0
1 6 5 10 6 5 0
MEM OPC Ra Rb Mem Disp
3 2 2 22 1 1 0 0 0
1 6 5 10 6 5 5 4 0
OP OPC Ra Rb func Rc
Fig. 1.12b: RISC instruction formats

(DEC Alpha)
Figure 1-2: RISC instruction formats (MIPS R4000-family and DEC Alpha).
1.5.2 Bus Cycles, Clock Cycles, and “States”
A processor CPU carries out the instruction cycle in two phases: the “instruction fetch
and decode phase” (IF phase) and the “instruction execute phase” (IE phase). An
instruction cycle may require a number of bus cycles28, B1, B2, B3, etc. in order to fetch
and execute the instruction. A bus cycle29 (also referred to as “local bus cycle”,
“memory cycle”, or “external bus cycle”) begins whenever the CPU needs to access an
external memory location or I/O port, i.e., whenever the processor places an address on
its external address bus. Therefore, a bus cycle is the sequence of basic activities needed
to perform a memory read (or I/O input) operation, a memory write (or I/O output)
operation, or a more complex read-modify-write operation. The bus cycle, therefore,
corresponds to the time needed to complete one transfer from/to memory or I/O port.
Once the whole instruction has been fetched and decoded, its subsequent execution phase
may or may not require additional external bus cycles (depending upon whether or not its
execution needs to access memory or I/O). This subdivision of an instruction cycle into
bus cycles and clock cycles is shown in Figure 1.13.
All basic internal activities of the CPU (the “microoperations”) and all bus cycles
that the processor initiates on its external bus must be executed at well-coordinated times.
28
Actually, since most processors now contain an on-chip instruction cache (ICACHE), it is possible that
an instruction need no bus cycles at all if it is already in the on-chip ICACHE (and therefore its IF phase
needs no access to main memory) and its IE phase specifies a completely internal CPU operation (like
the register-to-register instructions).
29
The term “bus cycle” we use here corresponds to what some manufacturers refer to as a “machine
cycle.”
1.48
This is accomplished by sometimes using two different clocks, one external (the input or
system clock, denoted as CLK and has a clock cycle time CS) and the internal “internal
processor clock” (or simply the “processor clock”, it is denoted as PCLK and has a clock
cycle time Cp).
Instruction
Cycle
Bus
cycles B1 B2 B3
Input
clock C1 C2 C3 C4 C1 C2 C3 C4 C1 C2 C3 C4
cycles
Figure 1.13: An instruction cycle subdivided into bus cycles and clock cycles
The system clock, CLK, is used to time the external bus activities as well as
those of the other components in the system; this clock is also called the input clock,
because it is the driving clock applied as input directly to the microprocessor chip. It
establishes the local bus cycle time and controls (clocks) all transfers between the
processor CPU and memory or I/O ports (and that is why it is also referred to a the bus
clock). Everything that happens in the system is synchronized to this system clock’s
rising or falling edges. For example, a processor with an input clock of frequency FS =
16 MHz will have the basic events on the bus occur every CS = 62.5 ns (or even 31.25ns
if both the rising and falling edges of the clock are used).
This input clock CLK is supplied to the processor CPU from an external clock
generator/driver chip that acts as a constant frequency source. Figure 1.14 shows the
clock generators that drive various Intel and Motorola processors. (At the right-hand side
of the clock timing we give the actual names the manufacturers use.) The clock generator
itself requires an external series-resonant crystal input (or constant frequency source)
whose frequency may or may not be the same as that of the generated output clock CLK.
Sometimes the clock generator may provide a second clock output, called a peripheral
clock, that may be half the frequency of FS. The use of this second peripheral clock
simplifies computer system design because it allows the bus interface components to
1.49
operate at half the speed specified by the processor’s input clock, imposing on them less
stringent requirements30.
The internal processor clock PCLK is distributed throughout the hardware inside
the microprocessor chip and used to time all internal components and internal
microoperations. The timing of the execution of these internal microoperations is
synchronized to this internal processor clock’s rising or falling edges. The processor
clock PCLK can be either generated internally from the input clock CLK (the most
common case) or supplied externally to the processor using a different input pin from the
CLK input (e.g., the Motorola 68040 of Figure 1.14c). Some processors divide the input
frequency FS internally by 2 to generate the (internal) processor clock frequency31 Fp,
(e.g., Intel 386), others use internally the same clock frequency as that of the input clock
(e.g., Intel 486 and i860), and others multiply internally the incoming frequency by 2 to
generate their internal processor (like most Motorola processors.)
As we said earlier, a bus cycle corresponds to the time needed to do a data
transfer between the processor and the addressed slave device. A bus cycle is always an
integral multiple of the system clock cycle CS. The maximum data transfer rate for an
external bus operation (the processor bus bandwidth) is determined not only by the
frequency of this system clock FS, but also by additional information that include the
width of the processor’s external data bus and the number of clock cycles per bus cycle.
àà
------------------------------------------------------àà
Example 1- 6: Processor bus bandwidth
Assume a processor is driven by a 16-MHz input clock and has bus cycles composed of four input
clock cycles, C1, C2, C3 and C4. Then, if the data bus is 16 bits wide, the maximum data transfer
rate for this processor would be 2 bytes every four clock cycles, or 8 megabytes per second; if the
data bus is 32 bits, the maximum data transfer rate would double to 16 megabytes per second.
ßß-------------------------------------------------------
ßß
Bus Cycle
Crystal
Tw
C1 C2 C3 C4
Processor
CLK
Clock CLK F System
s
Generator Clock
/2
Fp
Input/System Processor
Clock Clock
Phase 1 Phase 2 Phase 3 Phase 4
State T1 State T2
Fig. 1.14a: Intel 386
30
For example, consider the clock for the Intel 386 in Figure 1.14a : its external 82384 clock generator
produces two clocks: one is called “CLK2”, which corresponds to our CLK, and the other “CLK”,
which has half the frequency of CLK2 and used as a “peripheral clock.”
31
For example, an Intel processor driven by a 66-MHz system clock and having an internal processor
clock of 32 MHz, is usually referred to as a “32-MHz processor.”
1.50
Crystal
Processor
Bus Cycle Tw
CLK System
Clock CLK F and
s C1 C2
Generator Processor
Clock
FS = Fp
State T1 State T2
Input/System
Clock
Fig. 1.14b: Intel 486 and Pentium
Crystal Input/System Bus Cycle

Clock
Processor System C1 C2 Cw
CLK Clock
CLK BCLK(Fs)
Clock
Generator x2
PCLK(Fp ) Processor
Clock
Processor (PCLK) T1 T2 T3 T4 T5 T6 T7 T8
Clock
Fig. 1.14c: Motorola 68040
Figure 1.14: Input/system clock, processor (internal) clock, “states”, and bus cycles
for representative Intel and Motorola processors
àà
----------------------------------------------------------àà
Example 1- 7: Comparing the bandwidths of two processors’ external buses
How would we go about comparing the external data bus bandwidths (or transfer rates) of a “clock-
doubled 66-MHz 486DX2” and a “66-MHz Pentium”? As we will see below, what we are often
being quoted as the “actual bandwidth” of the microprocessor, it is really the “theoretically
maximum data transfer rate that the microprocessor may achieve only for a few specific types of bus
cycles”. We need some detailed knowledge of the processors’ characteristics to determine the
bandwidth.
a) 66-MHz 486DX2:
When we refer to the 66-MHz 486DX2 we mean a 486 microprocessor that has an internal
“processor clock frequency” FP = 66 MHz; thus, the “processor clock cycle” is CP =1/66MHz
= 15.15ns.
Since the “DX”2 means a “clock-doubled” microprocessor (i.e., a microprocessor whose
internal processor clock is double the frequency of its external clock), the “system clock
frequency” is FS = 33 MHz; in other words, although the processor operates internally at twice
1.51
the speed of a normal “non-clock-doubled” 486, its external bus continues to run at only 33 MHz.
Therefore, the “system clock cycle” is CS = 30.3ns.
How long does it take to access memory and do a data transfer? We need to know how many
system clock cycles are needed by a 486 bus cycle. Since this number is two, a bus cycle equals
2CS = 60.6ns.
Since the external data bus width of the 486 is 32 bits, we have 4 bytes transferred per 60.6ns,
and thus the “maximum data-bus transfer rate” is 66 Mbytes/sec.
If we now consider a particular bus transfer called “burst transfer” (that we will discuss in
more detail in the next Chapter) with which we have a maximum of 4 bytes transferred every
“clock cycle” instead of every “bus cycle”, then the DX2 bus transfer rate increases to 132
Mbytes/sec.
b) 66-MHz Pentium:
When we refer to a 66-Mhz Pentium we mean a Pentium microprocessor whose both
internal and external clock have the same 66 MHz frequency; FS = FP = 66 MHz. Therefore, CS
= CP =1/66MHz = 15.15ns. Internally the Pentium is clocked at the same rate (66 MHz) as the
above 486DX2, but it has much more stringent requirements on the speed of the external
components on the motherboard the Pentium external bus runs a twice the frequency of the
486DX2); the motherboard is “system” implementation dependent.
The Pentium also needs 2 clock cycles to implement a bus cycle; in other words, now,
bus cycle = 2* CS = 30.3ns.
But the external data bus width of the Pentium is 64 bits; thus we have 8 bytes
transferred per 30.3ns, which yields a “maximum data-bus transfer rate” of 264 Mbytes/sec.
Again, under the assumptions of a “burst transfer” (i.e., maximum 8 bytes per clock
cycle) we get a Pentium bus transfer rate of 528 Mbytes/sec.
ßß-----------------------------------------------------------
ßß
It can be seen from Figure 1.15 that the duration of what each processor calls a
state (or “local bus state”) varies among the various processors (even among those from
the same manufacturer!). For some products the “state” is smaller than the input clock
cycle time CS; for example, the Motorola 68040 “state” equals one-fourth the input clock
cycle. Asserting and sampling of external signals can be done for Motorola products
either at system clock cycle (CLK) boundaries or at half the system clock cycle
boundaries (every two T states). For other products, the “state” is equal to the input clock
cycle: for example, the Intel 486 states. Finally, there are products whose “state”
duration is longer than that of the input clock cycle CS; for example, the Intel 386 has a
“state” which is twice as long as the input clock cycle. Thus, when we compare a
number of processors, we may find out that of them have the same bus cycle, consisting
for example of four input clock cycles, but some of them may say that they have a “4-
state” bus cycle, while the others that they have a “2-state” bus cycle. For the bus timing
calculations of bus transfers one should use as reference the input clock cycle time CS
rather the number of “states” in its bus cycle.
The duration of a wait state Tw (which is a state inserted32 to elongate the bus
cycle because a slower slave component cannot respond to the processor’s request within
the allocated time period) also varies among the different products: most often it equals
the input lock cycle, although there may also be products with different duration (e.g., the
32
The actual position in the bus cycle for inserting a wait state Tw depends on the particular processor (as
shown is Figure 1.14 and explained later in more detail).
1.52
Intel 386 in Figure 1.14a has a wait state equal to two input clock cycles.) It is the
system designer’s responsibility to include the proper external interface logic to force the
processor to insert such wait states. For example, let’s assume that accessing memory
requires one wait state and accessing an I/O port requires three wait states. The design
must include the proper external logic to read and decode the address issued on the
address bus to determine whether it points to memory or I/O space and thus determine
whether one or three wait states will be required for that particular bus cycle.
Most of the basic bus cycles are considered as having fixed length. In some
processors, a bus cycle equals four input clock cycles (4CS), in others three (3CS), and in
others two input clock cycles (2 CS). Figure 1.15 shows the minimum duration of a bus
cycle for some Intel and Motorola processors. The diagram also summarizes the
relationship among the bus cycle, input clock cycle CS, internal processor clock cycle
Cp, and what various commercial processors call a “state” and a “wait state.”
à>
------------------------------------------------à
Example 1- 8: Bandwidth, frequency, and data bus width
Consider a 64-bit microprocessor implemented with a 32-bit external data bus and driven by an 100-
MHz input clock. Assume that this microprocessor has a bus cycle whose minimum duration is 4
input/system clock cycles.
a) What is the maximum data transfer rate that this microprocessor can accomplish?
b) In order to increase its performance, how would you compare increasing its external data bus
width to 64 bits versus doubling its input/system clock frequency to 200 MHz?
Answer:
a) System clock cycle = 1 / 100MHz = 10 ns.
Bus cycle = 4 X 10 ns = 40 ns.
Four bytes are transferred every 40 ns; thus, transfer rate = 100 MBytes/sec .
Doubling the frequency will most likely mean adopting a new chip manufacturing technology (if we
assume that the operation remains the same, and each instruction still requires the same number of
clock cycles). Doubling the external bus maybe easier, since this “64-bit processor” may have
already foreseen a next version implementation with 64 bit data buses; this means wider (maybe
newer) on-chip data bus drivers and latches and modifications of the on-chip “bus interface unit”.
b) In the first case, the speed of the memory subsystem will also need to almost double so that it will
not slow down the processor. In the second case, the “wordlength” of the memory subsystem will
have to double to be able to send and receive 64-bit quantities.
ß -------------------------------------------------------
<ß
1.5.3 Microoperations and State Transition Diagrams
All microprocessor CPUs carry out a bus cycle by executing a sequence of one or more
simultaneous basic operations (referred to as microoperations). All microoperations are
executed in synchronism with the processor’s internal clock, and we will assume here
that each microoperation can be executed within one processor clock cycle CP. We also
know that the activities on the external processor bus are executed in synchronism with
the external or system clock CLK.
1.53
Minimum Duration of a Duration of

duration of a local bus a "wait
Duration of the processor
local bus cycle "state" state"
Microprocessor clock ( Cp) and input clock
(number of (number of (number of
(C) cycles
input clock input clock input clock
cycles) cycles) cycles)
Intel 8086 4 Equal T state = 1 1
Intel 80286 4 Cp = 2 input clock cycles T state = 2 2
Intel 80386 4 Cp = 2 input clock cycles T state = 2 2
Intel 80486/i860 2 Equal T state = 1 1
Motorola 68000 4 Cp = 1/2 input clock cycles S state = 1/2 1
Motorola 68020/30 3 Cp = 1/2 input clock cycles S state = 1/2 1
Motorola 68040 2 Cp = 1/2 input clock cycles T state = 1/4 1
Figure 1.15: Relationships among local (or processor) bus cycle, input (or system)
clock cycle CS, processor (internal) clock cycle CP , and “state”.
Figure 1.16a shows the “state transition diagram” a generic processor may follow
to execute a complete external bus cycle. In this case we assume a processor with a bus
cycle composed of four system clock cycles: C1, C2, C3, and C4. In order to simplify
the discussion here, we will assume that the internal and external clocks have the same
frequency and, therefore, a clock cycle C = CP = CS. During each clock cycle, the
processor executes specific microoperations depending on the type of bus cycle it starts.
Figure 1.16b shows representative microoperations that a generic processor

executes in order to perform a “memory read” or “memory write” bus cycle33. Only the
microoperations whose effect is externally observable are shown; their explanation and
scheduling with each clock pulse C1, C2, C3 and C4 is given below:
Clock cycle C1:

The processor starts a bus cycle by placing an m-bit “byte-address” (an address pointing
to a byte location) on the address bus: (Am-1-A0 ß address). If the address has been
placed on a multiplexed address/data bus, it will stay valid on these lines for only the first
portion (the first clock cycle) of the bus cycle.
33
So far we are discussing only synchronous bus transactions. Asynchronous bus operations are covered
in the next Chapter.
1.54
C1
C2
Ready
(READY# = L)
C3
Not Ready
C4 (READY# = H)
Fig. 1.16a: Simplified state transition diagram of a generic processor with 4

input clock cycles per bus cycle (It is assumed that a “state” equals
one input/system clock cycle)
C 1 : Microoperations executed during clock cycle C1:

• Place the whole m-bit byte-address on the address bus:
A m-1 – A 0 ← byte-address issued
or
Place only most-significant part of address and issue byte-
enables:
( for e x a m p l e * :
A m-1 – A 3 ← q u a d w o r d - a d d r e s s
B E 7# – B E 0 # ← b y t e - a d d r e s s i s s u e d )
• Assert an “address strobe” synchronization signal:
AS# ← L
*If we assume, say, a 64-bit external data bus, then we will need 8 byte-enables
• I s s u e “ bus cycle identifier” or “function control” signals. For example:

for a “memory read” bus cycle:
M e m o r y / IO # ← H
R e a d / W rite# ← H
Data/Code# ← H
• If least-significant address bits are not issued, issue “o p e r a n d s i z e
indicator” or (data identifier) o n k o u t p u t p i n s l a b e l e d , s a y , S I Z E k-1 –
SIZE0 :
SIZE k-1 – SIZE0 ← proper values
• Issue other control signals.

F o r a w r i t e c y c l e : D n-1 – D 0 ← d a t a b y t e s i s s u e d . ( N o t h i n g f o r a r e a d c y c l e )
Sample the “port size indicator” input pins.

if R E A D Y # = H : i n s e r t a “ w a i t s t a t e ” ( h e r e , e x e c u t e a n a d d i t i o n a l C 4 s t a t e )
if R E A D Y # = L : l a t c h d a t a f r o m d a t a b u s ( f o r a r e a d o r i n p u t c y c l e ) ,
stop driving data bus (for a write or output cycle),
negate (deactivate) all signals, and
end current bus cycle.
Figure 1.16: State transition diagram and microoperations of a generic

microprocessor
1.55
Some processors do not place the whole m-bit byte-address on the address bus but
only its most significant part and accompany it by asserting proper values on the
processor’s “byte-enable” outputs. For example, consider a 64-bit processor (i.e. the
processor has a 64-bit external data bus connected to a 64-bit main memory) with 32-bit
addresses. This processor may be placing the most significant 29 bits A31-A3 of its 32-
bit address (this is also referred to as a “quadword-address”, i.e., an address whose least
significant 3 bits are zeros and thus points to a quadword location) on the 29-line address
bus and, instead of issuing the three least significant zeros, places proper values (0s or 1s)
on its 8 “byte-enable”34 output pins; in this case, for an m-bit address, we say: Am-1-A3
ß part of address (since A2,A1,A0 are not issued), and BE7# - BE0# ß proper values.
Anytime the processor starts a bus cycle, it notifies all other modules of the
system that a valid address is placed on the address bus by activating during C1 an
“address strobe” (sometimes called “address latch enable” or ALE) output
synchronization signal: (AS# ß L). This AS can trigger external latching circuitry to
latch the address and keep it valid for the remaining part of the bus cycle.
The processor also issues “bus cycle identifiers” (in the form or “status” or
“function code” signals listed in Figure 1.6) to inform the rest of the computer system
modules of the type of bus cycle it has initiated. For example, the processor activates a
control signal Memory/IO# to indicate whether access is to be performed to memory
(because it executes a memory-type instruction) or to I/O (because it executes an
input/output instruction, if it has such). The processor also informs the rest of the system
whether a read (or input) or a write (or output) bus cycle is executed by asserting an
output control line. For a read cycle: Read/Write# ß H. Finally, the processor may also
inform the rest of the system whether it is to transfer data (an operand or a result) or it is
to fetch code (a program instruction). For code fetch: Data/Code# ß L.
Some processors35 also inform the rest of the system modules of the size of the
operand to be transferred during the current bus cycle. They do that by issuing a
“operand size indicator”; for example, a processor with 32-bit operands will issue two
such signals (say, SIZ1 and SIZ0) where 00 indicates a 4-byte doubleword, 01 a byte, 10
a 2-byte word, and 11 a whole cache line. In general for k such output pins36 we can say:
SIZk-1 - SIZ0 ß proper values.
Other control signals are also needed to control external interface devices (“glue
logic”), like the ones shown in Figure 1.7. For example, external transceivers require an
enabling signal in the form of “data enable” (DEN#) and a second signal to indicate the
direction of the data transfer on the data bus, such as a “data transmit/receive” DT/R# (H
34
These “byte-enables” are actually more important for “write” than for “read” cycles, to indicate which
byte lanes of the data bus carry valid data and which “byte-sections” of memory must be triggered to
receive and store these bytes. The byte-enables may be issued directly from the processor CPU (like the
BE7#-BE0# of the Intel Pentium), or must be generated by external circuitry (see Motorola 68030 in
Figure 1.5d). For a “read” cycle, these “byte-enable” signals are usually of no importance, because
quite often an n-bit memory will always drive all its n output data bus lines, will send to the processor n
bits of data, and leave it up to the processor CPU to determine which of its n input data pins actually
have the requested data bytes.
35
For example some Motorola microprocessors.
36
This k depends on the maximum size of the operand the processor can handle; for example, a 64-bit
processor may need k = 3 to be able to specify 1 byte, 2 bytes, 3 bytes, 4 bytes, or 8 bytes.
1.56
indicating that the processor issues data and L the processor is to receive data.) These
signals may be issued directly from the processor chip itself or generated by the external
“bus controller” chip (which interprets some status signals the processor issues at the
beginning of each bus cycle).
Clock cycle C2:

For a read bus cycle nothing happens during C2, but for a write bus cycle the processor
places the data on the proper byte lane(s) of the n-bit data bus: Dn-1 -D0 ß data bytes
out.
Clock cycle C3:

During each bus cycle, the processor samples its input “port size indicator” pins
(see Figure 1.4) to determine the width of the responding slave device (or slave port) in
order to figure out (1) which data-bus byte-lanes carry valid data (because as we will see
in Chapter 2, when a less wide slave port is used, some microprocessors require it to be
connected to the lower byte-lanes of their external data bus, while others to the higher
byte-lanes!), and (2) whether or not the whole operand has been received. For example,
if the processor requested the transfer of a 32-bit operand and the memory port were only
16 bits wide, then the processor must adapt its operation dynamically to start a second
bus cycle to fetch the remaining half of the operand. The outgoing “operand size
indicator” and the incoming “port size indicator” are used by a number of processors to
implement this “dynamic bus sizing” capability (discussed in more detail in the next
Chapter).
Clock cycle C4:

During this last clock cycle the processor samples some kind of a “ready” input signal to
determine whether the accessed memory or I/O port has had enough time to respond. If
this port were slow, its interface will send a “not-ready” signal to inform the processor
that it could not respond and force it to insert one or more wait states. When the
processor is in a wait state, it does nothing but idles during clock cycle C4 until the slave
port finally negates this signal telling it that it has responded (i.e., either placed data on
the data bus for the processor to receive or latched and stored the data the processor had
placed for it on the data bus). If the port indicates that is ready, the processor will either
latch the data from the correct byte lanes of the data bus (for a read or input bus cycle) or
stop driving the data bus with data (for a write or output bus cycle), will negate all signals
(including the address), end the current bus cycle, and get ready to start the next.
àà
-------------------------------------------------------------àà
Example 1- 9: The execution of a CALL instruction by a generic microprocessor

To simplify the presentation, we consider here a very simple generic microprocessor. We assume
this microprocessor has no on-chip instruction cache or buffer, no internal pipelining and
overlapping capabilities, uses 16-bit addresses and has a 16-bit external data bus connected to a
16-bit memory. We also assume that the processor has the state transition diagram,
microoperations, and signaling given above in Figure 1.16.
1.57
Consider now the execution of a CALL instruction with the following 3-byte format:
BYTE1: CALL opcode (assume stored at hexadecimal byte-location 1232)
BYTE2: upper half of the target address (first executable instruction of the called
routine)
BYTE3: lower half of the target address.
The 16-bit target address identifies the location of the first executable instruction in the called
routine. Assume that the execution of this instruction is a follows:
(SP) ß (SP) - 2
[(SP)] ß (PC)
(PC) ß BYTE2,BYTE3
where (X) denotes “the contents of”, and [(X)] denotes “the contents of memory location pointed
at by the contents of X. Assume that initially the stack pointer register SP contains the
hexadecimal value ABCE (pointing to the latest entry on top of the stack).
Then Figure 1.17 shows the proper sequence of the most important microoperations
scheduled for execution during the clock cycles of each bus cycle by the generic processor in
Figure 1.16 in order to complete the CALL’s “instruction cycle” (i.e., both its fetch and execute
phases). The assumptions made here also include: no wait states, and that each instruction is
loaded into program memory starting at an even byte address (and therefore fetching them can
always be done on a 16-bit word basis).
ßß---------------------------------------------------------------
ßß
àà
------------------------------------------------------------------àà
Example 1- 10: Intel Pentium’s bus cycle microoperations
Figure 1.18 depicts the simplified state transition diagrams for the Intel Pentium
processor, along with the most important microoperations needed to execute a memory read and
a memory write cycle. We observe that the bus cycle of the Pentium requires two clock cycles (or
states T1 and T2) and a wait state corresponds to elongating the bus cycle by one input clock
cycle (i.e., repeating state T2). During C1 the Pentium places the 29 most significant bits (a
quadword address) of its 32-bit address on the 29-line nonmultiplexed address bus, and uses
internally bits A2, A1 and A0 to generate the eight output byte enables BE7# - BE0#. The
address status ADS# indicates that a new bus cycle is currently being driven by the
Pentium. Address parity AP is driven with even parity for all bus cycles along with each address
the processor issues. The signal CACHE# output indicates internal cacheability37 of the cycle (if a
read) and indicates a burst37 writeback cycle (if a write). The other two signals W/R# and D/C#
distinguish between read and write cycles and data or code transfer, respectively.
The data is transferred over the proper byte-lanes (D63-D56, D55-D48, D47-D40, D39-
D32, D31-D24, D23-D16, D15-D8, D7-D0) of the 64-bit data bus during C2, for either a read or
a write operation. The maximum data transfer rate for a bus operation is 64 bits for every two
system clock cycles. A bus cycle starts by the processor placing the address and issuing the ADS#
strobe and terminates when it samples the ready signal BRDY# = L. If it is High, wait states are
inserted by repeating T2. If this BRDY# is returned low, it indicates that the external system has
presented valid data on the data pins in response to a read or the external system has accepted
data in response to a write. On a write cycle, in addition with the data, the Pentium places eight
parity bits on lines DP7 - DP0, one parity bit per byte-lane on the data bus.
ßß----------------------------------------------------------------------
ßß
37
These terms are explained in other chapters.
1.58
Bus Cycle 1: Fetch opcode byte and BYTE2 from program memory:
C1: A15-A1<--123H(0012)*=(PC), BE1#-BE0#<--00, AS#<--L,
M/IO#<--H, R/W#<--H, D/C#<--L, SIZE1-SIZE0<--10=16-bit
word**, DEN#<--L, DT/R#<--L;
C2:
C3: sample “port size indicator” input pins (assume here 16-bit
memory port), (PC)<--(PC)+2=1234H (we assume this internal
microop is executed during C3);
C4: latch the two bytes from the data bus, decode the opcode byte
(the control section now determines the length of the instruction
and its format, the need for one more access to memory to fetch
the remaining part of the instruction, and what to do internally
with the received BYTE2), negate all signals, and end current
bus cycle;
* This is the 15-bit most significant part of the address: 123 are the hexadecimal digits for
the leftmost 12 bits and 001 are the next 3 binary bits to the right.
** Because we assumed -- in this example -- that reads from memory using the PC are always
done on the 16-bit basis.
Bus Cycle 2: Fetch DYTE3 and BYTE4 from program memory:

C1: A15-A1<--123H(0102)*=(PC)+2, BE1#-BE0#<--00, AS#<--L,
M/IO#<--H, R/W#<--H, D/C#<--L, SIZE1-SIZE0<--10=16-bit
word**, DEN#<--L, DT/R#<--L;
C2:
C3: sample “port size indicator” input pins (assume here a 16-bit
memory port), (PC)<--(PC)+2=1236H (we assume this internal
microop is executed during C3);
C4: latch BYTE3 and BYTE4 of the instruction and move them to
internal registers (the last byte BYTE4 will not be used here),
(SP)<--(SP)-2=ABCCH (assume this internal microop to prepare
the stack pointer for the next bus cycle is executed during C4),
negate all signals, and end current bus cycle;
** Because we assumed -- in this example -- that reads from memory using the PC are
always done on the 16-bit basis.
1.59
Bus Cycle 3: Save “return address” on top of the stack and jump to
the called routine:
C1: A15-A1<--ABCH(1102)*=(SP)-2, BE1#-BE0#<--00, AS#<--L,
M/IO#<--H, R/W#<--L, D/C#<--H, SIZE1-SIZE0<--10=16-bit
word, DEN#<--L, DT/R#<--H;
C2: D15-D0 <--1236H;
C3: sample “port size indicator” input pins (assume here a 16-bit
memory port);
C4: (PC)<--BYTE2,BYTE3 (assume this internal register-transfer

microoperation is done during C4), stop driving the data bus lines,
negate all signals, and end current bus cycle.
* This is the 15-bit most significant part of the address: 123 are the hexadecimal digits for
the leftmost 12 bits and 001 are the next 3 binary bits to the right.
Figure 1.17: Microoperations for the execution of a generic CALL instruction.

(Assumptions: 16-bit addresses, 16-bit data bus, 16-bit memory port, no wait states,
(PC) = 1232 initially, (SP) = ABCE initially (points to the last entry on top of the
stack).
1.6 PIPELINED RISC PROCESSORS

From Equation 1.8 we observe that, with a constant clock rate, in order to increase the
performance of the processor, one needs to reduce the CPIP value (the average processor
clock cycles per instruction). To reduce this CPIP means introducing some kind of
internal parallelism so that more than one instruction can execute concurrently, thereby
increasing the number of instructions completed per processor clock cycle.
One way to decrease the CPIP value is to introduce instruction-level parallelism
(ILP) inside the scalar processor CPU. Various approaches have been used to achieve
this:
1. One approach is based on the temporal parallelism, in which multiple instructions
are simultaneously overlapped in execution using one common hardware (the
pipeline). This leads to pipelining (discussed next in this section), which yields an
(ideal) CPIP equal 1 (i.e., a pipelined processor executes completely one instruction
per processor clock cycle CP). The performance of the processor can be further
1.60
increased by increasing both the depth of the pipeline and the processor clock rate.
This leads to superpipelined implementations (discussed in Section 1.7.1), which
have reduced CPIP below 1 (i.e., they execute more than one instruction per
processor clock cycle).
2. The second approach to parallelism is based on the spatial instruction-level
parallelism in a processor that contains a number of independent functional units or
multiple copies of some of the pipeline stages. This leads to superscalar
implementations in which multiple instructions are issued to various independent
functional units, executed in parallel, and completed per processor clock cycle
(discussed in more detail later in Section 1.7.2). Superscalars also achieve a CPIP
value of less than 1 without increasing the processor’s clock rate.
3. Other approaches -- including superpipelined-superscalar processors (a
combination of both superpipelined and superscalar implementation) and VLIW
(Very Long Instruction Word) processors -- are discussed in Sects 1.7.3 and 1.7.4.
T1
C1
BRDY# = L
T2
C2 BRDY# = H
a) State transition diagram (C is called by Pentium a "T state")
C1: A31 - A3 ← part of address, ADS# ← L, AP ← even parity,

BE#7 - BE#0 ← proper values, CACHE# ← H, W/R# ← L,
D/C# ← H;
C2: sample BRDY#: if BRDY# = H, execute additional T2 states;

if BRDY# = L, read data from data bus,
end cycles;
b) “Data memory read” microoperations
1.61
C1: A31 - A3 ← part of address, ADS# ← L, AP ← even parity,

BE#7 - BE#0 ← proper values, CACHE# ← H, W/R# ← H,
D/C# ← H;
C2: sample BRDY#: if BRDY# = H, execute additional T2 states;

if BRDY# = L, write data from data bus,
place proper values on
DP7-DP0,
end cycles;
c) Data memory write” microoperations
Figure 1.18: Pentium’s simplified state transition diagram and most important
microoperations for a “memory read” and “memory write” bus cycles.
1.6.1 Instruction Pipelines
The instruction pipelining technique came about from the observation that each
instruction cycle can be broken down into a number of steps, each of which takes an
equal fraction of the time needed to complete the entire instruction. For example, if each
instruction of the processor can be broken down into five steps, each step can be assigned
to execute by a different stage of a 5-stage pipeline (See Figure 1.19a). Instructions
would enter the pipeline at one end, be processed through all stages, and exit at the other
end. The pipeline accepts new instructions, before any previously accepted instructions
have been completely processed and exited from it. The latency of a pipeline execution
stage is the number of cycles between the time an instruction is issued for execution to
the EXE stage and the time a dependent instruction (which uses the result as an operand)
can be issued. In most cases, integer instructions have a single-cycle latency, while
floating-point add and multiply may have a 2-cycle latency. Other more complex
instructions (like integer multiply, floating-point square-root, and all divide instructions)
are computed iteratively and have longer latencies.
When a subtask result leaves one stage, the logic associated with that stage
becomes free and can accept new results from the previous stage. Thus, the rate at which
instructions are fed to the pipeline is chosen in relation to the time required to get an
input through one stage, with the main goal of keeping all portions of the pipeline fully
utilized. Once the pipeline is full the output rate will match the input rate.
All stages of the pipeline operate in parallel, each one executing a step from a
different instruction. (Storage buffers exist between stages to hold temporary results and
inputs to the next stage.) An individual instruction’s computations advance from one
stage to the next, and the instruction gets closer to completion as the end of the pipeline is
1.62
approached. In an ideal pipeline each stage takes the same amount of time to execute its
task; in order to simplify the explanation, let’s assume that this time equals one processor
clock cycle CP. Thus, if each instruction is broken down into n steps, then once the n-
stage pipeline becomes full, it effectively executes n instructions simultaneously. Each
instruction still needs the same amount of time to complete from start to finish (n basic
clock cycles), but because n instructions are being processed at a time, once the pipeline
is filled, the rate at which instructions are completed in an ideal pipeline is n times as
rapid (i.e., the instruction bandwidth has increased n times). The time per instruction on
the pipelined processor is equal to the time per instruction divided by the number of
pipelined stages. This processor would complete one instruction per processor clock
cycle, i.e., it will have a CPIP = 1.
A number of processors have implemented internal pipelines with at least 5

stages. In this case, the machine instruction is broken down into 5 steps, each step
requires one processor clock cycle CP, and up to 5 machine instructions can be executed
concurrently. Consider such a hypothetical scalar processor pipeline with the following 5
stages, shown in Figure 1.19:
IF (instruction fetch): The stage accesses the (on-chip) ICACHE to fetch the
instruction. This involves generating the ICACHE address, sending it to the
cache, identifying the cache entry, and reading out its content38.
ID (instruction decode): This stage involves the actual reading of the instruction
from the ICACHE39 (actually, a number of instruction bytes are fetched with a
single cache access), sending it to the decoder of the control unit, decoding it and
determining its length. For a register-to-register instruction, accessing the CPU
register file is done at this time to read the operands needed in the operation. If
the instruction is a LOAD/STORE or branch instruction, the address offset in the
instruction is advanced to the EXE stage.
EXE (execute operations): For an arithmetic/logic instruction, this stage uses the
ALU to execute the operation on the two register operands. If it is a LOAD or
STORE instruction, the effective memory address is computed in this stage.
(There usually exists separate hardware from the ALU to do the effective address
calculation). If the instruction is a branch instruction, this stage decides whether
the branch is to be taken or not, and generates the target address40.
MEM (memory access): Although it is usually called the “memory access” stage,
in reality it refers to LOAD and STORE instructions using the computed effective
address and accessing the on-chip DCACHE. Main memory will be accessed
only if there is a DCACHE miss. A LOAD will read an operand from the
DCACHE for an internal CPU register; a STORE will store in the DCACHE the
result of the ALU or the contents of a specified register.
38
As we will explain in detail in Chapter 6, a translation from virtual to physical address may be required
in this step involving part of the MMU hardware. (In processors with paging, this hardware is called
the TLB or “translation look-aside buffer”.)
39
We assume here that the instruction was found in the ICACHE, i.e., we had a “cache hit.”
40
Like the IF stage, the address generation step in this stage may also involve a “TLB look-up”.
1.63
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5
IF ID EXE MEM WR
Processor
clock cycle
Cp
5 independent functional units (FUs) of the pipeline:

• ICACHE: instruction memory for the instr fetch
stage
• Decoder and register file’s (RF) read ports
• ALU for the execution stage
• DCACHE: data memory for the MEM stage
• Register file’s (RF) write port for the (write register)
WR stage
PROCESSOR
CLOCK
CYCLE Cp: 1 2 3 4 5
Effective
Address offset address
calculation
CPU
REGISTERS ALU
DCACHE CPU
ICACHE REGISTERS
OPCODE
DECODER
IF ID EXE MEM WB
Note: Stages IF and ALU may involve TLB lookup
Figure 1-19: Five-stage scalar pipeline with one processor clock cycle per stage. (If
this pipeline operates at, say, 50 MHz, then each stage takes 20 ns to
accomplish its task, and the “pipeline basic clock cycle” = 20 ns; we
assume here that the “pipeline basic clock cycle” = processor clock
cycle).
1.64
WB (write back): This stage is used to place into a CPU register either the source
operand read from the DCACHE or the ALU result produced at the EXE stage.
If the processor depicted in Figure 1.19b operates in a nonpipelined fashion (a

nonrealistic assumption), then Figure 1.20a shows the execution of two successive
instructions, each requiring at least 5 processor clock cycles; total time for both
instructions is 10 processor clock cycles. In the 5-deep pipelined execution depicted in
Figure 1.20b, it is observed that the pipeline operates at 5 times the rate of the
nonpipelined scheme, and once the pipeline is filled, ideally it will complete one
instruction per processor clock cycle (CPIP=1). Although this average CPIP is ideally
equal to 1, in practice this value is somewhere between 1.2 and 1.6. The instruction
cycle of each individual instruction (which includes fetch, decode, store results, etc.) will
still take a total of at most 5 processor clock cycles (with the assumptions of Figure
1.20a). Compilers are very crucial to the efficient use of the processor pipeline by
keeping it full with useful instructions.
PROCESSOR CLOCK
CYCLE CP
INSTRUCTION 1: IF ID EXE MEM WB
(a)
IF ID EXE MEM WB
INSTRUCTION 3:
IF ID EXE MEM WB
INSTRUCTION 5:
1 2 3 4 5 6 7 8 9 10
PROCESSOR CLOCK CYCLES C P (b)
Figure 1.20: Comparing a nonpipelined with a 5-stage scalar pipelined operation
1.65
à ->
-------------------------------------------------------à
Example 1- 11: CPI-frequency balance
We can not easily judge the processor improvement achieved by considering only the
increase in the processor clock rate. Relatively deeper (and simpler) processor pipelines allow
for higher clock speeds. So what happens when we extend the pipeline by adding one additional
stage and increase the clock frequency by 50%? Would this also increase the processor’s
performance by 50%? If one gave the naïve affirmative answer, we will see in the following
example (from [25]) why this is not correct.
Consider a program segment that takes 100 clock cycles in a pipelined processor with a
processor clock frequency of 100 MHz. Then, the baseline internal architecture of the processor
would take 1 microsecond to execute this segment. Suppose we modify the processor’s internal
pipeline by adding an extra stage for LOADs. If LOADs are 30% of all operations (and assuming
each pipeline stage takes one processor clock), this would add 30 clocks to this program
segment, which would now take 130 clocks to execute. However, since this extra pipeline stage
allows the processor clock frequency to increase to 150 MHz, the total execution time becomes
130/150 or 0.867 microseconds. This amounts to a 15 percent performance improvement
(1/0.867) instead of the easy (and wrong) answer of 50% percent improvement.
ß ------------------------------------------------------------------
<ß
1.6.2 Advanced Microprocessor Operation
In this Section we present the operation of more advanced microprocessor systems using
as an example the baseline 5-stage pipelined processor model of Figure 1.21.
External Bus
Lines
BUI (Bus Interface Unit)
Data Register
File (RF)
FXU
Integer
Instruction
Execution
ICACHE Prefetch, DCACHE
Unit
Instructio Queue, or Data
n Decode & Cache
Cache Dispatch FPU
Floating Point
Execution Unit
IF ID EXE WB MEM
Instructions
Data/Operands
Figure 1.21: The model of a 5-stage scalar pipelined processor. (One instruction is
executed per processor clock cycle).
Since most microprocessors contain an on-chip ICACHE, instruction fetching is

done during the IF stage from this ICACHE; that means, the IF phase will need to start
1.66
external bus cycles to access main memory only when there is an ICACHE miss, i.e., the
instruction was not found in the ICACHE.
The decode stage takes its instructions from the on-chip ICACHE. Remember
that access to main memory will take place only if there is a miss on the ICACHE.
During such a long access to main memory the processor is either “stalled” waiting for
the instruction to be fetched or -- in more sophisticated processors -- will instead go
ahead and decode and execute other instructions already prefetched and queued from the
on-chip ICACHE. For example, a processor may incorporate two independent, line-size
(32-byte), undecoded, instruction stream queues: one to buffer sequential instructions
and the other instructions from the branch target buffer (for example, the Pentium
processor discussed later in Example 1- 12.) Another processor may have only a single
instruction stream queue, from which its dispatch logic can choose -- not necessarily in
strict FIFO order -- which, say, n of the bottom m instructions to issue for execution to its
n functional units (e.g., the PowerPC chooses which 3 of the bottom 4 instructions to
issue to its 3 functional units FXU, FPU, and branch processing unit.)
More advanced processors, in addition to the on-chip cache (the “level-1 cache”)
may also have an additional external “level-2 cache”; in that case, when there is a miss
in the on-chip cache, the level-2 cache will be first examined for the information
(instruction or data); only if there is a miss in this second-level cache too access to main
memory will be performed. First-level cache misses initiate bus transactions (to main
memory or level-2 cache) in the form of “cache line transfers” (as explained in detail in
Chapter 2).
In any case, when accesses to the ICACHE is done, instead of fetching only one
instruction, a number of instructions is prefetched by reading out a whole “cache line”
(for example 16 or 32 bytes long). The prefetched bytes are then rotated so that they are
justified for the instruction decoder, i.e., pointing to the starting point of the next
instruction.
After instruction decode, a “dispatcher” may do some instruction identification

and operand dependency resolution and submit the instruction for execution to one of the
EXE core units; for example in Figure 1.21, integer instructions will be issued to the
FXU (fixed-point unit) and floating point instructions to the FPU (floating point unit).
While in most processors this fetch, decode, and execute is done “in-order” (i.e.,
instructions are executed in the order they appear in the program), more advanced
superscalar processors dynamically schedule and execute instructions “out-of-order”
(without regard to their original ordering in the program). Superscalar processors
(discussed in Section 1.7.2) have a number of such decoders that operate in parallel to
decode more than one instruction per clock cycle. Decoded operations may be placed in
a “reservation station” -- a buffer or table – and wait there until all the operands for an
instruction are ready and a functional unit becomes available in the CPU to execute the
instruction. This out-of-order execution is one way to extract more “instruction-level
parallelism” out of the program code.
The register file in the CPU is accessible by both the integer and the floating
point units, or each unit may have its own specialized registers. The out-of-order
1.67
execution units are intelligent enough to know the original order of the instructions in the
program and re-impose program order when the results are to be committed (“retired”) to
their final destination registers.
Finally, all operands are fetched with LOAD instructions from the on-chip
DCACHE; only on a cache miss the processor will start an external bus cycle to access
main memory (or a second-level cache) for the operand. (Quite often in such cases the
processor will not only fetch the requested operand but will start a “burst bus cycle41” to
fetch the whole “cache line” that contains this operand in memory). Similarly, all
STORE instructions store their results in the on-chip DCACHE. (Whether this result is
also sent to main memory depends on the how the DCACHE works, as will be explained
in more detail in Chapter 5).
Quite often, to isolate operations between main memory and the DCACHE and to
smooth the flow of data between the slower memory and the faster processor, “load
queues” and “store buffers” are used inside the processor. “Load queues” hold operands
that come from memory during read cycles, while the “store buffers” hold data that the
processor sends to memory during write cycles. For example, the 486 and the Pentium
have only store buffers, while other advanced RISC processors have both load queues
and store buffers.
1.6.3 Problems with Pipelining
The above description applies to ideal operations of pipelines in which all stages require a
single processor clock cycle and instructions are issued to the pipeline in such a way as to
keep it always filled. Unfortunately, data dependencies and control hazards (associated
with the execution of conditional-branch instructions), present various problems that do
not allow such ideal operation. We will give here only a brief introduction to the type of
problems that this internal pipelining may face.
a) Data dependencies
One data dependency problem is the read-after-write or RAW 42. This problem arises
when an instruction depends on previous ALU results; i.e., a subsequent instruction
requires data that are to be produced by the previous instruction. In this case, the
instruction should not be started before these results are available43. Another problem is
the write-after-write or WAW 44 hazard: when an instruction writes to a register (or
resource in general) after a subsequent instruction has already written to the same
41
Burst bus cycles are covered in Chapter 2.
42
This RAW hazard is also called “true data dependence” or “destination-source conflict”.
43
Processors with a single, unified, on-chip cache, avoid this read-before-write hazard by giving priority to
the WB (write back) stage over the “operand fetch” stage (since both share this single on-chip CACHE.)
44
This WAW hazard is also called “output data dependence” or “destination-destination conflict”.
1.68
register, thus leaving the register with old stale data. A third data dependency problem is
the write-after-read or WAR 45: when an instruction attempts to write a result to a
register (or a resource) before a previous instruction reads the old data. Finally,
sometimes a problem may arise when an attempt is made to modify an instruction by a
preceding store operation. This can be handled by the decoder in the control section
which tests the instruction as soon as it is loaded to see whether it is a strore-type
instruction, in which case the fetch sequence must wait until the effective address has
been prepared to see whether it is going to modify a successive instruction.
Both software and hardware solutions have been used to handle these data dependencies.
They are listed here without going into their explanation.
Software solutions include: compiler-inserted NOOPs (no-operation
instructions), basic block scheduling, and list-scheduling. Hardware solutions include:
pipeline interlocks (always stall the pipeline until the dependency is resolved),
forwarding (always pass the ALU result directly to the functional unit that requires it
before it is written into the register), scoreboarding (check whether a decoded instruction
has a destination register used as a source register by another instruction, to ensure that a
source operand is not fetched from a register that is currently waiting for a result),
reservation stations (a distributed way of detecting data dependencies and passing the
results directly to the functional units rather than to registers), and register renaming (that
requires an increased number of registers to allow for multiple instances of registers, and
implemented through a mapping table, a reorder buffer, or future file).
b) The “load delay” problem

To reduce the problems that different pipeline stages may require different
amounts of time (for example when complex instructions exist that require access to
operands in memory), RISC processors have incorporated a large number of registers,
simple register-to-register instructions, and implemented what is called a “load/store
architecture”: all operations are performed on operands held in internal registers, and
main memory is accessed explicitly with only LOAD and STORE instructions.
However, this gives rise to another problem, the load delay problem: because memory is
slower than the processor pipeline, the loaded operand is not immediately available to
subsequent instructions that are already inside the pipeline. For example, consider the
operation of the 5-stage pipeline as shown in Figure 1.20b: assume that the first
instruction is a LOAD instruction; data from this LOAD instruction will not be available
until the end of this instruction’s MEM cycle (i.e., the end of processor clock cycle 4),
which is too late for the second instruction to use it when it is in its EXE stage (into
which this second instruction enters at the beginning of clock cycle 3.)
The load delay problem must be solved without stalling the flow of instructions
through the pipeline when a LOAD is executed. Two approaches are (a) to let the
software (compiler) take care of it statically, or (b) have the processor hardware handle it
dynamically (at run time). In the software solution, the compiler is required to reorder
45
This WAR hazard is also called “anti data dependence” or “source-destination conflict”.
1.69
instructions so that some other unrelated instruction of the program is executed in the
“load delay slot” (i.e., placed between the LOAD instruction and the immediately
following instruction that uses the loaded data), or if no such unrelated instruction can be
found, insert a NOOP instruction in the “load delay slot”. One hardware solution
followed by Intel 486 is to eliminate the load delay by rearranging the pipeline so that
memory addresses be computed in the second decode stage D2 of the pipeline before the
EXE stage [4]. Superpipelined processors may have a load delay slot of 2 or more (see
the MIPS R4000 in Appendix B), making the solution to the load delay problem more
difficult.
c) Control hazards
Finally, problems arise when a conditional branch instruction is encountered and a

previous instruction that sets the condition codes has not been completed yet and,
therefore, the control section -- not knowing yet whether the branch conditions are
satisfied – does not know whether to prefetch the instruction immediately following this
conditional branch instruction or the instruction at the target address of the branch.
Software solutions include the following (listed here without further discussion):
branch spreading, scheduling the branch delay slot, branch folding, software (static)
branch prediction, trace scheduling, loop unrolling, software pipelining, register
renaming, and scheduling across branches.
Hardware solutions include the following:

1. The control section can stop the flow of instructions (i.e., halt the pipeline)
before fetching the next instruction, until the preceding operation is finished
and the results (upon which the branch decision will be based) are known. No
matter whether the branch is taken or not, this always causes a delay.
2. The second way, called branch prediction, makes a guess as to which way the
branch is going to go before it is taken, follows this path, and continues
preparing instructions; if the guess later proves to be wrong, the pipeline must
be flushed clean and started again with the correct instruction. The Intel 486
follows this approach and assumes that the branch is taken and – in parallel
with the operation of the pipeline stages – the CPU runs a “speculative fetch
cycle” to the target address of the branch. If the CPU evaluates the branch
condition as true, the fetch to the target address refills the pipeline; otherwise,
the processor loses three clock cycles. (Such hardware prediction has been
implemented at either the fetch stage or the decode stage).
3. In a third alternative, called prefetching of multiple paths, the control section
can fetch the instructions immediately following the conditional branch
instruction as well as instructions from the target address of the branch
simultaneously, and when the ALU finally figures out the branch conditions,
then decide which one of the two prefetched groups of prepared instructions to
use.
1.70
1.7 HIGH-PERFORMANCE PROCESSORS

. High-performance processors46 are used for more advanced applications (such as
visualization, large data bases, digital video applications, advanced scientific computing,
client/server computing, etc.) and for the design of engineering and 3-D graphics
workstations, servers, SMP (symmetric multiprocessing) systems, and supercomputers.
High-performance processors discussed here include the superpipelined, superscalar,
superpipelined-superscalar, and VLIW (very long instruction word) processors.
1.7.1 Superpipelined (Scalar) Processors
Having introduced a pipeline into modern processors, how can the pipeline be modified
to further decrease the processor’s TP and thus increase its throughput? Superpipelining
accomplishes this by increasing the pipeline depth and increasing the pipeline clock rate;
in other words, superpipelining increases processor performance by reducing the
processor cycle (i.e., by reducing the Cp term in Eq. 1.7). Longer pipelines provide finer
granularity in instruction execution; for example, a bottleneck stage or a stage that
requires longer time to execute can be subdivided into two independent stages.
Increasing the pipeline depth, however, requires faster clocks and an increase in the rate
at which instructions enter and leave the pipeline.
Figure 1-22 shows the model of a 10-stage superpipelined processor. Comparing

it with the 5-stage pipelined model of Figure 1.21 we notice that -- in this example --
each stage in Figure 1.21 has been split in Figure 1.22 into two substages; the
instruction fetch stage has been split into two substages (instruction fetch first or IF1 and
instruction fetch second or IF2), the instruction decode stage into the first (ID1) and
second (ID2) substages, etc.
Figure 1.23a shows again the operation of our example 5-stage pipeline of
Figure 1.20b with its basic clock cycle equaling the processor clock cycle (and CP =
20ns); we had said that this pipelined processor completes one instruction every 20ns it
has a CPIP = 1. Figure 1.23b now depicts how the operation of an ideal 10-stage
superpipelined implementation of degree 2 would compare with it. (A major stage in the
pipelined processor is replaced by two substages, the substages are clocked at twice the
frequency of the major stage, and the processor initiates an operation at each substage on
each of the smaller clock cycles). In general, in a superpipelined implementation of
degree n, the pipeline clock needs to be n times as fast as the pipeline clock of the basic
pipeline. Thus, the pipeline clock rate of Figure 1.23b has doubled to 100 MHz (which
allows feeding instructions to the pipeline at twice the previous rate). When comparing
the pipelined implementation in Figure 1.23a with that of this superpipelined
46
High-performance processors include chips such as the Pentium II and Pentium Pro, the MIPS R4000-
and R10000-series, the PowerPC 620, the Alpha 21164, the UltraSparc-II, the PA-8000, etc.
1.71
implementation of degree 2 in Figure 1.23b, we notice that the superpipelined approach

now completes two instructions every 20ns basic clock cycle; i.e., it is twice as fast as
the simple pipelined implementation.
External Bus
Lines
Data Register
File (RF)
FXU
Integer
Instruction
Execution
ICACHE Prefetch, DCACHE
Unit
Instructio Queue, or Data
n Decode & Cache
Cache Dispatch FPU
Floating Point
Execution Unit
IF1 IF2 ID1 ID2 EXE1 EXE2 MEM1 MEM2

Instructions
Data/Operands WB1 WB2
Figure 1.22: The model of a 10-stage superpipelined (scalar) processor of “degree-

2”. (It has twice as many stages and twice as fast clock when compared with the
baseline pipelined processor of Figure 1.21.)
A superpipelined processor would require less hardware that that of a superscalar

processor (to be discussed next), since not all its stages need to be subdivided into
substages; but it does rely on faster internal hardware logic that can operate from faster
clocks. Although – for a given set of operations – the superpipelined processor takes
longer than the superscalar processor to generate all results, it can complete simple
operations sooner. Higher clock speeds however make layout and routing, and clock
distribution inside the processor chip more difficult
There is, however, a point of diminishing returns beyond which it does not pay to
subdivide the pipeline into more stages.
Appendix A.3 presents the details of an 8-stage superpipelined processor, the

MIPS R4000-series.
1.72
Processor clock cycle = 20 ns
IF ID EXE MEM WB
INSTRUCTION 3:
1 2 3 4 5 6 7 8 9 10
PROCESSOR CLOCK CYCLES CP
Internal processor = 50 MHz, each stage takes 20 ns, one instruction is issued
every 20 ns; ideally, CPIp = 1.
a) 5 stage scalar pipeline
(10-DEEP)
CP
C’P INTERNAL PROCESSOR CLOCK CYCLE (10ns)
IF1 IF2 ID1 ID2 EXE1 EXE2 MEM1 MEM2 WB1 WB2 INSTRUCTION 1
IF1 IF2 ID1 ID2 EXE1 EXE2 MEM1 MEM2 WB1 WB2
INSTRUCTION 5
INSTRUCTION 6
INSTRUCTION 7
INSTRUCTION 8
INSTRUCTION 9 IF1 IF2 ID1 ID2 EXE1 EXE2 MEM1 MEM2 WB1 WB2
INSTRUCTION 10
INSTRUCTION 11
CURRENT
CPU
CYCLE
OLD PROCESSOR SYSTEM CLOCK CYCLES
0 2 4 6 8 10 12 14 16 18 20
NEW INTERNAL PROCESSOR CLOCK CYCLE
b) 10 stage scalar superpipeline
Figure 1.23: Five-stage scalar pipelined and 10-stage scalar superpipelined

operations [19].
1.73
1.7.2 Superscalar Processors
a) Basic concepts
Superscalar processors are built on the principle that more than one instruction can be
fetched, decoded, executed, and completed in parallel. A prerequisite to the superscalar
architecture is the existence inside the processor of a number of independent “functional
units” (integer execution units, floating-points units, load/store units, graphics units, etc.).
A superscalar implementation of degree or way or issue m means that the processor can
(ideally) fetch, decode, issue, complete, and “retire” (or “graduate”) m instructions per
processor (pipeline) clock cycle. An m-way superscalar may have more than m
independent functional units (to allow the execution of more than m instructions per
clock cycle, but it still retires or graduates only m instructions per clock cycle).
In a superscalar implementation, the processor clock remains the same with that
of our earlier regular (basic) scalar implementation, but superscalar techniques increase
processor performance by reducing the average number of clock cycles per instruction
(i.e., by reducing the CPIp term in Eq. 1.7)
Figure 1.24 shows the operation of a hypothetical “degree-2” superscalar

implementation; each stage of the scalar pipelined processor of Figure 1.20b has been
duplicated. Like our previous basic pipelined scalar processor, this superscalar processor
operates again from a 50-MHz clock, but now contains two independent 5-stage pipelines
(each one of the type shown in Figure 1.19 operating concurrently. Two instructions
can now be fetched and decoded together and issued in parallel for execution in two
separate pipelines; since two instructions will complete every 20ns, this superscalar
implementation will execute twice as many instructions as compared with our basic
pipeline scheme and, therefore, ideally its CPIP = 0.5.
Figure 1.25a shows the model of an 8-stage scalar processor and Figure 1.25b
that of an 8-stage 4-way superscalar processor. In both scalar and superscalar processors
their functional units are pipelined47.
The first difference we notice is that a 4-way superscalar architecture fetches

from the ICACHE and decodes 4 instructions per (processor) cycle.
In the decode stage, four decoders operate in parallel. For branch instructions, the
branch path is predicted and the target address computed. Quite often, the fetch-decode
operation is an “in-order” sequence that takes as input the user program instruction
stream from the ICACHE, and decodes them into a sequence of microoperations that
represent the dataflow of that instruction stream. Actually the ICACHE works together
with the BTB (Branch Target Buffer48) and a pointer to the next instruction to do
speculative program prefetch. The ICACHE fetches the proper cache line, say 16 bytes
47
When an instruction that takes long time to complete is executed, functional unit conflicts between two
similar instructions can be significantly reduced by pipelining this functional unit. Pipelining is typically
much less expensive than duplicating the functional unit.
48
The BTB (Branch Target Buffer) is used to correctly predict most of the branches’ micro-operations.
1.74
long, and presents 16 aligned bytes to the decoders every clock cycle. The decode stage
buffers multiple 16-byte fetches and rotates prefetched bytes to the starting point of the
next instruction.
PROCESSOR CLOCK
(20ns)
1 2 3 4 5 6 7
Internal processor clock = 50 MHz, each stage takes 20 ns, 2 instructions are
dispatched simultaneously and executed in 1 processor clock cycle; ideally,
CPIp = 0.5
Figure 1.24: Superscalar operation: internal processor clock = 50 MHz, each stage
takes 20ns, 2 instructions are dispatched simultaneously and executed in 1 processor
clock cycle; ideally, CPIP = 0.5 [19].
Because most superscalar processors do include huge register sets, “register

renaming” and initial dependency checks are performed before instructions are issued to
the functional units. Register renaming is used to ensure each instruction is given
correct operands, and is done by dynamically mapping the logical register numbers
(“names”) used in the instruction to physical registers. A logical register is mapped to a
new physical register whenever that logical register is the destination of an instruction.
Each time a new value is put in a logical register, it is assigned to a physical register
(thus, each physical register has only a single value). The RF stage thus becomes very
complex to increase instruction-level parallelism (identify data dependencies and decide
which and how many instructions will be issued to the functional units in parallel).
Such processors are referred to as “in-order” superscalar processors: multiple
instructions are fetched each cycle, and several consecutive instructions can begin
execution simultaneously if all their corresponding operands are available. The
processor, however, will stall at any instruction whose operands are not available.
In the superscalar example of Figure 1.25b, there exist five independent
functional units: 2 single-stage FXUs, 2 three-stage pipelined FPUs, and 1 two-stage
pipelined load/store unit. The superscalar is a “4-way”superscalar if it can issue, execute
and complete up to 4 instructions in parallel (although physically this model chip
contains five functional units). The clock cycle of the superscalar processor is the same
1.75
as that of the scalar processor. To select the ideal mix of instructions, superscalar
microprocessors follow different “instruction issue rules”. (As an example, the
“instruction pairing” and “instruction issue rules” for Pentium are given below in
Example 1- 12).
An instruction is issued when it is handed over to a functional unit for execution.
More advanced processors have made the fetch/decode/RF unit more intelligent in
predicting program flow, have included some kind of “instruction pool” or one or more
“reservation stations” or “instructions queues” (where decoded instructions wait until all
their operands are ready and a functional unit is available) and then instructions are
“issued” in parallel for “out-of-order” execution to the many functional units. In an out-
of-order superscalar processor, each instruction is eligible to begin execution as soon as
its operands become available regardless of the original instruction sequence. The
hardware rearranges instructions in order to keep the various functional units busy. This
is called “non-sequential dynamic execution scheduling” or “dynamic instruction
issuing”. With such dynamic execution scheduling, the processor can operate at its
highest efficiency (functional units are kept from going idle) by reordering instructions to
suit the available functional unit resources. The instructions can be executed and
completed out-of-order and then reordered (or “retired” or “graduated”) back in their
original program order, as shown below in Figure 1.26. Such processors have replaced
the classical “execute” phase by two decoupled phases: an “issue/execute” and a “retire”
phase.
External
Line
Bus
s
BUI (Bus Interface
Unit)
Data
File RF)
Register
(
FXU
Instr RF
ICACH or
Decod
. Acces DCACH
E
e s E
FPU1 FPU2 FPU3
IF ID RF EXE1 EXE2 EXE3 MEM

(Stage (Stage (Stage Stage 4Srag 5 Stage (Stage
1) 2) 3) e 6 WB7)
Instructio (Stage
ns
Data/Operan 8)
ds
a) The model of an 8-stage pipelined scalar processor. (One instruction executed
1.76
External Bus
Lines
Data Register
FXU 1 File (RF)
Decoder 1 FXU 2
Resolve
Depend
ICACHE Decoder 2 endies ,
FPU 1,1 FPU
RF FPU 1,1 FPU1,2
1,2 FPU 1,3
Decoder 3 access, DCACHE
Decoder 4 Instr. FPU 2,1 FPU 2,2 FPU 2,3

Issue
BTB LOAD/STORE
(Branch Target UNIT
Buffer) MEM
RF EXE1 EXE2 EXE3
IF ID
(Stage 2) (Stage 3) Stage 4 Srage 5 Stage 6 WB
(Stage 7)
(Stage 1)
(Stage 8)
Instructions Data/Operands
b) The model of an 8-stage pipelined 4-way superscal ar processor. (“4-way” or “4-issue” or “degree-4”
superscalar: up to 4 instructions executed per cycle. For example, 2 integer and 2 floating point
instructions executed simultaneously in their functional units)
Figure 1.25: Models of scalar and superscalar processors. (The “4-way”

superscalar has at least 4 independent functional units, but the same clock with the
scalar).
An instruction is complete when its result has been computed and stored in a
temporary physical register.
Time
In order
In order
Instru
ction: Fetch Decode Out of order Graduate
Issue Execute Complete
Figure 1.26: Dynamic instruction execution scheduling
1.77
Finally, an instruction retires or graduates when its temporary result is

committed as the new architectural state of the processor (to the final destination user-
visible register). This is done by checking the status of the completed instructions in the
instruction pool and determining which of them can be removed from the pool and
retired. The retirement is done “in-order” (i.e., the original program order is imposed on
the retired instructions), if an instruction can retire only after it and all previous
instructions have been successfully completed. (For an example of an advanced
microprocessor with dynamic instruction scheduling, out-of-order execution, and in-order
retirement, see the Pentium Pro in Appendix A.5.1).
b) Examples of superscalar processors

In this section we present the basic characteristics of the following superscalar
microprocessors: the 32-bit Intel Pentium, and three 64-bit microprocessors the: MIPS
R10000, Sun Microsystems UltraSparc-II and PowerPC 620. We have classified the
Intel Pentium Pro and the DEC Alpha 21164 as superpipelined-superscalar processors.
(See Section 1.7.3).
1. Intel Pentium: Figure 1.5d depicts the external view of the Pentium microprocessor
with its address and data bus lines (the Pentium Pro is discussed in Section 1.7.3).
The Pentium is a 32-bit superscalar microprocessor (with 32-bit internal registers and
ALUs, 32-bit physical addresses, and 64-bit external data bus), has a degree of two (a
2-way instruction issue), has two integer ALUs (the “u-pipe” and the “v-pipe”) and
one FPU, and its longest pipeline is 8 stages. (More details on this Pentium processor
are given below in Example 1- 12.)
2. MIPS R10000: This processor is presented in more detail below in Appendix A.4.2.
3. Sun Microsystems UltraSparc-II: Figure 1.5d depicts the external view of the
UltraSparc microprocessor with its address and data bus lines. The UltraSparc is a
pure 64-bit superscalar microprocessor (with 64-bit internal registers and ALUs, 41-
bit physical addresses, an 128-bit external data bus, and an independent SBus (for
I/O to slower peripherals), it is a 4-way superscalar, contains nine execution units
(two integer ALUs, five FPUs, a branch-processing unit, and a load/store unit), its
longest pipeline is 9 stages, it does not issue instructions out-of-order, it is one of the
very few processors that have large number of registers to implement “register
windows”, and is optimized more for multimedia and graphics applications.
4. PowerPC 620: Figure 1.5d depicts the external view of the PowerPC
microprocessor with its address and data bus lines. The PowerPC is a pure 64-bit
superscalar processor (with 64-bit registers, 40-bit physical addresses to access 1
TB49 of physical main memory, and an 128-bit external data bus), it can software-
switch between 64- and 32-bit modes (and big-endian and little-endian modes), it is a
4-way superscalar, contains six functional units (three integer ALUs, an FPU, a
branch unit, and a load/store unit), it performs “dynamic instruction scheduling and
49
TB = tera-bytes = 1000 giga-bytes.
1.78
execution (which allows it to implement “out-of-order” instruction execution),

utilizes “dynamic branch prediction” to minimize pipeline stalls, and uses “register
renaming”.
5. HP PA-8000: Figure 1.5d depicts the external view of the HP PA-8000

microprocessor50. Like most of the above processors, the PA-8000 is a pure 64-bit
superscalar processor (with 64-bit registers, 64-bit physical addresses), it uses an
external 64-bit multiplexed bus, it can operate in big- or little-endian mode, and it is
a 4-way superscalar with out-of-order execution capability.
The chip contains 10 functional units: 2 integer ALUs, 2 integer shift/merge units,
2 floating-point multiply/accumulate (MAC) units, 2 floating-point divide/square-root
units, and 2 load/store units. The PA-8000 uses a 56-instruction-deep instruction
reorder buffer (IRB), which looks ahead to the next 56 instructions in the stream to
find 4 that can execute in parallel. Similar to the Pentium Pro (discussed later in
Section 1.7.3), the PA-8000 executes instructions out-of-order, uses register
renaming, and retires instructions in-program-order. Non of its competitors has dual
load/store pipes or such a large instruction window for out-of-order execution.
Each cycle, the processor issues up to 2 instructions to the address units and an
additional 2 instructions to the computation units (FPUs and integer ALUs). If there
are more than two executable instructions of a given type, the oldest instructions
receive priority and are issued to the functional units. As in the R10000, there are no
reservation stations in the functional units; instructions are not issued until they are
ready to be executed.
The PA-8000 is the only high-performance chip that has no on-chip cache;
instead, its primary (level-1) data and instruction caches are implemented outside the
chip accessed by a 128-bit wide interconnect. The advantage is that each off-chip
cache can be up to 4M in capacity, much larger that the 32K on-chip cache used by
processors such as the R10000 and 620. (Large caches are needed, because
applications whose data sets are too big to fit into an on-chip cache often perform
badly is such RISC cores with small on-chip caches). The disadvantage is that these
external caches are expensive because they must use ultrafast SRAMs to run at full
processor speed.
The PA-8000 is considered today the industry’s most powerful microprocessor.
Its SPECint95 and SPECfp95 performance measurements are 110% and 120%
higher, respectively, than those of Sun’s UltraSparc, and 7% and 31% higher,
respectively, than those of Digital’s Alpha microprocessors.
àà
--------------------------------------------------------------------àà
Example 1- 12: A 32-bit superscalar processor (Intel Pentium)
The Intel Pentium is a 32-bit 2-way superscalar microprocessor (with 32-bit internal registers
and 32-bit addresses) but has a 64-bit external data bus to improve the data transfer rate [24,25].
The Pentium uses a number of hardwired simple instructions (loads, stores, and simple ALU
instructions), while the more complex ones are microcoded (thus, it is a mixture of CISC and RISC
50
PA stands for “Precision Architecture”.
1.79
processor). The input/system clock cycle is the same as the processor’s internal clock cycle (CS =
CP) and a Pentium bus cycle is composed of 2 system clock cycles.
Figure 1..5b shows the external view of the Pentium chip with its 29-line address bus
A31-A3, the 8 byte-enables BE0#-BE7#, the 64-line data bus D63-D0, and an 8-line DP7-DP0 bus
that carries data parity bits (one bit per byte-lane of the data bus). Internal parity checking is
done on the Pentium chip.
When comparing Pentium’s simplified internal structure in Figure 1.27a with that of the
486 in Figure 1.10, we notice that the Pentium has an internal Harvard architecture with a
separate 8-KByte ICACHE and an 8-KByte DCACHE, both internal and external data buses are
64-bit buses, the microprocessor has a superscalar architecture with two integer execution units51
(ALUs) called “u-pipe” and “v-pipe”, and a more sophisticated pipelined FPU. (However, it
does not have a parallel load/store unit that other superscalar microprocessors have). The on-
chip MMU is completely compatible with Intel 486’s MMU. Each cache in Figure 1.27a has its
own TLB (Translation Lookaside Buffer) to translate linear logical addresses to the physical
addresses used by each on-chip cache.
Integer instructions are executed by the Pentium passing through two 5-stage integer
pipelines, the “v-pipe” and the “u-pipe”. Each integer pipeline has its own ALU, address
generation circuitry, and DCACHE interface. It also has an 8-stage FPU pipeline, the first 5
stages of which it shares with the integer unit (Figure 1.27b). Floating-point instructions use the
fourth stage (EX) to fetch the operands (called the OF stage) and the fifth stage (WB) as the first
execution stage (called X1); or one can say that a 3-stage floating-point instruction pipeline
(stages X2, WF, and ER) is appended to the integer pipelines. The v-pipe can execute simple
integer instructions as well as the FXCH floating-point exchange instruction; the u-pipe can
execute all integer and floating-point instructions. A floating-point instruction uses both integer
pipelines, and uses both of them to fetch a 64-bit operand in a single cycle. (For this reason,
except for the FXCH52 instruction, it is not possible to perform two floating-point operations in
parallel while its execution takes place in the u-pipe.)
The Pentium can issue:

• 2 “simple” instructions in parallel (according to the “instruction pairing rules”
described below); the first one of the pair is issued to the u-pipe, the second to the v-
pipe. (“Simple” instructions are entirely hardwired; they do not require any
microcode control and, in general, execute in one clock cycle);
• 1 floating-point instruction to the FPU ( to the u-pipe), not paired with anything
else in the FPU;
• 1 floating-point instruction to the FPU (to the u-pipe) in parallel with the FXCH
floating-point exchange instruction to the v-pipe.
Thus, together, the dual pipes can issue 2 integer instructions in one clock, or 1 floating-
point instruction (under certain circumstances, 2 floating point instructions) in one clock.
a) Dual integer pipelines

As with the 486’s single pipeline, the Pentium’s dual pipelines execute integer
instructions in 5 stages:
PF (Prefetch): Instructions are prefetched from the ICACHE or memory. (Actually there are
two 32-byte or line-size buffers, one used to buffer sequential instructions and
the other instructions from the branch target address.)
D1 (Decode 1): Two parallel decoders in this stage attempt to decode and issue the next two
sequential instructions. The decoders determine whether one or two
instructions can be issued contingent upon the “instruction pairing rules”,
51
Two integer pipelines are also found in the Sun Microsystems SuperSparc and Motorola 88110.
52
The FXCH instruction swaps any FPU register with the top of the stack. On the Pentium, this
instruction can be issued to the v-pipe in parallel with most other floating-point instructions.
1.80
External Bus
Lines
Data Register
64 Functional Units File (RF) 64
Integer Unit (ALU)
(V-PIPE)
256 Prefetch,
ICACHE Buffers DCACHE
Integer Unit (ALU)
(8KB) & (8KB)
(U-PIPE)
Decoders
(2) FPU
ADD DIVIDE MPY
BTB 32
(Branch Target
WB MEM
Buffer
IF EXE
ID
•Integer instr. pipelines: 5 stages
Instructions Data/Operands •Floating-point instr. pipelines: 8 stages
a) Pentium block diagram
1 2 3 4 5 6 7 8
PF D1 D2
V-PIPE
ICACHE CALCULATE EX WB (5 stages)
DECODE ADDRESSES
INSTR.
PREFIXES
OF U-PIPE
MEMORY-
RESIDENT
(5 stages)
BTB OPERANDS
EX WB
(OF) (X1)
X2 WF ER FPU
(EXECUTE (ROUND (ERROR (8 STAGES)
STAGE 2) RESULTS) REPORT)
•Issues: MEMORY &REGISTER READ

2 integer instrs. or FP DATA FORMAT CONVERSION
1 FP instr. (or 2 FP instrs. if
one is an “exchange” & MEMORY WRITE
instr.)
•FP instrs. do not get paired with
integer instrs.
b) Pentium pipelines
Fig. 1.27: The Intel 32-bit Pentium 2-way superscalar processor

described below. (As in the 486, an extra D1 clock is required to decode

instruction prefixes).
D2 (Decode 2): As in the 486, during D2 the addresses of memory resident operands are
calculated.
EX (Execute): As in the 486, this stage is used for both ALU operations and for DCACHE
accesses. (Therefore those instructions specifying both an ALU operation and a
DCACHE access will require more than one clock cycle in this stage).
WB (Writeback): Instructions are enabled here to modify processor state and complete
execution.
There are certain rules for pairing instructions in the Pentium and for its integer pipelines’
operation:
1) Instruction pairing rules:
The Pentium processor can issue one or two instructions every clock cycle. In order to issue two
instructions simultaneously they must satisfy the following conditions:
• Both instructions in the pair must be “simple” instructions.
• The destination of the first instruction is not the source of the second53 (i.e., there
must be no read-after-write register dependence)
• The destination of the first instruction is not the destination of the second (i.e., there
must be no write-after-write register dependence)
• The first of the two instructions is not a jump instruction.
If these conditions are met, then the first instruction is issued to the u-pipe and the second to
the v-pipe. Else, only the first instruction is issued to the u-pipe.
2) Integer instruction issue rules:
Both the u-pipe and v-pipe instructions enter and leave the D1 and D2 stages in unison.
When an instruction in one pipe is stalled then the instruction in the other pipe is also stalled
at the same pipeline stage. Thus both the u-pipe and v-pipe instructions enter the EX stage is
unison. Once in the EX, if the u-pipe instruction is stalled then the v-pipe instruction (if any)
is also stalled (not the other way around). If the v-pipe instruction is stalled then the
instruction paired with it in the u-pipe is allowed to advance. No successive instructions are
allowed to enter the EX stage of either pipeline until the instructions in both pipelines have
advanced to WB.
b) Floating point pipeline

Pentium’s FPU executes floating-point instructions in 8 stages:
PF (Prefetch): Same as in the integer pipeline
D1 (Decode 1): Same as in the integer pipeline
D2 (Decode 2): Same as in the integer pipeline
OF (Operand Fetch): FPU accesses both DCACHE and FP registers to fetch operands.
Converts FP data to external memory format and does a memory write. This
stage matches the EX stage of the integer pipeline.
X1 (FP Execute stage 1): Executes the first steps of the floating-point computation.
Converts external memory format to internal FP data format and writes
operand to FP register file.
X2 (FP Execute stage 2): Continues to execute the floating-point computation.
WF (Write floating-point): Performs rounding and writes floating-point result to register
file.
ER (Error Reporting): Reports errors and updates status word.
FP instruction issue rules:

Like in the integer pipelines, the Pentium follows certain rules of how FP instructions get issued:
• FP instructions do not get paired with integer instructions.
53
Logic in D1 ensures that the source and destination registers of the instruction issued to the v-pipe
differ from the destination register of the instruction issued to the u-pipe.
1.81
• A limited pairing of two FP instructions can be performed (only when the second
instruction in the FP pair is the “FP exchange” instruction FXCH).
• FP instructions that are not directly followed by an FP exchange instruction are
issued singly to the FPU (to the u-pipe).
ßß----------------------------------------------------------
ßß
1.7.3 Superpipelined-Superscalar Processors
Comparing an n-way superscalar with an n-deep superpipelined implementation we

notice that they have about equal performance (assuming the same instruction issue
restrictions)[2]. However, while the superpipelined implementation requires higher
internal clock speeds, the superscalar implementations have much more complex
hardware.
When a superpipeline cannot increase in depth beyond its maximum, the
superscalar approaches can be applied, leading to superscalar superpipelined
processors. There is no general agreement on what is the number of pipeline stages
needed to classify a superscalar microprocessor as a “superpipelined-superscalar” one.
We have arbitrarily chosen this number of stages to be at least 10; with this assumption,
both the 32-bit Intel Pentium Pro (14 stages) and the 64-bit DEC Alpha 21164 (12 stages)
would be classified as superpipelined-superscalar processors. They are discussed in
Appendices A.5.1 and A.5.2, respectively.
1.7.4 The VLIW Approach

Another old idea that may find commercial success is the VLIW (Very Long Instruction
Word) microprocessor design. The VLIW54 is the ultimate superscalar processor with a
large number of internal functional units, executing concurrently a larger number
operations per clock cycle. Figure 1.28a depicts the block diagram of a generic VLIW
processor [29]. It shows the case where a compiler has constructed the “very long
instruction” by putting together 8 independent instructions that can be executed
concurrently in 8 separate functional units inside the processor: for example the first two
floating-point instructions are executed in the top two FP functional units, the next three
integer instructions are executed in the three integer units, the LOAD and STORE
instructions in the two memory units, etc. Each of these 8 “operation fields” of the very
long instruction may be a traditional three-operand RISC-like instruction <op><source
register><destination register> shown in Figure 1.28b. The VLIW approach suggests
using very long instruction words (ranging from at least 256 bits to more than 1024 bits)
each containing a large number of such “operation fields”. VLIW processors increase the
54
Some earlier VLIW machines include those from Multiflow (HP), Culler, and Cydrome.
1.82
processor performance by reducing the number of instructions (i.e., by reducing the NI

term in Eq. 1.7)55.
To take, however, maximum advantage of all these functional units, sophisticated
instruction scheduling is needed to uncover instruction dependencies and allow out-of-
order execution in order to increase the number of instructions executed concurrently.
Scheduling can be done dynamically by hardware inside the microprocessor chip or
statically by software (the compiler). On a smaller scale, this is also done in superscalar
processors by special hardware that determines instruction dependencies dynamically at
run time. Because, however, the complexity of such scheduling hardware grows
geometrically with the number of functional units, VLIW processors rely (much more
than superscalar processors) on sophisticated compilers that need to know all intimate
hardware details (like the latency of each functional unit) to be able to do efficient
instruction scheduling. The VLIW approach provides the compiler with simpler and
more primitive operations to work with (as compared to those the RISC approach
provides), which can be almost down at the level of the microcode of some CISC
designs. With the compiler having a detailed understanding of how the target processor
works, the compiler can instruct the CPU to execute low level microoperations
(instructing it which gate connections to open and what results to latch) thus extracting
the maximum processor performance possible. The reason for having such long
instruction words is to accommodate such large number of bits that specify gate-opening
type of control signals.
BIU (BUS INTERFACE UNIT)
fpadd FP ADD MULTIPORTED

(INSTR. FP
CACHE) fpmul FP MULT/DIV REG. FILE
cmpnz INTEGER ALU (DATA
DECO- CACHE)
add INTEGER ALU MULTIPORTED
ICACHE DERS INTEGER
mul INTEGER MULT/DIV REGISTER DCACHE
load MEMORY UNIT #1 FILE
store MEMORY UNIT #2
branch BRANCH UNIT
8 FUs
Very long-instruction with 8
“operation fields”
a) A generic VLIW processor
55
VLIW processors can also reduce the CPIp term in Eq. 1.7 because – as in superscalars – multiple
instructions are issued per clock cycle. (Sometimes, the superscalar and VLIW processors are referred to
as “multiple-issue processors”).
1.83
31 0
ADD R1 R2
Add
Very long
instruction
b) Each “operation field” is a 3-field 32-bit RISC instruction
Figure 1.28: A generic VLIW (Very Long Instruction Word) processor and
instruction format.
Newer more advanced techniques are continuously being developed and used by
compilers that allow extracting more parallelism from integer code.
Another concern with the VLIW approach is the fact that portability of code may
be compromised, because VLIW code assumes a given underlying hardware
configuration which means there is no binary compatibility between generations. On
the other hand, VLIW presents the advantage of being able to implement old CISC
instruction sets more effectively than RISC can (allowing a company to come up with a
VLIW processor that is compatible to run old programs that had been written for the
company’s CISC products.)
Research work at IBM, uses the concept of “tree-instructions” to encode programs

and a scheme of dynamically pruning large tree-instructions which makes possible
object-code compatibility. Each “tree-instruction” corresponds to a multiway branch and
multiple operations, all performed simultaneously as one VLIW.
Some of the investigated research challenges regarding VLIW include the
following [30]:
• How much instruction-level parallelism (ILP) can be extracted from programs?
• What are the features required in a processor for adequately exploiting the ILP
extracted by the compiler?
• What are the restrictions imposed by the technology which limit the exploitation
of ILP?
• How can object-code compatibility be achieved in VLIW architectures? That is,
is it possible to write a VLIW program in machine-language, in such a way that
1.84
the same program can be used in different implementations of the same

architecture?
• How can compatibility with existing architectures be achieved? That is, is it
possible to exploit VLIW techniques as an extension to existing architectures? If
not, how can existing code be executed efficiently on a VLIW architecture?
• What areas are more amenable to exploiting VLIW techniques?
Engineering/technical, commercial, graphics applications, others?
1.8 COMPUTER SYSTEM DESIGN METHODOLOGY

Designing and building a computer system involves first choosing the appropriate
microprocessor to be used as the CPU of the system, selecting the components needed to
build the other modules (such as the memory subsystem, the I/O subsystem, caches,
buses, etc.), and finally properly interfacing them together to form a well-balanced
computing system. In this textbook, the design takes the high-level “systems approach”;
no instruction set design is involved, since this is determined by the particular
microprocessor chosen. Most computer designs involve a number of stages that
correspond to one or more levels of abstraction.
The first stage involves the specification of requirements that the computer
system must meet. These include functional requirements (such as the type of
applications the system is to execute, the programming language to be used, the type of
operating system needed, etc.), other system characteristics (such as upper bounds on cost
and lower bounds on performance, expandability objectives, etc.), and the identification
of additional constraints (such as existing application software, limitations on power,
size, and weight, and other compatibility constraints). This first stage is very difficult
and involves both tangible and intangible requirements.
The second stage involves evaluating different alternatives to establish the

architecture of the computer system: i.e., identify the subsystems and the way they are
to be assembled together. Very rarely, formal notation can be used to describe the
architecture; the PMS (“processor-memory-switch”) notation is a useful structural
description.
Then one identifies the characteristics (such as speed and size) of the remaining
modules of the computer system in such a way as to satisfy the overall requirements and
present a well balanced design. The most important system module of course is the
processor. Assuming that the system designer has a free choice (and various reasons,
including company associations and agreements, may restrict this choice), a careful
consideration must be given of the type of processor to be used. In this textbook, we not
only present the architecture and operation of a number of representative RISC and CISC
processors, but also identify their particularities, characteristics, technological constraints,
1.85
and architectural advances to help the system designer perform a good trade-off analysis
among them is selecting the most appropriate processor that meets the requirements of
the system to be built. With today’s high-performance processors, it is imperative to
have a well-designed memory hierarchy (an effective cache and memory architecture) to
match the processor’s rate of execution. The faster the memory hierarchy can get
instructions and data into the processor, the better the overall performance of the system.
The faster the processor and the larger the first-level cache it has on-chip, the more
necessary it becomes to include large, second-level, external caches. Finally, the
designer can then develop the I/O subsystem for interfacing to the outside world. Major
subsystem modules and alternative interconnections have already been presented in this
chapter, and many more alternatives will be given in the rest of the chapters.
In establishing the most appropriate system architecture, one can do the following:
1. Analytical methods: Sometimes analytical approximation methods can be used to
evaluate the performance of the processor and the bandwidth of the memory hierarchy
(caches and main memory). Parameters that affect the performance of the memory
hierarchy (caches and main memory) and approximation formulas used in estimating
system performance are discussed in detail in Chapters 5 and 6. Analytical models are
difficult to derive and do not represent the behavior of the actual system.
2. Software simulation: Commercial software simulators may exist or can be
developed (in the form of a computer program) that evaluate the model numerically over
a time period [16]. This approach is used to verify the model and gives an insight only in
the behavior of the system.
3. Software prototyping: A software prototype is an abstract representation of the

actual machine, which includes all major subsystems and their interconnections. An
interactive software prototype may allow the designer to vary the system parameters
effortlessly to assist in the evaluation of the hardware design (e.g., to measure utilization
and throughput) along with the development of application software (i.e., to design and
map algorithms onto the actual hardware) [17]. Such software packages may exist at the
architecture and the instruction level.
The next stage involves the development of the hardware and software usually
done in parallel [20]. In this textbook we do not cover the extensive topic of software
development. Instead we concentrate on the design to construct the hardware
prototype; this involves selecting the basic components (memory chips, I/O, interface
components, cache memories, controllers, buffers, etc.) and designing the various boards
of the actual computer system. We present techniques for designing the main memory
subsystem, the cache subsystem, selecting the system bus to interconnect them together,
and we discuss approaches for handling external interrupting devices that request service
from the system. Quite often, a trade-off has to be performed in deciding whether the
functions will be implemented in software or hardware (such as between software-based
floating-point and attached hardware FPU). The choice of each component depends on
the match between the design requirements for that subsystem (that were established with
the architecture during the previous design stage) and how well the components fit those
requirements.
1.86
The next stage is to test and debug the prototype and verify it completely before
committing to the final system. The prototype is exercised (by executing representative
programs – benchmarks -- from the kinds of applications the system will run) and
modified until it satisfies the given system performance requirements. The prototype
construction and verification is an iterative process.
Finally, hardware and software are integrated together, the actual computer
system is built, tested, debugged, and enters the production stage.
1.87
BIBLIOGRAPHY
[1] Hennessy, J. L., and D.A. Patterson, Computer Architecture: A Quantitative Approach, Second
edition, Morgan Kaufmann Publishers, Inc., San Mateo, CA, 1996.
[2] Jouppi, N., and D. Wall, “Available Instruction-Level Parallelism for Superscalar and
Superpipelined Machines”, Proc. Third Conf. Architectural Support for Programming Languages
and Operating Systems, ACM, Apr. 1989, pp. 272-282.
[3] Diefendorff, K., and M. Allen, “Organization of the Motorola 88110 Superscalar RISC
Microprocessor,” IEEE Micro, Apr. 1992, pp. 40-63.
[4] Intel Corp., 80386 Hardware Reference Manual, (231732-001), Santa Clara, CA, 1986.
[5] Intel Corp., i486 Microprocessor (240440-001), Santa Clara, CA, 1989.
[6] Integrated Device Technology, Inc., R3000/3001 Designer’s Guide, Santa Clara, CA, 1990.
[7] MIPS Computer Systems, Inc., MIPS R4000 Microprocessor User’s Manual (M8-00040),
Sunnyvale, CA, 1991.
[8] Motorola Inc., MC68030 Enhanced 32-bit Microprocessor User’s Manual, (MC68030UM/AD),
Austin, TX, 1987.
[9] Motorola Inc., MC68040 32-bit Microprocessor User’s Manual (MC68040UM/AD), Austin, TX,
1989.
[10] Motorola Inc., MC88100 Technical Data (BR588/D), Phoenix, AZ, 1988.
[11] Intel Corp., i860 64-bit Microprocessor Hardware Reference Manual (CG-101789), Santa Clara,
CA, 1990.
[12] Piepho, R.S., and W.S. Wu, “A Comparison of RISC Architectures”, IEEE Micro, Aug. 1989, pp.
51-62.
[13] Allison, A., “RISCs Challenge Mini, Micro Suppliers,” Mini-Micro Systems, Nov. 1986, pp. 127-
136.
[14] Patterson, D.A., “Reduced Instruction Computer”, Communications of the ACM, Vol. 28, No. 1, Jan.
1985, p. 189.
[15] Hennessy, J.L., “VLSI Processor Architecture,” IEEE Transactions on Computers, Vol. C-33, No.
12, Dec. 1984.
[16] Law, A.M., and W.D. Kelton, Simulation Modeling and Analysis, McGraw-Hill Book Company,
San Francisco, 1982.
[17] Barad, H., Rapid Prototyping of Massively Parallel Architectures, Tech. Report 88-10, Tulane
University, Electr. Engr. Dept., New Orleans, LA, 1988.
[18] MIPS Computer Systems, MIPS RISC Architecture, Lecture Notes, Sunnyvale, CA, Aug. 1991.
[19] MIPS Computer Systems, RISC Architectures, Lecture Notes, Sunnyvale, CA, Aug. 1991.
[20] Tabak, D., Advanced Microprocessors, McGraw-Hill Book Company, San Francisco, CA, 1991.
[21] Crawford, J.H., “The i486 CPU: Executing Instructions in One Clock Cycle,” IEEE Micro, Febr.
1990, pp. 27-36.]
[22] Sterling, T., “The Scientific Workstation of the Future May be a Pile of PCs”, Communications of
the ACM, Vol. 39, No. 9, Sept. 1996, pp. 11-12.
[23] Halfhill, T.R., "Intel Launches Rocket in a Docket", Byte, May 1993, pp, 92-108.
[24] Papworth, D.B., "Tuning the Pentium Pro Microarchitecture", IEEE Aficro, April 1996, pp. 8-15.
[25] MIPS Computer Systems, Inc., MIPS R4000 Microprocessor Users Manual (M8-00040), Sunnyvale,
CA, 1991
[26] Intel Corp., Pentium Processor User's Manual, Vols. 1-3, Santa Clara, CA, 1993.
[27] Alpert, D., and Avnon, D., "Architecture of the Pentium Microprocessor”, IEEE Micro, June 1993,
pp. 11 -21.
[28] Pountain, D., "The Word on VLIW', Byte, April 1996, pp. 61-64.
[29] Moreno, J.H., et al, "Architecture, compiler and simulation of tree-based VLIW processor", IBM
CyberJournal #RC20495, 7/9/96.
1.88
EXERCISES
1.1 Examine the register structures of the Intel 80386 and Motorola 68030
microprocessors and discuss their similarities and differences.
1.2. Consider a hypothetical 32-bit microprocessor having 32-bit instructions composed
of two fields: the first byte represents the opcode and the remaining the immediate
operand or the operand’s direct memory address:
(a) What is the maximum directly addressable memory capacity (in number of
bytes)?
(b) Discuss the impact on the system speed if the microprocessor has
(1) a 32-bit external address bus and 16-bit external data bus or
(2) a 16-bit external address bus and a 16-bit external data bus
(c) How many bits are needed for the program counter and the instruction
register?
1.3. Consider the Intel 8086 (Appendix A and Section 6.5.2) and the instruction ADD
DX,1234 whose execution results in adding the 16-bit contents of the specified
memory location to the contents of internal register DX and placing the sum back
into DX. (To calculate the final 20-bit physical address of the operand, the
microprocessor uses the 16-bit displacement 1234 contained in the above
instruction). Assume that the internal segment registers contain the following: (CS)
= ABCD, (DS) = CDFE, (SS) = D021, and (ES) = CFFF.
(a) Which memory location does the operand come from?
(b) If (IP) = 1046, where is the first byte of the instruction stored at?
Explain.
1.4. Consider a hypothetical microprocessor generating a 16-bit address (for example,
assume that the program counter and the address registers are 16-bit wide) and
having a 16-bit external data bus:
(a) What is the maximum memory address space that the processor can access
directly if it is connected to a “16-bit memory" (i.e., its smallest addressable
location is 16 bits wide)?
(b) What is the maximum memory address space that the processor can access
directly if it is connected to an “8-bit memory" " (i.e., its smallest addressable
location is 8 bits wide)?
(c) What architectural features will allow this microprocessor to access a separate
“I/0 space”?
(d) If an “input” and “output” instruction can specify an 8-bit “I/0 port number”,
how many “8-bit I/0 ports” can the microprocessor support? How many “16-
bit I/0 ports”? Explain.
1.5. Consider the Intel 8086 microprocessor and an intersegment CALL instruction
(calling a FAR procedure located in a different code segment in memory). Assume
that this particular instruction is as follows:
byte1: CALL opcode (assume stored at location 12344)
byte2: upper half of the target (jump) address
byte3: lower half of the target (jump) address
byte4: upper half of the new CS register value
1.89
byte5: lower half of the new CS register value

Assume that its execution portion is given as follows:
[(SP)-2] ß (CS)
[(SP)-4] ß (IP)
(SP) ß (SP)-4
(CS) ß (byte4)(byte5)
(IP) ß (byte2)(byte3)
where M denotes "the contents of” and [(X)] denotes “the contents of the memory
location pointed at by the contents of X”. Assume that initially (CS) = 1000, (SP) =
1000, and (SS) = 6000.
(a) List in the proper sequence all necessary microoperations executed during
each input clock cycle during the execute portion of the CALL
instruction. (Do not give those of the fetch portion of the instruction cycle
and assume that there is no internal queue in the processor.)
(b) Using the 8086's basic timing diagram shown in Figure A.7, draw the
timing diagram for the execution of the above CALL instruction. Show
actual hexadecimal values of the information transferred on the address
and data bus byte-lanes.
1.6 Consider the execution of the above Intel 8086 "intersegment CALL” instruction (a
CALL to a subroutine located in a different code segment m memory). What
information is stored in the stack? If the stack pointer register contained the value
1224 before the execution of this CALL, what is its value after the execution? What
other internal registers are affected by this instruction's execution?
1.7. When an access to memory is performed for a read operation, how do some
microprocessors know how much data the memory sends them? What happens
during a write operation?
1.8. Compare and discuss the state transition diagram of the Intel 80386 (Figures A-6
and 2.9 may be helpful) and Motorola 68030 microprocessors (Figure B-3 and 2.14
may be helpful).
1.9. Consider a 32-bit microprocessor (with a 16-bit external data bus) driven by an 8-
MHz input clock. Assume that this microprocessor has a bus cycle whose minimum
duration equals 4 input clock cycles. What is the maximum data transfer rate that
this microprocessor can sustain? In order to increase its performance, would it be
better to make its external data bus 32 bits wide or double the external clock
frequency supplied to the microprocessor? State any other assumptions you make
and explain.
1.10. Compare the data transfer rates of the Intel 80286, the Intel 80386, and the Intel
80486 when each one is driven by a 16-MHz input clock. State all your
assumptions.
1.90
BACK TO CHAPTERS FOR PRINTING

000-Chapter1 CPU

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

000-Chapter1 CPU

Загружено:

Авторское право:

Доступные форматы

Chapter 1:

Computer Systems Concepts

1.1 INTRODUCTION ................................................................................................ 3

COMPUTER SYSTEMS CONCEPTS AND

1.1.1 Advances in Microprocessor Technology

Characteristics 1972 1977 1982 1987 1992 1996

Number of I/O 16 40 64 100 340 (430) 512 (700)

Transistors/chip 2K 20K 100K 500K- >1M 7M

Processor 4 8 (16) 16 32 32 (64) 64 (128)

External data 4 8 (16) 16 32 64 (128) 128 (256)

Input (system) 0.5 - 1 5-8 8 (16) 20 (30) 50-100 200-433

Processor clock --- --- 20 6-2 1 .25

Total main 214 = 16 K 216 = 64 K 224 = 16 M 232 = 4 GB 232 = 4 GB 236 - 264

Memory chip 1K 4K to 16K 64K 256K 1M 64M

1.1.2 Overall System Performance

TS (total system execution time for a task) =

total system clock cycles for a task = NI * CPIS (1.2)

Thus the system performance of Equation 1.1 now becomes:

TS (total system execution time per task) = NI * CPIS * C S (1.4)

CPIS = number of processor clock cycles for internal operations +

1.1.3 Processor Performance

Processor performance (in MIPs) = (NI) / (total processor execution time)

There are two types of benchmarks: application and synthetic. Application

1.1.5 The RISC Approach

Major common properties of all RISC architectures include [12,14,15]:

register-to-register operations, in RISC processors less than 20% may be LOADs

Arguments can be made for and against RISC processors:

RISC Drawbacks. RISC programs tend to be lengthier (the NI in Equation 1.4 is

Data Memory Instruction

Data bus (for instructions) Processor external physical

Fig. 1.2b: Harvard architecture external processor buses

Figure 1- 2: Conventional processors with a single and separate external buses

1.2 THE PROCESSOR’S EXTERNAL VIEW

Figure 1.3a shows an example of a processor with an external “unified” or

By providing separate external buses, a Harvard-architecture processor can now

1.2.1 Processor Bus: Lines and Signals

MMU Processor (CPU)

Memory Bus Main Memory

b) Processor with on-chip cache

Logical Instruction Data

Physical c) Processor with external

Figure 1.3: Example processors with external MMUs and caches

a) Address and data buses.

Synchronization signals Address Bus

a) CISC processor external view

Synchronization signals Address Bus

Figure 1.4: External views of CISC and RISC processors

Different processors, however, have different numbers of external address and

1) 16-bit data buses

24 A23-A0 Address Bus 23 A23-A1 Address Bus

One “byte-enable” issued by Two “byte-enables” issued by

(“Byte-address” & 1 “byte-enable”) (“Word-address” & 2 “byte-enables”)

A0,BHE# = 00: 16-bit transfer UDS#,LDS# = 00: 16-bit transfer

Figure 1.5a: Processors with 16-bit external data bus

30 A31-A2 Address Bus 30 A31-A2 Address Bus

4 “byte-enables” issued by the 4 “byte-enables” generated by

30 AD31-AD2 Data Address

(4 “byte-enables” only for the data P bus)

Figure 1.5b: Processors with 32-bit external data

29 A31-A3 Address Bus

(“operand size indicator” issued)

Figure 1.5d: High performance superscalar processors (continues)

40 A39-A0 Address Bus